OpenAI released LifeSciBench, a benchmark of 750 expert-authored tasks spanning seven biological domains and seven research workflows, developed with input from 173 scientists. The headline result is humbling: even the most capable frontier models pass only roughly one in three tasks, with detailed rubrics grading the quality of scientific reasoning rather than just final answers.
The benchmark matters because it pushes back against the loudest version of the AI-for-science narrative. While models can accelerate literature review, hypothesis generation, and some analysis, LifeSciBench shows they remain far from autonomously executing real research workflows. The expert-written rubrics make the result hard to dismiss as a measurement artifact.
โThe expert-written rubrics make the result hard to dismiss as a measurement artifact.โ
For builders, the takeaway is to calibrate. AI is a powerful copilot for scientists, not a replacement, and the honest framing -- augment the researcher, don't replace them -- is both more defensible commercially and more credible with the domain buyers who will increasingly benchmark these claims against tools like LifeSciBench.