OpenAI Releases LifeSciBench -- Frontier Models Pass Just 1 in 3 Real Science Tasks

OpenAI introduced LifeSciBench, an expert-authored benchmark of 750 tasks across seven biological domains, built with 173 scientists. Even the strongest models pass only about one task in three -- a sober counterweight to claims that AI is close to autonomous scientific research.

750
Tasks
173
Contributing Scientists
~33%
Top Model Pass Rate
7
Domains
TC
Trace Cohen
Early-stage VC & angel ยท Founder, New York Venture Partners
June 17, 2026
1 min read
KEY TAKEAWAYS FOR VCs & FOUNDERS
1

A credible ceiling on AI-for-science hype: today's frontier still fails two of three real research tasks

2

Expert-written rubrics from 173 scientists make this a hard benchmark to game -- the gap is real, not artifactual

3

For founders selling 'AI scientist' products, this is the honest baseline buyers will now measure against

TC
The VC Read ยท Trace's TakeTrace Cohen

OpenAI publishing a benchmark its own models fail two-thirds of is the most useful thing it shipped this week. The 'AI scientist' pitch has gotten ahead of reality, and now there's a hard, expert-graded number to anchor the conversation. For founders selling autonomous research, this is the baseline your buyers will quote back at you -- so position as augmentation, not replacement, or get caught overclaiming. The honest framing is also the more defensible business.

OpenAI released LifeSciBench, a benchmark of 750 expert-authored tasks spanning seven biological domains and seven research workflows, developed with input from 173 scientists. The headline result is humbling: even the most capable frontier models pass only roughly one in three tasks, with detailed rubrics grading the quality of scientific reasoning rather than just final answers.

The benchmark matters because it pushes back against the loudest version of the AI-for-science narrative. While models can accelerate literature review, hypothesis generation, and some analysis, LifeSciBench shows they remain far from autonomously executing real research workflows. The expert-written rubrics make the result hard to dismiss as a measurement artifact.

โ€œThe expert-written rubrics make the result hard to dismiss as a measurement artifact.โ€

For builders, the takeaway is to calibrate. AI is a powerful copilot for scientists, not a replacement, and the honest framing -- augment the researcher, don't replace them -- is both more defensible commercially and more credible with the domain buyers who will increasingly benchmark these claims against tools like LifeSciBench.

ShareXLinkedInEmail

Originally reported by OpenAI. Analysis and editorial commentary by Value Add Pulse.

โ† Back to Pulse

Markets Now

live
SPCXโ–ฒ+4.10%
$213.40
CBRSโ–ฒ+0.44%
$318.20
SPYโ–ฒ+0.11%
5,915.30
QQQโ–ฒ+0.15%
19,931.80
NVDAโ–ฒ+0.85%
$154.10
MSFTโ–ฒ+0.27%
$475.20
GOOGLโ–ฒ+0.88%
$201.60
METAโ–ฒ+0.19%
$648.30