OpenAI Releases LifeSciBench -- Frontier Models Pass Just 1 in 3 Real Science Tasks

OpenAI introduced LifeSciBench, an expert-authored benchmark of 750 tasks across seven biological domains, built with 173 scientists. Even the strongest models pass only about one task in three -- a sober counterweight to claims that AI is close to autonomous scientific research.

750

Tasks

173

Contributing Scientists

~33%

Top Model Pass Rate

Domains

Trace Cohen

Early-stage VC & angel · Founder, New York Venture Partners

June 17, 2026

1 min read

THE RUNDOWN

A credible ceiling on AI-for-science hype: today's frontier still fails two of three real research tasks

Expert-written rubrics from 173 scientists make this a hard benchmark to game -- the gap is real, not artifactual

For founders selling 'AI scientist' products, this is the honest baseline buyers will now measure against

OpenAI released LifeSciBench, a benchmark of 750 expert-authored tasks spanning seven biological domains and seven research workflows, developed with input from 173 scientists. The headline result is humbling: even the most capable frontier models pass only roughly one in three tasks, with detailed rubrics grading the quality of scientific reasoning rather than just final answers.

The benchmark matters because it pushes back against the loudest version of the AI-for-science narrative. While models can accelerate literature review, hypothesis generation, and some analysis, LifeSciBench shows they remain far from autonomously executing real research workflows. The expert-written rubrics make the result hard to dismiss as a measurement artifact.

“The benchmark matters because it pushes back against the loudest version of the AI-for-science narrative.”

For builders, the takeaway is to calibrate. AI is a powerful copilot for scientists, not a replacement, and the honest framing -- augment the researcher, don't replace them -- is both more defensible commercially and more credible with the domain buyers who will increasingly benchmark these claims against tools like LifeSciBench.

OpenAI Releases LifeSciBench -- Frontier Models Pass Just 1 in 3 Real Science Tasks

THE RUNDOWN

The VC Read · Trace's Take

More on

Read Next

DeepMind Exec: AI Capex Is Really a Bet on Self-Improvement

DeepMind Exec: AI Capex Is Really a Bet on Self-Improvement

Alibaba Undercuts Kimi K3 With a Cheaper Flagship Model

Alibaba Undercuts Kimi K3 With a Cheaper Flagship Model

DeepSeek's Cheap New Model Is Making Waves Again

DeepSeek's Cheap New Model Is Making Waves Again

OpenAI Releases LifeSciBench -- Frontier Models Pass Just 1 in 3 Real Science Tasks

THE RUNDOWN

The VC Read · Trace's Take

More on

Read Next

DeepMind Exec: AI Capex Is Really a Bet on Self-Improvement

DeepMind Exec: AI Capex Is Really a Bet on Self-Improvement

Alibaba Undercuts Kimi K3 With a Cheaper Flagship Model

Alibaba Undercuts Kimi K3 With a Cheaper Flagship Model

DeepSeek's Cheap New Model Is Making Waves Again

DeepSeek's Cheap New Model Is Making Waves Again

OpenAI Releases LifeSciBench -- Frontier Models Pass Just 1 in 3 Real Science Tasks

By the Numbers

THE RUNDOWN

The VC Read · Trace's Take

Analysis

More on

Read Next

DeepMind Exec: AI Capex Is Really a Bet on Self-Improvement

DeepMind Exec: AI Capex Is Really a Bet on Self-Improvement

Alibaba Undercuts Kimi K3 With a Cheaper Flagship Model

Alibaba Undercuts Kimi K3 With a Cheaper Flagship Model

DeepSeek's Cheap New Model Is Making Waves Again

DeepSeek's Cheap New Model Is Making Waves Again

OpenAI Releases LifeSciBench -- Frontier Models Pass Just 1 in 3 Real Science Tasks

By the Numbers

THE RUNDOWN

The VC Read · Trace's Take

Analysis

More on

Read Next

DeepMind Exec: AI Capex Is Really a Bet on Self-Improvement

DeepMind Exec: AI Capex Is Really a Bet on Self-Improvement

Alibaba Undercuts Kimi K3 With a Cheaper Flagship Model

Alibaba Undercuts Kimi K3 With a Cheaper Flagship Model

DeepSeek's Cheap New Model Is Making Waves Again

DeepSeek's Cheap New Model Is Making Waves Again