
OpenAI introduced LifeSciBench on June 17, 2026. It is a benchmark for life-science research work. The emphasis is not on whether a model can recall biology facts. It is on whether AI systems can support realistic research tasks such as evidence handling, experiment design, risk judgment, and scientific communication.
LifeSciBench includes 750 expert-authored tasks spanning seven workflow categories and seven biological domains. OpenAI says the tasks were created by scientists with Ph.D.-level training and biotech or pharmaceutical experience, then reviewed through automated and expert review cycles. The benchmark also includes 1,062 task artifacts such as figures, PDFs, tables, sequence files, structure or chemical files, and web references.
That design matters because many existing life-science evaluations are too clean. They use structured questions, narrow domains, and reference answers that do not capture the ambiguity of live research. Real R&D work often requires judgment across incomplete evidence, conflicting results, experimental constraints, translational risk, and uncertainty.
The grading design is also important. Each task uses a detailed expert-developed rubric, and the benchmark contains 19,020 criteria in total. That means models are not judged only by the final answer. They are evaluated on specific scientific claims, calculations, decisions, justifications, caveats, and formatting. For research work, that is closer to how expert usefulness is assessed in practice.
OpenAI reports that GPT-Rosalind improves over GPT-5.5 on LifeSciBench, with exact pass rate rising from 25.7% to 36.1%. The strongest improvements appear in scientific communication and translation, meaning the ability to organize evidence into expert-facing explanations and connect preclinical evidence to clinical implications.
The limits are just as important. Artifact-heavy, design-heavy, and operationally constrained tasks remain difficult. OpenAI reports that GPT-Rosalind's pass rate falls from 45.1% on text-only tasks to 28.1% on tasks with artifacts or URLs. That suggests AI systems can already support parts of research reasoning, but still struggle with complex data handling and exact outputs.
The broader value of LifeSciBench is that it breaks the large question of AI for science into measurable work capabilities. The next generation of useful life-science AI will not simply answer expert questions. It will need to produce auditable research judgment across experimental data, papers, structures, sequences, risk, and decision constraints.



