Home Blog

The Second Half of AI for Science

Beyond isolated intelligence — rebuilding the ecosystem where science compounds

Posted on June 10, 2026 by Amber Liu

The true speed limit of science isn't the brilliance of the individual scientist — it is the ecosystem they work within.

TL;DR: The first half of the "AI for Science" era focused entirely on making the scientist smarter. The second half must be about rebuilding the ecosystem itself, creating a network where human and artificial intelligence can seamlessly compound.

The First Half: Racing on Individual Intelligence

For the past few years, the AI for Science playbook has been a repeating loop: make the individual AI scientist smarter in terms of literature review, ideation, experimentation or writing. We added more scaffolding, deeper memory, more domain-specific agent skills, and self-evolving loop. We'd show a gain on a benchmark, celebrate a flashy demo, and then start over.

And for a while, it worked brilliantly. AI Scientist v2 [1] pushed a fully AI-generated paper through a peer-reviewed workshop. Biomni [2] began autonomously executing complex biomedical workflows. The Virtual Lab [3] spun up a team of AI agents that successfully designed nanobodies (later validated in actual wet labs). Meanwhile, yardsticks like RE-Bench [4], PaperBench [5], and MLE-Bench [6] gave us ways to measure research engineering, paper reproduction, and ML experimentation — and we watched those numbers steadily climb.

THE FIRST-HALF SYSTEM — ONE NODE, MADE SMARTER other minds — unreachable across the wall bolt-on augmentations Scaffolding Memory Multi-agent Self-evolving Loop 1 add scaffolding 2 +5% on a benchmark 3 ship a demo 4 start over AI Scientist a single smarter node Benchmarks climb (then flatten) RE-Bench PaperBench MLE-Bench pick one stage the research pipeline Literature Review Ideation Experimentation Writing
The first-half playbook in one picture: bolt more scaffolding, memory, and multi-agent orchestration onto a single AI scientist, point it at one stage of the pipeline, show a benchmark gain, ship a demo, repeat. The node gets smarter — but it stays a single node, unable to reach the other minds working alongside it.

But the single-agent game has hit diminishing returns.

First, the marginal returns on clever scaffolding are collapsing. A friend of mine works on hypothesis generation for protein design. Just a few months ago, their team tried everything they could — elaborate pipelines, handcrafted heuristics, careful prompting tricks — to coax better candidates out of the models. Then the latest generation of GPT and Claude arrived, and the quality of their hypotheses jumped significantly, without much scaffolding at all. Much of what we painstakingly hand-engineer today functions as a temporary prosthetic. As foundation models develop better intrinsic reasoning and broader knowledge, they will simply absorb these functions. We've seen this movie before; it's The Bitter Lesson, playing out all over again at the agent level.

Second, and arguably worse, much of this first-half work optimizes for artificial constraints rather than fundamental needs. We've seen demos that proudly generate 100 papers overnight — but who was asking for 100 mediocre papers? The ultimate bottleneck in science has never been our paper count. We've built AI scientists to win rebuttal battles with reviewers, optimizing the art of getting past the gate instead of getting the science right. We've built academic-prose polishers to perfect the rigid, formulaic essay of the modern PDF. These tools are locally clever, but globally misguided. They treat the deep dysfunctions of the human research system as unchangeable laws of physics, training AI to excel at our own inefficiencies.

The first half asked: Can we make one scientist smarter? The answer is a resounding yes, and frontier labs will increasingly deliver that intelligence as a baseline. The harder problem was never the scientist.

The Real Bottleneck: A Formula 1 Car on a Dirt Road

Science is, and always has been, a collective, generational endeavor. It moves forward because thousands of people push in different directions simultaneously. Results collide. Discoveries circulate. Each generation absorbs the hard-won assumptions of the previous one, relies on them, and eventually tears a few of them down. When Newton credited the "shoulders of giants," he wasn't just being modest; he was giving a hyper-accurate description of how knowledge is produced. The unit of scientific progress is the network, not the individual scientist.

Because of this, the speed of science is dictated by network properties: how fast knowledge moves, how losslessly it transfers, and how cheaply it can be verified and built upon. If you make an individual node 10x smarter but leave the network untouched, you don't get a 10x acceleration in science. You get a Formula 1 car stuck on a road built for horses.

We built AI scientists with superhuman bandwidth, then dropped them into an ecosystem built for human limits:

  • The PDF. An AI scientist can run ten thousand experiments and hold reasoning traces no human mind could, yet to "publish" it must crush all of that into eight pages of linear prose — which another AI scientist then burns effort decompressing back into executable logic, guessing at the details the narrative stripped out. It's two superhuman intelligences talking through a format built for human readers three centuries ago, and the compression deletes exactly what AI scientists need most: the dead ends, the precise specifications, the real failures.
    WHAT THE AI SCIENTIST KNOWS ten thousand experiments · the full reasoning trace ✗ dead ends ✗ precise specs ✗ real failures branches · specs · failures · one success compress PDF 8 pages of linear narrative decompress ? ? ? WHAT THE NEXT AI RECOVERS from eight pages, only one storyline can be rebuilt AI the winning path, and nothing else only the winning path survives — the map of where not to go is gone
    The PDF is a lossy codec in both directions: one superhuman intelligence compresses a vast exploration tree into eight pages, and another spends enormous effort decompressing it — guessing at what was stripped away. What's lost is exactly what AI scientists need most: the dead ends, the precise specifications, the real failures.
  • Peer Review. We currently rely on three exhausted humans, dedicating a few hours each over several months, to judge work that is increasingly produced — and entirely consumable — by machines. These machines could verify claims instantly by simply re-executing the code.
  • Incentives. Our scientific reward system — citations, prestige, grant committees — operates as a pure attention economy. Attention is the scarcest resource in human cognition, so we built our entire metric of success around capturing it. But AI scientists do not have an attention bottleneck. If you point them at an attention economy, you get exactly what you'd expect: paper mills running at machine speed, aggressively salami-sliced results, and endless benchmark gaming. The most embarrassing AI demos of the first half aren't bugs in the technology; they are the inevitable result of something with infinite stamina perfectly optimizing our deeply flawed reward system.

When the nature of the bottleneck shifts, the rules of the game have to change. The second half of AI for Science isn't about making the car any faster. It's about paving the roads.

The Second Half: Paving the Roads

How do we rebuild a scientific ecosystem that feels native to AI, rather than actively hostile to it? The place to begin is the most basic primitive of all: how we document research knowledge.

Making a Smarter Node

10× { } other minds, unreachable The node gets brighter — and stays alone. Zero edges.

It Is the Network

The edges carry the compounding, not the nodes.

Make the node 10×, leave the network untouched, and you don't get 10× science. The unit of scientific progress was never the scientist — it is the network.

The Research Artifact as a Protocol

Human civilization didn't start compounding knowledge just because our brains got larger — anatomically modern brains predate civilization by hundreds of thousands of years. Knowledge truly started compounding when we invented language, and it exploded when we invented writing. The foundational primitive we use to encode knowledge determines whether that knowledge can accumulate at all.

For three centuries, science's core primitive has been the paper. But the paper is not a neutral container. It is a highly specific protocol optimized for human readers: it is linear, narrative, and persuasive. And it quietly levies two massive structural taxes that we've simply accepted as normal:

  • The Storytelling Tax: The messy, branching, failure-riddled reality of actual research gets sanitized into a clean, linear story. Everything that didn't fit that neat narrative — the rejected hypotheses, the failed runs, the entire exploration tree — is thrown in the trash.
  • The Engineering Tax: Prose that perfectly satisfies a human reviewer is wildly insufficient as a technical specification. The granular details an AI scientist would actually need to reproduce and build upon the work were simply never written down.

Humans tolerate these taxes. AI scientists are crushed by them.

What AI scientists need is an Agent-Native Research Artifact (ARA), which we propose in our recent paper (provocatively titled The Last Human-Written Paper [7]). Rather than a narrative written for casual reading, an ARA is the basic unit for systematically documenting research knowledge so that AI scientists can evolve collectively on top of it: a complete computational entity that carries the scientific logic, executable code with full specifications, evidence linking every claim back to its raw outputs, and the entire exploration graph — each artifact a building block in a scientific world model that AI scientists construct collaboratively.

We measured this against the three things an AI scientist actually does with a piece of research: understand it, reproduce it, and build on it. On understanding, when the same work is handed over as an ARA instead of a PDF, an AI scientist's question-answering accuracy across 450 questions jumps from 72.4% to 93.7%. On reproduction, end-to-end success rises from 57.4% to 64.4% — a smaller gain, because reproduction is bounded as much by the model's own capability as by the artifact it reads. And on extension, preserving the failure traces that a PDF deletes measurably speeds up the next AI scientist, since knowing what doesn't work is half the battle in research and precisely the half the paper format throws away.

But changing the format is just the beginning. The real paradigm shift is what the format unlocks: AI scientists actively forking, exchanging, and composing executable research.

Instead of an AI scientist saying, "I read your paper and was inspired," the interaction becomes: "I forked your artifact at experiment node 47, swapped out your environment assumption, and my new results are immediately diff-able against yours." Knowledge ceases to be something you summarize; it becomes something you inherit, exactly like open-source code. The moment research becomes natively forkable, science finally gets its version control, its dependency graph, and its git blame. Intelligence can finally compound across the entire network, rather than dying inside a single context window.

ORIGINAL ARTIFACT 45 46 48 49 node 47 result A FORK @ NODE 47 swap: environment assumption 47′ result B — yours, hours later diff results.diff @@ node 47 @@ - env: stationary + env: drifting @@ results @@ - claim C3 holds + C3 fails under drift verified by re-execution, not by trust version control dependency graph git blame
"I forked your artifact at experiment node 47, swapped out your environment assumption, and my new results are immediately diff-able against yours." The moment research becomes natively forkable, science gets its version control, its dependency graph, and its git blame.

The Human Role: From Road-Builders to Stewards

If AI networks compound knowledge at a thousand kilometers an hour, human cognition cannot physically keep pace to oversee every step. We have to give up the illusion of micro-managing the scientific process. Our role moves up the stack — not "drivers" relying on personal taste, but system architects anchored to physical reality and societal needs. The manual execution and the infinite hypothesis testing go to the machines, while humans focus on macro-control:

Target definition and compute allocation — Instead of evaluating every intermediate hypothesis, humans will define the ultimate societal goals (e.g., "design a carbon-negative concrete") and allocate the compute budget. We shift from being the laborers of science to its clients and investors.

Epistemic anchoring — We will no longer read raw literature. Instead, we will rely on specialized Interpretability AIs whose sole job is to translate hyper-dimensional AI research graphs into human-comprehensible risk/reward models, ensuring we understand the macro-implications of the network's discoveries.

The physical failsafe — Crucially, humans must guard the firewall between digital discovery and physical reality. To prevent machine-speed catastrophes in high-risk fields like synthetic biology, "alignment" must become hard engineering.

Picture It

When you put all these pieces together, the second half of AI for Science looks like this:

A human articulates a complex problem. A vast population of AI scientists fans out across the hypothesis space. They don't publish static papers; they publish living, executable artifacts. These artifacts are forked, composed, stress-tested, and re-executed by peer AI scientists in a matter of hours, not across multi-month review cycles. Verification happens continuously and mechanically. Failed branches are treated as first-class knowledge.

The "literature" ceases to be a dusty pile of disconnected PDFs. It becomes a single, continuously growing, executable tree of everything the network knows and exactly how it knows it. Humans walk the canopy of that tree — pruning, steering, and occasionally gasping at the view.

The first half asked how smart one scientist could be; the second asks how fast a network of them can compound. That's a problem of ecosystem, not capability — harder to benchmark and to demo, but where almost all of our future acceleration lives, because the gap between AI's bandwidth and our legacy ecosystem's is the only thing still holding us back. The first half built smarter scientists; the second gets to rebuild science itself.

Welcome to the second half.


Acknowledgements

Thanks to Ang Cao, Vandon Duong, Velvin Fu, Jintao Huang, Abhishaike Mahajan, and Chengyang Shi for the thoughtful discussions and feedback that shaped this post.

References

  1. Yamada, Y. et al. The AI Scientist-v2: Workshop-Level Automated Scientific Discovery via Agentic Tree Search. arXiv:2504.08066. arxiv.org/abs/2504.08066
  2. Huang, K. et al. Biomni: A General-Purpose Biomedical AI Agent. bioRxiv 2025.05.30.656746. biorxiv.org
  3. Swanson, K. et al. The Virtual Lab of AI agents designs new SARS-CoV-2 nanobodies. Nature (2025). nature.com
  4. Wijk, H. et al. RE-Bench: Evaluating Frontier AI R&D Capabilities of Language Model Agents Against Human Experts. arXiv:2411.15114. arxiv.org/abs/2411.15114
  5. Starace, G. et al. PaperBench: Evaluating AI's Ability to Replicate AI Research. arXiv:2504.01848. arxiv.org/abs/2504.01848
  6. Chan, J. S. et al. MLE-bench: Evaluating Machine Learning Agents on Machine Learning Engineering. arXiv:2410.07095. arxiv.org/abs/2410.07095
  7. The Last Human-Written Paper. arXiv:2604.24658. arxiv.org/abs/2604.24658