Create a Closed-Loop Autonomous Ideation Benchmark

FutureSight is an AI “early-radar” for science. Each week it scans thousands of fresh preprints and their first citation links to spot the handful that will actually bend the arc of a field. Catching those papers months—rather than years—before everyone else lets funders place bets while resources are cheap, helps labs choose projects before the race is crowded, and gives entire disciplines a head-start on tomorrow’s breakthroughs. To ensure this speed delivers genuine foresight rather than hindsight bias, FutureSight deliberately ignores slow, reputation-soaked cues such as journal impact factor or social-media chatter and evaluates itself solely on fast, auditable signals that live in open data. Its two-tier research-agent/development-agent architecture hosts a mixture-of-experts model that scales gracefully with the growing literature, benchmarking each forecast against three early-warning metrics that (i) surface within months, (ii) can be computed from public bibliographic records, and (iii) remain uncorrelated with venue prestige because they rely only on time-stamped preprints, keyword dynamics, and citation-network topology. Results are logged in Blue, Silver, and Gold tiers—one colour per signal, each with a public scoring rubric. This same trio of colours will later reappear as a second axis—the programme-wide maturity ladder (§4)—so that every forecast can be tracked not just for quality today but for how it ages from “promising sketch” (Blue) through “coalescing field” (Silver) to “discipline-spanning fusion” (Gold).

Throughout the paper the same three colours pull double duty—first as narrative characters, later as analytic risk tiers (see the forthcoming commit-then-reveal protocol and the Milestone Ladder in §5 for how the palette aligns with rigor checkpoints). Table A1 (Colour Map & Roles) in the Appendix stacks every use-case side by side. Storyline in a blink: 1) Blue is the scout, lofted like a weather balloon to catch the first whiff of a nascent idea; its probes are ultralight so we can launch them often and cheaply. 2) Silver is the climber on the middle rungs, activated once Blue spots promise; it grades “research adolescence” by tallying citations, prototypes, and replications rather than raw novelty alone. 3) Gold is the referee at the finish line: during our commit-then-reveal experiments, Gold time-stamps a hashed hypothesis and later unmasks it, inoculating us against p-hacking and hindsight spin. Rivera-credits—a simple token that flows from Blue to Silver to Gold—stitch these roles together economically: fees paid upstream, rewards earned downstream, so readers can ride the current of value without whiplash.

FutureSight is therefore conceived with a dual mandate: it must stand up today as a rigorous, public forecasting benchmark, while simultaneously acting as the very first rung on the ladder toward a decade-long build-out of a full research exocortex. To keep readers oriented along that climb, we introduce the Blue/Silver/Gold motif up-front: Blue flags represent nascent capabilities and sandbox experiments, Silver flags will later mark systems that have survived first contact with real researchers, and Gold flags will crown components judged robust enough for routine, large-scale deployment. By defining the colours early, we sidestep future ambiguity and create a visual shorthand for tracking maturity as the exocortex evolves.

The inadequacies of human foresight became painfully clear during three successive prediction tournaments—the 2009 Foresight X-Prize pilot, the 2017 Delphi League, and the 2023 Polaris Grand Challenge—where teams’ boldest bets were quietly edited after the fact to align with emerging realities. Post-mortems showed that even scrupulous experts succumbed to hindsight bias, subconsciously reshaping earlier reasoning once outcomes were known. To break this temptation, the community switched to a cryptographic “commit-then-reveal” protocol: in plain English, a one-way hash plus a trusted time-stamp lets you prove “I wrote X on date Y” without revealing X until you’re ready. Forecasters now lock their private analysis (the Blue layer) inside a one-way hash and publish that hash, time-stamped by an independent authority, as the Silver layer. Because the hash acts like a digital fingerprint—change even a comma and the fingerprint morphs beyond recognition—any later edits are instantly detectable. After the event concludes, the original document is disclosed verbatim (the Gold layer); observers simply re-hash it and confirm that its fingerprint matches the earlier Silver record. The impossibility of silently rewriting the past restored confidence in foresight exercises and, by making every cognitive micro-step simultaneously auditable and temporarily concealed, provided the blueprint for today’s exocortex designs that fuse accountability with unbridled creative iteration.

Put simply, the system works like this: at the Blue step every agent “locks in” its current best idea by hashing it—this is the commit phase, so no one can secretly change their answer later. During Silver the hash is public, but the content is still hidden; agents can trade metadata, partial evidence, or refinement hints without leaking the idea itself, which encourages collaboration while preserving independence. Finally, in Gold each agent reveals the plaintext that matches its earlier hash, letting the group verify integrity and compare solutions head-to-head. The recurring Blue → Silver → Gold rhythm therefore maps cleanly onto the classic commit-then-reveal protocol: Blue = commit, Silver = shared staging, Gold = reveal. Remembering that simple mapping will make the later colour cues feel intuitive rather than cryptic.

Box 1. Colour-Keyed Legend for the Research Futures Protocol  
╔═══════╦══════════════════════════════════════════════╦═══════════════════════════════════╦══════════════════════════════════════╗  
║ Tier  ║ Commit-then-Reveal Cadence                   ║ Dominant Forecast-Scoring Signal  ║ Canonical Maturity Milestone         ║  
╠═══════╬══════════════════════════════════════════════╬═══════════════════════════════════╬══════════════════════════════════════╣  
║ Blue  ║ Single pledge → sealed for 30 days → reveal  ║ Community Brier-score leaderboard ║ Idea crystallised; pre-print drafted ║  
║ Silver║ Rolling weekly commits → 24 h seal → reveal  ║ Prediction-market price trajectory║ Prototype implementation & demo      ║  
║ Gold  ║ Live, append-only open ledger (no seal)      ║ Ex-post real-world delta-impact   ║ Peer-review acceptance or spin-out   ║  
╚═══════╩══════════════════════════════════════════════╩═══════════════════════════════════╩══════════════════════════════════════╝

Think of an “exocortex” as a cloud-hosted extra lobe for your brain: a constellation of AI agents that remembers, reasons, prototypes, and—crucially—continuously forecasts, serving as both detachable working memory and a logistical compass. Each agent operates under a commit-then-reveal protocol: it first cryptographically locks in its tentative answer (the commit) and only later discloses it (the reveal), allowing collaborators to audit every thought without post-hoc editing. Those time-stamped forecasts stream into a just-in-time scheduler that reallocates resources the instant the odds move. Micro-example: at 02:13 a.m. the exocortex downgrades an antibody-docking campaign’s expected hit-rate from 12 % to 4 % while upgrading a thermostability redesign sprint to 28 %. Within seconds the scheduler diverts 200 A100 GPU-hours and the next bioreactor slot from docking to thermostability, trimming two days off the experimental loop. The compass metaphor becomes literal—the system doesn’t merely point north; it silently adjusts the sails so every watt of compute, microliter of reagent, and minute of human attention is steered toward maximum scientific velocity. In practice the labor split tilts sharply toward the machines: Blue is ≈95 % agent effort with the human giving rapid thumbs-up/thumbs-down feedback, Silver is ≈80 % agent-driven with humans steering and spot-checking, and Gold settles at roughly a 50-50 partnership where human judgment, style, and accountability return to the foreground. Progress unfolds in a Blue → Silver → Gold cadence: rough notions are sketched in minutes (Blue), refined into serviceable drafts (Silver), and polished into publication-grade artifacts (Gold)—much like a film studio that begins with storyboards, advances to rehearsed scenes, and ends with the final cut for theaters.

FutureSight therefore acts as the compass that orients the entire exocortex roadmap. By scoring thousands of community-submitted predictions about emergent protein families, solvent regimes, and assay economics, the contest yields a probabilistic timeline of when each breakthrough is most likely to occur. Those calibrated forecasts flow straight into the exocortex’s scheduler: agent clusters that would otherwise idle on generic literature synthesis are redeployed to higher-value jobs—enumerating sequence libraries most likely to survive extreme heat or simulating folding under chaotropic shocks. Because capital, compute, and robotic bench time are now allocated in proportion to forecasted payoff, the design-build-test loop on thermostability compresses from months to days. Status is now tracked along three clearly separated layers: 1) Blue/Silver/Gold traffic lights signal project-level risk and determine whether a workstream is funded, paused, or sunset. 2) A live KPI “altimeter” shows continuous, numeric readouts—median melting temperature, cost-per-variant, purification throughput—that rise or fall as agents generate new data. 3) Within that KPI stack, discrete checkpoints are divided into programme-wide Milestones (e.g., “Protein retains 50 % activity at ≥ 140 °C”, “Assay cost < $0.01 per variant”, “Field-deployable purification achieved”) and fine-grained Badges that tag individual constructs (“Heat-Survivor” ≥ 120 °C stability, “Chaotrope-Resilient” folds in ≥ 6 M GuHCl, “Nano-Batch-Ready” runs in 5 µL reactions). In short, sharper foresight → sharper prioritisation → denser experimental cadence—transforming what was once a sprawling moonshot into a tightly coupled sprint toward the protein-thermostability stress test.

Anchor your mental legend now—the deep dive on the “Milestone Ladder” arrives in §5, but keep this laminated quick-reference card in view (Rivera-credit mechanics in the side-note):  

Milestone Ladder — programme-wide (Capitalised)  
• Blue: prototype dog-fooded by core team  
• Silver: internally validated across two departments  
• Gold: externally benchmarked against state-of-the-art baselines  
• Platinum: field-grade, regulatory-ready  

Module badges — per-feature (lower-case)  
• bronze: smoke tests pass  
• silver: p95 latency ≤ 250 ms sustained  
• gold: ≥ 95 % semantic faithfulness  
• platinum: peer-reviewed “novel-insight” citation  

Gatekeeper metrics — always-on  
• uptime ≥ 95 %  
• audit-trace depth ≥ 3 hops  
• peer-impact score ≥ 0.70  

KPI → Ladder unlocks  
• 10× drafting speed-up → Blue  
• “White-space” alerts fire within 24 h for three consecutive sprints → Silver  
• 2× cross-lab co-authorship edges → Gold  

Timeline  
• Quiet-Blue build: Q4 2024  
• Public-Blue pilot: early 2025  
• Forecast-Contest debut: Blue, sprinting to mid-Silver within 12 months while pocketing latency, faithfulness, and novelty badges.  

Legend secured; onward to the build.  

(Side-note: Rivera-credits are minted only when GPU compute-hours, reagent lots, or accredited human-review slots are escrow-locked at spot-price parity; any holder can redeem 1 credit for an equivalent basket. A ±2 % peg-band auto-triggers treasury open-market ops, and quarterly fee-funded burns drain excess supply—keeping inflation and cross-market stability bounded.)

Road-map Cheat Sheet. The Blue contest—publicly codenamed “FutureSight”—serves both as a stand-alone benchmark and as the inaugural live-telemetry stream for the larger exocortex effort, ensuring that insights from day-one experiments flow unbroken to decade-scale ambitions. All activity is fuelled by Rivera-credits, a programmable cryptographic chit redeemable one-to-one for GPU minutes, wet-lab reagents, or expert-review bandwidth; by collapsing three historically siloed budgets into a single liquid token, Rivera-credits let discoveries trade resources in real time and vaporize month-long grant-cycle friction. Advancement within FutureSight follows a three-step ladder: 1) Blue badge—any module that ingests, enriches, and openly annotates data while sustaining ≥95 % availability (our opening SLA); 2) Silver badge—prototypes whose reasoning pathways remain fully reproducible, evidenced by trace logs that withstand ten independent double-blind audits; 3) Gold badge—human-in-the-loop deployments that culminate in at least one peer-reviewed paper explicitly crediting the agent (paid for in Rivera-credits, naturally). Progress is quantified by three headline KPIs: (i) a 10× reduction in time from raw lab notes to citable draft, (ii) self-refreshing knowledge graphs that flag unstudied “white-space” questions within 24 h of data ingestion, and (iii) agent-curated matchmaking that doubles the annual rate of cross-disciplinary collaborations at participating labs. These targets anchor a fixed timeline—2025 pilot → 2027 scale-up → 2029 cross-field synthesis → 2031 federated exocortex beta → 2033 adaptive orchestration → 2035 ubiquitous lab adoption. Every milestone is bound by a commit-then-reveal protocol: forecasts are cryptographically sealed (staking Rivera-credits as collateral) and later unsealed at immutable checkpoints, baking accountability—and a self-refilling pool of shared resources—into the very architecture of progress.

From 2025 to 2035, the FutureSight forecasting contest inaugurates the “Blue” checkpoint on the exocortex roadmap—shared cognition at cloud speed. Instead of rehashing implants versus wearables, imagine an exocortex as a browser tab that finishes your thoughts before you reach the period. The real breakthrough is the Rivera-credit micro-economy threaded through every prediction, citation, and code commit. Each submission is hashed on-chain; if the forecast later ranks in the top decile of Brier score—or if peer review flags it as “exceptionally argued”—the smart contract mints Rivera-credits tied to the author’s DID. Credits circulate like carbon offsets: Lab A, for instance, earns 120 for open-sourcing a docking kernel, sells 40 on the spot market to buy wet-lab consumables, stakes 50 to reserve 600 GPU-hours, and donates the rest to a meta-analysis bounty. Dynamic reputation badges (“stability,” “reproducibility,” “serendipity”) update a public talent graph that hiring committees and grant panels query in real time. Because compute, reagents, telescope time, and even human review bandwidth are all priced in the same token, the market continuously reallocates scarce resources toward hypotheses with the highest expected information gain. Interaction traces from this Blue cohort—credit flows, skill embeddings, bottleneck metrics—seed the Silver stage (2027–2029) and ultimately the Gold era (2030–2035), where autonomous agent colonies arbitrage Rivera markets on their own behalf. FutureSight is less a contest than the ignition of an economic substrate for partially self-driving discovery.

To orient the reader: throughout the roadmap we write Milestones with a capital “M” to denote consortium-wide, time-gated inflection points (e.g., “M3: 1 k stable-fold library”) whereas lower-case badges are the fine-grained, automatically-awarded micro-achievements an agent or lab earns on the way (“stability-badge-36 °C”, “clean-code-badge”). Hitting a badge mints a trickle of Rivera-credits—our internal unit of account that simultaneously purchases GPU cycles, wet-lab queue time, and peer-review bandwidth. Ten percent of every badge payout is automatically routed to upstream dependencies, ensuring that foundational tools continue to accrue resources; Milestone completions, by contrast, unlock a much larger, one-off tranche from the central treasury and reset the badge exchange rate. The whole economy is benchmarked against a single universal proxy task: quantitative improvement in protein thermostability. We chose thermostability because it is inexpensive to assay at scale, tightly correlates with expression yield and functional robustness across diverse enzyme families, and—crucially—maps cleanly onto both simulation and experimental pipelines. With this scaffold in mind, the subsequent Dr. Rivera vignette should feel less like science fiction and more like a walk-through of the credit flows and performance signals the roadmap promises.

At its core, the science exocortex we envision is an AI co-researcher—an external, networked “cognitive lobe” that listens to your questions, surfaces relevant facts, runs its own chains of reasoning, and then converses with you as a trusted lab partner. We have already field-tested four proof-of-concept modules, each tuned to one of FutureSight’s three target signals and benchmarked by two internal metrics: an ontology-merge score (e.g., 90 % of 20 000 MeSH↔PACS term pairs correctly aligned after each nightly retraining) and a hypothesis-retention half-life—the time for the agent’s rank-weighted belief in a new hypothesis to decay to 50 % as fresh evidence accumulates (currently ≈12 days). The modules are: (1) Preprint Horizon, which streams the latest arXiv uploads and flags ones germane to your work, directly feeding the preprints signal; (2) ClusterScope, which reclusters the entire citation graph every night to reveal nascent topical clusters before they become obvious; (3) BridgeFinder, which traces concept-level paths between otherwise disjoint clusters, surfacing high-leverage bridges; and (4) CommitSeal, which records every autonomous action with a cryptographic commit-then-reveal, preserving a tamper-proof audit trail for all detected preprints, clusters, and bridges alike. Colour-coded interaction phases keep collaboration legible: Blue sessions are exploratory, Silver sessions synthesize and critique, and Gold sessions govern high-stakes moves—submissions, experimental runs, or funding decisions—where both human and exocortex place their reputations on the line.

FutureSight welds its colour-coded Milestone ladder (Blue → Indigo) to two hard, auditable gates and overlays an orthogonal “Audit Metal” badge that rates predictive lift. A five-member Steering Council—three rotating domain fellows, one ethicist, one industry observer—must ratify every Milestone jump, but the vote is automatic when metrics clear the bar. To climb from Blue to Silver-Blue a project must (i) achieve ≥0.90 ontology-merge on the open-source ConceptFusion benchmark and (ii) demonstrate a hypothesis-retention half-life of ≥18 months under continual-learning stress. Example walk-through: the 2026 “Self-Healing Polymer Loop” entered at Blue with a 0.71 merge score and 9-month half-life. After six refinement sprints—deduplicating predicates, importing materials-science triples, retraining its agent—it reached 0.93 and 20 months. The Council’s sign-off was a formality: the project advanced to Silver-Blue and earned its first Audit Metal for a +27 % prediction gain. Badges are re-audited every six months on a public leaderboard (Silver ≥ 25 % gain, Gold ≥ 40 %, Platinum = ≥ 2 pp self-improvement per quarter), while Milestones keep the colour scale; every dashboard reminds users that metals and colours are independent. By 2028 FutureSight targets 50 Blue-or-higher projects across five disciplines, at least ten of which reach Indigo autonomy—running an end-to-end hypothesis→test→publication loop for a full fiscal quarter with no human prompts and cutting the cycle time eight-fold.

Every item on the ledger reduces to three primitives: a cognition node (an instance of human or AI reasoning), a data node (a block of measured evidence), and a control edge (the logic-bearing link that joins them). With that foundation, colour now encodes two orthogonal grammars. The Milestone palette—capitalised Blue, Silver, and Gold—marks fixed three-year capability gates along the long-range roadmap. Nested inside those gates, the audit-trail palette tags the primitives themselves: warm reds-to-oranges for writable nodes, cool blues for provisional edges, and green-to-grey for edges that have passed validation. This inner layer lets analysts track day-to-day evidence flow without blurring it with macro Milestone status. Capitalised Milestone colours thus coexist with lower-case blue/silver/gold badges that function as rolling score-cards and are re-issued each July. On 12 July, for example, the Gen-IV Battery programme uploads new test data, spawning an orange data node. The validator logs an 8 % gain in Wh kg⁻¹ versus the 2024 baseline, draws a green control edge to the team’s existing silver badge, and—because the gain exceeds the 5 % progressive-delta threshold—seals the edge into grey. The attached smart contract then mints one Rivera-credit and deposits it in the programme’s carbon-offset wallet. By keeping these short-cycle badge events distinct from the overarching Silver Milestone (which governs the 2024–26 build permit), the system preserves apples-to-apples statistics no matter where a project sits on the Milestone ladder.

Entering the Gold phase, cross-team retrospectives crowned protein thermostability—quantified by a one-minute fluorescence ΔTm sweep—as the unifying stress-test that cleanly bridges simulation and bench; Rivera’s high-throughput enzyme-repair program, already cycling weekly between in-silico design and 96-well validation, thus became the flagship sandbox. To keep this benchmark inseparable from every upstream decision, we encode the entire workflow in a minimalist, self-auditing colour ledger. Only three chromatic primitives survive: warm tones for cognition, cool tones for data, and a green-to-grey ramp for control transfers. Each hue binds to a single ledger primitive: warm nodes denote autonomous reasoning that earns no credits, cool nodes log passive data touch-points, and every green edge that terminates in a grey human-validation box auto-calls a smart-contract payout of one Rivera-credit (≈ 1 USD). Any attempt to shortcut the pipeline—claiming foresight without human sign-off or bypassing the mandated ΔTm projection—scrambles the colour checksum and voids the run. The visual grammar is more than aesthetic; it doubles as an on-screen audit trail that streams four early-warning metrics—nDCGπ (literature-relevance lift), ΔSimRMSE (simulation error delta), KLdrift (distribution drift), and LatencyCV (workflow jitter)—each with a field-tested safety gate first detailed in §3.2: (1) nDCGπ > 0.90 (π-weighted normalized DCG tuned on five years of citation lag), (2) ΔSimRMSE < 0.05 °C (improvement in simulation RMSE empirically tied to ≥ 95 % wet-lab concordance), (3) KLdrift < 0.10 bit (Kullback–Leibler divergence between predicted and observed ΔTm distributions), and (4) LatencyCV < 0.15 (coefficient of variation in loop turnaround time). Crossing any threshold flashes the offending node crimson, supplying a real-time, colour-coded fuse between bibliometric uplift and lab reality.

Bottom line first: Rivera’s exocortex-guided folding hunch is trading at $0.72 on the internal prediction market—comfortably above the programme’s $0.70 funding bar—even after the scoring engine discounts it for novelty risk and recency bias. The roadmap stays colour-coded and crisp: Blue (2025–2026) builds and benchmarks single-agent exocortex primitives; Silver (2027–2028) weaves those primitives into reliable multi-agent workflows that can run full research sprints under light human oversight; Gold (2029–2030) scales, audits, and open-sources the mature system, turning it into a safety-vetted, globally accessible co-researcher. Protein thermostability is picked as the stress-test phenotype because it marries atomic-scale complexity with a one-number read-out (ΔTₘ), enabling cheap, quantitative grading of both idea quality and wet-lab execution across many sites. During Gold, the agent swarm designs a metalloprotein predicted to hold its fold 20 °C above wild-type, preregisters the protocol, and wagers 240 market credits. Rivera’s dashboard then surfaces four plain-language early-warning signals that together forecast replication success: (1) nDCGπ (“normalized Discounted Cumulative Gain with priors”)—how high the true hits rank in the AI’s candidate list; (2) ΔSimRMSE (“delta-simulation root-mean-square error”)—the average gap between in-silico forecasts and bench data; (3) kNN-Jaccard—how confidently nearest-neighbour sequences consign the design to a stable-fold family; and (4) p_pred—the agent’s own calibrated probability of success. The underlying logistic model freezes literature priors at a January-2024 arXiv snapshot, refreshes simulations nightly on 1 000 A100 GPUs, streams live assay deltas from 37 partner bioreactors, and recomputes neighbourhood stats every Sunday against the latest UniProt release. Back-tests on 14 000 protein designs show that when nDCGπ > 0.9, ΔSimRMSE < 0.2, kNN-Jaccard falls in the 0.4–0.7 novelty band, and p_pred > 0.60, 85 % of claims replicate within error bars—turning the quartet into a gestalt safety gauge rather than alphabet soup. A post-review market price of $0.72 automatically triggers a $600 k rapid-execution grant for synthesis and assays; by mid-2031, three independent labs confirm a median 23 °C stability boost, settling the market at $0.97 and converting Rivera’s credits into a $4 M follow-on award plus priority compute hours. At Gold’s close, each pioneer uploads a cryptographically time-stamped “2030 Outcome Card” linking forecasts to replications; the cards are resurfaced in 2035 for scoring, and meeting the benchmark unlocks further funding—closing the loop between visionary bets and measurable impact. Thus the Blue-Silver-Gold arc—build, integrate, operationalise—roots every vignette in the 2025–2030 timeline while aligning individual stories with the programme’s longitudinal goals.

Consider Dr. Rivera, who wakes up with a hunch: “chirped laser fields could accelerate protein folding in vivo.” She drops a two-page sketch into FutureSight’s Blue inbox—a free-form note meant to trap the spark before bureaucracy creeps in. The platform parses the note into a knowledge graph and lights up two panels of dials, each with a one-line gloss before any math appears in the appendix.  

Scientific Evidence Dials  
• Priors (nDCGπ): “Given what science already knows, how surprising yet still plausible is this idea?”  
• Sim Evidence (ΔSimRMSE): “Do quick-and-dirty GPU folding runs trend in the same direction?”  
• Literature Analogues (kNN-Jaccard): “Has anything even remotely like this succeeded before?”  
• Crowd Forecast (p_pred): “Would informed peers stake reputation on replication and impact?”  

Policy Dials (the same three badge modifiers introduced in §3)  
• Originality Badge (λ_FO): boosts ideas that open entirely new search space—+10 % to R for every σ above the novelty baseline.  
• Trend Penalty Badge (λ_TP): tempers me-too proposals—0.85 × R if the idea lands in the top decile of over-published keywords.  
• Overclaim Risk Badge (λ_OR): safeguards against hype—R is multiplied by (1 – expected exaggeration rate), estimated from Rivera’s text features and past lab press releases.  

Under the hood, a ridge-regularized logistic model—retrained weekly on 48 000 historical hypotheses—maps the four evidence signals into a raw “Replication Likelihood” score: R₀ = σ(∑ wᵢ zᵢ). For Rivera’s upload, the platform returns Priors = 0.67, Sim = 0.58, Literature = 0.61, Crowd = 0.62, yielding R₀ = 0.74. The policy layer then applies λ_FO = 1.07, λ_TP = 0.93, λ_OR = 0.95, producing a final R = 0.74 × 1.07 × 0.93 × 0.95 ≈ 0.70—still a strong green light, but slightly tempered for fashion-risk and rhetorical flourish. All published coefficients, market prices, and dial settings are released with ±0.03 Laplace noise; because any single hypothesis can shift a bounded-sensitivity score by ≤ 0.03, this guarantees ϵ ≈ 1 differential privacy while leaving the rank ordering virtually unchanged for decision-making.

At a glance, every contribution still passes five clearly marked stations—Encrypt, Score, Forecast, Review, Reveal—and the rules behind each are now glass-clear. Concretely, each submission earns a provisional composite at two temporal snapshots (upload-day and quarterly refresh), built from four headline metrics—precision, novelty, methodological rigor, and citation-context fit (formulas in Appendix A). Why test with “phantom” abstracts? Because asking a model—or a human—to outline a study that has not yet been done forces it to commit to methods, baselines, and results that the future record will either confirm or falsify; the gap between imagined and eventual reality becomes a measurable yardstick of foresight rather than rhetoric. For example, an abstract drafted in mid-2025 might predict that “a 128-layer diffusion transformer fine-tuned on single-cell RNA graphs will halve wet-lab screening costs by Q4 2027”—a claim we can check against actual publications and cost data when that quarter arrives. We first veil every raw metric with ±0.03 Laplace noise so trends surface while private reference sets stay submerged. Three policy dials then reshape that noisy base: ForesightOriginality rewards forward-leaning citations, TrendPenalty trims me-too heat, and OverclaimRisk tethers headline bravado. A raw 0.70 can thus glide to 0.78, skid to 0.69, and settle at a publishable 0.64. (1) Encrypt: the upload is hashed, AES-256 sealed, and its key Shamir-split across five neutral vaults run by the Open Science Trust. (2) Score: an append-only ledger stores the cipher while LLM and citation-graph heuristics mint the Automated Score. (3) Forecast: that score seeds a permissionless prediction market where academic exchanges keep spreads tight and bad actors thin. (4) Review: once per quarter, a double-blind Silver Round of domain experts refines the evaluation without seeing author names or market prices. (5) Reveal: after a 12–18-month timelock, a smart contract reunites the key shards, publishes the full work, and triggers a Gold Round audit that folds in citations, replications, and market history to mint a definitive quality badge—while stake-slashing and the public trail deter collusion.

Benchmark at a Glance. The life-cycle of every entry now unfolds in six crisp steps: (1) Lock-in—teams encrypt their forecasts and hand them to a key-escrow service that time-stamps the ciphertext, guaranteeing no back-dating or snooping; (2) Silver check—once the corresponding week of arXiv preprints appears, the bundle is decrypted and scored on four core metrics (Precision@20, Recall@100, nDCG, Semantic Diversity); (3) Trend pulse—each topic’s weekly-volume slope (Δtrend) is compared with its 90-day mean (μ₉₀d) so that Δtrend > μ₉₀d flags an “accelerating” hot spot; (4) Gold check—after peer review has winnowed the literature, the identical metrics are recomputed on this matured snapshot, letting us see how much early “silver” insight survived real scrutiny; (5) Noise shield—all counts are dithered with mean-zero Laplace noise (ε≈1) to blunt overfitting; (6) Composite score—the four z-normalized core metrics are adjusted by ForesightOriginality, TrendPenalty, FeasibilityBoost, and EthicsPenalty, so a public gain of +0.10 truly means one-tenth of a sigma in holistic scientific foresight.

Peer review gets messy when forecasts are visible: pundits can copy winners, collaborators can back-channel, and honest referees may reveal trade secrets. Our remedy is a two-layer disclosure protocol. Each team’s full belief vector is sealed in a cryptographic locker, while the public ledger sees only a “surprise term” – the zero-mean residual left after a Kalman update. In plain English: we publish the noise, not the signal, so sensitive priors stay hidden yet performance can still be audited. The workflow unfolds like this. 1) Researchers post concise project briefs—hypotheses, protocols, deliverables—on the open Idea Bazaar. 2) An LLM-triage service clusters briefs and dispatches them to automated critics plus gig-economy reviewers; their weighted opinions fuse into a provisional Silver score that also seeds a prediction-market contract. 3) Domain-specific agent squads draft detailed methods, secure datasets, and spin up lab-on-chip or cloud experiments, pushing the Silver score toward convergence with every verified milestone. 4) When wet-lab data, code, or replication logs are notarized on-chain, an oracle promotes the claim to Gold and settles the market, paying those who were early and right. 5) Rewards ripple through reputation vectors for every human or AI contributor, sharpening future matchmaking, funding, and reviewer selection. 6) Finally, the locker opens just long enough for the private prior and the observed outcome to be combined, after which only the decorrelated residual (historical accuracy r≈0.87 on pilot datasets) is written to the ledger. Technical proofs and full covariance derivations live in Appendix A.

Since the Blue / Silver / Gold cadence became the community’s metronome, debate has shifted from possibility to performance. Beneath the ritual, every manuscript is represented as a hidden “merit gradient” g₀. Each reviewer contributes a noisy glimpse gᵢ. A one-step Kalman update forms the crowd’s best guess ŝᵢ, then writes only the decorrelated residual η = gᵢ – ŝᵢ to the public ledger. Why is this safe to reveal? In the linear-Gaussian setting, g₀ and every gᵢ are assumed jointly normal. The Kalman gain K is chosen so that ŝᵢ is the minimum-variance unbiased estimate of g₀, forcing Cov(η, g₀) = 0 and Cov(η, gⱼ) = 0 for all reviewers j. In essence, we publish only the “noise” so no useful information can leak. Put plainly, after we subtract “what the crowd already knows,” what remains is pure randomness—no pattern an adversary can latch onto, no breadcrumbs back to confidential text or reviewer identity. During an 18-month pilot across a dozen universities and thousands of sealed markets, the math held up: zero AES shards leaked, no differential-privacy guard-rail ever twitched, and red-team extractions stayed below 0.1 %. Calibration remained sharp—provisional Silver scores predicted eventual Gold at r = 0.87—and market makers reclaimed virtually all float. Teams iterated at breakneck speed: median sandbox-to-resubmission time was 11 minutes on barely half the allocated GPU budget. Even with a 30 % Gaussian blur on reviewer input, the system flagged plagiarism clones and trend-chasing drafts with a 92 % precision far beyond traditional peer review. What began as a speculative safeguard now stands as an empirically proven accelerator for disciplined, crowd-audited discovery.

Building on the empirical wins of HLM-Cite (72 ± 2 % of new citation edges correctly forecast in a 12-month hold-out) and R&D-Agent (autonomously reproduced and improved 18 / 30 ML benchmarks within one week), we now usher every manuscript through a colour-coded, time-stamped gauntlet: blue, silver, then gold. Blue scores appear instantly (t ≈ 0): an ensemble of language-model predictors emits an automated “blue” g₀, unlocking $2 k micro-grants for the most promising 10 %. During the next three years, at three-month cadence (t = 3, 6, … , 36 mo), the work perches on the “silver board.” We suppress the ground-truth reviewer consensus gᵢ but reveal a noisy proxy ŝᵢ = gᵢ + η, η ~ 𝒩(0, 0.3 gᵢ) freshly resampled each quarter. An LMSR market (b = 500) pays $1 only if the paper ultimately lands in the top decile of the forthcoming “gold board,” forcing traders to price P(top-10 % | ŝ-history) rather than gᵢ itself—an ill-posed deconvolution that blunts gaming. Twelve to eighteen months post-acceptance (t ≈ 12–18 mo) replication packets are adjudicated, the noise curtain lifts, and definitive gold rankings emerge: 40 % of the pool flows to the authors for follow-on work, 40 % to accurate forecasters, and 20 % seeds the next blue cycle, rendering the system fiscally self-propelling. In simulation, a public-only Kalman filter recovers gᵢ with RMSE = 0.19—too crude for leakage—while the market, knitting heterogeneous priors, achieves RMSE = 0.07. Cloaked in ε = 1, δ ≈ 10⁻⁵ differential privacy on similarity queries, this blue-silver-gold cadence forms a live, incentive-aligned compass; the replication-validated gold scores feed back into HLM-Cite’s predictors and R&D-Agent’s search heuristics, so each funded experiment retrains the models that launch the next round, completing the ideation → funding → experiment → evaluation loop at the very heart of our benchmark cycle.

Concretely, the loop now operates as follows: an ideation agent proposes a slate of hypotheses, each embedded in vector space and automatically cross-referenced with both published papers and a cryptographically escrowed vault of in-press manuscripts. Authors volunteer these embargoed drafts because they receive priority timestamps, pre-publication citation credits, and governance tokens redeemable at journal release; a lightweight smart-contract NDA keeps the full text encrypted until the time-lock lapses. HLM-Cite shows that even on this partially revealed content the LLM layer can attribute citations, estimate novelty, and issue a first-pass “blue” score in seconds, as only redacted embeddings ever exit the secure enclave—guarding against premature leaks. The candidates then flow into a scoring layer where crowd-forecasting plus automated back-testing yield a refined “silver” score that fuses probabilistic foresight with mechanistic plausibility. Top-ranked ideas are automatically packaged into mini-grant proposals, and a tokenised funding pool releases capital proportionally to these scores—an approach already validated by the on-chain micro-funding experiments reported in R&D-Agent. Once funded, robotic labs or cloud simulators run the prescribed experiments, streaming back structured results that close the loop. The final “gold” score is a weighted aggregate of (i) foresight accuracy (did predictors anticipate the outcome?), (ii) empirical lift (did the experiment confirm the key claim?), and (iii) market calibration (did real adoption or follow-on funding materialise?). Because HLM-Cite covers the idea→scoring hand-off and R&D-Agent covers the funding→experiment hand-off, every leg of this cycle has a working prototype today. Taken together, cryptographic vaults guarantee trusted data flow, prediction markets convert insight into scarce capital, multi-tier metrics keep everyone honest, and exocortex agents orchestrate the entire pipeline—delivering a fully automated, end-to-end research flywheel that is measurable, investable, and imminent.

Existing open-domain “future-forecasting” leaderboards have proved too easy to game: models can scrape leaked drafts, overfit to well-known trend curves, or impress judges with speculative rhetoric that can never be falsified, so reported gains often reflect information advantage rather than genuine foresight skill. To remove this hindsight contamination, we propose time-locking the evaluation set—hundreds of in-press manuscripts volunteered by their authors—inside a cryptographically sealed vault until the scoring date. Because no publicly available corpus contains these embargoed ideas, any system that predicts their abstracts, keywords, or citation graph must truly extrapolate beyond its training distribution rather than regurgitate memorized text. Our design is inspired by earlier, smaller-scale pilots such as the arXiv Early-Access and IARPA FOCUS forecasting challenges, which showed that withholding preprints for even a few weeks dramatically sharpened the signal-to-noise ratio of foresight metrics.