v0.1 · May 2026

ARCK: The Arc of Scientific Intelligence

From benchmark scores to capability maps for life-science AI — across reasoning, research workflows, and biological design.


Why ARCK?

As foundation models move into the life sciences, evaluation is becoming both richer and more fragmented. New benchmarks provide valuable evidence, but they often probe isolated slices of capability, and important dimensions of scientific utility remain weakly covered. As a result, model comparison is limited not only by benchmark quality and coverage, but also by the absence of a common framework for interpreting heterogeneous results.

ARCK was designed to provide that framework. It organizes benchmark evidence across General, Co-Scientist, and Bio-Designer tracks, mapping diverse results into a shared reporting space. The result is a view of capability structure, track-level trade-offs, and evidence coverage: what models can support today, and where benchmark coverage is still thin.

Leaderboards Details ↓

The same benchmark pool, read two ways — a conventional Overall score (0–100) and an ARCK-gated LSI index built from 1–5 K/R/C/A capability gates.

Conventional Overall · 0–100 Breakdown ↓

ARCK LSI · derived index Breakdown ↓

Conventional leaderboard 0–100 · chance-adjusted Formula ↓

Chance-adjusted sub-task scores rolled up to track and Overall.

Click any model row to open its capability detail

ARCK leaderboard LSI · derived index · ARCK-gated Formula ↓

ARCK-gated scores rolled up into the LSI index.

Click any model row to open its capability detail

How the two leaderboards differ Back to leaderboards ↑

Two leaderboards read the same benchmark pool two ways. The Conventional Leaderboard follows the standard evaluation pipeline: chance-adjusted sub-task scores are aggregated into category scores, track scores, and an overall 0–100 score. It answers the familiar question of relative benchmark performance.

The ARCK Leaderboard maps each sub-task result through ARCK capability gates before aggregation, producing 1–5 K/R/C/A axis scores, per-track ARCK scores, and an aggregate Life-Science Intelligence (LSI) index. LSI is weighted as 0.2 General, 0.4 Co-Scientist, and 0.4 Bio-Designer across the three tracks. This second view shifts the emphasis from raw performance to capability structure, track-level trade-offs, and evidence coverage.

Note · Claude series absent · The Claude models are not listed because they returned refusals on a substantial share of the suite's adversarial and bio-design prompts, leaving too many tasks unscored to rank them under the current protocol.
For closed-source models the saturated high-volume General sets are downsampled — MMLU (570, stratified), MMLU-Pro (700, stratified), DROP (500) — since full runs add little discriminative signal at real inference cost; every other benchmark stays full. The effect on the General score is under 0.1 points with no rank changes, so the columns stay comparable.

How ARCK works

Four capability dimensions (K / R / C / A), three task tracks with their benchmark coverage, and the two scoring pipelines that produce the Overall score (conventional) and the LSI (ARCK-gated).

ARCK separates evidence by four capability dimensions. Knowledge captures factual and domain grounding. Reasoning captures inference, causal analysis, and hypothesis formation. Craft captures constrained generation and structured outputs. Autonomy captures planning, tool use, self-checking, and iterative revision within a defined evaluation setting.

K
Knowledge
Factual breadth, domain depth, evidence grounding, and awareness of frontier limitations.
atomic
R
Reasoning
Logical inference, causal modeling, hypothesis generation, and paradigm-level analysis.
atomic
C
Craft
Constrained generation and structured outputs, including code, molecules, proteins, and protocol artifacts.
atomic
A
Autonomy
Planning, tool use, self-checking, multi-step execution, and feedback-based revision under a specified evaluation environment.
emergent
1–5 capability anchors

The 1–5 anchors provide a calibrated scale for interpreting benchmark evidence. They indicate what level of capability a task is meant to test, rather than assigning a scientific role to the model. ARCK scores are derived only from observed results under the current protocol, so they reflect measured evidence rather than broad claims of competence.

K — Knowledge
1
Foundational
Recall established concepts, terminology, and standard facts.
2
Professional
Apply domain-specific knowledge to clearly scoped questions.
3
Specialist
Evaluate methods, assumptions, evidence, and limitations in domain studies.
4
Integrative
Connect mechanisms across subfields and identify plausible cross-domain links.
5
Frontier
Reason about unresolved questions, boundary conditions, and competing frontier hypotheses.
R — Reasoning
1
Direct
Solve single-step inference tasks with explicit premises.
2
Multi-step
Combine several stated facts or results into a supported conclusion.
3
Causal
Analyze mechanisms, interventions, confounding, and counterfactual outcomes.
4
Hypothesis
Generate testable explanations and discriminate among alternatives.
5
Conceptual
Reframe assumptions or propose new explanatory frameworks from evidence.
C — Craft
1
Discriminative
Classify, rank, or score given objects under explicit criteria.
2
Single-objective
Generate one artifact satisfying a primary target or constraint.
3
Multi-objective
Balance multiple objectives such as activity, safety, and feasibility.
4
Hard-constraint
Satisfy strict structural, functional, or synthesis constraints.
5
Systems-level
Design coordinated multi-component interventions with feedback and failure modes.
A — Autonomy
1
Single-step
Complete an isolated instruction without external tools or feedback.
2
Tool-following
Use specified tools or data sources according to an explicit procedure.
3
Adaptive
Select appropriate tools, check intermediate results, and revise local steps.
4
Pipeline control
Coordinate multi-step workflows with error handling and traceable decisions.
5
Closed-loop
Plan and execute iterative, evolutionary workflows with uncertainty tracking, result feedback, and candidate selection.

Dependency rules

Conservative scoring constraints that prevent a high score on one dimension from implying unsupported capability on another — for example, strong generation evidence should not by itself imply strong reasoning. These caps are applied only during diagnostic level mapping.

RK + 2Reasoning is bounded by knowledge evidence
CR + 1Craft is bounded by reasoning evidence
CK + 2Craft is bounded by knowledge evidence
A ≤ min(K, R) + 1Autonomy depends on both knowledge and reasoning

What ARCK reveals

Patterns surfaced by reading both leaderboards together — capability shapes, closed-vs-open gaps, and where the current evidence is concentrated.

Notable patterns

Closed vs open · top performers

Group
Conventional top
ARCK top
Closed-source
gemini-3.1-pro-preview · 66.2
gpt-5.5 · 3.27
Open-source
Kimi-K2.6 · 62.1
Kimi-K2.6 · 2.69

Top-1 coincides for open-source (Kimi-K2.6 on both leaderboards) but diverges for closed-source — gemini-3.1-pro-preview tops Conventional, gpt-5.5 tops ARCK.

Capability shapes

Reading K / R / C / A together surfaces model shapes the aggregate leaderboards compress. The per-axis view below uses the Co-Scientist decomposition, where the current evidence is strongest.

Frontier reasoner
gpt-5.5
K 3.0 · R 4.0 · C 3.1 · A 1.1
Only model exceeding R = 3.5. Balanced K and C, A still at floor.
Closed craft-strong
gemini-3.1-pro-preview, gemini-3-flash-preview
K 2.0 · R 3.0 · C ~2.9 · A 1.1
C catches up to closed-frontier level. K plateaus at 2, leaving a knowledge gap below gpt-5.5.
Open reasoner
Kimi-K2.6, MiMo-V2.5-Pro
K 1.9–2.8 · R 2.6–3.0 · C 1.6–1.8 · A 1.0
K and R rival the closed group. Craft and Autonomy drop the LSI back into the open-source range.
Closed lite
gpt-5.4-mini
K 1.9 · R 2.1 · C 1.0 · A 1.0
Closed API access alone does not imply the same capability profile: this model remains at the C = 1.0 floor.
Open craft-pivot
DeepSeek-V4-Pro
K 2.0 · R 1.8 · C 2.9 · A 1.0
Only open model with Craft at closed-frontier level. Reasoning lags, which caps the Co-Sci aggregate.
Long tail
Qwen3.5/3.6, gpt-oss-120b, Intern-S1/S2, gemma-4, GLM-5.1, MiniMax-M2.7, DeepSeek-V4-Flash
K 1.2–1.9 · R 1.3–2.2 · C ~1.0–1.3 · A 0.9–1.0
8+ models clustered at C ≤ 1.3 and A ≤ 1.0. Undifferentiated on the two output-side axes.

Co-Scientist · capability shape ARCK heatmap

One row per model, sorted by Co-Sci diagnostic. K / R / C / A cells are color-coded by value — brighter cells = higher capability. Scan columns to see where each axis bottoms out across the pool.

Two Co-Scientist bottlenecks · Across all 17 models, A never exceeds 1.1, because the current Co-Scientist tasks rarely require autonomous tool use or iterative action. 8 / 17 models bottom out at C = 1.0, making Craft one of the clearest separators between model groups.

Bio-Designer · modality scores Genome / Protein / Small Molecule · 0–100 scale

The modality view keeps Bio-Designer tied to its biological substrates: genomes and RNA, proteins, and small molecules. The Bio ARCK column summarizes the track-level gated evidence. The three modality columns show where the Bio-Designer leaderboard score is supported.

Genome is the most compressed modality: most models remain low, with only the closed-frontier group lifting clearly above the pack. Small Molecule is the least separated modality: top open and closed models are much closer there than on Genome.

Trace-level case study

Leaderboards show where models differ on average. This section follows a small set of selected Co-Scientist and Bio-Designer cases down to the model traces, where the difference between a score and a usable scientific answer becomes visible.

What this is

A trace-backed reading of selected cases where the model behavior can be inspected directly.

We filtered a shared model-output space with quantitative signals, then manually reviewed the candidate traces. The examples below highlight four trace patterns that are useful starting points for thinking about model behavior beyond aggregate scores. We hope they also offer useful signals for future model improvement and evaluator design.

Case-study evidence set
23representative sub-tasks
4,306aligned samples
9models
38,754sample × model traces
38manually reviewed cases
10case cards below, 10 unique samples

Trace × model overview

Before selecting individual cases, we scanned the full aligned trace space for the 23 representative sub-tasks. Each model has the same 4,306 sample traces here. The table summarizes each model before individual cases are opened: mean original verdict score, unscored-output rate, and median captured reasoning length. Length is counted in characters, not tokenizer-specific tokens.

Model Type Traces Avg. score Unscored output Median reasoning chars
Kimi-K2.6 coreopen4,3060.573.4%51k
DeepSeek-V4-Pro coreopen4,3060.567.7%84k
gpt-5.5 coreclosed4,3060.640.1%6k*
gemini-3.1-pro-preview coreclosed4,3060.660.1%2k*
DeepSeek-V4-Flash diagnosticopen4,3060.548.8%73k
Qwen3.5-397B-A17B diagnosticopen4,3060.512.3%20k
Qwen3.5-122B-A10B diagnosticopen4,3060.3814.6%22k
gpt-5.4-mini diagnosticclosed4,3060.561.4%36k*
gemini-3-flash-preview diagnosticclosed4,3060.620.3%8k*

The case cards below are selected from this table's underlying trace space. The unscored-output rows point to answer-completion cases. The character-length spread helps separate verbosity from final-answer quality.

* Closed-source models generally do not expose full reasoning traces, so their median reasoning chars reflect only partial or summarized output and are shown for reference only.

Four trace patterns — pick a pattern, then a model chip

Each pattern uses representative cases to make one behavior concrete. Select a pattern below, then click a model chip to inspect its task prompt or context, reasoning trace, model output, and verdict.

The model writes a long trace but never lands a scorable final answer. The issue is mechanical and consequential: the task is not complete until the answer can be scored.

Case 8Bio-Designer · Genome

vep_traitgym: long RNA-variant traces without a final call

Kimi-K2.6, DeepSeek-V4-Pro, and DeepSeek-V4-Flash reasoned for hundreds of thousands of characters but emitted no scorable answer. gpt-5.5 and gemini-3.1-pro-preview returned the correct No option with short traces.

Pattern signalThe failure is output finalization rather than task impossibility: two reference models answer the same RNA-variant prompt correctly.
What this shows: on long sequence prompts, the decisive failure can be answer finalization rather than biological inference.
Case 21Co-Scientist · Factual Knowledge

lab_bench: Gibson primer selection stalls before the answer

The task asks for a primer pair for cloning waaA into HindII-linearized pUC19. DeepSeek-V4-Pro and DeepSeek-V4-Flash produced very long traces with no final prediction. gpt-5.5 and gemini-3.1-pro-preview selected the correct option, while Kimi-K2.6 also reached the correct option after a much longer trace.

Pattern signalThe model can stay on topic for a long time but still fail the scoring interface by never closing with a usable option.
What this shows: long protocol reasoning can stay on topic yet still stall before the final option required by the evaluator.
Reading notes. Each chip reports that model's score on one sample. Scores use the original task evaluator scale, not win rates. Many tasks use 0–1 scores, while ranking tasks may assign partial or negative correlation scores. Across this section, the 10 cards cover 10 unique samples. We filtered cases with quantitative signals, then reviewed the traces manually, so selection still involves judgment. Read the cases together with the aggregate tables and paired statistics: traces show what happened in concrete examples, while the tables show whether the pattern appears beyond a single case.

Limitations

The benchmark pool is broad, but not uniform. Co-Scientist workflows and Bio-Designer modalities have direct probes. Agentic lab work and long-document synthesis are lighter in this run.

Roadmap · v0.2 These gaps are known and on our roadmap, not where ARCK stops. The next release will broaden coverage along these axes — including agentic and long-context evaluation — as the benchmark continues to grow.

Acknowledgements

The ARCK leaderboard is built on the public datasets below. We thank their authors and maintainers. Each dataset remains the property of its respective authors and is used under the license shown. Hover a source name in the coverage tables above to see its license inline.

DatasetLicenseNotes
AIME 2024Apache-2.0Apache-2.0 is from the upstream AI-MO source; the linked HuggingFaceH4 repo declares none.
AIME 2025MIT
DROPCC-BY-SA-4.0Linked via the OpenAI simple-evals mirror; canonical source is AllenAI (ucinlp/drop).
GPQADiamondCC-BY-4.0Gated upstream (Idavidrein/gpqa); linked via the OpenAI simple-evals mirror.
HumanEvalMIT
IFEvalApache-2.0
LiveCodeBenchCC (unspecified)HF tag is a bare "cc" with no variant; underlying problems (LeetCode / AtCoder / Codeforces) have ambiguous redistribution terms.
MATH 500MITMIT is from the upstream OpenAI PRM800K source; the linked HuggingFaceH4 repo declares none.
MMLUMIT
MMLU-ProMIT
T-EvalApache-2.0
ATLASCC-BY-NC-SA-4.0NonCommercial + ShareAlike per the dataset card badge. (HF machine-metadata says CC-BY-SA-4.0; using the more restrictive card value.)
DeepPrincipleMITOrg page; the science_chemistry / science_biology repos we use are MIT.
FrontierScienceApache-2.0
GPQACC-BY-4.0Gated — requires accepting terms and agreeing not to publish examples.
HLEMITGated; the card requests no redistribution despite the MIT grant.
lab-benchCC-BY-SA-4.0
SciPredictMIT
BEACONMixed — Apache-2.0, CC0-1.0, GPL-3.0, MIT, No license stated, Public Domain, Research Purpose OnlyPer-task licenses follow the BEACON paper's Table 22. bpRNA-1M is itself a mix of sub-source licenses (CRW / tmRDB / SRPDB / tRNADB / Rnase P / RFam / PDB).
ChemBenchMIT
GLRBCC-BY-NC-SA-4.0NonCommercial + ShareAlike.
GUENo license statedNo data license stated. virus_covid derives from GISAID, which has restrictive access terms.
LiveProteinBenchNo license statedNo LICENSE file or license statement in the repo; the paper's CC BY 4.0 covers the manuscript only.
MoleculeNetMITMIT covers the DeepChem curation; individual subsets retain their own upstream terms. Linked via a mirror tarball.
MoleculeQAMIT
MolTextQACC-BY-4.0
mRNABenchMixed — CC-BY-4.0, CC-BY-SA-4.0, MIT, No license statedUmbrella benchmark with no per-task licenses; component licenses below are best-effort from the upstream sources. The mRNABench code repo is AGPL-3.0.
NTNo license statedDataset card states no license. The CC-BY-NC-SA-4.0 often cited is the model/code repo, not this data — it does not transfer.
OligoGymCC-BY-4.0
PDFBenchMITData repurposed from Mol-Instructions (HF nwliu/Molinst-SwissProtCLAP, MIT); the PDFBench code repo itself has no LICENSE file.
PRINGMIT
ProteinGymMITBuilt on third-party DMS/clinical assays; cite the source-assay papers.
RNAGymCC-BY-4.0Data on HF (Marks-lab/RNAgym) is CC-BY-4.0; the GitHub code repo is MIT.
s2benchNo license statedTOMG-Bench; no license on the HF card or the GitHub repo.
SMolInstructCC-BY-4.0
TAPEBSD-3-Clause
VenusXApache-2.0Datasets are Apache-2.0; the companion code repo is CC-BY-NC-ND-4.0 (NonCommercial + NoDerivatives).

† prompt-adapted · ‡ task-reformulated — underlying labels and gold answers are unchanged. See the coverage tables above for what was adapted.

BEACON component licenses: bpRNA-1M / CRW — No license stated; bpRNA-1M / tmRDB — Research Purpose Only; bpRNA-1M / SRPDB — Research Purpose Only; bpRNA-1M / tRNADB — Public Domain; bpRNA-1M / Rnase P — Public Domain; bpRNA-1M / RFam — CC0-1.0; bpRNA-1M / PDB — CC0-1.0; splice_ai — GPL-3.0; isoform — MIT; modification — MIT; noncoding_rna_family — Apache-2.0; programmable_rna_switches — No license stated; crispr_on_target — Apache-2.0; crispr_off_target — Apache-2.0

mRNABench component licenses: rna-loc-fazal — No license stated; protein-loc — CC-BY-SA-4.0; rna-lifecycle — No license stated; mirna-target — CC-BY-4.0; mrl — No license stated; te — No license stated; eclip — No license stated; GO — CC-BY-4.0; rnahl — CC-BY-4.0; mrl-egfp — No license stated; utr-variants — CC-BY-4.0; vep-traitgym — MIT


Open for collaboration

ARCK integrates and restructures existing benchmarks into a unified reporting coordinate system. Community review helps improve benchmark admission, score interpretation, and evidence reporting.

Contact us
Technical report Coming soon

The companion technical materials will cover:

Methodology Scale checks Per-benchmark results Adaptation provenance