ARCK — The Arc of Scientific Intelligence

Category	Sub-tasks	ARCK tags	Sources
Knowledge	2	K1R1C1A1	MMLU, MMLU-Pro
Mathematics	3	K1-2R1-3C1A1	AIME 2024, AIME 2025, MATH 500
Logic and Reasoning	2	K1-2R2C1A1	DROP, GPQADiamond
Instruction Following	1	K1R1C1A1	IFEval
Code Generation	2	K1-2R2-3C2A1	LiveCodeBench, HumanEval
Tool Use	1	K1R2C1A1	T-Eval
Agentic Execution	—	A2+	No current tasks

Category	Sub-tasks	ARCK tags	Sources
Factual Knowledge	3	K1-2R1C1A1	MMLU-Pro (Bio/Chem), lab-bench
Scientific Reasoning	8	K2R1-3C1-2A1	GPQA, ATLAS, HLE, DeepPrinciple^†
Evidence Critique	4	K2R2C1A1	FrontierScience (Bio/Chem Olympiad), SciPredict
Protocol Writing	2	K3R4C3A1	FrontierScience (Bio/Chem Research)
Agentic Collaboration	—	A2+	No current tasks

Modality	Sub-tasks	ARCK tags	Sources
Genome	69	K2R1-2C1A1	OligoGym^‡, NT^‡, GUE^‡, GLRB^‡, BEACON^‡, mRNABench^‡, RNAGym^‡
Protein	59	K2R1-3C1-2A1-2	LiveProteinBench, VenusX^‡, PRING^‡, ProteinGym^‡, TAPE^‡, PDFBench
Small Molecule	54	K1-2R1-3C1-3A1-2	ChemBench, MoleculeQA, MolTextQA, MoleculeNet^‡, SMolInstruct^†, s2bench

Evaluation type	Chance baseline	Example (raw → displayed)
4-choice MCQ	0.25	MMLU 0.85 → 80.0
5-choice MCQ	0.20	GPQA 0.42 → 27.5
Generation (pass@1)	0.00	HumanEval 0.82 → 82.0
Spearman ρ	0.00	ProteinGym ρ=0.42 → 42.0
LLM judge (1-5)	0.20	ChemBench 3.8/5 → 70.0

Trace-level case study

Aggregate leaderboard scores show where models differ on average. This section follows a small set of selected Co-Scientist and Bio-Designer cases down to the model traces, where the difference between a score and a usable scientific answer becomes visible.

What this is

A trace-backed reading of selected cases where the model behavior can be inspected directly.

We filtered a shared model-output space with quantitative signals, then manually reviewed the candidate traces. The examples below highlight four trace patterns that are useful starting points for thinking about model behavior beyond aggregate scores. We hope they also offer useful signals for future model improvement and evaluator design.

Case-study evidence set

23representative sub-tasks

4,306aligned samples

9models

38,754sample × model traces

38manually reviewed cases

10case cards below, 10 unique samples

Trace × model overview

Before selecting individual cases, we scanned the full aligned trace space for the 23 representative sub-tasks. Each model has the same 4,306 sample traces here. The table summarizes each model before individual cases are opened: mean original verdict score, unscored-output rate, and median captured reasoning length. Length is counted in characters, not tokenizer-specific tokens.

Model	Type	Traces	Avg. score	Unscored output	Median reasoning chars
Kimi-K2.6 core	open	4,306	0.57	3.4%	51k
DeepSeek-V4-Pro core	open	4,306	0.56	7.7%	84k
gpt-5.5 core	closed	4,306	0.64	0.1%	6k^*
gemini-3.1-pro-preview core	closed	4,306	0.66	0.1%	2k^*
DeepSeek-V4-Flash diagnostic	open	4,306	0.54	8.8%	73k
Qwen3.5-397B-A17B diagnostic	open	4,306	0.51	2.3%	20k
Qwen3.5-122B-A10B diagnostic	open	4,306	0.38	14.6%	22k
gpt-5.4-mini diagnostic	closed	4,306	0.56	1.4%	36k^*
gemini-3-flash-preview diagnostic	closed	4,306	0.62	0.3%	8k^*

The case cards below are selected from this table's underlying trace space. The unscored-output rows point to answer-completion cases. The character-length spread helps separate verbosity from final-answer quality.

^* Closed-source models generally do not expose full reasoning traces, so their median reasoning chars reflect only partial or summarized output and are shown for reference only.

Four trace patterns — pick a pattern, then a model chip

Each pattern uses representative cases to make one behavior concrete. Select a pattern below, then click a model chip to inspect its task prompt or context, reasoning trace, model output, and verdict.

The model writes a long trace but never lands a scorable final answer. The issue is mechanical and consequential: the task is not complete until the answer can be scored.

Case 8Bio-Designer · Genome

vep_traitgym: long RNA-variant traces without a final call

Kimi-K2.6, DeepSeek-V4-Pro, and DeepSeek-V4-Flash reasoned for hundreds of thousands of characters but emitted no scorable answer. gpt-5.5 and gemini-3.1-pro-preview returned the correct No option with short traces.

Pattern signalThe failure is output finalization rather than task impossibility: two reference models answer the same RNA-variant prompt correctly.

What this shows: on long sequence prompts, the decisive failure can be answer finalization rather than biological inference.

Case 21Co-Scientist · Factual Knowledge

lab_bench: Gibson primer selection stalls before the answer

The task asks for a primer pair for cloning waaA into HindII-linearized pUC19. DeepSeek-V4-Pro and DeepSeek-V4-Flash produced very long traces with no final prediction. gpt-5.5 and gemini-3.1-pro-preview selected the correct option, while Kimi-K2.6 also reached the correct option after a much longer trace.

Pattern signalThe model can stay on topic for a long time but still fail the scoring interface by never closing with a usable option.

What this shows: long protocol reasoning can stay on topic yet still stall before the final option required by the evaluator.

The model sees relevant evidence, but the final answer depends on which evidence it lets control the conclusion. In these selected cases, plausible but weaker cues pull some traces away from the scored answer.

Case 12Bio-Designer · Protein

TAPE Stability: many cues, wrong stability driver

The task asks for the only unstable miniprotein among four candidates. gpt-5.5, gemini-3.1-pro-preview, Kimi-K2.6, and DeepSeek-V4-Flash selected the correct candidate by focusing on Candidate 3's weak beta-strand pattern and poor core-packing signal, though Kimi-K2.6 and DeepSeek-V4-Flash reached it after long traces. DeepSeek-V4-Pro and gemini-3-flash-preview analyzed many protein-design cues but let other structural cues drive them to a different candidate.

Pattern signalThe wrong traces discuss topology, hydrophobic packing, and secondary-structure propensity, but let the wrong stability cue control the final choice.

What this shows: the useful signal is which stability cue controls the final choice, not how many protein-design features the trace lists.

Case 29Co-Scientist · Evidence Critique

SciPredict Biology: the observed direction matters

The task asks how running-wheel housing changed female mouse social investigation time. gpt-5.5, Kimi-K2.6, and both DeepSeek variants preserved the correct direction: wheel-housed females spent less time investigating. Gemini variants and gpt-5.4-mini answered no significant difference.

Pattern signalThe wrong traces consider the experimental setup and prior expectations, but underweight the specific reported direction of the study outcome.

What this shows: study-interpretation errors can come from losing the reported direction, even when the trace understands the experimental setup.

These cards compare stronger and lighter variants from the same model family on the same prompt. Aggregate paired comparisons still favor stronger variants more often, but individual traces show that scaling can change which evidence is used, how options are calibrated, or which final answer is selected.

Case 30Co-Scientist · SciPredict

SciPredict Biology: DAV result flips across families

The task asks which persistent viral infection showed reduced survival under L. plantarum supplementation. gpt-5.5 and gemini-3.1-pro-preview selected DAV, while their lighter variants moved to Nora or DAV+Nora. DeepSeek and Qwen show the opposite direction, with the lighter variants selecting DAV.

Pattern signalFour same-family comparisons split on one study-interpretation target, and all variants produce scorable answers.

What this shows: same-family variants can flip which infection condition they treat as the significant survival result.

Case 35Bio-Designer · Protein Stability

TAPE stability ranking: one prompt, mixed scaling directions

The target ranking is 4,1,2,3. gpt-5.4-mini, gemini-3-flash-preview, and Qwen3.5-122B-A10B move closer to the ranking than their stronger siblings, while DeepSeek-V4-Pro is exact and DeepSeek-V4-Flash is not.

Pattern signalThe same miniprotein ranking prompt shows reverse scaling in three families and positive scaling in one family.

What this shows: ranking tasks expose calibration differences between nearby candidates, not just exact-match failures.

Case 37Co-Scientist · SciPredict

SciPredict Biology: WNK inhibition reverses the surface-level direction

The scored answer says RNF43 surface levels end higher than FZD5 after WNK inhibition. gpt-5.4-mini, gemini-3.1-pro-preview, and DeepSeek-V4-Flash follow that direction. gpt-5.5, gemini-3-flash-preview, and DeepSeek-V4-Pro reverse it. Both Qwen variants answer correctly.

Pattern signalThree families split on the direction of the same HiBiT assay result, while one family remains stable.

What this shows: the same assay description can be anchored to opposite readout directions, and the trace makes that reversal visible.

Case 38Bio-Designer · Protein Function

LiveProteinBench GO: carrier-domain motif splits variants

The task asks for the molecular function from a long protein sequence. The scored answer is phosphopantetheine binding. gpt-5.5, both Gemini variants, DeepSeek-V4-Pro, and Qwen3.5-397B-A17B select the phosphopantetheine-binding option, while gpt-5.4-mini, DeepSeek-V4-Flash, and Qwen3.5-122B-A10B move to histone-binding or xylanase-style anchors.

Pattern signalThree families split on whether to trust the phosphopantetheine carrier-domain motif or competing domain/function hypotheses.

What this shows: sequence-function calls can hinge on one motif family, and variants differ in whether they keep it as the deciding evidence.

Trace viewer

Select a model chip

Input · Trace · Output · Verdict

Task context, traces, outputs, and verdicts come from embedded captured case-study data. When a saved model-facing task block is available, the task tab shows it directly. Otherwise it shows task context recovered from the captured reasoning trace. The homepage starts each field in a collapsed viewport for readability. The field text itself is not shortened in JavaScript. Reasoning colors are a reading aid over escaped raw text.

Reading notes. Each chip reports that model's score on one sample. Scores use the original task evaluator scale, not win rates. Many tasks use 0–1 scores, while ranking tasks may assign partial or negative correlation scores. Across this section, the 10 cards cover 10 unique samples. We filtered cases with quantitative signals, then reviewed the traces manually, so selection still involves judgment. Read the cases together with the aggregate tables and paired statistics: traces show what happened in concrete examples, while the tables show whether the pattern appears beyond a single case.

Dataset	License	Notes
AIME 2024	Apache-2.0	Apache-2.0 is from the upstream AI-MO source; the linked HuggingFaceH4 repo declares none.
AIME 2025	MIT	—
DROP	CC-BY-SA-4.0	Linked via the OpenAI simple-evals mirror; canonical source is AllenAI (ucinlp/drop).
GPQADiamond	CC-BY-4.0	Gated upstream (Idavidrein/gpqa); linked via the OpenAI simple-evals mirror.
HumanEval	MIT	—
IFEval	Apache-2.0	—
LiveCodeBench	CC (unspecified)	HF tag is a bare "cc" with no variant; underlying problems (LeetCode / AtCoder / Codeforces) have ambiguous redistribution terms.
MATH 500	MIT	MIT is from the upstream OpenAI PRM800K source; the linked HuggingFaceH4 repo declares none.
MMLU	MIT	—
MMLU-Pro	MIT	—
T-Eval	Apache-2.0	—
ATLAS	CC-BY-NC-SA-4.0	NonCommercial + ShareAlike per the dataset card badge. (HF machine-metadata says CC-BY-SA-4.0; using the more restrictive card value.)
DeepPrinciple^†	MIT	Org page; the science_chemistry / science_biology repos we use are MIT.
FrontierScience	Apache-2.0	—
GPQA	CC-BY-4.0	Gated — requires accepting terms and agreeing not to publish examples.
HLE	MIT	Gated; the card requests no redistribution despite the MIT grant.
lab-bench	CC-BY-SA-4.0	—
SciPredict	MIT	—
BEACON^‡	Mixed — Apache-2.0, CC0-1.0, GPL-3.0, MIT, No license stated, Public Domain, Research Purpose Only	Per-task licenses follow the BEACON paper's Table 22. bpRNA-1M is itself a mix of sub-source licenses (CRW / tmRDB / SRPDB / tRNADB / Rnase P / RFam / PDB).
ChemBench	MIT	—
GLRB^‡	CC-BY-NC-SA-4.0	NonCommercial + ShareAlike.
GUE^‡	No license stated	No data license stated. virus_covid derives from GISAID, which has restrictive access terms.
LiveProteinBench	No license stated	No LICENSE file or license statement in the repo; the paper's CC BY 4.0 covers the manuscript only.
MoleculeNet^‡	MIT	MIT covers the DeepChem curation; individual subsets retain their own upstream terms. Linked via a mirror tarball.
MoleculeQA	MIT	—
MolTextQA	CC-BY-4.0	—
mRNABench^‡	Mixed — CC-BY-4.0, CC-BY-SA-4.0, MIT, No license stated	Umbrella benchmark with no per-task licenses; component licenses below are best-effort from the upstream sources. The mRNABench code repo is AGPL-3.0.
NT^‡	No license stated	Dataset card states no license. The CC-BY-NC-SA-4.0 often cited is the model/code repo, not this data — it does not transfer.
OligoGym^‡	CC-BY-4.0	—
PDFBench	MIT	Data repurposed from Mol-Instructions (HF nwliu/Molinst-SwissProtCLAP, MIT); the PDFBench code repo itself has no LICENSE file.
PRING^‡	MIT	—
ProteinGym^‡	MIT	Built on third-party DMS/clinical assays; cite the source-assay papers.
RNAGym^‡	CC-BY-4.0	Data on HF (Marks-lab/RNAgym) is CC-BY-4.0; the GitHub code repo is MIT.
s2bench	No license stated	TOMG-Bench; no license on the HF card or the GitHub repo.
SMolInstruct^†	CC-BY-4.0	—
TAPE^‡	BSD-3-Clause	—
VenusX^‡	Apache-2.0	Datasets are Apache-2.0; the companion code repo is CC-BY-NC-ND-4.0 (NonCommercial + NoDerivatives).

Why ARCK?

Leaderboards Details ↓

Conventional Overall · 0–100 Breakdown ↓

ARCK LSI · derived index Breakdown ↓

Conventional leaderboard 0–100 · chance-adjusted Formula ↓

ARCK LSI leaderboard LSI · derived index · ARCK-gated Formula ↓

How the two leaderboards differ Back to leaderboards ↑

How ARCK works

Dependency rules

General

Co-Scientist

Bio-Designer

Benchmark coverage

General track

Co-Scientist track

Bio-Designer track

Shared step · chance-adjusted normalization Back to conventional leaderboard ↑

Conventional pipeline · produces Overall

ARCK pipeline · produces LSI Back to ARCK LSI leaderboard ↑

LSI range interpretation

K1 R1 C1 A1

K2 R2 C2 A2

K3 R3 C3 A3

K4 R4 C4 A4

K5 R5 C5 A5

What ARCK reveals

Notable patterns

Closed vs open · top performers

Capability shapes

Co-Scientist · capability shape ARCK heatmap

Bio-Designer · modality scores Genome / Protein / Small Molecule · 0–100 scale

Trace-level case study

Trace × model overview

Four trace patterns — pick a pattern, then a model chip

vep_traitgym: long RNA-variant traces without a final call

lab_bench: Gibson primer selection stalls before the answer

TAPE Stability: many cues, wrong stability driver

SciPredict Biology: the observed direction matters

Biology Research: plausible plans meet a narrow rubric

toxicity_and_safety: plausible hazards, single scored target

SciPredict Biology: DAV result flips across families

TAPE stability ranking: one prompt, mixed scaling directions

SciPredict Biology: WNK inhibition reverses the surface-level direction

LiveProteinBench GO: carrier-domain motif splits variants

Limitations

Acknowledgements

Open for collaboration