From benchmark scores to capability maps for life-science AI — across reasoning, research workflows, and biological design.
Why ARCK?
As foundation models move into the life sciences, evaluation is becoming both richer and more fragmented. New benchmarks provide valuable evidence, but they often probe isolated slices of capability, and important dimensions of scientific utility remain weakly covered. As a result, model comparison is limited not only by benchmark quality and coverage, but also by the absence of a common framework for interpreting heterogeneous results.
ARCK was designed to provide that framework. It organizes benchmark evidence across General, Co-Scientist, and Bio-Designer tracks, mapping diverse results into a shared reporting space. The result is a view of capability structure, track-level trade-offs, and evidence coverage: what models can support today, and where benchmark coverage is still thin.
Two leaderboards read the same benchmark pool two ways. The Conventional Leaderboard follows the standard evaluation pipeline: chance-adjusted sub-task scores are aggregated into category scores, track scores, and an overall 0–100 score. It answers the familiar question of relative benchmark performance.
The ARCK Leaderboard maps each sub-task result through ARCK capability gates before aggregation, producing 1–5 K/R/C/A axis scores, per-track ARCK scores, and an aggregate Life-Science Intelligence (LSI) index. LSI is weighted as 0.2 General, 0.4 Co-Scientist, and 0.4 Bio-Designer across the three tracks. This second view shifts the emphasis from raw performance to capability structure, track-level trade-offs, and evidence coverage.
Note · Claude series absent · The Claude models are not listed because they returned refusals on a substantial share of the suite's adversarial and bio-design prompts, leaving too many tasks unscored to rank them under the current protocol.
For closed-source models the saturated high-volume General sets are downsampled — MMLU (570, stratified), MMLU-Pro (700, stratified), DROP (500) — since full runs add little discriminative signal at real inference cost; every other benchmark stays full. The effect on the General score is under 0.1 points with no rank changes, so the columns stay comparable.
How ARCK works
Four capability dimensions (K / R / C / A), three task tracks with their benchmark coverage, and the two scoring pipelines that produce the Overall score (conventional) and the LSI (ARCK-gated).
ARCK separates evidence by four capability dimensions. Knowledge captures factual and domain grounding. Reasoning captures inference, causal analysis, and hypothesis formation. Craft captures constrained generation and structured outputs. Autonomy captures planning, tool use, self-checking, and iterative revision within a defined evaluation setting.
K
Knowledge
Factual breadth, domain depth, evidence grounding, and awareness of frontier limitations.
atomic
R
Reasoning
Logical inference, causal modeling, hypothesis generation, and paradigm-level analysis.
atomic
C
Craft
Constrained generation and structured outputs, including code, molecules, proteins, and protocol artifacts.
atomic
A
Autonomy
Planning, tool use, self-checking, multi-step execution, and feedback-based revision under a specified evaluation environment.
emergent
1–5 capability anchors
The 1–5 anchors provide a calibrated scale for interpreting benchmark evidence. They indicate what level of capability a task is meant to test, rather than assigning a scientific role to the model. ARCK scores are derived only from observed results under the current protocol, so they reflect measured evidence rather than broad claims of competence.
K — Knowledge
1
Foundational
Recall established concepts, terminology, and standard facts.
2
Professional
Apply domain-specific knowledge to clearly scoped questions.
3
Specialist
Evaluate methods, assumptions, evidence, and limitations in domain studies.
4
Integrative
Connect mechanisms across subfields and identify plausible cross-domain links.
5
Frontier
Reason about unresolved questions, boundary conditions, and competing frontier hypotheses.
R — Reasoning
1
Direct
Solve single-step inference tasks with explicit premises.
2
Multi-step
Combine several stated facts or results into a supported conclusion.
3
Causal
Analyze mechanisms, interventions, confounding, and counterfactual outcomes.
4
Hypothesis
Generate testable explanations and discriminate among alternatives.
5
Conceptual
Reframe assumptions or propose new explanatory frameworks from evidence.
C — Craft
1
Discriminative
Classify, rank, or score given objects under explicit criteria.
2
Single-objective
Generate one artifact satisfying a primary target or constraint.
3
Multi-objective
Balance multiple objectives such as activity, safety, and feasibility.
4
Hard-constraint
Satisfy strict structural, functional, or synthesis constraints.
5
Systems-level
Design coordinated multi-component interventions with feedback and failure modes.
A — Autonomy
1
Single-step
Complete an isolated instruction without external tools or feedback.
2
Tool-following
Use specified tools or data sources according to an explicit procedure.
3
Adaptive
Select appropriate tools, check intermediate results, and revise local steps.
4
Pipeline control
Coordinate multi-step workflows with error handling and traceable decisions.
5
Closed-loop
Plan and execute iterative, evolutionary workflows with uncertainty tracking, result feedback, and candidate selection.
Dependency rules
Conservative scoring constraints that prevent a high score on one dimension from implying unsupported capability on another — for example, strong generation evidence should not by itself imply strong reasoning. These caps are applied only during diagnostic level mapping.
R ≤ K + 2Reasoning is bounded by knowledge evidence
C ≤ R + 1Craft is bounded by reasoning evidence
C ≤ K + 2Craft is bounded by knowledge evidence
A ≤ min(K, R) + 1Autonomy depends on both knowledge and reasoning
The three tracks define the task families used for the leaderboards and diagnostic profiles. General is a baseline. Co-Scientist captures scientific workflow support. Bio-Designer captures domain-specific biological design.
Adaptation markers · Some sources were built for small, task-specific models or ship only raw data, so we adapt them into chat-LLM–answerable text tasks — the underlying labels and gold answers are never changed. Sources with no mark keep their upstream LLM prompt verbatim. Presentation-only packaging (flat-chat templates, special tokens, standard MCQ assembly from upstream options) is not counted as a rewrite. Markers below are dataset-level. For partially-adapted sets they flag the most-adapted sub-task. Per-sub-task provenance is included in the technical materials.
† prompt-adapted — upstream provided labels or option lists but no natural-language question, so we authored only the question stem.
‡ task-reformulated — upstream shipped only raw data (sequences, SMILES, variants, residue arrays, or numeric values), so we constructed both the question and the task format (MCQ / ranking / Yes–No / value binning).
Every source name in the tables below links to its dataset or code repository ↗ — hover a name to see its license.
Bio-Designer is organized by modality because the first split is scientifically meaningful: genome and RNA tasks, protein tasks, and small-molecule chemistry probe different forms of biological design. Each modality carries its own K / R / C / A coverage.
Two pipelines run over the same evidence pool. The Conventional pipeline aggregates sub-task scores by track and category to produce Overall in [0, 100]. The ARCK pipeline follows the residual LSI scorer: it groups normalized sub-task scores by K/R/C/A axis and level, applies a versioned track-aware threshold table, and aggregates capability estimates into LSI. Both share the same chance-adjusted normalization step. Only the two anchor formulas are highlighted — intermediate steps live inline.
Raw scores are adjusted for random baselines before averaging, so identical raw accuracies are not treated as equivalent when the underlying chance baselines differ. Both pipelines start here.
normalized_score = φ(raw; chance, 1, 1)
φ(s; a, b, ρ) = ρ·clamp((s-a)/(b-a), 0, 1)
ARCK gates use normalized_score in [0, 1]. The Conventional leaderboard reports 100 × normalized_score.
Evaluation type
Chance baseline
Example (raw → displayed)
4-choice MCQ
0.25
MMLU 0.85 → 80.0
5-choice MCQ
0.20
GPQA 0.42 → 27.5
Generation (pass@1)
0.00
HumanEval 0.82 → 82.0
Spearman ρ
0.00
ProteinGym ρ=0.42 → 42.0
LLM judge (1-5)
0.20
ChemBench 3.8/5 → 70.0
For non-MCQ tasks, the chance baseline is the expected score of a null model under that scoring metric (random generation has pass@1 = 0, and random predictions have Spearman ρ = 0). See the technical report for per-metric null-distribution measurements.
Conventional pipeline · produces Overall
Pure arithmetic aggregation — no gating, no level mapping. Each step is a mean.
1
Normalize sub-tasks
Apply the chance-adjusted normalization above. Each sub-task has a normalized score in [0, 1], displayed as [0, 100] in the Conventional leaderboard.
2
Mean within category
Within each track, categories (e.g., Knowledge / Math / Code for General) average their member sub-tasks.
3
Mean within track
Categories average to a track-level score: one for General, one for Co-Scientist, one for Bio-Designer.
4
Mean across tracks → Overall
Average the three track scores.
Output · the Overall column in the Conventional Leaderboard above.
After normalization, sub-tasks are grouped by ARCK level tag, gated per axis, and rolled up through capability-aware aggregation. The two anchor formulas:
K/R/C/A axis scores use 1–5 anchors. Per-track ARCK and aggregate LSI are derived indices, not clipped to 5.
In v0.1, aggregate LSI sits around 1.6–3.3 and per-track ARCK around 1.2–4.2 — the nominal ceiling is 11.25 at integer K=R=C=A=5.
Weights define the reporting aggregate, not a fixed operational utility per LSI point.
The five intermediate steps below show how raw evidence reaches that aggregation. Per-level gating formulas (τ threshold, interpolation) are inlined within each step.
1
Tag & group sub-tasks by ARCK level
Each sub-task has an expert annotation per axis (e.g., K=2, R=3). Within each track, group sub-tasks by axis + level — the K=1 group, K=2 group, K=3 group, …
2
Per-level mastery & gate
For each track, axis, and level, mastery is the weighted mean of normalized sub-task scores in that level group. A measured level passes when mastery reaches its track-aware threshold.
τn = τ(track, axis, n)
τ(·) is defined by the versioned TAU_TRACK_V0_1 table. Examples: General K/R = 0.70/0.80/0.90/0.95/1.00; Co-Scientist C = 0.20/0.30/0.45/0.55/0.65; Bio-Designer A = 0.30/0.90/0.95/1.00/1.00.
3
Map gated levels → axis score
An axis score is the highest passed integer level plus a bounded margin.
Daxis = m + φ(s; a, b, ρ)
Use (s, a, b, ρ)=(sn, τm, τn, 1) at the first failed measured level n. Use (sm, τm, 1, β) when all measured levels pass and τm<1. m is the highest passed measured level. v0.1 uses β=0.25.
4
Aggregate ARCK → per-track ARCK
Combine the four axis scores within each track. H is the harmonic mean of Knowledge, Reasoning, and Craft, so a weak base axis lowers the track score. Autonomy enters multiplicatively and cannot replace weak K/R/C evidence. If any of K/R/C is zero, the track score is zero. See ARCK score formula ↑
5
Track scores → LSI
Weighted sum across the three tracks · 0.2·General + 0.4·Co-Sci + 0.4·Bio.
Output · the per-track ARCK columns and the LSI column in the ARCK Leaderboard above, plus the ARCK cells of the Capability Profile heatmaps below.
The two leaderboards answer different questions: Conventional Overall gives the direct benchmark ranking. ARCK LSI shows how the same evidence changes when grouped by Knowledge, Reasoning, Craft, and Autonomy.
LSI range interpretation
Interpretation anchors for the derived LSI index. Current v0.1 model scores occupy only the lower part of this theoretical scale. These are reporting guides, not deployment role labels.
L1
K1 R1 C1 A1
Literature summarization, basic QA, simple property lookup
Paradigm discovery, systems-level drug design, autonomous evolutionary research workflows
Theoretical upper range
What ARCK reveals
Patterns surfaced by reading both leaderboards together — capability shapes, closed-vs-open gaps, and where the current evidence is concentrated.
Notable patterns
gpt-5.5 tops closed-source on the ARCK leaderboard (LSI ≈ 3.27) and leads the closed pool by a clear margin — here, gating widens a gap that Conventional scoring compresses. On Conventional it lands behind gemini-3.1-pro-preview (64.8 vs 66.2 Overall), so the two leaderboards disagree on the closed-source leader.
Kimi-K2.6 leads open-source on both leaderboards (Conventional Overall ≈ 62, ARCK LSI ≈ 2.7), boosted on Conventional by the highest General-track score in the open pool (≈ 91).
Top three closed-source ARCK scores (gpt-5.5, gemini-3.1-pro-preview, gemini-3-flash-preview) all reach R ≥ 3.0 and C ≥ 2.88 on the Co-Scientist axis decomposition, while no open model breaks C ≥ 3.0.
Most open-source models cluster at C ≈ 1.0 and A ≈ 1.0 in the ARCK Co-Sci profile, indicating that Craft and Autonomy are currently the weakest discriminators within the open pool.
On the General track the frontier closed models now lead — gemini-3.1-pro-preview tops it at 92.6, just above the best open model (Kimi-K2.6, 90.6) — but General benchmarks are largely saturated, so these top-end gaps are small.
Closed vs open · top performers
Group
Conventional top
ARCK top
Closed-source
gemini-3.1-pro-preview · 66.2
gpt-5.5 · 3.27
Open-source
Kimi-K2.6 · 62.1
Kimi-K2.6 · 2.69
Top-1 coincides for open-source (Kimi-K2.6 on both leaderboards) but diverges for closed-source — gemini-3.1-pro-preview tops Conventional, gpt-5.5 tops ARCK.
Capability shapes
Reading K / R / C / A together surfaces model shapes the aggregate leaderboards compress. The per-axis view below uses the Co-Scientist decomposition, where the current evidence is strongest.
Frontier reasoner
gpt-5.5
K 3.0 · R 4.0 · C 3.1 · A 1.1
Only model exceeding R = 3.5. Balanced K and C, A still at floor.
Closed craft-strong
gemini-3.1-pro-preview, gemini-3-flash-preview
K 2.0 · R 3.0 · C ~2.9 · A 1.1
C catches up to closed-frontier level. K plateaus at 2, leaving a knowledge gap below gpt-5.5.
Open reasoner
Kimi-K2.6, MiMo-V2.5-Pro
K 1.9–2.8 · R 2.6–3.0 · C 1.6–1.8 · A 1.0
K and R rival the closed group. Craft and Autonomy drop the LSI back into the open-source range.
Closed lite
gpt-5.4-mini
K 1.9 · R 2.1 · C 1.0 · A 1.0
Closed API access alone does not imply the same capability profile: this model remains at the C = 1.0 floor.
Open craft-pivot
DeepSeek-V4-Pro
K 2.0 · R 1.8 · C 2.9 · A 1.0
Only open model with Craft at closed-frontier level. Reasoning lags, which caps the Co-Sci aggregate.
8+ models clustered at C ≤ 1.3 and A ≤ 1.0. Undifferentiated on the two output-side axes.
Co-Scientist · capability shape ARCK heatmap
One row per model, sorted by Co-Sci diagnostic. K / R / C / A cells are color-coded by value — brighter cells = higher capability. Scan columns to see where each axis bottoms out across the pool.
Two Co-Scientist bottlenecks · Across all 17 models, A never exceeds 1.1, because the current Co-Scientist tasks rarely require autonomous tool use or iterative action. 8 / 17 models bottom out at C = 1.0, making Craft one of the clearest separators between model groups.
Bio-Designer · modality scores Genome / Protein / Small Molecule · 0–100 scale
The modality view keeps Bio-Designer tied to its biological substrates: genomes and RNA, proteins, and small molecules. The Bio ARCK column summarizes the track-level gated evidence. The three modality columns show where the Bio-Designer leaderboard score is supported.
Genome is the most compressed modality: most models remain low, with only the closed-frontier group lifting clearly above the pack. Small Molecule is the least separated modality: top open and closed models are much closer there than on Genome.
Trace-level case study
Leaderboards show where models differ on average. This section follows a small set of selected Co-Scientist and Bio-Designer cases down to the model traces, where the difference between a score and a usable scientific answer becomes visible.
What this is
A trace-backed reading of selected cases where the model behavior can be inspected directly.
We filtered a shared model-output space with quantitative signals, then manually reviewed the candidate traces. The examples below highlight four trace patterns that are useful starting points for thinking about model behavior beyond aggregate scores. We hope they also offer useful signals for future model improvement and evaluator design.
Case-study evidence set
23representative sub-tasks
4,306aligned samples
9models
38,754sample × model traces
38manually reviewed cases
10case cards below, 10 unique samples
Trace × model overview
Before selecting individual cases, we scanned the full aligned trace space for the 23 representative sub-tasks. Each model has the same 4,306 sample traces here. The table summarizes each model before individual cases are opened: mean original verdict score, unscored-output rate, and median captured reasoning length. Length is counted in characters, not tokenizer-specific tokens.
Model
Type
Traces
Avg. score
Unscored output
Median reasoning chars
Kimi-K2.6 core
open
4,306
0.57
3.4%
51k
DeepSeek-V4-Pro core
open
4,306
0.56
7.7%
84k
gpt-5.5 core
closed
4,306
0.64
0.1%
6k*
gemini-3.1-pro-preview core
closed
4,306
0.66
0.1%
2k*
DeepSeek-V4-Flash diagnostic
open
4,306
0.54
8.8%
73k
Qwen3.5-397B-A17B diagnostic
open
4,306
0.51
2.3%
20k
Qwen3.5-122B-A10B diagnostic
open
4,306
0.38
14.6%
22k
gpt-5.4-mini diagnostic
closed
4,306
0.56
1.4%
36k*
gemini-3-flash-preview diagnostic
closed
4,306
0.62
0.3%
8k*
The case cards below are selected from this table's underlying trace space. The unscored-output rows point to answer-completion cases. The character-length spread helps separate verbosity from final-answer quality.
* Closed-source models generally do not expose full reasoning traces, so their median reasoning chars reflect only partial or summarized output and are shown for reference only.
Four trace patterns — pick a pattern, then a model chip
Each pattern uses representative cases to make one behavior concrete. Select a pattern below, then click a model chip to inspect its task prompt or context, reasoning trace, model output, and verdict.
The model writes a long trace but never lands a scorable final answer. The issue is mechanical and consequential: the task is not complete until the answer can be scored.
Case 8Bio-Designer · Genome
vep_traitgym: long RNA-variant traces without a final call
Kimi-K2.6, DeepSeek-V4-Pro, and DeepSeek-V4-Flash reasoned for hundreds of thousands of characters but emitted no scorable answer. gpt-5.5 and gemini-3.1-pro-preview returned the correct No option with short traces.
Pattern signalThe failure is output finalization rather than task impossibility: two reference models answer the same RNA-variant prompt correctly.
What this shows: on long sequence prompts, the decisive failure can be answer finalization rather than biological inference.
Case 21Co-Scientist · Factual Knowledge
lab_bench: Gibson primer selection stalls before the answer
The task asks for a primer pair for cloning waaA into HindII-linearized pUC19. DeepSeek-V4-Pro and DeepSeek-V4-Flash produced very long traces with no final prediction. gpt-5.5 and gemini-3.1-pro-preview selected the correct option, while Kimi-K2.6 also reached the correct option after a much longer trace.
Pattern signalThe model can stay on topic for a long time but still fail the scoring interface by never closing with a usable option.
What this shows: long protocol reasoning can stay on topic yet still stall before the final option required by the evaluator.
The model sees relevant evidence, but the final answer depends on which evidence it lets control the conclusion. In these selected cases, plausible but weaker cues pull some traces away from the scored answer.
Case 12Bio-Designer · Protein
TAPE Stability: many cues, wrong stability driver
The task asks for the only unstable miniprotein among four candidates. gpt-5.5, gemini-3.1-pro-preview, Kimi-K2.6, and DeepSeek-V4-Flash selected the correct candidate by focusing on Candidate 3's weak beta-strand pattern and poor core-packing signal, though Kimi-K2.6 and DeepSeek-V4-Flash reached it after long traces. DeepSeek-V4-Pro and gemini-3-flash-preview analyzed many protein-design cues but let other structural cues drive them to a different candidate.
Pattern signalThe wrong traces discuss topology, hydrophobic packing, and secondary-structure propensity, but let the wrong stability cue control the final choice.
What this shows: the useful signal is which stability cue controls the final choice, not how many protein-design features the trace lists.
Case 29Co-Scientist · Evidence Critique
SciPredict Biology: the observed direction matters
The task asks how running-wheel housing changed female mouse social investigation time. gpt-5.5, Kimi-K2.6, and both DeepSeek variants preserved the correct direction: wheel-housed females spent less time investigating. Gemini variants and gpt-5.4-mini answered no significant difference.
Pattern signalThe wrong traces consider the experimental setup and prior expectations, but underweight the specific reported direction of the study outcome.
What this shows: study-interpretation errors can come from losing the reported direction, even when the trace understands the experimental setup.
Broad scientific answers can diverge from narrow scoring targets. The trace helps diagnose whether the gap comes from model behavior, evaluator design, or both.
Case 1Co-Scientist · Protocol Writing
Biology Research: plausible plans meet a narrow rubric
All four core models wrote substantial synthetic-biology plans. The rubric, however, awards points for specific commitments such as transposase-assisted integration, a CRISPRi split-activator logic design, and engineered peptide-signal stability.
Pattern signalThe score turns on whether an open-ended design answer matches a concrete rubric checklist.
What this shows: for open-ended design tasks, the trace helps separate missing rubric-specific commitments from a broadly plausible plan.
Case 10Bio-Designer · Small Molecule
toxicity_and_safety: plausible hazards, single scored target
Kimi-K2.6 and DeepSeek-V4-Pro listed several plausible hazards. gpt-5.5 and gemini-3.1-pro-preview selected the scored compatibility-chart target: solubilization of toxic substance.
Pattern signalThe task is scored as one main safety label. Broader hazard inventories can be reasonable but still misaligned with the evaluator target.
What this shows: single-label safety scoring rewards target selection, so a broader hazard inventory can become a scoring mismatch.
These cards compare stronger and lighter variants from the same model family on the same prompt. Aggregate paired comparisons still favor stronger variants more often, but individual traces show that scaling can change which evidence is used, how options are calibrated, or which final answer is selected.
Case 30Co-Scientist · SciPredict
SciPredict Biology: DAV result flips across families
The task asks which persistent viral infection showed reduced survival under L. plantarum supplementation. gpt-5.5 and gemini-3.1-pro-preview selected DAV, while their lighter variants moved to Nora or DAV+Nora. DeepSeek and Qwen show the opposite direction, with the lighter variants selecting DAV.
Pattern signalFour same-family comparisons split on one study-interpretation target, and all variants produce scorable answers.
What this shows: same-family variants can flip which infection condition they treat as the significant survival result.
Case 35Bio-Designer · Protein Stability
TAPE stability ranking: one prompt, mixed scaling directions
The target ranking is 4,1,2,3. gpt-5.4-mini, gemini-3-flash-preview, and Qwen3.5-122B-A10B move closer to the ranking than their stronger siblings, while DeepSeek-V4-Pro is exact and DeepSeek-V4-Flash is not.
Pattern signalThe same miniprotein ranking prompt shows reverse scaling in three families and positive scaling in one family.
What this shows: ranking tasks expose calibration differences between nearby candidates, not just exact-match failures.
Case 37Co-Scientist · SciPredict
SciPredict Biology: WNK inhibition reverses the surface-level direction
The scored answer says RNF43 surface levels end higher than FZD5 after WNK inhibition. gpt-5.4-mini, gemini-3.1-pro-preview, and DeepSeek-V4-Flash follow that direction. gpt-5.5, gemini-3-flash-preview, and DeepSeek-V4-Pro reverse it. Both Qwen variants answer correctly.
Pattern signalThree families split on the direction of the same HiBiT assay result, while one family remains stable.
What this shows: the same assay description can be anchored to opposite readout directions, and the trace makes that reversal visible.
The task asks for the molecular function from a long protein sequence. The scored answer is phosphopantetheine binding. gpt-5.5, both Gemini variants, DeepSeek-V4-Pro, and Qwen3.5-397B-A17B select the phosphopantetheine-binding option, while gpt-5.4-mini, DeepSeek-V4-Flash, and Qwen3.5-122B-A10B move to histone-binding or xylanase-style anchors.
Pattern signalThree families split on whether to trust the phosphopantetheine carrier-domain motif or competing domain/function hypotheses.
What this shows: sequence-function calls can hinge on one motif family, and variants differ in whether they keep it as the deciding evidence.
Trace viewer
Select a model chip
Input · Trace · Output · Verdict
Task context, traces, outputs, and verdicts come from embedded captured case-study data. When a saved model-facing task block is available, the task tab shows it directly. Otherwise it shows task context recovered from the captured reasoning trace. The homepage starts each field in a collapsed viewport for readability. The field text itself is not shortened in JavaScript. Reasoning colors are a reading aid over escaped raw text.
Reading notes. Each chip reports that model's score on one sample. Scores use the original task evaluator scale, not win rates. Many tasks use 0–1 scores, while ranking tasks may assign partial or negative correlation scores. Across this section, the 10 cards cover 10 unique samples. We filtered cases with quantitative signals, then reviewed the traces manually, so selection still involves judgment. Read the cases together with the aggregate tables and paired statistics: traces show what happened in concrete examples, while the tables show whether the pattern appears beyond a single case.
Limitations
The benchmark pool is broad, but not uniform. Co-Scientist workflows and Bio-Designer modalities have direct probes. Agentic lab work and long-document synthesis are lighter in this run.
No real agentic testing. General and Co-Scientist do not include multi-step tool use, planning, or feedback-driven revision. This is why the Autonomy (A) axis — and the General-track Tool Use category — stay near the floor across the model pool.
No solid long-context testing. The current sub-tasks do not meaningfully probe 10k+ token retrieval, long-document comprehension, or scientific literature synthesis.
General evidence is summarized. General benchmarks are used as the baseline layer. The scientific interpretation focuses on Co-Scientist and Bio-Designer.
Roadmap · v0.2 These gaps are known and on our roadmap, not where ARCK stops. The next release will broaden coverage along these axes — including agentic and long-context evaluation — as the benchmark continues to grow.
Acknowledgements
The ARCK leaderboard is built on the public datasets below. We thank their authors and maintainers. Each dataset remains the property of its respective authors and is used under the license shown. Hover a source name in the coverage tables above to see its license inline.
Datasets are Apache-2.0; the companion code repo is CC-BY-NC-ND-4.0 (NonCommercial + NoDerivatives).
† prompt-adapted · ‡ task-reformulated — underlying labels and gold answers are unchanged. See the coverage tables above for what was adapted.
BEACON component licenses: bpRNA-1M / CRW — No license stated; bpRNA-1M / tmRDB — Research Purpose Only; bpRNA-1M / SRPDB — Research Purpose Only; bpRNA-1M / tRNADB — Public Domain; bpRNA-1M / Rnase P — Public Domain; bpRNA-1M / RFam — CC0-1.0; bpRNA-1M / PDB — CC0-1.0; splice_ai — GPL-3.0; isoform — MIT; modification — MIT; noncoding_rna_family — Apache-2.0; programmable_rna_switches — No license stated; crispr_on_target — Apache-2.0; crispr_off_target — Apache-2.0
mRNABench component licenses: rna-loc-fazal — No license stated; protein-loc — CC-BY-SA-4.0; rna-lifecycle — No license stated; mirna-target — CC-BY-4.0; mrl — No license stated; te — No license stated; eclip — No license stated; GO — CC-BY-4.0; rnahl — CC-BY-4.0; mrl-egfp — No license stated; utr-variants — CC-BY-4.0; vep-traitgym — MIT
Open for collaboration
ARCK integrates and restructures existing benchmarks into a unified reporting coordinate system. Community review helps improve benchmark admission, score interpretation, and evidence reporting.