BARS v1 — Review Checklists (3 rounds)

These checklists are for reviewing changes to benchmark datasets, judge/runners, and the Studio GUI.

Review Round 1 — Spec/Schema correctness + Safety

catalog/benchmarks.json: every entry path exists and points at a JSON file under benchmarks/.
catalog/benchmarks.json: every case_count equals cases.length in the referenced dataset file.
catalog/schema.json: definitions.benchmark matches what’s actually in catalog/benchmarks.json (especially dataset_id pattern).
benchmarks/**.json: required top-level fields exist and are consistent (dataset_id, kind, locale, cases).
benchmarks/**.json: cases[].id is unique, stable, and human-meaningful (avoid renumbering/reordering without intent).
benchmarks/**.json: any “HTML gold” is safe-by-construction:
- For citation benchmarks, expected_bibliography_html is rendered via dangerouslySetInnerHTML in docs/components/CitationBenchmarkDemo.jsx — ensure it contains no scripts, event handlers, or untrusted HTML.
Provenance/attribution is explicit for derived fixtures (e.g., source.repo, source.license_note).

No secrets or internal paths committed (watch fields like source.local_path in benchmark JSON).
Public datasets must not include absolute paths (ban source.local_path entirely in public data); allow only in private/internal datasets that never ship.
Inputs/fixtures contain no private data; confirm dataset sourcing and redaction.
LLM-only judging: judge prompts do not contain gold outputs or direct string-matching shortcuts (no “answer leakage”).
Anti-contamination: if cases come from public corpora, document how you reduce memorization/leakage risk (e.g., transformations, holdouts).

Coverage targets are explicit (language/script/domain/format variety) and rationale is documented.
Any normalization rules are specified (what differences are allowed vs. counted as failures).

Repo gates: npm run validate:strict passes.
Docs sync: cd docs && npm run sync correctly copies benchmarks/ → docs/public/benchmarks/.
Dataset loading works in the docs UI:
- /benchmarks/<dataset>.json loads (see docs/components/CitationBenchmarkDemo.jsx).
- Report lookup follows the flat convention first: benchmarks/reports/<dataset.replace('/', '-')>.json, with nested fallback (see docs/components/CitationBenchmarkDemo.jsx).
Core runner instructions stay accurate (repo: ai-research-skills-core):
- docs/components/LazyCitationBenchmarkDemo.jsx “Run with core CLI” snippet matches the actual CLI entrypoint (packages/cli/bin/run.js) and flags.

Report JSON is diff-friendly (stable IDs, stable ordering where practical) and includes enough metadata to reproduce (runner version + git SHA + timestamp).
Output schema is documented and versioned (old reports should remain viewable/readable, unless an intentional breaking change is declared).

Pairwise aggregation is specified (how ties, missing judgments, and invalid outputs are handled).
Any “headline metric” has a clear definition and a test case that catches obvious regressions.
Determinism: with the same inputs + seed/config, reruns produce the same summary stats (or the nondeterminism is explicitly bounded).

Bench list UI supports filtering at minimum by kind, locale, and skill_id, plus search by dataset_id.
Drilldowns are complete:
- per-case: input, model output, judge output/reasoning, and a clear “why failed” view.
- per-run: totals + slice views (by dataset/kind/model) and a stable permalink/shareable identifier.
Export is usable: download raw per-case results + summary (.json at minimum; .csv optional).

Heavy UI sections are lazy-loaded (pattern: docs/components/LazyCitationBenchmarkDemo.jsx).
Large JSON artifacts are fetched on-demand from static assets (e.g., docs/public/benchmarks/), not bundled into initial JS.

XSS: never render model output as HTML; sanitize any user-provided content; keep “trusted HTML” limited to repo-owned gold (e.g., citation gold).
Secrets hygiene: only public keys in NEXT_PUBLIC_*; no tokens committed to git; production configs documented.