BARS v1 — Review Checklists (3 rounds)
These checklists are for reviewing changes to benchmark datasets, judge/runners, and the Studio GUI.
- If an item is not applicable, mark N/A with a one-line rationale.
- Prefer “verify with a file/command” over “looks good”.
Review Round 1 — Spec/Schema correctness + Safety
Dataset & catalog integrity (repo: ai-research-skills)
catalog/benchmarks.json: everyentrypath exists and points at a JSON file underbenchmarks/.catalog/benchmarks.json: everycase_countequalscases.lengthin the referenced dataset file.catalog/schema.json:definitions.benchmarkmatches what’s actually incatalog/benchmarks.json(especiallydataset_idpattern).benchmarks/**.json: required top-level fields exist and are consistent (dataset_id,kind,locale,cases).benchmarks/**.json:cases[].idis unique, stable, and human-meaningful (avoid renumbering/reordering without intent).benchmarks/**.json: any “HTML gold” is safe-by-construction:- For citation benchmarks,
expected_bibliography_htmlis rendered viadangerouslySetInnerHTMLindocs/components/CitationBenchmarkDemo.jsx— ensure it contains no scripts, event handlers, or untrusted HTML.
- For citation benchmarks,
- Provenance/attribution is explicit for derived fixtures (e.g.,
source.repo,source.license_note).
Threat model / anti-leakage
- No secrets or internal paths committed (watch fields like
source.local_pathin benchmark JSON). - Public datasets must not include absolute paths (ban
source.local_pathentirely in public data); allow only in private/internal datasets that never ship. - Inputs/fixtures contain no private data; confirm dataset sourcing and redaction.
- LLM-only judging: judge prompts do not contain gold outputs or direct string-matching shortcuts (no “answer leakage”).
- Anti-contamination: if cases come from public corpora, document how you reduce memorization/leakage risk (e.g., transformations, holdouts).
Bias mitigations
- Coverage targets are explicit (language/script/domain/format variety) and rationale is documented.
- Any normalization rules are specified (what differences are allowed vs. counted as failures).
Review Round 2 — CLI reproducibility + Artifacts + Stats
Reproducibility smoke checks
- Repo gates:
npm run validate:strictpasses. - Docs sync:
cd docs && npm run synccorrectly copiesbenchmarks/→docs/public/benchmarks/. - Dataset loading works in the docs UI:
/benchmarks/<dataset>.jsonloads (seedocs/components/CitationBenchmarkDemo.jsx).- Report lookup follows the flat convention first:
benchmarks/reports/<dataset.replace('/', '-')>.json, with nested fallback (seedocs/components/CitationBenchmarkDemo.jsx).
- Core runner instructions stay accurate (repo:
ai-research-skills-core):docs/components/LazyCitationBenchmarkDemo.jsx“Run with core CLI” snippet matches the actual CLI entrypoint (packages/cli/bin/run.js) and flags.
Artifact formats
- Report JSON is diff-friendly (stable IDs, stable ordering where practical) and includes enough metadata to reproduce (runner version + git SHA + timestamp).
- Output schema is documented and versioned (old reports should remain viewable/readable, unless an intentional breaking change is declared).
Stats correctness (regression / pairwise)
- Pairwise aggregation is specified (how ties, missing judgments, and invalid outputs are handled).
- Any “headline metric” has a clear definition and a test case that catches obvious regressions.
- Determinism: with the same inputs + seed/config, reruns produce the same summary stats (or the nondeterminism is explicitly bounded).
Review Round 3 — Studio GUI UX + Drilldowns + Security posture
UX completeness
- Bench list UI supports filtering at minimum by
kind,locale, andskill_id, plus search bydataset_id. - Drilldowns are complete:
- per-case: input, model output, judge output/reasoning, and a clear “why failed” view.
- per-run: totals + slice views (by dataset/kind/model) and a stable permalink/shareable identifier.
- Export is usable: download raw per-case results + summary (
.jsonat minimum;.csvoptional).
Performance
- Heavy UI sections are lazy-loaded (pattern:
docs/components/LazyCitationBenchmarkDemo.jsx). - Large JSON artifacts are fetched on-demand from static assets (e.g.,
docs/public/benchmarks/), not bundled into initial JS.
Security posture
- XSS: never render model output as HTML; sanitize any user-provided content; keep “trusted HTML” limited to repo-owned gold (e.g., citation gold).
- Secrets hygiene: only public keys in
NEXT_PUBLIC_*; no tokens committed to git; production configs documented.