BARS v1 — Review Checklists (3 rounds)

These checklists are for reviewing changes to benchmark datasets, judge/runners, and the Studio GUI.

  • If an item is not applicable, mark N/A with a one-line rationale.
  • Prefer “verify with a file/command” over “looks good”.

Review Round 1 — Spec/Schema correctness + Safety

Dataset & catalog integrity (repo: ai-research-skills)

  • catalog/benchmarks.json: every entry path exists and points at a JSON file under benchmarks/.
  • catalog/benchmarks.json: every case_count equals cases.length in the referenced dataset file.
  • catalog/schema.json: definitions.benchmark matches what’s actually in catalog/benchmarks.json (especially dataset_id pattern).
  • benchmarks/**.json: required top-level fields exist and are consistent (dataset_id, kind, locale, cases).
  • benchmarks/**.json: cases[].id is unique, stable, and human-meaningful (avoid renumbering/reordering without intent).
  • benchmarks/**.json: any “HTML gold” is safe-by-construction:
    • For citation benchmarks, expected_bibliography_html is rendered via dangerouslySetInnerHTML in docs/components/CitationBenchmarkDemo.jsx — ensure it contains no scripts, event handlers, or untrusted HTML.
  • Provenance/attribution is explicit for derived fixtures (e.g., source.repo, source.license_note).

Threat model / anti-leakage

  • No secrets or internal paths committed (watch fields like source.local_path in benchmark JSON).
  • Public datasets must not include absolute paths (ban source.local_path entirely in public data); allow only in private/internal datasets that never ship.
  • Inputs/fixtures contain no private data; confirm dataset sourcing and redaction.
  • LLM-only judging: judge prompts do not contain gold outputs or direct string-matching shortcuts (no “answer leakage”).
  • Anti-contamination: if cases come from public corpora, document how you reduce memorization/leakage risk (e.g., transformations, holdouts).

Bias mitigations

  • Coverage targets are explicit (language/script/domain/format variety) and rationale is documented.
  • Any normalization rules are specified (what differences are allowed vs. counted as failures).

Review Round 2 — CLI reproducibility + Artifacts + Stats

Reproducibility smoke checks

  • Repo gates: npm run validate:strict passes.
  • Docs sync: cd docs && npm run sync correctly copies benchmarks/docs/public/benchmarks/.
  • Dataset loading works in the docs UI:
    • /benchmarks/<dataset>.json loads (see docs/components/CitationBenchmarkDemo.jsx).
    • Report lookup follows the flat convention first: benchmarks/reports/<dataset.replace('/', '-')>.json, with nested fallback (see docs/components/CitationBenchmarkDemo.jsx).
  • Core runner instructions stay accurate (repo: ai-research-skills-core):
    • docs/components/LazyCitationBenchmarkDemo.jsx “Run with core CLI” snippet matches the actual CLI entrypoint (packages/cli/bin/run.js) and flags.

Artifact formats

  • Report JSON is diff-friendly (stable IDs, stable ordering where practical) and includes enough metadata to reproduce (runner version + git SHA + timestamp).
  • Output schema is documented and versioned (old reports should remain viewable/readable, unless an intentional breaking change is declared).

Stats correctness (regression / pairwise)

  • Pairwise aggregation is specified (how ties, missing judgments, and invalid outputs are handled).
  • Any “headline metric” has a clear definition and a test case that catches obvious regressions.
  • Determinism: with the same inputs + seed/config, reruns produce the same summary stats (or the nondeterminism is explicitly bounded).

Review Round 3 — Studio GUI UX + Drilldowns + Security posture

UX completeness

  • Bench list UI supports filtering at minimum by kind, locale, and skill_id, plus search by dataset_id.
  • Drilldowns are complete:
    • per-case: input, model output, judge output/reasoning, and a clear “why failed” view.
    • per-run: totals + slice views (by dataset/kind/model) and a stable permalink/shareable identifier.
  • Export is usable: download raw per-case results + summary (.json at minimum; .csv optional).

Performance

  • Heavy UI sections are lazy-loaded (pattern: docs/components/LazyCitationBenchmarkDemo.jsx).
  • Large JSON artifacts are fetched on-demand from static assets (e.g., docs/public/benchmarks/), not bundled into initial JS.

Security posture

  • XSS: never render model output as HTML; sanitize any user-provided content; keep “trusted HTML” limited to repo-owned gold (e.g., citation gold).
  • Secrets hygiene: only public keys in NEXT_PUBLIC_*; no tokens committed to git; production configs documented.