BARS v1 — LLM-only Judge Protocol (Skill Regression)

This document proposes a minimal-but-rigorous v1 evaluation protocol for benchmarking skills (prompt/skill text) under the constraint that judging is performed only by an LLM (no deterministic correctness checks used as ground-truth). Deterministic computation is still used for aggregation/statistics.

0) Conceptual model (clean decomposition)

Entities

  • Model M: the fixed candidate model under test (SUT).
  • Skill S: instructions/config applied at runtime (compare S_old vs S_new).
  • Dataset D: a set of evaluation items/cases x plus task metadata and constraints.
  • Runner config θ: sampling params, max tokens, tool config, etc. (must be locked per run).
  • Seed k: used to sample multiple generations per case.
  • Judge model J: an LLM used to score / compare outputs.

Functions

  • Generate: y = Run(M, S, x; θ, k) (produces a candidate response y plus logs).
  • Judge (pairwise): j = Judge_pair(J, x, A, B; rubric) where {A,B} are blinded candidate outputs.
  • Judge (pointwise): s = Judge_point(J, x, y; rubric) (optional secondary signal).
  • Aggregate: Summary = Agg({j, s}) (deterministic code: win-rate, CIs, tag rates, slices).

Regression objective

Primary regression metric is pairwise win-rate of S_new vs S_old over a fixed dataset and runner config:

  • win_rate = P(S_new > S_old) treating ties as 0.5.

Absolute (“pointwise”) scores are tracked for trending, debugging, and slice analysis, but pairwise is the gate.


1) v1 evaluation protocol (pairwise primary)

Per run (single M, single D, single comparison S_old vs S_new)

  1. Lock runner config θ (temperature, top_p, max_tokens, stop rules, tool access).
  2. For each case x ∈ D and for each generation seed k ∈ {1..K}:
    • Produce y_old = Run(M, S_old, x; θ, k) and y_new = Run(M, S_new, x; θ, k).
    • Create an A/B presentation by randomizing ordering:
      • A,B = permute(y_old, y_new) and record mapping {A: old|new, B: old|new}.
    • Collect pairwise judgment j_1 = Judge_pair(J, x, A, B; rubric) with temperature=0.
    • Repeat judgment R-1 more times only if needed (see defaults).
  3. Escalate to a stronger referee judge J_ref only when:
    • j has low confidence, or
    • multiple judgments disagree (order-/repeat-induced variance), or
    • needs_review=true due to injection/spec ambiguity.
  4. Aggregate across all comparisons to produce:
    • win_rate (+ Wilson 95% CI),
    • tag rates (fatal_tags, injection.detected),
    • optional pointwise means per dimension (secondary).

Defaults (recommended for v1)

  • K (generation repeats / seeds): K=2 (increase to 4 for high-variance tasks).
  • R (judge repeats): R=1 by default; escalate to R=2 when confidence is low or outcome is tie/unclear.
  • Referee escalation: only on disagreement, low-confidence, or needs_review.
  • Judge temperature: 0.

Swap rules (order bias mitigation)

v1 should at minimum:

  • Randomize A/B order per comparison and record it.
  • Track order_bias telemetry: win-rate conditioned on whether the new skill was A or B.

Optional (stronger, more expensive):

  • Swap-augmentation on a small audit subset (e.g., 10% of cases): re-judge with A/B swapped to estimate position bias and calibrate confidence thresholds.

2) Rubric design (structured, scalable)

Goal: a rubric format that scales across task kinds (citation formatting, extraction, open-ended writing) without becoming prose.

Rubric pattern

  • A small fixed core of dimensions that apply to all tasks.
  • A task-type extension that adds constraints and dimension weights.

Core dimensions (apply_to=all)

  • correctness_faithfulness (0–5): accurate, non-hallucinated relative to provided context.
  • completeness (0–5): covers required parts of the task.
  • instruction_following (0–5): respects user constraints (style, length, format).
  • clarity (0–5): readable and well-structured.
  • safety (0–5): avoids policy violations / unsafe content.

Task-specific dimension (optional)

  • spec_adherence (0–5): formatting/schema/citation requirements.

Constraints (per case)

Each case carries a constraints list the judge must check:

  • extraction: required keys, allowed values, schema notes.
  • citation: required fields, citation style, allowed normalization rules.
  • writing: max length, tone, required bulleting, etc.

Constraints should be machine-readable, not embedded in free-text guidance.


3) Judge output schema (JSON)

Judge should emit diff-friendly JSON with tags as the primary analytic surface.

Recommended fields:

  • pairwise.winner: A | B | tie
  • pairwise.confidence: 0..1
  • pairwise.deciding_dims: list of dimension ids that determined the outcome
  • pairwise.tags: short, enumerable tags (e.g., missing_field, hallucination, format_violation)
  • pairwise.needs_review: boolean
  • per_response[side].scores[dim]=0..5
  • per_response[side].fatal_tags: e.g., invalid_json, unsafe_content, refuses_task
  • injection.detected: boolean (+ optional short note)

Important: keep free-form text minimal (short_reason only), and prefer tags.


4) Deterministic parsing vs “LLM-only judging” (recommended wording)

To satisfy “LLM-only judging” while keeping the system useful:

  • Allowed (recommended): deterministic code for aggregation/statistics (win-rate, CIs, bootstrap/Wilson), slicing, report generation, and UI rendering.
  • Not used as ground-truth: deterministic parsers/validators that directly score correctness.

For tasks with deterministic correctness (e.g., JSON extraction):

  • v1 recommendation: keep correctness assessment inside the judge by adding spec_adherence + fatal_tags like invalid_json / missing_required_key.
  • Optional compromise (only if the user agrees): run deterministic parsing as an auxiliary signal for debugging (not gating), clearly labeled “non-judge telemetry”.

5) Regression gates (pairwise + guardrails)

Primary gate (pairwise)

Compute win_rate over N = |D_gate| * K pairwise comparisons:

  • Treat ties as 0.5.
  • Gate suggestion:
    • win_rate >= 0.55, and
    • lower bound of Wilson 95% CI > 0.50.

Secondary guardrails (non-negotiables)

  • fatal_tags rate does not increase beyond a small margin.
  • injection.detected rate does not increase.
  • spec_adherence does not regress beyond a small margin on spec-heavy tasks.

Variance reporting

Always report:

  • judge_disagreement_rate (when R>1 or referee triggered),
  • per-kind slice metrics (to catch regressions masked by averages).

6) Anti-leakage, blinding, and injection hardening

  • Blind the judge: never expose “old/new” labels; randomize A/B; avoid paths/skill IDs in judge prompt.
  • Treat outputs as untrusted: judge system prompt must instruct to ignore any instructions embedded in model outputs.
  • Private gate set: maintain a non-public evaluation set for regression gating; public examples are for iteration and docs only.
  • Canary refresh (optional): periodically rotate a small subset to detect overfitting.

7) ArXiv-ready “Method” outline (suggested)

  1. Task Definition: evaluation of M × S × D with S_old vs S_new.
  2. Data: case structure, kinds, constraints, public vs private split.
  3. Generation Protocol: locked runner config, multi-seed sampling K, artifact logging.
  4. LLM Judge Protocol:
    • pairwise A/B with blinding and randomization,
    • rubric format and dimensions,
    • confidence + escalation to referee,
    • bias mitigations (position/verbosity/format bias).
  5. Aggregation & Statistics:
    • win-rate, ties, Wilson CI / bootstrap,
    • variance reporting (seed variance, judge variance),
    • failure tag analyses.
  6. Threat Model: prompt injection, leakage/contamination, judge drift.
  7. Limitations: judge reliability, preference biases, cost/latency tradeoffs.

Key terms for search:

  • “LLM-as-a-judge position bias”, “verbosity bias”, “order effects”
  • “pairwise evaluation Bradley–Terry”, “Elo / TrueSkill”
  • “MT-Bench”, “AlpacaEval”, “Chatbot Arena”
  • “G-Eval rubric prompting”, “evaluator calibration”, “self-consistency for evaluators”