BARS v1 — LLM-only Judge Protocol (Skill Regression)
This document proposes a minimal-but-rigorous v1 evaluation protocol for benchmarking skills (prompt/skill text) under the constraint that judging is performed only by an LLM (no deterministic correctness checks used as ground-truth). Deterministic computation is still used for aggregation/statistics.
0) Conceptual model (clean decomposition)
Entities
- Model
M: the fixed candidate model under test (SUT). - Skill
S: instructions/config applied at runtime (compareS_oldvsS_new). - Dataset
D: a set of evaluation items/casesxplus task metadata and constraints. - Runner config
θ: sampling params, max tokens, tool config, etc. (must be locked per run). - Seed
k: used to sample multiple generations per case. - Judge model
J: an LLM used to score / compare outputs.
Functions
- Generate:
y = Run(M, S, x; θ, k)(produces a candidate responseyplus logs). - Judge (pairwise):
j = Judge_pair(J, x, A, B; rubric)where{A,B}are blinded candidate outputs. - Judge (pointwise):
s = Judge_point(J, x, y; rubric)(optional secondary signal). - Aggregate:
Summary = Agg({j, s})(deterministic code: win-rate, CIs, tag rates, slices).
Regression objective
Primary regression metric is pairwise win-rate of S_new vs S_old over a fixed dataset and runner config:
win_rate = P(S_new > S_old)treating ties as 0.5.
Absolute (“pointwise”) scores are tracked for trending, debugging, and slice analysis, but pairwise is the gate.
1) v1 evaluation protocol (pairwise primary)
Per run (single M, single D, single comparison S_old vs S_new)
- Lock runner config
θ(temperature, top_p, max_tokens, stop rules, tool access). - For each case
x ∈ Dand for each generation seedk ∈ {1..K}:- Produce
y_old = Run(M, S_old, x; θ, k)andy_new = Run(M, S_new, x; θ, k). - Create an A/B presentation by randomizing ordering:
A,B = permute(y_old, y_new)and record mapping{A: old|new, B: old|new}.
- Collect pairwise judgment
j_1 = Judge_pair(J, x, A, B; rubric)withtemperature=0. - Repeat judgment
R-1more times only if needed (see defaults).
- Produce
- Escalate to a stronger referee judge
J_refonly when:jhas low confidence, or- multiple judgments disagree (order-/repeat-induced variance), or
needs_review=truedue to injection/spec ambiguity.
- Aggregate across all comparisons to produce:
win_rate(+ Wilson 95% CI),- tag rates (
fatal_tags,injection.detected), - optional pointwise means per dimension (secondary).
Defaults (recommended for v1)
- K (generation repeats / seeds):
K=2(increase to 4 for high-variance tasks). - R (judge repeats):
R=1by default; escalate toR=2when confidence is low or outcome is tie/unclear. - Referee escalation: only on disagreement, low-confidence, or
needs_review. - Judge temperature:
0.
Swap rules (order bias mitigation)
v1 should at minimum:
- Randomize A/B order per comparison and record it.
- Track order_bias telemetry: win-rate conditioned on whether the new skill was A or B.
Optional (stronger, more expensive):
- Swap-augmentation on a small audit subset (e.g., 10% of cases): re-judge with A/B swapped to estimate position bias and calibrate confidence thresholds.
2) Rubric design (structured, scalable)
Goal: a rubric format that scales across task kinds (citation formatting, extraction, open-ended writing) without becoming prose.
Rubric pattern
- A small fixed core of dimensions that apply to all tasks.
- A task-type extension that adds constraints and dimension weights.
Core dimensions (apply_to=all)
correctness_faithfulness(0–5): accurate, non-hallucinated relative to provided context.completeness(0–5): covers required parts of the task.instruction_following(0–5): respects user constraints (style, length, format).clarity(0–5): readable and well-structured.safety(0–5): avoids policy violations / unsafe content.
Task-specific dimension (optional)
spec_adherence(0–5): formatting/schema/citation requirements.
Constraints (per case)
Each case carries a constraints list the judge must check:
- extraction: required keys, allowed values, schema notes.
- citation: required fields, citation style, allowed normalization rules.
- writing: max length, tone, required bulleting, etc.
Constraints should be machine-readable, not embedded in free-text guidance.
3) Judge output schema (JSON)
Judge should emit diff-friendly JSON with tags as the primary analytic surface.
Recommended fields:
pairwise.winner:A | B | tiepairwise.confidence:0..1pairwise.deciding_dims: list of dimension ids that determined the outcomepairwise.tags: short, enumerable tags (e.g.,missing_field,hallucination,format_violation)pairwise.needs_review: booleanper_response[side].scores[dim]=0..5per_response[side].fatal_tags: e.g.,invalid_json,unsafe_content,refuses_taskinjection.detected: boolean (+ optional short note)
Important: keep free-form text minimal (short_reason only), and prefer tags.
4) Deterministic parsing vs “LLM-only judging” (recommended wording)
To satisfy “LLM-only judging” while keeping the system useful:
- Allowed (recommended): deterministic code for aggregation/statistics (win-rate, CIs, bootstrap/Wilson), slicing, report generation, and UI rendering.
- Not used as ground-truth: deterministic parsers/validators that directly score correctness.
For tasks with deterministic correctness (e.g., JSON extraction):
- v1 recommendation: keep correctness assessment inside the judge by adding
spec_adherence+fatal_tagslikeinvalid_json/missing_required_key. - Optional compromise (only if the user agrees): run deterministic parsing as an auxiliary signal for debugging (not gating), clearly labeled “non-judge telemetry”.
5) Regression gates (pairwise + guardrails)
Primary gate (pairwise)
Compute win_rate over N = |D_gate| * K pairwise comparisons:
- Treat ties as 0.5.
- Gate suggestion:
win_rate >= 0.55, and- lower bound of Wilson 95% CI >
0.50.
Secondary guardrails (non-negotiables)
fatal_tagsrate does not increase beyond a small margin.injection.detectedrate does not increase.spec_adherencedoes not regress beyond a small margin on spec-heavy tasks.
Variance reporting
Always report:
judge_disagreement_rate(whenR>1or referee triggered),- per-kind slice metrics (to catch regressions masked by averages).
6) Anti-leakage, blinding, and injection hardening
- Blind the judge: never expose “old/new” labels; randomize A/B; avoid paths/skill IDs in judge prompt.
- Treat outputs as untrusted: judge system prompt must instruct to ignore any instructions embedded in model outputs.
- Private gate set: maintain a non-public evaluation set for regression gating; public examples are for iteration and docs only.
- Canary refresh (optional): periodically rotate a small subset to detect overfitting.
7) ArXiv-ready “Method” outline (suggested)
- Task Definition: evaluation of
M × S × DwithS_oldvsS_new. - Data: case structure, kinds, constraints, public vs private split.
- Generation Protocol: locked runner config, multi-seed sampling
K, artifact logging. - LLM Judge Protocol:
- pairwise A/B with blinding and randomization,
- rubric format and dimensions,
- confidence + escalation to referee,
- bias mitigations (position/verbosity/format bias).
- Aggregation & Statistics:
- win-rate, ties, Wilson CI / bootstrap,
- variance reporting (seed variance, judge variance),
- failure tag analyses.
- Threat Model: prompt injection, leakage/contamination, judge drift.
- Limitations: judge reliability, preference biases, cost/latency tradeoffs.
Key terms for search:
- “LLM-as-a-judge position bias”, “verbosity bias”, “order effects”
- “pairwise evaluation Bradley–Terry”, “Elo / TrueSkill”
- “MT-Bench”, “AlpacaEval”, “Chatbot Arena”
- “G-Eval rubric prompting”, “evaluator calibration”, “self-consistency for evaluators”