BARS v1 — LLM-only Judge Protocol (Skill Regression)

This document proposes a minimal-but-rigorous v1 evaluation protocol for benchmarking skills (prompt/skill text) under the constraint that judging is performed only by an LLM (no deterministic correctness checks used as ground-truth). Deterministic computation is still used for aggregation/statistics.

0) Conceptual model (clean decomposition)

Entities

Model M: the fixed candidate model under test (SUT).
Skill S: instructions/config applied at runtime (compare S_old vs S_new).
Dataset D: a set of evaluation items/cases x plus task metadata and constraints.
Runner config θ: sampling params, max tokens, tool config, etc. (must be locked per run).
Seed k: used to sample multiple generations per case.
Judge model J: an LLM used to score / compare outputs.

Functions

Generate: y = Run(M, S, x; θ, k) (produces a candidate response y plus logs).
Judge (pairwise): j = Judge_pair(J, x, A, B; rubric) where {A,B} are blinded candidate outputs.
Judge (pointwise): s = Judge_point(J, x, y; rubric) (optional secondary signal).
Aggregate: Summary = Agg({j, s}) (deterministic code: win-rate, CIs, tag rates, slices).

Regression objective

Primary regression metric is pairwise win-rate of S_new vs S_old over a fixed dataset and runner config:

win_rate = P(S_new > S_old) treating ties as 0.5.

Absolute (“pointwise”) scores are tracked for trending, debugging, and slice analysis, but pairwise is the gate.

1) v1 evaluation protocol (pairwise primary)

Per run (single `M`, single `D`, single comparison `S_old` vs `S_new`)

Lock runner config θ (temperature, top_p, max_tokens, stop rules, tool access).
For each case x ∈ D and for each generation seed k ∈ {1..K}:
- Produce y_old = Run(M, S_old, x; θ, k) and y_new = Run(M, S_new, x; θ, k).
- Create an A/B presentation by randomizing ordering:
  - A,B = permute(y_old, y_new) and record mapping {A: old|new, B: old|new}.
- Collect pairwise judgment j_1 = Judge_pair(J, x, A, B; rubric) with temperature=0.
- Repeat judgment R-1 more times only if needed (see defaults).
Escalate to a stronger referee judge J_ref only when:
- j has low confidence, or
- multiple judgments disagree (order-/repeat-induced variance), or
- needs_review=true due to injection/spec ambiguity.
Aggregate across all comparisons to produce:
- win_rate (+ Wilson 95% CI),
- tag rates (fatal_tags, injection.detected),
- optional pointwise means per dimension (secondary).

Defaults (recommended for v1)

K (generation repeats / seeds): K=2 (increase to 4 for high-variance tasks).
R (judge repeats): R=1 by default; escalate to R=2 when confidence is low or outcome is tie/unclear.
Referee escalation: only on disagreement, low-confidence, or needs_review.
Judge temperature: 0.

Swap rules (order bias mitigation)

v1 should at minimum:

Randomize A/B order per comparison and record it.
Track order_bias telemetry: win-rate conditioned on whether the new skill was A or B.

Optional (stronger, more expensive):

Swap-augmentation on a small audit subset (e.g., 10% of cases): re-judge with A/B swapped to estimate position bias and calibrate confidence thresholds.

2) Rubric design (structured, scalable)

Goal: a rubric format that scales across task kinds (citation formatting, extraction, open-ended writing) without becoming prose.

Rubric pattern

A small fixed core of dimensions that apply to all tasks.
A task-type extension that adds constraints and dimension weights.

Core dimensions (apply_to=all)

correctness_faithfulness (0–5): accurate, non-hallucinated relative to provided context.
completeness (0–5): covers required parts of the task.
instruction_following (0–5): respects user constraints (style, length, format).
clarity (0–5): readable and well-structured.
safety (0–5): avoids policy violations / unsafe content.

Task-specific dimension (optional)

spec_adherence (0–5): formatting/schema/citation requirements.

Constraints (per case)

Each case carries a constraints list the judge must check:

extraction: required keys, allowed values, schema notes.
citation: required fields, citation style, allowed normalization rules.
writing: max length, tone, required bulleting, etc.

Constraints should be machine-readable, not embedded in free-text guidance.

3) Judge output schema (JSON)

Judge should emit diff-friendly JSON with tags as the primary analytic surface.

Recommended fields:

pairwise.winner: A | B | tie
pairwise.confidence: 0..1
pairwise.deciding_dims: list of dimension ids that determined the outcome
pairwise.tags: short, enumerable tags (e.g., missing_field, hallucination, format_violation)
pairwise.needs_review: boolean
per_response[side].scores[dim]=0..5
per_response[side].fatal_tags: e.g., invalid_json, unsafe_content, refuses_task
injection.detected: boolean (+ optional short note)

Important: keep free-form text minimal (short_reason only), and prefer tags.

4) Deterministic parsing vs “LLM-only judging” (recommended wording)

To satisfy “LLM-only judging” while keeping the system useful:

Allowed (recommended): deterministic code for aggregation/statistics (win-rate, CIs, bootstrap/Wilson), slicing, report generation, and UI rendering.
Not used as ground-truth: deterministic parsers/validators that directly score correctness.

For tasks with deterministic correctness (e.g., JSON extraction):

v1 recommendation: keep correctness assessment inside the judge by adding spec_adherence + fatal_tags like invalid_json / missing_required_key.
Optional compromise (only if the user agrees): run deterministic parsing as an auxiliary signal for debugging (not gating), clearly labeled “non-judge telemetry”.

5) Regression gates (pairwise + guardrails)

Primary gate (pairwise)

Compute win_rate over N = |D_gate| * K pairwise comparisons:

Treat ties as 0.5.
Gate suggestion:
- win_rate >= 0.55, and
- lower bound of Wilson 95% CI > 0.50.

Secondary guardrails (non-negotiables)

fatal_tags rate does not increase beyond a small margin.
injection.detected rate does not increase.
spec_adherence does not regress beyond a small margin on spec-heavy tasks.

Variance reporting

Always report:

judge_disagreement_rate (when R>1 or referee triggered),
per-kind slice metrics (to catch regressions masked by averages).

6) Anti-leakage, blinding, and injection hardening

Blind the judge: never expose “old/new” labels; randomize A/B; avoid paths/skill IDs in judge prompt.
Treat outputs as untrusted: judge system prompt must instruct to ignore any instructions embedded in model outputs.
Private gate set: maintain a non-public evaluation set for regression gating; public examples are for iteration and docs only.
Canary refresh (optional): periodically rotate a small subset to detect overfitting.

7) ArXiv-ready “Method” outline (suggested)

Task Definition: evaluation of M × S × D with S_old vs S_new.
Data: case structure, kinds, constraints, public vs private split.
Generation Protocol: locked runner config, multi-seed sampling K, artifact logging.
LLM Judge Protocol:
- pairwise A/B with blinding and randomization,
- rubric format and dimensions,
- confidence + escalation to referee,
- bias mitigations (position/verbosity/format bias).
Aggregation & Statistics:
- win-rate, ties, Wilson CI / bootstrap,
- variance reporting (seed variance, judge variance),
- failure tag analyses.
Threat Model: prompt injection, leakage/contamination, judge drift.
Limitations: judge reliability, preference biases, cost/latency tradeoffs.

Key terms for search:

“LLM-as-a-judge position bias”, “verbosity bias”, “order effects”
“pairwise evaluation Bradley–Terry”, “Elo / TrueSkill”
“MT-Bench”, “AlpacaEval”, “Chatbot Arena”
“G-Eval rubric prompting”, “evaluator calibration”, “self-consistency for evaluators”