check:) or by an LLM judge (tagged judge:). This page documents how each works, what each costs, what each can and can’t catch.
The two tags
check — deterministic
Checked programmatically against mirror state and the session trace. Instant. Free. Exact. The evaluator either finds the thing the criterion describes, or it doesn’t.
judge — probabilistic
Checked by an LLM from the trace, final state, and
## Expected Behavior context. Bounded cost (~$0.0004 per criterion on Haiku). Used for subjective calls a deterministic check can’t make cleanly.check: or judge:), or Mirra infers one. See Inference rules below.
How check: works
The evaluator has access to:
- Mirror state — full SQLite read of every mirror in the session at the time evaluation runs (right after the scenario finishes).
- Trace — every request, response, webhook dispatch, state change, and error that happened during the scenario run, stored in PostgreSQL.
- Signature verification log — the engine tracks every signature verification attempt and whether it succeeded.
check: criteria are evaluated with a small deterministic grammar over those three sources. The evaluator understands:
- Counts — “Exactly 1 email is created” → counts rows in the mirror’s
emailstable. - Existence — “A webhook endpoint for
invoice.payment_failedwas registered” → queries the endpoint list. - Properties — “The email’s
fromishello@mail.acme.com” → reads the email row, compares the field. - Signature verification — “The webhook signature was verified” → looks at the trace’s
signature_verificationscollection. - Trace events — “No errors were logged” → counts trace entries with
level >= error.
Cost
check: criteria are free. They run in under 10ms each and don’t call any external API.
How judge: works
A judge: criterion is graded by an LLM call with this structure:
gpt-4o-mini, claude-sonnet-4-6. Configurable per scenario via judge-model: in ## Config.
Cost
Judge cost is bounded:- Trace excerpts trimmed to fit ~4K tokens max.
- Final state trimmed to ~2K tokens max.
- Typical judge call: ~6K input tokens + ~100 output tokens.
- At Haiku pricing (Apr 2026): ~$0.0004 per criterion per run.
judge: criteria × 5 runs = 15 judge calls = ~$0.006 per scenario execution. Fractions of a cent.
Why probabilistic at all
Two classes of judgment don’t reduce to programmatic checks:- Tone and appropriateness — “the email subject is appropriate for a welcome message” is a judgment, not a pattern match.
- Higher-order existence — “error handling for bounces is implemented.” A deterministic check can count handlers, but it can’t tell whether they handle the case usefully.
Inference rules
If a criterion has no explicitcheck: or judge: prefix, Mirra picks one based on the criterion’s wording:
| Signal in criterion | Inferred tag |
|---|---|
| Contains a number: “exactly 3”, “at least 1”, “fewer than 5” | check: |
Contains a concrete value: hello@mail.acme.com, sub_abc | check: |
| Contains a past-tense event: “was delivered”, “was verified”, “was created” | check: |
| Contains “no errors”, “no signature failures” | check: |
| Contains subjective words: “appropriate”, “clear”, “graceful”, “helpful” | judge: |
| Contains “error handling exists”, “correctly handles” without numbers | judge: |
| Multi-condition: “X and the code is well-structured” | judge: (the soft half wins) |
judge: on something obviously deterministic).
Writing criteria that grade well
Be specific about counts and values
Be specific about counts and values
✓ “Exactly 1 email is created with
from equal to hello@mail.acme.com”
✗ “The right email is sent”Specific criteria grade as check: (free, fast, exact). Vague criteria grade as judge: (small cost, LLM-variable answer).Split compound criteria
Split compound criteria
✓ “Exactly 1 email is created” + “The webhook signature was verified”
✗ “An email is created and the webhook is verified”One assertion per criterion makes per-criterion pass/fail reporting useful. Compound criteria give you a single binary for two things.
Lean on judge for genuinely soft things, not as a fallback
Lean on judge for genuinely soft things, not as a fallback
judge: works best for genuine taste calls. Using judge: as “I couldn’t write a clean check” is a signal your criterion is too vague to be useful — tighten it instead.Use ## Expected Behavior to help the judge
Use ## Expected Behavior to help the judge
When you use
judge:, write a ## Expected Behavior section that defines what “correct” means. Without it, the judge infers from the criterion text alone, which is usually too thin for stable grading.Statistical satisfaction
Running a scenario withruns: N produces N independent gradings per criterion. The satisfaction score is the mean.
Example — 5 runs, 6 criteria:
Where to go next
Scenario format
Every valid section and config key.
First scenario
Write criteria that grade well end-to-end.