The shape
Every scenario has five sections. Three are required, two are optional.| Section | Required | Who sees it |
|---|---|---|
# Title | yes | both |
## Setup | no | both (context) |
## Prompt / ## Task | for agent runs | both |
## Expected Behavior | no | evaluator only |
## Success Criteria / ## Checks | yes | evaluator only |
## Config | no | evaluator only |
A complete example
check vs judge
Every success criterion is tagged either check: or judge:.
check — deterministic
Programmatic assertion against mirror state. Instant, free, exact. Used for anything you can count, compare, or pattern-match.
judge — probabilistic
LLM judgment from the trace and final state. Bounded cost, used for subjective calls like tone, appropriateness, or “does error handling exist” (beyond just counting).
check: for everything verifiable and uses judge: sparingly for the subjective things only.
When Mirra infers the tag
If a criterion has no explicitcheck: or judge: prefix, Mirra infers one based on the wording:
- Has numbers or concrete state (“exactly 3”, “was delivered”, “the
fromaddress is”) → check. - Vague or subjective (“appropriate”, “clear”, “polite”, “handles gracefully”) → judge.
Seeds — starting state
A scenario starts from a seeded state. Three ways to supply it:- LLM-generated from Setup
- Built-in named fixture
- Custom JSON file
## Setup section is parsed by a small LLM that emits the seed state directly to the mirror. Zero configuration. Best for “any reasonable scenario” tests.Statistical satisfaction
Setruns: N in ## Config to execute the scenario N times. Each run resets to fresh state between iterations (via mirra reset).
Where scenarios live
Scenarios are plain markdown files in your repo. Put them wherever you keep tests:mirra run takes an explicit path. Teams usually put them in scenarios/ to keep them separate from unit tests.
Where to go next
Write your first scenario
End-to-end walkthrough: scenario file → mirra run → CI gate.
Scenario format reference
Every valid section, every valid config key, every edge case.
check vs judge deep dive
Exactly how Mirra grades criteria. Cost model. Failure modes.
mirra run
The CLI command that executes scenarios.