Structure at a glance
## Setup and ## setup both work). Order doesn’t matter — the parser looks up sections by heading, not position.
# Title
The human-readable scenario name. Used in dashboard listings and in log output.
## Setup
Plain-English description of the starting state. Parsed by a small LLM that emits seed state directly to the mirror. Also included in the agent’s prompt as context.
mirrors: default — usually an empty fixture unless the scenario’s ## Config says otherwise.
## Prompt / ## Task
The instruction shown to the driver (agent, test, or custom script). Both names work — ## Prompt is canonical, ## Task is an alias.
## Expected Behavior
Context for the evaluator only — not shown to the agent or test. Use this to document the behavior you want so the LLM-judge has ground truth when grading judge: criteria.
judge: evaluation relies only on the criterion text and the trace.
## Success Criteria / ## Checks
The list of criteria graded after the run. Both names work — ## Success Criteria is canonical, ## Checks is an alias.
check: and judge:) are optional — Mirra infers one if omitted.
See Evaluation reference for tag rules and the full inference grammar.
## Config
Key-value pairs controlling scenario execution. Parsed as YAML-ish key: value lines.
Config keys
Comma-separated list of mirrors to provision for this scenario. Required — Mirra needs to know what to spin up. Example:
mirrors: stripe, resend, twilio.Named built-in fixture applied as the seed. Format:
<mirror>:<fixture> for a single mirror, or just <fixture> when there’s only one mirror in mirrors:. Example: fixture: resend:transactional-busy.Path to a custom JSON seed file, relative to the scenario file. Example:
fixture-file: ./fixtures/my-seed.json. Takes precedence over fixture: if both are set.Pin a specific mirror version. Example:
mirror-version: stripe@0.7.3. If omitted, uses the latest stable.Per-run timeout in seconds. If the run exceeds this, it’s terminated and counted as a failure.
How many times to execute the scenario. Each run starts from fresh seeded state. The final satisfaction score is the mean across runs.
Which agent drives the scenario when using agent mode. Options:
claude-code, cursor, copilot, cline, custom. Default custom. See mirra run.LLM used to grade
judge: criteria. Options: claude-haiku-4-5, gpt-4o-mini, claude-sonnet-4-6. Tiny judges are cheap and usually right; larger judges cost more but catch more edge cases.When true, the session created for this scenario is persistent. Rarely used — scenarios typically run in ephemeral mode.
A complete reference example
Validation
Invalid scenarios fail fast whenmirra run loads them. Common errors:
Missing #Title
Missing #Title
Every scenario must have exactly one H1. Fix: add
# Name of scenario at the top.No criteria
No criteria
A scenario without any
## Success Criteria items can’t be evaluated. Fix: add at least one criterion.Unknown mirror in `mirrors:`
Unknown mirror in `mirrors:`
mirrors: github fails — github isn’t in the catalog. Fix: check Mirrors — overview.Unknown fixture
Unknown fixture
fixture: resend:transactional-bust fails because of the typo. Fix: reference an exact fixture name from the mirror’s documentation.fixture without mirrors
fixture without mirrors
fixture: transactional-busy without a mirrors: entry is ambiguous. Fix: use the qualified form resend:transactional-busy, or add mirrors:.Where to go next
Evaluation
How
check: and judge: criteria are graded.First scenario
Write a real scenario end-to-end.