Scenarios - Mirra

A scenario is a markdown file in your repository that describes what should happen during a session and how to tell if it worked. Scenarios replace the usual test-suite grammar (arrange-act-assert in code) with a grammar that both humans and coding agents can read, write, and reason about.

The shape

Every scenario has five sections. Three are required, two are optional.

Section	Required	Who sees it
`# Title`	yes	both
`## Setup`	no	both (context)
`## Prompt` / `## Task`	for agent runs	both
`## Expected Behavior`	no	evaluator only
`## Success Criteria` / `## Checks`	yes	evaluator only
`## Config`	no	evaluator only

A complete example

# Send Welcome Email on Signup

## Setup
A Resend account with one verified domain `mail.acme.com` and no emails
sent yet.

## Prompt
When a user signs up with email `alice@example.com`, send a welcome email
from `hello@mail.acme.com` with subject "Welcome to Acme" and track
delivery status via webhook.

## Expected Behavior
The integration should create the email via the Resend API, receive an
`email.sent` webhook immediately, then `email.delivered` within a few
seconds. Failed deliveries (bounces) should update local user state.

## Success Criteria
- check: Exactly 1 email is created in the Resend mirror
- check: The email's `from` address is `hello@mail.acme.com`
- check: At least 1 `email.sent` webhook was received by the handler
- check: The webhook signature was verified (no signature errors logged)
- judge: The email subject and body are appropriate for a welcome message
- judge: Error handling for bounces is implemented

## Config
mirrors: resend
timeout: 60
runs: 3

Run it:

$ mirra run scenarios/welcome-email.md

→ provisioning resend mirror…
✓ session ses_a7k2 ready
→ running agent…
✓ run 1/3 complete
✓ run 2/3 complete
✓ run 3/3 complete

satisfaction score: 87% (15/18 criteria passed across 3 runs)

`check` vs `judge`

Every success criterion is tagged either check: or judge:.

check — deterministic

Programmatic assertion against mirror state. Instant, free, exact. Used for anything you can count, compare, or pattern-match.

judge — probabilistic

LLM judgment from the trace and final state. Bounded cost, used for subjective calls like tone, appropriateness, or “does error handling exist” (beyond just counting).

A good scenario leans on check: for everything verifiable and uses judge: sparingly for the subjective things only.

When Mirra infers the tag

If a criterion has no explicit check: or judge: prefix, Mirra infers one based on the wording:

Has numbers or concrete state (“exactly 3”, “was delivered”, “the from address is”) → check.
Vague or subjective (“appropriate”, “clear”, “polite”, “handles gracefully”) → judge.

You can always override:

- check: The error message is clear   ← forces deterministic (may not be what you want)
- judge: Exactly 3 emails were sent    ← forces LLM (expensive — don't)

See Evaluation reference for the full rules.

Seeds — starting state

A scenario starts from a seeded state. Three ways to supply it:

LLM-generated from Setup
Built-in named fixture
Custom JSON file

## Setup
A Resend account with one verified domain and 50 sent emails
across the last 30 days, with a 5% bounce rate.

The plain-English ## Setup section is parsed by a small LLM that emits the seed state directly to the mirror. Zero configuration. Best for “any reasonable scenario” tests.

## Config
mirrors: resend
fixture: transactional-busy

Each mirror ships a handful of named fixtures: resend:empty, resend:transactional-busy, stripe:subscription-lifecycle, etc. Zero LLM cost; exact, reproducible, version-pinned. Best for deterministic tests.

## Config
mirrors: resend
fixture-file: ./fixtures/my-seed.json

Full control. Best when you need state a named fixture can’t describe — a production-like snapshot, a specific edge case, data scrubbed from an incident.

Statistical satisfaction

Set runs: N in ## Config to execute the scenario N times. Each run resets to fresh state between iterations (via mirra reset).

## Config
mirrors: resend
runs: 5

The output is a satisfaction score — a mean percentage of criteria passed across runs.

satisfaction score: 87% (26/30 criteria passed across 5 runs)

per-criterion breakdown:
  ✓ Exactly 1 email is created in the Resend mirror         5/5
  ✓ The email's from address is hello@mail.acme.com          5/5
  ✓ At least 1 email.sent webhook was received               5/5
  ✓ The webhook signature was verified                       5/5
  ✗ The email subject and body are appropriate for a welcome 4/5
  ✗ Error handling for bounces is implemented                2/5

This is the shape of output that handles agent-generated code gracefully — one flaky run doesn’t fail your CI, but two flaky criteria across five runs is a signal worth acting on.

Where scenarios live

Scenarios are plain markdown files in your repo. Put them wherever you keep tests:

acme-app/
├── src/
├── tests/
│   └── integration/
├── scenarios/              ← convention
│   ├── welcome-email.md
│   ├── subscription-upgrade.md
│   └── refund-flow.md
└── package.json

Mirra doesn’t care about the path — mirra run takes an explicit path. Teams usually put them in scenarios/ to keep them separate from unit tests.

Where to go next

Write your first scenario

End-to-end walkthrough: scenario file → mirra run → CI gate.

Scenario format reference

Every valid section, every valid config key, every edge case.

check vs judge deep dive

Exactly how Mirra grades criteria. Cost model. Failure modes.

mirra run

The CLI command that executes scenarios.

​The shape

​A complete example

​check vs judge