What we’re testing
Imagine your application has a signup endpoint that sends a welcome email via Resend. A passing test needs to prove that:- An email was actually created in Resend (not just that the SDK call returned).
- The
fromaddress matches the configured sender. - The webhook for delivery confirmation arrived.
- The webhook signature was verified.
- The subject line is reasonable.
- Bounces are handled.
expect(…).toBe(…) can verify cleanly.
That’s exactly the check: vs judge: split.
1. Write the scenario
Createscenarios/welcome-email.md:
2. Run it
3. Read the score
86% means 18 of 21 criteria passes. Fivecheck: criteria ran 3 times each and all passed — that’s deterministic, not a coincidence. One judge: criterion passed all 3 — the subject line is fine. One judge: criterion failed all 3 — bounce handling is missing.
The 3/3 vs 0/3 pattern matters: if bounces are sometimes handled and sometimes not, the evaluator would show 1/3 or 2/3, and you’d know it’s flaky code. A clean 0/3 means it’s genuinely missing.
4. Fix the failing criterion
Add bounce handling to your webhook route:5. Wire it into CI
GitHub Actions example:--fail-below=0.9 exits the job non-zero if satisfaction drops below 90% — your PR fails CI, and the mirra-result.json artifact is attached for inspection.
6. Iterate
As your code changes, keep the scenario tight to what you actually promise users:- Add
check:lines for new assertions you can mechanically verify. - Use
judge:sparingly and only for genuinely subjective calls. - Increase
runs:if you want tighter confidence; decrease if CI is slow. - Split one bloated scenario into several focused ones —
welcome-email.md,bounce-handling.md,quota-exceeded.md— each gating what it gates.
Where to go next
Scenario format reference
Every valid section, every config key.
Coding agents + MCP
Let Claude Code or Cursor drive scenarios directly.
Vitest plugin
Drive scenarios from an existing Vitest suite.
mirra run reference
Every flag, every exit code, every output format.