Golden Fixtures

The Problem: How Do You Trust AI-Generated Rules?

Invariants rule out certain bad states for the models your evaluator admits. But they don’t show your system does the right thing — only that it avoids the wrong thing you declared. You still need evidence that a specific sequence of observations produces the specific facts, intents, and effects your domain requires.

Golden fixtures are that evidence. A golden fixture is a deterministic input timeline paired with an expected world state. You define the exact observations that enter the system and the exact derived state that should result. If the evaluator output matches byte-for-byte, you have cryptographic evidence that this evaluator produces the expected behavior for that scenario.

Deterministic Input Timelines

A fixture is a JSONL file. Each line is either an observation (input) or an expectation (output). Together they define a complete scenario — a deterministic input timeline.

Happy Paths

Happy paths show the system does what it should when everything goes right:

{"kind": "booking.request", "payload": {"email": "pat@example.com", "slot_id": "slot-42", "patient_name": "Pat"}}
{"kind": "slot.status", "payload": {"slot_id": "slot-42", "is_available": true}}
{"kind": "reserve.result", "payload": {"request_id": "req-1", "slot_id": "slot-42", "succeeded": true}}

Error Paths

Error paths show the system handles failures and conflicts correctly. These are not optional — every flagship app ships contradiction-path fixtures:

{"kind": "booking.request", "payload": {"email": "pat@example.com", "slot_id": "slot-42"}}
{"kind": "booking.request", "payload": {"email": "sam@example.com", "slot_id": "slot-42"}}
{"kind": "slot.status", "payload": {"slot_id": "slot-42", "is_available": true}}

The expected output should show that only one booking succeeds and the no_double_booking invariant holds. Contradiction paths catch interaction bugs that happy paths miss.

A typical app fixture directory:

fixtures/
  happy-path.jsonl          # Normal booking flow
  contradiction-path.jsonl  # Conflicting observations
  cancellation-path.jsonl   # Mid-flow cancellation
  llm-extraction.jsonl      # LLM-assisted intake

Expected World State

Each fixture also declares what the evaluator should produce — the expected world state after all observations have been replayed and the evaluator reaches its fixed point:

{"expect_fact": "booking_confirmed", "args": ["req-1", "slot-42"]}
{"expect_fact": "slot_reserved", "args": ["slot-42"]}
{"expect_no_fact": "slot_available", "args": ["slot-42"]}
{"expect_intent": "intent.send_confirmation", "args": ["req-1", "pat@example.com"]}

Expectations can assert:

Facts that must exist — expect_fact
Facts that must not exist — expect_no_fact
Intents that must be derived — expect_intent

The expected world state is the specification. The .dh rules are the implementation. If the evaluator output matches the expected state, the implementation satisfies the specification for that scenario.

The AI Feedback Loop

Fixtures create a tight, automated feedback loop for AI agents:

Human defines fixtures — observation sequences and expected outputs
AI generates .dh rules — ontology derivations, intents, helpers
jacqos replay runs the fixture — evaluator processes observations
Output compared to expectations — byte-identical match required
AI iterates if mismatch — adjusts rules based on diff
When all fixtures pass and all invariants hold — the rules satisfy the fixture corpus and declared invariants for this evaluator

The human never needs to read the generated rules. The fixtures are the specification; the rules are the implementation detail. The AI keeps iterating until the output matches exactly.

How `jacqos verify` Checks Fixture Conformance

jacqos verify replays every fixture from scratch on a clean database, checks the evaluator output against expectations, and verifies all invariants at every fixed point:

$ jacqos verify
Replaying fixtures...
  happy-path.jsonl             PASS  (3 observations, 2 facts matched)
  contradiction-path.jsonl     PASS  (3 observations, 1 fact matched)
  cancellation-path.jsonl      PASS  (4 observations, 3 facts matched)
  llm-extraction.jsonl         PASS  (5 observations, 4 facts matched)

Checking invariants...
  no_double_booking            PASS  (427 slots evaluated)
  confirmed_has_email          PASS  (89 bookings evaluated)
  no_cancelled_intents         PASS  (12 intents evaluated)

All checks passed. Digest: sha256:a1b2c3d4e5f6...

Each replay is deterministic. The same observations, the same evaluator, the same rules produce the same facts every time. If anything changes — a rule, a mapper, a helper — the digest changes.

When a fixture fails, the output shows exactly what diverged:

$ jacqos verify
Replaying fixtures...
  happy-path.jsonl             FAIL

  Expected: booking_confirmed("req-1", "slot-42")
  Got:      (not derived)

  Missing facts: 1
  Unexpected facts: 0

  Hint: rule rules.dh:23 did not fire.
  Provenance: no atom matched booking_request(_, "slot-42", _)

Digest-Backed Evidence

When jacqos verify passes, it produces a verification digest — a cryptographic hash that attests to exact behavior:

The digest covers:

Evaluator identity — hash of ontology rules, mapper semantics, and helper digests
Fixture corpus — hash of every .jsonl fixture file
Derived state — byte-identical facts, intents, and provenance for each fixture

Verification digest: sha256:a1b2c3d4e5f6...
  evaluator_digest:  sha256:7890ab...
  fixture_corpus:    sha256:cdef01...
  derived_state:     sha256:234567...

This digest is portable. It travels with your evaluation package and can be independently verified. Anyone with the same evaluator and fixture corpus can reproduce the exact same digest. If they can’t, something changed.

This is not just a test report. It is cryptographic evidence that a specific evaluator, given specific inputs, produced specific outputs. The evidence is only as strong as the fixtures and expectations you defined — but for those fixtures, it is exact.

Limitations

Golden fixtures provide evidence for defined inputs, not blanket evidence for all possible inputs.

What fixtures show:

For the exact observation sequences in your fixture corpus, the evaluator produces the exact expected world state
The evidence is reproducible and cryptographically verifiable
Any change to rules, mappers, or helpers that affects fixture outcomes will be detected

What fixtures do not show:

That the system behaves correctly for observation sequences not in the corpus
That the fixture corpus covers all important scenarios
That the expected world state itself is correct (a fixture with wrong expectations will still pass)

Fixtures are scenario-level contracts. They answer: “given these specific observations, does the system produce this specific result?” They do not answer: “does the system behave correctly for all valid observations?”

For universal properties, use invariants. Invariants hold across all evaluation states produced by the fixed evaluator, not just fixture scenarios. The combination of golden fixtures (specific scenario evidence) and invariant review (universal constraints over the evaluated model) gives you both targeted evidence and broad safety boundaries.

Property	Golden Fixture	Invariant
Scope	One specific scenario	All evaluation states
Shows	Exact expected output	Universal constraint holds
Catches unknown scenarios	No	Yes (via property testing)
Cryptographic digest	Yes	Yes (within verify)
Survives rule changes	May need updating	Yes

Next Steps

Invariant Review — universal constraints that hold across all states
Visual Provenance — tracing facts back to evidence when fixtures fail
Fixtures and Invariants Guide — practical guide to writing fixtures
CLI Reference — jacqos verify and jacqos replay commands
Getting Started — try it yourself