Skip to content

Golden Fixtures

The Problem: How Do You Trust AI-Generated Rules?

Section titled “The Problem: How Do You Trust AI-Generated Rules?”

Invariants rule out certain bad states for the models your evaluator admits. But they don’t show your system does the right thing — only that it avoids the wrong thing you declared. You still need evidence that a specific sequence of observations produces the specific facts, intents, and effects your domain requires.

Golden fixtures are that evidence. A golden fixture is a deterministic input timeline paired with an expected world state. You define the exact observations that enter the system and the exact derived state that should result. If the evaluator output matches byte-for-byte, you have cryptographic evidence that this evaluator produces the expected behavior for that scenario.

A fixture is a JSONL file. Each line is either an observation (input) or an expectation (output). Together they define a complete scenario — a deterministic input timeline.

Happy paths show the system does what it should when everything goes right:

{"kind": "booking.request", "payload": {"email": "pat@example.com", "slot_id": "slot-42", "patient_name": "Pat"}}
{"kind": "slot.status", "payload": {"slot_id": "slot-42", "is_available": true}}
{"kind": "reserve.result", "payload": {"request_id": "req-1", "slot_id": "slot-42", "succeeded": true}}

Error paths show the system handles failures and conflicts correctly. These are not optional — every flagship app ships contradiction-path fixtures:

{"kind": "booking.request", "payload": {"email": "pat@example.com", "slot_id": "slot-42"}}
{"kind": "booking.request", "payload": {"email": "sam@example.com", "slot_id": "slot-42"}}
{"kind": "slot.status", "payload": {"slot_id": "slot-42", "is_available": true}}

The expected output should show that only one booking succeeds and the no_double_booking invariant holds. Contradiction paths catch interaction bugs that happy paths miss.

A typical app fixture directory:

fixtures/
happy-path.jsonl # Normal booking flow
contradiction-path.jsonl # Conflicting observations
cancellation-path.jsonl # Mid-flow cancellation
llm-extraction.jsonl # LLM-assisted intake

Each fixture also declares what the evaluator should produce — the expected world state after all observations have been replayed and the evaluator reaches its fixed point:

{"expect_fact": "booking_confirmed", "args": ["req-1", "slot-42"]}
{"expect_fact": "slot_reserved", "args": ["slot-42"]}
{"expect_no_fact": "slot_available", "args": ["slot-42"]}
{"expect_intent": "intent.send_confirmation", "args": ["req-1", "pat@example.com"]}

Expectations can assert:

  • Facts that must existexpect_fact
  • Facts that must not existexpect_no_fact
  • Intents that must be derivedexpect_intent

The expected world state is the specification. The .dh rules are the implementation. If the evaluator output matches the expected state, the implementation satisfies the specification for that scenario.

Fixtures create a tight, automated feedback loop for AI agents:

  1. Human defines fixtures — observation sequences and expected outputs
  2. AI generates .dh rules — ontology derivations, intents, helpers
  3. jacqos replay runs the fixture — evaluator processes observations
  4. Output compared to expectations — byte-identical match required
  5. AI iterates if mismatch — adjusts rules based on diff
  6. When all fixtures pass and all invariants hold — the rules satisfy the fixture corpus and declared invariants for this evaluator

The human never needs to read the generated rules. The fixtures are the specification; the rules are the implementation detail. The AI keeps iterating until the output matches exactly.

How jacqos verify Checks Fixture Conformance

Section titled “How jacqos verify Checks Fixture Conformance”

jacqos verify replays every fixture from scratch on a clean database, checks the evaluator output against expectations, and verifies all invariants at every fixed point:

Terminal window
$ jacqos verify
Replaying fixtures...
happy-path.jsonl PASS (3 observations, 2 facts matched)
contradiction-path.jsonl PASS (3 observations, 1 fact matched)
cancellation-path.jsonl PASS (4 observations, 3 facts matched)
llm-extraction.jsonl PASS (5 observations, 4 facts matched)
Checking invariants...
no_double_booking PASS (427 slots evaluated)
confirmed_has_email PASS (89 bookings evaluated)
no_cancelled_intents PASS (12 intents evaluated)
All checks passed. Digest: sha256:a1b2c3d4e5f6...

Each replay is deterministic. The same observations, the same evaluator, the same rules produce the same facts every time. If anything changes — a rule, a mapper, a helper — the digest changes.

When a fixture fails, the output shows exactly what diverged:

Terminal window
$ jacqos verify
Replaying fixtures...
happy-path.jsonl FAIL
Expected: booking_confirmed("req-1", "slot-42")
Got: (not derived)
Missing facts: 1
Unexpected facts: 0
Hint: rule rules.dh:23 did not fire.
Provenance: no atom matched booking_request(_, "slot-42", _)

When jacqos verify passes, it produces a verification digest — a cryptographic hash that attests to exact behavior:

The digest covers:

  • Evaluator identity — hash of ontology rules, mapper semantics, and helper digests
  • Fixture corpus — hash of every .jsonl fixture file
  • Derived state — byte-identical facts, intents, and provenance for each fixture
Verification digest: sha256:a1b2c3d4e5f6...
evaluator_digest: sha256:7890ab...
fixture_corpus: sha256:cdef01...
derived_state: sha256:234567...

This digest is portable. It travels with your evaluation package and can be independently verified. Anyone with the same evaluator and fixture corpus can reproduce the exact same digest. If they can’t, something changed.

This is not just a test report. It is cryptographic evidence that a specific evaluator, given specific inputs, produced specific outputs. The evidence is only as strong as the fixtures and expectations you defined — but for those fixtures, it is exact.

Golden fixtures provide evidence for defined inputs, not blanket evidence for all possible inputs.

What fixtures show:

  • For the exact observation sequences in your fixture corpus, the evaluator produces the exact expected world state
  • The evidence is reproducible and cryptographically verifiable
  • Any change to rules, mappers, or helpers that affects fixture outcomes will be detected

What fixtures do not show:

  • That the system behaves correctly for observation sequences not in the corpus
  • That the fixture corpus covers all important scenarios
  • That the expected world state itself is correct (a fixture with wrong expectations will still pass)

Fixtures are scenario-level contracts. They answer: “given these specific observations, does the system produce this specific result?” They do not answer: “does the system behave correctly for all valid observations?”

For universal properties, use invariants. Invariants hold across all evaluation states produced by the fixed evaluator, not just fixture scenarios. The combination of golden fixtures (specific scenario evidence) and invariant review (universal constraints over the evaluated model) gives you both targeted evidence and broad safety boundaries.

PropertyGolden FixtureInvariant
ScopeOne specific scenarioAll evaluation states
ShowsExact expected outputUniversal constraint holds
Catches unknown scenariosNoYes (via property testing)
Cryptographic digestYesYes (within verify)
Survives rule changesMay need updatingYes