Crash Recovery

Why Crash Recovery Matters

JacqOS agents interact with external systems — booking APIs, payment processors, LLM providers. Any of these calls can fail mid-flight. The process can crash between sending a request and recording the response. The network can drop the reply after the remote system already committed the action.

In a workflow-first system, this ambiguity is often papered over with retry loops and hope. JacqOS takes a different approach: every state transition is durable, and ambiguous outcomes require explicit human resolution. The system never guesses.

The Intent Lifecycle

Every intent passes through a durable state machine:

Derived → Admitted → Executing → Completed
                  ↘ (crash) → Reconcile Required

Each transition appends an observation to the log. This means the full lifecycle is visible in provenance and survives any restart.

Derived

The evaluator reaches a fixed point and produces intent.* facts. These are candidate intents — what the system wants to do based on current evidence.

Admitted

The shell durably records each new intent before any external call begins. This is the commit point. Once admitted, the shell is responsible for driving the intent to completion or flagging it for reconciliation.

Executing

The shell dispatches the intent through its declared capability. An effect_started marker is written. The external call happens. The response is recorded as a new observation.

Completed

The shell writes an effect_completed receipt. The new observation feeds back into the evaluator, potentially deriving new facts, retracting old ones, or triggering further intents.

What Happens on Crash

On restart, the shell inspects every admitted intent and classifies it:

State found	What it means	Action
No `effect_started` marker	Intent was admitted but never executed	Safe to execute from scratch
`effect_completed` receipt exists	Effect already finished	No action needed
`effect_started` without terminal receipt	Ambiguous — the call may or may not have succeeded	Classify for retry or reconciliation

The third case is the interesting one. The shell sent the request, but crashed before recording the outcome. Did the external system process it? There is no way to know without checking.

Auto-Retry vs. Manual Reconciliation

Safe Auto-Retry

The shell automatically retries when it can prove the request is safe to repeat:

Read-only requests — GET calls that don’t mutate external state
Idempotency key present — the resource contract guarantees exactly-once semantics
Request-fingerprint contract — the external API confirms replay safety

Auto-retried effects append a new effect_started observation, preserving the full audit trail. The original attempt and the retry are both visible in provenance.

Manual Reconciliation

When replay safety cannot be proven, the effect enters reconcile_required state. This is the default for any mutation where the shell cannot confirm the outcome. The system stops and asks a human.

Common scenarios requiring reconciliation:

POST request without an idempotency key
Payment or state-changing call where the response was lost
Any effect where partial execution could cause inconsistency

Resolving Reconciliation

Use the CLI to inspect and resolve pending reconciliations:

# See what needs resolution
jacqos reconcile inspect --session latest

# After checking the external system:
jacqos reconcile resolve <attempt-id> succeeded
jacqos reconcile resolve <attempt-id> failed
jacqos reconcile resolve <attempt-id> retry

Every resolution appends a new observation with provenance. The evaluator re-runs with the new evidence. If the original intent conditions still hold, a new intent may be derived and executed cleanly.

See the CLI Reference for full command details.

Worked Example

Consider this sequence in the appointment-booking app:

A booking_request observation arrives for slot RS-2024-03
The evaluator derives intent.reserve_slot("REQ-1", "RS-2024-03")
The shell admits the intent and starts an HTTP call to clinic_api
The process crashes mid-request

On restart:

The shell finds effect_started without a terminal receipt
http.fetch to clinic_api is a POST without an idempotency key — not safe to auto-retry
The effect enters reconcile_required
The operator runs jacqos reconcile inspect --session latest
They check the clinic API dashboard and find the slot was reserved
They resolve: jacqos reconcile resolve eff-0042 succeeded
The resolution observation feeds back into the evaluator
confirmation_pending is derived, leading to intent.send_confirmation
The confirmation email sends normally

The entire chain — crash, reconciliation, and recovery — is visible in the observation log and traceable through Studio’s drill inspector and timeline.

Contradictions

A related but distinct concept is contradictions — conflicting assertions and retractions for the same fact. These arise when new observations provide evidence that contradicts existing derived truth.

# List active contradictions
jacqos contradiction list

# Preview a resolution
jacqos contradiction preview <id> --decision accept-assertion

# Commit a resolution
jacqos contradiction resolve <id> --decision accept-retraction \
  --note "Provider confirmed slot was already taken"

Contradiction resolution decisions: accept-assertion, accept-retraction, or defer. Each resolution is recorded as an observation with provenance.

Design Principles

No silent retry of mutations. If the shell cannot prove a retry is safe, it stops and asks. This is the conservative default — it prevents double-bookings, duplicate payments, and silent data corruption.
Every transition is durable. Admitted, started, completed, and reconciled states are all observations. Nothing is lost on crash.
Reconciliation is explicit. The operator provides evidence (“I checked the external system and the slot is held”). This evidence becomes part of the provenance chain.
Design for idempotency. If your external API supports idempotency keys, use them. This turns manual reconciliation into safe auto-retry — a much better operational experience.

Next Steps

Debug, Verify, Ship — the end-to-end workflow page that integrates jacqos reconcile inspect, jacqos contradiction list/resolve, and the rest of the debugging surface into a single failure-to-green narrative
Effects and Intents — the full guide with code examples
CLI Reference — reconcile and contradiction commands
jacqos.toml Reference — declaring capabilities and resources
Observation-First Thinking — why durable observations make this possible