Skip to content

Crash Recovery

JacqOS agents interact with external systems — booking APIs, payment processors, LLM providers. Any of these calls can fail mid-flight. The process can crash between sending a request and recording the response. The network can drop the reply after the remote system already committed the action.

In a workflow-first system, this ambiguity is often papered over with retry loops and hope. JacqOS takes a different approach: every state transition is durable, and ambiguous outcomes require explicit human resolution. The system never guesses.

Every intent passes through a durable state machine:

Derived → Admitted → Executing → Completed
↘ (crash) → Reconcile Required

Each transition appends an observation to the log. This means the full lifecycle is visible in provenance and survives any restart.

The evaluator reaches a fixed point and produces intent.* facts. These are candidate intents — what the system wants to do based on current evidence.

The shell durably records each new intent before any external call begins. This is the commit point. Once admitted, the shell is responsible for driving the intent to completion or flagging it for reconciliation.

The shell dispatches the intent through its declared capability. An effect_started marker is written. The external call happens. The response is recorded as a new observation.

The shell writes an effect_completed receipt. The new observation feeds back into the evaluator, potentially deriving new facts, retracting old ones, or triggering further intents.

On restart, the shell inspects every admitted intent and classifies it:

State foundWhat it meansAction
No effect_started markerIntent was admitted but never executedSafe to execute from scratch
effect_completed receipt existsEffect already finishedNo action needed
effect_started without terminal receiptAmbiguous — the call may or may not have succeededClassify for retry or reconciliation

The third case is the interesting one. The shell sent the request, but crashed before recording the outcome. Did the external system process it? There is no way to know without checking.

The shell automatically retries when it can prove the request is safe to repeat:

  • Read-only requests — GET calls that don’t mutate external state
  • Idempotency key present — the resource contract guarantees exactly-once semantics
  • Request-fingerprint contract — the external API confirms replay safety

Auto-retried effects append a new effect_started observation, preserving the full audit trail. The original attempt and the retry are both visible in provenance.

When replay safety cannot be proven, the effect enters reconcile_required state. This is the default for any mutation where the shell cannot confirm the outcome. The system stops and asks a human.

Common scenarios requiring reconciliation:

  • POST request without an idempotency key
  • Payment or state-changing call where the response was lost
  • Any effect where partial execution could cause inconsistency

Use the CLI to inspect and resolve pending reconciliations:

Terminal window
# See what needs resolution
jacqos reconcile inspect --session latest
# After checking the external system:
jacqos reconcile resolve <attempt-id> succeeded
jacqos reconcile resolve <attempt-id> failed
jacqos reconcile resolve <attempt-id> retry

Every resolution appends a new observation with provenance. The evaluator re-runs with the new evidence. If the original intent conditions still hold, a new intent may be derived and executed cleanly.

See the CLI Reference for full command details.

Consider this sequence in the appointment-booking app:

  1. A booking_request observation arrives for slot RS-2024-03
  2. The evaluator derives intent.reserve_slot("REQ-1", "RS-2024-03")
  3. The shell admits the intent and starts an HTTP call to clinic_api
  4. The process crashes mid-request

On restart:

  1. The shell finds effect_started without a terminal receipt
  2. http.fetch to clinic_api is a POST without an idempotency key — not safe to auto-retry
  3. The effect enters reconcile_required
  4. The operator runs jacqos reconcile inspect --session latest
  5. They check the clinic API dashboard and find the slot was reserved
  6. They resolve: jacqos reconcile resolve eff-0042 succeeded
  7. The resolution observation feeds back into the evaluator
  8. confirmation_pending is derived, leading to intent.send_confirmation
  9. The confirmation email sends normally

The entire chain — crash, reconciliation, and recovery — is visible in the observation log and traceable through Studio’s drill inspector and timeline.

A related but distinct concept is contradictions — conflicting assertions and retractions for the same fact. These arise when new observations provide evidence that contradicts existing derived truth.

Terminal window
# List active contradictions
jacqos contradiction list
# Preview a resolution
jacqos contradiction preview <id> --decision accept-assertion
# Commit a resolution
jacqos contradiction resolve <id> --decision accept-retraction \
--note "Provider confirmed slot was already taken"

Contradiction resolution decisions: accept-assertion, accept-retraction, or defer. Each resolution is recorded as an observation with provenance.

  • No silent retry of mutations. If the shell cannot prove a retry is safe, it stops and asks. This is the conservative default — it prevents double-bookings, duplicate payments, and silent data corruption.
  • Every transition is durable. Admitted, started, completed, and reconciled states are all observations. Nothing is lost on crash.
  • Reconciliation is explicit. The operator provides evidence (“I checked the external system and the slot is held”). This evidence becomes part of the provenance chain.
  • Design for idempotency. If your external API supports idempotency keys, use them. This turns manual reconciliation into safe auto-retry — a much better operational experience.