Crash Recovery
Why Crash Recovery Matters
Section titled “Why Crash Recovery Matters”JacqOS agents interact with external systems — booking APIs, payment processors, LLM providers. Any of these calls can fail mid-flight. The process can crash between sending a request and recording the response. The network can drop the reply after the remote system already committed the action.
In a workflow-first system, this ambiguity is often papered over with retry loops and hope. JacqOS takes a different approach: every state transition is durable, and ambiguous outcomes require explicit human resolution. The system never guesses.
The Intent Lifecycle
Section titled “The Intent Lifecycle”Every intent passes through a durable state machine:
Derived → Admitted → Executing → Completed ↘ (crash) → Reconcile RequiredEach transition appends an observation to the log. This means the full lifecycle is visible in provenance and survives any restart.
Derived
Section titled “Derived”The evaluator reaches a fixed point and produces intent.* facts. These are candidate intents — what the system wants to do based on current evidence.
Admitted
Section titled “Admitted”The shell durably records each new intent before any external call begins. This is the commit point. Once admitted, the shell is responsible for driving the intent to completion or flagging it for reconciliation.
Executing
Section titled “Executing”The shell dispatches the intent through its declared capability. An effect_started marker is written. The external call happens. The response is recorded as a new observation.
Completed
Section titled “Completed”The shell writes an effect_completed receipt. The new observation feeds back into the evaluator, potentially deriving new facts, retracting old ones, or triggering further intents.
What Happens on Crash
Section titled “What Happens on Crash”On restart, the shell inspects every admitted intent and classifies it:
| State found | What it means | Action |
|---|---|---|
No effect_started marker | Intent was admitted but never executed | Safe to execute from scratch |
effect_completed receipt exists | Effect already finished | No action needed |
effect_started without terminal receipt | Ambiguous — the call may or may not have succeeded | Classify for retry or reconciliation |
The third case is the interesting one. The shell sent the request, but crashed before recording the outcome. Did the external system process it? There is no way to know without checking.
Auto-Retry vs. Manual Reconciliation
Section titled “Auto-Retry vs. Manual Reconciliation”Safe Auto-Retry
Section titled “Safe Auto-Retry”The shell automatically retries when it can prove the request is safe to repeat:
- Read-only requests — GET calls that don’t mutate external state
- Idempotency key present — the resource contract guarantees exactly-once semantics
- Request-fingerprint contract — the external API confirms replay safety
Auto-retried effects append a new effect_started observation, preserving the full audit trail. The original attempt and the retry are both visible in provenance.
Manual Reconciliation
Section titled “Manual Reconciliation”When replay safety cannot be proven, the effect enters reconcile_required state. This is the default for any mutation where the shell cannot confirm the outcome. The system stops and asks a human.
Common scenarios requiring reconciliation:
- POST request without an idempotency key
- Payment or state-changing call where the response was lost
- Any effect where partial execution could cause inconsistency
Resolving Reconciliation
Section titled “Resolving Reconciliation”Use the CLI to inspect and resolve pending reconciliations:
# See what needs resolutionjacqos reconcile inspect --session latest
# After checking the external system:jacqos reconcile resolve <attempt-id> succeededjacqos reconcile resolve <attempt-id> failedjacqos reconcile resolve <attempt-id> retryEvery resolution appends a new observation with provenance. The evaluator re-runs with the new evidence. If the original intent conditions still hold, a new intent may be derived and executed cleanly.
See the CLI Reference for full command details.
Worked Example
Section titled “Worked Example”Consider this sequence in the appointment-booking app:
- A
booking_requestobservation arrives for slotRS-2024-03 - The evaluator derives
intent.reserve_slot("REQ-1", "RS-2024-03") - The shell admits the intent and starts an HTTP call to
clinic_api - The process crashes mid-request
On restart:
- The shell finds
effect_startedwithout a terminal receipt http.fetchtoclinic_apiis a POST without an idempotency key — not safe to auto-retry- The effect enters
reconcile_required - The operator runs
jacqos reconcile inspect --session latest - They check the clinic API dashboard and find the slot was reserved
- They resolve:
jacqos reconcile resolve eff-0042 succeeded - The resolution observation feeds back into the evaluator
confirmation_pendingis derived, leading tointent.send_confirmation- The confirmation email sends normally
The entire chain — crash, reconciliation, and recovery — is visible in the observation log and traceable through Studio’s drill inspector and timeline.
Contradictions
Section titled “Contradictions”A related but distinct concept is contradictions — conflicting assertions and retractions for the same fact. These arise when new observations provide evidence that contradicts existing derived truth.
# List active contradictionsjacqos contradiction list
# Preview a resolutionjacqos contradiction preview <id> --decision accept-assertion
# Commit a resolutionjacqos contradiction resolve <id> --decision accept-retraction \ --note "Provider confirmed slot was already taken"Contradiction resolution decisions: accept-assertion, accept-retraction, or defer. Each resolution is recorded as an observation with provenance.
Design Principles
Section titled “Design Principles”- No silent retry of mutations. If the shell cannot prove a retry is safe, it stops and asks. This is the conservative default — it prevents double-bookings, duplicate payments, and silent data corruption.
- Every transition is durable. Admitted, started, completed, and reconciled states are all observations. Nothing is lost on crash.
- Reconciliation is explicit. The operator provides evidence (“I checked the external system and the slot is held”). This evidence becomes part of the provenance chain.
- Design for idempotency. If your external API supports idempotency keys, use them. This turns manual reconciliation into safe auto-retry — a much better operational experience.
Next Steps
Section titled “Next Steps”- Debug, Verify, Ship — the end-to-end workflow page that integrates
jacqos reconcile inspect,jacqos contradiction list/resolve, and the rest of the debugging surface into a single failure-to-green narrative - Effects and Intents — the full guide with code examples
- CLI Reference — reconcile and contradiction commands
- jacqos.toml Reference — declaring capabilities and resources
- Observation-First Thinking — why durable observations make this possible