Skip to content

Incident Response Walkthrough

A flagship example that models cloud incident response on a service dependency graph. When a primary database degrades, triage derives the blast radius through recursive Datalog over the real topology, communications and remediation agents react through the shared derived model rather than a hidden orchestration graph, and catastrophic invariants stop unsafe plans before they become effects.

This walkthrough is the cleanest demonstration of the multi-namespace coordination pattern with recursive derivation:

topology.update + telemetry.alert
-> infra.transitively_depends (recursive closure)
-> triage.blast_radius
-> proposal.remediation_action
-> remediation.plan
-> intent.notify_stakeholder, intent.remediate

It covers the full JacqOS pipeline with a focus on stigmergic coordination through invariants:

  1. Observations arrive as JSON events (topology.update, telemetry.alert, llm.remediation_decision_result, effect.receipt)
  2. Mappers extract trusted topology atoms and requires_relay semantic atoms from the model output
  3. Rules derive recursive transitive dependencies, blast radius, root-cause severity, and proposal-relayed remediation plans
  4. Invariants enforce catastrophic guards — no_kill_unsynced_primary, always_have_admin, no_isolate_healthy
  5. Intents derive stakeholder notifications and bounded remediation calls
  6. Fixtures prove the happy path, an unsafe-plan contradiction, a deep cascade, and full coverage of the safety boundary
jacqos-incident-response/
jacqos.toml
ontology/
schema.dh # 4 namespaces of relations
rules.dh # 3 strata + 3 catastrophic invariants
intents.dh # notify + remediate intent derivation
mappings/
inbound.rhai # mapper contract + observation mapping
prompts/
remediation-system.md # remediation prompt bundle
schemas/
remediation-action.json # structured-output schema
fixtures/
happy-path.jsonl
happy-path.expected.json
contradiction-path.jsonl
contradiction-path.expected.json
cascade-path.jsonl
cascade-path.expected.json
coverage-path.jsonl
coverage-path.expected.json
generated/
... # verification, graph, and export artifacts

jacqos.toml declares the app identity, the notification API binding, the remediation model binding, and recorded replay for both:

app_id = "jacqos-incident-response"
app_version = "0.1.0"
[paths]
ontology = ["ontology/*.dh"]
mappings = ["mappings/*.rhai"]
prompts = ["prompts/*.md"]
schemas = ["schemas/*.json"]
fixtures = ["fixtures/*.jsonl"]
helpers = ["helpers/*.rhai"]
[capabilities]
http_clients = ["notify_api"]
models = ["remediation_model"]
timers = false
blob_store = true
[capabilities.intents]
"intent.notify_stakeholder" = { capability = "http.fetch", resource = "notify_api" }
"intent.remediate" = { capability = "llm.complete", resource = "remediation_model", result_kind = "llm.remediation_decision_result" }
[resources.http.notify_api]
base_url = "https://incident-notify.example.invalid"
credential_ref = "NOTIFY_API_TOKEN"
replay = "record"
[resources.model.remediation_model]
provider = "openai"
model = "gpt-4o-mini"
credential_ref = "OPENAI_API_KEY"
replay = "record"

The remediation model is wired through the same provider-capture path as the HTTP notifier. The bundled fixtures are fully deterministic — jacqos verify produces the same facts every run, no API key needed when matching captures are present, and the same seam can be flipped between record and replay without any ontology change.

The schema partitions relations across four namespaces — infra.* for topology and telemetry, triage.* for blast-radius reasoning, proposal.* and remediation.* for the model-relayed action surface, and intent.* for declared external effects:

relation infra.service(service_id: text)
relation infra.depends_on(service_id: text, dependency_id: text)
relation infra.transitively_depends(service_id: text, dependency_id: text)
relation infra.health_signal(service_id: text, status: text, seq: int)
relation infra.degraded(service_id: text)
relation infra.healthy(service_id: text)
relation infra.is_primary_db(service_id: text)
relation infra.replica_synced(service_id: text)
relation infra.production_system(service_id: text)
relation infra.has_admin_access(service_id: text)
relation infra.admin_gap(service_id: text)
relation triage.blast_radius(service_id: text, root_service: text)
relation triage.impacted(service_id: text)
relation triage.root_cause(root_service: text)
relation triage.severity(root_service: text, severity: text)
relation triage.stakeholder_notified(root_service: text)
relation proposal.remediation_action(
decision_id: text, root_service: text, target_service: text, action: text, seq: int
)
relation remediation.plan(root_service: text, target_service: text, action: text, seq: int)
relation remediation.unsafely_scaled_primary(service_id: text)
relation remediation.unsafely_isolated(service_id: text)
relation intent.notify_stakeholder(root_service: text, severity: text)
relation intent.remediate(root_service: text, severity: text)

The crucial separation is between proposal.remediation_action (whatever the model said) and remediation.plan (what passed the relay boundary). The ontology keeps these on opposite sides of the requires_relay gate so an absurd plan never silently becomes an executable action.

The mapper marks the LLM remediation output as requires_relay into proposal.*. Topology and telemetry stay trusted atoms; only the model’s structured action lands behind the relay namespace:

fn mapper_contract() {
#{
requires_relay: [
#{
observation_class: "llm.remediation_decision_result",
predicate_prefixes: ["proposal."],
relay_namespace: "proposal",
}
],
}
}

map_observation() then projects topology, telemetry, model output, and effect receipts:

if obs.kind == "topology.update" {
let atoms = [atom("service.id", body.service_id)];
if body.contains("depends_on") {
for dependency in body.depends_on {
atoms.push(atom("service.depends_on", dependency));
}
}
if body.contains("is_primary_db") {
if body.is_primary_db == true {
atoms.push(atom("service.primary_db", body.service_id));
}
}
if body.contains("replica_synced") {
if body.replica_synced == true {
atoms.push(atom("service.replica_synced", body.service_id));
}
}
return atoms;
}
if obs.kind == "llm.remediation_decision_result" {
return [
atom("proposal.id", body.decision_id),
atom("proposal.root_service", body.root_service),
atom("proposal.target_service", body.target_service),
atom("proposal.action", body.action),
atom("proposal.seq", body.seq),
];
}

The split is the whole design: topology is trusted structure, telemetry is trusted signal, and the model’s remediation action is fallible interpretation that has to clear the proposal boundary before any rule can derive remediation.plan.

Step 4: Derive Blast Radius, Severity, And Plans

Section titled “Step 4: Derive Blast Radius, Severity, And Plans”

Recursive transitive closure computes blast radius from the dependency graph rather than from hand-authored runbooks:

rule infra.transitively_depends(service, dependency) :-
infra.depends_on(service, dependency).
rule infra.transitively_depends(service, root) :-
infra.depends_on(service, dependency),
infra.transitively_depends(dependency, root).
rule triage.root_cause(root) :-
infra.degraded(root),
not infra.healthy(root).
rule triage.blast_radius(root, root) :-
triage.root_cause(root).
rule triage.blast_radius(service, root) :-
infra.transitively_depends(service, root),
triage.root_cause(root).

When a degraded primary appears, every transitively dependent service joins the blast radius automatically. The cascade fixture exercises a five-service chain to prove the closure is depth-faithful.

Severity is a small projection over the root cause:

rule triage.severity(root, "critical") :-
triage.root_cause(root),
infra.is_primary_db(root).
rule triage.severity(root, "high") :-
triage.root_cause(root),
not infra.is_primary_db(root).

The model’s proposal lifts into proposal.remediation_action, then a single bridge rule promotes it into remediation.plan only if it cleared the relay boundary:

rule assert proposal.remediation_action(decision_id, root, target, action, seq) :-
atom(obs, "proposal.id", decision_id),
atom(obs, "proposal.root_service", root),
atom(obs, "proposal.target_service", target),
atom(obs, "proposal.action", action),
atom(obs, "proposal.seq", seq).
rule remediation.plan(root, target, action, seq) :-
proposal.remediation_action(_, root, target, action, seq).

The catastrophic boundary is a pair of unsafe-condition relations and the named invariants that forbid them:

rule remediation.unsafely_scaled_primary(node) :-
remediation.scale_down(node),
infra.is_primary_db(node),
not infra.replica_synced(node).
rule remediation.unsafely_isolated(service) :-
remediation.isolate(service),
not triage.impacted(service).
invariant no_kill_unsynced_primary(node) :-
count remediation.unsafely_scaled_primary(node) <= 0.
invariant no_isolate_healthy(service) :-
count remediation.unsafely_isolated(service) <= 0.
invariant always_have_admin(service) :-
count infra.admin_gap(service) <= 0.

These three invariants are the structural backstop. Even if a future rule edit weakens the planning layer, an unsafe scale-down of an unsynced primary, an isolate of a healthy service, or a production system without admin access still trips invariant review and jacqos verify halts.

Step 5: Derive Outbound Effects Only From Stable State

Section titled “Step 5: Derive Outbound Effects Only From Stable State”

Communications and remediation are independent intent rules over the shared model. There is no shared workflow graph — each agent reads what it needs from triage.* and contributes its declared intent:

rule intent.notify_stakeholder(root, severity) :-
triage.root_cause(root),
triage.severity(root, severity),
not triage.stakeholder_notified(root).
rule intent.remediate(root, severity) :-
triage.root_cause(root),
triage.severity(root, severity),
not remediation.plan(root, _, _, _).

Two agents, one shared truth surface, zero orchestration code. That is stigmergic coordination — the same pattern as ant trails, but typed and inspectable.

This example ships four fixtures, each exercising a different facet of the pipeline.

A degraded primary fans out through the dependency chain, stakeholders are notified, and a safe reroute to the in-sync replica is proposed and accepted. Final state: one root cause, one blast radius, one applied remediation, all three invariants hold.

The remediation model proposes a scale_down against the primary while no synced replica exists. remediation.unsafely_scaled_primary derives, no_kill_unsynced_primary fires, and the unsafe plan never reaches an effect. The fixture also exercises a retracted telemetry signal so the timeline shows both invariant containment and contradiction handling in the same window.

A five-service chain (cdn-edge -> frontend-web -> edge-api -> auth-service -> db-primary) exercises deep transitive closure. The model proposes isolating auth-service, the rule confirms the service is in the blast radius, and no_isolate_healthy does not fire because the isolate target is genuinely impacted.

A timeline that walks every accepting and rejecting branch of the rule graph so the verification bundle’s coverage report reaches 100% on the rule shape. The same coverage data is consumed today by jacqos verify and is exported in every verification bundle under generated/verification/.

Open the demo with jacqos studio --lineage incident-response and the bundled happy-path fixture loads. Switch fixtures from the timeline picker to walk every scenario:

  • Safe reroute -> the Done tab shows db-primary -> reroute, applied. Drill in and the inspector takes you from the executed remediation back through remediation.plan, the model’s proposal.remediation_action, the blast-radius derivation, and the original telemetry alert.
  • Unsafe scale-down blocked -> the Blocked tab shows the no_kill_unsynced_primary invariant violation. The drill inspector names the missing infra.replica_synced fact and the model’s proposal that triggered the unsafe condition. No effect ever fired.
  • Five-service cascade -> the Done tab shows the isolate applied to auth-service. Drill into the blast radius and the inspector walks the recursive infra.transitively_depends chain back to db-primary.
  • Stakeholder notified -> a notification effect shows up in Done for every fixture; this is the second agent participating through the shared model with no orchestration glue.

This is the multi-agent coordination pattern in its strongest form:

  • two independent agents (notify and remediate) coordinate without a shared workflow
  • recursive Datalog computes blast radius from real topology, not hand-authored playbooks
  • catastrophic invariants are a structural backstop that survives any rule-edit regression
  • the model’s plan is visible, queryable, and forced to clear the proposal boundary before it can become an effect

That is how you stop a confidently wrong remediation plan from terminating a primary database during the worst hour of your year.

The incident-response pattern fits every domain where multiple agents read shared state and contribute to a single outcome under safety constraints:

  • Production database operations — propose backup, restore, or failover plans; gate by replica sync state and snapshot freshness
  • Kubernetes orchestration — propose pod terminations or node drains; gate by quorum, leader election, and PDB compliance
  • Financial trading kill-switches — propose order cancellations or position closes; gate by exposure limits and counterparty status
  • Industrial control loops — multiple sensor agents feed a shared model; actuator agents read the model and respect named safety invariants

To start building, scaffold a starter app:

Terminal window
jacqos scaffold --pattern multi-agent my-incident-app