Incident Response Walkthrough

What You’ll Build

A flagship example that models cloud incident response on a service dependency graph. When a primary database degrades, triage derives the blast radius through recursive Datalog over the real topology, communications and remediation agents react through the shared derived model rather than a hidden orchestration graph, and catastrophic invariants stop unsafe plans before they become effects.

This walkthrough is the cleanest demonstration of the multi-namespace coordination pattern with recursive derivation:

topology.update + telemetry.alert
  -> infra.transitively_depends (recursive closure)
  -> triage.blast_radius
  -> proposal.remediation_action
  -> remediation.plan
  -> intent.notify_stakeholder, intent.remediate

It covers the full JacqOS pipeline with a focus on stigmergic coordination through invariants:

Observations arrive as JSON events (topology.update, telemetry.alert, llm.remediation_decision_result, effect.receipt)
Mappers extract trusted topology atoms and requires_relay semantic atoms from the model output
Rules derive recursive transitive dependencies, blast radius, root-cause severity, and proposal-relayed remediation plans
Invariants enforce catastrophic guards — no_kill_unsynced_primary, always_have_admin, no_isolate_healthy
Intents derive stakeholder notifications and bounded remediation calls
Fixtures prove the happy path, an unsafe-plan contradiction, a deep cascade, and full coverage of the safety boundary

Project Structure

jacqos-incident-response/
  jacqos.toml
  ontology/
    schema.dh                          # 4 namespaces of relations
    rules.dh                           # 3 strata + 3 catastrophic invariants
    intents.dh                         # notify + remediate intent derivation
  mappings/
    inbound.rhai                       # mapper contract + observation mapping
  prompts/
    remediation-system.md              # remediation prompt bundle
  schemas/
    remediation-action.json            # structured-output schema
  fixtures/
    happy-path.jsonl
    happy-path.expected.json
    contradiction-path.jsonl
    contradiction-path.expected.json
    cascade-path.jsonl
    cascade-path.expected.json
    coverage-path.jsonl
    coverage-path.expected.json
  generated/
    ...                                # verification, graph, and export artifacts

Step 1: Configure The App

jacqos.toml declares the app identity, the notification API binding, the remediation model binding, and recorded replay for both:

app_id = "jacqos-incident-response"
app_version = "0.1.0"

[paths]
ontology = ["ontology/*.dh"]
mappings = ["mappings/*.rhai"]
prompts = ["prompts/*.md"]
schemas = ["schemas/*.json"]
fixtures = ["fixtures/*.jsonl"]
helpers = ["helpers/*.rhai"]

[capabilities]
http_clients = ["notify_api"]
models = ["remediation_model"]
timers = false
blob_store = true

[capabilities.intents]
"intent.notify_stakeholder" = { capability = "http.fetch", resource = "notify_api" }
"intent.remediate" = { capability = "llm.complete", resource = "remediation_model", result_kind = "llm.remediation_decision_result" }

[resources.http.notify_api]
base_url = "https://incident-notify.example.invalid"
credential_ref = "NOTIFY_API_TOKEN"
replay = "record"

[resources.model.remediation_model]
provider = "openai"
model = "gpt-5.5"
credential_ref = "OPENAI_API_KEY"
replay = "record"

The remediation model is wired through the same provider-capture path as the HTTP notifier. The bundled fixtures are fully deterministic — jacqos verify produces the same facts every run, no API key needed when matching captures are present, and the same seam can be flipped between record and replay without any ontology change.

Step 2: Declare Relations

The schema partitions relations across four namespaces — infra.* for topology and telemetry, triage.* for blast-radius reasoning, proposal.* and remediation.* for the model-relayed action surface, and intent.* for declared external effects:

relation infra.service(service_id: text)
relation infra.depends_on(service_id: text, dependency_id: text)
relation infra.transitively_depends(service_id: text, dependency_id: text)
relation infra.health_signal(service_id: text, status: text, seq: int)
relation infra.degraded(service_id: text)
relation infra.healthy(service_id: text)
relation infra.is_primary_db(service_id: text)
relation infra.replica_synced(service_id: text)
relation infra.production_system(service_id: text)
relation infra.has_admin_access(service_id: text)
relation infra.admin_gap(service_id: text)

relation triage.blast_radius(service_id: text, root_service: text)
relation triage.impacted(service_id: text)
relation triage.root_cause(root_service: text)
relation triage.severity(root_service: text, severity: text)
relation triage.stakeholder_notified(root_service: text)

relation proposal.remediation_action(
  decision_id: text, root_service: text, target_service: text, action: text, seq: int
)
relation remediation.plan(root_service: text, target_service: text, action: text, seq: int)
relation remediation.unsafely_scaled_primary(service_id: text)
relation remediation.unsafely_isolated(service_id: text)

relation intent.notify_stakeholder(root_service: text, severity: text)
relation intent.remediate(root_service: text, severity: text)

The crucial separation is between proposal.remediation_action (whatever the model said) and remediation.plan (what passed the relay boundary). The ontology keeps these on opposite sides of the requires_relay gate so an absurd plan never silently becomes an executable action.

Step 3: Map Observations To Atoms

The mapper marks the LLM remediation output as requires_relay into proposal.*. Topology and telemetry stay trusted atoms; only the model’s structured action lands behind the relay namespace:

fn mapper_contract() {
    #{
        requires_relay: [
            #{
                observation_class: "llm.remediation_decision_result",
                predicate_prefixes: ["proposal."],
                relay_namespace: "proposal",
            }
        ],
    }
}

map_observation() then projects topology, telemetry, model output, and effect receipts:

if obs.kind == "topology.update" {
    let atoms = [atom("service.id", body.service_id)];
    if body.contains("depends_on") {
        for dependency in body.depends_on {
            atoms.push(atom("service.depends_on", dependency));
        }
    }
    if body.contains("is_primary_db") {
        if body.is_primary_db == true {
            atoms.push(atom("service.primary_db", body.service_id));
        }
    }
    if body.contains("replica_synced") {
        if body.replica_synced == true {
            atoms.push(atom("service.replica_synced", body.service_id));
        }
    }
    return atoms;
}

if obs.kind == "llm.remediation_decision_result" {
    return [
        atom("proposal.id", body.decision_id),
        atom("proposal.root_service", body.root_service),
        atom("proposal.target_service", body.target_service),
        atom("proposal.action", body.action),
        atom("proposal.seq", body.seq),
    ];
}

The split is the whole design: topology is trusted structure, telemetry is trusted signal, and the model’s remediation action is fallible interpretation that has to clear the proposal boundary before any rule can derive remediation.plan.

Step 4: Derive Blast Radius, Severity, And Plans

Recursive transitive closure computes blast radius from the dependency graph rather than from hand-authored runbooks:

rule infra.transitively_depends(service, dependency) :-
  infra.depends_on(service, dependency).

rule infra.transitively_depends(service, root) :-
  infra.depends_on(service, dependency),
  infra.transitively_depends(dependency, root).

rule triage.root_cause(root) :-
  infra.degraded(root),
  not infra.healthy(root).

rule triage.blast_radius(root, root) :-
  triage.root_cause(root).

rule triage.blast_radius(service, root) :-
  infra.transitively_depends(service, root),
  triage.root_cause(root).

When a degraded primary appears, every transitively dependent service joins the blast radius automatically. The cascade fixture exercises a five-service chain to prove the closure is depth-faithful.

Severity is a small projection over the root cause:

rule triage.severity(root, "critical") :-
  triage.root_cause(root),
  infra.is_primary_db(root).

rule triage.severity(root, "high") :-
  triage.root_cause(root),
  not infra.is_primary_db(root).

The model’s proposal lifts into proposal.remediation_action, then a single bridge rule promotes it into remediation.plan only if it cleared the relay boundary:

rule assert proposal.remediation_action(decision_id, root, target, action, seq) :-
  atom(obs, "proposal.id", decision_id),
  atom(obs, "proposal.root_service", root),
  atom(obs, "proposal.target_service", target),
  atom(obs, "proposal.action", action),
  atom(obs, "proposal.seq", seq).

rule remediation.plan(root, target, action, seq) :-
  proposal.remediation_action(_, root, target, action, seq).

The catastrophic boundary is a pair of unsafe-condition relations and the named invariants that forbid them:

rule remediation.unsafely_scaled_primary(node) :-
  remediation.scale_down(node),
  infra.is_primary_db(node),
  not infra.replica_synced(node).

rule remediation.unsafely_isolated(service) :-
  remediation.isolate(service),
  not triage.impacted(service).

invariant no_kill_unsynced_primary(node) :-
  count remediation.unsafely_scaled_primary(node) <= 0.

invariant no_isolate_healthy(service) :-
  count remediation.unsafely_isolated(service) <= 0.

invariant always_have_admin(service) :-
  count infra.admin_gap(service) <= 0.

These three invariants are the structural backstop. Even if a future rule edit weakens the planning layer, an unsafe scale-down of an unsynced primary, an isolate of a healthy service, or a production system without admin access still trips invariant review and jacqos verify halts.

Step 5: Derive Outbound Effects Only From Stable State

Communications and remediation are independent intent rules over the shared model. There is no shared workflow graph — each agent reads what it needs from triage.* and contributes its declared intent:

rule intent.notify_stakeholder(root, severity) :-
  triage.root_cause(root),
  triage.severity(root, severity),
  not triage.stakeholder_notified(root).

rule intent.remediate(root, severity) :-
  triage.root_cause(root),
  triage.severity(root, severity),
  not remediation.plan(root, _, _, _).

Two agents, one shared truth surface, zero orchestration code. That is stigmergic coordination — the same pattern as ant trails, but typed and inspectable.

Step 6: Fixtures

This example ships four fixtures, each exercising a different facet of the pipeline.

Happy path

A degraded primary fans out through the dependency chain, stakeholders are notified, and a safe reroute to the in-sync replica is proposed and accepted. Final state: one root cause, one blast radius, one applied remediation, all three invariants hold.

Contradiction path

The remediation model proposes a scale_down against the primary while no synced replica exists. remediation.unsafely_scaled_primary derives, no_kill_unsynced_primary fires, and the unsafe plan never reaches an effect. The fixture also exercises a retracted telemetry signal so the timeline shows both invariant containment and contradiction handling in the same window.

Cascade path

A five-service chain (cdn-edge -> frontend-web -> edge-api -> auth-service -> db-primary) exercises deep transitive closure. The model proposes isolating auth-service, the rule confirms the service is in the blast radius, and no_isolate_healthy does not fire because the isolate target is genuinely impacted.

Coverage path

A timeline that walks every accepting and rejecting branch of the rule graph so the verification bundle’s coverage report reaches 100% on the rule shape. The same coverage data is consumed today by jacqos verify and is exported in every verification bundle under generated/verification/.

What You’ll See In Studio

Open the demo with jacqos studio --lineage incident-response and the bundled happy-path fixture loads. Switch fixtures from the timeline picker to walk every scenario:

Safe reroute -> the Done tab shows db-primary -> reroute, applied. Drill in and the inspector takes you from the executed remediation back through remediation.plan, the model’s proposal.remediation_action, the blast-radius derivation, and the original telemetry alert.
Unsafe scale-down blocked -> the Blocked tab shows the no_kill_unsynced_primary invariant violation. The drill inspector names the missing infra.replica_synced fact and the model’s proposal that triggered the unsafe condition. No effect ever fired.
Five-service cascade -> the Done tab shows the isolate applied to auth-service. Drill into the blast radius and the inspector walks the recursive infra.transitively_depends chain back to db-primary.
Stakeholder notified -> a notification effect shows up in Done for every fixture; this is the second agent participating through the shared model with no orchestration glue.

Why This Example Matters

This is the multi-agent coordination pattern in its strongest form:

two independent agents (notify and remediate) coordinate without a shared workflow
recursive Datalog computes blast radius from real topology, not hand-authored playbooks
catastrophic invariants are a structural backstop that survives any rule-edit regression
the model’s plan is visible, queryable, and forced to clear the proposal boundary before it can become an effect

That is how you stop a confidently wrong remediation plan from terminating a primary database during the worst hour of your year.

Make It Yours

The incident-response pattern fits every domain where multiple agents read shared state and contribute to a single outcome under safety constraints:

Production database operations — propose backup, restore, or failover plans; gate by replica sync state and snapshot freshness
Kubernetes orchestration — propose pod terminations or node drains; gate by quorum, leader election, and PDB compliance
Financial trading kill-switches — propose order cancellations or position closes; gate by exposure limits and counterparty status
Industrial control loops — multiple sensor agents feed a shared model; actuator agents read the model and respect named safety invariants

To start building, scaffold a starter app:

jacqos scaffold --pattern multi-agent my-incident-app

Next Steps

Multi-Agent Patterns — the namespace partitioning + stigmergic coordination story this example demonstrates
Smart Farm Walkthrough — a smaller multi-agent example with a single named-invariant safety boundary
LLM Decision Containment — the pattern page that explains the proposal.* boundary the remediation model passes through
Invariant Review — how named invariants replace code review of generated rules
Observation-First Thinking — the underlying evidence-first mental model