# Fabric Execution Plan 2026-05-19 Status: active execution plan. This document merges: - the service-over-fabric model; - the area and peer stability model; - the live audit findings from 2026-05-18 through 2026-05-19; - the node survival and recovery policy; - the current rollout and runtime rewrite findings. The goal is to move the live fabric from a partially migrated QUIC-first fleet to a fully converged distributed runtime where: 1. inter-node transport is QUIC over UDP only; 2. services use fabric channels and do not implement their own transport; 3. nodes do not depend on one compat control/download edge; 4. node directory and service discovery are distributed through signed records, peer cache, and live peer exchange; 5. the fleet remains recoverable after losing part of the fabric. ## 1. Current live state ### 1.1 What is already true - Inter-node runtime transport is QUIC over UDP. - All active nodes are converging on the latest control-endpoint rewrite line. - `home-*`, `test-*`, and `usa-los-1` already run `rap-node-agent 0.2.325-updatehintwake`. - `ifcm-rufms-s-mo1cr` is no longer lost and still sends fresh heartbeat. - Internal artifact plans now support mirror URLs instead of a single artifact URL. - The public `vpn.cin.su` ingress and the `19191` compatibility ingress on `home-1` were repaired so downloads and control traffic can flow again. ### 1.2 What is still not finished - `ifcm-rufms-s-mo1cr` still reports the old `http://vpn.cin.su:19191/api/v1` control URL in live heartbeat. - `ifcm-rufms-s-mo1cr` still runs `rap-node-agent 0.2.322-controlendpointsrewrite` while the rest of the reachable fleet is already on `0.2.325-updatehintwake`. - The current blocker is now known precisely: fresh heartbeat plus a dead updater subscription plane on a node-agent that does not yet support local updater wake from heartbeat update hints. - Signed registry runtime is still not fully `active` across the fleet. - Cross-area direct peer diversity is still below the target for multiple nodes. - TCP is still visible in allowed edge roles: - external ingress; - Control API; - release downloads; - temporary compatibility recovery overlap. ## 2. Target system model ### 2.1 Transport - Inter-node runtime transport: QUIC over UDP only. - No TCP/WebSocket fallback as the normal fabric carrier. ### 2.2 Service layer - Services consume a fabric channel contract. - Services do not know internal path selection, relay choice, NAT traversal, or route replacement details. - External TCP/HTTP/HTTPS exists only at the ingress edge and is mapped into a fabric channel. ### 2.3 Discovery and directory - Nodes do not query PostgreSQL as part of ordinary transport/runtime flow. - PostgreSQL remains durable source of truth for policy, rollout, release, desired state, and audit. - Runtime node discovery must use: - signed registry records; - peer cache; - endpoint candidates; - bounded live peer exchange. ### 2.4 Small fleet rule For the current fleet size, every node should keep the full directory of all known nodes in scoped local state, plus runtime observations and endpoint candidate health. ## 3. Execution priorities ### P0. Finish runtime control-path convergence Goal: - remove the last live compat control dependency without manual host access. - ensure a live node can wake its local updater plane when Control/API sends an explicit update hint, even if the previous updater loop died. Required work: 1. Release the noop runtime rewrite restart fix. 2. Roll it out to the fleet. 3. Verify that updated nodes restart into canonical control endpoints. 4. Add a local updater wake path driven by heartbeat update hints so `update-trigger.json` is not the only signal. 5. Confirm that `compat_control_dependency_nodes` falls to zero. 6. Confirm that `updater_subscription_alert_nodes` falls to zero. 7. Confirm that `updater_wake_unsupported_nodes` falls to zero. Done when: - no live heartbeat reports `fabric_control_endpoint` on the `19191` compat contract. - no live node shows `updater_subscription_gap` while heartbeat remains fresh. - no live node remains on a pre-`0.2.325-updatehintwake` node-agent while heartbeat is still fresh and update status is stale. ### P1. Finish distributed registry activation Goal: - nodes must resolve active service records without relying on one compat URL. Required work: 1. Promote signed registry runtime from `candidate_only` / `missing` to `active`. 2. Ensure nodes resolve at least: - `control-api` - `update-store` - `update-cache` 3. Add live observability for: - active records - candidate records - resolved core services - last live probe Done when: - `fabric_registry_runtime_report.status = active` for the production fleet. ### P2. Turn node directory into a real distributed runtime input Goal: - nodes should learn and keep node/service information from the fabric, not by repeatedly consulting a center. Required work: 1. Preserve full scoped node directory for the current fleet. 2. Carry signed node/service records through peer exchange. 3. Keep endpoint candidates and runtime observations in local peer cache. 4. Spread updates to node/service reachability like a bounded wave, not as independent central fetches by every node. Rules: - nodes may distribute signed directory/service data; - nodes must not self-author authoritative control-plane state; - the runtime may consume replicated signed copies of truth; - PostgreSQL remains durable origin of truth. Done when: - nodes can refresh peer/service discovery from peers plus signed records even if one control edge disappears. ### P3. Replace the naive “3 peers” rule with stability by area and ingress Goal: - measure and enforce resilience by failure-domain diversity, not only count. Required metrics: - `direct_ready_count` - `relay_ready_count` - `external_area_ready_count` - `independent_ingress_ready_count` - `recovery_path_count` Required topology labels: - `site_id` - physical or logical site - `locality_group` - private/local reachability domain - `nat_group` - shared public edge dependency Required behaviors: 1. Prefer peers from different `area` values. 2. Prefer peers behind different public ingress / NAT dependencies. 3. Keep direct-ready and relay-ready separate. 4. Keep at least one recovery path outside the local area. 5. Treat a public endpoint behind the same NAT area as `external-network-required` unless cross-area observers have validated it. 6. Do not demote a public endpoint only because the same area cannot hairpin through its own public router address. 7. Prefer a scoped local/private QUIC endpoint over a public endpoint when the candidate is confirmed to be in the same local segment or NAT group. 8. Penalize or reject private/local-looking endpoints when they belong to a different segment/NAT scope than the local node, instead of probing them as if they were reachable. Done when: - critical nodes satisfy cross-area direct resilience targets, not merely raw peer-count targets. ### P4. Normalize edge roles and remove accidental TCP confusion Goal: - if TCP is present, it must be obviously classified and justified. Allowed TCP roles: - external service ingress; - Control API ingress; - artifact delivery edge; - temporary compatibility recovery overlap. Required work: 1. Keep explicit inventory of edge listeners. 2. Distinguish transport TCP from service-edge TCP in audits and UI. 3. Advance the fabric-only recovery gate only after: - compat control dependency is zero; - registry is active; - recovery path no longer depends on `19191`. ### P5. Build the update orchestrator and distributed update intent plane Goal: - nodes must not depend on one updater endpoint, one old updater process, or one central polling path; - update rollout must be controlled so the whole farm cannot update at once; - update intent must be distributable through management and neighboring nodes as signed metadata. Required model: 1. The durable update object is a signed `update_intent`, not a hard-coded updater URL. 2. Nodes may receive update intent from: - Control API; - update-store / update-cache; - subscription hints over an outbound control channel; - signed peer gossip from neighboring nodes; - local cached last-known-good update state. 3. Nodes validate intent locally before execution. 4. Neighbor nodes may relay signed intent and artifacts, but cannot forge authority or expand scope. 5. Slow polling remains as the final safety net. 6. Subscription/hints are the fast path. 7. Gossip is the partition/recovery path. 8. Orchestrator-issued rollout leases are the concurrency guard. Orchestrator requirements: - canary, rolling, pinned, and forced-node strategies; - max parallel globally; - max parallel per area / site / NAT group; - max unavailable nodes; - pause/resume/abort; - failure-rate stop; - automatic stop on heartbeat loss or rollback; - role-aware scheduling for control-api, update-store, update-cache, relay, ingress, and egress nodes; - separate host-agent and node-agent phases; - emergency recovery bridge for compat nodes that predate the orchestrator. Node-side requirements: - accept `check now` subscription signals; - periodically poll as fallback; - accept newer signed update intents from peers; - keep a local update journal: - pending intent generation; - lease id; - last accepted plan; - staged artifact hash; - previous binary / image; - rollback state; - admission failure reason; - reconcile stale updater runtime against current node/container/task state before fetching plans; - report `blocked`, `leased`, `staged`, `applying`, `healthy`, `rolled_back`, and `aborted` states explicitly. Done when: - a node can learn a new update intent without directly reaching the original control edge; - a stale updater command line can be repaired from local running runtime state; - simultaneous farm-wide update start is impossible without explicit recovery-admin override; - rollout can be paused and resumed without losing node intent state; - at least one test proves a node behind NAT receives an update signal through a neighbor and still waits for an orchestrator lease before applying. ## 4. Immediate next implementation sequence ### Step A Release and roll out the noop-rewrite restart fix so that updated runtimes do not remain on stale control sessions after a config rewrite. ### Step B Release and roll out the relay certificate intent fix so stale-relay replacement and bootstrap relay paths do not probe a relay endpoint with a certificate fingerprint copied from a different private direct candidate. This is tracked by: - `rap-node-agent 0.2.332-relaycertintentfix` Done when: - `peer certificate fingerprint mismatch` no longer appears on healthy relay/bootstrap paths between live areas; - `ifcm` no longer loses ready peers because relay endpoint selection and peer certificate pinning disagree. ### Step B Re-check live heartbeat and stale-risk: - `compat_control_dependency_nodes` - `registry_candidate_only_nodes` - `updater_subscription_alert_nodes` - `updater_wake_unsupported_nodes` - `bridge_hold_required` - current control URL in heartbeat ### Step C Continue registry activation work until active records are used in practice. ### Step D Continue peer diversity work using: - `area` - direct-ready area coverage - independent ingress diversity ### Step E Run another live audit and decide whether `19191/tcp` recovery overlap can be removed. ## 5. Hard acceptance criteria The fabric is considered converged only when all of the following are true: 1. Inter-node runtime transport is QUIC/UDP only. 2. No live node depends on the compat `19191` control contract. 3. Signed registry runtime is active. 4. Nodes carry and use distributed node/service knowledge through signed records and peer cache. 5. Cross-area direct resilience targets are satisfied for critical nodes. 6. Remaining TCP listeners are only service-edge roles, never hidden inter-node transport. ## 6. This plan starts now The immediate active engineering task after writing this document is: - complete the rollout of the runtime rewrite restart fix; - remove the last live compat control dependency; - then move directly into signed registry activation and cross-area peer resilience work. Update 2026-05-19: - `rap-node-agent 0.2.325-updatehintwake` adds a second-stage recovery path for heartbeat update hints: when a fresh hint generation arrives, the live node-agent persists `update-trigger.json` and wakes the local updater task/service. - This is specifically meant to prevent the `ifcm-rufms-s-mo1cr` class of failure where heartbeat remains fresh but the updater subscription plane is dead. - As of the current rollout, this release is already on `home-*`, `test-*`, and `usa-los-1`; `ifcm-rufms-s-mo1cr` remains the sole `updater_wake_unsupported` blocker.