# Fabric Execution Plan 2026-05-19

Status: active execution plan.

This document merges:

- the service-over-fabric model;
- the area and peer stability model;
- the live audit findings from 2026-05-18 through 2026-05-19;
- the node survival and recovery policy;
- the current rollout and runtime rewrite findings.

The goal is to move the live fabric from a partially migrated QUIC-first fleet
to a fully converged distributed runtime where:

1. inter-node transport is QUIC over UDP only;
2. services use fabric channels and do not implement their own transport;
3. nodes do not depend on one compat control/download edge;
4. node directory and service discovery are distributed through signed records,
   peer cache, and live peer exchange;
5. the fleet remains recoverable after losing part of the fabric.

## 1. Current live state

### 1.1 What is already true

- Inter-node runtime transport is QUIC over UDP.
- All active nodes are converging on the latest control-endpoint rewrite line.
- `home-*`, `test-*`, and `usa-los-1` already run
  `rap-node-agent 0.2.325-updatehintwake`.
- `ifcm-rufms-s-mo1cr` is no longer lost and still sends fresh heartbeat.
- Internal artifact plans now support mirror URLs instead of a single artifact
  URL.
- The public `vpn.cin.su` ingress and the `19191` compatibility ingress on
  `home-1` were repaired so downloads and control traffic can flow again.

### 1.2 What is still not finished

- `ifcm-rufms-s-mo1cr` still reports the old
  `http://vpn.cin.su:19191/api/v1` control URL in live heartbeat.
- `ifcm-rufms-s-mo1cr` still runs `rap-node-agent 0.2.322-controlendpointsrewrite`
  while the rest of the reachable fleet is already on
  `0.2.325-updatehintwake`.
- The current blocker is now known precisely:
  fresh heartbeat plus a dead updater subscription plane on a node-agent that
  does not yet support local updater wake from heartbeat update hints.
- Signed registry runtime is still not fully `active` across the fleet.
- Cross-area direct peer diversity is still below the target for multiple
  nodes.
- TCP is still visible in allowed edge roles:
  - external ingress;
  - Control API;
  - release downloads;
  - temporary compatibility recovery overlap.

## 2. Target system model

### 2.1 Transport

- Inter-node runtime transport: QUIC over UDP only.
- No TCP/WebSocket fallback as the normal fabric carrier.

### 2.2 Service layer

- Services consume a fabric channel contract.
- Services do not know internal path selection, relay choice, NAT traversal, or
  route replacement details.
- External TCP/HTTP/HTTPS exists only at the ingress edge and is mapped into a
  fabric channel.

### 2.3 Discovery and directory

- Nodes do not query PostgreSQL as part of ordinary transport/runtime flow.
- PostgreSQL remains durable source of truth for policy, rollout, release,
  desired state, and audit.
- Runtime node discovery must use:
  - signed registry records;
  - peer cache;
  - endpoint candidates;
  - bounded live peer exchange.

### 2.4 Small fleet rule

For the current fleet size, every node should keep the full directory of all
known nodes in scoped local state, plus runtime observations and endpoint
candidate health.

## 3. Execution priorities

### P0. Finish runtime control-path convergence

Goal:

- remove the last live compat control dependency without manual host access.
- ensure a live node can wake its local updater plane when Control/API sends an
  explicit update hint, even if the previous updater loop died.

Required work:

1. Release the noop runtime rewrite restart fix.
2. Roll it out to the fleet.
3. Verify that updated nodes restart into canonical control endpoints.
4. Add a local updater wake path driven by heartbeat update hints so
   `update-trigger.json` is not the only signal.
5. Confirm that `compat_control_dependency_nodes` falls to zero.
6. Confirm that `updater_subscription_alert_nodes` falls to zero.
7. Confirm that `updater_wake_unsupported_nodes` falls to zero.

Done when:

- no live heartbeat reports `fabric_control_endpoint` on the `19191` compat contract.
- no live node shows `updater_subscription_gap` while heartbeat remains fresh.
- no live node remains on a pre-`0.2.325-updatehintwake` node-agent while
  heartbeat is still fresh and update status is stale.

### P1. Finish distributed registry activation

Goal:

- nodes must resolve active service records without relying on one compat URL.

Required work:

1. Promote signed registry runtime from `candidate_only` / `missing` to
   `active`.
2. Ensure nodes resolve at least:
   - `control-api`
   - `update-store`
   - `update-cache`
3. Add live observability for:
   - active records
   - candidate records
   - resolved core services
   - last live probe

Done when:

- `fabric_registry_runtime_report.status = active` for the production fleet.

### P2. Turn node directory into a real distributed runtime input

Goal:

- nodes should learn and keep node/service information from the fabric, not by
  repeatedly consulting a center.

Required work:

1. Preserve full scoped node directory for the current fleet.
2. Carry signed node/service records through peer exchange.
3. Keep endpoint candidates and runtime observations in local peer cache.
4. Spread updates to node/service reachability like a bounded wave, not as
   independent central fetches by every node.

Rules:

- nodes may distribute signed directory/service data;
- nodes must not self-author authoritative control-plane state;
- the runtime may consume replicated signed copies of truth;
- PostgreSQL remains durable origin of truth.

Done when:

- nodes can refresh peer/service discovery from peers plus signed records even
  if one control edge disappears.

### P3. Replace the naive “3 peers” rule with stability by area and ingress

Goal:

- measure and enforce resilience by failure-domain diversity, not only count.

Required metrics:

- `direct_ready_count`
- `relay_ready_count`
- `external_area_ready_count`
- `independent_ingress_ready_count`
- `recovery_path_count`

Required topology labels:

- `site_id` - physical or logical site
- `locality_group` - private/local reachability domain
- `nat_group` - shared public edge dependency

Required behaviors:

1. Prefer peers from different `area` values.
2. Prefer peers behind different public ingress / NAT dependencies.
3. Keep direct-ready and relay-ready separate.
4. Keep at least one recovery path outside the local area.
5. Treat a public endpoint behind the same NAT area as
   `external-network-required` unless cross-area observers have validated it.
6. Do not demote a public endpoint only because the same area cannot hairpin
   through its own public router address.
7. Prefer a scoped local/private QUIC endpoint over a public endpoint when the
   candidate is confirmed to be in the same local segment or NAT group.
8. Penalize or reject private/local-looking endpoints when they belong to a
   different segment/NAT scope than the local node, instead of probing them as
   if they were reachable.

Done when:

- critical nodes satisfy cross-area direct resilience targets, not merely raw
  peer-count targets.

### P4. Normalize edge roles and remove accidental TCP confusion

Goal:

- if TCP is present, it must be obviously classified and justified.

Allowed TCP roles:

- external service ingress;
- Control API ingress;
- artifact delivery edge;
- temporary compatibility recovery overlap.

Required work:

1. Keep explicit inventory of edge listeners.
2. Distinguish transport TCP from service-edge TCP in audits and UI.
3. Advance the fabric-only recovery gate only after:
   - compat control dependency is zero;
   - registry is active;
   - recovery path no longer depends on `19191`.

### P5. Build the update orchestrator and distributed update intent plane

Goal:

- nodes must not depend on one updater endpoint, one old updater process, or one
  central polling path;
- update rollout must be controlled so the whole farm cannot update at once;
- update intent must be distributable through management and neighboring nodes
  as signed metadata.

Required model:

1. The durable update object is a signed `update_intent`, not a hard-coded
   updater URL.
2. Nodes may receive update intent from:
   - Control API;
   - update-store / update-cache;
   - subscription hints over an outbound control channel;
   - signed peer gossip from neighboring nodes;
   - local cached last-known-good update state.
3. Nodes validate intent locally before execution.
4. Neighbor nodes may relay signed intent and artifacts, but cannot forge
   authority or expand scope.
5. Slow polling remains as the final safety net.
6. Subscription/hints are the fast path.
7. Gossip is the partition/recovery path.
8. Orchestrator-issued rollout leases are the concurrency guard.

Orchestrator requirements:

- canary, rolling, pinned, and forced-node strategies;
- max parallel globally;
- max parallel per area / site / NAT group;
- max unavailable nodes;
- pause/resume/abort;
- failure-rate stop;
- automatic stop on heartbeat loss or rollback;
- role-aware scheduling for control-api, update-store, update-cache, relay,
  ingress, and egress nodes;
- separate host-agent and node-agent phases;
- emergency recovery bridge for compat nodes that predate the orchestrator.

Node-side requirements:

- accept `check now` subscription signals;
- periodically poll as fallback;
- accept newer signed update intents from peers;
- keep a local update journal:
  - pending intent generation;
  - lease id;
  - last accepted plan;
  - staged artifact hash;
  - previous binary / image;
  - rollback state;
  - admission failure reason;
- reconcile stale updater runtime against current node/container/task state
  before fetching plans;
- report `blocked`, `leased`, `staged`, `applying`, `healthy`, `rolled_back`,
  and `aborted` states explicitly.

Done when:

- a node can learn a new update intent without directly reaching the original
  control edge;
- a stale updater command line can be repaired from local running runtime state;
- simultaneous farm-wide update start is impossible without explicit
  recovery-admin override;
- rollout can be paused and resumed without losing node intent state;
- at least one test proves a node behind NAT receives an update signal through
  a neighbor and still waits for an orchestrator lease before applying.

## 4. Immediate next implementation sequence

### Step A

Release and roll out the noop-rewrite restart fix so that updated runtimes do
not remain on stale control sessions after a config rewrite.

### Step B

Release and roll out the relay certificate intent fix so stale-relay
replacement and bootstrap relay paths do not probe a relay endpoint with a
certificate fingerprint copied from a different private direct candidate.

This is tracked by:

- `rap-node-agent 0.2.332-relaycertintentfix`

Done when:

- `peer certificate fingerprint mismatch` no longer appears on healthy
  relay/bootstrap paths between live areas;
- `ifcm` no longer loses ready peers because relay endpoint selection and peer
  certificate pinning disagree.

### Step B

Re-check live heartbeat and stale-risk:

- `compat_control_dependency_nodes`
- `registry_candidate_only_nodes`
- `updater_subscription_alert_nodes`
- `updater_wake_unsupported_nodes`
- `bridge_hold_required`
- current control URL in heartbeat

### Step C

Continue registry activation work until active records are used in practice.

### Step D

Continue peer diversity work using:

- `area`
- direct-ready area coverage
- independent ingress diversity

### Step E

Run another live audit and decide whether `19191/tcp` recovery overlap can be
removed.

## 5. Hard acceptance criteria

The fabric is considered converged only when all of the following are true:

1. Inter-node runtime transport is QUIC/UDP only.
2. No live node depends on the compat `19191` control contract.
3. Signed registry runtime is active.
4. Nodes carry and use distributed node/service knowledge through signed
   records and peer cache.
5. Cross-area direct resilience targets are satisfied for critical nodes.
6. Remaining TCP listeners are only service-edge roles, never hidden inter-node
   transport.

## 6. This plan starts now

The immediate active engineering task after writing this document is:

- complete the rollout of the runtime rewrite restart fix;
- remove the last live compat control dependency;
- then move directly into signed registry activation and cross-area peer
  resilience work.

Update 2026-05-19:

- `rap-node-agent 0.2.325-updatehintwake` adds a second-stage recovery path for
  heartbeat update hints: when a fresh hint generation arrives, the live
  node-agent persists `update-trigger.json` and wakes the local updater
  task/service.
- This is specifically meant to prevent the `ifcm-rufms-s-mo1cr` class of
  failure where heartbeat remains fresh but the updater subscription plane is
  dead.
- As of the current rollout, this release is already on `home-*`, `test-*`,
  and `usa-los-1`; `ifcm-rufms-s-mo1cr` remains the sole
  `updater_wake_unsupported` blocker.