387 lines
13 KiB
Markdown
387 lines
13 KiB
Markdown
# Fabric Execution Plan 2026-05-19
|
|
|
|
Status: active execution plan.
|
|
|
|
This document merges:
|
|
|
|
- the service-over-fabric model;
|
|
- the area and peer stability model;
|
|
- the live audit findings from 2026-05-18 through 2026-05-19;
|
|
- the node survival and recovery policy;
|
|
- the current rollout and runtime rewrite findings.
|
|
|
|
The goal is to move the live fabric from a partially migrated QUIC-first fleet
|
|
to a fully converged distributed runtime where:
|
|
|
|
1. inter-node transport is QUIC over UDP only;
|
|
2. services use fabric channels and do not implement their own transport;
|
|
3. nodes do not depend on one compat control/download edge;
|
|
4. node directory and service discovery are distributed through signed records,
|
|
peer cache, and live peer exchange;
|
|
5. the fleet remains recoverable after losing part of the fabric.
|
|
|
|
## 1. Current live state
|
|
|
|
### 1.1 What is already true
|
|
|
|
- Inter-node runtime transport is QUIC over UDP.
|
|
- All active nodes are converging on the latest control-endpoint rewrite line.
|
|
- `home-*`, `test-*`, and `usa-los-1` already run
|
|
`rap-node-agent 0.2.325-updatehintwake`.
|
|
- `ifcm-rufms-s-mo1cr` is no longer lost and still sends fresh heartbeat.
|
|
- Internal artifact plans now support mirror URLs instead of a single artifact
|
|
URL.
|
|
- The public `vpn.cin.su` ingress and the `19191` compatibility ingress on
|
|
`home-1` were repaired so downloads and control traffic can flow again.
|
|
|
|
### 1.2 What is still not finished
|
|
|
|
- `ifcm-rufms-s-mo1cr` still reports the old
|
|
`http://vpn.cin.su:19191/api/v1` control URL in live heartbeat.
|
|
- `ifcm-rufms-s-mo1cr` still runs `rap-node-agent 0.2.322-controlendpointsrewrite`
|
|
while the rest of the reachable fleet is already on
|
|
`0.2.325-updatehintwake`.
|
|
- The current blocker is now known precisely:
|
|
fresh heartbeat plus a dead updater subscription plane on a node-agent that
|
|
does not yet support local updater wake from heartbeat update hints.
|
|
- Signed registry runtime is still not fully `active` across the fleet.
|
|
- Cross-area direct peer diversity is still below the target for multiple
|
|
nodes.
|
|
- TCP is still visible in allowed edge roles:
|
|
- external ingress;
|
|
- Control API;
|
|
- release downloads;
|
|
- temporary compatibility recovery overlap.
|
|
|
|
## 2. Target system model
|
|
|
|
### 2.1 Transport
|
|
|
|
- Inter-node runtime transport: QUIC over UDP only.
|
|
- No TCP/WebSocket fallback as the normal fabric carrier.
|
|
|
|
### 2.2 Service layer
|
|
|
|
- Services consume a fabric channel contract.
|
|
- Services do not know internal path selection, relay choice, NAT traversal, or
|
|
route replacement details.
|
|
- External TCP/HTTP/HTTPS exists only at the ingress edge and is mapped into a
|
|
fabric channel.
|
|
|
|
### 2.3 Discovery and directory
|
|
|
|
- Nodes do not query PostgreSQL as part of ordinary transport/runtime flow.
|
|
- PostgreSQL remains durable source of truth for policy, rollout, release,
|
|
desired state, and audit.
|
|
- Runtime node discovery must use:
|
|
- signed registry records;
|
|
- peer cache;
|
|
- endpoint candidates;
|
|
- bounded live peer exchange.
|
|
|
|
### 2.4 Small fleet rule
|
|
|
|
For the current fleet size, every node should keep the full directory of all
|
|
known nodes in scoped local state, plus runtime observations and endpoint
|
|
candidate health.
|
|
|
|
## 3. Execution priorities
|
|
|
|
### P0. Finish runtime control-path convergence
|
|
|
|
Goal:
|
|
|
|
- remove the last live compat control dependency without manual host access.
|
|
- ensure a live node can wake its local updater plane when Control/API sends an
|
|
explicit update hint, even if the previous updater loop died.
|
|
|
|
Required work:
|
|
|
|
1. Release the noop runtime rewrite restart fix.
|
|
2. Roll it out to the fleet.
|
|
3. Verify that updated nodes restart into canonical control endpoints.
|
|
4. Add a local updater wake path driven by heartbeat update hints so
|
|
`update-trigger.json` is not the only signal.
|
|
5. Confirm that `compat_control_dependency_nodes` falls to zero.
|
|
6. Confirm that `updater_subscription_alert_nodes` falls to zero.
|
|
7. Confirm that `updater_wake_unsupported_nodes` falls to zero.
|
|
|
|
Done when:
|
|
|
|
- no live heartbeat reports `fabric_control_endpoint` on the `19191` compat contract.
|
|
- no live node shows `updater_subscription_gap` while heartbeat remains fresh.
|
|
- no live node remains on a pre-`0.2.325-updatehintwake` node-agent while
|
|
heartbeat is still fresh and update status is stale.
|
|
|
|
### P1. Finish distributed registry activation
|
|
|
|
Goal:
|
|
|
|
- nodes must resolve active service records without relying on one compat URL.
|
|
|
|
Required work:
|
|
|
|
1. Promote signed registry runtime from `candidate_only` / `missing` to
|
|
`active`.
|
|
2. Ensure nodes resolve at least:
|
|
- `control-api`
|
|
- `update-store`
|
|
- `update-cache`
|
|
3. Add live observability for:
|
|
- active records
|
|
- candidate records
|
|
- resolved core services
|
|
- last live probe
|
|
|
|
Done when:
|
|
|
|
- `fabric_registry_runtime_report.status = active` for the production fleet.
|
|
|
|
### P2. Turn node directory into a real distributed runtime input
|
|
|
|
Goal:
|
|
|
|
- nodes should learn and keep node/service information from the fabric, not by
|
|
repeatedly consulting a center.
|
|
|
|
Required work:
|
|
|
|
1. Preserve full scoped node directory for the current fleet.
|
|
2. Carry signed node/service records through peer exchange.
|
|
3. Keep endpoint candidates and runtime observations in local peer cache.
|
|
4. Spread updates to node/service reachability like a bounded wave, not as
|
|
independent central fetches by every node.
|
|
|
|
Rules:
|
|
|
|
- nodes may distribute signed directory/service data;
|
|
- nodes must not self-author authoritative control-plane state;
|
|
- the runtime may consume replicated signed copies of truth;
|
|
- PostgreSQL remains durable origin of truth.
|
|
|
|
Done when:
|
|
|
|
- nodes can refresh peer/service discovery from peers plus signed records even
|
|
if one control edge disappears.
|
|
|
|
### P3. Replace the naive “3 peers” rule with stability by area and ingress
|
|
|
|
Goal:
|
|
|
|
- measure and enforce resilience by failure-domain diversity, not only count.
|
|
|
|
Required metrics:
|
|
|
|
- `direct_ready_count`
|
|
- `relay_ready_count`
|
|
- `external_area_ready_count`
|
|
- `independent_ingress_ready_count`
|
|
- `recovery_path_count`
|
|
|
|
Required topology labels:
|
|
|
|
- `site_id` - physical or logical site
|
|
- `locality_group` - private/local reachability domain
|
|
- `nat_group` - shared public edge dependency
|
|
|
|
Required behaviors:
|
|
|
|
1. Prefer peers from different `area` values.
|
|
2. Prefer peers behind different public ingress / NAT dependencies.
|
|
3. Keep direct-ready and relay-ready separate.
|
|
4. Keep at least one recovery path outside the local area.
|
|
5. Treat a public endpoint behind the same NAT area as
|
|
`external-network-required` unless cross-area observers have validated it.
|
|
6. Do not demote a public endpoint only because the same area cannot hairpin
|
|
through its own public router address.
|
|
7. Prefer a scoped local/private QUIC endpoint over a public endpoint when the
|
|
candidate is confirmed to be in the same local segment or NAT group.
|
|
8. Penalize or reject private/local-looking endpoints when they belong to a
|
|
different segment/NAT scope than the local node, instead of probing them as
|
|
if they were reachable.
|
|
|
|
Done when:
|
|
|
|
- critical nodes satisfy cross-area direct resilience targets, not merely raw
|
|
peer-count targets.
|
|
|
|
### P4. Normalize edge roles and remove accidental TCP confusion
|
|
|
|
Goal:
|
|
|
|
- if TCP is present, it must be obviously classified and justified.
|
|
|
|
Allowed TCP roles:
|
|
|
|
- external service ingress;
|
|
- Control API ingress;
|
|
- artifact delivery edge;
|
|
- temporary compatibility recovery overlap.
|
|
|
|
Required work:
|
|
|
|
1. Keep explicit inventory of edge listeners.
|
|
2. Distinguish transport TCP from service-edge TCP in audits and UI.
|
|
3. Advance the fabric-only recovery gate only after:
|
|
- compat control dependency is zero;
|
|
- registry is active;
|
|
- recovery path no longer depends on `19191`.
|
|
|
|
### P5. Build the update orchestrator and distributed update intent plane
|
|
|
|
Goal:
|
|
|
|
- nodes must not depend on one updater endpoint, one old updater process, or one
|
|
central polling path;
|
|
- update rollout must be controlled so the whole farm cannot update at once;
|
|
- update intent must be distributable through management and neighboring nodes
|
|
as signed metadata.
|
|
|
|
Required model:
|
|
|
|
1. The durable update object is a signed `update_intent`, not a hard-coded
|
|
updater URL.
|
|
2. Nodes may receive update intent from:
|
|
- Control API;
|
|
- update-store / update-cache;
|
|
- subscription hints over an outbound control channel;
|
|
- signed peer gossip from neighboring nodes;
|
|
- local cached last-known-good update state.
|
|
3. Nodes validate intent locally before execution.
|
|
4. Neighbor nodes may relay signed intent and artifacts, but cannot forge
|
|
authority or expand scope.
|
|
5. Slow polling remains as the final safety net.
|
|
6. Subscription/hints are the fast path.
|
|
7. Gossip is the partition/recovery path.
|
|
8. Orchestrator-issued rollout leases are the concurrency guard.
|
|
|
|
Orchestrator requirements:
|
|
|
|
- canary, rolling, pinned, and forced-node strategies;
|
|
- max parallel globally;
|
|
- max parallel per area / site / NAT group;
|
|
- max unavailable nodes;
|
|
- pause/resume/abort;
|
|
- failure-rate stop;
|
|
- automatic stop on heartbeat loss or rollback;
|
|
- role-aware scheduling for control-api, update-store, update-cache, relay,
|
|
ingress, and egress nodes;
|
|
- separate host-agent and node-agent phases;
|
|
- emergency recovery bridge for compat nodes that predate the orchestrator.
|
|
|
|
Node-side requirements:
|
|
|
|
- accept `check now` subscription signals;
|
|
- periodically poll as fallback;
|
|
- accept newer signed update intents from peers;
|
|
- keep a local update journal:
|
|
- pending intent generation;
|
|
- lease id;
|
|
- last accepted plan;
|
|
- staged artifact hash;
|
|
- previous binary / image;
|
|
- rollback state;
|
|
- admission failure reason;
|
|
- reconcile stale updater runtime against current node/container/task state
|
|
before fetching plans;
|
|
- report `blocked`, `leased`, `staged`, `applying`, `healthy`, `rolled_back`,
|
|
and `aborted` states explicitly.
|
|
|
|
Done when:
|
|
|
|
- a node can learn a new update intent without directly reaching the original
|
|
control edge;
|
|
- a stale updater command line can be repaired from local running runtime state;
|
|
- simultaneous farm-wide update start is impossible without explicit
|
|
recovery-admin override;
|
|
- rollout can be paused and resumed without losing node intent state;
|
|
- at least one test proves a node behind NAT receives an update signal through
|
|
a neighbor and still waits for an orchestrator lease before applying.
|
|
|
|
## 4. Immediate next implementation sequence
|
|
|
|
### Step A
|
|
|
|
Release and roll out the noop-rewrite restart fix so that updated runtimes do
|
|
not remain on stale control sessions after a config rewrite.
|
|
|
|
### Step B
|
|
|
|
Release and roll out the relay certificate intent fix so stale-relay
|
|
replacement and bootstrap relay paths do not probe a relay endpoint with a
|
|
certificate fingerprint copied from a different private direct candidate.
|
|
|
|
This is tracked by:
|
|
|
|
- `rap-node-agent 0.2.332-relaycertintentfix`
|
|
|
|
Done when:
|
|
|
|
- `peer certificate fingerprint mismatch` no longer appears on healthy
|
|
relay/bootstrap paths between live areas;
|
|
- `ifcm` no longer loses ready peers because relay endpoint selection and peer
|
|
certificate pinning disagree.
|
|
|
|
### Step B
|
|
|
|
Re-check live heartbeat and stale-risk:
|
|
|
|
- `compat_control_dependency_nodes`
|
|
- `registry_candidate_only_nodes`
|
|
- `updater_subscription_alert_nodes`
|
|
- `updater_wake_unsupported_nodes`
|
|
- `bridge_hold_required`
|
|
- current control URL in heartbeat
|
|
|
|
### Step C
|
|
|
|
Continue registry activation work until active records are used in practice.
|
|
|
|
### Step D
|
|
|
|
Continue peer diversity work using:
|
|
|
|
- `area`
|
|
- direct-ready area coverage
|
|
- independent ingress diversity
|
|
|
|
### Step E
|
|
|
|
Run another live audit and decide whether `19191/tcp` recovery overlap can be
|
|
removed.
|
|
|
|
## 5. Hard acceptance criteria
|
|
|
|
The fabric is considered converged only when all of the following are true:
|
|
|
|
1. Inter-node runtime transport is QUIC/UDP only.
|
|
2. No live node depends on the compat `19191` control contract.
|
|
3. Signed registry runtime is active.
|
|
4. Nodes carry and use distributed node/service knowledge through signed
|
|
records and peer cache.
|
|
5. Cross-area direct resilience targets are satisfied for critical nodes.
|
|
6. Remaining TCP listeners are only service-edge roles, never hidden inter-node
|
|
transport.
|
|
|
|
## 6. This plan starts now
|
|
|
|
The immediate active engineering task after writing this document is:
|
|
|
|
- complete the rollout of the runtime rewrite restart fix;
|
|
- remove the last live compat control dependency;
|
|
- then move directly into signed registry activation and cross-area peer
|
|
resilience work.
|
|
|
|
Update 2026-05-19:
|
|
|
|
- `rap-node-agent 0.2.325-updatehintwake` adds a second-stage recovery path for
|
|
heartbeat update hints: when a fresh hint generation arrives, the live
|
|
node-agent persists `update-trigger.json` and wakes the local updater
|
|
task/service.
|
|
- This is specifically meant to prevent the `ifcm-rufms-s-mo1cr` class of
|
|
failure where heartbeat remains fresh but the updater subscription plane is
|
|
dead.
|
|
- As of the current rollout, this release is already on `home-*`, `test-*`,
|
|
and `usa-los-1`; `ifcm-rufms-s-mo1cr` remains the sole
|
|
`updater_wake_unsupported` blocker.
|