рабочий вариант, но скороть 10 МБит
This commit is contained in:
@@ -0,0 +1,386 @@
|
||||
# Fabric Execution Plan 2026-05-19
|
||||
|
||||
Status: active execution plan.
|
||||
|
||||
This document merges:
|
||||
|
||||
- the service-over-fabric model;
|
||||
- the area and peer stability model;
|
||||
- the live audit findings from 2026-05-18 through 2026-05-19;
|
||||
- the node survival and recovery policy;
|
||||
- the current rollout and runtime rewrite findings.
|
||||
|
||||
The goal is to move the live fabric from a partially migrated QUIC-first fleet
|
||||
to a fully converged distributed runtime where:
|
||||
|
||||
1. inter-node transport is QUIC over UDP only;
|
||||
2. services use fabric channels and do not implement their own transport;
|
||||
3. nodes do not depend on one compat control/download edge;
|
||||
4. node directory and service discovery are distributed through signed records,
|
||||
peer cache, and live peer exchange;
|
||||
5. the fleet remains recoverable after losing part of the fabric.
|
||||
|
||||
## 1. Current live state
|
||||
|
||||
### 1.1 What is already true
|
||||
|
||||
- Inter-node runtime transport is QUIC over UDP.
|
||||
- All active nodes are converging on the latest control-endpoint rewrite line.
|
||||
- `home-*`, `test-*`, and `usa-los-1` already run
|
||||
`rap-node-agent 0.2.325-updatehintwake`.
|
||||
- `ifcm-rufms-s-mo1cr` is no longer lost and still sends fresh heartbeat.
|
||||
- Internal artifact plans now support mirror URLs instead of a single artifact
|
||||
URL.
|
||||
- The public `vpn.cin.su` ingress and the `19191` compatibility ingress on
|
||||
`home-1` were repaired so downloads and control traffic can flow again.
|
||||
|
||||
### 1.2 What is still not finished
|
||||
|
||||
- `ifcm-rufms-s-mo1cr` still reports the old
|
||||
`http://vpn.cin.su:19191/api/v1` control URL in live heartbeat.
|
||||
- `ifcm-rufms-s-mo1cr` still runs `rap-node-agent 0.2.322-controlendpointsrewrite`
|
||||
while the rest of the reachable fleet is already on
|
||||
`0.2.325-updatehintwake`.
|
||||
- The current blocker is now known precisely:
|
||||
fresh heartbeat plus a dead updater subscription plane on a node-agent that
|
||||
does not yet support local updater wake from heartbeat update hints.
|
||||
- Signed registry runtime is still not fully `active` across the fleet.
|
||||
- Cross-area direct peer diversity is still below the target for multiple
|
||||
nodes.
|
||||
- TCP is still visible in allowed edge roles:
|
||||
- external ingress;
|
||||
- Control API;
|
||||
- release downloads;
|
||||
- temporary compatibility recovery overlap.
|
||||
|
||||
## 2. Target system model
|
||||
|
||||
### 2.1 Transport
|
||||
|
||||
- Inter-node runtime transport: QUIC over UDP only.
|
||||
- No TCP/WebSocket fallback as the normal fabric carrier.
|
||||
|
||||
### 2.2 Service layer
|
||||
|
||||
- Services consume a fabric channel contract.
|
||||
- Services do not know internal path selection, relay choice, NAT traversal, or
|
||||
route replacement details.
|
||||
- External TCP/HTTP/HTTPS exists only at the ingress edge and is mapped into a
|
||||
fabric channel.
|
||||
|
||||
### 2.3 Discovery and directory
|
||||
|
||||
- Nodes do not query PostgreSQL as part of ordinary transport/runtime flow.
|
||||
- PostgreSQL remains durable source of truth for policy, rollout, release,
|
||||
desired state, and audit.
|
||||
- Runtime node discovery must use:
|
||||
- signed registry records;
|
||||
- peer cache;
|
||||
- endpoint candidates;
|
||||
- bounded live peer exchange.
|
||||
|
||||
### 2.4 Small fleet rule
|
||||
|
||||
For the current fleet size, every node should keep the full directory of all
|
||||
known nodes in scoped local state, plus runtime observations and endpoint
|
||||
candidate health.
|
||||
|
||||
## 3. Execution priorities
|
||||
|
||||
### P0. Finish runtime control-path convergence
|
||||
|
||||
Goal:
|
||||
|
||||
- remove the last live compat control dependency without manual host access.
|
||||
- ensure a live node can wake its local updater plane when Control/API sends an
|
||||
explicit update hint, even if the previous updater loop died.
|
||||
|
||||
Required work:
|
||||
|
||||
1. Release the noop runtime rewrite restart fix.
|
||||
2. Roll it out to the fleet.
|
||||
3. Verify that updated nodes restart into canonical control endpoints.
|
||||
4. Add a local updater wake path driven by heartbeat update hints so
|
||||
`update-trigger.json` is not the only signal.
|
||||
5. Confirm that `compat_control_dependency_nodes` falls to zero.
|
||||
6. Confirm that `updater_subscription_alert_nodes` falls to zero.
|
||||
7. Confirm that `updater_wake_unsupported_nodes` falls to zero.
|
||||
|
||||
Done when:
|
||||
|
||||
- no live heartbeat reports `fabric_control_endpoint` on the `19191` compat contract.
|
||||
- no live node shows `updater_subscription_gap` while heartbeat remains fresh.
|
||||
- no live node remains on a pre-`0.2.325-updatehintwake` node-agent while
|
||||
heartbeat is still fresh and update status is stale.
|
||||
|
||||
### P1. Finish distributed registry activation
|
||||
|
||||
Goal:
|
||||
|
||||
- nodes must resolve active service records without relying on one compat URL.
|
||||
|
||||
Required work:
|
||||
|
||||
1. Promote signed registry runtime from `candidate_only` / `missing` to
|
||||
`active`.
|
||||
2. Ensure nodes resolve at least:
|
||||
- `control-api`
|
||||
- `update-store`
|
||||
- `update-cache`
|
||||
3. Add live observability for:
|
||||
- active records
|
||||
- candidate records
|
||||
- resolved core services
|
||||
- last live probe
|
||||
|
||||
Done when:
|
||||
|
||||
- `fabric_registry_runtime_report.status = active` for the production fleet.
|
||||
|
||||
### P2. Turn node directory into a real distributed runtime input
|
||||
|
||||
Goal:
|
||||
|
||||
- nodes should learn and keep node/service information from the fabric, not by
|
||||
repeatedly consulting a center.
|
||||
|
||||
Required work:
|
||||
|
||||
1. Preserve full scoped node directory for the current fleet.
|
||||
2. Carry signed node/service records through peer exchange.
|
||||
3. Keep endpoint candidates and runtime observations in local peer cache.
|
||||
4. Spread updates to node/service reachability like a bounded wave, not as
|
||||
independent central fetches by every node.
|
||||
|
||||
Rules:
|
||||
|
||||
- nodes may distribute signed directory/service data;
|
||||
- nodes must not self-author authoritative control-plane state;
|
||||
- the runtime may consume replicated signed copies of truth;
|
||||
- PostgreSQL remains durable origin of truth.
|
||||
|
||||
Done when:
|
||||
|
||||
- nodes can refresh peer/service discovery from peers plus signed records even
|
||||
if one control edge disappears.
|
||||
|
||||
### P3. Replace the naive “3 peers” rule with stability by area and ingress
|
||||
|
||||
Goal:
|
||||
|
||||
- measure and enforce resilience by failure-domain diversity, not only count.
|
||||
|
||||
Required metrics:
|
||||
|
||||
- `direct_ready_count`
|
||||
- `relay_ready_count`
|
||||
- `external_area_ready_count`
|
||||
- `independent_ingress_ready_count`
|
||||
- `recovery_path_count`
|
||||
|
||||
Required topology labels:
|
||||
|
||||
- `site_id` - physical or logical site
|
||||
- `locality_group` - private/local reachability domain
|
||||
- `nat_group` - shared public edge dependency
|
||||
|
||||
Required behaviors:
|
||||
|
||||
1. Prefer peers from different `area` values.
|
||||
2. Prefer peers behind different public ingress / NAT dependencies.
|
||||
3. Keep direct-ready and relay-ready separate.
|
||||
4. Keep at least one recovery path outside the local area.
|
||||
5. Treat a public endpoint behind the same NAT area as
|
||||
`external-network-required` unless cross-area observers have validated it.
|
||||
6. Do not demote a public endpoint only because the same area cannot hairpin
|
||||
through its own public router address.
|
||||
7. Prefer a scoped local/private QUIC endpoint over a public endpoint when the
|
||||
candidate is confirmed to be in the same local segment or NAT group.
|
||||
8. Penalize or reject private/local-looking endpoints when they belong to a
|
||||
different segment/NAT scope than the local node, instead of probing them as
|
||||
if they were reachable.
|
||||
|
||||
Done when:
|
||||
|
||||
- critical nodes satisfy cross-area direct resilience targets, not merely raw
|
||||
peer-count targets.
|
||||
|
||||
### P4. Normalize edge roles and remove accidental TCP confusion
|
||||
|
||||
Goal:
|
||||
|
||||
- if TCP is present, it must be obviously classified and justified.
|
||||
|
||||
Allowed TCP roles:
|
||||
|
||||
- external service ingress;
|
||||
- Control API ingress;
|
||||
- artifact delivery edge;
|
||||
- temporary compatibility recovery overlap.
|
||||
|
||||
Required work:
|
||||
|
||||
1. Keep explicit inventory of edge listeners.
|
||||
2. Distinguish transport TCP from service-edge TCP in audits and UI.
|
||||
3. Advance the fabric-only recovery gate only after:
|
||||
- compat control dependency is zero;
|
||||
- registry is active;
|
||||
- recovery path no longer depends on `19191`.
|
||||
|
||||
### P5. Build the update orchestrator and distributed update intent plane
|
||||
|
||||
Goal:
|
||||
|
||||
- nodes must not depend on one updater endpoint, one old updater process, or one
|
||||
central polling path;
|
||||
- update rollout must be controlled so the whole farm cannot update at once;
|
||||
- update intent must be distributable through management and neighboring nodes
|
||||
as signed metadata.
|
||||
|
||||
Required model:
|
||||
|
||||
1. The durable update object is a signed `update_intent`, not a hard-coded
|
||||
updater URL.
|
||||
2. Nodes may receive update intent from:
|
||||
- Control API;
|
||||
- update-store / update-cache;
|
||||
- subscription hints over an outbound control channel;
|
||||
- signed peer gossip from neighboring nodes;
|
||||
- local cached last-known-good update state.
|
||||
3. Nodes validate intent locally before execution.
|
||||
4. Neighbor nodes may relay signed intent and artifacts, but cannot forge
|
||||
authority or expand scope.
|
||||
5. Slow polling remains as the final safety net.
|
||||
6. Subscription/hints are the fast path.
|
||||
7. Gossip is the partition/recovery path.
|
||||
8. Orchestrator-issued rollout leases are the concurrency guard.
|
||||
|
||||
Orchestrator requirements:
|
||||
|
||||
- canary, rolling, pinned, and forced-node strategies;
|
||||
- max parallel globally;
|
||||
- max parallel per area / site / NAT group;
|
||||
- max unavailable nodes;
|
||||
- pause/resume/abort;
|
||||
- failure-rate stop;
|
||||
- automatic stop on heartbeat loss or rollback;
|
||||
- role-aware scheduling for control-api, update-store, update-cache, relay,
|
||||
ingress, and egress nodes;
|
||||
- separate host-agent and node-agent phases;
|
||||
- emergency recovery bridge for compat nodes that predate the orchestrator.
|
||||
|
||||
Node-side requirements:
|
||||
|
||||
- accept `check now` subscription signals;
|
||||
- periodically poll as fallback;
|
||||
- accept newer signed update intents from peers;
|
||||
- keep a local update journal:
|
||||
- pending intent generation;
|
||||
- lease id;
|
||||
- last accepted plan;
|
||||
- staged artifact hash;
|
||||
- previous binary / image;
|
||||
- rollback state;
|
||||
- admission failure reason;
|
||||
- reconcile stale updater runtime against current node/container/task state
|
||||
before fetching plans;
|
||||
- report `blocked`, `leased`, `staged`, `applying`, `healthy`, `rolled_back`,
|
||||
and `aborted` states explicitly.
|
||||
|
||||
Done when:
|
||||
|
||||
- a node can learn a new update intent without directly reaching the original
|
||||
control edge;
|
||||
- a stale updater command line can be repaired from local running runtime state;
|
||||
- simultaneous farm-wide update start is impossible without explicit
|
||||
recovery-admin override;
|
||||
- rollout can be paused and resumed without losing node intent state;
|
||||
- at least one test proves a node behind NAT receives an update signal through
|
||||
a neighbor and still waits for an orchestrator lease before applying.
|
||||
|
||||
## 4. Immediate next implementation sequence
|
||||
|
||||
### Step A
|
||||
|
||||
Release and roll out the noop-rewrite restart fix so that updated runtimes do
|
||||
not remain on stale control sessions after a config rewrite.
|
||||
|
||||
### Step B
|
||||
|
||||
Release and roll out the relay certificate intent fix so stale-relay
|
||||
replacement and bootstrap relay paths do not probe a relay endpoint with a
|
||||
certificate fingerprint copied from a different private direct candidate.
|
||||
|
||||
This is tracked by:
|
||||
|
||||
- `rap-node-agent 0.2.332-relaycertintentfix`
|
||||
|
||||
Done when:
|
||||
|
||||
- `peer certificate fingerprint mismatch` no longer appears on healthy
|
||||
relay/bootstrap paths between live areas;
|
||||
- `ifcm` no longer loses ready peers because relay endpoint selection and peer
|
||||
certificate pinning disagree.
|
||||
|
||||
### Step B
|
||||
|
||||
Re-check live heartbeat and stale-risk:
|
||||
|
||||
- `compat_control_dependency_nodes`
|
||||
- `registry_candidate_only_nodes`
|
||||
- `updater_subscription_alert_nodes`
|
||||
- `updater_wake_unsupported_nodes`
|
||||
- `bridge_hold_required`
|
||||
- current control URL in heartbeat
|
||||
|
||||
### Step C
|
||||
|
||||
Continue registry activation work until active records are used in practice.
|
||||
|
||||
### Step D
|
||||
|
||||
Continue peer diversity work using:
|
||||
|
||||
- `area`
|
||||
- direct-ready area coverage
|
||||
- independent ingress diversity
|
||||
|
||||
### Step E
|
||||
|
||||
Run another live audit and decide whether `19191/tcp` recovery overlap can be
|
||||
removed.
|
||||
|
||||
## 5. Hard acceptance criteria
|
||||
|
||||
The fabric is considered converged only when all of the following are true:
|
||||
|
||||
1. Inter-node runtime transport is QUIC/UDP only.
|
||||
2. No live node depends on the compat `19191` control contract.
|
||||
3. Signed registry runtime is active.
|
||||
4. Nodes carry and use distributed node/service knowledge through signed
|
||||
records and peer cache.
|
||||
5. Cross-area direct resilience targets are satisfied for critical nodes.
|
||||
6. Remaining TCP listeners are only service-edge roles, never hidden inter-node
|
||||
transport.
|
||||
|
||||
## 6. This plan starts now
|
||||
|
||||
The immediate active engineering task after writing this document is:
|
||||
|
||||
- complete the rollout of the runtime rewrite restart fix;
|
||||
- remove the last live compat control dependency;
|
||||
- then move directly into signed registry activation and cross-area peer
|
||||
resilience work.
|
||||
|
||||
Update 2026-05-19:
|
||||
|
||||
- `rap-node-agent 0.2.325-updatehintwake` adds a second-stage recovery path for
|
||||
heartbeat update hints: when a fresh hint generation arrives, the live
|
||||
node-agent persists `update-trigger.json` and wakes the local updater
|
||||
task/service.
|
||||
- This is specifically meant to prevent the `ifcm-rufms-s-mo1cr` class of
|
||||
failure where heartbeat remains fresh but the updater subscription plane is
|
||||
dead.
|
||||
- As of the current rollout, this release is already on `home-*`, `test-*`,
|
||||
and `usa-los-1`; `ifcm-rufms-s-mo1cr` remains the sole
|
||||
`updater_wake_unsupported` blocker.
|
||||
Reference in New Issue
Block a user