рабочий вариант, но скороть 10 МБит

2026-05-22 21:46:49 +03:00
parent 469fa0e860
commit 20d361a886
280 changed files with 954890 additions and 18524 deletions
@@ -0,0 +1,386 @@
+# Fabric Execution Plan 2026-05-19
+
+Status: active execution plan.
+
+This document merges:
+
+- the service-over-fabric model;
+- the area and peer stability model;
+- the live audit findings from 2026-05-18 through 2026-05-19;
+- the node survival and recovery policy;
+- the current rollout and runtime rewrite findings.
+
+The goal is to move the live fabric from a partially migrated QUIC-first fleet
+to a fully converged distributed runtime where:
+
+1. inter-node transport is QUIC over UDP only;
+2. services use fabric channels and do not implement their own transport;
+3. nodes do not depend on one compat control/download edge;
+4. node directory and service discovery are distributed through signed records,
+   peer cache, and live peer exchange;
+5. the fleet remains recoverable after losing part of the fabric.
+
+## 1. Current live state
+
+### 1.1 What is already true
+
+- Inter-node runtime transport is QUIC over UDP.
+- All active nodes are converging on the latest control-endpoint rewrite line.
+- `home-*`, `test-*`, and `usa-los-1` already run
+  `rap-node-agent 0.2.325-updatehintwake`.
+- `ifcm-rufms-s-mo1cr` is no longer lost and still sends fresh heartbeat.
+- Internal artifact plans now support mirror URLs instead of a single artifact
+  URL.
+- The public `vpn.cin.su` ingress and the `19191` compatibility ingress on
+  `home-1` were repaired so downloads and control traffic can flow again.
+
+### 1.2 What is still not finished
+
+- `ifcm-rufms-s-mo1cr` still reports the old
+  `http://vpn.cin.su:19191/api/v1` control URL in live heartbeat.
+- `ifcm-rufms-s-mo1cr` still runs `rap-node-agent 0.2.322-controlendpointsrewrite`
+  while the rest of the reachable fleet is already on
+  `0.2.325-updatehintwake`.
+- The current blocker is now known precisely:
+  fresh heartbeat plus a dead updater subscription plane on a node-agent that
+  does not yet support local updater wake from heartbeat update hints.
+- Signed registry runtime is still not fully `active` across the fleet.
+- Cross-area direct peer diversity is still below the target for multiple
+  nodes.
+- TCP is still visible in allowed edge roles:
+  - external ingress;
+  - Control API;
+  - release downloads;
+  - temporary compatibility recovery overlap.
+
+## 2. Target system model
+
+### 2.1 Transport
+
+- Inter-node runtime transport: QUIC over UDP only.
+- No TCP/WebSocket fallback as the normal fabric carrier.
+
+### 2.2 Service layer
+
+- Services consume a fabric channel contract.
+- Services do not know internal path selection, relay choice, NAT traversal, or
+  route replacement details.
+- External TCP/HTTP/HTTPS exists only at the ingress edge and is mapped into a
+  fabric channel.
+
+### 2.3 Discovery and directory
+
+- Nodes do not query PostgreSQL as part of ordinary transport/runtime flow.
+- PostgreSQL remains durable source of truth for policy, rollout, release,
+  desired state, and audit.
+- Runtime node discovery must use:
+  - signed registry records;
+  - peer cache;
+  - endpoint candidates;
+  - bounded live peer exchange.
+
+### 2.4 Small fleet rule
+
+For the current fleet size, every node should keep the full directory of all
+known nodes in scoped local state, plus runtime observations and endpoint
+candidate health.
+
+## 3. Execution priorities
+
+### P0. Finish runtime control-path convergence
+
+Goal:
+
+- remove the last live compat control dependency without manual host access.
+- ensure a live node can wake its local updater plane when Control/API sends an
+  explicit update hint, even if the previous updater loop died.
+
+Required work:
+
+1. Release the noop runtime rewrite restart fix.
+2. Roll it out to the fleet.
+3. Verify that updated nodes restart into canonical control endpoints.
+4. Add a local updater wake path driven by heartbeat update hints so
+   `update-trigger.json` is not the only signal.
+5. Confirm that `compat_control_dependency_nodes` falls to zero.
+6. Confirm that `updater_subscription_alert_nodes` falls to zero.
+7. Confirm that `updater_wake_unsupported_nodes` falls to zero.
+
+Done when:
+
+- no live heartbeat reports `fabric_control_endpoint` on the `19191` compat contract.
+- no live node shows `updater_subscription_gap` while heartbeat remains fresh.
+- no live node remains on a pre-`0.2.325-updatehintwake` node-agent while
+  heartbeat is still fresh and update status is stale.
+
+### P1. Finish distributed registry activation
+
+Goal:
+
+- nodes must resolve active service records without relying on one compat URL.
+
+Required work:
+
+1. Promote signed registry runtime from `candidate_only` / `missing` to
+   `active`.
+2. Ensure nodes resolve at least:
+   - `control-api`
+   - `update-store`
+   - `update-cache`
+3. Add live observability for:
+   - active records
+   - candidate records
+   - resolved core services
+   - last live probe
+
+Done when:
+
+- `fabric_registry_runtime_report.status = active` for the production fleet.
+
+### P2. Turn node directory into a real distributed runtime input
+
+Goal:
+
+- nodes should learn and keep node/service information from the fabric, not by
+  repeatedly consulting a center.
+
+Required work:
+
+1. Preserve full scoped node directory for the current fleet.
+2. Carry signed node/service records through peer exchange.
+3. Keep endpoint candidates and runtime observations in local peer cache.
+4. Spread updates to node/service reachability like a bounded wave, not as
+   independent central fetches by every node.
+
+Rules:
+
+- nodes may distribute signed directory/service data;
+- nodes must not self-author authoritative control-plane state;
+- the runtime may consume replicated signed copies of truth;
+- PostgreSQL remains durable origin of truth.
+
+Done when:
+
+- nodes can refresh peer/service discovery from peers plus signed records even
+  if one control edge disappears.
+
+### P3. Replace the naive “3 peers” rule with stability by area and ingress
+
+Goal:
+
+- measure and enforce resilience by failure-domain diversity, not only count.
+
+Required metrics:
+
+- `direct_ready_count`
+- `relay_ready_count`
+- `external_area_ready_count`
+- `independent_ingress_ready_count`
+- `recovery_path_count`
+
+Required topology labels:
+
+- `site_id` - physical or logical site
+- `locality_group` - private/local reachability domain
+- `nat_group` - shared public edge dependency
+
+Required behaviors:
+
+1. Prefer peers from different `area` values.
+2. Prefer peers behind different public ingress / NAT dependencies.
+3. Keep direct-ready and relay-ready separate.
+4. Keep at least one recovery path outside the local area.
+5. Treat a public endpoint behind the same NAT area as
+   `external-network-required` unless cross-area observers have validated it.
+6. Do not demote a public endpoint only because the same area cannot hairpin
+   through its own public router address.
+7. Prefer a scoped local/private QUIC endpoint over a public endpoint when the
+   candidate is confirmed to be in the same local segment or NAT group.
+8. Penalize or reject private/local-looking endpoints when they belong to a
+   different segment/NAT scope than the local node, instead of probing them as
+   if they were reachable.
+
+Done when:
+
+- critical nodes satisfy cross-area direct resilience targets, not merely raw
+  peer-count targets.
+
+### P4. Normalize edge roles and remove accidental TCP confusion
+
+Goal:
+
+- if TCP is present, it must be obviously classified and justified.
+
+Allowed TCP roles:
+
+- external service ingress;
+- Control API ingress;
+- artifact delivery edge;
+- temporary compatibility recovery overlap.
+
+Required work:
+
+1. Keep explicit inventory of edge listeners.
+2. Distinguish transport TCP from service-edge TCP in audits and UI.
+3. Advance the fabric-only recovery gate only after:
+   - compat control dependency is zero;
+   - registry is active;
+   - recovery path no longer depends on `19191`.
+
+### P5. Build the update orchestrator and distributed update intent plane
+
+Goal:
+
+- nodes must not depend on one updater endpoint, one old updater process, or one
+  central polling path;
+- update rollout must be controlled so the whole farm cannot update at once;
+- update intent must be distributable through management and neighboring nodes
+  as signed metadata.
+
+Required model:
+
+1. The durable update object is a signed `update_intent`, not a hard-coded
+   updater URL.
+2. Nodes may receive update intent from:
+   - Control API;
+   - update-store / update-cache;
+   - subscription hints over an outbound control channel;
+   - signed peer gossip from neighboring nodes;
+   - local cached last-known-good update state.
+3. Nodes validate intent locally before execution.
+4. Neighbor nodes may relay signed intent and artifacts, but cannot forge
+   authority or expand scope.
+5. Slow polling remains as the final safety net.
+6. Subscription/hints are the fast path.
+7. Gossip is the partition/recovery path.
+8. Orchestrator-issued rollout leases are the concurrency guard.
+
+Orchestrator requirements:
+
+- canary, rolling, pinned, and forced-node strategies;
+- max parallel globally;
+- max parallel per area / site / NAT group;
+- max unavailable nodes;
+- pause/resume/abort;
+- failure-rate stop;
+- automatic stop on heartbeat loss or rollback;
+- role-aware scheduling for control-api, update-store, update-cache, relay,
+  ingress, and egress nodes;
+- separate host-agent and node-agent phases;
+- emergency recovery bridge for compat nodes that predate the orchestrator.
+
+Node-side requirements:
+
+- accept `check now` subscription signals;
+- periodically poll as fallback;
+- accept newer signed update intents from peers;
+- keep a local update journal:
+  - pending intent generation;
+  - lease id;
+  - last accepted plan;
+  - staged artifact hash;
+  - previous binary / image;
+  - rollback state;
+  - admission failure reason;
+- reconcile stale updater runtime against current node/container/task state
+  before fetching plans;
+- report `blocked`, `leased`, `staged`, `applying`, `healthy`, `rolled_back`,
+  and `aborted` states explicitly.
+
+Done when:
+
+- a node can learn a new update intent without directly reaching the original
+  control edge;
+- a stale updater command line can be repaired from local running runtime state;
+- simultaneous farm-wide update start is impossible without explicit
+  recovery-admin override;
+- rollout can be paused and resumed without losing node intent state;
+- at least one test proves a node behind NAT receives an update signal through
+  a neighbor and still waits for an orchestrator lease before applying.
+
+## 4. Immediate next implementation sequence
+
+### Step A
+
+Release and roll out the noop-rewrite restart fix so that updated runtimes do
+not remain on stale control sessions after a config rewrite.
+
+### Step B
+
+Release and roll out the relay certificate intent fix so stale-relay
+replacement and bootstrap relay paths do not probe a relay endpoint with a
+certificate fingerprint copied from a different private direct candidate.
+
+This is tracked by:
+
+- `rap-node-agent 0.2.332-relaycertintentfix`
+
+Done when:
+
+- `peer certificate fingerprint mismatch` no longer appears on healthy
+  relay/bootstrap paths between live areas;
+- `ifcm` no longer loses ready peers because relay endpoint selection and peer
+  certificate pinning disagree.
+
+### Step B
+
+Re-check live heartbeat and stale-risk:
+
+- `compat_control_dependency_nodes`
+- `registry_candidate_only_nodes`
+- `updater_subscription_alert_nodes`
+- `updater_wake_unsupported_nodes`
+- `bridge_hold_required`
+- current control URL in heartbeat
+
+### Step C
+
+Continue registry activation work until active records are used in practice.
+
+### Step D
+
+Continue peer diversity work using:
+
+- `area`
+- direct-ready area coverage
+- independent ingress diversity
+
+### Step E
+
+Run another live audit and decide whether `19191/tcp` recovery overlap can be
+removed.
+
+## 5. Hard acceptance criteria
+
+The fabric is considered converged only when all of the following are true:
+
+1. Inter-node runtime transport is QUIC/UDP only.
+2. No live node depends on the compat `19191` control contract.
+3. Signed registry runtime is active.
+4. Nodes carry and use distributed node/service knowledge through signed
+   records and peer cache.
+5. Cross-area direct resilience targets are satisfied for critical nodes.
+6. Remaining TCP listeners are only service-edge roles, never hidden inter-node
+   transport.
+
+## 6. This plan starts now
+
+The immediate active engineering task after writing this document is:
+
+- complete the rollout of the runtime rewrite restart fix;
+- remove the last live compat control dependency;
+- then move directly into signed registry activation and cross-area peer
+  resilience work.
+
+Update 2026-05-19:
+
+- `rap-node-agent 0.2.325-updatehintwake` adds a second-stage recovery path for
+  heartbeat update hints: when a fresh hint generation arrives, the live
+  node-agent persists `update-trigger.json` and wakes the local updater
+  task/service.
+- This is specifically meant to prevent the `ifcm-rufms-s-mo1cr` class of
+  failure where heartbeat remains fresh but the updater subscription plane is
+  dead.
+- As of the current rollout, this release is already on `home-*`, `test-*`,
+  and `usa-los-1`; `ifcm-rufms-s-mo1cr` remains the sole
+  `updater_wake_unsupported` blocker.