3

2026-05-18 21:33:39 +03:00
parent 5096155d83
commit 469fa0e860
94 changed files with 8761 additions and 8003 deletions
@@ -0,0 +1,427 @@
+# Fabric Node Survival And Recovery Policy
+
+Status: active architecture policy.
+
+This document defines the non-negotiable survival, compatibility, and recovery
+rules for Secure Access Fabric nodes. It exists because losing a node is not an
+acceptable operating model once the fabric grows beyond a small manually
+maintained fleet.
+
+Reference incident:
+
+- `ifcm-rufms-s-mo1cr` is the canonical recovery case.
+- The node is behind NAT.
+- There is no direct administrative access to the Windows host.
+- The node must remain recoverable through the fabric/update/recovery plane
+  without relying on manual host login.
+
+The latest live recovery evidence for this case is documented in
+[FABRIC_LIVE_AUDIT_2026-05-18.md](\\nas\\MST\\codex\\rdp-proxy\\docs\\architecture\\FABRIC_LIVE_AUDIT_2026-05-18.md).
+
+This policy applies to Linux, Windows, Android, containerized nodes, and future
+ node types.
+
+## 1. Core Decision
+
+The fabric must be able to lose:
+
+- old API endpoints;
+- old artifact URLs;
+- previous public IP addresses;
+- previous NAT mappings;
+- previous relay nodes;
+- previous route-authority replicas;
+- previous update-cache replicas;
+- old service locations;
+- operator access to the host OS;
+- the current physical location of a workload;
+- part of the cluster.
+
+And still keep the node recoverable.
+
+Manual repair is allowed as an emergency tool. It must not be the default
+survival strategy.
+
+## 2. Non-Negotiable Invariants
+
+### 2.1 Node Identity Must Survive
+
+A recoverable node must preserve:
+
+- `node_id`;
+- node keypair or key reference;
+- pinned cluster authority / quorum descriptor;
+- last accepted signed registry records;
+- last accepted bootstrap seed set;
+- last known good update policy;
+- last known good workload desired state;
+- rollback metadata;
+- recovery audit trail.
+
+Reinstall or repair must prefer preserving local state. Identity reset is a
+high-risk operator action, not the default repair path.
+
+### 2.2 Compatibility Must Stay Until Recovery Is Complete
+
+Any change to the fabric must keep older nodes recoverable until one of these
+is true:
+
+1. every node has confirmed the new contract; or
+2. the missing nodes were manually retired, revoked, or explicitly accepted as
+   lost.
+
+This applies to:
+
+- update plan formats;
+- signed registry schemas;
+- artifact install types;
+- authority signature envelopes;
+- bootstrap config formats;
+- recovery seed formats;
+- host-agent / updater runtime contracts;
+- control endpoints needed only for migration.
+
+The rule is strict: do not delete the old recovery format while nodes that may
+still need it remain unrecovered.
+
+### 2.3 QUIC-Only Transport Does Not Mean Single Bootstrap Location
+
+Node-to-node runtime transport remains QUIC over UDP only.
+
+That does not permit:
+
+- one bootstrap address;
+- one update mirror;
+- one registry carrier;
+- one ingress node;
+- one relay;
+- one control replica.
+
+QUIC is the transport. Survivability requires many signed ways to discover the
+current valid QUIC endpoints.
+
+### 2.4 No Single Service May Own Recovery
+
+Recovery must not depend on one:
+
+- backend URL;
+- DNS name;
+- HTTP ingress;
+- update repository host;
+- relay node;
+- cluster admin node.
+
+Any of those may disappear while the node is still healthy enough to recover.
+
+## 3. Required Recovery Layers
+
+### 3.1 Embedded Bootstrap Seed Set
+
+Each installable node package must contain a bounded bootstrap seed set:
+
+- multiple seed nodes;
+- public and private candidates where appropriate;
+- QUIC endpoint candidates only;
+- signed bootstrap metadata;
+- expiry / epoch rules;
+- optional organization / cluster scope constraints.
+
+The bootstrap seed set is only the first door, not cluster truth.
+
+### 3.2 Signed Registry Gossip
+
+After bootstrap, a node must learn current service locations through signed
+fabric registry records that can be carried by any reachable peer.
+
+Required properties:
+
+- multiple records per service;
+- quorum or otherwise policy-approved signatures;
+- monotonic epoch/generation;
+- expiry and freshness checks;
+- live probe before promotion;
+- ability to accept newer records from a reachable neighbor even when old
+  origins are gone.
+
+### 3.3 Outbound-Only Recovery Attachment
+
+A node behind NAT or in passive mode must be recoverable through an outbound
+attachment.
+
+Required behaviors:
+
+- the node can maintain at least one long-lived outbound QUIC control channel;
+- that channel survives IP changes by reconnecting through any remaining seed or
+  signed registry endpoint;
+- the node may receive updated registry truth, update triggers, workload
+  changes, and recovery instructions over that channel;
+- the fabric must not require inbound TCP/UDP reachability to repair the node.
+
+### 3.4 Local Recovery Agent Boundary
+
+The node must have a minimal recovery-capable local agent boundary that is
+separate from ordinary service workloads.
+
+It must be able to:
+
+- validate signed update plans;
+- download artifacts from multiple mirrors;
+- stage replacement binaries;
+- restart node-agent or host-agent tasks;
+- rollback to previous binaries;
+- swap to new signed registry/bootstrap records;
+- emit recovery status when transport returns.
+
+If node workloads fail, this local recovery boundary must still exist.
+
+### 3.5 Multi-Source Artifact Delivery
+
+Artifacts must be retrievable from more than one source:
+
+- local cached file;
+- cluster update-cache;
+- organization-local cache if policy allows;
+- public or internet-reachable mirror;
+- neighbor-assisted relay transfer over the fabric.
+
+A node must not become unrecoverable because one artifact hostname or one
+download service disappeared.
+
+### 3.6 Trigger And Subscription Plane
+
+Polling alone is not enough for very large fleets.
+
+Required model:
+
+- nodes may still perform slow fallback polling;
+- primary update notification uses subscription/signal delivery;
+- update-cache or registry service can repeatedly signal pending updates until
+  acknowledged;
+- signals are idempotent;
+- signals do not require the old control endpoint to remain alive.
+
+## 4. Update Safety Rules
+
+### 4.1 Upgrade Contracts
+
+Every release that changes recovery-critical contracts must explicitly declare:
+
+- minimum supported old version;
+- maximum tolerated skew;
+- whether migration is rolling-safe;
+- whether the node must first update host-agent or node-agent;
+- rollback compatibility;
+- whether old bootstrap/registry envelopes remain accepted.
+
+### 4.2 Two-Key Rule For Breaking Changes
+
+Do not simultaneously break:
+
+- discovery of where to get the update; and
+- ability to understand the update once found.
+
+At least one of those must remain compatible until fleet convergence or
+explicit retirement.
+
+### 4.3 Old Artifact Retention
+
+Recovery-critical artifact versions must remain available until:
+
+- all nodes have moved past them; or
+- the remaining nodes are revoked/retired and recorded as intentionally lost.
+
+Do not garbage-collect the last working host-agent or node-agent build for an
+unrecovered population.
+
+### 4.4 Install Type Continuity
+
+If historical nodes request different install types for the same product
+(`windows_binary`, `windows_service`, `native`, `linux_binary`, etc.), recovery
+planning must keep compatibility aliases until the fleet converges.
+
+The fabric must not strand nodes on an install-type naming mismatch.
+
+### 4.5 Legacy Recovery Contract Drift Must Be Treated As A Blocking Risk
+
+A stale node may report:
+
+- a compatible recovery artifact exists under the current registry; but
+- the last local updater/host-agent status still says `no_matching_artifact` or
+  an equivalent legacy contract failure.
+
+This means the node is not only waiting for a heartbeat. It is running an older
+recovery planner contract and may still depend on:
+
+- historical install-type aliases;
+- older artifact matching semantics;
+- older update-plan interpretation rules;
+- overlap in signed registry / bootstrap envelopes.
+
+This condition must be classified as `legacy recovery contract drift` and must
+block compatibility removal the same way an artifact gap does.
+
+Operationally this also means:
+
+- the node requires a `recovery bridge`;
+- the cluster enters `bridge hold active` for compatibility-removal decisions;
+- `bridge hold` remains active until the node reports a recovery-compatible
+  status on the current contract or the operator explicitly retires the node;
+- when a compatible artifact and target mapping already exist, the node should
+  be classified as `bridge replay ready`, meaning the system can replay the
+  legacy-compatible update plan as soon as the node regains an outbound control
+  cycle;
+- operator tooling should expose a canonical `bridge replay plan` per node so
+  recovery replay uses the same signed update-plan logic as normal updates;
+- compatibility aliases / overlap must remain enabled for that node population;
+- dashboards and rollout guards must show this separately from ordinary
+  `waiting recovery heartbeat`.
+
+Canonical example:
+
+- `ifcm-rufms-s-mo1cr` is stale;
+- the current backend can match a Windows-compatible host-agent artifact;
+- the last host-agent report still says `no_matching_artifact`;
+- therefore the node must be treated as a legacy recovery-contract blocker, not
+  merely as a delayed heartbeat.
+
+## 5. Service And Location Mobility Rules
+
+Moving a service must not strand nodes that only know the old location.
+
+Required pattern:
+
+1. publish new signed registry records;
+2. keep old records valid during overlap;
+3. allow any reachable peer to relay the new records;
+4. live-probe and promote the new endpoints;
+5. only then retire the old location;
+6. keep enough overlap for slow or partitioned nodes to catch up.
+
+This applies to:
+
+- control-api replicas;
+- update-cache/update-store replicas;
+- web/admin ingress replicas;
+- relay/rendezvous nodes;
+- service-channel endpoints.
+
+## 6. Failure Classes The Fabric Must Tolerate
+
+The design must explicitly handle all of these:
+
+- node behind NAT with only outbound connectivity;
+- several nodes behind one NAT/local segment;
+- node changes public IP;
+- node changes private IP;
+- old DNS/URL becomes dead;
+- artifact mirror disappears;
+- control ingress disappears;
+- relay disappears;
+- update install fails halfway;
+- binary staged but restart fails;
+- old task/service name changes;
+- local disk is nearly full;
+- time skew causes signature freshness risk;
+- authority rotates;
+- route authority replica disappears;
+- state directory survives but binary is broken;
+- binary survives but state directory is partly stale;
+- node reboots during update;
+- only one peer still knows the new registry truth;
+- node is partitioned for a long time and rejoins later;
+- platform removes legacy support too early;
+- operator has no shell/RDP/WinRM/SSH access to the host.
+
+## 7. Required Local State And Journaling
+
+The node local state store must retain at least:
+
+- active and previous signed registry records;
+- active and previous bootstrap seeds;
+- last successful update plan per product;
+- last applied artifact hash/version;
+- last rollback candidate;
+- last successful service endpoints used for update/control;
+- pending trigger generation;
+- recovery attempts with timestamps and reasons;
+- last known good runtime command line / task/unit identity;
+- last known workload desired states.
+
+Writes must be atomic. A power loss must not leave the node with zero valid
+state.
+
+## 8. Observability And Fleet Safety Rules
+
+The control plane must make invisible-recovery risk explicit.
+
+It must surface:
+
+- nodes with stale heartbeat but recent updater activity;
+- nodes with no working compatible recovery artifact;
+- nodes whose pinned registry/bootstrap epoch is too old;
+- nodes whose only known artifact URL is dead;
+- nodes whose desired state requires a contract they cannot parse;
+- nodes whose local agent version is below the minimum recovery floor;
+- nodes whose last successful contact depended on a single service replica.
+
+Cluster-wide changes that would strand such nodes must be blocked or require an
+explicit recovery-admin override.
+
+## 9. Release And Migration Checklist
+
+Before deleting old code, old formats, or old endpoints, verify all of these:
+
+1. every active node has confirmed a compatible version; or the remaining nodes
+   are explicitly marked for manual retirement/recovery;
+2. host-agent and node-agent recovery paths both have matching artifacts;
+3. bootstrap/registry overlap exists for the migration window;
+4. at least two independent artifact sources remain reachable;
+5. signed registry gossip can carry the new locations without the old API
+   hostname;
+6. rollback artifacts are still available;
+7. install type aliases remain for historical agents where needed;
+8. NAT/passive/outbound-only nodes were explicitly tested;
+9. stale-node risk report is empty or consciously accepted by recovery-admin;
+10. removal of legacy support is documented with the exact cutoff conditions.
+
+## 10. `ifcm-rufms-s-mo1cr` Rule
+
+`ifcm-rufms-s-mo1cr` is the standing reference case for future work.
+
+For this node class, the platform must assume:
+
+- the host is behind NAT;
+- the node may only keep outbound channels;
+- no direct Windows administrative access exists;
+- old discovery endpoints may disappear;
+- only the fabric/update/recovery plane can save the node.
+
+Any future transport, update, authority, bootstrap, registry, or workload
+change must be reviewed against this question:
+
+> If `ifcm-rufms-s-mo1cr` is still on the older contract and we cannot log in to
+> the host, can the fabric still recover it?
+
+If the answer is no, the change is incomplete.
+
+## 11. Immediate Follow-Through
+
+The system should keep implementing these concrete items:
+
+- separate documented recovery-plane tests for Windows NAT nodes;
+- signed registry retention and overlap checks before endpoint migration;
+- compatibility alias coverage for historical install types;
+- artifact availability health over all mirrors;
+- stale-node risk dashboard/report before legacy removal;
+- node-local journaling for last good registry/update state;
+- neighbor-assisted artifact relay path;
+- explicit recovery simulation for outbound-only nodes with dead old endpoints.
+
+## 12. Decision
+
+The fabric must treat node survival as a first-class architecture contract.
+
+A node is not considered safe merely because the happy path works. It is safe
+only when it can survive protocol migration, endpoint relocation, partial
+cluster loss, artifact source loss, and lack of manual host access without
+being abandoned.