This commit is contained in:
2026-05-18 21:33:39 +03:00
parent 5096155d83
commit 469fa0e860
94 changed files with 8761 additions and 8003 deletions
@@ -0,0 +1,427 @@
# Fabric Node Survival And Recovery Policy
Status: active architecture policy.
This document defines the non-negotiable survival, compatibility, and recovery
rules for Secure Access Fabric nodes. It exists because losing a node is not an
acceptable operating model once the fabric grows beyond a small manually
maintained fleet.
Reference incident:
- `ifcm-rufms-s-mo1cr` is the canonical recovery case.
- The node is behind NAT.
- There is no direct administrative access to the Windows host.
- The node must remain recoverable through the fabric/update/recovery plane
without relying on manual host login.
The latest live recovery evidence for this case is documented in
[FABRIC_LIVE_AUDIT_2026-05-18.md](\\nas\\MST\\codex\\rdp-proxy\\docs\\architecture\\FABRIC_LIVE_AUDIT_2026-05-18.md).
This policy applies to Linux, Windows, Android, containerized nodes, and future
node types.
## 1. Core Decision
The fabric must be able to lose:
- old API endpoints;
- old artifact URLs;
- previous public IP addresses;
- previous NAT mappings;
- previous relay nodes;
- previous route-authority replicas;
- previous update-cache replicas;
- old service locations;
- operator access to the host OS;
- the current physical location of a workload;
- part of the cluster.
And still keep the node recoverable.
Manual repair is allowed as an emergency tool. It must not be the default
survival strategy.
## 2. Non-Negotiable Invariants
### 2.1 Node Identity Must Survive
A recoverable node must preserve:
- `node_id`;
- node keypair or key reference;
- pinned cluster authority / quorum descriptor;
- last accepted signed registry records;
- last accepted bootstrap seed set;
- last known good update policy;
- last known good workload desired state;
- rollback metadata;
- recovery audit trail.
Reinstall or repair must prefer preserving local state. Identity reset is a
high-risk operator action, not the default repair path.
### 2.2 Compatibility Must Stay Until Recovery Is Complete
Any change to the fabric must keep older nodes recoverable until one of these
is true:
1. every node has confirmed the new contract; or
2. the missing nodes were manually retired, revoked, or explicitly accepted as
lost.
This applies to:
- update plan formats;
- signed registry schemas;
- artifact install types;
- authority signature envelopes;
- bootstrap config formats;
- recovery seed formats;
- host-agent / updater runtime contracts;
- control endpoints needed only for migration.
The rule is strict: do not delete the old recovery format while nodes that may
still need it remain unrecovered.
### 2.3 QUIC-Only Transport Does Not Mean Single Bootstrap Location
Node-to-node runtime transport remains QUIC over UDP only.
That does not permit:
- one bootstrap address;
- one update mirror;
- one registry carrier;
- one ingress node;
- one relay;
- one control replica.
QUIC is the transport. Survivability requires many signed ways to discover the
current valid QUIC endpoints.
### 2.4 No Single Service May Own Recovery
Recovery must not depend on one:
- backend URL;
- DNS name;
- HTTP ingress;
- update repository host;
- relay node;
- cluster admin node.
Any of those may disappear while the node is still healthy enough to recover.
## 3. Required Recovery Layers
### 3.1 Embedded Bootstrap Seed Set
Each installable node package must contain a bounded bootstrap seed set:
- multiple seed nodes;
- public and private candidates where appropriate;
- QUIC endpoint candidates only;
- signed bootstrap metadata;
- expiry / epoch rules;
- optional organization / cluster scope constraints.
The bootstrap seed set is only the first door, not cluster truth.
### 3.2 Signed Registry Gossip
After bootstrap, a node must learn current service locations through signed
fabric registry records that can be carried by any reachable peer.
Required properties:
- multiple records per service;
- quorum or otherwise policy-approved signatures;
- monotonic epoch/generation;
- expiry and freshness checks;
- live probe before promotion;
- ability to accept newer records from a reachable neighbor even when old
origins are gone.
### 3.3 Outbound-Only Recovery Attachment
A node behind NAT or in passive mode must be recoverable through an outbound
attachment.
Required behaviors:
- the node can maintain at least one long-lived outbound QUIC control channel;
- that channel survives IP changes by reconnecting through any remaining seed or
signed registry endpoint;
- the node may receive updated registry truth, update triggers, workload
changes, and recovery instructions over that channel;
- the fabric must not require inbound TCP/UDP reachability to repair the node.
### 3.4 Local Recovery Agent Boundary
The node must have a minimal recovery-capable local agent boundary that is
separate from ordinary service workloads.
It must be able to:
- validate signed update plans;
- download artifacts from multiple mirrors;
- stage replacement binaries;
- restart node-agent or host-agent tasks;
- rollback to previous binaries;
- swap to new signed registry/bootstrap records;
- emit recovery status when transport returns.
If node workloads fail, this local recovery boundary must still exist.
### 3.5 Multi-Source Artifact Delivery
Artifacts must be retrievable from more than one source:
- local cached file;
- cluster update-cache;
- organization-local cache if policy allows;
- public or internet-reachable mirror;
- neighbor-assisted relay transfer over the fabric.
A node must not become unrecoverable because one artifact hostname or one
download service disappeared.
### 3.6 Trigger And Subscription Plane
Polling alone is not enough for very large fleets.
Required model:
- nodes may still perform slow fallback polling;
- primary update notification uses subscription/signal delivery;
- update-cache or registry service can repeatedly signal pending updates until
acknowledged;
- signals are idempotent;
- signals do not require the old control endpoint to remain alive.
## 4. Update Safety Rules
### 4.1 Upgrade Contracts
Every release that changes recovery-critical contracts must explicitly declare:
- minimum supported old version;
- maximum tolerated skew;
- whether migration is rolling-safe;
- whether the node must first update host-agent or node-agent;
- rollback compatibility;
- whether old bootstrap/registry envelopes remain accepted.
### 4.2 Two-Key Rule For Breaking Changes
Do not simultaneously break:
- discovery of where to get the update; and
- ability to understand the update once found.
At least one of those must remain compatible until fleet convergence or
explicit retirement.
### 4.3 Old Artifact Retention
Recovery-critical artifact versions must remain available until:
- all nodes have moved past them; or
- the remaining nodes are revoked/retired and recorded as intentionally lost.
Do not garbage-collect the last working host-agent or node-agent build for an
unrecovered population.
### 4.4 Install Type Continuity
If historical nodes request different install types for the same product
(`windows_binary`, `windows_service`, `native`, `linux_binary`, etc.), recovery
planning must keep compatibility aliases until the fleet converges.
The fabric must not strand nodes on an install-type naming mismatch.
### 4.5 Legacy Recovery Contract Drift Must Be Treated As A Blocking Risk
A stale node may report:
- a compatible recovery artifact exists under the current registry; but
- the last local updater/host-agent status still says `no_matching_artifact` or
an equivalent legacy contract failure.
This means the node is not only waiting for a heartbeat. It is running an older
recovery planner contract and may still depend on:
- historical install-type aliases;
- older artifact matching semantics;
- older update-plan interpretation rules;
- overlap in signed registry / bootstrap envelopes.
This condition must be classified as `legacy recovery contract drift` and must
block compatibility removal the same way an artifact gap does.
Operationally this also means:
- the node requires a `recovery bridge`;
- the cluster enters `bridge hold active` for compatibility-removal decisions;
- `bridge hold` remains active until the node reports a recovery-compatible
status on the current contract or the operator explicitly retires the node;
- when a compatible artifact and target mapping already exist, the node should
be classified as `bridge replay ready`, meaning the system can replay the
legacy-compatible update plan as soon as the node regains an outbound control
cycle;
- operator tooling should expose a canonical `bridge replay plan` per node so
recovery replay uses the same signed update-plan logic as normal updates;
- compatibility aliases / overlap must remain enabled for that node population;
- dashboards and rollout guards must show this separately from ordinary
`waiting recovery heartbeat`.
Canonical example:
- `ifcm-rufms-s-mo1cr` is stale;
- the current backend can match a Windows-compatible host-agent artifact;
- the last host-agent report still says `no_matching_artifact`;
- therefore the node must be treated as a legacy recovery-contract blocker, not
merely as a delayed heartbeat.
## 5. Service And Location Mobility Rules
Moving a service must not strand nodes that only know the old location.
Required pattern:
1. publish new signed registry records;
2. keep old records valid during overlap;
3. allow any reachable peer to relay the new records;
4. live-probe and promote the new endpoints;
5. only then retire the old location;
6. keep enough overlap for slow or partitioned nodes to catch up.
This applies to:
- control-api replicas;
- update-cache/update-store replicas;
- web/admin ingress replicas;
- relay/rendezvous nodes;
- service-channel endpoints.
## 6. Failure Classes The Fabric Must Tolerate
The design must explicitly handle all of these:
- node behind NAT with only outbound connectivity;
- several nodes behind one NAT/local segment;
- node changes public IP;
- node changes private IP;
- old DNS/URL becomes dead;
- artifact mirror disappears;
- control ingress disappears;
- relay disappears;
- update install fails halfway;
- binary staged but restart fails;
- old task/service name changes;
- local disk is nearly full;
- time skew causes signature freshness risk;
- authority rotates;
- route authority replica disappears;
- state directory survives but binary is broken;
- binary survives but state directory is partly stale;
- node reboots during update;
- only one peer still knows the new registry truth;
- node is partitioned for a long time and rejoins later;
- platform removes legacy support too early;
- operator has no shell/RDP/WinRM/SSH access to the host.
## 7. Required Local State And Journaling
The node local state store must retain at least:
- active and previous signed registry records;
- active and previous bootstrap seeds;
- last successful update plan per product;
- last applied artifact hash/version;
- last rollback candidate;
- last successful service endpoints used for update/control;
- pending trigger generation;
- recovery attempts with timestamps and reasons;
- last known good runtime command line / task/unit identity;
- last known workload desired states.
Writes must be atomic. A power loss must not leave the node with zero valid
state.
## 8. Observability And Fleet Safety Rules
The control plane must make invisible-recovery risk explicit.
It must surface:
- nodes with stale heartbeat but recent updater activity;
- nodes with no working compatible recovery artifact;
- nodes whose pinned registry/bootstrap epoch is too old;
- nodes whose only known artifact URL is dead;
- nodes whose desired state requires a contract they cannot parse;
- nodes whose local agent version is below the minimum recovery floor;
- nodes whose last successful contact depended on a single service replica.
Cluster-wide changes that would strand such nodes must be blocked or require an
explicit recovery-admin override.
## 9. Release And Migration Checklist
Before deleting old code, old formats, or old endpoints, verify all of these:
1. every active node has confirmed a compatible version; or the remaining nodes
are explicitly marked for manual retirement/recovery;
2. host-agent and node-agent recovery paths both have matching artifacts;
3. bootstrap/registry overlap exists for the migration window;
4. at least two independent artifact sources remain reachable;
5. signed registry gossip can carry the new locations without the old API
hostname;
6. rollback artifacts are still available;
7. install type aliases remain for historical agents where needed;
8. NAT/passive/outbound-only nodes were explicitly tested;
9. stale-node risk report is empty or consciously accepted by recovery-admin;
10. removal of legacy support is documented with the exact cutoff conditions.
## 10. `ifcm-rufms-s-mo1cr` Rule
`ifcm-rufms-s-mo1cr` is the standing reference case for future work.
For this node class, the platform must assume:
- the host is behind NAT;
- the node may only keep outbound channels;
- no direct Windows administrative access exists;
- old discovery endpoints may disappear;
- only the fabric/update/recovery plane can save the node.
Any future transport, update, authority, bootstrap, registry, or workload
change must be reviewed against this question:
> If `ifcm-rufms-s-mo1cr` is still on the older contract and we cannot log in to
> the host, can the fabric still recover it?
If the answer is no, the change is incomplete.
## 11. Immediate Follow-Through
The system should keep implementing these concrete items:
- separate documented recovery-plane tests for Windows NAT nodes;
- signed registry retention and overlap checks before endpoint migration;
- compatibility alias coverage for historical install types;
- artifact availability health over all mirrors;
- stale-node risk dashboard/report before legacy removal;
- node-local journaling for last good registry/update state;
- neighbor-assisted artifact relay path;
- explicit recovery simulation for outbound-only nodes with dead old endpoints.
## 12. Decision
The fabric must treat node survival as a first-class architecture contract.
A node is not considered safe merely because the happy path works. It is safe
only when it can survive protocol migration, endpoint relocation, partial
cluster loss, artifact source loss, and lack of manual host access without
being abandoned.