3
This commit is contained in:
@@ -0,0 +1,427 @@
|
||||
# Fabric Node Survival And Recovery Policy
|
||||
|
||||
Status: active architecture policy.
|
||||
|
||||
This document defines the non-negotiable survival, compatibility, and recovery
|
||||
rules for Secure Access Fabric nodes. It exists because losing a node is not an
|
||||
acceptable operating model once the fabric grows beyond a small manually
|
||||
maintained fleet.
|
||||
|
||||
Reference incident:
|
||||
|
||||
- `ifcm-rufms-s-mo1cr` is the canonical recovery case.
|
||||
- The node is behind NAT.
|
||||
- There is no direct administrative access to the Windows host.
|
||||
- The node must remain recoverable through the fabric/update/recovery plane
|
||||
without relying on manual host login.
|
||||
|
||||
The latest live recovery evidence for this case is documented in
|
||||
[FABRIC_LIVE_AUDIT_2026-05-18.md](\\nas\\MST\\codex\\rdp-proxy\\docs\\architecture\\FABRIC_LIVE_AUDIT_2026-05-18.md).
|
||||
|
||||
This policy applies to Linux, Windows, Android, containerized nodes, and future
|
||||
node types.
|
||||
|
||||
## 1. Core Decision
|
||||
|
||||
The fabric must be able to lose:
|
||||
|
||||
- old API endpoints;
|
||||
- old artifact URLs;
|
||||
- previous public IP addresses;
|
||||
- previous NAT mappings;
|
||||
- previous relay nodes;
|
||||
- previous route-authority replicas;
|
||||
- previous update-cache replicas;
|
||||
- old service locations;
|
||||
- operator access to the host OS;
|
||||
- the current physical location of a workload;
|
||||
- part of the cluster.
|
||||
|
||||
And still keep the node recoverable.
|
||||
|
||||
Manual repair is allowed as an emergency tool. It must not be the default
|
||||
survival strategy.
|
||||
|
||||
## 2. Non-Negotiable Invariants
|
||||
|
||||
### 2.1 Node Identity Must Survive
|
||||
|
||||
A recoverable node must preserve:
|
||||
|
||||
- `node_id`;
|
||||
- node keypair or key reference;
|
||||
- pinned cluster authority / quorum descriptor;
|
||||
- last accepted signed registry records;
|
||||
- last accepted bootstrap seed set;
|
||||
- last known good update policy;
|
||||
- last known good workload desired state;
|
||||
- rollback metadata;
|
||||
- recovery audit trail.
|
||||
|
||||
Reinstall or repair must prefer preserving local state. Identity reset is a
|
||||
high-risk operator action, not the default repair path.
|
||||
|
||||
### 2.2 Compatibility Must Stay Until Recovery Is Complete
|
||||
|
||||
Any change to the fabric must keep older nodes recoverable until one of these
|
||||
is true:
|
||||
|
||||
1. every node has confirmed the new contract; or
|
||||
2. the missing nodes were manually retired, revoked, or explicitly accepted as
|
||||
lost.
|
||||
|
||||
This applies to:
|
||||
|
||||
- update plan formats;
|
||||
- signed registry schemas;
|
||||
- artifact install types;
|
||||
- authority signature envelopes;
|
||||
- bootstrap config formats;
|
||||
- recovery seed formats;
|
||||
- host-agent / updater runtime contracts;
|
||||
- control endpoints needed only for migration.
|
||||
|
||||
The rule is strict: do not delete the old recovery format while nodes that may
|
||||
still need it remain unrecovered.
|
||||
|
||||
### 2.3 QUIC-Only Transport Does Not Mean Single Bootstrap Location
|
||||
|
||||
Node-to-node runtime transport remains QUIC over UDP only.
|
||||
|
||||
That does not permit:
|
||||
|
||||
- one bootstrap address;
|
||||
- one update mirror;
|
||||
- one registry carrier;
|
||||
- one ingress node;
|
||||
- one relay;
|
||||
- one control replica.
|
||||
|
||||
QUIC is the transport. Survivability requires many signed ways to discover the
|
||||
current valid QUIC endpoints.
|
||||
|
||||
### 2.4 No Single Service May Own Recovery
|
||||
|
||||
Recovery must not depend on one:
|
||||
|
||||
- backend URL;
|
||||
- DNS name;
|
||||
- HTTP ingress;
|
||||
- update repository host;
|
||||
- relay node;
|
||||
- cluster admin node.
|
||||
|
||||
Any of those may disappear while the node is still healthy enough to recover.
|
||||
|
||||
## 3. Required Recovery Layers
|
||||
|
||||
### 3.1 Embedded Bootstrap Seed Set
|
||||
|
||||
Each installable node package must contain a bounded bootstrap seed set:
|
||||
|
||||
- multiple seed nodes;
|
||||
- public and private candidates where appropriate;
|
||||
- QUIC endpoint candidates only;
|
||||
- signed bootstrap metadata;
|
||||
- expiry / epoch rules;
|
||||
- optional organization / cluster scope constraints.
|
||||
|
||||
The bootstrap seed set is only the first door, not cluster truth.
|
||||
|
||||
### 3.2 Signed Registry Gossip
|
||||
|
||||
After bootstrap, a node must learn current service locations through signed
|
||||
fabric registry records that can be carried by any reachable peer.
|
||||
|
||||
Required properties:
|
||||
|
||||
- multiple records per service;
|
||||
- quorum or otherwise policy-approved signatures;
|
||||
- monotonic epoch/generation;
|
||||
- expiry and freshness checks;
|
||||
- live probe before promotion;
|
||||
- ability to accept newer records from a reachable neighbor even when old
|
||||
origins are gone.
|
||||
|
||||
### 3.3 Outbound-Only Recovery Attachment
|
||||
|
||||
A node behind NAT or in passive mode must be recoverable through an outbound
|
||||
attachment.
|
||||
|
||||
Required behaviors:
|
||||
|
||||
- the node can maintain at least one long-lived outbound QUIC control channel;
|
||||
- that channel survives IP changes by reconnecting through any remaining seed or
|
||||
signed registry endpoint;
|
||||
- the node may receive updated registry truth, update triggers, workload
|
||||
changes, and recovery instructions over that channel;
|
||||
- the fabric must not require inbound TCP/UDP reachability to repair the node.
|
||||
|
||||
### 3.4 Local Recovery Agent Boundary
|
||||
|
||||
The node must have a minimal recovery-capable local agent boundary that is
|
||||
separate from ordinary service workloads.
|
||||
|
||||
It must be able to:
|
||||
|
||||
- validate signed update plans;
|
||||
- download artifacts from multiple mirrors;
|
||||
- stage replacement binaries;
|
||||
- restart node-agent or host-agent tasks;
|
||||
- rollback to previous binaries;
|
||||
- swap to new signed registry/bootstrap records;
|
||||
- emit recovery status when transport returns.
|
||||
|
||||
If node workloads fail, this local recovery boundary must still exist.
|
||||
|
||||
### 3.5 Multi-Source Artifact Delivery
|
||||
|
||||
Artifacts must be retrievable from more than one source:
|
||||
|
||||
- local cached file;
|
||||
- cluster update-cache;
|
||||
- organization-local cache if policy allows;
|
||||
- public or internet-reachable mirror;
|
||||
- neighbor-assisted relay transfer over the fabric.
|
||||
|
||||
A node must not become unrecoverable because one artifact hostname or one
|
||||
download service disappeared.
|
||||
|
||||
### 3.6 Trigger And Subscription Plane
|
||||
|
||||
Polling alone is not enough for very large fleets.
|
||||
|
||||
Required model:
|
||||
|
||||
- nodes may still perform slow fallback polling;
|
||||
- primary update notification uses subscription/signal delivery;
|
||||
- update-cache or registry service can repeatedly signal pending updates until
|
||||
acknowledged;
|
||||
- signals are idempotent;
|
||||
- signals do not require the old control endpoint to remain alive.
|
||||
|
||||
## 4. Update Safety Rules
|
||||
|
||||
### 4.1 Upgrade Contracts
|
||||
|
||||
Every release that changes recovery-critical contracts must explicitly declare:
|
||||
|
||||
- minimum supported old version;
|
||||
- maximum tolerated skew;
|
||||
- whether migration is rolling-safe;
|
||||
- whether the node must first update host-agent or node-agent;
|
||||
- rollback compatibility;
|
||||
- whether old bootstrap/registry envelopes remain accepted.
|
||||
|
||||
### 4.2 Two-Key Rule For Breaking Changes
|
||||
|
||||
Do not simultaneously break:
|
||||
|
||||
- discovery of where to get the update; and
|
||||
- ability to understand the update once found.
|
||||
|
||||
At least one of those must remain compatible until fleet convergence or
|
||||
explicit retirement.
|
||||
|
||||
### 4.3 Old Artifact Retention
|
||||
|
||||
Recovery-critical artifact versions must remain available until:
|
||||
|
||||
- all nodes have moved past them; or
|
||||
- the remaining nodes are revoked/retired and recorded as intentionally lost.
|
||||
|
||||
Do not garbage-collect the last working host-agent or node-agent build for an
|
||||
unrecovered population.
|
||||
|
||||
### 4.4 Install Type Continuity
|
||||
|
||||
If historical nodes request different install types for the same product
|
||||
(`windows_binary`, `windows_service`, `native`, `linux_binary`, etc.), recovery
|
||||
planning must keep compatibility aliases until the fleet converges.
|
||||
|
||||
The fabric must not strand nodes on an install-type naming mismatch.
|
||||
|
||||
### 4.5 Legacy Recovery Contract Drift Must Be Treated As A Blocking Risk
|
||||
|
||||
A stale node may report:
|
||||
|
||||
- a compatible recovery artifact exists under the current registry; but
|
||||
- the last local updater/host-agent status still says `no_matching_artifact` or
|
||||
an equivalent legacy contract failure.
|
||||
|
||||
This means the node is not only waiting for a heartbeat. It is running an older
|
||||
recovery planner contract and may still depend on:
|
||||
|
||||
- historical install-type aliases;
|
||||
- older artifact matching semantics;
|
||||
- older update-plan interpretation rules;
|
||||
- overlap in signed registry / bootstrap envelopes.
|
||||
|
||||
This condition must be classified as `legacy recovery contract drift` and must
|
||||
block compatibility removal the same way an artifact gap does.
|
||||
|
||||
Operationally this also means:
|
||||
|
||||
- the node requires a `recovery bridge`;
|
||||
- the cluster enters `bridge hold active` for compatibility-removal decisions;
|
||||
- `bridge hold` remains active until the node reports a recovery-compatible
|
||||
status on the current contract or the operator explicitly retires the node;
|
||||
- when a compatible artifact and target mapping already exist, the node should
|
||||
be classified as `bridge replay ready`, meaning the system can replay the
|
||||
legacy-compatible update plan as soon as the node regains an outbound control
|
||||
cycle;
|
||||
- operator tooling should expose a canonical `bridge replay plan` per node so
|
||||
recovery replay uses the same signed update-plan logic as normal updates;
|
||||
- compatibility aliases / overlap must remain enabled for that node population;
|
||||
- dashboards and rollout guards must show this separately from ordinary
|
||||
`waiting recovery heartbeat`.
|
||||
|
||||
Canonical example:
|
||||
|
||||
- `ifcm-rufms-s-mo1cr` is stale;
|
||||
- the current backend can match a Windows-compatible host-agent artifact;
|
||||
- the last host-agent report still says `no_matching_artifact`;
|
||||
- therefore the node must be treated as a legacy recovery-contract blocker, not
|
||||
merely as a delayed heartbeat.
|
||||
|
||||
## 5. Service And Location Mobility Rules
|
||||
|
||||
Moving a service must not strand nodes that only know the old location.
|
||||
|
||||
Required pattern:
|
||||
|
||||
1. publish new signed registry records;
|
||||
2. keep old records valid during overlap;
|
||||
3. allow any reachable peer to relay the new records;
|
||||
4. live-probe and promote the new endpoints;
|
||||
5. only then retire the old location;
|
||||
6. keep enough overlap for slow or partitioned nodes to catch up.
|
||||
|
||||
This applies to:
|
||||
|
||||
- control-api replicas;
|
||||
- update-cache/update-store replicas;
|
||||
- web/admin ingress replicas;
|
||||
- relay/rendezvous nodes;
|
||||
- service-channel endpoints.
|
||||
|
||||
## 6. Failure Classes The Fabric Must Tolerate
|
||||
|
||||
The design must explicitly handle all of these:
|
||||
|
||||
- node behind NAT with only outbound connectivity;
|
||||
- several nodes behind one NAT/local segment;
|
||||
- node changes public IP;
|
||||
- node changes private IP;
|
||||
- old DNS/URL becomes dead;
|
||||
- artifact mirror disappears;
|
||||
- control ingress disappears;
|
||||
- relay disappears;
|
||||
- update install fails halfway;
|
||||
- binary staged but restart fails;
|
||||
- old task/service name changes;
|
||||
- local disk is nearly full;
|
||||
- time skew causes signature freshness risk;
|
||||
- authority rotates;
|
||||
- route authority replica disappears;
|
||||
- state directory survives but binary is broken;
|
||||
- binary survives but state directory is partly stale;
|
||||
- node reboots during update;
|
||||
- only one peer still knows the new registry truth;
|
||||
- node is partitioned for a long time and rejoins later;
|
||||
- platform removes legacy support too early;
|
||||
- operator has no shell/RDP/WinRM/SSH access to the host.
|
||||
|
||||
## 7. Required Local State And Journaling
|
||||
|
||||
The node local state store must retain at least:
|
||||
|
||||
- active and previous signed registry records;
|
||||
- active and previous bootstrap seeds;
|
||||
- last successful update plan per product;
|
||||
- last applied artifact hash/version;
|
||||
- last rollback candidate;
|
||||
- last successful service endpoints used for update/control;
|
||||
- pending trigger generation;
|
||||
- recovery attempts with timestamps and reasons;
|
||||
- last known good runtime command line / task/unit identity;
|
||||
- last known workload desired states.
|
||||
|
||||
Writes must be atomic. A power loss must not leave the node with zero valid
|
||||
state.
|
||||
|
||||
## 8. Observability And Fleet Safety Rules
|
||||
|
||||
The control plane must make invisible-recovery risk explicit.
|
||||
|
||||
It must surface:
|
||||
|
||||
- nodes with stale heartbeat but recent updater activity;
|
||||
- nodes with no working compatible recovery artifact;
|
||||
- nodes whose pinned registry/bootstrap epoch is too old;
|
||||
- nodes whose only known artifact URL is dead;
|
||||
- nodes whose desired state requires a contract they cannot parse;
|
||||
- nodes whose local agent version is below the minimum recovery floor;
|
||||
- nodes whose last successful contact depended on a single service replica.
|
||||
|
||||
Cluster-wide changes that would strand such nodes must be blocked or require an
|
||||
explicit recovery-admin override.
|
||||
|
||||
## 9. Release And Migration Checklist
|
||||
|
||||
Before deleting old code, old formats, or old endpoints, verify all of these:
|
||||
|
||||
1. every active node has confirmed a compatible version; or the remaining nodes
|
||||
are explicitly marked for manual retirement/recovery;
|
||||
2. host-agent and node-agent recovery paths both have matching artifacts;
|
||||
3. bootstrap/registry overlap exists for the migration window;
|
||||
4. at least two independent artifact sources remain reachable;
|
||||
5. signed registry gossip can carry the new locations without the old API
|
||||
hostname;
|
||||
6. rollback artifacts are still available;
|
||||
7. install type aliases remain for historical agents where needed;
|
||||
8. NAT/passive/outbound-only nodes were explicitly tested;
|
||||
9. stale-node risk report is empty or consciously accepted by recovery-admin;
|
||||
10. removal of legacy support is documented with the exact cutoff conditions.
|
||||
|
||||
## 10. `ifcm-rufms-s-mo1cr` Rule
|
||||
|
||||
`ifcm-rufms-s-mo1cr` is the standing reference case for future work.
|
||||
|
||||
For this node class, the platform must assume:
|
||||
|
||||
- the host is behind NAT;
|
||||
- the node may only keep outbound channels;
|
||||
- no direct Windows administrative access exists;
|
||||
- old discovery endpoints may disappear;
|
||||
- only the fabric/update/recovery plane can save the node.
|
||||
|
||||
Any future transport, update, authority, bootstrap, registry, or workload
|
||||
change must be reviewed against this question:
|
||||
|
||||
> If `ifcm-rufms-s-mo1cr` is still on the older contract and we cannot log in to
|
||||
> the host, can the fabric still recover it?
|
||||
|
||||
If the answer is no, the change is incomplete.
|
||||
|
||||
## 11. Immediate Follow-Through
|
||||
|
||||
The system should keep implementing these concrete items:
|
||||
|
||||
- separate documented recovery-plane tests for Windows NAT nodes;
|
||||
- signed registry retention and overlap checks before endpoint migration;
|
||||
- compatibility alias coverage for historical install types;
|
||||
- artifact availability health over all mirrors;
|
||||
- stale-node risk dashboard/report before legacy removal;
|
||||
- node-local journaling for last good registry/update state;
|
||||
- neighbor-assisted artifact relay path;
|
||||
- explicit recovery simulation for outbound-only nodes with dead old endpoints.
|
||||
|
||||
## 12. Decision
|
||||
|
||||
The fabric must treat node survival as a first-class architecture contract.
|
||||
|
||||
A node is not considered safe merely because the happy path works. It is safe
|
||||
only when it can survive protocol migration, endpoint relocation, partial
|
||||
cluster loss, artifact source loss, and lack of manual host access without
|
||||
being abandoned.
|
||||
Reference in New Issue
Block a user