19 KiB
Fabric Node Survival And Recovery Policy
Status: active architecture policy.
This document defines the non-negotiable survival, compatibility, and recovery rules for Secure Access Fabric nodes. It exists because losing a node is not an acceptable operating model once the fabric grows beyond a small manually maintained fleet.
Reference incident:
ifcm-rufms-s-mo1cris the canonical recovery case.- The node is behind NAT.
- There is no direct administrative access to the Windows host.
- The node must remain recoverable through the fabric/update/recovery plane without relying on manual host login.
The latest live recovery evidence for this case is documented in FABRIC_LIVE_AUDIT_2026-05-18.md.
This policy applies to Linux, Windows, Android, containerized nodes, and future node types.
1. Core Decision
The fabric must be able to lose:
- old API endpoints;
- old artifact distributors;
- previous public IP addresses;
- previous NAT mappings;
- previous relay nodes;
- previous route-authority replicas;
- previous update-cache replicas;
- old service locations;
- operator access to the host OS;
- the current physical location of a workload;
- part of the cluster.
And still keep the node recoverable.
Manual repair is allowed as an emergency tool. It must not be the default survival strategy.
2. Non-Negotiable Invariants
2.1 Node Identity Must Survive
A recoverable node must preserve:
node_id;- node keypair or key reference;
- pinned cluster authority / quorum descriptor;
- last accepted signed registry records;
- last accepted bootstrap seed set;
- last known good update policy;
- last known good workload desired state;
- rollback metadata;
- recovery audit trail.
Reinstall or repair must prefer preserving local state. Identity reset is a high-risk operator action, not the default repair path.
2.2 Compatibility Must Stay Until Recovery Is Complete
Any change to the fabric must keep older nodes recoverable until one of these is true:
- every node has confirmed the new contract; or
- the missing nodes were manually removed, revoked, or explicitly accepted as lost.
This applies to:
- update plan formats;
- signed registry schemas;
- artifact install types;
- authority signature envelopes;
- bootstrap config formats;
- recovery seed formats;
- host-agent / updater runtime contracts;
- control endpoints needed only for migration.
Canonical Control API access must be distributable as an explicit endpoint
set, not only as a single compat backend_url. Install/update contracts should
carry:
control_plane_endpoints;- signed fabric registry bootstrap records;
- artifact endpoints.
The old backend_url remains a compatibility fallback only until the fleet has
converged.
The rule is strict: do not delete the old recovery format while nodes that may still need it remain unrecovered.
2.3 QUIC-Only Transport Does Not Mean Single Bootstrap Location
Node-to-node runtime transport remains QUIC over UDP only.
That does not permit:
- one bootstrap address;
- one update mirror;
- one registry carrier;
- one ingress node;
- one relay;
- one control replica.
QUIC is the transport. Survivability requires many signed ways to discover the current valid QUIC endpoints.
2.4 No Single Service May Own Recovery
Recovery must not depend on one:
- backend URL;
- DNS name;
- HTTP ingress;
- update repository host;
- relay node;
- cluster admin node.
Any of those may disappear while the node is still healthy enough to recover.
3. Required Recovery Layers
3.1 Embedded Bootstrap Seed Set
Each installable node package must contain a bounded bootstrap seed set:
- multiple seed nodes;
- public and private candidates where appropriate;
- QUIC endpoint candidates only;
- signed bootstrap metadata;
- expiry / epoch rules;
- optional organization / cluster scope constraints.
The bootstrap seed set is only the first door, not cluster truth.
3.2 Signed Registry Gossip
After bootstrap, a node must learn current service locations through signed fabric registry records that can be carried by any reachable peer.
Required properties:
- multiple records per service;
- quorum or otherwise policy-approved signatures;
- monotonic epoch/generation;
- expiry and freshness checks;
- live probe before promotion;
- ability to accept newer records from a reachable neighbor even when old origins are gone.
3.3 Outbound-Only Recovery Attachment
A node behind NAT or in passive mode must be recoverable through an outbound attachment.
Required behaviors:
- the node can maintain at least one long-lived outbound QUIC control channel;
- that channel survives IP changes by reconnecting through any remaining seed or signed registry endpoint;
- the node may receive updated registry truth, update triggers, workload changes, and recovery instructions over that channel;
- the fabric must not require inbound TCP/UDP reachability to repair the node.
3.4 Local Recovery Agent Boundary
The node must have a minimal recovery-capable local agent boundary that is separate from ordinary service workloads.
It must be able to:
- validate signed update plans;
- download artifacts from multiple mirrors;
- stage replacement binaries;
- restart node-agent or host-agent tasks;
- rollback to previous binaries;
- swap to new signed registry/bootstrap records;
- emit recovery status when transport returns.
If node workloads fail, this local recovery boundary must still exist.
3.5 Multi-Source Artifact Delivery
Artifacts must be retrievable from more than one source:
- local cached file;
- cluster update-cache;
- organization-local cache if policy allows;
- public or internet-reachable mirror;
- neighbor-assisted relay transfer over the fabric.
A node must not become unrecoverable because one artifact hostname or one download service disappeared.
3.6 Trigger And Subscription Plane
Polling alone is not enough for very large fleets.
Required model:
- nodes may still perform slow fallback polling;
- primary update notification uses subscription/signal delivery;
- update-cache or registry service can repeatedly signal pending updates until acknowledged;
- signals are idempotent;
- signals do not require the old control endpoint to remain alive.
3.7 Update Intent Must Be Independent From One Updater Endpoint
A node must not be permanently bound to one updater service, one updater node, one systemd unit name, one scheduled task name, or one control endpoint.
The durable object is not "call this updater URL". The durable object is a signed update intent:
- product;
- target version or version constraint;
- artifact hashes and allowed mirrors;
- compatibility contract;
- rollout lease constraints;
- force / emergency flags;
- rollback permission;
- signed registry/service records that can carry the intent;
- expiry and generation.
A node may learn the same signed intent from:
- Control API;
- update-store;
- update-cache;
- long-lived outbound control subscription;
- neighboring nodes through signed fabric registry gossip;
- local cached last-known-good update state.
The receiving node must validate the intent locally before acting. A neighbor may relay signed update metadata and artifacts, but it must not become an authority that can forge or broaden an update.
The local recovery boundary must reconcile stale runtime facts before fetching or applying a plan:
- current cluster id;
- node id and identity state directory;
- current container/task/unit name;
- current control endpoints;
- current signed registry records;
- available artifact mirrors.
This is mandatory because a node may move, a container may be renamed, a task may be recreated, or the old host updater may still have a stale command line.
3.8 Polling, Subscription, And Neighbor Relay Are All Required
The update plane must use three delivery paths at the same time:
- slow local fallback polling, so a node eventually recovers even after missed signals;
- subscription / push hints, so ordinary updates are fast and do not wait for a long poll interval;
- peer relay of signed update intents and signed registry records, so a node can learn current update truth through reachable neighbors when the old center or old ingress is unavailable.
No one path is allowed to be the only recovery mechanism.
Polling cadence is a safety net, not the rollout control mechanism. Rollout control belongs to the orchestrator and signed rollout leases.
4. Update Safety Rules
4.1 Upgrade Contracts
Every release that changes recovery-critical contracts must explicitly declare:
- minimum supported old version;
- maximum tolerated skew;
- whether migration is rolling-safe;
- whether the node must first update host-agent or node-agent;
- rollback compatibility;
- whether old bootstrap/registry envelopes remain accepted.
4.2 Two-Key Rule For Breaking Changes
Do not simultaneously break:
- discovery of where to get the update; and
- ability to understand the update once found.
At least one of those must remain compatible until fleet convergence or explicit retirement.
4.3 Old Artifact Retention
Recovery-critical artifact versions must remain available until:
- all nodes have moved past them; or
- the remaining nodes are revoked/removed and recorded as intentionally lost.
Do not garbage-collect the last working host-agent or node-agent build for an unrecovered population.
4.4 Install Type Continuity
If historical nodes request different install types for the same product
(windows_binary, windows_service, native, linux_binary, etc.), recovery
planning must publish explicit signed install-type mappings in the fabric
registry until the fleet converges.
The fabric must not strand nodes on an install-type naming mismatch.
4.5 Compat Recovery Contract Drift Must Be Treated As A Blocking Risk
A stale node may report:
- a compatible recovery artifact exists under the current registry; but
- the last local updater/host-agent status still says
no_matching_artifactor an equivalent compat contract failure.
This means the node is not only waiting for a heartbeat. It is running an older recovery planner contract and may still depend on:
- historical install-type aliases;
- older artifact matching semantics;
- older update-plan interpretation rules;
- overlap in signed registry / bootstrap envelopes.
This condition must be classified as compat recovery contract drift and must
block compatibility removal the same way an artifact gap does.
Operationally this also means:
- the node requires a
recovery bridge; - the cluster enters
bridge hold activefor compatibility-removal decisions; bridge holdremains active until the node reports a recovery-compatible status on the current contract or the operator explicitly retires the node;- when a compatible artifact and target mapping already exist, the node should
be classified as
bridge replay ready, meaning the system can replay the compat-compatible update plan as soon as the node regains an outbound control cycle; - operator tooling should expose a canonical
bridge replay planper node so recovery replay uses the same signed update-plan logic as normal updates; - signed recovery mappings must remain available for that node population;
- dashboards and rollout guards must show this separately from ordinary
waiting recovery heartbeat.
Canonical example:
ifcm-rufms-s-mo1cris stale;- the current backend can match a Windows-compatible host-agent artifact;
- the last host-agent report still says
no_matching_artifact; - therefore the node must be treated as a compat recovery-contract blocker, not merely as a delayed heartbeat.
4.6 Rollout Orchestrator Is Mandatory
Large fleet update safety requires an orchestrator. The orchestrator decides which nodes may update now. Nodes decide whether a received signed intent is valid and locally safe to execute.
The orchestrator must support:
- canary rollout;
- rolling rollout;
- area / site / NAT-group aware rollout;
- max parallel updates globally;
- max parallel updates per area;
- max unavailable nodes;
- minimum healthy quorum before continuing;
- hold / pause / resume;
- force update for explicitly selected nodes;
- automatic stop on failure rate, heartbeat loss, rollback, or route diversity regression;
- separate host-agent and node-agent phases;
- emergency recovery bridge for pre-orchestrator compat nodes.
The orchestrator must issue short-lived rollout leases. A node may only start an update when it holds a valid lease for that product/version. If the lease expires before apply starts, the node must re-check the policy.
Rollout leases prevent the entire farm from starting the same update simultaneously when a subscription signal or gossip wave reaches all nodes.
4.7 Node-Side Update Admission Control
Even with a lease, the node must perform local admission checks before apply:
- artifact hash and signature match the signed intent;
- rollback artifact or previous binary is available unless policy explicitly disables rollback;
- enough disk space exists for stage plus rollback;
- current active workload can tolerate restart, or orchestrator granted a maintenance lease;
- the node still has at least the required recovery connectivity after excluding itself as temporarily unavailable;
- host-agent update is applied before node-agent update when the contract says the host-agent is the recovery floor.
If admission fails, the node reports blocked with a precise reason instead of
silently waiting.
4.8 Update Waves Must Preserve Failure-Domain Diversity
An update wave must not take down all nodes from the same recovery role or failure domain at once.
The orchestrator must account for:
- area;
- site;
- locality group;
- NAT group;
- public ingress dependency;
- control-api role;
- update-store / update-cache role;
- relay / rendezvous role;
- VPN ingress / egress roles;
- nodes that are currently the only known recovery path for another node.
For a small fleet, this means the orchestrator may update one node at a time when the remaining diversity is weak, even if the global max parallel setting is higher.
5. Service And Location Mobility Rules
Moving a service must not strand nodes that only know the old location.
Required pattern:
- publish new signed registry records;
- keep old records valid during overlap;
- allow any reachable peer to relay the new records;
- live-probe and promote the new endpoints;
- only then retire the old location;
- keep enough overlap for slow or partitioned nodes to catch up.
This applies to:
- control-api replicas;
- update-cache/update-store replicas;
- web/admin ingress replicas;
- relay/rendezvous nodes;
- service-channel endpoints.
6. Failure Classes The Fabric Must Tolerate
The design must explicitly handle all of these:
- node behind NAT with only outbound connectivity;
- several nodes behind one NAT/local segment;
- node changes public IP;
- node changes private IP;
- old DNS/URL becomes dead;
- artifact mirror disappears;
- control ingress disappears;
- relay disappears;
- update install fails halfway;
- binary staged but restart fails;
- old task/service name changes;
- local disk is nearly full;
- time skew causes signature freshness risk;
- authority rotates;
- route authority replica disappears;
- state directory survives but binary is broken;
- binary survives but state directory is partly stale;
- node reboots during update;
- only one peer still knows the new registry truth;
- node is partitioned for a long time and rejoins later;
- platform removes compat support too early;
- operator has no shell/RDP/WinRM/SSH access to the host.
7. Required Local State And Journaling
The node local state store must retain at least:
- active and previous signed registry records;
- active and previous bootstrap seeds;
- last successful update plan per product;
- last applied artifact hash/version;
- last rollback candidate;
- last successful service endpoints used for update/control;
- pending trigger generation;
- recovery attempts with timestamps and reasons;
- last known good runtime command line / task/unit identity;
- last known workload desired states.
Writes must be atomic. A power loss must not leave the node with zero valid state.
8. Observability And Fleet Safety Rules
The control plane must make invisible-recovery risk explicit.
It must surface:
- nodes with stale heartbeat but recent updater activity;
- nodes with no working compatible recovery artifact;
- nodes whose pinned registry/bootstrap epoch is too old;
- nodes whose only known artifact distributor is dead;
- nodes whose desired state requires a contract they cannot parse;
- nodes whose local agent version is below the minimum recovery floor;
- nodes whose last successful contact depended on a single service replica.
Cluster-wide changes that would strand such nodes must be blocked or require an explicit recovery-admin override.
9. Release And Migration Checklist
Before deleting old code, old formats, or old endpoints, verify all of these:
- every active node has confirmed a compatible version; or the remaining nodes are explicitly marked for manual retirement/recovery;
- host-agent and node-agent recovery paths both have matching artifacts;
- bootstrap/registry overlap exists for the migration window;
- at least two independent artifact sources remain reachable;
- signed registry gossip can carry the new locations without the old API hostname;
- rollback artifacts are still available;
- install type aliases remain for historical agents where needed;
- NAT/passive/outbound-only nodes were explicitly tested;
- stale-node risk report is empty or consciously accepted by recovery-admin;
- removal of compat support is documented with the exact cutoff conditions.
10. ifcm-rufms-s-mo1cr Rule
ifcm-rufms-s-mo1cr is the standing reference case for future work.
For this node class, the platform must assume:
- the host is behind NAT;
- the node may only keep outbound channels;
- no direct Windows administrative access exists;
- old discovery endpoints may disappear;
- only the fabric/update/recovery plane can save the node.
Any future transport, update, authority, bootstrap, registry, or workload change must be reviewed against this question:
If
ifcm-rufms-s-mo1cris still on the older contract and we cannot log in to the host, can the fabric still recover it?
If the answer is no, the change is incomplete.
11. Immediate Follow-Through
The system should keep implementing these concrete items:
- separate documented recovery-plane tests for Windows NAT nodes;
- signed registry retention and overlap checks before endpoint migration;
- compatibility alias coverage for historical install types;
- artifact availability health over all mirrors;
- stale-node risk dashboard/report before compat cleanup;
- node-local journaling for last good registry/update state;
- neighbor-assisted artifact relay path;
- explicit recovery simulation for outbound-only nodes with dead old endpoints.
12. Decision
The fabric must treat node survival as a first-class architecture contract.
A node is not considered safe merely because the happy path works. It is safe only when it can survive protocol migration, endpoint relocation, partial cluster loss, artifact source loss, and lack of manual host access without being abandoned.