рабочий вариант, но скороть 10 МБит
build / backend (push) Has been cancelled
build / node-agent (push) Has been cancelled
build / worker (push) Has been cancelled

This commit is contained in:
2026-05-22 21:46:49 +03:00
parent 469fa0e860
commit 20d361a886
280 changed files with 954890 additions and 18524 deletions
@@ -26,7 +26,7 @@ This policy applies to Linux, Windows, Android, containerized nodes, and future
The fabric must be able to lose:
- old API endpoints;
- old artifact URLs;
- old artifact distributors;
- previous public IP addresses;
- previous NAT mappings;
- previous relay nodes;
@@ -67,7 +67,7 @@ Any change to the fabric must keep older nodes recoverable until one of these
is true:
1. every node has confirmed the new contract; or
2. the missing nodes were manually retired, revoked, or explicitly accepted as
2. the missing nodes were manually removed, revoked, or explicitly accepted as
lost.
This applies to:
@@ -81,6 +81,17 @@ This applies to:
- host-agent / updater runtime contracts;
- control endpoints needed only for migration.
Canonical `Control API` access must be distributable as an explicit endpoint
set, not only as a single compat `backend_url`. Install/update contracts should
carry:
- `control_plane_endpoints`;
- signed fabric registry bootstrap records;
- artifact endpoints.
The old `backend_url` remains a compatibility fallback only until the fleet has
converged.
The rule is strict: do not delete the old recovery format while nodes that may
still need it remain unrecovered.
@@ -200,6 +211,67 @@ Required model:
- signals are idempotent;
- signals do not require the old control endpoint to remain alive.
### 3.7 Update Intent Must Be Independent From One Updater Endpoint
A node must not be permanently bound to one updater service, one updater node,
one systemd unit name, one scheduled task name, or one control endpoint.
The durable object is not "call this updater URL". The durable object is a
signed update intent:
- product;
- target version or version constraint;
- artifact hashes and allowed mirrors;
- compatibility contract;
- rollout lease constraints;
- force / emergency flags;
- rollback permission;
- signed registry/service records that can carry the intent;
- expiry and generation.
A node may learn the same signed intent from:
- Control API;
- update-store;
- update-cache;
- long-lived outbound control subscription;
- neighboring nodes through signed fabric registry gossip;
- local cached last-known-good update state.
The receiving node must validate the intent locally before acting. A neighbor
may relay signed update metadata and artifacts, but it must not become an
authority that can forge or broaden an update.
The local recovery boundary must reconcile stale runtime facts before fetching
or applying a plan:
- current cluster id;
- node id and identity state directory;
- current container/task/unit name;
- current control endpoints;
- current signed registry records;
- available artifact mirrors.
This is mandatory because a node may move, a container may be renamed, a task
may be recreated, or the old host updater may still have a stale command line.
### 3.8 Polling, Subscription, And Neighbor Relay Are All Required
The update plane must use three delivery paths at the same time:
1. slow local fallback polling, so a node eventually recovers even after missed
signals;
2. subscription / push hints, so ordinary updates are fast and do not wait for
a long poll interval;
3. peer relay of signed update intents and signed registry records, so a node
can learn current update truth through reachable neighbors when the old
center or old ingress is unavailable.
No one path is allowed to be the only recovery mechanism.
Polling cadence is a safety net, not the rollout control mechanism. Rollout
control belongs to the orchestrator and signed rollout leases.
## 4. Update Safety Rules
### 4.1 Upgrade Contracts
@@ -228,7 +300,7 @@ explicit retirement.
Recovery-critical artifact versions must remain available until:
- all nodes have moved past them; or
- the remaining nodes are revoked/retired and recorded as intentionally lost.
- the remaining nodes are revoked/removed and recorded as intentionally lost.
Do not garbage-collect the last working host-agent or node-agent build for an
unrecovered population.
@@ -237,17 +309,18 @@ unrecovered population.
If historical nodes request different install types for the same product
(`windows_binary`, `windows_service`, `native`, `linux_binary`, etc.), recovery
planning must keep compatibility aliases until the fleet converges.
planning must publish explicit signed install-type mappings in the fabric
registry until the fleet converges.
The fabric must not strand nodes on an install-type naming mismatch.
### 4.5 Legacy Recovery Contract Drift Must Be Treated As A Blocking Risk
### 4.5 Compat Recovery Contract Drift Must Be Treated As A Blocking Risk
A stale node may report:
- a compatible recovery artifact exists under the current registry; but
- the last local updater/host-agent status still says `no_matching_artifact` or
an equivalent legacy contract failure.
an equivalent compat contract failure.
This means the node is not only waiting for a heartbeat. It is running an older
recovery planner contract and may still depend on:
@@ -257,7 +330,7 @@ recovery planner contract and may still depend on:
- older update-plan interpretation rules;
- overlap in signed registry / bootstrap envelopes.
This condition must be classified as `legacy recovery contract drift` and must
This condition must be classified as `compat recovery contract drift` and must
block compatibility removal the same way an artifact gap does.
Operationally this also means:
@@ -268,11 +341,11 @@ Operationally this also means:
status on the current contract or the operator explicitly retires the node;
- when a compatible artifact and target mapping already exist, the node should
be classified as `bridge replay ready`, meaning the system can replay the
legacy-compatible update plan as soon as the node regains an outbound control
compat-compatible update plan as soon as the node regains an outbound control
cycle;
- operator tooling should expose a canonical `bridge replay plan` per node so
recovery replay uses the same signed update-plan logic as normal updates;
- compatibility aliases / overlap must remain enabled for that node population;
- signed recovery mappings must remain available for that node population;
- dashboards and rollout guards must show this separately from ordinary
`waiting recovery heartbeat`.
@@ -281,9 +354,78 @@ Canonical example:
- `ifcm-rufms-s-mo1cr` is stale;
- the current backend can match a Windows-compatible host-agent artifact;
- the last host-agent report still says `no_matching_artifact`;
- therefore the node must be treated as a legacy recovery-contract blocker, not
- therefore the node must be treated as a compat recovery-contract blocker, not
merely as a delayed heartbeat.
### 4.6 Rollout Orchestrator Is Mandatory
Large fleet update safety requires an orchestrator. The orchestrator decides
which nodes may update now. Nodes decide whether a received signed intent is
valid and locally safe to execute.
The orchestrator must support:
- canary rollout;
- rolling rollout;
- area / site / NAT-group aware rollout;
- max parallel updates globally;
- max parallel updates per area;
- max unavailable nodes;
- minimum healthy quorum before continuing;
- hold / pause / resume;
- force update for explicitly selected nodes;
- automatic stop on failure rate, heartbeat loss, rollback, or route diversity
regression;
- separate host-agent and node-agent phases;
- emergency recovery bridge for pre-orchestrator compat nodes.
The orchestrator must issue short-lived rollout leases. A node may only start an
update when it holds a valid lease for that product/version. If the lease
expires before apply starts, the node must re-check the policy.
Rollout leases prevent the entire farm from starting the same update
simultaneously when a subscription signal or gossip wave reaches all nodes.
### 4.7 Node-Side Update Admission Control
Even with a lease, the node must perform local admission checks before apply:
- artifact hash and signature match the signed intent;
- rollback artifact or previous binary is available unless policy explicitly
disables rollback;
- enough disk space exists for stage plus rollback;
- current active workload can tolerate restart, or orchestrator granted a
maintenance lease;
- the node still has at least the required recovery connectivity after
excluding itself as temporarily unavailable;
- host-agent update is applied before node-agent update when the contract says
the host-agent is the recovery floor.
If admission fails, the node reports `blocked` with a precise reason instead of
silently waiting.
### 4.8 Update Waves Must Preserve Failure-Domain Diversity
An update wave must not take down all nodes from the same recovery role or
failure domain at once.
The orchestrator must account for:
- area;
- site;
- locality group;
- NAT group;
- public ingress dependency;
- control-api role;
- update-store / update-cache role;
- relay / rendezvous role;
- VPN ingress / egress roles;
- nodes that are currently the only known recovery path for another node.
For a small fleet, this means the orchestrator may update one node at a time
when the remaining diversity is weak, even if the global max parallel setting
is higher.
## 5. Service And Location Mobility Rules
Moving a service must not strand nodes that only know the old location.
@@ -329,7 +471,7 @@ The design must explicitly handle all of these:
- node reboots during update;
- only one peer still knows the new registry truth;
- node is partitioned for a long time and rejoins later;
- platform removes legacy support too early;
- platform removes compat support too early;
- operator has no shell/RDP/WinRM/SSH access to the host.
## 7. Required Local State And Journaling
@@ -359,7 +501,7 @@ It must surface:
- nodes with stale heartbeat but recent updater activity;
- nodes with no working compatible recovery artifact;
- nodes whose pinned registry/bootstrap epoch is too old;
- nodes whose only known artifact URL is dead;
- nodes whose only known artifact distributor is dead;
- nodes whose desired state requires a contract they cannot parse;
- nodes whose local agent version is below the minimum recovery floor;
- nodes whose last successful contact depended on a single service replica.
@@ -382,7 +524,7 @@ Before deleting old code, old formats, or old endpoints, verify all of these:
7. install type aliases remain for historical agents where needed;
8. NAT/passive/outbound-only nodes were explicitly tested;
9. stale-node risk report is empty or consciously accepted by recovery-admin;
10. removal of legacy support is documented with the exact cutoff conditions.
10. removal of compat support is documented with the exact cutoff conditions.
## 10. `ifcm-rufms-s-mo1cr` Rule
@@ -412,7 +554,7 @@ The system should keep implementing these concrete items:
- signed registry retention and overlap checks before endpoint migration;
- compatibility alias coverage for historical install types;
- artifact availability health over all mirrors;
- stale-node risk dashboard/report before legacy removal;
- stale-node risk dashboard/report before compat cleanup;
- node-local journaling for last good registry/update state;
- neighbor-assisted artifact relay path;
- explicit recovery simulation for outbound-only nodes with dead old endpoints.