рабочий вариант, но скороть 10 МБит

2026-05-22 21:46:49 +03:00
parent 469fa0e860
commit 20d361a886
280 changed files with 954890 additions and 18524 deletions
@@ -26,7 +26,7 @@ This policy applies to Linux, Windows, Android, containerized nodes, and future
 The fabric must be able to lose:

 - old API endpoints;
- old artifact URLs;
+- old artifact distributors;
 - previous public IP addresses;
 - previous NAT mappings;
 - previous relay nodes;
@@ -67,7 +67,7 @@ Any change to the fabric must keep older nodes recoverable until one of these
 is true:

 1. every node has confirmed the new contract; or
-2. the missing nodes were manually retired, revoked, or explicitly accepted as
+2. the missing nodes were manually removed, revoked, or explicitly accepted as
   lost.

 This applies to:
@@ -81,6 +81,17 @@ This applies to:
 - host-agent / updater runtime contracts;
 - control endpoints needed only for migration.

+Canonical `Control API` access must be distributable as an explicit endpoint
+set, not only as a single compat `backend_url`. Install/update contracts should
+carry:
+
+- `control_plane_endpoints`;
+- signed fabric registry bootstrap records;
+- artifact endpoints.
+
+The old `backend_url` remains a compatibility fallback only until the fleet has
+converged.
+
 The rule is strict: do not delete the old recovery format while nodes that may
 still need it remain unrecovered.

@@ -200,6 +211,67 @@ Required model:
 - signals are idempotent;
 - signals do not require the old control endpoint to remain alive.

+### 3.7 Update Intent Must Be Independent From One Updater Endpoint
+
+A node must not be permanently bound to one updater service, one updater node,
+one systemd unit name, one scheduled task name, or one control endpoint.
+
+The durable object is not "call this updater URL". The durable object is a
+signed update intent:
+
+- product;
+- target version or version constraint;
+- artifact hashes and allowed mirrors;
+- compatibility contract;
+- rollout lease constraints;
+- force / emergency flags;
+- rollback permission;
+- signed registry/service records that can carry the intent;
+- expiry and generation.
+
+A node may learn the same signed intent from:
+
+- Control API;
+- update-store;
+- update-cache;
+- long-lived outbound control subscription;
+- neighboring nodes through signed fabric registry gossip;
+- local cached last-known-good update state.
+
+The receiving node must validate the intent locally before acting. A neighbor
+may relay signed update metadata and artifacts, but it must not become an
+authority that can forge or broaden an update.
+
+The local recovery boundary must reconcile stale runtime facts before fetching
+or applying a plan:
+
+- current cluster id;
+- node id and identity state directory;
+- current container/task/unit name;
+- current control endpoints;
+- current signed registry records;
+- available artifact mirrors.
+
+This is mandatory because a node may move, a container may be renamed, a task
+may be recreated, or the old host updater may still have a stale command line.
+
+### 3.8 Polling, Subscription, And Neighbor Relay Are All Required
+
+The update plane must use three delivery paths at the same time:
+
+1. slow local fallback polling, so a node eventually recovers even after missed
+   signals;
+2. subscription / push hints, so ordinary updates are fast and do not wait for
+   a long poll interval;
+3. peer relay of signed update intents and signed registry records, so a node
+   can learn current update truth through reachable neighbors when the old
+   center or old ingress is unavailable.
+
+No one path is allowed to be the only recovery mechanism.
+
+Polling cadence is a safety net, not the rollout control mechanism. Rollout
+control belongs to the orchestrator and signed rollout leases.
+
 ## 4. Update Safety Rules

 ### 4.1 Upgrade Contracts
@@ -228,7 +300,7 @@ explicit retirement.
 Recovery-critical artifact versions must remain available until:

 - all nodes have moved past them; or
- the remaining nodes are revoked/retired and recorded as intentionally lost.
+- the remaining nodes are revoked/removed and recorded as intentionally lost.

 Do not garbage-collect the last working host-agent or node-agent build for an
 unrecovered population.
@@ -237,17 +309,18 @@ unrecovered population.

 If historical nodes request different install types for the same product
 (`windows_binary`, `windows_service`, `native`, `linux_binary`, etc.), recovery
-planning must keep compatibility aliases until the fleet converges.
+planning must publish explicit signed install-type mappings in the fabric
+registry until the fleet converges.

 The fabric must not strand nodes on an install-type naming mismatch.

-### 4.5 Legacy Recovery Contract Drift Must Be Treated As A Blocking Risk
+### 4.5 Compat Recovery Contract Drift Must Be Treated As A Blocking Risk

 A stale node may report:

 - a compatible recovery artifact exists under the current registry; but
 - the last local updater/host-agent status still says `no_matching_artifact` or
-  an equivalent legacy contract failure.
+  an equivalent compat contract failure.

 This means the node is not only waiting for a heartbeat. It is running an older
 recovery planner contract and may still depend on:
@@ -257,7 +330,7 @@ recovery planner contract and may still depend on:
 - older update-plan interpretation rules;
 - overlap in signed registry / bootstrap envelopes.

-This condition must be classified as `legacy recovery contract drift` and must
+This condition must be classified as `compat recovery contract drift` and must
 block compatibility removal the same way an artifact gap does.

 Operationally this also means:
@@ -268,11 +341,11 @@ Operationally this also means:
  status on the current contract or the operator explicitly retires the node;
 - when a compatible artifact and target mapping already exist, the node should
  be classified as `bridge replay ready`, meaning the system can replay the
-  legacy-compatible update plan as soon as the node regains an outbound control
+  compat-compatible update plan as soon as the node regains an outbound control
  cycle;
 - operator tooling should expose a canonical `bridge replay plan` per node so
  recovery replay uses the same signed update-plan logic as normal updates;
- compatibility aliases / overlap must remain enabled for that node population;
+- signed recovery mappings must remain available for that node population;
 - dashboards and rollout guards must show this separately from ordinary
  `waiting recovery heartbeat`.

@@ -281,9 +354,78 @@ Canonical example:
 - `ifcm-rufms-s-mo1cr` is stale;
 - the current backend can match a Windows-compatible host-agent artifact;
 - the last host-agent report still says `no_matching_artifact`;
- therefore the node must be treated as a legacy recovery-contract blocker, not
+- therefore the node must be treated as a compat recovery-contract blocker, not
  merely as a delayed heartbeat.

+### 4.6 Rollout Orchestrator Is Mandatory
+
+Large fleet update safety requires an orchestrator. The orchestrator decides
+which nodes may update now. Nodes decide whether a received signed intent is
+valid and locally safe to execute.
+
+The orchestrator must support:
+
+- canary rollout;
+- rolling rollout;
+- area / site / NAT-group aware rollout;
+- max parallel updates globally;
+- max parallel updates per area;
+- max unavailable nodes;
+- minimum healthy quorum before continuing;
+- hold / pause / resume;
+- force update for explicitly selected nodes;
+- automatic stop on failure rate, heartbeat loss, rollback, or route diversity
+  regression;
+- separate host-agent and node-agent phases;
+- emergency recovery bridge for pre-orchestrator compat nodes.
+
+The orchestrator must issue short-lived rollout leases. A node may only start an
+update when it holds a valid lease for that product/version. If the lease
+expires before apply starts, the node must re-check the policy.
+
+Rollout leases prevent the entire farm from starting the same update
+simultaneously when a subscription signal or gossip wave reaches all nodes.
+
+### 4.7 Node-Side Update Admission Control
+
+Even with a lease, the node must perform local admission checks before apply:
+
+- artifact hash and signature match the signed intent;
+- rollback artifact or previous binary is available unless policy explicitly
+  disables rollback;
+- enough disk space exists for stage plus rollback;
+- current active workload can tolerate restart, or orchestrator granted a
+  maintenance lease;
+- the node still has at least the required recovery connectivity after
+  excluding itself as temporarily unavailable;
+- host-agent update is applied before node-agent update when the contract says
+  the host-agent is the recovery floor.
+
+If admission fails, the node reports `blocked` with a precise reason instead of
+silently waiting.
+
+### 4.8 Update Waves Must Preserve Failure-Domain Diversity
+
+An update wave must not take down all nodes from the same recovery role or
+failure domain at once.
+
+The orchestrator must account for:
+
+- area;
+- site;
+- locality group;
+- NAT group;
+- public ingress dependency;
+- control-api role;
+- update-store / update-cache role;
+- relay / rendezvous role;
+- VPN ingress / egress roles;
+- nodes that are currently the only known recovery path for another node.
+
+For a small fleet, this means the orchestrator may update one node at a time
+when the remaining diversity is weak, even if the global max parallel setting
+is higher.
+
 ## 5. Service And Location Mobility Rules

 Moving a service must not strand nodes that only know the old location.
@@ -329,7 +471,7 @@ The design must explicitly handle all of these:
 - node reboots during update;
 - only one peer still knows the new registry truth;
 - node is partitioned for a long time and rejoins later;
- platform removes legacy support too early;
+- platform removes compat support too early;
 - operator has no shell/RDP/WinRM/SSH access to the host.

 ## 7. Required Local State And Journaling
@@ -359,7 +501,7 @@ It must surface:
 - nodes with stale heartbeat but recent updater activity;
 - nodes with no working compatible recovery artifact;
 - nodes whose pinned registry/bootstrap epoch is too old;
- nodes whose only known artifact URL is dead;
+- nodes whose only known artifact distributor is dead;
 - nodes whose desired state requires a contract they cannot parse;
 - nodes whose local agent version is below the minimum recovery floor;
 - nodes whose last successful contact depended on a single service replica.
@@ -382,7 +524,7 @@ Before deleting old code, old formats, or old endpoints, verify all of these:
 7. install type aliases remain for historical agents where needed;
 8. NAT/passive/outbound-only nodes were explicitly tested;
 9. stale-node risk report is empty or consciously accepted by recovery-admin;
-10. removal of legacy support is documented with the exact cutoff conditions.
+10. removal of compat support is documented with the exact cutoff conditions.

 ## 10. `ifcm-rufms-s-mo1cr` Rule

@@ -412,7 +554,7 @@ The system should keep implementing these concrete items:
 - signed registry retention and overlap checks before endpoint migration;
 - compatibility alias coverage for historical install types;
 - artifact availability health over all mirrors;
- stale-node risk dashboard/report before legacy removal;
+- stale-node risk dashboard/report before compat cleanup;
 - node-local journaling for last good registry/update state;
 - neighbor-assisted artifact relay path;
 - explicit recovery simulation for outbound-only nodes with dead old endpoints.