рабочий вариант, но скороть 10 МБит
This commit is contained in:
@@ -26,7 +26,7 @@ This policy applies to Linux, Windows, Android, containerized nodes, and future
|
||||
The fabric must be able to lose:
|
||||
|
||||
- old API endpoints;
|
||||
- old artifact URLs;
|
||||
- old artifact distributors;
|
||||
- previous public IP addresses;
|
||||
- previous NAT mappings;
|
||||
- previous relay nodes;
|
||||
@@ -67,7 +67,7 @@ Any change to the fabric must keep older nodes recoverable until one of these
|
||||
is true:
|
||||
|
||||
1. every node has confirmed the new contract; or
|
||||
2. the missing nodes were manually retired, revoked, or explicitly accepted as
|
||||
2. the missing nodes were manually removed, revoked, or explicitly accepted as
|
||||
lost.
|
||||
|
||||
This applies to:
|
||||
@@ -81,6 +81,17 @@ This applies to:
|
||||
- host-agent / updater runtime contracts;
|
||||
- control endpoints needed only for migration.
|
||||
|
||||
Canonical `Control API` access must be distributable as an explicit endpoint
|
||||
set, not only as a single compat `backend_url`. Install/update contracts should
|
||||
carry:
|
||||
|
||||
- `control_plane_endpoints`;
|
||||
- signed fabric registry bootstrap records;
|
||||
- artifact endpoints.
|
||||
|
||||
The old `backend_url` remains a compatibility fallback only until the fleet has
|
||||
converged.
|
||||
|
||||
The rule is strict: do not delete the old recovery format while nodes that may
|
||||
still need it remain unrecovered.
|
||||
|
||||
@@ -200,6 +211,67 @@ Required model:
|
||||
- signals are idempotent;
|
||||
- signals do not require the old control endpoint to remain alive.
|
||||
|
||||
### 3.7 Update Intent Must Be Independent From One Updater Endpoint
|
||||
|
||||
A node must not be permanently bound to one updater service, one updater node,
|
||||
one systemd unit name, one scheduled task name, or one control endpoint.
|
||||
|
||||
The durable object is not "call this updater URL". The durable object is a
|
||||
signed update intent:
|
||||
|
||||
- product;
|
||||
- target version or version constraint;
|
||||
- artifact hashes and allowed mirrors;
|
||||
- compatibility contract;
|
||||
- rollout lease constraints;
|
||||
- force / emergency flags;
|
||||
- rollback permission;
|
||||
- signed registry/service records that can carry the intent;
|
||||
- expiry and generation.
|
||||
|
||||
A node may learn the same signed intent from:
|
||||
|
||||
- Control API;
|
||||
- update-store;
|
||||
- update-cache;
|
||||
- long-lived outbound control subscription;
|
||||
- neighboring nodes through signed fabric registry gossip;
|
||||
- local cached last-known-good update state.
|
||||
|
||||
The receiving node must validate the intent locally before acting. A neighbor
|
||||
may relay signed update metadata and artifacts, but it must not become an
|
||||
authority that can forge or broaden an update.
|
||||
|
||||
The local recovery boundary must reconcile stale runtime facts before fetching
|
||||
or applying a plan:
|
||||
|
||||
- current cluster id;
|
||||
- node id and identity state directory;
|
||||
- current container/task/unit name;
|
||||
- current control endpoints;
|
||||
- current signed registry records;
|
||||
- available artifact mirrors.
|
||||
|
||||
This is mandatory because a node may move, a container may be renamed, a task
|
||||
may be recreated, or the old host updater may still have a stale command line.
|
||||
|
||||
### 3.8 Polling, Subscription, And Neighbor Relay Are All Required
|
||||
|
||||
The update plane must use three delivery paths at the same time:
|
||||
|
||||
1. slow local fallback polling, so a node eventually recovers even after missed
|
||||
signals;
|
||||
2. subscription / push hints, so ordinary updates are fast and do not wait for
|
||||
a long poll interval;
|
||||
3. peer relay of signed update intents and signed registry records, so a node
|
||||
can learn current update truth through reachable neighbors when the old
|
||||
center or old ingress is unavailable.
|
||||
|
||||
No one path is allowed to be the only recovery mechanism.
|
||||
|
||||
Polling cadence is a safety net, not the rollout control mechanism. Rollout
|
||||
control belongs to the orchestrator and signed rollout leases.
|
||||
|
||||
## 4. Update Safety Rules
|
||||
|
||||
### 4.1 Upgrade Contracts
|
||||
@@ -228,7 +300,7 @@ explicit retirement.
|
||||
Recovery-critical artifact versions must remain available until:
|
||||
|
||||
- all nodes have moved past them; or
|
||||
- the remaining nodes are revoked/retired and recorded as intentionally lost.
|
||||
- the remaining nodes are revoked/removed and recorded as intentionally lost.
|
||||
|
||||
Do not garbage-collect the last working host-agent or node-agent build for an
|
||||
unrecovered population.
|
||||
@@ -237,17 +309,18 @@ unrecovered population.
|
||||
|
||||
If historical nodes request different install types for the same product
|
||||
(`windows_binary`, `windows_service`, `native`, `linux_binary`, etc.), recovery
|
||||
planning must keep compatibility aliases until the fleet converges.
|
||||
planning must publish explicit signed install-type mappings in the fabric
|
||||
registry until the fleet converges.
|
||||
|
||||
The fabric must not strand nodes on an install-type naming mismatch.
|
||||
|
||||
### 4.5 Legacy Recovery Contract Drift Must Be Treated As A Blocking Risk
|
||||
### 4.5 Compat Recovery Contract Drift Must Be Treated As A Blocking Risk
|
||||
|
||||
A stale node may report:
|
||||
|
||||
- a compatible recovery artifact exists under the current registry; but
|
||||
- the last local updater/host-agent status still says `no_matching_artifact` or
|
||||
an equivalent legacy contract failure.
|
||||
an equivalent compat contract failure.
|
||||
|
||||
This means the node is not only waiting for a heartbeat. It is running an older
|
||||
recovery planner contract and may still depend on:
|
||||
@@ -257,7 +330,7 @@ recovery planner contract and may still depend on:
|
||||
- older update-plan interpretation rules;
|
||||
- overlap in signed registry / bootstrap envelopes.
|
||||
|
||||
This condition must be classified as `legacy recovery contract drift` and must
|
||||
This condition must be classified as `compat recovery contract drift` and must
|
||||
block compatibility removal the same way an artifact gap does.
|
||||
|
||||
Operationally this also means:
|
||||
@@ -268,11 +341,11 @@ Operationally this also means:
|
||||
status on the current contract or the operator explicitly retires the node;
|
||||
- when a compatible artifact and target mapping already exist, the node should
|
||||
be classified as `bridge replay ready`, meaning the system can replay the
|
||||
legacy-compatible update plan as soon as the node regains an outbound control
|
||||
compat-compatible update plan as soon as the node regains an outbound control
|
||||
cycle;
|
||||
- operator tooling should expose a canonical `bridge replay plan` per node so
|
||||
recovery replay uses the same signed update-plan logic as normal updates;
|
||||
- compatibility aliases / overlap must remain enabled for that node population;
|
||||
- signed recovery mappings must remain available for that node population;
|
||||
- dashboards and rollout guards must show this separately from ordinary
|
||||
`waiting recovery heartbeat`.
|
||||
|
||||
@@ -281,9 +354,78 @@ Canonical example:
|
||||
- `ifcm-rufms-s-mo1cr` is stale;
|
||||
- the current backend can match a Windows-compatible host-agent artifact;
|
||||
- the last host-agent report still says `no_matching_artifact`;
|
||||
- therefore the node must be treated as a legacy recovery-contract blocker, not
|
||||
- therefore the node must be treated as a compat recovery-contract blocker, not
|
||||
merely as a delayed heartbeat.
|
||||
|
||||
### 4.6 Rollout Orchestrator Is Mandatory
|
||||
|
||||
Large fleet update safety requires an orchestrator. The orchestrator decides
|
||||
which nodes may update now. Nodes decide whether a received signed intent is
|
||||
valid and locally safe to execute.
|
||||
|
||||
The orchestrator must support:
|
||||
|
||||
- canary rollout;
|
||||
- rolling rollout;
|
||||
- area / site / NAT-group aware rollout;
|
||||
- max parallel updates globally;
|
||||
- max parallel updates per area;
|
||||
- max unavailable nodes;
|
||||
- minimum healthy quorum before continuing;
|
||||
- hold / pause / resume;
|
||||
- force update for explicitly selected nodes;
|
||||
- automatic stop on failure rate, heartbeat loss, rollback, or route diversity
|
||||
regression;
|
||||
- separate host-agent and node-agent phases;
|
||||
- emergency recovery bridge for pre-orchestrator compat nodes.
|
||||
|
||||
The orchestrator must issue short-lived rollout leases. A node may only start an
|
||||
update when it holds a valid lease for that product/version. If the lease
|
||||
expires before apply starts, the node must re-check the policy.
|
||||
|
||||
Rollout leases prevent the entire farm from starting the same update
|
||||
simultaneously when a subscription signal or gossip wave reaches all nodes.
|
||||
|
||||
### 4.7 Node-Side Update Admission Control
|
||||
|
||||
Even with a lease, the node must perform local admission checks before apply:
|
||||
|
||||
- artifact hash and signature match the signed intent;
|
||||
- rollback artifact or previous binary is available unless policy explicitly
|
||||
disables rollback;
|
||||
- enough disk space exists for stage plus rollback;
|
||||
- current active workload can tolerate restart, or orchestrator granted a
|
||||
maintenance lease;
|
||||
- the node still has at least the required recovery connectivity after
|
||||
excluding itself as temporarily unavailable;
|
||||
- host-agent update is applied before node-agent update when the contract says
|
||||
the host-agent is the recovery floor.
|
||||
|
||||
If admission fails, the node reports `blocked` with a precise reason instead of
|
||||
silently waiting.
|
||||
|
||||
### 4.8 Update Waves Must Preserve Failure-Domain Diversity
|
||||
|
||||
An update wave must not take down all nodes from the same recovery role or
|
||||
failure domain at once.
|
||||
|
||||
The orchestrator must account for:
|
||||
|
||||
- area;
|
||||
- site;
|
||||
- locality group;
|
||||
- NAT group;
|
||||
- public ingress dependency;
|
||||
- control-api role;
|
||||
- update-store / update-cache role;
|
||||
- relay / rendezvous role;
|
||||
- VPN ingress / egress roles;
|
||||
- nodes that are currently the only known recovery path for another node.
|
||||
|
||||
For a small fleet, this means the orchestrator may update one node at a time
|
||||
when the remaining diversity is weak, even if the global max parallel setting
|
||||
is higher.
|
||||
|
||||
## 5. Service And Location Mobility Rules
|
||||
|
||||
Moving a service must not strand nodes that only know the old location.
|
||||
@@ -329,7 +471,7 @@ The design must explicitly handle all of these:
|
||||
- node reboots during update;
|
||||
- only one peer still knows the new registry truth;
|
||||
- node is partitioned for a long time and rejoins later;
|
||||
- platform removes legacy support too early;
|
||||
- platform removes compat support too early;
|
||||
- operator has no shell/RDP/WinRM/SSH access to the host.
|
||||
|
||||
## 7. Required Local State And Journaling
|
||||
@@ -359,7 +501,7 @@ It must surface:
|
||||
- nodes with stale heartbeat but recent updater activity;
|
||||
- nodes with no working compatible recovery artifact;
|
||||
- nodes whose pinned registry/bootstrap epoch is too old;
|
||||
- nodes whose only known artifact URL is dead;
|
||||
- nodes whose only known artifact distributor is dead;
|
||||
- nodes whose desired state requires a contract they cannot parse;
|
||||
- nodes whose local agent version is below the minimum recovery floor;
|
||||
- nodes whose last successful contact depended on a single service replica.
|
||||
@@ -382,7 +524,7 @@ Before deleting old code, old formats, or old endpoints, verify all of these:
|
||||
7. install type aliases remain for historical agents where needed;
|
||||
8. NAT/passive/outbound-only nodes were explicitly tested;
|
||||
9. stale-node risk report is empty or consciously accepted by recovery-admin;
|
||||
10. removal of legacy support is documented with the exact cutoff conditions.
|
||||
10. removal of compat support is documented with the exact cutoff conditions.
|
||||
|
||||
## 10. `ifcm-rufms-s-mo1cr` Rule
|
||||
|
||||
@@ -412,7 +554,7 @@ The system should keep implementing these concrete items:
|
||||
- signed registry retention and overlap checks before endpoint migration;
|
||||
- compatibility alias coverage for historical install types;
|
||||
- artifact availability health over all mirrors;
|
||||
- stale-node risk dashboard/report before legacy removal;
|
||||
- stale-node risk dashboard/report before compat cleanup;
|
||||
- node-local journaling for last good registry/update state;
|
||||
- neighbor-assisted artifact relay path;
|
||||
- explicit recovery simulation for outbound-only nodes with dead old endpoints.
|
||||
|
||||
Reference in New Issue
Block a user