рабочий вариант, но скороть 10 МБит
This commit is contained in:
@@ -201,8 +201,8 @@ Updates must support:
|
||||
- local update cache where approved
|
||||
- OS / architecture specific artifacts under signed release manifests
|
||||
- explicit migration bundles when data structures change
|
||||
- legacy recovery compatibility until the fleet is converged or explicitly
|
||||
retired
|
||||
- compat recovery compatibility until the fleet is converged or explicitly
|
||||
removed
|
||||
- multi-source artifact retrieval for stranded or NAT-only nodes
|
||||
|
||||
Version Storage stores immutable release manifests, artifacts, hashes,
|
||||
|
||||
@@ -1035,7 +1035,7 @@ Node-agent can start, stop, and monitor service workloads based on role assignme
|
||||
|
||||
C19A adds the first bounded live service-supervision runtime proof on top of
|
||||
that contract: node-agent can read node-scoped desired workloads without an
|
||||
operator actor id, report built-in `core-mesh` and `mesh-listener` as running,
|
||||
operator actor id, report built-in `core-mesh` and `fabric-listener` as running,
|
||||
report native built-in `synthetic.echo` as running, and keep unsupported
|
||||
production workloads degraded instead of pretending that their adapters exist.
|
||||
The live smoke is `scripts/fabric/c19a-service-workload-supervision-smoke.ps1`.
|
||||
|
||||
@@ -262,7 +262,7 @@ Rules:
|
||||
- latest frame wins
|
||||
- render must not block input/control
|
||||
- binary payloads should be used on direct data plane
|
||||
- backend fallback may continue existing JSON/base64 behavior during migration
|
||||
- compat fallback may continue existing JSON/base64 behavior during migration
|
||||
|
||||
### `clipboard`
|
||||
|
||||
@@ -347,7 +347,7 @@ The DP-2 JSON header contains:
|
||||
- `session_id`
|
||||
- `channel`, currently `render`
|
||||
- `message_type`, currently `render.frame.full` or `render.frame.region` on
|
||||
direct worker WSS; `session.frame` remains accepted as the legacy DP-2
|
||||
direct worker WSS; `session.frame` remains accepted as the compat DP-2
|
||||
binary message type for compatibility.
|
||||
- `sequence`
|
||||
- `timestamp`
|
||||
@@ -950,7 +950,7 @@ explicit direct render message types:
|
||||
|
||||
Compatibility:
|
||||
|
||||
- Windows client direct transport still accepts legacy binary `message_type=session.frame`.
|
||||
- Windows client direct transport still accepts compat binary `message_type=session.frame`.
|
||||
- Inside the Windows application pipeline, direct binary frames are normalized
|
||||
back into the existing `session.frame` envelope so UI, lifecycle, input,
|
||||
clipboard, and file transfer behavior remain unchanged.
|
||||
|
||||
@@ -24,7 +24,7 @@ policy allows, host limited control/storage roles when approved, and report
|
||||
mobile-specific capacity signals such as battery, network type, NAT behavior,
|
||||
foreground/background state, and metered network policy.
|
||||
|
||||
Node survival and recovery across endpoint moves, NAT-only reachability, legacy
|
||||
Node survival and recovery across endpoint moves, NAT-only reachability, compat
|
||||
contract overlap, and unavailable manual host access are governed by
|
||||
`docs/architecture/FABRIC_NODE_SURVIVAL_AND_RECOVERY_POLICY.md`. In
|
||||
particular, nodes like `ifcm-rufms-s-mo1cr` must remain recoverable through the
|
||||
@@ -179,8 +179,8 @@ Endpoint state is also distributed:
|
||||
|
||||
Moving a service must not break the farm.
|
||||
|
||||
`RAP_BACKEND_URL` or any fixed HTTP/API address is only a migration fallback for
|
||||
old nodes. It is not cluster truth. After bootstrap, a node finds services by
|
||||
`RAP_FABRIC_REGISTRY_RECORDS_JSON` and signed registry gossip, not any fixed
|
||||
HTTP/API address, define cluster truth. After bootstrap, a node finds services by
|
||||
logical role through signed fabric registry records that can be carried by any
|
||||
reachable peer.
|
||||
|
||||
@@ -258,7 +258,7 @@ Service classes that must use this registry before production hardening:
|
||||
- `vpn-egress-pool`: user-visible exit pools; users see pools, not backing
|
||||
nodes.
|
||||
|
||||
Legacy endpoint compatibility is allowed only for rolling migration:
|
||||
Compat endpoint compatibility is allowed only for rolling migration:
|
||||
|
||||
- Old nodes may use their baked HTTP/control URL only to fetch a new version or
|
||||
a signed registry bootstrap record.
|
||||
@@ -504,7 +504,7 @@ Deliverables:
|
||||
|
||||
### Stage FNP-3: WebSocket/TCP Compatibility Transport
|
||||
|
||||
Status: retired as a migration-only stage.
|
||||
Status: removed as a migration-only stage.
|
||||
|
||||
This stage existed to bootstrap binary frame semantics before QUIC routing and
|
||||
carrier reuse were ready. It introduced the transport-neutral frame loop,
|
||||
|
||||
@@ -6,6 +6,10 @@ This document replaces the oversimplified rule "every node must keep 3
|
||||
connections" with a stability model based on failure domains ("areas"),
|
||||
multi-path reachability, and live peer memory.
|
||||
|
||||
It operates at the `Fabric Transport` layer. Services above the transport must
|
||||
consume service channels and must not directly reason about peer topology. See
|
||||
[FABRIC_SERVICE_OVER_TRANSPORT_MODEL.md](\\nas\\MST\\codex\\rdp-proxy\\docs\\architecture\\FABRIC_SERVICE_OVER_TRANSPORT_MODEL.md).
|
||||
|
||||
## 1. Why the old "3 connections" rule is not enough
|
||||
|
||||
A raw connection count is too weak as a resilience rule.
|
||||
@@ -43,6 +47,9 @@ An area can be derived from:
|
||||
The area label must be part of live node metadata and endpoint candidate
|
||||
metadata.
|
||||
|
||||
For the current fleet, area assignment should be explicit operator metadata, not
|
||||
an inference hidden only inside routing code.
|
||||
|
||||
## 3. Stability objective
|
||||
|
||||
Each node should maintain a working peer set with diversity, not just count.
|
||||
|
||||
@@ -0,0 +1,386 @@
|
||||
# Fabric Execution Plan 2026-05-19
|
||||
|
||||
Status: active execution plan.
|
||||
|
||||
This document merges:
|
||||
|
||||
- the service-over-fabric model;
|
||||
- the area and peer stability model;
|
||||
- the live audit findings from 2026-05-18 through 2026-05-19;
|
||||
- the node survival and recovery policy;
|
||||
- the current rollout and runtime rewrite findings.
|
||||
|
||||
The goal is to move the live fabric from a partially migrated QUIC-first fleet
|
||||
to a fully converged distributed runtime where:
|
||||
|
||||
1. inter-node transport is QUIC over UDP only;
|
||||
2. services use fabric channels and do not implement their own transport;
|
||||
3. nodes do not depend on one compat control/download edge;
|
||||
4. node directory and service discovery are distributed through signed records,
|
||||
peer cache, and live peer exchange;
|
||||
5. the fleet remains recoverable after losing part of the fabric.
|
||||
|
||||
## 1. Current live state
|
||||
|
||||
### 1.1 What is already true
|
||||
|
||||
- Inter-node runtime transport is QUIC over UDP.
|
||||
- All active nodes are converging on the latest control-endpoint rewrite line.
|
||||
- `home-*`, `test-*`, and `usa-los-1` already run
|
||||
`rap-node-agent 0.2.325-updatehintwake`.
|
||||
- `ifcm-rufms-s-mo1cr` is no longer lost and still sends fresh heartbeat.
|
||||
- Internal artifact plans now support mirror URLs instead of a single artifact
|
||||
URL.
|
||||
- The public `vpn.cin.su` ingress and the `19191` compatibility ingress on
|
||||
`home-1` were repaired so downloads and control traffic can flow again.
|
||||
|
||||
### 1.2 What is still not finished
|
||||
|
||||
- `ifcm-rufms-s-mo1cr` still reports the old
|
||||
`http://vpn.cin.su:19191/api/v1` control URL in live heartbeat.
|
||||
- `ifcm-rufms-s-mo1cr` still runs `rap-node-agent 0.2.322-controlendpointsrewrite`
|
||||
while the rest of the reachable fleet is already on
|
||||
`0.2.325-updatehintwake`.
|
||||
- The current blocker is now known precisely:
|
||||
fresh heartbeat plus a dead updater subscription plane on a node-agent that
|
||||
does not yet support local updater wake from heartbeat update hints.
|
||||
- Signed registry runtime is still not fully `active` across the fleet.
|
||||
- Cross-area direct peer diversity is still below the target for multiple
|
||||
nodes.
|
||||
- TCP is still visible in allowed edge roles:
|
||||
- external ingress;
|
||||
- Control API;
|
||||
- release downloads;
|
||||
- temporary compatibility recovery overlap.
|
||||
|
||||
## 2. Target system model
|
||||
|
||||
### 2.1 Transport
|
||||
|
||||
- Inter-node runtime transport: QUIC over UDP only.
|
||||
- No TCP/WebSocket fallback as the normal fabric carrier.
|
||||
|
||||
### 2.2 Service layer
|
||||
|
||||
- Services consume a fabric channel contract.
|
||||
- Services do not know internal path selection, relay choice, NAT traversal, or
|
||||
route replacement details.
|
||||
- External TCP/HTTP/HTTPS exists only at the ingress edge and is mapped into a
|
||||
fabric channel.
|
||||
|
||||
### 2.3 Discovery and directory
|
||||
|
||||
- Nodes do not query PostgreSQL as part of ordinary transport/runtime flow.
|
||||
- PostgreSQL remains durable source of truth for policy, rollout, release,
|
||||
desired state, and audit.
|
||||
- Runtime node discovery must use:
|
||||
- signed registry records;
|
||||
- peer cache;
|
||||
- endpoint candidates;
|
||||
- bounded live peer exchange.
|
||||
|
||||
### 2.4 Small fleet rule
|
||||
|
||||
For the current fleet size, every node should keep the full directory of all
|
||||
known nodes in scoped local state, plus runtime observations and endpoint
|
||||
candidate health.
|
||||
|
||||
## 3. Execution priorities
|
||||
|
||||
### P0. Finish runtime control-path convergence
|
||||
|
||||
Goal:
|
||||
|
||||
- remove the last live compat control dependency without manual host access.
|
||||
- ensure a live node can wake its local updater plane when Control/API sends an
|
||||
explicit update hint, even if the previous updater loop died.
|
||||
|
||||
Required work:
|
||||
|
||||
1. Release the noop runtime rewrite restart fix.
|
||||
2. Roll it out to the fleet.
|
||||
3. Verify that updated nodes restart into canonical control endpoints.
|
||||
4. Add a local updater wake path driven by heartbeat update hints so
|
||||
`update-trigger.json` is not the only signal.
|
||||
5. Confirm that `compat_control_dependency_nodes` falls to zero.
|
||||
6. Confirm that `updater_subscription_alert_nodes` falls to zero.
|
||||
7. Confirm that `updater_wake_unsupported_nodes` falls to zero.
|
||||
|
||||
Done when:
|
||||
|
||||
- no live heartbeat reports `fabric_control_endpoint` on the `19191` compat contract.
|
||||
- no live node shows `updater_subscription_gap` while heartbeat remains fresh.
|
||||
- no live node remains on a pre-`0.2.325-updatehintwake` node-agent while
|
||||
heartbeat is still fresh and update status is stale.
|
||||
|
||||
### P1. Finish distributed registry activation
|
||||
|
||||
Goal:
|
||||
|
||||
- nodes must resolve active service records without relying on one compat URL.
|
||||
|
||||
Required work:
|
||||
|
||||
1. Promote signed registry runtime from `candidate_only` / `missing` to
|
||||
`active`.
|
||||
2. Ensure nodes resolve at least:
|
||||
- `control-api`
|
||||
- `update-store`
|
||||
- `update-cache`
|
||||
3. Add live observability for:
|
||||
- active records
|
||||
- candidate records
|
||||
- resolved core services
|
||||
- last live probe
|
||||
|
||||
Done when:
|
||||
|
||||
- `fabric_registry_runtime_report.status = active` for the production fleet.
|
||||
|
||||
### P2. Turn node directory into a real distributed runtime input
|
||||
|
||||
Goal:
|
||||
|
||||
- nodes should learn and keep node/service information from the fabric, not by
|
||||
repeatedly consulting a center.
|
||||
|
||||
Required work:
|
||||
|
||||
1. Preserve full scoped node directory for the current fleet.
|
||||
2. Carry signed node/service records through peer exchange.
|
||||
3. Keep endpoint candidates and runtime observations in local peer cache.
|
||||
4. Spread updates to node/service reachability like a bounded wave, not as
|
||||
independent central fetches by every node.
|
||||
|
||||
Rules:
|
||||
|
||||
- nodes may distribute signed directory/service data;
|
||||
- nodes must not self-author authoritative control-plane state;
|
||||
- the runtime may consume replicated signed copies of truth;
|
||||
- PostgreSQL remains durable origin of truth.
|
||||
|
||||
Done when:
|
||||
|
||||
- nodes can refresh peer/service discovery from peers plus signed records even
|
||||
if one control edge disappears.
|
||||
|
||||
### P3. Replace the naive “3 peers” rule with stability by area and ingress
|
||||
|
||||
Goal:
|
||||
|
||||
- measure and enforce resilience by failure-domain diversity, not only count.
|
||||
|
||||
Required metrics:
|
||||
|
||||
- `direct_ready_count`
|
||||
- `relay_ready_count`
|
||||
- `external_area_ready_count`
|
||||
- `independent_ingress_ready_count`
|
||||
- `recovery_path_count`
|
||||
|
||||
Required topology labels:
|
||||
|
||||
- `site_id` - physical or logical site
|
||||
- `locality_group` - private/local reachability domain
|
||||
- `nat_group` - shared public edge dependency
|
||||
|
||||
Required behaviors:
|
||||
|
||||
1. Prefer peers from different `area` values.
|
||||
2. Prefer peers behind different public ingress / NAT dependencies.
|
||||
3. Keep direct-ready and relay-ready separate.
|
||||
4. Keep at least one recovery path outside the local area.
|
||||
5. Treat a public endpoint behind the same NAT area as
|
||||
`external-network-required` unless cross-area observers have validated it.
|
||||
6. Do not demote a public endpoint only because the same area cannot hairpin
|
||||
through its own public router address.
|
||||
7. Prefer a scoped local/private QUIC endpoint over a public endpoint when the
|
||||
candidate is confirmed to be in the same local segment or NAT group.
|
||||
8. Penalize or reject private/local-looking endpoints when they belong to a
|
||||
different segment/NAT scope than the local node, instead of probing them as
|
||||
if they were reachable.
|
||||
|
||||
Done when:
|
||||
|
||||
- critical nodes satisfy cross-area direct resilience targets, not merely raw
|
||||
peer-count targets.
|
||||
|
||||
### P4. Normalize edge roles and remove accidental TCP confusion
|
||||
|
||||
Goal:
|
||||
|
||||
- if TCP is present, it must be obviously classified and justified.
|
||||
|
||||
Allowed TCP roles:
|
||||
|
||||
- external service ingress;
|
||||
- Control API ingress;
|
||||
- artifact delivery edge;
|
||||
- temporary compatibility recovery overlap.
|
||||
|
||||
Required work:
|
||||
|
||||
1. Keep explicit inventory of edge listeners.
|
||||
2. Distinguish transport TCP from service-edge TCP in audits and UI.
|
||||
3. Advance the fabric-only recovery gate only after:
|
||||
- compat control dependency is zero;
|
||||
- registry is active;
|
||||
- recovery path no longer depends on `19191`.
|
||||
|
||||
### P5. Build the update orchestrator and distributed update intent plane
|
||||
|
||||
Goal:
|
||||
|
||||
- nodes must not depend on one updater endpoint, one old updater process, or one
|
||||
central polling path;
|
||||
- update rollout must be controlled so the whole farm cannot update at once;
|
||||
- update intent must be distributable through management and neighboring nodes
|
||||
as signed metadata.
|
||||
|
||||
Required model:
|
||||
|
||||
1. The durable update object is a signed `update_intent`, not a hard-coded
|
||||
updater URL.
|
||||
2. Nodes may receive update intent from:
|
||||
- Control API;
|
||||
- update-store / update-cache;
|
||||
- subscription hints over an outbound control channel;
|
||||
- signed peer gossip from neighboring nodes;
|
||||
- local cached last-known-good update state.
|
||||
3. Nodes validate intent locally before execution.
|
||||
4. Neighbor nodes may relay signed intent and artifacts, but cannot forge
|
||||
authority or expand scope.
|
||||
5. Slow polling remains as the final safety net.
|
||||
6. Subscription/hints are the fast path.
|
||||
7. Gossip is the partition/recovery path.
|
||||
8. Orchestrator-issued rollout leases are the concurrency guard.
|
||||
|
||||
Orchestrator requirements:
|
||||
|
||||
- canary, rolling, pinned, and forced-node strategies;
|
||||
- max parallel globally;
|
||||
- max parallel per area / site / NAT group;
|
||||
- max unavailable nodes;
|
||||
- pause/resume/abort;
|
||||
- failure-rate stop;
|
||||
- automatic stop on heartbeat loss or rollback;
|
||||
- role-aware scheduling for control-api, update-store, update-cache, relay,
|
||||
ingress, and egress nodes;
|
||||
- separate host-agent and node-agent phases;
|
||||
- emergency recovery bridge for compat nodes that predate the orchestrator.
|
||||
|
||||
Node-side requirements:
|
||||
|
||||
- accept `check now` subscription signals;
|
||||
- periodically poll as fallback;
|
||||
- accept newer signed update intents from peers;
|
||||
- keep a local update journal:
|
||||
- pending intent generation;
|
||||
- lease id;
|
||||
- last accepted plan;
|
||||
- staged artifact hash;
|
||||
- previous binary / image;
|
||||
- rollback state;
|
||||
- admission failure reason;
|
||||
- reconcile stale updater runtime against current node/container/task state
|
||||
before fetching plans;
|
||||
- report `blocked`, `leased`, `staged`, `applying`, `healthy`, `rolled_back`,
|
||||
and `aborted` states explicitly.
|
||||
|
||||
Done when:
|
||||
|
||||
- a node can learn a new update intent without directly reaching the original
|
||||
control edge;
|
||||
- a stale updater command line can be repaired from local running runtime state;
|
||||
- simultaneous farm-wide update start is impossible without explicit
|
||||
recovery-admin override;
|
||||
- rollout can be paused and resumed without losing node intent state;
|
||||
- at least one test proves a node behind NAT receives an update signal through
|
||||
a neighbor and still waits for an orchestrator lease before applying.
|
||||
|
||||
## 4. Immediate next implementation sequence
|
||||
|
||||
### Step A
|
||||
|
||||
Release and roll out the noop-rewrite restart fix so that updated runtimes do
|
||||
not remain on stale control sessions after a config rewrite.
|
||||
|
||||
### Step B
|
||||
|
||||
Release and roll out the relay certificate intent fix so stale-relay
|
||||
replacement and bootstrap relay paths do not probe a relay endpoint with a
|
||||
certificate fingerprint copied from a different private direct candidate.
|
||||
|
||||
This is tracked by:
|
||||
|
||||
- `rap-node-agent 0.2.332-relaycertintentfix`
|
||||
|
||||
Done when:
|
||||
|
||||
- `peer certificate fingerprint mismatch` no longer appears on healthy
|
||||
relay/bootstrap paths between live areas;
|
||||
- `ifcm` no longer loses ready peers because relay endpoint selection and peer
|
||||
certificate pinning disagree.
|
||||
|
||||
### Step B
|
||||
|
||||
Re-check live heartbeat and stale-risk:
|
||||
|
||||
- `compat_control_dependency_nodes`
|
||||
- `registry_candidate_only_nodes`
|
||||
- `updater_subscription_alert_nodes`
|
||||
- `updater_wake_unsupported_nodes`
|
||||
- `bridge_hold_required`
|
||||
- current control URL in heartbeat
|
||||
|
||||
### Step C
|
||||
|
||||
Continue registry activation work until active records are used in practice.
|
||||
|
||||
### Step D
|
||||
|
||||
Continue peer diversity work using:
|
||||
|
||||
- `area`
|
||||
- direct-ready area coverage
|
||||
- independent ingress diversity
|
||||
|
||||
### Step E
|
||||
|
||||
Run another live audit and decide whether `19191/tcp` recovery overlap can be
|
||||
removed.
|
||||
|
||||
## 5. Hard acceptance criteria
|
||||
|
||||
The fabric is considered converged only when all of the following are true:
|
||||
|
||||
1. Inter-node runtime transport is QUIC/UDP only.
|
||||
2. No live node depends on the compat `19191` control contract.
|
||||
3. Signed registry runtime is active.
|
||||
4. Nodes carry and use distributed node/service knowledge through signed
|
||||
records and peer cache.
|
||||
5. Cross-area direct resilience targets are satisfied for critical nodes.
|
||||
6. Remaining TCP listeners are only service-edge roles, never hidden inter-node
|
||||
transport.
|
||||
|
||||
## 6. This plan starts now
|
||||
|
||||
The immediate active engineering task after writing this document is:
|
||||
|
||||
- complete the rollout of the runtime rewrite restart fix;
|
||||
- remove the last live compat control dependency;
|
||||
- then move directly into signed registry activation and cross-area peer
|
||||
resilience work.
|
||||
|
||||
Update 2026-05-19:
|
||||
|
||||
- `rap-node-agent 0.2.325-updatehintwake` adds a second-stage recovery path for
|
||||
heartbeat update hints: when a fresh hint generation arrives, the live
|
||||
node-agent persists `update-trigger.json` and wakes the local updater
|
||||
task/service.
|
||||
- This is specifically meant to prevent the `ifcm-rufms-s-mo1cr` class of
|
||||
failure where heartbeat remains fresh but the updater subscription plane is
|
||||
dead.
|
||||
- As of the current rollout, this release is already on `home-*`, `test-*`,
|
||||
and `usa-los-1`; `ifcm-rufms-s-mo1cr` remains the sole
|
||||
`updater_wake_unsupported` blocker.
|
||||
@@ -258,7 +258,7 @@ Production fabric-core migration boundary:
|
||||
QUIC endpoint candidates for the next hop, sends the envelope over the chosen
|
||||
QUIC route, and reroutes to warm standby/fallback QUIC candidates on connect
|
||||
failure or response timeout.
|
||||
- The legacy HTTP production forward carrier has been removed from the mesh
|
||||
- The compat HTTP production forward carrier has been removed from the mesh
|
||||
runtime API. Production forwarding now exposes a single QUIC transport
|
||||
implementation; HTTP handlers remain only as node-local API surfaces and test
|
||||
harness entry points.
|
||||
@@ -287,7 +287,7 @@ Production fabric-core migration boundary:
|
||||
- Node-agent discovery now advertises multiple QUIC candidates in one heartbeat
|
||||
instead of collapsing to one address: operator/public QUIC, listener QUIC,
|
||||
LAN/interface QUIC, STUN reflexive `ice_quic`, reverse/outbound-only, and
|
||||
`relay_quic` fallback. Candidate metadata carries `local_segment_id`,
|
||||
`relay_quic` fallback. Candidate metadata carries `locality_group_id`,
|
||||
`nat_group_id`, `stun_server`, `ice_foundation`, `relay_node_id`, and
|
||||
`relay_endpoint` when configured. When a relay endpoint is the first physical
|
||||
QUIC hop, its advertised certificate fingerprint must survive route planning
|
||||
@@ -296,23 +296,23 @@ Production fabric-core migration boundary:
|
||||
- Endpoint candidate scoring is QUIC-mode only. It ranks `direct_quic`,
|
||||
`lan_quic`, `ice_quic`, `reverse_quic`, and `relay_quic` using freshness,
|
||||
health observations, latency, reliability, region, policy tags, and live
|
||||
capacity pressure; HTTP/WebSocket labels are treated as rejected legacy
|
||||
capacity pressure; HTTP/WebSocket labels are treated as rejected compat
|
||||
candidates rather than alternate transports.
|
||||
- `FabricTransportForTarget` no longer accepts a WebSocket carrier. Transport
|
||||
selection can return only `QUICFabricTransport`; unsupported labels fail with
|
||||
a QUIC-required error.
|
||||
- Explicit transport labels are authoritative. A legacy label such as `relay`
|
||||
- Explicit transport labels are authoritative. A compat label such as `relay`
|
||||
or `outbound_reverse` is rejected even when the endpoint string uses a
|
||||
`quic://` scheme; configs must use `relay_quic` and `reverse_quic`.
|
||||
- Node-agent config loading rejects legacy advertised transport labels and
|
||||
- Node-agent config loading rejects compat advertised transport labels and
|
||||
HTTP/WebSocket advertised endpoint schemes for mesh, STUN-reflexive, and relay
|
||||
fabric endpoints. Bad endpoint posture fails before heartbeat publication.
|
||||
- Host-agent install/runtime validation rejects legacy mesh advertise transport
|
||||
- Host-agent install/runtime validation rejects compat mesh advertise transport
|
||||
labels and HTTP/WebSocket advertise endpoints before they can be passed into a
|
||||
node-agent Docker runtime.
|
||||
- JSON-advertised endpoint candidates and scoped synthetic config route
|
||||
recovery surfaces are hard-fail QUIC-only: endpoint candidates, recovery
|
||||
seeds, and rendezvous leases reject legacy transport labels and
|
||||
seeds, and rendezvous leases reject compat transport labels and
|
||||
HTTP/WebSocket endpoint schemes instead of silently downgrading or dropping
|
||||
entries.
|
||||
- Rendezvous relay leases and peer-connection intents now use `relay_quic` as
|
||||
@@ -325,24 +325,24 @@ Production fabric-core migration boundary:
|
||||
- Node-agent synthetic runtime no longer installs an HTTP peer transport as an
|
||||
inter-node carrier, and the shared mesh runtime package no longer exports an
|
||||
HTTP peer transport implementation. Any HTTP synthetic motion is confined to
|
||||
explicit legacy smoke harness code while fabric acceptance uses QUIC loadtest
|
||||
explicit compat smoke harness code while fabric acceptance uses QUIC loadtest
|
||||
gates.
|
||||
- Control-plane and debug JSON mesh config loading is validated after
|
||||
conversion into runtime structures. Peer endpoint candidates, recovery seeds,
|
||||
rendezvous leases, and selected relay endpoints in route decisions must use
|
||||
QUIC labels/endpoints before they can update node runtime state.
|
||||
- Scoped synthetic mesh configs also reject legacy `peer_endpoints` directly,
|
||||
- Scoped synthetic mesh configs also reject compat `peer_endpoints` directly,
|
||||
in addition to QUIC-only checks for endpoint candidates, recovery seeds, and
|
||||
rendezvous leases.
|
||||
- The old fabric-session WebSocket endpoint is no longer exposed by
|
||||
`FabricSessionEnabled` alone. It requires an explicit legacy test harness flag
|
||||
`FabricSessionEnabled` alone. It requires an explicit compat test harness flag
|
||||
and is not part of the node-agent fabric transport surface.
|
||||
- Same local segment or same NAT group is treated as a LAN route by the planner,
|
||||
so a whole cluster piece behind one NAT can prefer private addresses between
|
||||
its own nodes while still maintaining outbound/relay visibility to the rest
|
||||
of the fabric.
|
||||
- Heartbeat telemetry includes `fabric_runtime_report` with QUIC-only status,
|
||||
route-set counts, QUIC candidate totals, rejected legacy/non-QUIC candidate
|
||||
route-set counts, QUIC candidate totals, rejected compat/non-QUIC candidate
|
||||
totals by transport label, route pressure, QUIC listener state, goroutines,
|
||||
heap usage, and the next recommended soak gate.
|
||||
- `FabricOverlayTransport` is the generic service-neutral send facade over
|
||||
@@ -375,7 +375,7 @@ Production fabric-core migration boundary:
|
||||
healthy targets are present. A `mixed-public-nat-lan-relay` or
|
||||
`nat-lan-relay` run fails if it does not exercise `lan_quic`, `ice_quic`,
|
||||
`reverse_quic`, and `relay_quic`.
|
||||
- Loadtest verdicts also fail on legacy route-mode labels. Seeing `relay`,
|
||||
- Loadtest verdicts also fail on compat route-mode labels. Seeing `relay`,
|
||||
`outbound_reverse`, `direct_http`, `direct_https`, `direct_tcp_tls`, `ws`,
|
||||
`wss`, or `websocket` in route-mode telemetry is treated as a transport-layer
|
||||
violation even if payload delivery succeeds.
|
||||
@@ -686,7 +686,7 @@ Production fabric-core migration boundary:
|
||||
`control_ack_p95_ms=2`, `ack_p95_ms=7`, `channel_leaks=0`,
|
||||
`route_pressure.active_total=0`, and matching acquire/release counts.
|
||||
- Verified strict QUIC route-mode gate:
|
||||
`fabric-loadtest-20260516-182550` rebuilt the loadtest image with legacy
|
||||
`fabric-loadtest-20260516-182550` rebuilt the loadtest image with compat
|
||||
route-mode verdicts and ran the 4-node mixed topology profile. It produced
|
||||
400/400 successful logical channels, observed only `lan_quic`, `ice_quic`,
|
||||
`reverse_quic`, and `relay_quic`, kept `ack_mismatched_streams=0`,
|
||||
@@ -816,7 +816,7 @@ Production fabric-core migration boundary:
|
||||
- Published and registered node-agent release `0.2.280-fabricsession` with
|
||||
linux binary/native and Docker image artifacts. The release is intentionally
|
||||
not assigned to live node update policies yet because current live node
|
||||
workload/env posture still advertises legacy `direct_http` and HTTP/HTTPS
|
||||
workload/env posture still advertises compat `direct_http` and HTTP/HTTPS
|
||||
mesh endpoints. Before rollout, node configs must be migrated to
|
||||
`quic://...` endpoints, QUIC advertise labels, and enabled QUIC listener env
|
||||
such as `RAP_MESH_QUIC_FABRIC_ENABLED=true` plus
|
||||
|
||||
@@ -4,9 +4,22 @@ Status: live operational audit of the current fabric. This document records the
|
||||
real state observed on 2026-05-18 and explicitly calls out where runtime
|
||||
behavior still differs from the target architecture.
|
||||
|
||||
The target layering model referenced by this audit is documented in
|
||||
[FABRIC_SERVICE_OVER_TRANSPORT_MODEL.md](\\nas\\MST\\codex\\rdp-proxy\\docs\\architecture\\FABRIC_SERVICE_OVER_TRANSPORT_MODEL.md).
|
||||
The current execution sequence derived from this audit is maintained in
|
||||
[FABRIC_EXECUTION_PLAN_2026-05-19.md](\\nas\\MST\\codex\\rdp-proxy\\docs\\architecture\\FABRIC_EXECUTION_PLAN_2026-05-19.md).
|
||||
|
||||
## Current confirmed state
|
||||
|
||||
- Inter-node transport for the live node-agent fleet is `QUIC over UDP`.
|
||||
- The `0.2.327-registrybootstraprewrite` rollout initially exposed a backend
|
||||
ingestion defect: fresh `home-*` / `test-*` heartbeats were returning HTTP
|
||||
`500`, not because QUIC or registry bootstrap was broken, but because
|
||||
PostgreSQL rejected `\u0000` inside heartbeat JSON with
|
||||
`unsupported Unicode escape sequence (SQLSTATE 22P05)`.
|
||||
- Backend heartbeat ingestion now sanitizes `\u0000` before persistence.
|
||||
- After that fix, `home-*` and `test-*` resumed normal heartbeat flow and
|
||||
converged onto the new release line with live registry promotion.
|
||||
- The active node set
|
||||
- `home-1`
|
||||
- `home-2`
|
||||
@@ -16,9 +29,40 @@ behavior still differs from the target architecture.
|
||||
- `test-3`
|
||||
- `usa-los-1`
|
||||
- `ifcm-rufms-s-mo1cr`
|
||||
is converged on `0.2.321-directreadytarget`.
|
||||
currently spans:
|
||||
- `home-*`, `test-*`, and `usa-los-1` on
|
||||
`0.2.327-registrybootstraprewrite`;
|
||||
- `ifcm-rufms-s-mo1cr` still remaining on
|
||||
`0.2.322-controlendpointsrewrite`.
|
||||
- `ifcm-rufms-s-mo1cr` recovered through the compatibility recovery path and is
|
||||
no longer stale.
|
||||
- `ifcm-rufms-s-mo1cr` has already migrated off compat overlap
|
||||
`http://vpn.cin.su:19191/api/v1` and now reports
|
||||
`https://vpn.cin.su/api/v1`, but it still has not advanced to the new
|
||||
registry-aware release line.
|
||||
- `home-*` and `test-*` now report:
|
||||
- `reported_version = 0.2.327-registrybootstraprewrite`
|
||||
- `peer_cache_peers = 7`
|
||||
- `fabric_registry_runtime_report.status = active`
|
||||
- `usa-los-1` is already on `0.2.327-registrybootstraprewrite` but still
|
||||
reports `fabric_registry_runtime_report.status = missing`, which means this
|
||||
node remains a runtime bootstrap/config rewrite gap rather than a version-gap.
|
||||
- After repairing malformed `RAP_MESH_ADVERTISE_ENDPOINTS_JSON` on
|
||||
`home-1/home-2/home-3`, the `home` area now emits enriched heartbeat metadata
|
||||
again instead of falling back to the thin `c3` payload.
|
||||
- Live stale-risk snapshot at `2026-05-18T19:39Z` now reports:
|
||||
- `compat_control_dependency_nodes = 1` (`ifcm-rufms-s-mo1cr`)
|
||||
- `registry_candidate_only_nodes = 1` (`ifcm-rufms-s-mo1cr`)
|
||||
- `direct_peer_alert_nodes = 5`
|
||||
- `area_diversity_alert_nodes = 6`
|
||||
- Live heartbeat/update status snapshot after the `0.2.325-updatehintwake`
|
||||
rollout still shows:
|
||||
- `ifcm-rufms-s-mo1cr` heartbeat fresh at `2026-05-18 23:08:44 UTC`
|
||||
- `fabric_control_endpoint = http://vpn.cin.su:19191/api/v1`
|
||||
- `peer_cache_peers = 7`
|
||||
- latest update status still stuck at `2026-05-18 20:50 UTC`
|
||||
- this is now classified as `updater_wake_unsupported`, not just a generic
|
||||
stale or compat-control symptom
|
||||
|
||||
## Why TCP traffic is still visible
|
||||
|
||||
@@ -35,7 +79,7 @@ Observed live listeners:
|
||||
- `19132/udp`, `19133/udp`, `19134/udp` - QUIC fabric listeners
|
||||
- `usa-los-1`
|
||||
- `19131/udp` - QUIC fabric listener
|
||||
- `19191/tcp` - external compatibility bridge currently held open so legacy
|
||||
- `19191/tcp` - external compatibility bridge currently held open so compat
|
||||
recovery contracts can still reach `Control API/downloads`
|
||||
|
||||
Therefore:
|
||||
@@ -49,7 +93,8 @@ Therefore:
|
||||
|
||||
### 1. Nodes do not yet operate from a fully active signed registry gossip plane
|
||||
|
||||
Observed on the live `ifcm-rufms-s-mo1cr` heartbeat:
|
||||
Observed on the live `ifcm-rufms-s-mo1cr` heartbeat after the backend/runtime
|
||||
refresh:
|
||||
|
||||
- `fabric_registry_runtime_report.status = candidate_only`
|
||||
- `resolved_service_count = 0`
|
||||
@@ -61,11 +106,11 @@ This means the current runtime still depends on compatibility control URLs more
|
||||
than the target architecture allows. The node is alive in the fabric, but not
|
||||
yet operating from a fully resolved active registry view.
|
||||
|
||||
### 2. Legacy control/download contracts are still real dependencies
|
||||
### 2. Compat control/download contracts are still real dependencies
|
||||
|
||||
Observed on the live `ifcm-rufms-s-mo1cr` heartbeat after recovery:
|
||||
|
||||
- `mesh_outbound_session_report.control_plane_url = http://vpn.cin.su:19191/api/v1`
|
||||
- `mesh_outbound_session_report.fabric_control_endpoint = http://vpn.cin.su:19191/api/v1`
|
||||
|
||||
This confirms the root recovery lesson:
|
||||
|
||||
@@ -77,15 +122,31 @@ This confirms the root recovery lesson:
|
||||
|
||||
### 3. Direct peer resilience is still below the intended threshold
|
||||
|
||||
Observed from live heartbeat metadata:
|
||||
Observed from the live stale-risk snapshot at `2026-05-18T19:39Z`:
|
||||
|
||||
- `ifcm-rufms-s-mo1cr`
|
||||
- `peer_connection_ready = 2`
|
||||
- `peer_connection_relay_ready = 3`
|
||||
- `target_ready_peers = 3`
|
||||
- `home-1`
|
||||
- `peer_connection_ready = 1`
|
||||
- `direct_ready_areas = [usa]`
|
||||
- `external_area_ready_count = 1/2`
|
||||
- `home-2`
|
||||
- `peer_connection_ready = 1`
|
||||
- `direct_ready_areas = [usa]`
|
||||
- `external_area_ready_count = 1/2`
|
||||
- `home-3`
|
||||
- `peer_connection_ready = 1`
|
||||
- `direct_ready_areas = [usa]`
|
||||
- `external_area_ready_count = 1/2`
|
||||
- `test-1/2/3`
|
||||
- `peer_connection_ready = 3`
|
||||
- but `direct_ready_areas = [usa]`
|
||||
- therefore each still triggers `external_area_deficit:1_of_2`
|
||||
- `usa-los-1`
|
||||
- `peer_connection_ready = 1`
|
||||
- `peer_connection_relay_ready = 5`
|
||||
- `direct_ready_areas = [ifcm, home, test]`
|
||||
- `target_ready_peers = 3`
|
||||
|
||||
This means the direct-path resilience target is not satisfied yet, even though
|
||||
@@ -99,17 +160,35 @@ The practical reason is simple:
|
||||
- relay-ready adjacency is masking direct peer deficit, but it does not replace
|
||||
the requirement for at least three direct-ready peers.
|
||||
|
||||
### 3.1 Public endpoint confirmation must be cross-area, not local hairpin
|
||||
|
||||
The live `home/test` topology also exposed a verification mistake in the
|
||||
runtime model:
|
||||
|
||||
- `home` and `test` sit behind the same public router address
|
||||
`94.141.118.222`;
|
||||
- some public QUIC candidates are valid only when tested from another area such
|
||||
as `usa` or `ifcm`;
|
||||
- a same-area probe can fail purely because the local router does not support
|
||||
hairpin NAT / NAT reflection.
|
||||
|
||||
Operational consequence:
|
||||
|
||||
- a public endpoint marked as `external-network-required` must be treated as
|
||||
non-authoritative when the failure came from `self` or `same_area`;
|
||||
- the public candidate should be confirmed or rejected by `cross_area`
|
||||
observers instead.
|
||||
|
||||
### 4. Observability is still heterogeneous
|
||||
|
||||
Live heartbeat coverage is inconsistent:
|
||||
Live heartbeat coverage is now richer than it was earlier in the day, but it is
|
||||
still not fully converged in behavior:
|
||||
|
||||
- `test-*`, `ifcm`, `usa-los-1` emit rich `c17z20` heartbeat metadata with
|
||||
endpoint, peer recovery, and registry sections.
|
||||
- `home-*` currently do not expose the same full sections in their latest
|
||||
heartbeat rows.
|
||||
|
||||
This means operator visibility is uneven and the documentation must not imply
|
||||
uniform live introspection across every node today.
|
||||
- `test-*`, `ifcm`, `usa-los-1`, and now repaired `home-*` expose endpoint,
|
||||
peer recovery, and registry sections again.
|
||||
- `ifcm` is still the only node that currently reports `compat control` and
|
||||
`registry candidate_only`, so the observability gap has narrowed into a real
|
||||
single-node convergence issue instead of a fleet-wide blind spot.
|
||||
|
||||
## What is true right now
|
||||
|
||||
@@ -117,21 +196,63 @@ uniform live introspection across every node today.
|
||||
2. QUIC/UDP is the actual node-to-node transport.
|
||||
3. Compatibility `19191/tcp` is still required for recovery overlap.
|
||||
4. Signed registry gossip is not yet the sole active discovery/control source.
|
||||
5. The "at least 3 direct-ready peers per node" resilience target is not yet
|
||||
met for all externally significant nodes.
|
||||
5. `ifcm` still depends on the compat `19191` control overlap.
|
||||
6. The plain `3 direct peers` target is insufficient on its own; the live fleet
|
||||
now clearly shows that `cross-area direct diversity` is the next real gate.
|
||||
|
||||
## Control/API migration progress
|
||||
|
||||
The codebase now carries a more explicit migration contract for control access:
|
||||
|
||||
- install profiles prefer canonical `control_plane_endpoints` over a compat
|
||||
singleton `backend_url`;
|
||||
- host runtime env generation now exports
|
||||
removed control-plane endpoint env key;
|
||||
- node heartbeat/control reporting prefers that canonical endpoint set when it
|
||||
is present.
|
||||
- stale updater status behind a fresh heartbeat is now classified separately as
|
||||
`updater_subscription_gap`;
|
||||
- heartbeat update hints now have a second-stage recovery path: after writing
|
||||
`update-trigger.json`, a live node can also wake its local updater
|
||||
task/service.
|
||||
|
||||
This does not instantly rewrite older runtime wrappers on already-installed
|
||||
nodes by itself. It does remove the same trap for the next install, reinstall,
|
||||
or update-service rewrite cycle.
|
||||
|
||||
## Operational rule until the next audit
|
||||
|
||||
Do not remove the compatibility `19191/tcp` recovery overlap while any of the
|
||||
following remain true:
|
||||
|
||||
- any live node still reports a `control_plane_url` on the `19191` contract;
|
||||
- any live node still reports a `fabric_control_endpoint` on the `19191` contract;
|
||||
- any live node has `fabric_registry_runtime_report.status != active`;
|
||||
- any externally significant node has fewer than 3 direct-ready peers;
|
||||
- any node can only recover through legacy `Control API/downloads` overlap.
|
||||
- any node can only recover through compat `Control API/downloads` overlap.
|
||||
|
||||
## Required next work
|
||||
|
||||
Update 2026-05-19:
|
||||
|
||||
- `rap-node-agent 0.2.325-updatehintwake` was released with a local updater
|
||||
wake path driven by heartbeat update hints.
|
||||
- The release exists because `ifcm-rufms-s-mo1cr` showed that a node can keep
|
||||
sending fresh heartbeat while the updater subscription plane silently stops
|
||||
progressing.
|
||||
- This is now treated as a first-class recovery-plane problem, not as a vague
|
||||
stale-node symptom.
|
||||
- The live rollout already moved `home-*`, `test-*`, and `usa-los-1` onto
|
||||
`0.2.325-updatehintwake`.
|
||||
- `ifcm-rufms-s-mo1cr` is now the only remaining
|
||||
`updater_wake_unsupported` blocker.
|
||||
- Live `ifcm` peer telemetry also exposed a distinct transport-resilience
|
||||
defect: on one stale-relay/bootstrap path the node tried a relay endpoint
|
||||
with the certificate fingerprint from a different private direct candidate,
|
||||
producing
|
||||
`CRYPTO_ERROR ... quic peer certificate fingerprint mismatch`.
|
||||
- That bug is now fixed in the runtime line tracked as
|
||||
`0.2.332-relaycertintentfix`.
|
||||
|
||||
### A. Finish signed registry activation
|
||||
|
||||
Each node must be able to resolve active records for at least:
|
||||
|
||||
@@ -26,7 +26,7 @@ This policy applies to Linux, Windows, Android, containerized nodes, and future
|
||||
The fabric must be able to lose:
|
||||
|
||||
- old API endpoints;
|
||||
- old artifact URLs;
|
||||
- old artifact distributors;
|
||||
- previous public IP addresses;
|
||||
- previous NAT mappings;
|
||||
- previous relay nodes;
|
||||
@@ -67,7 +67,7 @@ Any change to the fabric must keep older nodes recoverable until one of these
|
||||
is true:
|
||||
|
||||
1. every node has confirmed the new contract; or
|
||||
2. the missing nodes were manually retired, revoked, or explicitly accepted as
|
||||
2. the missing nodes were manually removed, revoked, or explicitly accepted as
|
||||
lost.
|
||||
|
||||
This applies to:
|
||||
@@ -81,6 +81,17 @@ This applies to:
|
||||
- host-agent / updater runtime contracts;
|
||||
- control endpoints needed only for migration.
|
||||
|
||||
Canonical `Control API` access must be distributable as an explicit endpoint
|
||||
set, not only as a single compat `backend_url`. Install/update contracts should
|
||||
carry:
|
||||
|
||||
- `control_plane_endpoints`;
|
||||
- signed fabric registry bootstrap records;
|
||||
- artifact endpoints.
|
||||
|
||||
The old `backend_url` remains a compatibility fallback only until the fleet has
|
||||
converged.
|
||||
|
||||
The rule is strict: do not delete the old recovery format while nodes that may
|
||||
still need it remain unrecovered.
|
||||
|
||||
@@ -200,6 +211,67 @@ Required model:
|
||||
- signals are idempotent;
|
||||
- signals do not require the old control endpoint to remain alive.
|
||||
|
||||
### 3.7 Update Intent Must Be Independent From One Updater Endpoint
|
||||
|
||||
A node must not be permanently bound to one updater service, one updater node,
|
||||
one systemd unit name, one scheduled task name, or one control endpoint.
|
||||
|
||||
The durable object is not "call this updater URL". The durable object is a
|
||||
signed update intent:
|
||||
|
||||
- product;
|
||||
- target version or version constraint;
|
||||
- artifact hashes and allowed mirrors;
|
||||
- compatibility contract;
|
||||
- rollout lease constraints;
|
||||
- force / emergency flags;
|
||||
- rollback permission;
|
||||
- signed registry/service records that can carry the intent;
|
||||
- expiry and generation.
|
||||
|
||||
A node may learn the same signed intent from:
|
||||
|
||||
- Control API;
|
||||
- update-store;
|
||||
- update-cache;
|
||||
- long-lived outbound control subscription;
|
||||
- neighboring nodes through signed fabric registry gossip;
|
||||
- local cached last-known-good update state.
|
||||
|
||||
The receiving node must validate the intent locally before acting. A neighbor
|
||||
may relay signed update metadata and artifacts, but it must not become an
|
||||
authority that can forge or broaden an update.
|
||||
|
||||
The local recovery boundary must reconcile stale runtime facts before fetching
|
||||
or applying a plan:
|
||||
|
||||
- current cluster id;
|
||||
- node id and identity state directory;
|
||||
- current container/task/unit name;
|
||||
- current control endpoints;
|
||||
- current signed registry records;
|
||||
- available artifact mirrors.
|
||||
|
||||
This is mandatory because a node may move, a container may be renamed, a task
|
||||
may be recreated, or the old host updater may still have a stale command line.
|
||||
|
||||
### 3.8 Polling, Subscription, And Neighbor Relay Are All Required
|
||||
|
||||
The update plane must use three delivery paths at the same time:
|
||||
|
||||
1. slow local fallback polling, so a node eventually recovers even after missed
|
||||
signals;
|
||||
2. subscription / push hints, so ordinary updates are fast and do not wait for
|
||||
a long poll interval;
|
||||
3. peer relay of signed update intents and signed registry records, so a node
|
||||
can learn current update truth through reachable neighbors when the old
|
||||
center or old ingress is unavailable.
|
||||
|
||||
No one path is allowed to be the only recovery mechanism.
|
||||
|
||||
Polling cadence is a safety net, not the rollout control mechanism. Rollout
|
||||
control belongs to the orchestrator and signed rollout leases.
|
||||
|
||||
## 4. Update Safety Rules
|
||||
|
||||
### 4.1 Upgrade Contracts
|
||||
@@ -228,7 +300,7 @@ explicit retirement.
|
||||
Recovery-critical artifact versions must remain available until:
|
||||
|
||||
- all nodes have moved past them; or
|
||||
- the remaining nodes are revoked/retired and recorded as intentionally lost.
|
||||
- the remaining nodes are revoked/removed and recorded as intentionally lost.
|
||||
|
||||
Do not garbage-collect the last working host-agent or node-agent build for an
|
||||
unrecovered population.
|
||||
@@ -237,17 +309,18 @@ unrecovered population.
|
||||
|
||||
If historical nodes request different install types for the same product
|
||||
(`windows_binary`, `windows_service`, `native`, `linux_binary`, etc.), recovery
|
||||
planning must keep compatibility aliases until the fleet converges.
|
||||
planning must publish explicit signed install-type mappings in the fabric
|
||||
registry until the fleet converges.
|
||||
|
||||
The fabric must not strand nodes on an install-type naming mismatch.
|
||||
|
||||
### 4.5 Legacy Recovery Contract Drift Must Be Treated As A Blocking Risk
|
||||
### 4.5 Compat Recovery Contract Drift Must Be Treated As A Blocking Risk
|
||||
|
||||
A stale node may report:
|
||||
|
||||
- a compatible recovery artifact exists under the current registry; but
|
||||
- the last local updater/host-agent status still says `no_matching_artifact` or
|
||||
an equivalent legacy contract failure.
|
||||
an equivalent compat contract failure.
|
||||
|
||||
This means the node is not only waiting for a heartbeat. It is running an older
|
||||
recovery planner contract and may still depend on:
|
||||
@@ -257,7 +330,7 @@ recovery planner contract and may still depend on:
|
||||
- older update-plan interpretation rules;
|
||||
- overlap in signed registry / bootstrap envelopes.
|
||||
|
||||
This condition must be classified as `legacy recovery contract drift` and must
|
||||
This condition must be classified as `compat recovery contract drift` and must
|
||||
block compatibility removal the same way an artifact gap does.
|
||||
|
||||
Operationally this also means:
|
||||
@@ -268,11 +341,11 @@ Operationally this also means:
|
||||
status on the current contract or the operator explicitly retires the node;
|
||||
- when a compatible artifact and target mapping already exist, the node should
|
||||
be classified as `bridge replay ready`, meaning the system can replay the
|
||||
legacy-compatible update plan as soon as the node regains an outbound control
|
||||
compat-compatible update plan as soon as the node regains an outbound control
|
||||
cycle;
|
||||
- operator tooling should expose a canonical `bridge replay plan` per node so
|
||||
recovery replay uses the same signed update-plan logic as normal updates;
|
||||
- compatibility aliases / overlap must remain enabled for that node population;
|
||||
- signed recovery mappings must remain available for that node population;
|
||||
- dashboards and rollout guards must show this separately from ordinary
|
||||
`waiting recovery heartbeat`.
|
||||
|
||||
@@ -281,9 +354,78 @@ Canonical example:
|
||||
- `ifcm-rufms-s-mo1cr` is stale;
|
||||
- the current backend can match a Windows-compatible host-agent artifact;
|
||||
- the last host-agent report still says `no_matching_artifact`;
|
||||
- therefore the node must be treated as a legacy recovery-contract blocker, not
|
||||
- therefore the node must be treated as a compat recovery-contract blocker, not
|
||||
merely as a delayed heartbeat.
|
||||
|
||||
### 4.6 Rollout Orchestrator Is Mandatory
|
||||
|
||||
Large fleet update safety requires an orchestrator. The orchestrator decides
|
||||
which nodes may update now. Nodes decide whether a received signed intent is
|
||||
valid and locally safe to execute.
|
||||
|
||||
The orchestrator must support:
|
||||
|
||||
- canary rollout;
|
||||
- rolling rollout;
|
||||
- area / site / NAT-group aware rollout;
|
||||
- max parallel updates globally;
|
||||
- max parallel updates per area;
|
||||
- max unavailable nodes;
|
||||
- minimum healthy quorum before continuing;
|
||||
- hold / pause / resume;
|
||||
- force update for explicitly selected nodes;
|
||||
- automatic stop on failure rate, heartbeat loss, rollback, or route diversity
|
||||
regression;
|
||||
- separate host-agent and node-agent phases;
|
||||
- emergency recovery bridge for pre-orchestrator compat nodes.
|
||||
|
||||
The orchestrator must issue short-lived rollout leases. A node may only start an
|
||||
update when it holds a valid lease for that product/version. If the lease
|
||||
expires before apply starts, the node must re-check the policy.
|
||||
|
||||
Rollout leases prevent the entire farm from starting the same update
|
||||
simultaneously when a subscription signal or gossip wave reaches all nodes.
|
||||
|
||||
### 4.7 Node-Side Update Admission Control
|
||||
|
||||
Even with a lease, the node must perform local admission checks before apply:
|
||||
|
||||
- artifact hash and signature match the signed intent;
|
||||
- rollback artifact or previous binary is available unless policy explicitly
|
||||
disables rollback;
|
||||
- enough disk space exists for stage plus rollback;
|
||||
- current active workload can tolerate restart, or orchestrator granted a
|
||||
maintenance lease;
|
||||
- the node still has at least the required recovery connectivity after
|
||||
excluding itself as temporarily unavailable;
|
||||
- host-agent update is applied before node-agent update when the contract says
|
||||
the host-agent is the recovery floor.
|
||||
|
||||
If admission fails, the node reports `blocked` with a precise reason instead of
|
||||
silently waiting.
|
||||
|
||||
### 4.8 Update Waves Must Preserve Failure-Domain Diversity
|
||||
|
||||
An update wave must not take down all nodes from the same recovery role or
|
||||
failure domain at once.
|
||||
|
||||
The orchestrator must account for:
|
||||
|
||||
- area;
|
||||
- site;
|
||||
- locality group;
|
||||
- NAT group;
|
||||
- public ingress dependency;
|
||||
- control-api role;
|
||||
- update-store / update-cache role;
|
||||
- relay / rendezvous role;
|
||||
- VPN ingress / egress roles;
|
||||
- nodes that are currently the only known recovery path for another node.
|
||||
|
||||
For a small fleet, this means the orchestrator may update one node at a time
|
||||
when the remaining diversity is weak, even if the global max parallel setting
|
||||
is higher.
|
||||
|
||||
## 5. Service And Location Mobility Rules
|
||||
|
||||
Moving a service must not strand nodes that only know the old location.
|
||||
@@ -329,7 +471,7 @@ The design must explicitly handle all of these:
|
||||
- node reboots during update;
|
||||
- only one peer still knows the new registry truth;
|
||||
- node is partitioned for a long time and rejoins later;
|
||||
- platform removes legacy support too early;
|
||||
- platform removes compat support too early;
|
||||
- operator has no shell/RDP/WinRM/SSH access to the host.
|
||||
|
||||
## 7. Required Local State And Journaling
|
||||
@@ -359,7 +501,7 @@ It must surface:
|
||||
- nodes with stale heartbeat but recent updater activity;
|
||||
- nodes with no working compatible recovery artifact;
|
||||
- nodes whose pinned registry/bootstrap epoch is too old;
|
||||
- nodes whose only known artifact URL is dead;
|
||||
- nodes whose only known artifact distributor is dead;
|
||||
- nodes whose desired state requires a contract they cannot parse;
|
||||
- nodes whose local agent version is below the minimum recovery floor;
|
||||
- nodes whose last successful contact depended on a single service replica.
|
||||
@@ -382,7 +524,7 @@ Before deleting old code, old formats, or old endpoints, verify all of these:
|
||||
7. install type aliases remain for historical agents where needed;
|
||||
8. NAT/passive/outbound-only nodes were explicitly tested;
|
||||
9. stale-node risk report is empty or consciously accepted by recovery-admin;
|
||||
10. removal of legacy support is documented with the exact cutoff conditions.
|
||||
10. removal of compat support is documented with the exact cutoff conditions.
|
||||
|
||||
## 10. `ifcm-rufms-s-mo1cr` Rule
|
||||
|
||||
@@ -412,7 +554,7 @@ The system should keep implementing these concrete items:
|
||||
- signed registry retention and overlap checks before endpoint migration;
|
||||
- compatibility alias coverage for historical install types;
|
||||
- artifact availability health over all mirrors;
|
||||
- stale-node risk dashboard/report before legacy removal;
|
||||
- stale-node risk dashboard/report before compat cleanup;
|
||||
- node-local journaling for last good registry/update state;
|
||||
- neighbor-assisted artifact relay path;
|
||||
- explicit recovery simulation for outbound-only nodes with dead old endpoints.
|
||||
|
||||
@@ -344,7 +344,7 @@ The first backend contract slice is implemented:
|
||||
- Fenced routes are not returned as primary or alternate route candidates in a
|
||||
service-channel lease. If every route for the selected entry/exit pair is
|
||||
fenced by service-channel feedback, the lease enters explicit degraded
|
||||
backend fallback with reason
|
||||
compat fallback with reason
|
||||
`fabric_routes_fenced_by_service_channel_feedback`.
|
||||
- A live smoke on 2026-05-07 created two short-lived `test-1 -> test-2`
|
||||
`vpn_packets` route intents, injected fresh service-channel flow feedback
|
||||
@@ -507,18 +507,18 @@ The first backend contract slice is implemented:
|
||||
post-restart exit inbox depth from `0` to `88` with zero inbox drops.
|
||||
- C18Z3 adds the matching entry-side resilience and degraded-fallback contract.
|
||||
Node-agent `0.2.183` validates the signed service-channel lease authority and
|
||||
forces backend fallback when Control Plane has signed
|
||||
forces compat fallback when Control Plane has signed
|
||||
`status=degraded_fallback` or `primary_route.status=missing_route_intent`.
|
||||
This prevents a node from ignoring the lease decision and accidentally using
|
||||
older generic route candidates for the same VPN resource. The rule applies to
|
||||
both HTTP packet ingress and WebSocket packet ingress. The live smoke
|
||||
`scripts/fabric/c18z3-live-service-channel-entry-ws-fallback-smoke.ps1`
|
||||
proves HTTP warm delivery, WebSocket ingress parity, entry-node restart and
|
||||
recovery while a lease exists, explicit backend fallback when no authorized
|
||||
recovery while a lease exists, explicit compat fallback when no authorized
|
||||
fabric route exists, and route-intent expiry. The passing artifact is
|
||||
`artifacts/c18z3-live-service-channel-entry-ws-fallback-smoke-result.json`;
|
||||
run `c18z3-20260507-211402` accepted warm `4/4`, WebSocket `8` packets,
|
||||
recovery `4/4`, and moved the degraded backend fallback queue from `0` to
|
||||
recovery `4/4`, and moved the degraded compat fallback queue from `0` to
|
||||
`8`.
|
||||
- C18Z4 adds live long-session pressure coverage without another runtime
|
||||
release. The script
|
||||
@@ -529,7 +529,7 @@ The first backend contract slice is implemented:
|
||||
alternate route. The passing artifact is
|
||||
`artifacts/c18z4-live-service-channel-session-pressure-smoke-result.json`;
|
||||
run `c18z4-20260507-212748` grew the exit inbox from `0` to `384`, kept
|
||||
route failure delta `0`, flow drop delta `0`, and backend fallback queue
|
||||
route failure delta `0`, flow drop delta `0`, and compat fallback queue
|
||||
`0 -> 0`. This proves route-policy churn can be absorbed by the shared
|
||||
fabric runtime while a service WebSocket remains active.
|
||||
- C18Z5 adds live exit-node failure coverage while the same kind of service
|
||||
@@ -540,7 +540,7 @@ The first backend contract slice is implemented:
|
||||
the same signed WebSocket. The passing artifact is
|
||||
`artifacts/c18z5-live-service-channel-exit-restart-smoke-result.json`; run
|
||||
`c18z5-20260507-213745` sent 480 packets total, observed route failure delta
|
||||
`48`, backend fallback queue `0 -> 192`, flow drop delta `0`, and recovery
|
||||
`48`, compat fallback queue `0 -> 192`, flow drop delta `0`, and recovery
|
||||
exit inbox `0 -> 192`. This proves exit failure is surfaced as explicit
|
||||
degraded/fallback telemetry and fabric delivery resumes after runtime
|
||||
recovery without requiring the service connection to be rebuilt.
|
||||
@@ -554,7 +554,7 @@ The first backend contract slice is implemented:
|
||||
`artifacts/c18z6-live-service-channel-active-rebuild-smoke-result.json`; run
|
||||
`c18z6-20260507-214900` sent 384 packets, delivered all of them to the exit
|
||||
inbox, selected the replacement route, kept route failure delta `0`, flow
|
||||
drop delta `0`, and backend fallback queue `0 -> 0`. This proves route-manager
|
||||
drop delta `0`, and compat fallback queue `0 -> 0`. This proves route-manager
|
||||
replacement can be applied under an active service session without requiring
|
||||
the service connection to be recreated.
|
||||
- C18Z7 adds concurrent service-session isolation coverage. The script
|
||||
@@ -565,7 +565,7 @@ The first backend contract slice is implemented:
|
||||
`applied_rebuild`, then continues all sessions. The passing artifact is
|
||||
`artifacts/c18z7-live-service-channel-concurrent-isolation-smoke-result.json`;
|
||||
run `c18z7-20260507-215727` delivered 864 packets total, 288 packets per
|
||||
session, with total backend fallback delta `0`, route failure delta `0`, and
|
||||
session, with total compat fallback delta `0`, route failure delta `0`, and
|
||||
flow drop delta `0`. This proves concurrent service sessions keep separate
|
||||
resource queues and are not starved or poisoned by a shared route-manager
|
||||
rebuild.
|
||||
@@ -579,7 +579,7 @@ The first backend contract slice is implemented:
|
||||
run `c18z8-20260507-221347` delivered 192 packets per interactive session,
|
||||
hit flow scheduler high watermark `1024`, scheduled `1030` packets on the
|
||||
hottest channel, dropped `282` packets on that overloaded channel, and kept
|
||||
backend fallback delta `0` and route failure delta `0`. This proves bounded
|
||||
compat fallback delta `0` and route failure delta `0`. This proves bounded
|
||||
queue pressure is service-neutral, observable, and isolated to the overloaded
|
||||
logical flow without starving other active sessions.
|
||||
- C18Z9 adds route-pool replacement preference coverage. Node-agent `0.2.184`
|
||||
@@ -593,7 +593,7 @@ The first backend contract slice is implemented:
|
||||
node-agent `applied_rebuild`, and verifies the same service session continues
|
||||
over the fast route. The passing artifact is
|
||||
`artifacts/c18z9-live-service-channel-route-pool-smoke-result.json`; run
|
||||
`c18z9-20260507-224901` kept backend fallback delta `0`, route failure delta
|
||||
`c18z9-20260507-224901` kept compat fallback delta `0`, route failure delta
|
||||
`0`, and flow drop delta `0`.
|
||||
- C18Z10 adds service-channel exit-pool failover coverage. Backend/node-agent
|
||||
`0.2.185` binds signed entry/exit pools into the service-channel lease
|
||||
@@ -610,7 +610,7 @@ The first backend contract slice is implemented:
|
||||
`applied_rebuild`, and verifies 288 packets land on the alternate exit. The
|
||||
passing artifact is
|
||||
`artifacts/c18z10-live-service-channel-exit-pool-smoke-result.json`; run
|
||||
`c18z10-20260507-232645` kept backend fallback `0`, route failure delta `0`,
|
||||
`c18z10-20260507-232645` kept compat fallback `0`, route failure delta `0`,
|
||||
and flow drop delta `0`.
|
||||
- C18Z11 adds service-channel entry-pool failover contract coverage. Backend
|
||||
`rap-backend:fabric-service-channel-0.2.186` keeps
|
||||
@@ -675,7 +675,7 @@ The first backend contract slice is implemented:
|
||||
continues on the learned fast route. The passing artifact is
|
||||
`artifacts/c18z14-live-service-channel-active-quality-shift-smoke-result.json`;
|
||||
run `c18z14-20260508-071644` sent 60 batches / 480 packets, delivered all
|
||||
packets to the exit, kept backend fallback `0`, flow drops `0`, and expired
|
||||
packets to the exit, kept compat fallback `0`, flow drops `0`, and expired
|
||||
temporary route intents.
|
||||
- C18Z15 exposes and hardens effective route-quality preference telemetry.
|
||||
Backend `rap-backend:fabric-service-channel-0.2.191` reports both raw
|
||||
@@ -690,7 +690,7 @@ The first backend contract slice is implemented:
|
||||
passing artifact is
|
||||
`artifacts/c18z15-live-service-channel-effective-quality-smoke-result.json`;
|
||||
run `c18z14-20260508-073538` sent 60 batches / 480 packets, delivered all
|
||||
packets to the exit, kept backend fallback `0`, flow drops `0`, and exposed
|
||||
packets to the exit, kept compat fallback `0`, flow drops `0`, and exposed
|
||||
decayed effective scores in node telemetry.
|
||||
- C18Z16 adds per-channel route-quality preference telemetry and fairness
|
||||
guardrails. Node-agent `0.2.191` records the applied
|
||||
@@ -704,7 +704,7 @@ The first backend contract slice is implemented:
|
||||
`artifacts/c18z16-live-service-channel-quality-fairness-smoke-result.json`;
|
||||
run `c18z14-20260508-074943` sent 60 batches / 480 packets, served 32
|
||||
logical channels, applied quality preference telemetry to all 32 served
|
||||
channels, kept backend fallback `0`, and flow drops `0`.
|
||||
channels, kept compat fallback `0`, and flow drops `0`.
|
||||
- C18Z17 clears stale per-channel route-quality markers. Node-agent `0.2.192`
|
||||
removes channel-level quality preference diagnostics when the preference is no
|
||||
longer present in the current effective preference set or when the preferred
|
||||
@@ -712,10 +712,10 @@ The first backend contract slice is implemented:
|
||||
`scripts/fabric/c18z17-live-service-channel-quality-cleanup-smoke.ps1`
|
||||
verifies that active channel markers reference visible preferences, stale
|
||||
markers are absent, expired route intents are not active, and the session
|
||||
completes without backend fallback. The passing artifact is
|
||||
completes without compat fallback. The passing artifact is
|
||||
`artifacts/c18z17-live-service-channel-quality-cleanup-smoke-result.json`;
|
||||
run `c18z14-20260508-075750` sent 60 batches / 480 packets, kept 32 active
|
||||
quality markers, found `0` stale markers, kept backend fallback `0`, and
|
||||
quality markers, found `0` stale markers, kept compat fallback `0`, and
|
||||
flow drops `0`.
|
||||
- C18Z18 scopes flow-scheduler channel memory by service session. Node-agent
|
||||
`0.2.193` now keys runtime-sent logical channels as
|
||||
@@ -728,11 +728,11 @@ The first backend contract slice is implemented:
|
||||
`scripts/fabric/c18z18-service-channel-session-scoped-fairness-smoke.ps1`
|
||||
wraps the live C18Z17 route-quality/fairness path, verifies served live
|
||||
channel names are session-scoped and no unscoped served `flow-NN` channels
|
||||
remain, and keeps backend fallback and flow drops at zero. The passing
|
||||
remain, and keeps compat fallback and flow drops at zero. The passing
|
||||
artifact is
|
||||
`artifacts/c18z18-service-channel-session-scoped-fairness-smoke-result.json`;
|
||||
run `c18z14-20260508-082520` served 32 session-scoped channels, applied
|
||||
quality markers to all 32, kept backend fallback `0`, and flow drops `0`.
|
||||
quality markers to all 32, kept compat fallback `0`, and flow drops `0`.
|
||||
- C18Z19 adds the first bounded parallel send window for independent
|
||||
service-channel logical flows. Node-agent `0.2.194` can send scheduled
|
||||
logical channels concurrently with `MaxParallelFlowSends=4` in the live
|
||||
@@ -769,7 +769,7 @@ The first backend contract slice is implemented:
|
||||
run `c18z14-20260508-085635` delivered 480 packets, observed
|
||||
`max_parallel_flow_sends=4`, `recommended_parallel_flow_sends=4`,
|
||||
`scheduler_max_in_flight=4`, attempt/success/latency telemetry on all 32
|
||||
served channels, backend fallback `0`, and flow drops `0`.
|
||||
served channels, compat fallback `0`, and flow drops `0`.
|
||||
- C18Z21 adds rolling per-channel/session quality windows. Node-agent `0.2.196`
|
||||
keeps the lifetime counters for audit visibility, but adaptive send-window
|
||||
pressure now comes from the bounded recent quality window, so old drops and
|
||||
@@ -785,7 +785,7 @@ The first backend contract slice is implemented:
|
||||
run `c18z14-20260508-091952` delivered 480 packets, observed
|
||||
`scheduler_quality_window_sample_count=480`, rolling failures `0`, rolling
|
||||
drops `0`, rolling samples/success/latency on all 32 served channels,
|
||||
`recommended_parallel_flow_sends=4`, backend fallback `0`, and flow drops `0`.
|
||||
`recommended_parallel_flow_sends=4`, compat fallback `0`, and flow drops `0`.
|
||||
- C18Z22 connects the rolling window to backend durable route feedback. Backend
|
||||
`rap-backend:fabric-service-channel-0.2.197` reads `quality_window_*` fields
|
||||
from node-agent heartbeat metadata and uses fresh rolling failure/drop/slow
|
||||
@@ -799,7 +799,7 @@ The first backend contract slice is implemented:
|
||||
fields. The passing artifact is
|
||||
`artifacts/c18z22-service-channel-rolling-feedback-smoke-result.json`; run
|
||||
`c18z14-20260508-093100` delivered 480 packets, observed one persisted
|
||||
healthy rolling feedback item with rolling payload, backend fallback `0`, and
|
||||
healthy rolling feedback item with rolling payload, compat fallback `0`, and
|
||||
flow drops `0`.
|
||||
- C18Z23 adds route recovery hysteresis. Backend
|
||||
`rap-backend:fabric-service-channel-0.2.198` re-admits routes that have
|
||||
@@ -812,7 +812,7 @@ The first backend contract slice is implemented:
|
||||
the live C18Z22 path and verifies backend `0.2.198`, rolling feedback, clean
|
||||
forwarding, and the unit hysteresis contract. The passing artifact is
|
||||
`artifacts/c18z23-service-channel-recovery-hysteresis-smoke-result.json`; run
|
||||
`c18z14-20260508-094111` delivered 480 packets with backend fallback `0` and
|
||||
`c18z14-20260508-094111` delivered 480 packets with compat fallback `0` and
|
||||
flow drops `0`.
|
||||
- C18Z24 exposes that recovery state to operators and API consumers. Backend
|
||||
`rap-backend:fabric-service-channel-0.2.199` enriches route feedback API
|
||||
@@ -925,7 +925,7 @@ The first backend contract slice is implemented:
|
||||
C18X; route-intent lifecycle cleanup and synthetic-config expired-route
|
||||
filtering landed in C18Y; bounded multi-channel load/rebuild/drop telemetry
|
||||
coverage landed in C18Z; live signed service-channel ingress through the
|
||||
running mesh listener landed in C18Z1; sustained live ingress with exit-node
|
||||
running fabric listener landed in C18Z1; sustained live ingress with exit-node
|
||||
restart/recovery coverage landed in C18Z2; signed degraded fallback
|
||||
enforcement plus entry restart/WebSocket parity landed in C18Z3; long-lived
|
||||
WebSocket pressure with mid-session route-policy churn landed in C18Z4; live
|
||||
@@ -988,7 +988,7 @@ The first backend contract slice is implemented:
|
||||
from explicit degraded compatibility requests; C18Z56 adds active-channel remediation
|
||||
diagnostics (`none`, `rebuild_route`, `prefer_alternate_route`,
|
||||
`hold_degraded_route_state`) to make the next runtime action explicit, and its
|
||||
alternate-route branch is live-smoke-proven with backend fallback kept off.
|
||||
alternate-route branch is live-smoke-proven with compat fallback kept off.
|
||||
C18Z57 adds the bounded machine-readable `remediation_command` contract to
|
||||
active access telemetry rows so route-manager can consume a short-lived
|
||||
`prefer_alternate_route` command with primary/replacement route ids and TTL.
|
||||
@@ -996,7 +996,7 @@ The first backend contract slice is implemented:
|
||||
node-agent route-manager consumes them as explicit applied replacement
|
||||
decisions sourced from `service_channel_remediation_command`. C18Z59 proves
|
||||
post-remediation service-channel traffic actually selects the replacement
|
||||
route in runtime/flow telemetry without local/backend fallback. C18Z60 proves
|
||||
route in runtime/flow telemetry without local/compat fallback. C18Z60 proves
|
||||
the same remediation path for multiple independent VPN flow channels in one
|
||||
packet batch, with replacement-route flow stats, no flow drops, no route
|
||||
failures, and no degraded fallback. C18Z61 proves the remediation replacement
|
||||
@@ -1024,7 +1024,7 @@ The first backend contract slice is implemented:
|
||||
0. C18Z68 turns this evidence into backend/admin flow-health diagnostics:
|
||||
access telemetry now reports `flow_health_status` and `flow_health_reason` at
|
||||
cluster, node, and active-channel levels using traffic-class pressure, queue
|
||||
pressure, flow drops, backend fallback, route-quality failures/drops/slow
|
||||
pressure, flow drops, compat fallback, route-quality failures/drops/slow
|
||||
samples, and route send latency. C18Z69 adds node-side adaptive response:
|
||||
runtime heartbeat flow-scheduler snapshots now include per-class
|
||||
`recommended_parallel_windows` and adaptive backpressure reason, and the send
|
||||
@@ -1039,7 +1039,7 @@ The first backend contract slice is implemented:
|
||||
tune shared fabric backpressure without changing VPN/RDP-specific code.
|
||||
C18Z72 adds an audited pool/failover policy contract for entry/exit pool
|
||||
constraints, preferred entry/exit, selection strategy, failover modes,
|
||||
backend fallback allowance, and sticky session mode. Lease issuance applies
|
||||
compat fallback allowance, and sticky session mode. Lease issuance applies
|
||||
that policy before route selection and signs the effective `pool_policy`
|
||||
provenance into the service-channel lease authority payload. C18Z73 projects
|
||||
that signed pool-policy fingerprint into active access telemetry and guards
|
||||
@@ -1080,7 +1080,7 @@ The first backend contract slice is implemented:
|
||||
existing rebuild command to a replacement route, the entry node reports a
|
||||
route-manager decision for the same `rebuild_request_id`, the transition is
|
||||
`applied_rebuild`, and live service-channel packet ingress selects the
|
||||
replacement route with no local/backend fallback, route failures, or flow
|
||||
replacement route with no local/compat fallback, route failures, or flow
|
||||
drops. C18Z80 extends that into sustained post-rebuild pressure: five mixed
|
||||
service-channel packet bursts remain on the replacement route, no stale
|
||||
primary route is reselected, and fallback, route-failure, flow-drop, and
|
||||
|
||||
@@ -0,0 +1,206 @@
|
||||
# Fabric Service-Over-Transport Model
|
||||
|
||||
Status: active target architecture.
|
||||
|
||||
This document defines the mandatory separation between:
|
||||
|
||||
1. the internal fabric transport;
|
||||
2. the logical service channel contract;
|
||||
3. the external service ingress edge.
|
||||
|
||||
It exists to prevent a recurring failure pattern where external TCP/HTTP/HTTPS
|
||||
listeners are mistaken for the fabric's internal transport.
|
||||
|
||||
## 1. Core rule
|
||||
|
||||
The fabric is the internal transport substrate.
|
||||
|
||||
- Inside the fabric, node-to-node runtime transport is `QUIC over UDP`.
|
||||
- Services do not implement their own inter-node transport.
|
||||
- Services do not need to understand relay, NAT, route replacement, or peer
|
||||
selection details.
|
||||
|
||||
A service asks the fabric for a channel. The fabric creates, maintains,
|
||||
rebuilds, and heals that channel.
|
||||
|
||||
## 2. Three-layer model
|
||||
|
||||
### 2.1 Fabric Transport
|
||||
|
||||
Fabric Transport is the lowest runtime layer.
|
||||
|
||||
Responsibilities:
|
||||
|
||||
- peer discovery and peer memory
|
||||
- endpoint candidate verification
|
||||
- direct and relay path establishment
|
||||
- route maintenance
|
||||
- route replacement
|
||||
- cross-area recovery
|
||||
- QUIC session lifecycle
|
||||
- cert pin and authority trust enforcement
|
||||
|
||||
Transport contract:
|
||||
|
||||
- node-to-node runtime transport is `QUIC/UDP`
|
||||
- TCP is not an alternate transport carrier inside the fabric
|
||||
|
||||
### 2.2 Fabric Service Channel
|
||||
|
||||
The service channel is the logical contract used by any upper-layer service.
|
||||
|
||||
Responsibilities:
|
||||
|
||||
- request a route to a node or pool
|
||||
- expose a stable channel identifier
|
||||
- carry bidirectional application traffic
|
||||
- survive path rebuild when possible
|
||||
- surface degraded or migrated channel state to the service without exposing
|
||||
internal route topology details
|
||||
|
||||
The service channel must hide:
|
||||
|
||||
- which relay was chosen
|
||||
- which direct peer was replaced
|
||||
- which ingress or NAT path was changed
|
||||
- which recovery seed was used
|
||||
|
||||
The service should see channel semantics, not transport topology.
|
||||
|
||||
### 2.3 External Service Ingress
|
||||
|
||||
External ingress is the edge that accepts user-facing or third-party traffic.
|
||||
|
||||
Examples:
|
||||
|
||||
- HTTP/HTTPS ingress for admin UI and personal cabinet
|
||||
- VPN ingress that accepts client traffic
|
||||
- future RDP ingress
|
||||
|
||||
Ingress may speak TCP/HTTP/HTTPS or another external protocol at the edge, but
|
||||
after acceptance it must map traffic into a fabric service channel.
|
||||
|
||||
The ingress edge is not the fabric transport.
|
||||
|
||||
## 3. Examples
|
||||
|
||||
### 3.1 Admin panel
|
||||
|
||||
The user opens an HTTPS page.
|
||||
|
||||
1. the public ingress listens on `80/443`;
|
||||
2. it accepts HTTPS and performs edge policy checks;
|
||||
3. it opens or reuses a `control_ui` fabric service channel;
|
||||
4. the request is forwarded through the fabric to a panel service instance;
|
||||
5. the response is returned through the fabric channel back to the ingress;
|
||||
6. the ingress returns the HTTP response to the browser.
|
||||
|
||||
The browser sees HTTPS. The fabric sees an internal service channel over
|
||||
`QUIC/UDP`.
|
||||
|
||||
### 3.2 VPN
|
||||
|
||||
The VPN edge accepts client-side tunnel traffic.
|
||||
|
||||
1. the VPN service receives IPv4 packets from the client-facing side;
|
||||
2. it requests a `vpn_tunnel` channel to an egress pool;
|
||||
3. the fabric chooses and maintains the route;
|
||||
4. an egress node performs IPv4 exit/NAT to the external network;
|
||||
5. return traffic follows the maintained channel back through the fabric.
|
||||
|
||||
The VPN service does not decide how the route is built. The fabric does.
|
||||
|
||||
## 4. Service contract requirements
|
||||
|
||||
Every service-over-fabric integration must use a channel contract with at least
|
||||
these concepts:
|
||||
|
||||
- `channel_request`
|
||||
- `channel_id`
|
||||
- `channel_class`
|
||||
- `destination_selector`
|
||||
- `current_state`
|
||||
- `send`
|
||||
- `receive`
|
||||
- `close`
|
||||
|
||||
Optional but recommended:
|
||||
|
||||
- `channel_migrated`
|
||||
- `channel_degraded`
|
||||
- `preferred_qos_class`
|
||||
- `pool_affinity`
|
||||
- `session_stickiness`
|
||||
|
||||
## 5. Channel classes
|
||||
|
||||
Minimum channel classes:
|
||||
|
||||
- `control_ui`
|
||||
- `vpn_tunnel`
|
||||
- `rdp_session`
|
||||
- `artifact_delivery`
|
||||
- `service_admin`
|
||||
- `internal_control`
|
||||
|
||||
Each class must define:
|
||||
|
||||
- latency sensitivity
|
||||
- loss tolerance
|
||||
- bandwidth expectation
|
||||
- stickiness requirement
|
||||
- pool failover behavior
|
||||
- health check behavior
|
||||
|
||||
## 6. Pool-first delivery
|
||||
|
||||
Services should target pools when they need resilience.
|
||||
|
||||
Examples:
|
||||
|
||||
- VPN should target an egress pool, not a single node
|
||||
- future RDP should target a pool of reachable adapters when that service mode
|
||||
applies
|
||||
|
||||
The service must not need to know:
|
||||
|
||||
- which specific node was selected
|
||||
- how many nodes are in the pool
|
||||
- whether the path was direct or relayed
|
||||
|
||||
## 7. Recovery implications
|
||||
|
||||
Recovery must be separated into:
|
||||
|
||||
1. node survival
|
||||
2. transport survival
|
||||
3. service channel survival
|
||||
4. ingress survival
|
||||
|
||||
It is not enough for the node to recover if the service channel model still
|
||||
depends on a hidden compat carrier.
|
||||
|
||||
## 8. TCP clarification
|
||||
|
||||
TCP is allowed only in these roles:
|
||||
|
||||
- external user ingress
|
||||
- operator/API ingress
|
||||
- temporary compatibility recovery overlap
|
||||
- artifact/control delivery at the service edge
|
||||
|
||||
TCP is not allowed as the normal inter-node fabric transport.
|
||||
|
||||
If TCP is still visible in the live system, it must be classified explicitly as
|
||||
one of the roles above.
|
||||
|
||||
## 9. Relationship to area stability
|
||||
|
||||
The transport layer must maintain resilient peer diversity across areas, but the
|
||||
service layer must not need to understand those details.
|
||||
|
||||
See
|
||||
[FABRIC_AREA_AND_PEER_STABILITY_MODEL.md](\\nas\\MST\\codex\\rdp-proxy\\docs\\architecture\\FABRIC_AREA_AND_PEER_STABILITY_MODEL.md)
|
||||
for the current peer diversity model and
|
||||
[FABRIC_LIVE_AUDIT_2026-05-18.md](\\nas\\MST\\codex\\rdp-proxy\\docs\\architecture\\FABRIC_LIVE_AUDIT_2026-05-18.md)
|
||||
for the live operational gaps.
|
||||
@@ -0,0 +1,70 @@
|
||||
# Fabric Transport Scale Plan
|
||||
|
||||
Goal: make the farm fabric a durable transport layer, not a request/response helper. VPN is the first service adapter, but the same tunnel/session model must carry RDP, VNC, remote workspace, artifact replication, and future services.
|
||||
|
||||
## Invariants
|
||||
|
||||
- Services request a tunnel to a pool and a remote service kind. They do not select nodes, routes, relays, or endpoints.
|
||||
- The farm owns route selection, failover, relay/direct decisions, stream scheduling, and recovery.
|
||||
- Control traffic stays small. Bulk and continuous traffic never move as a single control response.
|
||||
- `tunnel_id` is the stable service-facing identity. Legacy service ids, including VPN connection ids, are aliases only.
|
||||
- Hot traffic is binary framed, not JSON/base64.
|
||||
- Interactive/control/DNS traffic must not wait behind bulk traffic.
|
||||
- Route changes preserve the service tunnel identity.
|
||||
|
||||
## Planes
|
||||
|
||||
- Control plane: signed commands, leases, tunnel creation, route epochs, policy, heartbeat.
|
||||
- Session plane: tunnel lifetime, stream registry, stream open/close/reset, route migration.
|
||||
- Data plane: binary QUIC stream frames for service traffic.
|
||||
- Bulk plane: artifact/file/replication streams with offset, resume, chunk hash, final hash, mirror failover.
|
||||
- Observability plane: topology facts, route health, pressure, drops, throughput, per-class latency.
|
||||
|
||||
## Service Tunnel Contract
|
||||
|
||||
Each service receives:
|
||||
|
||||
- `tunnel_id`
|
||||
- `pool_id`
|
||||
- `service_id`
|
||||
- `local_service_id`
|
||||
- `remote_service_id`
|
||||
- `service_kind`
|
||||
- `service_class`
|
||||
- `service_role`
|
||||
- `route_lease_id`
|
||||
- `route_generation`
|
||||
- `data_plane`
|
||||
- `traffic_classes`
|
||||
- `stream_shards`
|
||||
|
||||
VPN default profile:
|
||||
|
||||
- pool: `ipv4-egress`
|
||||
- service kind: `vpn-exit`
|
||||
- service class: `vpn_packets`
|
||||
- role: `ipv4-egress`
|
||||
|
||||
Future profiles use the same contract, for example `rdp-client`, `vnc-client`, `artifact-store`, or `remote-workspace`.
|
||||
|
||||
## Implementation Phases
|
||||
|
||||
1. Generalize the tunnel contract and keep VPN as the first profile. Current code exposes `rap.fabric_service_tunnel.v1`.
|
||||
2. Move all service traffic to tunnel identity, keeping legacy ids only as aliases. Current VPN packet/session frames use `tunnel_id`; VPN ids are compatibility aliases inside the packet payload.
|
||||
3. Introduce reusable session streams for all services, not only VPN packet batches. Current code tracks `rap.fabric_service_stream_registry.v1` with per-tunnel stream state.
|
||||
4. Add route epoch migration so a service keeps the same tunnel while the farm changes path. Current contract already carries `route_lease_id` and `route_generation` through profile, Android runtime, hello/heartbeat, and peer registry. The mobile runtime can now apply a new runtime config for the same `tunnel_id` and update the active transport route epoch without closing service streams.
|
||||
5. Move artifact delivery from control request chunks to bulk streams with resume and mirror failover.
|
||||
6. Add per-class stream scheduler and backpressure at tunnel, node, route, and pool levels.
|
||||
7. Add admission control and capacity accounting per node, route, pool, organization, and service.
|
||||
8. Add stress tests for many tunnels, mixed traffic, route failure, node failure, and update while traffic flows.
|
||||
|
||||
## Scale Rules
|
||||
|
||||
- Frame size protects memory and fairness; throughput comes from parallel streams and windows.
|
||||
- The current fabric frame guardrail is 8 MiB per frame. This is not a bandwidth ceiling; VPN and later services batch below it and scale by `traffic_classes` plus `stream_shards`.
|
||||
- VPN mobile batches currently allow up to 2048 packets or 4 MiB per batch, so the service stays below the frame guardrail while avoiding the old 1 MiB choke.
|
||||
- DNS/control/interactive traffic must use separate classes from bulk; DNS maps to reliable fabric frames by default.
|
||||
- No node should require a full global graph for normal operation. Use scoped directories and area-level summaries.
|
||||
- Bulk must be drainable and resumable.
|
||||
- Interactive traffic must stay preemptive over bulk.
|
||||
- Every transport fact must be observable separately from planned route and endpoint candidates.
|
||||
@@ -604,14 +604,14 @@ experiment while preserving the production forwarding kill-switch. This result
|
||||
is retained only as test-history context; it is not the active transport
|
||||
direction for the fabric runtime:
|
||||
|
||||
- `HTTPPeerTransport` maps explicit peer node IDs to synthetic HTTP endpoint
|
||||
- `QUICPeerTransport` maps explicit peer node IDs to synthetic QUIC endpoint
|
||||
URLs.
|
||||
- `rap-node-agent` can start a synthetic `/mesh/v1/*` endpoint only when
|
||||
`RAP_MESH_SYNTHETIC_RUNTIME_ENABLED=true` and `RAP_MESH_LISTEN_ADDR` is set.
|
||||
- `rap-node-agent` can start the synthetic fabric runtime only when
|
||||
`RAP_FABRIC_RUNTIME_ENABLED=true` and `RAP_FABRIC_LISTEN_ADDR` is set.
|
||||
- peer endpoints and synthetic routes can be injected as JSON for smoke/debug
|
||||
only.
|
||||
- `mesh-live-smoke` proves direct and single-relay synthetic traffic over real
|
||||
local HTTP endpoints.
|
||||
local QUIC endpoints.
|
||||
- bounded `synthetic.echo` remains the only test-service payload.
|
||||
- `/mesh/v1/forward` remains disabled.
|
||||
- no production service traffic is authorized.
|
||||
|
||||
@@ -504,7 +504,7 @@ Implementation:
|
||||
`diff_time_ms`, `render_update_reason`, and
|
||||
`fallback_to_full_frame_reason`.
|
||||
- Windows direct transport accepts `render.frame.full`,
|
||||
`render.frame.region`, and legacy `session.frame` binary messages.
|
||||
`render.frame.region`, and compat `session.frame` binary messages.
|
||||
- Windows presenter keeps a per-session framebuffer and patches region bytes
|
||||
into it before presenting the updated WPF surface.
|
||||
- Smoke proof showed baseline `render.frame.full` at `3,686,400` bytes and
|
||||
|
||||
@@ -340,4 +340,4 @@ Deliver:
|
||||
- buildable `workers/rdp-service-csharp`
|
||||
- interfaces for protocol engine, data-plane bridge, graphics sink, input source
|
||||
- README with migration stages
|
||||
- docs update marking current C++/FreeRDP path as legacy MVP runtime
|
||||
- docs update marking current C++/FreeRDP path as compat MVP runtime
|
||||
|
||||
@@ -312,7 +312,7 @@ Responsibilities:
|
||||
- enforces user, organization, cluster, and owner visibility policy before accepting traffic
|
||||
- participates in latency-aware and load-aware exit selection
|
||||
- supports failover between nodes in the same exit pool without changing the Android client protocol
|
||||
- does not expose legacy VPN protocols as the steady-state data plane
|
||||
- does not expose compat VPN protocols as the steady-state data plane
|
||||
|
||||
### `vpn-client`
|
||||
|
||||
@@ -324,7 +324,7 @@ Responsibilities:
|
||||
- requests the list of visible IPv4 exit pools and nodes according to the current user's access level
|
||||
- creates VPN packet channels to the selected `ipv4-egress`/`vpn-exit` pool
|
||||
- switches to another authorized exit when the selected exit fails or becomes slow
|
||||
- keeps old protocol compatibility out of the runtime data plane; old nodes may only use legacy download/update paths long enough to fetch the new agent
|
||||
- keeps old protocol compatibility out of the runtime data plane; old nodes may only use compat download/update paths long enough to fetch the new agent
|
||||
- exposes its local IPv4 ingress as service configuration: on Android this is the
|
||||
`VpnService` TUN, and on Linux/Docker it may also include explicit TCP/UDP
|
||||
listen ports that are mapped into VPN packet channels.
|
||||
|
||||
@@ -300,7 +300,7 @@ Recommended flow:
|
||||
3. dual validation period begins where required
|
||||
4. new certificates are issued/accepted
|
||||
5. old certificates expire or are revoked
|
||||
6. old trust root is retired after rollout threshold
|
||||
6. old trust root is removed after rollout threshold
|
||||
|
||||
Channels should revalidate after trust bundle changes.
|
||||
|
||||
|
||||
@@ -8,7 +8,7 @@ transport architecture. The active inter-node transport model is QUIC-only; see
|
||||
`docs/architecture/SECURE_ACCESS_FABRIC_TARGET.md`.
|
||||
|
||||
Status: P3.3 historical test-stand smoke complete for encrypted resource
|
||||
secrets, assignment-time resolution, and legacy RDP baseline behavior with
|
||||
secrets, assignment-time resolution, and compat RDP baseline behavior with
|
||||
smoke-only direct-worker trust.
|
||||
|
||||
This document defines the next security hardening layer around the accepted RDP
|
||||
@@ -110,7 +110,7 @@ In `APP_ENV=production`:
|
||||
|
||||
- RDP/VNC/SSH resources must have `secret_ref`.
|
||||
- Plain credential-like keys are rejected in resource `metadata`.
|
||||
- Session start rejects legacy resources that still contain plaintext
|
||||
- Session start rejects compat resources that still contain plaintext
|
||||
credential-like metadata.
|
||||
- backend startup requires secret encryption key material.
|
||||
- Development/smoke environments may continue using plaintext metadata while
|
||||
|
||||
@@ -109,7 +109,7 @@ adapter runtime.
|
||||
- Control Plane remains authoritative for session lifecycle and policy.
|
||||
- PostgreSQL remains source of truth; Redis remains live coordination only.
|
||||
- Fabric transport remains QUIC-only between nodes; any historical direct
|
||||
worker or backend fallback paths belong to paused service-specific baselines,
|
||||
worker or compat fallback paths belong to paused service-specific baselines,
|
||||
not to the active fabric transport contract.
|
||||
- Adapter runtime must not create sessions outside broker/assignment control.
|
||||
|
||||
|
||||
@@ -212,7 +212,7 @@ Signing key rotation rules:
|
||||
1. New key is introduced in a signed trust bundle.
|
||||
2. Node verifies the new key through existing trust.
|
||||
3. Snapshots may be dual-signed during transition.
|
||||
4. Old key is retired only after policy-defined rollout.
|
||||
4. Old key is removed only after policy-defined rollout.
|
||||
5. Compromised key is revoked through signed revocation metadata or emergency
|
||||
recovery flow.
|
||||
|
||||
|
||||
@@ -12,6 +12,10 @@ Core. It does not redefine node-to-node transport. Current fabric inter-node
|
||||
transport is QUIC-only; VPN/IP tunnel runtime must request and use fabric
|
||||
routes instead of introducing a separate packet transport contract.
|
||||
|
||||
The general service-over-fabric contract is defined in
|
||||
[FABRIC_SERVICE_OVER_TRANSPORT_MODEL.md](\\nas\\MST\\codex\\rdp-proxy\\docs\\architecture\\FABRIC_SERVICE_OVER_TRANSPORT_MODEL.md).
|
||||
VPN is one service class over that transport model, not an exception to it.
|
||||
|
||||
## Purpose
|
||||
|
||||
VPN/IP tunnel is a service above the Fabric Core, not a node-local setting.
|
||||
@@ -25,6 +29,9 @@ platform's core rules:
|
||||
- Nodes execute leased work only.
|
||||
- Organizations must not see mesh topology.
|
||||
- Interactive services such as RDP must not be harmed by VPN bulk traffic.
|
||||
- VPN ingress may accept external client traffic, but after acceptance it must
|
||||
map that traffic into a fabric service channel rather than inventing an
|
||||
alternate inter-node carrier.
|
||||
|
||||
## Non-Goals
|
||||
|
||||
|
||||
@@ -18,6 +18,13 @@ Terminology rule:
|
||||
The Control API may use HTTP/HTTPS, but it is not a fallback or alternate
|
||||
carrier for fabric node-to-node runtime traffic.
|
||||
|
||||
The formal three-layer separation is defined in
|
||||
[FABRIC_SERVICE_OVER_TRANSPORT_MODEL.md](\\nas\\MST\\codex\\rdp-proxy\\docs\\architecture\\FABRIC_SERVICE_OVER_TRANSPORT_MODEL.md):
|
||||
|
||||
- `Fabric Transport` - internal QUIC/UDP substrate
|
||||
- `Fabric Service Channel` - logical service channel contract
|
||||
- `External Service Ingress` - browser/API TCP/HTTP/HTTPS edge
|
||||
|
||||
## Purpose
|
||||
|
||||
The platform needs a clear distinction between:
|
||||
@@ -36,7 +43,7 @@ secrets, node identity, or routing authority.
|
||||
|
||||
Public HTTPS Ingress is an edge service. It may run on a public Internet node,
|
||||
including a small/slow node intended only to accept browser traffic and pass it
|
||||
into the fabric.
|
||||
into the fabric through a service channel.
|
||||
|
||||
Role names:
|
||||
|
||||
@@ -225,7 +232,7 @@ The recommended model is:
|
||||
```text
|
||||
Admin Web Shell
|
||||
-> UI Manifest / Page Definition endpoint
|
||||
-> Scoped Control API endpoints
|
||||
-> Scoped Fabric control endpoints
|
||||
```
|
||||
|
||||
Dynamic pages are allowed for:
|
||||
@@ -474,8 +481,8 @@ the management authority. Platform/global admin runtime remains limited to
|
||||
platform-owner trusted nodes. Cluster, organization, and user panels receive
|
||||
only their scoped projections.
|
||||
|
||||
The legacy Fabric map with separate `inputs`, `cluster nodes`, and `egress
|
||||
zones` is retired for the transport-layer view. The Fabric panel must show
|
||||
The compat Fabric map with separate `inputs`, `cluster nodes`, and `egress
|
||||
zones` is removed for the transport-layer view. The Fabric panel must show
|
||||
actual direct/fresh QUIC neighbor links, one-way/passive direction, stale/problem
|
||||
state, relay/route-health annotations, and web-ingress runtime readiness. It
|
||||
must not render old entry/egress zone columns as if they were transport
|
||||
@@ -520,7 +527,7 @@ The platform recognizes these web/admin placement roles:
|
||||
| `policy-authority` | platform trusted nodes only | Authorization/policy decisions and signed claims. |
|
||||
| `audit-sink` | platform trusted nodes only | Durable mutation/security audit ingestion. |
|
||||
|
||||
Legacy `entry-node` remains a generic client ingress/service edge role for
|
||||
Compat `entry-node` remains a generic client ingress/service edge role for
|
||||
non-admin product services. It must not imply admin authority.
|
||||
|
||||
## Fabric Service Classes
|
||||
|
||||
@@ -202,9 +202,9 @@ Current authoritative audit:
|
||||
|
||||
- `docs/audits/PROJECT_AUDIT_2026-04-26.md`
|
||||
|
||||
Legacy warning:
|
||||
Archive warning:
|
||||
|
||||
- `docs/_legacy_v1` is historical reference only and must not be used for
|
||||
- archived `docs/_archive_v1` is historical reference only and must not be used for
|
||||
implementation decisions
|
||||
|
||||
## Correct Next Step
|
||||
|
||||
@@ -303,7 +303,7 @@ Known worker/RDP gaps:
|
||||
|
||||
At the start of this audit these files were stale or partly stale:
|
||||
|
||||
- `README.md` still points to old legacy docs and says not to start with UI,
|
||||
- `README.md` still points to old compat-era docs and says not to start with UI,
|
||||
while the Windows client already exists
|
||||
- `docs/codex/CURRENT_STATUS.md` says WebSocket takeover proof is still a gap,
|
||||
even though that proof was later closed
|
||||
@@ -434,7 +434,7 @@ Do not:
|
||||
Acceptance:
|
||||
|
||||
- a new engineer/Codex can read the docs and know the actual next step
|
||||
- no doc points to legacy v1 or already-completed stages as next work
|
||||
- no doc points to archived v1 or already-completed stages as next work
|
||||
|
||||
### P1. RDP Visual Correctness Hardening
|
||||
|
||||
@@ -505,7 +505,7 @@ Completed:
|
||||
`docs/architecture/SECURITY_SECRETS_READINESS.md`
|
||||
- production mode rejects plaintext credential-like resource metadata
|
||||
- production RDP/VNC/SSH resources require `secret_ref`
|
||||
- session start rejects legacy plaintext resources in production mode
|
||||
- session start rejects compat plaintext resources in production mode
|
||||
- data-plane allowed-channel policy test exists
|
||||
- worker direct-bind denial probes cover wrong worker/user/org/resource,
|
||||
wrong attachment, over-broad channels, and failed/terminated states
|
||||
|
||||
@@ -306,12 +306,12 @@ Current implementation focus:
|
||||
activation manifests, stores installation authority and signed
|
||||
`platform_role_grants`, and strict platform-admin checks ignore direct
|
||||
PostgreSQL `users.platform_role` edits unless a valid grant exists. Web-admin
|
||||
shows installation status and first-owner bootstrap; dev/legacy SQL seed
|
||||
shows installation status and first-owner bootstrap; dev/compat SQL seed
|
||||
compatibility remains explicit and gated by
|
||||
`INSTALLATION_INSECURE_BOOTSTRAP_ENABLED`.
|
||||
- Cluster Authority foundation is implemented and backend/agent/web-build plus
|
||||
docker-test lifecycle-smoke verified. Clusters now have Ed25519 authority
|
||||
keys, join-token scope material is signed, node approval/bootstrap material
|
||||
keys, join-token scope material is signed, node approval/join material
|
||||
is signed, and Control Plane synthetic mesh config snapshots include a
|
||||
signed hash envelope with `authority_required=true`. Cluster authority
|
||||
private keys are encrypted at rest when `SECRET_ENCRYPTION_KEY_B64`/file is
|
||||
@@ -321,15 +321,15 @@ Current implementation focus:
|
||||
join-token output, approval rows, and synthetic config visibility. The
|
||||
docker-test run `dev-bootstrap-20260428-201430` proved fresh dev cluster
|
||||
creation, signed join token, real node-agent enrollment, platform-owner
|
||||
approval, automatic signed bootstrap polling, authority pin persistence,
|
||||
approval, automatic signed join polling, authority pin persistence,
|
||||
heartbeat, and signed synthetic-config verification. This is a control-plane
|
||||
trust contract only; it does not enable RDP/VPN/service payload forwarding or
|
||||
production relay packet forwarding.
|
||||
- Node enrollment bootstrap polling is implemented and backend/agent-test plus
|
||||
- Node enrollment join polling is implemented and backend/agent-test plus
|
||||
docker-test lifecycle-smoke verified. After enrollment, `rap-node-agent`
|
||||
stores `pending_join_request_id`, polls
|
||||
`/node-agents/enrollments/{requestID}/bootstrap`, verifies the signed
|
||||
approval/bootstrap contract, and persists the approved `node_id`,
|
||||
`/node-agents/enrollments/{requestID}/join`, verifies the signed
|
||||
approval/join contract, and persists the approved `node_id`,
|
||||
`identity_status`, and cluster authority pin into `identity.json`. Polling is
|
||||
controlled by `RAP_ENROLLMENT_POLL_INTERVAL_SECONDS` and
|
||||
`RAP_ENROLLMENT_POLL_TIMEOUT_SECONDS`.
|
||||
@@ -437,12 +437,12 @@ Not current scope:
|
||||
`rap-node-agent` tests only, behind the same disabled-by-default feature
|
||||
flag, and carries only bounded `synthetic.echo` test-service payloads.
|
||||
- C17E adds a live node-to-node synthetic HTTP transport skeleton and smoke
|
||||
harness. It remains behind `RAP_MESH_SYNTHETIC_RUNTIME_ENABLED=false` by
|
||||
harness. It remains behind `RAP_FABRIC_RUNTIME_ENABLED=false` by
|
||||
default and does not authorize production mesh, RDP, VPN, file, video, or
|
||||
service workload traffic.
|
||||
- C17F adds a scoped synthetic mesh config file boundary, prefers it over
|
||||
debug JSON, and reports synthetic route-health observations to the existing
|
||||
mesh links control-plane endpoint when testing flags allow synthetic links.
|
||||
mesh links fabric control endpoint when testing flags allow synthetic links.
|
||||
- C17G adds backend
|
||||
`/clusters/{clusterID}/nodes/{nodeID}/mesh/synthetic-config` and node-agent
|
||||
consumption of that config when no local scoped config file is set.
|
||||
@@ -876,7 +876,7 @@ Result:
|
||||
Additional C17H deployed multi-agent synthetic config smoke verification:
|
||||
|
||||
```powershell
|
||||
powershell -NoProfile -ExecutionPolicy Bypass -File scripts\fabric\c17h-multi-agent-synthetic-smoke-ssh.ps1 -KeepRunning
|
||||
removed multi-agent smoke script is not part of the active tree
|
||||
go test ./...
|
||||
```
|
||||
|
||||
@@ -1705,7 +1705,7 @@ Result:
|
||||
Docker-test C17Z12 runtime smoke:
|
||||
|
||||
```powershell
|
||||
.\scripts\fabric\c17z12-rendezvous-relay-smoke-ssh.ps1 -KeepRunning
|
||||
removed docker-test smoke script is not part of the active tree
|
||||
```
|
||||
|
||||
Result from run `c17z12-20260428-142108`:
|
||||
@@ -1730,7 +1730,7 @@ Additional C17Z13 rendezvous lease telemetry verification:
|
||||
```powershell
|
||||
go test ./...
|
||||
cmd /c "pushd \\nas\MST\codex\rdp-proxy\web-admin && npm run build && popd"
|
||||
.\scripts\fabric\c17z12-rendezvous-relay-smoke-ssh.ps1 -KeepRunning
|
||||
removed docker-test smoke script is not part of the active tree
|
||||
```
|
||||
|
||||
Run from:
|
||||
@@ -1764,7 +1764,7 @@ Additional C17Z14 rendezvous lease refresh verification:
|
||||
```powershell
|
||||
go test ./...
|
||||
cmd /c "pushd \\nas\MST\codex\rdp-proxy\web-admin && npm run build"
|
||||
.\scripts\fabric\c17z12-rendezvous-relay-smoke-ssh.ps1 -KeepRunning
|
||||
removed docker-test smoke script is not part of the active tree
|
||||
```
|
||||
|
||||
Run from:
|
||||
@@ -1801,7 +1801,7 @@ Additional C17Z15 rendezvous relay replacement verification:
|
||||
```powershell
|
||||
go test ./...
|
||||
cmd /c "pushd \\nas\MST\codex\rdp-proxy\web-admin && npm run build"
|
||||
.\scripts\fabric\c17z12-rendezvous-relay-smoke-ssh.ps1 -KeepRunning
|
||||
removed docker-test smoke script is not part of the active tree
|
||||
```
|
||||
|
||||
Run from:
|
||||
@@ -1840,7 +1840,7 @@ Additional C17Z16 route/path decision artifact verification:
|
||||
```powershell
|
||||
go test ./...
|
||||
cmd /c "pushd \\nas\MST\codex\rdp-proxy\web-admin && npm run build"
|
||||
.\scripts\fabric\c17z12-rendezvous-relay-smoke-ssh.ps1 -KeepRunning
|
||||
removed docker-test smoke script is not part of the active tree
|
||||
```
|
||||
|
||||
Run from:
|
||||
@@ -1878,7 +1878,7 @@ Additional C17Z17 route generation tracker verification:
|
||||
```powershell
|
||||
go test ./...
|
||||
cmd /c "pushd \\nas\MST\codex\rdp-proxy\web-admin && npm run build"
|
||||
.\scripts\fabric\c17z12-rendezvous-relay-smoke-ssh.ps1 -KeepRunning
|
||||
removed docker-test smoke script is not part of the active tree
|
||||
```
|
||||
|
||||
Run from:
|
||||
@@ -1921,7 +1921,7 @@ Additional C17Z18 route-health effective path verification:
|
||||
```powershell
|
||||
go test ./...
|
||||
cmd /c "pushd \\nas\MST\codex\rdp-proxy\web-admin && npm run build"
|
||||
.\scripts\fabric\c17z12-rendezvous-relay-smoke-ssh.ps1 -KeepRunning
|
||||
removed docker-test smoke script is not part of the active tree
|
||||
```
|
||||
|
||||
Run from:
|
||||
@@ -2002,7 +2002,13 @@ Additional C17Z20 route-health feedback refresh verification:
|
||||
```powershell
|
||||
go test ./...
|
||||
cmd /c "pushd \\nas\MST\codex\rdp-proxy\web-admin && npm run build"
|
||||
pwsh -NoProfile -ExecutionPolicy Bypass -File scripts\fabric\c17z12-rendezvous-relay-smoke-ssh.ps1 -KeepRunning
|
||||
pwsh -NoProfile -ExecutionPolicy Bypass -File scripts\check-fabric-standard-boundary.ps1
|
||||
```
|
||||
|
||||
Removed smoke record:
|
||||
|
||||
```powershell
|
||||
removed docker-test smoke script is not part of the active tree
|
||||
```
|
||||
|
||||
Run from:
|
||||
@@ -2033,10 +2039,10 @@ C17Z20 report:
|
||||
|
||||
- `artifacts/c17z20-route-health-feedback-refresh-report.md`
|
||||
|
||||
Dev cluster enrollment/bootstrap lifecycle verification:
|
||||
Archived dev cluster enrollment/bootstrap lifecycle verification:
|
||||
|
||||
```powershell
|
||||
pwsh -NoProfile -ExecutionPolicy Bypass -File scripts\fabric\dev-cluster-enrollment-bootstrap-smoke-ssh.ps1 -KeepRunning
|
||||
removed dev lifecycle smoke script is not part of the active tree
|
||||
```
|
||||
|
||||
Result from docker-test run `dev-bootstrap-20260428-201430`:
|
||||
|
||||
@@ -62,7 +62,7 @@ Cluster Authority foundation is now also complete:
|
||||
- cluster authority private keys are encrypted at rest when
|
||||
`SECRET_ENCRYPTION_KEY_B64`/file is configured; production already requires
|
||||
a secret encryption key
|
||||
- legacy/default clusters are backfilled lazily through `EnsureClusterAuthority`
|
||||
- compat/default clusters are backfilled lazily through `EnsureClusterAuthority`
|
||||
- backend signs join-token scope material, node approval/bootstrap material,
|
||||
and node-scoped synthetic mesh config snapshots
|
||||
- node-agent verifies signed Control Plane synthetic config when
|
||||
@@ -80,14 +80,14 @@ Cluster Authority foundation is now also complete:
|
||||
`RAP_WORKLOAD_SUPERVISION_ENABLED=false` by default while service runtime
|
||||
supervision remains a stub
|
||||
|
||||
Node enrollment bootstrap polling is also complete:
|
||||
|
||||
- backend exposes `/node-agents/enrollments/{requestID}/bootstrap`
|
||||
- pending agents prove `cluster_id`, `node_fingerprint`, and `public_key`
|
||||
before receiving status/bootstrap material
|
||||
- `rap-node-agent` stores `pending_join_request_id`, polls approval, verifies
|
||||
the signed bootstrap contract, then persists `node_id`, `identity_status`,
|
||||
and cluster authority pin into `identity.json`
|
||||
Node enrollment join polling is also complete:
|
||||
|
||||
- backend exposes `/node-agents/enrollments/{requestID}/join`
|
||||
- pending agents prove `cluster_id`, `node_fingerprint`, and `public_key`
|
||||
before receiving status/join material
|
||||
- `rap-node-agent` stores `pending_join_request_id`, polls approval, verifies
|
||||
the signed join contract, then persists `node_id`, `identity_status`,
|
||||
and cluster authority pin into `identity.json`
|
||||
- polling is controlled by `RAP_ENROLLMENT_POLL_INTERVAL_SECONDS` and
|
||||
`RAP_ENROLLMENT_POLL_TIMEOUT_SECONDS`
|
||||
|
||||
@@ -157,14 +157,16 @@ Runtime report:
|
||||
- `artifacts/c18z15-live-service-channel-effective-quality-smoke-result.json`
|
||||
- `artifacts/c18z16-live-service-channel-quality-fairness-smoke-result.json`
|
||||
- `artifacts/c18z17-live-service-channel-quality-cleanup-smoke-result.json`
|
||||
- `artifacts/c18z18-service-channel-session-scoped-fairness-smoke-result.json`
|
||||
- `artifacts/c18z19-service-channel-parallel-flow-window-smoke-result.json`
|
||||
- `artifacts/c18z20-service-channel-adaptive-window-telemetry-smoke-result.json`
|
||||
- Docker-test smoke command:
|
||||
`pwsh -NoProfile -ExecutionPolicy Bypass -File scripts\fabric\c17z12-rendezvous-relay-smoke-ssh.ps1 -KeepRunning`
|
||||
- Dev lifecycle smoke command:
|
||||
`pwsh -NoProfile -ExecutionPolicy Bypass -File scripts\fabric\dev-cluster-enrollment-bootstrap-smoke-ssh.ps1 -KeepRunning`
|
||||
- Last proven runtime run: `c17z18-20260428-221601` (legacy smoke script name,
|
||||
- `artifacts/c18z18-service-channel-session-scoped-fairness-smoke-result.json`
|
||||
- `artifacts/c18z19-service-channel-parallel-flow-window-smoke-result.json`
|
||||
- `artifacts/c18z20-service-channel-adaptive-window-telemetry-smoke-result.json`
|
||||
- Active fabric standard check:
|
||||
`pwsh -NoProfile -ExecutionPolicy Bypass -File scripts\check-fabric-standard-boundary.ps1`
|
||||
- Removed docker-test smoke record:
|
||||
`removed docker-test smoke script is not part of the active tree`
|
||||
- Removed dev lifecycle smoke record:
|
||||
`removed dev lifecycle smoke script is not part of the active tree`
|
||||
- Last proven runtime run: `c17z18-20260428-221601` (compat smoke script name,
|
||||
current C17Z20 node-agent code)
|
||||
- Last proven dev lifecycle run: `dev-bootstrap-20260428-201430`
|
||||
- Admin: `http://192.168.200.61:18080/`
|
||||
@@ -193,30 +195,30 @@ Node-agent image `rap-node-agent:0.2.270-c18z95` is built and deployed on
|
||||
`test-1/2/3`; web-admin is rebuilt and deployed to `rap_web_admin`.
|
||||
All three test nodes run the C18Z92 image, healthy, and current after policy
|
||||
update. Node-agent still requires signed service-channel lease authority when
|
||||
cluster authority is pinned, but if legacy clients cannot send signed lease
|
||||
cluster authority is pinned, but if compat clients cannot send signed lease
|
||||
headers it now calls backend introspection before accepting the unsigned token.
|
||||
Accepted ingress is visible as `accepted_by=signed|introspection|legacy_unsigned`
|
||||
Accepted ingress is visible as `accepted_by=signed|introspection|compat_unsigned`
|
||||
in structured node logs and via `X-RAP-Service-Channel-Accepted-By` on HTTP
|
||||
packet ingress. Durable introspection stores only `token_hash` plus a scrubbed
|
||||
lease payload, so backend restarts no longer break compatibility clients. Live
|
||||
lease maintenance now lists active/expired durable compatibility leases and runs
|
||||
bounded cleanup through the admin API/panel. Durable access telemetry now
|
||||
aggregates node-reported accepted ingress counters by signed/introspection/
|
||||
legacy path, with heartbeat metadata fallback and admin-panel visibility.
|
||||
compat path, with heartbeat metadata fallback and admin-panel visibility.
|
||||
Access telemetry now also correlates active durable service-channel leases with
|
||||
entry/exit nodes, primary route status, backend fallback, and latest
|
||||
entry/exit nodes, primary route status, compat fallback, and latest
|
||||
route-quality feedback when a route exists. Normal-route access diagnostics are
|
||||
smoke-proven with a temporary direct `vpn_packets` route and healthy rolling
|
||||
quality window. Degraded normal-route diagnostics are also smoke-proven: the
|
||||
active channel stays on a normal primary route with `force_backend_fallback=false`
|
||||
active channel stays on a normal primary route with `force_compat_fallback=false`
|
||||
while route feedback becomes `fenced` and rolling failure/drop/slow counters are
|
||||
visible. Active-channel remediation diagnostics now expose
|
||||
`remediation_action`, reason, optional alternate route id/status, and operator
|
||||
hint, with unit coverage for healthy/noop, rebuild, backend fallback, and
|
||||
hint, with unit coverage for healthy/noop, rebuild, compat fallback, and
|
||||
authorized alternate decisions. The alternate-route remediation branch is now
|
||||
live-smoke-proven: a selected primary route is degraded after lease issuance and
|
||||
access telemetry recommends `prefer_alternate_route` while keeping
|
||||
`force_backend_fallback=false`. C18Z57 turns that recommendation into a bounded
|
||||
`force_compat_fallback=false`. C18Z57 turns that recommendation into a bounded
|
||||
machine-readable `remediation_command` on the active channel row, including the
|
||||
primary route, replacement route, issued time, and command TTL capped to the
|
||||
lease lifetime. C18Z58 projects those commands into node-scoped synthetic mesh
|
||||
@@ -225,10 +227,10 @@ route-manager `applied` decision with source
|
||||
`service_channel_remediation_command`. C18Z59 proves active traffic follows the
|
||||
replacement route after remediation: runtime heartbeat evidence shows
|
||||
`last_selected_route_id` and flow-scheduler `last_route_id` on the replacement
|
||||
route, with no local/backend fallback and no route send failures. C18Z60 proves
|
||||
route, with no local/compat fallback and no route send failures. C18Z60 proves
|
||||
the same replacement path under multiple independent VPN flow channels: a
|
||||
twelve-packet batch is classified across multiple flow-scheduler channels, all
|
||||
observed replacement-route sends avoid local/backend fallback, flow drops, and
|
||||
observed replacement-route sends avoid local/compat fallback, flow drops, and
|
||||
route failures. C18Z61 raises that to a pressure batch of 128 IPv4/TCP-like
|
||||
packets; runtime evidence shows 32 replacement-route flow stats, scheduler
|
||||
high-watermark 5, max-in-flight 4, no fallback, no drops, and no route failures.
|
||||
@@ -260,7 +262,7 @@ fallback, route failures, flow drops, and scheduler drops stayed at 0. C18Z68
|
||||
adds backend/admin flow-health guard diagnostics over that telemetry:
|
||||
`flow_health_status` and `flow_health_reason` are projected at cluster, node,
|
||||
and active-channel levels from traffic-class pressure, queue pressure, flow
|
||||
drops, backend fallback, route-quality failures/drops/slow samples, and route
|
||||
drops, compat fallback, route-quality failures/drops/slow samples, and route
|
||||
send latency. Web-admin now shows flow-health chips beside flow QoS.
|
||||
C18Z69 adds node-side adaptive response: heartbeat flow-scheduler snapshots now
|
||||
report per-class `recommended_parallel_windows` plus
|
||||
@@ -319,7 +321,7 @@ C18Z79 closes the planner-to-runtime proof loop for that branch: after planner
|
||||
resolution, the entry node reports a route-manager decision with the same
|
||||
`rebuild_request_id`, the transition is `applied_rebuild`, and live
|
||||
service-channel packet traffic selects the replacement route without
|
||||
local/backend fallback, route failures, or flow drops. C18Z80 hardens that
|
||||
local/compat fallback, route failures, or flow drops. C18Z80 hardens that
|
||||
same path under sustained pressure: after planner-applied rebuild, five
|
||||
post-rebuild bursts of mixed `interactive`, `bulk`, and `reliable` VPN packet
|
||||
batches stay on the replacement route, the stale primary is not reselected, and
|
||||
@@ -396,7 +398,7 @@ C18Z91 makes node-agent consume that signed/introspected data-plane contract.
|
||||
Service-channel packet ingress validates the contract, applies the preferred
|
||||
fabric route, emits data-plane mode/transport/fallback/logical-flow fields in
|
||||
access logs, and reports contract adoption in heartbeat access telemetry.
|
||||
C18Z92 enforces disabled backend fallback policy at node-agent runtime: when a
|
||||
C18Z92 enforces disabled compat fallback policy at node-agent runtime: when a
|
||||
signed lease says `backend_relay_policy=disabled`, route failure or missing
|
||||
fabric route returns a visible 503 instead of silently proxying working data
|
||||
through backend relay.
|
||||
@@ -414,13 +416,13 @@ can now surface a recommended action such as restoring the fabric route instead
|
||||
of treating backend relay as normal service traffic.
|
||||
C18Z95 adds node-agent blocked-fallback telemetry. When a signed data-plane
|
||||
contract disables backend relay and the entry runtime cannot use a fabric
|
||||
route, node-agent reports `backend_fallback_blocked`, the last data-plane
|
||||
route, node-agent reports `compat_fallback_blocked`, the last data-plane
|
||||
violation status/reason, and backend/admin project those fields to cluster,
|
||||
node, channel, and `data_plane_contract` incident diagnostics. Disabled-policy
|
||||
refusal is now separate from real backend relay usage.
|
||||
C18Z96 wires normal-route send failure with disabled backend relay into the
|
||||
existing route feedback and rebuild planner path. When heartbeat access
|
||||
telemetry reports `fabric_route_send_failed_backend_fallback_blocked`, backend
|
||||
telemetry reports `fabric_route_send_failed_compat_fallback_blocked`, compat
|
||||
correlates the entry node's active service-channel leases, records fenced
|
||||
`fabric_service_channel_route_feedback` for the selected primary route, and the
|
||||
existing planner can select an alternate/replacement route. This keeps blocked
|
||||
@@ -570,7 +572,7 @@ artifacts:
|
||||
`artifacts/c18z89-service-channel-access-decision-resurface-action-smoke-result.json`, and
|
||||
`artifacts/c18z90-service-channel-data-plane-contract-smoke-result.json`, and
|
||||
`artifacts/c18z91-node-agent-data-plane-contract-enforcement-smoke-result.json`, and
|
||||
`artifacts/c18z92-node-agent-disabled-backend-fallback-smoke-result.json`, and
|
||||
`artifacts/c18z92-node-agent-disabled-compat-fallback-smoke-result.json`, and
|
||||
`artifacts/c18z93-access-telemetry-data-plane-contract-smoke-result.json`, and
|
||||
`artifacts/c18z94-data-plane-contract-incident-smoke-result.json`, and
|
||||
`artifacts/c18z95-node-agent-blocked-fallback-telemetry-smoke-result.json`, and
|
||||
|
||||
@@ -2,9 +2,8 @@
|
||||
|
||||
Date: 2026-05-05
|
||||
|
||||
This document freezes the current near-working VPN state. Treat it as the
|
||||
rollback and comparison point before changing the Android VPN dataplane,
|
||||
gateway assignment, mesh route intents, or packet relay behavior.
|
||||
This archived document records the pre-fabric VPN state for comparison only.
|
||||
It is not a rollback instruction for the current farm standard.
|
||||
|
||||
## Baseline components
|
||||
|
||||
@@ -23,7 +22,7 @@ gateway assignment, mesh route intents, or packet relay behavior.
|
||||
- DNS from exit side: `192.168.200.210`
|
||||
- Client tunnel: full tunnel, `0.0.0.0/0`, VPN address `10.77.0.2/24`
|
||||
- Active gateway lease: home-1, generation `8`
|
||||
- Active relay transport: `backend_http_packet_relay`
|
||||
- Current farm standard: QUIC fabric packet transport only.
|
||||
|
||||
## Current working behavior
|
||||
|
||||
@@ -59,9 +58,8 @@ delays, and RDP sessions that connect and later drop.
|
||||
- Do not reduce Android `TUN_WRITE_MAX_RETRIES` below `1000` without a
|
||||
controlled regression test.
|
||||
- Do not relax Android VPN source-address validation.
|
||||
- Do not re-enable the home-1 `vpn_packets` fabric mesh route intent for this
|
||||
connection until the Android client can intentionally use the fabric entry
|
||||
path. The current working baseline relies on `backend_http_packet_relay`.
|
||||
- Do not reintroduce direct backend packet relay. VPN packets must use the
|
||||
fabric session or fabric mesh packet transport.
|
||||
- Do not change the active entry/exit away from home-1 without saving packet
|
||||
counters before and after.
|
||||
- Do not change DNS away from `192.168.200.210` without checking full-tunnel
|
||||
@@ -75,5 +73,5 @@ delays, and RDP sessions that connect and later drop.
|
||||
2. Add clearer per-flow counters for long-lived TCP flows such as RDP.
|
||||
3. Add a small repeatable smoke test: DNS, direct IP HTTP, 2ip.ru, Telegram-like
|
||||
long connection, and RDP port reachability.
|
||||
4. Only after this baseline is stable, move Android entry traffic from backend
|
||||
relay to fabric mesh.
|
||||
4. Keep Android entry traffic on the fabric path and compare behavior against
|
||||
this archived baseline only for diagnostics.
|
||||
|
||||
@@ -16,7 +16,8 @@ Example:
|
||||
|
||||
```bash
|
||||
rap-host-agent monitor-loop \
|
||||
--backend-url http://127.0.0.1:18121/api/v1 \
|
||||
--fabric-registry-records-json '<signed_registry_records_json>' \
|
||||
--cluster-authority-public-key '<cluster_authority_public_key>' \
|
||||
--cluster-id cfc0743d-d960-49fb-9de8-96e063d5e4aa \
|
||||
--node-id 108a0d66-d65e-4dea-b9a8-135366bf7dba \
|
||||
--current-version 0.2.261-vpnfarm \
|
||||
|
||||
Reference in New Issue
Block a user