рабочий вариант, но скороть 10 МБит
build / backend (push) Has been cancelled
build / node-agent (push) Has been cancelled
build / worker (push) Has been cancelled

This commit is contained in:
2026-05-22 21:46:49 +03:00
parent 469fa0e860
commit 20d361a886
280 changed files with 954890 additions and 18524 deletions
+2 -2
View File
@@ -201,8 +201,8 @@ Updates must support:
- local update cache where approved
- OS / architecture specific artifacts under signed release manifests
- explicit migration bundles when data structures change
- legacy recovery compatibility until the fleet is converged or explicitly
retired
- compat recovery compatibility until the fleet is converged or explicitly
removed
- multi-source artifact retrieval for stranded or NAT-only nodes
Version Storage stores immutable release manifests, artifacts, hashes,
@@ -1035,7 +1035,7 @@ Node-agent can start, stop, and monitor service workloads based on role assignme
C19A adds the first bounded live service-supervision runtime proof on top of
that contract: node-agent can read node-scoped desired workloads without an
operator actor id, report built-in `core-mesh` and `mesh-listener` as running,
operator actor id, report built-in `core-mesh` and `fabric-listener` as running,
report native built-in `synthetic.echo` as running, and keep unsupported
production workloads degraded instead of pretending that their adapters exist.
The live smoke is `scripts/fabric/c19a-service-workload-supervision-smoke.ps1`.
+3 -3
View File
@@ -262,7 +262,7 @@ Rules:
- latest frame wins
- render must not block input/control
- binary payloads should be used on direct data plane
- backend fallback may continue existing JSON/base64 behavior during migration
- compat fallback may continue existing JSON/base64 behavior during migration
### `clipboard`
@@ -347,7 +347,7 @@ The DP-2 JSON header contains:
- `session_id`
- `channel`, currently `render`
- `message_type`, currently `render.frame.full` or `render.frame.region` on
direct worker WSS; `session.frame` remains accepted as the legacy DP-2
direct worker WSS; `session.frame` remains accepted as the compat DP-2
binary message type for compatibility.
- `sequence`
- `timestamp`
@@ -950,7 +950,7 @@ explicit direct render message types:
Compatibility:
- Windows client direct transport still accepts legacy binary `message_type=session.frame`.
- Windows client direct transport still accepts compat binary `message_type=session.frame`.
- Inside the Windows application pipeline, direct binary frames are normalized
back into the existing `session.frame` envelope so UI, lifecycle, input,
clipboard, and file transfer behavior remain unchanged.
@@ -24,7 +24,7 @@ policy allows, host limited control/storage roles when approved, and report
mobile-specific capacity signals such as battery, network type, NAT behavior,
foreground/background state, and metered network policy.
Node survival and recovery across endpoint moves, NAT-only reachability, legacy
Node survival and recovery across endpoint moves, NAT-only reachability, compat
contract overlap, and unavailable manual host access are governed by
`docs/architecture/FABRIC_NODE_SURVIVAL_AND_RECOVERY_POLICY.md`. In
particular, nodes like `ifcm-rufms-s-mo1cr` must remain recoverable through the
@@ -179,8 +179,8 @@ Endpoint state is also distributed:
Moving a service must not break the farm.
`RAP_BACKEND_URL` or any fixed HTTP/API address is only a migration fallback for
old nodes. It is not cluster truth. After bootstrap, a node finds services by
`RAP_FABRIC_REGISTRY_RECORDS_JSON` and signed registry gossip, not any fixed
HTTP/API address, define cluster truth. After bootstrap, a node finds services by
logical role through signed fabric registry records that can be carried by any
reachable peer.
@@ -258,7 +258,7 @@ Service classes that must use this registry before production hardening:
- `vpn-egress-pool`: user-visible exit pools; users see pools, not backing
nodes.
Legacy endpoint compatibility is allowed only for rolling migration:
Compat endpoint compatibility is allowed only for rolling migration:
- Old nodes may use their baked HTTP/control URL only to fetch a new version or
a signed registry bootstrap record.
@@ -504,7 +504,7 @@ Deliverables:
### Stage FNP-3: WebSocket/TCP Compatibility Transport
Status: retired as a migration-only stage.
Status: removed as a migration-only stage.
This stage existed to bootstrap binary frame semantics before QUIC routing and
carrier reuse were ready. It introduced the transport-neutral frame loop,
@@ -6,6 +6,10 @@ This document replaces the oversimplified rule "every node must keep 3
connections" with a stability model based on failure domains ("areas"),
multi-path reachability, and live peer memory.
It operates at the `Fabric Transport` layer. Services above the transport must
consume service channels and must not directly reason about peer topology. See
[FABRIC_SERVICE_OVER_TRANSPORT_MODEL.md](\\nas\\MST\\codex\\rdp-proxy\\docs\\architecture\\FABRIC_SERVICE_OVER_TRANSPORT_MODEL.md).
## 1. Why the old "3 connections" rule is not enough
A raw connection count is too weak as a resilience rule.
@@ -43,6 +47,9 @@ An area can be derived from:
The area label must be part of live node metadata and endpoint candidate
metadata.
For the current fleet, area assignment should be explicit operator metadata, not
an inference hidden only inside routing code.
## 3. Stability objective
Each node should maintain a working peer set with diversity, not just count.
@@ -0,0 +1,386 @@
# Fabric Execution Plan 2026-05-19
Status: active execution plan.
This document merges:
- the service-over-fabric model;
- the area and peer stability model;
- the live audit findings from 2026-05-18 through 2026-05-19;
- the node survival and recovery policy;
- the current rollout and runtime rewrite findings.
The goal is to move the live fabric from a partially migrated QUIC-first fleet
to a fully converged distributed runtime where:
1. inter-node transport is QUIC over UDP only;
2. services use fabric channels and do not implement their own transport;
3. nodes do not depend on one compat control/download edge;
4. node directory and service discovery are distributed through signed records,
peer cache, and live peer exchange;
5. the fleet remains recoverable after losing part of the fabric.
## 1. Current live state
### 1.1 What is already true
- Inter-node runtime transport is QUIC over UDP.
- All active nodes are converging on the latest control-endpoint rewrite line.
- `home-*`, `test-*`, and `usa-los-1` already run
`rap-node-agent 0.2.325-updatehintwake`.
- `ifcm-rufms-s-mo1cr` is no longer lost and still sends fresh heartbeat.
- Internal artifact plans now support mirror URLs instead of a single artifact
URL.
- The public `vpn.cin.su` ingress and the `19191` compatibility ingress on
`home-1` were repaired so downloads and control traffic can flow again.
### 1.2 What is still not finished
- `ifcm-rufms-s-mo1cr` still reports the old
`http://vpn.cin.su:19191/api/v1` control URL in live heartbeat.
- `ifcm-rufms-s-mo1cr` still runs `rap-node-agent 0.2.322-controlendpointsrewrite`
while the rest of the reachable fleet is already on
`0.2.325-updatehintwake`.
- The current blocker is now known precisely:
fresh heartbeat plus a dead updater subscription plane on a node-agent that
does not yet support local updater wake from heartbeat update hints.
- Signed registry runtime is still not fully `active` across the fleet.
- Cross-area direct peer diversity is still below the target for multiple
nodes.
- TCP is still visible in allowed edge roles:
- external ingress;
- Control API;
- release downloads;
- temporary compatibility recovery overlap.
## 2. Target system model
### 2.1 Transport
- Inter-node runtime transport: QUIC over UDP only.
- No TCP/WebSocket fallback as the normal fabric carrier.
### 2.2 Service layer
- Services consume a fabric channel contract.
- Services do not know internal path selection, relay choice, NAT traversal, or
route replacement details.
- External TCP/HTTP/HTTPS exists only at the ingress edge and is mapped into a
fabric channel.
### 2.3 Discovery and directory
- Nodes do not query PostgreSQL as part of ordinary transport/runtime flow.
- PostgreSQL remains durable source of truth for policy, rollout, release,
desired state, and audit.
- Runtime node discovery must use:
- signed registry records;
- peer cache;
- endpoint candidates;
- bounded live peer exchange.
### 2.4 Small fleet rule
For the current fleet size, every node should keep the full directory of all
known nodes in scoped local state, plus runtime observations and endpoint
candidate health.
## 3. Execution priorities
### P0. Finish runtime control-path convergence
Goal:
- remove the last live compat control dependency without manual host access.
- ensure a live node can wake its local updater plane when Control/API sends an
explicit update hint, even if the previous updater loop died.
Required work:
1. Release the noop runtime rewrite restart fix.
2. Roll it out to the fleet.
3. Verify that updated nodes restart into canonical control endpoints.
4. Add a local updater wake path driven by heartbeat update hints so
`update-trigger.json` is not the only signal.
5. Confirm that `compat_control_dependency_nodes` falls to zero.
6. Confirm that `updater_subscription_alert_nodes` falls to zero.
7. Confirm that `updater_wake_unsupported_nodes` falls to zero.
Done when:
- no live heartbeat reports `fabric_control_endpoint` on the `19191` compat contract.
- no live node shows `updater_subscription_gap` while heartbeat remains fresh.
- no live node remains on a pre-`0.2.325-updatehintwake` node-agent while
heartbeat is still fresh and update status is stale.
### P1. Finish distributed registry activation
Goal:
- nodes must resolve active service records without relying on one compat URL.
Required work:
1. Promote signed registry runtime from `candidate_only` / `missing` to
`active`.
2. Ensure nodes resolve at least:
- `control-api`
- `update-store`
- `update-cache`
3. Add live observability for:
- active records
- candidate records
- resolved core services
- last live probe
Done when:
- `fabric_registry_runtime_report.status = active` for the production fleet.
### P2. Turn node directory into a real distributed runtime input
Goal:
- nodes should learn and keep node/service information from the fabric, not by
repeatedly consulting a center.
Required work:
1. Preserve full scoped node directory for the current fleet.
2. Carry signed node/service records through peer exchange.
3. Keep endpoint candidates and runtime observations in local peer cache.
4. Spread updates to node/service reachability like a bounded wave, not as
independent central fetches by every node.
Rules:
- nodes may distribute signed directory/service data;
- nodes must not self-author authoritative control-plane state;
- the runtime may consume replicated signed copies of truth;
- PostgreSQL remains durable origin of truth.
Done when:
- nodes can refresh peer/service discovery from peers plus signed records even
if one control edge disappears.
### P3. Replace the naive “3 peers” rule with stability by area and ingress
Goal:
- measure and enforce resilience by failure-domain diversity, not only count.
Required metrics:
- `direct_ready_count`
- `relay_ready_count`
- `external_area_ready_count`
- `independent_ingress_ready_count`
- `recovery_path_count`
Required topology labels:
- `site_id` - physical or logical site
- `locality_group` - private/local reachability domain
- `nat_group` - shared public edge dependency
Required behaviors:
1. Prefer peers from different `area` values.
2. Prefer peers behind different public ingress / NAT dependencies.
3. Keep direct-ready and relay-ready separate.
4. Keep at least one recovery path outside the local area.
5. Treat a public endpoint behind the same NAT area as
`external-network-required` unless cross-area observers have validated it.
6. Do not demote a public endpoint only because the same area cannot hairpin
through its own public router address.
7. Prefer a scoped local/private QUIC endpoint over a public endpoint when the
candidate is confirmed to be in the same local segment or NAT group.
8. Penalize or reject private/local-looking endpoints when they belong to a
different segment/NAT scope than the local node, instead of probing them as
if they were reachable.
Done when:
- critical nodes satisfy cross-area direct resilience targets, not merely raw
peer-count targets.
### P4. Normalize edge roles and remove accidental TCP confusion
Goal:
- if TCP is present, it must be obviously classified and justified.
Allowed TCP roles:
- external service ingress;
- Control API ingress;
- artifact delivery edge;
- temporary compatibility recovery overlap.
Required work:
1. Keep explicit inventory of edge listeners.
2. Distinguish transport TCP from service-edge TCP in audits and UI.
3. Advance the fabric-only recovery gate only after:
- compat control dependency is zero;
- registry is active;
- recovery path no longer depends on `19191`.
### P5. Build the update orchestrator and distributed update intent plane
Goal:
- nodes must not depend on one updater endpoint, one old updater process, or one
central polling path;
- update rollout must be controlled so the whole farm cannot update at once;
- update intent must be distributable through management and neighboring nodes
as signed metadata.
Required model:
1. The durable update object is a signed `update_intent`, not a hard-coded
updater URL.
2. Nodes may receive update intent from:
- Control API;
- update-store / update-cache;
- subscription hints over an outbound control channel;
- signed peer gossip from neighboring nodes;
- local cached last-known-good update state.
3. Nodes validate intent locally before execution.
4. Neighbor nodes may relay signed intent and artifacts, but cannot forge
authority or expand scope.
5. Slow polling remains as the final safety net.
6. Subscription/hints are the fast path.
7. Gossip is the partition/recovery path.
8. Orchestrator-issued rollout leases are the concurrency guard.
Orchestrator requirements:
- canary, rolling, pinned, and forced-node strategies;
- max parallel globally;
- max parallel per area / site / NAT group;
- max unavailable nodes;
- pause/resume/abort;
- failure-rate stop;
- automatic stop on heartbeat loss or rollback;
- role-aware scheduling for control-api, update-store, update-cache, relay,
ingress, and egress nodes;
- separate host-agent and node-agent phases;
- emergency recovery bridge for compat nodes that predate the orchestrator.
Node-side requirements:
- accept `check now` subscription signals;
- periodically poll as fallback;
- accept newer signed update intents from peers;
- keep a local update journal:
- pending intent generation;
- lease id;
- last accepted plan;
- staged artifact hash;
- previous binary / image;
- rollback state;
- admission failure reason;
- reconcile stale updater runtime against current node/container/task state
before fetching plans;
- report `blocked`, `leased`, `staged`, `applying`, `healthy`, `rolled_back`,
and `aborted` states explicitly.
Done when:
- a node can learn a new update intent without directly reaching the original
control edge;
- a stale updater command line can be repaired from local running runtime state;
- simultaneous farm-wide update start is impossible without explicit
recovery-admin override;
- rollout can be paused and resumed without losing node intent state;
- at least one test proves a node behind NAT receives an update signal through
a neighbor and still waits for an orchestrator lease before applying.
## 4. Immediate next implementation sequence
### Step A
Release and roll out the noop-rewrite restart fix so that updated runtimes do
not remain on stale control sessions after a config rewrite.
### Step B
Release and roll out the relay certificate intent fix so stale-relay
replacement and bootstrap relay paths do not probe a relay endpoint with a
certificate fingerprint copied from a different private direct candidate.
This is tracked by:
- `rap-node-agent 0.2.332-relaycertintentfix`
Done when:
- `peer certificate fingerprint mismatch` no longer appears on healthy
relay/bootstrap paths between live areas;
- `ifcm` no longer loses ready peers because relay endpoint selection and peer
certificate pinning disagree.
### Step B
Re-check live heartbeat and stale-risk:
- `compat_control_dependency_nodes`
- `registry_candidate_only_nodes`
- `updater_subscription_alert_nodes`
- `updater_wake_unsupported_nodes`
- `bridge_hold_required`
- current control URL in heartbeat
### Step C
Continue registry activation work until active records are used in practice.
### Step D
Continue peer diversity work using:
- `area`
- direct-ready area coverage
- independent ingress diversity
### Step E
Run another live audit and decide whether `19191/tcp` recovery overlap can be
removed.
## 5. Hard acceptance criteria
The fabric is considered converged only when all of the following are true:
1. Inter-node runtime transport is QUIC/UDP only.
2. No live node depends on the compat `19191` control contract.
3. Signed registry runtime is active.
4. Nodes carry and use distributed node/service knowledge through signed
records and peer cache.
5. Cross-area direct resilience targets are satisfied for critical nodes.
6. Remaining TCP listeners are only service-edge roles, never hidden inter-node
transport.
## 6. This plan starts now
The immediate active engineering task after writing this document is:
- complete the rollout of the runtime rewrite restart fix;
- remove the last live compat control dependency;
- then move directly into signed registry activation and cross-area peer
resilience work.
Update 2026-05-19:
- `rap-node-agent 0.2.325-updatehintwake` adds a second-stage recovery path for
heartbeat update hints: when a fresh hint generation arrives, the live
node-agent persists `update-trigger.json` and wakes the local updater
task/service.
- This is specifically meant to prevent the `ifcm-rufms-s-mo1cr` class of
failure where heartbeat remains fresh but the updater subscription plane is
dead.
- As of the current rollout, this release is already on `home-*`, `test-*`,
and `usa-los-1`; `ifcm-rufms-s-mo1cr` remains the sole
`updater_wake_unsupported` blocker.
@@ -258,7 +258,7 @@ Production fabric-core migration boundary:
QUIC endpoint candidates for the next hop, sends the envelope over the chosen
QUIC route, and reroutes to warm standby/fallback QUIC candidates on connect
failure or response timeout.
- The legacy HTTP production forward carrier has been removed from the mesh
- The compat HTTP production forward carrier has been removed from the mesh
runtime API. Production forwarding now exposes a single QUIC transport
implementation; HTTP handlers remain only as node-local API surfaces and test
harness entry points.
@@ -287,7 +287,7 @@ Production fabric-core migration boundary:
- Node-agent discovery now advertises multiple QUIC candidates in one heartbeat
instead of collapsing to one address: operator/public QUIC, listener QUIC,
LAN/interface QUIC, STUN reflexive `ice_quic`, reverse/outbound-only, and
`relay_quic` fallback. Candidate metadata carries `local_segment_id`,
`relay_quic` fallback. Candidate metadata carries `locality_group_id`,
`nat_group_id`, `stun_server`, `ice_foundation`, `relay_node_id`, and
`relay_endpoint` when configured. When a relay endpoint is the first physical
QUIC hop, its advertised certificate fingerprint must survive route planning
@@ -296,23 +296,23 @@ Production fabric-core migration boundary:
- Endpoint candidate scoring is QUIC-mode only. It ranks `direct_quic`,
`lan_quic`, `ice_quic`, `reverse_quic`, and `relay_quic` using freshness,
health observations, latency, reliability, region, policy tags, and live
capacity pressure; HTTP/WebSocket labels are treated as rejected legacy
capacity pressure; HTTP/WebSocket labels are treated as rejected compat
candidates rather than alternate transports.
- `FabricTransportForTarget` no longer accepts a WebSocket carrier. Transport
selection can return only `QUICFabricTransport`; unsupported labels fail with
a QUIC-required error.
- Explicit transport labels are authoritative. A legacy label such as `relay`
- Explicit transport labels are authoritative. A compat label such as `relay`
or `outbound_reverse` is rejected even when the endpoint string uses a
`quic://` scheme; configs must use `relay_quic` and `reverse_quic`.
- Node-agent config loading rejects legacy advertised transport labels and
- Node-agent config loading rejects compat advertised transport labels and
HTTP/WebSocket advertised endpoint schemes for mesh, STUN-reflexive, and relay
fabric endpoints. Bad endpoint posture fails before heartbeat publication.
- Host-agent install/runtime validation rejects legacy mesh advertise transport
- Host-agent install/runtime validation rejects compat mesh advertise transport
labels and HTTP/WebSocket advertise endpoints before they can be passed into a
node-agent Docker runtime.
- JSON-advertised endpoint candidates and scoped synthetic config route
recovery surfaces are hard-fail QUIC-only: endpoint candidates, recovery
seeds, and rendezvous leases reject legacy transport labels and
seeds, and rendezvous leases reject compat transport labels and
HTTP/WebSocket endpoint schemes instead of silently downgrading or dropping
entries.
- Rendezvous relay leases and peer-connection intents now use `relay_quic` as
@@ -325,24 +325,24 @@ Production fabric-core migration boundary:
- Node-agent synthetic runtime no longer installs an HTTP peer transport as an
inter-node carrier, and the shared mesh runtime package no longer exports an
HTTP peer transport implementation. Any HTTP synthetic motion is confined to
explicit legacy smoke harness code while fabric acceptance uses QUIC loadtest
explicit compat smoke harness code while fabric acceptance uses QUIC loadtest
gates.
- Control-plane and debug JSON mesh config loading is validated after
conversion into runtime structures. Peer endpoint candidates, recovery seeds,
rendezvous leases, and selected relay endpoints in route decisions must use
QUIC labels/endpoints before they can update node runtime state.
- Scoped synthetic mesh configs also reject legacy `peer_endpoints` directly,
- Scoped synthetic mesh configs also reject compat `peer_endpoints` directly,
in addition to QUIC-only checks for endpoint candidates, recovery seeds, and
rendezvous leases.
- The old fabric-session WebSocket endpoint is no longer exposed by
`FabricSessionEnabled` alone. It requires an explicit legacy test harness flag
`FabricSessionEnabled` alone. It requires an explicit compat test harness flag
and is not part of the node-agent fabric transport surface.
- Same local segment or same NAT group is treated as a LAN route by the planner,
so a whole cluster piece behind one NAT can prefer private addresses between
its own nodes while still maintaining outbound/relay visibility to the rest
of the fabric.
- Heartbeat telemetry includes `fabric_runtime_report` with QUIC-only status,
route-set counts, QUIC candidate totals, rejected legacy/non-QUIC candidate
route-set counts, QUIC candidate totals, rejected compat/non-QUIC candidate
totals by transport label, route pressure, QUIC listener state, goroutines,
heap usage, and the next recommended soak gate.
- `FabricOverlayTransport` is the generic service-neutral send facade over
@@ -375,7 +375,7 @@ Production fabric-core migration boundary:
healthy targets are present. A `mixed-public-nat-lan-relay` or
`nat-lan-relay` run fails if it does not exercise `lan_quic`, `ice_quic`,
`reverse_quic`, and `relay_quic`.
- Loadtest verdicts also fail on legacy route-mode labels. Seeing `relay`,
- Loadtest verdicts also fail on compat route-mode labels. Seeing `relay`,
`outbound_reverse`, `direct_http`, `direct_https`, `direct_tcp_tls`, `ws`,
`wss`, or `websocket` in route-mode telemetry is treated as a transport-layer
violation even if payload delivery succeeds.
@@ -686,7 +686,7 @@ Production fabric-core migration boundary:
`control_ack_p95_ms=2`, `ack_p95_ms=7`, `channel_leaks=0`,
`route_pressure.active_total=0`, and matching acquire/release counts.
- Verified strict QUIC route-mode gate:
`fabric-loadtest-20260516-182550` rebuilt the loadtest image with legacy
`fabric-loadtest-20260516-182550` rebuilt the loadtest image with compat
route-mode verdicts and ran the 4-node mixed topology profile. It produced
400/400 successful logical channels, observed only `lan_quic`, `ice_quic`,
`reverse_quic`, and `relay_quic`, kept `ack_mismatched_streams=0`,
@@ -816,7 +816,7 @@ Production fabric-core migration boundary:
- Published and registered node-agent release `0.2.280-fabricsession` with
linux binary/native and Docker image artifacts. The release is intentionally
not assigned to live node update policies yet because current live node
workload/env posture still advertises legacy `direct_http` and HTTP/HTTPS
workload/env posture still advertises compat `direct_http` and HTTP/HTTPS
mesh endpoints. Before rollout, node configs must be migrated to
`quic://...` endpoints, QUIC advertise labels, and enabled QUIC listener env
such as `RAP_MESH_QUIC_FABRIC_ENABLED=true` plus
+140 -19
View File
@@ -4,9 +4,22 @@ Status: live operational audit of the current fabric. This document records the
real state observed on 2026-05-18 and explicitly calls out where runtime
behavior still differs from the target architecture.
The target layering model referenced by this audit is documented in
[FABRIC_SERVICE_OVER_TRANSPORT_MODEL.md](\\nas\\MST\\codex\\rdp-proxy\\docs\\architecture\\FABRIC_SERVICE_OVER_TRANSPORT_MODEL.md).
The current execution sequence derived from this audit is maintained in
[FABRIC_EXECUTION_PLAN_2026-05-19.md](\\nas\\MST\\codex\\rdp-proxy\\docs\\architecture\\FABRIC_EXECUTION_PLAN_2026-05-19.md).
## Current confirmed state
- Inter-node transport for the live node-agent fleet is `QUIC over UDP`.
- The `0.2.327-registrybootstraprewrite` rollout initially exposed a backend
ingestion defect: fresh `home-*` / `test-*` heartbeats were returning HTTP
`500`, not because QUIC or registry bootstrap was broken, but because
PostgreSQL rejected `\u0000` inside heartbeat JSON with
`unsupported Unicode escape sequence (SQLSTATE 22P05)`.
- Backend heartbeat ingestion now sanitizes `\u0000` before persistence.
- After that fix, `home-*` and `test-*` resumed normal heartbeat flow and
converged onto the new release line with live registry promotion.
- The active node set
- `home-1`
- `home-2`
@@ -16,9 +29,40 @@ behavior still differs from the target architecture.
- `test-3`
- `usa-los-1`
- `ifcm-rufms-s-mo1cr`
is converged on `0.2.321-directreadytarget`.
currently spans:
- `home-*`, `test-*`, and `usa-los-1` on
`0.2.327-registrybootstraprewrite`;
- `ifcm-rufms-s-mo1cr` still remaining on
`0.2.322-controlendpointsrewrite`.
- `ifcm-rufms-s-mo1cr` recovered through the compatibility recovery path and is
no longer stale.
- `ifcm-rufms-s-mo1cr` has already migrated off compat overlap
`http://vpn.cin.su:19191/api/v1` and now reports
`https://vpn.cin.su/api/v1`, but it still has not advanced to the new
registry-aware release line.
- `home-*` and `test-*` now report:
- `reported_version = 0.2.327-registrybootstraprewrite`
- `peer_cache_peers = 7`
- `fabric_registry_runtime_report.status = active`
- `usa-los-1` is already on `0.2.327-registrybootstraprewrite` but still
reports `fabric_registry_runtime_report.status = missing`, which means this
node remains a runtime bootstrap/config rewrite gap rather than a version-gap.
- After repairing malformed `RAP_MESH_ADVERTISE_ENDPOINTS_JSON` on
`home-1/home-2/home-3`, the `home` area now emits enriched heartbeat metadata
again instead of falling back to the thin `c3` payload.
- Live stale-risk snapshot at `2026-05-18T19:39Z` now reports:
- `compat_control_dependency_nodes = 1` (`ifcm-rufms-s-mo1cr`)
- `registry_candidate_only_nodes = 1` (`ifcm-rufms-s-mo1cr`)
- `direct_peer_alert_nodes = 5`
- `area_diversity_alert_nodes = 6`
- Live heartbeat/update status snapshot after the `0.2.325-updatehintwake`
rollout still shows:
- `ifcm-rufms-s-mo1cr` heartbeat fresh at `2026-05-18 23:08:44 UTC`
- `fabric_control_endpoint = http://vpn.cin.su:19191/api/v1`
- `peer_cache_peers = 7`
- latest update status still stuck at `2026-05-18 20:50 UTC`
- this is now classified as `updater_wake_unsupported`, not just a generic
stale or compat-control symptom
## Why TCP traffic is still visible
@@ -35,7 +79,7 @@ Observed live listeners:
- `19132/udp`, `19133/udp`, `19134/udp` - QUIC fabric listeners
- `usa-los-1`
- `19131/udp` - QUIC fabric listener
- `19191/tcp` - external compatibility bridge currently held open so legacy
- `19191/tcp` - external compatibility bridge currently held open so compat
recovery contracts can still reach `Control API/downloads`
Therefore:
@@ -49,7 +93,8 @@ Therefore:
### 1. Nodes do not yet operate from a fully active signed registry gossip plane
Observed on the live `ifcm-rufms-s-mo1cr` heartbeat:
Observed on the live `ifcm-rufms-s-mo1cr` heartbeat after the backend/runtime
refresh:
- `fabric_registry_runtime_report.status = candidate_only`
- `resolved_service_count = 0`
@@ -61,11 +106,11 @@ This means the current runtime still depends on compatibility control URLs more
than the target architecture allows. The node is alive in the fabric, but not
yet operating from a fully resolved active registry view.
### 2. Legacy control/download contracts are still real dependencies
### 2. Compat control/download contracts are still real dependencies
Observed on the live `ifcm-rufms-s-mo1cr` heartbeat after recovery:
- `mesh_outbound_session_report.control_plane_url = http://vpn.cin.su:19191/api/v1`
- `mesh_outbound_session_report.fabric_control_endpoint = http://vpn.cin.su:19191/api/v1`
This confirms the root recovery lesson:
@@ -77,15 +122,31 @@ This confirms the root recovery lesson:
### 3. Direct peer resilience is still below the intended threshold
Observed from live heartbeat metadata:
Observed from the live stale-risk snapshot at `2026-05-18T19:39Z`:
- `ifcm-rufms-s-mo1cr`
- `peer_connection_ready = 2`
- `peer_connection_relay_ready = 3`
- `target_ready_peers = 3`
- `home-1`
- `peer_connection_ready = 1`
- `direct_ready_areas = [usa]`
- `external_area_ready_count = 1/2`
- `home-2`
- `peer_connection_ready = 1`
- `direct_ready_areas = [usa]`
- `external_area_ready_count = 1/2`
- `home-3`
- `peer_connection_ready = 1`
- `direct_ready_areas = [usa]`
- `external_area_ready_count = 1/2`
- `test-1/2/3`
- `peer_connection_ready = 3`
- but `direct_ready_areas = [usa]`
- therefore each still triggers `external_area_deficit:1_of_2`
- `usa-los-1`
- `peer_connection_ready = 1`
- `peer_connection_relay_ready = 5`
- `direct_ready_areas = [ifcm, home, test]`
- `target_ready_peers = 3`
This means the direct-path resilience target is not satisfied yet, even though
@@ -99,17 +160,35 @@ The practical reason is simple:
- relay-ready adjacency is masking direct peer deficit, but it does not replace
the requirement for at least three direct-ready peers.
### 3.1 Public endpoint confirmation must be cross-area, not local hairpin
The live `home/test` topology also exposed a verification mistake in the
runtime model:
- `home` and `test` sit behind the same public router address
`94.141.118.222`;
- some public QUIC candidates are valid only when tested from another area such
as `usa` or `ifcm`;
- a same-area probe can fail purely because the local router does not support
hairpin NAT / NAT reflection.
Operational consequence:
- a public endpoint marked as `external-network-required` must be treated as
non-authoritative when the failure came from `self` or `same_area`;
- the public candidate should be confirmed or rejected by `cross_area`
observers instead.
### 4. Observability is still heterogeneous
Live heartbeat coverage is inconsistent:
Live heartbeat coverage is now richer than it was earlier in the day, but it is
still not fully converged in behavior:
- `test-*`, `ifcm`, `usa-los-1` emit rich `c17z20` heartbeat metadata with
endpoint, peer recovery, and registry sections.
- `home-*` currently do not expose the same full sections in their latest
heartbeat rows.
This means operator visibility is uneven and the documentation must not imply
uniform live introspection across every node today.
- `test-*`, `ifcm`, `usa-los-1`, and now repaired `home-*` expose endpoint,
peer recovery, and registry sections again.
- `ifcm` is still the only node that currently reports `compat control` and
`registry candidate_only`, so the observability gap has narrowed into a real
single-node convergence issue instead of a fleet-wide blind spot.
## What is true right now
@@ -117,21 +196,63 @@ uniform live introspection across every node today.
2. QUIC/UDP is the actual node-to-node transport.
3. Compatibility `19191/tcp` is still required for recovery overlap.
4. Signed registry gossip is not yet the sole active discovery/control source.
5. The "at least 3 direct-ready peers per node" resilience target is not yet
met for all externally significant nodes.
5. `ifcm` still depends on the compat `19191` control overlap.
6. The plain `3 direct peers` target is insufficient on its own; the live fleet
now clearly shows that `cross-area direct diversity` is the next real gate.
## Control/API migration progress
The codebase now carries a more explicit migration contract for control access:
- install profiles prefer canonical `control_plane_endpoints` over a compat
singleton `backend_url`;
- host runtime env generation now exports
removed control-plane endpoint env key;
- node heartbeat/control reporting prefers that canonical endpoint set when it
is present.
- stale updater status behind a fresh heartbeat is now classified separately as
`updater_subscription_gap`;
- heartbeat update hints now have a second-stage recovery path: after writing
`update-trigger.json`, a live node can also wake its local updater
task/service.
This does not instantly rewrite older runtime wrappers on already-installed
nodes by itself. It does remove the same trap for the next install, reinstall,
or update-service rewrite cycle.
## Operational rule until the next audit
Do not remove the compatibility `19191/tcp` recovery overlap while any of the
following remain true:
- any live node still reports a `control_plane_url` on the `19191` contract;
- any live node still reports a `fabric_control_endpoint` on the `19191` contract;
- any live node has `fabric_registry_runtime_report.status != active`;
- any externally significant node has fewer than 3 direct-ready peers;
- any node can only recover through legacy `Control API/downloads` overlap.
- any node can only recover through compat `Control API/downloads` overlap.
## Required next work
Update 2026-05-19:
- `rap-node-agent 0.2.325-updatehintwake` was released with a local updater
wake path driven by heartbeat update hints.
- The release exists because `ifcm-rufms-s-mo1cr` showed that a node can keep
sending fresh heartbeat while the updater subscription plane silently stops
progressing.
- This is now treated as a first-class recovery-plane problem, not as a vague
stale-node symptom.
- The live rollout already moved `home-*`, `test-*`, and `usa-los-1` onto
`0.2.325-updatehintwake`.
- `ifcm-rufms-s-mo1cr` is now the only remaining
`updater_wake_unsupported` blocker.
- Live `ifcm` peer telemetry also exposed a distinct transport-resilience
defect: on one stale-relay/bootstrap path the node tried a relay endpoint
with the certificate fingerprint from a different private direct candidate,
producing
`CRYPTO_ERROR ... quic peer certificate fingerprint mismatch`.
- That bug is now fixed in the runtime line tracked as
`0.2.332-relaycertintentfix`.
### A. Finish signed registry activation
Each node must be able to resolve active records for at least:
@@ -26,7 +26,7 @@ This policy applies to Linux, Windows, Android, containerized nodes, and future
The fabric must be able to lose:
- old API endpoints;
- old artifact URLs;
- old artifact distributors;
- previous public IP addresses;
- previous NAT mappings;
- previous relay nodes;
@@ -67,7 +67,7 @@ Any change to the fabric must keep older nodes recoverable until one of these
is true:
1. every node has confirmed the new contract; or
2. the missing nodes were manually retired, revoked, or explicitly accepted as
2. the missing nodes were manually removed, revoked, or explicitly accepted as
lost.
This applies to:
@@ -81,6 +81,17 @@ This applies to:
- host-agent / updater runtime contracts;
- control endpoints needed only for migration.
Canonical `Control API` access must be distributable as an explicit endpoint
set, not only as a single compat `backend_url`. Install/update contracts should
carry:
- `control_plane_endpoints`;
- signed fabric registry bootstrap records;
- artifact endpoints.
The old `backend_url` remains a compatibility fallback only until the fleet has
converged.
The rule is strict: do not delete the old recovery format while nodes that may
still need it remain unrecovered.
@@ -200,6 +211,67 @@ Required model:
- signals are idempotent;
- signals do not require the old control endpoint to remain alive.
### 3.7 Update Intent Must Be Independent From One Updater Endpoint
A node must not be permanently bound to one updater service, one updater node,
one systemd unit name, one scheduled task name, or one control endpoint.
The durable object is not "call this updater URL". The durable object is a
signed update intent:
- product;
- target version or version constraint;
- artifact hashes and allowed mirrors;
- compatibility contract;
- rollout lease constraints;
- force / emergency flags;
- rollback permission;
- signed registry/service records that can carry the intent;
- expiry and generation.
A node may learn the same signed intent from:
- Control API;
- update-store;
- update-cache;
- long-lived outbound control subscription;
- neighboring nodes through signed fabric registry gossip;
- local cached last-known-good update state.
The receiving node must validate the intent locally before acting. A neighbor
may relay signed update metadata and artifacts, but it must not become an
authority that can forge or broaden an update.
The local recovery boundary must reconcile stale runtime facts before fetching
or applying a plan:
- current cluster id;
- node id and identity state directory;
- current container/task/unit name;
- current control endpoints;
- current signed registry records;
- available artifact mirrors.
This is mandatory because a node may move, a container may be renamed, a task
may be recreated, or the old host updater may still have a stale command line.
### 3.8 Polling, Subscription, And Neighbor Relay Are All Required
The update plane must use three delivery paths at the same time:
1. slow local fallback polling, so a node eventually recovers even after missed
signals;
2. subscription / push hints, so ordinary updates are fast and do not wait for
a long poll interval;
3. peer relay of signed update intents and signed registry records, so a node
can learn current update truth through reachable neighbors when the old
center or old ingress is unavailable.
No one path is allowed to be the only recovery mechanism.
Polling cadence is a safety net, not the rollout control mechanism. Rollout
control belongs to the orchestrator and signed rollout leases.
## 4. Update Safety Rules
### 4.1 Upgrade Contracts
@@ -228,7 +300,7 @@ explicit retirement.
Recovery-critical artifact versions must remain available until:
- all nodes have moved past them; or
- the remaining nodes are revoked/retired and recorded as intentionally lost.
- the remaining nodes are revoked/removed and recorded as intentionally lost.
Do not garbage-collect the last working host-agent or node-agent build for an
unrecovered population.
@@ -237,17 +309,18 @@ unrecovered population.
If historical nodes request different install types for the same product
(`windows_binary`, `windows_service`, `native`, `linux_binary`, etc.), recovery
planning must keep compatibility aliases until the fleet converges.
planning must publish explicit signed install-type mappings in the fabric
registry until the fleet converges.
The fabric must not strand nodes on an install-type naming mismatch.
### 4.5 Legacy Recovery Contract Drift Must Be Treated As A Blocking Risk
### 4.5 Compat Recovery Contract Drift Must Be Treated As A Blocking Risk
A stale node may report:
- a compatible recovery artifact exists under the current registry; but
- the last local updater/host-agent status still says `no_matching_artifact` or
an equivalent legacy contract failure.
an equivalent compat contract failure.
This means the node is not only waiting for a heartbeat. It is running an older
recovery planner contract and may still depend on:
@@ -257,7 +330,7 @@ recovery planner contract and may still depend on:
- older update-plan interpretation rules;
- overlap in signed registry / bootstrap envelopes.
This condition must be classified as `legacy recovery contract drift` and must
This condition must be classified as `compat recovery contract drift` and must
block compatibility removal the same way an artifact gap does.
Operationally this also means:
@@ -268,11 +341,11 @@ Operationally this also means:
status on the current contract or the operator explicitly retires the node;
- when a compatible artifact and target mapping already exist, the node should
be classified as `bridge replay ready`, meaning the system can replay the
legacy-compatible update plan as soon as the node regains an outbound control
compat-compatible update plan as soon as the node regains an outbound control
cycle;
- operator tooling should expose a canonical `bridge replay plan` per node so
recovery replay uses the same signed update-plan logic as normal updates;
- compatibility aliases / overlap must remain enabled for that node population;
- signed recovery mappings must remain available for that node population;
- dashboards and rollout guards must show this separately from ordinary
`waiting recovery heartbeat`.
@@ -281,9 +354,78 @@ Canonical example:
- `ifcm-rufms-s-mo1cr` is stale;
- the current backend can match a Windows-compatible host-agent artifact;
- the last host-agent report still says `no_matching_artifact`;
- therefore the node must be treated as a legacy recovery-contract blocker, not
- therefore the node must be treated as a compat recovery-contract blocker, not
merely as a delayed heartbeat.
### 4.6 Rollout Orchestrator Is Mandatory
Large fleet update safety requires an orchestrator. The orchestrator decides
which nodes may update now. Nodes decide whether a received signed intent is
valid and locally safe to execute.
The orchestrator must support:
- canary rollout;
- rolling rollout;
- area / site / NAT-group aware rollout;
- max parallel updates globally;
- max parallel updates per area;
- max unavailable nodes;
- minimum healthy quorum before continuing;
- hold / pause / resume;
- force update for explicitly selected nodes;
- automatic stop on failure rate, heartbeat loss, rollback, or route diversity
regression;
- separate host-agent and node-agent phases;
- emergency recovery bridge for pre-orchestrator compat nodes.
The orchestrator must issue short-lived rollout leases. A node may only start an
update when it holds a valid lease for that product/version. If the lease
expires before apply starts, the node must re-check the policy.
Rollout leases prevent the entire farm from starting the same update
simultaneously when a subscription signal or gossip wave reaches all nodes.
### 4.7 Node-Side Update Admission Control
Even with a lease, the node must perform local admission checks before apply:
- artifact hash and signature match the signed intent;
- rollback artifact or previous binary is available unless policy explicitly
disables rollback;
- enough disk space exists for stage plus rollback;
- current active workload can tolerate restart, or orchestrator granted a
maintenance lease;
- the node still has at least the required recovery connectivity after
excluding itself as temporarily unavailable;
- host-agent update is applied before node-agent update when the contract says
the host-agent is the recovery floor.
If admission fails, the node reports `blocked` with a precise reason instead of
silently waiting.
### 4.8 Update Waves Must Preserve Failure-Domain Diversity
An update wave must not take down all nodes from the same recovery role or
failure domain at once.
The orchestrator must account for:
- area;
- site;
- locality group;
- NAT group;
- public ingress dependency;
- control-api role;
- update-store / update-cache role;
- relay / rendezvous role;
- VPN ingress / egress roles;
- nodes that are currently the only known recovery path for another node.
For a small fleet, this means the orchestrator may update one node at a time
when the remaining diversity is weak, even if the global max parallel setting
is higher.
## 5. Service And Location Mobility Rules
Moving a service must not strand nodes that only know the old location.
@@ -329,7 +471,7 @@ The design must explicitly handle all of these:
- node reboots during update;
- only one peer still knows the new registry truth;
- node is partitioned for a long time and rejoins later;
- platform removes legacy support too early;
- platform removes compat support too early;
- operator has no shell/RDP/WinRM/SSH access to the host.
## 7. Required Local State And Journaling
@@ -359,7 +501,7 @@ It must surface:
- nodes with stale heartbeat but recent updater activity;
- nodes with no working compatible recovery artifact;
- nodes whose pinned registry/bootstrap epoch is too old;
- nodes whose only known artifact URL is dead;
- nodes whose only known artifact distributor is dead;
- nodes whose desired state requires a contract they cannot parse;
- nodes whose local agent version is below the minimum recovery floor;
- nodes whose last successful contact depended on a single service replica.
@@ -382,7 +524,7 @@ Before deleting old code, old formats, or old endpoints, verify all of these:
7. install type aliases remain for historical agents where needed;
8. NAT/passive/outbound-only nodes were explicitly tested;
9. stale-node risk report is empty or consciously accepted by recovery-admin;
10. removal of legacy support is documented with the exact cutoff conditions.
10. removal of compat support is documented with the exact cutoff conditions.
## 10. `ifcm-rufms-s-mo1cr` Rule
@@ -412,7 +554,7 @@ The system should keep implementing these concrete items:
- signed registry retention and overlap checks before endpoint migration;
- compatibility alias coverage for historical install types;
- artifact availability health over all mirrors;
- stale-node risk dashboard/report before legacy removal;
- stale-node risk dashboard/report before compat cleanup;
- node-local journaling for last good registry/update state;
- neighbor-assisted artifact relay path;
- explicit recovery simulation for outbound-only nodes with dead old endpoints.
@@ -344,7 +344,7 @@ The first backend contract slice is implemented:
- Fenced routes are not returned as primary or alternate route candidates in a
service-channel lease. If every route for the selected entry/exit pair is
fenced by service-channel feedback, the lease enters explicit degraded
backend fallback with reason
compat fallback with reason
`fabric_routes_fenced_by_service_channel_feedback`.
- A live smoke on 2026-05-07 created two short-lived `test-1 -> test-2`
`vpn_packets` route intents, injected fresh service-channel flow feedback
@@ -507,18 +507,18 @@ The first backend contract slice is implemented:
post-restart exit inbox depth from `0` to `88` with zero inbox drops.
- C18Z3 adds the matching entry-side resilience and degraded-fallback contract.
Node-agent `0.2.183` validates the signed service-channel lease authority and
forces backend fallback when Control Plane has signed
forces compat fallback when Control Plane has signed
`status=degraded_fallback` or `primary_route.status=missing_route_intent`.
This prevents a node from ignoring the lease decision and accidentally using
older generic route candidates for the same VPN resource. The rule applies to
both HTTP packet ingress and WebSocket packet ingress. The live smoke
`scripts/fabric/c18z3-live-service-channel-entry-ws-fallback-smoke.ps1`
proves HTTP warm delivery, WebSocket ingress parity, entry-node restart and
recovery while a lease exists, explicit backend fallback when no authorized
recovery while a lease exists, explicit compat fallback when no authorized
fabric route exists, and route-intent expiry. The passing artifact is
`artifacts/c18z3-live-service-channel-entry-ws-fallback-smoke-result.json`;
run `c18z3-20260507-211402` accepted warm `4/4`, WebSocket `8` packets,
recovery `4/4`, and moved the degraded backend fallback queue from `0` to
recovery `4/4`, and moved the degraded compat fallback queue from `0` to
`8`.
- C18Z4 adds live long-session pressure coverage without another runtime
release. The script
@@ -529,7 +529,7 @@ The first backend contract slice is implemented:
alternate route. The passing artifact is
`artifacts/c18z4-live-service-channel-session-pressure-smoke-result.json`;
run `c18z4-20260507-212748` grew the exit inbox from `0` to `384`, kept
route failure delta `0`, flow drop delta `0`, and backend fallback queue
route failure delta `0`, flow drop delta `0`, and compat fallback queue
`0 -> 0`. This proves route-policy churn can be absorbed by the shared
fabric runtime while a service WebSocket remains active.
- C18Z5 adds live exit-node failure coverage while the same kind of service
@@ -540,7 +540,7 @@ The first backend contract slice is implemented:
the same signed WebSocket. The passing artifact is
`artifacts/c18z5-live-service-channel-exit-restart-smoke-result.json`; run
`c18z5-20260507-213745` sent 480 packets total, observed route failure delta
`48`, backend fallback queue `0 -> 192`, flow drop delta `0`, and recovery
`48`, compat fallback queue `0 -> 192`, flow drop delta `0`, and recovery
exit inbox `0 -> 192`. This proves exit failure is surfaced as explicit
degraded/fallback telemetry and fabric delivery resumes after runtime
recovery without requiring the service connection to be rebuilt.
@@ -554,7 +554,7 @@ The first backend contract slice is implemented:
`artifacts/c18z6-live-service-channel-active-rebuild-smoke-result.json`; run
`c18z6-20260507-214900` sent 384 packets, delivered all of them to the exit
inbox, selected the replacement route, kept route failure delta `0`, flow
drop delta `0`, and backend fallback queue `0 -> 0`. This proves route-manager
drop delta `0`, and compat fallback queue `0 -> 0`. This proves route-manager
replacement can be applied under an active service session without requiring
the service connection to be recreated.
- C18Z7 adds concurrent service-session isolation coverage. The script
@@ -565,7 +565,7 @@ The first backend contract slice is implemented:
`applied_rebuild`, then continues all sessions. The passing artifact is
`artifacts/c18z7-live-service-channel-concurrent-isolation-smoke-result.json`;
run `c18z7-20260507-215727` delivered 864 packets total, 288 packets per
session, with total backend fallback delta `0`, route failure delta `0`, and
session, with total compat fallback delta `0`, route failure delta `0`, and
flow drop delta `0`. This proves concurrent service sessions keep separate
resource queues and are not starved or poisoned by a shared route-manager
rebuild.
@@ -579,7 +579,7 @@ The first backend contract slice is implemented:
run `c18z8-20260507-221347` delivered 192 packets per interactive session,
hit flow scheduler high watermark `1024`, scheduled `1030` packets on the
hottest channel, dropped `282` packets on that overloaded channel, and kept
backend fallback delta `0` and route failure delta `0`. This proves bounded
compat fallback delta `0` and route failure delta `0`. This proves bounded
queue pressure is service-neutral, observable, and isolated to the overloaded
logical flow without starving other active sessions.
- C18Z9 adds route-pool replacement preference coverage. Node-agent `0.2.184`
@@ -593,7 +593,7 @@ The first backend contract slice is implemented:
node-agent `applied_rebuild`, and verifies the same service session continues
over the fast route. The passing artifact is
`artifacts/c18z9-live-service-channel-route-pool-smoke-result.json`; run
`c18z9-20260507-224901` kept backend fallback delta `0`, route failure delta
`c18z9-20260507-224901` kept compat fallback delta `0`, route failure delta
`0`, and flow drop delta `0`.
- C18Z10 adds service-channel exit-pool failover coverage. Backend/node-agent
`0.2.185` binds signed entry/exit pools into the service-channel lease
@@ -610,7 +610,7 @@ The first backend contract slice is implemented:
`applied_rebuild`, and verifies 288 packets land on the alternate exit. The
passing artifact is
`artifacts/c18z10-live-service-channel-exit-pool-smoke-result.json`; run
`c18z10-20260507-232645` kept backend fallback `0`, route failure delta `0`,
`c18z10-20260507-232645` kept compat fallback `0`, route failure delta `0`,
and flow drop delta `0`.
- C18Z11 adds service-channel entry-pool failover contract coverage. Backend
`rap-backend:fabric-service-channel-0.2.186` keeps
@@ -675,7 +675,7 @@ The first backend contract slice is implemented:
continues on the learned fast route. The passing artifact is
`artifacts/c18z14-live-service-channel-active-quality-shift-smoke-result.json`;
run `c18z14-20260508-071644` sent 60 batches / 480 packets, delivered all
packets to the exit, kept backend fallback `0`, flow drops `0`, and expired
packets to the exit, kept compat fallback `0`, flow drops `0`, and expired
temporary route intents.
- C18Z15 exposes and hardens effective route-quality preference telemetry.
Backend `rap-backend:fabric-service-channel-0.2.191` reports both raw
@@ -690,7 +690,7 @@ The first backend contract slice is implemented:
passing artifact is
`artifacts/c18z15-live-service-channel-effective-quality-smoke-result.json`;
run `c18z14-20260508-073538` sent 60 batches / 480 packets, delivered all
packets to the exit, kept backend fallback `0`, flow drops `0`, and exposed
packets to the exit, kept compat fallback `0`, flow drops `0`, and exposed
decayed effective scores in node telemetry.
- C18Z16 adds per-channel route-quality preference telemetry and fairness
guardrails. Node-agent `0.2.191` records the applied
@@ -704,7 +704,7 @@ The first backend contract slice is implemented:
`artifacts/c18z16-live-service-channel-quality-fairness-smoke-result.json`;
run `c18z14-20260508-074943` sent 60 batches / 480 packets, served 32
logical channels, applied quality preference telemetry to all 32 served
channels, kept backend fallback `0`, and flow drops `0`.
channels, kept compat fallback `0`, and flow drops `0`.
- C18Z17 clears stale per-channel route-quality markers. Node-agent `0.2.192`
removes channel-level quality preference diagnostics when the preference is no
longer present in the current effective preference set or when the preferred
@@ -712,10 +712,10 @@ The first backend contract slice is implemented:
`scripts/fabric/c18z17-live-service-channel-quality-cleanup-smoke.ps1`
verifies that active channel markers reference visible preferences, stale
markers are absent, expired route intents are not active, and the session
completes without backend fallback. The passing artifact is
completes without compat fallback. The passing artifact is
`artifacts/c18z17-live-service-channel-quality-cleanup-smoke-result.json`;
run `c18z14-20260508-075750` sent 60 batches / 480 packets, kept 32 active
quality markers, found `0` stale markers, kept backend fallback `0`, and
quality markers, found `0` stale markers, kept compat fallback `0`, and
flow drops `0`.
- C18Z18 scopes flow-scheduler channel memory by service session. Node-agent
`0.2.193` now keys runtime-sent logical channels as
@@ -728,11 +728,11 @@ The first backend contract slice is implemented:
`scripts/fabric/c18z18-service-channel-session-scoped-fairness-smoke.ps1`
wraps the live C18Z17 route-quality/fairness path, verifies served live
channel names are session-scoped and no unscoped served `flow-NN` channels
remain, and keeps backend fallback and flow drops at zero. The passing
remain, and keeps compat fallback and flow drops at zero. The passing
artifact is
`artifacts/c18z18-service-channel-session-scoped-fairness-smoke-result.json`;
run `c18z14-20260508-082520` served 32 session-scoped channels, applied
quality markers to all 32, kept backend fallback `0`, and flow drops `0`.
quality markers to all 32, kept compat fallback `0`, and flow drops `0`.
- C18Z19 adds the first bounded parallel send window for independent
service-channel logical flows. Node-agent `0.2.194` can send scheduled
logical channels concurrently with `MaxParallelFlowSends=4` in the live
@@ -769,7 +769,7 @@ The first backend contract slice is implemented:
run `c18z14-20260508-085635` delivered 480 packets, observed
`max_parallel_flow_sends=4`, `recommended_parallel_flow_sends=4`,
`scheduler_max_in_flight=4`, attempt/success/latency telemetry on all 32
served channels, backend fallback `0`, and flow drops `0`.
served channels, compat fallback `0`, and flow drops `0`.
- C18Z21 adds rolling per-channel/session quality windows. Node-agent `0.2.196`
keeps the lifetime counters for audit visibility, but adaptive send-window
pressure now comes from the bounded recent quality window, so old drops and
@@ -785,7 +785,7 @@ The first backend contract slice is implemented:
run `c18z14-20260508-091952` delivered 480 packets, observed
`scheduler_quality_window_sample_count=480`, rolling failures `0`, rolling
drops `0`, rolling samples/success/latency on all 32 served channels,
`recommended_parallel_flow_sends=4`, backend fallback `0`, and flow drops `0`.
`recommended_parallel_flow_sends=4`, compat fallback `0`, and flow drops `0`.
- C18Z22 connects the rolling window to backend durable route feedback. Backend
`rap-backend:fabric-service-channel-0.2.197` reads `quality_window_*` fields
from node-agent heartbeat metadata and uses fresh rolling failure/drop/slow
@@ -799,7 +799,7 @@ The first backend contract slice is implemented:
fields. The passing artifact is
`artifacts/c18z22-service-channel-rolling-feedback-smoke-result.json`; run
`c18z14-20260508-093100` delivered 480 packets, observed one persisted
healthy rolling feedback item with rolling payload, backend fallback `0`, and
healthy rolling feedback item with rolling payload, compat fallback `0`, and
flow drops `0`.
- C18Z23 adds route recovery hysteresis. Backend
`rap-backend:fabric-service-channel-0.2.198` re-admits routes that have
@@ -812,7 +812,7 @@ The first backend contract slice is implemented:
the live C18Z22 path and verifies backend `0.2.198`, rolling feedback, clean
forwarding, and the unit hysteresis contract. The passing artifact is
`artifacts/c18z23-service-channel-recovery-hysteresis-smoke-result.json`; run
`c18z14-20260508-094111` delivered 480 packets with backend fallback `0` and
`c18z14-20260508-094111` delivered 480 packets with compat fallback `0` and
flow drops `0`.
- C18Z24 exposes that recovery state to operators and API consumers. Backend
`rap-backend:fabric-service-channel-0.2.199` enriches route feedback API
@@ -925,7 +925,7 @@ The first backend contract slice is implemented:
C18X; route-intent lifecycle cleanup and synthetic-config expired-route
filtering landed in C18Y; bounded multi-channel load/rebuild/drop telemetry
coverage landed in C18Z; live signed service-channel ingress through the
running mesh listener landed in C18Z1; sustained live ingress with exit-node
running fabric listener landed in C18Z1; sustained live ingress with exit-node
restart/recovery coverage landed in C18Z2; signed degraded fallback
enforcement plus entry restart/WebSocket parity landed in C18Z3; long-lived
WebSocket pressure with mid-session route-policy churn landed in C18Z4; live
@@ -988,7 +988,7 @@ The first backend contract slice is implemented:
from explicit degraded compatibility requests; C18Z56 adds active-channel remediation
diagnostics (`none`, `rebuild_route`, `prefer_alternate_route`,
`hold_degraded_route_state`) to make the next runtime action explicit, and its
alternate-route branch is live-smoke-proven with backend fallback kept off.
alternate-route branch is live-smoke-proven with compat fallback kept off.
C18Z57 adds the bounded machine-readable `remediation_command` contract to
active access telemetry rows so route-manager can consume a short-lived
`prefer_alternate_route` command with primary/replacement route ids and TTL.
@@ -996,7 +996,7 @@ The first backend contract slice is implemented:
node-agent route-manager consumes them as explicit applied replacement
decisions sourced from `service_channel_remediation_command`. C18Z59 proves
post-remediation service-channel traffic actually selects the replacement
route in runtime/flow telemetry without local/backend fallback. C18Z60 proves
route in runtime/flow telemetry without local/compat fallback. C18Z60 proves
the same remediation path for multiple independent VPN flow channels in one
packet batch, with replacement-route flow stats, no flow drops, no route
failures, and no degraded fallback. C18Z61 proves the remediation replacement
@@ -1024,7 +1024,7 @@ The first backend contract slice is implemented:
0. C18Z68 turns this evidence into backend/admin flow-health diagnostics:
access telemetry now reports `flow_health_status` and `flow_health_reason` at
cluster, node, and active-channel levels using traffic-class pressure, queue
pressure, flow drops, backend fallback, route-quality failures/drops/slow
pressure, flow drops, compat fallback, route-quality failures/drops/slow
samples, and route send latency. C18Z69 adds node-side adaptive response:
runtime heartbeat flow-scheduler snapshots now include per-class
`recommended_parallel_windows` and adaptive backpressure reason, and the send
@@ -1039,7 +1039,7 @@ The first backend contract slice is implemented:
tune shared fabric backpressure without changing VPN/RDP-specific code.
C18Z72 adds an audited pool/failover policy contract for entry/exit pool
constraints, preferred entry/exit, selection strategy, failover modes,
backend fallback allowance, and sticky session mode. Lease issuance applies
compat fallback allowance, and sticky session mode. Lease issuance applies
that policy before route selection and signs the effective `pool_policy`
provenance into the service-channel lease authority payload. C18Z73 projects
that signed pool-policy fingerprint into active access telemetry and guards
@@ -1080,7 +1080,7 @@ The first backend contract slice is implemented:
existing rebuild command to a replacement route, the entry node reports a
route-manager decision for the same `rebuild_request_id`, the transition is
`applied_rebuild`, and live service-channel packet ingress selects the
replacement route with no local/backend fallback, route failures, or flow
replacement route with no local/compat fallback, route failures, or flow
drops. C18Z80 extends that into sustained post-rebuild pressure: five mixed
service-channel packet bursts remain on the replacement route, no stale
primary route is reselected, and fallback, route-failure, flow-drop, and
@@ -0,0 +1,206 @@
# Fabric Service-Over-Transport Model
Status: active target architecture.
This document defines the mandatory separation between:
1. the internal fabric transport;
2. the logical service channel contract;
3. the external service ingress edge.
It exists to prevent a recurring failure pattern where external TCP/HTTP/HTTPS
listeners are mistaken for the fabric's internal transport.
## 1. Core rule
The fabric is the internal transport substrate.
- Inside the fabric, node-to-node runtime transport is `QUIC over UDP`.
- Services do not implement their own inter-node transport.
- Services do not need to understand relay, NAT, route replacement, or peer
selection details.
A service asks the fabric for a channel. The fabric creates, maintains,
rebuilds, and heals that channel.
## 2. Three-layer model
### 2.1 Fabric Transport
Fabric Transport is the lowest runtime layer.
Responsibilities:
- peer discovery and peer memory
- endpoint candidate verification
- direct and relay path establishment
- route maintenance
- route replacement
- cross-area recovery
- QUIC session lifecycle
- cert pin and authority trust enforcement
Transport contract:
- node-to-node runtime transport is `QUIC/UDP`
- TCP is not an alternate transport carrier inside the fabric
### 2.2 Fabric Service Channel
The service channel is the logical contract used by any upper-layer service.
Responsibilities:
- request a route to a node or pool
- expose a stable channel identifier
- carry bidirectional application traffic
- survive path rebuild when possible
- surface degraded or migrated channel state to the service without exposing
internal route topology details
The service channel must hide:
- which relay was chosen
- which direct peer was replaced
- which ingress or NAT path was changed
- which recovery seed was used
The service should see channel semantics, not transport topology.
### 2.3 External Service Ingress
External ingress is the edge that accepts user-facing or third-party traffic.
Examples:
- HTTP/HTTPS ingress for admin UI and personal cabinet
- VPN ingress that accepts client traffic
- future RDP ingress
Ingress may speak TCP/HTTP/HTTPS or another external protocol at the edge, but
after acceptance it must map traffic into a fabric service channel.
The ingress edge is not the fabric transport.
## 3. Examples
### 3.1 Admin panel
The user opens an HTTPS page.
1. the public ingress listens on `80/443`;
2. it accepts HTTPS and performs edge policy checks;
3. it opens or reuses a `control_ui` fabric service channel;
4. the request is forwarded through the fabric to a panel service instance;
5. the response is returned through the fabric channel back to the ingress;
6. the ingress returns the HTTP response to the browser.
The browser sees HTTPS. The fabric sees an internal service channel over
`QUIC/UDP`.
### 3.2 VPN
The VPN edge accepts client-side tunnel traffic.
1. the VPN service receives IPv4 packets from the client-facing side;
2. it requests a `vpn_tunnel` channel to an egress pool;
3. the fabric chooses and maintains the route;
4. an egress node performs IPv4 exit/NAT to the external network;
5. return traffic follows the maintained channel back through the fabric.
The VPN service does not decide how the route is built. The fabric does.
## 4. Service contract requirements
Every service-over-fabric integration must use a channel contract with at least
these concepts:
- `channel_request`
- `channel_id`
- `channel_class`
- `destination_selector`
- `current_state`
- `send`
- `receive`
- `close`
Optional but recommended:
- `channel_migrated`
- `channel_degraded`
- `preferred_qos_class`
- `pool_affinity`
- `session_stickiness`
## 5. Channel classes
Minimum channel classes:
- `control_ui`
- `vpn_tunnel`
- `rdp_session`
- `artifact_delivery`
- `service_admin`
- `internal_control`
Each class must define:
- latency sensitivity
- loss tolerance
- bandwidth expectation
- stickiness requirement
- pool failover behavior
- health check behavior
## 6. Pool-first delivery
Services should target pools when they need resilience.
Examples:
- VPN should target an egress pool, not a single node
- future RDP should target a pool of reachable adapters when that service mode
applies
The service must not need to know:
- which specific node was selected
- how many nodes are in the pool
- whether the path was direct or relayed
## 7. Recovery implications
Recovery must be separated into:
1. node survival
2. transport survival
3. service channel survival
4. ingress survival
It is not enough for the node to recover if the service channel model still
depends on a hidden compat carrier.
## 8. TCP clarification
TCP is allowed only in these roles:
- external user ingress
- operator/API ingress
- temporary compatibility recovery overlap
- artifact/control delivery at the service edge
TCP is not allowed as the normal inter-node fabric transport.
If TCP is still visible in the live system, it must be classified explicitly as
one of the roles above.
## 9. Relationship to area stability
The transport layer must maintain resilient peer diversity across areas, but the
service layer must not need to understand those details.
See
[FABRIC_AREA_AND_PEER_STABILITY_MODEL.md](\\nas\\MST\\codex\\rdp-proxy\\docs\\architecture\\FABRIC_AREA_AND_PEER_STABILITY_MODEL.md)
for the current peer diversity model and
[FABRIC_LIVE_AUDIT_2026-05-18.md](\\nas\\MST\\codex\\rdp-proxy\\docs\\architecture\\FABRIC_LIVE_AUDIT_2026-05-18.md)
for the live operational gaps.
@@ -0,0 +1,70 @@
# Fabric Transport Scale Plan
Goal: make the farm fabric a durable transport layer, not a request/response helper. VPN is the first service adapter, but the same tunnel/session model must carry RDP, VNC, remote workspace, artifact replication, and future services.
## Invariants
- Services request a tunnel to a pool and a remote service kind. They do not select nodes, routes, relays, or endpoints.
- The farm owns route selection, failover, relay/direct decisions, stream scheduling, and recovery.
- Control traffic stays small. Bulk and continuous traffic never move as a single control response.
- `tunnel_id` is the stable service-facing identity. Legacy service ids, including VPN connection ids, are aliases only.
- Hot traffic is binary framed, not JSON/base64.
- Interactive/control/DNS traffic must not wait behind bulk traffic.
- Route changes preserve the service tunnel identity.
## Planes
- Control plane: signed commands, leases, tunnel creation, route epochs, policy, heartbeat.
- Session plane: tunnel lifetime, stream registry, stream open/close/reset, route migration.
- Data plane: binary QUIC stream frames for service traffic.
- Bulk plane: artifact/file/replication streams with offset, resume, chunk hash, final hash, mirror failover.
- Observability plane: topology facts, route health, pressure, drops, throughput, per-class latency.
## Service Tunnel Contract
Each service receives:
- `tunnel_id`
- `pool_id`
- `service_id`
- `local_service_id`
- `remote_service_id`
- `service_kind`
- `service_class`
- `service_role`
- `route_lease_id`
- `route_generation`
- `data_plane`
- `traffic_classes`
- `stream_shards`
VPN default profile:
- pool: `ipv4-egress`
- service kind: `vpn-exit`
- service class: `vpn_packets`
- role: `ipv4-egress`
Future profiles use the same contract, for example `rdp-client`, `vnc-client`, `artifact-store`, or `remote-workspace`.
## Implementation Phases
1. Generalize the tunnel contract and keep VPN as the first profile. Current code exposes `rap.fabric_service_tunnel.v1`.
2. Move all service traffic to tunnel identity, keeping legacy ids only as aliases. Current VPN packet/session frames use `tunnel_id`; VPN ids are compatibility aliases inside the packet payload.
3. Introduce reusable session streams for all services, not only VPN packet batches. Current code tracks `rap.fabric_service_stream_registry.v1` with per-tunnel stream state.
4. Add route epoch migration so a service keeps the same tunnel while the farm changes path. Current contract already carries `route_lease_id` and `route_generation` through profile, Android runtime, hello/heartbeat, and peer registry. The mobile runtime can now apply a new runtime config for the same `tunnel_id` and update the active transport route epoch without closing service streams.
5. Move artifact delivery from control request chunks to bulk streams with resume and mirror failover.
6. Add per-class stream scheduler and backpressure at tunnel, node, route, and pool levels.
7. Add admission control and capacity accounting per node, route, pool, organization, and service.
8. Add stress tests for many tunnels, mixed traffic, route failure, node failure, and update while traffic flows.
## Scale Rules
- Frame size protects memory and fairness; throughput comes from parallel streams and windows.
- The current fabric frame guardrail is 8 MiB per frame. This is not a bandwidth ceiling; VPN and later services batch below it and scale by `traffic_classes` plus `stream_shards`.
- VPN mobile batches currently allow up to 2048 packets or 4 MiB per batch, so the service stays below the frame guardrail while avoiding the old 1 MiB choke.
- DNS/control/interactive traffic must use separate classes from bulk; DNS maps to reliable fabric frames by default.
- No node should require a full global graph for normal operation. Use scoped directories and area-level summaries.
- Bulk must be drainable and resumable.
- Interactive traffic must stay preemptive over bulk.
- Every transport fact must be observable separately from planned route and endpoint candidates.
@@ -604,14 +604,14 @@ experiment while preserving the production forwarding kill-switch. This result
is retained only as test-history context; it is not the active transport
direction for the fabric runtime:
- `HTTPPeerTransport` maps explicit peer node IDs to synthetic HTTP endpoint
- `QUICPeerTransport` maps explicit peer node IDs to synthetic QUIC endpoint
URLs.
- `rap-node-agent` can start a synthetic `/mesh/v1/*` endpoint only when
`RAP_MESH_SYNTHETIC_RUNTIME_ENABLED=true` and `RAP_MESH_LISTEN_ADDR` is set.
- `rap-node-agent` can start the synthetic fabric runtime only when
`RAP_FABRIC_RUNTIME_ENABLED=true` and `RAP_FABRIC_LISTEN_ADDR` is set.
- peer endpoints and synthetic routes can be injected as JSON for smoke/debug
only.
- `mesh-live-smoke` proves direct and single-relay synthetic traffic over real
local HTTP endpoints.
local QUIC endpoints.
- bounded `synthetic.echo` remains the only test-service payload.
- `/mesh/v1/forward` remains disabled.
- no production service traffic is authorized.
+1 -1
View File
@@ -504,7 +504,7 @@ Implementation:
`diff_time_ms`, `render_update_reason`, and
`fallback_to_full_frame_reason`.
- Windows direct transport accepts `render.frame.full`,
`render.frame.region`, and legacy `session.frame` binary messages.
`render.frame.region`, and compat `session.frame` binary messages.
- Windows presenter keeps a per-session framebuffer and patches region bytes
into it before presenting the updated WPF surface.
- Smoke proof showed baseline `render.frame.full` at `3,686,400` bytes and
@@ -340,4 +340,4 @@ Deliver:
- buildable `workers/rdp-service-csharp`
- interfaces for protocol engine, data-plane bridge, graphics sink, input source
- README with migration stages
- docs update marking current C++/FreeRDP path as legacy MVP runtime
- docs update marking current C++/FreeRDP path as compat MVP runtime
@@ -312,7 +312,7 @@ Responsibilities:
- enforces user, organization, cluster, and owner visibility policy before accepting traffic
- participates in latency-aware and load-aware exit selection
- supports failover between nodes in the same exit pool without changing the Android client protocol
- does not expose legacy VPN protocols as the steady-state data plane
- does not expose compat VPN protocols as the steady-state data plane
### `vpn-client`
@@ -324,7 +324,7 @@ Responsibilities:
- requests the list of visible IPv4 exit pools and nodes according to the current user's access level
- creates VPN packet channels to the selected `ipv4-egress`/`vpn-exit` pool
- switches to another authorized exit when the selected exit fails or becomes slow
- keeps old protocol compatibility out of the runtime data plane; old nodes may only use legacy download/update paths long enough to fetch the new agent
- keeps old protocol compatibility out of the runtime data plane; old nodes may only use compat download/update paths long enough to fetch the new agent
- exposes its local IPv4 ingress as service configuration: on Android this is the
`VpnService` TUN, and on Linux/Docker it may also include explicit TCP/UDP
listen ports that are mapped into VPN packet channels.
@@ -300,7 +300,7 @@ Recommended flow:
3. dual validation period begins where required
4. new certificates are issued/accepted
5. old certificates expire or are revoked
6. old trust root is retired after rollout threshold
6. old trust root is removed after rollout threshold
Channels should revalidate after trust bundle changes.
@@ -8,7 +8,7 @@ transport architecture. The active inter-node transport model is QUIC-only; see
`docs/architecture/SECURE_ACCESS_FABRIC_TARGET.md`.
Status: P3.3 historical test-stand smoke complete for encrypted resource
secrets, assignment-time resolution, and legacy RDP baseline behavior with
secrets, assignment-time resolution, and compat RDP baseline behavior with
smoke-only direct-worker trust.
This document defines the next security hardening layer around the accepted RDP
@@ -110,7 +110,7 @@ In `APP_ENV=production`:
- RDP/VNC/SSH resources must have `secret_ref`.
- Plain credential-like keys are rejected in resource `metadata`.
- Session start rejects legacy resources that still contain plaintext
- Session start rejects compat resources that still contain plaintext
credential-like metadata.
- backend startup requires secret encryption key material.
- Development/smoke environments may continue using plaintext metadata while
@@ -109,7 +109,7 @@ adapter runtime.
- Control Plane remains authoritative for session lifecycle and policy.
- PostgreSQL remains source of truth; Redis remains live coordination only.
- Fabric transport remains QUIC-only between nodes; any historical direct
worker or backend fallback paths belong to paused service-specific baselines,
worker or compat fallback paths belong to paused service-specific baselines,
not to the active fabric transport contract.
- Adapter runtime must not create sessions outside broker/assignment control.
@@ -212,7 +212,7 @@ Signing key rotation rules:
1. New key is introduced in a signed trust bundle.
2. Node verifies the new key through existing trust.
3. Snapshots may be dual-signed during transition.
4. Old key is retired only after policy-defined rollout.
4. Old key is removed only after policy-defined rollout.
5. Compromised key is revoked through signed revocation metadata or emergency
recovery flow.
@@ -12,6 +12,10 @@ Core. It does not redefine node-to-node transport. Current fabric inter-node
transport is QUIC-only; VPN/IP tunnel runtime must request and use fabric
routes instead of introducing a separate packet transport contract.
The general service-over-fabric contract is defined in
[FABRIC_SERVICE_OVER_TRANSPORT_MODEL.md](\\nas\\MST\\codex\\rdp-proxy\\docs\\architecture\\FABRIC_SERVICE_OVER_TRANSPORT_MODEL.md).
VPN is one service class over that transport model, not an exception to it.
## Purpose
VPN/IP tunnel is a service above the Fabric Core, not a node-local setting.
@@ -25,6 +29,9 @@ platform's core rules:
- Nodes execute leased work only.
- Organizations must not see mesh topology.
- Interactive services such as RDP must not be harmed by VPN bulk traffic.
- VPN ingress may accept external client traffic, but after acceptance it must
map that traffic into a fabric service channel rather than inventing an
alternate inter-node carrier.
## Non-Goals
@@ -18,6 +18,13 @@ Terminology rule:
The Control API may use HTTP/HTTPS, but it is not a fallback or alternate
carrier for fabric node-to-node runtime traffic.
The formal three-layer separation is defined in
[FABRIC_SERVICE_OVER_TRANSPORT_MODEL.md](\\nas\\MST\\codex\\rdp-proxy\\docs\\architecture\\FABRIC_SERVICE_OVER_TRANSPORT_MODEL.md):
- `Fabric Transport` - internal QUIC/UDP substrate
- `Fabric Service Channel` - logical service channel contract
- `External Service Ingress` - browser/API TCP/HTTP/HTTPS edge
## Purpose
The platform needs a clear distinction between:
@@ -36,7 +43,7 @@ secrets, node identity, or routing authority.
Public HTTPS Ingress is an edge service. It may run on a public Internet node,
including a small/slow node intended only to accept browser traffic and pass it
into the fabric.
into the fabric through a service channel.
Role names:
@@ -225,7 +232,7 @@ The recommended model is:
```text
Admin Web Shell
-> UI Manifest / Page Definition endpoint
-> Scoped Control API endpoints
-> Scoped Fabric control endpoints
```
Dynamic pages are allowed for:
@@ -474,8 +481,8 @@ the management authority. Platform/global admin runtime remains limited to
platform-owner trusted nodes. Cluster, organization, and user panels receive
only their scoped projections.
The legacy Fabric map with separate `inputs`, `cluster nodes`, and `egress
zones` is retired for the transport-layer view. The Fabric panel must show
The compat Fabric map with separate `inputs`, `cluster nodes`, and `egress
zones` is removed for the transport-layer view. The Fabric panel must show
actual direct/fresh QUIC neighbor links, one-way/passive direction, stale/problem
state, relay/route-health annotations, and web-ingress runtime readiness. It
must not render old entry/egress zone columns as if they were transport
@@ -520,7 +527,7 @@ The platform recognizes these web/admin placement roles:
| `policy-authority` | platform trusted nodes only | Authorization/policy decisions and signed claims. |
| `audit-sink` | platform trusted nodes only | Durable mutation/security audit ingestion. |
Legacy `entry-node` remains a generic client ingress/service edge role for
Compat `entry-node` remains a generic client ingress/service edge role for
non-admin product services. It must not imply admin authority.
## Fabric Service Classes