рабочий вариант, но скороть 10 МБит

2026-05-22 21:46:49 +03:00
parent 469fa0e860
commit 20d361a886
280 changed files with 954890 additions and 18524 deletions
@@ -201,8 +201,8 @@ Updates must support:
 - local update cache where approved
 - OS / architecture specific artifacts under signed release manifests
 - explicit migration bundles when data structures change
- legacy recovery compatibility until the fleet is converged or explicitly
-  retired
+- compat recovery compatibility until the fleet is converged or explicitly
+  removed
 - multi-source artifact retrieval for stranded or NAT-only nodes

 Version Storage stores immutable release manifests, artifacts, hashes,
@@ -1035,7 +1035,7 @@ Node-agent can start, stop, and monitor service workloads based on role assignme

 C19A adds the first bounded live service-supervision runtime proof on top of
 that contract: node-agent can read node-scoped desired workloads without an
-operator actor id, report built-in `core-mesh` and `mesh-listener` as running,
+operator actor id, report built-in `core-mesh` and `fabric-listener` as running,
 report native built-in `synthetic.echo` as running, and keep unsupported
 production workloads degraded instead of pretending that their adapters exist.
 The live smoke is `scripts/fabric/c19a-service-workload-supervision-smoke.ps1`.
@@ -262,7 +262,7 @@ Rules:
 - latest frame wins
 - render must not block input/control
 - binary payloads should be used on direct data plane
- backend fallback may continue existing JSON/base64 behavior during migration
+- compat fallback may continue existing JSON/base64 behavior during migration

 ### `clipboard`

@@ -347,7 +347,7 @@ The DP-2 JSON header contains:
 - `session_id`
 - `channel`, currently `render`
 - `message_type`, currently `render.frame.full` or `render.frame.region` on
-  direct worker WSS; `session.frame` remains accepted as the legacy DP-2
+  direct worker WSS; `session.frame` remains accepted as the compat DP-2
  binary message type for compatibility.
 - `sequence`
 - `timestamp`
@@ -950,7 +950,7 @@ explicit direct render message types:

 Compatibility:

- Windows client direct transport still accepts legacy binary `message_type=session.frame`.
+- Windows client direct transport still accepts compat binary `message_type=session.frame`.
 - Inside the Windows application pipeline, direct binary frames are normalized
  back into the existing `session.frame` envelope so UI, lifecycle, input,
  clipboard, and file transfer behavior remain unchanged.
@@ -24,7 +24,7 @@ policy allows, host limited control/storage roles when approved, and report
 mobile-specific capacity signals such as battery, network type, NAT behavior,
 foreground/background state, and metered network policy.

-Node survival and recovery across endpoint moves, NAT-only reachability, legacy
+Node survival and recovery across endpoint moves, NAT-only reachability, compat
 contract overlap, and unavailable manual host access are governed by
 `docs/architecture/FABRIC_NODE_SURVIVAL_AND_RECOVERY_POLICY.md`. In
 particular, nodes like `ifcm-rufms-s-mo1cr` must remain recoverable through the
@@ -179,8 +179,8 @@ Endpoint state is also distributed:

 Moving a service must not break the farm.

-`RAP_BACKEND_URL` or any fixed HTTP/API address is only a migration fallback for
-old nodes. It is not cluster truth. After bootstrap, a node finds services by
+`RAP_FABRIC_REGISTRY_RECORDS_JSON` and signed registry gossip, not any fixed
+HTTP/API address, define cluster truth. After bootstrap, a node finds services by
 logical role through signed fabric registry records that can be carried by any
 reachable peer.

@@ -258,7 +258,7 @@ Service classes that must use this registry before production hardening:
 - `vpn-egress-pool`: user-visible exit pools; users see pools, not backing
  nodes.

-Legacy endpoint compatibility is allowed only for rolling migration:
+Compat endpoint compatibility is allowed only for rolling migration:

 - Old nodes may use their baked HTTP/control URL only to fetch a new version or
  a signed registry bootstrap record.
@@ -504,7 +504,7 @@ Deliverables:

 ### Stage FNP-3: WebSocket/TCP Compatibility Transport

-Status: retired as a migration-only stage.
+Status: removed as a migration-only stage.

 This stage existed to bootstrap binary frame semantics before QUIC routing and
 carrier reuse were ready. It introduced the transport-neutral frame loop,
@@ -6,6 +6,10 @@ This document replaces the oversimplified rule "every node must keep 3
 connections" with a stability model based on failure domains ("areas"),
 multi-path reachability, and live peer memory.

+It operates at the `Fabric Transport` layer. Services above the transport must
+consume service channels and must not directly reason about peer topology. See
+[FABRIC_SERVICE_OVER_TRANSPORT_MODEL.md](\\nas\\MST\\codex\\rdp-proxy\\docs\\architecture\\FABRIC_SERVICE_OVER_TRANSPORT_MODEL.md).
+
 ## 1. Why the old "3 connections" rule is not enough

 A raw connection count is too weak as a resilience rule.
@@ -43,6 +47,9 @@ An area can be derived from:
 The area label must be part of live node metadata and endpoint candidate
 metadata.

+For the current fleet, area assignment should be explicit operator metadata, not
+an inference hidden only inside routing code.
+
 ## 3. Stability objective

 Each node should maintain a working peer set with diversity, not just count.
@@ -0,0 +1,386 @@
+# Fabric Execution Plan 2026-05-19
+
+Status: active execution plan.
+
+This document merges:
+
+- the service-over-fabric model;
+- the area and peer stability model;
+- the live audit findings from 2026-05-18 through 2026-05-19;
+- the node survival and recovery policy;
+- the current rollout and runtime rewrite findings.
+
+The goal is to move the live fabric from a partially migrated QUIC-first fleet
+to a fully converged distributed runtime where:
+
+1. inter-node transport is QUIC over UDP only;
+2. services use fabric channels and do not implement their own transport;
+3. nodes do not depend on one compat control/download edge;
+4. node directory and service discovery are distributed through signed records,
+   peer cache, and live peer exchange;
+5. the fleet remains recoverable after losing part of the fabric.
+
+## 1. Current live state
+
+### 1.1 What is already true
+
+- Inter-node runtime transport is QUIC over UDP.
+- All active nodes are converging on the latest control-endpoint rewrite line.
+- `home-*`, `test-*`, and `usa-los-1` already run
+  `rap-node-agent 0.2.325-updatehintwake`.
+- `ifcm-rufms-s-mo1cr` is no longer lost and still sends fresh heartbeat.
+- Internal artifact plans now support mirror URLs instead of a single artifact
+  URL.
+- The public `vpn.cin.su` ingress and the `19191` compatibility ingress on
+  `home-1` were repaired so downloads and control traffic can flow again.
+
+### 1.2 What is still not finished
+
+- `ifcm-rufms-s-mo1cr` still reports the old
+  `http://vpn.cin.su:19191/api/v1` control URL in live heartbeat.
+- `ifcm-rufms-s-mo1cr` still runs `rap-node-agent 0.2.322-controlendpointsrewrite`
+  while the rest of the reachable fleet is already on
+  `0.2.325-updatehintwake`.
+- The current blocker is now known precisely:
+  fresh heartbeat plus a dead updater subscription plane on a node-agent that
+  does not yet support local updater wake from heartbeat update hints.
+- Signed registry runtime is still not fully `active` across the fleet.
+- Cross-area direct peer diversity is still below the target for multiple
+  nodes.
+- TCP is still visible in allowed edge roles:
+  - external ingress;
+  - Control API;
+  - release downloads;
+  - temporary compatibility recovery overlap.
+
+## 2. Target system model
+
+### 2.1 Transport
+
+- Inter-node runtime transport: QUIC over UDP only.
+- No TCP/WebSocket fallback as the normal fabric carrier.
+
+### 2.2 Service layer
+
+- Services consume a fabric channel contract.
+- Services do not know internal path selection, relay choice, NAT traversal, or
+  route replacement details.
+- External TCP/HTTP/HTTPS exists only at the ingress edge and is mapped into a
+  fabric channel.
+
+### 2.3 Discovery and directory
+
+- Nodes do not query PostgreSQL as part of ordinary transport/runtime flow.
+- PostgreSQL remains durable source of truth for policy, rollout, release,
+  desired state, and audit.
+- Runtime node discovery must use:
+  - signed registry records;
+  - peer cache;
+  - endpoint candidates;
+  - bounded live peer exchange.
+
+### 2.4 Small fleet rule
+
+For the current fleet size, every node should keep the full directory of all
+known nodes in scoped local state, plus runtime observations and endpoint
+candidate health.
+
+## 3. Execution priorities
+
+### P0. Finish runtime control-path convergence
+
+Goal:
+
+- remove the last live compat control dependency without manual host access.
+- ensure a live node can wake its local updater plane when Control/API sends an
+  explicit update hint, even if the previous updater loop died.
+
+Required work:
+
+1. Release the noop runtime rewrite restart fix.
+2. Roll it out to the fleet.
+3. Verify that updated nodes restart into canonical control endpoints.
+4. Add a local updater wake path driven by heartbeat update hints so
+   `update-trigger.json` is not the only signal.
+5. Confirm that `compat_control_dependency_nodes` falls to zero.
+6. Confirm that `updater_subscription_alert_nodes` falls to zero.
+7. Confirm that `updater_wake_unsupported_nodes` falls to zero.
+
+Done when:
+
+- no live heartbeat reports `fabric_control_endpoint` on the `19191` compat contract.
+- no live node shows `updater_subscription_gap` while heartbeat remains fresh.
+- no live node remains on a pre-`0.2.325-updatehintwake` node-agent while
+  heartbeat is still fresh and update status is stale.
+
+### P1. Finish distributed registry activation
+
+Goal:
+
+- nodes must resolve active service records without relying on one compat URL.
+
+Required work:
+
+1. Promote signed registry runtime from `candidate_only` / `missing` to
+   `active`.
+2. Ensure nodes resolve at least:
+   - `control-api`
+   - `update-store`
+   - `update-cache`
+3. Add live observability for:
+   - active records
+   - candidate records
+   - resolved core services
+   - last live probe
+
+Done when:
+
+- `fabric_registry_runtime_report.status = active` for the production fleet.
+
+### P2. Turn node directory into a real distributed runtime input
+
+Goal:
+
+- nodes should learn and keep node/service information from the fabric, not by
+  repeatedly consulting a center.
+
+Required work:
+
+1. Preserve full scoped node directory for the current fleet.
+2. Carry signed node/service records through peer exchange.
+3. Keep endpoint candidates and runtime observations in local peer cache.
+4. Spread updates to node/service reachability like a bounded wave, not as
+   independent central fetches by every node.
+
+Rules:
+
+- nodes may distribute signed directory/service data;
+- nodes must not self-author authoritative control-plane state;
+- the runtime may consume replicated signed copies of truth;
+- PostgreSQL remains durable origin of truth.
+
+Done when:
+
+- nodes can refresh peer/service discovery from peers plus signed records even
+  if one control edge disappears.
+
+### P3. Replace the naive “3 peers” rule with stability by area and ingress
+
+Goal:
+
+- measure and enforce resilience by failure-domain diversity, not only count.
+
+Required metrics:
+
+- `direct_ready_count`
+- `relay_ready_count`
+- `external_area_ready_count`
+- `independent_ingress_ready_count`
+- `recovery_path_count`
+
+Required topology labels:
+
+- `site_id` - physical or logical site
+- `locality_group` - private/local reachability domain
+- `nat_group` - shared public edge dependency
+
+Required behaviors:
+
+1. Prefer peers from different `area` values.
+2. Prefer peers behind different public ingress / NAT dependencies.
+3. Keep direct-ready and relay-ready separate.
+4. Keep at least one recovery path outside the local area.
+5. Treat a public endpoint behind the same NAT area as
+   `external-network-required` unless cross-area observers have validated it.
+6. Do not demote a public endpoint only because the same area cannot hairpin
+   through its own public router address.
+7. Prefer a scoped local/private QUIC endpoint over a public endpoint when the
+   candidate is confirmed to be in the same local segment or NAT group.
+8. Penalize or reject private/local-looking endpoints when they belong to a
+   different segment/NAT scope than the local node, instead of probing them as
+   if they were reachable.
+
+Done when:
+
+- critical nodes satisfy cross-area direct resilience targets, not merely raw
+  peer-count targets.
+
+### P4. Normalize edge roles and remove accidental TCP confusion
+
+Goal:
+
+- if TCP is present, it must be obviously classified and justified.
+
+Allowed TCP roles:
+
+- external service ingress;
+- Control API ingress;
+- artifact delivery edge;
+- temporary compatibility recovery overlap.
+
+Required work:
+
+1. Keep explicit inventory of edge listeners.
+2. Distinguish transport TCP from service-edge TCP in audits and UI.
+3. Advance the fabric-only recovery gate only after:
+   - compat control dependency is zero;
+   - registry is active;
+   - recovery path no longer depends on `19191`.
+
+### P5. Build the update orchestrator and distributed update intent plane
+
+Goal:
+
+- nodes must not depend on one updater endpoint, one old updater process, or one
+  central polling path;
+- update rollout must be controlled so the whole farm cannot update at once;
+- update intent must be distributable through management and neighboring nodes
+  as signed metadata.
+
+Required model:
+
+1. The durable update object is a signed `update_intent`, not a hard-coded
+   updater URL.
+2. Nodes may receive update intent from:
+   - Control API;
+   - update-store / update-cache;
+   - subscription hints over an outbound control channel;
+   - signed peer gossip from neighboring nodes;
+   - local cached last-known-good update state.
+3. Nodes validate intent locally before execution.
+4. Neighbor nodes may relay signed intent and artifacts, but cannot forge
+   authority or expand scope.
+5. Slow polling remains as the final safety net.
+6. Subscription/hints are the fast path.
+7. Gossip is the partition/recovery path.
+8. Orchestrator-issued rollout leases are the concurrency guard.
+
+Orchestrator requirements:
+
+- canary, rolling, pinned, and forced-node strategies;
+- max parallel globally;
+- max parallel per area / site / NAT group;
+- max unavailable nodes;
+- pause/resume/abort;
+- failure-rate stop;
+- automatic stop on heartbeat loss or rollback;
+- role-aware scheduling for control-api, update-store, update-cache, relay,
+  ingress, and egress nodes;
+- separate host-agent and node-agent phases;
+- emergency recovery bridge for compat nodes that predate the orchestrator.
+
+Node-side requirements:
+
+- accept `check now` subscription signals;
+- periodically poll as fallback;
+- accept newer signed update intents from peers;
+- keep a local update journal:
+  - pending intent generation;
+  - lease id;
+  - last accepted plan;
+  - staged artifact hash;
+  - previous binary / image;
+  - rollback state;
+  - admission failure reason;
+- reconcile stale updater runtime against current node/container/task state
+  before fetching plans;
+- report `blocked`, `leased`, `staged`, `applying`, `healthy`, `rolled_back`,
+  and `aborted` states explicitly.
+
+Done when:
+
+- a node can learn a new update intent without directly reaching the original
+  control edge;
+- a stale updater command line can be repaired from local running runtime state;
+- simultaneous farm-wide update start is impossible without explicit
+  recovery-admin override;
+- rollout can be paused and resumed without losing node intent state;
+- at least one test proves a node behind NAT receives an update signal through
+  a neighbor and still waits for an orchestrator lease before applying.
+
+## 4. Immediate next implementation sequence
+
+### Step A
+
+Release and roll out the noop-rewrite restart fix so that updated runtimes do
+not remain on stale control sessions after a config rewrite.
+
+### Step B
+
+Release and roll out the relay certificate intent fix so stale-relay
+replacement and bootstrap relay paths do not probe a relay endpoint with a
+certificate fingerprint copied from a different private direct candidate.
+
+This is tracked by:
+
+- `rap-node-agent 0.2.332-relaycertintentfix`
+
+Done when:
+
+- `peer certificate fingerprint mismatch` no longer appears on healthy
+  relay/bootstrap paths between live areas;
+- `ifcm` no longer loses ready peers because relay endpoint selection and peer
+  certificate pinning disagree.
+
+### Step B
+
+Re-check live heartbeat and stale-risk:
+
+- `compat_control_dependency_nodes`
+- `registry_candidate_only_nodes`
+- `updater_subscription_alert_nodes`
+- `updater_wake_unsupported_nodes`
+- `bridge_hold_required`
+- current control URL in heartbeat
+
+### Step C
+
+Continue registry activation work until active records are used in practice.
+
+### Step D
+
+Continue peer diversity work using:
+
+- `area`
+- direct-ready area coverage
+- independent ingress diversity
+
+### Step E
+
+Run another live audit and decide whether `19191/tcp` recovery overlap can be
+removed.
+
+## 5. Hard acceptance criteria
+
+The fabric is considered converged only when all of the following are true:
+
+1. Inter-node runtime transport is QUIC/UDP only.
+2. No live node depends on the compat `19191` control contract.
+3. Signed registry runtime is active.
+4. Nodes carry and use distributed node/service knowledge through signed
+   records and peer cache.
+5. Cross-area direct resilience targets are satisfied for critical nodes.
+6. Remaining TCP listeners are only service-edge roles, never hidden inter-node
+   transport.
+
+## 6. This plan starts now
+
+The immediate active engineering task after writing this document is:
+
+- complete the rollout of the runtime rewrite restart fix;
+- remove the last live compat control dependency;
+- then move directly into signed registry activation and cross-area peer
+  resilience work.
+
+Update 2026-05-19:
+
+- `rap-node-agent 0.2.325-updatehintwake` adds a second-stage recovery path for
+  heartbeat update hints: when a fresh hint generation arrives, the live
+  node-agent persists `update-trigger.json` and wakes the local updater
+  task/service.
+- This is specifically meant to prevent the `ifcm-rufms-s-mo1cr` class of
+  failure where heartbeat remains fresh but the updater subscription plane is
+  dead.
+- As of the current rollout, this release is already on `home-*`, `test-*`,
+  and `usa-los-1`; `ifcm-rufms-s-mo1cr` remains the sole
+  `updater_wake_unsupported` blocker.
@@ -258,7 +258,7 @@ Production fabric-core migration boundary:
  QUIC endpoint candidates for the next hop, sends the envelope over the chosen
  QUIC route, and reroutes to warm standby/fallback QUIC candidates on connect
  failure or response timeout.
- The legacy HTTP production forward carrier has been removed from the mesh
+- The compat HTTP production forward carrier has been removed from the mesh
  runtime API. Production forwarding now exposes a single QUIC transport
  implementation; HTTP handlers remain only as node-local API surfaces and test
  harness entry points.
@@ -287,7 +287,7 @@ Production fabric-core migration boundary:
 - Node-agent discovery now advertises multiple QUIC candidates in one heartbeat
  instead of collapsing to one address: operator/public QUIC, listener QUIC,
  LAN/interface QUIC, STUN reflexive `ice_quic`, reverse/outbound-only, and
-  `relay_quic` fallback. Candidate metadata carries `local_segment_id`,
+  `relay_quic` fallback. Candidate metadata carries `locality_group_id`,
  `nat_group_id`, `stun_server`, `ice_foundation`, `relay_node_id`, and
  `relay_endpoint` when configured. When a relay endpoint is the first physical
  QUIC hop, its advertised certificate fingerprint must survive route planning
@@ -296,23 +296,23 @@ Production fabric-core migration boundary:
 - Endpoint candidate scoring is QUIC-mode only. It ranks `direct_quic`,
  `lan_quic`, `ice_quic`, `reverse_quic`, and `relay_quic` using freshness,
  health observations, latency, reliability, region, policy tags, and live
-  capacity pressure; HTTP/WebSocket labels are treated as rejected legacy
+  capacity pressure; HTTP/WebSocket labels are treated as rejected compat
  candidates rather than alternate transports.
 - `FabricTransportForTarget` no longer accepts a WebSocket carrier. Transport
  selection can return only `QUICFabricTransport`; unsupported labels fail with
  a QUIC-required error.
- Explicit transport labels are authoritative. A legacy label such as `relay`
+- Explicit transport labels are authoritative. A compat label such as `relay`
  or `outbound_reverse` is rejected even when the endpoint string uses a
  `quic://` scheme; configs must use `relay_quic` and `reverse_quic`.
- Node-agent config loading rejects legacy advertised transport labels and
+- Node-agent config loading rejects compat advertised transport labels and
  HTTP/WebSocket advertised endpoint schemes for mesh, STUN-reflexive, and relay
  fabric endpoints. Bad endpoint posture fails before heartbeat publication.
- Host-agent install/runtime validation rejects legacy mesh advertise transport
+- Host-agent install/runtime validation rejects compat mesh advertise transport
  labels and HTTP/WebSocket advertise endpoints before they can be passed into a
  node-agent Docker runtime.
 - JSON-advertised endpoint candidates and scoped synthetic config route
  recovery surfaces are hard-fail QUIC-only: endpoint candidates, recovery
-  seeds, and rendezvous leases reject legacy transport labels and
+  seeds, and rendezvous leases reject compat transport labels and
  HTTP/WebSocket endpoint schemes instead of silently downgrading or dropping
  entries.
 - Rendezvous relay leases and peer-connection intents now use `relay_quic` as
@@ -325,24 +325,24 @@ Production fabric-core migration boundary:
 - Node-agent synthetic runtime no longer installs an HTTP peer transport as an
  inter-node carrier, and the shared mesh runtime package no longer exports an
  HTTP peer transport implementation. Any HTTP synthetic motion is confined to
-  explicit legacy smoke harness code while fabric acceptance uses QUIC loadtest
+  explicit compat smoke harness code while fabric acceptance uses QUIC loadtest
  gates.
 - Control-plane and debug JSON mesh config loading is validated after
  conversion into runtime structures. Peer endpoint candidates, recovery seeds,
  rendezvous leases, and selected relay endpoints in route decisions must use
  QUIC labels/endpoints before they can update node runtime state.
- Scoped synthetic mesh configs also reject legacy `peer_endpoints` directly,
+- Scoped synthetic mesh configs also reject compat `peer_endpoints` directly,
  in addition to QUIC-only checks for endpoint candidates, recovery seeds, and
  rendezvous leases.
 - The old fabric-session WebSocket endpoint is no longer exposed by
-  `FabricSessionEnabled` alone. It requires an explicit legacy test harness flag
+  `FabricSessionEnabled` alone. It requires an explicit compat test harness flag
  and is not part of the node-agent fabric transport surface.
 - Same local segment or same NAT group is treated as a LAN route by the planner,
  so a whole cluster piece behind one NAT can prefer private addresses between
  its own nodes while still maintaining outbound/relay visibility to the rest
  of the fabric.
 - Heartbeat telemetry includes `fabric_runtime_report` with QUIC-only status,
-  route-set counts, QUIC candidate totals, rejected legacy/non-QUIC candidate
+  route-set counts, QUIC candidate totals, rejected compat/non-QUIC candidate
  totals by transport label, route pressure, QUIC listener state, goroutines,
  heap usage, and the next recommended soak gate.
 - `FabricOverlayTransport` is the generic service-neutral send facade over
@@ -375,7 +375,7 @@ Production fabric-core migration boundary:
  healthy targets are present. A `mixed-public-nat-lan-relay` or
  `nat-lan-relay` run fails if it does not exercise `lan_quic`, `ice_quic`,
  `reverse_quic`, and `relay_quic`.
- Loadtest verdicts also fail on legacy route-mode labels. Seeing `relay`,
+- Loadtest verdicts also fail on compat route-mode labels. Seeing `relay`,
  `outbound_reverse`, `direct_http`, `direct_https`, `direct_tcp_tls`, `ws`,
  `wss`, or `websocket` in route-mode telemetry is treated as a transport-layer
  violation even if payload delivery succeeds.
@@ -686,7 +686,7 @@ Production fabric-core migration boundary:
  `control_ack_p95_ms=2`, `ack_p95_ms=7`, `channel_leaks=0`,
  `route_pressure.active_total=0`, and matching acquire/release counts.
 - Verified strict QUIC route-mode gate:
-  `fabric-loadtest-20260516-182550` rebuilt the loadtest image with legacy
+  `fabric-loadtest-20260516-182550` rebuilt the loadtest image with compat
  route-mode verdicts and ran the 4-node mixed topology profile. It produced
  400/400 successful logical channels, observed only `lan_quic`, `ice_quic`,
  `reverse_quic`, and `relay_quic`, kept `ack_mismatched_streams=0`,
@@ -816,7 +816,7 @@ Production fabric-core migration boundary:
 - Published and registered node-agent release `0.2.280-fabricsession` with
  linux binary/native and Docker image artifacts. The release is intentionally
  not assigned to live node update policies yet because current live node
-  workload/env posture still advertises legacy `direct_http` and HTTP/HTTPS
+  workload/env posture still advertises compat `direct_http` and HTTP/HTTPS
  mesh endpoints. Before rollout, node configs must be migrated to
  `quic://...` endpoints, QUIC advertise labels, and enabled QUIC listener env
  such as `RAP_MESH_QUIC_FABRIC_ENABLED=true` plus
@@ -4,9 +4,22 @@ Status: live operational audit of the current fabric. This document records the
 real state observed on 2026-05-18 and explicitly calls out where runtime
 behavior still differs from the target architecture.

+The target layering model referenced by this audit is documented in
+[FABRIC_SERVICE_OVER_TRANSPORT_MODEL.md](\\nas\\MST\\codex\\rdp-proxy\\docs\\architecture\\FABRIC_SERVICE_OVER_TRANSPORT_MODEL.md).
+The current execution sequence derived from this audit is maintained in
+[FABRIC_EXECUTION_PLAN_2026-05-19.md](\\nas\\MST\\codex\\rdp-proxy\\docs\\architecture\\FABRIC_EXECUTION_PLAN_2026-05-19.md).
+
 ## Current confirmed state

 - Inter-node transport for the live node-agent fleet is `QUIC over UDP`.
+- The `0.2.327-registrybootstraprewrite` rollout initially exposed a backend
+  ingestion defect: fresh `home-*` / `test-*` heartbeats were returning HTTP
+  `500`, not because QUIC or registry bootstrap was broken, but because
+  PostgreSQL rejected `\u0000` inside heartbeat JSON with
+  `unsupported Unicode escape sequence (SQLSTATE 22P05)`.
+- Backend heartbeat ingestion now sanitizes `\u0000` before persistence.
+- After that fix, `home-*` and `test-*` resumed normal heartbeat flow and
+  converged onto the new release line with live registry promotion.
 - The active node set
  - `home-1`
  - `home-2`
@@ -16,9 +29,40 @@ behavior still differs from the target architecture.
  - `test-3`
  - `usa-los-1`
  - `ifcm-rufms-s-mo1cr`
-  is converged on `0.2.321-directreadytarget`.
+  currently spans:
+  - `home-*`, `test-*`, and `usa-los-1` on
+    `0.2.327-registrybootstraprewrite`;
+  - `ifcm-rufms-s-mo1cr` still remaining on
+    `0.2.322-controlendpointsrewrite`.
 - `ifcm-rufms-s-mo1cr` recovered through the compatibility recovery path and is
  no longer stale.
+- `ifcm-rufms-s-mo1cr` has already migrated off compat overlap
+  `http://vpn.cin.su:19191/api/v1` and now reports
+  `https://vpn.cin.su/api/v1`, but it still has not advanced to the new
+  registry-aware release line.
+- `home-*` and `test-*` now report:
+  - `reported_version = 0.2.327-registrybootstraprewrite`
+  - `peer_cache_peers = 7`
+  - `fabric_registry_runtime_report.status = active`
+- `usa-los-1` is already on `0.2.327-registrybootstraprewrite` but still
+  reports `fabric_registry_runtime_report.status = missing`, which means this
+  node remains a runtime bootstrap/config rewrite gap rather than a version-gap.
+- After repairing malformed `RAP_MESH_ADVERTISE_ENDPOINTS_JSON` on
+  `home-1/home-2/home-3`, the `home` area now emits enriched heartbeat metadata
+  again instead of falling back to the thin `c3` payload.
+- Live stale-risk snapshot at `2026-05-18T19:39Z` now reports:
+  - `compat_control_dependency_nodes = 1` (`ifcm-rufms-s-mo1cr`)
+  - `registry_candidate_only_nodes = 1` (`ifcm-rufms-s-mo1cr`)
+  - `direct_peer_alert_nodes = 5`
+  - `area_diversity_alert_nodes = 6`
+- Live heartbeat/update status snapshot after the `0.2.325-updatehintwake`
+  rollout still shows:
+  - `ifcm-rufms-s-mo1cr` heartbeat fresh at `2026-05-18 23:08:44 UTC`
+  - `fabric_control_endpoint = http://vpn.cin.su:19191/api/v1`
+  - `peer_cache_peers = 7`
+  - latest update status still stuck at `2026-05-18 20:50 UTC`
+  - this is now classified as `updater_wake_unsupported`, not just a generic
+    stale or compat-control symptom

 ## Why TCP traffic is still visible

@@ -35,7 +79,7 @@ Observed live listeners:
  - `19132/udp`, `19133/udp`, `19134/udp` - QUIC fabric listeners
 - `usa-los-1`
  - `19131/udp` - QUIC fabric listener
-  - `19191/tcp` - external compatibility bridge currently held open so legacy
+  - `19191/tcp` - external compatibility bridge currently held open so compat
    recovery contracts can still reach `Control API/downloads`

 Therefore:
@@ -49,7 +93,8 @@ Therefore:

 ### 1. Nodes do not yet operate from a fully active signed registry gossip plane

-Observed on the live `ifcm-rufms-s-mo1cr` heartbeat:
+Observed on the live `ifcm-rufms-s-mo1cr` heartbeat after the backend/runtime
+refresh:

 - `fabric_registry_runtime_report.status = candidate_only`
 - `resolved_service_count = 0`
@@ -61,11 +106,11 @@ This means the current runtime still depends on compatibility control URLs more
 than the target architecture allows. The node is alive in the fabric, but not
 yet operating from a fully resolved active registry view.

-### 2. Legacy control/download contracts are still real dependencies
+### 2. Compat control/download contracts are still real dependencies

 Observed on the live `ifcm-rufms-s-mo1cr` heartbeat after recovery:

- `mesh_outbound_session_report.control_plane_url = http://vpn.cin.su:19191/api/v1`
+- `mesh_outbound_session_report.fabric_control_endpoint = http://vpn.cin.su:19191/api/v1`

 This confirms the root recovery lesson:

@@ -77,15 +122,31 @@ This confirms the root recovery lesson:

 ### 3. Direct peer resilience is still below the intended threshold

-Observed from live heartbeat metadata:
+Observed from the live stale-risk snapshot at `2026-05-18T19:39Z`:

 - `ifcm-rufms-s-mo1cr`
  - `peer_connection_ready = 2`
  - `peer_connection_relay_ready = 3`
  - `target_ready_peers = 3`
+- `home-1`
+  - `peer_connection_ready = 1`
+  - `direct_ready_areas = [usa]`
+  - `external_area_ready_count = 1/2`
+- `home-2`
+  - `peer_connection_ready = 1`
+  - `direct_ready_areas = [usa]`
+  - `external_area_ready_count = 1/2`
+- `home-3`
+  - `peer_connection_ready = 1`
+  - `direct_ready_areas = [usa]`
+  - `external_area_ready_count = 1/2`
+- `test-1/2/3`
+  - `peer_connection_ready = 3`
+  - but `direct_ready_areas = [usa]`
+  - therefore each still triggers `external_area_deficit:1_of_2`
 - `usa-los-1`
  - `peer_connection_ready = 1`
-  - `peer_connection_relay_ready = 5`
+  - `direct_ready_areas = [ifcm, home, test]`
  - `target_ready_peers = 3`

 This means the direct-path resilience target is not satisfied yet, even though
@@ -99,17 +160,35 @@ The practical reason is simple:
 - relay-ready adjacency is masking direct peer deficit, but it does not replace
  the requirement for at least three direct-ready peers.

+### 3.1 Public endpoint confirmation must be cross-area, not local hairpin
+
+The live `home/test` topology also exposed a verification mistake in the
+runtime model:
+
+- `home` and `test` sit behind the same public router address
+  `94.141.118.222`;
+- some public QUIC candidates are valid only when tested from another area such
+  as `usa` or `ifcm`;
+- a same-area probe can fail purely because the local router does not support
+  hairpin NAT / NAT reflection.
+
+Operational consequence:
+
+- a public endpoint marked as `external-network-required` must be treated as
+  non-authoritative when the failure came from `self` or `same_area`;
+- the public candidate should be confirmed or rejected by `cross_area`
+  observers instead.
+
 ### 4. Observability is still heterogeneous

-Live heartbeat coverage is inconsistent:
+Live heartbeat coverage is now richer than it was earlier in the day, but it is
+still not fully converged in behavior:

- `test-*`, `ifcm`, `usa-los-1` emit rich `c17z20` heartbeat metadata with
-  endpoint, peer recovery, and registry sections.
- `home-*` currently do not expose the same full sections in their latest
-  heartbeat rows.
-
-This means operator visibility is uneven and the documentation must not imply
-uniform live introspection across every node today.
+- `test-*`, `ifcm`, `usa-los-1`, and now repaired `home-*` expose endpoint,
+  peer recovery, and registry sections again.
+- `ifcm` is still the only node that currently reports `compat control` and
+  `registry candidate_only`, so the observability gap has narrowed into a real
+  single-node convergence issue instead of a fleet-wide blind spot.

 ## What is true right now

@@ -117,21 +196,63 @@ uniform live introspection across every node today.
 2. QUIC/UDP is the actual node-to-node transport.
 3. Compatibility `19191/tcp` is still required for recovery overlap.
 4. Signed registry gossip is not yet the sole active discovery/control source.
-5. The "at least 3 direct-ready peers per node" resilience target is not yet
-   met for all externally significant nodes.
+5. `ifcm` still depends on the compat `19191` control overlap.
+6. The plain `3 direct peers` target is insufficient on its own; the live fleet
+   now clearly shows that `cross-area direct diversity` is the next real gate.
+
+## Control/API migration progress
+
+The codebase now carries a more explicit migration contract for control access:
+
+- install profiles prefer canonical `control_plane_endpoints` over a compat
+  singleton `backend_url`;
+- host runtime env generation now exports
+  removed control-plane endpoint env key;
+- node heartbeat/control reporting prefers that canonical endpoint set when it
+  is present.
+- stale updater status behind a fresh heartbeat is now classified separately as
+  `updater_subscription_gap`;
+- heartbeat update hints now have a second-stage recovery path: after writing
+  `update-trigger.json`, a live node can also wake its local updater
+  task/service.
+
+This does not instantly rewrite older runtime wrappers on already-installed
+nodes by itself. It does remove the same trap for the next install, reinstall,
+or update-service rewrite cycle.

 ## Operational rule until the next audit

 Do not remove the compatibility `19191/tcp` recovery overlap while any of the
 following remain true:

- any live node still reports a `control_plane_url` on the `19191` contract;
+- any live node still reports a `fabric_control_endpoint` on the `19191` contract;
 - any live node has `fabric_registry_runtime_report.status != active`;
 - any externally significant node has fewer than 3 direct-ready peers;
- any node can only recover through legacy `Control API/downloads` overlap.
+- any node can only recover through compat `Control API/downloads` overlap.

 ## Required next work

+Update 2026-05-19:
+
+- `rap-node-agent 0.2.325-updatehintwake` was released with a local updater
+  wake path driven by heartbeat update hints.
+- The release exists because `ifcm-rufms-s-mo1cr` showed that a node can keep
+  sending fresh heartbeat while the updater subscription plane silently stops
+  progressing.
+- This is now treated as a first-class recovery-plane problem, not as a vague
+  stale-node symptom.
+- The live rollout already moved `home-*`, `test-*`, and `usa-los-1` onto
+  `0.2.325-updatehintwake`.
+- `ifcm-rufms-s-mo1cr` is now the only remaining
+  `updater_wake_unsupported` blocker.
+- Live `ifcm` peer telemetry also exposed a distinct transport-resilience
+  defect: on one stale-relay/bootstrap path the node tried a relay endpoint
+  with the certificate fingerprint from a different private direct candidate,
+  producing
+  `CRYPTO_ERROR ... quic peer certificate fingerprint mismatch`.
+- That bug is now fixed in the runtime line tracked as
+  `0.2.332-relaycertintentfix`.
+
 ### A. Finish signed registry activation

 Each node must be able to resolve active records for at least:
@@ -26,7 +26,7 @@ This policy applies to Linux, Windows, Android, containerized nodes, and future
 The fabric must be able to lose:

 - old API endpoints;
- old artifact URLs;
+- old artifact distributors;
 - previous public IP addresses;
 - previous NAT mappings;
 - previous relay nodes;
@@ -67,7 +67,7 @@ Any change to the fabric must keep older nodes recoverable until one of these
 is true:

 1. every node has confirmed the new contract; or
-2. the missing nodes were manually retired, revoked, or explicitly accepted as
+2. the missing nodes were manually removed, revoked, or explicitly accepted as
   lost.

 This applies to:
@@ -81,6 +81,17 @@ This applies to:
 - host-agent / updater runtime contracts;
 - control endpoints needed only for migration.

+Canonical `Control API` access must be distributable as an explicit endpoint
+set, not only as a single compat `backend_url`. Install/update contracts should
+carry:
+
+- `control_plane_endpoints`;
+- signed fabric registry bootstrap records;
+- artifact endpoints.
+
+The old `backend_url` remains a compatibility fallback only until the fleet has
+converged.
+
 The rule is strict: do not delete the old recovery format while nodes that may
 still need it remain unrecovered.

@@ -200,6 +211,67 @@ Required model:
 - signals are idempotent;
 - signals do not require the old control endpoint to remain alive.

+### 3.7 Update Intent Must Be Independent From One Updater Endpoint
+
+A node must not be permanently bound to one updater service, one updater node,
+one systemd unit name, one scheduled task name, or one control endpoint.
+
+The durable object is not "call this updater URL". The durable object is a
+signed update intent:
+
+- product;
+- target version or version constraint;
+- artifact hashes and allowed mirrors;
+- compatibility contract;
+- rollout lease constraints;
+- force / emergency flags;
+- rollback permission;
+- signed registry/service records that can carry the intent;
+- expiry and generation.
+
+A node may learn the same signed intent from:
+
+- Control API;
+- update-store;
+- update-cache;
+- long-lived outbound control subscription;
+- neighboring nodes through signed fabric registry gossip;
+- local cached last-known-good update state.
+
+The receiving node must validate the intent locally before acting. A neighbor
+may relay signed update metadata and artifacts, but it must not become an
+authority that can forge or broaden an update.
+
+The local recovery boundary must reconcile stale runtime facts before fetching
+or applying a plan:
+
+- current cluster id;
+- node id and identity state directory;
+- current container/task/unit name;
+- current control endpoints;
+- current signed registry records;
+- available artifact mirrors.
+
+This is mandatory because a node may move, a container may be renamed, a task
+may be recreated, or the old host updater may still have a stale command line.
+
+### 3.8 Polling, Subscription, And Neighbor Relay Are All Required
+
+The update plane must use three delivery paths at the same time:
+
+1. slow local fallback polling, so a node eventually recovers even after missed
+   signals;
+2. subscription / push hints, so ordinary updates are fast and do not wait for
+   a long poll interval;
+3. peer relay of signed update intents and signed registry records, so a node
+   can learn current update truth through reachable neighbors when the old
+   center or old ingress is unavailable.
+
+No one path is allowed to be the only recovery mechanism.
+
+Polling cadence is a safety net, not the rollout control mechanism. Rollout
+control belongs to the orchestrator and signed rollout leases.
+
 ## 4. Update Safety Rules

 ### 4.1 Upgrade Contracts
@@ -228,7 +300,7 @@ explicit retirement.
 Recovery-critical artifact versions must remain available until:

 - all nodes have moved past them; or
- the remaining nodes are revoked/retired and recorded as intentionally lost.
+- the remaining nodes are revoked/removed and recorded as intentionally lost.

 Do not garbage-collect the last working host-agent or node-agent build for an
 unrecovered population.
@@ -237,17 +309,18 @@ unrecovered population.

 If historical nodes request different install types for the same product
 (`windows_binary`, `windows_service`, `native`, `linux_binary`, etc.), recovery
-planning must keep compatibility aliases until the fleet converges.
+planning must publish explicit signed install-type mappings in the fabric
+registry until the fleet converges.

 The fabric must not strand nodes on an install-type naming mismatch.

-### 4.5 Legacy Recovery Contract Drift Must Be Treated As A Blocking Risk
+### 4.5 Compat Recovery Contract Drift Must Be Treated As A Blocking Risk

 A stale node may report:

 - a compatible recovery artifact exists under the current registry; but
 - the last local updater/host-agent status still says `no_matching_artifact` or
-  an equivalent legacy contract failure.
+  an equivalent compat contract failure.

 This means the node is not only waiting for a heartbeat. It is running an older
 recovery planner contract and may still depend on:
@@ -257,7 +330,7 @@ recovery planner contract and may still depend on:
 - older update-plan interpretation rules;
 - overlap in signed registry / bootstrap envelopes.

-This condition must be classified as `legacy recovery contract drift` and must
+This condition must be classified as `compat recovery contract drift` and must
 block compatibility removal the same way an artifact gap does.

 Operationally this also means:
@@ -268,11 +341,11 @@ Operationally this also means:
  status on the current contract or the operator explicitly retires the node;
 - when a compatible artifact and target mapping already exist, the node should
  be classified as `bridge replay ready`, meaning the system can replay the
-  legacy-compatible update plan as soon as the node regains an outbound control
+  compat-compatible update plan as soon as the node regains an outbound control
  cycle;
 - operator tooling should expose a canonical `bridge replay plan` per node so
  recovery replay uses the same signed update-plan logic as normal updates;
- compatibility aliases / overlap must remain enabled for that node population;
+- signed recovery mappings must remain available for that node population;
 - dashboards and rollout guards must show this separately from ordinary
  `waiting recovery heartbeat`.

@@ -281,9 +354,78 @@ Canonical example:
 - `ifcm-rufms-s-mo1cr` is stale;
 - the current backend can match a Windows-compatible host-agent artifact;
 - the last host-agent report still says `no_matching_artifact`;
- therefore the node must be treated as a legacy recovery-contract blocker, not
+- therefore the node must be treated as a compat recovery-contract blocker, not
  merely as a delayed heartbeat.

+### 4.6 Rollout Orchestrator Is Mandatory
+
+Large fleet update safety requires an orchestrator. The orchestrator decides
+which nodes may update now. Nodes decide whether a received signed intent is
+valid and locally safe to execute.
+
+The orchestrator must support:
+
+- canary rollout;
+- rolling rollout;
+- area / site / NAT-group aware rollout;
+- max parallel updates globally;
+- max parallel updates per area;
+- max unavailable nodes;
+- minimum healthy quorum before continuing;
+- hold / pause / resume;
+- force update for explicitly selected nodes;
+- automatic stop on failure rate, heartbeat loss, rollback, or route diversity
+  regression;
+- separate host-agent and node-agent phases;
+- emergency recovery bridge for pre-orchestrator compat nodes.
+
+The orchestrator must issue short-lived rollout leases. A node may only start an
+update when it holds a valid lease for that product/version. If the lease
+expires before apply starts, the node must re-check the policy.
+
+Rollout leases prevent the entire farm from starting the same update
+simultaneously when a subscription signal or gossip wave reaches all nodes.
+
+### 4.7 Node-Side Update Admission Control
+
+Even with a lease, the node must perform local admission checks before apply:
+
+- artifact hash and signature match the signed intent;
+- rollback artifact or previous binary is available unless policy explicitly
+  disables rollback;
+- enough disk space exists for stage plus rollback;
+- current active workload can tolerate restart, or orchestrator granted a
+  maintenance lease;
+- the node still has at least the required recovery connectivity after
+  excluding itself as temporarily unavailable;
+- host-agent update is applied before node-agent update when the contract says
+  the host-agent is the recovery floor.
+
+If admission fails, the node reports `blocked` with a precise reason instead of
+silently waiting.
+
+### 4.8 Update Waves Must Preserve Failure-Domain Diversity
+
+An update wave must not take down all nodes from the same recovery role or
+failure domain at once.
+
+The orchestrator must account for:
+
+- area;
+- site;
+- locality group;
+- NAT group;
+- public ingress dependency;
+- control-api role;
+- update-store / update-cache role;
+- relay / rendezvous role;
+- VPN ingress / egress roles;
+- nodes that are currently the only known recovery path for another node.
+
+For a small fleet, this means the orchestrator may update one node at a time
+when the remaining diversity is weak, even if the global max parallel setting
+is higher.
+
 ## 5. Service And Location Mobility Rules

 Moving a service must not strand nodes that only know the old location.
@@ -329,7 +471,7 @@ The design must explicitly handle all of these:
 - node reboots during update;
 - only one peer still knows the new registry truth;
 - node is partitioned for a long time and rejoins later;
- platform removes legacy support too early;
+- platform removes compat support too early;
 - operator has no shell/RDP/WinRM/SSH access to the host.

 ## 7. Required Local State And Journaling
@@ -359,7 +501,7 @@ It must surface:
 - nodes with stale heartbeat but recent updater activity;
 - nodes with no working compatible recovery artifact;
 - nodes whose pinned registry/bootstrap epoch is too old;
- nodes whose only known artifact URL is dead;
+- nodes whose only known artifact distributor is dead;
 - nodes whose desired state requires a contract they cannot parse;
 - nodes whose local agent version is below the minimum recovery floor;
 - nodes whose last successful contact depended on a single service replica.
@@ -382,7 +524,7 @@ Before deleting old code, old formats, or old endpoints, verify all of these:
 7. install type aliases remain for historical agents where needed;
 8. NAT/passive/outbound-only nodes were explicitly tested;
 9. stale-node risk report is empty or consciously accepted by recovery-admin;
-10. removal of legacy support is documented with the exact cutoff conditions.
+10. removal of compat support is documented with the exact cutoff conditions.

 ## 10. `ifcm-rufms-s-mo1cr` Rule

@@ -412,7 +554,7 @@ The system should keep implementing these concrete items:
 - signed registry retention and overlap checks before endpoint migration;
 - compatibility alias coverage for historical install types;
 - artifact availability health over all mirrors;
- stale-node risk dashboard/report before legacy removal;
+- stale-node risk dashboard/report before compat cleanup;
 - node-local journaling for last good registry/update state;
 - neighbor-assisted artifact relay path;
 - explicit recovery simulation for outbound-only nodes with dead old endpoints.
@@ -344,7 +344,7 @@ The first backend contract slice is implemented:
 - Fenced routes are not returned as primary or alternate route candidates in a
  service-channel lease. If every route for the selected entry/exit pair is
  fenced by service-channel feedback, the lease enters explicit degraded
-  backend fallback with reason
+  compat fallback with reason
  `fabric_routes_fenced_by_service_channel_feedback`.
 - A live smoke on 2026-05-07 created two short-lived `test-1 -> test-2`
  `vpn_packets` route intents, injected fresh service-channel flow feedback
@@ -507,18 +507,18 @@ The first backend contract slice is implemented:
  post-restart exit inbox depth from `0` to `88` with zero inbox drops.
 - C18Z3 adds the matching entry-side resilience and degraded-fallback contract.
  Node-agent `0.2.183` validates the signed service-channel lease authority and
-  forces backend fallback when Control Plane has signed
+  forces compat fallback when Control Plane has signed
  `status=degraded_fallback` or `primary_route.status=missing_route_intent`.
  This prevents a node from ignoring the lease decision and accidentally using
  older generic route candidates for the same VPN resource. The rule applies to
  both HTTP packet ingress and WebSocket packet ingress. The live smoke
  `scripts/fabric/c18z3-live-service-channel-entry-ws-fallback-smoke.ps1`
  proves HTTP warm delivery, WebSocket ingress parity, entry-node restart and
-  recovery while a lease exists, explicit backend fallback when no authorized
+  recovery while a lease exists, explicit compat fallback when no authorized
  fabric route exists, and route-intent expiry. The passing artifact is
  `artifacts/c18z3-live-service-channel-entry-ws-fallback-smoke-result.json`;
  run `c18z3-20260507-211402` accepted warm `4/4`, WebSocket `8` packets,
-  recovery `4/4`, and moved the degraded backend fallback queue from `0` to
+  recovery `4/4`, and moved the degraded compat fallback queue from `0` to
  `8`.
 - C18Z4 adds live long-session pressure coverage without another runtime
  release. The script
@@ -529,7 +529,7 @@ The first backend contract slice is implemented:
  alternate route. The passing artifact is
  `artifacts/c18z4-live-service-channel-session-pressure-smoke-result.json`;
  run `c18z4-20260507-212748` grew the exit inbox from `0` to `384`, kept
-  route failure delta `0`, flow drop delta `0`, and backend fallback queue
+  route failure delta `0`, flow drop delta `0`, and compat fallback queue
  `0 -> 0`. This proves route-policy churn can be absorbed by the shared
  fabric runtime while a service WebSocket remains active.
 - C18Z5 adds live exit-node failure coverage while the same kind of service
@@ -540,7 +540,7 @@ The first backend contract slice is implemented:
  the same signed WebSocket. The passing artifact is
  `artifacts/c18z5-live-service-channel-exit-restart-smoke-result.json`; run
  `c18z5-20260507-213745` sent 480 packets total, observed route failure delta
-  `48`, backend fallback queue `0 -> 192`, flow drop delta `0`, and recovery
+  `48`, compat fallback queue `0 -> 192`, flow drop delta `0`, and recovery
  exit inbox `0 -> 192`. This proves exit failure is surfaced as explicit
  degraded/fallback telemetry and fabric delivery resumes after runtime
  recovery without requiring the service connection to be rebuilt.
@@ -554,7 +554,7 @@ The first backend contract slice is implemented:
  `artifacts/c18z6-live-service-channel-active-rebuild-smoke-result.json`; run
  `c18z6-20260507-214900` sent 384 packets, delivered all of them to the exit
  inbox, selected the replacement route, kept route failure delta `0`, flow
-  drop delta `0`, and backend fallback queue `0 -> 0`. This proves route-manager
+  drop delta `0`, and compat fallback queue `0 -> 0`. This proves route-manager
  replacement can be applied under an active service session without requiring
  the service connection to be recreated.
 - C18Z7 adds concurrent service-session isolation coverage. The script
@@ -565,7 +565,7 @@ The first backend contract slice is implemented:
  `applied_rebuild`, then continues all sessions. The passing artifact is
  `artifacts/c18z7-live-service-channel-concurrent-isolation-smoke-result.json`;
  run `c18z7-20260507-215727` delivered 864 packets total, 288 packets per
-  session, with total backend fallback delta `0`, route failure delta `0`, and
+  session, with total compat fallback delta `0`, route failure delta `0`, and
  flow drop delta `0`. This proves concurrent service sessions keep separate
  resource queues and are not starved or poisoned by a shared route-manager
  rebuild.
@@ -579,7 +579,7 @@ The first backend contract slice is implemented:
  run `c18z8-20260507-221347` delivered 192 packets per interactive session,
  hit flow scheduler high watermark `1024`, scheduled `1030` packets on the
  hottest channel, dropped `282` packets on that overloaded channel, and kept
-  backend fallback delta `0` and route failure delta `0`. This proves bounded
+  compat fallback delta `0` and route failure delta `0`. This proves bounded
  queue pressure is service-neutral, observable, and isolated to the overloaded
  logical flow without starving other active sessions.
 - C18Z9 adds route-pool replacement preference coverage. Node-agent `0.2.184`
@@ -593,7 +593,7 @@ The first backend contract slice is implemented:
  node-agent `applied_rebuild`, and verifies the same service session continues
  over the fast route. The passing artifact is
  `artifacts/c18z9-live-service-channel-route-pool-smoke-result.json`; run
-  `c18z9-20260507-224901` kept backend fallback delta `0`, route failure delta
+  `c18z9-20260507-224901` kept compat fallback delta `0`, route failure delta
  `0`, and flow drop delta `0`.
 - C18Z10 adds service-channel exit-pool failover coverage. Backend/node-agent
  `0.2.185` binds signed entry/exit pools into the service-channel lease
@@ -610,7 +610,7 @@ The first backend contract slice is implemented:
  `applied_rebuild`, and verifies 288 packets land on the alternate exit. The
  passing artifact is
  `artifacts/c18z10-live-service-channel-exit-pool-smoke-result.json`; run
-  `c18z10-20260507-232645` kept backend fallback `0`, route failure delta `0`,
+  `c18z10-20260507-232645` kept compat fallback `0`, route failure delta `0`,
  and flow drop delta `0`.
 - C18Z11 adds service-channel entry-pool failover contract coverage. Backend
  `rap-backend:fabric-service-channel-0.2.186` keeps
@@ -675,7 +675,7 @@ The first backend contract slice is implemented:
  continues on the learned fast route. The passing artifact is
  `artifacts/c18z14-live-service-channel-active-quality-shift-smoke-result.json`;
  run `c18z14-20260508-071644` sent 60 batches / 480 packets, delivered all
-  packets to the exit, kept backend fallback `0`, flow drops `0`, and expired
+  packets to the exit, kept compat fallback `0`, flow drops `0`, and expired
  temporary route intents.
 - C18Z15 exposes and hardens effective route-quality preference telemetry.
  Backend `rap-backend:fabric-service-channel-0.2.191` reports both raw
@@ -690,7 +690,7 @@ The first backend contract slice is implemented:
  passing artifact is
  `artifacts/c18z15-live-service-channel-effective-quality-smoke-result.json`;
  run `c18z14-20260508-073538` sent 60 batches / 480 packets, delivered all
-  packets to the exit, kept backend fallback `0`, flow drops `0`, and exposed
+  packets to the exit, kept compat fallback `0`, flow drops `0`, and exposed
  decayed effective scores in node telemetry.
 - C18Z16 adds per-channel route-quality preference telemetry and fairness
  guardrails. Node-agent `0.2.191` records the applied
@@ -704,7 +704,7 @@ The first backend contract slice is implemented:
  `artifacts/c18z16-live-service-channel-quality-fairness-smoke-result.json`;
  run `c18z14-20260508-074943` sent 60 batches / 480 packets, served 32
  logical channels, applied quality preference telemetry to all 32 served
-  channels, kept backend fallback `0`, and flow drops `0`.
+  channels, kept compat fallback `0`, and flow drops `0`.
 - C18Z17 clears stale per-channel route-quality markers. Node-agent `0.2.192`
  removes channel-level quality preference diagnostics when the preference is no
  longer present in the current effective preference set or when the preferred
@@ -712,10 +712,10 @@ The first backend contract slice is implemented:
  `scripts/fabric/c18z17-live-service-channel-quality-cleanup-smoke.ps1`
  verifies that active channel markers reference visible preferences, stale
  markers are absent, expired route intents are not active, and the session
-  completes without backend fallback. The passing artifact is
+  completes without compat fallback. The passing artifact is
  `artifacts/c18z17-live-service-channel-quality-cleanup-smoke-result.json`;
  run `c18z14-20260508-075750` sent 60 batches / 480 packets, kept 32 active
-  quality markers, found `0` stale markers, kept backend fallback `0`, and
+  quality markers, found `0` stale markers, kept compat fallback `0`, and
  flow drops `0`.
 - C18Z18 scopes flow-scheduler channel memory by service session. Node-agent
  `0.2.193` now keys runtime-sent logical channels as
@@ -728,11 +728,11 @@ The first backend contract slice is implemented:
  `scripts/fabric/c18z18-service-channel-session-scoped-fairness-smoke.ps1`
  wraps the live C18Z17 route-quality/fairness path, verifies served live
  channel names are session-scoped and no unscoped served `flow-NN` channels
-  remain, and keeps backend fallback and flow drops at zero. The passing
+  remain, and keeps compat fallback and flow drops at zero. The passing
  artifact is
  `artifacts/c18z18-service-channel-session-scoped-fairness-smoke-result.json`;
  run `c18z14-20260508-082520` served 32 session-scoped channels, applied
-  quality markers to all 32, kept backend fallback `0`, and flow drops `0`.
+  quality markers to all 32, kept compat fallback `0`, and flow drops `0`.
 - C18Z19 adds the first bounded parallel send window for independent
  service-channel logical flows. Node-agent `0.2.194` can send scheduled
  logical channels concurrently with `MaxParallelFlowSends=4` in the live
@@ -769,7 +769,7 @@ The first backend contract slice is implemented:
  run `c18z14-20260508-085635` delivered 480 packets, observed
  `max_parallel_flow_sends=4`, `recommended_parallel_flow_sends=4`,
  `scheduler_max_in_flight=4`, attempt/success/latency telemetry on all 32
-  served channels, backend fallback `0`, and flow drops `0`.
+  served channels, compat fallback `0`, and flow drops `0`.
 - C18Z21 adds rolling per-channel/session quality windows. Node-agent `0.2.196`
  keeps the lifetime counters for audit visibility, but adaptive send-window
  pressure now comes from the bounded recent quality window, so old drops and
@@ -785,7 +785,7 @@ The first backend contract slice is implemented:
  run `c18z14-20260508-091952` delivered 480 packets, observed
  `scheduler_quality_window_sample_count=480`, rolling failures `0`, rolling
  drops `0`, rolling samples/success/latency on all 32 served channels,
-  `recommended_parallel_flow_sends=4`, backend fallback `0`, and flow drops `0`.
+  `recommended_parallel_flow_sends=4`, compat fallback `0`, and flow drops `0`.
 - C18Z22 connects the rolling window to backend durable route feedback. Backend
  `rap-backend:fabric-service-channel-0.2.197` reads `quality_window_*` fields
  from node-agent heartbeat metadata and uses fresh rolling failure/drop/slow
@@ -799,7 +799,7 @@ The first backend contract slice is implemented:
  fields. The passing artifact is
  `artifacts/c18z22-service-channel-rolling-feedback-smoke-result.json`; run
  `c18z14-20260508-093100` delivered 480 packets, observed one persisted
-  healthy rolling feedback item with rolling payload, backend fallback `0`, and
+  healthy rolling feedback item with rolling payload, compat fallback `0`, and
  flow drops `0`.
 - C18Z23 adds route recovery hysteresis. Backend
  `rap-backend:fabric-service-channel-0.2.198` re-admits routes that have
@@ -812,7 +812,7 @@ The first backend contract slice is implemented:
  the live C18Z22 path and verifies backend `0.2.198`, rolling feedback, clean
  forwarding, and the unit hysteresis contract. The passing artifact is
  `artifacts/c18z23-service-channel-recovery-hysteresis-smoke-result.json`; run
-  `c18z14-20260508-094111` delivered 480 packets with backend fallback `0` and
+  `c18z14-20260508-094111` delivered 480 packets with compat fallback `0` and
  flow drops `0`.
 - C18Z24 exposes that recovery state to operators and API consumers. Backend
  `rap-backend:fabric-service-channel-0.2.199` enriches route feedback API
@@ -925,7 +925,7 @@ The first backend contract slice is implemented:
   C18X; route-intent lifecycle cleanup and synthetic-config expired-route
   filtering landed in C18Y; bounded multi-channel load/rebuild/drop telemetry
   coverage landed in C18Z; live signed service-channel ingress through the
-   running mesh listener landed in C18Z1; sustained live ingress with exit-node
+   running fabric listener landed in C18Z1; sustained live ingress with exit-node
   restart/recovery coverage landed in C18Z2; signed degraded fallback
   enforcement plus entry restart/WebSocket parity landed in C18Z3; long-lived
   WebSocket pressure with mid-session route-policy churn landed in C18Z4; live
@@ -988,7 +988,7 @@ The first backend contract slice is implemented:
  from explicit degraded compatibility requests; C18Z56 adds active-channel remediation
  diagnostics (`none`, `rebuild_route`, `prefer_alternate_route`,
  `hold_degraded_route_state`) to make the next runtime action explicit, and its
-  alternate-route branch is live-smoke-proven with backend fallback kept off.
+  alternate-route branch is live-smoke-proven with compat fallback kept off.
  C18Z57 adds the bounded machine-readable `remediation_command` contract to
  active access telemetry rows so route-manager can consume a short-lived
  `prefer_alternate_route` command with primary/replacement route ids and TTL.
@@ -996,7 +996,7 @@ The first backend contract slice is implemented:
  node-agent route-manager consumes them as explicit applied replacement
  decisions sourced from `service_channel_remediation_command`. C18Z59 proves
  post-remediation service-channel traffic actually selects the replacement
-  route in runtime/flow telemetry without local/backend fallback. C18Z60 proves
+  route in runtime/flow telemetry without local/compat fallback. C18Z60 proves
  the same remediation path for multiple independent VPN flow channels in one
  packet batch, with replacement-route flow stats, no flow drops, no route
  failures, and no degraded fallback. C18Z61 proves the remediation replacement
@@ -1024,7 +1024,7 @@ The first backend contract slice is implemented:
  0. C18Z68 turns this evidence into backend/admin flow-health diagnostics:
  access telemetry now reports `flow_health_status` and `flow_health_reason` at
  cluster, node, and active-channel levels using traffic-class pressure, queue
-  pressure, flow drops, backend fallback, route-quality failures/drops/slow
+  pressure, flow drops, compat fallback, route-quality failures/drops/slow
  samples, and route send latency. C18Z69 adds node-side adaptive response:
  runtime heartbeat flow-scheduler snapshots now include per-class
  `recommended_parallel_windows` and adaptive backpressure reason, and the send
@@ -1039,7 +1039,7 @@ The first backend contract slice is implemented:
  tune shared fabric backpressure without changing VPN/RDP-specific code.
  C18Z72 adds an audited pool/failover policy contract for entry/exit pool
  constraints, preferred entry/exit, selection strategy, failover modes,
-  backend fallback allowance, and sticky session mode. Lease issuance applies
+  compat fallback allowance, and sticky session mode. Lease issuance applies
  that policy before route selection and signs the effective `pool_policy`
  provenance into the service-channel lease authority payload. C18Z73 projects
  that signed pool-policy fingerprint into active access telemetry and guards
@@ -1080,7 +1080,7 @@ The first backend contract slice is implemented:
  existing rebuild command to a replacement route, the entry node reports a
  route-manager decision for the same `rebuild_request_id`, the transition is
  `applied_rebuild`, and live service-channel packet ingress selects the
-  replacement route with no local/backend fallback, route failures, or flow
+  replacement route with no local/compat fallback, route failures, or flow
  drops. C18Z80 extends that into sustained post-rebuild pressure: five mixed
  service-channel packet bursts remain on the replacement route, no stale
  primary route is reselected, and fallback, route-failure, flow-drop, and
@@ -0,0 +1,206 @@
+# Fabric Service-Over-Transport Model
+
+Status: active target architecture.
+
+This document defines the mandatory separation between:
+
+1. the internal fabric transport;
+2. the logical service channel contract;
+3. the external service ingress edge.
+
+It exists to prevent a recurring failure pattern where external TCP/HTTP/HTTPS
+listeners are mistaken for the fabric's internal transport.
+
+## 1. Core rule
+
+The fabric is the internal transport substrate.
+
+- Inside the fabric, node-to-node runtime transport is `QUIC over UDP`.
+- Services do not implement their own inter-node transport.
+- Services do not need to understand relay, NAT, route replacement, or peer
+  selection details.
+
+A service asks the fabric for a channel. The fabric creates, maintains,
+rebuilds, and heals that channel.
+
+## 2. Three-layer model
+
+### 2.1 Fabric Transport
+
+Fabric Transport is the lowest runtime layer.
+
+Responsibilities:
+
+- peer discovery and peer memory
+- endpoint candidate verification
+- direct and relay path establishment
+- route maintenance
+- route replacement
+- cross-area recovery
+- QUIC session lifecycle
+- cert pin and authority trust enforcement
+
+Transport contract:
+
+- node-to-node runtime transport is `QUIC/UDP`
+- TCP is not an alternate transport carrier inside the fabric
+
+### 2.2 Fabric Service Channel
+
+The service channel is the logical contract used by any upper-layer service.
+
+Responsibilities:
+
+- request a route to a node or pool
+- expose a stable channel identifier
+- carry bidirectional application traffic
+- survive path rebuild when possible
+- surface degraded or migrated channel state to the service without exposing
+  internal route topology details
+
+The service channel must hide:
+
+- which relay was chosen
+- which direct peer was replaced
+- which ingress or NAT path was changed
+- which recovery seed was used
+
+The service should see channel semantics, not transport topology.
+
+### 2.3 External Service Ingress
+
+External ingress is the edge that accepts user-facing or third-party traffic.
+
+Examples:
+
+- HTTP/HTTPS ingress for admin UI and personal cabinet
+- VPN ingress that accepts client traffic
+- future RDP ingress
+
+Ingress may speak TCP/HTTP/HTTPS or another external protocol at the edge, but
+after acceptance it must map traffic into a fabric service channel.
+
+The ingress edge is not the fabric transport.
+
+## 3. Examples
+
+### 3.1 Admin panel
+
+The user opens an HTTPS page.
+
+1. the public ingress listens on `80/443`;
+2. it accepts HTTPS and performs edge policy checks;
+3. it opens or reuses a `control_ui` fabric service channel;
+4. the request is forwarded through the fabric to a panel service instance;
+5. the response is returned through the fabric channel back to the ingress;
+6. the ingress returns the HTTP response to the browser.
+
+The browser sees HTTPS. The fabric sees an internal service channel over
+`QUIC/UDP`.
+
+### 3.2 VPN
+
+The VPN edge accepts client-side tunnel traffic.
+
+1. the VPN service receives IPv4 packets from the client-facing side;
+2. it requests a `vpn_tunnel` channel to an egress pool;
+3. the fabric chooses and maintains the route;
+4. an egress node performs IPv4 exit/NAT to the external network;
+5. return traffic follows the maintained channel back through the fabric.
+
+The VPN service does not decide how the route is built. The fabric does.
+
+## 4. Service contract requirements
+
+Every service-over-fabric integration must use a channel contract with at least
+these concepts:
+
+- `channel_request`
+- `channel_id`
+- `channel_class`
+- `destination_selector`
+- `current_state`
+- `send`
+- `receive`
+- `close`
+
+Optional but recommended:
+
+- `channel_migrated`
+- `channel_degraded`
+- `preferred_qos_class`
+- `pool_affinity`
+- `session_stickiness`
+
+## 5. Channel classes
+
+Minimum channel classes:
+
+- `control_ui`
+- `vpn_tunnel`
+- `rdp_session`
+- `artifact_delivery`
+- `service_admin`
+- `internal_control`
+
+Each class must define:
+
+- latency sensitivity
+- loss tolerance
+- bandwidth expectation
+- stickiness requirement
+- pool failover behavior
+- health check behavior
+
+## 6. Pool-first delivery
+
+Services should target pools when they need resilience.
+
+Examples:
+
+- VPN should target an egress pool, not a single node
+- future RDP should target a pool of reachable adapters when that service mode
+  applies
+
+The service must not need to know:
+
+- which specific node was selected
+- how many nodes are in the pool
+- whether the path was direct or relayed
+
+## 7. Recovery implications
+
+Recovery must be separated into:
+
+1. node survival
+2. transport survival
+3. service channel survival
+4. ingress survival
+
+It is not enough for the node to recover if the service channel model still
+depends on a hidden compat carrier.
+
+## 8. TCP clarification
+
+TCP is allowed only in these roles:
+
+- external user ingress
+- operator/API ingress
+- temporary compatibility recovery overlap
+- artifact/control delivery at the service edge
+
+TCP is not allowed as the normal inter-node fabric transport.
+
+If TCP is still visible in the live system, it must be classified explicitly as
+one of the roles above.
+
+## 9. Relationship to area stability
+
+The transport layer must maintain resilient peer diversity across areas, but the
+service layer must not need to understand those details.
+
+See
+[FABRIC_AREA_AND_PEER_STABILITY_MODEL.md](\\nas\\MST\\codex\\rdp-proxy\\docs\\architecture\\FABRIC_AREA_AND_PEER_STABILITY_MODEL.md)
+for the current peer diversity model and
+[FABRIC_LIVE_AUDIT_2026-05-18.md](\\nas\\MST\\codex\\rdp-proxy\\docs\\architecture\\FABRIC_LIVE_AUDIT_2026-05-18.md)
+for the live operational gaps.
@@ -0,0 +1,70 @@
+# Fabric Transport Scale Plan
+
+Goal: make the farm fabric a durable transport layer, not a request/response helper. VPN is the first service adapter, but the same tunnel/session model must carry RDP, VNC, remote workspace, artifact replication, and future services.
+
+## Invariants
+
+- Services request a tunnel to a pool and a remote service kind. They do not select nodes, routes, relays, or endpoints.
+- The farm owns route selection, failover, relay/direct decisions, stream scheduling, and recovery.
+- Control traffic stays small. Bulk and continuous traffic never move as a single control response.
+- `tunnel_id` is the stable service-facing identity. Legacy service ids, including VPN connection ids, are aliases only.
+- Hot traffic is binary framed, not JSON/base64.
+- Interactive/control/DNS traffic must not wait behind bulk traffic.
+- Route changes preserve the service tunnel identity.
+
+## Planes
+
+- Control plane: signed commands, leases, tunnel creation, route epochs, policy, heartbeat.
+- Session plane: tunnel lifetime, stream registry, stream open/close/reset, route migration.
+- Data plane: binary QUIC stream frames for service traffic.
+- Bulk plane: artifact/file/replication streams with offset, resume, chunk hash, final hash, mirror failover.
+- Observability plane: topology facts, route health, pressure, drops, throughput, per-class latency.
+
+## Service Tunnel Contract
+
+Each service receives:
+
+- `tunnel_id`
+- `pool_id`
+- `service_id`
+- `local_service_id`
+- `remote_service_id`
+- `service_kind`
+- `service_class`
+- `service_role`
+- `route_lease_id`
+- `route_generation`
+- `data_plane`
+- `traffic_classes`
+- `stream_shards`
+
+VPN default profile:
+
+- pool: `ipv4-egress`
+- service kind: `vpn-exit`
+- service class: `vpn_packets`
+- role: `ipv4-egress`
+
+Future profiles use the same contract, for example `rdp-client`, `vnc-client`, `artifact-store`, or `remote-workspace`.
+
+## Implementation Phases
+
+1. Generalize the tunnel contract and keep VPN as the first profile. Current code exposes `rap.fabric_service_tunnel.v1`.
+2. Move all service traffic to tunnel identity, keeping legacy ids only as aliases. Current VPN packet/session frames use `tunnel_id`; VPN ids are compatibility aliases inside the packet payload.
+3. Introduce reusable session streams for all services, not only VPN packet batches. Current code tracks `rap.fabric_service_stream_registry.v1` with per-tunnel stream state.
+4. Add route epoch migration so a service keeps the same tunnel while the farm changes path. Current contract already carries `route_lease_id` and `route_generation` through profile, Android runtime, hello/heartbeat, and peer registry. The mobile runtime can now apply a new runtime config for the same `tunnel_id` and update the active transport route epoch without closing service streams.
+5. Move artifact delivery from control request chunks to bulk streams with resume and mirror failover.
+6. Add per-class stream scheduler and backpressure at tunnel, node, route, and pool levels.
+7. Add admission control and capacity accounting per node, route, pool, organization, and service.
+8. Add stress tests for many tunnels, mixed traffic, route failure, node failure, and update while traffic flows.
+
+## Scale Rules
+
+- Frame size protects memory and fairness; throughput comes from parallel streams and windows.
+- The current fabric frame guardrail is 8 MiB per frame. This is not a bandwidth ceiling; VPN and later services batch below it and scale by `traffic_classes` plus `stream_shards`.
+- VPN mobile batches currently allow up to 2048 packets or 4 MiB per batch, so the service stays below the frame guardrail while avoiding the old 1 MiB choke.
+- DNS/control/interactive traffic must use separate classes from bulk; DNS maps to reliable fabric frames by default.
+- No node should require a full global graph for normal operation. Use scoped directories and area-level summaries.
+- Bulk must be drainable and resumable.
+- Interactive traffic must stay preemptive over bulk.
+- Every transport fact must be observable separately from planned route and endpoint candidates.
@@ -604,14 +604,14 @@ experiment while preserving the production forwarding kill-switch. This result
 is retained only as test-history context; it is not the active transport
 direction for the fabric runtime:

- `HTTPPeerTransport` maps explicit peer node IDs to synthetic HTTP endpoint
+- `QUICPeerTransport` maps explicit peer node IDs to synthetic QUIC endpoint
  URLs.
- `rap-node-agent` can start a synthetic `/mesh/v1/*` endpoint only when
-  `RAP_MESH_SYNTHETIC_RUNTIME_ENABLED=true` and `RAP_MESH_LISTEN_ADDR` is set.
+- `rap-node-agent` can start the synthetic fabric runtime only when
+  `RAP_FABRIC_RUNTIME_ENABLED=true` and `RAP_FABRIC_LISTEN_ADDR` is set.
 - peer endpoints and synthetic routes can be injected as JSON for smoke/debug
  only.
 - `mesh-live-smoke` proves direct and single-relay synthetic traffic over real
-  local HTTP endpoints.
+  local QUIC endpoints.
 - bounded `synthetic.echo` remains the only test-service payload.
 - `/mesh/v1/forward` remains disabled.
 - no production service traffic is authorized.
@@ -504,7 +504,7 @@ Implementation:
  `diff_time_ms`, `render_update_reason`, and
  `fallback_to_full_frame_reason`.
 - Windows direct transport accepts `render.frame.full`,
-  `render.frame.region`, and legacy `session.frame` binary messages.
+  `render.frame.region`, and compat `session.frame` binary messages.
 - Windows presenter keeps a per-session framebuffer and patches region bytes
  into it before presenting the updated WPF surface.
 - Smoke proof showed baseline `render.frame.full` at `3,686,400` bytes and
@@ -340,4 +340,4 @@ Deliver:
 - buildable `workers/rdp-service-csharp`
 - interfaces for protocol engine, data-plane bridge, graphics sink, input source
 - README with migration stages
- docs update marking current C++/FreeRDP path as legacy MVP runtime
+- docs update marking current C++/FreeRDP path as compat MVP runtime
@@ -312,7 +312,7 @@ Responsibilities:
 - enforces user, organization, cluster, and owner visibility policy before accepting traffic
 - participates in latency-aware and load-aware exit selection
 - supports failover between nodes in the same exit pool without changing the Android client protocol
- does not expose legacy VPN protocols as the steady-state data plane
+- does not expose compat VPN protocols as the steady-state data plane

 ### `vpn-client`

@@ -324,7 +324,7 @@ Responsibilities:
 - requests the list of visible IPv4 exit pools and nodes according to the current user's access level
 - creates VPN packet channels to the selected `ipv4-egress`/`vpn-exit` pool
 - switches to another authorized exit when the selected exit fails or becomes slow
- keeps old protocol compatibility out of the runtime data plane; old nodes may only use legacy download/update paths long enough to fetch the new agent
+- keeps old protocol compatibility out of the runtime data plane; old nodes may only use compat download/update paths long enough to fetch the new agent
 - exposes its local IPv4 ingress as service configuration: on Android this is the
  `VpnService` TUN, and on Linux/Docker it may also include explicit TCP/UDP
  listen ports that are mapped into VPN packet channels.
@@ -300,7 +300,7 @@ Recommended flow:
 3. dual validation period begins where required
 4. new certificates are issued/accepted
 5. old certificates expire or are revoked
-6. old trust root is retired after rollout threshold
+6. old trust root is removed after rollout threshold

 Channels should revalidate after trust bundle changes.

@@ -8,7 +8,7 @@ transport architecture. The active inter-node transport model is QUIC-only; see
 `docs/architecture/SECURE_ACCESS_FABRIC_TARGET.md`.

 Status: P3.3 historical test-stand smoke complete for encrypted resource
-secrets, assignment-time resolution, and legacy RDP baseline behavior with
+secrets, assignment-time resolution, and compat RDP baseline behavior with
 smoke-only direct-worker trust.

 This document defines the next security hardening layer around the accepted RDP
@@ -110,7 +110,7 @@ In `APP_ENV=production`:

 - RDP/VNC/SSH resources must have `secret_ref`.
 - Plain credential-like keys are rejected in resource `metadata`.
- Session start rejects legacy resources that still contain plaintext
+- Session start rejects compat resources that still contain plaintext
  credential-like metadata.
 - backend startup requires secret encryption key material.
 - Development/smoke environments may continue using plaintext metadata while
@@ -109,7 +109,7 @@ adapter runtime.
 - Control Plane remains authoritative for session lifecycle and policy.
 - PostgreSQL remains source of truth; Redis remains live coordination only.
 - Fabric transport remains QUIC-only between nodes; any historical direct
-  worker or backend fallback paths belong to paused service-specific baselines,
+  worker or compat fallback paths belong to paused service-specific baselines,
  not to the active fabric transport contract.
 - Adapter runtime must not create sessions outside broker/assignment control.

@@ -212,7 +212,7 @@ Signing key rotation rules:
 1. New key is introduced in a signed trust bundle.
 2. Node verifies the new key through existing trust.
 3. Snapshots may be dual-signed during transition.
-4. Old key is retired only after policy-defined rollout.
+4. Old key is removed only after policy-defined rollout.
 5. Compromised key is revoked through signed revocation metadata or emergency
   recovery flow.

@@ -12,6 +12,10 @@ Core. It does not redefine node-to-node transport. Current fabric inter-node
 transport is QUIC-only; VPN/IP tunnel runtime must request and use fabric
 routes instead of introducing a separate packet transport contract.

+The general service-over-fabric contract is defined in
+[FABRIC_SERVICE_OVER_TRANSPORT_MODEL.md](\\nas\\MST\\codex\\rdp-proxy\\docs\\architecture\\FABRIC_SERVICE_OVER_TRANSPORT_MODEL.md).
+VPN is one service class over that transport model, not an exception to it.
+
 ## Purpose

 VPN/IP tunnel is a service above the Fabric Core, not a node-local setting.
@@ -25,6 +29,9 @@ platform's core rules:
 - Nodes execute leased work only.
 - Organizations must not see mesh topology.
 - Interactive services such as RDP must not be harmed by VPN bulk traffic.
+- VPN ingress may accept external client traffic, but after acceptance it must
+  map that traffic into a fabric service channel rather than inventing an
+  alternate inter-node carrier.

 ## Non-Goals

@@ -18,6 +18,13 @@ Terminology rule:
 The Control API may use HTTP/HTTPS, but it is not a fallback or alternate
 carrier for fabric node-to-node runtime traffic.

+The formal three-layer separation is defined in
+[FABRIC_SERVICE_OVER_TRANSPORT_MODEL.md](\\nas\\MST\\codex\\rdp-proxy\\docs\\architecture\\FABRIC_SERVICE_OVER_TRANSPORT_MODEL.md):
+
+- `Fabric Transport` - internal QUIC/UDP substrate
+- `Fabric Service Channel` - logical service channel contract
+- `External Service Ingress` - browser/API TCP/HTTP/HTTPS edge
+
 ## Purpose

 The platform needs a clear distinction between:
@@ -36,7 +43,7 @@ secrets, node identity, or routing authority.

 Public HTTPS Ingress is an edge service. It may run on a public Internet node,
 including a small/slow node intended only to accept browser traffic and pass it
-into the fabric.
+into the fabric through a service channel.

 Role names:

@@ -225,7 +232,7 @@ The recommended model is:
 ```text
 Admin Web Shell
  -> UI Manifest / Page Definition endpoint
-  -> Scoped Control API endpoints
+  -> Scoped Fabric control endpoints
 ```

 Dynamic pages are allowed for:
@@ -474,8 +481,8 @@ the management authority. Platform/global admin runtime remains limited to
 platform-owner trusted nodes. Cluster, organization, and user panels receive
 only their scoped projections.

-The legacy Fabric map with separate `inputs`, `cluster nodes`, and `egress
-zones` is retired for the transport-layer view. The Fabric panel must show
+The compat Fabric map with separate `inputs`, `cluster nodes`, and `egress
+zones` is removed for the transport-layer view. The Fabric panel must show
 actual direct/fresh QUIC neighbor links, one-way/passive direction, stale/problem
 state, relay/route-health annotations, and web-ingress runtime readiness. It
 must not render old entry/egress zone columns as if they were transport
@@ -520,7 +527,7 @@ The platform recognizes these web/admin placement roles:
 | `policy-authority` | platform trusted nodes only | Authorization/policy decisions and signed claims. |
 | `audit-sink` | platform trusted nodes only | Durable mutation/security audit ingestion. |

-Legacy `entry-node` remains a generic client ingress/service edge role for
+Compat `entry-node` remains a generic client ingress/service edge role for
 non-admin product services. It must not imply admin authority.

 ## Fabric Service Classes