Refactor RDP proxy handling and update related tests

2026-05-17 20:38:35 +03:00
parent 8e9402580f
commit d551e57fd5
172 changed files with 22117 additions and 2509 deletions
@@ -88,6 +88,16 @@ Native host process responsible for node identity, enrollment, certificates, hea
 Service Workload:
 A workload executed on a node. It may be native or containerized. Examples: `rdp-worker`, `vnc-worker`, `entry-node`, `relay-node`, `file-storage-cache`.

+Public/Admin HTTPS Ingress:
+A service-edge role that listens on TCP `80`/`443` for browser/API HTTPS and
+forwards accepted requests into the QUIC-only fabric service channel. It is not
+an authority service and does not imply permission to manage the cluster.
+
+Admin UI Runtime:
+A scoped admin service runtime. Global admin runtime may run only on
+platform-owner trusted nodes; cluster, organization, and user portal runtimes
+receive only their scoped projections.
+
 Capability:
 What a node can technically do. Example: `can_run_rdp_worker`.

@@ -162,6 +172,13 @@ policy, approvals, and audit.
 20. Node-agent is the local supervisor for health, restart, update, and rollback
    of node services, but Control Plane owns rollout policy and durable schema
    migration orchestration.
+21. HTTP/HTTPS is an external service edge only. Fabric node-to-node transport
+    remains QUIC-only.
+22. A node that accepts `443` does not own management authority. Admin authority
+    belongs to signed roles, scoped claims, policy, and trusted runtime nodes.
+23. Global admin runtime, policy authority, and audit sink must run only on
+    platform-owner controlled nodes. Organization and cluster portals must not
+    expose unrelated tenants, clusters, or internal mesh topology.

 ## Existing Node Management Semantics

@@ -0,0 +1,96 @@
+# Distributed Authority Audit 2026-05-16
+
+Status: target architecture is distributed, but the live test cluster still has
+bootstrap central authority pieces that must be removed before production trust.
+
+## Fixed Requirements
+
+- No single management/API/storage/update service is allowed to own cluster
+  truth.
+- Control, storage, update, route authority, observer, and update-cache are node
+  roles in the fabric.
+- A service endpoint can serve signed state, but cannot create trusted state by
+  itself.
+- Node identity is cryptographic. IP addresses, DNS names, and NAT addresses are
+  endpoint candidates only.
+- Nodes must publish real signed candidates for reachable interfaces,
+  STUN/ICE-reflexive addresses, passive reverse channels, and relay fallback.
+- Nodes must verify signed control data locally before applying it.
+
+## Live Cluster Findings
+
+- The live cluster has one active `cluster_authorities` row:
+  `rap-ca-ed25519-09877466aa9b6b58b0f312b0b313ea33`.
+- Its metadata says `storage=database_signer` and
+  `production_target=external_cluster_signer_or_hsm`.
+- Release metadata for recent node-agent versions is signed, but signed by the
+  same database-backed authority.
+- Synthetic mesh configs are signed and node-agent verifies them against the
+  pinned cluster authority.
+- Node enrollment pins cluster authority into `identity.json`.
+- Before this audit, host-agent update plans were carried with signatures but
+  host-agent did not locally reject unsigned plans when a pinned authority was
+  present.
+
+## Changes Made In This Audit
+
+- The fabric docs now declare distributed authority and quorum as mandatory.
+- Node/fabric endpoints must be explicit `host:port`; DNS-only service names are
+  rejected as fabric endpoints.
+- `home-1` no longer advertises `smoke.cin.su` as a fabric endpoint. It now
+  advertises its real interface candidate `quic://192.168.200.85:18080`.
+- Host-agent now verifies `node_update_plan` authority signatures when
+  `identity.json` contains a pinned cluster authority public key.
+- Unsigned update plans are rejected in that pinned-authority mode.
+- Added `rap.cluster_authority.quorum.v1` and
+  `rap.cluster_authority.quorum_envelope.v1` contracts to both agent and
+  backend authority packages.
+- Host-agent can now verify quorum-signed update plans when `identity.json`
+  contains a pinned quorum descriptor.
+- Backend update plans now include an `authority_quorum` envelope when the
+  cluster authority metadata contains a quorum descriptor. If that configured
+  quorum cannot be satisfied, the update plan is not issued.
+- Node bootstrap now carries `cluster_authority_quorum`; the approval authority
+  payload signs the quorum descriptor hash, and node-agent persists the
+  descriptor into `identity.json` after verifying the signed hash.
+- Published `rap-node-agent` and `rap-host-agent` release
+  `0.2.284-quorumauthority`.
+- Canaried `home-1` to `rap-node-agent 0.2.284-quorumauthority` and
+  `rap-host-agent 0.2.284-quorumauthority`; both reported healthy/noop after
+  update.
+- Published `rap-node-agent` and `rap-host-agent` release
+  `0.2.285-quorumbootstrap`.
+- Canaried `home-1` to `rap-node-agent 0.2.285-quorumbootstrap` and
+  `rap-host-agent 0.2.285-quorumbootstrap`; both reported current=target/noop.
+  `ifcm-rufms-s-mo1cr` was intentionally not updated because it is behind NAT
+  and still needs fabric/update-cache artifact reachability before further
+  rollout.
+
+## Remaining Production Blockers
+
+- Replace `database_signer` with quorum authority:
+  M-of-N signatures from nodes or hardware/offline keys with
+  `control-authority` / `update-authority` roles.
+- Store authority descriptors and role certificates as replicated signed state,
+  not only database rows.
+- Require quorum envelopes for the remaining high-risk mutations: role
+  mutation, release creation, update policy mutation, route lease issuance,
+  relay/rendezvous lease issuance, storage placement, and authority rotation.
+  Node update plans and bootstrap quorum pinning now have the first contract
+  hooks, but production still needs real M-of-N signers.
+- Add node-side verification of release metadata in addition to update-plan
+  verification; update-plan verification is now enforced by host-agent when a
+  pinned authority or pinned quorum descriptor exists.
+- Add update-cache mirror selection through fabric endpoint candidates instead
+  of a single HTTP origin.
+- Add signed endpoint-candidate epochs so peer directory gossip can survive API
+  replica loss.
+- Add revocation/fencing epochs for compromised authority keys, nodes, and
+  update artifacts.
+
+## Acceptance Rule
+
+The cluster is not production-trust-ready while a single `database_signer` can
+create authoritative cluster mutations. It may remain as a development bootstrap
+signer only when every signed payload clearly identifies it as bootstrap and
+nodes can be configured to reject it in production mode.
@@ -62,6 +62,88 @@ route and stream semantics.
 7. Mobile nodes are first-class nodes with stricter capability scoring.
 8. HTTP forwarding remains a compatibility and emergency fallback, not the
   primary high-speed data plane.
+9. There must be no single management service that can seize the fabric. Control,
+   storage, update distribution, route authority, and certificate authority are
+   fabric roles assigned to eligible nodes and protected by quorum signatures.
+   A web/API endpoint is only an access replica for a signed state log, not the
+   owner of cluster truth.
+10. IP addresses and DNS names are never authority. Nodes announce signed
+    endpoint candidates for every usable interface, public/reflexive address,
+    local segment address, reverse channel, and relay fallback. Neighbors select
+    the usable candidate locally by policy, reachability, latency, load, and
+    trust.
+
+## Distributed Control And Trust
+
+The target fabric behaves like a distributed network, not a client/server
+management product. The cluster has a replicated signed state log and many
+service replicas. Any node with the right role can serve API, storage, update,
+or route-coordinator duties, but no single replica can mutate cluster authority
+alone.
+
+Required trust model:
+
+- Every node has a long-lived node identity key and short-lived role
+  certificates. The node identity is cryptographic; the current IP, hostname,
+  NAT address, or container name is only an endpoint candidate.
+- Cluster authority is threshold-based. Root or high-risk changes require M-of-N
+  signatures from authorized control-authority nodes or hardware/offline
+  operator keys.
+- Role certificates are scoped by action, organization/tenant, service,
+  partition, validity window, and allowed delegation depth.
+- Update releases, route leases, peer-directory epochs, storage shard placement,
+  node approvals, role changes, and authority rotations are signed records in
+  the state log.
+- A node accepts control data only when it can verify signatures, epoch/fencing,
+  expiry, target cluster, target node or role scope, and monotonic generation.
+- A compromised API replica can withhold or delay data, but cannot forge updates,
+  route authority, new certificates, node roles, or cluster ownership.
+- Bootstrap may use a temporary centralized signer for development, but
+  production mode must mark that signer as non-authoritative unless quorum
+  signatures are present.
+
+Authority levels:
+
+- `root-authority`: rotates cluster root and quorum membership. Offline or
+  hardware-backed where possible. Rarely online.
+- `control-authority`: approves node join, role changes, policy epochs, and
+  route-authority membership through quorum.
+- `route-authority`: signs short-lived route leases and relay/rendezvous
+  assignments for a shard or partition.
+- `update-authority`: signs release metadata, compatibility, artifact hashes,
+  rollback windows, and staged rollout policy.
+- `storage-authority`: signs storage shard manifests, replication factors,
+  retention policy, and recovery epochs.
+- `observer-authority`: can sign telemetry observations only; it cannot mutate
+  routing, roles, updates, or secrets.
+
+Required anti-takeover controls:
+
+- No bearer admin token may grant fabric-wide mutation without a signed authority
+  envelope.
+- No node may accept unsigned update metadata or an artifact whose hash is not
+  signed by update-authority quorum.
+- No node may accept unsigned route changes for production channels.
+- No node may promote itself into control, storage, update, relay, or route
+  authority roles without a quorum-signed role certificate.
+- Authority and role certificates must have short validity, explicit scopes, and
+  revocation/fencing epochs.
+- Nodes must pin the cluster root/quorum descriptor and reject unexpected root
+  changes unless the old quorum signs the transition or an offline recovery
+  policy is invoked.
+
+Endpoint state is also distributed:
+
+- Nodes publish signed endpoint-candidate sets containing local interfaces,
+  public/reflexive STUN/ICE candidates, NAT group/local segment identifiers,
+  relay fallback, and passive reverse-channel availability.
+- Endpoint candidates expire quickly. When a node changes IP, it reconnects
+  passively to any reachable fabric peer or API replica and publishes a new
+  signed candidate epoch.
+- Peers keep using cached valid candidates and route leases while refreshing
+  from any reachable replica or neighbor gossip path.
+- Neighbor selection is local and latency/load-aware; the state log announces
+  facts and policy, not a forced single next hop.

 ## Node Roles

@@ -0,0 +1,845 @@
+# Fabric-First Transport And Stress Plan
+
+Status: fabric-first implementation baseline is active. QUIC-only transport,
+route planning, runtime reroute/failover, pressure accounting, shared-host
+stress gates, 1000-channel load, failure/degradation gates, and a 30-minute
+real-byte soak are implemented and verified. Remaining work is wider real
+topology coverage as the cluster grows.
+
+This project is now fabric-first. Work on service payloads, service adapter
+expansion, and Android VPN transport is paused until the fabric transport layer
+is complete and proven under real load.
+
+## Goal
+
+The fabric is a distributed QUIC overlay over the current IPv4 network. Nodes
+may have public addresses, sit behind NAT, or represent a whole local segment
+behind one NAT. The fabric must expose a single logical transport layer where
+nodes can reach each other directly, through local segment paths, through
+passive outbound tunnels, or through relay hops without changing the data-plane
+protocol.
+
+QUIC is the only data-plane protocol. Direct, LAN, relay, reverse, and
+ICE-selected paths are route modes inside the same QUIC fabric, not alternative
+transports.
+
+The fabric must not depend on one management service for authority. API,
+storage, update-cache, route-coordinator, observer, and authority duties are
+roles inside the mesh. A reachable API endpoint can distribute signed state, but
+it cannot be the source of truth by itself. Nodes accept control data,
+configuration, route leases, update plans, and role changes only when the
+signatures, quorum rules, scopes, epochs, and expiry windows verify locally.
+
+## Required Fabric Behavior
+
+- Address channels by `node_id`, `pool_id`, or service target, not by raw IP.
+- Keep endpoint candidates for public QUIC, LAN QUIC, reverse/passive QUIC,
+  relay QUIC, and future ICE-derived QUIC paths.
+- Treat DNS names such as web/admin/API domains as service endpoints only, not
+  node identity or fabric authority.
+- Require node-published endpoint candidates to include explicit `host:port`,
+  reachability, connectivity mode, NAT/local-segment metadata, source, and
+  freshness.
+- Prefer local segment paths for nodes that share a NAT/local network.
+- Keep outbound passive QUIC control/data adjacencies from NATed nodes to
+  reachable public or relay nodes.
+- Build logical channels over shared QUIC adjacencies instead of opening one
+  physical QUIC connection per channel.
+- Maintain primary, warm standby, and fallback route sets per channel.
+- Rebuild a channel when an intermediate hop fails.
+- Switch to another pool member when the target is a pool and the current
+  endpoint fails.
+- Reroute slow channels when a faster path exists and the reroute will not harm
+  aggregate fabric throughput.
+- Spread channels across available routes so the shortest path is not saturated
+  while other nodes are idle.
+- Isolate channels with per-channel flow control, traffic classes, backpressure,
+  quotas, and fairness scheduling.
+- Report per-node, per-link, per-route, and per-channel load and failure causes.
+
+## Service Channel Boundary
+
+The fabric is the only component that builds and maintains transport channels.
+VPN, RDP, SSH, web ingress, file transfer, and future adapters are applications
+above the fabric. They must not select raw QUIC endpoints, pin exit nodes as a
+transport concern, open fallback transports, or implement route repair.
+
+Every service starts by submitting a fabric service channel request:
+
+```json
+{
+  "schema_version": "rap.fabric_service_channel_request.v1",
+  "channel_id": "vpn-session-or-service-session-id",
+  "source_role": "vpn-client | rdp-client | service-adapter",
+  "service_class": "vpn_packets | rdp | ssh | file_transfer | web",
+  "target": {
+    "kind": "pool",
+    "pool_ids": ["home-ipv4"],
+    "service_role": "ipv4-egress"
+  },
+  "traffic": {
+    "mode": "duplex",
+    "application_protocol_agnostic": true,
+    "flow_distribution": "latency_and_load_aware"
+  },
+  "resilience": {
+    "min_active_paths": 1,
+    "warm_standby_paths": 1,
+    "failover": "pool_member_or_next_authorized_pool",
+    "reroute_on": ["route_failure", "latency_regression", "loss_regression", "backpressure"]
+  }
+}
+```
+
+The fabric responds with a signed route bundle containing a short-lived
+`rap.fabric_route_lease.v1`. The lease names the target pool, primary path,
+warm standby paths, multipath hints, and rebuild policy. Physical endpoint
+candidates are visible only to the fabric runtime as lease material; service
+adapters do not rank, pin, or fail over endpoints themselves. A service adapter
+receives only a duplex channel handle and service metadata:
+
+- Android VPN: TUN packet reader/writer only.
+- `ipv4-egress`: NAT/ordinary IPv4 exit only.
+- RDP: protocol/session adapter only; server address, protocol, credentials,
+  rendering, and clipboard are RDP service metadata, not fabric routing.
+
+Temporary compatibility fields such as `exit_candidates` may exist only inside
+the fabric route bundle consumed by the fabric runtime. Service code must treat
+them as opaque and must not schedule routes from them.
+
+The VPN client runtime accepts only `fabric_service_channel_request` plus
+`fabric_route_bundle.route_lease`. The Android service may keep a deprecated
+diagnostic endpoint cache, but packet routing must come from the lease. If a
+path fails, slows down, or its target pool member dies, the fabric lease/rebuild
+policy is the authority; the VPN service continues writing packets to the
+channel and does not switch protocols.
+
+## Distributed Authority Requirements
+
+- No single control-plane/API/storage/update node can mutate the cluster alone.
+- Cluster root and high-risk role changes require threshold signatures from
+  authorized control-authority keys.
+- Update releases require signed metadata, signed artifact hashes, compatibility
+  constraints, rollout scope, and rollback windows; mirrors may serve bytes but
+  cannot change what is trusted.
+- Route leases, relay leases, rendezvous assignments, peer-directory epochs, and
+  endpoint candidate epochs are signed and short-lived.
+- Nodes cache the last valid signed state and continue routing through peers,
+  relay fallbacks, and passive reverse channels when API replicas are down.
+- A compromised replica may delay or omit data, but must not be able to forge
+  role assignment, route authority, update authority, storage placement, or node
+  ownership.
+- Development `database_signer` mode is not production authority. Production
+  acceptance requires quorum-signed envelopes for node join, role mutation,
+  mesh config, route leases, update plans, and release metadata.
+
+## Implementation Layers
+
+1. Discovery layer: STUN/ICE, LAN candidates, public candidates, reverse
+   tunnels, relay candidates.
+2. Fabric adjacency layer: long-lived QUIC neighbor sessions with capacity,
+   health, and pressure metrics.
+3. Routing layer: latency-aware and load-aware route sets with relay fallback
+   and pool failover.
+4. Channel layer: millions of logical channels with independent lifecycle,
+   flow control, and statistics.
+
+## Stress Requirements
+
+The fabric is not accepted by ping tests. It must pass real byte-transfer load:
+
+- 1000 concurrent streams from different source nodes to different destination
+  nodes.
+- Mixed long-lived and short-lived channels.
+- Aggressive create/delete churn.
+- many-to-one, one-to-many, and many-to-many traffic.
+- direct, LAN, relay, multi-hop, and reverse tunnel paths.
+- endpoint pool failover under load.
+- intermediate relay/node failure and route rebuild under load.
+- induced latency, packet loss, bandwidth caps, and route saturation.
+- control/interactive traffic surviving bulk traffic.
+- no sustained overload of one path when alternatives exist.
+- no goroutine, memory, stream, or file descriptor leak after churn.
+
+## Required Stress Report
+
+Every stress run must produce machine-readable JSON with:
+
+- topology and scenario profile;
+- channel setup/teardown counts and latency;
+- total and per-channel throughput;
+- per-node and per-route capacity pressure;
+- p50/p95/p99 latency where measured;
+- backpressure, rejection, and queue-depth counters;
+- route switch and failover events;
+- target pool failover events;
+- QUIC connection and logical channel counts;
+- final pass/fail verdict against SLO thresholds.
+
+The first executable harness is `agents/rap-node-agent/cmd/fabric-loadtest`.
+It supports in-process multi-node QUIC targets, short logical channel churn,
+pool failover, target failure injection, and JSON reports.
+
+Example local pool-failover run:
+
+```powershell
+go run ./cmd/fabric-loadtest -mode all -nodes 4 -streams 400 -concurrency 64 -bytes-per-stream 262144 -payload-size 16384 -fail-target 0 -fail-after 1ms -timeout 60s
+```
+
+The local harness is not a replacement for distributed host testing. It is the
+first acceptance gate for protocol limits, channel lifecycle churn, pool
+failover semantics, and reporting shape before running the same workload across
+the shared test Docker host.
+
+Distributed shared-host smoke:
+
+```powershell
+powershell -ExecutionPolicy Bypass -File scripts/fabric/fabric-loadtest-docker-smoke.ps1 -DockerContext test-docker -TuneUdpBuffers -Nodes 4 -Streams 400 -Concurrency 64 -BytesPerStream 262144 -PayloadSize 16384 -FailTarget 0 -FailAfter 100ms
+```
+
+The distributed smoke builds/runs separate server and client containers on the
+shared Docker host, sends real QUIC fabric frames across the Docker network,
+kills one target node during load, and expects all channels assigned to that
+target to fail over to the remaining pool.
+
+The smoke summary includes the strict loadtest verdict plus `route_pressure`
+and `transport_snapshot`; the script fails when the client verdict is not
+`pass` and carries `verdict_reasons` into the thrown error.
+
+`-TuneUdpBuffers` applies runtime host sysctls through a privileged one-shot
+container before the run and records the observed values in the summary:
+`net.core.rmem_max`, `net.core.wmem_max`, `net.core.rmem_default`, and
+`net.core.wmem_default`.
+
+Degraded-target and latency-aware admission run:
+
+```powershell
+powershell -ExecutionPolicy Bypass -File scripts/fabric/fabric-loadtest-docker-smoke.ps1 -DockerContext test-docker -TuneUdpBuffers -Nodes 4 -Streams 300 -Concurrency 64 -BytesPerStream 131072 -PayloadSize 16384 -FailTarget -1 -ImpairTarget 0 -ImpairDelay 100ms -ImpairLoss 0.5% -ProbeTargets -MaxTargetRttMs 80
+```
+
+This applies `tc netem` to one target, probes every target before mass channel
+placement, excludes targets above the RTT threshold, and reports per-target
+setup/duration percentiles. This is the first executable gate for
+latency-aware placement; live channel migration after mid-stream degradation is
+the next routing-layer gate.
+
+Mid-stream migration gate:
+
+```powershell
+powershell -ExecutionPolicy Bypass -File scripts/fabric/fabric-loadtest-docker-smoke.ps1 -DockerContext test-docker -TuneUdpBuffers -Nodes 4 -Streams 80 -Concurrency 16 -BytesPerStream 8388608 -PayloadSize 65536 -FailTarget -1 -ImpairTarget 0 -ImpairDelay 200ms -ImpairLoss 0.5% -ImpairAfter 50ms -MigrateSlowStreams -MaxAckMs 30
+```
+
+This starts channels normally, applies `tc netem` after traffic is already in
+flight, and expects slow logical streams to continue their remaining bytes on a
+different target. The report exposes `migration_events`, `max_ack_ms`,
+`ack_p95_ms`, `ack_p99_ms`, `route_attempts_total`, `reroute_causes`, and
+per-target stats.
+
+Production fabric-core migration boundary:
+
+- `FabricChannelRouter` opens channels on the best route from a `FabricRouteSet`.
+- Live `FabricChannelObservation` values update counters and trigger reroute on
+  route failure, ACK latency threshold, or capacity pressure.
+- Reroutes switch route binding and pool target where applicable, increment
+  `RerouteCount`, and emit `FabricChannelRouteEvent`.
+- `MinRerouteInterval` provides hysteresis so a noisy path does not cause route
+  flapping.
+- `FabricChannelRuntime` binds the router to live QUIC fabric sessions for
+  reliable byte payloads: it opens the logical stream, sends frames, measures
+  ACK latency, reports observations to the router, and continues remaining
+  payloads on a rerouted QUIC route after connect failure or slow ACKs.
+- QUIC logical session close cancels the stream read side before closing the
+  write side, so high-churn short sessions release reader goroutines promptly
+  instead of waiting for stream read deadlines.
+- Server-side QUIC stream handlers close their write side when the handler
+  exits. This returns QUIC stream credit promptly during high-churn short
+  sessions and prevents the last worker window from stalling on stream open.
+- Production request/response forwarding now builds a `FabricRouteSet` from all
+  QUIC endpoint candidates for the next hop, sends the envelope over the chosen
+  QUIC route, and reroutes to warm standby/fallback QUIC candidates on connect
+  failure or response timeout.
+- The legacy HTTP production forward carrier has been removed from the mesh
+  runtime API. Production forwarding now exposes a single QUIC transport
+  implementation; HTTP handlers remain only as node-local API surfaces and test
+  harness entry points.
+- Production route choice includes live per-route active-channel pressure, so
+  concurrent forwarding requests can spread across equivalent QUIC candidates
+  instead of concentrating on the first/shortest route until it is saturated.
+- Production forwarding also keeps per-route health quarantine. A QUIC route
+  that fails connect or response is marked unhealthy for a bounded retry window,
+  skipped by subsequent channel scheduling, exposed in route-health snapshots,
+  and restored automatically after the retry window or a successful send.
+- `FabricRoutePressureTracker` provides shared active-channel accounting for
+  both production request/response forwarding and bulk `FabricChannelRuntime`
+  traffic, so different traffic surfaces can make route decisions against the
+  same live load signal.
+- Route pressure is observable through `FabricRoutePressureSnapshot`, including
+  current active channels, max active channels, total acquire/release counts,
+  and last acquired/released route IDs. Bulk runtime results and production
+  QUIC forwarding snapshots expose this data for stress reports.
+- `fabric-loadtest` reports route IDs per stream attempt, global
+  `route_pressure`, and per-target `max_active_channels`, so stress runs can
+  verify channel distribution and release accounting after churn.
+- `FabricRouteSetForPeerEndpointCandidates` converts QUIC endpoint candidates
+  into production route sets for direct, LAN, ICE/STUN-derived, reverse
+  outbound, and relay fallback modes. Non-QUIC candidates are rejected instead
+  of becoming alternate transports.
+- Node-agent discovery now advertises multiple QUIC candidates in one heartbeat
+  instead of collapsing to one address: operator/public QUIC, listener QUIC,
+  LAN/interface QUIC, STUN reflexive `ice_quic`, reverse/outbound-only, and
+  `relay_quic` fallback. Candidate metadata carries `local_segment_id`,
+  `nat_group_id`, `stun_server`, `ice_foundation`, `relay_node_id`, and
+  `relay_endpoint` when configured.
+- Endpoint candidate scoring is QUIC-mode only. It ranks `direct_quic`,
+  `lan_quic`, `ice_quic`, `reverse_quic`, and `relay_quic` using freshness,
+  health observations, latency, reliability, region, policy tags, and live
+  capacity pressure; HTTP/WebSocket labels are treated as rejected legacy
+  candidates rather than alternate transports.
+- `FabricTransportForTarget` no longer accepts a WebSocket carrier. Transport
+  selection can return only `QUICFabricTransport`; unsupported labels fail with
+  a QUIC-required error.
+- Explicit transport labels are authoritative. A legacy label such as `relay`
+  or `outbound_reverse` is rejected even when the endpoint string uses a
+  `quic://` scheme; configs must use `relay_quic` and `reverse_quic`.
+- Node-agent config loading rejects legacy advertised transport labels and
+  HTTP/WebSocket advertised endpoint schemes for mesh, STUN-reflexive, and relay
+  fabric endpoints. Bad endpoint posture fails before heartbeat publication.
+- Host-agent install/runtime validation rejects legacy mesh advertise transport
+  labels and HTTP/WebSocket advertise endpoints before they can be passed into a
+  node-agent Docker runtime.
+- JSON-advertised endpoint candidates and scoped synthetic config route
+  recovery surfaces are hard-fail QUIC-only: endpoint candidates, recovery
+  seeds, and rendezvous leases reject legacy transport labels and
+  HTTP/WebSocket endpoint schemes instead of silently downgrading or dropping
+  entries.
+- Rendezvous relay leases and peer-connection intents now use `relay_quic` as
+  the transport label. `relay_control` remains only a telemetry/control-state
+  name for rendezvous admission counters, not a data-plane transport option.
+- Peer connection health probing is aligned with the QUIC fabric: QUIC endpoint
+  candidates are probed with QUIC session setup, pinned certificate metadata is
+  honored, and HTTP/WebSocket endpoint schemes are rejected instead of being
+  used as peer health transport.
+- Node-agent synthetic runtime no longer installs an HTTP peer transport as an
+  inter-node carrier, and the shared mesh runtime package no longer exports an
+  HTTP peer transport implementation. Any HTTP synthetic motion is confined to
+  explicit legacy smoke harness code while fabric acceptance uses QUIC loadtest
+  gates.
+- Control-plane and debug JSON mesh config loading is validated after
+  conversion into runtime structures. Peer endpoint candidates, recovery seeds,
+  rendezvous leases, and selected relay endpoints in route decisions must use
+  QUIC labels/endpoints before they can update node runtime state.
+- Scoped synthetic mesh configs also reject legacy `peer_endpoints` directly,
+  in addition to QUIC-only checks for endpoint candidates, recovery seeds, and
+  rendezvous leases.
+- The old fabric-session WebSocket endpoint is no longer exposed by
+  `FabricSessionEnabled` alone. It requires an explicit legacy test harness flag
+  and is not part of the node-agent fabric transport surface.
+- Same local segment or same NAT group is treated as a LAN route by the planner,
+  so a whole cluster piece behind one NAT can prefer private addresses between
+  its own nodes while still maintaining outbound/relay visibility to the rest
+  of the fabric.
+- Heartbeat telemetry includes `fabric_runtime_report` with QUIC-only status,
+  route-set counts, QUIC candidate totals, rejected legacy/non-QUIC candidate
+  totals by transport label, route pressure, QUIC listener state, goroutines,
+  heap usage, and the next recommended soak gate.
+- `FabricOverlayTransport` is the generic service-neutral send facade over
+  route sets, `FabricChannelRuntime`, shared route pressure, and QUIC sessions.
+  New traffic classes should enter the fabric through this layer or an
+  equivalent runtime integration, not through HTTP/WebSocket fallbacks.
+- `FabricChannelRuntime` uses the same route health quarantine as production
+  forwarding. Connect failures, stream send failures, and missing ACKs mark a
+  route unhealthy for a bounded retry window, so later channels for any traffic
+  class avoid that route until it recovers.
+- `FabricOverlayTransport` exposes route pressure and route health snapshots,
+  and node heartbeat runtime metadata reports production route health plus the
+  current quarantined route count.
+- Scheduler resource guardrails include `HardMaxRoutePressure`: when enabled,
+  a route whose projected active-channel pressure exceeds the threshold is not
+  admitted. This makes overload prevention enforceable in route choice rather
+  than only observable after the fact.
+- The loadtest verdict fails on route-pressure leaks, acquire/release mismatch,
+  missing acquire accounting, active channels above configured concurrency, or
+  target distribution collapse/skew when multiple targets are healthy.
+- Continuous soak aggregation is bounded: `fabric-loadtest` keeps exact
+  counters, per-target totals, route-mode counts, error/reroute totals, and
+  bounded latency samples, while `stream_samples` is capped to diagnostic
+  examples. Long 30-120 minute runs should not retain one result object per
+  logical channel.
+- `fabric-loadtest` also keeps bounded `error_samples`, so high-volume churn
+  reports preserve representative failed logical channels even when the first
+  retained `stream_samples` are all successful.
+- Mixed topology verdicts require route-mode coverage when at least four
+  healthy targets are present. A `mixed-public-nat-lan-relay` or
+  `nat-lan-relay` run fails if it does not exercise `lan_quic`, `ice_quic`,
+  `reverse_quic`, and `relay_quic`.
+- Loadtest verdicts also fail on legacy route-mode labels. Seeing `relay`,
+  `outbound_reverse`, `direct_http`, `direct_https`, `direct_tcp_tls`, `ws`,
+  `wss`, or `websocket` in route-mode telemetry is treated as a transport-layer
+  violation even if payload delivery succeeds.
+- Healthy multi-target verdicts check both stream distribution and byte
+  distribution. This prevents a run from passing with equal channel counts but
+  most bulk bytes concentrated on one target or route.
+- Healthy multi-target verdicts also check route-pressure distribution through
+  per-route `max_active` values. A run fails if live concurrent channel load
+  collapses onto one target/route while alternatives are healthy.
+- Successful logical channels must receive one ACK per transmitted data frame.
+  `fabric-loadtest` reports `ack_mismatched_streams`, per-target
+  `acks_received`, and fails verdict when any stream is marked successful with
+  fewer ACKs than sent frames.
+- ACK payloads carry the SHA-256 checksum of the received data-frame payload.
+  `fabric-loadtest` validates the checksum for every ACK and fails verdict with
+  `ack_integrity_errors` when the acknowledged bytes do not match the sent
+  payload.
+- Failover accounting separates `abandoned_frames` from true ACK mismatch. A
+  frame sent on a route that dies before ACK is counted as abandoned and the
+  unacknowledged byte range is retransmitted on the next pool member; verdict
+  still fails when non-abandoned frames are missing ACKs.
+- Loadtest data frames use deterministic per-frame payloads derived from stream
+  index, logical stream ID, sequence, and byte offset. This makes checksum ACKs
+  validate each frame identity instead of repeatedly validating one shared
+  buffer pattern.
+- Mixed bulk/control stress is supported with `-control-every`,
+  `-control-bytes-per-stream`, and `-max-control-ack-p95-ms`. Reports include
+  `control_streams`, `bulk_streams`, `control_ack_p95_ms`, and
+  `bulk_ack_p95_ms`; verdict fails when control ACK p95 exceeds the configured
+  SLO.
+- Verified shared-host mixed smoke:
+  `powershell -ExecutionPolicy Bypass -File scripts/fabric/fabric-loadtest-docker-smoke.ps1 -DockerContext test-docker -Nodes 2 -Streams 40 -Concurrency 8 -BytesPerStream 65536 -PayloadSize 8192 -FailTarget -1 -ControlEvery 5 -ControlBytesPerStream 4096 -MaxControlAckP95Ms 100`.
+  The run produced 40/40 successful streams, 8 control streams,
+  `control_ack_p95_ms=1`, `bulk_ack_p95_ms=2`,
+  `route_pressure.active_total=0`, and matching acquire/release counts.
+- Verified shared-host mixed failover stress:
+  `powershell -ExecutionPolicy Bypass -File scripts/fabric/fabric-loadtest-docker-smoke.ps1 -DockerContext test-docker -SkipBuild -TuneUdpBuffers -Nodes 4 -Streams 1000 -Concurrency 128 -BytesPerStream 1048576 -PayloadSize 65536 -FailTarget 0 -FailAfter 100ms -ControlEvery 20 -ControlBytesPerStream 4096 -MaxControlAckP95Ms 100`.
+  Latest run `fabric-loadtest-20260516-160751` produced 1000/1000 successful
+  streams, 250 failover events after the planned target kill, 50 control
+  streams, `control_ack_p95_ms=3`, `bulk_ack_p95_ms=6`, `ack_p95_ms=6`,
+  `ack_p99_ms=8`, `route_attempts_total=1250`,
+  `route_pressure.active_total=0`, `max_active_total=128`, and matching
+  acquire/release counts. Full JSON artifacts are written under
+  `artifacts/fabric-loadtest`.
+- Verified shared-host mixed degradation/migration stress:
+  `powershell -ExecutionPolicy Bypass -File scripts/fabric/fabric-loadtest-docker-smoke.ps1 -DockerContext test-docker -TuneUdpBuffers -Nodes 4 -Streams 200 -Concurrency 32 -BytesPerStream 8388608 -PayloadSize 65536 -FailTarget -1 -ImpairTarget 0 -ImpairDelay 200ms -ImpairLoss 0.5% -ImpairAfter 50ms -MigrateSlowStreams -MaxAckMs 30 -ControlEvery 10 -ControlBytesPerStream 4096 -MaxControlAckP95Ms 100`.
+  The run produced 200/200 successful streams, 9 migration events,
+  20 control streams, `control_ack_p95_ms=2`, `bulk_ack_p95_ms=7`,
+  `route_pressure.active_total=0`, `max_active_total=32`, and matching
+  acquire/release counts.
+- Latest shared-host degradation/migration gate:
+  `fabric-loadtest-20260516-160710` with 160 streams, 32 concurrency, 4 MiB
+  bulk streams, 180 ms + 0.5% induced impairment after 50 ms produced 160/160
+  successful streams, 12 slow-ACK migrations, degraded-target quarantine,
+  `control_ack_p95_ms=3`, `bulk_ack_p95_ms=180`,
+  `route_pressure.active_total=0`, `max_active_total=32`, and matching
+  acquire/release counts.
+- Short shared-host soak gate:
+  `fabric-loadtest-20260516-160943` with `-Duration 45s`, 1200 streams,
+  96 concurrency, four healthy targets, and mixed control/bulk traffic produced
+  1200/1200 successful streams, even 300/300/300/300 target distribution,
+  `channel_opens=1200`, `channel_closes=1200`, `channel_leaks=0`,
+  `control_ack_p95_ms=4`, `ack_p95_ms=5`, `ack_p99_ms=8`,
+  `route_pressure.active_total=0`, `max_active_total=96`, and matching
+  acquire/release counts.
+- Continuous soak mode is now explicit: add `-Soak -Duration 30m` or
+  `-Soak -Duration 120m` to the Docker runner. In soak mode workers keep
+  creating and closing logical channels until the duration expires, instead of
+  stopping after a fixed stream list. This is the required gate for memory,
+  goroutine, file descriptor, QUIC stream, and route-pressure stability.
+- Soak duration stops new logical channel creation but does not cancel channels
+  already in flight. In-flight channels drain under their per-channel
+  `-StreamTimeout`; the outer `-ClientTimeout` remains the hard scenario
+  guardrail. This prevents the final active window from being counted as
+  failed streams just because the soak timer expired.
+- Recommended real-topology soak command:
+  `powershell -ExecutionPolicy Bypass -File scripts/fabric/fabric-loadtest-docker-smoke.ps1 -DockerContext test-docker -SkipBuild -TuneUdpBuffers -Nodes 4 -Streams 1000 -Concurrency 128 -BytesPerStream 1048576 -PayloadSize 65536 -TopologyProfile mixed-public-nat-lan-relay -Soak -Duration 30m -ResourceSampleInterval 10s -MaxGoroutineDelta 64 -MaxHeapDeltaMB 512 -FailTarget -1 -ControlEvery 20 -ControlBytesPerStream 4096 -MaxControlAckP95Ms 100`.
+- Soak reports include `resource_samples` and `resource_summary` with
+  goroutine start/end/max/delta, heap allocation start/end/max/delta, heap
+  objects, open file descriptor start/end/max/delta, GC delta, max active QUIC
+  streams, and max active route load.
+  Optional verdict gates `-MaxGoroutineDelta` and `-MaxHeapDeltaMB` fail the
+  run if resource drift exceeds the configured budget.
+- Optional file descriptor verdict gates `-MaxOpenFDDelta` and `-MaxOpenFDs`
+  are passed through the Docker runner to `fabric-loadtest` as
+  `-max-open-fd-delta` and `-max-open-fds`. On Linux containers these read
+  `/proc/self/fd` and fail the run if descriptor count drifts or peaks beyond
+  the configured budget.
+- Optional throughput SLO gate `-MinThroughputMbps` is passed through the Docker
+  runner to `fabric-loadtest` as `-min-throughput-mbps`. It fails verdict when
+  useful data-plane throughput falls below the configured Mbps floor.
+- Optional short-session churn SLO gate `-MinChannelChurnPerSec` is passed
+  through the Docker runner to `fabric-loadtest` as
+  `-min-channel-churn-per-sec`. It fails verdict when logical channel
+  open/close throughput falls below the configured channel-per-second floor.
+- Each logical channel has a per-channel timeout through `-StreamTimeout`
+  in the Docker runner and `-stream-timeout` in `fabric-loadtest`. This keeps a
+  wedged channel from holding a worker slot until the whole client run times
+  out, preserving channel isolation under churn.
+- Each data frame has an ACK timeout through `-AckTimeout` in the Docker runner
+  and `-ack-timeout` in `fabric-loadtest`. A missing ACK triggers reroute/pool
+  retry without waiting for the full channel timeout.
+- Optional overall ACK latency gates `-MaxAckP95Ms` and `-MaxAckP99Ms` are
+  passed through the Docker runner to `fabric-loadtest` as
+  `-max-ack-p95-ms` and `-max-ack-p99-ms`. They fail healthy runs when
+  aggregate data-plane ACK latency exceeds the configured SLO, independently
+  of slow-route migration thresholds.
+- Optional per-target ACK latency gate `-MaxTargetAckMs` is passed through the
+  Docker runner to `fabric-loadtest` as `-max-target-ack-ms`. It fails healthy
+  runs when any target route reports a `target_stats[*].max_ack_ms` above the
+  configured SLO.
+- Optional channel setup latency gates `-MaxSetupP95Ms` and `-MaxSetupP99Ms`
+  are passed through the Docker runner to `fabric-loadtest` as
+  `-max-setup-p95-ms` and `-max-setup-p99-ms`. They fail healthy runs when
+  logical channel open/setup latency exceeds the configured SLO before payload
+  transfer starts.
+- Optional reroute latency gates `-MaxRerouteP95Ms` and `-MaxRerouteP99Ms`
+  are passed through the Docker runner to `fabric-loadtest` as
+  `-max-reroute-p95-ms` and `-max-reroute-p99-ms`. They measure repeat channel
+  setup latency after pool failover or slow-route migration and fail the run
+  when route rebuild exceeds the configured SLO.
+- Docker shared-host summaries also include `container_stats` from
+  `docker stats --no-stream` for each fabric server/client container that is
+  still running at the end of the scenario. This records CPU percent, memory
+  usage, memory percent, network IO, block IO, and PID count per node before
+  cleanup.
+- Long soak runs can add `-ContainerStatsSampleInterval 10s` to collect
+  periodic Docker container stats while traffic is in flight. The runner writes
+  samples to `container_stats_samples_path`, includes
+  `container_stats_samples_count` and `container_stats_sample_summary`, and
+  records per-container memory/PID start, end, max, and delta values.
+- Optional container resource verdict gates `-MaxContainerMemoryMiB` and
+  `-MaxContainerPids` fail the Docker scenario when any running fabric
+  container exceeds the configured memory or PID budget at the final snapshot
+  or at any periodic sample peak.
+- Verified short continuous soak:
+  `fabric-loadtest-20260516-163206` used `-Soak -Duration 20s`,
+  mixed public/NAT/LAN/relay profile, 32 concurrency, and mixed control/bulk
+  traffic. It produced 4000/4000 successful logical channels,
+  `channel_opens=4035`, `channel_closes=4035`, `channel_leaks=0`,
+  `route_pressure.active_total=0`, `max_active_total=32`,
+  `control_ack_p95_ms=2`, `ack_p95_ms=4`, resource sample count 12,
+  goroutine delta -18, max active streams 32, max active route load 32, and
+  matching acquire/release counts.
+- Verified 60-second high-churn continuous soak with graceful drain:
+  `fabric-loadtest-20260516-174505` rebuilt the Docker image after changing
+  soak duration to stop generation and let in-flight channels drain. The
+  4-node mixed-topology run used 128 concurrency, `-Duration 60s`,
+  `-StreamTimeout 15s`, periodic resource/container sampling, mixed
+  control/bulk traffic, throughput and churn SLOs. It produced 438740/438740
+  successful logical channels, `channel_churn_per_sec=7310`,
+  `throughput_bps=3473632858`, `ack_p95_ms=5`, `ack_p99_ms=6`,
+  `control_ack_p95_ms=3`, `channel_opens=438740`,
+  `channel_closes=438740`, `channel_leaks=0`, `open_failures=0`,
+  `goroutines_delta=-1`, `open_fds_delta=4`, all four route modes, clean
+  route-pressure accounting, and verdict `pass`.
+- Verified pool failover soak with ACK timeout and abandoned-frame accounting:
+  `fabric-loadtest-20260516-175622` rebuilt the Docker image with ACK timeout,
+  target quarantine, and abandoned-frame accounting, then killed target 0 after
+  3 seconds during a 30-second mixed-topology soak. It produced 136194/136194
+  successful logical channels, `failed_streams=0`, `failover_events=82`,
+  `abandoned_frames=75`, `ack_mismatched_streams=0`,
+  `ack_integrity_errors=0`, `channel_churn_per_sec=4543`,
+  `throughput_bps=2156155314`, `reroute_latency_p99_ms=9`,
+  `channel_leaks=0`, clean route-pressure accounting, and verdict `pass`.
+- Verified container stats gate:
+  `fabric-loadtest-20260516-163854` produced a passing 2-node mixed-topology
+  smoke with `-MaxContainerMemoryMiB 128 -MaxContainerPids 64` and included
+  `container_stats` for both fabric server containers, with memory usage around
+  4-6 MiB per server and server PID counts 7-9. A negative control run with
+  `-MaxContainerMemoryMiB 1` failed as expected with
+  `container_memory_mib=...>1` verdict reasons.
+- Verified periodic container stats sampling:
+  `fabric-loadtest-20260516-164259` used `-Soak -Duration 8s`,
+  `-ContainerStatsSampleInterval 2s`, mixed public/NAT/LAN/relay profile, and
+  `-MaxContainerMemoryMiB 128 -MaxContainerPids 64`. It produced 2000/2000
+  successful logical channels, `channel_opens=2009`, `channel_closes=2009`,
+  `channel_leaks=0`, even 1000/1000 target distribution, 400 control streams,
+  `ack_p95_ms=1`, `route_pressure.active_total=0`, matching acquire/release
+  counts, final server memory around 12-13 MiB, and periodic sample peaks for
+  the client and both servers in
+  `fabric-loadtest-20260516-164259-container-stats-samples.json`.
+- Verified high-churn goroutine drain after QUIC close cancellation:
+  `fabric-loadtest-20260516-164502` rebuilt the Docker image and repeated the
+  2-node mixed-topology continuous soak with `-MaxGoroutineDelta 64`,
+  `-MaxHeapDeltaMB 128`, `-ContainerStatsSampleInterval 2s`,
+  `-MaxContainerMemoryMiB 128`, and `-MaxContainerPids 64`. It produced
+  2000/2000 successful logical channels, `channel_opens=2009`,
+  `channel_closes=2009`, `channel_leaks=0`, even 1000/1000 target
+  distribution, `control_ack_p95_ms=1`, `ack_p95_ms=1`,
+  `route_pressure.active_total=0`, matching acquire/release counts, and
+  `goroutines_delta=-2`.
+- Verified file descriptor gate:
+  `fabric-loadtest-20260516-164725` rebuilt the Docker image and repeated the
+  2-node mixed-topology continuous soak with `-MaxOpenFDDelta 8` and
+  `-MaxOpenFDs 128` in addition to goroutine, heap, container memory, and PID
+  gates. It produced 2000/2000 successful logical channels,
+  `channel_leaks=0`, `route_pressure.active_total=0`, matching
+  acquire/release counts, `open_fds_start=15`, `open_fds_end=9`,
+  `open_fds_max=19`, and `open_fds_delta=-6`.
+- Verified bounded soak aggregation:
+  `fabric-loadtest-20260516-165051` rebuilt the Docker image after changing
+  soak result storage to an aggregate collector. The 2-node mixed-topology soak
+  produced 2000/2000 successful logical channels, even 1000/1000 target
+  distribution, `channel_leaks=0`, `route_pressure.active_total=0`, matching
+  acquire/release counts, `goroutines_delta=0`, `open_fds_delta=1`, verdict
+  `pass`, and only 25 retained `stream_samples` in the full report.
+- Verified mixed route-mode coverage gate:
+  `fabric-loadtest-20260516-165308` rebuilt the Docker image with the route
+  coverage verdict and ran a 4-node mixed-topology soak. It produced 4000/4000
+  successful logical channels, even 1000/1000/1000/1000 target distribution,
+  `channel_leaks=0`, `route_pressure.active_total=0`, matching
+  acquire/release counts, and observed all required route modes:
+  `lan_quic`, `ice_quic`, `reverse_quic`, and `relay_quic`.
+- Verified ACK integrity gate:
+  `fabric-loadtest-20260516-165544` rebuilt the Docker image with the ACK
+  mismatch verdict and repeated the 4-node mixed-topology soak. It produced
+  4000/4000 successful logical channels, `ack_mismatched_streams=0`, per-target
+  `frames_sent=6600` and `acks_received=6600`, all four route modes, clean
+  channel/route pressure accounting, and verdict `pass`.
+- Verified ACK checksum integrity gate:
+  `fabric-loadtest-20260516-165926` rebuilt the Docker image with ACK payload
+  checksums and repeated the 4-node mixed-topology soak. It produced 4000/4000
+  successful logical channels, `ack_mismatched_streams=0`,
+  `ack_integrity_errors=0`, 26400 total data frames, 26400 ACKs, all four route
+  modes, clean channel/route pressure accounting, and verdict `pass`.
+- Verified unique per-frame payload integrity:
+  `fabric-loadtest-20260516-170150` rebuilt the Docker image after switching
+  loadtest traffic from a shared payload buffer to deterministic per-frame
+  payloads. The 4-node mixed-topology soak produced 4000/4000 successful
+  logical channels, `ack_mismatched_streams=0`, `ack_integrity_errors=0`, 26400
+  data frames, 26400 ACKs, all four route modes, clean channel/route pressure
+  accounting, and verdict `pass`.
+- Verified throughput SLO gate:
+  `fabric-loadtest-20260516-170512` rebuilt the Docker image with
+  `-MinThroughputMbps 100` and repeated the 4-node mixed-topology soak. It
+  produced 4000/4000 successful logical channels, `throughput_bps=212479668`,
+  `ack_mismatched_streams=0`, `ack_integrity_errors=0`, all four route modes,
+  clean channel/route pressure accounting, and verdict `pass`.
+- Verified short-session churn SLO gate:
+  `fabric-loadtest-20260516-173320` rebuilt the Docker image with
+  `-MinChannelChurnPerSec 200`, then ran a 4-node mixed-topology high-churn
+  short-session smoke with 1000 one-frame logical channels. It produced
+  1000/1000 successful logical channels, `channel_churn_per_sec=9478`,
+  `channel_opens=1000`, `channel_closes=1000`, `channel_leaks=0`, even target
+  stream distribution, all four route modes, clean route-pressure accounting,
+  and verdict `pass`.
+- Verified high-churn QUIC stream-credit regression gate:
+  `fabric-loadtest-20260516-174046` rebuilt the Docker image after closing the
+  server-side QUIC stream on handler exit and ran a 4-node mixed-topology burst
+  of 5000 one-frame short logical channels at 128 concurrency with
+  `-MinChannelChurnPerSec 300` and `-StreamTimeout 15s`. It produced 5000/5000
+  successful logical channels, `channel_churn_per_sec=21124`,
+  `channel_opens=5000`, `channel_closes=5000`, `channel_leaks=0`,
+  `open_failures=0`, `ack_mismatched_streams=0`, `ack_integrity_errors=0`,
+  even 1250/1250/1250/1250 target distribution, all four route modes, clean
+  route-pressure accounting, and verdict `pass`.
+- Verified target byte distribution gate:
+  `fabric-loadtest-20260516-170731` rebuilt the Docker image with byte
+  distribution verdicts and repeated the 4-node mixed-topology soak. It
+  produced 4000/4000 successful logical channels, even 1000/1000/1000/1000
+  stream distribution, exactly 53,248,000 bytes per target,
+  `throughput_bps=212488911`, all four route modes, clean channel/route
+  pressure accounting, and verdict `pass`.
+- Verified overall ACK latency SLO gate:
+  `fabric-loadtest-20260516-171001` rebuilt the Docker image with
+  `-MaxAckP95Ms 20` and `-MaxAckP99Ms 50` and repeated the 4-node
+  mixed-topology soak. It produced 4000/4000 successful logical channels,
+  `ack_p95_ms=2`, `ack_p99_ms=3`, `ack_mismatched_streams=0`,
+  `ack_integrity_errors=0`, all four route modes, clean channel/route pressure
+  accounting, and verdict `pass`.
+- Verified route-pressure distribution gate:
+  `fabric-loadtest-20260516-171216` rebuilt the Docker image with
+  route-pressure distribution verdicts and repeated the 4-node mixed-topology
+  soak. It produced 4000/4000 successful logical channels, even target stream
+  and byte distribution, per-route `max_active` values of 13/12/13/13,
+  `route_pressure.active_total=0`, matching acquire/release counts, and
+  verdict `pass`.
+- Verified per-target ACK latency gate:
+  `fabric-loadtest-20260516-171454` rebuilt the Docker image with
+  `-MaxTargetAckMs 20` and repeated the 4-node mixed-topology soak. It produced
+  4000/4000 successful logical channels, per-target `max_ack_ms` values of
+  6/5/7/9, `ack_p95_ms=3`, `ack_p99_ms=5`, all four route modes, clean
+  channel/route pressure accounting, and verdict `pass`.
+- Verified channel setup latency SLO gate:
+  `fabric-loadtest-20260516-171937` rebuilt the Docker image with
+  `-MaxSetupP95Ms 20` and `-MaxSetupP99Ms 50`, then repeated the 4-node
+  mixed-topology soak with ACK, throughput, FD, goroutine, heap, container
+  memory, and PID gates enabled. It produced 4000/4000 successful logical
+  channels, `setup_latency_p95_ms=0`, `ack_p95_ms=3`, `ack_p99_ms=3`,
+  `throughput_bps=212572631`, even target stream/byte distribution, all four
+  route modes, clean channel/route pressure accounting, and verdict `pass`.
+- Verified reroute latency SLO gate:
+  `fabric-loadtest-20260516-172652` rebuilt the Docker image with
+  `-MaxRerouteP95Ms 100` and `-MaxRerouteP99Ms 200`, then ran a 4-node
+  mixed-topology pool-failover stress with target 0 killed during load. It
+  produced 400/400 successful logical channels, 100 pool failover events,
+  `reroute_latency_p95_ms=1`, `reroute_latency_p99_ms=2`,
+  `route_attempts_total=500`, `ack_p95_ms=6`, `ack_p99_ms=8`,
+  `throughput_bps=3863633075`, clean channel/route pressure accounting, and
+  verdict `pass`.
+- Mixed topology profile gate:
+  `fabric-loadtest-20260516-162037` used
+  `-TopologyProfile mixed-public-nat-lan-relay` with 400 streams, 64
+  concurrency, four targets, and mixed control/bulk traffic. It produced
+  400/400 successful streams, 100 streams per target, route-mode reporting for
+  `lan_quic`, `ice_quic`, `reverse_quic`, and `relay_quic`,
+  `control_ack_p95_ms=2`, `ack_p95_ms=7`, `channel_leaks=0`,
+  `route_pressure.active_total=0`, and matching acquire/release counts.
+- Verified strict QUIC route-mode gate:
+  `fabric-loadtest-20260516-182550` rebuilt the loadtest image with legacy
+  route-mode verdicts and ran the 4-node mixed topology profile. It produced
+  400/400 successful logical channels, observed only `lan_quic`, `ice_quic`,
+  `reverse_quic`, and `relay_quic`, kept `ack_mismatched_streams=0`,
+  `ack_integrity_errors=0`, `channel_leaks=0`, clean route-pressure accounting,
+  and verdict `pass`.
+- `fabric-loadtest` now also treats the configured target list as part of the
+  acceptance surface: every target must be `quic://...`. Empty targets, bare
+  `host:port`, HTTP(S), and WS/WSS targets produce a failing
+  `non_quic_targets=...` verdict reason. Client mode also rejects those targets
+  before dialing, so a bad stress command cannot accidentally exercise a
+  non-QUIC path and only discover it after the run.
+- The shared Docker runner `scripts/fabric/fabric-loadtest-docker-smoke.ps1`
+  now has matching guardrails: it refuses local Docker Desktop contexts such as
+  `default`/`desktop-linux` and validates generated targets before launch so the
+  real-load smoke remains tied to the shared test Docker host and QUIC-only
+  endpoints.
+- Shared Docker validation after those guardrails:
+  `fabric-loadtest-20260516-190049` rebuilt the Docker image on `test-docker`
+  and ran 4 QUIC targets with 120 streams. It produced 120/120 successful
+  logical channels, `ack_p95_ms=3`, `setup_latency_p95_ms=21`, clean
+  open/close and route-pressure accounting, QUIC-only targets, and verdict
+  `pass`.
+- Shared Docker mixed-topology failover validation:
+  `fabric-loadtest-20260516-190137` reused the image on `test-docker`, killed
+  target 0 after 100ms, and ran 400 streams over the mixed public/NAT/LAN/relay
+  profile. It produced 400/400 successful logical channels, 100 pool failover
+  events, `route_attempts_total=500`, route modes `ice_quic`,
+  `reverse_quic`, and `relay_quic` after the failed target was removed,
+  `ack_p95_ms=8`, `setup_latency_p95_ms=51`, clean channel/route-pressure
+  accounting, and verdict `pass`.
+- Shared Docker mixed-topology route coverage validation:
+  `fabric-loadtest-20260516-190207` ran the same 4-target mixed profile without
+  target failure. It produced 400/400 successful logical channels, exactly 100
+  streams per target, observed `lan_quic`, `ice_quic`, `reverse_quic`, and
+  `relay_quic`, kept `ack_integrity_errors=0`, `channel_leaks=0`,
+  `route_pressure.active_total=0`, and verdict `pass`.
+- Load balancing under pool failover is now an acceptance gate. The first
+  stricter shared-host rebuild, `fabric-loadtest-20260516-190704`, intentionally
+  failed because all failed-target retries moved to the nearest live target,
+  producing `target_byte_distribution_skew` and
+  `route_pressure_distribution_skew`. The retry selector was then changed to
+  spread failed-slot retries across the currently usable target set instead of
+  selecting the next target in ring order.
+- Verified load-aware retry routing after the fix:
+  `fabric-loadtest-20260516-191028` rebuilt on `test-docker`, killed target 0
+  after 100ms, and repeated the 4-target mixed profile. It produced 400/400
+  successful logical channels, 100 pool failover events, surviving-target stream
+  distribution of 134/133/133, surviving route-pressure max-active values of
+  30/25/27, `ack_p95_ms=4`, `reroute_latency_p95_ms=1`, clean acquire/release
+  accounting, and verdict `pass`.
+- Verified 1000-channel mixed-topology stress:
+  `fabric-loadtest-20260516-193414` ran 1000 logical channels on `test-docker`
+  with 128 concurrency, mixed control/bulk traffic, and the
+  `mixed-public-nat-lan-relay` profile. It produced 1000/1000 successful
+  logical channels, exact 250/250/250/250 target distribution, observed all four
+  QUIC route modes (`lan_quic`, `ice_quic`, `reverse_quic`, `relay_quic`),
+  `throughput_bps=3629522849`, `channel_churn_per_sec=1919`,
+  `ack_p95_ms=6`, clean channel/route-pressure accounting, and verdict `pass`.
+- Verified 1000-channel pool-failover stress:
+  `fabric-loadtest-20260516-193444` killed target 0 after 100ms and ran 1000
+  logical channels with 128 concurrency. It produced 1000/1000 successful
+  logical channels, 250 pool failover events, surviving-target distribution of
+  334/333/333, `route_attempts_total=1250`, `ack_p95_ms=7`, clean
+  acquire/release accounting, and verdict `pass`.
+- Verified latency-degradation migration:
+  `fabric-loadtest-20260516-193515` applied `tc netem delay 80ms` to target 1,
+  enabled slow-stream migration with `-MaxAckMs 20`, and ran 400 mixed-profile
+  channels. It observed the impaired target in `degraded_targets`, produced
+  64 slow-ACK migrations, moved completed streams onto healthy targets with
+  distribution 134/133/133, kept `channel_leaks=0`, `ack_integrity_errors=0`,
+  clean route-pressure accounting, and verdict `pass`.
+- Shared Docker runner resource-sample fallback was verified with
+  `fabric-loadtest-20260516-190325`: short runs now still persist
+  `container_stats_samples_path` and a minimal per-container sample summary
+  from final Docker stats when the background sampler has no time to emit
+  samples.
+- Added `scripts/fabric/fabric-acceptance-summary.ps1` to aggregate recent
+  `*-summary.json` artifacts into an acceptance report. It captures verdicts,
+  target distribution, route modes, churn, failover/migration counts, latency
+  SLOs, resource evidence, and keeps intentional failed runs visible as
+  regression evidence for gates such as route-pressure skew detection.
+- The first 30-minute soak attempt (`fabric-loadtest-20260516-193558`) exposed
+  a runner defect instead of a fabric defect: server containers were still
+  started with a fixed `-timeout 10m`, so the three surviving servers exited
+  around minute 10 while the client expected a 30-minute run. The Docker runner
+  now exposes `-ServerTimeout` and defaults it to `-ClientTimeout`, so long soak
+  server lifetimes match the client run.
+- The next soak attempt (`fabric-loadtest-20260516-194816`) passed the 10-minute
+  server-timeout boundary but exposed another long-run behavior: a healthy
+  surviving target could stay out of placement after a transient degradation
+  mark. `fabric-loadtest` now uses a bounded `target_quarantine_ttl` for
+  placement while still preserving historical `degraded_targets` observations
+  in the report. The Docker runner exposes this as `-TargetQuarantineTTL`.
+- `fabric-loadtest-20260516-200241` then exposed a soak-loop issue: it reported
+  `pass` with 432869/432869 logical channels and clean accounting, but finished
+  after about 95 seconds despite `config.duration=30m`. The cause was worker
+  shutdown on per-stream `context deadline exceeded`; soak workers now only exit
+  on the parent run context or the configured soak stop time, not on one
+  channel's timeout.
+- `fabric-loadtest-20260516-200939` and `fabric-loadtest-20260516-201331`
+  confirmed the soak loop fix by running full 3-minute preflights, but they
+  failed the zero-failed-stream gate under target-kill injection. The issue was
+  policy: the known killed target re-entered placement too quickly via the
+  short transient quarantine TTL, causing some channels to spend their stream
+  budget on a hard-dead endpoint. `fabric-loadtest` now separates transient
+  `target_quarantine_ttl` from `failure_quarantine_ttl`, and the Docker runner
+  exposes `-FailureQuarantineTTL`.
+- Verified 30-minute long-duration soak:
+  `fabric-loadtest-20260516-202532` ran on `test-docker` for 1800.010 seconds
+  with 4 QUIC targets, 128 concurrency, mixed control/bulk traffic, 64 KiB per
+  logical channel, 10-second resource and container samples, and the
+  `mixed-public-nat-lan-relay` profile. It produced 15,074,556/15,074,556
+  successful logical channels, 895,308,005,376 bytes, `throughput_bps=3979124146`,
+  `channel_churn_per_sec=8374`, exact 3,768,639 streams per target, all four
+  QUIC route modes, `ack_p95_ms=5`, `ack_p99_ms=6`, `channel_leaks=0`,
+  matching 15,074,556 channel opens/closes, `route_pressure.active_total=0`,
+  458 container-stat samples, bounded memory/PID use, and verdict `pass`.
+- Verified real-node host-to-host QUIC smoke:
+  `home-1` ran the standalone `fabric-loadtest` client against a temporary
+  QUIC server on `test-docker` at `quic://docker-test.cin.su:19443`. The run
+  created 1000 short logical channels at 128 concurrency, mixed control and
+  bulk traffic, sent 59,392,000 bytes, received 3700/3700 ACKs, produced
+  `throughput_bps=1177445403`, `channel_churn_per_sec=2478`,
+  `ack_p95_ms=12`, `ack_p99_ms=21`, `setup_latency_p95_ms=118`, zero failed
+  streams, zero channel leaks, and verdict `pass`. The report is saved as
+  `artifacts/fabric-real-nodes/home-1-to-test-docker-20260516-181649.json`.
+- Published and registered node-agent release `0.2.280-fabricsession` with
+  linux binary/native and Docker image artifacts. The release is intentionally
+  not assigned to live node update policies yet because current live node
+  workload/env posture still advertises legacy `direct_http` and HTTP/HTTPS
+  mesh endpoints. Before rollout, node configs must be migrated to
+  `quic://...` endpoints, QUIC advertise labels, and enabled QUIC listener env
+  such as `RAP_MESH_QUIC_FABRIC_ENABLED=true` plus
+  `RAP_MESH_QUIC_FABRIC_LISTEN_ADDR`.
+- Loadtest degraded-target quarantine is observable through `degraded_targets`.
+  When `-impair-target` and slow-stream migration are enabled, verdict fails if
+  no degraded target is observed or if degraded targets do not produce migration
+  events. A shared-host validation run with 120 streams reported
+  `degraded_targets = { impaired_target: "slow_ack" }`, 5 migration events,
+  `control_ack_p95_ms=3`, and clean acquire/release accounting.
+- Channel lifecycle accounting is explicit in `fabric-loadtest` through
+  `channel_opens`, `channel_closes`, and `channel_leaks`. Verdict fails on
+  open/close mismatch, active stream leaks, or mismatch between route-pressure
+  acquire counts and QUIC stream opens.
+- The next validation step is broader real mixed public/NAT/LAN topology across
+  separate physical or VM hosts. The shared Docker host has verified the route
+  model, stress gates, 30-minute stability, memory, goroutine, file descriptor,
+  container resource, and route-pressure accounting. A true external NAT lab
+  should now validate the same gates with independent NAT devices, public nodes,
+  and local NAT-side cluster segments.
+
+Initial SLO examples:
+
+- `channel_setup_p95_ms < 200`
+- `reroute_p95_ms < 1000`
+- `control_latency_p99_ms < 100 under bulk load`
+- `packet_loss_after_recovery < 0.1%`
+- `no_route_pressure_over_90_percent_when_alternatives_exist`
+- `no_channel_table_growth_after_churn`
@@ -204,6 +204,8 @@ Examples:
 - `vnc-worker` wraps a future VNC client/runtime.
 - `vpn-exit` handles exit routing.
 - `vpn-connector` handles private network reachability.
+- `vpn-client` runs on an end-user device, including Android, as a normal farm node.
+- `ipv4-egress` marks a node/service that can send authorized VPN packet traffic to ordinary IPv4 networks.
 - `video-relay` handles media optimized paths.

 Rules:
@@ -293,6 +295,41 @@ Responsibilities:
 - applies route, DNS, and egress restrictions
 - reports traffic and health telemetry

+### `ipv4-egress`
+
+Fabric-only IPv4 exit service. It is assigned to nodes that may forward authorized VPN packet channels from the mesh to ordinary IPv4 networks.
+
+Responsibilities:
+
+- accepts VPN packet channels only through the fabric service channel
+- advertises exit pool membership, region, route policy, and health
+- enforces user, organization, cluster, and owner visibility policy before accepting traffic
+- participates in latency-aware and load-aware exit selection
+- supports failover between nodes in the same exit pool without changing the Android client protocol
+- does not expose legacy VPN protocols as the steady-state data plane
+
+### `vpn-client`
+
+Client-side VPN node role. On Android the installed application is a node-agent/runtime with this role, then the VPN client service is started locally and joins the farm like any other node.
+
+Responsibilities:
+
+- joins the mesh using the current QUIC fabric transport
+- requests the list of visible IPv4 exit pools and nodes according to the current user's access level
+- creates VPN packet channels to the selected `ipv4-egress`/`vpn-exit` pool
+- switches to another authorized exit when the selected exit fails or becomes slow
+- keeps old protocol compatibility out of the runtime data plane; old nodes may only use legacy download/update paths long enough to fetch the new agent
+- exposes its local IPv4 ingress as service configuration: on Android this is the
+  `VpnService` TUN, and on Linux/Docker it may also include explicit TCP/UDP
+  listen ports that are mapped into VPN packet channels.
+
+Rules:
+
+- A VPN client does not use a dedicated entry node. It is itself a mesh node.
+- The farm builds the route from the client node to an authorized exit pool.
+- Exits are addressed as pools. A pool may contain one node, but that is a degraded redundancy posture and should be visible as a risk.
+- The control plane may issue policy and signed route authority, but it must not become the packet entry point for the VPN client.
+
 ### `vpn-connector`

 Connector to private networks.
@@ -1,13 +1,13 @@
 # Web Ingress and Admin UI Model

-Status: target architecture clarification. Documentation only.
+Status: target architecture and implementation contract.

 This document defines how HTTP/HTTPS web entry, Admin UI, dynamic page
 composition, and cluster configuration responsibilities are separated in the
 Secure Access Fabric.

-It does not implement code, APIs, UI pages, mesh runtime, VPN runtime, or RDP
-changes.
+The fabric node-to-node transport remains QUIC-only. HTTP/HTTPS is allowed only
+as an external client-facing service edge.

 ## Purpose

@@ -16,33 +16,41 @@ The platform needs a clear distinction between:
 - Web Service as the HTTP/HTTPS entry layer
 - Control Plane as the owner of cluster configuration and policy
 - Admin UI as a safe, scoped user interface over Control Plane APIs
+- Fabric Transport as the internal QUIC-only node-to-node substrate

 The Web layer must never become the owner of cluster state, policy, topology,
 secrets, node identity, or routing authority.

 ## Layer Ownership

-### Web Service / Web Ingress
+### Public HTTPS Ingress

-Web Service is an edge service.
+Public HTTPS Ingress is an edge service. It may run on a public Internet node,
+including a small/slow node intended only to accept browser traffic and pass it
+into the fabric.

-Suggested role names:
+Role names:

- `web-ingress`
- `admin-web-entry`
- `admin-web-shell`
+- `public-ingress`
+- `admin-ingress`

 Responsibilities:

- accept HTTP/HTTPS
+- listen on TCP `80` only for ACME challenges, health checks, and HTTPS
+  redirects
+- listen on TCP `443` for browser/API HTTPS
 - terminate TLS or sit behind the approved TLS terminator
- serve Admin UI shell/static assets
- proxy browser/API traffic to Control API
+- serve only approved static UI shells and safe public metadata
+- validate SNI/Host, request size, rate limits, and edge policy
+- map the request to an allowed platform, cluster, organization, or user portal
+  scope
+- forward accepted traffic into the fabric through an authorized fabric service
+  channel
 - apply edge controls such as headers, rate limits, request size limits, and
  future WAF rules
 - expose only approved public/admin endpoints

-Web Service must not:
+Public HTTPS Ingress must not:

 - own cluster configuration
 - directly mutate PostgreSQL
@@ -51,6 +59,39 @@ Web Service must not:
 - store node identity or certificates as source of truth
 - expose internal mesh topology to browser clients
 - execute cluster decisions locally
+- hold platform/global admin authority keys
+- infer authorization from the fact that it accepted TCP `443`
+- become a general relay for arbitrary HTTP inside the fabric
+
+The node that accepts HTTPS is not the node that automatically owns or executes
+admin logic. It is only a service edge.
+
+### Fabric Transport
+
+Fabric Transport is the internal node-to-node layer.
+
+Rules:
+
+- node-to-node traffic uses QUIC only
+- no HTTP fallback between fabric nodes
+- STUN/ICE/rendezvous/relay are fabric transport mechanisms, not browser/API
+  protocols
+- any service traffic accepted on `443` is converted into a scoped fabric
+  service channel before it crosses the mesh
+- direct links, relay links, and route-health observations must remain separate
+  in diagnostics
+- a fabric route proves reachability, not administrative authority
+
+If a public ingress receives a request for an admin surface, the request flow is:
+
+```text
+Browser HTTPS
+  -> public/admin ingress on 443
+  -> tenant/cluster/platform scope selection
+  -> signed fabric service channel over QUIC
+  -> authorized admin/runtime service node
+  -> Control Plane authorization and policy
+```

 ### Control Plane

@@ -77,9 +118,23 @@ only.
 Cluster configuration is changed only through Control Plane services and APIs.
 The Web layer is a presentation and ingress layer over those APIs.

-### Admin UI
+### Admin UI Runtime

-Admin UI is a client application served through Web Ingress.
+Admin UI Runtime is the service that serves and executes the admin surface. It
+may run on any node explicitly assigned the matching runtime role.
+
+Role names:
+
+- `global-admin-runtime`
+- `cluster-admin-runtime`
+- `organization-portal-runtime`
+- `user-portal-runtime`
+- `identity-runtime`
+- `policy-authority`
+- `audit-sink`
+
+Admin UI is a client application served through Public HTTPS Ingress or Admin UI
+Runtime according to deployment policy.

 It renders safe Control Plane projections and submits user actions to Control
 Plane APIs.
@@ -95,7 +150,7 @@ Admin UI must not:
  viewer
 - contain executable cluster logic

-## Admin Endpoint Placement
+## Admin Endpoint Placement And Trust

 Admin UI endpoint placement is explicit and must not be inferred from storage.

@@ -110,6 +165,8 @@ Scopes:
 - Organization Admin Panel: tenant-safe projection for one organization. It
  must expose only allowed resources, service endpoints, sessions, policies,
  and safe status.
+- User Portal: personal/account scope. It must expose only the authenticated
+  user's resources, sessions, devices, and profile actions.

 Rules:

@@ -118,19 +175,29 @@ Rules:
 - Storage nodes distribute/cache scoped configuration and snapshots only.
 - Admin/web ingress is a separate service role and requires explicit Control
  Plane assignment.
+- Public Internet ingress is not enough to run a global panel.
+- `global-admin-runtime`, `policy-authority`, and `audit-sink` may run only on
+  platform-owner trusted nodes.
+- `cluster-admin-runtime` may run only on nodes authorized for that cluster.
+- `organization-portal-runtime` and `user-portal-runtime` may run on broader
+  infrastructure, but they receive only scoped projections.
 - Cluster-local admin endpoints require valid TLS/cert policy, signed scoped
  snapshots, current node health, and sufficient role coverage.
 - Platform Owner Console remains the owner-level view even when cluster-local
  admin endpoints exist.
 - Organization Admin Panel must never expose intermediate mesh topology,
  storage shards, peer caches, route caches, or unrelated cluster data.
+- A request entering through an organization-bound ingress must be rejected if it
+  asks for another organization, another cluster outside its contract, global
+  topology, or platform-owner data.

 ## Request Flow

 ```text
 Admin Browser
-  -> Web Ingress / Admin Web Shell
-  -> Control API
+  -> Public/Admin HTTPS Ingress
+  -> Fabric Service Channel over QUIC
+  -> Admin UI Runtime / Control API
  -> PostgreSQL source of truth
  -> signed scoped snapshots / config distribution
  -> rap-node-agent
@@ -266,6 +333,18 @@ Organization admin must not see:
 - secrets
 - unrelated cluster internals

+Ingress-bound projections:
+
+- A platform-owner ingress may expose platform navigation only after platform
+  authorization, MFA/step-up, and policy checks.
+- A cluster-bound ingress may expose only that cluster's admin surface and
+  cluster-scoped safe diagnostics.
+- An organization-bound ingress may expose only the organization projection and
+  organization-safe service endpoints.
+- A user portal ingress may expose only the user's personal/account projection.
+- Host/SNI alone is not authorization; it only selects the maximum possible
+  projection before server-side authorization narrows it further.
+
 ## Service Adapter UI Extensions

 Service adapters may need configuration UI.
@@ -361,22 +440,258 @@ High-risk actions include:

 ## Deployment Model

+### Current Test Entry
+
+The current shared Docker test stand exposes the Platform Owner Control Panel at
+`http://docker-test.cin.su:18080/` (`http://192.168.200.61:18080/`). This is a
+temporary lab HTTP edge served by `rap_web_admin` from
+`/tmp/rap-web-admin/html` on `test-docker`.
+
+This entry is not the production authority model. It is allowed only for the
+shared test stand while the HTTPS admin-ingress runtime is being completed. The
+target production entry is:
+
+```text
+Browser HTTPS on 443
+  -> node with explicit admin-ingress/public-ingress role
+  -> signed web-ingress envelope
+  -> QUIC fabric service channel
+  -> authorized admin/portal runtime node
+  -> Control API projection/authorization
+```
+
+The browser-facing ingress may be a small public node, but it must not become
+the management authority. Platform/global admin runtime remains limited to
+platform-owner trusted nodes. Cluster, organization, and user panels receive
+only their scoped projections.
+
+The legacy Fabric map with separate `inputs`, `cluster nodes`, and `egress
+zones` is retired for the transport-layer view. The Fabric panel must show
+actual direct/fresh QUIC neighbor links, one-way/passive direction, stale/problem
+state, relay/route-health annotations, and web-ingress runtime readiness. It
+must not render old entry/egress zone columns as if they were transport
+topology.
+
 Possible deployment modes:

- Web Ingress and Control API in the same deployment for small/test installs
+- Public/Admin HTTPS Ingress and Control API in the same deployment for
+  small/test installs
 - Web Ingress separated from Control API for production
 - multiple Web Ingress nodes for regional/admin access
 - Web Ingress behind Caddy/Nginx/enterprise ingress
 - Admin UI shell served from Web Ingress while APIs remain on Control API
+- Internet ingress on a low-capacity node that forwards scoped channels to a
+  trusted admin runtime elsewhere in the fabric
+- global admin runtime only on platform-owner controlled nodes
+- cluster admin runtime on cluster-authorized nodes
+- organization/user portal runtime on tenant-safe nodes with scoped data

 Even when deployed together, ownership remains separate:

- Web Ingress is entry/presentation
+- Public/Admin HTTPS Ingress is entry/presentation
+- Fabric Transport is QUIC-only service-channel delivery
 - Control API is authorization/domain logic
 - PostgreSQL is source of truth
 - Fabric Storage/Config Storage is scoped distribution/cache
 - node-agent consumes scoped desired state

+## Required Roles
+
+The platform recognizes these web/admin placement roles:
+
+| Role | Scope | Purpose |
+| --- | --- | --- |
+| `public-ingress` | cluster or organization | Listen on 80/443, terminate/validate HTTPS, forward scoped service channels. |
+| `admin-ingress` | platform or cluster | HTTPS edge for admin surfaces. It does not own authority. |
+| `global-admin-runtime` | platform trusted nodes only | Platform-owner console/runtime. |
+| `cluster-admin-runtime` | cluster | Cluster admin console/runtime for one cluster. |
+| `organization-portal-runtime` | organization | Tenant-safe organization administration. |
+| `user-portal-runtime` | user/organization | Personal account/resource portal. |
+| `identity-runtime` | platform/cluster | Authentication, session, MFA, step-up and token issuance. |
+| `policy-authority` | platform trusted nodes only | Authorization/policy decisions and signed claims. |
+| `audit-sink` | platform trusted nodes only | Durable mutation/security audit ingestion. |
+
+Legacy `entry-node` remains a generic client ingress/service edge role for
+non-admin product services. It must not imply admin authority.
+
+## Fabric Service Classes
+
+Admin and portal traffic uses explicit fabric service classes. This prevents
+admin traffic from being disguised as VPN/RDP/file/video traffic and gives the
+routing layer clear QoS, role, and audit semantics.
+
+| Service class | Required runtime roles | Projection |
+| --- | --- | --- |
+| `platform_admin` | `admin-ingress`, `global-admin-runtime`, `identity-runtime`, `policy-authority`, `audit-sink` | Platform-owner console. |
+| `cluster_admin` | `admin-ingress`, `cluster-admin-runtime`, `identity-runtime`, `policy-authority`, `audit-sink` | One cluster. |
+| `organization_portal` | `public-ingress`, `organization-portal-runtime`, `identity-runtime`, `policy-authority`, `audit-sink` | One organization. |
+| `user_portal` | `public-ingress`, `user-portal-runtime`, `identity-runtime`, `policy-authority`, `audit-sink` | One authenticated user/account scope. |
+
+Default channels for these classes are `control`, `interactive`, and
+`reliable`. They are latency-sensitive control-plane/service traffic, not bulk
+data transfer.
+
+## Desired Workload Contract
+
+Ingress nodes are configured through normal node desired workloads. The first
+runtime stage is a contract probe: node-agent validates the policy and reports a
+workload status, but it does not open `80`/`443` until the real ingress runtime
+stage is enabled.
+
+Example platform/cluster admin ingress workload:
+
+```json
+{
+  "service_type": "admin-ingress",
+  "desired_state": "enabled",
+  "runtime_mode": "native",
+  "config": {
+    "listen_http_port": 80,
+    "listen_https_port": 443,
+    "tls_mode": "terminate",
+    "scope": "platform",
+    "service_classes": ["platform_admin", "cluster_admin"]
+  }
+}
+```
+
+Example organization/user public ingress workload:
+
+```json
+{
+  "service_type": "public-ingress",
+  "desired_state": "enabled",
+  "runtime_mode": "native",
+  "config": {
+    "listen_http_port": 80,
+    "listen_https_port": 443,
+    "tls_mode": "terminate",
+    "scope": "organization",
+    "service_classes": ["organization_portal", "user_portal"]
+  }
+}
+```
+
+Contract-probe status requirements:
+
+- `fabric_transport` is `quic_only`
+- `http_between_fabric_nodes` is `false`
+- `authority_service` is `false`
+- `fabric_service_channel_required` is `true`
+- `ports_opened_by_stub` is `false`
+- invalid service classes or non-80/443 ports report `degraded`
+- real listener startup requires both workload config
+  `real_listener_enabled=true` and node-agent process gate
+  `RAP_WEB_INGRESS_RUNTIME_ENABLED=true`
+- without the process gate, a real-listener request reports
+  `web_ingress_real_listener_gate_disabled`
+- the first handler stage returns schema
+  `rap.web_ingress.runtime_response.v1`; it redirects HTTP to HTTPS, exposes
+  health, validates service class/scope, and blocks payload forwarding with
+  `fabric_service_channel_binding_not_implemented` until the QUIC service
+  channel binding is implemented
+- node-agent owns a web-ingress listener lifecycle manager. When the real
+  listener gate is enabled, it starts the HTTP redirect listener and starts
+  HTTPS only when `tls_cert_file` and `tls_key_file` are present in workload
+  config. Without TLS files the listener status is `partial` and service
+  payload remains blocked.
+- HTTPS handler has a `FabricBinder` boundary. Valid requests become
+  `rap.web_ingress.fabric_request.v1` records with method, path, query, host,
+  derived scope, service class, safe headers, bounded body, and observed
+  timestamp. Runtime derives fabric scope from service class
+  (`platform_admin` -> `platform`, `cluster_admin` -> `cluster`,
+  `organization_portal` -> `organization`, `user_portal` -> `user`) before
+  signing/forwarding the request.
+  Dangerous browser headers such as `Authorization`, `Cookie`, `Set-Cookie`,
+  and service-channel tokens are not forwarded as ordinary proxy headers.
+  The binder must convert the request into a signed/scoped fabric service
+  channel envelope; if no binder is present, ingress returns
+  `fabric_service_channel_binding_not_implemented`.
+- The first concrete binder emits
+  `rap.web_ingress.fabric_service_channel_envelope.v1`. The envelope contains
+  the safe request projection, base64-encoded body, scope, service class,
+  observed timestamp, and envelope timestamp. It is serialized as canonical JSON
+  for signing, then passed to an `EnvelopeSigner` and `EnvelopeSender`.
+  `EnvelopeSigner` owns node/service-channel signature policy. `EnvelopeSender`
+  owns delivery into the QUIC fabric service channel and route selection. This
+  keeps HTTP edge handling separated from mesh internals while making the
+  security boundary explicit and testable.
+- The initial signer implementation is Ed25519 over the canonical envelope
+  bytes. The signer can derive `key_id` from the public key fingerprint or use
+  an explicitly configured key id. Production deployment must bind this key to
+  the node identity/service-channel authority policy before enabling real
+  browser traffic.
+- The initial mesh sender adapter can submit the signed envelope through the
+  existing reliable fabric channel runtime using `control` traffic class and a
+  configured route set to an admin/portal runtime node or pool. At this stage it
+  returns a delivery-accepted response with route/channel metrics. Full
+  request/response admin API streaming remains a later runtime step and must
+  stay on the same QUIC fabric channel model.
+- The fabric channel runtime now also has a request/response path for web
+  ingress: it opens a QUIC stream, sends the signed envelope as `FrameData`, and
+  waits for a `FrameData` response on the same stream and sequence. Route
+  failures or response timeouts use the same latency-aware reroute path as
+  reliable delivery. Runtime HTTP responses use
+  `rap.web_ingress.fabric_runtime_response.v1` with status code, safe headers,
+  and body/body_b64. If a runtime response is not in that schema, ingress
+  reports delivery-accepted metrics instead of treating arbitrary payload as an
+  HTTP response.
+- QUIC fabric server reserves `WebIngressForwardQUICStreamID` for web ingress
+  request/response forwarding. The server invokes a web-ingress forward handler
+  with the signed envelope payload and returns a wrapper containing either
+  runtime payload or an error on the same stream/sequence.
+- Admin/portal runtime nodes have a signed-envelope receiver contract. The
+  receiver verifies `rap.web_ingress.signed_fabric_service_channel_envelope.v1`,
+  Ed25519 signature, trusted key id, scope, service class, and timestamp skew
+  before calling the local runtime handler. The local handler returns
+  `rap.web_ingress.fabric_runtime_response.v1`; unsafe response headers are
+  filtered before the payload is returned to the ingress edge.
+- Node-agent exposes explicit runtime key policy inputs while the final signed
+  config-snapshot distribution is being wired:
+  `RAP_WEB_INGRESS_SIGNING_PRIVATE_KEY`,
+  `RAP_WEB_INGRESS_SIGNING_KEY_ID`, and
+  `RAP_WEB_INGRESS_TRUSTED_KEYS_JSON`. Trusted keys JSON may be either
+  `{"key_id":"public_key_b64"}` or an array of
+  `{"key_id":"...","public_key":"..."}` objects. Without trusted keys the
+  web-ingress receiver handler is not installed. Runtime receiver placement can
+  be narrowed with `RAP_WEB_INGRESS_RUNTIME_SERVICE_CLASSES`, a comma-separated
+  allow-list of `platform_admin`, `cluster_admin`, `organization_portal`, and
+  `user_portal`; this is a temporary explicit node-local policy until signed
+  role snapshots drive receiver placement.
+- Heartbeat metadata includes `web_ingress_runtime_receiver_report` when QUIC
+  fabric or web-ingress key policy is configured. The report exposes the
+  signed-envelope schema, QUIC stream id, trusted key count, receiver
+  service-class allow-list, handler installation state, status/reason
+  (`ready`, `degraded`, or `blocked`), and QUIC endpoint readiness so the
+  fabric panel can show whether a node can currently receive admin/portal
+  runtime traffic and why it cannot.
+- QUIC listener/reverse-transport handler configuration is sensitive to the
+  web-ingress trusted key policy and runtime service-class allow-list. If either
+  policy changes, node-agent restarts or refreshes the QUIC fabric handler
+  binding so stale key trust or stale receiver placement is not kept in memory.
+- The first local admin runtime dispatcher is intentionally read-only. It
+  handles `/healthz`, `/readyz`, and `*/ui-manifest` requests after signed
+  envelope verification. It returns `rap.web_ingress.admin_runtime_response.v1`
+  with a safe `rap.web_ingress.ui_manifest.v1` projection that lists sections
+  and read-only actions for the requested service class. It rejects invalid
+  `scope`/`service_class` pairs before using either the local fallback or the
+  Control API projection client. Mutations return
+  `control_api_mutation_binding_not_implemented`; unknown read projections
+  return `control_api_projection_binding_not_implemented` until the dispatcher
+  is wired to the real Control API authorization/projection layer.
+- The dispatcher now has a `ControlAPIProjectionClient` boundary. When bound,
+  read-only GET/HEAD requests are sent to the Control API projection endpoint
+  and returned as `rap.web_ingress.control_api_projection_response.v1`.
+  Backend exposes the first read-only projection endpoint at
+  `/api/v1/clusters/{cluster_id}/nodes/{node_id}/admin-runtime/projection`.
+  It returns safe manifest/projection payloads, marks audit as required, and
+  rejects mutation methods and invalid `scope`/`service_class` combinations.
+  Requests must use schema
+  `rap.web_ingress.control_api_projection_request.v1`; agent accepts responses
+  only with schema `rap.web_ingress.control_api_projection_response.v1`.
+  This is the first Control API binding slice; it is not yet a full
+  authorization/session/audit implementation.
+
 ## Future Stages

 Suggested staged work:
@@ -417,8 +732,9 @@ This document does not authorize:
 ## Result / Decision

 WEB is an ingress and presentation layer, not a cluster configuration owner.
-Cluster configuration belongs to the Control Plane and is persisted in
-PostgreSQL. Dynamic admin pages are allowed only as safe, scoped,
+Fabric remains QUIC-only internally; HTTP/HTTPS exists only at the external
+client edge. Cluster configuration belongs to the Control Plane and is persisted
+in PostgreSQL. Dynamic admin pages are allowed only as safe, scoped,
 schema-driven projections over Control Plane APIs. They must not embed secrets,
 internal topology, peer caches, route caches, or arbitrary executable code.