rdp-proxy/docs/architecture/FABRIC_SERVICE_CHANNEL_RUNTIME.md

# Fabric Service Channel Runtime

Status: accepted product direction and implementation guardrail.

This document defines the common runtime layer that service products must use
for live traffic. VPN, Remote Server/Desktop Access, video meetings, file
transfer, SSH/VNC/RDP adapters, and future services must not each invent their
own route, relay, retry, and failover mechanics.

## Problem

The platform goal is a distributed high-speed access fabric:

```text
client or service ingress
  -> authorized entry node / entry pool
  -> fastest healthy fabric route
  -> authorized exit node / exit pool
  -> target network, adapter, or service runtime
```

Recent VPN work exposed an architectural risk: debugging transport behavior
inside the Android VPN client or temporary backend packet relay can hide the
real missing layer. If the common fabric channel is incomplete, every later
service will repeat the same work and the Remote Server/Desktop Access client
will get stuck on transport issues that should already be solved below it.

The backend/control API remains the control plane. It must not become the
production realtime relay for high-rate service traffic.

## Product Rule

All live service traffic goes through the Fabric Service Channel runtime.

Control-plane and engineering traffic:

- login
- profile refresh
- policy lookup
- session creation
- route authorization
- diagnostics
- update metadata

may use Control API and admin ingress.

Working data traffic:

- VPN IP packets
- remote desktop display/input/control channels
- SSH/VNC streams
- file chunks
- video/audio
- future realtime service payloads

must use Fabric Service Channel unless an explicit compatibility fallback is
selected and reported as degraded.

## Service Request Contract

A service requests a channel by logical intent, not by hard-coding a node path.

Target shape:

```json
{
  "service_class": "vpn_packets | remote_workspace | file_transfer | video",
  "organization_id": "...",
  "user_id": "...",
  "resource_id": "...",
  "entry_pool": ["node-a", "node-b"],
  "exit_pool": ["node-x", "node-y"],
  "required_roles": ["entry-node", "vpn-exit"],
  "allowed_channels": ["control", "reliable", "bulk", "droppable"],
  "qos": {
    "interactive": true,
    "bulk_limit_mbps": 0,
    "priority": "interactive | normal | bulk"
  },
  "failover": {
    "route_rebuild": "automatic",
    "exit_failover": "automatic",
    "sticky_session": true
  }
}
```

The control plane returns a short-lived, signed service-channel lease:

- channel/session id
- selected entry
- selected exit
- alternate entries/exits
- primary route path
- alternate route paths
- allowed channel classes
- route generation/fencing epoch
- token expiry and refresh policy
- fallback policy

The service sees a channel endpoint and channel capabilities. It does not see
the full mesh topology unless it is a platform-owner diagnostic view.

## Runtime Responsibilities

### Control Plane

- authorizes the service request
- resolves organization/resource policy
- selects candidate entry and exit pools
- issues signed channel leases
- records audit
- publishes route generation and allowed service class
- receives telemetry and route health feedback
- triggers route/exit replacement when needed

### Fabric Routing Engine

- chooses shortest/fastest healthy route
- scores latency, loss, queue depth, bandwidth, node health, NAT mode,
  region/locality, role eligibility, and route generation freshness
- maintains alternate routes
- avoids full-mesh requirements
- rebuilds routes when links/nodes degrade

### Entry Node

- accepts client-facing live connections
- validates service-channel token
- multiplexes logical streams/channels
- applies backpressure and per-channel scheduling
- forwards payloads to the selected route
- switches to alternate route/exit when instructed or when local health proves
  the path bad

### Intermediate Relay Nodes

- forward authorized envelopes only
- enforce route id, channel class, TTL, generation, and next-hop rules
- report link health and queue pressure
- do not own durable session state

### Exit Node

- terminates the fabric route for the selected service
- connects to LAN/internet/adapter/runtime target
- enforces service policy locally
- reports egress health, DNS policy, and throughput
- can be replaced by another exit from the pool when policy allows

## Channel Model

The common fabric layer is channel-oriented.

| Channel class | Reliability | Typical services | Scheduling |
| --- | --- | --- | --- |
| `control` | reliable | attach/detach, route refresh, service state | highest |
| `interactive` | reliable/low-latency | RDP input, SSH input, cursor/control | highest data |
| `reliable` | ordered bounded | clipboard, small files, terminal output | medium |
| `bulk` | reliable bounded | VPN packets, downloads, large file chunks | lower than interactive |
| `droppable` | latest-wins | video frames, remote display regions, telemetry | drop stale |

VPN packets are protocol-neutral IP packets. They must not be special-cased as
HTTP, RDP, DNS, Telegram, or browser traffic. Optimization must improve the
shared packet path.

Remote Server/Desktop Access uses the same channel runtime, but its adapter
uses service-specific channel classes such as input, display, cursor,
clipboard, file transfer, audio, and telemetry.

## Failover Rules

The fabric must support:

- entry pool selection
- exit pool selection
- alternate route set
- quick route rebuild on node/link failure
- sticky route while healthy to avoid needless TCP disruption
- graceful drain when possible
- hard failover when route is stale or fenced
- explicit degraded fallback when the backend relay is used

VPN failover may still break existing TCP sessions in the initial mode. The
fabric must minimize disruption, but lossless TCP migration is a future mode and
must not be assumed.

## Current Gap

The project already has important pieces:

- signed node identity and scoped mesh config
- production fabric-control forwarding
- production `vpn_packet` envelope tests
- route intents and route health feedback
- entry-node VPN packet ingress prototype
- backend relay fallback for lab compatibility

The missing production layer is the service-channel runtime:

- stable client-to-entry live transport
- multiplexed logical streams/channels
- route manager with primary and alternate paths
- service-neutral QoS/backpressure
- channel-level telemetry
- automatic route and exit replacement contract
- explicit degraded fallback reporting

Until this layer is complete, VPN should be treated as a proving service for
the fabric channel, not as a one-off Android transport project.

## Implemented Foundation

The first backend contract slice is implemented:

- `POST /api/v1/clusters/{cluster_id}/fabric/service-channels/leases`
  issues a `rap.fabric_service_channel_lease.v1` contract.
- The lease contains selected entry/exit nodes, entry/exit pools, service
  class, required roles, allowed channel classes, route generation, fencing
  epoch, primary route, alternate routes, token metadata, entry HTTP/WebSocket
  endpoint templates, QoS, failover policy, and explicit fallback state.
- Each lease includes a cluster-authority-signed
  `rap.fabric_service_channel_lease_authority.v1` payload that binds the
  channel id, service class, selected entry/exit, primary route, generation,
  fencing epoch, expiry, and token hash.
- When an authorized fabric route exists, fallback is only available and not
  active.
- When no authorized fabric route exists, the lease is marked
  `degraded_fallback`; backend relay is explicit compatibility fallback rather
  than hidden steady state.
- VPN client profiles now embed `fabric_service_channel_lease` for each planned
  VPN route, making VPN the first consumer of the common channel contract.
- `rap-node-agent` now exposes the first entry runtime endpoint for the VPN
  proving service:
  `/api/v1/clusters/{cluster_id}/fabric/service-channels/{channel_id}/vpn-connections/{resource_id}/packets`
  and the `/packets/ws` WebSocket variant.
- The entry endpoint requires a `rap_fsc_*` service-channel token, accepts
  packet batches in `application/vnd.rap.vpn-packet-batch.v1`, forwards through
  the existing production `vpn_packet` fabric route, and maps route failures to
  the explicit backend relay compatibility path.
- Service-channel leases now carry a signed `data_plane` contract declaring
  control-plane API use, working-data transport through Fabric Service Channel,
  steady-state fabric routes, backend relay fallback policy, and
  service-neutral multi-flow isolation.
- Node-agent validates the signed or introspected data-plane contract, applies
  the preferred fabric route from the contract, reports contract adoption in
  heartbeat access telemetry, and refuses backend relay when the contract says
  `backend_relay_policy=disabled`.
- Backend access telemetry and web-admin active-channel diagnostics now project
  the data-plane adoption count plus last data-plane mode, working transport,
  steady-state transport, backend relay policy, and logical flow mode at
  cluster, node, and active-channel levels.
- Rebuild/access incident diagnostics now include `data_plane_contract`
  incidents for accepted service-channel traffic without a reported
  data-plane contract, transport/policy mismatches, disabled backend relay
  observations, and degraded backend relay usage. These incidents keep backend
  relay visible as degraded compatibility behavior rather than hidden steady
  state.
- Node-agent access telemetry distinguishes backend relay actually used from
  backend relay blocked by signed data-plane policy. Blocked fallback reports
  include `backend_fallback_blocked` and the last violation status/reason, and
  backend projects them to access telemetry plus `data_plane_contract`
  incidents.
- Backend correlates access-report send failures with active service-channel
  leases. A normal primary route that fails while backend relay is disabled is
  persisted as fenced route feedback, allowing the existing rebuild planner to
  select an authorized alternate instead of leaving the channel stuck at a
  policy-blocked fallback.
- Access-report-derived route feedback is deduplicated while an active fenced
  or degraded observation from `fabric_service_channel_access_report` already
  exists for the same cluster, reporter node, route, and service class. This
  prevents repeated blocked-fallback send-failure heartbeats from continuously
  refreshing the same feedback and churning rebuild attempts.
- Replacement decisions and rebuild-attempt ledger rows carry the originating
  access-report feedback identity: observation id, source, observed/expiry
  timestamps, channel/resource ids, and data-plane violation status/reason.
  This makes the chain `access report -> route feedback -> planner decision ->
  rebuild attempt` visible without opening raw JSON payloads.
- Rebuild-attempt ledger queries can filter by `feedback_source`,
  `feedback_channel_id`, and `feedback_violation_status`. The admin panel
  exposes the same fields so incident drilldown can jump directly to the
  correlated attempts behind an access-report-derived failure.
- Entry token validation now supports cluster-authority signed lease
  enforcement. When the client sends
  `X-RAP-Service-Channel-Authority-Payload` and
  `X-RAP-Service-Channel-Authority-Signature`, the entry node verifies the
  signature, expiry, selected entry node, service class, channel/resource ids,
  allowed `vpn_packet` channel, and token hash before accepting traffic.
- Android VPN release `0.2.159` consumes the profile
  `fabric_service_channel_lease`, builds the entry HTTP/WebSocket URLs from
  the lease templates, and sends the service-channel token and signed authority
  headers. A live smoke against `usa-los-1` accepted a valid signed lease and
  rejected a bad token with `403`.
- Node-agent release `0.2.162` adds the first route-manager behavior inside
  the entry runtime. The VPN packet ingress keeps the same runtime object when
  synthetic mesh config refreshes, records live send/receive counters, selected
  route/next hop, route attempts/failures, local-gateway fallback, and inbox
  queue depths.
- Client packet sends now try all valid `vpn_packet` route candidates, with a
  sticky preference for the last successful route. Backend relay fallback is
  reached only after all fabric candidates fail, and telemetry marks that as
  degraded compatibility behavior rather than normal steady-state transport.
- A live smoke on 2026-05-07 against the `usa-los-1` service-channel endpoint
  returned `202 Accepted` and heartbeat telemetry reported route attempts,
  route failure, and selected next hop `home-1`, proving that the report comes
  from the active ingress handler.
- Node-agent release `0.2.163` adds the first service-neutral flow scheduler.
  The scheduler does not make HTTP/RDP/DNS/application decisions. It hashes
  universal IP packets by 5-tuple, or opaque packet hash when no tuple can be
  read, into logical `flow-*` channels. Each channel records queue depth,
  enqueue/dequeue counts, drops, high-watermark, and backpressure state.
- Client packet batches are now fanned out by logical channel before route
  forwarding. This is the first step toward letting independent sessions share
  one VPN/fabric connection without a stalled flow hiding the health and
  pressure of other flows.
- A live smoke on 2026-05-07 sent two different packet flows through the signed
  service-channel endpoint and telemetry reported two flow batches, two flow
  channels, two enqueues/dequeues, and zero drops.
- Node-agent release `0.2.164` turns those logical channels into the first
  active scheduling behavior. Each channel remembers its last successful route
  and next hop, the last failed route, send duration, served count, stall count,
  consecutive failures, and whether route rebuild or degraded fallback is
  recommended.
- Scheduled batches are drained with a service-neutral fairness rule:
  non-stalled channels first, then less-served channels, then the oldest served
  channel. This still carries raw VPN/IP packets; it does not inspect HTTP,
  RDP, DNS, Telegram, browser traffic, or any other application protocol.
- Route selection is now per-channel. A channel may prefer its last successful
  route and defer its last failed route, so one bad route candidate does not
  keep punishing the same flow on the next send.
- A live smoke on 2026-05-07 posted two flows through `usa-los-1` and reported
  schema `c18l.fabric_service_channel_runtime_report.v1`,
  `send_packets=2`, `send_flow_batches=2`, `flow_scheduler.channel_count=2`,
  `dropped=0`, and per-flow `last_route_id`, `last_next_hop`, `served`,
  `stall_count`, and fallback recommendation fields.
- Backend release `rap-backend:fabric-service-channel-0.2.165` consumes fresh
  entry-node service-channel heartbeat feedback when issuing a new lease. It
  reads `fabric_service_channel_runtime_report.ingress.flow_scheduler`
  `channel_stats`, boosts routes with recent successful flow sends, penalizes
  recent failed routes, and fences routes that explicitly recommend rebuild or
  degraded fallback.
- Fenced routes are not returned as primary or alternate route candidates in a
  service-channel lease. If every route for the selected entry/exit pair is
  fenced by service-channel feedback, the lease enters explicit degraded
  backend fallback with reason
  `fabric_routes_fenced_by_service_channel_feedback`.
- A live smoke on 2026-05-07 created two short-lived `test-1 -> test-2`
  `vpn_packets` route intents, injected fresh service-channel flow feedback
  marking the higher-priority route as rebuild-required, and the next lease
  selected the lower-priority healthy route with score reason
  `service_channel_recent_success`.
- Backend release `rap-backend:fabric-service-channel-0.2.166` makes that
  route feedback durable. Heartbeat telemetry records service-neutral route
  observations in `fabric_service_channel_route_feedback_observations` and
  updates `fabric_service_channel_route_feedback_latest` with expiring latest
  state per reporter node, service class, and route.
- Lease generation now reads durable latest feedback before falling back to
  fresh heartbeat metadata. This keeps route fencing/boosting available across
  backend restarts and prevents a single heartbeat replacement from erasing
  recent route-health evidence.
- A live smoke on 2026-05-07 persisted a fenced observation for a forced-bad
  higher-priority `test-1 -> test-2` route and a healthy observation for the
  lower-priority route. After backend restart, the next service-channel lease
  selected the healthy route with `service_channel_recent_success`; the durable
  latest table showed the bad route as `fenced` and active.
- Backend release `rap-backend:fabric-service-channel-0.2.167` exposes durable
  feedback for diagnostics and starts feeding it back into route-generation.
  Operators can list fresh observations through
  `/clusters/{clusterID}/fabric/service-channels/route-feedback`, and scoped
  node synthetic configs now include a `service_channel_route_feedback` report.
- Synthetic config generation skips routes fenced by the local node's durable
  service-channel feedback while that observation remains active. This is the
  first closed loop from entry-runtime traffic health to the next route config:
  a known-bad route is withheld from that node instead of being re-issued until
  the feedback expires or a new healthy observation replaces it.
- Backend release `rap-backend:fabric-service-channel-0.2.168` adds proactive
  replacement decisions for fenced service-channel routes. When a fenced route
  is withheld, route path decisions now record either
  `service_channel_feedback_replacement` with `replacement_route_id` and
  effective replacement hops, or `service_channel_feedback_no_alternate` when no
  unfenced alternate route exists.
- A live smoke on 2026-05-07 fenced a higher-priority `test-1 -> test-2` route
  and kept a lower-priority healthy route. The scoped `test-1` synthetic config
  excluded the bad route, kept the healthy route, and reported a replacement
  decision from the bad route to the healthy route with score reason
  `selected_unfenced_alternate_route`.
- Backend/node-agent release `0.2.169` adds the first replacement dampening
  behavior. When choosing an alternate for a fenced service-channel route, the
  control plane gives active healthy durable feedback a large stable preference
  and records `active_healthy_feedback_dampening_window` in score reasons. This
  keeps a recently successful replacement selected over a higher-priority but
  unproven route until the feedback expires or a newer observation changes the
  state.
- Route path decision reports now include `degraded_decision_count` for
  `service_channel_feedback_no_alternate`; upgraded node-agents echo
  `replacement_route_id` and degraded counts in heartbeat diagnostics. A live
  smoke on 2026-05-07 confirmed a low-priority healthy replacement beat a
  higher-priority unproven alternate while the healthy feedback was active.
- Node-agent/host-agent hotfix `0.2.171` keeps the signed synthetic config
  contract in sync with the backend feedback report. Agents now preserve
  `service_channel_route_feedback` while recalculating the authority payload
  hash, preventing `0.2.169`-style hash mismatches after C18O/C18Q feedback
  fields are present in control-plane configs. The release is published with
  Docker, Linux service, Windows service, and binary artifacts.
- Backend/web-admin release `0.2.172` adds cluster-level route feedback
  operations: operators can filter current feedback by reporter, route, service
  class, status, or include expired observations, and can expire stale route
  feedback after verification. Expiring feedback removes it from active route
  selection by moving `expires_at` to now while retaining history for audit and
  diagnostics.
- C18S adds operator-expire churn guardrails. A manual expire now creates an
  audit event, sets `operator_retry_cooldown_until`, and lets the route retry
  with explicit decision reason
  `service_channel_route_retry_after_operator_expire`. If the same reporter
  immediately sends another non-healthy observation for the same route/service
  inside the cooldown, Control Plane records it as
  `operator_retry_cooldown` with zero score adjustment instead of immediately
  re-fencing the route.
- C18T starts automatic service-neutral rebuild orchestration. Route path
  decisions now include rebuild request metadata. Fenced runtime feedback that
  keeps failing outside manual retry cooldown creates a bounded rebuild
  request. If an unfenced alternate is available, Control Plane marks the
  rebuild `applied` and selects that route generation; if no alternate exists,
  it records `pending_degraded_fallback` and keeps backend relay as the
  explicit degraded path until a new route appears. The compatibility release
  `0.2.175` keeps node/host-agent signed-config models aligned with these new
  fields.
- C18U moves rebuild metadata into node-agent runtime behavior. Node-agent
  `0.2.176` builds a local service-channel route-manager snapshot from
  `route_path_decisions`, tracks rebuild request/apply/pending-degraded counts,
  marks rebuilt-away routes as withdrawn, clears a withdrawn cached selected
  route, and filters withdrawn routes from new service-channel candidates. This
  keeps service traffic on the Control Plane replacement instead of repeatedly
  choosing a route that was already fenced. Backend `0.2.176` also makes node
  list version state prefer a node's actual reported target version over stale
  failed update-status rows.
- C18V adds route-manager transition telemetry and churn coverage. Node-agent
  `0.2.177` reports `route_manager_transition` alongside the current manager
  snapshot, including previous/current generation, status, decision count,
  withdrawn route count, restored route count, pending-degraded fallback count,
  rebuild applied count, and any cached selected route cleared because Control
  Plane withdrew it. Coverage verifies three service-neutral lifecycle cases:
  applied rebuild replacement, pending degraded fallback when no alternate is
  available, and rollback/restoration when a fresh config removes the rebuild
  decision.
- C18W adds a live docker-test verification loop for that telemetry. The smoke
  script `scripts/fabric/c18w-service-channel-route-manager-smoke.ps1` creates
  short-lived service-channel route intents, injects durable fenced/healthy
  feedback through the heartbeat contract, observes Control Plane
  `rebuild_status=applied`, waits for node-agent `applied_rebuild`, expires the
  feedback through the operator endpoint, verifies the config has no rebuild
  decision, and waits for `restored_by_new_config`. The passing artifact is
  `artifacts/c18w-service-channel-route-manager-smoke-result.json`. The live
  run also hardened feedback expiration in backend `0.2.179` by avoiding pgx
  mixed timestamp/text parameter inference and array-parameter fragility.
- C18X adds service-neutral logical-channel isolation coverage and fixes a
  route-memory bug found by that coverage. Node-agent `0.2.180` keeps global
  last-route stickiness only for channels with no local route state; if a
  channel has a failed route to avoid, candidates are ordered without falling
  back to the global last selected route. This prevents one failed flow from
  poisoning unrelated flows that are still healthy on the primary route. The
  same slice verifies bounded same-channel backpressure/drop telemetry and
  preserves the existing packet-flow hashing split. The passing smoke artifact
  is `artifacts/c18x-service-channel-logical-channel-smoke-result.json`.
- C18Y adds route-intent lifecycle cleanup for operator/test routes. Backend
  `0.2.181` enriches route-intent list responses with lifecycle state, exposes
  platform-admin `expire` and `disable` actions, and prevents expired route
  policies from being emitted in node-scoped synthetic config. This keeps stale
  smoke route intents visible for audit while stopping agents from probing them
  as live routes. Web-admin Fabric Links now shows route-intent lifecycle
  counts and actions. The passing smoke artifact is
  `artifacts/c18y-route-intent-lifecycle-smoke-result.json`.
- C18Z adds bounded service-channel load coverage around the shared runtime.
  Node-agent `0.2.181` verifies many independent logical packet channels can
  rebuild away from a Control Plane-withdrawn primary route without retrying
  the withdrawn candidate, while same-channel overload reports bounded drops
  and high-water marks. `FabricFlowScheduler.Snapshot` now keeps
  `backpressure_active=true` when bounded drops occurred even if the queue has
  already drained. The docker-test smoke also creates temporary route intents,
  verifies their routes are visible, then expires/disables them and proves they
  disappear from scoped synthetic config. The passing smoke artifact is
  `artifacts/c18z-service-channel-load-smoke-result.json`.
- C18Z1 proves the same runtime through the running node HTTP surface instead
  of only in-process transport tests. Node-agent `0.2.182` adds a dynamic mesh
  listener handler so synthetic-config refreshes swap the active
  `/mesh/v1/forward` and service-channel ingress handler state without
  restarting the listening port. This closes the stale-handler failure where
  route-health probes had fresh routes but production forward still rejected
  live packets with `mesh synthetic route not found`. Backend `0.2.182` keeps
  active degraded/fenced route feedback from being immediately overwritten by a
  newer healthy heartbeat until the feedback expires or is explicitly cleared.
  The live smoke posts signed generic packet batches into `test-1`, verifies
  delivery into the `test-2` fabric inbox, forces a route rebuild, waits for
  node `applied_rebuild`, and verifies the second batch uses the replacement
  route. The passing smoke artifact is
  `artifacts/c18z1-live-service-channel-ingress-smoke-result.json`.
- C18Z2 adds a sustained live ingress and exit-restart smoke. The script
  `scripts/fabric/c18z2-live-service-channel-soak-smoke.ps1` keeps the same
  protocol-neutral service-channel shape, sends multiple signed packet batches
  through `test-1`, restarts the `test-2` exit container, waits for the exit
  runtime to reload Control Plane synthetic config, then proves recovery
  batches are accepted and delivered to the exit inbox. The passing artifact is
  `artifacts/c18z2-live-service-channel-soak-smoke-result.json`; run
  `c18z2-20260507-205112` accepted warm/restart/recovery batches and grew the
  post-restart exit inbox depth from `0` to `88` with zero inbox drops.
- C18Z3 adds the matching entry-side resilience and degraded-fallback contract.
  Node-agent `0.2.183` validates the signed service-channel lease authority and
  forces backend fallback when Control Plane has signed
  `status=degraded_fallback` or `primary_route.status=missing_route_intent`.
  This prevents a node from ignoring the lease decision and accidentally using
  older generic route candidates for the same VPN resource. The rule applies to
  both HTTP packet ingress and WebSocket packet ingress. The live smoke
  `scripts/fabric/c18z3-live-service-channel-entry-ws-fallback-smoke.ps1`
  proves HTTP warm delivery, WebSocket ingress parity, entry-node restart and
  recovery while a lease exists, explicit backend fallback when no authorized
  fabric route exists, and route-intent expiry. The passing artifact is
  `artifacts/c18z3-live-service-channel-entry-ws-fallback-smoke-result.json`;
  run `c18z3-20260507-211402` accepted warm `4/4`, WebSocket `8` packets,
  recovery `4/4`, and moved the degraded backend fallback queue from `0` to
  `8`.
- C18Z4 adds live long-session pressure coverage without another runtime
  release. The script
  `scripts/fabric/c18z4-live-service-channel-session-pressure-smoke.ps1` holds
  one signed service-channel WebSocket open, sends 48 batches / 384 packets,
  expires the primary route intent mid-session, waits for the dynamic
  synthetic-config refresh, and verifies the post-switch traffic uses the
  alternate route. The passing artifact is
  `artifacts/c18z4-live-service-channel-session-pressure-smoke-result.json`;
  run `c18z4-20260507-212748` grew the exit inbox from `0` to `384`, kept
  route failure delta `0`, flow drop delta `0`, and backend fallback queue
  `0 -> 0`. This proves route-policy churn can be absorbed by the shared
  fabric runtime while a service WebSocket remains active.
- C18Z5 adds live exit-node failure coverage while the same kind of service
  WebSocket remains active. The script
  `scripts/fabric/c18z5-live-service-channel-exit-restart-smoke.ps1` sends
  pre-outage traffic, stops the `test-2` exit container while traffic continues,
  starts it again, waits runtime readiness, and then sends recovery traffic over
  the same signed WebSocket. The passing artifact is
  `artifacts/c18z5-live-service-channel-exit-restart-smoke-result.json`; run
  `c18z5-20260507-213745` sent 480 packets total, observed route failure delta
  `48`, backend fallback queue `0 -> 192`, flow drop delta `0`, and recovery
  exit inbox `0 -> 192`. This proves exit failure is surfaced as explicit
  degraded/fallback telemetry and fabric delivery resumes after runtime
  recovery without requiring the service connection to be rebuilt.
- C18Z6 adds live Control Plane rebuild coverage while a service WebSocket is
  active. The script
  `scripts/fabric/c18z6-live-service-channel-active-rebuild-smoke.ps1` injects
  route-health feedback for the primary route, observes Control Plane
  `rebuild_status=applied` with the alternate route as replacement, waits for
  node-agent `route_manager_transition.status=applied_rebuild`, and continues
  traffic over the same signed WebSocket. The passing artifact is
  `artifacts/c18z6-live-service-channel-active-rebuild-smoke-result.json`; run
  `c18z6-20260507-214900` sent 384 packets, delivered all of them to the exit
  inbox, selected the replacement route, kept route failure delta `0`, flow
  drop delta `0`, and backend fallback queue `0 -> 0`. This proves route-manager
  replacement can be applied under an active service session without requiring
  the service connection to be recreated.
- C18Z7 adds concurrent service-session isolation coverage. The script
  `scripts/fabric/c18z7-live-service-channel-concurrent-isolation-smoke.ps1`
  opens three signed service-channel WebSockets over the same entry/exit pair,
  interleaves packet batches across them, injects primary-route stale feedback,
  waits for Control Plane `rebuild_status=applied` and node-agent
  `applied_rebuild`, then continues all sessions. The passing artifact is
  `artifacts/c18z7-live-service-channel-concurrent-isolation-smoke-result.json`;
  run `c18z7-20260507-215727` delivered 864 packets total, 288 packets per
  session, with total backend fallback delta `0`, route failure delta `0`, and
  flow drop delta `0`. This proves concurrent service sessions keep separate
  resource queues and are not starved or poisoned by a shared route-manager
  rebuild.
- C18Z8 adds live backpressure/fairness isolation coverage. The script
  `scripts/fabric/c18z8-live-service-channel-backpressure-isolation-smoke.ps1`
  opens two interactive service-channel WebSockets and one abusive WebSocket on
  the same entry/exit pair. The abusive session overloads a single stable
  5-tuple with 1300 packets while the interactive sessions continue sending
  small batches. The passing artifact is
  `artifacts/c18z8-live-service-channel-backpressure-isolation-smoke-result.json`;
  run `c18z8-20260507-221347` delivered 192 packets per interactive session,
  hit flow scheduler high watermark `1024`, scheduled `1030` packets on the
  hottest channel, dropped `282` packets on that overloaded channel, and kept
  backend fallback delta `0` and route failure delta `0`. This proves bounded
  queue pressure is service-neutral, observable, and isolated to the overloaded
  logical flow without starving other active sessions.
- C18Z9 adds route-pool replacement preference coverage. Node-agent `0.2.184`
  now honors Control Plane `replacement_route_id` as the preferred route when a
  service-channel rebuild decision is applied, instead of only withdrawing the
  stale route and then relying on synthetic-config ordering. The live smoke
  `scripts/fabric/c18z9-live-service-channel-route-pool-smoke.ps1` creates a
  slow relay primary route (`test-1 -> test-3 -> test-2`) and a fast direct
  replacement (`test-1 -> test-2`), sends 54 batches / 432 packets over one
  signed WebSocket, injects stale-route feedback, waits for Control Plane and
  node-agent `applied_rebuild`, and verifies the same service session continues
  over the fast route. The passing artifact is
  `artifacts/c18z9-live-service-channel-route-pool-smoke-result.json`; run
  `c18z9-20260507-224901` kept backend fallback delta `0`, route failure delta
  `0`, and flow drop delta `0`.
- C18Z10 adds service-channel exit-pool failover coverage. Backend/node-agent
  `0.2.185` binds signed entry/exit pools into the service-channel lease
  authority, keeps selected exit aligned with the selected primary route, and
  allows Control Plane replacement to move to another authorized exit when
  route intents share the same exit-pool/resource metadata key. Node-agent also
  seeds the entry runtime with the signed lease primary route so initial
  traffic follows the lease before normal route-manager ordering. The live
  smoke `scripts/fabric/c18z10-live-service-channel-exit-pool-smoke.ps1`
  creates primary exit `test-1 -> test-2` and alternate exit
  `test-1 -> test-3`, sends 54 batches / 432 packets over one signed
  WebSocket, verifies 144 packets land on the primary exit before feedback,
  injects stale-route feedback, waits for Control Plane and node-agent
  `applied_rebuild`, and verifies 288 packets land on the alternate exit. The
  passing artifact is
  `artifacts/c18z10-live-service-channel-exit-pool-smoke-result.json`; run
  `c18z10-20260507-232645` kept backend fallback `0`, route failure delta `0`,
  and flow drop delta `0`.
- C18Z11 adds service-channel entry-pool failover contract coverage. Backend
  `rap-backend:fabric-service-channel-0.2.186` keeps
  `selected_entry_node_id` aligned with the selected primary route when the
  healthy route starts at another authorized entry node, and route replacement
  scope now understands entry-pool metadata keys. The live smoke
  `scripts/fabric/c18z11-live-service-channel-entry-pool-smoke.ps1` creates
  primary entry `test-1 -> test-2` and alternate entry `test-3 -> test-2`,
  sends 144 packets through the initial `test-1` lease, injects feedback for
  the primary entry route, refreshes the lease, verifies the new lease selects
  `test-3`, and sends 288 more packets through the alternate entry to the same
  exit. The passing artifact is
  `artifacts/c18z11-live-service-channel-entry-pool-smoke-result.json`; run
  `c18z11-20260507-235341` delivered 432 packets to the exit, kept backend
  fallback `0`, route failure deltas `0/0`, and flow drop deltas `0/0`. This
  proves the Control Plane lease/reconnect contract for entry replacement; it
  does not claim that a broken client-to-entry socket survives entry-node loss.
- C18Z12 adds the first route quality scoring layer for lease selection.
  Backend `rap-backend:fabric-service-channel-0.2.187` consumes
  service-neutral runtime feedback from
  `fabric_service_channel_runtime_report.ingress.flow_scheduler`: fast
  `last_send_duration_ms` values boost a route, slow values penalize it, and
  recent failures/stalls apply bounded penalties. This is explicitly
  application-protocol neutral; it scores the shared fabric channel rather than
  HTTP, RDP, DNS, or any other payload type. The smoke
  `scripts/fabric/c18z12-service-channel-route-quality-smoke.ps1` creates a
  higher-priority slow relay route and a lower-priority fast direct route. The
  initial lease selects the slow route by policy priority; after runtime
  telemetry reports fast route `8ms` and slow route `900ms`, the refreshed lease
  selects the fast route with score reason
  `service_channel_quality_latency_le_10ms`. The passing artifact is
  `artifacts/c18z12-service-channel-route-quality-smoke-result.json`; run
  `c18z12-20260508-000209` passed and expired its temporary route intents.
- C18Z13 closes the first live self-learning route-quality loop. Node-agent
  `0.2.188` records any positive sub-millisecond service-channel send duration
  as `1ms` instead of `0ms`, so very fast routes still produce actionable
  quality telemetry. The live smoke
  `scripts/fabric/c18z13-live-service-channel-route-quality-smoke.ps1` does
  not inject a synthetic heartbeat. It first proves policy priority selects a
  higher-priority relay route, expires that route, sends 24 real
  service-channel batches / 192 packets through the fast direct route, waits
  for the node-agent heartbeat to persist healthy route feedback in the
  backend, then introduces a new higher-priority relay candidate. The refreshed
  lease selects the already-learned fast route with score reasons
  `service_channel_recent_success` and
  `service_channel_quality_latency_le_10ms`. The passing artifact is
  `artifacts/c18z13-live-service-channel-route-quality-smoke-result.json`; run
  `c18z13-20260508-001610` delivered all 192 packets to the exit, kept backend
  fallback `0`, flow drops `0`, and expired temporary route intents.
- C18Z14 makes the learned route-quality loop active-session aware. Backend
  `rap-backend:fabric-service-channel-0.2.190` decays older healthy
  service-channel feedback before route scoring, so stale success does not keep
  full weight until expiry. Node-agent `0.2.189` consumes healthy
  service-channel route-quality observations from the signed synthetic config
  and can prefer a significantly better learned route over a sticky per-flow
  route/config-order candidate. The smoke
  `scripts/fabric/c18z14-live-service-channel-active-quality-shift-smoke.ps1`
  keeps one signed WebSocket service-channel session open across route
  generation changes: it starts on a higher-priority relay route, expires that
  route, sends real traffic over the fast route to teach backend feedback, then
  introduces a new higher-priority relay candidate. The same active WebSocket
  continues on the learned fast route. The passing artifact is
  `artifacts/c18z14-live-service-channel-active-quality-shift-smoke-result.json`;
  run `c18z14-20260508-071644` sent 60 batches / 480 packets, delivered all
  packets to the exit, kept backend fallback `0`, flow drops `0`, and expired
  temporary route intents.
- C18Z15 exposes and hardens effective route-quality preference telemetry.
  Backend `rap-backend:fabric-service-channel-0.2.191` reports both raw
  `score_adjustment` and decayed `effective_score_adjustment` in
  service-channel feedback observations. Node-agent `0.2.190` consumes the
  effective score for active route preference decisions, keeps the raw score
  for diagnostics, and exposes sorted `route_quality_preferences` in runtime
  telemetry. The smoke
  `scripts/fabric/c18z15-live-service-channel-effective-quality-smoke.ps1`
  wraps the active-session quality-shift scenario and verifies that route
  preferences, effective scores, and age-decayed scores are visible. The
  passing artifact is
  `artifacts/c18z15-live-service-channel-effective-quality-smoke-result.json`;
  run `c18z14-20260508-073538` sent 60 batches / 480 packets, delivered all
  packets to the exit, kept backend fallback `0`, flow drops `0`, and exposed
  decayed effective scores in node telemetry.
- C18Z16 adds per-channel route-quality preference telemetry and fairness
  guardrails. Node-agent `0.2.191` records the applied
  `quality_preference_route_id`, effective/raw score, and reasons on each
  flow-scheduler channel that uses a quality-preferred route. Unit coverage
  proves a learned route-quality preference can move multiple logical channels
  to the fast route without merging their queues or dropping packets. The smoke
  `scripts/fabric/c18z16-live-service-channel-quality-fairness-smoke.ps1`
  validates the live route-quality shift with per-channel diagnostics. The
  passing artifact is
  `artifacts/c18z16-live-service-channel-quality-fairness-smoke-result.json`;
  run `c18z14-20260508-074943` sent 60 batches / 480 packets, served 32
  logical channels, applied quality preference telemetry to all 32 served
  channels, kept backend fallback `0`, and flow drops `0`.
- C18Z17 clears stale per-channel route-quality markers. Node-agent `0.2.192`
  removes channel-level quality preference diagnostics when the preference is no
  longer present in the current effective preference set or when the preferred
  route is withdrawn by the route manager. The smoke
  `scripts/fabric/c18z17-live-service-channel-quality-cleanup-smoke.ps1`
  verifies that active channel markers reference visible preferences, stale
  markers are absent, expired route intents are not active, and the session
  completes without backend fallback. The passing artifact is
  `artifacts/c18z17-live-service-channel-quality-cleanup-smoke-result.json`;
  run `c18z14-20260508-075750` sent 60 batches / 480 packets, kept 32 active
  quality markers, found `0` stale markers, kept backend fallback `0`, and
  flow drops `0`.
- C18Z18 scopes flow-scheduler channel memory by service session. Node-agent
  `0.2.193` now keys runtime-sent logical channels as
  `vpn:{vpnConnectionID}:flow-NN`, while keeping the low-level scheduler API
  compatible with unscoped unit tests. This prevents two simultaneous
  VPN/service sessions that share the same entry/exit and same IP-flow shard
  from sharing route-failure memory or diagnostic markers. Unit coverage proves
  `vpn-a` can avoid a failed primary route while `vpn-b` keeps the healthy
  primary route for the same packet flow. The smoke
  `scripts/fabric/c18z18-service-channel-session-scoped-fairness-smoke.ps1`
  wraps the live C18Z17 route-quality/fairness path, verifies served live
  channel names are session-scoped and no unscoped served `flow-NN` channels
  remain, and keeps backend fallback and flow drops at zero. The passing
  artifact is
  `artifacts/c18z18-service-channel-session-scoped-fairness-smoke-result.json`;
  run `c18z14-20260508-082520` served 32 session-scoped channels, applied
  quality markers to all 32, kept backend fallback `0`, and flow drops `0`.
- C18Z19 adds the first bounded parallel send window for independent
  service-channel logical flows. Node-agent `0.2.194` can send scheduled
  logical channels concurrently with `MaxParallelFlowSends=4` in the live
  node-agent runtime, while older/default in-process behavior remains
  sequential unless the window is explicitly set. This keeps the data path
  protocol-neutral: it does not inspect HTTP, RDP, DNS, Telegram, or browser
  traffic; it only prevents one slow logical flow/channel from blocking another
  independent channel in the same shared fabric service path. Telemetry now
  exposes `max_parallel_flow_sends` and `send_flow_parallel_batches`. Unit
  coverage blocks one logical channel and proves another channel completes
  before the slow channel is released. The smoke
  `scripts/fabric/c18z19-service-channel-parallel-flow-window-smoke.ps1`
  wraps the live C18Z18 path and verifies the parallel window is enabled and
  observed in runtime telemetry. The passing artifact is
  `artifacts/c18z19-service-channel-parallel-flow-window-smoke-result.json`;
  run `c18z14-20260508-084133` delivered 480 packets, observed
  `max_parallel_flow_sends=4`, `send_flow_parallel_batches=60`, backend
  fallback `0`, and flow drops `0`.
- C18Z20 adds per-channel latency/retry/in-flight telemetry and the first
  adaptive recommended parallel window. Node-agent `0.2.195` tracks scheduler
  `in_flight`, `max_in_flight`, slow/failing channel counts, per-channel
  `send_attempts`, `send_successes`, `send_failures`, `in_flight`,
  `max_in_flight`, and latency buckets (`<=10ms`, `<=100ms`, `<=1000ms`,
  `>1000ms`). The runtime reports `recommended_parallel_flow_sends`, currently
  reducing the window under bounded drops, degraded fallback recommendations,
  repeated failures, or slow/stalled channels. Unit coverage proves the
  recommended window shrinks under queue/route pressure and that the parallel
  window still lets an independent channel complete while another is blocked.
  The smoke
  `scripts/fabric/c18z20-service-channel-adaptive-window-telemetry-smoke.ps1`
  wraps the live C18Z19 path and verifies the new telemetry is visible on real
  docker-test nodes. The passing artifact is
  `artifacts/c18z20-service-channel-adaptive-window-telemetry-smoke-result.json`;
  run `c18z14-20260508-085635` delivered 480 packets, observed
  `max_parallel_flow_sends=4`, `recommended_parallel_flow_sends=4`,
  `scheduler_max_in_flight=4`, attempt/success/latency telemetry on all 32
  served channels, backend fallback `0`, and flow drops `0`.
- C18Z21 adds rolling per-channel/session quality windows. Node-agent `0.2.196`
  keeps the lifetime counters for audit visibility, but adaptive send-window
  pressure now comes from the bounded recent quality window, so old drops and
  old route failures roll out after successful fresh samples. The scheduler
  exposes aggregate rolling-window sample/failure/slow/drop counters and each
  channel exposes sample, success, failure, slow, drop, average-latency, and
  last-updated telemetry. Unit coverage proves old pressure is forgotten by the
  rolling window while lifetime counters remain visible. The smoke
  `scripts/fabric/c18z21-service-channel-rolling-quality-window-smoke.ps1`
  wraps the live C18Z20 path and verifies the new telemetry on real docker-test
  nodes. The passing artifact is
  `artifacts/c18z21-service-channel-rolling-quality-window-smoke-result.json`;
  run `c18z14-20260508-091952` delivered 480 packets, observed
  `scheduler_quality_window_sample_count=480`, rolling failures `0`, rolling
  drops `0`, rolling samples/success/latency on all 32 served channels,
  `recommended_parallel_flow_sends=4`, backend fallback `0`, and flow drops `0`.
- C18Z22 connects the rolling window to backend durable route feedback. Backend
  `rap-backend:fabric-service-channel-0.2.197` reads `quality_window_*` fields
  from node-agent heartbeat metadata and uses fresh rolling failure/drop/slow
  counts plus rolling average latency when persisting
  `fabric_service_channel` route feedback. Lifetime fields remain available as
  fallback for older agents, but they no longer dominate scoring when a current
  rolling window is present and clean. The smoke
  `scripts/fabric/c18z22-service-channel-rolling-feedback-smoke.ps1` wraps the
  live C18Z21 path and verifies persisted feedback includes
  `service_channel_rolling_quality_window` and payload `quality_window_*`
  fields. The passing artifact is
  `artifacts/c18z22-service-channel-rolling-feedback-smoke-result.json`; run
  `c18z14-20260508-093100` delivered 480 packets, observed one persisted
  healthy rolling feedback item with rolling payload, backend fallback `0`, and
  flow drops `0`.
- C18Z23 adds route recovery hysteresis. Backend
  `rap-backend:fabric-service-channel-0.2.198` re-admits routes that have
  healthy rolling-window feedback during an operator-expire/manual retry
  cooldown, but applies a bounded score penalty (`150`) and the
  `service_channel_recovery_hysteresis` reason. The recovered route remains
  authorized and available as an alternate, while a steady healthy route can
  remain primary until the recovery window proves stable enough. The smoke
  `scripts/fabric/c18z23-service-channel-recovery-hysteresis-smoke.ps1` wraps
  the live C18Z22 path and verifies backend `0.2.198`, rolling feedback, clean
  forwarding, and the unit hysteresis contract. The passing artifact is
  `artifacts/c18z23-service-channel-recovery-hysteresis-smoke-result.json`; run
  `c18z14-20260508-094111` delivered 480 packets with backend fallback `0` and
  flow drops `0`.
- C18Z24 exposes that recovery state to operators and API consumers. Backend
  `rap-backend:fabric-service-channel-0.2.199` enriches route feedback API
  responses and node-scoped service-channel feedback reports with
  `recovery_state`, `recovery_hysteresis_active`, and
  `recovery_hysteresis_penalty`; route path decision reports now include
  `recovery_hysteresis_count`. Web-admin shows recovered/hysteresis chips and a
  recovery column next to route feedback status, score, reasons, retry
  cooldown, and expiry. The smoke
  `scripts/fabric/c18z24-service-channel-recovery-visibility-smoke.ps1`
  verifies backend `0.2.199`, unit recovery visibility, and live
  route-feedback API recovery shape. The passing artifact is
  `artifacts/c18z24-service-channel-recovery-visibility-smoke-result.json`;
  live API returned 109 feedback observations with recovery state shape.
- C18Z25 adds a stability threshold before recovered routes can become steady
  again. Backend `rap-backend:fabric-service-channel-0.2.200` keeps manual
  retry recovered routes under hysteresis until they report at least 64 clean
  rolling-window samples (`success >= 64`, failures/slow/drops `0`). Once the
  threshold is met, the route is promoted back to `healthy`, gets
  `recovery_promoted=true` and `service_channel_recovery_promoted`, and no
  longer receives the hysteresis penalty. Admin/API expose promoted counts and
  flags beside recovered/hysteresis state. The smoke
  `scripts/fabric/c18z25-service-channel-recovery-promotion-smoke.ps1`
  verifies backend `0.2.200`, the promotion unit contract, and live
  route-feedback API recovery shape. The passing artifact is
  `artifacts/c18z25-service-channel-recovery-promotion-smoke-result.json`.
- C18Z26 adds explicit demotion after recovery promotion. Backend
  `rap-backend:fabric-service-channel-0.2.201` marks a recovered/promoted route
  under retry cooldown as `recovery_demoted=true` when fresh rolling feedback
  shows failures, drops, slow samples, degraded fallback, rebuild
  recommendation, or fenced state. The demotion includes a concrete
  `recovery_reason`, adds `service_channel_recovery_demoted` plus the specific
  reason to route score reasons, and increments `recovery_demoted_count` in
  route path decision reports. Web-admin shows demoted feedback/path chips and
  reason text. The smoke
  `scripts/fabric/c18z26-service-channel-recovery-demotion-smoke.ps1` verifies
  backend `0.2.201`, demotion unit coverage, and live route-feedback API
  recovery shape. The passing artifact is
  `artifacts/c18z26-service-channel-recovery-demotion-smoke-result.json`.
- C18Z27 adds cluster-level recovery policy tuning. Backend
  `rap-backend:fabric-service-channel-0.2.202` exposes
  `GET/PUT /clusters/{clusterID}/fabric/service-channels/recovery-policy`,
  backed by strict defaults plus optional cluster metadata override
  `fabric_service_channel_recovery_policy`. The policy controls hysteresis
  penalty, promotion minimum samples, demotion thresholds for failures, drops,
  and slow samples, and rebuild/fenced demotion toggles. Lease route selection,
  route feedback reports, and node-scoped synthetic config feedback consume the
  effective policy. Web-admin shows and edits the policy in the
  service-channel diagnostics card. The smoke
  `scripts/fabric/c18z27-service-channel-recovery-policy-smoke.ps1` verifies
  backend `0.2.202`, policy unit coverage, live GET/PUT policy API, and default
  restoration. The passing artifact is
  `artifacts/c18z27-service-channel-recovery-policy-smoke-result.json`.
- C18Z28 adds recovery policy provenance to service-channel diagnostics.
  Backend `rap-backend:fabric-service-channel-0.2.203` includes the effective
  recovery policy on `FabricServiceChannelRoute`,
  `FabricServiceChannelLease`, signed lease authority payloads, route feedback
  reports, and route path decision reports. This lets operators audit a
  primary route, alternate route, degraded fallback, or path decision against
  the exact policy source and thresholds that produced the score/recovery
  state. Web-admin node diagnostics show the policy source and key thresholds
  beside service-channel feedback and route decisions. The smoke
  `scripts/fabric/c18z28-service-channel-recovery-policy-provenance-smoke.ps1`
  verifies backend `0.2.203`, live synthetic config provenance, live lease
  provenance, primary route provenance, and signed authority-payload
  provenance. The passing artifact is
  `artifacts/c18z28-service-channel-recovery-policy-provenance-smoke-result.json`.
- C18Z29 adds feedback provenance guardrails. Backend
  `rap-backend:fabric-service-channel-0.2.204` computes a stable recovery
  policy fingerprint and recognizes optional runtime feedback provenance:
  `recovery_policy_fingerprint`, `route_generation`, `route_policy_version`,
  and `policy_version`. Route feedback observations expose observed/effective
  policy fingerprints and route generations, while reports expose missing and
  stale counters. Feedback that explicitly came from an old policy or route
  generation is still visible, but it is scored conservatively and cannot fence
  or rebuild a current route. Missing provenance remains compatible for old
  node-agents. The smoke
  `scripts/fabric/c18z29-service-channel-feedback-provenance-guard-smoke.ps1`
  verifies backend `0.2.204`, unit guardrails, live policy fingerprint, and
  live feedback provenance counter shape. The passing artifact is
  `artifacts/c18z29-service-channel-feedback-provenance-guard-smoke-result.json`.

## Implementation Order

1. Define and test the generic service-channel lease and route-generation
   contract in the backend. Done for the first VPN packet consumer.
2. Add node-agent entry runtime that accepts a client/service live connection
   and maps it to a fabric route. Done for the first VPN packet HTTP/WebSocket
   ingress with signed lease verification.
3. Add node-agent route manager with primary/alternate route selection,
   generation fencing, health feedback, and failover. First alternate-route
   retry and live telemetry slice is done in `0.2.162`; generation fencing,
   active health feedback, and route rebuild triggers remain.
4. Add service-neutral channel scheduling and bounded queues. Protocol-neutral
   IP-flow hashing and queue/backpressure telemetry landed in `0.2.163`; the
   first fair drain, route memory, failed-route avoidance, and rebuild/degraded
   fallback signals landed in `0.2.164`. Async per-channel workers, load
   shedding policy, and deeper route rebuild history remain. The first Control
   Plane lease-time feedback consumer landed in backend `0.2.165`; durable
   latest route feedback landed in backend `0.2.166`; admin diagnostics and
   fenced-route avoidance in synthetic config landed in backend `0.2.167`;
   proactive replacement decisions landed in backend `0.2.168`; dampened
   healthy replacement preference and degraded/no-alternate counts landed in
   `0.2.169`; operator-expire retry cooldown guardrails landed in C18S; bounded
   rebuild request/decision metadata landed in C18T; node-agent runtime
   withdrawal/replacement consumption landed in C18U; route-manager transition
   telemetry and restore/pending fallback coverage landed in C18V; live
   Control Plane/runtime route-manager verification landed in C18W; per-logical
   channel failed-route isolation and bounded backpressure coverage landed in
   C18X; route-intent lifecycle cleanup and synthetic-config expired-route
   filtering landed in C18Y; bounded multi-channel load/rebuild/drop telemetry
   coverage landed in C18Z; live signed service-channel ingress through the
   running mesh listener landed in C18Z1; sustained live ingress with exit-node
   restart/recovery coverage landed in C18Z2; signed degraded fallback
   enforcement plus entry restart/WebSocket parity landed in C18Z3; long-lived
   WebSocket pressure with mid-session route-policy churn landed in C18Z4; live
   exit-node restart/fallback/recovery under an active WebSocket landed in
   C18Z5; live Control Plane rebuild replacement under an active WebSocket
   landed in C18Z6; concurrent active WebSocket/session isolation under rebuild
   landed in C18Z7; active backpressure/fairness isolation for overloaded
   logical flows landed in C18Z8; route-pool replacement preference landed in
   C18Z9; exit-pool failover landed in C18Z10; entry-pool failover contract
   landed in C18Z11; route quality scoring landed in C18Z12; live
   self-learning route quality from real service-channel traffic landed in
   C18Z13; active-session route-quality preference and backend feedback age
   decay landed in C18Z14; effective route-quality score telemetry and
   node-side effective score consumption landed in C18Z15; per-channel
   route-quality preference telemetry and multi-channel fairness guardrails
   landed in C18Z16; stale route-quality marker cleanup landed in C18Z17;
   service-session-scoped flow scheduler memory landed in C18Z18; bounded
   parallel logical-flow send windows landed in C18Z19; per-channel
   latency/retry/in-flight telemetry plus adaptive recommended window landed in
   C18Z20; rolling quality windows landed in C18Z21; backend rolling feedback
   consumption landed in C18Z22; recovery hysteresis landed in C18Z23; recovery
  state API/admin visibility landed in C18Z24; recovery promotion threshold
  policy landed in C18Z25; recovery demotion telemetry/policy landed in
  C18Z26; cluster-level recovery policy tuning landed in C18Z27; recovery
  policy provenance landed in C18Z28; feedback provenance guardrails landed
  in C18Z29; node-agent per-flow feedback provenance and backend heartbeat
  preservation landed in C18Z30; durable backend route-rebuild attempt ledger,
  API visibility, and admin diagnostics landed in C18Z31; generation-strict
  rebuild timeline correlation with node-agent route-manager/route-generation
  heartbeat telemetry and post-rebuild traffic counters landed in C18Z32;
  computed rebuild guard status/severity/reason fields and admin guard chips
  landed in C18Z33; cluster-level rebuild health summary endpoint/admin panel
  with affected nodes/routes and recommended operator action landed in C18Z34;
  generation-scoped operator silence for rebuild-health alerts landed in
  C18Z35; resurfacing detection for new generations after an operator silence
  landed in C18Z36; fast service-channel readiness gate landed in C18Z37;
  default-fast rebuild ledger summary with explicit deep enrichment landed in
  C18Z38; bounded deep-ledger drilldown by reporter/route/service/generation
  with offset pagination landed in C18Z39; bounded rebuild incident grouping
  with one-click deep investigation landed in C18Z40; audited incident
  investigation and incident-level silence actions landed in C18Z41; durable
  rebuild correlation/guard snapshots for fast warm readiness/health/incidents
  landed in C18Z42; service-channel schema preflight for migration-safe manual
  deploys landed in C18Z43; bounded rebuild snapshot warmup for missing
  correlation snapshots plus stale-snapshot detection landed in C18Z44;
  heartbeat-triggered auto-warmup for runtime-evidence rebuild snapshots landed
  in C18Z45; rebuild snapshot maintenance health with overdue/runtime-evidence
  visibility landed in C18Z46; node-agent signed service-channel lease
  enforcement when cluster authority is pinned landed in C18Z47; backend
  introspection fallback for unsigned compatibility clients landed in C18Z48;
  accepted-by telemetry for signed/introspection/legacy ingress landed in
  C18Z49; durable lease introspection across backend restarts landed in C18Z50;
  bounded durable lease cleanup and admin visibility landed in C18Z51; durable
  accepted-by access telemetry aggregation with heartbeat fallback and admin
  visibility landed in C18Z52; active lease/session correlation with
  entry/exit, route status, fallback, and latest route-quality feedback
  visibility landed in C18Z53; C18Z54 smoke proves the same diagnostics on a
  normal non-fallback primary route with healthy rolling route-quality feedback;
  C18Z55 smoke proves degraded/fenced normal-route feedback is shown separately
  from explicit backend fallback; C18Z56 adds active-channel remediation
  diagnostics (`none`, `rebuild_route`, `prefer_alternate_route`,
  `use_backend_fallback`) to make the next runtime action explicit, and its
  alternate-route branch is live-smoke-proven with backend fallback kept off.
  C18Z57 adds the bounded machine-readable `remediation_command` contract to
  active access telemetry rows so route-manager can consume a short-lived
  `prefer_alternate_route` command with primary/replacement route ids and TTL.
  C18Z58 projects those commands into node-scoped synthetic mesh config and the
  node-agent route-manager consumes them as explicit applied replacement
  decisions sourced from `service_channel_remediation_command`. C18Z59 proves
  post-remediation service-channel traffic actually selects the replacement
  route in runtime/flow telemetry without local/backend fallback. C18Z60 proves
  the same remediation path for multiple independent VPN flow channels in one
  packet batch, with replacement-route flow stats, no flow drops, no route
  failures, and no degraded fallback. C18Z61 proves the remediation replacement
  path under a larger 128-packet pressure batch with 32 replacement-route flow
  stats, scheduler high-watermark 5, max-in-flight 4, no drops, no route
  failures, and no degraded fallback. C18Z62 adds neutral service-channel
  traffic-class QoS wiring: HTTP ingress accepts `X-RAP-Traffic-Class`, the
  scheduler keeps distinct traffic-class channel ids/stats, unit tests prove
  priority ordering, and live smoke proves bulk pressure plus interactive
  traffic both use the replacement route without fallback, drops, or route
  failures. C18Z63 proves concurrent QoS isolation in the runtime: an
  interactive traffic-class packet completes while a bulk send is deliberately
  held in-flight, with traffic-class stats, no drops, and no failures. C18Z64
  adds compact `traffic_class_counts` telemetry to flow-scheduler snapshots so
  diagnostics can see active flow-channel distribution by traffic class without
  scanning every channel stat; it is live-proven on docker-test with bulk and
  interactive counts visible in heartbeat metadata. C18Z65/C18Z66 project this
  QoS/pressure telemetry into backend access telemetry and web-admin at cluster,
  node, and active-channel levels. C18Z67 proves the live HTTP concurrent QoS
  path under pressure: six parallel bulk service-channel requests and one
  interactive request share the same entry path after remediation; the
  interactive request completes in 132 ms, 3072 post-remediation packets move
  over the replacement route, bulk/interactive replacement-route flow stats are
  visible, and fallback, route failures, flow drops, and scheduler drops remain
  0. C18Z68 turns this evidence into backend/admin flow-health diagnostics:
  access telemetry now reports `flow_health_status` and `flow_health_reason` at
  cluster, node, and active-channel levels using traffic-class pressure, queue
  pressure, flow drops, backend fallback, route-quality failures/drops/slow
  samples, and route send latency. C18Z69 adds node-side adaptive response:
  runtime heartbeat flow-scheduler snapshots now include per-class
  `recommended_parallel_windows` and adaptive backpressure reason, and the send
  path applies the traffic-class-specific window so bulk/droppable are reduced
  before interactive/control under pressure. C18Z70 projects those adaptive
  runtime fields into backend access telemetry and web-admin at cluster, node,
  and active-channel levels, with cluster windows aggregated by minimum non-zero
  recommended window per class. C18Z71 adds an audited cluster adaptive-policy
  contract for max window, queue/bulk thresholds, and per-class windows; the
  effective policy fingerprint is signed into node synthetic config, reported
  in runtime heartbeats, and consumed by node-agent scheduling so operators can
  tune shared fabric backpressure without changing VPN/RDP-specific code.
  C18Z72 adds an audited pool/failover policy contract for entry/exit pool
  constraints, preferred entry/exit, selection strategy, failover modes,
  backend fallback allowance, and sticky session mode. Lease issuance applies
  that policy before route selection and signs the effective `pool_policy`
  provenance into the service-channel lease authority payload. C18Z73 projects
  that signed pool-policy fingerprint into active access telemetry and guards
  remediation commands: backend rejects alternate routes outside the signed
  entry/exit lease pools and emits `rebuild_route`, while node-agent
  defensively ignores any guarded rejected `prefer_alternate_route` command
  before route-manager application. Web-admin shows pool/remediation guard
  status in access telemetry and node synthetic-config remediation rows. C18Z74
  correlates active remediation commands with the entry node route-manager
  heartbeat so access telemetry shows execution state:
  `waiting_node_apply`, `applied`, `rejected_by_policy_guard`,
  `pending_rebuild_request`, or `expired`, with reason/generation/observed-at.
  C18Z75 records `rebuild_route` remediation as durable rebuild ledger intent
  rows when node-scoped synthetic config is fetched: allowed commands become
  `rebuild_status=requested` / `outcome=rebuild_requested`, while policy-guard
  rejects become `rebuild_status=rejected` /
  `outcome=policy_guard_rejected`. Access telemetry then reports
  `rebuild_request_recorded` or `rebuild_request_rejected` for the active
  channel. C18Z76 adds node-side acknowledgement for the allowed
  `rebuild_route` branch: node-agent consumes the command as a route-manager
  `pending_degraded_fallback` decision with source
  `service_channel_remediation_command`, while guarded commands remain ignored.
  Backend access telemetry correlates that heartbeat evidence with the durable
  ledger and reports `rebuild_request_recorded_node_pending`. C18Z77 resolves
  those durable remediation rebuild requests in the Control Plane planner:
  valid alternates inside the active signed lease pools become `applied` /
  `replacement_selected` route-manager decisions with the same command id,
  missing safe alternates become `no_alternate`, policy/lease blocks become
  `deferred_by_policy`, and stale commands become `expired`. Access telemetry
  reports these as `rebuild_request_applied`,
  `rebuild_request_no_alternate`, `rebuild_request_deferred_by_policy`, or
  `rebuild_request_expired`. C18Z78 adds operator-facing visibility for those
  planner outcomes in web-admin and live-proves the applied branch: when an
  alternate route appears after lease issuance, the existing `rebuild_route`
  command resolves to `applied` / `replacement_selected` and access telemetry
  reports `rebuild_request_applied`.
  C18Z79 closes that applied-branch proof loop: after the planner resolves the
  existing rebuild command to a replacement route, the entry node reports a
  route-manager decision for the same `rebuild_request_id`, the transition is
  `applied_rebuild`, and live service-channel packet ingress selects the
  replacement route with no local/backend fallback, route failures, or flow
  drops. C18Z80 extends that into sustained post-rebuild pressure: five mixed
  service-channel packet bursts remain on the replacement route, no stale
  primary route is reselected, and fallback, route-failure, flow-drop, and
  scheduler-drop deltas remain zero from the pre-pressure baseline. C18Z81
  proves the negative recovery branch: when the already-applied replacement
  route reports generation-valid fenced feedback, the Control Plane selects a
  new safe recovery route and live traffic moves to that recovery route without
  reselecting the degraded replacement or adding fallback/failure/drop deltas.
  C18Z82 proves the no-safe-recovery branch: if that replacement is also fenced
  and no safe recovery route exists, synthetic config reports
  `service_channel_feedback_no_alternate` / `pending_degraded_fallback` with
  `no_unfenced_alternate_route` instead of silently keeping a bad route.
  C18Z83 projects that route-manager decision into active access telemetry and
  web-admin active-channel diagnostics, including decision source, route id,
  replacement route id, rebuild status/reason/generation, and score reasons.
  C18Z84 aggregates those decisions at access-telemetry summary level so the
  operator can see replacement, applied rebuild, recovery, and no-safe counts
  without drilling into individual channel rows.
  C18Z85 projects those access-decision aggregates into rebuild health and
  incidents, adding `incident_source=access_decision` rows for active
  no-safe/recovery/applied route-decision states. C18Z86 adds
  channel-scoped silence/acknowledgement for those access-decision incidents:
  the silence API accepts `incident_source` and `channel_id`, stores no-safe
  access silences under a channel-scoped route key, and rebuild
  health/incidents apply those silences so acknowledged current-generation
  no-safe decisions are not counted as active bad incidents. Resurfacing on
  generation change is covered in unit tests; live runtime smoke proves the
  operator silence path. C18Z87 exposes active silences through the API and
  web-admin, including access-decision source/channel/display route metadata,
  and adds unsilence so an acknowledged access no-safe incident can be made
  active again without waiting for TTL expiry. C18Z88 exposes access-decision
  resurface details on incidents: the silence id, previous acknowledged
  generation, and silence expiry are returned when the current active-channel
  decision changes generation after acknowledgement. The live smoke proves the
  incident resurfaces as active bad while preserving previous-generation
  context for the operator. C18Z89 closes the generation-change operator action
  loop for resurfaced access-decision incidents: incidents now include
  `alert_resurfaced_cause`, previous route id, and previous channel id;
  web-admin shows the cause; and the live smoke proves the operator can
  re-acknowledge the resurfaced generation after validating that active-channel
  decision route/generation context matches the incident. C18Z90 introduces an
  explicit signed production data-plane contract on service-channel leases:
  `data_plane` is present in the lease, authority payload, introspection
  response, and lease-maintenance/admin list. It declares backend API as
  control-plane transport, fabric service channel/fabric route as working
  data/steady-state transport, backend relay as degraded fallback only, and
  service-neutral protocol-agnostic isolated logical flows as the runtime
  contract for VPN, Remote Workspace, files, video, and future services. C18Z91
  makes node-agent consume the signed/introspected data-plane contract, apply
  the preferred fabric route, log data-plane mode/transports/fallback policy,
  and report contract adoption in heartbeat access telemetry. C18Z92 enforces
  the fallback boundary: when `backend_relay_policy=disabled`, route failure or
  missing fabric route returns a visible service-channel error instead of
  silently proxying working data through backend relay. C18Z93-C18Z95 project
  that data-plane contract and blocked-fallback evidence into access telemetry,
  incidents, and node-agent heartbeat reports. C18Z96-C18Z98 feed
  access-report-derived blocked fallback send failures into durable route
  feedback and rebuild ledger correlation, with bounded deduplication and
  feedback identity carried into replacement decisions. C18Z99 adds rebuild
  ledger filters for `feedback_source`, `feedback_channel_id`, and
  `feedback_violation_status`. C18Z100 aggregates those same fields in
  rebuild-health `feedback_breakdowns`, including active warn/bad, silenced,
  latest observation, and affected reporter node/route counts, and web-admin
  shows the breakdown in the Rebuild health panel. C18Z101 connects that
  operator view to investigation: each breakdown row shows related incident
  context by channel/reporter/route overlap and can open the deep rebuild
  ledger with source/channel/violation filters prefilled. C18Z102 adds backend
  audit breadcrumbs for that drilldown, recording
  `fabric.service_channel_rebuild_feedback_breakdown.investigation_opened`
  events with the feedback source/channel/violation filters before the panel
  opens the filtered deep ledger. C18Z103 surfaces recent rebuild incident and
  feedback-breakdown investigation audit breadcrumbs directly in the Fabric
  diagnostics panel with time, source, feedback filters, target reporter/route,
  actor, and reason. C18Z104 adds focused audit loading: the cluster audit API
  accepts `event_type` and `target_type` filters, and the Fabric diagnostics
  panel requests just the recent fabric investigation breadcrumbs instead of
  relying on the generic latest cluster audit window. C18Z105 correlates those
  breadcrumbs back to the currently visible rebuild-health feedback breakdowns
  or rebuild incidents in web-admin, marking whether the diagnostic object is
  still active/visible and giving the operator a direct `open` action. C18Z106
  moves that correlation into the backend/API: focused audit reads with
  `correlation=fabric_diagnostics` return `correlation_hints` containing the
  current diagnostic status and matching breakdown/incident object when
  present. The rebuild-health feedback breakdown window was also raised to 100
  groups so fresh failure classes remain visible on noisy long-running test
  clusters. C18Z107 adds compact `audit_summary` aggregates for focused Fabric
  diagnostics audit reads, including counts by current diagnostic status,
  feedback source, feedback violation status, correlated/not-visible totals,
  and latest time, and web-admin shows those counts above the investigation
  rows. C18Z108 splits the operator workflow read from generic cluster audit:
  `GET /clusters/{clusterID}/fabric/service-channels/rebuild-investigations/breadcrumbs`
  returns a dedicated `rebuild_investigation_breadcrumbs` contract with events
  and summary, and web-admin consumes that endpoint for Recent investigations.
C18Z109 adds freshness windows to that contract: callers can pass
`current_window_seconds` and `history_window_seconds`, events are marked
`current`, `stale`, or `expired` in `correlation_hints.breadcrumb_status`,
and the summary includes counts by breadcrumb status for operator triage.
C19C adds the first non-VPN service-channel lease proof: Remote Workspace uses
the same signed data-plane contract, route intent model, introspection, and
maintenance visibility, but its entry descriptor is service-specific
(`remote-workspaces/.../streams`) and uses a remote-workspace frame batch media
type rather than VPN packet paths.
C19D proves the matching entry-node ingress boundary for Remote Workspace:
node-agent validates signed lease authority or introspection, service class,
channel class, selected entry node, allowed flow isolation, and data-plane
contract on `remote-workspaces/{resource_id}/streams/{channel_class}`. Empty
probe requests return `202` with a remote-workspace ingress probe contract and
access telemetry; real RDP frame forwarding remains deliberately
`not_implemented` until the service adapter work begins.
C19E adds a narrow frame-batch probe on that boundary. The adapter contract
advertises `rap.remote_workspace_frame_batch.v1`, and entry-node accepts
non-empty payloads only when they are JSON probe batches with `probe_only=true`,
valid remote-workspace logical channels, valid directions, and bounded payload
metadata. Accepted probes return `payload_flow=validated_probe_only`; production
frame forwarding is still not enabled.
C19F connects that validated probe to a node-agent local adapter sink. The
in-memory `node_agent_rdp_worker_contract_probe` sink accepts only validated
probe batches and returns `rap.remote_workspace_frame_batch_delivery.v1`
receipts. Entry responses now report `payload_flow=delivered_probe_only` when
the local sink accepts the batch; no RDP server traffic or desktop frame
forwarding is enabled by this stage.
C19G makes that sink delivery observable outside the direct ingress response:
node-agent reports `remote_workspace_adapter_sink` in `rdp-worker` workload
status and `remote_workspace_adapter_sink_report` in node telemetry, including
delivery count, latest sequence, frame count, channel class, adapter contract,
and explicit `payload_traffic=none` proof.
C19H adds negative guardrail proof for the same frame path: `probe_only=false`,
unknown logical channels, invalid channel direction, service/channel mismatch,
and unsupported payload encoding are rejected before adapter delivery. This
keeps the current Remote Workspace path as a contract probe only, not a hidden
RDP payload tunnel.
C19I adds bounded adapter handoff queue/ack semantics to that probe-only sink.
The sink reports queue capacity/depth and accepted, dropped, acked, backpressure,
and drop-policy fields in `rap.remote_workspace_frame_batch_delivery.v1`.
Current capacity is `8`: droppable display overflow is accepted with excess
frames dropped and accepted frames acked, while reliable input overflow returns
backpressure without `adapter_delivery`. The path remains
`payload_traffic=none`; real RDP frame forwarding is still deferred to the
service adapter runtime.
C19J promotes those queue/backpressure signals into the existing observability
surfaces. Workload status and node telemetry now expose queue capacity/depth,
cumulative accepted/dropped/acked frame counters, `backpressure_count`, and the
latest rejected batch metadata/reason, so adapter pressure can be diagnosed
without relying on the individual ingress response.
C19K binds that queue model to a probe-only adapter session identity. Entry-node
derives `adapter_session_id` from the selected service-channel context and the
adapter sink reports `adapter_runtime_id=node_agent_rdp_worker_contract_probe`
with `session_state=probe_bound` in delivery receipts, workload status, and
telemetry. Rejected reliable overflow batches keep the same session identity,
which gives the future real adapter runtime a stable lifecycle boundary while
payload forwarding remains disabled.
C19L adds lifecycle accounting for those probe-only adapter sessions. Node-agent
tracks active sessions, created/bound totals, last activity timestamps,
per-session delivery/backpressure/frame counters, idle expiry counters, and
`current_session_lifecycle_state`. Successful probe delivery binds the session;
reliable overflow records pressure on the same session instead of hiding it as a
standalone request failure.
C19M adds an explicit local control endpoint for that lifecycle:
`POST /mesh/v1/remote-workspace/adapter-sessions/{adapter_session_id}/control`
accepts `close`, `expire`, and `reset`. The control result and report counters
make deliberate session shutdown visible through workload status and telemetry,
which prepares the same lifecycle shape for a real adapter runtime.
C19N adds guardrails for that endpoint: unsupported actions, malformed payloads,
invalid session IDs, unknown sessions, and oversized reasons are rejected before
state mutation. Repeated `close` is idempotent for a terminal session, reporting
the prior terminal state without double-counting closed sessions.
C19O adds a direct snapshot endpoint for diagnostics:
`GET /mesh/v1/remote-workspace/adapter-sessions?include_terminal=true&limit=N`
returns active and optional terminal adapter sessions with lifecycle state,
activity/backpressure timestamps, counters, and runtime identity. This gives the
future real adapter runtime an operator-facing inspection surface before payload
forwarding is enabled.
C19P adds the runtime handoff mailbox for active adapter sessions. The mailbox is
bounded in memory and stores `frame_batch_probe_delivered` and `backpressure`
events with sequence numbers and service-channel context. A future `rdp-worker`
runtime can read or drain it via
`GET /mesh/v1/remote-workspace/adapter-sessions/{adapter_session_id}/mailbox`,
while snapshots and telemetry expose mailbox depth and enqueue/drain/drop
counters.
C19Q hardens that mailbox handoff surface. Invalid adapter session IDs, unknown
sessions, and invalid limits are rejected without mutating mailbox state, while
`drain=true&limit=N` can remove events in bounded chunks and leave the remaining
depth visible for the next adapter-runtime poll. The mailbox is verified under
pressure as drop-oldest bounded state, and a closed adapter session is no longer
readable as an active runtime mailbox. This preserves the probe-only boundary
and still does not enable RDP frame forwarding.
C19R adds bounded mailbox polling ergonomics for that future runtime consumer.
`GET /mesh/v1/remote-workspace/adapter-sessions/{adapter_session_id}/mailbox`
now accepts `wait_ms`, returns explicit `empty`, `waited`, `wait_timeout`, and
`wait_ms` fields, and wakes when a new mailbox event arrives before the timeout.
The wait remains node-local and probe-only; it does not enable desktop frame
transport, backend relay, or production RDP payload forwarding.
C19S promotes those mailbox consumer signals into node-agent diagnostics.
Workload status, heartbeat telemetry, and active session snapshots now expose
mailbox read, wait, timeout, and empty-read counters plus last mailbox read
metadata. This lets operators identify hot polling or idle adapter consumers
without opening a data-plane path or forwarding desktop frames.
C19T adds node-local mailbox consumer checkpoint/ack metadata for the future
adapter runtime handoff. The mailbox endpoint accepts `consumer_id` and
`ack_sequence`, validates both before reading state, and returns consumer read,
ack, checkpoint, ack sequence, and lag metadata. The probe sink keeps bounded
per-session consumer cursor state and exposes aggregate/current-session
consumer counters in workload status and heartbeat telemetry. This remains a
diagnostic handoff contract only: no RDP frames are forwarded, no backend relay
semantics are introduced, and the mailbox stays node-local.
C19U adds lifecycle guardrails for those node-local consumer cursors. A consumer
can request `reset_consumer=true` with a valid `consumer_id` to clear its cursor
before the current mailbox read is recorded, and mailbox responses now expose
consumer capacity/count plus created/reset/evicted lifecycle metadata. Workload
status and heartbeat telemetry also expose reset and eviction counters, keeping
cursor cleanup observable without changing mailbox delivery or enabling
payload forwarding.
C19V adds read-only cursor inspection for adapter-runtime handoff recovery.
`GET /mesh/v1/remote-workspace/adapter-sessions/{adapter_session_id}/mailbox/consumers`
returns the active session's bounded consumer cursor list with checkpoint, ack,
lag, read/ack totals, and timestamps. The endpoint supports a bounded `limit`
and does not read, drain, reset, or mutate mailbox state, so inspection remains
node-local and diagnostic-only.
C19W adds cursor-aware resume reads for mailbox consumers. The mailbox endpoint
now accepts `after_sequence` for non-destructive reads and returns
`after_sequence`, `skipped_count`, and `returned_count` so adapter runtimes can
resume from a checkpoint without client-side filtering. Long-poll waits for
events newer than the requested sequence, and `after_sequence` is rejected with
`drain=true` to keep resume reads separate from destructive mailbox drains.
C19X adds consumer-aware resume convenience on top of that explicit sequence
window. `resume_from=ack|checkpoint` can be used with `consumer_id` to resolve
the read window from the stored consumer cursor before reading the mailbox, and
responses include `resume_from` and `resume_sequence`. Resume requests reject
manual `after_sequence`, `drain=true`, reset, missing consumers, and unknown
consumer cursors so adapter runtimes cannot accidentally mix cursor modes.
C19Y adds resume telemetry for operator diagnostics. Workload status and
heartbeat reports expose resume/after-sequence read totals, returned/skipped
totals, and the last resume cursor, sequence, consumer, returned count, and
skipped count. Session snapshots mirror the per-session counters so diagnostics
can distinguish normal polling from cursor-resume reads without reading or
draining mailbox state.
C19Z adds a compact adapter-runtime readiness summary to the sink report.
`adapter_runtime_readiness` combines probe-only status, session lifecycle state,
mailbox depth, consumer cursor, resume cursor, lag, and returned/skipped counts
into one diagnostic object so operators can verify handoff readiness without
triggering mailbox reads or drains.
C19Z1 adds a read-only mailbox handoff preflight endpoint. Adapter runtimes can
call `/mailbox/preflight` with `consumer_id` and `resume_from=ack|checkpoint`
to validate the stored cursor and inspect the next expected event window without
reading, draining, acking, or mutating consumer state.
C19Z2 adds separate telemetry for those handoff checks. Workload status and
heartbeat reports expose preflight totals split by ack/checkpoint cursor and the
last preflight session, consumer, cursor, after-sequence, available/returned/
skipped counts, and expected sequence range; readiness diagnostics mirror the
latest preflight summary.
C19Z3 adds stale-cursor diagnostics to preflight. When a consumer cursor points
behind dropped bounded-mailbox events, the preflight response reports retained
sequence bounds, `diagnostic_state=stale_cursor_gap`, `stale_cursor=true`, and
`missing_dropped_count`; workload/heartbeat telemetry and readiness diagnostics
mirror that latest stale state.
C19Z4 adds explicit action hints to those diagnostics. Preflight responses now
include `recommended_action` and `action_hints`; stale cursor gaps recommend
resetting the consumer cursor, requesting a full adapter resync, and resuming
from checkpoint after resync. Telemetry and readiness diagnostics mirror the
latest recommended action and hints.
C19Z5 adds remediation provenance for those hints. Preflight responses,
workload/heartbeat telemetry, and readiness diagnostics include
`action_reason` plus structured `action_context` with the resume cursor,
retained sequence bounds, dropped/missing counts, consumer checkpoint/ack, and
expected window counters that explain why the recommended action was chosen.
C19Z6 adds a compact operator-facing preflight summary derived from the same
read-only state. Preflight responses, telemetry, and readiness diagnostics now
include `operator_summary` and `operator_summary_fields` so dashboards can show
the diagnostic state, action, reason, resume cursor, retained bounds, and key
window counters without recomputing or mutating mailbox state.
C19Z7 adds machine-sortable operator status and severity to that summary.
Preflight responses, telemetry, readiness diagnostics, and
`operator_summary_fields` now expose `operator_status` and `operator_severity`
so dashboards can sort ready, caught-up, and resync-required handoffs without
parsing human text.
C19Z8 groups the latest preflight view for admin UI consumption. The readiness
diagnostic keeps all existing flat latest-preflight fields and adds
`last_preflight` with observed time, cursor, counts, diagnostic state, selected
action, action provenance, operator summary, status, severity, and summary
fields.
C19Z9 adds retained-window detail to that grouped readiness view. The
`last_preflight` object now includes first/last retained sequence and mailbox
dropped total so stale-cursor summaries can explain the bounded mailbox window
without requiring a separate raw preflight lookup.
C19Z10 adds a structured remediation checklist to the grouped readiness view.
The `last_preflight.remediation_checklist` entries are derived from diagnostic
state and action hints, marking required/satisfied operator steps for cursor
reset, adapter resync, and post-resync resume without executing those actions.
C19Z11 adds summary status and counts for that checklist. The grouped readiness
view now exposes `remediation_checklist_status` plus total, required,
satisfied, and pending counts so admin UI can render checklist state without
scanning the step array.
C19Z12 adds per-session preflight operator status/severity counters. Readiness
now exposes counts for statuses such as `ready_to_resume`, `caught_up`, and
`resync_required`, plus severity counts such as `ok`, `info`, and `warn`, and
the grouped latest-preflight rollup mirrors those counters for dashboard
context.
C19Z13 derives a compact preflight attention status from those counters.
Readiness and `last_preflight` expose `preflight_attention_status` values such
as `clean`, `needs_attention`, and `repeated_resync_required`, letting admin UI
sort sessions without interpreting count maps directly.
C19Z14 proves the repeated-resync branch. Unit and live smoke coverage now run
multiple stale preflights on the same active adapter session and verify
`preflight_attention_status=repeated_resync_required` with repeated
`resync_required` / `warn` counters, while the preflight path remains read-only.
C19Z15 adds `preflight_attention_reason` beside the attention status. The reason
is derived from the latest preflight counters/status and explains clean,
attention-needed, and repeated-resync states without requiring UI code to parse
the counter maps.
C19Z16 completes focused proof coverage for those reasons. Unit coverage proves
clean, single-resync, repeated-resync, and no-preflight mappings, and live smoke
proves the single stale-preflight `resync_required_preflight_observed` reason.
C19Z17 adds a diagnostics contract marker to the grouped preflight readiness
rollup. `last_preflight` now includes `diagnostics_schema_version` and a
`diagnostics_contract` list for retained-window, remediation-checklist,
attention, and operator-count fields so admin UI can gate rendering safely.
C19Z18 adds machine-readable feature flags for that contract. `last_preflight`
now includes boolean `diagnostics_features` entries for retained-window,
remediation-checklist, attention, and operator-count diagnostics, allowing UI
and automation clients to check support without scanning the contract list.
C19Z19 adds a compatibility proof for the two contract forms. Unit and live
smoke coverage now verify that workload and telemetry reports expose matching
`diagnostics_contract` entries and `diagnostics_features` booleans for each
preflight diagnostics group.
C19Z20 adds the no-preflight absence proof. Active adapter sessions that have
not observed a mailbox preflight report `preflight_attention_status=unknown`,
`preflight_attention_reason=no_preflight_observed`, zero session preflight
count, and no grouped `last_preflight` rollup, so UI can distinguish "not
observed yet" from an observed clean state.
C19Z21 adds the no-active-session readiness proof. After the last adapter
session is closed, readiness reports idle/not-ready with zero active sessions,
no active `adapter_session_id`, no `last_preflight` rollup, and terminal
`last_session_state=closed` from the terminal-session ledger.
C19Z22 extends terminal-state coverage to `expire` and `reset` controls. The
same no-active-session readiness shape now proves `last_session_state=expired`
and `last_session_state=reset` from the terminal-session ledger.
C19Z23 adds grouped terminal-session summary metadata for the no-active-session
case. Readiness now includes `terminal_session_summary` with adapter session id,
terminal state, reason, and control timestamp while retaining flat compatibility
fields.
C19Z24 adds a contract marker to that summary. The grouped
`terminal_session_summary` now carries a schema version and summary-contract
field list so UI can gate rendering explicitly.
C19Z25 adds boolean feature flags for the same grouped terminal summary fields,
mirroring the preflight diagnostics contract/feature pattern.
C19Z26 adds compatibility proof coverage for those two terminal summary contract
forms, verifying that `summary_contract` entries and `summary_features` booleans
stay aligned in workload and telemetry reports.
C19Z27 adds absence proof coverage for a fresh no-session runtime: before any
terminal history exists, readiness stays in `waiting_for_session` and does not
include `terminal_session_summary`.
C19Z28 adds the grouped no-session readiness summary for that empty-runtime
state. Fresh adapter readiness now includes `no_session_summary` with schema
version `rap.remote_workspace_adapter_no_session_summary.v1`, a summary
contract for `status`, `diagnostic_state`, `active_session_count`, and
`terminal_session_count`, and matching idle/waiting-for-session counts, while
the terminal-session summary remains absent until terminal history exists.
C19Z29 adds boolean `summary_features` to the same grouped no-session summary
for `status`, `diagnostic_state`, `active_session_count`, and
`terminal_session_count`, matching the terminal summary and preflight
diagnostics feature-flag convention.
C19Z30 adds compatibility proof coverage for the grouped no-session summary,
verifying that `summary_contract` entries and `summary_features` booleans stay
aligned in workload and telemetry reports.
C19Z31 adds the inverse terminal-history absence proof: after adapter sessions
reach terminal states, readiness exposes `terminal_session_summary` and omits
`no_session_summary` in workload and telemetry reports.
C19Z32 proves readiness summary exclusivity across the three runtime shapes:
fresh exposes only `no_session_summary`, active exposes neither grouped summary,
and terminal exposes only `terminal_session_summary`.
C19Z33 adds a compact readiness state matrix artifact for admin/runtime handoff:
fresh, active, and terminal rows are emitted for workload and telemetry with
only the relevant readiness fields and summary-presence booleans.
C19Z34 adds an explicit probe-to-runtime gate artifact. It confirms the current
Remote Workspace runtime is still `contract_probe`, `probe_only=true`, and
`payload_traffic=none`, lists the ready contracts, and records the remaining
runtime gates before real RDP frame transport can be enabled.
C19Z35 adds the disabled-by-default real-adapter supervision scaffold. The
`rdp-worker` contract-probe status now advertises
`rap.remote_workspace_real_adapter_supervision.v1` with future config env names,
status contract fields, and guardrails, while `contract_probe` remains the only
active execution mode and payload traffic remains `none`.
C19Z36 adds compatibility proof for that scaffold, verifying the disabled state,
status contract, env names, process model, and guardrails remain aligned in unit
and live workload status coverage.
C19Z37 adds disabled real-adapter config projection. Node-agent parses the
future `RAP_REMOTE_WORKSPACE_REAL_ADAPTER_*` env values and reports only
sanitized status metadata under
`real_adapter_supervision.config_projection`: whether enable was requested,
whether command/args/workdir are present, args JSON shape, and that raw values
are redacted. This does not activate the real adapter; `enabled=false`,
`activation_allowed=false`, and `payload_traffic=none` remain required.
C19Z38 proves projection compatibility across default/empty and requested
config shapes. Unit and live smoke coverage verify absent env and requested
env both keep activation blocked, raw values redacted, and payload traffic
disabled.
C19Z39 adds an explicit disabled activation decision contract. The real adapter
status now reports `decision=blocked`,
`reason=real_runtime_stage_not_enabled`, `activation_allowed=false`, and the
missing gates before a future stage may start an external RDP worker process.
C19Z40 adds a compact handoff report proving that the supervision scaffold,
config projection, and blocked activation decision remain aligned for both
requested and default config shapes.
C19Z41 adds real-adapter supervision feature flags for config projection,
activation decision, missing gates, and raw-value redaction so UI and
automation clients can gate rendering explicitly.
C19Z42 folds those feature flags into the compact handoff report, proving
scaffold/projection/decision/features alignment for requested and default node
config in one admin/runtime artifact.
C19Z43 proves contract-probe precedence when desired workload config includes
both `adapter_contract_probe` and `real_adapter_supervision`; the runtime stays
running in probe mode and real-adapter activation remains blocked.
C19Z44 proves the real-adapter-only desired workload path remains degraded and
blocked, with the same disabled activation contract and no payload traffic.
C19Z45 adds a compact desired-workload mode matrix for probe-only,
real-adapter-only, and combined requested modes, confirming all paths retain
disabled real-adapter activation and no payload traffic.
C19Z46 adds compatibility proof for that mode matrix row contract, including
explicit feature-flag and missing-gate visibility markers.
C19Z47 adds a disabled process-supervisor preconditions contract for the future
external RDP worker process while keeping `process_start_allowed=false` and all
payload traffic disabled.
C19Z48 proves that process-supervisor preconditions contract across requested
and default config shapes, including required/missing checks and disabled start.
C19Z49 folds process-supervisor preconditions into the compact handoff report,
proving alignment with projection, activation decision, and feature flags.
C19Z50 folds those preconditions into the desired-workload mode matrix, proving
process start remains disabled across probe-only, real-adapter-only, and
combined requested modes.
C19Z51 adds compatibility proof for that mode matrix v2 row contract.
C19Z52 adds a disabled process-health-probe contract for the future external
RDP worker process while keeping health probes disabled and payload traffic at
`none`.
C19Z53 proves that process-health-probe contract across requested/default
status forms.
C19Z54 folds process-health-probe visibility into the compact handoff report,
proving disabled health probes and payload-free alignment across all
real-adapter handoff contracts.
C19Z55 folds process-health-probe visibility into the desired-workload mode
matrix, proving disabled health probes and no payload traffic across probe-only,
real-adapter-only, and combined requested modes.
C19Z56 adds compatibility proof for that mode matrix v3 row contract.
C19Z57 ties handoff v4 and mode matrix v3 compatibility into a compact disabled
real-adapter readiness/handoff checklist.
C19Z58 adds compatibility proof for that readiness/handoff summary and
checklist contract.
C19Z59 derives a disabled real-adapter operator action map from that checklist
while keeping activation, process start, and payload forwarding blocked.
C19Z60 adds compatibility proof for that operator action map contract.
C19Z61 groups the disabled real-adapter readiness summary, checklist, and
action map into one compact admin handoff bundle.
C19Z62 adds compatibility proof for that admin handoff bundle contract.
C19Z63 derives compact admin handoff digest display rows from the bundle while
preserving disabled runtime guardrails.
C19Z64 adds compatibility proof for that admin handoff digest row contract.
C19Z65 adds a digest rollup with severity/state counts, primary action, and
guardrail summary.
C19Z66 adds compatibility proof for that digest rollup contract.
C19Z67 summarizes the proven disabled real-adapter admin handoff chain from
handoff v4 through digest rollup compatibility.
C19Z68 adds compatibility proof for that full-chain summary contract.
C19Z69 marks the disabled real-adapter admin handoff package as
contract-only-ready while keeping the real runtime stage blocked.
C19Z70 proves the release marker contract remains compatible while keeping the
real runtime stage blocked.
C19Z71 adds a final contract-only package index for the disabled real-adapter
admin handoff chain.
C19Z72 proves the final package index contract for the disabled real-adapter
admin handoff chain.
C19Z73 adds a contract-only runtime gate phase boundary for the next disabled
real-adapter preflight phase.
C19Z74 proves the runtime gate phase boundary contract.
C19Z75 adds a disabled real-adapter runtime gate preflight checklist with all
items still blocking runtime.
C19Z76 proves the disabled real-adapter runtime gate preflight checklist
contract.
C19Z77 adds a disabled real-adapter runtime gate preflight status summary.
C19Z78 proves the disabled real-adapter runtime gate preflight status summary
contract.
C19Z79 adds disabled real-adapter runtime gate preflight action hints.
C19Z80 proves the disabled real-adapter runtime gate preflight action hints
contract.
C19Z81 adds a disabled real-adapter runtime gate preflight operator handoff
bundle.
C19Z82 proves the disabled real-adapter runtime gate preflight operator handoff
bundle contract.
C19Z83 adds a disabled real-adapter runtime gate preflight release marker.
C19Z84 proves the disabled real-adapter runtime gate preflight release marker
contract.
C19Z85 adds a disabled real-adapter runtime gate preflight package index.
C19Z86 proves the disabled real-adapter runtime gate preflight package index
contract.
C19Z87 adds a disabled real-adapter runtime gate preflight closeout summary.
C19Z88 proves the disabled real-adapter runtime gate preflight closeout summary
contract.
C19Z89 starts the explicit real-adapter runtime gate enablement phase with a
contract-only request that remains blocked pending validation.
C19Z90 proves the explicit real-adapter runtime gate enablement request
contract.
C19Z91 adds contract-only operator confirmation validation while keeping the
runtime gate blocked pending remaining validations.
C19Z92 proves the operator confirmation validation contract.
C19Z93 adds contract-only binary validation while keeping the runtime gate
blocked pending remaining validations.
C19Z94 proves the binary validation contract.
C19Z95 adds contract-only permission validation while keeping the runtime gate
blocked pending remaining validations.
C19Z96 proves the permission validation contract.
C19Z97 adds contract-only supervisor validation while keeping the runtime gate
blocked pending remaining validations.
C19Z98 proves the supervisor validation contract.
C19Z99 adds contract-only health probe validation while keeping the runtime gate
blocked pending payload gate validation.
C19Z100 proves the health probe validation contract.
C19Z101 adds contract-only payload gate validation with no remaining required
validations while keeping runtime not enabled.
C19Z102 proves the payload gate validation contract.
C19Z103 adds the runtime gate validation closeout while keeping explicit
operator enablement required.
C19Z104 proves the runtime gate validation closeout contract.
C19Z105 adds an operator enablement readiness package while keeping runtime
disabled by default.
C19Z106 proves the operator enablement readiness package contract.
C19Z107 adds an operator enablement readiness release marker while keeping
runtime disabled by default.
C19Z108 proves the operator enablement readiness release marker contract.
C19Z109 adds an operator enablement readiness package index while keeping
runtime disabled by default.
C19Z110 proves the operator enablement readiness package index contract.
C19Z111 adds an operator readiness closeout summary while keeping runtime
disabled by default.
C19Z112 proves the operator readiness closeout summary contract.
C19Z113 adds an operator review decision request while keeping runtime disabled
by default.
C19Z114 proves the operator review decision request contract.
C19Z115 adds an operator decision status summary while keeping runtime disabled
by default.
C19Z116 proves the operator decision status summary contract.
C19Z117 adds an operator approval/rejection outcome contract with the outcome
not approved and runtime disabled by default.
C19Z118 proves the operator approval/rejection outcome contract.
C19Z119 adds an operator outcome closeout/reopen boundary while keeping runtime
disabled by default.
C19Z120 proves the operator outcome closeout/reopen boundary contract.
C19Z121 adds a not-approved outcome release marker while keeping runtime
disabled by default.
C19Z122 proves the not-approved outcome release marker contract.
C19Z123 adds a not-approved outcome package index while keeping runtime disabled
by default.
C19Z124 proves the not-approved outcome package index contract.
C19Z125 adds a not-approved outcome closeout summary while keeping runtime
disabled by default.
C19Z126 proves the not-approved outcome closeout summary contract.
C19Z127 adds a final not-approved outcome release marker while keeping runtime
disabled by default.
C19Z128 proves the final not-approved outcome release marker contract.
C19Z129 adds a final not-approved outcome package index/archive marker while
keeping runtime disabled by default.
C19Z130 proves the final not-approved outcome package index/archive marker
contract.
C19Z131 adds a not-approved outcome archive closeout manifest while keeping
runtime disabled by default.
C19Z132 proves the not-approved outcome archive closeout manifest contract.
C19Z133 adds a stopped-branch sentinel for the not-approved outcome while
keeping runtime disabled by default.
C19Z134 proves the not-approved outcome stopped-branch sentinel contract.
C19Z135 adds a no-continuation guard for the stopped not-approved outcome while
keeping runtime disabled by default.
C19Z136 proves the not-approved outcome no-continuation guard contract.
C19Z137 adds continuation block enforcement for the stopped not-approved
outcome while keeping runtime disabled by default.
C19Z138 proves the not-approved outcome continuation block enforcement
contract.
C19Z139 adds a continuation block audit record for the stopped not-approved
outcome while keeping runtime disabled by default.
C19Z140 proves the not-approved outcome continuation block audit record
contract.
C19Z141 adds a continuation block audit rollup for the stopped not-approved
outcome while keeping runtime disabled by default.
C19Z142 proves the not-approved outcome continuation block audit rollup
contract.
C19Z143 adds an operator stop summary for the stopped not-approved outcome
while keeping runtime disabled by default.
C19Z144 proves the not-approved outcome operator stop summary contract.
C19Z145 adds an operator stop handoff for the stopped not-approved outcome
while keeping runtime disabled by default.
C19Z146 proves the not-approved outcome operator stop handoff contract.
C19Z147 adds an operator stop handoff digest for the stopped not-approved
outcome while keeping runtime disabled by default.
C19Z148 proves the not-approved outcome operator stop handoff digest contract.
C19Z149 adds an operator stop status snapshot for the stopped not-approved
outcome while keeping runtime disabled by default.
C19Z150 proves the not-approved outcome operator stop status snapshot contract.
C19Z151 adds an operator stop status snapshot index for the stopped
not-approved outcome while keeping runtime disabled by default.
C19Z152 proves the not-approved outcome operator stop status snapshot index
contract.
C19Z153 adds an operator stop status catalog for the stopped not-approved
outcome while keeping runtime disabled by default.
C19Z154 proves the not-approved outcome operator stop status catalog contract.
C19Z155 adds an operator stop status catalog release marker for the stopped
not-approved outcome while keeping runtime disabled by default.
C19Z156 proves the not-approved outcome operator stop status catalog release
marker contract.
C19Z157 adds an operator stop status catalog package index for the stopped
not-approved outcome while keeping runtime disabled by default.
C19Z158 proves the not-approved outcome operator stop status catalog package
index contract.
C19Z159 adds an operator stop status catalog closeout summary for the stopped
not-approved outcome while keeping runtime disabled by default.
C19Z160 proves the not-approved outcome operator stop status catalog closeout
summary contract.
C19Z161 adds an operator stop status final archive marker for the stopped
not-approved outcome while keeping runtime disabled by default.
C19Z162 proves the not-approved outcome operator stop status final archive
marker contract.
C19Z163 adds an operator stop status final archive manifest for the stopped
not-approved outcome while keeping runtime disabled by default.
C19Z164 proves the not-approved outcome operator stop status final archive
manifest contract.
C19Z165 adds a terminal-complete marker for the stopped not-approved outcome
factory while keeping runtime disabled by default.
C19Z166 proves the not-approved outcome factory terminal-complete contract.
C20Z1 opens a new explicit real-adapter enablement request while keeping
runtime disabled by default.
C20Z2 proves the new explicit real-adapter enablement request contract.
C20Z3 adds the operator validation intake for the new explicit request while
keeping runtime disabled by default.
C20Z4 completes the operator validation checklist contract while keeping
runtime disabled by default.
C20Z5 closes the operator validation chain contract while keeping runtime
disabled by default.
C20Z6 proves the C20 stage terminal-complete contract.
5. Move VPN packet flow to the service channel and keep backend relay only as
   explicit degraded fallback.
6. Run load tests against the fabric channel: many streams, route failure,
   exit failure, NAT/outbound-only nodes, queue pressure, DNS/LAN/Internet
   egress.
7. Build Remote Server/Desktop Access on top of this channel, not beside it.

## Non-Negotiable Guardrails

- Do not solve new service performance problems inside a protocol-specific
  client before checking the common fabric channel.
- Do not add a production service that depends on backend packet/frame relay as
  the steady-state path.
- Do not expose internal mesh topology to organization users.
- Do not merge VPN and Remote Server/Desktop Access into one product.
- Do not let bulk traffic starve interactive traffic.
- Do not hide degraded fallback; report it visibly in diagnostics/admin UI.