Files
rdp-proxy/docs/architecture/FABRIC_SERVICE_CHANNEL_RUNTIME.md
m 20d361a886
build / backend (push) Has been cancelled
build / node-agent (push) Has been cancelled
build / worker (push) Has been cancelled
рабочий вариант, но скороть 10 МБит
2026-05-22 21:46:49 +03:00

1736 lines
109 KiB
Markdown

# Fabric Service Channel Runtime
Status: accepted product direction and implementation guardrail.
This document defines the common runtime layer that service products must use
for live traffic. VPN, Remote Server/Desktop Access, video meetings, file
transfer, SSH/VNC/RDP adapters, and future services must not each invent their
own route, relay, retry, and failover mechanics.
## Problem
The platform goal is a distributed high-speed access fabric:
```text
client or service ingress
-> authorized entry node / entry pool
-> fastest healthy fabric route
-> authorized exit node / exit pool
-> target network, adapter, or service runtime
```
Recent VPN work exposed an architectural risk: debugging transport behavior
inside the Android VPN client or temporary backend packet relay can hide the
real missing layer. If the common fabric channel is incomplete, every later
service will repeat the same work and the Remote Server/Desktop Access client
will get stuck on transport issues that should already be solved below it.
The backend/control API remains the control plane. It must not become the
production realtime relay for high-rate service traffic.
## Product Rule
All live service traffic goes through the Fabric Service Channel runtime.
Control-plane and engineering traffic:
- login
- profile refresh
- policy lookup
- session creation
- route authorization
- diagnostics
- update metadata
may use Control API and admin ingress.
Working data traffic:
- VPN IP packets
- remote desktop display/input/control channels
- SSH/VNC streams
- file chunks
- video/audio
- future realtime service payloads
must use Fabric Service Channel unless an explicit compatibility fallback is
selected and reported as degraded.
## Service Request Contract
A service requests a channel by logical intent, not by hard-coding a node path.
Target shape:
```json
{
"service_class": "vpn_packets | remote_workspace | file_transfer | video",
"organization_id": "...",
"user_id": "...",
"resource_id": "...",
"entry_pool": ["node-a", "node-b"],
"exit_pool": ["node-x", "node-y"],
"required_roles": ["entry-node", "vpn-exit"],
"allowed_channels": ["control", "reliable", "bulk", "droppable"],
"qos": {
"interactive": true,
"bulk_limit_mbps": 0,
"priority": "interactive | normal | bulk"
},
"failover": {
"route_rebuild": "automatic",
"exit_failover": "automatic",
"sticky_session": true
}
}
```
The control plane returns a short-lived, signed service-channel lease:
- channel/session id
- selected entry
- selected exit
- alternate entries/exits
- primary route path
- alternate route paths
- allowed channel classes
- route generation/fencing epoch
- token expiry and refresh policy
- fallback policy
The service sees a channel endpoint and channel capabilities. It does not see
the full mesh topology unless it is a platform-owner diagnostic view.
## Runtime Responsibilities
### Control Plane
- authorizes the service request
- resolves organization/resource policy
- selects candidate entry and exit pools
- issues signed channel leases
- records audit
- publishes route generation and allowed service class
- receives telemetry and route health feedback
- triggers route/exit replacement when needed
### Fabric Routing Engine
- chooses shortest/fastest healthy route
- scores latency, loss, queue depth, bandwidth, node health, NAT mode,
region/locality, role eligibility, and route generation freshness
- maintains alternate routes
- avoids full-mesh requirements
- rebuilds routes when links/nodes degrade
### Entry Node
- accepts client-facing live connections
- validates service-channel token
- multiplexes logical streams/channels
- applies backpressure and per-channel scheduling
- forwards payloads to the selected route
- switches to alternate route/exit when instructed or when local health proves
the path bad
### Intermediate Relay Nodes
- forward authorized envelopes only
- enforce route id, channel class, TTL, generation, and next-hop rules
- report link health and queue pressure
- do not own durable session state
### Exit Node
- terminates the fabric route for the selected service
- connects to LAN/internet/adapter/runtime target
- enforces service policy locally
- reports egress health, DNS policy, and throughput
- can be replaced by another exit from the pool when policy allows
## Channel Model
The common fabric layer is channel-oriented.
| Channel class | Reliability | Typical services | Scheduling |
| --- | --- | --- | --- |
| `control` | reliable | attach/detach, route refresh, service state | highest |
| `interactive` | reliable/low-latency | RDP input, SSH input, cursor/control | highest data |
| `reliable` | ordered bounded | clipboard, small files, terminal output | medium |
| `bulk` | reliable bounded | VPN packets, downloads, large file chunks | lower than interactive |
| `droppable` | latest-wins | video frames, remote display regions, telemetry | drop stale |
VPN packets are protocol-neutral IP packets. They must not be special-cased as
HTTP, RDP, DNS, Telegram, or browser traffic. Optimization must improve the
shared packet path.
Remote Server/Desktop Access uses the same channel runtime, but its adapter
uses service-specific channel classes such as input, display, cursor,
clipboard, file transfer, audio, and telemetry.
## Failover Rules
The fabric must support:
- entry pool selection
- exit pool selection
- alternate route set
- quick route rebuild on node/link failure
- sticky route while healthy to avoid needless TCP disruption
- graceful drain when possible
- hard failover when route is stale or fenced
- explicit degraded fallback when the backend relay is used
VPN failover may still break existing TCP sessions in the initial mode. The
fabric must minimize disruption, but lossless TCP migration is a future mode and
must not be assumed.
## Current Gap
The project already has important pieces:
- signed node identity and scoped mesh config
- production fabric-control forwarding
- production `vpn_packet` envelope tests
- route intents and route health feedback
- entry-node VPN packet ingress prototype
- backend relay fallback for lab compatibility
The missing production layer is the service-channel runtime:
- stable client-to-entry live transport
- multiplexed logical streams/channels
- route manager with primary and alternate paths
- service-neutral QoS/backpressure
- channel-level telemetry
- automatic route and exit replacement contract
- explicit degraded fallback reporting
Until this layer is complete, VPN should be treated as a proving service for
the fabric channel, not as a one-off Android transport project.
## Implemented Foundation
The first backend contract slice is implemented:
- `POST /api/v1/clusters/{cluster_id}/fabric/service-channels/leases`
issues a `rap.fabric_service_channel_lease.v1` contract.
- The lease contains selected entry/exit nodes, entry/exit pools, service
class, required roles, allowed channel classes, route generation, fencing
epoch, primary route, alternate routes, token metadata, entry HTTP/WebSocket
endpoint templates, QoS, failover policy, and explicit fallback state.
- Each lease includes a cluster-authority-signed
`rap.fabric_service_channel_lease_authority.v1` payload that binds the
channel id, service class, selected entry/exit, primary route, generation,
fencing epoch, expiry, and token hash.
- When an authorized fabric route exists, fallback is only available and not
active.
- When no authorized fabric route exists, the lease is marked
`degraded_fallback`; backend relay is explicit compatibility fallback rather
than hidden steady state.
- VPN client profiles now embed `fabric_service_channel_lease` for each planned
VPN route, making VPN the first consumer of the common channel contract.
- `rap-node-agent` now exposes the first entry runtime endpoint for the VPN
proving service:
`/api/v1/clusters/{cluster_id}/fabric/service-channels/{channel_id}/vpn-connections/{resource_id}/packets`
and the `/packets/ws` WebSocket variant.
- The entry endpoint requires a `rap_fsc_*` service-channel token, accepts
packet batches in `application/vnd.rap.vpn-packet-batch.v1`, forwards through
the existing production `vpn_packet` fabric route, and maps route failures to
the explicit backend relay compatibility path.
- Service-channel leases now carry a signed `data_plane` contract declaring
control-plane API use, working-data transport through Fabric Service Channel,
steady-state fabric routes, backend relay fallback policy, and
service-neutral multi-flow isolation.
- Node-agent validates the signed or introspected data-plane contract, applies
the preferred fabric route from the contract, reports contract adoption in
heartbeat access telemetry, and refuses backend relay when the contract says
`backend_relay_policy=disabled`.
- Backend access telemetry and web-admin active-channel diagnostics now project
the data-plane adoption count plus last data-plane mode, working transport,
steady-state transport, backend relay policy, and logical flow mode at
cluster, node, and active-channel levels.
- Rebuild/access incident diagnostics now include `data_plane_contract`
incidents for accepted service-channel traffic without a reported
data-plane contract, transport/policy mismatches, disabled backend relay
observations, and degraded backend relay usage. These incidents keep backend
relay visible as degraded compatibility behavior rather than hidden steady
state.
- Node-agent access telemetry distinguishes degraded compatibility requested
from degraded compatibility blocked by signed data-plane policy. Blocked
compatibility reports include `degraded_compatibility_blocked` and the last
violation status/reason, while preserving the original raw violation code in
a separate field for historical correlation, and
backend projects them to access telemetry plus `data_plane_contract`
incidents.
- Backend correlates access-report send failures with active service-channel
leases. A normal primary route that fails while backend relay is disabled is
persisted as fenced route feedback, allowing the existing rebuild planner to
select an authorized alternate instead of leaving the channel stuck at a
policy-blocked fallback.
- Access-report-derived route feedback is deduplicated while an active fenced
or degraded observation from `fabric_service_channel_access_report` already
exists for the same cluster, reporter node, route, and service class. This
prevents repeated blocked-fallback send-failure heartbeats from continuously
refreshing the same feedback and churning rebuild attempts.
- Replacement decisions and rebuild-attempt ledger rows carry the originating
access-report feedback identity: observation id, source, observed/expiry
timestamps, channel/resource ids, and data-plane violation status/reason.
This makes the chain `access report -> route feedback -> planner decision ->
rebuild attempt` visible without opening raw JSON payloads.
- Rebuild-attempt ledger queries can filter by `feedback_source`,
`feedback_channel_id`, and `feedback_violation_status`. The admin panel
exposes the same fields so incident drilldown can jump directly to the
correlated attempts behind an access-report-derived failure.
- Entry token validation now supports cluster-authority signed lease
enforcement. When the client sends
`X-RAP-Service-Channel-Authority-Payload` and
`X-RAP-Service-Channel-Authority-Signature`, the entry node verifies the
signature, expiry, selected entry node, service class, channel/resource ids,
allowed `vpn_packet` channel, and token hash before accepting traffic.
- Android VPN release `0.2.159` consumes the profile
`fabric_service_channel_lease`, builds the entry HTTP/WebSocket URLs from
the lease templates, and sends the service-channel token and signed authority
headers. A live smoke against `usa-los-1` accepted a valid signed lease and
rejected a bad token with `403`.
- Node-agent release `0.2.162` adds the first route-manager behavior inside
the entry runtime. The VPN packet ingress keeps the same runtime object when
synthetic mesh config refreshes, records live send/receive counters, selected
route/next hop, route attempts/failures, local-gateway fallback, and inbox
queue depths.
- Client packet sends now try all valid `vpn_packet` route candidates, with a
sticky preference for the last successful route. Backend relay fallback is
reached only after all fabric candidates fail, and telemetry marks that as
degraded compatibility behavior rather than normal steady-state transport.
- A live smoke on 2026-05-07 against the `usa-los-1` service-channel endpoint
returned `202 Accepted` and heartbeat telemetry reported route attempts,
route failure, and selected next hop `home-1`, proving that the report comes
from the active ingress handler.
- Node-agent release `0.2.163` adds the first service-neutral flow scheduler.
The scheduler does not make HTTP/RDP/DNS/application decisions. It hashes
universal IP packets by 5-tuple, or opaque packet hash when no tuple can be
read, into logical `flow-*` channels. Each channel records queue depth,
enqueue/dequeue counts, drops, high-watermark, and backpressure state.
- Client packet batches are now fanned out by logical channel before route
forwarding. This is the first step toward letting independent sessions share
one VPN/fabric connection without a stalled flow hiding the health and
pressure of other flows.
- A live smoke on 2026-05-07 sent two different packet flows through the signed
service-channel endpoint and telemetry reported two flow batches, two flow
channels, two enqueues/dequeues, and zero drops.
- Node-agent release `0.2.164` turns those logical channels into the first
active scheduling behavior. Each channel remembers its last successful route
and next hop, the last failed route, send duration, served count, stall count,
consecutive failures, and whether route rebuild or degraded fallback is
recommended.
- Scheduled batches are drained with a service-neutral fairness rule:
non-stalled channels first, then less-served channels, then the oldest served
channel. This still carries raw VPN/IP packets; it does not inspect HTTP,
RDP, DNS, Telegram, browser traffic, or any other application protocol.
- Route selection is now per-channel. A channel may prefer its last successful
route and defer its last failed route, so one bad route candidate does not
keep punishing the same flow on the next send.
- A live smoke on 2026-05-07 posted two flows through `usa-los-1` and reported
schema `c18l.fabric_service_channel_runtime_report.v1`,
`send_packets=2`, `send_flow_batches=2`, `flow_scheduler.channel_count=2`,
`dropped=0`, and per-flow `last_route_id`, `last_next_hop`, `served`,
`stall_count`, and fallback recommendation fields.
- Backend release `rap-backend:fabric-service-channel-0.2.165` consumes fresh
entry-node service-channel heartbeat feedback when issuing a new lease. It
reads `fabric_service_channel_runtime_report.ingress.flow_scheduler`
`channel_stats`, boosts routes with recent successful flow sends, penalizes
recent failed routes, and fences routes that explicitly recommend rebuild or
degraded fallback.
- Fenced routes are not returned as primary or alternate route candidates in a
service-channel lease. If every route for the selected entry/exit pair is
fenced by service-channel feedback, the lease enters explicit degraded
compat fallback with reason
`fabric_routes_fenced_by_service_channel_feedback`.
- A live smoke on 2026-05-07 created two short-lived `test-1 -> test-2`
`vpn_packets` route intents, injected fresh service-channel flow feedback
marking the higher-priority route as rebuild-required, and the next lease
selected the lower-priority healthy route with score reason
`service_channel_recent_success`.
- Backend release `rap-backend:fabric-service-channel-0.2.166` makes that
route feedback durable. Heartbeat telemetry records service-neutral route
observations in `fabric_service_channel_route_feedback_observations` and
updates `fabric_service_channel_route_feedback_latest` with expiring latest
state per reporter node, service class, and route.
- Lease generation now reads durable latest feedback before falling back to
fresh heartbeat metadata. This keeps route fencing/boosting available across
backend restarts and prevents a single heartbeat replacement from erasing
recent route-health evidence.
- A live smoke on 2026-05-07 persisted a fenced observation for a forced-bad
higher-priority `test-1 -> test-2` route and a healthy observation for the
lower-priority route. After backend restart, the next service-channel lease
selected the healthy route with `service_channel_recent_success`; the durable
latest table showed the bad route as `fenced` and active.
- Backend release `rap-backend:fabric-service-channel-0.2.167` exposes durable
feedback for diagnostics and starts feeding it back into route-generation.
Operators can list fresh observations through
`/clusters/{clusterID}/fabric/service-channels/route-feedback`, and scoped
node synthetic configs now include a `service_channel_route_feedback` report.
- Synthetic config generation skips routes fenced by the local node's durable
service-channel feedback while that observation remains active. This is the
first closed loop from entry-runtime traffic health to the next route config:
a known-bad route is withheld from that node instead of being re-issued until
the feedback expires or a new healthy observation replaces it.
- Backend release `rap-backend:fabric-service-channel-0.2.168` adds proactive
replacement decisions for fenced service-channel routes. When a fenced route
is withheld, route path decisions now record either
`service_channel_feedback_replacement` with `replacement_route_id` and
effective replacement hops, or `service_channel_feedback_no_alternate` when no
unfenced alternate route exists.
- A live smoke on 2026-05-07 fenced a higher-priority `test-1 -> test-2` route
and kept a lower-priority healthy route. The scoped `test-1` synthetic config
excluded the bad route, kept the healthy route, and reported a replacement
decision from the bad route to the healthy route with score reason
`selected_unfenced_alternate_route`.
- Backend/node-agent release `0.2.169` adds the first replacement dampening
behavior. When choosing an alternate for a fenced service-channel route, the
control plane gives active healthy durable feedback a large stable preference
and records `active_healthy_feedback_dampening_window` in score reasons. This
keeps a recently successful replacement selected over a higher-priority but
unproven route until the feedback expires or a newer observation changes the
state.
- Route path decision reports now include `degraded_decision_count` for
`service_channel_feedback_no_alternate`; upgraded node-agents echo
`replacement_route_id` and degraded counts in heartbeat diagnostics. A live
smoke on 2026-05-07 confirmed a low-priority healthy replacement beat a
higher-priority unproven alternate while the healthy feedback was active.
- Node-agent/host-agent hotfix `0.2.171` keeps the signed synthetic config
contract in sync with the backend feedback report. Agents now preserve
`service_channel_route_feedback` while recalculating the authority payload
hash, preventing `0.2.169`-style hash mismatches after C18O/C18Q feedback
fields are present in control-plane configs. The release is published with
Docker, Linux service, Windows service, and binary artifacts.
- Backend/web-admin release `0.2.172` adds cluster-level route feedback
operations: operators can filter current feedback by reporter, route, service
class, status, or include expired observations, and can expire stale route
feedback after verification. Expiring feedback removes it from active route
selection by moving `expires_at` to now while retaining history for audit and
diagnostics.
- C18S adds operator-expire churn guardrails. A manual expire now creates an
audit event, sets `operator_retry_cooldown_until`, and lets the route retry
with explicit decision reason
`service_channel_route_retry_after_operator_expire`. If the same reporter
immediately sends another non-healthy observation for the same route/service
inside the cooldown, Control Plane records it as
`operator_retry_cooldown` with zero score adjustment instead of immediately
re-fencing the route.
- C18T starts automatic service-neutral rebuild orchestration. Route path
decisions now include rebuild request metadata. Fenced runtime feedback that
keeps failing outside manual retry cooldown creates a bounded rebuild
request. If an unfenced alternate is available, Control Plane marks the
rebuild `applied` and selects that route generation; if no alternate exists,
it records `pending_degraded_route_state` and keeps the channel in explicit
degraded route state until a new route appears. The compatibility release
`0.2.175` keeps node/host-agent signed-config models aligned with these new
fields.
- C18U moves rebuild metadata into node-agent runtime behavior. Node-agent
`0.2.176` builds a local service-channel route-manager snapshot from
`route_path_decisions`, tracks rebuild request/apply/pending-degraded counts,
marks rebuilt-away routes as withdrawn, clears a withdrawn cached selected
route, and filters withdrawn routes from new service-channel candidates. This
keeps service traffic on the Control Plane replacement instead of repeatedly
choosing a route that was already fenced. Backend `0.2.176` also makes node
list version state prefer a node's actual reported target version over stale
failed update-status rows.
- C18V adds route-manager transition telemetry and churn coverage. Node-agent
`0.2.177` reports `route_manager_transition` alongside the current manager
snapshot, including previous/current generation, status, decision count,
withdrawn route count, restored route count, pending degraded route-state count,
rebuild applied count, and any cached selected route cleared because Control
Plane withdrew it. Coverage verifies three service-neutral lifecycle cases:
applied rebuild replacement, pending degraded route state when no alternate is
available, and rollback/restoration when a fresh config removes the rebuild
decision.
- C18W adds a live docker-test verification loop for that telemetry. The smoke
script `scripts/fabric/c18w-service-channel-route-manager-smoke.ps1` creates
short-lived service-channel route intents, injects durable fenced/healthy
feedback through the heartbeat contract, observes Control Plane
`rebuild_status=applied`, waits for node-agent `applied_rebuild`, expires the
feedback through the operator endpoint, verifies the config has no rebuild
decision, and waits for `restored_by_new_config`. The passing artifact is
`artifacts/c18w-service-channel-route-manager-smoke-result.json`. The live
run also hardened feedback expiration in backend `0.2.179` by avoiding pgx
mixed timestamp/text parameter inference and array-parameter fragility.
- C18X adds service-neutral logical-channel isolation coverage and fixes a
route-memory bug found by that coverage. Node-agent `0.2.180` keeps global
last-route stickiness only for channels with no local route state; if a
channel has a failed route to avoid, candidates are ordered without falling
back to the global last selected route. This prevents one failed flow from
poisoning unrelated flows that are still healthy on the primary route. The
same slice verifies bounded same-channel backpressure/drop telemetry and
preserves the existing packet-flow hashing split. The passing smoke artifact
is `artifacts/c18x-service-channel-logical-channel-smoke-result.json`.
- C18Y adds route-intent lifecycle cleanup for operator/test routes. Backend
`0.2.181` enriches route-intent list responses with lifecycle state, exposes
platform-admin `expire` and `disable` actions, and prevents expired route
policies from being emitted in node-scoped synthetic config. This keeps stale
smoke route intents visible for audit while stopping agents from probing them
as live routes. Web-admin Fabric Links now shows route-intent lifecycle
counts and actions. The passing smoke artifact is
`artifacts/c18y-route-intent-lifecycle-smoke-result.json`.
- C18Z adds bounded service-channel load coverage around the shared runtime.
Node-agent `0.2.181` verifies many independent logical packet channels can
rebuild away from a Control Plane-withdrawn primary route without retrying
the withdrawn candidate, while same-channel overload reports bounded drops
and high-water marks. `FabricFlowScheduler.Snapshot` now keeps
`backpressure_active=true` when bounded drops occurred even if the queue has
already drained. The docker-test smoke also creates temporary route intents,
verifies their routes are visible, then expires/disables them and proves they
disappear from scoped synthetic config. The passing smoke artifact is
`artifacts/c18z-service-channel-load-smoke-result.json`.
- C18Z1 proves the same runtime through the running node HTTP surface instead
of only in-process transport tests. Node-agent `0.2.182` adds a dynamic mesh
listener handler so synthetic-config refreshes swap the active
`/mesh/v1/forward` and service-channel ingress handler state without
restarting the listening port. This closes the stale-handler failure where
route-health probes had fresh routes but production forward still rejected
live packets with `mesh synthetic route not found`. Backend `0.2.182` keeps
active degraded/fenced route feedback from being immediately overwritten by a
newer healthy heartbeat until the feedback expires or is explicitly cleared.
The live smoke posts signed generic packet batches into `test-1`, verifies
delivery into the `test-2` fabric inbox, forces a route rebuild, waits for
node `applied_rebuild`, and verifies the second batch uses the replacement
route. The passing smoke artifact is
`artifacts/c18z1-live-service-channel-ingress-smoke-result.json`.
- C18Z2 adds a sustained live ingress and exit-restart smoke. The script
`scripts/fabric/c18z2-live-service-channel-soak-smoke.ps1` keeps the same
protocol-neutral service-channel shape, sends multiple signed packet batches
through `test-1`, restarts the `test-2` exit container, waits for the exit
runtime to reload Control Plane synthetic config, then proves recovery
batches are accepted and delivered to the exit inbox. The passing artifact is
`artifacts/c18z2-live-service-channel-soak-smoke-result.json`; run
`c18z2-20260507-205112` accepted warm/restart/recovery batches and grew the
post-restart exit inbox depth from `0` to `88` with zero inbox drops.
- C18Z3 adds the matching entry-side resilience and degraded-fallback contract.
Node-agent `0.2.183` validates the signed service-channel lease authority and
forces compat fallback when Control Plane has signed
`status=degraded_fallback` or `primary_route.status=missing_route_intent`.
This prevents a node from ignoring the lease decision and accidentally using
older generic route candidates for the same VPN resource. The rule applies to
both HTTP packet ingress and WebSocket packet ingress. The live smoke
`scripts/fabric/c18z3-live-service-channel-entry-ws-fallback-smoke.ps1`
proves HTTP warm delivery, WebSocket ingress parity, entry-node restart and
recovery while a lease exists, explicit compat fallback when no authorized
fabric route exists, and route-intent expiry. The passing artifact is
`artifacts/c18z3-live-service-channel-entry-ws-fallback-smoke-result.json`;
run `c18z3-20260507-211402` accepted warm `4/4`, WebSocket `8` packets,
recovery `4/4`, and moved the degraded compat fallback queue from `0` to
`8`.
- C18Z4 adds live long-session pressure coverage without another runtime
release. The script
`scripts/fabric/c18z4-live-service-channel-session-pressure-smoke.ps1` holds
one signed service-channel WebSocket open, sends 48 batches / 384 packets,
expires the primary route intent mid-session, waits for the dynamic
synthetic-config refresh, and verifies the post-switch traffic uses the
alternate route. The passing artifact is
`artifacts/c18z4-live-service-channel-session-pressure-smoke-result.json`;
run `c18z4-20260507-212748` grew the exit inbox from `0` to `384`, kept
route failure delta `0`, flow drop delta `0`, and compat fallback queue
`0 -> 0`. This proves route-policy churn can be absorbed by the shared
fabric runtime while a service WebSocket remains active.
- C18Z5 adds live exit-node failure coverage while the same kind of service
WebSocket remains active. The script
`scripts/fabric/c18z5-live-service-channel-exit-restart-smoke.ps1` sends
pre-outage traffic, stops the `test-2` exit container while traffic continues,
starts it again, waits runtime readiness, and then sends recovery traffic over
the same signed WebSocket. The passing artifact is
`artifacts/c18z5-live-service-channel-exit-restart-smoke-result.json`; run
`c18z5-20260507-213745` sent 480 packets total, observed route failure delta
`48`, compat fallback queue `0 -> 192`, flow drop delta `0`, and recovery
exit inbox `0 -> 192`. This proves exit failure is surfaced as explicit
degraded/fallback telemetry and fabric delivery resumes after runtime
recovery without requiring the service connection to be rebuilt.
- C18Z6 adds live Control Plane rebuild coverage while a service WebSocket is
active. The script
`scripts/fabric/c18z6-live-service-channel-active-rebuild-smoke.ps1` injects
route-health feedback for the primary route, observes Control Plane
`rebuild_status=applied` with the alternate route as replacement, waits for
node-agent `route_manager_transition.status=applied_rebuild`, and continues
traffic over the same signed WebSocket. The passing artifact is
`artifacts/c18z6-live-service-channel-active-rebuild-smoke-result.json`; run
`c18z6-20260507-214900` sent 384 packets, delivered all of them to the exit
inbox, selected the replacement route, kept route failure delta `0`, flow
drop delta `0`, and compat fallback queue `0 -> 0`. This proves route-manager
replacement can be applied under an active service session without requiring
the service connection to be recreated.
- C18Z7 adds concurrent service-session isolation coverage. The script
`scripts/fabric/c18z7-live-service-channel-concurrent-isolation-smoke.ps1`
opens three signed service-channel WebSockets over the same entry/exit pair,
interleaves packet batches across them, injects primary-route stale feedback,
waits for Control Plane `rebuild_status=applied` and node-agent
`applied_rebuild`, then continues all sessions. The passing artifact is
`artifacts/c18z7-live-service-channel-concurrent-isolation-smoke-result.json`;
run `c18z7-20260507-215727` delivered 864 packets total, 288 packets per
session, with total compat fallback delta `0`, route failure delta `0`, and
flow drop delta `0`. This proves concurrent service sessions keep separate
resource queues and are not starved or poisoned by a shared route-manager
rebuild.
- C18Z8 adds live backpressure/fairness isolation coverage. The script
`scripts/fabric/c18z8-live-service-channel-backpressure-isolation-smoke.ps1`
opens two interactive service-channel WebSockets and one abusive WebSocket on
the same entry/exit pair. The abusive session overloads a single stable
5-tuple with 1300 packets while the interactive sessions continue sending
small batches. The passing artifact is
`artifacts/c18z8-live-service-channel-backpressure-isolation-smoke-result.json`;
run `c18z8-20260507-221347` delivered 192 packets per interactive session,
hit flow scheduler high watermark `1024`, scheduled `1030` packets on the
hottest channel, dropped `282` packets on that overloaded channel, and kept
compat fallback delta `0` and route failure delta `0`. This proves bounded
queue pressure is service-neutral, observable, and isolated to the overloaded
logical flow without starving other active sessions.
- C18Z9 adds route-pool replacement preference coverage. Node-agent `0.2.184`
now honors Control Plane `replacement_route_id` as the preferred route when a
service-channel rebuild decision is applied, instead of only withdrawing the
stale route and then relying on synthetic-config ordering. The live smoke
`scripts/fabric/c18z9-live-service-channel-route-pool-smoke.ps1` creates a
slow relay primary route (`test-1 -> test-3 -> test-2`) and a fast direct
replacement (`test-1 -> test-2`), sends 54 batches / 432 packets over one
signed WebSocket, injects stale-route feedback, waits for Control Plane and
node-agent `applied_rebuild`, and verifies the same service session continues
over the fast route. The passing artifact is
`artifacts/c18z9-live-service-channel-route-pool-smoke-result.json`; run
`c18z9-20260507-224901` kept compat fallback delta `0`, route failure delta
`0`, and flow drop delta `0`.
- C18Z10 adds service-channel exit-pool failover coverage. Backend/node-agent
`0.2.185` binds signed entry/exit pools into the service-channel lease
authority, keeps selected exit aligned with the selected primary route, and
allows Control Plane replacement to move to another authorized exit when
route intents share the same exit-pool/resource metadata key. Node-agent also
seeds the entry runtime with the signed lease primary route so initial
traffic follows the lease before normal route-manager ordering. The live
smoke `scripts/fabric/c18z10-live-service-channel-exit-pool-smoke.ps1`
creates primary exit `test-1 -> test-2` and alternate exit
`test-1 -> test-3`, sends 54 batches / 432 packets over one signed
WebSocket, verifies 144 packets land on the primary exit before feedback,
injects stale-route feedback, waits for Control Plane and node-agent
`applied_rebuild`, and verifies 288 packets land on the alternate exit. The
passing artifact is
`artifacts/c18z10-live-service-channel-exit-pool-smoke-result.json`; run
`c18z10-20260507-232645` kept compat fallback `0`, route failure delta `0`,
and flow drop delta `0`.
- C18Z11 adds service-channel entry-pool failover contract coverage. Backend
`rap-backend:fabric-service-channel-0.2.186` keeps
`selected_entry_node_id` aligned with the selected primary route when the
healthy route starts at another authorized entry node, and route replacement
scope now understands entry-pool metadata keys. The live smoke
`scripts/fabric/c18z11-live-service-channel-entry-pool-smoke.ps1` creates
primary entry `test-1 -> test-2` and alternate entry `test-3 -> test-2`,
sends 144 packets through the initial `test-1` lease, injects feedback for
the primary entry route, refreshes the lease, verifies the new lease selects
`test-3`, and sends 288 more packets through the alternate entry to the same
exit. The passing artifact is
`artifacts/c18z11-live-service-channel-entry-pool-smoke-result.json`; run
`c18z11-20260507-235341` delivered 432 packets to the exit, kept backend
fallback `0`, route failure deltas `0/0`, and flow drop deltas `0/0`. This
proves the Control Plane lease/reconnect contract for entry replacement; it
does not claim that a broken client-to-entry socket survives entry-node loss.
- C18Z12 adds the first route quality scoring layer for lease selection.
Backend `rap-backend:fabric-service-channel-0.2.187` consumes
service-neutral runtime feedback from
`fabric_service_channel_runtime_report.ingress.flow_scheduler`: fast
`last_send_duration_ms` values boost a route, slow values penalize it, and
recent failures/stalls apply bounded penalties. This is explicitly
application-protocol neutral; it scores the shared fabric channel rather than
HTTP, RDP, DNS, or any other payload type. The smoke
`scripts/fabric/c18z12-service-channel-route-quality-smoke.ps1` creates a
higher-priority slow relay route and a lower-priority fast direct route. The
initial lease selects the slow route by policy priority; after runtime
telemetry reports fast route `8ms` and slow route `900ms`, the refreshed lease
selects the fast route with score reason
`service_channel_quality_latency_le_10ms`. The passing artifact is
`artifacts/c18z12-service-channel-route-quality-smoke-result.json`; run
`c18z12-20260508-000209` passed and expired its temporary route intents.
- C18Z13 closes the first live self-learning route-quality loop. Node-agent
`0.2.188` records any positive sub-millisecond service-channel send duration
as `1ms` instead of `0ms`, so very fast routes still produce actionable
quality telemetry. The live smoke
`scripts/fabric/c18z13-live-service-channel-route-quality-smoke.ps1` does
not inject a synthetic heartbeat. It first proves policy priority selects a
higher-priority relay route, expires that route, sends 24 real
service-channel batches / 192 packets through the fast direct route, waits
for the node-agent heartbeat to persist healthy route feedback in the
backend, then introduces a new higher-priority relay candidate. The refreshed
lease selects the already-learned fast route with score reasons
`service_channel_recent_success` and
`service_channel_quality_latency_le_10ms`. The passing artifact is
`artifacts/c18z13-live-service-channel-route-quality-smoke-result.json`; run
`c18z13-20260508-001610` delivered all 192 packets to the exit, kept backend
fallback `0`, flow drops `0`, and expired temporary route intents.
- C18Z14 makes the learned route-quality loop active-session aware. Backend
`rap-backend:fabric-service-channel-0.2.190` decays older healthy
service-channel feedback before route scoring, so stale success does not keep
full weight until expiry. Node-agent `0.2.189` consumes healthy
service-channel route-quality observations from the signed synthetic config
and can prefer a significantly better learned route over a sticky per-flow
route/config-order candidate. The smoke
`scripts/fabric/c18z14-live-service-channel-active-quality-shift-smoke.ps1`
keeps one signed WebSocket service-channel session open across route
generation changes: it starts on a higher-priority relay route, expires that
route, sends real traffic over the fast route to teach backend feedback, then
introduces a new higher-priority relay candidate. The same active WebSocket
continues on the learned fast route. The passing artifact is
`artifacts/c18z14-live-service-channel-active-quality-shift-smoke-result.json`;
run `c18z14-20260508-071644` sent 60 batches / 480 packets, delivered all
packets to the exit, kept compat fallback `0`, flow drops `0`, and expired
temporary route intents.
- C18Z15 exposes and hardens effective route-quality preference telemetry.
Backend `rap-backend:fabric-service-channel-0.2.191` reports both raw
`score_adjustment` and decayed `effective_score_adjustment` in
service-channel feedback observations. Node-agent `0.2.190` consumes the
effective score for active route preference decisions, keeps the raw score
for diagnostics, and exposes sorted `route_quality_preferences` in runtime
telemetry. The smoke
`scripts/fabric/c18z15-live-service-channel-effective-quality-smoke.ps1`
wraps the active-session quality-shift scenario and verifies that route
preferences, effective scores, and age-decayed scores are visible. The
passing artifact is
`artifacts/c18z15-live-service-channel-effective-quality-smoke-result.json`;
run `c18z14-20260508-073538` sent 60 batches / 480 packets, delivered all
packets to the exit, kept compat fallback `0`, flow drops `0`, and exposed
decayed effective scores in node telemetry.
- C18Z16 adds per-channel route-quality preference telemetry and fairness
guardrails. Node-agent `0.2.191` records the applied
`quality_preference_route_id`, effective/raw score, and reasons on each
flow-scheduler channel that uses a quality-preferred route. Unit coverage
proves a learned route-quality preference can move multiple logical channels
to the fast route without merging their queues or dropping packets. The smoke
`scripts/fabric/c18z16-live-service-channel-quality-fairness-smoke.ps1`
validates the live route-quality shift with per-channel diagnostics. The
passing artifact is
`artifacts/c18z16-live-service-channel-quality-fairness-smoke-result.json`;
run `c18z14-20260508-074943` sent 60 batches / 480 packets, served 32
logical channels, applied quality preference telemetry to all 32 served
channels, kept compat fallback `0`, and flow drops `0`.
- C18Z17 clears stale per-channel route-quality markers. Node-agent `0.2.192`
removes channel-level quality preference diagnostics when the preference is no
longer present in the current effective preference set or when the preferred
route is withdrawn by the route manager. The smoke
`scripts/fabric/c18z17-live-service-channel-quality-cleanup-smoke.ps1`
verifies that active channel markers reference visible preferences, stale
markers are absent, expired route intents are not active, and the session
completes without compat fallback. The passing artifact is
`artifacts/c18z17-live-service-channel-quality-cleanup-smoke-result.json`;
run `c18z14-20260508-075750` sent 60 batches / 480 packets, kept 32 active
quality markers, found `0` stale markers, kept compat fallback `0`, and
flow drops `0`.
- C18Z18 scopes flow-scheduler channel memory by service session. Node-agent
`0.2.193` now keys runtime-sent logical channels as
`vpn:{vpnConnectionID}:flow-NN`, while keeping the low-level scheduler API
compatible with unscoped unit tests. This prevents two simultaneous
VPN/service sessions that share the same entry/exit and same IP-flow shard
from sharing route-failure memory or diagnostic markers. Unit coverage proves
`vpn-a` can avoid a failed primary route while `vpn-b` keeps the healthy
primary route for the same packet flow. The smoke
`scripts/fabric/c18z18-service-channel-session-scoped-fairness-smoke.ps1`
wraps the live C18Z17 route-quality/fairness path, verifies served live
channel names are session-scoped and no unscoped served `flow-NN` channels
remain, and keeps compat fallback and flow drops at zero. The passing
artifact is
`artifacts/c18z18-service-channel-session-scoped-fairness-smoke-result.json`;
run `c18z14-20260508-082520` served 32 session-scoped channels, applied
quality markers to all 32, kept compat fallback `0`, and flow drops `0`.
- C18Z19 adds the first bounded parallel send window for independent
service-channel logical flows. Node-agent `0.2.194` can send scheduled
logical channels concurrently with `MaxParallelFlowSends=4` in the live
node-agent runtime, while older/default in-process behavior remains
sequential unless the window is explicitly set. This keeps the data path
protocol-neutral: it does not inspect HTTP, RDP, DNS, Telegram, or browser
traffic; it only prevents one slow logical flow/channel from blocking another
independent channel in the same shared fabric service path. Telemetry now
exposes `max_parallel_flow_sends` and `send_flow_parallel_batches`. Unit
coverage blocks one logical channel and proves another channel completes
before the slow channel is released. The smoke
`scripts/fabric/c18z19-service-channel-parallel-flow-window-smoke.ps1`
wraps the live C18Z18 path and verifies the parallel window is enabled and
observed in runtime telemetry. The passing artifact is
`artifacts/c18z19-service-channel-parallel-flow-window-smoke-result.json`;
run `c18z14-20260508-084133` delivered 480 packets, observed
`max_parallel_flow_sends=4`, `send_flow_parallel_batches=60`, backend
fallback `0`, and flow drops `0`.
- C18Z20 adds per-channel latency/retry/in-flight telemetry and the first
adaptive recommended parallel window. Node-agent `0.2.195` tracks scheduler
`in_flight`, `max_in_flight`, slow/failing channel counts, per-channel
`send_attempts`, `send_successes`, `send_failures`, `in_flight`,
`max_in_flight`, and latency buckets (`<=10ms`, `<=100ms`, `<=1000ms`,
`>1000ms`). The runtime reports `recommended_parallel_flow_sends`, currently
reducing the window under bounded drops, degraded fallback recommendations,
repeated failures, or slow/stalled channels. Unit coverage proves the
recommended window shrinks under queue/route pressure and that the parallel
window still lets an independent channel complete while another is blocked.
The smoke
`scripts/fabric/c18z20-service-channel-adaptive-window-telemetry-smoke.ps1`
wraps the live C18Z19 path and verifies the new telemetry is visible on real
docker-test nodes. The passing artifact is
`artifacts/c18z20-service-channel-adaptive-window-telemetry-smoke-result.json`;
run `c18z14-20260508-085635` delivered 480 packets, observed
`max_parallel_flow_sends=4`, `recommended_parallel_flow_sends=4`,
`scheduler_max_in_flight=4`, attempt/success/latency telemetry on all 32
served channels, compat fallback `0`, and flow drops `0`.
- C18Z21 adds rolling per-channel/session quality windows. Node-agent `0.2.196`
keeps the lifetime counters for audit visibility, but adaptive send-window
pressure now comes from the bounded recent quality window, so old drops and
old route failures roll out after successful fresh samples. The scheduler
exposes aggregate rolling-window sample/failure/slow/drop counters and each
channel exposes sample, success, failure, slow, drop, average-latency, and
last-updated telemetry. Unit coverage proves old pressure is forgotten by the
rolling window while lifetime counters remain visible. The smoke
`scripts/fabric/c18z21-service-channel-rolling-quality-window-smoke.ps1`
wraps the live C18Z20 path and verifies the new telemetry on real docker-test
nodes. The passing artifact is
`artifacts/c18z21-service-channel-rolling-quality-window-smoke-result.json`;
run `c18z14-20260508-091952` delivered 480 packets, observed
`scheduler_quality_window_sample_count=480`, rolling failures `0`, rolling
drops `0`, rolling samples/success/latency on all 32 served channels,
`recommended_parallel_flow_sends=4`, compat fallback `0`, and flow drops `0`.
- C18Z22 connects the rolling window to backend durable route feedback. Backend
`rap-backend:fabric-service-channel-0.2.197` reads `quality_window_*` fields
from node-agent heartbeat metadata and uses fresh rolling failure/drop/slow
counts plus rolling average latency when persisting
`fabric_service_channel` route feedback. Lifetime fields remain available as
fallback for older agents, but they no longer dominate scoring when a current
rolling window is present and clean. The smoke
`scripts/fabric/c18z22-service-channel-rolling-feedback-smoke.ps1` wraps the
live C18Z21 path and verifies persisted feedback includes
`service_channel_rolling_quality_window` and payload `quality_window_*`
fields. The passing artifact is
`artifacts/c18z22-service-channel-rolling-feedback-smoke-result.json`; run
`c18z14-20260508-093100` delivered 480 packets, observed one persisted
healthy rolling feedback item with rolling payload, compat fallback `0`, and
flow drops `0`.
- C18Z23 adds route recovery hysteresis. Backend
`rap-backend:fabric-service-channel-0.2.198` re-admits routes that have
healthy rolling-window feedback during an operator-expire/manual retry
cooldown, but applies a bounded score penalty (`150`) and the
`service_channel_recovery_hysteresis` reason. The recovered route remains
authorized and available as an alternate, while a steady healthy route can
remain primary until the recovery window proves stable enough. The smoke
`scripts/fabric/c18z23-service-channel-recovery-hysteresis-smoke.ps1` wraps
the live C18Z22 path and verifies backend `0.2.198`, rolling feedback, clean
forwarding, and the unit hysteresis contract. The passing artifact is
`artifacts/c18z23-service-channel-recovery-hysteresis-smoke-result.json`; run
`c18z14-20260508-094111` delivered 480 packets with compat fallback `0` and
flow drops `0`.
- C18Z24 exposes that recovery state to operators and API consumers. Backend
`rap-backend:fabric-service-channel-0.2.199` enriches route feedback API
responses and node-scoped service-channel feedback reports with
`recovery_state`, `recovery_hysteresis_active`, and
`recovery_hysteresis_penalty`; route path decision reports now include
`recovery_hysteresis_count`. Web-admin shows recovered/hysteresis chips and a
recovery column next to route feedback status, score, reasons, retry
cooldown, and expiry. The smoke
`scripts/fabric/c18z24-service-channel-recovery-visibility-smoke.ps1`
verifies backend `0.2.199`, unit recovery visibility, and live
route-feedback API recovery shape. The passing artifact is
`artifacts/c18z24-service-channel-recovery-visibility-smoke-result.json`;
live API returned 109 feedback observations with recovery state shape.
- C18Z25 adds a stability threshold before recovered routes can become steady
again. Backend `rap-backend:fabric-service-channel-0.2.200` keeps manual
retry recovered routes under hysteresis until they report at least 64 clean
rolling-window samples (`success >= 64`, failures/slow/drops `0`). Once the
threshold is met, the route is promoted back to `healthy`, gets
`recovery_promoted=true` and `service_channel_recovery_promoted`, and no
longer receives the hysteresis penalty. Admin/API expose promoted counts and
flags beside recovered/hysteresis state. The smoke
`scripts/fabric/c18z25-service-channel-recovery-promotion-smoke.ps1`
verifies backend `0.2.200`, the promotion unit contract, and live
route-feedback API recovery shape. The passing artifact is
`artifacts/c18z25-service-channel-recovery-promotion-smoke-result.json`.
- C18Z26 adds explicit demotion after recovery promotion. Backend
`rap-backend:fabric-service-channel-0.2.201` marks a recovered/promoted route
under retry cooldown as `recovery_demoted=true` when fresh rolling feedback
shows failures, drops, slow samples, degraded fallback, rebuild
recommendation, or fenced state. The demotion includes a concrete
`recovery_reason`, adds `service_channel_recovery_demoted` plus the specific
reason to route score reasons, and increments `recovery_demoted_count` in
route path decision reports. Web-admin shows demoted feedback/path chips and
reason text. The smoke
`scripts/fabric/c18z26-service-channel-recovery-demotion-smoke.ps1` verifies
backend `0.2.201`, demotion unit coverage, and live route-feedback API
recovery shape. The passing artifact is
`artifacts/c18z26-service-channel-recovery-demotion-smoke-result.json`.
- C18Z27 adds cluster-level recovery policy tuning. Backend
`rap-backend:fabric-service-channel-0.2.202` exposes
`GET/PUT /clusters/{clusterID}/fabric/service-channels/recovery-policy`,
backed by strict defaults plus optional cluster metadata override
`fabric_service_channel_recovery_policy`. The policy controls hysteresis
penalty, promotion minimum samples, demotion thresholds for failures, drops,
and slow samples, and rebuild/fenced demotion toggles. Lease route selection,
route feedback reports, and node-scoped synthetic config feedback consume the
effective policy. Web-admin shows and edits the policy in the
service-channel diagnostics card. The smoke
`scripts/fabric/c18z27-service-channel-recovery-policy-smoke.ps1` verifies
backend `0.2.202`, policy unit coverage, live GET/PUT policy API, and default
restoration. The passing artifact is
`artifacts/c18z27-service-channel-recovery-policy-smoke-result.json`.
- C18Z28 adds recovery policy provenance to service-channel diagnostics.
Backend `rap-backend:fabric-service-channel-0.2.203` includes the effective
recovery policy on `FabricServiceChannelRoute`,
`FabricServiceChannelLease`, signed lease authority payloads, route feedback
reports, and route path decision reports. This lets operators audit a
primary route, alternate route, degraded fallback, or path decision against
the exact policy source and thresholds that produced the score/recovery
state. Web-admin node diagnostics show the policy source and key thresholds
beside service-channel feedback and route decisions. The smoke
`scripts/fabric/c18z28-service-channel-recovery-policy-provenance-smoke.ps1`
verifies backend `0.2.203`, live synthetic config provenance, live lease
provenance, primary route provenance, and signed authority-payload
provenance. The passing artifact is
`artifacts/c18z28-service-channel-recovery-policy-provenance-smoke-result.json`.
- C18Z29 adds feedback provenance guardrails. Backend
`rap-backend:fabric-service-channel-0.2.204` computes a stable recovery
policy fingerprint and recognizes optional runtime feedback provenance:
`recovery_policy_fingerprint`, `route_generation`, `route_policy_version`,
and `policy_version`. Route feedback observations expose observed/effective
policy fingerprints and route generations, while reports expose missing and
stale counters. Feedback that explicitly came from an old policy or route
generation is still visible, but it is scored conservatively and cannot fence
or rebuild a current route. Missing provenance remains compatible for old
node-agents. The smoke
`scripts/fabric/c18z29-service-channel-feedback-provenance-guard-smoke.ps1`
verifies backend `0.2.204`, unit guardrails, live policy fingerprint, and
live feedback provenance counter shape. The passing artifact is
`artifacts/c18z29-service-channel-feedback-provenance-guard-smoke-result.json`.
## Implementation Order
1. Define and test the generic service-channel lease and route-generation
contract in the backend. Done for the first VPN packet consumer.
2. Add node-agent entry runtime that accepts a client/service live connection
and maps it to a fabric route. Done for the first VPN packet HTTP/WebSocket
ingress with signed lease verification.
3. Add node-agent route manager with primary/alternate route selection,
generation fencing, health feedback, and failover. First alternate-route
retry and live telemetry slice is done in `0.2.162`; generation fencing,
active health feedback, and route rebuild triggers remain.
4. Add service-neutral channel scheduling and bounded queues. Protocol-neutral
IP-flow hashing and queue/backpressure telemetry landed in `0.2.163`; the
first fair drain, route memory, failed-route avoidance, and rebuild/degraded
fallback signals landed in `0.2.164`. Async per-channel workers, load
shedding policy, and deeper route rebuild history remain. The first Control
Plane lease-time feedback consumer landed in backend `0.2.165`; durable
latest route feedback landed in backend `0.2.166`; admin diagnostics and
fenced-route avoidance in synthetic config landed in backend `0.2.167`;
proactive replacement decisions landed in backend `0.2.168`; dampened
healthy replacement preference and degraded/no-alternate counts landed in
`0.2.169`; operator-expire retry cooldown guardrails landed in C18S; bounded
rebuild request/decision metadata landed in C18T; node-agent runtime
withdrawal/replacement consumption landed in C18U; route-manager transition
telemetry and restore/pending fallback coverage landed in C18V; live
Control Plane/runtime route-manager verification landed in C18W; per-logical
channel failed-route isolation and bounded backpressure coverage landed in
C18X; route-intent lifecycle cleanup and synthetic-config expired-route
filtering landed in C18Y; bounded multi-channel load/rebuild/drop telemetry
coverage landed in C18Z; live signed service-channel ingress through the
running fabric listener landed in C18Z1; sustained live ingress with exit-node
restart/recovery coverage landed in C18Z2; signed degraded fallback
enforcement plus entry restart/WebSocket parity landed in C18Z3; long-lived
WebSocket pressure with mid-session route-policy churn landed in C18Z4; live
exit-node restart/fallback/recovery under an active WebSocket landed in
C18Z5; live Control Plane rebuild replacement under an active WebSocket
landed in C18Z6; concurrent active WebSocket/session isolation under rebuild
landed in C18Z7; active backpressure/fairness isolation for overloaded
logical flows landed in C18Z8; route-pool replacement preference landed in
C18Z9; exit-pool failover landed in C18Z10; entry-pool failover contract
landed in C18Z11; route quality scoring landed in C18Z12; live
self-learning route quality from real service-channel traffic landed in
C18Z13; active-session route-quality preference and backend feedback age
decay landed in C18Z14; effective route-quality score telemetry and
node-side effective score consumption landed in C18Z15; per-channel
route-quality preference telemetry and multi-channel fairness guardrails
landed in C18Z16; stale route-quality marker cleanup landed in C18Z17;
service-session-scoped flow scheduler memory landed in C18Z18; bounded
parallel logical-flow send windows landed in C18Z19; per-channel
latency/retry/in-flight telemetry plus adaptive recommended window landed in
C18Z20; rolling quality windows landed in C18Z21; backend rolling feedback
consumption landed in C18Z22; recovery hysteresis landed in C18Z23; recovery
state API/admin visibility landed in C18Z24; recovery promotion threshold
policy landed in C18Z25; recovery demotion telemetry/policy landed in
C18Z26; cluster-level recovery policy tuning landed in C18Z27; recovery
policy provenance landed in C18Z28; feedback provenance guardrails landed
in C18Z29; node-agent per-flow feedback provenance and backend heartbeat
preservation landed in C18Z30; durable backend route-rebuild attempt ledger,
API visibility, and admin diagnostics landed in C18Z31; generation-strict
rebuild timeline correlation with node-agent route-manager/route-generation
heartbeat telemetry and post-rebuild traffic counters landed in C18Z32;
computed rebuild guard status/severity/reason fields and admin guard chips
landed in C18Z33; cluster-level rebuild health summary endpoint/admin panel
with affected nodes/routes and recommended operator action landed in C18Z34;
generation-scoped operator silence for rebuild-health alerts landed in
C18Z35; resurfacing detection for new generations after an operator silence
landed in C18Z36; fast service-channel readiness gate landed in C18Z37;
default-fast rebuild ledger summary with explicit deep enrichment landed in
C18Z38; bounded deep-ledger drilldown by reporter/route/service/generation
with offset pagination landed in C18Z39; bounded rebuild incident grouping
with one-click deep investigation landed in C18Z40; audited incident
investigation and incident-level silence actions landed in C18Z41; durable
rebuild correlation/guard snapshots for fast warm readiness/health/incidents
landed in C18Z42; service-channel schema preflight for migration-safe manual
deploys landed in C18Z43; bounded rebuild snapshot warmup for missing
correlation snapshots plus stale-snapshot detection landed in C18Z44;
heartbeat-triggered auto-warmup for runtime-evidence rebuild snapshots landed
in C18Z45; rebuild snapshot maintenance health with overdue/runtime-evidence
visibility landed in C18Z46; node-agent signed service-channel lease
enforcement when cluster authority is pinned landed in C18Z47; backend
introspection fallback for token-authorized compatibility clients landed in C18Z48;
accepted-by telemetry for signed/introspection/token-authorized ingress landed in
C18Z49; durable lease introspection across backend restarts landed in C18Z50;
bounded durable lease cleanup and admin visibility landed in C18Z51; durable
accepted-by access telemetry aggregation with heartbeat fallback and admin
visibility landed in C18Z52; active lease/session correlation with
entry/exit, route status, fallback, and latest route-quality feedback
visibility landed in C18Z53; C18Z54 smoke proves the same diagnostics on a
normal non-fallback primary route with healthy rolling route-quality feedback;
C18Z55 smoke proves degraded/fenced normal-route feedback is shown separately
from explicit degraded compatibility requests; C18Z56 adds active-channel remediation
diagnostics (`none`, `rebuild_route`, `prefer_alternate_route`,
`hold_degraded_route_state`) to make the next runtime action explicit, and its
alternate-route branch is live-smoke-proven with compat fallback kept off.
C18Z57 adds the bounded machine-readable `remediation_command` contract to
active access telemetry rows so route-manager can consume a short-lived
`prefer_alternate_route` command with primary/replacement route ids and TTL.
C18Z58 projects those commands into node-scoped synthetic mesh config and the
node-agent route-manager consumes them as explicit applied replacement
decisions sourced from `service_channel_remediation_command`. C18Z59 proves
post-remediation service-channel traffic actually selects the replacement
route in runtime/flow telemetry without local/compat fallback. C18Z60 proves
the same remediation path for multiple independent VPN flow channels in one
packet batch, with replacement-route flow stats, no flow drops, no route
failures, and no degraded fallback. C18Z61 proves the remediation replacement
path under a larger 128-packet pressure batch with 32 replacement-route flow
stats, scheduler high-watermark 5, max-in-flight 4, no drops, no route
failures, and no degraded fallback. C18Z62 adds neutral service-channel
traffic-class QoS wiring: HTTP ingress accepts `X-RAP-Traffic-Class`, the
scheduler keeps distinct traffic-class channel ids/stats, unit tests prove
priority ordering, and live smoke proves bulk pressure plus interactive
traffic both use the replacement route without fallback, drops, or route
failures. C18Z63 proves concurrent QoS isolation in the runtime: an
interactive traffic-class packet completes while a bulk send is deliberately
held in-flight, with traffic-class stats, no drops, and no failures. C18Z64
adds compact `traffic_class_counts` telemetry to flow-scheduler snapshots so
diagnostics can see active flow-channel distribution by traffic class without
scanning every channel stat; it is live-proven on docker-test with bulk and
interactive counts visible in heartbeat metadata. C18Z65/C18Z66 project this
QoS/pressure telemetry into backend access telemetry and web-admin at cluster,
node, and active-channel levels. C18Z67 proves the live HTTP concurrent QoS
path under pressure: six parallel bulk service-channel requests and one
interactive request share the same entry path after remediation; the
interactive request completes in 132 ms, 3072 post-remediation packets move
over the replacement route, bulk/interactive replacement-route flow stats are
visible, and fallback, route failures, flow drops, and scheduler drops remain
0. C18Z68 turns this evidence into backend/admin flow-health diagnostics:
access telemetry now reports `flow_health_status` and `flow_health_reason` at
cluster, node, and active-channel levels using traffic-class pressure, queue
pressure, flow drops, compat fallback, route-quality failures/drops/slow
samples, and route send latency. C18Z69 adds node-side adaptive response:
runtime heartbeat flow-scheduler snapshots now include per-class
`recommended_parallel_windows` and adaptive backpressure reason, and the send
path applies the traffic-class-specific window so bulk/droppable are reduced
before interactive/control under pressure. C18Z70 projects those adaptive
runtime fields into backend access telemetry and web-admin at cluster, node,
and active-channel levels, with cluster windows aggregated by minimum non-zero
recommended window per class. C18Z71 adds an audited cluster adaptive-policy
contract for max window, queue/bulk thresholds, and per-class windows; the
effective policy fingerprint is signed into node synthetic config, reported
in runtime heartbeats, and consumed by node-agent scheduling so operators can
tune shared fabric backpressure without changing VPN/RDP-specific code.
C18Z72 adds an audited pool/failover policy contract for entry/exit pool
constraints, preferred entry/exit, selection strategy, failover modes,
compat fallback allowance, and sticky session mode. Lease issuance applies
that policy before route selection and signs the effective `pool_policy`
provenance into the service-channel lease authority payload. C18Z73 projects
that signed pool-policy fingerprint into active access telemetry and guards
remediation commands: backend rejects alternate routes outside the signed
entry/exit lease pools and emits `rebuild_route`, while node-agent
defensively ignores any guarded rejected `prefer_alternate_route` command
before route-manager application. Web-admin shows pool/remediation guard
status in access telemetry and node synthetic-config remediation rows. C18Z74
correlates active remediation commands with the entry node route-manager
heartbeat so access telemetry shows execution state:
`waiting_node_apply`, `applied`, `rejected_by_policy_guard`,
`pending_rebuild_request`, or `expired`, with reason/generation/observed-at.
C18Z75 records `rebuild_route` remediation as durable rebuild ledger intent
rows when node-scoped synthetic config is fetched: allowed commands become
`rebuild_status=requested` / `outcome=rebuild_requested`, while policy-guard
rejects become `rebuild_status=rejected` /
`outcome=policy_guard_rejected`. Access telemetry then reports
`rebuild_request_recorded` or `rebuild_request_rejected` for the active
channel. C18Z76 adds node-side acknowledgement for the allowed
`rebuild_route` branch: node-agent consumes the command as a route-manager
`pending_degraded_route_state` decision with source
`service_channel_remediation_command`, while guarded commands remain ignored.
Backend access telemetry correlates that heartbeat evidence with the durable
ledger and reports `rebuild_request_recorded_node_pending`. C18Z77 resolves
those durable remediation rebuild requests in the Control Plane planner:
valid alternates inside the active signed lease pools become `applied` /
`replacement_selected` route-manager decisions with the same command id,
missing safe alternates become `no_alternate`, policy/lease blocks become
`deferred_by_policy`, and stale commands become `expired`. Access telemetry
reports these as `rebuild_request_applied`,
`rebuild_request_no_alternate`, `rebuild_request_deferred_by_policy`, or
`rebuild_request_expired`. C18Z78 adds operator-facing visibility for those
planner outcomes in web-admin and live-proves the applied branch: when an
alternate route appears after lease issuance, the existing `rebuild_route`
command resolves to `applied` / `replacement_selected` and access telemetry
reports `rebuild_request_applied`.
C18Z79 closes that applied-branch proof loop: after the planner resolves the
existing rebuild command to a replacement route, the entry node reports a
route-manager decision for the same `rebuild_request_id`, the transition is
`applied_rebuild`, and live service-channel packet ingress selects the
replacement route with no local/compat fallback, route failures, or flow
drops. C18Z80 extends that into sustained post-rebuild pressure: five mixed
service-channel packet bursts remain on the replacement route, no stale
primary route is reselected, and fallback, route-failure, flow-drop, and
scheduler-drop deltas remain zero from the pre-pressure baseline. C18Z81
proves the negative recovery branch: when the already-applied replacement
route reports generation-valid fenced feedback, the Control Plane selects a
new safe recovery route and live traffic moves to that recovery route without
reselecting the degraded replacement or adding fallback/failure/drop deltas.
C18Z82 proves the no-safe-recovery branch: if that replacement is also fenced
and no safe recovery route exists, synthetic config reports
`service_channel_feedback_no_alternate` / `pending_degraded_route_state` with
`no_unfenced_alternate_route` instead of silently keeping a bad route.
C18Z83 projects that route-manager decision into active access telemetry and
web-admin active-channel diagnostics, including decision source, route id,
replacement route id, rebuild status/reason/generation, and score reasons.
C18Z84 aggregates those decisions at access-telemetry summary level so the
operator can see replacement, applied rebuild, recovery, and no-safe counts
without drilling into individual channel rows.
C18Z85 projects those access-decision aggregates into rebuild health and
incidents, adding `incident_source=access_decision` rows for active
no-safe/recovery/applied route-decision states. C18Z86 adds
channel-scoped silence/acknowledgement for those access-decision incidents:
the silence API accepts `incident_source` and `channel_id`, stores no-safe
access silences under a channel-scoped route key, and rebuild
health/incidents apply those silences so acknowledged current-generation
no-safe decisions are not counted as active bad incidents. Resurfacing on
generation change is covered in unit tests; live runtime smoke proves the
operator silence path. C18Z87 exposes active silences through the API and
web-admin, including access-decision source/channel/display route metadata,
and adds unsilence so an acknowledged access no-safe incident can be made
active again without waiting for TTL expiry. C18Z88 exposes access-decision
resurface details on incidents: the silence id, previous acknowledged
generation, and silence expiry are returned when the current active-channel
decision changes generation after acknowledgement. The live smoke proves the
incident resurfaces as active bad while preserving previous-generation
context for the operator. C18Z89 closes the generation-change operator action
loop for resurfaced access-decision incidents: incidents now include
`alert_resurfaced_cause`, previous route id, and previous channel id;
web-admin shows the cause; and the live smoke proves the operator can
re-acknowledge the resurfaced generation after validating that active-channel
decision route/generation context matches the incident. C18Z90 introduces an
explicit signed production data-plane contract on service-channel leases:
`data_plane` is present in the lease, authority payload, introspection
response, and lease-maintenance/admin list. It declares backend API as
control-plane transport, fabric service channel/fabric route as working
data/steady-state transport, degraded compatibility relay as an explicit
compatibility state only, and
service-neutral protocol-agnostic isolated logical flows as the runtime
contract for VPN, Remote Workspace, files, video, and future services. C18Z91
makes node-agent consume the signed/introspected data-plane contract, apply
the preferred fabric route, log data-plane mode/transports/fallback policy,
and report contract adoption in heartbeat access telemetry. C18Z92 enforces
the fallback boundary: when `backend_relay_policy=disabled`, route failure or
missing fabric route returns a visible service-channel error instead of
silently proxying working data through backend relay. C18Z93-C18Z95 project
that data-plane contract and blocked-fallback evidence into access telemetry,
incidents, and node-agent heartbeat reports. C18Z96-C18Z98 feed
access-report-derived blocked fallback send failures into durable route
feedback and rebuild ledger correlation, with bounded deduplication and
feedback identity carried into replacement decisions. C18Z99 adds rebuild
ledger filters for `feedback_source`, `feedback_channel_id`, and
`feedback_violation_status`. C18Z100 aggregates those same fields in
rebuild-health `feedback_breakdowns`, including active warn/bad, silenced,
latest observation, and affected reporter node/route counts, and web-admin
shows the breakdown in the Rebuild health panel. C18Z101 connects that
operator view to investigation: each breakdown row shows related incident
context by channel/reporter/route overlap and can open the deep rebuild
ledger with source/channel/violation filters prefilled. C18Z102 adds backend
audit breadcrumbs for that drilldown, recording
`fabric.service_channel_rebuild_feedback_breakdown.investigation_opened`
events with the feedback source/channel/violation filters before the panel
opens the filtered deep ledger. C18Z103 surfaces recent rebuild incident and
feedback-breakdown investigation audit breadcrumbs directly in the Fabric
diagnostics panel with time, source, feedback filters, target reporter/route,
actor, and reason. C18Z104 adds focused audit loading: the cluster audit API
accepts `event_type` and `target_type` filters, and the Fabric diagnostics
panel requests just the recent fabric investigation breadcrumbs instead of
relying on the generic latest cluster audit window. C18Z105 correlates those
breadcrumbs back to the currently visible rebuild-health feedback breakdowns
or rebuild incidents in web-admin, marking whether the diagnostic object is
still active/visible and giving the operator a direct `open` action. C18Z106
moves that correlation into the backend/API: focused audit reads with
`correlation=fabric_diagnostics` return `correlation_hints` containing the
current diagnostic status and matching breakdown/incident object when
present. The rebuild-health feedback breakdown window was also raised to 100
groups so fresh failure classes remain visible on noisy long-running test
clusters. C18Z107 adds compact `audit_summary` aggregates for focused Fabric
diagnostics audit reads, including counts by current diagnostic status,
feedback source, feedback violation status, correlated/not-visible totals,
and latest time, and web-admin shows those counts above the investigation
rows. C18Z108 splits the operator workflow read from generic cluster audit:
`GET /clusters/{clusterID}/fabric/service-channels/rebuild-investigations/breadcrumbs`
returns a dedicated `rebuild_investigation_breadcrumbs` contract with events
and summary, and web-admin consumes that endpoint for Recent investigations.
C18Z109 adds freshness windows to that contract: callers can pass
`current_window_seconds` and `history_window_seconds`, events are marked
`current`, `stale`, or `expired` in `correlation_hints.breadcrumb_status`,
and the summary includes counts by breadcrumb status for operator triage.
C19C adds the first non-VPN service-channel lease proof: Remote Workspace uses
the same signed data-plane contract, route intent model, introspection, and
maintenance visibility, but its entry descriptor is service-specific
(`remote-workspaces/.../streams`) and uses a remote-workspace frame batch media
type rather than VPN packet paths.
C19D proves the matching entry-node ingress boundary for Remote Workspace:
node-agent validates signed lease authority or introspection, service class,
channel class, selected entry node, allowed flow isolation, and data-plane
contract on `remote-workspaces/{resource_id}/streams/{channel_class}`. Empty
probe requests return `202` with a remote-workspace ingress probe contract and
access telemetry; real RDP frame forwarding remains deliberately
`validated_only` for empty probes until the service adapter work begins.
C19E adds a narrow frame-batch probe on that boundary. The adapter contract
advertises `rap.remote_workspace_frame_batch.v1`, and entry-node accepts
non-empty payloads only when they are JSON probe batches with `probe_only=true`,
valid remote-workspace logical channels, valid directions, and bounded payload
metadata. Accepted frame probes return `payload_flow=validated_probe_only`, while
empty/control probes return `payload_flow=validated_only`; production
frame forwarding is still not enabled.
C19F connects that validated probe to a node-agent local adapter sink. The
in-memory `node_agent_rdp_worker_contract_probe` sink accepts only validated
probe batches and returns `rap.remote_workspace_frame_batch_delivery.v1`
receipts. Entry responses now report `payload_flow=delivered_probe_only` when
the local sink accepts the batch; no RDP server traffic or desktop frame
forwarding is enabled by this stage.
C19G makes that sink delivery observable outside the direct ingress response:
node-agent reports `remote_workspace_adapter_sink` in `rdp-worker` workload
status and `remote_workspace_adapter_sink_report` in node telemetry, including
delivery count, latest sequence, frame count, channel class, adapter contract,
and explicit `payload_traffic=none` proof.
C19H adds negative guardrail proof for the same frame path: `probe_only=false`,
unknown logical channels, invalid channel direction, service/channel mismatch,
and unsupported payload encoding are rejected before adapter delivery. This
keeps the current Remote Workspace path as a contract probe only, not a hidden
RDP payload tunnel.
C19I adds bounded adapter handoff queue/ack semantics to that probe-only sink.
The sink reports queue capacity/depth and accepted, dropped, acked, backpressure,
and drop-policy fields in `rap.remote_workspace_frame_batch_delivery.v1`.
Current capacity is `8`: droppable display overflow is accepted with excess
frames dropped and accepted frames acked, while reliable input overflow returns
backpressure without `adapter_delivery`. The path remains
`payload_traffic=none`; real RDP frame forwarding is still deferred to the
service adapter runtime.
C19J promotes those queue/backpressure signals into the existing observability
surfaces. Workload status and node telemetry now expose queue capacity/depth,
cumulative accepted/dropped/acked frame counters, `backpressure_count`, and the
latest rejected batch metadata/reason, so adapter pressure can be diagnosed
without relying on the individual ingress response.
C19K binds that queue model to a probe-only adapter session identity. Entry-node
derives `adapter_session_id` from the selected service-channel context and the
adapter sink reports `adapter_runtime_id=node_agent_rdp_worker_contract_probe`
with `session_state=probe_bound` in delivery receipts, workload status, and
telemetry. Rejected reliable overflow batches keep the same session identity,
which gives the future real adapter runtime a stable lifecycle boundary while
payload forwarding remains disabled.
C19L adds lifecycle accounting for those probe-only adapter sessions. Node-agent
tracks active sessions, created/bound totals, last activity timestamps,
per-session delivery/backpressure/frame counters, idle expiry counters, and
`current_session_lifecycle_state`. Successful probe delivery binds the session;
reliable overflow records pressure on the same session instead of hiding it as a
standalone request failure.
C19M adds an explicit local control endpoint for that lifecycle:
`POST /mesh/v1/remote-workspace/adapter-sessions/{adapter_session_id}/control`
accepts `close`, `expire`, and `reset`. The control result and report counters
make deliberate session shutdown visible through workload status and telemetry,
which prepares the same lifecycle shape for a real adapter runtime.
C19N adds guardrails for that endpoint: unsupported actions, malformed payloads,
invalid session IDs, unknown sessions, and oversized reasons are rejected before
state mutation. Repeated `close` is idempotent for a terminal session, reporting
the prior terminal state without double-counting closed sessions.
C19O adds a direct snapshot endpoint for diagnostics:
`GET /mesh/v1/remote-workspace/adapter-sessions?include_terminal=true&limit=N`
returns active and optional terminal adapter sessions with lifecycle state,
activity/backpressure timestamps, counters, and runtime identity. This gives the
future real adapter runtime an operator-facing inspection surface before payload
forwarding is enabled.
C19P adds the runtime handoff mailbox for active adapter sessions. The mailbox is
bounded in memory and stores `frame_batch_probe_delivered` and `backpressure`
events with sequence numbers and service-channel context. A future `rdp-worker`
runtime can read or drain it via
`GET /mesh/v1/remote-workspace/adapter-sessions/{adapter_session_id}/mailbox`,
while snapshots and telemetry expose mailbox depth and enqueue/drain/drop
counters.
C19Q hardens that mailbox handoff surface. Invalid adapter session IDs, unknown
sessions, and invalid limits are rejected without mutating mailbox state, while
`drain=true&limit=N` can remove events in bounded chunks and leave the remaining
depth visible for the next adapter-runtime poll. The mailbox is verified under
pressure as drop-oldest bounded state, and a closed adapter session is no longer
readable as an active runtime mailbox. This preserves the probe-only boundary
and still does not enable RDP frame forwarding.
C19R adds bounded mailbox polling ergonomics for that future runtime consumer.
`GET /mesh/v1/remote-workspace/adapter-sessions/{adapter_session_id}/mailbox`
now accepts `wait_ms`, returns explicit `empty`, `waited`, `wait_timeout`, and
`wait_ms` fields, and wakes when a new mailbox event arrives before the timeout.
The wait remains node-local and probe-only; it does not enable desktop frame
transport, backend relay, or production RDP payload forwarding.
C19S promotes those mailbox consumer signals into node-agent diagnostics.
Workload status, heartbeat telemetry, and active session snapshots now expose
mailbox read, wait, timeout, and empty-read counters plus last mailbox read
metadata. This lets operators identify hot polling or idle adapter consumers
without opening a data-plane path or forwarding desktop frames.
C19T adds node-local mailbox consumer checkpoint/ack metadata for the future
adapter runtime handoff. The mailbox endpoint accepts `consumer_id` and
`ack_sequence`, validates both before reading state, and returns consumer read,
ack, checkpoint, ack sequence, and lag metadata. The probe sink keeps bounded
per-session consumer cursor state and exposes aggregate/current-session
consumer counters in workload status and heartbeat telemetry. This remains a
diagnostic handoff contract only: no RDP frames are forwarded, no backend relay
semantics are introduced, and the mailbox stays node-local.
C19U adds lifecycle guardrails for those node-local consumer cursors. A consumer
can request `reset_consumer=true` with a valid `consumer_id` to clear its cursor
before the current mailbox read is recorded, and mailbox responses now expose
consumer capacity/count plus created/reset/evicted lifecycle metadata. Workload
status and heartbeat telemetry also expose reset and eviction counters, keeping
cursor cleanup observable without changing mailbox delivery or enabling
payload forwarding.
C19V adds read-only cursor inspection for adapter-runtime handoff recovery.
`GET /mesh/v1/remote-workspace/adapter-sessions/{adapter_session_id}/mailbox/consumers`
returns the active session's bounded consumer cursor list with checkpoint, ack,
lag, read/ack totals, and timestamps. The endpoint supports a bounded `limit`
and does not read, drain, reset, or mutate mailbox state, so inspection remains
node-local and diagnostic-only.
C19W adds cursor-aware resume reads for mailbox consumers. The mailbox endpoint
now accepts `after_sequence` for non-destructive reads and returns
`after_sequence`, `skipped_count`, and `returned_count` so adapter runtimes can
resume from a checkpoint without client-side filtering. Long-poll waits for
events newer than the requested sequence, and `after_sequence` is rejected with
`drain=true` to keep resume reads separate from destructive mailbox drains.
C19X adds consumer-aware resume convenience on top of that explicit sequence
window. `resume_from=ack|checkpoint` can be used with `consumer_id` to resolve
the read window from the stored consumer cursor before reading the mailbox, and
responses include `resume_from` and `resume_sequence`. Resume requests reject
manual `after_sequence`, `drain=true`, reset, missing consumers, and unknown
consumer cursors so adapter runtimes cannot accidentally mix cursor modes.
C19Y adds resume telemetry for operator diagnostics. Workload status and
heartbeat reports expose resume/after-sequence read totals, returned/skipped
totals, and the last resume cursor, sequence, consumer, returned count, and
skipped count. Session snapshots mirror the per-session counters so diagnostics
can distinguish normal polling from cursor-resume reads without reading or
draining mailbox state.
C19Z adds a compact adapter-runtime readiness summary to the sink report.
`adapter_runtime_readiness` combines probe-only status, session lifecycle state,
mailbox depth, consumer cursor, resume cursor, lag, and returned/skipped counts
into one diagnostic object so operators can verify handoff readiness without
triggering mailbox reads or drains.
C19Z1 adds a read-only mailbox handoff preflight endpoint. Adapter runtimes can
call `/mailbox/preflight` with `consumer_id` and `resume_from=ack|checkpoint`
to validate the stored cursor and inspect the next expected event window without
reading, draining, acking, or mutating consumer state.
C19Z2 adds separate telemetry for those handoff checks. Workload status and
heartbeat reports expose preflight totals split by ack/checkpoint cursor and the
last preflight session, consumer, cursor, after-sequence, available/returned/
skipped counts, and expected sequence range; readiness diagnostics mirror the
latest preflight summary.
C19Z3 adds stale-cursor diagnostics to preflight. When a consumer cursor points
behind dropped bounded-mailbox events, the preflight response reports retained
sequence bounds, `diagnostic_state=stale_cursor_gap`, `stale_cursor=true`, and
`missing_dropped_count`; workload/heartbeat telemetry and readiness diagnostics
mirror that latest stale state.
C19Z4 adds explicit action hints to those diagnostics. Preflight responses now
include `recommended_action` and `action_hints`; stale cursor gaps recommend
resetting the consumer cursor, requesting a full adapter resync, and resuming
from checkpoint after resync. Telemetry and readiness diagnostics mirror the
latest recommended action and hints.
C19Z5 adds remediation provenance for those hints. Preflight responses,
workload/heartbeat telemetry, and readiness diagnostics include
`action_reason` plus structured `action_context` with the resume cursor,
retained sequence bounds, dropped/missing counts, consumer checkpoint/ack, and
expected window counters that explain why the recommended action was chosen.
C19Z6 adds a compact operator-facing preflight summary derived from the same
read-only state. Preflight responses, telemetry, and readiness diagnostics now
include `operator_summary` and `operator_summary_fields` so dashboards can show
the diagnostic state, action, reason, resume cursor, retained bounds, and key
window counters without recomputing or mutating mailbox state.
C19Z7 adds machine-sortable operator status and severity to that summary.
Preflight responses, telemetry, readiness diagnostics, and
`operator_summary_fields` now expose `operator_status` and `operator_severity`
so dashboards can sort ready, caught-up, and resync-required handoffs without
parsing human text.
C19Z8 groups the latest preflight view for admin UI consumption. The readiness
diagnostic keeps all existing flat latest-preflight fields and adds
`last_preflight` with observed time, cursor, counts, diagnostic state, selected
action, action provenance, operator summary, status, severity, and summary
fields.
C19Z9 adds retained-window detail to that grouped readiness view. The
`last_preflight` object now includes first/last retained sequence and mailbox
dropped total so stale-cursor summaries can explain the bounded mailbox window
without requiring a separate raw preflight lookup.
C19Z10 adds a structured remediation checklist to the grouped readiness view.
The `last_preflight.remediation_checklist` entries are derived from diagnostic
state and action hints, marking required/satisfied operator steps for cursor
reset, adapter resync, and post-resync resume without executing those actions.
C19Z11 adds summary status and counts for that checklist. The grouped readiness
view now exposes `remediation_checklist_status` plus total, required,
satisfied, and pending counts so admin UI can render checklist state without
scanning the step array.
C19Z12 adds per-session preflight operator status/severity counters. Readiness
now exposes counts for statuses such as `ready_to_resume`, `caught_up`, and
`resync_required`, plus severity counts such as `ok`, `info`, and `warn`, and
the grouped latest-preflight rollup mirrors those counters for dashboard
context.
C19Z13 derives a compact preflight attention status from those counters.
Readiness and `last_preflight` expose `preflight_attention_status` values such
as `clean`, `needs_attention`, and `repeated_resync_required`, letting admin UI
sort sessions without interpreting count maps directly.
C19Z14 proves the repeated-resync branch. Unit and live smoke coverage now run
multiple stale preflights on the same active adapter session and verify
`preflight_attention_status=repeated_resync_required` with repeated
`resync_required` / `warn` counters, while the preflight path remains read-only.
C19Z15 adds `preflight_attention_reason` beside the attention status. The reason
is derived from the latest preflight counters/status and explains clean,
attention-needed, and repeated-resync states without requiring UI code to parse
the counter maps.
C19Z16 completes focused proof coverage for those reasons. Unit coverage proves
clean, single-resync, repeated-resync, and no-preflight mappings, and live smoke
proves the single stale-preflight `resync_required_preflight_observed` reason.
C19Z17 adds a diagnostics contract marker to the grouped preflight readiness
rollup. `last_preflight` now includes `diagnostics_schema_version` and a
`diagnostics_contract` list for retained-window, remediation-checklist,
attention, and operator-count fields so admin UI can gate rendering safely.
C19Z18 adds machine-readable feature flags for that contract. `last_preflight`
now includes boolean `diagnostics_features` entries for retained-window,
remediation-checklist, attention, and operator-count diagnostics, allowing UI
and automation clients to check support without scanning the contract list.
C19Z19 adds a compatibility proof for the two contract forms. Unit and live
smoke coverage now verify that workload and telemetry reports expose matching
`diagnostics_contract` entries and `diagnostics_features` booleans for each
preflight diagnostics group.
C19Z20 adds the no-preflight absence proof. Active adapter sessions that have
not observed a mailbox preflight report `preflight_attention_status=unknown`,
`preflight_attention_reason=no_preflight_observed`, zero session preflight
count, and no grouped `last_preflight` rollup, so UI can distinguish "not
observed yet" from an observed clean state.
C19Z21 adds the no-active-session readiness proof. After the last adapter
session is closed, readiness reports idle/not-ready with zero active sessions,
no active `adapter_session_id`, no `last_preflight` rollup, and terminal
`last_session_state=closed` from the terminal-session ledger.
C19Z22 extends terminal-state coverage to `expire` and `reset` controls. The
same no-active-session readiness shape now proves `last_session_state=expired`
and `last_session_state=reset` from the terminal-session ledger.
C19Z23 adds grouped terminal-session summary metadata for the no-active-session
case. Readiness now includes `terminal_session_summary` with adapter session id,
terminal state, reason, and control timestamp while retaining flat compatibility
fields.
C19Z24 adds a contract marker to that summary. The grouped
`terminal_session_summary` now carries a schema version and summary-contract
field list so UI can gate rendering explicitly.
C19Z25 adds boolean feature flags for the same grouped terminal summary fields,
mirroring the preflight diagnostics contract/feature pattern.
C19Z26 adds compatibility proof coverage for those two terminal summary contract
forms, verifying that `summary_contract` entries and `summary_features` booleans
stay aligned in workload and telemetry reports.
C19Z27 adds absence proof coverage for a fresh no-session runtime: before any
terminal history exists, readiness stays in `waiting_for_session` and does not
include `terminal_session_summary`.
C19Z28 adds the grouped no-session readiness summary for that empty-runtime
state. Fresh adapter readiness now includes `no_session_summary` with schema
version `rap.remote_workspace_adapter_no_session_summary.v1`, a summary
contract for `status`, `diagnostic_state`, `active_session_count`, and
`terminal_session_count`, and matching idle/waiting-for-session counts, while
the terminal-session summary remains absent until terminal history exists.
C19Z29 adds boolean `summary_features` to the same grouped no-session summary
for `status`, `diagnostic_state`, `active_session_count`, and
`terminal_session_count`, matching the terminal summary and preflight
diagnostics feature-flag convention.
C19Z30 adds compatibility proof coverage for the grouped no-session summary,
verifying that `summary_contract` entries and `summary_features` booleans stay
aligned in workload and telemetry reports.
C19Z31 adds the inverse terminal-history absence proof: after adapter sessions
reach terminal states, readiness exposes `terminal_session_summary` and omits
`no_session_summary` in workload and telemetry reports.
C19Z32 proves readiness summary exclusivity across the three runtime shapes:
fresh exposes only `no_session_summary`, active exposes neither grouped summary,
and terminal exposes only `terminal_session_summary`.
C19Z33 adds a compact readiness state matrix artifact for admin/runtime handoff:
fresh, active, and terminal rows are emitted for workload and telemetry with
only the relevant readiness fields and summary-presence booleans.
C19Z34 adds an explicit probe-to-runtime gate artifact. It confirms the current
Remote Workspace runtime is still `contract_probe`, `probe_only=true`, and
`payload_traffic=none`, lists the ready contracts, and records the remaining
runtime gates before real RDP frame transport can be enabled.
C19Z35 adds the disabled-by-default real-adapter supervision scaffold. The
`rdp-worker` contract-probe status now advertises
`rap.remote_workspace_real_adapter_supervision.v1` with future config env names,
status contract fields, and guardrails, while `contract_probe` remains the only
active execution mode and payload traffic remains `none`.
C19Z36 adds compatibility proof for that scaffold, verifying the disabled state,
status contract, env names, process model, and guardrails remain aligned in unit
and live workload status coverage.
C19Z37 adds disabled real-adapter config projection. Node-agent parses the
future `RAP_REMOTE_WORKSPACE_REAL_ADAPTER_*` env values and reports only
sanitized status metadata under
`real_adapter_supervision.config_projection`: whether enable was requested,
whether command/args/workdir are present, args JSON shape, and that raw values
are redacted. This does not activate the real adapter; `enabled=false`,
`activation_allowed=false`, and `payload_traffic=none` remain required.
C19Z38 proves projection compatibility across default/empty and requested
config shapes. Unit and live smoke coverage verify absent env and requested
env both keep activation blocked, raw values redacted, and payload traffic
disabled.
C19Z39 adds an explicit disabled activation decision contract. The real adapter
status now reports `decision=blocked`,
`reason=real_runtime_stage_not_enabled`, `activation_allowed=false`, and the
missing gates before a future stage may start an external RDP worker process.
C19Z40 adds a compact handoff report proving that the supervision scaffold,
config projection, and blocked activation decision remain aligned for both
requested and default config shapes.
C19Z41 adds real-adapter supervision feature flags for config projection,
activation decision, missing gates, and raw-value redaction so UI and
automation clients can gate rendering explicitly.
C19Z42 folds those feature flags into the compact handoff report, proving
scaffold/projection/decision/features alignment for requested and default node
config in one admin/runtime artifact.
C19Z43 proves contract-probe precedence when desired workload config includes
both `adapter_contract_probe` and `real_adapter_supervision`; the runtime stays
running in probe mode and real-adapter activation remains blocked.
C19Z44 proves the real-adapter-only desired workload path remains degraded and
blocked, with the same disabled activation contract and no payload traffic.
C19Z45 adds a compact desired-workload mode matrix for probe-only,
real-adapter-only, and combined requested modes, confirming all paths retain
disabled real-adapter activation and no payload traffic.
C19Z46 adds compatibility proof for that mode matrix row contract, including
explicit feature-flag and missing-gate visibility markers.
C19Z47 adds a disabled process-supervisor preconditions contract for the future
external RDP worker process while keeping `process_start_allowed=false` and all
payload traffic disabled.
C19Z48 proves that process-supervisor preconditions contract across requested
and default config shapes, including required/missing checks and disabled start.
C19Z49 folds process-supervisor preconditions into the compact handoff report,
proving alignment with projection, activation decision, and feature flags.
C19Z50 folds those preconditions into the desired-workload mode matrix, proving
process start remains disabled across probe-only, real-adapter-only, and
combined requested modes.
C19Z51 adds compatibility proof for that mode matrix v2 row contract.
C19Z52 adds a disabled process-health-probe contract for the future external
RDP worker process while keeping health probes disabled and payload traffic at
`none`.
C19Z53 proves that process-health-probe contract across requested/default
status forms.
C19Z54 folds process-health-probe visibility into the compact handoff report,
proving disabled health probes and payload-free alignment across all
real-adapter handoff contracts.
C19Z55 folds process-health-probe visibility into the desired-workload mode
matrix, proving disabled health probes and no payload traffic across probe-only,
real-adapter-only, and combined requested modes.
C19Z56 adds compatibility proof for that mode matrix v3 row contract.
C19Z57 ties handoff v4 and mode matrix v3 compatibility into a compact disabled
real-adapter readiness/handoff checklist.
C19Z58 adds compatibility proof for that readiness/handoff summary and
checklist contract.
C19Z59 derives a disabled real-adapter operator action map from that checklist
while keeping activation, process start, and payload forwarding blocked.
C19Z60 adds compatibility proof for that operator action map contract.
C19Z61 groups the disabled real-adapter readiness summary, checklist, and
action map into one compact admin handoff bundle.
C19Z62 adds compatibility proof for that admin handoff bundle contract.
C19Z63 derives compact admin handoff digest display rows from the bundle while
preserving disabled runtime guardrails.
C19Z64 adds compatibility proof for that admin handoff digest row contract.
C19Z65 adds a digest rollup with severity/state counts, primary action, and
guardrail summary.
C19Z66 adds compatibility proof for that digest rollup contract.
C19Z67 summarizes the proven disabled real-adapter admin handoff chain from
handoff v4 through digest rollup compatibility.
C19Z68 adds compatibility proof for that full-chain summary contract.
C19Z69 marks the disabled real-adapter admin handoff package as
contract-only-ready while keeping the real runtime stage blocked.
C19Z70 proves the release marker contract remains compatible while keeping the
real runtime stage blocked.
C19Z71 adds a final contract-only package index for the disabled real-adapter
admin handoff chain.
C19Z72 proves the final package index contract for the disabled real-adapter
admin handoff chain.
C19Z73 adds a contract-only runtime gate phase boundary for the next disabled
real-adapter preflight phase.
C19Z74 proves the runtime gate phase boundary contract.
C19Z75 adds a disabled real-adapter runtime gate preflight checklist with all
items still blocking runtime.
C19Z76 proves the disabled real-adapter runtime gate preflight checklist
contract.
C19Z77 adds a disabled real-adapter runtime gate preflight status summary.
C19Z78 proves the disabled real-adapter runtime gate preflight status summary
contract.
C19Z79 adds disabled real-adapter runtime gate preflight action hints.
C19Z80 proves the disabled real-adapter runtime gate preflight action hints
contract.
C19Z81 adds a disabled real-adapter runtime gate preflight operator handoff
bundle.
C19Z82 proves the disabled real-adapter runtime gate preflight operator handoff
bundle contract.
C19Z83 adds a disabled real-adapter runtime gate preflight release marker.
C19Z84 proves the disabled real-adapter runtime gate preflight release marker
contract.
C19Z85 adds a disabled real-adapter runtime gate preflight package index.
C19Z86 proves the disabled real-adapter runtime gate preflight package index
contract.
C19Z87 adds a disabled real-adapter runtime gate preflight closeout summary.
C19Z88 proves the disabled real-adapter runtime gate preflight closeout summary
contract.
C19Z89 starts the explicit real-adapter runtime gate enablement phase with a
contract-only request that remains blocked pending validation.
C19Z90 proves the explicit real-adapter runtime gate enablement request
contract.
C19Z91 adds contract-only operator confirmation validation while keeping the
runtime gate blocked pending remaining validations.
C19Z92 proves the operator confirmation validation contract.
C19Z93 adds contract-only binary validation while keeping the runtime gate
blocked pending remaining validations.
C19Z94 proves the binary validation contract.
C19Z95 adds contract-only permission validation while keeping the runtime gate
blocked pending remaining validations.
C19Z96 proves the permission validation contract.
C19Z97 adds contract-only supervisor validation while keeping the runtime gate
blocked pending remaining validations.
C19Z98 proves the supervisor validation contract.
C19Z99 adds contract-only health probe validation while keeping the runtime gate
blocked pending payload gate validation.
C19Z100 proves the health probe validation contract.
C19Z101 adds contract-only payload gate validation with no remaining required
validations while keeping runtime not enabled.
C19Z102 proves the payload gate validation contract.
C19Z103 adds the runtime gate validation closeout while keeping explicit
operator enablement required.
C19Z104 proves the runtime gate validation closeout contract.
C19Z105 adds an operator enablement readiness package while keeping runtime
disabled by default.
C19Z106 proves the operator enablement readiness package contract.
C19Z107 adds an operator enablement readiness release marker while keeping
runtime disabled by default.
C19Z108 proves the operator enablement readiness release marker contract.
C19Z109 adds an operator enablement readiness package index while keeping
runtime disabled by default.
C19Z110 proves the operator enablement readiness package index contract.
C19Z111 adds an operator readiness closeout summary while keeping runtime
disabled by default.
C19Z112 proves the operator readiness closeout summary contract.
C19Z113 adds an operator review decision request while keeping runtime disabled
by default.
C19Z114 proves the operator review decision request contract.
C19Z115 adds an operator decision status summary while keeping runtime disabled
by default.
C19Z116 proves the operator decision status summary contract.
C19Z117 adds an operator approval/rejection outcome contract with the outcome
not approved and runtime disabled by default.
C19Z118 proves the operator approval/rejection outcome contract.
C19Z119 adds an operator outcome closeout/reopen boundary while keeping runtime
disabled by default.
C19Z120 proves the operator outcome closeout/reopen boundary contract.
C19Z121 adds a not-approved outcome release marker while keeping runtime
disabled by default.
C19Z122 proves the not-approved outcome release marker contract.
C19Z123 adds a not-approved outcome package index while keeping runtime disabled
by default.
C19Z124 proves the not-approved outcome package index contract.
C19Z125 adds a not-approved outcome closeout summary while keeping runtime
disabled by default.
C19Z126 proves the not-approved outcome closeout summary contract.
C19Z127 adds a final not-approved outcome release marker while keeping runtime
disabled by default.
C19Z128 proves the final not-approved outcome release marker contract.
C19Z129 adds a final not-approved outcome package index/archive marker while
keeping runtime disabled by default.
C19Z130 proves the final not-approved outcome package index/archive marker
contract.
C19Z131 adds a not-approved outcome archive closeout manifest while keeping
runtime disabled by default.
C19Z132 proves the not-approved outcome archive closeout manifest contract.
C19Z133 adds a stopped-branch sentinel for the not-approved outcome while
keeping runtime disabled by default.
C19Z134 proves the not-approved outcome stopped-branch sentinel contract.
C19Z135 adds a no-continuation guard for the stopped not-approved outcome while
keeping runtime disabled by default.
C19Z136 proves the not-approved outcome no-continuation guard contract.
C19Z137 adds continuation block enforcement for the stopped not-approved
outcome while keeping runtime disabled by default.
C19Z138 proves the not-approved outcome continuation block enforcement
contract.
C19Z139 adds a continuation block audit record for the stopped not-approved
outcome while keeping runtime disabled by default.
C19Z140 proves the not-approved outcome continuation block audit record
contract.
C19Z141 adds a continuation block audit rollup for the stopped not-approved
outcome while keeping runtime disabled by default.
C19Z142 proves the not-approved outcome continuation block audit rollup
contract.
C19Z143 adds an operator stop summary for the stopped not-approved outcome
while keeping runtime disabled by default.
C19Z144 proves the not-approved outcome operator stop summary contract.
C19Z145 adds an operator stop handoff for the stopped not-approved outcome
while keeping runtime disabled by default.
C19Z146 proves the not-approved outcome operator stop handoff contract.
C19Z147 adds an operator stop handoff digest for the stopped not-approved
outcome while keeping runtime disabled by default.
C19Z148 proves the not-approved outcome operator stop handoff digest contract.
C19Z149 adds an operator stop status snapshot for the stopped not-approved
outcome while keeping runtime disabled by default.
C19Z150 proves the not-approved outcome operator stop status snapshot contract.
C19Z151 adds an operator stop status snapshot index for the stopped
not-approved outcome while keeping runtime disabled by default.
C19Z152 proves the not-approved outcome operator stop status snapshot index
contract.
C19Z153 adds an operator stop status catalog for the stopped not-approved
outcome while keeping runtime disabled by default.
C19Z154 proves the not-approved outcome operator stop status catalog contract.
C19Z155 adds an operator stop status catalog release marker for the stopped
not-approved outcome while keeping runtime disabled by default.
C19Z156 proves the not-approved outcome operator stop status catalog release
marker contract.
C19Z157 adds an operator stop status catalog package index for the stopped
not-approved outcome while keeping runtime disabled by default.
C19Z158 proves the not-approved outcome operator stop status catalog package
index contract.
C19Z159 adds an operator stop status catalog closeout summary for the stopped
not-approved outcome while keeping runtime disabled by default.
C19Z160 proves the not-approved outcome operator stop status catalog closeout
summary contract.
C19Z161 adds an operator stop status final archive marker for the stopped
not-approved outcome while keeping runtime disabled by default.
C19Z162 proves the not-approved outcome operator stop status final archive
marker contract.
C19Z163 adds an operator stop status final archive manifest for the stopped
not-approved outcome while keeping runtime disabled by default.
C19Z164 proves the not-approved outcome operator stop status final archive
manifest contract.
C19Z165 adds a terminal-complete marker for the stopped not-approved outcome
factory while keeping runtime disabled by default.
C19Z166 proves the not-approved outcome factory terminal-complete contract.
C20Z1 opens a new explicit real-adapter enablement request while keeping
runtime disabled by default.
C20Z2 proves the new explicit real-adapter enablement request contract.
C20Z3 adds the operator validation intake for the new explicit request while
keeping runtime disabled by default.
C20Z4 completes the operator validation checklist contract while keeping
runtime disabled by default.
C20Z5 closes the operator validation chain contract while keeping runtime
disabled by default.
C20Z6 proves the C20 stage terminal-complete contract.
5. Move VPN packet flow to the service channel and keep backend relay only as
explicit degraded fallback.
6. Run load tests against the fabric channel: many streams, route failure,
exit failure, NAT/outbound-only nodes, queue pressure, DNS/LAN/Internet
egress.
7. Build Remote Server/Desktop Access on top of this channel, not beside it.
## Non-Negotiable Guardrails
- Do not solve new service performance problems inside a protocol-specific
client before checking the common fabric channel.
- Do not add a production service that depends on backend packet/frame relay as
the steady-state path.
- Do not expose internal mesh topology to organization users.
- Do not merge VPN and Remote Server/Desktop Access into one product.
- Do not let bulk traffic starve interactive traffic.
- Do not hide degraded fallback; report it visibly in diagnostics/admin UI.