Record project continuation changes

2026-05-12 21:02:29 +03:00
parent 3059d1d7a3
commit 8f69d53193
339 changed files with 101111 additions and 1769 deletions
@@ -1016,6 +1016,240 @@ Status: implemented and verified. Report: `artifacts/c5-service-workload-supervi
 Goal:
 Node-agent can start, stop, and monitor service workloads based on role assignment.

+C19A adds the first bounded live service-supervision runtime proof on top of
+that contract: node-agent can read node-scoped desired workloads without an
+operator actor id, report built-in `core-mesh` and `mesh-listener` as running,
+report native built-in `synthetic.echo` as running, and keep unsupported
+production workloads degraded instead of pretending that their adapters exist.
+The live smoke is `scripts/fabric/c19a-service-workload-supervision-smoke.ps1`.
+
+C19B adds the Remote Workspace/RDP adapter-contract bridge without enabling RDP
+payload traffic. A native `rdp-worker` desired workload with
+`adapter_contract_probe=true` reports the remote-workspace channel map,
+requires Fabric Service Channel, and marks backend relay as not steady-state.
+The live smoke is
+`scripts/fabric/c19b-remote-workspace-adapter-contract-smoke.ps1`.
+
+C19C wires Remote Workspace into service-channel lease issuance without
+starting RDP traffic: route intents now accept `remote_workspace`, the lease
+entry descriptor uses remote-workspace stream paths and frame-batch media type
+instead of VPN packet paths, and the signed data-plane contract is present in
+lease, authority payload, introspection, and lease maintenance. The live smoke
+is `scripts/fabric/c19c-remote-workspace-service-channel-lease-smoke.ps1`.
+
+C19D adds the Remote Workspace entry-node ingress skeleton. The node-agent
+accepts a signed/introspected `remote_workspace` service-channel lease on
+`remote-workspaces/{resource_id}/streams/{channel_class}`, validates service
+class, channel class, selected entry node, and data-plane flow isolation, and
+reports access telemetry. It intentionally returns a probe contract with
+`payload_flow=not_implemented` for non-empty RDP payloads; this stage proves
+the Fabric ingress contract without forwarding desktop frames yet. The live
+smoke is `scripts/fabric/c19d-remote-workspace-entry-ingress-smoke.ps1`.
+
+C19E adds the first Remote Workspace frame-batch contract probe across the
+adapter/entry boundary. The `rdp-worker` adapter probe reports
+`rap.remote_workspace_frame_batch.v1`; entry-node accepts only
+`probe_only=true` frame batches, validates logical adapter channels and
+directions, and returns `payload_flow=validated_probe_only`. Real desktop frame
+delivery remains intentionally disabled until the service adapter runtime stage.
+The live smoke is
+`scripts/fabric/c19e-remote-workspace-frame-batch-contract-smoke.ps1`.
+
+C19F adds the first local adapter-sink proof for that frame-batch contract.
+Node-agent now keeps an in-memory `node_agent_rdp_worker_contract_probe` sink
+for Remote Workspace frame probes and preserves it across mesh config refresh.
+Entry-node delivers validated `probe_only=true` frame batches to that sink and
+returns a `rap.remote_workspace_frame_batch_delivery.v1` receipt with
+`payload_flow=delivered_probe_only`. This still does not enable production RDP
+frame forwarding. The live smoke is
+`scripts/fabric/c19f-remote-workspace-adapter-sink-smoke.ps1`.
+
+C19G exposes the adapter-sink delivery proof through existing node-agent
+visibility channels. The `rdp-worker` workload status payload now includes
+`remote_workspace_adapter_sink`, and node telemetry includes
+`remote_workspace_adapter_sink_report`, both carrying delivery count, latest
+delivery sequence, channel class, frame count, and the probe-only/no-payload
+boundary. The live smoke is
+`scripts/fabric/c19g-remote-workspace-adapter-sink-telemetry-smoke.ps1`.
+
+C19H locks down the Remote Workspace frame-batch guardrails before real adapter
+runtime work begins. Unit and live smoke coverage now proves that entry-node
+rejects `probe_only=false`, unknown logical channels, invalid channel
+directions, service-class mismatch, channel-class mismatch, and unsupported
+payload encoding, and that rejected batches do not produce adapter delivery.
+The live smoke is
+`scripts/fabric/c19h-remote-workspace-frame-guardrails-smoke.ps1`.
+
+C19I adds the first bounded adapter handoff queue/ack proof for the same
+probe-only path. The local `node_agent_rdp_worker_contract_probe` sink reports
+queue capacity/depth plus accepted, dropped, and acked frame counts: with
+capacity `8`, droppable display overflow accepts/acks `8` frames and drops `3`,
+while reliable input overflow is rejected with backpressure and no delivery
+receipt. The boundary still carries `payload_traffic=none`; this is queue
+semantics for the future adapter runtime, not real RDP payload forwarding. The
+live smoke is
+`scripts/fabric/c19i-remote-workspace-adapter-queue-smoke.ps1`.
+
+C19J makes those queue/backpressure signals operationally visible. The
+`remote_workspace_adapter_sink` workload status payload and
+`remote_workspace_adapter_sink_report` telemetry now include current queue
+capacity/depth, cumulative accepted/dropped/acked frame counters,
+`backpressure_count`, and the latest rejected batch metadata/reason. The live
+smoke first produces the C19I droppable overflow plus reliable backpressure,
+then waits until both workload status and telemetry show the delivery, dropped
+total, and backpressure increment. The live smoke is
+`scripts/fabric/c19j-remote-workspace-adapter-queue-telemetry-smoke.ps1`.
+
+C19K introduces the probe-only adapter session boundary. Entry-node derives a
+stable `adapter_session_id` from the service-channel lease/resource/route
+context and passes it to the local `rdp-worker` adapter probe sink. Delivery
+receipts, workload status, and telemetry now include `adapter_session_id`,
+`adapter_runtime_id=node_agent_rdp_worker_contract_probe`, and
+`session_state=probe_bound`, and rejected/backpressured batches retain the same
+session identity. This is still not real RDP payload forwarding; it binds the
+queue/ack/backpressure model to the future per-session adapter runtime. The
+live smoke is
+`scripts/fabric/c19k-remote-workspace-adapter-session-boundary-smoke.ps1`.
+
+C19L adds the first lifecycle model to that probe-only adapter session. The
+node-agent sink now tracks active sessions in memory with created/bound totals,
+last activity timestamps, per-session delivery/backpressure/frame counters,
+`current_session_lifecycle_state`, and idle expiry counters. A successful
+droppable overflow binds the session as `probe_bound`; a reliable overflow keeps
+the same `adapter_session_id` and moves the lifecycle state to `backpressure`
+for diagnosis. Receipts expose session created/bound/last-activity timestamps
+and per-session counters while `payload_traffic=none` remains enforced. The
+live smoke is
+`scripts/fabric/c19l-remote-workspace-adapter-session-lifecycle-smoke.ps1`.
+
+C19M adds explicit probe-only adapter-session control. Node-agent exposes
+`POST /mesh/v1/remote-workspace/adapter-sessions/{adapter_session_id}/control`
+with `close`, `expire`, and `reset` actions, returning
+`rap.remote_workspace_adapter_session_control.v1`. Workload status and telemetry
+now include `session_control_total`, `session_closed_total`,
+`session_reset_total`, and the latest control action/session/state, so sessions
+can be ended deliberately instead of only by idle TTL. The live smoke creates a
+Remote Workspace adapter session, closes it through the mesh control endpoint,
+and waits until workload status and telemetry expose the close. The live smoke
+is
+`scripts/fabric/c19m-remote-workspace-adapter-session-control-smoke.ps1`.
+
+C19N locks down the adapter-session control guardrails. Control requests now
+reject unsupported actions, invalid `adapter_session_id` values, malformed JSON,
+unknown active/terminal sessions, and overlong reasons without creating hidden
+session state. Repeating `close` against an already closed terminal session is
+idempotent: it reports `previous_state=closed` and does not increment
+`session_closed_total` again, while still counting the control observation. The
+live smoke verifies the negative cases plus first/repeated close visibility in
+workload status and telemetry. The live smoke is
+`scripts/fabric/c19n-remote-workspace-adapter-session-control-guardrails-smoke.ps1`.
+
+C19O adds an immediate read-only adapter-session snapshot endpoint:
+`GET /mesh/v1/remote-workspace/adapter-sessions?include_terminal=true&limit=N`.
+It returns `rap.remote_workspace_adapter_session_snapshot.v1` with active
+sessions, terminal sessions when requested, per-session lifecycle state,
+activity/backpressure timestamps, frame counters, and runtime identity. This
+lets operators inspect adapter-session state directly from node-agent without
+waiting for heartbeat, workload status, or telemetry propagation. The live smoke
+checks active-session visibility, close transition into terminal snapshot, and
+invalid snapshot limit rejection. The live smoke is
+`scripts/fabric/c19o-remote-workspace-adapter-session-snapshot-smoke.ps1`.
+
+C19P adds the first adapter-runtime handoff mailbox contract. Each active
+probe-only adapter session now owns a bounded in-memory mailbox that receives
+`frame_batch_probe_delivered` and `backpressure` events with frame counts,
+channel/resource/route context, and sequence numbers. Node-agent exposes
+`GET /mesh/v1/remote-workspace/adapter-sessions/{adapter_session_id}/mailbox`
+with optional `drain=true`, and session snapshots/workload reports expose
+mailbox depth/enqueued/drained/dropped counters. This is the handoff surface a
+real `rdp-worker` runtime can consume next; payload forwarding is still disabled.
+The live smoke verifies read, drain, post-drain empty state, and snapshot
+counters. The live smoke is
+`scripts/fabric/c19p-remote-workspace-adapter-runtime-mailbox-smoke.ps1`.
+
+C19Q hardens the mailbox handoff. Invalid IDs, unknown sessions, and invalid
+limits are rejected before state mutation, and bounded `drain=true&limit=N`
+reads remove only the returned event slice while preserving remaining depth for
+the next poll. The bounded mailbox drops oldest events once capacity is reached,
+and a closed adapter session no longer exposes an active runtime mailbox. The
+live smoke verifies negative cases, drop-oldest pressure, partial drain, and
+closed-session rejection. The live smoke is
+`scripts/fabric/c19q-remote-workspace-adapter-mailbox-guardrails-smoke.ps1`.
+
+C19R adds bounded long-poll ergonomics to the same node-local mailbox endpoint.
+`wait_ms` lets an adapter runtime wait briefly for the next event without hot
+polling, and responses make empty/timeout state explicit with `empty`,
+`waited`, `wait_timeout`, and `wait_ms`. The live smoke proves empty timeout and
+wake-on-delayed-event behavior while keeping the path probe-only. The live smoke
+is `scripts/fabric/c19r-remote-workspace-mailbox-long-poll-smoke.ps1`.
+
+C19S makes mailbox consumer behavior visible in diagnostics. Workload status and
+node telemetry now expose `mailbox_read_total`, `mailbox_wait_total`,
+`mailbox_wait_timeout_total`, `mailbox_empty_read_total`, and last mailbox read
+metadata; active session snapshots carry the same per-session counters while a
+session remains active. The live smoke proves C19R traffic is reflected in both
+workload status and telemetry. The live smoke is
+`scripts/fabric/c19s-remote-workspace-mailbox-telemetry-smoke.ps1`.
+
+C19T adds the node-local consumer cursor contract for that mailbox. Consumers
+can pass `consumer_id` plus optional `ack_sequence` to receive explicit
+checkpoint, ack, lag, read, and ack counters without draining mailbox state.
+The probe sink stores bounded per-session consumer state and reports aggregate
+and current-session consumer telemetry through workload status and heartbeat
+telemetry. The live smoke is
+`scripts/fabric/c19t-remote-workspace-mailbox-consumer-checkpoint-smoke.ps1`.
+
+C19U adds lifecycle visibility and reset guardrails to the same cursor state.
+Mailbox consumers can pass `reset_consumer=true` with a valid `consumer_id` to
+clear their checkpoint/ack state before the current read is recorded. Mailbox
+responses now expose consumer count/capacity, created/reset/evicted flags, and
+consumer timestamps, while diagnostics add reset and eviction counters. The
+live smoke is
+`scripts/fabric/c19u-remote-workspace-mailbox-consumer-lifecycle-smoke.ps1`.
+
+C19V adds read-only inspection for active mailbox consumer cursors. The
+node-local
+`GET /mesh/v1/remote-workspace/adapter-sessions/{adapter_session_id}/mailbox/consumers`
+endpoint returns bounded cursor snapshots with consumer ids, checkpoint and ack
+sequences, lag, totals, and timestamps. It is verified as read-only: inspection
+does not increment mailbox reads, ack totals, reset counters, or drain mailbox
+events. The live smoke is
+`scripts/fabric/c19v-remote-workspace-mailbox-consumer-snapshot-smoke.ps1`.
+
+C19W adds cursor-aware resume reads to the mailbox endpoint. Consumers can pass
+`after_sequence` to receive only mailbox events newer than their checkpoint;
+responses include `skipped_count` and `returned_count`, and long-poll waits for
+newer-than-checkpoint events. The endpoint rejects `after_sequence` with
+`drain=true`, preserving the non-destructive resume contract. The live smoke is
+`scripts/fabric/c19w-remote-workspace-mailbox-after-sequence-smoke.ps1`.
+
+C19X adds consumer-aware resume convenience. Mailbox reads with `consumer_id`
+can pass `resume_from=ack` or `resume_from=checkpoint`; the node-agent resolves
+the stored cursor to `after_sequence` before reading and returns
+`resume_from`/`resume_sequence` in the response. The guardrails reject mixing
+resume with manual `after_sequence`, drain, reset, missing consumers, or invalid
+cursor names. The live smoke is
+`scripts/fabric/c19x-remote-workspace-mailbox-consumer-resume-smoke.ps1`.
+
+C19Y adds resume telemetry to workload status and heartbeat reports. Operators
+can now see resume read totals, after-sequence read totals, returned/skipped
+totals, and the last resume cursor, sequence, consumer, returned count, and
+skipped count. Session snapshots also expose per-session resume counters. The
+live smoke is
+`scripts/fabric/c19y-remote-workspace-mailbox-resume-telemetry-smoke.ps1`.
+
+C19Z adds adapter-runtime readiness diagnostics. Sink reports now include
+`adapter_runtime_readiness`, a compact probe-only object with ready status,
+diagnostic state, session lifecycle, mailbox depth, consumer cursor, resume
+cursor, lag, and returned/skipped counts. The live smoke is
+`scripts/fabric/c19z-remote-workspace-adapter-readiness-smoke.ps1`.
+
+C19Z1 adds read-only handoff preflight for mailbox consumers. The endpoint
+`/mailbox/preflight` accepts `consumer_id` and `resume_from=ack|checkpoint`,
+then reports the expected next event window without mailbox reads, drains, acks,
+or consumer cursor mutation. The live smoke is
+`scripts/fabric/c19z1-remote-workspace-mailbox-preflight-smoke.ps1`.
+
 Includes:

 - container/native workload contract
@@ -131,6 +131,43 @@ Data Plane

 The backend/control plane must not become a production VPN packet relay.

+## Universal Packet Dataplane Principle
+
+The VPN service carries IP packets. It must not classify the product as a web
+proxy, an RDP helper, or an HTTP-only accelerator. HTTP, DNS, RDP, SSH, VNC,
+messengers, audio calls, file transfer, application sync, and future mobile or
+desktop traffic are all just packets flowing through the same tunnel contract.
+
+Implementation rules:
+
+- packet forwarding must not branch on application protocol for correctness
+- performance work must optimize the shared packet path, not a specific site or
+  port
+- batching, backpressure, retries, and route failover are dataplane mechanics
+  and must apply to all traffic
+- diagnostics may summarize protocol/ports for operators, but diagnostics must
+  not decide whether traffic is allowed to flow
+- a transient transport error must not permanently downgrade the tunnel to a
+  per-packet request mode
+- the control plane chooses entry, exit, route, lease, and policy; packet flow
+  should use the fastest available fabric path
+
+The temporary backend HTTP packet relay is a lab compatibility path. The
+production target is:
+
+```text
+client device
+  -> selected entry node
+  -> fabric route / alternate route set
+  -> selected exit node
+  -> target private network or Internet gateway
+```
+
+When the cluster grows, route choice must consider latency, loss, queue depth,
+node health, role eligibility, lease freshness, and regional/network locality.
+If a node or link degrades, the fabric should switch to an alternate route
+without requiring the client to understand mesh topology.
+
 ## Control Plane Responsibilities

 The control plane owns:
@@ -1,123 +1,610 @@
-C17Z20 is complete.
+Current product decision:
+
+Stop treating VPN, Remote Server/Desktop Access, video, file transfer, and
+future services as separate transport implementations. The next implementation
+work should focus on the shared Fabric Service Channel runtime described in
+`docs/architecture/FABRIC_SERVICE_CHANNEL_RUNTIME.md`.
+
+The immediate engineering target is:
+
+- backend service-channel lease/route-generation contract
+- node-agent entry runtime for client/service live connections
+- service-neutral channel scheduling, bounded queues, route health, and
+  failover
+- VPN packet flow as the proving service over that common channel
+- backend relay only as explicit degraded fallback
+
+Backend service-channel lease/route-generation contract is now started:
+
+- `POST /clusters/{clusterID}/fabric/service-channels/leases` issues
+  `rap.fabric_service_channel_lease.v1`
+- VPN client profiles embed `fabric_service_channel_lease`
+- tests cover ready route and degraded backend-relay fallback behavior
+- leases include entry HTTP/WebSocket endpoint templates for the selected
+  service channel
+- leases include cluster-authority-signed
+  `rap.fabric_service_channel_lease_authority.v1` payloads that bind token
+  hash, selected route, generation, fencing epoch, and expiry
+
+Node-agent entry runtime is now started:
+
+- `rap-node-agent` accepts VPN packet batches through
+  `/api/v1/clusters/{clusterID}/fabric/service-channels/{channelID}/vpn-connections/{vpnConnectionID}/packets`
+  and `/packets/ws`
+- entry runtime requires a `rap_fsc_*` service-channel token and maps packet
+  batches to the existing production `vpn_packet` fabric route
+- route failure falls back to the canonical backend relay endpoint so degraded
+  compatibility remains explicit
+
+Next narrow runtime layer:

-Installation Authority foundation is also complete:
+- persist cluster-level default window policy for Fabric diagnostics
+  investigation breadcrumbs and expose a small admin control for it
+- keep this in the shared Fabric Service Channel runtime contract and telemetry
+- do not add Android/RDP protocol work in this slice
+
+C17Z20 is complete.
+
+Installation Authority foundation is also complete:
+
+- production config requires strict authority mode with Product Root public key
+- first-owner bootstrap requires a signed activation manifest in strict mode
+- `installation_authority` and signed `platform_role_grants` are persisted
+- strict platform-admin checks ignore direct `users.platform_role` edits unless
+  a valid signed grant exists
+- web-admin shows installation status and first-owner bootstrap
+- `scripts/installation/product-root-tool.go` can generate Ed25519 Product Root
+  keys and sign activation manifests; private keys must stay outside the repo
+
+Cluster Authority foundation is now also complete:
+
+- every newly created cluster gets an Ed25519 `cluster_authorities` key record
+- cluster authority private keys are encrypted at rest when
+  `SECRET_ENCRYPTION_KEY_B64`/file is configured; production already requires
+  a secret encryption key
+- legacy/default clusters are backfilled lazily through `EnsureClusterAuthority`
+- backend signs join-token scope material, node approval/bootstrap material,
+  and node-scoped synthetic mesh config snapshots
+- node-agent verifies signed Control Plane synthetic config when
+  `authority_required=true` or signature fields are present
+- node-agent can pin `RAP_CLUSTER_AUTHORITY_PUBLIC_KEY` and
+  `RAP_CLUSTER_AUTHORITY_FINGERPRINT`, and identity state can store the same
+  trust anchor after approval
+- web-admin shows cluster key fingerprints on summaries, join-token output,
+  approval rows, and synthetic config visibility
+- docker-test lifecycle smoke is complete: fresh dev install, first-owner
+  bootstrap, cluster creation, signed join token, real node-agent enrollment,
+  owner approval, automatic signed bootstrap polling, authority pin
+  persistence, heartbeat, and signed synthetic config verification all passed
+- `rap-node-agent` desired-workload polling/status reporting is gated by
+  `RAP_WORKLOAD_SUPERVISION_ENABLED=false` by default while service runtime
+  supervision remains a stub
+
+Node enrollment bootstrap polling is also complete:
+
+- backend exposes `/node-agents/enrollments/{requestID}/bootstrap`
+- pending agents prove `cluster_id`, `node_fingerprint`, and `public_key`
+  before receiving status/bootstrap material
+- `rap-node-agent` stores `pending_join_request_id`, polls approval, verifies
+  the signed bootstrap contract, then persists `node_id`, `identity_status`,
+  and cluster authority pin into `identity.json`
+- polling is controlled by `RAP_ENROLLMENT_POLL_INTERVAL_SECONDS` and
+  `RAP_ENROLLMENT_POLL_TIMEOUT_SECONDS`
+
+Current state:
+
+- C17Z12 added rendezvous/relay control-plane leases for peers that would
+  otherwise stay in `waiting_rendezvous`.
+- C17Z13-C17Z14 added lease telemetry and node-scoped synthetic-config refresh
+  for renewal/stale relay recovery.
+- C17Z15 added backend stale-relay replacement/withdrawal policy and alternate
+  relay-pool scoring.
+- C17Z16 added Control Plane `route_path_decisions`.
+- C17Z17 added node-side route generation apply/withdraw tracking.
+- C17Z18 applies Control Plane `route_path_decisions` to synthetic
+  route-health route config only. The synthetic `fabric.route_health` runtime
+  now probes the selected effective path, including replacement relay paths,
+  and reports expected/observed hops plus drift state.
+- C17Z19 consumes those synthetic route-health observations in backend relay
+  scoring. Drift/unreachable/failure feedback marks the exact selected relay
+  stale and can trigger replacement; healthy low-latency route-health boosts
+  alternate relay score reasons. Migration `000022` adds the `synthetic` mesh
+  service class, and web-admin marks relay policy `rh feedback`.
+- C17Z20 closes the node-side feedback loop. After node-agent reports
+  synthetic route-health drift/unreachable/failure, it performs a bounded
+  node-scoped synthetic-config refresh, applies returned replacement route
+  decisions to route-health config immediately, and reports
+  `c17z20.mesh_route_health_feedback_refresh_report.v1`.
+- Backend `mesh_latest_links` now keeps latest observations per observation
+  type/route, so `synthetic_route_health` is not overwritten by
+  `peer_connection_manager`.
+- Web-admin Fabric links now show observation type, selected relay, and
+  route-health effective/observed path.
+- All of this remains control-plane/synthetic route-health only. It does not
+  forward RDP/VPN/service payloads, does not start VPN runtime, and does not
+  implement arbitrary relay packet forwarding.
+- Cluster Authority and node enrollment bootstrap are docker-test
+  lifecycle-smoke verified in run `dev-bootstrap-20260428-201430`.
+- Fresh migration replay found and fixed a PostgreSQL view replacement issue in
+  `000021_cluster_authority_keys`; the migration now drops/recreates
+  `cluster_admin_summaries` in up/down paths.
+
+Runtime report:
+
+- `artifacts/c17z18-route-health-effective-path-report.md`
+- `artifacts/c17z19-route-health-feedback-report.md`
+- `artifacts/c17z19-route-health-feedback-smoke-result.json`
+- `artifacts/c17z20-route-health-feedback-refresh-report.md`
+- `artifacts/dev-cluster-enrollment-bootstrap-smoke-report.md`
+- `artifacts/c18w-service-channel-route-manager-smoke-result.json`
+- `artifacts/c18x-service-channel-logical-channel-smoke-result.json`
+- `artifacts/c18y-route-intent-lifecycle-smoke-result.json`
+- `artifacts/c18z-service-channel-load-smoke-result.json`
+- `artifacts/c18z1-live-service-channel-ingress-smoke-result.json`
+- `artifacts/c18z2-live-service-channel-soak-smoke-result.json`
+- `artifacts/c18z3-live-service-channel-entry-ws-fallback-smoke-result.json`
+- `artifacts/c18z4-live-service-channel-session-pressure-smoke-result.json`
+- `artifacts/c18z5-live-service-channel-exit-restart-smoke-result.json`
+- `artifacts/c18z6-live-service-channel-active-rebuild-smoke-result.json`
+- `artifacts/c18z7-live-service-channel-concurrent-isolation-smoke-result.json`
+- `artifacts/c18z8-live-service-channel-backpressure-isolation-smoke-result.json`
+- `artifacts/c18z9-live-service-channel-route-pool-smoke-result.json`
+- `artifacts/c18z10-live-service-channel-exit-pool-smoke-result.json`
+- `artifacts/c18z11-live-service-channel-entry-pool-smoke-result.json`
+- `artifacts/c18z12-service-channel-route-quality-smoke-result.json`
+- `artifacts/c18z13-live-service-channel-route-quality-smoke-result.json`
+- `artifacts/c18z14-live-service-channel-active-quality-shift-smoke-result.json`
+- `artifacts/c18z15-live-service-channel-effective-quality-smoke-result.json`
+- `artifacts/c18z16-live-service-channel-quality-fairness-smoke-result.json`
+- `artifacts/c18z17-live-service-channel-quality-cleanup-smoke-result.json`
+- `artifacts/c18z18-service-channel-session-scoped-fairness-smoke-result.json`
+- `artifacts/c18z19-service-channel-parallel-flow-window-smoke-result.json`
+- `artifacts/c18z20-service-channel-adaptive-window-telemetry-smoke-result.json`
+- Docker-test smoke command:
+  `pwsh -NoProfile -ExecutionPolicy Bypass -File scripts\fabric\c17z12-rendezvous-relay-smoke-ssh.ps1 -KeepRunning`
+- Dev lifecycle smoke command:
+  `pwsh -NoProfile -ExecutionPolicy Bypass -File scripts\fabric\dev-cluster-enrollment-bootstrap-smoke-ssh.ps1 -KeepRunning`
+- Last proven runtime run: `c17z18-20260428-221601` (legacy smoke script name,
+  current C17Z20 node-agent code)
+- Last proven dev lifecycle run: `dev-bootstrap-20260428-201430`
+- Admin: `http://192.168.200.61:18080/`
+- C17Z20 multi-agent API: `http://192.168.200.61:18120/api/v1`
+- C17Z19 backend-only API: `http://192.168.200.61:18122/api/v1`
+- Dev lifecycle API: `http://192.168.200.61:18121/api/v1`
+
+Do not automatically continue into:
+
+- RDP/VNC/SSH/file/video/service workload traffic over mesh
+- VPN/IP tunnel runtime implementation
+- arbitrary relay packet forwarding
+- production payload forwarding for relay paths
+- QUIC/WebRTC or STUN/TURN/ICE
+- TUN/TAP, host route, DNS, or firewall manipulation
+- backend/session lifecycle changes
+- Windows client changes
+
+Current active next layer after the 2026-05-09 C18Z109 breadcrumb freshness
+window proof:

- production config requires strict authority mode with Product Root public key
- first-owner bootstrap requires a signed activation manifest in strict mode
- `installation_authority` and signed `platform_role_grants` are persisted
- strict platform-admin checks ignore direct `users.platform_role` edits unless
-  a valid signed grant exists
- web-admin shows installation status and first-owner bootstrap
- `scripts/installation/product-root-tool.go` can generate Ed25519 Product Root
-  keys and sign activation manifests; private keys must stay outside the repo
+C18Z48/C18Z49/C18Z50/C18Z51/C18Z52/C18Z53/C18Z54/C18Z55/C18Z56/C18Z57/C18Z58/C18Z59/C18Z60/C18Z61/C18Z62/C18Z63/C18Z64/C18Z65/C18Z66/C18Z67/C18Z68/C18Z69/C18Z70/C18Z71/C18Z72/C18Z73/C18Z74/C18Z75/C18Z76/C18Z77/C18Z78/C18Z79/C18Z80/C18Z81/C18Z82/C18Z83/C18Z84/C18Z85/C18Z86/C18Z87/C18Z88/C18Z89/C18Z90/C18Z91/C18Z92/C18Z93/C18Z94/C18Z95/C18Z96/C18Z97/C18Z98/C18Z99/C18Z100/C18Z101/C18Z102/C18Z103/C18Z104/C18Z105/C18Z106/C18Z107/C18Z108/C18Z109 are complete. Backend is deployed on docker-test as
+`rap-backend:fabric-service-channel-0.2.281-c18z109`; migration
+`000029_fabric_service_channel_leases` is applied on the shared test database.
+Node-agent image `rap-node-agent:0.2.270-c18z95` is built and deployed on
+`test-1/2/3`; web-admin is rebuilt and deployed to `rap_web_admin`.
+All three test nodes run the C18Z92 image, healthy, and current after policy
+update. Node-agent still requires signed service-channel lease authority when
+cluster authority is pinned, but if legacy clients cannot send signed lease
+headers it now calls backend introspection before accepting the unsigned token.
+Accepted ingress is visible as `accepted_by=signed|introspection|legacy_unsigned`
+in structured node logs and via `X-RAP-Service-Channel-Accepted-By` on HTTP
+packet ingress. Durable introspection stores only `token_hash` plus a scrubbed
+lease payload, so backend restarts no longer break compatibility clients. Live
+lease maintenance now lists active/expired durable compatibility leases and runs
+bounded cleanup through the admin API/panel. Durable access telemetry now
+aggregates node-reported accepted ingress counters by signed/introspection/
+legacy path, with heartbeat metadata fallback and admin-panel visibility.
+Access telemetry now also correlates active durable service-channel leases with
+entry/exit nodes, primary route status, backend fallback, and latest
+route-quality feedback when a route exists. Normal-route access diagnostics are
+smoke-proven with a temporary direct `vpn_packets` route and healthy rolling
+quality window. Degraded normal-route diagnostics are also smoke-proven: the
+active channel stays on a normal primary route with `force_backend_fallback=false`
+while route feedback becomes `fenced` and rolling failure/drop/slow counters are
+visible. Active-channel remediation diagnostics now expose
+`remediation_action`, reason, optional alternate route id/status, and operator
+hint, with unit coverage for healthy/noop, rebuild, backend fallback, and
+authorized alternate decisions. The alternate-route remediation branch is now
+live-smoke-proven: a selected primary route is degraded after lease issuance and
+access telemetry recommends `prefer_alternate_route` while keeping
+`force_backend_fallback=false`. C18Z57 turns that recommendation into a bounded
+machine-readable `remediation_command` on the active channel row, including the
+primary route, replacement route, issued time, and command TTL capped to the
+lease lifetime. C18Z58 projects those commands into node-scoped synthetic mesh
+config and node-agent consumes `prefer_alternate_route` as an explicit
+route-manager `applied` decision with source
+`service_channel_remediation_command`. C18Z59 proves active traffic follows the
+replacement route after remediation: runtime heartbeat evidence shows
+`last_selected_route_id` and flow-scheduler `last_route_id` on the replacement
+route, with no local/backend fallback and no route send failures. C18Z60 proves
+the same replacement path under multiple independent VPN flow channels: a
+twelve-packet batch is classified across multiple flow-scheduler channels, all
+observed replacement-route sends avoid local/backend fallback, flow drops, and
+route failures. C18Z61 raises that to a pressure batch of 128 IPv4/TCP-like
+packets; runtime evidence shows 32 replacement-route flow stats, scheduler
+high-watermark 5, max-in-flight 4, no fallback, no drops, and no route failures.
+C18Z62 adds neutral traffic-class QoS wiring at the service-channel HTTP
+ingress: `X-RAP-Traffic-Class` can mark `control`, `interactive`, `reliable`,
+`bulk`, or `droppable`; default traffic remains backward-compatible bulk.
+Unit tests prove scheduler priority order, and live smoke proves a bulk
+128-packet pressure batch plus an interactive packet both move over the
+replacement route with separate traffic-class flow stats and no fallback,
+drops, or route failures. C18Z63 adds a controlled concurrent runtime proof: a
+bulk traffic-class send is held in-flight while an independent interactive
+traffic-class packet is sent through the same ingress, and interactive completes
+before bulk release with `MaxInFlight >= 2`, no drops, and no failures.
+C18Z64 adds compact runtime telemetry: `rap.fabric_flow_scheduler.v1` snapshots
+include `traffic_class_counts`, so backend/admin/diagnostics can show active
+flow-channel counts per traffic class without scanning each channel stat. It is
+live-proven on `rap-node-agent:0.2.239-c18z64`; latest test-1 snapshot showed
+`bulk=32`, `interactive=12`, drops 0. C18Z65/C18Z66 project those counts and
+flow pressure fields into backend access telemetry at node, active-channel, and
+cluster aggregate levels, and web-admin shows cluster/node/channel `flow QoS`
+visibility. Live aggregate API result showed `bulk=32`, `interactive=12`,
+`flow_channel_count=44`, `flow_max_in_flight=4`. C18Z67 adds a live HTTP
+concurrent QoS proof: six parallel bulk service-channel requests ran while an
+interactive traffic-class request was injected on the same entry path after
+remediation; the interactive request completed in 132 ms, all 6 bulk requests
+were accepted, 3072 post-remediation packets moved over the replacement route,
+32 bulk and 12 interactive replacement-route flow stats were observed, and
+fallback, route failures, flow drops, and scheduler drops stayed at 0. C18Z68
+adds backend/admin flow-health guard diagnostics over that telemetry:
+`flow_health_status` and `flow_health_reason` are projected at cluster, node,
+and active-channel levels from traffic-class pressure, queue pressure, flow
+drops, backend fallback, route-quality failures/drops/slow samples, and route
+send latency. Web-admin now shows flow-health chips beside flow QoS.
+C18Z69 adds node-side adaptive response: heartbeat flow-scheduler snapshots now
+report per-class `recommended_parallel_windows` plus
+`adaptive_backpressure_active/reason`, and the ingress send path uses the
+traffic-class-specific window. Under pressure, bulk/droppable are reduced first,
+reliable is reduced moderately, and control/interactive keep their full window
+unless their own class degrades. Live smoke verified `bulk=1`, `droppable=1`,
+`reliable=3`, `interactive=4`, `control=4`, no drops, and
+`bulk_window_reduced_to_protect_interactive`. C18Z70 projects those adaptive
+runtime fields into backend/admin access telemetry at cluster, node, and
+active-channel levels. Cluster windows are aggregated by minimum non-zero
+per-class recommendation, and web-admin shows adaptive window chips beside flow
+health/QoS. Live API artifact shows `adaptive=true`,
+`bulk_window_reduced_to_protect_interactive`, and windows `bulk=1`,
+`droppable=1`, `reliable=3`, `interactive=4`, `control=4`. C18Z71 adds the
+cluster-level adaptive policy contract:
+`GET/PUT /clusters/{clusterID}/fabric/service-channels/adaptive-policy`.
+The policy stores audited thresholds and class windows in cluster metadata,
+projects the effective fingerprint into signed node-scoped synthetic config,
+and node-agent heartbeat/runtime telemetry reports `adaptive_policy_fingerprint`.
+The node scheduler consumes the policy at runtime; default policy preserves
+bulk=1/droppable=1/reliable=3/control=4/interactive=4, while the C18Z71 smoke
+proved an operator policy with max window 6 and `bulk=2` changes the live
+recommended windows without breaking interactive/control. A signed-config hash
+mismatch found during the smoke was fixed by preserving all signed adaptive
+policy provenance fields in the node-agent client model. C18Z72 adds the
+cluster-level pool/failover policy contract:
+`GET/PUT /clusters/{clusterID}/fabric/service-channels/pool-policy`. Lease
+issuance now applies the effective entry/exit pool constraints and preferred
+entry/exit before route selection, stores the effective policy on the lease,
+and signs it into `rap.fabric_service_channel_lease_authority.v1`. Live smoke
+proved a policy-constrained lease selects only the policy entry/exit from a
+wider requested pool and carries matching signed `pool_policy` provenance.
+C18Z73 projects that signed pool-policy fingerprint into active access
+telemetry and guards remediation commands against routes outside the signed
+lease pools. C18Z74 correlates active remediation commands with entry-node
+route-manager heartbeats and reports execution states such as
+`waiting_node_apply`, `applied`, `rejected_by_policy_guard`,
+`pending_rebuild_request`, and `expired`. C18Z75 records `rebuild_route`
+remediation as durable rebuild ledger intent rows when node-scoped synthetic
+config is fetched, and access telemetry reports `rebuild_request_recorded` or
+`rebuild_request_rejected`. C18Z76 makes the allowed `rebuild_route` command
+visible from the node side: node-agent consumes it as a route-manager
+`pending_degraded_fallback` decision with source
+`service_channel_remediation_command`, and backend access telemetry correlates
+that with the durable ledger as `rebuild_request_recorded_node_pending`.
+C18Z77 resolves durable remediation rebuild requests inside the shared Control
+Plane planner: signed-pool-valid alternates become `applied` /
+`replacement_selected` and are projected as route-manager decisions with the
+same command id, missing safe alternates become `no_alternate`, lease/policy
+blocks become `deferred_by_policy`, and stale commands become `expired`.
+C18Z78 adds operator-visible planner outcome chips in web-admin and proves the
+`applied` branch live by adding an alternate route after lease issuance and
+verifying the existing rebuild command resolves to `rebuild_request_applied`.
+C18Z79 closes the planner-to-runtime proof loop for that branch: after planner
+resolution, the entry node reports a route-manager decision with the same
+`rebuild_request_id`, the transition is `applied_rebuild`, and live
+service-channel packet traffic selects the replacement route without
+local/backend fallback, route failures, or flow drops. C18Z80 hardens that
+same path under sustained pressure: after planner-applied rebuild, five
+post-rebuild bursts of mixed `interactive`, `bulk`, and `reliable` VPN packet
+batches stay on the replacement route, the stale primary is not reselected, and
+fallback/route-failure/drop deltas stay zero from the pre-pressure baseline.
+C18Z81 adds the negative/rollback proof: after the initial replacement is
+applied and used, a generation-valid fenced feedback report for that
+replacement causes the Control Plane to select a new safe recovery route; live
+traffic then moves to the recovery route, the degraded replacement is not
+reselected, and fallback/failure/drop deltas stay zero for the recovery send.
+The C18Z81 work also tightened older smoke checks to use per-run counter deltas
+instead of absolute cumulative runtime counters.
+C18Z82 closes the no-safe-recovery branch: after the replacement route reports
+generation-valid fenced feedback and no new safe recovery route is created,
+node-scoped synthetic config surfaces `service_channel_feedback_no_alternate`
+with `pending_degraded_fallback`, `no_unfenced_alternate_route`, and
+`backend_relay_degraded_fallback_until_rebuild`, proving the Control Plane
+exposes a degraded/no-alternate state instead of silently sticking to a bad
+replacement.
+C18Z83 projects those route-manager decisions into active access telemetry and
+web-admin: active channels now expose route-decision source, route id,
+replacement route id, rebuild status/reason/generation, and score reasons.
+The live smoke proves the no-safe state is visible through access telemetry as
+`service_channel_feedback_no_alternate` /
+`pending_degraded_fallback`, with operator execution state remaining compatible
+with durable ledger `rebuild_request_no_alternate`.
+C18Z84 aggregates those per-channel decisions at the access-telemetry summary
+level: route-decision channel count, replacement decision count, applied
+rebuild count, recovery decision count, and no-safe recovery count are exposed
+to the API and web-admin summary chips. The no-safe branch now prioritizes the
+aggregate status reason `active_channels_no_safe_recovery` over generic missing
+access-report noise.
+C18Z85 projects access-decision aggregates into rebuild health and incident
+diagnostics. Health summary now carries access decision counts and prioritizes
+`inspect_access_no_safe_recovery_route_pool_and_signed_policy` when no-safe is
+active. Rebuild incidents now include `incident_source=access_decision` rows
+for active channel decisions such as `access_no_safe_recovery`, with bad
+severity and channel id, so operators see route-decision failures beside ledger
+incidents.
+C18Z86 adds silence/acknowledgement behavior for those
+`incident_source=access_decision` incidents. Silence requests now carry
+`incident_source` and `channel_id`; access-decision no-safe silences are stored
+with a channel-scoped route key, applied back into rebuild health/incidents,
+and exact current-generation incidents stop contributing to active bad count.
+Generation-changing access-decision resurfacing is unit-tested; the live smoke
+proves the operator silence path on docker-test.
+C18Z87 exposes active rebuild/access-decision silences to operators and adds
+unsilence. The API now lists active rebuild alert silences, returns
+access-decision `incident_source`, `channel_id`, and display route id, and
+allows deleting a silence by id. Web-admin shows an `Active rebuild silences`
+table with an unsilence action. The live smoke proves list -> silence ->
+unsilence and verifies the access no-safe incident becomes active again.
+C18Z88 makes access-decision resurfacing operator-visible in live runtime.
+Access-decision incidents now expose the silence id they resurfaced from, the
+previous acknowledged generation, and the silence expiry. The live smoke
+proves: access no-safe incident -> silence current generation -> wait for a new
+route-decision generation -> incident returns as `alert_resurfaced=true`, active
+bad count is restored, and previous generation metadata is preserved.
+C18Z89 closes the resurfaced-incident operator action loop for generation
+changes. Resurfaced access-decision incidents now expose
+`alert_resurfaced_cause`, previous route id, and previous channel id; web-admin
+shows the cause beside resurfaced incidents. The live smoke proves the operator
+can re-acknowledge the resurfaced generation, the active-channel decision
+context matches the incident route/generation, and the current generation
+returns to a silenced state.
+C18Z90 introduces the explicit signed production data-plane contract on
+service-channel leases. `data_plane` is now part of the lease, authority
+payload, introspection response, and lease-maintenance/admin list. It declares
+that control-plane traffic uses backend API, working data uses the fabric
+service channel over fabric routes, backend relay is degraded fallback only,
+production forwarding is required, and logical flows are service-neutral,
+protocol-agnostic, and isolated. Web-admin shows this contract in the
+service-channel lease table.
+C18Z91 makes node-agent consume that signed/introspected data-plane contract.
+Service-channel packet ingress validates the contract, applies the preferred
+fabric route, emits data-plane mode/transport/fallback/logical-flow fields in
+access logs, and reports contract adoption in heartbeat access telemetry.
+C18Z92 enforces disabled backend fallback policy at node-agent runtime: when a
+signed lease says `backend_relay_policy=disabled`, route failure or missing
+fabric route returns a visible 503 instead of silently proxying working data
+through backend relay.
+C18Z93 promotes that data-plane contract telemetry into backend access
+telemetry and web-admin active-channel diagnostics: cluster, node, and
+active-channel rows now show contract adoption count, last working transport,
+steady-state transport, backend relay policy, data-plane mode, and logical
+flow mode.
+C18Z94 turns those data-plane/fallback signals into operator incidents.
+`data_plane_contract` incident rows are now emitted for missing data-plane
+contract reports after accepted service-channel traffic, wrong working or
+steady-state transport, wrong logical flow mode, disabled backend relay
+observed, and degraded backend relay usage. The incident list/readiness path
+can now surface a recommended action such as restoring the fabric route instead
+of treating backend relay as normal service traffic.
+C18Z95 adds node-agent blocked-fallback telemetry. When a signed data-plane
+contract disables backend relay and the entry runtime cannot use a fabric
+route, node-agent reports `backend_fallback_blocked`, the last data-plane
+violation status/reason, and backend/admin project those fields to cluster,
+node, channel, and `data_plane_contract` incident diagnostics. Disabled-policy
+refusal is now separate from real backend relay usage.
+C18Z96 wires normal-route send failure with disabled backend relay into the
+existing route feedback and rebuild planner path. When heartbeat access
+telemetry reports `fabric_route_send_failed_backend_fallback_blocked`, backend
+correlates the entry node's active service-channel leases, records fenced
+`fabric_service_channel_route_feedback` for the selected primary route, and the
+existing planner can select an alternate/replacement route. This keeps blocked
+fallback from becoming a dead-end operator alert.
+C18Z97 adds bounded deduplication for those access-report-derived route
+feedback records. Repeated blocked-fallback send-failure heartbeats no longer
+rewrite the same active fenced feedback or churn planner rebuild attempts while
+the first access-report feedback is still active. Runtime feedback from the
+flow scheduler remains independent.
+C18Z98 carries that feedback identity into the replacement decision and
+rebuild-attempt ledger: decision and ledger rows now expose
+`feedback_observation_id`, `feedback_source`, feedback observed/expiry time,
+channel/resource ids, and data-plane violation status/reason. Web-admin shows
+that correlation in Route decisions and Rebuild ledger.
+C18Z99 adds rebuild ledger filters for those correlation fields. The backend
+`/fabric/service-channels/rebuild-attempts` API accepts `feedback_source`,
+`feedback_channel_id`, and `feedback_violation_status`, and web-admin exposes
+the same filters in the rebuild ledger form. The live smoke proves source,
+channel, violation, combined filters, and wrong-channel exclusion.
+C18Z100 adds rebuild-health feedback breakdown aggregation for the same
+correlation fields. The backend rebuild-health summary now returns
+`feedback_breakdowns` grouped by feedback source, feedback channel id, and
+feedback violation status, including total/good/warn/bad/unknown counts,
+active warn/bad counts, silenced count, latest observation time, and affected
+reporter nodes/routes. Web-admin shows this breakdown in the Rebuild health
+panel so operators can see which access-report-derived failure classes dominate
+active warn/bad rebuild state.
+C18Z101 wires that breakdown into operator workflow in web-admin. Each
+feedback-breakdown row now shows related incident context by channel/reporter/
+route overlap and has an `open ledger` action that switches to the deep rebuild
+ledger with `feedback_source`, `feedback_channel_id`, and
+`feedback_violation_status` prefilled from the breakdown row.
+C18Z102 adds backend audit breadcrumbs for that workflow. The existing rebuild
+investigation endpoint now accepts feedback source/channel/violation drilldown
+payloads, records
+`fabric.service_channel_rebuild_feedback_breakdown.investigation_opened`
+cluster audit events, and web-admin records one before opening the filtered
+deep ledger from a rebuild-health feedback breakdown row.
+C18Z103 surfaces those breadcrumbs directly in the Fabric diagnostics panel.
+Web-admin now filters the loaded cluster audit list for rebuild incident and
+feedback-breakdown investigation events and shows recent drilldowns with time,
+source, feedback filters, target reporter/route, actor, and reason beside
+rebuild incidents and silences.
+C18Z104 adds focused audit loading for that panel. The cluster audit API now
+accepts `event_type` and `target_type` filters, including repeated or
+comma-separated values, and web-admin loads recent fabric investigation
+breadcrumbs with a dedicated filtered request instead of depending on the
+generic latest-100 cluster audit list.
+C18Z105 correlates those focused audit breadcrumbs back to currently visible
+diagnostics in web-admin. Recent investigation rows now show whether the
+breadcrumb still matches an active rebuild-health feedback breakdown or visible
+rebuild incident, and provide an `open` action to jump back into the matching
+filtered ledger path.
+C18Z106 moves that correlation into the backend/API. `GET /audit` with
+`correlation=fabric_diagnostics` now returns `correlation_hints` for focused
+fabric investigation breadcrumbs, including current diagnostic status
+(`breakdown_active`, `incident_visible`, or `not_visible`) and the matching
+breakdown/incident object when present. Web-admin consumes those hints and keeps
+its previous local matching as fallback. During verification the noisy test
+history exposed that rebuild-health feedback breakdowns were capped too tightly;
+the backend now returns up to 100 breakdown groups so fresh failure classes are
+not pushed out by older smoke history.
+C18Z107 adds a compact backend-provided `audit_summary` beside `audit_events`.
+For focused Fabric diagnostics audit reads, the summary includes total count,
+counts by event/target type, counts by current diagnostic status, counts by
+feedback source/violation status, correlated count, not-visible count, and
+latest time. Web-admin shows these as Recent investigations chips and short
+source/violation lines without recalculating the aggregate in the browser.
+C18Z108 moves Fabric diagnostics investigation breadcrumbs off the generic
+cluster audit read path. Backend now exposes
+`GET /clusters/{clusterID}/fabric/service-channels/rebuild-investigations/breadcrumbs`
+with a dedicated `rebuild_investigation_breadcrumbs` contract containing
+events plus summary. Web-admin uses this endpoint for Recent investigations
+and keeps generic audit semantics separate from Fabric diagnostics workflow
+state.
+C18Z109 adds freshness windows to the dedicated breadcrumb contract. The
+endpoint accepts `current_window_seconds` and `history_window_seconds`, annotates
+each breadcrumb with `correlation_hints.breadcrumb_status` (`current`, `stale`,
+or `expired`) plus age/window seconds, returns current/stale/expired totals, and
+adds `counts_by_breadcrumb_status` to the summary. Web-admin shows freshness
+chips and an age column in Recent investigations, so operators can separate live
+workflow hints from stale history without deleting audit records.
+Live
+verification passed:
+`scripts/fabric/c18z48-service-channel-introspection-smoke.ps1` and
+`scripts/fabric/c18z50-service-channel-durable-introspection-smoke.ps1` and
+`scripts/fabric/c18z51-service-channel-lease-maintenance-smoke.ps1` and
+`scripts/fabric/c18z52-service-channel-access-telemetry-smoke.ps1` and
+`scripts/fabric/c18z53-service-channel-access-correlation-smoke.ps1` and
+`scripts/fabric/c18z54-service-channel-normal-route-access-smoke.ps1` and
+`scripts/fabric/c18z55-service-channel-degraded-route-access-smoke.ps1` and
+`scripts/fabric/c18z56-service-channel-alternate-remediation-smoke.ps1` and
+`scripts/fabric/c18z57-service-channel-remediation-command-smoke.ps1` and
+`scripts/fabric/c18z58-service-channel-remediation-apply-smoke.ps1` and
+`scripts/fabric/c18z59-service-channel-remediation-traffic-smoke.ps1` and
+`scripts/fabric/c18z60-service-channel-remediation-multiflow-smoke.ps1` and
+`scripts/fabric/c18z61-service-channel-remediation-pressure-smoke.ps1` and
+`scripts/fabric/c18z62-service-channel-remediation-qos-smoke.ps1` and
+`scripts/fabric/c18z67-service-channel-concurrent-qos-live-smoke.ps1` and
+`scripts/fabric/c18z69-service-channel-adaptive-backpressure-smoke.ps1` and
+`scripts/fabric/c18z71-service-channel-adaptive-policy-smoke.ps1` and
+`scripts/fabric/c18z72-service-channel-pool-policy-smoke.ps1`, with
+artifacts:
+`artifacts/c18z48-service-channel-introspection-smoke-result.json`,
+`artifacts/c18z50-service-channel-durable-introspection-smoke-result.json`, and
+`artifacts/c18z51-service-channel-lease-maintenance-smoke-result.json`, and
+`artifacts/c18z52-service-channel-access-telemetry-smoke-result.json`, and
+`artifacts/c18z53-service-channel-access-correlation-smoke-result.json`, and
+`artifacts/c18z54-service-channel-normal-route-access-smoke-result.json`, and
+`artifacts/c18z55-service-channel-degraded-route-access-smoke-result.json`, and
+`artifacts/c18z56-service-channel-alternate-remediation-smoke-result.json`, and
+`artifacts/c18z57-service-channel-remediation-command-smoke-result.json`, and
+`artifacts/c18z58-service-channel-remediation-apply-smoke-result.json`, and
+`artifacts/c18z59-service-channel-remediation-traffic-smoke-result.json`, and
+`artifacts/c18z60-service-channel-remediation-multiflow-smoke-result.json`, and
+`artifacts/c18z61-service-channel-remediation-pressure-smoke-result.json`, and
+`artifacts/c18z62-service-channel-remediation-qos-smoke-result.json`, and
+`artifacts/c18z63-service-channel-concurrent-qos-go-test.jsonl`, and
+`artifacts/c18z64-service-channel-traffic-class-telemetry-go-test.jsonl`, and
+`artifacts/c18z64-service-channel-traffic-class-telemetry-live-smoke-result.json`, and
+`artifacts/c18z64-service-channel-traffic-class-telemetry-live-snapshot.json`, and
+`artifacts/c18z65-service-channel-access-qos-telemetry-api-result.json`, and
+`artifacts/c18z65-service-channel-access-qos-telemetry-smoke-result.json`, and
+`artifacts/c18z66-service-channel-access-qos-aggregate-api-result.json`, and
+`artifacts/c18z67-service-channel-concurrent-qos-live-smoke-result.json`, and
+`artifacts/c18z68-service-channel-flow-health-api-result.json`, and
+`artifacts/c18z69-service-channel-adaptive-backpressure-smoke-result.json`, and
+`artifacts/c18z70-service-channel-adaptive-telemetry-api-result.json`, and
+`artifacts/c18z71-service-channel-adaptive-policy-smoke-result.json`, and
+`artifacts/c18z72-service-channel-pool-policy-smoke-result.json`, and
+`artifacts/c18z73-service-channel-pool-policy-remediation-guard-smoke-result.json`, and
+`artifacts/c18z74-service-channel-remediation-execution-smoke-result.json`, and
+`artifacts/c18z75-service-channel-rebuild-intent-smoke-result.json`, and
+`artifacts/c18z76-service-channel-rebuild-node-pending-smoke-result.json`, and
+`artifacts/c18z77-service-channel-rebuild-planner-resolution-smoke-result.json`, and
+`artifacts/c18z78-service-channel-rebuild-planner-applied-smoke-result.json`, and
+`artifacts/c18z79-service-channel-planner-runtime-loop-smoke-result.json`, and
+`artifacts/c18z80-service-channel-post-rebuild-pressure-smoke-result.json`, and
+`artifacts/c18z81-service-channel-replacement-degradation-recovery-smoke-result.json`, and
+`artifacts/c18z82-service-channel-no-safe-recovery-smoke-result.json`, and
+`artifacts/c18z83-service-channel-access-telemetry-no-safe-smoke-result.json`, and
+`artifacts/c18z84-service-channel-access-decision-aggregate-smoke-result.json`, and
+`artifacts/c18z85-service-channel-access-decision-incident-smoke-result.json`, and
+`artifacts/c18z86-service-channel-access-decision-silence-smoke-result.json`, and
+`artifacts/c18z87-service-channel-access-decision-unsilence-smoke-result.json`, and
+`artifacts/c18z88-service-channel-access-decision-resurface-smoke-result.json`, and
+`artifacts/c18z89-service-channel-access-decision-resurface-action-smoke-result.json`, and
+`artifacts/c18z90-service-channel-data-plane-contract-smoke-result.json`, and
+`artifacts/c18z91-node-agent-data-plane-contract-enforcement-smoke-result.json`, and
+`artifacts/c18z92-node-agent-disabled-backend-fallback-smoke-result.json`, and
+`artifacts/c18z93-access-telemetry-data-plane-contract-smoke-result.json`, and
+`artifacts/c18z94-data-plane-contract-incident-smoke-result.json`, and
+`artifacts/c18z95-node-agent-blocked-fallback-telemetry-smoke-result.json`, and
+`artifacts/c18z96-blocked-fallback-rebuild-feedback-smoke-result.json`, and
+`artifacts/c18z97-blocked-fallback-feedback-dedup-smoke-result.json`, and
+`artifacts/c18z98-blocked-fallback-rebuild-correlation-smoke-result.json`, and
+`artifacts/c18z99-rebuild-correlation-filter-smoke-result.json`, and
+`artifacts/c18z100-rebuild-health-feedback-breakdown-smoke-result.json`, and
+`artifacts/c18z102-rebuild-health-feedback-drilldown-audit-smoke-result.json`,
+`artifacts/c18z104-focused-fabric-audit-smoke-result.json`, and
+`artifacts/c18z106-audit-correlation-hints-smoke-result.json`, and
+`artifacts/c18z107-audit-correlation-summary-smoke-result.json`, and
+`artifacts/c18z108-dedicated-breadcrumbs-smoke-result.json`, and
+`artifacts/c18z109-breadcrumb-freshness-window-smoke-result.json`.

-Cluster Authority foundation is now also complete:
+Current active continuation after C19Z1:

- every newly created cluster gets an Ed25519 `cluster_authorities` key record
- cluster authority private keys are encrypted at rest when
-  `SECRET_ENCRYPTION_KEY_B64`/file is configured; production already requires
-  a secret encryption key
- legacy/default clusters are backfilled lazily through `EnsureClusterAuthority`
- backend signs join-token scope material, node approval/bootstrap material,
-  and node-scoped synthetic mesh config snapshots
- node-agent verifies signed Control Plane synthetic config when
-  `authority_required=true` or signature fields are present
- node-agent can pin `RAP_CLUSTER_AUTHORITY_PUBLIC_KEY` and
-  `RAP_CLUSTER_AUTHORITY_FINGERPRINT`, and identity state can store the same
-  trust anchor after approval
- web-admin shows cluster key fingerprints on summaries, join-token output,
-  approval rows, and synthetic config visibility
- docker-test lifecycle smoke is complete: fresh dev install, first-owner
-  bootstrap, cluster creation, signed join token, real node-agent enrollment,
-  owner approval, automatic signed bootstrap polling, authority pin
-  persistence, heartbeat, and signed synthetic config verification all passed
- `rap-node-agent` desired-workload polling/status reporting is gated by
-  `RAP_WORKLOAD_SUPERVISION_ENABLED=false` by default while service runtime
-  supervision remains a stub
+C19Z1 is implemented and runtime-smoke-proven. Remote Workspace adapter sessions
+now expose read-only mailbox handoff preflight:
+`GET /mesh/v1/remote-workspace/adapter-sessions/{adapter_session_id}/mailbox/preflight?consumer_id=...&resume_from=ack|checkpoint`.
+The response validates the consumer cursor and reports the expected next event
+window (`after_sequence`, available/returned/skipped counts, first/last expected
+sequence) without reading, draining, acking, or mutating consumer state.
+Node-agent image `rap-node-agent:codex-service-supervisor-20260512z2` is
+deployed on `test-1/2/3`. Verification artifacts:
+`artifacts/c19z1-remote-workspace-mailbox-preflight-smoke-result.json`, C19X
+source
+`artifacts/c19z1-remote-workspace-mailbox-preflight-source-result.json`, and
+C19Z regression
+`artifacts/c19z-remote-workspace-adapter-readiness-smoke-result.json`.

-Node enrollment bootstrap polling is also complete:
-
- backend exposes `/node-agents/enrollments/{requestID}/bootstrap`
- pending agents prove `cluster_id`, `node_fingerprint`, and `public_key`
-  before receiving status/bootstrap material
- `rap-node-agent` stores `pending_join_request_id`, polls approval, verifies
-  the signed bootstrap contract, then persists `node_id`, `identity_status`,
-  and cluster authority pin into `identity.json`
- polling is controlled by `RAP_ENROLLMENT_POLL_INTERVAL_SECONDS` and
-  `RAP_ENROLLMENT_POLL_TIMEOUT_SECONDS`
-
-Current state:
-
- C17Z12 added rendezvous/relay control-plane leases for peers that would
-  otherwise stay in `waiting_rendezvous`.
- C17Z13-C17Z14 added lease telemetry and node-scoped synthetic-config refresh
-  for renewal/stale relay recovery.
- C17Z15 added backend stale-relay replacement/withdrawal policy and alternate
-  relay-pool scoring.
- C17Z16 added Control Plane `route_path_decisions`.
- C17Z17 added node-side route generation apply/withdraw tracking.
- C17Z18 applies Control Plane `route_path_decisions` to synthetic
-  route-health route config only. The synthetic `fabric.route_health` runtime
-  now probes the selected effective path, including replacement relay paths,
-  and reports expected/observed hops plus drift state.
- C17Z19 consumes those synthetic route-health observations in backend relay
-  scoring. Drift/unreachable/failure feedback marks the exact selected relay
-  stale and can trigger replacement; healthy low-latency route-health boosts
-  alternate relay score reasons. Migration `000022` adds the `synthetic` mesh
-  service class, and web-admin marks relay policy `rh feedback`.
- C17Z20 closes the node-side feedback loop. After node-agent reports
-  synthetic route-health drift/unreachable/failure, it performs a bounded
-  node-scoped synthetic-config refresh, applies returned replacement route
-  decisions to route-health config immediately, and reports
-  `c17z20.mesh_route_health_feedback_refresh_report.v1`.
- Backend `mesh_latest_links` now keeps latest observations per observation
-  type/route, so `synthetic_route_health` is not overwritten by
-  `peer_connection_manager`.
- Web-admin Fabric links now show observation type, selected relay, and
-  route-health effective/observed path.
- All of this remains control-plane/synthetic route-health only. It does not
-  forward RDP/VPN/service payloads, does not start VPN runtime, and does not
-  implement arbitrary relay packet forwarding.
- Cluster Authority and node enrollment bootstrap are docker-test
-  lifecycle-smoke verified in run `dev-bootstrap-20260428-201430`.
- Fresh migration replay found and fixed a PostgreSQL view replacement issue in
-  `000021_cluster_authority_keys`; the migration now drops/recreates
-  `cluster_admin_summaries` in up/down paths.
-
-Runtime report:
-
- `artifacts/c17z18-route-health-effective-path-report.md`
- `artifacts/c17z19-route-health-feedback-report.md`
- `artifacts/c17z19-route-health-feedback-smoke-result.json`
- `artifacts/c17z20-route-health-feedback-refresh-report.md`
- `artifacts/dev-cluster-enrollment-bootstrap-smoke-report.md`
- Docker-test smoke command:
-  `pwsh -NoProfile -ExecutionPolicy Bypass -File scripts\fabric\c17z12-rendezvous-relay-smoke-ssh.ps1 -KeepRunning`
- Dev lifecycle smoke command:
-  `pwsh -NoProfile -ExecutionPolicy Bypass -File scripts\fabric\dev-cluster-enrollment-bootstrap-smoke-ssh.ps1 -KeepRunning`
- Last proven runtime run: `c17z18-20260428-221601` (legacy smoke script name,
-  current C17Z20 node-agent code)
- Last proven dev lifecycle run: `dev-bootstrap-20260428-201430`
- Admin: `http://192.168.200.61:5174/`
- C17Z20 multi-agent API: `http://192.168.200.61:18120/api/v1`
- C17Z19 backend-only API: `http://192.168.200.61:18122/api/v1`
- Dev lifecycle API: `http://192.168.200.61:18121/api/v1`
-
-Do not automatically continue into:
-
- RDP/VNC/SSH/file/video/service workload traffic over mesh
- VPN/IP tunnel runtime implementation
- arbitrary relay packet forwarding
- production payload forwarding for relay paths
- QUIC/WebRTC or STUN/TURN/ICE
- TUN/TAP, host route, DNS, or firewall manipulation
- backend/session lifecycle changes
- Windows client changes
-
-Next narrow layer, if approved:
-
-C17Z21 should tighten route-health feedback refresh dampening: if an immediate
-feedback refresh returns the same config version or no replacement change, keep
-a per-route/relay no-change cooldown before retrying. Keep the boundary
-synthetic/control-plane only and keep RDP/VPN/service payload forwarding
-untouched.
+Next narrow Remote Workspace layer should stay probe-only and node-local. A good
+C19Z2 candidate is handoff preflight telemetry: add counters/last-preflight
+fields for the read-only preflight endpoint in workload status/heartbeat reports,
+so operators can distinguish handoff checks from mailbox reads. Do not add
+desktop frame transport, Android work, backend relay semantics, or production
+adapter payload forwarding in this slice.
@@ -0,0 +1,79 @@
+# VPN baseline 0.2.87
+
+Date: 2026-05-05
+
+This document freezes the current near-working VPN state. Treat it as the
+rollback and comparison point before changing the Android VPN dataplane,
+gateway assignment, mesh route intents, or packet relay behavior.
+
+## Baseline components
+
+- Android client: `0.2.87` / version code `87`
+- APK path: `web-admin/deploy/html/downloads/rap-android-rdp-vpn-latest-release.apk`
+- Known-good APK path: `web-admin/deploy/html/downloads/rap-android-rdp-vpn-known-good-0.2.87.apk`
+- Versioned APK path: `web-admin/deploy/html/downloads/releases/0.2.87/rap-android-rdp-vpn-0.2.87-release.apk`
+- APK sha256: `bc44304658df7cd0ad627660c9e7b37af68910cdb13b310314ab99a049ff3b8d`
+- APK size: `1187103`
+- Backend image: `rap-backend:vpn-dataplane-contract-0.2.86`
+- Node/host agents: `0.2.86`
+- Cluster: `cfc0743d-d960-49fb-9de8-96e063d5e4aa`
+- VPN connection: `7cc94b0d-9cc2-4492-956a-cb0913b887e2` (`home-full-tunnel`)
+- Entry node: `home-1` (`8ad04829-cd30-4290-913d-1ce5c7ef7bb3`)
+- Exit node: `home-1` (`8ad04829-cd30-4290-913d-1ce5c7ef7bb3`)
+- DNS from exit side: `192.168.200.210`
+- Client tunnel: full tunnel, `0.0.0.0/0`, VPN address `10.77.0.2/24`
+- Active gateway lease: home-1, generation `8`
+- Active relay transport: `backend_http_packet_relay`
+
+## Current working behavior
+
+- General web traffic passes through the VPN.
+- External sites open through the configured home exit.
+- Telegram can connect, but initial connection may be delayed.
+- RDP can connect through the tunnel, but long-lived sessions can still drop.
+- Speed is the best observed so far, but speed-test pages may delay loading
+  their plugin/script parts.
+
+## Observed diagnostics
+
+Latest phone diagnostics for device `37574bd4-b944-440f-bbd5-87f2980d22c4`
+reported Android app version `0.2.87`.
+
+Packet relay counters showed both directions are active:
+
+- `client_to_gateway`: no queue drops observed, queue depth returned to `0`
+- `gateway_to_client`: queue depth was observed at `48-55`
+- `gateway_to_client`: `246` dropped packets were observed
+- Android side recorded downlink traffic, uplink traffic, and several uplink
+  sender errors
+- Android source validation dropped packets whose source was not the VPN
+  address; keep this guard enabled
+
+Interpretation: the active path is real and carries traffic, but downlink
+backpressure or Android TUN drain stalls can still interrupt long-lived TCP
+flows. This explains delayed Telegram startup, speed-test plugin loading
+delays, and RDP sessions that connect and later drop.
+
+## Guardrails
+
+- Do not reduce Android `TUN_WRITE_MAX_RETRIES` below `1000` without a
+  controlled regression test.
+- Do not relax Android VPN source-address validation.
+- Do not re-enable the home-1 `vpn_packets` fabric mesh route intent for this
+  connection until the Android client can intentionally use the fabric entry
+  path. The current working baseline relies on `backend_http_packet_relay`.
+- Do not change the active entry/exit away from home-1 without saving packet
+  counters before and after.
+- Do not change DNS away from `192.168.200.210` without checking full-tunnel
+  DNS and direct-IP traffic separately.
+- Keep the 0.2.87 APK available as a known-good rollback artifact.
+
+## Next safe work
+
+1. Stabilize `gateway_to_client` downlink queue draining and Android TUN write
+   backpressure.
+2. Add clearer per-flow counters for long-lived TCP flows such as RDP.
+3. Add a small repeatable smoke test: DNS, direct IP HTTP, 2ip.ru, Telegram-like
+   long connection, and RDP port reachability.
+4. Only after this baseline is stable, move Android entry traffic from backend
+   relay to fabric mesh.