Record project continuation changes

This commit is contained in:
2026-05-12 21:02:29 +03:00
parent 3059d1d7a3
commit 8f69d53193
339 changed files with 101111 additions and 1769 deletions
@@ -1016,6 +1016,240 @@ Status: implemented and verified. Report: `artifacts/c5-service-workload-supervi
Goal:
Node-agent can start, stop, and monitor service workloads based on role assignment.
C19A adds the first bounded live service-supervision runtime proof on top of
that contract: node-agent can read node-scoped desired workloads without an
operator actor id, report built-in `core-mesh` and `mesh-listener` as running,
report native built-in `synthetic.echo` as running, and keep unsupported
production workloads degraded instead of pretending that their adapters exist.
The live smoke is `scripts/fabric/c19a-service-workload-supervision-smoke.ps1`.
C19B adds the Remote Workspace/RDP adapter-contract bridge without enabling RDP
payload traffic. A native `rdp-worker` desired workload with
`adapter_contract_probe=true` reports the remote-workspace channel map,
requires Fabric Service Channel, and marks backend relay as not steady-state.
The live smoke is
`scripts/fabric/c19b-remote-workspace-adapter-contract-smoke.ps1`.
C19C wires Remote Workspace into service-channel lease issuance without
starting RDP traffic: route intents now accept `remote_workspace`, the lease
entry descriptor uses remote-workspace stream paths and frame-batch media type
instead of VPN packet paths, and the signed data-plane contract is present in
lease, authority payload, introspection, and lease maintenance. The live smoke
is `scripts/fabric/c19c-remote-workspace-service-channel-lease-smoke.ps1`.
C19D adds the Remote Workspace entry-node ingress skeleton. The node-agent
accepts a signed/introspected `remote_workspace` service-channel lease on
`remote-workspaces/{resource_id}/streams/{channel_class}`, validates service
class, channel class, selected entry node, and data-plane flow isolation, and
reports access telemetry. It intentionally returns a probe contract with
`payload_flow=not_implemented` for non-empty RDP payloads; this stage proves
the Fabric ingress contract without forwarding desktop frames yet. The live
smoke is `scripts/fabric/c19d-remote-workspace-entry-ingress-smoke.ps1`.
C19E adds the first Remote Workspace frame-batch contract probe across the
adapter/entry boundary. The `rdp-worker` adapter probe reports
`rap.remote_workspace_frame_batch.v1`; entry-node accepts only
`probe_only=true` frame batches, validates logical adapter channels and
directions, and returns `payload_flow=validated_probe_only`. Real desktop frame
delivery remains intentionally disabled until the service adapter runtime stage.
The live smoke is
`scripts/fabric/c19e-remote-workspace-frame-batch-contract-smoke.ps1`.
C19F adds the first local adapter-sink proof for that frame-batch contract.
Node-agent now keeps an in-memory `node_agent_rdp_worker_contract_probe` sink
for Remote Workspace frame probes and preserves it across mesh config refresh.
Entry-node delivers validated `probe_only=true` frame batches to that sink and
returns a `rap.remote_workspace_frame_batch_delivery.v1` receipt with
`payload_flow=delivered_probe_only`. This still does not enable production RDP
frame forwarding. The live smoke is
`scripts/fabric/c19f-remote-workspace-adapter-sink-smoke.ps1`.
C19G exposes the adapter-sink delivery proof through existing node-agent
visibility channels. The `rdp-worker` workload status payload now includes
`remote_workspace_adapter_sink`, and node telemetry includes
`remote_workspace_adapter_sink_report`, both carrying delivery count, latest
delivery sequence, channel class, frame count, and the probe-only/no-payload
boundary. The live smoke is
`scripts/fabric/c19g-remote-workspace-adapter-sink-telemetry-smoke.ps1`.
C19H locks down the Remote Workspace frame-batch guardrails before real adapter
runtime work begins. Unit and live smoke coverage now proves that entry-node
rejects `probe_only=false`, unknown logical channels, invalid channel
directions, service-class mismatch, channel-class mismatch, and unsupported
payload encoding, and that rejected batches do not produce adapter delivery.
The live smoke is
`scripts/fabric/c19h-remote-workspace-frame-guardrails-smoke.ps1`.
C19I adds the first bounded adapter handoff queue/ack proof for the same
probe-only path. The local `node_agent_rdp_worker_contract_probe` sink reports
queue capacity/depth plus accepted, dropped, and acked frame counts: with
capacity `8`, droppable display overflow accepts/acks `8` frames and drops `3`,
while reliable input overflow is rejected with backpressure and no delivery
receipt. The boundary still carries `payload_traffic=none`; this is queue
semantics for the future adapter runtime, not real RDP payload forwarding. The
live smoke is
`scripts/fabric/c19i-remote-workspace-adapter-queue-smoke.ps1`.
C19J makes those queue/backpressure signals operationally visible. The
`remote_workspace_adapter_sink` workload status payload and
`remote_workspace_adapter_sink_report` telemetry now include current queue
capacity/depth, cumulative accepted/dropped/acked frame counters,
`backpressure_count`, and the latest rejected batch metadata/reason. The live
smoke first produces the C19I droppable overflow plus reliable backpressure,
then waits until both workload status and telemetry show the delivery, dropped
total, and backpressure increment. The live smoke is
`scripts/fabric/c19j-remote-workspace-adapter-queue-telemetry-smoke.ps1`.
C19K introduces the probe-only adapter session boundary. Entry-node derives a
stable `adapter_session_id` from the service-channel lease/resource/route
context and passes it to the local `rdp-worker` adapter probe sink. Delivery
receipts, workload status, and telemetry now include `adapter_session_id`,
`adapter_runtime_id=node_agent_rdp_worker_contract_probe`, and
`session_state=probe_bound`, and rejected/backpressured batches retain the same
session identity. This is still not real RDP payload forwarding; it binds the
queue/ack/backpressure model to the future per-session adapter runtime. The
live smoke is
`scripts/fabric/c19k-remote-workspace-adapter-session-boundary-smoke.ps1`.
C19L adds the first lifecycle model to that probe-only adapter session. The
node-agent sink now tracks active sessions in memory with created/bound totals,
last activity timestamps, per-session delivery/backpressure/frame counters,
`current_session_lifecycle_state`, and idle expiry counters. A successful
droppable overflow binds the session as `probe_bound`; a reliable overflow keeps
the same `adapter_session_id` and moves the lifecycle state to `backpressure`
for diagnosis. Receipts expose session created/bound/last-activity timestamps
and per-session counters while `payload_traffic=none` remains enforced. The
live smoke is
`scripts/fabric/c19l-remote-workspace-adapter-session-lifecycle-smoke.ps1`.
C19M adds explicit probe-only adapter-session control. Node-agent exposes
`POST /mesh/v1/remote-workspace/adapter-sessions/{adapter_session_id}/control`
with `close`, `expire`, and `reset` actions, returning
`rap.remote_workspace_adapter_session_control.v1`. Workload status and telemetry
now include `session_control_total`, `session_closed_total`,
`session_reset_total`, and the latest control action/session/state, so sessions
can be ended deliberately instead of only by idle TTL. The live smoke creates a
Remote Workspace adapter session, closes it through the mesh control endpoint,
and waits until workload status and telemetry expose the close. The live smoke
is
`scripts/fabric/c19m-remote-workspace-adapter-session-control-smoke.ps1`.
C19N locks down the adapter-session control guardrails. Control requests now
reject unsupported actions, invalid `adapter_session_id` values, malformed JSON,
unknown active/terminal sessions, and overlong reasons without creating hidden
session state. Repeating `close` against an already closed terminal session is
idempotent: it reports `previous_state=closed` and does not increment
`session_closed_total` again, while still counting the control observation. The
live smoke verifies the negative cases plus first/repeated close visibility in
workload status and telemetry. The live smoke is
`scripts/fabric/c19n-remote-workspace-adapter-session-control-guardrails-smoke.ps1`.
C19O adds an immediate read-only adapter-session snapshot endpoint:
`GET /mesh/v1/remote-workspace/adapter-sessions?include_terminal=true&limit=N`.
It returns `rap.remote_workspace_adapter_session_snapshot.v1` with active
sessions, terminal sessions when requested, per-session lifecycle state,
activity/backpressure timestamps, frame counters, and runtime identity. This
lets operators inspect adapter-session state directly from node-agent without
waiting for heartbeat, workload status, or telemetry propagation. The live smoke
checks active-session visibility, close transition into terminal snapshot, and
invalid snapshot limit rejection. The live smoke is
`scripts/fabric/c19o-remote-workspace-adapter-session-snapshot-smoke.ps1`.
C19P adds the first adapter-runtime handoff mailbox contract. Each active
probe-only adapter session now owns a bounded in-memory mailbox that receives
`frame_batch_probe_delivered` and `backpressure` events with frame counts,
channel/resource/route context, and sequence numbers. Node-agent exposes
`GET /mesh/v1/remote-workspace/adapter-sessions/{adapter_session_id}/mailbox`
with optional `drain=true`, and session snapshots/workload reports expose
mailbox depth/enqueued/drained/dropped counters. This is the handoff surface a
real `rdp-worker` runtime can consume next; payload forwarding is still disabled.
The live smoke verifies read, drain, post-drain empty state, and snapshot
counters. The live smoke is
`scripts/fabric/c19p-remote-workspace-adapter-runtime-mailbox-smoke.ps1`.
C19Q hardens the mailbox handoff. Invalid IDs, unknown sessions, and invalid
limits are rejected before state mutation, and bounded `drain=true&limit=N`
reads remove only the returned event slice while preserving remaining depth for
the next poll. The bounded mailbox drops oldest events once capacity is reached,
and a closed adapter session no longer exposes an active runtime mailbox. The
live smoke verifies negative cases, drop-oldest pressure, partial drain, and
closed-session rejection. The live smoke is
`scripts/fabric/c19q-remote-workspace-adapter-mailbox-guardrails-smoke.ps1`.
C19R adds bounded long-poll ergonomics to the same node-local mailbox endpoint.
`wait_ms` lets an adapter runtime wait briefly for the next event without hot
polling, and responses make empty/timeout state explicit with `empty`,
`waited`, `wait_timeout`, and `wait_ms`. The live smoke proves empty timeout and
wake-on-delayed-event behavior while keeping the path probe-only. The live smoke
is `scripts/fabric/c19r-remote-workspace-mailbox-long-poll-smoke.ps1`.
C19S makes mailbox consumer behavior visible in diagnostics. Workload status and
node telemetry now expose `mailbox_read_total`, `mailbox_wait_total`,
`mailbox_wait_timeout_total`, `mailbox_empty_read_total`, and last mailbox read
metadata; active session snapshots carry the same per-session counters while a
session remains active. The live smoke proves C19R traffic is reflected in both
workload status and telemetry. The live smoke is
`scripts/fabric/c19s-remote-workspace-mailbox-telemetry-smoke.ps1`.
C19T adds the node-local consumer cursor contract for that mailbox. Consumers
can pass `consumer_id` plus optional `ack_sequence` to receive explicit
checkpoint, ack, lag, read, and ack counters without draining mailbox state.
The probe sink stores bounded per-session consumer state and reports aggregate
and current-session consumer telemetry through workload status and heartbeat
telemetry. The live smoke is
`scripts/fabric/c19t-remote-workspace-mailbox-consumer-checkpoint-smoke.ps1`.
C19U adds lifecycle visibility and reset guardrails to the same cursor state.
Mailbox consumers can pass `reset_consumer=true` with a valid `consumer_id` to
clear their checkpoint/ack state before the current read is recorded. Mailbox
responses now expose consumer count/capacity, created/reset/evicted flags, and
consumer timestamps, while diagnostics add reset and eviction counters. The
live smoke is
`scripts/fabric/c19u-remote-workspace-mailbox-consumer-lifecycle-smoke.ps1`.
C19V adds read-only inspection for active mailbox consumer cursors. The
node-local
`GET /mesh/v1/remote-workspace/adapter-sessions/{adapter_session_id}/mailbox/consumers`
endpoint returns bounded cursor snapshots with consumer ids, checkpoint and ack
sequences, lag, totals, and timestamps. It is verified as read-only: inspection
does not increment mailbox reads, ack totals, reset counters, or drain mailbox
events. The live smoke is
`scripts/fabric/c19v-remote-workspace-mailbox-consumer-snapshot-smoke.ps1`.
C19W adds cursor-aware resume reads to the mailbox endpoint. Consumers can pass
`after_sequence` to receive only mailbox events newer than their checkpoint;
responses include `skipped_count` and `returned_count`, and long-poll waits for
newer-than-checkpoint events. The endpoint rejects `after_sequence` with
`drain=true`, preserving the non-destructive resume contract. The live smoke is
`scripts/fabric/c19w-remote-workspace-mailbox-after-sequence-smoke.ps1`.
C19X adds consumer-aware resume convenience. Mailbox reads with `consumer_id`
can pass `resume_from=ack` or `resume_from=checkpoint`; the node-agent resolves
the stored cursor to `after_sequence` before reading and returns
`resume_from`/`resume_sequence` in the response. The guardrails reject mixing
resume with manual `after_sequence`, drain, reset, missing consumers, or invalid
cursor names. The live smoke is
`scripts/fabric/c19x-remote-workspace-mailbox-consumer-resume-smoke.ps1`.
C19Y adds resume telemetry to workload status and heartbeat reports. Operators
can now see resume read totals, after-sequence read totals, returned/skipped
totals, and the last resume cursor, sequence, consumer, returned count, and
skipped count. Session snapshots also expose per-session resume counters. The
live smoke is
`scripts/fabric/c19y-remote-workspace-mailbox-resume-telemetry-smoke.ps1`.
C19Z adds adapter-runtime readiness diagnostics. Sink reports now include
`adapter_runtime_readiness`, a compact probe-only object with ready status,
diagnostic state, session lifecycle, mailbox depth, consumer cursor, resume
cursor, lag, and returned/skipped counts. The live smoke is
`scripts/fabric/c19z-remote-workspace-adapter-readiness-smoke.ps1`.
C19Z1 adds read-only handoff preflight for mailbox consumers. The endpoint
`/mailbox/preflight` accepts `consumer_id` and `resume_from=ack|checkpoint`,
then reports the expected next event window without mailbox reads, drains, acks,
or consumer cursor mutation. The live smoke is
`scripts/fabric/c19z1-remote-workspace-mailbox-preflight-smoke.ps1`.
Includes:
- container/native workload contract
File diff suppressed because it is too large Load Diff
@@ -131,6 +131,43 @@ Data Plane
The backend/control plane must not become a production VPN packet relay.
## Universal Packet Dataplane Principle
The VPN service carries IP packets. It must not classify the product as a web
proxy, an RDP helper, or an HTTP-only accelerator. HTTP, DNS, RDP, SSH, VNC,
messengers, audio calls, file transfer, application sync, and future mobile or
desktop traffic are all just packets flowing through the same tunnel contract.
Implementation rules:
- packet forwarding must not branch on application protocol for correctness
- performance work must optimize the shared packet path, not a specific site or
port
- batching, backpressure, retries, and route failover are dataplane mechanics
and must apply to all traffic
- diagnostics may summarize protocol/ports for operators, but diagnostics must
not decide whether traffic is allowed to flow
- a transient transport error must not permanently downgrade the tunnel to a
per-packet request mode
- the control plane chooses entry, exit, route, lease, and policy; packet flow
should use the fastest available fabric path
The temporary backend HTTP packet relay is a lab compatibility path. The
production target is:
```text
client device
-> selected entry node
-> fabric route / alternate route set
-> selected exit node
-> target private network or Internet gateway
```
When the cluster grows, route choice must consider latency, loss, queue depth,
node health, role eligibility, lease freshness, and regional/network locality.
If a node or link degrades, the fabric should switch to an alternate route
without requiring the client to understand mesh topology.
## Control Plane Responsibilities
The control plane owns:
+605 -118
View File
@@ -1,123 +1,610 @@
C17Z20 is complete.
Current product decision:
Stop treating VPN, Remote Server/Desktop Access, video, file transfer, and
future services as separate transport implementations. The next implementation
work should focus on the shared Fabric Service Channel runtime described in
`docs/architecture/FABRIC_SERVICE_CHANNEL_RUNTIME.md`.
The immediate engineering target is:
- backend service-channel lease/route-generation contract
- node-agent entry runtime for client/service live connections
- service-neutral channel scheduling, bounded queues, route health, and
failover
- VPN packet flow as the proving service over that common channel
- backend relay only as explicit degraded fallback
Backend service-channel lease/route-generation contract is now started:
- `POST /clusters/{clusterID}/fabric/service-channels/leases` issues
`rap.fabric_service_channel_lease.v1`
- VPN client profiles embed `fabric_service_channel_lease`
- tests cover ready route and degraded backend-relay fallback behavior
- leases include entry HTTP/WebSocket endpoint templates for the selected
service channel
- leases include cluster-authority-signed
`rap.fabric_service_channel_lease_authority.v1` payloads that bind token
hash, selected route, generation, fencing epoch, and expiry
Node-agent entry runtime is now started:
- `rap-node-agent` accepts VPN packet batches through
`/api/v1/clusters/{clusterID}/fabric/service-channels/{channelID}/vpn-connections/{vpnConnectionID}/packets`
and `/packets/ws`
- entry runtime requires a `rap_fsc_*` service-channel token and maps packet
batches to the existing production `vpn_packet` fabric route
- route failure falls back to the canonical backend relay endpoint so degraded
compatibility remains explicit
Next narrow runtime layer:
Installation Authority foundation is also complete:
- persist cluster-level default window policy for Fabric diagnostics
investigation breadcrumbs and expose a small admin control for it
- keep this in the shared Fabric Service Channel runtime contract and telemetry
- do not add Android/RDP protocol work in this slice
C17Z20 is complete.
Installation Authority foundation is also complete:
- production config requires strict authority mode with Product Root public key
- first-owner bootstrap requires a signed activation manifest in strict mode
- `installation_authority` and signed `platform_role_grants` are persisted
- strict platform-admin checks ignore direct `users.platform_role` edits unless
a valid signed grant exists
- web-admin shows installation status and first-owner bootstrap
- `scripts/installation/product-root-tool.go` can generate Ed25519 Product Root
keys and sign activation manifests; private keys must stay outside the repo
Cluster Authority foundation is now also complete:
- every newly created cluster gets an Ed25519 `cluster_authorities` key record
- cluster authority private keys are encrypted at rest when
`SECRET_ENCRYPTION_KEY_B64`/file is configured; production already requires
a secret encryption key
- legacy/default clusters are backfilled lazily through `EnsureClusterAuthority`
- backend signs join-token scope material, node approval/bootstrap material,
and node-scoped synthetic mesh config snapshots
- node-agent verifies signed Control Plane synthetic config when
`authority_required=true` or signature fields are present
- node-agent can pin `RAP_CLUSTER_AUTHORITY_PUBLIC_KEY` and
`RAP_CLUSTER_AUTHORITY_FINGERPRINT`, and identity state can store the same
trust anchor after approval
- web-admin shows cluster key fingerprints on summaries, join-token output,
approval rows, and synthetic config visibility
- docker-test lifecycle smoke is complete: fresh dev install, first-owner
bootstrap, cluster creation, signed join token, real node-agent enrollment,
owner approval, automatic signed bootstrap polling, authority pin
persistence, heartbeat, and signed synthetic config verification all passed
- `rap-node-agent` desired-workload polling/status reporting is gated by
`RAP_WORKLOAD_SUPERVISION_ENABLED=false` by default while service runtime
supervision remains a stub
Node enrollment bootstrap polling is also complete:
- backend exposes `/node-agents/enrollments/{requestID}/bootstrap`
- pending agents prove `cluster_id`, `node_fingerprint`, and `public_key`
before receiving status/bootstrap material
- `rap-node-agent` stores `pending_join_request_id`, polls approval, verifies
the signed bootstrap contract, then persists `node_id`, `identity_status`,
and cluster authority pin into `identity.json`
- polling is controlled by `RAP_ENROLLMENT_POLL_INTERVAL_SECONDS` and
`RAP_ENROLLMENT_POLL_TIMEOUT_SECONDS`
Current state:
- C17Z12 added rendezvous/relay control-plane leases for peers that would
otherwise stay in `waiting_rendezvous`.
- C17Z13-C17Z14 added lease telemetry and node-scoped synthetic-config refresh
for renewal/stale relay recovery.
- C17Z15 added backend stale-relay replacement/withdrawal policy and alternate
relay-pool scoring.
- C17Z16 added Control Plane `route_path_decisions`.
- C17Z17 added node-side route generation apply/withdraw tracking.
- C17Z18 applies Control Plane `route_path_decisions` to synthetic
route-health route config only. The synthetic `fabric.route_health` runtime
now probes the selected effective path, including replacement relay paths,
and reports expected/observed hops plus drift state.
- C17Z19 consumes those synthetic route-health observations in backend relay
scoring. Drift/unreachable/failure feedback marks the exact selected relay
stale and can trigger replacement; healthy low-latency route-health boosts
alternate relay score reasons. Migration `000022` adds the `synthetic` mesh
service class, and web-admin marks relay policy `rh feedback`.
- C17Z20 closes the node-side feedback loop. After node-agent reports
synthetic route-health drift/unreachable/failure, it performs a bounded
node-scoped synthetic-config refresh, applies returned replacement route
decisions to route-health config immediately, and reports
`c17z20.mesh_route_health_feedback_refresh_report.v1`.
- Backend `mesh_latest_links` now keeps latest observations per observation
type/route, so `synthetic_route_health` is not overwritten by
`peer_connection_manager`.
- Web-admin Fabric links now show observation type, selected relay, and
route-health effective/observed path.
- All of this remains control-plane/synthetic route-health only. It does not
forward RDP/VPN/service payloads, does not start VPN runtime, and does not
implement arbitrary relay packet forwarding.
- Cluster Authority and node enrollment bootstrap are docker-test
lifecycle-smoke verified in run `dev-bootstrap-20260428-201430`.
- Fresh migration replay found and fixed a PostgreSQL view replacement issue in
`000021_cluster_authority_keys`; the migration now drops/recreates
`cluster_admin_summaries` in up/down paths.
Runtime report:
- `artifacts/c17z18-route-health-effective-path-report.md`
- `artifacts/c17z19-route-health-feedback-report.md`
- `artifacts/c17z19-route-health-feedback-smoke-result.json`
- `artifacts/c17z20-route-health-feedback-refresh-report.md`
- `artifacts/dev-cluster-enrollment-bootstrap-smoke-report.md`
- `artifacts/c18w-service-channel-route-manager-smoke-result.json`
- `artifacts/c18x-service-channel-logical-channel-smoke-result.json`
- `artifacts/c18y-route-intent-lifecycle-smoke-result.json`
- `artifacts/c18z-service-channel-load-smoke-result.json`
- `artifacts/c18z1-live-service-channel-ingress-smoke-result.json`
- `artifacts/c18z2-live-service-channel-soak-smoke-result.json`
- `artifacts/c18z3-live-service-channel-entry-ws-fallback-smoke-result.json`
- `artifacts/c18z4-live-service-channel-session-pressure-smoke-result.json`
- `artifacts/c18z5-live-service-channel-exit-restart-smoke-result.json`
- `artifacts/c18z6-live-service-channel-active-rebuild-smoke-result.json`
- `artifacts/c18z7-live-service-channel-concurrent-isolation-smoke-result.json`
- `artifacts/c18z8-live-service-channel-backpressure-isolation-smoke-result.json`
- `artifacts/c18z9-live-service-channel-route-pool-smoke-result.json`
- `artifacts/c18z10-live-service-channel-exit-pool-smoke-result.json`
- `artifacts/c18z11-live-service-channel-entry-pool-smoke-result.json`
- `artifacts/c18z12-service-channel-route-quality-smoke-result.json`
- `artifacts/c18z13-live-service-channel-route-quality-smoke-result.json`
- `artifacts/c18z14-live-service-channel-active-quality-shift-smoke-result.json`
- `artifacts/c18z15-live-service-channel-effective-quality-smoke-result.json`
- `artifacts/c18z16-live-service-channel-quality-fairness-smoke-result.json`
- `artifacts/c18z17-live-service-channel-quality-cleanup-smoke-result.json`
- `artifacts/c18z18-service-channel-session-scoped-fairness-smoke-result.json`
- `artifacts/c18z19-service-channel-parallel-flow-window-smoke-result.json`
- `artifacts/c18z20-service-channel-adaptive-window-telemetry-smoke-result.json`
- Docker-test smoke command:
`pwsh -NoProfile -ExecutionPolicy Bypass -File scripts\fabric\c17z12-rendezvous-relay-smoke-ssh.ps1 -KeepRunning`
- Dev lifecycle smoke command:
`pwsh -NoProfile -ExecutionPolicy Bypass -File scripts\fabric\dev-cluster-enrollment-bootstrap-smoke-ssh.ps1 -KeepRunning`
- Last proven runtime run: `c17z18-20260428-221601` (legacy smoke script name,
current C17Z20 node-agent code)
- Last proven dev lifecycle run: `dev-bootstrap-20260428-201430`
- Admin: `http://192.168.200.61:18080/`
- C17Z20 multi-agent API: `http://192.168.200.61:18120/api/v1`
- C17Z19 backend-only API: `http://192.168.200.61:18122/api/v1`
- Dev lifecycle API: `http://192.168.200.61:18121/api/v1`
Do not automatically continue into:
- RDP/VNC/SSH/file/video/service workload traffic over mesh
- VPN/IP tunnel runtime implementation
- arbitrary relay packet forwarding
- production payload forwarding for relay paths
- QUIC/WebRTC or STUN/TURN/ICE
- TUN/TAP, host route, DNS, or firewall manipulation
- backend/session lifecycle changes
- Windows client changes
Current active next layer after the 2026-05-09 C18Z109 breadcrumb freshness
window proof:
- production config requires strict authority mode with Product Root public key
- first-owner bootstrap requires a signed activation manifest in strict mode
- `installation_authority` and signed `platform_role_grants` are persisted
- strict platform-admin checks ignore direct `users.platform_role` edits unless
a valid signed grant exists
- web-admin shows installation status and first-owner bootstrap
- `scripts/installation/product-root-tool.go` can generate Ed25519 Product Root
keys and sign activation manifests; private keys must stay outside the repo
C18Z48/C18Z49/C18Z50/C18Z51/C18Z52/C18Z53/C18Z54/C18Z55/C18Z56/C18Z57/C18Z58/C18Z59/C18Z60/C18Z61/C18Z62/C18Z63/C18Z64/C18Z65/C18Z66/C18Z67/C18Z68/C18Z69/C18Z70/C18Z71/C18Z72/C18Z73/C18Z74/C18Z75/C18Z76/C18Z77/C18Z78/C18Z79/C18Z80/C18Z81/C18Z82/C18Z83/C18Z84/C18Z85/C18Z86/C18Z87/C18Z88/C18Z89/C18Z90/C18Z91/C18Z92/C18Z93/C18Z94/C18Z95/C18Z96/C18Z97/C18Z98/C18Z99/C18Z100/C18Z101/C18Z102/C18Z103/C18Z104/C18Z105/C18Z106/C18Z107/C18Z108/C18Z109 are complete. Backend is deployed on docker-test as
`rap-backend:fabric-service-channel-0.2.281-c18z109`; migration
`000029_fabric_service_channel_leases` is applied on the shared test database.
Node-agent image `rap-node-agent:0.2.270-c18z95` is built and deployed on
`test-1/2/3`; web-admin is rebuilt and deployed to `rap_web_admin`.
All three test nodes run the C18Z92 image, healthy, and current after policy
update. Node-agent still requires signed service-channel lease authority when
cluster authority is pinned, but if legacy clients cannot send signed lease
headers it now calls backend introspection before accepting the unsigned token.
Accepted ingress is visible as `accepted_by=signed|introspection|legacy_unsigned`
in structured node logs and via `X-RAP-Service-Channel-Accepted-By` on HTTP
packet ingress. Durable introspection stores only `token_hash` plus a scrubbed
lease payload, so backend restarts no longer break compatibility clients. Live
lease maintenance now lists active/expired durable compatibility leases and runs
bounded cleanup through the admin API/panel. Durable access telemetry now
aggregates node-reported accepted ingress counters by signed/introspection/
legacy path, with heartbeat metadata fallback and admin-panel visibility.
Access telemetry now also correlates active durable service-channel leases with
entry/exit nodes, primary route status, backend fallback, and latest
route-quality feedback when a route exists. Normal-route access diagnostics are
smoke-proven with a temporary direct `vpn_packets` route and healthy rolling
quality window. Degraded normal-route diagnostics are also smoke-proven: the
active channel stays on a normal primary route with `force_backend_fallback=false`
while route feedback becomes `fenced` and rolling failure/drop/slow counters are
visible. Active-channel remediation diagnostics now expose
`remediation_action`, reason, optional alternate route id/status, and operator
hint, with unit coverage for healthy/noop, rebuild, backend fallback, and
authorized alternate decisions. The alternate-route remediation branch is now
live-smoke-proven: a selected primary route is degraded after lease issuance and
access telemetry recommends `prefer_alternate_route` while keeping
`force_backend_fallback=false`. C18Z57 turns that recommendation into a bounded
machine-readable `remediation_command` on the active channel row, including the
primary route, replacement route, issued time, and command TTL capped to the
lease lifetime. C18Z58 projects those commands into node-scoped synthetic mesh
config and node-agent consumes `prefer_alternate_route` as an explicit
route-manager `applied` decision with source
`service_channel_remediation_command`. C18Z59 proves active traffic follows the
replacement route after remediation: runtime heartbeat evidence shows
`last_selected_route_id` and flow-scheduler `last_route_id` on the replacement
route, with no local/backend fallback and no route send failures. C18Z60 proves
the same replacement path under multiple independent VPN flow channels: a
twelve-packet batch is classified across multiple flow-scheduler channels, all
observed replacement-route sends avoid local/backend fallback, flow drops, and
route failures. C18Z61 raises that to a pressure batch of 128 IPv4/TCP-like
packets; runtime evidence shows 32 replacement-route flow stats, scheduler
high-watermark 5, max-in-flight 4, no fallback, no drops, and no route failures.
C18Z62 adds neutral traffic-class QoS wiring at the service-channel HTTP
ingress: `X-RAP-Traffic-Class` can mark `control`, `interactive`, `reliable`,
`bulk`, or `droppable`; default traffic remains backward-compatible bulk.
Unit tests prove scheduler priority order, and live smoke proves a bulk
128-packet pressure batch plus an interactive packet both move over the
replacement route with separate traffic-class flow stats and no fallback,
drops, or route failures. C18Z63 adds a controlled concurrent runtime proof: a
bulk traffic-class send is held in-flight while an independent interactive
traffic-class packet is sent through the same ingress, and interactive completes
before bulk release with `MaxInFlight >= 2`, no drops, and no failures.
C18Z64 adds compact runtime telemetry: `rap.fabric_flow_scheduler.v1` snapshots
include `traffic_class_counts`, so backend/admin/diagnostics can show active
flow-channel counts per traffic class without scanning each channel stat. It is
live-proven on `rap-node-agent:0.2.239-c18z64`; latest test-1 snapshot showed
`bulk=32`, `interactive=12`, drops 0. C18Z65/C18Z66 project those counts and
flow pressure fields into backend access telemetry at node, active-channel, and
cluster aggregate levels, and web-admin shows cluster/node/channel `flow QoS`
visibility. Live aggregate API result showed `bulk=32`, `interactive=12`,
`flow_channel_count=44`, `flow_max_in_flight=4`. C18Z67 adds a live HTTP
concurrent QoS proof: six parallel bulk service-channel requests ran while an
interactive traffic-class request was injected on the same entry path after
remediation; the interactive request completed in 132 ms, all 6 bulk requests
were accepted, 3072 post-remediation packets moved over the replacement route,
32 bulk and 12 interactive replacement-route flow stats were observed, and
fallback, route failures, flow drops, and scheduler drops stayed at 0. C18Z68
adds backend/admin flow-health guard diagnostics over that telemetry:
`flow_health_status` and `flow_health_reason` are projected at cluster, node,
and active-channel levels from traffic-class pressure, queue pressure, flow
drops, backend fallback, route-quality failures/drops/slow samples, and route
send latency. Web-admin now shows flow-health chips beside flow QoS.
C18Z69 adds node-side adaptive response: heartbeat flow-scheduler snapshots now
report per-class `recommended_parallel_windows` plus
`adaptive_backpressure_active/reason`, and the ingress send path uses the
traffic-class-specific window. Under pressure, bulk/droppable are reduced first,
reliable is reduced moderately, and control/interactive keep their full window
unless their own class degrades. Live smoke verified `bulk=1`, `droppable=1`,
`reliable=3`, `interactive=4`, `control=4`, no drops, and
`bulk_window_reduced_to_protect_interactive`. C18Z70 projects those adaptive
runtime fields into backend/admin access telemetry at cluster, node, and
active-channel levels. Cluster windows are aggregated by minimum non-zero
per-class recommendation, and web-admin shows adaptive window chips beside flow
health/QoS. Live API artifact shows `adaptive=true`,
`bulk_window_reduced_to_protect_interactive`, and windows `bulk=1`,
`droppable=1`, `reliable=3`, `interactive=4`, `control=4`. C18Z71 adds the
cluster-level adaptive policy contract:
`GET/PUT /clusters/{clusterID}/fabric/service-channels/adaptive-policy`.
The policy stores audited thresholds and class windows in cluster metadata,
projects the effective fingerprint into signed node-scoped synthetic config,
and node-agent heartbeat/runtime telemetry reports `adaptive_policy_fingerprint`.
The node scheduler consumes the policy at runtime; default policy preserves
bulk=1/droppable=1/reliable=3/control=4/interactive=4, while the C18Z71 smoke
proved an operator policy with max window 6 and `bulk=2` changes the live
recommended windows without breaking interactive/control. A signed-config hash
mismatch found during the smoke was fixed by preserving all signed adaptive
policy provenance fields in the node-agent client model. C18Z72 adds the
cluster-level pool/failover policy contract:
`GET/PUT /clusters/{clusterID}/fabric/service-channels/pool-policy`. Lease
issuance now applies the effective entry/exit pool constraints and preferred
entry/exit before route selection, stores the effective policy on the lease,
and signs it into `rap.fabric_service_channel_lease_authority.v1`. Live smoke
proved a policy-constrained lease selects only the policy entry/exit from a
wider requested pool and carries matching signed `pool_policy` provenance.
C18Z73 projects that signed pool-policy fingerprint into active access
telemetry and guards remediation commands against routes outside the signed
lease pools. C18Z74 correlates active remediation commands with entry-node
route-manager heartbeats and reports execution states such as
`waiting_node_apply`, `applied`, `rejected_by_policy_guard`,
`pending_rebuild_request`, and `expired`. C18Z75 records `rebuild_route`
remediation as durable rebuild ledger intent rows when node-scoped synthetic
config is fetched, and access telemetry reports `rebuild_request_recorded` or
`rebuild_request_rejected`. C18Z76 makes the allowed `rebuild_route` command
visible from the node side: node-agent consumes it as a route-manager
`pending_degraded_fallback` decision with source
`service_channel_remediation_command`, and backend access telemetry correlates
that with the durable ledger as `rebuild_request_recorded_node_pending`.
C18Z77 resolves durable remediation rebuild requests inside the shared Control
Plane planner: signed-pool-valid alternates become `applied` /
`replacement_selected` and are projected as route-manager decisions with the
same command id, missing safe alternates become `no_alternate`, lease/policy
blocks become `deferred_by_policy`, and stale commands become `expired`.
C18Z78 adds operator-visible planner outcome chips in web-admin and proves the
`applied` branch live by adding an alternate route after lease issuance and
verifying the existing rebuild command resolves to `rebuild_request_applied`.
C18Z79 closes the planner-to-runtime proof loop for that branch: after planner
resolution, the entry node reports a route-manager decision with the same
`rebuild_request_id`, the transition is `applied_rebuild`, and live
service-channel packet traffic selects the replacement route without
local/backend fallback, route failures, or flow drops. C18Z80 hardens that
same path under sustained pressure: after planner-applied rebuild, five
post-rebuild bursts of mixed `interactive`, `bulk`, and `reliable` VPN packet
batches stay on the replacement route, the stale primary is not reselected, and
fallback/route-failure/drop deltas stay zero from the pre-pressure baseline.
C18Z81 adds the negative/rollback proof: after the initial replacement is
applied and used, a generation-valid fenced feedback report for that
replacement causes the Control Plane to select a new safe recovery route; live
traffic then moves to the recovery route, the degraded replacement is not
reselected, and fallback/failure/drop deltas stay zero for the recovery send.
The C18Z81 work also tightened older smoke checks to use per-run counter deltas
instead of absolute cumulative runtime counters.
C18Z82 closes the no-safe-recovery branch: after the replacement route reports
generation-valid fenced feedback and no new safe recovery route is created,
node-scoped synthetic config surfaces `service_channel_feedback_no_alternate`
with `pending_degraded_fallback`, `no_unfenced_alternate_route`, and
`backend_relay_degraded_fallback_until_rebuild`, proving the Control Plane
exposes a degraded/no-alternate state instead of silently sticking to a bad
replacement.
C18Z83 projects those route-manager decisions into active access telemetry and
web-admin: active channels now expose route-decision source, route id,
replacement route id, rebuild status/reason/generation, and score reasons.
The live smoke proves the no-safe state is visible through access telemetry as
`service_channel_feedback_no_alternate` /
`pending_degraded_fallback`, with operator execution state remaining compatible
with durable ledger `rebuild_request_no_alternate`.
C18Z84 aggregates those per-channel decisions at the access-telemetry summary
level: route-decision channel count, replacement decision count, applied
rebuild count, recovery decision count, and no-safe recovery count are exposed
to the API and web-admin summary chips. The no-safe branch now prioritizes the
aggregate status reason `active_channels_no_safe_recovery` over generic missing
access-report noise.
C18Z85 projects access-decision aggregates into rebuild health and incident
diagnostics. Health summary now carries access decision counts and prioritizes
`inspect_access_no_safe_recovery_route_pool_and_signed_policy` when no-safe is
active. Rebuild incidents now include `incident_source=access_decision` rows
for active channel decisions such as `access_no_safe_recovery`, with bad
severity and channel id, so operators see route-decision failures beside ledger
incidents.
C18Z86 adds silence/acknowledgement behavior for those
`incident_source=access_decision` incidents. Silence requests now carry
`incident_source` and `channel_id`; access-decision no-safe silences are stored
with a channel-scoped route key, applied back into rebuild health/incidents,
and exact current-generation incidents stop contributing to active bad count.
Generation-changing access-decision resurfacing is unit-tested; the live smoke
proves the operator silence path on docker-test.
C18Z87 exposes active rebuild/access-decision silences to operators and adds
unsilence. The API now lists active rebuild alert silences, returns
access-decision `incident_source`, `channel_id`, and display route id, and
allows deleting a silence by id. Web-admin shows an `Active rebuild silences`
table with an unsilence action. The live smoke proves list -> silence ->
unsilence and verifies the access no-safe incident becomes active again.
C18Z88 makes access-decision resurfacing operator-visible in live runtime.
Access-decision incidents now expose the silence id they resurfaced from, the
previous acknowledged generation, and the silence expiry. The live smoke
proves: access no-safe incident -> silence current generation -> wait for a new
route-decision generation -> incident returns as `alert_resurfaced=true`, active
bad count is restored, and previous generation metadata is preserved.
C18Z89 closes the resurfaced-incident operator action loop for generation
changes. Resurfaced access-decision incidents now expose
`alert_resurfaced_cause`, previous route id, and previous channel id; web-admin
shows the cause beside resurfaced incidents. The live smoke proves the operator
can re-acknowledge the resurfaced generation, the active-channel decision
context matches the incident route/generation, and the current generation
returns to a silenced state.
C18Z90 introduces the explicit signed production data-plane contract on
service-channel leases. `data_plane` is now part of the lease, authority
payload, introspection response, and lease-maintenance/admin list. It declares
that control-plane traffic uses backend API, working data uses the fabric
service channel over fabric routes, backend relay is degraded fallback only,
production forwarding is required, and logical flows are service-neutral,
protocol-agnostic, and isolated. Web-admin shows this contract in the
service-channel lease table.
C18Z91 makes node-agent consume that signed/introspected data-plane contract.
Service-channel packet ingress validates the contract, applies the preferred
fabric route, emits data-plane mode/transport/fallback/logical-flow fields in
access logs, and reports contract adoption in heartbeat access telemetry.
C18Z92 enforces disabled backend fallback policy at node-agent runtime: when a
signed lease says `backend_relay_policy=disabled`, route failure or missing
fabric route returns a visible 503 instead of silently proxying working data
through backend relay.
C18Z93 promotes that data-plane contract telemetry into backend access
telemetry and web-admin active-channel diagnostics: cluster, node, and
active-channel rows now show contract adoption count, last working transport,
steady-state transport, backend relay policy, data-plane mode, and logical
flow mode.
C18Z94 turns those data-plane/fallback signals into operator incidents.
`data_plane_contract` incident rows are now emitted for missing data-plane
contract reports after accepted service-channel traffic, wrong working or
steady-state transport, wrong logical flow mode, disabled backend relay
observed, and degraded backend relay usage. The incident list/readiness path
can now surface a recommended action such as restoring the fabric route instead
of treating backend relay as normal service traffic.
C18Z95 adds node-agent blocked-fallback telemetry. When a signed data-plane
contract disables backend relay and the entry runtime cannot use a fabric
route, node-agent reports `backend_fallback_blocked`, the last data-plane
violation status/reason, and backend/admin project those fields to cluster,
node, channel, and `data_plane_contract` incident diagnostics. Disabled-policy
refusal is now separate from real backend relay usage.
C18Z96 wires normal-route send failure with disabled backend relay into the
existing route feedback and rebuild planner path. When heartbeat access
telemetry reports `fabric_route_send_failed_backend_fallback_blocked`, backend
correlates the entry node's active service-channel leases, records fenced
`fabric_service_channel_route_feedback` for the selected primary route, and the
existing planner can select an alternate/replacement route. This keeps blocked
fallback from becoming a dead-end operator alert.
C18Z97 adds bounded deduplication for those access-report-derived route
feedback records. Repeated blocked-fallback send-failure heartbeats no longer
rewrite the same active fenced feedback or churn planner rebuild attempts while
the first access-report feedback is still active. Runtime feedback from the
flow scheduler remains independent.
C18Z98 carries that feedback identity into the replacement decision and
rebuild-attempt ledger: decision and ledger rows now expose
`feedback_observation_id`, `feedback_source`, feedback observed/expiry time,
channel/resource ids, and data-plane violation status/reason. Web-admin shows
that correlation in Route decisions and Rebuild ledger.
C18Z99 adds rebuild ledger filters for those correlation fields. The backend
`/fabric/service-channels/rebuild-attempts` API accepts `feedback_source`,
`feedback_channel_id`, and `feedback_violation_status`, and web-admin exposes
the same filters in the rebuild ledger form. The live smoke proves source,
channel, violation, combined filters, and wrong-channel exclusion.
C18Z100 adds rebuild-health feedback breakdown aggregation for the same
correlation fields. The backend rebuild-health summary now returns
`feedback_breakdowns` grouped by feedback source, feedback channel id, and
feedback violation status, including total/good/warn/bad/unknown counts,
active warn/bad counts, silenced count, latest observation time, and affected
reporter nodes/routes. Web-admin shows this breakdown in the Rebuild health
panel so operators can see which access-report-derived failure classes dominate
active warn/bad rebuild state.
C18Z101 wires that breakdown into operator workflow in web-admin. Each
feedback-breakdown row now shows related incident context by channel/reporter/
route overlap and has an `open ledger` action that switches to the deep rebuild
ledger with `feedback_source`, `feedback_channel_id`, and
`feedback_violation_status` prefilled from the breakdown row.
C18Z102 adds backend audit breadcrumbs for that workflow. The existing rebuild
investigation endpoint now accepts feedback source/channel/violation drilldown
payloads, records
`fabric.service_channel_rebuild_feedback_breakdown.investigation_opened`
cluster audit events, and web-admin records one before opening the filtered
deep ledger from a rebuild-health feedback breakdown row.
C18Z103 surfaces those breadcrumbs directly in the Fabric diagnostics panel.
Web-admin now filters the loaded cluster audit list for rebuild incident and
feedback-breakdown investigation events and shows recent drilldowns with time,
source, feedback filters, target reporter/route, actor, and reason beside
rebuild incidents and silences.
C18Z104 adds focused audit loading for that panel. The cluster audit API now
accepts `event_type` and `target_type` filters, including repeated or
comma-separated values, and web-admin loads recent fabric investigation
breadcrumbs with a dedicated filtered request instead of depending on the
generic latest-100 cluster audit list.
C18Z105 correlates those focused audit breadcrumbs back to currently visible
diagnostics in web-admin. Recent investigation rows now show whether the
breadcrumb still matches an active rebuild-health feedback breakdown or visible
rebuild incident, and provide an `open` action to jump back into the matching
filtered ledger path.
C18Z106 moves that correlation into the backend/API. `GET /audit` with
`correlation=fabric_diagnostics` now returns `correlation_hints` for focused
fabric investigation breadcrumbs, including current diagnostic status
(`breakdown_active`, `incident_visible`, or `not_visible`) and the matching
breakdown/incident object when present. Web-admin consumes those hints and keeps
its previous local matching as fallback. During verification the noisy test
history exposed that rebuild-health feedback breakdowns were capped too tightly;
the backend now returns up to 100 breakdown groups so fresh failure classes are
not pushed out by older smoke history.
C18Z107 adds a compact backend-provided `audit_summary` beside `audit_events`.
For focused Fabric diagnostics audit reads, the summary includes total count,
counts by event/target type, counts by current diagnostic status, counts by
feedback source/violation status, correlated count, not-visible count, and
latest time. Web-admin shows these as Recent investigations chips and short
source/violation lines without recalculating the aggregate in the browser.
C18Z108 moves Fabric diagnostics investigation breadcrumbs off the generic
cluster audit read path. Backend now exposes
`GET /clusters/{clusterID}/fabric/service-channels/rebuild-investigations/breadcrumbs`
with a dedicated `rebuild_investigation_breadcrumbs` contract containing
events plus summary. Web-admin uses this endpoint for Recent investigations
and keeps generic audit semantics separate from Fabric diagnostics workflow
state.
C18Z109 adds freshness windows to the dedicated breadcrumb contract. The
endpoint accepts `current_window_seconds` and `history_window_seconds`, annotates
each breadcrumb with `correlation_hints.breadcrumb_status` (`current`, `stale`,
or `expired`) plus age/window seconds, returns current/stale/expired totals, and
adds `counts_by_breadcrumb_status` to the summary. Web-admin shows freshness
chips and an age column in Recent investigations, so operators can separate live
workflow hints from stale history without deleting audit records.
Live
verification passed:
`scripts/fabric/c18z48-service-channel-introspection-smoke.ps1` and
`scripts/fabric/c18z50-service-channel-durable-introspection-smoke.ps1` and
`scripts/fabric/c18z51-service-channel-lease-maintenance-smoke.ps1` and
`scripts/fabric/c18z52-service-channel-access-telemetry-smoke.ps1` and
`scripts/fabric/c18z53-service-channel-access-correlation-smoke.ps1` and
`scripts/fabric/c18z54-service-channel-normal-route-access-smoke.ps1` and
`scripts/fabric/c18z55-service-channel-degraded-route-access-smoke.ps1` and
`scripts/fabric/c18z56-service-channel-alternate-remediation-smoke.ps1` and
`scripts/fabric/c18z57-service-channel-remediation-command-smoke.ps1` and
`scripts/fabric/c18z58-service-channel-remediation-apply-smoke.ps1` and
`scripts/fabric/c18z59-service-channel-remediation-traffic-smoke.ps1` and
`scripts/fabric/c18z60-service-channel-remediation-multiflow-smoke.ps1` and
`scripts/fabric/c18z61-service-channel-remediation-pressure-smoke.ps1` and
`scripts/fabric/c18z62-service-channel-remediation-qos-smoke.ps1` and
`scripts/fabric/c18z67-service-channel-concurrent-qos-live-smoke.ps1` and
`scripts/fabric/c18z69-service-channel-adaptive-backpressure-smoke.ps1` and
`scripts/fabric/c18z71-service-channel-adaptive-policy-smoke.ps1` and
`scripts/fabric/c18z72-service-channel-pool-policy-smoke.ps1`, with
artifacts:
`artifacts/c18z48-service-channel-introspection-smoke-result.json`,
`artifacts/c18z50-service-channel-durable-introspection-smoke-result.json`, and
`artifacts/c18z51-service-channel-lease-maintenance-smoke-result.json`, and
`artifacts/c18z52-service-channel-access-telemetry-smoke-result.json`, and
`artifacts/c18z53-service-channel-access-correlation-smoke-result.json`, and
`artifacts/c18z54-service-channel-normal-route-access-smoke-result.json`, and
`artifacts/c18z55-service-channel-degraded-route-access-smoke-result.json`, and
`artifacts/c18z56-service-channel-alternate-remediation-smoke-result.json`, and
`artifacts/c18z57-service-channel-remediation-command-smoke-result.json`, and
`artifacts/c18z58-service-channel-remediation-apply-smoke-result.json`, and
`artifacts/c18z59-service-channel-remediation-traffic-smoke-result.json`, and
`artifacts/c18z60-service-channel-remediation-multiflow-smoke-result.json`, and
`artifacts/c18z61-service-channel-remediation-pressure-smoke-result.json`, and
`artifacts/c18z62-service-channel-remediation-qos-smoke-result.json`, and
`artifacts/c18z63-service-channel-concurrent-qos-go-test.jsonl`, and
`artifacts/c18z64-service-channel-traffic-class-telemetry-go-test.jsonl`, and
`artifacts/c18z64-service-channel-traffic-class-telemetry-live-smoke-result.json`, and
`artifacts/c18z64-service-channel-traffic-class-telemetry-live-snapshot.json`, and
`artifacts/c18z65-service-channel-access-qos-telemetry-api-result.json`, and
`artifacts/c18z65-service-channel-access-qos-telemetry-smoke-result.json`, and
`artifacts/c18z66-service-channel-access-qos-aggregate-api-result.json`, and
`artifacts/c18z67-service-channel-concurrent-qos-live-smoke-result.json`, and
`artifacts/c18z68-service-channel-flow-health-api-result.json`, and
`artifacts/c18z69-service-channel-adaptive-backpressure-smoke-result.json`, and
`artifacts/c18z70-service-channel-adaptive-telemetry-api-result.json`, and
`artifacts/c18z71-service-channel-adaptive-policy-smoke-result.json`, and
`artifacts/c18z72-service-channel-pool-policy-smoke-result.json`, and
`artifacts/c18z73-service-channel-pool-policy-remediation-guard-smoke-result.json`, and
`artifacts/c18z74-service-channel-remediation-execution-smoke-result.json`, and
`artifacts/c18z75-service-channel-rebuild-intent-smoke-result.json`, and
`artifacts/c18z76-service-channel-rebuild-node-pending-smoke-result.json`, and
`artifacts/c18z77-service-channel-rebuild-planner-resolution-smoke-result.json`, and
`artifacts/c18z78-service-channel-rebuild-planner-applied-smoke-result.json`, and
`artifacts/c18z79-service-channel-planner-runtime-loop-smoke-result.json`, and
`artifacts/c18z80-service-channel-post-rebuild-pressure-smoke-result.json`, and
`artifacts/c18z81-service-channel-replacement-degradation-recovery-smoke-result.json`, and
`artifacts/c18z82-service-channel-no-safe-recovery-smoke-result.json`, and
`artifacts/c18z83-service-channel-access-telemetry-no-safe-smoke-result.json`, and
`artifacts/c18z84-service-channel-access-decision-aggregate-smoke-result.json`, and
`artifacts/c18z85-service-channel-access-decision-incident-smoke-result.json`, and
`artifacts/c18z86-service-channel-access-decision-silence-smoke-result.json`, and
`artifacts/c18z87-service-channel-access-decision-unsilence-smoke-result.json`, and
`artifacts/c18z88-service-channel-access-decision-resurface-smoke-result.json`, and
`artifacts/c18z89-service-channel-access-decision-resurface-action-smoke-result.json`, and
`artifacts/c18z90-service-channel-data-plane-contract-smoke-result.json`, and
`artifacts/c18z91-node-agent-data-plane-contract-enforcement-smoke-result.json`, and
`artifacts/c18z92-node-agent-disabled-backend-fallback-smoke-result.json`, and
`artifacts/c18z93-access-telemetry-data-plane-contract-smoke-result.json`, and
`artifacts/c18z94-data-plane-contract-incident-smoke-result.json`, and
`artifacts/c18z95-node-agent-blocked-fallback-telemetry-smoke-result.json`, and
`artifacts/c18z96-blocked-fallback-rebuild-feedback-smoke-result.json`, and
`artifacts/c18z97-blocked-fallback-feedback-dedup-smoke-result.json`, and
`artifacts/c18z98-blocked-fallback-rebuild-correlation-smoke-result.json`, and
`artifacts/c18z99-rebuild-correlation-filter-smoke-result.json`, and
`artifacts/c18z100-rebuild-health-feedback-breakdown-smoke-result.json`, and
`artifacts/c18z102-rebuild-health-feedback-drilldown-audit-smoke-result.json`,
`artifacts/c18z104-focused-fabric-audit-smoke-result.json`, and
`artifacts/c18z106-audit-correlation-hints-smoke-result.json`, and
`artifacts/c18z107-audit-correlation-summary-smoke-result.json`, and
`artifacts/c18z108-dedicated-breadcrumbs-smoke-result.json`, and
`artifacts/c18z109-breadcrumb-freshness-window-smoke-result.json`.
Cluster Authority foundation is now also complete:
Current active continuation after C19Z1:
- every newly created cluster gets an Ed25519 `cluster_authorities` key record
- cluster authority private keys are encrypted at rest when
`SECRET_ENCRYPTION_KEY_B64`/file is configured; production already requires
a secret encryption key
- legacy/default clusters are backfilled lazily through `EnsureClusterAuthority`
- backend signs join-token scope material, node approval/bootstrap material,
and node-scoped synthetic mesh config snapshots
- node-agent verifies signed Control Plane synthetic config when
`authority_required=true` or signature fields are present
- node-agent can pin `RAP_CLUSTER_AUTHORITY_PUBLIC_KEY` and
`RAP_CLUSTER_AUTHORITY_FINGERPRINT`, and identity state can store the same
trust anchor after approval
- web-admin shows cluster key fingerprints on summaries, join-token output,
approval rows, and synthetic config visibility
- docker-test lifecycle smoke is complete: fresh dev install, first-owner
bootstrap, cluster creation, signed join token, real node-agent enrollment,
owner approval, automatic signed bootstrap polling, authority pin
persistence, heartbeat, and signed synthetic config verification all passed
- `rap-node-agent` desired-workload polling/status reporting is gated by
`RAP_WORKLOAD_SUPERVISION_ENABLED=false` by default while service runtime
supervision remains a stub
C19Z1 is implemented and runtime-smoke-proven. Remote Workspace adapter sessions
now expose read-only mailbox handoff preflight:
`GET /mesh/v1/remote-workspace/adapter-sessions/{adapter_session_id}/mailbox/preflight?consumer_id=...&resume_from=ack|checkpoint`.
The response validates the consumer cursor and reports the expected next event
window (`after_sequence`, available/returned/skipped counts, first/last expected
sequence) without reading, draining, acking, or mutating consumer state.
Node-agent image `rap-node-agent:codex-service-supervisor-20260512z2` is
deployed on `test-1/2/3`. Verification artifacts:
`artifacts/c19z1-remote-workspace-mailbox-preflight-smoke-result.json`, C19X
source
`artifacts/c19z1-remote-workspace-mailbox-preflight-source-result.json`, and
C19Z regression
`artifacts/c19z-remote-workspace-adapter-readiness-smoke-result.json`.
Node enrollment bootstrap polling is also complete:
- backend exposes `/node-agents/enrollments/{requestID}/bootstrap`
- pending agents prove `cluster_id`, `node_fingerprint`, and `public_key`
before receiving status/bootstrap material
- `rap-node-agent` stores `pending_join_request_id`, polls approval, verifies
the signed bootstrap contract, then persists `node_id`, `identity_status`,
and cluster authority pin into `identity.json`
- polling is controlled by `RAP_ENROLLMENT_POLL_INTERVAL_SECONDS` and
`RAP_ENROLLMENT_POLL_TIMEOUT_SECONDS`
Current state:
- C17Z12 added rendezvous/relay control-plane leases for peers that would
otherwise stay in `waiting_rendezvous`.
- C17Z13-C17Z14 added lease telemetry and node-scoped synthetic-config refresh
for renewal/stale relay recovery.
- C17Z15 added backend stale-relay replacement/withdrawal policy and alternate
relay-pool scoring.
- C17Z16 added Control Plane `route_path_decisions`.
- C17Z17 added node-side route generation apply/withdraw tracking.
- C17Z18 applies Control Plane `route_path_decisions` to synthetic
route-health route config only. The synthetic `fabric.route_health` runtime
now probes the selected effective path, including replacement relay paths,
and reports expected/observed hops plus drift state.
- C17Z19 consumes those synthetic route-health observations in backend relay
scoring. Drift/unreachable/failure feedback marks the exact selected relay
stale and can trigger replacement; healthy low-latency route-health boosts
alternate relay score reasons. Migration `000022` adds the `synthetic` mesh
service class, and web-admin marks relay policy `rh feedback`.
- C17Z20 closes the node-side feedback loop. After node-agent reports
synthetic route-health drift/unreachable/failure, it performs a bounded
node-scoped synthetic-config refresh, applies returned replacement route
decisions to route-health config immediately, and reports
`c17z20.mesh_route_health_feedback_refresh_report.v1`.
- Backend `mesh_latest_links` now keeps latest observations per observation
type/route, so `synthetic_route_health` is not overwritten by
`peer_connection_manager`.
- Web-admin Fabric links now show observation type, selected relay, and
route-health effective/observed path.
- All of this remains control-plane/synthetic route-health only. It does not
forward RDP/VPN/service payloads, does not start VPN runtime, and does not
implement arbitrary relay packet forwarding.
- Cluster Authority and node enrollment bootstrap are docker-test
lifecycle-smoke verified in run `dev-bootstrap-20260428-201430`.
- Fresh migration replay found and fixed a PostgreSQL view replacement issue in
`000021_cluster_authority_keys`; the migration now drops/recreates
`cluster_admin_summaries` in up/down paths.
Runtime report:
- `artifacts/c17z18-route-health-effective-path-report.md`
- `artifacts/c17z19-route-health-feedback-report.md`
- `artifacts/c17z19-route-health-feedback-smoke-result.json`
- `artifacts/c17z20-route-health-feedback-refresh-report.md`
- `artifacts/dev-cluster-enrollment-bootstrap-smoke-report.md`
- Docker-test smoke command:
`pwsh -NoProfile -ExecutionPolicy Bypass -File scripts\fabric\c17z12-rendezvous-relay-smoke-ssh.ps1 -KeepRunning`
- Dev lifecycle smoke command:
`pwsh -NoProfile -ExecutionPolicy Bypass -File scripts\fabric\dev-cluster-enrollment-bootstrap-smoke-ssh.ps1 -KeepRunning`
- Last proven runtime run: `c17z18-20260428-221601` (legacy smoke script name,
current C17Z20 node-agent code)
- Last proven dev lifecycle run: `dev-bootstrap-20260428-201430`
- Admin: `http://192.168.200.61:5174/`
- C17Z20 multi-agent API: `http://192.168.200.61:18120/api/v1`
- C17Z19 backend-only API: `http://192.168.200.61:18122/api/v1`
- Dev lifecycle API: `http://192.168.200.61:18121/api/v1`
Do not automatically continue into:
- RDP/VNC/SSH/file/video/service workload traffic over mesh
- VPN/IP tunnel runtime implementation
- arbitrary relay packet forwarding
- production payload forwarding for relay paths
- QUIC/WebRTC or STUN/TURN/ICE
- TUN/TAP, host route, DNS, or firewall manipulation
- backend/session lifecycle changes
- Windows client changes
Next narrow layer, if approved:
C17Z21 should tighten route-health feedback refresh dampening: if an immediate
feedback refresh returns the same config version or no replacement change, keep
a per-route/relay no-change cooldown before retrying. Keep the boundary
synthetic/control-plane only and keep RDP/VPN/service payload forwarding
untouched.
Next narrow Remote Workspace layer should stay probe-only and node-local. A good
C19Z2 candidate is handoff preflight telemetry: add counters/last-preflight
fields for the read-only preflight endpoint in workload status/heartbeat reports,
so operators can distinguish handoff checks from mailbox reads. Do not add
desktop frame transport, Android work, backend relay semantics, or production
adapter payload forwarding in this slice.
+79
View File
@@ -0,0 +1,79 @@
# VPN baseline 0.2.87
Date: 2026-05-05
This document freezes the current near-working VPN state. Treat it as the
rollback and comparison point before changing the Android VPN dataplane,
gateway assignment, mesh route intents, or packet relay behavior.
## Baseline components
- Android client: `0.2.87` / version code `87`
- APK path: `web-admin/deploy/html/downloads/rap-android-rdp-vpn-latest-release.apk`
- Known-good APK path: `web-admin/deploy/html/downloads/rap-android-rdp-vpn-known-good-0.2.87.apk`
- Versioned APK path: `web-admin/deploy/html/downloads/releases/0.2.87/rap-android-rdp-vpn-0.2.87-release.apk`
- APK sha256: `bc44304658df7cd0ad627660c9e7b37af68910cdb13b310314ab99a049ff3b8d`
- APK size: `1187103`
- Backend image: `rap-backend:vpn-dataplane-contract-0.2.86`
- Node/host agents: `0.2.86`
- Cluster: `cfc0743d-d960-49fb-9de8-96e063d5e4aa`
- VPN connection: `7cc94b0d-9cc2-4492-956a-cb0913b887e2` (`home-full-tunnel`)
- Entry node: `home-1` (`8ad04829-cd30-4290-913d-1ce5c7ef7bb3`)
- Exit node: `home-1` (`8ad04829-cd30-4290-913d-1ce5c7ef7bb3`)
- DNS from exit side: `192.168.200.210`
- Client tunnel: full tunnel, `0.0.0.0/0`, VPN address `10.77.0.2/24`
- Active gateway lease: home-1, generation `8`
- Active relay transport: `backend_http_packet_relay`
## Current working behavior
- General web traffic passes through the VPN.
- External sites open through the configured home exit.
- Telegram can connect, but initial connection may be delayed.
- RDP can connect through the tunnel, but long-lived sessions can still drop.
- Speed is the best observed so far, but speed-test pages may delay loading
their plugin/script parts.
## Observed diagnostics
Latest phone diagnostics for device `37574bd4-b944-440f-bbd5-87f2980d22c4`
reported Android app version `0.2.87`.
Packet relay counters showed both directions are active:
- `client_to_gateway`: no queue drops observed, queue depth returned to `0`
- `gateway_to_client`: queue depth was observed at `48-55`
- `gateway_to_client`: `246` dropped packets were observed
- Android side recorded downlink traffic, uplink traffic, and several uplink
sender errors
- Android source validation dropped packets whose source was not the VPN
address; keep this guard enabled
Interpretation: the active path is real and carries traffic, but downlink
backpressure or Android TUN drain stalls can still interrupt long-lived TCP
flows. This explains delayed Telegram startup, speed-test plugin loading
delays, and RDP sessions that connect and later drop.
## Guardrails
- Do not reduce Android `TUN_WRITE_MAX_RETRIES` below `1000` without a
controlled regression test.
- Do not relax Android VPN source-address validation.
- Do not re-enable the home-1 `vpn_packets` fabric mesh route intent for this
connection until the Android client can intentionally use the fabric entry
path. The current working baseline relies on `backend_http_packet_relay`.
- Do not change the active entry/exit away from home-1 without saving packet
counters before and after.
- Do not change DNS away from `192.168.200.210` without checking full-tunnel
DNS and direct-IP traffic separately.
- Keep the 0.2.87 APK available as a known-good rollback artifact.
## Next safe work
1. Stabilize `gateway_to_client` downlink queue draining and Android TUN write
backpressure.
2. Add clearer per-flow counters for long-lived TCP flows such as RDP.
3. Add a small repeatable smoke test: DNS, direct IP HTTP, 2ip.ru, Telegram-like
long connection, and RDP port reachability.
4. Only after this baseline is stable, move Android entry traffic from backend
relay to fabric mesh.