Files
rdp-proxy/docs/codex/NEXT_STEP_PROMPT.md
T
2026-05-14 23:30:34 +03:00

617 lines
41 KiB
Markdown

Current product decision:
Stop treating VPN, Remote Server/Desktop Access, video, file transfer, and
future services as separate transport implementations. The next implementation
work should focus on the shared Fabric Service Channel runtime described in
`docs/architecture/FABRIC_SERVICE_CHANNEL_RUNTIME.md`.
The immediate engineering target is:
- backend service-channel lease/route-generation contract
- node-agent entry runtime for client/service live connections
- service-neutral channel scheduling, bounded queues, route health, and
failover
- VPN packet flow as the proving service over that common channel
- backend relay only as explicit degraded fallback
Backend service-channel lease/route-generation contract is now started:
- `POST /clusters/{clusterID}/fabric/service-channels/leases` issues
`rap.fabric_service_channel_lease.v1`
- VPN client profiles embed `fabric_service_channel_lease`
- tests cover ready route and degraded backend-relay fallback behavior
- leases include entry HTTP/WebSocket endpoint templates for the selected
service channel
- leases include cluster-authority-signed
`rap.fabric_service_channel_lease_authority.v1` payloads that bind token
hash, selected route, generation, fencing epoch, and expiry
Node-agent entry runtime is now started:
- `rap-node-agent` accepts VPN packet batches through
`/api/v1/clusters/{clusterID}/fabric/service-channels/{channelID}/vpn-connections/{vpnConnectionID}/packets`
and `/packets/ws`
- entry runtime requires a `rap_fsc_*` service-channel token and maps packet
batches to the existing production `vpn_packet` fabric route
- route failure falls back to the canonical backend relay endpoint so degraded
compatibility remains explicit
Next narrow runtime layer:
- persist cluster-level default window policy for Fabric diagnostics
investigation breadcrumbs and expose a small admin control for it
- keep this in the shared Fabric Service Channel runtime contract and telemetry
- do not add Android/RDP protocol work in this slice
C17Z20 is complete.
Installation Authority foundation is also complete:
- production config requires strict authority mode with Product Root public key
- first-owner bootstrap requires a signed activation manifest in strict mode
- `installation_authority` and signed `platform_role_grants` are persisted
- strict platform-admin checks ignore direct `users.platform_role` edits unless
a valid signed grant exists
- web-admin shows installation status and first-owner bootstrap
- `scripts/installation/product-root-tool.go` can generate Ed25519 Product Root
keys and sign activation manifests; private keys must stay outside the repo
Cluster Authority foundation is now also complete:
- every newly created cluster gets an Ed25519 `cluster_authorities` key record
- cluster authority private keys are encrypted at rest when
`SECRET_ENCRYPTION_KEY_B64`/file is configured; production already requires
a secret encryption key
- legacy/default clusters are backfilled lazily through `EnsureClusterAuthority`
- backend signs join-token scope material, node approval/bootstrap material,
and node-scoped synthetic mesh config snapshots
- node-agent verifies signed Control Plane synthetic config when
`authority_required=true` or signature fields are present
- node-agent can pin `RAP_CLUSTER_AUTHORITY_PUBLIC_KEY` and
`RAP_CLUSTER_AUTHORITY_FINGERPRINT`, and identity state can store the same
trust anchor after approval
- web-admin shows cluster key fingerprints on summaries, join-token output,
approval rows, and synthetic config visibility
- docker-test lifecycle smoke is complete: fresh dev install, first-owner
bootstrap, cluster creation, signed join token, real node-agent enrollment,
owner approval, automatic signed bootstrap polling, authority pin
persistence, heartbeat, and signed synthetic config verification all passed
- `rap-node-agent` desired-workload polling/status reporting is gated by
`RAP_WORKLOAD_SUPERVISION_ENABLED=false` by default while service runtime
supervision remains a stub
Node enrollment bootstrap polling is also complete:
- backend exposes `/node-agents/enrollments/{requestID}/bootstrap`
- pending agents prove `cluster_id`, `node_fingerprint`, and `public_key`
before receiving status/bootstrap material
- `rap-node-agent` stores `pending_join_request_id`, polls approval, verifies
the signed bootstrap contract, then persists `node_id`, `identity_status`,
and cluster authority pin into `identity.json`
- polling is controlled by `RAP_ENROLLMENT_POLL_INTERVAL_SECONDS` and
`RAP_ENROLLMENT_POLL_TIMEOUT_SECONDS`
Current state:
- C17Z12 added rendezvous/relay control-plane leases for peers that would
otherwise stay in `waiting_rendezvous`.
- C17Z13-C17Z14 added lease telemetry and node-scoped synthetic-config refresh
for renewal/stale relay recovery.
- C17Z15 added backend stale-relay replacement/withdrawal policy and alternate
relay-pool scoring.
- C17Z16 added Control Plane `route_path_decisions`.
- C17Z17 added node-side route generation apply/withdraw tracking.
- C17Z18 applies Control Plane `route_path_decisions` to synthetic
route-health route config only. The synthetic `fabric.route_health` runtime
now probes the selected effective path, including replacement relay paths,
and reports expected/observed hops plus drift state.
- C17Z19 consumes those synthetic route-health observations in backend relay
scoring. Drift/unreachable/failure feedback marks the exact selected relay
stale and can trigger replacement; healthy low-latency route-health boosts
alternate relay score reasons. Migration `000022` adds the `synthetic` mesh
service class, and web-admin marks relay policy `rh feedback`.
- C17Z20 closes the node-side feedback loop. After node-agent reports
synthetic route-health drift/unreachable/failure, it performs a bounded
node-scoped synthetic-config refresh, applies returned replacement route
decisions to route-health config immediately, and reports
`c17z20.mesh_route_health_feedback_refresh_report.v1`.
- Backend `mesh_latest_links` now keeps latest observations per observation
type/route, so `synthetic_route_health` is not overwritten by
`peer_connection_manager`.
- Web-admin Fabric links now show observation type, selected relay, and
route-health effective/observed path.
- All of this remains control-plane/synthetic route-health only. It does not
forward RDP/VPN/service payloads, does not start VPN runtime, and does not
implement arbitrary relay packet forwarding.
- Cluster Authority and node enrollment bootstrap are docker-test
lifecycle-smoke verified in run `dev-bootstrap-20260428-201430`.
- Fresh migration replay found and fixed a PostgreSQL view replacement issue in
`000021_cluster_authority_keys`; the migration now drops/recreates
`cluster_admin_summaries` in up/down paths.
Runtime report:
- `artifacts/c17z18-route-health-effective-path-report.md`
- `artifacts/c17z19-route-health-feedback-report.md`
- `artifacts/c17z19-route-health-feedback-smoke-result.json`
- `artifacts/c17z20-route-health-feedback-refresh-report.md`
- `artifacts/dev-cluster-enrollment-bootstrap-smoke-report.md`
- `artifacts/c18w-service-channel-route-manager-smoke-result.json`
- `artifacts/c18x-service-channel-logical-channel-smoke-result.json`
- `artifacts/c18y-route-intent-lifecycle-smoke-result.json`
- `artifacts/c18z-service-channel-load-smoke-result.json`
- `artifacts/c18z1-live-service-channel-ingress-smoke-result.json`
- `artifacts/c18z2-live-service-channel-soak-smoke-result.json`
- `artifacts/c18z3-live-service-channel-entry-ws-fallback-smoke-result.json`
- `artifacts/c18z4-live-service-channel-session-pressure-smoke-result.json`
- `artifacts/c18z5-live-service-channel-exit-restart-smoke-result.json`
- `artifacts/c18z6-live-service-channel-active-rebuild-smoke-result.json`
- `artifacts/c18z7-live-service-channel-concurrent-isolation-smoke-result.json`
- `artifacts/c18z8-live-service-channel-backpressure-isolation-smoke-result.json`
- `artifacts/c18z9-live-service-channel-route-pool-smoke-result.json`
- `artifacts/c18z10-live-service-channel-exit-pool-smoke-result.json`
- `artifacts/c18z11-live-service-channel-entry-pool-smoke-result.json`
- `artifacts/c18z12-service-channel-route-quality-smoke-result.json`
- `artifacts/c18z13-live-service-channel-route-quality-smoke-result.json`
- `artifacts/c18z14-live-service-channel-active-quality-shift-smoke-result.json`
- `artifacts/c18z15-live-service-channel-effective-quality-smoke-result.json`
- `artifacts/c18z16-live-service-channel-quality-fairness-smoke-result.json`
- `artifacts/c18z17-live-service-channel-quality-cleanup-smoke-result.json`
- `artifacts/c18z18-service-channel-session-scoped-fairness-smoke-result.json`
- `artifacts/c18z19-service-channel-parallel-flow-window-smoke-result.json`
- `artifacts/c18z20-service-channel-adaptive-window-telemetry-smoke-result.json`
- Docker-test smoke command:
`pwsh -NoProfile -ExecutionPolicy Bypass -File scripts\fabric\c17z12-rendezvous-relay-smoke-ssh.ps1 -KeepRunning`
- Dev lifecycle smoke command:
`pwsh -NoProfile -ExecutionPolicy Bypass -File scripts\fabric\dev-cluster-enrollment-bootstrap-smoke-ssh.ps1 -KeepRunning`
- Last proven runtime run: `c17z18-20260428-221601` (legacy smoke script name,
current C17Z20 node-agent code)
- Last proven dev lifecycle run: `dev-bootstrap-20260428-201430`
- Admin: `http://192.168.200.61:18080/`
- C17Z20 multi-agent API: `http://192.168.200.61:18120/api/v1`
- C17Z19 backend-only API: `http://192.168.200.61:18122/api/v1`
- Dev lifecycle API: `http://192.168.200.61:18121/api/v1`
Do not automatically continue into:
- RDP/VNC/SSH/file/video/service workload traffic over mesh
- VPN/IP tunnel runtime implementation
- arbitrary relay packet forwarding
- production payload forwarding for relay paths
- QUIC/WebRTC or STUN/TURN/ICE
- TUN/TAP, host route, DNS, or firewall manipulation
- backend/session lifecycle changes
- Windows client changes
Current active next layer after the 2026-05-09 C18Z109 breadcrumb freshness
window proof:
C18Z48/C18Z49/C18Z50/C18Z51/C18Z52/C18Z53/C18Z54/C18Z55/C18Z56/C18Z57/C18Z58/C18Z59/C18Z60/C18Z61/C18Z62/C18Z63/C18Z64/C18Z65/C18Z66/C18Z67/C18Z68/C18Z69/C18Z70/C18Z71/C18Z72/C18Z73/C18Z74/C18Z75/C18Z76/C18Z77/C18Z78/C18Z79/C18Z80/C18Z81/C18Z82/C18Z83/C18Z84/C18Z85/C18Z86/C18Z87/C18Z88/C18Z89/C18Z90/C18Z91/C18Z92/C18Z93/C18Z94/C18Z95/C18Z96/C18Z97/C18Z98/C18Z99/C18Z100/C18Z101/C18Z102/C18Z103/C18Z104/C18Z105/C18Z106/C18Z107/C18Z108/C18Z109 are complete. Backend is deployed on docker-test as
`rap-backend:fabric-service-channel-0.2.281-c18z109`; migration
`000029_fabric_service_channel_leases` is applied on the shared test database.
Node-agent image `rap-node-agent:0.2.270-c18z95` is built and deployed on
`test-1/2/3`; web-admin is rebuilt and deployed to `rap_web_admin`.
All three test nodes run the C18Z92 image, healthy, and current after policy
update. Node-agent still requires signed service-channel lease authority when
cluster authority is pinned, but if legacy clients cannot send signed lease
headers it now calls backend introspection before accepting the unsigned token.
Accepted ingress is visible as `accepted_by=signed|introspection|legacy_unsigned`
in structured node logs and via `X-RAP-Service-Channel-Accepted-By` on HTTP
packet ingress. Durable introspection stores only `token_hash` plus a scrubbed
lease payload, so backend restarts no longer break compatibility clients. Live
lease maintenance now lists active/expired durable compatibility leases and runs
bounded cleanup through the admin API/panel. Durable access telemetry now
aggregates node-reported accepted ingress counters by signed/introspection/
legacy path, with heartbeat metadata fallback and admin-panel visibility.
Access telemetry now also correlates active durable service-channel leases with
entry/exit nodes, primary route status, backend fallback, and latest
route-quality feedback when a route exists. Normal-route access diagnostics are
smoke-proven with a temporary direct `vpn_packets` route and healthy rolling
quality window. Degraded normal-route diagnostics are also smoke-proven: the
active channel stays on a normal primary route with `force_backend_fallback=false`
while route feedback becomes `fenced` and rolling failure/drop/slow counters are
visible. Active-channel remediation diagnostics now expose
`remediation_action`, reason, optional alternate route id/status, and operator
hint, with unit coverage for healthy/noop, rebuild, backend fallback, and
authorized alternate decisions. The alternate-route remediation branch is now
live-smoke-proven: a selected primary route is degraded after lease issuance and
access telemetry recommends `prefer_alternate_route` while keeping
`force_backend_fallback=false`. C18Z57 turns that recommendation into a bounded
machine-readable `remediation_command` on the active channel row, including the
primary route, replacement route, issued time, and command TTL capped to the
lease lifetime. C18Z58 projects those commands into node-scoped synthetic mesh
config and node-agent consumes `prefer_alternate_route` as an explicit
route-manager `applied` decision with source
`service_channel_remediation_command`. C18Z59 proves active traffic follows the
replacement route after remediation: runtime heartbeat evidence shows
`last_selected_route_id` and flow-scheduler `last_route_id` on the replacement
route, with no local/backend fallback and no route send failures. C18Z60 proves
the same replacement path under multiple independent VPN flow channels: a
twelve-packet batch is classified across multiple flow-scheduler channels, all
observed replacement-route sends avoid local/backend fallback, flow drops, and
route failures. C18Z61 raises that to a pressure batch of 128 IPv4/TCP-like
packets; runtime evidence shows 32 replacement-route flow stats, scheduler
high-watermark 5, max-in-flight 4, no fallback, no drops, and no route failures.
C18Z62 adds neutral traffic-class QoS wiring at the service-channel HTTP
ingress: `X-RAP-Traffic-Class` can mark `control`, `interactive`, `reliable`,
`bulk`, or `droppable`; default traffic remains backward-compatible bulk.
Unit tests prove scheduler priority order, and live smoke proves a bulk
128-packet pressure batch plus an interactive packet both move over the
replacement route with separate traffic-class flow stats and no fallback,
drops, or route failures. C18Z63 adds a controlled concurrent runtime proof: a
bulk traffic-class send is held in-flight while an independent interactive
traffic-class packet is sent through the same ingress, and interactive completes
before bulk release with `MaxInFlight >= 2`, no drops, and no failures.
C18Z64 adds compact runtime telemetry: `rap.fabric_flow_scheduler.v1` snapshots
include `traffic_class_counts`, so backend/admin/diagnostics can show active
flow-channel counts per traffic class without scanning each channel stat. It is
live-proven on `rap-node-agent:0.2.239-c18z64`; latest test-1 snapshot showed
`bulk=32`, `interactive=12`, drops 0. C18Z65/C18Z66 project those counts and
flow pressure fields into backend access telemetry at node, active-channel, and
cluster aggregate levels, and web-admin shows cluster/node/channel `flow QoS`
visibility. Live aggregate API result showed `bulk=32`, `interactive=12`,
`flow_channel_count=44`, `flow_max_in_flight=4`. C18Z67 adds a live HTTP
concurrent QoS proof: six parallel bulk service-channel requests ran while an
interactive traffic-class request was injected on the same entry path after
remediation; the interactive request completed in 132 ms, all 6 bulk requests
were accepted, 3072 post-remediation packets moved over the replacement route,
32 bulk and 12 interactive replacement-route flow stats were observed, and
fallback, route failures, flow drops, and scheduler drops stayed at 0. C18Z68
adds backend/admin flow-health guard diagnostics over that telemetry:
`flow_health_status` and `flow_health_reason` are projected at cluster, node,
and active-channel levels from traffic-class pressure, queue pressure, flow
drops, backend fallback, route-quality failures/drops/slow samples, and route
send latency. Web-admin now shows flow-health chips beside flow QoS.
C18Z69 adds node-side adaptive response: heartbeat flow-scheduler snapshots now
report per-class `recommended_parallel_windows` plus
`adaptive_backpressure_active/reason`, and the ingress send path uses the
traffic-class-specific window. Under pressure, bulk/droppable are reduced first,
reliable is reduced moderately, and control/interactive keep their full window
unless their own class degrades. Live smoke verified `bulk=1`, `droppable=1`,
`reliable=3`, `interactive=4`, `control=4`, no drops, and
`bulk_window_reduced_to_protect_interactive`. C18Z70 projects those adaptive
runtime fields into backend/admin access telemetry at cluster, node, and
active-channel levels. Cluster windows are aggregated by minimum non-zero
per-class recommendation, and web-admin shows adaptive window chips beside flow
health/QoS. Live API artifact shows `adaptive=true`,
`bulk_window_reduced_to_protect_interactive`, and windows `bulk=1`,
`droppable=1`, `reliable=3`, `interactive=4`, `control=4`. C18Z71 adds the
cluster-level adaptive policy contract:
`GET/PUT /clusters/{clusterID}/fabric/service-channels/adaptive-policy`.
The policy stores audited thresholds and class windows in cluster metadata,
projects the effective fingerprint into signed node-scoped synthetic config,
and node-agent heartbeat/runtime telemetry reports `adaptive_policy_fingerprint`.
The node scheduler consumes the policy at runtime; default policy preserves
bulk=1/droppable=1/reliable=3/control=4/interactive=4, while the C18Z71 smoke
proved an operator policy with max window 6 and `bulk=2` changes the live
recommended windows without breaking interactive/control. A signed-config hash
mismatch found during the smoke was fixed by preserving all signed adaptive
policy provenance fields in the node-agent client model. C18Z72 adds the
cluster-level pool/failover policy contract:
`GET/PUT /clusters/{clusterID}/fabric/service-channels/pool-policy`. Lease
issuance now applies the effective entry/exit pool constraints and preferred
entry/exit before route selection, stores the effective policy on the lease,
and signs it into `rap.fabric_service_channel_lease_authority.v1`. Live smoke
proved a policy-constrained lease selects only the policy entry/exit from a
wider requested pool and carries matching signed `pool_policy` provenance.
C18Z73 projects that signed pool-policy fingerprint into active access
telemetry and guards remediation commands against routes outside the signed
lease pools. C18Z74 correlates active remediation commands with entry-node
route-manager heartbeats and reports execution states such as
`waiting_node_apply`, `applied`, `rejected_by_policy_guard`,
`pending_rebuild_request`, and `expired`. C18Z75 records `rebuild_route`
remediation as durable rebuild ledger intent rows when node-scoped synthetic
config is fetched, and access telemetry reports `rebuild_request_recorded` or
`rebuild_request_rejected`. C18Z76 makes the allowed `rebuild_route` command
visible from the node side: node-agent consumes it as a route-manager
`pending_degraded_fallback` decision with source
`service_channel_remediation_command`, and backend access telemetry correlates
that with the durable ledger as `rebuild_request_recorded_node_pending`.
C18Z77 resolves durable remediation rebuild requests inside the shared Control
Plane planner: signed-pool-valid alternates become `applied` /
`replacement_selected` and are projected as route-manager decisions with the
same command id, missing safe alternates become `no_alternate`, lease/policy
blocks become `deferred_by_policy`, and stale commands become `expired`.
C18Z78 adds operator-visible planner outcome chips in web-admin and proves the
`applied` branch live by adding an alternate route after lease issuance and
verifying the existing rebuild command resolves to `rebuild_request_applied`.
C18Z79 closes the planner-to-runtime proof loop for that branch: after planner
resolution, the entry node reports a route-manager decision with the same
`rebuild_request_id`, the transition is `applied_rebuild`, and live
service-channel packet traffic selects the replacement route without
local/backend fallback, route failures, or flow drops. C18Z80 hardens that
same path under sustained pressure: after planner-applied rebuild, five
post-rebuild bursts of mixed `interactive`, `bulk`, and `reliable` VPN packet
batches stay on the replacement route, the stale primary is not reselected, and
fallback/route-failure/drop deltas stay zero from the pre-pressure baseline.
C18Z81 adds the negative/rollback proof: after the initial replacement is
applied and used, a generation-valid fenced feedback report for that
replacement causes the Control Plane to select a new safe recovery route; live
traffic then moves to the recovery route, the degraded replacement is not
reselected, and fallback/failure/drop deltas stay zero for the recovery send.
The C18Z81 work also tightened older smoke checks to use per-run counter deltas
instead of absolute cumulative runtime counters.
C18Z82 closes the no-safe-recovery branch: after the replacement route reports
generation-valid fenced feedback and no new safe recovery route is created,
node-scoped synthetic config surfaces `service_channel_feedback_no_alternate`
with `pending_degraded_fallback`, `no_unfenced_alternate_route`, and
`backend_relay_degraded_fallback_until_rebuild`, proving the Control Plane
exposes a degraded/no-alternate state instead of silently sticking to a bad
replacement.
C18Z83 projects those route-manager decisions into active access telemetry and
web-admin: active channels now expose route-decision source, route id,
replacement route id, rebuild status/reason/generation, and score reasons.
The live smoke proves the no-safe state is visible through access telemetry as
`service_channel_feedback_no_alternate` /
`pending_degraded_fallback`, with operator execution state remaining compatible
with durable ledger `rebuild_request_no_alternate`.
C18Z84 aggregates those per-channel decisions at the access-telemetry summary
level: route-decision channel count, replacement decision count, applied
rebuild count, recovery decision count, and no-safe recovery count are exposed
to the API and web-admin summary chips. The no-safe branch now prioritizes the
aggregate status reason `active_channels_no_safe_recovery` over generic missing
access-report noise.
C18Z85 projects access-decision aggregates into rebuild health and incident
diagnostics. Health summary now carries access decision counts and prioritizes
`inspect_access_no_safe_recovery_route_pool_and_signed_policy` when no-safe is
active. Rebuild incidents now include `incident_source=access_decision` rows
for active channel decisions such as `access_no_safe_recovery`, with bad
severity and channel id, so operators see route-decision failures beside ledger
incidents.
C18Z86 adds silence/acknowledgement behavior for those
`incident_source=access_decision` incidents. Silence requests now carry
`incident_source` and `channel_id`; access-decision no-safe silences are stored
with a channel-scoped route key, applied back into rebuild health/incidents,
and exact current-generation incidents stop contributing to active bad count.
Generation-changing access-decision resurfacing is unit-tested; the live smoke
proves the operator silence path on docker-test.
C18Z87 exposes active rebuild/access-decision silences to operators and adds
unsilence. The API now lists active rebuild alert silences, returns
access-decision `incident_source`, `channel_id`, and display route id, and
allows deleting a silence by id. Web-admin shows an `Active rebuild silences`
table with an unsilence action. The live smoke proves list -> silence ->
unsilence and verifies the access no-safe incident becomes active again.
C18Z88 makes access-decision resurfacing operator-visible in live runtime.
Access-decision incidents now expose the silence id they resurfaced from, the
previous acknowledged generation, and the silence expiry. The live smoke
proves: access no-safe incident -> silence current generation -> wait for a new
route-decision generation -> incident returns as `alert_resurfaced=true`, active
bad count is restored, and previous generation metadata is preserved.
C18Z89 closes the resurfaced-incident operator action loop for generation
changes. Resurfaced access-decision incidents now expose
`alert_resurfaced_cause`, previous route id, and previous channel id; web-admin
shows the cause beside resurfaced incidents. The live smoke proves the operator
can re-acknowledge the resurfaced generation, the active-channel decision
context matches the incident route/generation, and the current generation
returns to a silenced state.
C18Z90 introduces the explicit signed production data-plane contract on
service-channel leases. `data_plane` is now part of the lease, authority
payload, introspection response, and lease-maintenance/admin list. It declares
that control-plane traffic uses backend API, working data uses the fabric
service channel over fabric routes, backend relay is degraded fallback only,
production forwarding is required, and logical flows are service-neutral,
protocol-agnostic, and isolated. Web-admin shows this contract in the
service-channel lease table.
C18Z91 makes node-agent consume that signed/introspected data-plane contract.
Service-channel packet ingress validates the contract, applies the preferred
fabric route, emits data-plane mode/transport/fallback/logical-flow fields in
access logs, and reports contract adoption in heartbeat access telemetry.
C18Z92 enforces disabled backend fallback policy at node-agent runtime: when a
signed lease says `backend_relay_policy=disabled`, route failure or missing
fabric route returns a visible 503 instead of silently proxying working data
through backend relay.
C18Z93 promotes that data-plane contract telemetry into backend access
telemetry and web-admin active-channel diagnostics: cluster, node, and
active-channel rows now show contract adoption count, last working transport,
steady-state transport, backend relay policy, data-plane mode, and logical
flow mode.
C18Z94 turns those data-plane/fallback signals into operator incidents.
`data_plane_contract` incident rows are now emitted for missing data-plane
contract reports after accepted service-channel traffic, wrong working or
steady-state transport, wrong logical flow mode, disabled backend relay
observed, and degraded backend relay usage. The incident list/readiness path
can now surface a recommended action such as restoring the fabric route instead
of treating backend relay as normal service traffic.
C18Z95 adds node-agent blocked-fallback telemetry. When a signed data-plane
contract disables backend relay and the entry runtime cannot use a fabric
route, node-agent reports `backend_fallback_blocked`, the last data-plane
violation status/reason, and backend/admin project those fields to cluster,
node, channel, and `data_plane_contract` incident diagnostics. Disabled-policy
refusal is now separate from real backend relay usage.
C18Z96 wires normal-route send failure with disabled backend relay into the
existing route feedback and rebuild planner path. When heartbeat access
telemetry reports `fabric_route_send_failed_backend_fallback_blocked`, backend
correlates the entry node's active service-channel leases, records fenced
`fabric_service_channel_route_feedback` for the selected primary route, and the
existing planner can select an alternate/replacement route. This keeps blocked
fallback from becoming a dead-end operator alert.
C18Z97 adds bounded deduplication for those access-report-derived route
feedback records. Repeated blocked-fallback send-failure heartbeats no longer
rewrite the same active fenced feedback or churn planner rebuild attempts while
the first access-report feedback is still active. Runtime feedback from the
flow scheduler remains independent.
C18Z98 carries that feedback identity into the replacement decision and
rebuild-attempt ledger: decision and ledger rows now expose
`feedback_observation_id`, `feedback_source`, feedback observed/expiry time,
channel/resource ids, and data-plane violation status/reason. Web-admin shows
that correlation in Route decisions and Rebuild ledger.
C18Z99 adds rebuild ledger filters for those correlation fields. The backend
`/fabric/service-channels/rebuild-attempts` API accepts `feedback_source`,
`feedback_channel_id`, and `feedback_violation_status`, and web-admin exposes
the same filters in the rebuild ledger form. The live smoke proves source,
channel, violation, combined filters, and wrong-channel exclusion.
C18Z100 adds rebuild-health feedback breakdown aggregation for the same
correlation fields. The backend rebuild-health summary now returns
`feedback_breakdowns` grouped by feedback source, feedback channel id, and
feedback violation status, including total/good/warn/bad/unknown counts,
active warn/bad counts, silenced count, latest observation time, and affected
reporter nodes/routes. Web-admin shows this breakdown in the Rebuild health
panel so operators can see which access-report-derived failure classes dominate
active warn/bad rebuild state.
C18Z101 wires that breakdown into operator workflow in web-admin. Each
feedback-breakdown row now shows related incident context by channel/reporter/
route overlap and has an `open ledger` action that switches to the deep rebuild
ledger with `feedback_source`, `feedback_channel_id`, and
`feedback_violation_status` prefilled from the breakdown row.
C18Z102 adds backend audit breadcrumbs for that workflow. The existing rebuild
investigation endpoint now accepts feedback source/channel/violation drilldown
payloads, records
`fabric.service_channel_rebuild_feedback_breakdown.investigation_opened`
cluster audit events, and web-admin records one before opening the filtered
deep ledger from a rebuild-health feedback breakdown row.
C18Z103 surfaces those breadcrumbs directly in the Fabric diagnostics panel.
Web-admin now filters the loaded cluster audit list for rebuild incident and
feedback-breakdown investigation events and shows recent drilldowns with time,
source, feedback filters, target reporter/route, actor, and reason beside
rebuild incidents and silences.
C18Z104 adds focused audit loading for that panel. The cluster audit API now
accepts `event_type` and `target_type` filters, including repeated or
comma-separated values, and web-admin loads recent fabric investigation
breadcrumbs with a dedicated filtered request instead of depending on the
generic latest-100 cluster audit list.
C18Z105 correlates those focused audit breadcrumbs back to currently visible
diagnostics in web-admin. Recent investigation rows now show whether the
breadcrumb still matches an active rebuild-health feedback breakdown or visible
rebuild incident, and provide an `open` action to jump back into the matching
filtered ledger path.
C18Z106 moves that correlation into the backend/API. `GET /audit` with
`correlation=fabric_diagnostics` now returns `correlation_hints` for focused
fabric investigation breadcrumbs, including current diagnostic status
(`breakdown_active`, `incident_visible`, or `not_visible`) and the matching
breakdown/incident object when present. Web-admin consumes those hints and keeps
its previous local matching as fallback. During verification the noisy test
history exposed that rebuild-health feedback breakdowns were capped too tightly;
the backend now returns up to 100 breakdown groups so fresh failure classes are
not pushed out by older smoke history.
C18Z107 adds a compact backend-provided `audit_summary` beside `audit_events`.
For focused Fabric diagnostics audit reads, the summary includes total count,
counts by event/target type, counts by current diagnostic status, counts by
feedback source/violation status, correlated count, not-visible count, and
latest time. Web-admin shows these as Recent investigations chips and short
source/violation lines without recalculating the aggregate in the browser.
C18Z108 moves Fabric diagnostics investigation breadcrumbs off the generic
cluster audit read path. Backend now exposes
`GET /clusters/{clusterID}/fabric/service-channels/rebuild-investigations/breadcrumbs`
with a dedicated `rebuild_investigation_breadcrumbs` contract containing
events plus summary. Web-admin uses this endpoint for Recent investigations
and keeps generic audit semantics separate from Fabric diagnostics workflow
state.
C18Z109 adds freshness windows to the dedicated breadcrumb contract. The
endpoint accepts `current_window_seconds` and `history_window_seconds`, annotates
each breadcrumb with `correlation_hints.breadcrumb_status` (`current`, `stale`,
or `expired`) plus age/window seconds, returns current/stale/expired totals, and
adds `counts_by_breadcrumb_status` to the summary. Web-admin shows freshness
chips and an age column in Recent investigations, so operators can separate live
workflow hints from stale history without deleting audit records.
Live
verification passed:
`scripts/fabric/c18z48-service-channel-introspection-smoke.ps1` and
`scripts/fabric/c18z50-service-channel-durable-introspection-smoke.ps1` and
`scripts/fabric/c18z51-service-channel-lease-maintenance-smoke.ps1` and
`scripts/fabric/c18z52-service-channel-access-telemetry-smoke.ps1` and
`scripts/fabric/c18z53-service-channel-access-correlation-smoke.ps1` and
`scripts/fabric/c18z54-service-channel-normal-route-access-smoke.ps1` and
`scripts/fabric/c18z55-service-channel-degraded-route-access-smoke.ps1` and
`scripts/fabric/c18z56-service-channel-alternate-remediation-smoke.ps1` and
`scripts/fabric/c18z57-service-channel-remediation-command-smoke.ps1` and
`scripts/fabric/c18z58-service-channel-remediation-apply-smoke.ps1` and
`scripts/fabric/c18z59-service-channel-remediation-traffic-smoke.ps1` and
`scripts/fabric/c18z60-service-channel-remediation-multiflow-smoke.ps1` and
`scripts/fabric/c18z61-service-channel-remediation-pressure-smoke.ps1` and
`scripts/fabric/c18z62-service-channel-remediation-qos-smoke.ps1` and
`scripts/fabric/c18z67-service-channel-concurrent-qos-live-smoke.ps1` and
`scripts/fabric/c18z69-service-channel-adaptive-backpressure-smoke.ps1` and
`scripts/fabric/c18z71-service-channel-adaptive-policy-smoke.ps1` and
`scripts/fabric/c18z72-service-channel-pool-policy-smoke.ps1`, with
artifacts:
`artifacts/c18z48-service-channel-introspection-smoke-result.json`,
`artifacts/c18z50-service-channel-durable-introspection-smoke-result.json`, and
`artifacts/c18z51-service-channel-lease-maintenance-smoke-result.json`, and
`artifacts/c18z52-service-channel-access-telemetry-smoke-result.json`, and
`artifacts/c18z53-service-channel-access-correlation-smoke-result.json`, and
`artifacts/c18z54-service-channel-normal-route-access-smoke-result.json`, and
`artifacts/c18z55-service-channel-degraded-route-access-smoke-result.json`, and
`artifacts/c18z56-service-channel-alternate-remediation-smoke-result.json`, and
`artifacts/c18z57-service-channel-remediation-command-smoke-result.json`, and
`artifacts/c18z58-service-channel-remediation-apply-smoke-result.json`, and
`artifacts/c18z59-service-channel-remediation-traffic-smoke-result.json`, and
`artifacts/c18z60-service-channel-remediation-multiflow-smoke-result.json`, and
`artifacts/c18z61-service-channel-remediation-pressure-smoke-result.json`, and
`artifacts/c18z62-service-channel-remediation-qos-smoke-result.json`, and
`artifacts/c18z63-service-channel-concurrent-qos-go-test.jsonl`, and
`artifacts/c18z64-service-channel-traffic-class-telemetry-go-test.jsonl`, and
`artifacts/c18z64-service-channel-traffic-class-telemetry-live-smoke-result.json`, and
`artifacts/c18z64-service-channel-traffic-class-telemetry-live-snapshot.json`, and
`artifacts/c18z65-service-channel-access-qos-telemetry-api-result.json`, and
`artifacts/c18z65-service-channel-access-qos-telemetry-smoke-result.json`, and
`artifacts/c18z66-service-channel-access-qos-aggregate-api-result.json`, and
`artifacts/c18z67-service-channel-concurrent-qos-live-smoke-result.json`, and
`artifacts/c18z68-service-channel-flow-health-api-result.json`, and
`artifacts/c18z69-service-channel-adaptive-backpressure-smoke-result.json`, and
`artifacts/c18z70-service-channel-adaptive-telemetry-api-result.json`, and
`artifacts/c18z71-service-channel-adaptive-policy-smoke-result.json`, and
`artifacts/c18z72-service-channel-pool-policy-smoke-result.json`, and
`artifacts/c18z73-service-channel-pool-policy-remediation-guard-smoke-result.json`, and
`artifacts/c18z74-service-channel-remediation-execution-smoke-result.json`, and
`artifacts/c18z75-service-channel-rebuild-intent-smoke-result.json`, and
`artifacts/c18z76-service-channel-rebuild-node-pending-smoke-result.json`, and
`artifacts/c18z77-service-channel-rebuild-planner-resolution-smoke-result.json`, and
`artifacts/c18z78-service-channel-rebuild-planner-applied-smoke-result.json`, and
`artifacts/c18z79-service-channel-planner-runtime-loop-smoke-result.json`, and
`artifacts/c18z80-service-channel-post-rebuild-pressure-smoke-result.json`, and
`artifacts/c18z81-service-channel-replacement-degradation-recovery-smoke-result.json`, and
`artifacts/c18z82-service-channel-no-safe-recovery-smoke-result.json`, and
`artifacts/c18z83-service-channel-access-telemetry-no-safe-smoke-result.json`, and
`artifacts/c18z84-service-channel-access-decision-aggregate-smoke-result.json`, and
`artifacts/c18z85-service-channel-access-decision-incident-smoke-result.json`, and
`artifacts/c18z86-service-channel-access-decision-silence-smoke-result.json`, and
`artifacts/c18z87-service-channel-access-decision-unsilence-smoke-result.json`, and
`artifacts/c18z88-service-channel-access-decision-resurface-smoke-result.json`, and
`artifacts/c18z89-service-channel-access-decision-resurface-action-smoke-result.json`, and
`artifacts/c18z90-service-channel-data-plane-contract-smoke-result.json`, and
`artifacts/c18z91-node-agent-data-plane-contract-enforcement-smoke-result.json`, and
`artifacts/c18z92-node-agent-disabled-backend-fallback-smoke-result.json`, and
`artifacts/c18z93-access-telemetry-data-plane-contract-smoke-result.json`, and
`artifacts/c18z94-data-plane-contract-incident-smoke-result.json`, and
`artifacts/c18z95-node-agent-blocked-fallback-telemetry-smoke-result.json`, and
`artifacts/c18z96-blocked-fallback-rebuild-feedback-smoke-result.json`, and
`artifacts/c18z97-blocked-fallback-feedback-dedup-smoke-result.json`, and
`artifacts/c18z98-blocked-fallback-rebuild-correlation-smoke-result.json`, and
`artifacts/c18z99-rebuild-correlation-filter-smoke-result.json`, and
`artifacts/c18z100-rebuild-health-feedback-breakdown-smoke-result.json`, and
`artifacts/c18z102-rebuild-health-feedback-drilldown-audit-smoke-result.json`,
`artifacts/c18z104-focused-fabric-audit-smoke-result.json`, and
`artifacts/c18z106-audit-correlation-hints-smoke-result.json`, and
`artifacts/c18z107-audit-correlation-summary-smoke-result.json`, and
`artifacts/c18z108-dedicated-breadcrumbs-smoke-result.json`, and
`artifacts/c18z109-breadcrumb-freshness-window-smoke-result.json`.
Current active continuation after C20Z6:
C20Z1 through C20Z6 are implemented and runtime-smoke-proven. The C20 stage is
terminal-complete by contract. It opened and validated a new explicit
real-adapter enablement request as a contract-only transition:
`rap.remote_workspace_real_adapter_c20_stage_terminal_complete.v1`, with
`terminal_status=stage_terminal_complete_contract_only`,
`stage_status=complete_no_more_c20_layers_required`,
`stage_name=c20_real_adapter_new_explicit_enablement_request`,
`validation_chain_status=complete_contract_only`,
`enablement_boundary=runtime_enablement_requires_next_explicit_runtime_stage`,
`enablement_decision=validated_contract_only_not_enabled`,
`enablement_status=validated_not_enabled`,
`runtime_gate_state=validated_contract_only_not_enabled`,
`runtime_effect=contract_only_no_runtime_enablement`,
`operator_default_action=keep_real_adapter_disabled_until_next_explicit_runtime_stage`,
`next_allowed_entrypoint=next_explicit_runtime_enablement_stage_only`,
`allows_process_start=false`, and `allows_payload_traffic=false`. Docker-test
`test-1/2/3` remain on
`rap-node-agent:codex-service-supervisor-20260513z52`. Verification artifact:
`artifacts/c20z6-remote-workspace-real-adapter-stage-terminal-complete-compatibility-smoke-result.json`.
The not-approved factory remains terminal-complete by contract, and C20 is now
also terminal-complete by contract. Do not add more C20 continuation layers.
The only allowed next entrypoint is a new explicit runtime enablement stage.
Keep the real adapter disabled until that new stage explicitly changes runtime
state: no process start, no real RDP frame transport, no Android work, no
backend relay semantics, and no production adapter payload forwarding.