rdp-proxy/docs/codex/NEXT_STEP_PROMPT.md at 5e4c0d596b5c85b061cd7f5624adde0eedc10042

m/rdp-proxy

Fork 0

Files

T

m 04c46042d9 1

2026-05-14 23:30:34 +03:00

41 KiB

Raw Blame History

Current product decision:

Stop treating VPN, Remote Server/Desktop Access, video, file transfer, and future services as separate transport implementations. The next implementation work should focus on the shared Fabric Service Channel runtime described in docs/architecture/FABRIC_SERVICE_CHANNEL_RUNTIME.md.

The immediate engineering target is:

backend service-channel lease/route-generation contract
node-agent entry runtime for client/service live connections
service-neutral channel scheduling, bounded queues, route health, and failover
VPN packet flow as the proving service over that common channel
backend relay only as explicit degraded fallback

Backend service-channel lease/route-generation contract is now started:

POST /clusters/{clusterID}/fabric/service-channels/leases issues rap.fabric_service_channel_lease.v1
VPN client profiles embed fabric_service_channel_lease
tests cover ready route and degraded backend-relay fallback behavior
leases include entry HTTP/WebSocket endpoint templates for the selected service channel
leases include cluster-authority-signed rap.fabric_service_channel_lease_authority.v1 payloads that bind token hash, selected route, generation, fencing epoch, and expiry

Node-agent entry runtime is now started:

rap-node-agent accepts VPN packet batches through /api/v1/clusters/{clusterID}/fabric/service-channels/{channelID}/vpn-connections/{vpnConnectionID}/packets and /packets/ws
entry runtime requires a rap_fsc_* service-channel token and maps packet batches to the existing production vpn_packet fabric route
route failure falls back to the canonical backend relay endpoint so degraded compatibility remains explicit

Next narrow runtime layer:

persist cluster-level default window policy for Fabric diagnostics investigation breadcrumbs and expose a small admin control for it
keep this in the shared Fabric Service Channel runtime contract and telemetry
do not add Android/RDP protocol work in this slice

C17Z20 is complete.

Installation Authority foundation is also complete:

production config requires strict authority mode with Product Root public key
first-owner bootstrap requires a signed activation manifest in strict mode
installation_authority and signed platform_role_grants are persisted
strict platform-admin checks ignore direct users.platform_role edits unless a valid signed grant exists
web-admin shows installation status and first-owner bootstrap
scripts/installation/product-root-tool.go can generate Ed25519 Product Root keys and sign activation manifests; private keys must stay outside the repo

Cluster Authority foundation is now also complete:

every newly created cluster gets an Ed25519 cluster_authorities key record
cluster authority private keys are encrypted at rest when SECRET_ENCRYPTION_KEY_B64/file is configured; production already requires a secret encryption key
legacy/default clusters are backfilled lazily through EnsureClusterAuthority
backend signs join-token scope material, node approval/bootstrap material, and node-scoped synthetic mesh config snapshots
node-agent verifies signed Control Plane synthetic config when authority_required=true or signature fields are present
node-agent can pin RAP_CLUSTER_AUTHORITY_PUBLIC_KEY and RAP_CLUSTER_AUTHORITY_FINGERPRINT, and identity state can store the same trust anchor after approval
web-admin shows cluster key fingerprints on summaries, join-token output, approval rows, and synthetic config visibility
docker-test lifecycle smoke is complete: fresh dev install, first-owner bootstrap, cluster creation, signed join token, real node-agent enrollment, owner approval, automatic signed bootstrap polling, authority pin persistence, heartbeat, and signed synthetic config verification all passed
rap-node-agent desired-workload polling/status reporting is gated by RAP_WORKLOAD_SUPERVISION_ENABLED=false by default while service runtime supervision remains a stub

Node enrollment bootstrap polling is also complete:

backend exposes /node-agents/enrollments/{requestID}/bootstrap
pending agents prove cluster_id, node_fingerprint, and public_key before receiving status/bootstrap material
rap-node-agent stores pending_join_request_id, polls approval, verifies the signed bootstrap contract, then persists node_id, identity_status, and cluster authority pin into identity.json
polling is controlled by RAP_ENROLLMENT_POLL_INTERVAL_SECONDS and RAP_ENROLLMENT_POLL_TIMEOUT_SECONDS

Current state:

C17Z12 added rendezvous/relay control-plane leases for peers that would otherwise stay in waiting_rendezvous.
C17Z13-C17Z14 added lease telemetry and node-scoped synthetic-config refresh for renewal/stale relay recovery.
C17Z15 added backend stale-relay replacement/withdrawal policy and alternate relay-pool scoring.
C17Z16 added Control Plane route_path_decisions.
C17Z17 added node-side route generation apply/withdraw tracking.
C17Z18 applies Control Plane route_path_decisions to synthetic route-health route config only. The synthetic fabric.route_health runtime now probes the selected effective path, including replacement relay paths, and reports expected/observed hops plus drift state.
C17Z19 consumes those synthetic route-health observations in backend relay scoring. Drift/unreachable/failure feedback marks the exact selected relay stale and can trigger replacement; healthy low-latency route-health boosts alternate relay score reasons. Migration 000022 adds the synthetic mesh service class, and web-admin marks relay policy rh feedback.
C17Z20 closes the node-side feedback loop. After node-agent reports synthetic route-health drift/unreachable/failure, it performs a bounded node-scoped synthetic-config refresh, applies returned replacement route decisions to route-health config immediately, and reports c17z20.mesh_route_health_feedback_refresh_report.v1.
Backend mesh_latest_links now keeps latest observations per observation type/route, so synthetic_route_health is not overwritten by peer_connection_manager.
Web-admin Fabric links now show observation type, selected relay, and route-health effective/observed path.
All of this remains control-plane/synthetic route-health only. It does not forward RDP/VPN/service payloads, does not start VPN runtime, and does not implement arbitrary relay packet forwarding.
Cluster Authority and node enrollment bootstrap are docker-test lifecycle-smoke verified in run dev-bootstrap-20260428-201430.
Fresh migration replay found and fixed a PostgreSQL view replacement issue in 000021_cluster_authority_keys; the migration now drops/recreates cluster_admin_summaries in up/down paths.

Runtime report:

artifacts/c17z18-route-health-effective-path-report.md
artifacts/c17z19-route-health-feedback-report.md
artifacts/c17z19-route-health-feedback-smoke-result.json
artifacts/c17z20-route-health-feedback-refresh-report.md
artifacts/dev-cluster-enrollment-bootstrap-smoke-report.md
artifacts/c18w-service-channel-route-manager-smoke-result.json
artifacts/c18x-service-channel-logical-channel-smoke-result.json
artifacts/c18y-route-intent-lifecycle-smoke-result.json
artifacts/c18z-service-channel-load-smoke-result.json
artifacts/c18z1-live-service-channel-ingress-smoke-result.json
artifacts/c18z2-live-service-channel-soak-smoke-result.json
artifacts/c18z3-live-service-channel-entry-ws-fallback-smoke-result.json
artifacts/c18z4-live-service-channel-session-pressure-smoke-result.json
artifacts/c18z5-live-service-channel-exit-restart-smoke-result.json
artifacts/c18z6-live-service-channel-active-rebuild-smoke-result.json
artifacts/c18z7-live-service-channel-concurrent-isolation-smoke-result.json
artifacts/c18z8-live-service-channel-backpressure-isolation-smoke-result.json
artifacts/c18z9-live-service-channel-route-pool-smoke-result.json
artifacts/c18z10-live-service-channel-exit-pool-smoke-result.json
artifacts/c18z11-live-service-channel-entry-pool-smoke-result.json
artifacts/c18z12-service-channel-route-quality-smoke-result.json
artifacts/c18z13-live-service-channel-route-quality-smoke-result.json
artifacts/c18z14-live-service-channel-active-quality-shift-smoke-result.json
artifacts/c18z15-live-service-channel-effective-quality-smoke-result.json
artifacts/c18z16-live-service-channel-quality-fairness-smoke-result.json
artifacts/c18z17-live-service-channel-quality-cleanup-smoke-result.json
artifacts/c18z18-service-channel-session-scoped-fairness-smoke-result.json
artifacts/c18z19-service-channel-parallel-flow-window-smoke-result.json
artifacts/c18z20-service-channel-adaptive-window-telemetry-smoke-result.json
Docker-test smoke command: pwsh -NoProfile -ExecutionPolicy Bypass -File scripts\fabric\c17z12-rendezvous-relay-smoke-ssh.ps1 -KeepRunning
Dev lifecycle smoke command: pwsh -NoProfile -ExecutionPolicy Bypass -File scripts\fabric\dev-cluster-enrollment-bootstrap-smoke-ssh.ps1 -KeepRunning
Last proven runtime run: c17z18-20260428-221601 (legacy smoke script name, current C17Z20 node-agent code)
Last proven dev lifecycle run: dev-bootstrap-20260428-201430
Admin: http://192.168.200.61:18080/
C17Z20 multi-agent API: http://192.168.200.61:18120/api/v1
C17Z19 backend-only API: http://192.168.200.61:18122/api/v1
Dev lifecycle API: http://192.168.200.61:18121/api/v1

Do not automatically continue into:

RDP/VNC/SSH/file/video/service workload traffic over mesh
VPN/IP tunnel runtime implementation
arbitrary relay packet forwarding
production payload forwarding for relay paths
QUIC/WebRTC or STUN/TURN/ICE
TUN/TAP, host route, DNS, or firewall manipulation
backend/session lifecycle changes
Windows client changes

Current active next layer after the 2026-05-09 C18Z109 breadcrumb freshness window proof:

C18Z48/C18Z49/C18Z50/C18Z51/C18Z52/C18Z53/C18Z54/C18Z55/C18Z56/C18Z57/C18Z58/C18Z59/C18Z60/C18Z61/C18Z62/C18Z63/C18Z64/C18Z65/C18Z66/C18Z67/C18Z68/C18Z69/C18Z70/C18Z71/C18Z72/C18Z73/C18Z74/C18Z75/C18Z76/C18Z77/C18Z78/C18Z79/C18Z80/C18Z81/C18Z82/C18Z83/C18Z84/C18Z85/C18Z86/C18Z87/C18Z88/C18Z89/C18Z90/C18Z91/C18Z92/C18Z93/C18Z94/C18Z95/C18Z96/C18Z97/C18Z98/C18Z99/C18Z100/C18Z101/C18Z102/C18Z103/C18Z104/C18Z105/C18Z106/C18Z107/C18Z108/C18Z109 are complete. Backend is deployed on docker-test as rap-backend:fabric-service-channel-0.2.281-c18z109; migration 000029_fabric_service_channel_leases is applied on the shared test database. Node-agent image rap-node-agent:0.2.270-c18z95 is built and deployed on test-1/2/3; web-admin is rebuilt and deployed to rap_web_admin. All three test nodes run the C18Z92 image, healthy, and current after policy update. Node-agent still requires signed service-channel lease authority when cluster authority is pinned, but if legacy clients cannot send signed lease headers it now calls backend introspection before accepting the unsigned token. Accepted ingress is visible as accepted_by=signed|introspection|legacy_unsigned in structured node logs and via X-RAP-Service-Channel-Accepted-By on HTTP packet ingress. Durable introspection stores only token_hash plus a scrubbed lease payload, so backend restarts no longer break compatibility clients. Live lease maintenance now lists active/expired durable compatibility leases and runs bounded cleanup through the admin API/panel. Durable access telemetry now aggregates node-reported accepted ingress counters by signed/introspection/ legacy path, with heartbeat metadata fallback and admin-panel visibility. Access telemetry now also correlates active durable service-channel leases with entry/exit nodes, primary route status, backend fallback, and latest route-quality feedback when a route exists. Normal-route access diagnostics are smoke-proven with a temporary direct vpn_packets route and healthy rolling quality window. Degraded normal-route diagnostics are also smoke-proven: the active channel stays on a normal primary route with force_backend_fallback=false while route feedback becomes fenced and rolling failure/drop/slow counters are visible. Active-channel remediation diagnostics now expose remediation_action, reason, optional alternate route id/status, and operator hint, with unit coverage for healthy/noop, rebuild, backend fallback, and authorized alternate decisions. The alternate-route remediation branch is now live-smoke-proven: a selected primary route is degraded after lease issuance and access telemetry recommends prefer_alternate_route while keeping force_backend_fallback=false. C18Z57 turns that recommendation into a bounded machine-readable remediation_command on the active channel row, including the primary route, replacement route, issued time, and command TTL capped to the lease lifetime. C18Z58 projects those commands into node-scoped synthetic mesh config and node-agent consumes prefer_alternate_route as an explicit route-manager applied decision with source service_channel_remediation_command. C18Z59 proves active traffic follows the replacement route after remediation: runtime heartbeat evidence shows last_selected_route_id and flow-scheduler last_route_id on the replacement route, with no local/backend fallback and no route send failures. C18Z60 proves the same replacement path under multiple independent VPN flow channels: a twelve-packet batch is classified across multiple flow-scheduler channels, all observed replacement-route sends avoid local/backend fallback, flow drops, and route failures. C18Z61 raises that to a pressure batch of 128 IPv4/TCP-like packets; runtime evidence shows 32 replacement-route flow stats, scheduler high-watermark 5, max-in-flight 4, no fallback, no drops, and no route failures. C18Z62 adds neutral traffic-class QoS wiring at the service-channel HTTP ingress: X-RAP-Traffic-Class can mark control, interactive, reliable, bulk, or droppable; default traffic remains backward-compatible bulk. Unit tests prove scheduler priority order, and live smoke proves a bulk 128-packet pressure batch plus an interactive packet both move over the replacement route with separate traffic-class flow stats and no fallback, drops, or route failures. C18Z63 adds a controlled concurrent runtime proof: a bulk traffic-class send is held in-flight while an independent interactive traffic-class packet is sent through the same ingress, and interactive completes before bulk release with MaxInFlight >= 2, no drops, and no failures. C18Z64 adds compact runtime telemetry: rap.fabric_flow_scheduler.v1 snapshots include traffic_class_counts, so backend/admin/diagnostics can show active flow-channel counts per traffic class without scanning each channel stat. It is live-proven on rap-node-agent:0.2.239-c18z64; latest test-1 snapshot showed bulk=32, interactive=12, drops 0. C18Z65/C18Z66 project those counts and flow pressure fields into backend access telemetry at node, active-channel, and cluster aggregate levels, and web-admin shows cluster/node/channel flow QoS visibility. Live aggregate API result showed bulk=32, interactive=12, flow_channel_count=44, flow_max_in_flight=4. C18Z67 adds a live HTTP concurrent QoS proof: six parallel bulk service-channel requests ran while an interactive traffic-class request was injected on the same entry path after remediation; the interactive request completed in 132 ms, all 6 bulk requests were accepted, 3072 post-remediation packets moved over the replacement route, 32 bulk and 12 interactive replacement-route flow stats were observed, and fallback, route failures, flow drops, and scheduler drops stayed at 0. C18Z68 adds backend/admin flow-health guard diagnostics over that telemetry: flow_health_status and flow_health_reason are projected at cluster, node, and active-channel levels from traffic-class pressure, queue pressure, flow drops, backend fallback, route-quality failures/drops/slow samples, and route send latency. Web-admin now shows flow-health chips beside flow QoS. C18Z69 adds node-side adaptive response: heartbeat flow-scheduler snapshots now report per-class recommended_parallel_windows plus adaptive_backpressure_active/reason, and the ingress send path uses the traffic-class-specific window. Under pressure, bulk/droppable are reduced first, reliable is reduced moderately, and control/interactive keep their full window unless their own class degrades. Live smoke verified bulk=1, droppable=1, reliable=3, interactive=4, control=4, no drops, and bulk_window_reduced_to_protect_interactive. C18Z70 projects those adaptive runtime fields into backend/admin access telemetry at cluster, node, and active-channel levels. Cluster windows are aggregated by minimum non-zero per-class recommendation, and web-admin shows adaptive window chips beside flow health/QoS. Live API artifact shows adaptive=true, bulk_window_reduced_to_protect_interactive, and windows bulk=1, droppable=1, reliable=3, interactive=4, control=4. C18Z71 adds the cluster-level adaptive policy contract: GET/PUT /clusters/{clusterID}/fabric/service-channels/adaptive-policy. The policy stores audited thresholds and class windows in cluster metadata, projects the effective fingerprint into signed node-scoped synthetic config, and node-agent heartbeat/runtime telemetry reports adaptive_policy_fingerprint. The node scheduler consumes the policy at runtime; default policy preserves bulk=1/droppable=1/reliable=3/control=4/interactive=4, while the C18Z71 smoke proved an operator policy with max window 6 and bulk=2 changes the live recommended windows without breaking interactive/control. A signed-config hash mismatch found during the smoke was fixed by preserving all signed adaptive policy provenance fields in the node-agent client model. C18Z72 adds the cluster-level pool/failover policy contract: GET/PUT /clusters/{clusterID}/fabric/service-channels/pool-policy. Lease issuance now applies the effective entry/exit pool constraints and preferred entry/exit before route selection, stores the effective policy on the lease, and signs it into rap.fabric_service_channel_lease_authority.v1. Live smoke proved a policy-constrained lease selects only the policy entry/exit from a wider requested pool and carries matching signed pool_policy provenance. C18Z73 projects that signed pool-policy fingerprint into active access telemetry and guards remediation commands against routes outside the signed lease pools. C18Z74 correlates active remediation commands with entry-node route-manager heartbeats and reports execution states such as waiting_node_apply, applied, rejected_by_policy_guard, pending_rebuild_request, and expired. C18Z75 records rebuild_route remediation as durable rebuild ledger intent rows when node-scoped synthetic config is fetched, and access telemetry reports rebuild_request_recorded or rebuild_request_rejected. C18Z76 makes the allowed rebuild_route command visible from the node side: node-agent consumes it as a route-manager pending_degraded_fallback decision with source service_channel_remediation_command, and backend access telemetry correlates that with the durable ledger as rebuild_request_recorded_node_pending. C18Z77 resolves durable remediation rebuild requests inside the shared Control Plane planner: signed-pool-valid alternates become applied / replacement_selected and are projected as route-manager decisions with the same command id, missing safe alternates become no_alternate, lease/policy blocks become deferred_by_policy, and stale commands become expired. C18Z78 adds operator-visible planner outcome chips in web-admin and proves the applied branch live by adding an alternate route after lease issuance and verifying the existing rebuild command resolves to rebuild_request_applied. C18Z79 closes the planner-to-runtime proof loop for that branch: after planner resolution, the entry node reports a route-manager decision with the same rebuild_request_id, the transition is applied_rebuild, and live service-channel packet traffic selects the replacement route without local/backend fallback, route failures, or flow drops. C18Z80 hardens that same path under sustained pressure: after planner-applied rebuild, five post-rebuild bursts of mixed interactive, bulk, and reliable VPN packet batches stay on the replacement route, the stale primary is not reselected, and fallback/route-failure/drop deltas stay zero from the pre-pressure baseline. C18Z81 adds the negative/rollback proof: after the initial replacement is applied and used, a generation-valid fenced feedback report for that replacement causes the Control Plane to select a new safe recovery route; live traffic then moves to the recovery route, the degraded replacement is not reselected, and fallback/failure/drop deltas stay zero for the recovery send. The C18Z81 work also tightened older smoke checks to use per-run counter deltas instead of absolute cumulative runtime counters. C18Z82 closes the no-safe-recovery branch: after the replacement route reports generation-valid fenced feedback and no new safe recovery route is created, node-scoped synthetic config surfaces service_channel_feedback_no_alternate with pending_degraded_fallback, no_unfenced_alternate_route, and backend_relay_degraded_fallback_until_rebuild, proving the Control Plane exposes a degraded/no-alternate state instead of silently sticking to a bad replacement. C18Z83 projects those route-manager decisions into active access telemetry and web-admin: active channels now expose route-decision source, route id, replacement route id, rebuild status/reason/generation, and score reasons. The live smoke proves the no-safe state is visible through access telemetry as service_channel_feedback_no_alternate / pending_degraded_fallback, with operator execution state remaining compatible with durable ledger rebuild_request_no_alternate. C18Z84 aggregates those per-channel decisions at the access-telemetry summary level: route-decision channel count, replacement decision count, applied rebuild count, recovery decision count, and no-safe recovery count are exposed to the API and web-admin summary chips. The no-safe branch now prioritizes the aggregate status reason active_channels_no_safe_recovery over generic missing access-report noise. C18Z85 projects access-decision aggregates into rebuild health and incident diagnostics. Health summary now carries access decision counts and prioritizes inspect_access_no_safe_recovery_route_pool_and_signed_policy when no-safe is active. Rebuild incidents now include incident_source=access_decision rows for active channel decisions such as access_no_safe_recovery, with bad severity and channel id, so operators see route-decision failures beside ledger incidents. C18Z86 adds silence/acknowledgement behavior for those incident_source=access_decision incidents. Silence requests now carry incident_source and channel_id; access-decision no-safe silences are stored with a channel-scoped route key, applied back into rebuild health/incidents, and exact current-generation incidents stop contributing to active bad count. Generation-changing access-decision resurfacing is unit-tested; the live smoke proves the operator silence path on docker-test. C18Z87 exposes active rebuild/access-decision silences to operators and adds unsilence. The API now lists active rebuild alert silences, returns access-decision incident_source, channel_id, and display route id, and allows deleting a silence by id. Web-admin shows an Active rebuild silences table with an unsilence action. The live smoke proves list -> silence -> unsilence and verifies the access no-safe incident becomes active again. C18Z88 makes access-decision resurfacing operator-visible in live runtime. Access-decision incidents now expose the silence id they resurfaced from, the previous acknowledged generation, and the silence expiry. The live smoke proves: access no-safe incident -> silence current generation -> wait for a new route-decision generation -> incident returns as alert_resurfaced=true, active bad count is restored, and previous generation metadata is preserved. C18Z89 closes the resurfaced-incident operator action loop for generation changes. Resurfaced access-decision incidents now expose alert_resurfaced_cause, previous route id, and previous channel id; web-admin shows the cause beside resurfaced incidents. The live smoke proves the operator can re-acknowledge the resurfaced generation, the active-channel decision context matches the incident route/generation, and the current generation returns to a silenced state. C18Z90 introduces the explicit signed production data-plane contract on service-channel leases. data_plane is now part of the lease, authority payload, introspection response, and lease-maintenance/admin list. It declares that control-plane traffic uses backend API, working data uses the fabric service channel over fabric routes, backend relay is degraded fallback only, production forwarding is required, and logical flows are service-neutral, protocol-agnostic, and isolated. Web-admin shows this contract in the service-channel lease table. C18Z91 makes node-agent consume that signed/introspected data-plane contract. Service-channel packet ingress validates the contract, applies the preferred fabric route, emits data-plane mode/transport/fallback/logical-flow fields in access logs, and reports contract adoption in heartbeat access telemetry. C18Z92 enforces disabled backend fallback policy at node-agent runtime: when a signed lease says backend_relay_policy=disabled, route failure or missing fabric route returns a visible 503 instead of silently proxying working data through backend relay. C18Z93 promotes that data-plane contract telemetry into backend access telemetry and web-admin active-channel diagnostics: cluster, node, and active-channel rows now show contract adoption count, last working transport, steady-state transport, backend relay policy, data-plane mode, and logical flow mode. C18Z94 turns those data-plane/fallback signals into operator incidents. data_plane_contract incident rows are now emitted for missing data-plane contract reports after accepted service-channel traffic, wrong working or steady-state transport, wrong logical flow mode, disabled backend relay observed, and degraded backend relay usage. The incident list/readiness path can now surface a recommended action such as restoring the fabric route instead of treating backend relay as normal service traffic. C18Z95 adds node-agent blocked-fallback telemetry. When a signed data-plane contract disables backend relay and the entry runtime cannot use a fabric route, node-agent reports backend_fallback_blocked, the last data-plane violation status/reason, and backend/admin project those fields to cluster, node, channel, and data_plane_contract incident diagnostics. Disabled-policy refusal is now separate from real backend relay usage. C18Z96 wires normal-route send failure with disabled backend relay into the existing route feedback and rebuild planner path. When heartbeat access telemetry reports fabric_route_send_failed_backend_fallback_blocked, backend correlates the entry node's active service-channel leases, records fenced fabric_service_channel_route_feedback for the selected primary route, and the existing planner can select an alternate/replacement route. This keeps blocked fallback from becoming a dead-end operator alert. C18Z97 adds bounded deduplication for those access-report-derived route feedback records. Repeated blocked-fallback send-failure heartbeats no longer rewrite the same active fenced feedback or churn planner rebuild attempts while the first access-report feedback is still active. Runtime feedback from the flow scheduler remains independent. C18Z98 carries that feedback identity into the replacement decision and rebuild-attempt ledger: decision and ledger rows now expose feedback_observation_id, feedback_source, feedback observed/expiry time, channel/resource ids, and data-plane violation status/reason. Web-admin shows that correlation in Route decisions and Rebuild ledger. C18Z99 adds rebuild ledger filters for those correlation fields. The backend /fabric/service-channels/rebuild-attempts API accepts feedback_source, feedback_channel_id, and feedback_violation_status, and web-admin exposes the same filters in the rebuild ledger form. The live smoke proves source, channel, violation, combined filters, and wrong-channel exclusion. C18Z100 adds rebuild-health feedback breakdown aggregation for the same correlation fields. The backend rebuild-health summary now returns feedback_breakdowns grouped by feedback source, feedback channel id, and feedback violation status, including total/good/warn/bad/unknown counts, active warn/bad counts, silenced count, latest observation time, and affected reporter nodes/routes. Web-admin shows this breakdown in the Rebuild health panel so operators can see which access-report-derived failure classes dominate active warn/bad rebuild state. C18Z101 wires that breakdown into operator workflow in web-admin. Each feedback-breakdown row now shows related incident context by channel/reporter/ route overlap and has an open ledger action that switches to the deep rebuild ledger with feedback_source, feedback_channel_id, and feedback_violation_status prefilled from the breakdown row. C18Z102 adds backend audit breadcrumbs for that workflow. The existing rebuild investigation endpoint now accepts feedback source/channel/violation drilldown payloads, records fabric.service_channel_rebuild_feedback_breakdown.investigation_opened cluster audit events, and web-admin records one before opening the filtered deep ledger from a rebuild-health feedback breakdown row. C18Z103 surfaces those breadcrumbs directly in the Fabric diagnostics panel. Web-admin now filters the loaded cluster audit list for rebuild incident and feedback-breakdown investigation events and shows recent drilldowns with time, source, feedback filters, target reporter/route, actor, and reason beside rebuild incidents and silences. C18Z104 adds focused audit loading for that panel. The cluster audit API now accepts event_type and target_type filters, including repeated or comma-separated values, and web-admin loads recent fabric investigation breadcrumbs with a dedicated filtered request instead of depending on the generic latest-100 cluster audit list. C18Z105 correlates those focused audit breadcrumbs back to currently visible diagnostics in web-admin. Recent investigation rows now show whether the breadcrumb still matches an active rebuild-health feedback breakdown or visible rebuild incident, and provide an open action to jump back into the matching filtered ledger path. C18Z106 moves that correlation into the backend/API. GET /audit with correlation=fabric_diagnostics now returns correlation_hints for focused fabric investigation breadcrumbs, including current diagnostic status (breakdown_active, incident_visible, or not_visible) and the matching breakdown/incident object when present. Web-admin consumes those hints and keeps its previous local matching as fallback. During verification the noisy test history exposed that rebuild-health feedback breakdowns were capped too tightly; the backend now returns up to 100 breakdown groups so fresh failure classes are not pushed out by older smoke history. C18Z107 adds a compact backend-provided audit_summary beside audit_events. For focused Fabric diagnostics audit reads, the summary includes total count, counts by event/target type, counts by current diagnostic status, counts by feedback source/violation status, correlated count, not-visible count, and latest time. Web-admin shows these as Recent investigations chips and short source/violation lines without recalculating the aggregate in the browser. C18Z108 moves Fabric diagnostics investigation breadcrumbs off the generic cluster audit read path. Backend now exposes GET /clusters/{clusterID}/fabric/service-channels/rebuild-investigations/breadcrumbs with a dedicated rebuild_investigation_breadcrumbs contract containing events plus summary. Web-admin uses this endpoint for Recent investigations and keeps generic audit semantics separate from Fabric diagnostics workflow state. C18Z109 adds freshness windows to the dedicated breadcrumb contract. The endpoint accepts current_window_seconds and history_window_seconds, annotates each breadcrumb with correlation_hints.breadcrumb_status (current, stale, or expired) plus age/window seconds, returns current/stale/expired totals, and adds counts_by_breadcrumb_status to the summary. Web-admin shows freshness chips and an age column in Recent investigations, so operators can separate live workflow hints from stale history without deleting audit records. Live verification passed: scripts/fabric/c18z48-service-channel-introspection-smoke.ps1 and scripts/fabric/c18z50-service-channel-durable-introspection-smoke.ps1 and scripts/fabric/c18z51-service-channel-lease-maintenance-smoke.ps1 and scripts/fabric/c18z52-service-channel-access-telemetry-smoke.ps1 and scripts/fabric/c18z53-service-channel-access-correlation-smoke.ps1 and scripts/fabric/c18z54-service-channel-normal-route-access-smoke.ps1 and scripts/fabric/c18z55-service-channel-degraded-route-access-smoke.ps1 and scripts/fabric/c18z56-service-channel-alternate-remediation-smoke.ps1 and scripts/fabric/c18z57-service-channel-remediation-command-smoke.ps1 and scripts/fabric/c18z58-service-channel-remediation-apply-smoke.ps1 and scripts/fabric/c18z59-service-channel-remediation-traffic-smoke.ps1 and scripts/fabric/c18z60-service-channel-remediation-multiflow-smoke.ps1 and scripts/fabric/c18z61-service-channel-remediation-pressure-smoke.ps1 and scripts/fabric/c18z62-service-channel-remediation-qos-smoke.ps1 and scripts/fabric/c18z67-service-channel-concurrent-qos-live-smoke.ps1 and scripts/fabric/c18z69-service-channel-adaptive-backpressure-smoke.ps1 and scripts/fabric/c18z71-service-channel-adaptive-policy-smoke.ps1 and scripts/fabric/c18z72-service-channel-pool-policy-smoke.ps1, with artifacts: artifacts/c18z48-service-channel-introspection-smoke-result.json, artifacts/c18z50-service-channel-durable-introspection-smoke-result.json, and artifacts/c18z51-service-channel-lease-maintenance-smoke-result.json, and artifacts/c18z52-service-channel-access-telemetry-smoke-result.json, and artifacts/c18z53-service-channel-access-correlation-smoke-result.json, and artifacts/c18z54-service-channel-normal-route-access-smoke-result.json, and artifacts/c18z55-service-channel-degraded-route-access-smoke-result.json, and artifacts/c18z56-service-channel-alternate-remediation-smoke-result.json, and artifacts/c18z57-service-channel-remediation-command-smoke-result.json, and artifacts/c18z58-service-channel-remediation-apply-smoke-result.json, and artifacts/c18z59-service-channel-remediation-traffic-smoke-result.json, and artifacts/c18z60-service-channel-remediation-multiflow-smoke-result.json, and artifacts/c18z61-service-channel-remediation-pressure-smoke-result.json, and artifacts/c18z62-service-channel-remediation-qos-smoke-result.json, and artifacts/c18z63-service-channel-concurrent-qos-go-test.jsonl, and artifacts/c18z64-service-channel-traffic-class-telemetry-go-test.jsonl, and artifacts/c18z64-service-channel-traffic-class-telemetry-live-smoke-result.json, and artifacts/c18z64-service-channel-traffic-class-telemetry-live-snapshot.json, and artifacts/c18z65-service-channel-access-qos-telemetry-api-result.json, and artifacts/c18z65-service-channel-access-qos-telemetry-smoke-result.json, and artifacts/c18z66-service-channel-access-qos-aggregate-api-result.json, and artifacts/c18z67-service-channel-concurrent-qos-live-smoke-result.json, and artifacts/c18z68-service-channel-flow-health-api-result.json, and artifacts/c18z69-service-channel-adaptive-backpressure-smoke-result.json, and artifacts/c18z70-service-channel-adaptive-telemetry-api-result.json, and artifacts/c18z71-service-channel-adaptive-policy-smoke-result.json, and artifacts/c18z72-service-channel-pool-policy-smoke-result.json, and artifacts/c18z73-service-channel-pool-policy-remediation-guard-smoke-result.json, and artifacts/c18z74-service-channel-remediation-execution-smoke-result.json, and artifacts/c18z75-service-channel-rebuild-intent-smoke-result.json, and artifacts/c18z76-service-channel-rebuild-node-pending-smoke-result.json, and artifacts/c18z77-service-channel-rebuild-planner-resolution-smoke-result.json, and artifacts/c18z78-service-channel-rebuild-planner-applied-smoke-result.json, and artifacts/c18z79-service-channel-planner-runtime-loop-smoke-result.json, and artifacts/c18z80-service-channel-post-rebuild-pressure-smoke-result.json, and artifacts/c18z81-service-channel-replacement-degradation-recovery-smoke-result.json, and artifacts/c18z82-service-channel-no-safe-recovery-smoke-result.json, and artifacts/c18z83-service-channel-access-telemetry-no-safe-smoke-result.json, and artifacts/c18z84-service-channel-access-decision-aggregate-smoke-result.json, and artifacts/c18z85-service-channel-access-decision-incident-smoke-result.json, and artifacts/c18z86-service-channel-access-decision-silence-smoke-result.json, and artifacts/c18z87-service-channel-access-decision-unsilence-smoke-result.json, and artifacts/c18z88-service-channel-access-decision-resurface-smoke-result.json, and artifacts/c18z89-service-channel-access-decision-resurface-action-smoke-result.json, and artifacts/c18z90-service-channel-data-plane-contract-smoke-result.json, and artifacts/c18z91-node-agent-data-plane-contract-enforcement-smoke-result.json, and artifacts/c18z92-node-agent-disabled-backend-fallback-smoke-result.json, and artifacts/c18z93-access-telemetry-data-plane-contract-smoke-result.json, and artifacts/c18z94-data-plane-contract-incident-smoke-result.json, and artifacts/c18z95-node-agent-blocked-fallback-telemetry-smoke-result.json, and artifacts/c18z96-blocked-fallback-rebuild-feedback-smoke-result.json, and artifacts/c18z97-blocked-fallback-feedback-dedup-smoke-result.json, and artifacts/c18z98-blocked-fallback-rebuild-correlation-smoke-result.json, and artifacts/c18z99-rebuild-correlation-filter-smoke-result.json, and artifacts/c18z100-rebuild-health-feedback-breakdown-smoke-result.json, and artifacts/c18z102-rebuild-health-feedback-drilldown-audit-smoke-result.json, artifacts/c18z104-focused-fabric-audit-smoke-result.json, and artifacts/c18z106-audit-correlation-hints-smoke-result.json, and artifacts/c18z107-audit-correlation-summary-smoke-result.json, and artifacts/c18z108-dedicated-breadcrumbs-smoke-result.json, and artifacts/c18z109-breadcrumb-freshness-window-smoke-result.json.

Current active continuation after C20Z6:

C20Z1 through C20Z6 are implemented and runtime-smoke-proven. The C20 stage is terminal-complete by contract. It opened and validated a new explicit real-adapter enablement request as a contract-only transition: rap.remote_workspace_real_adapter_c20_stage_terminal_complete.v1, with terminal_status=stage_terminal_complete_contract_only, stage_status=complete_no_more_c20_layers_required, stage_name=c20_real_adapter_new_explicit_enablement_request, validation_chain_status=complete_contract_only, enablement_boundary=runtime_enablement_requires_next_explicit_runtime_stage, enablement_decision=validated_contract_only_not_enabled, enablement_status=validated_not_enabled, runtime_gate_state=validated_contract_only_not_enabled, runtime_effect=contract_only_no_runtime_enablement, operator_default_action=keep_real_adapter_disabled_until_next_explicit_runtime_stage, next_allowed_entrypoint=next_explicit_runtime_enablement_stage_only, allows_process_start=false, and allows_payload_traffic=false. Docker-test test-1/2/3 remain on rap-node-agent:codex-service-supervisor-20260513z52. Verification artifact: artifacts/c20z6-remote-workspace-real-adapter-stage-terminal-complete-compatibility-smoke-result.json.

The not-approved factory remains terminal-complete by contract, and C20 is now also terminal-complete by contract. Do not add more C20 continuation layers. The only allowed next entrypoint is a new explicit runtime enablement stage. Keep the real adapter disabled until that new stage explicitly changes runtime state: no process start, no real RDP frame transport, no Android work, no backend relay semantics, and no production adapter payload forwarding.

41 KiB Raw Blame History

41 KiB

Raw Blame History