41 KiB
Current product decision:
Stop treating VPN, Remote Server/Desktop Access, video, file transfer, and
future services as separate transport implementations. The next implementation
work should focus on the shared Fabric Service Channel runtime described in
docs/architecture/FABRIC_SERVICE_CHANNEL_RUNTIME.md.
The immediate engineering target is:
- backend service-channel lease/route-generation contract
- node-agent entry runtime for client/service live connections
- service-neutral channel scheduling, bounded queues, route health, and failover
- VPN packet flow as the proving service over that common channel
- backend relay only as explicit degraded fallback
Backend service-channel lease/route-generation contract is now started:
POST /clusters/{clusterID}/fabric/service-channels/leasesissuesrap.fabric_service_channel_lease.v1- VPN client profiles embed
fabric_service_channel_lease - tests cover ready route and degraded backend-relay fallback behavior
- leases include entry HTTP/WebSocket endpoint templates for the selected service channel
- leases include cluster-authority-signed
rap.fabric_service_channel_lease_authority.v1payloads that bind token hash, selected route, generation, fencing epoch, and expiry
Node-agent entry runtime is now started:
rap-node-agentaccepts VPN packet batches through/api/v1/clusters/{clusterID}/fabric/service-channels/{channelID}/vpn-connections/{vpnConnectionID}/packetsand/packets/ws- entry runtime requires a
rap_fsc_*service-channel token and maps packet batches to the existing productionvpn_packetfabric route - route failure falls back to the canonical backend relay endpoint so degraded compatibility remains explicit
Next narrow runtime layer:
- persist cluster-level default window policy for Fabric diagnostics investigation breadcrumbs and expose a small admin control for it
- keep this in the shared Fabric Service Channel runtime contract and telemetry
- do not add Android/RDP protocol work in this slice
C17Z20 is complete.
Installation Authority foundation is also complete:
- production config requires strict authority mode with Product Root public key
- first-owner bootstrap requires a signed activation manifest in strict mode
installation_authorityand signedplatform_role_grantsare persisted- strict platform-admin checks ignore direct
users.platform_roleedits unless a valid signed grant exists - web-admin shows installation status and first-owner bootstrap
scripts/installation/product-root-tool.gocan generate Ed25519 Product Root keys and sign activation manifests; private keys must stay outside the repo
Cluster Authority foundation is now also complete:
- every newly created cluster gets an Ed25519
cluster_authoritieskey record - cluster authority private keys are encrypted at rest when
SECRET_ENCRYPTION_KEY_B64/file is configured; production already requires a secret encryption key - legacy/default clusters are backfilled lazily through
EnsureClusterAuthority - backend signs join-token scope material, node approval/bootstrap material, and node-scoped synthetic mesh config snapshots
- node-agent verifies signed Control Plane synthetic config when
authority_required=trueor signature fields are present - node-agent can pin
RAP_CLUSTER_AUTHORITY_PUBLIC_KEYandRAP_CLUSTER_AUTHORITY_FINGERPRINT, and identity state can store the same trust anchor after approval - web-admin shows cluster key fingerprints on summaries, join-token output, approval rows, and synthetic config visibility
- docker-test lifecycle smoke is complete: fresh dev install, first-owner bootstrap, cluster creation, signed join token, real node-agent enrollment, owner approval, automatic signed bootstrap polling, authority pin persistence, heartbeat, and signed synthetic config verification all passed
rap-node-agentdesired-workload polling/status reporting is gated byRAP_WORKLOAD_SUPERVISION_ENABLED=falseby default while service runtime supervision remains a stub
Node enrollment bootstrap polling is also complete:
- backend exposes
/node-agents/enrollments/{requestID}/bootstrap - pending agents prove
cluster_id,node_fingerprint, andpublic_keybefore receiving status/bootstrap material rap-node-agentstorespending_join_request_id, polls approval, verifies the signed bootstrap contract, then persistsnode_id,identity_status, and cluster authority pin intoidentity.json- polling is controlled by
RAP_ENROLLMENT_POLL_INTERVAL_SECONDSandRAP_ENROLLMENT_POLL_TIMEOUT_SECONDS
Current state:
- C17Z12 added rendezvous/relay control-plane leases for peers that would
otherwise stay in
waiting_rendezvous. - C17Z13-C17Z14 added lease telemetry and node-scoped synthetic-config refresh for renewal/stale relay recovery.
- C17Z15 added backend stale-relay replacement/withdrawal policy and alternate relay-pool scoring.
- C17Z16 added Control Plane
route_path_decisions. - C17Z17 added node-side route generation apply/withdraw tracking.
- C17Z18 applies Control Plane
route_path_decisionsto synthetic route-health route config only. The syntheticfabric.route_healthruntime now probes the selected effective path, including replacement relay paths, and reports expected/observed hops plus drift state. - C17Z19 consumes those synthetic route-health observations in backend relay
scoring. Drift/unreachable/failure feedback marks the exact selected relay
stale and can trigger replacement; healthy low-latency route-health boosts
alternate relay score reasons. Migration
000022adds thesyntheticmesh service class, and web-admin marks relay policyrh feedback. - C17Z20 closes the node-side feedback loop. After node-agent reports
synthetic route-health drift/unreachable/failure, it performs a bounded
node-scoped synthetic-config refresh, applies returned replacement route
decisions to route-health config immediately, and reports
c17z20.mesh_route_health_feedback_refresh_report.v1. - Backend
mesh_latest_linksnow keeps latest observations per observation type/route, sosynthetic_route_healthis not overwritten bypeer_connection_manager. - Web-admin Fabric links now show observation type, selected relay, and route-health effective/observed path.
- All of this remains control-plane/synthetic route-health only. It does not forward RDP/VPN/service payloads, does not start VPN runtime, and does not implement arbitrary relay packet forwarding.
- Cluster Authority and node enrollment bootstrap are docker-test
lifecycle-smoke verified in run
dev-bootstrap-20260428-201430. - Fresh migration replay found and fixed a PostgreSQL view replacement issue in
000021_cluster_authority_keys; the migration now drops/recreatescluster_admin_summariesin up/down paths.
Runtime report:
artifacts/c17z18-route-health-effective-path-report.mdartifacts/c17z19-route-health-feedback-report.mdartifacts/c17z19-route-health-feedback-smoke-result.jsonartifacts/c17z20-route-health-feedback-refresh-report.mdartifacts/dev-cluster-enrollment-bootstrap-smoke-report.mdartifacts/c18w-service-channel-route-manager-smoke-result.jsonartifacts/c18x-service-channel-logical-channel-smoke-result.jsonartifacts/c18y-route-intent-lifecycle-smoke-result.jsonartifacts/c18z-service-channel-load-smoke-result.jsonartifacts/c18z1-live-service-channel-ingress-smoke-result.jsonartifacts/c18z2-live-service-channel-soak-smoke-result.jsonartifacts/c18z3-live-service-channel-entry-ws-fallback-smoke-result.jsonartifacts/c18z4-live-service-channel-session-pressure-smoke-result.jsonartifacts/c18z5-live-service-channel-exit-restart-smoke-result.jsonartifacts/c18z6-live-service-channel-active-rebuild-smoke-result.jsonartifacts/c18z7-live-service-channel-concurrent-isolation-smoke-result.jsonartifacts/c18z8-live-service-channel-backpressure-isolation-smoke-result.jsonartifacts/c18z9-live-service-channel-route-pool-smoke-result.jsonartifacts/c18z10-live-service-channel-exit-pool-smoke-result.jsonartifacts/c18z11-live-service-channel-entry-pool-smoke-result.jsonartifacts/c18z12-service-channel-route-quality-smoke-result.jsonartifacts/c18z13-live-service-channel-route-quality-smoke-result.jsonartifacts/c18z14-live-service-channel-active-quality-shift-smoke-result.jsonartifacts/c18z15-live-service-channel-effective-quality-smoke-result.jsonartifacts/c18z16-live-service-channel-quality-fairness-smoke-result.jsonartifacts/c18z17-live-service-channel-quality-cleanup-smoke-result.jsonartifacts/c18z18-service-channel-session-scoped-fairness-smoke-result.jsonartifacts/c18z19-service-channel-parallel-flow-window-smoke-result.jsonartifacts/c18z20-service-channel-adaptive-window-telemetry-smoke-result.json- Docker-test smoke command:
pwsh -NoProfile -ExecutionPolicy Bypass -File scripts\fabric\c17z12-rendezvous-relay-smoke-ssh.ps1 -KeepRunning - Dev lifecycle smoke command:
pwsh -NoProfile -ExecutionPolicy Bypass -File scripts\fabric\dev-cluster-enrollment-bootstrap-smoke-ssh.ps1 -KeepRunning - Last proven runtime run:
c17z18-20260428-221601(legacy smoke script name, current C17Z20 node-agent code) - Last proven dev lifecycle run:
dev-bootstrap-20260428-201430 - Admin:
http://192.168.200.61:18080/ - C17Z20 multi-agent API:
http://192.168.200.61:18120/api/v1 - C17Z19 backend-only API:
http://192.168.200.61:18122/api/v1 - Dev lifecycle API:
http://192.168.200.61:18121/api/v1
Do not automatically continue into:
- RDP/VNC/SSH/file/video/service workload traffic over mesh
- VPN/IP tunnel runtime implementation
- arbitrary relay packet forwarding
- production payload forwarding for relay paths
- QUIC/WebRTC or STUN/TURN/ICE
- TUN/TAP, host route, DNS, or firewall manipulation
- backend/session lifecycle changes
- Windows client changes
Current active next layer after the 2026-05-09 C18Z109 breadcrumb freshness window proof:
C18Z48/C18Z49/C18Z50/C18Z51/C18Z52/C18Z53/C18Z54/C18Z55/C18Z56/C18Z57/C18Z58/C18Z59/C18Z60/C18Z61/C18Z62/C18Z63/C18Z64/C18Z65/C18Z66/C18Z67/C18Z68/C18Z69/C18Z70/C18Z71/C18Z72/C18Z73/C18Z74/C18Z75/C18Z76/C18Z77/C18Z78/C18Z79/C18Z80/C18Z81/C18Z82/C18Z83/C18Z84/C18Z85/C18Z86/C18Z87/C18Z88/C18Z89/C18Z90/C18Z91/C18Z92/C18Z93/C18Z94/C18Z95/C18Z96/C18Z97/C18Z98/C18Z99/C18Z100/C18Z101/C18Z102/C18Z103/C18Z104/C18Z105/C18Z106/C18Z107/C18Z108/C18Z109 are complete. Backend is deployed on docker-test as
rap-backend:fabric-service-channel-0.2.281-c18z109; migration
000029_fabric_service_channel_leases is applied on the shared test database.
Node-agent image rap-node-agent:0.2.270-c18z95 is built and deployed on
test-1/2/3; web-admin is rebuilt and deployed to rap_web_admin.
All three test nodes run the C18Z92 image, healthy, and current after policy
update. Node-agent still requires signed service-channel lease authority when
cluster authority is pinned, but if legacy clients cannot send signed lease
headers it now calls backend introspection before accepting the unsigned token.
Accepted ingress is visible as accepted_by=signed|introspection|legacy_unsigned
in structured node logs and via X-RAP-Service-Channel-Accepted-By on HTTP
packet ingress. Durable introspection stores only token_hash plus a scrubbed
lease payload, so backend restarts no longer break compatibility clients. Live
lease maintenance now lists active/expired durable compatibility leases and runs
bounded cleanup through the admin API/panel. Durable access telemetry now
aggregates node-reported accepted ingress counters by signed/introspection/
legacy path, with heartbeat metadata fallback and admin-panel visibility.
Access telemetry now also correlates active durable service-channel leases with
entry/exit nodes, primary route status, backend fallback, and latest
route-quality feedback when a route exists. Normal-route access diagnostics are
smoke-proven with a temporary direct vpn_packets route and healthy rolling
quality window. Degraded normal-route diagnostics are also smoke-proven: the
active channel stays on a normal primary route with force_backend_fallback=false
while route feedback becomes fenced and rolling failure/drop/slow counters are
visible. Active-channel remediation diagnostics now expose
remediation_action, reason, optional alternate route id/status, and operator
hint, with unit coverage for healthy/noop, rebuild, backend fallback, and
authorized alternate decisions. The alternate-route remediation branch is now
live-smoke-proven: a selected primary route is degraded after lease issuance and
access telemetry recommends prefer_alternate_route while keeping
force_backend_fallback=false. C18Z57 turns that recommendation into a bounded
machine-readable remediation_command on the active channel row, including the
primary route, replacement route, issued time, and command TTL capped to the
lease lifetime. C18Z58 projects those commands into node-scoped synthetic mesh
config and node-agent consumes prefer_alternate_route as an explicit
route-manager applied decision with source
service_channel_remediation_command. C18Z59 proves active traffic follows the
replacement route after remediation: runtime heartbeat evidence shows
last_selected_route_id and flow-scheduler last_route_id on the replacement
route, with no local/backend fallback and no route send failures. C18Z60 proves
the same replacement path under multiple independent VPN flow channels: a
twelve-packet batch is classified across multiple flow-scheduler channels, all
observed replacement-route sends avoid local/backend fallback, flow drops, and
route failures. C18Z61 raises that to a pressure batch of 128 IPv4/TCP-like
packets; runtime evidence shows 32 replacement-route flow stats, scheduler
high-watermark 5, max-in-flight 4, no fallback, no drops, and no route failures.
C18Z62 adds neutral traffic-class QoS wiring at the service-channel HTTP
ingress: X-RAP-Traffic-Class can mark control, interactive, reliable,
bulk, or droppable; default traffic remains backward-compatible bulk.
Unit tests prove scheduler priority order, and live smoke proves a bulk
128-packet pressure batch plus an interactive packet both move over the
replacement route with separate traffic-class flow stats and no fallback,
drops, or route failures. C18Z63 adds a controlled concurrent runtime proof: a
bulk traffic-class send is held in-flight while an independent interactive
traffic-class packet is sent through the same ingress, and interactive completes
before bulk release with MaxInFlight >= 2, no drops, and no failures.
C18Z64 adds compact runtime telemetry: rap.fabric_flow_scheduler.v1 snapshots
include traffic_class_counts, so backend/admin/diagnostics can show active
flow-channel counts per traffic class without scanning each channel stat. It is
live-proven on rap-node-agent:0.2.239-c18z64; latest test-1 snapshot showed
bulk=32, interactive=12, drops 0. C18Z65/C18Z66 project those counts and
flow pressure fields into backend access telemetry at node, active-channel, and
cluster aggregate levels, and web-admin shows cluster/node/channel flow QoS
visibility. Live aggregate API result showed bulk=32, interactive=12,
flow_channel_count=44, flow_max_in_flight=4. C18Z67 adds a live HTTP
concurrent QoS proof: six parallel bulk service-channel requests ran while an
interactive traffic-class request was injected on the same entry path after
remediation; the interactive request completed in 132 ms, all 6 bulk requests
were accepted, 3072 post-remediation packets moved over the replacement route,
32 bulk and 12 interactive replacement-route flow stats were observed, and
fallback, route failures, flow drops, and scheduler drops stayed at 0. C18Z68
adds backend/admin flow-health guard diagnostics over that telemetry:
flow_health_status and flow_health_reason are projected at cluster, node,
and active-channel levels from traffic-class pressure, queue pressure, flow
drops, backend fallback, route-quality failures/drops/slow samples, and route
send latency. Web-admin now shows flow-health chips beside flow QoS.
C18Z69 adds node-side adaptive response: heartbeat flow-scheduler snapshots now
report per-class recommended_parallel_windows plus
adaptive_backpressure_active/reason, and the ingress send path uses the
traffic-class-specific window. Under pressure, bulk/droppable are reduced first,
reliable is reduced moderately, and control/interactive keep their full window
unless their own class degrades. Live smoke verified bulk=1, droppable=1,
reliable=3, interactive=4, control=4, no drops, and
bulk_window_reduced_to_protect_interactive. C18Z70 projects those adaptive
runtime fields into backend/admin access telemetry at cluster, node, and
active-channel levels. Cluster windows are aggregated by minimum non-zero
per-class recommendation, and web-admin shows adaptive window chips beside flow
health/QoS. Live API artifact shows adaptive=true,
bulk_window_reduced_to_protect_interactive, and windows bulk=1,
droppable=1, reliable=3, interactive=4, control=4. C18Z71 adds the
cluster-level adaptive policy contract:
GET/PUT /clusters/{clusterID}/fabric/service-channels/adaptive-policy.
The policy stores audited thresholds and class windows in cluster metadata,
projects the effective fingerprint into signed node-scoped synthetic config,
and node-agent heartbeat/runtime telemetry reports adaptive_policy_fingerprint.
The node scheduler consumes the policy at runtime; default policy preserves
bulk=1/droppable=1/reliable=3/control=4/interactive=4, while the C18Z71 smoke
proved an operator policy with max window 6 and bulk=2 changes the live
recommended windows without breaking interactive/control. A signed-config hash
mismatch found during the smoke was fixed by preserving all signed adaptive
policy provenance fields in the node-agent client model. C18Z72 adds the
cluster-level pool/failover policy contract:
GET/PUT /clusters/{clusterID}/fabric/service-channels/pool-policy. Lease
issuance now applies the effective entry/exit pool constraints and preferred
entry/exit before route selection, stores the effective policy on the lease,
and signs it into rap.fabric_service_channel_lease_authority.v1. Live smoke
proved a policy-constrained lease selects only the policy entry/exit from a
wider requested pool and carries matching signed pool_policy provenance.
C18Z73 projects that signed pool-policy fingerprint into active access
telemetry and guards remediation commands against routes outside the signed
lease pools. C18Z74 correlates active remediation commands with entry-node
route-manager heartbeats and reports execution states such as
waiting_node_apply, applied, rejected_by_policy_guard,
pending_rebuild_request, and expired. C18Z75 records rebuild_route
remediation as durable rebuild ledger intent rows when node-scoped synthetic
config is fetched, and access telemetry reports rebuild_request_recorded or
rebuild_request_rejected. C18Z76 makes the allowed rebuild_route command
visible from the node side: node-agent consumes it as a route-manager
pending_degraded_fallback decision with source
service_channel_remediation_command, and backend access telemetry correlates
that with the durable ledger as rebuild_request_recorded_node_pending.
C18Z77 resolves durable remediation rebuild requests inside the shared Control
Plane planner: signed-pool-valid alternates become applied /
replacement_selected and are projected as route-manager decisions with the
same command id, missing safe alternates become no_alternate, lease/policy
blocks become deferred_by_policy, and stale commands become expired.
C18Z78 adds operator-visible planner outcome chips in web-admin and proves the
applied branch live by adding an alternate route after lease issuance and
verifying the existing rebuild command resolves to rebuild_request_applied.
C18Z79 closes the planner-to-runtime proof loop for that branch: after planner
resolution, the entry node reports a route-manager decision with the same
rebuild_request_id, the transition is applied_rebuild, and live
service-channel packet traffic selects the replacement route without
local/backend fallback, route failures, or flow drops. C18Z80 hardens that
same path under sustained pressure: after planner-applied rebuild, five
post-rebuild bursts of mixed interactive, bulk, and reliable VPN packet
batches stay on the replacement route, the stale primary is not reselected, and
fallback/route-failure/drop deltas stay zero from the pre-pressure baseline.
C18Z81 adds the negative/rollback proof: after the initial replacement is
applied and used, a generation-valid fenced feedback report for that
replacement causes the Control Plane to select a new safe recovery route; live
traffic then moves to the recovery route, the degraded replacement is not
reselected, and fallback/failure/drop deltas stay zero for the recovery send.
The C18Z81 work also tightened older smoke checks to use per-run counter deltas
instead of absolute cumulative runtime counters.
C18Z82 closes the no-safe-recovery branch: after the replacement route reports
generation-valid fenced feedback and no new safe recovery route is created,
node-scoped synthetic config surfaces service_channel_feedback_no_alternate
with pending_degraded_fallback, no_unfenced_alternate_route, and
backend_relay_degraded_fallback_until_rebuild, proving the Control Plane
exposes a degraded/no-alternate state instead of silently sticking to a bad
replacement.
C18Z83 projects those route-manager decisions into active access telemetry and
web-admin: active channels now expose route-decision source, route id,
replacement route id, rebuild status/reason/generation, and score reasons.
The live smoke proves the no-safe state is visible through access telemetry as
service_channel_feedback_no_alternate /
pending_degraded_fallback, with operator execution state remaining compatible
with durable ledger rebuild_request_no_alternate.
C18Z84 aggregates those per-channel decisions at the access-telemetry summary
level: route-decision channel count, replacement decision count, applied
rebuild count, recovery decision count, and no-safe recovery count are exposed
to the API and web-admin summary chips. The no-safe branch now prioritizes the
aggregate status reason active_channels_no_safe_recovery over generic missing
access-report noise.
C18Z85 projects access-decision aggregates into rebuild health and incident
diagnostics. Health summary now carries access decision counts and prioritizes
inspect_access_no_safe_recovery_route_pool_and_signed_policy when no-safe is
active. Rebuild incidents now include incident_source=access_decision rows
for active channel decisions such as access_no_safe_recovery, with bad
severity and channel id, so operators see route-decision failures beside ledger
incidents.
C18Z86 adds silence/acknowledgement behavior for those
incident_source=access_decision incidents. Silence requests now carry
incident_source and channel_id; access-decision no-safe silences are stored
with a channel-scoped route key, applied back into rebuild health/incidents,
and exact current-generation incidents stop contributing to active bad count.
Generation-changing access-decision resurfacing is unit-tested; the live smoke
proves the operator silence path on docker-test.
C18Z87 exposes active rebuild/access-decision silences to operators and adds
unsilence. The API now lists active rebuild alert silences, returns
access-decision incident_source, channel_id, and display route id, and
allows deleting a silence by id. Web-admin shows an Active rebuild silences
table with an unsilence action. The live smoke proves list -> silence ->
unsilence and verifies the access no-safe incident becomes active again.
C18Z88 makes access-decision resurfacing operator-visible in live runtime.
Access-decision incidents now expose the silence id they resurfaced from, the
previous acknowledged generation, and the silence expiry. The live smoke
proves: access no-safe incident -> silence current generation -> wait for a new
route-decision generation -> incident returns as alert_resurfaced=true, active
bad count is restored, and previous generation metadata is preserved.
C18Z89 closes the resurfaced-incident operator action loop for generation
changes. Resurfaced access-decision incidents now expose
alert_resurfaced_cause, previous route id, and previous channel id; web-admin
shows the cause beside resurfaced incidents. The live smoke proves the operator
can re-acknowledge the resurfaced generation, the active-channel decision
context matches the incident route/generation, and the current generation
returns to a silenced state.
C18Z90 introduces the explicit signed production data-plane contract on
service-channel leases. data_plane is now part of the lease, authority
payload, introspection response, and lease-maintenance/admin list. It declares
that control-plane traffic uses backend API, working data uses the fabric
service channel over fabric routes, backend relay is degraded fallback only,
production forwarding is required, and logical flows are service-neutral,
protocol-agnostic, and isolated. Web-admin shows this contract in the
service-channel lease table.
C18Z91 makes node-agent consume that signed/introspected data-plane contract.
Service-channel packet ingress validates the contract, applies the preferred
fabric route, emits data-plane mode/transport/fallback/logical-flow fields in
access logs, and reports contract adoption in heartbeat access telemetry.
C18Z92 enforces disabled backend fallback policy at node-agent runtime: when a
signed lease says backend_relay_policy=disabled, route failure or missing
fabric route returns a visible 503 instead of silently proxying working data
through backend relay.
C18Z93 promotes that data-plane contract telemetry into backend access
telemetry and web-admin active-channel diagnostics: cluster, node, and
active-channel rows now show contract adoption count, last working transport,
steady-state transport, backend relay policy, data-plane mode, and logical
flow mode.
C18Z94 turns those data-plane/fallback signals into operator incidents.
data_plane_contract incident rows are now emitted for missing data-plane
contract reports after accepted service-channel traffic, wrong working or
steady-state transport, wrong logical flow mode, disabled backend relay
observed, and degraded backend relay usage. The incident list/readiness path
can now surface a recommended action such as restoring the fabric route instead
of treating backend relay as normal service traffic.
C18Z95 adds node-agent blocked-fallback telemetry. When a signed data-plane
contract disables backend relay and the entry runtime cannot use a fabric
route, node-agent reports backend_fallback_blocked, the last data-plane
violation status/reason, and backend/admin project those fields to cluster,
node, channel, and data_plane_contract incident diagnostics. Disabled-policy
refusal is now separate from real backend relay usage.
C18Z96 wires normal-route send failure with disabled backend relay into the
existing route feedback and rebuild planner path. When heartbeat access
telemetry reports fabric_route_send_failed_backend_fallback_blocked, backend
correlates the entry node's active service-channel leases, records fenced
fabric_service_channel_route_feedback for the selected primary route, and the
existing planner can select an alternate/replacement route. This keeps blocked
fallback from becoming a dead-end operator alert.
C18Z97 adds bounded deduplication for those access-report-derived route
feedback records. Repeated blocked-fallback send-failure heartbeats no longer
rewrite the same active fenced feedback or churn planner rebuild attempts while
the first access-report feedback is still active. Runtime feedback from the
flow scheduler remains independent.
C18Z98 carries that feedback identity into the replacement decision and
rebuild-attempt ledger: decision and ledger rows now expose
feedback_observation_id, feedback_source, feedback observed/expiry time,
channel/resource ids, and data-plane violation status/reason. Web-admin shows
that correlation in Route decisions and Rebuild ledger.
C18Z99 adds rebuild ledger filters for those correlation fields. The backend
/fabric/service-channels/rebuild-attempts API accepts feedback_source,
feedback_channel_id, and feedback_violation_status, and web-admin exposes
the same filters in the rebuild ledger form. The live smoke proves source,
channel, violation, combined filters, and wrong-channel exclusion.
C18Z100 adds rebuild-health feedback breakdown aggregation for the same
correlation fields. The backend rebuild-health summary now returns
feedback_breakdowns grouped by feedback source, feedback channel id, and
feedback violation status, including total/good/warn/bad/unknown counts,
active warn/bad counts, silenced count, latest observation time, and affected
reporter nodes/routes. Web-admin shows this breakdown in the Rebuild health
panel so operators can see which access-report-derived failure classes dominate
active warn/bad rebuild state.
C18Z101 wires that breakdown into operator workflow in web-admin. Each
feedback-breakdown row now shows related incident context by channel/reporter/
route overlap and has an open ledger action that switches to the deep rebuild
ledger with feedback_source, feedback_channel_id, and
feedback_violation_status prefilled from the breakdown row.
C18Z102 adds backend audit breadcrumbs for that workflow. The existing rebuild
investigation endpoint now accepts feedback source/channel/violation drilldown
payloads, records
fabric.service_channel_rebuild_feedback_breakdown.investigation_opened
cluster audit events, and web-admin records one before opening the filtered
deep ledger from a rebuild-health feedback breakdown row.
C18Z103 surfaces those breadcrumbs directly in the Fabric diagnostics panel.
Web-admin now filters the loaded cluster audit list for rebuild incident and
feedback-breakdown investigation events and shows recent drilldowns with time,
source, feedback filters, target reporter/route, actor, and reason beside
rebuild incidents and silences.
C18Z104 adds focused audit loading for that panel. The cluster audit API now
accepts event_type and target_type filters, including repeated or
comma-separated values, and web-admin loads recent fabric investigation
breadcrumbs with a dedicated filtered request instead of depending on the
generic latest-100 cluster audit list.
C18Z105 correlates those focused audit breadcrumbs back to currently visible
diagnostics in web-admin. Recent investigation rows now show whether the
breadcrumb still matches an active rebuild-health feedback breakdown or visible
rebuild incident, and provide an open action to jump back into the matching
filtered ledger path.
C18Z106 moves that correlation into the backend/API. GET /audit with
correlation=fabric_diagnostics now returns correlation_hints for focused
fabric investigation breadcrumbs, including current diagnostic status
(breakdown_active, incident_visible, or not_visible) and the matching
breakdown/incident object when present. Web-admin consumes those hints and keeps
its previous local matching as fallback. During verification the noisy test
history exposed that rebuild-health feedback breakdowns were capped too tightly;
the backend now returns up to 100 breakdown groups so fresh failure classes are
not pushed out by older smoke history.
C18Z107 adds a compact backend-provided audit_summary beside audit_events.
For focused Fabric diagnostics audit reads, the summary includes total count,
counts by event/target type, counts by current diagnostic status, counts by
feedback source/violation status, correlated count, not-visible count, and
latest time. Web-admin shows these as Recent investigations chips and short
source/violation lines without recalculating the aggregate in the browser.
C18Z108 moves Fabric diagnostics investigation breadcrumbs off the generic
cluster audit read path. Backend now exposes
GET /clusters/{clusterID}/fabric/service-channels/rebuild-investigations/breadcrumbs
with a dedicated rebuild_investigation_breadcrumbs contract containing
events plus summary. Web-admin uses this endpoint for Recent investigations
and keeps generic audit semantics separate from Fabric diagnostics workflow
state.
C18Z109 adds freshness windows to the dedicated breadcrumb contract. The
endpoint accepts current_window_seconds and history_window_seconds, annotates
each breadcrumb with correlation_hints.breadcrumb_status (current, stale,
or expired) plus age/window seconds, returns current/stale/expired totals, and
adds counts_by_breadcrumb_status to the summary. Web-admin shows freshness
chips and an age column in Recent investigations, so operators can separate live
workflow hints from stale history without deleting audit records.
Live
verification passed:
scripts/fabric/c18z48-service-channel-introspection-smoke.ps1 and
scripts/fabric/c18z50-service-channel-durable-introspection-smoke.ps1 and
scripts/fabric/c18z51-service-channel-lease-maintenance-smoke.ps1 and
scripts/fabric/c18z52-service-channel-access-telemetry-smoke.ps1 and
scripts/fabric/c18z53-service-channel-access-correlation-smoke.ps1 and
scripts/fabric/c18z54-service-channel-normal-route-access-smoke.ps1 and
scripts/fabric/c18z55-service-channel-degraded-route-access-smoke.ps1 and
scripts/fabric/c18z56-service-channel-alternate-remediation-smoke.ps1 and
scripts/fabric/c18z57-service-channel-remediation-command-smoke.ps1 and
scripts/fabric/c18z58-service-channel-remediation-apply-smoke.ps1 and
scripts/fabric/c18z59-service-channel-remediation-traffic-smoke.ps1 and
scripts/fabric/c18z60-service-channel-remediation-multiflow-smoke.ps1 and
scripts/fabric/c18z61-service-channel-remediation-pressure-smoke.ps1 and
scripts/fabric/c18z62-service-channel-remediation-qos-smoke.ps1 and
scripts/fabric/c18z67-service-channel-concurrent-qos-live-smoke.ps1 and
scripts/fabric/c18z69-service-channel-adaptive-backpressure-smoke.ps1 and
scripts/fabric/c18z71-service-channel-adaptive-policy-smoke.ps1 and
scripts/fabric/c18z72-service-channel-pool-policy-smoke.ps1, with
artifacts:
artifacts/c18z48-service-channel-introspection-smoke-result.json,
artifacts/c18z50-service-channel-durable-introspection-smoke-result.json, and
artifacts/c18z51-service-channel-lease-maintenance-smoke-result.json, and
artifacts/c18z52-service-channel-access-telemetry-smoke-result.json, and
artifacts/c18z53-service-channel-access-correlation-smoke-result.json, and
artifacts/c18z54-service-channel-normal-route-access-smoke-result.json, and
artifacts/c18z55-service-channel-degraded-route-access-smoke-result.json, and
artifacts/c18z56-service-channel-alternate-remediation-smoke-result.json, and
artifacts/c18z57-service-channel-remediation-command-smoke-result.json, and
artifacts/c18z58-service-channel-remediation-apply-smoke-result.json, and
artifacts/c18z59-service-channel-remediation-traffic-smoke-result.json, and
artifacts/c18z60-service-channel-remediation-multiflow-smoke-result.json, and
artifacts/c18z61-service-channel-remediation-pressure-smoke-result.json, and
artifacts/c18z62-service-channel-remediation-qos-smoke-result.json, and
artifacts/c18z63-service-channel-concurrent-qos-go-test.jsonl, and
artifacts/c18z64-service-channel-traffic-class-telemetry-go-test.jsonl, and
artifacts/c18z64-service-channel-traffic-class-telemetry-live-smoke-result.json, and
artifacts/c18z64-service-channel-traffic-class-telemetry-live-snapshot.json, and
artifacts/c18z65-service-channel-access-qos-telemetry-api-result.json, and
artifacts/c18z65-service-channel-access-qos-telemetry-smoke-result.json, and
artifacts/c18z66-service-channel-access-qos-aggregate-api-result.json, and
artifacts/c18z67-service-channel-concurrent-qos-live-smoke-result.json, and
artifacts/c18z68-service-channel-flow-health-api-result.json, and
artifacts/c18z69-service-channel-adaptive-backpressure-smoke-result.json, and
artifacts/c18z70-service-channel-adaptive-telemetry-api-result.json, and
artifacts/c18z71-service-channel-adaptive-policy-smoke-result.json, and
artifacts/c18z72-service-channel-pool-policy-smoke-result.json, and
artifacts/c18z73-service-channel-pool-policy-remediation-guard-smoke-result.json, and
artifacts/c18z74-service-channel-remediation-execution-smoke-result.json, and
artifacts/c18z75-service-channel-rebuild-intent-smoke-result.json, and
artifacts/c18z76-service-channel-rebuild-node-pending-smoke-result.json, and
artifacts/c18z77-service-channel-rebuild-planner-resolution-smoke-result.json, and
artifacts/c18z78-service-channel-rebuild-planner-applied-smoke-result.json, and
artifacts/c18z79-service-channel-planner-runtime-loop-smoke-result.json, and
artifacts/c18z80-service-channel-post-rebuild-pressure-smoke-result.json, and
artifacts/c18z81-service-channel-replacement-degradation-recovery-smoke-result.json, and
artifacts/c18z82-service-channel-no-safe-recovery-smoke-result.json, and
artifacts/c18z83-service-channel-access-telemetry-no-safe-smoke-result.json, and
artifacts/c18z84-service-channel-access-decision-aggregate-smoke-result.json, and
artifacts/c18z85-service-channel-access-decision-incident-smoke-result.json, and
artifacts/c18z86-service-channel-access-decision-silence-smoke-result.json, and
artifacts/c18z87-service-channel-access-decision-unsilence-smoke-result.json, and
artifacts/c18z88-service-channel-access-decision-resurface-smoke-result.json, and
artifacts/c18z89-service-channel-access-decision-resurface-action-smoke-result.json, and
artifacts/c18z90-service-channel-data-plane-contract-smoke-result.json, and
artifacts/c18z91-node-agent-data-plane-contract-enforcement-smoke-result.json, and
artifacts/c18z92-node-agent-disabled-backend-fallback-smoke-result.json, and
artifacts/c18z93-access-telemetry-data-plane-contract-smoke-result.json, and
artifacts/c18z94-data-plane-contract-incident-smoke-result.json, and
artifacts/c18z95-node-agent-blocked-fallback-telemetry-smoke-result.json, and
artifacts/c18z96-blocked-fallback-rebuild-feedback-smoke-result.json, and
artifacts/c18z97-blocked-fallback-feedback-dedup-smoke-result.json, and
artifacts/c18z98-blocked-fallback-rebuild-correlation-smoke-result.json, and
artifacts/c18z99-rebuild-correlation-filter-smoke-result.json, and
artifacts/c18z100-rebuild-health-feedback-breakdown-smoke-result.json, and
artifacts/c18z102-rebuild-health-feedback-drilldown-audit-smoke-result.json,
artifacts/c18z104-focused-fabric-audit-smoke-result.json, and
artifacts/c18z106-audit-correlation-hints-smoke-result.json, and
artifacts/c18z107-audit-correlation-summary-smoke-result.json, and
artifacts/c18z108-dedicated-breadcrumbs-smoke-result.json, and
artifacts/c18z109-breadcrumb-freshness-window-smoke-result.json.
Current active continuation after C20Z6:
C20Z1 through C20Z6 are implemented and runtime-smoke-proven. The C20 stage is
terminal-complete by contract. It opened and validated a new explicit
real-adapter enablement request as a contract-only transition:
rap.remote_workspace_real_adapter_c20_stage_terminal_complete.v1, with
terminal_status=stage_terminal_complete_contract_only,
stage_status=complete_no_more_c20_layers_required,
stage_name=c20_real_adapter_new_explicit_enablement_request,
validation_chain_status=complete_contract_only,
enablement_boundary=runtime_enablement_requires_next_explicit_runtime_stage,
enablement_decision=validated_contract_only_not_enabled,
enablement_status=validated_not_enabled,
runtime_gate_state=validated_contract_only_not_enabled,
runtime_effect=contract_only_no_runtime_enablement,
operator_default_action=keep_real_adapter_disabled_until_next_explicit_runtime_stage,
next_allowed_entrypoint=next_explicit_runtime_enablement_stage_only,
allows_process_start=false, and allows_payload_traffic=false. Docker-test
test-1/2/3 remain on
rap-node-agent:codex-service-supervisor-20260513z52. Verification artifact:
artifacts/c20z6-remote-workspace-real-adapter-stage-terminal-complete-compatibility-smoke-result.json.
The not-approved factory remains terminal-complete by contract, and C20 is now also terminal-complete by contract. Do not add more C20 continuation layers. The only allowed next entrypoint is a new explicit runtime enablement stage. Keep the real adapter disabled until that new stage explicitly changes runtime state: no process start, no real RDP frame transport, no Android work, no backend relay semantics, and no production adapter payload forwarding.