617 lines
41 KiB
Markdown
617 lines
41 KiB
Markdown
Current product decision:
|
|
|
|
Stop treating VPN, Remote Server/Desktop Access, video, file transfer, and
|
|
future services as separate transport implementations. The next implementation
|
|
work should focus on the shared Fabric Service Channel runtime described in
|
|
`docs/architecture/FABRIC_SERVICE_CHANNEL_RUNTIME.md`.
|
|
|
|
The immediate engineering target is:
|
|
|
|
- backend service-channel lease/route-generation contract
|
|
- node-agent entry runtime for client/service live connections
|
|
- service-neutral channel scheduling, bounded queues, route health, and
|
|
failover
|
|
- VPN packet flow as the proving service over that common channel
|
|
- backend relay only as explicit degraded fallback
|
|
|
|
Backend service-channel lease/route-generation contract is now started:
|
|
|
|
- `POST /clusters/{clusterID}/fabric/service-channels/leases` issues
|
|
`rap.fabric_service_channel_lease.v1`
|
|
- VPN client profiles embed `fabric_service_channel_lease`
|
|
- tests cover ready route and degraded backend-relay fallback behavior
|
|
- leases include entry HTTP/WebSocket endpoint templates for the selected
|
|
service channel
|
|
- leases include cluster-authority-signed
|
|
`rap.fabric_service_channel_lease_authority.v1` payloads that bind token
|
|
hash, selected route, generation, fencing epoch, and expiry
|
|
|
|
Node-agent entry runtime is now started:
|
|
|
|
- `rap-node-agent` accepts VPN packet batches through
|
|
`/api/v1/clusters/{clusterID}/fabric/service-channels/{channelID}/vpn-connections/{vpnConnectionID}/packets`
|
|
and `/packets/ws`
|
|
- entry runtime requires a `rap_fsc_*` service-channel token and maps packet
|
|
batches to the existing production `vpn_packet` fabric route
|
|
- route failure falls back to the canonical backend relay endpoint so degraded
|
|
compatibility remains explicit
|
|
|
|
Next narrow runtime layer:
|
|
|
|
- persist cluster-level default window policy for Fabric diagnostics
|
|
investigation breadcrumbs and expose a small admin control for it
|
|
- keep this in the shared Fabric Service Channel runtime contract and telemetry
|
|
- do not add Android/RDP protocol work in this slice
|
|
|
|
C17Z20 is complete.
|
|
|
|
Installation Authority foundation is also complete:
|
|
|
|
- production config requires strict authority mode with Product Root public key
|
|
- first-owner bootstrap requires a signed activation manifest in strict mode
|
|
- `installation_authority` and signed `platform_role_grants` are persisted
|
|
- strict platform-admin checks ignore direct `users.platform_role` edits unless
|
|
a valid signed grant exists
|
|
- web-admin shows installation status and first-owner bootstrap
|
|
- `scripts/installation/product-root-tool.go` can generate Ed25519 Product Root
|
|
keys and sign activation manifests; private keys must stay outside the repo
|
|
|
|
Cluster Authority foundation is now also complete:
|
|
|
|
- every newly created cluster gets an Ed25519 `cluster_authorities` key record
|
|
- cluster authority private keys are encrypted at rest when
|
|
`SECRET_ENCRYPTION_KEY_B64`/file is configured; production already requires
|
|
a secret encryption key
|
|
- legacy/default clusters are backfilled lazily through `EnsureClusterAuthority`
|
|
- backend signs join-token scope material, node approval/bootstrap material,
|
|
and node-scoped synthetic mesh config snapshots
|
|
- node-agent verifies signed Control Plane synthetic config when
|
|
`authority_required=true` or signature fields are present
|
|
- node-agent can pin `RAP_CLUSTER_AUTHORITY_PUBLIC_KEY` and
|
|
`RAP_CLUSTER_AUTHORITY_FINGERPRINT`, and identity state can store the same
|
|
trust anchor after approval
|
|
- web-admin shows cluster key fingerprints on summaries, join-token output,
|
|
approval rows, and synthetic config visibility
|
|
- docker-test lifecycle smoke is complete: fresh dev install, first-owner
|
|
bootstrap, cluster creation, signed join token, real node-agent enrollment,
|
|
owner approval, automatic signed bootstrap polling, authority pin
|
|
persistence, heartbeat, and signed synthetic config verification all passed
|
|
- `rap-node-agent` desired-workload polling/status reporting is gated by
|
|
`RAP_WORKLOAD_SUPERVISION_ENABLED=false` by default while service runtime
|
|
supervision remains a stub
|
|
|
|
Node enrollment bootstrap polling is also complete:
|
|
|
|
- backend exposes `/node-agents/enrollments/{requestID}/bootstrap`
|
|
- pending agents prove `cluster_id`, `node_fingerprint`, and `public_key`
|
|
before receiving status/bootstrap material
|
|
- `rap-node-agent` stores `pending_join_request_id`, polls approval, verifies
|
|
the signed bootstrap contract, then persists `node_id`, `identity_status`,
|
|
and cluster authority pin into `identity.json`
|
|
- polling is controlled by `RAP_ENROLLMENT_POLL_INTERVAL_SECONDS` and
|
|
`RAP_ENROLLMENT_POLL_TIMEOUT_SECONDS`
|
|
|
|
Current state:
|
|
|
|
- C17Z12 added rendezvous/relay control-plane leases for peers that would
|
|
otherwise stay in `waiting_rendezvous`.
|
|
- C17Z13-C17Z14 added lease telemetry and node-scoped synthetic-config refresh
|
|
for renewal/stale relay recovery.
|
|
- C17Z15 added backend stale-relay replacement/withdrawal policy and alternate
|
|
relay-pool scoring.
|
|
- C17Z16 added Control Plane `route_path_decisions`.
|
|
- C17Z17 added node-side route generation apply/withdraw tracking.
|
|
- C17Z18 applies Control Plane `route_path_decisions` to synthetic
|
|
route-health route config only. The synthetic `fabric.route_health` runtime
|
|
now probes the selected effective path, including replacement relay paths,
|
|
and reports expected/observed hops plus drift state.
|
|
- C17Z19 consumes those synthetic route-health observations in backend relay
|
|
scoring. Drift/unreachable/failure feedback marks the exact selected relay
|
|
stale and can trigger replacement; healthy low-latency route-health boosts
|
|
alternate relay score reasons. Migration `000022` adds the `synthetic` mesh
|
|
service class, and web-admin marks relay policy `rh feedback`.
|
|
- C17Z20 closes the node-side feedback loop. After node-agent reports
|
|
synthetic route-health drift/unreachable/failure, it performs a bounded
|
|
node-scoped synthetic-config refresh, applies returned replacement route
|
|
decisions to route-health config immediately, and reports
|
|
`c17z20.mesh_route_health_feedback_refresh_report.v1`.
|
|
- Backend `mesh_latest_links` now keeps latest observations per observation
|
|
type/route, so `synthetic_route_health` is not overwritten by
|
|
`peer_connection_manager`.
|
|
- Web-admin Fabric links now show observation type, selected relay, and
|
|
route-health effective/observed path.
|
|
- All of this remains control-plane/synthetic route-health only. It does not
|
|
forward RDP/VPN/service payloads, does not start VPN runtime, and does not
|
|
implement arbitrary relay packet forwarding.
|
|
- Cluster Authority and node enrollment bootstrap are docker-test
|
|
lifecycle-smoke verified in run `dev-bootstrap-20260428-201430`.
|
|
- Fresh migration replay found and fixed a PostgreSQL view replacement issue in
|
|
`000021_cluster_authority_keys`; the migration now drops/recreates
|
|
`cluster_admin_summaries` in up/down paths.
|
|
|
|
Runtime report:
|
|
|
|
- `artifacts/c17z18-route-health-effective-path-report.md`
|
|
- `artifacts/c17z19-route-health-feedback-report.md`
|
|
- `artifacts/c17z19-route-health-feedback-smoke-result.json`
|
|
- `artifacts/c17z20-route-health-feedback-refresh-report.md`
|
|
- `artifacts/dev-cluster-enrollment-bootstrap-smoke-report.md`
|
|
- `artifacts/c18w-service-channel-route-manager-smoke-result.json`
|
|
- `artifacts/c18x-service-channel-logical-channel-smoke-result.json`
|
|
- `artifacts/c18y-route-intent-lifecycle-smoke-result.json`
|
|
- `artifacts/c18z-service-channel-load-smoke-result.json`
|
|
- `artifacts/c18z1-live-service-channel-ingress-smoke-result.json`
|
|
- `artifacts/c18z2-live-service-channel-soak-smoke-result.json`
|
|
- `artifacts/c18z3-live-service-channel-entry-ws-fallback-smoke-result.json`
|
|
- `artifacts/c18z4-live-service-channel-session-pressure-smoke-result.json`
|
|
- `artifacts/c18z5-live-service-channel-exit-restart-smoke-result.json`
|
|
- `artifacts/c18z6-live-service-channel-active-rebuild-smoke-result.json`
|
|
- `artifacts/c18z7-live-service-channel-concurrent-isolation-smoke-result.json`
|
|
- `artifacts/c18z8-live-service-channel-backpressure-isolation-smoke-result.json`
|
|
- `artifacts/c18z9-live-service-channel-route-pool-smoke-result.json`
|
|
- `artifacts/c18z10-live-service-channel-exit-pool-smoke-result.json`
|
|
- `artifacts/c18z11-live-service-channel-entry-pool-smoke-result.json`
|
|
- `artifacts/c18z12-service-channel-route-quality-smoke-result.json`
|
|
- `artifacts/c18z13-live-service-channel-route-quality-smoke-result.json`
|
|
- `artifacts/c18z14-live-service-channel-active-quality-shift-smoke-result.json`
|
|
- `artifacts/c18z15-live-service-channel-effective-quality-smoke-result.json`
|
|
- `artifacts/c18z16-live-service-channel-quality-fairness-smoke-result.json`
|
|
- `artifacts/c18z17-live-service-channel-quality-cleanup-smoke-result.json`
|
|
- `artifacts/c18z18-service-channel-session-scoped-fairness-smoke-result.json`
|
|
- `artifacts/c18z19-service-channel-parallel-flow-window-smoke-result.json`
|
|
- `artifacts/c18z20-service-channel-adaptive-window-telemetry-smoke-result.json`
|
|
- Docker-test smoke command:
|
|
`pwsh -NoProfile -ExecutionPolicy Bypass -File scripts\fabric\c17z12-rendezvous-relay-smoke-ssh.ps1 -KeepRunning`
|
|
- Dev lifecycle smoke command:
|
|
`pwsh -NoProfile -ExecutionPolicy Bypass -File scripts\fabric\dev-cluster-enrollment-bootstrap-smoke-ssh.ps1 -KeepRunning`
|
|
- Last proven runtime run: `c17z18-20260428-221601` (legacy smoke script name,
|
|
current C17Z20 node-agent code)
|
|
- Last proven dev lifecycle run: `dev-bootstrap-20260428-201430`
|
|
- Admin: `http://192.168.200.61:18080/`
|
|
- C17Z20 multi-agent API: `http://192.168.200.61:18120/api/v1`
|
|
- C17Z19 backend-only API: `http://192.168.200.61:18122/api/v1`
|
|
- Dev lifecycle API: `http://192.168.200.61:18121/api/v1`
|
|
|
|
Do not automatically continue into:
|
|
|
|
- RDP/VNC/SSH/file/video/service workload traffic over mesh
|
|
- VPN/IP tunnel runtime implementation
|
|
- arbitrary relay packet forwarding
|
|
- production payload forwarding for relay paths
|
|
- QUIC/WebRTC or STUN/TURN/ICE
|
|
- TUN/TAP, host route, DNS, or firewall manipulation
|
|
- backend/session lifecycle changes
|
|
- Windows client changes
|
|
|
|
Current active next layer after the 2026-05-09 C18Z109 breadcrumb freshness
|
|
window proof:
|
|
|
|
C18Z48/C18Z49/C18Z50/C18Z51/C18Z52/C18Z53/C18Z54/C18Z55/C18Z56/C18Z57/C18Z58/C18Z59/C18Z60/C18Z61/C18Z62/C18Z63/C18Z64/C18Z65/C18Z66/C18Z67/C18Z68/C18Z69/C18Z70/C18Z71/C18Z72/C18Z73/C18Z74/C18Z75/C18Z76/C18Z77/C18Z78/C18Z79/C18Z80/C18Z81/C18Z82/C18Z83/C18Z84/C18Z85/C18Z86/C18Z87/C18Z88/C18Z89/C18Z90/C18Z91/C18Z92/C18Z93/C18Z94/C18Z95/C18Z96/C18Z97/C18Z98/C18Z99/C18Z100/C18Z101/C18Z102/C18Z103/C18Z104/C18Z105/C18Z106/C18Z107/C18Z108/C18Z109 are complete. Backend is deployed on docker-test as
|
|
`rap-backend:fabric-service-channel-0.2.281-c18z109`; migration
|
|
`000029_fabric_service_channel_leases` is applied on the shared test database.
|
|
Node-agent image `rap-node-agent:0.2.270-c18z95` is built and deployed on
|
|
`test-1/2/3`; web-admin is rebuilt and deployed to `rap_web_admin`.
|
|
All three test nodes run the C18Z92 image, healthy, and current after policy
|
|
update. Node-agent still requires signed service-channel lease authority when
|
|
cluster authority is pinned, but if legacy clients cannot send signed lease
|
|
headers it now calls backend introspection before accepting the unsigned token.
|
|
Accepted ingress is visible as `accepted_by=signed|introspection|legacy_unsigned`
|
|
in structured node logs and via `X-RAP-Service-Channel-Accepted-By` on HTTP
|
|
packet ingress. Durable introspection stores only `token_hash` plus a scrubbed
|
|
lease payload, so backend restarts no longer break compatibility clients. Live
|
|
lease maintenance now lists active/expired durable compatibility leases and runs
|
|
bounded cleanup through the admin API/panel. Durable access telemetry now
|
|
aggregates node-reported accepted ingress counters by signed/introspection/
|
|
legacy path, with heartbeat metadata fallback and admin-panel visibility.
|
|
Access telemetry now also correlates active durable service-channel leases with
|
|
entry/exit nodes, primary route status, backend fallback, and latest
|
|
route-quality feedback when a route exists. Normal-route access diagnostics are
|
|
smoke-proven with a temporary direct `vpn_packets` route and healthy rolling
|
|
quality window. Degraded normal-route diagnostics are also smoke-proven: the
|
|
active channel stays on a normal primary route with `force_backend_fallback=false`
|
|
while route feedback becomes `fenced` and rolling failure/drop/slow counters are
|
|
visible. Active-channel remediation diagnostics now expose
|
|
`remediation_action`, reason, optional alternate route id/status, and operator
|
|
hint, with unit coverage for healthy/noop, rebuild, backend fallback, and
|
|
authorized alternate decisions. The alternate-route remediation branch is now
|
|
live-smoke-proven: a selected primary route is degraded after lease issuance and
|
|
access telemetry recommends `prefer_alternate_route` while keeping
|
|
`force_backend_fallback=false`. C18Z57 turns that recommendation into a bounded
|
|
machine-readable `remediation_command` on the active channel row, including the
|
|
primary route, replacement route, issued time, and command TTL capped to the
|
|
lease lifetime. C18Z58 projects those commands into node-scoped synthetic mesh
|
|
config and node-agent consumes `prefer_alternate_route` as an explicit
|
|
route-manager `applied` decision with source
|
|
`service_channel_remediation_command`. C18Z59 proves active traffic follows the
|
|
replacement route after remediation: runtime heartbeat evidence shows
|
|
`last_selected_route_id` and flow-scheduler `last_route_id` on the replacement
|
|
route, with no local/backend fallback and no route send failures. C18Z60 proves
|
|
the same replacement path under multiple independent VPN flow channels: a
|
|
twelve-packet batch is classified across multiple flow-scheduler channels, all
|
|
observed replacement-route sends avoid local/backend fallback, flow drops, and
|
|
route failures. C18Z61 raises that to a pressure batch of 128 IPv4/TCP-like
|
|
packets; runtime evidence shows 32 replacement-route flow stats, scheduler
|
|
high-watermark 5, max-in-flight 4, no fallback, no drops, and no route failures.
|
|
C18Z62 adds neutral traffic-class QoS wiring at the service-channel HTTP
|
|
ingress: `X-RAP-Traffic-Class` can mark `control`, `interactive`, `reliable`,
|
|
`bulk`, or `droppable`; default traffic remains backward-compatible bulk.
|
|
Unit tests prove scheduler priority order, and live smoke proves a bulk
|
|
128-packet pressure batch plus an interactive packet both move over the
|
|
replacement route with separate traffic-class flow stats and no fallback,
|
|
drops, or route failures. C18Z63 adds a controlled concurrent runtime proof: a
|
|
bulk traffic-class send is held in-flight while an independent interactive
|
|
traffic-class packet is sent through the same ingress, and interactive completes
|
|
before bulk release with `MaxInFlight >= 2`, no drops, and no failures.
|
|
C18Z64 adds compact runtime telemetry: `rap.fabric_flow_scheduler.v1` snapshots
|
|
include `traffic_class_counts`, so backend/admin/diagnostics can show active
|
|
flow-channel counts per traffic class without scanning each channel stat. It is
|
|
live-proven on `rap-node-agent:0.2.239-c18z64`; latest test-1 snapshot showed
|
|
`bulk=32`, `interactive=12`, drops 0. C18Z65/C18Z66 project those counts and
|
|
flow pressure fields into backend access telemetry at node, active-channel, and
|
|
cluster aggregate levels, and web-admin shows cluster/node/channel `flow QoS`
|
|
visibility. Live aggregate API result showed `bulk=32`, `interactive=12`,
|
|
`flow_channel_count=44`, `flow_max_in_flight=4`. C18Z67 adds a live HTTP
|
|
concurrent QoS proof: six parallel bulk service-channel requests ran while an
|
|
interactive traffic-class request was injected on the same entry path after
|
|
remediation; the interactive request completed in 132 ms, all 6 bulk requests
|
|
were accepted, 3072 post-remediation packets moved over the replacement route,
|
|
32 bulk and 12 interactive replacement-route flow stats were observed, and
|
|
fallback, route failures, flow drops, and scheduler drops stayed at 0. C18Z68
|
|
adds backend/admin flow-health guard diagnostics over that telemetry:
|
|
`flow_health_status` and `flow_health_reason` are projected at cluster, node,
|
|
and active-channel levels from traffic-class pressure, queue pressure, flow
|
|
drops, backend fallback, route-quality failures/drops/slow samples, and route
|
|
send latency. Web-admin now shows flow-health chips beside flow QoS.
|
|
C18Z69 adds node-side adaptive response: heartbeat flow-scheduler snapshots now
|
|
report per-class `recommended_parallel_windows` plus
|
|
`adaptive_backpressure_active/reason`, and the ingress send path uses the
|
|
traffic-class-specific window. Under pressure, bulk/droppable are reduced first,
|
|
reliable is reduced moderately, and control/interactive keep their full window
|
|
unless their own class degrades. Live smoke verified `bulk=1`, `droppable=1`,
|
|
`reliable=3`, `interactive=4`, `control=4`, no drops, and
|
|
`bulk_window_reduced_to_protect_interactive`. C18Z70 projects those adaptive
|
|
runtime fields into backend/admin access telemetry at cluster, node, and
|
|
active-channel levels. Cluster windows are aggregated by minimum non-zero
|
|
per-class recommendation, and web-admin shows adaptive window chips beside flow
|
|
health/QoS. Live API artifact shows `adaptive=true`,
|
|
`bulk_window_reduced_to_protect_interactive`, and windows `bulk=1`,
|
|
`droppable=1`, `reliable=3`, `interactive=4`, `control=4`. C18Z71 adds the
|
|
cluster-level adaptive policy contract:
|
|
`GET/PUT /clusters/{clusterID}/fabric/service-channels/adaptive-policy`.
|
|
The policy stores audited thresholds and class windows in cluster metadata,
|
|
projects the effective fingerprint into signed node-scoped synthetic config,
|
|
and node-agent heartbeat/runtime telemetry reports `adaptive_policy_fingerprint`.
|
|
The node scheduler consumes the policy at runtime; default policy preserves
|
|
bulk=1/droppable=1/reliable=3/control=4/interactive=4, while the C18Z71 smoke
|
|
proved an operator policy with max window 6 and `bulk=2` changes the live
|
|
recommended windows without breaking interactive/control. A signed-config hash
|
|
mismatch found during the smoke was fixed by preserving all signed adaptive
|
|
policy provenance fields in the node-agent client model. C18Z72 adds the
|
|
cluster-level pool/failover policy contract:
|
|
`GET/PUT /clusters/{clusterID}/fabric/service-channels/pool-policy`. Lease
|
|
issuance now applies the effective entry/exit pool constraints and preferred
|
|
entry/exit before route selection, stores the effective policy on the lease,
|
|
and signs it into `rap.fabric_service_channel_lease_authority.v1`. Live smoke
|
|
proved a policy-constrained lease selects only the policy entry/exit from a
|
|
wider requested pool and carries matching signed `pool_policy` provenance.
|
|
C18Z73 projects that signed pool-policy fingerprint into active access
|
|
telemetry and guards remediation commands against routes outside the signed
|
|
lease pools. C18Z74 correlates active remediation commands with entry-node
|
|
route-manager heartbeats and reports execution states such as
|
|
`waiting_node_apply`, `applied`, `rejected_by_policy_guard`,
|
|
`pending_rebuild_request`, and `expired`. C18Z75 records `rebuild_route`
|
|
remediation as durable rebuild ledger intent rows when node-scoped synthetic
|
|
config is fetched, and access telemetry reports `rebuild_request_recorded` or
|
|
`rebuild_request_rejected`. C18Z76 makes the allowed `rebuild_route` command
|
|
visible from the node side: node-agent consumes it as a route-manager
|
|
`pending_degraded_fallback` decision with source
|
|
`service_channel_remediation_command`, and backend access telemetry correlates
|
|
that with the durable ledger as `rebuild_request_recorded_node_pending`.
|
|
C18Z77 resolves durable remediation rebuild requests inside the shared Control
|
|
Plane planner: signed-pool-valid alternates become `applied` /
|
|
`replacement_selected` and are projected as route-manager decisions with the
|
|
same command id, missing safe alternates become `no_alternate`, lease/policy
|
|
blocks become `deferred_by_policy`, and stale commands become `expired`.
|
|
C18Z78 adds operator-visible planner outcome chips in web-admin and proves the
|
|
`applied` branch live by adding an alternate route after lease issuance and
|
|
verifying the existing rebuild command resolves to `rebuild_request_applied`.
|
|
C18Z79 closes the planner-to-runtime proof loop for that branch: after planner
|
|
resolution, the entry node reports a route-manager decision with the same
|
|
`rebuild_request_id`, the transition is `applied_rebuild`, and live
|
|
service-channel packet traffic selects the replacement route without
|
|
local/backend fallback, route failures, or flow drops. C18Z80 hardens that
|
|
same path under sustained pressure: after planner-applied rebuild, five
|
|
post-rebuild bursts of mixed `interactive`, `bulk`, and `reliable` VPN packet
|
|
batches stay on the replacement route, the stale primary is not reselected, and
|
|
fallback/route-failure/drop deltas stay zero from the pre-pressure baseline.
|
|
C18Z81 adds the negative/rollback proof: after the initial replacement is
|
|
applied and used, a generation-valid fenced feedback report for that
|
|
replacement causes the Control Plane to select a new safe recovery route; live
|
|
traffic then moves to the recovery route, the degraded replacement is not
|
|
reselected, and fallback/failure/drop deltas stay zero for the recovery send.
|
|
The C18Z81 work also tightened older smoke checks to use per-run counter deltas
|
|
instead of absolute cumulative runtime counters.
|
|
C18Z82 closes the no-safe-recovery branch: after the replacement route reports
|
|
generation-valid fenced feedback and no new safe recovery route is created,
|
|
node-scoped synthetic config surfaces `service_channel_feedback_no_alternate`
|
|
with `pending_degraded_fallback`, `no_unfenced_alternate_route`, and
|
|
`backend_relay_degraded_fallback_until_rebuild`, proving the Control Plane
|
|
exposes a degraded/no-alternate state instead of silently sticking to a bad
|
|
replacement.
|
|
C18Z83 projects those route-manager decisions into active access telemetry and
|
|
web-admin: active channels now expose route-decision source, route id,
|
|
replacement route id, rebuild status/reason/generation, and score reasons.
|
|
The live smoke proves the no-safe state is visible through access telemetry as
|
|
`service_channel_feedback_no_alternate` /
|
|
`pending_degraded_fallback`, with operator execution state remaining compatible
|
|
with durable ledger `rebuild_request_no_alternate`.
|
|
C18Z84 aggregates those per-channel decisions at the access-telemetry summary
|
|
level: route-decision channel count, replacement decision count, applied
|
|
rebuild count, recovery decision count, and no-safe recovery count are exposed
|
|
to the API and web-admin summary chips. The no-safe branch now prioritizes the
|
|
aggregate status reason `active_channels_no_safe_recovery` over generic missing
|
|
access-report noise.
|
|
C18Z85 projects access-decision aggregates into rebuild health and incident
|
|
diagnostics. Health summary now carries access decision counts and prioritizes
|
|
`inspect_access_no_safe_recovery_route_pool_and_signed_policy` when no-safe is
|
|
active. Rebuild incidents now include `incident_source=access_decision` rows
|
|
for active channel decisions such as `access_no_safe_recovery`, with bad
|
|
severity and channel id, so operators see route-decision failures beside ledger
|
|
incidents.
|
|
C18Z86 adds silence/acknowledgement behavior for those
|
|
`incident_source=access_decision` incidents. Silence requests now carry
|
|
`incident_source` and `channel_id`; access-decision no-safe silences are stored
|
|
with a channel-scoped route key, applied back into rebuild health/incidents,
|
|
and exact current-generation incidents stop contributing to active bad count.
|
|
Generation-changing access-decision resurfacing is unit-tested; the live smoke
|
|
proves the operator silence path on docker-test.
|
|
C18Z87 exposes active rebuild/access-decision silences to operators and adds
|
|
unsilence. The API now lists active rebuild alert silences, returns
|
|
access-decision `incident_source`, `channel_id`, and display route id, and
|
|
allows deleting a silence by id. Web-admin shows an `Active rebuild silences`
|
|
table with an unsilence action. The live smoke proves list -> silence ->
|
|
unsilence and verifies the access no-safe incident becomes active again.
|
|
C18Z88 makes access-decision resurfacing operator-visible in live runtime.
|
|
Access-decision incidents now expose the silence id they resurfaced from, the
|
|
previous acknowledged generation, and the silence expiry. The live smoke
|
|
proves: access no-safe incident -> silence current generation -> wait for a new
|
|
route-decision generation -> incident returns as `alert_resurfaced=true`, active
|
|
bad count is restored, and previous generation metadata is preserved.
|
|
C18Z89 closes the resurfaced-incident operator action loop for generation
|
|
changes. Resurfaced access-decision incidents now expose
|
|
`alert_resurfaced_cause`, previous route id, and previous channel id; web-admin
|
|
shows the cause beside resurfaced incidents. The live smoke proves the operator
|
|
can re-acknowledge the resurfaced generation, the active-channel decision
|
|
context matches the incident route/generation, and the current generation
|
|
returns to a silenced state.
|
|
C18Z90 introduces the explicit signed production data-plane contract on
|
|
service-channel leases. `data_plane` is now part of the lease, authority
|
|
payload, introspection response, and lease-maintenance/admin list. It declares
|
|
that control-plane traffic uses backend API, working data uses the fabric
|
|
service channel over fabric routes, backend relay is degraded fallback only,
|
|
production forwarding is required, and logical flows are service-neutral,
|
|
protocol-agnostic, and isolated. Web-admin shows this contract in the
|
|
service-channel lease table.
|
|
C18Z91 makes node-agent consume that signed/introspected data-plane contract.
|
|
Service-channel packet ingress validates the contract, applies the preferred
|
|
fabric route, emits data-plane mode/transport/fallback/logical-flow fields in
|
|
access logs, and reports contract adoption in heartbeat access telemetry.
|
|
C18Z92 enforces disabled backend fallback policy at node-agent runtime: when a
|
|
signed lease says `backend_relay_policy=disabled`, route failure or missing
|
|
fabric route returns a visible 503 instead of silently proxying working data
|
|
through backend relay.
|
|
C18Z93 promotes that data-plane contract telemetry into backend access
|
|
telemetry and web-admin active-channel diagnostics: cluster, node, and
|
|
active-channel rows now show contract adoption count, last working transport,
|
|
steady-state transport, backend relay policy, data-plane mode, and logical
|
|
flow mode.
|
|
C18Z94 turns those data-plane/fallback signals into operator incidents.
|
|
`data_plane_contract` incident rows are now emitted for missing data-plane
|
|
contract reports after accepted service-channel traffic, wrong working or
|
|
steady-state transport, wrong logical flow mode, disabled backend relay
|
|
observed, and degraded backend relay usage. The incident list/readiness path
|
|
can now surface a recommended action such as restoring the fabric route instead
|
|
of treating backend relay as normal service traffic.
|
|
C18Z95 adds node-agent blocked-fallback telemetry. When a signed data-plane
|
|
contract disables backend relay and the entry runtime cannot use a fabric
|
|
route, node-agent reports `backend_fallback_blocked`, the last data-plane
|
|
violation status/reason, and backend/admin project those fields to cluster,
|
|
node, channel, and `data_plane_contract` incident diagnostics. Disabled-policy
|
|
refusal is now separate from real backend relay usage.
|
|
C18Z96 wires normal-route send failure with disabled backend relay into the
|
|
existing route feedback and rebuild planner path. When heartbeat access
|
|
telemetry reports `fabric_route_send_failed_backend_fallback_blocked`, backend
|
|
correlates the entry node's active service-channel leases, records fenced
|
|
`fabric_service_channel_route_feedback` for the selected primary route, and the
|
|
existing planner can select an alternate/replacement route. This keeps blocked
|
|
fallback from becoming a dead-end operator alert.
|
|
C18Z97 adds bounded deduplication for those access-report-derived route
|
|
feedback records. Repeated blocked-fallback send-failure heartbeats no longer
|
|
rewrite the same active fenced feedback or churn planner rebuild attempts while
|
|
the first access-report feedback is still active. Runtime feedback from the
|
|
flow scheduler remains independent.
|
|
C18Z98 carries that feedback identity into the replacement decision and
|
|
rebuild-attempt ledger: decision and ledger rows now expose
|
|
`feedback_observation_id`, `feedback_source`, feedback observed/expiry time,
|
|
channel/resource ids, and data-plane violation status/reason. Web-admin shows
|
|
that correlation in Route decisions and Rebuild ledger.
|
|
C18Z99 adds rebuild ledger filters for those correlation fields. The backend
|
|
`/fabric/service-channels/rebuild-attempts` API accepts `feedback_source`,
|
|
`feedback_channel_id`, and `feedback_violation_status`, and web-admin exposes
|
|
the same filters in the rebuild ledger form. The live smoke proves source,
|
|
channel, violation, combined filters, and wrong-channel exclusion.
|
|
C18Z100 adds rebuild-health feedback breakdown aggregation for the same
|
|
correlation fields. The backend rebuild-health summary now returns
|
|
`feedback_breakdowns` grouped by feedback source, feedback channel id, and
|
|
feedback violation status, including total/good/warn/bad/unknown counts,
|
|
active warn/bad counts, silenced count, latest observation time, and affected
|
|
reporter nodes/routes. Web-admin shows this breakdown in the Rebuild health
|
|
panel so operators can see which access-report-derived failure classes dominate
|
|
active warn/bad rebuild state.
|
|
C18Z101 wires that breakdown into operator workflow in web-admin. Each
|
|
feedback-breakdown row now shows related incident context by channel/reporter/
|
|
route overlap and has an `open ledger` action that switches to the deep rebuild
|
|
ledger with `feedback_source`, `feedback_channel_id`, and
|
|
`feedback_violation_status` prefilled from the breakdown row.
|
|
C18Z102 adds backend audit breadcrumbs for that workflow. The existing rebuild
|
|
investigation endpoint now accepts feedback source/channel/violation drilldown
|
|
payloads, records
|
|
`fabric.service_channel_rebuild_feedback_breakdown.investigation_opened`
|
|
cluster audit events, and web-admin records one before opening the filtered
|
|
deep ledger from a rebuild-health feedback breakdown row.
|
|
C18Z103 surfaces those breadcrumbs directly in the Fabric diagnostics panel.
|
|
Web-admin now filters the loaded cluster audit list for rebuild incident and
|
|
feedback-breakdown investigation events and shows recent drilldowns with time,
|
|
source, feedback filters, target reporter/route, actor, and reason beside
|
|
rebuild incidents and silences.
|
|
C18Z104 adds focused audit loading for that panel. The cluster audit API now
|
|
accepts `event_type` and `target_type` filters, including repeated or
|
|
comma-separated values, and web-admin loads recent fabric investigation
|
|
breadcrumbs with a dedicated filtered request instead of depending on the
|
|
generic latest-100 cluster audit list.
|
|
C18Z105 correlates those focused audit breadcrumbs back to currently visible
|
|
diagnostics in web-admin. Recent investigation rows now show whether the
|
|
breadcrumb still matches an active rebuild-health feedback breakdown or visible
|
|
rebuild incident, and provide an `open` action to jump back into the matching
|
|
filtered ledger path.
|
|
C18Z106 moves that correlation into the backend/API. `GET /audit` with
|
|
`correlation=fabric_diagnostics` now returns `correlation_hints` for focused
|
|
fabric investigation breadcrumbs, including current diagnostic status
|
|
(`breakdown_active`, `incident_visible`, or `not_visible`) and the matching
|
|
breakdown/incident object when present. Web-admin consumes those hints and keeps
|
|
its previous local matching as fallback. During verification the noisy test
|
|
history exposed that rebuild-health feedback breakdowns were capped too tightly;
|
|
the backend now returns up to 100 breakdown groups so fresh failure classes are
|
|
not pushed out by older smoke history.
|
|
C18Z107 adds a compact backend-provided `audit_summary` beside `audit_events`.
|
|
For focused Fabric diagnostics audit reads, the summary includes total count,
|
|
counts by event/target type, counts by current diagnostic status, counts by
|
|
feedback source/violation status, correlated count, not-visible count, and
|
|
latest time. Web-admin shows these as Recent investigations chips and short
|
|
source/violation lines without recalculating the aggregate in the browser.
|
|
C18Z108 moves Fabric diagnostics investigation breadcrumbs off the generic
|
|
cluster audit read path. Backend now exposes
|
|
`GET /clusters/{clusterID}/fabric/service-channels/rebuild-investigations/breadcrumbs`
|
|
with a dedicated `rebuild_investigation_breadcrumbs` contract containing
|
|
events plus summary. Web-admin uses this endpoint for Recent investigations
|
|
and keeps generic audit semantics separate from Fabric diagnostics workflow
|
|
state.
|
|
C18Z109 adds freshness windows to the dedicated breadcrumb contract. The
|
|
endpoint accepts `current_window_seconds` and `history_window_seconds`, annotates
|
|
each breadcrumb with `correlation_hints.breadcrumb_status` (`current`, `stale`,
|
|
or `expired`) plus age/window seconds, returns current/stale/expired totals, and
|
|
adds `counts_by_breadcrumb_status` to the summary. Web-admin shows freshness
|
|
chips and an age column in Recent investigations, so operators can separate live
|
|
workflow hints from stale history without deleting audit records.
|
|
Live
|
|
verification passed:
|
|
`scripts/fabric/c18z48-service-channel-introspection-smoke.ps1` and
|
|
`scripts/fabric/c18z50-service-channel-durable-introspection-smoke.ps1` and
|
|
`scripts/fabric/c18z51-service-channel-lease-maintenance-smoke.ps1` and
|
|
`scripts/fabric/c18z52-service-channel-access-telemetry-smoke.ps1` and
|
|
`scripts/fabric/c18z53-service-channel-access-correlation-smoke.ps1` and
|
|
`scripts/fabric/c18z54-service-channel-normal-route-access-smoke.ps1` and
|
|
`scripts/fabric/c18z55-service-channel-degraded-route-access-smoke.ps1` and
|
|
`scripts/fabric/c18z56-service-channel-alternate-remediation-smoke.ps1` and
|
|
`scripts/fabric/c18z57-service-channel-remediation-command-smoke.ps1` and
|
|
`scripts/fabric/c18z58-service-channel-remediation-apply-smoke.ps1` and
|
|
`scripts/fabric/c18z59-service-channel-remediation-traffic-smoke.ps1` and
|
|
`scripts/fabric/c18z60-service-channel-remediation-multiflow-smoke.ps1` and
|
|
`scripts/fabric/c18z61-service-channel-remediation-pressure-smoke.ps1` and
|
|
`scripts/fabric/c18z62-service-channel-remediation-qos-smoke.ps1` and
|
|
`scripts/fabric/c18z67-service-channel-concurrent-qos-live-smoke.ps1` and
|
|
`scripts/fabric/c18z69-service-channel-adaptive-backpressure-smoke.ps1` and
|
|
`scripts/fabric/c18z71-service-channel-adaptive-policy-smoke.ps1` and
|
|
`scripts/fabric/c18z72-service-channel-pool-policy-smoke.ps1`, with
|
|
artifacts:
|
|
`artifacts/c18z48-service-channel-introspection-smoke-result.json`,
|
|
`artifacts/c18z50-service-channel-durable-introspection-smoke-result.json`, and
|
|
`artifacts/c18z51-service-channel-lease-maintenance-smoke-result.json`, and
|
|
`artifacts/c18z52-service-channel-access-telemetry-smoke-result.json`, and
|
|
`artifacts/c18z53-service-channel-access-correlation-smoke-result.json`, and
|
|
`artifacts/c18z54-service-channel-normal-route-access-smoke-result.json`, and
|
|
`artifacts/c18z55-service-channel-degraded-route-access-smoke-result.json`, and
|
|
`artifacts/c18z56-service-channel-alternate-remediation-smoke-result.json`, and
|
|
`artifacts/c18z57-service-channel-remediation-command-smoke-result.json`, and
|
|
`artifacts/c18z58-service-channel-remediation-apply-smoke-result.json`, and
|
|
`artifacts/c18z59-service-channel-remediation-traffic-smoke-result.json`, and
|
|
`artifacts/c18z60-service-channel-remediation-multiflow-smoke-result.json`, and
|
|
`artifacts/c18z61-service-channel-remediation-pressure-smoke-result.json`, and
|
|
`artifacts/c18z62-service-channel-remediation-qos-smoke-result.json`, and
|
|
`artifacts/c18z63-service-channel-concurrent-qos-go-test.jsonl`, and
|
|
`artifacts/c18z64-service-channel-traffic-class-telemetry-go-test.jsonl`, and
|
|
`artifacts/c18z64-service-channel-traffic-class-telemetry-live-smoke-result.json`, and
|
|
`artifacts/c18z64-service-channel-traffic-class-telemetry-live-snapshot.json`, and
|
|
`artifacts/c18z65-service-channel-access-qos-telemetry-api-result.json`, and
|
|
`artifacts/c18z65-service-channel-access-qos-telemetry-smoke-result.json`, and
|
|
`artifacts/c18z66-service-channel-access-qos-aggregate-api-result.json`, and
|
|
`artifacts/c18z67-service-channel-concurrent-qos-live-smoke-result.json`, and
|
|
`artifacts/c18z68-service-channel-flow-health-api-result.json`, and
|
|
`artifacts/c18z69-service-channel-adaptive-backpressure-smoke-result.json`, and
|
|
`artifacts/c18z70-service-channel-adaptive-telemetry-api-result.json`, and
|
|
`artifacts/c18z71-service-channel-adaptive-policy-smoke-result.json`, and
|
|
`artifacts/c18z72-service-channel-pool-policy-smoke-result.json`, and
|
|
`artifacts/c18z73-service-channel-pool-policy-remediation-guard-smoke-result.json`, and
|
|
`artifacts/c18z74-service-channel-remediation-execution-smoke-result.json`, and
|
|
`artifacts/c18z75-service-channel-rebuild-intent-smoke-result.json`, and
|
|
`artifacts/c18z76-service-channel-rebuild-node-pending-smoke-result.json`, and
|
|
`artifacts/c18z77-service-channel-rebuild-planner-resolution-smoke-result.json`, and
|
|
`artifacts/c18z78-service-channel-rebuild-planner-applied-smoke-result.json`, and
|
|
`artifacts/c18z79-service-channel-planner-runtime-loop-smoke-result.json`, and
|
|
`artifacts/c18z80-service-channel-post-rebuild-pressure-smoke-result.json`, and
|
|
`artifacts/c18z81-service-channel-replacement-degradation-recovery-smoke-result.json`, and
|
|
`artifacts/c18z82-service-channel-no-safe-recovery-smoke-result.json`, and
|
|
`artifacts/c18z83-service-channel-access-telemetry-no-safe-smoke-result.json`, and
|
|
`artifacts/c18z84-service-channel-access-decision-aggregate-smoke-result.json`, and
|
|
`artifacts/c18z85-service-channel-access-decision-incident-smoke-result.json`, and
|
|
`artifacts/c18z86-service-channel-access-decision-silence-smoke-result.json`, and
|
|
`artifacts/c18z87-service-channel-access-decision-unsilence-smoke-result.json`, and
|
|
`artifacts/c18z88-service-channel-access-decision-resurface-smoke-result.json`, and
|
|
`artifacts/c18z89-service-channel-access-decision-resurface-action-smoke-result.json`, and
|
|
`artifacts/c18z90-service-channel-data-plane-contract-smoke-result.json`, and
|
|
`artifacts/c18z91-node-agent-data-plane-contract-enforcement-smoke-result.json`, and
|
|
`artifacts/c18z92-node-agent-disabled-backend-fallback-smoke-result.json`, and
|
|
`artifacts/c18z93-access-telemetry-data-plane-contract-smoke-result.json`, and
|
|
`artifacts/c18z94-data-plane-contract-incident-smoke-result.json`, and
|
|
`artifacts/c18z95-node-agent-blocked-fallback-telemetry-smoke-result.json`, and
|
|
`artifacts/c18z96-blocked-fallback-rebuild-feedback-smoke-result.json`, and
|
|
`artifacts/c18z97-blocked-fallback-feedback-dedup-smoke-result.json`, and
|
|
`artifacts/c18z98-blocked-fallback-rebuild-correlation-smoke-result.json`, and
|
|
`artifacts/c18z99-rebuild-correlation-filter-smoke-result.json`, and
|
|
`artifacts/c18z100-rebuild-health-feedback-breakdown-smoke-result.json`, and
|
|
`artifacts/c18z102-rebuild-health-feedback-drilldown-audit-smoke-result.json`,
|
|
`artifacts/c18z104-focused-fabric-audit-smoke-result.json`, and
|
|
`artifacts/c18z106-audit-correlation-hints-smoke-result.json`, and
|
|
`artifacts/c18z107-audit-correlation-summary-smoke-result.json`, and
|
|
`artifacts/c18z108-dedicated-breadcrumbs-smoke-result.json`, and
|
|
`artifacts/c18z109-breadcrumb-freshness-window-smoke-result.json`.
|
|
|
|
Current active continuation after C20Z6:
|
|
|
|
C20Z1 through C20Z6 are implemented and runtime-smoke-proven. The C20 stage is
|
|
terminal-complete by contract. It opened and validated a new explicit
|
|
real-adapter enablement request as a contract-only transition:
|
|
`rap.remote_workspace_real_adapter_c20_stage_terminal_complete.v1`, with
|
|
`terminal_status=stage_terminal_complete_contract_only`,
|
|
`stage_status=complete_no_more_c20_layers_required`,
|
|
`stage_name=c20_real_adapter_new_explicit_enablement_request`,
|
|
`validation_chain_status=complete_contract_only`,
|
|
`enablement_boundary=runtime_enablement_requires_next_explicit_runtime_stage`,
|
|
`enablement_decision=validated_contract_only_not_enabled`,
|
|
`enablement_status=validated_not_enabled`,
|
|
`runtime_gate_state=validated_contract_only_not_enabled`,
|
|
`runtime_effect=contract_only_no_runtime_enablement`,
|
|
`operator_default_action=keep_real_adapter_disabled_until_next_explicit_runtime_stage`,
|
|
`next_allowed_entrypoint=next_explicit_runtime_enablement_stage_only`,
|
|
`allows_process_start=false`, and `allows_payload_traffic=false`. Docker-test
|
|
`test-1/2/3` remain on
|
|
`rap-node-agent:codex-service-supervisor-20260513z52`. Verification artifact:
|
|
`artifacts/c20z6-remote-workspace-real-adapter-stage-terminal-complete-compatibility-smoke-result.json`.
|
|
|
|
The not-approved factory remains terminal-complete by contract, and C20 is now
|
|
also terminal-complete by contract. Do not add more C20 continuation layers.
|
|
The only allowed next entrypoint is a new explicit runtime enablement stage.
|
|
Keep the real adapter disabled until that new stage explicitly changes runtime
|
|
state: no process start, no real RDP frame transport, no Android work, no
|
|
backend relay semantics, and no production adapter payload forwarding.
|