184 KiB
CODEX CONTEXT
Project identity
This project is a production-grade distributed secure access platform.
It started as a custom RDP proxy with persistent server-side sessions, but the final target architecture is broader:
- distributed secure access fabric
- multi-tenant platform
- session broker for GUI and future non-GUI protocols
- cluster mesh of nodes
- connector/VPN layer
- customer-managed and platform-managed nodes
- node-agent based self-update / rollback / health supervision
Product architecture rule: VPN and Remote Workspace are separate products/layers
Do not merge VPN/IP tunnel work with Remote Workspace / remote desktop work.
- VPN is a universal network-layer IP tunnel. It carries any traffic generated by a phone, Windows PC, Linux host, or other client device: HTTP, DNS, ping, RDP clients, SSH clients, SMB, business apps, and future protocols. VPN must stay protocol-agnostic and must not contain remote-desktop-specific logic.
- Remote Workspace is an application/session-layer service. The client talks to RAP using RAP's own client protocol. RAP workers/connectors then talk to the target server using protocol adapters such as RDP, SSH, VNC, or future adapters, convert screen/input/clipboard/files/audio/control into RAP's format, and render it in the RAP client.
- VPN optimization work must focus on generic data-plane transport, full-tunnel/split-tunnel routing, DNS, MTU/MSS, QoS, NAT traversal, direct UDP/QUIC transport, fallback relay, diagnostics, and stability for arbitrary traffic.
- Remote Workspace optimization work must focus on server catalog, session broker, workers/connectors, protocol adapters, RAP client protocol, separate connection windows, rendering/input/clipboard/file/audio behavior, and user-facing remote-workspace UX.
- Both VPN and Remote Workspace must consume the shared Fabric Service Channel runtime. Control/API traffic may use backend/admin ingress, but working service data must use the fabric channel whenever available. Backend relay is a compatibility/degraded fallback, not the production steady-state.
- The accepted service-channel direction is documented in
docs/architecture/FABRIC_SERVICE_CHANNEL_RUNTIME.md: a service requests a channel with entry pool, exit pool, roles, service class, channel classes, QoS and failover policy; the fabric selects the fastest healthy route and rebuilds it on failure. Protocol-specific services must not reimplement this transport. - Current implementation: backend issues
rap.fabric_service_channel_lease.v1leases and embeds them in VPN client profiles. Leases include cluster-authority-signedrap.fabric_service_channel_lease_authority.v1payloads that bind token hash, selected route, generation, fencing epoch, and expiry, plus a signeddata_planecontract declaring that working data uses the Fabric Service Channel over fabric routes while backend relay is only an explicit degraded/disabled fallback policy.rap-node-agentaccepts the first VPN packet service-channel entry endpoint under/api/v1/clusters/{cluster_id}/fabric/service-channels/{channel_id}/vpn-connections/{resource_id}/packetsplus/packets/ws. The endpoint validates the signed or introspected data-plane contract, applies the preferred fabric route, uses the existing productionvpn_packetfabric route, reports contract adoption in heartbeat access telemetry, and refuses backend relay when the contract disables it. Backend access telemetry and web-admin now show data-plane adoption, working/steady-state transport, backend relay policy, data-plane mode, and logical flow mode at cluster/node/channel levels. The next slice is explicit route/fallback violation incidents from that telemetry, plus client consumption of the lease endpoint template.
Current proven foundation
The current codebase already proved the most risky low-level lifecycle assumptions for RDP:
- real FreeRDP connect works
- session state transitions to active work
- terminate works
- detach works without killing the remote session
- reattach works without recreating the remote session
- takeover works without recreating the remote session
- per-resource certificate verification policy exists
certificate_verification_mode = strict | ignorestrictis defaultignoreworks on a per-resource basis- worker build is reproducible
- backend build is reproducible
This proven lifecycle must NOT be broken by future architecture work.
Current architecture baseline
Current audit and baseline snapshot:
docs/audits/PROJECT_AUDIT_2026-04-26.mddocs/audits/CURRENT_BASELINE_MATRIX.md
Test environment
- Canonical test Docker host:
192.168.200.61 - Canonical Docker context:
test-ubuntu - Canonical SSH alias:
docker-test - Current external control-plane endpoint for remote/offsite node enrollment:
http://94.141.118.222:19191/http://vpn.cin.su:19191. - Current port forward:
94.141.118.222:19191->192.168.200.61:18080. - For offsite Windows/Linux nodes, install profiles should use:
http://vpn.cin.su:19191/api/v1as control-plane endpoint andhttp://vpn.cin.su:19191/downloadsas artifact endpoint unless the user explicitly chooses the raw IP endpoint. - Backend API for local/client smoke runs:
http://192.168.200.61:8080/api/v1 - WebSocket gateway for local/client smoke runs:
ws://192.168.200.61:8080/api/v1/gateway/ws - Stage C17 planning is completed.
- C17A synthetic mesh runtime skeleton is implemented and test-proven in
rap-node-agentonly. It is disabled by default and carries syntheticfabric.probe/fabric.probe_ackmessages only. - C17B route health and failover probes are implemented and test-proven in
rap-node-agentonly. They are disabled by default and carry syntheticfabric.route_health/fabric.route_health_ackmessages only. - C17C relay semantic hardening is implemented and test-proven in
rap-node-agentonly. It is disabled by default and models synthetic per-channel queues/QoS/backpressure only. - C17D non-production test-service path is implemented and test-proven in
rap-node-agentonly. It is disabled by default and carries only boundedsynthetic.echotest payloads. - C17E/C17F/C17G are implemented and proven for live synthetic HTTP transport, scoped synthetic route config, and Control Plane scoped synthetic config consumption.
- C17H deployed multi-agent synthetic config smoke is runtime-proven on
docker-test: five runningrap-node-agentcontainers consume backend-issued node-scoped synthetic config, direct and single-relay synthetic route-health observations return to the Control Plane, and production forwarding remains disabled. - C17I production forwarding gate foundation is implemented and test-proven:
rap-node-agenthas an explicit production-forwarding gate, while/mesh/v1/forwardstill refuses production payload forwarding until a later approved runtime stage. - C17J production envelope contract is implemented and test-proven:
/mesh/v1/forwardvalidates route-bound production envelopes forfabric_control/fabric.controlonly when the gate is enabled, rejects service channels, and still refuses production forwarding. - C17K production envelope observation is implemented and test-proven: valid accepted envelopes can be observed locally as metadata-only records after validation; rejected envelopes are not observed, observation failure fails closed, and production forwarding remains unavailable.
- C17L bounded production observation sink is implemented and test-proven: accepted metadata-only observations can be retained locally with fixed capacity, oldest-entry drop behavior, and no payload body storage.
- C17M production observation sink wiring is implemented and test-proven:
node-agent can wire the bounded local metadata-only sink when
RAP_MESH_PRODUCTION_OBSERVATION_SINK_CAPACITYis explicitly greater than zero; the wiring is disabled by default and exposes no read API. - C17N production observation sink metrics are implemented and test-proven: local sink metrics expose only capacity, current depth, accepted total, and dropped-oldest total; they expose no observation records or payload metadata.
- C17O production observation sink local metrics logging is implemented and test-proven: node-agent logs aggregate sink metrics locally when the sink is explicitly enabled; no read API or Control Plane reporting is added.
- C17P production observation sink change-driven metrics logging is implemented and test-proven: node-agent suppresses repeated identical local sink metrics logs; no read API or Control Plane reporting is added.
- C17Q production forwarding gate/runtime log boundary is implemented and
test-proven: node-agent logs production forwarding gate state separately from
production forwarding runtime state. Runtime state remained false until
C17Z introduced gate-controlled
fabric.controldirect forwarding. - C17R production observation sink capacity guard is implemented and
test-proven:
RAP_MESH_PRODUCTION_OBSERVATION_SINK_CAPACITYis rejected above10000. - C17S production observation panic fail-closed hardening is implemented and test-proven: observer errors and observer panics both fail closed as observation failure.
- C17T production envelope payload boundary is implemented and test-proven:
validated production
fabric.controlenvelope payloads are bounded to4096bytes and oversized envelopes are rejected before observation. - C17U production envelope created-at skew boundary is implemented and
test-proven: validated production
fabric.controlenvelopes whosecreated_atis more than one minute in the future are rejected before observation. - C17V peer endpoint candidate model is implemented and test-proven: node-scoped synthetic mesh config now carries route-scoped endpoint candidates with transport, address, reachability, NAT type, connectivity mode, priority, policy tags, verification time, and metadata. This is a model/config boundary only; no production route scoring, NAT traversal, shortcut routing, or forwarding runtime is implemented.
- C17W peer endpoint candidate scoring model is implemented and test-proven:
rap-node-agentcan rank already-scoped endpoint candidates using soft inputs such as transport, reachability, connectivity mode, NAT type, priority, region, policy tags, channel class, and verification age. This is a scoring helper only; it does not open connections, choose production routes, or forward payloads. - C17X health-aware endpoint candidate scoring overlay is implemented and
test-proven: endpoint candidate scoring can optionally use local health
observations keyed by
endpoint_id, including latency, success/failure history, recent failure reason, reliability score, and observation freshness. This remains advisory scoring only and is not wired into production route execution. - C17Y Platform Owner synthetic mesh visibility is implemented and
build/test-proven:
web-adminreads node-scoped synthetic mesh config and shows config enabled state, route counts, peer endpoints, endpoint candidates, C17X advisory scoring boundary, andproduction_forwarding. This remains platform-owner visibility only and does not enable production forwarding. - C17Z production fabric-control direct forwarding boundary is implemented and
test-proven: when
RAP_MESH_PRODUCTION_FORWARDING_ENABLED=true,/mesh/v1/forwardcan deliver valid route-boundfabric.controlenvelopes at the local destination or forward them to a direct next hop from explicit peer endpoint config. Service channels, arbitrary relay forwarding, multi-hop production route execution, and RDP/VPN/file/video/service payloads remain unavailable. - C17Z1 production fabric-control multi-hop route-path boundary is implemented
and test-proven: production
fabric.controlenvelopes can carryroute_pathandvisited_node_ids; relay nodes validate path position, forward only to the next path node, update TTL/hop/visited metadata, and reject loops. Service payloads remain unavailable. - C17Z2 production fabric-control forwarding observability boundary is
implemented and test-proven: node-agent emits local
mesh_production_forward_eventlogs for accepted, forwarded, delivered, and rejected productionfabric.controlenvelopes. Logs are metadata-only and include no payload bodies or read API. - C17Z3 production fabric-control route-config boundary is implemented and
test-proven: when scoped/control-plane mesh routes are available locally,
production
fabric.controlenvelopes must match configured route_id/path/ next-hop/channel/expiry/TTL/hop limits before forwarding. - C17Z4 scoped peer directory and recovery seeds boundary is implemented and
test/build-proven: node-scoped mesh config carries scoped
peer_directoryand explicit boundedrecovery_seeds; node-agent parses/validates them and web-admin shows counts. - C17Z5 node-agent peer cache runtime boundary is implemented and test-proven:
node-agent builds a local
PeerCache, selects bounded warm peers, probes warm peers with/mesh/v1/health, and reports metadata-only mesh-link observations when synthetic mesh testing is enabled. - C17Z6 dynamic endpoint reporting boundary is implemented and test-proven: node-agent reports explicit advertised mesh endpoint metadata in heartbeat, and Control Plane projects latest reported endpoints/candidates into node-scoped synthetic mesh config.
- C17Z7 private/corporate endpoint candidate boundary is implemented and test-proven: node-agent reports multiple advertised endpoint candidates, scoring rewards private/corporate same-site candidates, and peer cache can use the best candidate address for warm health.
- C17Z8 peer connection state machine boundary is implemented and test-proven:
node-agent tracks warm-peer states
disconnected,connecting,ready,degraded, andbackoff, with bounded backoff after repeated health probe failures. - C17Z9 peer recovery planner boundary is implemented and test-proven: node-agent targets a bounded stable ready-peer set, enters recovery when ready peers fall below target, and selects bounded recovery probes from warm peers, recovery seeds, and other connectable scoped peers.
- C17Z10 peer connection intent planner boundary is implemented and test-proven: node-agent classifies bounded peer work as maintain/probe/ recover and classifies transport readiness as direct/private_lan/ corporate_lan/outbound_only/relay_required, with rendezvous-required metadata only.
- C17Z11 peer connection manager runtime boundary is implemented and
test-proven: node-agent uses a reusable HTTP keep-alive client for real
control-plane health probes of direct/private/corporate peers and records
waiting_rendezvousfor outbound-only/relay-required peers. - C17Z12 rendezvous/relay control-plane contract is implemented and
docker-test-runtime-proven: backend issues node-scoped
rendezvous_leases, node-agent resolves matchingwaiting_rendezvousintents intorelay_control, probes relay/mesh/v1/health, records and maintainsrelay_ready, and keeps service payload forwarding disabled. - C17Z13 rendezvous lease telemetry is implemented and
docker-test-runtime-proven: node-agent reports
mesh_rendezvous_lease_reportwith relay admission, peer admission, TTL/renewal posture,relay_ready, and explicit no-payload boundary flags; web-admin showsrv leasesin recent heartbeat tables. - C17Z14 rendezvous lease refresh contract is implemented and docker-test-runtime-proven: node-agent refreshes renewal-needed/stale rendezvous leases through node-scoped synthetic config reload, updates the running peer cache/route/lease state, and reports refresh plus stale relay withdrawal/reselection telemetry. Service payload forwarding remains unavailable.
- C17Z15 backend relay replacement policy is implemented and
docker-test-runtime-proven: backend consumes recent stale-relay heartbeat
feedback, withdraws stale explicit rendezvous leases, scores alternate relay
candidates from route adjacency, endpoint priority, policy tags, and recent
mesh-link health, and returns replacement leases plus
rendezvous_relay_policydecisions in node-scoped synthetic config. Node-agent reportsc17z15.mesh_rendezvous_lease_report.v1and keeps stale state scoped to the exact lease/relay, so replacement leases for the same peer are not marked stale by association. Service payload forwarding remains unavailable. - C17Z16 route/path decision artifact is implemented and
docker-test-runtime-proven: backend
c17z16.synthetic.v1config includesroute_path_decisionswith original hops, effective hops, local previous/ next hop, selected replacement relay, generation, score reasons, and no-payload boundary flags. Node-agent stores the control-plane route generation and reportsc17z16.mesh_route_path_decision_report.v1plusc17z16.mesh_rendezvous_lease_report.v1. Service payload forwarding remains unavailable. - C17Z17 node-side route generation tracker is implemented and
docker-test-runtime-proven: backend
c17z17.synthetic.v1config and node-agentmesh_route_generation_reporttrack active/applied/unchanged/ withdrawn route decisions, generation changes, total counters, andwithdrawn_by_replacementrecords for stale relay paths when replacement is first observed. Service payload forwarding remains unavailable. - C17Z18 synthetic route-health effective path runtime is implemented and
docker-test-runtime-proven: backend
c17z18.synthetic.v1config and node-agentmesh_route_health_config_reportapply Control Planeroute_path_decisionsto synthetic route-health route config only. The synthetic runtime probes selected effective paths through replacement relays, reports expected/observed hops and drift state, and backend latest mesh links preserve route-health observations separately from connection-manager observations. Service payload forwarding remains unavailable. - C17Z19 synthetic route-health feedback scoring is implemented and
docker-test-runtime-proven: backend consumes recent
synthetic_route_healthobservations in relay scoring, uses drift/unreachable/failure metadata to mark the exact selected relay stale, boosts healthy low-latency relay candidates, and returns replacement leases/route decisions through the existing synthetic config contract. Migration000022adds thesyntheticmesh service class. Service payload forwarding remains unavailable. - C17Z20 node-side route-health feedback refresh is implemented and
docker-test-runtime-proven: after reporting synthetic route-health
drift/unreachable/failure, node-agent performs a bounded node-scoped
synthetic-config refresh, applies returned replacement route decisions to
route-health config immediately, and reports
c17z20.mesh_route_health_feedback_refresh_report.v1. Service payload forwarding remains unavailable. - C17Z21 offsite control-plane bootstrap relay and Windows updater foundation
are implemented and docker-test/runtime-proven: backend exposes
/mesh/v1/healththrough the admin/nginx control-plane origin and issues control-plane-only bootstrap rendezvous leases for outbound-only nodes using their reported public control-plane URL. Remote Windows nodeifcm-rufms-s-mo1crresolved 3/3 peers torelay_readythroughhttp://94.141.118.222:19191, while service/RDP/VPN payload forwarding remains disabled. Release0.1.3is published for Docker and Windowswindows_serviceartifacts, andinstall-windowsnow installs a per-node Scheduled Task updater for future Windows node-agent updates. - C17Z22 updater observability and Windows host-agent self-update staging are
implemented and test-proven:
rap-host-agentreportsphase=plan,status=noopfor already-current/no-op plans, update state is scoped per product sorap-node-agentandrap-host-agentdo not overwrite each other's current version, and the Windows updater wrapper runs short one-shot cycles that can apply stagedrap-host-agent.exe.nextbefore the next update check. Releaserap-host-agent 0.1.3is published forlinux_binaryandwindows_binary; Docker updater containers ontest-1/2/3report no-op plans. - Installation Authority foundation is implemented: production requires strict
Product Root public key config, first-owner bootstrap uses signed Ed25519
activation manifests,
installation_authorityand signedplatform_role_grantsare persisted, and strict platform-admin checks ignore directusers.platform_roledatabase edits without a valid signed grant. Web-admin exposes installation status/first-owner bootstrap, andscripts/installation/product-root-tool.gogenerates keys/manifests for offline product-root operations. - Cluster Authority and node enrollment bootstrap are docker-test lifecycle
smoke-proven in run
dev-bootstrap-20260428-201430: a fresh dev install bootstrapped the first owner, created a cluster, issued a signed join token, accepted realrap-node-agentenrollment, owner-approved the join request, agent-polled signed bootstrap, persisted cluster authority pin, heartbeated, and verified signedc17z18.synthetic.v1Control Plane config. Production service payload forwarding remains unavailable. - Migration
000021_cluster_authority_keysdrops/recreatescluster_admin_summariesbecause fresh replay proved PostgreSQL cannot change that view layout viaCREATE OR REPLACE VIEW. rap-node-agentdesired-workload polling/status reporting is gated byRAP_WORKLOAD_SUPERVISION_ENABLED=falseby default while service runtime supervision remains a stub.- C18 VPN/IP tunnel service target design is completed as documentation only.
- C18A VPN/IP tunnel control-plane data model foundation is implemented and backend-test-proven.
- C18B VPN/IP tunnel lease/fencing hardening is implemented and backend-test-proven.
- C18C VPN/IP tunnel node-agent desired-state consumption/reporting is implemented and backend-test-proven.
- No next platform-core implementation step is automatically authorized after C17Z20. The next mesh layer should stay limited to route-health feedback refresh dampening/no-change cooldown unless the user explicitly chooses another staged task.
- Latest RDP performance reference image:
rap-rdp-worker:rdp-perf6-dirty-region - Stage 5.2 file-download runtime artifacts remain preserved for when RDP work resumes, but they are not the active next task.
- Do not use
docker.cin.sufor this project unless explicitly requested for a separate one-off check.
Backend
- Go
- PostgreSQL = source of truth
- Redis = live coordination / routing only
- REST for control plane
- WebSocket for live session channel
Worker
- C++ worker
- FreeRDP integration
- worker runtime hides FreeRDP details from backend
- The C++ worker remains the primary RDP runtime.
- Target RDP performance direction:
docs/architecture/RDP_SERVICE_CPP_PERFORMANCE_TARGET.md. - The RDP performance rewrite scope is limited to C++ RDP service adapter internals. It must not redesign backend control plane, cluster transport, organizations, leases, or session lifecycle.
- The C# RDP service skeleton is inactive research scaffolding and is not the current runtime direction.
- Current RDP Adapter baseline: RDP-Perf-6 dirty-region direct binary rendering
is completed and smoke-proven on
docker-test. RDP work is paused by product decision; next active work is Fabric Core / cluster foundation. - P3/P3.1 security-readiness foundation exists: production mode rejects
plaintext credential-like resource metadata, requires
secret_reffor RDP/VNC/SSH resources, and has an encrypted PostgreSQL-backed resource secret storage/resolver MVP. P3.2 direct-worker TLS/PKI guard exists. - P3.3 production-like test-stand smoke is complete on
docker-test: backend runs inAPP_ENV=productionwith a test-only secret key file, a secret-backed RDP resource starts real sessions through the resolver path, metadata/audit do not contain plaintext credentials, and backend gateway fallback remains available when direct worker WSS trust issmoke_insecure. - P3.4 production direct-worker WSS trust model is documented in
docs/architecture/PRODUCTION_DIRECT_WORKER_WSS_TRUST.md; it defines platform CA/public CA behavior, worker certificate SAN/identity requirements, app-local Windows trust direction, rotation/revocation, and the futureplatform_casmoke plan. No RDP runtime behavior changed in P3.4. - P3.5 app-local platform CA trust is implemented and runtime-proven on
docker-test: Windows client validates direct worker WSS with an app-local platform CA bundle, keeps hostname/SAN validation enabled, selectsdirect_worker_wsswithout insecure TLS bypass, and falls back to backend gateway for unknown CA / smoke-only production cases. - P3.6 stale Redis worker/live event idempotency is implemented and runtime-proven: stale worker events for terminal PostgreSQL sessions are ignored, backend restart survives stale Redis events, and terminal sessions are not reopened.
- Stage 5.2 server-to-client file download core data path is runtime-proven:
direct worker WSS and backend gateway fallback both download text/binary
files from
RAP_Transfers\ToClientwith matching size/hash, and direct policy blocking is proven fordisabledandclient_to_server. Lifecycle blocking is also runtime-proven for detach, old-client takeover, and worker failure. Runtime report:artifacts/stage5-2-file-download-runtime-report.md. - Stage 5.2 is not fully accepted yet. Remaining proof: Windows desktop UI download path and regression matrix for rendering/input/clipboard/upload/ reconnect/takeover.
Clients
- future native clients:
- Windows: native desktop client first
- Linux: native desktop client later
- web UI is admin/control plane, not the primary power-user client
Final architecture direction
The long-term target architecture is documented in:
docs/architecture/SECURE_ACCESS_FABRIC_TARGET.mddocs/architecture/CLUSTER_NODE_ADMIN_FOUNDATION.mddocs/architecture/WEB_INGRESS_AND_ADMIN_UI_MODEL.md
This document defines the target Secure Access Fabric architecture only. It is not the current implementation scope and must not be used as permission to start mesh, VPN, multi-cluster, updater, or realtime data-plane migration work without an explicit staged prompt.
CLUSTER_NODE_ADMIN_FOUNDATION.md defines the next platform-core planning
baseline for clusters, node enrollment, native node-agent identity, platform
admin console, multi-cluster administration, and future organization admin
visibility. It is a staged foundation document, not permission to implement
mesh packet routing or VPN runtime.
WEB_INGRESS_AND_ADMIN_UI_MODEL.md defines WEB as HTTP/HTTPS ingress and
Admin UI presentation only. Cluster configuration remains Control Plane
ownership through scoped APIs, PostgreSQL source-of-truth mutations, and audit.
Dynamic pages must be safe schema-driven projections and must not embed
internal topology, peer caches, route caches, secrets, raw credentials, or
arbitrary executable code.
Admin endpoint placement is explicit. Fabric Storage / Config Storage nodes do not automatically host or move the cluster panel. Platform Owner Console remains global platform-owner scope. Cluster Admin Endpoint requires explicit admin/web ingress role assignment, cluster health/trust readiness, and Control Plane authorization. Organization Admin Panel remains a tenant-safe projection.
The final platform must support:
- Multi-tenancy / Organizations
- platform has many organizations
- each organization has isolated users, groups, resources, policies, audit, connectors
- users may belong to multiple organizations
- organization admins only see their organization
- platform admins see platform scope
- Identity federation
- local users
- LDAP / Active Directory
- OIDC
- future extensibility for more identity sources
- access mappings based on external groups / claims
- Cluster of nodes
- no mandatory single central node
- many nodes across many sites
- nodes can be platform-managed or customer-managed
- customer-managed nodes are sandboxed cluster participants, not full cluster owners
- Node agent
- small stable always-running agent on every node
- supervises services
- downloads updates
- verifies signed artifacts
- can rollback to previous version
- can restart crashed services
- can work on thin or thick nodes
- Service-based node model Each node is not monolithic. A node has:
- capabilities: what it can do physically/technically
- enabled services: what it is allowed/assigned to do
Possible services include:
- ingress-gateway
- mesh-router
- relay
- connector-host
- vpn-adapter
- session-worker
- media-relay
- file-relay
- update-cache
- config-replica
- audit-sink
- metrics-exporter
- Cluster mesh and routing
- encrypted inter-node communication
- dynamic topology
- no need for full mesh
- multi-hop routing allowed
- route failover
- client failover between ingress nodes
- connector failover between nodes
- Split-brain prevention
- quorum-based cluster behavior
- minority partition must not become a second authoritative cluster
- degraded / recovery / isolated modes
- manual recovery / promote decision by platform recovery admin
- Connector / VPN layer
- connectors are reusable network access methods
- one connector may be used by multiple resources
- connector placement and failover are controlled by policy
- nodes may be allowed or disallowed to host connectors
- direct access, VPN, relay and future egress modes must fit this model
- Future exit mode
- split tunnel
- full tunnel
- internet access through cluster
- not first implementation priority
Non-negotiable design rules
- Do not rewrite proven session lifecycle carelessly.
- Do not turn Redis into a source of truth.
- Do not make certificate-ignore a global worker setting.
- Do not make customer-managed nodes platform-wide trusted by default.
- Do not create a separate cluster per organization.
- Do not assume a single permanently reachable central node.
- Do not rely on “secret protocol with no docs” as security.
- Security must come from crypto, auth, isolation, policy and observability.
- Prefer incremental evolution from current proven system.
- Do not collapse platform control plane and data plane into one vague layer.
Implementation strategy
The codebase must evolve in phases.
Current implementation focus remains:
- RDP work is paused by product decision
- preserve the accepted RDP Adapter baseline and Stage 5.x file-transfer work
- do not delete or rewrite the current RDP MVP while platform-core work starts
- C1-C9 platform-core foundations are implemented and verified: clusters, node enrollment, node-agent scaffold, platform admin console, workload supervision contract, mesh control-plane prep, mesh skeleton, multi-cluster hardening, and organization admin foundation
- C10 Fabric Core configuration distribution design is completed
- C11 signed scoped cluster snapshot model is completed
- C12 node local state store is completed
- C13 Fabric Storage / Config Storage service foundation is completed
- C14 peer directory and cache model is completed
- C15 Fabric Routing Engine skeleton is completed
- C16 secure node-to-node channel lifecycle is completed
- C17 mesh routing runtime implementation plan is completed
- C17A synthetic mesh runtime skeleton is implemented and test-proven with synthetic fabric messages only, no RDP/VPN/production service traffic
- C17B route health and failover probes are implemented and test-proven with synthetic traffic only, no RDP/VPN/production service traffic
- C17C relay semantic hardening is implemented and test-proven with synthetic channel classes only, no RDP/VPN/production service traffic
- C17D non-production test-service path is implemented and test-proven with
bounded
synthetic.echotraffic only, no RDP/VPN/production service traffic - C17E live node-to-node synthetic HTTP transport is implemented and smoke-proven with synthetic traffic only
- C17F scoped synthetic route config loading and route-health reporting is implemented and smoke-proven with synthetic traffic only
- C17G Control Plane scoped synthetic config read/consume is implemented and test-proven with synthetic traffic only
- C17H deployed multi-agent synthetic config smoke is implemented and
runtime-proven on
docker-testwith synthetic traffic only - C17I production forwarding gate foundation is implemented and test-proven; production forwarding remains unavailable
- C17J production envelope contract validation is implemented and test-proven; production forwarding remains unavailable
- C17K production envelope observation is implemented and test-proven; production forwarding remains unavailable
- C17L bounded production observation sink is implemented and test-proven; production forwarding remains unavailable
- C17M production observation sink wiring is implemented and test-proven; production forwarding remains unavailable
- C17N production observation sink metrics are implemented and test-proven; production forwarding remains unavailable
- C17O production observation sink local metrics logging is implemented and test-proven; production forwarding remains unavailable
- C17P production observation sink change-driven metrics logging is implemented and test-proven; production forwarding remains unavailable
- C17Q production forwarding gate/runtime log boundary is implemented and test-proven; production forwarding remains unavailable
- C17R production observation sink capacity guard is implemented and test-proven; production forwarding remains unavailable
- C17S production observation panic fail-closed hardening is implemented and test-proven; production forwarding remains unavailable
- C17T production envelope payload boundary is implemented and test-proven; production forwarding remains unavailable
- C17U production envelope created-at skew boundary is implemented and test-proven; production forwarding remains unavailable
- C17V peer endpoint candidate model and NAT/connectivity hints are implemented and test-proven; production forwarding remains unavailable
- C17W peer endpoint candidate scoring model is implemented and test-proven; production forwarding remains unavailable
- C17X health-aware endpoint candidate scoring overlay is implemented and test-proven; production forwarding remains unavailable
- C17Y Platform Owner synthetic mesh visibility is implemented and build/test-proven; production forwarding remains unavailable
- C17Z production fabric-control direct forwarding is implemented and test-proven; production service traffic remains unavailable
- C17Z1 production fabric-control multi-hop route-path forwarding is implemented and test-proven; production service traffic remains unavailable
- C17Z2 production fabric-control forwarding observability is implemented and test-proven; production service traffic remains unavailable
- C17Z3 production fabric-control route-config boundary is implemented and test-proven; production service traffic remains unavailable
- C17Z4 scoped peer directory/recovery seed boundary is implemented and test/build-proven; production service traffic remains unavailable
- C17Z5 node-agent peer cache runtime boundary is implemented and test-proven; production service traffic remains unavailable
- C17Z6 dynamic endpoint reporting boundary is implemented and test-proven; production service traffic remains unavailable
- C17Z7 private/corporate endpoint candidate boundary is implemented and test-proven; production service traffic remains unavailable
- C17Z8 peer connection state machine boundary is implemented and test-proven; production service traffic remains unavailable
- C17Z9 peer recovery planner boundary is implemented and test-proven; production service traffic remains unavailable
- C17Z10 peer connection intent planner boundary is implemented and test-proven; production service traffic remains unavailable
- C17Z11 peer connection manager runtime boundary is implemented and test-proven; production service traffic remains unavailable
- C17Z12 rendezvous/relay control-plane contract is implemented and docker-test-runtime-proven; production service traffic remains unavailable
- C17Z13 rendezvous lease telemetry is implemented and docker-test-runtime-proven; production service traffic remains unavailable
- C17Z14 rendezvous lease refresh contract is implemented and docker-test-runtime-proven; production service traffic remains unavailable
- C17Z15 backend relay replacement policy is implemented and docker-test-runtime-proven; production service traffic remains unavailable
- C17Z16 route/path decision artifact is implemented and docker-test-runtime-proven; production service traffic remains unavailable
- C17Z17 node-side route generation tracker is implemented and docker-test-runtime-proven; production service traffic remains unavailable
- C17Z18 synthetic route-health effective path runtime is implemented and docker-test-runtime-proven; production service traffic remains unavailable
- C17Z19 synthetic route-health feedback scoring is implemented and docker-test-runtime-proven; production service traffic remains unavailable
- C17Z20 node-side route-health feedback refresh is implemented and docker-test-runtime-proven; production service traffic remains unavailable
- C17Z21 node installation/update control-plane is implemented and docker-test-runtime-proven for Docker nodes; production service traffic remains unavailable
- C17Z22 Windows host-agent install/update supervision is implemented and runtime-proven on the remote Windows node; production service traffic remains unavailable
- C17Z23 update observability is implemented in backend/admin UI: per-node updater status history is exposed and deployed on docker-test, so node-agent and host-agent update activity can be audited from node details
- C17Z24 combined updater reporting is implemented and docker-test-proven:
Linux/Docker
rap-host-agent update-loopnow also polls/reportsrap-host-agentstatus, release0.1.4is published for node-agent and host-agent artifacts, and docker-test nodestest-1/2/3auto-updated to node-agent0.1.4while reporting host-agent0.1.4no-op status. - C17Z25 Windows updater repair visibility is implemented in admin UI: node
details / Updates now shows a ready CMD repair command for existing Windows
nodes using
http://vpn.cin.su:19191/api/v1,--replace, and--auto-update-current-version 0.0.0so a stale updater wrapper can be recreated without a new join token. - C17Z26 updater fleet visibility is implemented in admin UI: the node list now
shows per-node updater status based on latest
rap-node-agentandrap-host-agentreports, explicitly flagging missing host-agent reports, stale update reports, or update errors before opening node details. - C17Z27 backend version-state projection is implemented and deployed on
docker-test: node list responses now derive
version_statefrom activerap-node-agentdesired policy plus latest update report. Docker/Linux nodes on0.1.4showcurrent; the remote Windows node still on0.1.3showsoutdatedwhile remaining heartbeat-healthy. - C17Z28 Windows updater loop hardening is implemented and partially
docker-test-proven via release
0.1.5: Windows host-agent updater scripts now run combinedupdate-loop --max-runs 1, and Windowsupdate-loopalso polls/appliesrap-host-agentupdates. Release0.1.5artifacts are published for Docker/Linux and Windows; docker-test nodestest-1/2/3updated torap-node-agent 0.1.5. Existing remote Windows nodes with stale pre-0.1.5 updater wrapper still require one repair command from admin UI to replace their local wrapper, after which automatic polling should continue. - Admin UI now marks missing host-agent updater reports as
repair updaterin the node list and explains in node details / Updates when to run the Windows repair command. The command uses the external control-plane endpoint and does not require a join token for already enrolled Windows nodes. - Admin UI node details / Updates also provides a ready downloadable
rap-repair-updater-<node>.cmdplus copy-command action for Windows repair, reducing operator copy/paste mistakes on remote Windows hosts. - Windows repair command generation was hardened after the first remote repair:
foreground
update-loopnow includes explicit--node-id, copies any stagedrap-host-agent.exe.nextover the main host-agent binary after the one-shot loop exits, deletes the staged file, and runs the updater scheduled task. The node list now distinguisheshost-agent stagedfrom generic stale/error. - C17Z29 Windows persistent updater repair is implemented in
rap-host-agentrelease0.1.6:install-windowsaccepts--node-idand writes that node id into the persistent Windows updater wrapper so Scheduled Task polling no longer depends on findingidentity.jsonin the expected state directory. Docker-test nodestest-1/2/3updated to0.1.6; existing Windows and off-host Docker nodes still need their local updater wrappers to pick up the 0.1.6 host-agent repair path. - C17Z30 operator-configured public mesh endpoints are implemented and
docker-test-deployed: desired
mesh-listener.advertise_endpointis now projected into peer endpoint candidates for other nodes and preferred over auto-discovered private heartbeat endpoints.home-1(8ad04829-cd30-4290-913d-1ce5c7ef7bb3) is configured withlisten_addr=0.0.0.0:19131,advertise_endpoint=http://94.141.118.222:19199,connectivity_mode=direct,nat_type=port_restricted,region=home.test-1synthetic config now receiveshome-1peer endpointhttp://94.141.118.222:19199; internal192.168.200.85:19131responds with HTTP 405 on GET, while external94.141.118.222:19199currently refuses TCP, so router/firewall forwarding still needs correction outside the platform. - C17Z31 offsite bootstrap peer selection is implemented and docker-test
deployed: operator-configured public/direct desired mesh-listener endpoints
are kept in core-mesh bootstrap even after the default warm-peer target is
reached. This fixes the case where remote Windows node
ifcm-rufms-s-mo1crreceived onlytest-*warm peers and nohome-1. Its synthetic config now includeshome-1endpointhttp://94.141.118.222:19199and candidates ordered as operator public, heartbeat advertised public, then private LAN converted to relay-required for offsite. External TCP to94.141.118.222:19199still failed from Codex and docker-test checks while internal192.168.200.85:19131succeeds, so a real offsiteTest-NetConnection 94.141.118.222 -Port 19199is the next network validation. - C17Z32 native Ubuntu/Linux service install is implemented and docker-test
deployed: backend exposes
/node-agents/linux-install-profile, host-agent supportsinstall-linux, installsrap-node-agentunder/opt/rap/<node>, state under/var/lib/rap/nodes/<node>, config under/etc/rap/<node>, createsrap-node-agent-<node>.service, and creates a persistentrap-host-agent-updater-<node>.servicefor automatic node-agent and host-agent updates. Release0.1.7is published forrap-node-agent(linux_binary,windows_service) andrap-host-agent(linux_binary,windows_binary). Admin UI now has anUbuntu serviceinstall profile and generates profile-basedinstall-linuxcommands. A one-use token forvps-ubuntu-1is active until 2026-05-02T08:41:41Z:rap_join_a23Xhz63YstshWUBAPGPz5fzQ8YpHDP05RXaaYa4DoA; scope roles arecore-meshandrelay-node, control-plane endpoint ishttp://vpn.cin.su:19191/api/v1, artifact endpoint ishttp://vpn.cin.su:19191/downloads. - Admin UI and docs now cover the full Windows updater operational workflow:
node details shows an
Updater healthsummary, generated repair CMD prints scheduled-task and binary diagnostics before/after repair, applies staged host-agent binaries, restarts the updater task, and README documents first install, repair without join-token, system-task/user-task behavior, staged host-agent recovery, and reboot/autostart verification. - Cluster Authority plus node enrollment bootstrap polling are docker-test
lifecycle-smoke-proven; fresh install migration replay is fixed for
cluster_admin_summaries - C18 VPN/IP tunnel service target design is completed as documentation only
- C18A VPN/IP tunnel control-plane data model foundation is implemented and backend-test-proven
- C18B VPN/IP tunnel lease/fencing hardening is implemented and backend-test-proven
- C18C VPN/IP tunnel node-agent desired-state consumption/reporting is implemented and backend-test-proven
- Version Storage / Update Repository is documented as a future Fabric Core service for signed release manifests, OS/arch artifacts, stable/current/candidate channels, update-cache mirroring, node-agent update supervision, rollback, and explicit data-structure migration bundles. Runtime updater behavior is partially implemented for the current Docker and Windows node-agent/host-agent paths; broader staged rollout policy and service payload forwarding remain separate work.
- no next platform-core implementation step is automatically authorized after C17Z20; choose the next narrow staged prompt explicitly before continuing
- preserve the proven RDP lifecycle behavior
- keep the current backend gateway available as the active/fallback implementation path
- accepted VPN data-plane target: the phone/client connects only to an available entry node; the entry node uses the existing mesh/fabric route to a selected exit node/pool, and the exit node handles LAN/internet egress. Nodes behind NAT may participate when they can maintain outbound mesh/control sessions. Backend packet relay must remain a compatibility/fallback path, not the desired steady-state path.
- C18D VPN-over-fabric foundation is implemented and docker-test-started:
VPN client profiles include
vpn_fabric_routewith entry pool, exit pool, selected entry/exit, preferredfabric_meshdata-plane, andbackend_relayfallback. Node-agent0.2.39adds a dedicated productionvpn_packetchannel (vpn.packet_batch, 256 KiB batch limit), destination delivery hook,vpnruntime.FabricPacketTransport, andvpn_fabric_packet_transportheartbeat capability.home-1auto-updated to0.2.39; other nodes have automatic desired policy0.2.39and should move as their updater loops pick it up. Live Android VPN traffic still uses backend relay until entry-node client ingress is wired to the fabric transport. - C18E VPN-over-fabric route contract is backend-deployed on docker-test as
rap-backend:test-vpn-fabric-route-0.2.41: when a VPN client profile selects different entry and exit nodes, backend now ensures two activemesh_route_intentswith service_classvpn_packetsand allowed channelvpn_packet. The live HOME profile currently selectsusa-los-1as entry andhome-1as exit whenentry_node_id=b829ffde-...is requested, and the synthetic config for both nodes includes the twovpn_packetroutes. Existing fallback remainsbackend_relay; production forwarding gate is still disabled on old/live remote nodes until their runtime is explicitly updated/enabled. - External/offsite updater gap found and fixed for version
0.2.40: nativerap-node-agentbinaries forlinux_binary,linux_service, andwindows_serviceplus matchingrap-host-agentbinaries are copied under/downloadsand registered in channeldev-external. Update plans forusa-los-1(linux_binary) andifcm-rufms-s-mo1cr(windows_service) now returnaction=update,target_version=0.2.40instead ofno_matching_artifact. - C18F production-forwarding gate work is partially live: backend
rap-backend:test-vpn-fabric-route-0.2.42signs node synthetic configs withproduction_forwarding=true/control_plane_only=falsewhen the node's desiredmesh-listenerworkload hasproduction_forwarding_enabled=true.home-1andusa-los-1desired mesh-listener configs have this flag enabled. Node-agent0.2.44accepts signed production-forwarding mesh configs and host-agent0.2.44fixes Docker updater behavior so synthetic mesh runtime is not disabled on Docker updates. Runtime status:usa-los-1reportsmesh_production_forwarding=true;home-1reports0.2.44and synthetic runtime enabled, but its listener report is stilldisabled/listen_addr_empty, sohome-1is not yet a usable production fabric endpoint. Next action is to repair whyhome-1is not applying the signed mesh-listener config (listen_addr=0.0.0.0:19131) after Docker updater restart. - C18G VPN-over-fabric runtime path is live-tested on docker-test. Backend is
deployed as
rap-backend:test-vpn-fabric-route-0.2.43; VPN route intents now allow bothvpn_packetdata andfabric_controlhealth probes. Node-agent0.2.47fixes initial production VPN packet envelope hop addressing and reports the matching version.home-1andusa-los-1both report0.2.47, healthy, listener0.0.0.0:19131, andmesh_production_forwarding=true. Live route health is reachable in both directions (usa-los-1 -> home-1around 200 ms,home-1 -> usa-los-1around 200-415 ms). A direct live POST tohttp://195.123.240.88:19131/api/v1/clusters/.../vpn-connections/.../tunnel/client/packetsreturns202 Accepted, proving entry-node VPN packet ingress can forward over fabric to the home exit. The HOME VPN placement policy now has entry pool[usa-los-1, home-1]and exithome-1; client profile with preferredusa-los-1selectsusa-los-1 -> home-1. - C18H live VPN triage on 2026-05-04:
home-1andusa-los-1report node-agent0.2.48, healthy heartbeats, active HOME VPN assignment onhome-1, andpacket_forwarding=true/runtime_available=true. Manual packet tests through the USA entry proved the path Android-style packet ->usa-los-1-> fabric ->home-1-> LAN/DNS -> fabric ->usa-los-1-> client can return ICMP and DNS replies. The remaining live symptom was the phone not sending fresh packets to the current entry after the backend relay queue was cleared. Android VPN app0.2.59was built and published to/downloads/rap-android-rdp-vpn-latest-debug.apk; it normalizes old saved backend URLs (vpn.cin.su:19191,94.141.118.222:19191,192.168.200.61:18080, etc.) to the current USA entry backendhttp://195.123.240.88:19131/api/v1and shows app version, device id, and connection id in the header for live log correlation. - C18I fabric service-channel foundation is live on 2026-05-07. Backend,
node-agent, and Android VPN release
0.2.159are published. VPN profiles now include a signedrap.fabric_service_channel_lease.v1withentry_direct_http_v1packet and WebSocket templates. Android consumes this lease and sends service-channel headers. Theusa-los-1entry endpoint validates the cluster-authority signed lease payload and token hash; a live smoke throughhttp://195.123.240.88:19131/.../fabric/service-channels/...succeeded with a valid lease and rejected a bad token with403. Current HOME profile selectsusa-los-1as entry andhome-1as exit; both nodes report0.2.159. Docker-test nodestest-1,test-2, andtest-3also report0.2.159.ifcm-rufms-s-mo1cris still on0.2.119; it has staged the host-agent0.2.159update and should finish on the next Windows updater loop/restart. - C18J fabric service-channel runtime route-manager slice is live on
2026-05-07 as node/host-agent
0.2.162. The entry-nodeFabricClientPacketIngressnow preserves its runtime object across synthetic config refreshes, so heartbeat telemetry reports the same ingress object that serves HTTP/WebSocket service-channel traffic. It tracks send/receive batches, route attempts/failures, selected route/next hop, local-gateway fallback, and inbox queue depths.SendClientPacketBatchnow retries all validvpn_packetroute candidates with sticky preference before backend relay is allowed as degraded compatibility fallback. Release0.2.161was superseded because its Docker tar was rebuilt after registration;0.2.162is the clean published release with matching artifact hashes. Docker-testtest-1/2/3,usa-los-1, andifcm-rufms-s-mo1crreport0.2.162;home-1is healthy and still on0.2.161awaiting its next updater loop. Live smoke throughhttp://195.123.240.88:19131/.../fabric/service-channelsreturned202andusa-los-1telemetry then showed route attempts, one route failure, and selected next hophome-1, proving live ingress telemetry and alternate-route retry are active. - C18K service-neutral flow/channel scheduler is live on 2026-05-07 as
node/host-agent
0.2.163. The VPN proving service still carries universal IP packets and does not route by application protocol, but the entry runtime now hashes packets by IP 5-tuple, or packet hash for non-IP/invalid packets, into 32 logicalflow-*channels. Each channel has bounded queue accounting, high-watermark/backpressure/dropped telemetry, and batches are fanned out per logical channel before being sent through the same fabric route-manager. Live smoke againstusa-los-1posted two different IP flows through the signed service-channel endpoint and heartbeat reportedsend_packets=2,send_flow_batches=2,flow_scheduler.channel_count=2,enqueued=2,dequeued=2,dropped=0, with queue depths forflow-12andflow-14. All six current cluster nodes (home-1,usa-los-1,ifcm-rufms-s-mo1cr,test-1,test-2,test-3) report node-agent0.2.163and healthy. - C18L active flow scheduling telemetry is live on 2026-05-07 as
node/host-agent
0.2.164. Eachflow-*channel now keeps route memory, served count, last served time, last route/next hop, failed-route marker, consecutive failures, stall count, last send duration, and explicitroute_rebuild_recommended/degraded_fallback_recommendedsignals. The scheduler drains non-stalled channels first, prefers less-served/older channels, avoids a channel's last failed route on the next send, and only marks degraded fallback after repeated failures. Live smoke againstusa-los-1posted two IP flows through the signed service-channel endpoint: heartbeat reported schemac18l.fabric_service_channel_runtime_report.v1,send_packets=2,send_flow_batches=2,flow_scheduler.channel_count=2,dropped=0,backpressure=false,last_next_hop=home-1, and per-flowserved=1. One stale candidate route failed and was bypassed before the successful route tohome-1. All six current cluster nodes (home-1,usa-los-1,ifcm-rufms-s-mo1cr,test-1,test-2,test-3) report node-agent0.2.164and healthy. - C18M Control Plane service-channel feedback is live on 2026-05-07. Backend
image
rap-backend:fabric-service-channel-0.2.165is deployed on docker-test, and node/host-agent0.2.165artifacts are published. When issuingrap.fabric_service_channel_lease.v1, backend now reads fresh entry-node heartbeat metadatafabric_service_channel_runtime_report.ingress.flow_scheduler.channel_stats, builds per-route service-channel feedback, boosts recently successful routes, penalizes recent failures, and fences routes that reportroute_rebuild_recommended,degraded_fallback_recommended, or repeated consecutive failures. Fenced routes are not selected as primary or alternate; if all selected entry/exit routes are fenced, the lease uses explicit degraded backend fallback with reasonfabric_routes_fenced_by_service_channel_feedback. Live smoke created two short-livedtest-1 -> test-2route intents, injected a fresh service-channel flow feedback heartbeat marking the higher-priority route as rebuild-required, and the next lease selected the lower-priority healthy route with score reasonservice_channel_recent_success; the bad route was not offered as an alternate. Current node rollout:home-1,usa-los-1,test-1,test-2, andtest-3report0.2.165; Windowsifcm-rufms-s-mo1crremains healthy on0.2.164and should move on its next updater cycle. - C18N durable service-channel route feedback is live on 2026-05-07. Backend
image
rap-backend:fabric-service-channel-0.2.166is deployed on docker-test with migration000025_fabric_service_channel_route_feedback. Heartbeats now persist service-neutral route observations intofabric_service_channel_route_feedback_observationsand maintain an expiring latest view infabric_service_channel_route_feedback_latest. Lease selection reads this durable latest feedback before falling back to in-memory heartbeat parsing, so route fencing survives backend restarts and stale heartbeat replacement. Node/host-agent0.2.166artifacts and Docker image are published, update policies target0.2.166, andtest-1/2/3,usa-los-1, andifcm-rufms-s-mo1crreport0.2.166;home-1is healthy but still on0.2.165until its next updater cycle. Live smoke created two short-livedtest-1 -> test-2routes, persisted a fenced observation for the higher-priority bad route and a healthy observation for the lower-priority route, restarted backend, and the next lease selected the healthy route withservice_channel_recent_success. - C18O service-channel feedback diagnostics and synthetic route avoidance are
live on 2026-05-07. Backend image
rap-backend:fabric-service-channel-0.2.167is deployed on docker-test and web-admin is rebuilt/published. Admin/API now expose fresh durable feedback throughGET /clusters/{clusterID}/fabric/service-channels/route-feedback, and each node synthetic config includesservice_channel_route_feedbackwith healthy/degraded/fenced counts and observations. Synthetic config generation skips routes fenced by the local node's durable service-channel feedback, so nodes stop receiving known-bad route configs while the feedback is active. Live smoke created freshtest-1 -> test-2routes, persistedfencedfeedback for the higher-priority route andhealthyfeedback for the lower-priority route, confirmed the API returned both observations, and confirmedtest-1synthetic config excluded the bad route while keeping the healthy route. - C18P proactive service-channel replacement decisions are live on 2026-05-07.
Backend image
rap-backend:fabric-service-channel-0.2.168is deployed on docker-test and web-admin is rebuilt/published. When synthetic config generation withholds a route fenced by local service-channel feedback, it now records aroute_path_decisionsitem withdecision_source=service_channel_feedback_replacement,replacement_route_id, effective replacement hops, and score reasons. If no alternate exists, the decision source becomesservice_channel_feedback_no_alternatewith visible score reasonno_unfenced_alternate_route. Live smoke created freshtest-1 -> test-2bad/good routes, fenced the bad route, disabled older smoke routes, and confirmedtest-1synthetic config excluded the bad route, kept the good route, and reported replacement from bad route to good route. - C18Q service-channel replacement dampening is live on 2026-05-07. Backend
image
rap-backend:fabric-service-channel-0.2.169, node/host-agent0.2.169artifacts, Docker image, update policies, and web-admin are published on docker-test. Replacement selection now gives a large stable preference to routes with active healthy durable feedback, addingactive_healthy_feedback_dampening_windowto score reasons, so a recently successful replacement wins over a higher-priority but unproven route until the feedback window expires or a newer fenced/healthy observation changes the state.RoutePathDecisionReportnow includesdegraded_decision_countforservice_channel_feedback_no_alternate, and node-agent heartbeat reports includereplacement_route_idand degraded counts after upgrade. Live smoke fenced a high-priority badtest-1 -> test-2route, supplied healthy feedback for a low-priority route, also created a higher-priority unproven route, and confirmed replacement selected the healthy route because of the dampening window. - C18Q hotfix
0.2.171is published on 2026-05-07. Node-agent now includesservice_channel_route_feedbackin the signed synthetic config model before recalculating the authority payload hash. Without this, upgraded backend configs were signed correctly but0.2.169agents rejected them withcontrol-plane synthetic mesh config authority payload hash mismatch. Regression coverage verifies a signed config containing durable service-channel feedback. Artifacts, Docker image, latest download aliases, and update policies were moved to0.2.171;test-1/2/3are running0.2.171and loadingsource=control_planeagain. The release includeslinux_service, Docker, Windows service, and binary artifacts so service installs can auto-update. Old C18 smoke/expired route intents were disabled after validation. - C18R fleet diagnostics/operator action slice is live on 2026-05-07. Backend
image
rap-backend:fabric-service-channel-0.2.172adds route feedback filters (route_id,feedback_status,include_expired) andPOST /clusters/{clusterID}/fabric/service-channels/route-feedback/expire. The expire action is cluster-mutable/admin gated and marks latest feedback expired without deleting historical observations. Web-admin / Fabric Links now shows a cluster-level service-channel feedback panel with fenced, degraded, healthy and no-alternate counts, replacement/no-alternate decisions, and an operatorexpireaction for stale non-healthy feedback. - C18S service-channel feedback churn guardrails are implemented on
2026-05-07. Operator expire now records
fabric.service_channel_route_feedback.expiredaudit events, returns and persists a shortoperator_retry_cooldown_until, and route generation addsservice_channel_route_retry_after_operator_expirewhen a manually expired route is being retried. During that cooldown, repeated non-healthy feedback from the same reporter/route/service is suppressed asoperator_retry_cooldowninstead of immediately fencing the route again. Web-admin shows the retry/cooldown state in Fabric Links. - C18T automatic rebuild decision contract is implemented on 2026-05-07.
RoutePathDecisionnow carriesrebuild_request_id,rebuild_status,rebuild_reason, andrebuild_attempt. When fenced service-channel feedback keeps failing outside manual retry cooldown, Control Plane records a bounded rebuild request. If an unfenced alternate exists, the decision is markedrebuild_status=applied; if not, it ispending_degraded_fallbackand leases expose backend relay with reasonfabric_route_rebuild_pending_backend_relay. Web-admin shows rebuild counts, status, and attempts in Fabric Links. A live smoke on docker-test created short-livedtest-1 -> test-2bad/good routes, reported fenced feedback for the bad route and healthy feedback for the good route, and confirmed scoped synthetic config returnedservice_channel_feedback_replacementwithrebuild_status=appliedandrebuild_attempt=3. Node/host-agent0.2.175is published so agents preserve the new signed rebuild fields. - C18U node-agent route-manager rebuild consumption is live on 2026-05-07.
Node-agent
0.2.176now converts backend rebuild decisions into a service-channel route-manager snapshot, counts rebuild requests/applies, marks applied/pending-degraded routes as withdrawn, clears a withdrawn cached selected route, and excludes withdrawn routes from new service-channel route candidates. This keeps new flows from retrying a route that Control Plane has already rebuilt away from. Unit coverage verifies a bad route is skipped in favor of its replacement. Node/host-agent0.2.176artifacts, Docker image, latest download aliases, release manifests, and node policies are published.test-1/2/3,usa-los-1, andifcm-rufms-s-mo1crreport0.2.176. Backendrap-backend:fabric-service-channel-0.2.176is deployed with a panel consistency fix: if a node reports the target version, stale failed update status no longer overridesversion_state=current. - C18V route-manager churn telemetry is live on 2026-05-07. Node-agent
0.2.177addsroute_manager_transitionto the service-channel runtime report with previous/current generation, transition status, decision counts, withdrawn/restored route counts, pending-degraded fallback count, rebuild applied count, and any cleared cached route. Tests cover applied rebuild replacement, pending degraded fallback with no alternate, and restoration by a fresh config so withdrawn routes do not become sticky local state. Artifacts, Docker image, latest download aliases, release manifests, and node policies are published.test-1/2/3run0.2.177; their heartbeat metadata exposesrap.fabric_service_channel_route_manager_transition.v1. - C18W live Control Plane/runtime verification is implemented and smoke-passed
on 2026-05-07. Script
scripts/fabric/c18w-service-channel-route-manager-smoke.ps1drives the whole loop against docker-test API: creates temporary service-channel route intents fortest-1 -> test-2, injects fenced/healthy route feedback through heartbeat, verifies scoped config emitsrebuild_status=applied, waits for node-agent heartbeatroute_manager_transition.status=applied_rebuild, expires the feedback, verifies the restored config has no rebuild decision, and waits forrestored_by_new_config. Result artifact:artifacts/c18w-service-channel-route-manager-smoke-result.jsonwith runc18w-20260507-173226. During the smoke, operator expire exposed live pgx parameter issues; backendrap-backend:fabric-service-channel-0.2.179is deployed with safer UUID/text timestamp handling for feedback expire. - C18X logical-channel isolation and bounded backpressure coverage is
implemented and smoke-passed on 2026-05-07. Node-agent/host-agent
0.2.180artifacts, Docker image, latest download aliases, release manifests, and node policies are published. The key runtime fix is inFabricClientPacketIngress.routeCandidatesForChannel: a channel with a local failed-route avoid state no longer falls back to the global last selected route, so one degraded logical flow cannot drag unrelated flows back onto the failed path. Coverage proves independent logical-channel failover, bounded same-channel backpressure/drop telemetry, and packet-flow hashing. Scriptscripts/fabric/c18x-service-channel-logical-channel-smoke.ps1passes with result artifactartifacts/c18x-service-channel-logical-channel-smoke-result.jsonrunc18x-20260507-180647. Test docker nodestest-1/2/3are runningrap-node-agent:0.2.180; backend remainsrap-backend:fabric-service-channel-0.2.179. - C18Y route-intent lifecycle cleanup is implemented and smoke-passed on
2026-05-07. Backend
rap-backend:fabric-service-channel-0.2.181is deployed on docker-test, and web-admin Fabric Links now shows route-intent lifecycle counts/table with operatorexpireanddisableactions. Route intents are enriched withlifecycle_status,is_expired, andpolicy_expires_at. Node-scoped synthetic mesh config now filters out expired policy routes, so stale smoke routes no longer get emitted to agents for route-health probing. API actions are available atPOST /clusters/{clusterID}/mesh/route-intents/{routeIntentID}/expireand/disable. Scriptscripts/fabric/c18y-route-intent-lifecycle-smoke.ps1passed against docker-test API, resultartifacts/c18y-route-intent-lifecycle-smoke-result.jsonrunc18y-20260507-192702. During deploy, docker-test root disk was full from build cache/images;docker builder prune -afanddocker image prune -ffreed space before redeploy. - C18Z bounded service-channel load coverage is implemented, published, and
smoke-passed on 2026-05-07. Node-agent/host-agent
0.2.181artifacts, Docker imagerap-node-agent:0.2.181, latest download aliases, release manifests, and update policies are published.test-1/2/3are restarted onrap-node-agent:0.2.181;usa-los-1also reports0.2.181. The key runtime fix is inFabricFlowScheduler.Snapshot: backpressure remains visible when bounded drops occurred, even after the queue drains. Coverage proves multi-channel rebuild away from a withdrawn primary route and per-channel bounded drop/high-water telemetry. Scriptscripts/fabric/c18z-service-channel-load-smoke.ps1passed against docker-test API, resultartifacts/c18z-service-channel-load-smoke-result.jsonrunc18z-20260507-194616. Release artifacts were corrected after initial publication to use backend-relative/downloads/...primary URLs plus internal/external mirror URLs, so offsite nodes resolve downloads through their own control-plane origin such ashttp://vpn.cin.su:19191. Current caveat:ifcm-rufms-s-mo1crandhome-1remainedversion_state=failedat the last check; their next update plan now points to reachable0.2.181artifacts, but the local updater loop still needs to retry/report success. - C18Z1 live service-channel ingress is implemented, published, and
smoke-passed on 2026-05-07. Node-agent/host-agent
0.2.182artifacts, Docker imagerap-node-agent:0.2.182, release manifests, and update policies are published. Backendrap-backend:fabric-service-channel-0.2.182is deployed on docker-test. The runtime fix is a dynamic mesh listener handler: synthetic config refreshes now update/mesh/v1/forward, service-channel ingress, production routes, delivery inbox, and forward transport without requiring a port/listener restart. Backend route-feedback latest policy now prevents a fresh healthy heartbeat from immediately overwriting active degraded/fenced feedback before TTL expiry, so rebuild decisions survive long enough for nodes to apply them. Scriptscripts/fabric/c18z1-live-service-channel-ingress-smoke.ps1posts signed generic packet batches to the runningtest-1service-channel HTTP endpoint, waits both entry and exit runtime configs, verifies exit inbox delivery, injects route feedback, observes Control Plane rebuild, waits nodeapplied_rebuild, sends a second batch over the replacement route, and expires both temporary route intents. Result:artifacts/c18z1-live-service-channel-ingress-smoke-result.jsonrunc18z1-20260507-203628. All current nodes report0.2.182/currentat the last check. - C18Z2 live service-channel sustained soak/failure smoke is implemented and
passed on 2026-05-07 without a new runtime release. Script
scripts/fabric/c18z2-live-service-channel-soak-smoke.ps1drives signed generic packet batches through the runningtest-1service-channel HTTP endpoint, keeps temporary primary/alternatetest-1 -> test-2route intents visible, restarts the exit-node containerrap_test_node_test_2, waits for the exit runtime to reload synthetic config, and verifies recovery batches reach the exit fabric inbox after the restart. Result:artifacts/c18z2-live-service-channel-soak-smoke-result.jsonrunc18z2-20260507-205112: warm batches6/6, during-restart batches3/3, recovery batches8/8, exit inbox depth grew from post-restart baseline0to88, drops0, and both temporary route intents expired. - C18Z3 live service-channel entry/WebSocket/degraded-fallback smoke is
implemented, published, and passed on 2026-05-07. Node-agent/host-agent
0.2.183artifacts and Docker imagerap-node-agent:0.2.183are published to docker-test downloads; update policies fortest-1/2/3are set torollingtarget0.2.183, and the test containers run that image. The runtime fix makes the entry node honor the signed service-channel lease authority: leases withstatus=degraded_fallbackorprimary_route.status=missing_route_intentnow force backend fallback instead of reusing stale generic route candidates. The same fallback rule is applied to HTTP and WebSocket packet ingress. Scriptscripts/fabric/c18z3-live-service-channel-entry-ws-fallback-smoke.ps1verifies signed HTTP warm batches, WebSocket ingress parity, entry-node container restart while the lease exists, recovery batches over the same lease, explicit degraded fallback for a no-route exit, and route-intent expiry. Result:artifacts/c18z3-live-service-channel-entry-ws-fallback-smoke-result.jsonrunc18z3-20260507-211402: warm4/4, WebSocket packets8, recovery4/4, backend fallback queue0 -> 8, route failures0, and all checks passed. During publication the first0.2.183Docker tar had a malformed entrypoint and stale size/hash metadata; it was rebuilt, the latest tar alias was replaced, and the release artifact row was corrected to sha256231286cf5860b22cf8ca6550f67f61b0ca4b5011ab9b09995bcabbafe883fee1, size7261696. - C18Z4 live service-channel long-session pressure smoke is implemented and
passed on 2026-05-07 without a new runtime release beyond
0.2.183. Scriptscripts/fabric/c18z4-live-service-channel-session-pressure-smoke.ps1opens one signed long-lived service-channel WebSocket fromtest-1totest-2, sends 48 packet batches / 384 packets, expires the primary route intent while the WebSocket session is still active, waits for dynamic synthetic-config refresh, and verifies the remaining packets use the alternate route. Result:artifacts/c18z4-live-service-channel-session-pressure-smoke-result.jsonrunc18z4-20260507-212748: exit inbox depth0 -> 384, route failure delta0, flow drop delta0, backend fallback queue0 -> 0, primary route removed from entry/exit configs, alternate route selected after the switch, and both route intents expired. This proves the shared Fabric Service Channel can keep a service session alive while Control Plane changes the live route set, without falling back to backend relay. - C18Z5 live service-channel exit-restart smoke is implemented and passed on
2026-05-07 without a new runtime release beyond
0.2.183. Scriptscripts/fabric/c18z5-live-service-channel-exit-restart-smoke.ps1keeps one signed WebSocket service-channel session open fromtest-1totest-2, sends pre-outage traffic, stopstest-2for a bounded outage while traffic continues, starts it again, waits runtime readiness, then sends recovery traffic over the same WebSocket. Result:artifacts/c18z5-live-service-channel-exit-restart-smoke-result.jsonrunc18z5-20260507-213745: pre/outage/recovery batches12/24/24, total packets480, route failure delta48, backend fallback queue0 -> 192, flow drop delta0, and recovery exit inbox0 -> 192. This proves real exit-node failure is visible as fallback/failure telemetry while the long-lived service channel remains usable and fabric delivery resumes after the exit runtime returns. After the test,test-2and all active cluster nodes were healthy/current on0.2.183. - C18Z6 live service-channel active rebuild smoke is implemented and passed on
2026-05-07 without a new runtime release beyond
0.2.183. Scriptscripts/fabric/c18z6-live-service-channel-active-rebuild-smoke.ps1keeps a signed WebSocket service-channel session open fromtest-1totest-2, sends pre-rebuild traffic, injects route-health feedback that marks the primary route stale and names the alternate route as replacement, waits for Control Planerebuild_status=applied, waits for node-agentroute_manager_transition.status=applied_rebuild, then continues sending over the same WebSocket. Result:artifacts/c18z6-live-service-channel-active-rebuild-smoke-result.jsonrunc18z6-20260507-214900: pre/post batches16/32, total packets384, exit inbox depth0 -> 384, Control Plane replacement routeb2f3c510-46d2-4dce-8389-3952a99d0311, route failure delta0, flow drop delta0, backend fallback queue0 -> 0, all checks passed, and all active nodes remained healthy/current on0.2.183. This proves a live service channel can apply a route-manager rebuild decision without rebuilding the service WebSocket. - C18Z7 live service-channel concurrent isolation smoke is implemented and
passed on 2026-05-07 without a new runtime release beyond
0.2.183. Scriptscripts/fabric/c18z7-live-service-channel-concurrent-isolation-smoke.ps1opens three signed WebSocket service-channel sessions over the sametest-1 -> test-2entry/exit pair, interleaves packet batches across all sessions, injects primary-route stale feedback, waits for Control Planerebuild_status=appliedand node-agentapplied_rebuild, then continues all sessions over the same sockets. Result:artifacts/c18z7-live-service-channel-concurrent-isolation-smoke-result.jsonrunc18z7-20260507-215727: 3 sessions, 36 rounds, 288 packets per session, 864 packets total, each session exit inbox depth288, total exit depth864, backend fallback delta0, route failure delta0, flow drop delta0, and all active nodes healthy/current on0.2.183. This proves rebuild and route-manager state are shared correctly without one active service session starving or poisoning the other concurrent sessions. - C18Z8 live service-channel backpressure isolation smoke is implemented and
passed on 2026-05-07 without a new runtime release beyond
0.2.183. Scriptscripts/fabric/c18z8-live-service-channel-backpressure-isolation-smoke.ps1opens two interactive signed WebSocket sessions plus one abusive session over the sametest-1 -> test-2entry/exit pair. The abusive session sends 1300 packets on one stable 5-tuple to force a single flow shard to hit bounded queue pressure while the interactive sessions continue sending small batches. Result:artifacts/c18z8-live-service-channel-backpressure-isolation-smoke-result.jsonrunc18z8-20260507-221347: both interactive sessions delivered 192 packets each, the abusive flow reached scheduler high watermark1024, scheduled1030packets on the hottest channel, dropped282packets on that channel, produced backend fallback delta0, route failure delta0, and all active nodes stayed healthy/current on0.2.183. This proves bounded backpressure is visible and isolated to the overloaded logical flow without starving other active service sessions. - C18Z9 route-pool runtime selection is implemented, released as node/host
agent
0.2.184, published to docker-test downloads, and passed on 2026-05-07. Runtime fix: when Control Plane marks a service-channel routerebuild_status=appliedand providesreplacement_route_id, node-agent now treats that replacement as the preferred route for sticky flow/channel selection instead of merely withdrawing the bad route and falling back to config order. Unit coverage:TestFabricClientPacketIngressPrefersControlPlaneReplacementOverConfigOrder. Live scriptscripts/fabric/c18z9-live-service-channel-route-pool-smoke.ps1creates a route pool with slow relay primarytest-1 -> test-3 -> test-2and fast direct replacementtest-1 -> test-2, keeps one signed WebSocket active, injects stale-route feedback, waits for Control Plane and node-agentapplied_rebuild, then verifies the same service session continues over the direct replacement. Result:artifacts/c18z9-live-service-channel-route-pool-smoke-result.jsonrunc18z9-20260507-224901: 54 batches / 432 packets sent and delivered to exit, backend fallback delta0, route failure delta0, flow drop delta0, and temporary route intents expired. Test containerstest-1/2/3runrap-node-agent:0.2.184;usa-los-1,home-1, andifcm-rufms-s-mo1crremain healthy on0.2.183until their rollout policy is advanced. - C18Z10 service-channel exit-pool failover is implemented, released as
node/host-agent
0.2.185, published to docker-test downloads, registered in the stable update channel, and passed on 2026-05-07. Backend service-channel leases now bind signed entry/exit pools, selected exit follows the selected primary route, and Control Plane replacement can cross to another authorized exit when route intents share an exit-pool/resource metadata key. Node-agent now honors the signed lease primary route as the initial service-channel preference before normal config-order selection. Unit coverage:TestIssueFabricServiceChannelLeaseSelectsHealthyAlternateExitFromPool,TestGetNodeSyntheticMeshConfigReplacesFencedServiceChannelRouteAcrossExitPool, andTestFabricClientPacketIngressUsesLeasePreferredRouteBeforeConfigOrder. Live scriptscripts/fabric/c18z10-live-service-channel-exit-pool-smoke.ps1creates a primary exit routetest-1 -> test-2and an alternate exit routetest-1 -> test-3in the same exit pool, keeps one signed WebSocket active, verifies pre-rebuild traffic reaches the primary exit, injects stale-route feedback, waits for Control Plane/node-agentapplied_rebuild, then verifies post-rebuild traffic reaches the alternate exit. Result:artifacts/c18z10-live-service-channel-exit-pool-smoke-result.jsonrunc18z10-20260507-232645: 54 batches / 432 packets sent, primary exit queue144, alternate exit queue288, backend fallback0, route failure delta0, flow drop delta0, decision sourceservice_channel_feedback_exit_pool_replacement, and temporary route intents expired. Backend andtest-1/2/3are running0.2.185; update plans now return download URLs on192.168.200.61:18080when the API is reached directly on18121. - C18Z11 service-channel entry-pool failover contract is implemented and
backend-deployed as
rap-backend:fabric-service-channel-0.2.186; node-agent remains0.2.185because no node runtime binary change was required. Backend lease selection now keepsselected_entry_node_idaligned with the selected primary route when the healthy route starts at another authorized entry node. Route replacement scope also understands entry-pool metadata keys (entry_pool_id,service_entry_pool_id,fabric_entry_pool_id) in addition to exit-pool/resource keys, and route decision reports count entry-pool replacement decisions. Unit coverage:TestIssueFabricServiceChannelLeaseSelectsHealthyAlternateEntryFromPoolandTestGetNodeSyntheticMeshConfigReplacesFencedServiceChannelRouteAcrossEntryPool. Live scriptscripts/fabric/c18z11-live-service-channel-entry-pool-smoke.ps1creates primary entry routetest-1 -> test-2and alternate entry routetest-3 -> test-2, verifies the initial lease usestest-1, sends 144 packets, injects service-channel feedback fencing the primary entry route, verifies a refreshed lease selectstest-3, then sends 288 more packets through the alternate entry to the same exit. Result:artifacts/c18z11-live-service-channel-entry-pool-smoke-result.jsonrunc18z11-20260507-235341: exit queue432, backend fallback0, route failure deltas0/0, flow drop deltas0/0, and temporary route intents expired. This is a lease refresh/reconnect contract for entry replacement; preserving a broken client-to-entry socket across an entry node outage is not expected. - C18Z12 service-channel route quality scoring is implemented and
backend-deployed as
rap-backend:fabric-service-channel-0.2.187; node-agent remains0.2.185. Backend now uses service-neutral runtime quality feedback fromfabric_service_channel_runtime_report.ingress.flow_schedulerwhen scoring lease routes:last_send_duration_msadds deterministic latency boosts/penalties, and recent failures/stalls apply bounded penalties. This is protocol-agnostic and applies to the shared fabric channel, not HTTP/RDP/DNS special cases. Unit coverage:TestIssueFabricServiceChannelLeasePrefersFastHealthyRouteFeedback. Live scriptscripts/fabric/c18z12-service-channel-route-quality-smoke.ps1creates a high-priority slow relay routetest-1 -> test-3 -> test-2and a lower-priority fast direct routetest-1 -> test-2; the initial lease selects the slow route by policy priority, then quality telemetry reports fast route8msand slow route900ms, and the refreshed lease selects the fast route with score reasonservice_channel_quality_latency_le_10ms. Result:artifacts/c18z12-service-channel-route-quality-smoke-result.jsonrunc18z12-20260508-000209; all checks passed and temporary route intents expired. - C18Z13 live service-channel route quality self-learning is implemented,
released as node-agent
0.2.188, published to docker-test downloads, registered in the stable update channel, and deployed to docker-test containerstest-1/2/3. Runtime fix: positive sub-millisecond service-channel send durations are rounded to1ms, preventing fast local routes from looking like "no quality sample". Unit coverage:TestFabricFlowSchedulerRoundsSubMillisecondSendDuration. Live scriptscripts/fabric/c18z13-live-service-channel-route-quality-smoke.ps1proves the self-learning path without heartbeat injection: initial lease picks a higher-priority relay route, real service-channel traffic sends 24 batches / 192 packets over the fast direct route, backend persists healthy route feedback from the node-agent heartbeat (last_send_duration_ms=1,score_adjustment=90), and a refreshed lease prefers that fast route over a newly introduced higher-priority relay candidate. Result:artifacts/c18z13-live-service-channel-route-quality-smoke-result.jsonrunc18z13-20260508-001610; backend fallback0, flow drops0, temporary route intents expired. Published release id:64effc62-18b6-4eeb-a1c9-f5fb8e251491. - C18Z14 active-session route-quality preference is implemented. Backend
rap-backend:fabric-service-channel-0.2.190and node-agent0.2.189are deployed to docker-testtest-1/2/3; node-agent0.2.189is published to docker-test downloads and registered in the stable update channel as release9bda9bac-71f3-4e8f-ae70-2abccb1cb866. Backend now decays older healthy service-channel feedback before lease scoring so stale success loses weight before expiry. Node-agent consumes healthy route-quality observations from signed synthetic config and can override sticky per-flow/config-order route choice when a learned route is significantly better. Unit coverage:TestFabricClientPacketIngressQualityPreferenceOverridesStickyRouteandTestIssueFabricServiceChannelLeaseDecaysOlderHealthyRouteFeedback. Live scriptscripts/fabric/c18z14-live-service-channel-active-quality-shift-smoke.ps1keeps one signed WebSocket open while route policy changes: it starts on a higher-priority relay route, expires that route, sends real traffic through the fast direct route to teach feedback, introduces a new higher-priority relay candidate, and verifies the same active session stays on the learned fast route. Result:artifacts/c18z14-live-service-channel-active-quality-shift-smoke-result.jsonrunc18z14-20260508-071644; 60 batches / 480 packets delivered, backend fallback0, flow drops0, temporary route intents expired. - C18Z15 effective route-quality score telemetry is implemented. Backend
rap-backend:fabric-service-channel-0.2.191is deployed on docker-test, and node-agent0.2.190is built, published to docker-test downloads, registered in the stable update channel, and deployed totest-1/2/3. Published release id:2e4cd0c8-2480-4637-b845-6dcb115dbebd. Backend feedback reports now include decayedeffective_score_adjustmentalongside rawscore_adjustment; node-agent consumes the effective score for active route-quality preference and exposes sortedroute_quality_preferencesin runtime telemetry with raw/effective score and decay reasons. Unit coverage:TestFabricClientPacketIngressQualityPreferenceUsesEffectiveScoreandTestServiceChannelRouteFeedbackReportIncludesEffectiveDecayedScore. Live scriptscripts/fabric/c18z15-live-service-channel-effective-quality-smoke.ps1verifies route-quality preference telemetry, effective score visibility, and decayed effective score visibility after the active-session quality-shift scenario. Result:artifacts/c18z15-live-service-channel-effective-quality-smoke-result.jsonrunc18z14-20260508-073538; 60 batches / 480 packets delivered, backend fallback0, flow drops0, temporary route intents expired. - C18Z16 per-channel route-quality fairness telemetry is implemented. Node-agent
0.2.191is built, published to docker-test downloads, registered in the stable update channel, and deployed totest-1/2/3; backend remainsrap-backend:fabric-service-channel-0.2.191. Published release id:f072759c-5c3b-4ba0-936a-f59b6d3d7632. Flow-scheduler channel stats now expose the appliedquality_preference_route_id, effective/raw preference score, and preference reasons, so operators can see which logical channels actually used learned route quality. Unit coverage:TestFabricClientPacketIngressQualityPreferencePreservesMultiChannelFairness. Live scriptscripts/fabric/c18z16-live-service-channel-quality-fairness-smoke.ps1validates multi-channel quality-preference fairness after the active-session route-quality shift. Result:artifacts/c18z16-live-service-channel-quality-fairness-smoke-result.jsonrunc18z14-20260508-074943; 60 batches / 480 packets delivered, 32 served logical channels, 32 channels with quality preference applied, backend fallback0, flow drops0, temporary route intents expired. - C18Z17 stale route-quality marker cleanup is implemented. Node-agent
0.2.192is built, published to docker-test downloads, registered in the stable update channel, and deployed totest-1/2/3; backend remainsrap-backend:fabric-service-channel-0.2.191. Published release id:846881bd-e7e0-4212-b8c9-4a6012c6eff7. Flow-scheduler channel stats now clear quality preference markers when the preference is no longer in the effective preference set or when the route manager withdraws that route. Unit coverage:TestFabricClientPacketIngressClearsStaleQualityPreferenceMarkersandTestFabricClientPacketIngressClearsWithdrawnQualityPreferenceMarkers. Live scriptscripts/fabric/c18z17-live-service-channel-quality-cleanup-smoke.ps1verifies cleanup after the active-session quality/fairness scenario. Result:artifacts/c18z17-live-service-channel-quality-cleanup-smoke-result.jsonrunc18z14-20260508-075750; 60 batches / 480 packets delivered, active quality markers32, stale quality markers0, visible preferences3, backend fallback0, flow drops0, temporary route intents expired. - C18Z18 service-session-scoped flow scheduler memory is implemented.
Node-agent
0.2.193is built, published to docker-test downloads, registered in the stable update channel, and deployed totest-1/2/3; backend remainsrap-backend:fabric-service-channel-0.2.191. Published release id:05a3d29e-8a62-4bc8-84a3-1d00b794b9c9. Runtime-sent flow scheduler channel keys now include the VPN/service session:vpn:{vpnConnectionID}:flow-NN. This keeps route memory, failed-route avoidance, served/drop counters, and route-quality markers isolated when several service-channel sessions share one entry/exit and hash to the same logical flow shard. Unit coverage:TestFabricClientPacketIngressIsolatesRouteMemoryPerVPNConnectionandTestFabricClientPacketIngressQualityPreferencePreservesMultiChannelFairness. Live scriptscripts/fabric/c18z18-service-channel-session-scoped-fairness-smoke.ps1wraps the live C18Z17 quality path and verifies served live channels are session-scoped, unscoped servedflow-NNchannels are absent, quality markers are session-scoped, backend fallback is0, and flow drops are0. Result:artifacts/c18z18-service-channel-session-scoped-fairness-smoke-result.jsonrunc18z14-20260508-082520; 60 batches / 480 packets delivered, served channels32, session-scoped served channels32, session-scoped quality channels32, unscoped served channels0, backend fallback0, flow drops0, temporary route intents expired. - C18Z19 bounded parallel logical-flow send window is implemented. Node-agent
0.2.194is built, published to docker-test downloads, registered in the stable update channel, and deployed totest-1/2/3; backend remainsrap-backend:fabric-service-channel-0.2.191. Published release id:926e5b84-4b0b-4f47-b1fe-798d8105679f. The live node-agent runtime enablesMaxParallelFlowSends=4, so independent scheduled logical channels can send concurrently instead of one slow channel blocking all following channels. This remains service-neutral and does not inspect HTTP/RDP/DNS/application traffic. Telemetry now exposesmax_parallel_flow_sendsandsend_flow_parallel_batches. Unit coverage:TestFabricClientPacketIngressParallelFlowWindowDoesNotBlockIndependentChannel. Live scriptscripts/fabric/c18z19-service-channel-parallel-flow-window-smoke.ps1wraps the C18Z18 live route-quality/session-scoped path and verifies the parallel window is enabled and observed while backend fallback and flow drops stay at zero. Result:artifacts/c18z19-service-channel-parallel-flow-window-smoke-result.jsonrunc18z14-20260508-084133; 60 batches / 480 packets delivered,max_parallel_flow_sends=4,send_flow_parallel_batches=60, served channels32, session-scoped quality channels32, backend fallback0, flow drops0, temporary route intents expired. - C18Z20 per-channel latency/retry/in-flight telemetry and adaptive recommended
send-window telemetry are implemented. Node-agent
0.2.195is built, published to docker-test downloads, registered in the stable update channel, and deployed totest-1/2/3; backend remainsrap-backend:fabric-service-channel-0.2.191. Published release id:b9e198e0-e012-4600-ad14-856820aff41c. Scheduler telemetry now includes globalin_flight,max_in_flight, slow/failing channel counts, and per-channelsend_attempts,send_successes,send_failures,in_flight,max_in_flight, and latency buckets. Ingress telemetry now includesrecommended_parallel_flow_sends; the recommendation shrinks under bounded drops, degraded fallback recommendations, repeated failures, or slow/stalled channels. Unit coverage:TestFabricFlowSchedulerRecommendsSmallerWindowUnderPressureandTestFabricClientPacketIngressParallelFlowWindowDoesNotBlockIndependentChannel. Live scriptscripts/fabric/c18z20-service-channel-adaptive-window-telemetry-smoke.ps1wraps the C18Z19 live path and verifies the new telemetry on real docker-test nodes. Result:artifacts/c18z20-service-channel-adaptive-window-telemetry-smoke-result.jsonrunc18z14-20260508-085635; 60 batches / 480 packets delivered,max_parallel_flow_sends=4,recommended_parallel_flow_sends=4,scheduler_max_in_flight=4, attempts/success/latency visible on 32 channels, backend fallback0, flow drops0, temporary route intents expired. - C18Z21 rolling per-channel/session quality windows are implemented.
Node-agent
0.2.196is built, published to docker-test downloads, registered in the stable update channel, and deployed totest-1/2/3; backend remainsrap-backend:fabric-service-channel-0.2.191. Published release id:813b2050-4d4e-444c-9bde-72b1d1f7dd35. Scheduler decisions now use a bounded fresh quality window instead of lifetime-only drop/failure counters, so old pressure rolls out after newer successful samples. Telemetry now exposes scheduler-levelquality_window_sample_count,quality_window_failure_count,quality_window_slow_count,quality_window_drop_count, and per-channel success/failure/slow/drop sample counts, average latency, and last update time. Unit coverage:TestFabricFlowSchedulerRollingQualityWindowForgetsOldPressure,TestFabricFlowSchedulerRecommendsSmallerWindowUnderPressure, andTestFabricClientPacketIngressParallelFlowWindowDoesNotBlockIndependentChannel. Live scriptscripts/fabric/c18z21-service-channel-rolling-quality-window-smoke.ps1wraps the C18Z20 live path and verifies the rolling-window telemetry on real docker-test nodes. Result:artifacts/c18z21-service-channel-rolling-quality-window-smoke-result.jsonrunc18z14-20260508-091952; 60 batches / 480 packets delivered, scheduler quality-window samples480, failures0, drops0, window samples/success/latency visible on 32 channels,recommended_parallel_flow_sends=4, backend fallback0, flow drops0, temporary route intents expired. - C18Z22 backend durable route feedback now consumes the rolling quality
window from node-agent heartbeat metadata. Backend
rap-backend:fabric-service-channel-0.2.197is built and deployed on docker-test; node-agent remains0.2.196ontest-1/2/3. For agents that exposequality_window_*, backend uses fresh rolling failure/drop/slow counts and rolling average latency when creatingfabric_service_channelroute feedback; oldlast_failed_route_id,consecutive_failures, andstall_countremain fallback inputs for older agents only. This prevents old route failures from dominating durable scoring after the channel has recovered with a clean rolling window. Unit coverage:TestRecordHeartbeatUsesRollingQualityWindowForRouteFeedbackandTestRecordHeartbeatPersistsServiceChannelRouteFeedbackForLaterLease. Live scriptscripts/fabric/c18z22-service-channel-rolling-feedback-smoke.ps1wraps the C18Z21 live path and verifies persisted route feedback containsservice_channel_rolling_quality_windowplus payloadquality_window_*fields. Result:artifacts/c18z22-service-channel-rolling-feedback-smoke-result.jsonrunc18z14-20260508-093100; 60 batches / 480 packets delivered, route feedback count1, rolling feedback count1, healthy rolling feedback count1, rolling payload count1, backend fallback0, flow drops0. - C18Z23 recovery hysteresis is implemented for recovered service-channel
routes. Backend
rap-backend:fabric-service-channel-0.2.198is built and deployed on docker-test; node-agent remains0.2.196ontest-1/2/3. When a route has an operator-expire/manual retry cooldown from prior fenced feedback but now also has healthy rolling-window feedback, backend re-admits the route asauthorizedwhile applying a bounded recovery hysteresis score penalty (150) andservice_channel_recovery_hysteresisreason. This keeps recovered routes available as alternates without immediately displacing a steady route and reducing route-selection flapping. Unit coverage:TestIssueFabricServiceChannelLeaseDampensRecoveredRouteDuringRetryCooldownandTestRecordHeartbeatUsesRollingQualityWindowForRouteFeedback. Live scriptscripts/fabric/c18z23-service-channel-recovery-hysteresis-smoke.ps1wraps the C18Z22 live path and verifies backend0.2.198, rolling feedback, and clean live forwarding. Result:artifacts/c18z23-service-channel-recovery-hysteresis-smoke-result.jsonrunc18z14-20260508-094111; 60 batches / 480 packets delivered, backend fallback0, flow drops0, recovery hysteresis penalty150. - C18Z24 recovery visibility is implemented for service-channel route
diagnostics. Backend
rap-backend:fabric-service-channel-0.2.199is built and deployed on docker-test; node-agent remains0.2.196ontest-1/2/3. Route feedback API responses and node-scoped service-channel feedback reports now exposerecovery_state,recovery_hysteresis_active, andrecovery_hysteresis_penalty, while route path decision reports countrecovery_hysteresis_count. Admin diagnostics now show recovered/hysteresis chips and a recovery column beside route feedback status. Unit coverage:TestIssueFabricServiceChannelLeaseDampensRecoveredRouteDuringRetryCooldown,TestServiceChannelRouteFeedbackReportExposesRecoveryState, andTestRoutePathDecisionReportCountsRecoveryHysteresis. Smoke result:artifacts/c18z24-service-channel-recovery-visibility-smoke-result.json; route feedback API exposed recovery shape for 109 observations, backend image0.2.199was live, and the web-admin build was published torap_web_admin. - C18Z25 recovery promotion policy is implemented. Backend
rap-backend:fabric-service-channel-0.2.200is built and deployed on docker-test; node-agent remains0.2.196. A route under manual retry cooldown remainsrecoveredwith hysteresis penalty until it reports at least 64 clean rolling-window samples (success >= 64, failures/slow/drops zero). After that it is promoted back to steadyhealthy, getsrecovery_promoted=true,service_channel_recovery_promoted, and no hysteresis penalty. Admin/API now expose promoted counts/flags alongside recovered/hysteresis state. Smoke result:artifacts/c18z25-service-channel-recovery-promotion-smoke-result.json; backend image0.2.200was live and route-feedback API exposed recovery state for 109 observations. - C18Z26 recovery demotion policy is implemented. Backend
rap-backend:fabric-service-channel-0.2.201is built and deployed on docker-test; node-agent remains0.2.196. If a previously recovered or promoted route under retry cooldown reports fresh rolling failures, drops, slow samples, degraded fallback, rebuild recommendation, or fenced feedback, backend now exposesrecovery_demoted=truewith a concreterecovery_reasonsuch asservice_channel_recovery_demoted_failure,..._slow,..._rebuild, or..._fenced. Route score reasons includeservice_channel_recovery_demotedand the specific demotion reason, and route path decision reports countrecovery_demoted_count. Admin diagnostics now show demoted feedback/path chips and the demotion reason. Smoke result:artifacts/c18z26-service-channel-recovery-demotion-smoke-result.json; backend image0.2.201was live and route-feedback API exposed recovery state for 109 observations. - C18Z27 recovery policy tuning is implemented. Backend
rap-backend:fabric-service-channel-0.2.202is built and deployed on docker-test; node-agent remains0.2.196. Effective service-channel recovery policy now has a strict default contract and optional cluster metadata override atfabric_service_channel_recovery_policy. API endpointsGET/PUT /clusters/{clusterID}/fabric/service-channels/recovery-policyexpose and update hysteresis penalty, promotion minimum samples, demotion thresholds for failures/drops/slow samples, and rebuild/fenced demotion toggles. Lease route selection, route feedback reports, and node-scoped synthetic config feedback consume the effective policy. Web-admin shows and edits the policy in the service-channel route feedback card. Smoke result:artifacts/c18z27-service-channel-recovery-policy-smoke-result.json; live API updated policy values, then restored strict defaults (penalty=150,promotion_min_samples=64, demotion thresholds1). - C18Z28 recovery policy provenance is implemented. Backend
rap-backend:fabric-service-channel-0.2.203is built and deployed on docker-test; node-agent remains0.2.196.FabricServiceChannelRoute,FabricServiceChannelLease, signed lease authority payloads, service-channel route feedback reports, and route path decision reports now carry the effective recovery policy used for scoring and recovery decisions. This makes every primary/alternate/fallback choice auditable against the policy source and thresholds that produced it. Web-admin node diagnostics show the service-channel feedback policy and route decision policy source. Smoke result:artifacts/c18z28-service-channel-recovery-policy-provenance-smoke-result.json; live synthetic config and live lease issuance both exposed recovery policy provenance on docker-test. - C18Z29 feedback provenance guardrails are implemented. Backend
rap-backend:fabric-service-channel-0.2.204is built and deployed on docker-test; node-agent remains0.2.196. Recovery policy now has a stable fingerprint. Backend recognizes optional runtime feedback provenance fields (recovery_policy_fingerprint,route_generation,route_policy_version,policy_version), exposes observed/effective fingerprints/generations on route feedback observations, and reports missing/stale counters. Explicit stale policy/generation feedback is scored conservatively, cannot fence a current route, and cannot request rebuild/demotion; missing provenance stays compatible for current old agents but is visible in diagnostics. Web-admin shows provenance warnings in service-channel feedback. Smoke result:artifacts/c18z29-service-channel-feedback-provenance-guard-smoke-result.json. - C18Z30 node-agent feedback provenance is implemented. Backend
rap-backend:fabric-service-channel-0.2.209and node-agent0.2.208are built and deployed on docker-test (test-1/2/3). Node-agent now preserves the signed synthetic config contract for recovery feedback/route decision fields and records per-flowrecovery_policy_fingerprint,route_policy_version, androute_generationat send time, so feedback remains auditable even after route churn/expiry. Backend heartbeat parsing now preserves those fields into durable service-channel feedback payloads. Live smoke passed with 28/28 runtime channel stats carrying provenance, 3/3 feedback observations carrying provenance, and no missing/stale provenance counters. Artifacts:artifacts/c18z30-node-telemetry-provenance-live-smoke-base-result.jsonandartifacts/c18z30-node-agent-feedback-provenance-smoke-result.json. - C18Z31 service-channel rebuild ledger is implemented. Backend
rap-backend:fabric-service-channel-0.2.211is built and deployed on docker-test; node-agent remains0.2.208ontest-1/2/3. Backend now keeps durable route rebuild attempt history infabric_service_channel_route_rebuild_attempts, upserted from synthetic config route decisions when service-channel feedback requests rebuild. The ledger stores trigger/rebuild status, old route, selected replacement, policy fingerprint, generation, feedback status/reasons, latency/failure counters, outcome, and compact decision payload. API endpointGET /clusters/{clusterID}/fabric/service-channels/rebuild-attemptsexposes the history; web-admin loads it into Service-channel route feedback diagnostics as a rebuild ledger table. Migration000026is applied on docker-test. Live smoke passed:artifacts/c18z31-base-active-rebuild-smoke-result.jsonandartifacts/c18z31-service-channel-rebuild-ledger-smoke-result.json. - C18Z32 service-channel rebuild timeline is implemented. Backend
rap-backend:fabric-service-channel-0.2.213is built and deployed on docker-test; node-agent remains0.2.208ontest-1/2/3. The rebuild attempts API now enriches durable ledger rows with node-agent heartbeat correlation: matchingroute_manager_transition, route-generation apply or withdrawn decision, post-rebuild selected route, flow packet/drop/failure counters, and a compact chronologicaltimelinewithbackend_decision,node_route_generation_apply,node_route_manager_transition, andpost_rebuild_trafficstages. Matching is generation-strict when the backend attempt has a generation, preventing stale transition/status matches. Web-admin rebuild ledger shows backend, agent, route-generation, and traffic columns. Live smoke passed:artifacts/c18z32-base-rebuild-ledger-smoke-result.jsonandartifacts/c18z32-service-channel-rebuild-timeline-smoke-result.json. - C18Z33 service-channel rebuild guardrails are implemented. Backend
rap-backend:fabric-service-channel-0.2.214is built and deployed on docker-test; node-agent remains0.2.208. Rebuild attempts API now adds computed guard fields:guard_status,guard_severity,guard_reason, age, and transition/traffic deadlines. Successful correlated rebuilds reportguard_status=ok,guard_severity=good; missing node transition, route-generation correlation, post-rebuild traffic, unexpected selected route, or post-rebuild drops/failures surface as warn/bad states. Web-admin shows guard chips and counts in the service-channel rebuild ledger. Live smoke passed:artifacts/c18z33-base-rebuild-ledger-smoke-result.jsonandartifacts/c18z33-service-channel-rebuild-guard-smoke-result.json. - C18Z34 service-channel rebuild health summary is implemented. Backend
rap-backend:fabric-service-channel-0.2.215is built and deployed on docker-test; node-agent remains0.2.208. New endpointGET /clusters/{clusterID}/fabric/service-channels/rebuild-healthreturns a cluster-level operational summary over the durable rebuild ledger/timeline: counts by guard status/severity, applied/pending counts, affected reporter nodes/routes, most recent bad attempts, and recommended operator action. Web-admin shows the summary as a Rebuild health subpanel above the rebuild ledger. Live smoke passed:artifacts/c18z34-base-rebuild-guard-smoke-result.jsonandartifacts/c18z34-service-channel-rebuild-health-smoke-result.json. - C18Z35 service-channel rebuild alert silence lifecycle is implemented.
Backend
rap-backend:fabric-service-channel-0.2.216is built and deployed on docker-test; node-agent remains0.2.208. Migration000027createsfabric_service_channel_rebuild_alert_silences, applied on docker-test. New APIPOST /clusters/{clusterID}/fabric/service-channels/rebuild-health/silencesrecords bounded operator silence for an exact alert fingerprint: reporter node, route, guard status, and generation. Rebuild health now separates total bad/warn from active bad/warn and silenced counts; silenced alerts are omitted from affected nodes/routes and active bad attempt lists. A new generation, route, or reporter remains active by design. Web-admin exposessilence 6hon active bad rebuild-health rows. Live smoke passed:artifacts/c18z35-base-rebuild-health-smoke-result.jsonandartifacts/c18z35-service-channel-rebuild-alert-silence-smoke-result.json. - C18Z36 service-channel rebuild alert resurfacing is implemented. Backend
rap-backend:fabric-service-channel-0.2.217is built and deployed on docker-test; node-agent remains0.2.208. Rebuild health marks active bad/warn attempts asalert_resurfacedwhen an active silence exists for the same reporter node, route, and guard status but a different generation. The summary exposesresurfaced_countandresurfaced_attempts, including the previous silenced generation and silence expiry. Web-admin shows a resurfaced chip/table and allows silencing the new generation separately. Live smoke passed:artifacts/c18z36-base-rebuild-health-smoke-result.jsonandartifacts/c18z36-service-channel-rebuild-alert-resurface-smoke-result.json. - C18Z37 service-channel readiness gate is implemented. Backend
rap-backend:fabric-service-channel-0.2.218is built and deployed on docker-test; node-agent remains0.2.208. New endpointGET /clusters/{clusterID}/fabric/service-channels/readinessreturns a fast recent-window verdict:clean,degraded, orblocked, with active bad/warn counts, resurfaced/silenced counts, missing transition, route-generation, post-rebuild traffic, unexpected-route, and post-rebuild degraded counters plus blocking/degraded reasons and recommended operator action. Web-admin shows this as a top-level readiness panel in Service-channel route feedback. Readiness and default admin health queries are intentionally capped to a small recent window so the operator view stays responsive after many rebuild attempts; deep ledger diagnostics remain a separate next layer. Live smoke passed:artifacts/c18z37-base-rebuild-health-smoke-result.jsonandartifacts/c18z37-service-channel-readiness-smoke-result.json. - C18Z38 service-channel rebuild ledger enrichment split is implemented.
Backend
rap-backend:fabric-service-channel-0.2.219is built and deployed on docker-test; node-agent remains0.2.208. The rebuild attempts API now defaults toenrichment=summary, returning durable ledger rows without the expensive heartbeat/timeline guard correlation. Operators can requestenrichment=deepexplicitly for per-route investigation. Web-admin defaults to the fast ledger, shows timeline/guard fields as deep-only in summary mode, and provides a manual deep ledger toggle. C18Z32/C18Z33 smokes now request deep enrichment. Live smoke passed:artifacts/c18z38-service-channel-rebuild-ledger-enrichment-smoke-result.json. - C18Z39 service-channel rebuild ledger drilldown is implemented. Backend
rap-backend:fabric-service-channel-0.2.220is built and deployed on docker-test; node-agent remains0.2.208. The rebuild attempts API now acceptsgenerationandoffset, allowing narrow deep investigations by reporter node, route, service class, and route generation with bounded pagination. Web-admin adds rebuild ledger filters for reporter/route/ generation/service plus prev/next paging in deep mode. Live smoke passed:artifacts/c18z39-service-channel-rebuild-ledger-drilldown-smoke-result.json. - C18Z40 service-channel rebuild incident grouping is implemented. Backend
rap-backend:fabric-service-channel-0.2.222is built and deployed on docker-test; node-agent remains0.2.208. New endpointGET /clusters/{clusterID}/fabric/service-channels/rebuild-incidentsgroups the bounded recent rebuild window by reporter node, route, service class, generation, and guard status, exposing first/last seen, attempt count, latest guard/replacement/outcome, silence/resurface flags, and recommended action. The incident window is capped to 5 to keep default admin refresh bounded; broader investigation still uses filtered deep ledger. Web-admin shows a Rebuild incidents list andopen deeploads the exact filtered deep ledger slice for that incident. Live smoke passed:artifacts/c18z40-service-channel-rebuild-incidents-smoke-result.json. - C18Z41 service-channel rebuild incident actions are implemented. Backend
rap-backend:fabric-service-channel-0.2.223is built and deployed on docker-test; node-agent remains0.2.208. New APIPOST /clusters/{clusterID}/fabric/service-channels/rebuild-incidents/investigationsrecords an audit event when an operator opens a deep rebuild investigation. Web-admin incident rows now exposeopen deepwith audit andsilence 6husing the incident fingerprint fields; after silence the panel refreshes only rebuild health/readiness/incidents instead of the whole cluster scope. Live smoke passed:artifacts/c18z41-service-channel-rebuild-incident-actions-smoke-result.json. - C18Z42 service-channel rebuild correlation snapshots are implemented.
Backend
rap-backend:fabric-service-channel-0.2.224is built and deployed on docker-test; node-agent remains0.2.208. Migration000028adds durable correlation/guard snapshot columns tofabric_service_channel_route_rebuild_attempts, including node transition, route-generation, post-rebuild traffic, guard status/severity/reason, compact timeline, andcorrelation_snapshot_at. Deep enrichment now writes the snapshot once; later deep/readiness/health/incidents reuse it and only recompute age-sensitive guard state without scanning heartbeat history. External summary ledger still strips guard/timeline fields to preserve the fast C18Z38 contract. On docker-test, applying000028manually was required before smoke because this manual backend redeploy path does not auto-apply migrations. Live smoke passed twice; after warm snapshot timings were roughly summary 92 ms, deep 2 ms, incidents 2 ms:artifacts/c18z42-service-channel-rebuild-correlation-snapshot-smoke-result.json. - C18Z43 service-channel schema preflight is implemented. Backend
rap-backend:fabric-service-channel-0.2.225is built and deployed on docker-test; web-admin is redeployed. New endpointGET /clusters/{clusterID}/fabric/service-channels/schema-statuschecks the DB relation/columns required by migration000028before operators rely on rebuild health/readiness/incidents. Web-admin shows a Fabric schema preflight panel beside service-channel readiness, with required/missing check counts and operator action. Live smoke passed:artifacts/c18z43-service-channel-schema-preflight-smoke-result.json. - C18Z44 service-channel rebuild snapshot warmup is implemented. Backend
rap-backend:fabric-service-channel-0.2.226is built and deployed on docker-test; web-admin is redeployed. New endpointPOST /clusters/{clusterID}/fabric/service-channels/rebuild-snapshots/warmupperforms a bounded proactive pass over recent rebuild attempts. It fills missing correlation snapshots, counts stale snapshots, and defers heavy stale rescans because age-sensitive guard state is already recomputed from cached snapshots on read. Web-admin adds awarm snapshotsaction and displays warmed/fresh/missing/stale/deferred/error counts. Live smoke passed:artifacts/c18z44-service-channel-rebuild-snapshot-warmup-smoke-result.json. - C18Z45 service-channel rebuild snapshot auto-warmup is implemented. Backend
rap-backend:fabric-service-channel-0.2.227is built and deployed on docker-test; node-agent remains0.2.208. Heartbeat processing now performs a bounded missing-snapshot maintenance pass for the reporting node's recent rebuild attempts. It only persists a snapshot when the heartbeat contains runtime evidence such as post-rebuild traffic or matched route-manager/ route-generation state, preventing backend-only timelines from becoming stale cache entries. Auto-warmup writes an audit eventfabric.service_channel_rebuild_snapshot.auto_warmupwith trigger, heartbeat, warmed route IDs, generations, rebuild IDs, counts, and errors. Live smoke passed:artifacts/c18z45-service-channel-rebuild-snapshot-auto-warmup-smoke-result.json. - C18Z46 service-channel rebuild snapshot maintenance health is implemented.
Backend
rap-backend:fabric-service-channel-0.2.228is built and deployed on docker-test; web-admin is redeployed. New endpointGET /clusters/{clusterID}/fabric/service-channels/rebuild-snapshots/healthexposes bounded snapshot-cache maintenance status: recent attempt count, valid/missing/overdue runtime-evidence snapshots, heartbeat threshold, latest auto-warmup audit summary, and per-node warmed/error/missing counts. Web-admin adds aSnapshot maintenancepanel beside schema/readiness. Live smoke passed:artifacts/c18z46-service-channel-rebuild-snapshot-health-smoke-result.json. - C18Z47 service-channel signed lease enforcement is implemented. Node-agent
release
0.2.230is built, published under/downloads, registered as the activerap-node-agentdev release, and deployed on docker-testtest-1/2/3; all three report0.2.230, healthy, and current after policy update. When a cluster authority public key is pinned, the node-agent now rejects unsignedrap_fsc_*service-channel requests and requires the signedrap.fabric_service_channel_lease_authority.v1payload/signature headers. Legacy unsigned tokens remain accepted only in unpinned test mode. Live smoke proved unsigned POST is rejected with 403 while signed lease POST is accepted with 202:artifacts/c18z47-service-channel-signed-lease-enforcement-smoke-result.json. - C18Z48 service-channel backend introspection compatibility is implemented.
Backend
rap-backend:fabric-service-channel-0.2.231is built/deployed on docker-test. Node-agent/host-agent artifacts0.2.232are published under/downloads;rap-node-agentrelease0.2.232is registered and deployed ontest-1/2/3, and all three report healthy/current. When signed service-channel authority headers are absent but cluster authority is pinned, node-agent now calls backend lease introspection before accepting an unsigned token. Bad tokens are still rejected. Live smoke passed:artifacts/c18z48-service-channel-introspection-smoke-result.json. - C18Z49 service-channel acceptance telemetry is implemented in node-agent
0.2.232. Each accepted Fabric Service Channel ingress recordsaccepted_by=signed|introspection|legacy_unsigned, route preference, and backend-fallback state in structured node logs. HTTP packet ingress also returnsX-RAP-Service-Channel-Accepted-Byfor smoke/diagnostics. - C18Z50 durable service-channel lease introspection is implemented. Migration
000029_fabric_service_channel_leasesadds a durable lease table keyed by cluster/channel and stores onlytoken_hashplus a scrubbed lease payload with the raw bearer token removed. Backendrap-backend:fabric-service-channel-0.2.233is built/deployed on docker-test after applying the migration. Introspection now reads memory first, then durable storage, so compatibility clients survive backend restart. Live smoke restartedrap_test_backend, accepted the unsigned token through introspection, rejected a bad token, and verified the durable lease omits the raw token:artifacts/c18z50-service-channel-durable-introspection-smoke-result.json. - C18Z51 service-channel lease maintenance is implemented. Backend
rap-backend:fabric-service-channel-0.2.234is built/deployed on docker-test. New endpoints list durable service-channel lease maintenance state and run bounded expired-lease cleanup:GET /clusters/{clusterID}/fabric/service-channels/leasesandPOST /clusters/{clusterID}/fabric/service-channels/leases/cleanup. Web-admin adds aService-channel leasespanel with active/expired counts, recent lease rows, and cleanup action. Live smoke issued a 1-second lease, observed it as expired, cleaned it up, and verified it disappeared:artifacts/c18z51-service-channel-lease-maintenance-smoke-result.json. - C18Z52 service-channel access telemetry visibility is implemented. Backend
rap-backend:fabric-service-channel-0.2.235is built/deployed on docker-test; node-agent/host-agent0.2.235artifacts are published under/downloads, registered as active dev releases, and deployed ontest-1/2/3. Node-agent now reports accepted service-channel ingress counters bysigned,introspection, andlegacy_unsigned, including backend-fallback count and last accepted timestamp. Backend exposesGET /clusters/{clusterID}/fabric/service-channels/access-telemetry, reading telemetry observations with heartbeat metadata fallback. Web-admin adds aService-channel accesspanel with cluster totals and per-node rows. Live smoke sent packets through test-1, observedX-RAP-Service-Channel-Accepted-By: introspection, and verified backend aggregate visibility:artifacts/c18z52-service-channel-access-telemetry-smoke-result.json. - C18Z53 service-channel access/session correlation is implemented. Backend
rap-backend:fabric-service-channel-0.2.236is built/deployed on docker-test; node-agent remains0.2.235. The access telemetry endpoint now correlates accepted ingress counters with active durable service-channel leases, selected entry/exit nodes, primary route status, explicit backend fallback, and latest route-quality feedback when a route exists. Web-admin'sService-channel accesspanel now shows active channel rows before per-node counters, so operators can see whether a live service channel is using normal route quality feedback or degraded backend fallback. Live smoke created an active lease, sent ingress traffic through test-1, and verified active channel correlation plus fallback visibility:artifacts/c18z53-service-channel-access-correlation-smoke-result.json. - C18Z54 normal-route access correlation is smoke-proven on the existing
C18Z53 backend/admin surface. New smoke creates a temporary direct
vpn_packetsroute intent, injects healthy route-quality heartbeat telemetry, issues a service-channel lease that selects the normal primary route, sends ingress traffic, and verifies the access telemetry active channel row isready, not backend fallback, withroute_feedback_statushealthy, rolling quality counters, and last send duration:artifacts/c18z54-service-channel-normal-route-access-smoke-result.json. - C18Z55 degraded normal-route access correlation is smoke-proven on the same
backend/admin surface. The smoke first issues a lease on a normal primary
vpn_packetsroute, then injects degraded/fenced route-quality heartbeat feedback for that already-selected route. Access telemetry correctly reports the active channel asreadyandforce_backend_fallback=false, while route feedback isfenced, rolling failure/drop/slow counters are visible, and the aggregate access status becomesdegradedbecausedegraded_route_count > 0:artifacts/c18z55-service-channel-degraded-route-access-smoke-result.json. - C18Z56 active-channel remediation diagnostics are implemented. Backend
rap-backend:fabric-service-channel-0.2.237is built/deployed on docker-test; node-agent remains0.2.235. Active access telemetry channel rows now includeremediation_action,remediation_reason,remediation_route_id,remediation_route_status, and an operator hint. Decisions distinguish explicit backend fallback, degraded/fenced normal route with an authorized alternate (prefer_alternate_route), degraded/fenced route needing rebuild (rebuild_route), and healthy route (none). Web-admin shows the remediation action in theService-channel accessactive-channel table. C18Z55 smoke now verifiesremediation_action=rebuild_route; backend unit coverage verifies the alternate-route remediation branch. - C18Z56 alternate-route remediation is also live-smoke-proven. New smoke
creates primary and authorized alternate
vpn_packetsroutes, issues a lease while primary is still healthy/selected, then injects fenced feedback for the selected primary. Access telemetry keeps the active channel on the normal route withforce_backend_fallback=false, reportsroute_feedback_statusfenced, and recommendsremediation_action=prefer_alternate_routewith the alternate route id/status;degraded_fallback_channel_countstays zero:artifacts/c18z56-service-channel-alternate-remediation-smoke-result.json. - C18Z57 bounded remediation command contract is implemented. Backend
rap-backend:fabric-service-channel-0.2.238is built/deployed on docker-test; node-agent remains0.2.235. Active access telemetry channel rows now includeremediation_commandfor non-noop remediation actions, with schema version, deterministic command id, action, channel/resource/service, entry/exit, primary route, replacement route when present, reason/operator hint, issued time, and a bounded TTL capped to the lease lifetime. Web-admin marks remediation rows withcmdwhen this machine-readable command is present. Live smoke proves a fenced selected primary route with an authorized alternate emits aprefer_alternate_routecommand pointing at the alternate:artifacts/c18z57-service-channel-remediation-command-smoke-result.json. - C18Z58 service-channel remediation command consumption is implemented.
Backend
rap-backend:fabric-service-channel-0.2.239and node-agentrap-node-agent:0.2.237are built/deployed on docker-test (test-1/2/3). Backend now projects activeremediation_commanditems into node-scoped synthetic mesh config asservice_channel_remediation_commands. Node-agent parses those commands and turnsprefer_alternate_routeinto an explicit route-managerapplieddecision with sourceservice_channel_remediation_command, so an active channel that still presents the old primary route can be routed through the replacement route. Web-admin node details show remediation-command count/table in the Mesh tab. Live smoke proves access telemetry, synthetic config projection, and node-agent route-manager consumption:artifacts/c18z58-service-channel-remediation-apply-smoke-result.json. - C18Z59 active remediation traffic proof is smoke-proven on the same
backend/node-agent images with production forwarding enabled on docker-test
test-1/2/3. The smoke sends service-channel traffic before/after the remediation command is consumed, then verifies runtime heartbeat evidence:last_selected_route_idand flow-schedulerlast_route_idmove to the replacement route,send_successes=1,send_failures=0,send_fallback_local=0, and no degraded backend fallback is recommended. Result:artifacts/c18z59-service-channel-remediation-traffic-smoke-result.json. - C18Z60 multi-flow remediation traffic proof is smoke-proven. The smoke sends
a batch of twelve IPv4/TCP-like packets that classify into multiple
independent VPN flow channels after the remediation command is consumed.
Runtime heartbeat evidence shows the replacement route selected, at least two
flow-scheduler channels on that route, no local/backend fallback, no flow
drops, and no route send failures. Result:
artifacts/c18z60-service-channel-remediation-multiflow-smoke-result.json. - C18Z61 pressure remediation traffic proof is smoke-proven. The smoke sends a
batch of 128 IPv4/TCP-like packets after remediation; runtime evidence shows
32 replacement-route flow stats, scheduler high-watermark 5,
max-in-flight 4,
send_fallback_local=0, route failures 0, and flow/scheduler drops 0. Result:artifacts/c18z61-service-channel-remediation-pressure-smoke-result.json. - C18Z62 service-channel QoS class wiring is implemented in node-agent and
live-smoke-proven on docker-test image
rap-node-agent:0.2.238-c18z62. Service-channel HTTP ingress accepts neutralX-RAP-Traffic-Class(control,interactive,reliable,bulk,droppable) and the flow scheduler keeps distinct traffic-class channel ids/stats while preserving the old default bulk channel ids. Unit tests prove priority orderingcontrol > interactive > reliable > bulk > droppable; live smoke proves a bulk 128-packet pressure batch plus an interactive packet both move through the remediation replacement route with no local/backend fallback, drops, or route failures. Result:artifacts/c18z62-service-channel-remediation-qos-smoke-result.json. - C18Z63 concurrent QoS isolation is implemented and unit-proven. A controlled
runtime test holds a bulk traffic-class send in-flight with a blocking
production transport, then sends an independent interactive traffic-class
packet through the same ingress; the interactive send completes before the
bulk release, with
MaxInFlight >= 2, traffic-class-specific stats, no drops, and no failures. This proves the shared Fabric Service Channel runtime does not globally serialize interactive/control-style traffic behind bulk work. Artifact:artifacts/c18z63-service-channel-concurrent-qos-go-test.jsonl. - C18Z64 traffic-class telemetry aggregation is implemented and live-proven on
docker-test image
rap-node-agent:0.2.239-c18z64.rap.fabric_flow_scheduler.v1snapshots now includetraffic_class_counts, giving backend/admin/diagnostics a compact count of active flow channels per traffic class without scanning every channel stat. Unit coverage proves the counts for explicit control/interactive/bulk classes and for the concurrent bulk+interactive isolation case. Live smoke re-ran the QoS path ontest-1/2/3; latest heartbeat snapshot showedtraffic_class_countsbulk=32,interactive=12, drops 0. Artifacts:artifacts/c18z64-service-channel-traffic-class-telemetry-go-test.jsonl,artifacts/c18z64-service-channel-traffic-class-telemetry-live-smoke-result.json, andartifacts/c18z64-service-channel-traffic-class-telemetry-live-snapshot.json. - C18Z65/C18Z66 backend/admin QoS diagnostics are implemented and live-proven.
Backend
rap-backend:fabric-service-channel-0.2.241-c18z66is deployed on docker-test and projects runtimetraffic_class_counts, flow channel count, max in-flight, dropped, and high-watermark from node heartbeats intoGET /fabric/service-channels/access-telemetryat node, active-channel, and cluster aggregate levels. Web-admin Service-channel access shows flow QoS chips/rows for cluster totals, active channels, and nodes. Live API aggregate result showedbulk=32,interactive=12,flow_channel_count=44,flow_max_in_flight=4. Artifacts:artifacts/c18z65-service-channel-access-qos-telemetry-api-result.json,artifacts/c18z65-service-channel-access-qos-telemetry-smoke-result.json, andartifacts/c18z66-service-channel-access-qos-aggregate-api-result.json. - C18Z67 live concurrent QoS proof is implemented and smoke-proven against
docker-test backend
rap-backend:fabric-service-channel-0.2.241-c18z66and node-agent imagerap-node-agent:0.2.239-c18z64. The smoke pushes six parallel bulk service-channel HTTP packet requests while an interactive traffic-class request is injected through the same entry path after remediation. Runc18z67-20260508-213452accepted all 6 bulk requests, forwarded 3072 post-remediation packets, completed the interactive request in 132 ms, observed 32 bulk and 12 interactive replacement-route flow stats, and kept local/backend fallback, route failures, flow drops, and scheduler drops at 0. Artifact:artifacts/c18z67-service-channel-concurrent-qos-live-smoke-result.json. - C18Z68 service-channel flow-health guard is implemented and deployed on
docker-test as
rap-backend:fabric-service-channel-0.2.242-c18z68, with web-admin rebuilt/deployed. Access telemetry now projectsflow_health_statusandflow_health_reasonat cluster, node, and active-channel levels from traffic-class counts, queue pressure, flow drops, backend fallback, route-quality failures/drops/slow samples, and route send latency. Web-admin shows explicit flow-health chips beside flow QoS so sustained bulk pressure, degraded latency, fallback, and drops are visible before adding user services. Verification passed:go test ./internal/modules/cluster, web-adminnpm run build, updated C18Z67 live smoke against backend0.2.242-c18z68, and live API artifactartifacts/c18z68-service-channel-flow-health-api-result.json. - C18Z69 node-side adaptive backpressure is implemented and deployed on
docker-test image
rap-node-agent:0.2.243-c18z69fortest-1/2/3.FabricFlowSchedulernow calculates per-traffic-classrecommended_parallel_windowsand reportsadaptive_backpressure_active/adaptive_backpressure_reasonin runtime heartbeat snapshots. Bulk and droppable classes are reduced first under pressure, reliable is reduced moderately, while control/interactive keep their full window unless their own class has drops/failures/slow samples. Live C18Z69 smoke wraps the C18Z67 pressure path and verifiedbulk=1,droppable=1,reliable=3,interactive=4,control=4,bulk=32,interactive=12, high-watermark 72, max-in-flight 4, drops 0, andbulk_window_reduced_to_protect_interactive. Artifacts:artifacts/c18z67-service-channel-concurrent-qos-live-smoke-result.jsonandartifacts/c18z69-service-channel-adaptive-backpressure-smoke-result.json. - C18Z70 backend/admin adaptive backpressure visibility is implemented and
deployed on docker-test as
rap-backend:fabric-service-channel-0.2.244-c18z70; web-admin is rebuilt and deployed. Access telemetry now projects node-agentrecommended_parallel_windows,adaptive_backpressure_active, andadaptive_backpressure_reasonat cluster, node, and active-channel levels. Cluster aggregation uses the minimum non-zero recommended window per class, so the operator sees the most conservative active runtime limit. Web-admin shows adaptive windows next to flow health and flow QoS. Live API returnedadaptive=true, reasonbulk_window_reduced_to_protect_interactive, and windowsbulk=1,droppable=1,reliable=3,interactive=4,control=4. Verification passed:go test ./internal/modules/cluster, web-adminnpm run build, C18Z69 live smoke, andartifacts/c18z70-service-channel-adaptive-telemetry-api-result.json. - C18Z71 adaptive policy contract is implemented and deployed on docker-test as
rap-backend:fabric-service-channel-0.2.245-c18z71with node-agent imagerap-node-agent:0.2.245-c18z71ontest-1/2/3. Backend exposes auditedGET/PUT /clusters/{clusterID}/fabric/service-channels/adaptive-policyfor max parallel window, queue/bulk pressure thresholds, and per-class windows. The effective policy is embedded in signed node synthetic config and node-agent runtime heartbeat snapshots now reportadaptive_policy_fingerprint. The scheduler consumes the policy at runtime: default policy preserves the C18Z69 behavior, while the C18Z71 live smoke proved an operator policy can raise max window to 6 and bulk pressure window to 2 while keeping interactive/control at 6. During smoke, a signed synthetic config hash mismatch was found and fixed by preserving adaptive policy provenance fields in the node-agent client model. Verification passed:go test ./internal/modules/cluster,go test ./cmd/rap-node-agent ./internal/mesh ./internal/vpnruntime ./internal/client ./internal/config, web-adminnpm run build, C18Z71 live smoke, and C18Z69 regression smoke. Artifacts:artifacts/c18z71-service-channel-adaptive-policy-smoke-result.jsonandartifacts/c18z69-service-channel-adaptive-backpressure-smoke-result.json. - C18Z72 service-channel pool/failover policy contract is implemented and
deployed on docker-test as
rap-backend:fabric-service-channel-0.2.246-c18z72; node-agent remainsrap-node-agent:0.2.245-c18z71ontest-1/2/3. Backend exposes auditedGET/PUT /clusters/{clusterID}/fabric/service-channels/pool-policyfor entry/exit pool constraints, preferred entry/exit, selection strategy, route/entry/exit failover modes, backend fallback allowance, and sticky session mode. Lease issuance now applies the effective policy before route selection, constrainsentry_pool/exit_pool, chooses policy preferred nodes when present, embedspool_policyprovenance in the lease, and signs it intorap.fabric_service_channel_lease_authority.v1. Web-admin API/types know the new policy contract. Verification passed:go test ./internal/modules/cluster, web-adminnpm run build, C18Z72 live smoke, and C18Z71 regression smoke. Artifact:artifacts/c18z72-service-channel-pool-policy-smoke-result.json. - C18Z73 pool-policy remediation guard and telemetry is implemented and
deployed on docker-test as
rap-backend:fabric-service-channel-0.2.247-c18z73with node-agent imagerap-node-agent:0.2.247-c18z73ontest-1/2/3; web-admin is rebuilt and deployed. Active access telemetry now projects the signedpool_policy_fingerprint, remediation guard status/reason, and guarded remediation commands. Backend remediation rejects an alternate route outside the signed entry/exit lease pools and emitsrebuild_routeinstead ofprefer_alternate_route; node-agent defensively ignores guarded rejected remediation commands before route-manager application. Web-admin shows guard chips in access telemetry and node synthetic-config remediation rows. Verification passed:go test ./internal/modules/cluster ./internal/platform/runtime ./internal/modules/nodeagent,go test ./cmd/rap-node-agent ./internal/mesh ./internal/vpnruntime ./internal/config, web-adminnpm run build, C18Z73 live smoke, C18Z72 regression smoke, and C18Z71/C18Z67 live regression smoke. Artifacts:artifacts/c18z73-service-channel-pool-policy-remediation-guard-smoke-result.json,artifacts/c18z72-service-channel-pool-policy-smoke-result.json,artifacts/c18z71-service-channel-adaptive-policy-smoke-result.json, andartifacts/c18z67-service-channel-concurrent-qos-live-smoke-result.json. - C18Z74 service-channel remediation execution visibility is implemented and
deployed on docker-test as
rap-backend:fabric-service-channel-0.2.248-c18z74with node-agent imagerap-node-agent:0.2.248-c18z74ontest-1/2/3; web-admin is rebuilt and deployed. Active access telemetry now computesremediation_execution_status, reason, generation, and observed timestamp by correlating active remediation commands with the entry node's latest route-manager heartbeat.prefer_alternate_routecommands showwaiting_node_applyuntil the node reports a matching route-manager decision and thenapplied; guarded commands showrejected_by_policy_guard; boundedrebuild_routecommands showpending_rebuild_request. The execution state is copied into the machine-readable remediation command and displayed in web-admin access telemetry / node synthetic remediation rows. Verification passed:go test ./internal/modules/cluster ./internal/platform/runtime ./internal/modules/nodeagent,go test ./cmd/rap-node-agent ./internal/mesh ./internal/vpnruntime ./internal/config, web-adminnpm run build, C18Z74 live smoke, C18Z73 regression smoke, and C18Z72 regression smoke. Artifacts:artifacts/c18z74-service-channel-remediation-execution-smoke-result.json,artifacts/c18z67-service-channel-concurrent-qos-live-smoke-result.json,artifacts/c18z73-service-channel-pool-policy-remediation-guard-smoke-result.json, andartifacts/c18z72-service-channel-pool-policy-smoke-result.json. - C18Z75 durable remediation rebuild intent foundation is implemented and
deployed on docker-test as
rap-backend:fabric-service-channel-0.2.249-c18z75; node-agent remainsrap-node-agent:0.2.248-c18z74ontest-1/2/3. When a node fetches synthetic config containing arebuild_routeremediation command, backend now records a durable row in the existingfabric_service_channel_route_rebuild_attemptsledger withrebuild_status=requested/outcome=rebuild_requested, orrebuild_status=rejected/outcome=policy_guard_rejectedwhen the pool policy guard rejects it. Access telemetry correlates that ledger row back to the active channel and reportsrebuild_request_recordedorrebuild_request_rejectedinremediation_execution_status. The C18Z75 smoke isolates a route pair, provesrebuild_route, fetches synthetic config to persist the intent, verifies the rebuild ledger row, and verifies access telemetry reports the recorded execution state. Verification passed:go test ./internal/modules/cluster ./internal/platform/runtime ./internal/modules/nodeagent,go test ./cmd/rap-node-agent ./internal/mesh ./internal/vpnruntime ./internal/config, web-adminnpm run build, C18Z75 live smoke, C18Z73 regression smoke, and C18Z72 regression smoke. Artifacts:artifacts/c18z75-service-channel-rebuild-intent-smoke-result.json,artifacts/c18z73-service-channel-pool-policy-remediation-guard-smoke-result.json, andartifacts/c18z72-service-channel-pool-policy-smoke-result.json. - C18Z76 service-channel rebuild-route node acknowledgement is implemented and
deployed on docker-test as
rap-backend:fabric-service-channel-0.2.250-c18z76with node-agent imagerap-node-agent:0.2.250-c18z76ontest-1/2/3. Node-agent now consumes allowedrebuild_routeremediation commands as route-manager decisions withrebuild_status=pending_degraded_fallbackanddecision_source=service_channel_remediation_command; guarded commands are still ignored. Backend access telemetry correlates this route-manager acknowledgement with the durable ledger intent and reportsrebuild_request_recorded_node_pending. Verification passed:go test ./internal/modules/cluster ./internal/platform/runtime ./internal/modules/nodeagent,go test ./cmd/rap-node-agent ./internal/agent ./internal/mesh ./internal/vpnruntime ./internal/config, C18Z76 live smoke, C18Z75 regression smoke, and C18Z74/C18Z67 regression smoke. Artifacts:artifacts/c18z76-service-channel-rebuild-node-pending-smoke-result.json,artifacts/c18z75-service-channel-rebuild-intent-smoke-result.json,artifacts/c18z74-service-channel-remediation-execution-smoke-result.json, andartifacts/c18z67-service-channel-concurrent-qos-live-smoke-result.json. - C18Z77 service-channel rebuild planner resolution is implemented and
deployed on docker-test as
rap-backend:fabric-service-channel-0.2.251-c18z77with node-agent imagerap-node-agent:0.2.251-c18z77ontest-1/2/3. Backend now resolves durablerebuild_routeremediation requests during node-scoped synthetic config generation: it keeps lease pool-policy guardrails, recordsapplied/replacement_selectedwhen a signed-pool-valid alternate route exists, recordsno_alternatewhen no safe alternate exists, recordsdeferred_by_policywhen the active lease cannot authorize the replacement, and recordsexpiredfor stale commands. When a replacement is applied, the same command id is projected as a route-manager decision so node-agent can consume the resolved planner decision without duplicating the raw command. Access telemetry reports planner states such asrebuild_request_appliedandrebuild_request_no_alternate. Verification passed:go test ./internal/modules/cluster ./internal/platform/runtime ./internal/modules/nodeagent,go test ./cmd/rap-node-agent ./internal/agent ./internal/mesh ./internal/vpnruntime ./internal/config, C18Z77 live smoke, C18Z75 regression smoke, and C18Z74/C18Z67 regression smoke. Artifacts:artifacts/c18z77-service-channel-rebuild-planner-resolution-smoke-result.json,artifacts/c18z75-service-channel-rebuild-intent-smoke-result.json,artifacts/c18z74-service-channel-remediation-execution-smoke-result.json, andartifacts/c18z67-service-channel-concurrent-qos-live-smoke-result.json. - C18Z78 service-channel rebuild planner applied-branch visibility is
implemented and deployed on docker-test as
rap-backend:fabric-service-channel-0.2.252-c18z78with node-agent imagerap-node-agent:0.2.252-c18z78ontest-1/2/3; web-admin is rebuilt and deployed torap_web_admin. The admin access-telemetry execution column and node synthetic remediation rows now render planner outcomes with explicit labels and tones:rebuild_request_appliedis good,rebuild_request_recorded(_node_pending),rebuild_request_no_alternate, andrebuild_request_deferred_by_policyare warning states, while rejected or expired requests are bad states. The C18Z78 live smoke proves the applied planner branch: a primary route is leased first, the primary route is then degraded, an alternate route is added after the lease, synthetic config fetch resolves the existingrebuild_routecommand toapplied/replacement_selected, and access telemetry reportsrebuild_request_applied. Verification passed:go test ./internal/modules/cluster ./internal/platform/runtime ./internal/modules/nodeagent,go test ./cmd/rap-node-agent ./internal/agent ./internal/mesh ./internal/vpnruntime ./internal/config, web-adminnpm run build, C18Z78 live smoke, C18Z77 regression smoke, and C18Z74/C18Z67 regression smoke. Artifacts:artifacts/c18z78-service-channel-rebuild-planner-applied-smoke-result.json,artifacts/c18z77-service-channel-rebuild-planner-resolution-smoke-result.json,artifacts/c18z74-service-channel-remediation-execution-smoke-result.json, andartifacts/c18z67-service-channel-concurrent-qos-live-smoke-result.json. - C18Z79 service-channel planner-to-runtime loop proof is implemented and
deployed on docker-test as
rap-backend:fabric-service-channel-0.2.253-c18z79with node-agent imagerap-node-agent:0.2.253-c18z79ontest-1/2/3. The new live smoke extends the C18Z78 applied branch: after planner resolves the existingrebuild_routecommand toapplied/replacement_selected, the entry node reports a route-manager decision for the samerebuild_request_id, reports transitionapplied_rebuild, and live service-channel packet ingress selects the replacement route with no local/backend fallback, route failures, or flow drops. Verification passed:go test ./internal/modules/cluster ./internal/platform/runtime ./internal/modules/nodeagent,go test ./cmd/rap-node-agent ./internal/agent ./internal/mesh ./internal/vpnruntime ./internal/config, C18Z79 live smoke, C18Z78 and C18Z77 sequential regressions, and C18Z67 concurrent QoS regression. Artifact:artifacts/c18z79-service-channel-planner-runtime-loop-smoke-result.json. - C18Z80 service-channel sustained post-rebuild pressure proof is implemented
and deployed on docker-test as
rap-backend:fabric-service-channel-0.2.254-c18z80with node-agent imagerap-node-agent:0.2.254-c18z80ontest-1/2/3. The new live smoke keeps the C18Z79 planner-applied loop, then sends five post-rebuild bursts of mixedinteractive,bulk, andreliableVPN packet batches. It proves every burst is accepted by the service-channel runtime, every burst reports the replacement route, the stale primary is not reselected, and fallback, route-failure, flow-drop, and scheduler-drop deltas stay zero from the pre-pressure baseline. Smoke route hygiene was tightened: C18Z67 now disables pre-existing activevpn_packetsintents for its entry/exit pair, and C18Z79/C18Z80 expire their temporary primary/alternate intents after a successful run. Verification passed:go test ./internal/modules/cluster ./internal/platform/runtime ./internal/modules/nodeagent,go test ./cmd/rap-node-agent ./internal/agent ./internal/mesh ./internal/vpnruntime ./internal/config, C18Z80 live smoke, C18Z79 regression smoke, and C18Z67 concurrent QoS regression. Artifact:artifacts/c18z80-service-channel-post-rebuild-pressure-smoke-result.json. - C18Z81 service-channel replacement-degradation recovery proof is implemented
and deployed on docker-test as
rap-backend:fabric-service-channel-0.2.255-c18z81with node-agent imagerap-node-agent:0.2.255-c18z81ontest-1/2/3. The new live smoke proves the negative branch after C18Z80: once the initial replacement is applied and used, a generation-valid fenced feedback report for that replacement causes the Control Plane to select a new safe recovery route. Live traffic then moves to the recovery route, the degraded replacement is not reselected, and fallback, route-failure, flow-drop, and scheduler-drop deltas stay zero for the recovery send. The smoke also documents an important guardrail: stale route-generation feedback must not trigger recovery. C18Z67/C18Z79 were tightened to check per-run counter deltas rather than cumulative runtime counters. Verification passed:go test ./internal/modules/cluster ./internal/platform/runtime ./internal/modules/nodeagent,go test ./cmd/rap-node-agent ./internal/agent ./internal/mesh ./internal/vpnruntime ./internal/config, C18Z81 live smoke, C18Z80 regression smoke, C18Z79 regression smoke, and C18Z67 concurrent QoS regression. Artifact:artifacts/c18z81-service-channel-replacement-degradation-recovery-smoke-result.json. - C18Z82 service-channel no-safe-recovery proof is implemented and deployed on
docker-test as
rap-backend:fabric-service-channel-0.2.256-c18z82with node-agent imagerap-node-agent:0.2.256-c18z82ontest-1/2/3. The new live smoke proves the branch where the original primary is degraded, the replacement is applied and used, then that replacement reports generation-valid fenced feedback while no new safe recovery route exists. Node-scoped synthetic config reportsservice_channel_feedback_no_alternatewithpending_degraded_fallback; score reasons includeno_unfenced_alternate_routeandbackend_relay_degraded_fallback_until_rebuild, so the Control Plane exposes an explicit degraded/no-alternate state instead of silently sticking to a bad replacement. Verification passed:go test ./internal/modules/cluster ./internal/platform/runtime ./internal/modules/nodeagent,go test ./cmd/rap-node-agent ./internal/agent ./internal/mesh ./internal/vpnruntime ./internal/config, C18Z82 live smoke, C18Z81 recovery regression, C18Z80 pressure regression, and C18Z67 concurrent QoS regression. Artifact:artifacts/c18z82-service-channel-no-safe-recovery-smoke-result.json. - C18Z83 service-channel access-telemetry no-safe projection is implemented and
deployed on docker-test as
rap-backend:fabric-service-channel-0.2.257-c18z83; node-agent remainsrap-node-agent:0.2.256-c18z82ontest-1/2/3, and web-admin is rebuilt/deployed torap_web_admin. Active access telemetry channels now expose route-decision source, route id, replacement route id, rebuild status/reason/generation, and score reasons. Web-admin shows a dedicateddecisioncolumn in the active-channel table. The live smoke proves no-safe recovery is visible through access telemetry asservice_channel_feedback_no_alternate/pending_degraded_fallback, while durable ledger state can still reportrebuild_request_no_alternate. Verification passed:go test ./internal/modules/cluster ./internal/platform/runtime ./internal/modules/nodeagent, web-adminnpm run build, and C18Z83 live smoke. Artifact:artifacts/c18z83-service-channel-access-telemetry-no-safe-smoke-result.json. - C18Z84 service-channel access-decision aggregate proof is implemented and
deployed on docker-test as
rap-backend:fabric-service-channel-0.2.258-c18z84; node-agent remainsrap-node-agent:0.2.256-c18z82ontest-1/2/3, and web-admin is rebuilt/deployed torap_web_admin. Access telemetry now exposes aggregate route-decision counters:route_decision_channel_count,replacement_decision_count,applied_rebuild_decision_count,recovery_decision_count, andno_safe_recovery_decision_count. Web-admin summary chips show these counts, and no-safe route decisions now prioritize the aggregate reasonactive_channels_no_safe_recoveryover generic missing access-report noise. Verification passed:go test ./internal/modules/cluster ./internal/platform/runtime ./internal/modules/nodeagent, web-adminnpm run build, C18Z84 live smoke, and C18Z83 regression smoke. Artifact:artifacts/c18z84-service-channel-access-decision-aggregate-smoke-result.json. - C18Z85 service-channel access-decision incident projection is implemented and
deployed on docker-test as
rap-backend:fabric-service-channel-0.2.259-c18z85; node-agent remainsrap-node-agent:0.2.256-c18z82ontest-1/2/3, and web-admin is rebuilt/deployed torap_web_admin. Rebuild health summary now carries access decision counts and prioritizesinspect_access_no_safe_recovery_route_pool_and_signed_policywhen no-safe is active. Rebuild incidents now includeincident_source=access_decisionentries with channel id and operator-facing severity/action, includingaccess_no_safe_recoveryas a bad incident. Verification passed:go test ./internal/modules/cluster ./internal/platform/runtime ./internal/modules/nodeagent, web-adminnpm run build, C18Z85 live smoke, and C18Z84 regression smoke. Artifact:artifacts/c18z85-service-channel-access-decision-incident-smoke-result.json. - C18Z86 service-channel access-decision silence/acknowledgement is
implemented and deployed on docker-test as
rap-backend:fabric-service-channel-0.2.261-c18z86; node-agent remainsrap-node-agent:0.2.256-c18z82ontest-1/2/3, and web-admin is rebuilt/deployed torap_web_admin. Rebuild alert silence requests now carryincident_sourceandchannel_id;incident_source=access_decisionno-safe incidents requirechannel_idand are stored with channel-scoped route keys. Rebuild health and incident lists apply those silences, so an acknowledged current-generation access no-safe incident is silenced and no longer contributes to active bad count. Generation-change resurfacing is covered in unit tests; live smoke proves the channel-scoped silence path. Verification passed:go test ./internal/modules/cluster ./internal/platform/runtime ./internal/modules/nodeagent, web-adminnpm run build, C18Z86 live smoke, and C18Z85 regression smoke. Artifact:artifacts/c18z86-service-channel-access-decision-silence-smoke-result.json. - C18Z87 service-channel access-decision silence management is implemented and
deployed on docker-test as
rap-backend:fabric-service-channel-0.2.262-c18z87; node-agent remainsrap-node-agent:0.2.256-c18z82ontest-1/2/3, and web-admin is rebuilt/deployed torap_web_admin. Backend now exposes active rebuild alert silences, enriches access-decision silences withincident_source,channel_id, anddisplay_route_id, and supports unsilence by id. Web-admin shows anActive rebuild silencestable with anunsilenceaction. The live smoke proves the operator path: access no-safe incident -> silence -> active silence listed -> unsilence -> active bad incident restored. Verification passed:go test ./internal/modules/cluster ./internal/platform/runtime ./internal/modules/nodeagent, web-adminnpm run build, C18Z87 live smoke, and C18Z86 regression smoke. Artifact:artifacts/c18z87-service-channel-access-decision-unsilence-smoke-result.json. - C18Z88 service-channel access-decision resurface proof is implemented and
deployed on docker-test as
rap-backend:fabric-service-channel-0.2.263-c18z88; node-agent remainsrap-node-agent:0.2.256-c18z82ontest-1/2/3, and web-admin is rebuilt/deployed torap_web_admin. Access-decision incidents now include resurface details (alert_resurfaced_from_silence_id,alert_resurfaced_previous_generation, andalert_resurfaced_previous_until) when a previously acknowledged access-decision incident changes generation/route/channel and becomes active again. Web-admin shows the previous generation/expiry beside resurfaced incidents. The live smoke proves access no-safe -> silence current generation -> route-decision generation changes -> incident resurfaces as active bad with previous-generation metadata preserved. Verification passed:go test ./internal/modules/cluster ./internal/platform/runtime ./internal/modules/nodeagent, web-adminnpm run build, C18Z88 live smoke, and C18Z87 regression smoke. Artifact:artifacts/c18z88-service-channel-access-decision-resurface-smoke-result.json. - C18Z89 service-channel access-decision resurface action loop is implemented
and deployed on docker-test as
rap-backend:fabric-service-channel-0.2.264-c18z89; node-agent remainsrap-node-agent:0.2.256-c18z82ontest-1/2/3, and web-admin is rebuilt/deployed torap_web_admin. Resurfaced access-decision incidents now includealert_resurfaced_cause,alert_resurfaced_previous_route_id, andalert_resurfaced_previous_channel_id. Web-admin shows the cause beside the resurfaced action text. The live smoke proves the operator path: access no-safe -> silence current generation -> generation changes and resurfaces -> active-channel decision context matches the incident -> re-acknowledge current generation -> incident returns to silenced state. Verification passed:go test ./internal/modules/cluster ./internal/platform/runtime ./internal/modules/nodeagent, web-adminnpm run build, C18Z89 live smoke, and C18Z88 regression smoke. Artifact:artifacts/c18z89-service-channel-access-decision-resurface-action-smoke-result.json. - C18Z90 service-channel production data-plane contract is implemented and
deployed on docker-test as
rap-backend:fabric-service-channel-0.2.265-c18z90; node-agent remainsrap-node-agent:0.2.256-c18z82ontest-1/2/3, and web-admin is rebuilt/deployed torap_web_admin. Service-channel leases now include a signeddata_planecontract in the lease, authority payload, introspection response, and lease-maintenance/admin list. The contract declares backend API as control-plane transport, fabric service channel over fabric routes as working/steady-state data transport, backend relay as degraded fallback only, production forwarding required, and service-neutral protocol-agnostic logical flow isolation. Web-admin shows data-plane/fallback policy in service-channel leases. Verification passed:go test ./internal/modules/cluster ./internal/platform/runtime ./internal/modules/nodeagent, web-adminnpm run build, C18Z90 live smoke, and C18Z89 regression smoke. Artifact:artifacts/c18z90-service-channel-data-plane-contract-smoke-result.json. - C18Z91 node-agent data-plane contract consumption is implemented and
deployed on docker-test as
rap-node-agent:0.2.266-c18z91ontest-1/2/3with backend stillrap-backend:fabric-service-channel-0.2.265-c18z90. Service-channel VPN packet ingress now parses signed/introspecteddata_plane, validates the production contract, applies the preferred fabric route, logs data-plane mode/transports/backend-relay policy/logical-flow mode, and reportsdata_plane_contractplus last transport/policy fields in heartbeat access telemetry. Verification passed:go test ./cmd/rap-node-agent ./internal/agent ./internal/mesh ./internal/vpnruntime ./internal/config, backend cluster tests, web-admin build, C18Z91 live smoke, and C18Z90 regression smoke. Artifact:artifacts/c18z91-node-agent-data-plane-contract-enforcement-smoke-result.json. - C18Z92 node-agent backend-fallback policy enforcement is implemented and
deployed on docker-test as
rap-node-agent:0.2.267-c18z92ontest-1/2/3. If a signed data-plane contract hasbackend_relay_policy=disabled, the service-channel runtime no longer proxies failed/missing fabric-route working data through backend relay; it returns a visible service unavailable result. The live smoke temporarily disables backend fallback in pool policy, issues a no-route lease, verifiesbackend_relay_policy=disabled, posts to test-1, and proves the node rejects with 503 instead of backend relay. Verification passed: node-agent tests, C18Z92 live smoke, and C18Z91 regression smoke. Artifact:artifacts/c18z92-node-agent-disabled-backend-fallback-smoke-result.json. - C18Z93 access-telemetry data-plane projection is implemented and deployed on
docker-test as
rap-backend:fabric-service-channel-0.2.268-c18z93; node-agent remainsrap-node-agent:0.2.267-c18z92ontest-1/2/3, and web-admin is rebuilt/deployed torap_web_admin. Backend access telemetry now promotes node-reporteddata_plane_contractand last data-plane mode/working transport/steady-state transport/backend relay policy/logical flow mode to cluster, node, and active-channel diagnostics. Web-admin shows summary chips plus channel/node table columns for data-plane adoption and relay policy. Verification passed:go test ./internal/modules/cluster ./internal/platform/runtime ./internal/modules/nodeagent, web-adminnpm run build, C18Z93 live smoke, C18Z92 regression smoke, and C18Z91 regression smoke. Artifact:artifacts/c18z93-access-telemetry-data-plane-contract-smoke-result.json. - C18Z94 data-plane contract incident diagnostics are implemented and deployed
on docker-test as
rap-backend:fabric-service-channel-0.2.269-c18z94; node-agent remainsrap-node-agent:0.2.267-c18z92ontest-1/2/3, and web-admin is rebuilt/deployed torap_web_admin. Access/rebuild incident diagnostics now includeincident_source=data_plane_contractrows for missing data-plane contract reports after accepted traffic, working/steady transport mismatches, logical-flow mismatch, disabled backend relay observed, and degraded/backend-relay policy violations. The smoke now proves disabled backend relay is emitted as a bad incident with actionrestore_fabric_route_or_change_signed_backend_relay_policy_before_retry. Verification passed:go test ./internal/modules/cluster ./internal/platform/runtime ./internal/modules/nodeagent, web-adminnpm run build, C18Z94 live smoke, C18Z93 regression smoke, C18Z92 regression smoke, and C18Z91 regression smoke. Artifact:artifacts/c18z94-data-plane-contract-incident-smoke-result.json. - C18Z95 node-agent blocked-fallback telemetry is implemented and deployed on
docker-test as backend
rap-backend:fabric-service-channel-0.2.270-c18z95and node-agentrap-node-agent:0.2.270-c18z95ontest-1/2/3; web-admin is rebuilt/deployed torap_web_admin. Node-agent now reportsbackend_fallback_blocked,fabric_route_send_failure, and last data-plane violation status/reason infabric_service_channel_access_report. Backend access telemetry projects those fields to cluster, node, and active-channel rows, anddata_plane_contractincidents distinguish policy-blocked fallback from real backend relay usage. Verification passed: node-agent tests, backend tests, web-admin build, C18Z95 live smoke, and C18Z94/C18Z93/C18Z92 regressions. Artifact:artifacts/c18z95-node-agent-blocked-fallback-telemetry-smoke-result.json. - C18Z96 blocked-fallback rebuild feedback is implemented and deployed on
docker-test as backend
rap-backend:fabric-service-channel-0.2.281-c18z109; node-agent remainsrap-node-agent:0.2.270-c18z95ontest-1/2/3, and web-admin remains deployed. Backend now converts heartbeat access reports withfabric_route_send_failed_backend_fallback_blockedinto durable fencedfabric_service_channel_route_feedbackfor the active channel primary route. The existing route rebuild planner then selects an authorized replacement route when one exists. Verification passed: backend tests, node-agent tests, web-admin build, C18Z96 live smoke, and C18Z95/C18Z93 regressions. Artifact:artifacts/c18z96-blocked-fallback-rebuild-feedback-smoke-result.json. - C18Z97 blocked-fallback feedback dedup is implemented and deployed on
docker-test as backend
rap-backend:fabric-service-channel-0.2.281-c18z109. Backend now suppresses repeated access-report-derived route feedback while an active fenced/degraded observation fromfabric_service_channel_access_reportalready exists for the same cluster, reporter node, route, and service class. This keeps repeated blocked-fallback send-failure heartbeats from refreshing the same feedback and churning rebuild attempts. Verification passed: backend tests, node-agent tests, C18Z97 live smoke, and C18Z96/C18Z95 regressions. Artifact:artifacts/c18z97-blocked-fallback-feedback-dedup-smoke-result.json. - C18Z98 blocked-fallback rebuild correlation is implemented and deployed on
docker-test as backend
rap-backend:fabric-service-channel-0.2.281-c18z109; web-admin is rebuilt/deployed torap_web_admin. Backend now carries the originating access-report route-feedback identity into replacement decisions and rebuild-attempt ledger rows:feedback_observation_id,feedback_source, feedback observed/expiry times, channel/resource ids, and data-plane violation status/reason. Web-admin shows this correlation in Route decisions and Rebuild ledger. Verification passed: backend tests, node-agent tests, web-admin build, C18Z98 live smoke, and C18Z97/C18Z96/C18Z95 regressions. Artifact:artifacts/c18z98-blocked-fallback-rebuild-correlation-smoke-result.json. - C18Z99 rebuild correlation filters are implemented and deployed on
docker-test as backend
rap-backend:fabric-service-channel-0.2.281-c18z109; web-admin is rebuilt/deployed torap_web_admin. The rebuild-attempt ledger API now acceptsfeedback_source,feedback_channel_id, andfeedback_violation_statusfilters, and web-admin exposes them in the rebuild ledger filter form. Verification passed: backend tests, node-agent tests, web-admin build, C18Z99 live smoke, and C18Z98/C18Z97/C18Z96/C18Z95/ C18Z93 regressions. Artifact:artifacts/c18z99-rebuild-correlation-filter-smoke-result.json. - C18Z100 rebuild-health feedback breakdown is implemented and deployed on
docker-test as backend
rap-backend:fabric-service-channel-0.2.281-c18z109; web-admin is rebuilt/deployed torap_web_admin. The rebuild-health summary now returnsfeedback_breakdownsgrouped by feedback source, feedback channel id, and feedback violation status, with total/good/warn/bad/unknown counts, active warn/bad counts, silenced count, latest observation time, and affected reporter nodes/routes. Web-admin shows the breakdown in the Rebuild health panel. Verification passed: backend tests, node-agent tests, web-admin build, C18Z100 live smoke, and C18Z99/C18Z98/C18Z97/C18Z96/C18Z95/ C18Z93 regressions. Artifact:artifacts/c18z100-rebuild-health-feedback-breakdown-smoke-result.json. - C18Z101 rebuild-health feedback drilldown UI is implemented and deployed to
rap_web_admin; backend remainsrap-backend:fabric-service-channel-0.2.281-c18z109. Web-admin now shows related incident context on rebuild-health feedback breakdown rows and anopen ledgeraction that switches to deep rebuild ledger withfeedback_source,feedback_channel_id, andfeedback_violation_statusprefilled from the selected breakdown. Verification passed: web-admin build and deployed asset/download checks. - C18Z102 rebuild-health feedback drilldown audit breadcrumbs are implemented
and deployed on docker-test as backend
rap-backend:fabric-service-channel-0.2.281-c18z109; web-admin is rebuilt/ deployed torap_web_admin. The existing rebuild investigation endpoint now accepts feedback source/channel/violation drilldown payloads and recordsfabric.service_channel_rebuild_feedback_breakdown.investigation_openedcluster audit events before web-admin opens the filtered deep rebuild ledger. Verification passed: backend tests, web-admin build, C18Z102 live smoke, and C18Z100/C18Z99/C18Z98 regressions. Artifact:artifacts/c18z102-rebuild-health-feedback-drilldown-audit-smoke-result.json. - C18Z103 Fabric diagnostics drilldown audit visibility is implemented and
deployed to
rap_web_admin; backend remainsrap-backend:fabric-service-channel-0.2.281-c18z109. Web-admin now filters the loaded cluster audit list for rebuild incident and feedback-breakdown investigation events and shows recent drilldowns in the Fabric diagnostics panel with time, source, feedback filters, target reporter/route, actor, and reason. Verification passed: web-admin build and deployed asset/download checks. - C18Z104 focused Fabric audit loading is implemented and deployed on
docker-test as backend
rap-backend:fabric-service-channel-0.2.281-c18z109; web-admin is rebuilt/deployed torap_web_admin. The cluster audit API now accepts repeated or comma-separatedevent_typefilters plustarget_typefilters, and Fabric diagnostics loads recent rebuild incident/feedback breakdown investigation breadcrumbs with a dedicated filtered request instead of depending on the generic latest-100 audit list. Verification passed: backend tests, web-admin build, C18Z104 live smoke, and C18Z102/C18Z100 regressions. Artifact:artifacts/c18z104-focused-fabric-audit-smoke-result.json. - C18Z105 Fabric drilldown breadcrumb correlation UI is implemented and
deployed to
rap_web_admin; backend remainsrap-backend:fabric-service-channel-0.2.281-c18z109. Recent investigation rows in Fabric diagnostics now show whether each breadcrumb still matches a current rebuild-health feedback breakdown or visible rebuild incident, and provide anopenaction to jump back into the matching filtered ledger path. Verification passed: web-admin build and deployed asset/download checks. - C18Z106 server-side Fabric drilldown breadcrumb correlation is implemented
and deployed on docker-test as backend
rap-backend:fabric-service-channel-0.2.281-c18z109; web-admin is rebuilt/ deployed torap_web_admin. Focused audit reads withcorrelation=fabric_diagnosticsnow returncorrelation_hintswith current diagnostic status and matching rebuild-health feedback breakdown or rebuild incident when present. Web-admin consumes those hints and keeps local matching as fallback. The rebuild-health feedback breakdown window is raised to 100 groups after C18Z100 regression exposed the previous cap could hide fresh failure classes on noisy test history. Verification passed: backend tests, web-admin build, C18Z106 live smoke, and C18Z104/C18Z100 regressions. Artifact:artifacts/c18z106-audit-correlation-hints-smoke-result.json. - C18Z107 drilldown breadcrumb summary is implemented and deployed on
docker-test as backend
rap-backend:fabric-service-channel-0.2.281-c18z109; web-admin is rebuilt/deployed torap_web_admin. Audit responses now include compactaudit_summaryaggregates besideaudit_events; focused Fabric diagnostics uses them to show counts by current diagnostic status, feedback source, feedback violation status, correlated/not-visible totals, and latest time above the Recent investigations rows. Verification passed: backend tests, web-admin build, C18Z107 live smoke, and C18Z106/C18Z104 regressions. Artifact:artifacts/c18z107-audit-correlation-summary-smoke-result.json. - C18Z108 dedicated Fabric diagnostics breadcrumbs are implemented and deployed
on docker-test as backend
rap-backend:fabric-service-channel-0.2.281-c18z109; web-admin is rebuilt/deployed torap_web_admin. Backend exposesGET /clusters/{clusterID}/fabric/service-channels/rebuild-investigations/breadcrumbsreturningrebuild_investigation_breadcrumbswith events and summary, so the operator Recent investigations workflow no longer overloads the generic cluster audit endpoint. Verification passed: backend tests, web-admin build, C18Z108 live smoke, and C18Z107/C18Z106/C18Z100 regressions. Artifact:artifacts/c18z108-dedicated-breadcrumbs-smoke-result.json. - C18Z109 Fabric diagnostics breadcrumb freshness windows are implemented and
deployed on docker-test as backend
rap-backend:fabric-service-channel-0.2.281-c18z109; web-admin is rebuilt/ deployed torap_web_admin. The dedicated breadcrumb endpoint acceptscurrent_window_secondsandhistory_window_seconds, annotates events withcorrelation_hints.breadcrumb_status(current,stale,expired) plus age/window seconds, returns current/stale/expired totals, and includescounts_by_breadcrumb_statusin summary. Web-admin shows freshness chips and age in Recent investigations. Verification passed: backend tests, web-admin build, C18Z109 live smoke, and C18Z108/C18Z107/C18Z106 regressions. Artifact:artifacts/c18z109-breadcrumb-freshness-window-smoke-result.json. - C19Q Remote Workspace mailbox guardrails are implemented and
runtime-smoke-proven on docker-test. The adapter-session mailbox handoff now
has unit and live coverage for invalid adapter session IDs, unknown sessions,
invalid limits, and bounded
drain=true&limit=Npartial drain semantics. This remains probe-only and node-local: it does not enable RDP protocol forwarding, desktop frame transport, Android work, or backend relay behavior. Verification passed:go test ./internal/meshinagents/rap-node-agentandscripts/fabric/c19q-remote-workspace-adapter-mailbox-guardrails-smoke.ps1. Artifact:artifacts/c19q-remote-workspace-adapter-mailbox-guardrails-smoke-result.json. - C19R Remote Workspace mailbox long-poll ergonomics are implemented and
runtime-smoke-proven on docker-test. The mailbox endpoint now accepts bounded
wait_ms, returns explicitempty,waited,wait_timeout, andwait_msfields, and wakes when a delayed mailbox event arrives before timeout. Node-agent imagerap-node-agent:codex-service-supervisor-20260512sis built and deployed ontest-1/2/3. Verification passed:go test ./internal/mesh, C19R live smoke, and C19Q regression smoke. Artifact:artifacts/c19r-remote-workspace-mailbox-long-poll-smoke-result.json. - C19S Remote Workspace mailbox telemetry is implemented and
runtime-smoke-proven on docker-test. Workload status and heartbeat telemetry
now expose mailbox read/wait/timeout/empty-read counters plus last mailbox
read metadata, so adapter consumer polling behavior is visible without
enabling desktop frame transport. Node-agent image
rap-node-agent:codex-service-supervisor-20260512tis built and deployed ontest-1/2/3. Verification passed:go test ./internal/mesh, C19S live smoke, and C19R regression smoke. Artifact:artifacts/c19s-remote-workspace-mailbox-telemetry-smoke-result.json. - C19T Remote Workspace mailbox consumer checkpoint/ack metadata is implemented
and runtime-smoke-proven on docker-test. The mailbox endpoint now accepts a
validated
consumer_idand optionalack_sequence, returns consumer checkpoint/ack/lag/read metadata, and keeps bounded per-session node-local consumer cursor state. Workload status and heartbeat telemetry expose aggregate/current-session consumer read and ack counters. Node-agent imagerap-node-agent:codex-service-supervisor-20260512uis built and deployed ontest-1/2/3. Verification passed:go test ./internal/mesh, C19T live smoke, and C19S regression smoke. Artifact:artifacts/c19t-remote-workspace-mailbox-consumer-checkpoint-smoke-result.json. - C19U Remote Workspace mailbox consumer lifecycle guardrails are implemented
and runtime-smoke-proven on docker-test. Consumers can pass
reset_consumer=truewith a validatedconsumer_idto clear cursor state before the current read is recorded. Mailbox responses expose consumer count/capacity, created/reset/evicted lifecycle flags, and consumer timestamps; workload status and heartbeat telemetry expose consumer reset and eviction counters. Node-agent imagerap-node-agent:codex-service-supervisor-20260512vis built and deployed ontest-1/2/3. Verification passed:go test ./internal/mesh, C19U live smoke, and C19T regression smoke. Artifact:artifacts/c19u-remote-workspace-mailbox-consumer-lifecycle-smoke-result.json. - C19V Remote Workspace mailbox consumer cursor inspection is implemented and
runtime-smoke-proven on docker-test. Active adapter sessions now expose a
read-only
/mesh/v1/remote-workspace/adapter-sessions/{adapter_session_id}/mailbox/consumersendpoint with bounded cursor snapshots: consumer ids, checkpoint/ack sequences, lag, read/ack totals, and timestamps. The endpoint is read-only and does not increment mailbox reads, acks, resets, or drain events. Node-agent imagerap-node-agent:codex-service-supervisor-20260512wis built and deployed ontest-1/2/3. Verification passed:go test ./internal/mesh, C19V live smoke, and C19U regression smoke. Artifact:artifacts/c19v-remote-workspace-mailbox-consumer-snapshot-smoke-result.json. - C19W Remote Workspace mailbox cursor-aware resume reads are implemented and
runtime-smoke-proven on docker-test. The mailbox endpoint now accepts
after_sequencefor non-destructive reads, returnsskipped_countandreturned_count, and long-polls for events newer than the requested sequence.after_sequencewithdrain=trueis rejected to keep resume reads separate from destructive drains. Node-agent imagerap-node-agent:codex-service-supervisor-20260512xis built and deployed ontest-1/2/3. Verification passed:go test ./internal/mesh, C19W live smoke, and C19V regression smoke. Artifact:artifacts/c19w-remote-workspace-mailbox-after-sequence-smoke-result.json. - C19X Remote Workspace mailbox consumer-aware resume is implemented and
runtime-smoke-proven on docker-test. Mailbox reads with
consumer_idcan passresume_from=ack|checkpoint; the node-agent resolves the stored cursor toafter_sequencebefore reading and returnsresume_from/resume_sequence. Guardrails reject mixing resume with manualafter_sequence, drain, reset, missing consumers, or invalid cursor names. Node-agent imagerap-node-agent:codex-service-supervisor-20260512yis built and deployed ontest-1/2/3. Verification passed:go test ./internal/mesh, C19X live smoke, and C19W regression smoke. Artifact:artifacts/c19x-remote-workspace-mailbox-consumer-resume-smoke-result.json. - C19Y Remote Workspace mailbox resume telemetry is implemented and
runtime-smoke-proven on docker-test. Workload status and heartbeat telemetry
now expose resume/after-sequence read totals, returned/skipped totals, and the
last resume cursor/sequence/consumer plus returned/skipped counts for
operator diagnostics. Session snapshots include the same per-session resume
counters. Node-agent image
rap-node-agent:codex-service-supervisor-20260512zis built and deployed ontest-1/2/3. Verification passed:go test ./internal/mesh, C19Y live smoke, C19X source smoke, and C19W regression smoke. Artifact:artifacts/c19y-remote-workspace-mailbox-resume-telemetry-smoke-result.json. - C19Z Remote Workspace adapter runtime readiness summary is implemented and
runtime-smoke-proven on docker-test. The sink report now includes compact
adapter_runtime_readinessdiagnostics with session lifecycle state, mailbox depth, consumer cursor, resume cursor, skipped/returned counts, and ready/diagnostic status for operator handoff checks. Node-agent imagerap-node-agent:codex-service-supervisor-20260512z1is built and deployed ontest-1/2/3. Verification passed:go test ./internal/mesh, C19Z live smoke, C19X source smoke, and C19Y regression smoke. Artifact:artifacts/c19z-remote-workspace-adapter-readiness-smoke-result.json. - C19Z1 Remote Workspace mailbox handoff preflight is implemented and
runtime-smoke-proven on docker-test. The node-agent now exposes read-only
GET /mesh/v1/remote-workspace/adapter-sessions/{adapter_session_id}/mailbox/preflightforconsumer_idplusresume_from=ack|checkpoint; it validates the cursor and reports the expected next event window without reading, draining, acking, or mutating consumer state. Node-agent imagerap-node-agent:codex-service-supervisor-20260512z2is built and deployed ontest-1/2/3. Verification passed:go test ./internal/mesh, C19Z1 live smoke, C19X source smoke, and C19Z regression smoke. Artifact:artifacts/c19z1-remote-workspace-mailbox-preflight-smoke-result.json.
The current phase is NOT:
- full mesh routing implementation
- full VPN orchestration
- multi-cluster runtime traffic handling
- production data-plane migration
- complete updater rollout orchestration
- video meetings
- final native client UI redesign
Future mesh, VPN, multi-cluster, node-agent updater, and production realtime data-plane work must be introduced only through explicit, narrow, staged implementation prompts.
Always keep the project production-oriented. Do not simplify it into a toy app.