Files
rdp-proxy/CODEX_CONTEXT.md
T
2026-05-12 21:02:29 +03:00

2761 lines
184 KiB
Markdown

# CODEX CONTEXT
## Project identity
This project is a production-grade distributed secure access platform.
It started as a custom RDP proxy with persistent server-side sessions, but the final target architecture is broader:
- distributed secure access fabric
- multi-tenant platform
- session broker for GUI and future non-GUI protocols
- cluster mesh of nodes
- connector/VPN layer
- customer-managed and platform-managed nodes
- node-agent based self-update / rollback / health supervision
## Product architecture rule: VPN and Remote Workspace are separate products/layers
Do not merge VPN/IP tunnel work with Remote Workspace / remote desktop work.
- VPN is a universal network-layer IP tunnel. It carries any traffic generated
by a phone, Windows PC, Linux host, or other client device: HTTP, DNS, ping,
RDP clients, SSH clients, SMB, business apps, and future protocols. VPN must
stay protocol-agnostic and must not contain remote-desktop-specific logic.
- Remote Workspace is an application/session-layer service. The client talks to
RAP using RAP's own client protocol. RAP workers/connectors then talk to the
target server using protocol adapters such as RDP, SSH, VNC, or future
adapters, convert screen/input/clipboard/files/audio/control into RAP's
format, and render it in the RAP client.
- VPN optimization work must focus on generic data-plane transport,
full-tunnel/split-tunnel routing, DNS, MTU/MSS, QoS, NAT traversal, direct
UDP/QUIC transport, fallback relay, diagnostics, and stability for arbitrary
traffic.
- Remote Workspace optimization work must focus on server catalog, session
broker, workers/connectors, protocol adapters, RAP client protocol, separate
connection windows, rendering/input/clipboard/file/audio behavior, and
user-facing remote-workspace UX.
- Both VPN and Remote Workspace must consume the shared Fabric Service Channel
runtime. Control/API traffic may use backend/admin ingress, but working
service data must use the fabric channel whenever available. Backend relay is
a compatibility/degraded fallback, not the production steady-state.
- The accepted service-channel direction is documented in
`docs/architecture/FABRIC_SERVICE_CHANNEL_RUNTIME.md`: a service requests a
channel with entry pool, exit pool, roles, service class, channel classes,
QoS and failover policy; the fabric selects the fastest healthy route and
rebuilds it on failure. Protocol-specific services must not reimplement this
transport.
- Current implementation: backend issues `rap.fabric_service_channel_lease.v1`
leases and embeds them in VPN client profiles. Leases include
cluster-authority-signed `rap.fabric_service_channel_lease_authority.v1`
payloads that bind token hash, selected route, generation, fencing epoch, and
expiry, plus a signed `data_plane` contract declaring that working data uses
the Fabric Service Channel over fabric routes while backend relay is only an
explicit degraded/disabled fallback policy. `rap-node-agent` accepts the
first VPN packet service-channel entry
endpoint under
`/api/v1/clusters/{cluster_id}/fabric/service-channels/{channel_id}/vpn-connections/{resource_id}/packets`
plus `/packets/ws`. The endpoint validates the signed or introspected
data-plane contract, applies the preferred fabric route, uses the existing
production `vpn_packet` fabric route, reports contract adoption in heartbeat
access telemetry, and refuses backend relay when the contract disables it.
Backend access telemetry and web-admin now show data-plane adoption,
working/steady-state transport, backend relay policy, data-plane mode, and
logical flow mode at cluster/node/channel levels. The next slice is explicit
route/fallback violation incidents from that telemetry, plus client
consumption of the lease endpoint template.
## Current proven foundation
The current codebase already proved the most risky low-level lifecycle assumptions for RDP:
- real FreeRDP connect works
- session state transitions to active work
- terminate works
- detach works without killing the remote session
- reattach works without recreating the remote session
- takeover works without recreating the remote session
- per-resource certificate verification policy exists
- `certificate_verification_mode = strict | ignore`
- `strict` is default
- `ignore` works on a per-resource basis
- worker build is reproducible
- backend build is reproducible
This proven lifecycle must NOT be broken by future architecture work.
## Current architecture baseline
Current audit and baseline snapshot:
- `docs/audits/PROJECT_AUDIT_2026-04-26.md`
- `docs/audits/CURRENT_BASELINE_MATRIX.md`
### Test environment
- Canonical test Docker host: `192.168.200.61`
- Canonical Docker context: `test-ubuntu`
- Canonical SSH alias: `docker-test`
- Current external control-plane endpoint for remote/offsite node enrollment:
`http://94.141.118.222:19191` / `http://vpn.cin.su:19191`.
- Current port forward: `94.141.118.222:19191` -> `192.168.200.61:18080`.
- For offsite Windows/Linux nodes, install profiles should use:
`http://vpn.cin.su:19191/api/v1` as control-plane endpoint and
`http://vpn.cin.su:19191/downloads` as artifact endpoint unless the user
explicitly chooses the raw IP endpoint.
- Backend API for local/client smoke runs: `http://192.168.200.61:8080/api/v1`
- WebSocket gateway for local/client smoke runs: `ws://192.168.200.61:8080/api/v1/gateway/ws`
- Stage C17 planning is completed.
- C17A synthetic mesh runtime skeleton is implemented and test-proven in
`rap-node-agent` only. It is disabled by default and carries synthetic
`fabric.probe` / `fabric.probe_ack` messages only.
- C17B route health and failover probes are implemented and test-proven in
`rap-node-agent` only. They are disabled by default and carry synthetic
`fabric.route_health` / `fabric.route_health_ack` messages only.
- C17C relay semantic hardening is implemented and test-proven in
`rap-node-agent` only. It is disabled by default and models synthetic
per-channel queues/QoS/backpressure only.
- C17D non-production test-service path is implemented and test-proven in
`rap-node-agent` only. It is disabled by default and carries only bounded
`synthetic.echo` test payloads.
- C17E/C17F/C17G are implemented and proven for live synthetic HTTP transport,
scoped synthetic route config, and Control Plane scoped synthetic config
consumption.
- C17H deployed multi-agent synthetic config smoke is runtime-proven on
`docker-test`: five running `rap-node-agent` containers consume
backend-issued node-scoped synthetic config, direct and single-relay
synthetic route-health observations return to the Control Plane, and
production forwarding remains disabled.
- C17I production forwarding gate foundation is implemented and test-proven:
`rap-node-agent` has an explicit production-forwarding gate, while
`/mesh/v1/forward` still refuses production payload forwarding until a later
approved runtime stage.
- C17J production envelope contract is implemented and test-proven:
`/mesh/v1/forward` validates route-bound production envelopes for
`fabric_control` / `fabric.control` only when the gate is enabled, rejects
service channels, and still refuses production forwarding.
- C17K production envelope observation is implemented and test-proven:
valid accepted envelopes can be observed locally as metadata-only records
after validation; rejected envelopes are not observed, observation failure
fails closed, and production forwarding remains unavailable.
- C17L bounded production observation sink is implemented and test-proven:
accepted metadata-only observations can be retained locally with fixed
capacity, oldest-entry drop behavior, and no payload body storage.
- C17M production observation sink wiring is implemented and test-proven:
node-agent can wire the bounded local metadata-only sink when
`RAP_MESH_PRODUCTION_OBSERVATION_SINK_CAPACITY` is explicitly greater than
zero; the wiring is disabled by default and exposes no read API.
- C17N production observation sink metrics are implemented and test-proven:
local sink metrics expose only capacity, current depth, accepted total, and
dropped-oldest total; they expose no observation records or payload metadata.
- C17O production observation sink local metrics logging is implemented and
test-proven: node-agent logs aggregate sink metrics locally when the sink is
explicitly enabled; no read API or Control Plane reporting is added.
- C17P production observation sink change-driven metrics logging is implemented
and test-proven: node-agent suppresses repeated identical local sink metrics
logs; no read API or Control Plane reporting is added.
- C17Q production forwarding gate/runtime log boundary is implemented and
test-proven: node-agent logs production forwarding gate state separately from
production forwarding runtime state. Runtime state remained false until
C17Z introduced gate-controlled `fabric.control` direct forwarding.
- C17R production observation sink capacity guard is implemented and
test-proven: `RAP_MESH_PRODUCTION_OBSERVATION_SINK_CAPACITY` is rejected
above `10000`.
- C17S production observation panic fail-closed hardening is implemented and
test-proven: observer errors and observer panics both fail closed as
observation failure.
- C17T production envelope payload boundary is implemented and test-proven:
validated production `fabric.control` envelope payloads are bounded to
`4096` bytes and oversized envelopes are rejected before observation.
- C17U production envelope created-at skew boundary is implemented and
test-proven: validated production `fabric.control` envelopes whose
`created_at` is more than one minute in the future are rejected before
observation.
- C17V peer endpoint candidate model is implemented and test-proven:
node-scoped synthetic mesh config now carries route-scoped endpoint
candidates with transport, address, reachability, NAT type, connectivity
mode, priority, policy tags, verification time, and metadata. This is a
model/config boundary only; no production route scoring, NAT traversal,
shortcut routing, or forwarding runtime is implemented.
- C17W peer endpoint candidate scoring model is implemented and test-proven:
`rap-node-agent` can rank already-scoped endpoint candidates using soft
inputs such as transport, reachability, connectivity mode, NAT type,
priority, region, policy tags, channel class, and verification age. This is
a scoring helper only; it does not open connections, choose production
routes, or forward payloads.
- C17X health-aware endpoint candidate scoring overlay is implemented and
test-proven: endpoint candidate scoring can optionally use local health
observations keyed by `endpoint_id`, including latency, success/failure
history, recent failure reason, reliability score, and observation freshness.
This remains advisory scoring only and is not wired into production route
execution.
- C17Y Platform Owner synthetic mesh visibility is implemented and
build/test-proven: `web-admin` reads node-scoped synthetic mesh config and
shows config enabled state, route counts, peer endpoints, endpoint
candidates, C17X advisory scoring boundary, and `production_forwarding`.
This remains platform-owner visibility only and does not enable production
forwarding.
- C17Z production fabric-control direct forwarding boundary is implemented and
test-proven: when `RAP_MESH_PRODUCTION_FORWARDING_ENABLED=true`,
`/mesh/v1/forward` can deliver valid route-bound `fabric.control` envelopes
at the local destination or forward them to a direct next hop from explicit
peer endpoint config. Service channels, arbitrary relay forwarding,
multi-hop production route execution, and RDP/VPN/file/video/service payloads
remain unavailable.
- C17Z1 production fabric-control multi-hop route-path boundary is implemented
and test-proven: production `fabric.control` envelopes can carry
`route_path` and `visited_node_ids`; relay nodes validate path position,
forward only to the next path node, update TTL/hop/visited metadata, and
reject loops. Service payloads remain unavailable.
- C17Z2 production fabric-control forwarding observability boundary is
implemented and test-proven: node-agent emits local
`mesh_production_forward_event` logs for accepted, forwarded, delivered, and
rejected production `fabric.control` envelopes. Logs are metadata-only and
include no payload bodies or read API.
- C17Z3 production fabric-control route-config boundary is implemented and
test-proven: when scoped/control-plane mesh routes are available locally,
production `fabric.control` envelopes must match configured route_id/path/
next-hop/channel/expiry/TTL/hop limits before forwarding.
- C17Z4 scoped peer directory and recovery seeds boundary is implemented and
test/build-proven: node-scoped mesh config carries scoped `peer_directory`
and explicit bounded `recovery_seeds`; node-agent parses/validates them and
web-admin shows counts.
- C17Z5 node-agent peer cache runtime boundary is implemented and test-proven:
node-agent builds a local `PeerCache`, selects bounded warm peers, probes warm
peers with `/mesh/v1/health`, and reports metadata-only mesh-link
observations when synthetic mesh testing is enabled.
- C17Z6 dynamic endpoint reporting boundary is implemented and test-proven:
node-agent reports explicit advertised mesh endpoint metadata in heartbeat,
and Control Plane projects latest reported endpoints/candidates into
node-scoped synthetic mesh config.
- C17Z7 private/corporate endpoint candidate boundary is implemented and
test-proven: node-agent reports multiple advertised endpoint candidates,
scoring rewards private/corporate same-site candidates, and peer cache can
use the best candidate address for warm health.
- C17Z8 peer connection state machine boundary is implemented and test-proven:
node-agent tracks warm-peer states `disconnected`, `connecting`, `ready`,
`degraded`, and `backoff`, with bounded backoff after repeated health probe
failures.
- C17Z9 peer recovery planner boundary is implemented and test-proven:
node-agent targets a bounded stable ready-peer set, enters recovery when
ready peers fall below target, and selects bounded recovery probes from warm
peers, recovery seeds, and other connectable scoped peers.
- C17Z10 peer connection intent planner boundary is implemented and
test-proven: node-agent classifies bounded peer work as maintain/probe/
recover and classifies transport readiness as direct/private_lan/
corporate_lan/outbound_only/relay_required, with rendezvous-required
metadata only.
- C17Z11 peer connection manager runtime boundary is implemented and
test-proven: node-agent uses a reusable HTTP keep-alive client for real
control-plane health probes of direct/private/corporate peers and records
`waiting_rendezvous` for outbound-only/relay-required peers.
- C17Z12 rendezvous/relay control-plane contract is implemented and
docker-test-runtime-proven: backend issues node-scoped `rendezvous_leases`,
node-agent resolves matching `waiting_rendezvous` intents into
`relay_control`, probes relay `/mesh/v1/health`, records and maintains
`relay_ready`, and keeps service payload forwarding disabled.
- C17Z13 rendezvous lease telemetry is implemented and
docker-test-runtime-proven: node-agent reports
`mesh_rendezvous_lease_report` with relay admission, peer admission,
TTL/renewal posture, `relay_ready`, and explicit no-payload boundary flags;
web-admin shows `rv leases` in recent heartbeat tables.
- C17Z14 rendezvous lease refresh contract is implemented and
docker-test-runtime-proven: node-agent refreshes renewal-needed/stale
rendezvous leases through node-scoped synthetic config reload, updates the
running peer cache/route/lease state, and reports refresh plus stale relay
withdrawal/reselection telemetry. Service payload forwarding remains
unavailable.
- C17Z15 backend relay replacement policy is implemented and
docker-test-runtime-proven: backend consumes recent stale-relay heartbeat
feedback, withdraws stale explicit rendezvous leases, scores alternate relay
candidates from route adjacency, endpoint priority, policy tags, and recent
mesh-link health, and returns replacement leases plus
`rendezvous_relay_policy` decisions in node-scoped synthetic config.
Node-agent reports `c17z15.mesh_rendezvous_lease_report.v1` and keeps stale
state scoped to the exact lease/relay, so replacement leases for the same
peer are not marked stale by association. Service payload forwarding remains
unavailable.
- C17Z16 route/path decision artifact is implemented and
docker-test-runtime-proven: backend `c17z16.synthetic.v1` config includes
`route_path_decisions` with original hops, effective hops, local previous/
next hop, selected replacement relay, generation, score reasons, and
no-payload boundary flags. Node-agent stores the control-plane route
generation and reports `c17z16.mesh_route_path_decision_report.v1` plus
`c17z16.mesh_rendezvous_lease_report.v1`. Service payload forwarding remains
unavailable.
- C17Z17 node-side route generation tracker is implemented and
docker-test-runtime-proven: backend `c17z17.synthetic.v1` config and
node-agent `mesh_route_generation_report` track active/applied/unchanged/
withdrawn route decisions, generation changes, total counters, and
`withdrawn_by_replacement` records for stale relay paths when replacement is
first observed. Service payload forwarding remains unavailable.
- C17Z18 synthetic route-health effective path runtime is implemented and
docker-test-runtime-proven: backend `c17z18.synthetic.v1` config and
node-agent `mesh_route_health_config_report` apply Control Plane
`route_path_decisions` to synthetic route-health route config only. The
synthetic runtime probes selected effective paths through replacement relays,
reports expected/observed hops and drift state, and backend latest mesh links
preserve route-health observations separately from connection-manager
observations. Service payload forwarding remains unavailable.
- C17Z19 synthetic route-health feedback scoring is implemented and
docker-test-runtime-proven: backend consumes recent `synthetic_route_health`
observations in relay scoring, uses drift/unreachable/failure metadata to
mark the exact selected relay stale, boosts healthy low-latency relay
candidates, and returns replacement leases/route decisions through the
existing synthetic config contract. Migration `000022` adds the `synthetic`
mesh service class. Service payload forwarding remains unavailable.
- C17Z20 node-side route-health feedback refresh is implemented and
docker-test-runtime-proven: after reporting synthetic route-health
drift/unreachable/failure, node-agent performs a bounded node-scoped
synthetic-config refresh, applies returned replacement route decisions to
route-health config immediately, and reports
`c17z20.mesh_route_health_feedback_refresh_report.v1`. Service payload
forwarding remains unavailable.
- C17Z21 offsite control-plane bootstrap relay and Windows updater foundation
are implemented and docker-test/runtime-proven: backend exposes
`/mesh/v1/health` through the admin/nginx control-plane origin and issues
control-plane-only bootstrap rendezvous leases for outbound-only nodes using
their reported public control-plane URL. Remote Windows node
`ifcm-rufms-s-mo1cr` resolved 3/3 peers to `relay_ready` through
`http://94.141.118.222:19191`, while service/RDP/VPN payload forwarding
remains disabled. Release `0.1.3` is published for Docker and Windows
`windows_service` artifacts, and `install-windows` now installs a
per-node Scheduled Task updater for future Windows node-agent updates.
- C17Z22 updater observability and Windows host-agent self-update staging are
implemented and test-proven: `rap-host-agent` reports `phase=plan`,
`status=noop` for already-current/no-op plans, update state is scoped per
product so `rap-node-agent` and `rap-host-agent` do not overwrite each
other's current version, and the Windows updater wrapper runs short
one-shot cycles that can apply staged `rap-host-agent.exe.next` before the
next update check. Release `rap-host-agent 0.1.3` is published for
`linux_binary` and `windows_binary`; Docker updater containers on
`test-1/2/3` report no-op plans.
- Installation Authority foundation is implemented: production requires strict
Product Root public key config, first-owner bootstrap uses signed Ed25519
activation manifests, `installation_authority` and signed
`platform_role_grants` are persisted, and strict platform-admin checks ignore
direct `users.platform_role` database edits without a valid signed grant.
Web-admin exposes installation status/first-owner bootstrap, and
`scripts/installation/product-root-tool.go` generates keys/manifests for
offline product-root operations.
- Cluster Authority and node enrollment bootstrap are docker-test lifecycle
smoke-proven in run `dev-bootstrap-20260428-201430`: a fresh dev install
bootstrapped the first owner, created a cluster, issued a signed join token,
accepted real `rap-node-agent` enrollment, owner-approved the join request,
agent-polled signed bootstrap, persisted cluster authority pin, heartbeated,
and verified signed `c17z18.synthetic.v1` Control Plane config. Production
service payload forwarding remains unavailable.
- Migration `000021_cluster_authority_keys` drops/recreates
`cluster_admin_summaries` because fresh replay proved PostgreSQL cannot
change that view layout via `CREATE OR REPLACE VIEW`.
- `rap-node-agent` desired-workload polling/status reporting is gated by
`RAP_WORKLOAD_SUPERVISION_ENABLED=false` by default while service runtime
supervision remains a stub.
- C18 VPN/IP tunnel service target design is completed as documentation only.
- C18A VPN/IP tunnel control-plane data model foundation is implemented and
backend-test-proven.
- C18B VPN/IP tunnel lease/fencing hardening is implemented and
backend-test-proven.
- C18C VPN/IP tunnel node-agent desired-state consumption/reporting is
implemented and backend-test-proven.
- No next platform-core implementation step is automatically authorized after
C17Z20. The next mesh layer should stay limited to route-health feedback
refresh dampening/no-change cooldown unless the user explicitly chooses
another staged task.
- Latest RDP performance reference image:
`rap-rdp-worker:rdp-perf6-dirty-region`
- Stage 5.2 file-download runtime artifacts remain preserved for when RDP work
resumes, but they are not the active next task.
- Do not use `docker.cin.su` for this project unless explicitly requested for a separate one-off check.
### Backend
- Go
- PostgreSQL = source of truth
- Redis = live coordination / routing only
- REST for control plane
- WebSocket for live session channel
### Worker
- C++ worker
- FreeRDP integration
- worker runtime hides FreeRDP details from backend
- The C++ worker remains the primary RDP runtime.
- Target RDP performance direction: `docs/architecture/RDP_SERVICE_CPP_PERFORMANCE_TARGET.md`.
- The RDP performance rewrite scope is limited to C++ RDP service adapter
internals. It must not redesign backend control plane, cluster transport,
organizations, leases, or session lifecycle.
- The C# RDP service skeleton is inactive research scaffolding and is not the
current runtime direction.
- Current RDP Adapter baseline: RDP-Perf-6 dirty-region direct binary rendering
is completed and smoke-proven on `docker-test`. RDP work is paused by product
decision; next active work is Fabric Core / cluster foundation.
- P3/P3.1 security-readiness foundation exists: production mode rejects
plaintext credential-like resource metadata, requires `secret_ref` for
RDP/VNC/SSH resources, and has an encrypted PostgreSQL-backed resource secret
storage/resolver MVP. P3.2 direct-worker TLS/PKI guard exists.
- P3.3 production-like test-stand smoke is complete on `docker-test`: backend
runs in `APP_ENV=production` with a test-only secret key file, a secret-backed
RDP resource starts real sessions through the resolver path, metadata/audit do
not contain plaintext credentials, and backend gateway fallback remains
available when direct worker WSS trust is `smoke_insecure`.
- P3.4 production direct-worker WSS trust model is documented in
`docs/architecture/PRODUCTION_DIRECT_WORKER_WSS_TRUST.md`; it defines
platform CA/public CA behavior, worker certificate SAN/identity requirements,
app-local Windows trust direction, rotation/revocation, and the future
`platform_ca` smoke plan. No RDP runtime behavior changed in P3.4.
- P3.5 app-local platform CA trust is implemented and runtime-proven on
`docker-test`: Windows client validates direct worker WSS with an app-local
platform CA bundle, keeps hostname/SAN validation enabled, selects
`direct_worker_wss` without insecure TLS bypass, and falls back to backend
gateway for unknown CA / smoke-only production cases.
- P3.6 stale Redis worker/live event idempotency is implemented and
runtime-proven: stale worker events for terminal PostgreSQL sessions are
ignored, backend restart survives stale Redis events, and terminal sessions
are not reopened.
- Stage 5.2 server-to-client file download core data path is runtime-proven:
direct worker WSS and backend gateway fallback both download text/binary
files from `RAP_Transfers\ToClient` with matching size/hash, and direct
policy blocking is proven for `disabled` and `client_to_server`. Lifecycle
blocking is also runtime-proven for detach, old-client takeover, and worker
failure. Runtime report:
`artifacts/stage5-2-file-download-runtime-report.md`.
- Stage 5.2 is not fully accepted yet. Remaining proof: Windows desktop UI
download path and regression matrix for rendering/input/clipboard/upload/
reconnect/takeover.
### Clients
- future native clients:
- Windows: native desktop client first
- Linux: native desktop client later
- web UI is admin/control plane, not the primary power-user client
## Final architecture direction
The long-term target architecture is documented in:
- `docs/architecture/SECURE_ACCESS_FABRIC_TARGET.md`
- `docs/architecture/CLUSTER_NODE_ADMIN_FOUNDATION.md`
- `docs/architecture/WEB_INGRESS_AND_ADMIN_UI_MODEL.md`
This document defines the target Secure Access Fabric architecture only. It is not the current implementation scope and must not be used as permission to start mesh, VPN, multi-cluster, updater, or realtime data-plane migration work without an explicit staged prompt.
`CLUSTER_NODE_ADMIN_FOUNDATION.md` defines the next platform-core planning
baseline for clusters, node enrollment, native node-agent identity, platform
admin console, multi-cluster administration, and future organization admin
visibility. It is a staged foundation document, not permission to implement
mesh packet routing or VPN runtime.
`WEB_INGRESS_AND_ADMIN_UI_MODEL.md` defines WEB as HTTP/HTTPS ingress and
Admin UI presentation only. Cluster configuration remains Control Plane
ownership through scoped APIs, PostgreSQL source-of-truth mutations, and audit.
Dynamic pages must be safe schema-driven projections and must not embed
internal topology, peer caches, route caches, secrets, raw credentials, or
arbitrary executable code.
Admin endpoint placement is explicit. Fabric Storage / Config Storage nodes do
not automatically host or move the cluster panel. Platform Owner Console
remains global platform-owner scope. Cluster Admin Endpoint requires explicit
admin/web ingress role assignment, cluster health/trust readiness, and Control
Plane authorization. Organization Admin Panel remains a tenant-safe projection.
The final platform must support:
1. Multi-tenancy / Organizations
- platform has many organizations
- each organization has isolated users, groups, resources, policies, audit, connectors
- users may belong to multiple organizations
- organization admins only see their organization
- platform admins see platform scope
2. Identity federation
- local users
- LDAP / Active Directory
- OIDC
- future extensibility for more identity sources
- access mappings based on external groups / claims
3. Cluster of nodes
- no mandatory single central node
- many nodes across many sites
- nodes can be platform-managed or customer-managed
- customer-managed nodes are sandboxed cluster participants, not full cluster owners
4. Node agent
- small stable always-running agent on every node
- supervises services
- downloads updates
- verifies signed artifacts
- can rollback to previous version
- can restart crashed services
- can work on thin or thick nodes
5. Service-based node model
Each node is not monolithic.
A node has:
- capabilities: what it can do physically/technically
- enabled services: what it is allowed/assigned to do
Possible services include:
- ingress-gateway
- mesh-router
- relay
- connector-host
- vpn-adapter
- session-worker
- media-relay
- file-relay
- update-cache
- config-replica
- audit-sink
- metrics-exporter
6. Cluster mesh and routing
- encrypted inter-node communication
- dynamic topology
- no need for full mesh
- multi-hop routing allowed
- route failover
- client failover between ingress nodes
- connector failover between nodes
7. Split-brain prevention
- quorum-based cluster behavior
- minority partition must not become a second authoritative cluster
- degraded / recovery / isolated modes
- manual recovery / promote decision by platform recovery admin
8. Connector / VPN layer
- connectors are reusable network access methods
- one connector may be used by multiple resources
- connector placement and failover are controlled by policy
- nodes may be allowed or disallowed to host connectors
- direct access, VPN, relay and future egress modes must fit this model
9. Future exit mode
- split tunnel
- full tunnel
- internet access through cluster
- not first implementation priority
## Non-negotiable design rules
- Do not rewrite proven session lifecycle carelessly.
- Do not turn Redis into a source of truth.
- Do not make certificate-ignore a global worker setting.
- Do not make customer-managed nodes platform-wide trusted by default.
- Do not create a separate cluster per organization.
- Do not assume a single permanently reachable central node.
- Do not rely on “secret protocol with no docs” as security.
- Security must come from crypto, auth, isolation, policy and observability.
- Prefer incremental evolution from current proven system.
- Do not collapse platform control plane and data plane into one vague layer.
## Implementation strategy
The codebase must evolve in phases.
Current implementation focus remains:
- RDP work is paused by product decision
- preserve the accepted RDP Adapter baseline and Stage 5.x file-transfer work
- do not delete or rewrite the current RDP MVP while platform-core work starts
- C1-C9 platform-core foundations are implemented and verified: clusters,
node enrollment, node-agent scaffold, platform admin console, workload
supervision contract, mesh control-plane prep, mesh skeleton, multi-cluster
hardening, and organization admin foundation
- C10 Fabric Core configuration distribution design is completed
- C11 signed scoped cluster snapshot model is completed
- C12 node local state store is completed
- C13 Fabric Storage / Config Storage service foundation is completed
- C14 peer directory and cache model is completed
- C15 Fabric Routing Engine skeleton is completed
- C16 secure node-to-node channel lifecycle is completed
- C17 mesh routing runtime implementation plan is completed
- C17A synthetic mesh runtime skeleton is implemented and test-proven with
synthetic fabric messages only, no RDP/VPN/production service traffic
- C17B route health and failover probes are implemented and test-proven with
synthetic traffic only, no RDP/VPN/production service traffic
- C17C relay semantic hardening is implemented and test-proven with synthetic
channel classes only, no RDP/VPN/production service traffic
- C17D non-production test-service path is implemented and test-proven with
bounded `synthetic.echo` traffic only, no RDP/VPN/production service traffic
- C17E live node-to-node synthetic HTTP transport is implemented and
smoke-proven with synthetic traffic only
- C17F scoped synthetic route config loading and route-health reporting is
implemented and smoke-proven with synthetic traffic only
- C17G Control Plane scoped synthetic config read/consume is implemented and
test-proven with synthetic traffic only
- C17H deployed multi-agent synthetic config smoke is implemented and
runtime-proven on `docker-test` with synthetic traffic only
- C17I production forwarding gate foundation is implemented and test-proven;
production forwarding remains unavailable
- C17J production envelope contract validation is implemented and test-proven;
production forwarding remains unavailable
- C17K production envelope observation is implemented and test-proven;
production forwarding remains unavailable
- C17L bounded production observation sink is implemented and test-proven;
production forwarding remains unavailable
- C17M production observation sink wiring is implemented and test-proven;
production forwarding remains unavailable
- C17N production observation sink metrics are implemented and test-proven;
production forwarding remains unavailable
- C17O production observation sink local metrics logging is implemented and
test-proven; production forwarding remains unavailable
- C17P production observation sink change-driven metrics logging is implemented
and test-proven; production forwarding remains unavailable
- C17Q production forwarding gate/runtime log boundary is implemented and
test-proven; production forwarding remains unavailable
- C17R production observation sink capacity guard is implemented and
test-proven; production forwarding remains unavailable
- C17S production observation panic fail-closed hardening is implemented and
test-proven; production forwarding remains unavailable
- C17T production envelope payload boundary is implemented and test-proven;
production forwarding remains unavailable
- C17U production envelope created-at skew boundary is implemented and
test-proven; production forwarding remains unavailable
- C17V peer endpoint candidate model and NAT/connectivity hints are
implemented and test-proven; production forwarding remains unavailable
- C17W peer endpoint candidate scoring model is implemented and test-proven;
production forwarding remains unavailable
- C17X health-aware endpoint candidate scoring overlay is implemented and
test-proven; production forwarding remains unavailable
- C17Y Platform Owner synthetic mesh visibility is implemented and
build/test-proven; production forwarding remains unavailable
- C17Z production fabric-control direct forwarding is implemented and
test-proven; production service traffic remains unavailable
- C17Z1 production fabric-control multi-hop route-path forwarding is
implemented and test-proven; production service traffic remains unavailable
- C17Z2 production fabric-control forwarding observability is implemented and
test-proven; production service traffic remains unavailable
- C17Z3 production fabric-control route-config boundary is implemented and
test-proven; production service traffic remains unavailable
- C17Z4 scoped peer directory/recovery seed boundary is implemented and
test/build-proven; production service traffic remains unavailable
- C17Z5 node-agent peer cache runtime boundary is implemented and test-proven;
production service traffic remains unavailable
- C17Z6 dynamic endpoint reporting boundary is implemented and test-proven;
production service traffic remains unavailable
- C17Z7 private/corporate endpoint candidate boundary is implemented and
test-proven; production service traffic remains unavailable
- C17Z8 peer connection state machine boundary is implemented and test-proven;
production service traffic remains unavailable
- C17Z9 peer recovery planner boundary is implemented and test-proven;
production service traffic remains unavailable
- C17Z10 peer connection intent planner boundary is implemented and
test-proven; production service traffic remains unavailable
- C17Z11 peer connection manager runtime boundary is implemented and
test-proven; production service traffic remains unavailable
- C17Z12 rendezvous/relay control-plane contract is implemented and
docker-test-runtime-proven; production service traffic remains unavailable
- C17Z13 rendezvous lease telemetry is implemented and
docker-test-runtime-proven; production service traffic remains unavailable
- C17Z14 rendezvous lease refresh contract is implemented and
docker-test-runtime-proven; production service traffic remains unavailable
- C17Z15 backend relay replacement policy is implemented and
docker-test-runtime-proven; production service traffic remains unavailable
- C17Z16 route/path decision artifact is implemented and
docker-test-runtime-proven; production service traffic remains unavailable
- C17Z17 node-side route generation tracker is implemented and
docker-test-runtime-proven; production service traffic remains unavailable
- C17Z18 synthetic route-health effective path runtime is implemented and
docker-test-runtime-proven; production service traffic remains unavailable
- C17Z19 synthetic route-health feedback scoring is implemented and
docker-test-runtime-proven; production service traffic remains unavailable
- C17Z20 node-side route-health feedback refresh is implemented and
docker-test-runtime-proven; production service traffic remains unavailable
- C17Z21 node installation/update control-plane is implemented and
docker-test-runtime-proven for Docker nodes; production service traffic
remains unavailable
- C17Z22 Windows host-agent install/update supervision is implemented and
runtime-proven on the remote Windows node; production service traffic remains
unavailable
- C17Z23 update observability is implemented in backend/admin UI: per-node
updater status history is exposed and deployed on docker-test, so node-agent
and host-agent update activity can be audited from node details
- C17Z24 combined updater reporting is implemented and docker-test-proven:
Linux/Docker `rap-host-agent update-loop` now also polls/reports
`rap-host-agent` status, release `0.1.4` is published for node-agent and
host-agent artifacts, and docker-test nodes `test-1/2/3` auto-updated to
node-agent `0.1.4` while reporting host-agent `0.1.4` no-op status.
- C17Z25 Windows updater repair visibility is implemented in admin UI: node
details / Updates now shows a ready CMD repair command for existing Windows
nodes using `http://vpn.cin.su:19191/api/v1`, `--replace`, and
`--auto-update-current-version 0.0.0` so a stale updater wrapper can be
recreated without a new join token.
- C17Z26 updater fleet visibility is implemented in admin UI: the node list now
shows per-node updater status based on latest `rap-node-agent` and
`rap-host-agent` reports, explicitly flagging missing host-agent reports,
stale update reports, or update errors before opening node details.
- C17Z27 backend version-state projection is implemented and deployed on
docker-test: node list responses now derive `version_state` from active
`rap-node-agent` desired policy plus latest update report. Docker/Linux nodes
on `0.1.4` show `current`; the remote Windows node still on `0.1.3` shows
`outdated` while remaining heartbeat-healthy.
- C17Z28 Windows updater loop hardening is implemented and partially
docker-test-proven via release `0.1.5`: Windows host-agent updater scripts now
run combined `update-loop --max-runs 1`, and Windows `update-loop` also
polls/applies `rap-host-agent` updates. Release `0.1.5` artifacts are
published for Docker/Linux and Windows; docker-test nodes `test-1/2/3`
updated to `rap-node-agent 0.1.5`. Existing remote Windows nodes with stale
pre-0.1.5 updater wrapper still require one repair command from admin UI to
replace their local wrapper, after which automatic polling should continue.
- Admin UI now marks missing host-agent updater reports as `repair updater` in
the node list and explains in node details / Updates when to run the Windows
repair command. The command uses the external control-plane endpoint and does
not require a join token for already enrolled Windows nodes.
- Admin UI node details / Updates also provides a ready downloadable
`rap-repair-updater-<node>.cmd` plus copy-command action for Windows repair,
reducing operator copy/paste mistakes on remote Windows hosts.
- Windows repair command generation was hardened after the first remote repair:
foreground `update-loop` now includes explicit `--node-id`, copies any staged
`rap-host-agent.exe.next` over the main host-agent binary after the one-shot
loop exits, deletes the staged file, and runs the updater scheduled task.
The node list now distinguishes `host-agent staged` from generic stale/error.
- C17Z29 Windows persistent updater repair is implemented in `rap-host-agent`
release `0.1.6`: `install-windows` accepts `--node-id` and writes that node
id into the persistent Windows updater wrapper so Scheduled Task polling no
longer depends on finding `identity.json` in the expected state directory.
Docker-test nodes `test-1/2/3` updated to `0.1.6`; existing Windows and
off-host Docker nodes still need their local updater wrappers to pick up the
0.1.6 host-agent repair path.
- C17Z30 operator-configured public mesh endpoints are implemented and
docker-test-deployed: desired `mesh-listener.advertise_endpoint` is now
projected into peer endpoint candidates for other nodes and preferred over
auto-discovered private heartbeat endpoints. `home-1`
(`8ad04829-cd30-4290-913d-1ce5c7ef7bb3`) is configured with
`listen_addr=0.0.0.0:19131`, `advertise_endpoint=http://94.141.118.222:19199`,
`connectivity_mode=direct`, `nat_type=port_restricted`, `region=home`.
`test-1` synthetic config now receives `home-1` peer endpoint
`http://94.141.118.222:19199`; internal `192.168.200.85:19131` responds with
HTTP 405 on GET, while external `94.141.118.222:19199` currently refuses TCP,
so router/firewall forwarding still needs correction outside the platform.
- C17Z31 offsite bootstrap peer selection is implemented and docker-test
deployed: operator-configured public/direct desired mesh-listener endpoints
are kept in core-mesh bootstrap even after the default warm-peer target is
reached. This fixes the case where remote Windows node
`ifcm-rufms-s-mo1cr` received only `test-*` warm peers and no `home-1`.
Its synthetic config now includes `home-1` endpoint
`http://94.141.118.222:19199` and candidates ordered as operator public,
heartbeat advertised public, then private LAN converted to relay-required for
offsite. External TCP to `94.141.118.222:19199` still failed from Codex and
docker-test checks while internal `192.168.200.85:19131` succeeds, so a real
offsite `Test-NetConnection 94.141.118.222 -Port 19199` is the next network
validation.
- C17Z32 native Ubuntu/Linux service install is implemented and docker-test
deployed: backend exposes `/node-agents/linux-install-profile`, host-agent
supports `install-linux`, installs `rap-node-agent` under
`/opt/rap/<node>`, state under `/var/lib/rap/nodes/<node>`, config under
`/etc/rap/<node>`, creates `rap-node-agent-<node>.service`, and creates a
persistent `rap-host-agent-updater-<node>.service` for automatic node-agent
and host-agent updates. Release `0.1.7` is published for `rap-node-agent`
(`linux_binary`, `windows_service`) and `rap-host-agent`
(`linux_binary`, `windows_binary`). Admin UI now has an `Ubuntu service`
install profile and generates profile-based `install-linux` commands.
A one-use token for `vps-ubuntu-1` is active until 2026-05-02T08:41:41Z:
`rap_join_a23Xhz63YstshWUBAPGPz5fzQ8YpHDP05RXaaYa4DoA`; scope roles are
`core-mesh` and `relay-node`, control-plane endpoint is
`http://vpn.cin.su:19191/api/v1`, artifact endpoint is
`http://vpn.cin.su:19191/downloads`.
- Admin UI and docs now cover the full Windows updater operational workflow:
node details shows an `Updater health` summary, generated repair CMD prints
scheduled-task and binary diagnostics before/after repair, applies staged
host-agent binaries, restarts the updater task, and README documents first
install, repair without join-token, system-task/user-task behavior, staged
host-agent recovery, and reboot/autostart verification.
- Cluster Authority plus node enrollment bootstrap polling are docker-test
lifecycle-smoke-proven; fresh install migration replay is fixed for
`cluster_admin_summaries`
- C18 VPN/IP tunnel service target design is completed as documentation only
- C18A VPN/IP tunnel control-plane data model foundation is implemented and
backend-test-proven
- C18B VPN/IP tunnel lease/fencing hardening is implemented and
backend-test-proven
- C18C VPN/IP tunnel node-agent desired-state consumption/reporting is
implemented and backend-test-proven
- Version Storage / Update Repository is documented as a future Fabric Core
service for signed release manifests, OS/arch artifacts,
stable/current/candidate channels, update-cache mirroring, node-agent
update supervision, rollback, and explicit data-structure migration bundles.
Runtime updater behavior is partially implemented for the current Docker and
Windows node-agent/host-agent paths; broader staged rollout policy and
service payload forwarding remain separate work.
- no next platform-core implementation step is automatically authorized after
C17Z20; choose the next narrow staged prompt explicitly before continuing
- preserve the proven RDP lifecycle behavior
- keep the current backend gateway available as the active/fallback implementation path
- accepted VPN data-plane target: the phone/client connects only to an
available entry node; the entry node uses the existing mesh/fabric route to a
selected exit node/pool, and the exit node handles LAN/internet egress. Nodes
behind NAT may participate when they can maintain outbound mesh/control
sessions. Backend packet relay must remain a compatibility/fallback path, not
the desired steady-state path.
- C18D VPN-over-fabric foundation is implemented and docker-test-started:
VPN client profiles include `vpn_fabric_route` with entry pool, exit pool,
selected entry/exit, preferred `fabric_mesh` data-plane, and
`backend_relay` fallback. Node-agent `0.2.39` adds a dedicated production
`vpn_packet` channel (`vpn.packet_batch`, 256 KiB batch limit), destination
delivery hook, `vpnruntime.FabricPacketTransport`, and
`vpn_fabric_packet_transport` heartbeat capability. `home-1` auto-updated to
`0.2.39`; other nodes have automatic desired policy `0.2.39` and should move
as their updater loops pick it up. Live Android VPN traffic still uses backend
relay until entry-node client ingress is wired to the fabric transport.
- C18E VPN-over-fabric route contract is backend-deployed on docker-test as
`rap-backend:test-vpn-fabric-route-0.2.41`: when a VPN client profile selects
different entry and exit nodes, backend now ensures two active
`mesh_route_intents` with service_class `vpn_packets` and allowed channel
`vpn_packet`. The live HOME profile currently selects `usa-los-1` as entry
and `home-1` as exit when `entry_node_id=b829ffde-...` is requested, and the
synthetic config for both nodes includes the two `vpn_packet` routes. Existing
fallback remains `backend_relay`; production forwarding gate is still disabled
on old/live remote nodes until their runtime is explicitly updated/enabled.
- External/offsite updater gap found and fixed for version `0.2.40`: native
`rap-node-agent` binaries for `linux_binary`, `linux_service`, and
`windows_service` plus matching `rap-host-agent` binaries are copied under
`/downloads` and registered in channel `dev-external`. Update plans for
`usa-los-1` (`linux_binary`) and `ifcm-rufms-s-mo1cr` (`windows_service`) now
return `action=update`, `target_version=0.2.40` instead of
`no_matching_artifact`.
- C18F production-forwarding gate work is partially live: backend
`rap-backend:test-vpn-fabric-route-0.2.42` signs node synthetic configs with
`production_forwarding=true` / `control_plane_only=false` when the node's
desired `mesh-listener` workload has `production_forwarding_enabled=true`.
`home-1` and `usa-los-1` desired mesh-listener configs have this flag enabled.
Node-agent `0.2.44` accepts signed production-forwarding mesh configs and
host-agent `0.2.44` fixes Docker updater behavior so synthetic mesh runtime is
not disabled on Docker updates. Runtime status: `usa-los-1` reports
`mesh_production_forwarding=true`; `home-1` reports `0.2.44` and synthetic
runtime enabled, but its listener report is still `disabled/listen_addr_empty`,
so `home-1` is not yet a usable production fabric endpoint. Next action is to
repair why `home-1` is not applying the signed mesh-listener config
(`listen_addr=0.0.0.0:19131`) after Docker updater restart.
- C18G VPN-over-fabric runtime path is live-tested on docker-test. Backend is
deployed as `rap-backend:test-vpn-fabric-route-0.2.43`; VPN route intents now
allow both `vpn_packet` data and `fabric_control` health probes. Node-agent
`0.2.47` fixes initial production VPN packet envelope hop addressing and
reports the matching version. `home-1` and `usa-los-1` both report
`0.2.47`, healthy, listener `0.0.0.0:19131`, and
`mesh_production_forwarding=true`. Live route health is reachable in both
directions (`usa-los-1 -> home-1` around 200 ms, `home-1 -> usa-los-1`
around 200-415 ms). A direct live POST to
`http://195.123.240.88:19131/api/v1/clusters/.../vpn-connections/.../tunnel/client/packets`
returns `202 Accepted`, proving entry-node VPN packet ingress can forward
over fabric to the home exit. The HOME VPN placement policy now has entry
pool `[usa-los-1, home-1]` and exit `home-1`; client profile with preferred
`usa-los-1` selects `usa-los-1 -> home-1`.
- C18H live VPN triage on 2026-05-04: `home-1` and `usa-los-1` report
node-agent `0.2.48`, healthy heartbeats, active HOME VPN assignment on
`home-1`, and `packet_forwarding=true` / `runtime_available=true`. Manual
packet tests through the USA entry proved the path
Android-style packet -> `usa-los-1` -> fabric -> `home-1` -> LAN/DNS ->
fabric -> `usa-los-1` -> client can return ICMP and DNS replies. The remaining
live symptom was the phone not sending fresh packets to the current entry
after the backend relay queue was cleared. Android VPN app `0.2.59` was built
and published to `/downloads/rap-android-rdp-vpn-latest-debug.apk`; it
normalizes old saved backend URLs (`vpn.cin.su:19191`,
`94.141.118.222:19191`, `192.168.200.61:18080`, etc.) to the current USA
entry backend `http://195.123.240.88:19131/api/v1` and shows app version,
device id, and connection id in the header for live log correlation.
- C18I fabric service-channel foundation is live on 2026-05-07. Backend,
node-agent, and Android VPN release `0.2.159` are published. VPN profiles now
include a signed `rap.fabric_service_channel_lease.v1` with
`entry_direct_http_v1` packet and WebSocket templates. Android consumes this
lease and sends service-channel headers. The `usa-los-1` entry endpoint
validates the cluster-authority signed lease payload and token hash; a live
smoke through `http://195.123.240.88:19131/.../fabric/service-channels/...`
succeeded with a valid lease and rejected a bad token with `403`. Current HOME
profile selects `usa-los-1` as entry and `home-1` as exit; both nodes report
`0.2.159`. Docker-test nodes `test-1`, `test-2`, and `test-3` also report
`0.2.159`. `ifcm-rufms-s-mo1cr` is still on `0.2.119`; it has staged the
host-agent `0.2.159` update and should finish on the next Windows updater
loop/restart.
- C18J fabric service-channel runtime route-manager slice is live on
2026-05-07 as node/host-agent `0.2.162`. The entry-node
`FabricClientPacketIngress` now preserves its runtime object across synthetic
config refreshes, so heartbeat telemetry reports the same ingress object that
serves HTTP/WebSocket service-channel traffic. It tracks send/receive batches,
route attempts/failures, selected route/next hop, local-gateway fallback, and
inbox queue depths. `SendClientPacketBatch` now retries all valid
`vpn_packet` route candidates with sticky preference before backend relay is
allowed as degraded compatibility fallback. Release `0.2.161` was superseded
because its Docker tar was rebuilt after registration; `0.2.162` is the
clean published release with matching artifact hashes. Docker-test
`test-1/2/3`, `usa-los-1`, and `ifcm-rufms-s-mo1cr` report `0.2.162`;
`home-1` is healthy and still on `0.2.161` awaiting its next updater loop.
Live smoke through `http://195.123.240.88:19131/.../fabric/service-channels`
returned `202` and `usa-los-1` telemetry then showed route attempts,
one route failure, and selected next hop `home-1`, proving live ingress
telemetry and alternate-route retry are active.
- C18K service-neutral flow/channel scheduler is live on 2026-05-07 as
node/host-agent `0.2.163`. The VPN proving service still carries universal
IP packets and does not route by application protocol, but the entry runtime
now hashes packets by IP 5-tuple, or packet hash for non-IP/invalid packets,
into 32 logical `flow-*` channels. Each channel has bounded queue accounting,
high-watermark/backpressure/dropped telemetry, and batches are fanned out per
logical channel before being sent through the same fabric route-manager. Live
smoke against `usa-los-1` posted two different IP flows through the signed
service-channel endpoint and heartbeat reported `send_packets=2`,
`send_flow_batches=2`, `flow_scheduler.channel_count=2`, `enqueued=2`,
`dequeued=2`, `dropped=0`, with queue depths for `flow-12` and `flow-14`.
All six current cluster nodes (`home-1`, `usa-los-1`, `ifcm-rufms-s-mo1cr`,
`test-1`, `test-2`, `test-3`) report node-agent `0.2.163` and healthy.
- C18L active flow scheduling telemetry is live on 2026-05-07 as
node/host-agent `0.2.164`. Each `flow-*` channel now keeps route memory,
served count, last served time, last route/next hop, failed-route marker,
consecutive failures, stall count, last send duration, and explicit
`route_rebuild_recommended` / `degraded_fallback_recommended` signals. The
scheduler drains non-stalled channels first, prefers less-served/older
channels, avoids a channel's last failed route on the next send, and only
marks degraded fallback after repeated failures. Live smoke against
`usa-los-1` posted two IP flows through the signed service-channel endpoint:
heartbeat reported schema `c18l.fabric_service_channel_runtime_report.v1`,
`send_packets=2`, `send_flow_batches=2`, `flow_scheduler.channel_count=2`,
`dropped=0`, `backpressure=false`, `last_next_hop=home-1`, and per-flow
`served=1`. One stale candidate route failed and was bypassed before the
successful route to `home-1`. All six current cluster nodes (`home-1`,
`usa-los-1`, `ifcm-rufms-s-mo1cr`, `test-1`, `test-2`, `test-3`) report
node-agent `0.2.164` and healthy.
- C18M Control Plane service-channel feedback is live on 2026-05-07. Backend
image `rap-backend:fabric-service-channel-0.2.165` is deployed on
docker-test, and node/host-agent `0.2.165` artifacts are published. When
issuing `rap.fabric_service_channel_lease.v1`, backend now reads fresh
entry-node heartbeat metadata
`fabric_service_channel_runtime_report.ingress.flow_scheduler.channel_stats`,
builds per-route service-channel feedback, boosts recently successful routes,
penalizes recent failures, and fences routes that report
`route_rebuild_recommended`, `degraded_fallback_recommended`, or repeated
consecutive failures. Fenced routes are not selected as primary or alternate;
if all selected entry/exit routes are fenced, the lease uses explicit
degraded backend fallback with reason
`fabric_routes_fenced_by_service_channel_feedback`. Live smoke created two
short-lived `test-1 -> test-2` route intents, injected a fresh
service-channel flow feedback heartbeat marking the higher-priority route as
rebuild-required, and the next lease selected the lower-priority healthy
route with score reason `service_channel_recent_success`; the bad route was
not offered as an alternate. Current node rollout: `home-1`, `usa-los-1`,
`test-1`, `test-2`, and `test-3` report `0.2.165`; Windows `ifcm-rufms-s-mo1cr`
remains healthy on `0.2.164` and should move on its next updater cycle.
- C18N durable service-channel route feedback is live on 2026-05-07. Backend
image `rap-backend:fabric-service-channel-0.2.166` is deployed on
docker-test with migration `000025_fabric_service_channel_route_feedback`.
Heartbeats now persist service-neutral route observations into
`fabric_service_channel_route_feedback_observations` and maintain an
expiring latest view in `fabric_service_channel_route_feedback_latest`.
Lease selection reads this durable latest feedback before falling back to
in-memory heartbeat parsing, so route fencing survives backend restarts and
stale heartbeat replacement. Node/host-agent `0.2.166` artifacts and Docker
image are published, update policies target `0.2.166`, and `test-1/2/3`,
`usa-los-1`, and `ifcm-rufms-s-mo1cr` report `0.2.166`; `home-1` is healthy
but still on `0.2.165` until its next updater cycle. Live smoke created two
short-lived `test-1 -> test-2` routes, persisted a fenced observation for the
higher-priority bad route and a healthy observation for the lower-priority
route, restarted backend, and the next lease selected the healthy route with
`service_channel_recent_success`.
- C18O service-channel feedback diagnostics and synthetic route avoidance are
live on 2026-05-07. Backend image
`rap-backend:fabric-service-channel-0.2.167` is deployed on docker-test and
web-admin is rebuilt/published. Admin/API now expose fresh durable feedback
through `GET /clusters/{clusterID}/fabric/service-channels/route-feedback`,
and each node synthetic config includes
`service_channel_route_feedback` with healthy/degraded/fenced counts and
observations. Synthetic config generation skips routes fenced by the local
node's durable service-channel feedback, so nodes stop receiving known-bad
route configs while the feedback is active. Live smoke created fresh
`test-1 -> test-2` routes, persisted `fenced` feedback for the higher-priority
route and `healthy` feedback for the lower-priority route, confirmed the API
returned both observations, and confirmed `test-1` synthetic config excluded
the bad route while keeping the healthy route.
- C18P proactive service-channel replacement decisions are live on 2026-05-07.
Backend image `rap-backend:fabric-service-channel-0.2.168` is deployed on
docker-test and web-admin is rebuilt/published. When synthetic config
generation withholds a route fenced by local service-channel feedback, it now
records a `route_path_decisions` item with
`decision_source=service_channel_feedback_replacement`,
`replacement_route_id`, effective replacement hops, and score reasons. If no
alternate exists, the decision source becomes
`service_channel_feedback_no_alternate` with visible score reason
`no_unfenced_alternate_route`. Live smoke created fresh `test-1 -> test-2`
bad/good routes, fenced the bad route, disabled older smoke routes, and
confirmed `test-1` synthetic config excluded the bad route, kept the good
route, and reported replacement from bad route to good route.
- C18Q service-channel replacement dampening is live on 2026-05-07. Backend
image `rap-backend:fabric-service-channel-0.2.169`, node/host-agent
`0.2.169` artifacts, Docker image, update policies, and web-admin are
published on docker-test. Replacement selection now gives a large stable
preference to routes with active healthy durable feedback, adding
`active_healthy_feedback_dampening_window` to score reasons, so a recently
successful replacement wins over a higher-priority but unproven route until
the feedback window expires or a newer fenced/healthy observation changes the
state. `RoutePathDecisionReport` now includes `degraded_decision_count` for
`service_channel_feedback_no_alternate`, and node-agent heartbeat reports
include `replacement_route_id` and degraded counts after upgrade. Live smoke
fenced a high-priority bad `test-1 -> test-2` route, supplied healthy feedback
for a low-priority route, also created a higher-priority unproven route, and
confirmed replacement selected the healthy route because of the dampening
window.
- C18Q hotfix `0.2.171` is published on 2026-05-07. Node-agent now includes
`service_channel_route_feedback` in the signed synthetic config model before
recalculating the authority payload hash. Without this, upgraded backend
configs were signed correctly but `0.2.169` agents rejected them with
`control-plane synthetic mesh config authority payload hash mismatch`.
Regression coverage verifies a signed config containing durable
service-channel feedback. Artifacts, Docker image, latest download aliases,
and update policies were moved to `0.2.171`; `test-1/2/3` are running
`0.2.171` and loading `source=control_plane` again. The release includes
`linux_service`, Docker, Windows service, and binary artifacts so service
installs can auto-update. Old C18 smoke/expired route intents were disabled
after validation.
- C18R fleet diagnostics/operator action slice is live on 2026-05-07. Backend
image `rap-backend:fabric-service-channel-0.2.172` adds route feedback
filters (`route_id`, `feedback_status`, `include_expired`) and
`POST /clusters/{clusterID}/fabric/service-channels/route-feedback/expire`.
The expire action is cluster-mutable/admin gated and marks latest feedback
expired without deleting historical observations. Web-admin / Fabric Links
now shows a cluster-level service-channel feedback panel with fenced,
degraded, healthy and no-alternate counts, replacement/no-alternate decisions,
and an operator `expire` action for stale non-healthy feedback.
- C18S service-channel feedback churn guardrails are implemented on
2026-05-07. Operator expire now records
`fabric.service_channel_route_feedback.expired` audit events, returns and
persists a short `operator_retry_cooldown_until`, and route generation adds
`service_channel_route_retry_after_operator_expire` when a manually expired
route is being retried. During that cooldown, repeated non-healthy feedback
from the same reporter/route/service is suppressed as
`operator_retry_cooldown` instead of immediately fencing the route again.
Web-admin shows the retry/cooldown state in Fabric Links.
- C18T automatic rebuild decision contract is implemented on 2026-05-07.
`RoutePathDecision` now carries `rebuild_request_id`, `rebuild_status`,
`rebuild_reason`, and `rebuild_attempt`. When fenced service-channel feedback
keeps failing outside manual retry cooldown, Control Plane records a bounded
rebuild request. If an unfenced alternate exists, the decision is marked
`rebuild_status=applied`; if not, it is
`pending_degraded_fallback` and leases expose backend relay with reason
`fabric_route_rebuild_pending_backend_relay`. Web-admin shows rebuild counts,
status, and attempts in Fabric Links. A live smoke on docker-test created
short-lived `test-1 -> test-2` bad/good routes, reported fenced feedback for
the bad route and healthy feedback for the good route, and confirmed scoped
synthetic config returned `service_channel_feedback_replacement` with
`rebuild_status=applied` and `rebuild_attempt=3`. Node/host-agent `0.2.175`
is published so agents preserve the new signed rebuild fields.
- C18U node-agent route-manager rebuild consumption is live on 2026-05-07.
Node-agent `0.2.176` now converts backend rebuild decisions into a
service-channel route-manager snapshot, counts rebuild requests/applies,
marks applied/pending-degraded routes as withdrawn, clears a withdrawn cached
selected route, and excludes withdrawn routes from new service-channel route
candidates. This keeps new flows from retrying a route that Control Plane has
already rebuilt away from. Unit coverage verifies a bad route is skipped in
favor of its replacement. Node/host-agent `0.2.176` artifacts, Docker image,
latest download aliases, release manifests, and node policies are published.
`test-1/2/3`, `usa-los-1`, and `ifcm-rufms-s-mo1cr` report `0.2.176`.
Backend `rap-backend:fabric-service-channel-0.2.176` is deployed with a
panel consistency fix: if a node reports the target version, stale failed
update status no longer overrides `version_state=current`.
- C18V route-manager churn telemetry is live on 2026-05-07. Node-agent
`0.2.177` adds `route_manager_transition` to the service-channel runtime
report with previous/current generation, transition status, decision counts,
withdrawn/restored route counts, pending-degraded fallback count, rebuild
applied count, and any cleared cached route. Tests cover applied rebuild
replacement, pending degraded fallback with no alternate, and restoration by
a fresh config so withdrawn routes do not become sticky local state. Artifacts,
Docker image, latest download aliases, release manifests, and node policies
are published. `test-1/2/3` run `0.2.177`; their heartbeat metadata exposes
`rap.fabric_service_channel_route_manager_transition.v1`.
- C18W live Control Plane/runtime verification is implemented and smoke-passed
on 2026-05-07. Script
`scripts/fabric/c18w-service-channel-route-manager-smoke.ps1` drives the
whole loop against docker-test API: creates temporary service-channel route
intents for `test-1 -> test-2`, injects fenced/healthy route feedback through
heartbeat, verifies scoped config emits `rebuild_status=applied`, waits for
node-agent heartbeat `route_manager_transition.status=applied_rebuild`,
expires the feedback, verifies the restored config has no rebuild decision,
and waits for `restored_by_new_config`. Result artifact:
`artifacts/c18w-service-channel-route-manager-smoke-result.json` with run
`c18w-20260507-173226`. During the smoke, operator expire exposed live pgx
parameter issues; backend `rap-backend:fabric-service-channel-0.2.179` is
deployed with safer UUID/text timestamp handling for feedback expire.
- C18X logical-channel isolation and bounded backpressure coverage is
implemented and smoke-passed on 2026-05-07. Node-agent/host-agent `0.2.180`
artifacts, Docker image, latest download aliases, release manifests, and
node policies are published. The key runtime fix is in
`FabricClientPacketIngress.routeCandidatesForChannel`: a channel with a local
failed-route avoid state no longer falls back to the global last selected
route, so one degraded logical flow cannot drag unrelated flows back onto the
failed path. Coverage proves independent logical-channel failover, bounded
same-channel backpressure/drop telemetry, and packet-flow hashing. Script
`scripts/fabric/c18x-service-channel-logical-channel-smoke.ps1` passes with
result artifact `artifacts/c18x-service-channel-logical-channel-smoke-result.json`
run `c18x-20260507-180647`. Test docker nodes `test-1/2/3` are running
`rap-node-agent:0.2.180`; backend remains
`rap-backend:fabric-service-channel-0.2.179`.
- C18Y route-intent lifecycle cleanup is implemented and smoke-passed on
2026-05-07. Backend `rap-backend:fabric-service-channel-0.2.181` is deployed
on docker-test, and web-admin Fabric Links now shows route-intent lifecycle
counts/table with operator `expire` and `disable` actions. Route intents are
enriched with `lifecycle_status`, `is_expired`, and `policy_expires_at`.
Node-scoped synthetic mesh config now filters out expired policy routes, so
stale smoke routes no longer get emitted to agents for route-health probing.
API actions are available at
`POST /clusters/{clusterID}/mesh/route-intents/{routeIntentID}/expire` and
`/disable`. Script `scripts/fabric/c18y-route-intent-lifecycle-smoke.ps1`
passed against docker-test API, result
`artifacts/c18y-route-intent-lifecycle-smoke-result.json` run
`c18y-20260507-192702`. During deploy, docker-test root disk was full from
build cache/images; `docker builder prune -af` and `docker image prune -f`
freed space before redeploy.
- C18Z bounded service-channel load coverage is implemented, published, and
smoke-passed on 2026-05-07. Node-agent/host-agent `0.2.181` artifacts,
Docker image `rap-node-agent:0.2.181`, latest download aliases, release
manifests, and update policies are published. `test-1/2/3` are restarted on
`rap-node-agent:0.2.181`; `usa-los-1` also reports `0.2.181`. The key runtime
fix is in `FabricFlowScheduler.Snapshot`: backpressure remains visible when
bounded drops occurred, even after the queue drains. Coverage proves
multi-channel rebuild away from a withdrawn primary route and per-channel
bounded drop/high-water telemetry. Script
`scripts/fabric/c18z-service-channel-load-smoke.ps1` passed against
docker-test API, result
`artifacts/c18z-service-channel-load-smoke-result.json` run
`c18z-20260507-194616`. Release artifacts were corrected after initial
publication to use backend-relative `/downloads/...` primary URLs plus
internal/external mirror URLs, so offsite nodes resolve downloads through
their own control-plane origin such as `http://vpn.cin.su:19191`. Current
caveat: `ifcm-rufms-s-mo1cr` and `home-1` remained `version_state=failed`
at the last check; their next update plan now points to reachable `0.2.181`
artifacts, but the local updater loop still needs to retry/report success.
- C18Z1 live service-channel ingress is implemented, published, and
smoke-passed on 2026-05-07. Node-agent/host-agent `0.2.182` artifacts,
Docker image `rap-node-agent:0.2.182`, release manifests, and update
policies are published. Backend `rap-backend:fabric-service-channel-0.2.182`
is deployed on docker-test. The runtime fix is a dynamic mesh listener
handler: synthetic config refreshes now update `/mesh/v1/forward`,
service-channel ingress, production routes, delivery inbox, and forward
transport without requiring a port/listener restart. Backend route-feedback
latest policy now prevents a fresh healthy heartbeat from immediately
overwriting active degraded/fenced feedback before TTL expiry, so rebuild
decisions survive long enough for nodes to apply them. Script
`scripts/fabric/c18z1-live-service-channel-ingress-smoke.ps1` posts signed
generic packet batches to the running `test-1` service-channel HTTP endpoint,
waits both entry and exit runtime configs, verifies exit inbox delivery,
injects route feedback, observes Control Plane rebuild, waits node
`applied_rebuild`, sends a second batch over the replacement route, and
expires both temporary route intents. Result:
`artifacts/c18z1-live-service-channel-ingress-smoke-result.json` run
`c18z1-20260507-203628`. All current nodes report `0.2.182/current` at the
last check.
- C18Z2 live service-channel sustained soak/failure smoke is implemented and
passed on 2026-05-07 without a new runtime release. Script
`scripts/fabric/c18z2-live-service-channel-soak-smoke.ps1` drives signed
generic packet batches through the running `test-1` service-channel HTTP
endpoint, keeps temporary primary/alternate `test-1 -> test-2` route intents
visible, restarts the exit-node container `rap_test_node_test_2`, waits for
the exit runtime to reload synthetic config, and verifies recovery batches
reach the exit fabric inbox after the restart. Result:
`artifacts/c18z2-live-service-channel-soak-smoke-result.json` run
`c18z2-20260507-205112`: warm batches `6/6`, during-restart batches `3/3`,
recovery batches `8/8`, exit inbox depth grew from post-restart baseline
`0` to `88`, drops `0`, and both temporary route intents expired.
- C18Z3 live service-channel entry/WebSocket/degraded-fallback smoke is
implemented, published, and passed on 2026-05-07. Node-agent/host-agent
`0.2.183` artifacts and Docker image `rap-node-agent:0.2.183` are published
to docker-test downloads; update policies for `test-1/2/3` are set to
`rolling` target `0.2.183`, and the test containers run that image. The
runtime fix makes the entry node honor the signed service-channel lease
authority: leases with `status=degraded_fallback` or
`primary_route.status=missing_route_intent` now force backend fallback instead
of reusing stale generic route candidates. The same fallback rule is applied
to HTTP and WebSocket packet ingress. Script
`scripts/fabric/c18z3-live-service-channel-entry-ws-fallback-smoke.ps1`
verifies signed HTTP warm batches, WebSocket ingress parity, entry-node
container restart while the lease exists, recovery batches over the same
lease, explicit degraded fallback for a no-route exit, and route-intent
expiry. Result:
`artifacts/c18z3-live-service-channel-entry-ws-fallback-smoke-result.json`
run `c18z3-20260507-211402`: warm `4/4`, WebSocket packets `8`, recovery
`4/4`, backend fallback queue `0 -> 8`, route failures `0`, and all checks
passed. During publication the first `0.2.183` Docker tar had a malformed
entrypoint and stale size/hash metadata; it was rebuilt, the latest tar alias
was replaced, and the release artifact row was corrected to sha256
`231286cf5860b22cf8ca6550f67f61b0ca4b5011ab9b09995bcabbafe883fee1`, size
`7261696`.
- C18Z4 live service-channel long-session pressure smoke is implemented and
passed on 2026-05-07 without a new runtime release beyond `0.2.183`. Script
`scripts/fabric/c18z4-live-service-channel-session-pressure-smoke.ps1` opens
one signed long-lived service-channel WebSocket from `test-1` to `test-2`,
sends 48 packet batches / 384 packets, expires the primary route intent while
the WebSocket session is still active, waits for dynamic synthetic-config
refresh, and verifies the remaining packets use the alternate route. Result:
`artifacts/c18z4-live-service-channel-session-pressure-smoke-result.json`
run `c18z4-20260507-212748`: exit inbox depth `0 -> 384`, route failure delta
`0`, flow drop delta `0`, backend fallback queue `0 -> 0`, primary route
removed from entry/exit configs, alternate route selected after the switch,
and both route intents expired. This proves the shared Fabric Service Channel
can keep a service session alive while Control Plane changes the live route
set, without falling back to backend relay.
- C18Z5 live service-channel exit-restart smoke is implemented and passed on
2026-05-07 without a new runtime release beyond `0.2.183`. Script
`scripts/fabric/c18z5-live-service-channel-exit-restart-smoke.ps1` keeps one
signed WebSocket service-channel session open from `test-1` to `test-2`,
sends pre-outage traffic, stops `test-2` for a bounded outage while traffic
continues, starts it again, waits runtime readiness, then sends recovery
traffic over the same WebSocket. Result:
`artifacts/c18z5-live-service-channel-exit-restart-smoke-result.json` run
`c18z5-20260507-213745`: pre/outage/recovery batches `12/24/24`, total
packets `480`, route failure delta `48`, backend fallback queue `0 -> 192`,
flow drop delta `0`, and recovery exit inbox `0 -> 192`. This proves real
exit-node failure is visible as fallback/failure telemetry while the
long-lived service channel remains usable and fabric delivery resumes after
the exit runtime returns. After the test, `test-2` and all active cluster
nodes were healthy/current on `0.2.183`.
- C18Z6 live service-channel active rebuild smoke is implemented and passed on
2026-05-07 without a new runtime release beyond `0.2.183`. Script
`scripts/fabric/c18z6-live-service-channel-active-rebuild-smoke.ps1` keeps a
signed WebSocket service-channel session open from `test-1` to `test-2`,
sends pre-rebuild traffic, injects route-health feedback that marks the
primary route stale and names the alternate route as replacement, waits for
Control Plane `rebuild_status=applied`, waits for node-agent
`route_manager_transition.status=applied_rebuild`, then continues sending
over the same WebSocket. Result:
`artifacts/c18z6-live-service-channel-active-rebuild-smoke-result.json` run
`c18z6-20260507-214900`: pre/post batches `16/32`, total packets `384`,
exit inbox depth `0 -> 384`, Control Plane replacement route
`b2f3c510-46d2-4dce-8389-3952a99d0311`, route failure delta `0`, flow drop
delta `0`, backend fallback queue `0 -> 0`, all checks passed, and all
active nodes remained healthy/current on `0.2.183`. This proves a live
service channel can apply a route-manager rebuild decision without rebuilding
the service WebSocket.
- C18Z7 live service-channel concurrent isolation smoke is implemented and
passed on 2026-05-07 without a new runtime release beyond `0.2.183`. Script
`scripts/fabric/c18z7-live-service-channel-concurrent-isolation-smoke.ps1`
opens three signed WebSocket service-channel sessions over the same
`test-1 -> test-2` entry/exit pair, interleaves packet batches across all
sessions, injects primary-route stale feedback, waits for Control Plane
`rebuild_status=applied` and node-agent `applied_rebuild`, then continues all
sessions over the same sockets. Result:
`artifacts/c18z7-live-service-channel-concurrent-isolation-smoke-result.json`
run `c18z7-20260507-215727`: 3 sessions, 36 rounds, 288 packets per session,
864 packets total, each session exit inbox depth `288`, total exit depth
`864`, backend fallback delta `0`, route failure delta `0`, flow drop delta
`0`, and all active nodes healthy/current on `0.2.183`. This proves rebuild
and route-manager state are shared correctly without one active service
session starving or poisoning the other concurrent sessions.
- C18Z8 live service-channel backpressure isolation smoke is implemented and
passed on 2026-05-07 without a new runtime release beyond `0.2.183`. Script
`scripts/fabric/c18z8-live-service-channel-backpressure-isolation-smoke.ps1`
opens two interactive signed WebSocket sessions plus one abusive session over
the same `test-1 -> test-2` entry/exit pair. The abusive session sends 1300
packets on one stable 5-tuple to force a single flow shard to hit bounded
queue pressure while the interactive sessions continue sending small batches.
Result:
`artifacts/c18z8-live-service-channel-backpressure-isolation-smoke-result.json`
run `c18z8-20260507-221347`: both interactive sessions delivered 192 packets
each, the abusive flow reached scheduler high watermark `1024`, scheduled
`1030` packets on the hottest channel, dropped `282` packets on that channel,
produced backend fallback delta `0`, route failure delta `0`, and all active
nodes stayed healthy/current on `0.2.183`. This proves bounded backpressure is
visible and isolated to the overloaded logical flow without starving other
active service sessions.
- C18Z9 route-pool runtime selection is implemented, released as node/host
agent `0.2.184`, published to docker-test downloads, and passed on
2026-05-07. Runtime fix: when Control Plane marks a service-channel route
`rebuild_status=applied` and provides `replacement_route_id`, node-agent now
treats that replacement as the preferred route for sticky flow/channel
selection instead of merely withdrawing the bad route and falling back to
config order. Unit coverage:
`TestFabricClientPacketIngressPrefersControlPlaneReplacementOverConfigOrder`.
Live script
`scripts/fabric/c18z9-live-service-channel-route-pool-smoke.ps1` creates a
route pool with slow relay primary `test-1 -> test-3 -> test-2` and fast
direct replacement `test-1 -> test-2`, keeps one signed WebSocket active,
injects stale-route feedback, waits for Control Plane and node-agent
`applied_rebuild`, then verifies the same service session continues over the
direct replacement. Result:
`artifacts/c18z9-live-service-channel-route-pool-smoke-result.json` run
`c18z9-20260507-224901`: 54 batches / 432 packets sent and delivered to exit,
backend fallback delta `0`, route failure delta `0`, flow drop delta `0`, and
temporary route intents expired. Test containers `test-1/2/3` run
`rap-node-agent:0.2.184`; `usa-los-1`, `home-1`, and
`ifcm-rufms-s-mo1cr` remain healthy on `0.2.183` until their rollout policy is
advanced.
- C18Z10 service-channel exit-pool failover is implemented, released as
node/host-agent `0.2.185`, published to docker-test downloads, registered in
the stable update channel, and passed on 2026-05-07. Backend service-channel
leases now bind signed entry/exit pools, selected exit follows the selected
primary route, and Control Plane replacement can cross to another authorized
exit when route intents share an exit-pool/resource metadata key. Node-agent
now honors the signed lease primary route as the initial service-channel
preference before normal config-order selection. Unit coverage:
`TestIssueFabricServiceChannelLeaseSelectsHealthyAlternateExitFromPool`,
`TestGetNodeSyntheticMeshConfigReplacesFencedServiceChannelRouteAcrossExitPool`,
and `TestFabricClientPacketIngressUsesLeasePreferredRouteBeforeConfigOrder`.
Live script
`scripts/fabric/c18z10-live-service-channel-exit-pool-smoke.ps1` creates a
primary exit route `test-1 -> test-2` and an alternate exit route
`test-1 -> test-3` in the same exit pool, keeps one signed WebSocket active,
verifies pre-rebuild traffic reaches the primary exit, injects stale-route
feedback, waits for Control Plane/node-agent `applied_rebuild`, then verifies
post-rebuild traffic reaches the alternate exit. Result:
`artifacts/c18z10-live-service-channel-exit-pool-smoke-result.json` run
`c18z10-20260507-232645`: 54 batches / 432 packets sent, primary exit queue
`144`, alternate exit queue `288`, backend fallback `0`, route failure delta
`0`, flow drop delta `0`, decision source
`service_channel_feedback_exit_pool_replacement`, and temporary route intents
expired. Backend and `test-1/2/3` are running `0.2.185`; update plans now
return download URLs on `192.168.200.61:18080` when the API is reached
directly on `18121`.
- C18Z11 service-channel entry-pool failover contract is implemented and
backend-deployed as `rap-backend:fabric-service-channel-0.2.186`; node-agent
remains `0.2.185` because no node runtime binary change was required.
Backend lease selection now keeps `selected_entry_node_id` aligned with the
selected primary route when the healthy route starts at another authorized
entry node. Route replacement scope also understands entry-pool metadata
keys (`entry_pool_id`, `service_entry_pool_id`, `fabric_entry_pool_id`) in
addition to exit-pool/resource keys, and route decision reports count
entry-pool replacement decisions. Unit coverage:
`TestIssueFabricServiceChannelLeaseSelectsHealthyAlternateEntryFromPool` and
`TestGetNodeSyntheticMeshConfigReplacesFencedServiceChannelRouteAcrossEntryPool`.
Live script
`scripts/fabric/c18z11-live-service-channel-entry-pool-smoke.ps1` creates
primary entry route `test-1 -> test-2` and alternate entry route
`test-3 -> test-2`, verifies the initial lease uses `test-1`, sends 144
packets, injects service-channel feedback fencing the primary entry route,
verifies a refreshed lease selects `test-3`, then sends 288 more packets
through the alternate entry to the same exit. Result:
`artifacts/c18z11-live-service-channel-entry-pool-smoke-result.json` run
`c18z11-20260507-235341`: exit queue `432`, backend fallback `0`, route
failure deltas `0/0`, flow drop deltas `0/0`, and temporary route intents
expired. This is a lease refresh/reconnect contract for entry replacement;
preserving a broken client-to-entry socket across an entry node outage is not
expected.
- C18Z12 service-channel route quality scoring is implemented and
backend-deployed as `rap-backend:fabric-service-channel-0.2.187`; node-agent
remains `0.2.185`. Backend now uses service-neutral runtime quality feedback
from `fabric_service_channel_runtime_report.ingress.flow_scheduler` when
scoring lease routes: `last_send_duration_ms` adds deterministic latency
boosts/penalties, and recent failures/stalls apply bounded penalties. This is
protocol-agnostic and applies to the shared fabric channel, not HTTP/RDP/DNS
special cases. Unit coverage:
`TestIssueFabricServiceChannelLeasePrefersFastHealthyRouteFeedback`. Live
script `scripts/fabric/c18z12-service-channel-route-quality-smoke.ps1`
creates a high-priority slow relay route `test-1 -> test-3 -> test-2` and a
lower-priority fast direct route `test-1 -> test-2`; the initial lease
selects the slow route by policy priority, then quality telemetry reports
fast route `8ms` and slow route `900ms`, and the refreshed lease selects the
fast route with score reason `service_channel_quality_latency_le_10ms`.
Result: `artifacts/c18z12-service-channel-route-quality-smoke-result.json`
run `c18z12-20260508-000209`; all checks passed and temporary route intents
expired.
- C18Z13 live service-channel route quality self-learning is implemented,
released as node-agent `0.2.188`, published to docker-test downloads,
registered in the stable update channel, and deployed to docker-test
containers `test-1/2/3`. Runtime fix: positive sub-millisecond
service-channel send durations are rounded to `1ms`, preventing fast local
routes from looking like "no quality sample". Unit coverage:
`TestFabricFlowSchedulerRoundsSubMillisecondSendDuration`. Live script
`scripts/fabric/c18z13-live-service-channel-route-quality-smoke.ps1` proves
the self-learning path without heartbeat injection: initial lease picks a
higher-priority relay route, real service-channel traffic sends 24 batches /
192 packets over the fast direct route, backend persists healthy route
feedback from the node-agent heartbeat (`last_send_duration_ms=1`,
`score_adjustment=90`), and a refreshed lease prefers that fast route over a
newly introduced higher-priority relay candidate. Result:
`artifacts/c18z13-live-service-channel-route-quality-smoke-result.json` run
`c18z13-20260508-001610`; backend fallback `0`, flow drops `0`, temporary
route intents expired. Published release id:
`64effc62-18b6-4eeb-a1c9-f5fb8e251491`.
- C18Z14 active-session route-quality preference is implemented. Backend
`rap-backend:fabric-service-channel-0.2.190` and node-agent `0.2.189` are
deployed to docker-test `test-1/2/3`; node-agent `0.2.189` is published to
docker-test downloads and registered in the stable update channel as release
`9bda9bac-71f3-4e8f-ae70-2abccb1cb866`. Backend now decays older healthy
service-channel feedback before lease scoring so stale success loses weight
before expiry. Node-agent consumes healthy route-quality observations from
signed synthetic config and can override sticky per-flow/config-order route
choice when a learned route is significantly better. Unit coverage:
`TestFabricClientPacketIngressQualityPreferenceOverridesStickyRoute` and
`TestIssueFabricServiceChannelLeaseDecaysOlderHealthyRouteFeedback`. Live
script
`scripts/fabric/c18z14-live-service-channel-active-quality-shift-smoke.ps1`
keeps one signed WebSocket open while route policy changes: it starts on a
higher-priority relay route, expires that route, sends real traffic through
the fast direct route to teach feedback, introduces a new higher-priority
relay candidate, and verifies the same active session stays on the learned
fast route. Result:
`artifacts/c18z14-live-service-channel-active-quality-shift-smoke-result.json`
run `c18z14-20260508-071644`; 60 batches / 480 packets delivered, backend
fallback `0`, flow drops `0`, temporary route intents expired.
- C18Z15 effective route-quality score telemetry is implemented. Backend
`rap-backend:fabric-service-channel-0.2.191` is deployed on docker-test, and
node-agent `0.2.190` is built, published to docker-test downloads, registered
in the stable update channel, and deployed to `test-1/2/3`. Published release
id: `2e4cd0c8-2480-4637-b845-6dcb115dbebd`. Backend feedback reports now
include decayed `effective_score_adjustment` alongside raw
`score_adjustment`; node-agent consumes the effective score for active
route-quality preference and exposes sorted `route_quality_preferences` in
runtime telemetry with raw/effective score and decay reasons. Unit coverage:
`TestFabricClientPacketIngressQualityPreferenceUsesEffectiveScore` and
`TestServiceChannelRouteFeedbackReportIncludesEffectiveDecayedScore`. Live
script
`scripts/fabric/c18z15-live-service-channel-effective-quality-smoke.ps1`
verifies route-quality preference telemetry, effective score visibility, and
decayed effective score visibility after the active-session quality-shift
scenario. Result:
`artifacts/c18z15-live-service-channel-effective-quality-smoke-result.json`
run `c18z14-20260508-073538`; 60 batches / 480 packets delivered, backend
fallback `0`, flow drops `0`, temporary route intents expired.
- C18Z16 per-channel route-quality fairness telemetry is implemented. Node-agent
`0.2.191` is built, published to docker-test downloads, registered in the
stable update channel, and deployed to `test-1/2/3`; backend remains
`rap-backend:fabric-service-channel-0.2.191`. Published release id:
`f072759c-5c3b-4ba0-936a-f59b6d3d7632`. Flow-scheduler channel stats now
expose the applied `quality_preference_route_id`, effective/raw preference
score, and preference reasons, so operators can see which logical channels
actually used learned route quality. Unit coverage:
`TestFabricClientPacketIngressQualityPreferencePreservesMultiChannelFairness`.
Live script
`scripts/fabric/c18z16-live-service-channel-quality-fairness-smoke.ps1`
validates multi-channel quality-preference fairness after the active-session
route-quality shift. Result:
`artifacts/c18z16-live-service-channel-quality-fairness-smoke-result.json`
run `c18z14-20260508-074943`; 60 batches / 480 packets delivered, 32 served
logical channels, 32 channels with quality preference applied, backend
fallback `0`, flow drops `0`, temporary route intents expired.
- C18Z17 stale route-quality marker cleanup is implemented. Node-agent
`0.2.192` is built, published to docker-test downloads, registered in the
stable update channel, and deployed to `test-1/2/3`; backend remains
`rap-backend:fabric-service-channel-0.2.191`. Published release id:
`846881bd-e7e0-4212-b8c9-4a6012c6eff7`. Flow-scheduler channel stats now
clear quality preference markers when the preference is no longer in the
effective preference set or when the route manager withdraws that route. Unit
coverage:
`TestFabricClientPacketIngressClearsStaleQualityPreferenceMarkers` and
`TestFabricClientPacketIngressClearsWithdrawnQualityPreferenceMarkers`.
Live script
`scripts/fabric/c18z17-live-service-channel-quality-cleanup-smoke.ps1`
verifies cleanup after the active-session quality/fairness scenario. Result:
`artifacts/c18z17-live-service-channel-quality-cleanup-smoke-result.json`
run `c18z14-20260508-075750`; 60 batches / 480 packets delivered, active
quality markers `32`, stale quality markers `0`, visible preferences `3`,
backend fallback `0`, flow drops `0`, temporary route intents expired.
- C18Z18 service-session-scoped flow scheduler memory is implemented.
Node-agent `0.2.193` is built, published to docker-test downloads,
registered in the stable update channel, and deployed to `test-1/2/3`;
backend remains `rap-backend:fabric-service-channel-0.2.191`. Published
release id: `05a3d29e-8a62-4bc8-84a3-1d00b794b9c9`. Runtime-sent flow
scheduler channel keys now include the VPN/service session:
`vpn:{vpnConnectionID}:flow-NN`. This keeps route memory, failed-route
avoidance, served/drop counters, and route-quality markers isolated when
several service-channel sessions share one entry/exit and hash to the same
logical flow shard. Unit coverage:
`TestFabricClientPacketIngressIsolatesRouteMemoryPerVPNConnection` and
`TestFabricClientPacketIngressQualityPreferencePreservesMultiChannelFairness`.
Live script
`scripts/fabric/c18z18-service-channel-session-scoped-fairness-smoke.ps1`
wraps the live C18Z17 quality path and verifies served live channels are
session-scoped, unscoped served `flow-NN` channels are absent, quality
markers are session-scoped, backend fallback is `0`, and flow drops are `0`.
Result:
`artifacts/c18z18-service-channel-session-scoped-fairness-smoke-result.json`
run `c18z14-20260508-082520`; 60 batches / 480 packets delivered, served
channels `32`, session-scoped served channels `32`, session-scoped quality
channels `32`, unscoped served channels `0`, backend fallback `0`, flow drops
`0`, temporary route intents expired.
- C18Z19 bounded parallel logical-flow send window is implemented. Node-agent
`0.2.194` is built, published to docker-test downloads, registered in the
stable update channel, and deployed to `test-1/2/3`; backend remains
`rap-backend:fabric-service-channel-0.2.191`. Published release id:
`926e5b84-4b0b-4f47-b1fe-798d8105679f`. The live node-agent runtime enables
`MaxParallelFlowSends=4`, so independent scheduled logical channels can send
concurrently instead of one slow channel blocking all following channels.
This remains service-neutral and does not inspect HTTP/RDP/DNS/application
traffic. Telemetry now exposes `max_parallel_flow_sends` and
`send_flow_parallel_batches`. Unit coverage:
`TestFabricClientPacketIngressParallelFlowWindowDoesNotBlockIndependentChannel`.
Live script
`scripts/fabric/c18z19-service-channel-parallel-flow-window-smoke.ps1` wraps
the C18Z18 live route-quality/session-scoped path and verifies the parallel
window is enabled and observed while backend fallback and flow drops stay at
zero. Result:
`artifacts/c18z19-service-channel-parallel-flow-window-smoke-result.json`
run `c18z14-20260508-084133`; 60 batches / 480 packets delivered,
`max_parallel_flow_sends=4`, `send_flow_parallel_batches=60`, served
channels `32`, session-scoped quality channels `32`, backend fallback `0`,
flow drops `0`, temporary route intents expired.
- C18Z20 per-channel latency/retry/in-flight telemetry and adaptive recommended
send-window telemetry are implemented. Node-agent `0.2.195` is built,
published to docker-test downloads, registered in the stable update channel,
and deployed to `test-1/2/3`; backend remains
`rap-backend:fabric-service-channel-0.2.191`. Published release id:
`b9e198e0-e012-4600-ad14-856820aff41c`. Scheduler telemetry now includes
global `in_flight`, `max_in_flight`, slow/failing channel counts, and
per-channel `send_attempts`, `send_successes`, `send_failures`,
`in_flight`, `max_in_flight`, and latency buckets. Ingress telemetry now
includes `recommended_parallel_flow_sends`; the recommendation shrinks under
bounded drops, degraded fallback recommendations, repeated failures, or
slow/stalled channels. Unit coverage:
`TestFabricFlowSchedulerRecommendsSmallerWindowUnderPressure` and
`TestFabricClientPacketIngressParallelFlowWindowDoesNotBlockIndependentChannel`.
Live script
`scripts/fabric/c18z20-service-channel-adaptive-window-telemetry-smoke.ps1`
wraps the C18Z19 live path and verifies the new telemetry on real docker-test
nodes. Result:
`artifacts/c18z20-service-channel-adaptive-window-telemetry-smoke-result.json`
run `c18z14-20260508-085635`; 60 batches / 480 packets delivered,
`max_parallel_flow_sends=4`, `recommended_parallel_flow_sends=4`,
`scheduler_max_in_flight=4`, attempts/success/latency visible on 32 channels,
backend fallback `0`, flow drops `0`, temporary route intents expired.
- C18Z21 rolling per-channel/session quality windows are implemented.
Node-agent `0.2.196` is built, published to docker-test downloads,
registered in the stable update channel, and deployed to `test-1/2/3`;
backend remains `rap-backend:fabric-service-channel-0.2.191`. Published
release id: `813b2050-4d4e-444c-9bde-72b1d1f7dd35`. Scheduler decisions now
use a bounded fresh quality window instead of lifetime-only drop/failure
counters, so old pressure rolls out after newer successful samples. Telemetry
now exposes scheduler-level `quality_window_sample_count`,
`quality_window_failure_count`, `quality_window_slow_count`,
`quality_window_drop_count`, and per-channel success/failure/slow/drop sample
counts, average latency, and last update time. Unit coverage:
`TestFabricFlowSchedulerRollingQualityWindowForgetsOldPressure`,
`TestFabricFlowSchedulerRecommendsSmallerWindowUnderPressure`, and
`TestFabricClientPacketIngressParallelFlowWindowDoesNotBlockIndependentChannel`.
Live script
`scripts/fabric/c18z21-service-channel-rolling-quality-window-smoke.ps1`
wraps the C18Z20 live path and verifies the rolling-window telemetry on real
docker-test nodes. Result:
`artifacts/c18z21-service-channel-rolling-quality-window-smoke-result.json`
run `c18z14-20260508-091952`; 60 batches / 480 packets delivered,
scheduler quality-window samples `480`, failures `0`, drops `0`, window
samples/success/latency visible on 32 channels, `recommended_parallel_flow_sends=4`,
backend fallback `0`, flow drops `0`, temporary route intents expired.
- C18Z22 backend durable route feedback now consumes the rolling quality
window from node-agent heartbeat metadata. Backend
`rap-backend:fabric-service-channel-0.2.197` is built and deployed on
docker-test; node-agent remains `0.2.196` on `test-1/2/3`. For agents that
expose `quality_window_*`, backend uses fresh rolling failure/drop/slow
counts and rolling average latency when creating `fabric_service_channel`
route feedback; old `last_failed_route_id`, `consecutive_failures`, and
`stall_count` remain fallback inputs for older agents only. This prevents old
route failures from dominating durable scoring after the channel has recovered
with a clean rolling window. Unit coverage:
`TestRecordHeartbeatUsesRollingQualityWindowForRouteFeedback` and
`TestRecordHeartbeatPersistsServiceChannelRouteFeedbackForLaterLease`.
Live script
`scripts/fabric/c18z22-service-channel-rolling-feedback-smoke.ps1` wraps the
C18Z21 live path and verifies persisted route feedback contains
`service_channel_rolling_quality_window` plus payload `quality_window_*`
fields. Result:
`artifacts/c18z22-service-channel-rolling-feedback-smoke-result.json` run
`c18z14-20260508-093100`; 60 batches / 480 packets delivered, route feedback
count `1`, rolling feedback count `1`, healthy rolling feedback count `1`,
rolling payload count `1`, backend fallback `0`, flow drops `0`.
- C18Z23 recovery hysteresis is implemented for recovered service-channel
routes. Backend `rap-backend:fabric-service-channel-0.2.198` is built and
deployed on docker-test; node-agent remains `0.2.196` on `test-1/2/3`.
When a route has an operator-expire/manual retry cooldown from prior fenced
feedback but now also has healthy rolling-window feedback, backend re-admits
the route as `authorized` while applying a bounded recovery hysteresis score
penalty (`150`) and `service_channel_recovery_hysteresis` reason. This keeps
recovered routes available as alternates without immediately displacing a
steady route and reducing route-selection flapping. Unit coverage:
`TestIssueFabricServiceChannelLeaseDampensRecoveredRouteDuringRetryCooldown`
and `TestRecordHeartbeatUsesRollingQualityWindowForRouteFeedback`. Live
script
`scripts/fabric/c18z23-service-channel-recovery-hysteresis-smoke.ps1` wraps
the C18Z22 live path and verifies backend `0.2.198`, rolling feedback, and
clean live forwarding. Result:
`artifacts/c18z23-service-channel-recovery-hysteresis-smoke-result.json` run
`c18z14-20260508-094111`; 60 batches / 480 packets delivered, backend
fallback `0`, flow drops `0`, recovery hysteresis penalty `150`.
- C18Z24 recovery visibility is implemented for service-channel route
diagnostics. Backend `rap-backend:fabric-service-channel-0.2.199` is built
and deployed on docker-test; node-agent remains `0.2.196` on `test-1/2/3`.
Route feedback API responses and node-scoped service-channel feedback reports
now expose `recovery_state`, `recovery_hysteresis_active`, and
`recovery_hysteresis_penalty`, while route path decision reports count
`recovery_hysteresis_count`. Admin diagnostics now show recovered/hysteresis
chips and a recovery column beside route feedback status. Unit coverage:
`TestIssueFabricServiceChannelLeaseDampensRecoveredRouteDuringRetryCooldown`,
`TestServiceChannelRouteFeedbackReportExposesRecoveryState`, and
`TestRoutePathDecisionReportCountsRecoveryHysteresis`. Smoke result:
`artifacts/c18z24-service-channel-recovery-visibility-smoke-result.json`;
route feedback API exposed recovery shape for 109 observations, backend
image `0.2.199` was live, and the web-admin build was published to
`rap_web_admin`.
- C18Z25 recovery promotion policy is implemented. Backend
`rap-backend:fabric-service-channel-0.2.200` is built and deployed on
docker-test; node-agent remains `0.2.196`. A route under manual retry
cooldown remains `recovered` with hysteresis penalty until it reports at
least 64 clean rolling-window samples (`success >= 64`, failures/slow/drops
zero). After that it is promoted back to steady `healthy`, gets
`recovery_promoted=true`, `service_channel_recovery_promoted`, and no
hysteresis penalty. Admin/API now expose promoted counts/flags alongside
recovered/hysteresis state. Smoke result:
`artifacts/c18z25-service-channel-recovery-promotion-smoke-result.json`;
backend image `0.2.200` was live and route-feedback API exposed recovery
state for 109 observations.
- C18Z26 recovery demotion policy is implemented. Backend
`rap-backend:fabric-service-channel-0.2.201` is built and deployed on
docker-test; node-agent remains `0.2.196`. If a previously recovered or
promoted route under retry cooldown reports fresh rolling failures, drops,
slow samples, degraded fallback, rebuild recommendation, or fenced feedback,
backend now exposes `recovery_demoted=true` with a concrete
`recovery_reason` such as `service_channel_recovery_demoted_failure`,
`..._slow`, `..._rebuild`, or `..._fenced`. Route score reasons include
`service_channel_recovery_demoted` and the specific demotion reason, and
route path decision reports count `recovery_demoted_count`. Admin diagnostics
now show demoted feedback/path chips and the demotion reason. Smoke result:
`artifacts/c18z26-service-channel-recovery-demotion-smoke-result.json`;
backend image `0.2.201` was live and route-feedback API exposed recovery
state for 109 observations.
- C18Z27 recovery policy tuning is implemented. Backend
`rap-backend:fabric-service-channel-0.2.202` is built and deployed on
docker-test; node-agent remains `0.2.196`. Effective service-channel
recovery policy now has a strict default contract and optional cluster
metadata override at `fabric_service_channel_recovery_policy`. API endpoints
`GET/PUT /clusters/{clusterID}/fabric/service-channels/recovery-policy`
expose and update hysteresis penalty, promotion minimum samples, demotion
thresholds for failures/drops/slow samples, and rebuild/fenced demotion
toggles. Lease route selection, route feedback reports, and node-scoped
synthetic config feedback consume the effective policy. Web-admin shows and
edits the policy in the service-channel route feedback card. Smoke result:
`artifacts/c18z27-service-channel-recovery-policy-smoke-result.json`; live
API updated policy values, then restored strict defaults
(`penalty=150`, `promotion_min_samples=64`, demotion thresholds `1`).
- C18Z28 recovery policy provenance is implemented. Backend
`rap-backend:fabric-service-channel-0.2.203` is built and deployed on
docker-test; node-agent remains `0.2.196`. `FabricServiceChannelRoute`,
`FabricServiceChannelLease`, signed lease authority payloads,
service-channel route feedback reports, and route path decision reports now
carry the effective recovery policy used for scoring and recovery decisions.
This makes every primary/alternate/fallback choice auditable against the
policy source and thresholds that produced it. Web-admin node diagnostics
show the service-channel feedback policy and route decision policy source.
Smoke result:
`artifacts/c18z28-service-channel-recovery-policy-provenance-smoke-result.json`;
live synthetic config and live lease issuance both exposed recovery policy
provenance on docker-test.
- C18Z29 feedback provenance guardrails are implemented. Backend
`rap-backend:fabric-service-channel-0.2.204` is built and deployed on
docker-test; node-agent remains `0.2.196`. Recovery policy now has a stable
fingerprint. Backend recognizes optional runtime feedback provenance fields
(`recovery_policy_fingerprint`, `route_generation`, `route_policy_version`,
`policy_version`), exposes observed/effective fingerprints/generations on
route feedback observations, and reports missing/stale counters. Explicit
stale policy/generation feedback is scored conservatively, cannot fence a
current route, and cannot request rebuild/demotion; missing provenance stays
compatible for current old agents but is visible in diagnostics. Web-admin
shows provenance warnings in service-channel feedback. Smoke result:
`artifacts/c18z29-service-channel-feedback-provenance-guard-smoke-result.json`.
- C18Z30 node-agent feedback provenance is implemented. Backend
`rap-backend:fabric-service-channel-0.2.209` and node-agent `0.2.208` are
built and deployed on docker-test (`test-1/2/3`). Node-agent now preserves the
signed synthetic config contract for recovery feedback/route decision fields
and records per-flow `recovery_policy_fingerprint`, `route_policy_version`,
and `route_generation` at send time, so feedback remains auditable even after
route churn/expiry. Backend heartbeat parsing now preserves those fields into
durable service-channel feedback payloads. Live smoke passed with 28/28
runtime channel stats carrying provenance, 3/3 feedback observations carrying
provenance, and no missing/stale provenance counters. Artifacts:
`artifacts/c18z30-node-telemetry-provenance-live-smoke-base-result.json` and
`artifacts/c18z30-node-agent-feedback-provenance-smoke-result.json`.
- C18Z31 service-channel rebuild ledger is implemented. Backend
`rap-backend:fabric-service-channel-0.2.211` is built and deployed on
docker-test; node-agent remains `0.2.208` on `test-1/2/3`. Backend now keeps
durable route rebuild attempt history in
`fabric_service_channel_route_rebuild_attempts`, upserted from synthetic
config route decisions when service-channel feedback requests rebuild. The
ledger stores trigger/rebuild status, old route, selected replacement,
policy fingerprint, generation, feedback status/reasons, latency/failure
counters, outcome, and compact decision payload. API endpoint
`GET /clusters/{clusterID}/fabric/service-channels/rebuild-attempts` exposes
the history; web-admin loads it into Service-channel route feedback
diagnostics as a rebuild ledger table. Migration `000026` is applied on
docker-test. Live smoke passed:
`artifacts/c18z31-base-active-rebuild-smoke-result.json` and
`artifacts/c18z31-service-channel-rebuild-ledger-smoke-result.json`.
- C18Z32 service-channel rebuild timeline is implemented. Backend
`rap-backend:fabric-service-channel-0.2.213` is built and deployed on
docker-test; node-agent remains `0.2.208` on `test-1/2/3`. The rebuild
attempts API now enriches durable ledger rows with node-agent heartbeat
correlation: matching `route_manager_transition`, route-generation apply or
withdrawn decision, post-rebuild selected route, flow packet/drop/failure
counters, and a compact chronological `timeline` with
`backend_decision`, `node_route_generation_apply`,
`node_route_manager_transition`, and `post_rebuild_traffic` stages. Matching
is generation-strict when the backend attempt has a generation, preventing
stale transition/status matches. Web-admin rebuild ledger shows backend,
agent, route-generation, and traffic columns. Live smoke passed:
`artifacts/c18z32-base-rebuild-ledger-smoke-result.json` and
`artifacts/c18z32-service-channel-rebuild-timeline-smoke-result.json`.
- C18Z33 service-channel rebuild guardrails are implemented. Backend
`rap-backend:fabric-service-channel-0.2.214` is built and deployed on
docker-test; node-agent remains `0.2.208`. Rebuild attempts API now adds
computed guard fields: `guard_status`, `guard_severity`, `guard_reason`,
age, and transition/traffic deadlines. Successful correlated rebuilds report
`guard_status=ok`, `guard_severity=good`; missing node transition,
route-generation correlation, post-rebuild traffic, unexpected selected
route, or post-rebuild drops/failures surface as warn/bad states. Web-admin
shows guard chips and counts in the service-channel rebuild ledger. Live
smoke passed: `artifacts/c18z33-base-rebuild-ledger-smoke-result.json` and
`artifacts/c18z33-service-channel-rebuild-guard-smoke-result.json`.
- C18Z34 service-channel rebuild health summary is implemented. Backend
`rap-backend:fabric-service-channel-0.2.215` is built and deployed on
docker-test; node-agent remains `0.2.208`. New endpoint
`GET /clusters/{clusterID}/fabric/service-channels/rebuild-health` returns a
cluster-level operational summary over the durable rebuild ledger/timeline:
counts by guard status/severity, applied/pending counts, affected reporter
nodes/routes, most recent bad attempts, and recommended operator action.
Web-admin shows the summary as a Rebuild health subpanel above the rebuild
ledger. Live smoke passed:
`artifacts/c18z34-base-rebuild-guard-smoke-result.json` and
`artifacts/c18z34-service-channel-rebuild-health-smoke-result.json`.
- C18Z35 service-channel rebuild alert silence lifecycle is implemented.
Backend `rap-backend:fabric-service-channel-0.2.216` is built and deployed on
docker-test; node-agent remains `0.2.208`. Migration `000027` creates
`fabric_service_channel_rebuild_alert_silences`, applied on docker-test. New
API `POST /clusters/{clusterID}/fabric/service-channels/rebuild-health/silences`
records bounded operator silence for an exact alert fingerprint:
reporter node, route, guard status, and generation. Rebuild health now
separates total bad/warn from active bad/warn and silenced counts; silenced
alerts are omitted from affected nodes/routes and active bad attempt lists.
A new generation, route, or reporter remains active by design. Web-admin
exposes `silence 6h` on active bad rebuild-health rows. Live smoke passed:
`artifacts/c18z35-base-rebuild-health-smoke-result.json` and
`artifacts/c18z35-service-channel-rebuild-alert-silence-smoke-result.json`.
- C18Z36 service-channel rebuild alert resurfacing is implemented. Backend
`rap-backend:fabric-service-channel-0.2.217` is built and deployed on
docker-test; node-agent remains `0.2.208`. Rebuild health marks active
bad/warn attempts as `alert_resurfaced` when an active silence exists for the
same reporter node, route, and guard status but a different generation. The
summary exposes `resurfaced_count` and `resurfaced_attempts`, including the
previous silenced generation and silence expiry. Web-admin shows a resurfaced
chip/table and allows silencing the new generation separately. Live smoke
passed: `artifacts/c18z36-base-rebuild-health-smoke-result.json` and
`artifacts/c18z36-service-channel-rebuild-alert-resurface-smoke-result.json`.
- C18Z37 service-channel readiness gate is implemented. Backend
`rap-backend:fabric-service-channel-0.2.218` is built and deployed on
docker-test; node-agent remains `0.2.208`. New endpoint
`GET /clusters/{clusterID}/fabric/service-channels/readiness` returns a fast
recent-window verdict: `clean`, `degraded`, or `blocked`, with active
bad/warn counts, resurfaced/silenced counts, missing transition,
route-generation, post-rebuild traffic, unexpected-route, and post-rebuild
degraded counters plus blocking/degraded reasons and recommended operator
action. Web-admin shows this as a top-level readiness panel in
Service-channel route feedback. Readiness and default admin health queries
are intentionally capped to a small recent window so the operator view stays
responsive after many rebuild attempts; deep ledger diagnostics remain a
separate next layer. Live smoke passed:
`artifacts/c18z37-base-rebuild-health-smoke-result.json` and
`artifacts/c18z37-service-channel-readiness-smoke-result.json`.
- C18Z38 service-channel rebuild ledger enrichment split is implemented.
Backend `rap-backend:fabric-service-channel-0.2.219` is built and deployed
on docker-test; node-agent remains `0.2.208`. The rebuild attempts API now
defaults to `enrichment=summary`, returning durable ledger rows without the
expensive heartbeat/timeline guard correlation. Operators can request
`enrichment=deep` explicitly for per-route investigation. Web-admin defaults
to the fast ledger, shows timeline/guard fields as deep-only in summary mode,
and provides a manual deep ledger toggle. C18Z32/C18Z33 smokes now request
deep enrichment. Live smoke passed:
`artifacts/c18z38-service-channel-rebuild-ledger-enrichment-smoke-result.json`.
- C18Z39 service-channel rebuild ledger drilldown is implemented. Backend
`rap-backend:fabric-service-channel-0.2.220` is built and deployed on
docker-test; node-agent remains `0.2.208`. The rebuild attempts API now
accepts `generation` and `offset`, allowing narrow deep investigations by
reporter node, route, service class, and route generation with bounded
pagination. Web-admin adds rebuild ledger filters for reporter/route/
generation/service plus prev/next paging in deep mode. Live smoke passed:
`artifacts/c18z39-service-channel-rebuild-ledger-drilldown-smoke-result.json`.
- C18Z40 service-channel rebuild incident grouping is implemented. Backend
`rap-backend:fabric-service-channel-0.2.222` is built and deployed on
docker-test; node-agent remains `0.2.208`. New endpoint
`GET /clusters/{clusterID}/fabric/service-channels/rebuild-incidents`
groups the bounded recent rebuild window by reporter node, route, service
class, generation, and guard status, exposing first/last seen, attempt count,
latest guard/replacement/outcome, silence/resurface flags, and recommended
action. The incident window is capped to 5 to keep default admin refresh
bounded; broader investigation still uses filtered deep ledger. Web-admin
shows a Rebuild incidents list and `open deep` loads the exact filtered deep
ledger slice for that incident. Live smoke passed:
`artifacts/c18z40-service-channel-rebuild-incidents-smoke-result.json`.
- C18Z41 service-channel rebuild incident actions are implemented. Backend
`rap-backend:fabric-service-channel-0.2.223` is built and deployed on
docker-test; node-agent remains `0.2.208`. New API
`POST /clusters/{clusterID}/fabric/service-channels/rebuild-incidents/investigations`
records an audit event when an operator opens a deep rebuild investigation.
Web-admin incident rows now expose `open deep` with audit and `silence 6h`
using the incident fingerprint fields; after silence the panel refreshes only
rebuild health/readiness/incidents instead of the whole cluster scope. Live
smoke passed:
`artifacts/c18z41-service-channel-rebuild-incident-actions-smoke-result.json`.
- C18Z42 service-channel rebuild correlation snapshots are implemented.
Backend `rap-backend:fabric-service-channel-0.2.224` is built and deployed
on docker-test; node-agent remains `0.2.208`. Migration `000028` adds
durable correlation/guard snapshot columns to
`fabric_service_channel_route_rebuild_attempts`, including node transition,
route-generation, post-rebuild traffic, guard status/severity/reason,
compact timeline, and `correlation_snapshot_at`. Deep enrichment now writes
the snapshot once; later deep/readiness/health/incidents reuse it and only
recompute age-sensitive guard state without scanning heartbeat history.
External summary ledger still strips guard/timeline fields to preserve the
fast C18Z38 contract. On docker-test, applying `000028` manually was required
before smoke because this manual backend redeploy path does not auto-apply
migrations. Live smoke passed twice; after warm snapshot timings were roughly
summary 92 ms, deep 2 ms, incidents 2 ms:
`artifacts/c18z42-service-channel-rebuild-correlation-snapshot-smoke-result.json`.
- C18Z43 service-channel schema preflight is implemented. Backend
`rap-backend:fabric-service-channel-0.2.225` is built and deployed on
docker-test; web-admin is redeployed. New endpoint
`GET /clusters/{clusterID}/fabric/service-channels/schema-status` checks the
DB relation/columns required by migration `000028` before operators rely on
rebuild health/readiness/incidents. Web-admin shows a Fabric schema preflight
panel beside service-channel readiness, with required/missing check counts and
operator action. Live smoke passed:
`artifacts/c18z43-service-channel-schema-preflight-smoke-result.json`.
- C18Z44 service-channel rebuild snapshot warmup is implemented. Backend
`rap-backend:fabric-service-channel-0.2.226` is built and deployed on
docker-test; web-admin is redeployed. New endpoint
`POST /clusters/{clusterID}/fabric/service-channels/rebuild-snapshots/warmup`
performs a bounded proactive pass over recent rebuild attempts. It fills
missing correlation snapshots, counts stale snapshots, and defers heavy stale
rescans because age-sensitive guard state is already recomputed from cached
snapshots on read. Web-admin adds a `warm snapshots` action and displays
warmed/fresh/missing/stale/deferred/error counts. Live smoke passed:
`artifacts/c18z44-service-channel-rebuild-snapshot-warmup-smoke-result.json`.
- C18Z45 service-channel rebuild snapshot auto-warmup is implemented. Backend
`rap-backend:fabric-service-channel-0.2.227` is built and deployed on
docker-test; node-agent remains `0.2.208`. Heartbeat processing now performs a
bounded missing-snapshot maintenance pass for the reporting node's recent
rebuild attempts. It only persists a snapshot when the heartbeat contains
runtime evidence such as post-rebuild traffic or matched route-manager/
route-generation state, preventing backend-only timelines from becoming stale
cache entries. Auto-warmup writes an audit event
`fabric.service_channel_rebuild_snapshot.auto_warmup` with trigger, heartbeat,
warmed route IDs, generations, rebuild IDs, counts, and errors. Live smoke
passed:
`artifacts/c18z45-service-channel-rebuild-snapshot-auto-warmup-smoke-result.json`.
- C18Z46 service-channel rebuild snapshot maintenance health is implemented.
Backend `rap-backend:fabric-service-channel-0.2.228` is built and deployed
on docker-test; web-admin is redeployed. New endpoint
`GET /clusters/{clusterID}/fabric/service-channels/rebuild-snapshots/health`
exposes bounded snapshot-cache maintenance status: recent attempt count,
valid/missing/overdue runtime-evidence snapshots, heartbeat threshold, latest
auto-warmup audit summary, and per-node warmed/error/missing counts. Web-admin
adds a `Snapshot maintenance` panel beside schema/readiness. Live smoke
passed:
`artifacts/c18z46-service-channel-rebuild-snapshot-health-smoke-result.json`.
- C18Z47 service-channel signed lease enforcement is implemented. Node-agent
release `0.2.230` is built, published under `/downloads`, registered as the
active `rap-node-agent` dev release, and deployed on docker-test
`test-1/2/3`; all three report `0.2.230`, healthy, and current after policy
update. When a cluster authority public key is pinned, the node-agent now
rejects unsigned `rap_fsc_*` service-channel requests and requires the
signed `rap.fabric_service_channel_lease_authority.v1` payload/signature
headers. Legacy unsigned tokens remain accepted only in unpinned test mode.
Live smoke proved unsigned POST is rejected with 403 while signed lease POST
is accepted with 202:
`artifacts/c18z47-service-channel-signed-lease-enforcement-smoke-result.json`.
- C18Z48 service-channel backend introspection compatibility is implemented.
Backend `rap-backend:fabric-service-channel-0.2.231` is built/deployed on
docker-test. Node-agent/host-agent artifacts `0.2.232` are published under
`/downloads`; `rap-node-agent` release `0.2.232` is registered and deployed
on `test-1/2/3`, and all three report healthy/current. When signed
service-channel authority headers are absent but cluster authority is pinned,
node-agent now calls backend lease introspection before accepting an unsigned
token. Bad tokens are still rejected. Live smoke passed:
`artifacts/c18z48-service-channel-introspection-smoke-result.json`.
- C18Z49 service-channel acceptance telemetry is implemented in node-agent
`0.2.232`. Each accepted Fabric Service Channel ingress records
`accepted_by=signed|introspection|legacy_unsigned`, route preference, and
backend-fallback state in structured node logs. HTTP packet ingress also
returns `X-RAP-Service-Channel-Accepted-By` for smoke/diagnostics.
- C18Z50 durable service-channel lease introspection is implemented. Migration
`000029_fabric_service_channel_leases` adds a durable lease table keyed by
cluster/channel and stores only `token_hash` plus a scrubbed lease payload
with the raw bearer token removed. Backend
`rap-backend:fabric-service-channel-0.2.233` is built/deployed on
docker-test after applying the migration. Introspection now reads memory
first, then durable storage, so compatibility clients survive backend
restart. Live smoke restarted `rap_test_backend`, accepted the unsigned token
through introspection, rejected a bad token, and verified the durable lease
omits the raw token:
`artifacts/c18z50-service-channel-durable-introspection-smoke-result.json`.
- C18Z51 service-channel lease maintenance is implemented. Backend
`rap-backend:fabric-service-channel-0.2.234` is built/deployed on
docker-test. New endpoints list durable service-channel lease maintenance
state and run bounded expired-lease cleanup:
`GET /clusters/{clusterID}/fabric/service-channels/leases` and
`POST /clusters/{clusterID}/fabric/service-channels/leases/cleanup`.
Web-admin adds a `Service-channel leases` panel with active/expired counts,
recent lease rows, and cleanup action. Live smoke issued a 1-second lease,
observed it as expired, cleaned it up, and verified it disappeared:
`artifacts/c18z51-service-channel-lease-maintenance-smoke-result.json`.
- C18Z52 service-channel access telemetry visibility is implemented. Backend
`rap-backend:fabric-service-channel-0.2.235` is built/deployed on
docker-test; node-agent/host-agent `0.2.235` artifacts are published under
`/downloads`, registered as active dev releases, and deployed on
`test-1/2/3`. Node-agent now reports accepted service-channel ingress
counters by `signed`, `introspection`, and `legacy_unsigned`, including
backend-fallback count and last accepted timestamp. Backend exposes
`GET /clusters/{clusterID}/fabric/service-channels/access-telemetry`,
reading telemetry observations with heartbeat metadata fallback. Web-admin
adds a `Service-channel access` panel with cluster totals and per-node rows.
Live smoke sent packets through test-1, observed
`X-RAP-Service-Channel-Accepted-By: introspection`, and verified backend
aggregate visibility:
`artifacts/c18z52-service-channel-access-telemetry-smoke-result.json`.
- C18Z53 service-channel access/session correlation is implemented. Backend
`rap-backend:fabric-service-channel-0.2.236` is built/deployed on
docker-test; node-agent remains `0.2.235`. The access telemetry endpoint now
correlates accepted ingress counters with active durable service-channel
leases, selected entry/exit nodes, primary route status, explicit backend
fallback, and latest route-quality feedback when a route exists. Web-admin's
`Service-channel access` panel now shows active channel rows before per-node
counters, so operators can see whether a live service channel is using normal
route quality feedback or degraded backend fallback. Live smoke created an
active lease, sent ingress traffic through test-1, and verified active
channel correlation plus fallback visibility:
`artifacts/c18z53-service-channel-access-correlation-smoke-result.json`.
- C18Z54 normal-route access correlation is smoke-proven on the existing
C18Z53 backend/admin surface. New smoke creates a temporary direct
`vpn_packets` route intent, injects healthy route-quality heartbeat
telemetry, issues a service-channel lease that selects the normal primary
route, sends ingress traffic, and verifies the access telemetry active
channel row is `ready`, not backend fallback, with `route_feedback_status`
`healthy`, rolling quality counters, and last send duration:
`artifacts/c18z54-service-channel-normal-route-access-smoke-result.json`.
- C18Z55 degraded normal-route access correlation is smoke-proven on the same
backend/admin surface. The smoke first issues a lease on a normal primary
`vpn_packets` route, then injects degraded/fenced route-quality heartbeat
feedback for that already-selected route. Access telemetry correctly reports
the active channel as `ready` and `force_backend_fallback=false`, while route
feedback is `fenced`, rolling failure/drop/slow counters are visible, and the
aggregate access status becomes `degraded` because `degraded_route_count > 0`:
`artifacts/c18z55-service-channel-degraded-route-access-smoke-result.json`.
- C18Z56 active-channel remediation diagnostics are implemented. Backend
`rap-backend:fabric-service-channel-0.2.237` is built/deployed on
docker-test; node-agent remains `0.2.235`. Active access telemetry channel
rows now include `remediation_action`, `remediation_reason`,
`remediation_route_id`, `remediation_route_status`, and an operator hint.
Decisions distinguish explicit backend fallback, degraded/fenced normal
route with an authorized alternate (`prefer_alternate_route`), degraded/fenced
route needing rebuild (`rebuild_route`), and healthy route (`none`).
Web-admin shows the remediation action in the `Service-channel access`
active-channel table. C18Z55 smoke now verifies
`remediation_action=rebuild_route`; backend unit coverage verifies the
alternate-route remediation branch.
- C18Z56 alternate-route remediation is also live-smoke-proven. New smoke
creates primary and authorized alternate `vpn_packets` routes, issues a lease
while primary is still healthy/selected, then injects fenced feedback for the
selected primary. Access telemetry keeps the active channel on the normal
route with `force_backend_fallback=false`, reports `route_feedback_status`
`fenced`, and recommends `remediation_action=prefer_alternate_route` with the
alternate route id/status; `degraded_fallback_channel_count` stays zero:
`artifacts/c18z56-service-channel-alternate-remediation-smoke-result.json`.
- C18Z57 bounded remediation command contract is implemented. Backend
`rap-backend:fabric-service-channel-0.2.238` is built/deployed on
docker-test; node-agent remains `0.2.235`. Active access telemetry channel
rows now include `remediation_command` for non-noop remediation actions, with
schema version, deterministic command id, action, channel/resource/service,
entry/exit, primary route, replacement route when present, reason/operator
hint, issued time, and a bounded TTL capped to the lease lifetime. Web-admin
marks remediation rows with `cmd` when this machine-readable command is
present. Live smoke proves a fenced selected primary route with an authorized
alternate emits a `prefer_alternate_route` command pointing at the alternate:
`artifacts/c18z57-service-channel-remediation-command-smoke-result.json`.
- C18Z58 service-channel remediation command consumption is implemented.
Backend `rap-backend:fabric-service-channel-0.2.239` and node-agent
`rap-node-agent:0.2.237` are built/deployed on docker-test (`test-1/2/3`).
Backend now projects active `remediation_command` items into node-scoped
synthetic mesh config as `service_channel_remediation_commands`. Node-agent
parses those commands and turns `prefer_alternate_route` into an explicit
route-manager `applied` decision with source
`service_channel_remediation_command`, so an active channel that still
presents the old primary route can be routed through the replacement route.
Web-admin node details show remediation-command count/table in the Mesh tab.
Live smoke proves access telemetry, synthetic config projection, and
node-agent route-manager consumption:
`artifacts/c18z58-service-channel-remediation-apply-smoke-result.json`.
- C18Z59 active remediation traffic proof is smoke-proven on the same
backend/node-agent images with production forwarding enabled on docker-test
`test-1/2/3`. The smoke sends service-channel traffic before/after the
remediation command is consumed, then verifies runtime heartbeat evidence:
`last_selected_route_id` and flow-scheduler `last_route_id` move to the
replacement route, `send_successes=1`, `send_failures=0`,
`send_fallback_local=0`, and no degraded backend fallback is recommended.
Result:
`artifacts/c18z59-service-channel-remediation-traffic-smoke-result.json`.
- C18Z60 multi-flow remediation traffic proof is smoke-proven. The smoke sends
a batch of twelve IPv4/TCP-like packets that classify into multiple
independent VPN flow channels after the remediation command is consumed.
Runtime heartbeat evidence shows the replacement route selected, at least two
flow-scheduler channels on that route, no local/backend fallback, no flow
drops, and no route send failures. Result:
`artifacts/c18z60-service-channel-remediation-multiflow-smoke-result.json`.
- C18Z61 pressure remediation traffic proof is smoke-proven. The smoke sends a
batch of 128 IPv4/TCP-like packets after remediation; runtime evidence shows
32 replacement-route flow stats, scheduler high-watermark 5,
max-in-flight 4, `send_fallback_local=0`, route failures 0, and flow/scheduler
drops 0. Result:
`artifacts/c18z61-service-channel-remediation-pressure-smoke-result.json`.
- C18Z62 service-channel QoS class wiring is implemented in node-agent and
live-smoke-proven on docker-test image `rap-node-agent:0.2.238-c18z62`.
Service-channel HTTP ingress accepts neutral `X-RAP-Traffic-Class`
(`control`, `interactive`, `reliable`, `bulk`, `droppable`) and the flow
scheduler keeps distinct traffic-class channel ids/stats while preserving the
old default bulk channel ids. Unit tests prove priority ordering
`control > interactive > reliable > bulk > droppable`; live smoke proves a
bulk 128-packet pressure batch plus an interactive packet both move through
the remediation replacement route with no local/backend fallback, drops, or
route failures. Result:
`artifacts/c18z62-service-channel-remediation-qos-smoke-result.json`.
- C18Z63 concurrent QoS isolation is implemented and unit-proven. A controlled
runtime test holds a bulk traffic-class send in-flight with a blocking
production transport, then sends an independent interactive traffic-class
packet through the same ingress; the interactive send completes before the
bulk release, with `MaxInFlight >= 2`, traffic-class-specific stats, no drops,
and no failures. This proves the shared Fabric Service Channel runtime does
not globally serialize interactive/control-style traffic behind bulk work.
Artifact:
`artifacts/c18z63-service-channel-concurrent-qos-go-test.jsonl`.
- C18Z64 traffic-class telemetry aggregation is implemented and live-proven on
docker-test image `rap-node-agent:0.2.239-c18z64`. `rap.fabric_flow_scheduler.v1`
snapshots now include `traffic_class_counts`, giving backend/admin/diagnostics
a compact count of active flow channels per traffic class without scanning
every channel stat. Unit coverage proves the counts for explicit
control/interactive/bulk classes and for the concurrent bulk+interactive
isolation case. Live smoke re-ran the QoS path on `test-1/2/3`; latest
heartbeat snapshot showed `traffic_class_counts` `bulk=32`,
`interactive=12`, drops 0. Artifacts:
`artifacts/c18z64-service-channel-traffic-class-telemetry-go-test.jsonl`,
`artifacts/c18z64-service-channel-traffic-class-telemetry-live-smoke-result.json`,
and
`artifacts/c18z64-service-channel-traffic-class-telemetry-live-snapshot.json`.
- C18Z65/C18Z66 backend/admin QoS diagnostics are implemented and live-proven.
Backend `rap-backend:fabric-service-channel-0.2.241-c18z66` is deployed on
docker-test and projects runtime `traffic_class_counts`, flow channel count,
max in-flight, dropped, and high-watermark from node heartbeats into
`GET /fabric/service-channels/access-telemetry` at node, active-channel, and
cluster aggregate levels. Web-admin Service-channel access shows flow QoS
chips/rows for cluster totals, active channels, and nodes. Live API aggregate
result showed `bulk=32`, `interactive=12`, `flow_channel_count=44`,
`flow_max_in_flight=4`. Artifacts:
`artifacts/c18z65-service-channel-access-qos-telemetry-api-result.json`,
`artifacts/c18z65-service-channel-access-qos-telemetry-smoke-result.json`,
and
`artifacts/c18z66-service-channel-access-qos-aggregate-api-result.json`.
- C18Z67 live concurrent QoS proof is implemented and smoke-proven against
docker-test backend `rap-backend:fabric-service-channel-0.2.241-c18z66` and
node-agent image `rap-node-agent:0.2.239-c18z64`. The smoke pushes six
parallel bulk service-channel HTTP packet requests while an interactive
traffic-class request is injected through the same entry path after
remediation. Run `c18z67-20260508-213452` accepted all 6 bulk requests,
forwarded 3072 post-remediation packets, completed the interactive request in
132 ms, observed 32 bulk and 12 interactive replacement-route flow stats, and
kept local/backend fallback, route failures, flow drops, and scheduler drops
at 0. Artifact:
`artifacts/c18z67-service-channel-concurrent-qos-live-smoke-result.json`.
- C18Z68 service-channel flow-health guard is implemented and deployed on
docker-test as `rap-backend:fabric-service-channel-0.2.242-c18z68`, with
web-admin rebuilt/deployed. Access telemetry now projects
`flow_health_status` and `flow_health_reason` at cluster, node, and
active-channel levels from traffic-class counts, queue pressure, flow drops,
backend fallback, route-quality failures/drops/slow samples, and route send
latency. Web-admin shows explicit flow-health chips beside flow QoS so
sustained bulk pressure, degraded latency, fallback, and drops are visible
before adding user services. Verification passed:
`go test ./internal/modules/cluster`, web-admin `npm run build`, updated
C18Z67 live smoke against backend `0.2.242-c18z68`, and live API artifact
`artifacts/c18z68-service-channel-flow-health-api-result.json`.
- C18Z69 node-side adaptive backpressure is implemented and deployed on
docker-test image `rap-node-agent:0.2.243-c18z69` for `test-1/2/3`.
`FabricFlowScheduler` now calculates per-traffic-class
`recommended_parallel_windows` and reports `adaptive_backpressure_active` /
`adaptive_backpressure_reason` in runtime heartbeat snapshots. Bulk and
droppable classes are reduced first under pressure, reliable is reduced
moderately, while control/interactive keep their full window unless their own
class has drops/failures/slow samples. Live C18Z69 smoke wraps the C18Z67
pressure path and verified `bulk=1`, `droppable=1`, `reliable=3`,
`interactive=4`, `control=4`, `bulk=32`, `interactive=12`, high-watermark
72, max-in-flight 4, drops 0, and
`bulk_window_reduced_to_protect_interactive`. Artifacts:
`artifacts/c18z67-service-channel-concurrent-qos-live-smoke-result.json` and
`artifacts/c18z69-service-channel-adaptive-backpressure-smoke-result.json`.
- C18Z70 backend/admin adaptive backpressure visibility is implemented and
deployed on docker-test as
`rap-backend:fabric-service-channel-0.2.244-c18z70`; web-admin is rebuilt and
deployed. Access telemetry now projects node-agent
`recommended_parallel_windows`, `adaptive_backpressure_active`, and
`adaptive_backpressure_reason` at cluster, node, and active-channel levels.
Cluster aggregation uses the minimum non-zero recommended window per class,
so the operator sees the most conservative active runtime limit. Web-admin
shows adaptive windows next to flow health and flow QoS. Live API returned
`adaptive=true`, reason `bulk_window_reduced_to_protect_interactive`, and
windows `bulk=1`, `droppable=1`, `reliable=3`, `interactive=4`,
`control=4`. Verification passed: `go test ./internal/modules/cluster`,
web-admin `npm run build`, C18Z69 live smoke, and
`artifacts/c18z70-service-channel-adaptive-telemetry-api-result.json`.
- C18Z71 adaptive policy contract is implemented and deployed on docker-test as
`rap-backend:fabric-service-channel-0.2.245-c18z71` with node-agent image
`rap-node-agent:0.2.245-c18z71` on `test-1/2/3`. Backend exposes audited
`GET/PUT /clusters/{clusterID}/fabric/service-channels/adaptive-policy` for
max parallel window, queue/bulk pressure thresholds, and per-class windows.
The effective policy is embedded in signed node synthetic config and
node-agent runtime heartbeat snapshots now report
`adaptive_policy_fingerprint`. The scheduler consumes the policy at runtime:
default policy preserves the C18Z69 behavior, while the C18Z71 live smoke
proved an operator policy can raise max window to 6 and bulk pressure window
to 2 while keeping interactive/control at 6. During smoke, a signed synthetic
config hash mismatch was found and fixed by preserving adaptive policy
provenance fields in the node-agent client model. Verification passed:
`go test ./internal/modules/cluster`,
`go test ./cmd/rap-node-agent ./internal/mesh ./internal/vpnruntime ./internal/client ./internal/config`,
web-admin `npm run build`, C18Z71 live smoke, and C18Z69 regression smoke.
Artifacts:
`artifacts/c18z71-service-channel-adaptive-policy-smoke-result.json` and
`artifacts/c18z69-service-channel-adaptive-backpressure-smoke-result.json`.
- C18Z72 service-channel pool/failover policy contract is implemented and
deployed on docker-test as
`rap-backend:fabric-service-channel-0.2.246-c18z72`; node-agent remains
`rap-node-agent:0.2.245-c18z71` on `test-1/2/3`. Backend exposes audited
`GET/PUT /clusters/{clusterID}/fabric/service-channels/pool-policy` for
entry/exit pool constraints, preferred entry/exit, selection strategy,
route/entry/exit failover modes, backend fallback allowance, and sticky
session mode. Lease issuance now applies the effective policy before route
selection, constrains `entry_pool`/`exit_pool`, chooses policy preferred
nodes when present, embeds `pool_policy` provenance in the lease, and signs
it into `rap.fabric_service_channel_lease_authority.v1`. Web-admin API/types
know the new policy contract. Verification passed:
`go test ./internal/modules/cluster`, web-admin `npm run build`,
C18Z72 live smoke, and C18Z71 regression smoke. Artifact:
`artifacts/c18z72-service-channel-pool-policy-smoke-result.json`.
- C18Z73 pool-policy remediation guard and telemetry is implemented and
deployed on docker-test as
`rap-backend:fabric-service-channel-0.2.247-c18z73` with node-agent image
`rap-node-agent:0.2.247-c18z73` on `test-1/2/3`; web-admin is rebuilt and
deployed. Active access telemetry now projects the signed
`pool_policy_fingerprint`, remediation guard status/reason, and guarded
remediation commands. Backend remediation rejects an alternate route outside
the signed entry/exit lease pools and emits `rebuild_route` instead of
`prefer_alternate_route`; node-agent defensively ignores guarded rejected
remediation commands before route-manager application. Web-admin shows guard
chips in access telemetry and node synthetic-config remediation rows.
Verification passed:
`go test ./internal/modules/cluster ./internal/platform/runtime ./internal/modules/nodeagent`,
`go test ./cmd/rap-node-agent ./internal/mesh ./internal/vpnruntime ./internal/config`,
web-admin `npm run build`, C18Z73 live smoke, C18Z72 regression smoke, and
C18Z71/C18Z67 live regression smoke. Artifacts:
`artifacts/c18z73-service-channel-pool-policy-remediation-guard-smoke-result.json`,
`artifacts/c18z72-service-channel-pool-policy-smoke-result.json`,
`artifacts/c18z71-service-channel-adaptive-policy-smoke-result.json`, and
`artifacts/c18z67-service-channel-concurrent-qos-live-smoke-result.json`.
- C18Z74 service-channel remediation execution visibility is implemented and
deployed on docker-test as
`rap-backend:fabric-service-channel-0.2.248-c18z74` with node-agent image
`rap-node-agent:0.2.248-c18z74` on `test-1/2/3`; web-admin is rebuilt and
deployed. Active access telemetry now computes
`remediation_execution_status`, reason, generation, and observed timestamp
by correlating active remediation commands with the entry node's latest
route-manager heartbeat. `prefer_alternate_route` commands show
`waiting_node_apply` until the node reports a matching route-manager decision
and then `applied`; guarded commands show `rejected_by_policy_guard`; bounded
`rebuild_route` commands show `pending_rebuild_request`. The execution state
is copied into the machine-readable remediation command and displayed in
web-admin access telemetry / node synthetic remediation rows. Verification
passed:
`go test ./internal/modules/cluster ./internal/platform/runtime ./internal/modules/nodeagent`,
`go test ./cmd/rap-node-agent ./internal/mesh ./internal/vpnruntime ./internal/config`,
web-admin `npm run build`, C18Z74 live smoke, C18Z73 regression smoke, and
C18Z72 regression smoke. Artifacts:
`artifacts/c18z74-service-channel-remediation-execution-smoke-result.json`,
`artifacts/c18z67-service-channel-concurrent-qos-live-smoke-result.json`,
`artifacts/c18z73-service-channel-pool-policy-remediation-guard-smoke-result.json`,
and `artifacts/c18z72-service-channel-pool-policy-smoke-result.json`.
- C18Z75 durable remediation rebuild intent foundation is implemented and
deployed on docker-test as
`rap-backend:fabric-service-channel-0.2.249-c18z75`; node-agent remains
`rap-node-agent:0.2.248-c18z74` on `test-1/2/3`. When a node fetches
synthetic config containing a `rebuild_route` remediation command, backend
now records a durable row in the existing
`fabric_service_channel_route_rebuild_attempts` ledger with
`rebuild_status=requested` / `outcome=rebuild_requested`, or
`rebuild_status=rejected` / `outcome=policy_guard_rejected` when the pool
policy guard rejects it. Access telemetry correlates that ledger row back to
the active channel and reports `rebuild_request_recorded` or
`rebuild_request_rejected` in `remediation_execution_status`. The C18Z75
smoke isolates a route pair, proves `rebuild_route`, fetches synthetic
config to persist the intent, verifies the rebuild ledger row, and verifies
access telemetry reports the recorded execution state. Verification passed:
`go test ./internal/modules/cluster ./internal/platform/runtime ./internal/modules/nodeagent`,
`go test ./cmd/rap-node-agent ./internal/mesh ./internal/vpnruntime ./internal/config`,
web-admin `npm run build`, C18Z75 live smoke, C18Z73 regression smoke, and
C18Z72 regression smoke. Artifacts:
`artifacts/c18z75-service-channel-rebuild-intent-smoke-result.json`,
`artifacts/c18z73-service-channel-pool-policy-remediation-guard-smoke-result.json`,
and `artifacts/c18z72-service-channel-pool-policy-smoke-result.json`.
- C18Z76 service-channel rebuild-route node acknowledgement is implemented and
deployed on docker-test as
`rap-backend:fabric-service-channel-0.2.250-c18z76` with node-agent image
`rap-node-agent:0.2.250-c18z76` on `test-1/2/3`. Node-agent now consumes
allowed `rebuild_route` remediation commands as route-manager decisions with
`rebuild_status=pending_degraded_fallback` and
`decision_source=service_channel_remediation_command`; guarded commands are
still ignored. Backend access telemetry correlates this route-manager
acknowledgement with the durable ledger intent and reports
`rebuild_request_recorded_node_pending`. Verification passed:
`go test ./internal/modules/cluster ./internal/platform/runtime ./internal/modules/nodeagent`,
`go test ./cmd/rap-node-agent ./internal/agent ./internal/mesh ./internal/vpnruntime ./internal/config`,
C18Z76 live smoke, C18Z75 regression smoke, and C18Z74/C18Z67 regression
smoke. Artifacts:
`artifacts/c18z76-service-channel-rebuild-node-pending-smoke-result.json`,
`artifacts/c18z75-service-channel-rebuild-intent-smoke-result.json`,
`artifacts/c18z74-service-channel-remediation-execution-smoke-result.json`,
and `artifacts/c18z67-service-channel-concurrent-qos-live-smoke-result.json`.
- C18Z77 service-channel rebuild planner resolution is implemented and
deployed on docker-test as
`rap-backend:fabric-service-channel-0.2.251-c18z77` with node-agent image
`rap-node-agent:0.2.251-c18z77` on `test-1/2/3`. Backend now resolves
durable `rebuild_route` remediation requests during node-scoped synthetic
config generation: it keeps lease pool-policy guardrails, records
`applied` / `replacement_selected` when a signed-pool-valid alternate route
exists, records `no_alternate` when no safe alternate exists, records
`deferred_by_policy` when the active lease cannot authorize the replacement,
and records `expired` for stale commands. When a replacement is applied, the
same command id is projected as a route-manager decision so node-agent can
consume the resolved planner decision without duplicating the raw command.
Access telemetry reports planner states such as `rebuild_request_applied`
and `rebuild_request_no_alternate`. Verification passed:
`go test ./internal/modules/cluster ./internal/platform/runtime ./internal/modules/nodeagent`,
`go test ./cmd/rap-node-agent ./internal/agent ./internal/mesh ./internal/vpnruntime ./internal/config`,
C18Z77 live smoke, C18Z75 regression smoke, and C18Z74/C18Z67 regression
smoke. Artifacts:
`artifacts/c18z77-service-channel-rebuild-planner-resolution-smoke-result.json`,
`artifacts/c18z75-service-channel-rebuild-intent-smoke-result.json`,
`artifacts/c18z74-service-channel-remediation-execution-smoke-result.json`,
and `artifacts/c18z67-service-channel-concurrent-qos-live-smoke-result.json`.
- C18Z78 service-channel rebuild planner applied-branch visibility is
implemented and deployed on docker-test as
`rap-backend:fabric-service-channel-0.2.252-c18z78` with node-agent image
`rap-node-agent:0.2.252-c18z78` on `test-1/2/3`; web-admin is rebuilt and
deployed to `rap_web_admin`. The admin access-telemetry execution column and
node synthetic remediation rows now render planner outcomes with explicit
labels and tones: `rebuild_request_applied` is good,
`rebuild_request_recorded(_node_pending)`, `rebuild_request_no_alternate`,
and `rebuild_request_deferred_by_policy` are warning states, while rejected
or expired requests are bad states. The C18Z78 live smoke proves the applied
planner branch: a primary route is leased first, the primary route is then
degraded, an alternate route is added after the lease, synthetic config
fetch resolves the existing `rebuild_route` command to `applied` /
`replacement_selected`, and access telemetry reports
`rebuild_request_applied`. Verification passed:
`go test ./internal/modules/cluster ./internal/platform/runtime ./internal/modules/nodeagent`,
`go test ./cmd/rap-node-agent ./internal/agent ./internal/mesh ./internal/vpnruntime ./internal/config`,
web-admin `npm run build`, C18Z78 live smoke, C18Z77 regression smoke, and
C18Z74/C18Z67 regression smoke. Artifacts:
`artifacts/c18z78-service-channel-rebuild-planner-applied-smoke-result.json`,
`artifacts/c18z77-service-channel-rebuild-planner-resolution-smoke-result.json`,
`artifacts/c18z74-service-channel-remediation-execution-smoke-result.json`,
and `artifacts/c18z67-service-channel-concurrent-qos-live-smoke-result.json`.
- C18Z79 service-channel planner-to-runtime loop proof is implemented and
deployed on docker-test as
`rap-backend:fabric-service-channel-0.2.253-c18z79` with node-agent image
`rap-node-agent:0.2.253-c18z79` on `test-1/2/3`. The new live smoke extends
the C18Z78 applied branch: after planner resolves the existing
`rebuild_route` command to `applied` / `replacement_selected`, the entry node
reports a route-manager decision for the same `rebuild_request_id`, reports
transition `applied_rebuild`, and live service-channel packet ingress selects
the replacement route with no local/backend fallback, route failures, or flow
drops. Verification passed:
`go test ./internal/modules/cluster ./internal/platform/runtime ./internal/modules/nodeagent`,
`go test ./cmd/rap-node-agent ./internal/agent ./internal/mesh ./internal/vpnruntime ./internal/config`,
C18Z79 live smoke, C18Z78 and C18Z77 sequential regressions, and C18Z67
concurrent QoS regression. Artifact:
`artifacts/c18z79-service-channel-planner-runtime-loop-smoke-result.json`.
- C18Z80 service-channel sustained post-rebuild pressure proof is implemented
and deployed on docker-test as
`rap-backend:fabric-service-channel-0.2.254-c18z80` with node-agent image
`rap-node-agent:0.2.254-c18z80` on `test-1/2/3`. The new live smoke keeps the
C18Z79 planner-applied loop, then sends five post-rebuild bursts of mixed
`interactive`, `bulk`, and `reliable` VPN packet batches. It proves every
burst is accepted by the service-channel runtime, every burst reports the
replacement route, the stale primary is not reselected, and fallback,
route-failure, flow-drop, and scheduler-drop deltas stay zero from the
pre-pressure baseline. Smoke route hygiene was tightened: C18Z67 now disables
pre-existing active `vpn_packets` intents for its entry/exit pair, and
C18Z79/C18Z80 expire their temporary primary/alternate intents after a
successful run. Verification passed:
`go test ./internal/modules/cluster ./internal/platform/runtime ./internal/modules/nodeagent`,
`go test ./cmd/rap-node-agent ./internal/agent ./internal/mesh ./internal/vpnruntime ./internal/config`,
C18Z80 live smoke, C18Z79 regression smoke, and C18Z67 concurrent QoS
regression. Artifact:
`artifacts/c18z80-service-channel-post-rebuild-pressure-smoke-result.json`.
- C18Z81 service-channel replacement-degradation recovery proof is implemented
and deployed on docker-test as
`rap-backend:fabric-service-channel-0.2.255-c18z81` with node-agent image
`rap-node-agent:0.2.255-c18z81` on `test-1/2/3`. The new live smoke proves
the negative branch after C18Z80: once the initial replacement is applied and
used, a generation-valid fenced feedback report for that replacement causes
the Control Plane to select a new safe recovery route. Live traffic then
moves to the recovery route, the degraded replacement is not reselected, and
fallback, route-failure, flow-drop, and scheduler-drop deltas stay zero for
the recovery send. The smoke also documents an important guardrail: stale
route-generation feedback must not trigger recovery. C18Z67/C18Z79 were
tightened to check per-run counter deltas rather than cumulative runtime
counters. Verification passed:
`go test ./internal/modules/cluster ./internal/platform/runtime ./internal/modules/nodeagent`,
`go test ./cmd/rap-node-agent ./internal/agent ./internal/mesh ./internal/vpnruntime ./internal/config`,
C18Z81 live smoke, C18Z80 regression smoke, C18Z79 regression smoke, and
C18Z67 concurrent QoS regression. Artifact:
`artifacts/c18z81-service-channel-replacement-degradation-recovery-smoke-result.json`.
- C18Z82 service-channel no-safe-recovery proof is implemented and deployed on
docker-test as `rap-backend:fabric-service-channel-0.2.256-c18z82` with
node-agent image `rap-node-agent:0.2.256-c18z82` on `test-1/2/3`. The new
live smoke proves the branch where the original primary is degraded, the
replacement is applied and used, then that replacement reports
generation-valid fenced feedback while no new safe recovery route exists.
Node-scoped synthetic config reports
`service_channel_feedback_no_alternate` with
`pending_degraded_fallback`; score reasons include
`no_unfenced_alternate_route` and
`backend_relay_degraded_fallback_until_rebuild`, so the Control Plane exposes
an explicit degraded/no-alternate state instead of silently sticking to a bad
replacement. Verification passed:
`go test ./internal/modules/cluster ./internal/platform/runtime ./internal/modules/nodeagent`,
`go test ./cmd/rap-node-agent ./internal/agent ./internal/mesh ./internal/vpnruntime ./internal/config`,
C18Z82 live smoke, C18Z81 recovery regression, C18Z80 pressure regression,
and C18Z67 concurrent QoS regression. Artifact:
`artifacts/c18z82-service-channel-no-safe-recovery-smoke-result.json`.
- C18Z83 service-channel access-telemetry no-safe projection is implemented and
deployed on docker-test as `rap-backend:fabric-service-channel-0.2.257-c18z83`;
node-agent remains `rap-node-agent:0.2.256-c18z82` on `test-1/2/3`, and
web-admin is rebuilt/deployed to `rap_web_admin`. Active access telemetry
channels now expose route-decision source, route id, replacement route id,
rebuild status/reason/generation, and score reasons. Web-admin shows a
dedicated `decision` column in the active-channel table. The live smoke
proves no-safe recovery is visible through access telemetry as
`service_channel_feedback_no_alternate` /
`pending_degraded_fallback`, while durable ledger state can still report
`rebuild_request_no_alternate`. Verification passed:
`go test ./internal/modules/cluster ./internal/platform/runtime ./internal/modules/nodeagent`,
web-admin `npm run build`, and C18Z83 live smoke. Artifact:
`artifacts/c18z83-service-channel-access-telemetry-no-safe-smoke-result.json`.
- C18Z84 service-channel access-decision aggregate proof is implemented and
deployed on docker-test as `rap-backend:fabric-service-channel-0.2.258-c18z84`;
node-agent remains `rap-node-agent:0.2.256-c18z82` on `test-1/2/3`, and
web-admin is rebuilt/deployed to `rap_web_admin`. Access telemetry now
exposes aggregate route-decision counters:
`route_decision_channel_count`, `replacement_decision_count`,
`applied_rebuild_decision_count`, `recovery_decision_count`, and
`no_safe_recovery_decision_count`. Web-admin summary chips show these counts,
and no-safe route decisions now prioritize the aggregate reason
`active_channels_no_safe_recovery` over generic missing access-report noise.
Verification passed:
`go test ./internal/modules/cluster ./internal/platform/runtime ./internal/modules/nodeagent`,
web-admin `npm run build`, C18Z84 live smoke, and C18Z83 regression smoke.
Artifact:
`artifacts/c18z84-service-channel-access-decision-aggregate-smoke-result.json`.
- C18Z85 service-channel access-decision incident projection is implemented and
deployed on docker-test as `rap-backend:fabric-service-channel-0.2.259-c18z85`;
node-agent remains `rap-node-agent:0.2.256-c18z82` on `test-1/2/3`, and
web-admin is rebuilt/deployed to `rap_web_admin`. Rebuild health summary now
carries access decision counts and prioritizes
`inspect_access_no_safe_recovery_route_pool_and_signed_policy` when no-safe
is active. Rebuild incidents now include `incident_source=access_decision`
entries with channel id and operator-facing severity/action, including
`access_no_safe_recovery` as a bad incident. Verification passed:
`go test ./internal/modules/cluster ./internal/platform/runtime ./internal/modules/nodeagent`,
web-admin `npm run build`, C18Z85 live smoke, and C18Z84 regression smoke.
Artifact:
`artifacts/c18z85-service-channel-access-decision-incident-smoke-result.json`.
- C18Z86 service-channel access-decision silence/acknowledgement is
implemented and deployed on docker-test as
`rap-backend:fabric-service-channel-0.2.261-c18z86`; node-agent remains
`rap-node-agent:0.2.256-c18z82` on `test-1/2/3`, and web-admin is
rebuilt/deployed to `rap_web_admin`. Rebuild alert silence requests now carry
`incident_source` and `channel_id`; `incident_source=access_decision`
no-safe incidents require `channel_id` and are stored with channel-scoped
route keys. Rebuild health and incident lists apply those silences, so an
acknowledged current-generation access no-safe incident is silenced and no
longer contributes to active bad count. Generation-change resurfacing is
covered in unit tests; live smoke proves the channel-scoped silence path.
Verification passed:
`go test ./internal/modules/cluster ./internal/platform/runtime ./internal/modules/nodeagent`,
web-admin `npm run build`, C18Z86 live smoke, and C18Z85 regression smoke.
Artifact:
`artifacts/c18z86-service-channel-access-decision-silence-smoke-result.json`.
- C18Z87 service-channel access-decision silence management is implemented and
deployed on docker-test as `rap-backend:fabric-service-channel-0.2.262-c18z87`;
node-agent remains `rap-node-agent:0.2.256-c18z82` on `test-1/2/3`, and
web-admin is rebuilt/deployed to `rap_web_admin`. Backend now exposes active
rebuild alert silences, enriches access-decision silences with
`incident_source`, `channel_id`, and `display_route_id`, and supports
unsilence by id. Web-admin shows an `Active rebuild silences` table with an
`unsilence` action. The live smoke proves the operator path:
access no-safe incident -> silence -> active silence listed -> unsilence ->
active bad incident restored. Verification passed:
`go test ./internal/modules/cluster ./internal/platform/runtime ./internal/modules/nodeagent`,
web-admin `npm run build`, C18Z87 live smoke, and C18Z86 regression smoke.
Artifact:
`artifacts/c18z87-service-channel-access-decision-unsilence-smoke-result.json`.
- C18Z88 service-channel access-decision resurface proof is implemented and
deployed on docker-test as `rap-backend:fabric-service-channel-0.2.263-c18z88`;
node-agent remains `rap-node-agent:0.2.256-c18z82` on `test-1/2/3`, and
web-admin is rebuilt/deployed to `rap_web_admin`. Access-decision incidents
now include resurface details (`alert_resurfaced_from_silence_id`,
`alert_resurfaced_previous_generation`, and
`alert_resurfaced_previous_until`) when a previously acknowledged
access-decision incident changes generation/route/channel and becomes active
again. Web-admin shows the previous generation/expiry beside resurfaced
incidents. The live smoke proves access no-safe -> silence current generation
-> route-decision generation changes -> incident resurfaces as active bad
with previous-generation metadata preserved. Verification passed:
`go test ./internal/modules/cluster ./internal/platform/runtime ./internal/modules/nodeagent`,
web-admin `npm run build`, C18Z88 live smoke, and C18Z87 regression smoke.
Artifact:
`artifacts/c18z88-service-channel-access-decision-resurface-smoke-result.json`.
- C18Z89 service-channel access-decision resurface action loop is implemented
and deployed on docker-test as `rap-backend:fabric-service-channel-0.2.264-c18z89`;
node-agent remains `rap-node-agent:0.2.256-c18z82` on `test-1/2/3`, and
web-admin is rebuilt/deployed to `rap_web_admin`. Resurfaced
access-decision incidents now include `alert_resurfaced_cause`,
`alert_resurfaced_previous_route_id`, and
`alert_resurfaced_previous_channel_id`. Web-admin shows the cause beside the
resurfaced action text. The live smoke proves the operator path:
access no-safe -> silence current generation -> generation changes and
resurfaces -> active-channel decision context matches the incident ->
re-acknowledge current generation -> incident returns to silenced state.
Verification passed:
`go test ./internal/modules/cluster ./internal/platform/runtime ./internal/modules/nodeagent`,
web-admin `npm run build`, C18Z89 live smoke, and C18Z88 regression smoke.
Artifact:
`artifacts/c18z89-service-channel-access-decision-resurface-action-smoke-result.json`.
- C18Z90 service-channel production data-plane contract is implemented and
deployed on docker-test as `rap-backend:fabric-service-channel-0.2.265-c18z90`;
node-agent remains `rap-node-agent:0.2.256-c18z82` on `test-1/2/3`, and
web-admin is rebuilt/deployed to `rap_web_admin`. Service-channel leases now
include a signed `data_plane` contract in the lease, authority payload,
introspection response, and lease-maintenance/admin list. The contract
declares backend API as control-plane transport, fabric service channel over
fabric routes as working/steady-state data transport, backend relay as
degraded fallback only, production forwarding required, and service-neutral
protocol-agnostic logical flow isolation. Web-admin shows data-plane/fallback
policy in service-channel leases. Verification passed:
`go test ./internal/modules/cluster ./internal/platform/runtime ./internal/modules/nodeagent`,
web-admin `npm run build`, C18Z90 live smoke, and C18Z89 regression smoke.
Artifact:
`artifacts/c18z90-service-channel-data-plane-contract-smoke-result.json`.
- C18Z91 node-agent data-plane contract consumption is implemented and
deployed on docker-test as `rap-node-agent:0.2.266-c18z91` on `test-1/2/3`
with backend still `rap-backend:fabric-service-channel-0.2.265-c18z90`.
Service-channel VPN packet ingress now parses signed/introspected
`data_plane`, validates the production contract, applies the preferred fabric
route, logs data-plane mode/transports/backend-relay policy/logical-flow
mode, and reports `data_plane_contract` plus last transport/policy fields in
heartbeat access telemetry. Verification passed:
`go test ./cmd/rap-node-agent ./internal/agent ./internal/mesh ./internal/vpnruntime ./internal/config`,
backend cluster tests, web-admin build, C18Z91 live smoke, and C18Z90
regression smoke. Artifact:
`artifacts/c18z91-node-agent-data-plane-contract-enforcement-smoke-result.json`.
- C18Z92 node-agent backend-fallback policy enforcement is implemented and
deployed on docker-test as `rap-node-agent:0.2.267-c18z92` on `test-1/2/3`.
If a signed data-plane contract has `backend_relay_policy=disabled`, the
service-channel runtime no longer proxies failed/missing fabric-route working
data through backend relay; it returns a visible service unavailable result.
The live smoke temporarily disables backend fallback in pool policy, issues a
no-route lease, verifies `backend_relay_policy=disabled`, posts to test-1,
and proves the node rejects with 503 instead of backend relay. Verification
passed: node-agent tests, C18Z92 live smoke, and C18Z91 regression smoke.
Artifact:
`artifacts/c18z92-node-agent-disabled-backend-fallback-smoke-result.json`.
- C18Z93 access-telemetry data-plane projection is implemented and deployed on
docker-test as `rap-backend:fabric-service-channel-0.2.268-c18z93`;
node-agent remains `rap-node-agent:0.2.267-c18z92` on `test-1/2/3`, and
web-admin is rebuilt/deployed to `rap_web_admin`. Backend access telemetry
now promotes node-reported `data_plane_contract` and last data-plane
mode/working transport/steady-state transport/backend relay policy/logical
flow mode to cluster, node, and active-channel diagnostics. Web-admin shows
summary chips plus channel/node table columns for data-plane adoption and
relay policy. Verification passed:
`go test ./internal/modules/cluster ./internal/platform/runtime ./internal/modules/nodeagent`,
web-admin `npm run build`, C18Z93 live smoke, C18Z92 regression smoke, and
C18Z91 regression smoke. Artifact:
`artifacts/c18z93-access-telemetry-data-plane-contract-smoke-result.json`.
- C18Z94 data-plane contract incident diagnostics are implemented and deployed
on docker-test as `rap-backend:fabric-service-channel-0.2.269-c18z94`;
node-agent remains `rap-node-agent:0.2.267-c18z92` on `test-1/2/3`, and
web-admin is rebuilt/deployed to `rap_web_admin`. Access/rebuild incident
diagnostics now include `incident_source=data_plane_contract` rows for
missing data-plane contract reports after accepted traffic, working/steady
transport mismatches, logical-flow mismatch, disabled backend relay observed,
and degraded/backend-relay policy violations. The smoke now proves disabled
backend relay is emitted as a bad incident with action
`restore_fabric_route_or_change_signed_backend_relay_policy_before_retry`.
Verification passed:
`go test ./internal/modules/cluster ./internal/platform/runtime ./internal/modules/nodeagent`,
web-admin `npm run build`, C18Z94 live smoke, C18Z93 regression smoke, C18Z92
regression smoke, and C18Z91 regression smoke. Artifact:
`artifacts/c18z94-data-plane-contract-incident-smoke-result.json`.
- C18Z95 node-agent blocked-fallback telemetry is implemented and deployed on
docker-test as backend `rap-backend:fabric-service-channel-0.2.270-c18z95`
and node-agent `rap-node-agent:0.2.270-c18z95` on `test-1/2/3`; web-admin is
rebuilt/deployed to `rap_web_admin`. Node-agent now reports
`backend_fallback_blocked`, `fabric_route_send_failure`, and last data-plane
violation status/reason in `fabric_service_channel_access_report`. Backend
access telemetry projects those fields to cluster, node, and active-channel
rows, and `data_plane_contract` incidents distinguish policy-blocked fallback
from real backend relay usage. Verification passed: node-agent tests,
backend tests, web-admin build, C18Z95 live smoke, and C18Z94/C18Z93/C18Z92
regressions. Artifact:
`artifacts/c18z95-node-agent-blocked-fallback-telemetry-smoke-result.json`.
- C18Z96 blocked-fallback rebuild feedback is implemented and deployed on
docker-test as backend `rap-backend:fabric-service-channel-0.2.281-c18z109`;
node-agent remains `rap-node-agent:0.2.270-c18z95` on `test-1/2/3`, and
web-admin remains deployed. Backend now converts heartbeat access reports
with `fabric_route_send_failed_backend_fallback_blocked` into durable fenced
`fabric_service_channel_route_feedback` for the active channel primary route.
The existing route rebuild planner then selects an authorized replacement
route when one exists. Verification passed: backend tests, node-agent tests,
web-admin build, C18Z96 live smoke, and C18Z95/C18Z93 regressions. Artifact:
`artifacts/c18z96-blocked-fallback-rebuild-feedback-smoke-result.json`.
- C18Z97 blocked-fallback feedback dedup is implemented and deployed on
docker-test as backend `rap-backend:fabric-service-channel-0.2.281-c18z109`.
Backend now suppresses repeated access-report-derived route feedback while an
active fenced/degraded observation from `fabric_service_channel_access_report`
already exists for the same cluster, reporter node, route, and service class.
This keeps repeated blocked-fallback send-failure heartbeats from refreshing
the same feedback and churning rebuild attempts. Verification passed:
backend tests, node-agent tests, C18Z97 live smoke, and C18Z96/C18Z95
regressions. Artifact:
`artifacts/c18z97-blocked-fallback-feedback-dedup-smoke-result.json`.
- C18Z98 blocked-fallback rebuild correlation is implemented and deployed on
docker-test as backend `rap-backend:fabric-service-channel-0.2.281-c18z109`;
web-admin is rebuilt/deployed to `rap_web_admin`. Backend now carries the
originating access-report route-feedback identity into replacement decisions
and rebuild-attempt ledger rows: `feedback_observation_id`,
`feedback_source`, feedback observed/expiry times, channel/resource ids, and
data-plane violation status/reason. Web-admin shows this correlation in
Route decisions and Rebuild ledger. Verification passed: backend tests,
node-agent tests, web-admin build, C18Z98 live smoke, and C18Z97/C18Z96/C18Z95
regressions. Artifact:
`artifacts/c18z98-blocked-fallback-rebuild-correlation-smoke-result.json`.
- C18Z99 rebuild correlation filters are implemented and deployed on
docker-test as backend `rap-backend:fabric-service-channel-0.2.281-c18z109`;
web-admin is rebuilt/deployed to `rap_web_admin`. The rebuild-attempt ledger
API now accepts `feedback_source`, `feedback_channel_id`, and
`feedback_violation_status` filters, and web-admin exposes them in the
rebuild ledger filter form. Verification passed: backend tests, node-agent
tests, web-admin build, C18Z99 live smoke, and C18Z98/C18Z97/C18Z96/C18Z95/
C18Z93 regressions. Artifact:
`artifacts/c18z99-rebuild-correlation-filter-smoke-result.json`.
- C18Z100 rebuild-health feedback breakdown is implemented and deployed on
docker-test as backend `rap-backend:fabric-service-channel-0.2.281-c18z109`;
web-admin is rebuilt/deployed to `rap_web_admin`. The rebuild-health summary
now returns `feedback_breakdowns` grouped by feedback source, feedback
channel id, and feedback violation status, with total/good/warn/bad/unknown
counts, active warn/bad counts, silenced count, latest observation time, and
affected reporter nodes/routes. Web-admin shows the breakdown in the Rebuild
health panel. Verification passed: backend tests, node-agent tests,
web-admin build, C18Z100 live smoke, and C18Z99/C18Z98/C18Z97/C18Z96/C18Z95/
C18Z93 regressions. Artifact:
`artifacts/c18z100-rebuild-health-feedback-breakdown-smoke-result.json`.
- C18Z101 rebuild-health feedback drilldown UI is implemented and deployed to
`rap_web_admin`; backend remains
`rap-backend:fabric-service-channel-0.2.281-c18z109`. Web-admin now shows
related incident context on rebuild-health feedback breakdown rows and an
`open ledger` action that switches to deep rebuild ledger with
`feedback_source`, `feedback_channel_id`, and `feedback_violation_status`
prefilled from the selected breakdown. Verification passed: web-admin build
and deployed asset/download checks.
- C18Z102 rebuild-health feedback drilldown audit breadcrumbs are implemented
and deployed on docker-test as backend
`rap-backend:fabric-service-channel-0.2.281-c18z109`; web-admin is rebuilt/
deployed to `rap_web_admin`. The existing rebuild investigation endpoint now
accepts feedback source/channel/violation drilldown payloads and records
`fabric.service_channel_rebuild_feedback_breakdown.investigation_opened`
cluster audit events before web-admin opens the filtered deep rebuild ledger.
Verification passed: backend tests, web-admin build, C18Z102 live smoke, and
C18Z100/C18Z99/C18Z98 regressions. Artifact:
`artifacts/c18z102-rebuild-health-feedback-drilldown-audit-smoke-result.json`.
- C18Z103 Fabric diagnostics drilldown audit visibility is implemented and
deployed to `rap_web_admin`; backend remains
`rap-backend:fabric-service-channel-0.2.281-c18z109`. Web-admin now filters
the loaded cluster audit list for rebuild incident and feedback-breakdown
investigation events and shows recent drilldowns in the Fabric diagnostics
panel with time, source, feedback filters, target reporter/route, actor, and
reason. Verification passed: web-admin build and deployed asset/download
checks.
- C18Z104 focused Fabric audit loading is implemented and deployed on
docker-test as backend `rap-backend:fabric-service-channel-0.2.281-c18z109`;
web-admin is rebuilt/deployed to `rap_web_admin`. The cluster audit API now
accepts repeated or comma-separated `event_type` filters plus `target_type`
filters, and Fabric diagnostics loads recent rebuild incident/feedback
breakdown investigation breadcrumbs with a dedicated filtered request instead
of depending on the generic latest-100 audit list. Verification passed:
backend tests, web-admin build, C18Z104 live smoke, and C18Z102/C18Z100
regressions. Artifact:
`artifacts/c18z104-focused-fabric-audit-smoke-result.json`.
- C18Z105 Fabric drilldown breadcrumb correlation UI is implemented and
deployed to `rap_web_admin`; backend remains
`rap-backend:fabric-service-channel-0.2.281-c18z109`. Recent investigation
rows in Fabric diagnostics now show whether each breadcrumb still matches a
current rebuild-health feedback breakdown or visible rebuild incident, and
provide an `open` action to jump back into the matching filtered ledger path.
Verification passed: web-admin build and deployed asset/download checks.
- C18Z106 server-side Fabric drilldown breadcrumb correlation is implemented
and deployed on docker-test as backend
`rap-backend:fabric-service-channel-0.2.281-c18z109`; web-admin is rebuilt/
deployed to `rap_web_admin`. Focused audit reads with
`correlation=fabric_diagnostics` now return `correlation_hints` with current
diagnostic status and matching rebuild-health feedback breakdown or rebuild
incident when present. Web-admin consumes those hints and keeps local matching
as fallback. The rebuild-health feedback breakdown window is raised to 100
groups after C18Z100 regression exposed the previous cap could hide fresh
failure classes on noisy test history. Verification passed: backend tests,
web-admin build, C18Z106 live smoke, and C18Z104/C18Z100 regressions.
Artifact: `artifacts/c18z106-audit-correlation-hints-smoke-result.json`.
- C18Z107 drilldown breadcrumb summary is implemented and deployed on
docker-test as backend `rap-backend:fabric-service-channel-0.2.281-c18z109`;
web-admin is rebuilt/deployed to `rap_web_admin`. Audit responses now include
compact `audit_summary` aggregates beside `audit_events`; focused Fabric
diagnostics uses them to show counts by current diagnostic status, feedback
source, feedback violation status, correlated/not-visible totals, and latest
time above the Recent investigations rows. Verification passed: backend
tests, web-admin build, C18Z107 live smoke, and C18Z106/C18Z104 regressions.
Artifact: `artifacts/c18z107-audit-correlation-summary-smoke-result.json`.
- C18Z108 dedicated Fabric diagnostics breadcrumbs are implemented and deployed
on docker-test as backend `rap-backend:fabric-service-channel-0.2.281-c18z109`;
web-admin is rebuilt/deployed to `rap_web_admin`. Backend exposes
`GET /clusters/{clusterID}/fabric/service-channels/rebuild-investigations/breadcrumbs`
returning `rebuild_investigation_breadcrumbs` with events and summary, so the
operator Recent investigations workflow no longer overloads the generic
cluster audit endpoint. Verification passed: backend tests, web-admin build,
C18Z108 live smoke, and C18Z107/C18Z106/C18Z100 regressions. Artifact:
`artifacts/c18z108-dedicated-breadcrumbs-smoke-result.json`.
- C18Z109 Fabric diagnostics breadcrumb freshness windows are implemented and
deployed on docker-test as backend
`rap-backend:fabric-service-channel-0.2.281-c18z109`; web-admin is rebuilt/
deployed to `rap_web_admin`. The dedicated breadcrumb endpoint accepts
`current_window_seconds` and `history_window_seconds`, annotates events with
`correlation_hints.breadcrumb_status` (`current`, `stale`, `expired`) plus
age/window seconds, returns current/stale/expired totals, and includes
`counts_by_breadcrumb_status` in summary. Web-admin shows freshness chips and
age in Recent investigations. Verification passed: backend tests, web-admin
build, C18Z109 live smoke, and C18Z108/C18Z107/C18Z106 regressions. Artifact:
`artifacts/c18z109-breadcrumb-freshness-window-smoke-result.json`.
- C19Q Remote Workspace mailbox guardrails are implemented and
runtime-smoke-proven on docker-test. The adapter-session mailbox handoff now
has unit and live coverage for invalid adapter session IDs, unknown sessions,
invalid limits, and bounded `drain=true&limit=N` partial drain semantics.
This remains probe-only and node-local: it does not enable RDP protocol
forwarding, desktop frame transport, Android work, or backend relay behavior.
Verification passed: `go test ./internal/mesh` in `agents/rap-node-agent` and
`scripts/fabric/c19q-remote-workspace-adapter-mailbox-guardrails-smoke.ps1`.
Artifact:
`artifacts/c19q-remote-workspace-adapter-mailbox-guardrails-smoke-result.json`.
- C19R Remote Workspace mailbox long-poll ergonomics are implemented and
runtime-smoke-proven on docker-test. The mailbox endpoint now accepts bounded
`wait_ms`, returns explicit `empty`, `waited`, `wait_timeout`, and `wait_ms`
fields, and wakes when a delayed mailbox event arrives before timeout.
Node-agent image `rap-node-agent:codex-service-supervisor-20260512s` is built
and deployed on `test-1/2/3`. Verification passed:
`go test ./internal/mesh`, C19R live smoke, and C19Q regression smoke.
Artifact:
`artifacts/c19r-remote-workspace-mailbox-long-poll-smoke-result.json`.
- C19S Remote Workspace mailbox telemetry is implemented and
runtime-smoke-proven on docker-test. Workload status and heartbeat telemetry
now expose mailbox read/wait/timeout/empty-read counters plus last mailbox
read metadata, so adapter consumer polling behavior is visible without
enabling desktop frame transport. Node-agent image
`rap-node-agent:codex-service-supervisor-20260512t` is built and deployed on
`test-1/2/3`. Verification passed: `go test ./internal/mesh`, C19S live
smoke, and C19R regression smoke. Artifact:
`artifacts/c19s-remote-workspace-mailbox-telemetry-smoke-result.json`.
- C19T Remote Workspace mailbox consumer checkpoint/ack metadata is implemented
and runtime-smoke-proven on docker-test. The mailbox endpoint now accepts a
validated `consumer_id` and optional `ack_sequence`, returns consumer
checkpoint/ack/lag/read metadata, and keeps bounded per-session node-local
consumer cursor state. Workload status and heartbeat telemetry expose
aggregate/current-session consumer read and ack counters. Node-agent image
`rap-node-agent:codex-service-supervisor-20260512u` is built and deployed on
`test-1/2/3`. Verification passed: `go test ./internal/mesh`, C19T live
smoke, and C19S regression smoke. Artifact:
`artifacts/c19t-remote-workspace-mailbox-consumer-checkpoint-smoke-result.json`.
- C19U Remote Workspace mailbox consumer lifecycle guardrails are implemented
and runtime-smoke-proven on docker-test. Consumers can pass
`reset_consumer=true` with a validated `consumer_id` to clear cursor state
before the current read is recorded. Mailbox responses expose consumer
count/capacity, created/reset/evicted lifecycle flags, and consumer
timestamps; workload status and heartbeat telemetry expose consumer reset and
eviction counters. Node-agent image
`rap-node-agent:codex-service-supervisor-20260512v` is built and deployed on
`test-1/2/3`. Verification passed: `go test ./internal/mesh`, C19U live
smoke, and C19T regression smoke. Artifact:
`artifacts/c19u-remote-workspace-mailbox-consumer-lifecycle-smoke-result.json`.
- C19V Remote Workspace mailbox consumer cursor inspection is implemented and
runtime-smoke-proven on docker-test. Active adapter sessions now expose a
read-only
`/mesh/v1/remote-workspace/adapter-sessions/{adapter_session_id}/mailbox/consumers`
endpoint with bounded cursor snapshots: consumer ids, checkpoint/ack
sequences, lag, read/ack totals, and timestamps. The endpoint is read-only and
does not increment mailbox reads, acks, resets, or drain events. Node-agent
image `rap-node-agent:codex-service-supervisor-20260512w` is built and
deployed on `test-1/2/3`. Verification passed: `go test ./internal/mesh`,
C19V live smoke, and C19U regression smoke. Artifact:
`artifacts/c19v-remote-workspace-mailbox-consumer-snapshot-smoke-result.json`.
- C19W Remote Workspace mailbox cursor-aware resume reads are implemented and
runtime-smoke-proven on docker-test. The mailbox endpoint now accepts
`after_sequence` for non-destructive reads, returns `skipped_count` and
`returned_count`, and long-polls for events newer than the requested sequence.
`after_sequence` with `drain=true` is rejected to keep resume reads separate
from destructive drains. Node-agent image
`rap-node-agent:codex-service-supervisor-20260512x` is built and deployed on
`test-1/2/3`. Verification passed: `go test ./internal/mesh`, C19W live
smoke, and C19V regression smoke. Artifact:
`artifacts/c19w-remote-workspace-mailbox-after-sequence-smoke-result.json`.
- C19X Remote Workspace mailbox consumer-aware resume is implemented and
runtime-smoke-proven on docker-test. Mailbox reads with `consumer_id` can pass
`resume_from=ack|checkpoint`; the node-agent resolves the stored cursor to
`after_sequence` before reading and returns `resume_from`/`resume_sequence`.
Guardrails reject mixing resume with manual `after_sequence`, drain, reset,
missing consumers, or invalid cursor names. Node-agent image
`rap-node-agent:codex-service-supervisor-20260512y` is built and deployed on
`test-1/2/3`. Verification passed: `go test ./internal/mesh`, C19X live
smoke, and C19W regression smoke. Artifact:
`artifacts/c19x-remote-workspace-mailbox-consumer-resume-smoke-result.json`.
- C19Y Remote Workspace mailbox resume telemetry is implemented and
runtime-smoke-proven on docker-test. Workload status and heartbeat telemetry
now expose resume/after-sequence read totals, returned/skipped totals, and the
last resume cursor/sequence/consumer plus returned/skipped counts for
operator diagnostics. Session snapshots include the same per-session resume
counters. Node-agent image
`rap-node-agent:codex-service-supervisor-20260512z` is built and deployed on
`test-1/2/3`. Verification passed: `go test ./internal/mesh`, C19Y live
smoke, C19X source smoke, and C19W regression smoke. Artifact:
`artifacts/c19y-remote-workspace-mailbox-resume-telemetry-smoke-result.json`.
- C19Z Remote Workspace adapter runtime readiness summary is implemented and
runtime-smoke-proven on docker-test. The sink report now includes compact
`adapter_runtime_readiness` diagnostics with session lifecycle state, mailbox
depth, consumer cursor, resume cursor, skipped/returned counts, and
ready/diagnostic status for operator handoff checks. Node-agent image
`rap-node-agent:codex-service-supervisor-20260512z1` is built and deployed on
`test-1/2/3`. Verification passed: `go test ./internal/mesh`, C19Z live
smoke, C19X source smoke, and C19Y regression smoke. Artifact:
`artifacts/c19z-remote-workspace-adapter-readiness-smoke-result.json`.
- C19Z1 Remote Workspace mailbox handoff preflight is implemented and
runtime-smoke-proven on docker-test. The node-agent now exposes read-only
`GET /mesh/v1/remote-workspace/adapter-sessions/{adapter_session_id}/mailbox/preflight`
for `consumer_id` plus `resume_from=ack|checkpoint`; it validates the cursor
and reports the expected next event window without reading, draining, acking,
or mutating consumer state. Node-agent image
`rap-node-agent:codex-service-supervisor-20260512z2` is built and deployed on
`test-1/2/3`. Verification passed: `go test ./internal/mesh`, C19Z1 live
smoke, C19X source smoke, and C19Z regression smoke. Artifact:
`artifacts/c19z1-remote-workspace-mailbox-preflight-smoke-result.json`.
The current phase is NOT:
- full mesh routing implementation
- full VPN orchestration
- multi-cluster runtime traffic handling
- production data-plane migration
- complete updater rollout orchestration
- video meetings
- final native client UI redesign
Future mesh, VPN, multi-cluster, node-agent updater, and production realtime data-plane work must be introduced only through explicit, narrow, staged implementation prompts.
Always keep the project production-oriented. Do not simplify it into a toy app.