1732 lines
108 KiB
Markdown
1732 lines
108 KiB
Markdown
# Fabric Service Channel Runtime
|
|
|
|
Status: accepted product direction and implementation guardrail.
|
|
|
|
This document defines the common runtime layer that service products must use
|
|
for live traffic. VPN, Remote Server/Desktop Access, video meetings, file
|
|
transfer, SSH/VNC/RDP adapters, and future services must not each invent their
|
|
own route, relay, retry, and failover mechanics.
|
|
|
|
## Problem
|
|
|
|
The platform goal is a distributed high-speed access fabric:
|
|
|
|
```text
|
|
client or service ingress
|
|
-> authorized entry node / entry pool
|
|
-> fastest healthy fabric route
|
|
-> authorized exit node / exit pool
|
|
-> target network, adapter, or service runtime
|
|
```
|
|
|
|
Recent VPN work exposed an architectural risk: debugging transport behavior
|
|
inside the Android VPN client or temporary backend packet relay can hide the
|
|
real missing layer. If the common fabric channel is incomplete, every later
|
|
service will repeat the same work and the Remote Server/Desktop Access client
|
|
will get stuck on transport issues that should already be solved below it.
|
|
|
|
The backend/control API remains the control plane. It must not become the
|
|
production realtime relay for high-rate service traffic.
|
|
|
|
## Product Rule
|
|
|
|
All live service traffic goes through the Fabric Service Channel runtime.
|
|
|
|
Control-plane and engineering traffic:
|
|
|
|
- login
|
|
- profile refresh
|
|
- policy lookup
|
|
- session creation
|
|
- route authorization
|
|
- diagnostics
|
|
- update metadata
|
|
|
|
may use Control API and admin ingress.
|
|
|
|
Working data traffic:
|
|
|
|
- VPN IP packets
|
|
- remote desktop display/input/control channels
|
|
- SSH/VNC streams
|
|
- file chunks
|
|
- video/audio
|
|
- future realtime service payloads
|
|
|
|
must use Fabric Service Channel unless an explicit compatibility fallback is
|
|
selected and reported as degraded.
|
|
|
|
## Service Request Contract
|
|
|
|
A service requests a channel by logical intent, not by hard-coding a node path.
|
|
|
|
Target shape:
|
|
|
|
```json
|
|
{
|
|
"service_class": "vpn_packets | remote_workspace | file_transfer | video",
|
|
"organization_id": "...",
|
|
"user_id": "...",
|
|
"resource_id": "...",
|
|
"entry_pool": ["node-a", "node-b"],
|
|
"exit_pool": ["node-x", "node-y"],
|
|
"required_roles": ["entry-node", "vpn-exit"],
|
|
"allowed_channels": ["control", "reliable", "bulk", "droppable"],
|
|
"qos": {
|
|
"interactive": true,
|
|
"bulk_limit_mbps": 0,
|
|
"priority": "interactive | normal | bulk"
|
|
},
|
|
"failover": {
|
|
"route_rebuild": "automatic",
|
|
"exit_failover": "automatic",
|
|
"sticky_session": true
|
|
}
|
|
}
|
|
```
|
|
|
|
The control plane returns a short-lived, signed service-channel lease:
|
|
|
|
- channel/session id
|
|
- selected entry
|
|
- selected exit
|
|
- alternate entries/exits
|
|
- primary route path
|
|
- alternate route paths
|
|
- allowed channel classes
|
|
- route generation/fencing epoch
|
|
- token expiry and refresh policy
|
|
- fallback policy
|
|
|
|
The service sees a channel endpoint and channel capabilities. It does not see
|
|
the full mesh topology unless it is a platform-owner diagnostic view.
|
|
|
|
## Runtime Responsibilities
|
|
|
|
### Control Plane
|
|
|
|
- authorizes the service request
|
|
- resolves organization/resource policy
|
|
- selects candidate entry and exit pools
|
|
- issues signed channel leases
|
|
- records audit
|
|
- publishes route generation and allowed service class
|
|
- receives telemetry and route health feedback
|
|
- triggers route/exit replacement when needed
|
|
|
|
### Fabric Routing Engine
|
|
|
|
- chooses shortest/fastest healthy route
|
|
- scores latency, loss, queue depth, bandwidth, node health, NAT mode,
|
|
region/locality, role eligibility, and route generation freshness
|
|
- maintains alternate routes
|
|
- avoids full-mesh requirements
|
|
- rebuilds routes when links/nodes degrade
|
|
|
|
### Entry Node
|
|
|
|
- accepts client-facing live connections
|
|
- validates service-channel token
|
|
- multiplexes logical streams/channels
|
|
- applies backpressure and per-channel scheduling
|
|
- forwards payloads to the selected route
|
|
- switches to alternate route/exit when instructed or when local health proves
|
|
the path bad
|
|
|
|
### Intermediate Relay Nodes
|
|
|
|
- forward authorized envelopes only
|
|
- enforce route id, channel class, TTL, generation, and next-hop rules
|
|
- report link health and queue pressure
|
|
- do not own durable session state
|
|
|
|
### Exit Node
|
|
|
|
- terminates the fabric route for the selected service
|
|
- connects to LAN/internet/adapter/runtime target
|
|
- enforces service policy locally
|
|
- reports egress health, DNS policy, and throughput
|
|
- can be replaced by another exit from the pool when policy allows
|
|
|
|
## Channel Model
|
|
|
|
The common fabric layer is channel-oriented.
|
|
|
|
| Channel class | Reliability | Typical services | Scheduling |
|
|
| --- | --- | --- | --- |
|
|
| `control` | reliable | attach/detach, route refresh, service state | highest |
|
|
| `interactive` | reliable/low-latency | RDP input, SSH input, cursor/control | highest data |
|
|
| `reliable` | ordered bounded | clipboard, small files, terminal output | medium |
|
|
| `bulk` | reliable bounded | VPN packets, downloads, large file chunks | lower than interactive |
|
|
| `droppable` | latest-wins | video frames, remote display regions, telemetry | drop stale |
|
|
|
|
VPN packets are protocol-neutral IP packets. They must not be special-cased as
|
|
HTTP, RDP, DNS, Telegram, or browser traffic. Optimization must improve the
|
|
shared packet path.
|
|
|
|
Remote Server/Desktop Access uses the same channel runtime, but its adapter
|
|
uses service-specific channel classes such as input, display, cursor,
|
|
clipboard, file transfer, audio, and telemetry.
|
|
|
|
## Failover Rules
|
|
|
|
The fabric must support:
|
|
|
|
- entry pool selection
|
|
- exit pool selection
|
|
- alternate route set
|
|
- quick route rebuild on node/link failure
|
|
- sticky route while healthy to avoid needless TCP disruption
|
|
- graceful drain when possible
|
|
- hard failover when route is stale or fenced
|
|
- explicit degraded fallback when the backend relay is used
|
|
|
|
VPN failover may still break existing TCP sessions in the initial mode. The
|
|
fabric must minimize disruption, but lossless TCP migration is a future mode and
|
|
must not be assumed.
|
|
|
|
## Current Gap
|
|
|
|
The project already has important pieces:
|
|
|
|
- signed node identity and scoped mesh config
|
|
- production fabric-control forwarding
|
|
- production `vpn_packet` envelope tests
|
|
- route intents and route health feedback
|
|
- entry-node VPN packet ingress prototype
|
|
- backend relay fallback for lab compatibility
|
|
|
|
The missing production layer is the service-channel runtime:
|
|
|
|
- stable client-to-entry live transport
|
|
- multiplexed logical streams/channels
|
|
- route manager with primary and alternate paths
|
|
- service-neutral QoS/backpressure
|
|
- channel-level telemetry
|
|
- automatic route and exit replacement contract
|
|
- explicit degraded fallback reporting
|
|
|
|
Until this layer is complete, VPN should be treated as a proving service for
|
|
the fabric channel, not as a one-off Android transport project.
|
|
|
|
## Implemented Foundation
|
|
|
|
The first backend contract slice is implemented:
|
|
|
|
- `POST /api/v1/clusters/{cluster_id}/fabric/service-channels/leases`
|
|
issues a `rap.fabric_service_channel_lease.v1` contract.
|
|
- The lease contains selected entry/exit nodes, entry/exit pools, service
|
|
class, required roles, allowed channel classes, route generation, fencing
|
|
epoch, primary route, alternate routes, token metadata, entry HTTP/WebSocket
|
|
endpoint templates, QoS, failover policy, and explicit fallback state.
|
|
- Each lease includes a cluster-authority-signed
|
|
`rap.fabric_service_channel_lease_authority.v1` payload that binds the
|
|
channel id, service class, selected entry/exit, primary route, generation,
|
|
fencing epoch, expiry, and token hash.
|
|
- When an authorized fabric route exists, fallback is only available and not
|
|
active.
|
|
- When no authorized fabric route exists, the lease is marked
|
|
`degraded_fallback`; backend relay is explicit compatibility fallback rather
|
|
than hidden steady state.
|
|
- VPN client profiles now embed `fabric_service_channel_lease` for each planned
|
|
VPN route, making VPN the first consumer of the common channel contract.
|
|
- `rap-node-agent` now exposes the first entry runtime endpoint for the VPN
|
|
proving service:
|
|
`/api/v1/clusters/{cluster_id}/fabric/service-channels/{channel_id}/vpn-connections/{resource_id}/packets`
|
|
and the `/packets/ws` WebSocket variant.
|
|
- The entry endpoint requires a `rap_fsc_*` service-channel token, accepts
|
|
packet batches in `application/vnd.rap.vpn-packet-batch.v1`, forwards through
|
|
the existing production `vpn_packet` fabric route, and maps route failures to
|
|
the explicit backend relay compatibility path.
|
|
- Service-channel leases now carry a signed `data_plane` contract declaring
|
|
control-plane API use, working-data transport through Fabric Service Channel,
|
|
steady-state fabric routes, backend relay fallback policy, and
|
|
service-neutral multi-flow isolation.
|
|
- Node-agent validates the signed or introspected data-plane contract, applies
|
|
the preferred fabric route from the contract, reports contract adoption in
|
|
heartbeat access telemetry, and refuses backend relay when the contract says
|
|
`backend_relay_policy=disabled`.
|
|
- Backend access telemetry and web-admin active-channel diagnostics now project
|
|
the data-plane adoption count plus last data-plane mode, working transport,
|
|
steady-state transport, backend relay policy, and logical flow mode at
|
|
cluster, node, and active-channel levels.
|
|
- Rebuild/access incident diagnostics now include `data_plane_contract`
|
|
incidents for accepted service-channel traffic without a reported
|
|
data-plane contract, transport/policy mismatches, disabled backend relay
|
|
observations, and degraded backend relay usage. These incidents keep backend
|
|
relay visible as degraded compatibility behavior rather than hidden steady
|
|
state.
|
|
- Node-agent access telemetry distinguishes backend relay actually used from
|
|
backend relay blocked by signed data-plane policy. Blocked fallback reports
|
|
include `backend_fallback_blocked` and the last violation status/reason, and
|
|
backend projects them to access telemetry plus `data_plane_contract`
|
|
incidents.
|
|
- Backend correlates access-report send failures with active service-channel
|
|
leases. A normal primary route that fails while backend relay is disabled is
|
|
persisted as fenced route feedback, allowing the existing rebuild planner to
|
|
select an authorized alternate instead of leaving the channel stuck at a
|
|
policy-blocked fallback.
|
|
- Access-report-derived route feedback is deduplicated while an active fenced
|
|
or degraded observation from `fabric_service_channel_access_report` already
|
|
exists for the same cluster, reporter node, route, and service class. This
|
|
prevents repeated blocked-fallback send-failure heartbeats from continuously
|
|
refreshing the same feedback and churning rebuild attempts.
|
|
- Replacement decisions and rebuild-attempt ledger rows carry the originating
|
|
access-report feedback identity: observation id, source, observed/expiry
|
|
timestamps, channel/resource ids, and data-plane violation status/reason.
|
|
This makes the chain `access report -> route feedback -> planner decision ->
|
|
rebuild attempt` visible without opening raw JSON payloads.
|
|
- Rebuild-attempt ledger queries can filter by `feedback_source`,
|
|
`feedback_channel_id`, and `feedback_violation_status`. The admin panel
|
|
exposes the same fields so incident drilldown can jump directly to the
|
|
correlated attempts behind an access-report-derived failure.
|
|
- Entry token validation now supports cluster-authority signed lease
|
|
enforcement. When the client sends
|
|
`X-RAP-Service-Channel-Authority-Payload` and
|
|
`X-RAP-Service-Channel-Authority-Signature`, the entry node verifies the
|
|
signature, expiry, selected entry node, service class, channel/resource ids,
|
|
allowed `vpn_packet` channel, and token hash before accepting traffic.
|
|
- Android VPN release `0.2.159` consumes the profile
|
|
`fabric_service_channel_lease`, builds the entry HTTP/WebSocket URLs from
|
|
the lease templates, and sends the service-channel token and signed authority
|
|
headers. A live smoke against `usa-los-1` accepted a valid signed lease and
|
|
rejected a bad token with `403`.
|
|
- Node-agent release `0.2.162` adds the first route-manager behavior inside
|
|
the entry runtime. The VPN packet ingress keeps the same runtime object when
|
|
synthetic mesh config refreshes, records live send/receive counters, selected
|
|
route/next hop, route attempts/failures, local-gateway fallback, and inbox
|
|
queue depths.
|
|
- Client packet sends now try all valid `vpn_packet` route candidates, with a
|
|
sticky preference for the last successful route. Backend relay fallback is
|
|
reached only after all fabric candidates fail, and telemetry marks that as
|
|
degraded compatibility behavior rather than normal steady-state transport.
|
|
- A live smoke on 2026-05-07 against the `usa-los-1` service-channel endpoint
|
|
returned `202 Accepted` and heartbeat telemetry reported route attempts,
|
|
route failure, and selected next hop `home-1`, proving that the report comes
|
|
from the active ingress handler.
|
|
- Node-agent release `0.2.163` adds the first service-neutral flow scheduler.
|
|
The scheduler does not make HTTP/RDP/DNS/application decisions. It hashes
|
|
universal IP packets by 5-tuple, or opaque packet hash when no tuple can be
|
|
read, into logical `flow-*` channels. Each channel records queue depth,
|
|
enqueue/dequeue counts, drops, high-watermark, and backpressure state.
|
|
- Client packet batches are now fanned out by logical channel before route
|
|
forwarding. This is the first step toward letting independent sessions share
|
|
one VPN/fabric connection without a stalled flow hiding the health and
|
|
pressure of other flows.
|
|
- A live smoke on 2026-05-07 sent two different packet flows through the signed
|
|
service-channel endpoint and telemetry reported two flow batches, two flow
|
|
channels, two enqueues/dequeues, and zero drops.
|
|
- Node-agent release `0.2.164` turns those logical channels into the first
|
|
active scheduling behavior. Each channel remembers its last successful route
|
|
and next hop, the last failed route, send duration, served count, stall count,
|
|
consecutive failures, and whether route rebuild or degraded fallback is
|
|
recommended.
|
|
- Scheduled batches are drained with a service-neutral fairness rule:
|
|
non-stalled channels first, then less-served channels, then the oldest served
|
|
channel. This still carries raw VPN/IP packets; it does not inspect HTTP,
|
|
RDP, DNS, Telegram, browser traffic, or any other application protocol.
|
|
- Route selection is now per-channel. A channel may prefer its last successful
|
|
route and defer its last failed route, so one bad route candidate does not
|
|
keep punishing the same flow on the next send.
|
|
- A live smoke on 2026-05-07 posted two flows through `usa-los-1` and reported
|
|
schema `c18l.fabric_service_channel_runtime_report.v1`,
|
|
`send_packets=2`, `send_flow_batches=2`, `flow_scheduler.channel_count=2`,
|
|
`dropped=0`, and per-flow `last_route_id`, `last_next_hop`, `served`,
|
|
`stall_count`, and fallback recommendation fields.
|
|
- Backend release `rap-backend:fabric-service-channel-0.2.165` consumes fresh
|
|
entry-node service-channel heartbeat feedback when issuing a new lease. It
|
|
reads `fabric_service_channel_runtime_report.ingress.flow_scheduler`
|
|
`channel_stats`, boosts routes with recent successful flow sends, penalizes
|
|
recent failed routes, and fences routes that explicitly recommend rebuild or
|
|
degraded fallback.
|
|
- Fenced routes are not returned as primary or alternate route candidates in a
|
|
service-channel lease. If every route for the selected entry/exit pair is
|
|
fenced by service-channel feedback, the lease enters explicit degraded
|
|
backend fallback with reason
|
|
`fabric_routes_fenced_by_service_channel_feedback`.
|
|
- A live smoke on 2026-05-07 created two short-lived `test-1 -> test-2`
|
|
`vpn_packets` route intents, injected fresh service-channel flow feedback
|
|
marking the higher-priority route as rebuild-required, and the next lease
|
|
selected the lower-priority healthy route with score reason
|
|
`service_channel_recent_success`.
|
|
- Backend release `rap-backend:fabric-service-channel-0.2.166` makes that
|
|
route feedback durable. Heartbeat telemetry records service-neutral route
|
|
observations in `fabric_service_channel_route_feedback_observations` and
|
|
updates `fabric_service_channel_route_feedback_latest` with expiring latest
|
|
state per reporter node, service class, and route.
|
|
- Lease generation now reads durable latest feedback before falling back to
|
|
fresh heartbeat metadata. This keeps route fencing/boosting available across
|
|
backend restarts and prevents a single heartbeat replacement from erasing
|
|
recent route-health evidence.
|
|
- A live smoke on 2026-05-07 persisted a fenced observation for a forced-bad
|
|
higher-priority `test-1 -> test-2` route and a healthy observation for the
|
|
lower-priority route. After backend restart, the next service-channel lease
|
|
selected the healthy route with `service_channel_recent_success`; the durable
|
|
latest table showed the bad route as `fenced` and active.
|
|
- Backend release `rap-backend:fabric-service-channel-0.2.167` exposes durable
|
|
feedback for diagnostics and starts feeding it back into route-generation.
|
|
Operators can list fresh observations through
|
|
`/clusters/{clusterID}/fabric/service-channels/route-feedback`, and scoped
|
|
node synthetic configs now include a `service_channel_route_feedback` report.
|
|
- Synthetic config generation skips routes fenced by the local node's durable
|
|
service-channel feedback while that observation remains active. This is the
|
|
first closed loop from entry-runtime traffic health to the next route config:
|
|
a known-bad route is withheld from that node instead of being re-issued until
|
|
the feedback expires or a new healthy observation replaces it.
|
|
- Backend release `rap-backend:fabric-service-channel-0.2.168` adds proactive
|
|
replacement decisions for fenced service-channel routes. When a fenced route
|
|
is withheld, route path decisions now record either
|
|
`service_channel_feedback_replacement` with `replacement_route_id` and
|
|
effective replacement hops, or `service_channel_feedback_no_alternate` when no
|
|
unfenced alternate route exists.
|
|
- A live smoke on 2026-05-07 fenced a higher-priority `test-1 -> test-2` route
|
|
and kept a lower-priority healthy route. The scoped `test-1` synthetic config
|
|
excluded the bad route, kept the healthy route, and reported a replacement
|
|
decision from the bad route to the healthy route with score reason
|
|
`selected_unfenced_alternate_route`.
|
|
- Backend/node-agent release `0.2.169` adds the first replacement dampening
|
|
behavior. When choosing an alternate for a fenced service-channel route, the
|
|
control plane gives active healthy durable feedback a large stable preference
|
|
and records `active_healthy_feedback_dampening_window` in score reasons. This
|
|
keeps a recently successful replacement selected over a higher-priority but
|
|
unproven route until the feedback expires or a newer observation changes the
|
|
state.
|
|
- Route path decision reports now include `degraded_decision_count` for
|
|
`service_channel_feedback_no_alternate`; upgraded node-agents echo
|
|
`replacement_route_id` and degraded counts in heartbeat diagnostics. A live
|
|
smoke on 2026-05-07 confirmed a low-priority healthy replacement beat a
|
|
higher-priority unproven alternate while the healthy feedback was active.
|
|
- Node-agent/host-agent hotfix `0.2.171` keeps the signed synthetic config
|
|
contract in sync with the backend feedback report. Agents now preserve
|
|
`service_channel_route_feedback` while recalculating the authority payload
|
|
hash, preventing `0.2.169`-style hash mismatches after C18O/C18Q feedback
|
|
fields are present in control-plane configs. The release is published with
|
|
Docker, Linux service, Windows service, and binary artifacts.
|
|
- Backend/web-admin release `0.2.172` adds cluster-level route feedback
|
|
operations: operators can filter current feedback by reporter, route, service
|
|
class, status, or include expired observations, and can expire stale route
|
|
feedback after verification. Expiring feedback removes it from active route
|
|
selection by moving `expires_at` to now while retaining history for audit and
|
|
diagnostics.
|
|
- C18S adds operator-expire churn guardrails. A manual expire now creates an
|
|
audit event, sets `operator_retry_cooldown_until`, and lets the route retry
|
|
with explicit decision reason
|
|
`service_channel_route_retry_after_operator_expire`. If the same reporter
|
|
immediately sends another non-healthy observation for the same route/service
|
|
inside the cooldown, Control Plane records it as
|
|
`operator_retry_cooldown` with zero score adjustment instead of immediately
|
|
re-fencing the route.
|
|
- C18T starts automatic service-neutral rebuild orchestration. Route path
|
|
decisions now include rebuild request metadata. Fenced runtime feedback that
|
|
keeps failing outside manual retry cooldown creates a bounded rebuild
|
|
request. If an unfenced alternate is available, Control Plane marks the
|
|
rebuild `applied` and selects that route generation; if no alternate exists,
|
|
it records `pending_degraded_fallback` and keeps backend relay as the
|
|
explicit degraded path until a new route appears. The compatibility release
|
|
`0.2.175` keeps node/host-agent signed-config models aligned with these new
|
|
fields.
|
|
- C18U moves rebuild metadata into node-agent runtime behavior. Node-agent
|
|
`0.2.176` builds a local service-channel route-manager snapshot from
|
|
`route_path_decisions`, tracks rebuild request/apply/pending-degraded counts,
|
|
marks rebuilt-away routes as withdrawn, clears a withdrawn cached selected
|
|
route, and filters withdrawn routes from new service-channel candidates. This
|
|
keeps service traffic on the Control Plane replacement instead of repeatedly
|
|
choosing a route that was already fenced. Backend `0.2.176` also makes node
|
|
list version state prefer a node's actual reported target version over stale
|
|
failed update-status rows.
|
|
- C18V adds route-manager transition telemetry and churn coverage. Node-agent
|
|
`0.2.177` reports `route_manager_transition` alongside the current manager
|
|
snapshot, including previous/current generation, status, decision count,
|
|
withdrawn route count, restored route count, pending-degraded fallback count,
|
|
rebuild applied count, and any cached selected route cleared because Control
|
|
Plane withdrew it. Coverage verifies three service-neutral lifecycle cases:
|
|
applied rebuild replacement, pending degraded fallback when no alternate is
|
|
available, and rollback/restoration when a fresh config removes the rebuild
|
|
decision.
|
|
- C18W adds a live docker-test verification loop for that telemetry. The smoke
|
|
script `scripts/fabric/c18w-service-channel-route-manager-smoke.ps1` creates
|
|
short-lived service-channel route intents, injects durable fenced/healthy
|
|
feedback through the heartbeat contract, observes Control Plane
|
|
`rebuild_status=applied`, waits for node-agent `applied_rebuild`, expires the
|
|
feedback through the operator endpoint, verifies the config has no rebuild
|
|
decision, and waits for `restored_by_new_config`. The passing artifact is
|
|
`artifacts/c18w-service-channel-route-manager-smoke-result.json`. The live
|
|
run also hardened feedback expiration in backend `0.2.179` by avoiding pgx
|
|
mixed timestamp/text parameter inference and array-parameter fragility.
|
|
- C18X adds service-neutral logical-channel isolation coverage and fixes a
|
|
route-memory bug found by that coverage. Node-agent `0.2.180` keeps global
|
|
last-route stickiness only for channels with no local route state; if a
|
|
channel has a failed route to avoid, candidates are ordered without falling
|
|
back to the global last selected route. This prevents one failed flow from
|
|
poisoning unrelated flows that are still healthy on the primary route. The
|
|
same slice verifies bounded same-channel backpressure/drop telemetry and
|
|
preserves the existing packet-flow hashing split. The passing smoke artifact
|
|
is `artifacts/c18x-service-channel-logical-channel-smoke-result.json`.
|
|
- C18Y adds route-intent lifecycle cleanup for operator/test routes. Backend
|
|
`0.2.181` enriches route-intent list responses with lifecycle state, exposes
|
|
platform-admin `expire` and `disable` actions, and prevents expired route
|
|
policies from being emitted in node-scoped synthetic config. This keeps stale
|
|
smoke route intents visible for audit while stopping agents from probing them
|
|
as live routes. Web-admin Fabric Links now shows route-intent lifecycle
|
|
counts and actions. The passing smoke artifact is
|
|
`artifacts/c18y-route-intent-lifecycle-smoke-result.json`.
|
|
- C18Z adds bounded service-channel load coverage around the shared runtime.
|
|
Node-agent `0.2.181` verifies many independent logical packet channels can
|
|
rebuild away from a Control Plane-withdrawn primary route without retrying
|
|
the withdrawn candidate, while same-channel overload reports bounded drops
|
|
and high-water marks. `FabricFlowScheduler.Snapshot` now keeps
|
|
`backpressure_active=true` when bounded drops occurred even if the queue has
|
|
already drained. The docker-test smoke also creates temporary route intents,
|
|
verifies their routes are visible, then expires/disables them and proves they
|
|
disappear from scoped synthetic config. The passing smoke artifact is
|
|
`artifacts/c18z-service-channel-load-smoke-result.json`.
|
|
- C18Z1 proves the same runtime through the running node HTTP surface instead
|
|
of only in-process transport tests. Node-agent `0.2.182` adds a dynamic mesh
|
|
listener handler so synthetic-config refreshes swap the active
|
|
`/mesh/v1/forward` and service-channel ingress handler state without
|
|
restarting the listening port. This closes the stale-handler failure where
|
|
route-health probes had fresh routes but production forward still rejected
|
|
live packets with `mesh synthetic route not found`. Backend `0.2.182` keeps
|
|
active degraded/fenced route feedback from being immediately overwritten by a
|
|
newer healthy heartbeat until the feedback expires or is explicitly cleared.
|
|
The live smoke posts signed generic packet batches into `test-1`, verifies
|
|
delivery into the `test-2` fabric inbox, forces a route rebuild, waits for
|
|
node `applied_rebuild`, and verifies the second batch uses the replacement
|
|
route. The passing smoke artifact is
|
|
`artifacts/c18z1-live-service-channel-ingress-smoke-result.json`.
|
|
- C18Z2 adds a sustained live ingress and exit-restart smoke. The script
|
|
`scripts/fabric/c18z2-live-service-channel-soak-smoke.ps1` keeps the same
|
|
protocol-neutral service-channel shape, sends multiple signed packet batches
|
|
through `test-1`, restarts the `test-2` exit container, waits for the exit
|
|
runtime to reload Control Plane synthetic config, then proves recovery
|
|
batches are accepted and delivered to the exit inbox. The passing artifact is
|
|
`artifacts/c18z2-live-service-channel-soak-smoke-result.json`; run
|
|
`c18z2-20260507-205112` accepted warm/restart/recovery batches and grew the
|
|
post-restart exit inbox depth from `0` to `88` with zero inbox drops.
|
|
- C18Z3 adds the matching entry-side resilience and degraded-fallback contract.
|
|
Node-agent `0.2.183` validates the signed service-channel lease authority and
|
|
forces backend fallback when Control Plane has signed
|
|
`status=degraded_fallback` or `primary_route.status=missing_route_intent`.
|
|
This prevents a node from ignoring the lease decision and accidentally using
|
|
older generic route candidates for the same VPN resource. The rule applies to
|
|
both HTTP packet ingress and WebSocket packet ingress. The live smoke
|
|
`scripts/fabric/c18z3-live-service-channel-entry-ws-fallback-smoke.ps1`
|
|
proves HTTP warm delivery, WebSocket ingress parity, entry-node restart and
|
|
recovery while a lease exists, explicit backend fallback when no authorized
|
|
fabric route exists, and route-intent expiry. The passing artifact is
|
|
`artifacts/c18z3-live-service-channel-entry-ws-fallback-smoke-result.json`;
|
|
run `c18z3-20260507-211402` accepted warm `4/4`, WebSocket `8` packets,
|
|
recovery `4/4`, and moved the degraded backend fallback queue from `0` to
|
|
`8`.
|
|
- C18Z4 adds live long-session pressure coverage without another runtime
|
|
release. The script
|
|
`scripts/fabric/c18z4-live-service-channel-session-pressure-smoke.ps1` holds
|
|
one signed service-channel WebSocket open, sends 48 batches / 384 packets,
|
|
expires the primary route intent mid-session, waits for the dynamic
|
|
synthetic-config refresh, and verifies the post-switch traffic uses the
|
|
alternate route. The passing artifact is
|
|
`artifacts/c18z4-live-service-channel-session-pressure-smoke-result.json`;
|
|
run `c18z4-20260507-212748` grew the exit inbox from `0` to `384`, kept
|
|
route failure delta `0`, flow drop delta `0`, and backend fallback queue
|
|
`0 -> 0`. This proves route-policy churn can be absorbed by the shared
|
|
fabric runtime while a service WebSocket remains active.
|
|
- C18Z5 adds live exit-node failure coverage while the same kind of service
|
|
WebSocket remains active. The script
|
|
`scripts/fabric/c18z5-live-service-channel-exit-restart-smoke.ps1` sends
|
|
pre-outage traffic, stops the `test-2` exit container while traffic continues,
|
|
starts it again, waits runtime readiness, and then sends recovery traffic over
|
|
the same signed WebSocket. The passing artifact is
|
|
`artifacts/c18z5-live-service-channel-exit-restart-smoke-result.json`; run
|
|
`c18z5-20260507-213745` sent 480 packets total, observed route failure delta
|
|
`48`, backend fallback queue `0 -> 192`, flow drop delta `0`, and recovery
|
|
exit inbox `0 -> 192`. This proves exit failure is surfaced as explicit
|
|
degraded/fallback telemetry and fabric delivery resumes after runtime
|
|
recovery without requiring the service connection to be rebuilt.
|
|
- C18Z6 adds live Control Plane rebuild coverage while a service WebSocket is
|
|
active. The script
|
|
`scripts/fabric/c18z6-live-service-channel-active-rebuild-smoke.ps1` injects
|
|
route-health feedback for the primary route, observes Control Plane
|
|
`rebuild_status=applied` with the alternate route as replacement, waits for
|
|
node-agent `route_manager_transition.status=applied_rebuild`, and continues
|
|
traffic over the same signed WebSocket. The passing artifact is
|
|
`artifacts/c18z6-live-service-channel-active-rebuild-smoke-result.json`; run
|
|
`c18z6-20260507-214900` sent 384 packets, delivered all of them to the exit
|
|
inbox, selected the replacement route, kept route failure delta `0`, flow
|
|
drop delta `0`, and backend fallback queue `0 -> 0`. This proves route-manager
|
|
replacement can be applied under an active service session without requiring
|
|
the service connection to be recreated.
|
|
- C18Z7 adds concurrent service-session isolation coverage. The script
|
|
`scripts/fabric/c18z7-live-service-channel-concurrent-isolation-smoke.ps1`
|
|
opens three signed service-channel WebSockets over the same entry/exit pair,
|
|
interleaves packet batches across them, injects primary-route stale feedback,
|
|
waits for Control Plane `rebuild_status=applied` and node-agent
|
|
`applied_rebuild`, then continues all sessions. The passing artifact is
|
|
`artifacts/c18z7-live-service-channel-concurrent-isolation-smoke-result.json`;
|
|
run `c18z7-20260507-215727` delivered 864 packets total, 288 packets per
|
|
session, with total backend fallback delta `0`, route failure delta `0`, and
|
|
flow drop delta `0`. This proves concurrent service sessions keep separate
|
|
resource queues and are not starved or poisoned by a shared route-manager
|
|
rebuild.
|
|
- C18Z8 adds live backpressure/fairness isolation coverage. The script
|
|
`scripts/fabric/c18z8-live-service-channel-backpressure-isolation-smoke.ps1`
|
|
opens two interactive service-channel WebSockets and one abusive WebSocket on
|
|
the same entry/exit pair. The abusive session overloads a single stable
|
|
5-tuple with 1300 packets while the interactive sessions continue sending
|
|
small batches. The passing artifact is
|
|
`artifacts/c18z8-live-service-channel-backpressure-isolation-smoke-result.json`;
|
|
run `c18z8-20260507-221347` delivered 192 packets per interactive session,
|
|
hit flow scheduler high watermark `1024`, scheduled `1030` packets on the
|
|
hottest channel, dropped `282` packets on that overloaded channel, and kept
|
|
backend fallback delta `0` and route failure delta `0`. This proves bounded
|
|
queue pressure is service-neutral, observable, and isolated to the overloaded
|
|
logical flow without starving other active sessions.
|
|
- C18Z9 adds route-pool replacement preference coverage. Node-agent `0.2.184`
|
|
now honors Control Plane `replacement_route_id` as the preferred route when a
|
|
service-channel rebuild decision is applied, instead of only withdrawing the
|
|
stale route and then relying on synthetic-config ordering. The live smoke
|
|
`scripts/fabric/c18z9-live-service-channel-route-pool-smoke.ps1` creates a
|
|
slow relay primary route (`test-1 -> test-3 -> test-2`) and a fast direct
|
|
replacement (`test-1 -> test-2`), sends 54 batches / 432 packets over one
|
|
signed WebSocket, injects stale-route feedback, waits for Control Plane and
|
|
node-agent `applied_rebuild`, and verifies the same service session continues
|
|
over the fast route. The passing artifact is
|
|
`artifacts/c18z9-live-service-channel-route-pool-smoke-result.json`; run
|
|
`c18z9-20260507-224901` kept backend fallback delta `0`, route failure delta
|
|
`0`, and flow drop delta `0`.
|
|
- C18Z10 adds service-channel exit-pool failover coverage. Backend/node-agent
|
|
`0.2.185` binds signed entry/exit pools into the service-channel lease
|
|
authority, keeps selected exit aligned with the selected primary route, and
|
|
allows Control Plane replacement to move to another authorized exit when
|
|
route intents share the same exit-pool/resource metadata key. Node-agent also
|
|
seeds the entry runtime with the signed lease primary route so initial
|
|
traffic follows the lease before normal route-manager ordering. The live
|
|
smoke `scripts/fabric/c18z10-live-service-channel-exit-pool-smoke.ps1`
|
|
creates primary exit `test-1 -> test-2` and alternate exit
|
|
`test-1 -> test-3`, sends 54 batches / 432 packets over one signed
|
|
WebSocket, verifies 144 packets land on the primary exit before feedback,
|
|
injects stale-route feedback, waits for Control Plane and node-agent
|
|
`applied_rebuild`, and verifies 288 packets land on the alternate exit. The
|
|
passing artifact is
|
|
`artifacts/c18z10-live-service-channel-exit-pool-smoke-result.json`; run
|
|
`c18z10-20260507-232645` kept backend fallback `0`, route failure delta `0`,
|
|
and flow drop delta `0`.
|
|
- C18Z11 adds service-channel entry-pool failover contract coverage. Backend
|
|
`rap-backend:fabric-service-channel-0.2.186` keeps
|
|
`selected_entry_node_id` aligned with the selected primary route when the
|
|
healthy route starts at another authorized entry node, and route replacement
|
|
scope now understands entry-pool metadata keys. The live smoke
|
|
`scripts/fabric/c18z11-live-service-channel-entry-pool-smoke.ps1` creates
|
|
primary entry `test-1 -> test-2` and alternate entry `test-3 -> test-2`,
|
|
sends 144 packets through the initial `test-1` lease, injects feedback for
|
|
the primary entry route, refreshes the lease, verifies the new lease selects
|
|
`test-3`, and sends 288 more packets through the alternate entry to the same
|
|
exit. The passing artifact is
|
|
`artifacts/c18z11-live-service-channel-entry-pool-smoke-result.json`; run
|
|
`c18z11-20260507-235341` delivered 432 packets to the exit, kept backend
|
|
fallback `0`, route failure deltas `0/0`, and flow drop deltas `0/0`. This
|
|
proves the Control Plane lease/reconnect contract for entry replacement; it
|
|
does not claim that a broken client-to-entry socket survives entry-node loss.
|
|
- C18Z12 adds the first route quality scoring layer for lease selection.
|
|
Backend `rap-backend:fabric-service-channel-0.2.187` consumes
|
|
service-neutral runtime feedback from
|
|
`fabric_service_channel_runtime_report.ingress.flow_scheduler`: fast
|
|
`last_send_duration_ms` values boost a route, slow values penalize it, and
|
|
recent failures/stalls apply bounded penalties. This is explicitly
|
|
application-protocol neutral; it scores the shared fabric channel rather than
|
|
HTTP, RDP, DNS, or any other payload type. The smoke
|
|
`scripts/fabric/c18z12-service-channel-route-quality-smoke.ps1` creates a
|
|
higher-priority slow relay route and a lower-priority fast direct route. The
|
|
initial lease selects the slow route by policy priority; after runtime
|
|
telemetry reports fast route `8ms` and slow route `900ms`, the refreshed lease
|
|
selects the fast route with score reason
|
|
`service_channel_quality_latency_le_10ms`. The passing artifact is
|
|
`artifacts/c18z12-service-channel-route-quality-smoke-result.json`; run
|
|
`c18z12-20260508-000209` passed and expired its temporary route intents.
|
|
- C18Z13 closes the first live self-learning route-quality loop. Node-agent
|
|
`0.2.188` records any positive sub-millisecond service-channel send duration
|
|
as `1ms` instead of `0ms`, so very fast routes still produce actionable
|
|
quality telemetry. The live smoke
|
|
`scripts/fabric/c18z13-live-service-channel-route-quality-smoke.ps1` does
|
|
not inject a synthetic heartbeat. It first proves policy priority selects a
|
|
higher-priority relay route, expires that route, sends 24 real
|
|
service-channel batches / 192 packets through the fast direct route, waits
|
|
for the node-agent heartbeat to persist healthy route feedback in the
|
|
backend, then introduces a new higher-priority relay candidate. The refreshed
|
|
lease selects the already-learned fast route with score reasons
|
|
`service_channel_recent_success` and
|
|
`service_channel_quality_latency_le_10ms`. The passing artifact is
|
|
`artifacts/c18z13-live-service-channel-route-quality-smoke-result.json`; run
|
|
`c18z13-20260508-001610` delivered all 192 packets to the exit, kept backend
|
|
fallback `0`, flow drops `0`, and expired temporary route intents.
|
|
- C18Z14 makes the learned route-quality loop active-session aware. Backend
|
|
`rap-backend:fabric-service-channel-0.2.190` decays older healthy
|
|
service-channel feedback before route scoring, so stale success does not keep
|
|
full weight until expiry. Node-agent `0.2.189` consumes healthy
|
|
service-channel route-quality observations from the signed synthetic config
|
|
and can prefer a significantly better learned route over a sticky per-flow
|
|
route/config-order candidate. The smoke
|
|
`scripts/fabric/c18z14-live-service-channel-active-quality-shift-smoke.ps1`
|
|
keeps one signed WebSocket service-channel session open across route
|
|
generation changes: it starts on a higher-priority relay route, expires that
|
|
route, sends real traffic over the fast route to teach backend feedback, then
|
|
introduces a new higher-priority relay candidate. The same active WebSocket
|
|
continues on the learned fast route. The passing artifact is
|
|
`artifacts/c18z14-live-service-channel-active-quality-shift-smoke-result.json`;
|
|
run `c18z14-20260508-071644` sent 60 batches / 480 packets, delivered all
|
|
packets to the exit, kept backend fallback `0`, flow drops `0`, and expired
|
|
temporary route intents.
|
|
- C18Z15 exposes and hardens effective route-quality preference telemetry.
|
|
Backend `rap-backend:fabric-service-channel-0.2.191` reports both raw
|
|
`score_adjustment` and decayed `effective_score_adjustment` in
|
|
service-channel feedback observations. Node-agent `0.2.190` consumes the
|
|
effective score for active route preference decisions, keeps the raw score
|
|
for diagnostics, and exposes sorted `route_quality_preferences` in runtime
|
|
telemetry. The smoke
|
|
`scripts/fabric/c18z15-live-service-channel-effective-quality-smoke.ps1`
|
|
wraps the active-session quality-shift scenario and verifies that route
|
|
preferences, effective scores, and age-decayed scores are visible. The
|
|
passing artifact is
|
|
`artifacts/c18z15-live-service-channel-effective-quality-smoke-result.json`;
|
|
run `c18z14-20260508-073538` sent 60 batches / 480 packets, delivered all
|
|
packets to the exit, kept backend fallback `0`, flow drops `0`, and exposed
|
|
decayed effective scores in node telemetry.
|
|
- C18Z16 adds per-channel route-quality preference telemetry and fairness
|
|
guardrails. Node-agent `0.2.191` records the applied
|
|
`quality_preference_route_id`, effective/raw score, and reasons on each
|
|
flow-scheduler channel that uses a quality-preferred route. Unit coverage
|
|
proves a learned route-quality preference can move multiple logical channels
|
|
to the fast route without merging their queues or dropping packets. The smoke
|
|
`scripts/fabric/c18z16-live-service-channel-quality-fairness-smoke.ps1`
|
|
validates the live route-quality shift with per-channel diagnostics. The
|
|
passing artifact is
|
|
`artifacts/c18z16-live-service-channel-quality-fairness-smoke-result.json`;
|
|
run `c18z14-20260508-074943` sent 60 batches / 480 packets, served 32
|
|
logical channels, applied quality preference telemetry to all 32 served
|
|
channels, kept backend fallback `0`, and flow drops `0`.
|
|
- C18Z17 clears stale per-channel route-quality markers. Node-agent `0.2.192`
|
|
removes channel-level quality preference diagnostics when the preference is no
|
|
longer present in the current effective preference set or when the preferred
|
|
route is withdrawn by the route manager. The smoke
|
|
`scripts/fabric/c18z17-live-service-channel-quality-cleanup-smoke.ps1`
|
|
verifies that active channel markers reference visible preferences, stale
|
|
markers are absent, expired route intents are not active, and the session
|
|
completes without backend fallback. The passing artifact is
|
|
`artifacts/c18z17-live-service-channel-quality-cleanup-smoke-result.json`;
|
|
run `c18z14-20260508-075750` sent 60 batches / 480 packets, kept 32 active
|
|
quality markers, found `0` stale markers, kept backend fallback `0`, and
|
|
flow drops `0`.
|
|
- C18Z18 scopes flow-scheduler channel memory by service session. Node-agent
|
|
`0.2.193` now keys runtime-sent logical channels as
|
|
`vpn:{vpnConnectionID}:flow-NN`, while keeping the low-level scheduler API
|
|
compatible with unscoped unit tests. This prevents two simultaneous
|
|
VPN/service sessions that share the same entry/exit and same IP-flow shard
|
|
from sharing route-failure memory or diagnostic markers. Unit coverage proves
|
|
`vpn-a` can avoid a failed primary route while `vpn-b` keeps the healthy
|
|
primary route for the same packet flow. The smoke
|
|
`scripts/fabric/c18z18-service-channel-session-scoped-fairness-smoke.ps1`
|
|
wraps the live C18Z17 route-quality/fairness path, verifies served live
|
|
channel names are session-scoped and no unscoped served `flow-NN` channels
|
|
remain, and keeps backend fallback and flow drops at zero. The passing
|
|
artifact is
|
|
`artifacts/c18z18-service-channel-session-scoped-fairness-smoke-result.json`;
|
|
run `c18z14-20260508-082520` served 32 session-scoped channels, applied
|
|
quality markers to all 32, kept backend fallback `0`, and flow drops `0`.
|
|
- C18Z19 adds the first bounded parallel send window for independent
|
|
service-channel logical flows. Node-agent `0.2.194` can send scheduled
|
|
logical channels concurrently with `MaxParallelFlowSends=4` in the live
|
|
node-agent runtime, while older/default in-process behavior remains
|
|
sequential unless the window is explicitly set. This keeps the data path
|
|
protocol-neutral: it does not inspect HTTP, RDP, DNS, Telegram, or browser
|
|
traffic; it only prevents one slow logical flow/channel from blocking another
|
|
independent channel in the same shared fabric service path. Telemetry now
|
|
exposes `max_parallel_flow_sends` and `send_flow_parallel_batches`. Unit
|
|
coverage blocks one logical channel and proves another channel completes
|
|
before the slow channel is released. The smoke
|
|
`scripts/fabric/c18z19-service-channel-parallel-flow-window-smoke.ps1`
|
|
wraps the live C18Z18 path and verifies the parallel window is enabled and
|
|
observed in runtime telemetry. The passing artifact is
|
|
`artifacts/c18z19-service-channel-parallel-flow-window-smoke-result.json`;
|
|
run `c18z14-20260508-084133` delivered 480 packets, observed
|
|
`max_parallel_flow_sends=4`, `send_flow_parallel_batches=60`, backend
|
|
fallback `0`, and flow drops `0`.
|
|
- C18Z20 adds per-channel latency/retry/in-flight telemetry and the first
|
|
adaptive recommended parallel window. Node-agent `0.2.195` tracks scheduler
|
|
`in_flight`, `max_in_flight`, slow/failing channel counts, per-channel
|
|
`send_attempts`, `send_successes`, `send_failures`, `in_flight`,
|
|
`max_in_flight`, and latency buckets (`<=10ms`, `<=100ms`, `<=1000ms`,
|
|
`>1000ms`). The runtime reports `recommended_parallel_flow_sends`, currently
|
|
reducing the window under bounded drops, degraded fallback recommendations,
|
|
repeated failures, or slow/stalled channels. Unit coverage proves the
|
|
recommended window shrinks under queue/route pressure and that the parallel
|
|
window still lets an independent channel complete while another is blocked.
|
|
The smoke
|
|
`scripts/fabric/c18z20-service-channel-adaptive-window-telemetry-smoke.ps1`
|
|
wraps the live C18Z19 path and verifies the new telemetry is visible on real
|
|
docker-test nodes. The passing artifact is
|
|
`artifacts/c18z20-service-channel-adaptive-window-telemetry-smoke-result.json`;
|
|
run `c18z14-20260508-085635` delivered 480 packets, observed
|
|
`max_parallel_flow_sends=4`, `recommended_parallel_flow_sends=4`,
|
|
`scheduler_max_in_flight=4`, attempt/success/latency telemetry on all 32
|
|
served channels, backend fallback `0`, and flow drops `0`.
|
|
- C18Z21 adds rolling per-channel/session quality windows. Node-agent `0.2.196`
|
|
keeps the lifetime counters for audit visibility, but adaptive send-window
|
|
pressure now comes from the bounded recent quality window, so old drops and
|
|
old route failures roll out after successful fresh samples. The scheduler
|
|
exposes aggregate rolling-window sample/failure/slow/drop counters and each
|
|
channel exposes sample, success, failure, slow, drop, average-latency, and
|
|
last-updated telemetry. Unit coverage proves old pressure is forgotten by the
|
|
rolling window while lifetime counters remain visible. The smoke
|
|
`scripts/fabric/c18z21-service-channel-rolling-quality-window-smoke.ps1`
|
|
wraps the live C18Z20 path and verifies the new telemetry on real docker-test
|
|
nodes. The passing artifact is
|
|
`artifacts/c18z21-service-channel-rolling-quality-window-smoke-result.json`;
|
|
run `c18z14-20260508-091952` delivered 480 packets, observed
|
|
`scheduler_quality_window_sample_count=480`, rolling failures `0`, rolling
|
|
drops `0`, rolling samples/success/latency on all 32 served channels,
|
|
`recommended_parallel_flow_sends=4`, backend fallback `0`, and flow drops `0`.
|
|
- C18Z22 connects the rolling window to backend durable route feedback. Backend
|
|
`rap-backend:fabric-service-channel-0.2.197` reads `quality_window_*` fields
|
|
from node-agent heartbeat metadata and uses fresh rolling failure/drop/slow
|
|
counts plus rolling average latency when persisting
|
|
`fabric_service_channel` route feedback. Lifetime fields remain available as
|
|
fallback for older agents, but they no longer dominate scoring when a current
|
|
rolling window is present and clean. The smoke
|
|
`scripts/fabric/c18z22-service-channel-rolling-feedback-smoke.ps1` wraps the
|
|
live C18Z21 path and verifies persisted feedback includes
|
|
`service_channel_rolling_quality_window` and payload `quality_window_*`
|
|
fields. The passing artifact is
|
|
`artifacts/c18z22-service-channel-rolling-feedback-smoke-result.json`; run
|
|
`c18z14-20260508-093100` delivered 480 packets, observed one persisted
|
|
healthy rolling feedback item with rolling payload, backend fallback `0`, and
|
|
flow drops `0`.
|
|
- C18Z23 adds route recovery hysteresis. Backend
|
|
`rap-backend:fabric-service-channel-0.2.198` re-admits routes that have
|
|
healthy rolling-window feedback during an operator-expire/manual retry
|
|
cooldown, but applies a bounded score penalty (`150`) and the
|
|
`service_channel_recovery_hysteresis` reason. The recovered route remains
|
|
authorized and available as an alternate, while a steady healthy route can
|
|
remain primary until the recovery window proves stable enough. The smoke
|
|
`scripts/fabric/c18z23-service-channel-recovery-hysteresis-smoke.ps1` wraps
|
|
the live C18Z22 path and verifies backend `0.2.198`, rolling feedback, clean
|
|
forwarding, and the unit hysteresis contract. The passing artifact is
|
|
`artifacts/c18z23-service-channel-recovery-hysteresis-smoke-result.json`; run
|
|
`c18z14-20260508-094111` delivered 480 packets with backend fallback `0` and
|
|
flow drops `0`.
|
|
- C18Z24 exposes that recovery state to operators and API consumers. Backend
|
|
`rap-backend:fabric-service-channel-0.2.199` enriches route feedback API
|
|
responses and node-scoped service-channel feedback reports with
|
|
`recovery_state`, `recovery_hysteresis_active`, and
|
|
`recovery_hysteresis_penalty`; route path decision reports now include
|
|
`recovery_hysteresis_count`. Web-admin shows recovered/hysteresis chips and a
|
|
recovery column next to route feedback status, score, reasons, retry
|
|
cooldown, and expiry. The smoke
|
|
`scripts/fabric/c18z24-service-channel-recovery-visibility-smoke.ps1`
|
|
verifies backend `0.2.199`, unit recovery visibility, and live
|
|
route-feedback API recovery shape. The passing artifact is
|
|
`artifacts/c18z24-service-channel-recovery-visibility-smoke-result.json`;
|
|
live API returned 109 feedback observations with recovery state shape.
|
|
- C18Z25 adds a stability threshold before recovered routes can become steady
|
|
again. Backend `rap-backend:fabric-service-channel-0.2.200` keeps manual
|
|
retry recovered routes under hysteresis until they report at least 64 clean
|
|
rolling-window samples (`success >= 64`, failures/slow/drops `0`). Once the
|
|
threshold is met, the route is promoted back to `healthy`, gets
|
|
`recovery_promoted=true` and `service_channel_recovery_promoted`, and no
|
|
longer receives the hysteresis penalty. Admin/API expose promoted counts and
|
|
flags beside recovered/hysteresis state. The smoke
|
|
`scripts/fabric/c18z25-service-channel-recovery-promotion-smoke.ps1`
|
|
verifies backend `0.2.200`, the promotion unit contract, and live
|
|
route-feedback API recovery shape. The passing artifact is
|
|
`artifacts/c18z25-service-channel-recovery-promotion-smoke-result.json`.
|
|
- C18Z26 adds explicit demotion after recovery promotion. Backend
|
|
`rap-backend:fabric-service-channel-0.2.201` marks a recovered/promoted route
|
|
under retry cooldown as `recovery_demoted=true` when fresh rolling feedback
|
|
shows failures, drops, slow samples, degraded fallback, rebuild
|
|
recommendation, or fenced state. The demotion includes a concrete
|
|
`recovery_reason`, adds `service_channel_recovery_demoted` plus the specific
|
|
reason to route score reasons, and increments `recovery_demoted_count` in
|
|
route path decision reports. Web-admin shows demoted feedback/path chips and
|
|
reason text. The smoke
|
|
`scripts/fabric/c18z26-service-channel-recovery-demotion-smoke.ps1` verifies
|
|
backend `0.2.201`, demotion unit coverage, and live route-feedback API
|
|
recovery shape. The passing artifact is
|
|
`artifacts/c18z26-service-channel-recovery-demotion-smoke-result.json`.
|
|
- C18Z27 adds cluster-level recovery policy tuning. Backend
|
|
`rap-backend:fabric-service-channel-0.2.202` exposes
|
|
`GET/PUT /clusters/{clusterID}/fabric/service-channels/recovery-policy`,
|
|
backed by strict defaults plus optional cluster metadata override
|
|
`fabric_service_channel_recovery_policy`. The policy controls hysteresis
|
|
penalty, promotion minimum samples, demotion thresholds for failures, drops,
|
|
and slow samples, and rebuild/fenced demotion toggles. Lease route selection,
|
|
route feedback reports, and node-scoped synthetic config feedback consume the
|
|
effective policy. Web-admin shows and edits the policy in the
|
|
service-channel diagnostics card. The smoke
|
|
`scripts/fabric/c18z27-service-channel-recovery-policy-smoke.ps1` verifies
|
|
backend `0.2.202`, policy unit coverage, live GET/PUT policy API, and default
|
|
restoration. The passing artifact is
|
|
`artifacts/c18z27-service-channel-recovery-policy-smoke-result.json`.
|
|
- C18Z28 adds recovery policy provenance to service-channel diagnostics.
|
|
Backend `rap-backend:fabric-service-channel-0.2.203` includes the effective
|
|
recovery policy on `FabricServiceChannelRoute`,
|
|
`FabricServiceChannelLease`, signed lease authority payloads, route feedback
|
|
reports, and route path decision reports. This lets operators audit a
|
|
primary route, alternate route, degraded fallback, or path decision against
|
|
the exact policy source and thresholds that produced the score/recovery
|
|
state. Web-admin node diagnostics show the policy source and key thresholds
|
|
beside service-channel feedback and route decisions. The smoke
|
|
`scripts/fabric/c18z28-service-channel-recovery-policy-provenance-smoke.ps1`
|
|
verifies backend `0.2.203`, live synthetic config provenance, live lease
|
|
provenance, primary route provenance, and signed authority-payload
|
|
provenance. The passing artifact is
|
|
`artifacts/c18z28-service-channel-recovery-policy-provenance-smoke-result.json`.
|
|
- C18Z29 adds feedback provenance guardrails. Backend
|
|
`rap-backend:fabric-service-channel-0.2.204` computes a stable recovery
|
|
policy fingerprint and recognizes optional runtime feedback provenance:
|
|
`recovery_policy_fingerprint`, `route_generation`, `route_policy_version`,
|
|
and `policy_version`. Route feedback observations expose observed/effective
|
|
policy fingerprints and route generations, while reports expose missing and
|
|
stale counters. Feedback that explicitly came from an old policy or route
|
|
generation is still visible, but it is scored conservatively and cannot fence
|
|
or rebuild a current route. Missing provenance remains compatible for old
|
|
node-agents. The smoke
|
|
`scripts/fabric/c18z29-service-channel-feedback-provenance-guard-smoke.ps1`
|
|
verifies backend `0.2.204`, unit guardrails, live policy fingerprint, and
|
|
live feedback provenance counter shape. The passing artifact is
|
|
`artifacts/c18z29-service-channel-feedback-provenance-guard-smoke-result.json`.
|
|
|
|
## Implementation Order
|
|
|
|
1. Define and test the generic service-channel lease and route-generation
|
|
contract in the backend. Done for the first VPN packet consumer.
|
|
2. Add node-agent entry runtime that accepts a client/service live connection
|
|
and maps it to a fabric route. Done for the first VPN packet HTTP/WebSocket
|
|
ingress with signed lease verification.
|
|
3. Add node-agent route manager with primary/alternate route selection,
|
|
generation fencing, health feedback, and failover. First alternate-route
|
|
retry and live telemetry slice is done in `0.2.162`; generation fencing,
|
|
active health feedback, and route rebuild triggers remain.
|
|
4. Add service-neutral channel scheduling and bounded queues. Protocol-neutral
|
|
IP-flow hashing and queue/backpressure telemetry landed in `0.2.163`; the
|
|
first fair drain, route memory, failed-route avoidance, and rebuild/degraded
|
|
fallback signals landed in `0.2.164`. Async per-channel workers, load
|
|
shedding policy, and deeper route rebuild history remain. The first Control
|
|
Plane lease-time feedback consumer landed in backend `0.2.165`; durable
|
|
latest route feedback landed in backend `0.2.166`; admin diagnostics and
|
|
fenced-route avoidance in synthetic config landed in backend `0.2.167`;
|
|
proactive replacement decisions landed in backend `0.2.168`; dampened
|
|
healthy replacement preference and degraded/no-alternate counts landed in
|
|
`0.2.169`; operator-expire retry cooldown guardrails landed in C18S; bounded
|
|
rebuild request/decision metadata landed in C18T; node-agent runtime
|
|
withdrawal/replacement consumption landed in C18U; route-manager transition
|
|
telemetry and restore/pending fallback coverage landed in C18V; live
|
|
Control Plane/runtime route-manager verification landed in C18W; per-logical
|
|
channel failed-route isolation and bounded backpressure coverage landed in
|
|
C18X; route-intent lifecycle cleanup and synthetic-config expired-route
|
|
filtering landed in C18Y; bounded multi-channel load/rebuild/drop telemetry
|
|
coverage landed in C18Z; live signed service-channel ingress through the
|
|
running mesh listener landed in C18Z1; sustained live ingress with exit-node
|
|
restart/recovery coverage landed in C18Z2; signed degraded fallback
|
|
enforcement plus entry restart/WebSocket parity landed in C18Z3; long-lived
|
|
WebSocket pressure with mid-session route-policy churn landed in C18Z4; live
|
|
exit-node restart/fallback/recovery under an active WebSocket landed in
|
|
C18Z5; live Control Plane rebuild replacement under an active WebSocket
|
|
landed in C18Z6; concurrent active WebSocket/session isolation under rebuild
|
|
landed in C18Z7; active backpressure/fairness isolation for overloaded
|
|
logical flows landed in C18Z8; route-pool replacement preference landed in
|
|
C18Z9; exit-pool failover landed in C18Z10; entry-pool failover contract
|
|
landed in C18Z11; route quality scoring landed in C18Z12; live
|
|
self-learning route quality from real service-channel traffic landed in
|
|
C18Z13; active-session route-quality preference and backend feedback age
|
|
decay landed in C18Z14; effective route-quality score telemetry and
|
|
node-side effective score consumption landed in C18Z15; per-channel
|
|
route-quality preference telemetry and multi-channel fairness guardrails
|
|
landed in C18Z16; stale route-quality marker cleanup landed in C18Z17;
|
|
service-session-scoped flow scheduler memory landed in C18Z18; bounded
|
|
parallel logical-flow send windows landed in C18Z19; per-channel
|
|
latency/retry/in-flight telemetry plus adaptive recommended window landed in
|
|
C18Z20; rolling quality windows landed in C18Z21; backend rolling feedback
|
|
consumption landed in C18Z22; recovery hysteresis landed in C18Z23; recovery
|
|
state API/admin visibility landed in C18Z24; recovery promotion threshold
|
|
policy landed in C18Z25; recovery demotion telemetry/policy landed in
|
|
C18Z26; cluster-level recovery policy tuning landed in C18Z27; recovery
|
|
policy provenance landed in C18Z28; feedback provenance guardrails landed
|
|
in C18Z29; node-agent per-flow feedback provenance and backend heartbeat
|
|
preservation landed in C18Z30; durable backend route-rebuild attempt ledger,
|
|
API visibility, and admin diagnostics landed in C18Z31; generation-strict
|
|
rebuild timeline correlation with node-agent route-manager/route-generation
|
|
heartbeat telemetry and post-rebuild traffic counters landed in C18Z32;
|
|
computed rebuild guard status/severity/reason fields and admin guard chips
|
|
landed in C18Z33; cluster-level rebuild health summary endpoint/admin panel
|
|
with affected nodes/routes and recommended operator action landed in C18Z34;
|
|
generation-scoped operator silence for rebuild-health alerts landed in
|
|
C18Z35; resurfacing detection for new generations after an operator silence
|
|
landed in C18Z36; fast service-channel readiness gate landed in C18Z37;
|
|
default-fast rebuild ledger summary with explicit deep enrichment landed in
|
|
C18Z38; bounded deep-ledger drilldown by reporter/route/service/generation
|
|
with offset pagination landed in C18Z39; bounded rebuild incident grouping
|
|
with one-click deep investigation landed in C18Z40; audited incident
|
|
investigation and incident-level silence actions landed in C18Z41; durable
|
|
rebuild correlation/guard snapshots for fast warm readiness/health/incidents
|
|
landed in C18Z42; service-channel schema preflight for migration-safe manual
|
|
deploys landed in C18Z43; bounded rebuild snapshot warmup for missing
|
|
correlation snapshots plus stale-snapshot detection landed in C18Z44;
|
|
heartbeat-triggered auto-warmup for runtime-evidence rebuild snapshots landed
|
|
in C18Z45; rebuild snapshot maintenance health with overdue/runtime-evidence
|
|
visibility landed in C18Z46; node-agent signed service-channel lease
|
|
enforcement when cluster authority is pinned landed in C18Z47; backend
|
|
introspection fallback for unsigned compatibility clients landed in C18Z48;
|
|
accepted-by telemetry for signed/introspection/legacy ingress landed in
|
|
C18Z49; durable lease introspection across backend restarts landed in C18Z50;
|
|
bounded durable lease cleanup and admin visibility landed in C18Z51; durable
|
|
accepted-by access telemetry aggregation with heartbeat fallback and admin
|
|
visibility landed in C18Z52; active lease/session correlation with
|
|
entry/exit, route status, fallback, and latest route-quality feedback
|
|
visibility landed in C18Z53; C18Z54 smoke proves the same diagnostics on a
|
|
normal non-fallback primary route with healthy rolling route-quality feedback;
|
|
C18Z55 smoke proves degraded/fenced normal-route feedback is shown separately
|
|
from explicit backend fallback; C18Z56 adds active-channel remediation
|
|
diagnostics (`none`, `rebuild_route`, `prefer_alternate_route`,
|
|
`use_backend_fallback`) to make the next runtime action explicit, and its
|
|
alternate-route branch is live-smoke-proven with backend fallback kept off.
|
|
C18Z57 adds the bounded machine-readable `remediation_command` contract to
|
|
active access telemetry rows so route-manager can consume a short-lived
|
|
`prefer_alternate_route` command with primary/replacement route ids and TTL.
|
|
C18Z58 projects those commands into node-scoped synthetic mesh config and the
|
|
node-agent route-manager consumes them as explicit applied replacement
|
|
decisions sourced from `service_channel_remediation_command`. C18Z59 proves
|
|
post-remediation service-channel traffic actually selects the replacement
|
|
route in runtime/flow telemetry without local/backend fallback. C18Z60 proves
|
|
the same remediation path for multiple independent VPN flow channels in one
|
|
packet batch, with replacement-route flow stats, no flow drops, no route
|
|
failures, and no degraded fallback. C18Z61 proves the remediation replacement
|
|
path under a larger 128-packet pressure batch with 32 replacement-route flow
|
|
stats, scheduler high-watermark 5, max-in-flight 4, no drops, no route
|
|
failures, and no degraded fallback. C18Z62 adds neutral service-channel
|
|
traffic-class QoS wiring: HTTP ingress accepts `X-RAP-Traffic-Class`, the
|
|
scheduler keeps distinct traffic-class channel ids/stats, unit tests prove
|
|
priority ordering, and live smoke proves bulk pressure plus interactive
|
|
traffic both use the replacement route without fallback, drops, or route
|
|
failures. C18Z63 proves concurrent QoS isolation in the runtime: an
|
|
interactive traffic-class packet completes while a bulk send is deliberately
|
|
held in-flight, with traffic-class stats, no drops, and no failures. C18Z64
|
|
adds compact `traffic_class_counts` telemetry to flow-scheduler snapshots so
|
|
diagnostics can see active flow-channel distribution by traffic class without
|
|
scanning every channel stat; it is live-proven on docker-test with bulk and
|
|
interactive counts visible in heartbeat metadata. C18Z65/C18Z66 project this
|
|
QoS/pressure telemetry into backend access telemetry and web-admin at cluster,
|
|
node, and active-channel levels. C18Z67 proves the live HTTP concurrent QoS
|
|
path under pressure: six parallel bulk service-channel requests and one
|
|
interactive request share the same entry path after remediation; the
|
|
interactive request completes in 132 ms, 3072 post-remediation packets move
|
|
over the replacement route, bulk/interactive replacement-route flow stats are
|
|
visible, and fallback, route failures, flow drops, and scheduler drops remain
|
|
0. C18Z68 turns this evidence into backend/admin flow-health diagnostics:
|
|
access telemetry now reports `flow_health_status` and `flow_health_reason` at
|
|
cluster, node, and active-channel levels using traffic-class pressure, queue
|
|
pressure, flow drops, backend fallback, route-quality failures/drops/slow
|
|
samples, and route send latency. C18Z69 adds node-side adaptive response:
|
|
runtime heartbeat flow-scheduler snapshots now include per-class
|
|
`recommended_parallel_windows` and adaptive backpressure reason, and the send
|
|
path applies the traffic-class-specific window so bulk/droppable are reduced
|
|
before interactive/control under pressure. C18Z70 projects those adaptive
|
|
runtime fields into backend access telemetry and web-admin at cluster, node,
|
|
and active-channel levels, with cluster windows aggregated by minimum non-zero
|
|
recommended window per class. C18Z71 adds an audited cluster adaptive-policy
|
|
contract for max window, queue/bulk thresholds, and per-class windows; the
|
|
effective policy fingerprint is signed into node synthetic config, reported
|
|
in runtime heartbeats, and consumed by node-agent scheduling so operators can
|
|
tune shared fabric backpressure without changing VPN/RDP-specific code.
|
|
C18Z72 adds an audited pool/failover policy contract for entry/exit pool
|
|
constraints, preferred entry/exit, selection strategy, failover modes,
|
|
backend fallback allowance, and sticky session mode. Lease issuance applies
|
|
that policy before route selection and signs the effective `pool_policy`
|
|
provenance into the service-channel lease authority payload. C18Z73 projects
|
|
that signed pool-policy fingerprint into active access telemetry and guards
|
|
remediation commands: backend rejects alternate routes outside the signed
|
|
entry/exit lease pools and emits `rebuild_route`, while node-agent
|
|
defensively ignores any guarded rejected `prefer_alternate_route` command
|
|
before route-manager application. Web-admin shows pool/remediation guard
|
|
status in access telemetry and node synthetic-config remediation rows. C18Z74
|
|
correlates active remediation commands with the entry node route-manager
|
|
heartbeat so access telemetry shows execution state:
|
|
`waiting_node_apply`, `applied`, `rejected_by_policy_guard`,
|
|
`pending_rebuild_request`, or `expired`, with reason/generation/observed-at.
|
|
C18Z75 records `rebuild_route` remediation as durable rebuild ledger intent
|
|
rows when node-scoped synthetic config is fetched: allowed commands become
|
|
`rebuild_status=requested` / `outcome=rebuild_requested`, while policy-guard
|
|
rejects become `rebuild_status=rejected` /
|
|
`outcome=policy_guard_rejected`. Access telemetry then reports
|
|
`rebuild_request_recorded` or `rebuild_request_rejected` for the active
|
|
channel. C18Z76 adds node-side acknowledgement for the allowed
|
|
`rebuild_route` branch: node-agent consumes the command as a route-manager
|
|
`pending_degraded_fallback` decision with source
|
|
`service_channel_remediation_command`, while guarded commands remain ignored.
|
|
Backend access telemetry correlates that heartbeat evidence with the durable
|
|
ledger and reports `rebuild_request_recorded_node_pending`. C18Z77 resolves
|
|
those durable remediation rebuild requests in the Control Plane planner:
|
|
valid alternates inside the active signed lease pools become `applied` /
|
|
`replacement_selected` route-manager decisions with the same command id,
|
|
missing safe alternates become `no_alternate`, policy/lease blocks become
|
|
`deferred_by_policy`, and stale commands become `expired`. Access telemetry
|
|
reports these as `rebuild_request_applied`,
|
|
`rebuild_request_no_alternate`, `rebuild_request_deferred_by_policy`, or
|
|
`rebuild_request_expired`. C18Z78 adds operator-facing visibility for those
|
|
planner outcomes in web-admin and live-proves the applied branch: when an
|
|
alternate route appears after lease issuance, the existing `rebuild_route`
|
|
command resolves to `applied` / `replacement_selected` and access telemetry
|
|
reports `rebuild_request_applied`.
|
|
C18Z79 closes that applied-branch proof loop: after the planner resolves the
|
|
existing rebuild command to a replacement route, the entry node reports a
|
|
route-manager decision for the same `rebuild_request_id`, the transition is
|
|
`applied_rebuild`, and live service-channel packet ingress selects the
|
|
replacement route with no local/backend fallback, route failures, or flow
|
|
drops. C18Z80 extends that into sustained post-rebuild pressure: five mixed
|
|
service-channel packet bursts remain on the replacement route, no stale
|
|
primary route is reselected, and fallback, route-failure, flow-drop, and
|
|
scheduler-drop deltas remain zero from the pre-pressure baseline. C18Z81
|
|
proves the negative recovery branch: when the already-applied replacement
|
|
route reports generation-valid fenced feedback, the Control Plane selects a
|
|
new safe recovery route and live traffic moves to that recovery route without
|
|
reselecting the degraded replacement or adding fallback/failure/drop deltas.
|
|
C18Z82 proves the no-safe-recovery branch: if that replacement is also fenced
|
|
and no safe recovery route exists, synthetic config reports
|
|
`service_channel_feedback_no_alternate` / `pending_degraded_fallback` with
|
|
`no_unfenced_alternate_route` instead of silently keeping a bad route.
|
|
C18Z83 projects that route-manager decision into active access telemetry and
|
|
web-admin active-channel diagnostics, including decision source, route id,
|
|
replacement route id, rebuild status/reason/generation, and score reasons.
|
|
C18Z84 aggregates those decisions at access-telemetry summary level so the
|
|
operator can see replacement, applied rebuild, recovery, and no-safe counts
|
|
without drilling into individual channel rows.
|
|
C18Z85 projects those access-decision aggregates into rebuild health and
|
|
incidents, adding `incident_source=access_decision` rows for active
|
|
no-safe/recovery/applied route-decision states. C18Z86 adds
|
|
channel-scoped silence/acknowledgement for those access-decision incidents:
|
|
the silence API accepts `incident_source` and `channel_id`, stores no-safe
|
|
access silences under a channel-scoped route key, and rebuild
|
|
health/incidents apply those silences so acknowledged current-generation
|
|
no-safe decisions are not counted as active bad incidents. Resurfacing on
|
|
generation change is covered in unit tests; live runtime smoke proves the
|
|
operator silence path. C18Z87 exposes active silences through the API and
|
|
web-admin, including access-decision source/channel/display route metadata,
|
|
and adds unsilence so an acknowledged access no-safe incident can be made
|
|
active again without waiting for TTL expiry. C18Z88 exposes access-decision
|
|
resurface details on incidents: the silence id, previous acknowledged
|
|
generation, and silence expiry are returned when the current active-channel
|
|
decision changes generation after acknowledgement. The live smoke proves the
|
|
incident resurfaces as active bad while preserving previous-generation
|
|
context for the operator. C18Z89 closes the generation-change operator action
|
|
loop for resurfaced access-decision incidents: incidents now include
|
|
`alert_resurfaced_cause`, previous route id, and previous channel id;
|
|
web-admin shows the cause; and the live smoke proves the operator can
|
|
re-acknowledge the resurfaced generation after validating that active-channel
|
|
decision route/generation context matches the incident. C18Z90 introduces an
|
|
explicit signed production data-plane contract on service-channel leases:
|
|
`data_plane` is present in the lease, authority payload, introspection
|
|
response, and lease-maintenance/admin list. It declares backend API as
|
|
control-plane transport, fabric service channel/fabric route as working
|
|
data/steady-state transport, backend relay as degraded fallback only, and
|
|
service-neutral protocol-agnostic isolated logical flows as the runtime
|
|
contract for VPN, Remote Workspace, files, video, and future services. C18Z91
|
|
makes node-agent consume the signed/introspected data-plane contract, apply
|
|
the preferred fabric route, log data-plane mode/transports/fallback policy,
|
|
and report contract adoption in heartbeat access telemetry. C18Z92 enforces
|
|
the fallback boundary: when `backend_relay_policy=disabled`, route failure or
|
|
missing fabric route returns a visible service-channel error instead of
|
|
silently proxying working data through backend relay. C18Z93-C18Z95 project
|
|
that data-plane contract and blocked-fallback evidence into access telemetry,
|
|
incidents, and node-agent heartbeat reports. C18Z96-C18Z98 feed
|
|
access-report-derived blocked fallback send failures into durable route
|
|
feedback and rebuild ledger correlation, with bounded deduplication and
|
|
feedback identity carried into replacement decisions. C18Z99 adds rebuild
|
|
ledger filters for `feedback_source`, `feedback_channel_id`, and
|
|
`feedback_violation_status`. C18Z100 aggregates those same fields in
|
|
rebuild-health `feedback_breakdowns`, including active warn/bad, silenced,
|
|
latest observation, and affected reporter node/route counts, and web-admin
|
|
shows the breakdown in the Rebuild health panel. C18Z101 connects that
|
|
operator view to investigation: each breakdown row shows related incident
|
|
context by channel/reporter/route overlap and can open the deep rebuild
|
|
ledger with source/channel/violation filters prefilled. C18Z102 adds backend
|
|
audit breadcrumbs for that drilldown, recording
|
|
`fabric.service_channel_rebuild_feedback_breakdown.investigation_opened`
|
|
events with the feedback source/channel/violation filters before the panel
|
|
opens the filtered deep ledger. C18Z103 surfaces recent rebuild incident and
|
|
feedback-breakdown investigation audit breadcrumbs directly in the Fabric
|
|
diagnostics panel with time, source, feedback filters, target reporter/route,
|
|
actor, and reason. C18Z104 adds focused audit loading: the cluster audit API
|
|
accepts `event_type` and `target_type` filters, and the Fabric diagnostics
|
|
panel requests just the recent fabric investigation breadcrumbs instead of
|
|
relying on the generic latest cluster audit window. C18Z105 correlates those
|
|
breadcrumbs back to the currently visible rebuild-health feedback breakdowns
|
|
or rebuild incidents in web-admin, marking whether the diagnostic object is
|
|
still active/visible and giving the operator a direct `open` action. C18Z106
|
|
moves that correlation into the backend/API: focused audit reads with
|
|
`correlation=fabric_diagnostics` return `correlation_hints` containing the
|
|
current diagnostic status and matching breakdown/incident object when
|
|
present. The rebuild-health feedback breakdown window was also raised to 100
|
|
groups so fresh failure classes remain visible on noisy long-running test
|
|
clusters. C18Z107 adds compact `audit_summary` aggregates for focused Fabric
|
|
diagnostics audit reads, including counts by current diagnostic status,
|
|
feedback source, feedback violation status, correlated/not-visible totals,
|
|
and latest time, and web-admin shows those counts above the investigation
|
|
rows. C18Z108 splits the operator workflow read from generic cluster audit:
|
|
`GET /clusters/{clusterID}/fabric/service-channels/rebuild-investigations/breadcrumbs`
|
|
returns a dedicated `rebuild_investigation_breadcrumbs` contract with events
|
|
and summary, and web-admin consumes that endpoint for Recent investigations.
|
|
C18Z109 adds freshness windows to that contract: callers can pass
|
|
`current_window_seconds` and `history_window_seconds`, events are marked
|
|
`current`, `stale`, or `expired` in `correlation_hints.breadcrumb_status`,
|
|
and the summary includes counts by breadcrumb status for operator triage.
|
|
C19C adds the first non-VPN service-channel lease proof: Remote Workspace uses
|
|
the same signed data-plane contract, route intent model, introspection, and
|
|
maintenance visibility, but its entry descriptor is service-specific
|
|
(`remote-workspaces/.../streams`) and uses a remote-workspace frame batch media
|
|
type rather than VPN packet paths.
|
|
C19D proves the matching entry-node ingress boundary for Remote Workspace:
|
|
node-agent validates signed lease authority or introspection, service class,
|
|
channel class, selected entry node, allowed flow isolation, and data-plane
|
|
contract on `remote-workspaces/{resource_id}/streams/{channel_class}`. Empty
|
|
probe requests return `202` with a remote-workspace ingress probe contract and
|
|
access telemetry; real RDP frame forwarding remains deliberately
|
|
`not_implemented` until the service adapter work begins.
|
|
C19E adds a narrow frame-batch probe on that boundary. The adapter contract
|
|
advertises `rap.remote_workspace_frame_batch.v1`, and entry-node accepts
|
|
non-empty payloads only when they are JSON probe batches with `probe_only=true`,
|
|
valid remote-workspace logical channels, valid directions, and bounded payload
|
|
metadata. Accepted probes return `payload_flow=validated_probe_only`; production
|
|
frame forwarding is still not enabled.
|
|
C19F connects that validated probe to a node-agent local adapter sink. The
|
|
in-memory `node_agent_rdp_worker_contract_probe` sink accepts only validated
|
|
probe batches and returns `rap.remote_workspace_frame_batch_delivery.v1`
|
|
receipts. Entry responses now report `payload_flow=delivered_probe_only` when
|
|
the local sink accepts the batch; no RDP server traffic or desktop frame
|
|
forwarding is enabled by this stage.
|
|
C19G makes that sink delivery observable outside the direct ingress response:
|
|
node-agent reports `remote_workspace_adapter_sink` in `rdp-worker` workload
|
|
status and `remote_workspace_adapter_sink_report` in node telemetry, including
|
|
delivery count, latest sequence, frame count, channel class, adapter contract,
|
|
and explicit `payload_traffic=none` proof.
|
|
C19H adds negative guardrail proof for the same frame path: `probe_only=false`,
|
|
unknown logical channels, invalid channel direction, service/channel mismatch,
|
|
and unsupported payload encoding are rejected before adapter delivery. This
|
|
keeps the current Remote Workspace path as a contract probe only, not a hidden
|
|
RDP payload tunnel.
|
|
C19I adds bounded adapter handoff queue/ack semantics to that probe-only sink.
|
|
The sink reports queue capacity/depth and accepted, dropped, acked, backpressure,
|
|
and drop-policy fields in `rap.remote_workspace_frame_batch_delivery.v1`.
|
|
Current capacity is `8`: droppable display overflow is accepted with excess
|
|
frames dropped and accepted frames acked, while reliable input overflow returns
|
|
backpressure without `adapter_delivery`. The path remains
|
|
`payload_traffic=none`; real RDP frame forwarding is still deferred to the
|
|
service adapter runtime.
|
|
C19J promotes those queue/backpressure signals into the existing observability
|
|
surfaces. Workload status and node telemetry now expose queue capacity/depth,
|
|
cumulative accepted/dropped/acked frame counters, `backpressure_count`, and the
|
|
latest rejected batch metadata/reason, so adapter pressure can be diagnosed
|
|
without relying on the individual ingress response.
|
|
C19K binds that queue model to a probe-only adapter session identity. Entry-node
|
|
derives `adapter_session_id` from the selected service-channel context and the
|
|
adapter sink reports `adapter_runtime_id=node_agent_rdp_worker_contract_probe`
|
|
with `session_state=probe_bound` in delivery receipts, workload status, and
|
|
telemetry. Rejected reliable overflow batches keep the same session identity,
|
|
which gives the future real adapter runtime a stable lifecycle boundary while
|
|
payload forwarding remains disabled.
|
|
C19L adds lifecycle accounting for those probe-only adapter sessions. Node-agent
|
|
tracks active sessions, created/bound totals, last activity timestamps,
|
|
per-session delivery/backpressure/frame counters, idle expiry counters, and
|
|
`current_session_lifecycle_state`. Successful probe delivery binds the session;
|
|
reliable overflow records pressure on the same session instead of hiding it as a
|
|
standalone request failure.
|
|
C19M adds an explicit local control endpoint for that lifecycle:
|
|
`POST /mesh/v1/remote-workspace/adapter-sessions/{adapter_session_id}/control`
|
|
accepts `close`, `expire`, and `reset`. The control result and report counters
|
|
make deliberate session shutdown visible through workload status and telemetry,
|
|
which prepares the same lifecycle shape for a real adapter runtime.
|
|
C19N adds guardrails for that endpoint: unsupported actions, malformed payloads,
|
|
invalid session IDs, unknown sessions, and oversized reasons are rejected before
|
|
state mutation. Repeated `close` is idempotent for a terminal session, reporting
|
|
the prior terminal state without double-counting closed sessions.
|
|
C19O adds a direct snapshot endpoint for diagnostics:
|
|
`GET /mesh/v1/remote-workspace/adapter-sessions?include_terminal=true&limit=N`
|
|
returns active and optional terminal adapter sessions with lifecycle state,
|
|
activity/backpressure timestamps, counters, and runtime identity. This gives the
|
|
future real adapter runtime an operator-facing inspection surface before payload
|
|
forwarding is enabled.
|
|
C19P adds the runtime handoff mailbox for active adapter sessions. The mailbox is
|
|
bounded in memory and stores `frame_batch_probe_delivered` and `backpressure`
|
|
events with sequence numbers and service-channel context. A future `rdp-worker`
|
|
runtime can read or drain it via
|
|
`GET /mesh/v1/remote-workspace/adapter-sessions/{adapter_session_id}/mailbox`,
|
|
while snapshots and telemetry expose mailbox depth and enqueue/drain/drop
|
|
counters.
|
|
C19Q hardens that mailbox handoff surface. Invalid adapter session IDs, unknown
|
|
sessions, and invalid limits are rejected without mutating mailbox state, while
|
|
`drain=true&limit=N` can remove events in bounded chunks and leave the remaining
|
|
depth visible for the next adapter-runtime poll. The mailbox is verified under
|
|
pressure as drop-oldest bounded state, and a closed adapter session is no longer
|
|
readable as an active runtime mailbox. This preserves the probe-only boundary
|
|
and still does not enable RDP frame forwarding.
|
|
C19R adds bounded mailbox polling ergonomics for that future runtime consumer.
|
|
`GET /mesh/v1/remote-workspace/adapter-sessions/{adapter_session_id}/mailbox`
|
|
now accepts `wait_ms`, returns explicit `empty`, `waited`, `wait_timeout`, and
|
|
`wait_ms` fields, and wakes when a new mailbox event arrives before the timeout.
|
|
The wait remains node-local and probe-only; it does not enable desktop frame
|
|
transport, backend relay, or production RDP payload forwarding.
|
|
C19S promotes those mailbox consumer signals into node-agent diagnostics.
|
|
Workload status, heartbeat telemetry, and active session snapshots now expose
|
|
mailbox read, wait, timeout, and empty-read counters plus last mailbox read
|
|
metadata. This lets operators identify hot polling or idle adapter consumers
|
|
without opening a data-plane path or forwarding desktop frames.
|
|
C19T adds node-local mailbox consumer checkpoint/ack metadata for the future
|
|
adapter runtime handoff. The mailbox endpoint accepts `consumer_id` and
|
|
`ack_sequence`, validates both before reading state, and returns consumer read,
|
|
ack, checkpoint, ack sequence, and lag metadata. The probe sink keeps bounded
|
|
per-session consumer cursor state and exposes aggregate/current-session
|
|
consumer counters in workload status and heartbeat telemetry. This remains a
|
|
diagnostic handoff contract only: no RDP frames are forwarded, no backend relay
|
|
semantics are introduced, and the mailbox stays node-local.
|
|
C19U adds lifecycle guardrails for those node-local consumer cursors. A consumer
|
|
can request `reset_consumer=true` with a valid `consumer_id` to clear its cursor
|
|
before the current mailbox read is recorded, and mailbox responses now expose
|
|
consumer capacity/count plus created/reset/evicted lifecycle metadata. Workload
|
|
status and heartbeat telemetry also expose reset and eviction counters, keeping
|
|
cursor cleanup observable without changing mailbox delivery or enabling
|
|
payload forwarding.
|
|
C19V adds read-only cursor inspection for adapter-runtime handoff recovery.
|
|
`GET /mesh/v1/remote-workspace/adapter-sessions/{adapter_session_id}/mailbox/consumers`
|
|
returns the active session's bounded consumer cursor list with checkpoint, ack,
|
|
lag, read/ack totals, and timestamps. The endpoint supports a bounded `limit`
|
|
and does not read, drain, reset, or mutate mailbox state, so inspection remains
|
|
node-local and diagnostic-only.
|
|
C19W adds cursor-aware resume reads for mailbox consumers. The mailbox endpoint
|
|
now accepts `after_sequence` for non-destructive reads and returns
|
|
`after_sequence`, `skipped_count`, and `returned_count` so adapter runtimes can
|
|
resume from a checkpoint without client-side filtering. Long-poll waits for
|
|
events newer than the requested sequence, and `after_sequence` is rejected with
|
|
`drain=true` to keep resume reads separate from destructive mailbox drains.
|
|
C19X adds consumer-aware resume convenience on top of that explicit sequence
|
|
window. `resume_from=ack|checkpoint` can be used with `consumer_id` to resolve
|
|
the read window from the stored consumer cursor before reading the mailbox, and
|
|
responses include `resume_from` and `resume_sequence`. Resume requests reject
|
|
manual `after_sequence`, `drain=true`, reset, missing consumers, and unknown
|
|
consumer cursors so adapter runtimes cannot accidentally mix cursor modes.
|
|
C19Y adds resume telemetry for operator diagnostics. Workload status and
|
|
heartbeat reports expose resume/after-sequence read totals, returned/skipped
|
|
totals, and the last resume cursor, sequence, consumer, returned count, and
|
|
skipped count. Session snapshots mirror the per-session counters so diagnostics
|
|
can distinguish normal polling from cursor-resume reads without reading or
|
|
draining mailbox state.
|
|
C19Z adds a compact adapter-runtime readiness summary to the sink report.
|
|
`adapter_runtime_readiness` combines probe-only status, session lifecycle state,
|
|
mailbox depth, consumer cursor, resume cursor, lag, and returned/skipped counts
|
|
into one diagnostic object so operators can verify handoff readiness without
|
|
triggering mailbox reads or drains.
|
|
C19Z1 adds a read-only mailbox handoff preflight endpoint. Adapter runtimes can
|
|
call `/mailbox/preflight` with `consumer_id` and `resume_from=ack|checkpoint`
|
|
to validate the stored cursor and inspect the next expected event window without
|
|
reading, draining, acking, or mutating consumer state.
|
|
C19Z2 adds separate telemetry for those handoff checks. Workload status and
|
|
heartbeat reports expose preflight totals split by ack/checkpoint cursor and the
|
|
last preflight session, consumer, cursor, after-sequence, available/returned/
|
|
skipped counts, and expected sequence range; readiness diagnostics mirror the
|
|
latest preflight summary.
|
|
C19Z3 adds stale-cursor diagnostics to preflight. When a consumer cursor points
|
|
behind dropped bounded-mailbox events, the preflight response reports retained
|
|
sequence bounds, `diagnostic_state=stale_cursor_gap`, `stale_cursor=true`, and
|
|
`missing_dropped_count`; workload/heartbeat telemetry and readiness diagnostics
|
|
mirror that latest stale state.
|
|
C19Z4 adds explicit action hints to those diagnostics. Preflight responses now
|
|
include `recommended_action` and `action_hints`; stale cursor gaps recommend
|
|
resetting the consumer cursor, requesting a full adapter resync, and resuming
|
|
from checkpoint after resync. Telemetry and readiness diagnostics mirror the
|
|
latest recommended action and hints.
|
|
C19Z5 adds remediation provenance for those hints. Preflight responses,
|
|
workload/heartbeat telemetry, and readiness diagnostics include
|
|
`action_reason` plus structured `action_context` with the resume cursor,
|
|
retained sequence bounds, dropped/missing counts, consumer checkpoint/ack, and
|
|
expected window counters that explain why the recommended action was chosen.
|
|
C19Z6 adds a compact operator-facing preflight summary derived from the same
|
|
read-only state. Preflight responses, telemetry, and readiness diagnostics now
|
|
include `operator_summary` and `operator_summary_fields` so dashboards can show
|
|
the diagnostic state, action, reason, resume cursor, retained bounds, and key
|
|
window counters without recomputing or mutating mailbox state.
|
|
C19Z7 adds machine-sortable operator status and severity to that summary.
|
|
Preflight responses, telemetry, readiness diagnostics, and
|
|
`operator_summary_fields` now expose `operator_status` and `operator_severity`
|
|
so dashboards can sort ready, caught-up, and resync-required handoffs without
|
|
parsing human text.
|
|
C19Z8 groups the latest preflight view for admin UI consumption. The readiness
|
|
diagnostic keeps all existing flat latest-preflight fields and adds
|
|
`last_preflight` with observed time, cursor, counts, diagnostic state, selected
|
|
action, action provenance, operator summary, status, severity, and summary
|
|
fields.
|
|
C19Z9 adds retained-window detail to that grouped readiness view. The
|
|
`last_preflight` object now includes first/last retained sequence and mailbox
|
|
dropped total so stale-cursor summaries can explain the bounded mailbox window
|
|
without requiring a separate raw preflight lookup.
|
|
C19Z10 adds a structured remediation checklist to the grouped readiness view.
|
|
The `last_preflight.remediation_checklist` entries are derived from diagnostic
|
|
state and action hints, marking required/satisfied operator steps for cursor
|
|
reset, adapter resync, and post-resync resume without executing those actions.
|
|
C19Z11 adds summary status and counts for that checklist. The grouped readiness
|
|
view now exposes `remediation_checklist_status` plus total, required,
|
|
satisfied, and pending counts so admin UI can render checklist state without
|
|
scanning the step array.
|
|
C19Z12 adds per-session preflight operator status/severity counters. Readiness
|
|
now exposes counts for statuses such as `ready_to_resume`, `caught_up`, and
|
|
`resync_required`, plus severity counts such as `ok`, `info`, and `warn`, and
|
|
the grouped latest-preflight rollup mirrors those counters for dashboard
|
|
context.
|
|
C19Z13 derives a compact preflight attention status from those counters.
|
|
Readiness and `last_preflight` expose `preflight_attention_status` values such
|
|
as `clean`, `needs_attention`, and `repeated_resync_required`, letting admin UI
|
|
sort sessions without interpreting count maps directly.
|
|
C19Z14 proves the repeated-resync branch. Unit and live smoke coverage now run
|
|
multiple stale preflights on the same active adapter session and verify
|
|
`preflight_attention_status=repeated_resync_required` with repeated
|
|
`resync_required` / `warn` counters, while the preflight path remains read-only.
|
|
C19Z15 adds `preflight_attention_reason` beside the attention status. The reason
|
|
is derived from the latest preflight counters/status and explains clean,
|
|
attention-needed, and repeated-resync states without requiring UI code to parse
|
|
the counter maps.
|
|
C19Z16 completes focused proof coverage for those reasons. Unit coverage proves
|
|
clean, single-resync, repeated-resync, and no-preflight mappings, and live smoke
|
|
proves the single stale-preflight `resync_required_preflight_observed` reason.
|
|
C19Z17 adds a diagnostics contract marker to the grouped preflight readiness
|
|
rollup. `last_preflight` now includes `diagnostics_schema_version` and a
|
|
`diagnostics_contract` list for retained-window, remediation-checklist,
|
|
attention, and operator-count fields so admin UI can gate rendering safely.
|
|
C19Z18 adds machine-readable feature flags for that contract. `last_preflight`
|
|
now includes boolean `diagnostics_features` entries for retained-window,
|
|
remediation-checklist, attention, and operator-count diagnostics, allowing UI
|
|
and automation clients to check support without scanning the contract list.
|
|
C19Z19 adds a compatibility proof for the two contract forms. Unit and live
|
|
smoke coverage now verify that workload and telemetry reports expose matching
|
|
`diagnostics_contract` entries and `diagnostics_features` booleans for each
|
|
preflight diagnostics group.
|
|
C19Z20 adds the no-preflight absence proof. Active adapter sessions that have
|
|
not observed a mailbox preflight report `preflight_attention_status=unknown`,
|
|
`preflight_attention_reason=no_preflight_observed`, zero session preflight
|
|
count, and no grouped `last_preflight` rollup, so UI can distinguish "not
|
|
observed yet" from an observed clean state.
|
|
C19Z21 adds the no-active-session readiness proof. After the last adapter
|
|
session is closed, readiness reports idle/not-ready with zero active sessions,
|
|
no active `adapter_session_id`, no `last_preflight` rollup, and terminal
|
|
`last_session_state=closed` from the terminal-session ledger.
|
|
C19Z22 extends terminal-state coverage to `expire` and `reset` controls. The
|
|
same no-active-session readiness shape now proves `last_session_state=expired`
|
|
and `last_session_state=reset` from the terminal-session ledger.
|
|
C19Z23 adds grouped terminal-session summary metadata for the no-active-session
|
|
case. Readiness now includes `terminal_session_summary` with adapter session id,
|
|
terminal state, reason, and control timestamp while retaining flat compatibility
|
|
fields.
|
|
C19Z24 adds a contract marker to that summary. The grouped
|
|
`terminal_session_summary` now carries a schema version and summary-contract
|
|
field list so UI can gate rendering explicitly.
|
|
C19Z25 adds boolean feature flags for the same grouped terminal summary fields,
|
|
mirroring the preflight diagnostics contract/feature pattern.
|
|
C19Z26 adds compatibility proof coverage for those two terminal summary contract
|
|
forms, verifying that `summary_contract` entries and `summary_features` booleans
|
|
stay aligned in workload and telemetry reports.
|
|
C19Z27 adds absence proof coverage for a fresh no-session runtime: before any
|
|
terminal history exists, readiness stays in `waiting_for_session` and does not
|
|
include `terminal_session_summary`.
|
|
C19Z28 adds the grouped no-session readiness summary for that empty-runtime
|
|
state. Fresh adapter readiness now includes `no_session_summary` with schema
|
|
version `rap.remote_workspace_adapter_no_session_summary.v1`, a summary
|
|
contract for `status`, `diagnostic_state`, `active_session_count`, and
|
|
`terminal_session_count`, and matching idle/waiting-for-session counts, while
|
|
the terminal-session summary remains absent until terminal history exists.
|
|
C19Z29 adds boolean `summary_features` to the same grouped no-session summary
|
|
for `status`, `diagnostic_state`, `active_session_count`, and
|
|
`terminal_session_count`, matching the terminal summary and preflight
|
|
diagnostics feature-flag convention.
|
|
C19Z30 adds compatibility proof coverage for the grouped no-session summary,
|
|
verifying that `summary_contract` entries and `summary_features` booleans stay
|
|
aligned in workload and telemetry reports.
|
|
C19Z31 adds the inverse terminal-history absence proof: after adapter sessions
|
|
reach terminal states, readiness exposes `terminal_session_summary` and omits
|
|
`no_session_summary` in workload and telemetry reports.
|
|
C19Z32 proves readiness summary exclusivity across the three runtime shapes:
|
|
fresh exposes only `no_session_summary`, active exposes neither grouped summary,
|
|
and terminal exposes only `terminal_session_summary`.
|
|
C19Z33 adds a compact readiness state matrix artifact for admin/runtime handoff:
|
|
fresh, active, and terminal rows are emitted for workload and telemetry with
|
|
only the relevant readiness fields and summary-presence booleans.
|
|
C19Z34 adds an explicit probe-to-runtime gate artifact. It confirms the current
|
|
Remote Workspace runtime is still `contract_probe`, `probe_only=true`, and
|
|
`payload_traffic=none`, lists the ready contracts, and records the remaining
|
|
runtime gates before real RDP frame transport can be enabled.
|
|
C19Z35 adds the disabled-by-default real-adapter supervision scaffold. The
|
|
`rdp-worker` contract-probe status now advertises
|
|
`rap.remote_workspace_real_adapter_supervision.v1` with future config env names,
|
|
status contract fields, and guardrails, while `contract_probe` remains the only
|
|
active execution mode and payload traffic remains `none`.
|
|
C19Z36 adds compatibility proof for that scaffold, verifying the disabled state,
|
|
status contract, env names, process model, and guardrails remain aligned in unit
|
|
and live workload status coverage.
|
|
C19Z37 adds disabled real-adapter config projection. Node-agent parses the
|
|
future `RAP_REMOTE_WORKSPACE_REAL_ADAPTER_*` env values and reports only
|
|
sanitized status metadata under
|
|
`real_adapter_supervision.config_projection`: whether enable was requested,
|
|
whether command/args/workdir are present, args JSON shape, and that raw values
|
|
are redacted. This does not activate the real adapter; `enabled=false`,
|
|
`activation_allowed=false`, and `payload_traffic=none` remain required.
|
|
C19Z38 proves projection compatibility across default/empty and requested
|
|
config shapes. Unit and live smoke coverage verify absent env and requested
|
|
env both keep activation blocked, raw values redacted, and payload traffic
|
|
disabled.
|
|
C19Z39 adds an explicit disabled activation decision contract. The real adapter
|
|
status now reports `decision=blocked`,
|
|
`reason=real_runtime_stage_not_enabled`, `activation_allowed=false`, and the
|
|
missing gates before a future stage may start an external RDP worker process.
|
|
C19Z40 adds a compact handoff report proving that the supervision scaffold,
|
|
config projection, and blocked activation decision remain aligned for both
|
|
requested and default config shapes.
|
|
C19Z41 adds real-adapter supervision feature flags for config projection,
|
|
activation decision, missing gates, and raw-value redaction so UI and
|
|
automation clients can gate rendering explicitly.
|
|
C19Z42 folds those feature flags into the compact handoff report, proving
|
|
scaffold/projection/decision/features alignment for requested and default node
|
|
config in one admin/runtime artifact.
|
|
C19Z43 proves contract-probe precedence when desired workload config includes
|
|
both `adapter_contract_probe` and `real_adapter_supervision`; the runtime stays
|
|
running in probe mode and real-adapter activation remains blocked.
|
|
C19Z44 proves the real-adapter-only desired workload path remains degraded and
|
|
blocked, with the same disabled activation contract and no payload traffic.
|
|
C19Z45 adds a compact desired-workload mode matrix for probe-only,
|
|
real-adapter-only, and combined requested modes, confirming all paths retain
|
|
disabled real-adapter activation and no payload traffic.
|
|
C19Z46 adds compatibility proof for that mode matrix row contract, including
|
|
explicit feature-flag and missing-gate visibility markers.
|
|
C19Z47 adds a disabled process-supervisor preconditions contract for the future
|
|
external RDP worker process while keeping `process_start_allowed=false` and all
|
|
payload traffic disabled.
|
|
C19Z48 proves that process-supervisor preconditions contract across requested
|
|
and default config shapes, including required/missing checks and disabled start.
|
|
C19Z49 folds process-supervisor preconditions into the compact handoff report,
|
|
proving alignment with projection, activation decision, and feature flags.
|
|
C19Z50 folds those preconditions into the desired-workload mode matrix, proving
|
|
process start remains disabled across probe-only, real-adapter-only, and
|
|
combined requested modes.
|
|
C19Z51 adds compatibility proof for that mode matrix v2 row contract.
|
|
C19Z52 adds a disabled process-health-probe contract for the future external
|
|
RDP worker process while keeping health probes disabled and payload traffic at
|
|
`none`.
|
|
C19Z53 proves that process-health-probe contract across requested/default
|
|
status forms.
|
|
C19Z54 folds process-health-probe visibility into the compact handoff report,
|
|
proving disabled health probes and payload-free alignment across all
|
|
real-adapter handoff contracts.
|
|
C19Z55 folds process-health-probe visibility into the desired-workload mode
|
|
matrix, proving disabled health probes and no payload traffic across probe-only,
|
|
real-adapter-only, and combined requested modes.
|
|
C19Z56 adds compatibility proof for that mode matrix v3 row contract.
|
|
C19Z57 ties handoff v4 and mode matrix v3 compatibility into a compact disabled
|
|
real-adapter readiness/handoff checklist.
|
|
C19Z58 adds compatibility proof for that readiness/handoff summary and
|
|
checklist contract.
|
|
C19Z59 derives a disabled real-adapter operator action map from that checklist
|
|
while keeping activation, process start, and payload forwarding blocked.
|
|
C19Z60 adds compatibility proof for that operator action map contract.
|
|
C19Z61 groups the disabled real-adapter readiness summary, checklist, and
|
|
action map into one compact admin handoff bundle.
|
|
C19Z62 adds compatibility proof for that admin handoff bundle contract.
|
|
C19Z63 derives compact admin handoff digest display rows from the bundle while
|
|
preserving disabled runtime guardrails.
|
|
C19Z64 adds compatibility proof for that admin handoff digest row contract.
|
|
C19Z65 adds a digest rollup with severity/state counts, primary action, and
|
|
guardrail summary.
|
|
C19Z66 adds compatibility proof for that digest rollup contract.
|
|
C19Z67 summarizes the proven disabled real-adapter admin handoff chain from
|
|
handoff v4 through digest rollup compatibility.
|
|
C19Z68 adds compatibility proof for that full-chain summary contract.
|
|
C19Z69 marks the disabled real-adapter admin handoff package as
|
|
contract-only-ready while keeping the real runtime stage blocked.
|
|
C19Z70 proves the release marker contract remains compatible while keeping the
|
|
real runtime stage blocked.
|
|
C19Z71 adds a final contract-only package index for the disabled real-adapter
|
|
admin handoff chain.
|
|
C19Z72 proves the final package index contract for the disabled real-adapter
|
|
admin handoff chain.
|
|
C19Z73 adds a contract-only runtime gate phase boundary for the next disabled
|
|
real-adapter preflight phase.
|
|
C19Z74 proves the runtime gate phase boundary contract.
|
|
C19Z75 adds a disabled real-adapter runtime gate preflight checklist with all
|
|
items still blocking runtime.
|
|
C19Z76 proves the disabled real-adapter runtime gate preflight checklist
|
|
contract.
|
|
C19Z77 adds a disabled real-adapter runtime gate preflight status summary.
|
|
C19Z78 proves the disabled real-adapter runtime gate preflight status summary
|
|
contract.
|
|
C19Z79 adds disabled real-adapter runtime gate preflight action hints.
|
|
C19Z80 proves the disabled real-adapter runtime gate preflight action hints
|
|
contract.
|
|
C19Z81 adds a disabled real-adapter runtime gate preflight operator handoff
|
|
bundle.
|
|
C19Z82 proves the disabled real-adapter runtime gate preflight operator handoff
|
|
bundle contract.
|
|
C19Z83 adds a disabled real-adapter runtime gate preflight release marker.
|
|
C19Z84 proves the disabled real-adapter runtime gate preflight release marker
|
|
contract.
|
|
C19Z85 adds a disabled real-adapter runtime gate preflight package index.
|
|
C19Z86 proves the disabled real-adapter runtime gate preflight package index
|
|
contract.
|
|
C19Z87 adds a disabled real-adapter runtime gate preflight closeout summary.
|
|
C19Z88 proves the disabled real-adapter runtime gate preflight closeout summary
|
|
contract.
|
|
C19Z89 starts the explicit real-adapter runtime gate enablement phase with a
|
|
contract-only request that remains blocked pending validation.
|
|
C19Z90 proves the explicit real-adapter runtime gate enablement request
|
|
contract.
|
|
C19Z91 adds contract-only operator confirmation validation while keeping the
|
|
runtime gate blocked pending remaining validations.
|
|
C19Z92 proves the operator confirmation validation contract.
|
|
C19Z93 adds contract-only binary validation while keeping the runtime gate
|
|
blocked pending remaining validations.
|
|
C19Z94 proves the binary validation contract.
|
|
C19Z95 adds contract-only permission validation while keeping the runtime gate
|
|
blocked pending remaining validations.
|
|
C19Z96 proves the permission validation contract.
|
|
C19Z97 adds contract-only supervisor validation while keeping the runtime gate
|
|
blocked pending remaining validations.
|
|
C19Z98 proves the supervisor validation contract.
|
|
C19Z99 adds contract-only health probe validation while keeping the runtime gate
|
|
blocked pending payload gate validation.
|
|
C19Z100 proves the health probe validation contract.
|
|
C19Z101 adds contract-only payload gate validation with no remaining required
|
|
validations while keeping runtime not enabled.
|
|
C19Z102 proves the payload gate validation contract.
|
|
C19Z103 adds the runtime gate validation closeout while keeping explicit
|
|
operator enablement required.
|
|
C19Z104 proves the runtime gate validation closeout contract.
|
|
C19Z105 adds an operator enablement readiness package while keeping runtime
|
|
disabled by default.
|
|
C19Z106 proves the operator enablement readiness package contract.
|
|
C19Z107 adds an operator enablement readiness release marker while keeping
|
|
runtime disabled by default.
|
|
C19Z108 proves the operator enablement readiness release marker contract.
|
|
C19Z109 adds an operator enablement readiness package index while keeping
|
|
runtime disabled by default.
|
|
C19Z110 proves the operator enablement readiness package index contract.
|
|
C19Z111 adds an operator readiness closeout summary while keeping runtime
|
|
disabled by default.
|
|
C19Z112 proves the operator readiness closeout summary contract.
|
|
C19Z113 adds an operator review decision request while keeping runtime disabled
|
|
by default.
|
|
C19Z114 proves the operator review decision request contract.
|
|
C19Z115 adds an operator decision status summary while keeping runtime disabled
|
|
by default.
|
|
C19Z116 proves the operator decision status summary contract.
|
|
C19Z117 adds an operator approval/rejection outcome contract with the outcome
|
|
not approved and runtime disabled by default.
|
|
C19Z118 proves the operator approval/rejection outcome contract.
|
|
C19Z119 adds an operator outcome closeout/reopen boundary while keeping runtime
|
|
disabled by default.
|
|
C19Z120 proves the operator outcome closeout/reopen boundary contract.
|
|
C19Z121 adds a not-approved outcome release marker while keeping runtime
|
|
disabled by default.
|
|
C19Z122 proves the not-approved outcome release marker contract.
|
|
C19Z123 adds a not-approved outcome package index while keeping runtime disabled
|
|
by default.
|
|
C19Z124 proves the not-approved outcome package index contract.
|
|
C19Z125 adds a not-approved outcome closeout summary while keeping runtime
|
|
disabled by default.
|
|
C19Z126 proves the not-approved outcome closeout summary contract.
|
|
C19Z127 adds a final not-approved outcome release marker while keeping runtime
|
|
disabled by default.
|
|
C19Z128 proves the final not-approved outcome release marker contract.
|
|
C19Z129 adds a final not-approved outcome package index/archive marker while
|
|
keeping runtime disabled by default.
|
|
C19Z130 proves the final not-approved outcome package index/archive marker
|
|
contract.
|
|
C19Z131 adds a not-approved outcome archive closeout manifest while keeping
|
|
runtime disabled by default.
|
|
C19Z132 proves the not-approved outcome archive closeout manifest contract.
|
|
C19Z133 adds a stopped-branch sentinel for the not-approved outcome while
|
|
keeping runtime disabled by default.
|
|
C19Z134 proves the not-approved outcome stopped-branch sentinel contract.
|
|
C19Z135 adds a no-continuation guard for the stopped not-approved outcome while
|
|
keeping runtime disabled by default.
|
|
C19Z136 proves the not-approved outcome no-continuation guard contract.
|
|
C19Z137 adds continuation block enforcement for the stopped not-approved
|
|
outcome while keeping runtime disabled by default.
|
|
C19Z138 proves the not-approved outcome continuation block enforcement
|
|
contract.
|
|
C19Z139 adds a continuation block audit record for the stopped not-approved
|
|
outcome while keeping runtime disabled by default.
|
|
C19Z140 proves the not-approved outcome continuation block audit record
|
|
contract.
|
|
C19Z141 adds a continuation block audit rollup for the stopped not-approved
|
|
outcome while keeping runtime disabled by default.
|
|
C19Z142 proves the not-approved outcome continuation block audit rollup
|
|
contract.
|
|
C19Z143 adds an operator stop summary for the stopped not-approved outcome
|
|
while keeping runtime disabled by default.
|
|
C19Z144 proves the not-approved outcome operator stop summary contract.
|
|
C19Z145 adds an operator stop handoff for the stopped not-approved outcome
|
|
while keeping runtime disabled by default.
|
|
C19Z146 proves the not-approved outcome operator stop handoff contract.
|
|
C19Z147 adds an operator stop handoff digest for the stopped not-approved
|
|
outcome while keeping runtime disabled by default.
|
|
C19Z148 proves the not-approved outcome operator stop handoff digest contract.
|
|
C19Z149 adds an operator stop status snapshot for the stopped not-approved
|
|
outcome while keeping runtime disabled by default.
|
|
C19Z150 proves the not-approved outcome operator stop status snapshot contract.
|
|
C19Z151 adds an operator stop status snapshot index for the stopped
|
|
not-approved outcome while keeping runtime disabled by default.
|
|
C19Z152 proves the not-approved outcome operator stop status snapshot index
|
|
contract.
|
|
C19Z153 adds an operator stop status catalog for the stopped not-approved
|
|
outcome while keeping runtime disabled by default.
|
|
C19Z154 proves the not-approved outcome operator stop status catalog contract.
|
|
C19Z155 adds an operator stop status catalog release marker for the stopped
|
|
not-approved outcome while keeping runtime disabled by default.
|
|
C19Z156 proves the not-approved outcome operator stop status catalog release
|
|
marker contract.
|
|
C19Z157 adds an operator stop status catalog package index for the stopped
|
|
not-approved outcome while keeping runtime disabled by default.
|
|
C19Z158 proves the not-approved outcome operator stop status catalog package
|
|
index contract.
|
|
C19Z159 adds an operator stop status catalog closeout summary for the stopped
|
|
not-approved outcome while keeping runtime disabled by default.
|
|
C19Z160 proves the not-approved outcome operator stop status catalog closeout
|
|
summary contract.
|
|
C19Z161 adds an operator stop status final archive marker for the stopped
|
|
not-approved outcome while keeping runtime disabled by default.
|
|
C19Z162 proves the not-approved outcome operator stop status final archive
|
|
marker contract.
|
|
C19Z163 adds an operator stop status final archive manifest for the stopped
|
|
not-approved outcome while keeping runtime disabled by default.
|
|
C19Z164 proves the not-approved outcome operator stop status final archive
|
|
manifest contract.
|
|
C19Z165 adds a terminal-complete marker for the stopped not-approved outcome
|
|
factory while keeping runtime disabled by default.
|
|
C19Z166 proves the not-approved outcome factory terminal-complete contract.
|
|
C20Z1 opens a new explicit real-adapter enablement request while keeping
|
|
runtime disabled by default.
|
|
C20Z2 proves the new explicit real-adapter enablement request contract.
|
|
C20Z3 adds the operator validation intake for the new explicit request while
|
|
keeping runtime disabled by default.
|
|
C20Z4 completes the operator validation checklist contract while keeping
|
|
runtime disabled by default.
|
|
C20Z5 closes the operator validation chain contract while keeping runtime
|
|
disabled by default.
|
|
C20Z6 proves the C20 stage terminal-complete contract.
|
|
5. Move VPN packet flow to the service channel and keep backend relay only as
|
|
explicit degraded fallback.
|
|
6. Run load tests against the fabric channel: many streams, route failure,
|
|
exit failure, NAT/outbound-only nodes, queue pressure, DNS/LAN/Internet
|
|
egress.
|
|
7. Build Remote Server/Desktop Access on top of this channel, not beside it.
|
|
|
|
## Non-Negotiable Guardrails
|
|
|
|
- Do not solve new service performance problems inside a protocol-specific
|
|
client before checking the common fabric channel.
|
|
- Do not add a production service that depends on backend packet/frame relay as
|
|
the steady-state path.
|
|
- Do not expose internal mesh topology to organization users.
|
|
- Do not merge VPN and Remote Server/Desktop Access into one product.
|
|
- Do not let bulk traffic starve interactive traffic.
|
|
- Do not hide degraded fallback; report it visibly in diagnostics/admin UI.
|