This commit is contained in:
2026-05-18 21:33:39 +03:00
parent 5096155d83
commit 469fa0e860
94 changed files with 8761 additions and 8003 deletions
+15 -2
View File
@@ -6,6 +6,16 @@ This file exists so architecture documents have a stable guardrails reference
inside `docs/architecture`. The operational Codex guardrails remain in
`docs/codex/ARCHITECTURE_GUARDRAILS.md`.
Transport clarification: references in this document to direct worker WSS and
backend gateway fallback belong to the preserved historical RDP service
baseline. They are not the active source of truth for inter-node transport.
Current fabric node-to-node transport is QUIC-only and is defined by
`docs/architecture/DISTRIBUTED_FABRIC_NODE_PROTOCOL_PLAN.md`,
`docs/architecture/FABRIC_FIRST_TRANSPORT_AND_STRESS_PLAN.md`, and
`docs/architecture/SECURE_ACCESS_FABRIC_TARGET.md`.
Node survivability, recovery overlap, and no-manual-access repair rules are
defined by `docs/architecture/FABRIC_NODE_SURVIVAL_AND_RECOVERY_POLICY.md`.
## 1. Preserve the Proven RDP Baseline
The following are already proven and must remain stable:
@@ -16,8 +26,8 @@ The following are already proven and must remain stable:
- detach without killing the remote session
- reattach without recreating the remote session
- takeover without recreating the remote session
- direct worker WSS data plane
- backend gateway fallback
- historical direct worker WSS RDP path
- historical backend gateway fallback for the RDP baseline
- C++ RDP Adapter as the active RDP runtime
Architecture clarification must not silently weaken this behavior.
@@ -191,6 +201,9 @@ Updates must support:
- local update cache where approved
- OS / architecture specific artifacts under signed release manifests
- explicit migration bundles when data structures change
- legacy recovery compatibility until the fleet is converged or explicitly
retired
- multi-source artifact retrieval for stranded or NAT-only nodes
Version Storage stores immutable release manifests, artifacts, hashes,
signatures, compatibility metadata, provenance, and approved migration bundles.
@@ -1059,7 +1059,8 @@ accepts a signed/introspected `remote_workspace` service-channel lease on
`remote-workspaces/{resource_id}/streams/{channel_class}`, validates service
class, channel class, selected entry node, and data-plane flow isolation, and
reports access telemetry. It intentionally returns a probe contract with
`payload_flow=not_implemented` for non-empty RDP payloads; this stage proves
`payload_flow=validated_only` for empty control probes; non-empty RDP payloads are
rejected with `probe_only required`. This stage proves
the Fabric ingress contract without forwarding desktop frames yet. The live
smoke is `scripts/fabric/c19d-remote-workspace-entry-ingress-smoke.ps1`.
+7
View File
@@ -1,5 +1,12 @@
# Data Plane v1 for RDP
Archived status: this document is a historical RDP/WebSocket stage record, not
the current runtime source of truth for transport architecture. The active
fabric transport model is QUIC-only between nodes; see
`docs/architecture/DISTRIBUTED_FABRIC_NODE_PROTOCOL_PLAN.md`,
`docs/architecture/FABRIC_FIRST_TRANSPORT_AND_STRESS_PLAN.md`, and
`docs/architecture/SECURE_ACCESS_FABRIC_TARGET.md`.
Status: DP-3A grayscale full-frame binary render foundation is implemented and smoke-proven on the test Docker environment as of 2026-04-25. DP-3B adaptive quality policy/selection is intentionally paused. The accepted C++ RDP Adapter baseline is the ordered-region path. RDP-Perf-6 makes direct dirty-region binary render explicit with `render.frame.full` / `render.frame.region` RAP2 message types and is build/probe/live-smoke-proven on the test Docker environment as of 2026-04-26. The current test Docker deployment for the RDP Adapter performance path is `rap-rdp-worker:rdp-perf6-dirty-region`. The Stage 5.2 core download data path remains runtime-proven for direct worker WSS and backend gateway fallback. Data-plane and RDP work are paused; the next active focus is Stage C10 Fabric Core / cluster foundation, not another data-plane feature.
This document defines the first staged data-plane evolution for the RDP MVP. It does not implement direct worker WebSocket runtime, mesh routing, VPN, QUIC, UDP, WebRTC, relay nodes, or multi-cluster behavior.
@@ -1,5 +1,12 @@
# Direct Worker WSS TLS / PKI
Archived status: this document captures a direct-worker WSS trust design track
and is no longer the primary reference for node-to-node transport. The active
fabric transport model is QUIC-only between nodes; see
`docs/architecture/DISTRIBUTED_FABRIC_NODE_PROTOCOL_PLAN.md`,
`docs/architecture/FABRIC_FIRST_TRANSPORT_AND_STRESS_PLAN.md`, and
`docs/architecture/SECURE_ACCESS_FABRIC_TARGET.md`.
Status: P3.4 trust-model design/prep complete.
This document defines the production trust model for direct worker WSS. It does
@@ -24,6 +24,21 @@ policy allows, host limited control/storage roles when approved, and report
mobile-specific capacity signals such as battery, network type, NAT behavior,
foreground/background state, and metered network policy.
Node survival and recovery across endpoint moves, NAT-only reachability, legacy
contract overlap, and unavailable manual host access are governed by
`docs/architecture/FABRIC_NODE_SURVIVAL_AND_RECOVERY_POLICY.md`. In
particular, nodes like `ifcm-rufms-s-mo1cr` must remain recoverable through the
fabric/update/recovery plane even when direct host login is unavailable.
Android implementation contract:
- app install/build contains a QUIC bootstrap seed set;
- runtime launch carries a `fabric_bootstrap_config`, not a backend URL;
- user login/profile selection happens over the fabric control channel;
- the Android VPN dataplane is QUIC fabric runtime only; HTTP batch packet
forwarding, WebSocket packet relay, and direct backend packet relay are not
part of the supported runtime path.
## What Was Missing
The current implementation proves route leases and production VPN forwarding,
@@ -60,8 +75,9 @@ route and stream semantics.
must keep working through cached policy, peer directories, route leases, and
local health when central components are degraded.
7. Mobile nodes are first-class nodes with stricter capability scoring.
8. HTTP forwarding remains a compatibility and emergency fallback, not the
primary high-speed data plane.
8. QUIC is the single runtime transport between fabric nodes. HTTP/HTTPS may
serve human-facing download or panel pages, but it is not a node data-plane
fallback and must not carry service packets.
9. There must be no single management service that can seize the fabric. Control,
storage, update distribution, route authority, and certificate authority are
fabric roles assigned to eligible nodes and protected by quorum signatures.
@@ -73,6 +89,20 @@ route and stream semantics.
the usable candidate locally by policy, reachability, latency, load, and
trust.
## Transport vs Control API
The system must keep two layers separate in naming, design, and diagnostics:
- `Fabric Transport` means inter-node runtime delivery only. It is QUIC over UDP
and carries leased service-channel/data-plane traffic between nodes.
- `Control API` means human/operator/programmatic management surfaces such as
web-admin, release publication, policy mutation, audit queries, and status
reads. Today that surface is HTTP/JSON and may sit behind HTTPS ingress.
The HTTP Control API is not a fallback transport for node-to-node runtime
traffic. A `409 Conflict` from the backend, a panel page load, or a release
download is control-plane behavior, not fabric transport behavior.
## Distributed Control And Trust
The target fabric behaves like a distributed network, not a client/server
@@ -145,6 +175,143 @@ Endpoint state is also distributed:
- Neighbor selection is local and latency/load-aware; the state log announces
facts and policy, not a forced single next hop.
### Fabric Registry Gossip
Moving a service must not break the farm.
`RAP_BACKEND_URL` or any fixed HTTP/API address is only a migration fallback for
old nodes. It is not cluster truth. After bootstrap, a node finds services by
logical role through signed fabric registry records that can be carried by any
reachable peer.
The rule is:
- any node may relay registry knowledge;
- only authorized signatures can create or replace trusted registry truth;
- a new record becomes active only after signature/authority checks and a
successful live probe through the fabric or a policy-approved direct QUIC
candidate;
- older still-valid records remain as fallback until their TTL expires.
Registry record shape:
```text
schema_version: rap.fabric.registry.gossip_record.v1
cluster_id
service: control-api | update-store | update-cache | web-admin | vpn-egress-pool | ...
scope: farm | cluster | organization
organization_id: optional
epoch: monotonic service epoch
generation: optional human/debug generation
issued_at
expires_at
issuer_node_id
issuer_role: control-authority | update-authority | storage-authority | route-authority
endpoints:
- endpoint_id
address: quic://...
transport: direct_quic | relay_quic | reverse_quic
reachability
connectivity_mode
priority / weight
peer_cert_sha256
signatures:
- key_id
issuer_id
role
alg: ed25519
value
```
Acceptance algorithm:
1. Reject records for a different cluster, expired records, future records past
allowed clock skew, unsupported schema, missing endpoints, or non-QUIC
endpoints.
2. Verify the canonical record payload, excluding `signatures`, against the
configured authority set.
3. Check the signer role is allowed for that service and scope.
4. Require quorum where policy says M-of-N; development may use one trusted
signer but must mark that signer as bootstrap/development authority.
5. Store accepted records as `candidate`.
6. Promote `candidate` to `active` only after live-probing at least one endpoint
and verifying the endpoint identity/pin.
7. Prefer higher epoch, then newer issued time, then generation. Do not replace
a live active record with an older record.
8. Keep the previous active record usable as fallback until TTL expiry when a
newer candidate is not yet live-verified.
This is the recovery path for mass moves. If every known service endpoint moves
at once, the operator or a control-authority node only has to deliver a signed
registry record to one reachable fabric node. That node validates it, probes it,
promotes it, and gossips it onward. User/mobile/candidate nodes may carry the
record, but cannot make it authoritative unless their role certificate permits
that service/scope.
Service classes that must use this registry before production hardening:
- `control-api`: heartbeat, auth/profile control projection, node registration,
policy/snapshot fetch.
- `update-store`: signed release manifests and compatibility windows.
- `update-cache`: artifact mirrors close to nodes.
- `web-admin`: management UI/API ingress replicas.
- `vpn-egress-pool`: user-visible exit pools; users see pools, not backing
nodes.
Legacy endpoint compatibility is allowed only for rolling migration:
- Old nodes may use their baked HTTP/control URL only to fetch a new version or
a signed registry bootstrap record.
- New nodes must treat fixed URLs as fallback hints, not as authority.
- Old code is removed only after every live node reports a version that supports
signed registry gossip and service discovery by role.
Listener configuration is split into bind sockets and reachability candidates:
- `listen_addr` is what the local process binds, for example
`0.0.0.0:18080` on `home-1`.
- `endpoint_candidates` is the ordered set of addresses other nodes may try.
A single node can publish LAN addresses, addresses on several network
adapters, STUN/reflexive addresses, and multiple public NAT forwards from
different providers.
- Public NAT forwards are modeled as candidates with metadata, not as a
replacement for the internal bind address. Example:
`quic://94.141.118.222:19199 reachability=public connectivity=direct
provider=isp1 maps_to=192.168.200.85:18080`.
- A candidate may be valid only from outside the NAT. Same-LAN hairpin failure
is not a proof that the public candidate is broken; verification must be
scoped to an external peer or remote probe.
- The route builder scores candidates by reachability, measured latency, loss,
load, policy, and verification freshness. If one provider or interface fails,
the node keeps the same node identity and republishes a new candidate epoch.
## Install Artifact Bootstrap Contract
Every installable artifact is a node image plus a bootstrap seed set.
This applies to Android, Docker, Linux services, and Windows services. The seed
set is baked into the artifact or delivered beside it as signed install
metadata. It is not a single backend URL and not a management server choice. It
is a bounded list of known fabric endpoint candidates that may be reachable from
different network positions:
- public QUIC candidates, for example `usa-los-1` or externally reachable
`home-1`;
- private/LAN QUIC candidates, for example Docker-test or home LAN nodes;
- closed-site candidates that have no Internet route themselves but can reach a
neighboring fabric node;
- optional pinned certificate hashes or authority descriptors for high-trust
entry candidates.
On first start the installed node tries the seed set, joins through any reachable
peer, registers as a candidate node with minimal rights, and then receives
signed peer-directory, role, update, and policy state through the fabric. If a
node is installed in an isolated network, it can still become visible and usable
when at least one nearby seed node can route onward to the rest of the fabric.
User login on Android is only identity/profile selection for the `vpn-client`
service; the underlying phone node already exists and participates in the
fabric with candidate permissions.
## Node Roles
Initial role vocabulary:
@@ -172,7 +339,7 @@ uplink stability, foreground state, and user cost policy.
Nodes must advertise capability facts in heartbeats and peer updates:
- supported fabric protocol versions;
- supported transports: UDP/QUIC, TCP, WebSocket, HTTPS fallback;
- supported transport: UDP/QUIC;
- NAT type and reachability;
- measured RTT/loss/jitter/bandwidth to peers and entry candidates;
- CPU, memory, queue depth, file descriptor/socket pressure;
@@ -184,9 +351,8 @@ Nodes must advertise capability facts in heartbeats and peer updates:
## Fabric Data Session V1
The first practical protocol step is a persistent binary data session. It may
initially run over WebSocket/TCP for faster delivery, but the framing must be
transport-neutral so the same protocol can move to QUIC/UDP.
The first practical protocol step is a persistent binary QUIC data session.
The framing stays service-neutral, but the runtime transport is QUIC only.
Minimum frame set:
@@ -338,69 +504,36 @@ Deliverables:
### Stage FNP-3: WebSocket/TCP Compatibility Transport
Status: started with a transport-neutral `io.Reader`/`io.Writer` frame loop,
WebSocket frame adapter in `agents/rap-node-agent/internal/fabricproto`, and a
gated/authenticated mesh smoke endpoint/client at `/mesh/v1/fabric/session/ws`.
`rap-host-agent fabric-session-smoke` provides the first operator smoke command
and can pass signed fabric-session authority payload/signature headers for
authority-pinned nodes.
Node-agent exposes the endpoint only when `RAP_MESH_FABRIC_SESSION_ENABLED` /
`-mesh-fabric-session-enabled` is set, and reports the enabled endpoint in
heartbeat metadata.
`mesh-live-smoke` includes a fabric-session `PING`/`PONG` check alongside the
existing route and test-service probes. Mesh client code now has a reusable
`FabricSessionClient` for multiple frame exchanges over one WebSocket session,
plus a pump mode with outbound/inbound queues for asynchronous stream traffic.
Live smoke verifies two `PING`/`PONG` round trips on the same connection.
`vpnruntime` has a binary VPN packet-batch mapper for `FrameData` payloads so
packet delivery can move away from JSON production envelopes in a gated mode.
`FabricSessionPacketTransport` now adapts that mapper to the existing
`PacketTransport` interface and can demultiplex inbound DATA frames into the
VPN packet inbox by stream id.
`mesh-live-smoke` now sends a real VPN packet batch through
`FabricSessionPacketTransport` over the WebSocket fabric session and requires a
stream ACK from the remote node.
Mesh has a peer session manager that reuses one pump per peer endpoint, giving
VPN transport selection a stable place to acquire long-lived fabric sessions.
Node config now carries a separate gated
`RAP_VPN_FABRIC_SESSION_TRANSPORT_ENABLED` switch and heartbeat report for the
binary VPN packet transport, keeping endpoint exposure and VPN dataplane
rollout independently controllable.
When the VPN fabric-session switch is enabled, node-agent now attempts to use a
long-lived peer session for gateway packet transport and falls back to the
existing HTTP production envelope path when the peer session is unavailable.
Peer session reuse now evicts closed pumps before reuse, so failed WebSocket
sessions can be reopened on the next transport acquisition.
Heartbeat telemetry includes peer session manager counters for active sessions,
reuses, opens, closed-pump evictions, and explicit close operations.
The mesh package now exposes a service-neutral `FabricTransport` abstraction;
the current WebSocket carrier implements it as `WebSocketFabricTransport`, so
future QUIC/UDP transport can be added without changing VPN/RDP/HTTP services.
`QUICFabricTransport` now implements the same interface and carries the same
binary `fabricproto` frames over a QUIC stream, with local smoke coverage for
`PING`/`PONG` and DATA/ACK.
Carrier selection understands QUIC transport labels and `quic://host:port`
endpoints while preserving WebSocket as the default fallback.
`QUICFabricServer` provides the matching node-side QUIC listener for accepting
fabric streams and running the same session frame handler as other carriers.
Node-agent can now gate the QUIC listener with
`RAP_MESH_QUIC_FABRIC_ENABLED` / `RAP_MESH_QUIC_FABRIC_LISTEN_ADDR`, report it
in heartbeat metadata, and pass the setting through host-agent install/update
profiles.
`mesh-live-smoke` verifies the QUIC carrier by starting a temporary QUIC fabric
server and requiring a `PING`/`PONG` round trip over `QUICFabricTransport`.
Nodes now advertise enabled QUIC fabric listeners as `direct_quic` fast-path
endpoint candidates, and endpoint ranking prefers QUIC over WebSocket/HTTPS
compatibility candidates for fabric sessions.
Status: retired as a migration-only stage.
This stage existed to bootstrap binary frame semantics before QUIC routing and
carrier reuse were ready. It introduced the transport-neutral frame loop,
session-shaped packet mapper, and early smoke tooling. That work was useful as
scaffolding, but it is no longer the target runtime.
Current rule:
- WebSocket/TCP fabric-session transport is not part of the supported node
dataplane.
- QUIC/UDP is the only supported runtime carrier between fabric nodes.
- Old WebSocket/TCP smoke helpers are being removed; migration/debug tooling
must move to QUIC-native smoke and recovery paths.
- Any routing, heartbeat, registry, peer probe, or service dataplane logic must
reject WebSocket/TCP carriers as non-QUIC transport, not treat them as a
valid alternate path.
What survives from this stage is the service-neutral frame model and the
`FabricSessionPacketTransport` mapping, which now ride on QUIC carriers instead
of a WebSocket fallback.
VPN fabric-session gateway transport now consumes ranked endpoint candidates,
so dataplane sessions can select QUIC fast-path candidates and fall back to
legacy peer endpoints when the control plane has not published candidates yet.
so dataplane sessions can select QUIC fast-path candidates and refuse non-QUIC
peer endpoints when the control plane has not published valid candidates yet.
The temporary self-signed QUIC listener advertises its SHA-256 certificate
fingerprint in endpoint metadata, and the QUIC client can pin that fingerprint
instead of disabling verification while the cluster CA path is being finished.
VPN fabric-session dialing now walks all ranked endpoint candidates before
falling back to the legacy peer endpoint, so a failed QUIC candidate does not
block WebSocket/HTTPS compatibility transport.
declaring the target unavailable, so a failed QUIC candidate does not silently
re-enable WebSocket/HTTPS compatibility transport.
Successful VPN fabric-session dialing logs the selected candidate, transport,
certificate pin usage, and remaining fallback count for phone-side diagnostics.
Heartbeat telemetry now includes VPN fabric-session dial counters for attempts,
@@ -416,8 +549,8 @@ Endpoint health observations are now emitted as a bounded standalone heartbeat
report (`rap.vpn_fabric_endpoint_health_report.v1`) so control plane can ingest
candidate feedback without parsing the transport diagnostics blob.
VPN fabric-session transport telemetry is carrier-neutral
(`fabric_session_binary_frames`) and reports QUIC/WebSocket as available
carriers instead of describing the dataplane as WebSocket-only.
(`fabric_session_binary_frames`) and reports QUIC selection plus non-QUIC
candidate rejection instead of describing the dataplane as WebSocket-capable.
Endpoint health observations are pruned in-memory by age and count before
snapshot/report generation, preventing long-running nodes from accumulating
unbounded candidate history.
@@ -583,10 +716,10 @@ propagated by host-agent install profiles.
Deliverables:
- carry binary frames over one persistent WebSocket/TCP connection;
- carry binary frames over one persistent QUIC fabric session;
- replace high-frequency `/mesh/v1/forward` packet POST usage for VPN routes in
a gated mode;
- keep HTTP forwarding as fallback.
- remove HTTP/WebSocket packet forwarding from the supported dataplane.
### Stage FNP-4: Android As Mobile Fabric Node
@@ -609,12 +742,12 @@ Deliverables:
### Stage FNP-6: QUIC/UDP Transport
Status: started with `QUICFabricTransport` in `internal/mesh`.
Status: active runtime baseline in `internal/mesh`.
Deliverables:
- implement QUIC transport for Fabric Data Session V1;
- preserve WebSocket/TCP as fallback;
- keep QUIC/UDP as the only supported inter-node runtime transport;
- test 4G/Wi-Fi transition and NAT behavior;
- benchmark throughput, latency, and recovery against current HTTP forwarding.
@@ -0,0 +1,183 @@
# Fabric Area And Peer Stability Model
Status: active design correction.
This document replaces the oversimplified rule "every node must keep 3
connections" with a stability model based on failure domains ("areas"),
multi-path reachability, and live peer memory.
## 1. Why the old "3 connections" rule is not enough
A raw connection count is too weak as a resilience rule.
Three links are not equivalent when:
- all three peers are in the same private network;
- all three depend on the same NAT or relay path;
- all three depend on the same public ingress;
- all three are relay-ready but not direct-ready;
- all three are stale observations rather than recently verified paths.
Therefore the fabric must not use a single scalar count as the stability
criterion.
## 2. Area
Introduce the concept of an `area`.
An area is a failure domain with high mutual reachability and shared external
risk. Examples:
- `home` - nodes in the same home/private site
- `test` - nodes in the same test Docker/LAN site
- `usa` - a public node in a remote Internet site
- `ifcm` - a separate NAT/domain behind another administrative boundary
An area can be derived from:
- operator-declared site/area label;
- shared private address space or local interface group;
- shared public egress/NAT identity;
- shared administrative host or cluster.
The area label must be part of live node metadata and endpoint candidate
metadata.
## 3. Stability objective
Each node should maintain a working peer set with diversity, not just count.
### 3.1 Minimum stable peer objective
For an ordinary production node:
- at least `2` recently verified direct-ready peers overall;
- at least `2` distinct external areas represented in the ready set when more
than one external area exists;
- at least `1` persistent recovery-capable path outside the local area;
- at least `1` additional relay-ready or rendezvous-capable path outside the
primary recovery path.
For an area gateway or strategically important public node:
- at least `3` direct-ready peers overall;
- at least `2` distinct external areas represented in the direct-ready set;
- at least `1` extra recovery path that does not share the same public ingress
or NAT dependency.
For a node in a tiny fleet where only one external area currently exists:
- the system must report `reduced-diversity mode`, not pretend the target is
fully satisfied.
### 3.2 What counts as "ready"
`ready` means:
- recently verified;
- usable for immediate QUIC route establishment;
- not only a historical candidate;
- not blocked on stale relay replacement;
- not only a compatibility `Control API/downloads` overlap path.
`relay_ready` does not replace `direct_ready`.
## 4. What a node must remember
Every node must keep a live working set, not just a tiny current-peer list.
Minimum retained peer memory:
1. all currently healthy nodes in the fleet, when the fleet is small enough;
2. for larger fleets, a bounded full directory plus prioritized recent working
peers;
3. for every known node:
- node id
- area
- role summary
- latest verified direct candidates
- latest verified relay/rendezvous candidates
- last success timestamp
- last failure class
- NAT / ingress dependency hints
- cert pin / authority compatibility metadata
For the current fleet size, every node should indeed be capable of remembering
the full directory of every other node. There is no scale excuse at 6-8 nodes.
## 5. Probe strategy
The node should not aggressively probe every possible path at full frequency.
It should maintain a layered strategy.
### 5.1 Hot set
Always keep a hot set of:
- current direct-ready peers;
- one recovery peer outside the local area;
- one alternate peer per external area.
These should be revalidated frequently.
### 5.2 Warm set
Maintain a warm set of:
- previously successful peers;
- peers from underrepresented areas;
- peers that would restore diversity if a hot peer fails.
These should be revalidated on a slower cadence and promoted when diversity or
direct-ready count drops.
### 5.3 Cold directory
Retain the full known directory and signed registry records, even if not
actively probed at the same rate.
## 6. Failure handling
When a direct-ready peer is lost:
1. do not merely replace it with the numerically cheapest peer;
2. prefer restoring:
- area diversity
- independent ingress diversity
- direct-ready count
3. only then fall back to relay-ready stabilization if direct replacement is
not currently available.
## 7. Implications for the current fleet
Current area mapping should be treated approximately as:
- `home`: `home-1`, `home-2`, `home-3`
- `test`: `test-1`, `test-2`, `test-3`
- `usa`: `usa-los-1`
- `ifcm`: `ifcm-rufms-s-mo1cr`
Under this model:
- a node in `home` should avoid satisfying its minimum peer objective using
only `home` peers plus one relay;
- `usa-los-1` and `ifcm-rufms-s-mo1cr` should both maintain direct-ready links
that span at least two foreign areas when possible;
- a fleet-wide alert should trigger when a node loses cross-area diversity even
if its total peer count still looks healthy.
## 8. Required implementation changes
1. Add `area` to node metadata and endpoint candidate metadata.
2. Track peer readiness by area, not only total count.
3. Separate:
- `direct_ready_count`
- `relay_ready_count`
- `external_area_ready_count`
- `independent_ingress_ready_count`
4. Alert on:
- zero recovery path outside the local area
- direct-ready deficit
- area diversity deficit
- registry resolution deficit
5. Preserve a full node directory for the current small fleet.
@@ -289,7 +289,10 @@ Production fabric-core migration boundary:
LAN/interface QUIC, STUN reflexive `ice_quic`, reverse/outbound-only, and
`relay_quic` fallback. Candidate metadata carries `local_segment_id`,
`nat_group_id`, `stun_server`, `ice_foundation`, `relay_node_id`, and
`relay_endpoint` when configured.
`relay_endpoint` when configured. When a relay endpoint is the first physical
QUIC hop, its advertised certificate fingerprint must survive route planning
so public-IP relay paths can verify the relay node by pin instead of falling
back to hostname/IP SAN matching.
- Endpoint candidate scoring is QUIC-mode only. It ranks `direct_quic`,
`lan_quic`, `ice_quic`, `reverse_quic`, and `relay_quic` using freshness,
health observations, latency, reliability, region, policy tags, and live
@@ -0,0 +1,179 @@
# Fabric Live Audit 2026-05-18
Status: live operational audit of the current fabric. This document records the
real state observed on 2026-05-18 and explicitly calls out where runtime
behavior still differs from the target architecture.
## Current confirmed state
- Inter-node transport for the live node-agent fleet is `QUIC over UDP`.
- The active node set
- `home-1`
- `home-2`
- `home-3`
- `test-1`
- `test-2`
- `test-3`
- `usa-los-1`
- `ifcm-rufms-s-mo1cr`
is converged on `0.2.321-directreadytarget`.
- `ifcm-rufms-s-mo1cr` recovered through the compatibility recovery path and is
no longer stale.
## Why TCP traffic is still visible
Visible TCP traffic is not coming from the inter-node fabric transport. It is
coming from the temporary compatibility recovery overlap that is still active.
Observed live listeners:
- `docker-test`
- `19191/tcp` - compatibility `Control API/downloads` bridge
- `18080/tcp` - web-admin
- `18090/tcp` - release files
- `18121/tcp` - backend Control API
- `19132/udp`, `19133/udp`, `19134/udp` - QUIC fabric listeners
- `usa-los-1`
- `19131/udp` - QUIC fabric listener
- `19191/tcp` - external compatibility bridge currently held open so legacy
recovery contracts can still reach `Control API/downloads`
Therefore:
- `TCP` is still present by design for recovery overlap.
- `UDP/QUIC` is the current node-to-node transport.
- The statement "the fabric is fully UDP-only" is not yet true at the full
system level while `19191/tcp` compatibility recovery remains enabled.
## Why nodes were still falling away
### 1. Nodes do not yet operate from a fully active signed registry gossip plane
Observed on the live `ifcm-rufms-s-mo1cr` heartbeat:
- `fabric_registry_runtime_report.status = candidate_only`
- `resolved_service_count = 0`
- `resolved_services.control-api = no_active_record`
- `resolved_services.update-store = no_active_record`
- `resolved_services.update-cache = no_active_record`
This means the current runtime still depends on compatibility control URLs more
than the target architecture allows. The node is alive in the fabric, but not
yet operating from a fully resolved active registry view.
### 2. Legacy control/download contracts are still real dependencies
Observed on the live `ifcm-rufms-s-mo1cr` heartbeat after recovery:
- `mesh_outbound_session_report.control_plane_url = http://vpn.cin.su:19191/api/v1`
This confirms the root recovery lesson:
- a NAT node without manual host access was still anchored to the old recovery
contract;
- until that contract was temporarily restored, the node could not advance;
- the node did not disappear because QUIC failed; it disappeared because the
recovery/control overlap was removed before the node had converged.
### 3. Direct peer resilience is still below the intended threshold
Observed from live heartbeat metadata:
- `ifcm-rufms-s-mo1cr`
- `peer_connection_ready = 2`
- `peer_connection_relay_ready = 3`
- `target_ready_peers = 3`
- `usa-los-1`
- `peer_connection_ready = 1`
- `peer_connection_relay_ready = 5`
- `target_ready_peers = 3`
This means the direct-path resilience target is not satisfied yet, even though
the nodes are healthy.
The practical reason is simple:
- the cluster has only a small number of externally reachable direct QUIC
endpoints;
- some nodes still advertise only private/LAN-reachable direct candidates;
- relay-ready adjacency is masking direct peer deficit, but it does not replace
the requirement for at least three direct-ready peers.
### 4. Observability is still heterogeneous
Live heartbeat coverage is inconsistent:
- `test-*`, `ifcm`, `usa-los-1` emit rich `c17z20` heartbeat metadata with
endpoint, peer recovery, and registry sections.
- `home-*` currently do not expose the same full sections in their latest
heartbeat rows.
This means operator visibility is uneven and the documentation must not imply
uniform live introspection across every node today.
## What is true right now
1. The fleet is converged on one live node-agent version.
2. QUIC/UDP is the actual node-to-node transport.
3. Compatibility `19191/tcp` is still required for recovery overlap.
4. Signed registry gossip is not yet the sole active discovery/control source.
5. The "at least 3 direct-ready peers per node" resilience target is not yet
met for all externally significant nodes.
## Operational rule until the next audit
Do not remove the compatibility `19191/tcp` recovery overlap while any of the
following remain true:
- any live node still reports a `control_plane_url` on the `19191` contract;
- any live node has `fabric_registry_runtime_report.status != active`;
- any externally significant node has fewer than 3 direct-ready peers;
- any node can only recover through legacy `Control API/downloads` overlap.
## Required next work
### A. Finish signed registry activation
Each node must be able to resolve active records for at least:
- `control-api`
- `update-store`
- `update-cache`
without falling back to the `19191` compatibility contract.
### B. Promote full direct endpoint dissemination
All nodes with public reachability must advertise every valid public direct QUIC
endpoint, and nodes must retain enough live peer memory to reconnect without
operator intervention.
### C. Enforce the direct-ready floor as a live alert
If a node has fewer than 3 direct-ready peers, this must remain a real
operational alert even when relay-ready peers exist.
### D. Normalize heartbeat observability
Every production node must emit the same minimum audit surface:
- endpoint candidates
- peer recovery counts
- registry runtime state
- update runtime state
without mixing rich and reduced heartbeat schemas across the fleet.
### E. Replace the naive peer-count rule
The live fleet shows that a plain "3 links per node" rule is not a sufficient
resilience model.
The current corrective design is documented in
[FABRIC_AREA_AND_PEER_STABILITY_MODEL.md](\\nas\\MST\\codex\\rdp-proxy\\docs\\architecture\\FABRIC_AREA_AND_PEER_STABILITY_MODEL.md)
and introduces:
- `area` as a failure-domain label;
- direct-ready vs relay-ready separation;
- cross-area diversity requirements;
- full-directory retention for small fleets.
@@ -0,0 +1,427 @@
# Fabric Node Survival And Recovery Policy
Status: active architecture policy.
This document defines the non-negotiable survival, compatibility, and recovery
rules for Secure Access Fabric nodes. It exists because losing a node is not an
acceptable operating model once the fabric grows beyond a small manually
maintained fleet.
Reference incident:
- `ifcm-rufms-s-mo1cr` is the canonical recovery case.
- The node is behind NAT.
- There is no direct administrative access to the Windows host.
- The node must remain recoverable through the fabric/update/recovery plane
without relying on manual host login.
The latest live recovery evidence for this case is documented in
[FABRIC_LIVE_AUDIT_2026-05-18.md](\\nas\\MST\\codex\\rdp-proxy\\docs\\architecture\\FABRIC_LIVE_AUDIT_2026-05-18.md).
This policy applies to Linux, Windows, Android, containerized nodes, and future
node types.
## 1. Core Decision
The fabric must be able to lose:
- old API endpoints;
- old artifact URLs;
- previous public IP addresses;
- previous NAT mappings;
- previous relay nodes;
- previous route-authority replicas;
- previous update-cache replicas;
- old service locations;
- operator access to the host OS;
- the current physical location of a workload;
- part of the cluster.
And still keep the node recoverable.
Manual repair is allowed as an emergency tool. It must not be the default
survival strategy.
## 2. Non-Negotiable Invariants
### 2.1 Node Identity Must Survive
A recoverable node must preserve:
- `node_id`;
- node keypair or key reference;
- pinned cluster authority / quorum descriptor;
- last accepted signed registry records;
- last accepted bootstrap seed set;
- last known good update policy;
- last known good workload desired state;
- rollback metadata;
- recovery audit trail.
Reinstall or repair must prefer preserving local state. Identity reset is a
high-risk operator action, not the default repair path.
### 2.2 Compatibility Must Stay Until Recovery Is Complete
Any change to the fabric must keep older nodes recoverable until one of these
is true:
1. every node has confirmed the new contract; or
2. the missing nodes were manually retired, revoked, or explicitly accepted as
lost.
This applies to:
- update plan formats;
- signed registry schemas;
- artifact install types;
- authority signature envelopes;
- bootstrap config formats;
- recovery seed formats;
- host-agent / updater runtime contracts;
- control endpoints needed only for migration.
The rule is strict: do not delete the old recovery format while nodes that may
still need it remain unrecovered.
### 2.3 QUIC-Only Transport Does Not Mean Single Bootstrap Location
Node-to-node runtime transport remains QUIC over UDP only.
That does not permit:
- one bootstrap address;
- one update mirror;
- one registry carrier;
- one ingress node;
- one relay;
- one control replica.
QUIC is the transport. Survivability requires many signed ways to discover the
current valid QUIC endpoints.
### 2.4 No Single Service May Own Recovery
Recovery must not depend on one:
- backend URL;
- DNS name;
- HTTP ingress;
- update repository host;
- relay node;
- cluster admin node.
Any of those may disappear while the node is still healthy enough to recover.
## 3. Required Recovery Layers
### 3.1 Embedded Bootstrap Seed Set
Each installable node package must contain a bounded bootstrap seed set:
- multiple seed nodes;
- public and private candidates where appropriate;
- QUIC endpoint candidates only;
- signed bootstrap metadata;
- expiry / epoch rules;
- optional organization / cluster scope constraints.
The bootstrap seed set is only the first door, not cluster truth.
### 3.2 Signed Registry Gossip
After bootstrap, a node must learn current service locations through signed
fabric registry records that can be carried by any reachable peer.
Required properties:
- multiple records per service;
- quorum or otherwise policy-approved signatures;
- monotonic epoch/generation;
- expiry and freshness checks;
- live probe before promotion;
- ability to accept newer records from a reachable neighbor even when old
origins are gone.
### 3.3 Outbound-Only Recovery Attachment
A node behind NAT or in passive mode must be recoverable through an outbound
attachment.
Required behaviors:
- the node can maintain at least one long-lived outbound QUIC control channel;
- that channel survives IP changes by reconnecting through any remaining seed or
signed registry endpoint;
- the node may receive updated registry truth, update triggers, workload
changes, and recovery instructions over that channel;
- the fabric must not require inbound TCP/UDP reachability to repair the node.
### 3.4 Local Recovery Agent Boundary
The node must have a minimal recovery-capable local agent boundary that is
separate from ordinary service workloads.
It must be able to:
- validate signed update plans;
- download artifacts from multiple mirrors;
- stage replacement binaries;
- restart node-agent or host-agent tasks;
- rollback to previous binaries;
- swap to new signed registry/bootstrap records;
- emit recovery status when transport returns.
If node workloads fail, this local recovery boundary must still exist.
### 3.5 Multi-Source Artifact Delivery
Artifacts must be retrievable from more than one source:
- local cached file;
- cluster update-cache;
- organization-local cache if policy allows;
- public or internet-reachable mirror;
- neighbor-assisted relay transfer over the fabric.
A node must not become unrecoverable because one artifact hostname or one
download service disappeared.
### 3.6 Trigger And Subscription Plane
Polling alone is not enough for very large fleets.
Required model:
- nodes may still perform slow fallback polling;
- primary update notification uses subscription/signal delivery;
- update-cache or registry service can repeatedly signal pending updates until
acknowledged;
- signals are idempotent;
- signals do not require the old control endpoint to remain alive.
## 4. Update Safety Rules
### 4.1 Upgrade Contracts
Every release that changes recovery-critical contracts must explicitly declare:
- minimum supported old version;
- maximum tolerated skew;
- whether migration is rolling-safe;
- whether the node must first update host-agent or node-agent;
- rollback compatibility;
- whether old bootstrap/registry envelopes remain accepted.
### 4.2 Two-Key Rule For Breaking Changes
Do not simultaneously break:
- discovery of where to get the update; and
- ability to understand the update once found.
At least one of those must remain compatible until fleet convergence or
explicit retirement.
### 4.3 Old Artifact Retention
Recovery-critical artifact versions must remain available until:
- all nodes have moved past them; or
- the remaining nodes are revoked/retired and recorded as intentionally lost.
Do not garbage-collect the last working host-agent or node-agent build for an
unrecovered population.
### 4.4 Install Type Continuity
If historical nodes request different install types for the same product
(`windows_binary`, `windows_service`, `native`, `linux_binary`, etc.), recovery
planning must keep compatibility aliases until the fleet converges.
The fabric must not strand nodes on an install-type naming mismatch.
### 4.5 Legacy Recovery Contract Drift Must Be Treated As A Blocking Risk
A stale node may report:
- a compatible recovery artifact exists under the current registry; but
- the last local updater/host-agent status still says `no_matching_artifact` or
an equivalent legacy contract failure.
This means the node is not only waiting for a heartbeat. It is running an older
recovery planner contract and may still depend on:
- historical install-type aliases;
- older artifact matching semantics;
- older update-plan interpretation rules;
- overlap in signed registry / bootstrap envelopes.
This condition must be classified as `legacy recovery contract drift` and must
block compatibility removal the same way an artifact gap does.
Operationally this also means:
- the node requires a `recovery bridge`;
- the cluster enters `bridge hold active` for compatibility-removal decisions;
- `bridge hold` remains active until the node reports a recovery-compatible
status on the current contract or the operator explicitly retires the node;
- when a compatible artifact and target mapping already exist, the node should
be classified as `bridge replay ready`, meaning the system can replay the
legacy-compatible update plan as soon as the node regains an outbound control
cycle;
- operator tooling should expose a canonical `bridge replay plan` per node so
recovery replay uses the same signed update-plan logic as normal updates;
- compatibility aliases / overlap must remain enabled for that node population;
- dashboards and rollout guards must show this separately from ordinary
`waiting recovery heartbeat`.
Canonical example:
- `ifcm-rufms-s-mo1cr` is stale;
- the current backend can match a Windows-compatible host-agent artifact;
- the last host-agent report still says `no_matching_artifact`;
- therefore the node must be treated as a legacy recovery-contract blocker, not
merely as a delayed heartbeat.
## 5. Service And Location Mobility Rules
Moving a service must not strand nodes that only know the old location.
Required pattern:
1. publish new signed registry records;
2. keep old records valid during overlap;
3. allow any reachable peer to relay the new records;
4. live-probe and promote the new endpoints;
5. only then retire the old location;
6. keep enough overlap for slow or partitioned nodes to catch up.
This applies to:
- control-api replicas;
- update-cache/update-store replicas;
- web/admin ingress replicas;
- relay/rendezvous nodes;
- service-channel endpoints.
## 6. Failure Classes The Fabric Must Tolerate
The design must explicitly handle all of these:
- node behind NAT with only outbound connectivity;
- several nodes behind one NAT/local segment;
- node changes public IP;
- node changes private IP;
- old DNS/URL becomes dead;
- artifact mirror disappears;
- control ingress disappears;
- relay disappears;
- update install fails halfway;
- binary staged but restart fails;
- old task/service name changes;
- local disk is nearly full;
- time skew causes signature freshness risk;
- authority rotates;
- route authority replica disappears;
- state directory survives but binary is broken;
- binary survives but state directory is partly stale;
- node reboots during update;
- only one peer still knows the new registry truth;
- node is partitioned for a long time and rejoins later;
- platform removes legacy support too early;
- operator has no shell/RDP/WinRM/SSH access to the host.
## 7. Required Local State And Journaling
The node local state store must retain at least:
- active and previous signed registry records;
- active and previous bootstrap seeds;
- last successful update plan per product;
- last applied artifact hash/version;
- last rollback candidate;
- last successful service endpoints used for update/control;
- pending trigger generation;
- recovery attempts with timestamps and reasons;
- last known good runtime command line / task/unit identity;
- last known workload desired states.
Writes must be atomic. A power loss must not leave the node with zero valid
state.
## 8. Observability And Fleet Safety Rules
The control plane must make invisible-recovery risk explicit.
It must surface:
- nodes with stale heartbeat but recent updater activity;
- nodes with no working compatible recovery artifact;
- nodes whose pinned registry/bootstrap epoch is too old;
- nodes whose only known artifact URL is dead;
- nodes whose desired state requires a contract they cannot parse;
- nodes whose local agent version is below the minimum recovery floor;
- nodes whose last successful contact depended on a single service replica.
Cluster-wide changes that would strand such nodes must be blocked or require an
explicit recovery-admin override.
## 9. Release And Migration Checklist
Before deleting old code, old formats, or old endpoints, verify all of these:
1. every active node has confirmed a compatible version; or the remaining nodes
are explicitly marked for manual retirement/recovery;
2. host-agent and node-agent recovery paths both have matching artifacts;
3. bootstrap/registry overlap exists for the migration window;
4. at least two independent artifact sources remain reachable;
5. signed registry gossip can carry the new locations without the old API
hostname;
6. rollback artifacts are still available;
7. install type aliases remain for historical agents where needed;
8. NAT/passive/outbound-only nodes were explicitly tested;
9. stale-node risk report is empty or consciously accepted by recovery-admin;
10. removal of legacy support is documented with the exact cutoff conditions.
## 10. `ifcm-rufms-s-mo1cr` Rule
`ifcm-rufms-s-mo1cr` is the standing reference case for future work.
For this node class, the platform must assume:
- the host is behind NAT;
- the node may only keep outbound channels;
- no direct Windows administrative access exists;
- old discovery endpoints may disappear;
- only the fabric/update/recovery plane can save the node.
Any future transport, update, authority, bootstrap, registry, or workload
change must be reviewed against this question:
> If `ifcm-rufms-s-mo1cr` is still on the older contract and we cannot log in to
> the host, can the fabric still recover it?
If the answer is no, the change is incomplete.
## 11. Immediate Follow-Through
The system should keep implementing these concrete items:
- separate documented recovery-plane tests for Windows NAT nodes;
- signed registry retention and overlap checks before endpoint migration;
- compatibility alias coverage for historical install types;
- artifact availability health over all mirrors;
- stale-node risk dashboard/report before legacy removal;
- node-local journaling for last good registry/update state;
- neighbor-assisted artifact relay path;
- explicit recovery simulation for outbound-only nodes with dead old endpoints.
## 12. Decision
The fabric must treat node survival as a first-class architecture contract.
A node is not considered safe merely because the happy path works. It is safe
only when it can survive protocol migration, endpoint relocation, partial
cluster loss, artifact source loss, and lack of manual host access without
being abandoned.
@@ -256,9 +256,11 @@ The first backend contract slice is implemented:
observations, and degraded backend relay usage. These incidents keep backend
relay visible as degraded compatibility behavior rather than hidden steady
state.
- Node-agent access telemetry distinguishes backend relay actually used from
backend relay blocked by signed data-plane policy. Blocked fallback reports
include `backend_fallback_blocked` and the last violation status/reason, and
- Node-agent access telemetry distinguishes degraded compatibility requested
from degraded compatibility blocked by signed data-plane policy. Blocked
compatibility reports include `degraded_compatibility_blocked` and the last
violation status/reason, while preserving the original raw violation code in
a separate field for historical correlation, and
backend projects them to access telemetry plus `data_plane_contract`
incidents.
- Backend correlates access-report send failures with active service-channel
@@ -421,8 +423,8 @@ The first backend contract slice is implemented:
keeps failing outside manual retry cooldown creates a bounded rebuild
request. If an unfenced alternate is available, Control Plane marks the
rebuild `applied` and selects that route generation; if no alternate exists,
it records `pending_degraded_fallback` and keeps backend relay as the
explicit degraded path until a new route appears. The compatibility release
it records `pending_degraded_route_state` and keeps the channel in explicit
degraded route state until a new route appears. The compatibility release
`0.2.175` keeps node/host-agent signed-config models aligned with these new
fields.
- C18U moves rebuild metadata into node-agent runtime behavior. Node-agent
@@ -437,10 +439,10 @@ The first backend contract slice is implemented:
- C18V adds route-manager transition telemetry and churn coverage. Node-agent
`0.2.177` reports `route_manager_transition` alongside the current manager
snapshot, including previous/current generation, status, decision count,
withdrawn route count, restored route count, pending-degraded fallback count,
withdrawn route count, restored route count, pending degraded route-state count,
rebuild applied count, and any cached selected route cleared because Control
Plane withdrew it. Coverage verifies three service-neutral lifecycle cases:
applied rebuild replacement, pending degraded fallback when no alternate is
applied rebuild replacement, pending degraded route state when no alternate is
available, and rollback/restoration when a fresh config removes the rebuild
decision.
- C18W adds a live docker-test verification loop for that telemetry. The smoke
@@ -973,8 +975,8 @@ The first backend contract slice is implemented:
in C18Z45; rebuild snapshot maintenance health with overdue/runtime-evidence
visibility landed in C18Z46; node-agent signed service-channel lease
enforcement when cluster authority is pinned landed in C18Z47; backend
introspection fallback for unsigned compatibility clients landed in C18Z48;
accepted-by telemetry for signed/introspection/legacy ingress landed in
introspection fallback for token-authorized compatibility clients landed in C18Z48;
accepted-by telemetry for signed/introspection/token-authorized ingress landed in
C18Z49; durable lease introspection across backend restarts landed in C18Z50;
bounded durable lease cleanup and admin visibility landed in C18Z51; durable
accepted-by access telemetry aggregation with heartbeat fallback and admin
@@ -983,9 +985,9 @@ The first backend contract slice is implemented:
visibility landed in C18Z53; C18Z54 smoke proves the same diagnostics on a
normal non-fallback primary route with healthy rolling route-quality feedback;
C18Z55 smoke proves degraded/fenced normal-route feedback is shown separately
from explicit backend fallback; C18Z56 adds active-channel remediation
from explicit degraded compatibility requests; C18Z56 adds active-channel remediation
diagnostics (`none`, `rebuild_route`, `prefer_alternate_route`,
`use_backend_fallback`) to make the next runtime action explicit, and its
`hold_degraded_route_state`) to make the next runtime action explicit, and its
alternate-route branch is live-smoke-proven with backend fallback kept off.
C18Z57 adds the bounded machine-readable `remediation_command` contract to
active access telemetry rows so route-manager can consume a short-lived
@@ -1058,7 +1060,7 @@ The first backend contract slice is implemented:
`rebuild_request_recorded` or `rebuild_request_rejected` for the active
channel. C18Z76 adds node-side acknowledgement for the allowed
`rebuild_route` branch: node-agent consumes the command as a route-manager
`pending_degraded_fallback` decision with source
`pending_degraded_route_state` decision with source
`service_channel_remediation_command`, while guarded commands remain ignored.
Backend access telemetry correlates that heartbeat evidence with the durable
ledger and reports `rebuild_request_recorded_node_pending`. C18Z77 resolves
@@ -1089,7 +1091,7 @@ The first backend contract slice is implemented:
reselecting the degraded replacement or adding fallback/failure/drop deltas.
C18Z82 proves the no-safe-recovery branch: if that replacement is also fenced
and no safe recovery route exists, synthetic config reports
`service_channel_feedback_no_alternate` / `pending_degraded_fallback` with
`service_channel_feedback_no_alternate` / `pending_degraded_route_state` with
`no_unfenced_alternate_route` instead of silently keeping a bad route.
C18Z83 projects that route-manager decision into active access telemetry and
web-admin active-channel diagnostics, including decision source, route id,
@@ -1124,7 +1126,8 @@ The first backend contract slice is implemented:
`data_plane` is present in the lease, authority payload, introspection
response, and lease-maintenance/admin list. It declares backend API as
control-plane transport, fabric service channel/fabric route as working
data/steady-state transport, backend relay as degraded fallback only, and
data/steady-state transport, degraded compatibility relay as an explicit
compatibility state only, and
service-neutral protocol-agnostic isolated logical flows as the runtime
contract for VPN, Remote Workspace, files, video, and future services. C18Z91
makes node-agent consume the signed/introspected data-plane contract, apply
@@ -1187,12 +1190,13 @@ channel class, selected entry node, allowed flow isolation, and data-plane
contract on `remote-workspaces/{resource_id}/streams/{channel_class}`. Empty
probe requests return `202` with a remote-workspace ingress probe contract and
access telemetry; real RDP frame forwarding remains deliberately
`not_implemented` until the service adapter work begins.
`validated_only` for empty probes until the service adapter work begins.
C19E adds a narrow frame-batch probe on that boundary. The adapter contract
advertises `rap.remote_workspace_frame_batch.v1`, and entry-node accepts
non-empty payloads only when they are JSON probe batches with `probe_only=true`,
valid remote-workspace logical channels, valid directions, and bounded payload
metadata. Accepted probes return `payload_flow=validated_probe_only`; production
metadata. Accepted frame probes return `payload_flow=validated_probe_only`, while
empty/control probes return `payload_flow=validated_only`; production
frame forwarding is still not enabled.
C19F connects that validated probe to a node-agent local adapter sink. The
in-memory `node_agent_rdp_worker_contract_probe` sink accepts only validated
@@ -3,7 +3,7 @@
Status: Stage C17 planning completed. Stage C17A synthetic mesh runtime
skeleton, Stage C17B route health/failover probes, Stage C17C relay semantic
hardening, Stage C17D non-production test-service path experiment, Stage C17E
live node-to-node synthetic HTTP transport skeleton, Stage C17F scoped
historical live node-to-node synthetic HTTP transport skeleton, Stage C17F scoped
synthetic route config boundary, Stage C17G Control Plane scoped synthetic
config read boundary, Stage C17H deployed multi-agent synthetic config smoke,
Stage C17I production forwarding gate, Stage C17J production envelope
@@ -44,8 +44,9 @@ invalidation. C17C added synthetic relay validation, per-channel bounded
queues, QoS dequeue order, telemetry-only drop/backpressure, and reliable
fabric/control rejection behavior. C17D added one bounded `synthetic.echo`
test-service path over direct, single-relay, and forced fallback routes. C17E
added real HTTP peer transport and a disabled-by-default node-agent synthetic
endpoint/smoke harness for direct and single-relay synthetic traffic. C17F
added one historical real-HTTP peer transport experiment and a
disabled-by-default node-agent synthetic endpoint/smoke harness for direct and
single-relay synthetic traffic only. C17F
added scoped synthetic peer/route config loading and synthetic route-health
link observation reporting. C17G added the Control Plane read boundary for
node-scoped synthetic mesh config. C17H proved that boundary in a deployed
@@ -596,10 +597,12 @@ C17H implemented a deployed multi-agent synthetic config smoke on
VPN/IP tunnel work remains a separate C18 track and must not be mixed into
C17 mesh runtime work.
## 15.4 C17E Result
## 15.4 C17E Historical Result
C17E implemented live node-to-node synthetic HTTP transport while preserving
the production forwarding kill-switch:
C17E implemented a historical live node-to-node synthetic HTTP transport
experiment while preserving the production forwarding kill-switch. This result
is retained only as test-history context; it is not the active transport
direction for the fabric runtime:
- `HTTPPeerTransport` maps explicit peer node IDs to synthetic HTTP endpoint
URLs.
@@ -613,6 +616,13 @@ the production forwarding kill-switch:
- `/mesh/v1/forward` remains disabled.
- no production service traffic is authorized.
Current direction:
- active fabric runtime transport is QUIC-only
- synthetic HTTP motion is historical test-only context
- production forwarding/runtime acceptance must use QUIC route execution rather
than HTTP peer transport
Verification:
```powershell
@@ -888,9 +898,11 @@ runtime. Stage C17A implements the first narrow runtime skeleton for synthetic
Fabric messages only. Stage C17B adds route health/failover observations using
synthetic Fabric messages only. Stage C17C adds relay semantic hardening for
synthetic channel classes only. Stage C17D adds one bounded non-production
`synthetic.echo` service-path experiment only. Stage C17E proves live
node-to-node synthetic HTTP transport using real local endpoints only. Stage
C17F proves scoped synthetic config loading and route-health reporting only.
`synthetic.echo` service-path experiment only. Stage C17E proves one
historical synthetic HTTP carrier experiment using real local endpoints only;
it is test-only and not representative of the active QUIC fabric runtime.
Stage C17F proves scoped synthetic config loading and route-health reporting
only.
Stage C17G proves Control Plane scoped synthetic config read/consume only.
Stage C17H proves deployed multi-agent Control Plane synthetic config
consumption and synthetic route-health reporting on `docker-test` only.
@@ -1,5 +1,12 @@
# Production Direct Worker WSS Trust
Archived status: this document describes an older direct-worker WSS trust
track. It is not the current runtime transport source of truth. For the active
fabric transport model, use
`docs/architecture/DISTRIBUTED_FABRIC_NODE_PROTOCOL_PLAN.md`,
`docs/architecture/FABRIC_FIRST_TRANSPORT_AND_STRESS_PLAN.md`, and
`docs/architecture/SECURE_ACCESS_FABRIC_TARGET.md`.
Status: P3.4 design/prep complete.
This document defines the production trust model for direct worker WSS. It is a
+8
View File
@@ -1,5 +1,13 @@
# RDP Adapter Runtime
Paused/archival note: this document remains useful for RDP adapter internals,
but it is not the current source of truth for transport/runtime architecture.
Fabric transport is now QUIC-only between nodes. For active transport,
recovery, and routing behavior, see
`docs/architecture/DISTRIBUTED_FABRIC_NODE_PROTOCOL_PLAN.md`,
`docs/architecture/FABRIC_FIRST_TRANSPORT_AND_STRESS_PLAN.md`, and
`docs/architecture/SECURE_ACCESS_FABRIC_TARGET.md`.
Status: active implementation plan for the new C++ RDP Adapter internals.
Current implementation status:
@@ -1,5 +1,12 @@
# RDP Stage 5.2 Design Pass - Server-To-Client File Download
Archived status: this document belongs to the earlier direct-worker/back-gateway
RDP track and is not the current source of truth for fabric transport
architecture. The active inter-node transport model is QUIC-only; see
`docs/architecture/DISTRIBUTED_FABRIC_NODE_PROTOCOL_PLAN.md`,
`docs/architecture/FABRIC_FIRST_TRANSPORT_AND_STRESS_PLAN.md`, and
`docs/architecture/SECURE_ACCESS_FABRIC_TARGET.md`.
Status: design-complete proposal, no runtime implementation in this step.
Date: 2026-04-26
@@ -1,5 +1,13 @@
# RDP Service C++ Performance Target
Paused/archival note: this document is an RDP performance track record, not the
current source of truth for node-to-node transport. Fabric transport is now
QUIC-only between nodes; use
`docs/architecture/DISTRIBUTED_FABRIC_NODE_PROTOCOL_PLAN.md`,
`docs/architecture/FABRIC_FIRST_TRANSPORT_AND_STRESS_PLAN.md`, and
`docs/architecture/SECURE_ACCESS_FABRIC_TARGET.md` for the active transport
model.
## Status
This is the paused RDP service performance direction. The implementation name is `RDP Adapter`: a concrete `Service Adapter` that translates Microsoft RDP into the platform session/data-plane protocol. The common adapter contract is defined in `docs/architecture/SERVICE_ADAPTER_PROTOCOL.md`; the RDP-specific runtime plan is defined in `docs/architecture/RDP_ADAPTER_RUNTIME.md`.
@@ -1,5 +1,13 @@
# RDP Service C# Target Architecture
Archived scope note: this document is retained as historical RDP runtime
research and is not the current source of truth for node-to-node transport.
Fabric transport is now QUIC-only between nodes; use
`docs/architecture/DISTRIBUTED_FABRIC_NODE_PROTOCOL_PLAN.md`,
`docs/architecture/FABRIC_FIRST_TRANSPORT_AND_STRESS_PLAN.md`, and
`docs/architecture/SECURE_ACCESS_FABRIC_TARGET.md` for the active transport
model.
## Status
Superseded.
@@ -8,6 +8,12 @@ The current proven RDP lifecycle remains a preserved implementation baseline.
RDP work is currently paused by product decision. The active architecture focus
is the lower Fabric Core / cluster / node foundation.
Transport clarification: historical references in this document to direct
worker WSS or backend gateway fallback describe the earlier RDP service proof
path and migration context. They must not be read as the current inter-node
transport contract. The active fabric node-to-node runtime transport is
QUIC-only.
## 1. Project Vision
The project is a Secure Access Fabric: a distributed, multi-tenant platform for secure access to private resources across sites, networks, and organizations.
@@ -1702,7 +1708,7 @@ Channels must have independent priority, reliability, and backpressure behavior.
The current RDP MVP proves lifecycle and basic viewer behavior. It is not the target production performance model.
Target RDP realtime model:
Target RDP realtime model for the paused historical RDP service track:
- client connects to direct/relay data plane, not backend frame relay
- input/control channels are separate from render/video
@@ -2459,7 +2465,11 @@ This is an incremental migration plan. It must not be executed as a big-bang rew
### Current Fallback
Keep the current backend WebSocket gateway as fallback while the production data plane is introduced.
Historical migration note: the older RDP MVP kept the backend WebSocket
gateway as a temporary fallback while an earlier production data-plane design
was being introduced. This is not the active fabric transport plan. Current
fabric node-to-node runtime transport is QUIC-only, and old compatibility paths
are being removed rather than extended.
Current RDP MVP remains the preserved service-adapter baseline, but it is not
the active implementation focus while Fabric Core stages are underway.
@@ -2543,9 +2553,14 @@ These stages must be introduced only through explicit, narrow implementation
prompts. RDP/VNC/SSH/VPN/video/file services remain above the Fabric Core and
must not define the lower fabric foundation.
### Stage DP-1: Direct Worker WSS
### Historical Stage DP-1: Direct Worker WSS
Introduce a short-lived authorized direct WSS path from client to worker or worker-local live endpoint.
This stage records an earlier RDP service migration concept. It is paused and
retained for historical context only. It must not be read as the active fabric
transport roadmap.
Introduce a short-lived authorized direct WSS path from client to worker or
worker-local live endpoint.
Goals:
@@ -2554,7 +2569,7 @@ Goals:
- keep session broker lifecycle unchanged
- keep fallback gateway available
### Stage DP-2: Binary Frames
### Historical Stage DP-2: Binary Frames
Replace base64 JSON frame payloads with binary frame messages.
@@ -2565,7 +2580,7 @@ Goals:
- reduce JSON/base64 overhead
- preserve latest-frame-only behavior
### Stage DP-3: Adaptive Quality
### Historical Stage DP-3: Adaptive Quality
Implement adaptive RDP quality profiles.
@@ -2577,9 +2592,10 @@ Goals:
- bandwidth and latency feedback
- bounded frame queues
### Stage DP-4: Relay Nodes
### Historical Stage DP-4: Relay Nodes
Introduce `entry-node` and `relay-node` roles for data-plane routing.
Introduce `entry-node` and `relay-node` roles for the earlier service-specific
data-plane routing model.
Goals:
+29 -19
View File
@@ -1,20 +1,28 @@
# Security And Secrets Readiness
Status: P3.3 test-stand smoke complete for encrypted resource secrets,
assignment-time resolution, and production fallback behavior with smoke-only
direct worker WSS trust.
Archived scope note: this document records an earlier RDP/direct-worker trust
and secret-handling stage. It is not the current source of truth for fabric
transport architecture. The active inter-node transport model is QUIC-only; see
`docs/architecture/DISTRIBUTED_FABRIC_NODE_PROTOCOL_PLAN.md`,
`docs/architecture/FABRIC_FIRST_TRANSPORT_AND_STRESS_PLAN.md`, and
`docs/architecture/SECURE_ACCESS_FABRIC_TARGET.md`.
Status: P3.3 historical test-stand smoke complete for encrypted resource
secrets, assignment-time resolution, and legacy RDP baseline behavior with
smoke-only direct-worker trust.
This document defines the next security hardening layer around the accepted RDP
MVP baseline. It does not implement mesh, VPN, server-to-client download, new
protocol adapters, or another RDP rendering mode.
## Current Accepted Baseline
## Current Accepted Historical RDP Baseline
- RDP worker baseline: `rap-rdp-worker:rdp-p1-region-order2`
- Backend control plane remains source of truth.
- Redis remains live coordination/routing only.
- Direct worker WSS is preferred for realtime RDP.
- Backend gateway remains fallback/debug.
- Historical direct-worker WSS was the preferred realtime RDP path in this
stage.
- Historical backend gateway remained a fallback/debug path for this stage.
- Text clipboard is policy-gated and accepted.
- Client-to-server file upload and restricted `RAP_Transfers` visibility are
accepted.
@@ -124,22 +132,24 @@ Already accepted:
- worker rejects wrong worker, wrong attachment, wrong organization, wrong
resource, over-broad channels, failed/terminated sessions, and jti replay
Production still needs:
Production still needed for that stage:
- deployed certificate chain for direct worker WSS on production nodes
- pinned or platform-issued worker certificates in live production config
- deployed certificate chain for the historical direct-worker WSS path on
production nodes
- pinned or platform-issued worker certificates in live production config for
that historical path
- no smoke-only TLS bypass in production clients
- rotation process for data-plane signing keys
- audit for failed token validation/bind attempts
P3.2 guard exists:
P3.2 historical guard exists:
- backend distinguishes `smoke_insecure`, `public_ca`, and `platform_ca`
direct worker WSS trust modes
- production backend omits smoke-only direct candidates
- Windows production client skips untrusted or smoke-only direct candidates
- backend distinguished `smoke_insecure`, `public_ca`, and `platform_ca`
direct-worker trust modes for the historical RDP path
- production backend omitted smoke-only direct candidates on that path
- Windows production client skipped untrusted or smoke-only direct candidates
P3.3 test-stand smoke exists:
P3.3 historical test-stand smoke exists:
- `resource_secrets` migration is applied on `docker-test`
- backend runs as `APP_ENV=production` with a test-only
@@ -149,9 +159,9 @@ P3.3 test-stand smoke exists:
- `resources.metadata`, `remote_sessions.metadata`, and `audit_events` were
checked for plaintext username/password leakage
- production backend with `DATA_PLANE_DIRECT_WORKER_TLS_TRUST_MODE=smoke_insecure`
returns backend gateway fallback only
returned the historical backend gateway debug path only
- development/smoke backend with the same trust mode advertises the explicit
smoke-only direct worker WSS candidate
smoke-only historical direct-worker candidate
- `RAP_Transfers` smoke passed on the secret-backed resource
## Required Regression Tests
@@ -202,8 +212,8 @@ P3.1 implemented audit events for:
assignment payload; a future resolver pull/token flow should reduce exposure
in Redis control queues.
- Worker still depends on plaintext assignment metadata for development smoke.
- Production direct worker WSS certificate issuance/rotation and platform CA
distribution are not complete.
- Production certificate issuance/rotation and platform CA distribution for the
historical direct-worker path are not complete.
- The test-stand secret key is a host-local test file, not a production KMS or
HSM-backed key.
- Automated end-to-end policy denial coverage is still thin.
+20 -2
View File
@@ -1,7 +1,21 @@
# Service Adapter Protocol
Scope note: this document remains the common adapter-model reference, but it is
not the current source of truth for transport/runtime topology between fabric
nodes. Fabric transport is now QUIC-only between nodes; for active transport,
routing, and recovery behavior see
`docs/architecture/DISTRIBUTED_FABRIC_NODE_PROTOCOL_PLAN.md`,
`docs/architecture/FABRIC_FIRST_TRANSPORT_AND_STRESS_PLAN.md`, and
`docs/architecture/SECURE_ACCESS_FABRIC_TARGET.md`.
Status: target contract and compile-safe foundation. This document defines the common adapter model for RDP, SSH, VNC, and future services. It does not replace the current backend control plane or current RDP runtime by itself.
Transport clarification: historical references in this document to direct
worker WSS, backend gateway fallback, or DP-1 channel shape belong to the
earlier RDP service baseline. They are not the active inter-node transport
contract. Current fabric node-to-node transport is QUIC-only; service adapters
consume fabric routes rather than define transport fallback behavior.
## 1. Purpose
The platform client must not implement third-party protocols directly.
@@ -94,12 +108,16 @@ adapter runtime.
- Service Adapter does not know UI implementation details.
- Control Plane remains authoritative for session lifecycle and policy.
- PostgreSQL remains source of truth; Redis remains live coordination only.
- Direct worker WSS and backend gateway fallback remain valid transports.
- Fabric transport remains QUIC-only between nodes; any historical direct
worker or backend fallback paths belong to paused service-specific baselines,
not to the active fabric transport contract.
- Adapter runtime must not create sessions outside broker/assignment control.
## 4. Logical Channels
The session protocol is channel-oriented even when DP-1 uses one WSS connection.
The session protocol is channel-oriented regardless of the concrete carrier. A
historical DP-1 single-WSS shape may still appear in paused RDP notes, but it
is not the current fabric transport contract.
| Channel | Direction | Reliability | Priority | Purpose |
| --- | --- | --- | --- | --- |
@@ -7,6 +7,11 @@ Secure Access Fabric. It does not implement VPN runtime, packet routing, TUN
devices, mesh traffic, service workload execution, API changes, migrations, or
RDP behavior changes.
Transport clarification: this document defines a service layer above Fabric
Core. It does not redefine node-to-node transport. Current fabric inter-node
transport is QUIC-only; VPN/IP tunnel runtime must request and use fabric
routes instead of introducing a separate packet transport contract.
## Purpose
VPN/IP tunnel is a service above the Fabric Core, not a node-local setting.
@@ -9,6 +9,15 @@ Secure Access Fabric.
The fabric node-to-node transport remains QUIC-only. HTTP/HTTPS is allowed only
as an external client-facing service edge.
Terminology rule:
- `Fabric Transport` = QUIC/UDP node-to-node runtime layer.
- `Control API` = HTTP/HTTPS management surface for UI, automation, releases,
policy, audit, and status.
The Control API may use HTTP/HTTPS, but it is not a fallback or alternate
carrier for fabric node-to-node runtime traffic.
## Purpose
The platform needs a clear distinction between: