3
This commit is contained in:
@@ -6,6 +6,16 @@ This file exists so architecture documents have a stable guardrails reference
|
||||
inside `docs/architecture`. The operational Codex guardrails remain in
|
||||
`docs/codex/ARCHITECTURE_GUARDRAILS.md`.
|
||||
|
||||
Transport clarification: references in this document to direct worker WSS and
|
||||
backend gateway fallback belong to the preserved historical RDP service
|
||||
baseline. They are not the active source of truth for inter-node transport.
|
||||
Current fabric node-to-node transport is QUIC-only and is defined by
|
||||
`docs/architecture/DISTRIBUTED_FABRIC_NODE_PROTOCOL_PLAN.md`,
|
||||
`docs/architecture/FABRIC_FIRST_TRANSPORT_AND_STRESS_PLAN.md`, and
|
||||
`docs/architecture/SECURE_ACCESS_FABRIC_TARGET.md`.
|
||||
Node survivability, recovery overlap, and no-manual-access repair rules are
|
||||
defined by `docs/architecture/FABRIC_NODE_SURVIVAL_AND_RECOVERY_POLICY.md`.
|
||||
|
||||
## 1. Preserve the Proven RDP Baseline
|
||||
|
||||
The following are already proven and must remain stable:
|
||||
@@ -16,8 +26,8 @@ The following are already proven and must remain stable:
|
||||
- detach without killing the remote session
|
||||
- reattach without recreating the remote session
|
||||
- takeover without recreating the remote session
|
||||
- direct worker WSS data plane
|
||||
- backend gateway fallback
|
||||
- historical direct worker WSS RDP path
|
||||
- historical backend gateway fallback for the RDP baseline
|
||||
- C++ RDP Adapter as the active RDP runtime
|
||||
|
||||
Architecture clarification must not silently weaken this behavior.
|
||||
@@ -191,6 +201,9 @@ Updates must support:
|
||||
- local update cache where approved
|
||||
- OS / architecture specific artifacts under signed release manifests
|
||||
- explicit migration bundles when data structures change
|
||||
- legacy recovery compatibility until the fleet is converged or explicitly
|
||||
retired
|
||||
- multi-source artifact retrieval for stranded or NAT-only nodes
|
||||
|
||||
Version Storage stores immutable release manifests, artifacts, hashes,
|
||||
signatures, compatibility metadata, provenance, and approved migration bundles.
|
||||
|
||||
@@ -1059,7 +1059,8 @@ accepts a signed/introspected `remote_workspace` service-channel lease on
|
||||
`remote-workspaces/{resource_id}/streams/{channel_class}`, validates service
|
||||
class, channel class, selected entry node, and data-plane flow isolation, and
|
||||
reports access telemetry. It intentionally returns a probe contract with
|
||||
`payload_flow=not_implemented` for non-empty RDP payloads; this stage proves
|
||||
`payload_flow=validated_only` for empty control probes; non-empty RDP payloads are
|
||||
rejected with `probe_only required`. This stage proves
|
||||
the Fabric ingress contract without forwarding desktop frames yet. The live
|
||||
smoke is `scripts/fabric/c19d-remote-workspace-entry-ingress-smoke.ps1`.
|
||||
|
||||
|
||||
@@ -1,5 +1,12 @@
|
||||
# Data Plane v1 for RDP
|
||||
|
||||
Archived status: this document is a historical RDP/WebSocket stage record, not
|
||||
the current runtime source of truth for transport architecture. The active
|
||||
fabric transport model is QUIC-only between nodes; see
|
||||
`docs/architecture/DISTRIBUTED_FABRIC_NODE_PROTOCOL_PLAN.md`,
|
||||
`docs/architecture/FABRIC_FIRST_TRANSPORT_AND_STRESS_PLAN.md`, and
|
||||
`docs/architecture/SECURE_ACCESS_FABRIC_TARGET.md`.
|
||||
|
||||
Status: DP-3A grayscale full-frame binary render foundation is implemented and smoke-proven on the test Docker environment as of 2026-04-25. DP-3B adaptive quality policy/selection is intentionally paused. The accepted C++ RDP Adapter baseline is the ordered-region path. RDP-Perf-6 makes direct dirty-region binary render explicit with `render.frame.full` / `render.frame.region` RAP2 message types and is build/probe/live-smoke-proven on the test Docker environment as of 2026-04-26. The current test Docker deployment for the RDP Adapter performance path is `rap-rdp-worker:rdp-perf6-dirty-region`. The Stage 5.2 core download data path remains runtime-proven for direct worker WSS and backend gateway fallback. Data-plane and RDP work are paused; the next active focus is Stage C10 Fabric Core / cluster foundation, not another data-plane feature.
|
||||
|
||||
This document defines the first staged data-plane evolution for the RDP MVP. It does not implement direct worker WebSocket runtime, mesh routing, VPN, QUIC, UDP, WebRTC, relay nodes, or multi-cluster behavior.
|
||||
|
||||
@@ -1,5 +1,12 @@
|
||||
# Direct Worker WSS TLS / PKI
|
||||
|
||||
Archived status: this document captures a direct-worker WSS trust design track
|
||||
and is no longer the primary reference for node-to-node transport. The active
|
||||
fabric transport model is QUIC-only between nodes; see
|
||||
`docs/architecture/DISTRIBUTED_FABRIC_NODE_PROTOCOL_PLAN.md`,
|
||||
`docs/architecture/FABRIC_FIRST_TRANSPORT_AND_STRESS_PLAN.md`, and
|
||||
`docs/architecture/SECURE_ACCESS_FABRIC_TARGET.md`.
|
||||
|
||||
Status: P3.4 trust-model design/prep complete.
|
||||
|
||||
This document defines the production trust model for direct worker WSS. It does
|
||||
|
||||
@@ -24,6 +24,21 @@ policy allows, host limited control/storage roles when approved, and report
|
||||
mobile-specific capacity signals such as battery, network type, NAT behavior,
|
||||
foreground/background state, and metered network policy.
|
||||
|
||||
Node survival and recovery across endpoint moves, NAT-only reachability, legacy
|
||||
contract overlap, and unavailable manual host access are governed by
|
||||
`docs/architecture/FABRIC_NODE_SURVIVAL_AND_RECOVERY_POLICY.md`. In
|
||||
particular, nodes like `ifcm-rufms-s-mo1cr` must remain recoverable through the
|
||||
fabric/update/recovery plane even when direct host login is unavailable.
|
||||
|
||||
Android implementation contract:
|
||||
|
||||
- app install/build contains a QUIC bootstrap seed set;
|
||||
- runtime launch carries a `fabric_bootstrap_config`, not a backend URL;
|
||||
- user login/profile selection happens over the fabric control channel;
|
||||
- the Android VPN dataplane is QUIC fabric runtime only; HTTP batch packet
|
||||
forwarding, WebSocket packet relay, and direct backend packet relay are not
|
||||
part of the supported runtime path.
|
||||
|
||||
## What Was Missing
|
||||
|
||||
The current implementation proves route leases and production VPN forwarding,
|
||||
@@ -60,8 +75,9 @@ route and stream semantics.
|
||||
must keep working through cached policy, peer directories, route leases, and
|
||||
local health when central components are degraded.
|
||||
7. Mobile nodes are first-class nodes with stricter capability scoring.
|
||||
8. HTTP forwarding remains a compatibility and emergency fallback, not the
|
||||
primary high-speed data plane.
|
||||
8. QUIC is the single runtime transport between fabric nodes. HTTP/HTTPS may
|
||||
serve human-facing download or panel pages, but it is not a node data-plane
|
||||
fallback and must not carry service packets.
|
||||
9. There must be no single management service that can seize the fabric. Control,
|
||||
storage, update distribution, route authority, and certificate authority are
|
||||
fabric roles assigned to eligible nodes and protected by quorum signatures.
|
||||
@@ -73,6 +89,20 @@ route and stream semantics.
|
||||
the usable candidate locally by policy, reachability, latency, load, and
|
||||
trust.
|
||||
|
||||
## Transport vs Control API
|
||||
|
||||
The system must keep two layers separate in naming, design, and diagnostics:
|
||||
|
||||
- `Fabric Transport` means inter-node runtime delivery only. It is QUIC over UDP
|
||||
and carries leased service-channel/data-plane traffic between nodes.
|
||||
- `Control API` means human/operator/programmatic management surfaces such as
|
||||
web-admin, release publication, policy mutation, audit queries, and status
|
||||
reads. Today that surface is HTTP/JSON and may sit behind HTTPS ingress.
|
||||
|
||||
The HTTP Control API is not a fallback transport for node-to-node runtime
|
||||
traffic. A `409 Conflict` from the backend, a panel page load, or a release
|
||||
download is control-plane behavior, not fabric transport behavior.
|
||||
|
||||
## Distributed Control And Trust
|
||||
|
||||
The target fabric behaves like a distributed network, not a client/server
|
||||
@@ -145,6 +175,143 @@ Endpoint state is also distributed:
|
||||
- Neighbor selection is local and latency/load-aware; the state log announces
|
||||
facts and policy, not a forced single next hop.
|
||||
|
||||
### Fabric Registry Gossip
|
||||
|
||||
Moving a service must not break the farm.
|
||||
|
||||
`RAP_BACKEND_URL` or any fixed HTTP/API address is only a migration fallback for
|
||||
old nodes. It is not cluster truth. After bootstrap, a node finds services by
|
||||
logical role through signed fabric registry records that can be carried by any
|
||||
reachable peer.
|
||||
|
||||
The rule is:
|
||||
|
||||
- any node may relay registry knowledge;
|
||||
- only authorized signatures can create or replace trusted registry truth;
|
||||
- a new record becomes active only after signature/authority checks and a
|
||||
successful live probe through the fabric or a policy-approved direct QUIC
|
||||
candidate;
|
||||
- older still-valid records remain as fallback until their TTL expires.
|
||||
|
||||
Registry record shape:
|
||||
|
||||
```text
|
||||
schema_version: rap.fabric.registry.gossip_record.v1
|
||||
cluster_id
|
||||
service: control-api | update-store | update-cache | web-admin | vpn-egress-pool | ...
|
||||
scope: farm | cluster | organization
|
||||
organization_id: optional
|
||||
epoch: monotonic service epoch
|
||||
generation: optional human/debug generation
|
||||
issued_at
|
||||
expires_at
|
||||
issuer_node_id
|
||||
issuer_role: control-authority | update-authority | storage-authority | route-authority
|
||||
endpoints:
|
||||
- endpoint_id
|
||||
address: quic://...
|
||||
transport: direct_quic | relay_quic | reverse_quic
|
||||
reachability
|
||||
connectivity_mode
|
||||
priority / weight
|
||||
peer_cert_sha256
|
||||
signatures:
|
||||
- key_id
|
||||
issuer_id
|
||||
role
|
||||
alg: ed25519
|
||||
value
|
||||
```
|
||||
|
||||
Acceptance algorithm:
|
||||
|
||||
1. Reject records for a different cluster, expired records, future records past
|
||||
allowed clock skew, unsupported schema, missing endpoints, or non-QUIC
|
||||
endpoints.
|
||||
2. Verify the canonical record payload, excluding `signatures`, against the
|
||||
configured authority set.
|
||||
3. Check the signer role is allowed for that service and scope.
|
||||
4. Require quorum where policy says M-of-N; development may use one trusted
|
||||
signer but must mark that signer as bootstrap/development authority.
|
||||
5. Store accepted records as `candidate`.
|
||||
6. Promote `candidate` to `active` only after live-probing at least one endpoint
|
||||
and verifying the endpoint identity/pin.
|
||||
7. Prefer higher epoch, then newer issued time, then generation. Do not replace
|
||||
a live active record with an older record.
|
||||
8. Keep the previous active record usable as fallback until TTL expiry when a
|
||||
newer candidate is not yet live-verified.
|
||||
|
||||
This is the recovery path for mass moves. If every known service endpoint moves
|
||||
at once, the operator or a control-authority node only has to deliver a signed
|
||||
registry record to one reachable fabric node. That node validates it, probes it,
|
||||
promotes it, and gossips it onward. User/mobile/candidate nodes may carry the
|
||||
record, but cannot make it authoritative unless their role certificate permits
|
||||
that service/scope.
|
||||
|
||||
Service classes that must use this registry before production hardening:
|
||||
|
||||
- `control-api`: heartbeat, auth/profile control projection, node registration,
|
||||
policy/snapshot fetch.
|
||||
- `update-store`: signed release manifests and compatibility windows.
|
||||
- `update-cache`: artifact mirrors close to nodes.
|
||||
- `web-admin`: management UI/API ingress replicas.
|
||||
- `vpn-egress-pool`: user-visible exit pools; users see pools, not backing
|
||||
nodes.
|
||||
|
||||
Legacy endpoint compatibility is allowed only for rolling migration:
|
||||
|
||||
- Old nodes may use their baked HTTP/control URL only to fetch a new version or
|
||||
a signed registry bootstrap record.
|
||||
- New nodes must treat fixed URLs as fallback hints, not as authority.
|
||||
- Old code is removed only after every live node reports a version that supports
|
||||
signed registry gossip and service discovery by role.
|
||||
|
||||
Listener configuration is split into bind sockets and reachability candidates:
|
||||
|
||||
- `listen_addr` is what the local process binds, for example
|
||||
`0.0.0.0:18080` on `home-1`.
|
||||
- `endpoint_candidates` is the ordered set of addresses other nodes may try.
|
||||
A single node can publish LAN addresses, addresses on several network
|
||||
adapters, STUN/reflexive addresses, and multiple public NAT forwards from
|
||||
different providers.
|
||||
- Public NAT forwards are modeled as candidates with metadata, not as a
|
||||
replacement for the internal bind address. Example:
|
||||
`quic://94.141.118.222:19199 reachability=public connectivity=direct
|
||||
provider=isp1 maps_to=192.168.200.85:18080`.
|
||||
- A candidate may be valid only from outside the NAT. Same-LAN hairpin failure
|
||||
is not a proof that the public candidate is broken; verification must be
|
||||
scoped to an external peer or remote probe.
|
||||
- The route builder scores candidates by reachability, measured latency, loss,
|
||||
load, policy, and verification freshness. If one provider or interface fails,
|
||||
the node keeps the same node identity and republishes a new candidate epoch.
|
||||
|
||||
## Install Artifact Bootstrap Contract
|
||||
|
||||
Every installable artifact is a node image plus a bootstrap seed set.
|
||||
|
||||
This applies to Android, Docker, Linux services, and Windows services. The seed
|
||||
set is baked into the artifact or delivered beside it as signed install
|
||||
metadata. It is not a single backend URL and not a management server choice. It
|
||||
is a bounded list of known fabric endpoint candidates that may be reachable from
|
||||
different network positions:
|
||||
|
||||
- public QUIC candidates, for example `usa-los-1` or externally reachable
|
||||
`home-1`;
|
||||
- private/LAN QUIC candidates, for example Docker-test or home LAN nodes;
|
||||
- closed-site candidates that have no Internet route themselves but can reach a
|
||||
neighboring fabric node;
|
||||
- optional pinned certificate hashes or authority descriptors for high-trust
|
||||
entry candidates.
|
||||
|
||||
On first start the installed node tries the seed set, joins through any reachable
|
||||
peer, registers as a candidate node with minimal rights, and then receives
|
||||
signed peer-directory, role, update, and policy state through the fabric. If a
|
||||
node is installed in an isolated network, it can still become visible and usable
|
||||
when at least one nearby seed node can route onward to the rest of the fabric.
|
||||
User login on Android is only identity/profile selection for the `vpn-client`
|
||||
service; the underlying phone node already exists and participates in the
|
||||
fabric with candidate permissions.
|
||||
|
||||
## Node Roles
|
||||
|
||||
Initial role vocabulary:
|
||||
@@ -172,7 +339,7 @@ uplink stability, foreground state, and user cost policy.
|
||||
Nodes must advertise capability facts in heartbeats and peer updates:
|
||||
|
||||
- supported fabric protocol versions;
|
||||
- supported transports: UDP/QUIC, TCP, WebSocket, HTTPS fallback;
|
||||
- supported transport: UDP/QUIC;
|
||||
- NAT type and reachability;
|
||||
- measured RTT/loss/jitter/bandwidth to peers and entry candidates;
|
||||
- CPU, memory, queue depth, file descriptor/socket pressure;
|
||||
@@ -184,9 +351,8 @@ Nodes must advertise capability facts in heartbeats and peer updates:
|
||||
|
||||
## Fabric Data Session V1
|
||||
|
||||
The first practical protocol step is a persistent binary data session. It may
|
||||
initially run over WebSocket/TCP for faster delivery, but the framing must be
|
||||
transport-neutral so the same protocol can move to QUIC/UDP.
|
||||
The first practical protocol step is a persistent binary QUIC data session.
|
||||
The framing stays service-neutral, but the runtime transport is QUIC only.
|
||||
|
||||
Minimum frame set:
|
||||
|
||||
@@ -338,69 +504,36 @@ Deliverables:
|
||||
|
||||
### Stage FNP-3: WebSocket/TCP Compatibility Transport
|
||||
|
||||
Status: started with a transport-neutral `io.Reader`/`io.Writer` frame loop,
|
||||
WebSocket frame adapter in `agents/rap-node-agent/internal/fabricproto`, and a
|
||||
gated/authenticated mesh smoke endpoint/client at `/mesh/v1/fabric/session/ws`.
|
||||
`rap-host-agent fabric-session-smoke` provides the first operator smoke command
|
||||
and can pass signed fabric-session authority payload/signature headers for
|
||||
authority-pinned nodes.
|
||||
Node-agent exposes the endpoint only when `RAP_MESH_FABRIC_SESSION_ENABLED` /
|
||||
`-mesh-fabric-session-enabled` is set, and reports the enabled endpoint in
|
||||
heartbeat metadata.
|
||||
`mesh-live-smoke` includes a fabric-session `PING`/`PONG` check alongside the
|
||||
existing route and test-service probes. Mesh client code now has a reusable
|
||||
`FabricSessionClient` for multiple frame exchanges over one WebSocket session,
|
||||
plus a pump mode with outbound/inbound queues for asynchronous stream traffic.
|
||||
Live smoke verifies two `PING`/`PONG` round trips on the same connection.
|
||||
`vpnruntime` has a binary VPN packet-batch mapper for `FrameData` payloads so
|
||||
packet delivery can move away from JSON production envelopes in a gated mode.
|
||||
`FabricSessionPacketTransport` now adapts that mapper to the existing
|
||||
`PacketTransport` interface and can demultiplex inbound DATA frames into the
|
||||
VPN packet inbox by stream id.
|
||||
`mesh-live-smoke` now sends a real VPN packet batch through
|
||||
`FabricSessionPacketTransport` over the WebSocket fabric session and requires a
|
||||
stream ACK from the remote node.
|
||||
Mesh has a peer session manager that reuses one pump per peer endpoint, giving
|
||||
VPN transport selection a stable place to acquire long-lived fabric sessions.
|
||||
Node config now carries a separate gated
|
||||
`RAP_VPN_FABRIC_SESSION_TRANSPORT_ENABLED` switch and heartbeat report for the
|
||||
binary VPN packet transport, keeping endpoint exposure and VPN dataplane
|
||||
rollout independently controllable.
|
||||
When the VPN fabric-session switch is enabled, node-agent now attempts to use a
|
||||
long-lived peer session for gateway packet transport and falls back to the
|
||||
existing HTTP production envelope path when the peer session is unavailable.
|
||||
Peer session reuse now evicts closed pumps before reuse, so failed WebSocket
|
||||
sessions can be reopened on the next transport acquisition.
|
||||
Heartbeat telemetry includes peer session manager counters for active sessions,
|
||||
reuses, opens, closed-pump evictions, and explicit close operations.
|
||||
The mesh package now exposes a service-neutral `FabricTransport` abstraction;
|
||||
the current WebSocket carrier implements it as `WebSocketFabricTransport`, so
|
||||
future QUIC/UDP transport can be added without changing VPN/RDP/HTTP services.
|
||||
`QUICFabricTransport` now implements the same interface and carries the same
|
||||
binary `fabricproto` frames over a QUIC stream, with local smoke coverage for
|
||||
`PING`/`PONG` and DATA/ACK.
|
||||
Carrier selection understands QUIC transport labels and `quic://host:port`
|
||||
endpoints while preserving WebSocket as the default fallback.
|
||||
`QUICFabricServer` provides the matching node-side QUIC listener for accepting
|
||||
fabric streams and running the same session frame handler as other carriers.
|
||||
Node-agent can now gate the QUIC listener with
|
||||
`RAP_MESH_QUIC_FABRIC_ENABLED` / `RAP_MESH_QUIC_FABRIC_LISTEN_ADDR`, report it
|
||||
in heartbeat metadata, and pass the setting through host-agent install/update
|
||||
profiles.
|
||||
`mesh-live-smoke` verifies the QUIC carrier by starting a temporary QUIC fabric
|
||||
server and requiring a `PING`/`PONG` round trip over `QUICFabricTransport`.
|
||||
Nodes now advertise enabled QUIC fabric listeners as `direct_quic` fast-path
|
||||
endpoint candidates, and endpoint ranking prefers QUIC over WebSocket/HTTPS
|
||||
compatibility candidates for fabric sessions.
|
||||
Status: retired as a migration-only stage.
|
||||
|
||||
This stage existed to bootstrap binary frame semantics before QUIC routing and
|
||||
carrier reuse were ready. It introduced the transport-neutral frame loop,
|
||||
session-shaped packet mapper, and early smoke tooling. That work was useful as
|
||||
scaffolding, but it is no longer the target runtime.
|
||||
|
||||
Current rule:
|
||||
|
||||
- WebSocket/TCP fabric-session transport is not part of the supported node
|
||||
dataplane.
|
||||
- QUIC/UDP is the only supported runtime carrier between fabric nodes.
|
||||
- Old WebSocket/TCP smoke helpers are being removed; migration/debug tooling
|
||||
must move to QUIC-native smoke and recovery paths.
|
||||
- Any routing, heartbeat, registry, peer probe, or service dataplane logic must
|
||||
reject WebSocket/TCP carriers as non-QUIC transport, not treat them as a
|
||||
valid alternate path.
|
||||
|
||||
What survives from this stage is the service-neutral frame model and the
|
||||
`FabricSessionPacketTransport` mapping, which now ride on QUIC carriers instead
|
||||
of a WebSocket fallback.
|
||||
VPN fabric-session gateway transport now consumes ranked endpoint candidates,
|
||||
so dataplane sessions can select QUIC fast-path candidates and fall back to
|
||||
legacy peer endpoints when the control plane has not published candidates yet.
|
||||
so dataplane sessions can select QUIC fast-path candidates and refuse non-QUIC
|
||||
peer endpoints when the control plane has not published valid candidates yet.
|
||||
The temporary self-signed QUIC listener advertises its SHA-256 certificate
|
||||
fingerprint in endpoint metadata, and the QUIC client can pin that fingerprint
|
||||
instead of disabling verification while the cluster CA path is being finished.
|
||||
VPN fabric-session dialing now walks all ranked endpoint candidates before
|
||||
falling back to the legacy peer endpoint, so a failed QUIC candidate does not
|
||||
block WebSocket/HTTPS compatibility transport.
|
||||
declaring the target unavailable, so a failed QUIC candidate does not silently
|
||||
re-enable WebSocket/HTTPS compatibility transport.
|
||||
Successful VPN fabric-session dialing logs the selected candidate, transport,
|
||||
certificate pin usage, and remaining fallback count for phone-side diagnostics.
|
||||
Heartbeat telemetry now includes VPN fabric-session dial counters for attempts,
|
||||
@@ -416,8 +549,8 @@ Endpoint health observations are now emitted as a bounded standalone heartbeat
|
||||
report (`rap.vpn_fabric_endpoint_health_report.v1`) so control plane can ingest
|
||||
candidate feedback without parsing the transport diagnostics blob.
|
||||
VPN fabric-session transport telemetry is carrier-neutral
|
||||
(`fabric_session_binary_frames`) and reports QUIC/WebSocket as available
|
||||
carriers instead of describing the dataplane as WebSocket-only.
|
||||
(`fabric_session_binary_frames`) and reports QUIC selection plus non-QUIC
|
||||
candidate rejection instead of describing the dataplane as WebSocket-capable.
|
||||
Endpoint health observations are pruned in-memory by age and count before
|
||||
snapshot/report generation, preventing long-running nodes from accumulating
|
||||
unbounded candidate history.
|
||||
@@ -583,10 +716,10 @@ propagated by host-agent install profiles.
|
||||
|
||||
Deliverables:
|
||||
|
||||
- carry binary frames over one persistent WebSocket/TCP connection;
|
||||
- carry binary frames over one persistent QUIC fabric session;
|
||||
- replace high-frequency `/mesh/v1/forward` packet POST usage for VPN routes in
|
||||
a gated mode;
|
||||
- keep HTTP forwarding as fallback.
|
||||
- remove HTTP/WebSocket packet forwarding from the supported dataplane.
|
||||
|
||||
### Stage FNP-4: Android As Mobile Fabric Node
|
||||
|
||||
@@ -609,12 +742,12 @@ Deliverables:
|
||||
|
||||
### Stage FNP-6: QUIC/UDP Transport
|
||||
|
||||
Status: started with `QUICFabricTransport` in `internal/mesh`.
|
||||
Status: active runtime baseline in `internal/mesh`.
|
||||
|
||||
Deliverables:
|
||||
|
||||
- implement QUIC transport for Fabric Data Session V1;
|
||||
- preserve WebSocket/TCP as fallback;
|
||||
- keep QUIC/UDP as the only supported inter-node runtime transport;
|
||||
- test 4G/Wi-Fi transition and NAT behavior;
|
||||
- benchmark throughput, latency, and recovery against current HTTP forwarding.
|
||||
|
||||
|
||||
@@ -0,0 +1,183 @@
|
||||
# Fabric Area And Peer Stability Model
|
||||
|
||||
Status: active design correction.
|
||||
|
||||
This document replaces the oversimplified rule "every node must keep 3
|
||||
connections" with a stability model based on failure domains ("areas"),
|
||||
multi-path reachability, and live peer memory.
|
||||
|
||||
## 1. Why the old "3 connections" rule is not enough
|
||||
|
||||
A raw connection count is too weak as a resilience rule.
|
||||
|
||||
Three links are not equivalent when:
|
||||
|
||||
- all three peers are in the same private network;
|
||||
- all three depend on the same NAT or relay path;
|
||||
- all three depend on the same public ingress;
|
||||
- all three are relay-ready but not direct-ready;
|
||||
- all three are stale observations rather than recently verified paths.
|
||||
|
||||
Therefore the fabric must not use a single scalar count as the stability
|
||||
criterion.
|
||||
|
||||
## 2. Area
|
||||
|
||||
Introduce the concept of an `area`.
|
||||
|
||||
An area is a failure domain with high mutual reachability and shared external
|
||||
risk. Examples:
|
||||
|
||||
- `home` - nodes in the same home/private site
|
||||
- `test` - nodes in the same test Docker/LAN site
|
||||
- `usa` - a public node in a remote Internet site
|
||||
- `ifcm` - a separate NAT/domain behind another administrative boundary
|
||||
|
||||
An area can be derived from:
|
||||
|
||||
- operator-declared site/area label;
|
||||
- shared private address space or local interface group;
|
||||
- shared public egress/NAT identity;
|
||||
- shared administrative host or cluster.
|
||||
|
||||
The area label must be part of live node metadata and endpoint candidate
|
||||
metadata.
|
||||
|
||||
## 3. Stability objective
|
||||
|
||||
Each node should maintain a working peer set with diversity, not just count.
|
||||
|
||||
### 3.1 Minimum stable peer objective
|
||||
|
||||
For an ordinary production node:
|
||||
|
||||
- at least `2` recently verified direct-ready peers overall;
|
||||
- at least `2` distinct external areas represented in the ready set when more
|
||||
than one external area exists;
|
||||
- at least `1` persistent recovery-capable path outside the local area;
|
||||
- at least `1` additional relay-ready or rendezvous-capable path outside the
|
||||
primary recovery path.
|
||||
|
||||
For an area gateway or strategically important public node:
|
||||
|
||||
- at least `3` direct-ready peers overall;
|
||||
- at least `2` distinct external areas represented in the direct-ready set;
|
||||
- at least `1` extra recovery path that does not share the same public ingress
|
||||
or NAT dependency.
|
||||
|
||||
For a node in a tiny fleet where only one external area currently exists:
|
||||
|
||||
- the system must report `reduced-diversity mode`, not pretend the target is
|
||||
fully satisfied.
|
||||
|
||||
### 3.2 What counts as "ready"
|
||||
|
||||
`ready` means:
|
||||
|
||||
- recently verified;
|
||||
- usable for immediate QUIC route establishment;
|
||||
- not only a historical candidate;
|
||||
- not blocked on stale relay replacement;
|
||||
- not only a compatibility `Control API/downloads` overlap path.
|
||||
|
||||
`relay_ready` does not replace `direct_ready`.
|
||||
|
||||
## 4. What a node must remember
|
||||
|
||||
Every node must keep a live working set, not just a tiny current-peer list.
|
||||
|
||||
Minimum retained peer memory:
|
||||
|
||||
1. all currently healthy nodes in the fleet, when the fleet is small enough;
|
||||
2. for larger fleets, a bounded full directory plus prioritized recent working
|
||||
peers;
|
||||
3. for every known node:
|
||||
- node id
|
||||
- area
|
||||
- role summary
|
||||
- latest verified direct candidates
|
||||
- latest verified relay/rendezvous candidates
|
||||
- last success timestamp
|
||||
- last failure class
|
||||
- NAT / ingress dependency hints
|
||||
- cert pin / authority compatibility metadata
|
||||
|
||||
For the current fleet size, every node should indeed be capable of remembering
|
||||
the full directory of every other node. There is no scale excuse at 6-8 nodes.
|
||||
|
||||
## 5. Probe strategy
|
||||
|
||||
The node should not aggressively probe every possible path at full frequency.
|
||||
It should maintain a layered strategy.
|
||||
|
||||
### 5.1 Hot set
|
||||
|
||||
Always keep a hot set of:
|
||||
|
||||
- current direct-ready peers;
|
||||
- one recovery peer outside the local area;
|
||||
- one alternate peer per external area.
|
||||
|
||||
These should be revalidated frequently.
|
||||
|
||||
### 5.2 Warm set
|
||||
|
||||
Maintain a warm set of:
|
||||
|
||||
- previously successful peers;
|
||||
- peers from underrepresented areas;
|
||||
- peers that would restore diversity if a hot peer fails.
|
||||
|
||||
These should be revalidated on a slower cadence and promoted when diversity or
|
||||
direct-ready count drops.
|
||||
|
||||
### 5.3 Cold directory
|
||||
|
||||
Retain the full known directory and signed registry records, even if not
|
||||
actively probed at the same rate.
|
||||
|
||||
## 6. Failure handling
|
||||
|
||||
When a direct-ready peer is lost:
|
||||
|
||||
1. do not merely replace it with the numerically cheapest peer;
|
||||
2. prefer restoring:
|
||||
- area diversity
|
||||
- independent ingress diversity
|
||||
- direct-ready count
|
||||
3. only then fall back to relay-ready stabilization if direct replacement is
|
||||
not currently available.
|
||||
|
||||
## 7. Implications for the current fleet
|
||||
|
||||
Current area mapping should be treated approximately as:
|
||||
|
||||
- `home`: `home-1`, `home-2`, `home-3`
|
||||
- `test`: `test-1`, `test-2`, `test-3`
|
||||
- `usa`: `usa-los-1`
|
||||
- `ifcm`: `ifcm-rufms-s-mo1cr`
|
||||
|
||||
Under this model:
|
||||
|
||||
- a node in `home` should avoid satisfying its minimum peer objective using
|
||||
only `home` peers plus one relay;
|
||||
- `usa-los-1` and `ifcm-rufms-s-mo1cr` should both maintain direct-ready links
|
||||
that span at least two foreign areas when possible;
|
||||
- a fleet-wide alert should trigger when a node loses cross-area diversity even
|
||||
if its total peer count still looks healthy.
|
||||
|
||||
## 8. Required implementation changes
|
||||
|
||||
1. Add `area` to node metadata and endpoint candidate metadata.
|
||||
2. Track peer readiness by area, not only total count.
|
||||
3. Separate:
|
||||
- `direct_ready_count`
|
||||
- `relay_ready_count`
|
||||
- `external_area_ready_count`
|
||||
- `independent_ingress_ready_count`
|
||||
4. Alert on:
|
||||
- zero recovery path outside the local area
|
||||
- direct-ready deficit
|
||||
- area diversity deficit
|
||||
- registry resolution deficit
|
||||
5. Preserve a full node directory for the current small fleet.
|
||||
@@ -289,7 +289,10 @@ Production fabric-core migration boundary:
|
||||
LAN/interface QUIC, STUN reflexive `ice_quic`, reverse/outbound-only, and
|
||||
`relay_quic` fallback. Candidate metadata carries `local_segment_id`,
|
||||
`nat_group_id`, `stun_server`, `ice_foundation`, `relay_node_id`, and
|
||||
`relay_endpoint` when configured.
|
||||
`relay_endpoint` when configured. When a relay endpoint is the first physical
|
||||
QUIC hop, its advertised certificate fingerprint must survive route planning
|
||||
so public-IP relay paths can verify the relay node by pin instead of falling
|
||||
back to hostname/IP SAN matching.
|
||||
- Endpoint candidate scoring is QUIC-mode only. It ranks `direct_quic`,
|
||||
`lan_quic`, `ice_quic`, `reverse_quic`, and `relay_quic` using freshness,
|
||||
health observations, latency, reliability, region, policy tags, and live
|
||||
|
||||
@@ -0,0 +1,179 @@
|
||||
# Fabric Live Audit 2026-05-18
|
||||
|
||||
Status: live operational audit of the current fabric. This document records the
|
||||
real state observed on 2026-05-18 and explicitly calls out where runtime
|
||||
behavior still differs from the target architecture.
|
||||
|
||||
## Current confirmed state
|
||||
|
||||
- Inter-node transport for the live node-agent fleet is `QUIC over UDP`.
|
||||
- The active node set
|
||||
- `home-1`
|
||||
- `home-2`
|
||||
- `home-3`
|
||||
- `test-1`
|
||||
- `test-2`
|
||||
- `test-3`
|
||||
- `usa-los-1`
|
||||
- `ifcm-rufms-s-mo1cr`
|
||||
is converged on `0.2.321-directreadytarget`.
|
||||
- `ifcm-rufms-s-mo1cr` recovered through the compatibility recovery path and is
|
||||
no longer stale.
|
||||
|
||||
## Why TCP traffic is still visible
|
||||
|
||||
Visible TCP traffic is not coming from the inter-node fabric transport. It is
|
||||
coming from the temporary compatibility recovery overlap that is still active.
|
||||
|
||||
Observed live listeners:
|
||||
|
||||
- `docker-test`
|
||||
- `19191/tcp` - compatibility `Control API/downloads` bridge
|
||||
- `18080/tcp` - web-admin
|
||||
- `18090/tcp` - release files
|
||||
- `18121/tcp` - backend Control API
|
||||
- `19132/udp`, `19133/udp`, `19134/udp` - QUIC fabric listeners
|
||||
- `usa-los-1`
|
||||
- `19131/udp` - QUIC fabric listener
|
||||
- `19191/tcp` - external compatibility bridge currently held open so legacy
|
||||
recovery contracts can still reach `Control API/downloads`
|
||||
|
||||
Therefore:
|
||||
|
||||
- `TCP` is still present by design for recovery overlap.
|
||||
- `UDP/QUIC` is the current node-to-node transport.
|
||||
- The statement "the fabric is fully UDP-only" is not yet true at the full
|
||||
system level while `19191/tcp` compatibility recovery remains enabled.
|
||||
|
||||
## Why nodes were still falling away
|
||||
|
||||
### 1. Nodes do not yet operate from a fully active signed registry gossip plane
|
||||
|
||||
Observed on the live `ifcm-rufms-s-mo1cr` heartbeat:
|
||||
|
||||
- `fabric_registry_runtime_report.status = candidate_only`
|
||||
- `resolved_service_count = 0`
|
||||
- `resolved_services.control-api = no_active_record`
|
||||
- `resolved_services.update-store = no_active_record`
|
||||
- `resolved_services.update-cache = no_active_record`
|
||||
|
||||
This means the current runtime still depends on compatibility control URLs more
|
||||
than the target architecture allows. The node is alive in the fabric, but not
|
||||
yet operating from a fully resolved active registry view.
|
||||
|
||||
### 2. Legacy control/download contracts are still real dependencies
|
||||
|
||||
Observed on the live `ifcm-rufms-s-mo1cr` heartbeat after recovery:
|
||||
|
||||
- `mesh_outbound_session_report.control_plane_url = http://vpn.cin.su:19191/api/v1`
|
||||
|
||||
This confirms the root recovery lesson:
|
||||
|
||||
- a NAT node without manual host access was still anchored to the old recovery
|
||||
contract;
|
||||
- until that contract was temporarily restored, the node could not advance;
|
||||
- the node did not disappear because QUIC failed; it disappeared because the
|
||||
recovery/control overlap was removed before the node had converged.
|
||||
|
||||
### 3. Direct peer resilience is still below the intended threshold
|
||||
|
||||
Observed from live heartbeat metadata:
|
||||
|
||||
- `ifcm-rufms-s-mo1cr`
|
||||
- `peer_connection_ready = 2`
|
||||
- `peer_connection_relay_ready = 3`
|
||||
- `target_ready_peers = 3`
|
||||
- `usa-los-1`
|
||||
- `peer_connection_ready = 1`
|
||||
- `peer_connection_relay_ready = 5`
|
||||
- `target_ready_peers = 3`
|
||||
|
||||
This means the direct-path resilience target is not satisfied yet, even though
|
||||
the nodes are healthy.
|
||||
|
||||
The practical reason is simple:
|
||||
|
||||
- the cluster has only a small number of externally reachable direct QUIC
|
||||
endpoints;
|
||||
- some nodes still advertise only private/LAN-reachable direct candidates;
|
||||
- relay-ready adjacency is masking direct peer deficit, but it does not replace
|
||||
the requirement for at least three direct-ready peers.
|
||||
|
||||
### 4. Observability is still heterogeneous
|
||||
|
||||
Live heartbeat coverage is inconsistent:
|
||||
|
||||
- `test-*`, `ifcm`, `usa-los-1` emit rich `c17z20` heartbeat metadata with
|
||||
endpoint, peer recovery, and registry sections.
|
||||
- `home-*` currently do not expose the same full sections in their latest
|
||||
heartbeat rows.
|
||||
|
||||
This means operator visibility is uneven and the documentation must not imply
|
||||
uniform live introspection across every node today.
|
||||
|
||||
## What is true right now
|
||||
|
||||
1. The fleet is converged on one live node-agent version.
|
||||
2. QUIC/UDP is the actual node-to-node transport.
|
||||
3. Compatibility `19191/tcp` is still required for recovery overlap.
|
||||
4. Signed registry gossip is not yet the sole active discovery/control source.
|
||||
5. The "at least 3 direct-ready peers per node" resilience target is not yet
|
||||
met for all externally significant nodes.
|
||||
|
||||
## Operational rule until the next audit
|
||||
|
||||
Do not remove the compatibility `19191/tcp` recovery overlap while any of the
|
||||
following remain true:
|
||||
|
||||
- any live node still reports a `control_plane_url` on the `19191` contract;
|
||||
- any live node has `fabric_registry_runtime_report.status != active`;
|
||||
- any externally significant node has fewer than 3 direct-ready peers;
|
||||
- any node can only recover through legacy `Control API/downloads` overlap.
|
||||
|
||||
## Required next work
|
||||
|
||||
### A. Finish signed registry activation
|
||||
|
||||
Each node must be able to resolve active records for at least:
|
||||
|
||||
- `control-api`
|
||||
- `update-store`
|
||||
- `update-cache`
|
||||
|
||||
without falling back to the `19191` compatibility contract.
|
||||
|
||||
### B. Promote full direct endpoint dissemination
|
||||
|
||||
All nodes with public reachability must advertise every valid public direct QUIC
|
||||
endpoint, and nodes must retain enough live peer memory to reconnect without
|
||||
operator intervention.
|
||||
|
||||
### C. Enforce the direct-ready floor as a live alert
|
||||
|
||||
If a node has fewer than 3 direct-ready peers, this must remain a real
|
||||
operational alert even when relay-ready peers exist.
|
||||
|
||||
### D. Normalize heartbeat observability
|
||||
|
||||
Every production node must emit the same minimum audit surface:
|
||||
|
||||
- endpoint candidates
|
||||
- peer recovery counts
|
||||
- registry runtime state
|
||||
- update runtime state
|
||||
|
||||
without mixing rich and reduced heartbeat schemas across the fleet.
|
||||
|
||||
### E. Replace the naive peer-count rule
|
||||
|
||||
The live fleet shows that a plain "3 links per node" rule is not a sufficient
|
||||
resilience model.
|
||||
|
||||
The current corrective design is documented in
|
||||
[FABRIC_AREA_AND_PEER_STABILITY_MODEL.md](\\nas\\MST\\codex\\rdp-proxy\\docs\\architecture\\FABRIC_AREA_AND_PEER_STABILITY_MODEL.md)
|
||||
and introduces:
|
||||
|
||||
- `area` as a failure-domain label;
|
||||
- direct-ready vs relay-ready separation;
|
||||
- cross-area diversity requirements;
|
||||
- full-directory retention for small fleets.
|
||||
@@ -0,0 +1,427 @@
|
||||
# Fabric Node Survival And Recovery Policy
|
||||
|
||||
Status: active architecture policy.
|
||||
|
||||
This document defines the non-negotiable survival, compatibility, and recovery
|
||||
rules for Secure Access Fabric nodes. It exists because losing a node is not an
|
||||
acceptable operating model once the fabric grows beyond a small manually
|
||||
maintained fleet.
|
||||
|
||||
Reference incident:
|
||||
|
||||
- `ifcm-rufms-s-mo1cr` is the canonical recovery case.
|
||||
- The node is behind NAT.
|
||||
- There is no direct administrative access to the Windows host.
|
||||
- The node must remain recoverable through the fabric/update/recovery plane
|
||||
without relying on manual host login.
|
||||
|
||||
The latest live recovery evidence for this case is documented in
|
||||
[FABRIC_LIVE_AUDIT_2026-05-18.md](\\nas\\MST\\codex\\rdp-proxy\\docs\\architecture\\FABRIC_LIVE_AUDIT_2026-05-18.md).
|
||||
|
||||
This policy applies to Linux, Windows, Android, containerized nodes, and future
|
||||
node types.
|
||||
|
||||
## 1. Core Decision
|
||||
|
||||
The fabric must be able to lose:
|
||||
|
||||
- old API endpoints;
|
||||
- old artifact URLs;
|
||||
- previous public IP addresses;
|
||||
- previous NAT mappings;
|
||||
- previous relay nodes;
|
||||
- previous route-authority replicas;
|
||||
- previous update-cache replicas;
|
||||
- old service locations;
|
||||
- operator access to the host OS;
|
||||
- the current physical location of a workload;
|
||||
- part of the cluster.
|
||||
|
||||
And still keep the node recoverable.
|
||||
|
||||
Manual repair is allowed as an emergency tool. It must not be the default
|
||||
survival strategy.
|
||||
|
||||
## 2. Non-Negotiable Invariants
|
||||
|
||||
### 2.1 Node Identity Must Survive
|
||||
|
||||
A recoverable node must preserve:
|
||||
|
||||
- `node_id`;
|
||||
- node keypair or key reference;
|
||||
- pinned cluster authority / quorum descriptor;
|
||||
- last accepted signed registry records;
|
||||
- last accepted bootstrap seed set;
|
||||
- last known good update policy;
|
||||
- last known good workload desired state;
|
||||
- rollback metadata;
|
||||
- recovery audit trail.
|
||||
|
||||
Reinstall or repair must prefer preserving local state. Identity reset is a
|
||||
high-risk operator action, not the default repair path.
|
||||
|
||||
### 2.2 Compatibility Must Stay Until Recovery Is Complete
|
||||
|
||||
Any change to the fabric must keep older nodes recoverable until one of these
|
||||
is true:
|
||||
|
||||
1. every node has confirmed the new contract; or
|
||||
2. the missing nodes were manually retired, revoked, or explicitly accepted as
|
||||
lost.
|
||||
|
||||
This applies to:
|
||||
|
||||
- update plan formats;
|
||||
- signed registry schemas;
|
||||
- artifact install types;
|
||||
- authority signature envelopes;
|
||||
- bootstrap config formats;
|
||||
- recovery seed formats;
|
||||
- host-agent / updater runtime contracts;
|
||||
- control endpoints needed only for migration.
|
||||
|
||||
The rule is strict: do not delete the old recovery format while nodes that may
|
||||
still need it remain unrecovered.
|
||||
|
||||
### 2.3 QUIC-Only Transport Does Not Mean Single Bootstrap Location
|
||||
|
||||
Node-to-node runtime transport remains QUIC over UDP only.
|
||||
|
||||
That does not permit:
|
||||
|
||||
- one bootstrap address;
|
||||
- one update mirror;
|
||||
- one registry carrier;
|
||||
- one ingress node;
|
||||
- one relay;
|
||||
- one control replica.
|
||||
|
||||
QUIC is the transport. Survivability requires many signed ways to discover the
|
||||
current valid QUIC endpoints.
|
||||
|
||||
### 2.4 No Single Service May Own Recovery
|
||||
|
||||
Recovery must not depend on one:
|
||||
|
||||
- backend URL;
|
||||
- DNS name;
|
||||
- HTTP ingress;
|
||||
- update repository host;
|
||||
- relay node;
|
||||
- cluster admin node.
|
||||
|
||||
Any of those may disappear while the node is still healthy enough to recover.
|
||||
|
||||
## 3. Required Recovery Layers
|
||||
|
||||
### 3.1 Embedded Bootstrap Seed Set
|
||||
|
||||
Each installable node package must contain a bounded bootstrap seed set:
|
||||
|
||||
- multiple seed nodes;
|
||||
- public and private candidates where appropriate;
|
||||
- QUIC endpoint candidates only;
|
||||
- signed bootstrap metadata;
|
||||
- expiry / epoch rules;
|
||||
- optional organization / cluster scope constraints.
|
||||
|
||||
The bootstrap seed set is only the first door, not cluster truth.
|
||||
|
||||
### 3.2 Signed Registry Gossip
|
||||
|
||||
After bootstrap, a node must learn current service locations through signed
|
||||
fabric registry records that can be carried by any reachable peer.
|
||||
|
||||
Required properties:
|
||||
|
||||
- multiple records per service;
|
||||
- quorum or otherwise policy-approved signatures;
|
||||
- monotonic epoch/generation;
|
||||
- expiry and freshness checks;
|
||||
- live probe before promotion;
|
||||
- ability to accept newer records from a reachable neighbor even when old
|
||||
origins are gone.
|
||||
|
||||
### 3.3 Outbound-Only Recovery Attachment
|
||||
|
||||
A node behind NAT or in passive mode must be recoverable through an outbound
|
||||
attachment.
|
||||
|
||||
Required behaviors:
|
||||
|
||||
- the node can maintain at least one long-lived outbound QUIC control channel;
|
||||
- that channel survives IP changes by reconnecting through any remaining seed or
|
||||
signed registry endpoint;
|
||||
- the node may receive updated registry truth, update triggers, workload
|
||||
changes, and recovery instructions over that channel;
|
||||
- the fabric must not require inbound TCP/UDP reachability to repair the node.
|
||||
|
||||
### 3.4 Local Recovery Agent Boundary
|
||||
|
||||
The node must have a minimal recovery-capable local agent boundary that is
|
||||
separate from ordinary service workloads.
|
||||
|
||||
It must be able to:
|
||||
|
||||
- validate signed update plans;
|
||||
- download artifacts from multiple mirrors;
|
||||
- stage replacement binaries;
|
||||
- restart node-agent or host-agent tasks;
|
||||
- rollback to previous binaries;
|
||||
- swap to new signed registry/bootstrap records;
|
||||
- emit recovery status when transport returns.
|
||||
|
||||
If node workloads fail, this local recovery boundary must still exist.
|
||||
|
||||
### 3.5 Multi-Source Artifact Delivery
|
||||
|
||||
Artifacts must be retrievable from more than one source:
|
||||
|
||||
- local cached file;
|
||||
- cluster update-cache;
|
||||
- organization-local cache if policy allows;
|
||||
- public or internet-reachable mirror;
|
||||
- neighbor-assisted relay transfer over the fabric.
|
||||
|
||||
A node must not become unrecoverable because one artifact hostname or one
|
||||
download service disappeared.
|
||||
|
||||
### 3.6 Trigger And Subscription Plane
|
||||
|
||||
Polling alone is not enough for very large fleets.
|
||||
|
||||
Required model:
|
||||
|
||||
- nodes may still perform slow fallback polling;
|
||||
- primary update notification uses subscription/signal delivery;
|
||||
- update-cache or registry service can repeatedly signal pending updates until
|
||||
acknowledged;
|
||||
- signals are idempotent;
|
||||
- signals do not require the old control endpoint to remain alive.
|
||||
|
||||
## 4. Update Safety Rules
|
||||
|
||||
### 4.1 Upgrade Contracts
|
||||
|
||||
Every release that changes recovery-critical contracts must explicitly declare:
|
||||
|
||||
- minimum supported old version;
|
||||
- maximum tolerated skew;
|
||||
- whether migration is rolling-safe;
|
||||
- whether the node must first update host-agent or node-agent;
|
||||
- rollback compatibility;
|
||||
- whether old bootstrap/registry envelopes remain accepted.
|
||||
|
||||
### 4.2 Two-Key Rule For Breaking Changes
|
||||
|
||||
Do not simultaneously break:
|
||||
|
||||
- discovery of where to get the update; and
|
||||
- ability to understand the update once found.
|
||||
|
||||
At least one of those must remain compatible until fleet convergence or
|
||||
explicit retirement.
|
||||
|
||||
### 4.3 Old Artifact Retention
|
||||
|
||||
Recovery-critical artifact versions must remain available until:
|
||||
|
||||
- all nodes have moved past them; or
|
||||
- the remaining nodes are revoked/retired and recorded as intentionally lost.
|
||||
|
||||
Do not garbage-collect the last working host-agent or node-agent build for an
|
||||
unrecovered population.
|
||||
|
||||
### 4.4 Install Type Continuity
|
||||
|
||||
If historical nodes request different install types for the same product
|
||||
(`windows_binary`, `windows_service`, `native`, `linux_binary`, etc.), recovery
|
||||
planning must keep compatibility aliases until the fleet converges.
|
||||
|
||||
The fabric must not strand nodes on an install-type naming mismatch.
|
||||
|
||||
### 4.5 Legacy Recovery Contract Drift Must Be Treated As A Blocking Risk
|
||||
|
||||
A stale node may report:
|
||||
|
||||
- a compatible recovery artifact exists under the current registry; but
|
||||
- the last local updater/host-agent status still says `no_matching_artifact` or
|
||||
an equivalent legacy contract failure.
|
||||
|
||||
This means the node is not only waiting for a heartbeat. It is running an older
|
||||
recovery planner contract and may still depend on:
|
||||
|
||||
- historical install-type aliases;
|
||||
- older artifact matching semantics;
|
||||
- older update-plan interpretation rules;
|
||||
- overlap in signed registry / bootstrap envelopes.
|
||||
|
||||
This condition must be classified as `legacy recovery contract drift` and must
|
||||
block compatibility removal the same way an artifact gap does.
|
||||
|
||||
Operationally this also means:
|
||||
|
||||
- the node requires a `recovery bridge`;
|
||||
- the cluster enters `bridge hold active` for compatibility-removal decisions;
|
||||
- `bridge hold` remains active until the node reports a recovery-compatible
|
||||
status on the current contract or the operator explicitly retires the node;
|
||||
- when a compatible artifact and target mapping already exist, the node should
|
||||
be classified as `bridge replay ready`, meaning the system can replay the
|
||||
legacy-compatible update plan as soon as the node regains an outbound control
|
||||
cycle;
|
||||
- operator tooling should expose a canonical `bridge replay plan` per node so
|
||||
recovery replay uses the same signed update-plan logic as normal updates;
|
||||
- compatibility aliases / overlap must remain enabled for that node population;
|
||||
- dashboards and rollout guards must show this separately from ordinary
|
||||
`waiting recovery heartbeat`.
|
||||
|
||||
Canonical example:
|
||||
|
||||
- `ifcm-rufms-s-mo1cr` is stale;
|
||||
- the current backend can match a Windows-compatible host-agent artifact;
|
||||
- the last host-agent report still says `no_matching_artifact`;
|
||||
- therefore the node must be treated as a legacy recovery-contract blocker, not
|
||||
merely as a delayed heartbeat.
|
||||
|
||||
## 5. Service And Location Mobility Rules
|
||||
|
||||
Moving a service must not strand nodes that only know the old location.
|
||||
|
||||
Required pattern:
|
||||
|
||||
1. publish new signed registry records;
|
||||
2. keep old records valid during overlap;
|
||||
3. allow any reachable peer to relay the new records;
|
||||
4. live-probe and promote the new endpoints;
|
||||
5. only then retire the old location;
|
||||
6. keep enough overlap for slow or partitioned nodes to catch up.
|
||||
|
||||
This applies to:
|
||||
|
||||
- control-api replicas;
|
||||
- update-cache/update-store replicas;
|
||||
- web/admin ingress replicas;
|
||||
- relay/rendezvous nodes;
|
||||
- service-channel endpoints.
|
||||
|
||||
## 6. Failure Classes The Fabric Must Tolerate
|
||||
|
||||
The design must explicitly handle all of these:
|
||||
|
||||
- node behind NAT with only outbound connectivity;
|
||||
- several nodes behind one NAT/local segment;
|
||||
- node changes public IP;
|
||||
- node changes private IP;
|
||||
- old DNS/URL becomes dead;
|
||||
- artifact mirror disappears;
|
||||
- control ingress disappears;
|
||||
- relay disappears;
|
||||
- update install fails halfway;
|
||||
- binary staged but restart fails;
|
||||
- old task/service name changes;
|
||||
- local disk is nearly full;
|
||||
- time skew causes signature freshness risk;
|
||||
- authority rotates;
|
||||
- route authority replica disappears;
|
||||
- state directory survives but binary is broken;
|
||||
- binary survives but state directory is partly stale;
|
||||
- node reboots during update;
|
||||
- only one peer still knows the new registry truth;
|
||||
- node is partitioned for a long time and rejoins later;
|
||||
- platform removes legacy support too early;
|
||||
- operator has no shell/RDP/WinRM/SSH access to the host.
|
||||
|
||||
## 7. Required Local State And Journaling
|
||||
|
||||
The node local state store must retain at least:
|
||||
|
||||
- active and previous signed registry records;
|
||||
- active and previous bootstrap seeds;
|
||||
- last successful update plan per product;
|
||||
- last applied artifact hash/version;
|
||||
- last rollback candidate;
|
||||
- last successful service endpoints used for update/control;
|
||||
- pending trigger generation;
|
||||
- recovery attempts with timestamps and reasons;
|
||||
- last known good runtime command line / task/unit identity;
|
||||
- last known workload desired states.
|
||||
|
||||
Writes must be atomic. A power loss must not leave the node with zero valid
|
||||
state.
|
||||
|
||||
## 8. Observability And Fleet Safety Rules
|
||||
|
||||
The control plane must make invisible-recovery risk explicit.
|
||||
|
||||
It must surface:
|
||||
|
||||
- nodes with stale heartbeat but recent updater activity;
|
||||
- nodes with no working compatible recovery artifact;
|
||||
- nodes whose pinned registry/bootstrap epoch is too old;
|
||||
- nodes whose only known artifact URL is dead;
|
||||
- nodes whose desired state requires a contract they cannot parse;
|
||||
- nodes whose local agent version is below the minimum recovery floor;
|
||||
- nodes whose last successful contact depended on a single service replica.
|
||||
|
||||
Cluster-wide changes that would strand such nodes must be blocked or require an
|
||||
explicit recovery-admin override.
|
||||
|
||||
## 9. Release And Migration Checklist
|
||||
|
||||
Before deleting old code, old formats, or old endpoints, verify all of these:
|
||||
|
||||
1. every active node has confirmed a compatible version; or the remaining nodes
|
||||
are explicitly marked for manual retirement/recovery;
|
||||
2. host-agent and node-agent recovery paths both have matching artifacts;
|
||||
3. bootstrap/registry overlap exists for the migration window;
|
||||
4. at least two independent artifact sources remain reachable;
|
||||
5. signed registry gossip can carry the new locations without the old API
|
||||
hostname;
|
||||
6. rollback artifacts are still available;
|
||||
7. install type aliases remain for historical agents where needed;
|
||||
8. NAT/passive/outbound-only nodes were explicitly tested;
|
||||
9. stale-node risk report is empty or consciously accepted by recovery-admin;
|
||||
10. removal of legacy support is documented with the exact cutoff conditions.
|
||||
|
||||
## 10. `ifcm-rufms-s-mo1cr` Rule
|
||||
|
||||
`ifcm-rufms-s-mo1cr` is the standing reference case for future work.
|
||||
|
||||
For this node class, the platform must assume:
|
||||
|
||||
- the host is behind NAT;
|
||||
- the node may only keep outbound channels;
|
||||
- no direct Windows administrative access exists;
|
||||
- old discovery endpoints may disappear;
|
||||
- only the fabric/update/recovery plane can save the node.
|
||||
|
||||
Any future transport, update, authority, bootstrap, registry, or workload
|
||||
change must be reviewed against this question:
|
||||
|
||||
> If `ifcm-rufms-s-mo1cr` is still on the older contract and we cannot log in to
|
||||
> the host, can the fabric still recover it?
|
||||
|
||||
If the answer is no, the change is incomplete.
|
||||
|
||||
## 11. Immediate Follow-Through
|
||||
|
||||
The system should keep implementing these concrete items:
|
||||
|
||||
- separate documented recovery-plane tests for Windows NAT nodes;
|
||||
- signed registry retention and overlap checks before endpoint migration;
|
||||
- compatibility alias coverage for historical install types;
|
||||
- artifact availability health over all mirrors;
|
||||
- stale-node risk dashboard/report before legacy removal;
|
||||
- node-local journaling for last good registry/update state;
|
||||
- neighbor-assisted artifact relay path;
|
||||
- explicit recovery simulation for outbound-only nodes with dead old endpoints.
|
||||
|
||||
## 12. Decision
|
||||
|
||||
The fabric must treat node survival as a first-class architecture contract.
|
||||
|
||||
A node is not considered safe merely because the happy path works. It is safe
|
||||
only when it can survive protocol migration, endpoint relocation, partial
|
||||
cluster loss, artifact source loss, and lack of manual host access without
|
||||
being abandoned.
|
||||
@@ -256,9 +256,11 @@ The first backend contract slice is implemented:
|
||||
observations, and degraded backend relay usage. These incidents keep backend
|
||||
relay visible as degraded compatibility behavior rather than hidden steady
|
||||
state.
|
||||
- Node-agent access telemetry distinguishes backend relay actually used from
|
||||
backend relay blocked by signed data-plane policy. Blocked fallback reports
|
||||
include `backend_fallback_blocked` and the last violation status/reason, and
|
||||
- Node-agent access telemetry distinguishes degraded compatibility requested
|
||||
from degraded compatibility blocked by signed data-plane policy. Blocked
|
||||
compatibility reports include `degraded_compatibility_blocked` and the last
|
||||
violation status/reason, while preserving the original raw violation code in
|
||||
a separate field for historical correlation, and
|
||||
backend projects them to access telemetry plus `data_plane_contract`
|
||||
incidents.
|
||||
- Backend correlates access-report send failures with active service-channel
|
||||
@@ -421,8 +423,8 @@ The first backend contract slice is implemented:
|
||||
keeps failing outside manual retry cooldown creates a bounded rebuild
|
||||
request. If an unfenced alternate is available, Control Plane marks the
|
||||
rebuild `applied` and selects that route generation; if no alternate exists,
|
||||
it records `pending_degraded_fallback` and keeps backend relay as the
|
||||
explicit degraded path until a new route appears. The compatibility release
|
||||
it records `pending_degraded_route_state` and keeps the channel in explicit
|
||||
degraded route state until a new route appears. The compatibility release
|
||||
`0.2.175` keeps node/host-agent signed-config models aligned with these new
|
||||
fields.
|
||||
- C18U moves rebuild metadata into node-agent runtime behavior. Node-agent
|
||||
@@ -437,10 +439,10 @@ The first backend contract slice is implemented:
|
||||
- C18V adds route-manager transition telemetry and churn coverage. Node-agent
|
||||
`0.2.177` reports `route_manager_transition` alongside the current manager
|
||||
snapshot, including previous/current generation, status, decision count,
|
||||
withdrawn route count, restored route count, pending-degraded fallback count,
|
||||
withdrawn route count, restored route count, pending degraded route-state count,
|
||||
rebuild applied count, and any cached selected route cleared because Control
|
||||
Plane withdrew it. Coverage verifies three service-neutral lifecycle cases:
|
||||
applied rebuild replacement, pending degraded fallback when no alternate is
|
||||
applied rebuild replacement, pending degraded route state when no alternate is
|
||||
available, and rollback/restoration when a fresh config removes the rebuild
|
||||
decision.
|
||||
- C18W adds a live docker-test verification loop for that telemetry. The smoke
|
||||
@@ -973,8 +975,8 @@ The first backend contract slice is implemented:
|
||||
in C18Z45; rebuild snapshot maintenance health with overdue/runtime-evidence
|
||||
visibility landed in C18Z46; node-agent signed service-channel lease
|
||||
enforcement when cluster authority is pinned landed in C18Z47; backend
|
||||
introspection fallback for unsigned compatibility clients landed in C18Z48;
|
||||
accepted-by telemetry for signed/introspection/legacy ingress landed in
|
||||
introspection fallback for token-authorized compatibility clients landed in C18Z48;
|
||||
accepted-by telemetry for signed/introspection/token-authorized ingress landed in
|
||||
C18Z49; durable lease introspection across backend restarts landed in C18Z50;
|
||||
bounded durable lease cleanup and admin visibility landed in C18Z51; durable
|
||||
accepted-by access telemetry aggregation with heartbeat fallback and admin
|
||||
@@ -983,9 +985,9 @@ The first backend contract slice is implemented:
|
||||
visibility landed in C18Z53; C18Z54 smoke proves the same diagnostics on a
|
||||
normal non-fallback primary route with healthy rolling route-quality feedback;
|
||||
C18Z55 smoke proves degraded/fenced normal-route feedback is shown separately
|
||||
from explicit backend fallback; C18Z56 adds active-channel remediation
|
||||
from explicit degraded compatibility requests; C18Z56 adds active-channel remediation
|
||||
diagnostics (`none`, `rebuild_route`, `prefer_alternate_route`,
|
||||
`use_backend_fallback`) to make the next runtime action explicit, and its
|
||||
`hold_degraded_route_state`) to make the next runtime action explicit, and its
|
||||
alternate-route branch is live-smoke-proven with backend fallback kept off.
|
||||
C18Z57 adds the bounded machine-readable `remediation_command` contract to
|
||||
active access telemetry rows so route-manager can consume a short-lived
|
||||
@@ -1058,7 +1060,7 @@ The first backend contract slice is implemented:
|
||||
`rebuild_request_recorded` or `rebuild_request_rejected` for the active
|
||||
channel. C18Z76 adds node-side acknowledgement for the allowed
|
||||
`rebuild_route` branch: node-agent consumes the command as a route-manager
|
||||
`pending_degraded_fallback` decision with source
|
||||
`pending_degraded_route_state` decision with source
|
||||
`service_channel_remediation_command`, while guarded commands remain ignored.
|
||||
Backend access telemetry correlates that heartbeat evidence with the durable
|
||||
ledger and reports `rebuild_request_recorded_node_pending`. C18Z77 resolves
|
||||
@@ -1089,7 +1091,7 @@ The first backend contract slice is implemented:
|
||||
reselecting the degraded replacement or adding fallback/failure/drop deltas.
|
||||
C18Z82 proves the no-safe-recovery branch: if that replacement is also fenced
|
||||
and no safe recovery route exists, synthetic config reports
|
||||
`service_channel_feedback_no_alternate` / `pending_degraded_fallback` with
|
||||
`service_channel_feedback_no_alternate` / `pending_degraded_route_state` with
|
||||
`no_unfenced_alternate_route` instead of silently keeping a bad route.
|
||||
C18Z83 projects that route-manager decision into active access telemetry and
|
||||
web-admin active-channel diagnostics, including decision source, route id,
|
||||
@@ -1124,7 +1126,8 @@ The first backend contract slice is implemented:
|
||||
`data_plane` is present in the lease, authority payload, introspection
|
||||
response, and lease-maintenance/admin list. It declares backend API as
|
||||
control-plane transport, fabric service channel/fabric route as working
|
||||
data/steady-state transport, backend relay as degraded fallback only, and
|
||||
data/steady-state transport, degraded compatibility relay as an explicit
|
||||
compatibility state only, and
|
||||
service-neutral protocol-agnostic isolated logical flows as the runtime
|
||||
contract for VPN, Remote Workspace, files, video, and future services. C18Z91
|
||||
makes node-agent consume the signed/introspected data-plane contract, apply
|
||||
@@ -1187,12 +1190,13 @@ channel class, selected entry node, allowed flow isolation, and data-plane
|
||||
contract on `remote-workspaces/{resource_id}/streams/{channel_class}`. Empty
|
||||
probe requests return `202` with a remote-workspace ingress probe contract and
|
||||
access telemetry; real RDP frame forwarding remains deliberately
|
||||
`not_implemented` until the service adapter work begins.
|
||||
`validated_only` for empty probes until the service adapter work begins.
|
||||
C19E adds a narrow frame-batch probe on that boundary. The adapter contract
|
||||
advertises `rap.remote_workspace_frame_batch.v1`, and entry-node accepts
|
||||
non-empty payloads only when they are JSON probe batches with `probe_only=true`,
|
||||
valid remote-workspace logical channels, valid directions, and bounded payload
|
||||
metadata. Accepted probes return `payload_flow=validated_probe_only`; production
|
||||
metadata. Accepted frame probes return `payload_flow=validated_probe_only`, while
|
||||
empty/control probes return `payload_flow=validated_only`; production
|
||||
frame forwarding is still not enabled.
|
||||
C19F connects that validated probe to a node-agent local adapter sink. The
|
||||
in-memory `node_agent_rdp_worker_contract_probe` sink accepts only validated
|
||||
|
||||
@@ -3,7 +3,7 @@
|
||||
Status: Stage C17 planning completed. Stage C17A synthetic mesh runtime
|
||||
skeleton, Stage C17B route health/failover probes, Stage C17C relay semantic
|
||||
hardening, Stage C17D non-production test-service path experiment, Stage C17E
|
||||
live node-to-node synthetic HTTP transport skeleton, Stage C17F scoped
|
||||
historical live node-to-node synthetic HTTP transport skeleton, Stage C17F scoped
|
||||
synthetic route config boundary, Stage C17G Control Plane scoped synthetic
|
||||
config read boundary, Stage C17H deployed multi-agent synthetic config smoke,
|
||||
Stage C17I production forwarding gate, Stage C17J production envelope
|
||||
@@ -44,8 +44,9 @@ invalidation. C17C added synthetic relay validation, per-channel bounded
|
||||
queues, QoS dequeue order, telemetry-only drop/backpressure, and reliable
|
||||
fabric/control rejection behavior. C17D added one bounded `synthetic.echo`
|
||||
test-service path over direct, single-relay, and forced fallback routes. C17E
|
||||
added real HTTP peer transport and a disabled-by-default node-agent synthetic
|
||||
endpoint/smoke harness for direct and single-relay synthetic traffic. C17F
|
||||
added one historical real-HTTP peer transport experiment and a
|
||||
disabled-by-default node-agent synthetic endpoint/smoke harness for direct and
|
||||
single-relay synthetic traffic only. C17F
|
||||
added scoped synthetic peer/route config loading and synthetic route-health
|
||||
link observation reporting. C17G added the Control Plane read boundary for
|
||||
node-scoped synthetic mesh config. C17H proved that boundary in a deployed
|
||||
@@ -596,10 +597,12 @@ C17H implemented a deployed multi-agent synthetic config smoke on
|
||||
VPN/IP tunnel work remains a separate C18 track and must not be mixed into
|
||||
C17 mesh runtime work.
|
||||
|
||||
## 15.4 C17E Result
|
||||
## 15.4 C17E Historical Result
|
||||
|
||||
C17E implemented live node-to-node synthetic HTTP transport while preserving
|
||||
the production forwarding kill-switch:
|
||||
C17E implemented a historical live node-to-node synthetic HTTP transport
|
||||
experiment while preserving the production forwarding kill-switch. This result
|
||||
is retained only as test-history context; it is not the active transport
|
||||
direction for the fabric runtime:
|
||||
|
||||
- `HTTPPeerTransport` maps explicit peer node IDs to synthetic HTTP endpoint
|
||||
URLs.
|
||||
@@ -613,6 +616,13 @@ the production forwarding kill-switch:
|
||||
- `/mesh/v1/forward` remains disabled.
|
||||
- no production service traffic is authorized.
|
||||
|
||||
Current direction:
|
||||
|
||||
- active fabric runtime transport is QUIC-only
|
||||
- synthetic HTTP motion is historical test-only context
|
||||
- production forwarding/runtime acceptance must use QUIC route execution rather
|
||||
than HTTP peer transport
|
||||
|
||||
Verification:
|
||||
|
||||
```powershell
|
||||
@@ -888,9 +898,11 @@ runtime. Stage C17A implements the first narrow runtime skeleton for synthetic
|
||||
Fabric messages only. Stage C17B adds route health/failover observations using
|
||||
synthetic Fabric messages only. Stage C17C adds relay semantic hardening for
|
||||
synthetic channel classes only. Stage C17D adds one bounded non-production
|
||||
`synthetic.echo` service-path experiment only. Stage C17E proves live
|
||||
node-to-node synthetic HTTP transport using real local endpoints only. Stage
|
||||
C17F proves scoped synthetic config loading and route-health reporting only.
|
||||
`synthetic.echo` service-path experiment only. Stage C17E proves one
|
||||
historical synthetic HTTP carrier experiment using real local endpoints only;
|
||||
it is test-only and not representative of the active QUIC fabric runtime.
|
||||
Stage C17F proves scoped synthetic config loading and route-health reporting
|
||||
only.
|
||||
Stage C17G proves Control Plane scoped synthetic config read/consume only.
|
||||
Stage C17H proves deployed multi-agent Control Plane synthetic config
|
||||
consumption and synthetic route-health reporting on `docker-test` only.
|
||||
|
||||
@@ -1,5 +1,12 @@
|
||||
# Production Direct Worker WSS Trust
|
||||
|
||||
Archived status: this document describes an older direct-worker WSS trust
|
||||
track. It is not the current runtime transport source of truth. For the active
|
||||
fabric transport model, use
|
||||
`docs/architecture/DISTRIBUTED_FABRIC_NODE_PROTOCOL_PLAN.md`,
|
||||
`docs/architecture/FABRIC_FIRST_TRANSPORT_AND_STRESS_PLAN.md`, and
|
||||
`docs/architecture/SECURE_ACCESS_FABRIC_TARGET.md`.
|
||||
|
||||
Status: P3.4 design/prep complete.
|
||||
|
||||
This document defines the production trust model for direct worker WSS. It is a
|
||||
|
||||
@@ -1,5 +1,13 @@
|
||||
# RDP Adapter Runtime
|
||||
|
||||
Paused/archival note: this document remains useful for RDP adapter internals,
|
||||
but it is not the current source of truth for transport/runtime architecture.
|
||||
Fabric transport is now QUIC-only between nodes. For active transport,
|
||||
recovery, and routing behavior, see
|
||||
`docs/architecture/DISTRIBUTED_FABRIC_NODE_PROTOCOL_PLAN.md`,
|
||||
`docs/architecture/FABRIC_FIRST_TRANSPORT_AND_STRESS_PLAN.md`, and
|
||||
`docs/architecture/SECURE_ACCESS_FABRIC_TARGET.md`.
|
||||
|
||||
Status: active implementation plan for the new C++ RDP Adapter internals.
|
||||
|
||||
Current implementation status:
|
||||
|
||||
@@ -1,5 +1,12 @@
|
||||
# RDP Stage 5.2 Design Pass - Server-To-Client File Download
|
||||
|
||||
Archived status: this document belongs to the earlier direct-worker/back-gateway
|
||||
RDP track and is not the current source of truth for fabric transport
|
||||
architecture. The active inter-node transport model is QUIC-only; see
|
||||
`docs/architecture/DISTRIBUTED_FABRIC_NODE_PROTOCOL_PLAN.md`,
|
||||
`docs/architecture/FABRIC_FIRST_TRANSPORT_AND_STRESS_PLAN.md`, and
|
||||
`docs/architecture/SECURE_ACCESS_FABRIC_TARGET.md`.
|
||||
|
||||
Status: design-complete proposal, no runtime implementation in this step.
|
||||
|
||||
Date: 2026-04-26
|
||||
|
||||
@@ -1,5 +1,13 @@
|
||||
# RDP Service C++ Performance Target
|
||||
|
||||
Paused/archival note: this document is an RDP performance track record, not the
|
||||
current source of truth for node-to-node transport. Fabric transport is now
|
||||
QUIC-only between nodes; use
|
||||
`docs/architecture/DISTRIBUTED_FABRIC_NODE_PROTOCOL_PLAN.md`,
|
||||
`docs/architecture/FABRIC_FIRST_TRANSPORT_AND_STRESS_PLAN.md`, and
|
||||
`docs/architecture/SECURE_ACCESS_FABRIC_TARGET.md` for the active transport
|
||||
model.
|
||||
|
||||
## Status
|
||||
|
||||
This is the paused RDP service performance direction. The implementation name is `RDP Adapter`: a concrete `Service Adapter` that translates Microsoft RDP into the platform session/data-plane protocol. The common adapter contract is defined in `docs/architecture/SERVICE_ADAPTER_PROTOCOL.md`; the RDP-specific runtime plan is defined in `docs/architecture/RDP_ADAPTER_RUNTIME.md`.
|
||||
|
||||
@@ -1,5 +1,13 @@
|
||||
# RDP Service C# Target Architecture
|
||||
|
||||
Archived scope note: this document is retained as historical RDP runtime
|
||||
research and is not the current source of truth for node-to-node transport.
|
||||
Fabric transport is now QUIC-only between nodes; use
|
||||
`docs/architecture/DISTRIBUTED_FABRIC_NODE_PROTOCOL_PLAN.md`,
|
||||
`docs/architecture/FABRIC_FIRST_TRANSPORT_AND_STRESS_PLAN.md`, and
|
||||
`docs/architecture/SECURE_ACCESS_FABRIC_TARGET.md` for the active transport
|
||||
model.
|
||||
|
||||
## Status
|
||||
|
||||
Superseded.
|
||||
|
||||
@@ -8,6 +8,12 @@ The current proven RDP lifecycle remains a preserved implementation baseline.
|
||||
RDP work is currently paused by product decision. The active architecture focus
|
||||
is the lower Fabric Core / cluster / node foundation.
|
||||
|
||||
Transport clarification: historical references in this document to direct
|
||||
worker WSS or backend gateway fallback describe the earlier RDP service proof
|
||||
path and migration context. They must not be read as the current inter-node
|
||||
transport contract. The active fabric node-to-node runtime transport is
|
||||
QUIC-only.
|
||||
|
||||
## 1. Project Vision
|
||||
|
||||
The project is a Secure Access Fabric: a distributed, multi-tenant platform for secure access to private resources across sites, networks, and organizations.
|
||||
@@ -1702,7 +1708,7 @@ Channels must have independent priority, reliability, and backpressure behavior.
|
||||
|
||||
The current RDP MVP proves lifecycle and basic viewer behavior. It is not the target production performance model.
|
||||
|
||||
Target RDP realtime model:
|
||||
Target RDP realtime model for the paused historical RDP service track:
|
||||
|
||||
- client connects to direct/relay data plane, not backend frame relay
|
||||
- input/control channels are separate from render/video
|
||||
@@ -2459,7 +2465,11 @@ This is an incremental migration plan. It must not be executed as a big-bang rew
|
||||
|
||||
### Current Fallback
|
||||
|
||||
Keep the current backend WebSocket gateway as fallback while the production data plane is introduced.
|
||||
Historical migration note: the older RDP MVP kept the backend WebSocket
|
||||
gateway as a temporary fallback while an earlier production data-plane design
|
||||
was being introduced. This is not the active fabric transport plan. Current
|
||||
fabric node-to-node runtime transport is QUIC-only, and old compatibility paths
|
||||
are being removed rather than extended.
|
||||
|
||||
Current RDP MVP remains the preserved service-adapter baseline, but it is not
|
||||
the active implementation focus while Fabric Core stages are underway.
|
||||
@@ -2543,9 +2553,14 @@ These stages must be introduced only through explicit, narrow implementation
|
||||
prompts. RDP/VNC/SSH/VPN/video/file services remain above the Fabric Core and
|
||||
must not define the lower fabric foundation.
|
||||
|
||||
### Stage DP-1: Direct Worker WSS
|
||||
### Historical Stage DP-1: Direct Worker WSS
|
||||
|
||||
Introduce a short-lived authorized direct WSS path from client to worker or worker-local live endpoint.
|
||||
This stage records an earlier RDP service migration concept. It is paused and
|
||||
retained for historical context only. It must not be read as the active fabric
|
||||
transport roadmap.
|
||||
|
||||
Introduce a short-lived authorized direct WSS path from client to worker or
|
||||
worker-local live endpoint.
|
||||
|
||||
Goals:
|
||||
|
||||
@@ -2554,7 +2569,7 @@ Goals:
|
||||
- keep session broker lifecycle unchanged
|
||||
- keep fallback gateway available
|
||||
|
||||
### Stage DP-2: Binary Frames
|
||||
### Historical Stage DP-2: Binary Frames
|
||||
|
||||
Replace base64 JSON frame payloads with binary frame messages.
|
||||
|
||||
@@ -2565,7 +2580,7 @@ Goals:
|
||||
- reduce JSON/base64 overhead
|
||||
- preserve latest-frame-only behavior
|
||||
|
||||
### Stage DP-3: Adaptive Quality
|
||||
### Historical Stage DP-3: Adaptive Quality
|
||||
|
||||
Implement adaptive RDP quality profiles.
|
||||
|
||||
@@ -2577,9 +2592,10 @@ Goals:
|
||||
- bandwidth and latency feedback
|
||||
- bounded frame queues
|
||||
|
||||
### Stage DP-4: Relay Nodes
|
||||
### Historical Stage DP-4: Relay Nodes
|
||||
|
||||
Introduce `entry-node` and `relay-node` roles for data-plane routing.
|
||||
Introduce `entry-node` and `relay-node` roles for the earlier service-specific
|
||||
data-plane routing model.
|
||||
|
||||
Goals:
|
||||
|
||||
|
||||
@@ -1,20 +1,28 @@
|
||||
# Security And Secrets Readiness
|
||||
|
||||
Status: P3.3 test-stand smoke complete for encrypted resource secrets,
|
||||
assignment-time resolution, and production fallback behavior with smoke-only
|
||||
direct worker WSS trust.
|
||||
Archived scope note: this document records an earlier RDP/direct-worker trust
|
||||
and secret-handling stage. It is not the current source of truth for fabric
|
||||
transport architecture. The active inter-node transport model is QUIC-only; see
|
||||
`docs/architecture/DISTRIBUTED_FABRIC_NODE_PROTOCOL_PLAN.md`,
|
||||
`docs/architecture/FABRIC_FIRST_TRANSPORT_AND_STRESS_PLAN.md`, and
|
||||
`docs/architecture/SECURE_ACCESS_FABRIC_TARGET.md`.
|
||||
|
||||
Status: P3.3 historical test-stand smoke complete for encrypted resource
|
||||
secrets, assignment-time resolution, and legacy RDP baseline behavior with
|
||||
smoke-only direct-worker trust.
|
||||
|
||||
This document defines the next security hardening layer around the accepted RDP
|
||||
MVP baseline. It does not implement mesh, VPN, server-to-client download, new
|
||||
protocol adapters, or another RDP rendering mode.
|
||||
|
||||
## Current Accepted Baseline
|
||||
## Current Accepted Historical RDP Baseline
|
||||
|
||||
- RDP worker baseline: `rap-rdp-worker:rdp-p1-region-order2`
|
||||
- Backend control plane remains source of truth.
|
||||
- Redis remains live coordination/routing only.
|
||||
- Direct worker WSS is preferred for realtime RDP.
|
||||
- Backend gateway remains fallback/debug.
|
||||
- Historical direct-worker WSS was the preferred realtime RDP path in this
|
||||
stage.
|
||||
- Historical backend gateway remained a fallback/debug path for this stage.
|
||||
- Text clipboard is policy-gated and accepted.
|
||||
- Client-to-server file upload and restricted `RAP_Transfers` visibility are
|
||||
accepted.
|
||||
@@ -124,22 +132,24 @@ Already accepted:
|
||||
- worker rejects wrong worker, wrong attachment, wrong organization, wrong
|
||||
resource, over-broad channels, failed/terminated sessions, and jti replay
|
||||
|
||||
Production still needs:
|
||||
Production still needed for that stage:
|
||||
|
||||
- deployed certificate chain for direct worker WSS on production nodes
|
||||
- pinned or platform-issued worker certificates in live production config
|
||||
- deployed certificate chain for the historical direct-worker WSS path on
|
||||
production nodes
|
||||
- pinned or platform-issued worker certificates in live production config for
|
||||
that historical path
|
||||
- no smoke-only TLS bypass in production clients
|
||||
- rotation process for data-plane signing keys
|
||||
- audit for failed token validation/bind attempts
|
||||
|
||||
P3.2 guard exists:
|
||||
P3.2 historical guard exists:
|
||||
|
||||
- backend distinguishes `smoke_insecure`, `public_ca`, and `platform_ca`
|
||||
direct worker WSS trust modes
|
||||
- production backend omits smoke-only direct candidates
|
||||
- Windows production client skips untrusted or smoke-only direct candidates
|
||||
- backend distinguished `smoke_insecure`, `public_ca`, and `platform_ca`
|
||||
direct-worker trust modes for the historical RDP path
|
||||
- production backend omitted smoke-only direct candidates on that path
|
||||
- Windows production client skipped untrusted or smoke-only direct candidates
|
||||
|
||||
P3.3 test-stand smoke exists:
|
||||
P3.3 historical test-stand smoke exists:
|
||||
|
||||
- `resource_secrets` migration is applied on `docker-test`
|
||||
- backend runs as `APP_ENV=production` with a test-only
|
||||
@@ -149,9 +159,9 @@ P3.3 test-stand smoke exists:
|
||||
- `resources.metadata`, `remote_sessions.metadata`, and `audit_events` were
|
||||
checked for plaintext username/password leakage
|
||||
- production backend with `DATA_PLANE_DIRECT_WORKER_TLS_TRUST_MODE=smoke_insecure`
|
||||
returns backend gateway fallback only
|
||||
returned the historical backend gateway debug path only
|
||||
- development/smoke backend with the same trust mode advertises the explicit
|
||||
smoke-only direct worker WSS candidate
|
||||
smoke-only historical direct-worker candidate
|
||||
- `RAP_Transfers` smoke passed on the secret-backed resource
|
||||
|
||||
## Required Regression Tests
|
||||
@@ -202,8 +212,8 @@ P3.1 implemented audit events for:
|
||||
assignment payload; a future resolver pull/token flow should reduce exposure
|
||||
in Redis control queues.
|
||||
- Worker still depends on plaintext assignment metadata for development smoke.
|
||||
- Production direct worker WSS certificate issuance/rotation and platform CA
|
||||
distribution are not complete.
|
||||
- Production certificate issuance/rotation and platform CA distribution for the
|
||||
historical direct-worker path are not complete.
|
||||
- The test-stand secret key is a host-local test file, not a production KMS or
|
||||
HSM-backed key.
|
||||
- Automated end-to-end policy denial coverage is still thin.
|
||||
|
||||
@@ -1,7 +1,21 @@
|
||||
# Service Adapter Protocol
|
||||
|
||||
Scope note: this document remains the common adapter-model reference, but it is
|
||||
not the current source of truth for transport/runtime topology between fabric
|
||||
nodes. Fabric transport is now QUIC-only between nodes; for active transport,
|
||||
routing, and recovery behavior see
|
||||
`docs/architecture/DISTRIBUTED_FABRIC_NODE_PROTOCOL_PLAN.md`,
|
||||
`docs/architecture/FABRIC_FIRST_TRANSPORT_AND_STRESS_PLAN.md`, and
|
||||
`docs/architecture/SECURE_ACCESS_FABRIC_TARGET.md`.
|
||||
|
||||
Status: target contract and compile-safe foundation. This document defines the common adapter model for RDP, SSH, VNC, and future services. It does not replace the current backend control plane or current RDP runtime by itself.
|
||||
|
||||
Transport clarification: historical references in this document to direct
|
||||
worker WSS, backend gateway fallback, or DP-1 channel shape belong to the
|
||||
earlier RDP service baseline. They are not the active inter-node transport
|
||||
contract. Current fabric node-to-node transport is QUIC-only; service adapters
|
||||
consume fabric routes rather than define transport fallback behavior.
|
||||
|
||||
## 1. Purpose
|
||||
|
||||
The platform client must not implement third-party protocols directly.
|
||||
@@ -94,12 +108,16 @@ adapter runtime.
|
||||
- Service Adapter does not know UI implementation details.
|
||||
- Control Plane remains authoritative for session lifecycle and policy.
|
||||
- PostgreSQL remains source of truth; Redis remains live coordination only.
|
||||
- Direct worker WSS and backend gateway fallback remain valid transports.
|
||||
- Fabric transport remains QUIC-only between nodes; any historical direct
|
||||
worker or backend fallback paths belong to paused service-specific baselines,
|
||||
not to the active fabric transport contract.
|
||||
- Adapter runtime must not create sessions outside broker/assignment control.
|
||||
|
||||
## 4. Logical Channels
|
||||
|
||||
The session protocol is channel-oriented even when DP-1 uses one WSS connection.
|
||||
The session protocol is channel-oriented regardless of the concrete carrier. A
|
||||
historical DP-1 single-WSS shape may still appear in paused RDP notes, but it
|
||||
is not the current fabric transport contract.
|
||||
|
||||
| Channel | Direction | Reliability | Priority | Purpose |
|
||||
| --- | --- | --- | --- | --- |
|
||||
|
||||
@@ -7,6 +7,11 @@ Secure Access Fabric. It does not implement VPN runtime, packet routing, TUN
|
||||
devices, mesh traffic, service workload execution, API changes, migrations, or
|
||||
RDP behavior changes.
|
||||
|
||||
Transport clarification: this document defines a service layer above Fabric
|
||||
Core. It does not redefine node-to-node transport. Current fabric inter-node
|
||||
transport is QUIC-only; VPN/IP tunnel runtime must request and use fabric
|
||||
routes instead of introducing a separate packet transport contract.
|
||||
|
||||
## Purpose
|
||||
|
||||
VPN/IP tunnel is a service above the Fabric Core, not a node-local setting.
|
||||
|
||||
@@ -9,6 +9,15 @@ Secure Access Fabric.
|
||||
The fabric node-to-node transport remains QUIC-only. HTTP/HTTPS is allowed only
|
||||
as an external client-facing service edge.
|
||||
|
||||
Terminology rule:
|
||||
|
||||
- `Fabric Transport` = QUIC/UDP node-to-node runtime layer.
|
||||
- `Control API` = HTTP/HTTPS management surface for UI, automation, releases,
|
||||
policy, audit, and status.
|
||||
|
||||
The Control API may use HTTP/HTTPS, but it is not a fallback or alternate
|
||||
carrier for fabric node-to-node runtime traffic.
|
||||
|
||||
## Purpose
|
||||
|
||||
The platform needs a clear distinction between:
|
||||
|
||||
Reference in New Issue
Block a user