Files
rdp-proxy/docs/architecture/DISTRIBUTED_FABRIC_NODE_PROTOCOL_PLAN.md
T

493 lines
23 KiB
Markdown

# Distributed Fabric Node Protocol Plan
This document fixes the target direction for the Secure Access Fabric after the
VPN performance investigation. The platform must not be treated as a VPN
server, RDP gateway, or web console. It is a distributed overlay transport where
every participating device is a fabric node, and VPN/RDP/HTTP/admin/storage are
services running over that fabric.
## Core Position
Every device is a node.
A phone, home server, cloud server, relay, admin-console host, storage host, and
update-cache host share the same base identity model. They differ by roles,
capabilities, policy, trust level, and current health.
```text
Node = identity + roles + capabilities + policy + health + local state
```
The Android VPN app is therefore not only a client. It is a mobile fabric node.
It may carry VPN traffic, participate in route discovery, relay traffic when
policy allows, host limited control/storage roles when approved, and report
mobile-specific capacity signals such as battery, network type, NAT behavior,
foreground/background state, and metered network policy.
## What Was Missing
The current implementation proves route leases and production VPN forwarding,
but it still has a data-plane shape that cannot scale to high throughput:
- too much payload traffic is carried as small request/response HTTP forwarding
calls;
- JSON/base64 payload envelopes add overhead and CPU cost;
- one overloaded stream can delay unrelated traffic;
- route health is visible, but the transport does not yet provide enough
low-latency per-stream feedback;
- the phone behaves mostly as a service client, not as a full fabric node;
- service discovery and route execution are not yet separated cleanly enough;
- fallback paths can keep traffic alive, but can also hide architecture
bottlenecks if used as the primary data plane.
For 100 Mbps per active device and future 1000+ or millions of devices, the
fabric must move to a persistent, binary, multiplexed data plane with explicit
route and stream semantics.
## Non-Negotiable Principles
1. Fabric is the lower transport layer. VPN, RDP, HTTP, admin console, storage,
and update delivery are services above it.
2. Service adapters must not discover topology, own route selection, or invent
failover logic. They request transport from the fabric.
3. Control plane and data plane are separate. API/console traffic must not be
the packet transport mechanism.
4. Every data session carries many independent streams. A blocked bulk download
must not stall RDP, DNS, control, or telemetry.
5. Routes are leased and replaceable. Route selection uses quality, policy,
locality, role eligibility, cost, trust, and current load.
6. The fabric is distributed. Central control can coordinate, but the runtime
must keep working through cached policy, peer directories, route leases, and
local health when central components are degraded.
7. Mobile nodes are first-class nodes with stricter capability scoring.
8. HTTP forwarding remains a compatibility and emergency fallback, not the
primary high-speed data plane.
## Node Roles
Initial role vocabulary:
- `mobile-edge`: mobile Android/iOS fabric node.
- `entry`: accepts external sessions.
- `relay`: forwards fabric traffic between nodes.
- `exit`: terminates routes into a target network or service zone.
- `service-host`: runs service adapters such as admin console, VPN exit, RDP,
HTTP ingress, storage, or update-cache.
- `control-plane`: participates in control authority, policy decisions, route
authority, or quorum work.
- `route-coordinator`: calculates or assists route candidates for a partition,
region, or service class.
- `storage`: stores approved replicated fabric state.
- `observer`: collects telemetry and health without carrying user traffic.
- `update-cache`: mirrors signed artifacts close to nodes.
Roles are policy decisions, not binary builds. A phone can theoretically receive
any role, but scheduler scoring must account for battery, OS restrictions, NAT,
uplink stability, foreground state, and user cost policy.
## Capability Model
Nodes must advertise capability facts in heartbeats and peer updates:
- supported fabric protocol versions;
- supported transports: UDP/QUIC, TCP, WebSocket, HTTPS fallback;
- NAT type and reachability;
- measured RTT/loss/jitter/bandwidth to peers and entry candidates;
- CPU, memory, queue depth, file descriptor/socket pressure;
- battery state, charging state, mobile/wifi network type, metered policy;
- max relay bandwidth and allowed traffic classes;
- service roles and service capacity;
- trust tier and allowed tenant/organization scopes;
- local policy version, peer directory version, route cache version.
## Fabric Data Session V1
The first practical protocol step is a persistent binary data session. It may
initially run over WebSocket/TCP for faster delivery, but the framing must be
transport-neutral so the same protocol can move to QUIC/UDP.
Minimum frame set:
```text
HELLO node identity, protocol version, capabilities
AUTH signed session token or mTLS-bound proof
SESSION_READY accepted limits, route epoch, peer epoch
OPEN_STREAM stream id, service id, traffic class, route id
DATA stream id, sequence, flags, payload
ACK stream id, received sequence/window
PING/PONG RTT and liveness
ROUTE_UPDATE new route lease or alternate route set
STREAM_CREDIT per-stream backpressure window
NODE_PRESSURE queue/cpu/memory/network pressure signal
CLOSE_STREAM normal stream close
RESET_STREAM failed stream, other streams remain alive
GOAWAY draining or protocol shutdown
```
Traffic classes:
- `control`: authorization, route updates, attach/detach, liveness.
- `dns`: small, latency-sensitive name resolution.
- `interactive`: RDP input, SSH interactive, UI control.
- `reliable`: normal web/API traffic.
- `bulk`: downloads, uploads, sync, large media.
- `droppable`: telemetry samples, optional probes, low-value background data.
Each stream has independent flow control and backpressure. Bulk can be slowed or
moved to another route without blocking control or interactive streams.
## Route Model
The fabric must maintain multiple candidate routes for an active session:
```text
phone-a -> entry-1 -> home-1
phone-a -> phone-b -> relay-2 -> home-1
phone-a -> entry-2 -> relay-4 -> service-host-7
```
Route scoring inputs:
- policy and role eligibility;
- route length and failure domains;
- RTT, jitter, packet loss, bandwidth estimate;
- queue depth and retransmit pressure;
- current node CPU/memory/socket pressure;
- mobile battery/charging/metered status;
- historical reliability;
- service locality;
- tenant/organization isolation;
- cost and operator preference.
Routes are issued as short leases with route id, epoch, allowed channels,
allowed service classes, hop list or next-hop policy, expiry, and fencing rules.
## Service Discovery
Services are logical names, not fixed hosts:
```text
service: admin-console
replicas: home-1, node-2, node-9
policy: active-active or leader/follower
ingress: vpn.cin.su / admin.cin.su / internal name
```
`vpn.cin.su` as an HTTP/HTTPS entry is a service endpoint. It can be hosted on
any eligible service-host node. If one replica fails, another replica can accept
the service lease and traffic can be routed to it.
## Scale Model
For 1000 devices, the platform needs entry pools, exit pools, route leases,
session placement, and overload protection.
For millions of devices, the platform additionally needs regional route
coordinators, distributed peer directories, local control partitions, telemetry
sampling, policy sharding, and resource accounting.
Every device joining the system increases potential edge capacity, but only if
the scheduler can safely decide when that node is allowed to relay, store, serve,
or only consume.
## Security And Abuse Controls
The distributed model increases power and also risk. The following controls are
required before mobile relay/control/storage roles are broadly enabled:
- node identity is cryptographic; IP address is never identity;
- all route leases are signed or locally verifiable;
- roles are scoped by organization, tenant, service, and time;
- mobile relay is opt-in by policy and user/device state;
- storage uses encrypted shards and explicit retention policy;
- control-plane participation requires trust tier and quorum policy;
- nodes never receive more topology or secret data than their role requires;
- abuse controls rate-limit relay use, route churn, and failed authentication;
- traffic accounting records who relayed what class and how much, without
exposing payload contents.
## Observability
The current tests show why aggregate "VPN works" is not enough. The fabric needs
per-node, per-route, and per-stream metrics:
- throughput by direction and traffic class;
- RTT, jitter, loss, retransmits, queue depth;
- frame encode/decode errors;
- stream resets and close reasons;
- route switch reason and time to recovery;
- node pressure and scheduler decisions;
- service discovery failover events;
- Android foreground/background and network transition events.
## Work Plan
### Stage FNP-0: Architecture Lock
Status: this document.
Deliverables:
- fix "every device is a node" as the model;
- separate fabric, services, control, and data plane;
- define missing protocol, route, scale, security, and observability pieces.
### Stage FNP-1: Binary Frame Contract
Deliverables:
- add a transport-neutral Go package for Fabric Data Session V1 frame types;
- encode/decode binary frames with size limits and validation;
- add tests for malformed frames, max frame size, stream ids, and frame type
compatibility;
- do not connect it to production traffic yet.
### Stage FNP-2: Persistent Session Runtime Skeleton
Status: in progress in `agents/rap-node-agent/internal/fabricproto`.
Deliverables:
- implement in-memory session runtime with streams, sequence numbers, ACK,
stream credit, reset, and close;
- handle protocol frames for open/data/ack/credit/reset/close/ping/goaway;
- prove that a blocked bulk stream does not block control/interactive streams;
- expose per-stream metrics.
### Stage FNP-3: WebSocket/TCP Compatibility Transport
Status: started with a transport-neutral `io.Reader`/`io.Writer` frame loop,
WebSocket frame adapter in `agents/rap-node-agent/internal/fabricproto`, and a
gated/authenticated mesh smoke endpoint/client at `/mesh/v1/fabric/session/ws`.
`rap-host-agent fabric-session-smoke` provides the first operator smoke command
and can pass signed fabric-session authority payload/signature headers for
authority-pinned nodes.
Node-agent exposes the endpoint only when `RAP_MESH_FABRIC_SESSION_ENABLED` /
`-mesh-fabric-session-enabled` is set, and reports the enabled endpoint in
heartbeat metadata.
`mesh-live-smoke` includes a fabric-session `PING`/`PONG` check alongside the
existing route and test-service probes. Mesh client code now has a reusable
`FabricSessionClient` for multiple frame exchanges over one WebSocket session,
plus a pump mode with outbound/inbound queues for asynchronous stream traffic.
Live smoke verifies two `PING`/`PONG` round trips on the same connection.
`vpnruntime` has a binary VPN packet-batch mapper for `FrameData` payloads so
packet delivery can move away from JSON production envelopes in a gated mode.
`FabricSessionPacketTransport` now adapts that mapper to the existing
`PacketTransport` interface and can demultiplex inbound DATA frames into the
VPN packet inbox by stream id.
`mesh-live-smoke` now sends a real VPN packet batch through
`FabricSessionPacketTransport` over the WebSocket fabric session and requires a
stream ACK from the remote node.
Mesh has a peer session manager that reuses one pump per peer endpoint, giving
VPN transport selection a stable place to acquire long-lived fabric sessions.
Node config now carries a separate gated
`RAP_VPN_FABRIC_SESSION_TRANSPORT_ENABLED` switch and heartbeat report for the
binary VPN packet transport, keeping endpoint exposure and VPN dataplane
rollout independently controllable.
When the VPN fabric-session switch is enabled, node-agent now attempts to use a
long-lived peer session for gateway packet transport and falls back to the
existing HTTP production envelope path when the peer session is unavailable.
Peer session reuse now evicts closed pumps before reuse, so failed WebSocket
sessions can be reopened on the next transport acquisition.
Heartbeat telemetry includes peer session manager counters for active sessions,
reuses, opens, closed-pump evictions, and explicit close operations.
The mesh package now exposes a service-neutral `FabricTransport` abstraction;
the current WebSocket carrier implements it as `WebSocketFabricTransport`, so
future QUIC/UDP transport can be added without changing VPN/RDP/HTTP services.
`QUICFabricTransport` now implements the same interface and carries the same
binary `fabricproto` frames over a QUIC stream, with local smoke coverage for
`PING`/`PONG` and DATA/ACK.
Carrier selection understands QUIC transport labels and `quic://host:port`
endpoints while preserving WebSocket as the default fallback.
`QUICFabricServer` provides the matching node-side QUIC listener for accepting
fabric streams and running the same session frame handler as other carriers.
Node-agent can now gate the QUIC listener with
`RAP_MESH_QUIC_FABRIC_ENABLED` / `RAP_MESH_QUIC_FABRIC_LISTEN_ADDR`, report it
in heartbeat metadata, and pass the setting through host-agent install/update
profiles.
`mesh-live-smoke` verifies the QUIC carrier by starting a temporary QUIC fabric
server and requiring a `PING`/`PONG` round trip over `QUICFabricTransport`.
Nodes now advertise enabled QUIC fabric listeners as `direct_quic` fast-path
endpoint candidates, and endpoint ranking prefers QUIC over WebSocket/HTTPS
compatibility candidates for fabric sessions.
VPN fabric-session gateway transport now consumes ranked endpoint candidates,
so dataplane sessions can select QUIC fast-path candidates and fall back to
legacy peer endpoints when the control plane has not published candidates yet.
The temporary self-signed QUIC listener advertises its SHA-256 certificate
fingerprint in endpoint metadata, and the QUIC client can pin that fingerprint
instead of disabling verification while the cluster CA path is being finished.
VPN fabric-session dialing now walks all ranked endpoint candidates before
falling back to the legacy peer endpoint, so a failed QUIC candidate does not
block WebSocket/HTTPS compatibility transport.
Successful VPN fabric-session dialing logs the selected candidate, transport,
certificate pin usage, and remaining fallback count for phone-side diagnostics.
Heartbeat telemetry now includes VPN fabric-session dial counters for attempts,
candidate failures, selected transport family, certificate pin usage, and the
last selected endpoint/failure reason.
VPN fabric-session dialing feeds candidate success/failure observations back
into endpoint ranking, so repeated local QUIC failures can temporarily demote
that endpoint while preserving it as a later fallback.
Endpoint scoring no longer treats missing/zero latency on failed observations as
moderate latency, preventing failed candidates from receiving a false score
bonus.
Endpoint health observations are now emitted as a bounded standalone heartbeat
report (`rap.vpn_fabric_endpoint_health_report.v1`) so control plane can ingest
candidate feedback without parsing the transport diagnostics blob.
VPN fabric-session transport telemetry is carrier-neutral
(`fabric_session_binary_frames`) and reports QUIC/WebSocket as available
carriers instead of describing the dataplane as WebSocket-only.
Endpoint health observations are pruned in-memory by age and count before
snapshot/report generation, preventing long-running nodes from accumulating
unbounded candidate history.
Scoped and control-plane synthetic mesh config can now carry
`peer_endpoint_observations`, and VPN fabric-session endpoint ranking merges
those remote health hints with local observations using the newest signal.
Endpoint health observations include source and reporter node fields so control
plane can distinguish local dial feedback from aggregated or policy-generated
health hints.
The endpoint health heartbeat report also includes the reporter node id at the
report level for simpler multi-node ingestion and diagnostics.
Peer cache construction now applies endpoint health observations when ranking
peer endpoint candidates, so recovery and warm-peer decisions see the same
degraded-path feedback as VPN fabric-session dialing.
Peer cache snapshots expose best-candidate score reasons, giving diagnostics a
direct explanation for why a QUIC, WebSocket, relay, or fallback endpoint was
chosen.
Heartbeat capabilities now advertise that peer-cache endpoint ranking consumes
health observations, allowing control plane and UI diagnostics to detect nodes
running the health-aware peer selection path.
VPN fabric QUIC transport now reuses QUIC connections per peer endpoint and
opens logical fabric-session streams on top, with heartbeat telemetry for QUIC
connection opens, reuses, evictions, and active count.
Cached QUIC connections are pruned by idle TTL, preventing long-running agents
from holding unused peer connections indefinitely.
QUIC carrier connections now track active logical streams and enforce a
per-connection stream limit, exposing stream opens/closes and limit rejects in
transport telemetry.
The per-connection QUIC stream limit is configurable through
`RAP_VPN_FABRIC_QUIC_MAX_STREAMS_PER_CONN` /
`-vpn-fabric-quic-max-streams-per-conn` and propagated by host-agent install
profiles.
QUIC stream-limit rejects are classified as capacity pressure instead of peer
endpoint failure, so local health feedback does not incorrectly demote a healthy
but saturated carrier.
VPN fabric dial telemetry records the last capacity-limited endpoint and
transport, making stream saturation visible without poisoning endpoint health
observations.
The same dial telemetry now keeps bounded per-endpoint capacity-pressure
counters, so operators can see whether stream saturation is occasional or
concentrated on a specific QUIC carrier.
Fresh local capacity-pressure counters also feed endpoint ranking as a bounded
penalty, spreading new fabric sessions away from a saturated carrier without
declaring that carrier failed.
VPN fabric-session transport now opens configurable per-class stream shards
for interactive and bulk packet traffic, so heavy browser flows do not share a
single logical stream with latency-sensitive RDP/control packets.
Host-agent install commands for Docker, Linux, and Windows expose the same
VPN fabric-session/QUIC tuning flags as install profiles, keeping manual and
profile-based rollout paths aligned.
Gateway runtime snapshots include the fabric-session packet transport stream
layout and send counters by traffic class/stream id for load-test diagnosis.
Those snapshots also summarize configured stream class/shard counts and active
send class/stream counts, making sharding health visible without expanding
per-stream maps.
Gateway shutdown now closes all VPN fabric-session stream shards and then the
underlying fabric session, preventing stale logical streams from consuming QUIC
carrier capacity after reconnects or rollout restarts.
Gateway runtime cancellation now fans out to both upload and download loops
when either direction exits, so transport cleanup runs promptly on one-sided
TUN or carrier failures.
Fabric-session packet transport snapshots include close-frame and close-error
counters for verifying that stream shard cleanup is actually happening.
Outgoing VPN packet batches are split by traffic class and selected stream
before they are framed, so one gateway batch containing many browser flows does
not collapse onto the first packet's logical stream.
`mesh-live-smoke` now sends mixed bulk and interactive VPN packets in a single
fabric-session batch and requires them to remain sharded.
Fabric-session packet transport snapshots now report packets per stream plus
last/max batch fanout, making real multi-site load distribution measurable from
gateway status.
Receive-side fabric-session packet counters are reported by traffic class and
stream id as well, so gateway status can compare TX and RX distribution under
browser/RDP load.
Endpoint ranking treats `capacity_limited` observations as a soft pressure
penalty instead of a hard recent failure, enabling load spreading without
marking the carrier unhealthy.
Local QUIC stream-limit pressure is now emitted as a capacity observation with
no failure-count increment, allowing control plane to spread load without
treating saturation as packet-path breakage.
Cached QUIC carrier idle TTL is configurable through
`RAP_VPN_FABRIC_QUIC_IDLE_TTL_SECONDS` / `-vpn-fabric-quic-idle-ttl` and
propagated by host-agent install profiles.
Deliverables:
- carry binary frames over one persistent WebSocket/TCP connection;
- replace high-frequency `/mesh/v1/forward` packet POST usage for VPN routes in
a gated mode;
- keep HTTP forwarding as fallback.
### Stage FNP-4: Android As Mobile Fabric Node
Deliverables:
- Android advertises node capabilities, network state, battery state, and
supported transports;
- Android opens Fabric Data Session V1 to entry;
- VPN packets map to independent streams/classes;
- diagnostics can run per-stream and per-route tests.
### Stage FNP-5: Route Leases And Multipath
Deliverables:
- route result includes primary and alternate routes;
- runtime can switch new streams to a better route;
- interactive streams can recover quickly after route fencing;
- route health uses dataplane metrics, not only HTTP request success.
### Stage FNP-6: QUIC/UDP Transport
Status: started with `QUICFabricTransport` in `internal/mesh`.
Deliverables:
- implement QUIC transport for Fabric Data Session V1;
- preserve WebSocket/TCP as fallback;
- test 4G/Wi-Fi transition and NAT behavior;
- benchmark throughput, latency, and recovery against current HTTP forwarding.
### Stage FNP-7: Distributed Service Discovery
Deliverables:
- service names map to eligible service replicas;
- admin console and VPN service can move between service-host nodes;
- service failover is expressed as leases and route updates.
### Stage FNP-8: Mobile Relay And Distributed Capacity
Deliverables:
- mobile nodes can opt into relay under strict policy;
- scheduler scores battery, metered network, NAT, trust, and load;
- route planner can use mobile nodes where they are closer/faster;
- accounting and abuse controls are enforced.
### Stage FNP-9: Scale To Large Fleets
Deliverables:
- entry and route coordinator pools;
- peer directory sharding;
- telemetry sampling and aggregation;
- per-tenant quotas and fairness;
- load tests for 1000 simulated devices, then larger synthetic fleets.
## Immediate Next Action
Start Stage FNP-1 in `rap-node-agent` as a non-production protocol package. The
goal is to create the binary frame contract and tests without disturbing the
current VPN path. After that, wire it into a gated persistent session runtime and
only then move Android/VPN traffic onto it.