4.0 KiB
4.0 KiB
Fabric Transport Scale Plan
Goal: make the farm fabric a durable transport layer, not a request/response helper. VPN is the first service adapter, but the same tunnel/session model must carry RDP, VNC, remote workspace, artifact replication, and future services.
Invariants
- Services request a tunnel to a pool and a remote service kind. They do not select nodes, routes, relays, or endpoints.
- The farm owns route selection, failover, relay/direct decisions, stream scheduling, and recovery.
- Control traffic stays small. Bulk and continuous traffic never move as a single control response.
tunnel_idis the stable service-facing identity. Legacy service ids, including VPN connection ids, are aliases only.- Hot traffic is binary framed, not JSON/base64.
- Interactive/control/DNS traffic must not wait behind bulk traffic.
- Route changes preserve the service tunnel identity.
Planes
- Control plane: signed commands, leases, tunnel creation, route epochs, policy, heartbeat.
- Session plane: tunnel lifetime, stream registry, stream open/close/reset, route migration.
- Data plane: binary QUIC stream frames for service traffic.
- Bulk plane: artifact/file/replication streams with offset, resume, chunk hash, final hash, mirror failover.
- Observability plane: topology facts, route health, pressure, drops, throughput, per-class latency.
Service Tunnel Contract
Each service receives:
tunnel_idpool_idservice_idlocal_service_idremote_service_idservice_kindservice_classservice_roleroute_lease_idroute_generationdata_planetraffic_classesstream_shards
VPN default profile:
- pool:
ipv4-egress - service kind:
vpn-exit - service class:
vpn_packets - role:
ipv4-egress
Future profiles use the same contract, for example rdp-client, vnc-client, artifact-store, or remote-workspace.
Implementation Phases
- Generalize the tunnel contract and keep VPN as the first profile. Current code exposes
rap.fabric_service_tunnel.v1. - Move all service traffic to tunnel identity, keeping legacy ids only as aliases. Current VPN packet/session frames use
tunnel_id; VPN ids are compatibility aliases inside the packet payload. - Introduce reusable session streams for all services, not only VPN packet batches. Current code tracks
rap.fabric_service_stream_registry.v1with per-tunnel stream state. - Add route epoch migration so a service keeps the same tunnel while the farm changes path. Current contract already carries
route_lease_idandroute_generationthrough profile, Android runtime, hello/heartbeat, and peer registry. The mobile runtime can now apply a new runtime config for the sametunnel_idand update the active transport route epoch without closing service streams. - Move artifact delivery from control request chunks to bulk streams with resume and mirror failover.
- Add per-class stream scheduler and backpressure at tunnel, node, route, and pool levels.
- Add admission control and capacity accounting per node, route, pool, organization, and service.
- Add stress tests for many tunnels, mixed traffic, route failure, node failure, and update while traffic flows.
Scale Rules
- Frame size protects memory and fairness; throughput comes from parallel streams and windows.
- The current fabric frame guardrail is 8 MiB per frame. This is not a bandwidth ceiling; VPN and later services batch below it and scale by
traffic_classesplusstream_shards. - VPN mobile batches currently allow up to 2048 packets or 4 MiB per batch, so the service stays below the frame guardrail while avoiding the old 1 MiB choke.
- DNS/control/interactive traffic must use separate classes from bulk; DNS maps to reliable fabric frames by default.
- No node should require a full global graph for normal operation. Use scoped directories and area-level summaries.
- Bulk must be drainable and resumable.
- Interactive traffic must stay preemptive over bulk.
- Every transport fact must be observable separately from planned route and endpoint candidates.