71 lines
4.0 KiB
Markdown
71 lines
4.0 KiB
Markdown
# Fabric Transport Scale Plan
|
|
|
|
Goal: make the farm fabric a durable transport layer, not a request/response helper. VPN is the first service adapter, but the same tunnel/session model must carry RDP, VNC, remote workspace, artifact replication, and future services.
|
|
|
|
## Invariants
|
|
|
|
- Services request a tunnel to a pool and a remote service kind. They do not select nodes, routes, relays, or endpoints.
|
|
- The farm owns route selection, failover, relay/direct decisions, stream scheduling, and recovery.
|
|
- Control traffic stays small. Bulk and continuous traffic never move as a single control response.
|
|
- `tunnel_id` is the stable service-facing identity. Legacy service ids, including VPN connection ids, are aliases only.
|
|
- Hot traffic is binary framed, not JSON/base64.
|
|
- Interactive/control/DNS traffic must not wait behind bulk traffic.
|
|
- Route changes preserve the service tunnel identity.
|
|
|
|
## Planes
|
|
|
|
- Control plane: signed commands, leases, tunnel creation, route epochs, policy, heartbeat.
|
|
- Session plane: tunnel lifetime, stream registry, stream open/close/reset, route migration.
|
|
- Data plane: binary QUIC stream frames for service traffic.
|
|
- Bulk plane: artifact/file/replication streams with offset, resume, chunk hash, final hash, mirror failover.
|
|
- Observability plane: topology facts, route health, pressure, drops, throughput, per-class latency.
|
|
|
|
## Service Tunnel Contract
|
|
|
|
Each service receives:
|
|
|
|
- `tunnel_id`
|
|
- `pool_id`
|
|
- `service_id`
|
|
- `local_service_id`
|
|
- `remote_service_id`
|
|
- `service_kind`
|
|
- `service_class`
|
|
- `service_role`
|
|
- `route_lease_id`
|
|
- `route_generation`
|
|
- `data_plane`
|
|
- `traffic_classes`
|
|
- `stream_shards`
|
|
|
|
VPN default profile:
|
|
|
|
- pool: `ipv4-egress`
|
|
- service kind: `vpn-exit`
|
|
- service class: `vpn_packets`
|
|
- role: `ipv4-egress`
|
|
|
|
Future profiles use the same contract, for example `rdp-client`, `vnc-client`, `artifact-store`, or `remote-workspace`.
|
|
|
|
## Implementation Phases
|
|
|
|
1. Generalize the tunnel contract and keep VPN as the first profile. Current code exposes `rap.fabric_service_tunnel.v1`.
|
|
2. Move all service traffic to tunnel identity, keeping legacy ids only as aliases. Current VPN packet/session frames use `tunnel_id`; VPN ids are compatibility aliases inside the packet payload.
|
|
3. Introduce reusable session streams for all services, not only VPN packet batches. Current code tracks `rap.fabric_service_stream_registry.v1` with per-tunnel stream state.
|
|
4. Add route epoch migration so a service keeps the same tunnel while the farm changes path. Current contract already carries `route_lease_id` and `route_generation` through profile, Android runtime, hello/heartbeat, and peer registry. The mobile runtime can now apply a new runtime config for the same `tunnel_id` and update the active transport route epoch without closing service streams.
|
|
5. Move artifact delivery from control request chunks to bulk streams with resume and mirror failover.
|
|
6. Add per-class stream scheduler and backpressure at tunnel, node, route, and pool levels.
|
|
7. Add admission control and capacity accounting per node, route, pool, organization, and service.
|
|
8. Add stress tests for many tunnels, mixed traffic, route failure, node failure, and update while traffic flows.
|
|
|
|
## Scale Rules
|
|
|
|
- Frame size protects memory and fairness; throughput comes from parallel streams and windows.
|
|
- The current fabric frame guardrail is 8 MiB per frame. This is not a bandwidth ceiling; VPN and later services batch below it and scale by `traffic_classes` plus `stream_shards`.
|
|
- VPN mobile batches currently allow up to 2048 packets or 4 MiB per batch, so the service stays below the frame guardrail while avoiding the old 1 MiB choke.
|
|
- DNS/control/interactive traffic must use separate classes from bulk; DNS maps to reliable fabric frames by default.
|
|
- No node should require a full global graph for normal operation. Use scoped directories and area-level summaries.
|
|
- Bulk must be drainable and resumable.
|
|
- Interactive traffic must stay preemptive over bulk.
|
|
- Every transport fact must be observable separately from planned route and endpoint candidates.
|