Files
rdp-proxy/docs/architecture/FABRIC_TRANSPORT_SCALE_PLAN.md
m 20d361a886
build / backend (push) Has been cancelled
build / node-agent (push) Has been cancelled
build / worker (push) Has been cancelled
рабочий вариант, но скороть 10 МБит
2026-05-22 21:46:49 +03:00

71 lines
4.0 KiB
Markdown

# Fabric Transport Scale Plan
Goal: make the farm fabric a durable transport layer, not a request/response helper. VPN is the first service adapter, but the same tunnel/session model must carry RDP, VNC, remote workspace, artifact replication, and future services.
## Invariants
- Services request a tunnel to a pool and a remote service kind. They do not select nodes, routes, relays, or endpoints.
- The farm owns route selection, failover, relay/direct decisions, stream scheduling, and recovery.
- Control traffic stays small. Bulk and continuous traffic never move as a single control response.
- `tunnel_id` is the stable service-facing identity. Legacy service ids, including VPN connection ids, are aliases only.
- Hot traffic is binary framed, not JSON/base64.
- Interactive/control/DNS traffic must not wait behind bulk traffic.
- Route changes preserve the service tunnel identity.
## Planes
- Control plane: signed commands, leases, tunnel creation, route epochs, policy, heartbeat.
- Session plane: tunnel lifetime, stream registry, stream open/close/reset, route migration.
- Data plane: binary QUIC stream frames for service traffic.
- Bulk plane: artifact/file/replication streams with offset, resume, chunk hash, final hash, mirror failover.
- Observability plane: topology facts, route health, pressure, drops, throughput, per-class latency.
## Service Tunnel Contract
Each service receives:
- `tunnel_id`
- `pool_id`
- `service_id`
- `local_service_id`
- `remote_service_id`
- `service_kind`
- `service_class`
- `service_role`
- `route_lease_id`
- `route_generation`
- `data_plane`
- `traffic_classes`
- `stream_shards`
VPN default profile:
- pool: `ipv4-egress`
- service kind: `vpn-exit`
- service class: `vpn_packets`
- role: `ipv4-egress`
Future profiles use the same contract, for example `rdp-client`, `vnc-client`, `artifact-store`, or `remote-workspace`.
## Implementation Phases
1. Generalize the tunnel contract and keep VPN as the first profile. Current code exposes `rap.fabric_service_tunnel.v1`.
2. Move all service traffic to tunnel identity, keeping legacy ids only as aliases. Current VPN packet/session frames use `tunnel_id`; VPN ids are compatibility aliases inside the packet payload.
3. Introduce reusable session streams for all services, not only VPN packet batches. Current code tracks `rap.fabric_service_stream_registry.v1` with per-tunnel stream state.
4. Add route epoch migration so a service keeps the same tunnel while the farm changes path. Current contract already carries `route_lease_id` and `route_generation` through profile, Android runtime, hello/heartbeat, and peer registry. The mobile runtime can now apply a new runtime config for the same `tunnel_id` and update the active transport route epoch without closing service streams.
5. Move artifact delivery from control request chunks to bulk streams with resume and mirror failover.
6. Add per-class stream scheduler and backpressure at tunnel, node, route, and pool levels.
7. Add admission control and capacity accounting per node, route, pool, organization, and service.
8. Add stress tests for many tunnels, mixed traffic, route failure, node failure, and update while traffic flows.
## Scale Rules
- Frame size protects memory and fairness; throughput comes from parallel streams and windows.
- The current fabric frame guardrail is 8 MiB per frame. This is not a bandwidth ceiling; VPN and later services batch below it and scale by `traffic_classes` plus `stream_shards`.
- VPN mobile batches currently allow up to 2048 packets or 4 MiB per batch, so the service stays below the frame guardrail while avoiding the old 1 MiB choke.
- DNS/control/interactive traffic must use separate classes from bulk; DNS maps to reliable fabric frames by default.
- No node should require a full global graph for normal operation. Use scoped directories and area-level summaries.
- Bulk must be drainable and resumable.
- Interactive traffic must stay preemptive over bulk.
- Every transport fact must be observable separately from planned route and endpoint candidates.