549 lines
26 KiB
Markdown
549 lines
26 KiB
Markdown
# Distributed Fabric Node Protocol Plan
|
|
|
|
This document fixes the target direction for the Secure Access Fabric after the
|
|
VPN performance investigation. The platform must not be treated as a VPN
|
|
server, RDP gateway, or web console. It is a distributed overlay transport where
|
|
every participating device is a fabric node, and VPN/RDP/HTTP/admin/storage are
|
|
services running over that fabric.
|
|
|
|
## Core Position
|
|
|
|
Every device is a node.
|
|
|
|
A phone, home server, cloud server, relay, admin-console host, storage host, and
|
|
update-cache host share the same base identity model. They differ by roles,
|
|
capabilities, policy, trust level, and current health.
|
|
|
|
```text
|
|
Node = identity + roles + capabilities + policy + health + local state
|
|
```
|
|
|
|
The Android VPN app is therefore not only a client. It is a mobile fabric node.
|
|
It may carry VPN traffic, participate in route discovery, relay traffic when
|
|
policy allows, host limited control/storage roles when approved, and report
|
|
mobile-specific capacity signals such as battery, network type, NAT behavior,
|
|
foreground/background state, and metered network policy.
|
|
|
|
## What Was Missing
|
|
|
|
The current implementation proves route leases and production VPN forwarding,
|
|
but it still has a data-plane shape that cannot scale to high throughput:
|
|
|
|
- too much payload traffic is carried as small request/response HTTP forwarding
|
|
calls;
|
|
- JSON/base64 payload envelopes add overhead and CPU cost;
|
|
- one overloaded stream can delay unrelated traffic;
|
|
- route health is visible, but the transport does not yet provide enough
|
|
low-latency per-stream feedback;
|
|
- the phone behaves mostly as a service client, not as a full fabric node;
|
|
- service discovery and route execution are not yet separated cleanly enough;
|
|
- fallback paths can keep traffic alive, but can also hide architecture
|
|
bottlenecks if used as the primary data plane.
|
|
|
|
For 100 Mbps per active device and future 1000+ or millions of devices, the
|
|
fabric must move to a persistent, binary, multiplexed data plane with explicit
|
|
route and stream semantics.
|
|
|
|
## Non-Negotiable Principles
|
|
|
|
1. Fabric is the lower transport layer. VPN, RDP, HTTP, admin console, storage,
|
|
and update delivery are services above it.
|
|
2. Service adapters must not discover topology, own route selection, or invent
|
|
failover logic. They request transport from the fabric.
|
|
3. Control plane and data plane are separate. API/console traffic must not be
|
|
the packet transport mechanism.
|
|
4. Every data session carries many independent streams. A blocked bulk download
|
|
must not stall RDP, DNS, control, or telemetry.
|
|
5. Routes are leased and replaceable. Route selection uses quality, policy,
|
|
locality, role eligibility, cost, trust, and current load.
|
|
6. The fabric is distributed. Central control can coordinate, but the runtime
|
|
must keep working through cached policy, peer directories, route leases, and
|
|
local health when central components are degraded.
|
|
7. Mobile nodes are first-class nodes with stricter capability scoring.
|
|
8. HTTP forwarding remains a compatibility and emergency fallback, not the
|
|
primary high-speed data plane.
|
|
|
|
## Node Roles
|
|
|
|
Initial role vocabulary:
|
|
|
|
- `mobile-edge`: mobile Android/iOS fabric node.
|
|
- `entry`: accepts external sessions.
|
|
- `relay`: forwards fabric traffic between nodes.
|
|
- `exit`: terminates routes into a target network or service zone.
|
|
- `service-host`: runs service adapters such as admin console, VPN exit, RDP,
|
|
HTTP ingress, storage, or update-cache.
|
|
- `control-plane`: participates in control authority, policy decisions, route
|
|
authority, or quorum work.
|
|
- `route-coordinator`: calculates or assists route candidates for a partition,
|
|
region, or service class.
|
|
- `storage`: stores approved replicated fabric state.
|
|
- `observer`: collects telemetry and health without carrying user traffic.
|
|
- `update-cache`: mirrors signed artifacts close to nodes.
|
|
|
|
Roles are policy decisions, not binary builds. A phone can theoretically receive
|
|
any role, but scheduler scoring must account for battery, OS restrictions, NAT,
|
|
uplink stability, foreground state, and user cost policy.
|
|
|
|
## Capability Model
|
|
|
|
Nodes must advertise capability facts in heartbeats and peer updates:
|
|
|
|
- supported fabric protocol versions;
|
|
- supported transports: UDP/QUIC, TCP, WebSocket, HTTPS fallback;
|
|
- NAT type and reachability;
|
|
- measured RTT/loss/jitter/bandwidth to peers and entry candidates;
|
|
- CPU, memory, queue depth, file descriptor/socket pressure;
|
|
- battery state, charging state, mobile/wifi network type, metered policy;
|
|
- max relay bandwidth and allowed traffic classes;
|
|
- service roles and service capacity;
|
|
- trust tier and allowed tenant/organization scopes;
|
|
- local policy version, peer directory version, route cache version.
|
|
|
|
## Fabric Data Session V1
|
|
|
|
The first practical protocol step is a persistent binary data session. It may
|
|
initially run over WebSocket/TCP for faster delivery, but the framing must be
|
|
transport-neutral so the same protocol can move to QUIC/UDP.
|
|
|
|
Minimum frame set:
|
|
|
|
```text
|
|
HELLO node identity, protocol version, capabilities
|
|
AUTH signed session token or mTLS-bound proof
|
|
SESSION_READY accepted limits, route epoch, peer epoch
|
|
OPEN_STREAM stream id, service id, traffic class, route id
|
|
DATA stream id, sequence, flags, payload
|
|
ACK stream id, received sequence/window
|
|
PING/PONG RTT and liveness
|
|
ROUTE_UPDATE new route lease or alternate route set
|
|
STREAM_CREDIT per-stream backpressure window
|
|
NODE_PRESSURE queue/cpu/memory/network pressure signal
|
|
CLOSE_STREAM normal stream close
|
|
RESET_STREAM failed stream, other streams remain alive
|
|
GOAWAY draining or protocol shutdown
|
|
```
|
|
|
|
Traffic classes:
|
|
|
|
- `control`: authorization, route updates, attach/detach, liveness.
|
|
- `dns`: small, latency-sensitive name resolution.
|
|
- `interactive`: RDP input, SSH interactive, UI control.
|
|
- `reliable`: normal web/API traffic.
|
|
- `bulk`: downloads, uploads, sync, large media.
|
|
- `droppable`: telemetry samples, optional probes, low-value background data.
|
|
|
|
Each stream has independent flow control and backpressure. Bulk can be slowed or
|
|
moved to another route without blocking control or interactive streams.
|
|
|
|
## Route Model
|
|
|
|
The fabric must maintain multiple candidate routes for an active session:
|
|
|
|
```text
|
|
phone-a -> entry-1 -> home-1
|
|
phone-a -> phone-b -> relay-2 -> home-1
|
|
phone-a -> entry-2 -> relay-4 -> service-host-7
|
|
```
|
|
|
|
Route scoring inputs:
|
|
|
|
- policy and role eligibility;
|
|
- route length and failure domains;
|
|
- RTT, jitter, packet loss, bandwidth estimate;
|
|
- queue depth and retransmit pressure;
|
|
- current node CPU/memory/socket pressure;
|
|
- mobile battery/charging/metered status;
|
|
- historical reliability;
|
|
- service locality;
|
|
- tenant/organization isolation;
|
|
- cost and operator preference.
|
|
|
|
Routes are issued as short leases with route id, epoch, allowed channels,
|
|
allowed service classes, hop list or next-hop policy, expiry, and fencing rules.
|
|
|
|
## Service Discovery
|
|
|
|
Services are logical names, not fixed hosts:
|
|
|
|
```text
|
|
service: admin-console
|
|
replicas: home-1, node-2, node-9
|
|
policy: active-active or leader/follower
|
|
ingress: vpn.cin.su / admin.cin.su / internal name
|
|
```
|
|
|
|
`vpn.cin.su` as an HTTP/HTTPS entry is a service endpoint. It can be hosted on
|
|
any eligible service-host node. If one replica fails, another replica can accept
|
|
the service lease and traffic can be routed to it.
|
|
|
|
## Scale Model
|
|
|
|
For 1000 devices, the platform needs entry pools, exit pools, route leases,
|
|
session placement, and overload protection.
|
|
|
|
For millions of devices, the platform additionally needs regional route
|
|
coordinators, distributed peer directories, local control partitions, telemetry
|
|
sampling, policy sharding, and resource accounting.
|
|
|
|
Every device joining the system increases potential edge capacity, but only if
|
|
the scheduler can safely decide when that node is allowed to relay, store, serve,
|
|
or only consume.
|
|
|
|
## Security And Abuse Controls
|
|
|
|
The distributed model increases power and also risk. The following controls are
|
|
required before mobile relay/control/storage roles are broadly enabled:
|
|
|
|
- node identity is cryptographic; IP address is never identity;
|
|
- all route leases are signed or locally verifiable;
|
|
- roles are scoped by organization, tenant, service, and time;
|
|
- mobile relay is opt-in by policy and user/device state;
|
|
- storage uses encrypted shards and explicit retention policy;
|
|
- control-plane participation requires trust tier and quorum policy;
|
|
- nodes never receive more topology or secret data than their role requires;
|
|
- abuse controls rate-limit relay use, route churn, and failed authentication;
|
|
- traffic accounting records who relayed what class and how much, without
|
|
exposing payload contents.
|
|
|
|
## Observability
|
|
|
|
The current tests show why aggregate "VPN works" is not enough. The fabric needs
|
|
per-node, per-route, and per-stream metrics:
|
|
|
|
- throughput by direction and traffic class;
|
|
- RTT, jitter, loss, retransmits, queue depth;
|
|
- frame encode/decode errors;
|
|
- stream resets and close reasons;
|
|
- route switch reason and time to recovery;
|
|
- node pressure and scheduler decisions;
|
|
- service discovery failover events;
|
|
- Android foreground/background and network transition events.
|
|
|
|
## Work Plan
|
|
|
|
### Stage FNP-0: Architecture Lock
|
|
|
|
Status: this document.
|
|
|
|
Deliverables:
|
|
|
|
- fix "every device is a node" as the model;
|
|
- separate fabric, services, control, and data plane;
|
|
- define missing protocol, route, scale, security, and observability pieces.
|
|
|
|
### Stage FNP-1: Binary Frame Contract
|
|
|
|
Deliverables:
|
|
|
|
- add a transport-neutral Go package for Fabric Data Session V1 frame types;
|
|
- encode/decode binary frames with size limits and validation;
|
|
- add tests for malformed frames, max frame size, stream ids, and frame type
|
|
compatibility;
|
|
- do not connect it to production traffic yet.
|
|
|
|
### Stage FNP-2: Persistent Session Runtime Skeleton
|
|
|
|
Status: in progress in `agents/rap-node-agent/internal/fabricproto`.
|
|
|
|
Deliverables:
|
|
|
|
- implement in-memory session runtime with streams, sequence numbers, ACK,
|
|
stream credit, reset, and close;
|
|
- handle protocol frames for open/data/ack/credit/reset/close/ping/goaway;
|
|
- prove that a blocked bulk stream does not block control/interactive streams;
|
|
- expose per-stream metrics.
|
|
|
|
### Stage FNP-3: WebSocket/TCP Compatibility Transport
|
|
|
|
Status: started with a transport-neutral `io.Reader`/`io.Writer` frame loop,
|
|
WebSocket frame adapter in `agents/rap-node-agent/internal/fabricproto`, and a
|
|
gated/authenticated mesh smoke endpoint/client at `/mesh/v1/fabric/session/ws`.
|
|
`rap-host-agent fabric-session-smoke` provides the first operator smoke command
|
|
and can pass signed fabric-session authority payload/signature headers for
|
|
authority-pinned nodes.
|
|
Node-agent exposes the endpoint only when `RAP_MESH_FABRIC_SESSION_ENABLED` /
|
|
`-mesh-fabric-session-enabled` is set, and reports the enabled endpoint in
|
|
heartbeat metadata.
|
|
`mesh-live-smoke` includes a fabric-session `PING`/`PONG` check alongside the
|
|
existing route and test-service probes. Mesh client code now has a reusable
|
|
`FabricSessionClient` for multiple frame exchanges over one WebSocket session,
|
|
plus a pump mode with outbound/inbound queues for asynchronous stream traffic.
|
|
Live smoke verifies two `PING`/`PONG` round trips on the same connection.
|
|
`vpnruntime` has a binary VPN packet-batch mapper for `FrameData` payloads so
|
|
packet delivery can move away from JSON production envelopes in a gated mode.
|
|
`FabricSessionPacketTransport` now adapts that mapper to the existing
|
|
`PacketTransport` interface and can demultiplex inbound DATA frames into the
|
|
VPN packet inbox by stream id.
|
|
`mesh-live-smoke` now sends a real VPN packet batch through
|
|
`FabricSessionPacketTransport` over the WebSocket fabric session and requires a
|
|
stream ACK from the remote node.
|
|
Mesh has a peer session manager that reuses one pump per peer endpoint, giving
|
|
VPN transport selection a stable place to acquire long-lived fabric sessions.
|
|
Node config now carries a separate gated
|
|
`RAP_VPN_FABRIC_SESSION_TRANSPORT_ENABLED` switch and heartbeat report for the
|
|
binary VPN packet transport, keeping endpoint exposure and VPN dataplane
|
|
rollout independently controllable.
|
|
When the VPN fabric-session switch is enabled, node-agent now attempts to use a
|
|
long-lived peer session for gateway packet transport and falls back to the
|
|
existing HTTP production envelope path when the peer session is unavailable.
|
|
Peer session reuse now evicts closed pumps before reuse, so failed WebSocket
|
|
sessions can be reopened on the next transport acquisition.
|
|
Heartbeat telemetry includes peer session manager counters for active sessions,
|
|
reuses, opens, closed-pump evictions, and explicit close operations.
|
|
The mesh package now exposes a service-neutral `FabricTransport` abstraction;
|
|
the current WebSocket carrier implements it as `WebSocketFabricTransport`, so
|
|
future QUIC/UDP transport can be added without changing VPN/RDP/HTTP services.
|
|
`QUICFabricTransport` now implements the same interface and carries the same
|
|
binary `fabricproto` frames over a QUIC stream, with local smoke coverage for
|
|
`PING`/`PONG` and DATA/ACK.
|
|
Carrier selection understands QUIC transport labels and `quic://host:port`
|
|
endpoints while preserving WebSocket as the default fallback.
|
|
`QUICFabricServer` provides the matching node-side QUIC listener for accepting
|
|
fabric streams and running the same session frame handler as other carriers.
|
|
Node-agent can now gate the QUIC listener with
|
|
`RAP_MESH_QUIC_FABRIC_ENABLED` / `RAP_MESH_QUIC_FABRIC_LISTEN_ADDR`, report it
|
|
in heartbeat metadata, and pass the setting through host-agent install/update
|
|
profiles.
|
|
`mesh-live-smoke` verifies the QUIC carrier by starting a temporary QUIC fabric
|
|
server and requiring a `PING`/`PONG` round trip over `QUICFabricTransport`.
|
|
Nodes now advertise enabled QUIC fabric listeners as `direct_quic` fast-path
|
|
endpoint candidates, and endpoint ranking prefers QUIC over WebSocket/HTTPS
|
|
compatibility candidates for fabric sessions.
|
|
VPN fabric-session gateway transport now consumes ranked endpoint candidates,
|
|
so dataplane sessions can select QUIC fast-path candidates and fall back to
|
|
legacy peer endpoints when the control plane has not published candidates yet.
|
|
The temporary self-signed QUIC listener advertises its SHA-256 certificate
|
|
fingerprint in endpoint metadata, and the QUIC client can pin that fingerprint
|
|
instead of disabling verification while the cluster CA path is being finished.
|
|
VPN fabric-session dialing now walks all ranked endpoint candidates before
|
|
falling back to the legacy peer endpoint, so a failed QUIC candidate does not
|
|
block WebSocket/HTTPS compatibility transport.
|
|
Successful VPN fabric-session dialing logs the selected candidate, transport,
|
|
certificate pin usage, and remaining fallback count for phone-side diagnostics.
|
|
Heartbeat telemetry now includes VPN fabric-session dial counters for attempts,
|
|
candidate failures, selected transport family, certificate pin usage, and the
|
|
last selected endpoint/failure reason.
|
|
VPN fabric-session dialing feeds candidate success/failure observations back
|
|
into endpoint ranking, so repeated local QUIC failures can temporarily demote
|
|
that endpoint while preserving it as a later fallback.
|
|
Endpoint scoring no longer treats missing/zero latency on failed observations as
|
|
moderate latency, preventing failed candidates from receiving a false score
|
|
bonus.
|
|
Endpoint health observations are now emitted as a bounded standalone heartbeat
|
|
report (`rap.vpn_fabric_endpoint_health_report.v1`) so control plane can ingest
|
|
candidate feedback without parsing the transport diagnostics blob.
|
|
VPN fabric-session transport telemetry is carrier-neutral
|
|
(`fabric_session_binary_frames`) and reports QUIC/WebSocket as available
|
|
carriers instead of describing the dataplane as WebSocket-only.
|
|
Endpoint health observations are pruned in-memory by age and count before
|
|
snapshot/report generation, preventing long-running nodes from accumulating
|
|
unbounded candidate history.
|
|
Scoped and control-plane synthetic mesh config can now carry
|
|
`peer_endpoint_observations`, and VPN fabric-session endpoint ranking merges
|
|
those remote health hints with local observations using the newest signal.
|
|
Endpoint health observations include source and reporter node fields so control
|
|
plane can distinguish local dial feedback from aggregated or policy-generated
|
|
health hints.
|
|
The endpoint health heartbeat report also includes the reporter node id at the
|
|
report level for simpler multi-node ingestion and diagnostics.
|
|
Peer cache construction now applies endpoint health observations when ranking
|
|
peer endpoint candidates, so recovery and warm-peer decisions see the same
|
|
degraded-path feedback as VPN fabric-session dialing.
|
|
Peer cache snapshots expose best-candidate score reasons, giving diagnostics a
|
|
direct explanation for why a QUIC, WebSocket, relay, or fallback endpoint was
|
|
chosen.
|
|
Heartbeat capabilities now advertise that peer-cache endpoint ranking consumes
|
|
health observations, allowing control plane and UI diagnostics to detect nodes
|
|
running the health-aware peer selection path.
|
|
VPN fabric QUIC transport now reuses QUIC connections per peer endpoint and
|
|
opens logical fabric-session streams on top, with heartbeat telemetry for QUIC
|
|
connection opens, reuses, evictions, and active count.
|
|
Cached QUIC connections are pruned by idle TTL, preventing long-running agents
|
|
from holding unused peer connections indefinitely.
|
|
QUIC carrier connections now track active logical streams and enforce a
|
|
per-connection stream limit, exposing stream opens/closes and limit rejects in
|
|
transport telemetry.
|
|
The per-connection QUIC stream limit is configurable through
|
|
`RAP_VPN_FABRIC_QUIC_MAX_STREAMS_PER_CONN` /
|
|
`-vpn-fabric-quic-max-streams-per-conn` and propagated by host-agent install
|
|
profiles.
|
|
QUIC stream-limit rejects are classified as capacity pressure instead of peer
|
|
endpoint failure, so local health feedback does not incorrectly demote a healthy
|
|
but saturated carrier.
|
|
VPN fabric dial telemetry records the last capacity-limited endpoint and
|
|
transport, making stream saturation visible without poisoning endpoint health
|
|
observations.
|
|
The same dial telemetry now keeps bounded per-endpoint capacity-pressure
|
|
counters, so operators can see whether stream saturation is occasional or
|
|
concentrated on a specific QUIC carrier.
|
|
Fresh local capacity-pressure counters also feed endpoint ranking as a bounded
|
|
penalty, spreading new fabric sessions away from a saturated carrier without
|
|
declaring that carrier failed.
|
|
VPN fabric-session transport now opens configurable per-class stream shards
|
|
for interactive and bulk packet traffic, so heavy browser flows do not share a
|
|
single logical stream with latency-sensitive RDP/control packets.
|
|
Host-agent install commands for Docker, Linux, and Windows expose the same
|
|
VPN fabric-session/QUIC tuning flags as install profiles, keeping manual and
|
|
profile-based rollout paths aligned.
|
|
Gateway runtime snapshots include the fabric-session packet transport stream
|
|
layout and send counters by traffic class/stream id for load-test diagnosis.
|
|
Those snapshots also summarize configured stream class/shard counts and active
|
|
send class/stream counts, making sharding health visible without expanding
|
|
per-stream maps.
|
|
Gateway shutdown now closes all VPN fabric-session stream shards and then the
|
|
underlying fabric session, preventing stale logical streams from consuming QUIC
|
|
carrier capacity after reconnects or rollout restarts.
|
|
Gateway runtime cancellation now fans out to both upload and download loops
|
|
when either direction exits, so transport cleanup runs promptly on one-sided
|
|
TUN or carrier failures.
|
|
Fabric-session packet transport snapshots include close-frame and close-error
|
|
counters for verifying that stream shard cleanup is actually happening.
|
|
Outgoing VPN packet batches are split by traffic class and selected stream
|
|
before they are framed, so one gateway batch containing many browser flows does
|
|
not collapse onto the first packet's logical stream.
|
|
`mesh-live-smoke` now sends mixed bulk and interactive VPN packets in a single
|
|
fabric-session batch and requires them to remain sharded.
|
|
The smoke report also exposes the mixed-batch frame fanout so regressions show
|
|
up as a concrete fanout drop, not just a failed boolean.
|
|
Batch fanout is bounded by configured stream shards, so a large batch with many
|
|
flows cannot explode into unbounded fabric frames.
|
|
Heartbeat tests assert the advertised VPN fabric stream-shard count and
|
|
capability, keeping control-plane diagnostics aligned with runtime behavior.
|
|
Fabric-session packet transport snapshots now report packets per stream plus
|
|
last/max batch fanout, making real multi-site load distribution measurable from
|
|
gateway status.
|
|
Receive-side fabric-session packet counters are reported by traffic class and
|
|
stream id as well, so gateway status can compare TX and RX distribution under
|
|
browser/RDP load.
|
|
QUIC fabric transport snapshots expose the configured stream limit, saturated
|
|
connection count, and capacity pressure percentage next to stream limit rejects.
|
|
Closed cached QUIC connections discovered during snapshot generation now update
|
|
the transport's cumulative eviction counters, keeping successive heartbeats
|
|
consistent.
|
|
`mesh-live-smoke` reports QUIC fabric capacity-pressure percentage from the
|
|
transport snapshot, verifying that the capacity fields are populated.
|
|
QUIC fabric snapshots now include per cached connection pressure, endpoint, and
|
|
saturation state; VPN fabric endpoint ranking consumes that live local pressure
|
|
before stream-limit rejection, spreading new sessions away from already busy
|
|
QUIC carriers.
|
|
Per-connection QUIC snapshot entries are sorted by peer and endpoint so
|
|
heartbeats and diagnostics stay stable across reports.
|
|
When local live QUIC pressure and recent capacity-limit counters overlap, the
|
|
ranking input keeps the stronger pressure signal rather than allowing a weak
|
|
fresh sample to hide a saturated endpoint.
|
|
Heartbeat VPN fabric reports now include a bounded `quic_capacity_pressure`
|
|
summary sorted by busiest cached QUIC connection, making overload diagnosis
|
|
visible without digging through the full carrier snapshot.
|
|
VPN fabric flow-scheduler snapshots now expose bulk pressure activation plus
|
|
bulk and interactive/control channel counts, making mixed browser/RDP load
|
|
diagnosis explicit when bulk windows are reduced to protect interactive traffic.
|
|
`mesh-live-smoke` now exercises that mixed-load scheduler path and reports bulk
|
|
pressure activation plus bulk/interactive window recommendations.
|
|
Flow-scheduler route recovery telemetry now records per-channel route switches,
|
|
the failed route a channel recovered from, and aggregate recovered-channel /
|
|
switch counts, making alternate-route recovery measurable during load tests.
|
|
`mesh-live-smoke` now also exercises a primary-route failure followed by an
|
|
alternate-route success and reports the resulting route switch count.
|
|
The same smoke output reports measured route recovery milliseconds for the
|
|
synthetic failover path.
|
|
Smoke now includes max/average route recovery timing from the scheduler
|
|
aggregate snapshot as well.
|
|
Route recovery telemetry includes failure/switch timestamps and recovery
|
|
duration in milliseconds for each recovered flow channel.
|
|
Scheduler snapshots also aggregate route recovery max/average milliseconds
|
|
across recovered channels for quick load-test health checks.
|
|
Route recovery telemetry now includes normalized switch reasons and aggregate
|
|
reason counts, so load tests can distinguish peer failures, timeouts, and other
|
|
route-break causes.
|
|
`mesh-live-smoke` reports the synthetic route-recovery reason beside recovery
|
|
timing and switch count.
|
|
Common route switch reasons are bucketed into stable labels such as timeout,
|
|
peer_unavailable, connection_refused, connection_reset, no_route_to_host, and
|
|
capacity_limited to keep heartbeat cardinality bounded.
|
|
Flow-scheduler snapshots now include a machine-readable pressure level
|
|
(`nominal`, `warning`, `critical`) and bounded reason list derived from drops,
|
|
route failures, route recovery, slow channels, bulk pressure, and adaptive
|
|
backpressure.
|
|
`mesh-live-smoke` reports the mixed-load scheduler pressure level and reasons.
|
|
Endpoint ranking treats `capacity_limited` observations as a soft pressure
|
|
penalty instead of a hard recent failure, enabling load spreading without
|
|
marking the carrier unhealthy.
|
|
Local QUIC stream-limit pressure is now emitted as a capacity observation with
|
|
no failure-count increment, allowing control plane to spread load without
|
|
treating saturation as packet-path breakage.
|
|
Cached QUIC carrier idle TTL is configurable through
|
|
`RAP_VPN_FABRIC_QUIC_IDLE_TTL_SECONDS` / `-vpn-fabric-quic-idle-ttl` and
|
|
propagated by host-agent install profiles.
|
|
|
|
Deliverables:
|
|
|
|
- carry binary frames over one persistent WebSocket/TCP connection;
|
|
- replace high-frequency `/mesh/v1/forward` packet POST usage for VPN routes in
|
|
a gated mode;
|
|
- keep HTTP forwarding as fallback.
|
|
|
|
### Stage FNP-4: Android As Mobile Fabric Node
|
|
|
|
Deliverables:
|
|
|
|
- Android advertises node capabilities, network state, battery state, and
|
|
supported transports;
|
|
- Android opens Fabric Data Session V1 to entry;
|
|
- VPN packets map to independent streams/classes;
|
|
- diagnostics can run per-stream and per-route tests.
|
|
|
|
### Stage FNP-5: Route Leases And Multipath
|
|
|
|
Deliverables:
|
|
|
|
- route result includes primary and alternate routes;
|
|
- runtime can switch new streams to a better route;
|
|
- interactive streams can recover quickly after route fencing;
|
|
- route health uses dataplane metrics, not only HTTP request success.
|
|
|
|
### Stage FNP-6: QUIC/UDP Transport
|
|
|
|
Status: started with `QUICFabricTransport` in `internal/mesh`.
|
|
|
|
Deliverables:
|
|
|
|
- implement QUIC transport for Fabric Data Session V1;
|
|
- preserve WebSocket/TCP as fallback;
|
|
- test 4G/Wi-Fi transition and NAT behavior;
|
|
- benchmark throughput, latency, and recovery against current HTTP forwarding.
|
|
|
|
### Stage FNP-7: Distributed Service Discovery
|
|
|
|
Deliverables:
|
|
|
|
- service names map to eligible service replicas;
|
|
- admin console and VPN service can move between service-host nodes;
|
|
- service failover is expressed as leases and route updates.
|
|
|
|
### Stage FNP-8: Mobile Relay And Distributed Capacity
|
|
|
|
Deliverables:
|
|
|
|
- mobile nodes can opt into relay under strict policy;
|
|
- scheduler scores battery, metered network, NAT, trust, and load;
|
|
- route planner can use mobile nodes where they are closer/faster;
|
|
- accounting and abuse controls are enforced.
|
|
|
|
### Stage FNP-9: Scale To Large Fleets
|
|
|
|
Deliverables:
|
|
|
|
- entry and route coordinator pools;
|
|
- peer directory sharding;
|
|
- telemetry sampling and aggregation;
|
|
- per-tenant quotas and fairness;
|
|
- load tests for 1000 simulated devices, then larger synthetic fleets.
|
|
|
|
## Immediate Next Action
|
|
|
|
Start Stage FNP-1 in `rap-node-agent` as a non-production protocol package. The
|
|
goal is to create the binary frame contract and tests without disturbing the
|
|
current VPN path. After that, wire it into a gated persistent session runtime and
|
|
only then move Android/VPN traffic onto it.
|