787 lines
37 KiB
Markdown
787 lines
37 KiB
Markdown
# Distributed Fabric Node Protocol Plan
|
|
|
|
This document fixes the target direction for the Secure Access Fabric after the
|
|
VPN performance investigation. The platform must not be treated as a VPN
|
|
server, RDP gateway, or web console. It is a distributed overlay transport where
|
|
every participating device is a fabric node, and VPN/RDP/HTTP/admin/storage are
|
|
services running over that fabric.
|
|
|
|
## Core Position
|
|
|
|
Every device is a node.
|
|
|
|
A phone, home server, cloud server, relay, admin-console host, storage host, and
|
|
update-cache host share the same base identity model. They differ by roles,
|
|
capabilities, policy, trust level, and current health.
|
|
|
|
```text
|
|
Node = identity + roles + capabilities + policy + health + local state
|
|
```
|
|
|
|
The Android VPN app is therefore not only a client. It is a mobile fabric node.
|
|
It may carry VPN traffic, participate in route discovery, relay traffic when
|
|
policy allows, host limited control/storage roles when approved, and report
|
|
mobile-specific capacity signals such as battery, network type, NAT behavior,
|
|
foreground/background state, and metered network policy.
|
|
|
|
Node survival and recovery across endpoint moves, NAT-only reachability, legacy
|
|
contract overlap, and unavailable manual host access are governed by
|
|
`docs/architecture/FABRIC_NODE_SURVIVAL_AND_RECOVERY_POLICY.md`. In
|
|
particular, nodes like `ifcm-rufms-s-mo1cr` must remain recoverable through the
|
|
fabric/update/recovery plane even when direct host login is unavailable.
|
|
|
|
Android implementation contract:
|
|
|
|
- app install/build contains a QUIC bootstrap seed set;
|
|
- runtime launch carries a `fabric_bootstrap_config`, not a backend URL;
|
|
- user login/profile selection happens over the fabric control channel;
|
|
- the Android VPN dataplane is QUIC fabric runtime only; HTTP batch packet
|
|
forwarding, WebSocket packet relay, and direct backend packet relay are not
|
|
part of the supported runtime path.
|
|
|
|
## What Was Missing
|
|
|
|
The current implementation proves route leases and production VPN forwarding,
|
|
but it still has a data-plane shape that cannot scale to high throughput:
|
|
|
|
- too much payload traffic is carried as small request/response HTTP forwarding
|
|
calls;
|
|
- JSON/base64 payload envelopes add overhead and CPU cost;
|
|
- one overloaded stream can delay unrelated traffic;
|
|
- route health is visible, but the transport does not yet provide enough
|
|
low-latency per-stream feedback;
|
|
- the phone behaves mostly as a service client, not as a full fabric node;
|
|
- service discovery and route execution are not yet separated cleanly enough;
|
|
- fallback paths can keep traffic alive, but can also hide architecture
|
|
bottlenecks if used as the primary data plane.
|
|
|
|
For 100 Mbps per active device and future 1000+ or millions of devices, the
|
|
fabric must move to a persistent, binary, multiplexed data plane with explicit
|
|
route and stream semantics.
|
|
|
|
## Non-Negotiable Principles
|
|
|
|
1. Fabric is the lower transport layer. VPN, RDP, HTTP, admin console, storage,
|
|
and update delivery are services above it.
|
|
2. Service adapters must not discover topology, own route selection, or invent
|
|
failover logic. They request transport from the fabric.
|
|
3. Control plane and data plane are separate. API/console traffic must not be
|
|
the packet transport mechanism.
|
|
4. Every data session carries many independent streams. A blocked bulk download
|
|
must not stall RDP, DNS, control, or telemetry.
|
|
5. Routes are leased and replaceable. Route selection uses quality, policy,
|
|
locality, role eligibility, cost, trust, and current load.
|
|
6. The fabric is distributed. Central control can coordinate, but the runtime
|
|
must keep working through cached policy, peer directories, route leases, and
|
|
local health when central components are degraded.
|
|
7. Mobile nodes are first-class nodes with stricter capability scoring.
|
|
8. QUIC is the single runtime transport between fabric nodes. HTTP/HTTPS may
|
|
serve human-facing download or panel pages, but it is not a node data-plane
|
|
fallback and must not carry service packets.
|
|
9. There must be no single management service that can seize the fabric. Control,
|
|
storage, update distribution, route authority, and certificate authority are
|
|
fabric roles assigned to eligible nodes and protected by quorum signatures.
|
|
A web/API endpoint is only an access replica for a signed state log, not the
|
|
owner of cluster truth.
|
|
10. IP addresses and DNS names are never authority. Nodes announce signed
|
|
endpoint candidates for every usable interface, public/reflexive address,
|
|
local segment address, reverse channel, and relay fallback. Neighbors select
|
|
the usable candidate locally by policy, reachability, latency, load, and
|
|
trust.
|
|
|
|
## Transport vs Control API
|
|
|
|
The system must keep two layers separate in naming, design, and diagnostics:
|
|
|
|
- `Fabric Transport` means inter-node runtime delivery only. It is QUIC over UDP
|
|
and carries leased service-channel/data-plane traffic between nodes.
|
|
- `Control API` means human/operator/programmatic management surfaces such as
|
|
web-admin, release publication, policy mutation, audit queries, and status
|
|
reads. Today that surface is HTTP/JSON and may sit behind HTTPS ingress.
|
|
|
|
The HTTP Control API is not a fallback transport for node-to-node runtime
|
|
traffic. A `409 Conflict` from the backend, a panel page load, or a release
|
|
download is control-plane behavior, not fabric transport behavior.
|
|
|
|
## Distributed Control And Trust
|
|
|
|
The target fabric behaves like a distributed network, not a client/server
|
|
management product. The cluster has a replicated signed state log and many
|
|
service replicas. Any node with the right role can serve API, storage, update,
|
|
or route-coordinator duties, but no single replica can mutate cluster authority
|
|
alone.
|
|
|
|
Required trust model:
|
|
|
|
- Every node has a long-lived node identity key and short-lived role
|
|
certificates. The node identity is cryptographic; the current IP, hostname,
|
|
NAT address, or container name is only an endpoint candidate.
|
|
- Cluster authority is threshold-based. Root or high-risk changes require M-of-N
|
|
signatures from authorized control-authority nodes or hardware/offline
|
|
operator keys.
|
|
- Role certificates are scoped by action, organization/tenant, service,
|
|
partition, validity window, and allowed delegation depth.
|
|
- Update releases, route leases, peer-directory epochs, storage shard placement,
|
|
node approvals, role changes, and authority rotations are signed records in
|
|
the state log.
|
|
- A node accepts control data only when it can verify signatures, epoch/fencing,
|
|
expiry, target cluster, target node or role scope, and monotonic generation.
|
|
- A compromised API replica can withhold or delay data, but cannot forge updates,
|
|
route authority, new certificates, node roles, or cluster ownership.
|
|
- Bootstrap may use a temporary centralized signer for development, but
|
|
production mode must mark that signer as non-authoritative unless quorum
|
|
signatures are present.
|
|
|
|
Authority levels:
|
|
|
|
- `root-authority`: rotates cluster root and quorum membership. Offline or
|
|
hardware-backed where possible. Rarely online.
|
|
- `control-authority`: approves node join, role changes, policy epochs, and
|
|
route-authority membership through quorum.
|
|
- `route-authority`: signs short-lived route leases and relay/rendezvous
|
|
assignments for a shard or partition.
|
|
- `update-authority`: signs release metadata, compatibility, artifact hashes,
|
|
rollback windows, and staged rollout policy.
|
|
- `storage-authority`: signs storage shard manifests, replication factors,
|
|
retention policy, and recovery epochs.
|
|
- `observer-authority`: can sign telemetry observations only; it cannot mutate
|
|
routing, roles, updates, or secrets.
|
|
|
|
Required anti-takeover controls:
|
|
|
|
- No bearer admin token may grant fabric-wide mutation without a signed authority
|
|
envelope.
|
|
- No node may accept unsigned update metadata or an artifact whose hash is not
|
|
signed by update-authority quorum.
|
|
- No node may accept unsigned route changes for production channels.
|
|
- No node may promote itself into control, storage, update, relay, or route
|
|
authority roles without a quorum-signed role certificate.
|
|
- Authority and role certificates must have short validity, explicit scopes, and
|
|
revocation/fencing epochs.
|
|
- Nodes must pin the cluster root/quorum descriptor and reject unexpected root
|
|
changes unless the old quorum signs the transition or an offline recovery
|
|
policy is invoked.
|
|
|
|
Endpoint state is also distributed:
|
|
|
|
- Nodes publish signed endpoint-candidate sets containing local interfaces,
|
|
public/reflexive STUN/ICE candidates, NAT group/local segment identifiers,
|
|
relay fallback, and passive reverse-channel availability.
|
|
- Endpoint candidates expire quickly. When a node changes IP, it reconnects
|
|
passively to any reachable fabric peer or API replica and publishes a new
|
|
signed candidate epoch.
|
|
- Peers keep using cached valid candidates and route leases while refreshing
|
|
from any reachable replica or neighbor gossip path.
|
|
- Neighbor selection is local and latency/load-aware; the state log announces
|
|
facts and policy, not a forced single next hop.
|
|
|
|
### Fabric Registry Gossip
|
|
|
|
Moving a service must not break the farm.
|
|
|
|
`RAP_BACKEND_URL` or any fixed HTTP/API address is only a migration fallback for
|
|
old nodes. It is not cluster truth. After bootstrap, a node finds services by
|
|
logical role through signed fabric registry records that can be carried by any
|
|
reachable peer.
|
|
|
|
The rule is:
|
|
|
|
- any node may relay registry knowledge;
|
|
- only authorized signatures can create or replace trusted registry truth;
|
|
- a new record becomes active only after signature/authority checks and a
|
|
successful live probe through the fabric or a policy-approved direct QUIC
|
|
candidate;
|
|
- older still-valid records remain as fallback until their TTL expires.
|
|
|
|
Registry record shape:
|
|
|
|
```text
|
|
schema_version: rap.fabric.registry.gossip_record.v1
|
|
cluster_id
|
|
service: control-api | update-store | update-cache | web-admin | vpn-egress-pool | ...
|
|
scope: farm | cluster | organization
|
|
organization_id: optional
|
|
epoch: monotonic service epoch
|
|
generation: optional human/debug generation
|
|
issued_at
|
|
expires_at
|
|
issuer_node_id
|
|
issuer_role: control-authority | update-authority | storage-authority | route-authority
|
|
endpoints:
|
|
- endpoint_id
|
|
address: quic://...
|
|
transport: direct_quic | relay_quic | reverse_quic
|
|
reachability
|
|
connectivity_mode
|
|
priority / weight
|
|
peer_cert_sha256
|
|
signatures:
|
|
- key_id
|
|
issuer_id
|
|
role
|
|
alg: ed25519
|
|
value
|
|
```
|
|
|
|
Acceptance algorithm:
|
|
|
|
1. Reject records for a different cluster, expired records, future records past
|
|
allowed clock skew, unsupported schema, missing endpoints, or non-QUIC
|
|
endpoints.
|
|
2. Verify the canonical record payload, excluding `signatures`, against the
|
|
configured authority set.
|
|
3. Check the signer role is allowed for that service and scope.
|
|
4. Require quorum where policy says M-of-N; development may use one trusted
|
|
signer but must mark that signer as bootstrap/development authority.
|
|
5. Store accepted records as `candidate`.
|
|
6. Promote `candidate` to `active` only after live-probing at least one endpoint
|
|
and verifying the endpoint identity/pin.
|
|
7. Prefer higher epoch, then newer issued time, then generation. Do not replace
|
|
a live active record with an older record.
|
|
8. Keep the previous active record usable as fallback until TTL expiry when a
|
|
newer candidate is not yet live-verified.
|
|
|
|
This is the recovery path for mass moves. If every known service endpoint moves
|
|
at once, the operator or a control-authority node only has to deliver a signed
|
|
registry record to one reachable fabric node. That node validates it, probes it,
|
|
promotes it, and gossips it onward. User/mobile/candidate nodes may carry the
|
|
record, but cannot make it authoritative unless their role certificate permits
|
|
that service/scope.
|
|
|
|
Service classes that must use this registry before production hardening:
|
|
|
|
- `control-api`: heartbeat, auth/profile control projection, node registration,
|
|
policy/snapshot fetch.
|
|
- `update-store`: signed release manifests and compatibility windows.
|
|
- `update-cache`: artifact mirrors close to nodes.
|
|
- `web-admin`: management UI/API ingress replicas.
|
|
- `vpn-egress-pool`: user-visible exit pools; users see pools, not backing
|
|
nodes.
|
|
|
|
Legacy endpoint compatibility is allowed only for rolling migration:
|
|
|
|
- Old nodes may use their baked HTTP/control URL only to fetch a new version or
|
|
a signed registry bootstrap record.
|
|
- New nodes must treat fixed URLs as fallback hints, not as authority.
|
|
- Old code is removed only after every live node reports a version that supports
|
|
signed registry gossip and service discovery by role.
|
|
|
|
Listener configuration is split into bind sockets and reachability candidates:
|
|
|
|
- `listen_addr` is what the local process binds, for example
|
|
`0.0.0.0:18080` on `home-1`.
|
|
- `endpoint_candidates` is the ordered set of addresses other nodes may try.
|
|
A single node can publish LAN addresses, addresses on several network
|
|
adapters, STUN/reflexive addresses, and multiple public NAT forwards from
|
|
different providers.
|
|
- Public NAT forwards are modeled as candidates with metadata, not as a
|
|
replacement for the internal bind address. Example:
|
|
`quic://94.141.118.222:19199 reachability=public connectivity=direct
|
|
provider=isp1 maps_to=192.168.200.85:18080`.
|
|
- A candidate may be valid only from outside the NAT. Same-LAN hairpin failure
|
|
is not a proof that the public candidate is broken; verification must be
|
|
scoped to an external peer or remote probe.
|
|
- The route builder scores candidates by reachability, measured latency, loss,
|
|
load, policy, and verification freshness. If one provider or interface fails,
|
|
the node keeps the same node identity and republishes a new candidate epoch.
|
|
|
|
## Install Artifact Bootstrap Contract
|
|
|
|
Every installable artifact is a node image plus a bootstrap seed set.
|
|
|
|
This applies to Android, Docker, Linux services, and Windows services. The seed
|
|
set is baked into the artifact or delivered beside it as signed install
|
|
metadata. It is not a single backend URL and not a management server choice. It
|
|
is a bounded list of known fabric endpoint candidates that may be reachable from
|
|
different network positions:
|
|
|
|
- public QUIC candidates, for example `usa-los-1` or externally reachable
|
|
`home-1`;
|
|
- private/LAN QUIC candidates, for example Docker-test or home LAN nodes;
|
|
- closed-site candidates that have no Internet route themselves but can reach a
|
|
neighboring fabric node;
|
|
- optional pinned certificate hashes or authority descriptors for high-trust
|
|
entry candidates.
|
|
|
|
On first start the installed node tries the seed set, joins through any reachable
|
|
peer, registers as a candidate node with minimal rights, and then receives
|
|
signed peer-directory, role, update, and policy state through the fabric. If a
|
|
node is installed in an isolated network, it can still become visible and usable
|
|
when at least one nearby seed node can route onward to the rest of the fabric.
|
|
User login on Android is only identity/profile selection for the `vpn-client`
|
|
service; the underlying phone node already exists and participates in the
|
|
fabric with candidate permissions.
|
|
|
|
## Node Roles
|
|
|
|
Initial role vocabulary:
|
|
|
|
- `mobile-edge`: mobile Android/iOS fabric node.
|
|
- `entry`: accepts external sessions.
|
|
- `relay`: forwards fabric traffic between nodes.
|
|
- `exit`: terminates routes into a target network or service zone.
|
|
- `service-host`: runs service adapters such as admin console, VPN exit, RDP,
|
|
HTTP ingress, storage, or update-cache.
|
|
- `control-plane`: participates in control authority, policy decisions, route
|
|
authority, or quorum work.
|
|
- `route-coordinator`: calculates or assists route candidates for a partition,
|
|
region, or service class.
|
|
- `storage`: stores approved replicated fabric state.
|
|
- `observer`: collects telemetry and health without carrying user traffic.
|
|
- `update-cache`: mirrors signed artifacts close to nodes.
|
|
|
|
Roles are policy decisions, not binary builds. A phone can theoretically receive
|
|
any role, but scheduler scoring must account for battery, OS restrictions, NAT,
|
|
uplink stability, foreground state, and user cost policy.
|
|
|
|
## Capability Model
|
|
|
|
Nodes must advertise capability facts in heartbeats and peer updates:
|
|
|
|
- supported fabric protocol versions;
|
|
- supported transport: UDP/QUIC;
|
|
- NAT type and reachability;
|
|
- measured RTT/loss/jitter/bandwidth to peers and entry candidates;
|
|
- CPU, memory, queue depth, file descriptor/socket pressure;
|
|
- battery state, charging state, mobile/wifi network type, metered policy;
|
|
- max relay bandwidth and allowed traffic classes;
|
|
- service roles and service capacity;
|
|
- trust tier and allowed tenant/organization scopes;
|
|
- local policy version, peer directory version, route cache version.
|
|
|
|
## Fabric Data Session V1
|
|
|
|
The first practical protocol step is a persistent binary QUIC data session.
|
|
The framing stays service-neutral, but the runtime transport is QUIC only.
|
|
|
|
Minimum frame set:
|
|
|
|
```text
|
|
HELLO node identity, protocol version, capabilities
|
|
AUTH signed session token or mTLS-bound proof
|
|
SESSION_READY accepted limits, route epoch, peer epoch
|
|
OPEN_STREAM stream id, service id, traffic class, route id
|
|
DATA stream id, sequence, flags, payload
|
|
ACK stream id, received sequence/window
|
|
PING/PONG RTT and liveness
|
|
ROUTE_UPDATE new route lease or alternate route set
|
|
STREAM_CREDIT per-stream backpressure window
|
|
NODE_PRESSURE queue/cpu/memory/network pressure signal
|
|
CLOSE_STREAM normal stream close
|
|
RESET_STREAM failed stream, other streams remain alive
|
|
GOAWAY draining or protocol shutdown
|
|
```
|
|
|
|
Traffic classes:
|
|
|
|
- `control`: authorization, route updates, attach/detach, liveness.
|
|
- `dns`: small, latency-sensitive name resolution.
|
|
- `interactive`: RDP input, SSH interactive, UI control.
|
|
- `reliable`: normal web/API traffic.
|
|
- `bulk`: downloads, uploads, sync, large media.
|
|
- `droppable`: telemetry samples, optional probes, low-value background data.
|
|
|
|
Each stream has independent flow control and backpressure. Bulk can be slowed or
|
|
moved to another route without blocking control or interactive streams.
|
|
|
|
## Route Model
|
|
|
|
The fabric must maintain multiple candidate routes for an active session:
|
|
|
|
```text
|
|
phone-a -> entry-1 -> home-1
|
|
phone-a -> phone-b -> relay-2 -> home-1
|
|
phone-a -> entry-2 -> relay-4 -> service-host-7
|
|
```
|
|
|
|
Route scoring inputs:
|
|
|
|
- policy and role eligibility;
|
|
- route length and failure domains;
|
|
- RTT, jitter, packet loss, bandwidth estimate;
|
|
- queue depth and retransmit pressure;
|
|
- current node CPU/memory/socket pressure;
|
|
- mobile battery/charging/metered status;
|
|
- historical reliability;
|
|
- service locality;
|
|
- tenant/organization isolation;
|
|
- cost and operator preference.
|
|
|
|
Routes are issued as short leases with route id, epoch, allowed channels,
|
|
allowed service classes, hop list or next-hop policy, expiry, and fencing rules.
|
|
|
|
## Service Discovery
|
|
|
|
Services are logical names, not fixed hosts:
|
|
|
|
```text
|
|
service: admin-console
|
|
replicas: home-1, node-2, node-9
|
|
policy: active-active or leader/follower
|
|
ingress: vpn.cin.su / admin.cin.su / internal name
|
|
```
|
|
|
|
`vpn.cin.su` as an HTTP/HTTPS entry is a service endpoint. It can be hosted on
|
|
any eligible service-host node. If one replica fails, another replica can accept
|
|
the service lease and traffic can be routed to it.
|
|
|
|
## Scale Model
|
|
|
|
For 1000 devices, the platform needs entry pools, exit pools, route leases,
|
|
session placement, and overload protection.
|
|
|
|
For millions of devices, the platform additionally needs regional route
|
|
coordinators, distributed peer directories, local control partitions, telemetry
|
|
sampling, policy sharding, and resource accounting.
|
|
|
|
Every device joining the system increases potential edge capacity, but only if
|
|
the scheduler can safely decide when that node is allowed to relay, store, serve,
|
|
or only consume.
|
|
|
|
## Security And Abuse Controls
|
|
|
|
The distributed model increases power and also risk. The following controls are
|
|
required before mobile relay/control/storage roles are broadly enabled:
|
|
|
|
- node identity is cryptographic; IP address is never identity;
|
|
- all route leases are signed or locally verifiable;
|
|
- roles are scoped by organization, tenant, service, and time;
|
|
- mobile relay is opt-in by policy and user/device state;
|
|
- storage uses encrypted shards and explicit retention policy;
|
|
- control-plane participation requires trust tier and quorum policy;
|
|
- nodes never receive more topology or secret data than their role requires;
|
|
- abuse controls rate-limit relay use, route churn, and failed authentication;
|
|
- traffic accounting records who relayed what class and how much, without
|
|
exposing payload contents.
|
|
|
|
## Observability
|
|
|
|
The current tests show why aggregate "VPN works" is not enough. The fabric needs
|
|
per-node, per-route, and per-stream metrics:
|
|
|
|
- throughput by direction and traffic class;
|
|
- RTT, jitter, loss, retransmits, queue depth;
|
|
- frame encode/decode errors;
|
|
- stream resets and close reasons;
|
|
- route switch reason and time to recovery;
|
|
- node pressure and scheduler decisions;
|
|
- service discovery failover events;
|
|
- Android foreground/background and network transition events.
|
|
|
|
## Work Plan
|
|
|
|
### Stage FNP-0: Architecture Lock
|
|
|
|
Status: this document.
|
|
|
|
Deliverables:
|
|
|
|
- fix "every device is a node" as the model;
|
|
- separate fabric, services, control, and data plane;
|
|
- define missing protocol, route, scale, security, and observability pieces.
|
|
|
|
### Stage FNP-1: Binary Frame Contract
|
|
|
|
Deliverables:
|
|
|
|
- add a transport-neutral Go package for Fabric Data Session V1 frame types;
|
|
- encode/decode binary frames with size limits and validation;
|
|
- add tests for malformed frames, max frame size, stream ids, and frame type
|
|
compatibility;
|
|
- do not connect it to production traffic yet.
|
|
|
|
### Stage FNP-2: Persistent Session Runtime Skeleton
|
|
|
|
Status: in progress in `agents/rap-node-agent/internal/fabricproto`.
|
|
|
|
Deliverables:
|
|
|
|
- implement in-memory session runtime with streams, sequence numbers, ACK,
|
|
stream credit, reset, and close;
|
|
- handle protocol frames for open/data/ack/credit/reset/close/ping/goaway;
|
|
- prove that a blocked bulk stream does not block control/interactive streams;
|
|
- expose per-stream metrics.
|
|
|
|
### Stage FNP-3: WebSocket/TCP Compatibility Transport
|
|
|
|
Status: retired as a migration-only stage.
|
|
|
|
This stage existed to bootstrap binary frame semantics before QUIC routing and
|
|
carrier reuse were ready. It introduced the transport-neutral frame loop,
|
|
session-shaped packet mapper, and early smoke tooling. That work was useful as
|
|
scaffolding, but it is no longer the target runtime.
|
|
|
|
Current rule:
|
|
|
|
- WebSocket/TCP fabric-session transport is not part of the supported node
|
|
dataplane.
|
|
- QUIC/UDP is the only supported runtime carrier between fabric nodes.
|
|
- Old WebSocket/TCP smoke helpers are being removed; migration/debug tooling
|
|
must move to QUIC-native smoke and recovery paths.
|
|
- Any routing, heartbeat, registry, peer probe, or service dataplane logic must
|
|
reject WebSocket/TCP carriers as non-QUIC transport, not treat them as a
|
|
valid alternate path.
|
|
|
|
What survives from this stage is the service-neutral frame model and the
|
|
`FabricSessionPacketTransport` mapping, which now ride on QUIC carriers instead
|
|
of a WebSocket fallback.
|
|
VPN fabric-session gateway transport now consumes ranked endpoint candidates,
|
|
so dataplane sessions can select QUIC fast-path candidates and refuse non-QUIC
|
|
peer endpoints when the control plane has not published valid candidates yet.
|
|
The temporary self-signed QUIC listener advertises its SHA-256 certificate
|
|
fingerprint in endpoint metadata, and the QUIC client can pin that fingerprint
|
|
instead of disabling verification while the cluster CA path is being finished.
|
|
VPN fabric-session dialing now walks all ranked endpoint candidates before
|
|
declaring the target unavailable, so a failed QUIC candidate does not silently
|
|
re-enable WebSocket/HTTPS compatibility transport.
|
|
Successful VPN fabric-session dialing logs the selected candidate, transport,
|
|
certificate pin usage, and remaining fallback count for phone-side diagnostics.
|
|
Heartbeat telemetry now includes VPN fabric-session dial counters for attempts,
|
|
candidate failures, selected transport family, certificate pin usage, and the
|
|
last selected endpoint/failure reason.
|
|
VPN fabric-session dialing feeds candidate success/failure observations back
|
|
into endpoint ranking, so repeated local QUIC failures can temporarily demote
|
|
that endpoint while preserving it as a later fallback.
|
|
Endpoint scoring no longer treats missing/zero latency on failed observations as
|
|
moderate latency, preventing failed candidates from receiving a false score
|
|
bonus.
|
|
Endpoint health observations are now emitted as a bounded standalone heartbeat
|
|
report (`rap.vpn_fabric_endpoint_health_report.v1`) so control plane can ingest
|
|
candidate feedback without parsing the transport diagnostics blob.
|
|
VPN fabric-session transport telemetry is carrier-neutral
|
|
(`fabric_session_binary_frames`) and reports QUIC selection plus non-QUIC
|
|
candidate rejection instead of describing the dataplane as WebSocket-capable.
|
|
Endpoint health observations are pruned in-memory by age and count before
|
|
snapshot/report generation, preventing long-running nodes from accumulating
|
|
unbounded candidate history.
|
|
Scoped and control-plane synthetic mesh config can now carry
|
|
`peer_endpoint_observations`, and VPN fabric-session endpoint ranking merges
|
|
those remote health hints with local observations using the newest signal.
|
|
Endpoint health observations include source and reporter node fields so control
|
|
plane can distinguish local dial feedback from aggregated or policy-generated
|
|
health hints.
|
|
The endpoint health heartbeat report also includes the reporter node id at the
|
|
report level for simpler multi-node ingestion and diagnostics.
|
|
Peer cache construction now applies endpoint health observations when ranking
|
|
peer endpoint candidates, so recovery and warm-peer decisions see the same
|
|
degraded-path feedback as VPN fabric-session dialing.
|
|
Peer cache snapshots expose best-candidate score reasons, giving diagnostics a
|
|
direct explanation for why a QUIC, WebSocket, relay, or fallback endpoint was
|
|
chosen.
|
|
Heartbeat capabilities now advertise that peer-cache endpoint ranking consumes
|
|
health observations, allowing control plane and UI diagnostics to detect nodes
|
|
running the health-aware peer selection path.
|
|
VPN fabric QUIC transport now reuses QUIC connections per peer endpoint and
|
|
opens logical fabric-session streams on top, with heartbeat telemetry for QUIC
|
|
connection opens, reuses, evictions, and active count.
|
|
Cached QUIC connections are pruned by idle TTL, preventing long-running agents
|
|
from holding unused peer connections indefinitely.
|
|
QUIC carrier connections now track active logical streams and enforce a
|
|
per-connection stream limit, exposing stream opens/closes and limit rejects in
|
|
transport telemetry.
|
|
The per-connection QUIC stream limit is configurable through
|
|
`RAP_VPN_FABRIC_QUIC_MAX_STREAMS_PER_CONN` /
|
|
`-vpn-fabric-quic-max-streams-per-conn` and propagated by host-agent install
|
|
profiles.
|
|
QUIC stream-limit rejects are classified as capacity pressure instead of peer
|
|
endpoint failure, so local health feedback does not incorrectly demote a healthy
|
|
but saturated carrier.
|
|
VPN fabric dial telemetry records the last capacity-limited endpoint and
|
|
transport, making stream saturation visible without poisoning endpoint health
|
|
observations.
|
|
The same dial telemetry now keeps bounded per-endpoint capacity-pressure
|
|
counters, so operators can see whether stream saturation is occasional or
|
|
concentrated on a specific QUIC carrier.
|
|
Fresh local capacity-pressure counters also feed endpoint ranking as a bounded
|
|
penalty, spreading new fabric sessions away from a saturated carrier without
|
|
declaring that carrier failed.
|
|
VPN fabric-session transport now opens configurable per-class stream shards
|
|
for interactive and bulk packet traffic, so heavy browser flows do not share a
|
|
single logical stream with latency-sensitive RDP/control packets.
|
|
Host-agent install commands for Docker, Linux, and Windows expose the same
|
|
VPN fabric-session/QUIC tuning flags as install profiles, keeping manual and
|
|
profile-based rollout paths aligned.
|
|
Gateway runtime snapshots include the fabric-session packet transport stream
|
|
layout and send counters by traffic class/stream id for load-test diagnosis.
|
|
Those snapshots also summarize configured stream class/shard counts and active
|
|
send class/stream counts, making sharding health visible without expanding
|
|
per-stream maps.
|
|
Gateway shutdown now closes all VPN fabric-session stream shards and then the
|
|
underlying fabric session, preventing stale logical streams from consuming QUIC
|
|
carrier capacity after reconnects or rollout restarts.
|
|
Gateway runtime cancellation now fans out to both upload and download loops
|
|
when either direction exits, so transport cleanup runs promptly on one-sided
|
|
TUN or carrier failures.
|
|
Fabric-session packet transport snapshots include close-frame and close-error
|
|
counters for verifying that stream shard cleanup is actually happening.
|
|
Outgoing VPN packet batches are split by traffic class and selected stream
|
|
before they are framed, so one gateway batch containing many browser flows does
|
|
not collapse onto the first packet's logical stream.
|
|
`mesh-live-smoke` now sends mixed bulk and interactive VPN packets in a single
|
|
fabric-session batch and requires them to remain sharded.
|
|
The smoke report also exposes the mixed-batch frame fanout so regressions show
|
|
up as a concrete fanout drop, not just a failed boolean.
|
|
Batch fanout is bounded by configured stream shards, so a large batch with many
|
|
flows cannot explode into unbounded fabric frames.
|
|
Heartbeat tests assert the advertised VPN fabric stream-shard count and
|
|
capability, keeping control-plane diagnostics aligned with runtime behavior.
|
|
Fabric-session packet transport snapshots now report packets per stream plus
|
|
last/max batch fanout, making real multi-site load distribution measurable from
|
|
gateway status.
|
|
Receive-side fabric-session packet counters are reported by traffic class and
|
|
stream id as well, so gateway status can compare TX and RX distribution under
|
|
browser/RDP load.
|
|
QUIC fabric transport snapshots expose the configured stream limit, saturated
|
|
connection count, and capacity pressure percentage next to stream limit rejects.
|
|
Closed cached QUIC connections discovered during snapshot generation now update
|
|
the transport's cumulative eviction counters, keeping successive heartbeats
|
|
consistent.
|
|
`mesh-live-smoke` reports QUIC fabric capacity-pressure percentage from the
|
|
transport snapshot, verifying that the capacity fields are populated.
|
|
QUIC fabric snapshots now include per cached connection pressure, endpoint, and
|
|
saturation state; VPN fabric endpoint ranking consumes that live local pressure
|
|
before stream-limit rejection, spreading new sessions away from already busy
|
|
QUIC carriers.
|
|
Per-connection QUIC snapshot entries are sorted by peer and endpoint so
|
|
heartbeats and diagnostics stay stable across reports.
|
|
When local live QUIC pressure and recent capacity-limit counters overlap, the
|
|
ranking input keeps the stronger pressure signal rather than allowing a weak
|
|
fresh sample to hide a saturated endpoint.
|
|
Heartbeat VPN fabric reports now include a bounded `quic_capacity_pressure`
|
|
summary sorted by busiest cached QUIC connection, making overload diagnosis
|
|
visible without digging through the full carrier snapshot.
|
|
VPN fabric flow-scheduler snapshots now expose bulk pressure activation plus
|
|
bulk and interactive/control channel counts, making mixed browser/RDP load
|
|
diagnosis explicit when bulk windows are reduced to protect interactive traffic.
|
|
`mesh-live-smoke` now exercises that mixed-load scheduler path and reports bulk
|
|
pressure activation plus bulk/interactive window recommendations.
|
|
Flow-scheduler route recovery telemetry now records per-channel route switches,
|
|
the failed route a channel recovered from, and aggregate recovered-channel /
|
|
switch counts, making alternate-route recovery measurable during load tests.
|
|
`mesh-live-smoke` now also exercises a primary-route failure followed by an
|
|
alternate-route success and reports the resulting route switch count.
|
|
The same smoke output reports measured route recovery milliseconds for the
|
|
synthetic failover path.
|
|
Smoke now includes max/average route recovery timing from the scheduler
|
|
aggregate snapshot as well.
|
|
Route recovery telemetry includes failure/switch timestamps and recovery
|
|
duration in milliseconds for each recovered flow channel.
|
|
Scheduler snapshots also aggregate route recovery max/average milliseconds
|
|
across recovered channels for quick load-test health checks.
|
|
Route recovery telemetry now includes normalized switch reasons and aggregate
|
|
reason counts, so load tests can distinguish peer failures, timeouts, and other
|
|
route-break causes.
|
|
`mesh-live-smoke` reports the synthetic route-recovery reason beside recovery
|
|
timing and switch count.
|
|
Common route switch reasons are bucketed into stable labels such as timeout,
|
|
peer_unavailable, connection_refused, connection_reset, no_route_to_host, and
|
|
capacity_limited to keep heartbeat cardinality bounded.
|
|
Flow-scheduler snapshots now include a machine-readable pressure level
|
|
(`nominal`, `warning`, `critical`) and bounded reason list derived from drops,
|
|
route failures, route recovery, slow channels, bulk pressure, and adaptive
|
|
backpressure.
|
|
The same pressure classification includes a bounded 0-100 score for automated
|
|
route, endpoint, and node comparisons.
|
|
`mesh-live-smoke` reports the mixed-load scheduler pressure level, score, and
|
|
reasons.
|
|
Heartbeat VPN fabric transport reports now include a compact
|
|
`flow_pressure` summary with level, score, reasons, bulk pressure, route
|
|
recovery timing, reason counts, and recommended per-class windows.
|
|
The `flow_pressure` summary includes a `recommended_action` such as
|
|
`observe`, `throttle_bulk`, `reduce_parallelism`, `prefer_faster_route`,
|
|
`observe_recovery`, `rebuild_or_reroute`, or `shed_or_reroute`.
|
|
`recommended_action` is now part of the shared `FabricFlowSchedulerSnapshot`
|
|
contract, so heartbeat reports and smoke diagnostics consume the same runtime
|
|
decision.
|
|
The scheduler's nominal snapshot explicitly reports the `observe` action.
|
|
Flow-scheduler snapshots keep a bounded pressure transition history with the
|
|
observed level, score, reasons, and recommended action. Repeated snapshots do
|
|
not duplicate unchanged pressure states, so controllers can distinguish current
|
|
state from recent worsening or recovery without unbounded heartbeat growth.
|
|
`mesh-live-smoke` reports the recommended action for its mixed bulk/interactive
|
|
load scenario.
|
|
Nodes advertise the `vpn_fabric_flow_pressure` capability when that heartbeat
|
|
summary is available.
|
|
When the VPN fabric ingress runtime has not been initialized yet, the heartbeat
|
|
still emits a nominal `flow_pressure` summary for schema stability.
|
|
Endpoint ranking treats `capacity_limited` observations as a soft pressure
|
|
penalty instead of a hard recent failure, enabling load spreading without
|
|
marking the carrier unhealthy.
|
|
Local QUIC stream-limit pressure is now emitted as a capacity observation with
|
|
no failure-count increment, allowing control plane to spread load without
|
|
treating saturation as packet-path breakage.
|
|
Cached QUIC carrier idle TTL is configurable through
|
|
`RAP_VPN_FABRIC_QUIC_IDLE_TTL_SECONDS` / `-vpn-fabric-quic-idle-ttl` and
|
|
propagated by host-agent install profiles.
|
|
|
|
Deliverables:
|
|
|
|
- carry binary frames over one persistent QUIC fabric session;
|
|
- replace high-frequency `/mesh/v1/forward` packet POST usage for VPN routes in
|
|
a gated mode;
|
|
- remove HTTP/WebSocket packet forwarding from the supported dataplane.
|
|
|
|
### Stage FNP-4: Android As Mobile Fabric Node
|
|
|
|
Deliverables:
|
|
|
|
- Android advertises node capabilities, network state, battery state, and
|
|
supported transports;
|
|
- Android opens Fabric Data Session V1 to entry;
|
|
- VPN packets map to independent streams/classes;
|
|
- diagnostics can run per-stream and per-route tests.
|
|
|
|
### Stage FNP-5: Route Leases And Multipath
|
|
|
|
Deliverables:
|
|
|
|
- route result includes primary and alternate routes;
|
|
- runtime can switch new streams to a better route;
|
|
- interactive streams can recover quickly after route fencing;
|
|
- route health uses dataplane metrics, not only HTTP request success.
|
|
|
|
### Stage FNP-6: QUIC/UDP Transport
|
|
|
|
Status: active runtime baseline in `internal/mesh`.
|
|
|
|
Deliverables:
|
|
|
|
- implement QUIC transport for Fabric Data Session V1;
|
|
- keep QUIC/UDP as the only supported inter-node runtime transport;
|
|
- test 4G/Wi-Fi transition and NAT behavior;
|
|
- benchmark throughput, latency, and recovery against current HTTP forwarding.
|
|
|
|
### Stage FNP-7: Distributed Service Discovery
|
|
|
|
Deliverables:
|
|
|
|
- service names map to eligible service replicas;
|
|
- admin console and VPN service can move between service-host nodes;
|
|
- service failover is expressed as leases and route updates.
|
|
|
|
### Stage FNP-8: Mobile Relay And Distributed Capacity
|
|
|
|
Deliverables:
|
|
|
|
- mobile nodes can opt into relay under strict policy;
|
|
- scheduler scores battery, metered network, NAT, trust, and load;
|
|
- route planner can use mobile nodes where they are closer/faster;
|
|
- accounting and abuse controls are enforced.
|
|
|
|
### Stage FNP-9: Scale To Large Fleets
|
|
|
|
Deliverables:
|
|
|
|
- entry and route coordinator pools;
|
|
- peer directory sharding;
|
|
- telemetry sampling and aggregation;
|
|
- per-tenant quotas and fairness;
|
|
- load tests for 1000 simulated devices, then larger synthetic fleets.
|
|
|
|
## Immediate Next Action
|
|
|
|
Start Stage FNP-1 in `rap-node-agent` as a non-production protocol package. The
|
|
goal is to create the binary frame contract and tests without disturbing the
|
|
current VPN path. After that, wire it into a gated persistent session runtime and
|
|
only then move Android/VPN traffic onto it.
|