37 KiB
Distributed Fabric Node Protocol Plan
This document fixes the target direction for the Secure Access Fabric after the VPN performance investigation. The platform must not be treated as a VPN server, RDP gateway, or web console. It is a distributed overlay transport where every participating device is a fabric node, and VPN/RDP/HTTP/admin/storage are services running over that fabric.
Core Position
Every device is a node.
A phone, home server, cloud server, relay, admin-console host, storage host, and update-cache host share the same base identity model. They differ by roles, capabilities, policy, trust level, and current health.
Node = identity + roles + capabilities + policy + health + local state
The Android VPN app is therefore not only a client. It is a mobile fabric node. It may carry VPN traffic, participate in route discovery, relay traffic when policy allows, host limited control/storage roles when approved, and report mobile-specific capacity signals such as battery, network type, NAT behavior, foreground/background state, and metered network policy.
Node survival and recovery across endpoint moves, NAT-only reachability, compat
contract overlap, and unavailable manual host access are governed by
docs/architecture/FABRIC_NODE_SURVIVAL_AND_RECOVERY_POLICY.md. In
particular, nodes like ifcm-rufms-s-mo1cr must remain recoverable through the
fabric/update/recovery plane even when direct host login is unavailable.
Android implementation contract:
- app install/build contains a QUIC bootstrap seed set;
- runtime launch carries a
fabric_bootstrap_config, not a backend URL; - user login/profile selection happens over the fabric control channel;
- the Android VPN dataplane is QUIC fabric runtime only; HTTP batch packet forwarding, WebSocket packet relay, and direct backend packet relay are not part of the supported runtime path.
What Was Missing
The current implementation proves route leases and production VPN forwarding, but it still has a data-plane shape that cannot scale to high throughput:
- too much payload traffic is carried as small request/response HTTP forwarding calls;
- JSON/base64 payload envelopes add overhead and CPU cost;
- one overloaded stream can delay unrelated traffic;
- route health is visible, but the transport does not yet provide enough low-latency per-stream feedback;
- the phone behaves mostly as a service client, not as a full fabric node;
- service discovery and route execution are not yet separated cleanly enough;
- fallback paths can keep traffic alive, but can also hide architecture bottlenecks if used as the primary data plane.
For 100 Mbps per active device and future 1000+ or millions of devices, the fabric must move to a persistent, binary, multiplexed data plane with explicit route and stream semantics.
Non-Negotiable Principles
- Fabric is the lower transport layer. VPN, RDP, HTTP, admin console, storage, and update delivery are services above it.
- Service adapters must not discover topology, own route selection, or invent failover logic. They request transport from the fabric.
- Control plane and data plane are separate. API/console traffic must not be the packet transport mechanism.
- Every data session carries many independent streams. A blocked bulk download must not stall RDP, DNS, control, or telemetry.
- Routes are leased and replaceable. Route selection uses quality, policy, locality, role eligibility, cost, trust, and current load.
- The fabric is distributed. Central control can coordinate, but the runtime must keep working through cached policy, peer directories, route leases, and local health when central components are degraded.
- Mobile nodes are first-class nodes with stricter capability scoring.
- QUIC is the single runtime transport between fabric nodes. HTTP/HTTPS may serve human-facing download or panel pages, but it is not a node data-plane fallback and must not carry service packets.
- There must be no single management service that can seize the fabric. Control, storage, update distribution, route authority, and certificate authority are fabric roles assigned to eligible nodes and protected by quorum signatures. A web/API endpoint is only an access replica for a signed state log, not the owner of cluster truth.
- IP addresses and DNS names are never authority. Nodes announce signed endpoint candidates for every usable interface, public/reflexive address, local segment address, reverse channel, and relay fallback. Neighbors select the usable candidate locally by policy, reachability, latency, load, and trust.
Transport vs Control API
The system must keep two layers separate in naming, design, and diagnostics:
Fabric Transportmeans inter-node runtime delivery only. It is QUIC over UDP and carries leased service-channel/data-plane traffic between nodes.Control APImeans human/operator/programmatic management surfaces such as web-admin, release publication, policy mutation, audit queries, and status reads. Today that surface is HTTP/JSON and may sit behind HTTPS ingress.
The HTTP Control API is not a fallback transport for node-to-node runtime
traffic. A 409 Conflict from the backend, a panel page load, or a release
download is control-plane behavior, not fabric transport behavior.
Distributed Control And Trust
The target fabric behaves like a distributed network, not a client/server management product. The cluster has a replicated signed state log and many service replicas. Any node with the right role can serve API, storage, update, or route-coordinator duties, but no single replica can mutate cluster authority alone.
Required trust model:
- Every node has a long-lived node identity key and short-lived role certificates. The node identity is cryptographic; the current IP, hostname, NAT address, or container name is only an endpoint candidate.
- Cluster authority is threshold-based. Root or high-risk changes require M-of-N signatures from authorized control-authority nodes or hardware/offline operator keys.
- Role certificates are scoped by action, organization/tenant, service, partition, validity window, and allowed delegation depth.
- Update releases, route leases, peer-directory epochs, storage shard placement, node approvals, role changes, and authority rotations are signed records in the state log.
- A node accepts control data only when it can verify signatures, epoch/fencing, expiry, target cluster, target node or role scope, and monotonic generation.
- A compromised API replica can withhold or delay data, but cannot forge updates, route authority, new certificates, node roles, or cluster ownership.
- Bootstrap may use a temporary centralized signer for development, but production mode must mark that signer as non-authoritative unless quorum signatures are present.
Authority levels:
root-authority: rotates cluster root and quorum membership. Offline or hardware-backed where possible. Rarely online.control-authority: approves node join, role changes, policy epochs, and route-authority membership through quorum.route-authority: signs short-lived route leases and relay/rendezvous assignments for a shard or partition.update-authority: signs release metadata, compatibility, artifact hashes, rollback windows, and staged rollout policy.storage-authority: signs storage shard manifests, replication factors, retention policy, and recovery epochs.observer-authority: can sign telemetry observations only; it cannot mutate routing, roles, updates, or secrets.
Required anti-takeover controls:
- No bearer admin token may grant fabric-wide mutation without a signed authority envelope.
- No node may accept unsigned update metadata or an artifact whose hash is not signed by update-authority quorum.
- No node may accept unsigned route changes for production channels.
- No node may promote itself into control, storage, update, relay, or route authority roles without a quorum-signed role certificate.
- Authority and role certificates must have short validity, explicit scopes, and revocation/fencing epochs.
- Nodes must pin the cluster root/quorum descriptor and reject unexpected root changes unless the old quorum signs the transition or an offline recovery policy is invoked.
Endpoint state is also distributed:
- Nodes publish signed endpoint-candidate sets containing local interfaces, public/reflexive STUN/ICE candidates, NAT group/local segment identifiers, relay fallback, and passive reverse-channel availability.
- Endpoint candidates expire quickly. When a node changes IP, it reconnects passively to any reachable fabric peer or API replica and publishes a new signed candidate epoch.
- Peers keep using cached valid candidates and route leases while refreshing from any reachable replica or neighbor gossip path.
- Neighbor selection is local and latency/load-aware; the state log announces facts and policy, not a forced single next hop.
Fabric Registry Gossip
Moving a service must not break the farm.
RAP_FABRIC_REGISTRY_RECORDS_JSON and signed registry gossip, not any fixed
HTTP/API address, define cluster truth. After bootstrap, a node finds services by
logical role through signed fabric registry records that can be carried by any
reachable peer.
The rule is:
- any node may relay registry knowledge;
- only authorized signatures can create or replace trusted registry truth;
- a new record becomes active only after signature/authority checks and a successful live probe through the fabric or a policy-approved direct QUIC candidate;
- older still-valid records remain as fallback until their TTL expires.
Registry record shape:
schema_version: rap.fabric.registry.gossip_record.v1
cluster_id
service: control-api | update-store | update-cache | web-admin | vpn-egress-pool | ...
scope: farm | cluster | organization
organization_id: optional
epoch: monotonic service epoch
generation: optional human/debug generation
issued_at
expires_at
issuer_node_id
issuer_role: control-authority | update-authority | storage-authority | route-authority
endpoints:
- endpoint_id
address: quic://...
transport: direct_quic | relay_quic | reverse_quic
reachability
connectivity_mode
priority / weight
peer_cert_sha256
signatures:
- key_id
issuer_id
role
alg: ed25519
value
Acceptance algorithm:
- Reject records for a different cluster, expired records, future records past allowed clock skew, unsupported schema, missing endpoints, or non-QUIC endpoints.
- Verify the canonical record payload, excluding
signatures, against the configured authority set. - Check the signer role is allowed for that service and scope.
- Require quorum where policy says M-of-N; development may use one trusted signer but must mark that signer as bootstrap/development authority.
- Store accepted records as
candidate. - Promote
candidatetoactiveonly after live-probing at least one endpoint and verifying the endpoint identity/pin. - Prefer higher epoch, then newer issued time, then generation. Do not replace a live active record with an older record.
- Keep the previous active record usable as fallback until TTL expiry when a newer candidate is not yet live-verified.
This is the recovery path for mass moves. If every known service endpoint moves at once, the operator or a control-authority node only has to deliver a signed registry record to one reachable fabric node. That node validates it, probes it, promotes it, and gossips it onward. User/mobile/candidate nodes may carry the record, but cannot make it authoritative unless their role certificate permits that service/scope.
Service classes that must use this registry before production hardening:
control-api: heartbeat, auth/profile control projection, node registration, policy/snapshot fetch.update-store: signed release manifests and compatibility windows.update-cache: artifact mirrors close to nodes.web-admin: management UI/API ingress replicas.vpn-egress-pool: user-visible exit pools; users see pools, not backing nodes.
Compat endpoint compatibility is allowed only for rolling migration:
- Old nodes may use their baked HTTP/control URL only to fetch a new version or a signed registry bootstrap record.
- New nodes must treat fixed URLs as fallback hints, not as authority.
- Old code is removed only after every live node reports a version that supports signed registry gossip and service discovery by role.
Listener configuration is split into bind sockets and reachability candidates:
listen_addris what the local process binds, for example0.0.0.0:18080onhome-1.endpoint_candidatesis the ordered set of addresses other nodes may try. A single node can publish LAN addresses, addresses on several network adapters, STUN/reflexive addresses, and multiple public NAT forwards from different providers.- Public NAT forwards are modeled as candidates with metadata, not as a
replacement for the internal bind address. Example:
quic://94.141.118.222:19199 reachability=public connectivity=direct provider=isp1 maps_to=192.168.200.85:18080. - A candidate may be valid only from outside the NAT. Same-LAN hairpin failure is not a proof that the public candidate is broken; verification must be scoped to an external peer or remote probe.
- The route builder scores candidates by reachability, measured latency, loss, load, policy, and verification freshness. If one provider or interface fails, the node keeps the same node identity and republishes a new candidate epoch.
Install Artifact Bootstrap Contract
Every installable artifact is a node image plus a bootstrap seed set.
This applies to Android, Docker, Linux services, and Windows services. The seed set is baked into the artifact or delivered beside it as signed install metadata. It is not a single backend URL and not a management server choice. It is a bounded list of known fabric endpoint candidates that may be reachable from different network positions:
- public QUIC candidates, for example
usa-los-1or externally reachablehome-1; - private/LAN QUIC candidates, for example Docker-test or home LAN nodes;
- closed-site candidates that have no Internet route themselves but can reach a neighboring fabric node;
- optional pinned certificate hashes or authority descriptors for high-trust entry candidates.
On first start the installed node tries the seed set, joins through any reachable
peer, registers as a candidate node with minimal rights, and then receives
signed peer-directory, role, update, and policy state through the fabric. If a
node is installed in an isolated network, it can still become visible and usable
when at least one nearby seed node can route onward to the rest of the fabric.
User login on Android is only identity/profile selection for the vpn-client
service; the underlying phone node already exists and participates in the
fabric with candidate permissions.
Node Roles
Initial role vocabulary:
mobile-edge: mobile Android/iOS fabric node.entry: accepts external sessions.relay: forwards fabric traffic between nodes.exit: terminates routes into a target network or service zone.service-host: runs service adapters such as admin console, VPN exit, RDP, HTTP ingress, storage, or update-cache.control-plane: participates in control authority, policy decisions, route authority, or quorum work.route-coordinator: calculates or assists route candidates for a partition, region, or service class.storage: stores approved replicated fabric state.observer: collects telemetry and health without carrying user traffic.update-cache: mirrors signed artifacts close to nodes.
Roles are policy decisions, not binary builds. A phone can theoretically receive any role, but scheduler scoring must account for battery, OS restrictions, NAT, uplink stability, foreground state, and user cost policy.
Capability Model
Nodes must advertise capability facts in heartbeats and peer updates:
- supported fabric protocol versions;
- supported transport: UDP/QUIC;
- NAT type and reachability;
- measured RTT/loss/jitter/bandwidth to peers and entry candidates;
- CPU, memory, queue depth, file descriptor/socket pressure;
- battery state, charging state, mobile/wifi network type, metered policy;
- max relay bandwidth and allowed traffic classes;
- service roles and service capacity;
- trust tier and allowed tenant/organization scopes;
- local policy version, peer directory version, route cache version.
Fabric Data Session V1
The first practical protocol step is a persistent binary QUIC data session. The framing stays service-neutral, but the runtime transport is QUIC only.
Minimum frame set:
HELLO node identity, protocol version, capabilities
AUTH signed session token or mTLS-bound proof
SESSION_READY accepted limits, route epoch, peer epoch
OPEN_STREAM stream id, service id, traffic class, route id
DATA stream id, sequence, flags, payload
ACK stream id, received sequence/window
PING/PONG RTT and liveness
ROUTE_UPDATE new route lease or alternate route set
STREAM_CREDIT per-stream backpressure window
NODE_PRESSURE queue/cpu/memory/network pressure signal
CLOSE_STREAM normal stream close
RESET_STREAM failed stream, other streams remain alive
GOAWAY draining or protocol shutdown
Traffic classes:
control: authorization, route updates, attach/detach, liveness.dns: small, latency-sensitive name resolution.interactive: RDP input, SSH interactive, UI control.reliable: normal web/API traffic.bulk: downloads, uploads, sync, large media.droppable: telemetry samples, optional probes, low-value background data.
Each stream has independent flow control and backpressure. Bulk can be slowed or moved to another route without blocking control or interactive streams.
Route Model
The fabric must maintain multiple candidate routes for an active session:
phone-a -> entry-1 -> home-1
phone-a -> phone-b -> relay-2 -> home-1
phone-a -> entry-2 -> relay-4 -> service-host-7
Route scoring inputs:
- policy and role eligibility;
- route length and failure domains;
- RTT, jitter, packet loss, bandwidth estimate;
- queue depth and retransmit pressure;
- current node CPU/memory/socket pressure;
- mobile battery/charging/metered status;
- historical reliability;
- service locality;
- tenant/organization isolation;
- cost and operator preference.
Routes are issued as short leases with route id, epoch, allowed channels, allowed service classes, hop list or next-hop policy, expiry, and fencing rules.
Service Discovery
Services are logical names, not fixed hosts:
service: admin-console
replicas: home-1, node-2, node-9
policy: active-active or leader/follower
ingress: vpn.cin.su / admin.cin.su / internal name
vpn.cin.su as an HTTP/HTTPS entry is a service endpoint. It can be hosted on
any eligible service-host node. If one replica fails, another replica can accept
the service lease and traffic can be routed to it.
Scale Model
For 1000 devices, the platform needs entry pools, exit pools, route leases, session placement, and overload protection.
For millions of devices, the platform additionally needs regional route coordinators, distributed peer directories, local control partitions, telemetry sampling, policy sharding, and resource accounting.
Every device joining the system increases potential edge capacity, but only if the scheduler can safely decide when that node is allowed to relay, store, serve, or only consume.
Security And Abuse Controls
The distributed model increases power and also risk. The following controls are required before mobile relay/control/storage roles are broadly enabled:
- node identity is cryptographic; IP address is never identity;
- all route leases are signed or locally verifiable;
- roles are scoped by organization, tenant, service, and time;
- mobile relay is opt-in by policy and user/device state;
- storage uses encrypted shards and explicit retention policy;
- control-plane participation requires trust tier and quorum policy;
- nodes never receive more topology or secret data than their role requires;
- abuse controls rate-limit relay use, route churn, and failed authentication;
- traffic accounting records who relayed what class and how much, without exposing payload contents.
Observability
The current tests show why aggregate "VPN works" is not enough. The fabric needs per-node, per-route, and per-stream metrics:
- throughput by direction and traffic class;
- RTT, jitter, loss, retransmits, queue depth;
- frame encode/decode errors;
- stream resets and close reasons;
- route switch reason and time to recovery;
- node pressure and scheduler decisions;
- service discovery failover events;
- Android foreground/background and network transition events.
Work Plan
Stage FNP-0: Architecture Lock
Status: this document.
Deliverables:
- fix "every device is a node" as the model;
- separate fabric, services, control, and data plane;
- define missing protocol, route, scale, security, and observability pieces.
Stage FNP-1: Binary Frame Contract
Deliverables:
- add a transport-neutral Go package for Fabric Data Session V1 frame types;
- encode/decode binary frames with size limits and validation;
- add tests for malformed frames, max frame size, stream ids, and frame type compatibility;
- do not connect it to production traffic yet.
Stage FNP-2: Persistent Session Runtime Skeleton
Status: in progress in agents/rap-node-agent/internal/fabricproto.
Deliverables:
- implement in-memory session runtime with streams, sequence numbers, ACK, stream credit, reset, and close;
- handle protocol frames for open/data/ack/credit/reset/close/ping/goaway;
- prove that a blocked bulk stream does not block control/interactive streams;
- expose per-stream metrics.
Stage FNP-3: WebSocket/TCP Compatibility Transport
Status: removed as a migration-only stage.
This stage existed to bootstrap binary frame semantics before QUIC routing and carrier reuse were ready. It introduced the transport-neutral frame loop, session-shaped packet mapper, and early smoke tooling. That work was useful as scaffolding, but it is no longer the target runtime.
Current rule:
- WebSocket/TCP fabric-session transport is not part of the supported node dataplane.
- QUIC/UDP is the only supported runtime carrier between fabric nodes.
- Old WebSocket/TCP smoke helpers are being removed; migration/debug tooling must move to QUIC-native smoke and recovery paths.
- Any routing, heartbeat, registry, peer probe, or service dataplane logic must reject WebSocket/TCP carriers as non-QUIC transport, not treat them as a valid alternate path.
What survives from this stage is the service-neutral frame model and the
FabricSessionPacketTransport mapping, which now ride on QUIC carriers instead
of a WebSocket fallback.
VPN fabric-session gateway transport now consumes ranked endpoint candidates,
so dataplane sessions can select QUIC fast-path candidates and refuse non-QUIC
peer endpoints when the control plane has not published valid candidates yet.
The temporary self-signed QUIC listener advertises its SHA-256 certificate
fingerprint in endpoint metadata, and the QUIC client can pin that fingerprint
instead of disabling verification while the cluster CA path is being finished.
VPN fabric-session dialing now walks all ranked endpoint candidates before
declaring the target unavailable, so a failed QUIC candidate does not silently
re-enable WebSocket/HTTPS compatibility transport.
Successful VPN fabric-session dialing logs the selected candidate, transport,
certificate pin usage, and remaining fallback count for phone-side diagnostics.
Heartbeat telemetry now includes VPN fabric-session dial counters for attempts,
candidate failures, selected transport family, certificate pin usage, and the
last selected endpoint/failure reason.
VPN fabric-session dialing feeds candidate success/failure observations back
into endpoint ranking, so repeated local QUIC failures can temporarily demote
that endpoint while preserving it as a later fallback.
Endpoint scoring no longer treats missing/zero latency on failed observations as
moderate latency, preventing failed candidates from receiving a false score
bonus.
Endpoint health observations are now emitted as a bounded standalone heartbeat
report (rap.vpn_fabric_endpoint_health_report.v1) so control plane can ingest
candidate feedback without parsing the transport diagnostics blob.
VPN fabric-session transport telemetry is carrier-neutral
(fabric_session_binary_frames) and reports QUIC selection plus non-QUIC
candidate rejection instead of describing the dataplane as WebSocket-capable.
Endpoint health observations are pruned in-memory by age and count before
snapshot/report generation, preventing long-running nodes from accumulating
unbounded candidate history.
Scoped and control-plane synthetic mesh config can now carry
peer_endpoint_observations, and VPN fabric-session endpoint ranking merges
those remote health hints with local observations using the newest signal.
Endpoint health observations include source and reporter node fields so control
plane can distinguish local dial feedback from aggregated or policy-generated
health hints.
The endpoint health heartbeat report also includes the reporter node id at the
report level for simpler multi-node ingestion and diagnostics.
Peer cache construction now applies endpoint health observations when ranking
peer endpoint candidates, so recovery and warm-peer decisions see the same
degraded-path feedback as VPN fabric-session dialing.
Peer cache snapshots expose best-candidate score reasons, giving diagnostics a
direct explanation for why a QUIC, WebSocket, relay, or fallback endpoint was
chosen.
Heartbeat capabilities now advertise that peer-cache endpoint ranking consumes
health observations, allowing control plane and UI diagnostics to detect nodes
running the health-aware peer selection path.
VPN fabric QUIC transport now reuses QUIC connections per peer endpoint and
opens logical fabric-session streams on top, with heartbeat telemetry for QUIC
connection opens, reuses, evictions, and active count.
Cached QUIC connections are pruned by idle TTL, preventing long-running agents
from holding unused peer connections indefinitely.
QUIC carrier connections now track active logical streams and enforce a
per-connection stream limit, exposing stream opens/closes and limit rejects in
transport telemetry.
The per-connection QUIC stream limit is configurable through
RAP_VPN_FABRIC_QUIC_MAX_STREAMS_PER_CONN /
-vpn-fabric-quic-max-streams-per-conn and propagated by host-agent install
profiles.
QUIC stream-limit rejects are classified as capacity pressure instead of peer
endpoint failure, so local health feedback does not incorrectly demote a healthy
but saturated carrier.
VPN fabric dial telemetry records the last capacity-limited endpoint and
transport, making stream saturation visible without poisoning endpoint health
observations.
The same dial telemetry now keeps bounded per-endpoint capacity-pressure
counters, so operators can see whether stream saturation is occasional or
concentrated on a specific QUIC carrier.
Fresh local capacity-pressure counters also feed endpoint ranking as a bounded
penalty, spreading new fabric sessions away from a saturated carrier without
declaring that carrier failed.
VPN fabric-session transport now opens configurable per-class stream shards
for interactive and bulk packet traffic, so heavy browser flows do not share a
single logical stream with latency-sensitive RDP/control packets.
Host-agent install commands for Docker, Linux, and Windows expose the same
VPN fabric-session/QUIC tuning flags as install profiles, keeping manual and
profile-based rollout paths aligned.
Gateway runtime snapshots include the fabric-session packet transport stream
layout and send counters by traffic class/stream id for load-test diagnosis.
Those snapshots also summarize configured stream class/shard counts and active
send class/stream counts, making sharding health visible without expanding
per-stream maps.
Gateway shutdown now closes all VPN fabric-session stream shards and then the
underlying fabric session, preventing stale logical streams from consuming QUIC
carrier capacity after reconnects or rollout restarts.
Gateway runtime cancellation now fans out to both upload and download loops
when either direction exits, so transport cleanup runs promptly on one-sided
TUN or carrier failures.
Fabric-session packet transport snapshots include close-frame and close-error
counters for verifying that stream shard cleanup is actually happening.
Outgoing VPN packet batches are split by traffic class and selected stream
before they are framed, so one gateway batch containing many browser flows does
not collapse onto the first packet's logical stream.
mesh-live-smoke now sends mixed bulk and interactive VPN packets in a single
fabric-session batch and requires them to remain sharded.
The smoke report also exposes the mixed-batch frame fanout so regressions show
up as a concrete fanout drop, not just a failed boolean.
Batch fanout is bounded by configured stream shards, so a large batch with many
flows cannot explode into unbounded fabric frames.
Heartbeat tests assert the advertised VPN fabric stream-shard count and
capability, keeping control-plane diagnostics aligned with runtime behavior.
Fabric-session packet transport snapshots now report packets per stream plus
last/max batch fanout, making real multi-site load distribution measurable from
gateway status.
Receive-side fabric-session packet counters are reported by traffic class and
stream id as well, so gateway status can compare TX and RX distribution under
browser/RDP load.
QUIC fabric transport snapshots expose the configured stream limit, saturated
connection count, and capacity pressure percentage next to stream limit rejects.
Closed cached QUIC connections discovered during snapshot generation now update
the transport's cumulative eviction counters, keeping successive heartbeats
consistent.
mesh-live-smoke reports QUIC fabric capacity-pressure percentage from the
transport snapshot, verifying that the capacity fields are populated.
QUIC fabric snapshots now include per cached connection pressure, endpoint, and
saturation state; VPN fabric endpoint ranking consumes that live local pressure
before stream-limit rejection, spreading new sessions away from already busy
QUIC carriers.
Per-connection QUIC snapshot entries are sorted by peer and endpoint so
heartbeats and diagnostics stay stable across reports.
When local live QUIC pressure and recent capacity-limit counters overlap, the
ranking input keeps the stronger pressure signal rather than allowing a weak
fresh sample to hide a saturated endpoint.
Heartbeat VPN fabric reports now include a bounded quic_capacity_pressure
summary sorted by busiest cached QUIC connection, making overload diagnosis
visible without digging through the full carrier snapshot.
VPN fabric flow-scheduler snapshots now expose bulk pressure activation plus
bulk and interactive/control channel counts, making mixed browser/RDP load
diagnosis explicit when bulk windows are reduced to protect interactive traffic.
mesh-live-smoke now exercises that mixed-load scheduler path and reports bulk
pressure activation plus bulk/interactive window recommendations.
Flow-scheduler route recovery telemetry now records per-channel route switches,
the failed route a channel recovered from, and aggregate recovered-channel /
switch counts, making alternate-route recovery measurable during load tests.
mesh-live-smoke now also exercises a primary-route failure followed by an
alternate-route success and reports the resulting route switch count.
The same smoke output reports measured route recovery milliseconds for the
synthetic failover path.
Smoke now includes max/average route recovery timing from the scheduler
aggregate snapshot as well.
Route recovery telemetry includes failure/switch timestamps and recovery
duration in milliseconds for each recovered flow channel.
Scheduler snapshots also aggregate route recovery max/average milliseconds
across recovered channels for quick load-test health checks.
Route recovery telemetry now includes normalized switch reasons and aggregate
reason counts, so load tests can distinguish peer failures, timeouts, and other
route-break causes.
mesh-live-smoke reports the synthetic route-recovery reason beside recovery
timing and switch count.
Common route switch reasons are bucketed into stable labels such as timeout,
peer_unavailable, connection_refused, connection_reset, no_route_to_host, and
capacity_limited to keep heartbeat cardinality bounded.
Flow-scheduler snapshots now include a machine-readable pressure level
(nominal, warning, critical) and bounded reason list derived from drops,
route failures, route recovery, slow channels, bulk pressure, and adaptive
backpressure.
The same pressure classification includes a bounded 0-100 score for automated
route, endpoint, and node comparisons.
mesh-live-smoke reports the mixed-load scheduler pressure level, score, and
reasons.
Heartbeat VPN fabric transport reports now include a compact
flow_pressure summary with level, score, reasons, bulk pressure, route
recovery timing, reason counts, and recommended per-class windows.
The flow_pressure summary includes a recommended_action such as
observe, throttle_bulk, reduce_parallelism, prefer_faster_route,
observe_recovery, rebuild_or_reroute, or shed_or_reroute.
recommended_action is now part of the shared FabricFlowSchedulerSnapshot
contract, so heartbeat reports and smoke diagnostics consume the same runtime
decision.
The scheduler's nominal snapshot explicitly reports the observe action.
Flow-scheduler snapshots keep a bounded pressure transition history with the
observed level, score, reasons, and recommended action. Repeated snapshots do
not duplicate unchanged pressure states, so controllers can distinguish current
state from recent worsening or recovery without unbounded heartbeat growth.
mesh-live-smoke reports the recommended action for its mixed bulk/interactive
load scenario.
Nodes advertise the vpn_fabric_flow_pressure capability when that heartbeat
summary is available.
When the VPN fabric ingress runtime has not been initialized yet, the heartbeat
still emits a nominal flow_pressure summary for schema stability.
Endpoint ranking treats capacity_limited observations as a soft pressure
penalty instead of a hard recent failure, enabling load spreading without
marking the carrier unhealthy.
Local QUIC stream-limit pressure is now emitted as a capacity observation with
no failure-count increment, allowing control plane to spread load without
treating saturation as packet-path breakage.
Cached QUIC carrier idle TTL is configurable through
RAP_VPN_FABRIC_QUIC_IDLE_TTL_SECONDS / -vpn-fabric-quic-idle-ttl and
propagated by host-agent install profiles.
Deliverables:
- carry binary frames over one persistent QUIC fabric session;
- replace high-frequency
/mesh/v1/forwardpacket POST usage for VPN routes in a gated mode; - remove HTTP/WebSocket packet forwarding from the supported dataplane.
Stage FNP-4: Android As Mobile Fabric Node
Deliverables:
- Android advertises node capabilities, network state, battery state, and supported transports;
- Android opens Fabric Data Session V1 to entry;
- VPN packets map to independent streams/classes;
- diagnostics can run per-stream and per-route tests.
Stage FNP-5: Route Leases And Multipath
Deliverables:
- route result includes primary and alternate routes;
- runtime can switch new streams to a better route;
- interactive streams can recover quickly after route fencing;
- route health uses dataplane metrics, not only HTTP request success.
Stage FNP-6: QUIC/UDP Transport
Status: active runtime baseline in internal/mesh.
Deliverables:
- implement QUIC transport for Fabric Data Session V1;
- keep QUIC/UDP as the only supported inter-node runtime transport;
- test 4G/Wi-Fi transition and NAT behavior;
- benchmark throughput, latency, and recovery against current HTTP forwarding.
Stage FNP-7: Distributed Service Discovery
Deliverables:
- service names map to eligible service replicas;
- admin console and VPN service can move between service-host nodes;
- service failover is expressed as leases and route updates.
Stage FNP-8: Mobile Relay And Distributed Capacity
Deliverables:
- mobile nodes can opt into relay under strict policy;
- scheduler scores battery, metered network, NAT, trust, and load;
- route planner can use mobile nodes where they are closer/faster;
- accounting and abuse controls are enforced.
Stage FNP-9: Scale To Large Fleets
Deliverables:
- entry and route coordinator pools;
- peer directory sharding;
- telemetry sampling and aggregation;
- per-tenant quotas and fairness;
- load tests for 1000 simulated devices, then larger synthetic fleets.
Immediate Next Action
Start Stage FNP-1 in rap-node-agent as a non-production protocol package. The
goal is to create the binary frame contract and tests without disturbing the
current VPN path. After that, wire it into a gated persistent session runtime and
only then move Android/VPN traffic onto it.