m/rdp-proxy

Fork 0

Files

T

m 20d361a886

build / backend (push) Has been cancelled

Details

build / node-agent (push) Has been cancelled

Details

build / worker (push) Has been cancelled

Details

рабочий вариант, но скороть 10 МБит

2026-05-22 21:46:49 +03:00

37 KiB

Raw Blame History

Distributed Fabric Node Protocol Plan

This document fixes the target direction for the Secure Access Fabric after the VPN performance investigation. The platform must not be treated as a VPN server, RDP gateway, or web console. It is a distributed overlay transport where every participating device is a fabric node, and VPN/RDP/HTTP/admin/storage are services running over that fabric.

Core Position

Every device is a node.

A phone, home server, cloud server, relay, admin-console host, storage host, and update-cache host share the same base identity model. They differ by roles, capabilities, policy, trust level, and current health.

Node = identity + roles + capabilities + policy + health + local state

The Android VPN app is therefore not only a client. It is a mobile fabric node. It may carry VPN traffic, participate in route discovery, relay traffic when policy allows, host limited control/storage roles when approved, and report mobile-specific capacity signals such as battery, network type, NAT behavior, foreground/background state, and metered network policy.

Node survival and recovery across endpoint moves, NAT-only reachability, compat contract overlap, and unavailable manual host access are governed by docs/architecture/FABRIC_NODE_SURVIVAL_AND_RECOVERY_POLICY.md. In particular, nodes like ifcm-rufms-s-mo1cr must remain recoverable through the fabric/update/recovery plane even when direct host login is unavailable.

Android implementation contract:

app install/build contains a QUIC bootstrap seed set;
runtime launch carries a fabric_bootstrap_config, not a backend URL;
user login/profile selection happens over the fabric control channel;
the Android VPN dataplane is QUIC fabric runtime only; HTTP batch packet forwarding, WebSocket packet relay, and direct backend packet relay are not part of the supported runtime path.

What Was Missing

The current implementation proves route leases and production VPN forwarding, but it still has a data-plane shape that cannot scale to high throughput:

too much payload traffic is carried as small request/response HTTP forwarding calls;
JSON/base64 payload envelopes add overhead and CPU cost;
one overloaded stream can delay unrelated traffic;
route health is visible, but the transport does not yet provide enough low-latency per-stream feedback;
the phone behaves mostly as a service client, not as a full fabric node;
service discovery and route execution are not yet separated cleanly enough;
fallback paths can keep traffic alive, but can also hide architecture bottlenecks if used as the primary data plane.

For 100 Mbps per active device and future 1000+ or millions of devices, the fabric must move to a persistent, binary, multiplexed data plane with explicit route and stream semantics.

Non-Negotiable Principles

Fabric is the lower transport layer. VPN, RDP, HTTP, admin console, storage, and update delivery are services above it.
Service adapters must not discover topology, own route selection, or invent failover logic. They request transport from the fabric.
Control plane and data plane are separate. API/console traffic must not be the packet transport mechanism.
Every data session carries many independent streams. A blocked bulk download must not stall RDP, DNS, control, or telemetry.
Routes are leased and replaceable. Route selection uses quality, policy, locality, role eligibility, cost, trust, and current load.
The fabric is distributed. Central control can coordinate, but the runtime must keep working through cached policy, peer directories, route leases, and local health when central components are degraded.
Mobile nodes are first-class nodes with stricter capability scoring.
QUIC is the single runtime transport between fabric nodes. HTTP/HTTPS may serve human-facing download or panel pages, but it is not a node data-plane fallback and must not carry service packets.
There must be no single management service that can seize the fabric. Control, storage, update distribution, route authority, and certificate authority are fabric roles assigned to eligible nodes and protected by quorum signatures. A web/API endpoint is only an access replica for a signed state log, not the owner of cluster truth.
IP addresses and DNS names are never authority. Nodes announce signed endpoint candidates for every usable interface, public/reflexive address, local segment address, reverse channel, and relay fallback. Neighbors select the usable candidate locally by policy, reachability, latency, load, and trust.

Transport vs Control API

The system must keep two layers separate in naming, design, and diagnostics:

Fabric Transport means inter-node runtime delivery only. It is QUIC over UDP and carries leased service-channel/data-plane traffic between nodes.
Control API means human/operator/programmatic management surfaces such as web-admin, release publication, policy mutation, audit queries, and status reads. Today that surface is HTTP/JSON and may sit behind HTTPS ingress.

The HTTP Control API is not a fallback transport for node-to-node runtime traffic. A 409 Conflict from the backend, a panel page load, or a release download is control-plane behavior, not fabric transport behavior.

Distributed Control And Trust

The target fabric behaves like a distributed network, not a client/server management product. The cluster has a replicated signed state log and many service replicas. Any node with the right role can serve API, storage, update, or route-coordinator duties, but no single replica can mutate cluster authority alone.

Required trust model:

Every node has a long-lived node identity key and short-lived role certificates. The node identity is cryptographic; the current IP, hostname, NAT address, or container name is only an endpoint candidate.
Cluster authority is threshold-based. Root or high-risk changes require M-of-N signatures from authorized control-authority nodes or hardware/offline operator keys.
Role certificates are scoped by action, organization/tenant, service, partition, validity window, and allowed delegation depth.
Update releases, route leases, peer-directory epochs, storage shard placement, node approvals, role changes, and authority rotations are signed records in the state log.
A node accepts control data only when it can verify signatures, epoch/fencing, expiry, target cluster, target node or role scope, and monotonic generation.
A compromised API replica can withhold or delay data, but cannot forge updates, route authority, new certificates, node roles, or cluster ownership.
Bootstrap may use a temporary centralized signer for development, but production mode must mark that signer as non-authoritative unless quorum signatures are present.

Authority levels:

root-authority: rotates cluster root and quorum membership. Offline or hardware-backed where possible. Rarely online.
control-authority: approves node join, role changes, policy epochs, and route-authority membership through quorum.
route-authority: signs short-lived route leases and relay/rendezvous assignments for a shard or partition.
update-authority: signs release metadata, compatibility, artifact hashes, rollback windows, and staged rollout policy.
storage-authority: signs storage shard manifests, replication factors, retention policy, and recovery epochs.
observer-authority: can sign telemetry observations only; it cannot mutate routing, roles, updates, or secrets.

Required anti-takeover controls:

No bearer admin token may grant fabric-wide mutation without a signed authority envelope.
No node may accept unsigned update metadata or an artifact whose hash is not signed by update-authority quorum.
No node may accept unsigned route changes for production channels.
No node may promote itself into control, storage, update, relay, or route authority roles without a quorum-signed role certificate.
Authority and role certificates must have short validity, explicit scopes, and revocation/fencing epochs.
Nodes must pin the cluster root/quorum descriptor and reject unexpected root changes unless the old quorum signs the transition or an offline recovery policy is invoked.

Endpoint state is also distributed:

Nodes publish signed endpoint-candidate sets containing local interfaces, public/reflexive STUN/ICE candidates, NAT group/local segment identifiers, relay fallback, and passive reverse-channel availability.
Endpoint candidates expire quickly. When a node changes IP, it reconnects passively to any reachable fabric peer or API replica and publishes a new signed candidate epoch.
Peers keep using cached valid candidates and route leases while refreshing from any reachable replica or neighbor gossip path.
Neighbor selection is local and latency/load-aware; the state log announces facts and policy, not a forced single next hop.

Fabric Registry Gossip

Moving a service must not break the farm.

RAP_FABRIC_REGISTRY_RECORDS_JSON and signed registry gossip, not any fixed HTTP/API address, define cluster truth. After bootstrap, a node finds services by logical role through signed fabric registry records that can be carried by any reachable peer.

The rule is:

any node may relay registry knowledge;
only authorized signatures can create or replace trusted registry truth;
a new record becomes active only after signature/authority checks and a successful live probe through the fabric or a policy-approved direct QUIC candidate;
older still-valid records remain as fallback until their TTL expires.

Registry record shape:

schema_version: rap.fabric.registry.gossip_record.v1
cluster_id
service: control-api | update-store | update-cache | web-admin | vpn-egress-pool | ...
scope: farm | cluster | organization
organization_id: optional
epoch: monotonic service epoch
generation: optional human/debug generation
issued_at
expires_at
issuer_node_id
issuer_role: control-authority | update-authority | storage-authority | route-authority
endpoints:
  - endpoint_id
    address: quic://...
    transport: direct_quic | relay_quic | reverse_quic
    reachability
    connectivity_mode
    priority / weight
    peer_cert_sha256
signatures:
  - key_id
    issuer_id
    role
    alg: ed25519
    value

Acceptance algorithm:

Reject records for a different cluster, expired records, future records past allowed clock skew, unsupported schema, missing endpoints, or non-QUIC endpoints.
Verify the canonical record payload, excluding signatures, against the configured authority set.
Check the signer role is allowed for that service and scope.
Require quorum where policy says M-of-N; development may use one trusted signer but must mark that signer as bootstrap/development authority.
Store accepted records as candidate.
Promote candidate to active only after live-probing at least one endpoint and verifying the endpoint identity/pin.
Prefer higher epoch, then newer issued time, then generation. Do not replace a live active record with an older record.
Keep the previous active record usable as fallback until TTL expiry when a newer candidate is not yet live-verified.

This is the recovery path for mass moves. If every known service endpoint moves at once, the operator or a control-authority node only has to deliver a signed registry record to one reachable fabric node. That node validates it, probes it, promotes it, and gossips it onward. User/mobile/candidate nodes may carry the record, but cannot make it authoritative unless their role certificate permits that service/scope.

Service classes that must use this registry before production hardening:

control-api: heartbeat, auth/profile control projection, node registration, policy/snapshot fetch.
update-store: signed release manifests and compatibility windows.
update-cache: artifact mirrors close to nodes.
web-admin: management UI/API ingress replicas.
vpn-egress-pool: user-visible exit pools; users see pools, not backing nodes.

Compat endpoint compatibility is allowed only for rolling migration:

Old nodes may use their baked HTTP/control URL only to fetch a new version or a signed registry bootstrap record.
New nodes must treat fixed URLs as fallback hints, not as authority.
Old code is removed only after every live node reports a version that supports signed registry gossip and service discovery by role.

Listener configuration is split into bind sockets and reachability candidates:

listen_addr is what the local process binds, for example 0.0.0.0:18080 on home-1.
endpoint_candidates is the ordered set of addresses other nodes may try. A single node can publish LAN addresses, addresses on several network adapters, STUN/reflexive addresses, and multiple public NAT forwards from different providers.
Public NAT forwards are modeled as candidates with metadata, not as a replacement for the internal bind address. Example: quic://94.141.118.222:19199 reachability=public connectivity=direct provider=isp1 maps_to=192.168.200.85:18080.
A candidate may be valid only from outside the NAT. Same-LAN hairpin failure is not a proof that the public candidate is broken; verification must be scoped to an external peer or remote probe.
The route builder scores candidates by reachability, measured latency, loss, load, policy, and verification freshness. If one provider or interface fails, the node keeps the same node identity and republishes a new candidate epoch.

Install Artifact Bootstrap Contract

Every installable artifact is a node image plus a bootstrap seed set.

This applies to Android, Docker, Linux services, and Windows services. The seed set is baked into the artifact or delivered beside it as signed install metadata. It is not a single backend URL and not a management server choice. It is a bounded list of known fabric endpoint candidates that may be reachable from different network positions:

public QUIC candidates, for example usa-los-1 or externally reachable home-1;
private/LAN QUIC candidates, for example Docker-test or home LAN nodes;
closed-site candidates that have no Internet route themselves but can reach a neighboring fabric node;
optional pinned certificate hashes or authority descriptors for high-trust entry candidates.

On first start the installed node tries the seed set, joins through any reachable peer, registers as a candidate node with minimal rights, and then receives signed peer-directory, role, update, and policy state through the fabric. If a node is installed in an isolated network, it can still become visible and usable when at least one nearby seed node can route onward to the rest of the fabric. User login on Android is only identity/profile selection for the vpn-client service; the underlying phone node already exists and participates in the fabric with candidate permissions.

Node Roles

Initial role vocabulary:

mobile-edge: mobile Android/iOS fabric node.
entry: accepts external sessions.
relay: forwards fabric traffic between nodes.
exit: terminates routes into a target network or service zone.
service-host: runs service adapters such as admin console, VPN exit, RDP, HTTP ingress, storage, or update-cache.
control-plane: participates in control authority, policy decisions, route authority, or quorum work.
route-coordinator: calculates or assists route candidates for a partition, region, or service class.
storage: stores approved replicated fabric state.
observer: collects telemetry and health without carrying user traffic.
update-cache: mirrors signed artifacts close to nodes.

Roles are policy decisions, not binary builds. A phone can theoretically receive any role, but scheduler scoring must account for battery, OS restrictions, NAT, uplink stability, foreground state, and user cost policy.

Capability Model

Nodes must advertise capability facts in heartbeats and peer updates:

supported fabric protocol versions;
supported transport: UDP/QUIC;
NAT type and reachability;
measured RTT/loss/jitter/bandwidth to peers and entry candidates;
CPU, memory, queue depth, file descriptor/socket pressure;
battery state, charging state, mobile/wifi network type, metered policy;
max relay bandwidth and allowed traffic classes;
service roles and service capacity;
trust tier and allowed tenant/organization scopes;
local policy version, peer directory version, route cache version.

Fabric Data Session V1

The first practical protocol step is a persistent binary QUIC data session. The framing stays service-neutral, but the runtime transport is QUIC only.

Minimum frame set:

HELLO              node identity, protocol version, capabilities
AUTH               signed session token or mTLS-bound proof
SESSION_READY      accepted limits, route epoch, peer epoch
OPEN_STREAM        stream id, service id, traffic class, route id
DATA               stream id, sequence, flags, payload
ACK                stream id, received sequence/window
PING/PONG          RTT and liveness
ROUTE_UPDATE       new route lease or alternate route set
STREAM_CREDIT      per-stream backpressure window
NODE_PRESSURE      queue/cpu/memory/network pressure signal
CLOSE_STREAM       normal stream close
RESET_STREAM       failed stream, other streams remain alive
GOAWAY             draining or protocol shutdown

Traffic classes:

control: authorization, route updates, attach/detach, liveness.
dns: small, latency-sensitive name resolution.
interactive: RDP input, SSH interactive, UI control.
reliable: normal web/API traffic.
bulk: downloads, uploads, sync, large media.
droppable: telemetry samples, optional probes, low-value background data.

Each stream has independent flow control and backpressure. Bulk can be slowed or moved to another route without blocking control or interactive streams.

Route Model

The fabric must maintain multiple candidate routes for an active session:

phone-a -> entry-1 -> home-1
phone-a -> phone-b -> relay-2 -> home-1
phone-a -> entry-2 -> relay-4 -> service-host-7

Route scoring inputs:

policy and role eligibility;
route length and failure domains;
RTT, jitter, packet loss, bandwidth estimate;
queue depth and retransmit pressure;
current node CPU/memory/socket pressure;
mobile battery/charging/metered status;
historical reliability;
service locality;
tenant/organization isolation;
cost and operator preference.

Routes are issued as short leases with route id, epoch, allowed channels, allowed service classes, hop list or next-hop policy, expiry, and fencing rules.

Service Discovery

Services are logical names, not fixed hosts:

service: admin-console
replicas: home-1, node-2, node-9
policy: active-active or leader/follower
ingress: vpn.cin.su / admin.cin.su / internal name

vpn.cin.su as an HTTP/HTTPS entry is a service endpoint. It can be hosted on any eligible service-host node. If one replica fails, another replica can accept the service lease and traffic can be routed to it.

Scale Model

For 1000 devices, the platform needs entry pools, exit pools, route leases, session placement, and overload protection.

For millions of devices, the platform additionally needs regional route coordinators, distributed peer directories, local control partitions, telemetry sampling, policy sharding, and resource accounting.

Every device joining the system increases potential edge capacity, but only if the scheduler can safely decide when that node is allowed to relay, store, serve, or only consume.

Security And Abuse Controls

The distributed model increases power and also risk. The following controls are required before mobile relay/control/storage roles are broadly enabled:

node identity is cryptographic; IP address is never identity;
all route leases are signed or locally verifiable;
roles are scoped by organization, tenant, service, and time;
mobile relay is opt-in by policy and user/device state;
storage uses encrypted shards and explicit retention policy;
control-plane participation requires trust tier and quorum policy;
nodes never receive more topology or secret data than their role requires;
abuse controls rate-limit relay use, route churn, and failed authentication;
traffic accounting records who relayed what class and how much, without exposing payload contents.

Observability

The current tests show why aggregate "VPN works" is not enough. The fabric needs per-node, per-route, and per-stream metrics:

throughput by direction and traffic class;
RTT, jitter, loss, retransmits, queue depth;
frame encode/decode errors;
stream resets and close reasons;
route switch reason and time to recovery;
node pressure and scheduler decisions;
service discovery failover events;
Android foreground/background and network transition events.

Work Plan

Stage FNP-0: Architecture Lock

Status: this document.

Deliverables:

fix "every device is a node" as the model;
separate fabric, services, control, and data plane;
define missing protocol, route, scale, security, and observability pieces.

Stage FNP-1: Binary Frame Contract

Deliverables:

add a transport-neutral Go package for Fabric Data Session V1 frame types;
encode/decode binary frames with size limits and validation;
add tests for malformed frames, max frame size, stream ids, and frame type compatibility;
do not connect it to production traffic yet.

Stage FNP-2: Persistent Session Runtime Skeleton

Status: in progress in agents/rap-node-agent/internal/fabricproto.

Deliverables:

implement in-memory session runtime with streams, sequence numbers, ACK, stream credit, reset, and close;
handle protocol frames for open/data/ack/credit/reset/close/ping/goaway;
prove that a blocked bulk stream does not block control/interactive streams;
expose per-stream metrics.

Stage FNP-3: WebSocket/TCP Compatibility Transport

Status: removed as a migration-only stage.

This stage existed to bootstrap binary frame semantics before QUIC routing and carrier reuse were ready. It introduced the transport-neutral frame loop, session-shaped packet mapper, and early smoke tooling. That work was useful as scaffolding, but it is no longer the target runtime.

Current rule:

WebSocket/TCP fabric-session transport is not part of the supported node dataplane.
QUIC/UDP is the only supported runtime carrier between fabric nodes.
Old WebSocket/TCP smoke helpers are being removed; migration/debug tooling must move to QUIC-native smoke and recovery paths.
Any routing, heartbeat, registry, peer probe, or service dataplane logic must reject WebSocket/TCP carriers as non-QUIC transport, not treat them as a valid alternate path.

What survives from this stage is the service-neutral frame model and the FabricSessionPacketTransport mapping, which now ride on QUIC carriers instead of a WebSocket fallback. VPN fabric-session gateway transport now consumes ranked endpoint candidates, so dataplane sessions can select QUIC fast-path candidates and refuse non-QUIC peer endpoints when the control plane has not published valid candidates yet. The temporary self-signed QUIC listener advertises its SHA-256 certificate fingerprint in endpoint metadata, and the QUIC client can pin that fingerprint instead of disabling verification while the cluster CA path is being finished. VPN fabric-session dialing now walks all ranked endpoint candidates before declaring the target unavailable, so a failed QUIC candidate does not silently re-enable WebSocket/HTTPS compatibility transport. Successful VPN fabric-session dialing logs the selected candidate, transport, certificate pin usage, and remaining fallback count for phone-side diagnostics. Heartbeat telemetry now includes VPN fabric-session dial counters for attempts, candidate failures, selected transport family, certificate pin usage, and the last selected endpoint/failure reason. VPN fabric-session dialing feeds candidate success/failure observations back into endpoint ranking, so repeated local QUIC failures can temporarily demote that endpoint while preserving it as a later fallback. Endpoint scoring no longer treats missing/zero latency on failed observations as moderate latency, preventing failed candidates from receiving a false score bonus. Endpoint health observations are now emitted as a bounded standalone heartbeat report (rap.vpn_fabric_endpoint_health_report.v1) so control plane can ingest candidate feedback without parsing the transport diagnostics blob. VPN fabric-session transport telemetry is carrier-neutral (fabric_session_binary_frames) and reports QUIC selection plus non-QUIC candidate rejection instead of describing the dataplane as WebSocket-capable. Endpoint health observations are pruned in-memory by age and count before snapshot/report generation, preventing long-running nodes from accumulating unbounded candidate history. Scoped and control-plane synthetic mesh config can now carry peer_endpoint_observations, and VPN fabric-session endpoint ranking merges those remote health hints with local observations using the newest signal. Endpoint health observations include source and reporter node fields so control plane can distinguish local dial feedback from aggregated or policy-generated health hints. The endpoint health heartbeat report also includes the reporter node id at the report level for simpler multi-node ingestion and diagnostics. Peer cache construction now applies endpoint health observations when ranking peer endpoint candidates, so recovery and warm-peer decisions see the same degraded-path feedback as VPN fabric-session dialing. Peer cache snapshots expose best-candidate score reasons, giving diagnostics a direct explanation for why a QUIC, WebSocket, relay, or fallback endpoint was chosen. Heartbeat capabilities now advertise that peer-cache endpoint ranking consumes health observations, allowing control plane and UI diagnostics to detect nodes running the health-aware peer selection path. VPN fabric QUIC transport now reuses QUIC connections per peer endpoint and opens logical fabric-session streams on top, with heartbeat telemetry for QUIC connection opens, reuses, evictions, and active count. Cached QUIC connections are pruned by idle TTL, preventing long-running agents from holding unused peer connections indefinitely. QUIC carrier connections now track active logical streams and enforce a per-connection stream limit, exposing stream opens/closes and limit rejects in transport telemetry. The per-connection QUIC stream limit is configurable through RAP_VPN_FABRIC_QUIC_MAX_STREAMS_PER_CONN / -vpn-fabric-quic-max-streams-per-conn and propagated by host-agent install profiles. QUIC stream-limit rejects are classified as capacity pressure instead of peer endpoint failure, so local health feedback does not incorrectly demote a healthy but saturated carrier. VPN fabric dial telemetry records the last capacity-limited endpoint and transport, making stream saturation visible without poisoning endpoint health observations. The same dial telemetry now keeps bounded per-endpoint capacity-pressure counters, so operators can see whether stream saturation is occasional or concentrated on a specific QUIC carrier. Fresh local capacity-pressure counters also feed endpoint ranking as a bounded penalty, spreading new fabric sessions away from a saturated carrier without declaring that carrier failed. VPN fabric-session transport now opens configurable per-class stream shards for interactive and bulk packet traffic, so heavy browser flows do not share a single logical stream with latency-sensitive RDP/control packets. Host-agent install commands for Docker, Linux, and Windows expose the same VPN fabric-session/QUIC tuning flags as install profiles, keeping manual and profile-based rollout paths aligned. Gateway runtime snapshots include the fabric-session packet transport stream layout and send counters by traffic class/stream id for load-test diagnosis. Those snapshots also summarize configured stream class/shard counts and active send class/stream counts, making sharding health visible without expanding per-stream maps. Gateway shutdown now closes all VPN fabric-session stream shards and then the underlying fabric session, preventing stale logical streams from consuming QUIC carrier capacity after reconnects or rollout restarts. Gateway runtime cancellation now fans out to both upload and download loops when either direction exits, so transport cleanup runs promptly on one-sided TUN or carrier failures. Fabric-session packet transport snapshots include close-frame and close-error counters for verifying that stream shard cleanup is actually happening. Outgoing VPN packet batches are split by traffic class and selected stream before they are framed, so one gateway batch containing many browser flows does not collapse onto the first packet's logical stream. mesh-live-smoke now sends mixed bulk and interactive VPN packets in a single fabric-session batch and requires them to remain sharded. The smoke report also exposes the mixed-batch frame fanout so regressions show up as a concrete fanout drop, not just a failed boolean. Batch fanout is bounded by configured stream shards, so a large batch with many flows cannot explode into unbounded fabric frames. Heartbeat tests assert the advertised VPN fabric stream-shard count and capability, keeping control-plane diagnostics aligned with runtime behavior. Fabric-session packet transport snapshots now report packets per stream plus last/max batch fanout, making real multi-site load distribution measurable from gateway status. Receive-side fabric-session packet counters are reported by traffic class and stream id as well, so gateway status can compare TX and RX distribution under browser/RDP load. QUIC fabric transport snapshots expose the configured stream limit, saturated connection count, and capacity pressure percentage next to stream limit rejects. Closed cached QUIC connections discovered during snapshot generation now update the transport's cumulative eviction counters, keeping successive heartbeats consistent. mesh-live-smoke reports QUIC fabric capacity-pressure percentage from the transport snapshot, verifying that the capacity fields are populated. QUIC fabric snapshots now include per cached connection pressure, endpoint, and saturation state; VPN fabric endpoint ranking consumes that live local pressure before stream-limit rejection, spreading new sessions away from already busy QUIC carriers. Per-connection QUIC snapshot entries are sorted by peer and endpoint so heartbeats and diagnostics stay stable across reports. When local live QUIC pressure and recent capacity-limit counters overlap, the ranking input keeps the stronger pressure signal rather than allowing a weak fresh sample to hide a saturated endpoint. Heartbeat VPN fabric reports now include a bounded quic_capacity_pressure summary sorted by busiest cached QUIC connection, making overload diagnosis visible without digging through the full carrier snapshot. VPN fabric flow-scheduler snapshots now expose bulk pressure activation plus bulk and interactive/control channel counts, making mixed browser/RDP load diagnosis explicit when bulk windows are reduced to protect interactive traffic. mesh-live-smoke now exercises that mixed-load scheduler path and reports bulk pressure activation plus bulk/interactive window recommendations. Flow-scheduler route recovery telemetry now records per-channel route switches, the failed route a channel recovered from, and aggregate recovered-channel / switch counts, making alternate-route recovery measurable during load tests. mesh-live-smoke now also exercises a primary-route failure followed by an alternate-route success and reports the resulting route switch count. The same smoke output reports measured route recovery milliseconds for the synthetic failover path. Smoke now includes max/average route recovery timing from the scheduler aggregate snapshot as well. Route recovery telemetry includes failure/switch timestamps and recovery duration in milliseconds for each recovered flow channel. Scheduler snapshots also aggregate route recovery max/average milliseconds across recovered channels for quick load-test health checks. Route recovery telemetry now includes normalized switch reasons and aggregate reason counts, so load tests can distinguish peer failures, timeouts, and other route-break causes. mesh-live-smoke reports the synthetic route-recovery reason beside recovery timing and switch count. Common route switch reasons are bucketed into stable labels such as timeout, peer_unavailable, connection_refused, connection_reset, no_route_to_host, and capacity_limited to keep heartbeat cardinality bounded. Flow-scheduler snapshots now include a machine-readable pressure level (nominal, warning, critical) and bounded reason list derived from drops, route failures, route recovery, slow channels, bulk pressure, and adaptive backpressure. The same pressure classification includes a bounded 0-100 score for automated route, endpoint, and node comparisons. mesh-live-smoke reports the mixed-load scheduler pressure level, score, and reasons. Heartbeat VPN fabric transport reports now include a compact flow_pressure summary with level, score, reasons, bulk pressure, route recovery timing, reason counts, and recommended per-class windows. The flow_pressure summary includes a recommended_action such as observe, throttle_bulk, reduce_parallelism, prefer_faster_route, observe_recovery, rebuild_or_reroute, or shed_or_reroute. recommended_action is now part of the shared FabricFlowSchedulerSnapshot contract, so heartbeat reports and smoke diagnostics consume the same runtime decision. The scheduler's nominal snapshot explicitly reports the observe action. Flow-scheduler snapshots keep a bounded pressure transition history with the observed level, score, reasons, and recommended action. Repeated snapshots do not duplicate unchanged pressure states, so controllers can distinguish current state from recent worsening or recovery without unbounded heartbeat growth. mesh-live-smoke reports the recommended action for its mixed bulk/interactive load scenario. Nodes advertise the vpn_fabric_flow_pressure capability when that heartbeat summary is available. When the VPN fabric ingress runtime has not been initialized yet, the heartbeat still emits a nominal flow_pressure summary for schema stability. Endpoint ranking treats capacity_limited observations as a soft pressure penalty instead of a hard recent failure, enabling load spreading without marking the carrier unhealthy. Local QUIC stream-limit pressure is now emitted as a capacity observation with no failure-count increment, allowing control plane to spread load without treating saturation as packet-path breakage. Cached QUIC carrier idle TTL is configurable through RAP_VPN_FABRIC_QUIC_IDLE_TTL_SECONDS / -vpn-fabric-quic-idle-ttl and propagated by host-agent install profiles.

Deliverables:

carry binary frames over one persistent QUIC fabric session;
replace high-frequency /mesh/v1/forward packet POST usage for VPN routes in a gated mode;
remove HTTP/WebSocket packet forwarding from the supported dataplane.

Stage FNP-4: Android As Mobile Fabric Node

Deliverables:

Android advertises node capabilities, network state, battery state, and supported transports;
Android opens Fabric Data Session V1 to entry;
VPN packets map to independent streams/classes;
diagnostics can run per-stream and per-route tests.

Stage FNP-5: Route Leases And Multipath

Deliverables:

route result includes primary and alternate routes;
runtime can switch new streams to a better route;
interactive streams can recover quickly after route fencing;
route health uses dataplane metrics, not only HTTP request success.

Stage FNP-6: QUIC/UDP Transport

Status: active runtime baseline in internal/mesh.

Deliverables:

implement QUIC transport for Fabric Data Session V1;
keep QUIC/UDP as the only supported inter-node runtime transport;
test 4G/Wi-Fi transition and NAT behavior;
benchmark throughput, latency, and recovery against current HTTP forwarding.

Stage FNP-7: Distributed Service Discovery

Deliverables:

service names map to eligible service replicas;
admin console and VPN service can move between service-host nodes;
service failover is expressed as leases and route updates.

Stage FNP-8: Mobile Relay And Distributed Capacity

Deliverables:

mobile nodes can opt into relay under strict policy;
scheduler scores battery, metered network, NAT, trust, and load;
route planner can use mobile nodes where they are closer/faster;
accounting and abuse controls are enforced.

Stage FNP-9: Scale To Large Fleets

Deliverables:

entry and route coordinator pools;
peer directory sharding;
telemetry sampling and aggregation;
per-tenant quotas and fairness;
load tests for 1000 simulated devices, then larger synthetic fleets.

Immediate Next Action

Start Stage FNP-1 in rap-node-agent as a non-production protocol package. The goal is to create the binary frame contract and tests without disturbing the current VPN path. After that, wire it into a gated persistent session runtime and only then move Android/VPN traffic onto it.

37 KiB Raw Blame History

Distributed Fabric Node Protocol Plan

Core Position

What Was Missing

Non-Negotiable Principles

Transport vs Control API

Distributed Control And Trust

Fabric Registry Gossip

Install Artifact Bootstrap Contract

Node Roles

Capability Model

Fabric Data Session V1

Route Model

Service Discovery

Scale Model

Security And Abuse Controls

Observability

Work Plan

Stage FNP-0: Architecture Lock

Stage FNP-1: Binary Frame Contract

Stage FNP-2: Persistent Session Runtime Skeleton

Stage FNP-3: WebSocket/TCP Compatibility Transport

Stage FNP-4: Android As Mobile Fabric Node

Stage FNP-5: Route Leases And Multipath

Stage FNP-6: QUIC/UDP Transport

Stage FNP-7: Distributed Service Discovery

Stage FNP-8: Mobile Relay And Distributed Capacity

Stage FNP-9: Scale To Large Fleets

Immediate Next Action

37 KiB

Raw Blame History