3
This commit is contained in:
@@ -24,6 +24,21 @@ policy allows, host limited control/storage roles when approved, and report
|
||||
mobile-specific capacity signals such as battery, network type, NAT behavior,
|
||||
foreground/background state, and metered network policy.
|
||||
|
||||
Node survival and recovery across endpoint moves, NAT-only reachability, legacy
|
||||
contract overlap, and unavailable manual host access are governed by
|
||||
`docs/architecture/FABRIC_NODE_SURVIVAL_AND_RECOVERY_POLICY.md`. In
|
||||
particular, nodes like `ifcm-rufms-s-mo1cr` must remain recoverable through the
|
||||
fabric/update/recovery plane even when direct host login is unavailable.
|
||||
|
||||
Android implementation contract:
|
||||
|
||||
- app install/build contains a QUIC bootstrap seed set;
|
||||
- runtime launch carries a `fabric_bootstrap_config`, not a backend URL;
|
||||
- user login/profile selection happens over the fabric control channel;
|
||||
- the Android VPN dataplane is QUIC fabric runtime only; HTTP batch packet
|
||||
forwarding, WebSocket packet relay, and direct backend packet relay are not
|
||||
part of the supported runtime path.
|
||||
|
||||
## What Was Missing
|
||||
|
||||
The current implementation proves route leases and production VPN forwarding,
|
||||
@@ -60,8 +75,9 @@ route and stream semantics.
|
||||
must keep working through cached policy, peer directories, route leases, and
|
||||
local health when central components are degraded.
|
||||
7. Mobile nodes are first-class nodes with stricter capability scoring.
|
||||
8. HTTP forwarding remains a compatibility and emergency fallback, not the
|
||||
primary high-speed data plane.
|
||||
8. QUIC is the single runtime transport between fabric nodes. HTTP/HTTPS may
|
||||
serve human-facing download or panel pages, but it is not a node data-plane
|
||||
fallback and must not carry service packets.
|
||||
9. There must be no single management service that can seize the fabric. Control,
|
||||
storage, update distribution, route authority, and certificate authority are
|
||||
fabric roles assigned to eligible nodes and protected by quorum signatures.
|
||||
@@ -73,6 +89,20 @@ route and stream semantics.
|
||||
the usable candidate locally by policy, reachability, latency, load, and
|
||||
trust.
|
||||
|
||||
## Transport vs Control API
|
||||
|
||||
The system must keep two layers separate in naming, design, and diagnostics:
|
||||
|
||||
- `Fabric Transport` means inter-node runtime delivery only. It is QUIC over UDP
|
||||
and carries leased service-channel/data-plane traffic between nodes.
|
||||
- `Control API` means human/operator/programmatic management surfaces such as
|
||||
web-admin, release publication, policy mutation, audit queries, and status
|
||||
reads. Today that surface is HTTP/JSON and may sit behind HTTPS ingress.
|
||||
|
||||
The HTTP Control API is not a fallback transport for node-to-node runtime
|
||||
traffic. A `409 Conflict` from the backend, a panel page load, or a release
|
||||
download is control-plane behavior, not fabric transport behavior.
|
||||
|
||||
## Distributed Control And Trust
|
||||
|
||||
The target fabric behaves like a distributed network, not a client/server
|
||||
@@ -145,6 +175,143 @@ Endpoint state is also distributed:
|
||||
- Neighbor selection is local and latency/load-aware; the state log announces
|
||||
facts and policy, not a forced single next hop.
|
||||
|
||||
### Fabric Registry Gossip
|
||||
|
||||
Moving a service must not break the farm.
|
||||
|
||||
`RAP_BACKEND_URL` or any fixed HTTP/API address is only a migration fallback for
|
||||
old nodes. It is not cluster truth. After bootstrap, a node finds services by
|
||||
logical role through signed fabric registry records that can be carried by any
|
||||
reachable peer.
|
||||
|
||||
The rule is:
|
||||
|
||||
- any node may relay registry knowledge;
|
||||
- only authorized signatures can create or replace trusted registry truth;
|
||||
- a new record becomes active only after signature/authority checks and a
|
||||
successful live probe through the fabric or a policy-approved direct QUIC
|
||||
candidate;
|
||||
- older still-valid records remain as fallback until their TTL expires.
|
||||
|
||||
Registry record shape:
|
||||
|
||||
```text
|
||||
schema_version: rap.fabric.registry.gossip_record.v1
|
||||
cluster_id
|
||||
service: control-api | update-store | update-cache | web-admin | vpn-egress-pool | ...
|
||||
scope: farm | cluster | organization
|
||||
organization_id: optional
|
||||
epoch: monotonic service epoch
|
||||
generation: optional human/debug generation
|
||||
issued_at
|
||||
expires_at
|
||||
issuer_node_id
|
||||
issuer_role: control-authority | update-authority | storage-authority | route-authority
|
||||
endpoints:
|
||||
- endpoint_id
|
||||
address: quic://...
|
||||
transport: direct_quic | relay_quic | reverse_quic
|
||||
reachability
|
||||
connectivity_mode
|
||||
priority / weight
|
||||
peer_cert_sha256
|
||||
signatures:
|
||||
- key_id
|
||||
issuer_id
|
||||
role
|
||||
alg: ed25519
|
||||
value
|
||||
```
|
||||
|
||||
Acceptance algorithm:
|
||||
|
||||
1. Reject records for a different cluster, expired records, future records past
|
||||
allowed clock skew, unsupported schema, missing endpoints, or non-QUIC
|
||||
endpoints.
|
||||
2. Verify the canonical record payload, excluding `signatures`, against the
|
||||
configured authority set.
|
||||
3. Check the signer role is allowed for that service and scope.
|
||||
4. Require quorum where policy says M-of-N; development may use one trusted
|
||||
signer but must mark that signer as bootstrap/development authority.
|
||||
5. Store accepted records as `candidate`.
|
||||
6. Promote `candidate` to `active` only after live-probing at least one endpoint
|
||||
and verifying the endpoint identity/pin.
|
||||
7. Prefer higher epoch, then newer issued time, then generation. Do not replace
|
||||
a live active record with an older record.
|
||||
8. Keep the previous active record usable as fallback until TTL expiry when a
|
||||
newer candidate is not yet live-verified.
|
||||
|
||||
This is the recovery path for mass moves. If every known service endpoint moves
|
||||
at once, the operator or a control-authority node only has to deliver a signed
|
||||
registry record to one reachable fabric node. That node validates it, probes it,
|
||||
promotes it, and gossips it onward. User/mobile/candidate nodes may carry the
|
||||
record, but cannot make it authoritative unless their role certificate permits
|
||||
that service/scope.
|
||||
|
||||
Service classes that must use this registry before production hardening:
|
||||
|
||||
- `control-api`: heartbeat, auth/profile control projection, node registration,
|
||||
policy/snapshot fetch.
|
||||
- `update-store`: signed release manifests and compatibility windows.
|
||||
- `update-cache`: artifact mirrors close to nodes.
|
||||
- `web-admin`: management UI/API ingress replicas.
|
||||
- `vpn-egress-pool`: user-visible exit pools; users see pools, not backing
|
||||
nodes.
|
||||
|
||||
Legacy endpoint compatibility is allowed only for rolling migration:
|
||||
|
||||
- Old nodes may use their baked HTTP/control URL only to fetch a new version or
|
||||
a signed registry bootstrap record.
|
||||
- New nodes must treat fixed URLs as fallback hints, not as authority.
|
||||
- Old code is removed only after every live node reports a version that supports
|
||||
signed registry gossip and service discovery by role.
|
||||
|
||||
Listener configuration is split into bind sockets and reachability candidates:
|
||||
|
||||
- `listen_addr` is what the local process binds, for example
|
||||
`0.0.0.0:18080` on `home-1`.
|
||||
- `endpoint_candidates` is the ordered set of addresses other nodes may try.
|
||||
A single node can publish LAN addresses, addresses on several network
|
||||
adapters, STUN/reflexive addresses, and multiple public NAT forwards from
|
||||
different providers.
|
||||
- Public NAT forwards are modeled as candidates with metadata, not as a
|
||||
replacement for the internal bind address. Example:
|
||||
`quic://94.141.118.222:19199 reachability=public connectivity=direct
|
||||
provider=isp1 maps_to=192.168.200.85:18080`.
|
||||
- A candidate may be valid only from outside the NAT. Same-LAN hairpin failure
|
||||
is not a proof that the public candidate is broken; verification must be
|
||||
scoped to an external peer or remote probe.
|
||||
- The route builder scores candidates by reachability, measured latency, loss,
|
||||
load, policy, and verification freshness. If one provider or interface fails,
|
||||
the node keeps the same node identity and republishes a new candidate epoch.
|
||||
|
||||
## Install Artifact Bootstrap Contract
|
||||
|
||||
Every installable artifact is a node image plus a bootstrap seed set.
|
||||
|
||||
This applies to Android, Docker, Linux services, and Windows services. The seed
|
||||
set is baked into the artifact or delivered beside it as signed install
|
||||
metadata. It is not a single backend URL and not a management server choice. It
|
||||
is a bounded list of known fabric endpoint candidates that may be reachable from
|
||||
different network positions:
|
||||
|
||||
- public QUIC candidates, for example `usa-los-1` or externally reachable
|
||||
`home-1`;
|
||||
- private/LAN QUIC candidates, for example Docker-test or home LAN nodes;
|
||||
- closed-site candidates that have no Internet route themselves but can reach a
|
||||
neighboring fabric node;
|
||||
- optional pinned certificate hashes or authority descriptors for high-trust
|
||||
entry candidates.
|
||||
|
||||
On first start the installed node tries the seed set, joins through any reachable
|
||||
peer, registers as a candidate node with minimal rights, and then receives
|
||||
signed peer-directory, role, update, and policy state through the fabric. If a
|
||||
node is installed in an isolated network, it can still become visible and usable
|
||||
when at least one nearby seed node can route onward to the rest of the fabric.
|
||||
User login on Android is only identity/profile selection for the `vpn-client`
|
||||
service; the underlying phone node already exists and participates in the
|
||||
fabric with candidate permissions.
|
||||
|
||||
## Node Roles
|
||||
|
||||
Initial role vocabulary:
|
||||
@@ -172,7 +339,7 @@ uplink stability, foreground state, and user cost policy.
|
||||
Nodes must advertise capability facts in heartbeats and peer updates:
|
||||
|
||||
- supported fabric protocol versions;
|
||||
- supported transports: UDP/QUIC, TCP, WebSocket, HTTPS fallback;
|
||||
- supported transport: UDP/QUIC;
|
||||
- NAT type and reachability;
|
||||
- measured RTT/loss/jitter/bandwidth to peers and entry candidates;
|
||||
- CPU, memory, queue depth, file descriptor/socket pressure;
|
||||
@@ -184,9 +351,8 @@ Nodes must advertise capability facts in heartbeats and peer updates:
|
||||
|
||||
## Fabric Data Session V1
|
||||
|
||||
The first practical protocol step is a persistent binary data session. It may
|
||||
initially run over WebSocket/TCP for faster delivery, but the framing must be
|
||||
transport-neutral so the same protocol can move to QUIC/UDP.
|
||||
The first practical protocol step is a persistent binary QUIC data session.
|
||||
The framing stays service-neutral, but the runtime transport is QUIC only.
|
||||
|
||||
Minimum frame set:
|
||||
|
||||
@@ -338,69 +504,36 @@ Deliverables:
|
||||
|
||||
### Stage FNP-3: WebSocket/TCP Compatibility Transport
|
||||
|
||||
Status: started with a transport-neutral `io.Reader`/`io.Writer` frame loop,
|
||||
WebSocket frame adapter in `agents/rap-node-agent/internal/fabricproto`, and a
|
||||
gated/authenticated mesh smoke endpoint/client at `/mesh/v1/fabric/session/ws`.
|
||||
`rap-host-agent fabric-session-smoke` provides the first operator smoke command
|
||||
and can pass signed fabric-session authority payload/signature headers for
|
||||
authority-pinned nodes.
|
||||
Node-agent exposes the endpoint only when `RAP_MESH_FABRIC_SESSION_ENABLED` /
|
||||
`-mesh-fabric-session-enabled` is set, and reports the enabled endpoint in
|
||||
heartbeat metadata.
|
||||
`mesh-live-smoke` includes a fabric-session `PING`/`PONG` check alongside the
|
||||
existing route and test-service probes. Mesh client code now has a reusable
|
||||
`FabricSessionClient` for multiple frame exchanges over one WebSocket session,
|
||||
plus a pump mode with outbound/inbound queues for asynchronous stream traffic.
|
||||
Live smoke verifies two `PING`/`PONG` round trips on the same connection.
|
||||
`vpnruntime` has a binary VPN packet-batch mapper for `FrameData` payloads so
|
||||
packet delivery can move away from JSON production envelopes in a gated mode.
|
||||
`FabricSessionPacketTransport` now adapts that mapper to the existing
|
||||
`PacketTransport` interface and can demultiplex inbound DATA frames into the
|
||||
VPN packet inbox by stream id.
|
||||
`mesh-live-smoke` now sends a real VPN packet batch through
|
||||
`FabricSessionPacketTransport` over the WebSocket fabric session and requires a
|
||||
stream ACK from the remote node.
|
||||
Mesh has a peer session manager that reuses one pump per peer endpoint, giving
|
||||
VPN transport selection a stable place to acquire long-lived fabric sessions.
|
||||
Node config now carries a separate gated
|
||||
`RAP_VPN_FABRIC_SESSION_TRANSPORT_ENABLED` switch and heartbeat report for the
|
||||
binary VPN packet transport, keeping endpoint exposure and VPN dataplane
|
||||
rollout independently controllable.
|
||||
When the VPN fabric-session switch is enabled, node-agent now attempts to use a
|
||||
long-lived peer session for gateway packet transport and falls back to the
|
||||
existing HTTP production envelope path when the peer session is unavailable.
|
||||
Peer session reuse now evicts closed pumps before reuse, so failed WebSocket
|
||||
sessions can be reopened on the next transport acquisition.
|
||||
Heartbeat telemetry includes peer session manager counters for active sessions,
|
||||
reuses, opens, closed-pump evictions, and explicit close operations.
|
||||
The mesh package now exposes a service-neutral `FabricTransport` abstraction;
|
||||
the current WebSocket carrier implements it as `WebSocketFabricTransport`, so
|
||||
future QUIC/UDP transport can be added without changing VPN/RDP/HTTP services.
|
||||
`QUICFabricTransport` now implements the same interface and carries the same
|
||||
binary `fabricproto` frames over a QUIC stream, with local smoke coverage for
|
||||
`PING`/`PONG` and DATA/ACK.
|
||||
Carrier selection understands QUIC transport labels and `quic://host:port`
|
||||
endpoints while preserving WebSocket as the default fallback.
|
||||
`QUICFabricServer` provides the matching node-side QUIC listener for accepting
|
||||
fabric streams and running the same session frame handler as other carriers.
|
||||
Node-agent can now gate the QUIC listener with
|
||||
`RAP_MESH_QUIC_FABRIC_ENABLED` / `RAP_MESH_QUIC_FABRIC_LISTEN_ADDR`, report it
|
||||
in heartbeat metadata, and pass the setting through host-agent install/update
|
||||
profiles.
|
||||
`mesh-live-smoke` verifies the QUIC carrier by starting a temporary QUIC fabric
|
||||
server and requiring a `PING`/`PONG` round trip over `QUICFabricTransport`.
|
||||
Nodes now advertise enabled QUIC fabric listeners as `direct_quic` fast-path
|
||||
endpoint candidates, and endpoint ranking prefers QUIC over WebSocket/HTTPS
|
||||
compatibility candidates for fabric sessions.
|
||||
Status: retired as a migration-only stage.
|
||||
|
||||
This stage existed to bootstrap binary frame semantics before QUIC routing and
|
||||
carrier reuse were ready. It introduced the transport-neutral frame loop,
|
||||
session-shaped packet mapper, and early smoke tooling. That work was useful as
|
||||
scaffolding, but it is no longer the target runtime.
|
||||
|
||||
Current rule:
|
||||
|
||||
- WebSocket/TCP fabric-session transport is not part of the supported node
|
||||
dataplane.
|
||||
- QUIC/UDP is the only supported runtime carrier between fabric nodes.
|
||||
- Old WebSocket/TCP smoke helpers are being removed; migration/debug tooling
|
||||
must move to QUIC-native smoke and recovery paths.
|
||||
- Any routing, heartbeat, registry, peer probe, or service dataplane logic must
|
||||
reject WebSocket/TCP carriers as non-QUIC transport, not treat them as a
|
||||
valid alternate path.
|
||||
|
||||
What survives from this stage is the service-neutral frame model and the
|
||||
`FabricSessionPacketTransport` mapping, which now ride on QUIC carriers instead
|
||||
of a WebSocket fallback.
|
||||
VPN fabric-session gateway transport now consumes ranked endpoint candidates,
|
||||
so dataplane sessions can select QUIC fast-path candidates and fall back to
|
||||
legacy peer endpoints when the control plane has not published candidates yet.
|
||||
so dataplane sessions can select QUIC fast-path candidates and refuse non-QUIC
|
||||
peer endpoints when the control plane has not published valid candidates yet.
|
||||
The temporary self-signed QUIC listener advertises its SHA-256 certificate
|
||||
fingerprint in endpoint metadata, and the QUIC client can pin that fingerprint
|
||||
instead of disabling verification while the cluster CA path is being finished.
|
||||
VPN fabric-session dialing now walks all ranked endpoint candidates before
|
||||
falling back to the legacy peer endpoint, so a failed QUIC candidate does not
|
||||
block WebSocket/HTTPS compatibility transport.
|
||||
declaring the target unavailable, so a failed QUIC candidate does not silently
|
||||
re-enable WebSocket/HTTPS compatibility transport.
|
||||
Successful VPN fabric-session dialing logs the selected candidate, transport,
|
||||
certificate pin usage, and remaining fallback count for phone-side diagnostics.
|
||||
Heartbeat telemetry now includes VPN fabric-session dial counters for attempts,
|
||||
@@ -416,8 +549,8 @@ Endpoint health observations are now emitted as a bounded standalone heartbeat
|
||||
report (`rap.vpn_fabric_endpoint_health_report.v1`) so control plane can ingest
|
||||
candidate feedback without parsing the transport diagnostics blob.
|
||||
VPN fabric-session transport telemetry is carrier-neutral
|
||||
(`fabric_session_binary_frames`) and reports QUIC/WebSocket as available
|
||||
carriers instead of describing the dataplane as WebSocket-only.
|
||||
(`fabric_session_binary_frames`) and reports QUIC selection plus non-QUIC
|
||||
candidate rejection instead of describing the dataplane as WebSocket-capable.
|
||||
Endpoint health observations are pruned in-memory by age and count before
|
||||
snapshot/report generation, preventing long-running nodes from accumulating
|
||||
unbounded candidate history.
|
||||
@@ -583,10 +716,10 @@ propagated by host-agent install profiles.
|
||||
|
||||
Deliverables:
|
||||
|
||||
- carry binary frames over one persistent WebSocket/TCP connection;
|
||||
- carry binary frames over one persistent QUIC fabric session;
|
||||
- replace high-frequency `/mesh/v1/forward` packet POST usage for VPN routes in
|
||||
a gated mode;
|
||||
- keep HTTP forwarding as fallback.
|
||||
- remove HTTP/WebSocket packet forwarding from the supported dataplane.
|
||||
|
||||
### Stage FNP-4: Android As Mobile Fabric Node
|
||||
|
||||
@@ -609,12 +742,12 @@ Deliverables:
|
||||
|
||||
### Stage FNP-6: QUIC/UDP Transport
|
||||
|
||||
Status: started with `QUICFabricTransport` in `internal/mesh`.
|
||||
Status: active runtime baseline in `internal/mesh`.
|
||||
|
||||
Deliverables:
|
||||
|
||||
- implement QUIC transport for Fabric Data Session V1;
|
||||
- preserve WebSocket/TCP as fallback;
|
||||
- keep QUIC/UDP as the only supported inter-node runtime transport;
|
||||
- test 4G/Wi-Fi transition and NAT behavior;
|
||||
- benchmark throughput, latency, and recovery against current HTTP forwarding.
|
||||
|
||||
|
||||
Reference in New Issue
Block a user