15 KiB
VPN / IP Tunnel Service Target Design
Status: Stage C18 planning result. Documentation only.
This document defines the target VPN/IP tunnel service architecture for the Secure Access Fabric. It does not implement VPN runtime, packet routing, TUN devices, mesh traffic, service workload execution, API changes, migrations, or RDP behavior changes.
Transport clarification: this document defines a service layer above Fabric Core. It does not redefine node-to-node transport. Current fabric inter-node transport is QUIC-only; VPN/IP tunnel runtime must request and use fabric routes instead of introducing a separate packet transport contract.
Purpose
VPN/IP tunnel is a service above the Fabric Core, not a node-local setting.
The service must allow managed access to private networks while preserving the platform's core rules:
- PostgreSQL remains the durable source of truth.
- Redis remains live coordination only.
- Fabric Routing Engine owns route choice.
- Nodes execute leased work only.
- Organizations must not see mesh topology.
- Interactive services such as RDP must not be harmed by VPN bulk traffic.
Non-Goals
Stage C18 does not implement:
- VPN/IP tunnel runtime
- TUN/TAP device handling
- packet forwarding
- host route or firewall manipulation
- QUIC, WebRTC, relay packet routing, or production mesh traffic
- Windows virtual adapter, Android
VpnService, or mobile client work - RDP, VNC, SSH, video, file, clipboard, or data-plane behavior changes
Core Concepts
vpn_connection
vpn_connection is the logical control-plane entity for one managed VPN/IP
tunnel connection to a private endpoint such as an office, customer site,
branch network, partner network, or private resource zone.
Target fields:
idorganization_idcluster_idname- target endpoint / office identity
- protocol/provider family
- credential/config reference
- allowed node policy
- mode:
single_activefor the initial model - desired state:
enabledordisabled - routing usage: RDP, VNC, SSH, HTTP/internal app, IP tunnel, or future service
- route policy references
- QoS / bandwidth policy references
- placement constraints
- safe status projection
The entity belongs to the control plane. It must not be inferred from a node environment variable, a manually started VPN process, or a host-local config file.
vpn_connection_lease
vpn_connection_lease represents current ownership.
Target fields:
vpn_connection_idcluster_idorganization_idowner_node_id- lease generation / fencing epoch
expires_atrenewed_atreleased_atfenced_at- status
Only the current owner with a valid, unexpired, unfenced lease may execute the VPN connection.
vpn_route_policy
vpn_route_policy defines what traffic may use the connection.
Policy dimensions:
- allowed CIDRs
- denied CIDRs
- DNS suffix or DNS server policy
- split-tunnel or full-tunnel eligibility
- service-specific usage
- resource-specific usage
- organization and role scope
- QoS class and bandwidth limits
Route policy is desired state. Runtime nodes apply scoped policy; they do not invent routes.
vpn_credential_ref
VPN credentials must be referenced through an approved secret resolver.
Nodes receive credentials/config only when authorized and only when required to execute or pre-warm the assigned connection. Nodes must not receive unrelated organization credentials.
Architecture Placement
Control Plane
owns vpn_connection desired state, policy, lease/fencing, audit
Fabric Core
owns node identity, role assignment consumption, scoped snapshots,
node-local state, and service supervision boundary
Fabric Routing Engine
chooses path to the current active VPN owner or eligible egress pool
VPN/IP Tunnel Service Runtime
executes tunnel only when assigned and leased
Data Plane
carries encrypted tunnel packets later, with QoS and backpressure
The backend/control plane must not become a production VPN packet relay.
Universal Packet Dataplane Principle
The VPN service carries IP packets. It must not classify the product as a web proxy, an RDP helper, or an HTTP-only accelerator. HTTP, DNS, RDP, SSH, VNC, messengers, audio calls, file transfer, application sync, and future mobile or desktop traffic are all just packets flowing through the same tunnel contract.
Implementation rules:
- packet forwarding must not branch on application protocol for correctness
- performance work must optimize the shared packet path, not a specific site or port
- batching, backpressure, retries, and route failover are dataplane mechanics and must apply to all traffic
- diagnostics may summarize protocol/ports for operators, but diagnostics must not decide whether traffic is allowed to flow
- a transient transport error must not permanently downgrade the tunnel to a per-packet request mode
- the control plane chooses entry, exit, route, lease, and policy; packet flow should use the fastest available fabric path
The temporary backend HTTP packet relay is a lab compatibility path. The production target is:
client device
-> selected entry node
-> fabric route / alternate route set
-> selected exit node
-> target private network or Internet gateway
When the cluster grows, route choice must consider latency, loss, queue depth, node health, role eligibility, lease freshness, and regional/network locality. If a node or link degrades, the fabric should switch to an alternate route without requiring the client to understand mesh topology.
Control Plane Responsibilities
The control plane owns:
- durable
vpn_connectiondesired state - route policy and service usage policy
- allowed node policy
- placement and candidate selection
- lease creation, renewal validation, and fencing decisions
- safe status projection
- audit events
- credential reference ownership
The control plane does not push arbitrary packets. It authorizes and records what should exist.
Node Responsibilities
Nodes do not decide to create VPN connections.
A node may execute a connection only when all of the following are true:
- node belongs to the correct cluster
- node has the required capability and role assignment
- node is allowed by the
vpn_connectionnode policy - node has a current signed/scoped configuration snapshot
- node holds the active lease
- desired state is
enabled - organization and service policy permit use
The node must stop execution when:
- lease is lost, expired, or fenced
- desired state becomes
disabled - role assignment is removed
- allowed node policy changes
- local node enters unsafe partition/degraded state
- cluster tells the node to drain
Single-Active Lease and Fencing Model
The initial mode is single_active.
Correctness requirement:
- exactly one node may maintain the active VPN tunnel
- stale owners must be fenced before replacement becomes authoritative
- ownership changes must be monotonic through a lease generation or equivalent fencing epoch
- connect/disconnect must be idempotent
- split-brain must not create duplicate active tunnels
Suggested target mechanics:
- short lease TTL
- periodic renewal
- monotonic lease generation
- node-local watchdog that stops tunnel when renewal fails
- explicit release on graceful shutdown
- fencing event before replacement if previous owner is uncertain
Routing Policy Model
Traffic references a logical vpn_connection, not a physical node.
Examples:
- RDP resource may require
vpn_connection = office-a - SSH resource may require
vpn_connection = office-a - IP tunnel profile may expose selected CIDRs through
office-a - HTTP/internal app resource may route through the active VPN owner
The Fabric Routing Engine resolves:
service request
-> logical vpn_connection
-> current active owner / eligible egress
-> fabric route
-> VPN service runtime
-> private network target
Route updates should be dynamic. Changing CIDRs, DNS policy, or active owner should not require manual client reconfiguration when clients use platform managed access.
QoS and Bandwidth Rules
VPN bulk traffic must degrade before interactive traffic.
Priority order:
- RDP input/control
- interactive RDP/VNC/SSH control and render-critical traffic
- clipboard and small reliable control messages
- video/audio adaptive traffic
- file transfer
- VPN bulk packets
- telemetry
Bandwidth policy should support:
- per-organization limits
- per-service limits
- per-
vpn_connectionlimits - per-node limits
- reserved bandwidth for interactive services
Service adapters must not implement QoS routing themselves. They label traffic or request a channel class; Fabric applies route/QoS policy.
Security Boundaries
Security requirements:
- organization-scoped
vpn_connection - cluster-scoped identity and tokens
- mTLS node-to-node transport
- short-lived route/tunnel authorization tokens when needed
- credentials delivered only through approved resolver
- candidate nodes receive only scoped config
- active owner receives execution secrets only when authorized
- no organization sees another organization's connections, routes, credentials, peer cache, or topology
- platform owner actions are audited
Compromised node blast radius must be bounded. A compromised node must not gain
credentials for unrelated vpn_connection entities or unrelated organizations.
Observability and Audit
Audit events:
vpn_connection_createdvpn_connection_enabledvpn_connection_disabledvpn_connection_policy_changedvpn_connection_candidate_changedvpn_connection_lease_acquiredvpn_connection_lease_renewedvpn_connection_lease_lostvpn_connection_owner_fencedvpn_connection_failover_startedvpn_connection_failover_completedvpn_connection_credential_rotatedvpn_route_policy_changed
Metrics/status:
- desired state
- active owner
- standby/pre-warm owners
- lease generation
- last connect/disconnect time
- route count
- latency/packet loss where observable
- bandwidth by service class
- failover count
- last failure reason
Organization views show safe status. Platform owner views may show active node and operational detail according to platform policy and audit.
Failure Mode Matrix
| Failure | Required behavior | Notes |
|---|---|---|
| Active node heartbeat lost | Lease expires or is fenced; cluster selects replacement | Single-active must be preserved |
| Active node loses lease locally | Node stops VPN runtime | Node must not wait for backend packet path |
| Control plane temporarily unavailable | Existing leased runtime may continue only within lease/snapshot policy | No policy mutation in degraded mode |
| Split-brain / partition | Minority must not create second active owner | Fencing/quorum rules required before runtime |
| Credential revoked | Active owner stops or reconnects with rotated credentials | Audit required |
| Route policy changes | Dynamic route update; deny removed routes | No manual client reconfiguration |
| Candidate node becomes overloaded | Keep sticky owner unless policy/failure/maintenance requires move | Avoid needless TCP disruption |
| Graceful node maintenance | Drain, release/transfer lease, then stop | Prefer standby/pre-warm replacement |
| VPN protocol reconnects | Preserve logical vpn_connection; refresh routes |
Some TCP sessions may still break |
| Relay path unavailable | Fabric reroutes if policy allows | VPN service does not own mesh routing |
Stateful Session Limits
VPN failover may disrupt long-lived TCP sessions. The platform should minimize
disruption through sticky placement, graceful drain, standby/pre-warm nodes,
stable route identity, and transparent route refresh, but the initial
single_active mode does not guarantee lossless TCP migration.
Future multi_active or load-balanced VPN modes may reduce disruption. They
must be explicit future modes and must not weaken single_active correctness.
Relationship to Current Mesh Proof Set
C17A-C17G prove synthetic fabric messages, route health/failover probes, relay
semantics, a bounded synthetic.echo path, live synthetic HTTP node-to-node
transport, scoped synthetic route config loading, and Control Plane scoped
synthetic config reads in rap-node-agent.
They do not authorize VPN traffic.
VPN/IP tunnel runtime must wait until the control-plane desired-state model, lease/fencing, scoped snapshots, node-local state, secure node-to-node channels, and Fabric Routing Engine boundaries are accepted for this service.
Future Implementation Stages
C18A - VPN/IP tunnel control-plane data model foundation:
- durable
vpn_connections - route policy tables
- allowed node policy
- lease/fencing model
- audit events
- no runtime packets
Status: completed and backend-test-proven. Result:
artifacts/c18a-vpn-control-plane-data-model-report.md.
C18B - Lease and fencing service:
- single-active ownership service
- TTL renewal/fencing behavior
- stale owner handling
- no real VPN runtime
Status: completed and backend-test-proven. Result:
artifacts/c18b-vpn-lease-fencing-hardening-report.md.
C18C - Node-agent desired-state consumption:
- node reads scoped
vpn_connectionassignments - reports status
- does not create real tunnel
Status: completed and backend-test-proven. Result:
artifacts/c18c-vpn-node-agent-desired-state-report.md.
Notes:
- node assignment visibility is limited to eligible candidates or the current active lease owner
- observed assignment status is explicit:
not_started,assigned,lease_required,blocked,unknown credential_refis not exposed to node-agent assignment payloads- no VPN runtime, TUN/TAP, host route/DNS/firewall/QoS manipulation, packet forwarding, or production mesh traffic is implemented
C18D - Secret resolver integration:
- scoped credential/config delivery
- candidate/active-owner restrictions
- credential rotation audit
C18E - Routing policy integration:
- CIDR and service-specific route intent
- route projection to Fabric Routing Engine
- no packet forwarding
C18F - Non-production fake VPN executor:
- synthetic leased service state only
- no TUN, no packets, no private network routing
C18G - Lab-only native VPN executor prototype:
- explicit separate approval required
- native mode preferred for TUN/firewall/QoS
- no privileged container by default
C18H - Client route refresh/resume design:
- route updates
- reconnect behavior
- split/full tunnel client posture
C18I - Production hardening:
- split-brain drills
- failover testing
- QoS load testing
- security review
- observability and incident runbooks
Result / Decision
Stage C18 defines VPN/IP tunnel as a cluster-managed service above Fabric Core. The first implementation step must be control-plane desired state and lease/fencing foundation, not packet routing. Nodes are execution units, not owners of desired state. Fabric owns routing and QoS. PostgreSQL remains the source of truth, Redis remains live coordination only, and Fabric Storage/Config Storage remains a scoped distribution/cache layer. RDP, current direct worker WSS, backend gateway fallback, C17 synthetic mesh proofs, and all existing service-adapter behavior are untouched by this document. C18A, C18B, and C18C are now implemented only as control-plane/node-agent contract foundation; they still do not authorize VPN/IP tunnel runtime or host networking changes.