# VPN / IP Tunnel Service Target Design Status: Stage C18 planning result. Documentation only. This document defines the target VPN/IP tunnel service architecture for the Secure Access Fabric. It does not implement VPN runtime, packet routing, TUN devices, mesh traffic, service workload execution, API changes, migrations, or RDP behavior changes. ## Purpose VPN/IP tunnel is a service above the Fabric Core, not a node-local setting. The service must allow managed access to private networks while preserving the platform's core rules: - PostgreSQL remains the durable source of truth. - Redis remains live coordination only. - Fabric Routing Engine owns route choice. - Nodes execute leased work only. - Organizations must not see mesh topology. - Interactive services such as RDP must not be harmed by VPN bulk traffic. ## Non-Goals Stage C18 does not implement: - VPN/IP tunnel runtime - TUN/TAP device handling - packet forwarding - host route or firewall manipulation - QUIC, WebRTC, relay packet routing, or production mesh traffic - Windows virtual adapter, Android `VpnService`, or mobile client work - RDP, VNC, SSH, video, file, clipboard, or data-plane behavior changes ## Core Concepts ### `vpn_connection` `vpn_connection` is the logical control-plane entity for one managed VPN/IP tunnel connection to a private endpoint such as an office, customer site, branch network, partner network, or private resource zone. Target fields: - `id` - `organization_id` - `cluster_id` - `name` - target endpoint / office identity - protocol/provider family - credential/config reference - allowed node policy - mode: `single_active` for the initial model - desired state: `enabled` or `disabled` - routing usage: RDP, VNC, SSH, HTTP/internal app, IP tunnel, or future service - route policy references - QoS / bandwidth policy references - placement constraints - safe status projection The entity belongs to the control plane. It must not be inferred from a node environment variable, a manually started VPN process, or a host-local config file. ### `vpn_connection_lease` `vpn_connection_lease` represents current ownership. Target fields: - `vpn_connection_id` - `cluster_id` - `organization_id` - `owner_node_id` - lease generation / fencing epoch - `expires_at` - `renewed_at` - `released_at` - `fenced_at` - status Only the current owner with a valid, unexpired, unfenced lease may execute the VPN connection. ### `vpn_route_policy` `vpn_route_policy` defines what traffic may use the connection. Policy dimensions: - allowed CIDRs - denied CIDRs - DNS suffix or DNS server policy - split-tunnel or full-tunnel eligibility - service-specific usage - resource-specific usage - organization and role scope - QoS class and bandwidth limits Route policy is desired state. Runtime nodes apply scoped policy; they do not invent routes. ### `vpn_credential_ref` VPN credentials must be referenced through an approved secret resolver. Nodes receive credentials/config only when authorized and only when required to execute or pre-warm the assigned connection. Nodes must not receive unrelated organization credentials. ## Architecture Placement ```text Control Plane owns vpn_connection desired state, policy, lease/fencing, audit Fabric Core owns node identity, role assignment consumption, scoped snapshots, node-local state, and service supervision boundary Fabric Routing Engine chooses path to the current active VPN owner or eligible egress pool VPN/IP Tunnel Service Runtime executes tunnel only when assigned and leased Data Plane carries encrypted tunnel packets later, with QoS and backpressure ``` The backend/control plane must not become a production VPN packet relay. ## Universal Packet Dataplane Principle The VPN service carries IP packets. It must not classify the product as a web proxy, an RDP helper, or an HTTP-only accelerator. HTTP, DNS, RDP, SSH, VNC, messengers, audio calls, file transfer, application sync, and future mobile or desktop traffic are all just packets flowing through the same tunnel contract. Implementation rules: - packet forwarding must not branch on application protocol for correctness - performance work must optimize the shared packet path, not a specific site or port - batching, backpressure, retries, and route failover are dataplane mechanics and must apply to all traffic - diagnostics may summarize protocol/ports for operators, but diagnostics must not decide whether traffic is allowed to flow - a transient transport error must not permanently downgrade the tunnel to a per-packet request mode - the control plane chooses entry, exit, route, lease, and policy; packet flow should use the fastest available fabric path The temporary backend HTTP packet relay is a lab compatibility path. The production target is: ```text client device -> selected entry node -> fabric route / alternate route set -> selected exit node -> target private network or Internet gateway ``` When the cluster grows, route choice must consider latency, loss, queue depth, node health, role eligibility, lease freshness, and regional/network locality. If a node or link degrades, the fabric should switch to an alternate route without requiring the client to understand mesh topology. ## Control Plane Responsibilities The control plane owns: - durable `vpn_connection` desired state - route policy and service usage policy - allowed node policy - placement and candidate selection - lease creation, renewal validation, and fencing decisions - safe status projection - audit events - credential reference ownership The control plane does not push arbitrary packets. It authorizes and records what should exist. ## Node Responsibilities Nodes do not decide to create VPN connections. A node may execute a connection only when all of the following are true: - node belongs to the correct cluster - node has the required capability and role assignment - node is allowed by the `vpn_connection` node policy - node has a current signed/scoped configuration snapshot - node holds the active lease - desired state is `enabled` - organization and service policy permit use The node must stop execution when: - lease is lost, expired, or fenced - desired state becomes `disabled` - role assignment is removed - allowed node policy changes - local node enters unsafe partition/degraded state - cluster tells the node to drain ## Single-Active Lease and Fencing Model The initial mode is `single_active`. Correctness requirement: - exactly one node may maintain the active VPN tunnel - stale owners must be fenced before replacement becomes authoritative - ownership changes must be monotonic through a lease generation or equivalent fencing epoch - connect/disconnect must be idempotent - split-brain must not create duplicate active tunnels Suggested target mechanics: - short lease TTL - periodic renewal - monotonic lease generation - node-local watchdog that stops tunnel when renewal fails - explicit release on graceful shutdown - fencing event before replacement if previous owner is uncertain ## Routing Policy Model Traffic references a logical `vpn_connection`, not a physical node. Examples: - RDP resource may require `vpn_connection = office-a` - SSH resource may require `vpn_connection = office-a` - IP tunnel profile may expose selected CIDRs through `office-a` - HTTP/internal app resource may route through the active VPN owner The Fabric Routing Engine resolves: ```text service request -> logical vpn_connection -> current active owner / eligible egress -> fabric route -> VPN service runtime -> private network target ``` Route updates should be dynamic. Changing CIDRs, DNS policy, or active owner should not require manual client reconfiguration when clients use platform managed access. ## QoS and Bandwidth Rules VPN bulk traffic must degrade before interactive traffic. Priority order: 1. RDP input/control 2. interactive RDP/VNC/SSH control and render-critical traffic 3. clipboard and small reliable control messages 4. video/audio adaptive traffic 5. file transfer 6. VPN bulk packets 7. telemetry Bandwidth policy should support: - per-organization limits - per-service limits - per-`vpn_connection` limits - per-node limits - reserved bandwidth for interactive services Service adapters must not implement QoS routing themselves. They label traffic or request a channel class; Fabric applies route/QoS policy. ## Security Boundaries Security requirements: - organization-scoped `vpn_connection` - cluster-scoped identity and tokens - mTLS node-to-node transport - short-lived route/tunnel authorization tokens when needed - credentials delivered only through approved resolver - candidate nodes receive only scoped config - active owner receives execution secrets only when authorized - no organization sees another organization's connections, routes, credentials, peer cache, or topology - platform owner actions are audited Compromised node blast radius must be bounded. A compromised node must not gain credentials for unrelated `vpn_connection` entities or unrelated organizations. ## Observability and Audit Audit events: - `vpn_connection_created` - `vpn_connection_enabled` - `vpn_connection_disabled` - `vpn_connection_policy_changed` - `vpn_connection_candidate_changed` - `vpn_connection_lease_acquired` - `vpn_connection_lease_renewed` - `vpn_connection_lease_lost` - `vpn_connection_owner_fenced` - `vpn_connection_failover_started` - `vpn_connection_failover_completed` - `vpn_connection_credential_rotated` - `vpn_route_policy_changed` Metrics/status: - desired state - active owner - standby/pre-warm owners - lease generation - last connect/disconnect time - route count - latency/packet loss where observable - bandwidth by service class - failover count - last failure reason Organization views show safe status. Platform owner views may show active node and operational detail according to platform policy and audit. ## Failure Mode Matrix | Failure | Required behavior | Notes | | --- | --- | --- | | Active node heartbeat lost | Lease expires or is fenced; cluster selects replacement | Single-active must be preserved | | Active node loses lease locally | Node stops VPN runtime | Node must not wait for backend packet path | | Control plane temporarily unavailable | Existing leased runtime may continue only within lease/snapshot policy | No policy mutation in degraded mode | | Split-brain / partition | Minority must not create second active owner | Fencing/quorum rules required before runtime | | Credential revoked | Active owner stops or reconnects with rotated credentials | Audit required | | Route policy changes | Dynamic route update; deny removed routes | No manual client reconfiguration | | Candidate node becomes overloaded | Keep sticky owner unless policy/failure/maintenance requires move | Avoid needless TCP disruption | | Graceful node maintenance | Drain, release/transfer lease, then stop | Prefer standby/pre-warm replacement | | VPN protocol reconnects | Preserve logical `vpn_connection`; refresh routes | Some TCP sessions may still break | | Relay path unavailable | Fabric reroutes if policy allows | VPN service does not own mesh routing | ## Stateful Session Limits VPN failover may disrupt long-lived TCP sessions. The platform should minimize disruption through sticky placement, graceful drain, standby/pre-warm nodes, stable route identity, and transparent route refresh, but the initial `single_active` mode does not guarantee lossless TCP migration. Future `multi_active` or load-balanced VPN modes may reduce disruption. They must be explicit future modes and must not weaken `single_active` correctness. ## Relationship to Current Mesh Proof Set C17A-C17G prove synthetic fabric messages, route health/failover probes, relay semantics, a bounded `synthetic.echo` path, live synthetic HTTP node-to-node transport, scoped synthetic route config loading, and Control Plane scoped synthetic config reads in `rap-node-agent`. They do not authorize VPN traffic. VPN/IP tunnel runtime must wait until the control-plane desired-state model, lease/fencing, scoped snapshots, node-local state, secure node-to-node channels, and Fabric Routing Engine boundaries are accepted for this service. ## Future Implementation Stages C18A - VPN/IP tunnel control-plane data model foundation: - durable `vpn_connections` - route policy tables - allowed node policy - lease/fencing model - audit events - no runtime packets Status: completed and backend-test-proven. Result: `artifacts/c18a-vpn-control-plane-data-model-report.md`. C18B - Lease and fencing service: - single-active ownership service - TTL renewal/fencing behavior - stale owner handling - no real VPN runtime Status: completed and backend-test-proven. Result: `artifacts/c18b-vpn-lease-fencing-hardening-report.md`. C18C - Node-agent desired-state consumption: - node reads scoped `vpn_connection` assignments - reports status - does not create real tunnel Status: completed and backend-test-proven. Result: `artifacts/c18c-vpn-node-agent-desired-state-report.md`. Notes: - node assignment visibility is limited to eligible candidates or the current active lease owner - observed assignment status is explicit: `not_started`, `assigned`, `lease_required`, `blocked`, `unknown` - `credential_ref` is not exposed to node-agent assignment payloads - no VPN runtime, TUN/TAP, host route/DNS/firewall/QoS manipulation, packet forwarding, or production mesh traffic is implemented C18D - Secret resolver integration: - scoped credential/config delivery - candidate/active-owner restrictions - credential rotation audit C18E - Routing policy integration: - CIDR and service-specific route intent - route projection to Fabric Routing Engine - no packet forwarding C18F - Non-production fake VPN executor: - synthetic leased service state only - no TUN, no packets, no private network routing C18G - Lab-only native VPN executor prototype: - explicit separate approval required - native mode preferred for TUN/firewall/QoS - no privileged container by default C18H - Client route refresh/resume design: - route updates - reconnect behavior - split/full tunnel client posture C18I - Production hardening: - split-brain drills - failover testing - QoS load testing - security review - observability and incident runbooks ## Result / Decision Stage C18 defines VPN/IP tunnel as a cluster-managed service above Fabric Core. The first implementation step must be control-plane desired state and lease/fencing foundation, not packet routing. Nodes are execution units, not owners of desired state. Fabric owns routing and QoS. PostgreSQL remains the source of truth, Redis remains live coordination only, and Fabric Storage/Config Storage remains a scoped distribution/cache layer. RDP, current direct worker WSS, backend gateway fallback, C17 synthetic mesh proofs, and all existing service-adapter behavior are untouched by this document. C18A, C18B, and C18C are now implemented only as control-plane/node-agent contract foundation; they still do not authorize VPN/IP tunnel runtime or host networking changes.