Files
rdp-proxy/docs/architecture/VPN_IP_TUNNEL_SERVICE_TARGET.md
T
2026-04-28 22:29:50 +03:00

13 KiB

VPN / IP Tunnel Service Target Design

Status: Stage C18 planning result. Documentation only.

This document defines the target VPN/IP tunnel service architecture for the Secure Access Fabric. It does not implement VPN runtime, packet routing, TUN devices, mesh traffic, service workload execution, API changes, migrations, or RDP behavior changes.

Purpose

VPN/IP tunnel is a service above the Fabric Core, not a node-local setting.

The service must allow managed access to private networks while preserving the platform's core rules:

  • PostgreSQL remains the durable source of truth.
  • Redis remains live coordination only.
  • Fabric Routing Engine owns route choice.
  • Nodes execute leased work only.
  • Organizations must not see mesh topology.
  • Interactive services such as RDP must not be harmed by VPN bulk traffic.

Non-Goals

Stage C18 does not implement:

  • VPN/IP tunnel runtime
  • TUN/TAP device handling
  • packet forwarding
  • host route or firewall manipulation
  • QUIC, WebRTC, relay packet routing, or production mesh traffic
  • Windows virtual adapter, Android VpnService, or mobile client work
  • RDP, VNC, SSH, video, file, clipboard, or data-plane behavior changes

Core Concepts

vpn_connection

vpn_connection is the logical control-plane entity for one managed VPN/IP tunnel connection to a private endpoint such as an office, customer site, branch network, partner network, or private resource zone.

Target fields:

  • id
  • organization_id
  • cluster_id
  • name
  • target endpoint / office identity
  • protocol/provider family
  • credential/config reference
  • allowed node policy
  • mode: single_active for the initial model
  • desired state: enabled or disabled
  • routing usage: RDP, VNC, SSH, HTTP/internal app, IP tunnel, or future service
  • route policy references
  • QoS / bandwidth policy references
  • placement constraints
  • safe status projection

The entity belongs to the control plane. It must not be inferred from a node environment variable, a manually started VPN process, or a host-local config file.

vpn_connection_lease

vpn_connection_lease represents current ownership.

Target fields:

  • vpn_connection_id
  • cluster_id
  • organization_id
  • owner_node_id
  • lease generation / fencing epoch
  • expires_at
  • renewed_at
  • released_at
  • fenced_at
  • status

Only the current owner with a valid, unexpired, unfenced lease may execute the VPN connection.

vpn_route_policy

vpn_route_policy defines what traffic may use the connection.

Policy dimensions:

  • allowed CIDRs
  • denied CIDRs
  • DNS suffix or DNS server policy
  • split-tunnel or full-tunnel eligibility
  • service-specific usage
  • resource-specific usage
  • organization and role scope
  • QoS class and bandwidth limits

Route policy is desired state. Runtime nodes apply scoped policy; they do not invent routes.

vpn_credential_ref

VPN credentials must be referenced through an approved secret resolver.

Nodes receive credentials/config only when authorized and only when required to execute or pre-warm the assigned connection. Nodes must not receive unrelated organization credentials.

Architecture Placement

Control Plane
  owns vpn_connection desired state, policy, lease/fencing, audit

Fabric Core
  owns node identity, role assignment consumption, scoped snapshots,
  node-local state, and service supervision boundary

Fabric Routing Engine
  chooses path to the current active VPN owner or eligible egress pool

VPN/IP Tunnel Service Runtime
  executes tunnel only when assigned and leased

Data Plane
  carries encrypted tunnel packets later, with QoS and backpressure

The backend/control plane must not become a production VPN packet relay.

Control Plane Responsibilities

The control plane owns:

  • durable vpn_connection desired state
  • route policy and service usage policy
  • allowed node policy
  • placement and candidate selection
  • lease creation, renewal validation, and fencing decisions
  • safe status projection
  • audit events
  • credential reference ownership

The control plane does not push arbitrary packets. It authorizes and records what should exist.

Node Responsibilities

Nodes do not decide to create VPN connections.

A node may execute a connection only when all of the following are true:

  • node belongs to the correct cluster
  • node has the required capability and role assignment
  • node is allowed by the vpn_connection node policy
  • node has a current signed/scoped configuration snapshot
  • node holds the active lease
  • desired state is enabled
  • organization and service policy permit use

The node must stop execution when:

  • lease is lost, expired, or fenced
  • desired state becomes disabled
  • role assignment is removed
  • allowed node policy changes
  • local node enters unsafe partition/degraded state
  • cluster tells the node to drain

Single-Active Lease and Fencing Model

The initial mode is single_active.

Correctness requirement:

  • exactly one node may maintain the active VPN tunnel
  • stale owners must be fenced before replacement becomes authoritative
  • ownership changes must be monotonic through a lease generation or equivalent fencing epoch
  • connect/disconnect must be idempotent
  • split-brain must not create duplicate active tunnels

Suggested target mechanics:

  • short lease TTL
  • periodic renewal
  • monotonic lease generation
  • node-local watchdog that stops tunnel when renewal fails
  • explicit release on graceful shutdown
  • fencing event before replacement if previous owner is uncertain

Routing Policy Model

Traffic references a logical vpn_connection, not a physical node.

Examples:

  • RDP resource may require vpn_connection = office-a
  • SSH resource may require vpn_connection = office-a
  • IP tunnel profile may expose selected CIDRs through office-a
  • HTTP/internal app resource may route through the active VPN owner

The Fabric Routing Engine resolves:

service request
  -> logical vpn_connection
  -> current active owner / eligible egress
  -> fabric route
  -> VPN service runtime
  -> private network target

Route updates should be dynamic. Changing CIDRs, DNS policy, or active owner should not require manual client reconfiguration when clients use platform managed access.

QoS and Bandwidth Rules

VPN bulk traffic must degrade before interactive traffic.

Priority order:

  1. RDP input/control
  2. interactive RDP/VNC/SSH control and render-critical traffic
  3. clipboard and small reliable control messages
  4. video/audio adaptive traffic
  5. file transfer
  6. VPN bulk packets
  7. telemetry

Bandwidth policy should support:

  • per-organization limits
  • per-service limits
  • per-vpn_connection limits
  • per-node limits
  • reserved bandwidth for interactive services

Service adapters must not implement QoS routing themselves. They label traffic or request a channel class; Fabric applies route/QoS policy.

Security Boundaries

Security requirements:

  • organization-scoped vpn_connection
  • cluster-scoped identity and tokens
  • mTLS node-to-node transport
  • short-lived route/tunnel authorization tokens when needed
  • credentials delivered only through approved resolver
  • candidate nodes receive only scoped config
  • active owner receives execution secrets only when authorized
  • no organization sees another organization's connections, routes, credentials, peer cache, or topology
  • platform owner actions are audited

Compromised node blast radius must be bounded. A compromised node must not gain credentials for unrelated vpn_connection entities or unrelated organizations.

Observability and Audit

Audit events:

  • vpn_connection_created
  • vpn_connection_enabled
  • vpn_connection_disabled
  • vpn_connection_policy_changed
  • vpn_connection_candidate_changed
  • vpn_connection_lease_acquired
  • vpn_connection_lease_renewed
  • vpn_connection_lease_lost
  • vpn_connection_owner_fenced
  • vpn_connection_failover_started
  • vpn_connection_failover_completed
  • vpn_connection_credential_rotated
  • vpn_route_policy_changed

Metrics/status:

  • desired state
  • active owner
  • standby/pre-warm owners
  • lease generation
  • last connect/disconnect time
  • route count
  • latency/packet loss where observable
  • bandwidth by service class
  • failover count
  • last failure reason

Organization views show safe status. Platform owner views may show active node and operational detail according to platform policy and audit.

Failure Mode Matrix

Failure Required behavior Notes
Active node heartbeat lost Lease expires or is fenced; cluster selects replacement Single-active must be preserved
Active node loses lease locally Node stops VPN runtime Node must not wait for backend packet path
Control plane temporarily unavailable Existing leased runtime may continue only within lease/snapshot policy No policy mutation in degraded mode
Split-brain / partition Minority must not create second active owner Fencing/quorum rules required before runtime
Credential revoked Active owner stops or reconnects with rotated credentials Audit required
Route policy changes Dynamic route update; deny removed routes No manual client reconfiguration
Candidate node becomes overloaded Keep sticky owner unless policy/failure/maintenance requires move Avoid needless TCP disruption
Graceful node maintenance Drain, release/transfer lease, then stop Prefer standby/pre-warm replacement
VPN protocol reconnects Preserve logical vpn_connection; refresh routes Some TCP sessions may still break
Relay path unavailable Fabric reroutes if policy allows VPN service does not own mesh routing

Stateful Session Limits

VPN failover may disrupt long-lived TCP sessions. The platform should minimize disruption through sticky placement, graceful drain, standby/pre-warm nodes, stable route identity, and transparent route refresh, but the initial single_active mode does not guarantee lossless TCP migration.

Future multi_active or load-balanced VPN modes may reduce disruption. They must be explicit future modes and must not weaken single_active correctness.

Relationship to Current Mesh Proof Set

C17A-C17G prove synthetic fabric messages, route health/failover probes, relay semantics, a bounded synthetic.echo path, live synthetic HTTP node-to-node transport, scoped synthetic route config loading, and Control Plane scoped synthetic config reads in rap-node-agent.

They do not authorize VPN traffic.

VPN/IP tunnel runtime must wait until the control-plane desired-state model, lease/fencing, scoped snapshots, node-local state, secure node-to-node channels, and Fabric Routing Engine boundaries are accepted for this service.

Future Implementation Stages

C18A - VPN/IP tunnel control-plane data model foundation:

  • durable vpn_connections
  • route policy tables
  • allowed node policy
  • lease/fencing model
  • audit events
  • no runtime packets

Status: completed and backend-test-proven. Result: artifacts/c18a-vpn-control-plane-data-model-report.md.

C18B - Lease and fencing service:

  • single-active ownership service
  • TTL renewal/fencing behavior
  • stale owner handling
  • no real VPN runtime

Status: completed and backend-test-proven. Result: artifacts/c18b-vpn-lease-fencing-hardening-report.md.

C18C - Node-agent desired-state consumption:

  • node reads scoped vpn_connection assignments
  • reports status
  • does not create real tunnel

Status: completed and backend-test-proven. Result: artifacts/c18c-vpn-node-agent-desired-state-report.md.

Notes:

  • node assignment visibility is limited to eligible candidates or the current active lease owner
  • observed assignment status is explicit: not_started, assigned, lease_required, blocked, unknown
  • credential_ref is not exposed to node-agent assignment payloads
  • no VPN runtime, TUN/TAP, host route/DNS/firewall/QoS manipulation, packet forwarding, or production mesh traffic is implemented

C18D - Secret resolver integration:

  • scoped credential/config delivery
  • candidate/active-owner restrictions
  • credential rotation audit

C18E - Routing policy integration:

  • CIDR and service-specific route intent
  • route projection to Fabric Routing Engine
  • no packet forwarding

C18F - Non-production fake VPN executor:

  • synthetic leased service state only
  • no TUN, no packets, no private network routing

C18G - Lab-only native VPN executor prototype:

  • explicit separate approval required
  • native mode preferred for TUN/firewall/QoS
  • no privileged container by default

C18H - Client route refresh/resume design:

  • route updates
  • reconnect behavior
  • split/full tunnel client posture

C18I - Production hardening:

  • split-brain drills
  • failover testing
  • QoS load testing
  • security review
  • observability and incident runbooks

Result / Decision

Stage C18 defines VPN/IP tunnel as a cluster-managed service above Fabric Core. The first implementation step must be control-plane desired state and lease/fencing foundation, not packet routing. Nodes are execution units, not owners of desired state. Fabric owns routing and QoS. PostgreSQL remains the source of truth, Redis remains live coordination only, and Fabric Storage/Config Storage remains a scoped distribution/cache layer. RDP, current direct worker WSS, backend gateway fallback, C17 synthetic mesh proofs, and all existing service-adapter behavior are untouched by this document. C18A, C18B, and C18C are now implemented only as control-plane/node-agent contract foundation; they still do not authorize VPN/IP tunnel runtime or host networking changes.