Initial project snapshot
This commit is contained in:
@@ -0,0 +1,428 @@
|
||||
# VPN / IP Tunnel Service Target Design
|
||||
|
||||
Status: Stage C18 planning result. Documentation only.
|
||||
|
||||
This document defines the target VPN/IP tunnel service architecture for the
|
||||
Secure Access Fabric. It does not implement VPN runtime, packet routing, TUN
|
||||
devices, mesh traffic, service workload execution, API changes, migrations, or
|
||||
RDP behavior changes.
|
||||
|
||||
## Purpose
|
||||
|
||||
VPN/IP tunnel is a service above the Fabric Core, not a node-local setting.
|
||||
|
||||
The service must allow managed access to private networks while preserving the
|
||||
platform's core rules:
|
||||
|
||||
- PostgreSQL remains the durable source of truth.
|
||||
- Redis remains live coordination only.
|
||||
- Fabric Routing Engine owns route choice.
|
||||
- Nodes execute leased work only.
|
||||
- Organizations must not see mesh topology.
|
||||
- Interactive services such as RDP must not be harmed by VPN bulk traffic.
|
||||
|
||||
## Non-Goals
|
||||
|
||||
Stage C18 does not implement:
|
||||
|
||||
- VPN/IP tunnel runtime
|
||||
- TUN/TAP device handling
|
||||
- packet forwarding
|
||||
- host route or firewall manipulation
|
||||
- QUIC, WebRTC, relay packet routing, or production mesh traffic
|
||||
- Windows virtual adapter, Android `VpnService`, or mobile client work
|
||||
- RDP, VNC, SSH, video, file, clipboard, or data-plane behavior changes
|
||||
|
||||
## Core Concepts
|
||||
|
||||
### `vpn_connection`
|
||||
|
||||
`vpn_connection` is the logical control-plane entity for one managed VPN/IP
|
||||
tunnel connection to a private endpoint such as an office, customer site,
|
||||
branch network, partner network, or private resource zone.
|
||||
|
||||
Target fields:
|
||||
|
||||
- `id`
|
||||
- `organization_id`
|
||||
- `cluster_id`
|
||||
- `name`
|
||||
- target endpoint / office identity
|
||||
- protocol/provider family
|
||||
- credential/config reference
|
||||
- allowed node policy
|
||||
- mode: `single_active` for the initial model
|
||||
- desired state: `enabled` or `disabled`
|
||||
- routing usage: RDP, VNC, SSH, HTTP/internal app, IP tunnel, or future service
|
||||
- route policy references
|
||||
- QoS / bandwidth policy references
|
||||
- placement constraints
|
||||
- safe status projection
|
||||
|
||||
The entity belongs to the control plane. It must not be inferred from a node
|
||||
environment variable, a manually started VPN process, or a host-local config
|
||||
file.
|
||||
|
||||
### `vpn_connection_lease`
|
||||
|
||||
`vpn_connection_lease` represents current ownership.
|
||||
|
||||
Target fields:
|
||||
|
||||
- `vpn_connection_id`
|
||||
- `cluster_id`
|
||||
- `organization_id`
|
||||
- `owner_node_id`
|
||||
- lease generation / fencing epoch
|
||||
- `expires_at`
|
||||
- `renewed_at`
|
||||
- `released_at`
|
||||
- `fenced_at`
|
||||
- status
|
||||
|
||||
Only the current owner with a valid, unexpired, unfenced lease may execute the
|
||||
VPN connection.
|
||||
|
||||
### `vpn_route_policy`
|
||||
|
||||
`vpn_route_policy` defines what traffic may use the connection.
|
||||
|
||||
Policy dimensions:
|
||||
|
||||
- allowed CIDRs
|
||||
- denied CIDRs
|
||||
- DNS suffix or DNS server policy
|
||||
- split-tunnel or full-tunnel eligibility
|
||||
- service-specific usage
|
||||
- resource-specific usage
|
||||
- organization and role scope
|
||||
- QoS class and bandwidth limits
|
||||
|
||||
Route policy is desired state. Runtime nodes apply scoped policy; they do not
|
||||
invent routes.
|
||||
|
||||
### `vpn_credential_ref`
|
||||
|
||||
VPN credentials must be referenced through an approved secret resolver.
|
||||
|
||||
Nodes receive credentials/config only when authorized and only when required to
|
||||
execute or pre-warm the assigned connection. Nodes must not receive unrelated
|
||||
organization credentials.
|
||||
|
||||
## Architecture Placement
|
||||
|
||||
```text
|
||||
Control Plane
|
||||
owns vpn_connection desired state, policy, lease/fencing, audit
|
||||
|
||||
Fabric Core
|
||||
owns node identity, role assignment consumption, scoped snapshots,
|
||||
node-local state, and service supervision boundary
|
||||
|
||||
Fabric Routing Engine
|
||||
chooses path to the current active VPN owner or eligible egress pool
|
||||
|
||||
VPN/IP Tunnel Service Runtime
|
||||
executes tunnel only when assigned and leased
|
||||
|
||||
Data Plane
|
||||
carries encrypted tunnel packets later, with QoS and backpressure
|
||||
```
|
||||
|
||||
The backend/control plane must not become a production VPN packet relay.
|
||||
|
||||
## Control Plane Responsibilities
|
||||
|
||||
The control plane owns:
|
||||
|
||||
- durable `vpn_connection` desired state
|
||||
- route policy and service usage policy
|
||||
- allowed node policy
|
||||
- placement and candidate selection
|
||||
- lease creation, renewal validation, and fencing decisions
|
||||
- safe status projection
|
||||
- audit events
|
||||
- credential reference ownership
|
||||
|
||||
The control plane does not push arbitrary packets. It authorizes and records
|
||||
what should exist.
|
||||
|
||||
## Node Responsibilities
|
||||
|
||||
Nodes do not decide to create VPN connections.
|
||||
|
||||
A node may execute a connection only when all of the following are true:
|
||||
|
||||
- node belongs to the correct cluster
|
||||
- node has the required capability and role assignment
|
||||
- node is allowed by the `vpn_connection` node policy
|
||||
- node has a current signed/scoped configuration snapshot
|
||||
- node holds the active lease
|
||||
- desired state is `enabled`
|
||||
- organization and service policy permit use
|
||||
|
||||
The node must stop execution when:
|
||||
|
||||
- lease is lost, expired, or fenced
|
||||
- desired state becomes `disabled`
|
||||
- role assignment is removed
|
||||
- allowed node policy changes
|
||||
- local node enters unsafe partition/degraded state
|
||||
- cluster tells the node to drain
|
||||
|
||||
## Single-Active Lease and Fencing Model
|
||||
|
||||
The initial mode is `single_active`.
|
||||
|
||||
Correctness requirement:
|
||||
|
||||
- exactly one node may maintain the active VPN tunnel
|
||||
- stale owners must be fenced before replacement becomes authoritative
|
||||
- ownership changes must be monotonic through a lease generation or equivalent
|
||||
fencing epoch
|
||||
- connect/disconnect must be idempotent
|
||||
- split-brain must not create duplicate active tunnels
|
||||
|
||||
Suggested target mechanics:
|
||||
|
||||
- short lease TTL
|
||||
- periodic renewal
|
||||
- monotonic lease generation
|
||||
- node-local watchdog that stops tunnel when renewal fails
|
||||
- explicit release on graceful shutdown
|
||||
- fencing event before replacement if previous owner is uncertain
|
||||
|
||||
## Routing Policy Model
|
||||
|
||||
Traffic references a logical `vpn_connection`, not a physical node.
|
||||
|
||||
Examples:
|
||||
|
||||
- RDP resource may require `vpn_connection = office-a`
|
||||
- SSH resource may require `vpn_connection = office-a`
|
||||
- IP tunnel profile may expose selected CIDRs through `office-a`
|
||||
- HTTP/internal app resource may route through the active VPN owner
|
||||
|
||||
The Fabric Routing Engine resolves:
|
||||
|
||||
```text
|
||||
service request
|
||||
-> logical vpn_connection
|
||||
-> current active owner / eligible egress
|
||||
-> fabric route
|
||||
-> VPN service runtime
|
||||
-> private network target
|
||||
```
|
||||
|
||||
Route updates should be dynamic. Changing CIDRs, DNS policy, or active owner
|
||||
should not require manual client reconfiguration when clients use platform
|
||||
managed access.
|
||||
|
||||
## QoS and Bandwidth Rules
|
||||
|
||||
VPN bulk traffic must degrade before interactive traffic.
|
||||
|
||||
Priority order:
|
||||
|
||||
1. RDP input/control
|
||||
2. interactive RDP/VNC/SSH control and render-critical traffic
|
||||
3. clipboard and small reliable control messages
|
||||
4. video/audio adaptive traffic
|
||||
5. file transfer
|
||||
6. VPN bulk packets
|
||||
7. telemetry
|
||||
|
||||
Bandwidth policy should support:
|
||||
|
||||
- per-organization limits
|
||||
- per-service limits
|
||||
- per-`vpn_connection` limits
|
||||
- per-node limits
|
||||
- reserved bandwidth for interactive services
|
||||
|
||||
Service adapters must not implement QoS routing themselves. They label traffic
|
||||
or request a channel class; Fabric applies route/QoS policy.
|
||||
|
||||
## Security Boundaries
|
||||
|
||||
Security requirements:
|
||||
|
||||
- organization-scoped `vpn_connection`
|
||||
- cluster-scoped identity and tokens
|
||||
- mTLS node-to-node transport
|
||||
- short-lived route/tunnel authorization tokens when needed
|
||||
- credentials delivered only through approved resolver
|
||||
- candidate nodes receive only scoped config
|
||||
- active owner receives execution secrets only when authorized
|
||||
- no organization sees another organization's connections, routes, credentials,
|
||||
peer cache, or topology
|
||||
- platform owner actions are audited
|
||||
|
||||
Compromised node blast radius must be bounded. A compromised node must not gain
|
||||
credentials for unrelated `vpn_connection` entities or unrelated organizations.
|
||||
|
||||
## Observability and Audit
|
||||
|
||||
Audit events:
|
||||
|
||||
- `vpn_connection_created`
|
||||
- `vpn_connection_enabled`
|
||||
- `vpn_connection_disabled`
|
||||
- `vpn_connection_policy_changed`
|
||||
- `vpn_connection_candidate_changed`
|
||||
- `vpn_connection_lease_acquired`
|
||||
- `vpn_connection_lease_renewed`
|
||||
- `vpn_connection_lease_lost`
|
||||
- `vpn_connection_owner_fenced`
|
||||
- `vpn_connection_failover_started`
|
||||
- `vpn_connection_failover_completed`
|
||||
- `vpn_connection_credential_rotated`
|
||||
- `vpn_route_policy_changed`
|
||||
|
||||
Metrics/status:
|
||||
|
||||
- desired state
|
||||
- active owner
|
||||
- standby/pre-warm owners
|
||||
- lease generation
|
||||
- last connect/disconnect time
|
||||
- route count
|
||||
- latency/packet loss where observable
|
||||
- bandwidth by service class
|
||||
- failover count
|
||||
- last failure reason
|
||||
|
||||
Organization views show safe status. Platform owner views may show active node
|
||||
and operational detail according to platform policy and audit.
|
||||
|
||||
## Failure Mode Matrix
|
||||
|
||||
| Failure | Required behavior | Notes |
|
||||
| --- | --- | --- |
|
||||
| Active node heartbeat lost | Lease expires or is fenced; cluster selects replacement | Single-active must be preserved |
|
||||
| Active node loses lease locally | Node stops VPN runtime | Node must not wait for backend packet path |
|
||||
| Control plane temporarily unavailable | Existing leased runtime may continue only within lease/snapshot policy | No policy mutation in degraded mode |
|
||||
| Split-brain / partition | Minority must not create second active owner | Fencing/quorum rules required before runtime |
|
||||
| Credential revoked | Active owner stops or reconnects with rotated credentials | Audit required |
|
||||
| Route policy changes | Dynamic route update; deny removed routes | No manual client reconfiguration |
|
||||
| Candidate node becomes overloaded | Keep sticky owner unless policy/failure/maintenance requires move | Avoid needless TCP disruption |
|
||||
| Graceful node maintenance | Drain, release/transfer lease, then stop | Prefer standby/pre-warm replacement |
|
||||
| VPN protocol reconnects | Preserve logical `vpn_connection`; refresh routes | Some TCP sessions may still break |
|
||||
| Relay path unavailable | Fabric reroutes if policy allows | VPN service does not own mesh routing |
|
||||
|
||||
## Stateful Session Limits
|
||||
|
||||
VPN failover may disrupt long-lived TCP sessions. The platform should minimize
|
||||
disruption through sticky placement, graceful drain, standby/pre-warm nodes,
|
||||
stable route identity, and transparent route refresh, but the initial
|
||||
`single_active` mode does not guarantee lossless TCP migration.
|
||||
|
||||
Future `multi_active` or load-balanced VPN modes may reduce disruption. They
|
||||
must be explicit future modes and must not weaken `single_active` correctness.
|
||||
|
||||
## Relationship to Current Mesh Proof Set
|
||||
|
||||
C17A-C17G prove synthetic fabric messages, route health/failover probes, relay
|
||||
semantics, a bounded `synthetic.echo` path, live synthetic HTTP node-to-node
|
||||
transport, scoped synthetic route config loading, and Control Plane scoped
|
||||
synthetic config reads in `rap-node-agent`.
|
||||
|
||||
They do not authorize VPN traffic.
|
||||
|
||||
VPN/IP tunnel runtime must wait until the control-plane desired-state model,
|
||||
lease/fencing, scoped snapshots, node-local state, secure node-to-node
|
||||
channels, and Fabric Routing Engine boundaries are accepted for this service.
|
||||
|
||||
## Future Implementation Stages
|
||||
|
||||
C18A - VPN/IP tunnel control-plane data model foundation:
|
||||
|
||||
- durable `vpn_connections`
|
||||
- route policy tables
|
||||
- allowed node policy
|
||||
- lease/fencing model
|
||||
- audit events
|
||||
- no runtime packets
|
||||
|
||||
Status: completed and backend-test-proven. Result:
|
||||
`artifacts/c18a-vpn-control-plane-data-model-report.md`.
|
||||
|
||||
C18B - Lease and fencing service:
|
||||
|
||||
- single-active ownership service
|
||||
- TTL renewal/fencing behavior
|
||||
- stale owner handling
|
||||
- no real VPN runtime
|
||||
|
||||
Status: completed and backend-test-proven. Result:
|
||||
`artifacts/c18b-vpn-lease-fencing-hardening-report.md`.
|
||||
|
||||
C18C - Node-agent desired-state consumption:
|
||||
|
||||
- node reads scoped `vpn_connection` assignments
|
||||
- reports status
|
||||
- does not create real tunnel
|
||||
|
||||
Status: completed and backend-test-proven. Result:
|
||||
`artifacts/c18c-vpn-node-agent-desired-state-report.md`.
|
||||
|
||||
Notes:
|
||||
|
||||
- node assignment visibility is limited to eligible candidates or the current
|
||||
active lease owner
|
||||
- observed assignment status is explicit: `not_started`, `assigned`,
|
||||
`lease_required`, `blocked`, `unknown`
|
||||
- `credential_ref` is not exposed to node-agent assignment payloads
|
||||
- no VPN runtime, TUN/TAP, host route/DNS/firewall/QoS manipulation, packet
|
||||
forwarding, or production mesh traffic is implemented
|
||||
|
||||
C18D - Secret resolver integration:
|
||||
|
||||
- scoped credential/config delivery
|
||||
- candidate/active-owner restrictions
|
||||
- credential rotation audit
|
||||
|
||||
C18E - Routing policy integration:
|
||||
|
||||
- CIDR and service-specific route intent
|
||||
- route projection to Fabric Routing Engine
|
||||
- no packet forwarding
|
||||
|
||||
C18F - Non-production fake VPN executor:
|
||||
|
||||
- synthetic leased service state only
|
||||
- no TUN, no packets, no private network routing
|
||||
|
||||
C18G - Lab-only native VPN executor prototype:
|
||||
|
||||
- explicit separate approval required
|
||||
- native mode preferred for TUN/firewall/QoS
|
||||
- no privileged container by default
|
||||
|
||||
C18H - Client route refresh/resume design:
|
||||
|
||||
- route updates
|
||||
- reconnect behavior
|
||||
- split/full tunnel client posture
|
||||
|
||||
C18I - Production hardening:
|
||||
|
||||
- split-brain drills
|
||||
- failover testing
|
||||
- QoS load testing
|
||||
- security review
|
||||
- observability and incident runbooks
|
||||
|
||||
## Result / Decision
|
||||
|
||||
Stage C18 defines VPN/IP tunnel as a cluster-managed service above Fabric Core.
|
||||
The first implementation step must be control-plane desired state and
|
||||
lease/fencing foundation, not packet routing. Nodes are execution units, not
|
||||
owners of desired state. Fabric owns routing and QoS. PostgreSQL remains the
|
||||
source of truth, Redis remains live coordination only, and Fabric
|
||||
Storage/Config Storage remains a scoped distribution/cache layer. RDP, current
|
||||
direct worker WSS, backend gateway fallback, C17 synthetic mesh proofs, and all
|
||||
existing service-adapter behavior are untouched by this document. C18A,
|
||||
C18B, and C18C are now implemented only as control-plane/node-agent contract
|
||||
foundation; they still do not authorize VPN/IP tunnel runtime or host
|
||||
networking changes.
|
||||
Reference in New Issue
Block a user