Initial project snapshot

This commit is contained in:
2026-04-28 22:29:50 +03:00
commit 8ba0561f4f
365 changed files with 91832 additions and 0 deletions
@@ -0,0 +1,428 @@
# VPN / IP Tunnel Service Target Design
Status: Stage C18 planning result. Documentation only.
This document defines the target VPN/IP tunnel service architecture for the
Secure Access Fabric. It does not implement VPN runtime, packet routing, TUN
devices, mesh traffic, service workload execution, API changes, migrations, or
RDP behavior changes.
## Purpose
VPN/IP tunnel is a service above the Fabric Core, not a node-local setting.
The service must allow managed access to private networks while preserving the
platform's core rules:
- PostgreSQL remains the durable source of truth.
- Redis remains live coordination only.
- Fabric Routing Engine owns route choice.
- Nodes execute leased work only.
- Organizations must not see mesh topology.
- Interactive services such as RDP must not be harmed by VPN bulk traffic.
## Non-Goals
Stage C18 does not implement:
- VPN/IP tunnel runtime
- TUN/TAP device handling
- packet forwarding
- host route or firewall manipulation
- QUIC, WebRTC, relay packet routing, or production mesh traffic
- Windows virtual adapter, Android `VpnService`, or mobile client work
- RDP, VNC, SSH, video, file, clipboard, or data-plane behavior changes
## Core Concepts
### `vpn_connection`
`vpn_connection` is the logical control-plane entity for one managed VPN/IP
tunnel connection to a private endpoint such as an office, customer site,
branch network, partner network, or private resource zone.
Target fields:
- `id`
- `organization_id`
- `cluster_id`
- `name`
- target endpoint / office identity
- protocol/provider family
- credential/config reference
- allowed node policy
- mode: `single_active` for the initial model
- desired state: `enabled` or `disabled`
- routing usage: RDP, VNC, SSH, HTTP/internal app, IP tunnel, or future service
- route policy references
- QoS / bandwidth policy references
- placement constraints
- safe status projection
The entity belongs to the control plane. It must not be inferred from a node
environment variable, a manually started VPN process, or a host-local config
file.
### `vpn_connection_lease`
`vpn_connection_lease` represents current ownership.
Target fields:
- `vpn_connection_id`
- `cluster_id`
- `organization_id`
- `owner_node_id`
- lease generation / fencing epoch
- `expires_at`
- `renewed_at`
- `released_at`
- `fenced_at`
- status
Only the current owner with a valid, unexpired, unfenced lease may execute the
VPN connection.
### `vpn_route_policy`
`vpn_route_policy` defines what traffic may use the connection.
Policy dimensions:
- allowed CIDRs
- denied CIDRs
- DNS suffix or DNS server policy
- split-tunnel or full-tunnel eligibility
- service-specific usage
- resource-specific usage
- organization and role scope
- QoS class and bandwidth limits
Route policy is desired state. Runtime nodes apply scoped policy; they do not
invent routes.
### `vpn_credential_ref`
VPN credentials must be referenced through an approved secret resolver.
Nodes receive credentials/config only when authorized and only when required to
execute or pre-warm the assigned connection. Nodes must not receive unrelated
organization credentials.
## Architecture Placement
```text
Control Plane
owns vpn_connection desired state, policy, lease/fencing, audit
Fabric Core
owns node identity, role assignment consumption, scoped snapshots,
node-local state, and service supervision boundary
Fabric Routing Engine
chooses path to the current active VPN owner or eligible egress pool
VPN/IP Tunnel Service Runtime
executes tunnel only when assigned and leased
Data Plane
carries encrypted tunnel packets later, with QoS and backpressure
```
The backend/control plane must not become a production VPN packet relay.
## Control Plane Responsibilities
The control plane owns:
- durable `vpn_connection` desired state
- route policy and service usage policy
- allowed node policy
- placement and candidate selection
- lease creation, renewal validation, and fencing decisions
- safe status projection
- audit events
- credential reference ownership
The control plane does not push arbitrary packets. It authorizes and records
what should exist.
## Node Responsibilities
Nodes do not decide to create VPN connections.
A node may execute a connection only when all of the following are true:
- node belongs to the correct cluster
- node has the required capability and role assignment
- node is allowed by the `vpn_connection` node policy
- node has a current signed/scoped configuration snapshot
- node holds the active lease
- desired state is `enabled`
- organization and service policy permit use
The node must stop execution when:
- lease is lost, expired, or fenced
- desired state becomes `disabled`
- role assignment is removed
- allowed node policy changes
- local node enters unsafe partition/degraded state
- cluster tells the node to drain
## Single-Active Lease and Fencing Model
The initial mode is `single_active`.
Correctness requirement:
- exactly one node may maintain the active VPN tunnel
- stale owners must be fenced before replacement becomes authoritative
- ownership changes must be monotonic through a lease generation or equivalent
fencing epoch
- connect/disconnect must be idempotent
- split-brain must not create duplicate active tunnels
Suggested target mechanics:
- short lease TTL
- periodic renewal
- monotonic lease generation
- node-local watchdog that stops tunnel when renewal fails
- explicit release on graceful shutdown
- fencing event before replacement if previous owner is uncertain
## Routing Policy Model
Traffic references a logical `vpn_connection`, not a physical node.
Examples:
- RDP resource may require `vpn_connection = office-a`
- SSH resource may require `vpn_connection = office-a`
- IP tunnel profile may expose selected CIDRs through `office-a`
- HTTP/internal app resource may route through the active VPN owner
The Fabric Routing Engine resolves:
```text
service request
-> logical vpn_connection
-> current active owner / eligible egress
-> fabric route
-> VPN service runtime
-> private network target
```
Route updates should be dynamic. Changing CIDRs, DNS policy, or active owner
should not require manual client reconfiguration when clients use platform
managed access.
## QoS and Bandwidth Rules
VPN bulk traffic must degrade before interactive traffic.
Priority order:
1. RDP input/control
2. interactive RDP/VNC/SSH control and render-critical traffic
3. clipboard and small reliable control messages
4. video/audio adaptive traffic
5. file transfer
6. VPN bulk packets
7. telemetry
Bandwidth policy should support:
- per-organization limits
- per-service limits
- per-`vpn_connection` limits
- per-node limits
- reserved bandwidth for interactive services
Service adapters must not implement QoS routing themselves. They label traffic
or request a channel class; Fabric applies route/QoS policy.
## Security Boundaries
Security requirements:
- organization-scoped `vpn_connection`
- cluster-scoped identity and tokens
- mTLS node-to-node transport
- short-lived route/tunnel authorization tokens when needed
- credentials delivered only through approved resolver
- candidate nodes receive only scoped config
- active owner receives execution secrets only when authorized
- no organization sees another organization's connections, routes, credentials,
peer cache, or topology
- platform owner actions are audited
Compromised node blast radius must be bounded. A compromised node must not gain
credentials for unrelated `vpn_connection` entities or unrelated organizations.
## Observability and Audit
Audit events:
- `vpn_connection_created`
- `vpn_connection_enabled`
- `vpn_connection_disabled`
- `vpn_connection_policy_changed`
- `vpn_connection_candidate_changed`
- `vpn_connection_lease_acquired`
- `vpn_connection_lease_renewed`
- `vpn_connection_lease_lost`
- `vpn_connection_owner_fenced`
- `vpn_connection_failover_started`
- `vpn_connection_failover_completed`
- `vpn_connection_credential_rotated`
- `vpn_route_policy_changed`
Metrics/status:
- desired state
- active owner
- standby/pre-warm owners
- lease generation
- last connect/disconnect time
- route count
- latency/packet loss where observable
- bandwidth by service class
- failover count
- last failure reason
Organization views show safe status. Platform owner views may show active node
and operational detail according to platform policy and audit.
## Failure Mode Matrix
| Failure | Required behavior | Notes |
| --- | --- | --- |
| Active node heartbeat lost | Lease expires or is fenced; cluster selects replacement | Single-active must be preserved |
| Active node loses lease locally | Node stops VPN runtime | Node must not wait for backend packet path |
| Control plane temporarily unavailable | Existing leased runtime may continue only within lease/snapshot policy | No policy mutation in degraded mode |
| Split-brain / partition | Minority must not create second active owner | Fencing/quorum rules required before runtime |
| Credential revoked | Active owner stops or reconnects with rotated credentials | Audit required |
| Route policy changes | Dynamic route update; deny removed routes | No manual client reconfiguration |
| Candidate node becomes overloaded | Keep sticky owner unless policy/failure/maintenance requires move | Avoid needless TCP disruption |
| Graceful node maintenance | Drain, release/transfer lease, then stop | Prefer standby/pre-warm replacement |
| VPN protocol reconnects | Preserve logical `vpn_connection`; refresh routes | Some TCP sessions may still break |
| Relay path unavailable | Fabric reroutes if policy allows | VPN service does not own mesh routing |
## Stateful Session Limits
VPN failover may disrupt long-lived TCP sessions. The platform should minimize
disruption through sticky placement, graceful drain, standby/pre-warm nodes,
stable route identity, and transparent route refresh, but the initial
`single_active` mode does not guarantee lossless TCP migration.
Future `multi_active` or load-balanced VPN modes may reduce disruption. They
must be explicit future modes and must not weaken `single_active` correctness.
## Relationship to Current Mesh Proof Set
C17A-C17G prove synthetic fabric messages, route health/failover probes, relay
semantics, a bounded `synthetic.echo` path, live synthetic HTTP node-to-node
transport, scoped synthetic route config loading, and Control Plane scoped
synthetic config reads in `rap-node-agent`.
They do not authorize VPN traffic.
VPN/IP tunnel runtime must wait until the control-plane desired-state model,
lease/fencing, scoped snapshots, node-local state, secure node-to-node
channels, and Fabric Routing Engine boundaries are accepted for this service.
## Future Implementation Stages
C18A - VPN/IP tunnel control-plane data model foundation:
- durable `vpn_connections`
- route policy tables
- allowed node policy
- lease/fencing model
- audit events
- no runtime packets
Status: completed and backend-test-proven. Result:
`artifacts/c18a-vpn-control-plane-data-model-report.md`.
C18B - Lease and fencing service:
- single-active ownership service
- TTL renewal/fencing behavior
- stale owner handling
- no real VPN runtime
Status: completed and backend-test-proven. Result:
`artifacts/c18b-vpn-lease-fencing-hardening-report.md`.
C18C - Node-agent desired-state consumption:
- node reads scoped `vpn_connection` assignments
- reports status
- does not create real tunnel
Status: completed and backend-test-proven. Result:
`artifacts/c18c-vpn-node-agent-desired-state-report.md`.
Notes:
- node assignment visibility is limited to eligible candidates or the current
active lease owner
- observed assignment status is explicit: `not_started`, `assigned`,
`lease_required`, `blocked`, `unknown`
- `credential_ref` is not exposed to node-agent assignment payloads
- no VPN runtime, TUN/TAP, host route/DNS/firewall/QoS manipulation, packet
forwarding, or production mesh traffic is implemented
C18D - Secret resolver integration:
- scoped credential/config delivery
- candidate/active-owner restrictions
- credential rotation audit
C18E - Routing policy integration:
- CIDR and service-specific route intent
- route projection to Fabric Routing Engine
- no packet forwarding
C18F - Non-production fake VPN executor:
- synthetic leased service state only
- no TUN, no packets, no private network routing
C18G - Lab-only native VPN executor prototype:
- explicit separate approval required
- native mode preferred for TUN/firewall/QoS
- no privileged container by default
C18H - Client route refresh/resume design:
- route updates
- reconnect behavior
- split/full tunnel client posture
C18I - Production hardening:
- split-brain drills
- failover testing
- QoS load testing
- security review
- observability and incident runbooks
## Result / Decision
Stage C18 defines VPN/IP tunnel as a cluster-managed service above Fabric Core.
The first implementation step must be control-plane desired state and
lease/fencing foundation, not packet routing. Nodes are execution units, not
owners of desired state. Fabric owns routing and QoS. PostgreSQL remains the
source of truth, Redis remains live coordination only, and Fabric
Storage/Config Storage remains a scoped distribution/cache layer. RDP, current
direct worker WSS, backend gateway fallback, C17 synthetic mesh proofs, and all
existing service-adapter behavior are untouched by this document. C18A,
C18B, and C18C are now implemented only as control-plane/node-agent contract
foundation; they still do not authorize VPN/IP tunnel runtime or host
networking changes.