2776 lines
85 KiB
Markdown
2776 lines
85 KiB
Markdown
# Secure Access Fabric Target Architecture
|
|
|
|
Status: target architecture, documentation only.
|
|
|
|
This document defines the long-term architecture direction for the platform. It must not be read as an implementation request for mesh runtime, VPN runtime, multi-cluster runtime, updater runtime, or a rewrite of the current RDP MVP.
|
|
|
|
The current proven RDP lifecycle remains a preserved implementation baseline.
|
|
RDP work is currently paused by product decision. The active architecture focus
|
|
is the lower Fabric Core / cluster / node foundation.
|
|
|
|
## 1. Project Vision
|
|
|
|
The project is a Secure Access Fabric: a distributed, multi-tenant platform for secure access to private resources across sites, networks, and organizations.
|
|
|
|
The platform started with RDP because persistent GUI session lifecycle was the highest-risk proof:
|
|
|
|
- real remote session start
|
|
- attach
|
|
- detach without destroying the remote session
|
|
- reattach without recreating the remote session
|
|
- takeover without recreating the remote session
|
|
- terminate
|
|
- worker failure detection
|
|
|
|
RDP is only the first service. The target platform must support additional secure access services through the same control plane and policy model:
|
|
|
|
- VNC
|
|
- SSH
|
|
- VPN / connector access
|
|
- file transfer
|
|
- video / media relay
|
|
- internal web applications
|
|
- future service-specific adapters
|
|
|
|
The platform must remain production-oriented. It must not collapse into a toy proxy, a single-node gateway, or a protocol-specific one-off implementation.
|
|
|
|
## 2. Architecture Layers
|
|
|
|
Target layer order:
|
|
|
|
1. Host OS
|
|
2. RAP Fabric Core
|
|
3. Secure Fabric Network
|
|
4. Service Runtime / Service Adapters
|
|
5. Access Clients / Admin UI
|
|
|
|
### RAP Fabric Core
|
|
|
|
The RAP Fabric Core is the lower distributed runtime foundation of the
|
|
platform.
|
|
|
|
It is not a real operating system. It runs above the host OS and is implemented
|
|
through the native `rap-node-agent`, control-plane contracts, signed
|
|
configuration snapshots, and service supervision boundaries.
|
|
|
|
Responsibilities:
|
|
|
|
- node identity
|
|
- node enrollment
|
|
- cluster membership bootstrap
|
|
- local node state
|
|
- capability reporting
|
|
- role assignment consumption
|
|
- signed scoped cluster snapshots
|
|
- update trust
|
|
- service workload supervision boundary
|
|
|
|
Rules:
|
|
|
|
- Fabric Core comes before mesh runtime.
|
|
- Mesh runtime must not be implemented before node identity, enrollment, and
|
|
role assignment are trustworthy.
|
|
- Mesh runtime must not be implemented before scoped configuration
|
|
distribution and node-local state are trustworthy.
|
|
- RDP, VNC, SSH, VPN, video, and file services consume Fabric Core. They do not
|
|
define it.
|
|
- Service workloads are separate from node identity.
|
|
- Nodes should be small, fast, and scoped to the data they need.
|
|
|
|
### Control Plane
|
|
|
|
The control plane owns durable domain state and policy decisions.
|
|
|
|
Responsibilities:
|
|
|
|
- organizations
|
|
- users, memberships, and roles
|
|
- resources and resource policies
|
|
- sessions and authoritative lifecycle state
|
|
- node inventory and capabilities
|
|
- cluster membership
|
|
- service enablement
|
|
- update policy
|
|
- audit
|
|
- billing / quotas later
|
|
- route authorization
|
|
- data-plane token issuance
|
|
|
|
Rules:
|
|
|
|
- PostgreSQL is the durable source of truth.
|
|
- Redis may be used only for live coordination, routing hints, heartbeats, leases, short-lived tokens, and ephemeral cache.
|
|
- Redis must not store durable topology, durable configuration, node identity,
|
|
policy, organization data, or long-lived route authority.
|
|
- The control plane must not become the production realtime relay for high-rate streams.
|
|
|
|
### Web Ingress / Admin Web
|
|
|
|
Web Ingress is the HTTP/HTTPS entry and presentation layer for browser-based
|
|
administration. It is not the owner of cluster configuration.
|
|
|
|
Responsibilities:
|
|
|
|
- accept HTTP/HTTPS
|
|
- terminate TLS or sit behind an approved TLS terminator
|
|
- serve the Admin UI shell and static assets
|
|
- proxy browser/API requests to Control API
|
|
- apply edge controls such as headers, rate limits, and future WAF rules
|
|
|
|
Rules:
|
|
|
|
- cluster configuration belongs to the Control Plane, not Web Ingress
|
|
- Web Ingress must not directly mutate PostgreSQL
|
|
- Web Ingress must not store durable topology, node identity, policy, or
|
|
secrets
|
|
- dynamic admin pages must be schema-driven safe projections over Control API
|
|
- page definitions must not contain internal mesh topology, peer caches, route
|
|
caches, secrets, raw credentials, or arbitrary executable code
|
|
- a Fabric Storage / Config Storage node must not automatically become an
|
|
Admin UI endpoint
|
|
- cluster-local Admin UI availability requires explicit admin/web ingress role
|
|
assignment, Control Plane authorization, TLS/cert policy, signed scoped
|
|
snapshot health, and cluster health checks
|
|
|
|
Admin endpoint ownership:
|
|
|
|
- Platform Owner Console is global Control Plane / platform-owner scope and may
|
|
aggregate visibility across clusters according to platform policy and audit.
|
|
- Cluster Admin Endpoint is a cluster-local admin/web ingress service for one
|
|
cluster. It is separate from storage and must be explicitly assigned to
|
|
authorized ingress/admin nodes.
|
|
- Organization Admin Panel is a tenant-safe projection and must expose only
|
|
policy-approved organization resources, service endpoints, and safe status.
|
|
- Storage nodes distribute/cache scoped configuration. They do not own browser
|
|
admin entry, platform ownership, cluster authority, or UI routing.
|
|
|
|
Detailed model:
|
|
|
|
- `docs/architecture/WEB_INGRESS_AND_ADMIN_UI_MODEL.md`
|
|
|
|
### Secure Mesh / Overlay Network
|
|
|
|
The mesh connects platform-managed and customer-managed nodes using authenticated, encrypted node-to-node communication.
|
|
|
|
Responsibilities:
|
|
|
|
- node identity
|
|
- mTLS transport
|
|
- route discovery
|
|
- route authorization
|
|
- multi-hop routing where needed
|
|
- partition awareness
|
|
- health reporting
|
|
- failover hints
|
|
- service placement constraints
|
|
|
|
Rules:
|
|
|
|
- Not every node needs a full mesh connection to every other node.
|
|
- Customer-managed nodes must be scoped and must not become platform-wide trusted relays by default.
|
|
- Organizations must not see the full mesh topology.
|
|
|
|
### Realtime Data Plane
|
|
|
|
The realtime data plane carries live traffic after the control plane authorizes access.
|
|
|
|
Traffic examples:
|
|
|
|
- RDP input
|
|
- RDP render / video frames
|
|
- clipboard events
|
|
- file transfer chunks
|
|
- VPN packets
|
|
- VNC framebuffer updates
|
|
- SSH streams
|
|
- service telemetry
|
|
|
|
Rules:
|
|
|
|
- The backend API must not be the production realtime relay.
|
|
- Clients should connect to the nearest authorized entry/live node.
|
|
- Traffic should route through the mesh or direct worker path to the service adapter.
|
|
- High-rate render/video traffic must use binary transport and bounded queues.
|
|
- Stale render/video frames must be droppable.
|
|
- Reliable channels must stay separate from droppable channels.
|
|
|
|
### Service Adapters
|
|
|
|
Service adapters translate platform-neutral session/channel contracts to protocol-specific runtimes.
|
|
|
|
Examples:
|
|
|
|
- `rdp-worker` wraps FreeRDP.
|
|
- `vnc-worker` wraps a future VNC client/runtime.
|
|
- `vpn-exit` handles exit routing.
|
|
- `vpn-connector` handles private network reachability.
|
|
- `video-relay` handles media optimized paths.
|
|
|
|
Rules:
|
|
|
|
- Protocol-specific details must remain behind adapter boundaries.
|
|
- Backend modules must not expose FreeRDP, VNC, VPN driver, or codec internals.
|
|
- Adapters enforce policy again locally. The control plane is not the only enforcement point.
|
|
|
|
## 3. Node Roles
|
|
|
|
Node roles are service capabilities and assignments. A physical or virtual node may host multiple roles when policy allows it.
|
|
|
|
### `control-api`
|
|
|
|
Control-plane API service.
|
|
|
|
Responsibilities:
|
|
|
|
- authentication and authorization
|
|
- organization-scoped APIs
|
|
- resource and policy CRUD
|
|
- session broker APIs
|
|
- node registration and inventory
|
|
- token issuance
|
|
- audit writes
|
|
|
|
It is not a production high-rate frame relay.
|
|
|
|
### `entry-node`
|
|
|
|
Client-facing realtime entry point.
|
|
|
|
Responsibilities:
|
|
|
|
- accepts client live connections
|
|
- validates short-lived data-plane tokens
|
|
- selects or receives routes to workers/services
|
|
- enforces per-channel permissions
|
|
- applies client-side QoS and backpressure rules
|
|
|
|
### `relay-node`
|
|
|
|
Mesh relay for traffic that cannot connect directly.
|
|
|
|
Responsibilities:
|
|
|
|
- node-to-node forwarding
|
|
- route failover
|
|
- traffic classification
|
|
- channel priority enforcement
|
|
|
|
It does not own durable session state.
|
|
|
|
### `rdp-worker`
|
|
|
|
RDP service adapter.
|
|
|
|
Responsibilities:
|
|
|
|
- wraps FreeRDP
|
|
- connects to target RDP hosts
|
|
- preserves server-side RDP session lifecycle
|
|
- applies input
|
|
- captures render/video updates
|
|
- handles RDP virtual channels such as clipboard and restricted drive redirection
|
|
- enforces RDP resource policy locally
|
|
|
|
### `vnc-worker`
|
|
|
|
Future VNC service adapter.
|
|
|
|
Responsibilities:
|
|
|
|
- connects to VNC servers
|
|
- translates VNC framebuffer updates into platform render channels
|
|
- applies keyboard and mouse input
|
|
- enforces VNC-specific policy
|
|
|
|
### `vpn-exit`
|
|
|
|
Exit point for VPN-like traffic.
|
|
|
|
Responsibilities:
|
|
|
|
- forwards authorized tunnel traffic to selected networks
|
|
- enforces split/full tunnel policy
|
|
- applies route, DNS, and egress restrictions
|
|
- reports traffic and health telemetry
|
|
|
|
### `vpn-connector`
|
|
|
|
Connector to private networks.
|
|
|
|
Responsibilities:
|
|
|
|
- establishes network reachability to private resource zones
|
|
- participates in route selection
|
|
- enforces connector placement policy
|
|
- may be customer-managed and organization-scoped
|
|
|
|
### `file-storage-cache`
|
|
|
|
Controlled cache for file transfer and staged artifacts.
|
|
|
|
Responsibilities:
|
|
|
|
- temporary file upload/download cache
|
|
- content hashing
|
|
- quota enforcement
|
|
- safe cleanup
|
|
- organization and session namespace isolation
|
|
|
|
It must not become arbitrary shared filesystem access.
|
|
|
|
### `update-cache`
|
|
|
|
Local cache for signed node/service update artifacts.
|
|
|
|
Responsibilities:
|
|
|
|
- mirrors approved signed artifacts
|
|
- reduces download load
|
|
- supports staged rollout
|
|
- never serves unsigned binaries
|
|
|
|
It is not the authoritative version repository. It is a scoped cache/mirror of
|
|
approved artifacts.
|
|
|
|
### `version-storage` / `update-repository`
|
|
|
|
Logical service for storing and distributing approved platform, node-agent, and
|
|
service workload versions.
|
|
|
|
Responsibilities:
|
|
|
|
- store signed release manifests
|
|
- store artifacts for supported OS / architecture combinations
|
|
- mark release channels such as `stable`, `current`, and `candidate`
|
|
- preserve the last known good stable version for rollback
|
|
- expose version metadata to Control Plane and authorized update-cache nodes
|
|
- store migration bundles for service-local data structure changes
|
|
- keep artifact hashes, signatures, provenance, compatibility constraints, and
|
|
rollout policy references
|
|
|
|
Rules:
|
|
|
|
- PostgreSQL remains authoritative for rollout policy, channel assignment,
|
|
approval, and audit.
|
|
- Version storage is an artifact repository and distribution source, not a
|
|
general-purpose database.
|
|
- Update-cache nodes may mirror approved artifacts but must not approve,
|
|
mutate, or invent versions.
|
|
- Unsigned or unapproved artifacts must never be served as runnable versions.
|
|
- Different OS/architecture builds are separate artifacts under one release
|
|
manifest.
|
|
|
|
### `video-relay`
|
|
|
|
Optimized realtime media path.
|
|
|
|
Responsibilities:
|
|
|
|
- carries encoded video streams where applicable
|
|
- supports adaptive bitrate / FPS
|
|
- preserves channel isolation from reliable control/input channels
|
|
- drops stale media frames under pressure
|
|
|
|
## 3.1 Node Runtime Model
|
|
|
|
A node is a host-level identity managed by the native `rap-node-agent`.
|
|
|
|
The node identity is not a container. Containers are packaging and isolation
|
|
units for service workloads, not the durable security identity of the node.
|
|
|
|
### Native Node Agent
|
|
|
|
`rap-node-agent` should run natively on the host, not inside a container.
|
|
|
|
The native node-agent owns:
|
|
|
|
- node identity
|
|
- node certificates and key custody
|
|
- node registration
|
|
- cluster membership bootstrap
|
|
- update trust
|
|
- health checks
|
|
- recovery logic
|
|
- service workload supervision
|
|
- host capability reporting
|
|
|
|
The node-agent must remain small and stable. Service workloads may be replaced,
|
|
rolled back, restarted, or moved without replacing the host-level node identity.
|
|
|
|
### Service Workload Runtime Modes
|
|
|
|
Service workloads may run either:
|
|
|
|
- containerized
|
|
- native
|
|
|
|
Containerized workloads are preferred when packaging, repeatability, and
|
|
isolation matter more than direct host networking or kernel integration.
|
|
|
|
Containerized workloads are preferred for:
|
|
|
|
- `rdp-worker`
|
|
- `vnc-worker`
|
|
- `relay-node`
|
|
- `entry-node`
|
|
- `file-storage-cache`
|
|
- `update-cache`
|
|
- `video-relay`
|
|
|
|
Native workloads are preferred when the service must manage host networking,
|
|
kernel routing, local firewall/QoS policy, or OS-level virtual adapters.
|
|
|
|
Native workloads are preferred for:
|
|
|
|
- `vpn-exit`
|
|
- `vpn-connector`
|
|
- host route manager
|
|
- firewall/QoS manager
|
|
- Windows virtual adapter service
|
|
- Android `VpnService` client
|
|
|
|
### Realtime Networking Rules
|
|
|
|
Realtime container workloads should avoid Docker bridge/NAT hot paths.
|
|
|
|
Preferred deployment modes for latency-sensitive workloads:
|
|
|
|
- host networking
|
|
- native process mode
|
|
- explicitly approved low-latency container networking
|
|
|
|
The platform should not route high-rate input/render/video paths through
|
|
avoidable bridge/NAT layers when low latency matters.
|
|
|
|
Privileged containers are discouraged.
|
|
|
|
If a workload requires privileged container mode, `NET_ADMIN`, `/dev/net/tun`,
|
|
host route manipulation, or host firewall access, native mode should be
|
|
preferred unless explicitly approved by platform policy and security review.
|
|
|
|
### Node Capability Flags
|
|
|
|
Node capabilities are host-level facts reported by `rap-node-agent` and used by
|
|
the control plane for placement and routing decisions.
|
|
|
|
Capability flags:
|
|
|
|
- `can_accept_client_ingress`
|
|
- `can_accept_node_ingress`
|
|
- `can_route_mesh`
|
|
- `can_egress_internet`
|
|
- `can_egress_private_network`
|
|
- `can_run_rdp_worker`
|
|
- `can_run_vnc_worker`
|
|
- `can_run_vpn_exit`
|
|
- `can_run_vpn_connector`
|
|
- `can_run_file_cache`
|
|
- `can_run_update_cache`
|
|
- `can_run_video_relay`
|
|
|
|
Capabilities do not automatically grant permission to run a service. They only
|
|
describe what the host can support. Service roles are assigned by policy per
|
|
cluster and organization.
|
|
|
|
### Installation Profiles
|
|
|
|
Installation profiles describe expected deployment posture. They are templates
|
|
for policy and bootstrap, not hardcoded node types.
|
|
|
|
`local entry node`:
|
|
|
|
- accepts nearby client ingress
|
|
- may host entry-node and lightweight relay workloads
|
|
- should use low-latency networking for realtime services
|
|
|
|
`remote exit node`:
|
|
|
|
- provides private-network or internet egress according to policy
|
|
- commonly hosts `vpn-exit` or `vpn-connector`
|
|
- usually requires native networking components
|
|
|
|
`intermediate relay node`:
|
|
|
|
- routes node-to-node traffic where direct connectivity is unavailable
|
|
- should not own durable session state
|
|
- may be containerized when host networking is sufficient
|
|
|
|
`VPS/dedicated production node`:
|
|
|
|
- platform-managed production node
|
|
- may combine entry, relay, cache, and worker roles when policy allows
|
|
- should use signed updates and strict node-agent recovery controls
|
|
|
|
`customer-managed node`:
|
|
|
|
- installed in a customer-controlled environment
|
|
- scoped to approved organizations and clusters
|
|
- must not expose full platform mesh topology to the organization
|
|
- should run only explicitly assigned service roles
|
|
|
|
### Identity and Policy Boundaries
|
|
|
|
Containers are packaging and isolation boundaries. They are not the node
|
|
identity, the update trust root, or the recovery authority.
|
|
|
|
The native node-agent owns:
|
|
|
|
- node identity
|
|
- certificates
|
|
- update trust
|
|
- recovery logic
|
|
- service inventory
|
|
- host capability reporting
|
|
|
|
Service roles are assigned by policy per cluster and organization. A host may
|
|
be technically capable of running a workload, but the control plane must still
|
|
authorize the role assignment for the relevant cluster and organization.
|
|
|
|
## 3.2 Fabric Peer Discovery and Routing Model
|
|
|
|
This section defines the target peer discovery, routing, local state, and
|
|
configuration distribution model for the lower Secure Access Fabric foundation.
|
|
|
|
This is a target architecture clarification only. It does not implement mesh
|
|
runtime traffic, VPN/IP tunnel runtime, relay packet routing, QUIC/WebRTC, RDP
|
|
data-plane changes, or any change to the proven direct worker WSS/backend
|
|
gateway fallback paths.
|
|
|
|
The model extends the node-agent, cluster, service workload, and service adapter
|
|
boundaries already defined in this document and in
|
|
`CLUSTER_NODE_ADMIN_FOUNDATION.md`.
|
|
|
|
### Non-Full-Mesh Principle
|
|
|
|
Nodes must not maintain connections to all nodes.
|
|
|
|
The fabric is an adaptive overlay, not a mandatory full mesh. Full mesh
|
|
connectivity does not scale across many customer sites, regions, mobile
|
|
clients, NAT environments, and small edge nodes.
|
|
|
|
Each node maintains three classes of peer knowledge:
|
|
|
|
- active peers
|
|
- warm candidate peers
|
|
- cold / bootstrap peers
|
|
|
|
Active peers:
|
|
Currently connected peers used for health, routing, relay, or service traffic
|
|
according to role and policy.
|
|
|
|
Warm candidate peers:
|
|
Known good peers that are not currently active but can be promoted quickly
|
|
when active peers fail or a better route is needed.
|
|
|
|
Cold / bootstrap peers:
|
|
Stable discovery peers or last-resort peers used when normal peer refresh or
|
|
warm candidates are unavailable.
|
|
|
|
Recommended active peer counts:
|
|
|
|
- normal node: 3-5 active peers
|
|
- relay / core node: 8-20 active peers
|
|
- thin / mobile node: 1-3 active peers
|
|
|
|
These are target defaults, not hardcoded limits. Policy, cluster size, node
|
|
role, region, load, NAT constraints, and observed reliability may adjust them.
|
|
|
|
### Local Peer Cache Model
|
|
|
|
`rap-node-agent` maintains a local peer cache for the node's cluster scope.
|
|
|
|
Peer cache fields:
|
|
|
|
- `node_id`
|
|
- `cluster_id`
|
|
- endpoint candidates
|
|
- roles / capabilities
|
|
- region / location hints
|
|
- `last_success_at`
|
|
- `last_latency_ms`
|
|
- `packet_loss`
|
|
- `reliability_score`
|
|
- trust / certificate fingerprint
|
|
- policy scope
|
|
- `last_seen_config_version`
|
|
|
|
Storage rules:
|
|
|
|
- stored in local node storage
|
|
- scoped to the node's cluster and role
|
|
- not stored in Redis
|
|
- not treated as PostgreSQL source of truth
|
|
- represented as a signed cluster-scoped snapshot plus runtime cache
|
|
- not fetched from PostgreSQL on every routing decision
|
|
|
|
The local cache is used for fast reconnect, route selection, and degraded-mode
|
|
operation. It must remain scoped to the node's cluster membership and assigned
|
|
roles. A node must not cache full platform topology unless its role explicitly
|
|
requires it.
|
|
|
|
### Peer Selection Algorithm
|
|
|
|
Peer selection must be score-based, not latency-only.
|
|
|
|
Latency is important, but latency alone can choose overloaded, unreliable, or
|
|
policy-invalid peers. The Fabric Routing Engine should calculate a score from
|
|
multiple inputs.
|
|
|
|
Score inputs:
|
|
|
|
- latency
|
|
- packet loss
|
|
- reliability
|
|
- region distance
|
|
- node load
|
|
- bandwidth availability
|
|
- role suitability
|
|
- policy constraints
|
|
- trust level
|
|
- recent failure history
|
|
|
|
Hard constraints are evaluated before scoring:
|
|
|
|
- cluster membership
|
|
- node identity trust
|
|
- certificate validity
|
|
- organization / service scope
|
|
- role assignment
|
|
- allowed ingress/egress/service placement
|
|
- partition / authority state
|
|
|
|
Soft scoring then ranks valid candidates. A lower-latency peer may lose to a
|
|
more reliable or policy-suitable peer when the service class requires it.
|
|
|
|
### Refresh and Recovery Model
|
|
|
|
Peer data is refreshed on multiple cycles.
|
|
|
|
Recommended refresh cycles:
|
|
|
|
- active peer heartbeat: 5-15 seconds
|
|
- active / warm latency probes: 30-120 seconds
|
|
- peer directory refresh: 5-15 minutes
|
|
- signed cluster snapshot refresh: version-triggered or periodic
|
|
- full resync only when version gap or signature mismatch requires it
|
|
|
|
The refresh cadence may vary by node role:
|
|
|
|
- thin/mobile nodes should probe less aggressively
|
|
- relay/core nodes should keep richer peer data
|
|
- service/egress nodes should prioritize peers relevant to assigned services
|
|
|
|
If all active links fail, recovery order is:
|
|
|
|
1. retry active peers
|
|
2. try warm candidates
|
|
3. try cold / bootstrap peers
|
|
4. use config/storage discovery endpoints
|
|
5. operate from the last signed snapshot only if policy allows degraded operation
|
|
|
|
The last known snapshot is a degraded-mode aid, not a durable authority. It may
|
|
help reconnect to the control plane or cluster peers, but it must not authorize
|
|
new node enrollment, role assignment, policy mutation, or trust-root changes.
|
|
|
|
### Fabric Routing Engine
|
|
|
|
The Fabric Routing Engine is a logical layer. This section does not require a
|
|
specific implementation package, process, or runtime in the current codebase.
|
|
|
|
Responsibilities:
|
|
|
|
- peer selection
|
|
- route discovery
|
|
- route scoring
|
|
- path failover
|
|
- shortcut connection decision
|
|
- QoS-aware routing
|
|
- route cache management
|
|
- topology hiding
|
|
- channel-aware routing
|
|
|
|
The Fabric Routing Engine belongs to the fabric layer. It may be implemented
|
|
across control-plane services, node-agent runtime, entry-node, relay-node, and
|
|
egress/service-node components over time.
|
|
|
|
This is a logical architecture concept only. It is not permission to implement
|
|
runtime mesh routing now.
|
|
|
|
Routing rules:
|
|
|
|
- routing decisions must not depend on live backend availability
|
|
- routing decisions use local node state, signed scoped snapshots, peer cache,
|
|
and route cache
|
|
- routing must respect policy, organization scope, cluster boundaries, and
|
|
partition/authority state
|
|
- routing must remain independent from RDP, VNC, SSH, VPN, video, file, or any
|
|
other service protocol internals
|
|
- routing state must remain reconstructable from authoritative control-plane
|
|
state and signed snapshots
|
|
|
|
Service adapters must not implement routing logic.
|
|
|
|
An RDP Adapter, VNC Adapter, SSH Adapter, VPN connector, or video adapter may
|
|
request connectivity to a destination node, target resource zone, or egress
|
|
pool. It must not decide mesh topology, multi-hop route selection, shortcut
|
|
creation, partition recovery, or cross-cluster routing policy.
|
|
|
|
Service Adapters must not:
|
|
|
|
- select routes
|
|
- discover peers
|
|
- manage mesh connections
|
|
- implement mesh failover logic
|
|
- implement shortcut logic
|
|
- implement partition recovery
|
|
- implement cross-cluster routing policy
|
|
|
|
### Service Routing Model
|
|
|
|
Service Adapters must not own mesh routing.
|
|
|
|
RDP, VNC, SSH, VPN, video, and file services request one of:
|
|
|
|
- destination node
|
|
- resource target
|
|
- egress node
|
|
- egress pool
|
|
|
|
The Fabric Routing Engine chooses the path.
|
|
|
|
Managed protocol mode example:
|
|
|
|
```text
|
|
Access Client
|
|
-> Ingress Node
|
|
-> Fabric route
|
|
-> Egress / Service Node
|
|
-> Service Adapter
|
|
-> Target Resource
|
|
```
|
|
|
|
Network/IP tunnel mode example:
|
|
|
|
```text
|
|
client virtual adapter
|
|
-> entry node
|
|
-> mesh peers
|
|
-> egress IPv4 router / VPN exit
|
|
-> private network
|
|
```
|
|
|
|
The Service Adapter sees the platform-facing session/channel contract. The
|
|
adapter does not need to know whether the fabric selected a direct path,
|
|
multi-hop relay path, shortcut path, or fallback path.
|
|
|
|
### Multi-Hop and Shortcut Connections
|
|
|
|
Multi-hop routing is allowed.
|
|
|
|
The platform must support nodes that cannot directly connect because of NAT,
|
|
firewalls, region layout, customer network policy, or temporary degradation.
|
|
|
|
The fabric may create direct shortcut connections when all of the following are
|
|
true:
|
|
|
|
- a long-lived flow exists
|
|
- route latency is high
|
|
- direct connection is possible
|
|
- latency improvement is expected
|
|
- policy allows the shortcut
|
|
- policy allows direct peer connection
|
|
- trust validation succeeds
|
|
- shortcut improves latency, jitter, or bandwidth
|
|
|
|
Shortcut connections must be:
|
|
|
|
- optional
|
|
- reversible
|
|
- fallback-safe
|
|
- not visible to organizations as topology
|
|
|
|
A shortcut must not become the only valid path unless policy explicitly allows
|
|
that risk. If the shortcut fails, traffic should return to the prior known-good
|
|
route, previous multi-hop route, or another scored route.
|
|
|
|
Shortcut creation must not bypass:
|
|
|
|
- node identity validation
|
|
- cluster membership
|
|
- service policy
|
|
- organization isolation
|
|
- channel permissions
|
|
- audit requirements
|
|
|
|
### Channel-Aware Routing
|
|
|
|
Routing decisions depend on channel type.
|
|
|
|
Channel routing priorities:
|
|
|
|
- `input` / `control`: lowest latency and highest reliability
|
|
- `render` / `video`: bandwidth, jitter, and latest-frame behavior
|
|
- `file_transfer`: throughput, reliability, and fairness
|
|
- `clipboard` / bounded control: reliable bounded path
|
|
- `telemetry`: low priority and droppable/sampled
|
|
- VPN packets: adaptive QoS and bulk protection
|
|
|
|
The routing layer must preserve the channel separation already defined by the
|
|
service adapter protocol.
|
|
|
|
Input/control should not wait behind render/video/file transfer. Render/video
|
|
may degrade before input latency is allowed to rise. File transfer must not
|
|
starve interactive sessions. VPN bulk traffic must not starve interactive RDP,
|
|
VNC, SSH, or control traffic. Telemetry should be sampled or dropped under
|
|
pressure.
|
|
|
|
### Node Local State Model
|
|
|
|
`rap-node-agent` stores local state required to operate and recover safely.
|
|
|
|
Local node state includes:
|
|
|
|
- node identity
|
|
- cluster membership
|
|
- signed scoped cluster snapshot
|
|
- peer cache
|
|
- route cache
|
|
- service assignment cache
|
|
- local health state
|
|
- partition / degraded state
|
|
- last applied config version
|
|
- pending update metadata
|
|
|
|
Local node state must be bounded and scoped.
|
|
|
|
A node must not store full cluster topology unless required by its assigned
|
|
role. For example:
|
|
|
|
- thin/mobile node: minimal entry/bootstrap peers
|
|
- service/egress node: peers relevant to assigned service routes
|
|
- relay/core node: broader route and peer cache
|
|
- platform owner diagnostic node: explicitly authorized wider visibility
|
|
|
|
Local state must not become a shadow source of truth. It supports runtime
|
|
operation, reconnect, and degraded behavior only.
|
|
|
|
### Data Ownership Model
|
|
|
|
Data ownership remains split by responsibility.
|
|
|
|
PostgreSQL:
|
|
Durable source of truth for domain state: platform, clusters, organizations,
|
|
users, policies, resources, role assignments, nodes, cluster membership, trust
|
|
state, service assignments, route intent, audit, and authoritative session
|
|
state.
|
|
|
|
Config/storage services:
|
|
Distribution layer for scoped snapshots, peer directories, local/nearby cache,
|
|
update metadata, policy snapshots, and node-scoped configuration. These
|
|
services may replicate by cluster, organization, and service scope, but they
|
|
must not become an uncontrolled second source of truth.
|
|
|
|
Node-agent:
|
|
Local signed snapshot, local cache, in-memory routing/runtime state, and runtime
|
|
executor. It stores scoped snapshots and runtime cache, executes assigned roles,
|
|
supervises workloads, and reports observations.
|
|
|
|
Redis:
|
|
Live coordination only. It may hold leases, heartbeats, routing hints,
|
|
short-lived tokens, ephemeral cache, attach tokens, and short-lived
|
|
coordination data. It is not a durable source of truth for peer topology,
|
|
cluster policy, or organization state.
|
|
|
|
### Scoped Cluster Configuration Distribution
|
|
|
|
Cluster configuration distribution is need-to-know.
|
|
|
|
A node receives only the configuration required for its cluster membership,
|
|
assigned role, service workload, and organization scope.
|
|
|
|
Core mesh node receives:
|
|
|
|
- neighbor / peer data
|
|
- route policy
|
|
- QoS policy
|
|
- cluster version
|
|
- no RDP credentials
|
|
- no full organization user list
|
|
|
|
Ingress node receives:
|
|
|
|
- allowed client entry policies
|
|
- token validation config
|
|
- route entry data
|
|
- no full internal topology
|
|
|
|
Egress / service node receives:
|
|
|
|
- assigned service configs
|
|
- needed resource refs
|
|
- needed connector refs
|
|
- policy for its assigned services
|
|
- secrets only through approved resolver and only at runtime
|
|
|
|
Storage service receives:
|
|
|
|
- assigned shard / scope data
|
|
- replication metadata
|
|
- no unrelated organization data
|
|
|
|
Signed snapshots must be scoped, versioned, and verifiable. A stale snapshot may
|
|
support degraded operation only when policy explicitly allows it. It must not
|
|
authorize risky mutations, trust-root changes, node approvals, role assignments,
|
|
or partition promotion.
|
|
|
|
### Fabric Storage Service / Config Storage Service
|
|
|
|
The Fabric Storage Service, also called Config Storage Service where the stored
|
|
data is configuration-focused, is a target fabric component.
|
|
|
|
Purpose:
|
|
|
|
- store and replicate scoped cluster configuration
|
|
- distribute signed snapshots
|
|
- keep frequently used data near services
|
|
- support local high-speed reads
|
|
- preserve configuration availability when some nodes disappear
|
|
|
|
Rules:
|
|
|
|
- not every node stores all data
|
|
- replication factor should be policy-driven
|
|
- critical data should be replicated across failure domains
|
|
- hot data should be placed near services that use it
|
|
- storage service must respect organization and cluster isolation
|
|
- PostgreSQL remains the authoritative source of truth
|
|
- storage role does not imply web/admin ingress role
|
|
- adding a storage node to a new cluster does not move the cluster panel
|
|
automatically
|
|
- cluster-local admin endpoints must be explicitly placed on admin/web ingress
|
|
nodes after the cluster has healthy authority, scoped snapshots, certificates,
|
|
and sufficient role coverage
|
|
|
|
Data classes:
|
|
|
|
- platform global data
|
|
- cluster state
|
|
- organization data
|
|
- service config data
|
|
- realtime / lease state
|
|
- audit / event data
|
|
- artifacts / update data
|
|
|
|
Realtime and lease state may be cached for speed but must remain reconcilable
|
|
with the authoritative ownership model. Audit/event data may be buffered or
|
|
replicated, but final retention and compliance rules are control-plane policy.
|
|
|
|
### Organization Isolation
|
|
|
|
Organizations must not see mesh topology.
|
|
|
|
Routing is internal platform behavior. Organization-facing APIs and admin UI
|
|
may expose only policy-approved service endpoints and safe status.
|
|
|
|
Organizations may see:
|
|
|
|
- allowed ingress endpoints
|
|
- allowed egress/service endpoints
|
|
- resource availability
|
|
- service health summarized for their scope
|
|
- connector/VPN status when policy allows
|
|
|
|
Organizations must not see:
|
|
|
|
- intermediate core mesh nodes
|
|
- full peer graph
|
|
- internal route scores
|
|
- other organizations' nodes or routes
|
|
- other organizations' storage shards
|
|
- other organizations' peer caches
|
|
- platform trust/certificate internals
|
|
- bootstrap peer lists
|
|
|
|
### Multi-Cluster Awareness
|
|
|
|
Clusters may exist independently.
|
|
|
|
A platform admin may manage multiple clusters, but clusters do not form one
|
|
single mesh by default.
|
|
|
|
Rules:
|
|
|
|
- clusters do not form a single mesh by default
|
|
- cross-cluster routing requires explicit trust
|
|
- node-agent belongs to exactly one cluster
|
|
- control plane may aggregate cluster visibility for platform owner
|
|
- a node may participate in multiple clusters only through isolated memberships
|
|
- cluster-scoped identities, certificates, tokens, storage namespaces, and
|
|
policies are required
|
|
|
|
Cross-cluster routing must require an explicit trust and policy model before it
|
|
is allowed. It must not emerge accidentally from shared platform ownership,
|
|
shared Redis, shared admin UI, or shared service adapter code.
|
|
|
|
Multi-cluster admin visibility is a control-plane aggregation feature. It does
|
|
not imply shared data plane, shared route authority, shared node identity, or
|
|
shared organization visibility.
|
|
|
|
Platform owner/admin roles may manage all clusters from one console according
|
|
to platform policy and audit. Organization admins see only authorized clusters,
|
|
resources, service endpoints, and safe status.
|
|
|
|
## 3.3 Cluster Configuration and Storage Model
|
|
|
|
Cluster configuration must not be fully stored on every node.
|
|
|
|
The Fabric Core uses scoped configuration distribution. A node receives only
|
|
the data required for its cluster membership, assigned role, service workload,
|
|
and organization scope. Per-role configuration subsets are mandatory for
|
|
production scale and tenant isolation.
|
|
|
|
Configuration layers:
|
|
|
|
- global platform config
|
|
- cluster config
|
|
- organization config
|
|
- service config
|
|
|
|
Distribution model:
|
|
|
|
- signed snapshots
|
|
- incremental updates
|
|
- version-based sync
|
|
- local cache in `rap-node-agent`
|
|
|
|
Sync rules:
|
|
|
|
- snapshots are scoped by cluster, organization, role, and service where needed
|
|
- incremental updates apply only after signature, version, and scope checks
|
|
- version gaps may trigger full resync
|
|
- signature mismatch must reject the update and require recovery/resync
|
|
- stale snapshots may support degraded operation only when policy allows it
|
|
|
|
Fabric Storage Service is a logical fabric component.
|
|
|
|
Responsibilities:
|
|
|
|
- distribute configuration
|
|
- cache hot data near services
|
|
- replicate critical data
|
|
- provide local read access
|
|
|
|
Rules:
|
|
|
|
- PostgreSQL = source of truth
|
|
- Storage service = distribution/cache layer
|
|
- Node = local cache only
|
|
- Storage service must not become an uncontrolled second source of truth
|
|
- Storage service must respect cluster and organization isolation
|
|
- Storage service must not become a general-purpose distributed database
|
|
- Storage service must not accept direct writes from nodes as authoritative
|
|
state
|
|
- Storage service must not expose arbitrary query capabilities to nodes or
|
|
services
|
|
- Storage service must not store full cluster or organization data on every
|
|
node
|
|
|
|
Data placement rules:
|
|
|
|
- hot data should be placed near service nodes
|
|
- cold data may remain remote
|
|
- replication is policy-based
|
|
- critical data should replicate across failure domains
|
|
- not all nodes store all data
|
|
- no node receives unrelated organization data
|
|
|
|
## 3.4 Node Local State Model
|
|
|
|
`rap-node-agent` local storage contains only bounded runtime and recovery state.
|
|
|
|
Node-agent local storage contains:
|
|
|
|
- peer cache
|
|
- route cache
|
|
- scoped cluster snapshot
|
|
- service assignment cache
|
|
- last known config version
|
|
- health state
|
|
|
|
Expanded local state may also include:
|
|
|
|
- node identity material and certificate references
|
|
- cluster membership state
|
|
- local partition / degraded state
|
|
- pending update metadata
|
|
- last successful storage/config sync metadata
|
|
|
|
Node must not:
|
|
|
|
- store full cluster topology
|
|
- store full organization data
|
|
- store unrelated peer caches
|
|
- store unrelated storage shards
|
|
- store resource secrets outside approved runtime resolver paths
|
|
- depend on live backend for every routing decision
|
|
|
|
Node-local state is a runtime accelerator and degraded-mode aid. It is not a
|
|
source of truth and must be reconstructable from authoritative control-plane
|
|
state and signed scoped snapshots.
|
|
|
|
## 3.5 Multi-Cluster Model
|
|
|
|
The platform may operate multiple clusters.
|
|
|
|
Clusters are isolated by default:
|
|
|
|
- clusters do not automatically trust each other
|
|
- clusters do not form one shared mesh by default
|
|
- cross-cluster routing requires explicit trust and policy
|
|
- cluster-scoped identities, certificates, tokens, storage namespaces, and
|
|
policies are required
|
|
|
|
Platform owner:
|
|
|
|
- sees all clusters according to platform policy
|
|
- manages trust relationships
|
|
- manages cross-cluster visibility and recovery policy
|
|
- audits cross-cluster actions
|
|
|
|
Organization:
|
|
|
|
- sees only assigned clusters and resources
|
|
- sees only policy-approved service endpoints and safe status
|
|
- does not see intermediate mesh topology
|
|
- does not see other organizations' routes, storage shards, peer caches, or
|
|
node internals
|
|
|
|
Node:
|
|
|
|
- may belong to multiple clusters only through isolated memberships
|
|
- must isolate identities per cluster
|
|
- must isolate certificates per cluster
|
|
- must isolate tokens and local storage namespaces per cluster
|
|
- must not leak route, peer, service, or organization state across clusters
|
|
|
|
Cross-cluster routing must never emerge accidentally from shared platform
|
|
ownership, shared Redis, shared admin UI, shared service adapter code, or
|
|
shared node host placement.
|
|
|
|
## 3.6 Peer Discovery and Recovery
|
|
|
|
Peer discovery is bounded and layered.
|
|
|
|
Node keeps:
|
|
|
|
- active peers
|
|
- warm peers
|
|
- bootstrap peers
|
|
|
|
Bootstrap sources:
|
|
|
|
- local cache
|
|
- signed config snapshot
|
|
- storage service
|
|
- admin-defined seed nodes
|
|
|
|
Recovery flow:
|
|
|
|
1. active peers
|
|
2. warm peers
|
|
3. bootstrap peers
|
|
4. storage/config service
|
|
5. last snapshot
|
|
|
|
Recovery rules:
|
|
|
|
- active peers are retried first to avoid unnecessary topology churn
|
|
- warm peers are promoted when active peers fail or route score changes
|
|
- bootstrap peers are used for reconnect, not as permanent full topology
|
|
- storage/config service may provide a fresh scoped peer directory
|
|
- last snapshot is used only when policy allows degraded operation
|
|
- degraded operation must not authorize new trust, node approval, role
|
|
assignment, policy mutation, partition promotion, or cross-cluster trust
|
|
|
|
## 3.7 Routing Ownership Clarification
|
|
|
|
Routing belongs only to Fabric.
|
|
|
|
Service adapters must not implement routing.
|
|
|
|
Routing decisions must be:
|
|
|
|
- policy-aware
|
|
- QoS-aware
|
|
- channel-aware
|
|
- tenant-aware
|
|
- cluster-aware
|
|
- health-aware
|
|
|
|
RDP/VNC/SSH/VPN must not implement mesh routing logic.
|
|
|
|
RDP, VNC, SSH, VPN, video, file, and future adapters may request:
|
|
|
|
- destination node
|
|
- resource target
|
|
- egress node
|
|
- egress pool
|
|
|
|
The Fabric Routing Engine chooses the path, route score, failover behavior,
|
|
shortcut policy, and channel priority treatment. Service adapters translate
|
|
external protocols into the platform service adapter protocol; they do not own
|
|
mesh topology, route discovery, route cache, multi-hop path selection, shortcut
|
|
creation, or cross-cluster route policy.
|
|
|
|
## 3.8 Shortcut Connection Rule
|
|
|
|
Fabric may create a direct connection between nodes if all required conditions
|
|
are true:
|
|
|
|
- long-lived session detected
|
|
- latency improvement possible
|
|
- direct connection is possible
|
|
- trust allows the connection
|
|
- policy allows the connection
|
|
- shortcut improves latency, jitter, or bandwidth
|
|
|
|
Shortcut must be:
|
|
|
|
- reversible
|
|
- optional
|
|
- fallback-safe
|
|
- invisible to organizations as topology
|
|
- audited where policy requires it
|
|
|
|
If shortcut fails:
|
|
|
|
- fall back to previous multi-hop route
|
|
- preserve channel priority
|
|
- preserve tenant isolation
|
|
- preserve service adapter boundaries
|
|
- do not expose topology details to organizations
|
|
|
|
Shortcut connections must not bypass node identity validation, cluster
|
|
membership, organization policy, channel permissions, route authorization,
|
|
audit requirements, or split-brain safeguards.
|
|
|
|
## 3.9 Topology Clarification
|
|
|
|
The platform topology must distinguish mesh infrastructure roles from
|
|
protocol/service edge roles. The secure mesh should remain protocol-neutral,
|
|
while service adapters live at explicitly authorized edge/service nodes.
|
|
|
|
### Core / Internal Mesh Nodes
|
|
|
|
Core mesh nodes maintain the secure overlay network.
|
|
|
|
Responsibilities:
|
|
|
|
- participate in cluster integrity
|
|
- route encrypted logical channels
|
|
- maintain health and route state
|
|
- cache necessary cluster and network configuration
|
|
- enforce QoS and routing policy
|
|
- forward traffic without understanding protocol payloads
|
|
|
|
Core mesh nodes do not need to understand RDP, VNC, SSH, HTTP, video, or audio
|
|
protocols. They also do not need to host service adapters.
|
|
|
|
Intermediate mesh nodes should classify traffic by platform channel metadata,
|
|
priority, and policy, not by parsing service protocol payloads.
|
|
|
|
### Ingress Edge Nodes
|
|
|
|
Ingress edge nodes accept external client connections and map them into the
|
|
authorized platform data plane.
|
|
|
|
Responsibilities:
|
|
|
|
- accept the platform desktop client protocol
|
|
- accept future IP tunnel / VPN-like traffic
|
|
- authenticate and authorize entry into the mesh
|
|
- validate short-lived client/data-plane tokens
|
|
- map client traffic into logical channels
|
|
- apply ingress QoS and backpressure
|
|
- expose only the ingress services allowed by policy
|
|
|
|
Ingress nodes may be close to users, deployed in platform regions, or installed
|
|
in customer environments when policy allows it.
|
|
|
|
### Egress Edge / Service Nodes
|
|
|
|
Egress edge and service nodes connect from the mesh to target resources.
|
|
|
|
They host service adapters such as:
|
|
|
|
- RDP
|
|
- VNC
|
|
- SSH
|
|
- HTTP / internal apps
|
|
- IPv4 router / internet exit
|
|
- video / audio adapter
|
|
|
|
Responsibilities:
|
|
|
|
- connect to target resources
|
|
- enforce protocol-specific policy at the edge
|
|
- translate platform channels to protocol-specific runtime behavior
|
|
- report service health and session/runtime state
|
|
- avoid exposing private resource details beyond authorized scopes
|
|
|
|
### Service Adapters Are Edge Functions
|
|
|
|
Service adapters belong at egress/service edge nodes.
|
|
|
|
Examples:
|
|
|
|
- RDP adapter runs near the RDP target or approved egress path.
|
|
- VNC adapter runs near the VNC target or approved egress path.
|
|
- SSH adapter runs near the SSH target or approved egress path.
|
|
- Video/audio adapters run where media capture or media relay is authorized.
|
|
|
|
Intermediate mesh nodes should route and preserve QoS. They should not parse
|
|
RDP, VNC, SSH, video, audio, clipboard, or file-transfer protocol payloads.
|
|
|
|
### Access Modes
|
|
|
|
The platform supports two complementary access modes.
|
|
|
|
#### Managed Protocol Mode
|
|
|
|
Example:
|
|
|
|
```text
|
|
client -> ingress -> mesh -> rdp-worker/service adapter -> RDP server
|
|
```
|
|
|
|
Managed protocol mode is used when the platform must control product-level
|
|
behavior.
|
|
|
|
Use this mode for:
|
|
|
|
- session lifecycle
|
|
- audit
|
|
- clipboard policy
|
|
- file policy
|
|
- takeover
|
|
- detach / reattach semantics
|
|
- browser and mobile access in future clients
|
|
- protocol-specific UX and safety controls
|
|
|
|
In this mode, the edge service adapter understands the target protocol and
|
|
enforces protocol-specific policy. Core mesh nodes remain protocol-neutral.
|
|
|
|
#### Network / IP Tunnel Mode
|
|
|
|
Example:
|
|
|
|
```text
|
|
client virtual adapter -> ingress IP tunnel -> mesh -> egress IPv4 router -> private network
|
|
```
|
|
|
|
Network/IP tunnel mode is used when the priority is transport-level access and
|
|
native client compatibility.
|
|
|
|
Use this mode for:
|
|
|
|
- maximum speed
|
|
- ordinary `mstsc` / native RDP clients
|
|
- native VNC clients
|
|
- native SSH clients
|
|
- browser access to reachable private web apps
|
|
- lower protocol-specific control
|
|
|
|
In this mode, the platform routes IP traffic according to policy. It does not
|
|
necessarily control RDP/VNC/SSH session internals such as clipboard, takeover,
|
|
or managed lifecycle.
|
|
|
|
### Multi-Role Nodes
|
|
|
|
A single node may have multiple roles when policy allows it.
|
|
|
|
Allowed role combinations include:
|
|
|
|
- core mesh only
|
|
- ingress only
|
|
- egress only
|
|
- ingress + core
|
|
- core + egress
|
|
- full edge node
|
|
|
|
Role combinations are deployment choices, not implicit trust expansion. A node
|
|
must only run the roles explicitly assigned to it.
|
|
|
|
### Capabilities Are Not Permissions
|
|
|
|
Node capabilities describe what a host can technically support.
|
|
|
|
Capabilities do not grant permission.
|
|
|
|
For example, a node may be technically capable of:
|
|
|
|
- egress routing
|
|
- RDP worker hosting
|
|
- VPN exit routing
|
|
- file cache hosting
|
|
|
|
The control plane must still explicitly authorize role assignment per cluster
|
|
and organization before that node may perform the role.
|
|
|
|
### Service Visibility
|
|
|
|
Organizations see only allowed ingress, egress, and service endpoints.
|
|
|
|
Organizations must not see:
|
|
|
|
- full intermediate core mesh topology
|
|
- unrelated cluster routes
|
|
- other organizations' service placements
|
|
- platform-internal relay paths
|
|
|
|
Different services may expose different ingress and egress points. For example,
|
|
an organization may have one ingress path for managed RDP sessions and another
|
|
for IP tunnel access.
|
|
|
|
### Future Design Direction
|
|
|
|
Future design must preserve this split:
|
|
|
|
- core mesh layer remains protocol-neutral
|
|
- service adapters remain pluggable edge services
|
|
- IP tunnel mode can allow built-in native RDP, VNC, SSH, and browser clients
|
|
- managed protocol mode remains the controlled and audited product experience
|
|
|
|
The platform should avoid pushing protocol-specific logic into core mesh
|
|
routing. Protocol knowledge belongs at ingress translation points or
|
|
egress/service adapters, depending on the access mode.
|
|
|
|
## 4. Organizations and Tenancy
|
|
|
|
The platform has a platform owner/admin scope and many organization scopes.
|
|
|
|
### Platform Roles
|
|
|
|
`platform_owner`:
|
|
|
|
- owns the platform-wide trust root
|
|
- can manage clusters and partitions
|
|
- can promote, merge, or split partitions
|
|
- can manage platform update policy
|
|
- can delegate platform administration
|
|
|
|
`platform_admin`:
|
|
|
|
- manages platform-level nodes and settings according to delegated permissions
|
|
- cannot bypass audit
|
|
- cannot silently access organization resources unless explicitly authorized by policy
|
|
|
|
### Organization Roles
|
|
|
|
`organization_owner`:
|
|
|
|
- owns an organization
|
|
- can manage organization admins and high-risk organization settings
|
|
- can approve customer-managed node use for that organization
|
|
|
|
`organization_admin`:
|
|
|
|
- manages users, resources, policies, connectors, and enabled services within one organization
|
|
|
|
`organization_user`:
|
|
|
|
- accesses resources allowed by organization policy
|
|
|
|
### Tenancy Rules
|
|
|
|
Organizations see only their own:
|
|
|
|
- users and memberships
|
|
- resources
|
|
- policies
|
|
- active sessions
|
|
- audit entries
|
|
- connectors
|
|
- enabled services
|
|
- allowed entry/exit points
|
|
|
|
Organizations must not see:
|
|
|
|
- full mesh topology
|
|
- platform-wide node inventory
|
|
- other organizations' resources or sessions
|
|
- raw cross-tenant node routes
|
|
- internal platform recovery state unless explicitly exposed as safe status
|
|
|
|
All identifiers, Redis keys, logs, cache namespaces, file paths, and policy decisions must be scoped to organization where applicable.
|
|
|
|
## 5. Multi-Cluster Model
|
|
|
|
The platform may operate multiple clusters.
|
|
|
|
A cluster is an operational trust and routing domain with its own:
|
|
|
|
- membership
|
|
- node identities
|
|
- role assignments
|
|
- policies
|
|
- tokens
|
|
- routing tables
|
|
- health state
|
|
- storage namespaces
|
|
- partition state
|
|
|
|
A node may serve multiple clusters only through isolated cluster memberships.
|
|
|
|
Rules:
|
|
|
|
- Cluster membership must be explicit.
|
|
- A node identity in one cluster must not automatically grant trust in another cluster.
|
|
- Tokens must be cluster-scoped.
|
|
- Policies must be cluster-scoped when they affect node placement or routes.
|
|
- Storage namespaces must include cluster scope where data may overlap.
|
|
- Logs and telemetry must include cluster identifiers without leaking other tenants' data.
|
|
|
|
Do not create a separate cluster per organization by default. Organizations are tenancy boundaries; clusters are operational/routing boundaries.
|
|
|
|
## 6. Partition and Split-Brain Behavior
|
|
|
|
The platform must prevent split-brain.
|
|
|
|
When a partition occurs:
|
|
|
|
- disconnected minority segments cannot add nodes
|
|
- disconnected minority segments cannot change cluster-wide policies
|
|
- disconnected minority segments cannot promote themselves automatically
|
|
- disconnected minority segments cannot issue broad new trust
|
|
- already-running services may continue only if safe and policy permits
|
|
- nodes must attempt reconnect
|
|
- nodes must report degraded/recovery/isolated state when connectivity returns
|
|
|
|
Allowed safe continuation examples:
|
|
|
|
- an already-attached RDP session may continue if the worker and client live path remain connected and policy was already authorized
|
|
- local health checks may continue
|
|
- cached signed update metadata may be read but not used to bypass rollout policy
|
|
|
|
Restricted operations in minority partitions:
|
|
|
|
- adding nodes
|
|
- changing organization access policies
|
|
- changing cluster role assignment
|
|
- changing update trust roots
|
|
- expanding connector routes
|
|
- issuing long-lived tokens
|
|
|
|
Only `platform_owner` or an explicitly delegated recovery role may promote, merge, or split partitions.
|
|
|
|
Partition operations must be audited.
|
|
|
|
## 7. Data Plane Design
|
|
|
|
The production data plane must separate realtime traffic from backend domain APIs.
|
|
|
|
Target flow:
|
|
|
|
1. Client authenticates through the control plane.
|
|
2. Client selects an organization and resource.
|
|
3. Control plane authorizes access and issues a short-lived data-plane token.
|
|
4. Client connects to the nearest authorized `entry-node`.
|
|
5. Entry node validates the token and route.
|
|
6. Traffic routes through direct path or mesh relay to the worker/service adapter.
|
|
7. Worker/service adapter enforces policy again.
|
|
8. Control plane remains source of truth for lifecycle and audit.
|
|
|
|
The backend gateway may remain as fallback during migration, but it must not be the production high-rate realtime relay.
|
|
|
|
### Remote Server/Desktop Access Service
|
|
|
|
The platform exposes one managed product service for remote server/desktop
|
|
access. It is not modeled as separate organization-facing `RDP service`,
|
|
`VNC service`, or `SSH service` objects.
|
|
|
|
Access Client always speaks the platform protocol to an authorized entry point.
|
|
After authentication, the user selects an organization and one allowed resource.
|
|
The resource definition owns the target protocol:
|
|
|
|
- `protocol=rdp`
|
|
- `protocol=vnc`
|
|
- `protocol=ssh`
|
|
|
|
The selected protocol chooses the internal adapter. The organization does not
|
|
configure or see adapter placement.
|
|
|
|
Resources also reference logical egress pools, for example `Office Moscow`.
|
|
An egress pool may be backed by multiple nodes with network reachability to the
|
|
office. Organization admins reference the logical egress; Fabric and cluster
|
|
policy select the concrete node/path.
|
|
|
|
This keeps node topology, adapter placement, and internal mesh routing hidden
|
|
from organizations while preserving protocol-specific policy enforcement at the
|
|
adapter boundary.
|
|
|
|
### Logical Channels
|
|
|
|
Channels must have independent priority, reliability, and backpressure behavior.
|
|
|
|
`input`:
|
|
|
|
- highest priority
|
|
- low latency
|
|
- reliable enough for ordered key/click events
|
|
- mouse move may be coalesced to latest
|
|
- must not wait behind render/video frames
|
|
|
|
`control`:
|
|
|
|
- reliable
|
|
- ordered
|
|
- used for attach, detach, takeover, terminate, reconnect, state changes
|
|
|
|
`render/video`:
|
|
|
|
- droppable
|
|
- latest-frame or latest-tile semantics
|
|
- stale frames must be discarded
|
|
- binary payloads only in production
|
|
- must not block input/control
|
|
|
|
`clipboard`:
|
|
|
|
- reliable
|
|
- policy-gated
|
|
- text-only until richer formats are explicitly designed
|
|
|
|
`file_transfer`:
|
|
|
|
- reliable
|
|
- chunked
|
|
- bounded memory
|
|
- content-hashed
|
|
- resumability may be added later
|
|
|
|
`vpn_packets`:
|
|
|
|
- adaptive
|
|
- QoS-classified
|
|
- bulk VPN traffic must not starve RDP input/render
|
|
|
|
`telemetry`:
|
|
|
|
- low priority
|
|
- lossy or sampled where acceptable
|
|
- must never block user interaction
|
|
|
|
## 8. RDP Performance Model
|
|
|
|
The current RDP MVP proves lifecycle and basic viewer behavior. It is not the target production performance model.
|
|
|
|
Target RDP realtime model:
|
|
|
|
- client connects to direct/relay data plane, not backend frame relay
|
|
- input/control channels are separate from render/video
|
|
- render frames are binary, not base64 JSON
|
|
- render uses latest-frame-only under pressure
|
|
- dirty regions / tiles are preferred over full-frame updates
|
|
- adaptive FPS based on motion, bandwidth, latency, and CPU
|
|
- input-triggered frames use a low-latency path
|
|
- render/video queues are bounded
|
|
- stale frames are dropped aggressively
|
|
- worker and client report frame timing and backpressure metrics
|
|
|
|
### Quality Profiles
|
|
|
|
`emergency_grayscale`:
|
|
|
|
- lowest bandwidth survival mode
|
|
- grayscale only
|
|
- low FPS
|
|
- aggressive frame dropping
|
|
- intended for degraded links
|
|
|
|
`low_bandwidth`:
|
|
|
|
- reduced color depth
|
|
- lower FPS
|
|
- dirty regions / tiles preferred
|
|
- stronger compression when available
|
|
|
|
`text_priority`:
|
|
|
|
- optimized for admin tools, terminals, forms, and text-heavy applications
|
|
- favors sharp text over animation smoothness
|
|
- may reduce color depth and background fidelity
|
|
|
|
`balanced`:
|
|
|
|
- default mode
|
|
- moderate FPS
|
|
- full color where bandwidth allows
|
|
- adaptive dirty/full frame behavior
|
|
|
|
`high_quality`:
|
|
|
|
- higher color fidelity
|
|
- higher FPS where bandwidth and CPU allow
|
|
- less aggressive degradation
|
|
|
|
### Color Modes
|
|
|
|
Supported target color modes:
|
|
|
|
- full color
|
|
- 256 colors
|
|
- 64 colors
|
|
- 16 colors
|
|
- grayscale
|
|
|
|
Color mode must be selected by policy/profile and may adapt at runtime when allowed.
|
|
|
|
## 9. VPN-Like Mode
|
|
|
|
VPN-like access is a future service, not part of the current RDP MVP.
|
|
|
|
Target capabilities:
|
|
|
|
- Windows virtual adapter
|
|
- Android `VpnService` later
|
|
- split tunnel
|
|
- full tunnel
|
|
- per-organization exit node policy
|
|
- DNS policy
|
|
- route policy
|
|
- traffic classification
|
|
- QoS so VPN bulk traffic does not hurt RDP
|
|
|
|
Rules:
|
|
|
|
- VPN traffic must not share unbounded queues with RDP render/input.
|
|
- VPN route expansion must be policy-controlled.
|
|
- DNS behavior must be explicit and auditable.
|
|
- Exit node selection must respect organization, cluster, node health, and policy.
|
|
|
|
## 9.1 Cluster-Managed VPN Connection Model
|
|
|
|
VPN is not a node configuration. VPN is a cluster-managed service.
|
|
|
|
The control plane owns VPN configuration. Nodes are execution units. The
|
|
cluster is responsible for availability, ownership, failover, and continuity.
|
|
|
|
### `vpn_connection`
|
|
|
|
Introduce `vpn_connection` as a logical control-plane entity.
|
|
|
|
A `vpn_connection` represents one managed VPN connection to a target endpoint,
|
|
such as an office, branch, customer network, partner network, or private
|
|
resource zone.
|
|
|
|
Required fields:
|
|
|
|
- `organization_id`
|
|
- target endpoint, for example office/site name and remote address
|
|
- credentials/config, stored through approved secret handling
|
|
- allowed node policy:
|
|
- explicit node list
|
|
- or any capable node allowed by cluster and organization policy
|
|
- mode:
|
|
- `single_active` is required for the initial model
|
|
- desired state:
|
|
- `enabled`
|
|
- `disabled`
|
|
- routing usage:
|
|
- RDP
|
|
- VNC
|
|
- SSH
|
|
- IP tunnel
|
|
- HTTP/internal apps
|
|
- other explicitly enabled services
|
|
- routing policy:
|
|
- allowed CIDRs
|
|
- denied CIDRs
|
|
- DNS routes where applicable
|
|
- service-specific route usage
|
|
- resource-specific route references, such as an RDP, VNC, SSH, HTTP, or
|
|
IP tunnel resource depending on this `vpn_connection`
|
|
- placement policy:
|
|
- allowed regions
|
|
- preferred regions
|
|
- allowed countries or providers where required
|
|
- latency/load constraints
|
|
- standby/pre-warm eligibility
|
|
- bandwidth and QoS policy:
|
|
- per-service limits
|
|
- per-organization limits
|
|
- priority classes
|
|
|
|
The `vpn_connection` belongs to the control plane. It must not be inferred from
|
|
a node-local file, a container environment variable, or an operator manually
|
|
starting a VPN client on a host.
|
|
|
|
### Cluster Responsibilities
|
|
|
|
For each enabled `vpn_connection`, the cluster must:
|
|
|
|
- ensure exactly one active node owns and maintains the VPN connection
|
|
- use lease-based ownership with TTL
|
|
- fence stale owners before a replacement owner becomes active
|
|
- reassign the VPN connection when the active node fails
|
|
- automatically recover the VPN connection after node loss
|
|
- expose connection status to authorized users
|
|
- audit ownership changes and failover events
|
|
|
|
The cluster must treat `single_active` as a correctness constraint, not merely a
|
|
placement preference.
|
|
|
|
### Node Behavior
|
|
|
|
Nodes do not decide to create VPN connections.
|
|
|
|
A node may execute a VPN connection only when all of the following are true:
|
|
|
|
- the node is capable of the required VPN role
|
|
- the node is explicitly allowed by the connection's node policy
|
|
- the node holds the active lease for the `vpn_connection`
|
|
- the connection desired state is `enabled`
|
|
- the cluster and organization policy permit the role assignment
|
|
|
|
Nodes must stop the VPN connection if:
|
|
|
|
- the lease is lost
|
|
- the lease cannot be renewed
|
|
- the connection desired state becomes `disabled`
|
|
- the node is removed from the allowed node policy
|
|
- cluster fencing marks the node as stale or unsafe
|
|
|
|
Node-local VPN execution must be a reaction to cluster assignment, never an
|
|
independent source of desired state.
|
|
|
|
### Failover
|
|
|
|
If the active VPN node disconnects, stops heartbeating, or loses its lease:
|
|
|
|
1. The cluster marks the previous owner stale or fenced.
|
|
2. The cluster selects a new eligible node.
|
|
3. The new node acquires the lease.
|
|
4. The new node establishes the VPN automatically.
|
|
5. Routing updates point traffic through the new active VPN node.
|
|
6. User connectivity should be preserved or restored with minimal interruption.
|
|
|
|
Failover behavior must be visible in status and audit. Operators should not
|
|
need to manually recreate the VPN connection after ordinary node loss.
|
|
|
|
### Visibility
|
|
|
|
Organization visibility:
|
|
|
|
- organization users/admins may see the VPN connection and status when policy
|
|
allows it
|
|
- organization admins may optionally see which node is currently active
|
|
- organizations do not see full cluster topology
|
|
- organizations do not see unrelated candidate nodes or platform-wide routing
|
|
internals
|
|
|
|
Platform owner visibility:
|
|
|
|
- platform owner/admin roles may inspect all nodes, leases, connections,
|
|
failover history, and cluster routing state according to delegated policy
|
|
|
|
### Routing Usage
|
|
|
|
Services may reference a `vpn_connection` as a routing dependency.
|
|
|
|
Examples:
|
|
|
|
- managed RDP session routes through the active VPN node before reaching the RDP
|
|
target
|
|
- VNC or SSH service adapter routes through the active VPN node
|
|
- IP tunnel traffic routes through the active VPN node transparently
|
|
- HTTP/internal app access reaches private endpoints through the active VPN
|
|
connection
|
|
|
|
Traffic should not depend on a specific node name unless policy explicitly
|
|
requires it. Traffic should depend on the logical `vpn_connection`, and the
|
|
cluster resolves that logical connection to the current active owner.
|
|
|
|
### Routing Policy Layer
|
|
|
|
Routing through a `vpn_connection` is policy-driven.
|
|
|
|
The routing policy defines:
|
|
|
|
- CIDR ranges reachable through the VPN connection
|
|
- CIDR ranges explicitly denied even if reachable
|
|
- DNS behavior and DNS suffix routing where applicable
|
|
- which services may use the VPN route
|
|
- which resources may use the VPN route
|
|
- whether a specific RDP resource may use the `vpn_connection`
|
|
- whether a specific VNC resource may use the `vpn_connection`
|
|
- whether a specific SSH resource may use the `vpn_connection`
|
|
- whether IP tunnel access may use the `vpn_connection`
|
|
- whether routes are exposed to managed protocol mode, IP tunnel mode, or both
|
|
- whether traffic may egress to the internet through the connection
|
|
|
|
The control plane publishes route intent. Data-plane nodes apply the authorized
|
|
routes for the currently active VPN owner.
|
|
|
|
Route updates should be dynamic. Adding, removing, or changing CIDR routes
|
|
should not require client reconfiguration when the client is using platform
|
|
managed routing. Clients should receive route updates through the authorized
|
|
data-plane/control channel or through the virtual adapter route manager.
|
|
|
|
Services such as RDP, VNC, SSH, HTTP/internal apps, and IP tunnel traffic may
|
|
reference the same logical `vpn_connection`, but each resource/service must
|
|
still pass its own authorization and policy checks.
|
|
|
|
### Session Stickiness
|
|
|
|
The cluster should avoid unnecessary VPN owner changes.
|
|
|
|
If the active VPN node is healthy and policy still allows it, the cluster should
|
|
keep the same active owner to preserve:
|
|
|
|
- TCP session continuity
|
|
- route cache stability
|
|
- predictable latency
|
|
- lower reconnection overhead
|
|
- fewer user-visible interruptions
|
|
|
|
Rebalancing must not preempt stable interactive work unless the node fails, the
|
|
policy changes, the node enters maintenance/draining, or a better authorized
|
|
placement is selected by explicit policy.
|
|
|
|
### Standby and Pre-Warm Nodes
|
|
|
|
The model may support standby/pre-warm nodes to reduce failover time.
|
|
|
|
Standby nodes may:
|
|
|
|
- pre-download configuration
|
|
- validate credentials are available without exposing them broadly
|
|
- establish readiness checks
|
|
- keep routing dependencies warm where safe
|
|
- report expected failover latency
|
|
|
|
Standby nodes must not become active owners unless they acquire the lease and
|
|
the previous owner is fenced or safely released.
|
|
|
|
In `single_active` mode, standby/pre-warm nodes must not create a second active
|
|
VPN tunnel. They may prepare configuration and readiness state only within the
|
|
limits of the VPN technology and security policy.
|
|
|
|
### Node Selection Strategy
|
|
|
|
When selecting an active or standby VPN node, the cluster should consider:
|
|
|
|
- organization and cluster authorization
|
|
- explicit allowed-node policy
|
|
- organization scope
|
|
- policy constraints
|
|
- region constraints
|
|
- country/provider constraints
|
|
- measured latency to target endpoint
|
|
- measured latency to expected ingress/client region
|
|
- current node load
|
|
- available bandwidth
|
|
- packet loss and health
|
|
- service co-location requirements
|
|
- maintenance/draining state
|
|
|
|
Region-aware placement must support both hard restrictions and soft
|
|
preferences. A policy may require traffic to stay in a geography, country, or
|
|
provider, or may prefer a nearby region while allowing failover elsewhere.
|
|
|
|
`platform_owner` may override region/provider constraints only through an
|
|
audited platform policy path.
|
|
|
|
### QoS and Traffic Prioritization
|
|
|
|
VPN traffic must not starve interactive services.
|
|
|
|
Priority rules:
|
|
|
|
- RDP input/control traffic is highest priority
|
|
- RDP interactive traffic is high priority
|
|
- VNC/SSH interactive traffic is high priority
|
|
- video/audio traffic is adaptive and should degrade bitrate/FPS before
|
|
harming interactive input
|
|
- clipboard and small control messages are reliable but bounded
|
|
- file transfer is reliable but lower priority than input/render/control
|
|
- VPN bulk traffic is lowest priority by default
|
|
- telemetry is low priority and may be sampled or dropped
|
|
|
|
The data plane must preserve channel priority even when VPN traffic is heavy.
|
|
Bulk tunnel traffic must degrade before RDP input latency or managed interactive
|
|
session control degrades.
|
|
|
|
### Bandwidth Control Per Service
|
|
|
|
Bandwidth policy should be enforceable per:
|
|
|
|
- organization
|
|
- `vpn_connection`
|
|
- service type
|
|
- session
|
|
- user or device where policy requires it
|
|
- ingress/egress node
|
|
|
|
Examples:
|
|
|
|
- cap IP tunnel bulk bandwidth
|
|
- reserve minimum bandwidth for RDP input/render
|
|
- limit file transfer throughput during interactive sessions
|
|
- apply lower priority to background update/cache traffic
|
|
|
|
Bandwidth controls should be observable and auditable when they affect user
|
|
traffic.
|
|
|
|
### NAT Traversal and Connectivity
|
|
|
|
VPN-capable nodes may sit behind NAT or restrictive firewalls.
|
|
|
|
Connectivity strategy should support:
|
|
|
|
- outbound-only nodes
|
|
- direct node-to-node connectivity when available
|
|
- NAT traversal / hole punching where practical and safe
|
|
- relay fallback through authorized relay nodes
|
|
- explicit detection of symmetric NAT or blocked paths
|
|
- health-based route switching
|
|
|
|
Relay fallback must preserve tenant isolation, channel priority, and policy
|
|
enforcement. Relay nodes should not parse VPN payloads.
|
|
|
|
### Client Reconnect and Resume
|
|
|
|
Clients should be able to resume after transient network changes without a full
|
|
application-level reconnect when possible.
|
|
|
|
Reconnect behavior should preserve:
|
|
|
|
- authenticated client identity
|
|
- session or tunnel identity
|
|
- authorized route set
|
|
- logical channel state where safe
|
|
- current `vpn_connection` binding
|
|
- logical access to office/private-network resources
|
|
|
|
If the active VPN owner changes during reconnect, the client should learn the
|
|
new route through control-plane/data-plane updates without manual
|
|
reconfiguration. Active routes should be refreshed transparently.
|
|
|
|
### Stateful VPN Considerations
|
|
|
|
VPN and IP tunnel traffic can contain long-lived TCP sessions and stateful flows.
|
|
|
|
The platform should minimize disruption by:
|
|
|
|
- keeping sticky ownership while healthy
|
|
- draining before planned owner changes
|
|
- preserving route identity where possible
|
|
- avoiding unnecessary NAT/source address changes
|
|
- using graceful reconnect paths for supported protocols
|
|
- surfacing unavoidable disruptions clearly
|
|
|
|
Some traffic may still break during failover. The goal is to preserve user
|
|
connectivity where technically possible and recover quickly when preservation is
|
|
not possible.
|
|
|
|
Future `multi_active` or load-balanced modes may reduce disruption, but they
|
|
must be designed as explicit modes and must not weaken `single_active`
|
|
correctness.
|
|
|
|
### Security Boundaries
|
|
|
|
Each VPN connection must be isolated.
|
|
|
|
Isolation requirements:
|
|
|
|
- organization-scoped configuration and secrets
|
|
- secure credential storage
|
|
- secure credential distribution
|
|
- credential delivery only to authorized candidate/active nodes
|
|
- per-connection routing namespace where practical
|
|
- no route leakage between organizations
|
|
- no organization may see, reference, or use another organization's
|
|
`vpn_connection`
|
|
- no implicit access to all private networks on a node
|
|
- least-privilege credentials/config delivery to the active owner
|
|
- scoped tokens for route usage
|
|
- protocol/service policy checks even when traffic uses the VPN route
|
|
|
|
The blast radius of a compromised node must be limited. A compromised node
|
|
should not gain access to unrelated `vpn_connection` secrets, unrelated
|
|
organization routes, or full cluster topology.
|
|
|
|
### Observability and Audit
|
|
|
|
The platform must observe and audit VPN lifecycle.
|
|
|
|
Track:
|
|
|
|
- desired state
|
|
- active node
|
|
- standby nodes
|
|
- lease owner and lease generation
|
|
- connect/disconnect events
|
|
- failover events
|
|
- fencing events
|
|
- enabled/disabled changes
|
|
- credential rotation
|
|
- active node changes
|
|
- route updates
|
|
- health checks
|
|
- bandwidth usage
|
|
- packet loss / latency
|
|
- policy denials
|
|
|
|
Organization admin views should show safe status without exposing full cluster
|
|
topology. Platform owner/admin views may show the current active node and full
|
|
operational detail across clusters according to delegated platform policy.
|
|
|
|
### Graceful Node Shutdown
|
|
|
|
Nodes should support graceful draining before maintenance or shutdown.
|
|
|
|
Graceful shutdown flow:
|
|
|
|
1. Node enters draining state.
|
|
2. Control plane stops placing new VPN ownership on the node.
|
|
3. Node releases or transfers the `vpn_connection` lease before maintenance
|
|
when possible.
|
|
4. Cluster selects an eligible replacement or standby.
|
|
5. Replacement acquires lease only after safe handoff/fencing rules pass.
|
|
6. Routes shift to the replacement.
|
|
7. Old node stops the VPN connection and exits.
|
|
|
|
User connectivity should be preserved when the VPN protocol and routing model
|
|
allow it. If disruption is unavoidable, it should be visible and audited.
|
|
|
|
### Service Chaining
|
|
|
|
Managed services may chain through a `vpn_connection`.
|
|
|
|
Examples:
|
|
|
|
- client -> ingress -> mesh -> RDP adapter -> active VPN node -> RDP target
|
|
- client -> ingress -> mesh -> VNC adapter -> active VPN node -> VNC target
|
|
- client -> ingress -> mesh -> SSH adapter -> active VPN node -> SSH target
|
|
- client -> ingress -> mesh -> video adapter -> active VPN node -> media target
|
|
- client virtual adapter -> ingress -> mesh -> active VPN node -> private CIDR
|
|
- IP tunnel over `vpn_connection`
|
|
- future video/audio service over a selected route
|
|
|
|
Service chaining must remain explicit in policy. A service may use a
|
|
`vpn_connection` only when both the service policy and VPN routing policy allow
|
|
it.
|
|
|
|
### Control Plane vs Data Plane Separation
|
|
|
|
The control plane owns:
|
|
|
|
- `vpn_connection` desired state
|
|
- route policy
|
|
- allowed node policy
|
|
- placement decisions
|
|
- lease ownership
|
|
- fencing decisions
|
|
- audit
|
|
- status projection
|
|
|
|
The data plane owns:
|
|
|
|
- encrypted traffic forwarding
|
|
- tunnel packet movement
|
|
- QoS enforcement
|
|
- channel backpressure
|
|
- health/telemetry reporting
|
|
- executing connect/disconnect only when leased
|
|
|
|
Data-plane nodes must not invent desired state. They execute the control-plane
|
|
assignment and stop when the assignment or lease is gone, lost, expired, or the
|
|
connection is disabled.
|
|
|
|
### Correctness Constraints
|
|
|
|
Single-active enforcement is critical.
|
|
|
|
Required safeguards:
|
|
|
|
- lease TTL
|
|
- fencing of stale owners
|
|
- monotonic ownership generation or equivalent epoch
|
|
- idempotent connect/disconnect operations
|
|
- explicit handling of partition and split-brain scenarios
|
|
- audit of ownership changes
|
|
|
|
Split-brain must be prevented. Two nodes must not maintain the same
|
|
single-active VPN connection at the same time.
|
|
|
|
In a partition, minority segments must not promote a new active VPN owner unless
|
|
the cluster's partition/recovery policy explicitly allows it and fencing can be
|
|
proven safe.
|
|
|
|
### Future Extensions
|
|
|
|
Future models may add:
|
|
|
|
- multi-active VPN connections
|
|
- load-balanced VPN ownership
|
|
- standby / pre-warm nodes
|
|
- health-ranked failover
|
|
- traffic-aware placement
|
|
- active/passive connection draining
|
|
|
|
These future modes must be explicit new modes. They must not weaken the initial
|
|
`single_active` guarantee.
|
|
|
|
## 10. Node Agent and Updates
|
|
|
|
Every node should run a small stable `node-agent`.
|
|
|
|
The node-agent is not a service workload. It supervises service workloads.
|
|
|
|
Responsibilities:
|
|
|
|
- node registration
|
|
- node identity management
|
|
- health checks
|
|
- service supervision
|
|
- restart crashed services
|
|
- download approved artifacts
|
|
- verify signatures
|
|
- apply staged updates
|
|
- rollback on failure
|
|
- report version state
|
|
- report partition/recovery state
|
|
|
|
Update rules:
|
|
|
|
- all artifacts must be signed
|
|
- unsigned binaries must never run
|
|
- rollout may be staged or canary-based
|
|
- rollback must be supported
|
|
- local update cache may be used
|
|
- thin nodes may download but not store long-term artifacts
|
|
- update-cache nodes may mirror approved artifacts
|
|
|
|
### Version Storage and Release Channels
|
|
|
|
The platform has a logical Version Storage / Update Repository service.
|
|
|
|
The repository stores release manifests, artifacts, signatures, hashes,
|
|
compatibility metadata, and migration bundles. It should keep at least:
|
|
|
|
- last stable version: the known-good version used for rollback
|
|
- current version: the version intended for normal cluster operation
|
|
- candidate version: the version staged for testing or rollout
|
|
- OS / architecture variants for every supported runtime target
|
|
- service-specific artifacts for workloads such as RDP Adapter, relay-node,
|
|
entry-node, storage/cache, and future VPN/VNC/SSH services
|
|
- node-agent artifacts with stricter rollout rules than ordinary workloads
|
|
|
|
Version Storage does not decide rollout by itself. The Control Plane owns
|
|
rollout policy, approvals, target rings, canary scope, channel assignment,
|
|
audit, and cluster/organization constraints. Version Storage provides immutable
|
|
signed artifacts and metadata for those policies.
|
|
|
|
Update-cache nodes mirror approved artifacts from Version Storage. Node-agents
|
|
download from the nearest authorized update-cache or directly from Version
|
|
Storage when policy allows. Every download must be verified by hash, signature,
|
|
version metadata, and trust root before execution.
|
|
|
|
### Node-Agent Update and Recovery Loop
|
|
|
|
The node-agent runs close to the node and remains the small stable supervisor.
|
|
|
|
It is responsible for:
|
|
|
|
- checking desired version / channel assignment
|
|
- selecting the correct artifact for OS, architecture, role, and service type
|
|
- downloading from authorized update-cache / Version Storage
|
|
- verifying signatures and hashes
|
|
- stopping, starting, and restarting service workloads
|
|
- health-checking updated workloads
|
|
- promoting a version to local known-good only after health checks pass
|
|
- rolling back to last stable / last known good when health checks fail
|
|
- reporting update progress, failures, and final version state
|
|
|
|
For node-agent self-update, extra safeguards are required:
|
|
|
|
- staged binary replacement or A/B slots
|
|
- watchdog or parent process where available
|
|
- rollback path that survives process crash
|
|
- no unsigned self-update
|
|
- no update that removes the agent's ability to contact the Control Plane or
|
|
trusted update source without an explicit break-glass policy
|
|
|
|
### Data Structure and Migration Updates
|
|
|
|
Updates may change code and data structure.
|
|
|
|
Migration artifacts must be modeled explicitly, not hidden inside binaries.
|
|
|
|
Migration types:
|
|
|
|
- Control Plane / PostgreSQL schema migrations
|
|
- service-local data migrations
|
|
- node-local state format migrations
|
|
- cache/storage shard format migrations
|
|
- protocol/config schema migrations
|
|
|
|
Rules:
|
|
|
|
- PostgreSQL schema migrations are orchestrated by the Control Plane release
|
|
process, not independently invented by node-agents.
|
|
- Service-local and node-local migrations may be executed by node-agent only
|
|
when included in an approved signed manifest.
|
|
- Migration manifests must declare source version, target version,
|
|
compatibility, preflight checks, backup/snapshot requirements, rollback
|
|
behavior, and whether downgrade is possible.
|
|
- Prefer expand/contract migrations for durable shared schemas.
|
|
- A service workload must not be updated to a version incompatible with the
|
|
currently applied schema/config snapshot.
|
|
- A failed migration must leave the node in a known state: rolled back,
|
|
degraded, or fenced, never silently half-updated.
|
|
- Rollout policy must support canary, staged rollout, pause, resume, rollback,
|
|
and blocklist of bad versions.
|
|
|
|
### Update Safety States
|
|
|
|
Recommended node/service update states:
|
|
|
|
- `idle`
|
|
- `checking`
|
|
- `downloading`
|
|
- `verifying`
|
|
- `staging`
|
|
- `migrating`
|
|
- `starting`
|
|
- `health_checking`
|
|
- `promoting`
|
|
- `running`
|
|
- `rollback_required`
|
|
- `rolling_back`
|
|
- `rolled_back`
|
|
- `failed`
|
|
- `fenced`
|
|
|
|
The platform owner must be able to see update state, last stable version,
|
|
current version, candidate version, failed version, rollback reason, and
|
|
whether a data migration was involved.
|
|
|
|
## 11. Security Model
|
|
|
|
Security must come from cryptography, scoped trust, policy enforcement, and audit. It must not depend on obscurity.
|
|
|
|
Required controls:
|
|
|
|
- mTLS for node-to-node communication
|
|
- strong node identity
|
|
- short-lived data-plane tokens
|
|
- per-channel permissions
|
|
- per-organization isolation
|
|
- per-cluster isolation
|
|
- backend policy enforcement
|
|
- worker/service-adapter policy enforcement
|
|
- scoped customer-managed node trust
|
|
- audit for control-plane actions
|
|
- audit for high-risk data-plane events
|
|
- revocation support
|
|
- signed updates
|
|
- no unsigned binaries
|
|
|
|
Per-channel authorization examples:
|
|
|
|
- input allowed only for current controller attachment
|
|
- render allowed only to authorized attachments
|
|
- clipboard direction controlled by `clipboard_mode`
|
|
- file transfer direction controlled by `file_transfer_mode`
|
|
- VPN routes controlled by organization and resource policy
|
|
|
|
The backend must authorize, but the worker and live nodes must also enforce. UI-only enforcement is never sufficient.
|
|
|
|
### Platform Admin Access and Device Trust
|
|
|
|
Platform owner/admin access must be risk-based.
|
|
|
|
Rules:
|
|
|
|
- platform owner may manage all clusters from one panel
|
|
- accredited / trusted devices may have lower friction
|
|
- unknown devices require stronger checks
|
|
- MFA / 2FA policy must be supported
|
|
- device trust state must be supported
|
|
- session risk controls must be supported
|
|
- high-risk actions require step-up authentication
|
|
|
|
High-risk actions include:
|
|
|
|
- cluster trust changes
|
|
- node approval
|
|
- role assignment
|
|
- partition promotion
|
|
- cross-cluster trust
|
|
- secrets access
|
|
- update policy changes
|
|
|
|
Risk controls must be auditable. Device trust must not bypass organization,
|
|
cluster, role, or policy isolation.
|
|
|
|
## 12. Migration Plan
|
|
|
|
This is an incremental migration plan. It must not be executed as a big-bang rewrite.
|
|
|
|
### Current Fallback
|
|
|
|
Keep the current backend WebSocket gateway as fallback while the production data plane is introduced.
|
|
|
|
Current RDP MVP remains the preserved service-adapter baseline, but it is not
|
|
the active implementation focus while Fabric Core stages are underway.
|
|
|
|
### Fabric Core Staging Note
|
|
|
|
The following stages define future platform-core work. This document implements
|
|
none of them.
|
|
|
|
Stage C10: Fabric Core documentation consolidation and config distribution
|
|
design. Completed as documentation/planning. Result:
|
|
`docs/architecture/FABRIC_CORE_CONFIG_DISTRIBUTION.md`.
|
|
|
|
Stage C11: signed scoped cluster snapshot model. Completed as
|
|
documentation/planning. Result:
|
|
`docs/architecture/SIGNED_SCOPED_CLUSTER_SNAPSHOT_MODEL.md`.
|
|
|
|
Stage C12: node local state store. Completed as documentation/planning. Result:
|
|
`docs/architecture/NODE_LOCAL_STATE_STORE.md`.
|
|
|
|
Stage C13: config/storage service foundation. Completed as
|
|
documentation/planning. Result:
|
|
`docs/architecture/FABRIC_STORAGE_CONFIG_SERVICE.md`.
|
|
|
|
Stage C14: peer directory and cache model. Completed as
|
|
documentation/planning. Result:
|
|
`docs/architecture/FABRIC_PEER_DIRECTORY_CACHE.md`.
|
|
|
|
Stage C15: Fabric Routing Engine skeleton. Completed as
|
|
documentation/planning. Result:
|
|
`docs/architecture/FABRIC_ROUTING_ENGINE_SKELETON.md`.
|
|
|
|
Stage C16: secure node-to-node channel lifecycle. Completed as
|
|
documentation/planning. Result:
|
|
`docs/architecture/SECURE_NODE_TO_NODE_CHANNEL_LIFECYCLE.md`.
|
|
|
|
Stage C17: mesh routing runtime planning. Completed as documentation/planning.
|
|
Result: `docs/architecture/MESH_ROUTING_RUNTIME_IMPLEMENTATION_PLAN.md`.
|
|
C17A synthetic mesh runtime skeleton is implemented and test-proven in
|
|
`rap-node-agent` only. It carries synthetic `fabric.probe` /
|
|
`fabric.probe_ack` messages behind a disabled-by-default feature flag. C17B
|
|
route health and failover probes are also implemented and test-proven with
|
|
synthetic traffic only. C17C relay semantic hardening is implemented and
|
|
test-proven with synthetic channel classes only. C17D non-production
|
|
`synthetic.echo` test-service path is implemented and test-proven with bounded
|
|
test payloads only. C17E live node-to-node synthetic HTTP transport is
|
|
implemented and smoke-proven using real local HTTP endpoints, still behind a
|
|
disabled-by-default feature flag and still synthetic-only. C17F scoped
|
|
synthetic peer/route config loading and route-health reporting is implemented
|
|
and smoke-proven, still synthetic-only. C17G Control Plane scoped synthetic
|
|
config read/consume boundary is implemented and test-proven, still
|
|
synthetic-only.
|
|
|
|
Stage C18: VPN/IP tunnel service target design. Completed as
|
|
documentation/planning. Result:
|
|
`docs/architecture/VPN_IP_TUNNEL_SERVICE_TARGET.md`.
|
|
|
|
Stage C18A: VPN/IP tunnel control-plane data model foundation. Implemented and
|
|
backend-test-proven. Result:
|
|
`artifacts/c18a-vpn-control-plane-data-model-report.md`.
|
|
|
|
Stage C18B: VPN/IP tunnel lease and fencing hardening. Implemented and
|
|
backend-test-proven. Result:
|
|
`artifacts/c18b-vpn-lease-fencing-hardening-report.md`.
|
|
|
|
Stage C18C: VPN/IP tunnel node-agent desired-state consumption. Implemented
|
|
and backend-test-proven. Result:
|
|
`artifacts/c18c-vpn-node-agent-desired-state-report.md`.
|
|
|
|
Stage C19: Version Storage / Update Repository and node-agent update/rollback
|
|
foundation. Future stage. It must define signed release manifests, OS/arch
|
|
artifact variants, `stable` / `current` / `candidate` channels,
|
|
last-known-good rollback behavior, update-cache mirroring, node-agent
|
|
download/verify/stage/health-check/promote/rollback states, and explicit data
|
|
structure migration bundles. C19 must not implement production updater runtime,
|
|
automatic PostgreSQL migration execution by node-agent, mesh/VPN runtime, RDP
|
|
changes, or data-plane behavior changes unless a later prompt explicitly
|
|
authorizes that narrow work.
|
|
|
|
These stages must be introduced only through explicit, narrow implementation
|
|
prompts. RDP/VNC/SSH/VPN/video/file services remain above the Fabric Core and
|
|
must not define the lower fabric foundation.
|
|
|
|
### Stage DP-1: Direct Worker WSS
|
|
|
|
Introduce a short-lived authorized direct WSS path from client to worker or worker-local live endpoint.
|
|
|
|
Goals:
|
|
|
|
- remove backend from high-rate render path
|
|
- keep backend as source of truth
|
|
- keep session broker lifecycle unchanged
|
|
- keep fallback gateway available
|
|
|
|
### Stage DP-2: Binary Frames
|
|
|
|
Replace base64 JSON frame payloads with binary frame messages.
|
|
|
|
Goals:
|
|
|
|
- reduce payload size
|
|
- reduce CPU cost
|
|
- reduce JSON/base64 overhead
|
|
- preserve latest-frame-only behavior
|
|
|
|
### Stage DP-3: Adaptive Quality
|
|
|
|
Implement adaptive RDP quality profiles.
|
|
|
|
Goals:
|
|
|
|
- dirty regions / tiles
|
|
- adaptive FPS
|
|
- color mode switching
|
|
- bandwidth and latency feedback
|
|
- bounded frame queues
|
|
|
|
### Stage DP-4: Relay Nodes
|
|
|
|
Introduce `entry-node` and `relay-node` roles for data-plane routing.
|
|
|
|
Goals:
|
|
|
|
- nearest entry selection
|
|
- relay when direct worker path is unavailable
|
|
- channel priority and QoS
|
|
- per-organization route authorization
|
|
|
|
### Stage DP-5: Mesh Routing
|
|
|
|
Introduce secure mesh routing across nodes.
|
|
|
|
Goals:
|
|
|
|
- mTLS node-to-node routes
|
|
- multi-hop support
|
|
- route failover
|
|
- partition awareness
|
|
- cluster-scoped node membership
|
|
|
|
### Stage VPN-1: Virtual Adapter
|
|
|
|
Introduce VPN-like client mode.
|
|
|
|
Goals:
|
|
|
|
- Windows virtual adapter first
|
|
- split tunnel first
|
|
- explicit DNS/routing policy
|
|
- QoS separation from RDP
|
|
- Android `VpnService` later
|
|
|
|
## 13. Risks and Bottlenecks
|
|
|
|
### Realtime Through Backend
|
|
|
|
Risk:
|
|
|
|
- backend becomes a high-rate frame relay
|
|
- control-plane APIs compete with render traffic
|
|
- latency grows with users
|
|
|
|
Target mitigation:
|
|
|
|
- move production realtime to data plane
|
|
- keep backend for authorization and lifecycle
|
|
|
|
### Full-Frame Rendering
|
|
|
|
Risk:
|
|
|
|
- high bandwidth
|
|
- high CPU
|
|
- UI thread pressure
|
|
- poor scaling
|
|
|
|
Target mitigation:
|
|
|
|
- binary frames
|
|
- dirty regions / tiles
|
|
- latest-frame-only
|
|
- adaptive FPS
|
|
- quality profiles
|
|
|
|
### Split-Brain
|
|
|
|
Risk:
|
|
|
|
- minority partition mutates cluster state
|
|
- duplicate authority
|
|
- unsafe node additions
|
|
|
|
Target mitigation:
|
|
|
|
- quorum/authority model
|
|
- restricted mutation in minority segments
|
|
- manual promote/merge/split by `platform_owner`
|
|
|
|
### Tenant Leakage
|
|
|
|
Risk:
|
|
|
|
- organizations see other organizations' resources, nodes, routes, logs, or cache keys
|
|
|
|
Target mitigation:
|
|
|
|
- organization namespace everywhere
|
|
- route abstraction
|
|
- filtered topology views
|
|
- audit and policy checks at every boundary
|
|
|
|
### Multi-Cluster Node Isolation
|
|
|
|
Risk:
|
|
|
|
- node trust leaks across clusters
|
|
- tokens reused across clusters
|
|
- storage namespace collision
|
|
|
|
Target mitigation:
|
|
|
|
- isolated cluster memberships
|
|
- cluster-scoped tokens
|
|
- cluster-scoped storage namespaces
|
|
- explicit node role assignment per cluster
|
|
|
|
### VPN Traffic Starving RDP
|
|
|
|
Risk:
|
|
|
|
- bulk VPN traffic consumes data-plane capacity
|
|
- RDP input/render becomes laggy
|
|
|
|
Target mitigation:
|
|
|
|
- QoS classes
|
|
- channel prioritization
|
|
- separate queues
|
|
- traffic shaping
|
|
- per-service limits
|
|
|
|
### Unsafe Updates
|
|
|
|
Risk:
|
|
|
|
- compromised binaries
|
|
- broken rollout
|
|
- no rollback path
|
|
|
|
Target mitigation:
|
|
|
|
- signed artifacts only
|
|
- staged rollout
|
|
- canary rollout
|
|
- health checks
|
|
- rollback
|
|
- update-cache with signature verification
|
|
|
|
## Non-Goals for the Current Phase
|
|
|
|
Do not implement the following as part of this document:
|
|
|
|
- full mesh runtime
|
|
- VPN runtime
|
|
- multi-cluster runtime
|
|
- updater runtime
|
|
- new protocol adapters
|
|
- RDP MVP rewrite
|
|
- UI redesign
|
|
|
|
This document is a target architecture reference. Implementation must proceed through explicit, narrow, verified stages.
|
|
|
|
## Result / Decision
|
|
|
|
The target architecture now treats Fabric Core as the lower distributed runtime
|
|
foundation above the host OS and below service adapters. Peer discovery,
|
|
routing, scoped configuration distribution, node-local state, storage/cache
|
|
services, multi-cluster boundaries, and platform-admin device trust are target
|
|
architecture only. No RDP runtime, data-plane behavior, mesh runtime traffic,
|
|
VPN/IP tunnel runtime, backend session lifecycle, API, migration, or code change
|
|
is implied by this document. RDP/VNC/SSH/VPN/video/file remain service layers
|
|
above the Fabric Core; they consume the Fabric foundation and must not define
|
|
or own peer discovery, route selection, scoped cluster distribution, or storage
|
|
authority.
|
|
|
|
Sections `3.2` through `3.8` now explicitly separate peer discovery, cluster
|
|
configuration distribution, Fabric Storage Service, node-local state,
|
|
multi-cluster isolation, recovery, routing ownership, and shortcut connection
|
|
rules. This makes future mesh/VPN work possible without making RDP/VNC/SSH/VPN
|
|
adapters responsible for routing and without requiring every node to store full
|
|
cluster or organization data. Fabric routing is defined as backend-independent
|
|
at realtime decision time, Redis is explicitly live-only, and Fabric
|
|
Storage/Config Storage is explicitly a scoped distribution/cache layer rather
|
|
than a general-purpose distributed database or second source of truth.
|
|
|
|
Admin UI ownership is separated from storage. A storage node never becomes the
|
|
cluster panel automatically. Platform Owner Console remains global platform
|
|
scope; cluster-local Admin Endpoints require explicit admin/web ingress role
|
|
assignment and health/trust readiness; Organization Admin Panel remains a
|
|
tenant-safe projection.
|
|
|
|
Stage C10 produced the consolidated Fabric Core configuration distribution
|
|
contract in `docs/architecture/FABRIC_CORE_CONFIG_DISTRIBUTION.md`. Stage C11
|
|
defined the signed scoped cluster snapshot model in
|
|
`docs/architecture/SIGNED_SCOPED_CLUSTER_SNAPSHOT_MODEL.md`. Stage C12 defined
|
|
node-local state storage in
|
|
`docs/architecture/NODE_LOCAL_STATE_STORE.md`. Stage C13 defined Fabric Storage
|
|
/ Config Storage in `docs/architecture/FABRIC_STORAGE_CONFIG_SERVICE.md`.
|
|
Stage C14 defined peer directory/cache behavior in
|
|
`docs/architecture/FABRIC_PEER_DIRECTORY_CACHE.md`. Stage C15 defined the
|
|
Fabric Routing Engine skeleton in
|
|
`docs/architecture/FABRIC_ROUTING_ENGINE_SKELETON.md`. Stage C16 defined secure
|
|
node-to-node channel lifecycle in
|
|
`docs/architecture/SECURE_NODE_TO_NODE_CHANNEL_LIFECYCLE.md`. Stage C17 defined
|
|
the mesh routing runtime implementation plan in
|
|
`docs/architecture/MESH_ROUTING_RUNTIME_IMPLEMENTATION_PLAN.md`. Stage C17A
|
|
implemented the first synthetic mesh runtime skeleton and remains limited to
|
|
`fabric.probe` / `fabric.probe_ack` test traffic only, not RDP, VPN/IP tunnel,
|
|
or production service traffic. Stage C17B added `fabric.route_health` /
|
|
`fabric.route_health_ack`, local route observations, fallback selection, and
|
|
route-cache invalidation, still synthetic-only. Stage C17C added synthetic
|
|
relay validation, per-channel bounded queues, QoS order, telemetry-only
|
|
backpressure/drop, and reliable fabric/control rejection behavior. Stage C17D
|
|
added one bounded `synthetic.echo` test-service path over direct, single-relay,
|
|
and forced fallback routes. No RDP, VPN/IP tunnel, or production service
|
|
traffic is authorized by C17D. Stage C17E added live node-to-node synthetic
|
|
HTTP transport and a smoke harness only; it still does not authorize production
|
|
mesh, RDP, VPN/IP tunnel, file, video, or service workload traffic. Stage C17F
|
|
added scoped synthetic config loading and route-health link observation
|
|
reporting only. Stage C17G added the Control Plane node-scoped synthetic config
|
|
read boundary only. Stage C18 completed VPN/IP tunnel target
|
|
planning in `docs/architecture/VPN_IP_TUNNEL_SERVICE_TARGET.md`; it does not
|
|
authorize VPN/IP tunnel runtime, packet routing, TUN/TAP handling, or
|
|
production service traffic. Stage C18A added durable PostgreSQL control-plane
|
|
tables, backend model/repository/service boundaries, platform-admin API
|
|
skeleton, and service tests for future `vpn_connection` work. C18A still does
|
|
not authorize VPN/IP tunnel runtime, node-agent VPN execution, packet routing,
|
|
or production service traffic. Stage C18B hardened lease owner eligibility,
|
|
single-active acquire idempotency, stale lease reclamation, and fencing
|
|
semantics, still without VPN runtime, node-agent execution, packet routing, or
|
|
production service traffic. Stage C18C added node-agent scoped assignment
|
|
visibility and assignment status reporting only; it does not expose
|
|
`credential_ref`, execute VPN runtime, manipulate host networking, or carry
|
|
production service traffic. Stage C19 is reserved for Version Storage / Update
|
|
Repository and node-agent update/rollback foundation, including release
|
|
channels, signed artifacts, update-cache mirroring, workload/self-update
|
|
safety, and explicit data-structure migration bundles. C19 is not implemented
|
|
here and does not authorize production updater runtime, unsigned artifacts,
|
|
automatic PostgreSQL migration by node-agent, RDP changes, mesh/VPN runtime, or
|
|
data-plane behavior changes.
|