Initial project snapshot
This commit is contained in:
@@ -0,0 +1,53 @@
|
||||
|
||||
SYSTEM OVERVIEW
|
||||
|
||||
Client -> Proxy (custom protocol)
|
||||
Proxy -> RDP servers (FreeRDP)
|
||||
|
||||
CORE FEATURES
|
||||
- Persistent sessions
|
||||
- Reconnect / takeover
|
||||
- No direct client-server access
|
||||
- Multi-session
|
||||
- Clipboard + file transfer
|
||||
- Quality profiles
|
||||
|
||||
SESSION RULES
|
||||
- Session lives on server
|
||||
- Client disconnect does NOT terminate session
|
||||
- Reattach allowed from any trusted device
|
||||
- Single active controller per session
|
||||
|
||||
COMPONENTS
|
||||
- API Gateway (Go)
|
||||
- Auth Service
|
||||
- Resource Service
|
||||
- Session Broker
|
||||
- Session Gateway (WebSocket)
|
||||
- RDP Worker (C++ + FreeRDP)
|
||||
|
||||
DATABASE ENTITIES
|
||||
- users
|
||||
- devices
|
||||
- resources
|
||||
- sessions
|
||||
- secrets
|
||||
- audit logs
|
||||
|
||||
PROTOCOL
|
||||
- REST for control
|
||||
- WebSocket for session stream
|
||||
|
||||
SECURITY
|
||||
- MFA
|
||||
- encrypted secrets
|
||||
- no direct RDP exposure
|
||||
- trusted devices
|
||||
|
||||
RENDERING
|
||||
- region updates (NOT full screenshots)
|
||||
- adaptive quality
|
||||
- bandwidth profiles
|
||||
|
||||
GOAL
|
||||
User works as if sitting at remote machine.
|
||||
@@ -0,0 +1,29 @@
|
||||
|
||||
You are building a production-grade remote access system.
|
||||
|
||||
Architecture:
|
||||
- Go backend
|
||||
- C++ RDP worker (FreeRDP)
|
||||
- WebSocket session streaming
|
||||
- PostgreSQL + Redis
|
||||
|
||||
Core rules:
|
||||
- Sessions are persistent
|
||||
- Client disconnect does NOT kill session
|
||||
- Reconnect must work
|
||||
- No direct RDP exposure
|
||||
|
||||
Tasks:
|
||||
1. Create backend structure in Go
|
||||
2. Implement auth (JWT + refresh)
|
||||
3. Implement session broker
|
||||
4. Implement WebSocket gateway
|
||||
5. Define protocol messages
|
||||
6. Prepare worker interface (C++ stub)
|
||||
|
||||
Focus:
|
||||
- Stability
|
||||
- Security
|
||||
- Performance
|
||||
|
||||
Do NOT simplify architecture.
|
||||
@@ -0,0 +1,18 @@
|
||||
|
||||
WHAT TO DO
|
||||
|
||||
1. Create empty repo
|
||||
2. Copy docs from archive into /docs
|
||||
3. Open Codex
|
||||
4. Paste prompt from:
|
||||
03_codex_prompts/00_master_prompt.md
|
||||
|
||||
Then iterate:
|
||||
- backend
|
||||
- protocol
|
||||
- workers
|
||||
- clients
|
||||
|
||||
IMPORTANT:
|
||||
Do NOT start with UI.
|
||||
Start with backend + session model.
|
||||
@@ -0,0 +1,23 @@
|
||||
|
||||
TECH STACK (FINAL)
|
||||
|
||||
Backend:
|
||||
- Go (API, Auth, Broker, Gateway)
|
||||
|
||||
RDP Worker:
|
||||
- C++ + FreeRDP
|
||||
|
||||
Clients:
|
||||
- Windows: C# (WPF)
|
||||
- Linux: C++ (Qt 6)
|
||||
|
||||
Admin panel:
|
||||
- TypeScript + React
|
||||
|
||||
Infra:
|
||||
- PostgreSQL
|
||||
- Redis
|
||||
- Docker
|
||||
|
||||
Key principle:
|
||||
One protocol, multiple native clients.
|
||||
@@ -0,0 +1,9 @@
|
||||
|
||||
START HERE
|
||||
|
||||
1. Read 05_decisions/technology_stack_review_2026.md
|
||||
2. Read 02_specs/technical_specification.md
|
||||
3. Then open 03_codex_prompts/00_master_prompt.md
|
||||
|
||||
This package contains full technical specification and Codex prompts
|
||||
for building a secure RDP proxy system with custom protocol.
|
||||
@@ -0,0 +1,261 @@
|
||||
# Architecture Guardrails
|
||||
|
||||
Status: architecture guardrails, documentation only.
|
||||
|
||||
This file exists so architecture documents have a stable guardrails reference
|
||||
inside `docs/architecture`. The operational Codex guardrails remain in
|
||||
`docs/codex/ARCHITECTURE_GUARDRAILS.md`.
|
||||
|
||||
## 1. Preserve the Proven RDP Baseline
|
||||
|
||||
The following are already proven and must remain stable:
|
||||
|
||||
- live FreeRDP connect
|
||||
- active session state
|
||||
- terminate
|
||||
- detach without killing the remote session
|
||||
- reattach without recreating the remote session
|
||||
- takeover without recreating the remote session
|
||||
- direct worker WSS data plane
|
||||
- backend gateway fallback
|
||||
- C++ RDP Adapter as the active RDP runtime
|
||||
|
||||
Architecture clarification must not silently weaken this behavior.
|
||||
|
||||
## 2. Source of Truth
|
||||
|
||||
PostgreSQL is the only durable source of truth for domain state.
|
||||
|
||||
Redis is live coordination only. It may hold leases, heartbeats, routing hints,
|
||||
attach tokens, short-lived tokens, and ephemeral cache. It must not become a
|
||||
durable source of truth for sessions, organizations, policies, cluster trust,
|
||||
peer topology, durable configuration, organization data, route authority, or
|
||||
node identity.
|
||||
|
||||
## 3. Fabric Core Before Mesh Runtime
|
||||
|
||||
RAP Fabric Core is the lower distributed runtime foundation above the host OS.
|
||||
|
||||
Fabric Core owns:
|
||||
|
||||
- native `rap-node-agent` identity
|
||||
- enrollment
|
||||
- local node state
|
||||
- capability reporting
|
||||
- role assignment consumption
|
||||
- signed scoped configuration snapshots
|
||||
- update trust
|
||||
- service supervision boundary
|
||||
|
||||
Mesh runtime traffic must not be implemented before node identity, enrollment,
|
||||
role assignment, scoped config distribution, and node-local state are
|
||||
trustworthy.
|
||||
|
||||
## 4. Node Identity and Service Workloads
|
||||
|
||||
A node is a host-level identity managed by native `rap-node-agent`.
|
||||
|
||||
Service workloads are separate from node identity. They may be containerized or
|
||||
native, but containers are packaging/isolation boundaries only.
|
||||
|
||||
Capabilities are not permissions. Role assignment must be explicit per cluster
|
||||
and, when needed, per organization.
|
||||
|
||||
## 5. Routing Ownership
|
||||
|
||||
Routing is owned by the Fabric layer, not individual Service Adapters.
|
||||
|
||||
RDP, VNC, SSH, VPN, video, and file services may request a destination node,
|
||||
resource target, egress node, or egress pool. The Fabric Routing Engine chooses
|
||||
the path.
|
||||
|
||||
Routing decisions must not depend on live backend availability. They use
|
||||
node-local state, signed scoped snapshots, peer cache, route cache, and policy.
|
||||
|
||||
Service Adapters must not implement mesh topology discovery, multi-hop route
|
||||
selection, shortcut creation, partition recovery, or cross-cluster routing
|
||||
policy.
|
||||
|
||||
Service Adapters must not select routes, discover peers, manage mesh
|
||||
connections, implement mesh failover, implement shortcut logic, implement
|
||||
partition recovery, or implement cross-cluster routing policy.
|
||||
|
||||
## 6. Need-to-Know Configuration
|
||||
|
||||
Nodes should be small, fast, and scoped.
|
||||
|
||||
A node receives only the configuration required for its cluster membership,
|
||||
assigned role, service workload, and organization scope. It must not store full
|
||||
cluster topology, unrelated organization data, unrelated storage shards, peer
|
||||
caches outside its scope, or secrets it does not need.
|
||||
|
||||
Secrets must be delivered only through approved resolvers and only at runtime
|
||||
when needed.
|
||||
|
||||
## 7. Fabric Storage Boundaries
|
||||
|
||||
Fabric Storage / Config Storage is a future distribution and cache layer, not a
|
||||
new source of truth.
|
||||
|
||||
Storage service must not:
|
||||
|
||||
- replace PostgreSQL
|
||||
- become a general-purpose distributed database
|
||||
- accept direct node writes as authoritative state
|
||||
- store full cluster or organization data on every node
|
||||
- expose arbitrary query capabilities
|
||||
- bypass organization and cluster isolation
|
||||
|
||||
## 8. Multi-Tenancy Isolation
|
||||
|
||||
Every organization must be isolated by design.
|
||||
|
||||
Namespace and authorize:
|
||||
|
||||
- resources
|
||||
- users-in-organization
|
||||
- groups
|
||||
- policies
|
||||
- connectors
|
||||
- sessions
|
||||
- service endpoints
|
||||
- audit
|
||||
- secret references
|
||||
- storage/cache scopes
|
||||
- Redis keys where applicable
|
||||
|
||||
Organizations must not see intermediate mesh topology, other organizations'
|
||||
routes, peer caches, nodes, storage shards, secrets, or platform trust
|
||||
internals.
|
||||
|
||||
## 9. Multi-Cluster Boundaries
|
||||
|
||||
A platform may manage multiple clusters, but clusters do not automatically
|
||||
trust each other and do not form one shared mesh by default.
|
||||
|
||||
Cross-cluster routing requires explicit trust and policy.
|
||||
|
||||
Cluster-scoped identities, certificates, tokens, storage namespaces, and
|
||||
policies are required. A node may participate in multiple clusters only through
|
||||
isolated memberships.
|
||||
|
||||
## 10. Split-Brain Prevention
|
||||
|
||||
Never allow minority partitions to become a second authoritative cluster
|
||||
automatically.
|
||||
|
||||
Cluster-wide changes, role changes, trust changes, node approvals, policy
|
||||
mutation, partition promotion, and cross-cluster trust must be restricted in
|
||||
non-quorum or degraded states.
|
||||
|
||||
## 11. Control Plane vs Data Plane
|
||||
|
||||
Control plane owns durable state and policy:
|
||||
|
||||
- organizations
|
||||
- users
|
||||
- memberships
|
||||
- roles
|
||||
- resources
|
||||
- policies
|
||||
- nodes
|
||||
- cluster membership
|
||||
- service assignments
|
||||
- connector/VPN desired state
|
||||
- updates
|
||||
- config distribution
|
||||
- audit
|
||||
|
||||
Data plane carries authorized traffic:
|
||||
|
||||
- session streams
|
||||
- worker traffic
|
||||
- relay traffic
|
||||
- connector traffic
|
||||
- future VPN/IP tunnel traffic
|
||||
|
||||
Do not collapse control plane and data plane into one vague layer.
|
||||
|
||||
## 12. Updates and Trust
|
||||
|
||||
Updates must support:
|
||||
|
||||
- Version Storage / Update Repository as the signed artifact source
|
||||
- explicit Control Plane rollout policy and approval
|
||||
- signed artifacts
|
||||
- no unsigned binaries
|
||||
- staged rollout
|
||||
- canary rollout
|
||||
- rollback
|
||||
- health checks
|
||||
- local update cache where approved
|
||||
- OS / architecture specific artifacts under signed release manifests
|
||||
- explicit migration bundles when data structures change
|
||||
|
||||
Version Storage stores immutable release manifests, artifacts, hashes,
|
||||
signatures, compatibility metadata, provenance, and approved migration bundles.
|
||||
It must not become a second source of truth for rollout policy, approvals,
|
||||
organization state, cluster state, or audit.
|
||||
|
||||
The native node-agent owns local update trust, health supervision, restart, and
|
||||
recovery logic. It may update, restart, or rollback assigned local workloads
|
||||
only according to signed manifests and Control Plane policy. Node-agent
|
||||
self-update requires stricter staged replacement and crash-safe rollback than
|
||||
ordinary workload updates.
|
||||
|
||||
PostgreSQL schema migrations are orchestrated by the Control Plane release
|
||||
process. Node-agent must not independently invent or execute durable
|
||||
PostgreSQL schema migrations. Service-local, node-local, cache, or protocol
|
||||
schema migrations require signed manifest metadata, preflight checks,
|
||||
rollback/fencing behavior, and explicit compatibility rules.
|
||||
|
||||
## 13. Performance and Routing Awareness
|
||||
|
||||
Placement and routing decisions must consider:
|
||||
|
||||
- CPU
|
||||
- RAM
|
||||
- network load
|
||||
- active sessions
|
||||
- connector load
|
||||
- relay load
|
||||
- service type
|
||||
- health score
|
||||
- latency
|
||||
- packet loss
|
||||
- bandwidth availability
|
||||
- policy constraints
|
||||
|
||||
Interactive input/control traffic must not wait behind render/video, file
|
||||
transfer, telemetry, or VPN bulk traffic.
|
||||
|
||||
## 14. No Runtime Expansion From Documentation
|
||||
|
||||
Architecture documentation does not authorize runtime implementation.
|
||||
|
||||
Do not start the following without an explicit staged prompt:
|
||||
|
||||
- RDP runtime changes
|
||||
- Windows client behavior changes
|
||||
- data-plane behavior changes
|
||||
- backend session lifecycle changes
|
||||
- mesh runtime traffic
|
||||
- VPN/IP tunnel runtime
|
||||
- relay packet routing
|
||||
- QUIC/WebRTC
|
||||
- service workload execution
|
||||
- new protocol adapters
|
||||
|
||||
## Result / Decision
|
||||
|
||||
These guardrails formalize the Secure Access Fabric lower foundation:
|
||||
PostgreSQL remains authoritative, Redis remains live-only, Fabric Core comes
|
||||
before mesh runtime, Fabric routing must not depend on live backend
|
||||
availability, service adapters do not own routing, nodes receive only
|
||||
need-to-know scoped configuration, Fabric Storage/Config Storage is not a
|
||||
general-purpose distributed database, and organizations must not see internal
|
||||
mesh topology. No code, API, migration, RDP, data-plane, mesh, VPN, relay, or
|
||||
service workload runtime behavior is changed by this document. Version
|
||||
Storage/Update Repository is a future signed artifact and release distribution
|
||||
foundation; it is not an updater runtime until a later explicit staged prompt
|
||||
authorizes it.
|
||||
File diff suppressed because it is too large
Load Diff
File diff suppressed because it is too large
Load Diff
@@ -0,0 +1,93 @@
|
||||
# Direct Worker WSS TLS / PKI
|
||||
|
||||
Status: P3.4 trust-model design/prep complete.
|
||||
|
||||
This document defines the production trust model for direct worker WSS. It does
|
||||
not implement mesh, relay nodes, VPN, QUIC, WebRTC, or a new RDP runtime.
|
||||
|
||||
Detailed P3.4 production certificate lifecycle, worker identity binding, client
|
||||
trust, rotation, revocation, and future smoke plan are defined in
|
||||
`docs/architecture/PRODUCTION_DIRECT_WORKER_WSS_TRUST.md`.
|
||||
|
||||
## Goal
|
||||
|
||||
Direct worker WSS is the preferred RDP realtime data-plane path. In production,
|
||||
the Access Client must only use direct worker WSS when both conditions are true:
|
||||
|
||||
- the backend advertises the candidate as production trusted
|
||||
- normal TLS certificate validation succeeds
|
||||
|
||||
The backend gateway remains the safe fallback/debug path.
|
||||
|
||||
## Trust Modes
|
||||
|
||||
Backend config `DATA_PLANE_DIRECT_WORKER_TLS_TRUST_MODE` supports:
|
||||
|
||||
- `smoke_insecure`: development/smoke only; candidate metadata is
|
||||
`smoke_only=true` and `production_trusted=false`
|
||||
- `public_ca`: worker certificate chains to an OS/publicly trusted CA;
|
||||
candidate metadata is `production_trusted=true`
|
||||
- `platform_ca`: worker certificate chains to a platform-managed CA;
|
||||
candidate metadata is `production_trusted=true`
|
||||
|
||||
Optional `DATA_PLANE_DIRECT_WORKER_TLS_CA_REF` labels the platform CA or trust
|
||||
bundle version in candidate metadata, for example `rap-platform-ca:v1`.
|
||||
|
||||
## Backend Enforcement
|
||||
|
||||
In production (`APP_ENV=production` or `APP_ENV=prod`):
|
||||
|
||||
- backend must not advertise `direct_worker_wss` candidates when
|
||||
`DATA_PLANE_DIRECT_WORKER_TLS_TRUST_MODE=smoke_insecure`
|
||||
- backend still advertises `backend_gateway` fallback when configured
|
||||
- direct candidates include trust metadata only when they are data-capable
|
||||
|
||||
Candidate metadata:
|
||||
|
||||
```json
|
||||
{
|
||||
"runtime_transport": "json_v1",
|
||||
"traffic_ready": true,
|
||||
"tls_trust_mode": "platform_ca",
|
||||
"production_trusted": true,
|
||||
"smoke_only": false,
|
||||
"tls_ca_ref": "rap-platform-ca:v1"
|
||||
}
|
||||
```
|
||||
|
||||
## Windows Client Enforcement
|
||||
|
||||
Client config `environment=production` or `prod` means:
|
||||
|
||||
- smoke-only direct candidates are skipped
|
||||
- candidates without production trust metadata are skipped
|
||||
- `allow_insecure_direct_data_plane_tls_for_smoke` is ignored for direct worker
|
||||
WSS
|
||||
- the client falls back to backend gateway instead of weakening TLS
|
||||
|
||||
In development/smoke:
|
||||
|
||||
- `allow_insecure_direct_data_plane_tls_for_smoke=true` may bypass certificate
|
||||
validation only for smoke-only direct candidates
|
||||
- this bypass must not be used as a production trust mechanism
|
||||
|
||||
## Worker Requirements
|
||||
|
||||
The worker direct WSS endpoint already requires:
|
||||
|
||||
- `RDP_WORKER_DATA_PLANE_TLS_CERT_FILE`
|
||||
- `RDP_WORKER_DATA_PLANE_TLS_KEY_FILE`
|
||||
- `RDP_WORKER_DATA_PLANE_PUBLIC_KEY_FILE` or
|
||||
`RDP_WORKER_DATA_PLANE_PUBLIC_KEY_PEM`
|
||||
|
||||
Production workers should use certificates issued for their advertised direct
|
||||
WSS hostname/IP subject alternative names. Platform-managed deployments should
|
||||
prefer a dedicated platform CA and rotation workflow.
|
||||
|
||||
## Remaining Work
|
||||
|
||||
- implement app-local platform CA trust bundle handling in Windows clients
|
||||
- automate worker certificate issuance/rotation
|
||||
- rotate backend data-plane signing keys
|
||||
- add live test-stand proof with `platform_ca` production-trusted direct WSS
|
||||
- later integrate node-agent certificate enrollment
|
||||
@@ -0,0 +1,465 @@
|
||||
# Fabric Core Configuration Distribution
|
||||
|
||||
Status: Stage C10 result. Documentation and architecture only.
|
||||
|
||||
This document consolidates the Fabric Core configuration distribution model for
|
||||
the Secure Access Fabric platform. It does not implement mesh runtime traffic,
|
||||
VPN/IP tunnel runtime, relay packet routing, RDP work, service workload
|
||||
execution, API changes, migrations, or code changes.
|
||||
|
||||
## 1. Purpose
|
||||
|
||||
Stage C10 defines the boundaries that must exist before the project safely
|
||||
moves into signed snapshots, node-local storage, config/storage services, peer
|
||||
directories, routing skeletons, secure node channels, mesh routing, or VPN/IP
|
||||
tunnel runtime.
|
||||
|
||||
The goal is to prevent the lower fabric from growing into an accidental
|
||||
distributed database, accidental full-mesh topology store, or service-specific
|
||||
RDP/VPN routing layer.
|
||||
|
||||
## 2. Layer Model
|
||||
|
||||
The platform layer order remains:
|
||||
|
||||
1. Host OS
|
||||
2. RAP Fabric Core
|
||||
3. Secure Fabric Network
|
||||
4. Service Runtime / Service Adapters
|
||||
5. Access Clients / Admin UI
|
||||
|
||||
Fabric Core is the lower distributed runtime foundation above the host OS. It
|
||||
is not a real operating system. It is implemented through native
|
||||
`rap-node-agent`, control-plane contracts, scoped signed snapshots, node-local
|
||||
state, role assignment consumption, update trust, and service supervision
|
||||
boundaries.
|
||||
|
||||
RDP, VNC, SSH, VPN, video, file transfer, and internal-app access are services
|
||||
above Fabric Core. They consume Fabric Core identity, placement, routing, and
|
||||
policy; they do not define peer discovery, route selection, cluster authority,
|
||||
or durable configuration ownership.
|
||||
|
||||
## 3. Source of Truth and Cache Boundaries
|
||||
|
||||
PostgreSQL remains the only durable source of truth for domain state:
|
||||
|
||||
- platform configuration
|
||||
- clusters
|
||||
- organizations
|
||||
- users and memberships
|
||||
- node identities and enrollment state
|
||||
- node role assignments
|
||||
- policies
|
||||
- resources
|
||||
- service desired state
|
||||
- audit
|
||||
- trust roots and revocation state
|
||||
|
||||
Redis remains live coordination only:
|
||||
|
||||
- leases
|
||||
- heartbeats
|
||||
- ephemeral routing hints
|
||||
- short-lived tokens
|
||||
- transient queues
|
||||
- runtime cache
|
||||
|
||||
Redis must not store durable topology, durable configuration, node identity,
|
||||
policy, organization data, cluster trust, or authoritative route state.
|
||||
|
||||
Fabric Storage / Config Storage is a distribution and cache layer. It must not:
|
||||
|
||||
- replace PostgreSQL
|
||||
- become a general-purpose distributed database
|
||||
- accept direct node writes as authoritative state
|
||||
- store every cluster or organization object on every node
|
||||
- expose arbitrary query capabilities
|
||||
- bypass organization, cluster, role, or service isolation
|
||||
|
||||
Node-local state is runtime state plus signed scoped snapshots. It supports
|
||||
fast operation and degraded reconnect. It is not a source of truth.
|
||||
|
||||
## 4. Configuration Layers
|
||||
|
||||
Configuration is separated into layers so nodes receive only what their role
|
||||
requires.
|
||||
|
||||
Global platform configuration:
|
||||
|
||||
- platform trust roots
|
||||
- supported protocol versions
|
||||
- update trust policy
|
||||
- platform-wide feature gates
|
||||
- high-risk admin policy
|
||||
|
||||
Cluster configuration:
|
||||
|
||||
- cluster identity
|
||||
- cluster trust roots and certificate policy
|
||||
- cluster authority/partition state
|
||||
- node role assignments
|
||||
- QoS policy
|
||||
- peer discovery policy
|
||||
- route policy
|
||||
- storage/config replication policy
|
||||
|
||||
Organization configuration:
|
||||
|
||||
- organization identity and status
|
||||
- organization service enablement
|
||||
- tenant-visible ingress/egress/service endpoints
|
||||
- tenant policy references
|
||||
- organization-specific resource references
|
||||
- safe status projections
|
||||
|
||||
Service configuration:
|
||||
|
||||
- assigned service workload configuration
|
||||
- service-specific policy subset
|
||||
- resource references needed by the assigned workload
|
||||
- connector or `vpn_connection` references where authorized
|
||||
- runtime secret references, resolved only through approved secret resolvers
|
||||
|
||||
## 5. Scoped Distribution Principle
|
||||
|
||||
Nodes receive configuration on a need-to-know basis.
|
||||
|
||||
Core mesh node receives:
|
||||
|
||||
- scoped peer/neighbor data
|
||||
- route policy
|
||||
- QoS policy
|
||||
- cluster version and trust metadata
|
||||
- no RDP credentials
|
||||
- no full organization user list
|
||||
- no unrelated service configuration
|
||||
|
||||
Ingress node receives:
|
||||
|
||||
- allowed client entry policies
|
||||
- token validation configuration
|
||||
- entry route hints
|
||||
- service endpoint mapping allowed for the ingress scope
|
||||
- no full internal topology
|
||||
- no unrelated organization data
|
||||
|
||||
Egress/service node receives:
|
||||
|
||||
- assigned service configs
|
||||
- needed resource references
|
||||
- needed connector or `vpn_connection` references
|
||||
- policy for assigned services
|
||||
- secrets only through approved resolver and only at runtime
|
||||
|
||||
Storage/config node receives:
|
||||
|
||||
- assigned shard/scope metadata
|
||||
- replication metadata
|
||||
- signed snapshot content for its assigned scope
|
||||
- no unrelated organization data
|
||||
- no unrestricted topology query access
|
||||
|
||||
Thin/mobile node receives:
|
||||
|
||||
- minimal bootstrap peers
|
||||
- active session/tunnel policy subset
|
||||
- local trust data required to reconnect
|
||||
- no broad cluster topology
|
||||
|
||||
## 6. Signed Scoped Cluster Snapshot Boundary
|
||||
|
||||
C10 defines snapshot boundaries only. C11 will define the full signed scoped
|
||||
cluster snapshot model.
|
||||
|
||||
A scoped snapshot is a signed, versioned, role-limited configuration package
|
||||
that a node-agent can store locally.
|
||||
|
||||
Snapshot properties:
|
||||
|
||||
- cluster-scoped
|
||||
- role-scoped
|
||||
- organization-scoped where applicable
|
||||
- versioned
|
||||
- signed by an authorized control-plane signing key
|
||||
- bounded in size
|
||||
- expires or requires refresh according to policy
|
||||
- reconstructable from PostgreSQL source-of-truth state
|
||||
|
||||
Snapshot contents may include:
|
||||
|
||||
- cluster id and version
|
||||
- node membership scope
|
||||
- assigned roles
|
||||
- allowed service workload refs
|
||||
- peer directory subset
|
||||
- route policy subset
|
||||
- QoS policy subset
|
||||
- trust roots and revocation metadata
|
||||
- storage/config endpoints for refresh
|
||||
- degraded-mode permissions
|
||||
|
||||
Snapshot contents must not include:
|
||||
|
||||
- unrelated organization data
|
||||
- broad user lists
|
||||
- raw secrets
|
||||
- RDP/VNC/SSH credentials
|
||||
- full cluster topology unless node role requires it
|
||||
- arbitrary query permissions
|
||||
|
||||
## 7. Node-Local State Boundary
|
||||
|
||||
`rap-node-agent` local state may contain:
|
||||
|
||||
- node identity material and certificate metadata
|
||||
- cluster membership state
|
||||
- signed scoped cluster snapshot
|
||||
- peer cache
|
||||
- route cache
|
||||
- service assignment cache
|
||||
- service health/status cache
|
||||
- local health state
|
||||
- partition/degraded state
|
||||
- last applied config version
|
||||
- pending update metadata
|
||||
- bounded telemetry buffer
|
||||
|
||||
Node-local state must not contain:
|
||||
|
||||
- full cluster topology unless explicitly required by role
|
||||
- full organization data
|
||||
- unrelated organization secrets
|
||||
- durable policy authority
|
||||
- durable route authority
|
||||
- durable audit authority
|
||||
- unrelated storage shards
|
||||
|
||||
Node-agent must be able to operate from local state for short degraded periods
|
||||
when policy allows it, but it must not authorize high-risk mutations while
|
||||
isolated.
|
||||
|
||||
## 8. Peer Directory and Cache Boundary
|
||||
|
||||
Peer directory data is distributed as scoped configuration, not queried from
|
||||
PostgreSQL on every routing decision.
|
||||
|
||||
Peer directory entry fields:
|
||||
|
||||
- `node_id`
|
||||
- `cluster_id`
|
||||
- endpoint candidates
|
||||
- roles/capabilities
|
||||
- region/location hints
|
||||
- trust/certificate fingerprint
|
||||
- policy scope
|
||||
- config version
|
||||
|
||||
Node-local peer cache may add runtime observations:
|
||||
|
||||
- `last_success_at`
|
||||
- `last_latency_ms`
|
||||
- packet loss
|
||||
- reliability score
|
||||
- recent failure history
|
||||
- observed load hints where allowed
|
||||
- last seen config version
|
||||
|
||||
Peer selection is score-based, not latency-only. Inputs include:
|
||||
|
||||
- latency
|
||||
- packet loss
|
||||
- reliability
|
||||
- region distance
|
||||
- node load
|
||||
- bandwidth availability
|
||||
- role suitability
|
||||
- policy constraints
|
||||
- trust level
|
||||
- recent failure history
|
||||
|
||||
The Fabric Routing Engine owns route selection. Service Adapters must not
|
||||
discover peers, select mesh routes, create shortcuts, or implement partition
|
||||
recovery.
|
||||
|
||||
## 9. Fabric Storage / Config Storage Role
|
||||
|
||||
Fabric Storage / Config Storage is a logical future service. It is a scoped
|
||||
distribution layer for configuration and signed snapshots.
|
||||
|
||||
Responsibilities:
|
||||
|
||||
- distribute signed scoped snapshots
|
||||
- distribute peer directories
|
||||
- cache hot configuration near service nodes
|
||||
- replicate critical scoped data across failure domains
|
||||
- provide nearby read access for node-agent refresh
|
||||
- support cluster/org/service scope boundaries
|
||||
- support version-based sync and incremental update delivery
|
||||
|
||||
Non-goals:
|
||||
|
||||
- no replacement of PostgreSQL
|
||||
- no arbitrary distributed database behavior
|
||||
- no direct node writes as authoritative state
|
||||
- no broad ad hoc query API
|
||||
- no full topology exposure to tenants
|
||||
- no full organization data on every node
|
||||
|
||||
Placement rules:
|
||||
|
||||
- hot data may be placed near services that use it
|
||||
- cold data may remain remote
|
||||
- critical data should replicate across failure domains
|
||||
- replication factor is policy-driven
|
||||
- storage scope must respect cluster, organization, and service boundaries
|
||||
|
||||
## 10. Distribution Flow
|
||||
|
||||
Normal flow:
|
||||
|
||||
1. Control plane reads authoritative state from PostgreSQL.
|
||||
2. Control plane compiles scoped configuration views.
|
||||
3. Control plane signs full scoped snapshots or incremental updates.
|
||||
4. Fabric Storage / Config Storage distributes and caches scoped artifacts.
|
||||
5. Node-agent fetches snapshots/updates from authorized endpoints.
|
||||
6. Node-agent verifies signatures, version, scope, expiry, and trust roots.
|
||||
7. Node-agent applies configuration into local state.
|
||||
8. Runtime components consume local state, not live backend calls, for realtime
|
||||
route decisions.
|
||||
|
||||
Realtime routing decisions must not depend on live backend availability. They
|
||||
should use verified local state, peer cache, route cache, and policy.
|
||||
|
||||
## 11. Versioning and Consistency Rules
|
||||
|
||||
Every snapshot and incremental update must carry:
|
||||
|
||||
- `cluster_id`
|
||||
- scope identifiers
|
||||
- monotonic config version or equivalent epoch
|
||||
- issued-at timestamp
|
||||
- expiry or refresh deadline
|
||||
- signer id / key id
|
||||
- signature
|
||||
- dependency/base version for increments
|
||||
|
||||
Rules:
|
||||
|
||||
- full snapshot can establish or repair local state
|
||||
- incremental update applies only to the expected base version
|
||||
- version gaps require full resync
|
||||
- signature mismatch rejects the update and triggers recovery
|
||||
- rollback to older config is forbidden unless explicitly authorized by a
|
||||
signed recovery policy
|
||||
- node must report last applied config version in heartbeat/status
|
||||
|
||||
## 12. Degraded Mode Rules
|
||||
|
||||
Degraded operation is allowed only when policy permits it.
|
||||
|
||||
Allowed examples:
|
||||
|
||||
- keep already-running safe services alive
|
||||
- continue existing authorized routes for a short TTL
|
||||
- reconnect to known active/warm/bootstrap peers
|
||||
- use last signed snapshot to find config/storage endpoints
|
||||
- report degraded status when connectivity returns
|
||||
|
||||
Forbidden while degraded:
|
||||
|
||||
- approve join requests
|
||||
- issue node certificates
|
||||
- assign roles
|
||||
- change cluster policy
|
||||
- change organization policy
|
||||
- rotate trust roots
|
||||
- promote partition authority automatically
|
||||
- access secrets not already authorized for the node's current role
|
||||
|
||||
Degraded mode must be time-bounded and observable.
|
||||
|
||||
## 13. Multi-Cluster Isolation
|
||||
|
||||
Clusters are isolated by default.
|
||||
|
||||
Rules:
|
||||
|
||||
- clusters do not automatically trust each other
|
||||
- clusters do not form one shared mesh by default
|
||||
- cross-cluster routing requires explicit trust and policy
|
||||
- platform owner may manage multiple clusters from one console
|
||||
- organization admins see only authorized clusters/resources
|
||||
- node may participate in multiple clusters only through isolated memberships
|
||||
- cluster-scoped identities, certificates, tokens, storage namespaces, and
|
||||
policies are required
|
||||
|
||||
A multi-cluster node must keep separate local state per cluster:
|
||||
|
||||
- separate identity/certificates
|
||||
- separate snapshots
|
||||
- separate peer cache
|
||||
- separate route cache
|
||||
- separate service assignment cache
|
||||
- separate storage namespace
|
||||
|
||||
## 14. Security Boundaries
|
||||
|
||||
Security requirements:
|
||||
|
||||
- snapshots are signed
|
||||
- transport for snapshot/update distribution is authenticated and encrypted
|
||||
- node-agent verifies signature, scope, expiry, signer, and trust root
|
||||
- secrets are never embedded directly in broad snapshots
|
||||
- secrets are resolved through approved resolvers only at runtime
|
||||
- high-risk admin actions require step-up authentication
|
||||
- all cluster trust and role changes are audited
|
||||
|
||||
High-risk actions include:
|
||||
|
||||
- node approval
|
||||
- role assignment
|
||||
- cluster trust changes
|
||||
- cross-cluster trust
|
||||
- partition promotion
|
||||
- secrets access
|
||||
- update policy changes
|
||||
- signing key rotation
|
||||
|
||||
## 15. C11-C18 Staging Boundary
|
||||
|
||||
C10 is a design consolidation stage. It prepares later stages:
|
||||
|
||||
- C11: signed scoped cluster snapshot model
|
||||
- C12: node local state store
|
||||
- C13: config/storage service foundation
|
||||
- C14: peer directory and cache model
|
||||
- C15: Fabric Routing Engine skeleton
|
||||
- C16: secure node-to-node channel lifecycle
|
||||
- C17: mesh routing runtime
|
||||
- C18: VPN/IP tunnel service
|
||||
|
||||
C10 implements none of these. Later stages must be explicit, narrow, and
|
||||
verified. Mesh routing and VPN/IP tunnel runtime must not start before C11-C16
|
||||
foundations are accepted.
|
||||
|
||||
## 16. Result / Decision
|
||||
|
||||
Stage C10 consolidates the lower Fabric Core configuration distribution model.
|
||||
|
||||
Decisions:
|
||||
|
||||
- PostgreSQL remains the only durable source of truth.
|
||||
- Redis remains live coordination only.
|
||||
- Fabric Storage / Config Storage is a scoped distribution/cache layer, not a
|
||||
second source of truth.
|
||||
- Nodes receive only role/cluster/organization scoped configuration.
|
||||
- Node-local state is bounded and non-authoritative.
|
||||
- Signed scoped snapshots are the required foundation for node-local operation
|
||||
and degraded recovery.
|
||||
- Peer directory/cache data is local and scoped; routing remains Fabric-owned.
|
||||
- Service Adapters remain protocol translators above Fabric Core.
|
||||
- Multi-cluster membership requires isolated identities, snapshots, caches,
|
||||
tokens, policies, and storage namespaces.
|
||||
|
||||
No code, migration, API, runtime, RDP, data-plane, mesh, VPN, relay, or service
|
||||
workload behavior is changed by C10.
|
||||
@@ -0,0 +1,398 @@
|
||||
# Fabric Peer Directory and Cache Model
|
||||
|
||||
Status: Stage C14 result. Documentation and architecture only.
|
||||
|
||||
This document defines the Fabric peer directory and node-local peer cache model.
|
||||
It does not implement code, migrations, APIs, mesh runtime traffic, VPN/IP
|
||||
tunnel runtime, relay packet routing, RDP work, or service workload execution.
|
||||
|
||||
## 1. Purpose
|
||||
|
||||
The peer directory tells a node which peers it may know about and potentially
|
||||
connect to. The node-local peer cache stores scoped peer data plus runtime
|
||||
observations for fast recovery and score-based peer selection.
|
||||
|
||||
The model must avoid:
|
||||
|
||||
- full-mesh assumptions
|
||||
- every node knowing full cluster topology
|
||||
- service adapters owning route selection
|
||||
- Redis as durable peer topology
|
||||
- backend calls on every realtime route decision
|
||||
|
||||
## 2. Peer Knowledge Classes
|
||||
|
||||
Each node maintains three peer classes:
|
||||
|
||||
- active peers
|
||||
- warm candidate peers
|
||||
- cold/bootstrap peers
|
||||
|
||||
Active peers:
|
||||
|
||||
- currently connected or recently used
|
||||
- participate in health, route, relay, or service traffic according to role
|
||||
- small bounded set
|
||||
|
||||
Warm candidate peers:
|
||||
|
||||
- known good but not currently active
|
||||
- promoted when active peers fail or a better path is needed
|
||||
- refreshed less frequently than active peers
|
||||
|
||||
Cold/bootstrap peers:
|
||||
|
||||
- seed or last-resort discovery peers
|
||||
- used when active and warm peers fail
|
||||
- may come from signed snapshot, local cache, storage/config service, or
|
||||
admin-defined seed nodes
|
||||
|
||||
Recommended active peer counts:
|
||||
|
||||
- normal node: 3-5
|
||||
- relay/core node: 8-20
|
||||
- thin/mobile node: 1-3
|
||||
|
||||
These are policy defaults, not hardcoded limits.
|
||||
|
||||
## 3. Peer Directory Record
|
||||
|
||||
A signed peer directory entry may contain:
|
||||
|
||||
- `node_id`
|
||||
- `cluster_id`
|
||||
- endpoint candidates
|
||||
- advertised roles
|
||||
- verified capabilities
|
||||
- allowed peer relationship type
|
||||
- region/location hints
|
||||
- trust/certificate fingerprint
|
||||
- certificate expiry metadata
|
||||
- policy scope
|
||||
- organization scope where applicable
|
||||
- service scope where applicable
|
||||
- supported transport hints
|
||||
- NAT/connectivity hints
|
||||
- `last_seen_config_version`
|
||||
|
||||
The peer directory is scoped. Ordinary nodes must not receive a full cluster
|
||||
peer directory unless their role explicitly requires it.
|
||||
|
||||
## 4. Endpoint Candidate Model
|
||||
|
||||
Endpoint candidates describe possible ways to reach a node.
|
||||
|
||||
Candidate fields:
|
||||
|
||||
- endpoint id
|
||||
- transport type
|
||||
- host/IP/DNS name
|
||||
- port
|
||||
- address family
|
||||
- public/private reachability
|
||||
- region
|
||||
- NAT type if known
|
||||
- TLS/mTLS identity expectations
|
||||
- priority
|
||||
- policy tags
|
||||
- last verified timestamp
|
||||
|
||||
Transport types may include future values such as:
|
||||
|
||||
- direct TCP/TLS
|
||||
- WSS
|
||||
- relay-assisted
|
||||
- outbound-only reverse channel
|
||||
- future QUIC/UDP where explicitly approved
|
||||
|
||||
This model is descriptive only. C14 does not implement new transports.
|
||||
|
||||
## 5. Node-Local Peer Cache
|
||||
|
||||
The node-local peer cache contains signed directory data plus runtime
|
||||
observations.
|
||||
|
||||
Directory-derived fields:
|
||||
|
||||
- peer identity
|
||||
- cluster id
|
||||
- endpoint candidates
|
||||
- roles/capabilities
|
||||
- trust fingerprint
|
||||
- policy scope
|
||||
- config version
|
||||
|
||||
Runtime observation fields:
|
||||
|
||||
- `last_success_at`
|
||||
- `last_failure_at`
|
||||
- `last_latency_ms`
|
||||
- packet loss
|
||||
- jitter
|
||||
- reliability score
|
||||
- recent failure history
|
||||
- observed load hint where allowed
|
||||
- active/warm/cold state
|
||||
- last selected route id if applicable
|
||||
|
||||
Runtime observations are hints. They are not durable authority.
|
||||
|
||||
## 6. Refresh Cadence
|
||||
|
||||
Recommended cadence:
|
||||
|
||||
- active peer heartbeat: 5-15 seconds
|
||||
- active/warm latency probes: 30-120 seconds
|
||||
- warm peer validation: 2-10 minutes
|
||||
- peer directory refresh: 5-15 minutes
|
||||
- cold/bootstrap validation: periodic or on demand
|
||||
- full peer directory resync: only on version gap, signature mismatch, or
|
||||
policy-triggered refresh
|
||||
|
||||
Cadence may vary by role:
|
||||
|
||||
- relay/core nodes maintain richer peer sets
|
||||
- thin/mobile nodes probe less aggressively
|
||||
- egress/service nodes prioritize peers relevant to assigned services
|
||||
- storage/config nodes prioritize configured replica peers
|
||||
|
||||
## 7. Peer Selection Scoring
|
||||
|
||||
Selection is score-based, not latency-only.
|
||||
|
||||
Hard checks first:
|
||||
|
||||
- cluster membership
|
||||
- node identity trust
|
||||
- certificate validity
|
||||
- role compatibility
|
||||
- allowed peer relationship
|
||||
- organization/service scope
|
||||
- partition/authority policy
|
||||
- transport compatibility
|
||||
- revocation status
|
||||
|
||||
Soft score inputs:
|
||||
|
||||
- latency
|
||||
- packet loss
|
||||
- jitter
|
||||
- reliability
|
||||
- recent failure history
|
||||
- region distance
|
||||
- node load hint
|
||||
- bandwidth availability
|
||||
- role suitability
|
||||
- route class/channel class
|
||||
- policy preference
|
||||
|
||||
No peer should be selected if it fails hard policy checks, even if latency is
|
||||
excellent.
|
||||
|
||||
## 8. Recovery Order
|
||||
|
||||
If active peers fail, recovery order is:
|
||||
|
||||
1. retry active peers with bounded backoff
|
||||
2. promote warm candidates
|
||||
3. try cold/bootstrap peers
|
||||
4. query authorized storage/config discovery endpoint
|
||||
5. use last signed snapshot for degraded reconnect if policy allows
|
||||
6. reconnect to control plane when available
|
||||
|
||||
Recovery must not authorize cluster mutation or high-risk actions.
|
||||
|
||||
## 9. Channel-Aware Peer Preference
|
||||
|
||||
Peer choice depends on channel class.
|
||||
|
||||
Input/control:
|
||||
|
||||
- lowest latency
|
||||
- lowest jitter
|
||||
- high reliability
|
||||
- never behind bulk traffic
|
||||
|
||||
Render/video:
|
||||
|
||||
- bandwidth and jitter aware
|
||||
- stale-frame dropping acceptable
|
||||
- avoid paths with persistent queue growth
|
||||
|
||||
File transfer:
|
||||
|
||||
- throughput and reliability
|
||||
- lower priority than input/control
|
||||
|
||||
Clipboard/control:
|
||||
|
||||
- reliable bounded path
|
||||
- low volume
|
||||
|
||||
Telemetry:
|
||||
|
||||
- low priority
|
||||
- lossy/sampled allowed
|
||||
|
||||
VPN/IP tunnel future:
|
||||
|
||||
- adaptive QoS
|
||||
- bulk traffic must not starve interactive sessions
|
||||
|
||||
## 10. Full-Mesh Prevention
|
||||
|
||||
Nodes must not attempt to connect to every known node.
|
||||
|
||||
Limits:
|
||||
|
||||
- active peers are bounded by role policy
|
||||
- warm peers are bounded by role policy
|
||||
- peer directory is scoped
|
||||
- full topology is hidden from organizations
|
||||
- service adapters never request arbitrary topology
|
||||
|
||||
Full topology access is reserved only for roles that require it, such as
|
||||
platform control/admin views or selected core/route-analysis components.
|
||||
|
||||
## 11. Security Boundaries
|
||||
|
||||
Peer cache must enforce:
|
||||
|
||||
- cluster isolation
|
||||
- organization isolation
|
||||
- certificate fingerprint validation
|
||||
- revocation status
|
||||
- role assignment
|
||||
- allowed peer relationship
|
||||
- service scope
|
||||
|
||||
A compromised ordinary node should not learn full cluster topology.
|
||||
|
||||
Peer cache data must not include:
|
||||
|
||||
- unrelated organization resources
|
||||
- raw secrets
|
||||
- broad user lists
|
||||
- arbitrary route authority
|
||||
- cross-cluster trust unless explicitly authorized
|
||||
|
||||
## 12. Multi-Cluster Peer Isolation
|
||||
|
||||
Multi-cluster node membership uses separate peer caches per cluster.
|
||||
|
||||
Per-cluster separation:
|
||||
|
||||
- peer directory
|
||||
- endpoint candidates
|
||||
- trust roots
|
||||
- certificate fingerprints
|
||||
- active/warm/cold peer state
|
||||
- route observations
|
||||
- failure history
|
||||
|
||||
Cross-cluster peer discovery requires explicit trust and policy. Clusters do
|
||||
not form a single mesh by default.
|
||||
|
||||
## 13. Storage / Snapshot Relationship
|
||||
|
||||
Peer directory data is distributed through signed snapshots or Fabric Storage /
|
||||
Config Storage artifacts.
|
||||
|
||||
Rules:
|
||||
|
||||
- peer directory version is tracked
|
||||
- node reports last applied peer directory version
|
||||
- version gap triggers refresh/full resync
|
||||
- signature/hash mismatch rejects the directory
|
||||
- revoked peers are removed or marked unusable
|
||||
- runtime observations are preserved only when still valid for the current
|
||||
directory version
|
||||
|
||||
## 14. Service Adapter Boundary
|
||||
|
||||
Service Adapters may request:
|
||||
|
||||
- destination node
|
||||
- resource target
|
||||
- egress node
|
||||
- egress pool
|
||||
- channel class
|
||||
|
||||
Service Adapters must not:
|
||||
|
||||
- enumerate peers
|
||||
- select mesh routes
|
||||
- promote warm peers
|
||||
- create shortcut connections
|
||||
- implement partition recovery
|
||||
- implement cross-cluster routing policy
|
||||
|
||||
The Fabric Routing Engine owns those decisions.
|
||||
|
||||
## 15. Observability
|
||||
|
||||
Node-agent should report safe peer/cache metrics:
|
||||
|
||||
- active peer count
|
||||
- warm peer count
|
||||
- bootstrap peer count
|
||||
- peer directory version
|
||||
- last refresh time
|
||||
- average active peer latency
|
||||
- packet loss summary
|
||||
- failed peer count
|
||||
- recovery mode if active
|
||||
- selected peer class by channel type
|
||||
|
||||
Reports must not expose full topology to organizations.
|
||||
|
||||
## 16. Future Validation Tests
|
||||
|
||||
Future implementation tests must prove:
|
||||
|
||||
- peer directory scope is enforced
|
||||
- wrong-cluster peer is rejected
|
||||
- revoked peer is rejected
|
||||
- invalid certificate fingerprint is rejected
|
||||
- full topology is not distributed to ordinary node
|
||||
- active peer count stays bounded
|
||||
- warm peer promotion works
|
||||
- bootstrap recovery works
|
||||
- score-based selection respects hard policy checks
|
||||
- stale runtime observations are ignored after directory version change
|
||||
- service adapter cannot bypass Fabric peer selection
|
||||
|
||||
## 17. C15 Preparation
|
||||
|
||||
C15 must define the Fabric Routing Engine skeleton boundary.
|
||||
|
||||
The routing engine will consume:
|
||||
|
||||
- peer directory/cache
|
||||
- route policy
|
||||
- QoS policy
|
||||
- channel class
|
||||
- service request metadata
|
||||
- cluster/organization scope
|
||||
- failure history
|
||||
|
||||
C15 must not carry production mesh traffic. It should define route request and
|
||||
route result boundaries before runtime routing exists.
|
||||
|
||||
## 18. Result / Decision
|
||||
|
||||
Stage C14 defines scoped peer discovery and peer cache behavior.
|
||||
|
||||
Decisions:
|
||||
|
||||
- nodes maintain active, warm, and cold/bootstrap peer classes
|
||||
- nodes do not maintain full mesh connections
|
||||
- peer directory data is scoped and signed
|
||||
- peer cache combines signed directory data with runtime observations
|
||||
- peer selection is score-based with hard policy checks first
|
||||
- recovery uses active, warm, bootstrap, storage/config, then last snapshot
|
||||
- service adapters do not own peer discovery or route selection
|
||||
- C15 must define the Fabric Routing Engine skeleton before mesh runtime
|
||||
|
||||
No code, migration, API, runtime, RDP, data-plane, mesh, VPN, relay, or service
|
||||
workload behavior is changed by C14.
|
||||
@@ -0,0 +1,518 @@
|
||||
# Fabric Routing Engine Skeleton
|
||||
|
||||
Status: Stage C15 result. Documentation and architecture only.
|
||||
|
||||
This document defines the Fabric Routing Engine skeleton boundary. It does not
|
||||
implement code, migrations, APIs, mesh runtime traffic, VPN/IP tunnel runtime,
|
||||
relay packet routing, RDP work, or service workload execution.
|
||||
|
||||
## 1. Purpose
|
||||
|
||||
The Fabric Routing Engine is the logical Fabric layer responsible for choosing
|
||||
authorized paths between ingress, core, egress, service, storage, and future
|
||||
VPN/IP-tunnel components.
|
||||
|
||||
C15 defines the route decision boundary before runtime mesh routing exists.
|
||||
|
||||
The purpose is to ensure that future routing:
|
||||
|
||||
- is policy-aware
|
||||
- is QoS-aware
|
||||
- is channel-aware
|
||||
- respects cluster and organization boundaries
|
||||
- uses scoped local state and peer cache
|
||||
- does not depend on live backend availability for realtime decisions
|
||||
- is not implemented independently by Service Adapters
|
||||
|
||||
## 2. Non-Goals
|
||||
|
||||
C15 does not:
|
||||
|
||||
- carry production mesh traffic
|
||||
- implement node-to-node transport
|
||||
- implement relay forwarding
|
||||
- implement VPN/IP tunnel packets
|
||||
- implement QUIC/WebRTC
|
||||
- implement route execution
|
||||
- implement service workloads
|
||||
- change RDP runtime
|
||||
- change backend session lifecycle
|
||||
- change Windows client behavior
|
||||
|
||||
It defines contracts and responsibilities only.
|
||||
|
||||
## 3. Routing Engine Responsibilities
|
||||
|
||||
The Fabric Routing Engine owns:
|
||||
|
||||
- route request validation
|
||||
- peer candidate filtering
|
||||
- route scoring
|
||||
- channel-aware path selection
|
||||
- QoS class selection
|
||||
- route cache lookup/update policy
|
||||
- failover decision boundaries
|
||||
- shortcut recommendation boundaries
|
||||
- topology hiding
|
||||
- policy and cluster-boundary enforcement
|
||||
- service adapter routing integration boundary
|
||||
|
||||
The Routing Engine does not own:
|
||||
|
||||
- PostgreSQL source-of-truth mutation
|
||||
- service protocol translation
|
||||
- RDP/VNC/SSH/VPN implementation details
|
||||
- raw packet forwarding
|
||||
- direct secret resolution
|
||||
- organization admin visibility
|
||||
- node enrollment authority
|
||||
|
||||
## 4. Inputs
|
||||
|
||||
Routing decisions may consume:
|
||||
|
||||
- signed scoped cluster snapshot
|
||||
- node-local peer cache
|
||||
- route cache
|
||||
- peer directory
|
||||
- route policy
|
||||
- QoS policy
|
||||
- service assignment cache
|
||||
- cluster membership
|
||||
- organization scope
|
||||
- service/resource scope
|
||||
- channel class
|
||||
- current health/degraded state
|
||||
- partition/authority state
|
||||
- failure history
|
||||
- load and latency observations
|
||||
|
||||
Routing decisions must not require a live backend call in the realtime path.
|
||||
|
||||
## 5. Route Request Contract
|
||||
|
||||
A route request is a logical request for a path. It is not a packet.
|
||||
|
||||
Required fields:
|
||||
|
||||
- `request_id`
|
||||
- `cluster_id`
|
||||
- `organization_id` where applicable
|
||||
- `source_node_id`
|
||||
- `source_role`
|
||||
- `destination_kind`
|
||||
- `destination_ref`
|
||||
- `service_type`
|
||||
- `channel_class`
|
||||
- `priority_class`
|
||||
- `policy_refs`
|
||||
- `requested_at`
|
||||
|
||||
Destination kinds:
|
||||
|
||||
- `node`
|
||||
- `egress_pool`
|
||||
- `service_instance`
|
||||
- `resource_target`
|
||||
- `vpn_connection`
|
||||
- `storage_scope`
|
||||
- `control_plane_endpoint`
|
||||
|
||||
Optional fields:
|
||||
|
||||
- `session_id`
|
||||
- `attachment_id`
|
||||
- `resource_id`
|
||||
- `user_id`
|
||||
- `device_id`
|
||||
- `region_preference`
|
||||
- `required_capabilities`
|
||||
- `forbidden_nodes`
|
||||
- `preferred_nodes`
|
||||
- `max_latency_ms`
|
||||
- `min_bandwidth_hint`
|
||||
- `stickiness_key`
|
||||
- `previous_route_id`
|
||||
- `failure_context`
|
||||
|
||||
Service adapters may create route requests through an adapter-facing boundary,
|
||||
but they must not select peers or paths themselves.
|
||||
|
||||
## 6. Route Result Contract
|
||||
|
||||
A route result is a signed or locally verifiable decision artifact for a
|
||||
bounded time.
|
||||
|
||||
Required fields:
|
||||
|
||||
- `route_id`
|
||||
- `request_id`
|
||||
- `cluster_id`
|
||||
- `organization_id` where applicable
|
||||
- `route_class`
|
||||
- `channel_class`
|
||||
- `selected_path`
|
||||
- `selected_qos_class`
|
||||
- `score`
|
||||
- `valid_from`
|
||||
- `expires_at`
|
||||
- `route_epoch`
|
||||
- `policy_version`
|
||||
- `decision_reason`
|
||||
|
||||
Selected path contains ordered logical hops:
|
||||
|
||||
- source node
|
||||
- optional ingress node
|
||||
- zero or more core/relay nodes
|
||||
- optional egress/service node
|
||||
- target/service endpoint
|
||||
|
||||
Optional fields:
|
||||
|
||||
- `fallback_paths`
|
||||
- `shortcut_candidate`
|
||||
- `stickiness_key`
|
||||
- `drain_after`
|
||||
- `degraded_mode`
|
||||
- `constraints_applied`
|
||||
- `rejection_reason`
|
||||
|
||||
Route results must be bounded by expiry, policy version, route epoch, and
|
||||
cluster authority state.
|
||||
|
||||
## 7. Channel Classes
|
||||
|
||||
Routing is channel-aware.
|
||||
|
||||
Initial channel classes:
|
||||
|
||||
- `control`
|
||||
- `input`
|
||||
- `render`
|
||||
- `cursor`
|
||||
- `clipboard`
|
||||
- `file_transfer`
|
||||
- `telemetry`
|
||||
- `vpn_packet`
|
||||
- `storage_fetch`
|
||||
- `update_fetch`
|
||||
|
||||
Rules:
|
||||
|
||||
- `input` and critical `control` prefer lowest latency and lowest jitter.
|
||||
- `render` prefers bandwidth and bounded jitter; stale render may be dropped.
|
||||
- `cursor` is latest-only and should use low-latency paths.
|
||||
- `clipboard` is reliable and bounded.
|
||||
- `file_transfer` prefers throughput but must not starve input/control/render.
|
||||
- `telemetry` is low priority and may be sampled or dropped.
|
||||
- `vpn_packet` uses adaptive QoS and bulk protection.
|
||||
- `storage_fetch` and `update_fetch` should not consume interactive reserves.
|
||||
|
||||
## 8. Route Classes
|
||||
|
||||
Initial route classes:
|
||||
|
||||
- `direct`
|
||||
- `single_relay`
|
||||
- `multi_hop`
|
||||
- `storage_local`
|
||||
- `storage_remote`
|
||||
- `vpn_chained`
|
||||
- `degraded_existing`
|
||||
- `unavailable`
|
||||
|
||||
`direct`:
|
||||
|
||||
- selected when source can safely reach destination directly
|
||||
- trust and policy must allow it
|
||||
|
||||
`single_relay`:
|
||||
|
||||
- selected when one relay improves connectivity or policy requires relay
|
||||
|
||||
`multi_hop`:
|
||||
|
||||
- selected when direct/single relay is unavailable or policy/region requires it
|
||||
|
||||
`storage_local` / `storage_remote`:
|
||||
|
||||
- used for config/snapshot/artifact fetch decisions
|
||||
|
||||
`vpn_chained`:
|
||||
|
||||
- used when a managed service or IP tunnel depends on a logical
|
||||
`vpn_connection`
|
||||
|
||||
`degraded_existing`:
|
||||
|
||||
- keeps an already-authorized existing path alive while policy permits
|
||||
|
||||
`unavailable`:
|
||||
|
||||
- explicit denial or no valid route
|
||||
|
||||
## 9. Hard Policy Checks
|
||||
|
||||
Hard checks run before scoring.
|
||||
|
||||
Reject route when:
|
||||
|
||||
- source node is not trusted
|
||||
- source node is not a member of the cluster
|
||||
- destination is outside cluster scope
|
||||
- cross-cluster trust is missing
|
||||
- organization scope does not match
|
||||
- role assignment does not permit the route
|
||||
- peer certificate is invalid or revoked
|
||||
- required channel is not authorized
|
||||
- partition/authority state forbids new route
|
||||
- destination node is draining or disabled and policy forbids placement
|
||||
- route would leak topology or tenant data
|
||||
|
||||
No score can override hard policy rejection.
|
||||
|
||||
## 10. Scoring Inputs
|
||||
|
||||
Soft scoring inputs:
|
||||
|
||||
- latency
|
||||
- jitter
|
||||
- packet loss
|
||||
- reliability
|
||||
- recent failure history
|
||||
- region distance
|
||||
- load
|
||||
- available bandwidth
|
||||
- role suitability
|
||||
- route length
|
||||
- service co-location
|
||||
- stickiness preference
|
||||
- cost preference
|
||||
- policy preference
|
||||
- health score
|
||||
|
||||
Scoring weights are policy-driven and may differ by channel class.
|
||||
|
||||
Example:
|
||||
|
||||
- input/control heavily weight latency and jitter
|
||||
- file transfer heavily weights throughput and reliability
|
||||
- VPN bulk considers QoS impact on interactive routes
|
||||
- storage fetch considers locality and replica freshness
|
||||
|
||||
## 11. Route Cache Relationship
|
||||
|
||||
Route cache is local and bounded.
|
||||
|
||||
Cache key inputs:
|
||||
|
||||
- cluster id
|
||||
- organization id
|
||||
- source node
|
||||
- destination kind/ref
|
||||
- service type
|
||||
- channel class
|
||||
- policy version
|
||||
- route epoch
|
||||
- stickiness key
|
||||
|
||||
Cache entries contain:
|
||||
|
||||
- route result
|
||||
- expiry
|
||||
- score
|
||||
- last success/failure
|
||||
- backoff state
|
||||
- fallback candidates
|
||||
|
||||
Cache invalidation triggers:
|
||||
|
||||
- policy version change
|
||||
- peer directory version change
|
||||
- trust/revocation update
|
||||
- route epoch change
|
||||
- health state change
|
||||
- repeated route failure
|
||||
- expiry
|
||||
|
||||
Route cache is a performance aid, not route authority.
|
||||
|
||||
## 12. Failover Boundaries
|
||||
|
||||
Failover decisions may:
|
||||
|
||||
- switch from failed active path to fallback path
|
||||
- promote warm peer path
|
||||
- retry through bootstrap route for recovery
|
||||
- mark route unavailable
|
||||
- request control-plane/config refresh when reachable
|
||||
- keep degraded existing path alive if policy permits
|
||||
|
||||
Failover decisions must not:
|
||||
|
||||
- create new cluster authority
|
||||
- bypass policy
|
||||
- add nodes
|
||||
- approve role changes
|
||||
- cross cluster boundaries without explicit trust
|
||||
- expose topology to organizations
|
||||
|
||||
## 13. Shortcut Decision Boundary
|
||||
|
||||
Shortcut connections are optional optimization recommendations.
|
||||
|
||||
A shortcut may be recommended when:
|
||||
|
||||
- long-lived flow exists
|
||||
- current path latency/jitter is high
|
||||
- direct connectivity appears possible
|
||||
- trust validation succeeds
|
||||
- policy allows shortcut
|
||||
- shortcut improves latency, jitter, or bandwidth
|
||||
- fallback path remains available
|
||||
|
||||
Shortcut recommendation output:
|
||||
|
||||
- source node
|
||||
- destination node
|
||||
- channel classes affected
|
||||
- expected improvement
|
||||
- required validation
|
||||
- expiry
|
||||
- fallback route id
|
||||
|
||||
C15 does not implement shortcut connections. It only defines when a future
|
||||
Routing Engine may recommend them.
|
||||
|
||||
## 14. Service Adapter Integration
|
||||
|
||||
Service Adapters may ask for routes using service-neutral metadata.
|
||||
|
||||
Examples:
|
||||
|
||||
- RDP Adapter requests route to RDP service/egress node or resource target.
|
||||
- VNC Adapter requests route to VNC target zone.
|
||||
- SSH Adapter requests route to SSH target.
|
||||
- VPN/IP tunnel service requests route through `vpn_connection`.
|
||||
- Storage fetch requests route to config/storage scope.
|
||||
|
||||
Service Adapters must not:
|
||||
|
||||
- enumerate peers
|
||||
- select mesh paths
|
||||
- create relay chains
|
||||
- create shortcuts
|
||||
- implement failover policy
|
||||
- implement partition recovery
|
||||
- implement cross-cluster routing trust
|
||||
|
||||
The adapter consumes a route result and sends/receives through the approved
|
||||
data-plane boundary when runtime exists.
|
||||
|
||||
## 15. Topology Hiding
|
||||
|
||||
Organizations see:
|
||||
|
||||
- allowed service endpoints
|
||||
- safe ingress/egress status
|
||||
- safe session/resource status
|
||||
- policy-visible route dependency names where allowed
|
||||
|
||||
Organizations must not see:
|
||||
|
||||
- intermediate core mesh nodes
|
||||
- full peer directory
|
||||
- route cache
|
||||
- shortcut candidates
|
||||
- other organizations' route data
|
||||
- storage shard placement
|
||||
|
||||
Platform owners may inspect routing internals according to audited platform
|
||||
policy.
|
||||
|
||||
## 16. Degraded and Partition Behavior
|
||||
|
||||
In degraded mode, Routing Engine may:
|
||||
|
||||
- keep existing authorized routes alive until TTL
|
||||
- use last signed snapshot for recovery
|
||||
- select fallback among already-authorized peers
|
||||
- mark route unavailable when safety cannot be proven
|
||||
|
||||
In degraded mode, Routing Engine must not:
|
||||
|
||||
- authorize new high-risk routes
|
||||
- mutate cluster trust
|
||||
- approve nodes
|
||||
- assign roles
|
||||
- promote partition authority automatically
|
||||
- create cross-cluster trust
|
||||
|
||||
## 17. Observability
|
||||
|
||||
Routing decisions should emit safe telemetry:
|
||||
|
||||
- route selected
|
||||
- route rejected
|
||||
- rejection reason
|
||||
- route class
|
||||
- channel class
|
||||
- score bucket
|
||||
- latency/jitter/packet loss summary
|
||||
- failover count
|
||||
- fallback used
|
||||
- shortcut recommended
|
||||
- policy version
|
||||
- peer directory version
|
||||
- route epoch
|
||||
|
||||
Tenant-visible telemetry must hide topology.
|
||||
|
||||
## 18. Future Validation Tests
|
||||
|
||||
Future implementation tests must prove:
|
||||
|
||||
- route request rejects wrong cluster
|
||||
- route request rejects wrong organization
|
||||
- revoked peer is not selected
|
||||
- unavailable route returns explicit result
|
||||
- cache invalidates on policy version change
|
||||
- cache invalidates on peer directory version change
|
||||
- input route prefers latency over throughput
|
||||
- file transfer route does not starve input class
|
||||
- service adapter cannot bypass routing engine
|
||||
- shortcut recommendation requires fallback path
|
||||
- degraded mode does not authorize new forbidden routes
|
||||
|
||||
## 19. C16 Preparation
|
||||
|
||||
C16 must define the secure node-to-node channel lifecycle that can later carry
|
||||
route-selected traffic.
|
||||
|
||||
C16 must preserve:
|
||||
|
||||
- routing results are bounded and policy-scoped
|
||||
- channels are authenticated and authorized
|
||||
- trust/revocation affects active channels
|
||||
- Service Adapters remain above Fabric routing
|
||||
- no mesh packet routing starts before explicit C17
|
||||
|
||||
## 20. Result / Decision
|
||||
|
||||
Stage C15 defines Fabric Routing Engine as a skeleton boundary for route
|
||||
requests, route results, scoring, cache relationship, failover, shortcut
|
||||
recommendations, topology hiding, and Service Adapter integration.
|
||||
|
||||
Decisions:
|
||||
|
||||
- Routing belongs to Fabric, not Service Adapters.
|
||||
- Route requests/results are logical contracts, not packet forwarding.
|
||||
- Hard policy checks precede scoring.
|
||||
- Route cache is local, bounded, and non-authoritative.
|
||||
- Routing is channel-aware and QoS-aware.
|
||||
- Shortcut connections are future optional recommendations, not C15 runtime.
|
||||
- C16 must define secure node-to-node channels before mesh routing runtime.
|
||||
|
||||
No code, migration, API, runtime, RDP, data-plane, mesh, VPN, relay, or service
|
||||
workload behavior is changed by C15.
|
||||
@@ -0,0 +1,333 @@
|
||||
# Fabric Storage / Config Storage Service
|
||||
|
||||
Status: Stage C13 result. Documentation and architecture only.
|
||||
|
||||
This document defines the Fabric Storage / Config Storage service foundation.
|
||||
It does not implement code, migrations, APIs, mesh runtime traffic, VPN/IP
|
||||
tunnel runtime, relay packet routing, RDP work, or service workload execution.
|
||||
|
||||
## 1. Purpose
|
||||
|
||||
Fabric Storage / Config Storage is the scoped distribution layer for Fabric
|
||||
Core configuration artifacts.
|
||||
|
||||
It distributes and caches:
|
||||
|
||||
- signed scoped cluster snapshots
|
||||
- incremental snapshot updates
|
||||
- peer directories
|
||||
- trust bundles
|
||||
- revocation metadata
|
||||
- update artifact metadata
|
||||
- service assignment/config artifacts
|
||||
|
||||
It exists so nodes can refresh local state quickly and reliably without asking
|
||||
the backend database for every realtime routing or supervision decision.
|
||||
|
||||
## 2. Non-Goals
|
||||
|
||||
Fabric Storage / Config Storage is not:
|
||||
|
||||
- a replacement for PostgreSQL
|
||||
- a second source of truth
|
||||
- a general-purpose distributed database
|
||||
- an arbitrary query engine
|
||||
- a tenant-visible topology database
|
||||
- a place for raw secrets
|
||||
- a durable runtime lease store
|
||||
- a high-rate realtime data-plane relay
|
||||
|
||||
Nodes must not write authoritative configuration directly into Fabric Storage.
|
||||
|
||||
## 3. Authority Model
|
||||
|
||||
Authoritative flow:
|
||||
|
||||
```text
|
||||
PostgreSQL
|
||||
-> control-plane config compiler
|
||||
-> signed scoped artifact
|
||||
-> Fabric Storage / Config Storage distribution
|
||||
-> node-agent local state
|
||||
```
|
||||
|
||||
Only the control plane, or a tightly scoped config compiler operating under
|
||||
control-plane authority, may produce authoritative signed configuration
|
||||
artifacts.
|
||||
|
||||
Fabric Storage may replicate and serve artifacts. It does not decide policy.
|
||||
|
||||
## 4. Artifact Types
|
||||
|
||||
Supported target artifact families:
|
||||
|
||||
- `cluster_snapshot`
|
||||
- `snapshot_increment`
|
||||
- `peer_directory`
|
||||
- `trust_bundle`
|
||||
- `revocation_list`
|
||||
- `service_assignment_bundle`
|
||||
- `route_policy_bundle`
|
||||
- `qos_policy_bundle`
|
||||
- `update_manifest`
|
||||
- `storage_directory`
|
||||
|
||||
Each artifact must carry:
|
||||
|
||||
- artifact id
|
||||
- artifact type
|
||||
- cluster id
|
||||
- scope ids
|
||||
- config version
|
||||
- authority epoch
|
||||
- issued at
|
||||
- expires at or refresh deadline
|
||||
- signer key id
|
||||
- content hash
|
||||
- signature or signature reference
|
||||
|
||||
## 5. Scope and Namespace Rules
|
||||
|
||||
Storage namespaces must be scoped by:
|
||||
|
||||
- platform
|
||||
- cluster
|
||||
- organization where applicable
|
||||
- service where applicable
|
||||
- role where applicable
|
||||
- node where applicable
|
||||
- artifact family
|
||||
|
||||
Example logical namespace:
|
||||
|
||||
```text
|
||||
platform/<platform_id>/
|
||||
cluster/<cluster_id>/
|
||||
trust/
|
||||
snapshots/node/<node_id>/
|
||||
snapshots/role/<role>/
|
||||
peers/scope/<scope_id>/
|
||||
services/<service_type>/
|
||||
updates/
|
||||
```
|
||||
|
||||
No node should receive access to namespaces outside its assigned cluster, role,
|
||||
service, and organization scope.
|
||||
|
||||
## 6. Replication Policy
|
||||
|
||||
Replication is policy-driven.
|
||||
|
||||
Inputs:
|
||||
|
||||
- artifact criticality
|
||||
- cluster size
|
||||
- failure domains
|
||||
- region placement
|
||||
- node role
|
||||
- organization isolation
|
||||
- service locality
|
||||
- update frequency
|
||||
- recovery time objective
|
||||
|
||||
Rules:
|
||||
|
||||
- critical trust and revocation artifacts replicate across failure domains
|
||||
- hot peer directories should be near entry/core/service nodes that use them
|
||||
- service config should be near assigned service nodes
|
||||
- organization-scoped artifacts must not replicate to unrelated org scopes
|
||||
- thin/mobile nodes should not become broad storage replicas
|
||||
- storage nodes may hold only assigned shards/scopes
|
||||
|
||||
## 7. Distribution Flows
|
||||
|
||||
Full snapshot refresh:
|
||||
|
||||
1. node-agent reports current config versions
|
||||
2. storage service returns available version metadata
|
||||
3. node-agent downloads full scoped snapshot if needed
|
||||
4. node-agent verifies signature and scope
|
||||
5. node-agent applies locally
|
||||
|
||||
Incremental update:
|
||||
|
||||
1. node-agent reports base version
|
||||
2. storage service returns matching increment chain
|
||||
3. node-agent verifies every increment
|
||||
4. node-agent applies only if base versions match
|
||||
5. version gap triggers full snapshot refresh
|
||||
|
||||
Trust/revocation update:
|
||||
|
||||
1. node-agent checks trust bundle/revocation version frequently
|
||||
2. storage service serves signed trust artifacts
|
||||
3. node-agent verifies using existing trust path
|
||||
4. revoked identities/keys immediately affect local validation
|
||||
|
||||
## 8. Consistency and Invalidation
|
||||
|
||||
Artifacts are immutable by content hash.
|
||||
|
||||
New versions are published as new artifacts plus index updates.
|
||||
|
||||
Rules:
|
||||
|
||||
- node must validate content hash
|
||||
- node must reject stale authority epoch
|
||||
- node must reject invalid signature
|
||||
- node must reject wrong scope
|
||||
- storage index may cache version metadata but not override signatures
|
||||
- deletion/tombstone artifacts must be signed
|
||||
- revoked artifacts must not be served as current versions
|
||||
|
||||
Cache invalidation is version-based, not best-effort string deletion.
|
||||
|
||||
## 9. Storage Node Behavior
|
||||
|
||||
A storage/config node may:
|
||||
|
||||
- cache assigned artifacts
|
||||
- replicate assigned artifacts
|
||||
- serve artifacts to authorized nodes
|
||||
- report artifact availability
|
||||
- report replication health
|
||||
- evict cold artifacts according to policy
|
||||
|
||||
A storage/config node must not:
|
||||
|
||||
- modify artifact content
|
||||
- sign artifacts
|
||||
- invent new config versions
|
||||
- widen scope
|
||||
- bypass authorization
|
||||
- serve unrelated org/cluster data
|
||||
- accept node writes as authoritative config
|
||||
|
||||
## 10. Authorization
|
||||
|
||||
Artifact fetch authorization must check:
|
||||
|
||||
- node identity
|
||||
- cluster membership
|
||||
- role assignment
|
||||
- artifact scope
|
||||
- organization scope where applicable
|
||||
- artifact type
|
||||
- trust/revocation status
|
||||
- partition/degraded policy
|
||||
|
||||
Storage service authorization may use:
|
||||
|
||||
- mTLS node identity
|
||||
- short-lived scoped tokens
|
||||
- signed node snapshot claims
|
||||
- control-plane issued fetch grants
|
||||
|
||||
Tenant users and organization admins must not directly query internal storage
|
||||
namespaces. They see safe status projections through control-plane APIs.
|
||||
|
||||
## 11. Failure and Degraded Behavior
|
||||
|
||||
If local storage service is unavailable, node-agent recovery order is:
|
||||
|
||||
1. try alternate local/nearby storage endpoint
|
||||
2. try active peers that advertise config/storage availability
|
||||
3. try bootstrap/config endpoints from last signed snapshot
|
||||
4. contact control plane if reachable
|
||||
5. continue from last valid local snapshot only if degraded policy allows it
|
||||
|
||||
Storage service outage must not grant new authority.
|
||||
|
||||
Nodes must not perform high-risk actions based on missing or stale storage.
|
||||
|
||||
## 12. Operational Observability
|
||||
|
||||
Storage service should report:
|
||||
|
||||
- artifact family health
|
||||
- replication lag
|
||||
- missing replica count
|
||||
- stale shard count
|
||||
- fetch latency
|
||||
- fetch failures
|
||||
- authorization denials
|
||||
- version gaps
|
||||
- signature/hash validation failures reported by nodes
|
||||
- storage capacity
|
||||
- eviction stats
|
||||
|
||||
Audit/control-plane events should include:
|
||||
|
||||
- artifact published
|
||||
- artifact revoked/tombstoned
|
||||
- replication policy changed
|
||||
- storage role assigned/removed
|
||||
- unauthorized fetch denied
|
||||
- critical artifact under-replicated
|
||||
|
||||
## 13. Security Requirements
|
||||
|
||||
Required:
|
||||
|
||||
- encrypted node-to-storage transport
|
||||
- authenticated node identity
|
||||
- scoped fetch authorization
|
||||
- immutable signed artifacts
|
||||
- hash verification
|
||||
- no raw secrets in broad artifacts
|
||||
- namespace isolation
|
||||
- audit for high-risk storage/admin actions
|
||||
|
||||
Compromised storage node blast radius must be limited:
|
||||
|
||||
- it cannot sign valid new artifacts
|
||||
- it cannot serve data outside assigned scopes
|
||||
- it cannot modify signed content without detection
|
||||
- it cannot become authoritative truth
|
||||
- nodes reject invalid signatures/hashes
|
||||
|
||||
## 14. Relationship to Runtime State
|
||||
|
||||
Fabric Storage is for configuration and distribution, not realtime runtime
|
||||
coordination.
|
||||
|
||||
Runtime state remains elsewhere:
|
||||
|
||||
- PostgreSQL for durable lifecycle/audit/state
|
||||
- Redis for live coordination/leases/heartbeats/ephemeral routing hints
|
||||
- node-local state for local cache/runtime observations
|
||||
|
||||
Do not store high-rate render frames, input streams, VPN packets, or relay
|
||||
traffic in Fabric Storage.
|
||||
|
||||
## 15. C14 Preparation
|
||||
|
||||
C14 must define the peer directory and peer cache model that Fabric Storage may
|
||||
distribute and node-agent may store locally.
|
||||
|
||||
C14 must preserve:
|
||||
|
||||
- storage service is distribution/cache only
|
||||
- peer directories are scoped
|
||||
- nodes do not learn full topology unless role requires it
|
||||
- routing decisions belong to Fabric Routing Engine, not Service Adapters
|
||||
|
||||
## 16. Result / Decision
|
||||
|
||||
Stage C13 defines Fabric Storage / Config Storage as a scoped distribution and
|
||||
cache service for signed Fabric Core artifacts.
|
||||
|
||||
Decisions:
|
||||
|
||||
- Fabric Storage distributes signed artifacts but does not author them
|
||||
- PostgreSQL remains authoritative
|
||||
- artifacts are immutable by content hash
|
||||
- invalidation is version-based
|
||||
- replication is policy-driven and scope-bound
|
||||
- storage nodes may cache and serve only assigned scopes
|
||||
- storage service is not a realtime data-plane relay
|
||||
- storage service is not a general-purpose database
|
||||
- C14 must define the peer directory/cache artifacts and local runtime use
|
||||
|
||||
No code, migration, API, runtime, RDP, data-plane, mesh, VPN, relay, or service
|
||||
workload behavior is changed by C13.
|
||||
File diff suppressed because it is too large
Load Diff
@@ -0,0 +1,419 @@
|
||||
# Node Local State Store
|
||||
|
||||
Status: Stage C12 result. Documentation and architecture only.
|
||||
|
||||
This document defines the node-local state store model for native
|
||||
`rap-node-agent`. It does not implement code, migrations, APIs, mesh runtime
|
||||
traffic, VPN/IP tunnel runtime, relay packet routing, RDP work, or service
|
||||
workload execution.
|
||||
|
||||
## 1. Purpose
|
||||
|
||||
The node-local state store lets `rap-node-agent` operate safely without asking
|
||||
the backend for every realtime routing or service supervision decision.
|
||||
|
||||
The local store must support:
|
||||
|
||||
- node identity persistence
|
||||
- cluster membership state
|
||||
- signed scoped snapshot storage
|
||||
- peer cache
|
||||
- route cache
|
||||
- service assignment cache
|
||||
- local health and degraded-mode state
|
||||
- pending update metadata
|
||||
- recovery after process restart or host reboot
|
||||
|
||||
The local store must not become a durable source of truth.
|
||||
|
||||
## 2. Authority Boundaries
|
||||
|
||||
PostgreSQL remains authoritative for durable domain state.
|
||||
|
||||
Fabric Storage / Config Storage distributes signed snapshots and increments.
|
||||
|
||||
Node-local state stores verified local copies and runtime observations.
|
||||
|
||||
Redis remains live coordination only.
|
||||
|
||||
Node-local state must not authorize:
|
||||
|
||||
- node enrollment approval
|
||||
- certificate issuance
|
||||
- role assignment
|
||||
- policy mutation
|
||||
- trust root mutation
|
||||
- organization mutation
|
||||
- partition promotion
|
||||
- cross-cluster trust
|
||||
|
||||
## 3. Storage Root and Namespaces
|
||||
|
||||
The node-agent should use one configured local storage root.
|
||||
|
||||
Example logical layout:
|
||||
|
||||
```text
|
||||
rap-node-agent-state/
|
||||
agent/
|
||||
clusters/
|
||||
<cluster_id>/
|
||||
identity/
|
||||
trust/
|
||||
snapshots/
|
||||
peers/
|
||||
routes/
|
||||
services/
|
||||
health/
|
||||
updates/
|
||||
telemetry/
|
||||
tmp/
|
||||
```
|
||||
|
||||
Rules:
|
||||
|
||||
- cluster state is namespace-isolated by `cluster_id`
|
||||
- multi-cluster membership uses separate identities and local state per cluster
|
||||
- temporary files are written under the same cluster namespace before atomic
|
||||
activation
|
||||
- no cluster may read another cluster's local state namespace
|
||||
- file permissions must restrict access to the node-agent service account
|
||||
|
||||
## 4. State Classes
|
||||
|
||||
### Agent State
|
||||
|
||||
Agent-level state:
|
||||
|
||||
- agent install id
|
||||
- agent version
|
||||
- local feature flags
|
||||
- last startup/shutdown status
|
||||
- local diagnostics
|
||||
- update engine metadata
|
||||
|
||||
Agent state is not cluster authority.
|
||||
|
||||
### Identity State
|
||||
|
||||
Cluster identity state:
|
||||
|
||||
- `node_id`
|
||||
- cluster membership id
|
||||
- node certificate metadata
|
||||
- public identity metadata
|
||||
- private key reference
|
||||
- enrollment state
|
||||
- revocation status cache
|
||||
|
||||
Private keys should be stored in an OS-protected key store when available. If
|
||||
file-backed keys are necessary, they must be encrypted at rest and protected by
|
||||
strict filesystem permissions.
|
||||
|
||||
### Trust State
|
||||
|
||||
Trust state:
|
||||
|
||||
- platform root trust refs
|
||||
- cluster trust roots
|
||||
- config signing keys
|
||||
- node-to-node trust bundle
|
||||
- revocation metadata
|
||||
- trust bundle version
|
||||
|
||||
Trust state must be signed and versioned. Unknown or revoked trust roots must
|
||||
not be accepted.
|
||||
|
||||
### Snapshot State
|
||||
|
||||
Snapshot state:
|
||||
|
||||
- active signed scoped snapshot per scope
|
||||
- previous verified snapshot per scope
|
||||
- pending snapshot or incremental update
|
||||
- snapshot verification metadata
|
||||
- last applied config version
|
||||
- expiry and refresh deadlines
|
||||
|
||||
Snapshot activation must be atomic:
|
||||
|
||||
1. write pending snapshot
|
||||
2. verify signature, scope, hash, expiry, and version
|
||||
3. persist verified content
|
||||
4. swap active pointer
|
||||
5. notify affected runtime components
|
||||
6. report applied version
|
||||
|
||||
### Peer Cache
|
||||
|
||||
Peer cache:
|
||||
|
||||
- scoped peer directory entries
|
||||
- endpoint candidates
|
||||
- certificate fingerprints
|
||||
- last success timestamp
|
||||
- latency
|
||||
- packet loss
|
||||
- reliability score
|
||||
- recent failure history
|
||||
- last seen config version
|
||||
|
||||
Peer cache combines signed directory data with runtime observations. Runtime
|
||||
observations are hints, not durable authority.
|
||||
|
||||
### Route Cache
|
||||
|
||||
Route cache:
|
||||
|
||||
- selected routes
|
||||
- route score
|
||||
- route class/channel class
|
||||
- route expiry
|
||||
- failover alternatives
|
||||
- shortcut state if future policy allows it
|
||||
- last successful path
|
||||
- recent failure reason
|
||||
|
||||
Route cache must be reconstructable from signed snapshots, peer cache, and
|
||||
runtime observations. It must not define policy.
|
||||
|
||||
### Service Assignment Cache
|
||||
|
||||
Service assignment cache:
|
||||
|
||||
- assigned service workloads
|
||||
- desired state
|
||||
- last reported state
|
||||
- service version
|
||||
- policy refs
|
||||
- resource refs needed by assigned services
|
||||
- connector or `vpn_connection` refs where authorized
|
||||
|
||||
This cache informs supervision. It does not allow the node to invent new
|
||||
service work.
|
||||
|
||||
### Health and Degraded State
|
||||
|
||||
Health/degraded state:
|
||||
|
||||
- last heartbeat sent
|
||||
- last control-plane contact
|
||||
- last config/storage contact
|
||||
- active degraded-mode reason
|
||||
- partition/degraded flags
|
||||
- local resource pressure
|
||||
- service health summaries
|
||||
- last known safe operation deadline
|
||||
|
||||
Degraded state must be visible in node heartbeat/status when connectivity
|
||||
returns.
|
||||
|
||||
### Update Metadata
|
||||
|
||||
Update state:
|
||||
|
||||
- current agent version
|
||||
- current workload versions
|
||||
- pending update metadata
|
||||
- signed artifact refs
|
||||
- rollout/canary assignment
|
||||
- rollback candidate metadata
|
||||
- last update result
|
||||
|
||||
Unsigned artifacts must never be activated.
|
||||
|
||||
## 5. Encryption and Secret Handling
|
||||
|
||||
The local store should avoid storing secrets. When secret-related data is
|
||||
required, store references and resolver metadata, not plaintext.
|
||||
|
||||
Rules:
|
||||
|
||||
- private keys use OS key store where possible
|
||||
- file-backed sensitive material is encrypted at rest
|
||||
- raw RDP/VNC/SSH/VPN credentials must not be stored in broad local snapshots
|
||||
- runtime secrets are resolved only when assigned service policy permits it
|
||||
- secret material must be wiped from temporary files and memory where practical
|
||||
- logs must not contain secret values
|
||||
|
||||
Recommended OS facilities:
|
||||
|
||||
- Windows: DPAPI or service-account protected certificate store
|
||||
- Linux: kernel keyring, TPM-backed store, or file encryption with protected
|
||||
service-account permissions
|
||||
- macOS future client/agent: Keychain
|
||||
|
||||
## 6. Atomicity and Durability
|
||||
|
||||
Writes must be safe across process crashes and host reboots.
|
||||
|
||||
Rules:
|
||||
|
||||
- write new content to temporary path
|
||||
- fsync or platform equivalent where needed
|
||||
- verify content before activation
|
||||
- atomically rename/swap active pointer
|
||||
- keep previous verified content for recovery
|
||||
- never partially overwrite active snapshots or identity data
|
||||
- use a store lock to prevent concurrent writers
|
||||
|
||||
Node-agent should tolerate:
|
||||
|
||||
- interrupted writes
|
||||
- corrupted pending updates
|
||||
- missing optional cache files
|
||||
- stale runtime observations
|
||||
|
||||
Node-agent must not tolerate silently corrupted identity, trust, or active
|
||||
snapshot data.
|
||||
|
||||
## 7. Cache Expiry and Cleanup
|
||||
|
||||
Local caches must be bounded.
|
||||
|
||||
Cleanup rules:
|
||||
|
||||
- remove expired peer observations
|
||||
- remove expired route cache entries
|
||||
- compact telemetry buffers
|
||||
- retain only policy-defined number of previous snapshots
|
||||
- remove stale pending updates after safe timeout
|
||||
- delete service assignment cache for removed roles after revocation is applied
|
||||
- wipe temporary files on startup
|
||||
|
||||
Caches may be rebuilt. Identity, trust, and active snapshots require stricter
|
||||
recovery behavior.
|
||||
|
||||
## 8. Corruption Recovery
|
||||
|
||||
Recovery order:
|
||||
|
||||
1. load active verified state
|
||||
2. reject corrupted pending state
|
||||
3. fallback to previous verified snapshot if active snapshot is corrupt and
|
||||
policy allows it
|
||||
4. request full snapshot from config/storage service
|
||||
5. use bootstrap peers or control plane if storage/config is unavailable
|
||||
6. enter degraded mode only if a valid snapshot and policy allow it
|
||||
7. fail closed for trust/identity corruption
|
||||
|
||||
Corruption must be reported through health/status and local diagnostics.
|
||||
|
||||
## 9. Multi-Cluster Isolation
|
||||
|
||||
A node may participate in multiple clusters only through isolated memberships.
|
||||
|
||||
Per-cluster isolation includes:
|
||||
|
||||
- identity
|
||||
- certificates
|
||||
- trust bundle
|
||||
- signed snapshots
|
||||
- peer cache
|
||||
- route cache
|
||||
- service assignment cache
|
||||
- update/workload namespace where needed
|
||||
- telemetry namespace
|
||||
|
||||
Cross-cluster data sharing is forbidden unless explicit platform trust and
|
||||
policy allow it.
|
||||
|
||||
## 10. Service Workload Boundary
|
||||
|
||||
Service workloads do not write authoritative node-local state.
|
||||
|
||||
Allowed workload interactions:
|
||||
|
||||
- read assigned service configuration through node-agent
|
||||
- report health/status to node-agent
|
||||
- request approved secret resolution through node-agent/control boundary
|
||||
- receive lifecycle commands from node-agent
|
||||
|
||||
Forbidden workload interactions:
|
||||
|
||||
- mutate role assignments
|
||||
- mutate snapshots
|
||||
- mutate peer directory authority
|
||||
- write trust roots
|
||||
- write cross-cluster state
|
||||
- store unrelated organization secrets
|
||||
|
||||
## 11. Backup and Restore
|
||||
|
||||
Backup rules:
|
||||
|
||||
- identity/private key backup is platform policy dependent and high-risk
|
||||
- snapshots and caches can usually be reconstructed
|
||||
- local route/peer caches should not be treated as backup-critical
|
||||
- trust state backup must preserve anti-rollback properties
|
||||
- restore must not allow replay of revoked identity or old trust roots
|
||||
|
||||
Restore must require control-plane validation before the node is trusted for
|
||||
new high-risk work.
|
||||
|
||||
## 12. Observability
|
||||
|
||||
Node-agent should report safe local state metadata:
|
||||
|
||||
- last applied config version
|
||||
- snapshot expiry/refresh status
|
||||
- trust bundle version
|
||||
- peer cache size
|
||||
- route cache size
|
||||
- degraded-mode state
|
||||
- local store health
|
||||
- last corruption/recovery event
|
||||
- pending update state
|
||||
|
||||
Reports must not include raw secrets or unrelated topology.
|
||||
|
||||
## 13. Future Validation Tests
|
||||
|
||||
Future implementation tests must prove:
|
||||
|
||||
- fresh install creates expected namespace layout
|
||||
- valid snapshot activates atomically
|
||||
- interrupted activation recovers to previous valid snapshot
|
||||
- corrupted pending update is ignored
|
||||
- corrupted active identity fails closed
|
||||
- peer cache expiry works
|
||||
- route cache expiry works
|
||||
- multi-cluster namespaces stay isolated
|
||||
- service workload cannot mutate authoritative local state
|
||||
- local store reports last applied config version
|
||||
- degraded-mode state is persisted and cleared correctly
|
||||
|
||||
## 14. C13 Preparation
|
||||
|
||||
C13 must define the Fabric Storage / Config Storage service that distributes
|
||||
snapshots, peer directories, trust bundles, and incremental updates to the
|
||||
node-local state store.
|
||||
|
||||
C13 must preserve:
|
||||
|
||||
- PostgreSQL authority
|
||||
- signed snapshot verification
|
||||
- node-local bounded cache behavior
|
||||
- cluster/org/service isolation
|
||||
- no arbitrary query/database behavior
|
||||
|
||||
## 15. Result / Decision
|
||||
|
||||
Stage C12 defines node-local state as a bounded, scoped, verified local store
|
||||
owned by native `rap-node-agent`.
|
||||
|
||||
Decisions:
|
||||
|
||||
- local state is namespaced per cluster
|
||||
- identity, trust, snapshots, peer cache, route cache, service assignment
|
||||
cache, health/degraded state, and update metadata are separate state classes
|
||||
- local state is not durable authority
|
||||
- snapshot activation must be atomic
|
||||
- caches are bounded and reconstructable
|
||||
- private keys and sensitive material require OS-protected or encrypted storage
|
||||
- service workloads cannot mutate authoritative node-local state
|
||||
- C13 must define distribution/storage services without turning them into a
|
||||
second source of truth
|
||||
|
||||
No code, migration, API, runtime, RDP, data-plane, mesh, VPN, relay, or service
|
||||
workload behavior is changed by C12.
|
||||
@@ -0,0 +1,351 @@
|
||||
# Production Direct Worker WSS Trust
|
||||
|
||||
Status: P3.4 design/prep complete.
|
||||
|
||||
This document defines the production trust model for direct worker WSS. It is a
|
||||
preparation document only: it does not change RDP runtime behavior, does not
|
||||
remove backend gateway fallback, and does not implement mesh, relay, VPN, QUIC,
|
||||
WebRTC, or node-agent enrollment.
|
||||
|
||||
## Goal
|
||||
|
||||
Direct worker WSS should become the preferred production realtime path only
|
||||
when the client can verify both:
|
||||
|
||||
- the backend candidate is authorized and marked `production_trusted=true`
|
||||
- the worker endpoint presents a valid TLS certificate for the advertised URL
|
||||
|
||||
The backend gateway remains the safe fallback/debug path.
|
||||
|
||||
## Trust Modes
|
||||
|
||||
`DATA_PLANE_DIRECT_WORKER_TLS_TRUST_MODE` has three modes:
|
||||
|
||||
- `smoke_insecure`: development/smoke only. Backend may advertise a direct
|
||||
candidate only outside production and must mark it `smoke_only=true` and
|
||||
`production_trusted=false`.
|
||||
- `public_ca`: worker WSS certificate chains to an OS/publicly trusted CA.
|
||||
Backend may mark the candidate `production_trusted=true`.
|
||||
- `platform_ca`: worker WSS certificate chains to a platform-managed CA.
|
||||
Backend may mark the candidate `production_trusted=true` and include
|
||||
`tls_ca_ref`.
|
||||
|
||||
Production must not treat `smoke_insecure` as trusted. P3.3 proved that a
|
||||
production backend with `smoke_insecure` falls back to `backend_gateway`.
|
||||
|
||||
## Recommended Mode
|
||||
|
||||
Use `platform_ca` as the default production model for platform-managed and
|
||||
customer-managed worker nodes.
|
||||
|
||||
Use `public_ca` only when the worker direct WSS endpoint is intentionally
|
||||
internet-addressable through stable DNS and a public certificate can be issued
|
||||
and renewed safely.
|
||||
|
||||
Rationale:
|
||||
|
||||
- most worker endpoints will be private, internal, or customer-managed
|
||||
- public CA issuance is often impossible for private IP/DNS names
|
||||
- a platform CA can bind certificates to platform node/worker identity
|
||||
- platform CA trust can later integrate with `rap-node-agent`
|
||||
- backend gateway fallback remains available while trust rollout is staged
|
||||
|
||||
## Certificate Profile
|
||||
|
||||
Worker direct WSS certificates must be server certificates.
|
||||
|
||||
Required X.509 properties:
|
||||
|
||||
- `KeyUsage`: `digitalSignature`, plus `keyEncipherment` where required by the
|
||||
selected TLS key type
|
||||
- `ExtendedKeyUsage`: `serverAuth`
|
||||
- SAN DNS/IP entries must match the host in the advertised direct worker WSS URL
|
||||
- CN must not be used as the trust identity
|
||||
- validity should be short-lived, recommended 30-90 days in production
|
||||
- key type should be ECDSA P-256 or RSA-2048+; prefer ECDSA where operationally
|
||||
practical
|
||||
|
||||
Recommended identity SAN:
|
||||
|
||||
```text
|
||||
URI:spiffe://rap/cluster/<cluster_id>/worker/<worker_id>
|
||||
```
|
||||
|
||||
For the current single-cluster MVP, `cluster_id` may be `default` until the
|
||||
cluster model becomes explicit.
|
||||
|
||||
The URI SAN is not a replacement for normal hostname verification. It is an
|
||||
additional identity binding for observability, future node-agent enrollment,
|
||||
and future control-plane certificate inventory.
|
||||
|
||||
## Candidate URL Rules
|
||||
|
||||
The backend must advertise a direct worker WSS URL whose host is covered by the
|
||||
worker certificate SAN.
|
||||
|
||||
Examples:
|
||||
|
||||
```text
|
||||
wss://rdp-worker-1.dp.test.cin.su:18443/rap/v1/data-plane
|
||||
wss://192.168.200.61:18443/rap/v1/data-plane
|
||||
```
|
||||
|
||||
If the URL uses a DNS name, the certificate must include that DNS SAN.
|
||||
|
||||
If the URL uses an IP address, the certificate must include that IP SAN.
|
||||
|
||||
Preferred production shape is DNS, not raw IP, because DNS gives safer
|
||||
certificate rotation and node replacement.
|
||||
|
||||
## Worker Identity Binding
|
||||
|
||||
Direct worker WSS authentication is layered:
|
||||
|
||||
1. TLS proves that the client reached an endpoint with a certificate trusted for
|
||||
the advertised URL.
|
||||
2. `data_plane_token` proves that the backend authorized the session,
|
||||
attachment, user, organization, resource, worker, and allowed channels.
|
||||
3. The worker validates the token and binds the WSS connection to an existing
|
||||
runtime only.
|
||||
|
||||
The TLS certificate does not replace token validation.
|
||||
|
||||
The token does not replace TLS trust.
|
||||
|
||||
Future production hardening should add control-plane certificate inventory:
|
||||
|
||||
```text
|
||||
worker_certificates
|
||||
worker_id
|
||||
cluster_id
|
||||
tls_ca_ref
|
||||
certificate_fingerprint_sha256
|
||||
serial_number
|
||||
not_before
|
||||
not_after
|
||||
status: active | retiring | revoked | expired
|
||||
```
|
||||
|
||||
Until that inventory exists, backend must be conservative and only mark direct
|
||||
candidates production-trusted when deployment configuration guarantees the
|
||||
worker certificate is trusted for the advertised URL.
|
||||
|
||||
## Platform CA Structure
|
||||
|
||||
Recommended hierarchy:
|
||||
|
||||
```text
|
||||
RAP Platform Offline Root CA
|
||||
-> RAP Data Plane Worker Intermediate CA v1
|
||||
-> worker direct WSS server certificates
|
||||
```
|
||||
|
||||
Rules:
|
||||
|
||||
- Root CA private key must not be present on worker hosts.
|
||||
- Intermediate CA private key must not be present on worker hosts.
|
||||
- Worker receives only its server certificate, private key, and CA chain.
|
||||
- Windows clients receive only the trust bundle, never private keys.
|
||||
- Backend receives CA reference metadata and may carry public trust bundle
|
||||
references, never CA private keys.
|
||||
|
||||
For the current test stand, a temporary test CA may be generated on
|
||||
`docker-test`, but it must be treated as throwaway test material and not
|
||||
committed.
|
||||
|
||||
## Certificate Issuance And Storage
|
||||
|
||||
Future `rap-node-agent` should own enrollment. Before node-agent exists, test
|
||||
stand issuance may be manual.
|
||||
|
||||
Production desired flow:
|
||||
|
||||
1. Platform owner approves node/worker enrollment.
|
||||
2. Node agent generates a private key locally.
|
||||
3. Node agent creates CSR with:
|
||||
- worker/node identity URI SAN
|
||||
- DNS/IP SANs for reachable direct WSS endpoints
|
||||
- cluster id
|
||||
4. Control plane or CA service signs the CSR if node policy allows the role.
|
||||
5. Node agent writes certificate/key to a host-local protected path.
|
||||
6. Worker container mounts certificate/key read-only, or native worker reads
|
||||
protected local files.
|
||||
7. Backend advertises direct candidates with:
|
||||
- `tls_trust_mode=platform_ca`
|
||||
- `production_trusted=true`
|
||||
- `smoke_only=false`
|
||||
- `tls_ca_ref=<active-ca-ref>`
|
||||
|
||||
Container note:
|
||||
|
||||
- Certificates are node/host trust assets, not container identity.
|
||||
- Containers may consume mounted cert/key files.
|
||||
- Container rebuilds must not generate production CA material.
|
||||
|
||||
## Windows Client Trust
|
||||
|
||||
For `public_ca`, the Windows client should rely on normal OS certificate
|
||||
validation.
|
||||
|
||||
For `platform_ca`, the preferred production approach is app-local trust:
|
||||
|
||||
- client configuration references a platform CA bundle by `tls_ca_ref`
|
||||
- WSS TLS validation uses a custom chain policy with an app-managed trust store
|
||||
- hostname/SAN validation remains enabled
|
||||
- revocation/deny-list checks are applied when available
|
||||
- no global insecure callback is used
|
||||
|
||||
Installing the platform root into the Windows CurrentUser or LocalMachine Root
|
||||
store may be supported for managed enterprise deployment, but it should not be
|
||||
required for MVP smoke because it broadens OS-level trust.
|
||||
|
||||
Current state:
|
||||
|
||||
- Windows client already skips smoke-only/untrusted direct candidates in
|
||||
production.
|
||||
- P3.5 added app-local platform CA bundle handling with normal hostname/SAN
|
||||
validation preserved.
|
||||
- P3.5 smoke proved `platform_ca` direct worker WSS without insecure TLS
|
||||
bypass on `docker-test`.
|
||||
|
||||
## Rotation
|
||||
|
||||
Worker certificate rotation:
|
||||
|
||||
- certificates should be renewed before 2/3 of lifetime has elapsed
|
||||
- new cert/key should be staged next to the old files
|
||||
- worker should reload or restart gracefully
|
||||
- backend gateway fallback must remain available during rotation
|
||||
- old cert should remain accepted during a short overlap window
|
||||
- after successful cutover, old cert should be marked retiring/expired
|
||||
|
||||
Platform CA rotation:
|
||||
|
||||
- introduce new `tls_ca_ref`
|
||||
- distribute the new trust bundle to clients before workers switch
|
||||
- backend may advertise candidates with the new CA only after client trust is
|
||||
available
|
||||
- keep old and new CA bundles valid during migration
|
||||
- remove the old CA only after all active workers and clients are migrated
|
||||
|
||||
## Revocation And Deny-List
|
||||
|
||||
Short-lived certificates are the first control.
|
||||
|
||||
Additional revocation controls:
|
||||
|
||||
- stop advertising direct candidates for revoked workers immediately
|
||||
- revoke worker certificate serial/fingerprint in control-plane inventory
|
||||
- optionally distribute a compact deny-list to clients
|
||||
- force backend gateway fallback for revoked/untrusted workers
|
||||
- rotate data-plane signing keys separately if token signing material is at risk
|
||||
|
||||
Revocation must not rely on the worker cooperating after compromise.
|
||||
|
||||
## Graceful Failure And Fallback
|
||||
|
||||
Direct WSS must fail closed:
|
||||
|
||||
- expired cert: direct rejected, fallback to backend gateway
|
||||
- hostname mismatch: direct rejected, fallback to backend gateway
|
||||
- untrusted platform CA: direct rejected, fallback to backend gateway
|
||||
- revoked fingerprint: direct rejected, fallback to backend gateway
|
||||
- token validation failure: direct rejected, fallback to backend gateway where
|
||||
policy permits
|
||||
|
||||
Fallback must be logged so production does not silently run permanently on the
|
||||
debug path.
|
||||
|
||||
## Test-Stand P3.5 Smoke Result
|
||||
|
||||
P3.5 proved `platform_ca` without using insecure TLS bypass.
|
||||
|
||||
Sanitized command shape:
|
||||
|
||||
```powershell
|
||||
# 1. Generate throwaway test CA and worker cert on docker-test.
|
||||
ssh docker-test "mkdir -p /tmp/rap-p3-5-platform-ca"
|
||||
|
||||
# Certificate must include:
|
||||
# - DNS SAN for the direct WSS host, if using DNS
|
||||
# - IP SAN 192.168.200.61, if using raw IP
|
||||
# - URI SAN spiffe://rap/cluster/default/worker/rdp-worker-1
|
||||
|
||||
# 2. Restart worker with platform CA-issued server cert.
|
||||
docker -H ssh://docker-test rm -f rap_worker_smoke
|
||||
docker -H ssh://docker-test run -d --name rap_worker_smoke --network host `
|
||||
-v /tmp/rap-p3-5-platform-ca:/certs:ro `
|
||||
-e RDP_WORKER_DATA_PLANE_TLS_CERT_FILE=/certs/worker.crt `
|
||||
-e RDP_WORKER_DATA_PLANE_TLS_KEY_FILE=/certs/worker.key `
|
||||
-e RDP_WORKER_DATA_PLANE_PUBLIC_KEY_FILE=/certs/dp-public.pem `
|
||||
rap-rdp-worker:rdp-p1-region-order2
|
||||
|
||||
# 3. Restart backend in production platform_ca mode.
|
||||
docker -H ssh://docker-test rm -f rap_backend_smoke
|
||||
docker -H ssh://docker-test run -d --name rap_backend_smoke --network host `
|
||||
-v /tmp/rap-dp1d1:/certs:ro `
|
||||
-v /tmp/rap-p3-3/secret-key.b64:/run/secrets/rap-secret-key.b64:ro `
|
||||
-e APP_ENV=production `
|
||||
-e SECRET_ENCRYPTION_KEY_FILE=/run/secrets/rap-secret-key.b64 `
|
||||
-e DATA_PLANE_DIRECT_WORKER_TLS_TRUST_MODE=platform_ca `
|
||||
-e DATA_PLANE_DIRECT_WORKER_TLS_CA_REF=rap-platform-ca:test-v1 `
|
||||
-e DATA_PLANE_DIRECT_WORKER_WSS_URL_TEMPLATE=wss://192.168.200.61:18443/rap/v1/data-plane `
|
||||
rap-backend-smoke:p3-3
|
||||
|
||||
# 4. Configure Windows client app-local trust bundle.
|
||||
# backend.direct_data_plane_platform_ca_bundle = artifacts\p3-5-platform-ca.crt
|
||||
# backend.environment = production
|
||||
# backend.allow_insecure_direct_data_plane_tls_for_smoke = false
|
||||
|
||||
# 5. Run desktop smoke and verify direct selected.
|
||||
pwsh -ExecutionPolicy Bypass -File scripts\windows-smoke\desktop-smoke.ps1 `
|
||||
-PreferDirectDataPlane:$true `
|
||||
-AllowInsecureDirectDataPlaneTlsForSmoke:$false `
|
||||
-DirectDataPlaneConnectTimeoutMs 2500 `
|
||||
-SkipOrgSwitchAndTokenRefresh
|
||||
```
|
||||
|
||||
P3.5 PASS conditions:
|
||||
|
||||
- backend candidate metadata includes:
|
||||
- `tls_trust_mode=platform_ca`
|
||||
- `production_trusted=true`
|
||||
- `smoke_only=false`
|
||||
- `tls_ca_ref=rap-platform-ca:test-v1`
|
||||
- Windows client selects `direct_worker_wss` in production mode
|
||||
- client does not use insecure TLS bypass
|
||||
- worker direct WSS token validation and runtime binding still pass
|
||||
- rendering/input/clipboard/file upload still pass
|
||||
- backend gateway fallback activates when direct cert validation fails or
|
||||
direct WSS is unavailable
|
||||
|
||||
Required negative tests:
|
||||
|
||||
- wrong SAN certificate rejected
|
||||
- expired certificate rejected
|
||||
- unknown CA rejected
|
||||
- `smoke_insecure` candidate skipped in production
|
||||
|
||||
Runtime proof is recorded in:
|
||||
|
||||
- `artifacts/p3-5-app-local-platform-ca-smoke-report.md`
|
||||
|
||||
## Current Implementation Status
|
||||
|
||||
Existing config fields are sufficient for P3.4:
|
||||
|
||||
- `DATA_PLANE_DIRECT_WORKER_TLS_TRUST_MODE`
|
||||
- `DATA_PLANE_DIRECT_WORKER_TLS_CA_REF`
|
||||
- `RDP_WORKER_DATA_PLANE_TLS_CERT_FILE`
|
||||
- `RDP_WORKER_DATA_PLANE_TLS_KEY_FILE`
|
||||
|
||||
P3.5 added Windows client setting:
|
||||
|
||||
- `direct_data_plane_platform_ca_bundle`
|
||||
|
||||
P3.6 completed stale worker-event/restart idempotency hardening.
|
||||
|
||||
Stage 5.2 server-to-client file download design is complete in
|
||||
`docs/architecture/RDP_FILE_DOWNLOAD_STAGE_5_2.md`. The next step should return
|
||||
to the RDP feature plan with the narrow Stage 5.2 implementation, not RDP
|
||||
rendering, lifecycle expansion, mesh, VPN, or new protocol adapters.
|
||||
@@ -0,0 +1,559 @@
|
||||
# RDP Adapter Runtime
|
||||
|
||||
Status: active implementation plan for the new C++ RDP Adapter internals.
|
||||
|
||||
Current implementation status:
|
||||
|
||||
- RDP-A1 is build-proven: the common Service Adapter channel model and probe exist.
|
||||
- RDP-A2 is live-smoke-proven on the test Docker environment as of 2026-04-26: `SessionRuntime` depends on `RdpAdapterRuntime`, not directly on FreeRDP runtime types, and a real RDP session still connects through the existing direct data plane.
|
||||
- RDP-Perf-2 is live-smoke-proven on the test Docker environment as of 2026-04-26: the current FreeRDP substrate now logs callback source/timing, capture source, and input-to-first-graphics-callback timing.
|
||||
- RDP-Perf-3 / RDP-A3 region-first BGRA fallback is live-smoke-proven on the test Docker environment as of 2026-04-26: direct binary region frames render in the Windows client and backend gateway fallback remains compatible.
|
||||
- RDP-Perf-4 / RDP-A6 gated RDPGFX foundation is build-proven and default-path smoke-proven on the test Docker environment as of 2026-04-26. RDPGFX stays disabled by default because the current live RDP target resets the connection when graphics pipeline support is advertised.
|
||||
- RDP-A4 CursorAdapter is live-smoke-proven on the test Docker environment as of 2026-04-26: FreeRDP pointer callbacks are normalized into latest-only `cursor.update` events, direct worker WSS sends them separately from display frames, and backend gateway fallback remains compatible.
|
||||
- RDP-Perf-5A is build-proven and smoke-proven on the test Docker environment as of 2026-04-26: classic GDI region/interactive frames use a 33 ms publish cadence, hot-loop lease renewal is removed, and direct/fallback paths remain compatible.
|
||||
- RDP-Perf-6 direct dirty-region binary contract is build/probe/live-smoke-proven on the test Docker environment as of 2026-04-26: direct `RAP2` frames now distinguish `render.frame.full` from `render.frame.region`, include region payload diagnostics, and the Windows presenter keeps a session framebuffer for region patching. Runtime proof used `P3.3 Secret RDP Resource`; observed dirty-region savings ranged from `82.22%` to `99.56%` versus the `3,686,400` byte full frame.
|
||||
- Current accepted baseline is `rap-rdp-worker:rdp-p1-region-order2`: ordered dirty-region delivery is preserved through `SessionRuntime`, worker direct WSS, Windows transport, and WPF presenter queues. Manual visual smoke accepted idle repaint, Start menu/hover, mouse, keyboard, and session close on 2026-04-26.
|
||||
- Remaining visual limitation is quality/performance rather than correctness: window drag behaves like older/slow-link RDP clients by showing a drag frame, and repaint after releasing a moved window is usable but not yet polished.
|
||||
- FreeRDP is still present as the current internal substrate behind the RDP Adapter boundary. It must not be removed until the adapter path is live-proven and replacement layers are ready.
|
||||
|
||||
This does not change the current cluster/control-plane contracts. The current backend gateway fallback remains available until each data-plane stage is proven.
|
||||
|
||||
## 1. Goal
|
||||
|
||||
The RDP Adapter must translate Microsoft RDP into the platform session/data-plane protocol.
|
||||
|
||||
```text
|
||||
Access Client
|
||||
<-> platform session/data-plane protocol
|
||||
RDP Adapter
|
||||
<-> FreeRDP / project-owned RDP internals
|
||||
RDP Server
|
||||
```
|
||||
|
||||
The adapter must process events from both sides:
|
||||
|
||||
- Access Client events: input, clipboard, file upload/download, control.
|
||||
- RDP Server events: graphics updates, cursor updates, clipboard changes, device/drive events, disconnects, errors.
|
||||
|
||||
The adapter must not depend on mouse/keyboard input to discover screen changes.
|
||||
|
||||
## 2. External References And Lessons
|
||||
|
||||
FreeRDP exposes an event-driven client model through:
|
||||
|
||||
- `freerdp_get_event_handles` / `freerdp_check_event_handles` for event dispatch.
|
||||
- `rdpUpdate` callbacks such as `BeginPaint`, `EndPaint`, `BitmapUpdate`, `RefreshRect`, `SurfaceBits`, `SurfaceFrameMarker`, and `SurfaceFrameBits`.
|
||||
- client channel modules such as `cliprdr`, `rdpdr`, and `rdpgfx`.
|
||||
|
||||
Apache Guacamole uses the same architectural principle at a higher level: protocol-specific plugins translate RDP/VNC/SSH into a common client protocol so the client does not implement those protocols directly.
|
||||
|
||||
Design implication for this project:
|
||||
|
||||
- FreeRDP callbacks/channels are adapter-origin event sources.
|
||||
- The platform Access Client receives normalized display/cursor/clipboard/file/control events.
|
||||
- Full-frame polling is only fallback/debug, not the target render mechanism.
|
||||
|
||||
## 3. Runtime Components
|
||||
|
||||
```text
|
||||
SessionRuntime
|
||||
owns lifecycle/assignment/policy/lease boundary
|
||||
owns RDP Adapter Runtime
|
||||
|
||||
RDP Adapter Runtime
|
||||
RdpEventPump
|
||||
InputAdapter
|
||||
DisplayAdapter
|
||||
CursorAdapter
|
||||
ClipboardAdapter
|
||||
FileTransferAdapter
|
||||
QualityController
|
||||
AdapterEventRouter
|
||||
|
||||
DataPlane Sinks
|
||||
direct worker WSS
|
||||
backend gateway fallback
|
||||
```
|
||||
|
||||
### RdpEventPump
|
||||
|
||||
Responsibilities:
|
||||
|
||||
- own the FreeRDP event loop
|
||||
- wait on FreeRDP event handles
|
||||
- dispatch FreeRDP callbacks promptly
|
||||
- never sleep instead of processing available server events
|
||||
- report disconnect/error state
|
||||
|
||||
### InputAdapter
|
||||
|
||||
Responsibilities:
|
||||
|
||||
- accept normalized platform input
|
||||
- preserve keyboard down/up ordering
|
||||
- preserve mouse button/wheel ordering
|
||||
- coalesce pointer move to latest
|
||||
- send focus/move/button/key through FreeRDP input API
|
||||
- never trigger full-frame capture loops as the main render mechanism
|
||||
|
||||
### DisplayAdapter
|
||||
|
||||
Responsibilities:
|
||||
|
||||
- consume FreeRDP update callbacks
|
||||
- generate platform display events
|
||||
- prefer dirty regions/surface updates over full frames
|
||||
- send baseline full frame only on connect/resize/attach/recovery/fallback
|
||||
- keep a full framebuffer only where needed for compatibility
|
||||
- never block input on render work
|
||||
|
||||
Required event sources:
|
||||
|
||||
- `BitmapUpdate`
|
||||
- `RefreshRect`
|
||||
- `SurfaceBits`
|
||||
- `SurfaceFrameMarker`
|
||||
- `SurfaceFrameBits`
|
||||
- `EndPaint`
|
||||
- RDPGFX channel events when enabled and stable
|
||||
- periodic fallback change detection only as a safety net
|
||||
|
||||
### CursorAdapter
|
||||
|
||||
Responsibilities:
|
||||
|
||||
- handle FreeRDP pointer callbacks
|
||||
- publish cursor position/visibility/shape independently from display frames
|
||||
- keep cursor events latest-only
|
||||
|
||||
### ClipboardAdapter
|
||||
|
||||
Responsibilities:
|
||||
|
||||
- use `cliprdr`
|
||||
- preserve existing `clipboard_mode`
|
||||
- text-only until explicitly expanded
|
||||
- enforce max size and lifecycle state
|
||||
- prevent loops using sequence/origin/hash
|
||||
|
||||
### FileTransferAdapter
|
||||
|
||||
Responsibilities:
|
||||
|
||||
- preserve existing `file_transfer_mode`
|
||||
- keep upload/download reliable and chunked
|
||||
- enforce session/controller/policy/state
|
||||
- keep restricted drive mapping isolated to per-session visible directory
|
||||
- never expose arbitrary worker filesystem paths
|
||||
|
||||
### QualityController
|
||||
|
||||
Responsibilities:
|
||||
|
||||
- choose color mode / FPS / dirty-region threshold
|
||||
- degrade render before input
|
||||
- keep file transfer and future VPN-like bulk traffic from starving interactive channels
|
||||
|
||||
## 4. Data-Plane Streams
|
||||
|
||||
The target adapter uses independent scheduling classes even if they share one WSS connection in DP-1:
|
||||
|
||||
| Stream | Channel | Scheduling |
|
||||
| --- | --- | --- |
|
||||
| Critical input | `input` | first, ordered, bounded |
|
||||
| Control | `control` | reliable, bounded |
|
||||
| Cursor | `cursor` | latest-only, bypass display cadence |
|
||||
| Display | `display` | droppable, latest region/frame |
|
||||
| Clipboard | `clipboard` | reliable, policy-gated |
|
||||
| File transfer | `file_transfer` | reliable chunked, bandwidth-limited |
|
||||
| Telemetry | `telemetry` | sampled/droppable |
|
||||
|
||||
Future transports may split streams physically:
|
||||
|
||||
- control/input WSS
|
||||
- display binary WSS or QUIC-like transport
|
||||
- file transfer chunk stream
|
||||
- audio/video adaptive stream
|
||||
|
||||
DP-1 must keep current direct WSS/fallback intact while enforcing scheduling semantics internally.
|
||||
|
||||
## 5. Display Contract
|
||||
|
||||
Display event types:
|
||||
|
||||
- `display.baseline_full_bgra`
|
||||
- `display.region_bgra`
|
||||
- `display.surface_create`
|
||||
- `display.surface_delete`
|
||||
- `display.surface_bits`
|
||||
- `display.encoded_frame`
|
||||
- `display.resize`
|
||||
- `display.sync`
|
||||
|
||||
Rules:
|
||||
|
||||
- Access Client owns the visible framebuffer.
|
||||
- Region updates patch the existing full-size framebuffer.
|
||||
- Adapter must send a baseline frame before region-only updates after connect/attach/resize.
|
||||
- Stale display updates may be dropped.
|
||||
- Cursor updates must not wait for display frames.
|
||||
- Full-frame BGRA is fallback, not production target.
|
||||
- Direct binary display messages use the existing `RAP2` frame header:
|
||||
`render.frame.full` for baseline/recovery frames and `render.frame.region`
|
||||
for BGRA32 dirty-region payloads.
|
||||
|
||||
## 6. FreeRDP Usage Rules
|
||||
|
||||
Default stable mode:
|
||||
|
||||
- GDI/primary framebuffer fallback
|
||||
- update callbacks installed
|
||||
- cliprdr enabled only when policy permits
|
||||
- rdpdr restricted drive only when file transfer policy permits
|
||||
|
||||
Experimental/next modes:
|
||||
|
||||
- RDPGFX dynamic channel behind explicit capability flag
|
||||
- surface/event parsing before enabling by default
|
||||
- encoded graphics payloads only when client capability and server support are proven
|
||||
|
||||
Do not enable unstable graphics paths globally. Each capability must be gated, logged, and fallback-safe.
|
||||
|
||||
## 7. Migration Plan
|
||||
|
||||
### RDP-A1: Contract And Scaffolding
|
||||
|
||||
Deliver:
|
||||
|
||||
- common Service Adapter protocol document
|
||||
- RDP Adapter runtime document
|
||||
- compile-safe adapter channel model
|
||||
- no runtime behavior switch
|
||||
|
||||
Status: completed and build-proven.
|
||||
|
||||
### RDP-A2: Event Router Boundary
|
||||
|
||||
Deliver:
|
||||
|
||||
- route FreeRDP notifications through `AdapterEventRouter`
|
||||
- preserve existing `WorkerEvent` output
|
||||
- prove server-origin display events flow without client input
|
||||
|
||||
Status: completed and live-smoke-proven on the test Docker environment as of 2026-04-26.
|
||||
|
||||
Current code boundary:
|
||||
|
||||
- `SessionRuntime` owns `RdpAdapterRuntime`.
|
||||
- `RdpAdapterRuntime` owns the current FreeRDP substrate.
|
||||
- `AdapterEventRouter` normalizes substrate notifications into adapter event descriptors.
|
||||
- Existing worker events and data-plane contracts are preserved.
|
||||
|
||||
Smoke command:
|
||||
|
||||
```powershell
|
||||
pwsh -ExecutionPolicy Bypass -File scripts/windows-smoke/desktop-smoke.ps1 `
|
||||
-PreferDirectDataPlane:$true `
|
||||
-AllowInsecureDirectDataPlaneTlsForSmoke:$true `
|
||||
-DirectDataPlaneConnectTimeoutMs 2500 `
|
||||
-DirectDataPlaneColorMode full_color `
|
||||
-SkipOrgSwitchAndTokenRefresh
|
||||
```
|
||||
|
||||
Smoke evidence:
|
||||
|
||||
- worker image: `rap-rdp-worker:rdp-adapter-a2`
|
||||
- session id: `c835e211-a105-4165-9ed2-885ddf876b84`
|
||||
- worker log: `rdp_adapter.runtime_start substrate=freerdp`
|
||||
- worker log: `adapter_event channel=display type=display.baseline_full_bgra`
|
||||
- worker log: `data_plane_bind_success ... render_transport=binary_v1`
|
||||
- client log: `data_plane.transport selected=direct_worker_wss`
|
||||
- client log: `SessionWindow rendered frame`
|
||||
- smoke result: login/resource/start/input/detach/attach/takeover/taken_over/logout passed
|
||||
- runtime creation count: one `started new runtime for session` entry across start/reattach/takeover
|
||||
|
||||
### RDP-A3: DisplayAdapter Region-First
|
||||
|
||||
Deliver:
|
||||
|
||||
- baseline full frame on connect/attach/resize
|
||||
- region updates as default normal UI path
|
||||
- client framebuffer patch proof
|
||||
- full-frame fallback retained
|
||||
|
||||
Status: completed and live-smoke-proven on the test Docker environment as of 2026-04-26.
|
||||
|
||||
Proof summary:
|
||||
|
||||
- `BitmapUpdate` dirty regions are deferred and flushed once at `EndPaint`.
|
||||
- Region payloads are sent over direct binary WSS as `message_type=render.frame.region` with `frame_update_kind=region`.
|
||||
- Windows client renders region frames into the existing framebuffer.
|
||||
- Backend gateway fallback remains available and smoke-proven.
|
||||
- Report: `artifacts/rdp-perf3-report.md`
|
||||
|
||||
Prerequisite proof:
|
||||
|
||||
- RDP-Perf-2 showed active `BitmapUpdate`, `BeginPaint`, and `EndPaint` callbacks in stable GDI mode.
|
||||
- RDP-Perf-2 did not observe `RefreshRect`, `SurfaceBits`, `SurfaceFrameMarker`, `SurfaceFrameBits`, or pointer callbacks in the live smoke.
|
||||
- The next implementation should prefer `BitmapUpdate` dirty regions and treat `EndPaint` as a flush/safety marker instead of producing duplicate captures.
|
||||
|
||||
### RDP-A4: CursorAdapter
|
||||
|
||||
Deliver:
|
||||
|
||||
- cursor position/shape/visibility channel
|
||||
- cursor updates independent from render cadence
|
||||
|
||||
Status: completed and live-smoke-proven on the test Docker environment as of
|
||||
2026-04-26.
|
||||
|
||||
Implementation notes:
|
||||
|
||||
- `CursorAdapter` produces normalized cursor position, visibility, shape, cache,
|
||||
hotspot, and mask metadata.
|
||||
- The FreeRDP substrate invokes original pointer callbacks first, then publishes
|
||||
platform cursor events.
|
||||
- `session_cursor_updated` is routed as the adapter event `cursor.update`.
|
||||
- Direct worker WSS keeps cursor as latest-only/droppable and does not block it
|
||||
behind render frames.
|
||||
- The Windows client consumes `cursor.update` without changing session lifecycle
|
||||
or UI layout.
|
||||
|
||||
Proof:
|
||||
|
||||
- direct smoke session id: `549806aa-c9db-48a9-917e-cf817cf236b5`
|
||||
- fallback smoke session id: `dee3a856-bee1-4eba-9c10-f62edaf56547`
|
||||
- worker image: `rap-rdp-worker:rdp-a4-cursor-adapter`
|
||||
- report: `artifacts/rdp-a4-cursor-adapter-report.md`
|
||||
|
||||
### RDP-A4.1 / RDP-Perf-5A: GDI Repaint Cadence Hardening
|
||||
|
||||
Deliver:
|
||||
|
||||
- bounded immediate FreeRDP event-handle drain after signaled event checks
|
||||
- rate-limited no-change detector logs
|
||||
- no Redis lease renewal in the hot render/input loop
|
||||
- 33 ms region/interactive render publish cadence
|
||||
- 100 ms full-frame fallback cadence retained
|
||||
- direct worker WSS and backend gateway fallback compatibility
|
||||
|
||||
Status: completed and smoke-proven on the test Docker environment as of
|
||||
2026-04-26.
|
||||
|
||||
Proof summary:
|
||||
|
||||
- direct smoke session id: `0cca4974-2a82-48dc-a0f6-1036ea8e98f0`
|
||||
- fallback smoke session id: `16deb09e-1c44-4e9d-8448-93b42ac66ed0`
|
||||
- worker image: `rap-rdp-worker:rdp-perf5a-repaint-cadence`
|
||||
- direct worker WSS selected in direct smoke
|
||||
- backend gateway selected in fallback smoke
|
||||
- direct render stayed binary and skipped JSON/base64 compatibility frame
|
||||
building
|
||||
- backend gateway fallback still built JSON/base64 compatibility frames
|
||||
- render queues stayed bounded in observed direct smoke
|
||||
- report: `artifacts/rdp-perf5a-report.md`
|
||||
|
||||
Follow-up manual validation:
|
||||
|
||||
- keyboard behavior reached a usable level
|
||||
- mouse movement/click behavior became acceptable for the MVP baseline
|
||||
- remote idle updates such as Task Manager percentages now repaint without local
|
||||
mouse movement
|
||||
- small redraw artifacts remain and require a focused visual correctness pass
|
||||
|
||||
### RDP-A4.2: Direct Attach Baseline And Region-Loss Repair
|
||||
|
||||
Deliver:
|
||||
|
||||
- request a full-frame baseline when a direct client attaches without a cached
|
||||
full frame
|
||||
- queue direct attach baseline frames as non-droppable reliable events
|
||||
- preserve region-first rendering for normal updates
|
||||
- capture throttled full-frame repair when region loss/drop can leave persistent
|
||||
artifacts
|
||||
- keep input, clipboard, file upload, session lifecycle, direct worker WSS, and
|
||||
backend gateway fallback unchanged
|
||||
|
||||
Status: previous accepted baseline, superseded by P1 ordered-region delivery on
|
||||
2026-04-26.
|
||||
|
||||
Proof summary:
|
||||
|
||||
- worker image: `rap-rdp-worker:rdp-region-repair`
|
||||
- worker probes pass for graphics adapter, cursor adapter, service adapter
|
||||
protocol, and direct data-plane bind validation
|
||||
- direct attach no longer starts from a black-only framebuffer when no cached
|
||||
full frame is available
|
||||
- server-origin idle updates are visible without local input
|
||||
- remaining issue is small redraw artifacts during some region update
|
||||
sequences
|
||||
|
||||
Current code boundaries:
|
||||
|
||||
- `SessionRuntime::PublishDirectAttachBaselineIfRequested`
|
||||
- `SessionRuntime::DrainAndPublishRenderNotifications`
|
||||
- `RdpAdapterRuntime::CaptureFullFrameNotification`
|
||||
- `RdpRuntime::CaptureFullFrameNotification`
|
||||
- `DirectWssEventSink::EnqueueEvent`
|
||||
|
||||
Next hardening target:
|
||||
|
||||
- add region sequence/gap diagnostics
|
||||
- identify whether remaining artifacts come from dropped regions, stale ordering,
|
||||
wrong client patching, missed callbacks, or repair timing
|
||||
- apply the smallest fix without returning to full-frame polling as the normal
|
||||
render path
|
||||
|
||||
### RDP-A4.3 / P1: Ordered Region Delivery Candidate
|
||||
|
||||
Root cause addressed:
|
||||
|
||||
- Region frames were passing through latest-frame-only queues in the direct
|
||||
worker writer, Windows transport, and WPF presenter.
|
||||
- A second ordered-delivery gap was found in `SessionRuntime`, where frame
|
||||
notifications were still coalesced before reaching the direct event sink.
|
||||
- Latest-frame-only behavior is correct for full frames and cursor updates, but
|
||||
it is unsafe for dirty-region patches because dropping an intermediate region
|
||||
can leave stale pixels on the client framebuffer.
|
||||
|
||||
Deliver:
|
||||
|
||||
- preserve ordered dirty-region frames through the worker direct WSS writer
|
||||
- preserve ordered dirty-region frames inside `SessionRuntime` before the direct
|
||||
event sink
|
||||
- preserve ordered dirty-region frames through the Windows direct transport
|
||||
- preserve ordered dirty-region application in the WPF session presenter
|
||||
- keep full frames able to supersede pending region queues
|
||||
- request a throttled full-frame repair if the worker direct region queue
|
||||
overflows
|
||||
- add client diagnostics for frame sequence gaps and regions received before a
|
||||
baseline
|
||||
- keep input, cursor, clipboard, file upload, session lifecycle, direct worker
|
||||
WSS, and backend gateway fallback unchanged
|
||||
|
||||
Status: accepted baseline on the test Docker environment as of 2026-04-26.
|
||||
|
||||
Proof summary:
|
||||
|
||||
- worker image: `rap-rdp-worker:rdp-p1-region-order2`
|
||||
- live test container: `rap_worker_smoke`
|
||||
- backend `go test ./...`: PASS
|
||||
- Windows solution build: PASS
|
||||
- worker graphics adapter probe: PASS
|
||||
- worker cursor adapter probe: PASS
|
||||
- worker service adapter protocol probe: PASS
|
||||
- worker direct data-plane bind valid probe: PASS
|
||||
- worker Redis registration: `worker:registration:rdp-worker-1` reports
|
||||
`status=online`
|
||||
- manual visual smoke: PASS for idle Task Manager updates without local input,
|
||||
Start menu/hover without persistent artifacts, window drag usability, mouse,
|
||||
keyboard, and session close
|
||||
- known limitation: drag uses old-client frame-only movement and release repaint
|
||||
is not polished
|
||||
|
||||
Current code boundaries:
|
||||
|
||||
- `SessionRuntime::RequestDirectFullFrameRepair`
|
||||
- `SessionRuntime::DrainAndPublishRenderNotifications`
|
||||
- `DirectWssEventSink::EnqueueEvent`
|
||||
- `SessionGatewayClient::QueueFrameEnvelope`
|
||||
- `SessionWindow::QueueFrameForPresentation`
|
||||
- `SessionWindowViewModel::ApplyFramePayload`
|
||||
|
||||
Manual acceptance result:
|
||||
|
||||
- Start menu/menu hover does not leave persistent stale regions.
|
||||
- Task Manager graph/percent updates continue without local input.
|
||||
- Mouse and keyboard responsiveness did not regress.
|
||||
- Session close works normally.
|
||||
- Window drag is workable but uses frame-only movement and non-perfect repaint
|
||||
after release; this belongs to the next performance/quality layer.
|
||||
|
||||
### RDP-Perf-6: Dirty-Region Direct Binary Contract
|
||||
|
||||
Goal:
|
||||
|
||||
- make dirty-region direct render explicit at the `RAP2` binary contract level
|
||||
- keep full-frame binary support as baseline/recovery fallback
|
||||
- keep backend gateway JSON/base64 fallback unchanged
|
||||
- avoid routing high-rate binary regions through Redis/backend
|
||||
|
||||
Status: implemented and build/probe/live-smoke-proven on the test Docker
|
||||
environment as of 2026-04-26 using direct worker WSS and
|
||||
`rap-rdp-worker:rdp-perf6-dirty-region`.
|
||||
|
||||
Implementation:
|
||||
|
||||
- Worker direct WSS emits `render.frame.full` for first frame, attach/reattach,
|
||||
resize, region-loss repair, invalid region fallback, and debug fallback.
|
||||
- Worker direct WSS emits `render.frame.region` for BGRA32 dirty-region
|
||||
payloads from the current classic GDI region-first path.
|
||||
- Region metadata includes full desktop dimensions, region coordinates,
|
||||
`region_stride`, `region_format=BGRA32`, payload length, sequence, and
|
||||
capture/input timing fields.
|
||||
- Worker diagnostics include `full_frame_sent`, `region_frame_sent`,
|
||||
`region_bytes`, `full_frame_bytes`, `region_savings_percent`,
|
||||
`diff_time_ms`, `render_update_reason`, and
|
||||
`fallback_to_full_frame_reason`.
|
||||
- Windows direct transport accepts `render.frame.full`,
|
||||
`render.frame.region`, and legacy `session.frame` binary messages.
|
||||
- Windows presenter keeps a per-session framebuffer and patches region bytes
|
||||
into it before presenting the updated WPF surface.
|
||||
- Smoke proof showed baseline `render.frame.full` at `3,686,400` bytes and
|
||||
dirty-region `render.frame.region` payloads such as `16,384`, `163,840`,
|
||||
`327,680`, and `655,360` bytes, with observed savings up to `99.56%`.
|
||||
|
||||
Boundaries preserved:
|
||||
|
||||
- no backend session lifecycle changes
|
||||
- no organization/auth/policy changes
|
||||
- no `data_plane_token` contract changes
|
||||
- no clipboard or file-transfer semantic changes
|
||||
- no RDPGFX default enablement
|
||||
- no mesh/VPN/relay/QUIC/WebRTC work
|
||||
- backend gateway fallback remains available
|
||||
|
||||
### RDP-A5: Clipboard/File Adapters
|
||||
|
||||
Deliver:
|
||||
|
||||
- move current cliprdr/file logic behind adapter boundaries
|
||||
- no behavior change
|
||||
- policy enforcement unchanged
|
||||
|
||||
### RDP-A6: RDPGFX Foundation
|
||||
|
||||
Deliver:
|
||||
|
||||
- gated RDPGFX surface event support
|
||||
- fallback to GDI region updates
|
||||
- no default enable until stable
|
||||
|
||||
Status: build-proven and default-path smoke-proven on the test Docker environment as of 2026-04-26.
|
||||
|
||||
Notes:
|
||||
|
||||
- `RDP_WORKER_RDPGFX_ENABLED=true` is the explicit gated switch.
|
||||
- The default runtime path remains classic GDI region-first.
|
||||
- The current test RDP target failed gated RDPGFX with a connection reset before `rdp.gfx channel_connected`, so no RDPGFX surface lifecycle proof is available for that target yet.
|
||||
- Report: `artifacts/rdp-perf4-report.md`
|
||||
|
||||
### RDP-A7: Encoded/Adaptive Render
|
||||
|
||||
Deliver:
|
||||
|
||||
- encoded display payloads where negotiated
|
||||
- adaptive quality profiles
|
||||
- weak-channel policy
|
||||
|
||||
## 8. Acceptance Criteria For New RDP Adapter
|
||||
|
||||
- idle remote screen changes are visible without local mouse/keyboard input
|
||||
- first click acts on remote UI, not only focus
|
||||
- pointer hover updates are visible
|
||||
- keyboard does not lose characters
|
||||
- detach/reattach/takeover do not recreate remote session
|
||||
- worker death marks session failed/recoverable correctly
|
||||
- clipboard and file transfer remain policy-enforced
|
||||
- direct worker WSS is preferred and fallback remains working
|
||||
- input latency is not affected by render/file/telemetry pressure
|
||||
@@ -0,0 +1,467 @@
|
||||
# RDP Stage 5.2 Design Pass - Server-To-Client File Download
|
||||
|
||||
Status: design-complete proposal, no runtime implementation in this step.
|
||||
|
||||
Date: 2026-04-26
|
||||
|
||||
This document defines the target Stage 5.2 implementation shape for safe
|
||||
server-to-client file download in the RDP service. It preserves the accepted
|
||||
RDP Adapter baseline, direct worker WSS, backend gateway fallback, and the
|
||||
restricted `RAP_Transfers` drive visibility model.
|
||||
|
||||
## 1. Baseline
|
||||
|
||||
Already accepted:
|
||||
|
||||
- client-to-server upload to worker-controlled per-session storage
|
||||
- restricted FreeRDP drive redirection
|
||||
- uploaded files visible inside remote Windows through `RAP_Transfers`
|
||||
- text clipboard
|
||||
- direct worker WSS with backend gateway fallback
|
||||
- C++ RDP Adapter as the active runtime
|
||||
|
||||
Not implemented yet:
|
||||
|
||||
- server-to-client file download
|
||||
- bidirectional file-transfer runtime behavior
|
||||
- remote filesystem browser
|
||||
- Windows session agent
|
||||
- SMB/WebDAV delivery
|
||||
- arbitrary remote path selection
|
||||
|
||||
## 2. Recommended V1 Model
|
||||
|
||||
Use the existing restricted `RAP_Transfers` redirected drive, but add a
|
||||
dedicated remote-to-client drop zone inside it.
|
||||
|
||||
Recommended visible layout:
|
||||
|
||||
```text
|
||||
RAP_Transfers\
|
||||
FromClient\ # future normalized upload destination
|
||||
ToClient\ # remote user places files here for download
|
||||
README.txt # describes policy and size limits
|
||||
```
|
||||
|
||||
For backward compatibility, the current upload path may continue placing files
|
||||
at the visible drive root until a later cleanup stage. Stage 5.2 should add
|
||||
`ToClient` first and avoid breaking accepted upload behavior.
|
||||
|
||||
Why this model:
|
||||
|
||||
- reuses the already proven restricted drive boundary
|
||||
- exposes no worker parent directories
|
||||
- needs no Windows agent
|
||||
- needs no SMB/WebDAV service
|
||||
- gives the remote user a clear, auditable action: copy file into `ToClient`
|
||||
- keeps server-to-client transfer policy-enforced in the real data path
|
||||
- keeps the Access Client independent from RDP internals
|
||||
|
||||
## 3. Non-Goals
|
||||
|
||||
Stage 5.2 must not implement:
|
||||
|
||||
- arbitrary remote path download
|
||||
- remote filesystem browsing
|
||||
- recursive folder download
|
||||
- drag-and-drop shell integration
|
||||
- server-to-client clipboard file lists
|
||||
- shared folders beyond the restricted per-session visible directory
|
||||
- SMB, WebDAV, HTTP drop service, or Windows agent
|
||||
- direct host filesystem exposure
|
||||
- file execution
|
||||
- background sync
|
||||
- cross-session persistent shared storage
|
||||
|
||||
## 4. Data Flow
|
||||
|
||||
```mermaid
|
||||
sequenceDiagram
|
||||
participant Remote as "Remote Windows Session"
|
||||
participant Drive as "RAP_Transfers\\ToClient"
|
||||
participant Worker as "RDP Adapter Worker"
|
||||
participant Backend as "Backend Gateway Fallback"
|
||||
participant Client as "Access Client"
|
||||
|
||||
Remote->>Drive: User copies file into ToClient
|
||||
Worker->>Worker: Detect stable regular file
|
||||
Worker->>Worker: Sanitize name, size check, hash snapshot
|
||||
Worker->>Client: file_download.available
|
||||
Client->>Worker: file_download.start
|
||||
Worker->>Client: file_download.chunk
|
||||
Client->>Worker: file_download.ack
|
||||
Worker->>Client: file_download.completed
|
||||
Worker->>Backend: Audit/status event
|
||||
```
|
||||
|
||||
Backend gateway fallback follows the same logical events but relays them
|
||||
through the existing backend gateway path. Direct worker WSS should be preferred
|
||||
when available.
|
||||
|
||||
## 5. Worker Detection Model
|
||||
|
||||
The worker owns the only trusted observation point: the local per-session
|
||||
visible directory that FreeRDP exposes as `RAP_Transfers`.
|
||||
|
||||
Detection rules:
|
||||
|
||||
- watch only `<transfer_root>/<session_id>/visible/ToClient`
|
||||
- ignore directories in Stage 5.2
|
||||
- ignore hidden/temp files such as `.part`, `.tmp`, `~$*`, and worker-owned
|
||||
transfer temp names
|
||||
- wait until size and modified time are stable for at least two checks
|
||||
- open the file read-only after stability is observed
|
||||
- use no-follow/openat style APIs where available
|
||||
- `fstat` the opened descriptor and reject non-regular files
|
||||
- verify the canonical path remains inside the `ToClient` directory
|
||||
- compute hash from the opened file snapshot
|
||||
- reject files above the configured max size before transfer starts
|
||||
|
||||
The worker must never trust a remote-supplied path. It only trusts a sanitized
|
||||
relative file name derived from the controlled drop directory.
|
||||
|
||||
## 6. File Identity
|
||||
|
||||
Each downloadable file gets an opaque `file_id`.
|
||||
|
||||
Recommended `file_id` input:
|
||||
|
||||
- session id
|
||||
- stable relative file name
|
||||
- file size
|
||||
- modified timestamp
|
||||
- inode/device where available
|
||||
- SHA-256 content hash or snapshot hash
|
||||
|
||||
The Access Client never sends a worker filesystem path. It requests download by
|
||||
`file_id` only.
|
||||
|
||||
## 7. Policy Model
|
||||
|
||||
Existing `file_transfer_mode` values remain:
|
||||
|
||||
| Mode | Upload client -> server | Download server -> client |
|
||||
| --- | --- | --- |
|
||||
| `disabled` | blocked | blocked |
|
||||
| `client_to_server` | allowed | blocked |
|
||||
| `server_to_client` | blocked | allowed |
|
||||
| `bidirectional` | allowed | allowed |
|
||||
|
||||
Stage 5.2 implements only the server-to-client side of this matrix. It must not
|
||||
regress the accepted client-to-server upload path.
|
||||
|
||||
Policy enforcement points:
|
||||
|
||||
- backend session gateway and data-plane token allowed channels
|
||||
- worker runtime before publishing availability
|
||||
- worker runtime before reading/sending chunks
|
||||
- Windows client UI before presenting download actions
|
||||
- Windows client transport before sending download requests
|
||||
|
||||
UI checks are convenience only. Backend and worker checks are the security
|
||||
boundary.
|
||||
|
||||
## 8. Lifecycle Gating
|
||||
|
||||
Download is allowed only while all are true:
|
||||
|
||||
- session state is `active`
|
||||
- attachment is the current controller
|
||||
- data-plane token allows file download
|
||||
- worker runtime still owns the active session
|
||||
- resource policy allows server-to-client file transfer
|
||||
|
||||
Download must be blocked or cancelled when:
|
||||
|
||||
- session is detached
|
||||
- attachment is superseded by takeover
|
||||
- session is failed
|
||||
- session is terminated
|
||||
- worker lease/runtime is stale
|
||||
- direct WSS token expires and no valid fallback path exists
|
||||
|
||||
In-flight downloads should be aborted on detach, takeover, failure, terminate,
|
||||
and window close. Stage 5.2 should not continue background downloads for an
|
||||
inactive controller.
|
||||
|
||||
## 9. Channel And Message Contract
|
||||
|
||||
Use a new logical channel:
|
||||
|
||||
```text
|
||||
file_download
|
||||
```
|
||||
|
||||
This keeps upload and download direction explicit while remaining under the
|
||||
broader `file_transfer` adapter concept.
|
||||
|
||||
Recommended events:
|
||||
|
||||
```text
|
||||
file_download.available
|
||||
file_download.start
|
||||
file_download.chunk
|
||||
file_download.ack
|
||||
file_download.progress
|
||||
file_download.cancel
|
||||
file_download.completed
|
||||
file_download.failed
|
||||
file_download.blocked
|
||||
```
|
||||
|
||||
Recommended common fields:
|
||||
|
||||
```json
|
||||
{
|
||||
"session_id": "...",
|
||||
"attachment_id": "...",
|
||||
"transfer_id": "...",
|
||||
"file_id": "...",
|
||||
"direction": "server_to_client",
|
||||
"file_name": "report.txt",
|
||||
"file_size": 12345,
|
||||
"offset": 0,
|
||||
"chunk_size": 262144,
|
||||
"total_size": 12345,
|
||||
"content_hash": "sha256:...",
|
||||
"sequence": 1,
|
||||
"status": "in_progress"
|
||||
}
|
||||
```
|
||||
|
||||
For Stage 5.2, chunks may use JSON/base64 payloads for compatibility with the
|
||||
current reliable control envelope model. A future data-plane stage can add
|
||||
binary chunk frames, but that must be a separate performance stage.
|
||||
|
||||
## 10. Chunking, Retry, And Backpressure
|
||||
|
||||
Recommended initial limits:
|
||||
|
||||
- max file size: reuse current 25 MiB default unless policy overrides are added
|
||||
- max raw chunk size: reuse current 256 KiB default
|
||||
- one or a small bounded number of concurrent downloads per session
|
||||
- bounded per-connection send queue
|
||||
- input and control always outrank file chunks
|
||||
- display/cursor must not be blocked by file transfer
|
||||
|
||||
Reliability model:
|
||||
|
||||
- client sends `file_download.start`
|
||||
- worker streams chunks in offset order
|
||||
- client writes to a temp local file
|
||||
- client acknowledges received offsets or chunk indexes
|
||||
- client verifies final hash before finalizing the local file
|
||||
- cancel stops worker reads and deletes local temp file
|
||||
- retry may request restart from offset only if the opened file snapshot is
|
||||
still the same; otherwise restart from zero with a new `file_id`
|
||||
|
||||
## 11. Local Client Destination
|
||||
|
||||
The Windows client should not auto-save files silently into arbitrary locations.
|
||||
|
||||
Recommended MVP behavior:
|
||||
|
||||
- show `file_download.available`
|
||||
- user selects a local save location or confirms a default downloads directory
|
||||
- write to a temp file first
|
||||
- verify hash
|
||||
- rename into final destination
|
||||
- never execute downloaded files
|
||||
- show localized progress, blocked, failed, cancelled, and completed messages
|
||||
|
||||
## 12. Security Constraints
|
||||
|
||||
Mandatory constraints:
|
||||
|
||||
- default policy remains `disabled`
|
||||
- no arbitrary worker filesystem paths
|
||||
- no arbitrary remote Windows paths
|
||||
- no parent path traversal
|
||||
- no symlink escape
|
||||
- no non-regular files
|
||||
- no overwrite without explicit user confirmation on the client side
|
||||
- no automatic file execution
|
||||
- no recursive directories
|
||||
- no server-to-client transfer after detach/takeover/failure/terminate
|
||||
- audit metadata only, never file contents
|
||||
- downloaded file names are display names, not trusted paths
|
||||
- worker cleanup must remove per-session visible storage on termination/failure
|
||||
|
||||
## 13. Audit And Observability
|
||||
|
||||
Audit in PostgreSQL should record:
|
||||
|
||||
- `file_download_available` where useful and rate-limited
|
||||
- `file_download_started`
|
||||
- `file_download_completed`
|
||||
- `file_download_cancelled`
|
||||
- `file_download_failed`
|
||||
- `file_download_blocked`
|
||||
|
||||
Audit details should include:
|
||||
|
||||
- organization id
|
||||
- resource id
|
||||
- session id
|
||||
- attachment id
|
||||
- user id
|
||||
- worker id
|
||||
- file name
|
||||
- size
|
||||
- content hash
|
||||
- policy mode
|
||||
- reason for block/failure
|
||||
|
||||
Do not store file contents in PostgreSQL, Redis, or audit logs.
|
||||
|
||||
Worker/client diagnostics should expose:
|
||||
|
||||
- detected file count
|
||||
- stable-detection latency
|
||||
- chunk throughput
|
||||
- retry count
|
||||
- cancel count
|
||||
- queue depth
|
||||
- bytes sent
|
||||
- hash verification result
|
||||
|
||||
## 14. Backend Gateway Fallback
|
||||
|
||||
The backend gateway remains fallback/debug.
|
||||
|
||||
Rules:
|
||||
|
||||
- direct worker WSS is preferred for file chunks
|
||||
- backend gateway fallback must enforce the same policy and lifecycle gates
|
||||
- fallback may be slower but must remain correct
|
||||
- chunks must not be stored permanently in Redis
|
||||
- Redis is only live routing/coordination
|
||||
- PostgreSQL remains authoritative for session/resource/policy state
|
||||
|
||||
## 15. Verification Matrix
|
||||
|
||||
Policy:
|
||||
|
||||
| Case | Expected |
|
||||
| --- | --- |
|
||||
| `disabled` | no availability actions, start blocked |
|
||||
| `client_to_server` | upload still works, download blocked |
|
||||
| `server_to_client` | download works, upload blocked |
|
||||
| `bidirectional` | upload and download work |
|
||||
|
||||
Lifecycle:
|
||||
|
||||
| Case | Expected |
|
||||
| --- | --- |
|
||||
| active current controller | download allowed by policy |
|
||||
| detached | availability/start/chunks blocked or cancelled |
|
||||
| old attachment after takeover | blocked |
|
||||
| failed session | blocked/cancelled |
|
||||
| terminated session | blocked/cancelled |
|
||||
| worker death | client shows failure and no silent queued download |
|
||||
|
||||
Security:
|
||||
|
||||
| Case | Expected |
|
||||
| --- | --- |
|
||||
| normal text file | downloads, hash matches |
|
||||
| normal binary file | downloads, hash matches |
|
||||
| too large file | blocked |
|
||||
| path traversal name | blocked |
|
||||
| absolute path-like name | blocked |
|
||||
| symlink/non-regular file | blocked |
|
||||
| file modified during transfer | fail/restart safely |
|
||||
| duplicate final local file | user confirmation required |
|
||||
|
||||
Regression:
|
||||
|
||||
| Area | Expected |
|
||||
| --- | --- |
|
||||
| rendering | unchanged |
|
||||
| mouse/keyboard | unchanged |
|
||||
| clipboard | unchanged |
|
||||
| upload | unchanged |
|
||||
| detach/reattach/takeover | unchanged |
|
||||
| backend gateway fallback | unchanged |
|
||||
|
||||
## 16. Future Work
|
||||
|
||||
Future stages may add:
|
||||
|
||||
- binary file chunk frames over direct data plane
|
||||
- resumable transfer manifests
|
||||
- folder packaging as explicit archive download
|
||||
- organization-level file DLP scanning
|
||||
- malware scanning hooks
|
||||
- remote shell integration or Windows agent
|
||||
- WebDAV/SMB-like drop service
|
||||
- direct peer/relay data-plane optimization
|
||||
|
||||
None of these belong in Stage 5.2 implementation.
|
||||
|
||||
## 17. Proposed Stage 5.2 Implementation Prompt
|
||||
|
||||
Proceed with Stage 5.2 only - RDP server-to-client file download.
|
||||
|
||||
Goal:
|
||||
|
||||
Implement safe, policy-aware download from the remote RDP session to the
|
||||
Windows Access Client using the restricted `RAP_Transfers\ToClient` drop zone.
|
||||
|
||||
Strict rules:
|
||||
|
||||
- do NOT implement arbitrary remote path download
|
||||
- do NOT implement remote filesystem browser
|
||||
- do NOT implement recursive folder transfer
|
||||
- do NOT implement SMB/WebDAV/Windows agent
|
||||
- do NOT expose any worker path outside the per-session visible directory
|
||||
- do NOT change RDP rendering/input/clipboard behavior
|
||||
- do NOT remove backend gateway fallback
|
||||
- do NOT implement binary file chunk frames yet
|
||||
- do NOT start mesh/VPN/relay work
|
||||
|
||||
Scope:
|
||||
|
||||
1. Create a per-session `visible/ToClient` directory inside the existing
|
||||
restricted `RAP_Transfers` mapping.
|
||||
2. Detect stable regular files in `ToClient` only.
|
||||
3. Sanitize file names, reject traversal/absolute paths/non-regular files, and
|
||||
enforce size limits.
|
||||
4. Add `file_download` logical channel and envelopes:
|
||||
`available`, `start`, `chunk`, `ack`, `progress`, `cancel`, `completed`,
|
||||
`failed`, and `blocked`.
|
||||
5. Enforce `file_transfer_mode` for `server_to_client` and `bidirectional` in
|
||||
backend gateway, data-plane token channels, worker runtime, and Windows
|
||||
client.
|
||||
6. Gate all download actions to active current-controller sessions only.
|
||||
7. Stream chunks reliably with bounded queues and hash verification.
|
||||
8. Keep input/control/render/cursor priority above file chunks.
|
||||
9. Add localized Windows client feedback and safe local temp-file finalization.
|
||||
10. Audit start/completion/cancel/failure/block events without storing file
|
||||
contents.
|
||||
|
||||
Verification:
|
||||
|
||||
- `disabled` blocks download
|
||||
- `client_to_server` blocks download and upload still works
|
||||
- `server_to_client` allows download and blocks upload
|
||||
- `bidirectional` allows upload and download
|
||||
- text file downloads and hash matches
|
||||
- binary file downloads and hash matches
|
||||
- too-large file blocked
|
||||
- traversal/symlink/non-regular files blocked
|
||||
- download blocked after detach
|
||||
- old client after takeover blocked
|
||||
- worker failure cancels/blocks download
|
||||
- rendering, mouse, keyboard, clipboard, upload, reconnect, and takeover remain
|
||||
working
|
||||
- backend gateway fallback remains available
|
||||
|
||||
Deliver:
|
||||
|
||||
- backend policy/token/gateway changes
|
||||
- worker download detector and chunk sender
|
||||
- Windows client download UI/path
|
||||
- localized messages
|
||||
- smoke script/docs update
|
||||
- PASS/FAIL matrix with logs and hash evidence
|
||||
@@ -0,0 +1,597 @@
|
||||
# RDP Service C++ Performance Target
|
||||
|
||||
## Status
|
||||
|
||||
This is the paused RDP service performance direction. The implementation name is `RDP Adapter`: a concrete `Service Adapter` that translates Microsoft RDP into the platform session/data-plane protocol. The common adapter contract is defined in `docs/architecture/SERVICE_ADAPTER_PROTOCOL.md`; the RDP-specific runtime plan is defined in `docs/architecture/RDP_ADAPTER_RUNTIME.md`.
|
||||
|
||||
Current implementation status:
|
||||
|
||||
- RDP-A1 / RDP-Perf-1 is build-proven.
|
||||
- RDP-A2 adapter boundary is live-smoke-proven on the test Docker environment as of 2026-04-26: runtime code now goes through `RdpAdapterRuntime`.
|
||||
- RDP-Perf-2 runtime instrumentation is build-proven and live-smoke-proven on the test Docker environment as of 2026-04-26.
|
||||
- RDP-Perf-3 region-first BGRA fallback is build-proven and live-smoke-proven on the test Docker environment as of 2026-04-26.
|
||||
- RDP-Perf-4 gated RDPGFX foundation is build-proven and default-path smoke-proven on the test Docker environment as of 2026-04-26. The current live RDP target resets the connection when RDPGFX is advertised, so RDPGFX remains disabled by default.
|
||||
- RDP-A4 CursorAdapter is build-proven and live-smoke-proven on the test Docker environment as of 2026-04-26. Cursor events now flow as latest-only adapter-origin `cursor.update` events over direct worker WSS and remain compatible with backend gateway fallback.
|
||||
- RDP-Perf-5A GDI repaint cadence hardening is build-proven and smoke-proven on the test Docker environment as of 2026-04-26. Region/interactive frames now publish on a 33 ms cadence, hot-loop lease renewal was removed, and backend gateway fallback remains compatible.
|
||||
- RDP-Perf-6 dirty-region direct binary contract is build/probe/live-smoke-proven on the test Docker environment as of 2026-04-26. Direct `RAP2` render frames now distinguish full frames from dirty-region patches and carry payload savings diagnostics; observed runtime dirty regions reduced payloads from the `3,686,400` byte full frame to examples such as `16,384`, `163,840`, and `655,360` bytes.
|
||||
- Current accepted baseline is `rap-rdp-worker:rdp-p1-region-order2`: dirty-region delivery is preserved in order through `SessionRuntime`, worker direct WSS, Windows transport, and WPF presenter queues. Manual visual smoke accepted idle repaint, Start menu/hover, keyboard, mouse, and session close.
|
||||
- Remaining visual limitation is quality/performance rather than correctness: window drag behaves like older/slow-link RDP clients by moving a frame, and repaint after release is usable but not polished.
|
||||
- FreeRDP remains the internal substrate behind the adapter boundary until region-first/event-driven replacement paths are live-proven.
|
||||
- RDP performance work is paused by product decision. When RDP work explicitly
|
||||
resumes, the next RDP step should continue from the stable GDI region-first
|
||||
path unless an RDPGFX-compatible target is added for gated testing.
|
||||
|
||||
The C++ worker remains the primary RDP runtime. The goal is not to rewrite the
|
||||
worker in another language. The goal is to replace the slow parts of the RDP
|
||||
service internals while preserving the proven backend/session/cluster/data-plane
|
||||
contracts.
|
||||
|
||||
The C# RDP service skeleton is superseded as a runtime direction and must not be
|
||||
used for implementation unless explicitly re-approved later.
|
||||
|
||||
## Current Problem
|
||||
|
||||
The current MVP proved the hard lifecycle behavior:
|
||||
|
||||
- connect
|
||||
- active state
|
||||
- detach without killing the remote session
|
||||
- reattach
|
||||
- takeover
|
||||
- terminate
|
||||
- clipboard text
|
||||
- file upload to worker storage
|
||||
- direct worker WSS data-plane
|
||||
|
||||
However, the render/input experience is not acceptable.
|
||||
|
||||
Root cause:
|
||||
|
||||
- the worker uses FreeRDP successfully for the RDP connection
|
||||
- but the production render path still behaves like framebuffer capture
|
||||
- the worker copies large BGRA buffers and publishes them as RAP frames
|
||||
- input is fast enough in parts of the path, but visual feedback depends on slow
|
||||
snapshot/frame delivery
|
||||
|
||||
On a >1 Gbit LAN this should not be slow. The bottleneck is the RDP service
|
||||
render algorithm, not the network.
|
||||
|
||||
## Non-Negotiable Boundaries
|
||||
|
||||
Do not change:
|
||||
|
||||
- backend control plane
|
||||
- organization/session lifecycle
|
||||
- PostgreSQL source of truth
|
||||
- Redis live coordination model
|
||||
- worker leases and assignment contracts
|
||||
- data_plane_token contracts
|
||||
- direct worker WSS transport
|
||||
- backend gateway fallback
|
||||
- clipboard/file-transfer policy semantics
|
||||
|
||||
Only the RDP service adapter internals may change.
|
||||
|
||||
## Target Design
|
||||
|
||||
Keep the worker in C++.
|
||||
|
||||
Use C++ to own the RDP service internals:
|
||||
|
||||
- input adapter
|
||||
- graphics adapter
|
||||
- cursor adapter
|
||||
- virtual channel adapters
|
||||
- quality/adaptive controller
|
||||
- render sink to existing RAP data-plane
|
||||
|
||||
FreeRDP may remain temporarily as a connection/security/channel substrate, but
|
||||
the target production render path must not be FreeRDP GDI framebuffer snapshots.
|
||||
If a FreeRDP layer blocks access to the needed RDP graphics primitives, replace
|
||||
that narrow layer with project-owned C++ code rather than rewriting the full
|
||||
service in another language.
|
||||
|
||||
## High-Performance RDP Model
|
||||
|
||||
Fast RDP clients do not repeatedly send full desktop images. They use protocol
|
||||
updates:
|
||||
|
||||
- dirty rectangles
|
||||
- surface commands
|
||||
- cursor updates
|
||||
- bitmap/cache updates
|
||||
- RDPGFX dynamic virtual channel
|
||||
- RemoteFX Progressive / ClearCodec / H.264 AVC420 / AVC444 / HEVC where
|
||||
negotiated
|
||||
- adaptive graphics and quality selection
|
||||
|
||||
References:
|
||||
|
||||
- https://learn.microsoft.com/en-us/openspecs/windows_protocols/ms-rdpegfx/da5c75f9-cd99-450c-98c4-014a496942b0
|
||||
- https://learn.microsoft.com/en-us/azure/virtual-desktop/graphics-encoding
|
||||
- https://freerdp-freerdp.mintlify.app/concepts/codecs
|
||||
|
||||
## New Internal Layers
|
||||
|
||||
```mermaid
|
||||
flowchart LR
|
||||
Target["Windows RDP Server"]
|
||||
RdpCore["C++ RDP Core / FreeRDP Substrate"]
|
||||
Graphics["Graphics Adapter"]
|
||||
Input["Input Adapter"]
|
||||
Channels["Virtual Channel Adapters"]
|
||||
DataPlane["Existing Direct Worker WSS"]
|
||||
Client["RAP Windows Client"]
|
||||
|
||||
Target <--> RdpCore
|
||||
RdpCore --> Graphics
|
||||
Input --> RdpCore
|
||||
RdpCore <--> Channels
|
||||
Graphics --> DataPlane
|
||||
Channels --> DataPlane
|
||||
DataPlane <--> Client
|
||||
```
|
||||
|
||||
### Graphics Adapter
|
||||
|
||||
The graphics adapter converts RDP graphics primitives into RAP render updates.
|
||||
|
||||
Supported update classes:
|
||||
|
||||
- `frame_full_bgra` only for baseline/debug/fallback
|
||||
- `region_bgra` for dirty regions
|
||||
- `surface_create`
|
||||
- `surface_delete`
|
||||
- `surface_map`
|
||||
- `surface_bits`
|
||||
- `encoded_frame`
|
||||
- `cursor_update`
|
||||
|
||||
Rules:
|
||||
|
||||
- full-frame BGRA is fallback, not the target production path
|
||||
- direct render remains binary
|
||||
- backend gateway fallback may keep JSON/base64 compatibility
|
||||
- stale render updates are droppable
|
||||
- input never waits behind render
|
||||
|
||||
### Input Adapter
|
||||
|
||||
Input stays separate from render.
|
||||
|
||||
Rules:
|
||||
|
||||
- keyboard down/up is reliable and ordered
|
||||
- mouse button down/up and wheel are reliable and ordered
|
||||
- mouse move is latest-only/coalesced
|
||||
- button down must include or be preceded by pointer position
|
||||
- no RAP focus message may consume the first remote click
|
||||
- input must not trigger full-frame capture loops
|
||||
|
||||
### Virtual Channel Adapters
|
||||
|
||||
Clipboard/file/drive redirection remain isolated:
|
||||
|
||||
- clipboard stays text-only until explicitly expanded
|
||||
- restricted drive mapping remains policy-bound
|
||||
- file upload/download policies stay enforced in the real data path
|
||||
|
||||
## Weak Network Strategy
|
||||
|
||||
Weak-channel performance must degrade render before input.
|
||||
|
||||
Priority order:
|
||||
|
||||
1. input
|
||||
2. control
|
||||
3. clipboard
|
||||
4. render key updates
|
||||
5. file transfer
|
||||
6. telemetry
|
||||
|
||||
Render adaptation:
|
||||
|
||||
- drop stale render updates
|
||||
- prefer dirty regions over full frames
|
||||
- reduce FPS before increasing input latency
|
||||
- reduce color mode where useful
|
||||
- use text-priority mode for office/admin workloads
|
||||
- use encoded/compressed graphics payloads where negotiated
|
||||
- never let file transfer or VPN-like bulk traffic starve RDP input/control
|
||||
|
||||
Quality profiles:
|
||||
|
||||
- `emergency_grayscale`
|
||||
- `low_bandwidth`
|
||||
- `text_priority`
|
||||
- `balanced`
|
||||
- `high_quality`
|
||||
|
||||
Color modes:
|
||||
|
||||
- full color
|
||||
- 256 colors
|
||||
- 64 colors
|
||||
- 16 colors
|
||||
- grayscale
|
||||
|
||||
## Migration Stages
|
||||
|
||||
### RDP-A1 / RDP-Perf-1: Boundary And Audit
|
||||
|
||||
Create C++ graphics/input adapter boundaries and document the replacement path.
|
||||
Do not change runtime behavior yet.
|
||||
|
||||
Deliver:
|
||||
|
||||
- common `Service Adapter` channel contract
|
||||
- RDP Adapter runtime plan
|
||||
- `graphics_adapter` interface
|
||||
- render update model
|
||||
- compile-safe probe
|
||||
- docs update
|
||||
|
||||
Status: completed and build-proven.
|
||||
|
||||
### RDP-Perf-2: Runtime Instrumentation And Source Selection
|
||||
|
||||
Measure existing FreeRDP update callbacks separately from frame publishing.
|
||||
|
||||
Deliver:
|
||||
|
||||
- update callback rate
|
||||
- dirty region dimensions
|
||||
- framebuffer copy time
|
||||
- binary send time
|
||||
- client render time
|
||||
- first-click trace without RAP focus interference
|
||||
|
||||
Status: completed and live-smoke-proven on the test Docker environment as of 2026-04-26.
|
||||
|
||||
Smoke command:
|
||||
|
||||
```powershell
|
||||
pwsh -ExecutionPolicy Bypass -File scripts/windows-smoke/desktop-smoke.ps1 `
|
||||
-PreferDirectDataPlane:$true `
|
||||
-AllowInsecureDirectDataPlaneTlsForSmoke:$true `
|
||||
-DirectDataPlaneConnectTimeoutMs 2500 `
|
||||
-DirectDataPlaneColorMode full_color `
|
||||
-SkipOrgSwitchAndTokenRefresh
|
||||
```
|
||||
|
||||
Smoke evidence:
|
||||
|
||||
- worker image: `rap-rdp-worker:rdp-perf2-instrumented`
|
||||
- session id: `1328b0dd-c5f9-4b15-b2ca-6d196ead5823`
|
||||
- direct data plane selected by the Windows client
|
||||
- login/resource/start/input/detach/attach/takeover/taken_over/logout passed
|
||||
- one RDP runtime was created for the session
|
||||
- artifacts:
|
||||
- `artifacts/rdp-perf2-worker-final.log`
|
||||
- `artifacts/rdp-perf2-client-final.log`
|
||||
- `artifacts/rdp-perf2-report.md`
|
||||
|
||||
Measured callback sources:
|
||||
|
||||
| Source | Count / behavior |
|
||||
| --- | --- |
|
||||
| `BeginPaint` | observed |
|
||||
| `EndPaint` | observed |
|
||||
| `BitmapUpdate` | observed and produced dirty region information |
|
||||
| `RefreshRect` | not observed in smoke |
|
||||
| `SurfaceBits` | not observed in smoke |
|
||||
| `SurfaceFrameMarker` | not observed in smoke |
|
||||
| `SurfaceFrameBits` | not observed in smoke |
|
||||
| pointer callbacks | not observed in smoke |
|
||||
|
||||
Measured conclusions:
|
||||
|
||||
- The RDP server/FreeRDP path does emit server-origin graphics callbacks in stable GDI mode.
|
||||
- Idle or server-origin screen changes can be detected without relying on local mouse/keyboard activity.
|
||||
- Full framebuffer copy time is not the main bottleneck in the measured smoke run.
|
||||
- The current render path duplicates work by capturing around both `BitmapUpdate` and `EndPaint`.
|
||||
- `EndPaint` should become a flush/safety marker rather than a second normal capture producer.
|
||||
- RDP-Perf-3 should make `BitmapUpdate` dirty regions the default normal render path and reserve full frames for connect/resize/attach/recovery.
|
||||
|
||||
### RDP-Perf-3: Region-First BGRA Fallback
|
||||
|
||||
Use true dirty regions as the default fallback path.
|
||||
|
||||
Deliver:
|
||||
|
||||
- no full-frame copy for small dirty updates
|
||||
- baseline full frame only on connect/resize/attach
|
||||
- region payloads only for normal UI changes
|
||||
|
||||
Status: completed and live-smoke-proven on the test Docker environment as of 2026-04-26.
|
||||
|
||||
Smoke evidence:
|
||||
|
||||
- worker image: `rap-rdp-worker:rdp-perf3-region-first`
|
||||
- direct smoke session id: `abc11233-34c4-45a6-a55b-0571a09332a1`
|
||||
- fallback smoke session id: `ee756839-6a82-49d4-9619-54acf69e1efd`
|
||||
- direct worker WSS selected and backend gateway fallback separately verified
|
||||
- login/resource/start/input/detach/attach/takeover/taken_over/logout passed in both direct and fallback smoke
|
||||
- direct session cleanup state: `terminated`
|
||||
- fallback session cleanup state: `terminated`
|
||||
- report: `artifacts/rdp-perf3-report.md`
|
||||
|
||||
Measured direct-path results:
|
||||
|
||||
| Metric | Result |
|
||||
| --- | --- |
|
||||
| new RDP runtime count | 1 |
|
||||
| direct data-plane binds | 6 |
|
||||
| worker input apply events | 6 |
|
||||
| deferred `BitmapUpdate` callbacks | 104 |
|
||||
| `bitmap_update_flush` captures | 104 |
|
||||
| region flush captures | 93 |
|
||||
| full flush captures | 11 |
|
||||
| periodic duplicate changes | 0 |
|
||||
| client rendered region frames | 19 |
|
||||
| client skipped region frames | 0 |
|
||||
|
||||
Implementation notes:
|
||||
|
||||
- `BitmapUpdate` is now deferred during a paint cycle.
|
||||
- `EndPaint` flushes the accumulated `BitmapUpdate` dirty region once.
|
||||
- `EndPaint` no longer performs a second normal change-detect capture when a bitmap update was already flushed.
|
||||
- The periodic change detector snapshot is synchronized after callback-driven frame capture, avoiding rediscovery of the same changed pixels.
|
||||
- Direct binary frame metadata now preserves full desktop dimensions separately from region payload dimensions, so the Windows client can patch regions into its framebuffer.
|
||||
- Backend gateway fallback remains compatible with the existing JSON/base64 path.
|
||||
|
||||
### RDP-Perf-4: RDPGFX Channel Foundation
|
||||
|
||||
Capture and parse RDPGFX surface updates where available.
|
||||
|
||||
Deliver:
|
||||
|
||||
- surface lifecycle
|
||||
- surface bits updates
|
||||
- cursor updates
|
||||
- fallback to region BGRA when RDPGFX unavailable
|
||||
|
||||
Status: build-proven and default-path smoke-proven on the test Docker environment as of 2026-04-26.
|
||||
|
||||
Implementation:
|
||||
|
||||
- RDPGFX stays disabled by default.
|
||||
- `RDP_WORKER_RDPGFX_ENABLED=true` is the only gated runtime switch.
|
||||
- Worker diagnostics now log RDPGFX configuration, channel subscription, channel connection, GDI graphics pipeline initialization, fallback reasons, and normalized FreeRDP surface update callbacks.
|
||||
- Callback summaries include RDPGFX counters.
|
||||
- The default classic GDI region-first path remains the active safe path.
|
||||
|
||||
Default smoke evidence:
|
||||
|
||||
- worker image: `rap-rdp-worker:rdp-perf4-rdpgfx-gated`
|
||||
- final default smoke session id: `30e80d99-e3b5-428b-aa18-fea65b8db499`
|
||||
- direct worker WSS selected
|
||||
- login/resource/start/input/detach/attach/takeover/taken_over/logout passed
|
||||
- session cleanup state: `terminated`
|
||||
- worker log: `rdp.gfx config requested=false mode=classic_gdi_region_first`
|
||||
- worker log: `rdp.perf callback_summary ... rdpgfx_requested=false ... frame_capture_region=...`
|
||||
|
||||
Gated RDPGFX target compatibility result:
|
||||
|
||||
- gated session id: `aa69f606-9217-4579-b438-b7d3ec5e01d0`
|
||||
- environment: `RDP_WORKER_RDPGFX_ENABLED=true`
|
||||
- result: failed on the current live RDP target
|
||||
- observed: `BIO_read returned a system error 104: Connection reset by peer`
|
||||
- observed: `freerdp_post_connect failed`
|
||||
- no `rdp.gfx channel_connected` or surface callbacks were observed before reset
|
||||
- conclusion: the current target must use the default GDI region-first path
|
||||
|
||||
Report: `artifacts/rdp-perf4-report.md`
|
||||
|
||||
### RDP-Perf-5: Encoded Graphics Payloads
|
||||
|
||||
Support encoded graphics payloads over RAP direct data-plane.
|
||||
|
||||
Deliver:
|
||||
|
||||
- binary encoded payload message type
|
||||
- client decode strategy
|
||||
- fallback to region BGRA
|
||||
|
||||
### RDP-A4: CursorAdapter
|
||||
|
||||
Move cursor handling into the RDP Adapter boundary and keep cursor events
|
||||
independent from display frame cadence.
|
||||
|
||||
Status: completed and live-smoke-proven on the test Docker environment as of
|
||||
2026-04-26.
|
||||
|
||||
Implementation:
|
||||
|
||||
- `CursorAdapter` normalizes FreeRDP pointer callbacks into cursor position,
|
||||
visibility, shape, cache, and mask metadata.
|
||||
- FreeRDP pointer callbacks are installed and restored inside the RDP runtime
|
||||
hook boundary.
|
||||
- Original FreeRDP pointer callbacks are invoked before platform normalization,
|
||||
preserving FreeRDP internal state.
|
||||
- `session_cursor_updated` worker events are mapped to platform
|
||||
`cursor.update` envelopes.
|
||||
- Direct worker WSS treats cursor as latest-only/droppable and schedules it
|
||||
separately from binary render frames.
|
||||
- Backend gateway fallback remains compatible with the same
|
||||
`session_cursor_updated` event payload.
|
||||
- Windows client accepts `cursor.update` through the existing render payload
|
||||
bridge without changing UI layout.
|
||||
|
||||
Smoke evidence:
|
||||
|
||||
- worker image: `rap-rdp-worker:rdp-a4-cursor-adapter`
|
||||
- direct smoke session id: `549806aa-c9db-48a9-917e-cf817cf236b5`
|
||||
- fallback smoke session id: `dee3a856-bee1-4eba-9c10-f62edaf56547`
|
||||
- direct worker WSS selected in direct smoke
|
||||
- backend gateway selected in fallback smoke
|
||||
- login/resource/start/input/detach/attach/takeover/taken_over/logout passed
|
||||
in both direct and fallback smoke
|
||||
- direct session cleanup state: `terminated`
|
||||
- fallback session cleanup state: `terminated`
|
||||
- worker log: `cursor.adapter hooks installed pointer_callbacks=true`
|
||||
- worker log: `adapter_event channel=cursor type=cursor.update origin=adapter`
|
||||
- worker log: `rdp.perf callback_summary ... cursor_updates_enqueued=...`
|
||||
- client log: `SessionWindowViewModel.HandleEnvelopeAsync ... cursor.update`
|
||||
- report: `artifacts/rdp-a4-cursor-adapter-report.md`
|
||||
|
||||
Known limitation:
|
||||
|
||||
- Cursor event separation does not by itself fix delayed hover/menu repaint.
|
||||
The next safe step is a GDI repaint cadence and server-origin update audit on
|
||||
the stable region-first path.
|
||||
|
||||
### RDP-Perf-5A: GDI Repaint Cadence And Hover Feedback Hardening
|
||||
|
||||
Fix the first proven stable-path repaint cadence bottlenecks without changing
|
||||
backend, session lifecycle, data-plane contracts, clipboard/file transfer, or UI
|
||||
layout.
|
||||
|
||||
Status: build-proven and smoke-proven on the test Docker environment as of
|
||||
2026-04-26.
|
||||
|
||||
Implementation:
|
||||
|
||||
- FreeRDP event pump performs a bounded immediate drain after a signaled handle
|
||||
check so already-queued server events are not delayed by the next wait cycle.
|
||||
- Periodic no-change detection logging is rate-limited to avoid hot-loop log
|
||||
pressure while the remote screen is idle.
|
||||
- Worker session runtime renews the worker lease every 5 seconds instead of
|
||||
performing Redis lease I/O on every render/input loop iteration.
|
||||
- Region and interactive render notifications use a 33 ms publish interval.
|
||||
- Full-frame fallback remains at 100 ms.
|
||||
- Direct worker WSS binary writer uses the same 33 ms interval for
|
||||
region/interactive frames.
|
||||
|
||||
Smoke evidence:
|
||||
|
||||
- worker image: `rap-rdp-worker:rdp-perf5a-repaint-cadence`
|
||||
- direct smoke session id: `0cca4974-2a82-48dc-a0f6-1036ea8e98f0`
|
||||
- fallback smoke session id: `16deb09e-1c44-4e9d-8448-93b42ac66ed0`
|
||||
- direct worker WSS selected in direct smoke
|
||||
- backend gateway selected in fallback smoke
|
||||
- login/resource/start/input/detach/attach/takeover/taken_over/logout passed
|
||||
in both direct and fallback smoke
|
||||
- direct session cleanup state: `terminated`
|
||||
- fallback session cleanup state: `terminated`
|
||||
- report: `artifacts/rdp-perf5a-report.md`
|
||||
|
||||
Measured direct-path results:
|
||||
|
||||
| Metric | Result |
|
||||
| --- | --- |
|
||||
| client rendered frames observed | 65 |
|
||||
| client binary frames observed | 66 |
|
||||
| direct region publishes at 33 ms | 54 |
|
||||
| direct outbound FPS max | 9.705640 |
|
||||
| render seen FPS max | 26.386542 |
|
||||
| render published FPS max | 9.459327 |
|
||||
| direct backpressure frame drops | 0 |
|
||||
| render pending max | 0 |
|
||||
|
||||
Measured conclusion:
|
||||
|
||||
- Region/interactive frames now leave the worker promptly when server-origin
|
||||
changes arrive.
|
||||
- The direct smoke did not show queued FreeRDP event-handle bursts after the new
|
||||
immediate drain path: `event_pump_drained_checks=0`.
|
||||
- The current live target still emits idle/server-origin region changes at
|
||||
roughly 1 FPS in observed stable GDI mode.
|
||||
- Manual UX validation is still required before claiming hover/menu
|
||||
responsiveness accepted by a human operator.
|
||||
|
||||
### RDP-Perf-6: Dirty-Region Direct Binary Render Contract
|
||||
|
||||
Replace full-frame-only direct binary render updates with explicit dirty-region
|
||||
direct binary render updates while preserving full-frame fallback.
|
||||
|
||||
Deliver:
|
||||
|
||||
- direct `RAP2` `message_type=render.frame.full`
|
||||
- direct `RAP2` `message_type=render.frame.region`
|
||||
- one bounding-rectangle dirty-region BGRA payload for normal UI changes
|
||||
- full-frame fallback for first frame, attach/reattach, resize, recovery,
|
||||
invalid region state, and debug/fallback mode
|
||||
- worker diagnostics for `full_frame_sent`, `region_frame_sent`,
|
||||
`region_bytes`, `full_frame_bytes`, `region_savings_percent`,
|
||||
`diff_time_ms`, `render_update_reason`, and
|
||||
`fallback_to_full_frame_reason`
|
||||
- Windows direct receiver support for explicit full/region message types
|
||||
- Windows framebuffer-backed region patching
|
||||
- backend gateway JSON/base64 fallback unchanged
|
||||
|
||||
Status: implemented and build/probe/live-smoke-proven on the test Docker
|
||||
environment as of 2026-04-26 using the current RDP target.
|
||||
|
||||
Build/probe evidence:
|
||||
|
||||
- worker image build: `rap-rdp-worker:rdp-perf6-dirty-region`
|
||||
- Windows client build: PASS
|
||||
- worker graphics adapter probe: PASS
|
||||
- worker direct data-plane bind valid probe: PASS
|
||||
- worker service adapter protocol probe: PASS
|
||||
- direct worker WSS smoke: PASS
|
||||
- backend gateway fallback smoke: PASS
|
||||
|
||||
Implementation notes:
|
||||
|
||||
- The current classic GDI region-first display path remains the source of
|
||||
dirty-region payloads.
|
||||
- The direct worker WSS sender no longer labels all binary render payloads as
|
||||
`session.frame`; it uses `render.frame.full` and `render.frame.region`.
|
||||
- The Windows transport still normalizes direct render frames into the existing
|
||||
application-level `session.frame` pipeline, so session lifecycle, input,
|
||||
clipboard, and file-transfer behavior are unchanged.
|
||||
- The Windows presenter keeps a session framebuffer and applies region patches
|
||||
into it before presenting the updated surface.
|
||||
- Backend gateway fallback remains JSON/base64 and is not used as the
|
||||
production high-rate render relay.
|
||||
- Runtime payload examples: full baseline `3,686,400` bytes; dirty regions
|
||||
`16,384`, `163,840`, `327,680`, `655,360`, and `737,280` bytes.
|
||||
|
||||
### RDP-Perf-7: Adaptive Quality Controller
|
||||
|
||||
Add channel-aware adaptive render quality.
|
||||
|
||||
Deliver:
|
||||
|
||||
- latency-aware profile switching
|
||||
- bandwidth-aware profile switching
|
||||
- latest-only render backpressure
|
||||
- stable input under load
|
||||
|
||||
## Acceptance Targets
|
||||
|
||||
LAN targets:
|
||||
|
||||
- first frame: under 2 seconds after successful RDP login
|
||||
- click to visible response: under 150 ms for common UI
|
||||
- keypress to visible response: under 150 ms for text input
|
||||
- pointer hover response: under 100 ms where the target emits hover changes
|
||||
- one click activates remote buttons correctly
|
||||
- no unbounded frame/input queues
|
||||
|
||||
Weak-channel targets:
|
||||
|
||||
- input remains usable even when render quality degrades
|
||||
- render drops stale updates instead of building backlog
|
||||
- file transfer never starves interactive input
|
||||
|
||||
## RDP Performance Work Paused
|
||||
|
||||
RDP performance work is paused. Next active work is Fabric Core / cluster
|
||||
foundation.
|
||||
|
||||
RDP-Perf-6 remains accepted and smoke-proven. Future RDP roadmap items such as
|
||||
RDP-Perf-7, adaptive quality, encoded payloads, additional RDPGFX testing,
|
||||
tiles, codecs, or further renderer optimization must not start without a new
|
||||
explicit RDP-stage prompt.
|
||||
|
||||
The preserved RDP baseline remains:
|
||||
|
||||
- C++ RDP Adapter runtime
|
||||
- direct worker WSS
|
||||
- backend gateway fallback
|
||||
- dirty-region direct binary render from RDP-Perf-6
|
||||
- proven session lifecycle
|
||||
- existing clipboard and file-transfer semantics
|
||||
@@ -0,0 +1,335 @@
|
||||
# RDP Service C# Target Architecture
|
||||
|
||||
## Status
|
||||
|
||||
Superseded.
|
||||
|
||||
The active direction is now documented in:
|
||||
|
||||
- `docs/architecture/RDP_SERVICE_CPP_PERFORMANCE_TARGET.md`
|
||||
|
||||
The C++ worker remains the primary RDP runtime. This C# document is retained only
|
||||
as historical/research context and must not be used for implementation unless
|
||||
explicitly re-approved.
|
||||
|
||||
## Problem Statement
|
||||
|
||||
The current RDP MVP proved the platform lifecycle, but its rendering model is not
|
||||
production-grade:
|
||||
|
||||
- FreeRDP connects to the target RDP server.
|
||||
- The worker reads the GDI framebuffer.
|
||||
- The worker publishes full or cropped BGRA frames through RAP direct WSS.
|
||||
- The Windows client renders those frames as a custom viewer.
|
||||
|
||||
This is not how high-performance RDP clients work. On a fast LAN, the network is
|
||||
not the main bottleneck. The bottleneck is that the service is repeatedly copying
|
||||
and publishing screen images instead of consuming the RDP graphics protocol as a
|
||||
graphics protocol.
|
||||
|
||||
Observable symptoms:
|
||||
|
||||
- delayed visual feedback after input
|
||||
- unreliable first-click behavior
|
||||
- poor hover behavior
|
||||
- high CPU/memory pressure from framebuffer copies
|
||||
- unnecessary 1280x720 BGRA full-frame payloads
|
||||
- fragile coupling between input, render snapshots, and UI timing
|
||||
|
||||
## External Reference Model
|
||||
|
||||
Microsoft RDP performance is based on graphics protocol features rather than
|
||||
screen scraping:
|
||||
|
||||
- RDP Graphics Pipeline Extension (`MS-RDPEGFX`) uses a dynamic virtual channel
|
||||
for graphics pipeline updates.
|
||||
- RDP supports adaptive graphics, delta detection, caching, mixed-mode encoding,
|
||||
RemoteFX Progressive, H.264/AVC, AVC444, and HEVC in modern environments.
|
||||
- FreeRDP documentation describes the RDP GFX Pipeline (`rdpgfx`) and codecs such
|
||||
as RemoteFX Progressive, H.264 AVC420/AVC444, ClearCodec, and ZGFX.
|
||||
|
||||
References:
|
||||
|
||||
- https://learn.microsoft.com/en-us/openspecs/windows_protocols/ms-rdpegfx/da5c75f9-cd99-450c-98c4-014a496942b0
|
||||
- https://learn.microsoft.com/en-us/azure/virtual-desktop/graphics-encoding
|
||||
- https://freerdp-freerdp.mintlify.app/concepts/codecs
|
||||
|
||||
## Target Decision
|
||||
|
||||
Replace the internal RDP engine with a C# implementation owned by this project.
|
||||
|
||||
The new service:
|
||||
|
||||
- is a RAP RDP service adapter, not a generic local RDP client UI
|
||||
- speaks standard RDP to the target Windows RDP server
|
||||
- keeps RDP protocol details inside the RDP service boundary
|
||||
- preserves current backend and cluster data-plane contracts
|
||||
- does not use FreeRDP as the runtime RDP engine
|
||||
- does not require the local Windows desktop client to become mstsc
|
||||
|
||||
The local Windows client remains a RAP client. It receives RAP display/input/
|
||||
clipboard/file messages over the existing direct worker WSS data-plane.
|
||||
|
||||
## What Must Not Change
|
||||
|
||||
The following are outside this rewrite:
|
||||
|
||||
- backend organization/auth/session lifecycle
|
||||
- PostgreSQL source-of-truth model
|
||||
- Redis live coordination model
|
||||
- worker registration and lease semantics
|
||||
- data_plane_token model
|
||||
- direct_worker_wss transport contract
|
||||
- backend gateway fallback
|
||||
- clipboard/file policy semantics
|
||||
- file upload policy semantics
|
||||
- session attach/detach/reattach/takeover/terminate semantics
|
||||
|
||||
Only the RDP service adapter internals change.
|
||||
|
||||
## Service Boundary
|
||||
|
||||
```mermaid
|
||||
flowchart LR
|
||||
Client["Windows RAP Client"]
|
||||
Backend["Backend Control Plane"]
|
||||
Worker["RDP Service Node"]
|
||||
Engine["C# RDP Protocol Engine"]
|
||||
Target["Target Windows RDP Server"]
|
||||
|
||||
Client <--> |"direct_worker_wss RAP channels"| Worker
|
||||
Backend <--> |"assignments, leases, audit"| Worker
|
||||
Worker --> Engine
|
||||
Engine <--> |"standard RDP"| Target
|
||||
```
|
||||
|
||||
The RDP service owns:
|
||||
|
||||
- RDP negotiation and transport
|
||||
- NLA/CredSSP/TLS integration
|
||||
- input translation to RDP fast-path input
|
||||
- graphics channel parsing
|
||||
- virtual channel handling for clipboard and future file features
|
||||
- conversion from RDP graphics units to RAP render messages
|
||||
- session runtime ownership and reconnect/takeover binding
|
||||
|
||||
The data-plane layer owns:
|
||||
|
||||
- data_plane_token validation
|
||||
- direct WSS connection binding
|
||||
- logical channel priority
|
||||
- reliable/droppable semantics
|
||||
- fallback compatibility
|
||||
|
||||
## New RDP Service Components
|
||||
|
||||
### `Rap.Rdp.Service`
|
||||
|
||||
Host process.
|
||||
|
||||
Responsibilities:
|
||||
|
||||
- load worker/RDP service configuration
|
||||
- register worker capabilities with existing coordination layer later
|
||||
- expose the existing direct WSS endpoint later
|
||||
- create and supervise RDP sessions
|
||||
- keep the current C++ worker active until cutover
|
||||
|
||||
### `Rap.Rdp.Core`
|
||||
|
||||
Pure C# protocol and runtime boundaries.
|
||||
|
||||
Responsibilities:
|
||||
|
||||
- define RDP session lifecycle interfaces
|
||||
- define protocol engine interfaces
|
||||
- define graphics/input/clipboard/file abstractions
|
||||
- avoid any dependency on WPF or backend repositories
|
||||
|
||||
### `Rap.Rdp.Protocol`
|
||||
|
||||
Future implementation module.
|
||||
|
||||
Responsibilities:
|
||||
|
||||
- implement RDP connection sequence from Microsoft Open Specifications
|
||||
- implement security/NLA/CredSSP/TLS
|
||||
- implement core channels and fast-path input
|
||||
- implement graphics pipeline negotiation
|
||||
- implement virtual channel framing
|
||||
|
||||
This module must not depend on the Windows desktop UI.
|
||||
|
||||
### `Rap.Rdp.DataPlane`
|
||||
|
||||
Future adapter module.
|
||||
|
||||
Responsibilities:
|
||||
|
||||
- map RAP direct WSS JSON/binary envelopes to the protocol engine
|
||||
- keep input highest priority
|
||||
- keep render latest-frame or latest-update droppable
|
||||
- keep clipboard/file reliable and policy-gated
|
||||
|
||||
## Graphics Strategy
|
||||
|
||||
The new render path must not use framebuffer screen scraping as the primary
|
||||
production path.
|
||||
|
||||
Priority order:
|
||||
|
||||
1. RDPGFX graphics pipeline channel.
|
||||
2. Surface/dirty-region updates.
|
||||
3. Encoded graphics payloads where available.
|
||||
4. Raw bitmap fallback only for compatibility/debug.
|
||||
|
||||
Target RAP render message classes:
|
||||
|
||||
- `surface.create`
|
||||
- `surface.delete`
|
||||
- `surface.map`
|
||||
- `surface.region`
|
||||
- `surface.codec_frame`
|
||||
- `cursor.update`
|
||||
- `frame.ack`
|
||||
|
||||
The first usable implementation may still decode some graphics to BGRA, but only
|
||||
as a controlled fallback. It must not become the permanent production model.
|
||||
|
||||
## Input Strategy
|
||||
|
||||
Input must be independent from render.
|
||||
|
||||
Rules:
|
||||
|
||||
- mouse down/up, wheel, and keyboard down/up are reliable and ordered
|
||||
- pointer move is coalesced latest-only
|
||||
- pointer position is explicitly sent before button-down when needed
|
||||
- input never waits behind render
|
||||
- no UI focus event may be inserted into the same ordered sequence in a way that
|
||||
consumes the first remote click
|
||||
|
||||
The current double-click regression is treated as a bug caused by the RAP-side
|
||||
focus/input sequencing, not as a normal RDP behavior.
|
||||
|
||||
## Clipboard And File Strategy
|
||||
|
||||
Existing policy semantics remain:
|
||||
|
||||
- clipboard modes stay enforced in backend, gateway/data-plane, and RDP service
|
||||
- file transfer modes stay enforced in backend, gateway/data-plane, and RDP service
|
||||
- text clipboard maps to RDP clipboard virtual channel
|
||||
- restricted drive visibility remains a separate policy-controlled feature
|
||||
|
||||
The C# rewrite must not expand clipboard/file scope while replacing render/input.
|
||||
|
||||
## Staged Migration Plan
|
||||
|
||||
### RDP-C#-0: Documentation And Skeleton
|
||||
|
||||
Create a buildable C# RDP service skeleton with interfaces only.
|
||||
|
||||
No runtime cutover.
|
||||
|
||||
### RDP-C#-1: Control-Plane Compatible Worker Shell
|
||||
|
||||
Implement worker registration, heartbeats, lease renewal, assignment consumption,
|
||||
and direct WSS token validation in C# using existing contracts.
|
||||
|
||||
The C++ worker remains default.
|
||||
|
||||
### RDP-C#-2: RDP Handshake Probe
|
||||
|
||||
Implement a non-viewing RDP connection probe:
|
||||
|
||||
- TCP/TLS
|
||||
- basic RDP negotiation
|
||||
- NLA/CredSSP if required
|
||||
- connect/disconnect lifecycle
|
||||
- failure reporting
|
||||
|
||||
No rendering yet.
|
||||
|
||||
### RDP-C#-3: Input-Only Protocol Path
|
||||
|
||||
After a connected session, send fast-path keyboard/mouse input to the RDP server.
|
||||
|
||||
Use diagnostic-only graphics or no graphics.
|
||||
|
||||
### RDP-C#-4: Basic Graphics Protocol Path
|
||||
|
||||
Implement the simplest RDP graphics path needed to display a desktop without
|
||||
FreeRDP.
|
||||
|
||||
Allowed as temporary fallback:
|
||||
|
||||
- raw bitmap updates
|
||||
- dirty-region bitmap updates
|
||||
|
||||
Not acceptable as final production:
|
||||
|
||||
- repeated full-frame screenshot capture
|
||||
|
||||
### RDP-C#-5: RDPGFX Foundation
|
||||
|
||||
Implement RDPGFX channel negotiation and surface update handling.
|
||||
|
||||
### RDP-C#-6: Codec Path
|
||||
|
||||
Implement or relay supported encoded graphics modes:
|
||||
|
||||
- RemoteFX Progressive where practical
|
||||
- H.264/AVC420/AVC444 where negotiated
|
||||
- client-side decode through platform APIs where possible
|
||||
|
||||
### RDP-C#-7: Runtime Cutover
|
||||
|
||||
Enable the C# RDP service per worker/resource via feature flag.
|
||||
|
||||
Rollback must switch back to the current C++ worker without changing backend
|
||||
contracts.
|
||||
|
||||
## Performance Requirements
|
||||
|
||||
Target for LAN:
|
||||
|
||||
- first frame under 2 seconds after successful RDP login
|
||||
- click to visible response under 150 ms for normal UI
|
||||
- keypress to visible response under 150 ms for text input
|
||||
- pointer hover response under 100 ms where the target OS emits hover changes
|
||||
- no unbounded frame queue
|
||||
- no render work on UI thread except final apply
|
||||
- no full-frame publish loop for static desktops
|
||||
|
||||
## Risks
|
||||
|
||||
- Implementing RDP from specs is substantial.
|
||||
- NLA/CredSSP correctness is security-sensitive.
|
||||
- Graphics codecs are complex.
|
||||
- Some target servers may negotiate older bitmap paths.
|
||||
- AVC/AVC444 decode support differs by client platform.
|
||||
- A partial RDP engine must not be switched into production before smoke proof.
|
||||
|
||||
## Recommended Immediate Next Step
|
||||
|
||||
Proceed with RDP-C#-0 only.
|
||||
|
||||
Goal:
|
||||
Create a buildable C# RDP service skeleton and protocol boundaries, without
|
||||
switching runtime traffic away from the current worker.
|
||||
|
||||
Strict rules:
|
||||
|
||||
- do not change backend contracts
|
||||
- do not change cluster transport
|
||||
- do not remove C++ worker
|
||||
- do not use FreeRDP in the new C# service
|
||||
- do not use third-party RDP libraries
|
||||
- do not claim the C# engine is runtime-ready
|
||||
|
||||
Deliver:
|
||||
|
||||
- buildable `workers/rdp-service-csharp`
|
||||
- interfaces for protocol engine, data-plane bridge, graphics sink, input source
|
||||
- README with migration stages
|
||||
- docs update marking current C++/FreeRDP path as legacy MVP runtime
|
||||
File diff suppressed because it is too large
Load Diff
@@ -0,0 +1,464 @@
|
||||
# Secure Node-to-Node Channel Lifecycle
|
||||
|
||||
Status: Stage C16 result. Documentation and architecture only.
|
||||
|
||||
This document defines the secure node-to-node channel lifecycle for the Secure
|
||||
Access Fabric. It does not implement code, migrations, APIs, mesh runtime
|
||||
traffic, VPN/IP tunnel runtime, relay packet routing, RDP work, or service
|
||||
workload execution.
|
||||
|
||||
## 1. Purpose
|
||||
|
||||
Secure node-to-node channels are the future authenticated transport foundation
|
||||
for Fabric routes. They must exist as a trust and lifecycle model before any
|
||||
production mesh routing runtime carries traffic.
|
||||
|
||||
C16 defines:
|
||||
|
||||
- mTLS identity validation
|
||||
- connection establishment
|
||||
- channel authorization
|
||||
- lifecycle state
|
||||
- heartbeat and liveness
|
||||
- reconnect/backoff
|
||||
- draining
|
||||
- revalidation
|
||||
- trust rotation
|
||||
- revocation handling
|
||||
- failure observability
|
||||
|
||||
## 2. Non-Goals
|
||||
|
||||
C16 does not:
|
||||
|
||||
- implement packet forwarding
|
||||
- implement mesh routing runtime
|
||||
- implement relay node behavior
|
||||
- implement VPN/IP tunnel traffic
|
||||
- implement QUIC/WebRTC
|
||||
- implement service workloads
|
||||
- change RDP runtime
|
||||
- change backend session lifecycle
|
||||
- change Windows client behavior
|
||||
|
||||
It defines the node-channel lifecycle boundary only.
|
||||
|
||||
## 3. Trust Foundation
|
||||
|
||||
Every node-to-node channel must be authenticated.
|
||||
|
||||
Required identity inputs:
|
||||
|
||||
- cluster id
|
||||
- local node id
|
||||
- remote node id
|
||||
- local node certificate
|
||||
- remote node certificate
|
||||
- cluster trust roots
|
||||
- revocation metadata
|
||||
- role assignment snapshot
|
||||
- allowed peer relationship
|
||||
- route/channel authorization policy
|
||||
|
||||
Private keys remain local to the node. The control plane must never store node
|
||||
private keys.
|
||||
|
||||
## 4. mTLS Certificate Requirements
|
||||
|
||||
Node certificates must be cluster-scoped.
|
||||
|
||||
Certificate identity should bind:
|
||||
|
||||
- node id
|
||||
- cluster id
|
||||
- certificate serial
|
||||
- validity period
|
||||
- key usage for node-to-node transport
|
||||
- optional role/service constraints where practical
|
||||
|
||||
Validation must check:
|
||||
|
||||
- certificate chain
|
||||
- cluster trust root
|
||||
- certificate validity time
|
||||
- node id binding
|
||||
- cluster id binding
|
||||
- expected remote node id
|
||||
- revocation status
|
||||
- key usage
|
||||
- policy scope
|
||||
|
||||
A valid TLS certificate is necessary but not sufficient. The channel must also
|
||||
pass role, peer, route, and channel authorization.
|
||||
|
||||
## 5. Channel Establishment Flow
|
||||
|
||||
Proposed logical flow:
|
||||
|
||||
1. Routing Engine or node-agent selects an allowed peer candidate.
|
||||
2. Local node checks peer directory and local policy.
|
||||
3. Local node opens authenticated transport.
|
||||
4. Both sides perform mTLS handshake.
|
||||
5. Both sides validate certificate identity and cluster scope.
|
||||
6. Both sides exchange channel hello metadata.
|
||||
7. Both sides validate role, channel classes, and policy version.
|
||||
8. Channel enters `established` state.
|
||||
9. Heartbeat/liveness begins.
|
||||
10. Channel is registered in local channel table with expiry/revalidation
|
||||
deadline.
|
||||
|
||||
Channel hello metadata should include:
|
||||
|
||||
- protocol version
|
||||
- cluster id
|
||||
- node id
|
||||
- supported channel classes
|
||||
- supported transport features
|
||||
- local config version
|
||||
- peer directory version
|
||||
- trust bundle version
|
||||
- route epoch
|
||||
- draining status
|
||||
|
||||
## 6. Channel States
|
||||
|
||||
Initial state machine:
|
||||
|
||||
- `idle`
|
||||
- `connecting`
|
||||
- `handshaking`
|
||||
- `authenticating`
|
||||
- `authorizing`
|
||||
- `established`
|
||||
- `revalidating`
|
||||
- `degraded`
|
||||
- `draining`
|
||||
- `closing`
|
||||
- `closed`
|
||||
- `failed`
|
||||
|
||||
State rules:
|
||||
|
||||
- no traffic before `established`
|
||||
- control/liveness may continue in `revalidating`
|
||||
- new non-essential traffic should stop in `draining`
|
||||
- channel must close on failed authentication
|
||||
- channel must close or degrade on failed reauthorization according to policy
|
||||
- terminal `closed`/`failed` channels must not be reused
|
||||
|
||||
## 7. Channel Classes
|
||||
|
||||
Allowed channel classes map to Fabric routing classes, not service-specific
|
||||
protocol internals.
|
||||
|
||||
Initial channel classes:
|
||||
|
||||
- `fabric_control`
|
||||
- `route_control`
|
||||
- `health`
|
||||
- `telemetry`
|
||||
- `render`
|
||||
- `input`
|
||||
- `clipboard`
|
||||
- `file_transfer`
|
||||
- `storage_fetch`
|
||||
- `update_fetch`
|
||||
- `vpn_packet`
|
||||
|
||||
Authorization is per channel class.
|
||||
|
||||
Rules:
|
||||
|
||||
- `input` and `fabric_control` require high-priority scheduling.
|
||||
- `render` and video-like traffic may be droppable/latest-only.
|
||||
- `file_transfer`, `storage_fetch`, and `update_fetch` must not starve
|
||||
`input` or control.
|
||||
- `vpn_packet` must be QoS-limited so bulk traffic cannot starve interactive
|
||||
channels.
|
||||
- A channel may carry only classes authorized by local policy and route result.
|
||||
|
||||
## 8. Channel Authorization
|
||||
|
||||
Authorization checks:
|
||||
|
||||
- local node is allowed to connect to remote node
|
||||
- remote node is allowed to accept the connection
|
||||
- cluster id matches
|
||||
- roles are compatible
|
||||
- route result or peer policy permits the relationship
|
||||
- requested channel classes are allowed
|
||||
- organization/service scope is allowed where applicable
|
||||
- partition/degraded state permits the channel
|
||||
- remote node is not revoked, disabled, or disallowed
|
||||
- certificate is not expired or revoked
|
||||
|
||||
Authorization must be repeated when:
|
||||
|
||||
- trust bundle changes
|
||||
- revocation list changes
|
||||
- role assignment changes
|
||||
- route policy changes
|
||||
- route epoch changes
|
||||
- channel is long-lived past revalidation interval
|
||||
|
||||
## 9. Heartbeat and Liveness
|
||||
|
||||
Heartbeats prove liveness, not authority.
|
||||
|
||||
Heartbeat metadata:
|
||||
|
||||
- channel id
|
||||
- local node id
|
||||
- remote node id
|
||||
- timestamp
|
||||
- sequence
|
||||
- observed latency
|
||||
- packet loss/jitter summary where available
|
||||
- local health hint
|
||||
- draining flag
|
||||
- config version
|
||||
- route epoch
|
||||
|
||||
Recommended heartbeat cadence:
|
||||
|
||||
- active control channels: 5-15 seconds
|
||||
- high-priority realtime channels: 2-10 seconds where needed
|
||||
- low-priority/storage channels: 15-60 seconds
|
||||
|
||||
Missing heartbeats should trigger:
|
||||
|
||||
1. suspicion state
|
||||
2. bounded retry
|
||||
3. route failover consideration
|
||||
4. channel close/failure
|
||||
5. health report
|
||||
|
||||
## 10. Reconnect and Backoff
|
||||
|
||||
Reconnect must be bounded and policy-aware.
|
||||
|
||||
Rules:
|
||||
|
||||
- use exponential backoff with jitter
|
||||
- do not stampede bootstrap peers
|
||||
- prefer warm candidates after active peer failure
|
||||
- stop reconnect when peer is revoked or policy disallows it
|
||||
- report repeated failures
|
||||
- preserve route stickiness only while healthy and authorized
|
||||
- avoid reconnect loops during draining or shutdown
|
||||
|
||||
Reconnect should use current peer cache and route policy, not stale hardcoded
|
||||
endpoints.
|
||||
|
||||
## 11. Revalidation
|
||||
|
||||
Long-lived channels must revalidate periodically.
|
||||
|
||||
Revalidation checks:
|
||||
|
||||
- certificate still valid
|
||||
- revocation status current enough
|
||||
- cluster trust root still valid
|
||||
- peer relationship still allowed
|
||||
- channel classes still allowed
|
||||
- route epoch/policy version still acceptable
|
||||
- role assignments still active
|
||||
|
||||
If revalidation fails:
|
||||
|
||||
- stop accepting new traffic
|
||||
- drain or close according to policy
|
||||
- report reason
|
||||
- trigger route failover where applicable
|
||||
|
||||
## 12. Draining and Graceful Shutdown
|
||||
|
||||
Draining supports maintenance and safe role removal.
|
||||
|
||||
Draining flow:
|
||||
|
||||
1. node enters draining state
|
||||
2. node advertises draining in heartbeat/channel metadata
|
||||
3. routing stops placing new flows on the node
|
||||
4. existing flows continue until TTL or policy deadline
|
||||
5. new non-essential channels are rejected
|
||||
6. channel closes after active work drains or deadline expires
|
||||
7. node reports drained status
|
||||
|
||||
Draining must not silently drop critical control messages.
|
||||
|
||||
If graceful drain fails, policy decides whether to force-close and failover.
|
||||
|
||||
## 13. Trust Rotation
|
||||
|
||||
Trust rotation must avoid split trust windows.
|
||||
|
||||
Recommended flow:
|
||||
|
||||
1. new trust bundle is signed by current trusted key
|
||||
2. nodes fetch and verify new trust bundle
|
||||
3. dual validation period begins where required
|
||||
4. new certificates are issued/accepted
|
||||
5. old certificates expire or are revoked
|
||||
6. old trust root is retired after rollout threshold
|
||||
|
||||
Channels should revalidate after trust bundle changes.
|
||||
|
||||
## 14. Revocation Handling
|
||||
|
||||
Revocation must affect active channels.
|
||||
|
||||
Revocation inputs:
|
||||
|
||||
- signed revocation list
|
||||
- trust bundle update
|
||||
- control-plane status after reconnect
|
||||
- emergency revocation policy
|
||||
|
||||
On revocation of remote node/certificate/key:
|
||||
|
||||
- stop new channels
|
||||
- mark existing channels as revalidation failed
|
||||
- close or drain according to policy severity
|
||||
- remove peer from eligible active/warm candidates
|
||||
- report and audit event
|
||||
|
||||
High-severity revocation should close immediately.
|
||||
|
||||
## 15. Partition and Degraded Behavior
|
||||
|
||||
In degraded mode, channels may continue only if:
|
||||
|
||||
- current signed snapshot permits it
|
||||
- certificates remain valid
|
||||
- revocation state is not known to reject the peer
|
||||
- route/channel policy permits degraded continuation
|
||||
- TTL has not expired
|
||||
|
||||
Degraded mode must not authorize:
|
||||
|
||||
- new node enrollment
|
||||
- new trust roots
|
||||
- role changes
|
||||
- cross-cluster trust changes
|
||||
- partition promotion
|
||||
- new high-risk channels without policy
|
||||
|
||||
## 16. Failure Classification
|
||||
|
||||
Failure reasons:
|
||||
|
||||
- `tls_handshake_failed`
|
||||
- `certificate_invalid`
|
||||
- `certificate_revoked`
|
||||
- `wrong_cluster`
|
||||
- `wrong_node`
|
||||
- `policy_denied`
|
||||
- `channel_class_denied`
|
||||
- `route_epoch_stale`
|
||||
- `heartbeat_timeout`
|
||||
- `peer_draining`
|
||||
- `peer_disabled`
|
||||
- `trust_bundle_stale`
|
||||
- `network_unreachable`
|
||||
- `backoff_exhausted`
|
||||
|
||||
Failures should be structured and safe to log.
|
||||
|
||||
## 17. Observability
|
||||
|
||||
Node-agent should report:
|
||||
|
||||
- channel state
|
||||
- active channel count
|
||||
- channel classes in use
|
||||
- handshake failures
|
||||
- authorization failures
|
||||
- heartbeat latency
|
||||
- reconnect count
|
||||
- backoff state
|
||||
- draining state
|
||||
- revocation actions
|
||||
- revalidation failures
|
||||
- route epoch/policy version
|
||||
|
||||
Tenant views must not expose internal topology. Platform owner views may show
|
||||
full channel diagnostics according to audited policy.
|
||||
|
||||
## 18. Security Requirements
|
||||
|
||||
Required:
|
||||
|
||||
- mTLS for node-to-node channels
|
||||
- cluster-scoped node certificates
|
||||
- certificate revocation support
|
||||
- policy-scoped channel authorization
|
||||
- no unauthenticated peer enumeration
|
||||
- no channel use before authorization
|
||||
- channel class separation
|
||||
- QoS-aware scheduling expectations
|
||||
- structured audit for high-risk channel changes
|
||||
|
||||
Compromised node blast radius must be limited by:
|
||||
|
||||
- scoped certificates
|
||||
- scoped snapshots
|
||||
- role assignment
|
||||
- peer directory scope
|
||||
- channel authorization
|
||||
- revocation
|
||||
- topology hiding
|
||||
|
||||
## 19. Future Validation Tests
|
||||
|
||||
Future implementation tests must prove:
|
||||
|
||||
- valid node-to-node mTLS succeeds
|
||||
- wrong cluster certificate rejected
|
||||
- wrong node id rejected
|
||||
- expired certificate rejected
|
||||
- revoked certificate closes active channel
|
||||
- unauthorized channel class rejected
|
||||
- channel cannot carry traffic before authorization
|
||||
- heartbeat timeout triggers failure
|
||||
- draining stops new channels
|
||||
- trust rotation revalidates channels
|
||||
- degraded mode honors TTL and forbidden actions
|
||||
- tenant-safe views hide topology
|
||||
|
||||
## 20. C17 Preparation
|
||||
|
||||
C17 may plan mesh routing runtime only after C10-C16 are accepted.
|
||||
|
||||
C17 must use:
|
||||
|
||||
- signed snapshots
|
||||
- node-local state store
|
||||
- Fabric Storage / Config Storage
|
||||
- peer directory/cache
|
||||
- Fabric Routing Engine route results
|
||||
- secure node-to-node channels
|
||||
|
||||
C17 must not jump directly to broad production mesh. It should first define a
|
||||
minimal runtime implementation plan, test topology, rollback path, and go/no-go
|
||||
criteria.
|
||||
|
||||
## 21. Result / Decision
|
||||
|
||||
Stage C16 defines secure node-to-node channels as authenticated,
|
||||
policy-authorized, lifecycle-managed connections.
|
||||
|
||||
Decisions:
|
||||
|
||||
- mTLS is required for node-to-node channels.
|
||||
- Certificate validity is necessary but not sufficient; channel policy must
|
||||
authorize role, peer relationship, route, and channel classes.
|
||||
- Active channels must revalidate on trust, revocation, role, and route policy
|
||||
changes.
|
||||
- Draining is a first-class lifecycle state.
|
||||
- Revocation affects active channels.
|
||||
- Degraded mode is bounded and cannot authorize high-risk mutations.
|
||||
- C17 must plan mesh routing runtime using C10-C16 foundations.
|
||||
|
||||
No code, migration, API, runtime, RDP, data-plane, mesh, VPN, relay, or service
|
||||
workload behavior is changed by C16.
|
||||
@@ -0,0 +1,209 @@
|
||||
# Security And Secrets Readiness
|
||||
|
||||
Status: P3.3 test-stand smoke complete for encrypted resource secrets,
|
||||
assignment-time resolution, and production fallback behavior with smoke-only
|
||||
direct worker WSS trust.
|
||||
|
||||
This document defines the next security hardening layer around the accepted RDP
|
||||
MVP baseline. It does not implement mesh, VPN, server-to-client download, new
|
||||
protocol adapters, or another RDP rendering mode.
|
||||
|
||||
## Current Accepted Baseline
|
||||
|
||||
- RDP worker baseline: `rap-rdp-worker:rdp-p1-region-order2`
|
||||
- Backend control plane remains source of truth.
|
||||
- Redis remains live coordination/routing only.
|
||||
- Direct worker WSS is preferred for realtime RDP.
|
||||
- Backend gateway remains fallback/debug.
|
||||
- Text clipboard is policy-gated and accepted.
|
||||
- Client-to-server file upload and restricted `RAP_Transfers` visibility are
|
||||
accepted.
|
||||
|
||||
## Problem
|
||||
|
||||
The current smoke/dev path can still seed RDP target credentials inside
|
||||
resource `metadata`. That was acceptable for proving lifecycle and RDP adapter
|
||||
behavior, but it must not be the production contract.
|
||||
|
||||
Production must not rely on plaintext target passwords, usernames, domain
|
||||
credentials, client secrets, tokens, or private keys stored in generic resource
|
||||
metadata.
|
||||
|
||||
## Target Secret Model
|
||||
|
||||
Resources keep non-secret connection shape:
|
||||
|
||||
```json
|
||||
{
|
||||
"id": "...",
|
||||
"organization_id": "...",
|
||||
"protocol": "rdp",
|
||||
"address": "rdp.example.internal:3389",
|
||||
"secret_ref": "rap-secret://org/<org_id>/resources/<resource_id>/rdp-primary",
|
||||
"metadata": {
|
||||
"certificate_verification_mode": "strict",
|
||||
"render_quality_profile": "balanced"
|
||||
}
|
||||
}
|
||||
```
|
||||
|
||||
Secrets are stored separately and referenced by `secret_ref`. The secret payload
|
||||
is protocol-specific and versioned:
|
||||
|
||||
```json
|
||||
{
|
||||
"version": 1,
|
||||
"protocol": "rdp",
|
||||
"username": "...",
|
||||
"domain": "...",
|
||||
"password": "...",
|
||||
"rotation_version": 3
|
||||
}
|
||||
```
|
||||
|
||||
The reference, not the plaintext secret, is copied into session metadata and
|
||||
audit context.
|
||||
|
||||
## Runtime Secret Resolution
|
||||
|
||||
Production runtime should resolve secrets through a dedicated secret resolver:
|
||||
|
||||
1. Backend validates resource/org/user authorization.
|
||||
2. Backend starts the session using resource `secret_ref`.
|
||||
3. Worker receives assignment with `secret_ref`, not plaintext credentials.
|
||||
4. Worker asks an authorized secret resolver for the secret using:
|
||||
- `organization_id`
|
||||
- `resource_id`
|
||||
- `worker_id`
|
||||
- `session_id`
|
||||
- short-lived lease/session proof
|
||||
5. Secret resolver returns credentials only to authorized workers for active
|
||||
leased sessions.
|
||||
6. Worker keeps secret material in memory only and never logs it.
|
||||
|
||||
The current P3.1 MVP uses an encrypted PostgreSQL-backed store:
|
||||
|
||||
- `resource_secrets` stores ciphertext, nonce, key id, algorithm, version, safe
|
||||
metadata, and `payload_sha256`.
|
||||
- `SECRET_ENCRYPTION_KEY_B64` or `SECRET_ENCRYPTION_KEY_FILE` supplies the
|
||||
AES-256-GCM key.
|
||||
- `SECRET_ENCRYPTION_KEY_ID` labels the active key.
|
||||
- the API can create/rotate a resource secret, but never returns plaintext.
|
||||
- session assignment resolves the secret only after organization/resource/
|
||||
worker/session/lease checks.
|
||||
|
||||
The resolver boundary can later be backed by KMS, Vault, cloud secret managers,
|
||||
or node-local secure delivery without changing the resource `secret_ref`
|
||||
contract.
|
||||
|
||||
## Production Guard
|
||||
|
||||
In `APP_ENV=production`:
|
||||
|
||||
- RDP/VNC/SSH resources must have `secret_ref`.
|
||||
- Plain credential-like keys are rejected in resource `metadata`.
|
||||
- Session start rejects legacy resources that still contain plaintext
|
||||
credential-like metadata.
|
||||
- backend startup requires secret encryption key material.
|
||||
- Development/smoke environments may continue using plaintext metadata while
|
||||
the resolver path is not used, but this is explicitly not production mode.
|
||||
|
||||
Credential-like metadata keys include password, username, domain, token,
|
||||
private key, client secret, credential, credentials, secret, and common
|
||||
underscore/hyphen variants.
|
||||
|
||||
## Data Plane Trust
|
||||
|
||||
Already accepted:
|
||||
|
||||
- backend signs `data_plane_token` with RS256 private key
|
||||
- worker validates with public key only
|
||||
- token is short-lived
|
||||
- token includes session, attachment, user, organization, worker, resource,
|
||||
allowed channels, expiry, and jti
|
||||
- worker rejects wrong worker, wrong attachment, wrong organization, wrong
|
||||
resource, over-broad channels, failed/terminated sessions, and jti replay
|
||||
|
||||
Production still needs:
|
||||
|
||||
- deployed certificate chain for direct worker WSS on production nodes
|
||||
- pinned or platform-issued worker certificates in live production config
|
||||
- no smoke-only TLS bypass in production clients
|
||||
- rotation process for data-plane signing keys
|
||||
- audit for failed token validation/bind attempts
|
||||
|
||||
P3.2 guard exists:
|
||||
|
||||
- backend distinguishes `smoke_insecure`, `public_ca`, and `platform_ca`
|
||||
direct worker WSS trust modes
|
||||
- production backend omits smoke-only direct candidates
|
||||
- Windows production client skips untrusted or smoke-only direct candidates
|
||||
|
||||
P3.3 test-stand smoke exists:
|
||||
|
||||
- `resource_secrets` migration is applied on `docker-test`
|
||||
- backend runs as `APP_ENV=production` with a test-only
|
||||
`SECRET_ENCRYPTION_KEY_FILE`
|
||||
- a secret-backed RDP resource starts a real session through assignment-time
|
||||
secret resolution
|
||||
- `resources.metadata`, `remote_sessions.metadata`, and `audit_events` were
|
||||
checked for plaintext username/password leakage
|
||||
- production backend with `DATA_PLANE_DIRECT_WORKER_TLS_TRUST_MODE=smoke_insecure`
|
||||
returns backend gateway fallback only
|
||||
- development/smoke backend with the same trust mode advertises the explicit
|
||||
smoke-only direct worker WSS candidate
|
||||
- `RAP_Transfers` smoke passed on the secret-backed resource
|
||||
|
||||
## Required Regression Tests
|
||||
|
||||
P3 must protect:
|
||||
|
||||
- plaintext resource credentials rejected in production
|
||||
- RDP production resources require `secret_ref`
|
||||
- development smoke plaintext metadata remains allowed
|
||||
- data-plane allowed channels follow runtime policy
|
||||
- direct bind rejects wrong worker
|
||||
- direct bind rejects wrong user
|
||||
- direct bind rejects wrong organization
|
||||
- direct bind rejects wrong resource
|
||||
- direct bind rejects old attachment
|
||||
- direct bind rejects failed/terminated states
|
||||
|
||||
## Audit Events
|
||||
|
||||
Current audit coverage should remain for:
|
||||
|
||||
- session start
|
||||
- attach
|
||||
- detach
|
||||
- takeover
|
||||
- terminate
|
||||
- failure
|
||||
|
||||
Future audit coverage should add:
|
||||
|
||||
- secret deleted
|
||||
- production resource rejected because plaintext credential metadata was found
|
||||
|
||||
Audit entries must reference `secret_ref` and resource/session ids, never
|
||||
plaintext secret values.
|
||||
|
||||
P3.1 implemented audit events for:
|
||||
|
||||
- `resource_secret_rotated`
|
||||
- `resource_secret_accessed`
|
||||
- `resource_secret_access_denied`
|
||||
|
||||
## Remaining Production Gaps
|
||||
|
||||
- External KMS/Vault integration is not implemented yet.
|
||||
- Master-key rotation/re-encryption workflow is not implemented yet.
|
||||
- The worker still receives resolved credentials through the transient
|
||||
assignment payload; a future resolver pull/token flow should reduce exposure
|
||||
in Redis control queues.
|
||||
- Worker still depends on plaintext assignment metadata for development smoke.
|
||||
- Production direct worker WSS certificate issuance/rotation and platform CA
|
||||
distribution are not complete.
|
||||
- The test-stand secret key is a host-local test file, not a production KMS or
|
||||
HSM-backed key.
|
||||
- Automated end-to-end policy denial coverage is still thin.
|
||||
@@ -0,0 +1,211 @@
|
||||
# Service Adapter Protocol
|
||||
|
||||
Status: target contract and compile-safe foundation. This document defines the common adapter model for RDP, SSH, VNC, and future services. It does not replace the current backend control plane or current RDP runtime by itself.
|
||||
|
||||
## 1. Purpose
|
||||
|
||||
The platform client must not implement third-party protocols directly.
|
||||
|
||||
```text
|
||||
Access Client <-> Secure Access Session Protocol <-> Service Adapter <-> Target Resource Protocol
|
||||
```
|
||||
|
||||
A `Service Adapter` translates one external service protocol into the platform session/data-plane model. RDP is the first adapter, but the same model must support SSH, VNC, HTTP/internal apps, video, and future services.
|
||||
|
||||
## 2. Terms
|
||||
|
||||
- `Access Client`: user-facing Windows, iOS, Android, or future browser/native client.
|
||||
- `Service Adapter`: protocol translation runtime at the service/egress edge.
|
||||
- `RDP Adapter`: Service Adapter for Microsoft RDP.
|
||||
- `SSH Adapter`: Service Adapter for SSH/terminal/SFTP/port-forward flows.
|
||||
- `VNC Adapter`: Service Adapter for VNC framebuffer/input flows.
|
||||
- `Target Resource`: external resource such as RDP host, SSH host, VNC host, internal app, or video endpoint.
|
||||
- `Control Plane`: backend/API for auth, organization isolation, resource policy, session lifecycle, worker selection, audit, and token issuance.
|
||||
- `Data Plane`: realtime channels between Access Client and adapter.
|
||||
|
||||
## 2.1 Remote Server/Desktop Access Product Model
|
||||
|
||||
The user-facing product service is **Remote Server/Desktop Access**.
|
||||
|
||||
RDP, VNC, and SSH are not separate cluster services exposed to organization
|
||||
administrators. They are internal protocol adapters used by the Remote
|
||||
Server/Desktop Access service.
|
||||
|
||||
The protocol is selected from the organization resource definition:
|
||||
|
||||
```text
|
||||
Organization resource:
|
||||
name: Accounting-01
|
||||
address: 10.10.1.15
|
||||
port: 3389
|
||||
protocol: rdp
|
||||
egress: Office Moscow
|
||||
```
|
||||
|
||||
```text
|
||||
protocol = rdp -> RDP Adapter
|
||||
protocol = vnc -> VNC Adapter
|
||||
protocol = ssh -> SSH Adapter
|
||||
```
|
||||
|
||||
The Access Client always speaks the platform access protocol. It does not speak
|
||||
RDP, VNC, or SSH directly.
|
||||
|
||||
Cluster operators assign nodes to run the Remote Server/Desktop Access service.
|
||||
They do not separately enable "RDP service", "VNC service", or "SSH service" for
|
||||
the organization. Adapter selection is an internal runtime decision based on the
|
||||
selected resource protocol.
|
||||
|
||||
Organization administrators manage resources and policies, not internal nodes:
|
||||
|
||||
- resource name
|
||||
- target address and port
|
||||
- protocol
|
||||
- allowed users/groups
|
||||
- clipboard/file/session policy
|
||||
- logical egress, such as `Office Moscow`
|
||||
|
||||
The logical egress is not a concrete node. It is an organization-visible egress
|
||||
pool or route label. Internally the cluster may back `Office Moscow` with one or
|
||||
many nodes that have network reachability to that office. Fabric routing and
|
||||
placement choose the concrete node/path.
|
||||
|
||||
Resulting flow:
|
||||
|
||||
```text
|
||||
Access Client
|
||||
-> entry point
|
||||
-> authenticate user
|
||||
-> select organization
|
||||
-> list allowed resources
|
||||
-> select resource
|
||||
-> use resource.protocol to choose adapter
|
||||
-> use resource.egress to choose egress pool/path
|
||||
-> connect to target
|
||||
```
|
||||
|
||||
This decision prevents exposing internal adapter placement and node topology to
|
||||
organizations while preserving protocol-specific policy enforcement inside the
|
||||
adapter runtime.
|
||||
|
||||
## 3. Non-Negotiable Boundaries
|
||||
|
||||
- Access Client does not know RDP/SSH/VNC protocol internals.
|
||||
- Service Adapter does not know UI implementation details.
|
||||
- Control Plane remains authoritative for session lifecycle and policy.
|
||||
- PostgreSQL remains source of truth; Redis remains live coordination only.
|
||||
- Direct worker WSS and backend gateway fallback remain valid transports.
|
||||
- Adapter runtime must not create sessions outside broker/assignment control.
|
||||
|
||||
## 4. Logical Channels
|
||||
|
||||
The session protocol is channel-oriented even when DP-1 uses one WSS connection.
|
||||
|
||||
| Channel | Direction | Reliability | Priority | Purpose |
|
||||
| --- | --- | --- | --- | --- |
|
||||
| `input` | client -> adapter | ordered reliable except move coalescing | highest | keyboard, pointer, wheel, focus |
|
||||
| `control` | both | reliable | high | attach, detach, takeover, state, heartbeat |
|
||||
| `display` | adapter -> client | droppable latest-frame/region | high but below input | frames, regions, surfaces, resize |
|
||||
| `cursor` | adapter -> client | latest-only | high | cursor position, shape, visibility |
|
||||
| `clipboard` | both | reliable | medium | policy-gated clipboard payloads |
|
||||
| `file_transfer` | both | reliable chunked | medium/low | upload/download, progress, cancel |
|
||||
| `audio` | adapter -> client, future client -> adapter | adaptive droppable | medium | future audio streams |
|
||||
| `device` | both | reliable | medium | future printer, smart card, drive policy events |
|
||||
| `telemetry` | adapter -> client/control | droppable | lowest | FPS, latency, queue depth, diagnostics |
|
||||
|
||||
Input must never wait behind display, file transfer, audio, or telemetry.
|
||||
|
||||
## 5. Event Direction Model
|
||||
|
||||
The adapter is not a passive responder. It must publish events whenever the target protocol emits them.
|
||||
|
||||
Client-origin examples:
|
||||
|
||||
- `input.keyboard`
|
||||
- `input.pointer_move`
|
||||
- `input.pointer_button`
|
||||
- `input.wheel`
|
||||
- `clipboard.client_text`
|
||||
- `file_upload.start`
|
||||
- `file_upload.chunk`
|
||||
- `control.detach`
|
||||
- `control.terminate`
|
||||
|
||||
Adapter-origin examples:
|
||||
|
||||
- `session.state`
|
||||
- `display.full_frame`
|
||||
- `display.region`
|
||||
- `display.surface_bits`
|
||||
- `display.encoded_frame`
|
||||
- `cursor.update`
|
||||
- `clipboard.server_text`
|
||||
- `file_transfer.progress`
|
||||
- `session.warning`
|
||||
- `session.failed`
|
||||
|
||||
Screen updates must be adapter-driven:
|
||||
|
||||
```text
|
||||
Target resource update -> adapter callback/event -> display/cursor event -> Access Client render
|
||||
```
|
||||
|
||||
Client input may request a fast refresh, but input must not be the primary trigger for discovering server-side screen changes.
|
||||
|
||||
## 6. Channel Scheduling
|
||||
|
||||
The adapter runtime must maintain separate scheduling semantics:
|
||||
|
||||
- `input`: drain first; keyboard and button events are ordered; pointer move is latest-only.
|
||||
- `control`: reliable and bounded; never behind render backlog.
|
||||
- `display`: droppable; stale frames/regions are discarded before send.
|
||||
- `cursor`: latest-only; may bypass display frame cadence.
|
||||
- `clipboard`: reliable and policy-gated.
|
||||
- `file_transfer`: reliable chunked; bandwidth-limited so it cannot starve input/display control.
|
||||
- `telemetry`: sampled/dropped under pressure.
|
||||
|
||||
## 7. Render Model
|
||||
|
||||
Display events should be sent in this preference order:
|
||||
|
||||
1. Encoded/surface updates when supported by the external protocol and client.
|
||||
2. Dirty regions/tiles.
|
||||
3. Full frame only for baseline, resize, attach/recover, or fallback.
|
||||
|
||||
Full-frame BGRA is a compatibility fallback, not the production performance target.
|
||||
|
||||
## 8. Adapter Policy Enforcement
|
||||
|
||||
Policy must be enforced inside the adapter runtime in addition to UI/backend checks:
|
||||
|
||||
- clipboard mode
|
||||
- file transfer mode
|
||||
- allowed channels
|
||||
- attachment/controller ownership
|
||||
- session active/taken_over/failed/terminated state
|
||||
- max payload sizes
|
||||
- dangerous path/name rejection
|
||||
- no arbitrary filesystem exposure
|
||||
|
||||
## 9. Adapter Lifecycle
|
||||
|
||||
All adapters must support:
|
||||
|
||||
- bind to existing assignment/session runtime
|
||||
- connect to target resource
|
||||
- publish state changes
|
||||
- keep runtime alive through detach when policy allows
|
||||
- reattach without recreating target session where protocol allows
|
||||
- takeover without recreating target session where protocol allows
|
||||
- terminate target session when broker commands terminate
|
||||
- fail fast and report authoritative failure when target runtime is gone
|
||||
|
||||
## 10. Future Adapters
|
||||
|
||||
RDP, SSH, and VNC share the same platform-facing contract but differ internally:
|
||||
|
||||
- RDP: graphics, cursor, keyboard/mouse, cliprdr, rdpdr, rdpgfx/graphics pipeline.
|
||||
- SSH: terminal output, keyboard input, resize, SFTP, port-forward.
|
||||
- VNC: framebuffer updates, pointer, keyboard, clipboard.
|
||||
|
||||
The common contract is the platform session protocol, not the external resource protocol.
|
||||
@@ -0,0 +1,415 @@
|
||||
# Signed Scoped Cluster Snapshot Model
|
||||
|
||||
Status: Stage C11 result. Documentation and architecture only.
|
||||
|
||||
This document defines the signed scoped cluster snapshot model for future
|
||||
`rap-node-agent` node-local operation and degraded-mode recovery. It does not
|
||||
implement code, migrations, APIs, mesh runtime traffic, VPN/IP tunnel runtime,
|
||||
relay packet routing, RDP work, or service workload execution.
|
||||
|
||||
## 1. Purpose
|
||||
|
||||
Signed scoped cluster snapshots allow a node to operate from verified local
|
||||
configuration without asking the backend for every realtime routing decision.
|
||||
|
||||
The snapshot model must preserve these boundaries:
|
||||
|
||||
- PostgreSQL remains the only durable source of truth.
|
||||
- Fabric Storage / Config Storage distributes signed snapshots and increments.
|
||||
- Node-agent stores only scoped local copies.
|
||||
- Redis remains live coordination only.
|
||||
- Service Adapters consume assigned local configuration but do not define
|
||||
routing or cluster authority.
|
||||
|
||||
## 2. Snapshot Definition
|
||||
|
||||
A scoped cluster snapshot is a signed, versioned configuration package compiled
|
||||
from authoritative control-plane state.
|
||||
|
||||
Snapshot characteristics:
|
||||
|
||||
- cluster-scoped
|
||||
- node-scoped
|
||||
- role-scoped
|
||||
- organization-scoped where applicable
|
||||
- signed by an authorized control-plane/config signing key
|
||||
- bounded in size
|
||||
- time-limited
|
||||
- reconstructable from PostgreSQL
|
||||
- safe to store in node-local state
|
||||
|
||||
Snapshots are not mutable local databases. A node may cache them and use them
|
||||
for runtime decisions within policy, but it must not treat them as new durable
|
||||
truth.
|
||||
|
||||
## 3. Snapshot Envelope
|
||||
|
||||
Every snapshot must have a signed envelope.
|
||||
|
||||
Required envelope fields:
|
||||
|
||||
- `snapshot_id`
|
||||
- `schema_version`
|
||||
- `cluster_id`
|
||||
- `subject_node_id`
|
||||
- `scope_type`
|
||||
- `scope_ids`
|
||||
- `roles`
|
||||
- `organization_ids`
|
||||
- `config_version`
|
||||
- `authority_epoch`
|
||||
- `issued_at`
|
||||
- `valid_from`
|
||||
- `expires_at`
|
||||
- `refresh_after`
|
||||
- `signer_key_id`
|
||||
- `signature_algorithm`
|
||||
- `content_hash`
|
||||
- `signature`
|
||||
|
||||
Recommended signature algorithms:
|
||||
|
||||
- Ed25519 for compact modern signatures where supported
|
||||
- RS256/RSA-PSS where compatibility with existing infrastructure is required
|
||||
|
||||
The exact wire encoding can be JSON canonicalization first and may evolve to a
|
||||
binary canonical form later. The important requirement is deterministic
|
||||
canonical bytes for signature verification.
|
||||
|
||||
## 4. Snapshot Scope Types
|
||||
|
||||
Supported initial scope types:
|
||||
|
||||
- `node_bootstrap`
|
||||
- `node_runtime`
|
||||
- `peer_directory`
|
||||
- `service_assignment`
|
||||
- `route_policy`
|
||||
- `qos_policy`
|
||||
- `trust_bundle`
|
||||
- `storage_directory`
|
||||
- `degraded_mode_policy`
|
||||
|
||||
The control plane may deliver one combined node runtime snapshot or multiple
|
||||
specialized snapshots. The node-agent local store must track version and expiry
|
||||
per scope.
|
||||
|
||||
## 5. Role-Based Snapshot Contents
|
||||
|
||||
Core mesh node snapshot may include:
|
||||
|
||||
- cluster identity
|
||||
- node membership state
|
||||
- allowed peer subset
|
||||
- route policy subset
|
||||
- QoS policy subset
|
||||
- trust bundle
|
||||
- config/storage refresh endpoints
|
||||
- degraded-mode peer recovery policy
|
||||
|
||||
Ingress node snapshot may include:
|
||||
|
||||
- cluster identity
|
||||
- ingress role assignment
|
||||
- client entry policy subset
|
||||
- token validation trust material
|
||||
- route entry policy
|
||||
- allowed service endpoint projections
|
||||
- no full internal topology
|
||||
- no service target credentials
|
||||
|
||||
Egress/service node snapshot may include:
|
||||
|
||||
- assigned service workload refs
|
||||
- assigned resource refs
|
||||
- service policy subset
|
||||
- connector or `vpn_connection` refs when authorized
|
||||
- route policy needed for assigned services
|
||||
- secret resolver refs only, not raw secrets
|
||||
|
||||
Storage/config node snapshot may include:
|
||||
|
||||
- assigned storage/config shard scope
|
||||
- replication metadata
|
||||
- peer/storage refresh policy
|
||||
- allowed snapshot families
|
||||
- no unrelated tenant data
|
||||
|
||||
Thin/mobile node snapshot may include:
|
||||
|
||||
- minimal trust bundle
|
||||
- active session/tunnel policy subset
|
||||
- minimal peer/bootstrap data
|
||||
- route refresh endpoints
|
||||
- no full cluster topology
|
||||
|
||||
## 6. Snapshot Content Rules
|
||||
|
||||
Allowed content:
|
||||
|
||||
- ids and safe metadata
|
||||
- role assignments for the subject scope
|
||||
- policy refs and selected policy bodies needed by the node
|
||||
- peer directory subset
|
||||
- route/QoS policy subset
|
||||
- trust roots and revocation metadata
|
||||
- service workload desired-state refs
|
||||
- secret resolver refs
|
||||
- degraded-mode policy
|
||||
|
||||
Forbidden content:
|
||||
|
||||
- unrelated organization data
|
||||
- broad organization user lists
|
||||
- raw RDP/VNC/SSH credentials
|
||||
- raw VPN credentials
|
||||
- secrets outside approved resolver flow
|
||||
- platform-wide topology for ordinary nodes
|
||||
- arbitrary query grants
|
||||
- audit authority
|
||||
- durable policy mutation authority
|
||||
|
||||
## 7. Full Snapshots and Incremental Updates
|
||||
|
||||
Full snapshot:
|
||||
|
||||
- establishes node-local state for a scope
|
||||
- repairs version gaps
|
||||
- repairs corruption
|
||||
- establishes a new `authority_epoch`
|
||||
- may replace older snapshots for the same scope
|
||||
|
||||
Incremental update:
|
||||
|
||||
- applies to exactly one base `config_version`
|
||||
- carries `base_config_version`
|
||||
- carries `next_config_version`
|
||||
- contains scoped patch operations or replacement sections
|
||||
- is signed independently
|
||||
- must be rejected if base version does not match
|
||||
|
||||
Rules:
|
||||
|
||||
- version gaps require full resync
|
||||
- signature mismatch requires rejection and recovery
|
||||
- expired snapshots cannot authorize new operations
|
||||
- node heartbeat/status must report last applied version per scope
|
||||
- rollback is forbidden unless signed recovery policy explicitly allows it
|
||||
|
||||
## 8. Trust Roots and Signing Key Rotation
|
||||
|
||||
The node-agent must know which config signing keys are trusted for each cluster.
|
||||
|
||||
Trust material may come from:
|
||||
|
||||
- enrollment response
|
||||
- trust bundle snapshot
|
||||
- manually installed platform root for bootstrap
|
||||
- signed key rotation update
|
||||
|
||||
Signing key rotation rules:
|
||||
|
||||
1. New key is introduced in a signed trust bundle.
|
||||
2. Node verifies the new key through existing trust.
|
||||
3. Snapshots may be dual-signed during transition.
|
||||
4. Old key is retired only after policy-defined rollout.
|
||||
5. Compromised key is revoked through signed revocation metadata or emergency
|
||||
recovery flow.
|
||||
|
||||
A node must reject snapshots signed by unknown, expired, revoked, or
|
||||
cluster-mismatched keys.
|
||||
|
||||
## 9. Verification Algorithm
|
||||
|
||||
Before applying a snapshot, node-agent verifies:
|
||||
|
||||
1. Envelope schema is supported.
|
||||
2. `cluster_id` matches local cluster membership.
|
||||
3. `subject_node_id` matches the local node, unless the scope explicitly allows
|
||||
shared role data.
|
||||
4. Signature key is trusted for the cluster and snapshot scope.
|
||||
5. Signature verifies over canonical bytes.
|
||||
6. `content_hash` matches content.
|
||||
7. `valid_from`, `expires_at`, and `refresh_after` are acceptable.
|
||||
8. `authority_epoch` is not stale.
|
||||
9. `config_version` is newer than the local accepted version or allowed by a
|
||||
signed recovery policy.
|
||||
10. Scope does not grant data beyond node role and organization authorization.
|
||||
11. Snapshot content passes structural validation.
|
||||
12. Snapshot does not contain forbidden raw secrets.
|
||||
|
||||
Failure must leave the previous valid snapshot active if policy allows it.
|
||||
|
||||
## 10. Degraded-Mode Use
|
||||
|
||||
Snapshots define what the node may do when disconnected from the backend or
|
||||
config/storage services.
|
||||
|
||||
Allowed when policy permits:
|
||||
|
||||
- continue already-running assigned services
|
||||
- preserve existing authorized routes for a bounded TTL
|
||||
- reconnect to active/warm/bootstrap peers
|
||||
- use local trust bundle to validate peers
|
||||
- use storage/config endpoints from the last valid snapshot
|
||||
- report degraded status when connectivity returns
|
||||
|
||||
Forbidden in degraded mode:
|
||||
|
||||
- approve node enrollment
|
||||
- issue certificates
|
||||
- assign roles
|
||||
- change cluster policy
|
||||
- change organization policy
|
||||
- rotate trust roots
|
||||
- promote partitions automatically
|
||||
- fetch unrelated secrets
|
||||
- create new service authority outside the snapshot scope
|
||||
|
||||
Degraded mode must be bounded by:
|
||||
|
||||
- snapshot expiry
|
||||
- route/session TTL
|
||||
- degraded-mode policy
|
||||
- partition/authority state
|
||||
|
||||
## 11. Revocation and Expiry
|
||||
|
||||
Snapshots expire. Expiry is a correctness boundary, not just a cache hint.
|
||||
|
||||
Revocation sources:
|
||||
|
||||
- signed trust bundle update
|
||||
- signed revocation list
|
||||
- control-plane status after reconnect
|
||||
- emergency recovery trust path
|
||||
|
||||
Revocation applies to:
|
||||
|
||||
- signing keys
|
||||
- node identities
|
||||
- role assignments
|
||||
- service assignments
|
||||
- peer eligibility
|
||||
- storage/config endpoints
|
||||
- degraded-mode permissions
|
||||
|
||||
If revocation state is unavailable, the node may only continue within the last
|
||||
valid degraded-mode policy and must not perform high-risk actions.
|
||||
|
||||
## 12. Rollback and Recovery
|
||||
|
||||
Normal rollback to an older config is forbidden.
|
||||
|
||||
Allowed recovery cases:
|
||||
|
||||
- local snapshot file corruption
|
||||
- interrupted incremental update
|
||||
- bad non-authoritative cache state
|
||||
- version gap requiring full resync
|
||||
|
||||
Recovery order:
|
||||
|
||||
1. keep last verified active snapshot
|
||||
2. reject bad update
|
||||
3. request full snapshot from config/storage service
|
||||
4. use bootstrap peers if refresh endpoints fail
|
||||
5. reconnect to control plane when available
|
||||
6. enter degraded mode only if policy allows
|
||||
|
||||
Rollback to an older signed snapshot requires explicit signed recovery policy
|
||||
with a newer `authority_epoch` or equivalent anti-rollback guard.
|
||||
|
||||
## 13. Node-Agent Local Expectations
|
||||
|
||||
Node-agent must store:
|
||||
|
||||
- active snapshot per scope
|
||||
- previous verified snapshot for recovery
|
||||
- pending downloaded snapshot/update before activation
|
||||
- verification metadata
|
||||
- last applied versions
|
||||
- signer key ids
|
||||
- expiry/refresh deadlines
|
||||
- rejection reason for last failed update
|
||||
|
||||
Activation should be atomic from the node-agent perspective:
|
||||
|
||||
- download to pending
|
||||
- verify
|
||||
- write to durable local store
|
||||
- swap active pointer
|
||||
- notify supervised services of relevant changes
|
||||
- report applied version in heartbeat/status
|
||||
|
||||
C12 will define the local store layout and durability details.
|
||||
|
||||
## 14. Distribution Relationship
|
||||
|
||||
Snapshot production flow:
|
||||
|
||||
1. PostgreSQL authoritative state changes.
|
||||
2. Control-plane snapshot compiler builds scoped view.
|
||||
3. Compiler validates scope and removes forbidden data.
|
||||
4. Snapshot is signed by config signing key.
|
||||
5. Snapshot or increment is published to Fabric Storage / Config Storage.
|
||||
6. Node-agent refreshes by version.
|
||||
7. Node-agent verifies and applies locally.
|
||||
|
||||
Node-origin reports such as health, heartbeat, or observed latency are not
|
||||
authoritative config writes. They may influence future compiled snapshots only
|
||||
after the control plane accepts them according to policy.
|
||||
|
||||
## 15. Validation and Future Tests
|
||||
|
||||
Future implementation tests must prove:
|
||||
|
||||
- valid snapshot applies
|
||||
- invalid signature rejected
|
||||
- wrong cluster rejected
|
||||
- wrong node rejected
|
||||
- expired snapshot rejected for new authority
|
||||
- rollback rejected
|
||||
- version gap triggers full resync
|
||||
- forbidden raw secret content rejected
|
||||
- unrelated organization data rejected
|
||||
- wrong role scope rejected
|
||||
- incremental update applies only to matching base version
|
||||
- revoked signer rejected
|
||||
- degraded-mode forbidden actions are blocked
|
||||
|
||||
## 16. C12 Preparation
|
||||
|
||||
C12 must define how node-agent stores and protects:
|
||||
|
||||
- snapshot files
|
||||
- identity material references
|
||||
- trust bundle cache
|
||||
- peer cache
|
||||
- route cache
|
||||
- service assignment cache
|
||||
- health/degraded state
|
||||
- update metadata
|
||||
|
||||
C12 must not turn local state into durable authority. It must preserve the C11
|
||||
rule that snapshots are verified scoped copies of PostgreSQL-derived state.
|
||||
|
||||
## 17. Result / Decision
|
||||
|
||||
Stage C11 defines signed scoped cluster snapshots as the required bridge between
|
||||
the authoritative control plane and node-local runtime operation.
|
||||
|
||||
Decisions:
|
||||
|
||||
- snapshots are signed, versioned, scoped, bounded, and expiring
|
||||
- snapshots are generated from PostgreSQL source-of-truth state
|
||||
- snapshots may be distributed by Fabric Storage / Config Storage
|
||||
- node-agent verifies before applying
|
||||
- node-agent may operate from snapshots only within policy
|
||||
- snapshots must not contain raw secrets or unrelated organization data
|
||||
- incremental updates require exact base-version matching
|
||||
- rollback requires explicit signed recovery policy
|
||||
- C12 must define local storage without changing these authority boundaries
|
||||
|
||||
No code, migration, API, runtime, RDP, data-plane, mesh, VPN, relay, or service
|
||||
workload behavior is changed by C11.
|
||||
@@ -0,0 +1,428 @@
|
||||
# VPN / IP Tunnel Service Target Design
|
||||
|
||||
Status: Stage C18 planning result. Documentation only.
|
||||
|
||||
This document defines the target VPN/IP tunnel service architecture for the
|
||||
Secure Access Fabric. It does not implement VPN runtime, packet routing, TUN
|
||||
devices, mesh traffic, service workload execution, API changes, migrations, or
|
||||
RDP behavior changes.
|
||||
|
||||
## Purpose
|
||||
|
||||
VPN/IP tunnel is a service above the Fabric Core, not a node-local setting.
|
||||
|
||||
The service must allow managed access to private networks while preserving the
|
||||
platform's core rules:
|
||||
|
||||
- PostgreSQL remains the durable source of truth.
|
||||
- Redis remains live coordination only.
|
||||
- Fabric Routing Engine owns route choice.
|
||||
- Nodes execute leased work only.
|
||||
- Organizations must not see mesh topology.
|
||||
- Interactive services such as RDP must not be harmed by VPN bulk traffic.
|
||||
|
||||
## Non-Goals
|
||||
|
||||
Stage C18 does not implement:
|
||||
|
||||
- VPN/IP tunnel runtime
|
||||
- TUN/TAP device handling
|
||||
- packet forwarding
|
||||
- host route or firewall manipulation
|
||||
- QUIC, WebRTC, relay packet routing, or production mesh traffic
|
||||
- Windows virtual adapter, Android `VpnService`, or mobile client work
|
||||
- RDP, VNC, SSH, video, file, clipboard, or data-plane behavior changes
|
||||
|
||||
## Core Concepts
|
||||
|
||||
### `vpn_connection`
|
||||
|
||||
`vpn_connection` is the logical control-plane entity for one managed VPN/IP
|
||||
tunnel connection to a private endpoint such as an office, customer site,
|
||||
branch network, partner network, or private resource zone.
|
||||
|
||||
Target fields:
|
||||
|
||||
- `id`
|
||||
- `organization_id`
|
||||
- `cluster_id`
|
||||
- `name`
|
||||
- target endpoint / office identity
|
||||
- protocol/provider family
|
||||
- credential/config reference
|
||||
- allowed node policy
|
||||
- mode: `single_active` for the initial model
|
||||
- desired state: `enabled` or `disabled`
|
||||
- routing usage: RDP, VNC, SSH, HTTP/internal app, IP tunnel, or future service
|
||||
- route policy references
|
||||
- QoS / bandwidth policy references
|
||||
- placement constraints
|
||||
- safe status projection
|
||||
|
||||
The entity belongs to the control plane. It must not be inferred from a node
|
||||
environment variable, a manually started VPN process, or a host-local config
|
||||
file.
|
||||
|
||||
### `vpn_connection_lease`
|
||||
|
||||
`vpn_connection_lease` represents current ownership.
|
||||
|
||||
Target fields:
|
||||
|
||||
- `vpn_connection_id`
|
||||
- `cluster_id`
|
||||
- `organization_id`
|
||||
- `owner_node_id`
|
||||
- lease generation / fencing epoch
|
||||
- `expires_at`
|
||||
- `renewed_at`
|
||||
- `released_at`
|
||||
- `fenced_at`
|
||||
- status
|
||||
|
||||
Only the current owner with a valid, unexpired, unfenced lease may execute the
|
||||
VPN connection.
|
||||
|
||||
### `vpn_route_policy`
|
||||
|
||||
`vpn_route_policy` defines what traffic may use the connection.
|
||||
|
||||
Policy dimensions:
|
||||
|
||||
- allowed CIDRs
|
||||
- denied CIDRs
|
||||
- DNS suffix or DNS server policy
|
||||
- split-tunnel or full-tunnel eligibility
|
||||
- service-specific usage
|
||||
- resource-specific usage
|
||||
- organization and role scope
|
||||
- QoS class and bandwidth limits
|
||||
|
||||
Route policy is desired state. Runtime nodes apply scoped policy; they do not
|
||||
invent routes.
|
||||
|
||||
### `vpn_credential_ref`
|
||||
|
||||
VPN credentials must be referenced through an approved secret resolver.
|
||||
|
||||
Nodes receive credentials/config only when authorized and only when required to
|
||||
execute or pre-warm the assigned connection. Nodes must not receive unrelated
|
||||
organization credentials.
|
||||
|
||||
## Architecture Placement
|
||||
|
||||
```text
|
||||
Control Plane
|
||||
owns vpn_connection desired state, policy, lease/fencing, audit
|
||||
|
||||
Fabric Core
|
||||
owns node identity, role assignment consumption, scoped snapshots,
|
||||
node-local state, and service supervision boundary
|
||||
|
||||
Fabric Routing Engine
|
||||
chooses path to the current active VPN owner or eligible egress pool
|
||||
|
||||
VPN/IP Tunnel Service Runtime
|
||||
executes tunnel only when assigned and leased
|
||||
|
||||
Data Plane
|
||||
carries encrypted tunnel packets later, with QoS and backpressure
|
||||
```
|
||||
|
||||
The backend/control plane must not become a production VPN packet relay.
|
||||
|
||||
## Control Plane Responsibilities
|
||||
|
||||
The control plane owns:
|
||||
|
||||
- durable `vpn_connection` desired state
|
||||
- route policy and service usage policy
|
||||
- allowed node policy
|
||||
- placement and candidate selection
|
||||
- lease creation, renewal validation, and fencing decisions
|
||||
- safe status projection
|
||||
- audit events
|
||||
- credential reference ownership
|
||||
|
||||
The control plane does not push arbitrary packets. It authorizes and records
|
||||
what should exist.
|
||||
|
||||
## Node Responsibilities
|
||||
|
||||
Nodes do not decide to create VPN connections.
|
||||
|
||||
A node may execute a connection only when all of the following are true:
|
||||
|
||||
- node belongs to the correct cluster
|
||||
- node has the required capability and role assignment
|
||||
- node is allowed by the `vpn_connection` node policy
|
||||
- node has a current signed/scoped configuration snapshot
|
||||
- node holds the active lease
|
||||
- desired state is `enabled`
|
||||
- organization and service policy permit use
|
||||
|
||||
The node must stop execution when:
|
||||
|
||||
- lease is lost, expired, or fenced
|
||||
- desired state becomes `disabled`
|
||||
- role assignment is removed
|
||||
- allowed node policy changes
|
||||
- local node enters unsafe partition/degraded state
|
||||
- cluster tells the node to drain
|
||||
|
||||
## Single-Active Lease and Fencing Model
|
||||
|
||||
The initial mode is `single_active`.
|
||||
|
||||
Correctness requirement:
|
||||
|
||||
- exactly one node may maintain the active VPN tunnel
|
||||
- stale owners must be fenced before replacement becomes authoritative
|
||||
- ownership changes must be monotonic through a lease generation or equivalent
|
||||
fencing epoch
|
||||
- connect/disconnect must be idempotent
|
||||
- split-brain must not create duplicate active tunnels
|
||||
|
||||
Suggested target mechanics:
|
||||
|
||||
- short lease TTL
|
||||
- periodic renewal
|
||||
- monotonic lease generation
|
||||
- node-local watchdog that stops tunnel when renewal fails
|
||||
- explicit release on graceful shutdown
|
||||
- fencing event before replacement if previous owner is uncertain
|
||||
|
||||
## Routing Policy Model
|
||||
|
||||
Traffic references a logical `vpn_connection`, not a physical node.
|
||||
|
||||
Examples:
|
||||
|
||||
- RDP resource may require `vpn_connection = office-a`
|
||||
- SSH resource may require `vpn_connection = office-a`
|
||||
- IP tunnel profile may expose selected CIDRs through `office-a`
|
||||
- HTTP/internal app resource may route through the active VPN owner
|
||||
|
||||
The Fabric Routing Engine resolves:
|
||||
|
||||
```text
|
||||
service request
|
||||
-> logical vpn_connection
|
||||
-> current active owner / eligible egress
|
||||
-> fabric route
|
||||
-> VPN service runtime
|
||||
-> private network target
|
||||
```
|
||||
|
||||
Route updates should be dynamic. Changing CIDRs, DNS policy, or active owner
|
||||
should not require manual client reconfiguration when clients use platform
|
||||
managed access.
|
||||
|
||||
## QoS and Bandwidth Rules
|
||||
|
||||
VPN bulk traffic must degrade before interactive traffic.
|
||||
|
||||
Priority order:
|
||||
|
||||
1. RDP input/control
|
||||
2. interactive RDP/VNC/SSH control and render-critical traffic
|
||||
3. clipboard and small reliable control messages
|
||||
4. video/audio adaptive traffic
|
||||
5. file transfer
|
||||
6. VPN bulk packets
|
||||
7. telemetry
|
||||
|
||||
Bandwidth policy should support:
|
||||
|
||||
- per-organization limits
|
||||
- per-service limits
|
||||
- per-`vpn_connection` limits
|
||||
- per-node limits
|
||||
- reserved bandwidth for interactive services
|
||||
|
||||
Service adapters must not implement QoS routing themselves. They label traffic
|
||||
or request a channel class; Fabric applies route/QoS policy.
|
||||
|
||||
## Security Boundaries
|
||||
|
||||
Security requirements:
|
||||
|
||||
- organization-scoped `vpn_connection`
|
||||
- cluster-scoped identity and tokens
|
||||
- mTLS node-to-node transport
|
||||
- short-lived route/tunnel authorization tokens when needed
|
||||
- credentials delivered only through approved resolver
|
||||
- candidate nodes receive only scoped config
|
||||
- active owner receives execution secrets only when authorized
|
||||
- no organization sees another organization's connections, routes, credentials,
|
||||
peer cache, or topology
|
||||
- platform owner actions are audited
|
||||
|
||||
Compromised node blast radius must be bounded. A compromised node must not gain
|
||||
credentials for unrelated `vpn_connection` entities or unrelated organizations.
|
||||
|
||||
## Observability and Audit
|
||||
|
||||
Audit events:
|
||||
|
||||
- `vpn_connection_created`
|
||||
- `vpn_connection_enabled`
|
||||
- `vpn_connection_disabled`
|
||||
- `vpn_connection_policy_changed`
|
||||
- `vpn_connection_candidate_changed`
|
||||
- `vpn_connection_lease_acquired`
|
||||
- `vpn_connection_lease_renewed`
|
||||
- `vpn_connection_lease_lost`
|
||||
- `vpn_connection_owner_fenced`
|
||||
- `vpn_connection_failover_started`
|
||||
- `vpn_connection_failover_completed`
|
||||
- `vpn_connection_credential_rotated`
|
||||
- `vpn_route_policy_changed`
|
||||
|
||||
Metrics/status:
|
||||
|
||||
- desired state
|
||||
- active owner
|
||||
- standby/pre-warm owners
|
||||
- lease generation
|
||||
- last connect/disconnect time
|
||||
- route count
|
||||
- latency/packet loss where observable
|
||||
- bandwidth by service class
|
||||
- failover count
|
||||
- last failure reason
|
||||
|
||||
Organization views show safe status. Platform owner views may show active node
|
||||
and operational detail according to platform policy and audit.
|
||||
|
||||
## Failure Mode Matrix
|
||||
|
||||
| Failure | Required behavior | Notes |
|
||||
| --- | --- | --- |
|
||||
| Active node heartbeat lost | Lease expires or is fenced; cluster selects replacement | Single-active must be preserved |
|
||||
| Active node loses lease locally | Node stops VPN runtime | Node must not wait for backend packet path |
|
||||
| Control plane temporarily unavailable | Existing leased runtime may continue only within lease/snapshot policy | No policy mutation in degraded mode |
|
||||
| Split-brain / partition | Minority must not create second active owner | Fencing/quorum rules required before runtime |
|
||||
| Credential revoked | Active owner stops or reconnects with rotated credentials | Audit required |
|
||||
| Route policy changes | Dynamic route update; deny removed routes | No manual client reconfiguration |
|
||||
| Candidate node becomes overloaded | Keep sticky owner unless policy/failure/maintenance requires move | Avoid needless TCP disruption |
|
||||
| Graceful node maintenance | Drain, release/transfer lease, then stop | Prefer standby/pre-warm replacement |
|
||||
| VPN protocol reconnects | Preserve logical `vpn_connection`; refresh routes | Some TCP sessions may still break |
|
||||
| Relay path unavailable | Fabric reroutes if policy allows | VPN service does not own mesh routing |
|
||||
|
||||
## Stateful Session Limits
|
||||
|
||||
VPN failover may disrupt long-lived TCP sessions. The platform should minimize
|
||||
disruption through sticky placement, graceful drain, standby/pre-warm nodes,
|
||||
stable route identity, and transparent route refresh, but the initial
|
||||
`single_active` mode does not guarantee lossless TCP migration.
|
||||
|
||||
Future `multi_active` or load-balanced VPN modes may reduce disruption. They
|
||||
must be explicit future modes and must not weaken `single_active` correctness.
|
||||
|
||||
## Relationship to Current Mesh Proof Set
|
||||
|
||||
C17A-C17G prove synthetic fabric messages, route health/failover probes, relay
|
||||
semantics, a bounded `synthetic.echo` path, live synthetic HTTP node-to-node
|
||||
transport, scoped synthetic route config loading, and Control Plane scoped
|
||||
synthetic config reads in `rap-node-agent`.
|
||||
|
||||
They do not authorize VPN traffic.
|
||||
|
||||
VPN/IP tunnel runtime must wait until the control-plane desired-state model,
|
||||
lease/fencing, scoped snapshots, node-local state, secure node-to-node
|
||||
channels, and Fabric Routing Engine boundaries are accepted for this service.
|
||||
|
||||
## Future Implementation Stages
|
||||
|
||||
C18A - VPN/IP tunnel control-plane data model foundation:
|
||||
|
||||
- durable `vpn_connections`
|
||||
- route policy tables
|
||||
- allowed node policy
|
||||
- lease/fencing model
|
||||
- audit events
|
||||
- no runtime packets
|
||||
|
||||
Status: completed and backend-test-proven. Result:
|
||||
`artifacts/c18a-vpn-control-plane-data-model-report.md`.
|
||||
|
||||
C18B - Lease and fencing service:
|
||||
|
||||
- single-active ownership service
|
||||
- TTL renewal/fencing behavior
|
||||
- stale owner handling
|
||||
- no real VPN runtime
|
||||
|
||||
Status: completed and backend-test-proven. Result:
|
||||
`artifacts/c18b-vpn-lease-fencing-hardening-report.md`.
|
||||
|
||||
C18C - Node-agent desired-state consumption:
|
||||
|
||||
- node reads scoped `vpn_connection` assignments
|
||||
- reports status
|
||||
- does not create real tunnel
|
||||
|
||||
Status: completed and backend-test-proven. Result:
|
||||
`artifacts/c18c-vpn-node-agent-desired-state-report.md`.
|
||||
|
||||
Notes:
|
||||
|
||||
- node assignment visibility is limited to eligible candidates or the current
|
||||
active lease owner
|
||||
- observed assignment status is explicit: `not_started`, `assigned`,
|
||||
`lease_required`, `blocked`, `unknown`
|
||||
- `credential_ref` is not exposed to node-agent assignment payloads
|
||||
- no VPN runtime, TUN/TAP, host route/DNS/firewall/QoS manipulation, packet
|
||||
forwarding, or production mesh traffic is implemented
|
||||
|
||||
C18D - Secret resolver integration:
|
||||
|
||||
- scoped credential/config delivery
|
||||
- candidate/active-owner restrictions
|
||||
- credential rotation audit
|
||||
|
||||
C18E - Routing policy integration:
|
||||
|
||||
- CIDR and service-specific route intent
|
||||
- route projection to Fabric Routing Engine
|
||||
- no packet forwarding
|
||||
|
||||
C18F - Non-production fake VPN executor:
|
||||
|
||||
- synthetic leased service state only
|
||||
- no TUN, no packets, no private network routing
|
||||
|
||||
C18G - Lab-only native VPN executor prototype:
|
||||
|
||||
- explicit separate approval required
|
||||
- native mode preferred for TUN/firewall/QoS
|
||||
- no privileged container by default
|
||||
|
||||
C18H - Client route refresh/resume design:
|
||||
|
||||
- route updates
|
||||
- reconnect behavior
|
||||
- split/full tunnel client posture
|
||||
|
||||
C18I - Production hardening:
|
||||
|
||||
- split-brain drills
|
||||
- failover testing
|
||||
- QoS load testing
|
||||
- security review
|
||||
- observability and incident runbooks
|
||||
|
||||
## Result / Decision
|
||||
|
||||
Stage C18 defines VPN/IP tunnel as a cluster-managed service above Fabric Core.
|
||||
The first implementation step must be control-plane desired state and
|
||||
lease/fencing foundation, not packet routing. Nodes are execution units, not
|
||||
owners of desired state. Fabric owns routing and QoS. PostgreSQL remains the
|
||||
source of truth, Redis remains live coordination only, and Fabric
|
||||
Storage/Config Storage remains a scoped distribution/cache layer. RDP, current
|
||||
direct worker WSS, backend gateway fallback, C17 synthetic mesh proofs, and all
|
||||
existing service-adapter behavior are untouched by this document. C18A,
|
||||
C18B, and C18C are now implemented only as control-plane/node-agent contract
|
||||
foundation; they still do not authorize VPN/IP tunnel runtime or host
|
||||
networking changes.
|
||||
@@ -0,0 +1,429 @@
|
||||
# Web Ingress and Admin UI Model
|
||||
|
||||
Status: target architecture clarification. Documentation only.
|
||||
|
||||
This document defines how HTTP/HTTPS web entry, Admin UI, dynamic page
|
||||
composition, and cluster configuration responsibilities are separated in the
|
||||
Secure Access Fabric.
|
||||
|
||||
It does not implement code, APIs, UI pages, mesh runtime, VPN runtime, or RDP
|
||||
changes.
|
||||
|
||||
## Purpose
|
||||
|
||||
The platform needs a clear distinction between:
|
||||
|
||||
- Web Service as the HTTP/HTTPS entry layer
|
||||
- Control Plane as the owner of cluster configuration and policy
|
||||
- Admin UI as a safe, scoped user interface over Control Plane APIs
|
||||
|
||||
The Web layer must never become the owner of cluster state, policy, topology,
|
||||
secrets, node identity, or routing authority.
|
||||
|
||||
## Layer Ownership
|
||||
|
||||
### Web Service / Web Ingress
|
||||
|
||||
Web Service is an edge service.
|
||||
|
||||
Suggested role names:
|
||||
|
||||
- `web-ingress`
|
||||
- `admin-web-entry`
|
||||
- `admin-web-shell`
|
||||
|
||||
Responsibilities:
|
||||
|
||||
- accept HTTP/HTTPS
|
||||
- terminate TLS or sit behind the approved TLS terminator
|
||||
- serve Admin UI shell/static assets
|
||||
- proxy browser/API traffic to Control API
|
||||
- apply edge controls such as headers, rate limits, request size limits, and
|
||||
future WAF rules
|
||||
- expose only approved public/admin endpoints
|
||||
|
||||
Web Service must not:
|
||||
|
||||
- own cluster configuration
|
||||
- directly mutate PostgreSQL
|
||||
- store durable topology or policy
|
||||
- store secrets
|
||||
- store node identity or certificates as source of truth
|
||||
- expose internal mesh topology to browser clients
|
||||
- execute cluster decisions locally
|
||||
|
||||
### Control Plane
|
||||
|
||||
Control Plane owns all durable cluster configuration and policy.
|
||||
|
||||
Responsibilities:
|
||||
|
||||
- clusters
|
||||
- nodes
|
||||
- node enrollment and approval
|
||||
- role assignments
|
||||
- organization and tenant policy
|
||||
- service desired state
|
||||
- service endpoint visibility
|
||||
- signed scoped snapshots
|
||||
- config distribution rules
|
||||
- audit
|
||||
- high-risk action authorization
|
||||
- step-up authentication requirements
|
||||
|
||||
PostgreSQL remains the durable source of truth. Redis remains live coordination
|
||||
only.
|
||||
|
||||
Cluster configuration is changed only through Control Plane services and APIs.
|
||||
The Web layer is a presentation and ingress layer over those APIs.
|
||||
|
||||
### Admin UI
|
||||
|
||||
Admin UI is a client application served through Web Ingress.
|
||||
|
||||
It renders safe Control Plane projections and submits user actions to Control
|
||||
Plane APIs.
|
||||
|
||||
Admin UI must not:
|
||||
|
||||
- contain embedded internal topology
|
||||
- contain secrets
|
||||
- contain raw credential references beyond safe indicators
|
||||
- contain peer cache data
|
||||
- contain route cache data
|
||||
- contain private node-to-node endpoints unless explicitly authorized for the
|
||||
viewer
|
||||
- contain executable cluster logic
|
||||
|
||||
## Admin Endpoint Placement
|
||||
|
||||
Admin UI endpoint placement is explicit and must not be inferred from storage.
|
||||
|
||||
Scopes:
|
||||
|
||||
- Platform Owner Console: global platform-owner scope. It may aggregate
|
||||
multiple clusters through Control Plane APIs according to platform policy and
|
||||
audit.
|
||||
- Cluster Admin Endpoint: cluster-local admin/web ingress endpoint for a single
|
||||
cluster. It is hosted only by nodes explicitly assigned an approved
|
||||
admin/web ingress role.
|
||||
- Organization Admin Panel: tenant-safe projection for one organization. It
|
||||
must expose only allowed resources, service endpoints, sessions, policies,
|
||||
and safe status.
|
||||
|
||||
Rules:
|
||||
|
||||
- Fabric Storage / Config Storage nodes do not automatically host Admin UI.
|
||||
- Adding a storage node to a new cluster does not move the cluster panel.
|
||||
- Storage nodes distribute/cache scoped configuration and snapshots only.
|
||||
- Admin/web ingress is a separate service role and requires explicit Control
|
||||
Plane assignment.
|
||||
- Cluster-local admin endpoints require valid TLS/cert policy, signed scoped
|
||||
snapshots, current node health, and sufficient role coverage.
|
||||
- Platform Owner Console remains the owner-level view even when cluster-local
|
||||
admin endpoints exist.
|
||||
- Organization Admin Panel must never expose intermediate mesh topology,
|
||||
storage shards, peer caches, route caches, or unrelated cluster data.
|
||||
|
||||
## Request Flow
|
||||
|
||||
```text
|
||||
Admin Browser
|
||||
-> Web Ingress / Admin Web Shell
|
||||
-> Control API
|
||||
-> PostgreSQL source of truth
|
||||
-> signed scoped snapshots / config distribution
|
||||
-> rap-node-agent
|
||||
```
|
||||
|
||||
Web Ingress may cache static assets and safe UI manifests, but it must not
|
||||
become a second source of truth.
|
||||
|
||||
## Dynamic Admin Pages
|
||||
|
||||
Admin pages may be dynamically composed, but they must be generated from safe
|
||||
metadata and scoped projections.
|
||||
|
||||
The recommended model is:
|
||||
|
||||
```text
|
||||
Admin Web Shell
|
||||
-> UI Manifest / Page Definition endpoint
|
||||
-> Scoped Control API endpoints
|
||||
```
|
||||
|
||||
Dynamic pages are allowed for:
|
||||
|
||||
- platform admin sections
|
||||
- cluster admin sections
|
||||
- node detail sections
|
||||
- service adapter safe configuration sections
|
||||
- future organization admin sections
|
||||
|
||||
Dynamic pages must be declarative. They must not inject arbitrary executable
|
||||
code from the backend into the browser.
|
||||
|
||||
## UI Manifest Model
|
||||
|
||||
The Control Plane may provide a `ui_manifest` or page definition for a specific
|
||||
viewer context.
|
||||
|
||||
Viewer context includes:
|
||||
|
||||
- user id
|
||||
- platform role
|
||||
- organization memberships
|
||||
- cluster access scope
|
||||
- device trust state
|
||||
- MFA / step-up state
|
||||
- feature flags
|
||||
- service availability
|
||||
|
||||
The manifest may include:
|
||||
|
||||
- visible navigation sections
|
||||
- page ids
|
||||
- component ids from an approved component registry
|
||||
- form schemas
|
||||
- table schemas
|
||||
- safe field labels and message keys
|
||||
- allowed actions
|
||||
- action risk level
|
||||
- API route references
|
||||
- required permissions
|
||||
- required step-up authentication flags
|
||||
- audit event category
|
||||
- refresh hints
|
||||
|
||||
The manifest must not include:
|
||||
|
||||
- secrets
|
||||
- raw credentials
|
||||
- private keys
|
||||
- full mesh topology
|
||||
- full peer cache
|
||||
- route cache
|
||||
- unrelated organization data
|
||||
- unrelated cluster data
|
||||
- internal node-to-node route details
|
||||
- arbitrary JavaScript or executable code
|
||||
|
||||
## Page Definition Safety Rules
|
||||
|
||||
Dynamic pages are schema-driven views over safe data.
|
||||
|
||||
Rules:
|
||||
|
||||
- page definitions are data, not code
|
||||
- page definitions must use an approved component registry
|
||||
- fields must be explicitly typed
|
||||
- actions must map to known Control Plane operations
|
||||
- every action must be permission checked server-side
|
||||
- high-risk actions must declare step-up requirements
|
||||
- all mutations must be audited
|
||||
- UI labels should use localization message keys with English fallback text
|
||||
- sensitive responses should use `Cache-Control: no-store`
|
||||
|
||||
Client-side hiding is not authorization. The Control Plane must enforce all
|
||||
permissions and policies even if a browser crafts a request manually.
|
||||
|
||||
## Safe Data Projection
|
||||
|
||||
The Control Plane should expose different projections for different audiences.
|
||||
|
||||
Platform owner/admin may see:
|
||||
|
||||
- clusters
|
||||
- nodes
|
||||
- join requests
|
||||
- role assignments
|
||||
- safe topology summaries
|
||||
- service placement
|
||||
- health and audit
|
||||
- partition/recovery status
|
||||
- active node for cluster-managed services where allowed
|
||||
|
||||
Organization admin may see only:
|
||||
|
||||
- organization resources
|
||||
- organization users/groups where authorized
|
||||
- organization policies
|
||||
- active sessions
|
||||
- allowed ingress endpoints
|
||||
- allowed egress/service endpoints
|
||||
- safe VPN/connector status
|
||||
- organization audit
|
||||
|
||||
Organization admin must not see:
|
||||
|
||||
- intermediate core mesh topology
|
||||
- other organizations
|
||||
- peer caches
|
||||
- route caches
|
||||
- unrelated nodes
|
||||
- platform trust roots
|
||||
- raw node certificates
|
||||
- secrets
|
||||
- unrelated cluster internals
|
||||
|
||||
## Service Adapter UI Extensions
|
||||
|
||||
Service adapters may need configuration UI.
|
||||
|
||||
Examples:
|
||||
|
||||
- RDP resource settings
|
||||
- VNC resource settings
|
||||
- SSH resource settings
|
||||
- VPN/IP tunnel connection settings
|
||||
- file policy settings
|
||||
- video/audio policy settings
|
||||
|
||||
Adapter UI extensions must be registered as safe schema descriptors through the
|
||||
Control Plane. Adapters must not directly publish arbitrary browser code.
|
||||
|
||||
Allowed extension content:
|
||||
|
||||
- field schema
|
||||
- validation hints
|
||||
- policy options
|
||||
- message keys
|
||||
- safe help text
|
||||
- action ids mapped to Control Plane APIs
|
||||
|
||||
Disallowed extension content:
|
||||
|
||||
- executable code
|
||||
- protocol secrets
|
||||
- internal adapter memory/state
|
||||
- raw target credentials
|
||||
- unrestricted backend endpoints
|
||||
|
||||
## Cluster Configuration Ownership
|
||||
|
||||
Cluster configuration belongs to Control Plane.
|
||||
|
||||
Examples:
|
||||
|
||||
- cluster creation and disablement
|
||||
- node approval
|
||||
- node role assignment
|
||||
- service desired state
|
||||
- VPN connection desired state
|
||||
- allowed node policy
|
||||
- route policy
|
||||
- QoS policy
|
||||
- signed snapshot generation
|
||||
- storage/config distribution scope
|
||||
|
||||
Admin UI may present these controls, but it does not own the decisions.
|
||||
|
||||
The authoritative path is:
|
||||
|
||||
```text
|
||||
Admin action
|
||||
-> Control API authorization
|
||||
-> policy validation
|
||||
-> PostgreSQL mutation
|
||||
-> audit event
|
||||
-> snapshot/config distribution update
|
||||
-> node-agent consumption
|
||||
```
|
||||
|
||||
## Security Requirements
|
||||
|
||||
Web/Admin security requirements:
|
||||
|
||||
- TLS for all browser traffic
|
||||
- secure cookies or approved token storage model
|
||||
- CSRF protection where cookie auth is used
|
||||
- CSP for Admin UI
|
||||
- no secrets in HTML or JavaScript bundles
|
||||
- no internal topology embedded in static assets
|
||||
- no arbitrary backend-provided JavaScript
|
||||
- strict server-side authorization
|
||||
- risk-based admin access
|
||||
- MFA/2FA and step-up for high-risk actions
|
||||
- audit every mutation
|
||||
- short-lived UI manifests where sensitive
|
||||
- no-store cache headers for sensitive API responses
|
||||
|
||||
High-risk actions include:
|
||||
|
||||
- node approval
|
||||
- role assignment
|
||||
- cluster trust changes
|
||||
- cross-cluster trust changes
|
||||
- partition promotion
|
||||
- secrets access
|
||||
- update policy changes
|
||||
- VPN credential/config resolver access
|
||||
|
||||
## Deployment Model
|
||||
|
||||
Possible deployment modes:
|
||||
|
||||
- Web Ingress and Control API in the same deployment for small/test installs
|
||||
- Web Ingress separated from Control API for production
|
||||
- multiple Web Ingress nodes for regional/admin access
|
||||
- Web Ingress behind Caddy/Nginx/enterprise ingress
|
||||
- Admin UI shell served from Web Ingress while APIs remain on Control API
|
||||
|
||||
Even when deployed together, ownership remains separate:
|
||||
|
||||
- Web Ingress is entry/presentation
|
||||
- Control API is authorization/domain logic
|
||||
- PostgreSQL is source of truth
|
||||
- Fabric Storage/Config Storage is scoped distribution/cache
|
||||
- node-agent consumes scoped desired state
|
||||
|
||||
## Future Stages
|
||||
|
||||
Suggested staged work:
|
||||
|
||||
WEB-1: Document Web Ingress and Admin UI ownership model.
|
||||
|
||||
WEB-2: Define `ui_manifest` schema and approved component registry.
|
||||
|
||||
WEB-3: Add platform-admin Admin Web Shell that consumes scoped manifests.
|
||||
Initial Platform Owner Control Panel is implemented and build-verified in
|
||||
`web-admin`. Report:
|
||||
`artifacts/web-admin-platform-owner-control-panel-report.md`.
|
||||
|
||||
WEB-4: Add cluster admin pages using Control Plane projections.
|
||||
|
||||
WEB-5: Add organization admin pages using tenant-safe projections.
|
||||
|
||||
WEB-6: Add high-risk action step-up and device-trust UI flows.
|
||||
|
||||
WEB-7: Add service-adapter UI extension registry.
|
||||
|
||||
WEB-8: Add signed/versioned UI manifest distribution if needed for offline or
|
||||
edge-served admin shells.
|
||||
|
||||
## Non-Goals
|
||||
|
||||
This document does not authorize:
|
||||
|
||||
- implementation of new UI pages
|
||||
- changing existing Windows client behavior
|
||||
- changing RDP runtime
|
||||
- mesh runtime
|
||||
- VPN runtime
|
||||
- node-agent service execution changes
|
||||
- storing cluster configuration inside Web Service
|
||||
- exposing internal topology to organizations
|
||||
|
||||
## Result / Decision
|
||||
|
||||
WEB is an ingress and presentation layer, not a cluster configuration owner.
|
||||
Cluster configuration belongs to the Control Plane and is persisted in
|
||||
PostgreSQL. Dynamic admin pages are allowed only as safe, scoped,
|
||||
schema-driven projections over Control Plane APIs. They must not embed secrets,
|
||||
internal topology, peer caches, route caches, or arbitrary executable code.
|
||||
|
||||
Admin endpoint placement is explicit. A Fabric Storage / Config Storage node
|
||||
does not automatically become a cluster panel. Platform Owner Console remains
|
||||
global platform-owner scope; Cluster Admin Endpoint is a separate cluster-local
|
||||
admin/web ingress role; Organization Admin Panel remains a tenant-safe
|
||||
projection.
|
||||
@@ -0,0 +1,221 @@
|
||||
# Current Baseline Matrix
|
||||
|
||||
Date: 2026-04-26
|
||||
|
||||
Purpose: single operational snapshot of the current project baseline. This file
|
||||
is not a target architecture document. It describes what is currently proven,
|
||||
what is merely implemented, and what remains unproven.
|
||||
|
||||
## Environment
|
||||
|
||||
Canonical test environment:
|
||||
|
||||
```text
|
||||
Docker host: 192.168.200.61
|
||||
SSH alias: docker-test
|
||||
Docker endpoint: ssh://docker-test
|
||||
Docker context: test-ubuntu
|
||||
Backend API: http://192.168.200.61:8080/api/v1
|
||||
Backend gateway: ws://192.168.200.61:8080/api/v1/gateway/ws
|
||||
```
|
||||
|
||||
Current live/smoke containers:
|
||||
|
||||
| Container | Image | Role |
|
||||
| --- | --- | --- |
|
||||
| `rap_backend_smoke` | `rap-backend-smoke:stage5-2-download` | backend control plane |
|
||||
| `rap_worker_smoke` | `rap-rdp-worker:stage5-2-download` | accepted RDP Adapter worker baseline plus runtime-proven Stage 5.2 core download path |
|
||||
| `rap_postgres` | `postgres:16` | source-of-truth database |
|
||||
| `rap_redis` | `redis:7` | live coordination/routing |
|
||||
|
||||
Current Windows client endpoints:
|
||||
|
||||
```json
|
||||
{
|
||||
"api_base_url": "http://192.168.200.61:8080/api/v1",
|
||||
"gateway_websocket_url": "ws://192.168.200.61:8080/api/v1/gateway/ws",
|
||||
"prefer_direct_data_plane": true,
|
||||
"direct_data_plane_connect_timeout_ms": 2500,
|
||||
"direct_data_plane_color_mode": "full_color",
|
||||
"direct_data_plane_platform_ca_bundle": "artifacts/p3-5-platform-ca.crt",
|
||||
"environment": "production",
|
||||
"allow_insecure_direct_data_plane_tls_for_smoke": false
|
||||
}
|
||||
```
|
||||
|
||||
## Build And Probe Snapshot
|
||||
|
||||
Commands run during P0:
|
||||
|
||||
```powershell
|
||||
go test ./...
|
||||
dotnet build .\clients\windows\RemoteAccessPlatform.Windows.slnx
|
||||
docker -H ssh://docker-test run --rm rap-rdp-worker:rdp-region-repair rdp-worker-graphics-adapter-probe
|
||||
docker -H ssh://docker-test run --rm rap-rdp-worker:rdp-region-repair rdp-worker-cursor-adapter-probe
|
||||
docker -H ssh://docker-test run --rm rap-rdp-worker:rdp-region-repair rdp-worker-service-adapter-protocol-probe
|
||||
docker -H ssh://docker-test run --rm rap-rdp-worker:rdp-region-repair rdp-worker-dataplane-bind-probe --scenario valid
|
||||
```
|
||||
|
||||
Additional accepted P1 baseline checks:
|
||||
|
||||
```powershell
|
||||
go test ./...
|
||||
dotnet build .\clients\windows\RemoteAccessPlatform.Windows.slnx
|
||||
docker -H ssh://docker-test build --tag rap-rdp-worker:rdp-p1-region-order2 --file workers/rdp-worker/Dockerfile workers/rdp-worker
|
||||
docker -H ssh://docker-test run --rm rap-rdp-worker:rdp-p1-region-order2 rdp-worker-graphics-adapter-probe
|
||||
docker -H ssh://docker-test run --rm rap-rdp-worker:rdp-p1-region-order2 rdp-worker-cursor-adapter-probe
|
||||
docker -H ssh://docker-test run --rm rap-rdp-worker:rdp-p1-region-order2 rdp-worker-service-adapter-protocol-probe
|
||||
docker -H ssh://docker-test run --rm rap-rdp-worker:rdp-p1-region-order2 rdp-worker-dataplane-bind-probe --scenario valid
|
||||
```
|
||||
|
||||
Results:
|
||||
|
||||
| Check | Result | Notes |
|
||||
| --- | --- | --- |
|
||||
| Backend `go test ./...` | PASS | Most packages still have no test files |
|
||||
| Windows solution build | PASS | 0 warnings, 0 errors |
|
||||
| Worker graphics adapter probe | PASS | `graphics_adapter_probe ok` |
|
||||
| Worker cursor adapter probe | PASS | `cursor_adapter_probe ok` |
|
||||
| Worker service adapter protocol probe | PASS | channel model prints successfully |
|
||||
| Worker direct bind valid probe | PASS | `PASS scenario=valid` |
|
||||
| P1 worker image build | PASS | `rap-rdp-worker:rdp-p1-region-order2` |
|
||||
| P1 worker probes | PASS | graphics, cursor, protocol, direct bind |
|
||||
| P1 smoke-worker deployment | PASS | `rap_worker_smoke` online on test Docker |
|
||||
| P3 backend secret guard tests | PASS | production plaintext metadata rejected; dev/smoke allowed |
|
||||
| P3 data-plane policy test | PASS | allowed channels follow clipboard/file-transfer policy |
|
||||
| P3 worker bind denial probes | PASS | wrong worker/user/org/resource/attachment/channels/state rejected |
|
||||
| P3.3 production secret smoke | PASS | secret-backed RDP resource starts real session on test stand |
|
||||
| P3.3 production fallback smoke | PASS | production backend omits smoke-only direct WSS candidate |
|
||||
| P3.3 dev/smoke direct candidate | PASS | direct candidate is `smoke_only=true`, not production trusted |
|
||||
| P3.4 production WSS trust design | PASS | platform CA, certificate lifecycle, app-local trust, smoke plan documented |
|
||||
| P3.5 app-local platform CA smoke | PASS | direct worker WSS selected without insecure TLS bypass; unknown CA and smoke-only production fallback proved |
|
||||
| P3.6 stale worker event idempotency | PASS | backend restart survives stale Redis worker events; terminal PostgreSQL sessions stay terminal |
|
||||
| Stage 5.2 file download build | PASS | backend/worker/client build |
|
||||
| Stage 5.2 core download runtime | PASS | direct worker WSS and backend gateway text/binary size/hash; policy block for disabled/client_to_server |
|
||||
| Stage 5.2 download lifecycle blocking | PASS | detach blocks, old-controller takeover returns `session.taken_over`, worker failure marks session `failed` and closes direct WS |
|
||||
|
||||
Important limitation:
|
||||
|
||||
- this snapshot does not replace a live manual RDP smoke pass
|
||||
- the repository directory used for this audit is not currently a Git checkout,
|
||||
so commit-level provenance is unavailable here
|
||||
|
||||
## Feature Matrix
|
||||
|
||||
| Area | Status | Current proof level | Next action |
|
||||
| --- | --- | --- | --- |
|
||||
| Backend foundation | Implemented | build/test PASS | expand automated tests |
|
||||
| Auth/refresh/devices | Implemented | previous runtime proof | add regression tests |
|
||||
| Organization scope | Implemented | previous hardening pass | add cross-org tests |
|
||||
| Session lifecycle | Implemented | live-proven | protect from regression |
|
||||
| Worker registration/leases | Implemented | live-proven | protect from regression |
|
||||
| Worker-death recovery | Implemented | live-proven | add automated smoke |
|
||||
| Structured messaging/localization | Implemented | runtime-proven | protect from regression |
|
||||
| Direct worker WSS | Implemented | live-proven | preserve |
|
||||
| Backend gateway fallback | Implemented | smoke-proven | preserve |
|
||||
| Binary direct render | Implemented | smoke-proven | preserve |
|
||||
| RDP region-first render | Implemented | live/manual usable | harden artifacts |
|
||||
| Direct attach baseline | Implemented | current baseline | preserve |
|
||||
| Region-loss repair | Implemented | current baseline | diagnose remaining artifacts |
|
||||
| Ordered region delivery | Implemented | manual visual smoke accepted | protect |
|
||||
| RDPGFX | Gated only | default path smoke-proven | keep disabled |
|
||||
| Keyboard/mouse input | Implemented | manually usable | protect |
|
||||
| Cursor updates | Implemented | probe/smoke-proven | protect |
|
||||
| Text clipboard | Implemented | accepted | protect |
|
||||
| File upload | Implemented | accepted to worker storage | protect |
|
||||
| Restricted drive visibility | Implemented | runtime-proven via `RAP_Transfers` | protect |
|
||||
| File download | Implemented | core data path and lifecycle blocking runtime-proven; desktop UI proof pending | prove remaining UI next |
|
||||
| Resource secret readiness | Guard implemented | backend tests PASS | protect |
|
||||
| Encrypted secret resolver | MVP implemented | live smoke PASS on test stand | harden KMS/rotation later |
|
||||
| Direct worker WSS TLS/PKI guard | Guard implemented | production platform CA smoke PASS | preserve |
|
||||
| Stale worker event restart safety | Implemented | runtime smoke PASS | protect |
|
||||
| Node-agent runtime | Not implemented | control-plane foundation only | future |
|
||||
| Mesh/VPN/runtime | Not implemented | target architecture only | future |
|
||||
| SSH/VNC adapters | Not implemented | none | future after RDP |
|
||||
|
||||
## RDP Baseline
|
||||
|
||||
Current accepted RDP worker image:
|
||||
|
||||
```text
|
||||
rap-rdp-worker:rdp-p1-region-order2
|
||||
```
|
||||
|
||||
Previous accepted baseline image:
|
||||
|
||||
```text
|
||||
rap-rdp-worker:rdp-region-repair
|
||||
```
|
||||
|
||||
Current RDP render model:
|
||||
|
||||
- classic FreeRDP/GDI region-first BGRA path
|
||||
- direct worker WSS binary `RAP2` frames
|
||||
- backend gateway JSON/base64 fallback
|
||||
- full frame on connect/attach/baseline/recovery/fallback repair
|
||||
- dirty region updates as normal display path
|
||||
- cursor as independent latest-only channel
|
||||
- input highest priority
|
||||
- clipboard and file upload reliable/policy-gated
|
||||
|
||||
Current RDP known limitation:
|
||||
|
||||
- window drag uses old-client/slow-link style frame-only movement; repaint after
|
||||
releasing a moved window is usable but not yet polished
|
||||
|
||||
Current accepted P1 behavior:
|
||||
|
||||
- dirty-region updates are preserved in-order through `SessionRuntime`, worker
|
||||
direct WSS, Windows transport, and WPF presenter queues
|
||||
- full frames still supersede pending region queues
|
||||
- worker direct region queue overflow requests throttled full-frame repair
|
||||
- client logs region sequence gaps and regions received before a baseline
|
||||
- manual visual smoke accepted idle repaint, Start menu/hover, drag usability,
|
||||
keyboard, mouse, and session close
|
||||
|
||||
Current RDP non-goals:
|
||||
|
||||
- no DP-3B adaptive quality yet
|
||||
- no compression/codecs/tiles yet
|
||||
- no RDPGFX default enable
|
||||
- no full Stage 5.2 desktop UI acceptance yet
|
||||
- no UI redesign
|
||||
- no backend/session lifecycle rewrite
|
||||
|
||||
## Documentation Truth Status
|
||||
|
||||
Updated during P0:
|
||||
|
||||
- `README.md`
|
||||
- `README_START_HERE.md`
|
||||
- `docs/codex/CURRENT_STATUS.md`
|
||||
- `docs/codex/NEXT_STEP_PROMPT.md`
|
||||
- `clients/windows/README.md`
|
||||
- `workers/rdp-worker/README.md`
|
||||
- `docs/architecture/DATA_PLANE_V1.md`
|
||||
- `docs/architecture/RDP_ADAPTER_RUNTIME.md`
|
||||
- `docs/architecture/RDP_SERVICE_CPP_PERFORMANCE_TARGET.md`
|
||||
- `docs/architecture/RDP_FILE_DOWNLOAD_STAGE_5_2.md`
|
||||
- `docs/audits/CURRENT_BASELINE_MATRIX.md`
|
||||
|
||||
Current authoritative audit:
|
||||
|
||||
- `docs/audits/PROJECT_AUDIT_2026-04-26.md`
|
||||
|
||||
Legacy warning:
|
||||
|
||||
- `docs/_legacy_v1` is historical reference only and must not be used for
|
||||
implementation decisions
|
||||
|
||||
## Correct Next Step
|
||||
|
||||
Proceed with Stage 5.2 remaining live runtime proof - Server-to-Client File
|
||||
Download:
|
||||
|
||||
- keep `rap-backend-smoke:stage5-2-download` and
|
||||
`rap-rdp-worker:stage5-2-download` deployed on `docker-test`
|
||||
- prove Windows desktop UI download for files placed in `RAP_Transfers\ToClient`
|
||||
- prove rendering/input/clipboard/upload/reconnect/takeover regressions
|
||||
- keep backend gateway fallback active
|
||||
- do not start arbitrary remote path download, SMB/WebDAV, Windows agent,
|
||||
binary file chunk frames, DP-3B, mesh/VPN, node-agent runtime, or new adapters
|
||||
@@ -0,0 +1,662 @@
|
||||
# Project Audit And Next-Step Plan
|
||||
|
||||
Date: 2026-04-26
|
||||
|
||||
Status: documentation/audit only. No runtime behavior is changed by this
|
||||
document.
|
||||
|
||||
## 1. Executive Summary
|
||||
|
||||
The project is no longer just an RDP proxy. The correct target is a Secure
|
||||
Access Fabric platform with a control plane, direct realtime data plane,
|
||||
service adapters, tenant isolation, and future node/mesh/VPN capabilities.
|
||||
|
||||
The implementation has reached a much more advanced state than several
|
||||
operational documents describe. The most important current risk is therefore
|
||||
not only code quality. It is source-of-truth drift: old prompts and READMEs can
|
||||
send the next stage in the wrong direction.
|
||||
|
||||
The RDP MVP has proven the hard lifecycle assumptions:
|
||||
|
||||
- real RDP connection through the worker works
|
||||
- active/detach/reattach/takeover/terminate flows are proven
|
||||
- takeover does not recreate the remote session
|
||||
- worker-death/orphan-active-session recovery is proven
|
||||
- Windows client can render and control a real remote desktop
|
||||
- direct worker WSS data plane is implemented and used
|
||||
- binary render frames are implemented on direct data plane
|
||||
- backend gateway JSON/base64 path remains available as fallback/debug
|
||||
- ordered dirty-region delivery is accepted as the current RDP baseline
|
||||
- text clipboard is implemented and accepted
|
||||
- client-to-server file upload to worker-controlled storage is accepted
|
||||
- restricted drive visibility is runtime-proven: uploaded files are visible and
|
||||
openable inside the remote Windows session through `RAP_Transfers`
|
||||
|
||||
The RDP adapter lesson is clear: "make it simple first and patch later" is
|
||||
dangerous for realtime protocols. Full-frame polling, implicit refresh after
|
||||
input, and backend/Redis realtime relaying worked for proof, but they caused
|
||||
the exact class of latency and correctness issues we later had to unwind. From
|
||||
this point forward, each service adapter must be specified as an event-driven
|
||||
adapter before implementation.
|
||||
|
||||
Recommended immediate priority:
|
||||
|
||||
1. Freeze and document the current working baseline.
|
||||
2. Synchronize stale project docs with the real state.
|
||||
3. Preserve the accepted RDP visual correctness/stability baseline.
|
||||
4. Preserve the accepted Stage 5.1.1 restricted drive visibility behavior.
|
||||
5. Add automated regression gates so manual discoveries become repeatable tests.
|
||||
|
||||
## 2. Audit Method
|
||||
|
||||
This audit used the current filesystem state in:
|
||||
|
||||
```text
|
||||
\\192.168.220.200\mst\codex\rdp-proxy
|
||||
```
|
||||
|
||||
Important environment note:
|
||||
|
||||
- the directory is not currently a Git checkout (`git status` reports that no
|
||||
`.git` repository exists), so this audit cannot use commit history
|
||||
- the canonical test Docker host is `docker-test` / `192.168.200.61`
|
||||
- the live test stack currently contains `rap_backend_smoke`, `rap_worker_smoke`,
|
||||
`rap_postgres`, and `rap_redis`
|
||||
|
||||
Commands run during this audit:
|
||||
|
||||
```powershell
|
||||
go test ./...
|
||||
dotnet build .\clients\windows\RemoteAccessPlatform.Windows.slnx
|
||||
docker -H ssh://docker-test run --rm rap-rdp-worker:rdp-region-repair rdp-worker-graphics-adapter-probe
|
||||
docker -H ssh://docker-test run --rm rap-rdp-worker:rdp-region-repair rdp-worker-cursor-adapter-probe
|
||||
docker -H ssh://docker-test run --rm rap-rdp-worker:rdp-region-repair rdp-worker-service-adapter-protocol-probe
|
||||
docker -H ssh://docker-test run --rm rap-rdp-worker:rdp-region-repair rdp-worker-dataplane-bind-probe --scenario valid
|
||||
```
|
||||
|
||||
Results:
|
||||
|
||||
- backend tests: PASS
|
||||
- Windows client build: PASS, 0 warnings, 0 errors
|
||||
- worker graphics adapter probe: PASS
|
||||
- worker cursor adapter probe: PASS
|
||||
- worker service adapter protocol probe: PASS
|
||||
- worker data-plane bind valid probe: PASS
|
||||
|
||||
Coverage warning:
|
||||
|
||||
- most backend modules still report `[no test files]`
|
||||
- much of the current confidence comes from smoke/manual proofs and logs
|
||||
- this is not enough for production readiness
|
||||
|
||||
## 3. Planned Direction
|
||||
|
||||
The authoritative long-term direction is:
|
||||
|
||||
- `CODEX_CONTEXT.md`
|
||||
- `docs/architecture/SECURE_ACCESS_FABRIC_TARGET.md`
|
||||
- `docs/architecture/DATA_PLANE_V1.md`
|
||||
- `docs/architecture/SERVICE_ADAPTER_PROTOCOL.md`
|
||||
- `docs/architecture/RDP_ADAPTER_RUNTIME.md`
|
||||
- `docs/architecture/RDP_SERVICE_CPP_PERFORMANCE_TARGET.md`
|
||||
|
||||
The target platform model is:
|
||||
|
||||
```text
|
||||
Access Client
|
||||
-> Ingress / Data Plane
|
||||
-> Secure Fabric / Routing
|
||||
-> Service Adapter at egress edge
|
||||
-> Target service
|
||||
```
|
||||
|
||||
For RDP specifically:
|
||||
|
||||
```text
|
||||
Access Client
|
||||
<-> platform session/data-plane protocol
|
||||
RDP Adapter
|
||||
<-> FreeRDP / project-owned RDP internals
|
||||
RDP Server
|
||||
```
|
||||
|
||||
This naming should be kept consistent:
|
||||
|
||||
- Access Client: native Windows/iOS/Android/Linux client that speaks the
|
||||
platform protocol
|
||||
- Control Plane: backend API, auth, orgs, policy, session lifecycle, audit
|
||||
- Data Plane: realtime session traffic channels
|
||||
- Service Adapter: protocol translator for RDP/VNC/SSH/video/etc
|
||||
- RDP Adapter: current C++ RDP service adapter
|
||||
- Entry/Ingress Node: accepts client connections into the fabric
|
||||
- Egress/Service Node: reaches target resources and hosts adapters
|
||||
- Node Agent: native host identity, update, health, and service supervisor
|
||||
|
||||
## 4. What Is Implemented
|
||||
|
||||
### Backend
|
||||
|
||||
Implemented:
|
||||
|
||||
- Go backend foundation
|
||||
- PostgreSQL source-of-truth storage
|
||||
- Redis live coordination/routing
|
||||
- auth foundation
|
||||
- refresh token rotation
|
||||
- devices/trusted devices
|
||||
- org-scoped resources and sessions
|
||||
- platform-core v2 foundation
|
||||
- identity source foundation
|
||||
- node/node-agent control-plane foundation
|
||||
- session broker orchestration
|
||||
- worker coordination and stale worker monitoring
|
||||
- structured localization-ready messages
|
||||
- resource certificate verification policy
|
||||
- clipboard policy
|
||||
- file-transfer policy
|
||||
- data-plane token/candidate generation
|
||||
- backend gateway fallback
|
||||
|
||||
Key files:
|
||||
|
||||
- `backend/internal/modules/sessionbroker/service.go`
|
||||
- `backend/internal/modules/sessionbroker/orchestration.go`
|
||||
- `backend/internal/modules/sessionbroker/state_machine.go`
|
||||
- `backend/internal/modules/sessionbroker/dataplane.go`
|
||||
- `backend/internal/modules/sessiongateway/module.go`
|
||||
- `backend/internal/modules/worker/monitor.go`
|
||||
- `backend/internal/modules/resource/module.go`
|
||||
- `backend/internal/modules/auth/service.go`
|
||||
- `backend/internal/platform/httpx/message.go`
|
||||
- `backend/migrations/000005_platform_core_v2.up.sql`
|
||||
- `backend/migrations/000007_clipboard_policy_mode.up.sql`
|
||||
- `backend/migrations/000008_file_transfer_policy_mode.up.sql`
|
||||
|
||||
Known backend gaps:
|
||||
|
||||
- automated test coverage is thin outside `sessionbroker`
|
||||
- P3/P3.1 resource secret-readiness and encrypted resolver MVP exists;
|
||||
production mode rejects plaintext credential metadata and requires
|
||||
`secret_ref` for RDP/VNC/SSH resources
|
||||
- external KMS/Vault integration and master-key rotation are not implemented
|
||||
yet
|
||||
- admin/control UI for safe resource/policy management is not the current focus
|
||||
- node-agent runtime is not implemented; only control-plane foundation exists
|
||||
- identity source sync runtime is not implemented
|
||||
|
||||
### Windows Client
|
||||
|
||||
Implemented:
|
||||
|
||||
- WPF client skeleton and build
|
||||
- auth/login/refresh/logout foundation
|
||||
- organization selection
|
||||
- resource list
|
||||
- active sessions
|
||||
- session window
|
||||
- direct data-plane selection with fallback
|
||||
- binary render receive path
|
||||
- input capture/forwarding
|
||||
- cursor/render display
|
||||
- localization-ready resource layer
|
||||
- text clipboard UI/path
|
||||
- file upload UI/path
|
||||
- failed-session refresh after gateway close
|
||||
|
||||
Key files:
|
||||
|
||||
- `clients/windows/src/RemoteAccessPlatform.Windows.App/SessionWindow.xaml`
|
||||
- `clients/windows/src/RemoteAccessPlatform.Windows.Application/ViewModels/SessionWindowViewModel.cs`
|
||||
- `clients/windows/src/RemoteAccessPlatform.Windows.Transport/SessionGatewayClient.cs`
|
||||
- `clients/windows/src/RemoteAccessPlatform.Windows.App/Input/SessionInputMapper.cs`
|
||||
- `clients/windows/src/RemoteAccessPlatform.Windows.Application/Localization/Strings.cs`
|
||||
- `clients/windows/src/RemoteAccessPlatform.Windows.Application/Resources/Strings.resx`
|
||||
|
||||
Known client gaps:
|
||||
|
||||
- final UX polish is not complete
|
||||
- automated client regression tests are missing
|
||||
- manual RDP UX remains the acceptance authority for now
|
||||
- some README limitations are stale and understate what exists
|
||||
|
||||
### RDP Worker / RDP Adapter
|
||||
|
||||
Implemented:
|
||||
|
||||
- standalone C++ worker service
|
||||
- FreeRDP integration behind worker boundary
|
||||
- worker registration/assignment/lease lifecycle
|
||||
- direct worker WSS endpoint
|
||||
- RS256 data-plane token validation
|
||||
- direct bind policy and current attachment validation
|
||||
- JSON control/input/clipboard/file-upload envelopes
|
||||
- binary RAP2 render frames for direct path
|
||||
- backend gateway JSON/base64 fallback
|
||||
- region-first BGRA render path
|
||||
- direct attach baseline full-frame repair
|
||||
- region-loss full-frame repair throttle
|
||||
- cursor adapter boundary
|
||||
- text clipboard through FreeRDP `cliprdr`
|
||||
- client-to-server file upload
|
||||
- restricted visible transfer directory
|
||||
- restricted FreeRDP drive redirection groundwork
|
||||
|
||||
Key files:
|
||||
|
||||
- `workers/rdp-worker/src/main.cpp`
|
||||
- `workers/rdp-worker/src/runtime/session_runtime.cpp`
|
||||
- `workers/rdp-worker/include/rdp_worker/runtime/session_runtime.hpp`
|
||||
- `workers/rdp-worker/src/adapter/rdp_adapter_runtime.cpp`
|
||||
- `workers/rdp-worker/src/freerdp/rdp_runtime.cpp`
|
||||
- `workers/rdp-worker/src/dataplane/direct_wss_server.cpp`
|
||||
- `workers/rdp-worker/src/runtime/direct_bind_policy.cpp`
|
||||
- `workers/rdp-worker/include/rdp_worker/adapter/service_adapter_protocol.hpp`
|
||||
|
||||
Current live/smoke images:
|
||||
|
||||
```text
|
||||
rap-backend-smoke:stage5-2-download
|
||||
rap-rdp-worker:stage5-2-download
|
||||
```
|
||||
|
||||
Known worker/RDP gaps:
|
||||
|
||||
- drag/release repaint is usable but not polished; drag behaves like an older
|
||||
RDP client on a weak link by moving a frame rather than continuously
|
||||
repainting the full window
|
||||
- RDPGFX is gated and disabled by default because the current live target resets
|
||||
the connection when RDPGFX is advertised
|
||||
- encoded graphics/codecs/tiles are not production-accepted yet
|
||||
- file download core data path is runtime-proven through direct worker WSS and
|
||||
backend gateway fallback, and lifecycle blocking is runtime-proven for
|
||||
detach, old-controller takeover, and worker failure. Stage 5.2 is not fully
|
||||
runtime-accepted until Windows desktop UI download is proven
|
||||
- FreeRDP is still the substrate; replacing it is not justified until the
|
||||
adapter boundary proves which pieces are actually insufficient
|
||||
|
||||
## 5. Plan vs Fact Matrix
|
||||
|
||||
| Area | Planned | Current fact | Status |
|
||||
| --- | --- | --- | --- |
|
||||
| Backend foundation | Go, config, HTTP, PostgreSQL, Redis | Implemented and builds | Done |
|
||||
| Auth | access/refresh flow, sessions, devices | Implemented | Done |
|
||||
| Session lifecycle | start/attach/detach/takeover/terminate/fail/recover | Live-proven earlier and preserved | Done, protect |
|
||||
| Multi-tenancy | organizations and org-scoped resources/sessions | Implemented | Done, needs more tests |
|
||||
| Authorization | platform/admin/member boundaries | Implemented foundation | Needs broader tests |
|
||||
| Worker coordination | registration, lease, stale recovery | Implemented and live-proven | Done, protect |
|
||||
| Windows client MVP | native WPF client | Implemented and builds | Done |
|
||||
| Localization messaging | structured backend/client messaging | Implemented and runtime-proven earlier | Done, protect |
|
||||
| Direct data plane | client-to-worker WSS | Implemented | Done |
|
||||
| Binary render | direct binary render, fallback JSON/base64 | Implemented | Done |
|
||||
| RDP adapter event model | event-driven adapter boundary | Implemented and P1 accepted | Done, protect |
|
||||
| RDP render quality | grayscale foundation | Implemented | Partial |
|
||||
| RDPGFX/encoded graphics | future performance path | gated only, not accepted | Not production |
|
||||
| Clipboard | text-only, policy-gated | Accepted | Done |
|
||||
| File upload | client-to-server to worker storage | Accepted | Done |
|
||||
| File visibility in RDP | restricted drive redirection | Accepted via `RAP_Transfers` | Done, protect |
|
||||
| File download | server-to-client | Core and lifecycle runtime-proven, desktop UI proof pending | Prove UI next |
|
||||
| Mesh/VPN/multi-cluster runtime | target architecture only | Not implemented | Correctly deferred |
|
||||
| Node-agent runtime/updater | target/foundation only | Not implemented | Future |
|
||||
| Identity sync runtime | LDAP/OIDC sync | Not implemented | Future |
|
||||
|
||||
## 6. Important Source-Of-Truth Drift
|
||||
|
||||
At the start of this audit these files were stale or partly stale:
|
||||
|
||||
- `README.md` still points to old legacy docs and says not to start with UI,
|
||||
while the Windows client already exists
|
||||
- `docs/codex/CURRENT_STATUS.md` says WebSocket takeover proof is still a gap,
|
||||
even though that proof was later closed
|
||||
- `docs/codex/NEXT_STEP_PROMPT.md` previously pointed to platform-core v2 as
|
||||
the next step, although platform-core v2 already exists
|
||||
- `clients/windows/README.md` still says it intentionally stops short of final
|
||||
viewer rendering, but the client now renders the remote desktop
|
||||
- `workers/rdp-worker/README.md` documented recent RDP stages, but previously
|
||||
did not clearly mark the current accepted image and latest manual acceptance
|
||||
- `docs/architecture/DATA_PLANE_V1.md` previously had a stale "Next
|
||||
Implementation Prompt"; it now points to Stage 5.2 live runtime proof
|
||||
- `docs/architecture/RDP_ADAPTER_RUNTIME.md` and
|
||||
`docs/architecture/RDP_SERVICE_CPP_PERFORMANCE_TARGET.md` still mark manual UX
|
||||
acceptance as pending before the latest fixes
|
||||
|
||||
This was the P0 risk addressed by the baseline-freeze documentation pass. Future
|
||||
stages must keep these files current after every accepted runtime change so a
|
||||
future Codex/session cannot follow an old prompt and reintroduce
|
||||
already-rejected architecture.
|
||||
|
||||
## 7. Lessons From The RDP Adapter Work
|
||||
|
||||
The RDP work exposed several project-level rules:
|
||||
|
||||
1. Realtime protocol features must be designed as channel semantics first.
|
||||
Input, display, cursor, clipboard, file transfer, and telemetry cannot share
|
||||
one undifferentiated queue.
|
||||
|
||||
2. Backend/Redis must not be the production realtime path. It is correct as
|
||||
fallback/debug/control-plane glue, not for high-rate render.
|
||||
|
||||
3. Full-frame rendering is not the normal production model. It is needed for
|
||||
baseline, attach, resize, recovery, and fallback repair.
|
||||
|
||||
4. Dirty regions cannot be blindly latest-only without a repair strategy.
|
||||
Dropping a region update may leave visible artifacts; the current
|
||||
`region_loss_repair` full-frame repair is a pragmatic safety net.
|
||||
|
||||
5. Server-origin events must drive display updates. Remote changes must not
|
||||
depend on local mouse/keyboard events.
|
||||
|
||||
6. Input must be independent from render. A key or click must never wait behind
|
||||
a frame, upload chunk, clipboard message, or lease renewal.
|
||||
|
||||
7. FreeRDP is not the problem by default. The earlier problem was how we pumped
|
||||
events, scheduled frames, relayed payloads, and treated screen updates. The
|
||||
correct direction is an adapter boundary around FreeRDP first, not a full
|
||||
rewrite before we can prove the replacement.
|
||||
|
||||
8. Manual UX proof is essential. Automated input can pass while real user input
|
||||
feels wrong.
|
||||
|
||||
9. Every "temporary" shortcut needs an explicit expiration condition. If it does
|
||||
not have one, it becomes architecture.
|
||||
|
||||
## 8. What We May Have Missed
|
||||
|
||||
These are not immediate bugs, but they should be addressed early because they
|
||||
shape the product:
|
||||
|
||||
- RDP server compatibility matrix: Windows Server versions, NLA modes, GDI vs
|
||||
RDPGFX behavior, color depth, TLS/cert behavior, domain login variants
|
||||
- weak-channel simulation: latency, jitter, loss, constrained bandwidth
|
||||
- high-concurrency session model: many users, many workers, CPU/network limits
|
||||
- deterministic smoke reports: every accepted stage should leave reproducible
|
||||
artifacts and commands
|
||||
- secret management: credentials must move out of plain resource metadata
|
||||
- production PKI: direct worker WSS currently uses smoke-friendly TLS handling
|
||||
on the client side
|
||||
- authorization tests: cross-org denial paths need automated coverage
|
||||
- resource policy test matrix: clipboard/file/cert/session policies
|
||||
- file transfer threat model: filename normalization, symlink escape, overwrite
|
||||
behavior, quotas, cleanup, audit
|
||||
- observability: per-channel latency, frame drops, input latency, worker event
|
||||
pump health, adapter callback counters
|
||||
- client UI state machine tests: close/dispose, failed state, reconnect,
|
||||
takeover, detach, old attachment blocking
|
||||
- upgrade/rollback story: node-agent target exists, runtime is not implemented
|
||||
- deployment topology: container host networking vs Docker bridge/NAT for
|
||||
realtime workloads
|
||||
- service adapter conformance suite: RDP now has a pattern that VNC/SSH/video
|
||||
should follow
|
||||
|
||||
## 9. Architectural Decisions To Freeze Now
|
||||
|
||||
These decisions should be treated as current project rules:
|
||||
|
||||
1. PostgreSQL is source of truth.
|
||||
2. Redis is live coordination/routing only.
|
||||
3. Backend is control plane, not production render relay.
|
||||
4. Direct data plane is preferred for realtime RDP traffic.
|
||||
5. Backend gateway remains fallback/debug until direct path is fully mature.
|
||||
6. Service adapters translate external protocols to platform channels.
|
||||
7. RDP Adapter remains C++ and FreeRDP-backed for now.
|
||||
8. FreeRDP details must not leak into backend or Access Client business logic.
|
||||
9. Access Client speaks platform protocol, not RDP.
|
||||
10. Mesh/VPN/multi-cluster/node-agent runtime remain future staged work.
|
||||
11. RDP must be stabilized before adding VNC/SSH/VPN/product expansion.
|
||||
12. No new feature should start while source-of-truth docs are stale.
|
||||
|
||||
## 10. Recommended Next Stages
|
||||
|
||||
### P0. Truth And Baseline Freeze
|
||||
|
||||
Goal: make the current working system impossible to misunderstand.
|
||||
|
||||
Do:
|
||||
|
||||
- update root `README.md`
|
||||
- update `docs/codex/CURRENT_STATUS.md`
|
||||
- update `docs/codex/NEXT_STEP_PROMPT.md`
|
||||
- update `clients/windows/README.md`
|
||||
- update `workers/rdp-worker/README.md`
|
||||
- update `docs/architecture/DATA_PLANE_V1.md` next prompt
|
||||
- update `docs/architecture/RDP_ADAPTER_RUNTIME.md` with latest baseline/region
|
||||
repair status
|
||||
- document current test Docker image/tag and startup commands
|
||||
- preserve the accepted RDP worker baseline
|
||||
- create one "current smoke matrix" document
|
||||
|
||||
Do not:
|
||||
|
||||
- add features
|
||||
- start DP-3B
|
||||
- start server-to-client download
|
||||
- start mesh/VPN/node-agent runtime
|
||||
|
||||
Acceptance:
|
||||
|
||||
- a new engineer/Codex can read the docs and know the actual next step
|
||||
- no doc points to legacy v1 or already-completed stages as next work
|
||||
|
||||
### P1. RDP Visual Correctness Hardening
|
||||
|
||||
Goal: eliminate remaining small artifacts without returning to slow full-frame
|
||||
rendering.
|
||||
|
||||
Do:
|
||||
|
||||
- add explicit region sequence/gap diagnostics
|
||||
- prove when artifacts happen: region drop, stale region ordering, missed server
|
||||
callback, client application bug, or repair interval issue
|
||||
- verify client applies region frames to the correct bitmap area and stride
|
||||
- keep baseline full frame on attach
|
||||
- keep full repair only on loss/recovery, not as normal render loop
|
||||
- collect before/after screenshots/logs
|
||||
|
||||
Do not:
|
||||
|
||||
- enable RDPGFX globally
|
||||
- add compression/tiles/codecs before correctness is stable
|
||||
- change backend/session lifecycle
|
||||
|
||||
Acceptance:
|
||||
|
||||
- remote idle updates repaint without local input
|
||||
- Start menu/task manager/window movement leave no persistent artifacts
|
||||
- input and close behavior remain usable
|
||||
|
||||
### P2. Stage 5.1.1 Restricted Drive Visibility Proof
|
||||
|
||||
Status: accepted as runtime-proven on the test Docker stand.
|
||||
|
||||
Goal: keep the upload visibility path protected while the RDP Adapter continues
|
||||
to be hardened.
|
||||
|
||||
Do:
|
||||
|
||||
- run live smoke with current RDP adapter baseline
|
||||
- upload file from Windows client
|
||||
- verify file appears in `\\tsclient\RAP_Transfers`
|
||||
- open text and binary files inside the remote Windows session
|
||||
- prove disabled policy blocks upload
|
||||
- prove takeover/detach/failure block old or invalid upload
|
||||
- verify directory cleanup on terminate
|
||||
|
||||
Do not:
|
||||
|
||||
- implement download
|
||||
- expose arbitrary worker filesystem
|
||||
- implement shared folders or SMB/WebDAV
|
||||
|
||||
Accepted proof:
|
||||
|
||||
- uploaded file is visible and openable inside remote Windows
|
||||
- only per-session visible directory is exposed
|
||||
- worker logs show `RAP_Transfers` configured as the only redirected drive
|
||||
- termination cleans the per-session transfer directory
|
||||
|
||||
### P3. Security And Secrets Readiness
|
||||
|
||||
Status: P3.1 MVP complete; production TLS/PKI remains P3.2.
|
||||
|
||||
Goal: remove proof-stage security shortcuts before broad usage.
|
||||
|
||||
Completed:
|
||||
|
||||
- documented secret-reference model in
|
||||
`docs/architecture/SECURITY_SECRETS_READINESS.md`
|
||||
- production mode rejects plaintext credential-like resource metadata
|
||||
- production RDP/VNC/SSH resources require `secret_ref`
|
||||
- session start rejects legacy plaintext resources in production mode
|
||||
- data-plane allowed-channel policy test exists
|
||||
- worker direct-bind denial probes cover wrong worker/user/org/resource,
|
||||
wrong attachment, over-broad channels, and failed/terminated states
|
||||
- encrypted PostgreSQL-backed `resource_secrets` store exists
|
||||
- resource secret create/rotate endpoint updates `resources.secret_ref` without
|
||||
returning plaintext
|
||||
- session assignment resolves `secret_ref` after organization/resource/session/
|
||||
worker/lease checks and does not mutate `remote_sessions.metadata` with
|
||||
plaintext
|
||||
- secret access/access-denied/rotation audit events exist
|
||||
- direct worker WSS TLS trust metadata/guard exists; production backend omits
|
||||
smoke-only direct candidates and production Windows client skips untrusted
|
||||
direct candidates
|
||||
|
||||
Still required after P3.2:
|
||||
|
||||
- deploy production direct-worker certificates/platform CA trust
|
||||
- add external KMS/Vault or stronger key-management integration
|
||||
- add master-key rotation/re-encryption workflow
|
||||
- consider future worker pull/token resolver flow to avoid resolved credentials
|
||||
in Redis assignment payloads
|
||||
|
||||
Do not:
|
||||
|
||||
- build full enterprise KMS prematurely
|
||||
- weaken certificate or token model for convenience
|
||||
|
||||
Acceptance:
|
||||
|
||||
- production mode cannot create/start resources with plaintext credential
|
||||
metadata
|
||||
- cross-org, old-attachment, wrong worker/resource/org, and terminal-session
|
||||
denial paths are covered by focused tests/probes
|
||||
|
||||
### P4. Automated Regression Suite
|
||||
|
||||
Goal: convert the painful manual discoveries into repeatable gates.
|
||||
|
||||
Do:
|
||||
|
||||
- add backend unit/integration tests for org scope, session state, data-plane
|
||||
token, stale worker, clipboard/file policies
|
||||
- add worker probes for render sequencing, direct baseline, region repair,
|
||||
adapter event routing
|
||||
- add Windows transport/viewmodel tests for fallback, close/dispose, failed
|
||||
state, frame latest-only, localization resolution
|
||||
- make smoke scripts emit machine-readable PASS/FAIL reports
|
||||
- pin each accepted image/build artifact
|
||||
|
||||
Acceptance:
|
||||
|
||||
- a regression in input, render, worker-death, takeover, clipboard, or upload
|
||||
fails a repeatable test before manual smoke
|
||||
|
||||
### P5. RDP Performance Next Layer
|
||||
|
||||
Goal: improve speed on weak channels after correctness is stable.
|
||||
|
||||
Candidate paths:
|
||||
|
||||
- RDPGFX on compatible target only
|
||||
- encoded graphics payloads
|
||||
- dirty-region compression
|
||||
- tile/region framing
|
||||
- adaptive quality profiles
|
||||
- palette/grayscale/low-bandwidth modes
|
||||
- per-channel QoS and backpressure telemetry
|
||||
|
||||
Do not:
|
||||
|
||||
- replace stable region-first path without fallback
|
||||
- ship a graphics mode that only works on one target
|
||||
|
||||
Acceptance:
|
||||
|
||||
- direct full-color baseline remains available
|
||||
- each new graphics mode has compatibility detection and fallback
|
||||
|
||||
### P6. Product Completion For RDP
|
||||
|
||||
Only after P0-P5 gates are stable:
|
||||
|
||||
- manual desktop acceptance for server-to-client file download from
|
||||
`RAP_Transfers\ToClient`
|
||||
- richer file transfer UX
|
||||
- final RDP UX polish
|
||||
- policy management UI
|
||||
- operational runbooks
|
||||
- release readiness checklist
|
||||
|
||||
### P7. Platform Expansion
|
||||
|
||||
Only after RDP is stable:
|
||||
|
||||
- VNC Adapter
|
||||
- SSH Adapter
|
||||
- node-agent runtime/updater
|
||||
- entry/relay nodes
|
||||
- mesh routing
|
||||
- VPN/IP tunnel mode
|
||||
- Linux/iOS/Android clients
|
||||
|
||||
## 11. Proposed Immediate Next Prompt
|
||||
|
||||
Use this as the next implementation prompt if we continue immediately:
|
||||
|
||||
```text
|
||||
Proceed with Stage 5.2 remaining desktop UI proof only - RDP server-to-client
|
||||
file download.
|
||||
|
||||
Goal:
|
||||
Finish acceptance of safe, policy-aware download from the remote RDP session to
|
||||
the Windows Access Client UI using the restricted RAP_Transfers\ToClient drop
|
||||
zone.
|
||||
|
||||
Strict rules:
|
||||
- do not implement arbitrary remote path download
|
||||
- do not implement remote filesystem browser
|
||||
- do not implement recursive folder transfer
|
||||
- do not implement SMB/WebDAV/Windows agent
|
||||
- do not expose any worker path outside the per-session visible directory
|
||||
- do not change RDP rendering/input/clipboard behavior
|
||||
- do not remove backend gateway fallback
|
||||
- do not implement binary file chunk frames yet
|
||||
- do not start DP-3B, mesh, VPN, node-agent runtime, or new adapters
|
||||
|
||||
Scope:
|
||||
1. Keep the current Stage 5.2 backend/worker deployment on docker-test.
|
||||
2. Prove Windows desktop UI download for text and binary files placed in
|
||||
RAP_Transfers\ToClient.
|
||||
3. Prove rendering, input, clipboard, upload, lifecycle, and fallback do not
|
||||
regress.
|
||||
|
||||
Acceptance:
|
||||
- disabled and client_to_server modes block download
|
||||
- server_to_client and bidirectional modes allow download
|
||||
- text and binary files download with matching hashes
|
||||
- traversal/symlink/non-regular/too-large files are blocked
|
||||
- rendering, input, clipboard, upload, lifecycle, and fallback do not regress
|
||||
```
|
||||
|
||||
## 12. Bottom Line
|
||||
|
||||
The project direction is sound, but the process must now become stricter:
|
||||
|
||||
- design channel semantics first
|
||||
- implement through adapter boundaries
|
||||
- prove with live/manual smoke and automated gates
|
||||
- update source-of-truth docs before starting the next major stage
|
||||
- reject "temporary" shortcuts unless they have a documented removal condition
|
||||
|
||||
The RDP Adapter experience was expensive, but useful. It showed exactly where
|
||||
the architecture must be disciplined before adding SSH, VNC, VPN, mobile
|
||||
clients, or mesh runtime.
|
||||
@@ -0,0 +1,148 @@
|
||||
# Architecture guardrails
|
||||
|
||||
These rules are mandatory.
|
||||
|
||||
## 1. Preserve the proven session foundation
|
||||
The following are already proven and must remain stable:
|
||||
- live FreeRDP connect
|
||||
- active session state
|
||||
- terminate
|
||||
- detach without killing remote session
|
||||
- reattach without recreating remote session
|
||||
- takeover without recreating remote session
|
||||
|
||||
No architectural refactor may silently weaken this behavior.
|
||||
|
||||
## 2. Source of truth
|
||||
- PostgreSQL is the only durable source of truth for domain state.
|
||||
- Redis is only for live coordination, routing, heartbeats, leases, attach tokens, and ephemeral cache.
|
||||
|
||||
## 3. Control plane vs data plane
|
||||
Keep them distinct.
|
||||
|
||||
### Control plane
|
||||
- organizations
|
||||
- users
|
||||
- memberships
|
||||
- roles
|
||||
- resources
|
||||
- policies
|
||||
- nodes
|
||||
- services
|
||||
- connectors
|
||||
- cluster membership
|
||||
- updates
|
||||
- config distribution
|
||||
|
||||
### Data plane
|
||||
- session streams
|
||||
- worker traffic
|
||||
- relay traffic
|
||||
- connector traffic
|
||||
- future exit traffic
|
||||
|
||||
## 4. Multi-tenancy isolation
|
||||
Every organization must be isolated by design.
|
||||
|
||||
Namespace by organization for:
|
||||
- resources
|
||||
- users-in-org
|
||||
- groups
|
||||
- policies
|
||||
- connectors
|
||||
- sessions
|
||||
- audit
|
||||
- secrets references
|
||||
- Redis keys where applicable
|
||||
|
||||
No cross-org leakage of identifiers, data, logs, cache keys, or policy decisions.
|
||||
|
||||
## 5. Customer-managed nodes
|
||||
Customer-managed nodes:
|
||||
- may join the common cluster,
|
||||
- must remain limited to allowed scope,
|
||||
- must not automatically become general-purpose relay/control nodes for other organizations.
|
||||
|
||||
## 6. Node agent design
|
||||
A node agent:
|
||||
- is small,
|
||||
- stable,
|
||||
- always running,
|
||||
- supervises services,
|
||||
- downloads signed updates,
|
||||
- verifies signatures and versions,
|
||||
- can rollback,
|
||||
- can restart services,
|
||||
- can operate on thin nodes and thick nodes.
|
||||
|
||||
The agent is not the same as the service workloads.
|
||||
|
||||
## 7. Split-brain prevention
|
||||
Never allow minority partitions to become a second authoritative cluster automatically.
|
||||
|
||||
Required states:
|
||||
- healthy
|
||||
- degraded
|
||||
- recovery
|
||||
- isolated / emergency
|
||||
|
||||
Cluster-wide changes, role changes and risky mutations must be restricted in non-quorum states.
|
||||
|
||||
## 8. Service model
|
||||
Each node must separate:
|
||||
- capabilities
|
||||
- enabled services
|
||||
|
||||
Do not encode every function into one monolithic node role.
|
||||
|
||||
## 9. Security model
|
||||
Security must be based on:
|
||||
- strong crypto
|
||||
- signed artifacts
|
||||
- node identity
|
||||
- short-lived user/session tokens
|
||||
- scoped trust
|
||||
- audit trails
|
||||
- revocation
|
||||
- least privilege
|
||||
|
||||
Do not depend on protocol obscurity.
|
||||
|
||||
## 10. Migration strategy
|
||||
Do not force a big-bang rewrite.
|
||||
Add the platform core around the current system in steps:
|
||||
1. organization / membership model
|
||||
2. org-scoped resource model
|
||||
3. node model and node-agent control interfaces
|
||||
4. connector model
|
||||
5. mesh / routing evolution
|
||||
6. native clients and higher-level features
|
||||
|
||||
## 11. Updates and rollback
|
||||
Updates must support:
|
||||
- manual or automatic policy
|
||||
- staged rollout
|
||||
- canary rollout
|
||||
- rollback to previous version
|
||||
- signed artifacts
|
||||
- optional update mirrors / caches on selected nodes
|
||||
|
||||
Thin nodes may download but not store update artifacts.
|
||||
|
||||
## 12. Performance and routing awareness
|
||||
Placement and routing decisions must consider:
|
||||
- CPU
|
||||
- RAM
|
||||
- network load
|
||||
- active sessions
|
||||
- connector load
|
||||
- relay load
|
||||
- service type
|
||||
- health score
|
||||
|
||||
## 13. No feature explosion before platform core
|
||||
Do not jump to:
|
||||
- full collaboration/video meetings
|
||||
- advanced media plane
|
||||
- internet exit mode
|
||||
before the platform core is modeled correctly.
|
||||
File diff suppressed because it is too large
Load Diff
@@ -0,0 +1,129 @@
|
||||
# Final platform technical direction (summary)
|
||||
|
||||
## Product definition
|
||||
A distributed secure access platform with:
|
||||
- multi-tenant organizations
|
||||
- proven persistent session broker for RDP
|
||||
- cluster of platform-managed and customer-managed nodes
|
||||
- node-agent based service fabric
|
||||
- connector/VPN layer
|
||||
- future split/full tunnel capability
|
||||
- future collaboration extensions
|
||||
|
||||
## Main top-level domains
|
||||
|
||||
### Platform
|
||||
Owns:
|
||||
- global policies
|
||||
- cluster control plane
|
||||
- platform admins
|
||||
- node trust
|
||||
- artifact signing and update policy
|
||||
- disaster recovery authority
|
||||
|
||||
### Organization
|
||||
Owns:
|
||||
- users
|
||||
- groups
|
||||
- organization admins
|
||||
- identity sources
|
||||
- resources
|
||||
- policies
|
||||
- connectors
|
||||
- audits
|
||||
- quotas
|
||||
- domains / branding later
|
||||
|
||||
### Node
|
||||
Has:
|
||||
- node identity
|
||||
- ownership type (platform-managed, customer-managed)
|
||||
- capabilities
|
||||
- enabled services
|
||||
- health
|
||||
- update policy
|
||||
- version state
|
||||
- partition state
|
||||
|
||||
### Node Agent
|
||||
Small stable agent that:
|
||||
- keeps running
|
||||
- supervises services
|
||||
- downloads signed updates
|
||||
- verifies integrity
|
||||
- restarts crashed services
|
||||
- rolls back if needed
|
||||
- reports health
|
||||
|
||||
### Connector
|
||||
Reusable network access method:
|
||||
- direct
|
||||
- VPN
|
||||
- relay-backed
|
||||
- future egress mode
|
||||
Bound to resources by policy, not duplicated blindly per server.
|
||||
|
||||
### Session broker
|
||||
Already proven for RDP persistent lifecycle.
|
||||
|
||||
## Mandatory capabilities
|
||||
|
||||
### Multi-tenant
|
||||
- org isolation
|
||||
- organization memberships
|
||||
- user may belong to multiple organizations
|
||||
- clear org switching UX later
|
||||
- org admins only see their org
|
||||
|
||||
### Identity federation
|
||||
- local accounts
|
||||
- LDAP / AD
|
||||
- OIDC
|
||||
- group/claim mapping to access
|
||||
|
||||
### Resource authorization
|
||||
- local manual mapping
|
||||
- external group / claim driven mapping
|
||||
- feature scopes:
|
||||
- RDP only
|
||||
- connector/VPN only
|
||||
- both
|
||||
- future scopes
|
||||
|
||||
### Cluster behavior
|
||||
- dynamic membership
|
||||
- encrypted inter-node communication
|
||||
- no mandatory single center
|
||||
- quorum-based authority
|
||||
- degraded / recovery / isolated modes
|
||||
- manual partition promotion only by highly privileged recovery admin
|
||||
- multi-hop route support
|
||||
- not every node needs full mesh
|
||||
|
||||
### Updates
|
||||
- signed artifacts
|
||||
- canary rollout
|
||||
- staged rollout
|
||||
- rollback
|
||||
- thin node vs artifact-cache node
|
||||
|
||||
### Customer-managed nodes
|
||||
- can join common cluster
|
||||
- can be scoped to their organization
|
||||
- can serve ingress / connector / egress functions for that organization
|
||||
- must not automatically become cluster-global trusted nodes
|
||||
|
||||
## What to implement first
|
||||
- organization model
|
||||
- memberships and roles
|
||||
- org-scoped resource model
|
||||
- identity source model
|
||||
- node and node-agent control plane model
|
||||
- service capabilities / enabled services model
|
||||
|
||||
## What to delay
|
||||
- full mesh engine
|
||||
- full connector scheduler
|
||||
- internet exit mode
|
||||
- collaboration/video meetings
|
||||
- heavy media routing
|
||||
@@ -0,0 +1,123 @@
|
||||
C17Z20 is complete.
|
||||
|
||||
Installation Authority foundation is also complete:
|
||||
|
||||
- production config requires strict authority mode with Product Root public key
|
||||
- first-owner bootstrap requires a signed activation manifest in strict mode
|
||||
- `installation_authority` and signed `platform_role_grants` are persisted
|
||||
- strict platform-admin checks ignore direct `users.platform_role` edits unless
|
||||
a valid signed grant exists
|
||||
- web-admin shows installation status and first-owner bootstrap
|
||||
- `scripts/installation/product-root-tool.go` can generate Ed25519 Product Root
|
||||
keys and sign activation manifests; private keys must stay outside the repo
|
||||
|
||||
Cluster Authority foundation is now also complete:
|
||||
|
||||
- every newly created cluster gets an Ed25519 `cluster_authorities` key record
|
||||
- cluster authority private keys are encrypted at rest when
|
||||
`SECRET_ENCRYPTION_KEY_B64`/file is configured; production already requires
|
||||
a secret encryption key
|
||||
- legacy/default clusters are backfilled lazily through `EnsureClusterAuthority`
|
||||
- backend signs join-token scope material, node approval/bootstrap material,
|
||||
and node-scoped synthetic mesh config snapshots
|
||||
- node-agent verifies signed Control Plane synthetic config when
|
||||
`authority_required=true` or signature fields are present
|
||||
- node-agent can pin `RAP_CLUSTER_AUTHORITY_PUBLIC_KEY` and
|
||||
`RAP_CLUSTER_AUTHORITY_FINGERPRINT`, and identity state can store the same
|
||||
trust anchor after approval
|
||||
- web-admin shows cluster key fingerprints on summaries, join-token output,
|
||||
approval rows, and synthetic config visibility
|
||||
- docker-test lifecycle smoke is complete: fresh dev install, first-owner
|
||||
bootstrap, cluster creation, signed join token, real node-agent enrollment,
|
||||
owner approval, automatic signed bootstrap polling, authority pin
|
||||
persistence, heartbeat, and signed synthetic config verification all passed
|
||||
- `rap-node-agent` desired-workload polling/status reporting is gated by
|
||||
`RAP_WORKLOAD_SUPERVISION_ENABLED=false` by default while service runtime
|
||||
supervision remains a stub
|
||||
|
||||
Node enrollment bootstrap polling is also complete:
|
||||
|
||||
- backend exposes `/node-agents/enrollments/{requestID}/bootstrap`
|
||||
- pending agents prove `cluster_id`, `node_fingerprint`, and `public_key`
|
||||
before receiving status/bootstrap material
|
||||
- `rap-node-agent` stores `pending_join_request_id`, polls approval, verifies
|
||||
the signed bootstrap contract, then persists `node_id`, `identity_status`,
|
||||
and cluster authority pin into `identity.json`
|
||||
- polling is controlled by `RAP_ENROLLMENT_POLL_INTERVAL_SECONDS` and
|
||||
`RAP_ENROLLMENT_POLL_TIMEOUT_SECONDS`
|
||||
|
||||
Current state:
|
||||
|
||||
- C17Z12 added rendezvous/relay control-plane leases for peers that would
|
||||
otherwise stay in `waiting_rendezvous`.
|
||||
- C17Z13-C17Z14 added lease telemetry and node-scoped synthetic-config refresh
|
||||
for renewal/stale relay recovery.
|
||||
- C17Z15 added backend stale-relay replacement/withdrawal policy and alternate
|
||||
relay-pool scoring.
|
||||
- C17Z16 added Control Plane `route_path_decisions`.
|
||||
- C17Z17 added node-side route generation apply/withdraw tracking.
|
||||
- C17Z18 applies Control Plane `route_path_decisions` to synthetic
|
||||
route-health route config only. The synthetic `fabric.route_health` runtime
|
||||
now probes the selected effective path, including replacement relay paths,
|
||||
and reports expected/observed hops plus drift state.
|
||||
- C17Z19 consumes those synthetic route-health observations in backend relay
|
||||
scoring. Drift/unreachable/failure feedback marks the exact selected relay
|
||||
stale and can trigger replacement; healthy low-latency route-health boosts
|
||||
alternate relay score reasons. Migration `000022` adds the `synthetic` mesh
|
||||
service class, and web-admin marks relay policy `rh feedback`.
|
||||
- C17Z20 closes the node-side feedback loop. After node-agent reports
|
||||
synthetic route-health drift/unreachable/failure, it performs a bounded
|
||||
node-scoped synthetic-config refresh, applies returned replacement route
|
||||
decisions to route-health config immediately, and reports
|
||||
`c17z20.mesh_route_health_feedback_refresh_report.v1`.
|
||||
- Backend `mesh_latest_links` now keeps latest observations per observation
|
||||
type/route, so `synthetic_route_health` is not overwritten by
|
||||
`peer_connection_manager`.
|
||||
- Web-admin Fabric links now show observation type, selected relay, and
|
||||
route-health effective/observed path.
|
||||
- All of this remains control-plane/synthetic route-health only. It does not
|
||||
forward RDP/VPN/service payloads, does not start VPN runtime, and does not
|
||||
implement arbitrary relay packet forwarding.
|
||||
- Cluster Authority and node enrollment bootstrap are docker-test
|
||||
lifecycle-smoke verified in run `dev-bootstrap-20260428-201430`.
|
||||
- Fresh migration replay found and fixed a PostgreSQL view replacement issue in
|
||||
`000021_cluster_authority_keys`; the migration now drops/recreates
|
||||
`cluster_admin_summaries` in up/down paths.
|
||||
|
||||
Runtime report:
|
||||
|
||||
- `artifacts/c17z18-route-health-effective-path-report.md`
|
||||
- `artifacts/c17z19-route-health-feedback-report.md`
|
||||
- `artifacts/c17z19-route-health-feedback-smoke-result.json`
|
||||
- `artifacts/c17z20-route-health-feedback-refresh-report.md`
|
||||
- `artifacts/dev-cluster-enrollment-bootstrap-smoke-report.md`
|
||||
- Docker-test smoke command:
|
||||
`pwsh -NoProfile -ExecutionPolicy Bypass -File scripts\fabric\c17z12-rendezvous-relay-smoke-ssh.ps1 -KeepRunning`
|
||||
- Dev lifecycle smoke command:
|
||||
`pwsh -NoProfile -ExecutionPolicy Bypass -File scripts\fabric\dev-cluster-enrollment-bootstrap-smoke-ssh.ps1 -KeepRunning`
|
||||
- Last proven runtime run: `c17z18-20260428-221601` (legacy smoke script name,
|
||||
current C17Z20 node-agent code)
|
||||
- Last proven dev lifecycle run: `dev-bootstrap-20260428-201430`
|
||||
- Admin: `http://192.168.200.61:5174/`
|
||||
- C17Z20 multi-agent API: `http://192.168.200.61:18120/api/v1`
|
||||
- C17Z19 backend-only API: `http://192.168.200.61:18122/api/v1`
|
||||
- Dev lifecycle API: `http://192.168.200.61:18121/api/v1`
|
||||
|
||||
Do not automatically continue into:
|
||||
|
||||
- RDP/VNC/SSH/file/video/service workload traffic over mesh
|
||||
- VPN/IP tunnel runtime implementation
|
||||
- arbitrary relay packet forwarding
|
||||
- production payload forwarding for relay paths
|
||||
- QUIC/WebRTC or STUN/TURN/ICE
|
||||
- TUN/TAP, host route, DNS, or firewall manipulation
|
||||
- backend/session lifecycle changes
|
||||
- Windows client changes
|
||||
|
||||
Next narrow layer, if approved:
|
||||
|
||||
C17Z21 should tighten route-health feedback refresh dampening: if an immediate
|
||||
feedback refresh returns the same config version or no replacement change, keep
|
||||
a per-route/relay no-change cooldown before retrying. Keep the boundary
|
||||
synthetic/control-plane only and keep RDP/VPN/service payload forwarding
|
||||
untouched.
|
||||
@@ -0,0 +1,81 @@
|
||||
# Target project structure for the next phase
|
||||
|
||||
This is the desired direction, not necessarily the current exact repo state.
|
||||
|
||||
## Root
|
||||
- `backend/`
|
||||
- `workers/rdp-worker/`
|
||||
- `clients/windows/`
|
||||
- `clients/linux/`
|
||||
- `web-admin/`
|
||||
- `scripts/`
|
||||
- `docs/`
|
||||
- `deploy/`
|
||||
- `CODEX_CONTEXT.md`
|
||||
|
||||
## Backend suggested evolution
|
||||
- `internal/platform/`
|
||||
- config
|
||||
- runtime
|
||||
- logging
|
||||
- postgres
|
||||
- redis
|
||||
- module
|
||||
- authn middleware
|
||||
- authz middleware
|
||||
- `internal/modules/`
|
||||
- auth
|
||||
- organization
|
||||
- membership
|
||||
- identitysource
|
||||
- group
|
||||
- resource
|
||||
- sessionbroker
|
||||
- sessiongateway
|
||||
- worker
|
||||
- node
|
||||
- nodeagent
|
||||
- connector
|
||||
- audit
|
||||
- policy
|
||||
- `pkg/contracts/`
|
||||
- session
|
||||
- worker
|
||||
- node
|
||||
- connector
|
||||
|
||||
## New modules to add in next phase
|
||||
- `organization`
|
||||
- `membership`
|
||||
- `identitysource`
|
||||
- `node`
|
||||
- `nodeagent`
|
||||
- `policy` (if policy logic is currently too scattered)
|
||||
|
||||
## DB evolution direction
|
||||
New tables/entities should include:
|
||||
- organizations
|
||||
- organization_memberships
|
||||
- organization_roles
|
||||
- identity_sources
|
||||
- identity_mappings
|
||||
- groups
|
||||
- group_memberships / external_group_bindings
|
||||
- nodes
|
||||
- node_services
|
||||
- node_capabilities
|
||||
- node_update_policies
|
||||
- node_partition_states
|
||||
- connectors
|
||||
- connector_bindings
|
||||
- organization_feature_scopes
|
||||
|
||||
Keep existing proven session tables intact unless migration is very deliberate.
|
||||
|
||||
## Worker
|
||||
Keep worker independent.
|
||||
Do not move node-agent responsibilities into the RDP worker.
|
||||
The worker is one service workload. The node-agent is the supervisor/orchestrator on the node.
|
||||
|
||||
## Clients
|
||||
Do not start final client implementation before the new platform-core backend model is established.
|
||||
Reference in New Issue
Block a user