Initial project snapshot

This commit is contained in:
2026-04-28 22:29:50 +03:00
commit 8ba0561f4f
365 changed files with 91832 additions and 0 deletions
@@ -0,0 +1,261 @@
# Architecture Guardrails
Status: architecture guardrails, documentation only.
This file exists so architecture documents have a stable guardrails reference
inside `docs/architecture`. The operational Codex guardrails remain in
`docs/codex/ARCHITECTURE_GUARDRAILS.md`.
## 1. Preserve the Proven RDP Baseline
The following are already proven and must remain stable:
- live FreeRDP connect
- active session state
- terminate
- detach without killing the remote session
- reattach without recreating the remote session
- takeover without recreating the remote session
- direct worker WSS data plane
- backend gateway fallback
- C++ RDP Adapter as the active RDP runtime
Architecture clarification must not silently weaken this behavior.
## 2. Source of Truth
PostgreSQL is the only durable source of truth for domain state.
Redis is live coordination only. It may hold leases, heartbeats, routing hints,
attach tokens, short-lived tokens, and ephemeral cache. It must not become a
durable source of truth for sessions, organizations, policies, cluster trust,
peer topology, durable configuration, organization data, route authority, or
node identity.
## 3. Fabric Core Before Mesh Runtime
RAP Fabric Core is the lower distributed runtime foundation above the host OS.
Fabric Core owns:
- native `rap-node-agent` identity
- enrollment
- local node state
- capability reporting
- role assignment consumption
- signed scoped configuration snapshots
- update trust
- service supervision boundary
Mesh runtime traffic must not be implemented before node identity, enrollment,
role assignment, scoped config distribution, and node-local state are
trustworthy.
## 4. Node Identity and Service Workloads
A node is a host-level identity managed by native `rap-node-agent`.
Service workloads are separate from node identity. They may be containerized or
native, but containers are packaging/isolation boundaries only.
Capabilities are not permissions. Role assignment must be explicit per cluster
and, when needed, per organization.
## 5. Routing Ownership
Routing is owned by the Fabric layer, not individual Service Adapters.
RDP, VNC, SSH, VPN, video, and file services may request a destination node,
resource target, egress node, or egress pool. The Fabric Routing Engine chooses
the path.
Routing decisions must not depend on live backend availability. They use
node-local state, signed scoped snapshots, peer cache, route cache, and policy.
Service Adapters must not implement mesh topology discovery, multi-hop route
selection, shortcut creation, partition recovery, or cross-cluster routing
policy.
Service Adapters must not select routes, discover peers, manage mesh
connections, implement mesh failover, implement shortcut logic, implement
partition recovery, or implement cross-cluster routing policy.
## 6. Need-to-Know Configuration
Nodes should be small, fast, and scoped.
A node receives only the configuration required for its cluster membership,
assigned role, service workload, and organization scope. It must not store full
cluster topology, unrelated organization data, unrelated storage shards, peer
caches outside its scope, or secrets it does not need.
Secrets must be delivered only through approved resolvers and only at runtime
when needed.
## 7. Fabric Storage Boundaries
Fabric Storage / Config Storage is a future distribution and cache layer, not a
new source of truth.
Storage service must not:
- replace PostgreSQL
- become a general-purpose distributed database
- accept direct node writes as authoritative state
- store full cluster or organization data on every node
- expose arbitrary query capabilities
- bypass organization and cluster isolation
## 8. Multi-Tenancy Isolation
Every organization must be isolated by design.
Namespace and authorize:
- resources
- users-in-organization
- groups
- policies
- connectors
- sessions
- service endpoints
- audit
- secret references
- storage/cache scopes
- Redis keys where applicable
Organizations must not see intermediate mesh topology, other organizations'
routes, peer caches, nodes, storage shards, secrets, or platform trust
internals.
## 9. Multi-Cluster Boundaries
A platform may manage multiple clusters, but clusters do not automatically
trust each other and do not form one shared mesh by default.
Cross-cluster routing requires explicit trust and policy.
Cluster-scoped identities, certificates, tokens, storage namespaces, and
policies are required. A node may participate in multiple clusters only through
isolated memberships.
## 10. Split-Brain Prevention
Never allow minority partitions to become a second authoritative cluster
automatically.
Cluster-wide changes, role changes, trust changes, node approvals, policy
mutation, partition promotion, and cross-cluster trust must be restricted in
non-quorum or degraded states.
## 11. Control Plane vs Data Plane
Control plane owns durable state and policy:
- organizations
- users
- memberships
- roles
- resources
- policies
- nodes
- cluster membership
- service assignments
- connector/VPN desired state
- updates
- config distribution
- audit
Data plane carries authorized traffic:
- session streams
- worker traffic
- relay traffic
- connector traffic
- future VPN/IP tunnel traffic
Do not collapse control plane and data plane into one vague layer.
## 12. Updates and Trust
Updates must support:
- Version Storage / Update Repository as the signed artifact source
- explicit Control Plane rollout policy and approval
- signed artifacts
- no unsigned binaries
- staged rollout
- canary rollout
- rollback
- health checks
- local update cache where approved
- OS / architecture specific artifacts under signed release manifests
- explicit migration bundles when data structures change
Version Storage stores immutable release manifests, artifacts, hashes,
signatures, compatibility metadata, provenance, and approved migration bundles.
It must not become a second source of truth for rollout policy, approvals,
organization state, cluster state, or audit.
The native node-agent owns local update trust, health supervision, restart, and
recovery logic. It may update, restart, or rollback assigned local workloads
only according to signed manifests and Control Plane policy. Node-agent
self-update requires stricter staged replacement and crash-safe rollback than
ordinary workload updates.
PostgreSQL schema migrations are orchestrated by the Control Plane release
process. Node-agent must not independently invent or execute durable
PostgreSQL schema migrations. Service-local, node-local, cache, or protocol
schema migrations require signed manifest metadata, preflight checks,
rollback/fencing behavior, and explicit compatibility rules.
## 13. Performance and Routing Awareness
Placement and routing decisions must consider:
- CPU
- RAM
- network load
- active sessions
- connector load
- relay load
- service type
- health score
- latency
- packet loss
- bandwidth availability
- policy constraints
Interactive input/control traffic must not wait behind render/video, file
transfer, telemetry, or VPN bulk traffic.
## 14. No Runtime Expansion From Documentation
Architecture documentation does not authorize runtime implementation.
Do not start the following without an explicit staged prompt:
- RDP runtime changes
- Windows client behavior changes
- data-plane behavior changes
- backend session lifecycle changes
- mesh runtime traffic
- VPN/IP tunnel runtime
- relay packet routing
- QUIC/WebRTC
- service workload execution
- new protocol adapters
## Result / Decision
These guardrails formalize the Secure Access Fabric lower foundation:
PostgreSQL remains authoritative, Redis remains live-only, Fabric Core comes
before mesh runtime, Fabric routing must not depend on live backend
availability, service adapters do not own routing, nodes receive only
need-to-know scoped configuration, Fabric Storage/Config Storage is not a
general-purpose distributed database, and organizations must not see internal
mesh topology. No code, API, migration, RDP, data-plane, mesh, VPN, relay, or
service workload runtime behavior is changed by this document. Version
Storage/Update Repository is a future signed artifact and release distribution
foundation; it is not an updater runtime until a later explicit staged prompt
authorizes it.
File diff suppressed because it is too large Load Diff
File diff suppressed because it is too large Load Diff
@@ -0,0 +1,93 @@
# Direct Worker WSS TLS / PKI
Status: P3.4 trust-model design/prep complete.
This document defines the production trust model for direct worker WSS. It does
not implement mesh, relay nodes, VPN, QUIC, WebRTC, or a new RDP runtime.
Detailed P3.4 production certificate lifecycle, worker identity binding, client
trust, rotation, revocation, and future smoke plan are defined in
`docs/architecture/PRODUCTION_DIRECT_WORKER_WSS_TRUST.md`.
## Goal
Direct worker WSS is the preferred RDP realtime data-plane path. In production,
the Access Client must only use direct worker WSS when both conditions are true:
- the backend advertises the candidate as production trusted
- normal TLS certificate validation succeeds
The backend gateway remains the safe fallback/debug path.
## Trust Modes
Backend config `DATA_PLANE_DIRECT_WORKER_TLS_TRUST_MODE` supports:
- `smoke_insecure`: development/smoke only; candidate metadata is
`smoke_only=true` and `production_trusted=false`
- `public_ca`: worker certificate chains to an OS/publicly trusted CA;
candidate metadata is `production_trusted=true`
- `platform_ca`: worker certificate chains to a platform-managed CA;
candidate metadata is `production_trusted=true`
Optional `DATA_PLANE_DIRECT_WORKER_TLS_CA_REF` labels the platform CA or trust
bundle version in candidate metadata, for example `rap-platform-ca:v1`.
## Backend Enforcement
In production (`APP_ENV=production` or `APP_ENV=prod`):
- backend must not advertise `direct_worker_wss` candidates when
`DATA_PLANE_DIRECT_WORKER_TLS_TRUST_MODE=smoke_insecure`
- backend still advertises `backend_gateway` fallback when configured
- direct candidates include trust metadata only when they are data-capable
Candidate metadata:
```json
{
"runtime_transport": "json_v1",
"traffic_ready": true,
"tls_trust_mode": "platform_ca",
"production_trusted": true,
"smoke_only": false,
"tls_ca_ref": "rap-platform-ca:v1"
}
```
## Windows Client Enforcement
Client config `environment=production` or `prod` means:
- smoke-only direct candidates are skipped
- candidates without production trust metadata are skipped
- `allow_insecure_direct_data_plane_tls_for_smoke` is ignored for direct worker
WSS
- the client falls back to backend gateway instead of weakening TLS
In development/smoke:
- `allow_insecure_direct_data_plane_tls_for_smoke=true` may bypass certificate
validation only for smoke-only direct candidates
- this bypass must not be used as a production trust mechanism
## Worker Requirements
The worker direct WSS endpoint already requires:
- `RDP_WORKER_DATA_PLANE_TLS_CERT_FILE`
- `RDP_WORKER_DATA_PLANE_TLS_KEY_FILE`
- `RDP_WORKER_DATA_PLANE_PUBLIC_KEY_FILE` or
`RDP_WORKER_DATA_PLANE_PUBLIC_KEY_PEM`
Production workers should use certificates issued for their advertised direct
WSS hostname/IP subject alternative names. Platform-managed deployments should
prefer a dedicated platform CA and rotation workflow.
## Remaining Work
- implement app-local platform CA trust bundle handling in Windows clients
- automate worker certificate issuance/rotation
- rotate backend data-plane signing keys
- add live test-stand proof with `platform_ca` production-trusted direct WSS
- later integrate node-agent certificate enrollment
@@ -0,0 +1,465 @@
# Fabric Core Configuration Distribution
Status: Stage C10 result. Documentation and architecture only.
This document consolidates the Fabric Core configuration distribution model for
the Secure Access Fabric platform. It does not implement mesh runtime traffic,
VPN/IP tunnel runtime, relay packet routing, RDP work, service workload
execution, API changes, migrations, or code changes.
## 1. Purpose
Stage C10 defines the boundaries that must exist before the project safely
moves into signed snapshots, node-local storage, config/storage services, peer
directories, routing skeletons, secure node channels, mesh routing, or VPN/IP
tunnel runtime.
The goal is to prevent the lower fabric from growing into an accidental
distributed database, accidental full-mesh topology store, or service-specific
RDP/VPN routing layer.
## 2. Layer Model
The platform layer order remains:
1. Host OS
2. RAP Fabric Core
3. Secure Fabric Network
4. Service Runtime / Service Adapters
5. Access Clients / Admin UI
Fabric Core is the lower distributed runtime foundation above the host OS. It
is not a real operating system. It is implemented through native
`rap-node-agent`, control-plane contracts, scoped signed snapshots, node-local
state, role assignment consumption, update trust, and service supervision
boundaries.
RDP, VNC, SSH, VPN, video, file transfer, and internal-app access are services
above Fabric Core. They consume Fabric Core identity, placement, routing, and
policy; they do not define peer discovery, route selection, cluster authority,
or durable configuration ownership.
## 3. Source of Truth and Cache Boundaries
PostgreSQL remains the only durable source of truth for domain state:
- platform configuration
- clusters
- organizations
- users and memberships
- node identities and enrollment state
- node role assignments
- policies
- resources
- service desired state
- audit
- trust roots and revocation state
Redis remains live coordination only:
- leases
- heartbeats
- ephemeral routing hints
- short-lived tokens
- transient queues
- runtime cache
Redis must not store durable topology, durable configuration, node identity,
policy, organization data, cluster trust, or authoritative route state.
Fabric Storage / Config Storage is a distribution and cache layer. It must not:
- replace PostgreSQL
- become a general-purpose distributed database
- accept direct node writes as authoritative state
- store every cluster or organization object on every node
- expose arbitrary query capabilities
- bypass organization, cluster, role, or service isolation
Node-local state is runtime state plus signed scoped snapshots. It supports
fast operation and degraded reconnect. It is not a source of truth.
## 4. Configuration Layers
Configuration is separated into layers so nodes receive only what their role
requires.
Global platform configuration:
- platform trust roots
- supported protocol versions
- update trust policy
- platform-wide feature gates
- high-risk admin policy
Cluster configuration:
- cluster identity
- cluster trust roots and certificate policy
- cluster authority/partition state
- node role assignments
- QoS policy
- peer discovery policy
- route policy
- storage/config replication policy
Organization configuration:
- organization identity and status
- organization service enablement
- tenant-visible ingress/egress/service endpoints
- tenant policy references
- organization-specific resource references
- safe status projections
Service configuration:
- assigned service workload configuration
- service-specific policy subset
- resource references needed by the assigned workload
- connector or `vpn_connection` references where authorized
- runtime secret references, resolved only through approved secret resolvers
## 5. Scoped Distribution Principle
Nodes receive configuration on a need-to-know basis.
Core mesh node receives:
- scoped peer/neighbor data
- route policy
- QoS policy
- cluster version and trust metadata
- no RDP credentials
- no full organization user list
- no unrelated service configuration
Ingress node receives:
- allowed client entry policies
- token validation configuration
- entry route hints
- service endpoint mapping allowed for the ingress scope
- no full internal topology
- no unrelated organization data
Egress/service node receives:
- assigned service configs
- needed resource references
- needed connector or `vpn_connection` references
- policy for assigned services
- secrets only through approved resolver and only at runtime
Storage/config node receives:
- assigned shard/scope metadata
- replication metadata
- signed snapshot content for its assigned scope
- no unrelated organization data
- no unrestricted topology query access
Thin/mobile node receives:
- minimal bootstrap peers
- active session/tunnel policy subset
- local trust data required to reconnect
- no broad cluster topology
## 6. Signed Scoped Cluster Snapshot Boundary
C10 defines snapshot boundaries only. C11 will define the full signed scoped
cluster snapshot model.
A scoped snapshot is a signed, versioned, role-limited configuration package
that a node-agent can store locally.
Snapshot properties:
- cluster-scoped
- role-scoped
- organization-scoped where applicable
- versioned
- signed by an authorized control-plane signing key
- bounded in size
- expires or requires refresh according to policy
- reconstructable from PostgreSQL source-of-truth state
Snapshot contents may include:
- cluster id and version
- node membership scope
- assigned roles
- allowed service workload refs
- peer directory subset
- route policy subset
- QoS policy subset
- trust roots and revocation metadata
- storage/config endpoints for refresh
- degraded-mode permissions
Snapshot contents must not include:
- unrelated organization data
- broad user lists
- raw secrets
- RDP/VNC/SSH credentials
- full cluster topology unless node role requires it
- arbitrary query permissions
## 7. Node-Local State Boundary
`rap-node-agent` local state may contain:
- node identity material and certificate metadata
- cluster membership state
- signed scoped cluster snapshot
- peer cache
- route cache
- service assignment cache
- service health/status cache
- local health state
- partition/degraded state
- last applied config version
- pending update metadata
- bounded telemetry buffer
Node-local state must not contain:
- full cluster topology unless explicitly required by role
- full organization data
- unrelated organization secrets
- durable policy authority
- durable route authority
- durable audit authority
- unrelated storage shards
Node-agent must be able to operate from local state for short degraded periods
when policy allows it, but it must not authorize high-risk mutations while
isolated.
## 8. Peer Directory and Cache Boundary
Peer directory data is distributed as scoped configuration, not queried from
PostgreSQL on every routing decision.
Peer directory entry fields:
- `node_id`
- `cluster_id`
- endpoint candidates
- roles/capabilities
- region/location hints
- trust/certificate fingerprint
- policy scope
- config version
Node-local peer cache may add runtime observations:
- `last_success_at`
- `last_latency_ms`
- packet loss
- reliability score
- recent failure history
- observed load hints where allowed
- last seen config version
Peer selection is score-based, not latency-only. Inputs include:
- latency
- packet loss
- reliability
- region distance
- node load
- bandwidth availability
- role suitability
- policy constraints
- trust level
- recent failure history
The Fabric Routing Engine owns route selection. Service Adapters must not
discover peers, select mesh routes, create shortcuts, or implement partition
recovery.
## 9. Fabric Storage / Config Storage Role
Fabric Storage / Config Storage is a logical future service. It is a scoped
distribution layer for configuration and signed snapshots.
Responsibilities:
- distribute signed scoped snapshots
- distribute peer directories
- cache hot configuration near service nodes
- replicate critical scoped data across failure domains
- provide nearby read access for node-agent refresh
- support cluster/org/service scope boundaries
- support version-based sync and incremental update delivery
Non-goals:
- no replacement of PostgreSQL
- no arbitrary distributed database behavior
- no direct node writes as authoritative state
- no broad ad hoc query API
- no full topology exposure to tenants
- no full organization data on every node
Placement rules:
- hot data may be placed near services that use it
- cold data may remain remote
- critical data should replicate across failure domains
- replication factor is policy-driven
- storage scope must respect cluster, organization, and service boundaries
## 10. Distribution Flow
Normal flow:
1. Control plane reads authoritative state from PostgreSQL.
2. Control plane compiles scoped configuration views.
3. Control plane signs full scoped snapshots or incremental updates.
4. Fabric Storage / Config Storage distributes and caches scoped artifacts.
5. Node-agent fetches snapshots/updates from authorized endpoints.
6. Node-agent verifies signatures, version, scope, expiry, and trust roots.
7. Node-agent applies configuration into local state.
8. Runtime components consume local state, not live backend calls, for realtime
route decisions.
Realtime routing decisions must not depend on live backend availability. They
should use verified local state, peer cache, route cache, and policy.
## 11. Versioning and Consistency Rules
Every snapshot and incremental update must carry:
- `cluster_id`
- scope identifiers
- monotonic config version or equivalent epoch
- issued-at timestamp
- expiry or refresh deadline
- signer id / key id
- signature
- dependency/base version for increments
Rules:
- full snapshot can establish or repair local state
- incremental update applies only to the expected base version
- version gaps require full resync
- signature mismatch rejects the update and triggers recovery
- rollback to older config is forbidden unless explicitly authorized by a
signed recovery policy
- node must report last applied config version in heartbeat/status
## 12. Degraded Mode Rules
Degraded operation is allowed only when policy permits it.
Allowed examples:
- keep already-running safe services alive
- continue existing authorized routes for a short TTL
- reconnect to known active/warm/bootstrap peers
- use last signed snapshot to find config/storage endpoints
- report degraded status when connectivity returns
Forbidden while degraded:
- approve join requests
- issue node certificates
- assign roles
- change cluster policy
- change organization policy
- rotate trust roots
- promote partition authority automatically
- access secrets not already authorized for the node's current role
Degraded mode must be time-bounded and observable.
## 13. Multi-Cluster Isolation
Clusters are isolated by default.
Rules:
- clusters do not automatically trust each other
- clusters do not form one shared mesh by default
- cross-cluster routing requires explicit trust and policy
- platform owner may manage multiple clusters from one console
- organization admins see only authorized clusters/resources
- node may participate in multiple clusters only through isolated memberships
- cluster-scoped identities, certificates, tokens, storage namespaces, and
policies are required
A multi-cluster node must keep separate local state per cluster:
- separate identity/certificates
- separate snapshots
- separate peer cache
- separate route cache
- separate service assignment cache
- separate storage namespace
## 14. Security Boundaries
Security requirements:
- snapshots are signed
- transport for snapshot/update distribution is authenticated and encrypted
- node-agent verifies signature, scope, expiry, signer, and trust root
- secrets are never embedded directly in broad snapshots
- secrets are resolved through approved resolvers only at runtime
- high-risk admin actions require step-up authentication
- all cluster trust and role changes are audited
High-risk actions include:
- node approval
- role assignment
- cluster trust changes
- cross-cluster trust
- partition promotion
- secrets access
- update policy changes
- signing key rotation
## 15. C11-C18 Staging Boundary
C10 is a design consolidation stage. It prepares later stages:
- C11: signed scoped cluster snapshot model
- C12: node local state store
- C13: config/storage service foundation
- C14: peer directory and cache model
- C15: Fabric Routing Engine skeleton
- C16: secure node-to-node channel lifecycle
- C17: mesh routing runtime
- C18: VPN/IP tunnel service
C10 implements none of these. Later stages must be explicit, narrow, and
verified. Mesh routing and VPN/IP tunnel runtime must not start before C11-C16
foundations are accepted.
## 16. Result / Decision
Stage C10 consolidates the lower Fabric Core configuration distribution model.
Decisions:
- PostgreSQL remains the only durable source of truth.
- Redis remains live coordination only.
- Fabric Storage / Config Storage is a scoped distribution/cache layer, not a
second source of truth.
- Nodes receive only role/cluster/organization scoped configuration.
- Node-local state is bounded and non-authoritative.
- Signed scoped snapshots are the required foundation for node-local operation
and degraded recovery.
- Peer directory/cache data is local and scoped; routing remains Fabric-owned.
- Service Adapters remain protocol translators above Fabric Core.
- Multi-cluster membership requires isolated identities, snapshots, caches,
tokens, policies, and storage namespaces.
No code, migration, API, runtime, RDP, data-plane, mesh, VPN, relay, or service
workload behavior is changed by C10.
@@ -0,0 +1,398 @@
# Fabric Peer Directory and Cache Model
Status: Stage C14 result. Documentation and architecture only.
This document defines the Fabric peer directory and node-local peer cache model.
It does not implement code, migrations, APIs, mesh runtime traffic, VPN/IP
tunnel runtime, relay packet routing, RDP work, or service workload execution.
## 1. Purpose
The peer directory tells a node which peers it may know about and potentially
connect to. The node-local peer cache stores scoped peer data plus runtime
observations for fast recovery and score-based peer selection.
The model must avoid:
- full-mesh assumptions
- every node knowing full cluster topology
- service adapters owning route selection
- Redis as durable peer topology
- backend calls on every realtime route decision
## 2. Peer Knowledge Classes
Each node maintains three peer classes:
- active peers
- warm candidate peers
- cold/bootstrap peers
Active peers:
- currently connected or recently used
- participate in health, route, relay, or service traffic according to role
- small bounded set
Warm candidate peers:
- known good but not currently active
- promoted when active peers fail or a better path is needed
- refreshed less frequently than active peers
Cold/bootstrap peers:
- seed or last-resort discovery peers
- used when active and warm peers fail
- may come from signed snapshot, local cache, storage/config service, or
admin-defined seed nodes
Recommended active peer counts:
- normal node: 3-5
- relay/core node: 8-20
- thin/mobile node: 1-3
These are policy defaults, not hardcoded limits.
## 3. Peer Directory Record
A signed peer directory entry may contain:
- `node_id`
- `cluster_id`
- endpoint candidates
- advertised roles
- verified capabilities
- allowed peer relationship type
- region/location hints
- trust/certificate fingerprint
- certificate expiry metadata
- policy scope
- organization scope where applicable
- service scope where applicable
- supported transport hints
- NAT/connectivity hints
- `last_seen_config_version`
The peer directory is scoped. Ordinary nodes must not receive a full cluster
peer directory unless their role explicitly requires it.
## 4. Endpoint Candidate Model
Endpoint candidates describe possible ways to reach a node.
Candidate fields:
- endpoint id
- transport type
- host/IP/DNS name
- port
- address family
- public/private reachability
- region
- NAT type if known
- TLS/mTLS identity expectations
- priority
- policy tags
- last verified timestamp
Transport types may include future values such as:
- direct TCP/TLS
- WSS
- relay-assisted
- outbound-only reverse channel
- future QUIC/UDP where explicitly approved
This model is descriptive only. C14 does not implement new transports.
## 5. Node-Local Peer Cache
The node-local peer cache contains signed directory data plus runtime
observations.
Directory-derived fields:
- peer identity
- cluster id
- endpoint candidates
- roles/capabilities
- trust fingerprint
- policy scope
- config version
Runtime observation fields:
- `last_success_at`
- `last_failure_at`
- `last_latency_ms`
- packet loss
- jitter
- reliability score
- recent failure history
- observed load hint where allowed
- active/warm/cold state
- last selected route id if applicable
Runtime observations are hints. They are not durable authority.
## 6. Refresh Cadence
Recommended cadence:
- active peer heartbeat: 5-15 seconds
- active/warm latency probes: 30-120 seconds
- warm peer validation: 2-10 minutes
- peer directory refresh: 5-15 minutes
- cold/bootstrap validation: periodic or on demand
- full peer directory resync: only on version gap, signature mismatch, or
policy-triggered refresh
Cadence may vary by role:
- relay/core nodes maintain richer peer sets
- thin/mobile nodes probe less aggressively
- egress/service nodes prioritize peers relevant to assigned services
- storage/config nodes prioritize configured replica peers
## 7. Peer Selection Scoring
Selection is score-based, not latency-only.
Hard checks first:
- cluster membership
- node identity trust
- certificate validity
- role compatibility
- allowed peer relationship
- organization/service scope
- partition/authority policy
- transport compatibility
- revocation status
Soft score inputs:
- latency
- packet loss
- jitter
- reliability
- recent failure history
- region distance
- node load hint
- bandwidth availability
- role suitability
- route class/channel class
- policy preference
No peer should be selected if it fails hard policy checks, even if latency is
excellent.
## 8. Recovery Order
If active peers fail, recovery order is:
1. retry active peers with bounded backoff
2. promote warm candidates
3. try cold/bootstrap peers
4. query authorized storage/config discovery endpoint
5. use last signed snapshot for degraded reconnect if policy allows
6. reconnect to control plane when available
Recovery must not authorize cluster mutation or high-risk actions.
## 9. Channel-Aware Peer Preference
Peer choice depends on channel class.
Input/control:
- lowest latency
- lowest jitter
- high reliability
- never behind bulk traffic
Render/video:
- bandwidth and jitter aware
- stale-frame dropping acceptable
- avoid paths with persistent queue growth
File transfer:
- throughput and reliability
- lower priority than input/control
Clipboard/control:
- reliable bounded path
- low volume
Telemetry:
- low priority
- lossy/sampled allowed
VPN/IP tunnel future:
- adaptive QoS
- bulk traffic must not starve interactive sessions
## 10. Full-Mesh Prevention
Nodes must not attempt to connect to every known node.
Limits:
- active peers are bounded by role policy
- warm peers are bounded by role policy
- peer directory is scoped
- full topology is hidden from organizations
- service adapters never request arbitrary topology
Full topology access is reserved only for roles that require it, such as
platform control/admin views or selected core/route-analysis components.
## 11. Security Boundaries
Peer cache must enforce:
- cluster isolation
- organization isolation
- certificate fingerprint validation
- revocation status
- role assignment
- allowed peer relationship
- service scope
A compromised ordinary node should not learn full cluster topology.
Peer cache data must not include:
- unrelated organization resources
- raw secrets
- broad user lists
- arbitrary route authority
- cross-cluster trust unless explicitly authorized
## 12. Multi-Cluster Peer Isolation
Multi-cluster node membership uses separate peer caches per cluster.
Per-cluster separation:
- peer directory
- endpoint candidates
- trust roots
- certificate fingerprints
- active/warm/cold peer state
- route observations
- failure history
Cross-cluster peer discovery requires explicit trust and policy. Clusters do
not form a single mesh by default.
## 13. Storage / Snapshot Relationship
Peer directory data is distributed through signed snapshots or Fabric Storage /
Config Storage artifacts.
Rules:
- peer directory version is tracked
- node reports last applied peer directory version
- version gap triggers refresh/full resync
- signature/hash mismatch rejects the directory
- revoked peers are removed or marked unusable
- runtime observations are preserved only when still valid for the current
directory version
## 14. Service Adapter Boundary
Service Adapters may request:
- destination node
- resource target
- egress node
- egress pool
- channel class
Service Adapters must not:
- enumerate peers
- select mesh routes
- promote warm peers
- create shortcut connections
- implement partition recovery
- implement cross-cluster routing policy
The Fabric Routing Engine owns those decisions.
## 15. Observability
Node-agent should report safe peer/cache metrics:
- active peer count
- warm peer count
- bootstrap peer count
- peer directory version
- last refresh time
- average active peer latency
- packet loss summary
- failed peer count
- recovery mode if active
- selected peer class by channel type
Reports must not expose full topology to organizations.
## 16. Future Validation Tests
Future implementation tests must prove:
- peer directory scope is enforced
- wrong-cluster peer is rejected
- revoked peer is rejected
- invalid certificate fingerprint is rejected
- full topology is not distributed to ordinary node
- active peer count stays bounded
- warm peer promotion works
- bootstrap recovery works
- score-based selection respects hard policy checks
- stale runtime observations are ignored after directory version change
- service adapter cannot bypass Fabric peer selection
## 17. C15 Preparation
C15 must define the Fabric Routing Engine skeleton boundary.
The routing engine will consume:
- peer directory/cache
- route policy
- QoS policy
- channel class
- service request metadata
- cluster/organization scope
- failure history
C15 must not carry production mesh traffic. It should define route request and
route result boundaries before runtime routing exists.
## 18. Result / Decision
Stage C14 defines scoped peer discovery and peer cache behavior.
Decisions:
- nodes maintain active, warm, and cold/bootstrap peer classes
- nodes do not maintain full mesh connections
- peer directory data is scoped and signed
- peer cache combines signed directory data with runtime observations
- peer selection is score-based with hard policy checks first
- recovery uses active, warm, bootstrap, storage/config, then last snapshot
- service adapters do not own peer discovery or route selection
- C15 must define the Fabric Routing Engine skeleton before mesh runtime
No code, migration, API, runtime, RDP, data-plane, mesh, VPN, relay, or service
workload behavior is changed by C14.
@@ -0,0 +1,518 @@
# Fabric Routing Engine Skeleton
Status: Stage C15 result. Documentation and architecture only.
This document defines the Fabric Routing Engine skeleton boundary. It does not
implement code, migrations, APIs, mesh runtime traffic, VPN/IP tunnel runtime,
relay packet routing, RDP work, or service workload execution.
## 1. Purpose
The Fabric Routing Engine is the logical Fabric layer responsible for choosing
authorized paths between ingress, core, egress, service, storage, and future
VPN/IP-tunnel components.
C15 defines the route decision boundary before runtime mesh routing exists.
The purpose is to ensure that future routing:
- is policy-aware
- is QoS-aware
- is channel-aware
- respects cluster and organization boundaries
- uses scoped local state and peer cache
- does not depend on live backend availability for realtime decisions
- is not implemented independently by Service Adapters
## 2. Non-Goals
C15 does not:
- carry production mesh traffic
- implement node-to-node transport
- implement relay forwarding
- implement VPN/IP tunnel packets
- implement QUIC/WebRTC
- implement route execution
- implement service workloads
- change RDP runtime
- change backend session lifecycle
- change Windows client behavior
It defines contracts and responsibilities only.
## 3. Routing Engine Responsibilities
The Fabric Routing Engine owns:
- route request validation
- peer candidate filtering
- route scoring
- channel-aware path selection
- QoS class selection
- route cache lookup/update policy
- failover decision boundaries
- shortcut recommendation boundaries
- topology hiding
- policy and cluster-boundary enforcement
- service adapter routing integration boundary
The Routing Engine does not own:
- PostgreSQL source-of-truth mutation
- service protocol translation
- RDP/VNC/SSH/VPN implementation details
- raw packet forwarding
- direct secret resolution
- organization admin visibility
- node enrollment authority
## 4. Inputs
Routing decisions may consume:
- signed scoped cluster snapshot
- node-local peer cache
- route cache
- peer directory
- route policy
- QoS policy
- service assignment cache
- cluster membership
- organization scope
- service/resource scope
- channel class
- current health/degraded state
- partition/authority state
- failure history
- load and latency observations
Routing decisions must not require a live backend call in the realtime path.
## 5. Route Request Contract
A route request is a logical request for a path. It is not a packet.
Required fields:
- `request_id`
- `cluster_id`
- `organization_id` where applicable
- `source_node_id`
- `source_role`
- `destination_kind`
- `destination_ref`
- `service_type`
- `channel_class`
- `priority_class`
- `policy_refs`
- `requested_at`
Destination kinds:
- `node`
- `egress_pool`
- `service_instance`
- `resource_target`
- `vpn_connection`
- `storage_scope`
- `control_plane_endpoint`
Optional fields:
- `session_id`
- `attachment_id`
- `resource_id`
- `user_id`
- `device_id`
- `region_preference`
- `required_capabilities`
- `forbidden_nodes`
- `preferred_nodes`
- `max_latency_ms`
- `min_bandwidth_hint`
- `stickiness_key`
- `previous_route_id`
- `failure_context`
Service adapters may create route requests through an adapter-facing boundary,
but they must not select peers or paths themselves.
## 6. Route Result Contract
A route result is a signed or locally verifiable decision artifact for a
bounded time.
Required fields:
- `route_id`
- `request_id`
- `cluster_id`
- `organization_id` where applicable
- `route_class`
- `channel_class`
- `selected_path`
- `selected_qos_class`
- `score`
- `valid_from`
- `expires_at`
- `route_epoch`
- `policy_version`
- `decision_reason`
Selected path contains ordered logical hops:
- source node
- optional ingress node
- zero or more core/relay nodes
- optional egress/service node
- target/service endpoint
Optional fields:
- `fallback_paths`
- `shortcut_candidate`
- `stickiness_key`
- `drain_after`
- `degraded_mode`
- `constraints_applied`
- `rejection_reason`
Route results must be bounded by expiry, policy version, route epoch, and
cluster authority state.
## 7. Channel Classes
Routing is channel-aware.
Initial channel classes:
- `control`
- `input`
- `render`
- `cursor`
- `clipboard`
- `file_transfer`
- `telemetry`
- `vpn_packet`
- `storage_fetch`
- `update_fetch`
Rules:
- `input` and critical `control` prefer lowest latency and lowest jitter.
- `render` prefers bandwidth and bounded jitter; stale render may be dropped.
- `cursor` is latest-only and should use low-latency paths.
- `clipboard` is reliable and bounded.
- `file_transfer` prefers throughput but must not starve input/control/render.
- `telemetry` is low priority and may be sampled or dropped.
- `vpn_packet` uses adaptive QoS and bulk protection.
- `storage_fetch` and `update_fetch` should not consume interactive reserves.
## 8. Route Classes
Initial route classes:
- `direct`
- `single_relay`
- `multi_hop`
- `storage_local`
- `storage_remote`
- `vpn_chained`
- `degraded_existing`
- `unavailable`
`direct`:
- selected when source can safely reach destination directly
- trust and policy must allow it
`single_relay`:
- selected when one relay improves connectivity or policy requires relay
`multi_hop`:
- selected when direct/single relay is unavailable or policy/region requires it
`storage_local` / `storage_remote`:
- used for config/snapshot/artifact fetch decisions
`vpn_chained`:
- used when a managed service or IP tunnel depends on a logical
`vpn_connection`
`degraded_existing`:
- keeps an already-authorized existing path alive while policy permits
`unavailable`:
- explicit denial or no valid route
## 9. Hard Policy Checks
Hard checks run before scoring.
Reject route when:
- source node is not trusted
- source node is not a member of the cluster
- destination is outside cluster scope
- cross-cluster trust is missing
- organization scope does not match
- role assignment does not permit the route
- peer certificate is invalid or revoked
- required channel is not authorized
- partition/authority state forbids new route
- destination node is draining or disabled and policy forbids placement
- route would leak topology or tenant data
No score can override hard policy rejection.
## 10. Scoring Inputs
Soft scoring inputs:
- latency
- jitter
- packet loss
- reliability
- recent failure history
- region distance
- load
- available bandwidth
- role suitability
- route length
- service co-location
- stickiness preference
- cost preference
- policy preference
- health score
Scoring weights are policy-driven and may differ by channel class.
Example:
- input/control heavily weight latency and jitter
- file transfer heavily weights throughput and reliability
- VPN bulk considers QoS impact on interactive routes
- storage fetch considers locality and replica freshness
## 11. Route Cache Relationship
Route cache is local and bounded.
Cache key inputs:
- cluster id
- organization id
- source node
- destination kind/ref
- service type
- channel class
- policy version
- route epoch
- stickiness key
Cache entries contain:
- route result
- expiry
- score
- last success/failure
- backoff state
- fallback candidates
Cache invalidation triggers:
- policy version change
- peer directory version change
- trust/revocation update
- route epoch change
- health state change
- repeated route failure
- expiry
Route cache is a performance aid, not route authority.
## 12. Failover Boundaries
Failover decisions may:
- switch from failed active path to fallback path
- promote warm peer path
- retry through bootstrap route for recovery
- mark route unavailable
- request control-plane/config refresh when reachable
- keep degraded existing path alive if policy permits
Failover decisions must not:
- create new cluster authority
- bypass policy
- add nodes
- approve role changes
- cross cluster boundaries without explicit trust
- expose topology to organizations
## 13. Shortcut Decision Boundary
Shortcut connections are optional optimization recommendations.
A shortcut may be recommended when:
- long-lived flow exists
- current path latency/jitter is high
- direct connectivity appears possible
- trust validation succeeds
- policy allows shortcut
- shortcut improves latency, jitter, or bandwidth
- fallback path remains available
Shortcut recommendation output:
- source node
- destination node
- channel classes affected
- expected improvement
- required validation
- expiry
- fallback route id
C15 does not implement shortcut connections. It only defines when a future
Routing Engine may recommend them.
## 14. Service Adapter Integration
Service Adapters may ask for routes using service-neutral metadata.
Examples:
- RDP Adapter requests route to RDP service/egress node or resource target.
- VNC Adapter requests route to VNC target zone.
- SSH Adapter requests route to SSH target.
- VPN/IP tunnel service requests route through `vpn_connection`.
- Storage fetch requests route to config/storage scope.
Service Adapters must not:
- enumerate peers
- select mesh paths
- create relay chains
- create shortcuts
- implement failover policy
- implement partition recovery
- implement cross-cluster routing trust
The adapter consumes a route result and sends/receives through the approved
data-plane boundary when runtime exists.
## 15. Topology Hiding
Organizations see:
- allowed service endpoints
- safe ingress/egress status
- safe session/resource status
- policy-visible route dependency names where allowed
Organizations must not see:
- intermediate core mesh nodes
- full peer directory
- route cache
- shortcut candidates
- other organizations' route data
- storage shard placement
Platform owners may inspect routing internals according to audited platform
policy.
## 16. Degraded and Partition Behavior
In degraded mode, Routing Engine may:
- keep existing authorized routes alive until TTL
- use last signed snapshot for recovery
- select fallback among already-authorized peers
- mark route unavailable when safety cannot be proven
In degraded mode, Routing Engine must not:
- authorize new high-risk routes
- mutate cluster trust
- approve nodes
- assign roles
- promote partition authority automatically
- create cross-cluster trust
## 17. Observability
Routing decisions should emit safe telemetry:
- route selected
- route rejected
- rejection reason
- route class
- channel class
- score bucket
- latency/jitter/packet loss summary
- failover count
- fallback used
- shortcut recommended
- policy version
- peer directory version
- route epoch
Tenant-visible telemetry must hide topology.
## 18. Future Validation Tests
Future implementation tests must prove:
- route request rejects wrong cluster
- route request rejects wrong organization
- revoked peer is not selected
- unavailable route returns explicit result
- cache invalidates on policy version change
- cache invalidates on peer directory version change
- input route prefers latency over throughput
- file transfer route does not starve input class
- service adapter cannot bypass routing engine
- shortcut recommendation requires fallback path
- degraded mode does not authorize new forbidden routes
## 19. C16 Preparation
C16 must define the secure node-to-node channel lifecycle that can later carry
route-selected traffic.
C16 must preserve:
- routing results are bounded and policy-scoped
- channels are authenticated and authorized
- trust/revocation affects active channels
- Service Adapters remain above Fabric routing
- no mesh packet routing starts before explicit C17
## 20. Result / Decision
Stage C15 defines Fabric Routing Engine as a skeleton boundary for route
requests, route results, scoring, cache relationship, failover, shortcut
recommendations, topology hiding, and Service Adapter integration.
Decisions:
- Routing belongs to Fabric, not Service Adapters.
- Route requests/results are logical contracts, not packet forwarding.
- Hard policy checks precede scoring.
- Route cache is local, bounded, and non-authoritative.
- Routing is channel-aware and QoS-aware.
- Shortcut connections are future optional recommendations, not C15 runtime.
- C16 must define secure node-to-node channels before mesh routing runtime.
No code, migration, API, runtime, RDP, data-plane, mesh, VPN, relay, or service
workload behavior is changed by C15.
@@ -0,0 +1,333 @@
# Fabric Storage / Config Storage Service
Status: Stage C13 result. Documentation and architecture only.
This document defines the Fabric Storage / Config Storage service foundation.
It does not implement code, migrations, APIs, mesh runtime traffic, VPN/IP
tunnel runtime, relay packet routing, RDP work, or service workload execution.
## 1. Purpose
Fabric Storage / Config Storage is the scoped distribution layer for Fabric
Core configuration artifacts.
It distributes and caches:
- signed scoped cluster snapshots
- incremental snapshot updates
- peer directories
- trust bundles
- revocation metadata
- update artifact metadata
- service assignment/config artifacts
It exists so nodes can refresh local state quickly and reliably without asking
the backend database for every realtime routing or supervision decision.
## 2. Non-Goals
Fabric Storage / Config Storage is not:
- a replacement for PostgreSQL
- a second source of truth
- a general-purpose distributed database
- an arbitrary query engine
- a tenant-visible topology database
- a place for raw secrets
- a durable runtime lease store
- a high-rate realtime data-plane relay
Nodes must not write authoritative configuration directly into Fabric Storage.
## 3. Authority Model
Authoritative flow:
```text
PostgreSQL
-> control-plane config compiler
-> signed scoped artifact
-> Fabric Storage / Config Storage distribution
-> node-agent local state
```
Only the control plane, or a tightly scoped config compiler operating under
control-plane authority, may produce authoritative signed configuration
artifacts.
Fabric Storage may replicate and serve artifacts. It does not decide policy.
## 4. Artifact Types
Supported target artifact families:
- `cluster_snapshot`
- `snapshot_increment`
- `peer_directory`
- `trust_bundle`
- `revocation_list`
- `service_assignment_bundle`
- `route_policy_bundle`
- `qos_policy_bundle`
- `update_manifest`
- `storage_directory`
Each artifact must carry:
- artifact id
- artifact type
- cluster id
- scope ids
- config version
- authority epoch
- issued at
- expires at or refresh deadline
- signer key id
- content hash
- signature or signature reference
## 5. Scope and Namespace Rules
Storage namespaces must be scoped by:
- platform
- cluster
- organization where applicable
- service where applicable
- role where applicable
- node where applicable
- artifact family
Example logical namespace:
```text
platform/<platform_id>/
cluster/<cluster_id>/
trust/
snapshots/node/<node_id>/
snapshots/role/<role>/
peers/scope/<scope_id>/
services/<service_type>/
updates/
```
No node should receive access to namespaces outside its assigned cluster, role,
service, and organization scope.
## 6. Replication Policy
Replication is policy-driven.
Inputs:
- artifact criticality
- cluster size
- failure domains
- region placement
- node role
- organization isolation
- service locality
- update frequency
- recovery time objective
Rules:
- critical trust and revocation artifacts replicate across failure domains
- hot peer directories should be near entry/core/service nodes that use them
- service config should be near assigned service nodes
- organization-scoped artifacts must not replicate to unrelated org scopes
- thin/mobile nodes should not become broad storage replicas
- storage nodes may hold only assigned shards/scopes
## 7. Distribution Flows
Full snapshot refresh:
1. node-agent reports current config versions
2. storage service returns available version metadata
3. node-agent downloads full scoped snapshot if needed
4. node-agent verifies signature and scope
5. node-agent applies locally
Incremental update:
1. node-agent reports base version
2. storage service returns matching increment chain
3. node-agent verifies every increment
4. node-agent applies only if base versions match
5. version gap triggers full snapshot refresh
Trust/revocation update:
1. node-agent checks trust bundle/revocation version frequently
2. storage service serves signed trust artifacts
3. node-agent verifies using existing trust path
4. revoked identities/keys immediately affect local validation
## 8. Consistency and Invalidation
Artifacts are immutable by content hash.
New versions are published as new artifacts plus index updates.
Rules:
- node must validate content hash
- node must reject stale authority epoch
- node must reject invalid signature
- node must reject wrong scope
- storage index may cache version metadata but not override signatures
- deletion/tombstone artifacts must be signed
- revoked artifacts must not be served as current versions
Cache invalidation is version-based, not best-effort string deletion.
## 9. Storage Node Behavior
A storage/config node may:
- cache assigned artifacts
- replicate assigned artifacts
- serve artifacts to authorized nodes
- report artifact availability
- report replication health
- evict cold artifacts according to policy
A storage/config node must not:
- modify artifact content
- sign artifacts
- invent new config versions
- widen scope
- bypass authorization
- serve unrelated org/cluster data
- accept node writes as authoritative config
## 10. Authorization
Artifact fetch authorization must check:
- node identity
- cluster membership
- role assignment
- artifact scope
- organization scope where applicable
- artifact type
- trust/revocation status
- partition/degraded policy
Storage service authorization may use:
- mTLS node identity
- short-lived scoped tokens
- signed node snapshot claims
- control-plane issued fetch grants
Tenant users and organization admins must not directly query internal storage
namespaces. They see safe status projections through control-plane APIs.
## 11. Failure and Degraded Behavior
If local storage service is unavailable, node-agent recovery order is:
1. try alternate local/nearby storage endpoint
2. try active peers that advertise config/storage availability
3. try bootstrap/config endpoints from last signed snapshot
4. contact control plane if reachable
5. continue from last valid local snapshot only if degraded policy allows it
Storage service outage must not grant new authority.
Nodes must not perform high-risk actions based on missing or stale storage.
## 12. Operational Observability
Storage service should report:
- artifact family health
- replication lag
- missing replica count
- stale shard count
- fetch latency
- fetch failures
- authorization denials
- version gaps
- signature/hash validation failures reported by nodes
- storage capacity
- eviction stats
Audit/control-plane events should include:
- artifact published
- artifact revoked/tombstoned
- replication policy changed
- storage role assigned/removed
- unauthorized fetch denied
- critical artifact under-replicated
## 13. Security Requirements
Required:
- encrypted node-to-storage transport
- authenticated node identity
- scoped fetch authorization
- immutable signed artifacts
- hash verification
- no raw secrets in broad artifacts
- namespace isolation
- audit for high-risk storage/admin actions
Compromised storage node blast radius must be limited:
- it cannot sign valid new artifacts
- it cannot serve data outside assigned scopes
- it cannot modify signed content without detection
- it cannot become authoritative truth
- nodes reject invalid signatures/hashes
## 14. Relationship to Runtime State
Fabric Storage is for configuration and distribution, not realtime runtime
coordination.
Runtime state remains elsewhere:
- PostgreSQL for durable lifecycle/audit/state
- Redis for live coordination/leases/heartbeats/ephemeral routing hints
- node-local state for local cache/runtime observations
Do not store high-rate render frames, input streams, VPN packets, or relay
traffic in Fabric Storage.
## 15. C14 Preparation
C14 must define the peer directory and peer cache model that Fabric Storage may
distribute and node-agent may store locally.
C14 must preserve:
- storage service is distribution/cache only
- peer directories are scoped
- nodes do not learn full topology unless role requires it
- routing decisions belong to Fabric Routing Engine, not Service Adapters
## 16. Result / Decision
Stage C13 defines Fabric Storage / Config Storage as a scoped distribution and
cache service for signed Fabric Core artifacts.
Decisions:
- Fabric Storage distributes signed artifacts but does not author them
- PostgreSQL remains authoritative
- artifacts are immutable by content hash
- invalidation is version-based
- replication is policy-driven and scope-bound
- storage nodes may cache and serve only assigned scopes
- storage service is not a realtime data-plane relay
- storage service is not a general-purpose database
- C14 must define the peer directory/cache artifacts and local runtime use
No code, migration, API, runtime, RDP, data-plane, mesh, VPN, relay, or service
workload behavior is changed by C13.
File diff suppressed because it is too large Load Diff
+419
View File
@@ -0,0 +1,419 @@
# Node Local State Store
Status: Stage C12 result. Documentation and architecture only.
This document defines the node-local state store model for native
`rap-node-agent`. It does not implement code, migrations, APIs, mesh runtime
traffic, VPN/IP tunnel runtime, relay packet routing, RDP work, or service
workload execution.
## 1. Purpose
The node-local state store lets `rap-node-agent` operate safely without asking
the backend for every realtime routing or service supervision decision.
The local store must support:
- node identity persistence
- cluster membership state
- signed scoped snapshot storage
- peer cache
- route cache
- service assignment cache
- local health and degraded-mode state
- pending update metadata
- recovery after process restart or host reboot
The local store must not become a durable source of truth.
## 2. Authority Boundaries
PostgreSQL remains authoritative for durable domain state.
Fabric Storage / Config Storage distributes signed snapshots and increments.
Node-local state stores verified local copies and runtime observations.
Redis remains live coordination only.
Node-local state must not authorize:
- node enrollment approval
- certificate issuance
- role assignment
- policy mutation
- trust root mutation
- organization mutation
- partition promotion
- cross-cluster trust
## 3. Storage Root and Namespaces
The node-agent should use one configured local storage root.
Example logical layout:
```text
rap-node-agent-state/
agent/
clusters/
<cluster_id>/
identity/
trust/
snapshots/
peers/
routes/
services/
health/
updates/
telemetry/
tmp/
```
Rules:
- cluster state is namespace-isolated by `cluster_id`
- multi-cluster membership uses separate identities and local state per cluster
- temporary files are written under the same cluster namespace before atomic
activation
- no cluster may read another cluster's local state namespace
- file permissions must restrict access to the node-agent service account
## 4. State Classes
### Agent State
Agent-level state:
- agent install id
- agent version
- local feature flags
- last startup/shutdown status
- local diagnostics
- update engine metadata
Agent state is not cluster authority.
### Identity State
Cluster identity state:
- `node_id`
- cluster membership id
- node certificate metadata
- public identity metadata
- private key reference
- enrollment state
- revocation status cache
Private keys should be stored in an OS-protected key store when available. If
file-backed keys are necessary, they must be encrypted at rest and protected by
strict filesystem permissions.
### Trust State
Trust state:
- platform root trust refs
- cluster trust roots
- config signing keys
- node-to-node trust bundle
- revocation metadata
- trust bundle version
Trust state must be signed and versioned. Unknown or revoked trust roots must
not be accepted.
### Snapshot State
Snapshot state:
- active signed scoped snapshot per scope
- previous verified snapshot per scope
- pending snapshot or incremental update
- snapshot verification metadata
- last applied config version
- expiry and refresh deadlines
Snapshot activation must be atomic:
1. write pending snapshot
2. verify signature, scope, hash, expiry, and version
3. persist verified content
4. swap active pointer
5. notify affected runtime components
6. report applied version
### Peer Cache
Peer cache:
- scoped peer directory entries
- endpoint candidates
- certificate fingerprints
- last success timestamp
- latency
- packet loss
- reliability score
- recent failure history
- last seen config version
Peer cache combines signed directory data with runtime observations. Runtime
observations are hints, not durable authority.
### Route Cache
Route cache:
- selected routes
- route score
- route class/channel class
- route expiry
- failover alternatives
- shortcut state if future policy allows it
- last successful path
- recent failure reason
Route cache must be reconstructable from signed snapshots, peer cache, and
runtime observations. It must not define policy.
### Service Assignment Cache
Service assignment cache:
- assigned service workloads
- desired state
- last reported state
- service version
- policy refs
- resource refs needed by assigned services
- connector or `vpn_connection` refs where authorized
This cache informs supervision. It does not allow the node to invent new
service work.
### Health and Degraded State
Health/degraded state:
- last heartbeat sent
- last control-plane contact
- last config/storage contact
- active degraded-mode reason
- partition/degraded flags
- local resource pressure
- service health summaries
- last known safe operation deadline
Degraded state must be visible in node heartbeat/status when connectivity
returns.
### Update Metadata
Update state:
- current agent version
- current workload versions
- pending update metadata
- signed artifact refs
- rollout/canary assignment
- rollback candidate metadata
- last update result
Unsigned artifacts must never be activated.
## 5. Encryption and Secret Handling
The local store should avoid storing secrets. When secret-related data is
required, store references and resolver metadata, not plaintext.
Rules:
- private keys use OS key store where possible
- file-backed sensitive material is encrypted at rest
- raw RDP/VNC/SSH/VPN credentials must not be stored in broad local snapshots
- runtime secrets are resolved only when assigned service policy permits it
- secret material must be wiped from temporary files and memory where practical
- logs must not contain secret values
Recommended OS facilities:
- Windows: DPAPI or service-account protected certificate store
- Linux: kernel keyring, TPM-backed store, or file encryption with protected
service-account permissions
- macOS future client/agent: Keychain
## 6. Atomicity and Durability
Writes must be safe across process crashes and host reboots.
Rules:
- write new content to temporary path
- fsync or platform equivalent where needed
- verify content before activation
- atomically rename/swap active pointer
- keep previous verified content for recovery
- never partially overwrite active snapshots or identity data
- use a store lock to prevent concurrent writers
Node-agent should tolerate:
- interrupted writes
- corrupted pending updates
- missing optional cache files
- stale runtime observations
Node-agent must not tolerate silently corrupted identity, trust, or active
snapshot data.
## 7. Cache Expiry and Cleanup
Local caches must be bounded.
Cleanup rules:
- remove expired peer observations
- remove expired route cache entries
- compact telemetry buffers
- retain only policy-defined number of previous snapshots
- remove stale pending updates after safe timeout
- delete service assignment cache for removed roles after revocation is applied
- wipe temporary files on startup
Caches may be rebuilt. Identity, trust, and active snapshots require stricter
recovery behavior.
## 8. Corruption Recovery
Recovery order:
1. load active verified state
2. reject corrupted pending state
3. fallback to previous verified snapshot if active snapshot is corrupt and
policy allows it
4. request full snapshot from config/storage service
5. use bootstrap peers or control plane if storage/config is unavailable
6. enter degraded mode only if a valid snapshot and policy allow it
7. fail closed for trust/identity corruption
Corruption must be reported through health/status and local diagnostics.
## 9. Multi-Cluster Isolation
A node may participate in multiple clusters only through isolated memberships.
Per-cluster isolation includes:
- identity
- certificates
- trust bundle
- signed snapshots
- peer cache
- route cache
- service assignment cache
- update/workload namespace where needed
- telemetry namespace
Cross-cluster data sharing is forbidden unless explicit platform trust and
policy allow it.
## 10. Service Workload Boundary
Service workloads do not write authoritative node-local state.
Allowed workload interactions:
- read assigned service configuration through node-agent
- report health/status to node-agent
- request approved secret resolution through node-agent/control boundary
- receive lifecycle commands from node-agent
Forbidden workload interactions:
- mutate role assignments
- mutate snapshots
- mutate peer directory authority
- write trust roots
- write cross-cluster state
- store unrelated organization secrets
## 11. Backup and Restore
Backup rules:
- identity/private key backup is platform policy dependent and high-risk
- snapshots and caches can usually be reconstructed
- local route/peer caches should not be treated as backup-critical
- trust state backup must preserve anti-rollback properties
- restore must not allow replay of revoked identity or old trust roots
Restore must require control-plane validation before the node is trusted for
new high-risk work.
## 12. Observability
Node-agent should report safe local state metadata:
- last applied config version
- snapshot expiry/refresh status
- trust bundle version
- peer cache size
- route cache size
- degraded-mode state
- local store health
- last corruption/recovery event
- pending update state
Reports must not include raw secrets or unrelated topology.
## 13. Future Validation Tests
Future implementation tests must prove:
- fresh install creates expected namespace layout
- valid snapshot activates atomically
- interrupted activation recovers to previous valid snapshot
- corrupted pending update is ignored
- corrupted active identity fails closed
- peer cache expiry works
- route cache expiry works
- multi-cluster namespaces stay isolated
- service workload cannot mutate authoritative local state
- local store reports last applied config version
- degraded-mode state is persisted and cleared correctly
## 14. C13 Preparation
C13 must define the Fabric Storage / Config Storage service that distributes
snapshots, peer directories, trust bundles, and incremental updates to the
node-local state store.
C13 must preserve:
- PostgreSQL authority
- signed snapshot verification
- node-local bounded cache behavior
- cluster/org/service isolation
- no arbitrary query/database behavior
## 15. Result / Decision
Stage C12 defines node-local state as a bounded, scoped, verified local store
owned by native `rap-node-agent`.
Decisions:
- local state is namespaced per cluster
- identity, trust, snapshots, peer cache, route cache, service assignment
cache, health/degraded state, and update metadata are separate state classes
- local state is not durable authority
- snapshot activation must be atomic
- caches are bounded and reconstructable
- private keys and sensitive material require OS-protected or encrypted storage
- service workloads cannot mutate authoritative node-local state
- C13 must define distribution/storage services without turning them into a
second source of truth
No code, migration, API, runtime, RDP, data-plane, mesh, VPN, relay, or service
workload behavior is changed by C12.
@@ -0,0 +1,351 @@
# Production Direct Worker WSS Trust
Status: P3.4 design/prep complete.
This document defines the production trust model for direct worker WSS. It is a
preparation document only: it does not change RDP runtime behavior, does not
remove backend gateway fallback, and does not implement mesh, relay, VPN, QUIC,
WebRTC, or node-agent enrollment.
## Goal
Direct worker WSS should become the preferred production realtime path only
when the client can verify both:
- the backend candidate is authorized and marked `production_trusted=true`
- the worker endpoint presents a valid TLS certificate for the advertised URL
The backend gateway remains the safe fallback/debug path.
## Trust Modes
`DATA_PLANE_DIRECT_WORKER_TLS_TRUST_MODE` has three modes:
- `smoke_insecure`: development/smoke only. Backend may advertise a direct
candidate only outside production and must mark it `smoke_only=true` and
`production_trusted=false`.
- `public_ca`: worker WSS certificate chains to an OS/publicly trusted CA.
Backend may mark the candidate `production_trusted=true`.
- `platform_ca`: worker WSS certificate chains to a platform-managed CA.
Backend may mark the candidate `production_trusted=true` and include
`tls_ca_ref`.
Production must not treat `smoke_insecure` as trusted. P3.3 proved that a
production backend with `smoke_insecure` falls back to `backend_gateway`.
## Recommended Mode
Use `platform_ca` as the default production model for platform-managed and
customer-managed worker nodes.
Use `public_ca` only when the worker direct WSS endpoint is intentionally
internet-addressable through stable DNS and a public certificate can be issued
and renewed safely.
Rationale:
- most worker endpoints will be private, internal, or customer-managed
- public CA issuance is often impossible for private IP/DNS names
- a platform CA can bind certificates to platform node/worker identity
- platform CA trust can later integrate with `rap-node-agent`
- backend gateway fallback remains available while trust rollout is staged
## Certificate Profile
Worker direct WSS certificates must be server certificates.
Required X.509 properties:
- `KeyUsage`: `digitalSignature`, plus `keyEncipherment` where required by the
selected TLS key type
- `ExtendedKeyUsage`: `serverAuth`
- SAN DNS/IP entries must match the host in the advertised direct worker WSS URL
- CN must not be used as the trust identity
- validity should be short-lived, recommended 30-90 days in production
- key type should be ECDSA P-256 or RSA-2048+; prefer ECDSA where operationally
practical
Recommended identity SAN:
```text
URI:spiffe://rap/cluster/<cluster_id>/worker/<worker_id>
```
For the current single-cluster MVP, `cluster_id` may be `default` until the
cluster model becomes explicit.
The URI SAN is not a replacement for normal hostname verification. It is an
additional identity binding for observability, future node-agent enrollment,
and future control-plane certificate inventory.
## Candidate URL Rules
The backend must advertise a direct worker WSS URL whose host is covered by the
worker certificate SAN.
Examples:
```text
wss://rdp-worker-1.dp.test.cin.su:18443/rap/v1/data-plane
wss://192.168.200.61:18443/rap/v1/data-plane
```
If the URL uses a DNS name, the certificate must include that DNS SAN.
If the URL uses an IP address, the certificate must include that IP SAN.
Preferred production shape is DNS, not raw IP, because DNS gives safer
certificate rotation and node replacement.
## Worker Identity Binding
Direct worker WSS authentication is layered:
1. TLS proves that the client reached an endpoint with a certificate trusted for
the advertised URL.
2. `data_plane_token` proves that the backend authorized the session,
attachment, user, organization, resource, worker, and allowed channels.
3. The worker validates the token and binds the WSS connection to an existing
runtime only.
The TLS certificate does not replace token validation.
The token does not replace TLS trust.
Future production hardening should add control-plane certificate inventory:
```text
worker_certificates
worker_id
cluster_id
tls_ca_ref
certificate_fingerprint_sha256
serial_number
not_before
not_after
status: active | retiring | revoked | expired
```
Until that inventory exists, backend must be conservative and only mark direct
candidates production-trusted when deployment configuration guarantees the
worker certificate is trusted for the advertised URL.
## Platform CA Structure
Recommended hierarchy:
```text
RAP Platform Offline Root CA
-> RAP Data Plane Worker Intermediate CA v1
-> worker direct WSS server certificates
```
Rules:
- Root CA private key must not be present on worker hosts.
- Intermediate CA private key must not be present on worker hosts.
- Worker receives only its server certificate, private key, and CA chain.
- Windows clients receive only the trust bundle, never private keys.
- Backend receives CA reference metadata and may carry public trust bundle
references, never CA private keys.
For the current test stand, a temporary test CA may be generated on
`docker-test`, but it must be treated as throwaway test material and not
committed.
## Certificate Issuance And Storage
Future `rap-node-agent` should own enrollment. Before node-agent exists, test
stand issuance may be manual.
Production desired flow:
1. Platform owner approves node/worker enrollment.
2. Node agent generates a private key locally.
3. Node agent creates CSR with:
- worker/node identity URI SAN
- DNS/IP SANs for reachable direct WSS endpoints
- cluster id
4. Control plane or CA service signs the CSR if node policy allows the role.
5. Node agent writes certificate/key to a host-local protected path.
6. Worker container mounts certificate/key read-only, or native worker reads
protected local files.
7. Backend advertises direct candidates with:
- `tls_trust_mode=platform_ca`
- `production_trusted=true`
- `smoke_only=false`
- `tls_ca_ref=<active-ca-ref>`
Container note:
- Certificates are node/host trust assets, not container identity.
- Containers may consume mounted cert/key files.
- Container rebuilds must not generate production CA material.
## Windows Client Trust
For `public_ca`, the Windows client should rely on normal OS certificate
validation.
For `platform_ca`, the preferred production approach is app-local trust:
- client configuration references a platform CA bundle by `tls_ca_ref`
- WSS TLS validation uses a custom chain policy with an app-managed trust store
- hostname/SAN validation remains enabled
- revocation/deny-list checks are applied when available
- no global insecure callback is used
Installing the platform root into the Windows CurrentUser or LocalMachine Root
store may be supported for managed enterprise deployment, but it should not be
required for MVP smoke because it broadens OS-level trust.
Current state:
- Windows client already skips smoke-only/untrusted direct candidates in
production.
- P3.5 added app-local platform CA bundle handling with normal hostname/SAN
validation preserved.
- P3.5 smoke proved `platform_ca` direct worker WSS without insecure TLS
bypass on `docker-test`.
## Rotation
Worker certificate rotation:
- certificates should be renewed before 2/3 of lifetime has elapsed
- new cert/key should be staged next to the old files
- worker should reload or restart gracefully
- backend gateway fallback must remain available during rotation
- old cert should remain accepted during a short overlap window
- after successful cutover, old cert should be marked retiring/expired
Platform CA rotation:
- introduce new `tls_ca_ref`
- distribute the new trust bundle to clients before workers switch
- backend may advertise candidates with the new CA only after client trust is
available
- keep old and new CA bundles valid during migration
- remove the old CA only after all active workers and clients are migrated
## Revocation And Deny-List
Short-lived certificates are the first control.
Additional revocation controls:
- stop advertising direct candidates for revoked workers immediately
- revoke worker certificate serial/fingerprint in control-plane inventory
- optionally distribute a compact deny-list to clients
- force backend gateway fallback for revoked/untrusted workers
- rotate data-plane signing keys separately if token signing material is at risk
Revocation must not rely on the worker cooperating after compromise.
## Graceful Failure And Fallback
Direct WSS must fail closed:
- expired cert: direct rejected, fallback to backend gateway
- hostname mismatch: direct rejected, fallback to backend gateway
- untrusted platform CA: direct rejected, fallback to backend gateway
- revoked fingerprint: direct rejected, fallback to backend gateway
- token validation failure: direct rejected, fallback to backend gateway where
policy permits
Fallback must be logged so production does not silently run permanently on the
debug path.
## Test-Stand P3.5 Smoke Result
P3.5 proved `platform_ca` without using insecure TLS bypass.
Sanitized command shape:
```powershell
# 1. Generate throwaway test CA and worker cert on docker-test.
ssh docker-test "mkdir -p /tmp/rap-p3-5-platform-ca"
# Certificate must include:
# - DNS SAN for the direct WSS host, if using DNS
# - IP SAN 192.168.200.61, if using raw IP
# - URI SAN spiffe://rap/cluster/default/worker/rdp-worker-1
# 2. Restart worker with platform CA-issued server cert.
docker -H ssh://docker-test rm -f rap_worker_smoke
docker -H ssh://docker-test run -d --name rap_worker_smoke --network host `
-v /tmp/rap-p3-5-platform-ca:/certs:ro `
-e RDP_WORKER_DATA_PLANE_TLS_CERT_FILE=/certs/worker.crt `
-e RDP_WORKER_DATA_PLANE_TLS_KEY_FILE=/certs/worker.key `
-e RDP_WORKER_DATA_PLANE_PUBLIC_KEY_FILE=/certs/dp-public.pem `
rap-rdp-worker:rdp-p1-region-order2
# 3. Restart backend in production platform_ca mode.
docker -H ssh://docker-test rm -f rap_backend_smoke
docker -H ssh://docker-test run -d --name rap_backend_smoke --network host `
-v /tmp/rap-dp1d1:/certs:ro `
-v /tmp/rap-p3-3/secret-key.b64:/run/secrets/rap-secret-key.b64:ro `
-e APP_ENV=production `
-e SECRET_ENCRYPTION_KEY_FILE=/run/secrets/rap-secret-key.b64 `
-e DATA_PLANE_DIRECT_WORKER_TLS_TRUST_MODE=platform_ca `
-e DATA_PLANE_DIRECT_WORKER_TLS_CA_REF=rap-platform-ca:test-v1 `
-e DATA_PLANE_DIRECT_WORKER_WSS_URL_TEMPLATE=wss://192.168.200.61:18443/rap/v1/data-plane `
rap-backend-smoke:p3-3
# 4. Configure Windows client app-local trust bundle.
# backend.direct_data_plane_platform_ca_bundle = artifacts\p3-5-platform-ca.crt
# backend.environment = production
# backend.allow_insecure_direct_data_plane_tls_for_smoke = false
# 5. Run desktop smoke and verify direct selected.
pwsh -ExecutionPolicy Bypass -File scripts\windows-smoke\desktop-smoke.ps1 `
-PreferDirectDataPlane:$true `
-AllowInsecureDirectDataPlaneTlsForSmoke:$false `
-DirectDataPlaneConnectTimeoutMs 2500 `
-SkipOrgSwitchAndTokenRefresh
```
P3.5 PASS conditions:
- backend candidate metadata includes:
- `tls_trust_mode=platform_ca`
- `production_trusted=true`
- `smoke_only=false`
- `tls_ca_ref=rap-platform-ca:test-v1`
- Windows client selects `direct_worker_wss` in production mode
- client does not use insecure TLS bypass
- worker direct WSS token validation and runtime binding still pass
- rendering/input/clipboard/file upload still pass
- backend gateway fallback activates when direct cert validation fails or
direct WSS is unavailable
Required negative tests:
- wrong SAN certificate rejected
- expired certificate rejected
- unknown CA rejected
- `smoke_insecure` candidate skipped in production
Runtime proof is recorded in:
- `artifacts/p3-5-app-local-platform-ca-smoke-report.md`
## Current Implementation Status
Existing config fields are sufficient for P3.4:
- `DATA_PLANE_DIRECT_WORKER_TLS_TRUST_MODE`
- `DATA_PLANE_DIRECT_WORKER_TLS_CA_REF`
- `RDP_WORKER_DATA_PLANE_TLS_CERT_FILE`
- `RDP_WORKER_DATA_PLANE_TLS_KEY_FILE`
P3.5 added Windows client setting:
- `direct_data_plane_platform_ca_bundle`
P3.6 completed stale worker-event/restart idempotency hardening.
Stage 5.2 server-to-client file download design is complete in
`docs/architecture/RDP_FILE_DOWNLOAD_STAGE_5_2.md`. The next step should return
to the RDP feature plan with the narrow Stage 5.2 implementation, not RDP
rendering, lifecycle expansion, mesh, VPN, or new protocol adapters.
+559
View File
@@ -0,0 +1,559 @@
# RDP Adapter Runtime
Status: active implementation plan for the new C++ RDP Adapter internals.
Current implementation status:
- RDP-A1 is build-proven: the common Service Adapter channel model and probe exist.
- RDP-A2 is live-smoke-proven on the test Docker environment as of 2026-04-26: `SessionRuntime` depends on `RdpAdapterRuntime`, not directly on FreeRDP runtime types, and a real RDP session still connects through the existing direct data plane.
- RDP-Perf-2 is live-smoke-proven on the test Docker environment as of 2026-04-26: the current FreeRDP substrate now logs callback source/timing, capture source, and input-to-first-graphics-callback timing.
- RDP-Perf-3 / RDP-A3 region-first BGRA fallback is live-smoke-proven on the test Docker environment as of 2026-04-26: direct binary region frames render in the Windows client and backend gateway fallback remains compatible.
- RDP-Perf-4 / RDP-A6 gated RDPGFX foundation is build-proven and default-path smoke-proven on the test Docker environment as of 2026-04-26. RDPGFX stays disabled by default because the current live RDP target resets the connection when graphics pipeline support is advertised.
- RDP-A4 CursorAdapter is live-smoke-proven on the test Docker environment as of 2026-04-26: FreeRDP pointer callbacks are normalized into latest-only `cursor.update` events, direct worker WSS sends them separately from display frames, and backend gateway fallback remains compatible.
- RDP-Perf-5A is build-proven and smoke-proven on the test Docker environment as of 2026-04-26: classic GDI region/interactive frames use a 33 ms publish cadence, hot-loop lease renewal is removed, and direct/fallback paths remain compatible.
- RDP-Perf-6 direct dirty-region binary contract is build/probe/live-smoke-proven on the test Docker environment as of 2026-04-26: direct `RAP2` frames now distinguish `render.frame.full` from `render.frame.region`, include region payload diagnostics, and the Windows presenter keeps a session framebuffer for region patching. Runtime proof used `P3.3 Secret RDP Resource`; observed dirty-region savings ranged from `82.22%` to `99.56%` versus the `3,686,400` byte full frame.
- Current accepted baseline is `rap-rdp-worker:rdp-p1-region-order2`: ordered dirty-region delivery is preserved through `SessionRuntime`, worker direct WSS, Windows transport, and WPF presenter queues. Manual visual smoke accepted idle repaint, Start menu/hover, mouse, keyboard, and session close on 2026-04-26.
- Remaining visual limitation is quality/performance rather than correctness: window drag behaves like older/slow-link RDP clients by showing a drag frame, and repaint after releasing a moved window is usable but not yet polished.
- FreeRDP is still present as the current internal substrate behind the RDP Adapter boundary. It must not be removed until the adapter path is live-proven and replacement layers are ready.
This does not change the current cluster/control-plane contracts. The current backend gateway fallback remains available until each data-plane stage is proven.
## 1. Goal
The RDP Adapter must translate Microsoft RDP into the platform session/data-plane protocol.
```text
Access Client
<-> platform session/data-plane protocol
RDP Adapter
<-> FreeRDP / project-owned RDP internals
RDP Server
```
The adapter must process events from both sides:
- Access Client events: input, clipboard, file upload/download, control.
- RDP Server events: graphics updates, cursor updates, clipboard changes, device/drive events, disconnects, errors.
The adapter must not depend on mouse/keyboard input to discover screen changes.
## 2. External References And Lessons
FreeRDP exposes an event-driven client model through:
- `freerdp_get_event_handles` / `freerdp_check_event_handles` for event dispatch.
- `rdpUpdate` callbacks such as `BeginPaint`, `EndPaint`, `BitmapUpdate`, `RefreshRect`, `SurfaceBits`, `SurfaceFrameMarker`, and `SurfaceFrameBits`.
- client channel modules such as `cliprdr`, `rdpdr`, and `rdpgfx`.
Apache Guacamole uses the same architectural principle at a higher level: protocol-specific plugins translate RDP/VNC/SSH into a common client protocol so the client does not implement those protocols directly.
Design implication for this project:
- FreeRDP callbacks/channels are adapter-origin event sources.
- The platform Access Client receives normalized display/cursor/clipboard/file/control events.
- Full-frame polling is only fallback/debug, not the target render mechanism.
## 3. Runtime Components
```text
SessionRuntime
owns lifecycle/assignment/policy/lease boundary
owns RDP Adapter Runtime
RDP Adapter Runtime
RdpEventPump
InputAdapter
DisplayAdapter
CursorAdapter
ClipboardAdapter
FileTransferAdapter
QualityController
AdapterEventRouter
DataPlane Sinks
direct worker WSS
backend gateway fallback
```
### RdpEventPump
Responsibilities:
- own the FreeRDP event loop
- wait on FreeRDP event handles
- dispatch FreeRDP callbacks promptly
- never sleep instead of processing available server events
- report disconnect/error state
### InputAdapter
Responsibilities:
- accept normalized platform input
- preserve keyboard down/up ordering
- preserve mouse button/wheel ordering
- coalesce pointer move to latest
- send focus/move/button/key through FreeRDP input API
- never trigger full-frame capture loops as the main render mechanism
### DisplayAdapter
Responsibilities:
- consume FreeRDP update callbacks
- generate platform display events
- prefer dirty regions/surface updates over full frames
- send baseline full frame only on connect/resize/attach/recovery/fallback
- keep a full framebuffer only where needed for compatibility
- never block input on render work
Required event sources:
- `BitmapUpdate`
- `RefreshRect`
- `SurfaceBits`
- `SurfaceFrameMarker`
- `SurfaceFrameBits`
- `EndPaint`
- RDPGFX channel events when enabled and stable
- periodic fallback change detection only as a safety net
### CursorAdapter
Responsibilities:
- handle FreeRDP pointer callbacks
- publish cursor position/visibility/shape independently from display frames
- keep cursor events latest-only
### ClipboardAdapter
Responsibilities:
- use `cliprdr`
- preserve existing `clipboard_mode`
- text-only until explicitly expanded
- enforce max size and lifecycle state
- prevent loops using sequence/origin/hash
### FileTransferAdapter
Responsibilities:
- preserve existing `file_transfer_mode`
- keep upload/download reliable and chunked
- enforce session/controller/policy/state
- keep restricted drive mapping isolated to per-session visible directory
- never expose arbitrary worker filesystem paths
### QualityController
Responsibilities:
- choose color mode / FPS / dirty-region threshold
- degrade render before input
- keep file transfer and future VPN-like bulk traffic from starving interactive channels
## 4. Data-Plane Streams
The target adapter uses independent scheduling classes even if they share one WSS connection in DP-1:
| Stream | Channel | Scheduling |
| --- | --- | --- |
| Critical input | `input` | first, ordered, bounded |
| Control | `control` | reliable, bounded |
| Cursor | `cursor` | latest-only, bypass display cadence |
| Display | `display` | droppable, latest region/frame |
| Clipboard | `clipboard` | reliable, policy-gated |
| File transfer | `file_transfer` | reliable chunked, bandwidth-limited |
| Telemetry | `telemetry` | sampled/droppable |
Future transports may split streams physically:
- control/input WSS
- display binary WSS or QUIC-like transport
- file transfer chunk stream
- audio/video adaptive stream
DP-1 must keep current direct WSS/fallback intact while enforcing scheduling semantics internally.
## 5. Display Contract
Display event types:
- `display.baseline_full_bgra`
- `display.region_bgra`
- `display.surface_create`
- `display.surface_delete`
- `display.surface_bits`
- `display.encoded_frame`
- `display.resize`
- `display.sync`
Rules:
- Access Client owns the visible framebuffer.
- Region updates patch the existing full-size framebuffer.
- Adapter must send a baseline frame before region-only updates after connect/attach/resize.
- Stale display updates may be dropped.
- Cursor updates must not wait for display frames.
- Full-frame BGRA is fallback, not production target.
- Direct binary display messages use the existing `RAP2` frame header:
`render.frame.full` for baseline/recovery frames and `render.frame.region`
for BGRA32 dirty-region payloads.
## 6. FreeRDP Usage Rules
Default stable mode:
- GDI/primary framebuffer fallback
- update callbacks installed
- cliprdr enabled only when policy permits
- rdpdr restricted drive only when file transfer policy permits
Experimental/next modes:
- RDPGFX dynamic channel behind explicit capability flag
- surface/event parsing before enabling by default
- encoded graphics payloads only when client capability and server support are proven
Do not enable unstable graphics paths globally. Each capability must be gated, logged, and fallback-safe.
## 7. Migration Plan
### RDP-A1: Contract And Scaffolding
Deliver:
- common Service Adapter protocol document
- RDP Adapter runtime document
- compile-safe adapter channel model
- no runtime behavior switch
Status: completed and build-proven.
### RDP-A2: Event Router Boundary
Deliver:
- route FreeRDP notifications through `AdapterEventRouter`
- preserve existing `WorkerEvent` output
- prove server-origin display events flow without client input
Status: completed and live-smoke-proven on the test Docker environment as of 2026-04-26.
Current code boundary:
- `SessionRuntime` owns `RdpAdapterRuntime`.
- `RdpAdapterRuntime` owns the current FreeRDP substrate.
- `AdapterEventRouter` normalizes substrate notifications into adapter event descriptors.
- Existing worker events and data-plane contracts are preserved.
Smoke command:
```powershell
pwsh -ExecutionPolicy Bypass -File scripts/windows-smoke/desktop-smoke.ps1 `
-PreferDirectDataPlane:$true `
-AllowInsecureDirectDataPlaneTlsForSmoke:$true `
-DirectDataPlaneConnectTimeoutMs 2500 `
-DirectDataPlaneColorMode full_color `
-SkipOrgSwitchAndTokenRefresh
```
Smoke evidence:
- worker image: `rap-rdp-worker:rdp-adapter-a2`
- session id: `c835e211-a105-4165-9ed2-885ddf876b84`
- worker log: `rdp_adapter.runtime_start substrate=freerdp`
- worker log: `adapter_event channel=display type=display.baseline_full_bgra`
- worker log: `data_plane_bind_success ... render_transport=binary_v1`
- client log: `data_plane.transport selected=direct_worker_wss`
- client log: `SessionWindow rendered frame`
- smoke result: login/resource/start/input/detach/attach/takeover/taken_over/logout passed
- runtime creation count: one `started new runtime for session` entry across start/reattach/takeover
### RDP-A3: DisplayAdapter Region-First
Deliver:
- baseline full frame on connect/attach/resize
- region updates as default normal UI path
- client framebuffer patch proof
- full-frame fallback retained
Status: completed and live-smoke-proven on the test Docker environment as of 2026-04-26.
Proof summary:
- `BitmapUpdate` dirty regions are deferred and flushed once at `EndPaint`.
- Region payloads are sent over direct binary WSS as `message_type=render.frame.region` with `frame_update_kind=region`.
- Windows client renders region frames into the existing framebuffer.
- Backend gateway fallback remains available and smoke-proven.
- Report: `artifacts/rdp-perf3-report.md`
Prerequisite proof:
- RDP-Perf-2 showed active `BitmapUpdate`, `BeginPaint`, and `EndPaint` callbacks in stable GDI mode.
- RDP-Perf-2 did not observe `RefreshRect`, `SurfaceBits`, `SurfaceFrameMarker`, `SurfaceFrameBits`, or pointer callbacks in the live smoke.
- The next implementation should prefer `BitmapUpdate` dirty regions and treat `EndPaint` as a flush/safety marker instead of producing duplicate captures.
### RDP-A4: CursorAdapter
Deliver:
- cursor position/shape/visibility channel
- cursor updates independent from render cadence
Status: completed and live-smoke-proven on the test Docker environment as of
2026-04-26.
Implementation notes:
- `CursorAdapter` produces normalized cursor position, visibility, shape, cache,
hotspot, and mask metadata.
- The FreeRDP substrate invokes original pointer callbacks first, then publishes
platform cursor events.
- `session_cursor_updated` is routed as the adapter event `cursor.update`.
- Direct worker WSS keeps cursor as latest-only/droppable and does not block it
behind render frames.
- The Windows client consumes `cursor.update` without changing session lifecycle
or UI layout.
Proof:
- direct smoke session id: `549806aa-c9db-48a9-917e-cf817cf236b5`
- fallback smoke session id: `dee3a856-bee1-4eba-9c10-f62edaf56547`
- worker image: `rap-rdp-worker:rdp-a4-cursor-adapter`
- report: `artifacts/rdp-a4-cursor-adapter-report.md`
### RDP-A4.1 / RDP-Perf-5A: GDI Repaint Cadence Hardening
Deliver:
- bounded immediate FreeRDP event-handle drain after signaled event checks
- rate-limited no-change detector logs
- no Redis lease renewal in the hot render/input loop
- 33 ms region/interactive render publish cadence
- 100 ms full-frame fallback cadence retained
- direct worker WSS and backend gateway fallback compatibility
Status: completed and smoke-proven on the test Docker environment as of
2026-04-26.
Proof summary:
- direct smoke session id: `0cca4974-2a82-48dc-a0f6-1036ea8e98f0`
- fallback smoke session id: `16deb09e-1c44-4e9d-8448-93b42ac66ed0`
- worker image: `rap-rdp-worker:rdp-perf5a-repaint-cadence`
- direct worker WSS selected in direct smoke
- backend gateway selected in fallback smoke
- direct render stayed binary and skipped JSON/base64 compatibility frame
building
- backend gateway fallback still built JSON/base64 compatibility frames
- render queues stayed bounded in observed direct smoke
- report: `artifacts/rdp-perf5a-report.md`
Follow-up manual validation:
- keyboard behavior reached a usable level
- mouse movement/click behavior became acceptable for the MVP baseline
- remote idle updates such as Task Manager percentages now repaint without local
mouse movement
- small redraw artifacts remain and require a focused visual correctness pass
### RDP-A4.2: Direct Attach Baseline And Region-Loss Repair
Deliver:
- request a full-frame baseline when a direct client attaches without a cached
full frame
- queue direct attach baseline frames as non-droppable reliable events
- preserve region-first rendering for normal updates
- capture throttled full-frame repair when region loss/drop can leave persistent
artifacts
- keep input, clipboard, file upload, session lifecycle, direct worker WSS, and
backend gateway fallback unchanged
Status: previous accepted baseline, superseded by P1 ordered-region delivery on
2026-04-26.
Proof summary:
- worker image: `rap-rdp-worker:rdp-region-repair`
- worker probes pass for graphics adapter, cursor adapter, service adapter
protocol, and direct data-plane bind validation
- direct attach no longer starts from a black-only framebuffer when no cached
full frame is available
- server-origin idle updates are visible without local input
- remaining issue is small redraw artifacts during some region update
sequences
Current code boundaries:
- `SessionRuntime::PublishDirectAttachBaselineIfRequested`
- `SessionRuntime::DrainAndPublishRenderNotifications`
- `RdpAdapterRuntime::CaptureFullFrameNotification`
- `RdpRuntime::CaptureFullFrameNotification`
- `DirectWssEventSink::EnqueueEvent`
Next hardening target:
- add region sequence/gap diagnostics
- identify whether remaining artifacts come from dropped regions, stale ordering,
wrong client patching, missed callbacks, or repair timing
- apply the smallest fix without returning to full-frame polling as the normal
render path
### RDP-A4.3 / P1: Ordered Region Delivery Candidate
Root cause addressed:
- Region frames were passing through latest-frame-only queues in the direct
worker writer, Windows transport, and WPF presenter.
- A second ordered-delivery gap was found in `SessionRuntime`, where frame
notifications were still coalesced before reaching the direct event sink.
- Latest-frame-only behavior is correct for full frames and cursor updates, but
it is unsafe for dirty-region patches because dropping an intermediate region
can leave stale pixels on the client framebuffer.
Deliver:
- preserve ordered dirty-region frames through the worker direct WSS writer
- preserve ordered dirty-region frames inside `SessionRuntime` before the direct
event sink
- preserve ordered dirty-region frames through the Windows direct transport
- preserve ordered dirty-region application in the WPF session presenter
- keep full frames able to supersede pending region queues
- request a throttled full-frame repair if the worker direct region queue
overflows
- add client diagnostics for frame sequence gaps and regions received before a
baseline
- keep input, cursor, clipboard, file upload, session lifecycle, direct worker
WSS, and backend gateway fallback unchanged
Status: accepted baseline on the test Docker environment as of 2026-04-26.
Proof summary:
- worker image: `rap-rdp-worker:rdp-p1-region-order2`
- live test container: `rap_worker_smoke`
- backend `go test ./...`: PASS
- Windows solution build: PASS
- worker graphics adapter probe: PASS
- worker cursor adapter probe: PASS
- worker service adapter protocol probe: PASS
- worker direct data-plane bind valid probe: PASS
- worker Redis registration: `worker:registration:rdp-worker-1` reports
`status=online`
- manual visual smoke: PASS for idle Task Manager updates without local input,
Start menu/hover without persistent artifacts, window drag usability, mouse,
keyboard, and session close
- known limitation: drag uses old-client frame-only movement and release repaint
is not polished
Current code boundaries:
- `SessionRuntime::RequestDirectFullFrameRepair`
- `SessionRuntime::DrainAndPublishRenderNotifications`
- `DirectWssEventSink::EnqueueEvent`
- `SessionGatewayClient::QueueFrameEnvelope`
- `SessionWindow::QueueFrameForPresentation`
- `SessionWindowViewModel::ApplyFramePayload`
Manual acceptance result:
- Start menu/menu hover does not leave persistent stale regions.
- Task Manager graph/percent updates continue without local input.
- Mouse and keyboard responsiveness did not regress.
- Session close works normally.
- Window drag is workable but uses frame-only movement and non-perfect repaint
after release; this belongs to the next performance/quality layer.
### RDP-Perf-6: Dirty-Region Direct Binary Contract
Goal:
- make dirty-region direct render explicit at the `RAP2` binary contract level
- keep full-frame binary support as baseline/recovery fallback
- keep backend gateway JSON/base64 fallback unchanged
- avoid routing high-rate binary regions through Redis/backend
Status: implemented and build/probe/live-smoke-proven on the test Docker
environment as of 2026-04-26 using direct worker WSS and
`rap-rdp-worker:rdp-perf6-dirty-region`.
Implementation:
- Worker direct WSS emits `render.frame.full` for first frame, attach/reattach,
resize, region-loss repair, invalid region fallback, and debug fallback.
- Worker direct WSS emits `render.frame.region` for BGRA32 dirty-region
payloads from the current classic GDI region-first path.
- Region metadata includes full desktop dimensions, region coordinates,
`region_stride`, `region_format=BGRA32`, payload length, sequence, and
capture/input timing fields.
- Worker diagnostics include `full_frame_sent`, `region_frame_sent`,
`region_bytes`, `full_frame_bytes`, `region_savings_percent`,
`diff_time_ms`, `render_update_reason`, and
`fallback_to_full_frame_reason`.
- Windows direct transport accepts `render.frame.full`,
`render.frame.region`, and legacy `session.frame` binary messages.
- Windows presenter keeps a per-session framebuffer and patches region bytes
into it before presenting the updated WPF surface.
- Smoke proof showed baseline `render.frame.full` at `3,686,400` bytes and
dirty-region `render.frame.region` payloads such as `16,384`, `163,840`,
`327,680`, and `655,360` bytes, with observed savings up to `99.56%`.
Boundaries preserved:
- no backend session lifecycle changes
- no organization/auth/policy changes
- no `data_plane_token` contract changes
- no clipboard or file-transfer semantic changes
- no RDPGFX default enablement
- no mesh/VPN/relay/QUIC/WebRTC work
- backend gateway fallback remains available
### RDP-A5: Clipboard/File Adapters
Deliver:
- move current cliprdr/file logic behind adapter boundaries
- no behavior change
- policy enforcement unchanged
### RDP-A6: RDPGFX Foundation
Deliver:
- gated RDPGFX surface event support
- fallback to GDI region updates
- no default enable until stable
Status: build-proven and default-path smoke-proven on the test Docker environment as of 2026-04-26.
Notes:
- `RDP_WORKER_RDPGFX_ENABLED=true` is the explicit gated switch.
- The default runtime path remains classic GDI region-first.
- The current test RDP target failed gated RDPGFX with a connection reset before `rdp.gfx channel_connected`, so no RDPGFX surface lifecycle proof is available for that target yet.
- Report: `artifacts/rdp-perf4-report.md`
### RDP-A7: Encoded/Adaptive Render
Deliver:
- encoded display payloads where negotiated
- adaptive quality profiles
- weak-channel policy
## 8. Acceptance Criteria For New RDP Adapter
- idle remote screen changes are visible without local mouse/keyboard input
- first click acts on remote UI, not only focus
- pointer hover updates are visible
- keyboard does not lose characters
- detach/reattach/takeover do not recreate remote session
- worker death marks session failed/recoverable correctly
- clipboard and file transfer remain policy-enforced
- direct worker WSS is preferred and fallback remains working
- input latency is not affected by render/file/telemetry pressure
@@ -0,0 +1,467 @@
# RDP Stage 5.2 Design Pass - Server-To-Client File Download
Status: design-complete proposal, no runtime implementation in this step.
Date: 2026-04-26
This document defines the target Stage 5.2 implementation shape for safe
server-to-client file download in the RDP service. It preserves the accepted
RDP Adapter baseline, direct worker WSS, backend gateway fallback, and the
restricted `RAP_Transfers` drive visibility model.
## 1. Baseline
Already accepted:
- client-to-server upload to worker-controlled per-session storage
- restricted FreeRDP drive redirection
- uploaded files visible inside remote Windows through `RAP_Transfers`
- text clipboard
- direct worker WSS with backend gateway fallback
- C++ RDP Adapter as the active runtime
Not implemented yet:
- server-to-client file download
- bidirectional file-transfer runtime behavior
- remote filesystem browser
- Windows session agent
- SMB/WebDAV delivery
- arbitrary remote path selection
## 2. Recommended V1 Model
Use the existing restricted `RAP_Transfers` redirected drive, but add a
dedicated remote-to-client drop zone inside it.
Recommended visible layout:
```text
RAP_Transfers\
FromClient\ # future normalized upload destination
ToClient\ # remote user places files here for download
README.txt # describes policy and size limits
```
For backward compatibility, the current upload path may continue placing files
at the visible drive root until a later cleanup stage. Stage 5.2 should add
`ToClient` first and avoid breaking accepted upload behavior.
Why this model:
- reuses the already proven restricted drive boundary
- exposes no worker parent directories
- needs no Windows agent
- needs no SMB/WebDAV service
- gives the remote user a clear, auditable action: copy file into `ToClient`
- keeps server-to-client transfer policy-enforced in the real data path
- keeps the Access Client independent from RDP internals
## 3. Non-Goals
Stage 5.2 must not implement:
- arbitrary remote path download
- remote filesystem browsing
- recursive folder download
- drag-and-drop shell integration
- server-to-client clipboard file lists
- shared folders beyond the restricted per-session visible directory
- SMB, WebDAV, HTTP drop service, or Windows agent
- direct host filesystem exposure
- file execution
- background sync
- cross-session persistent shared storage
## 4. Data Flow
```mermaid
sequenceDiagram
participant Remote as "Remote Windows Session"
participant Drive as "RAP_Transfers\\ToClient"
participant Worker as "RDP Adapter Worker"
participant Backend as "Backend Gateway Fallback"
participant Client as "Access Client"
Remote->>Drive: User copies file into ToClient
Worker->>Worker: Detect stable regular file
Worker->>Worker: Sanitize name, size check, hash snapshot
Worker->>Client: file_download.available
Client->>Worker: file_download.start
Worker->>Client: file_download.chunk
Client->>Worker: file_download.ack
Worker->>Client: file_download.completed
Worker->>Backend: Audit/status event
```
Backend gateway fallback follows the same logical events but relays them
through the existing backend gateway path. Direct worker WSS should be preferred
when available.
## 5. Worker Detection Model
The worker owns the only trusted observation point: the local per-session
visible directory that FreeRDP exposes as `RAP_Transfers`.
Detection rules:
- watch only `<transfer_root>/<session_id>/visible/ToClient`
- ignore directories in Stage 5.2
- ignore hidden/temp files such as `.part`, `.tmp`, `~$*`, and worker-owned
transfer temp names
- wait until size and modified time are stable for at least two checks
- open the file read-only after stability is observed
- use no-follow/openat style APIs where available
- `fstat` the opened descriptor and reject non-regular files
- verify the canonical path remains inside the `ToClient` directory
- compute hash from the opened file snapshot
- reject files above the configured max size before transfer starts
The worker must never trust a remote-supplied path. It only trusts a sanitized
relative file name derived from the controlled drop directory.
## 6. File Identity
Each downloadable file gets an opaque `file_id`.
Recommended `file_id` input:
- session id
- stable relative file name
- file size
- modified timestamp
- inode/device where available
- SHA-256 content hash or snapshot hash
The Access Client never sends a worker filesystem path. It requests download by
`file_id` only.
## 7. Policy Model
Existing `file_transfer_mode` values remain:
| Mode | Upload client -> server | Download server -> client |
| --- | --- | --- |
| `disabled` | blocked | blocked |
| `client_to_server` | allowed | blocked |
| `server_to_client` | blocked | allowed |
| `bidirectional` | allowed | allowed |
Stage 5.2 implements only the server-to-client side of this matrix. It must not
regress the accepted client-to-server upload path.
Policy enforcement points:
- backend session gateway and data-plane token allowed channels
- worker runtime before publishing availability
- worker runtime before reading/sending chunks
- Windows client UI before presenting download actions
- Windows client transport before sending download requests
UI checks are convenience only. Backend and worker checks are the security
boundary.
## 8. Lifecycle Gating
Download is allowed only while all are true:
- session state is `active`
- attachment is the current controller
- data-plane token allows file download
- worker runtime still owns the active session
- resource policy allows server-to-client file transfer
Download must be blocked or cancelled when:
- session is detached
- attachment is superseded by takeover
- session is failed
- session is terminated
- worker lease/runtime is stale
- direct WSS token expires and no valid fallback path exists
In-flight downloads should be aborted on detach, takeover, failure, terminate,
and window close. Stage 5.2 should not continue background downloads for an
inactive controller.
## 9. Channel And Message Contract
Use a new logical channel:
```text
file_download
```
This keeps upload and download direction explicit while remaining under the
broader `file_transfer` adapter concept.
Recommended events:
```text
file_download.available
file_download.start
file_download.chunk
file_download.ack
file_download.progress
file_download.cancel
file_download.completed
file_download.failed
file_download.blocked
```
Recommended common fields:
```json
{
"session_id": "...",
"attachment_id": "...",
"transfer_id": "...",
"file_id": "...",
"direction": "server_to_client",
"file_name": "report.txt",
"file_size": 12345,
"offset": 0,
"chunk_size": 262144,
"total_size": 12345,
"content_hash": "sha256:...",
"sequence": 1,
"status": "in_progress"
}
```
For Stage 5.2, chunks may use JSON/base64 payloads for compatibility with the
current reliable control envelope model. A future data-plane stage can add
binary chunk frames, but that must be a separate performance stage.
## 10. Chunking, Retry, And Backpressure
Recommended initial limits:
- max file size: reuse current 25 MiB default unless policy overrides are added
- max raw chunk size: reuse current 256 KiB default
- one or a small bounded number of concurrent downloads per session
- bounded per-connection send queue
- input and control always outrank file chunks
- display/cursor must not be blocked by file transfer
Reliability model:
- client sends `file_download.start`
- worker streams chunks in offset order
- client writes to a temp local file
- client acknowledges received offsets or chunk indexes
- client verifies final hash before finalizing the local file
- cancel stops worker reads and deletes local temp file
- retry may request restart from offset only if the opened file snapshot is
still the same; otherwise restart from zero with a new `file_id`
## 11. Local Client Destination
The Windows client should not auto-save files silently into arbitrary locations.
Recommended MVP behavior:
- show `file_download.available`
- user selects a local save location or confirms a default downloads directory
- write to a temp file first
- verify hash
- rename into final destination
- never execute downloaded files
- show localized progress, blocked, failed, cancelled, and completed messages
## 12. Security Constraints
Mandatory constraints:
- default policy remains `disabled`
- no arbitrary worker filesystem paths
- no arbitrary remote Windows paths
- no parent path traversal
- no symlink escape
- no non-regular files
- no overwrite without explicit user confirmation on the client side
- no automatic file execution
- no recursive directories
- no server-to-client transfer after detach/takeover/failure/terminate
- audit metadata only, never file contents
- downloaded file names are display names, not trusted paths
- worker cleanup must remove per-session visible storage on termination/failure
## 13. Audit And Observability
Audit in PostgreSQL should record:
- `file_download_available` where useful and rate-limited
- `file_download_started`
- `file_download_completed`
- `file_download_cancelled`
- `file_download_failed`
- `file_download_blocked`
Audit details should include:
- organization id
- resource id
- session id
- attachment id
- user id
- worker id
- file name
- size
- content hash
- policy mode
- reason for block/failure
Do not store file contents in PostgreSQL, Redis, or audit logs.
Worker/client diagnostics should expose:
- detected file count
- stable-detection latency
- chunk throughput
- retry count
- cancel count
- queue depth
- bytes sent
- hash verification result
## 14. Backend Gateway Fallback
The backend gateway remains fallback/debug.
Rules:
- direct worker WSS is preferred for file chunks
- backend gateway fallback must enforce the same policy and lifecycle gates
- fallback may be slower but must remain correct
- chunks must not be stored permanently in Redis
- Redis is only live routing/coordination
- PostgreSQL remains authoritative for session/resource/policy state
## 15. Verification Matrix
Policy:
| Case | Expected |
| --- | --- |
| `disabled` | no availability actions, start blocked |
| `client_to_server` | upload still works, download blocked |
| `server_to_client` | download works, upload blocked |
| `bidirectional` | upload and download work |
Lifecycle:
| Case | Expected |
| --- | --- |
| active current controller | download allowed by policy |
| detached | availability/start/chunks blocked or cancelled |
| old attachment after takeover | blocked |
| failed session | blocked/cancelled |
| terminated session | blocked/cancelled |
| worker death | client shows failure and no silent queued download |
Security:
| Case | Expected |
| --- | --- |
| normal text file | downloads, hash matches |
| normal binary file | downloads, hash matches |
| too large file | blocked |
| path traversal name | blocked |
| absolute path-like name | blocked |
| symlink/non-regular file | blocked |
| file modified during transfer | fail/restart safely |
| duplicate final local file | user confirmation required |
Regression:
| Area | Expected |
| --- | --- |
| rendering | unchanged |
| mouse/keyboard | unchanged |
| clipboard | unchanged |
| upload | unchanged |
| detach/reattach/takeover | unchanged |
| backend gateway fallback | unchanged |
## 16. Future Work
Future stages may add:
- binary file chunk frames over direct data plane
- resumable transfer manifests
- folder packaging as explicit archive download
- organization-level file DLP scanning
- malware scanning hooks
- remote shell integration or Windows agent
- WebDAV/SMB-like drop service
- direct peer/relay data-plane optimization
None of these belong in Stage 5.2 implementation.
## 17. Proposed Stage 5.2 Implementation Prompt
Proceed with Stage 5.2 only - RDP server-to-client file download.
Goal:
Implement safe, policy-aware download from the remote RDP session to the
Windows Access Client using the restricted `RAP_Transfers\ToClient` drop zone.
Strict rules:
- do NOT implement arbitrary remote path download
- do NOT implement remote filesystem browser
- do NOT implement recursive folder transfer
- do NOT implement SMB/WebDAV/Windows agent
- do NOT expose any worker path outside the per-session visible directory
- do NOT change RDP rendering/input/clipboard behavior
- do NOT remove backend gateway fallback
- do NOT implement binary file chunk frames yet
- do NOT start mesh/VPN/relay work
Scope:
1. Create a per-session `visible/ToClient` directory inside the existing
restricted `RAP_Transfers` mapping.
2. Detect stable regular files in `ToClient` only.
3. Sanitize file names, reject traversal/absolute paths/non-regular files, and
enforce size limits.
4. Add `file_download` logical channel and envelopes:
`available`, `start`, `chunk`, `ack`, `progress`, `cancel`, `completed`,
`failed`, and `blocked`.
5. Enforce `file_transfer_mode` for `server_to_client` and `bidirectional` in
backend gateway, data-plane token channels, worker runtime, and Windows
client.
6. Gate all download actions to active current-controller sessions only.
7. Stream chunks reliably with bounded queues and hash verification.
8. Keep input/control/render/cursor priority above file chunks.
9. Add localized Windows client feedback and safe local temp-file finalization.
10. Audit start/completion/cancel/failure/block events without storing file
contents.
Verification:
- `disabled` blocks download
- `client_to_server` blocks download and upload still works
- `server_to_client` allows download and blocks upload
- `bidirectional` allows upload and download
- text file downloads and hash matches
- binary file downloads and hash matches
- too-large file blocked
- traversal/symlink/non-regular files blocked
- download blocked after detach
- old client after takeover blocked
- worker failure cancels/blocks download
- rendering, mouse, keyboard, clipboard, upload, reconnect, and takeover remain
working
- backend gateway fallback remains available
Deliver:
- backend policy/token/gateway changes
- worker download detector and chunk sender
- Windows client download UI/path
- localized messages
- smoke script/docs update
- PASS/FAIL matrix with logs and hash evidence
@@ -0,0 +1,597 @@
# RDP Service C++ Performance Target
## Status
This is the paused RDP service performance direction. The implementation name is `RDP Adapter`: a concrete `Service Adapter` that translates Microsoft RDP into the platform session/data-plane protocol. The common adapter contract is defined in `docs/architecture/SERVICE_ADAPTER_PROTOCOL.md`; the RDP-specific runtime plan is defined in `docs/architecture/RDP_ADAPTER_RUNTIME.md`.
Current implementation status:
- RDP-A1 / RDP-Perf-1 is build-proven.
- RDP-A2 adapter boundary is live-smoke-proven on the test Docker environment as of 2026-04-26: runtime code now goes through `RdpAdapterRuntime`.
- RDP-Perf-2 runtime instrumentation is build-proven and live-smoke-proven on the test Docker environment as of 2026-04-26.
- RDP-Perf-3 region-first BGRA fallback is build-proven and live-smoke-proven on the test Docker environment as of 2026-04-26.
- RDP-Perf-4 gated RDPGFX foundation is build-proven and default-path smoke-proven on the test Docker environment as of 2026-04-26. The current live RDP target resets the connection when RDPGFX is advertised, so RDPGFX remains disabled by default.
- RDP-A4 CursorAdapter is build-proven and live-smoke-proven on the test Docker environment as of 2026-04-26. Cursor events now flow as latest-only adapter-origin `cursor.update` events over direct worker WSS and remain compatible with backend gateway fallback.
- RDP-Perf-5A GDI repaint cadence hardening is build-proven and smoke-proven on the test Docker environment as of 2026-04-26. Region/interactive frames now publish on a 33 ms cadence, hot-loop lease renewal was removed, and backend gateway fallback remains compatible.
- RDP-Perf-6 dirty-region direct binary contract is build/probe/live-smoke-proven on the test Docker environment as of 2026-04-26. Direct `RAP2` render frames now distinguish full frames from dirty-region patches and carry payload savings diagnostics; observed runtime dirty regions reduced payloads from the `3,686,400` byte full frame to examples such as `16,384`, `163,840`, and `655,360` bytes.
- Current accepted baseline is `rap-rdp-worker:rdp-p1-region-order2`: dirty-region delivery is preserved in order through `SessionRuntime`, worker direct WSS, Windows transport, and WPF presenter queues. Manual visual smoke accepted idle repaint, Start menu/hover, keyboard, mouse, and session close.
- Remaining visual limitation is quality/performance rather than correctness: window drag behaves like older/slow-link RDP clients by moving a frame, and repaint after release is usable but not polished.
- FreeRDP remains the internal substrate behind the adapter boundary until region-first/event-driven replacement paths are live-proven.
- RDP performance work is paused by product decision. When RDP work explicitly
resumes, the next RDP step should continue from the stable GDI region-first
path unless an RDPGFX-compatible target is added for gated testing.
The C++ worker remains the primary RDP runtime. The goal is not to rewrite the
worker in another language. The goal is to replace the slow parts of the RDP
service internals while preserving the proven backend/session/cluster/data-plane
contracts.
The C# RDP service skeleton is superseded as a runtime direction and must not be
used for implementation unless explicitly re-approved later.
## Current Problem
The current MVP proved the hard lifecycle behavior:
- connect
- active state
- detach without killing the remote session
- reattach
- takeover
- terminate
- clipboard text
- file upload to worker storage
- direct worker WSS data-plane
However, the render/input experience is not acceptable.
Root cause:
- the worker uses FreeRDP successfully for the RDP connection
- but the production render path still behaves like framebuffer capture
- the worker copies large BGRA buffers and publishes them as RAP frames
- input is fast enough in parts of the path, but visual feedback depends on slow
snapshot/frame delivery
On a >1 Gbit LAN this should not be slow. The bottleneck is the RDP service
render algorithm, not the network.
## Non-Negotiable Boundaries
Do not change:
- backend control plane
- organization/session lifecycle
- PostgreSQL source of truth
- Redis live coordination model
- worker leases and assignment contracts
- data_plane_token contracts
- direct worker WSS transport
- backend gateway fallback
- clipboard/file-transfer policy semantics
Only the RDP service adapter internals may change.
## Target Design
Keep the worker in C++.
Use C++ to own the RDP service internals:
- input adapter
- graphics adapter
- cursor adapter
- virtual channel adapters
- quality/adaptive controller
- render sink to existing RAP data-plane
FreeRDP may remain temporarily as a connection/security/channel substrate, but
the target production render path must not be FreeRDP GDI framebuffer snapshots.
If a FreeRDP layer blocks access to the needed RDP graphics primitives, replace
that narrow layer with project-owned C++ code rather than rewriting the full
service in another language.
## High-Performance RDP Model
Fast RDP clients do not repeatedly send full desktop images. They use protocol
updates:
- dirty rectangles
- surface commands
- cursor updates
- bitmap/cache updates
- RDPGFX dynamic virtual channel
- RemoteFX Progressive / ClearCodec / H.264 AVC420 / AVC444 / HEVC where
negotiated
- adaptive graphics and quality selection
References:
- https://learn.microsoft.com/en-us/openspecs/windows_protocols/ms-rdpegfx/da5c75f9-cd99-450c-98c4-014a496942b0
- https://learn.microsoft.com/en-us/azure/virtual-desktop/graphics-encoding
- https://freerdp-freerdp.mintlify.app/concepts/codecs
## New Internal Layers
```mermaid
flowchart LR
Target["Windows RDP Server"]
RdpCore["C++ RDP Core / FreeRDP Substrate"]
Graphics["Graphics Adapter"]
Input["Input Adapter"]
Channels["Virtual Channel Adapters"]
DataPlane["Existing Direct Worker WSS"]
Client["RAP Windows Client"]
Target <--> RdpCore
RdpCore --> Graphics
Input --> RdpCore
RdpCore <--> Channels
Graphics --> DataPlane
Channels --> DataPlane
DataPlane <--> Client
```
### Graphics Adapter
The graphics adapter converts RDP graphics primitives into RAP render updates.
Supported update classes:
- `frame_full_bgra` only for baseline/debug/fallback
- `region_bgra` for dirty regions
- `surface_create`
- `surface_delete`
- `surface_map`
- `surface_bits`
- `encoded_frame`
- `cursor_update`
Rules:
- full-frame BGRA is fallback, not the target production path
- direct render remains binary
- backend gateway fallback may keep JSON/base64 compatibility
- stale render updates are droppable
- input never waits behind render
### Input Adapter
Input stays separate from render.
Rules:
- keyboard down/up is reliable and ordered
- mouse button down/up and wheel are reliable and ordered
- mouse move is latest-only/coalesced
- button down must include or be preceded by pointer position
- no RAP focus message may consume the first remote click
- input must not trigger full-frame capture loops
### Virtual Channel Adapters
Clipboard/file/drive redirection remain isolated:
- clipboard stays text-only until explicitly expanded
- restricted drive mapping remains policy-bound
- file upload/download policies stay enforced in the real data path
## Weak Network Strategy
Weak-channel performance must degrade render before input.
Priority order:
1. input
2. control
3. clipboard
4. render key updates
5. file transfer
6. telemetry
Render adaptation:
- drop stale render updates
- prefer dirty regions over full frames
- reduce FPS before increasing input latency
- reduce color mode where useful
- use text-priority mode for office/admin workloads
- use encoded/compressed graphics payloads where negotiated
- never let file transfer or VPN-like bulk traffic starve RDP input/control
Quality profiles:
- `emergency_grayscale`
- `low_bandwidth`
- `text_priority`
- `balanced`
- `high_quality`
Color modes:
- full color
- 256 colors
- 64 colors
- 16 colors
- grayscale
## Migration Stages
### RDP-A1 / RDP-Perf-1: Boundary And Audit
Create C++ graphics/input adapter boundaries and document the replacement path.
Do not change runtime behavior yet.
Deliver:
- common `Service Adapter` channel contract
- RDP Adapter runtime plan
- `graphics_adapter` interface
- render update model
- compile-safe probe
- docs update
Status: completed and build-proven.
### RDP-Perf-2: Runtime Instrumentation And Source Selection
Measure existing FreeRDP update callbacks separately from frame publishing.
Deliver:
- update callback rate
- dirty region dimensions
- framebuffer copy time
- binary send time
- client render time
- first-click trace without RAP focus interference
Status: completed and live-smoke-proven on the test Docker environment as of 2026-04-26.
Smoke command:
```powershell
pwsh -ExecutionPolicy Bypass -File scripts/windows-smoke/desktop-smoke.ps1 `
-PreferDirectDataPlane:$true `
-AllowInsecureDirectDataPlaneTlsForSmoke:$true `
-DirectDataPlaneConnectTimeoutMs 2500 `
-DirectDataPlaneColorMode full_color `
-SkipOrgSwitchAndTokenRefresh
```
Smoke evidence:
- worker image: `rap-rdp-worker:rdp-perf2-instrumented`
- session id: `1328b0dd-c5f9-4b15-b2ca-6d196ead5823`
- direct data plane selected by the Windows client
- login/resource/start/input/detach/attach/takeover/taken_over/logout passed
- one RDP runtime was created for the session
- artifacts:
- `artifacts/rdp-perf2-worker-final.log`
- `artifacts/rdp-perf2-client-final.log`
- `artifacts/rdp-perf2-report.md`
Measured callback sources:
| Source | Count / behavior |
| --- | --- |
| `BeginPaint` | observed |
| `EndPaint` | observed |
| `BitmapUpdate` | observed and produced dirty region information |
| `RefreshRect` | not observed in smoke |
| `SurfaceBits` | not observed in smoke |
| `SurfaceFrameMarker` | not observed in smoke |
| `SurfaceFrameBits` | not observed in smoke |
| pointer callbacks | not observed in smoke |
Measured conclusions:
- The RDP server/FreeRDP path does emit server-origin graphics callbacks in stable GDI mode.
- Idle or server-origin screen changes can be detected without relying on local mouse/keyboard activity.
- Full framebuffer copy time is not the main bottleneck in the measured smoke run.
- The current render path duplicates work by capturing around both `BitmapUpdate` and `EndPaint`.
- `EndPaint` should become a flush/safety marker rather than a second normal capture producer.
- RDP-Perf-3 should make `BitmapUpdate` dirty regions the default normal render path and reserve full frames for connect/resize/attach/recovery.
### RDP-Perf-3: Region-First BGRA Fallback
Use true dirty regions as the default fallback path.
Deliver:
- no full-frame copy for small dirty updates
- baseline full frame only on connect/resize/attach
- region payloads only for normal UI changes
Status: completed and live-smoke-proven on the test Docker environment as of 2026-04-26.
Smoke evidence:
- worker image: `rap-rdp-worker:rdp-perf3-region-first`
- direct smoke session id: `abc11233-34c4-45a6-a55b-0571a09332a1`
- fallback smoke session id: `ee756839-6a82-49d4-9619-54acf69e1efd`
- direct worker WSS selected and backend gateway fallback separately verified
- login/resource/start/input/detach/attach/takeover/taken_over/logout passed in both direct and fallback smoke
- direct session cleanup state: `terminated`
- fallback session cleanup state: `terminated`
- report: `artifacts/rdp-perf3-report.md`
Measured direct-path results:
| Metric | Result |
| --- | --- |
| new RDP runtime count | 1 |
| direct data-plane binds | 6 |
| worker input apply events | 6 |
| deferred `BitmapUpdate` callbacks | 104 |
| `bitmap_update_flush` captures | 104 |
| region flush captures | 93 |
| full flush captures | 11 |
| periodic duplicate changes | 0 |
| client rendered region frames | 19 |
| client skipped region frames | 0 |
Implementation notes:
- `BitmapUpdate` is now deferred during a paint cycle.
- `EndPaint` flushes the accumulated `BitmapUpdate` dirty region once.
- `EndPaint` no longer performs a second normal change-detect capture when a bitmap update was already flushed.
- The periodic change detector snapshot is synchronized after callback-driven frame capture, avoiding rediscovery of the same changed pixels.
- Direct binary frame metadata now preserves full desktop dimensions separately from region payload dimensions, so the Windows client can patch regions into its framebuffer.
- Backend gateway fallback remains compatible with the existing JSON/base64 path.
### RDP-Perf-4: RDPGFX Channel Foundation
Capture and parse RDPGFX surface updates where available.
Deliver:
- surface lifecycle
- surface bits updates
- cursor updates
- fallback to region BGRA when RDPGFX unavailable
Status: build-proven and default-path smoke-proven on the test Docker environment as of 2026-04-26.
Implementation:
- RDPGFX stays disabled by default.
- `RDP_WORKER_RDPGFX_ENABLED=true` is the only gated runtime switch.
- Worker diagnostics now log RDPGFX configuration, channel subscription, channel connection, GDI graphics pipeline initialization, fallback reasons, and normalized FreeRDP surface update callbacks.
- Callback summaries include RDPGFX counters.
- The default classic GDI region-first path remains the active safe path.
Default smoke evidence:
- worker image: `rap-rdp-worker:rdp-perf4-rdpgfx-gated`
- final default smoke session id: `30e80d99-e3b5-428b-aa18-fea65b8db499`
- direct worker WSS selected
- login/resource/start/input/detach/attach/takeover/taken_over/logout passed
- session cleanup state: `terminated`
- worker log: `rdp.gfx config requested=false mode=classic_gdi_region_first`
- worker log: `rdp.perf callback_summary ... rdpgfx_requested=false ... frame_capture_region=...`
Gated RDPGFX target compatibility result:
- gated session id: `aa69f606-9217-4579-b438-b7d3ec5e01d0`
- environment: `RDP_WORKER_RDPGFX_ENABLED=true`
- result: failed on the current live RDP target
- observed: `BIO_read returned a system error 104: Connection reset by peer`
- observed: `freerdp_post_connect failed`
- no `rdp.gfx channel_connected` or surface callbacks were observed before reset
- conclusion: the current target must use the default GDI region-first path
Report: `artifacts/rdp-perf4-report.md`
### RDP-Perf-5: Encoded Graphics Payloads
Support encoded graphics payloads over RAP direct data-plane.
Deliver:
- binary encoded payload message type
- client decode strategy
- fallback to region BGRA
### RDP-A4: CursorAdapter
Move cursor handling into the RDP Adapter boundary and keep cursor events
independent from display frame cadence.
Status: completed and live-smoke-proven on the test Docker environment as of
2026-04-26.
Implementation:
- `CursorAdapter` normalizes FreeRDP pointer callbacks into cursor position,
visibility, shape, cache, and mask metadata.
- FreeRDP pointer callbacks are installed and restored inside the RDP runtime
hook boundary.
- Original FreeRDP pointer callbacks are invoked before platform normalization,
preserving FreeRDP internal state.
- `session_cursor_updated` worker events are mapped to platform
`cursor.update` envelopes.
- Direct worker WSS treats cursor as latest-only/droppable and schedules it
separately from binary render frames.
- Backend gateway fallback remains compatible with the same
`session_cursor_updated` event payload.
- Windows client accepts `cursor.update` through the existing render payload
bridge without changing UI layout.
Smoke evidence:
- worker image: `rap-rdp-worker:rdp-a4-cursor-adapter`
- direct smoke session id: `549806aa-c9db-48a9-917e-cf817cf236b5`
- fallback smoke session id: `dee3a856-bee1-4eba-9c10-f62edaf56547`
- direct worker WSS selected in direct smoke
- backend gateway selected in fallback smoke
- login/resource/start/input/detach/attach/takeover/taken_over/logout passed
in both direct and fallback smoke
- direct session cleanup state: `terminated`
- fallback session cleanup state: `terminated`
- worker log: `cursor.adapter hooks installed pointer_callbacks=true`
- worker log: `adapter_event channel=cursor type=cursor.update origin=adapter`
- worker log: `rdp.perf callback_summary ... cursor_updates_enqueued=...`
- client log: `SessionWindowViewModel.HandleEnvelopeAsync ... cursor.update`
- report: `artifacts/rdp-a4-cursor-adapter-report.md`
Known limitation:
- Cursor event separation does not by itself fix delayed hover/menu repaint.
The next safe step is a GDI repaint cadence and server-origin update audit on
the stable region-first path.
### RDP-Perf-5A: GDI Repaint Cadence And Hover Feedback Hardening
Fix the first proven stable-path repaint cadence bottlenecks without changing
backend, session lifecycle, data-plane contracts, clipboard/file transfer, or UI
layout.
Status: build-proven and smoke-proven on the test Docker environment as of
2026-04-26.
Implementation:
- FreeRDP event pump performs a bounded immediate drain after a signaled handle
check so already-queued server events are not delayed by the next wait cycle.
- Periodic no-change detection logging is rate-limited to avoid hot-loop log
pressure while the remote screen is idle.
- Worker session runtime renews the worker lease every 5 seconds instead of
performing Redis lease I/O on every render/input loop iteration.
- Region and interactive render notifications use a 33 ms publish interval.
- Full-frame fallback remains at 100 ms.
- Direct worker WSS binary writer uses the same 33 ms interval for
region/interactive frames.
Smoke evidence:
- worker image: `rap-rdp-worker:rdp-perf5a-repaint-cadence`
- direct smoke session id: `0cca4974-2a82-48dc-a0f6-1036ea8e98f0`
- fallback smoke session id: `16deb09e-1c44-4e9d-8448-93b42ac66ed0`
- direct worker WSS selected in direct smoke
- backend gateway selected in fallback smoke
- login/resource/start/input/detach/attach/takeover/taken_over/logout passed
in both direct and fallback smoke
- direct session cleanup state: `terminated`
- fallback session cleanup state: `terminated`
- report: `artifacts/rdp-perf5a-report.md`
Measured direct-path results:
| Metric | Result |
| --- | --- |
| client rendered frames observed | 65 |
| client binary frames observed | 66 |
| direct region publishes at 33 ms | 54 |
| direct outbound FPS max | 9.705640 |
| render seen FPS max | 26.386542 |
| render published FPS max | 9.459327 |
| direct backpressure frame drops | 0 |
| render pending max | 0 |
Measured conclusion:
- Region/interactive frames now leave the worker promptly when server-origin
changes arrive.
- The direct smoke did not show queued FreeRDP event-handle bursts after the new
immediate drain path: `event_pump_drained_checks=0`.
- The current live target still emits idle/server-origin region changes at
roughly 1 FPS in observed stable GDI mode.
- Manual UX validation is still required before claiming hover/menu
responsiveness accepted by a human operator.
### RDP-Perf-6: Dirty-Region Direct Binary Render Contract
Replace full-frame-only direct binary render updates with explicit dirty-region
direct binary render updates while preserving full-frame fallback.
Deliver:
- direct `RAP2` `message_type=render.frame.full`
- direct `RAP2` `message_type=render.frame.region`
- one bounding-rectangle dirty-region BGRA payload for normal UI changes
- full-frame fallback for first frame, attach/reattach, resize, recovery,
invalid region state, and debug/fallback mode
- worker diagnostics for `full_frame_sent`, `region_frame_sent`,
`region_bytes`, `full_frame_bytes`, `region_savings_percent`,
`diff_time_ms`, `render_update_reason`, and
`fallback_to_full_frame_reason`
- Windows direct receiver support for explicit full/region message types
- Windows framebuffer-backed region patching
- backend gateway JSON/base64 fallback unchanged
Status: implemented and build/probe/live-smoke-proven on the test Docker
environment as of 2026-04-26 using the current RDP target.
Build/probe evidence:
- worker image build: `rap-rdp-worker:rdp-perf6-dirty-region`
- Windows client build: PASS
- worker graphics adapter probe: PASS
- worker direct data-plane bind valid probe: PASS
- worker service adapter protocol probe: PASS
- direct worker WSS smoke: PASS
- backend gateway fallback smoke: PASS
Implementation notes:
- The current classic GDI region-first display path remains the source of
dirty-region payloads.
- The direct worker WSS sender no longer labels all binary render payloads as
`session.frame`; it uses `render.frame.full` and `render.frame.region`.
- The Windows transport still normalizes direct render frames into the existing
application-level `session.frame` pipeline, so session lifecycle, input,
clipboard, and file-transfer behavior are unchanged.
- The Windows presenter keeps a session framebuffer and applies region patches
into it before presenting the updated surface.
- Backend gateway fallback remains JSON/base64 and is not used as the
production high-rate render relay.
- Runtime payload examples: full baseline `3,686,400` bytes; dirty regions
`16,384`, `163,840`, `327,680`, `655,360`, and `737,280` bytes.
### RDP-Perf-7: Adaptive Quality Controller
Add channel-aware adaptive render quality.
Deliver:
- latency-aware profile switching
- bandwidth-aware profile switching
- latest-only render backpressure
- stable input under load
## Acceptance Targets
LAN targets:
- first frame: under 2 seconds after successful RDP login
- click to visible response: under 150 ms for common UI
- keypress to visible response: under 150 ms for text input
- pointer hover response: under 100 ms where the target emits hover changes
- one click activates remote buttons correctly
- no unbounded frame/input queues
Weak-channel targets:
- input remains usable even when render quality degrades
- render drops stale updates instead of building backlog
- file transfer never starves interactive input
## RDP Performance Work Paused
RDP performance work is paused. Next active work is Fabric Core / cluster
foundation.
RDP-Perf-6 remains accepted and smoke-proven. Future RDP roadmap items such as
RDP-Perf-7, adaptive quality, encoded payloads, additional RDPGFX testing,
tiles, codecs, or further renderer optimization must not start without a new
explicit RDP-stage prompt.
The preserved RDP baseline remains:
- C++ RDP Adapter runtime
- direct worker WSS
- backend gateway fallback
- dirty-region direct binary render from RDP-Perf-6
- proven session lifecycle
- existing clipboard and file-transfer semantics
@@ -0,0 +1,335 @@
# RDP Service C# Target Architecture
## Status
Superseded.
The active direction is now documented in:
- `docs/architecture/RDP_SERVICE_CPP_PERFORMANCE_TARGET.md`
The C++ worker remains the primary RDP runtime. This C# document is retained only
as historical/research context and must not be used for implementation unless
explicitly re-approved.
## Problem Statement
The current RDP MVP proved the platform lifecycle, but its rendering model is not
production-grade:
- FreeRDP connects to the target RDP server.
- The worker reads the GDI framebuffer.
- The worker publishes full or cropped BGRA frames through RAP direct WSS.
- The Windows client renders those frames as a custom viewer.
This is not how high-performance RDP clients work. On a fast LAN, the network is
not the main bottleneck. The bottleneck is that the service is repeatedly copying
and publishing screen images instead of consuming the RDP graphics protocol as a
graphics protocol.
Observable symptoms:
- delayed visual feedback after input
- unreliable first-click behavior
- poor hover behavior
- high CPU/memory pressure from framebuffer copies
- unnecessary 1280x720 BGRA full-frame payloads
- fragile coupling between input, render snapshots, and UI timing
## External Reference Model
Microsoft RDP performance is based on graphics protocol features rather than
screen scraping:
- RDP Graphics Pipeline Extension (`MS-RDPEGFX`) uses a dynamic virtual channel
for graphics pipeline updates.
- RDP supports adaptive graphics, delta detection, caching, mixed-mode encoding,
RemoteFX Progressive, H.264/AVC, AVC444, and HEVC in modern environments.
- FreeRDP documentation describes the RDP GFX Pipeline (`rdpgfx`) and codecs such
as RemoteFX Progressive, H.264 AVC420/AVC444, ClearCodec, and ZGFX.
References:
- https://learn.microsoft.com/en-us/openspecs/windows_protocols/ms-rdpegfx/da5c75f9-cd99-450c-98c4-014a496942b0
- https://learn.microsoft.com/en-us/azure/virtual-desktop/graphics-encoding
- https://freerdp-freerdp.mintlify.app/concepts/codecs
## Target Decision
Replace the internal RDP engine with a C# implementation owned by this project.
The new service:
- is a RAP RDP service adapter, not a generic local RDP client UI
- speaks standard RDP to the target Windows RDP server
- keeps RDP protocol details inside the RDP service boundary
- preserves current backend and cluster data-plane contracts
- does not use FreeRDP as the runtime RDP engine
- does not require the local Windows desktop client to become mstsc
The local Windows client remains a RAP client. It receives RAP display/input/
clipboard/file messages over the existing direct worker WSS data-plane.
## What Must Not Change
The following are outside this rewrite:
- backend organization/auth/session lifecycle
- PostgreSQL source-of-truth model
- Redis live coordination model
- worker registration and lease semantics
- data_plane_token model
- direct_worker_wss transport contract
- backend gateway fallback
- clipboard/file policy semantics
- file upload policy semantics
- session attach/detach/reattach/takeover/terminate semantics
Only the RDP service adapter internals change.
## Service Boundary
```mermaid
flowchart LR
Client["Windows RAP Client"]
Backend["Backend Control Plane"]
Worker["RDP Service Node"]
Engine["C# RDP Protocol Engine"]
Target["Target Windows RDP Server"]
Client <--> |"direct_worker_wss RAP channels"| Worker
Backend <--> |"assignments, leases, audit"| Worker
Worker --> Engine
Engine <--> |"standard RDP"| Target
```
The RDP service owns:
- RDP negotiation and transport
- NLA/CredSSP/TLS integration
- input translation to RDP fast-path input
- graphics channel parsing
- virtual channel handling for clipboard and future file features
- conversion from RDP graphics units to RAP render messages
- session runtime ownership and reconnect/takeover binding
The data-plane layer owns:
- data_plane_token validation
- direct WSS connection binding
- logical channel priority
- reliable/droppable semantics
- fallback compatibility
## New RDP Service Components
### `Rap.Rdp.Service`
Host process.
Responsibilities:
- load worker/RDP service configuration
- register worker capabilities with existing coordination layer later
- expose the existing direct WSS endpoint later
- create and supervise RDP sessions
- keep the current C++ worker active until cutover
### `Rap.Rdp.Core`
Pure C# protocol and runtime boundaries.
Responsibilities:
- define RDP session lifecycle interfaces
- define protocol engine interfaces
- define graphics/input/clipboard/file abstractions
- avoid any dependency on WPF or backend repositories
### `Rap.Rdp.Protocol`
Future implementation module.
Responsibilities:
- implement RDP connection sequence from Microsoft Open Specifications
- implement security/NLA/CredSSP/TLS
- implement core channels and fast-path input
- implement graphics pipeline negotiation
- implement virtual channel framing
This module must not depend on the Windows desktop UI.
### `Rap.Rdp.DataPlane`
Future adapter module.
Responsibilities:
- map RAP direct WSS JSON/binary envelopes to the protocol engine
- keep input highest priority
- keep render latest-frame or latest-update droppable
- keep clipboard/file reliable and policy-gated
## Graphics Strategy
The new render path must not use framebuffer screen scraping as the primary
production path.
Priority order:
1. RDPGFX graphics pipeline channel.
2. Surface/dirty-region updates.
3. Encoded graphics payloads where available.
4. Raw bitmap fallback only for compatibility/debug.
Target RAP render message classes:
- `surface.create`
- `surface.delete`
- `surface.map`
- `surface.region`
- `surface.codec_frame`
- `cursor.update`
- `frame.ack`
The first usable implementation may still decode some graphics to BGRA, but only
as a controlled fallback. It must not become the permanent production model.
## Input Strategy
Input must be independent from render.
Rules:
- mouse down/up, wheel, and keyboard down/up are reliable and ordered
- pointer move is coalesced latest-only
- pointer position is explicitly sent before button-down when needed
- input never waits behind render
- no UI focus event may be inserted into the same ordered sequence in a way that
consumes the first remote click
The current double-click regression is treated as a bug caused by the RAP-side
focus/input sequencing, not as a normal RDP behavior.
## Clipboard And File Strategy
Existing policy semantics remain:
- clipboard modes stay enforced in backend, gateway/data-plane, and RDP service
- file transfer modes stay enforced in backend, gateway/data-plane, and RDP service
- text clipboard maps to RDP clipboard virtual channel
- restricted drive visibility remains a separate policy-controlled feature
The C# rewrite must not expand clipboard/file scope while replacing render/input.
## Staged Migration Plan
### RDP-C#-0: Documentation And Skeleton
Create a buildable C# RDP service skeleton with interfaces only.
No runtime cutover.
### RDP-C#-1: Control-Plane Compatible Worker Shell
Implement worker registration, heartbeats, lease renewal, assignment consumption,
and direct WSS token validation in C# using existing contracts.
The C++ worker remains default.
### RDP-C#-2: RDP Handshake Probe
Implement a non-viewing RDP connection probe:
- TCP/TLS
- basic RDP negotiation
- NLA/CredSSP if required
- connect/disconnect lifecycle
- failure reporting
No rendering yet.
### RDP-C#-3: Input-Only Protocol Path
After a connected session, send fast-path keyboard/mouse input to the RDP server.
Use diagnostic-only graphics or no graphics.
### RDP-C#-4: Basic Graphics Protocol Path
Implement the simplest RDP graphics path needed to display a desktop without
FreeRDP.
Allowed as temporary fallback:
- raw bitmap updates
- dirty-region bitmap updates
Not acceptable as final production:
- repeated full-frame screenshot capture
### RDP-C#-5: RDPGFX Foundation
Implement RDPGFX channel negotiation and surface update handling.
### RDP-C#-6: Codec Path
Implement or relay supported encoded graphics modes:
- RemoteFX Progressive where practical
- H.264/AVC420/AVC444 where negotiated
- client-side decode through platform APIs where possible
### RDP-C#-7: Runtime Cutover
Enable the C# RDP service per worker/resource via feature flag.
Rollback must switch back to the current C++ worker without changing backend
contracts.
## Performance Requirements
Target for LAN:
- first frame under 2 seconds after successful RDP login
- click to visible response under 150 ms for normal UI
- keypress to visible response under 150 ms for text input
- pointer hover response under 100 ms where the target OS emits hover changes
- no unbounded frame queue
- no render work on UI thread except final apply
- no full-frame publish loop for static desktops
## Risks
- Implementing RDP from specs is substantial.
- NLA/CredSSP correctness is security-sensitive.
- Graphics codecs are complex.
- Some target servers may negotiate older bitmap paths.
- AVC/AVC444 decode support differs by client platform.
- A partial RDP engine must not be switched into production before smoke proof.
## Recommended Immediate Next Step
Proceed with RDP-C#-0 only.
Goal:
Create a buildable C# RDP service skeleton and protocol boundaries, without
switching runtime traffic away from the current worker.
Strict rules:
- do not change backend contracts
- do not change cluster transport
- do not remove C++ worker
- do not use FreeRDP in the new C# service
- do not use third-party RDP libraries
- do not claim the C# engine is runtime-ready
Deliver:
- buildable `workers/rdp-service-csharp`
- interfaces for protocol engine, data-plane bridge, graphics sink, input source
- README with migration stages
- docs update marking current C++/FreeRDP path as legacy MVP runtime
File diff suppressed because it is too large Load Diff
@@ -0,0 +1,464 @@
# Secure Node-to-Node Channel Lifecycle
Status: Stage C16 result. Documentation and architecture only.
This document defines the secure node-to-node channel lifecycle for the Secure
Access Fabric. It does not implement code, migrations, APIs, mesh runtime
traffic, VPN/IP tunnel runtime, relay packet routing, RDP work, or service
workload execution.
## 1. Purpose
Secure node-to-node channels are the future authenticated transport foundation
for Fabric routes. They must exist as a trust and lifecycle model before any
production mesh routing runtime carries traffic.
C16 defines:
- mTLS identity validation
- connection establishment
- channel authorization
- lifecycle state
- heartbeat and liveness
- reconnect/backoff
- draining
- revalidation
- trust rotation
- revocation handling
- failure observability
## 2. Non-Goals
C16 does not:
- implement packet forwarding
- implement mesh routing runtime
- implement relay node behavior
- implement VPN/IP tunnel traffic
- implement QUIC/WebRTC
- implement service workloads
- change RDP runtime
- change backend session lifecycle
- change Windows client behavior
It defines the node-channel lifecycle boundary only.
## 3. Trust Foundation
Every node-to-node channel must be authenticated.
Required identity inputs:
- cluster id
- local node id
- remote node id
- local node certificate
- remote node certificate
- cluster trust roots
- revocation metadata
- role assignment snapshot
- allowed peer relationship
- route/channel authorization policy
Private keys remain local to the node. The control plane must never store node
private keys.
## 4. mTLS Certificate Requirements
Node certificates must be cluster-scoped.
Certificate identity should bind:
- node id
- cluster id
- certificate serial
- validity period
- key usage for node-to-node transport
- optional role/service constraints where practical
Validation must check:
- certificate chain
- cluster trust root
- certificate validity time
- node id binding
- cluster id binding
- expected remote node id
- revocation status
- key usage
- policy scope
A valid TLS certificate is necessary but not sufficient. The channel must also
pass role, peer, route, and channel authorization.
## 5. Channel Establishment Flow
Proposed logical flow:
1. Routing Engine or node-agent selects an allowed peer candidate.
2. Local node checks peer directory and local policy.
3. Local node opens authenticated transport.
4. Both sides perform mTLS handshake.
5. Both sides validate certificate identity and cluster scope.
6. Both sides exchange channel hello metadata.
7. Both sides validate role, channel classes, and policy version.
8. Channel enters `established` state.
9. Heartbeat/liveness begins.
10. Channel is registered in local channel table with expiry/revalidation
deadline.
Channel hello metadata should include:
- protocol version
- cluster id
- node id
- supported channel classes
- supported transport features
- local config version
- peer directory version
- trust bundle version
- route epoch
- draining status
## 6. Channel States
Initial state machine:
- `idle`
- `connecting`
- `handshaking`
- `authenticating`
- `authorizing`
- `established`
- `revalidating`
- `degraded`
- `draining`
- `closing`
- `closed`
- `failed`
State rules:
- no traffic before `established`
- control/liveness may continue in `revalidating`
- new non-essential traffic should stop in `draining`
- channel must close on failed authentication
- channel must close or degrade on failed reauthorization according to policy
- terminal `closed`/`failed` channels must not be reused
## 7. Channel Classes
Allowed channel classes map to Fabric routing classes, not service-specific
protocol internals.
Initial channel classes:
- `fabric_control`
- `route_control`
- `health`
- `telemetry`
- `render`
- `input`
- `clipboard`
- `file_transfer`
- `storage_fetch`
- `update_fetch`
- `vpn_packet`
Authorization is per channel class.
Rules:
- `input` and `fabric_control` require high-priority scheduling.
- `render` and video-like traffic may be droppable/latest-only.
- `file_transfer`, `storage_fetch`, and `update_fetch` must not starve
`input` or control.
- `vpn_packet` must be QoS-limited so bulk traffic cannot starve interactive
channels.
- A channel may carry only classes authorized by local policy and route result.
## 8. Channel Authorization
Authorization checks:
- local node is allowed to connect to remote node
- remote node is allowed to accept the connection
- cluster id matches
- roles are compatible
- route result or peer policy permits the relationship
- requested channel classes are allowed
- organization/service scope is allowed where applicable
- partition/degraded state permits the channel
- remote node is not revoked, disabled, or disallowed
- certificate is not expired or revoked
Authorization must be repeated when:
- trust bundle changes
- revocation list changes
- role assignment changes
- route policy changes
- route epoch changes
- channel is long-lived past revalidation interval
## 9. Heartbeat and Liveness
Heartbeats prove liveness, not authority.
Heartbeat metadata:
- channel id
- local node id
- remote node id
- timestamp
- sequence
- observed latency
- packet loss/jitter summary where available
- local health hint
- draining flag
- config version
- route epoch
Recommended heartbeat cadence:
- active control channels: 5-15 seconds
- high-priority realtime channels: 2-10 seconds where needed
- low-priority/storage channels: 15-60 seconds
Missing heartbeats should trigger:
1. suspicion state
2. bounded retry
3. route failover consideration
4. channel close/failure
5. health report
## 10. Reconnect and Backoff
Reconnect must be bounded and policy-aware.
Rules:
- use exponential backoff with jitter
- do not stampede bootstrap peers
- prefer warm candidates after active peer failure
- stop reconnect when peer is revoked or policy disallows it
- report repeated failures
- preserve route stickiness only while healthy and authorized
- avoid reconnect loops during draining or shutdown
Reconnect should use current peer cache and route policy, not stale hardcoded
endpoints.
## 11. Revalidation
Long-lived channels must revalidate periodically.
Revalidation checks:
- certificate still valid
- revocation status current enough
- cluster trust root still valid
- peer relationship still allowed
- channel classes still allowed
- route epoch/policy version still acceptable
- role assignments still active
If revalidation fails:
- stop accepting new traffic
- drain or close according to policy
- report reason
- trigger route failover where applicable
## 12. Draining and Graceful Shutdown
Draining supports maintenance and safe role removal.
Draining flow:
1. node enters draining state
2. node advertises draining in heartbeat/channel metadata
3. routing stops placing new flows on the node
4. existing flows continue until TTL or policy deadline
5. new non-essential channels are rejected
6. channel closes after active work drains or deadline expires
7. node reports drained status
Draining must not silently drop critical control messages.
If graceful drain fails, policy decides whether to force-close and failover.
## 13. Trust Rotation
Trust rotation must avoid split trust windows.
Recommended flow:
1. new trust bundle is signed by current trusted key
2. nodes fetch and verify new trust bundle
3. dual validation period begins where required
4. new certificates are issued/accepted
5. old certificates expire or are revoked
6. old trust root is retired after rollout threshold
Channels should revalidate after trust bundle changes.
## 14. Revocation Handling
Revocation must affect active channels.
Revocation inputs:
- signed revocation list
- trust bundle update
- control-plane status after reconnect
- emergency revocation policy
On revocation of remote node/certificate/key:
- stop new channels
- mark existing channels as revalidation failed
- close or drain according to policy severity
- remove peer from eligible active/warm candidates
- report and audit event
High-severity revocation should close immediately.
## 15. Partition and Degraded Behavior
In degraded mode, channels may continue only if:
- current signed snapshot permits it
- certificates remain valid
- revocation state is not known to reject the peer
- route/channel policy permits degraded continuation
- TTL has not expired
Degraded mode must not authorize:
- new node enrollment
- new trust roots
- role changes
- cross-cluster trust changes
- partition promotion
- new high-risk channels without policy
## 16. Failure Classification
Failure reasons:
- `tls_handshake_failed`
- `certificate_invalid`
- `certificate_revoked`
- `wrong_cluster`
- `wrong_node`
- `policy_denied`
- `channel_class_denied`
- `route_epoch_stale`
- `heartbeat_timeout`
- `peer_draining`
- `peer_disabled`
- `trust_bundle_stale`
- `network_unreachable`
- `backoff_exhausted`
Failures should be structured and safe to log.
## 17. Observability
Node-agent should report:
- channel state
- active channel count
- channel classes in use
- handshake failures
- authorization failures
- heartbeat latency
- reconnect count
- backoff state
- draining state
- revocation actions
- revalidation failures
- route epoch/policy version
Tenant views must not expose internal topology. Platform owner views may show
full channel diagnostics according to audited policy.
## 18. Security Requirements
Required:
- mTLS for node-to-node channels
- cluster-scoped node certificates
- certificate revocation support
- policy-scoped channel authorization
- no unauthenticated peer enumeration
- no channel use before authorization
- channel class separation
- QoS-aware scheduling expectations
- structured audit for high-risk channel changes
Compromised node blast radius must be limited by:
- scoped certificates
- scoped snapshots
- role assignment
- peer directory scope
- channel authorization
- revocation
- topology hiding
## 19. Future Validation Tests
Future implementation tests must prove:
- valid node-to-node mTLS succeeds
- wrong cluster certificate rejected
- wrong node id rejected
- expired certificate rejected
- revoked certificate closes active channel
- unauthorized channel class rejected
- channel cannot carry traffic before authorization
- heartbeat timeout triggers failure
- draining stops new channels
- trust rotation revalidates channels
- degraded mode honors TTL and forbidden actions
- tenant-safe views hide topology
## 20. C17 Preparation
C17 may plan mesh routing runtime only after C10-C16 are accepted.
C17 must use:
- signed snapshots
- node-local state store
- Fabric Storage / Config Storage
- peer directory/cache
- Fabric Routing Engine route results
- secure node-to-node channels
C17 must not jump directly to broad production mesh. It should first define a
minimal runtime implementation plan, test topology, rollback path, and go/no-go
criteria.
## 21. Result / Decision
Stage C16 defines secure node-to-node channels as authenticated,
policy-authorized, lifecycle-managed connections.
Decisions:
- mTLS is required for node-to-node channels.
- Certificate validity is necessary but not sufficient; channel policy must
authorize role, peer relationship, route, and channel classes.
- Active channels must revalidate on trust, revocation, role, and route policy
changes.
- Draining is a first-class lifecycle state.
- Revocation affects active channels.
- Degraded mode is bounded and cannot authorize high-risk mutations.
- C17 must plan mesh routing runtime using C10-C16 foundations.
No code, migration, API, runtime, RDP, data-plane, mesh, VPN, relay, or service
workload behavior is changed by C16.
@@ -0,0 +1,209 @@
# Security And Secrets Readiness
Status: P3.3 test-stand smoke complete for encrypted resource secrets,
assignment-time resolution, and production fallback behavior with smoke-only
direct worker WSS trust.
This document defines the next security hardening layer around the accepted RDP
MVP baseline. It does not implement mesh, VPN, server-to-client download, new
protocol adapters, or another RDP rendering mode.
## Current Accepted Baseline
- RDP worker baseline: `rap-rdp-worker:rdp-p1-region-order2`
- Backend control plane remains source of truth.
- Redis remains live coordination/routing only.
- Direct worker WSS is preferred for realtime RDP.
- Backend gateway remains fallback/debug.
- Text clipboard is policy-gated and accepted.
- Client-to-server file upload and restricted `RAP_Transfers` visibility are
accepted.
## Problem
The current smoke/dev path can still seed RDP target credentials inside
resource `metadata`. That was acceptable for proving lifecycle and RDP adapter
behavior, but it must not be the production contract.
Production must not rely on plaintext target passwords, usernames, domain
credentials, client secrets, tokens, or private keys stored in generic resource
metadata.
## Target Secret Model
Resources keep non-secret connection shape:
```json
{
"id": "...",
"organization_id": "...",
"protocol": "rdp",
"address": "rdp.example.internal:3389",
"secret_ref": "rap-secret://org/<org_id>/resources/<resource_id>/rdp-primary",
"metadata": {
"certificate_verification_mode": "strict",
"render_quality_profile": "balanced"
}
}
```
Secrets are stored separately and referenced by `secret_ref`. The secret payload
is protocol-specific and versioned:
```json
{
"version": 1,
"protocol": "rdp",
"username": "...",
"domain": "...",
"password": "...",
"rotation_version": 3
}
```
The reference, not the plaintext secret, is copied into session metadata and
audit context.
## Runtime Secret Resolution
Production runtime should resolve secrets through a dedicated secret resolver:
1. Backend validates resource/org/user authorization.
2. Backend starts the session using resource `secret_ref`.
3. Worker receives assignment with `secret_ref`, not plaintext credentials.
4. Worker asks an authorized secret resolver for the secret using:
- `organization_id`
- `resource_id`
- `worker_id`
- `session_id`
- short-lived lease/session proof
5. Secret resolver returns credentials only to authorized workers for active
leased sessions.
6. Worker keeps secret material in memory only and never logs it.
The current P3.1 MVP uses an encrypted PostgreSQL-backed store:
- `resource_secrets` stores ciphertext, nonce, key id, algorithm, version, safe
metadata, and `payload_sha256`.
- `SECRET_ENCRYPTION_KEY_B64` or `SECRET_ENCRYPTION_KEY_FILE` supplies the
AES-256-GCM key.
- `SECRET_ENCRYPTION_KEY_ID` labels the active key.
- the API can create/rotate a resource secret, but never returns plaintext.
- session assignment resolves the secret only after organization/resource/
worker/session/lease checks.
The resolver boundary can later be backed by KMS, Vault, cloud secret managers,
or node-local secure delivery without changing the resource `secret_ref`
contract.
## Production Guard
In `APP_ENV=production`:
- RDP/VNC/SSH resources must have `secret_ref`.
- Plain credential-like keys are rejected in resource `metadata`.
- Session start rejects legacy resources that still contain plaintext
credential-like metadata.
- backend startup requires secret encryption key material.
- Development/smoke environments may continue using plaintext metadata while
the resolver path is not used, but this is explicitly not production mode.
Credential-like metadata keys include password, username, domain, token,
private key, client secret, credential, credentials, secret, and common
underscore/hyphen variants.
## Data Plane Trust
Already accepted:
- backend signs `data_plane_token` with RS256 private key
- worker validates with public key only
- token is short-lived
- token includes session, attachment, user, organization, worker, resource,
allowed channels, expiry, and jti
- worker rejects wrong worker, wrong attachment, wrong organization, wrong
resource, over-broad channels, failed/terminated sessions, and jti replay
Production still needs:
- deployed certificate chain for direct worker WSS on production nodes
- pinned or platform-issued worker certificates in live production config
- no smoke-only TLS bypass in production clients
- rotation process for data-plane signing keys
- audit for failed token validation/bind attempts
P3.2 guard exists:
- backend distinguishes `smoke_insecure`, `public_ca`, and `platform_ca`
direct worker WSS trust modes
- production backend omits smoke-only direct candidates
- Windows production client skips untrusted or smoke-only direct candidates
P3.3 test-stand smoke exists:
- `resource_secrets` migration is applied on `docker-test`
- backend runs as `APP_ENV=production` with a test-only
`SECRET_ENCRYPTION_KEY_FILE`
- a secret-backed RDP resource starts a real session through assignment-time
secret resolution
- `resources.metadata`, `remote_sessions.metadata`, and `audit_events` were
checked for plaintext username/password leakage
- production backend with `DATA_PLANE_DIRECT_WORKER_TLS_TRUST_MODE=smoke_insecure`
returns backend gateway fallback only
- development/smoke backend with the same trust mode advertises the explicit
smoke-only direct worker WSS candidate
- `RAP_Transfers` smoke passed on the secret-backed resource
## Required Regression Tests
P3 must protect:
- plaintext resource credentials rejected in production
- RDP production resources require `secret_ref`
- development smoke plaintext metadata remains allowed
- data-plane allowed channels follow runtime policy
- direct bind rejects wrong worker
- direct bind rejects wrong user
- direct bind rejects wrong organization
- direct bind rejects wrong resource
- direct bind rejects old attachment
- direct bind rejects failed/terminated states
## Audit Events
Current audit coverage should remain for:
- session start
- attach
- detach
- takeover
- terminate
- failure
Future audit coverage should add:
- secret deleted
- production resource rejected because plaintext credential metadata was found
Audit entries must reference `secret_ref` and resource/session ids, never
plaintext secret values.
P3.1 implemented audit events for:
- `resource_secret_rotated`
- `resource_secret_accessed`
- `resource_secret_access_denied`
## Remaining Production Gaps
- External KMS/Vault integration is not implemented yet.
- Master-key rotation/re-encryption workflow is not implemented yet.
- The worker still receives resolved credentials through the transient
assignment payload; a future resolver pull/token flow should reduce exposure
in Redis control queues.
- Worker still depends on plaintext assignment metadata for development smoke.
- Production direct worker WSS certificate issuance/rotation and platform CA
distribution are not complete.
- The test-stand secret key is a host-local test file, not a production KMS or
HSM-backed key.
- Automated end-to-end policy denial coverage is still thin.
@@ -0,0 +1,211 @@
# Service Adapter Protocol
Status: target contract and compile-safe foundation. This document defines the common adapter model for RDP, SSH, VNC, and future services. It does not replace the current backend control plane or current RDP runtime by itself.
## 1. Purpose
The platform client must not implement third-party protocols directly.
```text
Access Client <-> Secure Access Session Protocol <-> Service Adapter <-> Target Resource Protocol
```
A `Service Adapter` translates one external service protocol into the platform session/data-plane model. RDP is the first adapter, but the same model must support SSH, VNC, HTTP/internal apps, video, and future services.
## 2. Terms
- `Access Client`: user-facing Windows, iOS, Android, or future browser/native client.
- `Service Adapter`: protocol translation runtime at the service/egress edge.
- `RDP Adapter`: Service Adapter for Microsoft RDP.
- `SSH Adapter`: Service Adapter for SSH/terminal/SFTP/port-forward flows.
- `VNC Adapter`: Service Adapter for VNC framebuffer/input flows.
- `Target Resource`: external resource such as RDP host, SSH host, VNC host, internal app, or video endpoint.
- `Control Plane`: backend/API for auth, organization isolation, resource policy, session lifecycle, worker selection, audit, and token issuance.
- `Data Plane`: realtime channels between Access Client and adapter.
## 2.1 Remote Server/Desktop Access Product Model
The user-facing product service is **Remote Server/Desktop Access**.
RDP, VNC, and SSH are not separate cluster services exposed to organization
administrators. They are internal protocol adapters used by the Remote
Server/Desktop Access service.
The protocol is selected from the organization resource definition:
```text
Organization resource:
name: Accounting-01
address: 10.10.1.15
port: 3389
protocol: rdp
egress: Office Moscow
```
```text
protocol = rdp -> RDP Adapter
protocol = vnc -> VNC Adapter
protocol = ssh -> SSH Adapter
```
The Access Client always speaks the platform access protocol. It does not speak
RDP, VNC, or SSH directly.
Cluster operators assign nodes to run the Remote Server/Desktop Access service.
They do not separately enable "RDP service", "VNC service", or "SSH service" for
the organization. Adapter selection is an internal runtime decision based on the
selected resource protocol.
Organization administrators manage resources and policies, not internal nodes:
- resource name
- target address and port
- protocol
- allowed users/groups
- clipboard/file/session policy
- logical egress, such as `Office Moscow`
The logical egress is not a concrete node. It is an organization-visible egress
pool or route label. Internally the cluster may back `Office Moscow` with one or
many nodes that have network reachability to that office. Fabric routing and
placement choose the concrete node/path.
Resulting flow:
```text
Access Client
-> entry point
-> authenticate user
-> select organization
-> list allowed resources
-> select resource
-> use resource.protocol to choose adapter
-> use resource.egress to choose egress pool/path
-> connect to target
```
This decision prevents exposing internal adapter placement and node topology to
organizations while preserving protocol-specific policy enforcement inside the
adapter runtime.
## 3. Non-Negotiable Boundaries
- Access Client does not know RDP/SSH/VNC protocol internals.
- Service Adapter does not know UI implementation details.
- Control Plane remains authoritative for session lifecycle and policy.
- PostgreSQL remains source of truth; Redis remains live coordination only.
- Direct worker WSS and backend gateway fallback remain valid transports.
- Adapter runtime must not create sessions outside broker/assignment control.
## 4. Logical Channels
The session protocol is channel-oriented even when DP-1 uses one WSS connection.
| Channel | Direction | Reliability | Priority | Purpose |
| --- | --- | --- | --- | --- |
| `input` | client -> adapter | ordered reliable except move coalescing | highest | keyboard, pointer, wheel, focus |
| `control` | both | reliable | high | attach, detach, takeover, state, heartbeat |
| `display` | adapter -> client | droppable latest-frame/region | high but below input | frames, regions, surfaces, resize |
| `cursor` | adapter -> client | latest-only | high | cursor position, shape, visibility |
| `clipboard` | both | reliable | medium | policy-gated clipboard payloads |
| `file_transfer` | both | reliable chunked | medium/low | upload/download, progress, cancel |
| `audio` | adapter -> client, future client -> adapter | adaptive droppable | medium | future audio streams |
| `device` | both | reliable | medium | future printer, smart card, drive policy events |
| `telemetry` | adapter -> client/control | droppable | lowest | FPS, latency, queue depth, diagnostics |
Input must never wait behind display, file transfer, audio, or telemetry.
## 5. Event Direction Model
The adapter is not a passive responder. It must publish events whenever the target protocol emits them.
Client-origin examples:
- `input.keyboard`
- `input.pointer_move`
- `input.pointer_button`
- `input.wheel`
- `clipboard.client_text`
- `file_upload.start`
- `file_upload.chunk`
- `control.detach`
- `control.terminate`
Adapter-origin examples:
- `session.state`
- `display.full_frame`
- `display.region`
- `display.surface_bits`
- `display.encoded_frame`
- `cursor.update`
- `clipboard.server_text`
- `file_transfer.progress`
- `session.warning`
- `session.failed`
Screen updates must be adapter-driven:
```text
Target resource update -> adapter callback/event -> display/cursor event -> Access Client render
```
Client input may request a fast refresh, but input must not be the primary trigger for discovering server-side screen changes.
## 6. Channel Scheduling
The adapter runtime must maintain separate scheduling semantics:
- `input`: drain first; keyboard and button events are ordered; pointer move is latest-only.
- `control`: reliable and bounded; never behind render backlog.
- `display`: droppable; stale frames/regions are discarded before send.
- `cursor`: latest-only; may bypass display frame cadence.
- `clipboard`: reliable and policy-gated.
- `file_transfer`: reliable chunked; bandwidth-limited so it cannot starve input/display control.
- `telemetry`: sampled/dropped under pressure.
## 7. Render Model
Display events should be sent in this preference order:
1. Encoded/surface updates when supported by the external protocol and client.
2. Dirty regions/tiles.
3. Full frame only for baseline, resize, attach/recover, or fallback.
Full-frame BGRA is a compatibility fallback, not the production performance target.
## 8. Adapter Policy Enforcement
Policy must be enforced inside the adapter runtime in addition to UI/backend checks:
- clipboard mode
- file transfer mode
- allowed channels
- attachment/controller ownership
- session active/taken_over/failed/terminated state
- max payload sizes
- dangerous path/name rejection
- no arbitrary filesystem exposure
## 9. Adapter Lifecycle
All adapters must support:
- bind to existing assignment/session runtime
- connect to target resource
- publish state changes
- keep runtime alive through detach when policy allows
- reattach without recreating target session where protocol allows
- takeover without recreating target session where protocol allows
- terminate target session when broker commands terminate
- fail fast and report authoritative failure when target runtime is gone
## 10. Future Adapters
RDP, SSH, and VNC share the same platform-facing contract but differ internally:
- RDP: graphics, cursor, keyboard/mouse, cliprdr, rdpdr, rdpgfx/graphics pipeline.
- SSH: terminal output, keyboard input, resize, SFTP, port-forward.
- VNC: framebuffer updates, pointer, keyboard, clipboard.
The common contract is the platform session protocol, not the external resource protocol.
@@ -0,0 +1,415 @@
# Signed Scoped Cluster Snapshot Model
Status: Stage C11 result. Documentation and architecture only.
This document defines the signed scoped cluster snapshot model for future
`rap-node-agent` node-local operation and degraded-mode recovery. It does not
implement code, migrations, APIs, mesh runtime traffic, VPN/IP tunnel runtime,
relay packet routing, RDP work, or service workload execution.
## 1. Purpose
Signed scoped cluster snapshots allow a node to operate from verified local
configuration without asking the backend for every realtime routing decision.
The snapshot model must preserve these boundaries:
- PostgreSQL remains the only durable source of truth.
- Fabric Storage / Config Storage distributes signed snapshots and increments.
- Node-agent stores only scoped local copies.
- Redis remains live coordination only.
- Service Adapters consume assigned local configuration but do not define
routing or cluster authority.
## 2. Snapshot Definition
A scoped cluster snapshot is a signed, versioned configuration package compiled
from authoritative control-plane state.
Snapshot characteristics:
- cluster-scoped
- node-scoped
- role-scoped
- organization-scoped where applicable
- signed by an authorized control-plane/config signing key
- bounded in size
- time-limited
- reconstructable from PostgreSQL
- safe to store in node-local state
Snapshots are not mutable local databases. A node may cache them and use them
for runtime decisions within policy, but it must not treat them as new durable
truth.
## 3. Snapshot Envelope
Every snapshot must have a signed envelope.
Required envelope fields:
- `snapshot_id`
- `schema_version`
- `cluster_id`
- `subject_node_id`
- `scope_type`
- `scope_ids`
- `roles`
- `organization_ids`
- `config_version`
- `authority_epoch`
- `issued_at`
- `valid_from`
- `expires_at`
- `refresh_after`
- `signer_key_id`
- `signature_algorithm`
- `content_hash`
- `signature`
Recommended signature algorithms:
- Ed25519 for compact modern signatures where supported
- RS256/RSA-PSS where compatibility with existing infrastructure is required
The exact wire encoding can be JSON canonicalization first and may evolve to a
binary canonical form later. The important requirement is deterministic
canonical bytes for signature verification.
## 4. Snapshot Scope Types
Supported initial scope types:
- `node_bootstrap`
- `node_runtime`
- `peer_directory`
- `service_assignment`
- `route_policy`
- `qos_policy`
- `trust_bundle`
- `storage_directory`
- `degraded_mode_policy`
The control plane may deliver one combined node runtime snapshot or multiple
specialized snapshots. The node-agent local store must track version and expiry
per scope.
## 5. Role-Based Snapshot Contents
Core mesh node snapshot may include:
- cluster identity
- node membership state
- allowed peer subset
- route policy subset
- QoS policy subset
- trust bundle
- config/storage refresh endpoints
- degraded-mode peer recovery policy
Ingress node snapshot may include:
- cluster identity
- ingress role assignment
- client entry policy subset
- token validation trust material
- route entry policy
- allowed service endpoint projections
- no full internal topology
- no service target credentials
Egress/service node snapshot may include:
- assigned service workload refs
- assigned resource refs
- service policy subset
- connector or `vpn_connection` refs when authorized
- route policy needed for assigned services
- secret resolver refs only, not raw secrets
Storage/config node snapshot may include:
- assigned storage/config shard scope
- replication metadata
- peer/storage refresh policy
- allowed snapshot families
- no unrelated tenant data
Thin/mobile node snapshot may include:
- minimal trust bundle
- active session/tunnel policy subset
- minimal peer/bootstrap data
- route refresh endpoints
- no full cluster topology
## 6. Snapshot Content Rules
Allowed content:
- ids and safe metadata
- role assignments for the subject scope
- policy refs and selected policy bodies needed by the node
- peer directory subset
- route/QoS policy subset
- trust roots and revocation metadata
- service workload desired-state refs
- secret resolver refs
- degraded-mode policy
Forbidden content:
- unrelated organization data
- broad organization user lists
- raw RDP/VNC/SSH credentials
- raw VPN credentials
- secrets outside approved resolver flow
- platform-wide topology for ordinary nodes
- arbitrary query grants
- audit authority
- durable policy mutation authority
## 7. Full Snapshots and Incremental Updates
Full snapshot:
- establishes node-local state for a scope
- repairs version gaps
- repairs corruption
- establishes a new `authority_epoch`
- may replace older snapshots for the same scope
Incremental update:
- applies to exactly one base `config_version`
- carries `base_config_version`
- carries `next_config_version`
- contains scoped patch operations or replacement sections
- is signed independently
- must be rejected if base version does not match
Rules:
- version gaps require full resync
- signature mismatch requires rejection and recovery
- expired snapshots cannot authorize new operations
- node heartbeat/status must report last applied version per scope
- rollback is forbidden unless signed recovery policy explicitly allows it
## 8. Trust Roots and Signing Key Rotation
The node-agent must know which config signing keys are trusted for each cluster.
Trust material may come from:
- enrollment response
- trust bundle snapshot
- manually installed platform root for bootstrap
- signed key rotation update
Signing key rotation rules:
1. New key is introduced in a signed trust bundle.
2. Node verifies the new key through existing trust.
3. Snapshots may be dual-signed during transition.
4. Old key is retired only after policy-defined rollout.
5. Compromised key is revoked through signed revocation metadata or emergency
recovery flow.
A node must reject snapshots signed by unknown, expired, revoked, or
cluster-mismatched keys.
## 9. Verification Algorithm
Before applying a snapshot, node-agent verifies:
1. Envelope schema is supported.
2. `cluster_id` matches local cluster membership.
3. `subject_node_id` matches the local node, unless the scope explicitly allows
shared role data.
4. Signature key is trusted for the cluster and snapshot scope.
5. Signature verifies over canonical bytes.
6. `content_hash` matches content.
7. `valid_from`, `expires_at`, and `refresh_after` are acceptable.
8. `authority_epoch` is not stale.
9. `config_version` is newer than the local accepted version or allowed by a
signed recovery policy.
10. Scope does not grant data beyond node role and organization authorization.
11. Snapshot content passes structural validation.
12. Snapshot does not contain forbidden raw secrets.
Failure must leave the previous valid snapshot active if policy allows it.
## 10. Degraded-Mode Use
Snapshots define what the node may do when disconnected from the backend or
config/storage services.
Allowed when policy permits:
- continue already-running assigned services
- preserve existing authorized routes for a bounded TTL
- reconnect to active/warm/bootstrap peers
- use local trust bundle to validate peers
- use storage/config endpoints from the last valid snapshot
- report degraded status when connectivity returns
Forbidden in degraded mode:
- approve node enrollment
- issue certificates
- assign roles
- change cluster policy
- change organization policy
- rotate trust roots
- promote partitions automatically
- fetch unrelated secrets
- create new service authority outside the snapshot scope
Degraded mode must be bounded by:
- snapshot expiry
- route/session TTL
- degraded-mode policy
- partition/authority state
## 11. Revocation and Expiry
Snapshots expire. Expiry is a correctness boundary, not just a cache hint.
Revocation sources:
- signed trust bundle update
- signed revocation list
- control-plane status after reconnect
- emergency recovery trust path
Revocation applies to:
- signing keys
- node identities
- role assignments
- service assignments
- peer eligibility
- storage/config endpoints
- degraded-mode permissions
If revocation state is unavailable, the node may only continue within the last
valid degraded-mode policy and must not perform high-risk actions.
## 12. Rollback and Recovery
Normal rollback to an older config is forbidden.
Allowed recovery cases:
- local snapshot file corruption
- interrupted incremental update
- bad non-authoritative cache state
- version gap requiring full resync
Recovery order:
1. keep last verified active snapshot
2. reject bad update
3. request full snapshot from config/storage service
4. use bootstrap peers if refresh endpoints fail
5. reconnect to control plane when available
6. enter degraded mode only if policy allows
Rollback to an older signed snapshot requires explicit signed recovery policy
with a newer `authority_epoch` or equivalent anti-rollback guard.
## 13. Node-Agent Local Expectations
Node-agent must store:
- active snapshot per scope
- previous verified snapshot for recovery
- pending downloaded snapshot/update before activation
- verification metadata
- last applied versions
- signer key ids
- expiry/refresh deadlines
- rejection reason for last failed update
Activation should be atomic from the node-agent perspective:
- download to pending
- verify
- write to durable local store
- swap active pointer
- notify supervised services of relevant changes
- report applied version in heartbeat/status
C12 will define the local store layout and durability details.
## 14. Distribution Relationship
Snapshot production flow:
1. PostgreSQL authoritative state changes.
2. Control-plane snapshot compiler builds scoped view.
3. Compiler validates scope and removes forbidden data.
4. Snapshot is signed by config signing key.
5. Snapshot or increment is published to Fabric Storage / Config Storage.
6. Node-agent refreshes by version.
7. Node-agent verifies and applies locally.
Node-origin reports such as health, heartbeat, or observed latency are not
authoritative config writes. They may influence future compiled snapshots only
after the control plane accepts them according to policy.
## 15. Validation and Future Tests
Future implementation tests must prove:
- valid snapshot applies
- invalid signature rejected
- wrong cluster rejected
- wrong node rejected
- expired snapshot rejected for new authority
- rollback rejected
- version gap triggers full resync
- forbidden raw secret content rejected
- unrelated organization data rejected
- wrong role scope rejected
- incremental update applies only to matching base version
- revoked signer rejected
- degraded-mode forbidden actions are blocked
## 16. C12 Preparation
C12 must define how node-agent stores and protects:
- snapshot files
- identity material references
- trust bundle cache
- peer cache
- route cache
- service assignment cache
- health/degraded state
- update metadata
C12 must not turn local state into durable authority. It must preserve the C11
rule that snapshots are verified scoped copies of PostgreSQL-derived state.
## 17. Result / Decision
Stage C11 defines signed scoped cluster snapshots as the required bridge between
the authoritative control plane and node-local runtime operation.
Decisions:
- snapshots are signed, versioned, scoped, bounded, and expiring
- snapshots are generated from PostgreSQL source-of-truth state
- snapshots may be distributed by Fabric Storage / Config Storage
- node-agent verifies before applying
- node-agent may operate from snapshots only within policy
- snapshots must not contain raw secrets or unrelated organization data
- incremental updates require exact base-version matching
- rollback requires explicit signed recovery policy
- C12 must define local storage without changing these authority boundaries
No code, migration, API, runtime, RDP, data-plane, mesh, VPN, relay, or service
workload behavior is changed by C11.
@@ -0,0 +1,428 @@
# VPN / IP Tunnel Service Target Design
Status: Stage C18 planning result. Documentation only.
This document defines the target VPN/IP tunnel service architecture for the
Secure Access Fabric. It does not implement VPN runtime, packet routing, TUN
devices, mesh traffic, service workload execution, API changes, migrations, or
RDP behavior changes.
## Purpose
VPN/IP tunnel is a service above the Fabric Core, not a node-local setting.
The service must allow managed access to private networks while preserving the
platform's core rules:
- PostgreSQL remains the durable source of truth.
- Redis remains live coordination only.
- Fabric Routing Engine owns route choice.
- Nodes execute leased work only.
- Organizations must not see mesh topology.
- Interactive services such as RDP must not be harmed by VPN bulk traffic.
## Non-Goals
Stage C18 does not implement:
- VPN/IP tunnel runtime
- TUN/TAP device handling
- packet forwarding
- host route or firewall manipulation
- QUIC, WebRTC, relay packet routing, or production mesh traffic
- Windows virtual adapter, Android `VpnService`, or mobile client work
- RDP, VNC, SSH, video, file, clipboard, or data-plane behavior changes
## Core Concepts
### `vpn_connection`
`vpn_connection` is the logical control-plane entity for one managed VPN/IP
tunnel connection to a private endpoint such as an office, customer site,
branch network, partner network, or private resource zone.
Target fields:
- `id`
- `organization_id`
- `cluster_id`
- `name`
- target endpoint / office identity
- protocol/provider family
- credential/config reference
- allowed node policy
- mode: `single_active` for the initial model
- desired state: `enabled` or `disabled`
- routing usage: RDP, VNC, SSH, HTTP/internal app, IP tunnel, or future service
- route policy references
- QoS / bandwidth policy references
- placement constraints
- safe status projection
The entity belongs to the control plane. It must not be inferred from a node
environment variable, a manually started VPN process, or a host-local config
file.
### `vpn_connection_lease`
`vpn_connection_lease` represents current ownership.
Target fields:
- `vpn_connection_id`
- `cluster_id`
- `organization_id`
- `owner_node_id`
- lease generation / fencing epoch
- `expires_at`
- `renewed_at`
- `released_at`
- `fenced_at`
- status
Only the current owner with a valid, unexpired, unfenced lease may execute the
VPN connection.
### `vpn_route_policy`
`vpn_route_policy` defines what traffic may use the connection.
Policy dimensions:
- allowed CIDRs
- denied CIDRs
- DNS suffix or DNS server policy
- split-tunnel or full-tunnel eligibility
- service-specific usage
- resource-specific usage
- organization and role scope
- QoS class and bandwidth limits
Route policy is desired state. Runtime nodes apply scoped policy; they do not
invent routes.
### `vpn_credential_ref`
VPN credentials must be referenced through an approved secret resolver.
Nodes receive credentials/config only when authorized and only when required to
execute or pre-warm the assigned connection. Nodes must not receive unrelated
organization credentials.
## Architecture Placement
```text
Control Plane
owns vpn_connection desired state, policy, lease/fencing, audit
Fabric Core
owns node identity, role assignment consumption, scoped snapshots,
node-local state, and service supervision boundary
Fabric Routing Engine
chooses path to the current active VPN owner or eligible egress pool
VPN/IP Tunnel Service Runtime
executes tunnel only when assigned and leased
Data Plane
carries encrypted tunnel packets later, with QoS and backpressure
```
The backend/control plane must not become a production VPN packet relay.
## Control Plane Responsibilities
The control plane owns:
- durable `vpn_connection` desired state
- route policy and service usage policy
- allowed node policy
- placement and candidate selection
- lease creation, renewal validation, and fencing decisions
- safe status projection
- audit events
- credential reference ownership
The control plane does not push arbitrary packets. It authorizes and records
what should exist.
## Node Responsibilities
Nodes do not decide to create VPN connections.
A node may execute a connection only when all of the following are true:
- node belongs to the correct cluster
- node has the required capability and role assignment
- node is allowed by the `vpn_connection` node policy
- node has a current signed/scoped configuration snapshot
- node holds the active lease
- desired state is `enabled`
- organization and service policy permit use
The node must stop execution when:
- lease is lost, expired, or fenced
- desired state becomes `disabled`
- role assignment is removed
- allowed node policy changes
- local node enters unsafe partition/degraded state
- cluster tells the node to drain
## Single-Active Lease and Fencing Model
The initial mode is `single_active`.
Correctness requirement:
- exactly one node may maintain the active VPN tunnel
- stale owners must be fenced before replacement becomes authoritative
- ownership changes must be monotonic through a lease generation or equivalent
fencing epoch
- connect/disconnect must be idempotent
- split-brain must not create duplicate active tunnels
Suggested target mechanics:
- short lease TTL
- periodic renewal
- monotonic lease generation
- node-local watchdog that stops tunnel when renewal fails
- explicit release on graceful shutdown
- fencing event before replacement if previous owner is uncertain
## Routing Policy Model
Traffic references a logical `vpn_connection`, not a physical node.
Examples:
- RDP resource may require `vpn_connection = office-a`
- SSH resource may require `vpn_connection = office-a`
- IP tunnel profile may expose selected CIDRs through `office-a`
- HTTP/internal app resource may route through the active VPN owner
The Fabric Routing Engine resolves:
```text
service request
-> logical vpn_connection
-> current active owner / eligible egress
-> fabric route
-> VPN service runtime
-> private network target
```
Route updates should be dynamic. Changing CIDRs, DNS policy, or active owner
should not require manual client reconfiguration when clients use platform
managed access.
## QoS and Bandwidth Rules
VPN bulk traffic must degrade before interactive traffic.
Priority order:
1. RDP input/control
2. interactive RDP/VNC/SSH control and render-critical traffic
3. clipboard and small reliable control messages
4. video/audio adaptive traffic
5. file transfer
6. VPN bulk packets
7. telemetry
Bandwidth policy should support:
- per-organization limits
- per-service limits
- per-`vpn_connection` limits
- per-node limits
- reserved bandwidth for interactive services
Service adapters must not implement QoS routing themselves. They label traffic
or request a channel class; Fabric applies route/QoS policy.
## Security Boundaries
Security requirements:
- organization-scoped `vpn_connection`
- cluster-scoped identity and tokens
- mTLS node-to-node transport
- short-lived route/tunnel authorization tokens when needed
- credentials delivered only through approved resolver
- candidate nodes receive only scoped config
- active owner receives execution secrets only when authorized
- no organization sees another organization's connections, routes, credentials,
peer cache, or topology
- platform owner actions are audited
Compromised node blast radius must be bounded. A compromised node must not gain
credentials for unrelated `vpn_connection` entities or unrelated organizations.
## Observability and Audit
Audit events:
- `vpn_connection_created`
- `vpn_connection_enabled`
- `vpn_connection_disabled`
- `vpn_connection_policy_changed`
- `vpn_connection_candidate_changed`
- `vpn_connection_lease_acquired`
- `vpn_connection_lease_renewed`
- `vpn_connection_lease_lost`
- `vpn_connection_owner_fenced`
- `vpn_connection_failover_started`
- `vpn_connection_failover_completed`
- `vpn_connection_credential_rotated`
- `vpn_route_policy_changed`
Metrics/status:
- desired state
- active owner
- standby/pre-warm owners
- lease generation
- last connect/disconnect time
- route count
- latency/packet loss where observable
- bandwidth by service class
- failover count
- last failure reason
Organization views show safe status. Platform owner views may show active node
and operational detail according to platform policy and audit.
## Failure Mode Matrix
| Failure | Required behavior | Notes |
| --- | --- | --- |
| Active node heartbeat lost | Lease expires or is fenced; cluster selects replacement | Single-active must be preserved |
| Active node loses lease locally | Node stops VPN runtime | Node must not wait for backend packet path |
| Control plane temporarily unavailable | Existing leased runtime may continue only within lease/snapshot policy | No policy mutation in degraded mode |
| Split-brain / partition | Minority must not create second active owner | Fencing/quorum rules required before runtime |
| Credential revoked | Active owner stops or reconnects with rotated credentials | Audit required |
| Route policy changes | Dynamic route update; deny removed routes | No manual client reconfiguration |
| Candidate node becomes overloaded | Keep sticky owner unless policy/failure/maintenance requires move | Avoid needless TCP disruption |
| Graceful node maintenance | Drain, release/transfer lease, then stop | Prefer standby/pre-warm replacement |
| VPN protocol reconnects | Preserve logical `vpn_connection`; refresh routes | Some TCP sessions may still break |
| Relay path unavailable | Fabric reroutes if policy allows | VPN service does not own mesh routing |
## Stateful Session Limits
VPN failover may disrupt long-lived TCP sessions. The platform should minimize
disruption through sticky placement, graceful drain, standby/pre-warm nodes,
stable route identity, and transparent route refresh, but the initial
`single_active` mode does not guarantee lossless TCP migration.
Future `multi_active` or load-balanced VPN modes may reduce disruption. They
must be explicit future modes and must not weaken `single_active` correctness.
## Relationship to Current Mesh Proof Set
C17A-C17G prove synthetic fabric messages, route health/failover probes, relay
semantics, a bounded `synthetic.echo` path, live synthetic HTTP node-to-node
transport, scoped synthetic route config loading, and Control Plane scoped
synthetic config reads in `rap-node-agent`.
They do not authorize VPN traffic.
VPN/IP tunnel runtime must wait until the control-plane desired-state model,
lease/fencing, scoped snapshots, node-local state, secure node-to-node
channels, and Fabric Routing Engine boundaries are accepted for this service.
## Future Implementation Stages
C18A - VPN/IP tunnel control-plane data model foundation:
- durable `vpn_connections`
- route policy tables
- allowed node policy
- lease/fencing model
- audit events
- no runtime packets
Status: completed and backend-test-proven. Result:
`artifacts/c18a-vpn-control-plane-data-model-report.md`.
C18B - Lease and fencing service:
- single-active ownership service
- TTL renewal/fencing behavior
- stale owner handling
- no real VPN runtime
Status: completed and backend-test-proven. Result:
`artifacts/c18b-vpn-lease-fencing-hardening-report.md`.
C18C - Node-agent desired-state consumption:
- node reads scoped `vpn_connection` assignments
- reports status
- does not create real tunnel
Status: completed and backend-test-proven. Result:
`artifacts/c18c-vpn-node-agent-desired-state-report.md`.
Notes:
- node assignment visibility is limited to eligible candidates or the current
active lease owner
- observed assignment status is explicit: `not_started`, `assigned`,
`lease_required`, `blocked`, `unknown`
- `credential_ref` is not exposed to node-agent assignment payloads
- no VPN runtime, TUN/TAP, host route/DNS/firewall/QoS manipulation, packet
forwarding, or production mesh traffic is implemented
C18D - Secret resolver integration:
- scoped credential/config delivery
- candidate/active-owner restrictions
- credential rotation audit
C18E - Routing policy integration:
- CIDR and service-specific route intent
- route projection to Fabric Routing Engine
- no packet forwarding
C18F - Non-production fake VPN executor:
- synthetic leased service state only
- no TUN, no packets, no private network routing
C18G - Lab-only native VPN executor prototype:
- explicit separate approval required
- native mode preferred for TUN/firewall/QoS
- no privileged container by default
C18H - Client route refresh/resume design:
- route updates
- reconnect behavior
- split/full tunnel client posture
C18I - Production hardening:
- split-brain drills
- failover testing
- QoS load testing
- security review
- observability and incident runbooks
## Result / Decision
Stage C18 defines VPN/IP tunnel as a cluster-managed service above Fabric Core.
The first implementation step must be control-plane desired state and
lease/fencing foundation, not packet routing. Nodes are execution units, not
owners of desired state. Fabric owns routing and QoS. PostgreSQL remains the
source of truth, Redis remains live coordination only, and Fabric
Storage/Config Storage remains a scoped distribution/cache layer. RDP, current
direct worker WSS, backend gateway fallback, C17 synthetic mesh proofs, and all
existing service-adapter behavior are untouched by this document. C18A,
C18B, and C18C are now implemented only as control-plane/node-agent contract
foundation; they still do not authorize VPN/IP tunnel runtime or host
networking changes.
@@ -0,0 +1,429 @@
# Web Ingress and Admin UI Model
Status: target architecture clarification. Documentation only.
This document defines how HTTP/HTTPS web entry, Admin UI, dynamic page
composition, and cluster configuration responsibilities are separated in the
Secure Access Fabric.
It does not implement code, APIs, UI pages, mesh runtime, VPN runtime, or RDP
changes.
## Purpose
The platform needs a clear distinction between:
- Web Service as the HTTP/HTTPS entry layer
- Control Plane as the owner of cluster configuration and policy
- Admin UI as a safe, scoped user interface over Control Plane APIs
The Web layer must never become the owner of cluster state, policy, topology,
secrets, node identity, or routing authority.
## Layer Ownership
### Web Service / Web Ingress
Web Service is an edge service.
Suggested role names:
- `web-ingress`
- `admin-web-entry`
- `admin-web-shell`
Responsibilities:
- accept HTTP/HTTPS
- terminate TLS or sit behind the approved TLS terminator
- serve Admin UI shell/static assets
- proxy browser/API traffic to Control API
- apply edge controls such as headers, rate limits, request size limits, and
future WAF rules
- expose only approved public/admin endpoints
Web Service must not:
- own cluster configuration
- directly mutate PostgreSQL
- store durable topology or policy
- store secrets
- store node identity or certificates as source of truth
- expose internal mesh topology to browser clients
- execute cluster decisions locally
### Control Plane
Control Plane owns all durable cluster configuration and policy.
Responsibilities:
- clusters
- nodes
- node enrollment and approval
- role assignments
- organization and tenant policy
- service desired state
- service endpoint visibility
- signed scoped snapshots
- config distribution rules
- audit
- high-risk action authorization
- step-up authentication requirements
PostgreSQL remains the durable source of truth. Redis remains live coordination
only.
Cluster configuration is changed only through Control Plane services and APIs.
The Web layer is a presentation and ingress layer over those APIs.
### Admin UI
Admin UI is a client application served through Web Ingress.
It renders safe Control Plane projections and submits user actions to Control
Plane APIs.
Admin UI must not:
- contain embedded internal topology
- contain secrets
- contain raw credential references beyond safe indicators
- contain peer cache data
- contain route cache data
- contain private node-to-node endpoints unless explicitly authorized for the
viewer
- contain executable cluster logic
## Admin Endpoint Placement
Admin UI endpoint placement is explicit and must not be inferred from storage.
Scopes:
- Platform Owner Console: global platform-owner scope. It may aggregate
multiple clusters through Control Plane APIs according to platform policy and
audit.
- Cluster Admin Endpoint: cluster-local admin/web ingress endpoint for a single
cluster. It is hosted only by nodes explicitly assigned an approved
admin/web ingress role.
- Organization Admin Panel: tenant-safe projection for one organization. It
must expose only allowed resources, service endpoints, sessions, policies,
and safe status.
Rules:
- Fabric Storage / Config Storage nodes do not automatically host Admin UI.
- Adding a storage node to a new cluster does not move the cluster panel.
- Storage nodes distribute/cache scoped configuration and snapshots only.
- Admin/web ingress is a separate service role and requires explicit Control
Plane assignment.
- Cluster-local admin endpoints require valid TLS/cert policy, signed scoped
snapshots, current node health, and sufficient role coverage.
- Platform Owner Console remains the owner-level view even when cluster-local
admin endpoints exist.
- Organization Admin Panel must never expose intermediate mesh topology,
storage shards, peer caches, route caches, or unrelated cluster data.
## Request Flow
```text
Admin Browser
-> Web Ingress / Admin Web Shell
-> Control API
-> PostgreSQL source of truth
-> signed scoped snapshots / config distribution
-> rap-node-agent
```
Web Ingress may cache static assets and safe UI manifests, but it must not
become a second source of truth.
## Dynamic Admin Pages
Admin pages may be dynamically composed, but they must be generated from safe
metadata and scoped projections.
The recommended model is:
```text
Admin Web Shell
-> UI Manifest / Page Definition endpoint
-> Scoped Control API endpoints
```
Dynamic pages are allowed for:
- platform admin sections
- cluster admin sections
- node detail sections
- service adapter safe configuration sections
- future organization admin sections
Dynamic pages must be declarative. They must not inject arbitrary executable
code from the backend into the browser.
## UI Manifest Model
The Control Plane may provide a `ui_manifest` or page definition for a specific
viewer context.
Viewer context includes:
- user id
- platform role
- organization memberships
- cluster access scope
- device trust state
- MFA / step-up state
- feature flags
- service availability
The manifest may include:
- visible navigation sections
- page ids
- component ids from an approved component registry
- form schemas
- table schemas
- safe field labels and message keys
- allowed actions
- action risk level
- API route references
- required permissions
- required step-up authentication flags
- audit event category
- refresh hints
The manifest must not include:
- secrets
- raw credentials
- private keys
- full mesh topology
- full peer cache
- route cache
- unrelated organization data
- unrelated cluster data
- internal node-to-node route details
- arbitrary JavaScript or executable code
## Page Definition Safety Rules
Dynamic pages are schema-driven views over safe data.
Rules:
- page definitions are data, not code
- page definitions must use an approved component registry
- fields must be explicitly typed
- actions must map to known Control Plane operations
- every action must be permission checked server-side
- high-risk actions must declare step-up requirements
- all mutations must be audited
- UI labels should use localization message keys with English fallback text
- sensitive responses should use `Cache-Control: no-store`
Client-side hiding is not authorization. The Control Plane must enforce all
permissions and policies even if a browser crafts a request manually.
## Safe Data Projection
The Control Plane should expose different projections for different audiences.
Platform owner/admin may see:
- clusters
- nodes
- join requests
- role assignments
- safe topology summaries
- service placement
- health and audit
- partition/recovery status
- active node for cluster-managed services where allowed
Organization admin may see only:
- organization resources
- organization users/groups where authorized
- organization policies
- active sessions
- allowed ingress endpoints
- allowed egress/service endpoints
- safe VPN/connector status
- organization audit
Organization admin must not see:
- intermediate core mesh topology
- other organizations
- peer caches
- route caches
- unrelated nodes
- platform trust roots
- raw node certificates
- secrets
- unrelated cluster internals
## Service Adapter UI Extensions
Service adapters may need configuration UI.
Examples:
- RDP resource settings
- VNC resource settings
- SSH resource settings
- VPN/IP tunnel connection settings
- file policy settings
- video/audio policy settings
Adapter UI extensions must be registered as safe schema descriptors through the
Control Plane. Adapters must not directly publish arbitrary browser code.
Allowed extension content:
- field schema
- validation hints
- policy options
- message keys
- safe help text
- action ids mapped to Control Plane APIs
Disallowed extension content:
- executable code
- protocol secrets
- internal adapter memory/state
- raw target credentials
- unrestricted backend endpoints
## Cluster Configuration Ownership
Cluster configuration belongs to Control Plane.
Examples:
- cluster creation and disablement
- node approval
- node role assignment
- service desired state
- VPN connection desired state
- allowed node policy
- route policy
- QoS policy
- signed snapshot generation
- storage/config distribution scope
Admin UI may present these controls, but it does not own the decisions.
The authoritative path is:
```text
Admin action
-> Control API authorization
-> policy validation
-> PostgreSQL mutation
-> audit event
-> snapshot/config distribution update
-> node-agent consumption
```
## Security Requirements
Web/Admin security requirements:
- TLS for all browser traffic
- secure cookies or approved token storage model
- CSRF protection where cookie auth is used
- CSP for Admin UI
- no secrets in HTML or JavaScript bundles
- no internal topology embedded in static assets
- no arbitrary backend-provided JavaScript
- strict server-side authorization
- risk-based admin access
- MFA/2FA and step-up for high-risk actions
- audit every mutation
- short-lived UI manifests where sensitive
- no-store cache headers for sensitive API responses
High-risk actions include:
- node approval
- role assignment
- cluster trust changes
- cross-cluster trust changes
- partition promotion
- secrets access
- update policy changes
- VPN credential/config resolver access
## Deployment Model
Possible deployment modes:
- Web Ingress and Control API in the same deployment for small/test installs
- Web Ingress separated from Control API for production
- multiple Web Ingress nodes for regional/admin access
- Web Ingress behind Caddy/Nginx/enterprise ingress
- Admin UI shell served from Web Ingress while APIs remain on Control API
Even when deployed together, ownership remains separate:
- Web Ingress is entry/presentation
- Control API is authorization/domain logic
- PostgreSQL is source of truth
- Fabric Storage/Config Storage is scoped distribution/cache
- node-agent consumes scoped desired state
## Future Stages
Suggested staged work:
WEB-1: Document Web Ingress and Admin UI ownership model.
WEB-2: Define `ui_manifest` schema and approved component registry.
WEB-3: Add platform-admin Admin Web Shell that consumes scoped manifests.
Initial Platform Owner Control Panel is implemented and build-verified in
`web-admin`. Report:
`artifacts/web-admin-platform-owner-control-panel-report.md`.
WEB-4: Add cluster admin pages using Control Plane projections.
WEB-5: Add organization admin pages using tenant-safe projections.
WEB-6: Add high-risk action step-up and device-trust UI flows.
WEB-7: Add service-adapter UI extension registry.
WEB-8: Add signed/versioned UI manifest distribution if needed for offline or
edge-served admin shells.
## Non-Goals
This document does not authorize:
- implementation of new UI pages
- changing existing Windows client behavior
- changing RDP runtime
- mesh runtime
- VPN runtime
- node-agent service execution changes
- storing cluster configuration inside Web Service
- exposing internal topology to organizations
## Result / Decision
WEB is an ingress and presentation layer, not a cluster configuration owner.
Cluster configuration belongs to the Control Plane and is persisted in
PostgreSQL. Dynamic admin pages are allowed only as safe, scoped,
schema-driven projections over Control Plane APIs. They must not embed secrets,
internal topology, peer caches, route caches, or arbitrary executable code.
Admin endpoint placement is explicit. A Fabric Storage / Config Storage node
does not automatically become a cluster panel. Platform Owner Console remains
global platform-owner scope; Cluster Admin Endpoint is a separate cluster-local
admin/web ingress role; Organization Admin Panel remains a tenant-safe
projection.