13 KiB
Production Direct Worker WSS Trust
Archived status: this document describes an older direct-worker WSS trust
track. It is not the current runtime transport source of truth. For the active
fabric transport model, use
docs/architecture/DISTRIBUTED_FABRIC_NODE_PROTOCOL_PLAN.md,
docs/architecture/FABRIC_FIRST_TRANSPORT_AND_STRESS_PLAN.md, and
docs/architecture/SECURE_ACCESS_FABRIC_TARGET.md.
Status: P3.4 design/prep complete.
This document defines the production trust model for direct worker WSS. It is a preparation document only: it does not change RDP runtime behavior, does not remove backend gateway fallback, and does not implement mesh, relay, VPN, QUIC, WebRTC, or node-agent enrollment.
Goal
Direct worker WSS should become the preferred production realtime path only when the client can verify both:
- the backend candidate is authorized and marked
production_trusted=true - the worker endpoint presents a valid TLS certificate for the advertised URL
The backend gateway remains the safe fallback/debug path.
Trust Modes
DATA_PLANE_DIRECT_WORKER_TLS_TRUST_MODE has three modes:
smoke_insecure: development/smoke only. Backend may advertise a direct candidate only outside production and must mark itsmoke_only=trueandproduction_trusted=false.public_ca: worker WSS certificate chains to an OS/publicly trusted CA. Backend may mark the candidateproduction_trusted=true.platform_ca: worker WSS certificate chains to a platform-managed CA. Backend may mark the candidateproduction_trusted=trueand includetls_ca_ref.
Production must not treat smoke_insecure as trusted. P3.3 proved that a
production backend with smoke_insecure falls back to backend_gateway.
Recommended Mode
Use platform_ca as the default production model for platform-managed and
customer-managed worker nodes.
Use public_ca only when the worker direct WSS endpoint is intentionally
internet-addressable through stable DNS and a public certificate can be issued
and renewed safely.
Rationale:
- most worker endpoints will be private, internal, or customer-managed
- public CA issuance is often impossible for private IP/DNS names
- a platform CA can bind certificates to platform node/worker identity
- platform CA trust can later integrate with
rap-node-agent - backend gateway fallback remains available while trust rollout is staged
Certificate Profile
Worker direct WSS certificates must be server certificates.
Required X.509 properties:
KeyUsage:digitalSignature, pluskeyEnciphermentwhere required by the selected TLS key typeExtendedKeyUsage:serverAuth- SAN DNS/IP entries must match the host in the advertised direct worker WSS URL
- CN must not be used as the trust identity
- validity should be short-lived, recommended 30-90 days in production
- key type should be ECDSA P-256 or RSA-2048+; prefer ECDSA where operationally practical
Recommended identity SAN:
URI:spiffe://rap/cluster/<cluster_id>/worker/<worker_id>
For the current single-cluster MVP, cluster_id may be default until the
cluster model becomes explicit.
The URI SAN is not a replacement for normal hostname verification. It is an additional identity binding for observability, future node-agent enrollment, and future control-plane certificate inventory.
Candidate URL Rules
The backend must advertise a direct worker WSS URL whose host is covered by the worker certificate SAN.
Examples:
wss://rdp-worker-1.dp.test.cin.su:18443/rap/v1/data-plane
wss://192.168.200.61:18443/rap/v1/data-plane
If the URL uses a DNS name, the certificate must include that DNS SAN.
If the URL uses an IP address, the certificate must include that IP SAN.
Preferred production shape is DNS, not raw IP, because DNS gives safer certificate rotation and node replacement.
Worker Identity Binding
Direct worker WSS authentication is layered:
- TLS proves that the client reached an endpoint with a certificate trusted for the advertised URL.
data_plane_tokenproves that the backend authorized the session, attachment, user, organization, resource, worker, and allowed channels.- The worker validates the token and binds the WSS connection to an existing runtime only.
The TLS certificate does not replace token validation.
The token does not replace TLS trust.
Future production hardening should add control-plane certificate inventory:
worker_certificates
worker_id
cluster_id
tls_ca_ref
certificate_fingerprint_sha256
serial_number
not_before
not_after
status: active | retiring | revoked | expired
Until that inventory exists, backend must be conservative and only mark direct candidates production-trusted when deployment configuration guarantees the worker certificate is trusted for the advertised URL.
Platform CA Structure
Recommended hierarchy:
RAP Platform Offline Root CA
-> RAP Data Plane Worker Intermediate CA v1
-> worker direct WSS server certificates
Rules:
- Root CA private key must not be present on worker hosts.
- Intermediate CA private key must not be present on worker hosts.
- Worker receives only its server certificate, private key, and CA chain.
- Windows clients receive only the trust bundle, never private keys.
- Backend receives CA reference metadata and may carry public trust bundle references, never CA private keys.
For the current test stand, a temporary test CA may be generated on
docker-test, but it must be treated as throwaway test material and not
committed.
Certificate Issuance And Storage
Future rap-node-agent should own enrollment. Before node-agent exists, test
stand issuance may be manual.
Production desired flow:
- Platform owner approves node/worker enrollment.
- Node agent generates a private key locally.
- Node agent creates CSR with:
- worker/node identity URI SAN
- DNS/IP SANs for reachable direct WSS endpoints
- cluster id
- Control plane or CA service signs the CSR if node policy allows the role.
- Node agent writes certificate/key to a host-local protected path.
- Worker container mounts certificate/key read-only, or native worker reads protected local files.
- Backend advertises direct candidates with:
tls_trust_mode=platform_caproduction_trusted=truesmoke_only=falsetls_ca_ref=<active-ca-ref>
Container note:
- Certificates are node/host trust assets, not container identity.
- Containers may consume mounted cert/key files.
- Container rebuilds must not generate production CA material.
Windows Client Trust
For public_ca, the Windows client should rely on normal OS certificate
validation.
For platform_ca, the preferred production approach is app-local trust:
- client configuration references a platform CA bundle by
tls_ca_ref - WSS TLS validation uses a custom chain policy with an app-managed trust store
- hostname/SAN validation remains enabled
- revocation/deny-list checks are applied when available
- no global insecure callback is used
Installing the platform root into the Windows CurrentUser or LocalMachine Root store may be supported for managed enterprise deployment, but it should not be required for MVP smoke because it broadens OS-level trust.
Current state:
- Windows client already skips smoke-only/untrusted direct candidates in production.
- P3.5 added app-local platform CA bundle handling with normal hostname/SAN validation preserved.
- P3.5 smoke proved
platform_cadirect worker WSS without insecure TLS bypass ondocker-test.
Rotation
Worker certificate rotation:
- certificates should be renewed before 2/3 of lifetime has elapsed
- new cert/key should be staged next to the old files
- worker should reload or restart gracefully
- backend gateway fallback must remain available during rotation
- old cert should remain accepted during a short overlap window
- after successful cutover, old cert should be marked retiring/expired
Platform CA rotation:
- introduce new
tls_ca_ref - distribute the new trust bundle to clients before workers switch
- backend may advertise candidates with the new CA only after client trust is available
- keep old and new CA bundles valid during migration
- remove the old CA only after all active workers and clients are migrated
Revocation And Deny-List
Short-lived certificates are the first control.
Additional revocation controls:
- stop advertising direct candidates for revoked workers immediately
- revoke worker certificate serial/fingerprint in control-plane inventory
- optionally distribute a compact deny-list to clients
- force backend gateway fallback for revoked/untrusted workers
- rotate data-plane signing keys separately if token signing material is at risk
Revocation must not rely on the worker cooperating after compromise.
Graceful Failure And Fallback
Direct WSS must fail closed:
- expired cert: direct rejected, fallback to backend gateway
- hostname mismatch: direct rejected, fallback to backend gateway
- untrusted platform CA: direct rejected, fallback to backend gateway
- revoked fingerprint: direct rejected, fallback to backend gateway
- token validation failure: direct rejected, fallback to backend gateway where policy permits
Fallback must be logged so production does not silently run permanently on the debug path.
Test-Stand P3.5 Smoke Result
P3.5 proved platform_ca without using insecure TLS bypass.
Sanitized command shape:
# 1. Generate throwaway test CA and worker cert on docker-test.
ssh docker-test "mkdir -p /tmp/rap-p3-5-platform-ca"
# Certificate must include:
# - DNS SAN for the direct WSS host, if using DNS
# - IP SAN 192.168.200.61, if using raw IP
# - URI SAN spiffe://rap/cluster/default/worker/rdp-worker-1
# 2. Restart worker with platform CA-issued server cert.
docker -H ssh://docker-test rm -f rap_worker_smoke
docker -H ssh://docker-test run -d --name rap_worker_smoke --network host `
-v /tmp/rap-p3-5-platform-ca:/certs:ro `
-e RDP_WORKER_DATA_PLANE_TLS_CERT_FILE=/certs/worker.crt `
-e RDP_WORKER_DATA_PLANE_TLS_KEY_FILE=/certs/worker.key `
-e RDP_WORKER_DATA_PLANE_PUBLIC_KEY_FILE=/certs/dp-public.pem `
rap-rdp-worker:rdp-p1-region-order2
# 3. Restart backend in production platform_ca mode.
docker -H ssh://docker-test rm -f rap_backend_smoke
docker -H ssh://docker-test run -d --name rap_backend_smoke --network host `
-v /tmp/rap-dp1d1:/certs:ro `
-v /tmp/rap-p3-3/secret-key.b64:/run/secrets/rap-secret-key.b64:ro `
-e APP_ENV=production `
-e SECRET_ENCRYPTION_KEY_FILE=/run/secrets/rap-secret-key.b64 `
-e DATA_PLANE_DIRECT_WORKER_TLS_TRUST_MODE=platform_ca `
-e DATA_PLANE_DIRECT_WORKER_TLS_CA_REF=rap-platform-ca:test-v1 `
-e DATA_PLANE_DIRECT_WORKER_WSS_URL_TEMPLATE=wss://192.168.200.61:18443/rap/v1/data-plane `
rap-backend-smoke:p3-3
# 4. Configure Windows client app-local trust bundle.
# backend.direct_data_plane_platform_ca_bundle = artifacts\p3-5-platform-ca.crt
# backend.environment = production
# backend.allow_insecure_direct_data_plane_tls_for_smoke = false
# 5. Run desktop smoke and verify direct selected.
pwsh -ExecutionPolicy Bypass -File scripts\windows-smoke\desktop-smoke.ps1 `
-PreferDirectDataPlane:$true `
-AllowInsecureDirectDataPlaneTlsForSmoke:$false `
-DirectDataPlaneConnectTimeoutMs 2500 `
-SkipOrgSwitchAndTokenRefresh
P3.5 PASS conditions:
- backend candidate metadata includes:
tls_trust_mode=platform_caproduction_trusted=truesmoke_only=falsetls_ca_ref=rap-platform-ca:test-v1
- Windows client selects
direct_worker_wssin production mode - client does not use insecure TLS bypass
- worker direct WSS token validation and runtime binding still pass
- rendering/input/clipboard/file upload still pass
- backend gateway fallback activates when direct cert validation fails or direct WSS is unavailable
Required negative tests:
- wrong SAN certificate rejected
- expired certificate rejected
- unknown CA rejected
smoke_insecurecandidate skipped in production
Runtime proof is recorded in:
artifacts/p3-5-app-local-platform-ca-smoke-report.md
Current Implementation Status
Existing config fields are sufficient for P3.4:
DATA_PLANE_DIRECT_WORKER_TLS_TRUST_MODEDATA_PLANE_DIRECT_WORKER_TLS_CA_REFRDP_WORKER_DATA_PLANE_TLS_CERT_FILERDP_WORKER_DATA_PLANE_TLS_KEY_FILE
P3.5 added Windows client setting:
direct_data_plane_platform_ca_bundle
P3.6 completed stale worker-event/restart idempotency hardening.
Stage 5.2 server-to-client file download design is complete in
docs/architecture/RDP_FILE_DOWNLOAD_STAGE_5_2.md. The next step should return
to the RDP feature plan with the narrow Stage 5.2 implementation, not RDP
rendering, lifecycle expansion, mesh, VPN, or new protocol adapters.