# Production Direct Worker WSS Trust Status: P3.4 design/prep complete. This document defines the production trust model for direct worker WSS. It is a preparation document only: it does not change RDP runtime behavior, does not remove backend gateway fallback, and does not implement mesh, relay, VPN, QUIC, WebRTC, or node-agent enrollment. ## Goal Direct worker WSS should become the preferred production realtime path only when the client can verify both: - the backend candidate is authorized and marked `production_trusted=true` - the worker endpoint presents a valid TLS certificate for the advertised URL The backend gateway remains the safe fallback/debug path. ## Trust Modes `DATA_PLANE_DIRECT_WORKER_TLS_TRUST_MODE` has three modes: - `smoke_insecure`: development/smoke only. Backend may advertise a direct candidate only outside production and must mark it `smoke_only=true` and `production_trusted=false`. - `public_ca`: worker WSS certificate chains to an OS/publicly trusted CA. Backend may mark the candidate `production_trusted=true`. - `platform_ca`: worker WSS certificate chains to a platform-managed CA. Backend may mark the candidate `production_trusted=true` and include `tls_ca_ref`. Production must not treat `smoke_insecure` as trusted. P3.3 proved that a production backend with `smoke_insecure` falls back to `backend_gateway`. ## Recommended Mode Use `platform_ca` as the default production model for platform-managed and customer-managed worker nodes. Use `public_ca` only when the worker direct WSS endpoint is intentionally internet-addressable through stable DNS and a public certificate can be issued and renewed safely. Rationale: - most worker endpoints will be private, internal, or customer-managed - public CA issuance is often impossible for private IP/DNS names - a platform CA can bind certificates to platform node/worker identity - platform CA trust can later integrate with `rap-node-agent` - backend gateway fallback remains available while trust rollout is staged ## Certificate Profile Worker direct WSS certificates must be server certificates. Required X.509 properties: - `KeyUsage`: `digitalSignature`, plus `keyEncipherment` where required by the selected TLS key type - `ExtendedKeyUsage`: `serverAuth` - SAN DNS/IP entries must match the host in the advertised direct worker WSS URL - CN must not be used as the trust identity - validity should be short-lived, recommended 30-90 days in production - key type should be ECDSA P-256 or RSA-2048+; prefer ECDSA where operationally practical Recommended identity SAN: ```text URI:spiffe://rap/cluster//worker/ ``` For the current single-cluster MVP, `cluster_id` may be `default` until the cluster model becomes explicit. The URI SAN is not a replacement for normal hostname verification. It is an additional identity binding for observability, future node-agent enrollment, and future control-plane certificate inventory. ## Candidate URL Rules The backend must advertise a direct worker WSS URL whose host is covered by the worker certificate SAN. Examples: ```text wss://rdp-worker-1.dp.test.cin.su:18443/rap/v1/data-plane wss://192.168.200.61:18443/rap/v1/data-plane ``` If the URL uses a DNS name, the certificate must include that DNS SAN. If the URL uses an IP address, the certificate must include that IP SAN. Preferred production shape is DNS, not raw IP, because DNS gives safer certificate rotation and node replacement. ## Worker Identity Binding Direct worker WSS authentication is layered: 1. TLS proves that the client reached an endpoint with a certificate trusted for the advertised URL. 2. `data_plane_token` proves that the backend authorized the session, attachment, user, organization, resource, worker, and allowed channels. 3. The worker validates the token and binds the WSS connection to an existing runtime only. The TLS certificate does not replace token validation. The token does not replace TLS trust. Future production hardening should add control-plane certificate inventory: ```text worker_certificates worker_id cluster_id tls_ca_ref certificate_fingerprint_sha256 serial_number not_before not_after status: active | retiring | revoked | expired ``` Until that inventory exists, backend must be conservative and only mark direct candidates production-trusted when deployment configuration guarantees the worker certificate is trusted for the advertised URL. ## Platform CA Structure Recommended hierarchy: ```text RAP Platform Offline Root CA -> RAP Data Plane Worker Intermediate CA v1 -> worker direct WSS server certificates ``` Rules: - Root CA private key must not be present on worker hosts. - Intermediate CA private key must not be present on worker hosts. - Worker receives only its server certificate, private key, and CA chain. - Windows clients receive only the trust bundle, never private keys. - Backend receives CA reference metadata and may carry public trust bundle references, never CA private keys. For the current test stand, a temporary test CA may be generated on `docker-test`, but it must be treated as throwaway test material and not committed. ## Certificate Issuance And Storage Future `rap-node-agent` should own enrollment. Before node-agent exists, test stand issuance may be manual. Production desired flow: 1. Platform owner approves node/worker enrollment. 2. Node agent generates a private key locally. 3. Node agent creates CSR with: - worker/node identity URI SAN - DNS/IP SANs for reachable direct WSS endpoints - cluster id 4. Control plane or CA service signs the CSR if node policy allows the role. 5. Node agent writes certificate/key to a host-local protected path. 6. Worker container mounts certificate/key read-only, or native worker reads protected local files. 7. Backend advertises direct candidates with: - `tls_trust_mode=platform_ca` - `production_trusted=true` - `smoke_only=false` - `tls_ca_ref=` Container note: - Certificates are node/host trust assets, not container identity. - Containers may consume mounted cert/key files. - Container rebuilds must not generate production CA material. ## Windows Client Trust For `public_ca`, the Windows client should rely on normal OS certificate validation. For `platform_ca`, the preferred production approach is app-local trust: - client configuration references a platform CA bundle by `tls_ca_ref` - WSS TLS validation uses a custom chain policy with an app-managed trust store - hostname/SAN validation remains enabled - revocation/deny-list checks are applied when available - no global insecure callback is used Installing the platform root into the Windows CurrentUser or LocalMachine Root store may be supported for managed enterprise deployment, but it should not be required for MVP smoke because it broadens OS-level trust. Current state: - Windows client already skips smoke-only/untrusted direct candidates in production. - P3.5 added app-local platform CA bundle handling with normal hostname/SAN validation preserved. - P3.5 smoke proved `platform_ca` direct worker WSS without insecure TLS bypass on `docker-test`. ## Rotation Worker certificate rotation: - certificates should be renewed before 2/3 of lifetime has elapsed - new cert/key should be staged next to the old files - worker should reload or restart gracefully - backend gateway fallback must remain available during rotation - old cert should remain accepted during a short overlap window - after successful cutover, old cert should be marked retiring/expired Platform CA rotation: - introduce new `tls_ca_ref` - distribute the new trust bundle to clients before workers switch - backend may advertise candidates with the new CA only after client trust is available - keep old and new CA bundles valid during migration - remove the old CA only after all active workers and clients are migrated ## Revocation And Deny-List Short-lived certificates are the first control. Additional revocation controls: - stop advertising direct candidates for revoked workers immediately - revoke worker certificate serial/fingerprint in control-plane inventory - optionally distribute a compact deny-list to clients - force backend gateway fallback for revoked/untrusted workers - rotate data-plane signing keys separately if token signing material is at risk Revocation must not rely on the worker cooperating after compromise. ## Graceful Failure And Fallback Direct WSS must fail closed: - expired cert: direct rejected, fallback to backend gateway - hostname mismatch: direct rejected, fallback to backend gateway - untrusted platform CA: direct rejected, fallback to backend gateway - revoked fingerprint: direct rejected, fallback to backend gateway - token validation failure: direct rejected, fallback to backend gateway where policy permits Fallback must be logged so production does not silently run permanently on the debug path. ## Test-Stand P3.5 Smoke Result P3.5 proved `platform_ca` without using insecure TLS bypass. Sanitized command shape: ```powershell # 1. Generate throwaway test CA and worker cert on docker-test. ssh docker-test "mkdir -p /tmp/rap-p3-5-platform-ca" # Certificate must include: # - DNS SAN for the direct WSS host, if using DNS # - IP SAN 192.168.200.61, if using raw IP # - URI SAN spiffe://rap/cluster/default/worker/rdp-worker-1 # 2. Restart worker with platform CA-issued server cert. docker -H ssh://docker-test rm -f rap_worker_smoke docker -H ssh://docker-test run -d --name rap_worker_smoke --network host ` -v /tmp/rap-p3-5-platform-ca:/certs:ro ` -e RDP_WORKER_DATA_PLANE_TLS_CERT_FILE=/certs/worker.crt ` -e RDP_WORKER_DATA_PLANE_TLS_KEY_FILE=/certs/worker.key ` -e RDP_WORKER_DATA_PLANE_PUBLIC_KEY_FILE=/certs/dp-public.pem ` rap-rdp-worker:rdp-p1-region-order2 # 3. Restart backend in production platform_ca mode. docker -H ssh://docker-test rm -f rap_backend_smoke docker -H ssh://docker-test run -d --name rap_backend_smoke --network host ` -v /tmp/rap-dp1d1:/certs:ro ` -v /tmp/rap-p3-3/secret-key.b64:/run/secrets/rap-secret-key.b64:ro ` -e APP_ENV=production ` -e SECRET_ENCRYPTION_KEY_FILE=/run/secrets/rap-secret-key.b64 ` -e DATA_PLANE_DIRECT_WORKER_TLS_TRUST_MODE=platform_ca ` -e DATA_PLANE_DIRECT_WORKER_TLS_CA_REF=rap-platform-ca:test-v1 ` -e DATA_PLANE_DIRECT_WORKER_WSS_URL_TEMPLATE=wss://192.168.200.61:18443/rap/v1/data-plane ` rap-backend-smoke:p3-3 # 4. Configure Windows client app-local trust bundle. # backend.direct_data_plane_platform_ca_bundle = artifacts\p3-5-platform-ca.crt # backend.environment = production # backend.allow_insecure_direct_data_plane_tls_for_smoke = false # 5. Run desktop smoke and verify direct selected. pwsh -ExecutionPolicy Bypass -File scripts\windows-smoke\desktop-smoke.ps1 ` -PreferDirectDataPlane:$true ` -AllowInsecureDirectDataPlaneTlsForSmoke:$false ` -DirectDataPlaneConnectTimeoutMs 2500 ` -SkipOrgSwitchAndTokenRefresh ``` P3.5 PASS conditions: - backend candidate metadata includes: - `tls_trust_mode=platform_ca` - `production_trusted=true` - `smoke_only=false` - `tls_ca_ref=rap-platform-ca:test-v1` - Windows client selects `direct_worker_wss` in production mode - client does not use insecure TLS bypass - worker direct WSS token validation and runtime binding still pass - rendering/input/clipboard/file upload still pass - backend gateway fallback activates when direct cert validation fails or direct WSS is unavailable Required negative tests: - wrong SAN certificate rejected - expired certificate rejected - unknown CA rejected - `smoke_insecure` candidate skipped in production Runtime proof is recorded in: - `artifacts/p3-5-app-local-platform-ca-smoke-report.md` ## Current Implementation Status Existing config fields are sufficient for P3.4: - `DATA_PLANE_DIRECT_WORKER_TLS_TRUST_MODE` - `DATA_PLANE_DIRECT_WORKER_TLS_CA_REF` - `RDP_WORKER_DATA_PLANE_TLS_CERT_FILE` - `RDP_WORKER_DATA_PLANE_TLS_KEY_FILE` P3.5 added Windows client setting: - `direct_data_plane_platform_ca_bundle` P3.6 completed stale worker-event/restart idempotency hardening. Stage 5.2 server-to-client file download design is complete in `docs/architecture/RDP_FILE_DOWNLOAD_STAGE_5_2.md`. The next step should return to the RDP feature plan with the narrow Stage 5.2 implementation, not RDP rendering, lifecycle expansion, mesh, VPN, or new protocol adapters.