Initial project snapshot
This commit is contained in:
@@ -0,0 +1,351 @@
|
||||
# Production Direct Worker WSS Trust
|
||||
|
||||
Status: P3.4 design/prep complete.
|
||||
|
||||
This document defines the production trust model for direct worker WSS. It is a
|
||||
preparation document only: it does not change RDP runtime behavior, does not
|
||||
remove backend gateway fallback, and does not implement mesh, relay, VPN, QUIC,
|
||||
WebRTC, or node-agent enrollment.
|
||||
|
||||
## Goal
|
||||
|
||||
Direct worker WSS should become the preferred production realtime path only
|
||||
when the client can verify both:
|
||||
|
||||
- the backend candidate is authorized and marked `production_trusted=true`
|
||||
- the worker endpoint presents a valid TLS certificate for the advertised URL
|
||||
|
||||
The backend gateway remains the safe fallback/debug path.
|
||||
|
||||
## Trust Modes
|
||||
|
||||
`DATA_PLANE_DIRECT_WORKER_TLS_TRUST_MODE` has three modes:
|
||||
|
||||
- `smoke_insecure`: development/smoke only. Backend may advertise a direct
|
||||
candidate only outside production and must mark it `smoke_only=true` and
|
||||
`production_trusted=false`.
|
||||
- `public_ca`: worker WSS certificate chains to an OS/publicly trusted CA.
|
||||
Backend may mark the candidate `production_trusted=true`.
|
||||
- `platform_ca`: worker WSS certificate chains to a platform-managed CA.
|
||||
Backend may mark the candidate `production_trusted=true` and include
|
||||
`tls_ca_ref`.
|
||||
|
||||
Production must not treat `smoke_insecure` as trusted. P3.3 proved that a
|
||||
production backend with `smoke_insecure` falls back to `backend_gateway`.
|
||||
|
||||
## Recommended Mode
|
||||
|
||||
Use `platform_ca` as the default production model for platform-managed and
|
||||
customer-managed worker nodes.
|
||||
|
||||
Use `public_ca` only when the worker direct WSS endpoint is intentionally
|
||||
internet-addressable through stable DNS and a public certificate can be issued
|
||||
and renewed safely.
|
||||
|
||||
Rationale:
|
||||
|
||||
- most worker endpoints will be private, internal, or customer-managed
|
||||
- public CA issuance is often impossible for private IP/DNS names
|
||||
- a platform CA can bind certificates to platform node/worker identity
|
||||
- platform CA trust can later integrate with `rap-node-agent`
|
||||
- backend gateway fallback remains available while trust rollout is staged
|
||||
|
||||
## Certificate Profile
|
||||
|
||||
Worker direct WSS certificates must be server certificates.
|
||||
|
||||
Required X.509 properties:
|
||||
|
||||
- `KeyUsage`: `digitalSignature`, plus `keyEncipherment` where required by the
|
||||
selected TLS key type
|
||||
- `ExtendedKeyUsage`: `serverAuth`
|
||||
- SAN DNS/IP entries must match the host in the advertised direct worker WSS URL
|
||||
- CN must not be used as the trust identity
|
||||
- validity should be short-lived, recommended 30-90 days in production
|
||||
- key type should be ECDSA P-256 or RSA-2048+; prefer ECDSA where operationally
|
||||
practical
|
||||
|
||||
Recommended identity SAN:
|
||||
|
||||
```text
|
||||
URI:spiffe://rap/cluster/<cluster_id>/worker/<worker_id>
|
||||
```
|
||||
|
||||
For the current single-cluster MVP, `cluster_id` may be `default` until the
|
||||
cluster model becomes explicit.
|
||||
|
||||
The URI SAN is not a replacement for normal hostname verification. It is an
|
||||
additional identity binding for observability, future node-agent enrollment,
|
||||
and future control-plane certificate inventory.
|
||||
|
||||
## Candidate URL Rules
|
||||
|
||||
The backend must advertise a direct worker WSS URL whose host is covered by the
|
||||
worker certificate SAN.
|
||||
|
||||
Examples:
|
||||
|
||||
```text
|
||||
wss://rdp-worker-1.dp.test.cin.su:18443/rap/v1/data-plane
|
||||
wss://192.168.200.61:18443/rap/v1/data-plane
|
||||
```
|
||||
|
||||
If the URL uses a DNS name, the certificate must include that DNS SAN.
|
||||
|
||||
If the URL uses an IP address, the certificate must include that IP SAN.
|
||||
|
||||
Preferred production shape is DNS, not raw IP, because DNS gives safer
|
||||
certificate rotation and node replacement.
|
||||
|
||||
## Worker Identity Binding
|
||||
|
||||
Direct worker WSS authentication is layered:
|
||||
|
||||
1. TLS proves that the client reached an endpoint with a certificate trusted for
|
||||
the advertised URL.
|
||||
2. `data_plane_token` proves that the backend authorized the session,
|
||||
attachment, user, organization, resource, worker, and allowed channels.
|
||||
3. The worker validates the token and binds the WSS connection to an existing
|
||||
runtime only.
|
||||
|
||||
The TLS certificate does not replace token validation.
|
||||
|
||||
The token does not replace TLS trust.
|
||||
|
||||
Future production hardening should add control-plane certificate inventory:
|
||||
|
||||
```text
|
||||
worker_certificates
|
||||
worker_id
|
||||
cluster_id
|
||||
tls_ca_ref
|
||||
certificate_fingerprint_sha256
|
||||
serial_number
|
||||
not_before
|
||||
not_after
|
||||
status: active | retiring | revoked | expired
|
||||
```
|
||||
|
||||
Until that inventory exists, backend must be conservative and only mark direct
|
||||
candidates production-trusted when deployment configuration guarantees the
|
||||
worker certificate is trusted for the advertised URL.
|
||||
|
||||
## Platform CA Structure
|
||||
|
||||
Recommended hierarchy:
|
||||
|
||||
```text
|
||||
RAP Platform Offline Root CA
|
||||
-> RAP Data Plane Worker Intermediate CA v1
|
||||
-> worker direct WSS server certificates
|
||||
```
|
||||
|
||||
Rules:
|
||||
|
||||
- Root CA private key must not be present on worker hosts.
|
||||
- Intermediate CA private key must not be present on worker hosts.
|
||||
- Worker receives only its server certificate, private key, and CA chain.
|
||||
- Windows clients receive only the trust bundle, never private keys.
|
||||
- Backend receives CA reference metadata and may carry public trust bundle
|
||||
references, never CA private keys.
|
||||
|
||||
For the current test stand, a temporary test CA may be generated on
|
||||
`docker-test`, but it must be treated as throwaway test material and not
|
||||
committed.
|
||||
|
||||
## Certificate Issuance And Storage
|
||||
|
||||
Future `rap-node-agent` should own enrollment. Before node-agent exists, test
|
||||
stand issuance may be manual.
|
||||
|
||||
Production desired flow:
|
||||
|
||||
1. Platform owner approves node/worker enrollment.
|
||||
2. Node agent generates a private key locally.
|
||||
3. Node agent creates CSR with:
|
||||
- worker/node identity URI SAN
|
||||
- DNS/IP SANs for reachable direct WSS endpoints
|
||||
- cluster id
|
||||
4. Control plane or CA service signs the CSR if node policy allows the role.
|
||||
5. Node agent writes certificate/key to a host-local protected path.
|
||||
6. Worker container mounts certificate/key read-only, or native worker reads
|
||||
protected local files.
|
||||
7. Backend advertises direct candidates with:
|
||||
- `tls_trust_mode=platform_ca`
|
||||
- `production_trusted=true`
|
||||
- `smoke_only=false`
|
||||
- `tls_ca_ref=<active-ca-ref>`
|
||||
|
||||
Container note:
|
||||
|
||||
- Certificates are node/host trust assets, not container identity.
|
||||
- Containers may consume mounted cert/key files.
|
||||
- Container rebuilds must not generate production CA material.
|
||||
|
||||
## Windows Client Trust
|
||||
|
||||
For `public_ca`, the Windows client should rely on normal OS certificate
|
||||
validation.
|
||||
|
||||
For `platform_ca`, the preferred production approach is app-local trust:
|
||||
|
||||
- client configuration references a platform CA bundle by `tls_ca_ref`
|
||||
- WSS TLS validation uses a custom chain policy with an app-managed trust store
|
||||
- hostname/SAN validation remains enabled
|
||||
- revocation/deny-list checks are applied when available
|
||||
- no global insecure callback is used
|
||||
|
||||
Installing the platform root into the Windows CurrentUser or LocalMachine Root
|
||||
store may be supported for managed enterprise deployment, but it should not be
|
||||
required for MVP smoke because it broadens OS-level trust.
|
||||
|
||||
Current state:
|
||||
|
||||
- Windows client already skips smoke-only/untrusted direct candidates in
|
||||
production.
|
||||
- P3.5 added app-local platform CA bundle handling with normal hostname/SAN
|
||||
validation preserved.
|
||||
- P3.5 smoke proved `platform_ca` direct worker WSS without insecure TLS
|
||||
bypass on `docker-test`.
|
||||
|
||||
## Rotation
|
||||
|
||||
Worker certificate rotation:
|
||||
|
||||
- certificates should be renewed before 2/3 of lifetime has elapsed
|
||||
- new cert/key should be staged next to the old files
|
||||
- worker should reload or restart gracefully
|
||||
- backend gateway fallback must remain available during rotation
|
||||
- old cert should remain accepted during a short overlap window
|
||||
- after successful cutover, old cert should be marked retiring/expired
|
||||
|
||||
Platform CA rotation:
|
||||
|
||||
- introduce new `tls_ca_ref`
|
||||
- distribute the new trust bundle to clients before workers switch
|
||||
- backend may advertise candidates with the new CA only after client trust is
|
||||
available
|
||||
- keep old and new CA bundles valid during migration
|
||||
- remove the old CA only after all active workers and clients are migrated
|
||||
|
||||
## Revocation And Deny-List
|
||||
|
||||
Short-lived certificates are the first control.
|
||||
|
||||
Additional revocation controls:
|
||||
|
||||
- stop advertising direct candidates for revoked workers immediately
|
||||
- revoke worker certificate serial/fingerprint in control-plane inventory
|
||||
- optionally distribute a compact deny-list to clients
|
||||
- force backend gateway fallback for revoked/untrusted workers
|
||||
- rotate data-plane signing keys separately if token signing material is at risk
|
||||
|
||||
Revocation must not rely on the worker cooperating after compromise.
|
||||
|
||||
## Graceful Failure And Fallback
|
||||
|
||||
Direct WSS must fail closed:
|
||||
|
||||
- expired cert: direct rejected, fallback to backend gateway
|
||||
- hostname mismatch: direct rejected, fallback to backend gateway
|
||||
- untrusted platform CA: direct rejected, fallback to backend gateway
|
||||
- revoked fingerprint: direct rejected, fallback to backend gateway
|
||||
- token validation failure: direct rejected, fallback to backend gateway where
|
||||
policy permits
|
||||
|
||||
Fallback must be logged so production does not silently run permanently on the
|
||||
debug path.
|
||||
|
||||
## Test-Stand P3.5 Smoke Result
|
||||
|
||||
P3.5 proved `platform_ca` without using insecure TLS bypass.
|
||||
|
||||
Sanitized command shape:
|
||||
|
||||
```powershell
|
||||
# 1. Generate throwaway test CA and worker cert on docker-test.
|
||||
ssh docker-test "mkdir -p /tmp/rap-p3-5-platform-ca"
|
||||
|
||||
# Certificate must include:
|
||||
# - DNS SAN for the direct WSS host, if using DNS
|
||||
# - IP SAN 192.168.200.61, if using raw IP
|
||||
# - URI SAN spiffe://rap/cluster/default/worker/rdp-worker-1
|
||||
|
||||
# 2. Restart worker with platform CA-issued server cert.
|
||||
docker -H ssh://docker-test rm -f rap_worker_smoke
|
||||
docker -H ssh://docker-test run -d --name rap_worker_smoke --network host `
|
||||
-v /tmp/rap-p3-5-platform-ca:/certs:ro `
|
||||
-e RDP_WORKER_DATA_PLANE_TLS_CERT_FILE=/certs/worker.crt `
|
||||
-e RDP_WORKER_DATA_PLANE_TLS_KEY_FILE=/certs/worker.key `
|
||||
-e RDP_WORKER_DATA_PLANE_PUBLIC_KEY_FILE=/certs/dp-public.pem `
|
||||
rap-rdp-worker:rdp-p1-region-order2
|
||||
|
||||
# 3. Restart backend in production platform_ca mode.
|
||||
docker -H ssh://docker-test rm -f rap_backend_smoke
|
||||
docker -H ssh://docker-test run -d --name rap_backend_smoke --network host `
|
||||
-v /tmp/rap-dp1d1:/certs:ro `
|
||||
-v /tmp/rap-p3-3/secret-key.b64:/run/secrets/rap-secret-key.b64:ro `
|
||||
-e APP_ENV=production `
|
||||
-e SECRET_ENCRYPTION_KEY_FILE=/run/secrets/rap-secret-key.b64 `
|
||||
-e DATA_PLANE_DIRECT_WORKER_TLS_TRUST_MODE=platform_ca `
|
||||
-e DATA_PLANE_DIRECT_WORKER_TLS_CA_REF=rap-platform-ca:test-v1 `
|
||||
-e DATA_PLANE_DIRECT_WORKER_WSS_URL_TEMPLATE=wss://192.168.200.61:18443/rap/v1/data-plane `
|
||||
rap-backend-smoke:p3-3
|
||||
|
||||
# 4. Configure Windows client app-local trust bundle.
|
||||
# backend.direct_data_plane_platform_ca_bundle = artifacts\p3-5-platform-ca.crt
|
||||
# backend.environment = production
|
||||
# backend.allow_insecure_direct_data_plane_tls_for_smoke = false
|
||||
|
||||
# 5. Run desktop smoke and verify direct selected.
|
||||
pwsh -ExecutionPolicy Bypass -File scripts\windows-smoke\desktop-smoke.ps1 `
|
||||
-PreferDirectDataPlane:$true `
|
||||
-AllowInsecureDirectDataPlaneTlsForSmoke:$false `
|
||||
-DirectDataPlaneConnectTimeoutMs 2500 `
|
||||
-SkipOrgSwitchAndTokenRefresh
|
||||
```
|
||||
|
||||
P3.5 PASS conditions:
|
||||
|
||||
- backend candidate metadata includes:
|
||||
- `tls_trust_mode=platform_ca`
|
||||
- `production_trusted=true`
|
||||
- `smoke_only=false`
|
||||
- `tls_ca_ref=rap-platform-ca:test-v1`
|
||||
- Windows client selects `direct_worker_wss` in production mode
|
||||
- client does not use insecure TLS bypass
|
||||
- worker direct WSS token validation and runtime binding still pass
|
||||
- rendering/input/clipboard/file upload still pass
|
||||
- backend gateway fallback activates when direct cert validation fails or
|
||||
direct WSS is unavailable
|
||||
|
||||
Required negative tests:
|
||||
|
||||
- wrong SAN certificate rejected
|
||||
- expired certificate rejected
|
||||
- unknown CA rejected
|
||||
- `smoke_insecure` candidate skipped in production
|
||||
|
||||
Runtime proof is recorded in:
|
||||
|
||||
- `artifacts/p3-5-app-local-platform-ca-smoke-report.md`
|
||||
|
||||
## Current Implementation Status
|
||||
|
||||
Existing config fields are sufficient for P3.4:
|
||||
|
||||
- `DATA_PLANE_DIRECT_WORKER_TLS_TRUST_MODE`
|
||||
- `DATA_PLANE_DIRECT_WORKER_TLS_CA_REF`
|
||||
- `RDP_WORKER_DATA_PLANE_TLS_CERT_FILE`
|
||||
- `RDP_WORKER_DATA_PLANE_TLS_KEY_FILE`
|
||||
|
||||
P3.5 added Windows client setting:
|
||||
|
||||
- `direct_data_plane_platform_ca_bundle`
|
||||
|
||||
P3.6 completed stale worker-event/restart idempotency hardening.
|
||||
|
||||
Stage 5.2 server-to-client file download design is complete in
|
||||
`docs/architecture/RDP_FILE_DOWNLOAD_STAGE_5_2.md`. The next step should return
|
||||
to the RDP feature plan with the narrow Stage 5.2 implementation, not RDP
|
||||
rendering, lifecycle expansion, mesh, VPN, or new protocol adapters.
|
||||
Reference in New Issue
Block a user