Initial project snapshot

This commit is contained in:
2026-04-28 22:29:50 +03:00
commit 8ba0561f4f
365 changed files with 91832 additions and 0 deletions
@@ -0,0 +1,351 @@
# Production Direct Worker WSS Trust
Status: P3.4 design/prep complete.
This document defines the production trust model for direct worker WSS. It is a
preparation document only: it does not change RDP runtime behavior, does not
remove backend gateway fallback, and does not implement mesh, relay, VPN, QUIC,
WebRTC, or node-agent enrollment.
## Goal
Direct worker WSS should become the preferred production realtime path only
when the client can verify both:
- the backend candidate is authorized and marked `production_trusted=true`
- the worker endpoint presents a valid TLS certificate for the advertised URL
The backend gateway remains the safe fallback/debug path.
## Trust Modes
`DATA_PLANE_DIRECT_WORKER_TLS_TRUST_MODE` has three modes:
- `smoke_insecure`: development/smoke only. Backend may advertise a direct
candidate only outside production and must mark it `smoke_only=true` and
`production_trusted=false`.
- `public_ca`: worker WSS certificate chains to an OS/publicly trusted CA.
Backend may mark the candidate `production_trusted=true`.
- `platform_ca`: worker WSS certificate chains to a platform-managed CA.
Backend may mark the candidate `production_trusted=true` and include
`tls_ca_ref`.
Production must not treat `smoke_insecure` as trusted. P3.3 proved that a
production backend with `smoke_insecure` falls back to `backend_gateway`.
## Recommended Mode
Use `platform_ca` as the default production model for platform-managed and
customer-managed worker nodes.
Use `public_ca` only when the worker direct WSS endpoint is intentionally
internet-addressable through stable DNS and a public certificate can be issued
and renewed safely.
Rationale:
- most worker endpoints will be private, internal, or customer-managed
- public CA issuance is often impossible for private IP/DNS names
- a platform CA can bind certificates to platform node/worker identity
- platform CA trust can later integrate with `rap-node-agent`
- backend gateway fallback remains available while trust rollout is staged
## Certificate Profile
Worker direct WSS certificates must be server certificates.
Required X.509 properties:
- `KeyUsage`: `digitalSignature`, plus `keyEncipherment` where required by the
selected TLS key type
- `ExtendedKeyUsage`: `serverAuth`
- SAN DNS/IP entries must match the host in the advertised direct worker WSS URL
- CN must not be used as the trust identity
- validity should be short-lived, recommended 30-90 days in production
- key type should be ECDSA P-256 or RSA-2048+; prefer ECDSA where operationally
practical
Recommended identity SAN:
```text
URI:spiffe://rap/cluster/<cluster_id>/worker/<worker_id>
```
For the current single-cluster MVP, `cluster_id` may be `default` until the
cluster model becomes explicit.
The URI SAN is not a replacement for normal hostname verification. It is an
additional identity binding for observability, future node-agent enrollment,
and future control-plane certificate inventory.
## Candidate URL Rules
The backend must advertise a direct worker WSS URL whose host is covered by the
worker certificate SAN.
Examples:
```text
wss://rdp-worker-1.dp.test.cin.su:18443/rap/v1/data-plane
wss://192.168.200.61:18443/rap/v1/data-plane
```
If the URL uses a DNS name, the certificate must include that DNS SAN.
If the URL uses an IP address, the certificate must include that IP SAN.
Preferred production shape is DNS, not raw IP, because DNS gives safer
certificate rotation and node replacement.
## Worker Identity Binding
Direct worker WSS authentication is layered:
1. TLS proves that the client reached an endpoint with a certificate trusted for
the advertised URL.
2. `data_plane_token` proves that the backend authorized the session,
attachment, user, organization, resource, worker, and allowed channels.
3. The worker validates the token and binds the WSS connection to an existing
runtime only.
The TLS certificate does not replace token validation.
The token does not replace TLS trust.
Future production hardening should add control-plane certificate inventory:
```text
worker_certificates
worker_id
cluster_id
tls_ca_ref
certificate_fingerprint_sha256
serial_number
not_before
not_after
status: active | retiring | revoked | expired
```
Until that inventory exists, backend must be conservative and only mark direct
candidates production-trusted when deployment configuration guarantees the
worker certificate is trusted for the advertised URL.
## Platform CA Structure
Recommended hierarchy:
```text
RAP Platform Offline Root CA
-> RAP Data Plane Worker Intermediate CA v1
-> worker direct WSS server certificates
```
Rules:
- Root CA private key must not be present on worker hosts.
- Intermediate CA private key must not be present on worker hosts.
- Worker receives only its server certificate, private key, and CA chain.
- Windows clients receive only the trust bundle, never private keys.
- Backend receives CA reference metadata and may carry public trust bundle
references, never CA private keys.
For the current test stand, a temporary test CA may be generated on
`docker-test`, but it must be treated as throwaway test material and not
committed.
## Certificate Issuance And Storage
Future `rap-node-agent` should own enrollment. Before node-agent exists, test
stand issuance may be manual.
Production desired flow:
1. Platform owner approves node/worker enrollment.
2. Node agent generates a private key locally.
3. Node agent creates CSR with:
- worker/node identity URI SAN
- DNS/IP SANs for reachable direct WSS endpoints
- cluster id
4. Control plane or CA service signs the CSR if node policy allows the role.
5. Node agent writes certificate/key to a host-local protected path.
6. Worker container mounts certificate/key read-only, or native worker reads
protected local files.
7. Backend advertises direct candidates with:
- `tls_trust_mode=platform_ca`
- `production_trusted=true`
- `smoke_only=false`
- `tls_ca_ref=<active-ca-ref>`
Container note:
- Certificates are node/host trust assets, not container identity.
- Containers may consume mounted cert/key files.
- Container rebuilds must not generate production CA material.
## Windows Client Trust
For `public_ca`, the Windows client should rely on normal OS certificate
validation.
For `platform_ca`, the preferred production approach is app-local trust:
- client configuration references a platform CA bundle by `tls_ca_ref`
- WSS TLS validation uses a custom chain policy with an app-managed trust store
- hostname/SAN validation remains enabled
- revocation/deny-list checks are applied when available
- no global insecure callback is used
Installing the platform root into the Windows CurrentUser or LocalMachine Root
store may be supported for managed enterprise deployment, but it should not be
required for MVP smoke because it broadens OS-level trust.
Current state:
- Windows client already skips smoke-only/untrusted direct candidates in
production.
- P3.5 added app-local platform CA bundle handling with normal hostname/SAN
validation preserved.
- P3.5 smoke proved `platform_ca` direct worker WSS without insecure TLS
bypass on `docker-test`.
## Rotation
Worker certificate rotation:
- certificates should be renewed before 2/3 of lifetime has elapsed
- new cert/key should be staged next to the old files
- worker should reload or restart gracefully
- backend gateway fallback must remain available during rotation
- old cert should remain accepted during a short overlap window
- after successful cutover, old cert should be marked retiring/expired
Platform CA rotation:
- introduce new `tls_ca_ref`
- distribute the new trust bundle to clients before workers switch
- backend may advertise candidates with the new CA only after client trust is
available
- keep old and new CA bundles valid during migration
- remove the old CA only after all active workers and clients are migrated
## Revocation And Deny-List
Short-lived certificates are the first control.
Additional revocation controls:
- stop advertising direct candidates for revoked workers immediately
- revoke worker certificate serial/fingerprint in control-plane inventory
- optionally distribute a compact deny-list to clients
- force backend gateway fallback for revoked/untrusted workers
- rotate data-plane signing keys separately if token signing material is at risk
Revocation must not rely on the worker cooperating after compromise.
## Graceful Failure And Fallback
Direct WSS must fail closed:
- expired cert: direct rejected, fallback to backend gateway
- hostname mismatch: direct rejected, fallback to backend gateway
- untrusted platform CA: direct rejected, fallback to backend gateway
- revoked fingerprint: direct rejected, fallback to backend gateway
- token validation failure: direct rejected, fallback to backend gateway where
policy permits
Fallback must be logged so production does not silently run permanently on the
debug path.
## Test-Stand P3.5 Smoke Result
P3.5 proved `platform_ca` without using insecure TLS bypass.
Sanitized command shape:
```powershell
# 1. Generate throwaway test CA and worker cert on docker-test.
ssh docker-test "mkdir -p /tmp/rap-p3-5-platform-ca"
# Certificate must include:
# - DNS SAN for the direct WSS host, if using DNS
# - IP SAN 192.168.200.61, if using raw IP
# - URI SAN spiffe://rap/cluster/default/worker/rdp-worker-1
# 2. Restart worker with platform CA-issued server cert.
docker -H ssh://docker-test rm -f rap_worker_smoke
docker -H ssh://docker-test run -d --name rap_worker_smoke --network host `
-v /tmp/rap-p3-5-platform-ca:/certs:ro `
-e RDP_WORKER_DATA_PLANE_TLS_CERT_FILE=/certs/worker.crt `
-e RDP_WORKER_DATA_PLANE_TLS_KEY_FILE=/certs/worker.key `
-e RDP_WORKER_DATA_PLANE_PUBLIC_KEY_FILE=/certs/dp-public.pem `
rap-rdp-worker:rdp-p1-region-order2
# 3. Restart backend in production platform_ca mode.
docker -H ssh://docker-test rm -f rap_backend_smoke
docker -H ssh://docker-test run -d --name rap_backend_smoke --network host `
-v /tmp/rap-dp1d1:/certs:ro `
-v /tmp/rap-p3-3/secret-key.b64:/run/secrets/rap-secret-key.b64:ro `
-e APP_ENV=production `
-e SECRET_ENCRYPTION_KEY_FILE=/run/secrets/rap-secret-key.b64 `
-e DATA_PLANE_DIRECT_WORKER_TLS_TRUST_MODE=platform_ca `
-e DATA_PLANE_DIRECT_WORKER_TLS_CA_REF=rap-platform-ca:test-v1 `
-e DATA_PLANE_DIRECT_WORKER_WSS_URL_TEMPLATE=wss://192.168.200.61:18443/rap/v1/data-plane `
rap-backend-smoke:p3-3
# 4. Configure Windows client app-local trust bundle.
# backend.direct_data_plane_platform_ca_bundle = artifacts\p3-5-platform-ca.crt
# backend.environment = production
# backend.allow_insecure_direct_data_plane_tls_for_smoke = false
# 5. Run desktop smoke and verify direct selected.
pwsh -ExecutionPolicy Bypass -File scripts\windows-smoke\desktop-smoke.ps1 `
-PreferDirectDataPlane:$true `
-AllowInsecureDirectDataPlaneTlsForSmoke:$false `
-DirectDataPlaneConnectTimeoutMs 2500 `
-SkipOrgSwitchAndTokenRefresh
```
P3.5 PASS conditions:
- backend candidate metadata includes:
- `tls_trust_mode=platform_ca`
- `production_trusted=true`
- `smoke_only=false`
- `tls_ca_ref=rap-platform-ca:test-v1`
- Windows client selects `direct_worker_wss` in production mode
- client does not use insecure TLS bypass
- worker direct WSS token validation and runtime binding still pass
- rendering/input/clipboard/file upload still pass
- backend gateway fallback activates when direct cert validation fails or
direct WSS is unavailable
Required negative tests:
- wrong SAN certificate rejected
- expired certificate rejected
- unknown CA rejected
- `smoke_insecure` candidate skipped in production
Runtime proof is recorded in:
- `artifacts/p3-5-app-local-platform-ca-smoke-report.md`
## Current Implementation Status
Existing config fields are sufficient for P3.4:
- `DATA_PLANE_DIRECT_WORKER_TLS_TRUST_MODE`
- `DATA_PLANE_DIRECT_WORKER_TLS_CA_REF`
- `RDP_WORKER_DATA_PLANE_TLS_CERT_FILE`
- `RDP_WORKER_DATA_PLANE_TLS_KEY_FILE`
P3.5 added Windows client setting:
- `direct_data_plane_platform_ca_bundle`
P3.6 completed stale worker-event/restart idempotency hardening.
Stage 5.2 server-to-client file download design is complete in
`docs/architecture/RDP_FILE_DOWNLOAD_STAGE_5_2.md`. The next step should return
to the RDP feature plan with the narrow Stage 5.2 implementation, not RDP
rendering, lifecycle expansion, mesh, VPN, or new protocol adapters.