rdp-proxy/docs/architecture/SECURITY_SECRETS_READINESS.md

# Security And Secrets Readiness

Status: P3.3 test-stand smoke complete for encrypted resource secrets,
assignment-time resolution, and production fallback behavior with smoke-only
direct worker WSS trust.

This document defines the next security hardening layer around the accepted RDP
MVP baseline. It does not implement mesh, VPN, server-to-client download, new
protocol adapters, or another RDP rendering mode.

## Current Accepted Baseline

- RDP worker baseline: `rap-rdp-worker:rdp-p1-region-order2`
- Backend control plane remains source of truth.
- Redis remains live coordination/routing only.
- Direct worker WSS is preferred for realtime RDP.
- Backend gateway remains fallback/debug.
- Text clipboard is policy-gated and accepted.
- Client-to-server file upload and restricted `RAP_Transfers` visibility are
  accepted.

## Problem

The current smoke/dev path can still seed RDP target credentials inside
resource `metadata`. That was acceptable for proving lifecycle and RDP adapter
behavior, but it must not be the production contract.

Production must not rely on plaintext target passwords, usernames, domain
credentials, client secrets, tokens, or private keys stored in generic resource
metadata.

## Target Secret Model

Resources keep non-secret connection shape:

```json
{
  "id": "...",
  "organization_id": "...",
  "protocol": "rdp",
  "address": "rdp.example.internal:3389",
  "secret_ref": "rap-secret://org/<org_id>/resources/<resource_id>/rdp-primary",
  "metadata": {
    "certificate_verification_mode": "strict",
    "render_quality_profile": "balanced"
  }
}
```

Secrets are stored separately and referenced by `secret_ref`. The secret payload
is protocol-specific and versioned:

```json
{
  "version": 1,
  "protocol": "rdp",
  "username": "...",
  "domain": "...",
  "password": "...",
  "rotation_version": 3
}
```

The reference, not the plaintext secret, is copied into session metadata and
audit context.

## Runtime Secret Resolution

Production runtime should resolve secrets through a dedicated secret resolver:

1. Backend validates resource/org/user authorization.
2. Backend starts the session using resource `secret_ref`.
3. Worker receives assignment with `secret_ref`, not plaintext credentials.
4. Worker asks an authorized secret resolver for the secret using:
   - `organization_id`
   - `resource_id`
   - `worker_id`
   - `session_id`
   - short-lived lease/session proof
5. Secret resolver returns credentials only to authorized workers for active
   leased sessions.
6. Worker keeps secret material in memory only and never logs it.

The current P3.1 MVP uses an encrypted PostgreSQL-backed store:

- `resource_secrets` stores ciphertext, nonce, key id, algorithm, version, safe
  metadata, and `payload_sha256`.
- `SECRET_ENCRYPTION_KEY_B64` or `SECRET_ENCRYPTION_KEY_FILE` supplies the
  AES-256-GCM key.
- `SECRET_ENCRYPTION_KEY_ID` labels the active key.
- the API can create/rotate a resource secret, but never returns plaintext.
- session assignment resolves the secret only after organization/resource/
  worker/session/lease checks.

The resolver boundary can later be backed by KMS, Vault, cloud secret managers,
or node-local secure delivery without changing the resource `secret_ref`
contract.

## Production Guard

In `APP_ENV=production`:

- RDP/VNC/SSH resources must have `secret_ref`.
- Plain credential-like keys are rejected in resource `metadata`.
- Session start rejects legacy resources that still contain plaintext
  credential-like metadata.
- backend startup requires secret encryption key material.
- Development/smoke environments may continue using plaintext metadata while
  the resolver path is not used, but this is explicitly not production mode.

Credential-like metadata keys include password, username, domain, token,
private key, client secret, credential, credentials, secret, and common
underscore/hyphen variants.

## Data Plane Trust

Already accepted:

- backend signs `data_plane_token` with RS256 private key
- worker validates with public key only
- token is short-lived
- token includes session, attachment, user, organization, worker, resource,
  allowed channels, expiry, and jti
- worker rejects wrong worker, wrong attachment, wrong organization, wrong
  resource, over-broad channels, failed/terminated sessions, and jti replay

Production still needs:

- deployed certificate chain for direct worker WSS on production nodes
- pinned or platform-issued worker certificates in live production config
- no smoke-only TLS bypass in production clients
- rotation process for data-plane signing keys
- audit for failed token validation/bind attempts

P3.2 guard exists:

- backend distinguishes `smoke_insecure`, `public_ca`, and `platform_ca`
  direct worker WSS trust modes
- production backend omits smoke-only direct candidates
- Windows production client skips untrusted or smoke-only direct candidates

P3.3 test-stand smoke exists:

- `resource_secrets` migration is applied on `docker-test`
- backend runs as `APP_ENV=production` with a test-only
  `SECRET_ENCRYPTION_KEY_FILE`
- a secret-backed RDP resource starts a real session through assignment-time
  secret resolution
- `resources.metadata`, `remote_sessions.metadata`, and `audit_events` were
  checked for plaintext username/password leakage
- production backend with `DATA_PLANE_DIRECT_WORKER_TLS_TRUST_MODE=smoke_insecure`
  returns backend gateway fallback only
- development/smoke backend with the same trust mode advertises the explicit
  smoke-only direct worker WSS candidate
- `RAP_Transfers` smoke passed on the secret-backed resource

## Required Regression Tests

P3 must protect:

- plaintext resource credentials rejected in production
- RDP production resources require `secret_ref`
- development smoke plaintext metadata remains allowed
- data-plane allowed channels follow runtime policy
- direct bind rejects wrong worker
- direct bind rejects wrong user
- direct bind rejects wrong organization
- direct bind rejects wrong resource
- direct bind rejects old attachment
- direct bind rejects failed/terminated states

## Audit Events

Current audit coverage should remain for:

- session start
- attach
- detach
- takeover
- terminate
- failure

Future audit coverage should add:

- secret deleted
- production resource rejected because plaintext credential metadata was found

Audit entries must reference `secret_ref` and resource/session ids, never
plaintext secret values.

P3.1 implemented audit events for:

- `resource_secret_rotated`
- `resource_secret_accessed`
- `resource_secret_access_denied`

## Remaining Production Gaps

- External KMS/Vault integration is not implemented yet.
- Master-key rotation/re-encryption workflow is not implemented yet.
- The worker still receives resolved credentials through the transient
  assignment payload; a future resolver pull/token flow should reduce exposure
  in Redis control queues.
- Worker still depends on plaintext assignment metadata for development smoke.
- Production direct worker WSS certificate issuance/rotation and platform CA
  distribution are not complete.
- The test-stand secret key is a host-local test file, not a production KMS or
  HSM-backed key.
- Automated end-to-end policy denial coverage is still thin.