rdp-proxy/backend/README.md

# Backend Foundation

Production-oriented Go backend skeleton for the remote access platform.

## Scope included

- configuration loading from environment
- HTTP server bootstrap with graceful shutdown
- PostgreSQL and Redis connectivity wiring
- migrations scaffold
- auth foundation with access/refresh tokens, hashed refresh rotation, trusted devices, and persisted auth sessions
- persistent session storage foundation for remote sessions, attachments, resource policies, and audit events
- session broker orchestration for start, attach, detach, takeover, terminate, failure, and detached-session recovery
- Redis-backed live session state, controller binding, attach tokens, heartbeat keys, worker routing, and reconnect support
- Redis-backed worker registration, lease lifecycle, heartbeat tracking, stale lease recovery, and routing queues
- worker assignment queueing and worker event ingestion for the minimal real RDP worker runtime
- websocket live plane with attach handshake, ping/pong heartbeat, state messages, takeover detection, and transport reconnect flow
- module boundaries for auth, resources, session broker, and websocket gateway
- worker registry scaffold to prepare later RDP worker integration
- per-resource certificate verification policy for RDP connections with `strict` default and explicit `ignore` override
- platform-core v2 foundations for organizations, memberships, identity sources, nodes, and node-agent control plane
- Data Plane v1 contract scaffolding for optional session response candidates/tokens, with current backend gateway behavior preserved as fallback
- production resource secret-readiness guard for rejecting plaintext credential-like metadata and requiring `secret_ref` for RDP/VNC/SSH resources in production mode
- encrypted resource secret storage/resolver MVP for production `secret_ref` usage

## Entry point

Run the API from `cmd/api`.

## Local dev

- backend: `pwsh -File scripts/smoke/run-backend.ps1`
- infra: `pwsh -File scripts/smoke/start-infra.ps1`
- migrations: `pwsh -File scripts/smoke/apply-migrations.ps1`
- worker image build: `docker build --tag rap-rdp-worker:dev --file workers/rdp-worker/Dockerfile workers/rdp-worker`
- end-to-end smoke path: [scripts/smoke/README.md](/\\?\UNC\192.168.220.200\mst\codex\rdp-proxy\scripts\smoke\README.md)

## Configuration

Use `configs/api.example.env` as the starting point for local environment variables.

Resource secret-readiness is controlled by `APP_ENV`:

- in `APP_ENV=production` or `APP_ENV=prod`, RDP/VNC/SSH resources must carry
  `secret_ref` and must not include plaintext credential-like fields in
  `metadata`
- in development and smoke environments, plaintext metadata remains allowed
  until the encrypted secret resolver is implemented
- the production guard is enforced both on resource create/update and on
  session start, so compat plaintext resources cannot be started in production
  accidentally
- `SECRET_ENCRYPTION_KEY_B64` or `SECRET_ENCRYPTION_KEY_FILE` supplies the
  AES-256-GCM master key for the MVP encrypted store; production mode refuses
  to start without one
- `SECRET_ENCRYPTION_KEY_ID` labels the active key version in stored records
- `PUT /api/v1/resources/{resourceID}/secret` creates or rotates a resource
  secret and updates `resources.secret_ref`; plaintext is never returned by the
  API
- session assignment keeps PostgreSQL metadata safe: `remote_sessions.metadata`
  stores `secret_ref`, while resolved credentials are merged only into the
  transient worker assignment after session/worker/lease checks

See `docs/architecture/SECURITY_SECRETS_READINESS.md` for the target
secret-reference model and remaining resolver/PKI gaps.

Data Plane v1 contract scaffolding is controlled by:

- `DATA_PLANE_TOKEN_TTL`, default `1m`
- `DATA_PLANE_TOKEN_PRIVATE_KEY_FILE`, optional path to an RSA private key PEM used to sign RS256 data-plane tokens
- `DATA_PLANE_TOKEN_PRIVATE_KEY_PEM`, optional inline RSA private key PEM; used when file path is not configured
- `DATA_PLANE_BACKEND_GATEWAY_URL`, default `/api/v1/gateway/ws`
- `DATA_PLANE_DIRECT_WORKER_WSS_URL_TEMPLATE`, optional; supports `{worker_id}` replacement
- `DATA_PLANE_DIRECT_WORKER_JSON_RUNTIME`, default `false`; advertises
  `runtime_transport=json_v1` only after the worker direct JSON bridge is
  deployed and verified
- `DATA_PLANE_DIRECT_WORKER_BINARY_RENDER`, default `false`; when the direct
  JSON runtime is enabled, advertises `render_transport=binary_v1` so DP-2
  clients can request binary render frames over direct worker WSS. Binary
  render candidates also advertise `supported_color_modes=["full_color","grayscale"]`
  and `default_color_mode="full_color"` for the DP-3A grayscale foundation.
- `DATA_PLANE_DIRECT_WORKER_TLS_TRUST_MODE`, default `smoke_insecure`; allowed
  values are `smoke_insecure`, `public_ca`, and `platform_ca`.
- `DATA_PLANE_DIRECT_WORKER_TLS_CA_REF`, optional label for the platform CA or
  trust bundle version advertised to clients.

Data-plane tokens are RS256-signed. The backend must hold only the private key;
workers receive only the matching public key for validation. If no private key
is configured, the backend omits the optional `data_plane` offer and the
backend gateway fallback remains unchanged.

If no direct worker WSS URL template is configured, session responses still include the backend gateway fallback candidate only.
If the URL template is configured but `DATA_PLANE_DIRECT_WORKER_JSON_RUNTIME`
is `false`, the direct candidate is still present for contract visibility but is
not marked data-capable; DP-1D Windows clients will skip it and use the backend
gateway fallback.
If `DATA_PLANE_DIRECT_WORKER_BINARY_RENDER` is `false`, direct worker WSS
remains JSON/base64 for render. If it is `true`, only direct worker WSS render
is binary; backend gateway fallback remains JSON/base64.
In production, the backend does not advertise direct worker WSS when
`DATA_PLANE_DIRECT_WORKER_TLS_TRUST_MODE=smoke_insecure`; it keeps the backend
gateway fallback instead. Trusted direct candidates include `tls_trust_mode`,
`production_trusted`, `smoke_only`, and optional `tls_ca_ref` metadata. See
`docs/architecture/DIRECT_WORKER_TLS_PKI.md`.

## Module layout

- `internal/platform` shared runtime, config, infra, and bootstrap concerns
- `internal/modules/auth` auth and trusted-device boundary
- `internal/modules/organization` organization model, org roles, and memberships
- `internal/modules/identitysource` local/LDAP/OIDC identity source model and future mapping foundations
- `internal/modules/resource` remote resource inventory boundary
- `internal/modules/sessionbroker` persistent session lifecycle, orchestration, audit, and Redis live-state boundary
- `internal/modules/sessiongateway` websocket attach/reconnect/takeover transport boundary
- `internal/modules/worker` worker registration, lease coordination, and control-plane routing boundary for future C++ RDP workers
- `internal/modules/node` node inventory, capabilities, enabled services, update policy, and partition state
- `internal/modules/nodeagent` node-agent registration, health, service status, and update/rollback control interface
- `pkg/contracts` cross-module contracts for sessions and worker control

## Backend responsibilities

- PostgreSQL remains the source of truth for auth sessions, devices, remote sessions, attachments, resource policies, and audit events
- Redis is used only for live routing and coordination: attach tokens, controller bindings, live session cache, worker registration, worker leases, heartbeats, and routing queues
- `worker:control:<worker_id>` carries worker assignments, `worker:queue:<session_id>` carries live control/input envelopes, and `worker:events` carries worker-reported lifecycle events back into broker processing
- Session broker owns state transitions and orchestration rules; websocket handlers call broker services instead of talking to postgres repositories directly
- Worker runtime stays behind interfaces and Redis coordination so the backend remains isolated from FreeRDP implementation details while the minimal real RDP worker plugs into the control plane
- RDP certificate verification is configured per resource through `certificate_verification_mode`
- resources are now org-scoped in PostgreSQL and remote sessions persist their owning organization without changing the proven worker/session runtime contracts
- session start/attach/takeover responses may include optional `data_plane` candidates and a short-lived signed data-plane token for DP-1 direct worker WSS migration; existing clients continue to use the current gateway path, and direct realtime use remains gated by explicit candidate metadata

## Authorization model

- `platform_admin` and `platform_recovery_admin` have global access across organizations, resources, and sessions
- in `INSTALLATION_AUTHORITY_MODE=strict`, platform-admin power is effective only
  when the user also has a valid signed row in `platform_role_grants`; changing
  `users.platform_role` in PostgreSQL alone no longer grants owner access
- first-owner bootstrap is available at
  `POST /api/v1/installation/bootstrap-owner` and requires a Product Root
  Ed25519 signature over an activation manifest in strict mode
- production (`APP_ENV=production` or `prod`) requires strict installation
  authority plus `INSTALLATION_PRODUCT_ROOT_PUBLIC_KEY_B64` or
  `INSTALLATION_PRODUCT_ROOT_PUBLIC_KEY_FILE`
- compat/dev installs can keep database-role behavior, and insecure first-owner
  bootstrap is available only when
  `INSTALLATION_INSECURE_BOOTSTRAP_ENABLED=true`
- `org_owner` and `org_admin` can create and update resources inside their organization and can manage any remote session inside that organization
- active non-admin memberships such as `org_operator`, `org_member`, and `org_viewer` are deny-by-default for admin actions; they can only access org-scoped reads and operate on their own session flows where the session broker explicitly allows it
- session start always authorizes the actor against the resource organization before worker reservation
- attach, detach, takeover, and terminate authorize against the owning remote session organization before any state transition is written
- worker-facing events do not bypass this model for user-originated commands; internal worker failure and heartbeat paths remain broker-internal control-plane operations

## Migration safety

- `000005_platform_core_v2` bootstraps a single `default` organization and backfills existing `resources.organization_id` and `remote_sessions.organization_id` into that organization before setting `NOT NULL`
- `000006_default_org_memberships_backfill` safely restores access continuity by inserting missing active memberships for existing users into the `default` organization
- the backfill is idempotent because it only inserts rows missing under the `(organization_id, user_id)` uniqueness constraint
- platform administrators are backfilled as `org_owner` in the default organization, while other existing users are backfilled as `org_member`
- if `000005` fails before the `NOT NULL` step, PostgreSQL rolls back the transaction and leaves pre-v2 rows untouched; if `000006` is rerun, it skips already-created memberships rather than duplicating them

## Platform-Core V2 Notes

- `organizations`, `organization_memberships`, and `organization_roles` establish multi-tenant ownership and basic org-scoped authorization boundaries
- `identity_sources` and `identity_mappings` are foundation-only in this phase; full LDAP/OIDC sync and claim/group ingestion are intentionally deferred
- `nodes`, `node_capabilities`, `node_services`, `node_update_policies`, `node_partition_states`, and `node_agent_update_runs` provide the first control-plane model for node and node-agent lifecycle
- current proven RDP session lifecycle remains preserved: the session broker still orchestrates the same worker/session behavior, but it now records organization ownership via org-scoped resources
- PostgreSQL remains the source of truth for organizations, memberships, org-scoped resources, identity sources, nodes, node-agent state, and session lifecycle state

## Resource Certificate Verification

- `strict` is the default and keeps normal certificate validation enabled in the worker runtime
- `ignore` must be explicitly stored on the resource and allows that one RDP connection to skip certificate validation
- the backend passes this policy through session assignment data; it is not a global backend toggle

## Messaging Model

- HTTP errors now use a structured envelope:
  - `error.code`
  - `error.message_key`
  - `error.fallback_message`
  - `error.details`
  - `error.trace_id`
- `internal/platform/httpx` owns error normalization and trace-id generation so handlers can keep calling `WriteError(...)` without changing business logic.
- For `5xx` responses, user-facing payloads are normalized to an English generic fallback message while logs and diagnostics can still keep raw internal details elsewhere.
- For `4xx` responses, stable `code` and `message_key` are derived from the current fallback message, so clients can localize without depending on raw English text as the primary contract.

## WebSocket Messaging

- Session gateway envelopes keep the existing `type` and `payload` contract.
- User-facing websocket events now also include `event` with:
  - `code`
  - `message_key`
  - `fallback_message`
  - `details`
  - `trace_id`
- `session.taken_over`, terminal `session.state`, `transport.closed`, and protocol-level errors now carry this structured event object.
- Existing payload semantics remain intact for compatibility with the already proven session lifecycle.

## Message Rules

- Keep English as the only development language for `fallback_message`, logs, and diagnostics.
- New HTTP handlers should prefer `httpx.WriteError(...)` for user-facing failures instead of hand-building `"error": "..."` JSON.
- New websocket user-facing notifications should populate `TransportEnvelope.Event` with a stable `code` and `message_key`.
- Do not use raw human-readable English text as the primary client contract; it should only remain as fallback text.
- This messaging layer is now runtime-proven against the live Windows smoke flow for invalid-login errors, websocket takeover delivery, websocket state fallback rendering, and worker-death failure handling.
## Clipboard Policy

RDP text clipboard is controlled per resource through `resource_policies.clipboard_mode`.
Allowed values are `disabled`, `client_to_server`, `server_to_client`, and
`bidirectional`; the default is `disabled`. The compat `clipboard_enabled`
column is retained only for compatibility and migration/backfill, while new
runtime decisions use `clipboard_mode`.

Clipboard enforcement happens in the real data path:

- `sessionbroker.ResourcePolicy.ClipboardMode` is loaded from PostgreSQL and
  embedded into the session assignment metadata sent to the worker.
- `sessiongateway.Module.handleEnvelope` blocks client-to-server clipboard
  envelopes unless the session is `active` and the policy allows that direction.
- `worker.EventProcessor` sends worker-originated clipboard text through
  `sessionbroker.Service.UpdateWorkerClipboardText`, which applies the same
  active-state and server-to-client policy checks before updating live state.
- Clipboard messages carry `sequence_id`, `origin`, and `content_hash` so
  clients and workers can avoid feedback loops across reattach/takeover paths.
- Redis stores clipboard text only as transient live state for routing to the
  active controller; PostgreSQL remains authoritative for policy/session state.

## File Upload Policy

Stage 5.1 introduces client-to-server file upload as a policy-gated RDP
feature. The authoritative policy field is
`resource_policies.file_transfer_mode`; allowed values are `disabled`,
`client_to_server`, `server_to_client`, and `bidirectional`, but only
`client_to_server` behavior is implemented in this stage. The default is
`disabled`. The compat `file_transfer_enabled` column is retained only as a
derived compatibility flag and must not be treated as the primary policy.

Enforcement is deliberately duplicated in the real data path:

- `resource.Module` exposes `file_transfer_mode` in resource create, update,
  list, and read payloads.
- `sessionbroker.Service.StartRemoteSession` embeds `file_transfer_mode` into
  assignment metadata and requests the worker `file-transfer` capability only
  when client-to-server upload is allowed.
- `sessiongateway.Module.handleFileUploadStart` and
  `handleFileUploadChunk` require an active session, current controller,
  allowed policy mode, valid UUID `transfer_id`, safe file name, 25 MiB max
  file size, and 256 KiB max chunk size before routing chunks to the worker.
- Redis is used only to route bounded upload envelopes to the worker. The file
  itself is written by the worker to controlled worker storage; PostgreSQL
  remains authoritative for policy and session state.

## File Download Policy

Stage 5.2 adds a runtime-proven server-to-client download path for RDP. The
policy field remains `resource_policies.file_transfer_mode`; `server_to_client`
and `bidirectional` allow download, while `disabled` and `client_to_server`
block it. The default remains `disabled`.

The v1 download model uses only the restricted `RAP_Transfers\ToClient`
drop-zone inside the existing per-session visible transfer directory. Backend
gateway accepts only `file_download.start`, `file_download.ack`, and
`file_download.cancel` from the current controller of an active session and
routes them to the worker after policy validation. Worker-origin
`file_download.*` events are stored only as transient live state for
backend-gateway fallback delivery; PostgreSQL remains authoritative for
session/resource/policy state and must not store file contents.

The direct worker WSS path is also lifecycle-gated: detach returns
`file_download.blocked`, old-controller takeover returns `session.taken_over`,
and worker failure closes the direct transport after PostgreSQL transitions the
session to `failed`.