Files
rdp-proxy/docs/architecture/DATA_PLANE_V1.md
T
m 20d361a886
build / backend (push) Has been cancelled
build / node-agent (push) Has been cancelled
build / worker (push) Has been cancelled
рабочий вариант, но скороть 10 МБит
2026-05-22 21:46:49 +03:00

44 KiB

Data Plane v1 for RDP

Archived status: this document is a historical RDP/WebSocket stage record, not the current runtime source of truth for transport architecture. The active fabric transport model is QUIC-only between nodes; see docs/architecture/DISTRIBUTED_FABRIC_NODE_PROTOCOL_PLAN.md, docs/architecture/FABRIC_FIRST_TRANSPORT_AND_STRESS_PLAN.md, and docs/architecture/SECURE_ACCESS_FABRIC_TARGET.md.

Status: DP-3A grayscale full-frame binary render foundation is implemented and smoke-proven on the test Docker environment as of 2026-04-25. DP-3B adaptive quality policy/selection is intentionally paused. The accepted C++ RDP Adapter baseline is the ordered-region path. RDP-Perf-6 makes direct dirty-region binary render explicit with render.frame.full / render.frame.region RAP2 message types and is build/probe/live-smoke-proven on the test Docker environment as of 2026-04-26. The current test Docker deployment for the RDP Adapter performance path is rap-rdp-worker:rdp-perf6-dirty-region. The Stage 5.2 core download data path remains runtime-proven for direct worker WSS and backend gateway fallback. Data-plane and RDP work are paused; the next active focus is Stage C10 Fabric Core / cluster foundation, not another data-plane feature.

This document defines the first staged data-plane evolution for the RDP MVP. It does not implement direct worker WebSocket runtime, mesh routing, VPN, QUIC, UDP, WebRTC, relay nodes, or multi-cluster behavior.

The long-term platform target is defined in docs/architecture/SECURE_ACCESS_FABRIC_TARGET.md. This document narrows that target to DP-1: direct client-to-worker WSS for RDP realtime traffic, with the current backend gateway retained as fallback/debug.

1. Current Problem

The current RDP MVP routes realtime input/render through the backend WebSocket gateway and Redis-backed coordination. This is acceptable for fallback, debugging, lifecycle proof, and early MVP validation.

It is not acceptable as the production realtime path because:

  • render frames are high-rate and high-volume
  • base64/JSON render payloads add CPU and payload overhead
  • backend gateway can become a bottleneck under concurrent sessions
  • input can compete with render/frame processing
  • backend API capacity should be reserved for control-plane work
  • Redis must not become a frame transport or durable render store

The current implementation remains valid as fallback while DP-1 is introduced in stages.

2. Target DP-1 Path

Target DP-1 path:

Windows client
  -> direct WSS data-plane connection
  -> rdp-worker realtime endpoint
  -> existing RDP session runtime
  -> FreeRDP

Control-plane path remains:

Windows client
  -> backend API
  -> auth / org / policy / session broker
  -> worker selection
  -> short-lived data-plane token issuance

Fallback path remains:

Windows client
  -> backend WebSocket gateway
  -> current gateway/Redis/worker coordination path

DP-1 does not replace the session broker. It only moves realtime session traffic away from backend relay when direct worker WSS is available and authorized.

3. Responsibilities

Backend as Control Plane

Backend remains responsible for:

  • authentication
  • organization selection and isolation
  • resource authorization
  • resource policy evaluation
  • session lifecycle
  • worker selection
  • attachment ownership
  • takeover semantics
  • audit
  • short-lived data-plane token issuing
  • returning data-plane candidates
  • retaining backend gateway fallback

Backend must not become the production high-rate render relay.

Worker as Direct Realtime Endpoint

The RDP worker becomes responsible for:

  • exposing an authorized direct WSS endpoint
  • validating data_plane_token
  • binding a WSS connection to an existing session runtime
  • enforcing session, attachment, user, organization, and channel scope
  • carrying realtime logical channels directly:
    • input
    • render
    • clipboard
    • file_upload
    • control / heartbeat
    • telemetry
  • preserving existing FreeRDP runtime boundaries
  • preserving policy enforcement already present in worker runtime

The worker must not create a new RDP session just because a direct WSS connection attaches. It must bind to the existing broker-created session runtime.

Windows Client

The Windows client will eventually:

  • read data-plane candidates from session start/attach responses
  • prefer direct_worker_wss when available
  • fall back to backend_gateway when direct worker WSS is unavailable
  • keep existing lifecycle behavior unchanged
  • keep backend gateway support for debug/fallback

No client behavior changes are required for DP-1A.

4. Backend Contract Proposal

On session start, attach, and takeover, backend should extend the response with data-plane candidates.

Example:

{
  "session_id": "session-123",
  "attachment_id": "attachment-456",
  "gateway_url": "wss://backend.example.com/api/v1/gateway/ws",
  "data_plane": {
    "preferred": "direct_worker_wss",
    "token": "short-lived-data-plane-token",
    "expires_at": "2026-04-25T13:00:00Z",
    "candidates": [
      {
        "type": "direct_worker_wss",
        "url": "wss://worker-node.example.com/rap/v1/data-plane"
      },
      {
        "type": "backend_gateway",
        "url": "wss://backend.example.com/api/v1/gateway/ws"
      }
    ]
  }
}

Compatibility rules:

  • Existing fields must remain valid.
  • Existing clients may ignore data_plane.
  • gateway_url remains available for fallback/debug.
  • The backend must not return direct worker candidates unless the worker is live and route policy permits it.
  • Token TTL must be short.

Proposed DTO Shape

Names are proposals only.

SessionControlResult
  session
  attachment
  attach_token
  gateway_url
  data_plane?

DataPlaneOffer
  preferred
  token
  expires_at
  candidates[]

DataPlaneCandidate
  type
  url
  worker_id?
  node_id?
  cluster_id?
  priority?
  metadata?

5. Data Plane Token Model

data_plane_token must be short-lived and scoped. It is not a general API token.

Required claims:

  • session_id
  • attachment_id
  • user_id
  • organization_id
  • cluster_id if available
  • worker_id
  • resource_id
  • allowed_channels
  • expires_at
  • nonce / jti
  • issued_at
  • issuer
  • audience

Allowed channel values:

  • input
  • render
  • clipboard
  • file_upload
  • control
  • telemetry

Validation rules:

  • token must be signed by the backend with RS256 private key
  • worker must validate with public key only and must not hold a signing secret
  • token must be short-lived
  • token must match the worker receiving it
  • token must match an active session runtime
  • token must match current attachment/controller where required
  • token must not grant channels not allowed by resource policy
  • token must not survive session termination
  • token replay must be rejected or bounded by jti / nonce cache

Token refresh is not part of DP-1A. Future stages may either reissue tokens through the control plane or renew direct connections through a controlled flow.

6. Direct WSS Channel Model

DP-1 uses a single WSS connection with logical channels. Later stages may split transports, but DP-1 must keep the model simple and bounded.

control

Reliable channel.

Used for:

  • attach handshake
  • heartbeat
  • session state messages
  • detach notification
  • takeover notification
  • terminate notification
  • protocol errors

input

Highest-priority channel.

Rules:

  • input never waits behind render
  • key down/up and mouse button/wheel events must be ordered
  • mouse move may be coalesced to latest
  • input queues must be bounded
  • stale mouse move may be dropped
  • click/key/wheel must not be dropped under normal operation

render

Droppable/latest-frame channel.

Rules:

  • stale render frames must be dropped
  • latest frame wins
  • render must not block input/control
  • binary payloads should be used on direct data plane
  • compat fallback may continue existing JSON/base64 behavior during migration

clipboard

Reliable policy-gated channel.

Rules:

  • existing clipboard_mode applies
  • text-only behavior remains until richer formats are explicitly designed
  • blocked behavior must remain localized in clients
  • worker must enforce policy again

file_upload

Reliable chunked channel.

Rules:

  • existing file_transfer_mode applies
  • bounded chunk size
  • content hash
  • transfer id
  • no arbitrary path exposure
  • file upload must not block input

telemetry

Low-priority channel.

Rules:

  • sampled or lossy telemetry is acceptable
  • telemetry must not block user traffic
  • useful metrics include input RTT, frame FPS, dropped frames, queue length, decode time, render apply time

7. Message Framing

DP-1 uses:

  • JSON control messages for small envelopes
  • binary WebSocket frames for render payloads
  • no base64 for direct data-plane render frames

Backend fallback keeps the current JSON/base64 frame path for debug/fallback. Direct worker WSS uses binary render frames when the backend advertises render_transport=binary_v1 and the client requests render_transport=binary_v1.

JSON Envelope

Small control/reliable messages may use JSON:

{
  "protocol_version": 1,
  "session_id": "session-123",
  "attachment_id": "attachment-456",
  "channel": "input",
  "message_type": "mouse",
  "sequence": 1024,
  "timestamp": "2026-04-25T13:00:00.000Z",
  "flags": {},
  "payload": {}
}

Binary Frame Header

DP-2 uses a fixed 16-byte preamble followed by a UTF-8 JSON header and a raw binary payload:

offset  size  field
0       4     magic = "RAP2"
4       2     protocol_version, little-endian uint16, currently 1
6       2     flags, little-endian uint16
8       4     header_length, little-endian uint32
12      4     payload_length, little-endian uint32
16      n     UTF-8 JSON header
16+n    m     raw render payload bytes

The DP-2 JSON header contains:

  • protocol_version
  • session_id
  • channel, currently render
  • message_type, currently render.frame.full or render.frame.region on direct worker WSS; session.frame remains accepted as the compat DP-2 binary message type for compatibility.
  • sequence
  • timestamp
  • flags
  • payload_length
  • frame_width
  • frame_height
  • frame_stride
  • frame_format
  • optional region fields when message_type=render.frame.region: region_x, region_y, region_width, region_height, region_stride, region_format=BGRA32
  • optional color_mode, currently full_color or grayscale
  • optional quality_profile
  • optional original_frame_format
  • optional output_frame_format
  • optional raw_frame_bytes
  • optional binary_direct_bytes
  • optional diagnostics: full_frame_bytes, region_bytes, region_savings_percent, diff_time_ms, render_update_reason, fallback_to_full_frame_reason
  • optional input_correlation_id
  • optional worker_frame_captured_at

Binary frames must include a fixed or clearly parseable header before payload.

Required header fields:

  • protocol_version
  • session_id
  • channel
  • message_type
  • sequence
  • timestamp
  • flags
  • payload_length

Render payload must not be base64 encoded on direct data plane.

Suggested render message types:

  • render.frame.full
  • render.frame.region
  • render.cursor
  • render.resize
  • render.quality.changed

Suggested flags:

  • keyframe
  • droppable
  • latest_only
  • compressed
  • interactive
  • grayscale

8. Quality Profile Foundation

DP-1A defines quality profiles only. It does not implement adaptive rendering.

Profiles:

  • emergency_grayscale
  • low_bandwidth
  • text_priority
  • balanced
  • high_quality

Color modes:

  • full_color
  • 256_colors
  • 64_colors
  • 16_colors
  • grayscale

Rules:

  • quality profile must affect real render behavior in later stages
  • input priority remains absolute
  • render quality must degrade before input latency increases
  • lower profiles may reduce FPS, color depth, region size, or compression settings
  • higher profiles may increase FPS and color fidelity only when queues remain healthy
  • profile selection must be policy-aware and observable

9. Security Model

DP-1 security boundaries:

  • backend authorizes session access
  • backend issues short-lived data-plane token
  • worker validates token before accepting direct WSS
  • worker binds token to existing session runtime
  • worker enforces channel permissions
  • worker rejects mismatched session, attachment, organization, resource, worker, or expired token
  • backend gateway fallback keeps existing auth path

Transport:

  • direct worker WSS must use TLS
  • future node-to-node traffic uses mTLS as defined in the Secure Access Fabric target
  • DP-1 direct WSS may start with worker server TLS plus signed token validation
  • P3.2 direct worker WSS trust metadata distinguishes smoke_insecure, public_ca, and platform_ca
  • production backend must not advertise smoke-only direct candidates
  • production clients must not use insecure TLS bypass and must fall back to backend gateway if direct worker trust is unavailable
  • production deployments should avoid long-lived static worker secrets

Audit:

  • backend audits token issuance
  • backend audits session lifecycle
  • worker should report direct attach/detach/failure events back to control plane
  • direct data-plane traffic does not require auditing every input/render event
  • high-risk events such as takeover, failed token validation, policy denial, and file transfer should be auditable

10. Fallback Backend Gateway Path

The current backend WebSocket gateway remains:

  • fallback path
  • debug path
  • compatibility path for older clients
  • smoke-test path while DP-1 is staged

Fallback activation cases:

  • no direct worker candidate returned
  • direct WSS connect fails
  • token validation fails due to stale route
  • worker endpoint unavailable
  • policy forces backend gateway
  • client version does not support direct WSS

Fallback rules:

  • fallback must preserve existing lifecycle behavior
  • fallback must not silently weaken policy
  • fallback should be visible in logs/telemetry
  • fallback should be measurable against direct path latency

11. Migration Stages

Stage DP-1A: Spec Only

Create architecture/spec documentation.

No runtime behavior changes.

Stage DP-1B: Backend Offers Data Plane Candidates

Status: completed.

Backend extends session start/attach/takeover responses with data_plane.

Client still uses fallback backend gateway.

Implementation status:

  • backend response DTO can include optional gateway_url and data_plane
  • data_plane.token is a short-lived signed token with session, attachment, user, organization, worker, resource, allowed-channel, expiry, and jti scope
  • backend_gateway candidate is always returned when configured
  • direct_worker_wss candidate is returned only when a direct worker WSS URL template is configured
  • current clients may ignore data_plane safely
  • no worker direct WSS runtime is implemented in this stage
  • no client routing behavior changes in this stage

Verification:

  • old clients still work
  • responses include valid candidate shape
  • token is short-lived
  • token is scoped

Stage DP-1C: Worker Direct WSS Endpoint

Status: completed.

Worker exposes direct WSS endpoint and validates data_plane_token.

Windows client still uses fallback backend gateway.

Implementation status:

  • worker has optional /rap/v1/data-plane WSS endpoint
  • endpoint is disabled by default and requires TLS certificate/key paths
  • worker validates signed RS256 data_plane_token with a public key only
  • worker keeps no data-plane signing secret
  • worker rejects reused jti values with a bounded in-memory TTL cache
  • token validation checks session, attachment, user, organization, worker, resource, allowed channels, expiry, audience, and jti
  • endpoint binds only to existing SessionRuntime
  • bind checks reject old attachment after takeover, wrong attachment, wrong worker, wrong organization, wrong resource, missing runtime, failed/terminated runtime state, and channels broader than runtime policy
  • invalid token, wrong worker, expired token, replayed jti, and missing runtime are rejected
  • rdp-worker-dataplane-token-probe validates token behavior in the worker image
  • rdp-worker-dataplane-bind-probe validates attachment/state/channel bind policy without starting RDP
  • backend gateway remains active fallback
  • no Windows client routing change is included in this stage

Verification:

  • token validation works
  • runtime binding rejects missing runtime without creating a new RDP session
  • replayed jti values are rejected
  • wrong attachment and over-broad channels are rejected
  • no new RDP session is created
  • invalid tokens are rejected

Stage DP-1D: Windows Client Prefers Direct WSS

Status: completed as hardened client transport selection.

Windows client uses direct worker WSS only when the candidate is explicitly marked data-capable. Current DP-1C worker endpoint validates and binds but does not yet carry production render/input traffic, so unmarked candidates fall back to the backend gateway immediately.

Implementation status:

  • Windows session DTOs understand optional data_plane offers and candidates
  • transport selection remains behind ISessionGatewayClient
  • direct worker WSS candidates are considered only when metadata contains runtime_transport=json_v1 or traffic_ready=true
  • direct WSS attach attempts use short bounded timeout and never block the UI
  • failed/unavailable/not-ready direct path automatically uses backend gateway
  • existing backend gateway behavior remains unchanged
  • no worker runtime changes are included in DP-1D
  • no binary render frames or mesh/relay/VPN behavior is included

Verification:

  • Windows client build succeeds
  • fallback works and remains the default runtime path for current DP-1C endpoint
  • direct candidate selection is capability-gated to avoid losing render/input
  • lifecycle behavior remains stable

Stage DP-1D.1: Worker Direct JSON Realtime Bridge

Status: runtime-proven on the test Docker environment as of 2026-04-25.

Worker direct WSS now carries the same JSON realtime envelopes already used by the backend gateway. This is intentionally a bridge stage, not the final production data-plane protocol.

Implementation status:

  • worker direct WSS accepts existing JSON input, control, clipboard, and file_upload envelopes
  • worker direct WSS emits existing JSON session.state, session.frame, session.taken_over, clipboard.text, and file_upload.progress events
  • direct WSS binds only to an existing SessionRuntime; it never creates a new RDP runtime
  • direct inbound envelopes are bounded and drained before Redis fallback input
  • mouse move can be coalesced, but click, wheel, keyboard, clipboard, and file upload envelopes remain reliable within bounded queues
  • direct render is latest-frame-only and droppable in the worker WSS writer
  • direct inbound envelopes are tagged with token-bound session, attachment, user, organization, worker, and resource claims before they enter runtime
  • runtime rejects direct envelopes whose attachment_id no longer matches the current active controller attachment
  • takeover updates emit session.taken_over to the previous direct attachment while normal frame/state events continue only to the current attachment
  • backend advertises direct metadata only when DATA_PLANE_DIRECT_WORKER_JSON_RUNTIME=true
  • backend gateway fallback remains active and unchanged
  • Windows client behavior remains gated by DP-1D metadata selection

Verification performed:

  • backend go test ./... passes
  • Windows client build passes with no routing behavior change required
  • worker canonical Docker image builds with the direct JSON bridge
  • DP-1C endpoint smoke still proves malformed token rejection, valid-token-without-runtime rejection, and jti replay rejection
  • backend tests prove runtime_transport=json_v1 and traffic_ready=true are emitted only when the explicit runtime flag is enabled
  • live runtime proof was run on test Docker 192.168.200.61 with DATA_PLANE_DIRECT_WORKER_JSON_RUNTIME=true
  • backend session start returns direct_worker_wss candidate metadata: runtime_transport=json_v1 and traffic_ready=true
  • Windows desktop smoke selected direct_worker_wss and connected to wss://192.168.200.61:18443/rap/v1/data-plane
  • worker direct WSS validated the token and bound to the existing runtime
  • direct WSS accepted input envelopes and applied mouse/keyboard events through FreeRDP
  • direct WSS emitted JSON render/state events and the Windows client rendered a real desktop frame
  • direct WSS carried text clipboard client-to-server through the existing clipboard envelope and worker policy/cliprdr boundary
  • direct WSS carried chunked file upload through existing file_upload.start / file_upload.chunk envelopes and emitted file_upload.progress
  • fallback was proven by advertising an unavailable direct worker URL; the Windows client timed out direct WSS and selected backend_gateway
  • detach, reattach, takeover, session.taken_over, input, and render remained stable in direct and fallback smoke runs
  • no new RDP runtime was created by direct WSS attach; worker logs showed one started new runtime for the session and later updated assignment for existing session on reattach/takeover

Known limitations after DP-1D.1:

  • direct render still uses JSON/base64 full-frame payloads; binary render frames remain DP-2
  • direct server-to-client clipboard was not re-matrixed in this DP proof because Stage 4.1 already proved FreeRDP cliprdr behavior; DP-1D.1 proved that the direct bridge carries clipboard envelopes and preserves worker enforcement
  • file upload direct proof lands in the existing restricted worker visible transfer directory; broader file-transfer UX remains outside DP-1D.1
  • the Windows smoke script reports rendering=false when compact layout hides telemetry controls, even though frame receipt/rendering is proven by logs and UIA event text

Stage DP-1E: Latency Comparison

Status: measurement-complete on the test Docker environment as of 2026-04-25.

Compare direct path vs fallback before starting DP-2 binary render frames.

Metrics:

  • input capture to worker apply
  • worker frame capture to client render
  • frame queue length
  • dropped stale frames
  • close/dispose latency
  • fallback activation count

Smoke commands used:

pwsh -ExecutionPolicy Bypass -File scripts/windows-smoke/desktop-smoke.ps1 `
  -PreferDirectDataPlane:$true `
  -AllowInsecureDirectDataPlaneTlsForSmoke:$true `
  -DirectDataPlaneConnectTimeoutMs 2500 `
  -SkipOrgSwitchAndTokenRefresh

pwsh -ExecutionPolicy Bypass -File scripts/windows-smoke/desktop-smoke.ps1 `
  -PreferDirectDataPlane:$false `
  -AllowInsecureDirectDataPlaneTlsForSmoke:$true `
  -DirectDataPlaneConnectTimeoutMs 750 `
  -SkipOrgSwitchAndTokenRefresh

Measured sessions:

  • direct worker WSS: 59af4b37-3708-4cff-8e9d-054869946250
  • backend gateway fallback baseline: 673b7540-6276-4d73-824b-e5b2ea96182a
  • additional fallback-activation proof: direct candidate unavailable/not-ready logs on 8d89dd5c-fb14-4f70-a4e4-01ebb2a37da4 and 673b7540-6276-4d73-824b-e5b2ea96182a

Verification summary:

  • direct smoke passed login, resource list, start, input, detach, reattach, takeover, session.taken_over, and logout
  • fallback smoke passed login, resource list, start, input, detach, reattach, takeover, session.taken_over, and logout
  • smoke rendering=false is a compact-layout harness artifact; session event log contained Desktop frame received and client logs contained SessionWindow rendered frame
  • cleanup probe against /api/v1/sessions/active returned 404 because that endpoint does not exist; the implemented list endpoint is /api/v1/sessions?user_id=...
  • Redis worker queues were empty after the measured runs: worker:queue:59af... = 0, worker:queue:673b... = 0

Latency matrix:

Metric Direct worker WSS Backend gateway fallback
Client transport selection selected=direct_worker_wss in desktop logs selected=backend_gateway in desktop logs
Client capture/send to worker apply direct smoke retained worker-side receive/apply timestamps; client capture timestamp was not retained in the compact smoke log sampled fallback activation: about 205ms from WPF capture to worker apply for mouse down
Backend gateway input hop bypassed for direct realtime input backend receive to route typically <1ms
Worker receive to FreeRDP apply, mouse down 0ms to 24ms observed 0ms to 29ms observed
Worker receive to FreeRDP apply, mouse up 25ms to 26ms observed 28ms to 29ms observed
Worker receive to FreeRDP apply, key down about 25ms observed about 33ms observed
Worker receive to FreeRDP apply, key up about 26ms observed about 30ms observed
Backend route to worker receive not applicable about 31ms for key down, about 0ms to 31ms for sampled mouse/key events
FreeRDP apply to next captured frame 0ms to 40ms observed 0ms to 43ms observed
Worker frame capture to backend receive backend still receives worker frame telemetry; observed same-second receive same-second receive
Backend frame receive to client write not on direct render path sampled 486ms and 753ms on full-frame JSON/base64 gateway writes
Client render proof session.frame received and frame rendered in direct smoke SessionWindow rendered frame seq=19 size=1280x720
SessionGatewayConnection dispose about 1ms in sampled close traces about 1ms in sampled close traces
SessionWindow closed handler below 1ms in sampled close traces below 1ms in sampled close traces

Queue and backpressure observations:

  • direct inbound drained bounded batches before fallback Redis input
  • direct mouse move coalescing was active while preserving click/key ordering
  • direct outbound reported frames_queued_per_second matching frames_sent_per_second; reliable_dropped=0
  • worker render pending remained 0 for both paths
  • fallback Redis append queue length stayed bounded in sampled logs, usually 1 to 3, and returned to 0 after the run

Render observations:

  • direct render is already latest-frame-only/droppable at the worker WSS writer
  • worker render rates during interaction were approximately:
    • direct: ~3.0 to ~5.7 frames/sec sent/published, pending 0
    • fallback: ~2.0 to ~5.0 frames/sec published, pending 0
  • current frames are still JSON/base64 full-frame payloads
  • measured frame payload size remains about 4,915,200 bytes per JSON/base64 frame, so DP-1D.1 improves routing but does not remove the render payload bottleneck

Fallback activation proof:

  • fallback was explicitly selected when the client was configured with PreferDirectDataPlane=false
  • fallback was also visible when direct WSS was unavailable or not runtime ready, with client logs:
    • data_plane.transport direct_worker_wss failed; falling back to backend_gateway
    • data_plane.transport direct_worker_wss unavailable_or_not_runtime_ready; using backend_gateway
    • data_plane.transport selected=backend_gateway

DP-1E conclusion:

  • direct worker WSS removes backend/Redis from the realtime input path
  • fallback backend gateway remains functional and observable
  • neither path showed unbounded input queue growth during smoke
  • close/dispose traces remained fast in sampled logs
  • the dominant remaining bottleneck is render payload format and size, not worker input scheduling
  • DP-2 should focus on binary render frames and avoiding base64/JSON render payloads on the direct data plane

Stage DP-2: Binary Render Frames

Status: implemented and smoke-proven on the test Docker environment as of 2026-04-25.

Direct worker WSS now sends render payloads as binary WebSocket frames when the backend candidate metadata advertises render_transport=binary_v1 and the Windows client requests that transport. Backend gateway fallback continues to use the existing JSON/base64 frame path.

Goals:

  • remove base64 overhead from the direct worker WSS wire path
  • reduce direct render payload size
  • keep backend gateway JSON/base64 fallback intact
  • keep direct render latest-frame-only and droppable
  • keep input/control ahead of render

Implementation notes:

  • Backend advertises binary direct render only when DATA_PLANE_DIRECT_WORKER_BINARY_RENDER=true.
  • Direct candidate metadata includes runtime_transport=json_v1, traffic_ready=true, and render_transport=binary_v1.
  • Worker direct WSS accepts existing JSON envelopes for control/input/clipboard/file_upload and emits binary WebSocket frames for session.frame.
  • Windows client enables binary parsing only for direct candidates that advertise render_transport=binary_v1 or binary_render=true.
  • Backend gateway fallback remains unchanged and continues to deliver session.frame as JSON/base64.

Smoke proof:

  • direct session id: 824c0057-c8a0-4366-b5c2-805597ae2d61
  • fallback session id: 28e4b198-2c27-4971-951a-7b187c11f96d
  • direct client selected direct_worker_wss with render_transport=binary_v1
  • direct worker bind succeeded with render_transport=binary_v1
  • client received binary frames with raw payload size 3,686,400 bytes
  • client rendered binary frames, including frame sequences 1, 2, 4, 7, 9, 12, 14, 15, 17, 18, and 19
  • fallback client selected backend_gateway
  • fallback rendered JSON/base64 frames through the existing backend gateway path

Payload comparison:

  • DP-1E JSON/base64 frame payload: about 4,915,200 bytes for 1280x720 BGRA
  • DP-2 direct binary frame payload: 3,686,400 bytes for the same 1280x720 BGRA frame, plus a small binary preamble and JSON header
  • Direct wire payload reduction is about 25 percent compared with base64.

Latency and queue observations from smoke:

  • direct click frame render sample: worker captured frame at 1777141091937, WPF rendered it at 2026-04-25T21:18:11.6628382+03:00, about 226 ms later
  • direct key-down frame render sample: worker captured frame at 1777141093434, WPF rendered it at 2026-04-25T21:18:13.1614990+03:00, about 727 ms later
  • direct worker render rate sample: seen_per_second=4.953283, published_per_second=3.962626, dropped_per_second=0.990657, pending=0
  • direct data-plane outbound sample: frames_queued_per_second=5.404927, frames_sent_per_second=5.404927, binary_render_bytes_per_second=19926577.806299, json_render_bytes_per_second=0.000000, reliable_dropped=0
  • fallback worker render rate sample: seen_per_second=4.871576, published_per_second=3.897260, dropped_per_second=0.974315, pending=0

Known limitations:

  • DP-2.1 removed the internal base64 encode/decode hop from the direct render path. The direct worker WSS sender now receives raw captured frame bytes and writes them into RAP2 binary frames without decoding a compatibility session_frame.
  • The worker still builds compatibility session_frame events with base64 for backend gateway/live-state fallback. That compatibility conversion is intentionally isolated to the fallback boundary and is not used by the direct binary render sink.
  • Backend still receives compatibility worker frame events for fallback/debug. Binary render frames are not routed through Redis or backend gateway.
  • At the DP-2.1 point, dirty regions, tile encoding, adaptive quality, compression/codecs, and color-mode reduction remained later work.
  • Smoke rendering=false remains a compact-layout harness artifact; UIA output and client logs prove Desktop frame received and SessionWindow rendered frame.

Stage DP-2.1: Worker Raw-Frame Split

Status: implemented and smoke-proven on the test Docker environment as of 2026-04-25.

DP-2.1 keeps the DP-2 RAP2 binary frame contract and removes the remaining worker-internal base64 encode/decode hop from the direct render path.

Implementation notes:

  • FreeRDP frame capture now produces raw BGRA frame bytes for worker runtime render notifications.
  • SessionRuntime splits render publication into two outputs:
    • direct binary render sink receives raw frame bytes
    • compatibility fallback sink builds JSON/base64 session_frame only for backend gateway/live-state fallback
  • Worker direct WSS sends raw captured frame bytes as RAP2 binary WebSocket frames when render_transport=binary_v1 is active.
  • Backend gateway fallback remains unchanged and still receives JSON/base64 session.frame compatibility events.
  • Direct render remains latest-frame-only and droppable; input/control scheduling is unchanged.

Smoke proof:

  • direct session id: b4720057-db61-4c72-bb4c-bccfd7e30008
  • fallback session id: 65d0667b-aaef-4042-ae30-4c34d151e5aa
  • direct client selected direct_worker_wss with render_transport=binary_v1
  • fallback client selected backend_gateway
  • direct client received binary_frame_received payloads of 3,686,400 bytes for 1280x720 BGRA
  • direct client rendered frame sequences including 1, 2, 4, 7, 9, 13, 14, 15, 16, 18, 19, and 20
  • fallback client rendered JSON/base64 session.frame through backend gateway
  • worker logs show raw_frame_bytes=3686400, binary_direct_bytes=3686400, base64_compat_bytes=4915200, encode_skipped_for_direct=true, and fallback_compat_frame_built=true
  • worker direct outbound logs show binary_render_bytes_per_second non-zero and json_render_bytes_per_second=0.000000

Payload and conversion proof:

  • direct raw frame payload remains 3,686,400 bytes plus the small RAP2 preamble/header
  • fallback compatibility payload remains about 4,915,200 base64 bytes for the same frame
  • direct render no longer decodes frame_data from compatibility base64 before binary send
  • base64 is still generated for fallback/debug because the backend gateway path intentionally remains JSON/base64

Known limitations:

  • DP-2.1 is an internal worker render plumbing cleanup only.
  • Full-frame BGRA payloads are still heavy.
  • At the DP-2.1 point, dirty regions, tiles, adaptive quality, compression/codecs, and color-mode reduction remained future work.
  • Backend gateway fallback remains JSON/base64 by design.

Stage DP-3A: Grayscale Full-Frame Binary Render

Status: implemented and smoke-proven on the test Docker environment as of 2026-04-25.

DP-3A adds the first conservative quality foundation for the direct binary render path. DP-3A itself did not implement tiles, compression, codecs, or adaptive profile switching. Dirty-region direct binary rendering is handled by the later RDP Adapter RDP-Perf-6 path.

Contract changes:

  • backend direct binary render candidates advertise render_transport=binary_v1
  • backend direct binary render candidates advertise supported_color_modes=["full_color","grayscale"]
  • backend direct binary render candidates advertise default_color_mode="full_color"
  • Windows client requests full_color by default
  • Windows smoke can request grayscale through -DirectDataPlaneColorMode grayscale
  • RAP2 binary frame headers carry color_mode, quality_profile, original_frame_format, output_frame_format, raw_frame_bytes, and binary_direct_bytes

Implementation notes:

  • full_color direct render sends the existing raw BGRA frame unchanged.
  • grayscale direct render converts BGRA bytes in the worker direct binary sink before WSS send.
  • The grayscale path preserves BGRA32 output format so the Windows presenter can reuse the existing render path.
  • Backend gateway fallback remains JSON/base64 and is not affected by direct grayscale mode.
  • Direct render remains latest-frame-only and droppable.
  • Input/control scheduling is unchanged and remains higher priority than render.

Smoke proof:

  • direct full-color session id: 74a0e5c6-02e0-487f-a1a1-c2850a13881c
  • direct grayscale session id: 3b616bd7-1179-4ec5-879f-7cd270f92a0a
  • fallback backend-gateway session id: e5724cac-7f09-4931-9ad9-156a3f33d0b1
  • direct full-color client selected direct_worker_wss with render_transport=binary_v1, requested_color_mode=full_color, and applied_color_mode=full_color
  • direct grayscale client selected direct_worker_wss with render_transport=binary_v1, requested_color_mode=grayscale, and applied_color_mode=grayscale
  • fallback smoke selected backend_gateway and continued to render JSON/base64 session.frame events
  • direct full-color frames rendered with color_mode=full_color and bytes=3686400
  • direct grayscale frames rendered with color_mode=grayscale and bytes=3686400
  • worker logs show grayscale_conversion_applied=false for full color and grayscale_conversion_applied=true for grayscale
  • worker logs show raw_frame_bytes_before=3686400, raw_frame_bytes_after=3686400, and binary_direct_bytes=3686400
  • worker grayscale conversion time was observed around 1-2 ms per sampled 1280x720 BGRA frame
  • worker direct outbound logs show binary render traffic and json_render_bytes_per_second=0.000000 on the direct binary path

Verification commands:

pwsh -ExecutionPolicy Bypass -File scripts/windows-smoke/desktop-smoke.ps1 -PreferDirectDataPlane:$true -AllowInsecureDirectDataPlaneTlsForSmoke:$true -DirectDataPlaneConnectTimeoutMs 2500 -DirectDataPlaneColorMode full_color -SkipOrgSwitchAndTokenRefresh
pwsh -ExecutionPolicy Bypass -File scripts/windows-smoke/desktop-smoke.ps1 -PreferDirectDataPlane:$true -AllowInsecureDirectDataPlaneTlsForSmoke:$true -DirectDataPlaneConnectTimeoutMs 2500 -DirectDataPlaneColorMode grayscale -SkipOrgSwitchAndTokenRefresh
pwsh -ExecutionPolicy Bypass -File scripts/windows-smoke/desktop-smoke.ps1 -PreferDirectDataPlane:$false -AllowInsecureDirectDataPlaneTlsForSmoke:$true -DirectDataPlaneConnectTimeoutMs 2500 -DirectDataPlaneColorMode grayscale -SkipOrgSwitchAndTokenRefresh

Known limitations after DP-3A:

  • grayscale currently reduces color fidelity but not wire byte size because the output format remains BGRA32.
  • 256_colors, 64_colors, 16_colors, and palette modes are not implemented.
  • Tiles, compression/codecs, and adaptive profile switching remain future work.
  • Backend gateway fallback remains JSON/base64 by design.
  • Smoke rendering=false remains a compact-layout harness artifact in some runs; client logs prove Desktop frame received and SessionWindow rendered frame.

RDP-Perf-6: Direct Dirty-Region Binary Render Contract

Status: implemented and build/probe/live-smoke-proven on the test Docker environment as of 2026-04-26 using P3.3 Secret RDP Resource, direct worker WSS, and rap-rdp-worker:rdp-perf6-dirty-region.

RDP-Perf-6 keeps the existing RAP2 binary WebSocket transport and adds explicit direct render message types:

  • render.frame.full
  • render.frame.region

Compatibility:

  • Windows client direct transport still accepts compat binary message_type=session.frame.
  • Inside the Windows application pipeline, direct binary frames are normalized back into the existing session.frame envelope so UI, lifecycle, input, clipboard, and file transfer behavior remain unchanged.
  • Backend gateway fallback remains JSON/base64 and is not removed.

Dirty-region frame metadata:

  • frame_width, frame_height, frame_stride, frame_format
  • desktop_width, desktop_height
  • region_x, region_y, region_width, region_height
  • region_stride, region_format=BGRA32
  • payload_length
  • input_correlation_id and worker_frame_captured_at when available

Diagnostics added for payload and latency analysis:

  • full_frame_sent
  • region_frame_sent
  • full_frame_bytes
  • region_bytes
  • region_savings_percent
  • diff_time_ms
  • render_update_reason
  • fallback_to_full_frame_reason

Implementation notes:

  • Worker direct WSS emits render.frame.full for baseline/recovery frames and render.frame.region for dirty-region patches.
  • Worker direct render logs include payload savings and diff/capture timing.
  • Windows direct transport accepts the explicit render message types.
  • Windows DesktopFramePresenter maintains a session framebuffer and patches BGRA32 region payloads into it before presenting the updated surface.
  • Full-frame fallback remains available for first frame, attach/reattach, resize, region-loss repair, and debug/fallback paths.

Observed runtime proof:

  • Direct transport selected direct_worker_wss with render_transport=binary_v1.
  • Baseline frame used render.frame.full, 1280x720, 3,686,400 bytes.
  • Dirty-region examples used render.frame.region: 64x64 = 16,384 bytes (99.56% savings), 1280x128 = 655,360 bytes (82.22% savings), and 640x64 = 163,840 bytes (95.56% savings).
  • Direct-only binary region frames logged fallback_compat_frame_built=false while backend gateway fallback compatibility remained available separately.
  • Input, detach, reattach, takeover, and takeover event handling remained smoke-proven in the same run.

Stage DP-3B: Adaptive Quality

Implement quality profiles and adaptive render behavior.

Goals:

  • lower latency under load
  • bounded queues
  • real profile behavior
  • color mode adaptation

12. Risks

Token Leakage

Risk:

  • direct worker token could be reused.

Mitigation:

  • short TTL
  • jti / nonce
  • worker-scoped audience
  • attachment/session binding
  • TLS

Worker Endpoint Exposure

Risk:

  • worker direct endpoint becomes an attack surface.

Mitigation:

  • token validation before bind
  • rate limits
  • TLS
  • no unauthenticated session enumeration
  • minimal endpoint surface

Policy Drift

Risk:

  • backend and worker disagree on allowed channels.

Mitigation:

  • token claims include allowed channels
  • worker receives policy snapshot in assignment
  • worker enforces policy again
  • policy changes trigger session update or reconnect where required

Fallback Masking Production Problems

Risk:

  • clients silently fall back and hide direct data-plane failure.

Mitigation:

  • log fallback reason
  • expose telemetry
  • smoke tests verify both direct and fallback paths

Render Still Too Heavy

Risk:

  • direct WSS improves routing but full-frame render remains expensive.

Mitigation:

  • DP-2 binary frames
  • DP-3 adaptive quality
  • dirty regions / tiles
  • latest-frame-only semantics

File Upload Starving Input

Risk:

  • reliable file chunks can fill send queues.

Mitigation:

  • channel priority
  • bounded file queues
  • chunk pacing
  • input preemption

13. Future Verification Plan

Future DP-1 implementation must prove:

  • backend gateway fallback still works
  • direct worker WSS connects
  • token validation works
  • invalid/expired/wrong-worker tokens are rejected
  • direct WSS binds to existing session runtime
  • direct WSS does not recreate remote RDP session
  • input works over direct WSS
  • rendering works over direct WSS
  • clipboard still works
  • file upload still works
  • fallback activates if direct worker path is unavailable
  • input latency improves compared with fallback
  • render backlog does not grow
  • stale render frames are dropped
  • close/dispose is immediate
  • org/session/attachment/channel scope is enforced

14. Next Implementation Prompt

Data-plane and RDP work are paused by product decision.

DP-3B, Stage 5.2 remaining RDP desktop proof, and further RDP performance work must not start without a new explicit RDP/data-plane stage prompt.

The next active project work is Stage C10 in the lower Secure Access Fabric foundation:

Proceed with Stage C10 only.

Goal:
Consolidate Fabric Core architecture and prepare scoped cluster configuration
distribution design.

Scope:
- define signed scoped cluster snapshot model
- define node-local state boundaries
- define peer directory/cache boundaries
- define Fabric Storage / Config Storage role
- define source-of-truth vs distribution/cache boundaries
- define multi-cluster isolation boundaries
- define future implementation stages C11-C18

Do NOT:
- implement mesh runtime
- implement VPN
- implement RDP work
- implement service workloads
- change backend/runtime code

15. Non-Goals

DP-1 does not implement:

  • full mesh
  • VPN
  • QUIC
  • UDP transport
  • WebRTC
  • relay nodes
  • multi-cluster routing
  • adaptive quality beyond DP-3A grayscale full-frame foundation
  • binary render frames for fallback backend gateway
  • adaptive profile switching beyond DP-3A and dirty regions
  • removal of current backend WebSocket gateway
  • RDP MVP rewrite