Files
rdp-proxy/docs/audits/PROJECT_AUDIT_2026-04-26.md
T
m 20d361a886
build / backend (push) Has been cancelled
build / node-agent (push) Has been cancelled
build / worker (push) Has been cancelled
рабочий вариант, но скороть 10 МБит
2026-05-22 21:46:49 +03:00

25 KiB

Project Audit And Next-Step Plan

Date: 2026-04-26

Status: documentation/audit only. No runtime behavior is changed by this document.

1. Executive Summary

The project is no longer just an RDP proxy. The correct target is a Secure Access Fabric platform with a control plane, direct realtime data plane, service adapters, tenant isolation, and future node/mesh/VPN capabilities.

The implementation has reached a much more advanced state than several operational documents describe. The most important current risk is therefore not only code quality. It is source-of-truth drift: old prompts and READMEs can send the next stage in the wrong direction.

The RDP MVP has proven the hard lifecycle assumptions:

  • real RDP connection through the worker works
  • active/detach/reattach/takeover/terminate flows are proven
  • takeover does not recreate the remote session
  • worker-death/orphan-active-session recovery is proven
  • Windows client can render and control a real remote desktop
  • direct worker WSS data plane is implemented and used
  • binary render frames are implemented on direct data plane
  • backend gateway JSON/base64 path remains available as fallback/debug
  • ordered dirty-region delivery is accepted as the current RDP baseline
  • text clipboard is implemented and accepted
  • client-to-server file upload to worker-controlled storage is accepted
  • restricted drive visibility is runtime-proven: uploaded files are visible and openable inside the remote Windows session through RAP_Transfers

The RDP adapter lesson is clear: "make it simple first and patch later" is dangerous for realtime protocols. Full-frame polling, implicit refresh after input, and backend/Redis realtime relaying worked for proof, but they caused the exact class of latency and correctness issues we later had to unwind. From this point forward, each service adapter must be specified as an event-driven adapter before implementation.

Recommended immediate priority:

  1. Freeze and document the current working baseline.
  2. Synchronize stale project docs with the real state.
  3. Preserve the accepted RDP visual correctness/stability baseline.
  4. Preserve the accepted Stage 5.1.1 restricted drive visibility behavior.
  5. Add automated regression gates so manual discoveries become repeatable tests.

2. Audit Method

This audit used the current filesystem state in:

\\192.168.220.200\mst\codex\rdp-proxy

Important environment note:

  • the directory is not currently a Git checkout (git status reports that no .git repository exists), so this audit cannot use commit history
  • the canonical test Docker host is docker-test / 192.168.200.61
  • the live test stack currently contains rap_backend_smoke, rap_worker_smoke, rap_postgres, and rap_redis

Commands run during this audit:

go test ./...
dotnet build .\clients\windows\RemoteAccessPlatform.Windows.slnx
docker -H ssh://docker-test run --rm rap-rdp-worker:rdp-region-repair rdp-worker-graphics-adapter-probe
docker -H ssh://docker-test run --rm rap-rdp-worker:rdp-region-repair rdp-worker-cursor-adapter-probe
docker -H ssh://docker-test run --rm rap-rdp-worker:rdp-region-repair rdp-worker-service-adapter-protocol-probe
docker -H ssh://docker-test run --rm rap-rdp-worker:rdp-region-repair rdp-worker-dataplane-bind-probe --scenario valid

Results:

  • backend tests: PASS
  • Windows client build: PASS, 0 warnings, 0 errors
  • worker graphics adapter probe: PASS
  • worker cursor adapter probe: PASS
  • worker service adapter protocol probe: PASS
  • worker data-plane bind valid probe: PASS

Coverage warning:

  • most backend modules still report [no test files]
  • much of the current confidence comes from smoke/manual proofs and logs
  • this is not enough for production readiness

3. Planned Direction

The authoritative long-term direction is:

  • CODEX_CONTEXT.md
  • docs/architecture/SECURE_ACCESS_FABRIC_TARGET.md
  • docs/architecture/DATA_PLANE_V1.md
  • docs/architecture/SERVICE_ADAPTER_PROTOCOL.md
  • docs/architecture/RDP_ADAPTER_RUNTIME.md
  • docs/architecture/RDP_SERVICE_CPP_PERFORMANCE_TARGET.md

The target platform model is:

Access Client
  -> Ingress / Data Plane
  -> Secure Fabric / Routing
  -> Service Adapter at egress edge
  -> Target service

For RDP specifically:

Access Client
  <-> platform session/data-plane protocol
RDP Adapter
  <-> FreeRDP / project-owned RDP internals
RDP Server

This naming should be kept consistent:

  • Access Client: native Windows/iOS/Android/Linux client that speaks the platform protocol
  • Control Plane: backend API, auth, orgs, policy, session lifecycle, audit
  • Data Plane: realtime session traffic channels
  • Service Adapter: protocol translator for RDP/VNC/SSH/video/etc
  • RDP Adapter: current C++ RDP service adapter
  • Entry/Ingress Node: accepts client connections into the fabric
  • Egress/Service Node: reaches target resources and hosts adapters
  • Node Agent: native host identity, update, health, and service supervisor

4. What Is Implemented

Backend

Implemented:

  • Go backend foundation
  • PostgreSQL source-of-truth storage
  • Redis live coordination/routing
  • auth foundation
  • refresh token rotation
  • devices/trusted devices
  • org-scoped resources and sessions
  • platform-core v2 foundation
  • identity source foundation
  • node/node-agent control-plane foundation
  • session broker orchestration
  • worker coordination and stale worker monitoring
  • structured localization-ready messages
  • resource certificate verification policy
  • clipboard policy
  • file-transfer policy
  • data-plane token/candidate generation
  • backend gateway fallback

Key files:

  • backend/internal/modules/sessionbroker/service.go
  • backend/internal/modules/sessionbroker/orchestration.go
  • backend/internal/modules/sessionbroker/state_machine.go
  • backend/internal/modules/sessionbroker/dataplane.go
  • backend/internal/modules/sessiongateway/module.go
  • backend/internal/modules/worker/monitor.go
  • backend/internal/modules/resource/module.go
  • backend/internal/modules/auth/service.go
  • backend/internal/platform/httpx/message.go
  • backend/migrations/000005_platform_core_v2.up.sql
  • backend/migrations/000007_clipboard_policy_mode.up.sql
  • backend/migrations/000008_file_transfer_policy_mode.up.sql

Known backend gaps:

  • automated test coverage is thin outside sessionbroker
  • P3/P3.1 resource secret-readiness and encrypted resolver MVP exists; production mode rejects plaintext credential metadata and requires secret_ref for RDP/VNC/SSH resources
  • external KMS/Vault integration and master-key rotation are not implemented yet
  • admin/control UI for safe resource/policy management is not the current focus
  • node-agent runtime is not implemented; only control-plane foundation exists
  • identity source sync runtime is not implemented

Windows Client

Implemented:

  • WPF client skeleton and build
  • auth/login/refresh/logout foundation
  • organization selection
  • resource list
  • active sessions
  • session window
  • direct data-plane selection with fallback
  • binary render receive path
  • input capture/forwarding
  • cursor/render display
  • localization-ready resource layer
  • text clipboard UI/path
  • file upload UI/path
  • failed-session refresh after gateway close

Key files:

  • clients/windows/src/RemoteAccessPlatform.Windows.App/SessionWindow.xaml
  • clients/windows/src/RemoteAccessPlatform.Windows.Application/ViewModels/SessionWindowViewModel.cs
  • clients/windows/src/RemoteAccessPlatform.Windows.Transport/SessionGatewayClient.cs
  • clients/windows/src/RemoteAccessPlatform.Windows.App/Input/SessionInputMapper.cs
  • clients/windows/src/RemoteAccessPlatform.Windows.Application/Localization/Strings.cs
  • clients/windows/src/RemoteAccessPlatform.Windows.Application/Resources/Strings.resx

Known client gaps:

  • final UX polish is not complete
  • automated client regression tests are missing
  • manual RDP UX remains the acceptance authority for now
  • some README limitations are stale and understate what exists

RDP Worker / RDP Adapter

Implemented:

  • standalone C++ worker service
  • FreeRDP integration behind worker boundary
  • worker registration/assignment/lease lifecycle
  • direct worker WSS endpoint
  • RS256 data-plane token validation
  • direct bind policy and current attachment validation
  • JSON control/input/clipboard/file-upload envelopes
  • binary RAP2 render frames for direct path
  • backend gateway JSON/base64 fallback
  • region-first BGRA render path
  • direct attach baseline full-frame repair
  • region-loss full-frame repair throttle
  • cursor adapter boundary
  • text clipboard through FreeRDP cliprdr
  • client-to-server file upload
  • restricted visible transfer directory
  • restricted FreeRDP drive redirection groundwork

Key files:

  • workers/rdp-worker/src/main.cpp
  • workers/rdp-worker/src/runtime/session_runtime.cpp
  • workers/rdp-worker/include/rdp_worker/runtime/session_runtime.hpp
  • workers/rdp-worker/src/adapter/rdp_adapter_runtime.cpp
  • workers/rdp-worker/src/freerdp/rdp_runtime.cpp
  • workers/rdp-worker/src/dataplane/direct_wss_server.cpp
  • workers/rdp-worker/src/runtime/direct_bind_policy.cpp
  • workers/rdp-worker/include/rdp_worker/adapter/service_adapter_protocol.hpp

Current live/smoke images:

rap-backend-smoke:stage5-2-download
rap-rdp-worker:stage5-2-download

Known worker/RDP gaps:

  • drag/release repaint is usable but not polished; drag behaves like an older RDP client on a weak link by moving a frame rather than continuously repainting the full window
  • RDPGFX is gated and disabled by default because the current live target resets the connection when RDPGFX is advertised
  • encoded graphics/codecs/tiles are not production-accepted yet
  • file download core data path is runtime-proven through direct worker WSS and backend gateway fallback, and lifecycle blocking is runtime-proven for detach, old-controller takeover, and worker failure. Stage 5.2 is not fully runtime-accepted until Windows desktop UI download is proven
  • FreeRDP is still the substrate; replacing it is not justified until the adapter boundary proves which pieces are actually insufficient

5. Plan vs Fact Matrix

Area Planned Current fact Status
Backend foundation Go, config, HTTP, PostgreSQL, Redis Implemented and builds Done
Auth access/refresh flow, sessions, devices Implemented Done
Session lifecycle start/attach/detach/takeover/terminate/fail/recover Live-proven earlier and preserved Done, protect
Multi-tenancy organizations and org-scoped resources/sessions Implemented Done, needs more tests
Authorization platform/admin/member boundaries Implemented foundation Needs broader tests
Worker coordination registration, lease, stale recovery Implemented and live-proven Done, protect
Windows client MVP native WPF client Implemented and builds Done
Localization messaging structured backend/client messaging Implemented and runtime-proven earlier Done, protect
Direct data plane client-to-worker WSS Implemented Done
Binary render direct binary render, fallback JSON/base64 Implemented Done
RDP adapter event model event-driven adapter boundary Implemented and P1 accepted Done, protect
RDP render quality grayscale foundation Implemented Partial
RDPGFX/encoded graphics future performance path gated only, not accepted Not production
Clipboard text-only, policy-gated Accepted Done
File upload client-to-server to worker storage Accepted Done
File visibility in RDP restricted drive redirection Accepted via RAP_Transfers Done, protect
File download server-to-client Core and lifecycle runtime-proven, desktop UI proof pending Prove UI next
Mesh/VPN/multi-cluster runtime target architecture only Not implemented Correctly deferred
Node-agent runtime/updater target/foundation only Not implemented Future
Identity sync runtime LDAP/OIDC sync Not implemented Future

6. Important Source-Of-Truth Drift

At the start of this audit these files were stale or partly stale:

  • README.md still points to old compat-era docs and says not to start with UI, while the Windows client already exists
  • docs/codex/CURRENT_STATUS.md says WebSocket takeover proof is still a gap, even though that proof was later closed
  • docs/codex/NEXT_STEP_PROMPT.md previously pointed to platform-core v2 as the next step, although platform-core v2 already exists
  • clients/windows/README.md still says it intentionally stops short of final viewer rendering, but the client now renders the remote desktop
  • workers/rdp-worker/README.md documented recent RDP stages, but previously did not clearly mark the current accepted image and latest manual acceptance
  • docs/architecture/DATA_PLANE_V1.md previously had a stale "Next Implementation Prompt"; it now points to Stage 5.2 live runtime proof
  • docs/architecture/RDP_ADAPTER_RUNTIME.md and docs/architecture/RDP_SERVICE_CPP_PERFORMANCE_TARGET.md still mark manual UX acceptance as pending before the latest fixes

This was the P0 risk addressed by the baseline-freeze documentation pass. Future stages must keep these files current after every accepted runtime change so a future Codex/session cannot follow an old prompt and reintroduce already-rejected architecture.

7. Lessons From The RDP Adapter Work

The RDP work exposed several project-level rules:

  1. Realtime protocol features must be designed as channel semantics first. Input, display, cursor, clipboard, file transfer, and telemetry cannot share one undifferentiated queue.

  2. Backend/Redis must not be the production realtime path. It is correct as fallback/debug/control-plane glue, not for high-rate render.

  3. Full-frame rendering is not the normal production model. It is needed for baseline, attach, resize, recovery, and fallback repair.

  4. Dirty regions cannot be blindly latest-only without a repair strategy. Dropping a region update may leave visible artifacts; the current region_loss_repair full-frame repair is a pragmatic safety net.

  5. Server-origin events must drive display updates. Remote changes must not depend on local mouse/keyboard events.

  6. Input must be independent from render. A key or click must never wait behind a frame, upload chunk, clipboard message, or lease renewal.

  7. FreeRDP is not the problem by default. The earlier problem was how we pumped events, scheduled frames, relayed payloads, and treated screen updates. The correct direction is an adapter boundary around FreeRDP first, not a full rewrite before we can prove the replacement.

  8. Manual UX proof is essential. Automated input can pass while real user input feels wrong.

  9. Every "temporary" shortcut needs an explicit expiration condition. If it does not have one, it becomes architecture.

8. What We May Have Missed

These are not immediate bugs, but they should be addressed early because they shape the product:

  • RDP server compatibility matrix: Windows Server versions, NLA modes, GDI vs RDPGFX behavior, color depth, TLS/cert behavior, domain login variants
  • weak-channel simulation: latency, jitter, loss, constrained bandwidth
  • high-concurrency session model: many users, many workers, CPU/network limits
  • deterministic smoke reports: every accepted stage should leave reproducible artifacts and commands
  • secret management: credentials must move out of plain resource metadata
  • production PKI: direct worker WSS currently uses smoke-friendly TLS handling on the client side
  • authorization tests: cross-org denial paths need automated coverage
  • resource policy test matrix: clipboard/file/cert/session policies
  • file transfer threat model: filename normalization, symlink escape, overwrite behavior, quotas, cleanup, audit
  • observability: per-channel latency, frame drops, input latency, worker event pump health, adapter callback counters
  • client UI state machine tests: close/dispose, failed state, reconnect, takeover, detach, old attachment blocking
  • upgrade/rollback story: node-agent target exists, runtime is not implemented
  • deployment topology: container host networking vs Docker bridge/NAT for realtime workloads
  • service adapter conformance suite: RDP now has a pattern that VNC/SSH/video should follow

9. Architectural Decisions To Freeze Now

These decisions should be treated as current project rules:

  1. PostgreSQL is source of truth.
  2. Redis is live coordination/routing only.
  3. Backend is control plane, not production render relay.
  4. Direct data plane is preferred for realtime RDP traffic.
  5. Backend gateway remains fallback/debug until direct path is fully mature.
  6. Service adapters translate external protocols to platform channels.
  7. RDP Adapter remains C++ and FreeRDP-backed for now.
  8. FreeRDP details must not leak into backend or Access Client business logic.
  9. Access Client speaks platform protocol, not RDP.
  10. Mesh/VPN/multi-cluster/node-agent runtime remain future staged work.
  11. RDP must be stabilized before adding VNC/SSH/VPN/product expansion.
  12. No new feature should start while source-of-truth docs are stale.

P0. Truth And Baseline Freeze

Goal: make the current working system impossible to misunderstand.

Do:

  • update root README.md
  • update docs/codex/CURRENT_STATUS.md
  • update docs/codex/NEXT_STEP_PROMPT.md
  • update clients/windows/README.md
  • update workers/rdp-worker/README.md
  • update docs/architecture/DATA_PLANE_V1.md next prompt
  • update docs/architecture/RDP_ADAPTER_RUNTIME.md with latest baseline/region repair status
  • document current test Docker image/tag and startup commands
  • preserve the accepted RDP worker baseline
  • create one "current smoke matrix" document

Do not:

  • add features
  • start DP-3B
  • start server-to-client download
  • start mesh/VPN/node-agent runtime

Acceptance:

  • a new engineer/Codex can read the docs and know the actual next step
  • no doc points to archived v1 or already-completed stages as next work

P1. RDP Visual Correctness Hardening

Goal: eliminate remaining small artifacts without returning to slow full-frame rendering.

Do:

  • add explicit region sequence/gap diagnostics
  • prove when artifacts happen: region drop, stale region ordering, missed server callback, client application bug, or repair interval issue
  • verify client applies region frames to the correct bitmap area and stride
  • keep baseline full frame on attach
  • keep full repair only on loss/recovery, not as normal render loop
  • collect before/after screenshots/logs

Do not:

  • enable RDPGFX globally
  • add compression/tiles/codecs before correctness is stable
  • change backend/session lifecycle

Acceptance:

  • remote idle updates repaint without local input
  • Start menu/task manager/window movement leave no persistent artifacts
  • input and close behavior remain usable

P2. Stage 5.1.1 Restricted Drive Visibility Proof

Status: accepted as runtime-proven on the test Docker stand.

Goal: keep the upload visibility path protected while the RDP Adapter continues to be hardened.

Do:

  • run live smoke with current RDP adapter baseline
  • upload file from Windows client
  • verify file appears in \\tsclient\RAP_Transfers
  • open text and binary files inside the remote Windows session
  • prove disabled policy blocks upload
  • prove takeover/detach/failure block old or invalid upload
  • verify directory cleanup on terminate

Do not:

  • implement download
  • expose arbitrary worker filesystem
  • implement shared folders or SMB/WebDAV

Accepted proof:

  • uploaded file is visible and openable inside remote Windows
  • only per-session visible directory is exposed
  • worker logs show RAP_Transfers configured as the only redirected drive
  • termination cleans the per-session transfer directory

P3. Security And Secrets Readiness

Status: P3.1 MVP complete; production TLS/PKI remains P3.2.

Goal: remove proof-stage security shortcuts before broad usage.

Completed:

  • documented secret-reference model in docs/architecture/SECURITY_SECRETS_READINESS.md
  • production mode rejects plaintext credential-like resource metadata
  • production RDP/VNC/SSH resources require secret_ref
  • session start rejects compat plaintext resources in production mode
  • data-plane allowed-channel policy test exists
  • worker direct-bind denial probes cover wrong worker/user/org/resource, wrong attachment, over-broad channels, and failed/terminated states
  • encrypted PostgreSQL-backed resource_secrets store exists
  • resource secret create/rotate endpoint updates resources.secret_ref without returning plaintext
  • session assignment resolves secret_ref after organization/resource/session/ worker/lease checks and does not mutate remote_sessions.metadata with plaintext
  • secret access/access-denied/rotation audit events exist
  • direct worker WSS TLS trust metadata/guard exists; production backend omits smoke-only direct candidates and production Windows client skips untrusted direct candidates

Still required after P3.2:

  • deploy production direct-worker certificates/platform CA trust
  • add external KMS/Vault or stronger key-management integration
  • add master-key rotation/re-encryption workflow
  • consider future worker pull/token resolver flow to avoid resolved credentials in Redis assignment payloads

Do not:

  • build full enterprise KMS prematurely
  • weaken certificate or token model for convenience

Acceptance:

  • production mode cannot create/start resources with plaintext credential metadata
  • cross-org, old-attachment, wrong worker/resource/org, and terminal-session denial paths are covered by focused tests/probes

P4. Automated Regression Suite

Goal: convert the painful manual discoveries into repeatable gates.

Do:

  • add backend unit/integration tests for org scope, session state, data-plane token, stale worker, clipboard/file policies
  • add worker probes for render sequencing, direct baseline, region repair, adapter event routing
  • add Windows transport/viewmodel tests for fallback, close/dispose, failed state, frame latest-only, localization resolution
  • make smoke scripts emit machine-readable PASS/FAIL reports
  • pin each accepted image/build artifact

Acceptance:

  • a regression in input, render, worker-death, takeover, clipboard, or upload fails a repeatable test before manual smoke

P5. RDP Performance Next Layer

Goal: improve speed on weak channels after correctness is stable.

Candidate paths:

  • RDPGFX on compatible target only
  • encoded graphics payloads
  • dirty-region compression
  • tile/region framing
  • adaptive quality profiles
  • palette/grayscale/low-bandwidth modes
  • per-channel QoS and backpressure telemetry

Do not:

  • replace stable region-first path without fallback
  • ship a graphics mode that only works on one target

Acceptance:

  • direct full-color baseline remains available
  • each new graphics mode has compatibility detection and fallback

P6. Product Completion For RDP

Only after P0-P5 gates are stable:

  • manual desktop acceptance for server-to-client file download from RAP_Transfers\ToClient
  • richer file transfer UX
  • final RDP UX polish
  • policy management UI
  • operational runbooks
  • release readiness checklist

P7. Platform Expansion

Only after RDP is stable:

  • VNC Adapter
  • SSH Adapter
  • node-agent runtime/updater
  • entry/relay nodes
  • mesh routing
  • VPN/IP tunnel mode
  • Linux/iOS/Android clients

11. Proposed Immediate Next Prompt

Use this as the next implementation prompt if we continue immediately:

Proceed with Stage 5.2 remaining desktop UI proof only - RDP server-to-client
file download.

Goal:
Finish acceptance of safe, policy-aware download from the remote RDP session to
the Windows Access Client UI using the restricted RAP_Transfers\ToClient drop
zone.

Strict rules:
- do not implement arbitrary remote path download
- do not implement remote filesystem browser
- do not implement recursive folder transfer
- do not implement SMB/WebDAV/Windows agent
- do not expose any worker path outside the per-session visible directory
- do not change RDP rendering/input/clipboard behavior
- do not remove backend gateway fallback
- do not implement binary file chunk frames yet
- do not start DP-3B, mesh, VPN, node-agent runtime, or new adapters

Scope:
1. Keep the current Stage 5.2 backend/worker deployment on docker-test.
2. Prove Windows desktop UI download for text and binary files placed in
   RAP_Transfers\ToClient.
3. Prove rendering, input, clipboard, upload, lifecycle, and fallback do not
   regress.

Acceptance:
- disabled and client_to_server modes block download
- server_to_client and bidirectional modes allow download
- text and binary files download with matching hashes
- traversal/symlink/non-regular/too-large files are blocked
- rendering, input, clipboard, upload, lifecycle, and fallback do not regress

12. Bottom Line

The project direction is sound, but the process must now become stricter:

  • design channel semantics first
  • implement through adapter boundaries
  • prove with live/manual smoke and automated gates
  • update source-of-truth docs before starting the next major stage
  • reject "temporary" shortcuts unless they have a documented removal condition

The RDP Adapter experience was expensive, but useful. It showed exactly where the architecture must be disciplined before adding SSH, VNC, VPN, mobile clients, or mesh runtime.