Initial project snapshot

2026-04-28 22:29:50 +03:00
commit 8ba0561f4f
365 changed files with 91832 additions and 0 deletions
@@ -0,0 +1,148 @@
+# Architecture guardrails
+
+These rules are mandatory.
+
+## 1. Preserve the proven session foundation
+The following are already proven and must remain stable:
+- live FreeRDP connect
+- active session state
+- terminate
+- detach without killing remote session
+- reattach without recreating remote session
+- takeover without recreating remote session
+
+No architectural refactor may silently weaken this behavior.
+
+## 2. Source of truth
+- PostgreSQL is the only durable source of truth for domain state.
+- Redis is only for live coordination, routing, heartbeats, leases, attach tokens, and ephemeral cache.
+
+## 3. Control plane vs data plane
+Keep them distinct.
+
+### Control plane
+- organizations
+- users
+- memberships
+- roles
+- resources
+- policies
+- nodes
+- services
+- connectors
+- cluster membership
+- updates
+- config distribution
+
+### Data plane
+- session streams
+- worker traffic
+- relay traffic
+- connector traffic
+- future exit traffic
+
+## 4. Multi-tenancy isolation
+Every organization must be isolated by design.
+
+Namespace by organization for:
+- resources
+- users-in-org
+- groups
+- policies
+- connectors
+- sessions
+- audit
+- secrets references
+- Redis keys where applicable
+
+No cross-org leakage of identifiers, data, logs, cache keys, or policy decisions.
+
+## 5. Customer-managed nodes
+Customer-managed nodes:
+- may join the common cluster,
+- must remain limited to allowed scope,
+- must not automatically become general-purpose relay/control nodes for other organizations.
+
+## 6. Node agent design
+A node agent:
+- is small,
+- stable,
+- always running,
+- supervises services,
+- downloads signed updates,
+- verifies signatures and versions,
+- can rollback,
+- can restart services,
+- can operate on thin nodes and thick nodes.
+
+The agent is not the same as the service workloads.
+
+## 7. Split-brain prevention
+Never allow minority partitions to become a second authoritative cluster automatically.
+
+Required states:
+- healthy
+- degraded
+- recovery
+- isolated / emergency
+
+Cluster-wide changes, role changes and risky mutations must be restricted in non-quorum states.
+
+## 8. Service model
+Each node must separate:
+- capabilities
+- enabled services
+
+Do not encode every function into one monolithic node role.
+
+## 9. Security model
+Security must be based on:
+- strong crypto
+- signed artifacts
+- node identity
+- short-lived user/session tokens
+- scoped trust
+- audit trails
+- revocation
+- least privilege
+
+Do not depend on protocol obscurity.
+
+## 10. Migration strategy
+Do not force a big-bang rewrite.
+Add the platform core around the current system in steps:
+1. organization / membership model
+2. org-scoped resource model
+3. node model and node-agent control interfaces
+4. connector model
+5. mesh / routing evolution
+6. native clients and higher-level features
+
+## 11. Updates and rollback
+Updates must support:
+- manual or automatic policy
+- staged rollout
+- canary rollout
+- rollback to previous version
+- signed artifacts
+- optional update mirrors / caches on selected nodes
+
+Thin nodes may download but not store update artifacts.
+
+## 12. Performance and routing awareness
+Placement and routing decisions must consider:
+- CPU
+- RAM
+- network load
+- active sessions
+- connector load
+- relay load
+- service type
+- health score
+
+## 13. No feature explosion before platform core
+Do not jump to:
+- full collaboration/video meetings
+- advanced media plane
+- internet exit mode
+before the platform core is modeled correctly.
@@ -0,0 +1,129 @@
+# Final platform technical direction (summary)
+
+## Product definition
+A distributed secure access platform with:
+- multi-tenant organizations
+- proven persistent session broker for RDP
+- cluster of platform-managed and customer-managed nodes
+- node-agent based service fabric
+- connector/VPN layer
+- future split/full tunnel capability
+- future collaboration extensions
+
+## Main top-level domains
+
+### Platform
+Owns:
+- global policies
+- cluster control plane
+- platform admins
+- node trust
+- artifact signing and update policy
+- disaster recovery authority
+
+### Organization
+Owns:
+- users
+- groups
+- organization admins
+- identity sources
+- resources
+- policies
+- connectors
+- audits
+- quotas
+- domains / branding later
+
+### Node
+Has:
+- node identity
+- ownership type (platform-managed, customer-managed)
+- capabilities
+- enabled services
+- health
+- update policy
+- version state
+- partition state
+
+### Node Agent
+Small stable agent that:
+- keeps running
+- supervises services
+- downloads signed updates
+- verifies integrity
+- restarts crashed services
+- rolls back if needed
+- reports health
+
+### Connector
+Reusable network access method:
+- direct
+- VPN
+- relay-backed
+- future egress mode
+Bound to resources by policy, not duplicated blindly per server.
+
+### Session broker
+Already proven for RDP persistent lifecycle.
+
+## Mandatory capabilities
+
+### Multi-tenant
+- org isolation
+- organization memberships
+- user may belong to multiple organizations
+- clear org switching UX later
+- org admins only see their org
+
+### Identity federation
+- local accounts
+- LDAP / AD
+- OIDC
+- group/claim mapping to access
+
+### Resource authorization
+- local manual mapping
+- external group / claim driven mapping
+- feature scopes:
+  - RDP only
+  - connector/VPN only
+  - both
+  - future scopes
+
+### Cluster behavior
+- dynamic membership
+- encrypted inter-node communication
+- no mandatory single center
+- quorum-based authority
+- degraded / recovery / isolated modes
+- manual partition promotion only by highly privileged recovery admin
+- multi-hop route support
+- not every node needs full mesh
+
+### Updates
+- signed artifacts
+- canary rollout
+- staged rollout
+- rollback
+- thin node vs artifact-cache node
+
+### Customer-managed nodes
+- can join common cluster
+- can be scoped to their organization
+- can serve ingress / connector / egress functions for that organization
+- must not automatically become cluster-global trusted nodes
+
+## What to implement first
+- organization model
+- memberships and roles
+- org-scoped resource model
+- identity source model
+- node and node-agent control plane model
+- service capabilities / enabled services model
+
+## What to delay
+- full mesh engine
+- full connector scheduler
+- internet exit mode
+- collaboration/video meetings
+- heavy media routing
@@ -0,0 +1,123 @@
+C17Z20 is complete.
+
+Installation Authority foundation is also complete:
+
+- production config requires strict authority mode with Product Root public key
+- first-owner bootstrap requires a signed activation manifest in strict mode
+- `installation_authority` and signed `platform_role_grants` are persisted
+- strict platform-admin checks ignore direct `users.platform_role` edits unless
+  a valid signed grant exists
+- web-admin shows installation status and first-owner bootstrap
+- `scripts/installation/product-root-tool.go` can generate Ed25519 Product Root
+  keys and sign activation manifests; private keys must stay outside the repo
+
+Cluster Authority foundation is now also complete:
+
+- every newly created cluster gets an Ed25519 `cluster_authorities` key record
+- cluster authority private keys are encrypted at rest when
+  `SECRET_ENCRYPTION_KEY_B64`/file is configured; production already requires
+  a secret encryption key
+- legacy/default clusters are backfilled lazily through `EnsureClusterAuthority`
+- backend signs join-token scope material, node approval/bootstrap material,
+  and node-scoped synthetic mesh config snapshots
+- node-agent verifies signed Control Plane synthetic config when
+  `authority_required=true` or signature fields are present
+- node-agent can pin `RAP_CLUSTER_AUTHORITY_PUBLIC_KEY` and
+  `RAP_CLUSTER_AUTHORITY_FINGERPRINT`, and identity state can store the same
+  trust anchor after approval
+- web-admin shows cluster key fingerprints on summaries, join-token output,
+  approval rows, and synthetic config visibility
+- docker-test lifecycle smoke is complete: fresh dev install, first-owner
+  bootstrap, cluster creation, signed join token, real node-agent enrollment,
+  owner approval, automatic signed bootstrap polling, authority pin
+  persistence, heartbeat, and signed synthetic config verification all passed
+- `rap-node-agent` desired-workload polling/status reporting is gated by
+  `RAP_WORKLOAD_SUPERVISION_ENABLED=false` by default while service runtime
+  supervision remains a stub
+
+Node enrollment bootstrap polling is also complete:
+
+- backend exposes `/node-agents/enrollments/{requestID}/bootstrap`
+- pending agents prove `cluster_id`, `node_fingerprint`, and `public_key`
+  before receiving status/bootstrap material
+- `rap-node-agent` stores `pending_join_request_id`, polls approval, verifies
+  the signed bootstrap contract, then persists `node_id`, `identity_status`,
+  and cluster authority pin into `identity.json`
+- polling is controlled by `RAP_ENROLLMENT_POLL_INTERVAL_SECONDS` and
+  `RAP_ENROLLMENT_POLL_TIMEOUT_SECONDS`
+
+Current state:
+
+- C17Z12 added rendezvous/relay control-plane leases for peers that would
+  otherwise stay in `waiting_rendezvous`.
+- C17Z13-C17Z14 added lease telemetry and node-scoped synthetic-config refresh
+  for renewal/stale relay recovery.
+- C17Z15 added backend stale-relay replacement/withdrawal policy and alternate
+  relay-pool scoring.
+- C17Z16 added Control Plane `route_path_decisions`.
+- C17Z17 added node-side route generation apply/withdraw tracking.
+- C17Z18 applies Control Plane `route_path_decisions` to synthetic
+  route-health route config only. The synthetic `fabric.route_health` runtime
+  now probes the selected effective path, including replacement relay paths,
+  and reports expected/observed hops plus drift state.
+- C17Z19 consumes those synthetic route-health observations in backend relay
+  scoring. Drift/unreachable/failure feedback marks the exact selected relay
+  stale and can trigger replacement; healthy low-latency route-health boosts
+  alternate relay score reasons. Migration `000022` adds the `synthetic` mesh
+  service class, and web-admin marks relay policy `rh feedback`.
+- C17Z20 closes the node-side feedback loop. After node-agent reports
+  synthetic route-health drift/unreachable/failure, it performs a bounded
+  node-scoped synthetic-config refresh, applies returned replacement route
+  decisions to route-health config immediately, and reports
+  `c17z20.mesh_route_health_feedback_refresh_report.v1`.
+- Backend `mesh_latest_links` now keeps latest observations per observation
+  type/route, so `synthetic_route_health` is not overwritten by
+  `peer_connection_manager`.
+- Web-admin Fabric links now show observation type, selected relay, and
+  route-health effective/observed path.
+- All of this remains control-plane/synthetic route-health only. It does not
+  forward RDP/VPN/service payloads, does not start VPN runtime, and does not
+  implement arbitrary relay packet forwarding.
+- Cluster Authority and node enrollment bootstrap are docker-test
+  lifecycle-smoke verified in run `dev-bootstrap-20260428-201430`.
+- Fresh migration replay found and fixed a PostgreSQL view replacement issue in
+  `000021_cluster_authority_keys`; the migration now drops/recreates
+  `cluster_admin_summaries` in up/down paths.
+
+Runtime report:
+
+- `artifacts/c17z18-route-health-effective-path-report.md`
+- `artifacts/c17z19-route-health-feedback-report.md`
+- `artifacts/c17z19-route-health-feedback-smoke-result.json`
+- `artifacts/c17z20-route-health-feedback-refresh-report.md`
+- `artifacts/dev-cluster-enrollment-bootstrap-smoke-report.md`
+- Docker-test smoke command:
+  `pwsh -NoProfile -ExecutionPolicy Bypass -File scripts\fabric\c17z12-rendezvous-relay-smoke-ssh.ps1 -KeepRunning`
+- Dev lifecycle smoke command:
+  `pwsh -NoProfile -ExecutionPolicy Bypass -File scripts\fabric\dev-cluster-enrollment-bootstrap-smoke-ssh.ps1 -KeepRunning`
+- Last proven runtime run: `c17z18-20260428-221601` (legacy smoke script name,
+  current C17Z20 node-agent code)
+- Last proven dev lifecycle run: `dev-bootstrap-20260428-201430`
+- Admin: `http://192.168.200.61:5174/`
+- C17Z20 multi-agent API: `http://192.168.200.61:18120/api/v1`
+- C17Z19 backend-only API: `http://192.168.200.61:18122/api/v1`
+- Dev lifecycle API: `http://192.168.200.61:18121/api/v1`
+
+Do not automatically continue into:
+
+- RDP/VNC/SSH/file/video/service workload traffic over mesh
+- VPN/IP tunnel runtime implementation
+- arbitrary relay packet forwarding
+- production payload forwarding for relay paths
+- QUIC/WebRTC or STUN/TURN/ICE
+- TUN/TAP, host route, DNS, or firewall manipulation
+- backend/session lifecycle changes
+- Windows client changes
+
+Next narrow layer, if approved:
+
+C17Z21 should tighten route-health feedback refresh dampening: if an immediate
+feedback refresh returns the same config version or no replacement change, keep
+a per-route/relay no-change cooldown before retrying. Keep the boundary
+synthetic/control-plane only and keep RDP/VPN/service payload forwarding
+untouched.
@@ -0,0 +1,81 @@
+# Target project structure for the next phase
+
+This is the desired direction, not necessarily the current exact repo state.
+
+## Root
+- `backend/`
+- `workers/rdp-worker/`
+- `clients/windows/`
+- `clients/linux/`
+- `web-admin/`
+- `scripts/`
+- `docs/`
+- `deploy/`
+- `CODEX_CONTEXT.md`
+
+## Backend suggested evolution
+- `internal/platform/`
+  - config
+  - runtime
+  - logging
+  - postgres
+  - redis
+  - module
+  - authn middleware
+  - authz middleware
+- `internal/modules/`
+  - auth
+  - organization
+  - membership
+  - identitysource
+  - group
+  - resource
+  - sessionbroker
+  - sessiongateway
+  - worker
+  - node
+  - nodeagent
+  - connector
+  - audit
+  - policy
+- `pkg/contracts/`
+  - session
+  - worker
+  - node
+  - connector
+
+## New modules to add in next phase
+- `organization`
+- `membership`
+- `identitysource`
+- `node`
+- `nodeagent`
+- `policy` (if policy logic is currently too scattered)
+
+## DB evolution direction
+New tables/entities should include:
+- organizations
+- organization_memberships
+- organization_roles
+- identity_sources
+- identity_mappings
+- groups
+- group_memberships / external_group_bindings
+- nodes
+- node_services
+- node_capabilities
+- node_update_policies
+- node_partition_states
+- connectors
+- connector_bindings
+- organization_feature_scopes
+
+Keep existing proven session tables intact unless migration is very deliberate.
+
+## Worker
+Keep worker independent.
+Do not move node-agent responsibilities into the RDP worker.
+The worker is one service workload. The node-agent is the supervisor/orchestrator on the node.
+
+## Clients
+Do not start final client implementation before the new platform-core backend model is established.