Initial project snapshot
This commit is contained in:
@@ -0,0 +1,148 @@
|
||||
# Architecture guardrails
|
||||
|
||||
These rules are mandatory.
|
||||
|
||||
## 1. Preserve the proven session foundation
|
||||
The following are already proven and must remain stable:
|
||||
- live FreeRDP connect
|
||||
- active session state
|
||||
- terminate
|
||||
- detach without killing remote session
|
||||
- reattach without recreating remote session
|
||||
- takeover without recreating remote session
|
||||
|
||||
No architectural refactor may silently weaken this behavior.
|
||||
|
||||
## 2. Source of truth
|
||||
- PostgreSQL is the only durable source of truth for domain state.
|
||||
- Redis is only for live coordination, routing, heartbeats, leases, attach tokens, and ephemeral cache.
|
||||
|
||||
## 3. Control plane vs data plane
|
||||
Keep them distinct.
|
||||
|
||||
### Control plane
|
||||
- organizations
|
||||
- users
|
||||
- memberships
|
||||
- roles
|
||||
- resources
|
||||
- policies
|
||||
- nodes
|
||||
- services
|
||||
- connectors
|
||||
- cluster membership
|
||||
- updates
|
||||
- config distribution
|
||||
|
||||
### Data plane
|
||||
- session streams
|
||||
- worker traffic
|
||||
- relay traffic
|
||||
- connector traffic
|
||||
- future exit traffic
|
||||
|
||||
## 4. Multi-tenancy isolation
|
||||
Every organization must be isolated by design.
|
||||
|
||||
Namespace by organization for:
|
||||
- resources
|
||||
- users-in-org
|
||||
- groups
|
||||
- policies
|
||||
- connectors
|
||||
- sessions
|
||||
- audit
|
||||
- secrets references
|
||||
- Redis keys where applicable
|
||||
|
||||
No cross-org leakage of identifiers, data, logs, cache keys, or policy decisions.
|
||||
|
||||
## 5. Customer-managed nodes
|
||||
Customer-managed nodes:
|
||||
- may join the common cluster,
|
||||
- must remain limited to allowed scope,
|
||||
- must not automatically become general-purpose relay/control nodes for other organizations.
|
||||
|
||||
## 6. Node agent design
|
||||
A node agent:
|
||||
- is small,
|
||||
- stable,
|
||||
- always running,
|
||||
- supervises services,
|
||||
- downloads signed updates,
|
||||
- verifies signatures and versions,
|
||||
- can rollback,
|
||||
- can restart services,
|
||||
- can operate on thin nodes and thick nodes.
|
||||
|
||||
The agent is not the same as the service workloads.
|
||||
|
||||
## 7. Split-brain prevention
|
||||
Never allow minority partitions to become a second authoritative cluster automatically.
|
||||
|
||||
Required states:
|
||||
- healthy
|
||||
- degraded
|
||||
- recovery
|
||||
- isolated / emergency
|
||||
|
||||
Cluster-wide changes, role changes and risky mutations must be restricted in non-quorum states.
|
||||
|
||||
## 8. Service model
|
||||
Each node must separate:
|
||||
- capabilities
|
||||
- enabled services
|
||||
|
||||
Do not encode every function into one monolithic node role.
|
||||
|
||||
## 9. Security model
|
||||
Security must be based on:
|
||||
- strong crypto
|
||||
- signed artifacts
|
||||
- node identity
|
||||
- short-lived user/session tokens
|
||||
- scoped trust
|
||||
- audit trails
|
||||
- revocation
|
||||
- least privilege
|
||||
|
||||
Do not depend on protocol obscurity.
|
||||
|
||||
## 10. Migration strategy
|
||||
Do not force a big-bang rewrite.
|
||||
Add the platform core around the current system in steps:
|
||||
1. organization / membership model
|
||||
2. org-scoped resource model
|
||||
3. node model and node-agent control interfaces
|
||||
4. connector model
|
||||
5. mesh / routing evolution
|
||||
6. native clients and higher-level features
|
||||
|
||||
## 11. Updates and rollback
|
||||
Updates must support:
|
||||
- manual or automatic policy
|
||||
- staged rollout
|
||||
- canary rollout
|
||||
- rollback to previous version
|
||||
- signed artifacts
|
||||
- optional update mirrors / caches on selected nodes
|
||||
|
||||
Thin nodes may download but not store update artifacts.
|
||||
|
||||
## 12. Performance and routing awareness
|
||||
Placement and routing decisions must consider:
|
||||
- CPU
|
||||
- RAM
|
||||
- network load
|
||||
- active sessions
|
||||
- connector load
|
||||
- relay load
|
||||
- service type
|
||||
- health score
|
||||
|
||||
## 13. No feature explosion before platform core
|
||||
Do not jump to:
|
||||
- full collaboration/video meetings
|
||||
- advanced media plane
|
||||
- internet exit mode
|
||||
before the platform core is modeled correctly.
|
||||
File diff suppressed because it is too large
Load Diff
@@ -0,0 +1,129 @@
|
||||
# Final platform technical direction (summary)
|
||||
|
||||
## Product definition
|
||||
A distributed secure access platform with:
|
||||
- multi-tenant organizations
|
||||
- proven persistent session broker for RDP
|
||||
- cluster of platform-managed and customer-managed nodes
|
||||
- node-agent based service fabric
|
||||
- connector/VPN layer
|
||||
- future split/full tunnel capability
|
||||
- future collaboration extensions
|
||||
|
||||
## Main top-level domains
|
||||
|
||||
### Platform
|
||||
Owns:
|
||||
- global policies
|
||||
- cluster control plane
|
||||
- platform admins
|
||||
- node trust
|
||||
- artifact signing and update policy
|
||||
- disaster recovery authority
|
||||
|
||||
### Organization
|
||||
Owns:
|
||||
- users
|
||||
- groups
|
||||
- organization admins
|
||||
- identity sources
|
||||
- resources
|
||||
- policies
|
||||
- connectors
|
||||
- audits
|
||||
- quotas
|
||||
- domains / branding later
|
||||
|
||||
### Node
|
||||
Has:
|
||||
- node identity
|
||||
- ownership type (platform-managed, customer-managed)
|
||||
- capabilities
|
||||
- enabled services
|
||||
- health
|
||||
- update policy
|
||||
- version state
|
||||
- partition state
|
||||
|
||||
### Node Agent
|
||||
Small stable agent that:
|
||||
- keeps running
|
||||
- supervises services
|
||||
- downloads signed updates
|
||||
- verifies integrity
|
||||
- restarts crashed services
|
||||
- rolls back if needed
|
||||
- reports health
|
||||
|
||||
### Connector
|
||||
Reusable network access method:
|
||||
- direct
|
||||
- VPN
|
||||
- relay-backed
|
||||
- future egress mode
|
||||
Bound to resources by policy, not duplicated blindly per server.
|
||||
|
||||
### Session broker
|
||||
Already proven for RDP persistent lifecycle.
|
||||
|
||||
## Mandatory capabilities
|
||||
|
||||
### Multi-tenant
|
||||
- org isolation
|
||||
- organization memberships
|
||||
- user may belong to multiple organizations
|
||||
- clear org switching UX later
|
||||
- org admins only see their org
|
||||
|
||||
### Identity federation
|
||||
- local accounts
|
||||
- LDAP / AD
|
||||
- OIDC
|
||||
- group/claim mapping to access
|
||||
|
||||
### Resource authorization
|
||||
- local manual mapping
|
||||
- external group / claim driven mapping
|
||||
- feature scopes:
|
||||
- RDP only
|
||||
- connector/VPN only
|
||||
- both
|
||||
- future scopes
|
||||
|
||||
### Cluster behavior
|
||||
- dynamic membership
|
||||
- encrypted inter-node communication
|
||||
- no mandatory single center
|
||||
- quorum-based authority
|
||||
- degraded / recovery / isolated modes
|
||||
- manual partition promotion only by highly privileged recovery admin
|
||||
- multi-hop route support
|
||||
- not every node needs full mesh
|
||||
|
||||
### Updates
|
||||
- signed artifacts
|
||||
- canary rollout
|
||||
- staged rollout
|
||||
- rollback
|
||||
- thin node vs artifact-cache node
|
||||
|
||||
### Customer-managed nodes
|
||||
- can join common cluster
|
||||
- can be scoped to their organization
|
||||
- can serve ingress / connector / egress functions for that organization
|
||||
- must not automatically become cluster-global trusted nodes
|
||||
|
||||
## What to implement first
|
||||
- organization model
|
||||
- memberships and roles
|
||||
- org-scoped resource model
|
||||
- identity source model
|
||||
- node and node-agent control plane model
|
||||
- service capabilities / enabled services model
|
||||
|
||||
## What to delay
|
||||
- full mesh engine
|
||||
- full connector scheduler
|
||||
- internet exit mode
|
||||
- collaboration/video meetings
|
||||
- heavy media routing
|
||||
@@ -0,0 +1,123 @@
|
||||
C17Z20 is complete.
|
||||
|
||||
Installation Authority foundation is also complete:
|
||||
|
||||
- production config requires strict authority mode with Product Root public key
|
||||
- first-owner bootstrap requires a signed activation manifest in strict mode
|
||||
- `installation_authority` and signed `platform_role_grants` are persisted
|
||||
- strict platform-admin checks ignore direct `users.platform_role` edits unless
|
||||
a valid signed grant exists
|
||||
- web-admin shows installation status and first-owner bootstrap
|
||||
- `scripts/installation/product-root-tool.go` can generate Ed25519 Product Root
|
||||
keys and sign activation manifests; private keys must stay outside the repo
|
||||
|
||||
Cluster Authority foundation is now also complete:
|
||||
|
||||
- every newly created cluster gets an Ed25519 `cluster_authorities` key record
|
||||
- cluster authority private keys are encrypted at rest when
|
||||
`SECRET_ENCRYPTION_KEY_B64`/file is configured; production already requires
|
||||
a secret encryption key
|
||||
- legacy/default clusters are backfilled lazily through `EnsureClusterAuthority`
|
||||
- backend signs join-token scope material, node approval/bootstrap material,
|
||||
and node-scoped synthetic mesh config snapshots
|
||||
- node-agent verifies signed Control Plane synthetic config when
|
||||
`authority_required=true` or signature fields are present
|
||||
- node-agent can pin `RAP_CLUSTER_AUTHORITY_PUBLIC_KEY` and
|
||||
`RAP_CLUSTER_AUTHORITY_FINGERPRINT`, and identity state can store the same
|
||||
trust anchor after approval
|
||||
- web-admin shows cluster key fingerprints on summaries, join-token output,
|
||||
approval rows, and synthetic config visibility
|
||||
- docker-test lifecycle smoke is complete: fresh dev install, first-owner
|
||||
bootstrap, cluster creation, signed join token, real node-agent enrollment,
|
||||
owner approval, automatic signed bootstrap polling, authority pin
|
||||
persistence, heartbeat, and signed synthetic config verification all passed
|
||||
- `rap-node-agent` desired-workload polling/status reporting is gated by
|
||||
`RAP_WORKLOAD_SUPERVISION_ENABLED=false` by default while service runtime
|
||||
supervision remains a stub
|
||||
|
||||
Node enrollment bootstrap polling is also complete:
|
||||
|
||||
- backend exposes `/node-agents/enrollments/{requestID}/bootstrap`
|
||||
- pending agents prove `cluster_id`, `node_fingerprint`, and `public_key`
|
||||
before receiving status/bootstrap material
|
||||
- `rap-node-agent` stores `pending_join_request_id`, polls approval, verifies
|
||||
the signed bootstrap contract, then persists `node_id`, `identity_status`,
|
||||
and cluster authority pin into `identity.json`
|
||||
- polling is controlled by `RAP_ENROLLMENT_POLL_INTERVAL_SECONDS` and
|
||||
`RAP_ENROLLMENT_POLL_TIMEOUT_SECONDS`
|
||||
|
||||
Current state:
|
||||
|
||||
- C17Z12 added rendezvous/relay control-plane leases for peers that would
|
||||
otherwise stay in `waiting_rendezvous`.
|
||||
- C17Z13-C17Z14 added lease telemetry and node-scoped synthetic-config refresh
|
||||
for renewal/stale relay recovery.
|
||||
- C17Z15 added backend stale-relay replacement/withdrawal policy and alternate
|
||||
relay-pool scoring.
|
||||
- C17Z16 added Control Plane `route_path_decisions`.
|
||||
- C17Z17 added node-side route generation apply/withdraw tracking.
|
||||
- C17Z18 applies Control Plane `route_path_decisions` to synthetic
|
||||
route-health route config only. The synthetic `fabric.route_health` runtime
|
||||
now probes the selected effective path, including replacement relay paths,
|
||||
and reports expected/observed hops plus drift state.
|
||||
- C17Z19 consumes those synthetic route-health observations in backend relay
|
||||
scoring. Drift/unreachable/failure feedback marks the exact selected relay
|
||||
stale and can trigger replacement; healthy low-latency route-health boosts
|
||||
alternate relay score reasons. Migration `000022` adds the `synthetic` mesh
|
||||
service class, and web-admin marks relay policy `rh feedback`.
|
||||
- C17Z20 closes the node-side feedback loop. After node-agent reports
|
||||
synthetic route-health drift/unreachable/failure, it performs a bounded
|
||||
node-scoped synthetic-config refresh, applies returned replacement route
|
||||
decisions to route-health config immediately, and reports
|
||||
`c17z20.mesh_route_health_feedback_refresh_report.v1`.
|
||||
- Backend `mesh_latest_links` now keeps latest observations per observation
|
||||
type/route, so `synthetic_route_health` is not overwritten by
|
||||
`peer_connection_manager`.
|
||||
- Web-admin Fabric links now show observation type, selected relay, and
|
||||
route-health effective/observed path.
|
||||
- All of this remains control-plane/synthetic route-health only. It does not
|
||||
forward RDP/VPN/service payloads, does not start VPN runtime, and does not
|
||||
implement arbitrary relay packet forwarding.
|
||||
- Cluster Authority and node enrollment bootstrap are docker-test
|
||||
lifecycle-smoke verified in run `dev-bootstrap-20260428-201430`.
|
||||
- Fresh migration replay found and fixed a PostgreSQL view replacement issue in
|
||||
`000021_cluster_authority_keys`; the migration now drops/recreates
|
||||
`cluster_admin_summaries` in up/down paths.
|
||||
|
||||
Runtime report:
|
||||
|
||||
- `artifacts/c17z18-route-health-effective-path-report.md`
|
||||
- `artifacts/c17z19-route-health-feedback-report.md`
|
||||
- `artifacts/c17z19-route-health-feedback-smoke-result.json`
|
||||
- `artifacts/c17z20-route-health-feedback-refresh-report.md`
|
||||
- `artifacts/dev-cluster-enrollment-bootstrap-smoke-report.md`
|
||||
- Docker-test smoke command:
|
||||
`pwsh -NoProfile -ExecutionPolicy Bypass -File scripts\fabric\c17z12-rendezvous-relay-smoke-ssh.ps1 -KeepRunning`
|
||||
- Dev lifecycle smoke command:
|
||||
`pwsh -NoProfile -ExecutionPolicy Bypass -File scripts\fabric\dev-cluster-enrollment-bootstrap-smoke-ssh.ps1 -KeepRunning`
|
||||
- Last proven runtime run: `c17z18-20260428-221601` (legacy smoke script name,
|
||||
current C17Z20 node-agent code)
|
||||
- Last proven dev lifecycle run: `dev-bootstrap-20260428-201430`
|
||||
- Admin: `http://192.168.200.61:5174/`
|
||||
- C17Z20 multi-agent API: `http://192.168.200.61:18120/api/v1`
|
||||
- C17Z19 backend-only API: `http://192.168.200.61:18122/api/v1`
|
||||
- Dev lifecycle API: `http://192.168.200.61:18121/api/v1`
|
||||
|
||||
Do not automatically continue into:
|
||||
|
||||
- RDP/VNC/SSH/file/video/service workload traffic over mesh
|
||||
- VPN/IP tunnel runtime implementation
|
||||
- arbitrary relay packet forwarding
|
||||
- production payload forwarding for relay paths
|
||||
- QUIC/WebRTC or STUN/TURN/ICE
|
||||
- TUN/TAP, host route, DNS, or firewall manipulation
|
||||
- backend/session lifecycle changes
|
||||
- Windows client changes
|
||||
|
||||
Next narrow layer, if approved:
|
||||
|
||||
C17Z21 should tighten route-health feedback refresh dampening: if an immediate
|
||||
feedback refresh returns the same config version or no replacement change, keep
|
||||
a per-route/relay no-change cooldown before retrying. Keep the boundary
|
||||
synthetic/control-plane only and keep RDP/VPN/service payload forwarding
|
||||
untouched.
|
||||
@@ -0,0 +1,81 @@
|
||||
# Target project structure for the next phase
|
||||
|
||||
This is the desired direction, not necessarily the current exact repo state.
|
||||
|
||||
## Root
|
||||
- `backend/`
|
||||
- `workers/rdp-worker/`
|
||||
- `clients/windows/`
|
||||
- `clients/linux/`
|
||||
- `web-admin/`
|
||||
- `scripts/`
|
||||
- `docs/`
|
||||
- `deploy/`
|
||||
- `CODEX_CONTEXT.md`
|
||||
|
||||
## Backend suggested evolution
|
||||
- `internal/platform/`
|
||||
- config
|
||||
- runtime
|
||||
- logging
|
||||
- postgres
|
||||
- redis
|
||||
- module
|
||||
- authn middleware
|
||||
- authz middleware
|
||||
- `internal/modules/`
|
||||
- auth
|
||||
- organization
|
||||
- membership
|
||||
- identitysource
|
||||
- group
|
||||
- resource
|
||||
- sessionbroker
|
||||
- sessiongateway
|
||||
- worker
|
||||
- node
|
||||
- nodeagent
|
||||
- connector
|
||||
- audit
|
||||
- policy
|
||||
- `pkg/contracts/`
|
||||
- session
|
||||
- worker
|
||||
- node
|
||||
- connector
|
||||
|
||||
## New modules to add in next phase
|
||||
- `organization`
|
||||
- `membership`
|
||||
- `identitysource`
|
||||
- `node`
|
||||
- `nodeagent`
|
||||
- `policy` (if policy logic is currently too scattered)
|
||||
|
||||
## DB evolution direction
|
||||
New tables/entities should include:
|
||||
- organizations
|
||||
- organization_memberships
|
||||
- organization_roles
|
||||
- identity_sources
|
||||
- identity_mappings
|
||||
- groups
|
||||
- group_memberships / external_group_bindings
|
||||
- nodes
|
||||
- node_services
|
||||
- node_capabilities
|
||||
- node_update_policies
|
||||
- node_partition_states
|
||||
- connectors
|
||||
- connector_bindings
|
||||
- organization_feature_scopes
|
||||
|
||||
Keep existing proven session tables intact unless migration is very deliberate.
|
||||
|
||||
## Worker
|
||||
Keep worker independent.
|
||||
Do not move node-agent responsibilities into the RDP worker.
|
||||
The worker is one service workload. The node-agent is the supervisor/orchestrator on the node.
|
||||
|
||||
## Clients
|
||||
Do not start final client implementation before the new platform-core backend model is established.
|
||||
Reference in New Issue
Block a user