Initial project snapshot
This commit is contained in:
@@ -0,0 +1,618 @@
|
||||
# CODEX CONTEXT
|
||||
|
||||
## Project identity
|
||||
|
||||
This project is a production-grade distributed secure access platform.
|
||||
|
||||
It started as a custom RDP proxy with persistent server-side sessions, but the final target architecture is broader:
|
||||
|
||||
- distributed secure access fabric
|
||||
- multi-tenant platform
|
||||
- session broker for GUI and future non-GUI protocols
|
||||
- cluster mesh of nodes
|
||||
- connector/VPN layer
|
||||
- customer-managed and platform-managed nodes
|
||||
- node-agent based self-update / rollback / health supervision
|
||||
|
||||
## Current proven foundation
|
||||
|
||||
The current codebase already proved the most risky low-level lifecycle assumptions for RDP:
|
||||
|
||||
- real FreeRDP connect works
|
||||
- session state transitions to active work
|
||||
- terminate works
|
||||
- detach works without killing the remote session
|
||||
- reattach works without recreating the remote session
|
||||
- takeover works without recreating the remote session
|
||||
- per-resource certificate verification policy exists
|
||||
- `certificate_verification_mode = strict | ignore`
|
||||
- `strict` is default
|
||||
- `ignore` works on a per-resource basis
|
||||
- worker build is reproducible
|
||||
- backend build is reproducible
|
||||
|
||||
This proven lifecycle must NOT be broken by future architecture work.
|
||||
|
||||
## Current architecture baseline
|
||||
|
||||
Current audit and baseline snapshot:
|
||||
|
||||
- `docs/audits/PROJECT_AUDIT_2026-04-26.md`
|
||||
- `docs/audits/CURRENT_BASELINE_MATRIX.md`
|
||||
|
||||
### Test environment
|
||||
- Canonical test Docker host: `192.168.200.61`
|
||||
- Canonical Docker context: `test-ubuntu`
|
||||
- Canonical SSH alias: `docker-test`
|
||||
- Backend API for local/client smoke runs: `http://192.168.200.61:8080/api/v1`
|
||||
- WebSocket gateway for local/client smoke runs: `ws://192.168.200.61:8080/api/v1/gateway/ws`
|
||||
- Stage C17 planning is completed.
|
||||
- C17A synthetic mesh runtime skeleton is implemented and test-proven in
|
||||
`rap-node-agent` only. It is disabled by default and carries synthetic
|
||||
`fabric.probe` / `fabric.probe_ack` messages only.
|
||||
- C17B route health and failover probes are implemented and test-proven in
|
||||
`rap-node-agent` only. They are disabled by default and carry synthetic
|
||||
`fabric.route_health` / `fabric.route_health_ack` messages only.
|
||||
- C17C relay semantic hardening is implemented and test-proven in
|
||||
`rap-node-agent` only. It is disabled by default and models synthetic
|
||||
per-channel queues/QoS/backpressure only.
|
||||
- C17D non-production test-service path is implemented and test-proven in
|
||||
`rap-node-agent` only. It is disabled by default and carries only bounded
|
||||
`synthetic.echo` test payloads.
|
||||
- C17E/C17F/C17G are implemented and proven for live synthetic HTTP transport,
|
||||
scoped synthetic route config, and Control Plane scoped synthetic config
|
||||
consumption.
|
||||
- C17H deployed multi-agent synthetic config smoke is runtime-proven on
|
||||
`docker-test`: five running `rap-node-agent` containers consume
|
||||
backend-issued node-scoped synthetic config, direct and single-relay
|
||||
synthetic route-health observations return to the Control Plane, and
|
||||
production forwarding remains disabled.
|
||||
- C17I production forwarding gate foundation is implemented and test-proven:
|
||||
`rap-node-agent` has an explicit production-forwarding gate, while
|
||||
`/mesh/v1/forward` still refuses production payload forwarding until a later
|
||||
approved runtime stage.
|
||||
- C17J production envelope contract is implemented and test-proven:
|
||||
`/mesh/v1/forward` validates route-bound production envelopes for
|
||||
`fabric_control` / `fabric.control` only when the gate is enabled, rejects
|
||||
service channels, and still refuses production forwarding.
|
||||
- C17K production envelope observation is implemented and test-proven:
|
||||
valid accepted envelopes can be observed locally as metadata-only records
|
||||
after validation; rejected envelopes are not observed, observation failure
|
||||
fails closed, and production forwarding remains unavailable.
|
||||
- C17L bounded production observation sink is implemented and test-proven:
|
||||
accepted metadata-only observations can be retained locally with fixed
|
||||
capacity, oldest-entry drop behavior, and no payload body storage.
|
||||
- C17M production observation sink wiring is implemented and test-proven:
|
||||
node-agent can wire the bounded local metadata-only sink when
|
||||
`RAP_MESH_PRODUCTION_OBSERVATION_SINK_CAPACITY` is explicitly greater than
|
||||
zero; the wiring is disabled by default and exposes no read API.
|
||||
- C17N production observation sink metrics are implemented and test-proven:
|
||||
local sink metrics expose only capacity, current depth, accepted total, and
|
||||
dropped-oldest total; they expose no observation records or payload metadata.
|
||||
- C17O production observation sink local metrics logging is implemented and
|
||||
test-proven: node-agent logs aggregate sink metrics locally when the sink is
|
||||
explicitly enabled; no read API or Control Plane reporting is added.
|
||||
- C17P production observation sink change-driven metrics logging is implemented
|
||||
and test-proven: node-agent suppresses repeated identical local sink metrics
|
||||
logs; no read API or Control Plane reporting is added.
|
||||
- C17Q production forwarding gate/runtime log boundary is implemented and
|
||||
test-proven: node-agent logs production forwarding gate state separately from
|
||||
production forwarding runtime state. Runtime state remained false until
|
||||
C17Z introduced gate-controlled `fabric.control` direct forwarding.
|
||||
- C17R production observation sink capacity guard is implemented and
|
||||
test-proven: `RAP_MESH_PRODUCTION_OBSERVATION_SINK_CAPACITY` is rejected
|
||||
above `10000`.
|
||||
- C17S production observation panic fail-closed hardening is implemented and
|
||||
test-proven: observer errors and observer panics both fail closed as
|
||||
observation failure.
|
||||
- C17T production envelope payload boundary is implemented and test-proven:
|
||||
validated production `fabric.control` envelope payloads are bounded to
|
||||
`4096` bytes and oversized envelopes are rejected before observation.
|
||||
- C17U production envelope created-at skew boundary is implemented and
|
||||
test-proven: validated production `fabric.control` envelopes whose
|
||||
`created_at` is more than one minute in the future are rejected before
|
||||
observation.
|
||||
- C17V peer endpoint candidate model is implemented and test-proven:
|
||||
node-scoped synthetic mesh config now carries route-scoped endpoint
|
||||
candidates with transport, address, reachability, NAT type, connectivity
|
||||
mode, priority, policy tags, verification time, and metadata. This is a
|
||||
model/config boundary only; no production route scoring, NAT traversal,
|
||||
shortcut routing, or forwarding runtime is implemented.
|
||||
- C17W peer endpoint candidate scoring model is implemented and test-proven:
|
||||
`rap-node-agent` can rank already-scoped endpoint candidates using soft
|
||||
inputs such as transport, reachability, connectivity mode, NAT type,
|
||||
priority, region, policy tags, channel class, and verification age. This is
|
||||
a scoring helper only; it does not open connections, choose production
|
||||
routes, or forward payloads.
|
||||
- C17X health-aware endpoint candidate scoring overlay is implemented and
|
||||
test-proven: endpoint candidate scoring can optionally use local health
|
||||
observations keyed by `endpoint_id`, including latency, success/failure
|
||||
history, recent failure reason, reliability score, and observation freshness.
|
||||
This remains advisory scoring only and is not wired into production route
|
||||
execution.
|
||||
- C17Y Platform Owner synthetic mesh visibility is implemented and
|
||||
build/test-proven: `web-admin` reads node-scoped synthetic mesh config and
|
||||
shows config enabled state, route counts, peer endpoints, endpoint
|
||||
candidates, C17X advisory scoring boundary, and `production_forwarding`.
|
||||
This remains platform-owner visibility only and does not enable production
|
||||
forwarding.
|
||||
- C17Z production fabric-control direct forwarding boundary is implemented and
|
||||
test-proven: when `RAP_MESH_PRODUCTION_FORWARDING_ENABLED=true`,
|
||||
`/mesh/v1/forward` can deliver valid route-bound `fabric.control` envelopes
|
||||
at the local destination or forward them to a direct next hop from explicit
|
||||
peer endpoint config. Service channels, arbitrary relay forwarding,
|
||||
multi-hop production route execution, and RDP/VPN/file/video/service payloads
|
||||
remain unavailable.
|
||||
- C17Z1 production fabric-control multi-hop route-path boundary is implemented
|
||||
and test-proven: production `fabric.control` envelopes can carry
|
||||
`route_path` and `visited_node_ids`; relay nodes validate path position,
|
||||
forward only to the next path node, update TTL/hop/visited metadata, and
|
||||
reject loops. Service payloads remain unavailable.
|
||||
- C17Z2 production fabric-control forwarding observability boundary is
|
||||
implemented and test-proven: node-agent emits local
|
||||
`mesh_production_forward_event` logs for accepted, forwarded, delivered, and
|
||||
rejected production `fabric.control` envelopes. Logs are metadata-only and
|
||||
include no payload bodies or read API.
|
||||
- C17Z3 production fabric-control route-config boundary is implemented and
|
||||
test-proven: when scoped/control-plane mesh routes are available locally,
|
||||
production `fabric.control` envelopes must match configured route_id/path/
|
||||
next-hop/channel/expiry/TTL/hop limits before forwarding.
|
||||
- C17Z4 scoped peer directory and recovery seeds boundary is implemented and
|
||||
test/build-proven: node-scoped mesh config carries scoped `peer_directory`
|
||||
and explicit bounded `recovery_seeds`; node-agent parses/validates them and
|
||||
web-admin shows counts.
|
||||
- C17Z5 node-agent peer cache runtime boundary is implemented and test-proven:
|
||||
node-agent builds a local `PeerCache`, selects bounded warm peers, probes warm
|
||||
peers with `/mesh/v1/health`, and reports metadata-only mesh-link
|
||||
observations when synthetic mesh testing is enabled.
|
||||
- C17Z6 dynamic endpoint reporting boundary is implemented and test-proven:
|
||||
node-agent reports explicit advertised mesh endpoint metadata in heartbeat,
|
||||
and Control Plane projects latest reported endpoints/candidates into
|
||||
node-scoped synthetic mesh config.
|
||||
- C17Z7 private/corporate endpoint candidate boundary is implemented and
|
||||
test-proven: node-agent reports multiple advertised endpoint candidates,
|
||||
scoring rewards private/corporate same-site candidates, and peer cache can
|
||||
use the best candidate address for warm health.
|
||||
- C17Z8 peer connection state machine boundary is implemented and test-proven:
|
||||
node-agent tracks warm-peer states `disconnected`, `connecting`, `ready`,
|
||||
`degraded`, and `backoff`, with bounded backoff after repeated health probe
|
||||
failures.
|
||||
- C17Z9 peer recovery planner boundary is implemented and test-proven:
|
||||
node-agent targets a bounded stable ready-peer set, enters recovery when
|
||||
ready peers fall below target, and selects bounded recovery probes from warm
|
||||
peers, recovery seeds, and other connectable scoped peers.
|
||||
- C17Z10 peer connection intent planner boundary is implemented and
|
||||
test-proven: node-agent classifies bounded peer work as maintain/probe/
|
||||
recover and classifies transport readiness as direct/private_lan/
|
||||
corporate_lan/outbound_only/relay_required, with rendezvous-required
|
||||
metadata only.
|
||||
- C17Z11 peer connection manager runtime boundary is implemented and
|
||||
test-proven: node-agent uses a reusable HTTP keep-alive client for real
|
||||
control-plane health probes of direct/private/corporate peers and records
|
||||
`waiting_rendezvous` for outbound-only/relay-required peers.
|
||||
- C17Z12 rendezvous/relay control-plane contract is implemented and
|
||||
docker-test-runtime-proven: backend issues node-scoped `rendezvous_leases`,
|
||||
node-agent resolves matching `waiting_rendezvous` intents into
|
||||
`relay_control`, probes relay `/mesh/v1/health`, records and maintains
|
||||
`relay_ready`, and keeps service payload forwarding disabled.
|
||||
- C17Z13 rendezvous lease telemetry is implemented and
|
||||
docker-test-runtime-proven: node-agent reports
|
||||
`mesh_rendezvous_lease_report` with relay admission, peer admission,
|
||||
TTL/renewal posture, `relay_ready`, and explicit no-payload boundary flags;
|
||||
web-admin shows `rv leases` in recent heartbeat tables.
|
||||
- C17Z14 rendezvous lease refresh contract is implemented and
|
||||
docker-test-runtime-proven: node-agent refreshes renewal-needed/stale
|
||||
rendezvous leases through node-scoped synthetic config reload, updates the
|
||||
running peer cache/route/lease state, and reports refresh plus stale relay
|
||||
withdrawal/reselection telemetry. Service payload forwarding remains
|
||||
unavailable.
|
||||
- C17Z15 backend relay replacement policy is implemented and
|
||||
docker-test-runtime-proven: backend consumes recent stale-relay heartbeat
|
||||
feedback, withdraws stale explicit rendezvous leases, scores alternate relay
|
||||
candidates from route adjacency, endpoint priority, policy tags, and recent
|
||||
mesh-link health, and returns replacement leases plus
|
||||
`rendezvous_relay_policy` decisions in node-scoped synthetic config.
|
||||
Node-agent reports `c17z15.mesh_rendezvous_lease_report.v1` and keeps stale
|
||||
state scoped to the exact lease/relay, so replacement leases for the same
|
||||
peer are not marked stale by association. Service payload forwarding remains
|
||||
unavailable.
|
||||
- C17Z16 route/path decision artifact is implemented and
|
||||
docker-test-runtime-proven: backend `c17z16.synthetic.v1` config includes
|
||||
`route_path_decisions` with original hops, effective hops, local previous/
|
||||
next hop, selected replacement relay, generation, score reasons, and
|
||||
no-payload boundary flags. Node-agent stores the control-plane route
|
||||
generation and reports `c17z16.mesh_route_path_decision_report.v1` plus
|
||||
`c17z16.mesh_rendezvous_lease_report.v1`. Service payload forwarding remains
|
||||
unavailable.
|
||||
- C17Z17 node-side route generation tracker is implemented and
|
||||
docker-test-runtime-proven: backend `c17z17.synthetic.v1` config and
|
||||
node-agent `mesh_route_generation_report` track active/applied/unchanged/
|
||||
withdrawn route decisions, generation changes, total counters, and
|
||||
`withdrawn_by_replacement` records for stale relay paths when replacement is
|
||||
first observed. Service payload forwarding remains unavailable.
|
||||
- C17Z18 synthetic route-health effective path runtime is implemented and
|
||||
docker-test-runtime-proven: backend `c17z18.synthetic.v1` config and
|
||||
node-agent `mesh_route_health_config_report` apply Control Plane
|
||||
`route_path_decisions` to synthetic route-health route config only. The
|
||||
synthetic runtime probes selected effective paths through replacement relays,
|
||||
reports expected/observed hops and drift state, and backend latest mesh links
|
||||
preserve route-health observations separately from connection-manager
|
||||
observations. Service payload forwarding remains unavailable.
|
||||
- C17Z19 synthetic route-health feedback scoring is implemented and
|
||||
docker-test-runtime-proven: backend consumes recent `synthetic_route_health`
|
||||
observations in relay scoring, uses drift/unreachable/failure metadata to
|
||||
mark the exact selected relay stale, boosts healthy low-latency relay
|
||||
candidates, and returns replacement leases/route decisions through the
|
||||
existing synthetic config contract. Migration `000022` adds the `synthetic`
|
||||
mesh service class. Service payload forwarding remains unavailable.
|
||||
- C17Z20 node-side route-health feedback refresh is implemented and
|
||||
docker-test-runtime-proven: after reporting synthetic route-health
|
||||
drift/unreachable/failure, node-agent performs a bounded node-scoped
|
||||
synthetic-config refresh, applies returned replacement route decisions to
|
||||
route-health config immediately, and reports
|
||||
`c17z20.mesh_route_health_feedback_refresh_report.v1`. Service payload
|
||||
forwarding remains unavailable.
|
||||
- Installation Authority foundation is implemented: production requires strict
|
||||
Product Root public key config, first-owner bootstrap uses signed Ed25519
|
||||
activation manifests, `installation_authority` and signed
|
||||
`platform_role_grants` are persisted, and strict platform-admin checks ignore
|
||||
direct `users.platform_role` database edits without a valid signed grant.
|
||||
Web-admin exposes installation status/first-owner bootstrap, and
|
||||
`scripts/installation/product-root-tool.go` generates keys/manifests for
|
||||
offline product-root operations.
|
||||
- Cluster Authority and node enrollment bootstrap are docker-test lifecycle
|
||||
smoke-proven in run `dev-bootstrap-20260428-201430`: a fresh dev install
|
||||
bootstrapped the first owner, created a cluster, issued a signed join token,
|
||||
accepted real `rap-node-agent` enrollment, owner-approved the join request,
|
||||
agent-polled signed bootstrap, persisted cluster authority pin, heartbeated,
|
||||
and verified signed `c17z18.synthetic.v1` Control Plane config. Production
|
||||
service payload forwarding remains unavailable.
|
||||
- Migration `000021_cluster_authority_keys` drops/recreates
|
||||
`cluster_admin_summaries` because fresh replay proved PostgreSQL cannot
|
||||
change that view layout via `CREATE OR REPLACE VIEW`.
|
||||
- `rap-node-agent` desired-workload polling/status reporting is gated by
|
||||
`RAP_WORKLOAD_SUPERVISION_ENABLED=false` by default while service runtime
|
||||
supervision remains a stub.
|
||||
- C18 VPN/IP tunnel service target design is completed as documentation only.
|
||||
- C18A VPN/IP tunnel control-plane data model foundation is implemented and
|
||||
backend-test-proven.
|
||||
- C18B VPN/IP tunnel lease/fencing hardening is implemented and
|
||||
backend-test-proven.
|
||||
- C18C VPN/IP tunnel node-agent desired-state consumption/reporting is
|
||||
implemented and backend-test-proven.
|
||||
- No next platform-core implementation step is automatically authorized after
|
||||
C17Z20. The next mesh layer should stay limited to route-health feedback
|
||||
refresh dampening/no-change cooldown unless the user explicitly chooses
|
||||
another staged task.
|
||||
- Latest RDP performance reference image:
|
||||
`rap-rdp-worker:rdp-perf6-dirty-region`
|
||||
- Stage 5.2 file-download runtime artifacts remain preserved for when RDP work
|
||||
resumes, but they are not the active next task.
|
||||
- Do not use `docker.cin.su` for this project unless explicitly requested for a separate one-off check.
|
||||
|
||||
### Backend
|
||||
- Go
|
||||
- PostgreSQL = source of truth
|
||||
- Redis = live coordination / routing only
|
||||
- REST for control plane
|
||||
- WebSocket for live session channel
|
||||
|
||||
### Worker
|
||||
- C++ worker
|
||||
- FreeRDP integration
|
||||
- worker runtime hides FreeRDP details from backend
|
||||
- The C++ worker remains the primary RDP runtime.
|
||||
- Target RDP performance direction: `docs/architecture/RDP_SERVICE_CPP_PERFORMANCE_TARGET.md`.
|
||||
- The RDP performance rewrite scope is limited to C++ RDP service adapter
|
||||
internals. It must not redesign backend control plane, cluster transport,
|
||||
organizations, leases, or session lifecycle.
|
||||
- The C# RDP service skeleton is inactive research scaffolding and is not the
|
||||
current runtime direction.
|
||||
- Current RDP Adapter baseline: RDP-Perf-6 dirty-region direct binary rendering
|
||||
is completed and smoke-proven on `docker-test`. RDP work is paused by product
|
||||
decision; next active work is Fabric Core / cluster foundation.
|
||||
- P3/P3.1 security-readiness foundation exists: production mode rejects
|
||||
plaintext credential-like resource metadata, requires `secret_ref` for
|
||||
RDP/VNC/SSH resources, and has an encrypted PostgreSQL-backed resource secret
|
||||
storage/resolver MVP. P3.2 direct-worker TLS/PKI guard exists.
|
||||
- P3.3 production-like test-stand smoke is complete on `docker-test`: backend
|
||||
runs in `APP_ENV=production` with a test-only secret key file, a secret-backed
|
||||
RDP resource starts real sessions through the resolver path, metadata/audit do
|
||||
not contain plaintext credentials, and backend gateway fallback remains
|
||||
available when direct worker WSS trust is `smoke_insecure`.
|
||||
- P3.4 production direct-worker WSS trust model is documented in
|
||||
`docs/architecture/PRODUCTION_DIRECT_WORKER_WSS_TRUST.md`; it defines
|
||||
platform CA/public CA behavior, worker certificate SAN/identity requirements,
|
||||
app-local Windows trust direction, rotation/revocation, and the future
|
||||
`platform_ca` smoke plan. No RDP runtime behavior changed in P3.4.
|
||||
- P3.5 app-local platform CA trust is implemented and runtime-proven on
|
||||
`docker-test`: Windows client validates direct worker WSS with an app-local
|
||||
platform CA bundle, keeps hostname/SAN validation enabled, selects
|
||||
`direct_worker_wss` without insecure TLS bypass, and falls back to backend
|
||||
gateway for unknown CA / smoke-only production cases.
|
||||
- P3.6 stale Redis worker/live event idempotency is implemented and
|
||||
runtime-proven: stale worker events for terminal PostgreSQL sessions are
|
||||
ignored, backend restart survives stale Redis events, and terminal sessions
|
||||
are not reopened.
|
||||
- Stage 5.2 server-to-client file download core data path is runtime-proven:
|
||||
direct worker WSS and backend gateway fallback both download text/binary
|
||||
files from `RAP_Transfers\ToClient` with matching size/hash, and direct
|
||||
policy blocking is proven for `disabled` and `client_to_server`. Lifecycle
|
||||
blocking is also runtime-proven for detach, old-client takeover, and worker
|
||||
failure. Runtime report:
|
||||
`artifacts/stage5-2-file-download-runtime-report.md`.
|
||||
- Stage 5.2 is not fully accepted yet. Remaining proof: Windows desktop UI
|
||||
download path and regression matrix for rendering/input/clipboard/upload/
|
||||
reconnect/takeover.
|
||||
|
||||
### Clients
|
||||
- future native clients:
|
||||
- Windows: native desktop client first
|
||||
- Linux: native desktop client later
|
||||
- web UI is admin/control plane, not the primary power-user client
|
||||
|
||||
## Final architecture direction
|
||||
|
||||
The long-term target architecture is documented in:
|
||||
|
||||
- `docs/architecture/SECURE_ACCESS_FABRIC_TARGET.md`
|
||||
- `docs/architecture/CLUSTER_NODE_ADMIN_FOUNDATION.md`
|
||||
- `docs/architecture/WEB_INGRESS_AND_ADMIN_UI_MODEL.md`
|
||||
|
||||
This document defines the target Secure Access Fabric architecture only. It is not the current implementation scope and must not be used as permission to start mesh, VPN, multi-cluster, updater, or realtime data-plane migration work without an explicit staged prompt.
|
||||
|
||||
`CLUSTER_NODE_ADMIN_FOUNDATION.md` defines the next platform-core planning
|
||||
baseline for clusters, node enrollment, native node-agent identity, platform
|
||||
admin console, multi-cluster administration, and future organization admin
|
||||
visibility. It is a staged foundation document, not permission to implement
|
||||
mesh packet routing or VPN runtime.
|
||||
|
||||
`WEB_INGRESS_AND_ADMIN_UI_MODEL.md` defines WEB as HTTP/HTTPS ingress and
|
||||
Admin UI presentation only. Cluster configuration remains Control Plane
|
||||
ownership through scoped APIs, PostgreSQL source-of-truth mutations, and audit.
|
||||
Dynamic pages must be safe schema-driven projections and must not embed
|
||||
internal topology, peer caches, route caches, secrets, raw credentials, or
|
||||
arbitrary executable code.
|
||||
|
||||
Admin endpoint placement is explicit. Fabric Storage / Config Storage nodes do
|
||||
not automatically host or move the cluster panel. Platform Owner Console
|
||||
remains global platform-owner scope. Cluster Admin Endpoint requires explicit
|
||||
admin/web ingress role assignment, cluster health/trust readiness, and Control
|
||||
Plane authorization. Organization Admin Panel remains a tenant-safe projection.
|
||||
|
||||
The final platform must support:
|
||||
|
||||
1. Multi-tenancy / Organizations
|
||||
- platform has many organizations
|
||||
- each organization has isolated users, groups, resources, policies, audit, connectors
|
||||
- users may belong to multiple organizations
|
||||
- organization admins only see their organization
|
||||
- platform admins see platform scope
|
||||
|
||||
2. Identity federation
|
||||
- local users
|
||||
- LDAP / Active Directory
|
||||
- OIDC
|
||||
- future extensibility for more identity sources
|
||||
- access mappings based on external groups / claims
|
||||
|
||||
3. Cluster of nodes
|
||||
- no mandatory single central node
|
||||
- many nodes across many sites
|
||||
- nodes can be platform-managed or customer-managed
|
||||
- customer-managed nodes are sandboxed cluster participants, not full cluster owners
|
||||
|
||||
4. Node agent
|
||||
- small stable always-running agent on every node
|
||||
- supervises services
|
||||
- downloads updates
|
||||
- verifies signed artifacts
|
||||
- can rollback to previous version
|
||||
- can restart crashed services
|
||||
- can work on thin or thick nodes
|
||||
|
||||
5. Service-based node model
|
||||
Each node is not monolithic.
|
||||
A node has:
|
||||
- capabilities: what it can do physically/technically
|
||||
- enabled services: what it is allowed/assigned to do
|
||||
|
||||
Possible services include:
|
||||
- ingress-gateway
|
||||
- mesh-router
|
||||
- relay
|
||||
- connector-host
|
||||
- vpn-adapter
|
||||
- session-worker
|
||||
- media-relay
|
||||
- file-relay
|
||||
- update-cache
|
||||
- config-replica
|
||||
- audit-sink
|
||||
- metrics-exporter
|
||||
|
||||
6. Cluster mesh and routing
|
||||
- encrypted inter-node communication
|
||||
- dynamic topology
|
||||
- no need for full mesh
|
||||
- multi-hop routing allowed
|
||||
- route failover
|
||||
- client failover between ingress nodes
|
||||
- connector failover between nodes
|
||||
|
||||
7. Split-brain prevention
|
||||
- quorum-based cluster behavior
|
||||
- minority partition must not become a second authoritative cluster
|
||||
- degraded / recovery / isolated modes
|
||||
- manual recovery / promote decision by platform recovery admin
|
||||
|
||||
8. Connector / VPN layer
|
||||
- connectors are reusable network access methods
|
||||
- one connector may be used by multiple resources
|
||||
- connector placement and failover are controlled by policy
|
||||
- nodes may be allowed or disallowed to host connectors
|
||||
- direct access, VPN, relay and future egress modes must fit this model
|
||||
|
||||
9. Future exit mode
|
||||
- split tunnel
|
||||
- full tunnel
|
||||
- internet access through cluster
|
||||
- not first implementation priority
|
||||
|
||||
## Non-negotiable design rules
|
||||
|
||||
- Do not rewrite proven session lifecycle carelessly.
|
||||
- Do not turn Redis into a source of truth.
|
||||
- Do not make certificate-ignore a global worker setting.
|
||||
- Do not make customer-managed nodes platform-wide trusted by default.
|
||||
- Do not create a separate cluster per organization.
|
||||
- Do not assume a single permanently reachable central node.
|
||||
- Do not rely on “secret protocol with no docs” as security.
|
||||
- Security must come from crypto, auth, isolation, policy and observability.
|
||||
- Prefer incremental evolution from current proven system.
|
||||
- Do not collapse platform control plane and data plane into one vague layer.
|
||||
|
||||
## Implementation strategy
|
||||
|
||||
The codebase must evolve in phases.
|
||||
|
||||
Current implementation focus remains:
|
||||
- RDP work is paused by product decision
|
||||
- preserve the accepted RDP Adapter baseline and Stage 5.x file-transfer work
|
||||
- do not delete or rewrite the current RDP MVP while platform-core work starts
|
||||
- C1-C9 platform-core foundations are implemented and verified: clusters,
|
||||
node enrollment, node-agent scaffold, platform admin console, workload
|
||||
supervision contract, mesh control-plane prep, mesh skeleton, multi-cluster
|
||||
hardening, and organization admin foundation
|
||||
- C10 Fabric Core configuration distribution design is completed
|
||||
- C11 signed scoped cluster snapshot model is completed
|
||||
- C12 node local state store is completed
|
||||
- C13 Fabric Storage / Config Storage service foundation is completed
|
||||
- C14 peer directory and cache model is completed
|
||||
- C15 Fabric Routing Engine skeleton is completed
|
||||
- C16 secure node-to-node channel lifecycle is completed
|
||||
- C17 mesh routing runtime implementation plan is completed
|
||||
- C17A synthetic mesh runtime skeleton is implemented and test-proven with
|
||||
synthetic fabric messages only, no RDP/VPN/production service traffic
|
||||
- C17B route health and failover probes are implemented and test-proven with
|
||||
synthetic traffic only, no RDP/VPN/production service traffic
|
||||
- C17C relay semantic hardening is implemented and test-proven with synthetic
|
||||
channel classes only, no RDP/VPN/production service traffic
|
||||
- C17D non-production test-service path is implemented and test-proven with
|
||||
bounded `synthetic.echo` traffic only, no RDP/VPN/production service traffic
|
||||
- C17E live node-to-node synthetic HTTP transport is implemented and
|
||||
smoke-proven with synthetic traffic only
|
||||
- C17F scoped synthetic route config loading and route-health reporting is
|
||||
implemented and smoke-proven with synthetic traffic only
|
||||
- C17G Control Plane scoped synthetic config read/consume is implemented and
|
||||
test-proven with synthetic traffic only
|
||||
- C17H deployed multi-agent synthetic config smoke is implemented and
|
||||
runtime-proven on `docker-test` with synthetic traffic only
|
||||
- C17I production forwarding gate foundation is implemented and test-proven;
|
||||
production forwarding remains unavailable
|
||||
- C17J production envelope contract validation is implemented and test-proven;
|
||||
production forwarding remains unavailable
|
||||
- C17K production envelope observation is implemented and test-proven;
|
||||
production forwarding remains unavailable
|
||||
- C17L bounded production observation sink is implemented and test-proven;
|
||||
production forwarding remains unavailable
|
||||
- C17M production observation sink wiring is implemented and test-proven;
|
||||
production forwarding remains unavailable
|
||||
- C17N production observation sink metrics are implemented and test-proven;
|
||||
production forwarding remains unavailable
|
||||
- C17O production observation sink local metrics logging is implemented and
|
||||
test-proven; production forwarding remains unavailable
|
||||
- C17P production observation sink change-driven metrics logging is implemented
|
||||
and test-proven; production forwarding remains unavailable
|
||||
- C17Q production forwarding gate/runtime log boundary is implemented and
|
||||
test-proven; production forwarding remains unavailable
|
||||
- C17R production observation sink capacity guard is implemented and
|
||||
test-proven; production forwarding remains unavailable
|
||||
- C17S production observation panic fail-closed hardening is implemented and
|
||||
test-proven; production forwarding remains unavailable
|
||||
- C17T production envelope payload boundary is implemented and test-proven;
|
||||
production forwarding remains unavailable
|
||||
- C17U production envelope created-at skew boundary is implemented and
|
||||
test-proven; production forwarding remains unavailable
|
||||
- C17V peer endpoint candidate model and NAT/connectivity hints are
|
||||
implemented and test-proven; production forwarding remains unavailable
|
||||
- C17W peer endpoint candidate scoring model is implemented and test-proven;
|
||||
production forwarding remains unavailable
|
||||
- C17X health-aware endpoint candidate scoring overlay is implemented and
|
||||
test-proven; production forwarding remains unavailable
|
||||
- C17Y Platform Owner synthetic mesh visibility is implemented and
|
||||
build/test-proven; production forwarding remains unavailable
|
||||
- C17Z production fabric-control direct forwarding is implemented and
|
||||
test-proven; production service traffic remains unavailable
|
||||
- C17Z1 production fabric-control multi-hop route-path forwarding is
|
||||
implemented and test-proven; production service traffic remains unavailable
|
||||
- C17Z2 production fabric-control forwarding observability is implemented and
|
||||
test-proven; production service traffic remains unavailable
|
||||
- C17Z3 production fabric-control route-config boundary is implemented and
|
||||
test-proven; production service traffic remains unavailable
|
||||
- C17Z4 scoped peer directory/recovery seed boundary is implemented and
|
||||
test/build-proven; production service traffic remains unavailable
|
||||
- C17Z5 node-agent peer cache runtime boundary is implemented and test-proven;
|
||||
production service traffic remains unavailable
|
||||
- C17Z6 dynamic endpoint reporting boundary is implemented and test-proven;
|
||||
production service traffic remains unavailable
|
||||
- C17Z7 private/corporate endpoint candidate boundary is implemented and
|
||||
test-proven; production service traffic remains unavailable
|
||||
- C17Z8 peer connection state machine boundary is implemented and test-proven;
|
||||
production service traffic remains unavailable
|
||||
- C17Z9 peer recovery planner boundary is implemented and test-proven;
|
||||
production service traffic remains unavailable
|
||||
- C17Z10 peer connection intent planner boundary is implemented and
|
||||
test-proven; production service traffic remains unavailable
|
||||
- C17Z11 peer connection manager runtime boundary is implemented and
|
||||
test-proven; production service traffic remains unavailable
|
||||
- C17Z12 rendezvous/relay control-plane contract is implemented and
|
||||
docker-test-runtime-proven; production service traffic remains unavailable
|
||||
- C17Z13 rendezvous lease telemetry is implemented and
|
||||
docker-test-runtime-proven; production service traffic remains unavailable
|
||||
- C17Z14 rendezvous lease refresh contract is implemented and
|
||||
docker-test-runtime-proven; production service traffic remains unavailable
|
||||
- C17Z15 backend relay replacement policy is implemented and
|
||||
docker-test-runtime-proven; production service traffic remains unavailable
|
||||
- C17Z16 route/path decision artifact is implemented and
|
||||
docker-test-runtime-proven; production service traffic remains unavailable
|
||||
- C17Z17 node-side route generation tracker is implemented and
|
||||
docker-test-runtime-proven; production service traffic remains unavailable
|
||||
- C17Z18 synthetic route-health effective path runtime is implemented and
|
||||
docker-test-runtime-proven; production service traffic remains unavailable
|
||||
- C17Z19 synthetic route-health feedback scoring is implemented and
|
||||
docker-test-runtime-proven; production service traffic remains unavailable
|
||||
- C17Z20 node-side route-health feedback refresh is implemented and
|
||||
docker-test-runtime-proven; production service traffic remains unavailable
|
||||
- Cluster Authority plus node enrollment bootstrap polling are docker-test
|
||||
lifecycle-smoke-proven; fresh install migration replay is fixed for
|
||||
`cluster_admin_summaries`
|
||||
- C18 VPN/IP tunnel service target design is completed as documentation only
|
||||
- C18A VPN/IP tunnel control-plane data model foundation is implemented and
|
||||
backend-test-proven
|
||||
- C18B VPN/IP tunnel lease/fencing hardening is implemented and
|
||||
backend-test-proven
|
||||
- C18C VPN/IP tunnel node-agent desired-state consumption/reporting is
|
||||
implemented and backend-test-proven
|
||||
- Version Storage / Update Repository is documented as a future Fabric Core
|
||||
service for signed release manifests, OS/arch artifacts,
|
||||
stable/current/candidate channels, update-cache mirroring, node-agent
|
||||
update supervision, rollback, and explicit data-structure migration bundles.
|
||||
Runtime updater behavior is not implemented.
|
||||
- no next platform-core implementation step is automatically authorized after
|
||||
C17Z20; choose the next narrow staged prompt explicitly before continuing
|
||||
- preserve the proven RDP lifecycle behavior
|
||||
- keep the current backend gateway available as the active/fallback implementation path
|
||||
|
||||
The current phase is NOT:
|
||||
- full mesh routing implementation
|
||||
- full VPN orchestration
|
||||
- multi-cluster runtime traffic handling
|
||||
- production data-plane migration
|
||||
- updater runtime
|
||||
- video meetings
|
||||
- final native client UI redesign
|
||||
|
||||
Future mesh, VPN, multi-cluster, node-agent updater, and production realtime data-plane work must be introduced only through explicit, narrow, staged implementation prompts.
|
||||
|
||||
Always keep the project production-oriented. Do not simplify it into a toy app.
|
||||
Reference in New Issue
Block a user