Files
rdp-proxy/docs/architecture/ARCHITECTURE_GUARDRAILS.md
m 20d361a886
build / backend (push) Has been cancelled
build / node-agent (push) Has been cancelled
build / worker (push) Has been cancelled
рабочий вариант, но скороть 10 МБит
2026-05-22 21:46:49 +03:00

275 lines
8.9 KiB
Markdown

# Architecture Guardrails
Status: architecture guardrails, documentation only.
This file exists so architecture documents have a stable guardrails reference
inside `docs/architecture`. The operational Codex guardrails remain in
`docs/codex/ARCHITECTURE_GUARDRAILS.md`.
Transport clarification: references in this document to direct worker WSS and
backend gateway fallback belong to the preserved historical RDP service
baseline. They are not the active source of truth for inter-node transport.
Current fabric node-to-node transport is QUIC-only and is defined by
`docs/architecture/DISTRIBUTED_FABRIC_NODE_PROTOCOL_PLAN.md`,
`docs/architecture/FABRIC_FIRST_TRANSPORT_AND_STRESS_PLAN.md`, and
`docs/architecture/SECURE_ACCESS_FABRIC_TARGET.md`.
Node survivability, recovery overlap, and no-manual-access repair rules are
defined by `docs/architecture/FABRIC_NODE_SURVIVAL_AND_RECOVERY_POLICY.md`.
## 1. Preserve the Proven RDP Baseline
The following are already proven and must remain stable:
- live FreeRDP connect
- active session state
- terminate
- detach without killing the remote session
- reattach without recreating the remote session
- takeover without recreating the remote session
- historical direct worker WSS RDP path
- historical backend gateway fallback for the RDP baseline
- C++ RDP Adapter as the active RDP runtime
Architecture clarification must not silently weaken this behavior.
## 2. Source of Truth
PostgreSQL is the only durable source of truth for domain state.
Redis is live coordination only. It may hold leases, heartbeats, routing hints,
attach tokens, short-lived tokens, and ephemeral cache. It must not become a
durable source of truth for sessions, organizations, policies, cluster trust,
peer topology, durable configuration, organization data, route authority, or
node identity.
## 3. Fabric Core Before Mesh Runtime
RAP Fabric Core is the lower distributed runtime foundation above the host OS.
Fabric Core owns:
- native `rap-node-agent` identity
- enrollment
- local node state
- capability reporting
- role assignment consumption
- signed scoped configuration snapshots
- update trust
- service supervision boundary
Mesh runtime traffic must not be implemented before node identity, enrollment,
role assignment, scoped config distribution, and node-local state are
trustworthy.
## 4. Node Identity and Service Workloads
A node is a host-level identity managed by native `rap-node-agent`.
Service workloads are separate from node identity. They may be containerized or
native, but containers are packaging/isolation boundaries only.
Capabilities are not permissions. Role assignment must be explicit per cluster
and, when needed, per organization.
## 5. Routing Ownership
Routing is owned by the Fabric layer, not individual Service Adapters.
RDP, VNC, SSH, VPN, video, and file services may request a destination node,
resource target, egress node, or egress pool. The Fabric Routing Engine chooses
the path.
Routing decisions must not depend on live backend availability. They use
node-local state, signed scoped snapshots, peer cache, route cache, and policy.
Service Adapters must not implement mesh topology discovery, multi-hop route
selection, shortcut creation, partition recovery, or cross-cluster routing
policy.
Service Adapters must not select routes, discover peers, manage mesh
connections, implement mesh failover, implement shortcut logic, implement
partition recovery, or implement cross-cluster routing policy.
## 6. Need-to-Know Configuration
Nodes should be small, fast, and scoped.
A node receives only the configuration required for its cluster membership,
assigned role, service workload, and organization scope. It must not store full
cluster topology, unrelated organization data, unrelated storage shards, peer
caches outside its scope, or secrets it does not need.
Secrets must be delivered only through approved resolvers and only at runtime
when needed.
## 7. Fabric Storage Boundaries
Fabric Storage / Config Storage is a future distribution and cache layer, not a
new source of truth.
Storage service must not:
- replace PostgreSQL
- become a general-purpose distributed database
- accept direct node writes as authoritative state
- store full cluster or organization data on every node
- expose arbitrary query capabilities
- bypass organization and cluster isolation
## 8. Multi-Tenancy Isolation
Every organization must be isolated by design.
Namespace and authorize:
- resources
- users-in-organization
- groups
- policies
- connectors
- sessions
- service endpoints
- audit
- secret references
- storage/cache scopes
- Redis keys where applicable
Organizations must not see intermediate mesh topology, other organizations'
routes, peer caches, nodes, storage shards, secrets, or platform trust
internals.
## 9. Multi-Cluster Boundaries
A platform may manage multiple clusters, but clusters do not automatically
trust each other and do not form one shared mesh by default.
Cross-cluster routing requires explicit trust and policy.
Cluster-scoped identities, certificates, tokens, storage namespaces, and
policies are required. A node may participate in multiple clusters only through
isolated memberships.
## 10. Split-Brain Prevention
Never allow minority partitions to become a second authoritative cluster
automatically.
Cluster-wide changes, role changes, trust changes, node approvals, policy
mutation, partition promotion, and cross-cluster trust must be restricted in
non-quorum or degraded states.
## 11. Control Plane vs Data Plane
Control plane owns durable state and policy:
- organizations
- users
- memberships
- roles
- resources
- policies
- nodes
- cluster membership
- service assignments
- connector/VPN desired state
- updates
- config distribution
- audit
Data plane carries authorized traffic:
- session streams
- worker traffic
- relay traffic
- connector traffic
- future VPN/IP tunnel traffic
Do not collapse control plane and data plane into one vague layer.
## 12. Updates and Trust
Updates must support:
- Version Storage / Update Repository as the signed artifact source
- explicit Control Plane rollout policy and approval
- signed artifacts
- no unsigned binaries
- staged rollout
- canary rollout
- rollback
- health checks
- local update cache where approved
- OS / architecture specific artifacts under signed release manifests
- explicit migration bundles when data structures change
- compat recovery compatibility until the fleet is converged or explicitly
removed
- multi-source artifact retrieval for stranded or NAT-only nodes
Version Storage stores immutable release manifests, artifacts, hashes,
signatures, compatibility metadata, provenance, and approved migration bundles.
It must not become a second source of truth for rollout policy, approvals,
organization state, cluster state, or audit.
The native node-agent owns local update trust, health supervision, restart, and
recovery logic. It may update, restart, or rollback assigned local workloads
only according to signed manifests and Control Plane policy. Node-agent
self-update requires stricter staged replacement and crash-safe rollback than
ordinary workload updates.
PostgreSQL schema migrations are orchestrated by the Control Plane release
process. Node-agent must not independently invent or execute durable
PostgreSQL schema migrations. Service-local, node-local, cache, or protocol
schema migrations require signed manifest metadata, preflight checks,
rollback/fencing behavior, and explicit compatibility rules.
## 13. Performance and Routing Awareness
Placement and routing decisions must consider:
- CPU
- RAM
- network load
- active sessions
- connector load
- relay load
- service type
- health score
- latency
- packet loss
- bandwidth availability
- policy constraints
Interactive input/control traffic must not wait behind render/video, file
transfer, telemetry, or VPN bulk traffic.
## 14. No Runtime Expansion From Documentation
Architecture documentation does not authorize runtime implementation.
Do not start the following without an explicit staged prompt:
- RDP runtime changes
- Windows client behavior changes
- data-plane behavior changes
- backend session lifecycle changes
- mesh runtime traffic
- VPN/IP tunnel runtime
- relay packet routing
- QUIC/WebRTC
- service workload execution
- new protocol adapters
## Result / Decision
These guardrails formalize the Secure Access Fabric lower foundation:
PostgreSQL remains authoritative, Redis remains live-only, Fabric Core comes
before mesh runtime, Fabric routing must not depend on live backend
availability, service adapters do not own routing, nodes receive only
need-to-know scoped configuration, Fabric Storage/Config Storage is not a
general-purpose distributed database, and organizations must not see internal
mesh topology. No code, API, migration, RDP, data-plane, mesh, VPN, relay, or
service workload runtime behavior is changed by this document. Version
Storage/Update Repository is a future signed artifact and release distribution
foundation; it is not an updater runtime until a later explicit staged prompt
authorizes it.