8.9 KiB
Architecture Guardrails
Status: architecture guardrails, documentation only.
This file exists so architecture documents have a stable guardrails reference
inside docs/architecture. The operational Codex guardrails remain in
docs/codex/ARCHITECTURE_GUARDRAILS.md.
Transport clarification: references in this document to direct worker WSS and
backend gateway fallback belong to the preserved historical RDP service
baseline. They are not the active source of truth for inter-node transport.
Current fabric node-to-node transport is QUIC-only and is defined by
docs/architecture/DISTRIBUTED_FABRIC_NODE_PROTOCOL_PLAN.md,
docs/architecture/FABRIC_FIRST_TRANSPORT_AND_STRESS_PLAN.md, and
docs/architecture/SECURE_ACCESS_FABRIC_TARGET.md.
Node survivability, recovery overlap, and no-manual-access repair rules are
defined by docs/architecture/FABRIC_NODE_SURVIVAL_AND_RECOVERY_POLICY.md.
1. Preserve the Proven RDP Baseline
The following are already proven and must remain stable:
- live FreeRDP connect
- active session state
- terminate
- detach without killing the remote session
- reattach without recreating the remote session
- takeover without recreating the remote session
- historical direct worker WSS RDP path
- historical backend gateway fallback for the RDP baseline
- C++ RDP Adapter as the active RDP runtime
Architecture clarification must not silently weaken this behavior.
2. Source of Truth
PostgreSQL is the only durable source of truth for domain state.
Redis is live coordination only. It may hold leases, heartbeats, routing hints, attach tokens, short-lived tokens, and ephemeral cache. It must not become a durable source of truth for sessions, organizations, policies, cluster trust, peer topology, durable configuration, organization data, route authority, or node identity.
3. Fabric Core Before Mesh Runtime
RAP Fabric Core is the lower distributed runtime foundation above the host OS.
Fabric Core owns:
- native
rap-node-agentidentity - enrollment
- local node state
- capability reporting
- role assignment consumption
- signed scoped configuration snapshots
- update trust
- service supervision boundary
Mesh runtime traffic must not be implemented before node identity, enrollment, role assignment, scoped config distribution, and node-local state are trustworthy.
4. Node Identity and Service Workloads
A node is a host-level identity managed by native rap-node-agent.
Service workloads are separate from node identity. They may be containerized or native, but containers are packaging/isolation boundaries only.
Capabilities are not permissions. Role assignment must be explicit per cluster and, when needed, per organization.
5. Routing Ownership
Routing is owned by the Fabric layer, not individual Service Adapters.
RDP, VNC, SSH, VPN, video, and file services may request a destination node, resource target, egress node, or egress pool. The Fabric Routing Engine chooses the path.
Routing decisions must not depend on live backend availability. They use node-local state, signed scoped snapshots, peer cache, route cache, and policy.
Service Adapters must not implement mesh topology discovery, multi-hop route selection, shortcut creation, partition recovery, or cross-cluster routing policy.
Service Adapters must not select routes, discover peers, manage mesh connections, implement mesh failover, implement shortcut logic, implement partition recovery, or implement cross-cluster routing policy.
6. Need-to-Know Configuration
Nodes should be small, fast, and scoped.
A node receives only the configuration required for its cluster membership, assigned role, service workload, and organization scope. It must not store full cluster topology, unrelated organization data, unrelated storage shards, peer caches outside its scope, or secrets it does not need.
Secrets must be delivered only through approved resolvers and only at runtime when needed.
7. Fabric Storage Boundaries
Fabric Storage / Config Storage is a future distribution and cache layer, not a new source of truth.
Storage service must not:
- replace PostgreSQL
- become a general-purpose distributed database
- accept direct node writes as authoritative state
- store full cluster or organization data on every node
- expose arbitrary query capabilities
- bypass organization and cluster isolation
8. Multi-Tenancy Isolation
Every organization must be isolated by design.
Namespace and authorize:
- resources
- users-in-organization
- groups
- policies
- connectors
- sessions
- service endpoints
- audit
- secret references
- storage/cache scopes
- Redis keys where applicable
Organizations must not see intermediate mesh topology, other organizations' routes, peer caches, nodes, storage shards, secrets, or platform trust internals.
9. Multi-Cluster Boundaries
A platform may manage multiple clusters, but clusters do not automatically trust each other and do not form one shared mesh by default.
Cross-cluster routing requires explicit trust and policy.
Cluster-scoped identities, certificates, tokens, storage namespaces, and policies are required. A node may participate in multiple clusters only through isolated memberships.
10. Split-Brain Prevention
Never allow minority partitions to become a second authoritative cluster automatically.
Cluster-wide changes, role changes, trust changes, node approvals, policy mutation, partition promotion, and cross-cluster trust must be restricted in non-quorum or degraded states.
11. Control Plane vs Data Plane
Control plane owns durable state and policy:
- organizations
- users
- memberships
- roles
- resources
- policies
- nodes
- cluster membership
- service assignments
- connector/VPN desired state
- updates
- config distribution
- audit
Data plane carries authorized traffic:
- session streams
- worker traffic
- relay traffic
- connector traffic
- future VPN/IP tunnel traffic
Do not collapse control plane and data plane into one vague layer.
12. Updates and Trust
Updates must support:
- Version Storage / Update Repository as the signed artifact source
- explicit Control Plane rollout policy and approval
- signed artifacts
- no unsigned binaries
- staged rollout
- canary rollout
- rollback
- health checks
- local update cache where approved
- OS / architecture specific artifacts under signed release manifests
- explicit migration bundles when data structures change
- legacy recovery compatibility until the fleet is converged or explicitly retired
- multi-source artifact retrieval for stranded or NAT-only nodes
Version Storage stores immutable release manifests, artifacts, hashes, signatures, compatibility metadata, provenance, and approved migration bundles. It must not become a second source of truth for rollout policy, approvals, organization state, cluster state, or audit.
The native node-agent owns local update trust, health supervision, restart, and recovery logic. It may update, restart, or rollback assigned local workloads only according to signed manifests and Control Plane policy. Node-agent self-update requires stricter staged replacement and crash-safe rollback than ordinary workload updates.
PostgreSQL schema migrations are orchestrated by the Control Plane release process. Node-agent must not independently invent or execute durable PostgreSQL schema migrations. Service-local, node-local, cache, or protocol schema migrations require signed manifest metadata, preflight checks, rollback/fencing behavior, and explicit compatibility rules.
13. Performance and Routing Awareness
Placement and routing decisions must consider:
- CPU
- RAM
- network load
- active sessions
- connector load
- relay load
- service type
- health score
- latency
- packet loss
- bandwidth availability
- policy constraints
Interactive input/control traffic must not wait behind render/video, file transfer, telemetry, or VPN bulk traffic.
14. No Runtime Expansion From Documentation
Architecture documentation does not authorize runtime implementation.
Do not start the following without an explicit staged prompt:
- RDP runtime changes
- Windows client behavior changes
- data-plane behavior changes
- backend session lifecycle changes
- mesh runtime traffic
- VPN/IP tunnel runtime
- relay packet routing
- QUIC/WebRTC
- service workload execution
- new protocol adapters
Result / Decision
These guardrails formalize the Secure Access Fabric lower foundation: PostgreSQL remains authoritative, Redis remains live-only, Fabric Core comes before mesh runtime, Fabric routing must not depend on live backend availability, service adapters do not own routing, nodes receive only need-to-know scoped configuration, Fabric Storage/Config Storage is not a general-purpose distributed database, and organizations must not see internal mesh topology. No code, API, migration, RDP, data-plane, mesh, VPN, relay, or service workload runtime behavior is changed by this document. Version Storage/Update Repository is a future signed artifact and release distribution foundation; it is not an updater runtime until a later explicit staged prompt authorizes it.