Initial project snapshot
This commit is contained in:
@@ -0,0 +1,465 @@
|
||||
# Fabric Core Configuration Distribution
|
||||
|
||||
Status: Stage C10 result. Documentation and architecture only.
|
||||
|
||||
This document consolidates the Fabric Core configuration distribution model for
|
||||
the Secure Access Fabric platform. It does not implement mesh runtime traffic,
|
||||
VPN/IP tunnel runtime, relay packet routing, RDP work, service workload
|
||||
execution, API changes, migrations, or code changes.
|
||||
|
||||
## 1. Purpose
|
||||
|
||||
Stage C10 defines the boundaries that must exist before the project safely
|
||||
moves into signed snapshots, node-local storage, config/storage services, peer
|
||||
directories, routing skeletons, secure node channels, mesh routing, or VPN/IP
|
||||
tunnel runtime.
|
||||
|
||||
The goal is to prevent the lower fabric from growing into an accidental
|
||||
distributed database, accidental full-mesh topology store, or service-specific
|
||||
RDP/VPN routing layer.
|
||||
|
||||
## 2. Layer Model
|
||||
|
||||
The platform layer order remains:
|
||||
|
||||
1. Host OS
|
||||
2. RAP Fabric Core
|
||||
3. Secure Fabric Network
|
||||
4. Service Runtime / Service Adapters
|
||||
5. Access Clients / Admin UI
|
||||
|
||||
Fabric Core is the lower distributed runtime foundation above the host OS. It
|
||||
is not a real operating system. It is implemented through native
|
||||
`rap-node-agent`, control-plane contracts, scoped signed snapshots, node-local
|
||||
state, role assignment consumption, update trust, and service supervision
|
||||
boundaries.
|
||||
|
||||
RDP, VNC, SSH, VPN, video, file transfer, and internal-app access are services
|
||||
above Fabric Core. They consume Fabric Core identity, placement, routing, and
|
||||
policy; they do not define peer discovery, route selection, cluster authority,
|
||||
or durable configuration ownership.
|
||||
|
||||
## 3. Source of Truth and Cache Boundaries
|
||||
|
||||
PostgreSQL remains the only durable source of truth for domain state:
|
||||
|
||||
- platform configuration
|
||||
- clusters
|
||||
- organizations
|
||||
- users and memberships
|
||||
- node identities and enrollment state
|
||||
- node role assignments
|
||||
- policies
|
||||
- resources
|
||||
- service desired state
|
||||
- audit
|
||||
- trust roots and revocation state
|
||||
|
||||
Redis remains live coordination only:
|
||||
|
||||
- leases
|
||||
- heartbeats
|
||||
- ephemeral routing hints
|
||||
- short-lived tokens
|
||||
- transient queues
|
||||
- runtime cache
|
||||
|
||||
Redis must not store durable topology, durable configuration, node identity,
|
||||
policy, organization data, cluster trust, or authoritative route state.
|
||||
|
||||
Fabric Storage / Config Storage is a distribution and cache layer. It must not:
|
||||
|
||||
- replace PostgreSQL
|
||||
- become a general-purpose distributed database
|
||||
- accept direct node writes as authoritative state
|
||||
- store every cluster or organization object on every node
|
||||
- expose arbitrary query capabilities
|
||||
- bypass organization, cluster, role, or service isolation
|
||||
|
||||
Node-local state is runtime state plus signed scoped snapshots. It supports
|
||||
fast operation and degraded reconnect. It is not a source of truth.
|
||||
|
||||
## 4. Configuration Layers
|
||||
|
||||
Configuration is separated into layers so nodes receive only what their role
|
||||
requires.
|
||||
|
||||
Global platform configuration:
|
||||
|
||||
- platform trust roots
|
||||
- supported protocol versions
|
||||
- update trust policy
|
||||
- platform-wide feature gates
|
||||
- high-risk admin policy
|
||||
|
||||
Cluster configuration:
|
||||
|
||||
- cluster identity
|
||||
- cluster trust roots and certificate policy
|
||||
- cluster authority/partition state
|
||||
- node role assignments
|
||||
- QoS policy
|
||||
- peer discovery policy
|
||||
- route policy
|
||||
- storage/config replication policy
|
||||
|
||||
Organization configuration:
|
||||
|
||||
- organization identity and status
|
||||
- organization service enablement
|
||||
- tenant-visible ingress/egress/service endpoints
|
||||
- tenant policy references
|
||||
- organization-specific resource references
|
||||
- safe status projections
|
||||
|
||||
Service configuration:
|
||||
|
||||
- assigned service workload configuration
|
||||
- service-specific policy subset
|
||||
- resource references needed by the assigned workload
|
||||
- connector or `vpn_connection` references where authorized
|
||||
- runtime secret references, resolved only through approved secret resolvers
|
||||
|
||||
## 5. Scoped Distribution Principle
|
||||
|
||||
Nodes receive configuration on a need-to-know basis.
|
||||
|
||||
Core mesh node receives:
|
||||
|
||||
- scoped peer/neighbor data
|
||||
- route policy
|
||||
- QoS policy
|
||||
- cluster version and trust metadata
|
||||
- no RDP credentials
|
||||
- no full organization user list
|
||||
- no unrelated service configuration
|
||||
|
||||
Ingress node receives:
|
||||
|
||||
- allowed client entry policies
|
||||
- token validation configuration
|
||||
- entry route hints
|
||||
- service endpoint mapping allowed for the ingress scope
|
||||
- no full internal topology
|
||||
- no unrelated organization data
|
||||
|
||||
Egress/service node receives:
|
||||
|
||||
- assigned service configs
|
||||
- needed resource references
|
||||
- needed connector or `vpn_connection` references
|
||||
- policy for assigned services
|
||||
- secrets only through approved resolver and only at runtime
|
||||
|
||||
Storage/config node receives:
|
||||
|
||||
- assigned shard/scope metadata
|
||||
- replication metadata
|
||||
- signed snapshot content for its assigned scope
|
||||
- no unrelated organization data
|
||||
- no unrestricted topology query access
|
||||
|
||||
Thin/mobile node receives:
|
||||
|
||||
- minimal bootstrap peers
|
||||
- active session/tunnel policy subset
|
||||
- local trust data required to reconnect
|
||||
- no broad cluster topology
|
||||
|
||||
## 6. Signed Scoped Cluster Snapshot Boundary
|
||||
|
||||
C10 defines snapshot boundaries only. C11 will define the full signed scoped
|
||||
cluster snapshot model.
|
||||
|
||||
A scoped snapshot is a signed, versioned, role-limited configuration package
|
||||
that a node-agent can store locally.
|
||||
|
||||
Snapshot properties:
|
||||
|
||||
- cluster-scoped
|
||||
- role-scoped
|
||||
- organization-scoped where applicable
|
||||
- versioned
|
||||
- signed by an authorized control-plane signing key
|
||||
- bounded in size
|
||||
- expires or requires refresh according to policy
|
||||
- reconstructable from PostgreSQL source-of-truth state
|
||||
|
||||
Snapshot contents may include:
|
||||
|
||||
- cluster id and version
|
||||
- node membership scope
|
||||
- assigned roles
|
||||
- allowed service workload refs
|
||||
- peer directory subset
|
||||
- route policy subset
|
||||
- QoS policy subset
|
||||
- trust roots and revocation metadata
|
||||
- storage/config endpoints for refresh
|
||||
- degraded-mode permissions
|
||||
|
||||
Snapshot contents must not include:
|
||||
|
||||
- unrelated organization data
|
||||
- broad user lists
|
||||
- raw secrets
|
||||
- RDP/VNC/SSH credentials
|
||||
- full cluster topology unless node role requires it
|
||||
- arbitrary query permissions
|
||||
|
||||
## 7. Node-Local State Boundary
|
||||
|
||||
`rap-node-agent` local state may contain:
|
||||
|
||||
- node identity material and certificate metadata
|
||||
- cluster membership state
|
||||
- signed scoped cluster snapshot
|
||||
- peer cache
|
||||
- route cache
|
||||
- service assignment cache
|
||||
- service health/status cache
|
||||
- local health state
|
||||
- partition/degraded state
|
||||
- last applied config version
|
||||
- pending update metadata
|
||||
- bounded telemetry buffer
|
||||
|
||||
Node-local state must not contain:
|
||||
|
||||
- full cluster topology unless explicitly required by role
|
||||
- full organization data
|
||||
- unrelated organization secrets
|
||||
- durable policy authority
|
||||
- durable route authority
|
||||
- durable audit authority
|
||||
- unrelated storage shards
|
||||
|
||||
Node-agent must be able to operate from local state for short degraded periods
|
||||
when policy allows it, but it must not authorize high-risk mutations while
|
||||
isolated.
|
||||
|
||||
## 8. Peer Directory and Cache Boundary
|
||||
|
||||
Peer directory data is distributed as scoped configuration, not queried from
|
||||
PostgreSQL on every routing decision.
|
||||
|
||||
Peer directory entry fields:
|
||||
|
||||
- `node_id`
|
||||
- `cluster_id`
|
||||
- endpoint candidates
|
||||
- roles/capabilities
|
||||
- region/location hints
|
||||
- trust/certificate fingerprint
|
||||
- policy scope
|
||||
- config version
|
||||
|
||||
Node-local peer cache may add runtime observations:
|
||||
|
||||
- `last_success_at`
|
||||
- `last_latency_ms`
|
||||
- packet loss
|
||||
- reliability score
|
||||
- recent failure history
|
||||
- observed load hints where allowed
|
||||
- last seen config version
|
||||
|
||||
Peer selection is score-based, not latency-only. Inputs include:
|
||||
|
||||
- latency
|
||||
- packet loss
|
||||
- reliability
|
||||
- region distance
|
||||
- node load
|
||||
- bandwidth availability
|
||||
- role suitability
|
||||
- policy constraints
|
||||
- trust level
|
||||
- recent failure history
|
||||
|
||||
The Fabric Routing Engine owns route selection. Service Adapters must not
|
||||
discover peers, select mesh routes, create shortcuts, or implement partition
|
||||
recovery.
|
||||
|
||||
## 9. Fabric Storage / Config Storage Role
|
||||
|
||||
Fabric Storage / Config Storage is a logical future service. It is a scoped
|
||||
distribution layer for configuration and signed snapshots.
|
||||
|
||||
Responsibilities:
|
||||
|
||||
- distribute signed scoped snapshots
|
||||
- distribute peer directories
|
||||
- cache hot configuration near service nodes
|
||||
- replicate critical scoped data across failure domains
|
||||
- provide nearby read access for node-agent refresh
|
||||
- support cluster/org/service scope boundaries
|
||||
- support version-based sync and incremental update delivery
|
||||
|
||||
Non-goals:
|
||||
|
||||
- no replacement of PostgreSQL
|
||||
- no arbitrary distributed database behavior
|
||||
- no direct node writes as authoritative state
|
||||
- no broad ad hoc query API
|
||||
- no full topology exposure to tenants
|
||||
- no full organization data on every node
|
||||
|
||||
Placement rules:
|
||||
|
||||
- hot data may be placed near services that use it
|
||||
- cold data may remain remote
|
||||
- critical data should replicate across failure domains
|
||||
- replication factor is policy-driven
|
||||
- storage scope must respect cluster, organization, and service boundaries
|
||||
|
||||
## 10. Distribution Flow
|
||||
|
||||
Normal flow:
|
||||
|
||||
1. Control plane reads authoritative state from PostgreSQL.
|
||||
2. Control plane compiles scoped configuration views.
|
||||
3. Control plane signs full scoped snapshots or incremental updates.
|
||||
4. Fabric Storage / Config Storage distributes and caches scoped artifacts.
|
||||
5. Node-agent fetches snapshots/updates from authorized endpoints.
|
||||
6. Node-agent verifies signatures, version, scope, expiry, and trust roots.
|
||||
7. Node-agent applies configuration into local state.
|
||||
8. Runtime components consume local state, not live backend calls, for realtime
|
||||
route decisions.
|
||||
|
||||
Realtime routing decisions must not depend on live backend availability. They
|
||||
should use verified local state, peer cache, route cache, and policy.
|
||||
|
||||
## 11. Versioning and Consistency Rules
|
||||
|
||||
Every snapshot and incremental update must carry:
|
||||
|
||||
- `cluster_id`
|
||||
- scope identifiers
|
||||
- monotonic config version or equivalent epoch
|
||||
- issued-at timestamp
|
||||
- expiry or refresh deadline
|
||||
- signer id / key id
|
||||
- signature
|
||||
- dependency/base version for increments
|
||||
|
||||
Rules:
|
||||
|
||||
- full snapshot can establish or repair local state
|
||||
- incremental update applies only to the expected base version
|
||||
- version gaps require full resync
|
||||
- signature mismatch rejects the update and triggers recovery
|
||||
- rollback to older config is forbidden unless explicitly authorized by a
|
||||
signed recovery policy
|
||||
- node must report last applied config version in heartbeat/status
|
||||
|
||||
## 12. Degraded Mode Rules
|
||||
|
||||
Degraded operation is allowed only when policy permits it.
|
||||
|
||||
Allowed examples:
|
||||
|
||||
- keep already-running safe services alive
|
||||
- continue existing authorized routes for a short TTL
|
||||
- reconnect to known active/warm/bootstrap peers
|
||||
- use last signed snapshot to find config/storage endpoints
|
||||
- report degraded status when connectivity returns
|
||||
|
||||
Forbidden while degraded:
|
||||
|
||||
- approve join requests
|
||||
- issue node certificates
|
||||
- assign roles
|
||||
- change cluster policy
|
||||
- change organization policy
|
||||
- rotate trust roots
|
||||
- promote partition authority automatically
|
||||
- access secrets not already authorized for the node's current role
|
||||
|
||||
Degraded mode must be time-bounded and observable.
|
||||
|
||||
## 13. Multi-Cluster Isolation
|
||||
|
||||
Clusters are isolated by default.
|
||||
|
||||
Rules:
|
||||
|
||||
- clusters do not automatically trust each other
|
||||
- clusters do not form one shared mesh by default
|
||||
- cross-cluster routing requires explicit trust and policy
|
||||
- platform owner may manage multiple clusters from one console
|
||||
- organization admins see only authorized clusters/resources
|
||||
- node may participate in multiple clusters only through isolated memberships
|
||||
- cluster-scoped identities, certificates, tokens, storage namespaces, and
|
||||
policies are required
|
||||
|
||||
A multi-cluster node must keep separate local state per cluster:
|
||||
|
||||
- separate identity/certificates
|
||||
- separate snapshots
|
||||
- separate peer cache
|
||||
- separate route cache
|
||||
- separate service assignment cache
|
||||
- separate storage namespace
|
||||
|
||||
## 14. Security Boundaries
|
||||
|
||||
Security requirements:
|
||||
|
||||
- snapshots are signed
|
||||
- transport for snapshot/update distribution is authenticated and encrypted
|
||||
- node-agent verifies signature, scope, expiry, signer, and trust root
|
||||
- secrets are never embedded directly in broad snapshots
|
||||
- secrets are resolved through approved resolvers only at runtime
|
||||
- high-risk admin actions require step-up authentication
|
||||
- all cluster trust and role changes are audited
|
||||
|
||||
High-risk actions include:
|
||||
|
||||
- node approval
|
||||
- role assignment
|
||||
- cluster trust changes
|
||||
- cross-cluster trust
|
||||
- partition promotion
|
||||
- secrets access
|
||||
- update policy changes
|
||||
- signing key rotation
|
||||
|
||||
## 15. C11-C18 Staging Boundary
|
||||
|
||||
C10 is a design consolidation stage. It prepares later stages:
|
||||
|
||||
- C11: signed scoped cluster snapshot model
|
||||
- C12: node local state store
|
||||
- C13: config/storage service foundation
|
||||
- C14: peer directory and cache model
|
||||
- C15: Fabric Routing Engine skeleton
|
||||
- C16: secure node-to-node channel lifecycle
|
||||
- C17: mesh routing runtime
|
||||
- C18: VPN/IP tunnel service
|
||||
|
||||
C10 implements none of these. Later stages must be explicit, narrow, and
|
||||
verified. Mesh routing and VPN/IP tunnel runtime must not start before C11-C16
|
||||
foundations are accepted.
|
||||
|
||||
## 16. Result / Decision
|
||||
|
||||
Stage C10 consolidates the lower Fabric Core configuration distribution model.
|
||||
|
||||
Decisions:
|
||||
|
||||
- PostgreSQL remains the only durable source of truth.
|
||||
- Redis remains live coordination only.
|
||||
- Fabric Storage / Config Storage is a scoped distribution/cache layer, not a
|
||||
second source of truth.
|
||||
- Nodes receive only role/cluster/organization scoped configuration.
|
||||
- Node-local state is bounded and non-authoritative.
|
||||
- Signed scoped snapshots are the required foundation for node-local operation
|
||||
and degraded recovery.
|
||||
- Peer directory/cache data is local and scoped; routing remains Fabric-owned.
|
||||
- Service Adapters remain protocol translators above Fabric Core.
|
||||
- Multi-cluster membership requires isolated identities, snapshots, caches,
|
||||
tokens, policies, and storage namespaces.
|
||||
|
||||
No code, migration, API, runtime, RDP, data-plane, mesh, VPN, relay, or service
|
||||
workload behavior is changed by C10.
|
||||
Reference in New Issue
Block a user