Files
rdp-proxy/docs/architecture/NODE_LOCAL_STATE_STORE.md
T
2026-04-28 22:29:50 +03:00

420 lines
11 KiB
Markdown

# Node Local State Store
Status: Stage C12 result. Documentation and architecture only.
This document defines the node-local state store model for native
`rap-node-agent`. It does not implement code, migrations, APIs, mesh runtime
traffic, VPN/IP tunnel runtime, relay packet routing, RDP work, or service
workload execution.
## 1. Purpose
The node-local state store lets `rap-node-agent` operate safely without asking
the backend for every realtime routing or service supervision decision.
The local store must support:
- node identity persistence
- cluster membership state
- signed scoped snapshot storage
- peer cache
- route cache
- service assignment cache
- local health and degraded-mode state
- pending update metadata
- recovery after process restart or host reboot
The local store must not become a durable source of truth.
## 2. Authority Boundaries
PostgreSQL remains authoritative for durable domain state.
Fabric Storage / Config Storage distributes signed snapshots and increments.
Node-local state stores verified local copies and runtime observations.
Redis remains live coordination only.
Node-local state must not authorize:
- node enrollment approval
- certificate issuance
- role assignment
- policy mutation
- trust root mutation
- organization mutation
- partition promotion
- cross-cluster trust
## 3. Storage Root and Namespaces
The node-agent should use one configured local storage root.
Example logical layout:
```text
rap-node-agent-state/
agent/
clusters/
<cluster_id>/
identity/
trust/
snapshots/
peers/
routes/
services/
health/
updates/
telemetry/
tmp/
```
Rules:
- cluster state is namespace-isolated by `cluster_id`
- multi-cluster membership uses separate identities and local state per cluster
- temporary files are written under the same cluster namespace before atomic
activation
- no cluster may read another cluster's local state namespace
- file permissions must restrict access to the node-agent service account
## 4. State Classes
### Agent State
Agent-level state:
- agent install id
- agent version
- local feature flags
- last startup/shutdown status
- local diagnostics
- update engine metadata
Agent state is not cluster authority.
### Identity State
Cluster identity state:
- `node_id`
- cluster membership id
- node certificate metadata
- public identity metadata
- private key reference
- enrollment state
- revocation status cache
Private keys should be stored in an OS-protected key store when available. If
file-backed keys are necessary, they must be encrypted at rest and protected by
strict filesystem permissions.
### Trust State
Trust state:
- platform root trust refs
- cluster trust roots
- config signing keys
- node-to-node trust bundle
- revocation metadata
- trust bundle version
Trust state must be signed and versioned. Unknown or revoked trust roots must
not be accepted.
### Snapshot State
Snapshot state:
- active signed scoped snapshot per scope
- previous verified snapshot per scope
- pending snapshot or incremental update
- snapshot verification metadata
- last applied config version
- expiry and refresh deadlines
Snapshot activation must be atomic:
1. write pending snapshot
2. verify signature, scope, hash, expiry, and version
3. persist verified content
4. swap active pointer
5. notify affected runtime components
6. report applied version
### Peer Cache
Peer cache:
- scoped peer directory entries
- endpoint candidates
- certificate fingerprints
- last success timestamp
- latency
- packet loss
- reliability score
- recent failure history
- last seen config version
Peer cache combines signed directory data with runtime observations. Runtime
observations are hints, not durable authority.
### Route Cache
Route cache:
- selected routes
- route score
- route class/channel class
- route expiry
- failover alternatives
- shortcut state if future policy allows it
- last successful path
- recent failure reason
Route cache must be reconstructable from signed snapshots, peer cache, and
runtime observations. It must not define policy.
### Service Assignment Cache
Service assignment cache:
- assigned service workloads
- desired state
- last reported state
- service version
- policy refs
- resource refs needed by assigned services
- connector or `vpn_connection` refs where authorized
This cache informs supervision. It does not allow the node to invent new
service work.
### Health and Degraded State
Health/degraded state:
- last heartbeat sent
- last control-plane contact
- last config/storage contact
- active degraded-mode reason
- partition/degraded flags
- local resource pressure
- service health summaries
- last known safe operation deadline
Degraded state must be visible in node heartbeat/status when connectivity
returns.
### Update Metadata
Update state:
- current agent version
- current workload versions
- pending update metadata
- signed artifact refs
- rollout/canary assignment
- rollback candidate metadata
- last update result
Unsigned artifacts must never be activated.
## 5. Encryption and Secret Handling
The local store should avoid storing secrets. When secret-related data is
required, store references and resolver metadata, not plaintext.
Rules:
- private keys use OS key store where possible
- file-backed sensitive material is encrypted at rest
- raw RDP/VNC/SSH/VPN credentials must not be stored in broad local snapshots
- runtime secrets are resolved only when assigned service policy permits it
- secret material must be wiped from temporary files and memory where practical
- logs must not contain secret values
Recommended OS facilities:
- Windows: DPAPI or service-account protected certificate store
- Linux: kernel keyring, TPM-backed store, or file encryption with protected
service-account permissions
- macOS future client/agent: Keychain
## 6. Atomicity and Durability
Writes must be safe across process crashes and host reboots.
Rules:
- write new content to temporary path
- fsync or platform equivalent where needed
- verify content before activation
- atomically rename/swap active pointer
- keep previous verified content for recovery
- never partially overwrite active snapshots or identity data
- use a store lock to prevent concurrent writers
Node-agent should tolerate:
- interrupted writes
- corrupted pending updates
- missing optional cache files
- stale runtime observations
Node-agent must not tolerate silently corrupted identity, trust, or active
snapshot data.
## 7. Cache Expiry and Cleanup
Local caches must be bounded.
Cleanup rules:
- remove expired peer observations
- remove expired route cache entries
- compact telemetry buffers
- retain only policy-defined number of previous snapshots
- remove stale pending updates after safe timeout
- delete service assignment cache for removed roles after revocation is applied
- wipe temporary files on startup
Caches may be rebuilt. Identity, trust, and active snapshots require stricter
recovery behavior.
## 8. Corruption Recovery
Recovery order:
1. load active verified state
2. reject corrupted pending state
3. fallback to previous verified snapshot if active snapshot is corrupt and
policy allows it
4. request full snapshot from config/storage service
5. use bootstrap peers or control plane if storage/config is unavailable
6. enter degraded mode only if a valid snapshot and policy allow it
7. fail closed for trust/identity corruption
Corruption must be reported through health/status and local diagnostics.
## 9. Multi-Cluster Isolation
A node may participate in multiple clusters only through isolated memberships.
Per-cluster isolation includes:
- identity
- certificates
- trust bundle
- signed snapshots
- peer cache
- route cache
- service assignment cache
- update/workload namespace where needed
- telemetry namespace
Cross-cluster data sharing is forbidden unless explicit platform trust and
policy allow it.
## 10. Service Workload Boundary
Service workloads do not write authoritative node-local state.
Allowed workload interactions:
- read assigned service configuration through node-agent
- report health/status to node-agent
- request approved secret resolution through node-agent/control boundary
- receive lifecycle commands from node-agent
Forbidden workload interactions:
- mutate role assignments
- mutate snapshots
- mutate peer directory authority
- write trust roots
- write cross-cluster state
- store unrelated organization secrets
## 11. Backup and Restore
Backup rules:
- identity/private key backup is platform policy dependent and high-risk
- snapshots and caches can usually be reconstructed
- local route/peer caches should not be treated as backup-critical
- trust state backup must preserve anti-rollback properties
- restore must not allow replay of revoked identity or old trust roots
Restore must require control-plane validation before the node is trusted for
new high-risk work.
## 12. Observability
Node-agent should report safe local state metadata:
- last applied config version
- snapshot expiry/refresh status
- trust bundle version
- peer cache size
- route cache size
- degraded-mode state
- local store health
- last corruption/recovery event
- pending update state
Reports must not include raw secrets or unrelated topology.
## 13. Future Validation Tests
Future implementation tests must prove:
- fresh install creates expected namespace layout
- valid snapshot activates atomically
- interrupted activation recovers to previous valid snapshot
- corrupted pending update is ignored
- corrupted active identity fails closed
- peer cache expiry works
- route cache expiry works
- multi-cluster namespaces stay isolated
- service workload cannot mutate authoritative local state
- local store reports last applied config version
- degraded-mode state is persisted and cleared correctly
## 14. C13 Preparation
C13 must define the Fabric Storage / Config Storage service that distributes
snapshots, peer directories, trust bundles, and incremental updates to the
node-local state store.
C13 must preserve:
- PostgreSQL authority
- signed snapshot verification
- node-local bounded cache behavior
- cluster/org/service isolation
- no arbitrary query/database behavior
## 15. Result / Decision
Stage C12 defines node-local state as a bounded, scoped, verified local store
owned by native `rap-node-agent`.
Decisions:
- local state is namespaced per cluster
- identity, trust, snapshots, peer cache, route cache, service assignment
cache, health/degraded state, and update metadata are separate state classes
- local state is not durable authority
- snapshot activation must be atomic
- caches are bounded and reconstructable
- private keys and sensitive material require OS-protected or encrypted storage
- service workloads cannot mutate authoritative node-local state
- C13 must define distribution/storage services without turning them into a
second source of truth
No code, migration, API, runtime, RDP, data-plane, mesh, VPN, relay, or service
workload behavior is changed by C12.