11 KiB
Node Local State Store
Status: Stage C12 result. Documentation and architecture only.
This document defines the node-local state store model for native
rap-node-agent. It does not implement code, migrations, APIs, mesh runtime
traffic, VPN/IP tunnel runtime, relay packet routing, RDP work, or service
workload execution.
1. Purpose
The node-local state store lets rap-node-agent operate safely without asking
the backend for every realtime routing or service supervision decision.
The local store must support:
- node identity persistence
- cluster membership state
- signed scoped snapshot storage
- peer cache
- route cache
- service assignment cache
- local health and degraded-mode state
- pending update metadata
- recovery after process restart or host reboot
The local store must not become a durable source of truth.
2. Authority Boundaries
PostgreSQL remains authoritative for durable domain state.
Fabric Storage / Config Storage distributes signed snapshots and increments.
Node-local state stores verified local copies and runtime observations.
Redis remains live coordination only.
Node-local state must not authorize:
- node enrollment approval
- certificate issuance
- role assignment
- policy mutation
- trust root mutation
- organization mutation
- partition promotion
- cross-cluster trust
3. Storage Root and Namespaces
The node-agent should use one configured local storage root.
Example logical layout:
rap-node-agent-state/
agent/
clusters/
<cluster_id>/
identity/
trust/
snapshots/
peers/
routes/
services/
health/
updates/
telemetry/
tmp/
Rules:
- cluster state is namespace-isolated by
cluster_id - multi-cluster membership uses separate identities and local state per cluster
- temporary files are written under the same cluster namespace before atomic activation
- no cluster may read another cluster's local state namespace
- file permissions must restrict access to the node-agent service account
4. State Classes
Agent State
Agent-level state:
- agent install id
- agent version
- local feature flags
- last startup/shutdown status
- local diagnostics
- update engine metadata
Agent state is not cluster authority.
Identity State
Cluster identity state:
node_id- cluster membership id
- node certificate metadata
- public identity metadata
- private key reference
- enrollment state
- revocation status cache
Private keys should be stored in an OS-protected key store when available. If file-backed keys are necessary, they must be encrypted at rest and protected by strict filesystem permissions.
Trust State
Trust state:
- platform root trust refs
- cluster trust roots
- config signing keys
- node-to-node trust bundle
- revocation metadata
- trust bundle version
Trust state must be signed and versioned. Unknown or revoked trust roots must not be accepted.
Snapshot State
Snapshot state:
- active signed scoped snapshot per scope
- previous verified snapshot per scope
- pending snapshot or incremental update
- snapshot verification metadata
- last applied config version
- expiry and refresh deadlines
Snapshot activation must be atomic:
- write pending snapshot
- verify signature, scope, hash, expiry, and version
- persist verified content
- swap active pointer
- notify affected runtime components
- report applied version
Peer Cache
Peer cache:
- scoped peer directory entries
- endpoint candidates
- certificate fingerprints
- last success timestamp
- latency
- packet loss
- reliability score
- recent failure history
- last seen config version
Peer cache combines signed directory data with runtime observations. Runtime observations are hints, not durable authority.
Route Cache
Route cache:
- selected routes
- route score
- route class/channel class
- route expiry
- failover alternatives
- shortcut state if future policy allows it
- last successful path
- recent failure reason
Route cache must be reconstructable from signed snapshots, peer cache, and runtime observations. It must not define policy.
Service Assignment Cache
Service assignment cache:
- assigned service workloads
- desired state
- last reported state
- service version
- policy refs
- resource refs needed by assigned services
- connector or
vpn_connectionrefs where authorized
This cache informs supervision. It does not allow the node to invent new service work.
Health and Degraded State
Health/degraded state:
- last heartbeat sent
- last control-plane contact
- last config/storage contact
- active degraded-mode reason
- partition/degraded flags
- local resource pressure
- service health summaries
- last known safe operation deadline
Degraded state must be visible in node heartbeat/status when connectivity returns.
Update Metadata
Update state:
- current agent version
- current workload versions
- pending update metadata
- signed artifact refs
- rollout/canary assignment
- rollback candidate metadata
- last update result
Unsigned artifacts must never be activated.
5. Encryption and Secret Handling
The local store should avoid storing secrets. When secret-related data is required, store references and resolver metadata, not plaintext.
Rules:
- private keys use OS key store where possible
- file-backed sensitive material is encrypted at rest
- raw RDP/VNC/SSH/VPN credentials must not be stored in broad local snapshots
- runtime secrets are resolved only when assigned service policy permits it
- secret material must be wiped from temporary files and memory where practical
- logs must not contain secret values
Recommended OS facilities:
- Windows: DPAPI or service-account protected certificate store
- Linux: kernel keyring, TPM-backed store, or file encryption with protected service-account permissions
- macOS future client/agent: Keychain
6. Atomicity and Durability
Writes must be safe across process crashes and host reboots.
Rules:
- write new content to temporary path
- fsync or platform equivalent where needed
- verify content before activation
- atomically rename/swap active pointer
- keep previous verified content for recovery
- never partially overwrite active snapshots or identity data
- use a store lock to prevent concurrent writers
Node-agent should tolerate:
- interrupted writes
- corrupted pending updates
- missing optional cache files
- stale runtime observations
Node-agent must not tolerate silently corrupted identity, trust, or active snapshot data.
7. Cache Expiry and Cleanup
Local caches must be bounded.
Cleanup rules:
- remove expired peer observations
- remove expired route cache entries
- compact telemetry buffers
- retain only policy-defined number of previous snapshots
- remove stale pending updates after safe timeout
- delete service assignment cache for removed roles after revocation is applied
- wipe temporary files on startup
Caches may be rebuilt. Identity, trust, and active snapshots require stricter recovery behavior.
8. Corruption Recovery
Recovery order:
- load active verified state
- reject corrupted pending state
- fallback to previous verified snapshot if active snapshot is corrupt and policy allows it
- request full snapshot from config/storage service
- use bootstrap peers or control plane if storage/config is unavailable
- enter degraded mode only if a valid snapshot and policy allow it
- fail closed for trust/identity corruption
Corruption must be reported through health/status and local diagnostics.
9. Multi-Cluster Isolation
A node may participate in multiple clusters only through isolated memberships.
Per-cluster isolation includes:
- identity
- certificates
- trust bundle
- signed snapshots
- peer cache
- route cache
- service assignment cache
- update/workload namespace where needed
- telemetry namespace
Cross-cluster data sharing is forbidden unless explicit platform trust and policy allow it.
10. Service Workload Boundary
Service workloads do not write authoritative node-local state.
Allowed workload interactions:
- read assigned service configuration through node-agent
- report health/status to node-agent
- request approved secret resolution through node-agent/control boundary
- receive lifecycle commands from node-agent
Forbidden workload interactions:
- mutate role assignments
- mutate snapshots
- mutate peer directory authority
- write trust roots
- write cross-cluster state
- store unrelated organization secrets
11. Backup and Restore
Backup rules:
- identity/private key backup is platform policy dependent and high-risk
- snapshots and caches can usually be reconstructed
- local route/peer caches should not be treated as backup-critical
- trust state backup must preserve anti-rollback properties
- restore must not allow replay of revoked identity or old trust roots
Restore must require control-plane validation before the node is trusted for new high-risk work.
12. Observability
Node-agent should report safe local state metadata:
- last applied config version
- snapshot expiry/refresh status
- trust bundle version
- peer cache size
- route cache size
- degraded-mode state
- local store health
- last corruption/recovery event
- pending update state
Reports must not include raw secrets or unrelated topology.
13. Future Validation Tests
Future implementation tests must prove:
- fresh install creates expected namespace layout
- valid snapshot activates atomically
- interrupted activation recovers to previous valid snapshot
- corrupted pending update is ignored
- corrupted active identity fails closed
- peer cache expiry works
- route cache expiry works
- multi-cluster namespaces stay isolated
- service workload cannot mutate authoritative local state
- local store reports last applied config version
- degraded-mode state is persisted and cleared correctly
14. C13 Preparation
C13 must define the Fabric Storage / Config Storage service that distributes snapshots, peer directories, trust bundles, and incremental updates to the node-local state store.
C13 must preserve:
- PostgreSQL authority
- signed snapshot verification
- node-local bounded cache behavior
- cluster/org/service isolation
- no arbitrary query/database behavior
15. Result / Decision
Stage C12 defines node-local state as a bounded, scoped, verified local store
owned by native rap-node-agent.
Decisions:
- local state is namespaced per cluster
- identity, trust, snapshots, peer cache, route cache, service assignment cache, health/degraded state, and update metadata are separate state classes
- local state is not durable authority
- snapshot activation must be atomic
- caches are bounded and reconstructable
- private keys and sensitive material require OS-protected or encrypted storage
- service workloads cannot mutate authoritative node-local state
- C13 must define distribution/storage services without turning them into a second source of truth
No code, migration, API, runtime, RDP, data-plane, mesh, VPN, relay, or service workload behavior is changed by C12.