# Node Local State Store Status: Stage C12 result. Documentation and architecture only. This document defines the node-local state store model for native `rap-node-agent`. It does not implement code, migrations, APIs, mesh runtime traffic, VPN/IP tunnel runtime, relay packet routing, RDP work, or service workload execution. ## 1. Purpose The node-local state store lets `rap-node-agent` operate safely without asking the backend for every realtime routing or service supervision decision. The local store must support: - node identity persistence - cluster membership state - signed scoped snapshot storage - peer cache - route cache - service assignment cache - local health and degraded-mode state - pending update metadata - recovery after process restart or host reboot The local store must not become a durable source of truth. ## 2. Authority Boundaries PostgreSQL remains authoritative for durable domain state. Fabric Storage / Config Storage distributes signed snapshots and increments. Node-local state stores verified local copies and runtime observations. Redis remains live coordination only. Node-local state must not authorize: - node enrollment approval - certificate issuance - role assignment - policy mutation - trust root mutation - organization mutation - partition promotion - cross-cluster trust ## 3. Storage Root and Namespaces The node-agent should use one configured local storage root. Example logical layout: ```text rap-node-agent-state/ agent/ clusters/ / identity/ trust/ snapshots/ peers/ routes/ services/ health/ updates/ telemetry/ tmp/ ``` Rules: - cluster state is namespace-isolated by `cluster_id` - multi-cluster membership uses separate identities and local state per cluster - temporary files are written under the same cluster namespace before atomic activation - no cluster may read another cluster's local state namespace - file permissions must restrict access to the node-agent service account ## 4. State Classes ### Agent State Agent-level state: - agent install id - agent version - local feature flags - last startup/shutdown status - local diagnostics - update engine metadata Agent state is not cluster authority. ### Identity State Cluster identity state: - `node_id` - cluster membership id - node certificate metadata - public identity metadata - private key reference - enrollment state - revocation status cache Private keys should be stored in an OS-protected key store when available. If file-backed keys are necessary, they must be encrypted at rest and protected by strict filesystem permissions. ### Trust State Trust state: - platform root trust refs - cluster trust roots - config signing keys - node-to-node trust bundle - revocation metadata - trust bundle version Trust state must be signed and versioned. Unknown or revoked trust roots must not be accepted. ### Snapshot State Snapshot state: - active signed scoped snapshot per scope - previous verified snapshot per scope - pending snapshot or incremental update - snapshot verification metadata - last applied config version - expiry and refresh deadlines Snapshot activation must be atomic: 1. write pending snapshot 2. verify signature, scope, hash, expiry, and version 3. persist verified content 4. swap active pointer 5. notify affected runtime components 6. report applied version ### Peer Cache Peer cache: - scoped peer directory entries - endpoint candidates - certificate fingerprints - last success timestamp - latency - packet loss - reliability score - recent failure history - last seen config version Peer cache combines signed directory data with runtime observations. Runtime observations are hints, not durable authority. ### Route Cache Route cache: - selected routes - route score - route class/channel class - route expiry - failover alternatives - shortcut state if future policy allows it - last successful path - recent failure reason Route cache must be reconstructable from signed snapshots, peer cache, and runtime observations. It must not define policy. ### Service Assignment Cache Service assignment cache: - assigned service workloads - desired state - last reported state - service version - policy refs - resource refs needed by assigned services - connector or `vpn_connection` refs where authorized This cache informs supervision. It does not allow the node to invent new service work. ### Health and Degraded State Health/degraded state: - last heartbeat sent - last control-plane contact - last config/storage contact - active degraded-mode reason - partition/degraded flags - local resource pressure - service health summaries - last known safe operation deadline Degraded state must be visible in node heartbeat/status when connectivity returns. ### Update Metadata Update state: - current agent version - current workload versions - pending update metadata - signed artifact refs - rollout/canary assignment - rollback candidate metadata - last update result Unsigned artifacts must never be activated. ## 5. Encryption and Secret Handling The local store should avoid storing secrets. When secret-related data is required, store references and resolver metadata, not plaintext. Rules: - private keys use OS key store where possible - file-backed sensitive material is encrypted at rest - raw RDP/VNC/SSH/VPN credentials must not be stored in broad local snapshots - runtime secrets are resolved only when assigned service policy permits it - secret material must be wiped from temporary files and memory where practical - logs must not contain secret values Recommended OS facilities: - Windows: DPAPI or service-account protected certificate store - Linux: kernel keyring, TPM-backed store, or file encryption with protected service-account permissions - macOS future client/agent: Keychain ## 6. Atomicity and Durability Writes must be safe across process crashes and host reboots. Rules: - write new content to temporary path - fsync or platform equivalent where needed - verify content before activation - atomically rename/swap active pointer - keep previous verified content for recovery - never partially overwrite active snapshots or identity data - use a store lock to prevent concurrent writers Node-agent should tolerate: - interrupted writes - corrupted pending updates - missing optional cache files - stale runtime observations Node-agent must not tolerate silently corrupted identity, trust, or active snapshot data. ## 7. Cache Expiry and Cleanup Local caches must be bounded. Cleanup rules: - remove expired peer observations - remove expired route cache entries - compact telemetry buffers - retain only policy-defined number of previous snapshots - remove stale pending updates after safe timeout - delete service assignment cache for removed roles after revocation is applied - wipe temporary files on startup Caches may be rebuilt. Identity, trust, and active snapshots require stricter recovery behavior. ## 8. Corruption Recovery Recovery order: 1. load active verified state 2. reject corrupted pending state 3. fallback to previous verified snapshot if active snapshot is corrupt and policy allows it 4. request full snapshot from config/storage service 5. use bootstrap peers or control plane if storage/config is unavailable 6. enter degraded mode only if a valid snapshot and policy allow it 7. fail closed for trust/identity corruption Corruption must be reported through health/status and local diagnostics. ## 9. Multi-Cluster Isolation A node may participate in multiple clusters only through isolated memberships. Per-cluster isolation includes: - identity - certificates - trust bundle - signed snapshots - peer cache - route cache - service assignment cache - update/workload namespace where needed - telemetry namespace Cross-cluster data sharing is forbidden unless explicit platform trust and policy allow it. ## 10. Service Workload Boundary Service workloads do not write authoritative node-local state. Allowed workload interactions: - read assigned service configuration through node-agent - report health/status to node-agent - request approved secret resolution through node-agent/control boundary - receive lifecycle commands from node-agent Forbidden workload interactions: - mutate role assignments - mutate snapshots - mutate peer directory authority - write trust roots - write cross-cluster state - store unrelated organization secrets ## 11. Backup and Restore Backup rules: - identity/private key backup is platform policy dependent and high-risk - snapshots and caches can usually be reconstructed - local route/peer caches should not be treated as backup-critical - trust state backup must preserve anti-rollback properties - restore must not allow replay of revoked identity or old trust roots Restore must require control-plane validation before the node is trusted for new high-risk work. ## 12. Observability Node-agent should report safe local state metadata: - last applied config version - snapshot expiry/refresh status - trust bundle version - peer cache size - route cache size - degraded-mode state - local store health - last corruption/recovery event - pending update state Reports must not include raw secrets or unrelated topology. ## 13. Future Validation Tests Future implementation tests must prove: - fresh install creates expected namespace layout - valid snapshot activates atomically - interrupted activation recovers to previous valid snapshot - corrupted pending update is ignored - corrupted active identity fails closed - peer cache expiry works - route cache expiry works - multi-cluster namespaces stay isolated - service workload cannot mutate authoritative local state - local store reports last applied config version - degraded-mode state is persisted and cleared correctly ## 14. C13 Preparation C13 must define the Fabric Storage / Config Storage service that distributes snapshots, peer directories, trust bundles, and incremental updates to the node-local state store. C13 must preserve: - PostgreSQL authority - signed snapshot verification - node-local bounded cache behavior - cluster/org/service isolation - no arbitrary query/database behavior ## 15. Result / Decision Stage C12 defines node-local state as a bounded, scoped, verified local store owned by native `rap-node-agent`. Decisions: - local state is namespaced per cluster - identity, trust, snapshots, peer cache, route cache, service assignment cache, health/degraded state, and update metadata are separate state classes - local state is not durable authority - snapshot activation must be atomic - caches are bounded and reconstructable - private keys and sensitive material require OS-protected or encrypted storage - service workloads cannot mutate authoritative node-local state - C13 must define distribution/storage services without turning them into a second source of truth No code, migration, API, runtime, RDP, data-plane, mesh, VPN, relay, or service workload behavior is changed by C12.