420 lines
11 KiB
Markdown
420 lines
11 KiB
Markdown
# Node Local State Store
|
|
|
|
Status: Stage C12 result. Documentation and architecture only.
|
|
|
|
This document defines the node-local state store model for native
|
|
`rap-node-agent`. It does not implement code, migrations, APIs, mesh runtime
|
|
traffic, VPN/IP tunnel runtime, relay packet routing, RDP work, or service
|
|
workload execution.
|
|
|
|
## 1. Purpose
|
|
|
|
The node-local state store lets `rap-node-agent` operate safely without asking
|
|
the backend for every realtime routing or service supervision decision.
|
|
|
|
The local store must support:
|
|
|
|
- node identity persistence
|
|
- cluster membership state
|
|
- signed scoped snapshot storage
|
|
- peer cache
|
|
- route cache
|
|
- service assignment cache
|
|
- local health and degraded-mode state
|
|
- pending update metadata
|
|
- recovery after process restart or host reboot
|
|
|
|
The local store must not become a durable source of truth.
|
|
|
|
## 2. Authority Boundaries
|
|
|
|
PostgreSQL remains authoritative for durable domain state.
|
|
|
|
Fabric Storage / Config Storage distributes signed snapshots and increments.
|
|
|
|
Node-local state stores verified local copies and runtime observations.
|
|
|
|
Redis remains live coordination only.
|
|
|
|
Node-local state must not authorize:
|
|
|
|
- node enrollment approval
|
|
- certificate issuance
|
|
- role assignment
|
|
- policy mutation
|
|
- trust root mutation
|
|
- organization mutation
|
|
- partition promotion
|
|
- cross-cluster trust
|
|
|
|
## 3. Storage Root and Namespaces
|
|
|
|
The node-agent should use one configured local storage root.
|
|
|
|
Example logical layout:
|
|
|
|
```text
|
|
rap-node-agent-state/
|
|
agent/
|
|
clusters/
|
|
<cluster_id>/
|
|
identity/
|
|
trust/
|
|
snapshots/
|
|
peers/
|
|
routes/
|
|
services/
|
|
health/
|
|
updates/
|
|
telemetry/
|
|
tmp/
|
|
```
|
|
|
|
Rules:
|
|
|
|
- cluster state is namespace-isolated by `cluster_id`
|
|
- multi-cluster membership uses separate identities and local state per cluster
|
|
- temporary files are written under the same cluster namespace before atomic
|
|
activation
|
|
- no cluster may read another cluster's local state namespace
|
|
- file permissions must restrict access to the node-agent service account
|
|
|
|
## 4. State Classes
|
|
|
|
### Agent State
|
|
|
|
Agent-level state:
|
|
|
|
- agent install id
|
|
- agent version
|
|
- local feature flags
|
|
- last startup/shutdown status
|
|
- local diagnostics
|
|
- update engine metadata
|
|
|
|
Agent state is not cluster authority.
|
|
|
|
### Identity State
|
|
|
|
Cluster identity state:
|
|
|
|
- `node_id`
|
|
- cluster membership id
|
|
- node certificate metadata
|
|
- public identity metadata
|
|
- private key reference
|
|
- enrollment state
|
|
- revocation status cache
|
|
|
|
Private keys should be stored in an OS-protected key store when available. If
|
|
file-backed keys are necessary, they must be encrypted at rest and protected by
|
|
strict filesystem permissions.
|
|
|
|
### Trust State
|
|
|
|
Trust state:
|
|
|
|
- platform root trust refs
|
|
- cluster trust roots
|
|
- config signing keys
|
|
- node-to-node trust bundle
|
|
- revocation metadata
|
|
- trust bundle version
|
|
|
|
Trust state must be signed and versioned. Unknown or revoked trust roots must
|
|
not be accepted.
|
|
|
|
### Snapshot State
|
|
|
|
Snapshot state:
|
|
|
|
- active signed scoped snapshot per scope
|
|
- previous verified snapshot per scope
|
|
- pending snapshot or incremental update
|
|
- snapshot verification metadata
|
|
- last applied config version
|
|
- expiry and refresh deadlines
|
|
|
|
Snapshot activation must be atomic:
|
|
|
|
1. write pending snapshot
|
|
2. verify signature, scope, hash, expiry, and version
|
|
3. persist verified content
|
|
4. swap active pointer
|
|
5. notify affected runtime components
|
|
6. report applied version
|
|
|
|
### Peer Cache
|
|
|
|
Peer cache:
|
|
|
|
- scoped peer directory entries
|
|
- endpoint candidates
|
|
- certificate fingerprints
|
|
- last success timestamp
|
|
- latency
|
|
- packet loss
|
|
- reliability score
|
|
- recent failure history
|
|
- last seen config version
|
|
|
|
Peer cache combines signed directory data with runtime observations. Runtime
|
|
observations are hints, not durable authority.
|
|
|
|
### Route Cache
|
|
|
|
Route cache:
|
|
|
|
- selected routes
|
|
- route score
|
|
- route class/channel class
|
|
- route expiry
|
|
- failover alternatives
|
|
- shortcut state if future policy allows it
|
|
- last successful path
|
|
- recent failure reason
|
|
|
|
Route cache must be reconstructable from signed snapshots, peer cache, and
|
|
runtime observations. It must not define policy.
|
|
|
|
### Service Assignment Cache
|
|
|
|
Service assignment cache:
|
|
|
|
- assigned service workloads
|
|
- desired state
|
|
- last reported state
|
|
- service version
|
|
- policy refs
|
|
- resource refs needed by assigned services
|
|
- connector or `vpn_connection` refs where authorized
|
|
|
|
This cache informs supervision. It does not allow the node to invent new
|
|
service work.
|
|
|
|
### Health and Degraded State
|
|
|
|
Health/degraded state:
|
|
|
|
- last heartbeat sent
|
|
- last control-plane contact
|
|
- last config/storage contact
|
|
- active degraded-mode reason
|
|
- partition/degraded flags
|
|
- local resource pressure
|
|
- service health summaries
|
|
- last known safe operation deadline
|
|
|
|
Degraded state must be visible in node heartbeat/status when connectivity
|
|
returns.
|
|
|
|
### Update Metadata
|
|
|
|
Update state:
|
|
|
|
- current agent version
|
|
- current workload versions
|
|
- pending update metadata
|
|
- signed artifact refs
|
|
- rollout/canary assignment
|
|
- rollback candidate metadata
|
|
- last update result
|
|
|
|
Unsigned artifacts must never be activated.
|
|
|
|
## 5. Encryption and Secret Handling
|
|
|
|
The local store should avoid storing secrets. When secret-related data is
|
|
required, store references and resolver metadata, not plaintext.
|
|
|
|
Rules:
|
|
|
|
- private keys use OS key store where possible
|
|
- file-backed sensitive material is encrypted at rest
|
|
- raw RDP/VNC/SSH/VPN credentials must not be stored in broad local snapshots
|
|
- runtime secrets are resolved only when assigned service policy permits it
|
|
- secret material must be wiped from temporary files and memory where practical
|
|
- logs must not contain secret values
|
|
|
|
Recommended OS facilities:
|
|
|
|
- Windows: DPAPI or service-account protected certificate store
|
|
- Linux: kernel keyring, TPM-backed store, or file encryption with protected
|
|
service-account permissions
|
|
- macOS future client/agent: Keychain
|
|
|
|
## 6. Atomicity and Durability
|
|
|
|
Writes must be safe across process crashes and host reboots.
|
|
|
|
Rules:
|
|
|
|
- write new content to temporary path
|
|
- fsync or platform equivalent where needed
|
|
- verify content before activation
|
|
- atomically rename/swap active pointer
|
|
- keep previous verified content for recovery
|
|
- never partially overwrite active snapshots or identity data
|
|
- use a store lock to prevent concurrent writers
|
|
|
|
Node-agent should tolerate:
|
|
|
|
- interrupted writes
|
|
- corrupted pending updates
|
|
- missing optional cache files
|
|
- stale runtime observations
|
|
|
|
Node-agent must not tolerate silently corrupted identity, trust, or active
|
|
snapshot data.
|
|
|
|
## 7. Cache Expiry and Cleanup
|
|
|
|
Local caches must be bounded.
|
|
|
|
Cleanup rules:
|
|
|
|
- remove expired peer observations
|
|
- remove expired route cache entries
|
|
- compact telemetry buffers
|
|
- retain only policy-defined number of previous snapshots
|
|
- remove stale pending updates after safe timeout
|
|
- delete service assignment cache for removed roles after revocation is applied
|
|
- wipe temporary files on startup
|
|
|
|
Caches may be rebuilt. Identity, trust, and active snapshots require stricter
|
|
recovery behavior.
|
|
|
|
## 8. Corruption Recovery
|
|
|
|
Recovery order:
|
|
|
|
1. load active verified state
|
|
2. reject corrupted pending state
|
|
3. fallback to previous verified snapshot if active snapshot is corrupt and
|
|
policy allows it
|
|
4. request full snapshot from config/storage service
|
|
5. use bootstrap peers or control plane if storage/config is unavailable
|
|
6. enter degraded mode only if a valid snapshot and policy allow it
|
|
7. fail closed for trust/identity corruption
|
|
|
|
Corruption must be reported through health/status and local diagnostics.
|
|
|
|
## 9. Multi-Cluster Isolation
|
|
|
|
A node may participate in multiple clusters only through isolated memberships.
|
|
|
|
Per-cluster isolation includes:
|
|
|
|
- identity
|
|
- certificates
|
|
- trust bundle
|
|
- signed snapshots
|
|
- peer cache
|
|
- route cache
|
|
- service assignment cache
|
|
- update/workload namespace where needed
|
|
- telemetry namespace
|
|
|
|
Cross-cluster data sharing is forbidden unless explicit platform trust and
|
|
policy allow it.
|
|
|
|
## 10. Service Workload Boundary
|
|
|
|
Service workloads do not write authoritative node-local state.
|
|
|
|
Allowed workload interactions:
|
|
|
|
- read assigned service configuration through node-agent
|
|
- report health/status to node-agent
|
|
- request approved secret resolution through node-agent/control boundary
|
|
- receive lifecycle commands from node-agent
|
|
|
|
Forbidden workload interactions:
|
|
|
|
- mutate role assignments
|
|
- mutate snapshots
|
|
- mutate peer directory authority
|
|
- write trust roots
|
|
- write cross-cluster state
|
|
- store unrelated organization secrets
|
|
|
|
## 11. Backup and Restore
|
|
|
|
Backup rules:
|
|
|
|
- identity/private key backup is platform policy dependent and high-risk
|
|
- snapshots and caches can usually be reconstructed
|
|
- local route/peer caches should not be treated as backup-critical
|
|
- trust state backup must preserve anti-rollback properties
|
|
- restore must not allow replay of revoked identity or old trust roots
|
|
|
|
Restore must require control-plane validation before the node is trusted for
|
|
new high-risk work.
|
|
|
|
## 12. Observability
|
|
|
|
Node-agent should report safe local state metadata:
|
|
|
|
- last applied config version
|
|
- snapshot expiry/refresh status
|
|
- trust bundle version
|
|
- peer cache size
|
|
- route cache size
|
|
- degraded-mode state
|
|
- local store health
|
|
- last corruption/recovery event
|
|
- pending update state
|
|
|
|
Reports must not include raw secrets or unrelated topology.
|
|
|
|
## 13. Future Validation Tests
|
|
|
|
Future implementation tests must prove:
|
|
|
|
- fresh install creates expected namespace layout
|
|
- valid snapshot activates atomically
|
|
- interrupted activation recovers to previous valid snapshot
|
|
- corrupted pending update is ignored
|
|
- corrupted active identity fails closed
|
|
- peer cache expiry works
|
|
- route cache expiry works
|
|
- multi-cluster namespaces stay isolated
|
|
- service workload cannot mutate authoritative local state
|
|
- local store reports last applied config version
|
|
- degraded-mode state is persisted and cleared correctly
|
|
|
|
## 14. C13 Preparation
|
|
|
|
C13 must define the Fabric Storage / Config Storage service that distributes
|
|
snapshots, peer directories, trust bundles, and incremental updates to the
|
|
node-local state store.
|
|
|
|
C13 must preserve:
|
|
|
|
- PostgreSQL authority
|
|
- signed snapshot verification
|
|
- node-local bounded cache behavior
|
|
- cluster/org/service isolation
|
|
- no arbitrary query/database behavior
|
|
|
|
## 15. Result / Decision
|
|
|
|
Stage C12 defines node-local state as a bounded, scoped, verified local store
|
|
owned by native `rap-node-agent`.
|
|
|
|
Decisions:
|
|
|
|
- local state is namespaced per cluster
|
|
- identity, trust, snapshots, peer cache, route cache, service assignment
|
|
cache, health/degraded state, and update metadata are separate state classes
|
|
- local state is not durable authority
|
|
- snapshot activation must be atomic
|
|
- caches are bounded and reconstructable
|
|
- private keys and sensitive material require OS-protected or encrypted storage
|
|
- service workloads cannot mutate authoritative node-local state
|
|
- C13 must define distribution/storage services without turning them into a
|
|
second source of truth
|
|
|
|
No code, migration, API, runtime, RDP, data-plane, mesh, VPN, relay, or service
|
|
workload behavior is changed by C12.
|