Initial project snapshot
This commit is contained in:
@@ -0,0 +1,419 @@
|
||||
# Node Local State Store
|
||||
|
||||
Status: Stage C12 result. Documentation and architecture only.
|
||||
|
||||
This document defines the node-local state store model for native
|
||||
`rap-node-agent`. It does not implement code, migrations, APIs, mesh runtime
|
||||
traffic, VPN/IP tunnel runtime, relay packet routing, RDP work, or service
|
||||
workload execution.
|
||||
|
||||
## 1. Purpose
|
||||
|
||||
The node-local state store lets `rap-node-agent` operate safely without asking
|
||||
the backend for every realtime routing or service supervision decision.
|
||||
|
||||
The local store must support:
|
||||
|
||||
- node identity persistence
|
||||
- cluster membership state
|
||||
- signed scoped snapshot storage
|
||||
- peer cache
|
||||
- route cache
|
||||
- service assignment cache
|
||||
- local health and degraded-mode state
|
||||
- pending update metadata
|
||||
- recovery after process restart or host reboot
|
||||
|
||||
The local store must not become a durable source of truth.
|
||||
|
||||
## 2. Authority Boundaries
|
||||
|
||||
PostgreSQL remains authoritative for durable domain state.
|
||||
|
||||
Fabric Storage / Config Storage distributes signed snapshots and increments.
|
||||
|
||||
Node-local state stores verified local copies and runtime observations.
|
||||
|
||||
Redis remains live coordination only.
|
||||
|
||||
Node-local state must not authorize:
|
||||
|
||||
- node enrollment approval
|
||||
- certificate issuance
|
||||
- role assignment
|
||||
- policy mutation
|
||||
- trust root mutation
|
||||
- organization mutation
|
||||
- partition promotion
|
||||
- cross-cluster trust
|
||||
|
||||
## 3. Storage Root and Namespaces
|
||||
|
||||
The node-agent should use one configured local storage root.
|
||||
|
||||
Example logical layout:
|
||||
|
||||
```text
|
||||
rap-node-agent-state/
|
||||
agent/
|
||||
clusters/
|
||||
<cluster_id>/
|
||||
identity/
|
||||
trust/
|
||||
snapshots/
|
||||
peers/
|
||||
routes/
|
||||
services/
|
||||
health/
|
||||
updates/
|
||||
telemetry/
|
||||
tmp/
|
||||
```
|
||||
|
||||
Rules:
|
||||
|
||||
- cluster state is namespace-isolated by `cluster_id`
|
||||
- multi-cluster membership uses separate identities and local state per cluster
|
||||
- temporary files are written under the same cluster namespace before atomic
|
||||
activation
|
||||
- no cluster may read another cluster's local state namespace
|
||||
- file permissions must restrict access to the node-agent service account
|
||||
|
||||
## 4. State Classes
|
||||
|
||||
### Agent State
|
||||
|
||||
Agent-level state:
|
||||
|
||||
- agent install id
|
||||
- agent version
|
||||
- local feature flags
|
||||
- last startup/shutdown status
|
||||
- local diagnostics
|
||||
- update engine metadata
|
||||
|
||||
Agent state is not cluster authority.
|
||||
|
||||
### Identity State
|
||||
|
||||
Cluster identity state:
|
||||
|
||||
- `node_id`
|
||||
- cluster membership id
|
||||
- node certificate metadata
|
||||
- public identity metadata
|
||||
- private key reference
|
||||
- enrollment state
|
||||
- revocation status cache
|
||||
|
||||
Private keys should be stored in an OS-protected key store when available. If
|
||||
file-backed keys are necessary, they must be encrypted at rest and protected by
|
||||
strict filesystem permissions.
|
||||
|
||||
### Trust State
|
||||
|
||||
Trust state:
|
||||
|
||||
- platform root trust refs
|
||||
- cluster trust roots
|
||||
- config signing keys
|
||||
- node-to-node trust bundle
|
||||
- revocation metadata
|
||||
- trust bundle version
|
||||
|
||||
Trust state must be signed and versioned. Unknown or revoked trust roots must
|
||||
not be accepted.
|
||||
|
||||
### Snapshot State
|
||||
|
||||
Snapshot state:
|
||||
|
||||
- active signed scoped snapshot per scope
|
||||
- previous verified snapshot per scope
|
||||
- pending snapshot or incremental update
|
||||
- snapshot verification metadata
|
||||
- last applied config version
|
||||
- expiry and refresh deadlines
|
||||
|
||||
Snapshot activation must be atomic:
|
||||
|
||||
1. write pending snapshot
|
||||
2. verify signature, scope, hash, expiry, and version
|
||||
3. persist verified content
|
||||
4. swap active pointer
|
||||
5. notify affected runtime components
|
||||
6. report applied version
|
||||
|
||||
### Peer Cache
|
||||
|
||||
Peer cache:
|
||||
|
||||
- scoped peer directory entries
|
||||
- endpoint candidates
|
||||
- certificate fingerprints
|
||||
- last success timestamp
|
||||
- latency
|
||||
- packet loss
|
||||
- reliability score
|
||||
- recent failure history
|
||||
- last seen config version
|
||||
|
||||
Peer cache combines signed directory data with runtime observations. Runtime
|
||||
observations are hints, not durable authority.
|
||||
|
||||
### Route Cache
|
||||
|
||||
Route cache:
|
||||
|
||||
- selected routes
|
||||
- route score
|
||||
- route class/channel class
|
||||
- route expiry
|
||||
- failover alternatives
|
||||
- shortcut state if future policy allows it
|
||||
- last successful path
|
||||
- recent failure reason
|
||||
|
||||
Route cache must be reconstructable from signed snapshots, peer cache, and
|
||||
runtime observations. It must not define policy.
|
||||
|
||||
### Service Assignment Cache
|
||||
|
||||
Service assignment cache:
|
||||
|
||||
- assigned service workloads
|
||||
- desired state
|
||||
- last reported state
|
||||
- service version
|
||||
- policy refs
|
||||
- resource refs needed by assigned services
|
||||
- connector or `vpn_connection` refs where authorized
|
||||
|
||||
This cache informs supervision. It does not allow the node to invent new
|
||||
service work.
|
||||
|
||||
### Health and Degraded State
|
||||
|
||||
Health/degraded state:
|
||||
|
||||
- last heartbeat sent
|
||||
- last control-plane contact
|
||||
- last config/storage contact
|
||||
- active degraded-mode reason
|
||||
- partition/degraded flags
|
||||
- local resource pressure
|
||||
- service health summaries
|
||||
- last known safe operation deadline
|
||||
|
||||
Degraded state must be visible in node heartbeat/status when connectivity
|
||||
returns.
|
||||
|
||||
### Update Metadata
|
||||
|
||||
Update state:
|
||||
|
||||
- current agent version
|
||||
- current workload versions
|
||||
- pending update metadata
|
||||
- signed artifact refs
|
||||
- rollout/canary assignment
|
||||
- rollback candidate metadata
|
||||
- last update result
|
||||
|
||||
Unsigned artifacts must never be activated.
|
||||
|
||||
## 5. Encryption and Secret Handling
|
||||
|
||||
The local store should avoid storing secrets. When secret-related data is
|
||||
required, store references and resolver metadata, not plaintext.
|
||||
|
||||
Rules:
|
||||
|
||||
- private keys use OS key store where possible
|
||||
- file-backed sensitive material is encrypted at rest
|
||||
- raw RDP/VNC/SSH/VPN credentials must not be stored in broad local snapshots
|
||||
- runtime secrets are resolved only when assigned service policy permits it
|
||||
- secret material must be wiped from temporary files and memory where practical
|
||||
- logs must not contain secret values
|
||||
|
||||
Recommended OS facilities:
|
||||
|
||||
- Windows: DPAPI or service-account protected certificate store
|
||||
- Linux: kernel keyring, TPM-backed store, or file encryption with protected
|
||||
service-account permissions
|
||||
- macOS future client/agent: Keychain
|
||||
|
||||
## 6. Atomicity and Durability
|
||||
|
||||
Writes must be safe across process crashes and host reboots.
|
||||
|
||||
Rules:
|
||||
|
||||
- write new content to temporary path
|
||||
- fsync or platform equivalent where needed
|
||||
- verify content before activation
|
||||
- atomically rename/swap active pointer
|
||||
- keep previous verified content for recovery
|
||||
- never partially overwrite active snapshots or identity data
|
||||
- use a store lock to prevent concurrent writers
|
||||
|
||||
Node-agent should tolerate:
|
||||
|
||||
- interrupted writes
|
||||
- corrupted pending updates
|
||||
- missing optional cache files
|
||||
- stale runtime observations
|
||||
|
||||
Node-agent must not tolerate silently corrupted identity, trust, or active
|
||||
snapshot data.
|
||||
|
||||
## 7. Cache Expiry and Cleanup
|
||||
|
||||
Local caches must be bounded.
|
||||
|
||||
Cleanup rules:
|
||||
|
||||
- remove expired peer observations
|
||||
- remove expired route cache entries
|
||||
- compact telemetry buffers
|
||||
- retain only policy-defined number of previous snapshots
|
||||
- remove stale pending updates after safe timeout
|
||||
- delete service assignment cache for removed roles after revocation is applied
|
||||
- wipe temporary files on startup
|
||||
|
||||
Caches may be rebuilt. Identity, trust, and active snapshots require stricter
|
||||
recovery behavior.
|
||||
|
||||
## 8. Corruption Recovery
|
||||
|
||||
Recovery order:
|
||||
|
||||
1. load active verified state
|
||||
2. reject corrupted pending state
|
||||
3. fallback to previous verified snapshot if active snapshot is corrupt and
|
||||
policy allows it
|
||||
4. request full snapshot from config/storage service
|
||||
5. use bootstrap peers or control plane if storage/config is unavailable
|
||||
6. enter degraded mode only if a valid snapshot and policy allow it
|
||||
7. fail closed for trust/identity corruption
|
||||
|
||||
Corruption must be reported through health/status and local diagnostics.
|
||||
|
||||
## 9. Multi-Cluster Isolation
|
||||
|
||||
A node may participate in multiple clusters only through isolated memberships.
|
||||
|
||||
Per-cluster isolation includes:
|
||||
|
||||
- identity
|
||||
- certificates
|
||||
- trust bundle
|
||||
- signed snapshots
|
||||
- peer cache
|
||||
- route cache
|
||||
- service assignment cache
|
||||
- update/workload namespace where needed
|
||||
- telemetry namespace
|
||||
|
||||
Cross-cluster data sharing is forbidden unless explicit platform trust and
|
||||
policy allow it.
|
||||
|
||||
## 10. Service Workload Boundary
|
||||
|
||||
Service workloads do not write authoritative node-local state.
|
||||
|
||||
Allowed workload interactions:
|
||||
|
||||
- read assigned service configuration through node-agent
|
||||
- report health/status to node-agent
|
||||
- request approved secret resolution through node-agent/control boundary
|
||||
- receive lifecycle commands from node-agent
|
||||
|
||||
Forbidden workload interactions:
|
||||
|
||||
- mutate role assignments
|
||||
- mutate snapshots
|
||||
- mutate peer directory authority
|
||||
- write trust roots
|
||||
- write cross-cluster state
|
||||
- store unrelated organization secrets
|
||||
|
||||
## 11. Backup and Restore
|
||||
|
||||
Backup rules:
|
||||
|
||||
- identity/private key backup is platform policy dependent and high-risk
|
||||
- snapshots and caches can usually be reconstructed
|
||||
- local route/peer caches should not be treated as backup-critical
|
||||
- trust state backup must preserve anti-rollback properties
|
||||
- restore must not allow replay of revoked identity or old trust roots
|
||||
|
||||
Restore must require control-plane validation before the node is trusted for
|
||||
new high-risk work.
|
||||
|
||||
## 12. Observability
|
||||
|
||||
Node-agent should report safe local state metadata:
|
||||
|
||||
- last applied config version
|
||||
- snapshot expiry/refresh status
|
||||
- trust bundle version
|
||||
- peer cache size
|
||||
- route cache size
|
||||
- degraded-mode state
|
||||
- local store health
|
||||
- last corruption/recovery event
|
||||
- pending update state
|
||||
|
||||
Reports must not include raw secrets or unrelated topology.
|
||||
|
||||
## 13. Future Validation Tests
|
||||
|
||||
Future implementation tests must prove:
|
||||
|
||||
- fresh install creates expected namespace layout
|
||||
- valid snapshot activates atomically
|
||||
- interrupted activation recovers to previous valid snapshot
|
||||
- corrupted pending update is ignored
|
||||
- corrupted active identity fails closed
|
||||
- peer cache expiry works
|
||||
- route cache expiry works
|
||||
- multi-cluster namespaces stay isolated
|
||||
- service workload cannot mutate authoritative local state
|
||||
- local store reports last applied config version
|
||||
- degraded-mode state is persisted and cleared correctly
|
||||
|
||||
## 14. C13 Preparation
|
||||
|
||||
C13 must define the Fabric Storage / Config Storage service that distributes
|
||||
snapshots, peer directories, trust bundles, and incremental updates to the
|
||||
node-local state store.
|
||||
|
||||
C13 must preserve:
|
||||
|
||||
- PostgreSQL authority
|
||||
- signed snapshot verification
|
||||
- node-local bounded cache behavior
|
||||
- cluster/org/service isolation
|
||||
- no arbitrary query/database behavior
|
||||
|
||||
## 15. Result / Decision
|
||||
|
||||
Stage C12 defines node-local state as a bounded, scoped, verified local store
|
||||
owned by native `rap-node-agent`.
|
||||
|
||||
Decisions:
|
||||
|
||||
- local state is namespaced per cluster
|
||||
- identity, trust, snapshots, peer cache, route cache, service assignment
|
||||
cache, health/degraded state, and update metadata are separate state classes
|
||||
- local state is not durable authority
|
||||
- snapshot activation must be atomic
|
||||
- caches are bounded and reconstructable
|
||||
- private keys and sensitive material require OS-protected or encrypted storage
|
||||
- service workloads cannot mutate authoritative node-local state
|
||||
- C13 must define distribution/storage services without turning them into a
|
||||
second source of truth
|
||||
|
||||
No code, migration, API, runtime, RDP, data-plane, mesh, VPN, relay, or service
|
||||
workload behavior is changed by C12.
|
||||
Reference in New Issue
Block a user