rdp-proxy/docs/architecture/NODE_LOCAL_STATE_STORE.md

# Node Local State Store

Status: Stage C12 result. Documentation and architecture only.

This document defines the node-local state store model for native
`rap-node-agent`. It does not implement code, migrations, APIs, mesh runtime
traffic, VPN/IP tunnel runtime, relay packet routing, RDP work, or service
workload execution.

## 1. Purpose

The node-local state store lets `rap-node-agent` operate safely without asking
the backend for every realtime routing or service supervision decision.

The local store must support:

- node identity persistence
- cluster membership state
- signed scoped snapshot storage
- peer cache
- route cache
- service assignment cache
- local health and degraded-mode state
- pending update metadata
- recovery after process restart or host reboot

The local store must not become a durable source of truth.

## 2. Authority Boundaries

PostgreSQL remains authoritative for durable domain state.

Fabric Storage / Config Storage distributes signed snapshots and increments.

Node-local state stores verified local copies and runtime observations.

Redis remains live coordination only.

Node-local state must not authorize:

- node enrollment approval
- certificate issuance
- role assignment
- policy mutation
- trust root mutation
- organization mutation
- partition promotion
- cross-cluster trust

## 3. Storage Root and Namespaces

The node-agent should use one configured local storage root.

Example logical layout:

```text
rap-node-agent-state/
  agent/
  clusters/
    <cluster_id>/
      identity/
      trust/
      snapshots/
      peers/
      routes/
      services/
      health/
      updates/
      telemetry/
      tmp/
```

Rules:

- cluster state is namespace-isolated by `cluster_id`
- multi-cluster membership uses separate identities and local state per cluster
- temporary files are written under the same cluster namespace before atomic
  activation
- no cluster may read another cluster's local state namespace
- file permissions must restrict access to the node-agent service account

## 4. State Classes

### Agent State

Agent-level state:

- agent install id
- agent version
- local feature flags
- last startup/shutdown status
- local diagnostics
- update engine metadata

Agent state is not cluster authority.

### Identity State

Cluster identity state:

- `node_id`
- cluster membership id
- node certificate metadata
- public identity metadata
- private key reference
- enrollment state
- revocation status cache

Private keys should be stored in an OS-protected key store when available. If
file-backed keys are necessary, they must be encrypted at rest and protected by
strict filesystem permissions.

### Trust State

Trust state:

- platform root trust refs
- cluster trust roots
- config signing keys
- node-to-node trust bundle
- revocation metadata
- trust bundle version

Trust state must be signed and versioned. Unknown or revoked trust roots must
not be accepted.

### Snapshot State

Snapshot state:

- active signed scoped snapshot per scope
- previous verified snapshot per scope
- pending snapshot or incremental update
- snapshot verification metadata
- last applied config version
- expiry and refresh deadlines

Snapshot activation must be atomic:

1. write pending snapshot
2. verify signature, scope, hash, expiry, and version
3. persist verified content
4. swap active pointer
5. notify affected runtime components
6. report applied version

### Peer Cache

Peer cache:

- scoped peer directory entries
- endpoint candidates
- certificate fingerprints
- last success timestamp
- latency
- packet loss
- reliability score
- recent failure history
- last seen config version

Peer cache combines signed directory data with runtime observations. Runtime
observations are hints, not durable authority.

### Route Cache

Route cache:

- selected routes
- route score
- route class/channel class
- route expiry
- failover alternatives
- shortcut state if future policy allows it
- last successful path
- recent failure reason

Route cache must be reconstructable from signed snapshots, peer cache, and
runtime observations. It must not define policy.

### Service Assignment Cache

Service assignment cache:

- assigned service workloads
- desired state
- last reported state
- service version
- policy refs
- resource refs needed by assigned services
- connector or `vpn_connection` refs where authorized

This cache informs supervision. It does not allow the node to invent new
service work.

### Health and Degraded State

Health/degraded state:

- last heartbeat sent
- last control-plane contact
- last config/storage contact
- active degraded-mode reason
- partition/degraded flags
- local resource pressure
- service health summaries
- last known safe operation deadline

Degraded state must be visible in node heartbeat/status when connectivity
returns.

### Update Metadata

Update state:

- current agent version
- current workload versions
- pending update metadata
- signed artifact refs
- rollout/canary assignment
- rollback candidate metadata
- last update result

Unsigned artifacts must never be activated.

## 5. Encryption and Secret Handling

The local store should avoid storing secrets. When secret-related data is
required, store references and resolver metadata, not plaintext.

Rules:

- private keys use OS key store where possible
- file-backed sensitive material is encrypted at rest
- raw RDP/VNC/SSH/VPN credentials must not be stored in broad local snapshots
- runtime secrets are resolved only when assigned service policy permits it
- secret material must be wiped from temporary files and memory where practical
- logs must not contain secret values

Recommended OS facilities:

- Windows: DPAPI or service-account protected certificate store
- Linux: kernel keyring, TPM-backed store, or file encryption with protected
  service-account permissions
- macOS future client/agent: Keychain

## 6. Atomicity and Durability

Writes must be safe across process crashes and host reboots.

Rules:

- write new content to temporary path
- fsync or platform equivalent where needed
- verify content before activation
- atomically rename/swap active pointer
- keep previous verified content for recovery
- never partially overwrite active snapshots or identity data
- use a store lock to prevent concurrent writers

Node-agent should tolerate:

- interrupted writes
- corrupted pending updates
- missing optional cache files
- stale runtime observations

Node-agent must not tolerate silently corrupted identity, trust, or active
snapshot data.

## 7. Cache Expiry and Cleanup

Local caches must be bounded.

Cleanup rules:

- remove expired peer observations
- remove expired route cache entries
- compact telemetry buffers
- retain only policy-defined number of previous snapshots
- remove stale pending updates after safe timeout
- delete service assignment cache for removed roles after revocation is applied
- wipe temporary files on startup

Caches may be rebuilt. Identity, trust, and active snapshots require stricter
recovery behavior.

## 8. Corruption Recovery

Recovery order:

1. load active verified state
2. reject corrupted pending state
3. fallback to previous verified snapshot if active snapshot is corrupt and
   policy allows it
4. request full snapshot from config/storage service
5. use bootstrap peers or control plane if storage/config is unavailable
6. enter degraded mode only if a valid snapshot and policy allow it
7. fail closed for trust/identity corruption

Corruption must be reported through health/status and local diagnostics.

## 9. Multi-Cluster Isolation

A node may participate in multiple clusters only through isolated memberships.

Per-cluster isolation includes:

- identity
- certificates
- trust bundle
- signed snapshots
- peer cache
- route cache
- service assignment cache
- update/workload namespace where needed
- telemetry namespace

Cross-cluster data sharing is forbidden unless explicit platform trust and
policy allow it.

## 10. Service Workload Boundary

Service workloads do not write authoritative node-local state.

Allowed workload interactions:

- read assigned service configuration through node-agent
- report health/status to node-agent
- request approved secret resolution through node-agent/control boundary
- receive lifecycle commands from node-agent

Forbidden workload interactions:

- mutate role assignments
- mutate snapshots
- mutate peer directory authority
- write trust roots
- write cross-cluster state
- store unrelated organization secrets

## 11. Backup and Restore

Backup rules:

- identity/private key backup is platform policy dependent and high-risk
- snapshots and caches can usually be reconstructed
- local route/peer caches should not be treated as backup-critical
- trust state backup must preserve anti-rollback properties
- restore must not allow replay of revoked identity or old trust roots

Restore must require control-plane validation before the node is trusted for
new high-risk work.

## 12. Observability

Node-agent should report safe local state metadata:

- last applied config version
- snapshot expiry/refresh status
- trust bundle version
- peer cache size
- route cache size
- degraded-mode state
- local store health
- last corruption/recovery event
- pending update state

Reports must not include raw secrets or unrelated topology.

## 13. Future Validation Tests

Future implementation tests must prove:

- fresh install creates expected namespace layout
- valid snapshot activates atomically
- interrupted activation recovers to previous valid snapshot
- corrupted pending update is ignored
- corrupted active identity fails closed
- peer cache expiry works
- route cache expiry works
- multi-cluster namespaces stay isolated
- service workload cannot mutate authoritative local state
- local store reports last applied config version
- degraded-mode state is persisted and cleared correctly

## 14. C13 Preparation

C13 must define the Fabric Storage / Config Storage service that distributes
snapshots, peer directories, trust bundles, and incremental updates to the
node-local state store.

C13 must preserve:

- PostgreSQL authority
- signed snapshot verification
- node-local bounded cache behavior
- cluster/org/service isolation
- no arbitrary query/database behavior

## 15. Result / Decision

Stage C12 defines node-local state as a bounded, scoped, verified local store
owned by native `rap-node-agent`.

Decisions:

- local state is namespaced per cluster
- identity, trust, snapshots, peer cache, route cache, service assignment
  cache, health/degraded state, and update metadata are separate state classes
- local state is not durable authority
- snapshot activation must be atomic
- caches are bounded and reconstructable
- private keys and sensitive material require OS-protected or encrypted storage
- service workloads cannot mutate authoritative node-local state
- C13 must define distribution/storage services without turning them into a
  second source of truth

No code, migration, API, runtime, RDP, data-plane, mesh, VPN, relay, or service
workload behavior is changed by C12.