Files
rdp-proxy/docs/architecture/NODE_LOCAL_STATE_STORE.md
2026-04-28 22:29:50 +03:00

11 KiB

Node Local State Store

Status: Stage C12 result. Documentation and architecture only.

This document defines the node-local state store model for native rap-node-agent. It does not implement code, migrations, APIs, mesh runtime traffic, VPN/IP tunnel runtime, relay packet routing, RDP work, or service workload execution.

1. Purpose

The node-local state store lets rap-node-agent operate safely without asking the backend for every realtime routing or service supervision decision.

The local store must support:

  • node identity persistence
  • cluster membership state
  • signed scoped snapshot storage
  • peer cache
  • route cache
  • service assignment cache
  • local health and degraded-mode state
  • pending update metadata
  • recovery after process restart or host reboot

The local store must not become a durable source of truth.

2. Authority Boundaries

PostgreSQL remains authoritative for durable domain state.

Fabric Storage / Config Storage distributes signed snapshots and increments.

Node-local state stores verified local copies and runtime observations.

Redis remains live coordination only.

Node-local state must not authorize:

  • node enrollment approval
  • certificate issuance
  • role assignment
  • policy mutation
  • trust root mutation
  • organization mutation
  • partition promotion
  • cross-cluster trust

3. Storage Root and Namespaces

The node-agent should use one configured local storage root.

Example logical layout:

rap-node-agent-state/
  agent/
  clusters/
    <cluster_id>/
      identity/
      trust/
      snapshots/
      peers/
      routes/
      services/
      health/
      updates/
      telemetry/
      tmp/

Rules:

  • cluster state is namespace-isolated by cluster_id
  • multi-cluster membership uses separate identities and local state per cluster
  • temporary files are written under the same cluster namespace before atomic activation
  • no cluster may read another cluster's local state namespace
  • file permissions must restrict access to the node-agent service account

4. State Classes

Agent State

Agent-level state:

  • agent install id
  • agent version
  • local feature flags
  • last startup/shutdown status
  • local diagnostics
  • update engine metadata

Agent state is not cluster authority.

Identity State

Cluster identity state:

  • node_id
  • cluster membership id
  • node certificate metadata
  • public identity metadata
  • private key reference
  • enrollment state
  • revocation status cache

Private keys should be stored in an OS-protected key store when available. If file-backed keys are necessary, they must be encrypted at rest and protected by strict filesystem permissions.

Trust State

Trust state:

  • platform root trust refs
  • cluster trust roots
  • config signing keys
  • node-to-node trust bundle
  • revocation metadata
  • trust bundle version

Trust state must be signed and versioned. Unknown or revoked trust roots must not be accepted.

Snapshot State

Snapshot state:

  • active signed scoped snapshot per scope
  • previous verified snapshot per scope
  • pending snapshot or incremental update
  • snapshot verification metadata
  • last applied config version
  • expiry and refresh deadlines

Snapshot activation must be atomic:

  1. write pending snapshot
  2. verify signature, scope, hash, expiry, and version
  3. persist verified content
  4. swap active pointer
  5. notify affected runtime components
  6. report applied version

Peer Cache

Peer cache:

  • scoped peer directory entries
  • endpoint candidates
  • certificate fingerprints
  • last success timestamp
  • latency
  • packet loss
  • reliability score
  • recent failure history
  • last seen config version

Peer cache combines signed directory data with runtime observations. Runtime observations are hints, not durable authority.

Route Cache

Route cache:

  • selected routes
  • route score
  • route class/channel class
  • route expiry
  • failover alternatives
  • shortcut state if future policy allows it
  • last successful path
  • recent failure reason

Route cache must be reconstructable from signed snapshots, peer cache, and runtime observations. It must not define policy.

Service Assignment Cache

Service assignment cache:

  • assigned service workloads
  • desired state
  • last reported state
  • service version
  • policy refs
  • resource refs needed by assigned services
  • connector or vpn_connection refs where authorized

This cache informs supervision. It does not allow the node to invent new service work.

Health and Degraded State

Health/degraded state:

  • last heartbeat sent
  • last control-plane contact
  • last config/storage contact
  • active degraded-mode reason
  • partition/degraded flags
  • local resource pressure
  • service health summaries
  • last known safe operation deadline

Degraded state must be visible in node heartbeat/status when connectivity returns.

Update Metadata

Update state:

  • current agent version
  • current workload versions
  • pending update metadata
  • signed artifact refs
  • rollout/canary assignment
  • rollback candidate metadata
  • last update result

Unsigned artifacts must never be activated.

5. Encryption and Secret Handling

The local store should avoid storing secrets. When secret-related data is required, store references and resolver metadata, not plaintext.

Rules:

  • private keys use OS key store where possible
  • file-backed sensitive material is encrypted at rest
  • raw RDP/VNC/SSH/VPN credentials must not be stored in broad local snapshots
  • runtime secrets are resolved only when assigned service policy permits it
  • secret material must be wiped from temporary files and memory where practical
  • logs must not contain secret values

Recommended OS facilities:

  • Windows: DPAPI or service-account protected certificate store
  • Linux: kernel keyring, TPM-backed store, or file encryption with protected service-account permissions
  • macOS future client/agent: Keychain

6. Atomicity and Durability

Writes must be safe across process crashes and host reboots.

Rules:

  • write new content to temporary path
  • fsync or platform equivalent where needed
  • verify content before activation
  • atomically rename/swap active pointer
  • keep previous verified content for recovery
  • never partially overwrite active snapshots or identity data
  • use a store lock to prevent concurrent writers

Node-agent should tolerate:

  • interrupted writes
  • corrupted pending updates
  • missing optional cache files
  • stale runtime observations

Node-agent must not tolerate silently corrupted identity, trust, or active snapshot data.

7. Cache Expiry and Cleanup

Local caches must be bounded.

Cleanup rules:

  • remove expired peer observations
  • remove expired route cache entries
  • compact telemetry buffers
  • retain only policy-defined number of previous snapshots
  • remove stale pending updates after safe timeout
  • delete service assignment cache for removed roles after revocation is applied
  • wipe temporary files on startup

Caches may be rebuilt. Identity, trust, and active snapshots require stricter recovery behavior.

8. Corruption Recovery

Recovery order:

  1. load active verified state
  2. reject corrupted pending state
  3. fallback to previous verified snapshot if active snapshot is corrupt and policy allows it
  4. request full snapshot from config/storage service
  5. use bootstrap peers or control plane if storage/config is unavailable
  6. enter degraded mode only if a valid snapshot and policy allow it
  7. fail closed for trust/identity corruption

Corruption must be reported through health/status and local diagnostics.

9. Multi-Cluster Isolation

A node may participate in multiple clusters only through isolated memberships.

Per-cluster isolation includes:

  • identity
  • certificates
  • trust bundle
  • signed snapshots
  • peer cache
  • route cache
  • service assignment cache
  • update/workload namespace where needed
  • telemetry namespace

Cross-cluster data sharing is forbidden unless explicit platform trust and policy allow it.

10. Service Workload Boundary

Service workloads do not write authoritative node-local state.

Allowed workload interactions:

  • read assigned service configuration through node-agent
  • report health/status to node-agent
  • request approved secret resolution through node-agent/control boundary
  • receive lifecycle commands from node-agent

Forbidden workload interactions:

  • mutate role assignments
  • mutate snapshots
  • mutate peer directory authority
  • write trust roots
  • write cross-cluster state
  • store unrelated organization secrets

11. Backup and Restore

Backup rules:

  • identity/private key backup is platform policy dependent and high-risk
  • snapshots and caches can usually be reconstructed
  • local route/peer caches should not be treated as backup-critical
  • trust state backup must preserve anti-rollback properties
  • restore must not allow replay of revoked identity or old trust roots

Restore must require control-plane validation before the node is trusted for new high-risk work.

12. Observability

Node-agent should report safe local state metadata:

  • last applied config version
  • snapshot expiry/refresh status
  • trust bundle version
  • peer cache size
  • route cache size
  • degraded-mode state
  • local store health
  • last corruption/recovery event
  • pending update state

Reports must not include raw secrets or unrelated topology.

13. Future Validation Tests

Future implementation tests must prove:

  • fresh install creates expected namespace layout
  • valid snapshot activates atomically
  • interrupted activation recovers to previous valid snapshot
  • corrupted pending update is ignored
  • corrupted active identity fails closed
  • peer cache expiry works
  • route cache expiry works
  • multi-cluster namespaces stay isolated
  • service workload cannot mutate authoritative local state
  • local store reports last applied config version
  • degraded-mode state is persisted and cleared correctly

14. C13 Preparation

C13 must define the Fabric Storage / Config Storage service that distributes snapshots, peer directories, trust bundles, and incremental updates to the node-local state store.

C13 must preserve:

  • PostgreSQL authority
  • signed snapshot verification
  • node-local bounded cache behavior
  • cluster/org/service isolation
  • no arbitrary query/database behavior

15. Result / Decision

Stage C12 defines node-local state as a bounded, scoped, verified local store owned by native rap-node-agent.

Decisions:

  • local state is namespaced per cluster
  • identity, trust, snapshots, peer cache, route cache, service assignment cache, health/degraded state, and update metadata are separate state classes
  • local state is not durable authority
  • snapshot activation must be atomic
  • caches are bounded and reconstructable
  • private keys and sensitive material require OS-protected or encrypted storage
  • service workloads cannot mutate authoritative node-local state
  • C13 must define distribution/storage services without turning them into a second source of truth

No code, migration, API, runtime, RDP, data-plane, mesh, VPN, relay, or service workload behavior is changed by C12.