m/rdp-proxy

Fork 0

Files

T

m 8ba0561f4f Initial project snapshot

2026-04-28 22:29:50 +03:00

11 KiB

Raw Blame History

Node Local State Store

Status: Stage C12 result. Documentation and architecture only.

This document defines the node-local state store model for native rap-node-agent. It does not implement code, migrations, APIs, mesh runtime traffic, VPN/IP tunnel runtime, relay packet routing, RDP work, or service workload execution.

1. Purpose

The node-local state store lets rap-node-agent operate safely without asking the backend for every realtime routing or service supervision decision.

The local store must support:

node identity persistence
cluster membership state
signed scoped snapshot storage
peer cache
route cache
service assignment cache
local health and degraded-mode state
pending update metadata
recovery after process restart or host reboot

The local store must not become a durable source of truth.

2. Authority Boundaries

PostgreSQL remains authoritative for durable domain state.

Fabric Storage / Config Storage distributes signed snapshots and increments.

Node-local state stores verified local copies and runtime observations.

Redis remains live coordination only.

Node-local state must not authorize:

node enrollment approval
certificate issuance
role assignment
policy mutation
trust root mutation
organization mutation
partition promotion
cross-cluster trust

3. Storage Root and Namespaces

The node-agent should use one configured local storage root.

Example logical layout:

rap-node-agent-state/
  agent/
  clusters/
    <cluster_id>/
      identity/
      trust/
      snapshots/
      peers/
      routes/
      services/
      health/
      updates/
      telemetry/
      tmp/

Rules:

cluster state is namespace-isolated by cluster_id
multi-cluster membership uses separate identities and local state per cluster
temporary files are written under the same cluster namespace before atomic activation
no cluster may read another cluster's local state namespace
file permissions must restrict access to the node-agent service account

4. State Classes

Agent State

Agent-level state:

agent install id
agent version
local feature flags
last startup/shutdown status
local diagnostics
update engine metadata

Agent state is not cluster authority.

Identity State

Cluster identity state:

node_id
cluster membership id
node certificate metadata
public identity metadata
private key reference
enrollment state
revocation status cache

Private keys should be stored in an OS-protected key store when available. If file-backed keys are necessary, they must be encrypted at rest and protected by strict filesystem permissions.

Trust State

Trust state:

platform root trust refs
cluster trust roots
config signing keys
node-to-node trust bundle
revocation metadata
trust bundle version

Trust state must be signed and versioned. Unknown or revoked trust roots must not be accepted.

Snapshot State

Snapshot state:

active signed scoped snapshot per scope
previous verified snapshot per scope
pending snapshot or incremental update
snapshot verification metadata
last applied config version
expiry and refresh deadlines

Snapshot activation must be atomic:

write pending snapshot
verify signature, scope, hash, expiry, and version
persist verified content
swap active pointer
notify affected runtime components
report applied version

Peer Cache

Peer cache:

scoped peer directory entries
endpoint candidates
certificate fingerprints
last success timestamp
latency
packet loss
reliability score
recent failure history
last seen config version

Peer cache combines signed directory data with runtime observations. Runtime observations are hints, not durable authority.

Route Cache

Route cache:

selected routes
route score
route class/channel class
route expiry
failover alternatives
shortcut state if future policy allows it
last successful path
recent failure reason

Route cache must be reconstructable from signed snapshots, peer cache, and runtime observations. It must not define policy.

Service Assignment Cache

Service assignment cache:

assigned service workloads
desired state
last reported state
service version
policy refs
resource refs needed by assigned services
connector or vpn_connection refs where authorized

This cache informs supervision. It does not allow the node to invent new service work.

Health and Degraded State

Health/degraded state:

last heartbeat sent
last control-plane contact
last config/storage contact
active degraded-mode reason
partition/degraded flags
local resource pressure
service health summaries
last known safe operation deadline

Degraded state must be visible in node heartbeat/status when connectivity returns.

Update Metadata

Update state:

current agent version
current workload versions
pending update metadata
signed artifact refs
rollout/canary assignment
rollback candidate metadata
last update result

Unsigned artifacts must never be activated.

5. Encryption and Secret Handling

The local store should avoid storing secrets. When secret-related data is required, store references and resolver metadata, not plaintext.

Rules:

private keys use OS key store where possible
file-backed sensitive material is encrypted at rest
raw RDP/VNC/SSH/VPN credentials must not be stored in broad local snapshots
runtime secrets are resolved only when assigned service policy permits it
secret material must be wiped from temporary files and memory where practical
logs must not contain secret values

Recommended OS facilities:

Windows: DPAPI or service-account protected certificate store
Linux: kernel keyring, TPM-backed store, or file encryption with protected service-account permissions
macOS future client/agent: Keychain

6. Atomicity and Durability

Writes must be safe across process crashes and host reboots.

Rules:

write new content to temporary path
fsync or platform equivalent where needed
verify content before activation
atomically rename/swap active pointer
keep previous verified content for recovery
never partially overwrite active snapshots or identity data
use a store lock to prevent concurrent writers

Node-agent should tolerate:

interrupted writes
corrupted pending updates
missing optional cache files
stale runtime observations

Node-agent must not tolerate silently corrupted identity, trust, or active snapshot data.

7. Cache Expiry and Cleanup

Local caches must be bounded.

Cleanup rules:

remove expired peer observations
remove expired route cache entries
compact telemetry buffers
retain only policy-defined number of previous snapshots
remove stale pending updates after safe timeout
delete service assignment cache for removed roles after revocation is applied
wipe temporary files on startup

Caches may be rebuilt. Identity, trust, and active snapshots require stricter recovery behavior.

8. Corruption Recovery

Recovery order:

load active verified state
reject corrupted pending state
fallback to previous verified snapshot if active snapshot is corrupt and policy allows it
request full snapshot from config/storage service
use bootstrap peers or control plane if storage/config is unavailable
enter degraded mode only if a valid snapshot and policy allow it
fail closed for trust/identity corruption

Corruption must be reported through health/status and local diagnostics.

9. Multi-Cluster Isolation

A node may participate in multiple clusters only through isolated memberships.

Per-cluster isolation includes:

identity
certificates
trust bundle
signed snapshots
peer cache
route cache
service assignment cache
update/workload namespace where needed
telemetry namespace

Cross-cluster data sharing is forbidden unless explicit platform trust and policy allow it.

10. Service Workload Boundary

Service workloads do not write authoritative node-local state.

Allowed workload interactions:

read assigned service configuration through node-agent
report health/status to node-agent
request approved secret resolution through node-agent/control boundary
receive lifecycle commands from node-agent

Forbidden workload interactions:

mutate role assignments
mutate snapshots
mutate peer directory authority
write trust roots
write cross-cluster state
store unrelated organization secrets

11. Backup and Restore

Backup rules:

identity/private key backup is platform policy dependent and high-risk
snapshots and caches can usually be reconstructed
local route/peer caches should not be treated as backup-critical
trust state backup must preserve anti-rollback properties
restore must not allow replay of revoked identity or old trust roots

Restore must require control-plane validation before the node is trusted for new high-risk work.

12. Observability

Node-agent should report safe local state metadata:

last applied config version
snapshot expiry/refresh status
trust bundle version
peer cache size
route cache size
degraded-mode state
local store health
last corruption/recovery event
pending update state

Reports must not include raw secrets or unrelated topology.

13. Future Validation Tests

Future implementation tests must prove:

fresh install creates expected namespace layout
valid snapshot activates atomically
interrupted activation recovers to previous valid snapshot
corrupted pending update is ignored
corrupted active identity fails closed
peer cache expiry works
route cache expiry works
multi-cluster namespaces stay isolated
service workload cannot mutate authoritative local state
local store reports last applied config version
degraded-mode state is persisted and cleared correctly

14. C13 Preparation

C13 must define the Fabric Storage / Config Storage service that distributes snapshots, peer directories, trust bundles, and incremental updates to the node-local state store.

C13 must preserve:

PostgreSQL authority
signed snapshot verification
node-local bounded cache behavior
cluster/org/service isolation
no arbitrary query/database behavior

15. Result / Decision

Stage C12 defines node-local state as a bounded, scoped, verified local store owned by native rap-node-agent.

Decisions:

local state is namespaced per cluster
identity, trust, snapshots, peer cache, route cache, service assignment cache, health/degraded state, and update metadata are separate state classes
local state is not durable authority
snapshot activation must be atomic
caches are bounded and reconstructable
private keys and sensitive material require OS-protected or encrypted storage
service workloads cannot mutate authoritative node-local state
C13 must define distribution/storage services without turning them into a second source of truth

No code, migration, API, runtime, RDP, data-plane, mesh, VPN, relay, or service workload behavior is changed by C12.

11 KiB Raw Blame History

Node Local State Store

1. Purpose

2. Authority Boundaries

3. Storage Root and Namespaces

4. State Classes

Agent State

Identity State

Trust State

Snapshot State

Peer Cache

Route Cache

Service Assignment Cache

Health and Degraded State

Update Metadata

5. Encryption and Secret Handling

6. Atomicity and Durability

7. Cache Expiry and Cleanup

8. Corruption Recovery

9. Multi-Cluster Isolation

10. Service Workload Boundary

11. Backup and Restore

12. Observability

13. Future Validation Tests

14. C13 Preparation

15. Result / Decision

11 KiB

Raw Blame History