# Signed Scoped Cluster Snapshot Model Status: Stage C11 result. Documentation and architecture only. This document defines the signed scoped cluster snapshot model for future `rap-node-agent` node-local operation and degraded-mode recovery. It does not implement code, migrations, APIs, mesh runtime traffic, VPN/IP tunnel runtime, relay packet routing, RDP work, or service workload execution. ## 1. Purpose Signed scoped cluster snapshots allow a node to operate from verified local configuration without asking the backend for every realtime routing decision. The snapshot model must preserve these boundaries: - PostgreSQL remains the only durable source of truth. - Fabric Storage / Config Storage distributes signed snapshots and increments. - Node-agent stores only scoped local copies. - Redis remains live coordination only. - Service Adapters consume assigned local configuration but do not define routing or cluster authority. ## 2. Snapshot Definition A scoped cluster snapshot is a signed, versioned configuration package compiled from authoritative control-plane state. Snapshot characteristics: - cluster-scoped - node-scoped - role-scoped - organization-scoped where applicable - signed by an authorized control-plane/config signing key - bounded in size - time-limited - reconstructable from PostgreSQL - safe to store in node-local state Snapshots are not mutable local databases. A node may cache them and use them for runtime decisions within policy, but it must not treat them as new durable truth. ## 3. Snapshot Envelope Every snapshot must have a signed envelope. Required envelope fields: - `snapshot_id` - `schema_version` - `cluster_id` - `subject_node_id` - `scope_type` - `scope_ids` - `roles` - `organization_ids` - `config_version` - `authority_epoch` - `issued_at` - `valid_from` - `expires_at` - `refresh_after` - `signer_key_id` - `signature_algorithm` - `content_hash` - `signature` Recommended signature algorithms: - Ed25519 for compact modern signatures where supported - RS256/RSA-PSS where compatibility with existing infrastructure is required The exact wire encoding can be JSON canonicalization first and may evolve to a binary canonical form later. The important requirement is deterministic canonical bytes for signature verification. ## 4. Snapshot Scope Types Supported initial scope types: - `node_bootstrap` - `node_runtime` - `peer_directory` - `service_assignment` - `route_policy` - `qos_policy` - `trust_bundle` - `storage_directory` - `degraded_mode_policy` The control plane may deliver one combined node runtime snapshot or multiple specialized snapshots. The node-agent local store must track version and expiry per scope. ## 5. Role-Based Snapshot Contents Core mesh node snapshot may include: - cluster identity - node membership state - allowed peer subset - route policy subset - QoS policy subset - trust bundle - config/storage refresh endpoints - degraded-mode peer recovery policy Ingress node snapshot may include: - cluster identity - ingress role assignment - client entry policy subset - token validation trust material - route entry policy - allowed service endpoint projections - no full internal topology - no service target credentials Egress/service node snapshot may include: - assigned service workload refs - assigned resource refs - service policy subset - connector or `vpn_connection` refs when authorized - route policy needed for assigned services - secret resolver refs only, not raw secrets Storage/config node snapshot may include: - assigned storage/config shard scope - replication metadata - peer/storage refresh policy - allowed snapshot families - no unrelated tenant data Thin/mobile node snapshot may include: - minimal trust bundle - active session/tunnel policy subset - minimal peer/bootstrap data - route refresh endpoints - no full cluster topology ## 6. Snapshot Content Rules Allowed content: - ids and safe metadata - role assignments for the subject scope - policy refs and selected policy bodies needed by the node - peer directory subset - route/QoS policy subset - trust roots and revocation metadata - service workload desired-state refs - secret resolver refs - degraded-mode policy Forbidden content: - unrelated organization data - broad organization user lists - raw RDP/VNC/SSH credentials - raw VPN credentials - secrets outside approved resolver flow - platform-wide topology for ordinary nodes - arbitrary query grants - audit authority - durable policy mutation authority ## 7. Full Snapshots and Incremental Updates Full snapshot: - establishes node-local state for a scope - repairs version gaps - repairs corruption - establishes a new `authority_epoch` - may replace older snapshots for the same scope Incremental update: - applies to exactly one base `config_version` - carries `base_config_version` - carries `next_config_version` - contains scoped patch operations or replacement sections - is signed independently - must be rejected if base version does not match Rules: - version gaps require full resync - signature mismatch requires rejection and recovery - expired snapshots cannot authorize new operations - node heartbeat/status must report last applied version per scope - rollback is forbidden unless signed recovery policy explicitly allows it ## 8. Trust Roots and Signing Key Rotation The node-agent must know which config signing keys are trusted for each cluster. Trust material may come from: - enrollment response - trust bundle snapshot - manually installed platform root for bootstrap - signed key rotation update Signing key rotation rules: 1. New key is introduced in a signed trust bundle. 2. Node verifies the new key through existing trust. 3. Snapshots may be dual-signed during transition. 4. Old key is removed only after policy-defined rollout. 5. Compromised key is revoked through signed revocation metadata or emergency recovery flow. A node must reject snapshots signed by unknown, expired, revoked, or cluster-mismatched keys. ## 9. Verification Algorithm Before applying a snapshot, node-agent verifies: 1. Envelope schema is supported. 2. `cluster_id` matches local cluster membership. 3. `subject_node_id` matches the local node, unless the scope explicitly allows shared role data. 4. Signature key is trusted for the cluster and snapshot scope. 5. Signature verifies over canonical bytes. 6. `content_hash` matches content. 7. `valid_from`, `expires_at`, and `refresh_after` are acceptable. 8. `authority_epoch` is not stale. 9. `config_version` is newer than the local accepted version or allowed by a signed recovery policy. 10. Scope does not grant data beyond node role and organization authorization. 11. Snapshot content passes structural validation. 12. Snapshot does not contain forbidden raw secrets. Failure must leave the previous valid snapshot active if policy allows it. ## 10. Degraded-Mode Use Snapshots define what the node may do when disconnected from the backend or config/storage services. Allowed when policy permits: - continue already-running assigned services - preserve existing authorized routes for a bounded TTL - reconnect to active/warm/bootstrap peers - use local trust bundle to validate peers - use storage/config endpoints from the last valid snapshot - report degraded status when connectivity returns Forbidden in degraded mode: - approve node enrollment - issue certificates - assign roles - change cluster policy - change organization policy - rotate trust roots - promote partitions automatically - fetch unrelated secrets - create new service authority outside the snapshot scope Degraded mode must be bounded by: - snapshot expiry - route/session TTL - degraded-mode policy - partition/authority state ## 11. Revocation and Expiry Snapshots expire. Expiry is a correctness boundary, not just a cache hint. Revocation sources: - signed trust bundle update - signed revocation list - control-plane status after reconnect - emergency recovery trust path Revocation applies to: - signing keys - node identities - role assignments - service assignments - peer eligibility - storage/config endpoints - degraded-mode permissions If revocation state is unavailable, the node may only continue within the last valid degraded-mode policy and must not perform high-risk actions. ## 12. Rollback and Recovery Normal rollback to an older config is forbidden. Allowed recovery cases: - local snapshot file corruption - interrupted incremental update - bad non-authoritative cache state - version gap requiring full resync Recovery order: 1. keep last verified active snapshot 2. reject bad update 3. request full snapshot from config/storage service 4. use bootstrap peers if refresh endpoints fail 5. reconnect to control plane when available 6. enter degraded mode only if policy allows Rollback to an older signed snapshot requires explicit signed recovery policy with a newer `authority_epoch` or equivalent anti-rollback guard. ## 13. Node-Agent Local Expectations Node-agent must store: - active snapshot per scope - previous verified snapshot for recovery - pending downloaded snapshot/update before activation - verification metadata - last applied versions - signer key ids - expiry/refresh deadlines - rejection reason for last failed update Activation should be atomic from the node-agent perspective: - download to pending - verify - write to durable local store - swap active pointer - notify supervised services of relevant changes - report applied version in heartbeat/status C12 will define the local store layout and durability details. ## 14. Distribution Relationship Snapshot production flow: 1. PostgreSQL authoritative state changes. 2. Control-plane snapshot compiler builds scoped view. 3. Compiler validates scope and removes forbidden data. 4. Snapshot is signed by config signing key. 5. Snapshot or increment is published to Fabric Storage / Config Storage. 6. Node-agent refreshes by version. 7. Node-agent verifies and applies locally. Node-origin reports such as health, heartbeat, or observed latency are not authoritative config writes. They may influence future compiled snapshots only after the control plane accepts them according to policy. ## 15. Validation and Future Tests Future implementation tests must prove: - valid snapshot applies - invalid signature rejected - wrong cluster rejected - wrong node rejected - expired snapshot rejected for new authority - rollback rejected - version gap triggers full resync - forbidden raw secret content rejected - unrelated organization data rejected - wrong role scope rejected - incremental update applies only to matching base version - revoked signer rejected - degraded-mode forbidden actions are blocked ## 16. C12 Preparation C12 must define how node-agent stores and protects: - snapshot files - identity material references - trust bundle cache - peer cache - route cache - service assignment cache - health/degraded state - update metadata C12 must not turn local state into durable authority. It must preserve the C11 rule that snapshots are verified scoped copies of PostgreSQL-derived state. ## 17. Result / Decision Stage C11 defines signed scoped cluster snapshots as the required bridge between the authoritative control plane and node-local runtime operation. Decisions: - snapshots are signed, versioned, scoped, bounded, and expiring - snapshots are generated from PostgreSQL source-of-truth state - snapshots may be distributed by Fabric Storage / Config Storage - node-agent verifies before applying - node-agent may operate from snapshots only within policy - snapshots must not contain raw secrets or unrelated organization data - incremental updates require exact base-version matching - rollback requires explicit signed recovery policy - C12 must define local storage without changing these authority boundaries No code, migration, API, runtime, RDP, data-plane, mesh, VPN, relay, or service workload behavior is changed by C11.