416 lines
12 KiB
Markdown
416 lines
12 KiB
Markdown
# Signed Scoped Cluster Snapshot Model
|
|
|
|
Status: Stage C11 result. Documentation and architecture only.
|
|
|
|
This document defines the signed scoped cluster snapshot model for future
|
|
`rap-node-agent` node-local operation and degraded-mode recovery. It does not
|
|
implement code, migrations, APIs, mesh runtime traffic, VPN/IP tunnel runtime,
|
|
relay packet routing, RDP work, or service workload execution.
|
|
|
|
## 1. Purpose
|
|
|
|
Signed scoped cluster snapshots allow a node to operate from verified local
|
|
configuration without asking the backend for every realtime routing decision.
|
|
|
|
The snapshot model must preserve these boundaries:
|
|
|
|
- PostgreSQL remains the only durable source of truth.
|
|
- Fabric Storage / Config Storage distributes signed snapshots and increments.
|
|
- Node-agent stores only scoped local copies.
|
|
- Redis remains live coordination only.
|
|
- Service Adapters consume assigned local configuration but do not define
|
|
routing or cluster authority.
|
|
|
|
## 2. Snapshot Definition
|
|
|
|
A scoped cluster snapshot is a signed, versioned configuration package compiled
|
|
from authoritative control-plane state.
|
|
|
|
Snapshot characteristics:
|
|
|
|
- cluster-scoped
|
|
- node-scoped
|
|
- role-scoped
|
|
- organization-scoped where applicable
|
|
- signed by an authorized control-plane/config signing key
|
|
- bounded in size
|
|
- time-limited
|
|
- reconstructable from PostgreSQL
|
|
- safe to store in node-local state
|
|
|
|
Snapshots are not mutable local databases. A node may cache them and use them
|
|
for runtime decisions within policy, but it must not treat them as new durable
|
|
truth.
|
|
|
|
## 3. Snapshot Envelope
|
|
|
|
Every snapshot must have a signed envelope.
|
|
|
|
Required envelope fields:
|
|
|
|
- `snapshot_id`
|
|
- `schema_version`
|
|
- `cluster_id`
|
|
- `subject_node_id`
|
|
- `scope_type`
|
|
- `scope_ids`
|
|
- `roles`
|
|
- `organization_ids`
|
|
- `config_version`
|
|
- `authority_epoch`
|
|
- `issued_at`
|
|
- `valid_from`
|
|
- `expires_at`
|
|
- `refresh_after`
|
|
- `signer_key_id`
|
|
- `signature_algorithm`
|
|
- `content_hash`
|
|
- `signature`
|
|
|
|
Recommended signature algorithms:
|
|
|
|
- Ed25519 for compact modern signatures where supported
|
|
- RS256/RSA-PSS where compatibility with existing infrastructure is required
|
|
|
|
The exact wire encoding can be JSON canonicalization first and may evolve to a
|
|
binary canonical form later. The important requirement is deterministic
|
|
canonical bytes for signature verification.
|
|
|
|
## 4. Snapshot Scope Types
|
|
|
|
Supported initial scope types:
|
|
|
|
- `node_bootstrap`
|
|
- `node_runtime`
|
|
- `peer_directory`
|
|
- `service_assignment`
|
|
- `route_policy`
|
|
- `qos_policy`
|
|
- `trust_bundle`
|
|
- `storage_directory`
|
|
- `degraded_mode_policy`
|
|
|
|
The control plane may deliver one combined node runtime snapshot or multiple
|
|
specialized snapshots. The node-agent local store must track version and expiry
|
|
per scope.
|
|
|
|
## 5. Role-Based Snapshot Contents
|
|
|
|
Core mesh node snapshot may include:
|
|
|
|
- cluster identity
|
|
- node membership state
|
|
- allowed peer subset
|
|
- route policy subset
|
|
- QoS policy subset
|
|
- trust bundle
|
|
- config/storage refresh endpoints
|
|
- degraded-mode peer recovery policy
|
|
|
|
Ingress node snapshot may include:
|
|
|
|
- cluster identity
|
|
- ingress role assignment
|
|
- client entry policy subset
|
|
- token validation trust material
|
|
- route entry policy
|
|
- allowed service endpoint projections
|
|
- no full internal topology
|
|
- no service target credentials
|
|
|
|
Egress/service node snapshot may include:
|
|
|
|
- assigned service workload refs
|
|
- assigned resource refs
|
|
- service policy subset
|
|
- connector or `vpn_connection` refs when authorized
|
|
- route policy needed for assigned services
|
|
- secret resolver refs only, not raw secrets
|
|
|
|
Storage/config node snapshot may include:
|
|
|
|
- assigned storage/config shard scope
|
|
- replication metadata
|
|
- peer/storage refresh policy
|
|
- allowed snapshot families
|
|
- no unrelated tenant data
|
|
|
|
Thin/mobile node snapshot may include:
|
|
|
|
- minimal trust bundle
|
|
- active session/tunnel policy subset
|
|
- minimal peer/bootstrap data
|
|
- route refresh endpoints
|
|
- no full cluster topology
|
|
|
|
## 6. Snapshot Content Rules
|
|
|
|
Allowed content:
|
|
|
|
- ids and safe metadata
|
|
- role assignments for the subject scope
|
|
- policy refs and selected policy bodies needed by the node
|
|
- peer directory subset
|
|
- route/QoS policy subset
|
|
- trust roots and revocation metadata
|
|
- service workload desired-state refs
|
|
- secret resolver refs
|
|
- degraded-mode policy
|
|
|
|
Forbidden content:
|
|
|
|
- unrelated organization data
|
|
- broad organization user lists
|
|
- raw RDP/VNC/SSH credentials
|
|
- raw VPN credentials
|
|
- secrets outside approved resolver flow
|
|
- platform-wide topology for ordinary nodes
|
|
- arbitrary query grants
|
|
- audit authority
|
|
- durable policy mutation authority
|
|
|
|
## 7. Full Snapshots and Incremental Updates
|
|
|
|
Full snapshot:
|
|
|
|
- establishes node-local state for a scope
|
|
- repairs version gaps
|
|
- repairs corruption
|
|
- establishes a new `authority_epoch`
|
|
- may replace older snapshots for the same scope
|
|
|
|
Incremental update:
|
|
|
|
- applies to exactly one base `config_version`
|
|
- carries `base_config_version`
|
|
- carries `next_config_version`
|
|
- contains scoped patch operations or replacement sections
|
|
- is signed independently
|
|
- must be rejected if base version does not match
|
|
|
|
Rules:
|
|
|
|
- version gaps require full resync
|
|
- signature mismatch requires rejection and recovery
|
|
- expired snapshots cannot authorize new operations
|
|
- node heartbeat/status must report last applied version per scope
|
|
- rollback is forbidden unless signed recovery policy explicitly allows it
|
|
|
|
## 8. Trust Roots and Signing Key Rotation
|
|
|
|
The node-agent must know which config signing keys are trusted for each cluster.
|
|
|
|
Trust material may come from:
|
|
|
|
- enrollment response
|
|
- trust bundle snapshot
|
|
- manually installed platform root for bootstrap
|
|
- signed key rotation update
|
|
|
|
Signing key rotation rules:
|
|
|
|
1. New key is introduced in a signed trust bundle.
|
|
2. Node verifies the new key through existing trust.
|
|
3. Snapshots may be dual-signed during transition.
|
|
4. Old key is removed only after policy-defined rollout.
|
|
5. Compromised key is revoked through signed revocation metadata or emergency
|
|
recovery flow.
|
|
|
|
A node must reject snapshots signed by unknown, expired, revoked, or
|
|
cluster-mismatched keys.
|
|
|
|
## 9. Verification Algorithm
|
|
|
|
Before applying a snapshot, node-agent verifies:
|
|
|
|
1. Envelope schema is supported.
|
|
2. `cluster_id` matches local cluster membership.
|
|
3. `subject_node_id` matches the local node, unless the scope explicitly allows
|
|
shared role data.
|
|
4. Signature key is trusted for the cluster and snapshot scope.
|
|
5. Signature verifies over canonical bytes.
|
|
6. `content_hash` matches content.
|
|
7. `valid_from`, `expires_at`, and `refresh_after` are acceptable.
|
|
8. `authority_epoch` is not stale.
|
|
9. `config_version` is newer than the local accepted version or allowed by a
|
|
signed recovery policy.
|
|
10. Scope does not grant data beyond node role and organization authorization.
|
|
11. Snapshot content passes structural validation.
|
|
12. Snapshot does not contain forbidden raw secrets.
|
|
|
|
Failure must leave the previous valid snapshot active if policy allows it.
|
|
|
|
## 10. Degraded-Mode Use
|
|
|
|
Snapshots define what the node may do when disconnected from the backend or
|
|
config/storage services.
|
|
|
|
Allowed when policy permits:
|
|
|
|
- continue already-running assigned services
|
|
- preserve existing authorized routes for a bounded TTL
|
|
- reconnect to active/warm/bootstrap peers
|
|
- use local trust bundle to validate peers
|
|
- use storage/config endpoints from the last valid snapshot
|
|
- report degraded status when connectivity returns
|
|
|
|
Forbidden in degraded mode:
|
|
|
|
- approve node enrollment
|
|
- issue certificates
|
|
- assign roles
|
|
- change cluster policy
|
|
- change organization policy
|
|
- rotate trust roots
|
|
- promote partitions automatically
|
|
- fetch unrelated secrets
|
|
- create new service authority outside the snapshot scope
|
|
|
|
Degraded mode must be bounded by:
|
|
|
|
- snapshot expiry
|
|
- route/session TTL
|
|
- degraded-mode policy
|
|
- partition/authority state
|
|
|
|
## 11. Revocation and Expiry
|
|
|
|
Snapshots expire. Expiry is a correctness boundary, not just a cache hint.
|
|
|
|
Revocation sources:
|
|
|
|
- signed trust bundle update
|
|
- signed revocation list
|
|
- control-plane status after reconnect
|
|
- emergency recovery trust path
|
|
|
|
Revocation applies to:
|
|
|
|
- signing keys
|
|
- node identities
|
|
- role assignments
|
|
- service assignments
|
|
- peer eligibility
|
|
- storage/config endpoints
|
|
- degraded-mode permissions
|
|
|
|
If revocation state is unavailable, the node may only continue within the last
|
|
valid degraded-mode policy and must not perform high-risk actions.
|
|
|
|
## 12. Rollback and Recovery
|
|
|
|
Normal rollback to an older config is forbidden.
|
|
|
|
Allowed recovery cases:
|
|
|
|
- local snapshot file corruption
|
|
- interrupted incremental update
|
|
- bad non-authoritative cache state
|
|
- version gap requiring full resync
|
|
|
|
Recovery order:
|
|
|
|
1. keep last verified active snapshot
|
|
2. reject bad update
|
|
3. request full snapshot from config/storage service
|
|
4. use bootstrap peers if refresh endpoints fail
|
|
5. reconnect to control plane when available
|
|
6. enter degraded mode only if policy allows
|
|
|
|
Rollback to an older signed snapshot requires explicit signed recovery policy
|
|
with a newer `authority_epoch` or equivalent anti-rollback guard.
|
|
|
|
## 13. Node-Agent Local Expectations
|
|
|
|
Node-agent must store:
|
|
|
|
- active snapshot per scope
|
|
- previous verified snapshot for recovery
|
|
- pending downloaded snapshot/update before activation
|
|
- verification metadata
|
|
- last applied versions
|
|
- signer key ids
|
|
- expiry/refresh deadlines
|
|
- rejection reason for last failed update
|
|
|
|
Activation should be atomic from the node-agent perspective:
|
|
|
|
- download to pending
|
|
- verify
|
|
- write to durable local store
|
|
- swap active pointer
|
|
- notify supervised services of relevant changes
|
|
- report applied version in heartbeat/status
|
|
|
|
C12 will define the local store layout and durability details.
|
|
|
|
## 14. Distribution Relationship
|
|
|
|
Snapshot production flow:
|
|
|
|
1. PostgreSQL authoritative state changes.
|
|
2. Control-plane snapshot compiler builds scoped view.
|
|
3. Compiler validates scope and removes forbidden data.
|
|
4. Snapshot is signed by config signing key.
|
|
5. Snapshot or increment is published to Fabric Storage / Config Storage.
|
|
6. Node-agent refreshes by version.
|
|
7. Node-agent verifies and applies locally.
|
|
|
|
Node-origin reports such as health, heartbeat, or observed latency are not
|
|
authoritative config writes. They may influence future compiled snapshots only
|
|
after the control plane accepts them according to policy.
|
|
|
|
## 15. Validation and Future Tests
|
|
|
|
Future implementation tests must prove:
|
|
|
|
- valid snapshot applies
|
|
- invalid signature rejected
|
|
- wrong cluster rejected
|
|
- wrong node rejected
|
|
- expired snapshot rejected for new authority
|
|
- rollback rejected
|
|
- version gap triggers full resync
|
|
- forbidden raw secret content rejected
|
|
- unrelated organization data rejected
|
|
- wrong role scope rejected
|
|
- incremental update applies only to matching base version
|
|
- revoked signer rejected
|
|
- degraded-mode forbidden actions are blocked
|
|
|
|
## 16. C12 Preparation
|
|
|
|
C12 must define how node-agent stores and protects:
|
|
|
|
- snapshot files
|
|
- identity material references
|
|
- trust bundle cache
|
|
- peer cache
|
|
- route cache
|
|
- service assignment cache
|
|
- health/degraded state
|
|
- update metadata
|
|
|
|
C12 must not turn local state into durable authority. It must preserve the C11
|
|
rule that snapshots are verified scoped copies of PostgreSQL-derived state.
|
|
|
|
## 17. Result / Decision
|
|
|
|
Stage C11 defines signed scoped cluster snapshots as the required bridge between
|
|
the authoritative control plane and node-local runtime operation.
|
|
|
|
Decisions:
|
|
|
|
- snapshots are signed, versioned, scoped, bounded, and expiring
|
|
- snapshots are generated from PostgreSQL source-of-truth state
|
|
- snapshots may be distributed by Fabric Storage / Config Storage
|
|
- node-agent verifies before applying
|
|
- node-agent may operate from snapshots only within policy
|
|
- snapshots must not contain raw secrets or unrelated organization data
|
|
- incremental updates require exact base-version matching
|
|
- rollback requires explicit signed recovery policy
|
|
- C12 must define local storage without changing these authority boundaries
|
|
|
|
No code, migration, API, runtime, RDP, data-plane, mesh, VPN, relay, or service
|
|
workload behavior is changed by C11.
|