Initial project snapshot
This commit is contained in:
@@ -0,0 +1,415 @@
|
||||
# Signed Scoped Cluster Snapshot Model
|
||||
|
||||
Status: Stage C11 result. Documentation and architecture only.
|
||||
|
||||
This document defines the signed scoped cluster snapshot model for future
|
||||
`rap-node-agent` node-local operation and degraded-mode recovery. It does not
|
||||
implement code, migrations, APIs, mesh runtime traffic, VPN/IP tunnel runtime,
|
||||
relay packet routing, RDP work, or service workload execution.
|
||||
|
||||
## 1. Purpose
|
||||
|
||||
Signed scoped cluster snapshots allow a node to operate from verified local
|
||||
configuration without asking the backend for every realtime routing decision.
|
||||
|
||||
The snapshot model must preserve these boundaries:
|
||||
|
||||
- PostgreSQL remains the only durable source of truth.
|
||||
- Fabric Storage / Config Storage distributes signed snapshots and increments.
|
||||
- Node-agent stores only scoped local copies.
|
||||
- Redis remains live coordination only.
|
||||
- Service Adapters consume assigned local configuration but do not define
|
||||
routing or cluster authority.
|
||||
|
||||
## 2. Snapshot Definition
|
||||
|
||||
A scoped cluster snapshot is a signed, versioned configuration package compiled
|
||||
from authoritative control-plane state.
|
||||
|
||||
Snapshot characteristics:
|
||||
|
||||
- cluster-scoped
|
||||
- node-scoped
|
||||
- role-scoped
|
||||
- organization-scoped where applicable
|
||||
- signed by an authorized control-plane/config signing key
|
||||
- bounded in size
|
||||
- time-limited
|
||||
- reconstructable from PostgreSQL
|
||||
- safe to store in node-local state
|
||||
|
||||
Snapshots are not mutable local databases. A node may cache them and use them
|
||||
for runtime decisions within policy, but it must not treat them as new durable
|
||||
truth.
|
||||
|
||||
## 3. Snapshot Envelope
|
||||
|
||||
Every snapshot must have a signed envelope.
|
||||
|
||||
Required envelope fields:
|
||||
|
||||
- `snapshot_id`
|
||||
- `schema_version`
|
||||
- `cluster_id`
|
||||
- `subject_node_id`
|
||||
- `scope_type`
|
||||
- `scope_ids`
|
||||
- `roles`
|
||||
- `organization_ids`
|
||||
- `config_version`
|
||||
- `authority_epoch`
|
||||
- `issued_at`
|
||||
- `valid_from`
|
||||
- `expires_at`
|
||||
- `refresh_after`
|
||||
- `signer_key_id`
|
||||
- `signature_algorithm`
|
||||
- `content_hash`
|
||||
- `signature`
|
||||
|
||||
Recommended signature algorithms:
|
||||
|
||||
- Ed25519 for compact modern signatures where supported
|
||||
- RS256/RSA-PSS where compatibility with existing infrastructure is required
|
||||
|
||||
The exact wire encoding can be JSON canonicalization first and may evolve to a
|
||||
binary canonical form later. The important requirement is deterministic
|
||||
canonical bytes for signature verification.
|
||||
|
||||
## 4. Snapshot Scope Types
|
||||
|
||||
Supported initial scope types:
|
||||
|
||||
- `node_bootstrap`
|
||||
- `node_runtime`
|
||||
- `peer_directory`
|
||||
- `service_assignment`
|
||||
- `route_policy`
|
||||
- `qos_policy`
|
||||
- `trust_bundle`
|
||||
- `storage_directory`
|
||||
- `degraded_mode_policy`
|
||||
|
||||
The control plane may deliver one combined node runtime snapshot or multiple
|
||||
specialized snapshots. The node-agent local store must track version and expiry
|
||||
per scope.
|
||||
|
||||
## 5. Role-Based Snapshot Contents
|
||||
|
||||
Core mesh node snapshot may include:
|
||||
|
||||
- cluster identity
|
||||
- node membership state
|
||||
- allowed peer subset
|
||||
- route policy subset
|
||||
- QoS policy subset
|
||||
- trust bundle
|
||||
- config/storage refresh endpoints
|
||||
- degraded-mode peer recovery policy
|
||||
|
||||
Ingress node snapshot may include:
|
||||
|
||||
- cluster identity
|
||||
- ingress role assignment
|
||||
- client entry policy subset
|
||||
- token validation trust material
|
||||
- route entry policy
|
||||
- allowed service endpoint projections
|
||||
- no full internal topology
|
||||
- no service target credentials
|
||||
|
||||
Egress/service node snapshot may include:
|
||||
|
||||
- assigned service workload refs
|
||||
- assigned resource refs
|
||||
- service policy subset
|
||||
- connector or `vpn_connection` refs when authorized
|
||||
- route policy needed for assigned services
|
||||
- secret resolver refs only, not raw secrets
|
||||
|
||||
Storage/config node snapshot may include:
|
||||
|
||||
- assigned storage/config shard scope
|
||||
- replication metadata
|
||||
- peer/storage refresh policy
|
||||
- allowed snapshot families
|
||||
- no unrelated tenant data
|
||||
|
||||
Thin/mobile node snapshot may include:
|
||||
|
||||
- minimal trust bundle
|
||||
- active session/tunnel policy subset
|
||||
- minimal peer/bootstrap data
|
||||
- route refresh endpoints
|
||||
- no full cluster topology
|
||||
|
||||
## 6. Snapshot Content Rules
|
||||
|
||||
Allowed content:
|
||||
|
||||
- ids and safe metadata
|
||||
- role assignments for the subject scope
|
||||
- policy refs and selected policy bodies needed by the node
|
||||
- peer directory subset
|
||||
- route/QoS policy subset
|
||||
- trust roots and revocation metadata
|
||||
- service workload desired-state refs
|
||||
- secret resolver refs
|
||||
- degraded-mode policy
|
||||
|
||||
Forbidden content:
|
||||
|
||||
- unrelated organization data
|
||||
- broad organization user lists
|
||||
- raw RDP/VNC/SSH credentials
|
||||
- raw VPN credentials
|
||||
- secrets outside approved resolver flow
|
||||
- platform-wide topology for ordinary nodes
|
||||
- arbitrary query grants
|
||||
- audit authority
|
||||
- durable policy mutation authority
|
||||
|
||||
## 7. Full Snapshots and Incremental Updates
|
||||
|
||||
Full snapshot:
|
||||
|
||||
- establishes node-local state for a scope
|
||||
- repairs version gaps
|
||||
- repairs corruption
|
||||
- establishes a new `authority_epoch`
|
||||
- may replace older snapshots for the same scope
|
||||
|
||||
Incremental update:
|
||||
|
||||
- applies to exactly one base `config_version`
|
||||
- carries `base_config_version`
|
||||
- carries `next_config_version`
|
||||
- contains scoped patch operations or replacement sections
|
||||
- is signed independently
|
||||
- must be rejected if base version does not match
|
||||
|
||||
Rules:
|
||||
|
||||
- version gaps require full resync
|
||||
- signature mismatch requires rejection and recovery
|
||||
- expired snapshots cannot authorize new operations
|
||||
- node heartbeat/status must report last applied version per scope
|
||||
- rollback is forbidden unless signed recovery policy explicitly allows it
|
||||
|
||||
## 8. Trust Roots and Signing Key Rotation
|
||||
|
||||
The node-agent must know which config signing keys are trusted for each cluster.
|
||||
|
||||
Trust material may come from:
|
||||
|
||||
- enrollment response
|
||||
- trust bundle snapshot
|
||||
- manually installed platform root for bootstrap
|
||||
- signed key rotation update
|
||||
|
||||
Signing key rotation rules:
|
||||
|
||||
1. New key is introduced in a signed trust bundle.
|
||||
2. Node verifies the new key through existing trust.
|
||||
3. Snapshots may be dual-signed during transition.
|
||||
4. Old key is retired only after policy-defined rollout.
|
||||
5. Compromised key is revoked through signed revocation metadata or emergency
|
||||
recovery flow.
|
||||
|
||||
A node must reject snapshots signed by unknown, expired, revoked, or
|
||||
cluster-mismatched keys.
|
||||
|
||||
## 9. Verification Algorithm
|
||||
|
||||
Before applying a snapshot, node-agent verifies:
|
||||
|
||||
1. Envelope schema is supported.
|
||||
2. `cluster_id` matches local cluster membership.
|
||||
3. `subject_node_id` matches the local node, unless the scope explicitly allows
|
||||
shared role data.
|
||||
4. Signature key is trusted for the cluster and snapshot scope.
|
||||
5. Signature verifies over canonical bytes.
|
||||
6. `content_hash` matches content.
|
||||
7. `valid_from`, `expires_at`, and `refresh_after` are acceptable.
|
||||
8. `authority_epoch` is not stale.
|
||||
9. `config_version` is newer than the local accepted version or allowed by a
|
||||
signed recovery policy.
|
||||
10. Scope does not grant data beyond node role and organization authorization.
|
||||
11. Snapshot content passes structural validation.
|
||||
12. Snapshot does not contain forbidden raw secrets.
|
||||
|
||||
Failure must leave the previous valid snapshot active if policy allows it.
|
||||
|
||||
## 10. Degraded-Mode Use
|
||||
|
||||
Snapshots define what the node may do when disconnected from the backend or
|
||||
config/storage services.
|
||||
|
||||
Allowed when policy permits:
|
||||
|
||||
- continue already-running assigned services
|
||||
- preserve existing authorized routes for a bounded TTL
|
||||
- reconnect to active/warm/bootstrap peers
|
||||
- use local trust bundle to validate peers
|
||||
- use storage/config endpoints from the last valid snapshot
|
||||
- report degraded status when connectivity returns
|
||||
|
||||
Forbidden in degraded mode:
|
||||
|
||||
- approve node enrollment
|
||||
- issue certificates
|
||||
- assign roles
|
||||
- change cluster policy
|
||||
- change organization policy
|
||||
- rotate trust roots
|
||||
- promote partitions automatically
|
||||
- fetch unrelated secrets
|
||||
- create new service authority outside the snapshot scope
|
||||
|
||||
Degraded mode must be bounded by:
|
||||
|
||||
- snapshot expiry
|
||||
- route/session TTL
|
||||
- degraded-mode policy
|
||||
- partition/authority state
|
||||
|
||||
## 11. Revocation and Expiry
|
||||
|
||||
Snapshots expire. Expiry is a correctness boundary, not just a cache hint.
|
||||
|
||||
Revocation sources:
|
||||
|
||||
- signed trust bundle update
|
||||
- signed revocation list
|
||||
- control-plane status after reconnect
|
||||
- emergency recovery trust path
|
||||
|
||||
Revocation applies to:
|
||||
|
||||
- signing keys
|
||||
- node identities
|
||||
- role assignments
|
||||
- service assignments
|
||||
- peer eligibility
|
||||
- storage/config endpoints
|
||||
- degraded-mode permissions
|
||||
|
||||
If revocation state is unavailable, the node may only continue within the last
|
||||
valid degraded-mode policy and must not perform high-risk actions.
|
||||
|
||||
## 12. Rollback and Recovery
|
||||
|
||||
Normal rollback to an older config is forbidden.
|
||||
|
||||
Allowed recovery cases:
|
||||
|
||||
- local snapshot file corruption
|
||||
- interrupted incremental update
|
||||
- bad non-authoritative cache state
|
||||
- version gap requiring full resync
|
||||
|
||||
Recovery order:
|
||||
|
||||
1. keep last verified active snapshot
|
||||
2. reject bad update
|
||||
3. request full snapshot from config/storage service
|
||||
4. use bootstrap peers if refresh endpoints fail
|
||||
5. reconnect to control plane when available
|
||||
6. enter degraded mode only if policy allows
|
||||
|
||||
Rollback to an older signed snapshot requires explicit signed recovery policy
|
||||
with a newer `authority_epoch` or equivalent anti-rollback guard.
|
||||
|
||||
## 13. Node-Agent Local Expectations
|
||||
|
||||
Node-agent must store:
|
||||
|
||||
- active snapshot per scope
|
||||
- previous verified snapshot for recovery
|
||||
- pending downloaded snapshot/update before activation
|
||||
- verification metadata
|
||||
- last applied versions
|
||||
- signer key ids
|
||||
- expiry/refresh deadlines
|
||||
- rejection reason for last failed update
|
||||
|
||||
Activation should be atomic from the node-agent perspective:
|
||||
|
||||
- download to pending
|
||||
- verify
|
||||
- write to durable local store
|
||||
- swap active pointer
|
||||
- notify supervised services of relevant changes
|
||||
- report applied version in heartbeat/status
|
||||
|
||||
C12 will define the local store layout and durability details.
|
||||
|
||||
## 14. Distribution Relationship
|
||||
|
||||
Snapshot production flow:
|
||||
|
||||
1. PostgreSQL authoritative state changes.
|
||||
2. Control-plane snapshot compiler builds scoped view.
|
||||
3. Compiler validates scope and removes forbidden data.
|
||||
4. Snapshot is signed by config signing key.
|
||||
5. Snapshot or increment is published to Fabric Storage / Config Storage.
|
||||
6. Node-agent refreshes by version.
|
||||
7. Node-agent verifies and applies locally.
|
||||
|
||||
Node-origin reports such as health, heartbeat, or observed latency are not
|
||||
authoritative config writes. They may influence future compiled snapshots only
|
||||
after the control plane accepts them according to policy.
|
||||
|
||||
## 15. Validation and Future Tests
|
||||
|
||||
Future implementation tests must prove:
|
||||
|
||||
- valid snapshot applies
|
||||
- invalid signature rejected
|
||||
- wrong cluster rejected
|
||||
- wrong node rejected
|
||||
- expired snapshot rejected for new authority
|
||||
- rollback rejected
|
||||
- version gap triggers full resync
|
||||
- forbidden raw secret content rejected
|
||||
- unrelated organization data rejected
|
||||
- wrong role scope rejected
|
||||
- incremental update applies only to matching base version
|
||||
- revoked signer rejected
|
||||
- degraded-mode forbidden actions are blocked
|
||||
|
||||
## 16. C12 Preparation
|
||||
|
||||
C12 must define how node-agent stores and protects:
|
||||
|
||||
- snapshot files
|
||||
- identity material references
|
||||
- trust bundle cache
|
||||
- peer cache
|
||||
- route cache
|
||||
- service assignment cache
|
||||
- health/degraded state
|
||||
- update metadata
|
||||
|
||||
C12 must not turn local state into durable authority. It must preserve the C11
|
||||
rule that snapshots are verified scoped copies of PostgreSQL-derived state.
|
||||
|
||||
## 17. Result / Decision
|
||||
|
||||
Stage C11 defines signed scoped cluster snapshots as the required bridge between
|
||||
the authoritative control plane and node-local runtime operation.
|
||||
|
||||
Decisions:
|
||||
|
||||
- snapshots are signed, versioned, scoped, bounded, and expiring
|
||||
- snapshots are generated from PostgreSQL source-of-truth state
|
||||
- snapshots may be distributed by Fabric Storage / Config Storage
|
||||
- node-agent verifies before applying
|
||||
- node-agent may operate from snapshots only within policy
|
||||
- snapshots must not contain raw secrets or unrelated organization data
|
||||
- incremental updates require exact base-version matching
|
||||
- rollback requires explicit signed recovery policy
|
||||
- C12 must define local storage without changing these authority boundaries
|
||||
|
||||
No code, migration, API, runtime, RDP, data-plane, mesh, VPN, relay, or service
|
||||
workload behavior is changed by C11.
|
||||
Reference in New Issue
Block a user