Initial project snapshot
This commit is contained in:
@@ -0,0 +1,398 @@
|
||||
# Fabric Peer Directory and Cache Model
|
||||
|
||||
Status: Stage C14 result. Documentation and architecture only.
|
||||
|
||||
This document defines the Fabric peer directory and node-local peer cache model.
|
||||
It does not implement code, migrations, APIs, mesh runtime traffic, VPN/IP
|
||||
tunnel runtime, relay packet routing, RDP work, or service workload execution.
|
||||
|
||||
## 1. Purpose
|
||||
|
||||
The peer directory tells a node which peers it may know about and potentially
|
||||
connect to. The node-local peer cache stores scoped peer data plus runtime
|
||||
observations for fast recovery and score-based peer selection.
|
||||
|
||||
The model must avoid:
|
||||
|
||||
- full-mesh assumptions
|
||||
- every node knowing full cluster topology
|
||||
- service adapters owning route selection
|
||||
- Redis as durable peer topology
|
||||
- backend calls on every realtime route decision
|
||||
|
||||
## 2. Peer Knowledge Classes
|
||||
|
||||
Each node maintains three peer classes:
|
||||
|
||||
- active peers
|
||||
- warm candidate peers
|
||||
- cold/bootstrap peers
|
||||
|
||||
Active peers:
|
||||
|
||||
- currently connected or recently used
|
||||
- participate in health, route, relay, or service traffic according to role
|
||||
- small bounded set
|
||||
|
||||
Warm candidate peers:
|
||||
|
||||
- known good but not currently active
|
||||
- promoted when active peers fail or a better path is needed
|
||||
- refreshed less frequently than active peers
|
||||
|
||||
Cold/bootstrap peers:
|
||||
|
||||
- seed or last-resort discovery peers
|
||||
- used when active and warm peers fail
|
||||
- may come from signed snapshot, local cache, storage/config service, or
|
||||
admin-defined seed nodes
|
||||
|
||||
Recommended active peer counts:
|
||||
|
||||
- normal node: 3-5
|
||||
- relay/core node: 8-20
|
||||
- thin/mobile node: 1-3
|
||||
|
||||
These are policy defaults, not hardcoded limits.
|
||||
|
||||
## 3. Peer Directory Record
|
||||
|
||||
A signed peer directory entry may contain:
|
||||
|
||||
- `node_id`
|
||||
- `cluster_id`
|
||||
- endpoint candidates
|
||||
- advertised roles
|
||||
- verified capabilities
|
||||
- allowed peer relationship type
|
||||
- region/location hints
|
||||
- trust/certificate fingerprint
|
||||
- certificate expiry metadata
|
||||
- policy scope
|
||||
- organization scope where applicable
|
||||
- service scope where applicable
|
||||
- supported transport hints
|
||||
- NAT/connectivity hints
|
||||
- `last_seen_config_version`
|
||||
|
||||
The peer directory is scoped. Ordinary nodes must not receive a full cluster
|
||||
peer directory unless their role explicitly requires it.
|
||||
|
||||
## 4. Endpoint Candidate Model
|
||||
|
||||
Endpoint candidates describe possible ways to reach a node.
|
||||
|
||||
Candidate fields:
|
||||
|
||||
- endpoint id
|
||||
- transport type
|
||||
- host/IP/DNS name
|
||||
- port
|
||||
- address family
|
||||
- public/private reachability
|
||||
- region
|
||||
- NAT type if known
|
||||
- TLS/mTLS identity expectations
|
||||
- priority
|
||||
- policy tags
|
||||
- last verified timestamp
|
||||
|
||||
Transport types may include future values such as:
|
||||
|
||||
- direct TCP/TLS
|
||||
- WSS
|
||||
- relay-assisted
|
||||
- outbound-only reverse channel
|
||||
- future QUIC/UDP where explicitly approved
|
||||
|
||||
This model is descriptive only. C14 does not implement new transports.
|
||||
|
||||
## 5. Node-Local Peer Cache
|
||||
|
||||
The node-local peer cache contains signed directory data plus runtime
|
||||
observations.
|
||||
|
||||
Directory-derived fields:
|
||||
|
||||
- peer identity
|
||||
- cluster id
|
||||
- endpoint candidates
|
||||
- roles/capabilities
|
||||
- trust fingerprint
|
||||
- policy scope
|
||||
- config version
|
||||
|
||||
Runtime observation fields:
|
||||
|
||||
- `last_success_at`
|
||||
- `last_failure_at`
|
||||
- `last_latency_ms`
|
||||
- packet loss
|
||||
- jitter
|
||||
- reliability score
|
||||
- recent failure history
|
||||
- observed load hint where allowed
|
||||
- active/warm/cold state
|
||||
- last selected route id if applicable
|
||||
|
||||
Runtime observations are hints. They are not durable authority.
|
||||
|
||||
## 6. Refresh Cadence
|
||||
|
||||
Recommended cadence:
|
||||
|
||||
- active peer heartbeat: 5-15 seconds
|
||||
- active/warm latency probes: 30-120 seconds
|
||||
- warm peer validation: 2-10 minutes
|
||||
- peer directory refresh: 5-15 minutes
|
||||
- cold/bootstrap validation: periodic or on demand
|
||||
- full peer directory resync: only on version gap, signature mismatch, or
|
||||
policy-triggered refresh
|
||||
|
||||
Cadence may vary by role:
|
||||
|
||||
- relay/core nodes maintain richer peer sets
|
||||
- thin/mobile nodes probe less aggressively
|
||||
- egress/service nodes prioritize peers relevant to assigned services
|
||||
- storage/config nodes prioritize configured replica peers
|
||||
|
||||
## 7. Peer Selection Scoring
|
||||
|
||||
Selection is score-based, not latency-only.
|
||||
|
||||
Hard checks first:
|
||||
|
||||
- cluster membership
|
||||
- node identity trust
|
||||
- certificate validity
|
||||
- role compatibility
|
||||
- allowed peer relationship
|
||||
- organization/service scope
|
||||
- partition/authority policy
|
||||
- transport compatibility
|
||||
- revocation status
|
||||
|
||||
Soft score inputs:
|
||||
|
||||
- latency
|
||||
- packet loss
|
||||
- jitter
|
||||
- reliability
|
||||
- recent failure history
|
||||
- region distance
|
||||
- node load hint
|
||||
- bandwidth availability
|
||||
- role suitability
|
||||
- route class/channel class
|
||||
- policy preference
|
||||
|
||||
No peer should be selected if it fails hard policy checks, even if latency is
|
||||
excellent.
|
||||
|
||||
## 8. Recovery Order
|
||||
|
||||
If active peers fail, recovery order is:
|
||||
|
||||
1. retry active peers with bounded backoff
|
||||
2. promote warm candidates
|
||||
3. try cold/bootstrap peers
|
||||
4. query authorized storage/config discovery endpoint
|
||||
5. use last signed snapshot for degraded reconnect if policy allows
|
||||
6. reconnect to control plane when available
|
||||
|
||||
Recovery must not authorize cluster mutation or high-risk actions.
|
||||
|
||||
## 9. Channel-Aware Peer Preference
|
||||
|
||||
Peer choice depends on channel class.
|
||||
|
||||
Input/control:
|
||||
|
||||
- lowest latency
|
||||
- lowest jitter
|
||||
- high reliability
|
||||
- never behind bulk traffic
|
||||
|
||||
Render/video:
|
||||
|
||||
- bandwidth and jitter aware
|
||||
- stale-frame dropping acceptable
|
||||
- avoid paths with persistent queue growth
|
||||
|
||||
File transfer:
|
||||
|
||||
- throughput and reliability
|
||||
- lower priority than input/control
|
||||
|
||||
Clipboard/control:
|
||||
|
||||
- reliable bounded path
|
||||
- low volume
|
||||
|
||||
Telemetry:
|
||||
|
||||
- low priority
|
||||
- lossy/sampled allowed
|
||||
|
||||
VPN/IP tunnel future:
|
||||
|
||||
- adaptive QoS
|
||||
- bulk traffic must not starve interactive sessions
|
||||
|
||||
## 10. Full-Mesh Prevention
|
||||
|
||||
Nodes must not attempt to connect to every known node.
|
||||
|
||||
Limits:
|
||||
|
||||
- active peers are bounded by role policy
|
||||
- warm peers are bounded by role policy
|
||||
- peer directory is scoped
|
||||
- full topology is hidden from organizations
|
||||
- service adapters never request arbitrary topology
|
||||
|
||||
Full topology access is reserved only for roles that require it, such as
|
||||
platform control/admin views or selected core/route-analysis components.
|
||||
|
||||
## 11. Security Boundaries
|
||||
|
||||
Peer cache must enforce:
|
||||
|
||||
- cluster isolation
|
||||
- organization isolation
|
||||
- certificate fingerprint validation
|
||||
- revocation status
|
||||
- role assignment
|
||||
- allowed peer relationship
|
||||
- service scope
|
||||
|
||||
A compromised ordinary node should not learn full cluster topology.
|
||||
|
||||
Peer cache data must not include:
|
||||
|
||||
- unrelated organization resources
|
||||
- raw secrets
|
||||
- broad user lists
|
||||
- arbitrary route authority
|
||||
- cross-cluster trust unless explicitly authorized
|
||||
|
||||
## 12. Multi-Cluster Peer Isolation
|
||||
|
||||
Multi-cluster node membership uses separate peer caches per cluster.
|
||||
|
||||
Per-cluster separation:
|
||||
|
||||
- peer directory
|
||||
- endpoint candidates
|
||||
- trust roots
|
||||
- certificate fingerprints
|
||||
- active/warm/cold peer state
|
||||
- route observations
|
||||
- failure history
|
||||
|
||||
Cross-cluster peer discovery requires explicit trust and policy. Clusters do
|
||||
not form a single mesh by default.
|
||||
|
||||
## 13. Storage / Snapshot Relationship
|
||||
|
||||
Peer directory data is distributed through signed snapshots or Fabric Storage /
|
||||
Config Storage artifacts.
|
||||
|
||||
Rules:
|
||||
|
||||
- peer directory version is tracked
|
||||
- node reports last applied peer directory version
|
||||
- version gap triggers refresh/full resync
|
||||
- signature/hash mismatch rejects the directory
|
||||
- revoked peers are removed or marked unusable
|
||||
- runtime observations are preserved only when still valid for the current
|
||||
directory version
|
||||
|
||||
## 14. Service Adapter Boundary
|
||||
|
||||
Service Adapters may request:
|
||||
|
||||
- destination node
|
||||
- resource target
|
||||
- egress node
|
||||
- egress pool
|
||||
- channel class
|
||||
|
||||
Service Adapters must not:
|
||||
|
||||
- enumerate peers
|
||||
- select mesh routes
|
||||
- promote warm peers
|
||||
- create shortcut connections
|
||||
- implement partition recovery
|
||||
- implement cross-cluster routing policy
|
||||
|
||||
The Fabric Routing Engine owns those decisions.
|
||||
|
||||
## 15. Observability
|
||||
|
||||
Node-agent should report safe peer/cache metrics:
|
||||
|
||||
- active peer count
|
||||
- warm peer count
|
||||
- bootstrap peer count
|
||||
- peer directory version
|
||||
- last refresh time
|
||||
- average active peer latency
|
||||
- packet loss summary
|
||||
- failed peer count
|
||||
- recovery mode if active
|
||||
- selected peer class by channel type
|
||||
|
||||
Reports must not expose full topology to organizations.
|
||||
|
||||
## 16. Future Validation Tests
|
||||
|
||||
Future implementation tests must prove:
|
||||
|
||||
- peer directory scope is enforced
|
||||
- wrong-cluster peer is rejected
|
||||
- revoked peer is rejected
|
||||
- invalid certificate fingerprint is rejected
|
||||
- full topology is not distributed to ordinary node
|
||||
- active peer count stays bounded
|
||||
- warm peer promotion works
|
||||
- bootstrap recovery works
|
||||
- score-based selection respects hard policy checks
|
||||
- stale runtime observations are ignored after directory version change
|
||||
- service adapter cannot bypass Fabric peer selection
|
||||
|
||||
## 17. C15 Preparation
|
||||
|
||||
C15 must define the Fabric Routing Engine skeleton boundary.
|
||||
|
||||
The routing engine will consume:
|
||||
|
||||
- peer directory/cache
|
||||
- route policy
|
||||
- QoS policy
|
||||
- channel class
|
||||
- service request metadata
|
||||
- cluster/organization scope
|
||||
- failure history
|
||||
|
||||
C15 must not carry production mesh traffic. It should define route request and
|
||||
route result boundaries before runtime routing exists.
|
||||
|
||||
## 18. Result / Decision
|
||||
|
||||
Stage C14 defines scoped peer discovery and peer cache behavior.
|
||||
|
||||
Decisions:
|
||||
|
||||
- nodes maintain active, warm, and cold/bootstrap peer classes
|
||||
- nodes do not maintain full mesh connections
|
||||
- peer directory data is scoped and signed
|
||||
- peer cache combines signed directory data with runtime observations
|
||||
- peer selection is score-based with hard policy checks first
|
||||
- recovery uses active, warm, bootstrap, storage/config, then last snapshot
|
||||
- service adapters do not own peer discovery or route selection
|
||||
- C15 must define the Fabric Routing Engine skeleton before mesh runtime
|
||||
|
||||
No code, migration, API, runtime, RDP, data-plane, mesh, VPN, relay, or service
|
||||
workload behavior is changed by C14.
|
||||
Reference in New Issue
Block a user