399 lines
9.5 KiB
Markdown
399 lines
9.5 KiB
Markdown
# Fabric Peer Directory and Cache Model
|
|
|
|
Status: Stage C14 result. Documentation and architecture only.
|
|
|
|
This document defines the Fabric peer directory and node-local peer cache model.
|
|
It does not implement code, migrations, APIs, mesh runtime traffic, VPN/IP
|
|
tunnel runtime, relay packet routing, RDP work, or service workload execution.
|
|
|
|
## 1. Purpose
|
|
|
|
The peer directory tells a node which peers it may know about and potentially
|
|
connect to. The node-local peer cache stores scoped peer data plus runtime
|
|
observations for fast recovery and score-based peer selection.
|
|
|
|
The model must avoid:
|
|
|
|
- full-mesh assumptions
|
|
- every node knowing full cluster topology
|
|
- service adapters owning route selection
|
|
- Redis as durable peer topology
|
|
- backend calls on every realtime route decision
|
|
|
|
## 2. Peer Knowledge Classes
|
|
|
|
Each node maintains three peer classes:
|
|
|
|
- active peers
|
|
- warm candidate peers
|
|
- cold/bootstrap peers
|
|
|
|
Active peers:
|
|
|
|
- currently connected or recently used
|
|
- participate in health, route, relay, or service traffic according to role
|
|
- small bounded set
|
|
|
|
Warm candidate peers:
|
|
|
|
- known good but not currently active
|
|
- promoted when active peers fail or a better path is needed
|
|
- refreshed less frequently than active peers
|
|
|
|
Cold/bootstrap peers:
|
|
|
|
- seed or last-resort discovery peers
|
|
- used when active and warm peers fail
|
|
- may come from signed snapshot, local cache, storage/config service, or
|
|
admin-defined seed nodes
|
|
|
|
Recommended active peer counts:
|
|
|
|
- normal node: 3-5
|
|
- relay/core node: 8-20
|
|
- thin/mobile node: 1-3
|
|
|
|
These are policy defaults, not hardcoded limits.
|
|
|
|
## 3. Peer Directory Record
|
|
|
|
A signed peer directory entry may contain:
|
|
|
|
- `node_id`
|
|
- `cluster_id`
|
|
- endpoint candidates
|
|
- advertised roles
|
|
- verified capabilities
|
|
- allowed peer relationship type
|
|
- region/location hints
|
|
- trust/certificate fingerprint
|
|
- certificate expiry metadata
|
|
- policy scope
|
|
- organization scope where applicable
|
|
- service scope where applicable
|
|
- supported transport hints
|
|
- NAT/connectivity hints
|
|
- `last_seen_config_version`
|
|
|
|
The peer directory is scoped. Ordinary nodes must not receive a full cluster
|
|
peer directory unless their role explicitly requires it.
|
|
|
|
## 4. Endpoint Candidate Model
|
|
|
|
Endpoint candidates describe possible ways to reach a node.
|
|
|
|
Candidate fields:
|
|
|
|
- endpoint id
|
|
- transport type
|
|
- host/IP/DNS name
|
|
- port
|
|
- address family
|
|
- public/private reachability
|
|
- region
|
|
- NAT type if known
|
|
- TLS/mTLS identity expectations
|
|
- priority
|
|
- policy tags
|
|
- last verified timestamp
|
|
|
|
Transport types may include future values such as:
|
|
|
|
- direct TCP/TLS
|
|
- WSS
|
|
- relay-assisted
|
|
- outbound-only reverse channel
|
|
- future QUIC/UDP where explicitly approved
|
|
|
|
This model is descriptive only. C14 does not implement new transports.
|
|
|
|
## 5. Node-Local Peer Cache
|
|
|
|
The node-local peer cache contains signed directory data plus runtime
|
|
observations.
|
|
|
|
Directory-derived fields:
|
|
|
|
- peer identity
|
|
- cluster id
|
|
- endpoint candidates
|
|
- roles/capabilities
|
|
- trust fingerprint
|
|
- policy scope
|
|
- config version
|
|
|
|
Runtime observation fields:
|
|
|
|
- `last_success_at`
|
|
- `last_failure_at`
|
|
- `last_latency_ms`
|
|
- packet loss
|
|
- jitter
|
|
- reliability score
|
|
- recent failure history
|
|
- observed load hint where allowed
|
|
- active/warm/cold state
|
|
- last selected route id if applicable
|
|
|
|
Runtime observations are hints. They are not durable authority.
|
|
|
|
## 6. Refresh Cadence
|
|
|
|
Recommended cadence:
|
|
|
|
- active peer heartbeat: 5-15 seconds
|
|
- active/warm latency probes: 30-120 seconds
|
|
- warm peer validation: 2-10 minutes
|
|
- peer directory refresh: 5-15 minutes
|
|
- cold/bootstrap validation: periodic or on demand
|
|
- full peer directory resync: only on version gap, signature mismatch, or
|
|
policy-triggered refresh
|
|
|
|
Cadence may vary by role:
|
|
|
|
- relay/core nodes maintain richer peer sets
|
|
- thin/mobile nodes probe less aggressively
|
|
- egress/service nodes prioritize peers relevant to assigned services
|
|
- storage/config nodes prioritize configured replica peers
|
|
|
|
## 7. Peer Selection Scoring
|
|
|
|
Selection is score-based, not latency-only.
|
|
|
|
Hard checks first:
|
|
|
|
- cluster membership
|
|
- node identity trust
|
|
- certificate validity
|
|
- role compatibility
|
|
- allowed peer relationship
|
|
- organization/service scope
|
|
- partition/authority policy
|
|
- transport compatibility
|
|
- revocation status
|
|
|
|
Soft score inputs:
|
|
|
|
- latency
|
|
- packet loss
|
|
- jitter
|
|
- reliability
|
|
- recent failure history
|
|
- region distance
|
|
- node load hint
|
|
- bandwidth availability
|
|
- role suitability
|
|
- route class/channel class
|
|
- policy preference
|
|
|
|
No peer should be selected if it fails hard policy checks, even if latency is
|
|
excellent.
|
|
|
|
## 8. Recovery Order
|
|
|
|
If active peers fail, recovery order is:
|
|
|
|
1. retry active peers with bounded backoff
|
|
2. promote warm candidates
|
|
3. try cold/bootstrap peers
|
|
4. query authorized storage/config discovery endpoint
|
|
5. use last signed snapshot for degraded reconnect if policy allows
|
|
6. reconnect to control plane when available
|
|
|
|
Recovery must not authorize cluster mutation or high-risk actions.
|
|
|
|
## 9. Channel-Aware Peer Preference
|
|
|
|
Peer choice depends on channel class.
|
|
|
|
Input/control:
|
|
|
|
- lowest latency
|
|
- lowest jitter
|
|
- high reliability
|
|
- never behind bulk traffic
|
|
|
|
Render/video:
|
|
|
|
- bandwidth and jitter aware
|
|
- stale-frame dropping acceptable
|
|
- avoid paths with persistent queue growth
|
|
|
|
File transfer:
|
|
|
|
- throughput and reliability
|
|
- lower priority than input/control
|
|
|
|
Clipboard/control:
|
|
|
|
- reliable bounded path
|
|
- low volume
|
|
|
|
Telemetry:
|
|
|
|
- low priority
|
|
- lossy/sampled allowed
|
|
|
|
VPN/IP tunnel future:
|
|
|
|
- adaptive QoS
|
|
- bulk traffic must not starve interactive sessions
|
|
|
|
## 10. Full-Mesh Prevention
|
|
|
|
Nodes must not attempt to connect to every known node.
|
|
|
|
Limits:
|
|
|
|
- active peers are bounded by role policy
|
|
- warm peers are bounded by role policy
|
|
- peer directory is scoped
|
|
- full topology is hidden from organizations
|
|
- service adapters never request arbitrary topology
|
|
|
|
Full topology access is reserved only for roles that require it, such as
|
|
platform control/admin views or selected core/route-analysis components.
|
|
|
|
## 11. Security Boundaries
|
|
|
|
Peer cache must enforce:
|
|
|
|
- cluster isolation
|
|
- organization isolation
|
|
- certificate fingerprint validation
|
|
- revocation status
|
|
- role assignment
|
|
- allowed peer relationship
|
|
- service scope
|
|
|
|
A compromised ordinary node should not learn full cluster topology.
|
|
|
|
Peer cache data must not include:
|
|
|
|
- unrelated organization resources
|
|
- raw secrets
|
|
- broad user lists
|
|
- arbitrary route authority
|
|
- cross-cluster trust unless explicitly authorized
|
|
|
|
## 12. Multi-Cluster Peer Isolation
|
|
|
|
Multi-cluster node membership uses separate peer caches per cluster.
|
|
|
|
Per-cluster separation:
|
|
|
|
- peer directory
|
|
- endpoint candidates
|
|
- trust roots
|
|
- certificate fingerprints
|
|
- active/warm/cold peer state
|
|
- route observations
|
|
- failure history
|
|
|
|
Cross-cluster peer discovery requires explicit trust and policy. Clusters do
|
|
not form a single mesh by default.
|
|
|
|
## 13. Storage / Snapshot Relationship
|
|
|
|
Peer directory data is distributed through signed snapshots or Fabric Storage /
|
|
Config Storage artifacts.
|
|
|
|
Rules:
|
|
|
|
- peer directory version is tracked
|
|
- node reports last applied peer directory version
|
|
- version gap triggers refresh/full resync
|
|
- signature/hash mismatch rejects the directory
|
|
- revoked peers are removed or marked unusable
|
|
- runtime observations are preserved only when still valid for the current
|
|
directory version
|
|
|
|
## 14. Service Adapter Boundary
|
|
|
|
Service Adapters may request:
|
|
|
|
- destination node
|
|
- resource target
|
|
- egress node
|
|
- egress pool
|
|
- channel class
|
|
|
|
Service Adapters must not:
|
|
|
|
- enumerate peers
|
|
- select mesh routes
|
|
- promote warm peers
|
|
- create shortcut connections
|
|
- implement partition recovery
|
|
- implement cross-cluster routing policy
|
|
|
|
The Fabric Routing Engine owns those decisions.
|
|
|
|
## 15. Observability
|
|
|
|
Node-agent should report safe peer/cache metrics:
|
|
|
|
- active peer count
|
|
- warm peer count
|
|
- bootstrap peer count
|
|
- peer directory version
|
|
- last refresh time
|
|
- average active peer latency
|
|
- packet loss summary
|
|
- failed peer count
|
|
- recovery mode if active
|
|
- selected peer class by channel type
|
|
|
|
Reports must not expose full topology to organizations.
|
|
|
|
## 16. Future Validation Tests
|
|
|
|
Future implementation tests must prove:
|
|
|
|
- peer directory scope is enforced
|
|
- wrong-cluster peer is rejected
|
|
- revoked peer is rejected
|
|
- invalid certificate fingerprint is rejected
|
|
- full topology is not distributed to ordinary node
|
|
- active peer count stays bounded
|
|
- warm peer promotion works
|
|
- bootstrap recovery works
|
|
- score-based selection respects hard policy checks
|
|
- stale runtime observations are ignored after directory version change
|
|
- service adapter cannot bypass Fabric peer selection
|
|
|
|
## 17. C15 Preparation
|
|
|
|
C15 must define the Fabric Routing Engine skeleton boundary.
|
|
|
|
The routing engine will consume:
|
|
|
|
- peer directory/cache
|
|
- route policy
|
|
- QoS policy
|
|
- channel class
|
|
- service request metadata
|
|
- cluster/organization scope
|
|
- failure history
|
|
|
|
C15 must not carry production mesh traffic. It should define route request and
|
|
route result boundaries before runtime routing exists.
|
|
|
|
## 18. Result / Decision
|
|
|
|
Stage C14 defines scoped peer discovery and peer cache behavior.
|
|
|
|
Decisions:
|
|
|
|
- nodes maintain active, warm, and cold/bootstrap peer classes
|
|
- nodes do not maintain full mesh connections
|
|
- peer directory data is scoped and signed
|
|
- peer cache combines signed directory data with runtime observations
|
|
- peer selection is score-based with hard policy checks first
|
|
- recovery uses active, warm, bootstrap, storage/config, then last snapshot
|
|
- service adapters do not own peer discovery or route selection
|
|
- C15 must define the Fabric Routing Engine skeleton before mesh runtime
|
|
|
|
No code, migration, API, runtime, RDP, data-plane, mesh, VPN, relay, or service
|
|
workload behavior is changed by C14.
|