9.5 KiB
Fabric Peer Directory and Cache Model
Status: Stage C14 result. Documentation and architecture only.
This document defines the Fabric peer directory and node-local peer cache model. It does not implement code, migrations, APIs, mesh runtime traffic, VPN/IP tunnel runtime, relay packet routing, RDP work, or service workload execution.
1. Purpose
The peer directory tells a node which peers it may know about and potentially connect to. The node-local peer cache stores scoped peer data plus runtime observations for fast recovery and score-based peer selection.
The model must avoid:
- full-mesh assumptions
- every node knowing full cluster topology
- service adapters owning route selection
- Redis as durable peer topology
- backend calls on every realtime route decision
2. Peer Knowledge Classes
Each node maintains three peer classes:
- active peers
- warm candidate peers
- cold/bootstrap peers
Active peers:
- currently connected or recently used
- participate in health, route, relay, or service traffic according to role
- small bounded set
Warm candidate peers:
- known good but not currently active
- promoted when active peers fail or a better path is needed
- refreshed less frequently than active peers
Cold/bootstrap peers:
- seed or last-resort discovery peers
- used when active and warm peers fail
- may come from signed snapshot, local cache, storage/config service, or admin-defined seed nodes
Recommended active peer counts:
- normal node: 3-5
- relay/core node: 8-20
- thin/mobile node: 1-3
These are policy defaults, not hardcoded limits.
3. Peer Directory Record
A signed peer directory entry may contain:
node_idcluster_id- endpoint candidates
- advertised roles
- verified capabilities
- allowed peer relationship type
- region/location hints
- trust/certificate fingerprint
- certificate expiry metadata
- policy scope
- organization scope where applicable
- service scope where applicable
- supported transport hints
- NAT/connectivity hints
last_seen_config_version
The peer directory is scoped. Ordinary nodes must not receive a full cluster peer directory unless their role explicitly requires it.
4. Endpoint Candidate Model
Endpoint candidates describe possible ways to reach a node.
Candidate fields:
- endpoint id
- transport type
- host/IP/DNS name
- port
- address family
- public/private reachability
- region
- NAT type if known
- TLS/mTLS identity expectations
- priority
- policy tags
- last verified timestamp
Transport types may include future values such as:
- direct TCP/TLS
- WSS
- relay-assisted
- outbound-only reverse channel
- future QUIC/UDP where explicitly approved
This model is descriptive only. C14 does not implement new transports.
5. Node-Local Peer Cache
The node-local peer cache contains signed directory data plus runtime observations.
Directory-derived fields:
- peer identity
- cluster id
- endpoint candidates
- roles/capabilities
- trust fingerprint
- policy scope
- config version
Runtime observation fields:
last_success_atlast_failure_atlast_latency_ms- packet loss
- jitter
- reliability score
- recent failure history
- observed load hint where allowed
- active/warm/cold state
- last selected route id if applicable
Runtime observations are hints. They are not durable authority.
6. Refresh Cadence
Recommended cadence:
- active peer heartbeat: 5-15 seconds
- active/warm latency probes: 30-120 seconds
- warm peer validation: 2-10 minutes
- peer directory refresh: 5-15 minutes
- cold/bootstrap validation: periodic or on demand
- full peer directory resync: only on version gap, signature mismatch, or policy-triggered refresh
Cadence may vary by role:
- relay/core nodes maintain richer peer sets
- thin/mobile nodes probe less aggressively
- egress/service nodes prioritize peers relevant to assigned services
- storage/config nodes prioritize configured replica peers
7. Peer Selection Scoring
Selection is score-based, not latency-only.
Hard checks first:
- cluster membership
- node identity trust
- certificate validity
- role compatibility
- allowed peer relationship
- organization/service scope
- partition/authority policy
- transport compatibility
- revocation status
Soft score inputs:
- latency
- packet loss
- jitter
- reliability
- recent failure history
- region distance
- node load hint
- bandwidth availability
- role suitability
- route class/channel class
- policy preference
No peer should be selected if it fails hard policy checks, even if latency is excellent.
8. Recovery Order
If active peers fail, recovery order is:
- retry active peers with bounded backoff
- promote warm candidates
- try cold/bootstrap peers
- query authorized storage/config discovery endpoint
- use last signed snapshot for degraded reconnect if policy allows
- reconnect to control plane when available
Recovery must not authorize cluster mutation or high-risk actions.
9. Channel-Aware Peer Preference
Peer choice depends on channel class.
Input/control:
- lowest latency
- lowest jitter
- high reliability
- never behind bulk traffic
Render/video:
- bandwidth and jitter aware
- stale-frame dropping acceptable
- avoid paths with persistent queue growth
File transfer:
- throughput and reliability
- lower priority than input/control
Clipboard/control:
- reliable bounded path
- low volume
Telemetry:
- low priority
- lossy/sampled allowed
VPN/IP tunnel future:
- adaptive QoS
- bulk traffic must not starve interactive sessions
10. Full-Mesh Prevention
Nodes must not attempt to connect to every known node.
Limits:
- active peers are bounded by role policy
- warm peers are bounded by role policy
- peer directory is scoped
- full topology is hidden from organizations
- service adapters never request arbitrary topology
Full topology access is reserved only for roles that require it, such as platform control/admin views or selected core/route-analysis components.
11. Security Boundaries
Peer cache must enforce:
- cluster isolation
- organization isolation
- certificate fingerprint validation
- revocation status
- role assignment
- allowed peer relationship
- service scope
A compromised ordinary node should not learn full cluster topology.
Peer cache data must not include:
- unrelated organization resources
- raw secrets
- broad user lists
- arbitrary route authority
- cross-cluster trust unless explicitly authorized
12. Multi-Cluster Peer Isolation
Multi-cluster node membership uses separate peer caches per cluster.
Per-cluster separation:
- peer directory
- endpoint candidates
- trust roots
- certificate fingerprints
- active/warm/cold peer state
- route observations
- failure history
Cross-cluster peer discovery requires explicit trust and policy. Clusters do not form a single mesh by default.
13. Storage / Snapshot Relationship
Peer directory data is distributed through signed snapshots or Fabric Storage / Config Storage artifacts.
Rules:
- peer directory version is tracked
- node reports last applied peer directory version
- version gap triggers refresh/full resync
- signature/hash mismatch rejects the directory
- revoked peers are removed or marked unusable
- runtime observations are preserved only when still valid for the current directory version
14. Service Adapter Boundary
Service Adapters may request:
- destination node
- resource target
- egress node
- egress pool
- channel class
Service Adapters must not:
- enumerate peers
- select mesh routes
- promote warm peers
- create shortcut connections
- implement partition recovery
- implement cross-cluster routing policy
The Fabric Routing Engine owns those decisions.
15. Observability
Node-agent should report safe peer/cache metrics:
- active peer count
- warm peer count
- bootstrap peer count
- peer directory version
- last refresh time
- average active peer latency
- packet loss summary
- failed peer count
- recovery mode if active
- selected peer class by channel type
Reports must not expose full topology to organizations.
16. Future Validation Tests
Future implementation tests must prove:
- peer directory scope is enforced
- wrong-cluster peer is rejected
- revoked peer is rejected
- invalid certificate fingerprint is rejected
- full topology is not distributed to ordinary node
- active peer count stays bounded
- warm peer promotion works
- bootstrap recovery works
- score-based selection respects hard policy checks
- stale runtime observations are ignored after directory version change
- service adapter cannot bypass Fabric peer selection
17. C15 Preparation
C15 must define the Fabric Routing Engine skeleton boundary.
The routing engine will consume:
- peer directory/cache
- route policy
- QoS policy
- channel class
- service request metadata
- cluster/organization scope
- failure history
C15 must not carry production mesh traffic. It should define route request and route result boundaries before runtime routing exists.
18. Result / Decision
Stage C14 defines scoped peer discovery and peer cache behavior.
Decisions:
- nodes maintain active, warm, and cold/bootstrap peer classes
- nodes do not maintain full mesh connections
- peer directory data is scoped and signed
- peer cache combines signed directory data with runtime observations
- peer selection is score-based with hard policy checks first
- recovery uses active, warm, bootstrap, storage/config, then last snapshot
- service adapters do not own peer discovery or route selection
- C15 must define the Fabric Routing Engine skeleton before mesh runtime
No code, migration, API, runtime, RDP, data-plane, mesh, VPN, relay, or service workload behavior is changed by C14.