# Fabric Peer Directory and Cache Model Status: Stage C14 result. Documentation and architecture only. This document defines the Fabric peer directory and node-local peer cache model. It does not implement code, migrations, APIs, mesh runtime traffic, VPN/IP tunnel runtime, relay packet routing, RDP work, or service workload execution. ## 1. Purpose The peer directory tells a node which peers it may know about and potentially connect to. The node-local peer cache stores scoped peer data plus runtime observations for fast recovery and score-based peer selection. The model must avoid: - full-mesh assumptions - every node knowing full cluster topology - service adapters owning route selection - Redis as durable peer topology - backend calls on every realtime route decision ## 2. Peer Knowledge Classes Each node maintains three peer classes: - active peers - warm candidate peers - cold/bootstrap peers Active peers: - currently connected or recently used - participate in health, route, relay, or service traffic according to role - small bounded set Warm candidate peers: - known good but not currently active - promoted when active peers fail or a better path is needed - refreshed less frequently than active peers Cold/bootstrap peers: - seed or last-resort discovery peers - used when active and warm peers fail - may come from signed snapshot, local cache, storage/config service, or admin-defined seed nodes Recommended active peer counts: - normal node: 3-5 - relay/core node: 8-20 - thin/mobile node: 1-3 These are policy defaults, not hardcoded limits. ## 3. Peer Directory Record A signed peer directory entry may contain: - `node_id` - `cluster_id` - endpoint candidates - advertised roles - verified capabilities - allowed peer relationship type - region/location hints - trust/certificate fingerprint - certificate expiry metadata - policy scope - organization scope where applicable - service scope where applicable - supported transport hints - NAT/connectivity hints - `last_seen_config_version` The peer directory is scoped. Ordinary nodes must not receive a full cluster peer directory unless their role explicitly requires it. ## 4. Endpoint Candidate Model Endpoint candidates describe possible ways to reach a node. Candidate fields: - endpoint id - transport type - host/IP/DNS name - port - address family - public/private reachability - region - NAT type if known - TLS/mTLS identity expectations - priority - policy tags - last verified timestamp Transport types may include future values such as: - direct TCP/TLS - WSS - relay-assisted - outbound-only reverse channel - future QUIC/UDP where explicitly approved This model is descriptive only. C14 does not implement new transports. ## 5. Node-Local Peer Cache The node-local peer cache contains signed directory data plus runtime observations. Directory-derived fields: - peer identity - cluster id - endpoint candidates - roles/capabilities - trust fingerprint - policy scope - config version Runtime observation fields: - `last_success_at` - `last_failure_at` - `last_latency_ms` - packet loss - jitter - reliability score - recent failure history - observed load hint where allowed - active/warm/cold state - last selected route id if applicable Runtime observations are hints. They are not durable authority. ## 6. Refresh Cadence Recommended cadence: - active peer heartbeat: 5-15 seconds - active/warm latency probes: 30-120 seconds - warm peer validation: 2-10 minutes - peer directory refresh: 5-15 minutes - cold/bootstrap validation: periodic or on demand - full peer directory resync: only on version gap, signature mismatch, or policy-triggered refresh Cadence may vary by role: - relay/core nodes maintain richer peer sets - thin/mobile nodes probe less aggressively - egress/service nodes prioritize peers relevant to assigned services - storage/config nodes prioritize configured replica peers ## 7. Peer Selection Scoring Selection is score-based, not latency-only. Hard checks first: - cluster membership - node identity trust - certificate validity - role compatibility - allowed peer relationship - organization/service scope - partition/authority policy - transport compatibility - revocation status Soft score inputs: - latency - packet loss - jitter - reliability - recent failure history - region distance - node load hint - bandwidth availability - role suitability - route class/channel class - policy preference No peer should be selected if it fails hard policy checks, even if latency is excellent. ## 8. Recovery Order If active peers fail, recovery order is: 1. retry active peers with bounded backoff 2. promote warm candidates 3. try cold/bootstrap peers 4. query authorized storage/config discovery endpoint 5. use last signed snapshot for degraded reconnect if policy allows 6. reconnect to control plane when available Recovery must not authorize cluster mutation or high-risk actions. ## 9. Channel-Aware Peer Preference Peer choice depends on channel class. Input/control: - lowest latency - lowest jitter - high reliability - never behind bulk traffic Render/video: - bandwidth and jitter aware - stale-frame dropping acceptable - avoid paths with persistent queue growth File transfer: - throughput and reliability - lower priority than input/control Clipboard/control: - reliable bounded path - low volume Telemetry: - low priority - lossy/sampled allowed VPN/IP tunnel future: - adaptive QoS - bulk traffic must not starve interactive sessions ## 10. Full-Mesh Prevention Nodes must not attempt to connect to every known node. Limits: - active peers are bounded by role policy - warm peers are bounded by role policy - peer directory is scoped - full topology is hidden from organizations - service adapters never request arbitrary topology Full topology access is reserved only for roles that require it, such as platform control/admin views or selected core/route-analysis components. ## 11. Security Boundaries Peer cache must enforce: - cluster isolation - organization isolation - certificate fingerprint validation - revocation status - role assignment - allowed peer relationship - service scope A compromised ordinary node should not learn full cluster topology. Peer cache data must not include: - unrelated organization resources - raw secrets - broad user lists - arbitrary route authority - cross-cluster trust unless explicitly authorized ## 12. Multi-Cluster Peer Isolation Multi-cluster node membership uses separate peer caches per cluster. Per-cluster separation: - peer directory - endpoint candidates - trust roots - certificate fingerprints - active/warm/cold peer state - route observations - failure history Cross-cluster peer discovery requires explicit trust and policy. Clusters do not form a single mesh by default. ## 13. Storage / Snapshot Relationship Peer directory data is distributed through signed snapshots or Fabric Storage / Config Storage artifacts. Rules: - peer directory version is tracked - node reports last applied peer directory version - version gap triggers refresh/full resync - signature/hash mismatch rejects the directory - revoked peers are removed or marked unusable - runtime observations are preserved only when still valid for the current directory version ## 14. Service Adapter Boundary Service Adapters may request: - destination node - resource target - egress node - egress pool - channel class Service Adapters must not: - enumerate peers - select mesh routes - promote warm peers - create shortcut connections - implement partition recovery - implement cross-cluster routing policy The Fabric Routing Engine owns those decisions. ## 15. Observability Node-agent should report safe peer/cache metrics: - active peer count - warm peer count - bootstrap peer count - peer directory version - last refresh time - average active peer latency - packet loss summary - failed peer count - recovery mode if active - selected peer class by channel type Reports must not expose full topology to organizations. ## 16. Future Validation Tests Future implementation tests must prove: - peer directory scope is enforced - wrong-cluster peer is rejected - revoked peer is rejected - invalid certificate fingerprint is rejected - full topology is not distributed to ordinary node - active peer count stays bounded - warm peer promotion works - bootstrap recovery works - score-based selection respects hard policy checks - stale runtime observations are ignored after directory version change - service adapter cannot bypass Fabric peer selection ## 17. C15 Preparation C15 must define the Fabric Routing Engine skeleton boundary. The routing engine will consume: - peer directory/cache - route policy - QoS policy - channel class - service request metadata - cluster/organization scope - failure history C15 must not carry production mesh traffic. It should define route request and route result boundaries before runtime routing exists. ## 18. Result / Decision Stage C14 defines scoped peer discovery and peer cache behavior. Decisions: - nodes maintain active, warm, and cold/bootstrap peer classes - nodes do not maintain full mesh connections - peer directory data is scoped and signed - peer cache combines signed directory data with runtime observations - peer selection is score-based with hard policy checks first - recovery uses active, warm, bootstrap, storage/config, then last snapshot - service adapters do not own peer discovery or route selection - C15 must define the Fabric Routing Engine skeleton before mesh runtime No code, migration, API, runtime, RDP, data-plane, mesh, VPN, relay, or service workload behavior is changed by C14.