rdp-proxy/docs/architecture/FABRIC_PEER_DIRECTORY_CACHE.md

# Fabric Peer Directory and Cache Model

Status: Stage C14 result. Documentation and architecture only.

This document defines the Fabric peer directory and node-local peer cache model.
It does not implement code, migrations, APIs, mesh runtime traffic, VPN/IP
tunnel runtime, relay packet routing, RDP work, or service workload execution.

## 1. Purpose

The peer directory tells a node which peers it may know about and potentially
connect to. The node-local peer cache stores scoped peer data plus runtime
observations for fast recovery and score-based peer selection.

The model must avoid:

- full-mesh assumptions
- every node knowing full cluster topology
- service adapters owning route selection
- Redis as durable peer topology
- backend calls on every realtime route decision

## 2. Peer Knowledge Classes

Each node maintains three peer classes:

- active peers
- warm candidate peers
- cold/bootstrap peers

Active peers:

- currently connected or recently used
- participate in health, route, relay, or service traffic according to role
- small bounded set

Warm candidate peers:

- known good but not currently active
- promoted when active peers fail or a better path is needed
- refreshed less frequently than active peers

Cold/bootstrap peers:

- seed or last-resort discovery peers
- used when active and warm peers fail
- may come from signed snapshot, local cache, storage/config service, or
  admin-defined seed nodes

Recommended active peer counts:

- normal node: 3-5
- relay/core node: 8-20
- thin/mobile node: 1-3

These are policy defaults, not hardcoded limits.

## 3. Peer Directory Record

A signed peer directory entry may contain:

- `node_id`
- `cluster_id`
- endpoint candidates
- advertised roles
- verified capabilities
- allowed peer relationship type
- region/location hints
- trust/certificate fingerprint
- certificate expiry metadata
- policy scope
- organization scope where applicable
- service scope where applicable
- supported transport hints
- NAT/connectivity hints
- `last_seen_config_version`

The peer directory is scoped. Ordinary nodes must not receive a full cluster
peer directory unless their role explicitly requires it.

## 4. Endpoint Candidate Model

Endpoint candidates describe possible ways to reach a node.

Candidate fields:

- endpoint id
- transport type
- host/IP/DNS name
- port
- address family
- public/private reachability
- region
- NAT type if known
- TLS/mTLS identity expectations
- priority
- policy tags
- last verified timestamp

Transport types may include future values such as:

- direct TCP/TLS
- WSS
- relay-assisted
- outbound-only reverse channel
- future QUIC/UDP where explicitly approved

This model is descriptive only. C14 does not implement new transports.

## 5. Node-Local Peer Cache

The node-local peer cache contains signed directory data plus runtime
observations.

Directory-derived fields:

- peer identity
- cluster id
- endpoint candidates
- roles/capabilities
- trust fingerprint
- policy scope
- config version

Runtime observation fields:

- `last_success_at`
- `last_failure_at`
- `last_latency_ms`
- packet loss
- jitter
- reliability score
- recent failure history
- observed load hint where allowed
- active/warm/cold state
- last selected route id if applicable

Runtime observations are hints. They are not durable authority.

## 6. Refresh Cadence

Recommended cadence:

- active peer heartbeat: 5-15 seconds
- active/warm latency probes: 30-120 seconds
- warm peer validation: 2-10 minutes
- peer directory refresh: 5-15 minutes
- cold/bootstrap validation: periodic or on demand
- full peer directory resync: only on version gap, signature mismatch, or
  policy-triggered refresh

Cadence may vary by role:

- relay/core nodes maintain richer peer sets
- thin/mobile nodes probe less aggressively
- egress/service nodes prioritize peers relevant to assigned services
- storage/config nodes prioritize configured replica peers

## 7. Peer Selection Scoring

Selection is score-based, not latency-only.

Hard checks first:

- cluster membership
- node identity trust
- certificate validity
- role compatibility
- allowed peer relationship
- organization/service scope
- partition/authority policy
- transport compatibility
- revocation status

Soft score inputs:

- latency
- packet loss
- jitter
- reliability
- recent failure history
- region distance
- node load hint
- bandwidth availability
- role suitability
- route class/channel class
- policy preference

No peer should be selected if it fails hard policy checks, even if latency is
excellent.

## 8. Recovery Order

If active peers fail, recovery order is:

1. retry active peers with bounded backoff
2. promote warm candidates
3. try cold/bootstrap peers
4. query authorized storage/config discovery endpoint
5. use last signed snapshot for degraded reconnect if policy allows
6. reconnect to control plane when available

Recovery must not authorize cluster mutation or high-risk actions.

## 9. Channel-Aware Peer Preference

Peer choice depends on channel class.

Input/control:

- lowest latency
- lowest jitter
- high reliability
- never behind bulk traffic

Render/video:

- bandwidth and jitter aware
- stale-frame dropping acceptable
- avoid paths with persistent queue growth

File transfer:

- throughput and reliability
- lower priority than input/control

Clipboard/control:

- reliable bounded path
- low volume

Telemetry:

- low priority
- lossy/sampled allowed

VPN/IP tunnel future:

- adaptive QoS
- bulk traffic must not starve interactive sessions

## 10. Full-Mesh Prevention

Nodes must not attempt to connect to every known node.

Limits:

- active peers are bounded by role policy
- warm peers are bounded by role policy
- peer directory is scoped
- full topology is hidden from organizations
- service adapters never request arbitrary topology

Full topology access is reserved only for roles that require it, such as
platform control/admin views or selected core/route-analysis components.

## 11. Security Boundaries

Peer cache must enforce:

- cluster isolation
- organization isolation
- certificate fingerprint validation
- revocation status
- role assignment
- allowed peer relationship
- service scope

A compromised ordinary node should not learn full cluster topology.

Peer cache data must not include:

- unrelated organization resources
- raw secrets
- broad user lists
- arbitrary route authority
- cross-cluster trust unless explicitly authorized

## 12. Multi-Cluster Peer Isolation

Multi-cluster node membership uses separate peer caches per cluster.

Per-cluster separation:

- peer directory
- endpoint candidates
- trust roots
- certificate fingerprints
- active/warm/cold peer state
- route observations
- failure history

Cross-cluster peer discovery requires explicit trust and policy. Clusters do
not form a single mesh by default.

## 13. Storage / Snapshot Relationship

Peer directory data is distributed through signed snapshots or Fabric Storage /
Config Storage artifacts.

Rules:

- peer directory version is tracked
- node reports last applied peer directory version
- version gap triggers refresh/full resync
- signature/hash mismatch rejects the directory
- revoked peers are removed or marked unusable
- runtime observations are preserved only when still valid for the current
  directory version

## 14. Service Adapter Boundary

Service Adapters may request:

- destination node
- resource target
- egress node
- egress pool
- channel class

Service Adapters must not:

- enumerate peers
- select mesh routes
- promote warm peers
- create shortcut connections
- implement partition recovery
- implement cross-cluster routing policy

The Fabric Routing Engine owns those decisions.

## 15. Observability

Node-agent should report safe peer/cache metrics:

- active peer count
- warm peer count
- bootstrap peer count
- peer directory version
- last refresh time
- average active peer latency
- packet loss summary
- failed peer count
- recovery mode if active
- selected peer class by channel type

Reports must not expose full topology to organizations.

## 16. Future Validation Tests

Future implementation tests must prove:

- peer directory scope is enforced
- wrong-cluster peer is rejected
- revoked peer is rejected
- invalid certificate fingerprint is rejected
- full topology is not distributed to ordinary node
- active peer count stays bounded
- warm peer promotion works
- bootstrap recovery works
- score-based selection respects hard policy checks
- stale runtime observations are ignored after directory version change
- service adapter cannot bypass Fabric peer selection

## 17. C15 Preparation

C15 must define the Fabric Routing Engine skeleton boundary.

The routing engine will consume:

- peer directory/cache
- route policy
- QoS policy
- channel class
- service request metadata
- cluster/organization scope
- failure history

C15 must not carry production mesh traffic. It should define route request and
route result boundaries before runtime routing exists.

## 18. Result / Decision

Stage C14 defines scoped peer discovery and peer cache behavior.

Decisions:

- nodes maintain active, warm, and cold/bootstrap peer classes
- nodes do not maintain full mesh connections
- peer directory data is scoped and signed
- peer cache combines signed directory data with runtime observations
- peer selection is score-based with hard policy checks first
- recovery uses active, warm, bootstrap, storage/config, then last snapshot
- service adapters do not own peer discovery or route selection
- C15 must define the Fabric Routing Engine skeleton before mesh runtime

No code, migration, API, runtime, RDP, data-plane, mesh, VPN, relay, or service
workload behavior is changed by C14.