Files
rdp-proxy/docs/architecture/FABRIC_PEER_DIRECTORY_CACHE.md
T
2026-04-28 22:29:50 +03:00

9.5 KiB

Fabric Peer Directory and Cache Model

Status: Stage C14 result. Documentation and architecture only.

This document defines the Fabric peer directory and node-local peer cache model. It does not implement code, migrations, APIs, mesh runtime traffic, VPN/IP tunnel runtime, relay packet routing, RDP work, or service workload execution.

1. Purpose

The peer directory tells a node which peers it may know about and potentially connect to. The node-local peer cache stores scoped peer data plus runtime observations for fast recovery and score-based peer selection.

The model must avoid:

  • full-mesh assumptions
  • every node knowing full cluster topology
  • service adapters owning route selection
  • Redis as durable peer topology
  • backend calls on every realtime route decision

2. Peer Knowledge Classes

Each node maintains three peer classes:

  • active peers
  • warm candidate peers
  • cold/bootstrap peers

Active peers:

  • currently connected or recently used
  • participate in health, route, relay, or service traffic according to role
  • small bounded set

Warm candidate peers:

  • known good but not currently active
  • promoted when active peers fail or a better path is needed
  • refreshed less frequently than active peers

Cold/bootstrap peers:

  • seed or last-resort discovery peers
  • used when active and warm peers fail
  • may come from signed snapshot, local cache, storage/config service, or admin-defined seed nodes

Recommended active peer counts:

  • normal node: 3-5
  • relay/core node: 8-20
  • thin/mobile node: 1-3

These are policy defaults, not hardcoded limits.

3. Peer Directory Record

A signed peer directory entry may contain:

  • node_id
  • cluster_id
  • endpoint candidates
  • advertised roles
  • verified capabilities
  • allowed peer relationship type
  • region/location hints
  • trust/certificate fingerprint
  • certificate expiry metadata
  • policy scope
  • organization scope where applicable
  • service scope where applicable
  • supported transport hints
  • NAT/connectivity hints
  • last_seen_config_version

The peer directory is scoped. Ordinary nodes must not receive a full cluster peer directory unless their role explicitly requires it.

4. Endpoint Candidate Model

Endpoint candidates describe possible ways to reach a node.

Candidate fields:

  • endpoint id
  • transport type
  • host/IP/DNS name
  • port
  • address family
  • public/private reachability
  • region
  • NAT type if known
  • TLS/mTLS identity expectations
  • priority
  • policy tags
  • last verified timestamp

Transport types may include future values such as:

  • direct TCP/TLS
  • WSS
  • relay-assisted
  • outbound-only reverse channel
  • future QUIC/UDP where explicitly approved

This model is descriptive only. C14 does not implement new transports.

5. Node-Local Peer Cache

The node-local peer cache contains signed directory data plus runtime observations.

Directory-derived fields:

  • peer identity
  • cluster id
  • endpoint candidates
  • roles/capabilities
  • trust fingerprint
  • policy scope
  • config version

Runtime observation fields:

  • last_success_at
  • last_failure_at
  • last_latency_ms
  • packet loss
  • jitter
  • reliability score
  • recent failure history
  • observed load hint where allowed
  • active/warm/cold state
  • last selected route id if applicable

Runtime observations are hints. They are not durable authority.

6. Refresh Cadence

Recommended cadence:

  • active peer heartbeat: 5-15 seconds
  • active/warm latency probes: 30-120 seconds
  • warm peer validation: 2-10 minutes
  • peer directory refresh: 5-15 minutes
  • cold/bootstrap validation: periodic or on demand
  • full peer directory resync: only on version gap, signature mismatch, or policy-triggered refresh

Cadence may vary by role:

  • relay/core nodes maintain richer peer sets
  • thin/mobile nodes probe less aggressively
  • egress/service nodes prioritize peers relevant to assigned services
  • storage/config nodes prioritize configured replica peers

7. Peer Selection Scoring

Selection is score-based, not latency-only.

Hard checks first:

  • cluster membership
  • node identity trust
  • certificate validity
  • role compatibility
  • allowed peer relationship
  • organization/service scope
  • partition/authority policy
  • transport compatibility
  • revocation status

Soft score inputs:

  • latency
  • packet loss
  • jitter
  • reliability
  • recent failure history
  • region distance
  • node load hint
  • bandwidth availability
  • role suitability
  • route class/channel class
  • policy preference

No peer should be selected if it fails hard policy checks, even if latency is excellent.

8. Recovery Order

If active peers fail, recovery order is:

  1. retry active peers with bounded backoff
  2. promote warm candidates
  3. try cold/bootstrap peers
  4. query authorized storage/config discovery endpoint
  5. use last signed snapshot for degraded reconnect if policy allows
  6. reconnect to control plane when available

Recovery must not authorize cluster mutation or high-risk actions.

9. Channel-Aware Peer Preference

Peer choice depends on channel class.

Input/control:

  • lowest latency
  • lowest jitter
  • high reliability
  • never behind bulk traffic

Render/video:

  • bandwidth and jitter aware
  • stale-frame dropping acceptable
  • avoid paths with persistent queue growth

File transfer:

  • throughput and reliability
  • lower priority than input/control

Clipboard/control:

  • reliable bounded path
  • low volume

Telemetry:

  • low priority
  • lossy/sampled allowed

VPN/IP tunnel future:

  • adaptive QoS
  • bulk traffic must not starve interactive sessions

10. Full-Mesh Prevention

Nodes must not attempt to connect to every known node.

Limits:

  • active peers are bounded by role policy
  • warm peers are bounded by role policy
  • peer directory is scoped
  • full topology is hidden from organizations
  • service adapters never request arbitrary topology

Full topology access is reserved only for roles that require it, such as platform control/admin views or selected core/route-analysis components.

11. Security Boundaries

Peer cache must enforce:

  • cluster isolation
  • organization isolation
  • certificate fingerprint validation
  • revocation status
  • role assignment
  • allowed peer relationship
  • service scope

A compromised ordinary node should not learn full cluster topology.

Peer cache data must not include:

  • unrelated organization resources
  • raw secrets
  • broad user lists
  • arbitrary route authority
  • cross-cluster trust unless explicitly authorized

12. Multi-Cluster Peer Isolation

Multi-cluster node membership uses separate peer caches per cluster.

Per-cluster separation:

  • peer directory
  • endpoint candidates
  • trust roots
  • certificate fingerprints
  • active/warm/cold peer state
  • route observations
  • failure history

Cross-cluster peer discovery requires explicit trust and policy. Clusters do not form a single mesh by default.

13. Storage / Snapshot Relationship

Peer directory data is distributed through signed snapshots or Fabric Storage / Config Storage artifacts.

Rules:

  • peer directory version is tracked
  • node reports last applied peer directory version
  • version gap triggers refresh/full resync
  • signature/hash mismatch rejects the directory
  • revoked peers are removed or marked unusable
  • runtime observations are preserved only when still valid for the current directory version

14. Service Adapter Boundary

Service Adapters may request:

  • destination node
  • resource target
  • egress node
  • egress pool
  • channel class

Service Adapters must not:

  • enumerate peers
  • select mesh routes
  • promote warm peers
  • create shortcut connections
  • implement partition recovery
  • implement cross-cluster routing policy

The Fabric Routing Engine owns those decisions.

15. Observability

Node-agent should report safe peer/cache metrics:

  • active peer count
  • warm peer count
  • bootstrap peer count
  • peer directory version
  • last refresh time
  • average active peer latency
  • packet loss summary
  • failed peer count
  • recovery mode if active
  • selected peer class by channel type

Reports must not expose full topology to organizations.

16. Future Validation Tests

Future implementation tests must prove:

  • peer directory scope is enforced
  • wrong-cluster peer is rejected
  • revoked peer is rejected
  • invalid certificate fingerprint is rejected
  • full topology is not distributed to ordinary node
  • active peer count stays bounded
  • warm peer promotion works
  • bootstrap recovery works
  • score-based selection respects hard policy checks
  • stale runtime observations are ignored after directory version change
  • service adapter cannot bypass Fabric peer selection

17. C15 Preparation

C15 must define the Fabric Routing Engine skeleton boundary.

The routing engine will consume:

  • peer directory/cache
  • route policy
  • QoS policy
  • channel class
  • service request metadata
  • cluster/organization scope
  • failure history

C15 must not carry production mesh traffic. It should define route request and route result boundaries before runtime routing exists.

18. Result / Decision

Stage C14 defines scoped peer discovery and peer cache behavior.

Decisions:

  • nodes maintain active, warm, and cold/bootstrap peer classes
  • nodes do not maintain full mesh connections
  • peer directory data is scoped and signed
  • peer cache combines signed directory data with runtime observations
  • peer selection is score-based with hard policy checks first
  • recovery uses active, warm, bootstrap, storage/config, then last snapshot
  • service adapters do not own peer discovery or route selection
  • C15 must define the Fabric Routing Engine skeleton before mesh runtime

No code, migration, API, runtime, RDP, data-plane, mesh, VPN, relay, or service workload behavior is changed by C14.