# Fabric Area And Peer Stability Model Status: active design correction. This document replaces the oversimplified rule "every node must keep 3 connections" with a stability model based on failure domains ("areas"), multi-path reachability, and live peer memory. It operates at the `Fabric Transport` layer. Services above the transport must consume service channels and must not directly reason about peer topology. See [FABRIC_SERVICE_OVER_TRANSPORT_MODEL.md](\\nas\\MST\\codex\\rdp-proxy\\docs\\architecture\\FABRIC_SERVICE_OVER_TRANSPORT_MODEL.md). ## 1. Why the old "3 connections" rule is not enough A raw connection count is too weak as a resilience rule. Three links are not equivalent when: - all three peers are in the same private network; - all three depend on the same NAT or relay path; - all three depend on the same public ingress; - all three are relay-ready but not direct-ready; - all three are stale observations rather than recently verified paths. Therefore the fabric must not use a single scalar count as the stability criterion. ## 2. Area Introduce the concept of an `area`. An area is a failure domain with high mutual reachability and shared external risk. Examples: - `home` - nodes in the same home/private site - `test` - nodes in the same test Docker/LAN site - `usa` - a public node in a remote Internet site - `ifcm` - a separate NAT/domain behind another administrative boundary An area can be derived from: - operator-declared site/area label; - shared private address space or local interface group; - shared public egress/NAT identity; - shared administrative host or cluster. The area label must be part of live node metadata and endpoint candidate metadata. For the current fleet, area assignment should be explicit operator metadata, not an inference hidden only inside routing code. ## 3. Stability objective Each node should maintain a working peer set with diversity, not just count. ### 3.1 Minimum stable peer objective For an ordinary production node: - at least `2` recently verified direct-ready peers overall; - at least `2` distinct external areas represented in the ready set when more than one external area exists; - at least `1` persistent recovery-capable path outside the local area; - at least `1` additional relay-ready or rendezvous-capable path outside the primary recovery path. For an area gateway or strategically important public node: - at least `3` direct-ready peers overall; - at least `2` distinct external areas represented in the direct-ready set; - at least `1` extra recovery path that does not share the same public ingress or NAT dependency. For a node in a tiny fleet where only one external area currently exists: - the system must report `reduced-diversity mode`, not pretend the target is fully satisfied. ### 3.2 What counts as "ready" `ready` means: - recently verified; - usable for immediate QUIC route establishment; - not only a historical candidate; - not blocked on stale relay replacement; - not only a compatibility `Control API/downloads` overlap path. `relay_ready` does not replace `direct_ready`. ## 4. What a node must remember Every node must keep a live working set, not just a tiny current-peer list. Minimum retained peer memory: 1. all currently healthy nodes in the fleet, when the fleet is small enough; 2. for larger fleets, a bounded full directory plus prioritized recent working peers; 3. for every known node: - node id - area - role summary - latest verified direct candidates - latest verified relay/rendezvous candidates - last success timestamp - last failure class - NAT / ingress dependency hints - cert pin / authority compatibility metadata For the current fleet size, every node should indeed be capable of remembering the full directory of every other node. There is no scale excuse at 6-8 nodes. ## 5. Probe strategy The node should not aggressively probe every possible path at full frequency. It should maintain a layered strategy. ### 5.1 Hot set Always keep a hot set of: - current direct-ready peers; - one recovery peer outside the local area; - one alternate peer per external area. These should be revalidated frequently. ### 5.2 Warm set Maintain a warm set of: - previously successful peers; - peers from underrepresented areas; - peers that would restore diversity if a hot peer fails. These should be revalidated on a slower cadence and promoted when diversity or direct-ready count drops. ### 5.3 Cold directory Retain the full known directory and signed registry records, even if not actively probed at the same rate. ## 6. Failure handling When a direct-ready peer is lost: 1. do not merely replace it with the numerically cheapest peer; 2. prefer restoring: - area diversity - independent ingress diversity - direct-ready count 3. only then fall back to relay-ready stabilization if direct replacement is not currently available. ## 7. Implications for the current fleet Current area mapping should be treated approximately as: - `home`: `home-1`, `home-2`, `home-3` - `test`: `test-1`, `test-2`, `test-3` - `usa`: `usa-los-1` - `ifcm`: `ifcm-rufms-s-mo1cr` Under this model: - a node in `home` should avoid satisfying its minimum peer objective using only `home` peers plus one relay; - `usa-los-1` and `ifcm-rufms-s-mo1cr` should both maintain direct-ready links that span at least two foreign areas when possible; - a fleet-wide alert should trigger when a node loses cross-area diversity even if its total peer count still looks healthy. ## 8. Required implementation changes 1. Add `area` to node metadata and endpoint candidate metadata. 2. Track peer readiness by area, not only total count. 3. Separate: - `direct_ready_count` - `relay_ready_count` - `external_area_ready_count` - `independent_ingress_ready_count` 4. Alert on: - zero recovery path outside the local area - direct-ready deficit - area diversity deficit - registry resolution deficit 5. Preserve a full node directory for the current small fleet.