# Fabric Area And Peer Stability Model Status: active design correction. This document replaces the oversimplified rule "every node must keep 3 connections" with a stability model based on failure domains ("areas"), multi-path reachability, and live peer memory. ## 1. Why the old "3 connections" rule is not enough A raw connection count is too weak as a resilience rule. Three links are not equivalent when: - all three peers are in the same private network; - all three depend on the same NAT or relay path; - all three depend on the same public ingress; - all three are relay-ready but not direct-ready; - all three are stale observations rather than recently verified paths. Therefore the fabric must not use a single scalar count as the stability criterion. ## 2. Area Introduce the concept of an `area`. An area is a failure domain with high mutual reachability and shared external risk. Examples: - `home` - nodes in the same home/private site - `test` - nodes in the same test Docker/LAN site - `usa` - a public node in a remote Internet site - `ifcm` - a separate NAT/domain behind another administrative boundary An area can be derived from: - operator-declared site/area label; - shared private address space or local interface group; - shared public egress/NAT identity; - shared administrative host or cluster. The area label must be part of live node metadata and endpoint candidate metadata. ## 3. Stability objective Each node should maintain a working peer set with diversity, not just count. ### 3.1 Minimum stable peer objective For an ordinary production node: - at least `2` recently verified direct-ready peers overall; - at least `2` distinct external areas represented in the ready set when more than one external area exists; - at least `1` persistent recovery-capable path outside the local area; - at least `1` additional relay-ready or rendezvous-capable path outside the primary recovery path. For an area gateway or strategically important public node: - at least `3` direct-ready peers overall; - at least `2` distinct external areas represented in the direct-ready set; - at least `1` extra recovery path that does not share the same public ingress or NAT dependency. For a node in a tiny fleet where only one external area currently exists: - the system must report `reduced-diversity mode`, not pretend the target is fully satisfied. ### 3.2 What counts as "ready" `ready` means: - recently verified; - usable for immediate QUIC route establishment; - not only a historical candidate; - not blocked on stale relay replacement; - not only a compatibility `Control API/downloads` overlap path. `relay_ready` does not replace `direct_ready`. ## 4. What a node must remember Every node must keep a live working set, not just a tiny current-peer list. Minimum retained peer memory: 1. all currently healthy nodes in the fleet, when the fleet is small enough; 2. for larger fleets, a bounded full directory plus prioritized recent working peers; 3. for every known node: - node id - area - role summary - latest verified direct candidates - latest verified relay/rendezvous candidates - last success timestamp - last failure class - NAT / ingress dependency hints - cert pin / authority compatibility metadata For the current fleet size, every node should indeed be capable of remembering the full directory of every other node. There is no scale excuse at 6-8 nodes. ## 5. Probe strategy The node should not aggressively probe every possible path at full frequency. It should maintain a layered strategy. ### 5.1 Hot set Always keep a hot set of: - current direct-ready peers; - one recovery peer outside the local area; - one alternate peer per external area. These should be revalidated frequently. ### 5.2 Warm set Maintain a warm set of: - previously successful peers; - peers from underrepresented areas; - peers that would restore diversity if a hot peer fails. These should be revalidated on a slower cadence and promoted when diversity or direct-ready count drops. ### 5.3 Cold directory Retain the full known directory and signed registry records, even if not actively probed at the same rate. ## 6. Failure handling When a direct-ready peer is lost: 1. do not merely replace it with the numerically cheapest peer; 2. prefer restoring: - area diversity - independent ingress diversity - direct-ready count 3. only then fall back to relay-ready stabilization if direct replacement is not currently available. ## 7. Implications for the current fleet Current area mapping should be treated approximately as: - `home`: `home-1`, `home-2`, `home-3` - `test`: `test-1`, `test-2`, `test-3` - `usa`: `usa-los-1` - `ifcm`: `ifcm-rufms-s-mo1cr` Under this model: - a node in `home` should avoid satisfying its minimum peer objective using only `home` peers plus one relay; - `usa-los-1` and `ifcm-rufms-s-mo1cr` should both maintain direct-ready links that span at least two foreign areas when possible; - a fleet-wide alert should trigger when a node loses cross-area diversity even if its total peer count still looks healthy. ## 8. Required implementation changes 1. Add `area` to node metadata and endpoint candidate metadata. 2. Track peer readiness by area, not only total count. 3. Separate: - `direct_ready_count` - `relay_ready_count` - `external_area_ready_count` - `independent_ingress_ready_count` 4. Alert on: - zero recovery path outside the local area - direct-ready deficit - area diversity deficit - registry resolution deficit 5. Preserve a full node directory for the current small fleet.