5.5 KiB
Fabric Area And Peer Stability Model
Status: active design correction.
This document replaces the oversimplified rule "every node must keep 3 connections" with a stability model based on failure domains ("areas"), multi-path reachability, and live peer memory.
1. Why the old "3 connections" rule is not enough
A raw connection count is too weak as a resilience rule.
Three links are not equivalent when:
- all three peers are in the same private network;
- all three depend on the same NAT or relay path;
- all three depend on the same public ingress;
- all three are relay-ready but not direct-ready;
- all three are stale observations rather than recently verified paths.
Therefore the fabric must not use a single scalar count as the stability criterion.
2. Area
Introduce the concept of an area.
An area is a failure domain with high mutual reachability and shared external risk. Examples:
home- nodes in the same home/private sitetest- nodes in the same test Docker/LAN siteusa- a public node in a remote Internet siteifcm- a separate NAT/domain behind another administrative boundary
An area can be derived from:
- operator-declared site/area label;
- shared private address space or local interface group;
- shared public egress/NAT identity;
- shared administrative host or cluster.
The area label must be part of live node metadata and endpoint candidate metadata.
3. Stability objective
Each node should maintain a working peer set with diversity, not just count.
3.1 Minimum stable peer objective
For an ordinary production node:
- at least
2recently verified direct-ready peers overall; - at least
2distinct external areas represented in the ready set when more than one external area exists; - at least
1persistent recovery-capable path outside the local area; - at least
1additional relay-ready or rendezvous-capable path outside the primary recovery path.
For an area gateway or strategically important public node:
- at least
3direct-ready peers overall; - at least
2distinct external areas represented in the direct-ready set; - at least
1extra recovery path that does not share the same public ingress or NAT dependency.
For a node in a tiny fleet where only one external area currently exists:
- the system must report
reduced-diversity mode, not pretend the target is fully satisfied.
3.2 What counts as "ready"
ready means:
- recently verified;
- usable for immediate QUIC route establishment;
- not only a historical candidate;
- not blocked on stale relay replacement;
- not only a compatibility
Control API/downloadsoverlap path.
relay_ready does not replace direct_ready.
4. What a node must remember
Every node must keep a live working set, not just a tiny current-peer list.
Minimum retained peer memory:
- all currently healthy nodes in the fleet, when the fleet is small enough;
- for larger fleets, a bounded full directory plus prioritized recent working peers;
- for every known node:
- node id
- area
- role summary
- latest verified direct candidates
- latest verified relay/rendezvous candidates
- last success timestamp
- last failure class
- NAT / ingress dependency hints
- cert pin / authority compatibility metadata
For the current fleet size, every node should indeed be capable of remembering the full directory of every other node. There is no scale excuse at 6-8 nodes.
5. Probe strategy
The node should not aggressively probe every possible path at full frequency. It should maintain a layered strategy.
5.1 Hot set
Always keep a hot set of:
- current direct-ready peers;
- one recovery peer outside the local area;
- one alternate peer per external area.
These should be revalidated frequently.
5.2 Warm set
Maintain a warm set of:
- previously successful peers;
- peers from underrepresented areas;
- peers that would restore diversity if a hot peer fails.
These should be revalidated on a slower cadence and promoted when diversity or direct-ready count drops.
5.3 Cold directory
Retain the full known directory and signed registry records, even if not actively probed at the same rate.
6. Failure handling
When a direct-ready peer is lost:
- do not merely replace it with the numerically cheapest peer;
- prefer restoring:
- area diversity
- independent ingress diversity
- direct-ready count
- only then fall back to relay-ready stabilization if direct replacement is not currently available.
7. Implications for the current fleet
Current area mapping should be treated approximately as:
home:home-1,home-2,home-3test:test-1,test-2,test-3usa:usa-los-1ifcm:ifcm-rufms-s-mo1cr
Under this model:
- a node in
homeshould avoid satisfying its minimum peer objective using onlyhomepeers plus one relay; usa-los-1andifcm-rufms-s-mo1crshould both maintain direct-ready links that span at least two foreign areas when possible;- a fleet-wide alert should trigger when a node loses cross-area diversity even if its total peer count still looks healthy.
8. Required implementation changes
- Add
areato node metadata and endpoint candidate metadata. - Track peer readiness by area, not only total count.
- Separate:
direct_ready_countrelay_ready_countexternal_area_ready_countindependent_ingress_ready_count
- Alert on:
- zero recovery path outside the local area
- direct-ready deficit
- area diversity deficit
- registry resolution deficit
- Preserve a full node directory for the current small fleet.