This commit is contained in:
2026-05-18 21:33:39 +03:00
parent 5096155d83
commit 469fa0e860
94 changed files with 8761 additions and 8003 deletions
@@ -0,0 +1,183 @@
# Fabric Area And Peer Stability Model
Status: active design correction.
This document replaces the oversimplified rule "every node must keep 3
connections" with a stability model based on failure domains ("areas"),
multi-path reachability, and live peer memory.
## 1. Why the old "3 connections" rule is not enough
A raw connection count is too weak as a resilience rule.
Three links are not equivalent when:
- all three peers are in the same private network;
- all three depend on the same NAT or relay path;
- all three depend on the same public ingress;
- all three are relay-ready but not direct-ready;
- all three are stale observations rather than recently verified paths.
Therefore the fabric must not use a single scalar count as the stability
criterion.
## 2. Area
Introduce the concept of an `area`.
An area is a failure domain with high mutual reachability and shared external
risk. Examples:
- `home` - nodes in the same home/private site
- `test` - nodes in the same test Docker/LAN site
- `usa` - a public node in a remote Internet site
- `ifcm` - a separate NAT/domain behind another administrative boundary
An area can be derived from:
- operator-declared site/area label;
- shared private address space or local interface group;
- shared public egress/NAT identity;
- shared administrative host or cluster.
The area label must be part of live node metadata and endpoint candidate
metadata.
## 3. Stability objective
Each node should maintain a working peer set with diversity, not just count.
### 3.1 Minimum stable peer objective
For an ordinary production node:
- at least `2` recently verified direct-ready peers overall;
- at least `2` distinct external areas represented in the ready set when more
than one external area exists;
- at least `1` persistent recovery-capable path outside the local area;
- at least `1` additional relay-ready or rendezvous-capable path outside the
primary recovery path.
For an area gateway or strategically important public node:
- at least `3` direct-ready peers overall;
- at least `2` distinct external areas represented in the direct-ready set;
- at least `1` extra recovery path that does not share the same public ingress
or NAT dependency.
For a node in a tiny fleet where only one external area currently exists:
- the system must report `reduced-diversity mode`, not pretend the target is
fully satisfied.
### 3.2 What counts as "ready"
`ready` means:
- recently verified;
- usable for immediate QUIC route establishment;
- not only a historical candidate;
- not blocked on stale relay replacement;
- not only a compatibility `Control API/downloads` overlap path.
`relay_ready` does not replace `direct_ready`.
## 4. What a node must remember
Every node must keep a live working set, not just a tiny current-peer list.
Minimum retained peer memory:
1. all currently healthy nodes in the fleet, when the fleet is small enough;
2. for larger fleets, a bounded full directory plus prioritized recent working
peers;
3. for every known node:
- node id
- area
- role summary
- latest verified direct candidates
- latest verified relay/rendezvous candidates
- last success timestamp
- last failure class
- NAT / ingress dependency hints
- cert pin / authority compatibility metadata
For the current fleet size, every node should indeed be capable of remembering
the full directory of every other node. There is no scale excuse at 6-8 nodes.
## 5. Probe strategy
The node should not aggressively probe every possible path at full frequency.
It should maintain a layered strategy.
### 5.1 Hot set
Always keep a hot set of:
- current direct-ready peers;
- one recovery peer outside the local area;
- one alternate peer per external area.
These should be revalidated frequently.
### 5.2 Warm set
Maintain a warm set of:
- previously successful peers;
- peers from underrepresented areas;
- peers that would restore diversity if a hot peer fails.
These should be revalidated on a slower cadence and promoted when diversity or
direct-ready count drops.
### 5.3 Cold directory
Retain the full known directory and signed registry records, even if not
actively probed at the same rate.
## 6. Failure handling
When a direct-ready peer is lost:
1. do not merely replace it with the numerically cheapest peer;
2. prefer restoring:
- area diversity
- independent ingress diversity
- direct-ready count
3. only then fall back to relay-ready stabilization if direct replacement is
not currently available.
## 7. Implications for the current fleet
Current area mapping should be treated approximately as:
- `home`: `home-1`, `home-2`, `home-3`
- `test`: `test-1`, `test-2`, `test-3`
- `usa`: `usa-los-1`
- `ifcm`: `ifcm-rufms-s-mo1cr`
Under this model:
- a node in `home` should avoid satisfying its minimum peer objective using
only `home` peers plus one relay;
- `usa-los-1` and `ifcm-rufms-s-mo1cr` should both maintain direct-ready links
that span at least two foreign areas when possible;
- a fleet-wide alert should trigger when a node loses cross-area diversity even
if its total peer count still looks healthy.
## 8. Required implementation changes
1. Add `area` to node metadata and endpoint candidate metadata.
2. Track peer readiness by area, not only total count.
3. Separate:
- `direct_ready_count`
- `relay_ready_count`
- `external_area_ready_count`
- `independent_ingress_ready_count`
4. Alert on:
- zero recovery path outside the local area
- direct-ready deficit
- area diversity deficit
- registry resolution deficit
5. Preserve a full node directory for the current small fleet.