3
This commit is contained in:
@@ -0,0 +1,183 @@
|
||||
# Fabric Area And Peer Stability Model
|
||||
|
||||
Status: active design correction.
|
||||
|
||||
This document replaces the oversimplified rule "every node must keep 3
|
||||
connections" with a stability model based on failure domains ("areas"),
|
||||
multi-path reachability, and live peer memory.
|
||||
|
||||
## 1. Why the old "3 connections" rule is not enough
|
||||
|
||||
A raw connection count is too weak as a resilience rule.
|
||||
|
||||
Three links are not equivalent when:
|
||||
|
||||
- all three peers are in the same private network;
|
||||
- all three depend on the same NAT or relay path;
|
||||
- all three depend on the same public ingress;
|
||||
- all three are relay-ready but not direct-ready;
|
||||
- all three are stale observations rather than recently verified paths.
|
||||
|
||||
Therefore the fabric must not use a single scalar count as the stability
|
||||
criterion.
|
||||
|
||||
## 2. Area
|
||||
|
||||
Introduce the concept of an `area`.
|
||||
|
||||
An area is a failure domain with high mutual reachability and shared external
|
||||
risk. Examples:
|
||||
|
||||
- `home` - nodes in the same home/private site
|
||||
- `test` - nodes in the same test Docker/LAN site
|
||||
- `usa` - a public node in a remote Internet site
|
||||
- `ifcm` - a separate NAT/domain behind another administrative boundary
|
||||
|
||||
An area can be derived from:
|
||||
|
||||
- operator-declared site/area label;
|
||||
- shared private address space or local interface group;
|
||||
- shared public egress/NAT identity;
|
||||
- shared administrative host or cluster.
|
||||
|
||||
The area label must be part of live node metadata and endpoint candidate
|
||||
metadata.
|
||||
|
||||
## 3. Stability objective
|
||||
|
||||
Each node should maintain a working peer set with diversity, not just count.
|
||||
|
||||
### 3.1 Minimum stable peer objective
|
||||
|
||||
For an ordinary production node:
|
||||
|
||||
- at least `2` recently verified direct-ready peers overall;
|
||||
- at least `2` distinct external areas represented in the ready set when more
|
||||
than one external area exists;
|
||||
- at least `1` persistent recovery-capable path outside the local area;
|
||||
- at least `1` additional relay-ready or rendezvous-capable path outside the
|
||||
primary recovery path.
|
||||
|
||||
For an area gateway or strategically important public node:
|
||||
|
||||
- at least `3` direct-ready peers overall;
|
||||
- at least `2` distinct external areas represented in the direct-ready set;
|
||||
- at least `1` extra recovery path that does not share the same public ingress
|
||||
or NAT dependency.
|
||||
|
||||
For a node in a tiny fleet where only one external area currently exists:
|
||||
|
||||
- the system must report `reduced-diversity mode`, not pretend the target is
|
||||
fully satisfied.
|
||||
|
||||
### 3.2 What counts as "ready"
|
||||
|
||||
`ready` means:
|
||||
|
||||
- recently verified;
|
||||
- usable for immediate QUIC route establishment;
|
||||
- not only a historical candidate;
|
||||
- not blocked on stale relay replacement;
|
||||
- not only a compatibility `Control API/downloads` overlap path.
|
||||
|
||||
`relay_ready` does not replace `direct_ready`.
|
||||
|
||||
## 4. What a node must remember
|
||||
|
||||
Every node must keep a live working set, not just a tiny current-peer list.
|
||||
|
||||
Minimum retained peer memory:
|
||||
|
||||
1. all currently healthy nodes in the fleet, when the fleet is small enough;
|
||||
2. for larger fleets, a bounded full directory plus prioritized recent working
|
||||
peers;
|
||||
3. for every known node:
|
||||
- node id
|
||||
- area
|
||||
- role summary
|
||||
- latest verified direct candidates
|
||||
- latest verified relay/rendezvous candidates
|
||||
- last success timestamp
|
||||
- last failure class
|
||||
- NAT / ingress dependency hints
|
||||
- cert pin / authority compatibility metadata
|
||||
|
||||
For the current fleet size, every node should indeed be capable of remembering
|
||||
the full directory of every other node. There is no scale excuse at 6-8 nodes.
|
||||
|
||||
## 5. Probe strategy
|
||||
|
||||
The node should not aggressively probe every possible path at full frequency.
|
||||
It should maintain a layered strategy.
|
||||
|
||||
### 5.1 Hot set
|
||||
|
||||
Always keep a hot set of:
|
||||
|
||||
- current direct-ready peers;
|
||||
- one recovery peer outside the local area;
|
||||
- one alternate peer per external area.
|
||||
|
||||
These should be revalidated frequently.
|
||||
|
||||
### 5.2 Warm set
|
||||
|
||||
Maintain a warm set of:
|
||||
|
||||
- previously successful peers;
|
||||
- peers from underrepresented areas;
|
||||
- peers that would restore diversity if a hot peer fails.
|
||||
|
||||
These should be revalidated on a slower cadence and promoted when diversity or
|
||||
direct-ready count drops.
|
||||
|
||||
### 5.3 Cold directory
|
||||
|
||||
Retain the full known directory and signed registry records, even if not
|
||||
actively probed at the same rate.
|
||||
|
||||
## 6. Failure handling
|
||||
|
||||
When a direct-ready peer is lost:
|
||||
|
||||
1. do not merely replace it with the numerically cheapest peer;
|
||||
2. prefer restoring:
|
||||
- area diversity
|
||||
- independent ingress diversity
|
||||
- direct-ready count
|
||||
3. only then fall back to relay-ready stabilization if direct replacement is
|
||||
not currently available.
|
||||
|
||||
## 7. Implications for the current fleet
|
||||
|
||||
Current area mapping should be treated approximately as:
|
||||
|
||||
- `home`: `home-1`, `home-2`, `home-3`
|
||||
- `test`: `test-1`, `test-2`, `test-3`
|
||||
- `usa`: `usa-los-1`
|
||||
- `ifcm`: `ifcm-rufms-s-mo1cr`
|
||||
|
||||
Under this model:
|
||||
|
||||
- a node in `home` should avoid satisfying its minimum peer objective using
|
||||
only `home` peers plus one relay;
|
||||
- `usa-los-1` and `ifcm-rufms-s-mo1cr` should both maintain direct-ready links
|
||||
that span at least two foreign areas when possible;
|
||||
- a fleet-wide alert should trigger when a node loses cross-area diversity even
|
||||
if its total peer count still looks healthy.
|
||||
|
||||
## 8. Required implementation changes
|
||||
|
||||
1. Add `area` to node metadata and endpoint candidate metadata.
|
||||
2. Track peer readiness by area, not only total count.
|
||||
3. Separate:
|
||||
- `direct_ready_count`
|
||||
- `relay_ready_count`
|
||||
- `external_area_ready_count`
|
||||
- `independent_ingress_ready_count`
|
||||
4. Alert on:
|
||||
- zero recovery path outside the local area
|
||||
- direct-ready deficit
|
||||
- area diversity deficit
|
||||
- registry resolution deficit
|
||||
5. Preserve a full node directory for the current small fleet.
|
||||
Reference in New Issue
Block a user