3

2026-05-18 21:33:39 +03:00
parent 5096155d83
commit 469fa0e860
94 changed files with 8761 additions and 8003 deletions
@@ -0,0 +1,183 @@
+# Fabric Area And Peer Stability Model
+
+Status: active design correction.
+
+This document replaces the oversimplified rule "every node must keep 3
+connections" with a stability model based on failure domains ("areas"),
+multi-path reachability, and live peer memory.
+
+## 1. Why the old "3 connections" rule is not enough
+
+A raw connection count is too weak as a resilience rule.
+
+Three links are not equivalent when:
+
+- all three peers are in the same private network;
+- all three depend on the same NAT or relay path;
+- all three depend on the same public ingress;
+- all three are relay-ready but not direct-ready;
+- all three are stale observations rather than recently verified paths.
+
+Therefore the fabric must not use a single scalar count as the stability
+criterion.
+
+## 2. Area
+
+Introduce the concept of an `area`.
+
+An area is a failure domain with high mutual reachability and shared external
+risk. Examples:
+
+- `home` - nodes in the same home/private site
+- `test` - nodes in the same test Docker/LAN site
+- `usa` - a public node in a remote Internet site
+- `ifcm` - a separate NAT/domain behind another administrative boundary
+
+An area can be derived from:
+
+- operator-declared site/area label;
+- shared private address space or local interface group;
+- shared public egress/NAT identity;
+- shared administrative host or cluster.
+
+The area label must be part of live node metadata and endpoint candidate
+metadata.
+
+## 3. Stability objective
+
+Each node should maintain a working peer set with diversity, not just count.
+
+### 3.1 Minimum stable peer objective
+
+For an ordinary production node:
+
+- at least `2` recently verified direct-ready peers overall;
+- at least `2` distinct external areas represented in the ready set when more
+  than one external area exists;
+- at least `1` persistent recovery-capable path outside the local area;
+- at least `1` additional relay-ready or rendezvous-capable path outside the
+  primary recovery path.
+
+For an area gateway or strategically important public node:
+
+- at least `3` direct-ready peers overall;
+- at least `2` distinct external areas represented in the direct-ready set;
+- at least `1` extra recovery path that does not share the same public ingress
+  or NAT dependency.
+
+For a node in a tiny fleet where only one external area currently exists:
+
+- the system must report `reduced-diversity mode`, not pretend the target is
+  fully satisfied.
+
+### 3.2 What counts as "ready"
+
+`ready` means:
+
+- recently verified;
+- usable for immediate QUIC route establishment;
+- not only a historical candidate;
+- not blocked on stale relay replacement;
+- not only a compatibility `Control API/downloads` overlap path.
+
+`relay_ready` does not replace `direct_ready`.
+
+## 4. What a node must remember
+
+Every node must keep a live working set, not just a tiny current-peer list.
+
+Minimum retained peer memory:
+
+1. all currently healthy nodes in the fleet, when the fleet is small enough;
+2. for larger fleets, a bounded full directory plus prioritized recent working
+   peers;
+3. for every known node:
+   - node id
+   - area
+   - role summary
+   - latest verified direct candidates
+   - latest verified relay/rendezvous candidates
+   - last success timestamp
+   - last failure class
+   - NAT / ingress dependency hints
+   - cert pin / authority compatibility metadata
+
+For the current fleet size, every node should indeed be capable of remembering
+the full directory of every other node. There is no scale excuse at 6-8 nodes.
+
+## 5. Probe strategy
+
+The node should not aggressively probe every possible path at full frequency.
+It should maintain a layered strategy.
+
+### 5.1 Hot set
+
+Always keep a hot set of:
+
+- current direct-ready peers;
+- one recovery peer outside the local area;
+- one alternate peer per external area.
+
+These should be revalidated frequently.
+
+### 5.2 Warm set
+
+Maintain a warm set of:
+
+- previously successful peers;
+- peers from underrepresented areas;
+- peers that would restore diversity if a hot peer fails.
+
+These should be revalidated on a slower cadence and promoted when diversity or
+direct-ready count drops.
+
+### 5.3 Cold directory
+
+Retain the full known directory and signed registry records, even if not
+actively probed at the same rate.
+
+## 6. Failure handling
+
+When a direct-ready peer is lost:
+
+1. do not merely replace it with the numerically cheapest peer;
+2. prefer restoring:
+   - area diversity
+   - independent ingress diversity
+   - direct-ready count
+3. only then fall back to relay-ready stabilization if direct replacement is
+   not currently available.
+
+## 7. Implications for the current fleet
+
+Current area mapping should be treated approximately as:
+
+- `home`: `home-1`, `home-2`, `home-3`
+- `test`: `test-1`, `test-2`, `test-3`
+- `usa`: `usa-los-1`
+- `ifcm`: `ifcm-rufms-s-mo1cr`
+
+Under this model:
+
+- a node in `home` should avoid satisfying its minimum peer objective using
+  only `home` peers plus one relay;
+- `usa-los-1` and `ifcm-rufms-s-mo1cr` should both maintain direct-ready links
+  that span at least two foreign areas when possible;
+- a fleet-wide alert should trigger when a node loses cross-area diversity even
+  if its total peer count still looks healthy.
+
+## 8. Required implementation changes
+
+1. Add `area` to node metadata and endpoint candidate metadata.
+2. Track peer readiness by area, not only total count.
+3. Separate:
+   - `direct_ready_count`
+   - `relay_ready_count`
+   - `external_area_ready_count`
+   - `independent_ingress_ready_count`
+4. Alert on:
+   - zero recovery path outside the local area
+   - direct-ready deficit
+   - area diversity deficit
+   - registry resolution deficit
+5. Preserve a full node directory for the current small fleet.