# Fabric Area And Peer Stability Model

Status: active design correction.

This document replaces the oversimplified rule "every node must keep 3
connections" with a stability model based on failure domains ("areas"),
multi-path reachability, and live peer memory.

It operates at the `Fabric Transport` layer. Services above the transport must
consume service channels and must not directly reason about peer topology. See
[FABRIC_SERVICE_OVER_TRANSPORT_MODEL.md](\\nas\\MST\\codex\\rdp-proxy\\docs\\architecture\\FABRIC_SERVICE_OVER_TRANSPORT_MODEL.md).

## 1. Why the old "3 connections" rule is not enough

A raw connection count is too weak as a resilience rule.

Three links are not equivalent when:

- all three peers are in the same private network;
- all three depend on the same NAT or relay path;
- all three depend on the same public ingress;
- all three are relay-ready but not direct-ready;
- all three are stale observations rather than recently verified paths.

Therefore the fabric must not use a single scalar count as the stability
criterion.

## 2. Area

Introduce the concept of an `area`.

An area is a failure domain with high mutual reachability and shared external
risk. Examples:

- `home` - nodes in the same home/private site
- `test` - nodes in the same test Docker/LAN site
- `usa` - a public node in a remote Internet site
- `ifcm` - a separate NAT/domain behind another administrative boundary

An area can be derived from:

- operator-declared site/area label;
- shared private address space or local interface group;
- shared public egress/NAT identity;
- shared administrative host or cluster.

The area label must be part of live node metadata and endpoint candidate
metadata.

For the current fleet, area assignment should be explicit operator metadata, not
an inference hidden only inside routing code.

## 3. Stability objective

Each node should maintain a working peer set with diversity, not just count.

### 3.1 Minimum stable peer objective

For an ordinary production node:

- at least `2` recently verified direct-ready peers overall;
- at least `2` distinct external areas represented in the ready set when more
  than one external area exists;
- at least `1` persistent recovery-capable path outside the local area;
- at least `1` additional relay-ready or rendezvous-capable path outside the
  primary recovery path.

For an area gateway or strategically important public node:

- at least `3` direct-ready peers overall;
- at least `2` distinct external areas represented in the direct-ready set;
- at least `1` extra recovery path that does not share the same public ingress
  or NAT dependency.

For a node in a tiny fleet where only one external area currently exists:

- the system must report `reduced-diversity mode`, not pretend the target is
  fully satisfied.

### 3.2 What counts as "ready"

`ready` means:

- recently verified;
- usable for immediate QUIC route establishment;
- not only a historical candidate;
- not blocked on stale relay replacement;
- not only a compatibility `Control API/downloads` overlap path.

`relay_ready` does not replace `direct_ready`.

## 4. What a node must remember

Every node must keep a live working set, not just a tiny current-peer list.

Minimum retained peer memory:

1. all currently healthy nodes in the fleet, when the fleet is small enough;
2. for larger fleets, a bounded full directory plus prioritized recent working
   peers;
3. for every known node:
   - node id
   - area
   - role summary
   - latest verified direct candidates
   - latest verified relay/rendezvous candidates
   - last success timestamp
   - last failure class
   - NAT / ingress dependency hints
   - cert pin / authority compatibility metadata

For the current fleet size, every node should indeed be capable of remembering
the full directory of every other node. There is no scale excuse at 6-8 nodes.

## 5. Probe strategy

The node should not aggressively probe every possible path at full frequency.
It should maintain a layered strategy.

### 5.1 Hot set

Always keep a hot set of:

- current direct-ready peers;
- one recovery peer outside the local area;
- one alternate peer per external area.

These should be revalidated frequently.

### 5.2 Warm set

Maintain a warm set of:

- previously successful peers;
- peers from underrepresented areas;
- peers that would restore diversity if a hot peer fails.

These should be revalidated on a slower cadence and promoted when diversity or
direct-ready count drops.

### 5.3 Cold directory

Retain the full known directory and signed registry records, even if not
actively probed at the same rate.

## 6. Failure handling

When a direct-ready peer is lost:

1. do not merely replace it with the numerically cheapest peer;
2. prefer restoring:
   - area diversity
   - independent ingress diversity
   - direct-ready count
3. only then fall back to relay-ready stabilization if direct replacement is
   not currently available.

## 7. Implications for the current fleet

Current area mapping should be treated approximately as:

- `home`: `home-1`, `home-2`, `home-3`
- `test`: `test-1`, `test-2`, `test-3`
- `usa`: `usa-los-1`
- `ifcm`: `ifcm-rufms-s-mo1cr`

Under this model:

- a node in `home` should avoid satisfying its minimum peer objective using
  only `home` peers plus one relay;
- `usa-los-1` and `ifcm-rufms-s-mo1cr` should both maintain direct-ready links
  that span at least two foreign areas when possible;
- a fleet-wide alert should trigger when a node loses cross-area diversity even
  if its total peer count still looks healthy.

## 8. Required implementation changes

1. Add `area` to node metadata and endpoint candidate metadata.
2. Track peer readiness by area, not only total count.
3. Separate:
   - `direct_ready_count`
   - `relay_ready_count`
   - `external_area_ready_count`
   - `independent_ingress_ready_count`
4. Alert on:
   - zero recovery path outside the local area
   - direct-ready deficit
   - area diversity deficit
   - registry resolution deficit
5. Preserve a full node directory for the current small fleet.