Files
rdp-proxy/docs/architecture/FABRIC_AREA_AND_PEER_STABILITY_MODEL.md
m 20d361a886
build / backend (push) Has been cancelled
build / node-agent (push) Has been cancelled
build / worker (push) Has been cancelled
рабочий вариант, но скороть 10 МБит
2026-05-22 21:46:49 +03:00

5.9 KiB

Fabric Area And Peer Stability Model

Status: active design correction.

This document replaces the oversimplified rule "every node must keep 3 connections" with a stability model based on failure domains ("areas"), multi-path reachability, and live peer memory.

It operates at the Fabric Transport layer. Services above the transport must consume service channels and must not directly reason about peer topology. See FABRIC_SERVICE_OVER_TRANSPORT_MODEL.md.

1. Why the old "3 connections" rule is not enough

A raw connection count is too weak as a resilience rule.

Three links are not equivalent when:

  • all three peers are in the same private network;
  • all three depend on the same NAT or relay path;
  • all three depend on the same public ingress;
  • all three are relay-ready but not direct-ready;
  • all three are stale observations rather than recently verified paths.

Therefore the fabric must not use a single scalar count as the stability criterion.

2. Area

Introduce the concept of an area.

An area is a failure domain with high mutual reachability and shared external risk. Examples:

  • home - nodes in the same home/private site
  • test - nodes in the same test Docker/LAN site
  • usa - a public node in a remote Internet site
  • ifcm - a separate NAT/domain behind another administrative boundary

An area can be derived from:

  • operator-declared site/area label;
  • shared private address space or local interface group;
  • shared public egress/NAT identity;
  • shared administrative host or cluster.

The area label must be part of live node metadata and endpoint candidate metadata.

For the current fleet, area assignment should be explicit operator metadata, not an inference hidden only inside routing code.

3. Stability objective

Each node should maintain a working peer set with diversity, not just count.

3.1 Minimum stable peer objective

For an ordinary production node:

  • at least 2 recently verified direct-ready peers overall;
  • at least 2 distinct external areas represented in the ready set when more than one external area exists;
  • at least 1 persistent recovery-capable path outside the local area;
  • at least 1 additional relay-ready or rendezvous-capable path outside the primary recovery path.

For an area gateway or strategically important public node:

  • at least 3 direct-ready peers overall;
  • at least 2 distinct external areas represented in the direct-ready set;
  • at least 1 extra recovery path that does not share the same public ingress or NAT dependency.

For a node in a tiny fleet where only one external area currently exists:

  • the system must report reduced-diversity mode, not pretend the target is fully satisfied.

3.2 What counts as "ready"

ready means:

  • recently verified;
  • usable for immediate QUIC route establishment;
  • not only a historical candidate;
  • not blocked on stale relay replacement;
  • not only a compatibility Control API/downloads overlap path.

relay_ready does not replace direct_ready.

4. What a node must remember

Every node must keep a live working set, not just a tiny current-peer list.

Minimum retained peer memory:

  1. all currently healthy nodes in the fleet, when the fleet is small enough;
  2. for larger fleets, a bounded full directory plus prioritized recent working peers;
  3. for every known node:
    • node id
    • area
    • role summary
    • latest verified direct candidates
    • latest verified relay/rendezvous candidates
    • last success timestamp
    • last failure class
    • NAT / ingress dependency hints
    • cert pin / authority compatibility metadata

For the current fleet size, every node should indeed be capable of remembering the full directory of every other node. There is no scale excuse at 6-8 nodes.

5. Probe strategy

The node should not aggressively probe every possible path at full frequency. It should maintain a layered strategy.

5.1 Hot set

Always keep a hot set of:

  • current direct-ready peers;
  • one recovery peer outside the local area;
  • one alternate peer per external area.

These should be revalidated frequently.

5.2 Warm set

Maintain a warm set of:

  • previously successful peers;
  • peers from underrepresented areas;
  • peers that would restore diversity if a hot peer fails.

These should be revalidated on a slower cadence and promoted when diversity or direct-ready count drops.

5.3 Cold directory

Retain the full known directory and signed registry records, even if not actively probed at the same rate.

6. Failure handling

When a direct-ready peer is lost:

  1. do not merely replace it with the numerically cheapest peer;
  2. prefer restoring:
    • area diversity
    • independent ingress diversity
    • direct-ready count
  3. only then fall back to relay-ready stabilization if direct replacement is not currently available.

7. Implications for the current fleet

Current area mapping should be treated approximately as:

  • home: home-1, home-2, home-3
  • test: test-1, test-2, test-3
  • usa: usa-los-1
  • ifcm: ifcm-rufms-s-mo1cr

Under this model:

  • a node in home should avoid satisfying its minimum peer objective using only home peers plus one relay;
  • usa-los-1 and ifcm-rufms-s-mo1cr should both maintain direct-ready links that span at least two foreign areas when possible;
  • a fleet-wide alert should trigger when a node loses cross-area diversity even if its total peer count still looks healthy.

8. Required implementation changes

  1. Add area to node metadata and endpoint candidate metadata.
  2. Track peer readiness by area, not only total count.
  3. Separate:
    • direct_ready_count
    • relay_ready_count
    • external_area_ready_count
    • independent_ingress_ready_count
  4. Alert on:
    • zero recovery path outside the local area
    • direct-ready deficit
    • area diversity deficit
    • registry resolution deficit
  5. Preserve a full node directory for the current small fleet.