Files
rdp-proxy/docs/architecture/FABRIC_LIVE_AUDIT_2026-05-18.md
T
2026-05-18 21:33:39 +03:00

6.1 KiB

Fabric Live Audit 2026-05-18

Status: live operational audit of the current fabric. This document records the real state observed on 2026-05-18 and explicitly calls out where runtime behavior still differs from the target architecture.

Current confirmed state

  • Inter-node transport for the live node-agent fleet is QUIC over UDP.
  • The active node set
    • home-1
    • home-2
    • home-3
    • test-1
    • test-2
    • test-3
    • usa-los-1
    • ifcm-rufms-s-mo1cr is converged on 0.2.321-directreadytarget.
  • ifcm-rufms-s-mo1cr recovered through the compatibility recovery path and is no longer stale.

Why TCP traffic is still visible

Visible TCP traffic is not coming from the inter-node fabric transport. It is coming from the temporary compatibility recovery overlap that is still active.

Observed live listeners:

  • docker-test
    • 19191/tcp - compatibility Control API/downloads bridge
    • 18080/tcp - web-admin
    • 18090/tcp - release files
    • 18121/tcp - backend Control API
    • 19132/udp, 19133/udp, 19134/udp - QUIC fabric listeners
  • usa-los-1
    • 19131/udp - QUIC fabric listener
    • 19191/tcp - external compatibility bridge currently held open so legacy recovery contracts can still reach Control API/downloads

Therefore:

  • TCP is still present by design for recovery overlap.
  • UDP/QUIC is the current node-to-node transport.
  • The statement "the fabric is fully UDP-only" is not yet true at the full system level while 19191/tcp compatibility recovery remains enabled.

Why nodes were still falling away

1. Nodes do not yet operate from a fully active signed registry gossip plane

Observed on the live ifcm-rufms-s-mo1cr heartbeat:

  • fabric_registry_runtime_report.status = candidate_only
  • resolved_service_count = 0
  • resolved_services.control-api = no_active_record
  • resolved_services.update-store = no_active_record
  • resolved_services.update-cache = no_active_record

This means the current runtime still depends on compatibility control URLs more than the target architecture allows. The node is alive in the fabric, but not yet operating from a fully resolved active registry view.

2. Legacy control/download contracts are still real dependencies

Observed on the live ifcm-rufms-s-mo1cr heartbeat after recovery:

  • mesh_outbound_session_report.control_plane_url = http://vpn.cin.su:19191/api/v1

This confirms the root recovery lesson:

  • a NAT node without manual host access was still anchored to the old recovery contract;
  • until that contract was temporarily restored, the node could not advance;
  • the node did not disappear because QUIC failed; it disappeared because the recovery/control overlap was removed before the node had converged.

3. Direct peer resilience is still below the intended threshold

Observed from live heartbeat metadata:

  • ifcm-rufms-s-mo1cr
    • peer_connection_ready = 2
    • peer_connection_relay_ready = 3
    • target_ready_peers = 3
  • usa-los-1
    • peer_connection_ready = 1
    • peer_connection_relay_ready = 5
    • target_ready_peers = 3

This means the direct-path resilience target is not satisfied yet, even though the nodes are healthy.

The practical reason is simple:

  • the cluster has only a small number of externally reachable direct QUIC endpoints;
  • some nodes still advertise only private/LAN-reachable direct candidates;
  • relay-ready adjacency is masking direct peer deficit, but it does not replace the requirement for at least three direct-ready peers.

4. Observability is still heterogeneous

Live heartbeat coverage is inconsistent:

  • test-*, ifcm, usa-los-1 emit rich c17z20 heartbeat metadata with endpoint, peer recovery, and registry sections.
  • home-* currently do not expose the same full sections in their latest heartbeat rows.

This means operator visibility is uneven and the documentation must not imply uniform live introspection across every node today.

What is true right now

  1. The fleet is converged on one live node-agent version.
  2. QUIC/UDP is the actual node-to-node transport.
  3. Compatibility 19191/tcp is still required for recovery overlap.
  4. Signed registry gossip is not yet the sole active discovery/control source.
  5. The "at least 3 direct-ready peers per node" resilience target is not yet met for all externally significant nodes.

Operational rule until the next audit

Do not remove the compatibility 19191/tcp recovery overlap while any of the following remain true:

  • any live node still reports a control_plane_url on the 19191 contract;
  • any live node has fabric_registry_runtime_report.status != active;
  • any externally significant node has fewer than 3 direct-ready peers;
  • any node can only recover through legacy Control API/downloads overlap.

Required next work

A. Finish signed registry activation

Each node must be able to resolve active records for at least:

  • control-api
  • update-store
  • update-cache

without falling back to the 19191 compatibility contract.

B. Promote full direct endpoint dissemination

All nodes with public reachability must advertise every valid public direct QUIC endpoint, and nodes must retain enough live peer memory to reconnect without operator intervention.

C. Enforce the direct-ready floor as a live alert

If a node has fewer than 3 direct-ready peers, this must remain a real operational alert even when relay-ready peers exist.

D. Normalize heartbeat observability

Every production node must emit the same minimum audit surface:

  • endpoint candidates
  • peer recovery counts
  • registry runtime state
  • update runtime state

without mixing rich and reduced heartbeat schemas across the fleet.

E. Replace the naive peer-count rule

The live fleet shows that a plain "3 links per node" rule is not a sufficient resilience model.

The current corrective design is documented in FABRIC_AREA_AND_PEER_STABILITY_MODEL.md and introduces:

  • area as a failure-domain label;
  • direct-ready vs relay-ready separation;
  • cross-area diversity requirements;
  • full-directory retention for small fleets.