180 lines
6.1 KiB
Markdown
180 lines
6.1 KiB
Markdown
# Fabric Live Audit 2026-05-18
|
|
|
|
Status: live operational audit of the current fabric. This document records the
|
|
real state observed on 2026-05-18 and explicitly calls out where runtime
|
|
behavior still differs from the target architecture.
|
|
|
|
## Current confirmed state
|
|
|
|
- Inter-node transport for the live node-agent fleet is `QUIC over UDP`.
|
|
- The active node set
|
|
- `home-1`
|
|
- `home-2`
|
|
- `home-3`
|
|
- `test-1`
|
|
- `test-2`
|
|
- `test-3`
|
|
- `usa-los-1`
|
|
- `ifcm-rufms-s-mo1cr`
|
|
is converged on `0.2.321-directreadytarget`.
|
|
- `ifcm-rufms-s-mo1cr` recovered through the compatibility recovery path and is
|
|
no longer stale.
|
|
|
|
## Why TCP traffic is still visible
|
|
|
|
Visible TCP traffic is not coming from the inter-node fabric transport. It is
|
|
coming from the temporary compatibility recovery overlap that is still active.
|
|
|
|
Observed live listeners:
|
|
|
|
- `docker-test`
|
|
- `19191/tcp` - compatibility `Control API/downloads` bridge
|
|
- `18080/tcp` - web-admin
|
|
- `18090/tcp` - release files
|
|
- `18121/tcp` - backend Control API
|
|
- `19132/udp`, `19133/udp`, `19134/udp` - QUIC fabric listeners
|
|
- `usa-los-1`
|
|
- `19131/udp` - QUIC fabric listener
|
|
- `19191/tcp` - external compatibility bridge currently held open so legacy
|
|
recovery contracts can still reach `Control API/downloads`
|
|
|
|
Therefore:
|
|
|
|
- `TCP` is still present by design for recovery overlap.
|
|
- `UDP/QUIC` is the current node-to-node transport.
|
|
- The statement "the fabric is fully UDP-only" is not yet true at the full
|
|
system level while `19191/tcp` compatibility recovery remains enabled.
|
|
|
|
## Why nodes were still falling away
|
|
|
|
### 1. Nodes do not yet operate from a fully active signed registry gossip plane
|
|
|
|
Observed on the live `ifcm-rufms-s-mo1cr` heartbeat:
|
|
|
|
- `fabric_registry_runtime_report.status = candidate_only`
|
|
- `resolved_service_count = 0`
|
|
- `resolved_services.control-api = no_active_record`
|
|
- `resolved_services.update-store = no_active_record`
|
|
- `resolved_services.update-cache = no_active_record`
|
|
|
|
This means the current runtime still depends on compatibility control URLs more
|
|
than the target architecture allows. The node is alive in the fabric, but not
|
|
yet operating from a fully resolved active registry view.
|
|
|
|
### 2. Legacy control/download contracts are still real dependencies
|
|
|
|
Observed on the live `ifcm-rufms-s-mo1cr` heartbeat after recovery:
|
|
|
|
- `mesh_outbound_session_report.control_plane_url = http://vpn.cin.su:19191/api/v1`
|
|
|
|
This confirms the root recovery lesson:
|
|
|
|
- a NAT node without manual host access was still anchored to the old recovery
|
|
contract;
|
|
- until that contract was temporarily restored, the node could not advance;
|
|
- the node did not disappear because QUIC failed; it disappeared because the
|
|
recovery/control overlap was removed before the node had converged.
|
|
|
|
### 3. Direct peer resilience is still below the intended threshold
|
|
|
|
Observed from live heartbeat metadata:
|
|
|
|
- `ifcm-rufms-s-mo1cr`
|
|
- `peer_connection_ready = 2`
|
|
- `peer_connection_relay_ready = 3`
|
|
- `target_ready_peers = 3`
|
|
- `usa-los-1`
|
|
- `peer_connection_ready = 1`
|
|
- `peer_connection_relay_ready = 5`
|
|
- `target_ready_peers = 3`
|
|
|
|
This means the direct-path resilience target is not satisfied yet, even though
|
|
the nodes are healthy.
|
|
|
|
The practical reason is simple:
|
|
|
|
- the cluster has only a small number of externally reachable direct QUIC
|
|
endpoints;
|
|
- some nodes still advertise only private/LAN-reachable direct candidates;
|
|
- relay-ready adjacency is masking direct peer deficit, but it does not replace
|
|
the requirement for at least three direct-ready peers.
|
|
|
|
### 4. Observability is still heterogeneous
|
|
|
|
Live heartbeat coverage is inconsistent:
|
|
|
|
- `test-*`, `ifcm`, `usa-los-1` emit rich `c17z20` heartbeat metadata with
|
|
endpoint, peer recovery, and registry sections.
|
|
- `home-*` currently do not expose the same full sections in their latest
|
|
heartbeat rows.
|
|
|
|
This means operator visibility is uneven and the documentation must not imply
|
|
uniform live introspection across every node today.
|
|
|
|
## What is true right now
|
|
|
|
1. The fleet is converged on one live node-agent version.
|
|
2. QUIC/UDP is the actual node-to-node transport.
|
|
3. Compatibility `19191/tcp` is still required for recovery overlap.
|
|
4. Signed registry gossip is not yet the sole active discovery/control source.
|
|
5. The "at least 3 direct-ready peers per node" resilience target is not yet
|
|
met for all externally significant nodes.
|
|
|
|
## Operational rule until the next audit
|
|
|
|
Do not remove the compatibility `19191/tcp` recovery overlap while any of the
|
|
following remain true:
|
|
|
|
- any live node still reports a `control_plane_url` on the `19191` contract;
|
|
- any live node has `fabric_registry_runtime_report.status != active`;
|
|
- any externally significant node has fewer than 3 direct-ready peers;
|
|
- any node can only recover through legacy `Control API/downloads` overlap.
|
|
|
|
## Required next work
|
|
|
|
### A. Finish signed registry activation
|
|
|
|
Each node must be able to resolve active records for at least:
|
|
|
|
- `control-api`
|
|
- `update-store`
|
|
- `update-cache`
|
|
|
|
without falling back to the `19191` compatibility contract.
|
|
|
|
### B. Promote full direct endpoint dissemination
|
|
|
|
All nodes with public reachability must advertise every valid public direct QUIC
|
|
endpoint, and nodes must retain enough live peer memory to reconnect without
|
|
operator intervention.
|
|
|
|
### C. Enforce the direct-ready floor as a live alert
|
|
|
|
If a node has fewer than 3 direct-ready peers, this must remain a real
|
|
operational alert even when relay-ready peers exist.
|
|
|
|
### D. Normalize heartbeat observability
|
|
|
|
Every production node must emit the same minimum audit surface:
|
|
|
|
- endpoint candidates
|
|
- peer recovery counts
|
|
- registry runtime state
|
|
- update runtime state
|
|
|
|
without mixing rich and reduced heartbeat schemas across the fleet.
|
|
|
|
### E. Replace the naive peer-count rule
|
|
|
|
The live fleet shows that a plain "3 links per node" rule is not a sufficient
|
|
resilience model.
|
|
|
|
The current corrective design is documented in
|
|
[FABRIC_AREA_AND_PEER_STABILITY_MODEL.md](\\nas\\MST\\codex\\rdp-proxy\\docs\\architecture\\FABRIC_AREA_AND_PEER_STABILITY_MODEL.md)
|
|
and introduces:
|
|
|
|
- `area` as a failure-domain label;
|
|
- direct-ready vs relay-ready separation;
|
|
- cross-area diversity requirements;
|
|
- full-directory retention for small fleets.
|