3
This commit is contained in:
@@ -0,0 +1,179 @@
|
||||
# Fabric Live Audit 2026-05-18
|
||||
|
||||
Status: live operational audit of the current fabric. This document records the
|
||||
real state observed on 2026-05-18 and explicitly calls out where runtime
|
||||
behavior still differs from the target architecture.
|
||||
|
||||
## Current confirmed state
|
||||
|
||||
- Inter-node transport for the live node-agent fleet is `QUIC over UDP`.
|
||||
- The active node set
|
||||
- `home-1`
|
||||
- `home-2`
|
||||
- `home-3`
|
||||
- `test-1`
|
||||
- `test-2`
|
||||
- `test-3`
|
||||
- `usa-los-1`
|
||||
- `ifcm-rufms-s-mo1cr`
|
||||
is converged on `0.2.321-directreadytarget`.
|
||||
- `ifcm-rufms-s-mo1cr` recovered through the compatibility recovery path and is
|
||||
no longer stale.
|
||||
|
||||
## Why TCP traffic is still visible
|
||||
|
||||
Visible TCP traffic is not coming from the inter-node fabric transport. It is
|
||||
coming from the temporary compatibility recovery overlap that is still active.
|
||||
|
||||
Observed live listeners:
|
||||
|
||||
- `docker-test`
|
||||
- `19191/tcp` - compatibility `Control API/downloads` bridge
|
||||
- `18080/tcp` - web-admin
|
||||
- `18090/tcp` - release files
|
||||
- `18121/tcp` - backend Control API
|
||||
- `19132/udp`, `19133/udp`, `19134/udp` - QUIC fabric listeners
|
||||
- `usa-los-1`
|
||||
- `19131/udp` - QUIC fabric listener
|
||||
- `19191/tcp` - external compatibility bridge currently held open so legacy
|
||||
recovery contracts can still reach `Control API/downloads`
|
||||
|
||||
Therefore:
|
||||
|
||||
- `TCP` is still present by design for recovery overlap.
|
||||
- `UDP/QUIC` is the current node-to-node transport.
|
||||
- The statement "the fabric is fully UDP-only" is not yet true at the full
|
||||
system level while `19191/tcp` compatibility recovery remains enabled.
|
||||
|
||||
## Why nodes were still falling away
|
||||
|
||||
### 1. Nodes do not yet operate from a fully active signed registry gossip plane
|
||||
|
||||
Observed on the live `ifcm-rufms-s-mo1cr` heartbeat:
|
||||
|
||||
- `fabric_registry_runtime_report.status = candidate_only`
|
||||
- `resolved_service_count = 0`
|
||||
- `resolved_services.control-api = no_active_record`
|
||||
- `resolved_services.update-store = no_active_record`
|
||||
- `resolved_services.update-cache = no_active_record`
|
||||
|
||||
This means the current runtime still depends on compatibility control URLs more
|
||||
than the target architecture allows. The node is alive in the fabric, but not
|
||||
yet operating from a fully resolved active registry view.
|
||||
|
||||
### 2. Legacy control/download contracts are still real dependencies
|
||||
|
||||
Observed on the live `ifcm-rufms-s-mo1cr` heartbeat after recovery:
|
||||
|
||||
- `mesh_outbound_session_report.control_plane_url = http://vpn.cin.su:19191/api/v1`
|
||||
|
||||
This confirms the root recovery lesson:
|
||||
|
||||
- a NAT node without manual host access was still anchored to the old recovery
|
||||
contract;
|
||||
- until that contract was temporarily restored, the node could not advance;
|
||||
- the node did not disappear because QUIC failed; it disappeared because the
|
||||
recovery/control overlap was removed before the node had converged.
|
||||
|
||||
### 3. Direct peer resilience is still below the intended threshold
|
||||
|
||||
Observed from live heartbeat metadata:
|
||||
|
||||
- `ifcm-rufms-s-mo1cr`
|
||||
- `peer_connection_ready = 2`
|
||||
- `peer_connection_relay_ready = 3`
|
||||
- `target_ready_peers = 3`
|
||||
- `usa-los-1`
|
||||
- `peer_connection_ready = 1`
|
||||
- `peer_connection_relay_ready = 5`
|
||||
- `target_ready_peers = 3`
|
||||
|
||||
This means the direct-path resilience target is not satisfied yet, even though
|
||||
the nodes are healthy.
|
||||
|
||||
The practical reason is simple:
|
||||
|
||||
- the cluster has only a small number of externally reachable direct QUIC
|
||||
endpoints;
|
||||
- some nodes still advertise only private/LAN-reachable direct candidates;
|
||||
- relay-ready adjacency is masking direct peer deficit, but it does not replace
|
||||
the requirement for at least three direct-ready peers.
|
||||
|
||||
### 4. Observability is still heterogeneous
|
||||
|
||||
Live heartbeat coverage is inconsistent:
|
||||
|
||||
- `test-*`, `ifcm`, `usa-los-1` emit rich `c17z20` heartbeat metadata with
|
||||
endpoint, peer recovery, and registry sections.
|
||||
- `home-*` currently do not expose the same full sections in their latest
|
||||
heartbeat rows.
|
||||
|
||||
This means operator visibility is uneven and the documentation must not imply
|
||||
uniform live introspection across every node today.
|
||||
|
||||
## What is true right now
|
||||
|
||||
1. The fleet is converged on one live node-agent version.
|
||||
2. QUIC/UDP is the actual node-to-node transport.
|
||||
3. Compatibility `19191/tcp` is still required for recovery overlap.
|
||||
4. Signed registry gossip is not yet the sole active discovery/control source.
|
||||
5. The "at least 3 direct-ready peers per node" resilience target is not yet
|
||||
met for all externally significant nodes.
|
||||
|
||||
## Operational rule until the next audit
|
||||
|
||||
Do not remove the compatibility `19191/tcp` recovery overlap while any of the
|
||||
following remain true:
|
||||
|
||||
- any live node still reports a `control_plane_url` on the `19191` contract;
|
||||
- any live node has `fabric_registry_runtime_report.status != active`;
|
||||
- any externally significant node has fewer than 3 direct-ready peers;
|
||||
- any node can only recover through legacy `Control API/downloads` overlap.
|
||||
|
||||
## Required next work
|
||||
|
||||
### A. Finish signed registry activation
|
||||
|
||||
Each node must be able to resolve active records for at least:
|
||||
|
||||
- `control-api`
|
||||
- `update-store`
|
||||
- `update-cache`
|
||||
|
||||
without falling back to the `19191` compatibility contract.
|
||||
|
||||
### B. Promote full direct endpoint dissemination
|
||||
|
||||
All nodes with public reachability must advertise every valid public direct QUIC
|
||||
endpoint, and nodes must retain enough live peer memory to reconnect without
|
||||
operator intervention.
|
||||
|
||||
### C. Enforce the direct-ready floor as a live alert
|
||||
|
||||
If a node has fewer than 3 direct-ready peers, this must remain a real
|
||||
operational alert even when relay-ready peers exist.
|
||||
|
||||
### D. Normalize heartbeat observability
|
||||
|
||||
Every production node must emit the same minimum audit surface:
|
||||
|
||||
- endpoint candidates
|
||||
- peer recovery counts
|
||||
- registry runtime state
|
||||
- update runtime state
|
||||
|
||||
without mixing rich and reduced heartbeat schemas across the fleet.
|
||||
|
||||
### E. Replace the naive peer-count rule
|
||||
|
||||
The live fleet shows that a plain "3 links per node" rule is not a sufficient
|
||||
resilience model.
|
||||
|
||||
The current corrective design is documented in
|
||||
[FABRIC_AREA_AND_PEER_STABILITY_MODEL.md](\\nas\\MST\\codex\\rdp-proxy\\docs\\architecture\\FABRIC_AREA_AND_PEER_STABILITY_MODEL.md)
|
||||
and introduces:
|
||||
|
||||
- `area` as a failure-domain label;
|
||||
- direct-ready vs relay-ready separation;
|
||||
- cross-area diversity requirements;
|
||||
- full-directory retention for small fleets.
|
||||
Reference in New Issue
Block a user