Files
rdp-proxy/docs/architecture/FABRIC_LIVE_AUDIT_2026-05-18.md
m 20d361a886
build / backend (push) Has been cancelled
build / node-agent (push) Has been cancelled
build / worker (push) Has been cancelled
рабочий вариант, но скороть 10 МБит
2026-05-22 21:46:49 +03:00

12 KiB

Fabric Live Audit 2026-05-18

Status: live operational audit of the current fabric. This document records the real state observed on 2026-05-18 and explicitly calls out where runtime behavior still differs from the target architecture.

The target layering model referenced by this audit is documented in FABRIC_SERVICE_OVER_TRANSPORT_MODEL.md. The current execution sequence derived from this audit is maintained in FABRIC_EXECUTION_PLAN_2026-05-19.md.

Current confirmed state

  • Inter-node transport for the live node-agent fleet is QUIC over UDP.
  • The 0.2.327-registrybootstraprewrite rollout initially exposed a backend ingestion defect: fresh home-* / test-* heartbeats were returning HTTP 500, not because QUIC or registry bootstrap was broken, but because PostgreSQL rejected \u0000 inside heartbeat JSON with unsupported Unicode escape sequence (SQLSTATE 22P05).
  • Backend heartbeat ingestion now sanitizes \u0000 before persistence.
  • After that fix, home-* and test-* resumed normal heartbeat flow and converged onto the new release line with live registry promotion.
  • The active node set
    • home-1
    • home-2
    • home-3
    • test-1
    • test-2
    • test-3
    • usa-los-1
    • ifcm-rufms-s-mo1cr currently spans:
    • home-*, test-*, and usa-los-1 on 0.2.327-registrybootstraprewrite;
    • ifcm-rufms-s-mo1cr still remaining on 0.2.322-controlendpointsrewrite.
  • ifcm-rufms-s-mo1cr recovered through the compatibility recovery path and is no longer stale.
  • ifcm-rufms-s-mo1cr has already migrated off compat overlap http://vpn.cin.su:19191/api/v1 and now reports https://vpn.cin.su/api/v1, but it still has not advanced to the new registry-aware release line.
  • home-* and test-* now report:
    • reported_version = 0.2.327-registrybootstraprewrite
    • peer_cache_peers = 7
    • fabric_registry_runtime_report.status = active
  • usa-los-1 is already on 0.2.327-registrybootstraprewrite but still reports fabric_registry_runtime_report.status = missing, which means this node remains a runtime bootstrap/config rewrite gap rather than a version-gap.
  • After repairing malformed RAP_MESH_ADVERTISE_ENDPOINTS_JSON on home-1/home-2/home-3, the home area now emits enriched heartbeat metadata again instead of falling back to the thin c3 payload.
  • Live stale-risk snapshot at 2026-05-18T19:39Z now reports:
    • compat_control_dependency_nodes = 1 (ifcm-rufms-s-mo1cr)
    • registry_candidate_only_nodes = 1 (ifcm-rufms-s-mo1cr)
    • direct_peer_alert_nodes = 5
    • area_diversity_alert_nodes = 6
  • Live heartbeat/update status snapshot after the 0.2.325-updatehintwake rollout still shows:
    • ifcm-rufms-s-mo1cr heartbeat fresh at 2026-05-18 23:08:44 UTC
    • fabric_control_endpoint = http://vpn.cin.su:19191/api/v1
    • peer_cache_peers = 7
    • latest update status still stuck at 2026-05-18 20:50 UTC
    • this is now classified as updater_wake_unsupported, not just a generic stale or compat-control symptom

Why TCP traffic is still visible

Visible TCP traffic is not coming from the inter-node fabric transport. It is coming from the temporary compatibility recovery overlap that is still active.

Observed live listeners:

  • docker-test
    • 19191/tcp - compatibility Control API/downloads bridge
    • 18080/tcp - web-admin
    • 18090/tcp - release files
    • 18121/tcp - backend Control API
    • 19132/udp, 19133/udp, 19134/udp - QUIC fabric listeners
  • usa-los-1
    • 19131/udp - QUIC fabric listener
    • 19191/tcp - external compatibility bridge currently held open so compat recovery contracts can still reach Control API/downloads

Therefore:

  • TCP is still present by design for recovery overlap.
  • UDP/QUIC is the current node-to-node transport.
  • The statement "the fabric is fully UDP-only" is not yet true at the full system level while 19191/tcp compatibility recovery remains enabled.

Why nodes were still falling away

1. Nodes do not yet operate from a fully active signed registry gossip plane

Observed on the live ifcm-rufms-s-mo1cr heartbeat after the backend/runtime refresh:

  • fabric_registry_runtime_report.status = candidate_only
  • resolved_service_count = 0
  • resolved_services.control-api = no_active_record
  • resolved_services.update-store = no_active_record
  • resolved_services.update-cache = no_active_record

This means the current runtime still depends on compatibility control URLs more than the target architecture allows. The node is alive in the fabric, but not yet operating from a fully resolved active registry view.

2. Compat control/download contracts are still real dependencies

Observed on the live ifcm-rufms-s-mo1cr heartbeat after recovery:

  • mesh_outbound_session_report.fabric_control_endpoint = http://vpn.cin.su:19191/api/v1

This confirms the root recovery lesson:

  • a NAT node without manual host access was still anchored to the old recovery contract;
  • until that contract was temporarily restored, the node could not advance;
  • the node did not disappear because QUIC failed; it disappeared because the recovery/control overlap was removed before the node had converged.

3. Direct peer resilience is still below the intended threshold

Observed from the live stale-risk snapshot at 2026-05-18T19:39Z:

  • ifcm-rufms-s-mo1cr
    • peer_connection_ready = 2
    • peer_connection_relay_ready = 3
    • target_ready_peers = 3
  • home-1
    • peer_connection_ready = 1
    • direct_ready_areas = [usa]
    • external_area_ready_count = 1/2
  • home-2
    • peer_connection_ready = 1
    • direct_ready_areas = [usa]
    • external_area_ready_count = 1/2
  • home-3
    • peer_connection_ready = 1
    • direct_ready_areas = [usa]
    • external_area_ready_count = 1/2
  • test-1/2/3
    • peer_connection_ready = 3
    • but direct_ready_areas = [usa]
    • therefore each still triggers external_area_deficit:1_of_2
  • usa-los-1
    • peer_connection_ready = 1
    • direct_ready_areas = [ifcm, home, test]
    • target_ready_peers = 3

This means the direct-path resilience target is not satisfied yet, even though the nodes are healthy.

The practical reason is simple:

  • the cluster has only a small number of externally reachable direct QUIC endpoints;
  • some nodes still advertise only private/LAN-reachable direct candidates;
  • relay-ready adjacency is masking direct peer deficit, but it does not replace the requirement for at least three direct-ready peers.

3.1 Public endpoint confirmation must be cross-area, not local hairpin

The live home/test topology also exposed a verification mistake in the runtime model:

  • home and test sit behind the same public router address 94.141.118.222;
  • some public QUIC candidates are valid only when tested from another area such as usa or ifcm;
  • a same-area probe can fail purely because the local router does not support hairpin NAT / NAT reflection.

Operational consequence:

  • a public endpoint marked as external-network-required must be treated as non-authoritative when the failure came from self or same_area;
  • the public candidate should be confirmed or rejected by cross_area observers instead.

4. Observability is still heterogeneous

Live heartbeat coverage is now richer than it was earlier in the day, but it is still not fully converged in behavior:

  • test-*, ifcm, usa-los-1, and now repaired home-* expose endpoint, peer recovery, and registry sections again.
  • ifcm is still the only node that currently reports compat control and registry candidate_only, so the observability gap has narrowed into a real single-node convergence issue instead of a fleet-wide blind spot.

What is true right now

  1. The fleet is converged on one live node-agent version.
  2. QUIC/UDP is the actual node-to-node transport.
  3. Compatibility 19191/tcp is still required for recovery overlap.
  4. Signed registry gossip is not yet the sole active discovery/control source.
  5. ifcm still depends on the compat 19191 control overlap.
  6. The plain 3 direct peers target is insufficient on its own; the live fleet now clearly shows that cross-area direct diversity is the next real gate.

Control/API migration progress

The codebase now carries a more explicit migration contract for control access:

  • install profiles prefer canonical control_plane_endpoints over a compat singleton backend_url;
  • host runtime env generation now exports removed control-plane endpoint env key;
  • node heartbeat/control reporting prefers that canonical endpoint set when it is present.
  • stale updater status behind a fresh heartbeat is now classified separately as updater_subscription_gap;
  • heartbeat update hints now have a second-stage recovery path: after writing update-trigger.json, a live node can also wake its local updater task/service.

This does not instantly rewrite older runtime wrappers on already-installed nodes by itself. It does remove the same trap for the next install, reinstall, or update-service rewrite cycle.

Operational rule until the next audit

Do not remove the compatibility 19191/tcp recovery overlap while any of the following remain true:

  • any live node still reports a fabric_control_endpoint on the 19191 contract;
  • any live node has fabric_registry_runtime_report.status != active;
  • any externally significant node has fewer than 3 direct-ready peers;
  • any node can only recover through compat Control API/downloads overlap.

Required next work

Update 2026-05-19:

  • rap-node-agent 0.2.325-updatehintwake was released with a local updater wake path driven by heartbeat update hints.
  • The release exists because ifcm-rufms-s-mo1cr showed that a node can keep sending fresh heartbeat while the updater subscription plane silently stops progressing.
  • This is now treated as a first-class recovery-plane problem, not as a vague stale-node symptom.
  • The live rollout already moved home-*, test-*, and usa-los-1 onto 0.2.325-updatehintwake.
  • ifcm-rufms-s-mo1cr is now the only remaining updater_wake_unsupported blocker.
  • Live ifcm peer telemetry also exposed a distinct transport-resilience defect: on one stale-relay/bootstrap path the node tried a relay endpoint with the certificate fingerprint from a different private direct candidate, producing CRYPTO_ERROR ... quic peer certificate fingerprint mismatch.
  • That bug is now fixed in the runtime line tracked as 0.2.332-relaycertintentfix.

A. Finish signed registry activation

Each node must be able to resolve active records for at least:

  • control-api
  • update-store
  • update-cache

without falling back to the 19191 compatibility contract.

B. Promote full direct endpoint dissemination

All nodes with public reachability must advertise every valid public direct QUIC endpoint, and nodes must retain enough live peer memory to reconnect without operator intervention.

C. Enforce the direct-ready floor as a live alert

If a node has fewer than 3 direct-ready peers, this must remain a real operational alert even when relay-ready peers exist.

D. Normalize heartbeat observability

Every production node must emit the same minimum audit surface:

  • endpoint candidates
  • peer recovery counts
  • registry runtime state
  • update runtime state

without mixing rich and reduced heartbeat schemas across the fleet.

E. Replace the naive peer-count rule

The live fleet shows that a plain "3 links per node" rule is not a sufficient resilience model.

The current corrective design is documented in FABRIC_AREA_AND_PEER_STABILITY_MODEL.md and introduces:

  • area as a failure-domain label;
  • direct-ready vs relay-ready separation;
  • cross-area diversity requirements;
  • full-directory retention for small fleets.