301 lines
12 KiB
Markdown
301 lines
12 KiB
Markdown
# Fabric Live Audit 2026-05-18
|
|
|
|
Status: live operational audit of the current fabric. This document records the
|
|
real state observed on 2026-05-18 and explicitly calls out where runtime
|
|
behavior still differs from the target architecture.
|
|
|
|
The target layering model referenced by this audit is documented in
|
|
[FABRIC_SERVICE_OVER_TRANSPORT_MODEL.md](\\nas\\MST\\codex\\rdp-proxy\\docs\\architecture\\FABRIC_SERVICE_OVER_TRANSPORT_MODEL.md).
|
|
The current execution sequence derived from this audit is maintained in
|
|
[FABRIC_EXECUTION_PLAN_2026-05-19.md](\\nas\\MST\\codex\\rdp-proxy\\docs\\architecture\\FABRIC_EXECUTION_PLAN_2026-05-19.md).
|
|
|
|
## Current confirmed state
|
|
|
|
- Inter-node transport for the live node-agent fleet is `QUIC over UDP`.
|
|
- The `0.2.327-registrybootstraprewrite` rollout initially exposed a backend
|
|
ingestion defect: fresh `home-*` / `test-*` heartbeats were returning HTTP
|
|
`500`, not because QUIC or registry bootstrap was broken, but because
|
|
PostgreSQL rejected `\u0000` inside heartbeat JSON with
|
|
`unsupported Unicode escape sequence (SQLSTATE 22P05)`.
|
|
- Backend heartbeat ingestion now sanitizes `\u0000` before persistence.
|
|
- After that fix, `home-*` and `test-*` resumed normal heartbeat flow and
|
|
converged onto the new release line with live registry promotion.
|
|
- The active node set
|
|
- `home-1`
|
|
- `home-2`
|
|
- `home-3`
|
|
- `test-1`
|
|
- `test-2`
|
|
- `test-3`
|
|
- `usa-los-1`
|
|
- `ifcm-rufms-s-mo1cr`
|
|
currently spans:
|
|
- `home-*`, `test-*`, and `usa-los-1` on
|
|
`0.2.327-registrybootstraprewrite`;
|
|
- `ifcm-rufms-s-mo1cr` still remaining on
|
|
`0.2.322-controlendpointsrewrite`.
|
|
- `ifcm-rufms-s-mo1cr` recovered through the compatibility recovery path and is
|
|
no longer stale.
|
|
- `ifcm-rufms-s-mo1cr` has already migrated off compat overlap
|
|
`http://vpn.cin.su:19191/api/v1` and now reports
|
|
`https://vpn.cin.su/api/v1`, but it still has not advanced to the new
|
|
registry-aware release line.
|
|
- `home-*` and `test-*` now report:
|
|
- `reported_version = 0.2.327-registrybootstraprewrite`
|
|
- `peer_cache_peers = 7`
|
|
- `fabric_registry_runtime_report.status = active`
|
|
- `usa-los-1` is already on `0.2.327-registrybootstraprewrite` but still
|
|
reports `fabric_registry_runtime_report.status = missing`, which means this
|
|
node remains a runtime bootstrap/config rewrite gap rather than a version-gap.
|
|
- After repairing malformed `RAP_MESH_ADVERTISE_ENDPOINTS_JSON` on
|
|
`home-1/home-2/home-3`, the `home` area now emits enriched heartbeat metadata
|
|
again instead of falling back to the thin `c3` payload.
|
|
- Live stale-risk snapshot at `2026-05-18T19:39Z` now reports:
|
|
- `compat_control_dependency_nodes = 1` (`ifcm-rufms-s-mo1cr`)
|
|
- `registry_candidate_only_nodes = 1` (`ifcm-rufms-s-mo1cr`)
|
|
- `direct_peer_alert_nodes = 5`
|
|
- `area_diversity_alert_nodes = 6`
|
|
- Live heartbeat/update status snapshot after the `0.2.325-updatehintwake`
|
|
rollout still shows:
|
|
- `ifcm-rufms-s-mo1cr` heartbeat fresh at `2026-05-18 23:08:44 UTC`
|
|
- `fabric_control_endpoint = http://vpn.cin.su:19191/api/v1`
|
|
- `peer_cache_peers = 7`
|
|
- latest update status still stuck at `2026-05-18 20:50 UTC`
|
|
- this is now classified as `updater_wake_unsupported`, not just a generic
|
|
stale or compat-control symptom
|
|
|
|
## Why TCP traffic is still visible
|
|
|
|
Visible TCP traffic is not coming from the inter-node fabric transport. It is
|
|
coming from the temporary compatibility recovery overlap that is still active.
|
|
|
|
Observed live listeners:
|
|
|
|
- `docker-test`
|
|
- `19191/tcp` - compatibility `Control API/downloads` bridge
|
|
- `18080/tcp` - web-admin
|
|
- `18090/tcp` - release files
|
|
- `18121/tcp` - backend Control API
|
|
- `19132/udp`, `19133/udp`, `19134/udp` - QUIC fabric listeners
|
|
- `usa-los-1`
|
|
- `19131/udp` - QUIC fabric listener
|
|
- `19191/tcp` - external compatibility bridge currently held open so compat
|
|
recovery contracts can still reach `Control API/downloads`
|
|
|
|
Therefore:
|
|
|
|
- `TCP` is still present by design for recovery overlap.
|
|
- `UDP/QUIC` is the current node-to-node transport.
|
|
- The statement "the fabric is fully UDP-only" is not yet true at the full
|
|
system level while `19191/tcp` compatibility recovery remains enabled.
|
|
|
|
## Why nodes were still falling away
|
|
|
|
### 1. Nodes do not yet operate from a fully active signed registry gossip plane
|
|
|
|
Observed on the live `ifcm-rufms-s-mo1cr` heartbeat after the backend/runtime
|
|
refresh:
|
|
|
|
- `fabric_registry_runtime_report.status = candidate_only`
|
|
- `resolved_service_count = 0`
|
|
- `resolved_services.control-api = no_active_record`
|
|
- `resolved_services.update-store = no_active_record`
|
|
- `resolved_services.update-cache = no_active_record`
|
|
|
|
This means the current runtime still depends on compatibility control URLs more
|
|
than the target architecture allows. The node is alive in the fabric, but not
|
|
yet operating from a fully resolved active registry view.
|
|
|
|
### 2. Compat control/download contracts are still real dependencies
|
|
|
|
Observed on the live `ifcm-rufms-s-mo1cr` heartbeat after recovery:
|
|
|
|
- `mesh_outbound_session_report.fabric_control_endpoint = http://vpn.cin.su:19191/api/v1`
|
|
|
|
This confirms the root recovery lesson:
|
|
|
|
- a NAT node without manual host access was still anchored to the old recovery
|
|
contract;
|
|
- until that contract was temporarily restored, the node could not advance;
|
|
- the node did not disappear because QUIC failed; it disappeared because the
|
|
recovery/control overlap was removed before the node had converged.
|
|
|
|
### 3. Direct peer resilience is still below the intended threshold
|
|
|
|
Observed from the live stale-risk snapshot at `2026-05-18T19:39Z`:
|
|
|
|
- `ifcm-rufms-s-mo1cr`
|
|
- `peer_connection_ready = 2`
|
|
- `peer_connection_relay_ready = 3`
|
|
- `target_ready_peers = 3`
|
|
- `home-1`
|
|
- `peer_connection_ready = 1`
|
|
- `direct_ready_areas = [usa]`
|
|
- `external_area_ready_count = 1/2`
|
|
- `home-2`
|
|
- `peer_connection_ready = 1`
|
|
- `direct_ready_areas = [usa]`
|
|
- `external_area_ready_count = 1/2`
|
|
- `home-3`
|
|
- `peer_connection_ready = 1`
|
|
- `direct_ready_areas = [usa]`
|
|
- `external_area_ready_count = 1/2`
|
|
- `test-1/2/3`
|
|
- `peer_connection_ready = 3`
|
|
- but `direct_ready_areas = [usa]`
|
|
- therefore each still triggers `external_area_deficit:1_of_2`
|
|
- `usa-los-1`
|
|
- `peer_connection_ready = 1`
|
|
- `direct_ready_areas = [ifcm, home, test]`
|
|
- `target_ready_peers = 3`
|
|
|
|
This means the direct-path resilience target is not satisfied yet, even though
|
|
the nodes are healthy.
|
|
|
|
The practical reason is simple:
|
|
|
|
- the cluster has only a small number of externally reachable direct QUIC
|
|
endpoints;
|
|
- some nodes still advertise only private/LAN-reachable direct candidates;
|
|
- relay-ready adjacency is masking direct peer deficit, but it does not replace
|
|
the requirement for at least three direct-ready peers.
|
|
|
|
### 3.1 Public endpoint confirmation must be cross-area, not local hairpin
|
|
|
|
The live `home/test` topology also exposed a verification mistake in the
|
|
runtime model:
|
|
|
|
- `home` and `test` sit behind the same public router address
|
|
`94.141.118.222`;
|
|
- some public QUIC candidates are valid only when tested from another area such
|
|
as `usa` or `ifcm`;
|
|
- a same-area probe can fail purely because the local router does not support
|
|
hairpin NAT / NAT reflection.
|
|
|
|
Operational consequence:
|
|
|
|
- a public endpoint marked as `external-network-required` must be treated as
|
|
non-authoritative when the failure came from `self` or `same_area`;
|
|
- the public candidate should be confirmed or rejected by `cross_area`
|
|
observers instead.
|
|
|
|
### 4. Observability is still heterogeneous
|
|
|
|
Live heartbeat coverage is now richer than it was earlier in the day, but it is
|
|
still not fully converged in behavior:
|
|
|
|
- `test-*`, `ifcm`, `usa-los-1`, and now repaired `home-*` expose endpoint,
|
|
peer recovery, and registry sections again.
|
|
- `ifcm` is still the only node that currently reports `compat control` and
|
|
`registry candidate_only`, so the observability gap has narrowed into a real
|
|
single-node convergence issue instead of a fleet-wide blind spot.
|
|
|
|
## What is true right now
|
|
|
|
1. The fleet is converged on one live node-agent version.
|
|
2. QUIC/UDP is the actual node-to-node transport.
|
|
3. Compatibility `19191/tcp` is still required for recovery overlap.
|
|
4. Signed registry gossip is not yet the sole active discovery/control source.
|
|
5. `ifcm` still depends on the compat `19191` control overlap.
|
|
6. The plain `3 direct peers` target is insufficient on its own; the live fleet
|
|
now clearly shows that `cross-area direct diversity` is the next real gate.
|
|
|
|
## Control/API migration progress
|
|
|
|
The codebase now carries a more explicit migration contract for control access:
|
|
|
|
- install profiles prefer canonical `control_plane_endpoints` over a compat
|
|
singleton `backend_url`;
|
|
- host runtime env generation now exports
|
|
removed control-plane endpoint env key;
|
|
- node heartbeat/control reporting prefers that canonical endpoint set when it
|
|
is present.
|
|
- stale updater status behind a fresh heartbeat is now classified separately as
|
|
`updater_subscription_gap`;
|
|
- heartbeat update hints now have a second-stage recovery path: after writing
|
|
`update-trigger.json`, a live node can also wake its local updater
|
|
task/service.
|
|
|
|
This does not instantly rewrite older runtime wrappers on already-installed
|
|
nodes by itself. It does remove the same trap for the next install, reinstall,
|
|
or update-service rewrite cycle.
|
|
|
|
## Operational rule until the next audit
|
|
|
|
Do not remove the compatibility `19191/tcp` recovery overlap while any of the
|
|
following remain true:
|
|
|
|
- any live node still reports a `fabric_control_endpoint` on the `19191` contract;
|
|
- any live node has `fabric_registry_runtime_report.status != active`;
|
|
- any externally significant node has fewer than 3 direct-ready peers;
|
|
- any node can only recover through compat `Control API/downloads` overlap.
|
|
|
|
## Required next work
|
|
|
|
Update 2026-05-19:
|
|
|
|
- `rap-node-agent 0.2.325-updatehintwake` was released with a local updater
|
|
wake path driven by heartbeat update hints.
|
|
- The release exists because `ifcm-rufms-s-mo1cr` showed that a node can keep
|
|
sending fresh heartbeat while the updater subscription plane silently stops
|
|
progressing.
|
|
- This is now treated as a first-class recovery-plane problem, not as a vague
|
|
stale-node symptom.
|
|
- The live rollout already moved `home-*`, `test-*`, and `usa-los-1` onto
|
|
`0.2.325-updatehintwake`.
|
|
- `ifcm-rufms-s-mo1cr` is now the only remaining
|
|
`updater_wake_unsupported` blocker.
|
|
- Live `ifcm` peer telemetry also exposed a distinct transport-resilience
|
|
defect: on one stale-relay/bootstrap path the node tried a relay endpoint
|
|
with the certificate fingerprint from a different private direct candidate,
|
|
producing
|
|
`CRYPTO_ERROR ... quic peer certificate fingerprint mismatch`.
|
|
- That bug is now fixed in the runtime line tracked as
|
|
`0.2.332-relaycertintentfix`.
|
|
|
|
### A. Finish signed registry activation
|
|
|
|
Each node must be able to resolve active records for at least:
|
|
|
|
- `control-api`
|
|
- `update-store`
|
|
- `update-cache`
|
|
|
|
without falling back to the `19191` compatibility contract.
|
|
|
|
### B. Promote full direct endpoint dissemination
|
|
|
|
All nodes with public reachability must advertise every valid public direct QUIC
|
|
endpoint, and nodes must retain enough live peer memory to reconnect without
|
|
operator intervention.
|
|
|
|
### C. Enforce the direct-ready floor as a live alert
|
|
|
|
If a node has fewer than 3 direct-ready peers, this must remain a real
|
|
operational alert even when relay-ready peers exist.
|
|
|
|
### D. Normalize heartbeat observability
|
|
|
|
Every production node must emit the same minimum audit surface:
|
|
|
|
- endpoint candidates
|
|
- peer recovery counts
|
|
- registry runtime state
|
|
- update runtime state
|
|
|
|
without mixing rich and reduced heartbeat schemas across the fleet.
|
|
|
|
### E. Replace the naive peer-count rule
|
|
|
|
The live fleet shows that a plain "3 links per node" rule is not a sufficient
|
|
resilience model.
|
|
|
|
The current corrective design is documented in
|
|
[FABRIC_AREA_AND_PEER_STABILITY_MODEL.md](\\nas\\MST\\codex\\rdp-proxy\\docs\\architecture\\FABRIC_AREA_AND_PEER_STABILITY_MODEL.md)
|
|
and introduces:
|
|
|
|
- `area` as a failure-domain label;
|
|
- direct-ready vs relay-ready separation;
|
|
- cross-area diversity requirements;
|
|
- full-directory retention for small fleets.
|