рабочий вариант, но скороть 10 МБит
This commit is contained in:
@@ -4,9 +4,22 @@ Status: live operational audit of the current fabric. This document records the
|
||||
real state observed on 2026-05-18 and explicitly calls out where runtime
|
||||
behavior still differs from the target architecture.
|
||||
|
||||
The target layering model referenced by this audit is documented in
|
||||
[FABRIC_SERVICE_OVER_TRANSPORT_MODEL.md](\\nas\\MST\\codex\\rdp-proxy\\docs\\architecture\\FABRIC_SERVICE_OVER_TRANSPORT_MODEL.md).
|
||||
The current execution sequence derived from this audit is maintained in
|
||||
[FABRIC_EXECUTION_PLAN_2026-05-19.md](\\nas\\MST\\codex\\rdp-proxy\\docs\\architecture\\FABRIC_EXECUTION_PLAN_2026-05-19.md).
|
||||
|
||||
## Current confirmed state
|
||||
|
||||
- Inter-node transport for the live node-agent fleet is `QUIC over UDP`.
|
||||
- The `0.2.327-registrybootstraprewrite` rollout initially exposed a backend
|
||||
ingestion defect: fresh `home-*` / `test-*` heartbeats were returning HTTP
|
||||
`500`, not because QUIC or registry bootstrap was broken, but because
|
||||
PostgreSQL rejected `\u0000` inside heartbeat JSON with
|
||||
`unsupported Unicode escape sequence (SQLSTATE 22P05)`.
|
||||
- Backend heartbeat ingestion now sanitizes `\u0000` before persistence.
|
||||
- After that fix, `home-*` and `test-*` resumed normal heartbeat flow and
|
||||
converged onto the new release line with live registry promotion.
|
||||
- The active node set
|
||||
- `home-1`
|
||||
- `home-2`
|
||||
@@ -16,9 +29,40 @@ behavior still differs from the target architecture.
|
||||
- `test-3`
|
||||
- `usa-los-1`
|
||||
- `ifcm-rufms-s-mo1cr`
|
||||
is converged on `0.2.321-directreadytarget`.
|
||||
currently spans:
|
||||
- `home-*`, `test-*`, and `usa-los-1` on
|
||||
`0.2.327-registrybootstraprewrite`;
|
||||
- `ifcm-rufms-s-mo1cr` still remaining on
|
||||
`0.2.322-controlendpointsrewrite`.
|
||||
- `ifcm-rufms-s-mo1cr` recovered through the compatibility recovery path and is
|
||||
no longer stale.
|
||||
- `ifcm-rufms-s-mo1cr` has already migrated off compat overlap
|
||||
`http://vpn.cin.su:19191/api/v1` and now reports
|
||||
`https://vpn.cin.su/api/v1`, but it still has not advanced to the new
|
||||
registry-aware release line.
|
||||
- `home-*` and `test-*` now report:
|
||||
- `reported_version = 0.2.327-registrybootstraprewrite`
|
||||
- `peer_cache_peers = 7`
|
||||
- `fabric_registry_runtime_report.status = active`
|
||||
- `usa-los-1` is already on `0.2.327-registrybootstraprewrite` but still
|
||||
reports `fabric_registry_runtime_report.status = missing`, which means this
|
||||
node remains a runtime bootstrap/config rewrite gap rather than a version-gap.
|
||||
- After repairing malformed `RAP_MESH_ADVERTISE_ENDPOINTS_JSON` on
|
||||
`home-1/home-2/home-3`, the `home` area now emits enriched heartbeat metadata
|
||||
again instead of falling back to the thin `c3` payload.
|
||||
- Live stale-risk snapshot at `2026-05-18T19:39Z` now reports:
|
||||
- `compat_control_dependency_nodes = 1` (`ifcm-rufms-s-mo1cr`)
|
||||
- `registry_candidate_only_nodes = 1` (`ifcm-rufms-s-mo1cr`)
|
||||
- `direct_peer_alert_nodes = 5`
|
||||
- `area_diversity_alert_nodes = 6`
|
||||
- Live heartbeat/update status snapshot after the `0.2.325-updatehintwake`
|
||||
rollout still shows:
|
||||
- `ifcm-rufms-s-mo1cr` heartbeat fresh at `2026-05-18 23:08:44 UTC`
|
||||
- `fabric_control_endpoint = http://vpn.cin.su:19191/api/v1`
|
||||
- `peer_cache_peers = 7`
|
||||
- latest update status still stuck at `2026-05-18 20:50 UTC`
|
||||
- this is now classified as `updater_wake_unsupported`, not just a generic
|
||||
stale or compat-control symptom
|
||||
|
||||
## Why TCP traffic is still visible
|
||||
|
||||
@@ -35,7 +79,7 @@ Observed live listeners:
|
||||
- `19132/udp`, `19133/udp`, `19134/udp` - QUIC fabric listeners
|
||||
- `usa-los-1`
|
||||
- `19131/udp` - QUIC fabric listener
|
||||
- `19191/tcp` - external compatibility bridge currently held open so legacy
|
||||
- `19191/tcp` - external compatibility bridge currently held open so compat
|
||||
recovery contracts can still reach `Control API/downloads`
|
||||
|
||||
Therefore:
|
||||
@@ -49,7 +93,8 @@ Therefore:
|
||||
|
||||
### 1. Nodes do not yet operate from a fully active signed registry gossip plane
|
||||
|
||||
Observed on the live `ifcm-rufms-s-mo1cr` heartbeat:
|
||||
Observed on the live `ifcm-rufms-s-mo1cr` heartbeat after the backend/runtime
|
||||
refresh:
|
||||
|
||||
- `fabric_registry_runtime_report.status = candidate_only`
|
||||
- `resolved_service_count = 0`
|
||||
@@ -61,11 +106,11 @@ This means the current runtime still depends on compatibility control URLs more
|
||||
than the target architecture allows. The node is alive in the fabric, but not
|
||||
yet operating from a fully resolved active registry view.
|
||||
|
||||
### 2. Legacy control/download contracts are still real dependencies
|
||||
### 2. Compat control/download contracts are still real dependencies
|
||||
|
||||
Observed on the live `ifcm-rufms-s-mo1cr` heartbeat after recovery:
|
||||
|
||||
- `mesh_outbound_session_report.control_plane_url = http://vpn.cin.su:19191/api/v1`
|
||||
- `mesh_outbound_session_report.fabric_control_endpoint = http://vpn.cin.su:19191/api/v1`
|
||||
|
||||
This confirms the root recovery lesson:
|
||||
|
||||
@@ -77,15 +122,31 @@ This confirms the root recovery lesson:
|
||||
|
||||
### 3. Direct peer resilience is still below the intended threshold
|
||||
|
||||
Observed from live heartbeat metadata:
|
||||
Observed from the live stale-risk snapshot at `2026-05-18T19:39Z`:
|
||||
|
||||
- `ifcm-rufms-s-mo1cr`
|
||||
- `peer_connection_ready = 2`
|
||||
- `peer_connection_relay_ready = 3`
|
||||
- `target_ready_peers = 3`
|
||||
- `home-1`
|
||||
- `peer_connection_ready = 1`
|
||||
- `direct_ready_areas = [usa]`
|
||||
- `external_area_ready_count = 1/2`
|
||||
- `home-2`
|
||||
- `peer_connection_ready = 1`
|
||||
- `direct_ready_areas = [usa]`
|
||||
- `external_area_ready_count = 1/2`
|
||||
- `home-3`
|
||||
- `peer_connection_ready = 1`
|
||||
- `direct_ready_areas = [usa]`
|
||||
- `external_area_ready_count = 1/2`
|
||||
- `test-1/2/3`
|
||||
- `peer_connection_ready = 3`
|
||||
- but `direct_ready_areas = [usa]`
|
||||
- therefore each still triggers `external_area_deficit:1_of_2`
|
||||
- `usa-los-1`
|
||||
- `peer_connection_ready = 1`
|
||||
- `peer_connection_relay_ready = 5`
|
||||
- `direct_ready_areas = [ifcm, home, test]`
|
||||
- `target_ready_peers = 3`
|
||||
|
||||
This means the direct-path resilience target is not satisfied yet, even though
|
||||
@@ -99,17 +160,35 @@ The practical reason is simple:
|
||||
- relay-ready adjacency is masking direct peer deficit, but it does not replace
|
||||
the requirement for at least three direct-ready peers.
|
||||
|
||||
### 3.1 Public endpoint confirmation must be cross-area, not local hairpin
|
||||
|
||||
The live `home/test` topology also exposed a verification mistake in the
|
||||
runtime model:
|
||||
|
||||
- `home` and `test` sit behind the same public router address
|
||||
`94.141.118.222`;
|
||||
- some public QUIC candidates are valid only when tested from another area such
|
||||
as `usa` or `ifcm`;
|
||||
- a same-area probe can fail purely because the local router does not support
|
||||
hairpin NAT / NAT reflection.
|
||||
|
||||
Operational consequence:
|
||||
|
||||
- a public endpoint marked as `external-network-required` must be treated as
|
||||
non-authoritative when the failure came from `self` or `same_area`;
|
||||
- the public candidate should be confirmed or rejected by `cross_area`
|
||||
observers instead.
|
||||
|
||||
### 4. Observability is still heterogeneous
|
||||
|
||||
Live heartbeat coverage is inconsistent:
|
||||
Live heartbeat coverage is now richer than it was earlier in the day, but it is
|
||||
still not fully converged in behavior:
|
||||
|
||||
- `test-*`, `ifcm`, `usa-los-1` emit rich `c17z20` heartbeat metadata with
|
||||
endpoint, peer recovery, and registry sections.
|
||||
- `home-*` currently do not expose the same full sections in their latest
|
||||
heartbeat rows.
|
||||
|
||||
This means operator visibility is uneven and the documentation must not imply
|
||||
uniform live introspection across every node today.
|
||||
- `test-*`, `ifcm`, `usa-los-1`, and now repaired `home-*` expose endpoint,
|
||||
peer recovery, and registry sections again.
|
||||
- `ifcm` is still the only node that currently reports `compat control` and
|
||||
`registry candidate_only`, so the observability gap has narrowed into a real
|
||||
single-node convergence issue instead of a fleet-wide blind spot.
|
||||
|
||||
## What is true right now
|
||||
|
||||
@@ -117,21 +196,63 @@ uniform live introspection across every node today.
|
||||
2. QUIC/UDP is the actual node-to-node transport.
|
||||
3. Compatibility `19191/tcp` is still required for recovery overlap.
|
||||
4. Signed registry gossip is not yet the sole active discovery/control source.
|
||||
5. The "at least 3 direct-ready peers per node" resilience target is not yet
|
||||
met for all externally significant nodes.
|
||||
5. `ifcm` still depends on the compat `19191` control overlap.
|
||||
6. The plain `3 direct peers` target is insufficient on its own; the live fleet
|
||||
now clearly shows that `cross-area direct diversity` is the next real gate.
|
||||
|
||||
## Control/API migration progress
|
||||
|
||||
The codebase now carries a more explicit migration contract for control access:
|
||||
|
||||
- install profiles prefer canonical `control_plane_endpoints` over a compat
|
||||
singleton `backend_url`;
|
||||
- host runtime env generation now exports
|
||||
removed control-plane endpoint env key;
|
||||
- node heartbeat/control reporting prefers that canonical endpoint set when it
|
||||
is present.
|
||||
- stale updater status behind a fresh heartbeat is now classified separately as
|
||||
`updater_subscription_gap`;
|
||||
- heartbeat update hints now have a second-stage recovery path: after writing
|
||||
`update-trigger.json`, a live node can also wake its local updater
|
||||
task/service.
|
||||
|
||||
This does not instantly rewrite older runtime wrappers on already-installed
|
||||
nodes by itself. It does remove the same trap for the next install, reinstall,
|
||||
or update-service rewrite cycle.
|
||||
|
||||
## Operational rule until the next audit
|
||||
|
||||
Do not remove the compatibility `19191/tcp` recovery overlap while any of the
|
||||
following remain true:
|
||||
|
||||
- any live node still reports a `control_plane_url` on the `19191` contract;
|
||||
- any live node still reports a `fabric_control_endpoint` on the `19191` contract;
|
||||
- any live node has `fabric_registry_runtime_report.status != active`;
|
||||
- any externally significant node has fewer than 3 direct-ready peers;
|
||||
- any node can only recover through legacy `Control API/downloads` overlap.
|
||||
- any node can only recover through compat `Control API/downloads` overlap.
|
||||
|
||||
## Required next work
|
||||
|
||||
Update 2026-05-19:
|
||||
|
||||
- `rap-node-agent 0.2.325-updatehintwake` was released with a local updater
|
||||
wake path driven by heartbeat update hints.
|
||||
- The release exists because `ifcm-rufms-s-mo1cr` showed that a node can keep
|
||||
sending fresh heartbeat while the updater subscription plane silently stops
|
||||
progressing.
|
||||
- This is now treated as a first-class recovery-plane problem, not as a vague
|
||||
stale-node symptom.
|
||||
- The live rollout already moved `home-*`, `test-*`, and `usa-los-1` onto
|
||||
`0.2.325-updatehintwake`.
|
||||
- `ifcm-rufms-s-mo1cr` is now the only remaining
|
||||
`updater_wake_unsupported` blocker.
|
||||
- Live `ifcm` peer telemetry also exposed a distinct transport-resilience
|
||||
defect: on one stale-relay/bootstrap path the node tried a relay endpoint
|
||||
with the certificate fingerprint from a different private direct candidate,
|
||||
producing
|
||||
`CRYPTO_ERROR ... quic peer certificate fingerprint mismatch`.
|
||||
- That bug is now fixed in the runtime line tracked as
|
||||
`0.2.332-relaycertintentfix`.
|
||||
|
||||
### A. Finish signed registry activation
|
||||
|
||||
Each node must be able to resolve active records for at least:
|
||||
|
||||
Reference in New Issue
Block a user