рабочий вариант, но скороть 10 МБит
build / backend (push) Has been cancelled
build / node-agent (push) Has been cancelled
build / worker (push) Has been cancelled

This commit is contained in:
2026-05-22 21:46:49 +03:00
parent 469fa0e860
commit 20d361a886
280 changed files with 954890 additions and 18524 deletions
+140 -19
View File
@@ -4,9 +4,22 @@ Status: live operational audit of the current fabric. This document records the
real state observed on 2026-05-18 and explicitly calls out where runtime
behavior still differs from the target architecture.
The target layering model referenced by this audit is documented in
[FABRIC_SERVICE_OVER_TRANSPORT_MODEL.md](\\nas\\MST\\codex\\rdp-proxy\\docs\\architecture\\FABRIC_SERVICE_OVER_TRANSPORT_MODEL.md).
The current execution sequence derived from this audit is maintained in
[FABRIC_EXECUTION_PLAN_2026-05-19.md](\\nas\\MST\\codex\\rdp-proxy\\docs\\architecture\\FABRIC_EXECUTION_PLAN_2026-05-19.md).
## Current confirmed state
- Inter-node transport for the live node-agent fleet is `QUIC over UDP`.
- The `0.2.327-registrybootstraprewrite` rollout initially exposed a backend
ingestion defect: fresh `home-*` / `test-*` heartbeats were returning HTTP
`500`, not because QUIC or registry bootstrap was broken, but because
PostgreSQL rejected `\u0000` inside heartbeat JSON with
`unsupported Unicode escape sequence (SQLSTATE 22P05)`.
- Backend heartbeat ingestion now sanitizes `\u0000` before persistence.
- After that fix, `home-*` and `test-*` resumed normal heartbeat flow and
converged onto the new release line with live registry promotion.
- The active node set
- `home-1`
- `home-2`
@@ -16,9 +29,40 @@ behavior still differs from the target architecture.
- `test-3`
- `usa-los-1`
- `ifcm-rufms-s-mo1cr`
is converged on `0.2.321-directreadytarget`.
currently spans:
- `home-*`, `test-*`, and `usa-los-1` on
`0.2.327-registrybootstraprewrite`;
- `ifcm-rufms-s-mo1cr` still remaining on
`0.2.322-controlendpointsrewrite`.
- `ifcm-rufms-s-mo1cr` recovered through the compatibility recovery path and is
no longer stale.
- `ifcm-rufms-s-mo1cr` has already migrated off compat overlap
`http://vpn.cin.su:19191/api/v1` and now reports
`https://vpn.cin.su/api/v1`, but it still has not advanced to the new
registry-aware release line.
- `home-*` and `test-*` now report:
- `reported_version = 0.2.327-registrybootstraprewrite`
- `peer_cache_peers = 7`
- `fabric_registry_runtime_report.status = active`
- `usa-los-1` is already on `0.2.327-registrybootstraprewrite` but still
reports `fabric_registry_runtime_report.status = missing`, which means this
node remains a runtime bootstrap/config rewrite gap rather than a version-gap.
- After repairing malformed `RAP_MESH_ADVERTISE_ENDPOINTS_JSON` on
`home-1/home-2/home-3`, the `home` area now emits enriched heartbeat metadata
again instead of falling back to the thin `c3` payload.
- Live stale-risk snapshot at `2026-05-18T19:39Z` now reports:
- `compat_control_dependency_nodes = 1` (`ifcm-rufms-s-mo1cr`)
- `registry_candidate_only_nodes = 1` (`ifcm-rufms-s-mo1cr`)
- `direct_peer_alert_nodes = 5`
- `area_diversity_alert_nodes = 6`
- Live heartbeat/update status snapshot after the `0.2.325-updatehintwake`
rollout still shows:
- `ifcm-rufms-s-mo1cr` heartbeat fresh at `2026-05-18 23:08:44 UTC`
- `fabric_control_endpoint = http://vpn.cin.su:19191/api/v1`
- `peer_cache_peers = 7`
- latest update status still stuck at `2026-05-18 20:50 UTC`
- this is now classified as `updater_wake_unsupported`, not just a generic
stale or compat-control symptom
## Why TCP traffic is still visible
@@ -35,7 +79,7 @@ Observed live listeners:
- `19132/udp`, `19133/udp`, `19134/udp` - QUIC fabric listeners
- `usa-los-1`
- `19131/udp` - QUIC fabric listener
- `19191/tcp` - external compatibility bridge currently held open so legacy
- `19191/tcp` - external compatibility bridge currently held open so compat
recovery contracts can still reach `Control API/downloads`
Therefore:
@@ -49,7 +93,8 @@ Therefore:
### 1. Nodes do not yet operate from a fully active signed registry gossip plane
Observed on the live `ifcm-rufms-s-mo1cr` heartbeat:
Observed on the live `ifcm-rufms-s-mo1cr` heartbeat after the backend/runtime
refresh:
- `fabric_registry_runtime_report.status = candidate_only`
- `resolved_service_count = 0`
@@ -61,11 +106,11 @@ This means the current runtime still depends on compatibility control URLs more
than the target architecture allows. The node is alive in the fabric, but not
yet operating from a fully resolved active registry view.
### 2. Legacy control/download contracts are still real dependencies
### 2. Compat control/download contracts are still real dependencies
Observed on the live `ifcm-rufms-s-mo1cr` heartbeat after recovery:
- `mesh_outbound_session_report.control_plane_url = http://vpn.cin.su:19191/api/v1`
- `mesh_outbound_session_report.fabric_control_endpoint = http://vpn.cin.su:19191/api/v1`
This confirms the root recovery lesson:
@@ -77,15 +122,31 @@ This confirms the root recovery lesson:
### 3. Direct peer resilience is still below the intended threshold
Observed from live heartbeat metadata:
Observed from the live stale-risk snapshot at `2026-05-18T19:39Z`:
- `ifcm-rufms-s-mo1cr`
- `peer_connection_ready = 2`
- `peer_connection_relay_ready = 3`
- `target_ready_peers = 3`
- `home-1`
- `peer_connection_ready = 1`
- `direct_ready_areas = [usa]`
- `external_area_ready_count = 1/2`
- `home-2`
- `peer_connection_ready = 1`
- `direct_ready_areas = [usa]`
- `external_area_ready_count = 1/2`
- `home-3`
- `peer_connection_ready = 1`
- `direct_ready_areas = [usa]`
- `external_area_ready_count = 1/2`
- `test-1/2/3`
- `peer_connection_ready = 3`
- but `direct_ready_areas = [usa]`
- therefore each still triggers `external_area_deficit:1_of_2`
- `usa-los-1`
- `peer_connection_ready = 1`
- `peer_connection_relay_ready = 5`
- `direct_ready_areas = [ifcm, home, test]`
- `target_ready_peers = 3`
This means the direct-path resilience target is not satisfied yet, even though
@@ -99,17 +160,35 @@ The practical reason is simple:
- relay-ready adjacency is masking direct peer deficit, but it does not replace
the requirement for at least three direct-ready peers.
### 3.1 Public endpoint confirmation must be cross-area, not local hairpin
The live `home/test` topology also exposed a verification mistake in the
runtime model:
- `home` and `test` sit behind the same public router address
`94.141.118.222`;
- some public QUIC candidates are valid only when tested from another area such
as `usa` or `ifcm`;
- a same-area probe can fail purely because the local router does not support
hairpin NAT / NAT reflection.
Operational consequence:
- a public endpoint marked as `external-network-required` must be treated as
non-authoritative when the failure came from `self` or `same_area`;
- the public candidate should be confirmed or rejected by `cross_area`
observers instead.
### 4. Observability is still heterogeneous
Live heartbeat coverage is inconsistent:
Live heartbeat coverage is now richer than it was earlier in the day, but it is
still not fully converged in behavior:
- `test-*`, `ifcm`, `usa-los-1` emit rich `c17z20` heartbeat metadata with
endpoint, peer recovery, and registry sections.
- `home-*` currently do not expose the same full sections in their latest
heartbeat rows.
This means operator visibility is uneven and the documentation must not imply
uniform live introspection across every node today.
- `test-*`, `ifcm`, `usa-los-1`, and now repaired `home-*` expose endpoint,
peer recovery, and registry sections again.
- `ifcm` is still the only node that currently reports `compat control` and
`registry candidate_only`, so the observability gap has narrowed into a real
single-node convergence issue instead of a fleet-wide blind spot.
## What is true right now
@@ -117,21 +196,63 @@ uniform live introspection across every node today.
2. QUIC/UDP is the actual node-to-node transport.
3. Compatibility `19191/tcp` is still required for recovery overlap.
4. Signed registry gossip is not yet the sole active discovery/control source.
5. The "at least 3 direct-ready peers per node" resilience target is not yet
met for all externally significant nodes.
5. `ifcm` still depends on the compat `19191` control overlap.
6. The plain `3 direct peers` target is insufficient on its own; the live fleet
now clearly shows that `cross-area direct diversity` is the next real gate.
## Control/API migration progress
The codebase now carries a more explicit migration contract for control access:
- install profiles prefer canonical `control_plane_endpoints` over a compat
singleton `backend_url`;
- host runtime env generation now exports
removed control-plane endpoint env key;
- node heartbeat/control reporting prefers that canonical endpoint set when it
is present.
- stale updater status behind a fresh heartbeat is now classified separately as
`updater_subscription_gap`;
- heartbeat update hints now have a second-stage recovery path: after writing
`update-trigger.json`, a live node can also wake its local updater
task/service.
This does not instantly rewrite older runtime wrappers on already-installed
nodes by itself. It does remove the same trap for the next install, reinstall,
or update-service rewrite cycle.
## Operational rule until the next audit
Do not remove the compatibility `19191/tcp` recovery overlap while any of the
following remain true:
- any live node still reports a `control_plane_url` on the `19191` contract;
- any live node still reports a `fabric_control_endpoint` on the `19191` contract;
- any live node has `fabric_registry_runtime_report.status != active`;
- any externally significant node has fewer than 3 direct-ready peers;
- any node can only recover through legacy `Control API/downloads` overlap.
- any node can only recover through compat `Control API/downloads` overlap.
## Required next work
Update 2026-05-19:
- `rap-node-agent 0.2.325-updatehintwake` was released with a local updater
wake path driven by heartbeat update hints.
- The release exists because `ifcm-rufms-s-mo1cr` showed that a node can keep
sending fresh heartbeat while the updater subscription plane silently stops
progressing.
- This is now treated as a first-class recovery-plane problem, not as a vague
stale-node symptom.
- The live rollout already moved `home-*`, `test-*`, and `usa-los-1` onto
`0.2.325-updatehintwake`.
- `ifcm-rufms-s-mo1cr` is now the only remaining
`updater_wake_unsupported` blocker.
- Live `ifcm` peer telemetry also exposed a distinct transport-resilience
defect: on one stale-relay/bootstrap path the node tried a relay endpoint
with the certificate fingerprint from a different private direct candidate,
producing
`CRYPTO_ERROR ... quic peer certificate fingerprint mismatch`.
- That bug is now fixed in the runtime line tracked as
`0.2.332-relaycertintentfix`.
### A. Finish signed registry activation
Each node must be able to resolve active records for at least: