рабочий вариант, но скороть 10 МБит
build / backend (push) Has been cancelled
build / node-agent (push) Has been cancelled
build / worker (push) Has been cancelled

This commit is contained in:
2026-05-22 21:46:49 +03:00
parent 469fa0e860
commit 20d361a886
280 changed files with 954890 additions and 18524 deletions
+34 -32
View File
@@ -62,7 +62,7 @@ Cluster Authority foundation is now also complete:
- cluster authority private keys are encrypted at rest when
`SECRET_ENCRYPTION_KEY_B64`/file is configured; production already requires
a secret encryption key
- legacy/default clusters are backfilled lazily through `EnsureClusterAuthority`
- compat/default clusters are backfilled lazily through `EnsureClusterAuthority`
- backend signs join-token scope material, node approval/bootstrap material,
and node-scoped synthetic mesh config snapshots
- node-agent verifies signed Control Plane synthetic config when
@@ -80,14 +80,14 @@ Cluster Authority foundation is now also complete:
`RAP_WORKLOAD_SUPERVISION_ENABLED=false` by default while service runtime
supervision remains a stub
Node enrollment bootstrap polling is also complete:
- backend exposes `/node-agents/enrollments/{requestID}/bootstrap`
- pending agents prove `cluster_id`, `node_fingerprint`, and `public_key`
before receiving status/bootstrap material
- `rap-node-agent` stores `pending_join_request_id`, polls approval, verifies
the signed bootstrap contract, then persists `node_id`, `identity_status`,
and cluster authority pin into `identity.json`
Node enrollment join polling is also complete:
- backend exposes `/node-agents/enrollments/{requestID}/join`
- pending agents prove `cluster_id`, `node_fingerprint`, and `public_key`
before receiving status/join material
- `rap-node-agent` stores `pending_join_request_id`, polls approval, verifies
the signed join contract, then persists `node_id`, `identity_status`,
and cluster authority pin into `identity.json`
- polling is controlled by `RAP_ENROLLMENT_POLL_INTERVAL_SECONDS` and
`RAP_ENROLLMENT_POLL_TIMEOUT_SECONDS`
@@ -157,14 +157,16 @@ Runtime report:
- `artifacts/c18z15-live-service-channel-effective-quality-smoke-result.json`
- `artifacts/c18z16-live-service-channel-quality-fairness-smoke-result.json`
- `artifacts/c18z17-live-service-channel-quality-cleanup-smoke-result.json`
- `artifacts/c18z18-service-channel-session-scoped-fairness-smoke-result.json`
- `artifacts/c18z19-service-channel-parallel-flow-window-smoke-result.json`
- `artifacts/c18z20-service-channel-adaptive-window-telemetry-smoke-result.json`
- Docker-test smoke command:
`pwsh -NoProfile -ExecutionPolicy Bypass -File scripts\fabric\c17z12-rendezvous-relay-smoke-ssh.ps1 -KeepRunning`
- Dev lifecycle smoke command:
`pwsh -NoProfile -ExecutionPolicy Bypass -File scripts\fabric\dev-cluster-enrollment-bootstrap-smoke-ssh.ps1 -KeepRunning`
- Last proven runtime run: `c17z18-20260428-221601` (legacy smoke script name,
- `artifacts/c18z18-service-channel-session-scoped-fairness-smoke-result.json`
- `artifacts/c18z19-service-channel-parallel-flow-window-smoke-result.json`
- `artifacts/c18z20-service-channel-adaptive-window-telemetry-smoke-result.json`
- Active fabric standard check:
`pwsh -NoProfile -ExecutionPolicy Bypass -File scripts\check-fabric-standard-boundary.ps1`
- Removed docker-test smoke record:
`removed docker-test smoke script is not part of the active tree`
- Removed dev lifecycle smoke record:
`removed dev lifecycle smoke script is not part of the active tree`
- Last proven runtime run: `c17z18-20260428-221601` (compat smoke script name,
current C17Z20 node-agent code)
- Last proven dev lifecycle run: `dev-bootstrap-20260428-201430`
- Admin: `http://192.168.200.61:18080/`
@@ -193,30 +195,30 @@ Node-agent image `rap-node-agent:0.2.270-c18z95` is built and deployed on
`test-1/2/3`; web-admin is rebuilt and deployed to `rap_web_admin`.
All three test nodes run the C18Z92 image, healthy, and current after policy
update. Node-agent still requires signed service-channel lease authority when
cluster authority is pinned, but if legacy clients cannot send signed lease
cluster authority is pinned, but if compat clients cannot send signed lease
headers it now calls backend introspection before accepting the unsigned token.
Accepted ingress is visible as `accepted_by=signed|introspection|legacy_unsigned`
Accepted ingress is visible as `accepted_by=signed|introspection|compat_unsigned`
in structured node logs and via `X-RAP-Service-Channel-Accepted-By` on HTTP
packet ingress. Durable introspection stores only `token_hash` plus a scrubbed
lease payload, so backend restarts no longer break compatibility clients. Live
lease maintenance now lists active/expired durable compatibility leases and runs
bounded cleanup through the admin API/panel. Durable access telemetry now
aggregates node-reported accepted ingress counters by signed/introspection/
legacy path, with heartbeat metadata fallback and admin-panel visibility.
compat path, with heartbeat metadata fallback and admin-panel visibility.
Access telemetry now also correlates active durable service-channel leases with
entry/exit nodes, primary route status, backend fallback, and latest
entry/exit nodes, primary route status, compat fallback, and latest
route-quality feedback when a route exists. Normal-route access diagnostics are
smoke-proven with a temporary direct `vpn_packets` route and healthy rolling
quality window. Degraded normal-route diagnostics are also smoke-proven: the
active channel stays on a normal primary route with `force_backend_fallback=false`
active channel stays on a normal primary route with `force_compat_fallback=false`
while route feedback becomes `fenced` and rolling failure/drop/slow counters are
visible. Active-channel remediation diagnostics now expose
`remediation_action`, reason, optional alternate route id/status, and operator
hint, with unit coverage for healthy/noop, rebuild, backend fallback, and
hint, with unit coverage for healthy/noop, rebuild, compat fallback, and
authorized alternate decisions. The alternate-route remediation branch is now
live-smoke-proven: a selected primary route is degraded after lease issuance and
access telemetry recommends `prefer_alternate_route` while keeping
`force_backend_fallback=false`. C18Z57 turns that recommendation into a bounded
`force_compat_fallback=false`. C18Z57 turns that recommendation into a bounded
machine-readable `remediation_command` on the active channel row, including the
primary route, replacement route, issued time, and command TTL capped to the
lease lifetime. C18Z58 projects those commands into node-scoped synthetic mesh
@@ -225,10 +227,10 @@ route-manager `applied` decision with source
`service_channel_remediation_command`. C18Z59 proves active traffic follows the
replacement route after remediation: runtime heartbeat evidence shows
`last_selected_route_id` and flow-scheduler `last_route_id` on the replacement
route, with no local/backend fallback and no route send failures. C18Z60 proves
route, with no local/compat fallback and no route send failures. C18Z60 proves
the same replacement path under multiple independent VPN flow channels: a
twelve-packet batch is classified across multiple flow-scheduler channels, all
observed replacement-route sends avoid local/backend fallback, flow drops, and
observed replacement-route sends avoid local/compat fallback, flow drops, and
route failures. C18Z61 raises that to a pressure batch of 128 IPv4/TCP-like
packets; runtime evidence shows 32 replacement-route flow stats, scheduler
high-watermark 5, max-in-flight 4, no fallback, no drops, and no route failures.
@@ -260,7 +262,7 @@ fallback, route failures, flow drops, and scheduler drops stayed at 0. C18Z68
adds backend/admin flow-health guard diagnostics over that telemetry:
`flow_health_status` and `flow_health_reason` are projected at cluster, node,
and active-channel levels from traffic-class pressure, queue pressure, flow
drops, backend fallback, route-quality failures/drops/slow samples, and route
drops, compat fallback, route-quality failures/drops/slow samples, and route
send latency. Web-admin now shows flow-health chips beside flow QoS.
C18Z69 adds node-side adaptive response: heartbeat flow-scheduler snapshots now
report per-class `recommended_parallel_windows` plus
@@ -319,7 +321,7 @@ C18Z79 closes the planner-to-runtime proof loop for that branch: after planner
resolution, the entry node reports a route-manager decision with the same
`rebuild_request_id`, the transition is `applied_rebuild`, and live
service-channel packet traffic selects the replacement route without
local/backend fallback, route failures, or flow drops. C18Z80 hardens that
local/compat fallback, route failures, or flow drops. C18Z80 hardens that
same path under sustained pressure: after planner-applied rebuild, five
post-rebuild bursts of mixed `interactive`, `bulk`, and `reliable` VPN packet
batches stay on the replacement route, the stale primary is not reselected, and
@@ -396,7 +398,7 @@ C18Z91 makes node-agent consume that signed/introspected data-plane contract.
Service-channel packet ingress validates the contract, applies the preferred
fabric route, emits data-plane mode/transport/fallback/logical-flow fields in
access logs, and reports contract adoption in heartbeat access telemetry.
C18Z92 enforces disabled backend fallback policy at node-agent runtime: when a
C18Z92 enforces disabled compat fallback policy at node-agent runtime: when a
signed lease says `backend_relay_policy=disabled`, route failure or missing
fabric route returns a visible 503 instead of silently proxying working data
through backend relay.
@@ -414,13 +416,13 @@ can now surface a recommended action such as restoring the fabric route instead
of treating backend relay as normal service traffic.
C18Z95 adds node-agent blocked-fallback telemetry. When a signed data-plane
contract disables backend relay and the entry runtime cannot use a fabric
route, node-agent reports `backend_fallback_blocked`, the last data-plane
route, node-agent reports `compat_fallback_blocked`, the last data-plane
violation status/reason, and backend/admin project those fields to cluster,
node, channel, and `data_plane_contract` incident diagnostics. Disabled-policy
refusal is now separate from real backend relay usage.
C18Z96 wires normal-route send failure with disabled backend relay into the
existing route feedback and rebuild planner path. When heartbeat access
telemetry reports `fabric_route_send_failed_backend_fallback_blocked`, backend
telemetry reports `fabric_route_send_failed_compat_fallback_blocked`, compat
correlates the entry node's active service-channel leases, records fenced
`fabric_service_channel_route_feedback` for the selected primary route, and the
existing planner can select an alternate/replacement route. This keeps blocked
@@ -570,7 +572,7 @@ artifacts:
`artifacts/c18z89-service-channel-access-decision-resurface-action-smoke-result.json`, and
`artifacts/c18z90-service-channel-data-plane-contract-smoke-result.json`, and
`artifacts/c18z91-node-agent-data-plane-contract-enforcement-smoke-result.json`, and
`artifacts/c18z92-node-agent-disabled-backend-fallback-smoke-result.json`, and
`artifacts/c18z92-node-agent-disabled-compat-fallback-smoke-result.json`, and
`artifacts/c18z93-access-telemetry-data-plane-contract-smoke-result.json`, and
`artifacts/c18z94-data-plane-contract-incident-smoke-result.json`, and
`artifacts/c18z95-node-agent-blocked-fallback-telemetry-smoke-result.json`, and