Files
rdp-proxy/docs/architecture/MESH_ROUTING_RUNTIME_IMPLEMENTATION_PLAN.md
T
m 20d361a886
build / backend (push) Has been cancelled
build / node-agent (push) Has been cancelled
build / worker (push) Has been cancelled
рабочий вариант, но скороть 10 МБит
2026-05-22 21:46:49 +03:00

1150 lines
46 KiB
Markdown

# Mesh Routing Runtime Implementation Plan
Status: Stage C17 planning completed. Stage C17A synthetic mesh runtime
skeleton, Stage C17B route health/failover probes, Stage C17C relay semantic
hardening, Stage C17D non-production test-service path experiment, Stage C17E
historical live node-to-node synthetic HTTP transport skeleton, Stage C17F scoped
synthetic route config boundary, Stage C17G Control Plane scoped synthetic
config read boundary, Stage C17H deployed multi-agent synthetic config smoke,
Stage C17I production forwarding gate, Stage C17J production envelope
contract, Stage C17K production envelope observation, and Stage C17L bounded
production observation sink, and Stage C17M production observation sink wiring
and Stage C17N production observation sink metrics are implemented and
test-proven. Stage C17O production observation sink local metrics logging is
implemented and test-proven. Stage C17P production observation sink
change-driven metrics logging is implemented and test-proven. Stage C17Q
production forwarding gate/runtime log boundary is implemented and test-proven.
Stage C17R production observation sink capacity guard is implemented and
test-proven. Stage C17S production observation panic fail-closed hardening is
implemented and test-proven. Stage C17T production envelope payload boundary
is implemented and test-proven. Stage C17U production envelope created-at skew
boundary is implemented and test-proven. Stage C17V peer endpoint candidate
model and NAT/connectivity hints are implemented and test-proven. Stage C17W
peer endpoint candidate scoring model is implemented and test-proven. Stage
C17X health-aware endpoint candidate scoring overlay is implemented and
test-proven. Stage C17Y Platform Owner synthetic mesh visibility is implemented
and build/test-proven. Stage C17Z production fabric-control direct forwarding
boundary is implemented and test-proven. Stages C17Z1 through C17Z18 are
implemented and test/docker-test-runtime-proven through route-config,
peer-directory, peer-cache, endpoint reporting/candidates, peer connection
state/recovery/intent/manager, rendezvous/relay lease, stale relay
replacement, route/path decision, route generation tracker, and synthetic
route-health effective-path boundaries.
This document defines the implementation plan for future mesh routing runtime.
It does not implement code, migrations, APIs, mesh runtime traffic, VPN/IP
tunnel runtime, relay packet routing, RDP work, or service workload execution.
Production mesh runtime implementation is not authorized by this document.
C17A implemented only synthetic `fabric.probe` / `fabric.probe_ack` execution
behind a disabled-by-default feature flag. C17B added synthetic
`fabric.route_health` / `fabric.route_health_ack`, local route observations,
fallback route selection, warm route promotion metrics, and route-cache
invalidation. C17C added synthetic relay validation, per-channel bounded
queues, QoS dequeue order, telemetry-only drop/backpressure, and reliable
fabric/control rejection behavior. C17D added one bounded `synthetic.echo`
test-service path over direct, single-relay, and forced fallback routes. C17E
added one historical real-HTTP peer transport experiment and a
disabled-by-default node-agent synthetic endpoint/smoke harness for direct and
single-relay synthetic traffic only. C17F
added scoped synthetic peer/route config loading and synthetic route-health
link observation reporting. C17G added the Control Plane read boundary for
node-scoped synthetic mesh config. C17H proved that boundary in a deployed
multi-agent `docker-test` smoke. C17I added an explicit production-forwarding
gate. C17J added route-bound production envelope validation. C17K added
metadata-only local observation of accepted production envelopes while keeping
production forwarding unavailable. C17L added bounded local retention of
accepted metadata-only observations while still keeping production forwarding
unavailable. C17M added disabled-by-default node-agent wiring for that bounded
local sink while still keeping production forwarding unavailable. C17N added
local sink metrics without exposing observation records while still keeping
production forwarding unavailable. C17O added local aggregate metrics logging
without read APIs or Control Plane reporting while still keeping production
forwarding unavailable. C17P suppressed repeated unchanged local metrics logs
while still keeping production forwarding unavailable. C17Q separated
production forwarding gate state from runtime state in local logs while still
keeping production forwarding unavailable. C17R added a maximum local
observation sink capacity guard while still keeping production forwarding
unavailable. C17S made observer panic handling fail closed while still keeping
production forwarding unavailable. C17T added an explicit validated
fabric-control payload size boundary while still keeping production forwarding
unavailable. C17U added an explicit validated created-at future-skew boundary
while still keeping production forwarding unavailable. C17V added route-scoped
peer endpoint candidates and NAT/connectivity hints to synthetic config while
still keeping production forwarding unavailable. C17W added deterministic
local endpoint candidate scoring while still keeping production forwarding
unavailable. C17X added health-aware local endpoint candidate scoring while
still keeping production forwarding unavailable. C17Y added Platform Owner
visibility for node-scoped synthetic mesh config while still keeping
production forwarding unavailable. C17Z added gate-controlled production
`fabric.control` local delivery and direct next-hop forwarding while keeping
service traffic unavailable.
## 1. Purpose
C17 planning turns the accepted C10-C16 Fabric Core foundation into a safe,
incremental mesh routing runtime plan.
Accepted foundation:
- C10: Fabric Core config distribution design
- C11: signed scoped cluster snapshots
- C12: node local state store
- C13: Fabric Storage / Config Storage
- C14: peer directory/cache
- C15: Fabric Routing Engine skeleton
- C16: secure node-to-node channel lifecycle
C17 planning defines how runtime work should begin without accidentally
creating a broad production mesh, a second source of truth, or a hidden service
transport rewrite.
## 2. Hard Non-Goals
C17 planning and C17A must not:
- carry RDP user traffic
- carry VPN/IP tunnel traffic
- carry production service workload traffic
- replace direct worker WSS
- remove backend gateway fallback
- change backend session lifecycle
- change Windows client behavior
- expose mesh topology to organizations
- implement arbitrary relay packet forwarding
- implement QUIC/WebRTC
- bypass signed snapshots or node-local policy
- allow nodes to invent routes without Fabric Routing Engine boundaries
## 3. Runtime Principle
Mesh runtime must start as a controlled fabric-internal path.
Initial runtime traffic should be limited to:
- fabric control probes
- route health probes
- synthetic test messages
- safe telemetry
Service traffic such as RDP, VNC, SSH, file transfer, video, and VPN/IP tunnel
must remain outside the first runtime skeleton.
## 4. Minimal Runtime Sequence
The first implementation should follow this sequence.
### C17A: Mesh Runtime Skeleton, Synthetic Traffic Only
Status: implemented and test-proven. Report:
`artifacts/c17a-synthetic-mesh-runtime-skeleton-report.md`.
Goal:
Prove route selection, secure node channels, hop forwarding, TTL, observability,
and rollback with synthetic fabric messages only.
Allowed:
- route request/result implementation boundary
- node-to-node secure channel use from C16
- direct path for synthetic control message
- single relay path for synthetic control message
- TTL / hop limit
- route id propagation
- structured logs and metrics
- kill-switch to disable mesh runtime immediately
Not allowed:
- RDP traffic
- VPN/IP tunnel traffic
- service workload traffic
- organization-visible topology
- production data-plane migration
### C17B: Route Health and Failover Probes
Status: implemented and test-proven. Report:
`artifacts/c17b-route-health-failover-probes-report.md`.
Goal:
Prove route health observations and failover decisions with synthetic traffic.
Allowed:
- route health probes
- warm peer promotion
- fallback route selection
- failed route marking
- route cache invalidation on policy/peer changes
Not allowed:
- production service traffic
- tenant-visible routing decisions
### C17C: Relay Runtime Hardening
Status: implemented and test-proven. Report:
`artifacts/c17c-relay-semantic-hardening-report.md`.
Goal:
Harden relay forwarding semantics before service traffic.
Allowed:
- relay envelope validation
- max hops
- loop prevention
- per-channel class queue boundaries
- QoS scheduling with synthetic channel classes
- backpressure and drop rules
Not allowed:
- general-purpose packet relay
- VPN packet forwarding
- RDP render/input migration
### C17D: Non-Production Service-Path Experiment
Status: implemented and test-proven. Report:
`artifacts/c17d-non-production-test-service-path-report.md`.
Goal:
Optionally test a non-production service flow after C17A-C17C are accepted.
Allowed only after explicit approval:
- one test service type
- one test organization
- one test cluster
- forced fallback path
- no production users
RDP must remain paused unless separately approved.
## 5. Route Execution Boundary
Route execution consumes a route result from the Fabric Routing Engine.
Route execution may:
- open an authorized node-to-node channel
- send a route-bound envelope
- forward only if route id, hop id, TTL, and channel class are valid
- report delivery/failure telemetry
- update local route cache observations
Route execution must not:
- choose a route independently
- override hard policy checks
- create shortcut connections on its own
- cross cluster boundaries without explicit trust
- mutate PostgreSQL authority
- expose topology to tenants
## 6. Mesh Envelope Boundary
Initial mesh runtime envelopes should be service-neutral.
Required envelope fields:
- `fabric_protocol_version`
- `message_id`
- `route_id`
- `cluster_id`
- `source_node_id`
- `destination_node_id`
- `current_hop_node_id`
- `next_hop_node_id`
- `channel_class`
- `message_type`
- `ttl`
- `hop_count`
- `created_at`
- `expires_at`
- `payload_length`
- `payload_hash`
Initial allowed message types:
- `fabric.probe`
- `fabric.probe_ack`
- `fabric.route_health`
- `fabric.telemetry`
Payload must remain small and bounded in C17A.
## 7. Relay Forwarding Boundary
Relay forwarding in early runtime is not arbitrary packet forwarding.
Relay may forward only when:
- route id is known and valid
- current node is the expected hop
- next hop is authorized
- channel class is allowed
- TTL is positive
- hop count is within limit
- route has not expired
- source and destination match the route result
- partition/degraded policy allows forwarding
Relay must reject:
- unknown route id
- wrong cluster
- wrong organization scope
- expired route
- TTL exhausted
- hop loop detected
- unauthorized channel class
- revoked peer
- stale policy version
## 8. Loop Prevention
Required loop prevention:
- TTL
- max hop count
- visited hop set or compact loop token
- route epoch
- route id validation
- duplicate message id cache with TTL
Loop detection must fail closed and report telemetry.
## 9. Channel Scheduling
Even synthetic runtime should model future channel priorities.
Priority order:
1. `fabric_control`
2. `input`
3. `route_control`
4. `render`
5. `clipboard`
6. `file_transfer`
7. `storage_fetch`
8. `update_fetch`
9. `vpn_packet`
10. `telemetry`
C17A should only carry `fabric_control`, `route_control`, and `telemetry`, but
the scheduler boundary must not block future channel-aware extension.
## 10. Route Cache Integration
Route execution may update observations:
- route success
- route failure
- latency
- delivery time
- retry count
- failure reason
- peer health hint
Route execution must not update authoritative policy.
Cache invalidation must occur when:
- route expires
- policy version changes
- peer directory version changes
- trust/revocation changes
- route epoch changes
- repeated failures exceed threshold
## 11. Observability
C17 runtime must be observable before it is useful.
Required logs/metrics:
- route requested
- route selected
- route execution started
- channel opened
- envelope sent
- envelope forwarded
- envelope received
- route delivery succeeded
- route delivery failed
- route rejected with reason
- relay rejected with reason
- TTL/hop loop rejected
- fallback route used
- kill-switch activated
Metrics:
- active routes
- active channels
- route success rate
- route failure rate
- relay forwarding count
- relay rejection count
- route latency
- hop latency
- queue depth by channel class
- dropped synthetic messages
Tenant-visible views must not expose topology.
## 12. Rollback and Kill Switch
Mesh runtime must have an immediate rollback path.
Required controls:
- global feature flag: mesh runtime disabled
- cluster feature flag: mesh runtime disabled for cluster
- node feature flag: mesh runtime disabled for node
- route class flag: disable relay/multi-hop
- channel class flag: disable non-control classes
Rollback behavior:
- stop creating new mesh routes
- close synthetic runtime channels after drain or immediately by severity
- keep node enrollment/heartbeat unaffected
- keep RDP direct worker WSS and backend gateway fallback unaffected
- keep backend control plane unaffected
## 13. Smoke / Test Topology
Minimum smoke topology:
```text
control-api
|
| config/snapshot distribution
|
node-a: ingress-capable test node
|
| direct synthetic route
v
node-b: service/egress-capable test node
node-a
|
| relay synthetic route
v
node-r: relay-capable test node
|
v
node-b
```
Required test roles:
- `node-a`: can_accept_client_ingress, can_accept_node_ingress
- `node-r`: can_accept_node_ingress, can_route_mesh
- `node-b`: can_accept_node_ingress, can_egress_private_network or service test role
Smoke must prove:
- direct synthetic route succeeds
- single-relay synthetic route succeeds
- wrong cluster rejected
- wrong node rejected
- unauthorized channel rejected
- expired route rejected
- TTL loop rejected
- relay disabled kill-switch works
- mesh runtime disabled kill-switch works
- RDP baseline unaffected
## 14. C17A Result
C17A implemented the smallest safe runtime skeleton:
- `rap-node-agent` synthetic runtime is disabled by default
- direct synthetic `fabric.probe` / `fabric.probe_ack` path is test-proven
- single-relay synthetic `fabric.probe` / `fabric.probe_ack` path is
test-proven
- route id, route expiry, TTL, hop count, path validation, and loop protection
are enforced
- wrong cluster, wrong node, unauthorized channel, expired route, TTL
exhaustion, loop, and missing peer are rejected
- structured log and metrics boundaries exist
- existing `/mesh/v1/forward` production forwarding remains disabled
- no RDP, VPN, file, video, or production service traffic uses this skeleton
Verification:
```powershell
go test ./...
```
Run from:
```powershell
agents\rap-node-agent
```
## 15. C17B Proposed Scope
C17B implemented route health and failover probes using synthetic traffic only:
- keep mesh feature flag disabled by default
- preserve direct and single-relay synthetic probe behavior
- synthetic route health probes
- local route success/failure observations
- failed synthetic route marking in node-local runtime state
- warm peer candidate promotion only in test/smoke topology
- fallback synthetic route selection when the preferred route is unavailable
- route cache invalidation when policy, peer directory, or route version
changes
- no service traffic
- no RDP traffic
- no VPN/IP tunnel traffic
- no organization topology exposure
Verification:
```powershell
go test ./...
```
Run from:
```powershell
agents\rap-node-agent
```
## 15.1 C17C Proposed Scope
C17C implemented relay forwarding semantic hardening using synthetic channel
classes only:
- keep mesh feature flag disabled by default
- C17A direct/single-relay synthetic probes remain intact
- C17B route health/failover probes remain intact
- stricter relay envelope validation boundaries
- per-channel-class bounded queues for synthetic traffic
- QoS dequeue order for synthetic channel classes
- backpressure and drop rules for synthetic telemetry only
- reliable behavior for synthetic fabric/control health probes
- no service traffic
- no RDP traffic
- no VPN/IP tunnel traffic
- no organization topology exposure
Verification:
```powershell
go test ./...
```
Run from:
```powershell
agents\rap-node-agent
```
## 15.2 C17D Result
C17D implemented a non-production service-path experiment:
- keep mesh feature flag disabled by default
- C17A, C17B, and C17C behavior remains intact
- one test service type only: `synthetic.echo`
- one test organization only: `org-test`
- one test cluster only in tests: `cluster-1`
- bounded test payloads only
- direct test-service route proven
- single-relay test-service route proven
- forced fallback test-service route proven
- no topology exposed to organizations
- no production service traffic
- no RDP traffic
- no VPN/IP tunnel traffic
Verification:
```powershell
go test ./...
```
Run from:
```powershell
agents\rap-node-agent
```
## 15.3 C17H Result
C17H implemented a deployed multi-agent synthetic config smoke on
`docker-test`:
- five running `rap-node-agent` containers consumed backend-issued
node-scoped synthetic config
- direct and relay synthetic route-health observations returned through the
real backend
- Platform Owner summary reflected the C17H test cluster as healthy
- all scoped configs kept `production_forwarding=false`
- no production mesh traffic
- no service workload traffic
- no RDP/VPN/IP tunnel traffic
VPN/IP tunnel work remains a separate C18 track and must not be mixed into
C17 mesh runtime work.
## 15.4 C17E Historical Result
C17E implemented a historical live node-to-node synthetic HTTP transport
experiment while preserving the production forwarding kill-switch. This result
is retained only as test-history context; it is not the active transport
direction for the fabric runtime:
- `QUICPeerTransport` maps explicit peer node IDs to synthetic QUIC endpoint
URLs.
- `rap-node-agent` can start the synthetic fabric runtime only when
`RAP_FABRIC_RUNTIME_ENABLED=true` and `RAP_FABRIC_LISTEN_ADDR` is set.
- peer endpoints and synthetic routes can be injected as JSON for smoke/debug
only.
- `mesh-live-smoke` proves direct and single-relay synthetic traffic over real
local QUIC endpoints.
- bounded `synthetic.echo` remains the only test-service payload.
- `/mesh/v1/forward` remains disabled.
- no production service traffic is authorized.
Current direction:
- active fabric runtime transport is QUIC-only
- synthetic HTTP motion is historical test-only context
- production forwarding/runtime acceptance must use QUIC route execution rather
than HTTP peer transport
Verification:
```powershell
go test ./...
go run ./cmd/mesh-live-smoke
go build -o bin/rap-node-agent.exe ./cmd/rap-node-agent
go build -o bin/mesh-live-smoke.exe ./cmd/mesh-live-smoke
```
Run from:
```powershell
agents\rap-node-agent
```
## 15.5 C17F Result
C17F implemented scoped synthetic peer/route configuration loading and route
health reporting:
- `ScopedSyntheticConfig` validates `cluster_id`, `local_node_id`, peer
endpoint shape, route cluster, route membership, and route expiry.
- `rap-node-agent` prefers `RAP_MESH_SYNTHETIC_CONFIG` over debug JSON route
and peer endpoint injection.
- debug JSON remains available only as fallback for smoke/debug.
- when Fabric testing flags allow synthetic links, node-agent sends synthetic
route-health probes and reports safe link observations to the Control Plane.
- route-health metadata explicitly marks `traffic_forwarding=false` and
`service_workload_traffic=false`.
- C17E live direct/relay smoke remains intact.
- `/mesh/v1/forward` remains disabled.
Verification:
```powershell
go test ./...
go run ./cmd/mesh-live-smoke
go build -o bin/rap-node-agent.exe ./cmd/rap-node-agent
go build -o bin/mesh-live-smoke.exe ./cmd/mesh-live-smoke
```
Run from:
```powershell
agents\rap-node-agent
```
## 15.6 C17G Result
C17G implemented a Control Plane read boundary for node-scoped synthetic mesh
config:
- backend endpoint:
`/clusters/{clusterID}/nodes/{nodeID}/mesh/synthetic-config`
- endpoint returns no routes/endpoints when effective testing flags do not
allow synthetic links
- route intents remain the source for synthetic test route config
- only route intents whose path contains the requesting node are included
- unrelated peer endpoints are not returned to the requesting node
- `production_forwarding=false` is explicit in the response
- node-agent consumes Control Plane config when local
`RAP_MESH_SYNTHETIC_CONFIG` is not set
- local scoped config file remains preferred debug fallback
- debug JSON remains last fallback only
Verification:
```powershell
go test ./...
```
Run from:
```powershell
backend
agents\rap-node-agent
```
## 15.7 C17I-C17Z18 Result
C17I through C17Z18 added the first production-forwarding boundary checks, the
endpoint candidate config/scoring foundation, narrow production
`fabric.control` forwarding, local forwarding observability, and route-config
validation plus scoped peer directory/recovery seeds and warm-peer connection
state/recovery/intent planning, a control-plane health connection manager, and
a node-scoped rendezvous/relay control-plane lease contract with lease refresh
telemetry, stale-relay replacement policy, route/path decision metadata, and
node-side route generation apply/withdraw reporting plus synthetic
route-health effective-path probing while still keeping production service
traffic unavailable:
- C17I added an explicit `RAP_MESH_PRODUCTION_FORWARDING_ENABLED` node-agent
gate.
- C17J added route-bound production envelope validation on `/mesh/v1/forward`
for `fabric_control` / `fabric.control` only.
- C17K added local metadata-only accepted-envelope observation after
validation.
- C17L added a bounded local in-memory sink for accepted metadata-only
observations.
- C17M added disabled-by-default node-agent wiring for the bounded local sink
through `RAP_MESH_PRODUCTION_OBSERVATION_SINK_CAPACITY`.
- C17N added local metrics for the bounded local sink.
- C17O added local node-agent logging for aggregate sink metrics.
- C17P added change-driven suppression for unchanged aggregate sink metrics
logs.
- C17Q added local log separation for production forwarding gate state versus
production forwarding runtime state.
- C17R added a maximum capacity guard for the local observation sink.
- C17S added panic-safe fail-closed observer handling.
- C17T added an explicit production `fabric.control` envelope payload size
boundary.
- C17U added an explicit production `fabric.control` envelope `created_at`
future-skew boundary.
- C17V added route-scoped peer endpoint candidates with transport, address,
reachability, NAT type, connectivity mode, priority, policy tags,
verification time, and metadata.
- C17W added deterministic local scoring for already-scoped peer endpoint
candidates.
- C17X added optional local health observation inputs to endpoint candidate
scoring.
- C17Y added Platform Owner Control Panel visibility for node-scoped synthetic
mesh config.
- C17Z added production `fabric.control` local delivery and direct next-hop
forwarding behind the explicit production gate.
- C17Z1 added route-path-bound production `fabric.control` multi-hop
forwarding behind the explicit production gate.
- C17Z2 added local metadata-only production `fabric.control` forwarding
event logs for accepted, forwarded, delivered, and rejected envelopes.
- C17Z3 bound production `fabric.control` forwarding to local route config
when configured routes are available.
- C17Z4 added node-scoped peer directory and explicit bounded recovery seeds
to mesh config.
- C17Z5 added node-agent peer cache runtime state and warm-peer health probes.
- C17Z6 added explicit advertised endpoint reporting and Control Plane
projection of latest reported endpoints into scoped mesh config.
- C17Z7 added multiple advertised endpoint candidates, including
private/corporate LAN endpoints, and peer-cache selection of the best
candidate address for warm health.
- C17Z8 added node-local warm-peer connection states with bounded backoff
after repeated health-probe failures.
- C17Z9 added bounded node-local peer recovery planning over peer cache and
connection states.
- C17Z10 added node-local peer connection intents and transport readiness
classification.
- C17Z11 added a node-local peer connection manager for real control-plane
health probes over reusable HTTP keep-alive transport.
- C17Z12 added node-scoped rendezvous/relay control-plane leases and
relay-control health probes for peers that would otherwise remain
`waiting_rendezvous`.
- C17Z13 added heartbeat telemetry for relay admission, peer admission,
lease renewal posture, and `relay_ready` state.
- C17Z14 added node-scoped synthetic-config refresh for renewal-needed
rendezvous leases plus stale relay withdrawal/reselection telemetry.
- C17Z15 added backend relay replacement/withdrawal policy and alternate
relay scoring for stale rendezvous relays.
- C17Z16 added Control Plane `route_path_decisions` with original/effective
hops, local next hop, selected replacement relay, generation, and boundary
flags.
- C17Z17 added node-side route generation tracking for
`route_path_decisions`, including active/applied/unchanged/withdrawn counts,
generation change state, and `withdrawn_by_replacement` reporting for stale
relay paths.
- C17Z18 applies Control Plane `route_path_decisions` to synthetic
route-health route config only, probes selected effective paths through
replacement relays, reports expected/observed hops and drift state, and keeps
latest route-health observations separate from peer connection-manager
observations.
- rejected envelopes are not observed.
- observation failure fails closed.
- the bounded sink drops the oldest observation when full and stores no payload
bodies.
- metrics expose only capacity, current depth, accepted total, and
dropped-oldest total.
- local metrics logging exposes only aggregate sink metrics and adds no read
API or Control Plane reporting.
- unchanged aggregate sink metrics are not repeatedly logged.
- production forwarding runtime is limited to `fabric.control` direct
next-hop and route-path-bound forwarding when the gate is explicitly enabled.
- `RAP_MESH_PRODUCTION_OBSERVATION_SINK_CAPACITY` is rejected above `10000`.
- observer errors and observer panics both fail closed as observation failure.
- validated production `fabric.control` envelope payloads are bounded to
`4096` bytes.
- validated production `fabric.control` envelope `created_at` values are
bounded to a one-minute future skew.
- backend synthetic config returns only peer endpoints and endpoint candidates
that belong to the route path containing the requesting node.
- node-agent scoped synthetic config validates endpoint candidate shape.
- endpoint candidate scoring returns ranked candidates and reason labels only;
it does not open connections, choose production routes, or forward payloads.
- health-aware scoring remains advisory and is not wired into route execution.
- Platform Owner visibility shows config/candidate/scoring state without
exposing this to organization panels.
- service channels remain rejected.
- arbitrary relay forwarding is not implemented.
- `/mesh/v1/forward` still returns unavailable for missing production forward
transport.
- local production forward event logs expose only metadata and add no read API
or Control Plane reporting.
- configured production `fabric.control` envelopes must match local route
config route_id, cluster, source, destination, path, next hop, allowed
channel, expiry, max TTL, and max hop count before forwarding.
- scoped peer directory/recovery seeds feed node-local peer cache and recovery
planning only; persistent connection management, NAT traversal, and
relay/rendezvous runtime are not implemented.
- peer cache runtime selects bounded warm peers and probes `/mesh/v1/health`
only; it does not maintain persistent data-plane connections or forward
service payloads.
- dynamic endpoint reporting requires an explicit advertised endpoint; automatic
public IP discovery and STUN/TURN/ICE NAT classification are not implemented.
- private/corporate endpoint handling is candidate/scoring/runtime-health only;
it does not imply automatic subnet discovery or service payload forwarding.
- peer connection state is node-local metadata for warm `/mesh/v1/health`
probes only; it does not create persistent sockets, relay/rendezvous
runtime, or service payload forwarding.
- peer recovery planning chooses bounded health-probe candidates only; it does
not create persistent sockets, automatic NAT traversal, relay/rendezvous
runtime, or service payload forwarding.
- peer connection intents classify planned maintain/probe/recover work and
transport readiness only; they do not open persistent sockets, perform
STUN/TURN/ICE, run relay/rendezvous, or forward service payloads.
- peer connection manager probes only control-plane `/mesh/v1/health`; direct,
private, and corporate peers are probed directly, and C17Z12 can resolve
matching outbound-only/relay-required peers through `rendezvous_leases` as
relay-control health probes. It does not forward service payloads.
- rendezvous lease refresh reloads node-scoped synthetic config and updates
route/peer/lease state in the running agent, but does not forward service
payloads.
- backend relay replacement consumes stale-relay heartbeat feedback, withdraws
stale explicit rendezvous leases, scores alternate relay candidates, and
returns replacement lease decisions as control-plane metadata only.
- route/path decisions publish effective control-plane paths and local next-hop
metadata only; they do not execute service routes or forward payloads.
- no RDP, VPN, file, video, or service workload traffic is forwarded.
Verification:
```powershell
go test ./...
```
Run from:
```powershell
agents\rap-node-agent
backend
```
## 16. Risks
Primary risks:
- accidentally routing service traffic too early
- creating hidden topology exposure
- bypassing route policy with direct peer links
- relay turning into arbitrary packet forwarder
- route cache becoming authority
- missing kill-switch or rollback
- mesh runtime interfering with RDP baseline
Mitigation:
- synthetic traffic only at C17A
- strict feature flags
- route result validation at every hop
- no service adapter integration until later approved stage
- topology-safe observability
- explicit rollback
## 17. Result / Decision
Stage C17 planning defines a safe, staged implementation path for mesh routing
runtime. Stage C17A implements the first narrow runtime skeleton for synthetic
Fabric messages only. Stage C17B adds route health/failover observations using
synthetic Fabric messages only. Stage C17C adds relay semantic hardening for
synthetic channel classes only. Stage C17D adds one bounded non-production
`synthetic.echo` service-path experiment only. Stage C17E proves one
historical synthetic HTTP carrier experiment using real local endpoints only;
it is test-only and not representative of the active QUIC fabric runtime.
Stage C17F proves scoped synthetic config loading and route-health reporting
only.
Stage C17G proves Control Plane scoped synthetic config read/consume only.
Stage C17H proves deployed multi-agent Control Plane synthetic config
consumption and synthetic route-health reporting on `docker-test` only.
Stage C17I adds an explicit production-forwarding gate while keeping production
forwarding unavailable until a later approved runtime stage.
Stage C17J adds route-bound production envelope validation for fabric-control
messages only, while still keeping production forwarding unavailable.
Stage C17K adds metadata-only local observation of accepted production
envelopes, while still keeping production forwarding unavailable.
Stage C17L adds bounded local retention of accepted metadata-only observations,
while still keeping production forwarding unavailable.
Stage C17M wires that bounded local retention into node-agent only when an
explicit capacity is configured, while still keeping production forwarding
unavailable.
Stage C17N adds local sink metrics while still keeping production forwarding
unavailable.
Stage C17O logs aggregate sink metrics locally while still keeping production
forwarding unavailable.
Stage C17P suppresses repeated unchanged local aggregate sink metrics logs while
still keeping production forwarding unavailable.
Stage C17Q separates production forwarding gate state from runtime state in
local logs while still keeping production forwarding unavailable.
Stage C17R adds a maximum local observation sink capacity guard while still
keeping production forwarding unavailable.
Stage C17S makes observer panic handling fail closed while still keeping
production forwarding unavailable.
Stage C17T adds a validated production fabric-control payload size boundary
while still keeping production forwarding unavailable.
Stage C17U adds a validated production fabric-control created-at future-skew
boundary while still keeping production forwarding unavailable.
Stage C17V adds a scoped peer endpoint candidate model and NAT/connectivity
hints while still keeping production forwarding unavailable.
Stage C17W adds deterministic local endpoint candidate scoring while still
keeping production forwarding unavailable.
Stage C17X adds health-aware local endpoint candidate scoring while still
keeping production forwarding unavailable.
Stage C17Y adds Platform Owner synthetic mesh config visibility while still
keeping production forwarding unavailable.
Stage C17Z adds production fabric-control direct forwarding while still keeping
production service traffic unavailable.
Stage C17Z1 adds route-path-bound production fabric-control multi-hop
forwarding while still keeping production service traffic unavailable.
Stage C17Z2 adds local metadata-only production fabric-control forwarding
observability while still keeping production service traffic unavailable.
Stage C17Z3 adds route-config-bound production fabric-control forwarding
validation while still keeping production service traffic unavailable.
Stage C17Z4 adds scoped peer directory and bounded recovery seed config while
still keeping production service traffic unavailable.
Stage C17Z5 adds node-agent peer cache runtime and warm-peer health probes
while still keeping production service traffic unavailable.
Stage C17Z6 adds dynamic endpoint reporting/config projection while still
keeping production service traffic unavailable.
Stage C17Z7 adds private/corporate endpoint candidates and same-site scoring
while still keeping production service traffic unavailable.
Stage C17Z8 adds node-local warm-peer connection states and bounded backoff
while still keeping production service traffic unavailable.
Stage C17Z9 adds bounded node-local peer recovery planning while still keeping
production service traffic unavailable.
Stage C17Z10 adds node-local peer connection intent and transport readiness
classification while still keeping production service traffic unavailable.
Stage C17Z11 adds a real node-local peer connection manager for control-plane
health while still keeping production service traffic unavailable.
Stage C17Z12 adds node-scoped rendezvous/relay control-plane leases and
relay-control health probes while still keeping production service traffic
unavailable.
Stage C17Z13 adds rendezvous lease admission and renewal-posture telemetry
while still keeping production service traffic unavailable.
Stage C17Z14 adds rendezvous lease refresh/reload and stale relay
withdrawal/reselection telemetry while still keeping production service
traffic unavailable.
Stage C17Z15 adds backend relay replacement/withdrawal policy and alternate
relay-pool scoring for stale rendezvous relays while still keeping production
service traffic unavailable.
Stage C17Z16 adds Control Plane route/path decision artifacts for original and
effective hops while still keeping production service traffic unavailable.
Stage C17Z17 adds node-side route generation apply/withdraw tracking for
Control Plane route/path decisions while still keeping production service
traffic unavailable.
Stage C17Z18 applies those Control Plane route/path decisions to synthetic
route-health route config only, so route-health probes can verify replacement
effective paths while still keeping production service traffic unavailable.
Decisions:
- C17 is planning only.
- C17A is implemented and test-proven with synthetic fabric messages only.
- C17B is implemented and test-proven with synthetic route health/failover
messages only.
- C17C is implemented and test-proven with synthetic relay queues/QoS only.
- C17D is implemented and test-proven with one bounded `synthetic.echo`
test-service path only.
- C17E is implemented and smoke-proven with live HTTP synthetic direct and
single-relay paths only.
- C17F is implemented and smoke-proven with scoped synthetic route config and
link observation reporting only.
- C17G is implemented and test-proven with backend scoped synthetic config and
node-agent consumption only.
- C17H is implemented and runtime-proven with five deployed node-agent
containers, backend-issued node-scoped synthetic config, direct and
single-relay synthetic route-health observations, and production forwarding
disabled.
- C17I is implemented and test-proven with an explicit node-agent
production-forwarding gate. Enabling the gate still does not forward
production payloads because no production forwarding runtime is implemented
in this stage.
- C17J is implemented and test-proven with route-bound production envelope
validation on `/mesh/v1/forward`. Only `fabric_control` /
`fabric.control` is accepted for validation in this stage; service channels
are rejected and payloads are not forwarded.
- C17K is implemented and test-proven with metadata-only accepted-envelope
observation. Rejected envelopes are not observed, observation failure fails
closed, and payloads are not forwarded.
- C17L is implemented and test-proven with a bounded local accepted-observation
sink. Oldest observations are dropped when capacity is exceeded, payload
metadata is retained, payload bodies are not stored, and payloads are not
forwarded.
- C17M is implemented and test-proven with disabled-by-default node-agent
wiring for the bounded local accepted-observation sink. No observation read
API or Control Plane reporting is added in this stage.
- C17N is implemented and test-proven with local metrics for the bounded
accepted-observation sink. Metrics expose no observation records or payload
metadata. No observation read API or Control Plane reporting is added in this
stage.
- C17O is implemented and test-proven with local node-agent logging for
aggregate bounded-sink metrics. No observation read API or Control Plane
reporting is added in this stage.
- C17P is implemented and test-proven with change-driven suppression for
unchanged aggregate bounded-sink metrics logs. No observation read API or
Control Plane reporting is added in this stage.
- C17Q is implemented and test-proven with local log separation between
production forwarding gate state and production forwarding runtime state.
Runtime state remains false.
- C17R is implemented and test-proven with a maximum local observation sink
capacity guard.
- C17S is implemented and test-proven with panic-safe fail-closed observation
handling.
- C17T is implemented and test-proven with an explicit validated
fabric-control payload size boundary.
- C17U is implemented and test-proven with an explicit validated
fabric-control created-at future-skew boundary.
- C17V is implemented and test-proven with route-scoped peer endpoint
candidates and NAT/connectivity hints in synthetic config.
- C17W is implemented and test-proven with deterministic local endpoint
candidate scoring.
- C17X is implemented and test-proven with health-aware endpoint candidate
scoring.
- C17Y is implemented and build/test-proven with Platform Owner synthetic mesh
config visibility.
- C17Z is implemented and test-proven with gate-controlled production
`fabric.control` direct forwarding.
- C17Z1 is implemented and test-proven with gate-controlled route-path-bound
production `fabric.control` multi-hop forwarding.
- C17Z2 is implemented and test-proven with local metadata-only production
`fabric.control` forwarding event logs for accepted, forwarded, delivered,
and rejected envelopes.
- C17Z3 is implemented and test-proven with route-config-bound production
`fabric.control` forwarding validation.
- C17Z4 is implemented and test/build-proven with node-scoped peer directory
and recovery seed config.
- C17Z5 is implemented and test-proven with node-agent peer cache runtime and
warm-peer health probes.
- C17Z6 is implemented and test-proven with explicit advertised endpoint
reporting and scoped config projection.
- C17Z7 is implemented and test-proven with multiple public/private/corporate
endpoint candidates and same-site scoring.
- C17Z8 is implemented and test-proven with node-local warm-peer connection
states and bounded backoff.
- C17Z9 is implemented and test-proven with bounded node-local peer recovery
planning.
- C17Z10 is implemented and test-proven with node-local peer connection
intents and transport readiness classification.
- C17Z11 is implemented and test-proven with a node-local peer connection
manager for control-plane health.
- C17Z12 is implemented and docker-test-runtime-proven with node-scoped
`rendezvous_leases`; matching `waiting_rendezvous` intents become
`relay_control` health probes and record/maintain `relay_ready`.
- C17Z13 is implemented and docker-test-runtime-proven with
`mesh_rendezvous_lease_report` heartbeat telemetry for relay admission,
peer admission, TTL/renewal posture, and `relay_ready`.
- C17Z14 is implemented and docker-test-runtime-proven with node-scoped
synthetic-config refresh for renewal-needed rendezvous leases, runtime
peer cache/route/lease reload, refresh counters, and stale relay
withdrawal/reselection telemetry.
- C17Z15 is implemented and docker-test-runtime-proven with backend
stale-relay feedback handling, stale rendezvous lease withdrawal, alternate
relay scoring, replacement lease issuance, and node-agent relay replacement
telemetry.
- C17Z16 is implemented and docker-test-runtime-proven with
`route_path_decisions` in synthetic config and
`mesh_route_path_decision_report` heartbeat telemetry for control-plane
route generation/effective path metadata.
- C17Z17 is implemented and docker-test-runtime-proven with
`mesh_route_generation_report` heartbeat telemetry for active/applied/
unchanged/withdrawn route generation state over control-plane
`route_path_decisions`.
- C17Z18 is implemented and docker-test-runtime-proven with synthetic
route-health effective-path probing from Control Plane
`route_path_decisions`, route-health config telemetry, and latest-link
preservation by observation type/route.
- No RDP, VPN, or production service traffic may use mesh after C17Z18.
- Route execution must consume Fabric Routing Engine route results.
- Relay forwarding must be route-bound, TTL-bound, hop-bound, and policy-bound.
- Observability and kill-switches are required before runtime begins.
- C17A proves direct and single-relay synthetic routes in a test topology.
- No further mesh runtime step is authorized without a new explicit staged
prompt.
No RDP, data-plane, VPN, relay production traffic, or service workload
behavior is changed by C17A/C17B/C17C/C17D/C17E/C17F/C17G/C17H/C17I/C17J/C17K/C17L/C17M/C17N/C17O/C17P/C17Q/C17R/C17S/C17T/C17U/C17V/C17W/C17X/C17Y/C17Z/C17Z1/C17Z2/C17Z3/C17Z4/C17Z5/C17Z6/C17Z7/C17Z8/C17Z9/C17Z10/C17Z11/C17Z12/C17Z13/C17Z14/C17Z15/C17Z16/C17Z17/C17Z18.
The only runtime code added is disabled-by-default synthetic mesh probe,
synthetic route health/failover, synthetic relay scheduling, bounded
`synthetic.echo` test-service execution, live synthetic HTTP peer transport,
explicit production-forwarding gate checks, route-bound production envelope
validation, metadata-only accepted-envelope observation, and bounded local
accepted-observation retention/wiring/metrics/local change-driven logging and
capacity guarding/fail-closed observation hardening/payload/time-boundary
validation plus scoped endpoint candidate config validation/scoring and
health-aware scoring overlay in
`rap-node-agent`, plus Platform Owner visibility in `web-admin`; C17Z adds only
route-bound production `fabric.control` local delivery/direct next-hop
forwarding behind an explicit gate; C17Z1 adds only route-path-bound
production `fabric.control` multi-hop forwarding; C17Z2 adds only local
metadata-only production `fabric.control` forwarding event logs; C17Z3 adds
only local route-config validation for production `fabric.control` forwarding;
C17Z4 adds only scoped peer directory and recovery seed config boundaries;
C17Z5 adds only node-agent peer cache runtime and warm-peer health probes;
C17Z6 adds only explicit advertised endpoint reporting and scoped config
projection; C17Z7 adds only multiple private/corporate endpoint candidates and
same-site scoring; C17Z8 adds only node-local warm-peer connection state
tracking and bounded health-probe backoff; C17Z9 adds only bounded peer
recovery planning and metadata reporting; C17Z10 adds only peer connection
intent and transport readiness metadata; C17Z11 adds only control-plane
health connection manager probing and metadata; C17Z12 adds only
rendezvous/relay control-plane lease metadata and relay health probes; C17Z13
adds only rendezvous lease telemetry for admission, renewal posture, and
relay-ready state; C17Z14 adds only node-scoped lease refresh/reload,
refresh counters, and stale relay withdrawal/reselection telemetry; C17Z15
adds only backend relay replacement policy, alternate relay scoring, and
replacement lease control-plane metadata; C17Z16 adds only route/path decision
control-plane metadata and node heartbeat reporting for those decisions; C17Z17
adds only node-side route generation apply/withdraw metadata reporting for
those control-plane decisions; C17Z18 adds only synthetic route-health
effective-path probing, route-health config telemetry, drift metadata, and
latest-link visibility separation for observation types/routes.