775 lines
35 KiB
Markdown
775 lines
35 KiB
Markdown
# rap-node-agent
|
|
|
|
Native node agent MVP for the Secure Access Fabric.
|
|
|
|
Status: Stage C17Z18 synthetic route-health effective path boundary.
|
|
|
|
This agent is intentionally native. Containers may package service workloads,
|
|
but the host-level node identity belongs to `rap-node-agent`.
|
|
|
|
## Current Scope
|
|
|
|
Implemented:
|
|
|
|
- config loading from flags/environment
|
|
- local identity state file
|
|
- enrollment request client
|
|
- heartbeat client
|
|
- capability/facts payload
|
|
- status-only service reporting payload
|
|
- mesh control-channel skeleton
|
|
- route-health message skeleton
|
|
- relay skeleton that refuses production payload forwarding
|
|
- disabled-by-default synthetic mesh runtime for `fabric.probe` /
|
|
`fabric.probe_ack`
|
|
- direct and single-relay synthetic route tests
|
|
- synthetic `fabric.route_health` / `fabric.route_health_ack`
|
|
- local route success/failure observations
|
|
- fallback route selection for test topology
|
|
- route cache invalidation on version changes
|
|
- synthetic relay envelope validation
|
|
- per-channel bounded queues for synthetic traffic
|
|
- QoS dequeue order: `fabric_control`, then `route_control`, then `telemetry`
|
|
- telemetry-only stale message drop under backpressure
|
|
- reliable fabric/control queue rejection when full
|
|
- bounded non-production `synthetic.echo` test-service path
|
|
- direct, single-relay, and forced-fallback test-service proofs
|
|
- live QUIC peer transport for synthetic mesh envelopes
|
|
- disabled-by-default synthetic mesh QUIC endpoint in `rap-node-agent`
|
|
- `mesh-live-smoke` harness proving direct and single-relay synthetic traffic
|
|
over real local QUIC endpoints
|
|
- scoped synthetic mesh config file loading for peer endpoints and routes
|
|
- Control Plane synthetic mesh config read fallback when no local scoped config
|
|
file is set
|
|
- synthetic route-health observations reported to the Control Plane when test
|
|
flags allow synthetic links
|
|
- explicit production mesh forwarding gate config; production forwarding still
|
|
has no runtime implementation and remains unavailable
|
|
- route-bound production mesh envelope contract and fail-closed validation on
|
|
the QUIC production-forward path
|
|
- metadata-only production envelope observation hook for valid envelopes, still
|
|
without forwarding payloads
|
|
- bounded metadata-only production envelope observation sink for accepted
|
|
observations
|
|
- disabled-by-default node-agent wiring for the bounded observation sink
|
|
- local metrics for the bounded observation sink without exposing observation
|
|
records
|
|
- local node-agent logging for bounded observation sink metrics
|
|
- change-driven suppression for unchanged bounded observation sink metrics logs
|
|
- explicit local log distinction between production forwarding gate state and
|
|
production forwarding runtime state
|
|
- node-scoped rendezvous lease refresh through Control Plane synthetic config
|
|
- stale relay withdrawal/reselection telemetry
|
|
- relay replacement contract reporting for stale rendezvous relays
|
|
- route/path decision contract reporting for control-plane route generations
|
|
- route generation apply/withdraw tracking for control-plane path decisions
|
|
- synthetic route-health route config refresh from Control Plane path
|
|
decisions
|
|
- route-health expected/observed effective path drift reporting
|
|
- host-agent Docker update plan executor with artifact checksum/size
|
|
verification, container replacement, health check, status reporting, and
|
|
rollback attempt
|
|
- host-agent update loop for service/timer placement
|
|
- host-agent binary self-update loop for the updater service itself
|
|
- maximum capacity guard for the local production observation sink
|
|
- panic-safe fail-closed production envelope observation wrapper
|
|
- explicit `4096` byte payload boundary for validated production
|
|
fabric-control envelopes
|
|
- explicit future-skew boundary for validated production envelope `created_at`
|
|
- scoped synthetic peer endpoint candidate config with reachability,
|
|
NAT/connectivity hints, priority, policy tags, and metadata
|
|
- deterministic local peer endpoint candidate scoring model for synthetic
|
|
config candidates
|
|
- optional local health observation overlay for endpoint candidate scoring
|
|
- gate-controlled production `fabric.control` direct next-hop delivery
|
|
- route-path-bound production `fabric.control` multi-hop forwarding
|
|
- local metadata-only production `fabric.control` forwarding event logs
|
|
- route-config-bound production `fabric.control` forwarding validation
|
|
- scoped peer directory and bounded recovery seed config parsing/validation
|
|
- node-local peer cache with bounded warm peer health probes
|
|
- advertised mesh endpoint reporting through heartbeat metadata
|
|
- multiple advertised endpoint candidates, including private/corporate LAN
|
|
- peer connection state machine for warm-peer health
|
|
- bounded peer recovery planner over peer cache and connection states
|
|
- peer connection intent planner with transport readiness classification
|
|
- peer connection manager for real control-plane health over reusable
|
|
QUIC fabric transport
|
|
- route-health effective-path runtime through replacement relay control paths
|
|
|
|
Not implemented yet:
|
|
|
|
- mesh packet routing
|
|
- production mesh service traffic
|
|
- VPN runtime
|
|
- production workload supervision
|
|
- certificate issuance/rotation
|
|
- in-agent native updater runtime
|
|
- privileged host route/firewall control
|
|
|
|
## Build
|
|
|
|
```powershell
|
|
cd agents\rap-node-agent
|
|
go test ./...
|
|
go build -o bin\rap-node-agent.exe .\cmd\rap-node-agent
|
|
go build -buildvcs=false -o bin\rap-host-agent.exe .\cmd\rap-host-agent
|
|
go build -o bin\mesh-live-smoke.exe .\cmd\mesh-live-smoke
|
|
```
|
|
|
|
## Docker Host Agent Bootstrap
|
|
|
|
`rap-host-agent` is the first host-level installer/updater boundary for Docker
|
|
placement. It does not join the mesh itself. It applies the cluster's install
|
|
intent locally by running the `rap-node-agent` container with a persistent host
|
|
state directory. On Linux it also installs a systemd `update-loop` service by
|
|
default, so nodes continue to update from Control Plane policy without operator
|
|
commands on each host.
|
|
|
|
Preferred fabric-native install:
|
|
|
|
```bash
|
|
rap-host-agent install \
|
|
--bootstrap-bundle ./docker-node-1.bootstrap.json
|
|
```
|
|
|
|
Offline/import bootstrap is also supported:
|
|
|
|
```bash
|
|
rap-host-agent install \
|
|
--bootstrap-bundle ./docker-node-1.bootstrap.json
|
|
```
|
|
|
|
The bootstrap bundle carries the signed install profile, pinned cluster
|
|
authority key, and QUIC fabric registry seeds. The host-agent applies Docker
|
|
image, container, state-dir, mesh listen, advertise, NAT/connectivity, and
|
|
region settings locally, then the node-agent enrolls through QUIC fabric.
|
|
|
|
Manual install is still supported:
|
|
|
|
```bash
|
|
rap-host-agent install \
|
|
--bootstrap-bundle ./docker-node-1.bootstrap.json
|
|
```
|
|
|
|
The command creates or replaces only the local Docker container. The running
|
|
node-agent submits the join request, waits for owner approval, stores its
|
|
identity in the mounted state directory, and then sends heartbeats. Re-running
|
|
with `--replace` updates the container while preserving node identity. Pass
|
|
`--auto-update-enabled=false` only for lab/debug installs where the local
|
|
systemd updater must not be registered.
|
|
|
|
Useful checks:
|
|
|
|
```bash
|
|
rap-host-agent status --container-name rap-node-agent-docker-node-1
|
|
docker logs -f rap-node-agent-docker-node-1
|
|
```
|
|
|
|
For a node that was installed before the updater existed, register only the
|
|
local updater service without recreating the node-agent container:
|
|
|
|
```bash
|
|
rap-host-agent install-updater \
|
|
--state-dir /var/lib/rap/nodes/docker-node-1 \
|
|
--container-name rap-node-agent-docker-node-1
|
|
```
|
|
|
|
## Docker Host Agent Updates
|
|
|
|
`rap-host-agent update` applies one Control Plane update plan for an already
|
|
enrolled Docker node. The host-agent fetches the plan, downloads the selected
|
|
Docker image tar, verifies size and sha256, loads the image, recreates the
|
|
node-agent container from the existing Docker runtime settings, checks that the
|
|
container is running, and reports update phases back to the Control Plane.
|
|
|
|
```bash
|
|
rap-host-agent update \
|
|
--cluster-id <cluster_id> \
|
|
--node-id <node_id> \
|
|
--container-name rap-node-agent-docker-node-1 \
|
|
--current-version 0.1.0-c17z26
|
|
```
|
|
|
|
`rap-host-agent update-loop` is the per-node executor and health boundary. It
|
|
does not need to poll for normal releases: the node-agent receives an
|
|
`rap.node_update_hint.v1` subscription hint from Control Plane or the assigned
|
|
update-cache service during heartbeat, writes `<state-dir>/update-trigger.json`,
|
|
and the host-agent wakes immediately. The interval is an emergency fallback for
|
|
missed hints, service migration, or a dead update-cache service; keep it long
|
|
in production. The loop keeps running after transient errors by default and
|
|
advances its in-process current version after a successful update so it does
|
|
not repeatedly apply the same plan. When started without `--node-id` it reads
|
|
`<state-dir>/identity.json` and waits until the approved node identity appears,
|
|
which lets the updater service start immediately during first install. It also
|
|
persists the last applied node-agent version in
|
|
`<state-dir>/host-update-state.json` so a service restart does not reapply an
|
|
already-installed release.
|
|
|
|
```bash
|
|
rap-host-agent update-loop \
|
|
--cluster-id <cluster_id> \
|
|
--node-id <node_id> \
|
|
--container-name rap-node-agent-docker-node-1 \
|
|
--current-version 0.1.0-c17z26 \
|
|
--interval-seconds 21600 \
|
|
--jitter 0.15
|
|
```
|
|
|
|
Update-cache nodes are ordinary cluster nodes with the `update-cache` role.
|
|
Control Plane assigns a healthy update-cache node in the heartbeat hint. If the
|
|
assigned service disappears, the next hint returns `control_plane_fallback` or a
|
|
new service assignment; the local updater stays subscribed and only uses the
|
|
long fallback timer as a last resort.
|
|
|
|
`rap-host-agent update-host-agent-loop` updates the host-agent binary itself.
|
|
Only one global systemd unit is installed per Docker host:
|
|
`rap-host-agent-self-updater.service`. It uses one approved local node identity
|
|
to ask Control Plane for product `rap-host-agent` with install type
|
|
`linux_binary`, verifies the downloaded binary size and sha256, atomically
|
|
replaces `/usr/local/bin/rap-host-agent`, and reports status. The already
|
|
running process continues until systemd restarts it, while new invocations use
|
|
the new binary.
|
|
|
|
```bash
|
|
rap-host-agent update-host-agent-loop \
|
|
--cluster-id <cluster_id> \
|
|
--state-dir /var/lib/rap/nodes/docker-node-1 \
|
|
--binary-path /usr/local/bin/rap-host-agent
|
|
```
|
|
|
|
## Windows Host Agent Bootstrap And Updates
|
|
|
|
Windows uses the same bootstrap bundle model, but the local placement is a
|
|
Scheduled Task instead of Docker. In `--startup-mode auto` the installer first
|
|
tries an elevated `ONSTART` task running as `SYSTEM`; without admin rights it
|
|
falls back to a per-user `ONLOGON` task. The `ONSTART` mode starts after reboot
|
|
without an interactive user session. The `ONLOGON` fallback can only start after
|
|
that Windows user signs in.
|
|
|
|
```cmd
|
|
%TEMP%\rap-host-agent.exe install-windows --bootstrap-bundle "C:\bootstrap\office-win-1.bootstrap.json" --startup-mode "auto"
|
|
```
|
|
|
|
Offline/import bootstrap is also supported:
|
|
|
|
```cmd
|
|
%TEMP%\rap-host-agent.exe install-windows --bootstrap-bundle "C:\bootstrap\office-win-1.bootstrap.json" --startup-mode "auto"
|
|
```
|
|
|
|
`install-windows` installs two tasks:
|
|
|
|
- `RAP Node Agent <node>` runs `rap-node-agent.exe`.
|
|
- `RAP Host Agent Updater <node>` runs `rap-host-agent update-loop` for product
|
|
`rap-node-agent`, install type `windows_service`, and replaces the local
|
|
`rap-node-agent.exe` from signed release artifacts.
|
|
|
|
During first bootstrap the updater can read `<state-dir>\identity.json` and
|
|
will wait until the join request is approved. For an already-enrolled Windows
|
|
node, prefer passing `--node-id` explicitly. That makes the updater wrapper
|
|
independent from the local identity file location and is required for repair of
|
|
older Windows installs where the node is already heartbeat-healthy but the
|
|
host-agent updater has no usable identity file.
|
|
|
|
The repair path also reuses the local signed bootstrap/runtime state; it does
|
|
not require any backend URL.
|
|
|
|
The admin UI node details page generates a downloadable
|
|
`rap-repair-updater-<node>.cmd` for this repair path. It performs these steps:
|
|
|
|
- prints `schtasks /Query` diagnostics for the node-agent and updater tasks;
|
|
- prints the local `rap-*.exe*` files;
|
|
- downloads the current `rap-host-agent.exe`;
|
|
- reinstalls the Windows updater wrapper with `--node-id`;
|
|
- runs a foreground one-shot `update-loop --max-runs 1`;
|
|
- applies `rap-host-agent.exe.next` if the running host-agent could not replace
|
|
itself;
|
|
- restarts `RAP Host Agent Updater <node>`;
|
|
- prints post-repair diagnostics.
|
|
|
|
Expected successful updater reports in the admin panel:
|
|
|
|
```text
|
|
rap-node-agent <target> -> <target> plan/noop
|
|
rap-host-agent <target> -> <target> plan/noop
|
|
```
|
|
|
|
If the latest host-agent report is `apply/staged`, the new host-agent binary
|
|
was downloaded as `rap-host-agent.exe.next` but the running process still held
|
|
the old executable. End and run the updater task once, or rerun the generated
|
|
repair command:
|
|
|
|
```cmd
|
|
schtasks /End /TN "RAP Host Agent Updater office-win-1"
|
|
schtasks /Run /TN "RAP Host Agent Updater office-win-1"
|
|
```
|
|
|
|
### Windows Reboot / Autostart Verification
|
|
|
|
After installation or repair, verify the service survives a reboot:
|
|
|
|
1. Reboot the Windows host, or at minimum restart both scheduled tasks.
|
|
2. Confirm the tasks exist:
|
|
|
|
```cmd
|
|
schtasks /Query /TN "RAP Node Agent office-win-1" /V /FO LIST
|
|
schtasks /Query /TN "RAP Host Agent Updater office-win-1" /V /FO LIST
|
|
```
|
|
|
|
3. Confirm the admin panel shows:
|
|
|
|
```text
|
|
heartbeat: fresh
|
|
rap-node-agent: plan/noop
|
|
rap-host-agent: plan/noop
|
|
node version_state: current
|
|
```
|
|
|
|
Without admin rights, `install-windows --startup-mode auto` may fall back to
|
|
`user-task`. That node can still heartbeat and update after the user logs in,
|
|
but it will not start before logon after a reboot. Use an elevated shell for
|
|
production Windows nodes that must recover unattended.
|
|
|
|
Control Plane release artifacts for Windows must use:
|
|
|
|
- `product=rap-node-agent`
|
|
- `os=windows`
|
|
- `arch=amd64`
|
|
- `install_type=windows_service`
|
|
- `kind=binary`
|
|
|
|
## First Enrollment
|
|
|
|
Create a join token from the platform control plane, then run:
|
|
|
|
Use a signed bootstrap bundle plus QUIC fabric registry seeds. The node
|
|
enrolls only through QUIC fabric inside the farm.
|
|
|
|
The agent submits a pending join request and exits. It does not self-activate.
|
|
A platform admin must approve the join request.
|
|
|
|
## Enrollment Approval
|
|
|
|
When the agent enrolls, it stores the returned `pending_join_request_id` and
|
|
polls the Control Plane bootstrap endpoint until the platform owner approves
|
|
the request or the enrollment timeout expires. After approval, the agent
|
|
verifies the signed bootstrap contract and writes the approved `node_id`,
|
|
`cluster_id`, `identity_status=active`, `cluster_authority_public_key`, and
|
|
`cluster_authority_fingerprint` into `identity.json`.
|
|
|
|
Future C3 hardening can add signed node certificates and automatic secure
|
|
certificate material exchange.
|
|
|
|
Then run the agent again:
|
|
|
|
```powershell
|
|
.\bin\rap-node-agent.exe `
|
|
-state-dir C:\ProgramData\RapNodeAgent
|
|
```
|
|
|
|
It sends periodic heartbeats through the signed `control-api` service over QUIC
|
|
fabric:
|
|
|
|
```text
|
|
fabric control path /clusters/{clusterID}/nodes/{nodeID}/heartbeats
|
|
```
|
|
|
|
## Environment Variables
|
|
|
|
- `RAP_CLUSTER_ID`
|
|
- `RAP_CLUSTER_AUTHORITY_PUBLIC_KEY`
|
|
- `RAP_CLUSTER_AUTHORITY_FINGERPRINT`
|
|
- `RAP_JOIN_TOKEN`
|
|
- `RAP_NODE_NAME`
|
|
- `RAP_NODE_STATE_DIR`
|
|
- `RAP_WORKLOAD_SUPERVISION_ENABLED`
|
|
- `RAP_HEARTBEAT_INTERVAL_SECONDS`
|
|
- `RAP_ENROLLMENT_POLL_INTERVAL_SECONDS`
|
|
- `RAP_ENROLLMENT_POLL_TIMEOUT_SECONDS`
|
|
- `RAP_FABRIC_RUNTIME_ENABLED`
|
|
- `RAP_FABRIC_LISTEN_ADDR`
|
|
- `RAP_MESH_ADVERTISE_ENDPOINT`
|
|
- `RAP_MESH_ADVERTISE_ENDPOINTS_JSON`
|
|
- `RAP_MESH_ADVERTISE_TRANSPORT`
|
|
- `RAP_MESH_CONNECTIVITY_MODE`
|
|
- `RAP_MESH_NAT_TYPE`
|
|
- `RAP_MESH_REGION`
|
|
- `RAP_MESH_SYNTHETIC_CONFIG`
|
|
- `RAP_MESH_PEER_ENDPOINTS_JSON`
|
|
- `RAP_MESH_SYNTHETIC_ROUTES_JSON`
|
|
- `RAP_MESH_PRODUCTION_FORWARDING_ENABLED`
|
|
- `RAP_MESH_PRODUCTION_OBSERVATION_SINK_CAPACITY`
|
|
|
|
`RAP_FABRIC_RUNTIME_ENABLED` defaults to `false`. It gates only the
|
|
C17A/C17B/C17C/C17D/C17E synthetic probe, route-health, relay scheduling,
|
|
bounded `synthetic.echo` test-service runtime, and live synthetic QUIC endpoint.
|
|
It must not be used for RDP, VPN, file, video, or other production service
|
|
traffic.
|
|
|
|
`RAP_WORKLOAD_SUPERVISION_ENABLED` defaults to `false`. When enabled, the agent
|
|
polls node-scoped desired workloads and reports status. The current bounded
|
|
runtime reports built-in `core-mesh` and `fabric-listener` services as running
|
|
when enabled, supports the native built-in `synthetic.echo` test workload, and
|
|
keeps unsupported production workloads such as RDP workers degraded until their
|
|
supervisors are implemented.
|
|
|
|
For Remote Workspace/RDP integration work, the native `rdp-worker` desired
|
|
workload supports only an explicit `adapter_contract_probe` mode. That mode
|
|
reports the remote-workspace adapter channel contract and requires Fabric
|
|
Service Channel as the future data plane; it does not start FreeRDP, create a
|
|
remote session, or carry production RDP payloads.
|
|
|
|
`RAP_FABRIC_LISTEN_ADDR` names the historical synthetic listener address, but the
|
|
current runtime is QUIC-fabric-only and does not start an HTTP listener.
|
|
`RAP_MESH_SYNTHETIC_CONFIG` points to
|
|
a scoped synthetic mesh config snapshot and is preferred over debug JSON.
|
|
`RAP_MESH_PEER_ENDPOINTS_JSON` is a JSON object mapping peer node IDs to
|
|
endpoint URLs. `RAP_MESH_SYNTHETIC_ROUTES_JSON` is a JSON array of synthetic
|
|
route objects. If no local scoped config file is set, the agent asks the
|
|
Control Plane for:
|
|
|
|
```text
|
|
/clusters/{clusterID}/nodes/{nodeID}/mesh/synthetic-config
|
|
```
|
|
|
|
The JSON variables are debug fallback only.
|
|
|
|
Control Plane synthetic config with `authority_required=true` must include a
|
|
signed `authority_payload` / `authority_signature` envelope and a
|
|
`cluster_authority` descriptor. The agent verifies the signature, validates the
|
|
config hash, and rejects mismatched pinned authority values when
|
|
`RAP_CLUSTER_AUTHORITY_PUBLIC_KEY`, `RAP_CLUSTER_AUTHORITY_FINGERPRINT`, or the
|
|
same fields in `identity.json` are set.
|
|
|
|
`RAP_MESH_PRODUCTION_FORWARDING_ENABLED` defaults to `false`. It is a future
|
|
production-forwarding gate only. Turning it on does not enable production mesh
|
|
payload forwarding; the runtime still refuses service traffic after validating
|
|
the route-bound production envelope contract, until a later approved
|
|
production mesh stage implements route-bound, policy-bound forwarding.
|
|
|
|
The production envelope contract requires route, hop, TTL, expiry, payload
|
|
length, and SHA-256 payload hash fields. C17J accepts only the
|
|
`fabric_control` channel class and `fabric.control` message type for
|
|
validation. RDP, VPN, render, file, video, and service workload channels are
|
|
rejected.
|
|
|
|
C17K adds a local metadata-only observation hook after successful production
|
|
envelope validation. Observations include route/message/hop/channel metadata and
|
|
payload length/hash, not the payload body. Observation failure fails closed, and
|
|
the endpoint still does not forward payloads.
|
|
|
|
C17L adds a bounded in-memory observation sink for accepted metadata-only
|
|
observations. The sink drops the oldest observation when full and still stores
|
|
no payload bodies.
|
|
|
|
`RAP_MESH_PRODUCTION_OBSERVATION_SINK_CAPACITY` defaults to `0`. When set above
|
|
zero, C17M wires the bounded metadata-only sink into the node-agent mesh server.
|
|
This remains local-only, exposes no read API, stores no payload bodies, and
|
|
does not enable production forwarding. C17R rejects values above `10000`.
|
|
|
|
C17N adds local sink metrics: configured capacity, current depth, accepted
|
|
total, and dropped-oldest total. Metrics do not expose observation records,
|
|
route IDs, message IDs, hashes, payload metadata, or payload bodies.
|
|
|
|
C17O logs those aggregate metrics locally from the node-agent loop when the
|
|
sink is explicitly enabled. This does not add a read API or Control Plane
|
|
reporting.
|
|
|
|
C17P logs aggregate sink metrics only when they change, so steady heartbeat
|
|
loops do not repeat identical local metrics lines.
|
|
|
|
C17Q logs `production_forwarding_gate_enabled` separately from
|
|
`production_forwarding_runtime_enabled`. The runtime field remains `false`;
|
|
turning on the gate still does not enable production forwarding.
|
|
|
|
C17S makes production envelope observation panic-safe. Observer errors and
|
|
observer panics both fail closed as observation failure; forwarding remains
|
|
unavailable.
|
|
|
|
C17T limits validated production `fabric.control` envelope payloads to 4096
|
|
bytes. Oversized envelopes are rejected before observation.
|
|
|
|
C17U rejects production `fabric.control` envelopes whose `created_at` is more
|
|
than one minute in the future.
|
|
|
|
C17V adds scoped peer endpoint candidates to synthetic mesh config. Candidate
|
|
entries describe possible per-node endpoints with transport, address,
|
|
reachability, NAT type, connectivity mode, priority, policy tags, verification
|
|
time, and metadata. They are model/config hints only; no production route
|
|
scoring, NAT traversal, shortcut routing, or forwarding runtime is implemented.
|
|
|
|
C17W adds deterministic local scoring for scoped endpoint candidates. Scoring
|
|
uses transport, reachability, connectivity mode, NAT type, priority, preferred
|
|
region, policy tags, channel class, and verification age. It returns ranked
|
|
candidates and reason labels only; it does not select production routes, open
|
|
connections, perform NAT traversal, or forward payloads.
|
|
|
|
C17X extends candidate scoring with optional local health observations keyed by
|
|
`endpoint_id`. Observations can contribute latency, success/failure history,
|
|
recent failure reason, reliability score, and freshness/staleness signals.
|
|
The score remains advisory only and is not wired into production forwarding.
|
|
|
|
C17Z adds the first narrow production forwarding runtime. When
|
|
`RAP_MESH_PRODUCTION_FORWARDING_ENABLED=true`, the QUIC production-forward
|
|
handler can deliver route-bound `fabric.control` envelopes at the local
|
|
destination or forward them to a direct next hop from explicit peer endpoint
|
|
config. Service channels, RDP/VPN/file/video payloads, arbitrary relay
|
|
forwarding, and multi-hop production route execution remain unavailable.
|
|
|
|
C17Z1 adds route-path-bound multi-hop forwarding for production
|
|
`fabric.control` only. Envelopes may carry `route_path` and
|
|
`visited_node_ids`; each relay validates its path position, forwards only to
|
|
the next route-path node, updates TTL/hop/visited metadata, and rejects loops.
|
|
Service payloads remain unavailable.
|
|
|
|
C17Z2 emits local `mesh_production_forward_event` logs for production
|
|
`fabric.control` forwarding outcomes: accepted, forwarded, delivered, and
|
|
rejected. Logs include route/message/hop/channel/status/reason/TTL/hop count/
|
|
route path length/visited count/payload length metadata only. Payload bodies
|
|
are not logged, no observation read API is added, and service payloads remain
|
|
unavailable.
|
|
|
|
C17Z3 binds production `fabric.control` forwarding to loaded scoped or
|
|
Control Plane route config when routes are available locally. Configured
|
|
envelopes must match `route_id`, cluster, source, destination, route path,
|
|
next hop, allowed channel, expiry, max TTL, and max hop count before
|
|
forwarding. If no route config is present, existing C17Z1 behavior is
|
|
preserved. Service payloads remain unavailable.
|
|
|
|
C17Z4 adds scoped peer directory and recovery seed config. `peer_directory`
|
|
describes only peers needed by the node-scoped mesh config. `recovery_seeds`
|
|
is an explicit, bounded bootstrap/recovery list and is not a full cluster node
|
|
list. The node-agent parses and validates these fields, but does not yet
|
|
implement a persistent connection manager, NAT traversal, or
|
|
relay/rendezvous runtime.
|
|
|
|
C17Z5 turns scoped peer directory and recovery seed config into node-local
|
|
runtime `PeerCache` state. The cache builds a bounded warm peer set from
|
|
route-adjacent peers, recovery seeds, peer endpoints, and endpoint candidates.
|
|
When synthetic mesh testing is enabled, the node-agent probes warm peers with
|
|
QUIC fabric live probes and reports metadata-only mesh-link observations. This is not
|
|
a persistent connection manager and does not forward service payloads.
|
|
|
|
C17Z6 adds advertised mesh endpoint reporting. When
|
|
`RAP_MESH_ADVERTISE_ENDPOINT` is set, node-agent includes a
|
|
`mesh_endpoint_report` in heartbeat metadata with transport, connectivity mode,
|
|
NAT hint, region, observed time, and endpoint candidate metadata. Control Plane
|
|
can project the latest reported endpoint into node-scoped synthetic mesh config
|
|
for route-path peers. This does not perform automatic public IP discovery,
|
|
STUN/TURN/ICE NAT classification, or service payload forwarding.
|
|
|
|
C17Z7 adds `RAP_MESH_ADVERTISE_ENDPOINTS_JSON` for multiple advertised
|
|
endpoints per node. Candidates can describe public, private, corporate/LAN,
|
|
outbound, or relay-style addresses. Endpoint scoring rewards `private-lan`,
|
|
`corp-lan`, and `same-site` policy tags, and peer cache can use the best
|
|
candidate address for warm-peer health probes. This supports corporate-network
|
|
cluster segments without enabling service payload forwarding.
|
|
|
|
C17Z8 adds a node-local peer connection state machine on top of warm-peer
|
|
health probes. Warm peers move through `disconnected`, `connecting`, `ready`,
|
|
`degraded`, and `backoff`; repeated probe failures enter bounded backoff, and
|
|
successful probes recover to `ready`. Mesh-link observations include
|
|
metadata-only connection state. This is not a persistent socket/session manager
|
|
and does not forward service payloads.
|
|
|
|
C17Z9 adds a node-local peer recovery planner. The node targets a bounded
|
|
stable ready-peer set, defaulting to three connectable peers when available,
|
|
instead of probing every known cluster node. When ready peers fall below target,
|
|
the planner selects bounded recovery probes from warm peers, recovery seeds,
|
|
and other connectable scoped peers, skipping active backoff entries. Heartbeats
|
|
include metadata-only `mesh_peer_recovery_report` state. This is not persistent
|
|
connection transport, NAT traversal, relay/rendezvous runtime, or service
|
|
payload forwarding.
|
|
|
|
C17Z10 adds a node-local peer connection intent planner over the C17Z9 recovery
|
|
plan. It classifies bounded peer work as `maintain`, `probe`, or `recover`,
|
|
and classifies transport readiness as `direct`, `private_lan`,
|
|
`corporate_lan`, `outbound_only`, or `relay_required`. Heartbeats include
|
|
metadata-only `mesh_peer_connection_intent_report` counts. This is not
|
|
persistent connection transport, STUN/TURN/ICE, NAT traversal, relay runtime,
|
|
or service payload forwarding.
|
|
|
|
C17Z11 adds the first real node-local peer connection manager for mesh
|
|
control-plane health. It uses a reusable QUIC fabric transport to probe
|
|
direct/private/corporate peer endpoints selected by C17Z10 intents, updates
|
|
the shared peer connection tracker, and records `waiting_rendezvous` for
|
|
outbound-only or relay-required peers. Heartbeats include metadata-only
|
|
`mesh_peer_connection_manager_report` state. This is not STUN/TURN/ICE,
|
|
relay/rendezvous runtime, route lease generation, VPN runtime, or service
|
|
payload forwarding.
|
|
|
|
C17Z12 adds a node-scoped rendezvous/relay control-plane lease contract for
|
|
peers that would otherwise remain `waiting_rendezvous`. The agent consumes
|
|
`rendezvous_leases`, resolves matching intents into `relay_quic`, probes the
|
|
relay node over QUIC fabric live probe, and records `relay_ready` for the peer control
|
|
path. This remains control-plane health only and does not enable RDP/VPN/file/
|
|
video/service payload forwarding, arbitrary relay packet forwarding,
|
|
STUN/TURN/ICE, or host networking changes.
|
|
|
|
C17Z13 adds heartbeat telemetry for rendezvous lease admission and renewal
|
|
posture. The agent emits `mesh_rendezvous_lease_report` with local role,
|
|
relay/peer admission counts, TTL, renewal-after time, renewal-needed status,
|
|
`relay_ready`, and explicit no-payload boundary flags. This remains
|
|
metadata-only control-plane telemetry and does not enable service payload
|
|
forwarding.
|
|
|
|
C17Z14 adds a control-plane refresh contract for rendezvous leases. When a
|
|
lease is renewal-needed, expired, invalid, or tied to a stale relay state, the
|
|
agent reloads node-scoped synthetic config from Control Plane, updates the
|
|
running peer cache/route/lease state, and reports refresh counters plus stale
|
|
relay withdrawal/reselection fields. This remains control-plane health only
|
|
and does not enable service payload forwarding.
|
|
|
|
C17Z15 adds the node side of backend relay replacement policy. The agent
|
|
advertises the relay replacement contract capability and emits
|
|
`c17z15.mesh_rendezvous_lease_report.v1`; stale relay state is matched to the
|
|
exact rendezvous lease/relay when that metadata is present, so an alternate
|
|
replacement lease for the same peer is not treated as stale by association.
|
|
This remains control-plane health only and does not enable service payload
|
|
forwarding.
|
|
|
|
C17Z16 adds route/path decision reporting. The agent consumes
|
|
`route_path_decisions` from Control Plane synthetic config, keeps the latest
|
|
control-plane generation in local state, and emits
|
|
`c17z18.mesh_route_path_decision_report.v1` with effective hops, previous/next
|
|
hop, selected replacement relay, generation, and no-payload boundary flags.
|
|
This remains metadata-only route planning and does not enable service payload
|
|
forwarding.
|
|
|
|
C17Z17 adds node-side route generation tracking for Control Plane
|
|
`route_path_decisions`. The agent emits
|
|
`c17z18.mesh_route_generation_report.v1` with active, applied, unchanged, and
|
|
withdrawn decision counts, total counters, generation change state, active
|
|
decision details, and withdrawn decision details. When the first observed
|
|
config already contains a stale relay replacement, the tracker emits a
|
|
`withdrawn_by_replacement` record for the old relay path. This remains
|
|
metadata-only route planning and does not enable service payload forwarding.
|
|
|
|
C17Z18 applies Control Plane `route_path_decisions` to synthetic route-health
|
|
route config only. The agent keeps base routes separate from route-health
|
|
routes, periodically refreshes scoped config, emits
|
|
`c17z18.mesh_route_health_config_report.v1`, and reports route-health
|
|
observations with expected/observed hops and drift status. This probes
|
|
replacement relay effective paths for control-plane health only and does not
|
|
enable service payload forwarding.
|
|
|
|
C17Z21 defines the portable inbound listener contract for Docker, Linux
|
|
service, Windows service, and future OS-specific node packages. The node-agent
|
|
does not stop when the mesh listen port cannot be bound. It keeps the outbound
|
|
Control Plane session alive and emits `c17z21.fabric_listener_report.v1` in
|
|
heartbeat metadata with configured address, effective address, listen mode,
|
|
listener status, inbound reachability, one-way connectivity, failure reason,
|
|
and port-conflict diagnostics.
|
|
|
|
`RAP_FABRIC_LISTEN_PORT_MODE` controls behavior:
|
|
|
|
- `manual`: bind exactly `RAP_FABRIC_LISTEN_ADDR`; on conflict report
|
|
`listen_failed` and wait for an operator/config change.
|
|
- `auto`: try `RAP_FABRIC_LISTEN_ADDR`; on conflict scan
|
|
`RAP_FABRIC_LISTEN_AUTO_PORT_START..RAP_FABRIC_LISTEN_AUTO_PORT_END` and report
|
|
`auto_rebound` when a free port is selected.
|
|
- `disabled`: do not open an inbound listener; the node is expected to be
|
|
outbound-only, relay/rendezvous, or Control Plane only.
|
|
|
|
For `RAP_MESH_CONNECTIVITY_MODE=outbound_only`, inbound listener failure is not
|
|
treated as node death. The heartbeat remains `healthy` with
|
|
`mesh_one_way_connectivity=true` and listener diagnostics. For direct/private
|
|
LAN modes, a listener failure degrades the node so the admin panel can show
|
|
that the node is alive but cannot accept inbound mesh traffic. Service payload
|
|
forwarding is still not enabled by this contract.
|
|
|
|
C17Z22 separates outbound Control Plane presence from inbound mesh
|
|
reachability. When synthetic mesh testing is enabled, every heartbeat includes
|
|
`c17z22.mesh_outbound_session_report.v1` with node-to-control-plane direction,
|
|
keepalive transport, listener conflict state, rendezvous/relay counters, and a
|
|
`fabric_control_endpoint` plus a flag showing whether the current outbound session can be used as a reverse
|
|
control-channel contract. This is the portable basis for Docker, Linux service,
|
|
Windows service, and future packages where a node may be behind NAT or have no
|
|
stable inbound address. It is still control-plane telemetry only and does not
|
|
carry RDP/VPN/service payload traffic.
|
|
|
|
C17Z24 separates the listener bind address from advertised mesh endpoints. The
|
|
agent never advertises loopback addresses discovered from the local listener;
|
|
`127.0.0.1`/`::1` are test-only bind details, not cluster reachability data.
|
|
When the listener is active, the agent enumerates active non-loopback host
|
|
interfaces and reports usable endpoint candidates with interface metadata,
|
|
address family, reachability, NAT/connectivity hints, and priority. Container
|
|
bridge/veth interfaces and link-local addresses are filtered by default, while
|
|
physical and VPN-style interfaces are kept so different cluster segments can
|
|
choose the address that matches their network. Operator-provided
|
|
`RAP_MESH_ADVERTISE_ENDPOINT` or endpoint-candidate JSON remains authoritative
|
|
and is ranked ahead of auto-discovered addresses.
|
|
|
|
C17Z25 adds per-peer endpoint fallback probing to the control-plane mesh
|
|
manager. A node no longer treats the top-ranked endpoint candidate as the only
|
|
possible address for a peer. For each warm direct/private/corporate peer, the
|
|
manager probes the ranked candidate list until one QUIC fabric endpoint
|
|
responds or all direct candidates fail. Heartbeat metadata includes
|
|
`c17z25.mesh_peer_connection_manager_report.v1` with `probe_results`,
|
|
`selected_candidate_id`, `selected_endpoint`, and per-candidate success/failure
|
|
details. This is still control-plane health and address selection telemetry; it
|
|
does not forward RDP/VPN/service payloads.
|
|
|
|
Scoped synthetic config shape:
|
|
|
|
```json
|
|
{
|
|
"schema_version": "c17z18.synthetic.v1",
|
|
"cluster_id": "cluster-1",
|
|
"local_node_id": "node-a",
|
|
"config_version": "config-v1",
|
|
"peer_directory_version": "peers-v1",
|
|
"policy_version": "policy-v1",
|
|
"peer_endpoints": {
|
|
"node-b": "quic://127.0.0.1:19002"
|
|
},
|
|
"peer_endpoint_candidates": {
|
|
"node-b": [
|
|
{
|
|
"endpoint_id": "node-b-public",
|
|
"node_id": "node-b",
|
|
"transport": "direct_quic",
|
|
"address": "203.0.113.20:443",
|
|
"reachability": "public",
|
|
"nat_type": "restricted",
|
|
"connectivity_mode": "direct",
|
|
"priority": 10
|
|
}
|
|
]
|
|
},
|
|
"routes": [],
|
|
"route_path_decisions": {
|
|
"schema_version": "c17z18.route_path_decisions.v1",
|
|
"decisions": []
|
|
}
|
|
}
|
|
```
|
|
|
|
## C17E Live Synthetic Smoke
|
|
|
|
Run:
|
|
|
|
```powershell
|
|
cd agents\rap-node-agent
|
|
go run .\cmd\mesh-live-smoke
|
|
```
|
|
|
|
Expected:
|
|
|
|
- scoped synthetic config loads
|
|
- direct `node-a -> node-b` synthetic probe succeeds
|
|
- relay `node-a -> node-r -> node-b` synthetic probe succeeds
|
|
- bounded `synthetic.echo` test-service succeeds
|
|
- `production_forwarding=false`
|
|
|
|
## Safety Rules
|
|
|
|
- The agent never assigns roles to itself.
|
|
- The agent reports capabilities only.
|
|
- Platform policy assigns roles.
|
|
- No RDP/VPN/production service traffic is carried by the C17A-C17Z22 staged
|
|
mesh runtime.
|
|
- Production forwarding remains disabled by default and limited to
|
|
`fabric.control` when explicitly enabled.
|
|
- No privileged operations are performed by the current agent.
|
|
|