Record project continuation changes
This commit is contained in:
@@ -66,6 +66,11 @@ Implemented:
|
||||
- synthetic route-health route config refresh from Control Plane path
|
||||
decisions
|
||||
- route-health expected/observed effective path drift reporting
|
||||
- host-agent Docker update plan executor with artifact checksum/size
|
||||
verification, container replacement, health check, status reporting, and
|
||||
rollback attempt
|
||||
- host-agent update loop for service/timer placement
|
||||
- host-agent binary self-update loop for the updater service itself
|
||||
- maximum capacity guard for the local production observation sink
|
||||
- panic-safe fail-closed production envelope observation wrapper
|
||||
- explicit `4096` byte payload boundary for validated production
|
||||
@@ -98,7 +103,7 @@ Not implemented yet:
|
||||
- VPN runtime
|
||||
- production workload supervision
|
||||
- certificate issuance/rotation
|
||||
- updater runtime
|
||||
- in-agent native updater runtime
|
||||
- privileged host route/firewall control
|
||||
|
||||
## Build
|
||||
@@ -107,9 +112,237 @@ Not implemented yet:
|
||||
cd agents\rap-node-agent
|
||||
go test ./...
|
||||
go build -o bin\rap-node-agent.exe .\cmd\rap-node-agent
|
||||
go build -buildvcs=false -o bin\rap-host-agent.exe .\cmd\rap-host-agent
|
||||
go build -o bin\mesh-live-smoke.exe .\cmd\mesh-live-smoke
|
||||
```
|
||||
|
||||
## Docker Host Agent Bootstrap
|
||||
|
||||
`rap-host-agent` is the first host-level installer/updater boundary for Docker
|
||||
placement. It does not join the mesh itself. It applies the cluster's install
|
||||
intent locally by running the `rap-node-agent` container with a persistent host
|
||||
state directory. On Linux it also installs a systemd `update-loop` service by
|
||||
default, so nodes continue to update from Control Plane policy without operator
|
||||
commands on each host.
|
||||
|
||||
Preferred profile-based install:
|
||||
|
||||
```bash
|
||||
rap-host-agent install \
|
||||
--profile-url https://control.example.com/api/v1 \
|
||||
--cluster-id <cluster_id> \
|
||||
--install-token <one_time_install_token> \
|
||||
--node-name docker-node-1
|
||||
```
|
||||
|
||||
The host-agent exchanges the install token for a signed control-plane install
|
||||
profile, then applies Docker image, container, state-dir, mesh listen,
|
||||
advertise, NAT/connectivity, and region settings from that profile. The same
|
||||
token is then used by the node-agent for first enrollment, so the operator does
|
||||
not need to manually pass cluster/runtime flags.
|
||||
|
||||
Manual install is still supported:
|
||||
|
||||
```bash
|
||||
rap-host-agent install \
|
||||
--backend-url http://192.168.200.61:18080/api/v1 \
|
||||
--cluster-id <cluster_id> \
|
||||
--join-token <raw_join_token> \
|
||||
--node-name docker-node-1 \
|
||||
--image rap-node-agent:dev-enrollment-bootstrap-smoke \
|
||||
--container-name rap-node-agent-docker-node-1 \
|
||||
--state-dir /var/lib/rap/nodes/docker-node-1 \
|
||||
--network host \
|
||||
--replace
|
||||
```
|
||||
|
||||
The command creates or replaces only the local Docker container. The running
|
||||
node-agent submits the join request, waits for owner approval, stores its
|
||||
identity in the mounted state directory, and then sends heartbeats. Re-running
|
||||
with `--replace` updates the container while preserving node identity. Pass
|
||||
`--auto-update-enabled=false` only for lab/debug installs where the local
|
||||
systemd updater must not be registered.
|
||||
|
||||
Useful checks:
|
||||
|
||||
```bash
|
||||
rap-host-agent status --container-name rap-node-agent-docker-node-1
|
||||
docker logs -f rap-node-agent-docker-node-1
|
||||
```
|
||||
|
||||
For a node that was installed before the updater existed, register only the
|
||||
local updater service without recreating the node-agent container:
|
||||
|
||||
```bash
|
||||
rap-host-agent install-updater \
|
||||
--backend-url http://192.168.200.61:18080/api/v1 \
|
||||
--cluster-id <cluster_id> \
|
||||
--state-dir /var/lib/rap/nodes/docker-node-1 \
|
||||
--container-name rap-node-agent-docker-node-1
|
||||
```
|
||||
|
||||
## Docker Host Agent Updates
|
||||
|
||||
`rap-host-agent update` applies one Control Plane update plan for an already
|
||||
enrolled Docker node. The host-agent fetches the plan, downloads the selected
|
||||
Docker image tar, verifies size and sha256, loads the image, recreates the
|
||||
node-agent container from the existing Docker runtime settings, checks that the
|
||||
container is running, and reports update phases back to the Control Plane.
|
||||
|
||||
```bash
|
||||
rap-host-agent update \
|
||||
--backend-url http://192.168.200.61:18080/api/v1 \
|
||||
--cluster-id <cluster_id> \
|
||||
--node-id <node_id> \
|
||||
--container-name rap-node-agent-docker-node-1 \
|
||||
--current-version 0.1.0-c17z26
|
||||
```
|
||||
|
||||
`rap-host-agent update-loop` is the per-node executor and health boundary. It
|
||||
does not need to poll for normal releases: the node-agent receives an
|
||||
`rap.node_update_hint.v1` subscription hint from Control Plane or the assigned
|
||||
update-cache service during heartbeat, writes `<state-dir>/update-trigger.json`,
|
||||
and the host-agent wakes immediately. The interval is an emergency fallback for
|
||||
missed hints, service migration, or a dead update-cache service; keep it long
|
||||
in production. The loop keeps running after transient errors by default and
|
||||
advances its in-process current version after a successful update so it does
|
||||
not repeatedly apply the same plan. When started without `--node-id` it reads
|
||||
`<state-dir>/identity.json` and waits until the approved node identity appears,
|
||||
which lets the updater service start immediately during first install. It also
|
||||
persists the last applied node-agent version in
|
||||
`<state-dir>/host-update-state.json` so a service restart does not reapply an
|
||||
already-installed release.
|
||||
|
||||
```bash
|
||||
rap-host-agent update-loop \
|
||||
--backend-url http://192.168.200.61:18080/api/v1 \
|
||||
--cluster-id <cluster_id> \
|
||||
--node-id <node_id> \
|
||||
--container-name rap-node-agent-docker-node-1 \
|
||||
--current-version 0.1.0-c17z26 \
|
||||
--interval-seconds 21600 \
|
||||
--jitter 0.15
|
||||
```
|
||||
|
||||
Update-cache nodes are ordinary cluster nodes with the `update-cache` role.
|
||||
Control Plane assigns a healthy update-cache node in the heartbeat hint. If the
|
||||
assigned service disappears, the next hint returns `control_plane_fallback` or a
|
||||
new service assignment; the local updater stays subscribed and only uses the
|
||||
long fallback timer as a last resort.
|
||||
|
||||
`rap-host-agent update-host-agent-loop` updates the host-agent binary itself.
|
||||
Only one global systemd unit is installed per Docker host:
|
||||
`rap-host-agent-self-updater.service`. It uses one approved local node identity
|
||||
to ask Control Plane for product `rap-host-agent` with install type
|
||||
`linux_binary`, verifies the downloaded binary size and sha256, atomically
|
||||
replaces `/usr/local/bin/rap-host-agent`, and reports status. The already
|
||||
running process continues until systemd restarts it, while new invocations use
|
||||
the new binary.
|
||||
|
||||
```bash
|
||||
rap-host-agent update-host-agent-loop \
|
||||
--backend-url http://192.168.200.61:18080/api/v1 \
|
||||
--cluster-id <cluster_id> \
|
||||
--state-dir /var/lib/rap/nodes/docker-node-1 \
|
||||
--binary-path /usr/local/bin/rap-host-agent
|
||||
```
|
||||
|
||||
## Windows Host Agent Bootstrap And Updates
|
||||
|
||||
Windows uses the same Control Plane install profile, but the local placement is
|
||||
a Scheduled Task instead of Docker. In `--startup-mode auto` the installer first
|
||||
tries an elevated `ONSTART` task running as `SYSTEM`; without admin rights it
|
||||
falls back to a per-user `ONLOGON` task. The `ONSTART` mode starts after reboot
|
||||
without an interactive user session. The `ONLOGON` fallback can only start after
|
||||
that Windows user signs in.
|
||||
|
||||
```cmd
|
||||
powershell -NoProfile -ExecutionPolicy Bypass -Command "Invoke-WebRequest -UseBasicParsing 'http://control.example.com/downloads/rap-host-agent-windows-amd64.exe' -OutFile $env:TEMP\rap-host-agent.exe"
|
||||
%TEMP%\rap-host-agent.exe install-windows --profile-url "http://control.example.com/api/v1" --cluster-id "<cluster_id>" --install-token "<one_time_install_token>" --node-name "office-win-1" --startup-mode "auto"
|
||||
```
|
||||
|
||||
`install-windows` installs two tasks:
|
||||
|
||||
- `RAP Node Agent <node>` runs `rap-node-agent.exe`.
|
||||
- `RAP Host Agent Updater <node>` runs `rap-host-agent update-loop` for product
|
||||
`rap-node-agent`, install type `windows_service`, and replaces the local
|
||||
`rap-node-agent.exe` from signed release artifacts.
|
||||
|
||||
During first bootstrap the updater can read `<state-dir>\identity.json` and
|
||||
will wait until the join request is approved. For an already-enrolled Windows
|
||||
node, prefer passing `--node-id` explicitly. That makes the updater wrapper
|
||||
independent from the local identity file location and is required for repair of
|
||||
older Windows installs where the node is already heartbeat-healthy but the
|
||||
host-agent updater has no usable identity file.
|
||||
|
||||
```cmd
|
||||
%TEMP%\rap-host-agent.exe install-windows --backend-url "http://control.example.com/api/v1" --cluster-id "<cluster_id>" --node-id "<node_id>" --node-name "office-win-1" --replace --startup-mode "auto" --auto-update-current-version "<current_version>"
|
||||
```
|
||||
|
||||
The admin UI node details page generates a downloadable
|
||||
`rap-repair-updater-<node>.cmd` for this repair path. It performs these steps:
|
||||
|
||||
- prints `schtasks /Query` diagnostics for the node-agent and updater tasks;
|
||||
- prints the local `rap-*.exe*` files;
|
||||
- downloads the current `rap-host-agent.exe`;
|
||||
- reinstalls the Windows updater wrapper with `--node-id`;
|
||||
- runs a foreground one-shot `update-loop --max-runs 1`;
|
||||
- applies `rap-host-agent.exe.next` if the running host-agent could not replace
|
||||
itself;
|
||||
- restarts `RAP Host Agent Updater <node>`;
|
||||
- prints post-repair diagnostics.
|
||||
|
||||
Expected successful updater reports in the admin panel:
|
||||
|
||||
```text
|
||||
rap-node-agent <target> -> <target> plan/noop
|
||||
rap-host-agent <target> -> <target> plan/noop
|
||||
```
|
||||
|
||||
If the latest host-agent report is `apply/staged`, the new host-agent binary
|
||||
was downloaded as `rap-host-agent.exe.next` but the running process still held
|
||||
the old executable. End and run the updater task once, or rerun the generated
|
||||
repair command:
|
||||
|
||||
```cmd
|
||||
schtasks /End /TN "RAP Host Agent Updater office-win-1"
|
||||
schtasks /Run /TN "RAP Host Agent Updater office-win-1"
|
||||
```
|
||||
|
||||
### Windows Reboot / Autostart Verification
|
||||
|
||||
After installation or repair, verify the service survives a reboot:
|
||||
|
||||
1. Reboot the Windows host, or at minimum restart both scheduled tasks.
|
||||
2. Confirm the tasks exist:
|
||||
|
||||
```cmd
|
||||
schtasks /Query /TN "RAP Node Agent office-win-1" /V /FO LIST
|
||||
schtasks /Query /TN "RAP Host Agent Updater office-win-1" /V /FO LIST
|
||||
```
|
||||
|
||||
3. Confirm the admin panel shows:
|
||||
|
||||
```text
|
||||
heartbeat: fresh
|
||||
rap-node-agent: plan/noop
|
||||
rap-host-agent: plan/noop
|
||||
node version_state: current
|
||||
```
|
||||
|
||||
Without admin rights, `install-windows --startup-mode auto` may fall back to
|
||||
`user-task`. That node can still heartbeat and update after the user logs in,
|
||||
but it will not start before logon after a reboot. Use an elevated shell for
|
||||
production Windows nodes that must recover unattended.
|
||||
|
||||
Control Plane release artifacts for Windows must use:
|
||||
|
||||
- `product=rap-node-agent`
|
||||
- `os=windows`
|
||||
- `arch=amd64`
|
||||
- `install_type=windows_service`
|
||||
- `kind=binary`
|
||||
|
||||
## First Enrollment
|
||||
|
||||
Create a join token from the platform control plane, then run:
|
||||
@@ -185,9 +418,18 @@ bounded `synthetic.echo` test-service runtime, and live synthetic HTTP endpoint.
|
||||
It must not be used for RDP, VPN, file, video, or other production service
|
||||
traffic.
|
||||
|
||||
`RAP_WORKLOAD_SUPERVISION_ENABLED` defaults to `false`. While service runtime
|
||||
supervision is still a stub, the agent does not poll desired workloads or report
|
||||
workload status unless this flag is explicitly enabled.
|
||||
`RAP_WORKLOAD_SUPERVISION_ENABLED` defaults to `false`. When enabled, the agent
|
||||
polls node-scoped desired workloads and reports status. The current bounded
|
||||
runtime reports built-in `core-mesh` and `mesh-listener` services as running
|
||||
when enabled, supports the native built-in `synthetic.echo` test workload, and
|
||||
keeps unsupported production workloads such as RDP workers degraded until their
|
||||
supervisors are implemented.
|
||||
|
||||
For Remote Workspace/RDP integration work, the native `rdp-worker` desired
|
||||
workload supports only an explicit `adapter_contract_probe` mode. That mode
|
||||
reports the remote-workspace adapter channel contract and requires Fabric
|
||||
Service Channel as the future data plane; it does not start FreeRDP, create a
|
||||
remote session, or carry production RDP payloads.
|
||||
|
||||
`RAP_MESH_LISTEN_ADDR` starts the C17E/C17F/C17G synthetic HTTP endpoint only when
|
||||
`RAP_MESH_SYNTHETIC_RUNTIME_ENABLED=true`. `RAP_MESH_SYNTHETIC_CONFIG` points to
|
||||
@@ -423,6 +665,63 @@ observations with expected/observed hops and drift status. This probes
|
||||
replacement relay effective paths for control-plane health only and does not
|
||||
enable service payload forwarding.
|
||||
|
||||
C17Z21 defines the portable inbound listener contract for Docker, Linux
|
||||
service, Windows service, and future OS-specific node packages. The node-agent
|
||||
does not stop when the mesh listen port cannot be bound. It keeps the outbound
|
||||
Control Plane session alive and emits `c17z21.mesh_listener_report.v1` in
|
||||
heartbeat metadata with configured address, effective address, listen mode,
|
||||
listener status, inbound reachability, one-way connectivity, failure reason,
|
||||
and port-conflict diagnostics.
|
||||
|
||||
`RAP_MESH_LISTEN_PORT_MODE` controls behavior:
|
||||
|
||||
- `manual`: bind exactly `RAP_MESH_LISTEN_ADDR`; on conflict report
|
||||
`listen_failed` and wait for an operator/config change.
|
||||
- `auto`: try `RAP_MESH_LISTEN_ADDR`; on conflict scan
|
||||
`RAP_MESH_LISTEN_AUTO_PORT_START..RAP_MESH_LISTEN_AUTO_PORT_END` and report
|
||||
`auto_rebound` when a free port is selected.
|
||||
- `disabled`: do not open an inbound listener; the node is expected to be
|
||||
outbound-only, relay/rendezvous, or Control Plane only.
|
||||
|
||||
For `RAP_MESH_CONNECTIVITY_MODE=outbound_only`, inbound listener failure is not
|
||||
treated as node death. The heartbeat remains `healthy` with
|
||||
`mesh_one_way_connectivity=true` and listener diagnostics. For direct/private
|
||||
LAN modes, a listener failure degrades the node so the admin panel can show
|
||||
that the node is alive but cannot accept inbound mesh traffic. Service payload
|
||||
forwarding is still not enabled by this contract.
|
||||
|
||||
C17Z22 separates outbound Control Plane presence from inbound mesh
|
||||
reachability. When synthetic mesh testing is enabled, every heartbeat includes
|
||||
`c17z22.mesh_outbound_session_report.v1` with node-to-control-plane direction,
|
||||
keepalive transport, listener conflict state, rendezvous/relay counters, and a
|
||||
flag showing whether the current outbound session can be used as a reverse
|
||||
control-channel contract. This is the portable basis for Docker, Linux service,
|
||||
Windows service, and future packages where a node may be behind NAT or have no
|
||||
stable inbound address. It is still control-plane telemetry only and does not
|
||||
carry RDP/VPN/service payload traffic.
|
||||
|
||||
C17Z24 separates the listener bind address from advertised mesh endpoints. The
|
||||
agent never advertises loopback addresses discovered from the local listener;
|
||||
`127.0.0.1`/`::1` are test-only bind details, not cluster reachability data.
|
||||
When the listener is active, the agent enumerates active non-loopback host
|
||||
interfaces and reports usable endpoint candidates with interface metadata,
|
||||
address family, reachability, NAT/connectivity hints, and priority. Container
|
||||
bridge/veth interfaces and link-local addresses are filtered by default, while
|
||||
physical and VPN-style interfaces are kept so different cluster segments can
|
||||
choose the address that matches their network. Operator-provided
|
||||
`RAP_MESH_ADVERTISE_ENDPOINT` or endpoint-candidate JSON remains authoritative
|
||||
and is ranked ahead of auto-discovered addresses.
|
||||
|
||||
C17Z25 adds per-peer endpoint fallback probing to the control-plane mesh
|
||||
manager. A node no longer treats the top-ranked endpoint candidate as the only
|
||||
possible address for a peer. For each warm direct/private/corporate peer, the
|
||||
manager probes the ranked candidate list until one `/mesh/v1/health` endpoint
|
||||
responds or all direct candidates fail. Heartbeat metadata includes
|
||||
`c17z25.mesh_peer_connection_manager_report.v1` with `probe_results`,
|
||||
`selected_candidate_id`, `selected_endpoint`, and per-candidate success/failure
|
||||
details. This is still control-plane health and address selection telemetry; it
|
||||
does not forward RDP/VPN/service payloads.
|
||||
|
||||
Scoped synthetic config shape:
|
||||
|
||||
```json
|
||||
@@ -480,7 +779,7 @@ Expected:
|
||||
- The agent never assigns roles to itself.
|
||||
- The agent reports capabilities only.
|
||||
- Platform policy assigns roles.
|
||||
- No RDP/VPN/production service traffic is carried by the C17A-C17Z18 staged
|
||||
- No RDP/VPN/production service traffic is carried by the C17A-C17Z22 staged
|
||||
mesh runtime.
|
||||
- Production forwarding remains disabled by default and limited to
|
||||
`fabric.control` when explicitly enabled.
|
||||
|
||||
Reference in New Issue
Block a user