Record project continuation changes

2026-05-12 21:02:29 +03:00
parent 3059d1d7a3
commit 8f69d53193
339 changed files with 101111 additions and 1769 deletions
@@ -66,6 +66,11 @@ Implemented:
 - synthetic route-health route config refresh from Control Plane path
  decisions
 - route-health expected/observed effective path drift reporting
+- host-agent Docker update plan executor with artifact checksum/size
+  verification, container replacement, health check, status reporting, and
+  rollback attempt
+- host-agent update loop for service/timer placement
+- host-agent binary self-update loop for the updater service itself
 - maximum capacity guard for the local production observation sink
 - panic-safe fail-closed production envelope observation wrapper
 - explicit `4096` byte payload boundary for validated production
@@ -98,7 +103,7 @@ Not implemented yet:
 - VPN runtime
 - production workload supervision
 - certificate issuance/rotation
- updater runtime
+- in-agent native updater runtime
 - privileged host route/firewall control

 ## Build
@@ -107,9 +112,237 @@ Not implemented yet:
 cd agents\rap-node-agent
 go test ./...
 go build -o bin\rap-node-agent.exe .\cmd\rap-node-agent
+go build -buildvcs=false -o bin\rap-host-agent.exe .\cmd\rap-host-agent
 go build -o bin\mesh-live-smoke.exe .\cmd\mesh-live-smoke
 ```

+## Docker Host Agent Bootstrap
+
+`rap-host-agent` is the first host-level installer/updater boundary for Docker
+placement. It does not join the mesh itself. It applies the cluster's install
+intent locally by running the `rap-node-agent` container with a persistent host
+state directory. On Linux it also installs a systemd `update-loop` service by
+default, so nodes continue to update from Control Plane policy without operator
+commands on each host.
+
+Preferred profile-based install:
+
+```bash
+rap-host-agent install \
+  --profile-url https://control.example.com/api/v1 \
+  --cluster-id <cluster_id> \
+  --install-token <one_time_install_token> \
+  --node-name docker-node-1
+```
+
+The host-agent exchanges the install token for a signed control-plane install
+profile, then applies Docker image, container, state-dir, mesh listen,
+advertise, NAT/connectivity, and region settings from that profile. The same
+token is then used by the node-agent for first enrollment, so the operator does
+not need to manually pass cluster/runtime flags.
+
+Manual install is still supported:
+
+```bash
+rap-host-agent install \
+  --backend-url http://192.168.200.61:18080/api/v1 \
+  --cluster-id <cluster_id> \
+  --join-token <raw_join_token> \
+  --node-name docker-node-1 \
+  --image rap-node-agent:dev-enrollment-bootstrap-smoke \
+  --container-name rap-node-agent-docker-node-1 \
+  --state-dir /var/lib/rap/nodes/docker-node-1 \
+  --network host \
+  --replace
+```
+
+The command creates or replaces only the local Docker container. The running
+node-agent submits the join request, waits for owner approval, stores its
+identity in the mounted state directory, and then sends heartbeats. Re-running
+with `--replace` updates the container while preserving node identity. Pass
+`--auto-update-enabled=false` only for lab/debug installs where the local
+systemd updater must not be registered.
+
+Useful checks:
+
+```bash
+rap-host-agent status --container-name rap-node-agent-docker-node-1
+docker logs -f rap-node-agent-docker-node-1
+```
+
+For a node that was installed before the updater existed, register only the
+local updater service without recreating the node-agent container:
+
+```bash
+rap-host-agent install-updater \
+  --backend-url http://192.168.200.61:18080/api/v1 \
+  --cluster-id <cluster_id> \
+  --state-dir /var/lib/rap/nodes/docker-node-1 \
+  --container-name rap-node-agent-docker-node-1
+```
+
+## Docker Host Agent Updates
+
+`rap-host-agent update` applies one Control Plane update plan for an already
+enrolled Docker node. The host-agent fetches the plan, downloads the selected
+Docker image tar, verifies size and sha256, loads the image, recreates the
+node-agent container from the existing Docker runtime settings, checks that the
+container is running, and reports update phases back to the Control Plane.
+
+```bash
+rap-host-agent update \
+  --backend-url http://192.168.200.61:18080/api/v1 \
+  --cluster-id <cluster_id> \
+  --node-id <node_id> \
+  --container-name rap-node-agent-docker-node-1 \
+  --current-version 0.1.0-c17z26
+```
+
+`rap-host-agent update-loop` is the per-node executor and health boundary. It
+does not need to poll for normal releases: the node-agent receives an
+`rap.node_update_hint.v1` subscription hint from Control Plane or the assigned
+update-cache service during heartbeat, writes `<state-dir>/update-trigger.json`,
+and the host-agent wakes immediately. The interval is an emergency fallback for
+missed hints, service migration, or a dead update-cache service; keep it long
+in production. The loop keeps running after transient errors by default and
+advances its in-process current version after a successful update so it does
+not repeatedly apply the same plan. When started without `--node-id` it reads
+`<state-dir>/identity.json` and waits until the approved node identity appears,
+which lets the updater service start immediately during first install. It also
+persists the last applied node-agent version in
+`<state-dir>/host-update-state.json` so a service restart does not reapply an
+already-installed release.
+
+```bash
+rap-host-agent update-loop \
+  --backend-url http://192.168.200.61:18080/api/v1 \
+  --cluster-id <cluster_id> \
+  --node-id <node_id> \
+  --container-name rap-node-agent-docker-node-1 \
+  --current-version 0.1.0-c17z26 \
+  --interval-seconds 21600 \
+  --jitter 0.15
+```
+
+Update-cache nodes are ordinary cluster nodes with the `update-cache` role.
+Control Plane assigns a healthy update-cache node in the heartbeat hint. If the
+assigned service disappears, the next hint returns `control_plane_fallback` or a
+new service assignment; the local updater stays subscribed and only uses the
+long fallback timer as a last resort.
+
+`rap-host-agent update-host-agent-loop` updates the host-agent binary itself.
+Only one global systemd unit is installed per Docker host:
+`rap-host-agent-self-updater.service`. It uses one approved local node identity
+to ask Control Plane for product `rap-host-agent` with install type
+`linux_binary`, verifies the downloaded binary size and sha256, atomically
+replaces `/usr/local/bin/rap-host-agent`, and reports status. The already
+running process continues until systemd restarts it, while new invocations use
+the new binary.
+
+```bash
+rap-host-agent update-host-agent-loop \
+  --backend-url http://192.168.200.61:18080/api/v1 \
+  --cluster-id <cluster_id> \
+  --state-dir /var/lib/rap/nodes/docker-node-1 \
+  --binary-path /usr/local/bin/rap-host-agent
+```
+
+## Windows Host Agent Bootstrap And Updates
+
+Windows uses the same Control Plane install profile, but the local placement is
+a Scheduled Task instead of Docker. In `--startup-mode auto` the installer first
+tries an elevated `ONSTART` task running as `SYSTEM`; without admin rights it
+falls back to a per-user `ONLOGON` task. The `ONSTART` mode starts after reboot
+without an interactive user session. The `ONLOGON` fallback can only start after
+that Windows user signs in.
+
+```cmd
+powershell -NoProfile -ExecutionPolicy Bypass -Command "Invoke-WebRequest -UseBasicParsing 'http://control.example.com/downloads/rap-host-agent-windows-amd64.exe' -OutFile $env:TEMP\rap-host-agent.exe"
+%TEMP%\rap-host-agent.exe install-windows --profile-url "http://control.example.com/api/v1" --cluster-id "<cluster_id>" --install-token "<one_time_install_token>" --node-name "office-win-1" --startup-mode "auto"
+```
+
+`install-windows` installs two tasks:
+
+- `RAP Node Agent <node>` runs `rap-node-agent.exe`.
+- `RAP Host Agent Updater <node>` runs `rap-host-agent update-loop` for product
+  `rap-node-agent`, install type `windows_service`, and replaces the local
+  `rap-node-agent.exe` from signed release artifacts.
+
+During first bootstrap the updater can read `<state-dir>\identity.json` and
+will wait until the join request is approved. For an already-enrolled Windows
+node, prefer passing `--node-id` explicitly. That makes the updater wrapper
+independent from the local identity file location and is required for repair of
+older Windows installs where the node is already heartbeat-healthy but the
+host-agent updater has no usable identity file.
+
+```cmd
+%TEMP%\rap-host-agent.exe install-windows --backend-url "http://control.example.com/api/v1" --cluster-id "<cluster_id>" --node-id "<node_id>" --node-name "office-win-1" --replace --startup-mode "auto" --auto-update-current-version "<current_version>"
+```
+
+The admin UI node details page generates a downloadable
+`rap-repair-updater-<node>.cmd` for this repair path. It performs these steps:
+
+- prints `schtasks /Query` diagnostics for the node-agent and updater tasks;
+- prints the local `rap-*.exe*` files;
+- downloads the current `rap-host-agent.exe`;
+- reinstalls the Windows updater wrapper with `--node-id`;
+- runs a foreground one-shot `update-loop --max-runs 1`;
+- applies `rap-host-agent.exe.next` if the running host-agent could not replace
+  itself;
+- restarts `RAP Host Agent Updater <node>`;
+- prints post-repair diagnostics.
+
+Expected successful updater reports in the admin panel:
+
+```text
+rap-node-agent  <target> -> <target>  plan/noop
+rap-host-agent  <target> -> <target>  plan/noop
+```
+
+If the latest host-agent report is `apply/staged`, the new host-agent binary
+was downloaded as `rap-host-agent.exe.next` but the running process still held
+the old executable. End and run the updater task once, or rerun the generated
+repair command:
+
+```cmd
+schtasks /End /TN "RAP Host Agent Updater office-win-1"
+schtasks /Run /TN "RAP Host Agent Updater office-win-1"
+```
+
+### Windows Reboot / Autostart Verification
+
+After installation or repair, verify the service survives a reboot:
+
+1. Reboot the Windows host, or at minimum restart both scheduled tasks.
+2. Confirm the tasks exist:
+
+```cmd
+schtasks /Query /TN "RAP Node Agent office-win-1" /V /FO LIST
+schtasks /Query /TN "RAP Host Agent Updater office-win-1" /V /FO LIST
+```
+
+3. Confirm the admin panel shows:
+
+```text
+heartbeat: fresh
+rap-node-agent: plan/noop
+rap-host-agent: plan/noop
+node version_state: current
+```
+
+Without admin rights, `install-windows --startup-mode auto` may fall back to
+`user-task`. That node can still heartbeat and update after the user logs in,
+but it will not start before logon after a reboot. Use an elevated shell for
+production Windows nodes that must recover unattended.
+
+Control Plane release artifacts for Windows must use:
+
+- `product=rap-node-agent`
+- `os=windows`
+- `arch=amd64`
+- `install_type=windows_service`
+- `kind=binary`
+
 ## First Enrollment

 Create a join token from the platform control plane, then run:
@@ -185,9 +418,18 @@ bounded `synthetic.echo` test-service runtime, and live synthetic HTTP endpoint.
 It must not be used for RDP, VPN, file, video, or other production service
 traffic.

-`RAP_WORKLOAD_SUPERVISION_ENABLED` defaults to `false`. While service runtime
-supervision is still a stub, the agent does not poll desired workloads or report
-workload status unless this flag is explicitly enabled.
+`RAP_WORKLOAD_SUPERVISION_ENABLED` defaults to `false`. When enabled, the agent
+polls node-scoped desired workloads and reports status. The current bounded
+runtime reports built-in `core-mesh` and `mesh-listener` services as running
+when enabled, supports the native built-in `synthetic.echo` test workload, and
+keeps unsupported production workloads such as RDP workers degraded until their
+supervisors are implemented.
+
+For Remote Workspace/RDP integration work, the native `rdp-worker` desired
+workload supports only an explicit `adapter_contract_probe` mode. That mode
+reports the remote-workspace adapter channel contract and requires Fabric
+Service Channel as the future data plane; it does not start FreeRDP, create a
+remote session, or carry production RDP payloads.

 `RAP_MESH_LISTEN_ADDR` starts the C17E/C17F/C17G synthetic HTTP endpoint only when
 `RAP_MESH_SYNTHETIC_RUNTIME_ENABLED=true`. `RAP_MESH_SYNTHETIC_CONFIG` points to
@@ -423,6 +665,63 @@ observations with expected/observed hops and drift status. This probes
 replacement relay effective paths for control-plane health only and does not
 enable service payload forwarding.

+C17Z21 defines the portable inbound listener contract for Docker, Linux
+service, Windows service, and future OS-specific node packages. The node-agent
+does not stop when the mesh listen port cannot be bound. It keeps the outbound
+Control Plane session alive and emits `c17z21.mesh_listener_report.v1` in
+heartbeat metadata with configured address, effective address, listen mode,
+listener status, inbound reachability, one-way connectivity, failure reason,
+and port-conflict diagnostics.
+
+`RAP_MESH_LISTEN_PORT_MODE` controls behavior:
+
+- `manual`: bind exactly `RAP_MESH_LISTEN_ADDR`; on conflict report
+  `listen_failed` and wait for an operator/config change.
+- `auto`: try `RAP_MESH_LISTEN_ADDR`; on conflict scan
+  `RAP_MESH_LISTEN_AUTO_PORT_START..RAP_MESH_LISTEN_AUTO_PORT_END` and report
+  `auto_rebound` when a free port is selected.
+- `disabled`: do not open an inbound listener; the node is expected to be
+  outbound-only, relay/rendezvous, or Control Plane only.
+
+For `RAP_MESH_CONNECTIVITY_MODE=outbound_only`, inbound listener failure is not
+treated as node death. The heartbeat remains `healthy` with
+`mesh_one_way_connectivity=true` and listener diagnostics. For direct/private
+LAN modes, a listener failure degrades the node so the admin panel can show
+that the node is alive but cannot accept inbound mesh traffic. Service payload
+forwarding is still not enabled by this contract.
+
+C17Z22 separates outbound Control Plane presence from inbound mesh
+reachability. When synthetic mesh testing is enabled, every heartbeat includes
+`c17z22.mesh_outbound_session_report.v1` with node-to-control-plane direction,
+keepalive transport, listener conflict state, rendezvous/relay counters, and a
+flag showing whether the current outbound session can be used as a reverse
+control-channel contract. This is the portable basis for Docker, Linux service,
+Windows service, and future packages where a node may be behind NAT or have no
+stable inbound address. It is still control-plane telemetry only and does not
+carry RDP/VPN/service payload traffic.
+
+C17Z24 separates the listener bind address from advertised mesh endpoints. The
+agent never advertises loopback addresses discovered from the local listener;
+`127.0.0.1`/`::1` are test-only bind details, not cluster reachability data.
+When the listener is active, the agent enumerates active non-loopback host
+interfaces and reports usable endpoint candidates with interface metadata,
+address family, reachability, NAT/connectivity hints, and priority. Container
+bridge/veth interfaces and link-local addresses are filtered by default, while
+physical and VPN-style interfaces are kept so different cluster segments can
+choose the address that matches their network. Operator-provided
+`RAP_MESH_ADVERTISE_ENDPOINT` or endpoint-candidate JSON remains authoritative
+and is ranked ahead of auto-discovered addresses.
+
+C17Z25 adds per-peer endpoint fallback probing to the control-plane mesh
+manager. A node no longer treats the top-ranked endpoint candidate as the only
+possible address for a peer. For each warm direct/private/corporate peer, the
+manager probes the ranked candidate list until one `/mesh/v1/health` endpoint
+responds or all direct candidates fail. Heartbeat metadata includes
+`c17z25.mesh_peer_connection_manager_report.v1` with `probe_results`,
+`selected_candidate_id`, `selected_endpoint`, and per-candidate success/failure
+details. This is still control-plane health and address selection telemetry; it
+does not forward RDP/VPN/service payloads.
+
 Scoped synthetic config shape:

 ```json
@@ -480,7 +779,7 @@ Expected:
 - The agent never assigns roles to itself.
 - The agent reports capabilities only.
 - Platform policy assigns roles.
- No RDP/VPN/production service traffic is carried by the C17A-C17Z18 staged
+- No RDP/VPN/production service traffic is carried by the C17A-C17Z22 staged
  mesh runtime.
 - Production forwarding remains disabled by default and limited to
  `fabric.control` when explicitly enabled.