Record project continuation changes

This commit is contained in:
2026-05-12 21:02:29 +03:00
parent 3059d1d7a3
commit 8f69d53193
339 changed files with 101111 additions and 1769 deletions
+304 -5
View File
@@ -66,6 +66,11 @@ Implemented:
- synthetic route-health route config refresh from Control Plane path
decisions
- route-health expected/observed effective path drift reporting
- host-agent Docker update plan executor with artifact checksum/size
verification, container replacement, health check, status reporting, and
rollback attempt
- host-agent update loop for service/timer placement
- host-agent binary self-update loop for the updater service itself
- maximum capacity guard for the local production observation sink
- panic-safe fail-closed production envelope observation wrapper
- explicit `4096` byte payload boundary for validated production
@@ -98,7 +103,7 @@ Not implemented yet:
- VPN runtime
- production workload supervision
- certificate issuance/rotation
- updater runtime
- in-agent native updater runtime
- privileged host route/firewall control
## Build
@@ -107,9 +112,237 @@ Not implemented yet:
cd agents\rap-node-agent
go test ./...
go build -o bin\rap-node-agent.exe .\cmd\rap-node-agent
go build -buildvcs=false -o bin\rap-host-agent.exe .\cmd\rap-host-agent
go build -o bin\mesh-live-smoke.exe .\cmd\mesh-live-smoke
```
## Docker Host Agent Bootstrap
`rap-host-agent` is the first host-level installer/updater boundary for Docker
placement. It does not join the mesh itself. It applies the cluster's install
intent locally by running the `rap-node-agent` container with a persistent host
state directory. On Linux it also installs a systemd `update-loop` service by
default, so nodes continue to update from Control Plane policy without operator
commands on each host.
Preferred profile-based install:
```bash
rap-host-agent install \
--profile-url https://control.example.com/api/v1 \
--cluster-id <cluster_id> \
--install-token <one_time_install_token> \
--node-name docker-node-1
```
The host-agent exchanges the install token for a signed control-plane install
profile, then applies Docker image, container, state-dir, mesh listen,
advertise, NAT/connectivity, and region settings from that profile. The same
token is then used by the node-agent for first enrollment, so the operator does
not need to manually pass cluster/runtime flags.
Manual install is still supported:
```bash
rap-host-agent install \
--backend-url http://192.168.200.61:18080/api/v1 \
--cluster-id <cluster_id> \
--join-token <raw_join_token> \
--node-name docker-node-1 \
--image rap-node-agent:dev-enrollment-bootstrap-smoke \
--container-name rap-node-agent-docker-node-1 \
--state-dir /var/lib/rap/nodes/docker-node-1 \
--network host \
--replace
```
The command creates or replaces only the local Docker container. The running
node-agent submits the join request, waits for owner approval, stores its
identity in the mounted state directory, and then sends heartbeats. Re-running
with `--replace` updates the container while preserving node identity. Pass
`--auto-update-enabled=false` only for lab/debug installs where the local
systemd updater must not be registered.
Useful checks:
```bash
rap-host-agent status --container-name rap-node-agent-docker-node-1
docker logs -f rap-node-agent-docker-node-1
```
For a node that was installed before the updater existed, register only the
local updater service without recreating the node-agent container:
```bash
rap-host-agent install-updater \
--backend-url http://192.168.200.61:18080/api/v1 \
--cluster-id <cluster_id> \
--state-dir /var/lib/rap/nodes/docker-node-1 \
--container-name rap-node-agent-docker-node-1
```
## Docker Host Agent Updates
`rap-host-agent update` applies one Control Plane update plan for an already
enrolled Docker node. The host-agent fetches the plan, downloads the selected
Docker image tar, verifies size and sha256, loads the image, recreates the
node-agent container from the existing Docker runtime settings, checks that the
container is running, and reports update phases back to the Control Plane.
```bash
rap-host-agent update \
--backend-url http://192.168.200.61:18080/api/v1 \
--cluster-id <cluster_id> \
--node-id <node_id> \
--container-name rap-node-agent-docker-node-1 \
--current-version 0.1.0-c17z26
```
`rap-host-agent update-loop` is the per-node executor and health boundary. It
does not need to poll for normal releases: the node-agent receives an
`rap.node_update_hint.v1` subscription hint from Control Plane or the assigned
update-cache service during heartbeat, writes `<state-dir>/update-trigger.json`,
and the host-agent wakes immediately. The interval is an emergency fallback for
missed hints, service migration, or a dead update-cache service; keep it long
in production. The loop keeps running after transient errors by default and
advances its in-process current version after a successful update so it does
not repeatedly apply the same plan. When started without `--node-id` it reads
`<state-dir>/identity.json` and waits until the approved node identity appears,
which lets the updater service start immediately during first install. It also
persists the last applied node-agent version in
`<state-dir>/host-update-state.json` so a service restart does not reapply an
already-installed release.
```bash
rap-host-agent update-loop \
--backend-url http://192.168.200.61:18080/api/v1 \
--cluster-id <cluster_id> \
--node-id <node_id> \
--container-name rap-node-agent-docker-node-1 \
--current-version 0.1.0-c17z26 \
--interval-seconds 21600 \
--jitter 0.15
```
Update-cache nodes are ordinary cluster nodes with the `update-cache` role.
Control Plane assigns a healthy update-cache node in the heartbeat hint. If the
assigned service disappears, the next hint returns `control_plane_fallback` or a
new service assignment; the local updater stays subscribed and only uses the
long fallback timer as a last resort.
`rap-host-agent update-host-agent-loop` updates the host-agent binary itself.
Only one global systemd unit is installed per Docker host:
`rap-host-agent-self-updater.service`. It uses one approved local node identity
to ask Control Plane for product `rap-host-agent` with install type
`linux_binary`, verifies the downloaded binary size and sha256, atomically
replaces `/usr/local/bin/rap-host-agent`, and reports status. The already
running process continues until systemd restarts it, while new invocations use
the new binary.
```bash
rap-host-agent update-host-agent-loop \
--backend-url http://192.168.200.61:18080/api/v1 \
--cluster-id <cluster_id> \
--state-dir /var/lib/rap/nodes/docker-node-1 \
--binary-path /usr/local/bin/rap-host-agent
```
## Windows Host Agent Bootstrap And Updates
Windows uses the same Control Plane install profile, but the local placement is
a Scheduled Task instead of Docker. In `--startup-mode auto` the installer first
tries an elevated `ONSTART` task running as `SYSTEM`; without admin rights it
falls back to a per-user `ONLOGON` task. The `ONSTART` mode starts after reboot
without an interactive user session. The `ONLOGON` fallback can only start after
that Windows user signs in.
```cmd
powershell -NoProfile -ExecutionPolicy Bypass -Command "Invoke-WebRequest -UseBasicParsing 'http://control.example.com/downloads/rap-host-agent-windows-amd64.exe' -OutFile $env:TEMP\rap-host-agent.exe"
%TEMP%\rap-host-agent.exe install-windows --profile-url "http://control.example.com/api/v1" --cluster-id "<cluster_id>" --install-token "<one_time_install_token>" --node-name "office-win-1" --startup-mode "auto"
```
`install-windows` installs two tasks:
- `RAP Node Agent <node>` runs `rap-node-agent.exe`.
- `RAP Host Agent Updater <node>` runs `rap-host-agent update-loop` for product
`rap-node-agent`, install type `windows_service`, and replaces the local
`rap-node-agent.exe` from signed release artifacts.
During first bootstrap the updater can read `<state-dir>\identity.json` and
will wait until the join request is approved. For an already-enrolled Windows
node, prefer passing `--node-id` explicitly. That makes the updater wrapper
independent from the local identity file location and is required for repair of
older Windows installs where the node is already heartbeat-healthy but the
host-agent updater has no usable identity file.
```cmd
%TEMP%\rap-host-agent.exe install-windows --backend-url "http://control.example.com/api/v1" --cluster-id "<cluster_id>" --node-id "<node_id>" --node-name "office-win-1" --replace --startup-mode "auto" --auto-update-current-version "<current_version>"
```
The admin UI node details page generates a downloadable
`rap-repair-updater-<node>.cmd` for this repair path. It performs these steps:
- prints `schtasks /Query` diagnostics for the node-agent and updater tasks;
- prints the local `rap-*.exe*` files;
- downloads the current `rap-host-agent.exe`;
- reinstalls the Windows updater wrapper with `--node-id`;
- runs a foreground one-shot `update-loop --max-runs 1`;
- applies `rap-host-agent.exe.next` if the running host-agent could not replace
itself;
- restarts `RAP Host Agent Updater <node>`;
- prints post-repair diagnostics.
Expected successful updater reports in the admin panel:
```text
rap-node-agent <target> -> <target> plan/noop
rap-host-agent <target> -> <target> plan/noop
```
If the latest host-agent report is `apply/staged`, the new host-agent binary
was downloaded as `rap-host-agent.exe.next` but the running process still held
the old executable. End and run the updater task once, or rerun the generated
repair command:
```cmd
schtasks /End /TN "RAP Host Agent Updater office-win-1"
schtasks /Run /TN "RAP Host Agent Updater office-win-1"
```
### Windows Reboot / Autostart Verification
After installation or repair, verify the service survives a reboot:
1. Reboot the Windows host, or at minimum restart both scheduled tasks.
2. Confirm the tasks exist:
```cmd
schtasks /Query /TN "RAP Node Agent office-win-1" /V /FO LIST
schtasks /Query /TN "RAP Host Agent Updater office-win-1" /V /FO LIST
```
3. Confirm the admin panel shows:
```text
heartbeat: fresh
rap-node-agent: plan/noop
rap-host-agent: plan/noop
node version_state: current
```
Without admin rights, `install-windows --startup-mode auto` may fall back to
`user-task`. That node can still heartbeat and update after the user logs in,
but it will not start before logon after a reboot. Use an elevated shell for
production Windows nodes that must recover unattended.
Control Plane release artifacts for Windows must use:
- `product=rap-node-agent`
- `os=windows`
- `arch=amd64`
- `install_type=windows_service`
- `kind=binary`
## First Enrollment
Create a join token from the platform control plane, then run:
@@ -185,9 +418,18 @@ bounded `synthetic.echo` test-service runtime, and live synthetic HTTP endpoint.
It must not be used for RDP, VPN, file, video, or other production service
traffic.
`RAP_WORKLOAD_SUPERVISION_ENABLED` defaults to `false`. While service runtime
supervision is still a stub, the agent does not poll desired workloads or report
workload status unless this flag is explicitly enabled.
`RAP_WORKLOAD_SUPERVISION_ENABLED` defaults to `false`. When enabled, the agent
polls node-scoped desired workloads and reports status. The current bounded
runtime reports built-in `core-mesh` and `mesh-listener` services as running
when enabled, supports the native built-in `synthetic.echo` test workload, and
keeps unsupported production workloads such as RDP workers degraded until their
supervisors are implemented.
For Remote Workspace/RDP integration work, the native `rdp-worker` desired
workload supports only an explicit `adapter_contract_probe` mode. That mode
reports the remote-workspace adapter channel contract and requires Fabric
Service Channel as the future data plane; it does not start FreeRDP, create a
remote session, or carry production RDP payloads.
`RAP_MESH_LISTEN_ADDR` starts the C17E/C17F/C17G synthetic HTTP endpoint only when
`RAP_MESH_SYNTHETIC_RUNTIME_ENABLED=true`. `RAP_MESH_SYNTHETIC_CONFIG` points to
@@ -423,6 +665,63 @@ observations with expected/observed hops and drift status. This probes
replacement relay effective paths for control-plane health only and does not
enable service payload forwarding.
C17Z21 defines the portable inbound listener contract for Docker, Linux
service, Windows service, and future OS-specific node packages. The node-agent
does not stop when the mesh listen port cannot be bound. It keeps the outbound
Control Plane session alive and emits `c17z21.mesh_listener_report.v1` in
heartbeat metadata with configured address, effective address, listen mode,
listener status, inbound reachability, one-way connectivity, failure reason,
and port-conflict diagnostics.
`RAP_MESH_LISTEN_PORT_MODE` controls behavior:
- `manual`: bind exactly `RAP_MESH_LISTEN_ADDR`; on conflict report
`listen_failed` and wait for an operator/config change.
- `auto`: try `RAP_MESH_LISTEN_ADDR`; on conflict scan
`RAP_MESH_LISTEN_AUTO_PORT_START..RAP_MESH_LISTEN_AUTO_PORT_END` and report
`auto_rebound` when a free port is selected.
- `disabled`: do not open an inbound listener; the node is expected to be
outbound-only, relay/rendezvous, or Control Plane only.
For `RAP_MESH_CONNECTIVITY_MODE=outbound_only`, inbound listener failure is not
treated as node death. The heartbeat remains `healthy` with
`mesh_one_way_connectivity=true` and listener diagnostics. For direct/private
LAN modes, a listener failure degrades the node so the admin panel can show
that the node is alive but cannot accept inbound mesh traffic. Service payload
forwarding is still not enabled by this contract.
C17Z22 separates outbound Control Plane presence from inbound mesh
reachability. When synthetic mesh testing is enabled, every heartbeat includes
`c17z22.mesh_outbound_session_report.v1` with node-to-control-plane direction,
keepalive transport, listener conflict state, rendezvous/relay counters, and a
flag showing whether the current outbound session can be used as a reverse
control-channel contract. This is the portable basis for Docker, Linux service,
Windows service, and future packages where a node may be behind NAT or have no
stable inbound address. It is still control-plane telemetry only and does not
carry RDP/VPN/service payload traffic.
C17Z24 separates the listener bind address from advertised mesh endpoints. The
agent never advertises loopback addresses discovered from the local listener;
`127.0.0.1`/`::1` are test-only bind details, not cluster reachability data.
When the listener is active, the agent enumerates active non-loopback host
interfaces and reports usable endpoint candidates with interface metadata,
address family, reachability, NAT/connectivity hints, and priority. Container
bridge/veth interfaces and link-local addresses are filtered by default, while
physical and VPN-style interfaces are kept so different cluster segments can
choose the address that matches their network. Operator-provided
`RAP_MESH_ADVERTISE_ENDPOINT` or endpoint-candidate JSON remains authoritative
and is ranked ahead of auto-discovered addresses.
C17Z25 adds per-peer endpoint fallback probing to the control-plane mesh
manager. A node no longer treats the top-ranked endpoint candidate as the only
possible address for a peer. For each warm direct/private/corporate peer, the
manager probes the ranked candidate list until one `/mesh/v1/health` endpoint
responds or all direct candidates fail. Heartbeat metadata includes
`c17z25.mesh_peer_connection_manager_report.v1` with `probe_results`,
`selected_candidate_id`, `selected_endpoint`, and per-candidate success/failure
details. This is still control-plane health and address selection telemetry; it
does not forward RDP/VPN/service payloads.
Scoped synthetic config shape:
```json
@@ -480,7 +779,7 @@ Expected:
- The agent never assigns roles to itself.
- The agent reports capabilities only.
- Platform policy assigns roles.
- No RDP/VPN/production service traffic is carried by the C17A-C17Z18 staged
- No RDP/VPN/production service traffic is carried by the C17A-C17Z22 staged
mesh runtime.
- Production forwarding remains disabled by default and limited to
`fabric.control` when explicitly enabled.