1
This commit is contained in:
@@ -0,0 +1,35 @@
|
||||
# RAP host-agent monitor
|
||||
|
||||
`rap-host-agent monitor-loop` is the local watchdog that runs near a node host.
|
||||
It complements the update loop:
|
||||
|
||||
- starts watched Docker containers when they are stopped;
|
||||
- restarts watched containers when Docker health is `unhealthy`;
|
||||
- restarts containers stuck in `restarting` longer than the stale threshold;
|
||||
- rate-limits repeated remediation with a restart cooldown;
|
||||
- watches disk pressure and runs safe cleanup when the cleanup threshold is reached;
|
||||
- removes old `/tmp/rap-*` and `/tmp/go-build*` build directories;
|
||||
- writes an optional JSON status file;
|
||||
- reports monitor status to the control plane through the node update-status channel.
|
||||
|
||||
Example:
|
||||
|
||||
```bash
|
||||
rap-host-agent monitor-loop \
|
||||
--backend-url http://127.0.0.1:18121/api/v1 \
|
||||
--cluster-id cfc0743d-d960-49fb-9de8-96e063d5e4aa \
|
||||
--node-id 108a0d66-d65e-4dea-b9a8-135366bf7dba \
|
||||
--current-version 0.2.261-vpnfarm \
|
||||
--interval-seconds 60 \
|
||||
--disk-warn-percent 80 \
|
||||
--disk-cleanup-percent 85 \
|
||||
--disk-critical-percent 95 \
|
||||
--status-file /tmp/rap-web-admin/html/downloads/ops/host-monitor-status.json \
|
||||
--watch-container rap_test_postgres \
|
||||
--watch-container rap_test_redis \
|
||||
--watch-container rap_test_backend
|
||||
```
|
||||
|
||||
On the shared test Docker host the current public status file is:
|
||||
|
||||
`http://docker-test.cin.su:18080/downloads/ops/host-monitor-status.json`
|
||||
@@ -0,0 +1,64 @@
|
||||
# Test Docker Disk Guard
|
||||
|
||||
`test-docker` is a shared build and runtime host. If `/` fills up, Postgres can
|
||||
restart-loop with `No space left on device`, which breaks VPN diagnostics and
|
||||
cluster tests. The disk guard is the first operational guardrail for that host.
|
||||
|
||||
## What It Does
|
||||
|
||||
- Checks `/` usage every run.
|
||||
- At `>= 85%`, removes safe reclaimable data:
|
||||
- Docker build cache.
|
||||
- Dangling Docker images.
|
||||
- Old RAP temporary build directories under `/tmp`.
|
||||
- At `>= 85%`, publishes a warning status after cleanup if the host is still above the warning line.
|
||||
- At `>= 95%` after cleanup, publishes critical status and exits with code `2`.
|
||||
- Writes machine-readable status to:
|
||||
- `http://docker-test.cin.su:18080/downloads/ops/test-docker-disk-guard-status.json`
|
||||
- Writes host log to:
|
||||
- `/tmp/rap-ops/test-docker-disk-guard.log`
|
||||
|
||||
## Install Or Refresh Schedule
|
||||
|
||||
Run from the repo root on the Windows workstation:
|
||||
|
||||
```powershell
|
||||
pwsh -ExecutionPolicy Bypass -File scripts/ops/test-docker-disk-guard.ps1 -InstallCron -RunOnce
|
||||
```
|
||||
|
||||
The wrapper uploads `scripts/ops/test-docker-disk-guard.sh` to
|
||||
`/home/test/bin/rap-test-docker-disk-guard` on `test-docker`. It installs cron
|
||||
when `crontab` exists; otherwise it installs a user systemd timer named
|
||||
`rap-test-docker-disk-guard.timer`.
|
||||
|
||||
## Manual Check
|
||||
|
||||
```powershell
|
||||
pwsh -ExecutionPolicy Bypass -File scripts/ops/test-docker-disk-guard.ps1 -RunOnce
|
||||
Invoke-RestMethod http://docker-test.cin.su:18080/downloads/ops/test-docker-disk-guard-status.json
|
||||
```
|
||||
|
||||
## Expansion Approach
|
||||
|
||||
Cleanup is only a pressure valve. If the status remains `warning` or `critical`
|
||||
after cleanup, expand the host disk.
|
||||
|
||||
Current host root is expected to be LVM. If the VM already has free VG space,
|
||||
the guard status will recommend:
|
||||
|
||||
```bash
|
||||
sudo lvextend -r -l +100%FREE /dev/mapper/ubuntu--vg-ubuntu--lv
|
||||
```
|
||||
|
||||
If there is no VG free space, first expand the VM disk in the hypervisor, then
|
||||
run `pvresize` for the physical volume and finally `lvextend -r` for the root
|
||||
logical volume.
|
||||
|
||||
## Optional Webhook
|
||||
|
||||
The shell guard supports `WEBHOOK_URL`. If set in cron/environment, warning and
|
||||
critical states are posted as JSON:
|
||||
|
||||
```json
|
||||
{"level":"warning","message":"...","host":"...","observed_at":"..."}
|
||||
```
|
||||
Reference in New Issue
Block a user