Initial project snapshot

This commit is contained in:
2026-04-28 22:29:50 +03:00
commit 8ba0561f4f
365 changed files with 91832 additions and 0 deletions
+487
View File
@@ -0,0 +1,487 @@
# rap-node-agent
Native node agent MVP for the Secure Access Fabric.
Status: Stage C17Z18 synthetic route-health effective path boundary.
This agent is intentionally native. Containers may package service workloads,
but the host-level node identity belongs to `rap-node-agent`.
## Current Scope
Implemented:
- config loading from flags/environment
- local identity state file
- enrollment request client
- heartbeat client
- capability/facts payload
- status-only service reporting payload
- mesh control-channel skeleton
- route-health message skeleton
- relay skeleton that refuses production payload forwarding
- disabled-by-default synthetic mesh runtime for `fabric.probe` /
`fabric.probe_ack`
- direct and single-relay synthetic route tests
- synthetic `fabric.route_health` / `fabric.route_health_ack`
- local route success/failure observations
- fallback route selection for test topology
- route cache invalidation on version changes
- synthetic relay envelope validation
- per-channel bounded queues for synthetic traffic
- QoS dequeue order: `fabric_control`, then `route_control`, then `telemetry`
- telemetry-only stale message drop under backpressure
- reliable fabric/control queue rejection when full
- bounded non-production `synthetic.echo` test-service path
- direct, single-relay, and forced-fallback test-service proofs
- live HTTP peer transport for synthetic mesh envelopes
- disabled-by-default synthetic mesh HTTP endpoint in `rap-node-agent`
- `mesh-live-smoke` harness proving direct and single-relay synthetic traffic
over real local HTTP endpoints
- scoped synthetic mesh config file loading for peer endpoints and routes
- Control Plane synthetic mesh config read fallback when no local scoped config
file is set
- synthetic route-health observations reported to the Control Plane when test
flags allow synthetic links
- explicit production mesh forwarding gate config; production forwarding still
has no runtime implementation and remains unavailable
- route-bound production mesh envelope contract and fail-closed validation on
`/mesh/v1/forward`
- metadata-only production envelope observation hook for valid envelopes, still
without forwarding payloads
- bounded metadata-only production envelope observation sink for accepted
observations
- disabled-by-default node-agent wiring for the bounded observation sink
- local metrics for the bounded observation sink without exposing observation
records
- local node-agent logging for bounded observation sink metrics
- change-driven suppression for unchanged bounded observation sink metrics logs
- explicit local log distinction between production forwarding gate state and
production forwarding runtime state
- node-scoped rendezvous lease refresh through Control Plane synthetic config
- stale relay withdrawal/reselection telemetry
- relay replacement contract reporting for stale rendezvous relays
- route/path decision contract reporting for control-plane route generations
- route generation apply/withdraw tracking for control-plane path decisions
- synthetic route-health route config refresh from Control Plane path
decisions
- route-health expected/observed effective path drift reporting
- maximum capacity guard for the local production observation sink
- panic-safe fail-closed production envelope observation wrapper
- explicit `4096` byte payload boundary for validated production
fabric-control envelopes
- explicit future-skew boundary for validated production envelope `created_at`
- scoped synthetic peer endpoint candidate config with reachability,
NAT/connectivity hints, priority, policy tags, and metadata
- deterministic local peer endpoint candidate scoring model for synthetic
config candidates
- optional local health observation overlay for endpoint candidate scoring
- gate-controlled production `fabric.control` direct next-hop delivery
- route-path-bound production `fabric.control` multi-hop forwarding
- local metadata-only production `fabric.control` forwarding event logs
- route-config-bound production `fabric.control` forwarding validation
- scoped peer directory and bounded recovery seed config parsing/validation
- node-local peer cache with bounded warm peer health probes
- advertised mesh endpoint reporting through heartbeat metadata
- multiple advertised endpoint candidates, including private/corporate LAN
- peer connection state machine for warm-peer health
- bounded peer recovery planner over peer cache and connection states
- peer connection intent planner with transport readiness classification
- peer connection manager for real control-plane health over reusable
HTTP keep-alive transport
- route-health effective-path runtime through replacement relay control paths
Not implemented yet:
- mesh packet routing
- production mesh service traffic
- VPN runtime
- production workload supervision
- certificate issuance/rotation
- updater runtime
- privileged host route/firewall control
## Build
```powershell
cd agents\rap-node-agent
go test ./...
go build -o bin\rap-node-agent.exe .\cmd\rap-node-agent
go build -o bin\mesh-live-smoke.exe .\cmd\mesh-live-smoke
```
## First Enrollment
Create a join token from the platform control plane, then run:
```powershell
.\bin\rap-node-agent.exe `
-backend-url http://192.168.200.61:8080/api/v1 `
-cluster-id <cluster_id> `
-join-token <raw_join_token> `
-node-name test-node-1 `
-state-dir C:\ProgramData\RapNodeAgent
```
The agent submits a pending join request and exits. It does not self-activate.
A platform admin must approve the join request.
## Enrollment Approval
When the agent enrolls, it stores the returned `pending_join_request_id` and
polls the Control Plane bootstrap endpoint until the platform owner approves
the request or the enrollment timeout expires. After approval, the agent
verifies the signed bootstrap contract and writes the approved `node_id`,
`cluster_id`, `identity_status=active`, `cluster_authority_public_key`, and
`cluster_authority_fingerprint` into `identity.json`.
Future C3 hardening can add signed node certificates and automatic secure
certificate material exchange.
Then run the agent again:
```powershell
.\bin\rap-node-agent.exe `
-backend-url http://192.168.200.61:8080/api/v1 `
-state-dir C:\ProgramData\RapNodeAgent
```
It sends periodic heartbeats to:
```text
/api/v1/clusters/{clusterID}/nodes/{nodeID}/heartbeats
```
## Environment Variables
- `RAP_BACKEND_URL`
- `RAP_CLUSTER_ID`
- `RAP_CLUSTER_AUTHORITY_PUBLIC_KEY`
- `RAP_CLUSTER_AUTHORITY_FINGERPRINT`
- `RAP_JOIN_TOKEN`
- `RAP_NODE_NAME`
- `RAP_NODE_STATE_DIR`
- `RAP_WORKLOAD_SUPERVISION_ENABLED`
- `RAP_HEARTBEAT_INTERVAL_SECONDS`
- `RAP_ENROLLMENT_POLL_INTERVAL_SECONDS`
- `RAP_ENROLLMENT_POLL_TIMEOUT_SECONDS`
- `RAP_MESH_SYNTHETIC_RUNTIME_ENABLED`
- `RAP_MESH_LISTEN_ADDR`
- `RAP_MESH_ADVERTISE_ENDPOINT`
- `RAP_MESH_ADVERTISE_ENDPOINTS_JSON`
- `RAP_MESH_ADVERTISE_TRANSPORT`
- `RAP_MESH_CONNECTIVITY_MODE`
- `RAP_MESH_NAT_TYPE`
- `RAP_MESH_REGION`
- `RAP_MESH_SYNTHETIC_CONFIG`
- `RAP_MESH_PEER_ENDPOINTS_JSON`
- `RAP_MESH_SYNTHETIC_ROUTES_JSON`
- `RAP_MESH_PRODUCTION_FORWARDING_ENABLED`
- `RAP_MESH_PRODUCTION_OBSERVATION_SINK_CAPACITY`
`RAP_MESH_SYNTHETIC_RUNTIME_ENABLED` defaults to `false`. It gates only the
C17A/C17B/C17C/C17D/C17E synthetic probe, route-health, relay scheduling,
bounded `synthetic.echo` test-service runtime, and live synthetic HTTP endpoint.
It must not be used for RDP, VPN, file, video, or other production service
traffic.
`RAP_WORKLOAD_SUPERVISION_ENABLED` defaults to `false`. While service runtime
supervision is still a stub, the agent does not poll desired workloads or report
workload status unless this flag is explicitly enabled.
`RAP_MESH_LISTEN_ADDR` starts the C17E/C17F/C17G synthetic HTTP endpoint only when
`RAP_MESH_SYNTHETIC_RUNTIME_ENABLED=true`. `RAP_MESH_SYNTHETIC_CONFIG` points to
a scoped synthetic mesh config snapshot and is preferred over debug JSON.
`RAP_MESH_PEER_ENDPOINTS_JSON` is a JSON object mapping peer node IDs to
endpoint URLs. `RAP_MESH_SYNTHETIC_ROUTES_JSON` is a JSON array of synthetic
route objects. If no local scoped config file is set, the agent asks the
Control Plane for:
```text
/clusters/{clusterID}/nodes/{nodeID}/mesh/synthetic-config
```
The JSON variables are debug fallback only.
Control Plane synthetic config with `authority_required=true` must include a
signed `authority_payload` / `authority_signature` envelope and a
`cluster_authority` descriptor. The agent verifies the signature, validates the
config hash, and rejects mismatched pinned authority values when
`RAP_CLUSTER_AUTHORITY_PUBLIC_KEY`, `RAP_CLUSTER_AUTHORITY_FINGERPRINT`, or the
same fields in `identity.json` are set.
`RAP_MESH_PRODUCTION_FORWARDING_ENABLED` defaults to `false`. It is a future
production-forwarding gate only. Turning it on does not enable production mesh
payload forwarding; `/mesh/v1/forward` still returns an unavailable runtime
response after validating the route-bound production envelope contract, until
a later approved production mesh stage implements route-bound, policy-bound
forwarding.
The production envelope contract requires route, hop, TTL, expiry, payload
length, and SHA-256 payload hash fields. C17J accepts only the
`fabric_control` channel class and `fabric.control` message type for
validation. RDP, VPN, render, file, video, and service workload channels are
rejected.
C17K adds a local metadata-only observation hook after successful production
envelope validation. Observations include route/message/hop/channel metadata and
payload length/hash, not the payload body. Observation failure fails closed, and
the endpoint still does not forward payloads.
C17L adds a bounded in-memory observation sink for accepted metadata-only
observations. The sink drops the oldest observation when full and still stores
no payload bodies.
`RAP_MESH_PRODUCTION_OBSERVATION_SINK_CAPACITY` defaults to `0`. When set above
zero, C17M wires the bounded metadata-only sink into the node-agent mesh server.
This remains local-only, exposes no read API, stores no payload bodies, and
does not enable production forwarding. C17R rejects values above `10000`.
C17N adds local sink metrics: configured capacity, current depth, accepted
total, and dropped-oldest total. Metrics do not expose observation records,
route IDs, message IDs, hashes, payload metadata, or payload bodies.
C17O logs those aggregate metrics locally from the node-agent loop when the
sink is explicitly enabled. This does not add a read API or Control Plane
reporting.
C17P logs aggregate sink metrics only when they change, so steady heartbeat
loops do not repeat identical local metrics lines.
C17Q logs `production_forwarding_gate_enabled` separately from
`production_forwarding_runtime_enabled`. The runtime field remains `false`;
turning on the gate still does not enable production forwarding.
C17S makes production envelope observation panic-safe. Observer errors and
observer panics both fail closed as observation failure; forwarding remains
unavailable.
C17T limits validated production `fabric.control` envelope payloads to 4096
bytes. Oversized envelopes are rejected before observation.
C17U rejects production `fabric.control` envelopes whose `created_at` is more
than one minute in the future.
C17V adds scoped peer endpoint candidates to synthetic mesh config. Candidate
entries describe possible per-node endpoints with transport, address,
reachability, NAT type, connectivity mode, priority, policy tags, verification
time, and metadata. They are model/config hints only; no production route
scoring, NAT traversal, shortcut routing, or forwarding runtime is implemented.
C17W adds deterministic local scoring for scoped endpoint candidates. Scoring
uses transport, reachability, connectivity mode, NAT type, priority, preferred
region, policy tags, channel class, and verification age. It returns ranked
candidates and reason labels only; it does not select production routes, open
connections, perform NAT traversal, or forward payloads.
C17X extends candidate scoring with optional local health observations keyed by
`endpoint_id`. Observations can contribute latency, success/failure history,
recent failure reason, reliability score, and freshness/staleness signals.
The score remains advisory only and is not wired into production forwarding.
C17Z adds the first narrow production forwarding runtime. When
`RAP_MESH_PRODUCTION_FORWARDING_ENABLED=true`, `/mesh/v1/forward` can deliver
route-bound `fabric.control` envelopes at the local destination or forward them
to a direct next hop from explicit peer endpoint config. Service channels,
RDP/VPN/file/video payloads, arbitrary relay forwarding, and multi-hop
production route execution remain unavailable.
C17Z1 adds route-path-bound multi-hop forwarding for production
`fabric.control` only. Envelopes may carry `route_path` and
`visited_node_ids`; each relay validates its path position, forwards only to
the next route-path node, updates TTL/hop/visited metadata, and rejects loops.
Service payloads remain unavailable.
C17Z2 emits local `mesh_production_forward_event` logs for production
`fabric.control` forwarding outcomes: accepted, forwarded, delivered, and
rejected. Logs include route/message/hop/channel/status/reason/TTL/hop count/
route path length/visited count/payload length metadata only. Payload bodies
are not logged, no observation read API is added, and service payloads remain
unavailable.
C17Z3 binds production `fabric.control` forwarding to loaded scoped or
Control Plane route config when routes are available locally. Configured
envelopes must match `route_id`, cluster, source, destination, route path,
next hop, allowed channel, expiry, max TTL, and max hop count before
forwarding. If no route config is present, existing C17Z1 behavior is
preserved. Service payloads remain unavailable.
C17Z4 adds scoped peer directory and recovery seed config. `peer_directory`
describes only peers needed by the node-scoped mesh config. `recovery_seeds`
is an explicit, bounded bootstrap/recovery list and is not a full cluster node
list. The node-agent parses and validates these fields, but does not yet
implement a persistent connection manager, NAT traversal, or
relay/rendezvous runtime.
C17Z5 turns scoped peer directory and recovery seed config into node-local
runtime `PeerCache` state. The cache builds a bounded warm peer set from
route-adjacent peers, recovery seeds, peer endpoints, and endpoint candidates.
When synthetic mesh testing is enabled, the node-agent probes warm peers with
`/mesh/v1/health` and reports metadata-only mesh-link observations. This is not
a persistent connection manager and does not forward service payloads.
C17Z6 adds advertised mesh endpoint reporting. When
`RAP_MESH_ADVERTISE_ENDPOINT` is set, node-agent includes a
`mesh_endpoint_report` in heartbeat metadata with transport, connectivity mode,
NAT hint, region, observed time, and endpoint candidate metadata. Control Plane
can project the latest reported endpoint into node-scoped synthetic mesh config
for route-path peers. This does not perform automatic public IP discovery,
STUN/TURN/ICE NAT classification, or service payload forwarding.
C17Z7 adds `RAP_MESH_ADVERTISE_ENDPOINTS_JSON` for multiple advertised
endpoints per node. Candidates can describe public, private, corporate/LAN,
outbound, or relay-style addresses. Endpoint scoring rewards `private-lan`,
`corp-lan`, and `same-site` policy tags, and peer cache can use the best
candidate address for warm-peer health probes. This supports corporate-network
cluster segments without enabling service payload forwarding.
C17Z8 adds a node-local peer connection state machine on top of warm-peer
health probes. Warm peers move through `disconnected`, `connecting`, `ready`,
`degraded`, and `backoff`; repeated probe failures enter bounded backoff, and
successful probes recover to `ready`. Mesh-link observations include
metadata-only connection state. This is not a persistent socket/session manager
and does not forward service payloads.
C17Z9 adds a node-local peer recovery planner. The node targets a bounded
stable ready-peer set, defaulting to three connectable peers when available,
instead of probing every known cluster node. When ready peers fall below target,
the planner selects bounded recovery probes from warm peers, recovery seeds,
and other connectable scoped peers, skipping active backoff entries. Heartbeats
include metadata-only `mesh_peer_recovery_report` state. This is not persistent
connection transport, NAT traversal, relay/rendezvous runtime, or service
payload forwarding.
C17Z10 adds a node-local peer connection intent planner over the C17Z9 recovery
plan. It classifies bounded peer work as `maintain`, `probe`, or `recover`,
and classifies transport readiness as `direct`, `private_lan`,
`corporate_lan`, `outbound_only`, or `relay_required`. Heartbeats include
metadata-only `mesh_peer_connection_intent_report` counts. This is not
persistent connection transport, STUN/TURN/ICE, NAT traversal, relay runtime,
or service payload forwarding.
C17Z11 adds the first real node-local peer connection manager for mesh
control-plane health. It uses a reusable HTTP keep-alive client to probe
direct/private/corporate peer endpoints selected by C17Z10 intents, updates
the shared peer connection tracker, and records `waiting_rendezvous` for
outbound-only or relay-required peers. Heartbeats include metadata-only
`mesh_peer_connection_manager_report` state. This is not STUN/TURN/ICE,
relay/rendezvous runtime, route lease generation, VPN runtime, or service
payload forwarding.
C17Z12 adds a node-scoped rendezvous/relay control-plane lease contract for
peers that would otherwise remain `waiting_rendezvous`. The agent consumes
`rendezvous_leases`, resolves matching intents into `relay_control`, probes the
relay node `/mesh/v1/health`, and records `relay_ready` for the peer control
path. This remains control-plane health only and does not enable RDP/VPN/file/
video/service payload forwarding, arbitrary relay packet forwarding,
STUN/TURN/ICE, or host networking changes.
C17Z13 adds heartbeat telemetry for rendezvous lease admission and renewal
posture. The agent emits `mesh_rendezvous_lease_report` with local role,
relay/peer admission counts, TTL, renewal-after time, renewal-needed status,
`relay_ready`, and explicit no-payload boundary flags. This remains
metadata-only control-plane telemetry and does not enable service payload
forwarding.
C17Z14 adds a control-plane refresh contract for rendezvous leases. When a
lease is renewal-needed, expired, invalid, or tied to a stale relay state, the
agent reloads node-scoped synthetic config from Control Plane, updates the
running peer cache/route/lease state, and reports refresh counters plus stale
relay withdrawal/reselection fields. This remains control-plane health only
and does not enable service payload forwarding.
C17Z15 adds the node side of backend relay replacement policy. The agent
advertises the relay replacement contract capability and emits
`c17z15.mesh_rendezvous_lease_report.v1`; stale relay state is matched to the
exact rendezvous lease/relay when that metadata is present, so an alternate
replacement lease for the same peer is not treated as stale by association.
This remains control-plane health only and does not enable service payload
forwarding.
C17Z16 adds route/path decision reporting. The agent consumes
`route_path_decisions` from Control Plane synthetic config, keeps the latest
control-plane generation in local state, and emits
`c17z18.mesh_route_path_decision_report.v1` with effective hops, previous/next
hop, selected replacement relay, generation, and no-payload boundary flags.
This remains metadata-only route planning and does not enable service payload
forwarding.
C17Z17 adds node-side route generation tracking for Control Plane
`route_path_decisions`. The agent emits
`c17z18.mesh_route_generation_report.v1` with active, applied, unchanged, and
withdrawn decision counts, total counters, generation change state, active
decision details, and withdrawn decision details. When the first observed
config already contains a stale relay replacement, the tracker emits a
`withdrawn_by_replacement` record for the old relay path. This remains
metadata-only route planning and does not enable service payload forwarding.
C17Z18 applies Control Plane `route_path_decisions` to synthetic route-health
route config only. The agent keeps base routes separate from route-health
routes, periodically refreshes scoped config, emits
`c17z18.mesh_route_health_config_report.v1`, and reports route-health
observations with expected/observed hops and drift status. This probes
replacement relay effective paths for control-plane health only and does not
enable service payload forwarding.
Scoped synthetic config shape:
```json
{
"schema_version": "c17z18.synthetic.v1",
"cluster_id": "cluster-1",
"local_node_id": "node-a",
"config_version": "config-v1",
"peer_directory_version": "peers-v1",
"policy_version": "policy-v1",
"peer_endpoints": {
"node-b": "http://127.0.0.1:19002"
},
"peer_endpoint_candidates": {
"node-b": [
{
"endpoint_id": "node-b-public",
"node_id": "node-b",
"transport": "direct_tcp_tls",
"address": "203.0.113.20:443",
"reachability": "public",
"nat_type": "restricted",
"connectivity_mode": "direct",
"priority": 10
}
]
},
"routes": [],
"route_path_decisions": {
"schema_version": "c17z18.route_path_decisions.v1",
"decisions": []
}
}
```
## C17E Live Synthetic Smoke
Run:
```powershell
cd agents\rap-node-agent
go run .\cmd\mesh-live-smoke
```
Expected:
- scoped synthetic config loads
- direct `node-a -> node-b` synthetic probe succeeds
- relay `node-a -> node-r -> node-b` synthetic probe succeeds
- bounded `synthetic.echo` test-service succeeds
- `production_forwarding=false`
## Safety Rules
- The agent never assigns roles to itself.
- The agent reports capabilities only.
- Platform policy assigns roles.
- No RDP/VPN/production service traffic is carried by the C17A-C17Z18 staged
mesh runtime.
- Production forwarding remains disabled by default and limited to
`fabric.control` when explicitly enabled.
- No privileged operations are performed by the current agent.