diff --git a/- b/- new file mode 100644 index 0000000..7cd6e43 Binary files /dev/null and b/- differ diff --git a/.gitignore b/.gitignore index ebdbfcb..07b8f7a 100644 --- a/.gitignore +++ b/.gitignore @@ -1,64 +1,65 @@ -# OS -.DS_Store -Thumbs.db - -# IDE -.idea/ -.vscode/ -*.swp -*.swo - -# Logs -*.log -logs/ -tmp/ -temp/ - -# Env -.env -.env.* -!.env.example - -# Go -backend/bin/ -backend/.cache/ -backend/vendor/ -*.test -coverage.out - -# C/C++ -build/ -cmake-build-*/ -CMakeFiles/ -CMakeCache.txt -compile_commands.json -*.o -*.obj -*.so -*.a -*.dll -*.exe - -# .NET / WPF -bin/ -obj/ -*.user -*.suo - -# Node / React -node_modules/ -dist/ -build/ -.next/ -coverage/ - -# Python (if scripts appear later) -__pycache__/ -*.pyc - -# Docker -*.local.yml - +# OS +.DS_Store +Thumbs.db + +# IDE +.idea/ +.vscode/ +*.swp +*.swo + +# Logs +*.log +logs/ +tmp/ +temp/ + +# Env +.env +.env.* +!.env.example + +# Go +backend/bin/ +backend/.cache/ +backend/vendor/ +*.test +coverage.out + +# C/C++ +build/ +cmake-build-*/ +CMakeFiles/ +CMakeCache.txt +compile_commands.json +*.o +*.obj +*.so +*.a +*.dll +*.exe + +# .NET / WPF +bin/ +obj/ +*.user +*.suo + +# Node / React +node_modules/ +dist/ +build/ +.next/ +coverage/ + +# Python (if scripts appear later) +__pycache__/ +*.pyc + +# Docker +*.local.yml + # Generated artifacts artifacts/ -out/ \ No newline at end of file +out/ +web-admin/deploy/html/downloads/ diff --git a/AGENTS.md b/AGENTS.md new file mode 100644 index 0000000..56acd09 --- /dev/null +++ b/AGENTS.md @@ -0,0 +1,11 @@ +## Shared Test Docker Host + +- Do not use local Docker Desktop for this project. This Windows workstation runs inside a VM, so nested virtualization / local Docker is not a supported path. +- Use the shared test Docker host for all Docker builds, compose runs, container tests, and image checks. +- Use SSH alias `test-docker` / `docker-test`. +- Host: `docker-test.cin.su` (`192.168.200.61`) +- SSH user: `test` +- Preferred Docker endpoint when Docker CLI is available: `ssh://test-docker` +- Current working Docker context may be `test-ubuntu`; it points to the shared test Docker host. +- Portainer: `http://docker-test.cin.su:9000/`, user `admin` +- Do not store the password in repositories or project files; use an SSH key for persistent access. diff --git a/CODEX_CONTEXT.md b/CODEX_CONTEXT.md index aa88f3c..e98824c 100644 --- a/CODEX_CONTEXT.md +++ b/CODEX_CONTEXT.md @@ -1,618 +1,2760 @@ -# CODEX CONTEXT - -## Project identity - -This project is a production-grade distributed secure access platform. - -It started as a custom RDP proxy with persistent server-side sessions, but the final target architecture is broader: - -- distributed secure access fabric -- multi-tenant platform -- session broker for GUI and future non-GUI protocols -- cluster mesh of nodes -- connector/VPN layer -- customer-managed and platform-managed nodes -- node-agent based self-update / rollback / health supervision - -## Current proven foundation - -The current codebase already proved the most risky low-level lifecycle assumptions for RDP: - -- real FreeRDP connect works -- session state transitions to active work -- terminate works -- detach works without killing the remote session -- reattach works without recreating the remote session -- takeover works without recreating the remote session -- per-resource certificate verification policy exists -- `certificate_verification_mode = strict | ignore` -- `strict` is default -- `ignore` works on a per-resource basis -- worker build is reproducible -- backend build is reproducible - -This proven lifecycle must NOT be broken by future architecture work. - -## Current architecture baseline - -Current audit and baseline snapshot: - -- `docs/audits/PROJECT_AUDIT_2026-04-26.md` -- `docs/audits/CURRENT_BASELINE_MATRIX.md` - -### Test environment -- Canonical test Docker host: `192.168.200.61` -- Canonical Docker context: `test-ubuntu` -- Canonical SSH alias: `docker-test` -- Backend API for local/client smoke runs: `http://192.168.200.61:8080/api/v1` -- WebSocket gateway for local/client smoke runs: `ws://192.168.200.61:8080/api/v1/gateway/ws` -- Stage C17 planning is completed. -- C17A synthetic mesh runtime skeleton is implemented and test-proven in - `rap-node-agent` only. It is disabled by default and carries synthetic - `fabric.probe` / `fabric.probe_ack` messages only. -- C17B route health and failover probes are implemented and test-proven in - `rap-node-agent` only. They are disabled by default and carry synthetic - `fabric.route_health` / `fabric.route_health_ack` messages only. -- C17C relay semantic hardening is implemented and test-proven in - `rap-node-agent` only. It is disabled by default and models synthetic - per-channel queues/QoS/backpressure only. -- C17D non-production test-service path is implemented and test-proven in - `rap-node-agent` only. It is disabled by default and carries only bounded - `synthetic.echo` test payloads. -- C17E/C17F/C17G are implemented and proven for live synthetic HTTP transport, - scoped synthetic route config, and Control Plane scoped synthetic config - consumption. -- C17H deployed multi-agent synthetic config smoke is runtime-proven on - `docker-test`: five running `rap-node-agent` containers consume - backend-issued node-scoped synthetic config, direct and single-relay - synthetic route-health observations return to the Control Plane, and - production forwarding remains disabled. -- C17I production forwarding gate foundation is implemented and test-proven: - `rap-node-agent` has an explicit production-forwarding gate, while - `/mesh/v1/forward` still refuses production payload forwarding until a later - approved runtime stage. -- C17J production envelope contract is implemented and test-proven: - `/mesh/v1/forward` validates route-bound production envelopes for - `fabric_control` / `fabric.control` only when the gate is enabled, rejects - service channels, and still refuses production forwarding. -- C17K production envelope observation is implemented and test-proven: - valid accepted envelopes can be observed locally as metadata-only records - after validation; rejected envelopes are not observed, observation failure - fails closed, and production forwarding remains unavailable. -- C17L bounded production observation sink is implemented and test-proven: - accepted metadata-only observations can be retained locally with fixed - capacity, oldest-entry drop behavior, and no payload body storage. -- C17M production observation sink wiring is implemented and test-proven: - node-agent can wire the bounded local metadata-only sink when - `RAP_MESH_PRODUCTION_OBSERVATION_SINK_CAPACITY` is explicitly greater than - zero; the wiring is disabled by default and exposes no read API. -- C17N production observation sink metrics are implemented and test-proven: - local sink metrics expose only capacity, current depth, accepted total, and - dropped-oldest total; they expose no observation records or payload metadata. -- C17O production observation sink local metrics logging is implemented and - test-proven: node-agent logs aggregate sink metrics locally when the sink is - explicitly enabled; no read API or Control Plane reporting is added. -- C17P production observation sink change-driven metrics logging is implemented - and test-proven: node-agent suppresses repeated identical local sink metrics - logs; no read API or Control Plane reporting is added. -- C17Q production forwarding gate/runtime log boundary is implemented and - test-proven: node-agent logs production forwarding gate state separately from - production forwarding runtime state. Runtime state remained false until - C17Z introduced gate-controlled `fabric.control` direct forwarding. -- C17R production observation sink capacity guard is implemented and - test-proven: `RAP_MESH_PRODUCTION_OBSERVATION_SINK_CAPACITY` is rejected - above `10000`. -- C17S production observation panic fail-closed hardening is implemented and - test-proven: observer errors and observer panics both fail closed as - observation failure. -- C17T production envelope payload boundary is implemented and test-proven: - validated production `fabric.control` envelope payloads are bounded to - `4096` bytes and oversized envelopes are rejected before observation. -- C17U production envelope created-at skew boundary is implemented and - test-proven: validated production `fabric.control` envelopes whose - `created_at` is more than one minute in the future are rejected before - observation. -- C17V peer endpoint candidate model is implemented and test-proven: - node-scoped synthetic mesh config now carries route-scoped endpoint - candidates with transport, address, reachability, NAT type, connectivity - mode, priority, policy tags, verification time, and metadata. This is a - model/config boundary only; no production route scoring, NAT traversal, - shortcut routing, or forwarding runtime is implemented. -- C17W peer endpoint candidate scoring model is implemented and test-proven: - `rap-node-agent` can rank already-scoped endpoint candidates using soft - inputs such as transport, reachability, connectivity mode, NAT type, - priority, region, policy tags, channel class, and verification age. This is - a scoring helper only; it does not open connections, choose production - routes, or forward payloads. -- C17X health-aware endpoint candidate scoring overlay is implemented and - test-proven: endpoint candidate scoring can optionally use local health - observations keyed by `endpoint_id`, including latency, success/failure - history, recent failure reason, reliability score, and observation freshness. - This remains advisory scoring only and is not wired into production route - execution. -- C17Y Platform Owner synthetic mesh visibility is implemented and - build/test-proven: `web-admin` reads node-scoped synthetic mesh config and - shows config enabled state, route counts, peer endpoints, endpoint - candidates, C17X advisory scoring boundary, and `production_forwarding`. - This remains platform-owner visibility only and does not enable production - forwarding. -- C17Z production fabric-control direct forwarding boundary is implemented and - test-proven: when `RAP_MESH_PRODUCTION_FORWARDING_ENABLED=true`, - `/mesh/v1/forward` can deliver valid route-bound `fabric.control` envelopes - at the local destination or forward them to a direct next hop from explicit - peer endpoint config. Service channels, arbitrary relay forwarding, - multi-hop production route execution, and RDP/VPN/file/video/service payloads - remain unavailable. -- C17Z1 production fabric-control multi-hop route-path boundary is implemented - and test-proven: production `fabric.control` envelopes can carry - `route_path` and `visited_node_ids`; relay nodes validate path position, - forward only to the next path node, update TTL/hop/visited metadata, and - reject loops. Service payloads remain unavailable. -- C17Z2 production fabric-control forwarding observability boundary is - implemented and test-proven: node-agent emits local - `mesh_production_forward_event` logs for accepted, forwarded, delivered, and - rejected production `fabric.control` envelopes. Logs are metadata-only and - include no payload bodies or read API. -- C17Z3 production fabric-control route-config boundary is implemented and - test-proven: when scoped/control-plane mesh routes are available locally, - production `fabric.control` envelopes must match configured route_id/path/ - next-hop/channel/expiry/TTL/hop limits before forwarding. -- C17Z4 scoped peer directory and recovery seeds boundary is implemented and - test/build-proven: node-scoped mesh config carries scoped `peer_directory` - and explicit bounded `recovery_seeds`; node-agent parses/validates them and - web-admin shows counts. -- C17Z5 node-agent peer cache runtime boundary is implemented and test-proven: - node-agent builds a local `PeerCache`, selects bounded warm peers, probes warm - peers with `/mesh/v1/health`, and reports metadata-only mesh-link - observations when synthetic mesh testing is enabled. -- C17Z6 dynamic endpoint reporting boundary is implemented and test-proven: - node-agent reports explicit advertised mesh endpoint metadata in heartbeat, - and Control Plane projects latest reported endpoints/candidates into - node-scoped synthetic mesh config. -- C17Z7 private/corporate endpoint candidate boundary is implemented and - test-proven: node-agent reports multiple advertised endpoint candidates, - scoring rewards private/corporate same-site candidates, and peer cache can - use the best candidate address for warm health. -- C17Z8 peer connection state machine boundary is implemented and test-proven: - node-agent tracks warm-peer states `disconnected`, `connecting`, `ready`, - `degraded`, and `backoff`, with bounded backoff after repeated health probe - failures. -- C17Z9 peer recovery planner boundary is implemented and test-proven: - node-agent targets a bounded stable ready-peer set, enters recovery when - ready peers fall below target, and selects bounded recovery probes from warm - peers, recovery seeds, and other connectable scoped peers. -- C17Z10 peer connection intent planner boundary is implemented and - test-proven: node-agent classifies bounded peer work as maintain/probe/ - recover and classifies transport readiness as direct/private_lan/ - corporate_lan/outbound_only/relay_required, with rendezvous-required - metadata only. -- C17Z11 peer connection manager runtime boundary is implemented and - test-proven: node-agent uses a reusable HTTP keep-alive client for real - control-plane health probes of direct/private/corporate peers and records - `waiting_rendezvous` for outbound-only/relay-required peers. -- C17Z12 rendezvous/relay control-plane contract is implemented and - docker-test-runtime-proven: backend issues node-scoped `rendezvous_leases`, - node-agent resolves matching `waiting_rendezvous` intents into - `relay_control`, probes relay `/mesh/v1/health`, records and maintains - `relay_ready`, and keeps service payload forwarding disabled. -- C17Z13 rendezvous lease telemetry is implemented and - docker-test-runtime-proven: node-agent reports - `mesh_rendezvous_lease_report` with relay admission, peer admission, - TTL/renewal posture, `relay_ready`, and explicit no-payload boundary flags; - web-admin shows `rv leases` in recent heartbeat tables. -- C17Z14 rendezvous lease refresh contract is implemented and - docker-test-runtime-proven: node-agent refreshes renewal-needed/stale - rendezvous leases through node-scoped synthetic config reload, updates the - running peer cache/route/lease state, and reports refresh plus stale relay - withdrawal/reselection telemetry. Service payload forwarding remains - unavailable. -- C17Z15 backend relay replacement policy is implemented and - docker-test-runtime-proven: backend consumes recent stale-relay heartbeat - feedback, withdraws stale explicit rendezvous leases, scores alternate relay - candidates from route adjacency, endpoint priority, policy tags, and recent - mesh-link health, and returns replacement leases plus - `rendezvous_relay_policy` decisions in node-scoped synthetic config. - Node-agent reports `c17z15.mesh_rendezvous_lease_report.v1` and keeps stale - state scoped to the exact lease/relay, so replacement leases for the same - peer are not marked stale by association. Service payload forwarding remains - unavailable. -- C17Z16 route/path decision artifact is implemented and - docker-test-runtime-proven: backend `c17z16.synthetic.v1` config includes - `route_path_decisions` with original hops, effective hops, local previous/ - next hop, selected replacement relay, generation, score reasons, and - no-payload boundary flags. Node-agent stores the control-plane route - generation and reports `c17z16.mesh_route_path_decision_report.v1` plus - `c17z16.mesh_rendezvous_lease_report.v1`. Service payload forwarding remains - unavailable. -- C17Z17 node-side route generation tracker is implemented and - docker-test-runtime-proven: backend `c17z17.synthetic.v1` config and - node-agent `mesh_route_generation_report` track active/applied/unchanged/ - withdrawn route decisions, generation changes, total counters, and - `withdrawn_by_replacement` records for stale relay paths when replacement is - first observed. Service payload forwarding remains unavailable. -- C17Z18 synthetic route-health effective path runtime is implemented and - docker-test-runtime-proven: backend `c17z18.synthetic.v1` config and - node-agent `mesh_route_health_config_report` apply Control Plane - `route_path_decisions` to synthetic route-health route config only. The - synthetic runtime probes selected effective paths through replacement relays, - reports expected/observed hops and drift state, and backend latest mesh links - preserve route-health observations separately from connection-manager - observations. Service payload forwarding remains unavailable. -- C17Z19 synthetic route-health feedback scoring is implemented and - docker-test-runtime-proven: backend consumes recent `synthetic_route_health` - observations in relay scoring, uses drift/unreachable/failure metadata to - mark the exact selected relay stale, boosts healthy low-latency relay - candidates, and returns replacement leases/route decisions through the - existing synthetic config contract. Migration `000022` adds the `synthetic` - mesh service class. Service payload forwarding remains unavailable. -- C17Z20 node-side route-health feedback refresh is implemented and - docker-test-runtime-proven: after reporting synthetic route-health - drift/unreachable/failure, node-agent performs a bounded node-scoped - synthetic-config refresh, applies returned replacement route decisions to - route-health config immediately, and reports - `c17z20.mesh_route_health_feedback_refresh_report.v1`. Service payload - forwarding remains unavailable. -- Installation Authority foundation is implemented: production requires strict - Product Root public key config, first-owner bootstrap uses signed Ed25519 - activation manifests, `installation_authority` and signed - `platform_role_grants` are persisted, and strict platform-admin checks ignore - direct `users.platform_role` database edits without a valid signed grant. - Web-admin exposes installation status/first-owner bootstrap, and - `scripts/installation/product-root-tool.go` generates keys/manifests for - offline product-root operations. -- Cluster Authority and node enrollment bootstrap are docker-test lifecycle - smoke-proven in run `dev-bootstrap-20260428-201430`: a fresh dev install - bootstrapped the first owner, created a cluster, issued a signed join token, - accepted real `rap-node-agent` enrollment, owner-approved the join request, - agent-polled signed bootstrap, persisted cluster authority pin, heartbeated, - and verified signed `c17z18.synthetic.v1` Control Plane config. Production - service payload forwarding remains unavailable. -- Migration `000021_cluster_authority_keys` drops/recreates - `cluster_admin_summaries` because fresh replay proved PostgreSQL cannot - change that view layout via `CREATE OR REPLACE VIEW`. -- `rap-node-agent` desired-workload polling/status reporting is gated by - `RAP_WORKLOAD_SUPERVISION_ENABLED=false` by default while service runtime - supervision remains a stub. -- C18 VPN/IP tunnel service target design is completed as documentation only. -- C18A VPN/IP tunnel control-plane data model foundation is implemented and - backend-test-proven. -- C18B VPN/IP tunnel lease/fencing hardening is implemented and - backend-test-proven. -- C18C VPN/IP tunnel node-agent desired-state consumption/reporting is - implemented and backend-test-proven. -- No next platform-core implementation step is automatically authorized after - C17Z20. The next mesh layer should stay limited to route-health feedback - refresh dampening/no-change cooldown unless the user explicitly chooses - another staged task. -- Latest RDP performance reference image: - `rap-rdp-worker:rdp-perf6-dirty-region` -- Stage 5.2 file-download runtime artifacts remain preserved for when RDP work - resumes, but they are not the active next task. -- Do not use `docker.cin.su` for this project unless explicitly requested for a separate one-off check. - -### Backend -- Go -- PostgreSQL = source of truth -- Redis = live coordination / routing only -- REST for control plane -- WebSocket for live session channel - -### Worker -- C++ worker -- FreeRDP integration -- worker runtime hides FreeRDP details from backend -- The C++ worker remains the primary RDP runtime. -- Target RDP performance direction: `docs/architecture/RDP_SERVICE_CPP_PERFORMANCE_TARGET.md`. -- The RDP performance rewrite scope is limited to C++ RDP service adapter - internals. It must not redesign backend control plane, cluster transport, - organizations, leases, or session lifecycle. -- The C# RDP service skeleton is inactive research scaffolding and is not the - current runtime direction. -- Current RDP Adapter baseline: RDP-Perf-6 dirty-region direct binary rendering - is completed and smoke-proven on `docker-test`. RDP work is paused by product - decision; next active work is Fabric Core / cluster foundation. -- P3/P3.1 security-readiness foundation exists: production mode rejects - plaintext credential-like resource metadata, requires `secret_ref` for - RDP/VNC/SSH resources, and has an encrypted PostgreSQL-backed resource secret - storage/resolver MVP. P3.2 direct-worker TLS/PKI guard exists. -- P3.3 production-like test-stand smoke is complete on `docker-test`: backend - runs in `APP_ENV=production` with a test-only secret key file, a secret-backed - RDP resource starts real sessions through the resolver path, metadata/audit do - not contain plaintext credentials, and backend gateway fallback remains - available when direct worker WSS trust is `smoke_insecure`. -- P3.4 production direct-worker WSS trust model is documented in - `docs/architecture/PRODUCTION_DIRECT_WORKER_WSS_TRUST.md`; it defines - platform CA/public CA behavior, worker certificate SAN/identity requirements, - app-local Windows trust direction, rotation/revocation, and the future - `platform_ca` smoke plan. No RDP runtime behavior changed in P3.4. -- P3.5 app-local platform CA trust is implemented and runtime-proven on - `docker-test`: Windows client validates direct worker WSS with an app-local - platform CA bundle, keeps hostname/SAN validation enabled, selects - `direct_worker_wss` without insecure TLS bypass, and falls back to backend - gateway for unknown CA / smoke-only production cases. -- P3.6 stale Redis worker/live event idempotency is implemented and - runtime-proven: stale worker events for terminal PostgreSQL sessions are - ignored, backend restart survives stale Redis events, and terminal sessions - are not reopened. -- Stage 5.2 server-to-client file download core data path is runtime-proven: - direct worker WSS and backend gateway fallback both download text/binary - files from `RAP_Transfers\ToClient` with matching size/hash, and direct - policy blocking is proven for `disabled` and `client_to_server`. Lifecycle - blocking is also runtime-proven for detach, old-client takeover, and worker - failure. Runtime report: - `artifacts/stage5-2-file-download-runtime-report.md`. -- Stage 5.2 is not fully accepted yet. Remaining proof: Windows desktop UI - download path and regression matrix for rendering/input/clipboard/upload/ - reconnect/takeover. - -### Clients -- future native clients: - - Windows: native desktop client first - - Linux: native desktop client later -- web UI is admin/control plane, not the primary power-user client - -## Final architecture direction - -The long-term target architecture is documented in: - -- `docs/architecture/SECURE_ACCESS_FABRIC_TARGET.md` -- `docs/architecture/CLUSTER_NODE_ADMIN_FOUNDATION.md` -- `docs/architecture/WEB_INGRESS_AND_ADMIN_UI_MODEL.md` - -This document defines the target Secure Access Fabric architecture only. It is not the current implementation scope and must not be used as permission to start mesh, VPN, multi-cluster, updater, or realtime data-plane migration work without an explicit staged prompt. - -`CLUSTER_NODE_ADMIN_FOUNDATION.md` defines the next platform-core planning -baseline for clusters, node enrollment, native node-agent identity, platform -admin console, multi-cluster administration, and future organization admin -visibility. It is a staged foundation document, not permission to implement -mesh packet routing or VPN runtime. - -`WEB_INGRESS_AND_ADMIN_UI_MODEL.md` defines WEB as HTTP/HTTPS ingress and -Admin UI presentation only. Cluster configuration remains Control Plane -ownership through scoped APIs, PostgreSQL source-of-truth mutations, and audit. -Dynamic pages must be safe schema-driven projections and must not embed -internal topology, peer caches, route caches, secrets, raw credentials, or -arbitrary executable code. - -Admin endpoint placement is explicit. Fabric Storage / Config Storage nodes do -not automatically host or move the cluster panel. Platform Owner Console -remains global platform-owner scope. Cluster Admin Endpoint requires explicit -admin/web ingress role assignment, cluster health/trust readiness, and Control -Plane authorization. Organization Admin Panel remains a tenant-safe projection. - -The final platform must support: - -1. Multi-tenancy / Organizations -- platform has many organizations -- each organization has isolated users, groups, resources, policies, audit, connectors -- users may belong to multiple organizations -- organization admins only see their organization -- platform admins see platform scope - -2. Identity federation -- local users -- LDAP / Active Directory -- OIDC -- future extensibility for more identity sources -- access mappings based on external groups / claims - -3. Cluster of nodes -- no mandatory single central node -- many nodes across many sites -- nodes can be platform-managed or customer-managed -- customer-managed nodes are sandboxed cluster participants, not full cluster owners - -4. Node agent -- small stable always-running agent on every node -- supervises services -- downloads updates -- verifies signed artifacts -- can rollback to previous version -- can restart crashed services -- can work on thin or thick nodes - -5. Service-based node model -Each node is not monolithic. -A node has: -- capabilities: what it can do physically/technically -- enabled services: what it is allowed/assigned to do - -Possible services include: -- ingress-gateway -- mesh-router -- relay -- connector-host -- vpn-adapter -- session-worker -- media-relay -- file-relay -- update-cache -- config-replica -- audit-sink -- metrics-exporter - -6. Cluster mesh and routing -- encrypted inter-node communication -- dynamic topology -- no need for full mesh -- multi-hop routing allowed -- route failover -- client failover between ingress nodes -- connector failover between nodes - -7. Split-brain prevention -- quorum-based cluster behavior -- minority partition must not become a second authoritative cluster -- degraded / recovery / isolated modes -- manual recovery / promote decision by platform recovery admin - -8. Connector / VPN layer -- connectors are reusable network access methods -- one connector may be used by multiple resources -- connector placement and failover are controlled by policy -- nodes may be allowed or disallowed to host connectors -- direct access, VPN, relay and future egress modes must fit this model - -9. Future exit mode -- split tunnel -- full tunnel -- internet access through cluster -- not first implementation priority - -## Non-negotiable design rules - -- Do not rewrite proven session lifecycle carelessly. -- Do not turn Redis into a source of truth. -- Do not make certificate-ignore a global worker setting. -- Do not make customer-managed nodes platform-wide trusted by default. -- Do not create a separate cluster per organization. -- Do not assume a single permanently reachable central node. -- Do not rely on “secret protocol with no docs” as security. -- Security must come from crypto, auth, isolation, policy and observability. -- Prefer incremental evolution from current proven system. -- Do not collapse platform control plane and data plane into one vague layer. - -## Implementation strategy - -The codebase must evolve in phases. - -Current implementation focus remains: -- RDP work is paused by product decision -- preserve the accepted RDP Adapter baseline and Stage 5.x file-transfer work -- do not delete or rewrite the current RDP MVP while platform-core work starts -- C1-C9 platform-core foundations are implemented and verified: clusters, - node enrollment, node-agent scaffold, platform admin console, workload - supervision contract, mesh control-plane prep, mesh skeleton, multi-cluster - hardening, and organization admin foundation -- C10 Fabric Core configuration distribution design is completed -- C11 signed scoped cluster snapshot model is completed -- C12 node local state store is completed -- C13 Fabric Storage / Config Storage service foundation is completed -- C14 peer directory and cache model is completed -- C15 Fabric Routing Engine skeleton is completed -- C16 secure node-to-node channel lifecycle is completed -- C17 mesh routing runtime implementation plan is completed -- C17A synthetic mesh runtime skeleton is implemented and test-proven with - synthetic fabric messages only, no RDP/VPN/production service traffic -- C17B route health and failover probes are implemented and test-proven with - synthetic traffic only, no RDP/VPN/production service traffic -- C17C relay semantic hardening is implemented and test-proven with synthetic - channel classes only, no RDP/VPN/production service traffic -- C17D non-production test-service path is implemented and test-proven with - bounded `synthetic.echo` traffic only, no RDP/VPN/production service traffic -- C17E live node-to-node synthetic HTTP transport is implemented and - smoke-proven with synthetic traffic only -- C17F scoped synthetic route config loading and route-health reporting is - implemented and smoke-proven with synthetic traffic only -- C17G Control Plane scoped synthetic config read/consume is implemented and - test-proven with synthetic traffic only -- C17H deployed multi-agent synthetic config smoke is implemented and - runtime-proven on `docker-test` with synthetic traffic only -- C17I production forwarding gate foundation is implemented and test-proven; - production forwarding remains unavailable -- C17J production envelope contract validation is implemented and test-proven; - production forwarding remains unavailable -- C17K production envelope observation is implemented and test-proven; - production forwarding remains unavailable -- C17L bounded production observation sink is implemented and test-proven; - production forwarding remains unavailable -- C17M production observation sink wiring is implemented and test-proven; - production forwarding remains unavailable -- C17N production observation sink metrics are implemented and test-proven; - production forwarding remains unavailable -- C17O production observation sink local metrics logging is implemented and - test-proven; production forwarding remains unavailable -- C17P production observation sink change-driven metrics logging is implemented - and test-proven; production forwarding remains unavailable -- C17Q production forwarding gate/runtime log boundary is implemented and - test-proven; production forwarding remains unavailable -- C17R production observation sink capacity guard is implemented and - test-proven; production forwarding remains unavailable -- C17S production observation panic fail-closed hardening is implemented and - test-proven; production forwarding remains unavailable -- C17T production envelope payload boundary is implemented and test-proven; - production forwarding remains unavailable -- C17U production envelope created-at skew boundary is implemented and - test-proven; production forwarding remains unavailable -- C17V peer endpoint candidate model and NAT/connectivity hints are - implemented and test-proven; production forwarding remains unavailable -- C17W peer endpoint candidate scoring model is implemented and test-proven; - production forwarding remains unavailable -- C17X health-aware endpoint candidate scoring overlay is implemented and - test-proven; production forwarding remains unavailable -- C17Y Platform Owner synthetic mesh visibility is implemented and - build/test-proven; production forwarding remains unavailable -- C17Z production fabric-control direct forwarding is implemented and - test-proven; production service traffic remains unavailable -- C17Z1 production fabric-control multi-hop route-path forwarding is - implemented and test-proven; production service traffic remains unavailable -- C17Z2 production fabric-control forwarding observability is implemented and - test-proven; production service traffic remains unavailable -- C17Z3 production fabric-control route-config boundary is implemented and - test-proven; production service traffic remains unavailable -- C17Z4 scoped peer directory/recovery seed boundary is implemented and - test/build-proven; production service traffic remains unavailable -- C17Z5 node-agent peer cache runtime boundary is implemented and test-proven; - production service traffic remains unavailable -- C17Z6 dynamic endpoint reporting boundary is implemented and test-proven; - production service traffic remains unavailable -- C17Z7 private/corporate endpoint candidate boundary is implemented and - test-proven; production service traffic remains unavailable -- C17Z8 peer connection state machine boundary is implemented and test-proven; - production service traffic remains unavailable -- C17Z9 peer recovery planner boundary is implemented and test-proven; - production service traffic remains unavailable -- C17Z10 peer connection intent planner boundary is implemented and - test-proven; production service traffic remains unavailable -- C17Z11 peer connection manager runtime boundary is implemented and - test-proven; production service traffic remains unavailable -- C17Z12 rendezvous/relay control-plane contract is implemented and - docker-test-runtime-proven; production service traffic remains unavailable -- C17Z13 rendezvous lease telemetry is implemented and - docker-test-runtime-proven; production service traffic remains unavailable -- C17Z14 rendezvous lease refresh contract is implemented and - docker-test-runtime-proven; production service traffic remains unavailable -- C17Z15 backend relay replacement policy is implemented and - docker-test-runtime-proven; production service traffic remains unavailable -- C17Z16 route/path decision artifact is implemented and - docker-test-runtime-proven; production service traffic remains unavailable -- C17Z17 node-side route generation tracker is implemented and - docker-test-runtime-proven; production service traffic remains unavailable -- C17Z18 synthetic route-health effective path runtime is implemented and - docker-test-runtime-proven; production service traffic remains unavailable -- C17Z19 synthetic route-health feedback scoring is implemented and - docker-test-runtime-proven; production service traffic remains unavailable -- C17Z20 node-side route-health feedback refresh is implemented and - docker-test-runtime-proven; production service traffic remains unavailable -- Cluster Authority plus node enrollment bootstrap polling are docker-test - lifecycle-smoke-proven; fresh install migration replay is fixed for - `cluster_admin_summaries` -- C18 VPN/IP tunnel service target design is completed as documentation only -- C18A VPN/IP tunnel control-plane data model foundation is implemented and - backend-test-proven -- C18B VPN/IP tunnel lease/fencing hardening is implemented and - backend-test-proven -- C18C VPN/IP tunnel node-agent desired-state consumption/reporting is - implemented and backend-test-proven -- Version Storage / Update Repository is documented as a future Fabric Core - service for signed release manifests, OS/arch artifacts, - stable/current/candidate channels, update-cache mirroring, node-agent - update supervision, rollback, and explicit data-structure migration bundles. - Runtime updater behavior is not implemented. -- no next platform-core implementation step is automatically authorized after - C17Z20; choose the next narrow staged prompt explicitly before continuing -- preserve the proven RDP lifecycle behavior -- keep the current backend gateway available as the active/fallback implementation path +# CODEX CONTEXT + +## Project identity + +This project is a production-grade distributed secure access platform. + +It started as a custom RDP proxy with persistent server-side sessions, but the final target architecture is broader: + +- distributed secure access fabric +- multi-tenant platform +- session broker for GUI and future non-GUI protocols +- cluster mesh of nodes +- connector/VPN layer +- customer-managed and platform-managed nodes +- node-agent based self-update / rollback / health supervision + +## Product architecture rule: VPN and Remote Workspace are separate products/layers + +Do not merge VPN/IP tunnel work with Remote Workspace / remote desktop work. + +- VPN is a universal network-layer IP tunnel. It carries any traffic generated + by a phone, Windows PC, Linux host, or other client device: HTTP, DNS, ping, + RDP clients, SSH clients, SMB, business apps, and future protocols. VPN must + stay protocol-agnostic and must not contain remote-desktop-specific logic. +- Remote Workspace is an application/session-layer service. The client talks to + RAP using RAP's own client protocol. RAP workers/connectors then talk to the + target server using protocol adapters such as RDP, SSH, VNC, or future + adapters, convert screen/input/clipboard/files/audio/control into RAP's + format, and render it in the RAP client. +- VPN optimization work must focus on generic data-plane transport, + full-tunnel/split-tunnel routing, DNS, MTU/MSS, QoS, NAT traversal, direct + UDP/QUIC transport, fallback relay, diagnostics, and stability for arbitrary + traffic. +- Remote Workspace optimization work must focus on server catalog, session + broker, workers/connectors, protocol adapters, RAP client protocol, separate + connection windows, rendering/input/clipboard/file/audio behavior, and + user-facing remote-workspace UX. +- Both VPN and Remote Workspace must consume the shared Fabric Service Channel + runtime. Control/API traffic may use backend/admin ingress, but working + service data must use the fabric channel whenever available. Backend relay is + a compatibility/degraded fallback, not the production steady-state. +- The accepted service-channel direction is documented in + `docs/architecture/FABRIC_SERVICE_CHANNEL_RUNTIME.md`: a service requests a + channel with entry pool, exit pool, roles, service class, channel classes, + QoS and failover policy; the fabric selects the fastest healthy route and + rebuilds it on failure. Protocol-specific services must not reimplement this + transport. +- Current implementation: backend issues `rap.fabric_service_channel_lease.v1` + leases and embeds them in VPN client profiles. Leases include + cluster-authority-signed `rap.fabric_service_channel_lease_authority.v1` + payloads that bind token hash, selected route, generation, fencing epoch, and + expiry, plus a signed `data_plane` contract declaring that working data uses + the Fabric Service Channel over fabric routes while backend relay is only an + explicit degraded/disabled fallback policy. `rap-node-agent` accepts the + first VPN packet service-channel entry + endpoint under + `/api/v1/clusters/{cluster_id}/fabric/service-channels/{channel_id}/vpn-connections/{resource_id}/packets` + plus `/packets/ws`. The endpoint validates the signed or introspected + data-plane contract, applies the preferred fabric route, uses the existing + production `vpn_packet` fabric route, reports contract adoption in heartbeat + access telemetry, and refuses backend relay when the contract disables it. + Backend access telemetry and web-admin now show data-plane adoption, + working/steady-state transport, backend relay policy, data-plane mode, and + logical flow mode at cluster/node/channel levels. The next slice is explicit + route/fallback violation incidents from that telemetry, plus client + consumption of the lease endpoint template. + +## Current proven foundation + +The current codebase already proved the most risky low-level lifecycle assumptions for RDP: + +- real FreeRDP connect works +- session state transitions to active work +- terminate works +- detach works without killing the remote session +- reattach works without recreating the remote session +- takeover works without recreating the remote session +- per-resource certificate verification policy exists +- `certificate_verification_mode = strict | ignore` +- `strict` is default +- `ignore` works on a per-resource basis +- worker build is reproducible +- backend build is reproducible + +This proven lifecycle must NOT be broken by future architecture work. + +## Current architecture baseline + +Current audit and baseline snapshot: + +- `docs/audits/PROJECT_AUDIT_2026-04-26.md` +- `docs/audits/CURRENT_BASELINE_MATRIX.md` + +### Test environment +- Canonical test Docker host: `192.168.200.61` +- Canonical Docker context: `test-ubuntu` +- Canonical SSH alias: `docker-test` +- Current external control-plane endpoint for remote/offsite node enrollment: + `http://94.141.118.222:19191` / `http://vpn.cin.su:19191`. +- Current port forward: `94.141.118.222:19191` -> `192.168.200.61:18080`. +- For offsite Windows/Linux nodes, install profiles should use: + `http://vpn.cin.su:19191/api/v1` as control-plane endpoint and + `http://vpn.cin.su:19191/downloads` as artifact endpoint unless the user + explicitly chooses the raw IP endpoint. +- Backend API for local/client smoke runs: `http://192.168.200.61:8080/api/v1` +- WebSocket gateway for local/client smoke runs: `ws://192.168.200.61:8080/api/v1/gateway/ws` +- Stage C17 planning is completed. +- C17A synthetic mesh runtime skeleton is implemented and test-proven in + `rap-node-agent` only. It is disabled by default and carries synthetic + `fabric.probe` / `fabric.probe_ack` messages only. +- C17B route health and failover probes are implemented and test-proven in + `rap-node-agent` only. They are disabled by default and carry synthetic + `fabric.route_health` / `fabric.route_health_ack` messages only. +- C17C relay semantic hardening is implemented and test-proven in + `rap-node-agent` only. It is disabled by default and models synthetic + per-channel queues/QoS/backpressure only. +- C17D non-production test-service path is implemented and test-proven in + `rap-node-agent` only. It is disabled by default and carries only bounded + `synthetic.echo` test payloads. +- C17E/C17F/C17G are implemented and proven for live synthetic HTTP transport, + scoped synthetic route config, and Control Plane scoped synthetic config + consumption. +- C17H deployed multi-agent synthetic config smoke is runtime-proven on + `docker-test`: five running `rap-node-agent` containers consume + backend-issued node-scoped synthetic config, direct and single-relay + synthetic route-health observations return to the Control Plane, and + production forwarding remains disabled. +- C17I production forwarding gate foundation is implemented and test-proven: + `rap-node-agent` has an explicit production-forwarding gate, while + `/mesh/v1/forward` still refuses production payload forwarding until a later + approved runtime stage. +- C17J production envelope contract is implemented and test-proven: + `/mesh/v1/forward` validates route-bound production envelopes for + `fabric_control` / `fabric.control` only when the gate is enabled, rejects + service channels, and still refuses production forwarding. +- C17K production envelope observation is implemented and test-proven: + valid accepted envelopes can be observed locally as metadata-only records + after validation; rejected envelopes are not observed, observation failure + fails closed, and production forwarding remains unavailable. +- C17L bounded production observation sink is implemented and test-proven: + accepted metadata-only observations can be retained locally with fixed + capacity, oldest-entry drop behavior, and no payload body storage. +- C17M production observation sink wiring is implemented and test-proven: + node-agent can wire the bounded local metadata-only sink when + `RAP_MESH_PRODUCTION_OBSERVATION_SINK_CAPACITY` is explicitly greater than + zero; the wiring is disabled by default and exposes no read API. +- C17N production observation sink metrics are implemented and test-proven: + local sink metrics expose only capacity, current depth, accepted total, and + dropped-oldest total; they expose no observation records or payload metadata. +- C17O production observation sink local metrics logging is implemented and + test-proven: node-agent logs aggregate sink metrics locally when the sink is + explicitly enabled; no read API or Control Plane reporting is added. +- C17P production observation sink change-driven metrics logging is implemented + and test-proven: node-agent suppresses repeated identical local sink metrics + logs; no read API or Control Plane reporting is added. +- C17Q production forwarding gate/runtime log boundary is implemented and + test-proven: node-agent logs production forwarding gate state separately from + production forwarding runtime state. Runtime state remained false until + C17Z introduced gate-controlled `fabric.control` direct forwarding. +- C17R production observation sink capacity guard is implemented and + test-proven: `RAP_MESH_PRODUCTION_OBSERVATION_SINK_CAPACITY` is rejected + above `10000`. +- C17S production observation panic fail-closed hardening is implemented and + test-proven: observer errors and observer panics both fail closed as + observation failure. +- C17T production envelope payload boundary is implemented and test-proven: + validated production `fabric.control` envelope payloads are bounded to + `4096` bytes and oversized envelopes are rejected before observation. +- C17U production envelope created-at skew boundary is implemented and + test-proven: validated production `fabric.control` envelopes whose + `created_at` is more than one minute in the future are rejected before + observation. +- C17V peer endpoint candidate model is implemented and test-proven: + node-scoped synthetic mesh config now carries route-scoped endpoint + candidates with transport, address, reachability, NAT type, connectivity + mode, priority, policy tags, verification time, and metadata. This is a + model/config boundary only; no production route scoring, NAT traversal, + shortcut routing, or forwarding runtime is implemented. +- C17W peer endpoint candidate scoring model is implemented and test-proven: + `rap-node-agent` can rank already-scoped endpoint candidates using soft + inputs such as transport, reachability, connectivity mode, NAT type, + priority, region, policy tags, channel class, and verification age. This is + a scoring helper only; it does not open connections, choose production + routes, or forward payloads. +- C17X health-aware endpoint candidate scoring overlay is implemented and + test-proven: endpoint candidate scoring can optionally use local health + observations keyed by `endpoint_id`, including latency, success/failure + history, recent failure reason, reliability score, and observation freshness. + This remains advisory scoring only and is not wired into production route + execution. +- C17Y Platform Owner synthetic mesh visibility is implemented and + build/test-proven: `web-admin` reads node-scoped synthetic mesh config and + shows config enabled state, route counts, peer endpoints, endpoint + candidates, C17X advisory scoring boundary, and `production_forwarding`. + This remains platform-owner visibility only and does not enable production + forwarding. +- C17Z production fabric-control direct forwarding boundary is implemented and + test-proven: when `RAP_MESH_PRODUCTION_FORWARDING_ENABLED=true`, + `/mesh/v1/forward` can deliver valid route-bound `fabric.control` envelopes + at the local destination or forward them to a direct next hop from explicit + peer endpoint config. Service channels, arbitrary relay forwarding, + multi-hop production route execution, and RDP/VPN/file/video/service payloads + remain unavailable. +- C17Z1 production fabric-control multi-hop route-path boundary is implemented + and test-proven: production `fabric.control` envelopes can carry + `route_path` and `visited_node_ids`; relay nodes validate path position, + forward only to the next path node, update TTL/hop/visited metadata, and + reject loops. Service payloads remain unavailable. +- C17Z2 production fabric-control forwarding observability boundary is + implemented and test-proven: node-agent emits local + `mesh_production_forward_event` logs for accepted, forwarded, delivered, and + rejected production `fabric.control` envelopes. Logs are metadata-only and + include no payload bodies or read API. +- C17Z3 production fabric-control route-config boundary is implemented and + test-proven: when scoped/control-plane mesh routes are available locally, + production `fabric.control` envelopes must match configured route_id/path/ + next-hop/channel/expiry/TTL/hop limits before forwarding. +- C17Z4 scoped peer directory and recovery seeds boundary is implemented and + test/build-proven: node-scoped mesh config carries scoped `peer_directory` + and explicit bounded `recovery_seeds`; node-agent parses/validates them and + web-admin shows counts. +- C17Z5 node-agent peer cache runtime boundary is implemented and test-proven: + node-agent builds a local `PeerCache`, selects bounded warm peers, probes warm + peers with `/mesh/v1/health`, and reports metadata-only mesh-link + observations when synthetic mesh testing is enabled. +- C17Z6 dynamic endpoint reporting boundary is implemented and test-proven: + node-agent reports explicit advertised mesh endpoint metadata in heartbeat, + and Control Plane projects latest reported endpoints/candidates into + node-scoped synthetic mesh config. +- C17Z7 private/corporate endpoint candidate boundary is implemented and + test-proven: node-agent reports multiple advertised endpoint candidates, + scoring rewards private/corporate same-site candidates, and peer cache can + use the best candidate address for warm health. +- C17Z8 peer connection state machine boundary is implemented and test-proven: + node-agent tracks warm-peer states `disconnected`, `connecting`, `ready`, + `degraded`, and `backoff`, with bounded backoff after repeated health probe + failures. +- C17Z9 peer recovery planner boundary is implemented and test-proven: + node-agent targets a bounded stable ready-peer set, enters recovery when + ready peers fall below target, and selects bounded recovery probes from warm + peers, recovery seeds, and other connectable scoped peers. +- C17Z10 peer connection intent planner boundary is implemented and + test-proven: node-agent classifies bounded peer work as maintain/probe/ + recover and classifies transport readiness as direct/private_lan/ + corporate_lan/outbound_only/relay_required, with rendezvous-required + metadata only. +- C17Z11 peer connection manager runtime boundary is implemented and + test-proven: node-agent uses a reusable HTTP keep-alive client for real + control-plane health probes of direct/private/corporate peers and records + `waiting_rendezvous` for outbound-only/relay-required peers. +- C17Z12 rendezvous/relay control-plane contract is implemented and + docker-test-runtime-proven: backend issues node-scoped `rendezvous_leases`, + node-agent resolves matching `waiting_rendezvous` intents into + `relay_control`, probes relay `/mesh/v1/health`, records and maintains + `relay_ready`, and keeps service payload forwarding disabled. +- C17Z13 rendezvous lease telemetry is implemented and + docker-test-runtime-proven: node-agent reports + `mesh_rendezvous_lease_report` with relay admission, peer admission, + TTL/renewal posture, `relay_ready`, and explicit no-payload boundary flags; + web-admin shows `rv leases` in recent heartbeat tables. +- C17Z14 rendezvous lease refresh contract is implemented and + docker-test-runtime-proven: node-agent refreshes renewal-needed/stale + rendezvous leases through node-scoped synthetic config reload, updates the + running peer cache/route/lease state, and reports refresh plus stale relay + withdrawal/reselection telemetry. Service payload forwarding remains + unavailable. +- C17Z15 backend relay replacement policy is implemented and + docker-test-runtime-proven: backend consumes recent stale-relay heartbeat + feedback, withdraws stale explicit rendezvous leases, scores alternate relay + candidates from route adjacency, endpoint priority, policy tags, and recent + mesh-link health, and returns replacement leases plus + `rendezvous_relay_policy` decisions in node-scoped synthetic config. + Node-agent reports `c17z15.mesh_rendezvous_lease_report.v1` and keeps stale + state scoped to the exact lease/relay, so replacement leases for the same + peer are not marked stale by association. Service payload forwarding remains + unavailable. +- C17Z16 route/path decision artifact is implemented and + docker-test-runtime-proven: backend `c17z16.synthetic.v1` config includes + `route_path_decisions` with original hops, effective hops, local previous/ + next hop, selected replacement relay, generation, score reasons, and + no-payload boundary flags. Node-agent stores the control-plane route + generation and reports `c17z16.mesh_route_path_decision_report.v1` plus + `c17z16.mesh_rendezvous_lease_report.v1`. Service payload forwarding remains + unavailable. +- C17Z17 node-side route generation tracker is implemented and + docker-test-runtime-proven: backend `c17z17.synthetic.v1` config and + node-agent `mesh_route_generation_report` track active/applied/unchanged/ + withdrawn route decisions, generation changes, total counters, and + `withdrawn_by_replacement` records for stale relay paths when replacement is + first observed. Service payload forwarding remains unavailable. +- C17Z18 synthetic route-health effective path runtime is implemented and + docker-test-runtime-proven: backend `c17z18.synthetic.v1` config and + node-agent `mesh_route_health_config_report` apply Control Plane + `route_path_decisions` to synthetic route-health route config only. The + synthetic runtime probes selected effective paths through replacement relays, + reports expected/observed hops and drift state, and backend latest mesh links + preserve route-health observations separately from connection-manager + observations. Service payload forwarding remains unavailable. +- C17Z19 synthetic route-health feedback scoring is implemented and + docker-test-runtime-proven: backend consumes recent `synthetic_route_health` + observations in relay scoring, uses drift/unreachable/failure metadata to + mark the exact selected relay stale, boosts healthy low-latency relay + candidates, and returns replacement leases/route decisions through the + existing synthetic config contract. Migration `000022` adds the `synthetic` + mesh service class. Service payload forwarding remains unavailable. +- C17Z20 node-side route-health feedback refresh is implemented and + docker-test-runtime-proven: after reporting synthetic route-health + drift/unreachable/failure, node-agent performs a bounded node-scoped + synthetic-config refresh, applies returned replacement route decisions to + route-health config immediately, and reports + `c17z20.mesh_route_health_feedback_refresh_report.v1`. Service payload + forwarding remains unavailable. +- C17Z21 offsite control-plane bootstrap relay and Windows updater foundation + are implemented and docker-test/runtime-proven: backend exposes + `/mesh/v1/health` through the admin/nginx control-plane origin and issues + control-plane-only bootstrap rendezvous leases for outbound-only nodes using + their reported public control-plane URL. Remote Windows node + `ifcm-rufms-s-mo1cr` resolved 3/3 peers to `relay_ready` through + `http://94.141.118.222:19191`, while service/RDP/VPN payload forwarding + remains disabled. Release `0.1.3` is published for Docker and Windows + `windows_service` artifacts, and `install-windows` now installs a + per-node Scheduled Task updater for future Windows node-agent updates. +- C17Z22 updater observability and Windows host-agent self-update staging are + implemented and test-proven: `rap-host-agent` reports `phase=plan`, + `status=noop` for already-current/no-op plans, update state is scoped per + product so `rap-node-agent` and `rap-host-agent` do not overwrite each + other's current version, and the Windows updater wrapper runs short + one-shot cycles that can apply staged `rap-host-agent.exe.next` before the + next update check. Release `rap-host-agent 0.1.3` is published for + `linux_binary` and `windows_binary`; Docker updater containers on + `test-1/2/3` report no-op plans. +- Installation Authority foundation is implemented: production requires strict + Product Root public key config, first-owner bootstrap uses signed Ed25519 + activation manifests, `installation_authority` and signed + `platform_role_grants` are persisted, and strict platform-admin checks ignore + direct `users.platform_role` database edits without a valid signed grant. + Web-admin exposes installation status/first-owner bootstrap, and + `scripts/installation/product-root-tool.go` generates keys/manifests for + offline product-root operations. +- Cluster Authority and node enrollment bootstrap are docker-test lifecycle + smoke-proven in run `dev-bootstrap-20260428-201430`: a fresh dev install + bootstrapped the first owner, created a cluster, issued a signed join token, + accepted real `rap-node-agent` enrollment, owner-approved the join request, + agent-polled signed bootstrap, persisted cluster authority pin, heartbeated, + and verified signed `c17z18.synthetic.v1` Control Plane config. Production + service payload forwarding remains unavailable. +- Migration `000021_cluster_authority_keys` drops/recreates + `cluster_admin_summaries` because fresh replay proved PostgreSQL cannot + change that view layout via `CREATE OR REPLACE VIEW`. +- `rap-node-agent` desired-workload polling/status reporting is gated by + `RAP_WORKLOAD_SUPERVISION_ENABLED=false` by default while service runtime + supervision remains a stub. +- C18 VPN/IP tunnel service target design is completed as documentation only. +- C18A VPN/IP tunnel control-plane data model foundation is implemented and + backend-test-proven. +- C18B VPN/IP tunnel lease/fencing hardening is implemented and + backend-test-proven. +- C18C VPN/IP tunnel node-agent desired-state consumption/reporting is + implemented and backend-test-proven. +- No next platform-core implementation step is automatically authorized after + C17Z20. The next mesh layer should stay limited to route-health feedback + refresh dampening/no-change cooldown unless the user explicitly chooses + another staged task. +- Latest RDP performance reference image: + `rap-rdp-worker:rdp-perf6-dirty-region` +- Stage 5.2 file-download runtime artifacts remain preserved for when RDP work + resumes, but they are not the active next task. +- Do not use `docker.cin.su` for this project unless explicitly requested for a separate one-off check. + +### Backend +- Go +- PostgreSQL = source of truth +- Redis = live coordination / routing only +- REST for control plane +- WebSocket for live session channel + +### Worker +- C++ worker +- FreeRDP integration +- worker runtime hides FreeRDP details from backend +- The C++ worker remains the primary RDP runtime. +- Target RDP performance direction: `docs/architecture/RDP_SERVICE_CPP_PERFORMANCE_TARGET.md`. +- The RDP performance rewrite scope is limited to C++ RDP service adapter + internals. It must not redesign backend control plane, cluster transport, + organizations, leases, or session lifecycle. +- The C# RDP service skeleton is inactive research scaffolding and is not the + current runtime direction. +- Current RDP Adapter baseline: RDP-Perf-6 dirty-region direct binary rendering + is completed and smoke-proven on `docker-test`. RDP work is paused by product + decision; next active work is Fabric Core / cluster foundation. +- P3/P3.1 security-readiness foundation exists: production mode rejects + plaintext credential-like resource metadata, requires `secret_ref` for + RDP/VNC/SSH resources, and has an encrypted PostgreSQL-backed resource secret + storage/resolver MVP. P3.2 direct-worker TLS/PKI guard exists. +- P3.3 production-like test-stand smoke is complete on `docker-test`: backend + runs in `APP_ENV=production` with a test-only secret key file, a secret-backed + RDP resource starts real sessions through the resolver path, metadata/audit do + not contain plaintext credentials, and backend gateway fallback remains + available when direct worker WSS trust is `smoke_insecure`. +- P3.4 production direct-worker WSS trust model is documented in + `docs/architecture/PRODUCTION_DIRECT_WORKER_WSS_TRUST.md`; it defines + platform CA/public CA behavior, worker certificate SAN/identity requirements, + app-local Windows trust direction, rotation/revocation, and the future + `platform_ca` smoke plan. No RDP runtime behavior changed in P3.4. +- P3.5 app-local platform CA trust is implemented and runtime-proven on + `docker-test`: Windows client validates direct worker WSS with an app-local + platform CA bundle, keeps hostname/SAN validation enabled, selects + `direct_worker_wss` without insecure TLS bypass, and falls back to backend + gateway for unknown CA / smoke-only production cases. +- P3.6 stale Redis worker/live event idempotency is implemented and + runtime-proven: stale worker events for terminal PostgreSQL sessions are + ignored, backend restart survives stale Redis events, and terminal sessions + are not reopened. +- Stage 5.2 server-to-client file download core data path is runtime-proven: + direct worker WSS and backend gateway fallback both download text/binary + files from `RAP_Transfers\ToClient` with matching size/hash, and direct + policy blocking is proven for `disabled` and `client_to_server`. Lifecycle + blocking is also runtime-proven for detach, old-client takeover, and worker + failure. Runtime report: + `artifacts/stage5-2-file-download-runtime-report.md`. +- Stage 5.2 is not fully accepted yet. Remaining proof: Windows desktop UI + download path and regression matrix for rendering/input/clipboard/upload/ + reconnect/takeover. + +### Clients +- future native clients: + - Windows: native desktop client first + - Linux: native desktop client later +- web UI is admin/control plane, not the primary power-user client + +## Final architecture direction + +The long-term target architecture is documented in: + +- `docs/architecture/SECURE_ACCESS_FABRIC_TARGET.md` +- `docs/architecture/CLUSTER_NODE_ADMIN_FOUNDATION.md` +- `docs/architecture/WEB_INGRESS_AND_ADMIN_UI_MODEL.md` + +This document defines the target Secure Access Fabric architecture only. It is not the current implementation scope and must not be used as permission to start mesh, VPN, multi-cluster, updater, or realtime data-plane migration work without an explicit staged prompt. + +`CLUSTER_NODE_ADMIN_FOUNDATION.md` defines the next platform-core planning +baseline for clusters, node enrollment, native node-agent identity, platform +admin console, multi-cluster administration, and future organization admin +visibility. It is a staged foundation document, not permission to implement +mesh packet routing or VPN runtime. + +`WEB_INGRESS_AND_ADMIN_UI_MODEL.md` defines WEB as HTTP/HTTPS ingress and +Admin UI presentation only. Cluster configuration remains Control Plane +ownership through scoped APIs, PostgreSQL source-of-truth mutations, and audit. +Dynamic pages must be safe schema-driven projections and must not embed +internal topology, peer caches, route caches, secrets, raw credentials, or +arbitrary executable code. + +Admin endpoint placement is explicit. Fabric Storage / Config Storage nodes do +not automatically host or move the cluster panel. Platform Owner Console +remains global platform-owner scope. Cluster Admin Endpoint requires explicit +admin/web ingress role assignment, cluster health/trust readiness, and Control +Plane authorization. Organization Admin Panel remains a tenant-safe projection. + +The final platform must support: + +1. Multi-tenancy / Organizations +- platform has many organizations +- each organization has isolated users, groups, resources, policies, audit, connectors +- users may belong to multiple organizations +- organization admins only see their organization +- platform admins see platform scope + +2. Identity federation +- local users +- LDAP / Active Directory +- OIDC +- future extensibility for more identity sources +- access mappings based on external groups / claims + +3. Cluster of nodes +- no mandatory single central node +- many nodes across many sites +- nodes can be platform-managed or customer-managed +- customer-managed nodes are sandboxed cluster participants, not full cluster owners + +4. Node agent +- small stable always-running agent on every node +- supervises services +- downloads updates +- verifies signed artifacts +- can rollback to previous version +- can restart crashed services +- can work on thin or thick nodes + +5. Service-based node model +Each node is not monolithic. +A node has: +- capabilities: what it can do physically/technically +- enabled services: what it is allowed/assigned to do + +Possible services include: +- ingress-gateway +- mesh-router +- relay +- connector-host +- vpn-adapter +- session-worker +- media-relay +- file-relay +- update-cache +- config-replica +- audit-sink +- metrics-exporter + +6. Cluster mesh and routing +- encrypted inter-node communication +- dynamic topology +- no need for full mesh +- multi-hop routing allowed +- route failover +- client failover between ingress nodes +- connector failover between nodes + +7. Split-brain prevention +- quorum-based cluster behavior +- minority partition must not become a second authoritative cluster +- degraded / recovery / isolated modes +- manual recovery / promote decision by platform recovery admin + +8. Connector / VPN layer +- connectors are reusable network access methods +- one connector may be used by multiple resources +- connector placement and failover are controlled by policy +- nodes may be allowed or disallowed to host connectors +- direct access, VPN, relay and future egress modes must fit this model + +9. Future exit mode +- split tunnel +- full tunnel +- internet access through cluster +- not first implementation priority + +## Non-negotiable design rules + +- Do not rewrite proven session lifecycle carelessly. +- Do not turn Redis into a source of truth. +- Do not make certificate-ignore a global worker setting. +- Do not make customer-managed nodes platform-wide trusted by default. +- Do not create a separate cluster per organization. +- Do not assume a single permanently reachable central node. +- Do not rely on “secret protocol with no docs” as security. +- Security must come from crypto, auth, isolation, policy and observability. +- Prefer incremental evolution from current proven system. +- Do not collapse platform control plane and data plane into one vague layer. + +## Implementation strategy + +The codebase must evolve in phases. + +Current implementation focus remains: +- RDP work is paused by product decision +- preserve the accepted RDP Adapter baseline and Stage 5.x file-transfer work +- do not delete or rewrite the current RDP MVP while platform-core work starts +- C1-C9 platform-core foundations are implemented and verified: clusters, + node enrollment, node-agent scaffold, platform admin console, workload + supervision contract, mesh control-plane prep, mesh skeleton, multi-cluster + hardening, and organization admin foundation +- C10 Fabric Core configuration distribution design is completed +- C11 signed scoped cluster snapshot model is completed +- C12 node local state store is completed +- C13 Fabric Storage / Config Storage service foundation is completed +- C14 peer directory and cache model is completed +- C15 Fabric Routing Engine skeleton is completed +- C16 secure node-to-node channel lifecycle is completed +- C17 mesh routing runtime implementation plan is completed +- C17A synthetic mesh runtime skeleton is implemented and test-proven with + synthetic fabric messages only, no RDP/VPN/production service traffic +- C17B route health and failover probes are implemented and test-proven with + synthetic traffic only, no RDP/VPN/production service traffic +- C17C relay semantic hardening is implemented and test-proven with synthetic + channel classes only, no RDP/VPN/production service traffic +- C17D non-production test-service path is implemented and test-proven with + bounded `synthetic.echo` traffic only, no RDP/VPN/production service traffic +- C17E live node-to-node synthetic HTTP transport is implemented and + smoke-proven with synthetic traffic only +- C17F scoped synthetic route config loading and route-health reporting is + implemented and smoke-proven with synthetic traffic only +- C17G Control Plane scoped synthetic config read/consume is implemented and + test-proven with synthetic traffic only +- C17H deployed multi-agent synthetic config smoke is implemented and + runtime-proven on `docker-test` with synthetic traffic only +- C17I production forwarding gate foundation is implemented and test-proven; + production forwarding remains unavailable +- C17J production envelope contract validation is implemented and test-proven; + production forwarding remains unavailable +- C17K production envelope observation is implemented and test-proven; + production forwarding remains unavailable +- C17L bounded production observation sink is implemented and test-proven; + production forwarding remains unavailable +- C17M production observation sink wiring is implemented and test-proven; + production forwarding remains unavailable +- C17N production observation sink metrics are implemented and test-proven; + production forwarding remains unavailable +- C17O production observation sink local metrics logging is implemented and + test-proven; production forwarding remains unavailable +- C17P production observation sink change-driven metrics logging is implemented + and test-proven; production forwarding remains unavailable +- C17Q production forwarding gate/runtime log boundary is implemented and + test-proven; production forwarding remains unavailable +- C17R production observation sink capacity guard is implemented and + test-proven; production forwarding remains unavailable +- C17S production observation panic fail-closed hardening is implemented and + test-proven; production forwarding remains unavailable +- C17T production envelope payload boundary is implemented and test-proven; + production forwarding remains unavailable +- C17U production envelope created-at skew boundary is implemented and + test-proven; production forwarding remains unavailable +- C17V peer endpoint candidate model and NAT/connectivity hints are + implemented and test-proven; production forwarding remains unavailable +- C17W peer endpoint candidate scoring model is implemented and test-proven; + production forwarding remains unavailable +- C17X health-aware endpoint candidate scoring overlay is implemented and + test-proven; production forwarding remains unavailable +- C17Y Platform Owner synthetic mesh visibility is implemented and + build/test-proven; production forwarding remains unavailable +- C17Z production fabric-control direct forwarding is implemented and + test-proven; production service traffic remains unavailable +- C17Z1 production fabric-control multi-hop route-path forwarding is + implemented and test-proven; production service traffic remains unavailable +- C17Z2 production fabric-control forwarding observability is implemented and + test-proven; production service traffic remains unavailable +- C17Z3 production fabric-control route-config boundary is implemented and + test-proven; production service traffic remains unavailable +- C17Z4 scoped peer directory/recovery seed boundary is implemented and + test/build-proven; production service traffic remains unavailable +- C17Z5 node-agent peer cache runtime boundary is implemented and test-proven; + production service traffic remains unavailable +- C17Z6 dynamic endpoint reporting boundary is implemented and test-proven; + production service traffic remains unavailable +- C17Z7 private/corporate endpoint candidate boundary is implemented and + test-proven; production service traffic remains unavailable +- C17Z8 peer connection state machine boundary is implemented and test-proven; + production service traffic remains unavailable +- C17Z9 peer recovery planner boundary is implemented and test-proven; + production service traffic remains unavailable +- C17Z10 peer connection intent planner boundary is implemented and + test-proven; production service traffic remains unavailable +- C17Z11 peer connection manager runtime boundary is implemented and + test-proven; production service traffic remains unavailable +- C17Z12 rendezvous/relay control-plane contract is implemented and + docker-test-runtime-proven; production service traffic remains unavailable +- C17Z13 rendezvous lease telemetry is implemented and + docker-test-runtime-proven; production service traffic remains unavailable +- C17Z14 rendezvous lease refresh contract is implemented and + docker-test-runtime-proven; production service traffic remains unavailable +- C17Z15 backend relay replacement policy is implemented and + docker-test-runtime-proven; production service traffic remains unavailable +- C17Z16 route/path decision artifact is implemented and + docker-test-runtime-proven; production service traffic remains unavailable +- C17Z17 node-side route generation tracker is implemented and + docker-test-runtime-proven; production service traffic remains unavailable +- C17Z18 synthetic route-health effective path runtime is implemented and + docker-test-runtime-proven; production service traffic remains unavailable +- C17Z19 synthetic route-health feedback scoring is implemented and + docker-test-runtime-proven; production service traffic remains unavailable +- C17Z20 node-side route-health feedback refresh is implemented and + docker-test-runtime-proven; production service traffic remains unavailable +- C17Z21 node installation/update control-plane is implemented and + docker-test-runtime-proven for Docker nodes; production service traffic + remains unavailable +- C17Z22 Windows host-agent install/update supervision is implemented and + runtime-proven on the remote Windows node; production service traffic remains + unavailable +- C17Z23 update observability is implemented in backend/admin UI: per-node + updater status history is exposed and deployed on docker-test, so node-agent + and host-agent update activity can be audited from node details +- C17Z24 combined updater reporting is implemented and docker-test-proven: + Linux/Docker `rap-host-agent update-loop` now also polls/reports + `rap-host-agent` status, release `0.1.4` is published for node-agent and + host-agent artifacts, and docker-test nodes `test-1/2/3` auto-updated to + node-agent `0.1.4` while reporting host-agent `0.1.4` no-op status. +- C17Z25 Windows updater repair visibility is implemented in admin UI: node + details / Updates now shows a ready CMD repair command for existing Windows + nodes using `http://vpn.cin.su:19191/api/v1`, `--replace`, and + `--auto-update-current-version 0.0.0` so a stale updater wrapper can be + recreated without a new join token. +- C17Z26 updater fleet visibility is implemented in admin UI: the node list now + shows per-node updater status based on latest `rap-node-agent` and + `rap-host-agent` reports, explicitly flagging missing host-agent reports, + stale update reports, or update errors before opening node details. +- C17Z27 backend version-state projection is implemented and deployed on + docker-test: node list responses now derive `version_state` from active + `rap-node-agent` desired policy plus latest update report. Docker/Linux nodes + on `0.1.4` show `current`; the remote Windows node still on `0.1.3` shows + `outdated` while remaining heartbeat-healthy. +- C17Z28 Windows updater loop hardening is implemented and partially + docker-test-proven via release `0.1.5`: Windows host-agent updater scripts now + run combined `update-loop --max-runs 1`, and Windows `update-loop` also + polls/applies `rap-host-agent` updates. Release `0.1.5` artifacts are + published for Docker/Linux and Windows; docker-test nodes `test-1/2/3` + updated to `rap-node-agent 0.1.5`. Existing remote Windows nodes with stale + pre-0.1.5 updater wrapper still require one repair command from admin UI to + replace their local wrapper, after which automatic polling should continue. +- Admin UI now marks missing host-agent updater reports as `repair updater` in + the node list and explains in node details / Updates when to run the Windows + repair command. The command uses the external control-plane endpoint and does + not require a join token for already enrolled Windows nodes. +- Admin UI node details / Updates also provides a ready downloadable + `rap-repair-updater-.cmd` plus copy-command action for Windows repair, + reducing operator copy/paste mistakes on remote Windows hosts. +- Windows repair command generation was hardened after the first remote repair: + foreground `update-loop` now includes explicit `--node-id`, copies any staged + `rap-host-agent.exe.next` over the main host-agent binary after the one-shot + loop exits, deletes the staged file, and runs the updater scheduled task. + The node list now distinguishes `host-agent staged` from generic stale/error. +- C17Z29 Windows persistent updater repair is implemented in `rap-host-agent` + release `0.1.6`: `install-windows` accepts `--node-id` and writes that node + id into the persistent Windows updater wrapper so Scheduled Task polling no + longer depends on finding `identity.json` in the expected state directory. + Docker-test nodes `test-1/2/3` updated to `0.1.6`; existing Windows and + off-host Docker nodes still need their local updater wrappers to pick up the + 0.1.6 host-agent repair path. +- C17Z30 operator-configured public mesh endpoints are implemented and + docker-test-deployed: desired `mesh-listener.advertise_endpoint` is now + projected into peer endpoint candidates for other nodes and preferred over + auto-discovered private heartbeat endpoints. `home-1` + (`8ad04829-cd30-4290-913d-1ce5c7ef7bb3`) is configured with + `listen_addr=0.0.0.0:19131`, `advertise_endpoint=http://94.141.118.222:19199`, + `connectivity_mode=direct`, `nat_type=port_restricted`, `region=home`. + `test-1` synthetic config now receives `home-1` peer endpoint + `http://94.141.118.222:19199`; internal `192.168.200.85:19131` responds with + HTTP 405 on GET, while external `94.141.118.222:19199` currently refuses TCP, + so router/firewall forwarding still needs correction outside the platform. +- C17Z31 offsite bootstrap peer selection is implemented and docker-test + deployed: operator-configured public/direct desired mesh-listener endpoints + are kept in core-mesh bootstrap even after the default warm-peer target is + reached. This fixes the case where remote Windows node + `ifcm-rufms-s-mo1cr` received only `test-*` warm peers and no `home-1`. + Its synthetic config now includes `home-1` endpoint + `http://94.141.118.222:19199` and candidates ordered as operator public, + heartbeat advertised public, then private LAN converted to relay-required for + offsite. External TCP to `94.141.118.222:19199` still failed from Codex and + docker-test checks while internal `192.168.200.85:19131` succeeds, so a real + offsite `Test-NetConnection 94.141.118.222 -Port 19199` is the next network + validation. +- C17Z32 native Ubuntu/Linux service install is implemented and docker-test + deployed: backend exposes `/node-agents/linux-install-profile`, host-agent + supports `install-linux`, installs `rap-node-agent` under + `/opt/rap/`, state under `/var/lib/rap/nodes/`, config under + `/etc/rap/`, creates `rap-node-agent-.service`, and creates a + persistent `rap-host-agent-updater-.service` for automatic node-agent + and host-agent updates. Release `0.1.7` is published for `rap-node-agent` + (`linux_binary`, `windows_service`) and `rap-host-agent` + (`linux_binary`, `windows_binary`). Admin UI now has an `Ubuntu service` + install profile and generates profile-based `install-linux` commands. + A one-use token for `vps-ubuntu-1` is active until 2026-05-02T08:41:41Z: + `rap_join_a23Xhz63YstshWUBAPGPz5fzQ8YpHDP05RXaaYa4DoA`; scope roles are + `core-mesh` and `relay-node`, control-plane endpoint is + `http://vpn.cin.su:19191/api/v1`, artifact endpoint is + `http://vpn.cin.su:19191/downloads`. +- Admin UI and docs now cover the full Windows updater operational workflow: + node details shows an `Updater health` summary, generated repair CMD prints + scheduled-task and binary diagnostics before/after repair, applies staged + host-agent binaries, restarts the updater task, and README documents first + install, repair without join-token, system-task/user-task behavior, staged + host-agent recovery, and reboot/autostart verification. +- Cluster Authority plus node enrollment bootstrap polling are docker-test + lifecycle-smoke-proven; fresh install migration replay is fixed for + `cluster_admin_summaries` +- C18 VPN/IP tunnel service target design is completed as documentation only +- C18A VPN/IP tunnel control-plane data model foundation is implemented and + backend-test-proven +- C18B VPN/IP tunnel lease/fencing hardening is implemented and + backend-test-proven +- C18C VPN/IP tunnel node-agent desired-state consumption/reporting is + implemented and backend-test-proven +- Version Storage / Update Repository is documented as a future Fabric Core + service for signed release manifests, OS/arch artifacts, + stable/current/candidate channels, update-cache mirroring, node-agent + update supervision, rollback, and explicit data-structure migration bundles. + Runtime updater behavior is partially implemented for the current Docker and + Windows node-agent/host-agent paths; broader staged rollout policy and + service payload forwarding remain separate work. +- no next platform-core implementation step is automatically authorized after + C17Z20; choose the next narrow staged prompt explicitly before continuing +- preserve the proven RDP lifecycle behavior +- keep the current backend gateway available as the active/fallback implementation path +- accepted VPN data-plane target: the phone/client connects only to an + available entry node; the entry node uses the existing mesh/fabric route to a + selected exit node/pool, and the exit node handles LAN/internet egress. Nodes + behind NAT may participate when they can maintain outbound mesh/control + sessions. Backend packet relay must remain a compatibility/fallback path, not + the desired steady-state path. +- C18D VPN-over-fabric foundation is implemented and docker-test-started: + VPN client profiles include `vpn_fabric_route` with entry pool, exit pool, + selected entry/exit, preferred `fabric_mesh` data-plane, and + `backend_relay` fallback. Node-agent `0.2.39` adds a dedicated production + `vpn_packet` channel (`vpn.packet_batch`, 256 KiB batch limit), destination + delivery hook, `vpnruntime.FabricPacketTransport`, and + `vpn_fabric_packet_transport` heartbeat capability. `home-1` auto-updated to + `0.2.39`; other nodes have automatic desired policy `0.2.39` and should move + as their updater loops pick it up. Live Android VPN traffic still uses backend + relay until entry-node client ingress is wired to the fabric transport. +- C18E VPN-over-fabric route contract is backend-deployed on docker-test as + `rap-backend:test-vpn-fabric-route-0.2.41`: when a VPN client profile selects + different entry and exit nodes, backend now ensures two active + `mesh_route_intents` with service_class `vpn_packets` and allowed channel + `vpn_packet`. The live HOME profile currently selects `usa-los-1` as entry + and `home-1` as exit when `entry_node_id=b829ffde-...` is requested, and the + synthetic config for both nodes includes the two `vpn_packet` routes. Existing + fallback remains `backend_relay`; production forwarding gate is still disabled + on old/live remote nodes until their runtime is explicitly updated/enabled. +- External/offsite updater gap found and fixed for version `0.2.40`: native + `rap-node-agent` binaries for `linux_binary`, `linux_service`, and + `windows_service` plus matching `rap-host-agent` binaries are copied under + `/downloads` and registered in channel `dev-external`. Update plans for + `usa-los-1` (`linux_binary`) and `ifcm-rufms-s-mo1cr` (`windows_service`) now + return `action=update`, `target_version=0.2.40` instead of + `no_matching_artifact`. +- C18F production-forwarding gate work is partially live: backend + `rap-backend:test-vpn-fabric-route-0.2.42` signs node synthetic configs with + `production_forwarding=true` / `control_plane_only=false` when the node's + desired `mesh-listener` workload has `production_forwarding_enabled=true`. + `home-1` and `usa-los-1` desired mesh-listener configs have this flag enabled. + Node-agent `0.2.44` accepts signed production-forwarding mesh configs and + host-agent `0.2.44` fixes Docker updater behavior so synthetic mesh runtime is + not disabled on Docker updates. Runtime status: `usa-los-1` reports + `mesh_production_forwarding=true`; `home-1` reports `0.2.44` and synthetic + runtime enabled, but its listener report is still `disabled/listen_addr_empty`, + so `home-1` is not yet a usable production fabric endpoint. Next action is to + repair why `home-1` is not applying the signed mesh-listener config + (`listen_addr=0.0.0.0:19131`) after Docker updater restart. +- C18G VPN-over-fabric runtime path is live-tested on docker-test. Backend is + deployed as `rap-backend:test-vpn-fabric-route-0.2.43`; VPN route intents now + allow both `vpn_packet` data and `fabric_control` health probes. Node-agent + `0.2.47` fixes initial production VPN packet envelope hop addressing and + reports the matching version. `home-1` and `usa-los-1` both report + `0.2.47`, healthy, listener `0.0.0.0:19131`, and + `mesh_production_forwarding=true`. Live route health is reachable in both + directions (`usa-los-1 -> home-1` around 200 ms, `home-1 -> usa-los-1` + around 200-415 ms). A direct live POST to + `http://195.123.240.88:19131/api/v1/clusters/.../vpn-connections/.../tunnel/client/packets` + returns `202 Accepted`, proving entry-node VPN packet ingress can forward + over fabric to the home exit. The HOME VPN placement policy now has entry + pool `[usa-los-1, home-1]` and exit `home-1`; client profile with preferred + `usa-los-1` selects `usa-los-1 -> home-1`. +- C18H live VPN triage on 2026-05-04: `home-1` and `usa-los-1` report + node-agent `0.2.48`, healthy heartbeats, active HOME VPN assignment on + `home-1`, and `packet_forwarding=true` / `runtime_available=true`. Manual + packet tests through the USA entry proved the path + Android-style packet -> `usa-los-1` -> fabric -> `home-1` -> LAN/DNS -> + fabric -> `usa-los-1` -> client can return ICMP and DNS replies. The remaining + live symptom was the phone not sending fresh packets to the current entry + after the backend relay queue was cleared. Android VPN app `0.2.59` was built + and published to `/downloads/rap-android-rdp-vpn-latest-debug.apk`; it + normalizes old saved backend URLs (`vpn.cin.su:19191`, + `94.141.118.222:19191`, `192.168.200.61:18080`, etc.) to the current USA + entry backend `http://195.123.240.88:19131/api/v1` and shows app version, + device id, and connection id in the header for live log correlation. +- C18I fabric service-channel foundation is live on 2026-05-07. Backend, + node-agent, and Android VPN release `0.2.159` are published. VPN profiles now + include a signed `rap.fabric_service_channel_lease.v1` with + `entry_direct_http_v1` packet and WebSocket templates. Android consumes this + lease and sends service-channel headers. The `usa-los-1` entry endpoint + validates the cluster-authority signed lease payload and token hash; a live + smoke through `http://195.123.240.88:19131/.../fabric/service-channels/...` + succeeded with a valid lease and rejected a bad token with `403`. Current HOME + profile selects `usa-los-1` as entry and `home-1` as exit; both nodes report + `0.2.159`. Docker-test nodes `test-1`, `test-2`, and `test-3` also report + `0.2.159`. `ifcm-rufms-s-mo1cr` is still on `0.2.119`; it has staged the + host-agent `0.2.159` update and should finish on the next Windows updater + loop/restart. +- C18J fabric service-channel runtime route-manager slice is live on + 2026-05-07 as node/host-agent `0.2.162`. The entry-node + `FabricClientPacketIngress` now preserves its runtime object across synthetic + config refreshes, so heartbeat telemetry reports the same ingress object that + serves HTTP/WebSocket service-channel traffic. It tracks send/receive batches, + route attempts/failures, selected route/next hop, local-gateway fallback, and + inbox queue depths. `SendClientPacketBatch` now retries all valid + `vpn_packet` route candidates with sticky preference before backend relay is + allowed as degraded compatibility fallback. Release `0.2.161` was superseded + because its Docker tar was rebuilt after registration; `0.2.162` is the + clean published release with matching artifact hashes. Docker-test + `test-1/2/3`, `usa-los-1`, and `ifcm-rufms-s-mo1cr` report `0.2.162`; + `home-1` is healthy and still on `0.2.161` awaiting its next updater loop. + Live smoke through `http://195.123.240.88:19131/.../fabric/service-channels` + returned `202` and `usa-los-1` telemetry then showed route attempts, + one route failure, and selected next hop `home-1`, proving live ingress + telemetry and alternate-route retry are active. +- C18K service-neutral flow/channel scheduler is live on 2026-05-07 as + node/host-agent `0.2.163`. The VPN proving service still carries universal + IP packets and does not route by application protocol, but the entry runtime + now hashes packets by IP 5-tuple, or packet hash for non-IP/invalid packets, + into 32 logical `flow-*` channels. Each channel has bounded queue accounting, + high-watermark/backpressure/dropped telemetry, and batches are fanned out per + logical channel before being sent through the same fabric route-manager. Live + smoke against `usa-los-1` posted two different IP flows through the signed + service-channel endpoint and heartbeat reported `send_packets=2`, + `send_flow_batches=2`, `flow_scheduler.channel_count=2`, `enqueued=2`, + `dequeued=2`, `dropped=0`, with queue depths for `flow-12` and `flow-14`. + All six current cluster nodes (`home-1`, `usa-los-1`, `ifcm-rufms-s-mo1cr`, + `test-1`, `test-2`, `test-3`) report node-agent `0.2.163` and healthy. +- C18L active flow scheduling telemetry is live on 2026-05-07 as + node/host-agent `0.2.164`. Each `flow-*` channel now keeps route memory, + served count, last served time, last route/next hop, failed-route marker, + consecutive failures, stall count, last send duration, and explicit + `route_rebuild_recommended` / `degraded_fallback_recommended` signals. The + scheduler drains non-stalled channels first, prefers less-served/older + channels, avoids a channel's last failed route on the next send, and only + marks degraded fallback after repeated failures. Live smoke against + `usa-los-1` posted two IP flows through the signed service-channel endpoint: + heartbeat reported schema `c18l.fabric_service_channel_runtime_report.v1`, + `send_packets=2`, `send_flow_batches=2`, `flow_scheduler.channel_count=2`, + `dropped=0`, `backpressure=false`, `last_next_hop=home-1`, and per-flow + `served=1`. One stale candidate route failed and was bypassed before the + successful route to `home-1`. All six current cluster nodes (`home-1`, + `usa-los-1`, `ifcm-rufms-s-mo1cr`, `test-1`, `test-2`, `test-3`) report + node-agent `0.2.164` and healthy. +- C18M Control Plane service-channel feedback is live on 2026-05-07. Backend + image `rap-backend:fabric-service-channel-0.2.165` is deployed on + docker-test, and node/host-agent `0.2.165` artifacts are published. When + issuing `rap.fabric_service_channel_lease.v1`, backend now reads fresh + entry-node heartbeat metadata + `fabric_service_channel_runtime_report.ingress.flow_scheduler.channel_stats`, + builds per-route service-channel feedback, boosts recently successful routes, + penalizes recent failures, and fences routes that report + `route_rebuild_recommended`, `degraded_fallback_recommended`, or repeated + consecutive failures. Fenced routes are not selected as primary or alternate; + if all selected entry/exit routes are fenced, the lease uses explicit + degraded backend fallback with reason + `fabric_routes_fenced_by_service_channel_feedback`. Live smoke created two + short-lived `test-1 -> test-2` route intents, injected a fresh + service-channel flow feedback heartbeat marking the higher-priority route as + rebuild-required, and the next lease selected the lower-priority healthy + route with score reason `service_channel_recent_success`; the bad route was + not offered as an alternate. Current node rollout: `home-1`, `usa-los-1`, + `test-1`, `test-2`, and `test-3` report `0.2.165`; Windows `ifcm-rufms-s-mo1cr` + remains healthy on `0.2.164` and should move on its next updater cycle. +- C18N durable service-channel route feedback is live on 2026-05-07. Backend + image `rap-backend:fabric-service-channel-0.2.166` is deployed on + docker-test with migration `000025_fabric_service_channel_route_feedback`. + Heartbeats now persist service-neutral route observations into + `fabric_service_channel_route_feedback_observations` and maintain an + expiring latest view in `fabric_service_channel_route_feedback_latest`. + Lease selection reads this durable latest feedback before falling back to + in-memory heartbeat parsing, so route fencing survives backend restarts and + stale heartbeat replacement. Node/host-agent `0.2.166` artifacts and Docker + image are published, update policies target `0.2.166`, and `test-1/2/3`, + `usa-los-1`, and `ifcm-rufms-s-mo1cr` report `0.2.166`; `home-1` is healthy + but still on `0.2.165` until its next updater cycle. Live smoke created two + short-lived `test-1 -> test-2` routes, persisted a fenced observation for the + higher-priority bad route and a healthy observation for the lower-priority + route, restarted backend, and the next lease selected the healthy route with + `service_channel_recent_success`. +- C18O service-channel feedback diagnostics and synthetic route avoidance are + live on 2026-05-07. Backend image + `rap-backend:fabric-service-channel-0.2.167` is deployed on docker-test and + web-admin is rebuilt/published. Admin/API now expose fresh durable feedback + through `GET /clusters/{clusterID}/fabric/service-channels/route-feedback`, + and each node synthetic config includes + `service_channel_route_feedback` with healthy/degraded/fenced counts and + observations. Synthetic config generation skips routes fenced by the local + node's durable service-channel feedback, so nodes stop receiving known-bad + route configs while the feedback is active. Live smoke created fresh + `test-1 -> test-2` routes, persisted `fenced` feedback for the higher-priority + route and `healthy` feedback for the lower-priority route, confirmed the API + returned both observations, and confirmed `test-1` synthetic config excluded + the bad route while keeping the healthy route. +- C18P proactive service-channel replacement decisions are live on 2026-05-07. + Backend image `rap-backend:fabric-service-channel-0.2.168` is deployed on + docker-test and web-admin is rebuilt/published. When synthetic config + generation withholds a route fenced by local service-channel feedback, it now + records a `route_path_decisions` item with + `decision_source=service_channel_feedback_replacement`, + `replacement_route_id`, effective replacement hops, and score reasons. If no + alternate exists, the decision source becomes + `service_channel_feedback_no_alternate` with visible score reason + `no_unfenced_alternate_route`. Live smoke created fresh `test-1 -> test-2` + bad/good routes, fenced the bad route, disabled older smoke routes, and + confirmed `test-1` synthetic config excluded the bad route, kept the good + route, and reported replacement from bad route to good route. +- C18Q service-channel replacement dampening is live on 2026-05-07. Backend + image `rap-backend:fabric-service-channel-0.2.169`, node/host-agent + `0.2.169` artifacts, Docker image, update policies, and web-admin are + published on docker-test. Replacement selection now gives a large stable + preference to routes with active healthy durable feedback, adding + `active_healthy_feedback_dampening_window` to score reasons, so a recently + successful replacement wins over a higher-priority but unproven route until + the feedback window expires or a newer fenced/healthy observation changes the + state. `RoutePathDecisionReport` now includes `degraded_decision_count` for + `service_channel_feedback_no_alternate`, and node-agent heartbeat reports + include `replacement_route_id` and degraded counts after upgrade. Live smoke + fenced a high-priority bad `test-1 -> test-2` route, supplied healthy feedback + for a low-priority route, also created a higher-priority unproven route, and + confirmed replacement selected the healthy route because of the dampening + window. +- C18Q hotfix `0.2.171` is published on 2026-05-07. Node-agent now includes + `service_channel_route_feedback` in the signed synthetic config model before + recalculating the authority payload hash. Without this, upgraded backend + configs were signed correctly but `0.2.169` agents rejected them with + `control-plane synthetic mesh config authority payload hash mismatch`. + Regression coverage verifies a signed config containing durable + service-channel feedback. Artifacts, Docker image, latest download aliases, + and update policies were moved to `0.2.171`; `test-1/2/3` are running + `0.2.171` and loading `source=control_plane` again. The release includes + `linux_service`, Docker, Windows service, and binary artifacts so service + installs can auto-update. Old C18 smoke/expired route intents were disabled + after validation. +- C18R fleet diagnostics/operator action slice is live on 2026-05-07. Backend + image `rap-backend:fabric-service-channel-0.2.172` adds route feedback + filters (`route_id`, `feedback_status`, `include_expired`) and + `POST /clusters/{clusterID}/fabric/service-channels/route-feedback/expire`. + The expire action is cluster-mutable/admin gated and marks latest feedback + expired without deleting historical observations. Web-admin / Fabric Links + now shows a cluster-level service-channel feedback panel with fenced, + degraded, healthy and no-alternate counts, replacement/no-alternate decisions, + and an operator `expire` action for stale non-healthy feedback. +- C18S service-channel feedback churn guardrails are implemented on + 2026-05-07. Operator expire now records + `fabric.service_channel_route_feedback.expired` audit events, returns and + persists a short `operator_retry_cooldown_until`, and route generation adds + `service_channel_route_retry_after_operator_expire` when a manually expired + route is being retried. During that cooldown, repeated non-healthy feedback + from the same reporter/route/service is suppressed as + `operator_retry_cooldown` instead of immediately fencing the route again. + Web-admin shows the retry/cooldown state in Fabric Links. +- C18T automatic rebuild decision contract is implemented on 2026-05-07. + `RoutePathDecision` now carries `rebuild_request_id`, `rebuild_status`, + `rebuild_reason`, and `rebuild_attempt`. When fenced service-channel feedback + keeps failing outside manual retry cooldown, Control Plane records a bounded + rebuild request. If an unfenced alternate exists, the decision is marked + `rebuild_status=applied`; if not, it is + `pending_degraded_fallback` and leases expose backend relay with reason + `fabric_route_rebuild_pending_backend_relay`. Web-admin shows rebuild counts, + status, and attempts in Fabric Links. A live smoke on docker-test created + short-lived `test-1 -> test-2` bad/good routes, reported fenced feedback for + the bad route and healthy feedback for the good route, and confirmed scoped + synthetic config returned `service_channel_feedback_replacement` with + `rebuild_status=applied` and `rebuild_attempt=3`. Node/host-agent `0.2.175` + is published so agents preserve the new signed rebuild fields. +- C18U node-agent route-manager rebuild consumption is live on 2026-05-07. + Node-agent `0.2.176` now converts backend rebuild decisions into a + service-channel route-manager snapshot, counts rebuild requests/applies, + marks applied/pending-degraded routes as withdrawn, clears a withdrawn cached + selected route, and excludes withdrawn routes from new service-channel route + candidates. This keeps new flows from retrying a route that Control Plane has + already rebuilt away from. Unit coverage verifies a bad route is skipped in + favor of its replacement. Node/host-agent `0.2.176` artifacts, Docker image, + latest download aliases, release manifests, and node policies are published. + `test-1/2/3`, `usa-los-1`, and `ifcm-rufms-s-mo1cr` report `0.2.176`. + Backend `rap-backend:fabric-service-channel-0.2.176` is deployed with a + panel consistency fix: if a node reports the target version, stale failed + update status no longer overrides `version_state=current`. +- C18V route-manager churn telemetry is live on 2026-05-07. Node-agent + `0.2.177` adds `route_manager_transition` to the service-channel runtime + report with previous/current generation, transition status, decision counts, + withdrawn/restored route counts, pending-degraded fallback count, rebuild + applied count, and any cleared cached route. Tests cover applied rebuild + replacement, pending degraded fallback with no alternate, and restoration by + a fresh config so withdrawn routes do not become sticky local state. Artifacts, + Docker image, latest download aliases, release manifests, and node policies + are published. `test-1/2/3` run `0.2.177`; their heartbeat metadata exposes + `rap.fabric_service_channel_route_manager_transition.v1`. +- C18W live Control Plane/runtime verification is implemented and smoke-passed + on 2026-05-07. Script + `scripts/fabric/c18w-service-channel-route-manager-smoke.ps1` drives the + whole loop against docker-test API: creates temporary service-channel route + intents for `test-1 -> test-2`, injects fenced/healthy route feedback through + heartbeat, verifies scoped config emits `rebuild_status=applied`, waits for + node-agent heartbeat `route_manager_transition.status=applied_rebuild`, + expires the feedback, verifies the restored config has no rebuild decision, + and waits for `restored_by_new_config`. Result artifact: + `artifacts/c18w-service-channel-route-manager-smoke-result.json` with run + `c18w-20260507-173226`. During the smoke, operator expire exposed live pgx + parameter issues; backend `rap-backend:fabric-service-channel-0.2.179` is + deployed with safer UUID/text timestamp handling for feedback expire. +- C18X logical-channel isolation and bounded backpressure coverage is + implemented and smoke-passed on 2026-05-07. Node-agent/host-agent `0.2.180` + artifacts, Docker image, latest download aliases, release manifests, and + node policies are published. The key runtime fix is in + `FabricClientPacketIngress.routeCandidatesForChannel`: a channel with a local + failed-route avoid state no longer falls back to the global last selected + route, so one degraded logical flow cannot drag unrelated flows back onto the + failed path. Coverage proves independent logical-channel failover, bounded + same-channel backpressure/drop telemetry, and packet-flow hashing. Script + `scripts/fabric/c18x-service-channel-logical-channel-smoke.ps1` passes with + result artifact `artifacts/c18x-service-channel-logical-channel-smoke-result.json` + run `c18x-20260507-180647`. Test docker nodes `test-1/2/3` are running + `rap-node-agent:0.2.180`; backend remains + `rap-backend:fabric-service-channel-0.2.179`. +- C18Y route-intent lifecycle cleanup is implemented and smoke-passed on + 2026-05-07. Backend `rap-backend:fabric-service-channel-0.2.181` is deployed + on docker-test, and web-admin Fabric Links now shows route-intent lifecycle + counts/table with operator `expire` and `disable` actions. Route intents are + enriched with `lifecycle_status`, `is_expired`, and `policy_expires_at`. + Node-scoped synthetic mesh config now filters out expired policy routes, so + stale smoke routes no longer get emitted to agents for route-health probing. + API actions are available at + `POST /clusters/{clusterID}/mesh/route-intents/{routeIntentID}/expire` and + `/disable`. Script `scripts/fabric/c18y-route-intent-lifecycle-smoke.ps1` + passed against docker-test API, result + `artifacts/c18y-route-intent-lifecycle-smoke-result.json` run + `c18y-20260507-192702`. During deploy, docker-test root disk was full from + build cache/images; `docker builder prune -af` and `docker image prune -f` + freed space before redeploy. +- C18Z bounded service-channel load coverage is implemented, published, and + smoke-passed on 2026-05-07. Node-agent/host-agent `0.2.181` artifacts, + Docker image `rap-node-agent:0.2.181`, latest download aliases, release + manifests, and update policies are published. `test-1/2/3` are restarted on + `rap-node-agent:0.2.181`; `usa-los-1` also reports `0.2.181`. The key runtime + fix is in `FabricFlowScheduler.Snapshot`: backpressure remains visible when + bounded drops occurred, even after the queue drains. Coverage proves + multi-channel rebuild away from a withdrawn primary route and per-channel + bounded drop/high-water telemetry. Script + `scripts/fabric/c18z-service-channel-load-smoke.ps1` passed against + docker-test API, result + `artifacts/c18z-service-channel-load-smoke-result.json` run + `c18z-20260507-194616`. Release artifacts were corrected after initial + publication to use backend-relative `/downloads/...` primary URLs plus + internal/external mirror URLs, so offsite nodes resolve downloads through + their own control-plane origin such as `http://vpn.cin.su:19191`. Current + caveat: `ifcm-rufms-s-mo1cr` and `home-1` remained `version_state=failed` + at the last check; their next update plan now points to reachable `0.2.181` + artifacts, but the local updater loop still needs to retry/report success. +- C18Z1 live service-channel ingress is implemented, published, and + smoke-passed on 2026-05-07. Node-agent/host-agent `0.2.182` artifacts, + Docker image `rap-node-agent:0.2.182`, release manifests, and update + policies are published. Backend `rap-backend:fabric-service-channel-0.2.182` + is deployed on docker-test. The runtime fix is a dynamic mesh listener + handler: synthetic config refreshes now update `/mesh/v1/forward`, + service-channel ingress, production routes, delivery inbox, and forward + transport without requiring a port/listener restart. Backend route-feedback + latest policy now prevents a fresh healthy heartbeat from immediately + overwriting active degraded/fenced feedback before TTL expiry, so rebuild + decisions survive long enough for nodes to apply them. Script + `scripts/fabric/c18z1-live-service-channel-ingress-smoke.ps1` posts signed + generic packet batches to the running `test-1` service-channel HTTP endpoint, + waits both entry and exit runtime configs, verifies exit inbox delivery, + injects route feedback, observes Control Plane rebuild, waits node + `applied_rebuild`, sends a second batch over the replacement route, and + expires both temporary route intents. Result: + `artifacts/c18z1-live-service-channel-ingress-smoke-result.json` run + `c18z1-20260507-203628`. All current nodes report `0.2.182/current` at the + last check. +- C18Z2 live service-channel sustained soak/failure smoke is implemented and + passed on 2026-05-07 without a new runtime release. Script + `scripts/fabric/c18z2-live-service-channel-soak-smoke.ps1` drives signed + generic packet batches through the running `test-1` service-channel HTTP + endpoint, keeps temporary primary/alternate `test-1 -> test-2` route intents + visible, restarts the exit-node container `rap_test_node_test_2`, waits for + the exit runtime to reload synthetic config, and verifies recovery batches + reach the exit fabric inbox after the restart. Result: + `artifacts/c18z2-live-service-channel-soak-smoke-result.json` run + `c18z2-20260507-205112`: warm batches `6/6`, during-restart batches `3/3`, + recovery batches `8/8`, exit inbox depth grew from post-restart baseline + `0` to `88`, drops `0`, and both temporary route intents expired. +- C18Z3 live service-channel entry/WebSocket/degraded-fallback smoke is + implemented, published, and passed on 2026-05-07. Node-agent/host-agent + `0.2.183` artifacts and Docker image `rap-node-agent:0.2.183` are published + to docker-test downloads; update policies for `test-1/2/3` are set to + `rolling` target `0.2.183`, and the test containers run that image. The + runtime fix makes the entry node honor the signed service-channel lease + authority: leases with `status=degraded_fallback` or + `primary_route.status=missing_route_intent` now force backend fallback instead + of reusing stale generic route candidates. The same fallback rule is applied + to HTTP and WebSocket packet ingress. Script + `scripts/fabric/c18z3-live-service-channel-entry-ws-fallback-smoke.ps1` + verifies signed HTTP warm batches, WebSocket ingress parity, entry-node + container restart while the lease exists, recovery batches over the same + lease, explicit degraded fallback for a no-route exit, and route-intent + expiry. Result: + `artifacts/c18z3-live-service-channel-entry-ws-fallback-smoke-result.json` + run `c18z3-20260507-211402`: warm `4/4`, WebSocket packets `8`, recovery + `4/4`, backend fallback queue `0 -> 8`, route failures `0`, and all checks + passed. During publication the first `0.2.183` Docker tar had a malformed + entrypoint and stale size/hash metadata; it was rebuilt, the latest tar alias + was replaced, and the release artifact row was corrected to sha256 + `231286cf5860b22cf8ca6550f67f61b0ca4b5011ab9b09995bcabbafe883fee1`, size + `7261696`. +- C18Z4 live service-channel long-session pressure smoke is implemented and + passed on 2026-05-07 without a new runtime release beyond `0.2.183`. Script + `scripts/fabric/c18z4-live-service-channel-session-pressure-smoke.ps1` opens + one signed long-lived service-channel WebSocket from `test-1` to `test-2`, + sends 48 packet batches / 384 packets, expires the primary route intent while + the WebSocket session is still active, waits for dynamic synthetic-config + refresh, and verifies the remaining packets use the alternate route. Result: + `artifacts/c18z4-live-service-channel-session-pressure-smoke-result.json` + run `c18z4-20260507-212748`: exit inbox depth `0 -> 384`, route failure delta + `0`, flow drop delta `0`, backend fallback queue `0 -> 0`, primary route + removed from entry/exit configs, alternate route selected after the switch, + and both route intents expired. This proves the shared Fabric Service Channel + can keep a service session alive while Control Plane changes the live route + set, without falling back to backend relay. +- C18Z5 live service-channel exit-restart smoke is implemented and passed on + 2026-05-07 without a new runtime release beyond `0.2.183`. Script + `scripts/fabric/c18z5-live-service-channel-exit-restart-smoke.ps1` keeps one + signed WebSocket service-channel session open from `test-1` to `test-2`, + sends pre-outage traffic, stops `test-2` for a bounded outage while traffic + continues, starts it again, waits runtime readiness, then sends recovery + traffic over the same WebSocket. Result: + `artifacts/c18z5-live-service-channel-exit-restart-smoke-result.json` run + `c18z5-20260507-213745`: pre/outage/recovery batches `12/24/24`, total + packets `480`, route failure delta `48`, backend fallback queue `0 -> 192`, + flow drop delta `0`, and recovery exit inbox `0 -> 192`. This proves real + exit-node failure is visible as fallback/failure telemetry while the + long-lived service channel remains usable and fabric delivery resumes after + the exit runtime returns. After the test, `test-2` and all active cluster + nodes were healthy/current on `0.2.183`. +- C18Z6 live service-channel active rebuild smoke is implemented and passed on + 2026-05-07 without a new runtime release beyond `0.2.183`. Script + `scripts/fabric/c18z6-live-service-channel-active-rebuild-smoke.ps1` keeps a + signed WebSocket service-channel session open from `test-1` to `test-2`, + sends pre-rebuild traffic, injects route-health feedback that marks the + primary route stale and names the alternate route as replacement, waits for + Control Plane `rebuild_status=applied`, waits for node-agent + `route_manager_transition.status=applied_rebuild`, then continues sending + over the same WebSocket. Result: + `artifacts/c18z6-live-service-channel-active-rebuild-smoke-result.json` run + `c18z6-20260507-214900`: pre/post batches `16/32`, total packets `384`, + exit inbox depth `0 -> 384`, Control Plane replacement route + `b2f3c510-46d2-4dce-8389-3952a99d0311`, route failure delta `0`, flow drop + delta `0`, backend fallback queue `0 -> 0`, all checks passed, and all + active nodes remained healthy/current on `0.2.183`. This proves a live + service channel can apply a route-manager rebuild decision without rebuilding + the service WebSocket. +- C18Z7 live service-channel concurrent isolation smoke is implemented and + passed on 2026-05-07 without a new runtime release beyond `0.2.183`. Script + `scripts/fabric/c18z7-live-service-channel-concurrent-isolation-smoke.ps1` + opens three signed WebSocket service-channel sessions over the same + `test-1 -> test-2` entry/exit pair, interleaves packet batches across all + sessions, injects primary-route stale feedback, waits for Control Plane + `rebuild_status=applied` and node-agent `applied_rebuild`, then continues all + sessions over the same sockets. Result: + `artifacts/c18z7-live-service-channel-concurrent-isolation-smoke-result.json` + run `c18z7-20260507-215727`: 3 sessions, 36 rounds, 288 packets per session, + 864 packets total, each session exit inbox depth `288`, total exit depth + `864`, backend fallback delta `0`, route failure delta `0`, flow drop delta + `0`, and all active nodes healthy/current on `0.2.183`. This proves rebuild + and route-manager state are shared correctly without one active service + session starving or poisoning the other concurrent sessions. +- C18Z8 live service-channel backpressure isolation smoke is implemented and + passed on 2026-05-07 without a new runtime release beyond `0.2.183`. Script + `scripts/fabric/c18z8-live-service-channel-backpressure-isolation-smoke.ps1` + opens two interactive signed WebSocket sessions plus one abusive session over + the same `test-1 -> test-2` entry/exit pair. The abusive session sends 1300 + packets on one stable 5-tuple to force a single flow shard to hit bounded + queue pressure while the interactive sessions continue sending small batches. + Result: + `artifacts/c18z8-live-service-channel-backpressure-isolation-smoke-result.json` + run `c18z8-20260507-221347`: both interactive sessions delivered 192 packets + each, the abusive flow reached scheduler high watermark `1024`, scheduled + `1030` packets on the hottest channel, dropped `282` packets on that channel, + produced backend fallback delta `0`, route failure delta `0`, and all active + nodes stayed healthy/current on `0.2.183`. This proves bounded backpressure is + visible and isolated to the overloaded logical flow without starving other + active service sessions. +- C18Z9 route-pool runtime selection is implemented, released as node/host + agent `0.2.184`, published to docker-test downloads, and passed on + 2026-05-07. Runtime fix: when Control Plane marks a service-channel route + `rebuild_status=applied` and provides `replacement_route_id`, node-agent now + treats that replacement as the preferred route for sticky flow/channel + selection instead of merely withdrawing the bad route and falling back to + config order. Unit coverage: + `TestFabricClientPacketIngressPrefersControlPlaneReplacementOverConfigOrder`. + Live script + `scripts/fabric/c18z9-live-service-channel-route-pool-smoke.ps1` creates a + route pool with slow relay primary `test-1 -> test-3 -> test-2` and fast + direct replacement `test-1 -> test-2`, keeps one signed WebSocket active, + injects stale-route feedback, waits for Control Plane and node-agent + `applied_rebuild`, then verifies the same service session continues over the + direct replacement. Result: + `artifacts/c18z9-live-service-channel-route-pool-smoke-result.json` run + `c18z9-20260507-224901`: 54 batches / 432 packets sent and delivered to exit, + backend fallback delta `0`, route failure delta `0`, flow drop delta `0`, and + temporary route intents expired. Test containers `test-1/2/3` run + `rap-node-agent:0.2.184`; `usa-los-1`, `home-1`, and + `ifcm-rufms-s-mo1cr` remain healthy on `0.2.183` until their rollout policy is + advanced. +- C18Z10 service-channel exit-pool failover is implemented, released as + node/host-agent `0.2.185`, published to docker-test downloads, registered in + the stable update channel, and passed on 2026-05-07. Backend service-channel + leases now bind signed entry/exit pools, selected exit follows the selected + primary route, and Control Plane replacement can cross to another authorized + exit when route intents share an exit-pool/resource metadata key. Node-agent + now honors the signed lease primary route as the initial service-channel + preference before normal config-order selection. Unit coverage: + `TestIssueFabricServiceChannelLeaseSelectsHealthyAlternateExitFromPool`, + `TestGetNodeSyntheticMeshConfigReplacesFencedServiceChannelRouteAcrossExitPool`, + and `TestFabricClientPacketIngressUsesLeasePreferredRouteBeforeConfigOrder`. + Live script + `scripts/fabric/c18z10-live-service-channel-exit-pool-smoke.ps1` creates a + primary exit route `test-1 -> test-2` and an alternate exit route + `test-1 -> test-3` in the same exit pool, keeps one signed WebSocket active, + verifies pre-rebuild traffic reaches the primary exit, injects stale-route + feedback, waits for Control Plane/node-agent `applied_rebuild`, then verifies + post-rebuild traffic reaches the alternate exit. Result: + `artifacts/c18z10-live-service-channel-exit-pool-smoke-result.json` run + `c18z10-20260507-232645`: 54 batches / 432 packets sent, primary exit queue + `144`, alternate exit queue `288`, backend fallback `0`, route failure delta + `0`, flow drop delta `0`, decision source + `service_channel_feedback_exit_pool_replacement`, and temporary route intents + expired. Backend and `test-1/2/3` are running `0.2.185`; update plans now + return download URLs on `192.168.200.61:18080` when the API is reached + directly on `18121`. +- C18Z11 service-channel entry-pool failover contract is implemented and + backend-deployed as `rap-backend:fabric-service-channel-0.2.186`; node-agent + remains `0.2.185` because no node runtime binary change was required. + Backend lease selection now keeps `selected_entry_node_id` aligned with the + selected primary route when the healthy route starts at another authorized + entry node. Route replacement scope also understands entry-pool metadata + keys (`entry_pool_id`, `service_entry_pool_id`, `fabric_entry_pool_id`) in + addition to exit-pool/resource keys, and route decision reports count + entry-pool replacement decisions. Unit coverage: + `TestIssueFabricServiceChannelLeaseSelectsHealthyAlternateEntryFromPool` and + `TestGetNodeSyntheticMeshConfigReplacesFencedServiceChannelRouteAcrossEntryPool`. + Live script + `scripts/fabric/c18z11-live-service-channel-entry-pool-smoke.ps1` creates + primary entry route `test-1 -> test-2` and alternate entry route + `test-3 -> test-2`, verifies the initial lease uses `test-1`, sends 144 + packets, injects service-channel feedback fencing the primary entry route, + verifies a refreshed lease selects `test-3`, then sends 288 more packets + through the alternate entry to the same exit. Result: + `artifacts/c18z11-live-service-channel-entry-pool-smoke-result.json` run + `c18z11-20260507-235341`: exit queue `432`, backend fallback `0`, route + failure deltas `0/0`, flow drop deltas `0/0`, and temporary route intents + expired. This is a lease refresh/reconnect contract for entry replacement; + preserving a broken client-to-entry socket across an entry node outage is not + expected. +- C18Z12 service-channel route quality scoring is implemented and + backend-deployed as `rap-backend:fabric-service-channel-0.2.187`; node-agent + remains `0.2.185`. Backend now uses service-neutral runtime quality feedback + from `fabric_service_channel_runtime_report.ingress.flow_scheduler` when + scoring lease routes: `last_send_duration_ms` adds deterministic latency + boosts/penalties, and recent failures/stalls apply bounded penalties. This is + protocol-agnostic and applies to the shared fabric channel, not HTTP/RDP/DNS + special cases. Unit coverage: + `TestIssueFabricServiceChannelLeasePrefersFastHealthyRouteFeedback`. Live + script `scripts/fabric/c18z12-service-channel-route-quality-smoke.ps1` + creates a high-priority slow relay route `test-1 -> test-3 -> test-2` and a + lower-priority fast direct route `test-1 -> test-2`; the initial lease + selects the slow route by policy priority, then quality telemetry reports + fast route `8ms` and slow route `900ms`, and the refreshed lease selects the + fast route with score reason `service_channel_quality_latency_le_10ms`. + Result: `artifacts/c18z12-service-channel-route-quality-smoke-result.json` + run `c18z12-20260508-000209`; all checks passed and temporary route intents + expired. +- C18Z13 live service-channel route quality self-learning is implemented, + released as node-agent `0.2.188`, published to docker-test downloads, + registered in the stable update channel, and deployed to docker-test + containers `test-1/2/3`. Runtime fix: positive sub-millisecond + service-channel send durations are rounded to `1ms`, preventing fast local + routes from looking like "no quality sample". Unit coverage: + `TestFabricFlowSchedulerRoundsSubMillisecondSendDuration`. Live script + `scripts/fabric/c18z13-live-service-channel-route-quality-smoke.ps1` proves + the self-learning path without heartbeat injection: initial lease picks a + higher-priority relay route, real service-channel traffic sends 24 batches / + 192 packets over the fast direct route, backend persists healthy route + feedback from the node-agent heartbeat (`last_send_duration_ms=1`, + `score_adjustment=90`), and a refreshed lease prefers that fast route over a + newly introduced higher-priority relay candidate. Result: + `artifacts/c18z13-live-service-channel-route-quality-smoke-result.json` run + `c18z13-20260508-001610`; backend fallback `0`, flow drops `0`, temporary + route intents expired. Published release id: + `64effc62-18b6-4eeb-a1c9-f5fb8e251491`. +- C18Z14 active-session route-quality preference is implemented. Backend + `rap-backend:fabric-service-channel-0.2.190` and node-agent `0.2.189` are + deployed to docker-test `test-1/2/3`; node-agent `0.2.189` is published to + docker-test downloads and registered in the stable update channel as release + `9bda9bac-71f3-4e8f-ae70-2abccb1cb866`. Backend now decays older healthy + service-channel feedback before lease scoring so stale success loses weight + before expiry. Node-agent consumes healthy route-quality observations from + signed synthetic config and can override sticky per-flow/config-order route + choice when a learned route is significantly better. Unit coverage: + `TestFabricClientPacketIngressQualityPreferenceOverridesStickyRoute` and + `TestIssueFabricServiceChannelLeaseDecaysOlderHealthyRouteFeedback`. Live + script + `scripts/fabric/c18z14-live-service-channel-active-quality-shift-smoke.ps1` + keeps one signed WebSocket open while route policy changes: it starts on a + higher-priority relay route, expires that route, sends real traffic through + the fast direct route to teach feedback, introduces a new higher-priority + relay candidate, and verifies the same active session stays on the learned + fast route. Result: + `artifacts/c18z14-live-service-channel-active-quality-shift-smoke-result.json` + run `c18z14-20260508-071644`; 60 batches / 480 packets delivered, backend + fallback `0`, flow drops `0`, temporary route intents expired. +- C18Z15 effective route-quality score telemetry is implemented. Backend + `rap-backend:fabric-service-channel-0.2.191` is deployed on docker-test, and + node-agent `0.2.190` is built, published to docker-test downloads, registered + in the stable update channel, and deployed to `test-1/2/3`. Published release + id: `2e4cd0c8-2480-4637-b845-6dcb115dbebd`. Backend feedback reports now + include decayed `effective_score_adjustment` alongside raw + `score_adjustment`; node-agent consumes the effective score for active + route-quality preference and exposes sorted `route_quality_preferences` in + runtime telemetry with raw/effective score and decay reasons. Unit coverage: + `TestFabricClientPacketIngressQualityPreferenceUsesEffectiveScore` and + `TestServiceChannelRouteFeedbackReportIncludesEffectiveDecayedScore`. Live + script + `scripts/fabric/c18z15-live-service-channel-effective-quality-smoke.ps1` + verifies route-quality preference telemetry, effective score visibility, and + decayed effective score visibility after the active-session quality-shift + scenario. Result: + `artifacts/c18z15-live-service-channel-effective-quality-smoke-result.json` + run `c18z14-20260508-073538`; 60 batches / 480 packets delivered, backend + fallback `0`, flow drops `0`, temporary route intents expired. +- C18Z16 per-channel route-quality fairness telemetry is implemented. Node-agent + `0.2.191` is built, published to docker-test downloads, registered in the + stable update channel, and deployed to `test-1/2/3`; backend remains + `rap-backend:fabric-service-channel-0.2.191`. Published release id: + `f072759c-5c3b-4ba0-936a-f59b6d3d7632`. Flow-scheduler channel stats now + expose the applied `quality_preference_route_id`, effective/raw preference + score, and preference reasons, so operators can see which logical channels + actually used learned route quality. Unit coverage: + `TestFabricClientPacketIngressQualityPreferencePreservesMultiChannelFairness`. + Live script + `scripts/fabric/c18z16-live-service-channel-quality-fairness-smoke.ps1` + validates multi-channel quality-preference fairness after the active-session + route-quality shift. Result: + `artifacts/c18z16-live-service-channel-quality-fairness-smoke-result.json` + run `c18z14-20260508-074943`; 60 batches / 480 packets delivered, 32 served + logical channels, 32 channels with quality preference applied, backend + fallback `0`, flow drops `0`, temporary route intents expired. +- C18Z17 stale route-quality marker cleanup is implemented. Node-agent + `0.2.192` is built, published to docker-test downloads, registered in the + stable update channel, and deployed to `test-1/2/3`; backend remains + `rap-backend:fabric-service-channel-0.2.191`. Published release id: + `846881bd-e7e0-4212-b8c9-4a6012c6eff7`. Flow-scheduler channel stats now + clear quality preference markers when the preference is no longer in the + effective preference set or when the route manager withdraws that route. Unit + coverage: + `TestFabricClientPacketIngressClearsStaleQualityPreferenceMarkers` and + `TestFabricClientPacketIngressClearsWithdrawnQualityPreferenceMarkers`. + Live script + `scripts/fabric/c18z17-live-service-channel-quality-cleanup-smoke.ps1` + verifies cleanup after the active-session quality/fairness scenario. Result: + `artifacts/c18z17-live-service-channel-quality-cleanup-smoke-result.json` + run `c18z14-20260508-075750`; 60 batches / 480 packets delivered, active + quality markers `32`, stale quality markers `0`, visible preferences `3`, + backend fallback `0`, flow drops `0`, temporary route intents expired. +- C18Z18 service-session-scoped flow scheduler memory is implemented. + Node-agent `0.2.193` is built, published to docker-test downloads, + registered in the stable update channel, and deployed to `test-1/2/3`; + backend remains `rap-backend:fabric-service-channel-0.2.191`. Published + release id: `05a3d29e-8a62-4bc8-84a3-1d00b794b9c9`. Runtime-sent flow + scheduler channel keys now include the VPN/service session: + `vpn:{vpnConnectionID}:flow-NN`. This keeps route memory, failed-route + avoidance, served/drop counters, and route-quality markers isolated when + several service-channel sessions share one entry/exit and hash to the same + logical flow shard. Unit coverage: + `TestFabricClientPacketIngressIsolatesRouteMemoryPerVPNConnection` and + `TestFabricClientPacketIngressQualityPreferencePreservesMultiChannelFairness`. + Live script + `scripts/fabric/c18z18-service-channel-session-scoped-fairness-smoke.ps1` + wraps the live C18Z17 quality path and verifies served live channels are + session-scoped, unscoped served `flow-NN` channels are absent, quality + markers are session-scoped, backend fallback is `0`, and flow drops are `0`. + Result: + `artifacts/c18z18-service-channel-session-scoped-fairness-smoke-result.json` + run `c18z14-20260508-082520`; 60 batches / 480 packets delivered, served + channels `32`, session-scoped served channels `32`, session-scoped quality + channels `32`, unscoped served channels `0`, backend fallback `0`, flow drops + `0`, temporary route intents expired. +- C18Z19 bounded parallel logical-flow send window is implemented. Node-agent + `0.2.194` is built, published to docker-test downloads, registered in the + stable update channel, and deployed to `test-1/2/3`; backend remains + `rap-backend:fabric-service-channel-0.2.191`. Published release id: + `926e5b84-4b0b-4f47-b1fe-798d8105679f`. The live node-agent runtime enables + `MaxParallelFlowSends=4`, so independent scheduled logical channels can send + concurrently instead of one slow channel blocking all following channels. + This remains service-neutral and does not inspect HTTP/RDP/DNS/application + traffic. Telemetry now exposes `max_parallel_flow_sends` and + `send_flow_parallel_batches`. Unit coverage: + `TestFabricClientPacketIngressParallelFlowWindowDoesNotBlockIndependentChannel`. + Live script + `scripts/fabric/c18z19-service-channel-parallel-flow-window-smoke.ps1` wraps + the C18Z18 live route-quality/session-scoped path and verifies the parallel + window is enabled and observed while backend fallback and flow drops stay at + zero. Result: + `artifacts/c18z19-service-channel-parallel-flow-window-smoke-result.json` + run `c18z14-20260508-084133`; 60 batches / 480 packets delivered, + `max_parallel_flow_sends=4`, `send_flow_parallel_batches=60`, served + channels `32`, session-scoped quality channels `32`, backend fallback `0`, + flow drops `0`, temporary route intents expired. +- C18Z20 per-channel latency/retry/in-flight telemetry and adaptive recommended + send-window telemetry are implemented. Node-agent `0.2.195` is built, + published to docker-test downloads, registered in the stable update channel, + and deployed to `test-1/2/3`; backend remains + `rap-backend:fabric-service-channel-0.2.191`. Published release id: + `b9e198e0-e012-4600-ad14-856820aff41c`. Scheduler telemetry now includes + global `in_flight`, `max_in_flight`, slow/failing channel counts, and + per-channel `send_attempts`, `send_successes`, `send_failures`, + `in_flight`, `max_in_flight`, and latency buckets. Ingress telemetry now + includes `recommended_parallel_flow_sends`; the recommendation shrinks under + bounded drops, degraded fallback recommendations, repeated failures, or + slow/stalled channels. Unit coverage: + `TestFabricFlowSchedulerRecommendsSmallerWindowUnderPressure` and + `TestFabricClientPacketIngressParallelFlowWindowDoesNotBlockIndependentChannel`. + Live script + `scripts/fabric/c18z20-service-channel-adaptive-window-telemetry-smoke.ps1` + wraps the C18Z19 live path and verifies the new telemetry on real docker-test + nodes. Result: + `artifacts/c18z20-service-channel-adaptive-window-telemetry-smoke-result.json` + run `c18z14-20260508-085635`; 60 batches / 480 packets delivered, + `max_parallel_flow_sends=4`, `recommended_parallel_flow_sends=4`, + `scheduler_max_in_flight=4`, attempts/success/latency visible on 32 channels, + backend fallback `0`, flow drops `0`, temporary route intents expired. +- C18Z21 rolling per-channel/session quality windows are implemented. + Node-agent `0.2.196` is built, published to docker-test downloads, + registered in the stable update channel, and deployed to `test-1/2/3`; + backend remains `rap-backend:fabric-service-channel-0.2.191`. Published + release id: `813b2050-4d4e-444c-9bde-72b1d1f7dd35`. Scheduler decisions now + use a bounded fresh quality window instead of lifetime-only drop/failure + counters, so old pressure rolls out after newer successful samples. Telemetry + now exposes scheduler-level `quality_window_sample_count`, + `quality_window_failure_count`, `quality_window_slow_count`, + `quality_window_drop_count`, and per-channel success/failure/slow/drop sample + counts, average latency, and last update time. Unit coverage: + `TestFabricFlowSchedulerRollingQualityWindowForgetsOldPressure`, + `TestFabricFlowSchedulerRecommendsSmallerWindowUnderPressure`, and + `TestFabricClientPacketIngressParallelFlowWindowDoesNotBlockIndependentChannel`. + Live script + `scripts/fabric/c18z21-service-channel-rolling-quality-window-smoke.ps1` + wraps the C18Z20 live path and verifies the rolling-window telemetry on real + docker-test nodes. Result: + `artifacts/c18z21-service-channel-rolling-quality-window-smoke-result.json` + run `c18z14-20260508-091952`; 60 batches / 480 packets delivered, + scheduler quality-window samples `480`, failures `0`, drops `0`, window + samples/success/latency visible on 32 channels, `recommended_parallel_flow_sends=4`, + backend fallback `0`, flow drops `0`, temporary route intents expired. +- C18Z22 backend durable route feedback now consumes the rolling quality + window from node-agent heartbeat metadata. Backend + `rap-backend:fabric-service-channel-0.2.197` is built and deployed on + docker-test; node-agent remains `0.2.196` on `test-1/2/3`. For agents that + expose `quality_window_*`, backend uses fresh rolling failure/drop/slow + counts and rolling average latency when creating `fabric_service_channel` + route feedback; old `last_failed_route_id`, `consecutive_failures`, and + `stall_count` remain fallback inputs for older agents only. This prevents old + route failures from dominating durable scoring after the channel has recovered + with a clean rolling window. Unit coverage: + `TestRecordHeartbeatUsesRollingQualityWindowForRouteFeedback` and + `TestRecordHeartbeatPersistsServiceChannelRouteFeedbackForLaterLease`. + Live script + `scripts/fabric/c18z22-service-channel-rolling-feedback-smoke.ps1` wraps the + C18Z21 live path and verifies persisted route feedback contains + `service_channel_rolling_quality_window` plus payload `quality_window_*` + fields. Result: + `artifacts/c18z22-service-channel-rolling-feedback-smoke-result.json` run + `c18z14-20260508-093100`; 60 batches / 480 packets delivered, route feedback + count `1`, rolling feedback count `1`, healthy rolling feedback count `1`, + rolling payload count `1`, backend fallback `0`, flow drops `0`. +- C18Z23 recovery hysteresis is implemented for recovered service-channel + routes. Backend `rap-backend:fabric-service-channel-0.2.198` is built and + deployed on docker-test; node-agent remains `0.2.196` on `test-1/2/3`. + When a route has an operator-expire/manual retry cooldown from prior fenced + feedback but now also has healthy rolling-window feedback, backend re-admits + the route as `authorized` while applying a bounded recovery hysteresis score + penalty (`150`) and `service_channel_recovery_hysteresis` reason. This keeps + recovered routes available as alternates without immediately displacing a + steady route and reducing route-selection flapping. Unit coverage: + `TestIssueFabricServiceChannelLeaseDampensRecoveredRouteDuringRetryCooldown` + and `TestRecordHeartbeatUsesRollingQualityWindowForRouteFeedback`. Live + script + `scripts/fabric/c18z23-service-channel-recovery-hysteresis-smoke.ps1` wraps + the C18Z22 live path and verifies backend `0.2.198`, rolling feedback, and + clean live forwarding. Result: + `artifacts/c18z23-service-channel-recovery-hysteresis-smoke-result.json` run + `c18z14-20260508-094111`; 60 batches / 480 packets delivered, backend + fallback `0`, flow drops `0`, recovery hysteresis penalty `150`. +- C18Z24 recovery visibility is implemented for service-channel route + diagnostics. Backend `rap-backend:fabric-service-channel-0.2.199` is built + and deployed on docker-test; node-agent remains `0.2.196` on `test-1/2/3`. + Route feedback API responses and node-scoped service-channel feedback reports + now expose `recovery_state`, `recovery_hysteresis_active`, and + `recovery_hysteresis_penalty`, while route path decision reports count + `recovery_hysteresis_count`. Admin diagnostics now show recovered/hysteresis + chips and a recovery column beside route feedback status. Unit coverage: + `TestIssueFabricServiceChannelLeaseDampensRecoveredRouteDuringRetryCooldown`, + `TestServiceChannelRouteFeedbackReportExposesRecoveryState`, and + `TestRoutePathDecisionReportCountsRecoveryHysteresis`. Smoke result: + `artifacts/c18z24-service-channel-recovery-visibility-smoke-result.json`; + route feedback API exposed recovery shape for 109 observations, backend + image `0.2.199` was live, and the web-admin build was published to + `rap_web_admin`. +- C18Z25 recovery promotion policy is implemented. Backend + `rap-backend:fabric-service-channel-0.2.200` is built and deployed on + docker-test; node-agent remains `0.2.196`. A route under manual retry + cooldown remains `recovered` with hysteresis penalty until it reports at + least 64 clean rolling-window samples (`success >= 64`, failures/slow/drops + zero). After that it is promoted back to steady `healthy`, gets + `recovery_promoted=true`, `service_channel_recovery_promoted`, and no + hysteresis penalty. Admin/API now expose promoted counts/flags alongside + recovered/hysteresis state. Smoke result: + `artifacts/c18z25-service-channel-recovery-promotion-smoke-result.json`; + backend image `0.2.200` was live and route-feedback API exposed recovery + state for 109 observations. +- C18Z26 recovery demotion policy is implemented. Backend + `rap-backend:fabric-service-channel-0.2.201` is built and deployed on + docker-test; node-agent remains `0.2.196`. If a previously recovered or + promoted route under retry cooldown reports fresh rolling failures, drops, + slow samples, degraded fallback, rebuild recommendation, or fenced feedback, + backend now exposes `recovery_demoted=true` with a concrete + `recovery_reason` such as `service_channel_recovery_demoted_failure`, + `..._slow`, `..._rebuild`, or `..._fenced`. Route score reasons include + `service_channel_recovery_demoted` and the specific demotion reason, and + route path decision reports count `recovery_demoted_count`. Admin diagnostics + now show demoted feedback/path chips and the demotion reason. Smoke result: + `artifacts/c18z26-service-channel-recovery-demotion-smoke-result.json`; + backend image `0.2.201` was live and route-feedback API exposed recovery + state for 109 observations. +- C18Z27 recovery policy tuning is implemented. Backend + `rap-backend:fabric-service-channel-0.2.202` is built and deployed on + docker-test; node-agent remains `0.2.196`. Effective service-channel + recovery policy now has a strict default contract and optional cluster + metadata override at `fabric_service_channel_recovery_policy`. API endpoints + `GET/PUT /clusters/{clusterID}/fabric/service-channels/recovery-policy` + expose and update hysteresis penalty, promotion minimum samples, demotion + thresholds for failures/drops/slow samples, and rebuild/fenced demotion + toggles. Lease route selection, route feedback reports, and node-scoped + synthetic config feedback consume the effective policy. Web-admin shows and + edits the policy in the service-channel route feedback card. Smoke result: + `artifacts/c18z27-service-channel-recovery-policy-smoke-result.json`; live + API updated policy values, then restored strict defaults + (`penalty=150`, `promotion_min_samples=64`, demotion thresholds `1`). +- C18Z28 recovery policy provenance is implemented. Backend + `rap-backend:fabric-service-channel-0.2.203` is built and deployed on + docker-test; node-agent remains `0.2.196`. `FabricServiceChannelRoute`, + `FabricServiceChannelLease`, signed lease authority payloads, + service-channel route feedback reports, and route path decision reports now + carry the effective recovery policy used for scoring and recovery decisions. + This makes every primary/alternate/fallback choice auditable against the + policy source and thresholds that produced it. Web-admin node diagnostics + show the service-channel feedback policy and route decision policy source. + Smoke result: + `artifacts/c18z28-service-channel-recovery-policy-provenance-smoke-result.json`; + live synthetic config and live lease issuance both exposed recovery policy + provenance on docker-test. +- C18Z29 feedback provenance guardrails are implemented. Backend + `rap-backend:fabric-service-channel-0.2.204` is built and deployed on + docker-test; node-agent remains `0.2.196`. Recovery policy now has a stable + fingerprint. Backend recognizes optional runtime feedback provenance fields + (`recovery_policy_fingerprint`, `route_generation`, `route_policy_version`, + `policy_version`), exposes observed/effective fingerprints/generations on + route feedback observations, and reports missing/stale counters. Explicit + stale policy/generation feedback is scored conservatively, cannot fence a + current route, and cannot request rebuild/demotion; missing provenance stays + compatible for current old agents but is visible in diagnostics. Web-admin + shows provenance warnings in service-channel feedback. Smoke result: + `artifacts/c18z29-service-channel-feedback-provenance-guard-smoke-result.json`. +- C18Z30 node-agent feedback provenance is implemented. Backend + `rap-backend:fabric-service-channel-0.2.209` and node-agent `0.2.208` are + built and deployed on docker-test (`test-1/2/3`). Node-agent now preserves the + signed synthetic config contract for recovery feedback/route decision fields + and records per-flow `recovery_policy_fingerprint`, `route_policy_version`, + and `route_generation` at send time, so feedback remains auditable even after + route churn/expiry. Backend heartbeat parsing now preserves those fields into + durable service-channel feedback payloads. Live smoke passed with 28/28 + runtime channel stats carrying provenance, 3/3 feedback observations carrying + provenance, and no missing/stale provenance counters. Artifacts: + `artifacts/c18z30-node-telemetry-provenance-live-smoke-base-result.json` and + `artifacts/c18z30-node-agent-feedback-provenance-smoke-result.json`. +- C18Z31 service-channel rebuild ledger is implemented. Backend + `rap-backend:fabric-service-channel-0.2.211` is built and deployed on + docker-test; node-agent remains `0.2.208` on `test-1/2/3`. Backend now keeps + durable route rebuild attempt history in + `fabric_service_channel_route_rebuild_attempts`, upserted from synthetic + config route decisions when service-channel feedback requests rebuild. The + ledger stores trigger/rebuild status, old route, selected replacement, + policy fingerprint, generation, feedback status/reasons, latency/failure + counters, outcome, and compact decision payload. API endpoint + `GET /clusters/{clusterID}/fabric/service-channels/rebuild-attempts` exposes + the history; web-admin loads it into Service-channel route feedback + diagnostics as a rebuild ledger table. Migration `000026` is applied on + docker-test. Live smoke passed: + `artifacts/c18z31-base-active-rebuild-smoke-result.json` and + `artifacts/c18z31-service-channel-rebuild-ledger-smoke-result.json`. +- C18Z32 service-channel rebuild timeline is implemented. Backend + `rap-backend:fabric-service-channel-0.2.213` is built and deployed on + docker-test; node-agent remains `0.2.208` on `test-1/2/3`. The rebuild + attempts API now enriches durable ledger rows with node-agent heartbeat + correlation: matching `route_manager_transition`, route-generation apply or + withdrawn decision, post-rebuild selected route, flow packet/drop/failure + counters, and a compact chronological `timeline` with + `backend_decision`, `node_route_generation_apply`, + `node_route_manager_transition`, and `post_rebuild_traffic` stages. Matching + is generation-strict when the backend attempt has a generation, preventing + stale transition/status matches. Web-admin rebuild ledger shows backend, + agent, route-generation, and traffic columns. Live smoke passed: + `artifacts/c18z32-base-rebuild-ledger-smoke-result.json` and + `artifacts/c18z32-service-channel-rebuild-timeline-smoke-result.json`. +- C18Z33 service-channel rebuild guardrails are implemented. Backend + `rap-backend:fabric-service-channel-0.2.214` is built and deployed on + docker-test; node-agent remains `0.2.208`. Rebuild attempts API now adds + computed guard fields: `guard_status`, `guard_severity`, `guard_reason`, + age, and transition/traffic deadlines. Successful correlated rebuilds report + `guard_status=ok`, `guard_severity=good`; missing node transition, + route-generation correlation, post-rebuild traffic, unexpected selected + route, or post-rebuild drops/failures surface as warn/bad states. Web-admin + shows guard chips and counts in the service-channel rebuild ledger. Live + smoke passed: `artifacts/c18z33-base-rebuild-ledger-smoke-result.json` and + `artifacts/c18z33-service-channel-rebuild-guard-smoke-result.json`. +- C18Z34 service-channel rebuild health summary is implemented. Backend + `rap-backend:fabric-service-channel-0.2.215` is built and deployed on + docker-test; node-agent remains `0.2.208`. New endpoint + `GET /clusters/{clusterID}/fabric/service-channels/rebuild-health` returns a + cluster-level operational summary over the durable rebuild ledger/timeline: + counts by guard status/severity, applied/pending counts, affected reporter + nodes/routes, most recent bad attempts, and recommended operator action. + Web-admin shows the summary as a Rebuild health subpanel above the rebuild + ledger. Live smoke passed: + `artifacts/c18z34-base-rebuild-guard-smoke-result.json` and + `artifacts/c18z34-service-channel-rebuild-health-smoke-result.json`. +- C18Z35 service-channel rebuild alert silence lifecycle is implemented. + Backend `rap-backend:fabric-service-channel-0.2.216` is built and deployed on + docker-test; node-agent remains `0.2.208`. Migration `000027` creates + `fabric_service_channel_rebuild_alert_silences`, applied on docker-test. New + API `POST /clusters/{clusterID}/fabric/service-channels/rebuild-health/silences` + records bounded operator silence for an exact alert fingerprint: + reporter node, route, guard status, and generation. Rebuild health now + separates total bad/warn from active bad/warn and silenced counts; silenced + alerts are omitted from affected nodes/routes and active bad attempt lists. + A new generation, route, or reporter remains active by design. Web-admin + exposes `silence 6h` on active bad rebuild-health rows. Live smoke passed: + `artifacts/c18z35-base-rebuild-health-smoke-result.json` and + `artifacts/c18z35-service-channel-rebuild-alert-silence-smoke-result.json`. +- C18Z36 service-channel rebuild alert resurfacing is implemented. Backend + `rap-backend:fabric-service-channel-0.2.217` is built and deployed on + docker-test; node-agent remains `0.2.208`. Rebuild health marks active + bad/warn attempts as `alert_resurfaced` when an active silence exists for the + same reporter node, route, and guard status but a different generation. The + summary exposes `resurfaced_count` and `resurfaced_attempts`, including the + previous silenced generation and silence expiry. Web-admin shows a resurfaced + chip/table and allows silencing the new generation separately. Live smoke + passed: `artifacts/c18z36-base-rebuild-health-smoke-result.json` and + `artifacts/c18z36-service-channel-rebuild-alert-resurface-smoke-result.json`. +- C18Z37 service-channel readiness gate is implemented. Backend + `rap-backend:fabric-service-channel-0.2.218` is built and deployed on + docker-test; node-agent remains `0.2.208`. New endpoint + `GET /clusters/{clusterID}/fabric/service-channels/readiness` returns a fast + recent-window verdict: `clean`, `degraded`, or `blocked`, with active + bad/warn counts, resurfaced/silenced counts, missing transition, + route-generation, post-rebuild traffic, unexpected-route, and post-rebuild + degraded counters plus blocking/degraded reasons and recommended operator + action. Web-admin shows this as a top-level readiness panel in + Service-channel route feedback. Readiness and default admin health queries + are intentionally capped to a small recent window so the operator view stays + responsive after many rebuild attempts; deep ledger diagnostics remain a + separate next layer. Live smoke passed: + `artifacts/c18z37-base-rebuild-health-smoke-result.json` and + `artifacts/c18z37-service-channel-readiness-smoke-result.json`. +- C18Z38 service-channel rebuild ledger enrichment split is implemented. + Backend `rap-backend:fabric-service-channel-0.2.219` is built and deployed + on docker-test; node-agent remains `0.2.208`. The rebuild attempts API now + defaults to `enrichment=summary`, returning durable ledger rows without the + expensive heartbeat/timeline guard correlation. Operators can request + `enrichment=deep` explicitly for per-route investigation. Web-admin defaults + to the fast ledger, shows timeline/guard fields as deep-only in summary mode, + and provides a manual deep ledger toggle. C18Z32/C18Z33 smokes now request + deep enrichment. Live smoke passed: + `artifacts/c18z38-service-channel-rebuild-ledger-enrichment-smoke-result.json`. +- C18Z39 service-channel rebuild ledger drilldown is implemented. Backend + `rap-backend:fabric-service-channel-0.2.220` is built and deployed on + docker-test; node-agent remains `0.2.208`. The rebuild attempts API now + accepts `generation` and `offset`, allowing narrow deep investigations by + reporter node, route, service class, and route generation with bounded + pagination. Web-admin adds rebuild ledger filters for reporter/route/ + generation/service plus prev/next paging in deep mode. Live smoke passed: + `artifacts/c18z39-service-channel-rebuild-ledger-drilldown-smoke-result.json`. +- C18Z40 service-channel rebuild incident grouping is implemented. Backend + `rap-backend:fabric-service-channel-0.2.222` is built and deployed on + docker-test; node-agent remains `0.2.208`. New endpoint + `GET /clusters/{clusterID}/fabric/service-channels/rebuild-incidents` + groups the bounded recent rebuild window by reporter node, route, service + class, generation, and guard status, exposing first/last seen, attempt count, + latest guard/replacement/outcome, silence/resurface flags, and recommended + action. The incident window is capped to 5 to keep default admin refresh + bounded; broader investigation still uses filtered deep ledger. Web-admin + shows a Rebuild incidents list and `open deep` loads the exact filtered deep + ledger slice for that incident. Live smoke passed: + `artifacts/c18z40-service-channel-rebuild-incidents-smoke-result.json`. +- C18Z41 service-channel rebuild incident actions are implemented. Backend + `rap-backend:fabric-service-channel-0.2.223` is built and deployed on + docker-test; node-agent remains `0.2.208`. New API + `POST /clusters/{clusterID}/fabric/service-channels/rebuild-incidents/investigations` + records an audit event when an operator opens a deep rebuild investigation. + Web-admin incident rows now expose `open deep` with audit and `silence 6h` + using the incident fingerprint fields; after silence the panel refreshes only + rebuild health/readiness/incidents instead of the whole cluster scope. Live + smoke passed: + `artifacts/c18z41-service-channel-rebuild-incident-actions-smoke-result.json`. +- C18Z42 service-channel rebuild correlation snapshots are implemented. + Backend `rap-backend:fabric-service-channel-0.2.224` is built and deployed + on docker-test; node-agent remains `0.2.208`. Migration `000028` adds + durable correlation/guard snapshot columns to + `fabric_service_channel_route_rebuild_attempts`, including node transition, + route-generation, post-rebuild traffic, guard status/severity/reason, + compact timeline, and `correlation_snapshot_at`. Deep enrichment now writes + the snapshot once; later deep/readiness/health/incidents reuse it and only + recompute age-sensitive guard state without scanning heartbeat history. + External summary ledger still strips guard/timeline fields to preserve the + fast C18Z38 contract. On docker-test, applying `000028` manually was required + before smoke because this manual backend redeploy path does not auto-apply + migrations. Live smoke passed twice; after warm snapshot timings were roughly + summary 92 ms, deep 2 ms, incidents 2 ms: + `artifacts/c18z42-service-channel-rebuild-correlation-snapshot-smoke-result.json`. +- C18Z43 service-channel schema preflight is implemented. Backend + `rap-backend:fabric-service-channel-0.2.225` is built and deployed on + docker-test; web-admin is redeployed. New endpoint + `GET /clusters/{clusterID}/fabric/service-channels/schema-status` checks the + DB relation/columns required by migration `000028` before operators rely on + rebuild health/readiness/incidents. Web-admin shows a Fabric schema preflight + panel beside service-channel readiness, with required/missing check counts and + operator action. Live smoke passed: + `artifacts/c18z43-service-channel-schema-preflight-smoke-result.json`. +- C18Z44 service-channel rebuild snapshot warmup is implemented. Backend + `rap-backend:fabric-service-channel-0.2.226` is built and deployed on + docker-test; web-admin is redeployed. New endpoint + `POST /clusters/{clusterID}/fabric/service-channels/rebuild-snapshots/warmup` + performs a bounded proactive pass over recent rebuild attempts. It fills + missing correlation snapshots, counts stale snapshots, and defers heavy stale + rescans because age-sensitive guard state is already recomputed from cached + snapshots on read. Web-admin adds a `warm snapshots` action and displays + warmed/fresh/missing/stale/deferred/error counts. Live smoke passed: + `artifacts/c18z44-service-channel-rebuild-snapshot-warmup-smoke-result.json`. +- C18Z45 service-channel rebuild snapshot auto-warmup is implemented. Backend + `rap-backend:fabric-service-channel-0.2.227` is built and deployed on + docker-test; node-agent remains `0.2.208`. Heartbeat processing now performs a + bounded missing-snapshot maintenance pass for the reporting node's recent + rebuild attempts. It only persists a snapshot when the heartbeat contains + runtime evidence such as post-rebuild traffic or matched route-manager/ + route-generation state, preventing backend-only timelines from becoming stale + cache entries. Auto-warmup writes an audit event + `fabric.service_channel_rebuild_snapshot.auto_warmup` with trigger, heartbeat, + warmed route IDs, generations, rebuild IDs, counts, and errors. Live smoke + passed: + `artifacts/c18z45-service-channel-rebuild-snapshot-auto-warmup-smoke-result.json`. +- C18Z46 service-channel rebuild snapshot maintenance health is implemented. + Backend `rap-backend:fabric-service-channel-0.2.228` is built and deployed + on docker-test; web-admin is redeployed. New endpoint + `GET /clusters/{clusterID}/fabric/service-channels/rebuild-snapshots/health` + exposes bounded snapshot-cache maintenance status: recent attempt count, + valid/missing/overdue runtime-evidence snapshots, heartbeat threshold, latest + auto-warmup audit summary, and per-node warmed/error/missing counts. Web-admin + adds a `Snapshot maintenance` panel beside schema/readiness. Live smoke + passed: + `artifacts/c18z46-service-channel-rebuild-snapshot-health-smoke-result.json`. +- C18Z47 service-channel signed lease enforcement is implemented. Node-agent + release `0.2.230` is built, published under `/downloads`, registered as the + active `rap-node-agent` dev release, and deployed on docker-test + `test-1/2/3`; all three report `0.2.230`, healthy, and current after policy + update. When a cluster authority public key is pinned, the node-agent now + rejects unsigned `rap_fsc_*` service-channel requests and requires the + signed `rap.fabric_service_channel_lease_authority.v1` payload/signature + headers. Legacy unsigned tokens remain accepted only in unpinned test mode. + Live smoke proved unsigned POST is rejected with 403 while signed lease POST + is accepted with 202: + `artifacts/c18z47-service-channel-signed-lease-enforcement-smoke-result.json`. +- C18Z48 service-channel backend introspection compatibility is implemented. + Backend `rap-backend:fabric-service-channel-0.2.231` is built/deployed on + docker-test. Node-agent/host-agent artifacts `0.2.232` are published under + `/downloads`; `rap-node-agent` release `0.2.232` is registered and deployed + on `test-1/2/3`, and all three report healthy/current. When signed + service-channel authority headers are absent but cluster authority is pinned, + node-agent now calls backend lease introspection before accepting an unsigned + token. Bad tokens are still rejected. Live smoke passed: + `artifacts/c18z48-service-channel-introspection-smoke-result.json`. +- C18Z49 service-channel acceptance telemetry is implemented in node-agent + `0.2.232`. Each accepted Fabric Service Channel ingress records + `accepted_by=signed|introspection|legacy_unsigned`, route preference, and + backend-fallback state in structured node logs. HTTP packet ingress also + returns `X-RAP-Service-Channel-Accepted-By` for smoke/diagnostics. +- C18Z50 durable service-channel lease introspection is implemented. Migration + `000029_fabric_service_channel_leases` adds a durable lease table keyed by + cluster/channel and stores only `token_hash` plus a scrubbed lease payload + with the raw bearer token removed. Backend + `rap-backend:fabric-service-channel-0.2.233` is built/deployed on + docker-test after applying the migration. Introspection now reads memory + first, then durable storage, so compatibility clients survive backend + restart. Live smoke restarted `rap_test_backend`, accepted the unsigned token + through introspection, rejected a bad token, and verified the durable lease + omits the raw token: + `artifacts/c18z50-service-channel-durable-introspection-smoke-result.json`. +- C18Z51 service-channel lease maintenance is implemented. Backend + `rap-backend:fabric-service-channel-0.2.234` is built/deployed on + docker-test. New endpoints list durable service-channel lease maintenance + state and run bounded expired-lease cleanup: + `GET /clusters/{clusterID}/fabric/service-channels/leases` and + `POST /clusters/{clusterID}/fabric/service-channels/leases/cleanup`. + Web-admin adds a `Service-channel leases` panel with active/expired counts, + recent lease rows, and cleanup action. Live smoke issued a 1-second lease, + observed it as expired, cleaned it up, and verified it disappeared: + `artifacts/c18z51-service-channel-lease-maintenance-smoke-result.json`. +- C18Z52 service-channel access telemetry visibility is implemented. Backend + `rap-backend:fabric-service-channel-0.2.235` is built/deployed on + docker-test; node-agent/host-agent `0.2.235` artifacts are published under + `/downloads`, registered as active dev releases, and deployed on + `test-1/2/3`. Node-agent now reports accepted service-channel ingress + counters by `signed`, `introspection`, and `legacy_unsigned`, including + backend-fallback count and last accepted timestamp. Backend exposes + `GET /clusters/{clusterID}/fabric/service-channels/access-telemetry`, + reading telemetry observations with heartbeat metadata fallback. Web-admin + adds a `Service-channel access` panel with cluster totals and per-node rows. + Live smoke sent packets through test-1, observed + `X-RAP-Service-Channel-Accepted-By: introspection`, and verified backend + aggregate visibility: + `artifacts/c18z52-service-channel-access-telemetry-smoke-result.json`. +- C18Z53 service-channel access/session correlation is implemented. Backend + `rap-backend:fabric-service-channel-0.2.236` is built/deployed on + docker-test; node-agent remains `0.2.235`. The access telemetry endpoint now + correlates accepted ingress counters with active durable service-channel + leases, selected entry/exit nodes, primary route status, explicit backend + fallback, and latest route-quality feedback when a route exists. Web-admin's + `Service-channel access` panel now shows active channel rows before per-node + counters, so operators can see whether a live service channel is using normal + route quality feedback or degraded backend fallback. Live smoke created an + active lease, sent ingress traffic through test-1, and verified active + channel correlation plus fallback visibility: + `artifacts/c18z53-service-channel-access-correlation-smoke-result.json`. +- C18Z54 normal-route access correlation is smoke-proven on the existing + C18Z53 backend/admin surface. New smoke creates a temporary direct + `vpn_packets` route intent, injects healthy route-quality heartbeat + telemetry, issues a service-channel lease that selects the normal primary + route, sends ingress traffic, and verifies the access telemetry active + channel row is `ready`, not backend fallback, with `route_feedback_status` + `healthy`, rolling quality counters, and last send duration: + `artifacts/c18z54-service-channel-normal-route-access-smoke-result.json`. +- C18Z55 degraded normal-route access correlation is smoke-proven on the same + backend/admin surface. The smoke first issues a lease on a normal primary + `vpn_packets` route, then injects degraded/fenced route-quality heartbeat + feedback for that already-selected route. Access telemetry correctly reports + the active channel as `ready` and `force_backend_fallback=false`, while route + feedback is `fenced`, rolling failure/drop/slow counters are visible, and the + aggregate access status becomes `degraded` because `degraded_route_count > 0`: + `artifacts/c18z55-service-channel-degraded-route-access-smoke-result.json`. +- C18Z56 active-channel remediation diagnostics are implemented. Backend + `rap-backend:fabric-service-channel-0.2.237` is built/deployed on + docker-test; node-agent remains `0.2.235`. Active access telemetry channel + rows now include `remediation_action`, `remediation_reason`, + `remediation_route_id`, `remediation_route_status`, and an operator hint. + Decisions distinguish explicit backend fallback, degraded/fenced normal + route with an authorized alternate (`prefer_alternate_route`), degraded/fenced + route needing rebuild (`rebuild_route`), and healthy route (`none`). + Web-admin shows the remediation action in the `Service-channel access` + active-channel table. C18Z55 smoke now verifies + `remediation_action=rebuild_route`; backend unit coverage verifies the + alternate-route remediation branch. +- C18Z56 alternate-route remediation is also live-smoke-proven. New smoke + creates primary and authorized alternate `vpn_packets` routes, issues a lease + while primary is still healthy/selected, then injects fenced feedback for the + selected primary. Access telemetry keeps the active channel on the normal + route with `force_backend_fallback=false`, reports `route_feedback_status` + `fenced`, and recommends `remediation_action=prefer_alternate_route` with the + alternate route id/status; `degraded_fallback_channel_count` stays zero: + `artifacts/c18z56-service-channel-alternate-remediation-smoke-result.json`. +- C18Z57 bounded remediation command contract is implemented. Backend + `rap-backend:fabric-service-channel-0.2.238` is built/deployed on + docker-test; node-agent remains `0.2.235`. Active access telemetry channel + rows now include `remediation_command` for non-noop remediation actions, with + schema version, deterministic command id, action, channel/resource/service, + entry/exit, primary route, replacement route when present, reason/operator + hint, issued time, and a bounded TTL capped to the lease lifetime. Web-admin + marks remediation rows with `cmd` when this machine-readable command is + present. Live smoke proves a fenced selected primary route with an authorized + alternate emits a `prefer_alternate_route` command pointing at the alternate: + `artifacts/c18z57-service-channel-remediation-command-smoke-result.json`. +- C18Z58 service-channel remediation command consumption is implemented. + Backend `rap-backend:fabric-service-channel-0.2.239` and node-agent + `rap-node-agent:0.2.237` are built/deployed on docker-test (`test-1/2/3`). + Backend now projects active `remediation_command` items into node-scoped + synthetic mesh config as `service_channel_remediation_commands`. Node-agent + parses those commands and turns `prefer_alternate_route` into an explicit + route-manager `applied` decision with source + `service_channel_remediation_command`, so an active channel that still + presents the old primary route can be routed through the replacement route. + Web-admin node details show remediation-command count/table in the Mesh tab. + Live smoke proves access telemetry, synthetic config projection, and + node-agent route-manager consumption: + `artifacts/c18z58-service-channel-remediation-apply-smoke-result.json`. +- C18Z59 active remediation traffic proof is smoke-proven on the same + backend/node-agent images with production forwarding enabled on docker-test + `test-1/2/3`. The smoke sends service-channel traffic before/after the + remediation command is consumed, then verifies runtime heartbeat evidence: + `last_selected_route_id` and flow-scheduler `last_route_id` move to the + replacement route, `send_successes=1`, `send_failures=0`, + `send_fallback_local=0`, and no degraded backend fallback is recommended. + Result: + `artifacts/c18z59-service-channel-remediation-traffic-smoke-result.json`. +- C18Z60 multi-flow remediation traffic proof is smoke-proven. The smoke sends + a batch of twelve IPv4/TCP-like packets that classify into multiple + independent VPN flow channels after the remediation command is consumed. + Runtime heartbeat evidence shows the replacement route selected, at least two + flow-scheduler channels on that route, no local/backend fallback, no flow + drops, and no route send failures. Result: + `artifacts/c18z60-service-channel-remediation-multiflow-smoke-result.json`. +- C18Z61 pressure remediation traffic proof is smoke-proven. The smoke sends a + batch of 128 IPv4/TCP-like packets after remediation; runtime evidence shows + 32 replacement-route flow stats, scheduler high-watermark 5, + max-in-flight 4, `send_fallback_local=0`, route failures 0, and flow/scheduler + drops 0. Result: + `artifacts/c18z61-service-channel-remediation-pressure-smoke-result.json`. +- C18Z62 service-channel QoS class wiring is implemented in node-agent and + live-smoke-proven on docker-test image `rap-node-agent:0.2.238-c18z62`. + Service-channel HTTP ingress accepts neutral `X-RAP-Traffic-Class` + (`control`, `interactive`, `reliable`, `bulk`, `droppable`) and the flow + scheduler keeps distinct traffic-class channel ids/stats while preserving the + old default bulk channel ids. Unit tests prove priority ordering + `control > interactive > reliable > bulk > droppable`; live smoke proves a + bulk 128-packet pressure batch plus an interactive packet both move through + the remediation replacement route with no local/backend fallback, drops, or + route failures. Result: + `artifacts/c18z62-service-channel-remediation-qos-smoke-result.json`. +- C18Z63 concurrent QoS isolation is implemented and unit-proven. A controlled + runtime test holds a bulk traffic-class send in-flight with a blocking + production transport, then sends an independent interactive traffic-class + packet through the same ingress; the interactive send completes before the + bulk release, with `MaxInFlight >= 2`, traffic-class-specific stats, no drops, + and no failures. This proves the shared Fabric Service Channel runtime does + not globally serialize interactive/control-style traffic behind bulk work. + Artifact: + `artifacts/c18z63-service-channel-concurrent-qos-go-test.jsonl`. +- C18Z64 traffic-class telemetry aggregation is implemented and live-proven on + docker-test image `rap-node-agent:0.2.239-c18z64`. `rap.fabric_flow_scheduler.v1` + snapshots now include `traffic_class_counts`, giving backend/admin/diagnostics + a compact count of active flow channels per traffic class without scanning + every channel stat. Unit coverage proves the counts for explicit + control/interactive/bulk classes and for the concurrent bulk+interactive + isolation case. Live smoke re-ran the QoS path on `test-1/2/3`; latest + heartbeat snapshot showed `traffic_class_counts` `bulk=32`, + `interactive=12`, drops 0. Artifacts: + `artifacts/c18z64-service-channel-traffic-class-telemetry-go-test.jsonl`, + `artifacts/c18z64-service-channel-traffic-class-telemetry-live-smoke-result.json`, + and + `artifacts/c18z64-service-channel-traffic-class-telemetry-live-snapshot.json`. +- C18Z65/C18Z66 backend/admin QoS diagnostics are implemented and live-proven. + Backend `rap-backend:fabric-service-channel-0.2.241-c18z66` is deployed on + docker-test and projects runtime `traffic_class_counts`, flow channel count, + max in-flight, dropped, and high-watermark from node heartbeats into + `GET /fabric/service-channels/access-telemetry` at node, active-channel, and + cluster aggregate levels. Web-admin Service-channel access shows flow QoS + chips/rows for cluster totals, active channels, and nodes. Live API aggregate + result showed `bulk=32`, `interactive=12`, `flow_channel_count=44`, + `flow_max_in_flight=4`. Artifacts: + `artifacts/c18z65-service-channel-access-qos-telemetry-api-result.json`, + `artifacts/c18z65-service-channel-access-qos-telemetry-smoke-result.json`, + and + `artifacts/c18z66-service-channel-access-qos-aggregate-api-result.json`. +- C18Z67 live concurrent QoS proof is implemented and smoke-proven against + docker-test backend `rap-backend:fabric-service-channel-0.2.241-c18z66` and + node-agent image `rap-node-agent:0.2.239-c18z64`. The smoke pushes six + parallel bulk service-channel HTTP packet requests while an interactive + traffic-class request is injected through the same entry path after + remediation. Run `c18z67-20260508-213452` accepted all 6 bulk requests, + forwarded 3072 post-remediation packets, completed the interactive request in + 132 ms, observed 32 bulk and 12 interactive replacement-route flow stats, and + kept local/backend fallback, route failures, flow drops, and scheduler drops + at 0. Artifact: + `artifacts/c18z67-service-channel-concurrent-qos-live-smoke-result.json`. +- C18Z68 service-channel flow-health guard is implemented and deployed on + docker-test as `rap-backend:fabric-service-channel-0.2.242-c18z68`, with + web-admin rebuilt/deployed. Access telemetry now projects + `flow_health_status` and `flow_health_reason` at cluster, node, and + active-channel levels from traffic-class counts, queue pressure, flow drops, + backend fallback, route-quality failures/drops/slow samples, and route send + latency. Web-admin shows explicit flow-health chips beside flow QoS so + sustained bulk pressure, degraded latency, fallback, and drops are visible + before adding user services. Verification passed: + `go test ./internal/modules/cluster`, web-admin `npm run build`, updated + C18Z67 live smoke against backend `0.2.242-c18z68`, and live API artifact + `artifacts/c18z68-service-channel-flow-health-api-result.json`. +- C18Z69 node-side adaptive backpressure is implemented and deployed on + docker-test image `rap-node-agent:0.2.243-c18z69` for `test-1/2/3`. + `FabricFlowScheduler` now calculates per-traffic-class + `recommended_parallel_windows` and reports `adaptive_backpressure_active` / + `adaptive_backpressure_reason` in runtime heartbeat snapshots. Bulk and + droppable classes are reduced first under pressure, reliable is reduced + moderately, while control/interactive keep their full window unless their own + class has drops/failures/slow samples. Live C18Z69 smoke wraps the C18Z67 + pressure path and verified `bulk=1`, `droppable=1`, `reliable=3`, + `interactive=4`, `control=4`, `bulk=32`, `interactive=12`, high-watermark + 72, max-in-flight 4, drops 0, and + `bulk_window_reduced_to_protect_interactive`. Artifacts: + `artifacts/c18z67-service-channel-concurrent-qos-live-smoke-result.json` and + `artifacts/c18z69-service-channel-adaptive-backpressure-smoke-result.json`. +- C18Z70 backend/admin adaptive backpressure visibility is implemented and + deployed on docker-test as + `rap-backend:fabric-service-channel-0.2.244-c18z70`; web-admin is rebuilt and + deployed. Access telemetry now projects node-agent + `recommended_parallel_windows`, `adaptive_backpressure_active`, and + `adaptive_backpressure_reason` at cluster, node, and active-channel levels. + Cluster aggregation uses the minimum non-zero recommended window per class, + so the operator sees the most conservative active runtime limit. Web-admin + shows adaptive windows next to flow health and flow QoS. Live API returned + `adaptive=true`, reason `bulk_window_reduced_to_protect_interactive`, and + windows `bulk=1`, `droppable=1`, `reliable=3`, `interactive=4`, + `control=4`. Verification passed: `go test ./internal/modules/cluster`, + web-admin `npm run build`, C18Z69 live smoke, and + `artifacts/c18z70-service-channel-adaptive-telemetry-api-result.json`. +- C18Z71 adaptive policy contract is implemented and deployed on docker-test as + `rap-backend:fabric-service-channel-0.2.245-c18z71` with node-agent image + `rap-node-agent:0.2.245-c18z71` on `test-1/2/3`. Backend exposes audited + `GET/PUT /clusters/{clusterID}/fabric/service-channels/adaptive-policy` for + max parallel window, queue/bulk pressure thresholds, and per-class windows. + The effective policy is embedded in signed node synthetic config and + node-agent runtime heartbeat snapshots now report + `adaptive_policy_fingerprint`. The scheduler consumes the policy at runtime: + default policy preserves the C18Z69 behavior, while the C18Z71 live smoke + proved an operator policy can raise max window to 6 and bulk pressure window + to 2 while keeping interactive/control at 6. During smoke, a signed synthetic + config hash mismatch was found and fixed by preserving adaptive policy + provenance fields in the node-agent client model. Verification passed: + `go test ./internal/modules/cluster`, + `go test ./cmd/rap-node-agent ./internal/mesh ./internal/vpnruntime ./internal/client ./internal/config`, + web-admin `npm run build`, C18Z71 live smoke, and C18Z69 regression smoke. + Artifacts: + `artifacts/c18z71-service-channel-adaptive-policy-smoke-result.json` and + `artifacts/c18z69-service-channel-adaptive-backpressure-smoke-result.json`. +- C18Z72 service-channel pool/failover policy contract is implemented and + deployed on docker-test as + `rap-backend:fabric-service-channel-0.2.246-c18z72`; node-agent remains + `rap-node-agent:0.2.245-c18z71` on `test-1/2/3`. Backend exposes audited + `GET/PUT /clusters/{clusterID}/fabric/service-channels/pool-policy` for + entry/exit pool constraints, preferred entry/exit, selection strategy, + route/entry/exit failover modes, backend fallback allowance, and sticky + session mode. Lease issuance now applies the effective policy before route + selection, constrains `entry_pool`/`exit_pool`, chooses policy preferred + nodes when present, embeds `pool_policy` provenance in the lease, and signs + it into `rap.fabric_service_channel_lease_authority.v1`. Web-admin API/types + know the new policy contract. Verification passed: + `go test ./internal/modules/cluster`, web-admin `npm run build`, + C18Z72 live smoke, and C18Z71 regression smoke. Artifact: + `artifacts/c18z72-service-channel-pool-policy-smoke-result.json`. +- C18Z73 pool-policy remediation guard and telemetry is implemented and + deployed on docker-test as + `rap-backend:fabric-service-channel-0.2.247-c18z73` with node-agent image + `rap-node-agent:0.2.247-c18z73` on `test-1/2/3`; web-admin is rebuilt and + deployed. Active access telemetry now projects the signed + `pool_policy_fingerprint`, remediation guard status/reason, and guarded + remediation commands. Backend remediation rejects an alternate route outside + the signed entry/exit lease pools and emits `rebuild_route` instead of + `prefer_alternate_route`; node-agent defensively ignores guarded rejected + remediation commands before route-manager application. Web-admin shows guard + chips in access telemetry and node synthetic-config remediation rows. + Verification passed: + `go test ./internal/modules/cluster ./internal/platform/runtime ./internal/modules/nodeagent`, + `go test ./cmd/rap-node-agent ./internal/mesh ./internal/vpnruntime ./internal/config`, + web-admin `npm run build`, C18Z73 live smoke, C18Z72 regression smoke, and + C18Z71/C18Z67 live regression smoke. Artifacts: + `artifacts/c18z73-service-channel-pool-policy-remediation-guard-smoke-result.json`, + `artifacts/c18z72-service-channel-pool-policy-smoke-result.json`, + `artifacts/c18z71-service-channel-adaptive-policy-smoke-result.json`, and + `artifacts/c18z67-service-channel-concurrent-qos-live-smoke-result.json`. +- C18Z74 service-channel remediation execution visibility is implemented and + deployed on docker-test as + `rap-backend:fabric-service-channel-0.2.248-c18z74` with node-agent image + `rap-node-agent:0.2.248-c18z74` on `test-1/2/3`; web-admin is rebuilt and + deployed. Active access telemetry now computes + `remediation_execution_status`, reason, generation, and observed timestamp + by correlating active remediation commands with the entry node's latest + route-manager heartbeat. `prefer_alternate_route` commands show + `waiting_node_apply` until the node reports a matching route-manager decision + and then `applied`; guarded commands show `rejected_by_policy_guard`; bounded + `rebuild_route` commands show `pending_rebuild_request`. The execution state + is copied into the machine-readable remediation command and displayed in + web-admin access telemetry / node synthetic remediation rows. Verification + passed: + `go test ./internal/modules/cluster ./internal/platform/runtime ./internal/modules/nodeagent`, + `go test ./cmd/rap-node-agent ./internal/mesh ./internal/vpnruntime ./internal/config`, + web-admin `npm run build`, C18Z74 live smoke, C18Z73 regression smoke, and + C18Z72 regression smoke. Artifacts: + `artifacts/c18z74-service-channel-remediation-execution-smoke-result.json`, + `artifacts/c18z67-service-channel-concurrent-qos-live-smoke-result.json`, + `artifacts/c18z73-service-channel-pool-policy-remediation-guard-smoke-result.json`, + and `artifacts/c18z72-service-channel-pool-policy-smoke-result.json`. +- C18Z75 durable remediation rebuild intent foundation is implemented and + deployed on docker-test as + `rap-backend:fabric-service-channel-0.2.249-c18z75`; node-agent remains + `rap-node-agent:0.2.248-c18z74` on `test-1/2/3`. When a node fetches + synthetic config containing a `rebuild_route` remediation command, backend + now records a durable row in the existing + `fabric_service_channel_route_rebuild_attempts` ledger with + `rebuild_status=requested` / `outcome=rebuild_requested`, or + `rebuild_status=rejected` / `outcome=policy_guard_rejected` when the pool + policy guard rejects it. Access telemetry correlates that ledger row back to + the active channel and reports `rebuild_request_recorded` or + `rebuild_request_rejected` in `remediation_execution_status`. The C18Z75 + smoke isolates a route pair, proves `rebuild_route`, fetches synthetic + config to persist the intent, verifies the rebuild ledger row, and verifies + access telemetry reports the recorded execution state. Verification passed: + `go test ./internal/modules/cluster ./internal/platform/runtime ./internal/modules/nodeagent`, + `go test ./cmd/rap-node-agent ./internal/mesh ./internal/vpnruntime ./internal/config`, + web-admin `npm run build`, C18Z75 live smoke, C18Z73 regression smoke, and + C18Z72 regression smoke. Artifacts: + `artifacts/c18z75-service-channel-rebuild-intent-smoke-result.json`, + `artifacts/c18z73-service-channel-pool-policy-remediation-guard-smoke-result.json`, + and `artifacts/c18z72-service-channel-pool-policy-smoke-result.json`. +- C18Z76 service-channel rebuild-route node acknowledgement is implemented and + deployed on docker-test as + `rap-backend:fabric-service-channel-0.2.250-c18z76` with node-agent image + `rap-node-agent:0.2.250-c18z76` on `test-1/2/3`. Node-agent now consumes + allowed `rebuild_route` remediation commands as route-manager decisions with + `rebuild_status=pending_degraded_fallback` and + `decision_source=service_channel_remediation_command`; guarded commands are + still ignored. Backend access telemetry correlates this route-manager + acknowledgement with the durable ledger intent and reports + `rebuild_request_recorded_node_pending`. Verification passed: + `go test ./internal/modules/cluster ./internal/platform/runtime ./internal/modules/nodeagent`, + `go test ./cmd/rap-node-agent ./internal/agent ./internal/mesh ./internal/vpnruntime ./internal/config`, + C18Z76 live smoke, C18Z75 regression smoke, and C18Z74/C18Z67 regression + smoke. Artifacts: + `artifacts/c18z76-service-channel-rebuild-node-pending-smoke-result.json`, + `artifacts/c18z75-service-channel-rebuild-intent-smoke-result.json`, + `artifacts/c18z74-service-channel-remediation-execution-smoke-result.json`, + and `artifacts/c18z67-service-channel-concurrent-qos-live-smoke-result.json`. +- C18Z77 service-channel rebuild planner resolution is implemented and + deployed on docker-test as + `rap-backend:fabric-service-channel-0.2.251-c18z77` with node-agent image + `rap-node-agent:0.2.251-c18z77` on `test-1/2/3`. Backend now resolves + durable `rebuild_route` remediation requests during node-scoped synthetic + config generation: it keeps lease pool-policy guardrails, records + `applied` / `replacement_selected` when a signed-pool-valid alternate route + exists, records `no_alternate` when no safe alternate exists, records + `deferred_by_policy` when the active lease cannot authorize the replacement, + and records `expired` for stale commands. When a replacement is applied, the + same command id is projected as a route-manager decision so node-agent can + consume the resolved planner decision without duplicating the raw command. + Access telemetry reports planner states such as `rebuild_request_applied` + and `rebuild_request_no_alternate`. Verification passed: + `go test ./internal/modules/cluster ./internal/platform/runtime ./internal/modules/nodeagent`, + `go test ./cmd/rap-node-agent ./internal/agent ./internal/mesh ./internal/vpnruntime ./internal/config`, + C18Z77 live smoke, C18Z75 regression smoke, and C18Z74/C18Z67 regression + smoke. Artifacts: + `artifacts/c18z77-service-channel-rebuild-planner-resolution-smoke-result.json`, + `artifacts/c18z75-service-channel-rebuild-intent-smoke-result.json`, + `artifacts/c18z74-service-channel-remediation-execution-smoke-result.json`, + and `artifacts/c18z67-service-channel-concurrent-qos-live-smoke-result.json`. +- C18Z78 service-channel rebuild planner applied-branch visibility is + implemented and deployed on docker-test as + `rap-backend:fabric-service-channel-0.2.252-c18z78` with node-agent image + `rap-node-agent:0.2.252-c18z78` on `test-1/2/3`; web-admin is rebuilt and + deployed to `rap_web_admin`. The admin access-telemetry execution column and + node synthetic remediation rows now render planner outcomes with explicit + labels and tones: `rebuild_request_applied` is good, + `rebuild_request_recorded(_node_pending)`, `rebuild_request_no_alternate`, + and `rebuild_request_deferred_by_policy` are warning states, while rejected + or expired requests are bad states. The C18Z78 live smoke proves the applied + planner branch: a primary route is leased first, the primary route is then + degraded, an alternate route is added after the lease, synthetic config + fetch resolves the existing `rebuild_route` command to `applied` / + `replacement_selected`, and access telemetry reports + `rebuild_request_applied`. Verification passed: + `go test ./internal/modules/cluster ./internal/platform/runtime ./internal/modules/nodeagent`, + `go test ./cmd/rap-node-agent ./internal/agent ./internal/mesh ./internal/vpnruntime ./internal/config`, + web-admin `npm run build`, C18Z78 live smoke, C18Z77 regression smoke, and + C18Z74/C18Z67 regression smoke. Artifacts: + `artifacts/c18z78-service-channel-rebuild-planner-applied-smoke-result.json`, + `artifacts/c18z77-service-channel-rebuild-planner-resolution-smoke-result.json`, + `artifacts/c18z74-service-channel-remediation-execution-smoke-result.json`, + and `artifacts/c18z67-service-channel-concurrent-qos-live-smoke-result.json`. +- C18Z79 service-channel planner-to-runtime loop proof is implemented and + deployed on docker-test as + `rap-backend:fabric-service-channel-0.2.253-c18z79` with node-agent image + `rap-node-agent:0.2.253-c18z79` on `test-1/2/3`. The new live smoke extends + the C18Z78 applied branch: after planner resolves the existing + `rebuild_route` command to `applied` / `replacement_selected`, the entry node + reports a route-manager decision for the same `rebuild_request_id`, reports + transition `applied_rebuild`, and live service-channel packet ingress selects + the replacement route with no local/backend fallback, route failures, or flow + drops. Verification passed: + `go test ./internal/modules/cluster ./internal/platform/runtime ./internal/modules/nodeagent`, + `go test ./cmd/rap-node-agent ./internal/agent ./internal/mesh ./internal/vpnruntime ./internal/config`, + C18Z79 live smoke, C18Z78 and C18Z77 sequential regressions, and C18Z67 + concurrent QoS regression. Artifact: + `artifacts/c18z79-service-channel-planner-runtime-loop-smoke-result.json`. +- C18Z80 service-channel sustained post-rebuild pressure proof is implemented + and deployed on docker-test as + `rap-backend:fabric-service-channel-0.2.254-c18z80` with node-agent image + `rap-node-agent:0.2.254-c18z80` on `test-1/2/3`. The new live smoke keeps the + C18Z79 planner-applied loop, then sends five post-rebuild bursts of mixed + `interactive`, `bulk`, and `reliable` VPN packet batches. It proves every + burst is accepted by the service-channel runtime, every burst reports the + replacement route, the stale primary is not reselected, and fallback, + route-failure, flow-drop, and scheduler-drop deltas stay zero from the + pre-pressure baseline. Smoke route hygiene was tightened: C18Z67 now disables + pre-existing active `vpn_packets` intents for its entry/exit pair, and + C18Z79/C18Z80 expire their temporary primary/alternate intents after a + successful run. Verification passed: + `go test ./internal/modules/cluster ./internal/platform/runtime ./internal/modules/nodeagent`, + `go test ./cmd/rap-node-agent ./internal/agent ./internal/mesh ./internal/vpnruntime ./internal/config`, + C18Z80 live smoke, C18Z79 regression smoke, and C18Z67 concurrent QoS + regression. Artifact: + `artifacts/c18z80-service-channel-post-rebuild-pressure-smoke-result.json`. +- C18Z81 service-channel replacement-degradation recovery proof is implemented + and deployed on docker-test as + `rap-backend:fabric-service-channel-0.2.255-c18z81` with node-agent image + `rap-node-agent:0.2.255-c18z81` on `test-1/2/3`. The new live smoke proves + the negative branch after C18Z80: once the initial replacement is applied and + used, a generation-valid fenced feedback report for that replacement causes + the Control Plane to select a new safe recovery route. Live traffic then + moves to the recovery route, the degraded replacement is not reselected, and + fallback, route-failure, flow-drop, and scheduler-drop deltas stay zero for + the recovery send. The smoke also documents an important guardrail: stale + route-generation feedback must not trigger recovery. C18Z67/C18Z79 were + tightened to check per-run counter deltas rather than cumulative runtime + counters. Verification passed: + `go test ./internal/modules/cluster ./internal/platform/runtime ./internal/modules/nodeagent`, + `go test ./cmd/rap-node-agent ./internal/agent ./internal/mesh ./internal/vpnruntime ./internal/config`, + C18Z81 live smoke, C18Z80 regression smoke, C18Z79 regression smoke, and + C18Z67 concurrent QoS regression. Artifact: + `artifacts/c18z81-service-channel-replacement-degradation-recovery-smoke-result.json`. +- C18Z82 service-channel no-safe-recovery proof is implemented and deployed on + docker-test as `rap-backend:fabric-service-channel-0.2.256-c18z82` with + node-agent image `rap-node-agent:0.2.256-c18z82` on `test-1/2/3`. The new + live smoke proves the branch where the original primary is degraded, the + replacement is applied and used, then that replacement reports + generation-valid fenced feedback while no new safe recovery route exists. + Node-scoped synthetic config reports + `service_channel_feedback_no_alternate` with + `pending_degraded_fallback`; score reasons include + `no_unfenced_alternate_route` and + `backend_relay_degraded_fallback_until_rebuild`, so the Control Plane exposes + an explicit degraded/no-alternate state instead of silently sticking to a bad + replacement. Verification passed: + `go test ./internal/modules/cluster ./internal/platform/runtime ./internal/modules/nodeagent`, + `go test ./cmd/rap-node-agent ./internal/agent ./internal/mesh ./internal/vpnruntime ./internal/config`, + C18Z82 live smoke, C18Z81 recovery regression, C18Z80 pressure regression, + and C18Z67 concurrent QoS regression. Artifact: + `artifacts/c18z82-service-channel-no-safe-recovery-smoke-result.json`. +- C18Z83 service-channel access-telemetry no-safe projection is implemented and + deployed on docker-test as `rap-backend:fabric-service-channel-0.2.257-c18z83`; + node-agent remains `rap-node-agent:0.2.256-c18z82` on `test-1/2/3`, and + web-admin is rebuilt/deployed to `rap_web_admin`. Active access telemetry + channels now expose route-decision source, route id, replacement route id, + rebuild status/reason/generation, and score reasons. Web-admin shows a + dedicated `decision` column in the active-channel table. The live smoke + proves no-safe recovery is visible through access telemetry as + `service_channel_feedback_no_alternate` / + `pending_degraded_fallback`, while durable ledger state can still report + `rebuild_request_no_alternate`. Verification passed: + `go test ./internal/modules/cluster ./internal/platform/runtime ./internal/modules/nodeagent`, + web-admin `npm run build`, and C18Z83 live smoke. Artifact: + `artifacts/c18z83-service-channel-access-telemetry-no-safe-smoke-result.json`. +- C18Z84 service-channel access-decision aggregate proof is implemented and + deployed on docker-test as `rap-backend:fabric-service-channel-0.2.258-c18z84`; + node-agent remains `rap-node-agent:0.2.256-c18z82` on `test-1/2/3`, and + web-admin is rebuilt/deployed to `rap_web_admin`. Access telemetry now + exposes aggregate route-decision counters: + `route_decision_channel_count`, `replacement_decision_count`, + `applied_rebuild_decision_count`, `recovery_decision_count`, and + `no_safe_recovery_decision_count`. Web-admin summary chips show these counts, + and no-safe route decisions now prioritize the aggregate reason + `active_channels_no_safe_recovery` over generic missing access-report noise. + Verification passed: + `go test ./internal/modules/cluster ./internal/platform/runtime ./internal/modules/nodeagent`, + web-admin `npm run build`, C18Z84 live smoke, and C18Z83 regression smoke. + Artifact: + `artifacts/c18z84-service-channel-access-decision-aggregate-smoke-result.json`. +- C18Z85 service-channel access-decision incident projection is implemented and + deployed on docker-test as `rap-backend:fabric-service-channel-0.2.259-c18z85`; + node-agent remains `rap-node-agent:0.2.256-c18z82` on `test-1/2/3`, and + web-admin is rebuilt/deployed to `rap_web_admin`. Rebuild health summary now + carries access decision counts and prioritizes + `inspect_access_no_safe_recovery_route_pool_and_signed_policy` when no-safe + is active. Rebuild incidents now include `incident_source=access_decision` + entries with channel id and operator-facing severity/action, including + `access_no_safe_recovery` as a bad incident. Verification passed: + `go test ./internal/modules/cluster ./internal/platform/runtime ./internal/modules/nodeagent`, + web-admin `npm run build`, C18Z85 live smoke, and C18Z84 regression smoke. + Artifact: + `artifacts/c18z85-service-channel-access-decision-incident-smoke-result.json`. +- C18Z86 service-channel access-decision silence/acknowledgement is + implemented and deployed on docker-test as + `rap-backend:fabric-service-channel-0.2.261-c18z86`; node-agent remains + `rap-node-agent:0.2.256-c18z82` on `test-1/2/3`, and web-admin is + rebuilt/deployed to `rap_web_admin`. Rebuild alert silence requests now carry + `incident_source` and `channel_id`; `incident_source=access_decision` + no-safe incidents require `channel_id` and are stored with channel-scoped + route keys. Rebuild health and incident lists apply those silences, so an + acknowledged current-generation access no-safe incident is silenced and no + longer contributes to active bad count. Generation-change resurfacing is + covered in unit tests; live smoke proves the channel-scoped silence path. + Verification passed: + `go test ./internal/modules/cluster ./internal/platform/runtime ./internal/modules/nodeagent`, + web-admin `npm run build`, C18Z86 live smoke, and C18Z85 regression smoke. + Artifact: + `artifacts/c18z86-service-channel-access-decision-silence-smoke-result.json`. +- C18Z87 service-channel access-decision silence management is implemented and + deployed on docker-test as `rap-backend:fabric-service-channel-0.2.262-c18z87`; + node-agent remains `rap-node-agent:0.2.256-c18z82` on `test-1/2/3`, and + web-admin is rebuilt/deployed to `rap_web_admin`. Backend now exposes active + rebuild alert silences, enriches access-decision silences with + `incident_source`, `channel_id`, and `display_route_id`, and supports + unsilence by id. Web-admin shows an `Active rebuild silences` table with an + `unsilence` action. The live smoke proves the operator path: + access no-safe incident -> silence -> active silence listed -> unsilence -> + active bad incident restored. Verification passed: + `go test ./internal/modules/cluster ./internal/platform/runtime ./internal/modules/nodeagent`, + web-admin `npm run build`, C18Z87 live smoke, and C18Z86 regression smoke. + Artifact: + `artifacts/c18z87-service-channel-access-decision-unsilence-smoke-result.json`. +- C18Z88 service-channel access-decision resurface proof is implemented and + deployed on docker-test as `rap-backend:fabric-service-channel-0.2.263-c18z88`; + node-agent remains `rap-node-agent:0.2.256-c18z82` on `test-1/2/3`, and + web-admin is rebuilt/deployed to `rap_web_admin`. Access-decision incidents + now include resurface details (`alert_resurfaced_from_silence_id`, + `alert_resurfaced_previous_generation`, and + `alert_resurfaced_previous_until`) when a previously acknowledged + access-decision incident changes generation/route/channel and becomes active + again. Web-admin shows the previous generation/expiry beside resurfaced + incidents. The live smoke proves access no-safe -> silence current generation + -> route-decision generation changes -> incident resurfaces as active bad + with previous-generation metadata preserved. Verification passed: + `go test ./internal/modules/cluster ./internal/platform/runtime ./internal/modules/nodeagent`, + web-admin `npm run build`, C18Z88 live smoke, and C18Z87 regression smoke. + Artifact: + `artifacts/c18z88-service-channel-access-decision-resurface-smoke-result.json`. +- C18Z89 service-channel access-decision resurface action loop is implemented + and deployed on docker-test as `rap-backend:fabric-service-channel-0.2.264-c18z89`; + node-agent remains `rap-node-agent:0.2.256-c18z82` on `test-1/2/3`, and + web-admin is rebuilt/deployed to `rap_web_admin`. Resurfaced + access-decision incidents now include `alert_resurfaced_cause`, + `alert_resurfaced_previous_route_id`, and + `alert_resurfaced_previous_channel_id`. Web-admin shows the cause beside the + resurfaced action text. The live smoke proves the operator path: + access no-safe -> silence current generation -> generation changes and + resurfaces -> active-channel decision context matches the incident -> + re-acknowledge current generation -> incident returns to silenced state. + Verification passed: + `go test ./internal/modules/cluster ./internal/platform/runtime ./internal/modules/nodeagent`, + web-admin `npm run build`, C18Z89 live smoke, and C18Z88 regression smoke. + Artifact: + `artifacts/c18z89-service-channel-access-decision-resurface-action-smoke-result.json`. +- C18Z90 service-channel production data-plane contract is implemented and + deployed on docker-test as `rap-backend:fabric-service-channel-0.2.265-c18z90`; + node-agent remains `rap-node-agent:0.2.256-c18z82` on `test-1/2/3`, and + web-admin is rebuilt/deployed to `rap_web_admin`. Service-channel leases now + include a signed `data_plane` contract in the lease, authority payload, + introspection response, and lease-maintenance/admin list. The contract + declares backend API as control-plane transport, fabric service channel over + fabric routes as working/steady-state data transport, backend relay as + degraded fallback only, production forwarding required, and service-neutral + protocol-agnostic logical flow isolation. Web-admin shows data-plane/fallback + policy in service-channel leases. Verification passed: + `go test ./internal/modules/cluster ./internal/platform/runtime ./internal/modules/nodeagent`, + web-admin `npm run build`, C18Z90 live smoke, and C18Z89 regression smoke. + Artifact: + `artifacts/c18z90-service-channel-data-plane-contract-smoke-result.json`. +- C18Z91 node-agent data-plane contract consumption is implemented and + deployed on docker-test as `rap-node-agent:0.2.266-c18z91` on `test-1/2/3` + with backend still `rap-backend:fabric-service-channel-0.2.265-c18z90`. + Service-channel VPN packet ingress now parses signed/introspected + `data_plane`, validates the production contract, applies the preferred fabric + route, logs data-plane mode/transports/backend-relay policy/logical-flow + mode, and reports `data_plane_contract` plus last transport/policy fields in + heartbeat access telemetry. Verification passed: + `go test ./cmd/rap-node-agent ./internal/agent ./internal/mesh ./internal/vpnruntime ./internal/config`, + backend cluster tests, web-admin build, C18Z91 live smoke, and C18Z90 + regression smoke. Artifact: + `artifacts/c18z91-node-agent-data-plane-contract-enforcement-smoke-result.json`. +- C18Z92 node-agent backend-fallback policy enforcement is implemented and + deployed on docker-test as `rap-node-agent:0.2.267-c18z92` on `test-1/2/3`. + If a signed data-plane contract has `backend_relay_policy=disabled`, the + service-channel runtime no longer proxies failed/missing fabric-route working + data through backend relay; it returns a visible service unavailable result. + The live smoke temporarily disables backend fallback in pool policy, issues a + no-route lease, verifies `backend_relay_policy=disabled`, posts to test-1, + and proves the node rejects with 503 instead of backend relay. Verification + passed: node-agent tests, C18Z92 live smoke, and C18Z91 regression smoke. + Artifact: + `artifacts/c18z92-node-agent-disabled-backend-fallback-smoke-result.json`. +- C18Z93 access-telemetry data-plane projection is implemented and deployed on + docker-test as `rap-backend:fabric-service-channel-0.2.268-c18z93`; + node-agent remains `rap-node-agent:0.2.267-c18z92` on `test-1/2/3`, and + web-admin is rebuilt/deployed to `rap_web_admin`. Backend access telemetry + now promotes node-reported `data_plane_contract` and last data-plane + mode/working transport/steady-state transport/backend relay policy/logical + flow mode to cluster, node, and active-channel diagnostics. Web-admin shows + summary chips plus channel/node table columns for data-plane adoption and + relay policy. Verification passed: + `go test ./internal/modules/cluster ./internal/platform/runtime ./internal/modules/nodeagent`, + web-admin `npm run build`, C18Z93 live smoke, C18Z92 regression smoke, and + C18Z91 regression smoke. Artifact: + `artifacts/c18z93-access-telemetry-data-plane-contract-smoke-result.json`. +- C18Z94 data-plane contract incident diagnostics are implemented and deployed + on docker-test as `rap-backend:fabric-service-channel-0.2.269-c18z94`; + node-agent remains `rap-node-agent:0.2.267-c18z92` on `test-1/2/3`, and + web-admin is rebuilt/deployed to `rap_web_admin`. Access/rebuild incident + diagnostics now include `incident_source=data_plane_contract` rows for + missing data-plane contract reports after accepted traffic, working/steady + transport mismatches, logical-flow mismatch, disabled backend relay observed, + and degraded/backend-relay policy violations. The smoke now proves disabled + backend relay is emitted as a bad incident with action + `restore_fabric_route_or_change_signed_backend_relay_policy_before_retry`. + Verification passed: + `go test ./internal/modules/cluster ./internal/platform/runtime ./internal/modules/nodeagent`, + web-admin `npm run build`, C18Z94 live smoke, C18Z93 regression smoke, C18Z92 + regression smoke, and C18Z91 regression smoke. Artifact: + `artifacts/c18z94-data-plane-contract-incident-smoke-result.json`. +- C18Z95 node-agent blocked-fallback telemetry is implemented and deployed on + docker-test as backend `rap-backend:fabric-service-channel-0.2.270-c18z95` + and node-agent `rap-node-agent:0.2.270-c18z95` on `test-1/2/3`; web-admin is + rebuilt/deployed to `rap_web_admin`. Node-agent now reports + `backend_fallback_blocked`, `fabric_route_send_failure`, and last data-plane + violation status/reason in `fabric_service_channel_access_report`. Backend + access telemetry projects those fields to cluster, node, and active-channel + rows, and `data_plane_contract` incidents distinguish policy-blocked fallback + from real backend relay usage. Verification passed: node-agent tests, + backend tests, web-admin build, C18Z95 live smoke, and C18Z94/C18Z93/C18Z92 + regressions. Artifact: + `artifacts/c18z95-node-agent-blocked-fallback-telemetry-smoke-result.json`. +- C18Z96 blocked-fallback rebuild feedback is implemented and deployed on + docker-test as backend `rap-backend:fabric-service-channel-0.2.281-c18z109`; + node-agent remains `rap-node-agent:0.2.270-c18z95` on `test-1/2/3`, and + web-admin remains deployed. Backend now converts heartbeat access reports + with `fabric_route_send_failed_backend_fallback_blocked` into durable fenced + `fabric_service_channel_route_feedback` for the active channel primary route. + The existing route rebuild planner then selects an authorized replacement + route when one exists. Verification passed: backend tests, node-agent tests, + web-admin build, C18Z96 live smoke, and C18Z95/C18Z93 regressions. Artifact: + `artifacts/c18z96-blocked-fallback-rebuild-feedback-smoke-result.json`. +- C18Z97 blocked-fallback feedback dedup is implemented and deployed on + docker-test as backend `rap-backend:fabric-service-channel-0.2.281-c18z109`. + Backend now suppresses repeated access-report-derived route feedback while an + active fenced/degraded observation from `fabric_service_channel_access_report` + already exists for the same cluster, reporter node, route, and service class. + This keeps repeated blocked-fallback send-failure heartbeats from refreshing + the same feedback and churning rebuild attempts. Verification passed: + backend tests, node-agent tests, C18Z97 live smoke, and C18Z96/C18Z95 + regressions. Artifact: + `artifacts/c18z97-blocked-fallback-feedback-dedup-smoke-result.json`. +- C18Z98 blocked-fallback rebuild correlation is implemented and deployed on + docker-test as backend `rap-backend:fabric-service-channel-0.2.281-c18z109`; + web-admin is rebuilt/deployed to `rap_web_admin`. Backend now carries the + originating access-report route-feedback identity into replacement decisions + and rebuild-attempt ledger rows: `feedback_observation_id`, + `feedback_source`, feedback observed/expiry times, channel/resource ids, and + data-plane violation status/reason. Web-admin shows this correlation in + Route decisions and Rebuild ledger. Verification passed: backend tests, + node-agent tests, web-admin build, C18Z98 live smoke, and C18Z97/C18Z96/C18Z95 + regressions. Artifact: + `artifacts/c18z98-blocked-fallback-rebuild-correlation-smoke-result.json`. +- C18Z99 rebuild correlation filters are implemented and deployed on + docker-test as backend `rap-backend:fabric-service-channel-0.2.281-c18z109`; + web-admin is rebuilt/deployed to `rap_web_admin`. The rebuild-attempt ledger + API now accepts `feedback_source`, `feedback_channel_id`, and + `feedback_violation_status` filters, and web-admin exposes them in the + rebuild ledger filter form. Verification passed: backend tests, node-agent + tests, web-admin build, C18Z99 live smoke, and C18Z98/C18Z97/C18Z96/C18Z95/ + C18Z93 regressions. Artifact: + `artifacts/c18z99-rebuild-correlation-filter-smoke-result.json`. +- C18Z100 rebuild-health feedback breakdown is implemented and deployed on + docker-test as backend `rap-backend:fabric-service-channel-0.2.281-c18z109`; + web-admin is rebuilt/deployed to `rap_web_admin`. The rebuild-health summary + now returns `feedback_breakdowns` grouped by feedback source, feedback + channel id, and feedback violation status, with total/good/warn/bad/unknown + counts, active warn/bad counts, silenced count, latest observation time, and + affected reporter nodes/routes. Web-admin shows the breakdown in the Rebuild + health panel. Verification passed: backend tests, node-agent tests, + web-admin build, C18Z100 live smoke, and C18Z99/C18Z98/C18Z97/C18Z96/C18Z95/ + C18Z93 regressions. Artifact: + `artifacts/c18z100-rebuild-health-feedback-breakdown-smoke-result.json`. +- C18Z101 rebuild-health feedback drilldown UI is implemented and deployed to + `rap_web_admin`; backend remains + `rap-backend:fabric-service-channel-0.2.281-c18z109`. Web-admin now shows + related incident context on rebuild-health feedback breakdown rows and an + `open ledger` action that switches to deep rebuild ledger with + `feedback_source`, `feedback_channel_id`, and `feedback_violation_status` + prefilled from the selected breakdown. Verification passed: web-admin build + and deployed asset/download checks. +- C18Z102 rebuild-health feedback drilldown audit breadcrumbs are implemented + and deployed on docker-test as backend + `rap-backend:fabric-service-channel-0.2.281-c18z109`; web-admin is rebuilt/ + deployed to `rap_web_admin`. The existing rebuild investigation endpoint now + accepts feedback source/channel/violation drilldown payloads and records + `fabric.service_channel_rebuild_feedback_breakdown.investigation_opened` + cluster audit events before web-admin opens the filtered deep rebuild ledger. + Verification passed: backend tests, web-admin build, C18Z102 live smoke, and + C18Z100/C18Z99/C18Z98 regressions. Artifact: + `artifacts/c18z102-rebuild-health-feedback-drilldown-audit-smoke-result.json`. +- C18Z103 Fabric diagnostics drilldown audit visibility is implemented and + deployed to `rap_web_admin`; backend remains + `rap-backend:fabric-service-channel-0.2.281-c18z109`. Web-admin now filters + the loaded cluster audit list for rebuild incident and feedback-breakdown + investigation events and shows recent drilldowns in the Fabric diagnostics + panel with time, source, feedback filters, target reporter/route, actor, and + reason. Verification passed: web-admin build and deployed asset/download + checks. +- C18Z104 focused Fabric audit loading is implemented and deployed on + docker-test as backend `rap-backend:fabric-service-channel-0.2.281-c18z109`; + web-admin is rebuilt/deployed to `rap_web_admin`. The cluster audit API now + accepts repeated or comma-separated `event_type` filters plus `target_type` + filters, and Fabric diagnostics loads recent rebuild incident/feedback + breakdown investigation breadcrumbs with a dedicated filtered request instead + of depending on the generic latest-100 audit list. Verification passed: + backend tests, web-admin build, C18Z104 live smoke, and C18Z102/C18Z100 + regressions. Artifact: + `artifacts/c18z104-focused-fabric-audit-smoke-result.json`. +- C18Z105 Fabric drilldown breadcrumb correlation UI is implemented and + deployed to `rap_web_admin`; backend remains + `rap-backend:fabric-service-channel-0.2.281-c18z109`. Recent investigation + rows in Fabric diagnostics now show whether each breadcrumb still matches a + current rebuild-health feedback breakdown or visible rebuild incident, and + provide an `open` action to jump back into the matching filtered ledger path. + Verification passed: web-admin build and deployed asset/download checks. +- C18Z106 server-side Fabric drilldown breadcrumb correlation is implemented + and deployed on docker-test as backend + `rap-backend:fabric-service-channel-0.2.281-c18z109`; web-admin is rebuilt/ + deployed to `rap_web_admin`. Focused audit reads with + `correlation=fabric_diagnostics` now return `correlation_hints` with current + diagnostic status and matching rebuild-health feedback breakdown or rebuild + incident when present. Web-admin consumes those hints and keeps local matching + as fallback. The rebuild-health feedback breakdown window is raised to 100 + groups after C18Z100 regression exposed the previous cap could hide fresh + failure classes on noisy test history. Verification passed: backend tests, + web-admin build, C18Z106 live smoke, and C18Z104/C18Z100 regressions. + Artifact: `artifacts/c18z106-audit-correlation-hints-smoke-result.json`. +- C18Z107 drilldown breadcrumb summary is implemented and deployed on + docker-test as backend `rap-backend:fabric-service-channel-0.2.281-c18z109`; + web-admin is rebuilt/deployed to `rap_web_admin`. Audit responses now include + compact `audit_summary` aggregates beside `audit_events`; focused Fabric + diagnostics uses them to show counts by current diagnostic status, feedback + source, feedback violation status, correlated/not-visible totals, and latest + time above the Recent investigations rows. Verification passed: backend + tests, web-admin build, C18Z107 live smoke, and C18Z106/C18Z104 regressions. + Artifact: `artifacts/c18z107-audit-correlation-summary-smoke-result.json`. +- C18Z108 dedicated Fabric diagnostics breadcrumbs are implemented and deployed + on docker-test as backend `rap-backend:fabric-service-channel-0.2.281-c18z109`; + web-admin is rebuilt/deployed to `rap_web_admin`. Backend exposes + `GET /clusters/{clusterID}/fabric/service-channels/rebuild-investigations/breadcrumbs` + returning `rebuild_investigation_breadcrumbs` with events and summary, so the + operator Recent investigations workflow no longer overloads the generic + cluster audit endpoint. Verification passed: backend tests, web-admin build, + C18Z108 live smoke, and C18Z107/C18Z106/C18Z100 regressions. Artifact: + `artifacts/c18z108-dedicated-breadcrumbs-smoke-result.json`. +- C18Z109 Fabric diagnostics breadcrumb freshness windows are implemented and + deployed on docker-test as backend + `rap-backend:fabric-service-channel-0.2.281-c18z109`; web-admin is rebuilt/ + deployed to `rap_web_admin`. The dedicated breadcrumb endpoint accepts + `current_window_seconds` and `history_window_seconds`, annotates events with + `correlation_hints.breadcrumb_status` (`current`, `stale`, `expired`) plus + age/window seconds, returns current/stale/expired totals, and includes + `counts_by_breadcrumb_status` in summary. Web-admin shows freshness chips and + age in Recent investigations. Verification passed: backend tests, web-admin + build, C18Z109 live smoke, and C18Z108/C18Z107/C18Z106 regressions. Artifact: + `artifacts/c18z109-breadcrumb-freshness-window-smoke-result.json`. +- C19Q Remote Workspace mailbox guardrails are implemented and + runtime-smoke-proven on docker-test. The adapter-session mailbox handoff now + has unit and live coverage for invalid adapter session IDs, unknown sessions, + invalid limits, and bounded `drain=true&limit=N` partial drain semantics. + This remains probe-only and node-local: it does not enable RDP protocol + forwarding, desktop frame transport, Android work, or backend relay behavior. + Verification passed: `go test ./internal/mesh` in `agents/rap-node-agent` and + `scripts/fabric/c19q-remote-workspace-adapter-mailbox-guardrails-smoke.ps1`. + Artifact: + `artifacts/c19q-remote-workspace-adapter-mailbox-guardrails-smoke-result.json`. +- C19R Remote Workspace mailbox long-poll ergonomics are implemented and + runtime-smoke-proven on docker-test. The mailbox endpoint now accepts bounded + `wait_ms`, returns explicit `empty`, `waited`, `wait_timeout`, and `wait_ms` + fields, and wakes when a delayed mailbox event arrives before timeout. + Node-agent image `rap-node-agent:codex-service-supervisor-20260512s` is built + and deployed on `test-1/2/3`. Verification passed: + `go test ./internal/mesh`, C19R live smoke, and C19Q regression smoke. + Artifact: + `artifacts/c19r-remote-workspace-mailbox-long-poll-smoke-result.json`. +- C19S Remote Workspace mailbox telemetry is implemented and + runtime-smoke-proven on docker-test. Workload status and heartbeat telemetry + now expose mailbox read/wait/timeout/empty-read counters plus last mailbox + read metadata, so adapter consumer polling behavior is visible without + enabling desktop frame transport. Node-agent image + `rap-node-agent:codex-service-supervisor-20260512t` is built and deployed on + `test-1/2/3`. Verification passed: `go test ./internal/mesh`, C19S live + smoke, and C19R regression smoke. Artifact: + `artifacts/c19s-remote-workspace-mailbox-telemetry-smoke-result.json`. +- C19T Remote Workspace mailbox consumer checkpoint/ack metadata is implemented + and runtime-smoke-proven on docker-test. The mailbox endpoint now accepts a + validated `consumer_id` and optional `ack_sequence`, returns consumer + checkpoint/ack/lag/read metadata, and keeps bounded per-session node-local + consumer cursor state. Workload status and heartbeat telemetry expose + aggregate/current-session consumer read and ack counters. Node-agent image + `rap-node-agent:codex-service-supervisor-20260512u` is built and deployed on + `test-1/2/3`. Verification passed: `go test ./internal/mesh`, C19T live + smoke, and C19S regression smoke. Artifact: + `artifacts/c19t-remote-workspace-mailbox-consumer-checkpoint-smoke-result.json`. +- C19U Remote Workspace mailbox consumer lifecycle guardrails are implemented + and runtime-smoke-proven on docker-test. Consumers can pass + `reset_consumer=true` with a validated `consumer_id` to clear cursor state + before the current read is recorded. Mailbox responses expose consumer + count/capacity, created/reset/evicted lifecycle flags, and consumer + timestamps; workload status and heartbeat telemetry expose consumer reset and + eviction counters. Node-agent image + `rap-node-agent:codex-service-supervisor-20260512v` is built and deployed on + `test-1/2/3`. Verification passed: `go test ./internal/mesh`, C19U live + smoke, and C19T regression smoke. Artifact: + `artifacts/c19u-remote-workspace-mailbox-consumer-lifecycle-smoke-result.json`. +- C19V Remote Workspace mailbox consumer cursor inspection is implemented and + runtime-smoke-proven on docker-test. Active adapter sessions now expose a + read-only + `/mesh/v1/remote-workspace/adapter-sessions/{adapter_session_id}/mailbox/consumers` + endpoint with bounded cursor snapshots: consumer ids, checkpoint/ack + sequences, lag, read/ack totals, and timestamps. The endpoint is read-only and + does not increment mailbox reads, acks, resets, or drain events. Node-agent + image `rap-node-agent:codex-service-supervisor-20260512w` is built and + deployed on `test-1/2/3`. Verification passed: `go test ./internal/mesh`, + C19V live smoke, and C19U regression smoke. Artifact: + `artifacts/c19v-remote-workspace-mailbox-consumer-snapshot-smoke-result.json`. +- C19W Remote Workspace mailbox cursor-aware resume reads are implemented and + runtime-smoke-proven on docker-test. The mailbox endpoint now accepts + `after_sequence` for non-destructive reads, returns `skipped_count` and + `returned_count`, and long-polls for events newer than the requested sequence. + `after_sequence` with `drain=true` is rejected to keep resume reads separate + from destructive drains. Node-agent image + `rap-node-agent:codex-service-supervisor-20260512x` is built and deployed on + `test-1/2/3`. Verification passed: `go test ./internal/mesh`, C19W live + smoke, and C19V regression smoke. Artifact: + `artifacts/c19w-remote-workspace-mailbox-after-sequence-smoke-result.json`. +- C19X Remote Workspace mailbox consumer-aware resume is implemented and + runtime-smoke-proven on docker-test. Mailbox reads with `consumer_id` can pass + `resume_from=ack|checkpoint`; the node-agent resolves the stored cursor to + `after_sequence` before reading and returns `resume_from`/`resume_sequence`. + Guardrails reject mixing resume with manual `after_sequence`, drain, reset, + missing consumers, or invalid cursor names. Node-agent image + `rap-node-agent:codex-service-supervisor-20260512y` is built and deployed on + `test-1/2/3`. Verification passed: `go test ./internal/mesh`, C19X live + smoke, and C19W regression smoke. Artifact: + `artifacts/c19x-remote-workspace-mailbox-consumer-resume-smoke-result.json`. +- C19Y Remote Workspace mailbox resume telemetry is implemented and + runtime-smoke-proven on docker-test. Workload status and heartbeat telemetry + now expose resume/after-sequence read totals, returned/skipped totals, and the + last resume cursor/sequence/consumer plus returned/skipped counts for + operator diagnostics. Session snapshots include the same per-session resume + counters. Node-agent image + `rap-node-agent:codex-service-supervisor-20260512z` is built and deployed on + `test-1/2/3`. Verification passed: `go test ./internal/mesh`, C19Y live + smoke, C19X source smoke, and C19W regression smoke. Artifact: + `artifacts/c19y-remote-workspace-mailbox-resume-telemetry-smoke-result.json`. +- C19Z Remote Workspace adapter runtime readiness summary is implemented and + runtime-smoke-proven on docker-test. The sink report now includes compact + `adapter_runtime_readiness` diagnostics with session lifecycle state, mailbox + depth, consumer cursor, resume cursor, skipped/returned counts, and + ready/diagnostic status for operator handoff checks. Node-agent image + `rap-node-agent:codex-service-supervisor-20260512z1` is built and deployed on + `test-1/2/3`. Verification passed: `go test ./internal/mesh`, C19Z live + smoke, C19X source smoke, and C19Y regression smoke. Artifact: + `artifacts/c19z-remote-workspace-adapter-readiness-smoke-result.json`. +- C19Z1 Remote Workspace mailbox handoff preflight is implemented and + runtime-smoke-proven on docker-test. The node-agent now exposes read-only + `GET /mesh/v1/remote-workspace/adapter-sessions/{adapter_session_id}/mailbox/preflight` + for `consumer_id` plus `resume_from=ack|checkpoint`; it validates the cursor + and reports the expected next event window without reading, draining, acking, + or mutating consumer state. Node-agent image + `rap-node-agent:codex-service-supervisor-20260512z2` is built and deployed on + `test-1/2/3`. Verification passed: `go test ./internal/mesh`, C19Z1 live + smoke, C19X source smoke, and C19Z regression smoke. Artifact: + `artifacts/c19z1-remote-workspace-mailbox-preflight-smoke-result.json`. The current phase is NOT: -- full mesh routing implementation -- full VPN orchestration -- multi-cluster runtime traffic handling -- production data-plane migration -- updater runtime -- video meetings -- final native client UI redesign - -Future mesh, VPN, multi-cluster, node-agent updater, and production realtime data-plane work must be introduced only through explicit, narrow, staged implementation prompts. - -Always keep the project production-oriented. Do not simplify it into a toy app. +- full mesh routing implementation +- full VPN orchestration +- multi-cluster runtime traffic handling +- production data-plane migration +- complete updater rollout orchestration +- video meetings +- final native client UI redesign + +Future mesh, VPN, multi-cluster, node-agent updater, and production realtime data-plane work must be introduced only through explicit, narrow, staged implementation prompts. + +Always keep the project production-oriented. Do not simplify it into a toy app. diff --git a/README.md b/README.md index 3c04cff..2e724e4 100644 --- a/README.md +++ b/README.md @@ -9,6 +9,9 @@ The project started as an RDP proxy, but the target architecture is broader: - service adapters for RDP now and VNC/SSH/VPN/file/video later - native Access Clients - future secure mesh / node-agent / updater / connector model +- shared Fabric Service Channel runtime for working service data so VPN, + Remote Workspace, video, file, and future services request a common channel + instead of reimplementing transport, routing, and failover RDP is the first proven service baseline. RDP work is currently paused by product decision while the project moves to the Secure Access Fabric @@ -57,6 +60,7 @@ See the current audit and baseline matrix before starting new work: - `workers/rdp-worker/` - active C++ RDP Adapter worker - `workers/rdp-service-csharp/` - inactive research scaffold, not current runtime - `clients/windows/` - Windows native Access Client +- `clients/android/` - Android VPN client - `docs/architecture/` - target and staged architecture documents - `docs/codex/` - current Codex status and next-step prompts - `docs/audits/` - current audits and baseline matrices @@ -64,6 +68,49 @@ See the current audit and baseline matrix before starting new work: - `deploy/` - deployment assets - `web-admin/` - future/admin UI area +### Быстрый локальный билд Android APK после апдейта + +```powershell +pwsh -ExecutionPolicy Bypass -File scripts\prepare-local-build-workstation.ps1 -SetEnvironment +pwsh -ExecutionPolicy Bypass -File scripts\android\prepare-android-build-environment.ps1 -SetEnvironment +pwsh -ExecutionPolicy Bypass -File scripts\android\rebuild-and-publish-android-apk.ps1 +``` + +Для быстрого release (сборка + публикация + опциональная проверка манифеста): + +```powershell +pwsh -ExecutionPolicy Bypass -File scripts\android\release-android-apk.ps1 ` + -InstallMissing ` + -PublishToTestDockerDownloads +``` + +Для одного шага «всё-в-одном»: + +```powershell +pwsh -ExecutionPolicy Bypass -File scripts\android\fast-release-android-apk.ps1 +``` + +Или через `.cmd` (Windows, двойной клик/ссылкой): + +```text +scripts\android\fast-release-android-apk.cmd +``` + +После обновления Android-клиента выполните на машине сборки: + +```powershell +pwsh -ExecutionPolicy Bypass -File scripts\android\rebuild-and-publish-android-apk.ps1 -InstallMissing -BuildType release +``` + +Скрипт сам проверит окружение, при необходимости поставит недостающие компоненты +SDK (флаг `-InstallMissing`) и соберет APK. После этого артефакт сразу окажется в +`web-admin/deploy/html/downloads` для скачивания из панели. + +Важно по процессу релиза: каждый новый номер версии должен проходить полный цикл +`сборка → публикация → проверка manifest` и всегда попадать в `latest-release` + +`releases/` на стороне дистрибутива, чтобы узлы и пользователи всегда +обновлялись с актуального билда. + ## Read Order 1. `CODEX_CONTEXT.md` @@ -73,8 +120,9 @@ See the current audit and baseline matrix before starting new work: 5. `docs/codex/ARCHITECTURE_GUARDRAILS.md` 6. `docs/architecture/RDP_ADAPTER_RUNTIME.md` 7. `docs/architecture/DATA_PLANE_V1.md` -8. `docs/architecture/CLUSTER_NODE_ADMIN_FOUNDATION.md` -9. `docs/codex/NEXT_STEP_PROMPT.md` +8. `docs/architecture/FABRIC_SERVICE_CHANNEL_RUNTIME.md` +9. `docs/architecture/CLUSTER_NODE_ADMIN_FOUNDATION.md` +10. `docs/codex/NEXT_STEP_PROMPT.md` Do not use `docs/_legacy_v1` for implementation decisions. Legacy files are historical reference only. diff --git a/_tmp_android_build/.gradle/9.5.0/checksums/checksums.lock b/_tmp_android_build/.gradle/9.5.0/checksums/checksums.lock new file mode 100644 index 0000000..512ecf3 Binary files /dev/null and b/_tmp_android_build/.gradle/9.5.0/checksums/checksums.lock differ diff --git a/_tmp_android_build/.gradle/9.5.0/checksums/md5-checksums.bin b/_tmp_android_build/.gradle/9.5.0/checksums/md5-checksums.bin new file mode 100644 index 0000000..86f9a49 Binary files /dev/null and b/_tmp_android_build/.gradle/9.5.0/checksums/md5-checksums.bin differ diff --git a/_tmp_android_build/.gradle/9.5.0/checksums/sha1-checksums.bin b/_tmp_android_build/.gradle/9.5.0/checksums/sha1-checksums.bin new file mode 100644 index 0000000..a6ffb8c Binary files /dev/null and b/_tmp_android_build/.gradle/9.5.0/checksums/sha1-checksums.bin differ diff --git a/_tmp_android_build/.gradle/9.5.0/executionHistory/executionHistory.bin b/_tmp_android_build/.gradle/9.5.0/executionHistory/executionHistory.bin new file mode 100644 index 0000000..78cc373 Binary files /dev/null and b/_tmp_android_build/.gradle/9.5.0/executionHistory/executionHistory.bin differ diff --git a/_tmp_android_build/.gradle/9.5.0/executionHistory/executionHistory.lock b/_tmp_android_build/.gradle/9.5.0/executionHistory/executionHistory.lock new file mode 100644 index 0000000..bf65bf4 Binary files /dev/null and b/_tmp_android_build/.gradle/9.5.0/executionHistory/executionHistory.lock differ diff --git a/_tmp_android_build/.gradle/9.5.0/fileChanges/last-build.bin b/_tmp_android_build/.gradle/9.5.0/fileChanges/last-build.bin new file mode 100644 index 0000000..f76dd23 Binary files /dev/null and b/_tmp_android_build/.gradle/9.5.0/fileChanges/last-build.bin differ diff --git a/_tmp_android_build/.gradle/9.5.0/fileHashes/fileHashes.bin b/_tmp_android_build/.gradle/9.5.0/fileHashes/fileHashes.bin new file mode 100644 index 0000000..e3c57d8 Binary files /dev/null and b/_tmp_android_build/.gradle/9.5.0/fileHashes/fileHashes.bin differ diff --git a/_tmp_android_build/.gradle/9.5.0/fileHashes/fileHashes.lock b/_tmp_android_build/.gradle/9.5.0/fileHashes/fileHashes.lock new file mode 100644 index 0000000..f0be626 Binary files /dev/null and b/_tmp_android_build/.gradle/9.5.0/fileHashes/fileHashes.lock differ diff --git a/_tmp_android_build/.gradle/9.5.0/fileHashes/resourceHashesCache.bin b/_tmp_android_build/.gradle/9.5.0/fileHashes/resourceHashesCache.bin new file mode 100644 index 0000000..6023641 Binary files /dev/null and b/_tmp_android_build/.gradle/9.5.0/fileHashes/resourceHashesCache.bin differ diff --git a/_tmp_android_build/.gradle/9.5.0/gc.properties b/_tmp_android_build/.gradle/9.5.0/gc.properties new file mode 100644 index 0000000..e69de29 diff --git a/_tmp_android_build/.gradle/buildOutputCleanup/buildOutputCleanup.lock b/_tmp_android_build/.gradle/buildOutputCleanup/buildOutputCleanup.lock new file mode 100644 index 0000000..6141af9 Binary files /dev/null and b/_tmp_android_build/.gradle/buildOutputCleanup/buildOutputCleanup.lock differ diff --git a/_tmp_android_build/.gradle/buildOutputCleanup/cache.properties b/_tmp_android_build/.gradle/buildOutputCleanup/cache.properties new file mode 100644 index 0000000..971d8fa --- /dev/null +++ b/_tmp_android_build/.gradle/buildOutputCleanup/cache.properties @@ -0,0 +1,2 @@ +#Fri May 01 13:10:35 MSK 2026 +gradle.version=9.5.0 diff --git a/_tmp_android_build/.gradle/buildOutputCleanup/outputFiles.bin b/_tmp_android_build/.gradle/buildOutputCleanup/outputFiles.bin new file mode 100644 index 0000000..66d92eb Binary files /dev/null and b/_tmp_android_build/.gradle/buildOutputCleanup/outputFiles.bin differ diff --git a/_tmp_android_build/.gradle/vcs-1/gc.properties b/_tmp_android_build/.gradle/vcs-1/gc.properties new file mode 100644 index 0000000..e69de29 diff --git a/_tmp_android_build/README.md b/_tmp_android_build/README.md new file mode 100644 index 0000000..13ee83e --- /dev/null +++ b/_tmp_android_build/README.md @@ -0,0 +1,41 @@ +# RAP Android VPN + +This is the Android client for the experimental RAP VPN service. + +Implemented now: + +- login through `/auth/login`; +- trusted-device reconnect through `/auth/refresh` without retyping the password + while the device session is valid; +- load organization-scoped VPN client profile from `/clusters/{clusterID}/vpn/client-profile`; +- request Android VPN permission and create a `VpnService` TUN interface; +- relay TUN packets through the Control Plane HTTP packet relay to the active + `home-1` gateway lease. +- user-facing HOME-first screen: connect/disconnect is primary, while backend, + cluster, organization, login, and password are kept in the settings dialog; +- saved connection settings in app preferences so repeat connects do not require + retyping the profile. +- encrypted refresh-token storage through Android Keystore. If the trusted + device session is revoked or expires, the app asks for the password once and + then rotates the device keys/profile again. + +This is still a lab runtime, not a production WireGuard/IPsec implementation. +The active Linux gateway node must be able to create `/dev/net/tun`, run `ip`, +`sysctl`, and `iptables`, and enable NAT for `10.77.0.0/24`. + +Build from this repository on Windows: + +```powershell +$env:ANDROID_HOME="C:\Android\Sdk" +$env:ANDROID_SDK_ROOT="C:\Android\Sdk" +pwsh -ExecutionPolicy Bypass -File ..\..\scripts\android\build-android-apk.ps1 +adb install -r app/build/outputs/apk/debug/app-debug.apk +``` + +Or run directly from the project: + +```powershell +$env:ANDROID_HOME="C:\Android\Sdk" +$env:ANDROID_SDK_ROOT="C:\Android\Sdk" +gradle assembleDebug +``` diff --git a/_tmp_android_build/app/build.gradle b/_tmp_android_build/app/build.gradle new file mode 100644 index 0000000..6902b7b --- /dev/null +++ b/_tmp_android_build/app/build.gradle @@ -0,0 +1,24 @@ +plugins { + id "com.android.application" +} + +android { + namespace "su.cin.rapvpn" + compileSdk 35 + + buildFeatures { + buildConfig true + } + + defaultConfig { + applicationId "su.cin.rapvpn" + minSdk 26 + targetSdk 35 + versionCode 64 + versionName "0.2.64" + } +} + +dependencies { + implementation "com.squareup.okhttp3:okhttp:4.12.0" +} diff --git a/_tmp_android_build/app/src/main/AndroidManifest.xml b/_tmp_android_build/app/src/main/AndroidManifest.xml new file mode 100644 index 0000000..3472b91 --- /dev/null +++ b/_tmp_android_build/app/src/main/AndroidManifest.xml @@ -0,0 +1,59 @@ + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + diff --git a/_tmp_android_build/app/src/main/java/su/cin/rapvpn/MainActivity.java b/_tmp_android_build/app/src/main/java/su/cin/rapvpn/MainActivity.java new file mode 100644 index 0000000..46e2337 --- /dev/null +++ b/_tmp_android_build/app/src/main/java/su/cin/rapvpn/MainActivity.java @@ -0,0 +1,650 @@ +package su.cin.rapvpn; + +import android.app.Activity; +import android.app.AlertDialog; +import android.content.SharedPreferences; +import android.content.Intent; +import android.net.VpnService; +import android.os.Bundle; +import android.text.InputType; +import android.widget.Button; +import android.widget.CheckBox; +import android.widget.EditText; +import android.widget.LinearLayout; +import android.widget.TextView; + +import org.json.JSONArray; +import org.json.JSONObject; + +import java.util.Locale; + + +public class MainActivity extends Activity { + private static final String APP_VERSION = BuildConfig.VERSION_NAME; + private static final String DEFAULT_BACKEND_URL = "http://195.123.240.88:19131/api/v1"; + private static final String DEFAULT_ENTRY_NODE_ID = "b829ffde-690b-47ab-9522-0f22ab42596d"; + private static final int VPN_PREPARE_REQUEST = 42; + private static final String PREFS = "rap-vpn"; + private static final String PREF_DEVICE_FINGERPRINT = "device_fingerprint"; + private static final String PREF_REFRESH_TOKEN = "refresh_token"; + private static final String PREF_REFRESH_EXPIRES_AT = "refresh_expires_at"; + private static final String PREF_USER_ID = "user_id"; + private static final String PREF_DEVICE_ID = "device_id"; + private static final String PREF_PROFILE_JSON = "profile_json"; + private static final String PREF_VPN_CONNECTION_ID = "vpn_connection_id"; + private EditText backendUrl; + private EditText clusterId; + private EditText organizationId; + private EditText email; + private EditText password; + private TextView status; + private TextView profileSummary; + private TextView serverDirectory; + private TextView runtimeStatus; + private String profileJson = ""; + private String vpnConnectionId = ""; + private JSONArray lastResources = new JSONArray(); + private RapApiClient.AuthContext authContext = null; + private SharedPreferences prefs; + private SharedPreferences runtimePrefs; + private SecureTokenStore secureTokens; + + @Override + protected void onCreate(Bundle savedInstanceState) { + super.onCreate(savedInstanceState); + prefs = getSharedPreferences(PREFS, MODE_PRIVATE); + runtimePrefs = getSharedPreferences("rap-vpn-runtime", MODE_PRIVATE); + secureTokens = new SecureTokenStore(this); + LinearLayout root = new LinearLayout(this); + root.setOrientation(LinearLayout.VERTICAL); + root.setBackgroundColor(0xff101820); + int pad = dp(20); + root.setPadding(pad, pad, pad, pad); + + backendUrl = field("Backend URL", preferredBackendUrl()); + clusterId = field("Cluster ID", prefs.getString("cluster_id", "cfc0743d-d960-49fb-9de8-96e063d5e4aa")); + organizationId = field("Organization ID", prefs.getString("organization_id", "125ff8b2-5ac1-4406-9bbb-ebbe18f7c7ed")); + email = field("Email", prefs.getString("email", "m")); + password = field("Password", ""); + password.setInputType(InputType.TYPE_CLASS_TEXT | InputType.TYPE_TEXT_VARIATION_PASSWORD); + profileJson = prefs.getString(PREF_PROFILE_JSON, ""); + vpnConnectionId = prefs.getString(PREF_VPN_CONNECTION_ID, ""); + restoreAuthContext(); + + TextView title = new TextView(this); + title.setText("RAP HOME VPN " + APP_VERSION); + title.setTextColor(0xffffffff); + title.setTextSize(26); + title.setPadding(0, 0, 0, dp(8)); + + profileSummary = new TextView(this); + profileSummary.setTextColor(0xffc8d6df); + profileSummary.setTextSize(14); + profileSummary.setText(summaryText()); + + serverDirectory = new TextView(this); + serverDirectory.setTextColor(0xffe8eef2); + serverDirectory.setTextSize(15); + serverDirectory.setPadding(0, dp(14), 0, dp(14)); + serverDirectory.setText(""); + + status = new TextView(this); + status.setTextColor(0xffd8eadf); + status.setPadding(0, dp(14), 0, dp(14)); + status.setText("Готово. Версия " + APP_VERSION + "."); + + runtimeStatus = new TextView(this); + runtimeStatus.setTextColor(0xff9fb6c2); + runtimeStatus.setTextSize(13); + runtimeStatus.setPadding(0, 0, 0, dp(10)); + runtimeStatus.setText(runtimeStatusText()); + + Button load = new Button(this); + load.setText("Войти / обновить профиль"); + load.setOnClickListener(v -> loadProfile(false)); + + Button start = new Button(this); + start.setText("Включить HOME VPN"); + start.setOnClickListener(v -> prepareVpn()); + + Button stop = new Button(this); + stop.setText("Отключить VPN"); + stop.setOnClickListener(v -> { + Intent stopIntent = new Intent(this, RapVpnService.class); + stopIntent.setAction(RapVpnService.ACTION_STOP); + startService(stopIntent); + status.setText("VPN остановлен."); + runtimeStatus.setText(runtimeStatusText()); + }); + + Button settings = new Button(this); + settings.setText("Настройки"); + settings.setOnClickListener(v -> showSettingsDialog()); + + Button servers = new Button(this); + servers.setText("Открыть удаленный сервер"); + servers.setOnClickListener(v -> showServerPicker()); + + root.addView(title); + root.addView(profileSummary); + root.addView(load); + root.addView(servers); + root.addView(start); + root.addView(stop); + root.addView(settings); + root.addView(status); + root.addView(runtimeStatus); + setContentView(root); + scheduleRuntimeStatusRefresh(); + if (authContext != null && !authContext.deviceId.isEmpty()) { + startDiagnosticChannel(); + } + } + + @Override + protected void onDestroy() { + super.onDestroy(); + } + + private EditText field(String hint, String value) { + EditText input = new EditText(this); + input.setHint(hint); + input.setText(value); + input.setSingleLine(true); + return input; + } + + private void loadProfile() { + loadProfile(false); + } + + private void loadProfile(boolean startAfterLoad) { + status.setText("Загрузка..."); + saveSettings(); + new Thread(() -> { + try { + RapApiClient client = new RapApiClient(backendUrl.getText().toString(), this); + authContext = authenticate(client); + String activeOrganizationId = resolveOrganizationId(client, authContext.userId); + profileJson = client.vpnClientProfile(clusterId.getText().toString(), activeOrganizationId, authContext.userId, DEFAULT_ENTRY_NODE_ID); + vpnConnectionId = firstConnectionId(profileJson); + saveProfileState(); + JSONObject resourcePayload = client.resources(activeOrganizationId, authContext.userId); + lastResources = resourcePayload.optJSONArray("resources"); + if (lastResources == null) { + lastResources = new JSONArray(); + } + String resourcesText = resourcesText(resourcePayload); + runOnUiThread(() -> { + profileSummary.setText(summaryText()); + serverDirectory.setText(resourcesText); + status.setText(startAfterLoad ? "Профиль обновлен. Запускаю VPN..." : "Профиль и ключи устройства обновлены."); + startDiagnosticChannel(); + if (startAfterLoad) { + requestVpnPermission(); + } + }); + } catch (Exception ex) { + runOnUiThread(() -> { + String message = friendlyError(ex); + status.setText("Ошибка: " + message); + if (message.contains("логин") || message.contains("пароль") || message.contains("Сессия устройства")) { + clearSavedAuth(false); + showSettingsDialog(); + } + }); + } + }).start(); + } + + private void prepareVpn() { + loadProfile(true); + status.setText("Обновляю сессию устройства и VPN-профиль..."); + } + + private void requestVpnPermission() { + if (profileJson.isEmpty()) { + status.setText("VPN-профиль не загружен."); + return; + } + Intent prepare = VpnService.prepare(this); + if (prepare != null) { + startActivityForResult(prepare, VPN_PREPARE_REQUEST); + return; + } + startVpn(); + } + + @Override + protected void onActivityResult(int requestCode, int resultCode, Intent data) { + super.onActivityResult(requestCode, resultCode, data); + if (requestCode == VPN_PREPARE_REQUEST && resultCode == RESULT_OK) { + startVpn(); + } + } + + private void startVpn() { + Intent intent = new Intent(this, RapVpnService.class); + intent.putExtra(RapVpnService.EXTRA_PROFILE_JSON, profileJson); + intent.putExtra(RapVpnService.EXTRA_BACKEND_URL, backendUrl.getText().toString()); + intent.putExtra(RapVpnService.EXTRA_CLUSTER_ID, clusterId.getText().toString()); + intent.putExtra(RapVpnService.EXTRA_VPN_CONNECTION_ID, vpnConnectionId); + startForegroundService(intent); + status.setText("VPN включен. Версия " + APP_VERSION + ". Backend: " + backendUrl.getText() + ". Connection: " + vpnConnectionId); + runtimeStatus.setText(runtimeStatusText()); + } + + private void scheduleRuntimeStatusRefresh() { + runtimeStatus.postDelayed(() -> { + runtimeStatus.setText(runtimeStatusText()); + scheduleRuntimeStatusRefresh(); + }, 1500); + } + + private String runtimeStatusText() { + String state = runtimePrefs.getString("state", "нет данных"); + String message = runtimePrefs.getString("message", ""); + long updatedAt = runtimePrefs.getLong("updated_at", 0); + long read = runtimePrefs.getLong("uplink_read", 0); + long sent = runtimePrefs.getLong("uplink_sent", 0); + long down = runtimePrefs.getLong("downlink_received", 0); + long errors = runtimePrefs.getLong("errors", 0); + long readBytes = runtimePrefs.getLong("uplink_read_bytes", 0); + long sentBytes = runtimePrefs.getLong("uplink_sent_bytes", 0); + long downBytes = runtimePrefs.getLong("downlink_received_bytes", 0); + long droppedRead = runtimePrefs.getLong("uplink_dropped_packets", 0); + long droppedDown = runtimePrefs.getLong("downlink_dropped_packets", 0); + float uplinkReadMbps = runtimePrefs.getFloat("uplink_read_mbps", 0f); + float uplinkSentMbps = runtimePrefs.getFloat("uplink_sent_mbps", 0f); + float downlinkMbps = runtimePrefs.getFloat("downlink_received_mbps", 0f); + float uplinkReadPps = runtimePrefs.getFloat("uplink_read_pps", 0f); + float uplinkSentPps = runtimePrefs.getFloat("uplink_sent_pps", 0f); + float downlinkPps = runtimePrefs.getFloat("downlink_received_pps", 0f); + int workerCount = runtimePrefs.getInt("uplink_worker_count", 0); + int queueDepthTotal = runtimePrefs.getInt("uplink_queue_depth_total", 0); + int queueDepthMax = runtimePrefs.getInt("uplink_queue_depth_max", 0); + String queueDepths = runtimePrefs.getString("uplink_queue_depths", ""); + long queue0Drops = runtimePrefs.getLong("uplink_queue_0_drops", 0); + long queue1Drops = runtimePrefs.getLong("uplink_queue_1_drops", 0); + long queue2Drops = runtimePrefs.getLong("uplink_queue_2_drops", 0); + long queue3Drops = runtimePrefs.getLong("uplink_queue_3_drops", 0); + long queue0Offers = runtimePrefs.getLong("uplink_queue_0_offers", 0); + long queue1Offers = runtimePrefs.getLong("uplink_queue_1_offers", 0); + long queue2Offers = runtimePrefs.getLong("uplink_queue_2_offers", 0); + long queue3Offers = runtimePrefs.getLong("uplink_queue_3_offers", 0); + long sender0Packets = runtimePrefs.getLong("uplink_sender_worker_packets_0", 0); + long sender1Packets = runtimePrefs.getLong("uplink_sender_worker_packets_1", 0); + long sender2Packets = runtimePrefs.getLong("uplink_sender_worker_packets_2", 0); + long sender3Packets = runtimePrefs.getLong("uplink_sender_worker_packets_3", 0); + long sender0Errors = runtimePrefs.getLong("uplink_sender_worker_errors_0", 0); + long sender1Errors = runtimePrefs.getLong("uplink_sender_worker_errors_1", 0); + long sender2Errors = runtimePrefs.getLong("uplink_sender_worker_errors_2", 0); + long sender3Errors = runtimePrefs.getLong("uplink_sender_worker_errors_3", 0); + String age = updatedAt <= 0 ? "никогда" : ((System.currentTimeMillis() - updatedAt) / 1000) + " сек назад"; + return "Диагностика: " + state + + "\n" + message + + "\nread/sent/down: " + read + "/" + sent + "/" + down + + "\nerrors/drops: " + errors + "/" + (droppedRead + droppedDown) + + "\nthroughput Mbps: up " + String.format(Locale.US, "%.2f", uplinkSentMbps) + + " / down " + String.format(Locale.US, "%.2f", downlinkMbps) + + "\npps: up " + String.format(Locale.US, "%.1f", uplinkSentPps) + + " / down " + String.format(Locale.US, "%.1f", downlinkPps) + + "\nbytes read/sent/down: " + readBytes + "/" + sentBytes + "/" + downBytes + + "\nworkers: " + workerCount + + "\nqueue depth total/max: " + queueDepthTotal + " / " + queueDepthMax + + "\nqueue depths: " + (queueDepths.isEmpty() ? "-" : queueDepths) + + "\nqueue0 q/s: " + queue0Offers + "/" + queue0Drops + + " q1 " + queue1Offers + "/" + queue1Drops + + " q2 " + queue2Offers + "/" + queue2Drops + + " q3 " + queue3Offers + "/" + queue3Drops + + "\nsender pkt/err: w0 " + sender0Packets + "/" + sender0Errors + + " w1 " + sender1Packets + "/" + sender1Errors + + " w2 " + sender2Packets + "/" + sender2Errors + + " w3 " + sender3Packets + "/" + sender3Errors + + "\nобновлено: " + age; + } + + private void startDiagnosticChannel() { + if (authContext == null || authContext.deviceId.isEmpty()) { + return; + } + RapDiagnosticService.start(this); + } + + private String firstConnectionId(String profile) throws Exception { + JSONObject root = new JSONObject(profile); + JSONObject vpnProfile = root.getJSONObject("vpn_client_profile"); + JSONArray connections = vpnProfile.getJSONArray("connections"); + if (connections.length() == 0) { + throw new IllegalStateException("VPN profile has no connections"); + } + return connections.getJSONObject(0).getString("id"); + } + + private String resourcesText(JSONObject payload) throws Exception { + JSONArray resources = payload.optJSONArray("resources"); + if (resources == null || resources.length() == 0) { + return "Серверы: доступных ресурсов нет."; + } + StringBuilder text = new StringBuilder("Серверы:\n"); + int limit = Math.min(resources.length(), 6); + for (int i = 0; i < limit; i++) { + JSONObject resource = resources.getJSONObject(i); + text.append("• ") + .append(resource.optString("name", "server")) + .append(" ") + .append(resource.optString("protocol", "rdp")) + .append(" ") + .append(resource.optString("address", "")) + .append('\n'); + } + if (resources.length() > limit) { + text.append("и еще ").append(resources.length() - limit).append("..."); + } + return text.toString().trim(); + } + + private int dp(int value) { + return (int) (value * getResources().getDisplayMetrics().density); + } + + private String summaryText() { + String deviceId = prefs == null ? "" : prefs.getString(PREF_DEVICE_ID, ""); + String connectionId = vpnConnectionId == null || vpnConnectionId.isEmpty() + ? (prefs == null ? "" : prefs.getString(PREF_VPN_CONNECTION_ID, "")) + : vpnConnectionId; + return "Версия: " + APP_VERSION + + "\nВход: usa-los-1" + + "\nОрганизация: HOME" + + "\nВыход: home-1" + + "\nBackend: " + backendUrl.getText() + + "\nDevice: " + (deviceId.isEmpty() ? "нет" : deviceId) + + "\nConnection: " + (connectionId.isEmpty() ? "нет" : connectionId); + } + + private String preferredBackendUrl() { + String saved = prefs.getString("backend_url", DEFAULT_BACKEND_URL); + String normalized = normalizeBackendUrl(saved); + if (!normalized.equals(saved == null ? "" : saved.trim())) { + prefs.edit().putString("backend_url", normalized).apply(); + } + return normalized; + } + + private void saveSettings() { + String normalizedBackend = normalizeBackendUrl(backendUrl.getText().toString()); + if (!normalizedBackend.equals(backendUrl.getText().toString().trim())) { + backendUrl.setText(normalizedBackend); + } + prefs.edit() + .putString("backend_url", normalizedBackend) + .putString("cluster_id", clusterId.getText().toString()) + .putString("organization_id", organizationId.getText().toString()) + .putString("email", email.getText().toString()) + .apply(); + } + + private String normalizeBackendUrl(String value) { + String candidate = value == null ? "" : value.trim().replaceAll("/+$", ""); + if (candidate.isEmpty() || isLegacyControlPlaneUrl(candidate)) { + return DEFAULT_BACKEND_URL; + } + return candidate; + } + + private boolean isLegacyControlPlaneUrl(String value) { + String lower = value.toLowerCase(); + return lower.equals("http://94.141.118.222:19191/api/v1") + || lower.equals("http://vpn.cin.su:19191/api/v1") + || lower.equals("http://192.168.200.61:18080/api/v1") + || lower.equals("http://docker-test.cin.su:18080/api/v1") + || lower.equals("http://docker-test.cin.su/api/v1") + || lower.equals("http://192.168.200.61/api/v1"); + } + + private RapApiClient.AuthContext authenticate(RapApiClient client) throws Exception { + String savedRefresh = savedRefreshToken(); + if (!savedRefresh.isEmpty()) { + try { + RapApiClient.AuthContext refreshed = client.refresh(savedRefresh); + saveAuthContext(refreshed); + return refreshed; + } catch (Exception ignored) { + clearSavedAuth(false); + } + } + String passwordValue = password.getText().toString().trim(); + if (passwordValue.isEmpty()) { + throw new IllegalStateException("Сессия устройства истекла или отозвана. Введите пароль один раз, дальше ключи обновятся автоматически."); + } + RapApiClient.AuthContext loggedIn = client.login(email.getText().toString().trim(), passwordValue, deviceFingerprint()); + saveAuthContext(loggedIn); + return loggedIn; + } + + private String resolveOrganizationId(RapApiClient client, String userId) throws Exception { + JSONObject payload = client.organizations(userId); + JSONArray organizations = payload.optJSONArray("organizations"); + if (organizations == null || organizations.length() == 0) { + throw new IllegalStateException("У пользователя нет активной организации."); + } + String configured = organizationId.getText().toString().trim(); + JSONObject fallback = null; + for (int i = 0; i < organizations.length(); i++) { + JSONObject item = organizations.optJSONObject(i); + if (item == null) { + continue; + } + String id = item.optString("id", ""); + String name = item.optString("name", ""); + String slug = item.optString("slug", ""); + if (!configured.isEmpty() && configured.equals(id)) { + return configured; + } + if (fallback == null || "HOME".equalsIgnoreCase(name) || "home".equalsIgnoreCase(slug)) { + fallback = item; + } + } + String selected = fallback != null ? fallback.optString("id", "") : ""; + if (selected.isEmpty()) { + throw new IllegalStateException("Не удалось выбрать организацию пользователя."); + } + runOnUiThread(() -> { + organizationId.setText(selected); + saveSettings(); + }); + return selected; + } + + private void saveAuthContext(RapApiClient.AuthContext context) throws Exception { + secureTokens.put(PREF_REFRESH_TOKEN, context.refreshToken); + prefs.edit() + .putString(PREF_USER_ID, context.userId) + .putString(PREF_DEVICE_ID, context.deviceId) + .putString(PREF_REFRESH_EXPIRES_AT, context.refreshTokenExpiresAt) + .putString(PREF_PROFILE_JSON, profileJson) + .putString(PREF_VPN_CONNECTION_ID, vpnConnectionId) + .apply(); + } + + private void saveProfileState() { + prefs.edit() + .putString(PREF_PROFILE_JSON, profileJson) + .putString(PREF_VPN_CONNECTION_ID, vpnConnectionId) + .apply(); + } + + private void restoreAuthContext() { + String userId = prefs.getString(PREF_USER_ID, ""); + String deviceId = prefs.getString(PREF_DEVICE_ID, ""); + if (!userId.isEmpty() && !deviceId.isEmpty()) { + authContext = new RapApiClient.AuthContext( + userId, + deviceId, + "", + "", + secureTokens.get(PREF_REFRESH_TOKEN), + prefs.getString(PREF_REFRESH_EXPIRES_AT, "")); + } + } + + private void clearSavedAuth(boolean clearProfile) { + secureTokens.remove(PREF_REFRESH_TOKEN); + SharedPreferences.Editor editor = prefs.edit() + .remove(PREF_REFRESH_EXPIRES_AT) + .remove(PREF_USER_ID) + .remove(PREF_DEVICE_ID); + if (clearProfile) { + editor.remove(PREF_PROFILE_JSON).remove(PREF_VPN_CONNECTION_ID); + profileJson = ""; + vpnConnectionId = ""; + } + editor.apply(); + authContext = null; + } + + private String savedRefreshToken() { + String token = secureTokens.get(PREF_REFRESH_TOKEN); + if (!token.isEmpty()) { + return token; + } + String legacyToken = prefs.getString(PREF_REFRESH_TOKEN, ""); + if (!legacyToken.isEmpty()) { + try { + secureTokens.put(PREF_REFRESH_TOKEN, legacyToken); + prefs.edit().remove(PREF_REFRESH_TOKEN).apply(); + } catch (Exception ignored) { + } + } + return legacyToken; + } + + private String deviceFingerprint() { + String existing = prefs.getString(PREF_DEVICE_FINGERPRINT, ""); + if (!existing.isEmpty()) { + return existing; + } + String generated = "android-" + java.util.UUID.randomUUID(); + prefs.edit().putString(PREF_DEVICE_FINGERPRINT, generated).apply(); + return generated; + } + + private void showSettingsDialog() { + LinearLayout form = new LinearLayout(this); + form.setOrientation(LinearLayout.VERTICAL); + int pad = dp(12); + form.setPadding(pad, pad, pad, pad); + EditText backendDraft = field("Backend URL", backendUrl.getText().toString()); + EditText clusterDraft = field("Cluster ID", clusterId.getText().toString()); + EditText organizationDraft = field("Organization ID", organizationId.getText().toString()); + EditText emailDraft = field("Email", email.getText().toString()); + EditText passwordDraft = field("Password", password.getText().toString()); + passwordDraft.setInputType(InputType.TYPE_CLASS_TEXT | InputType.TYPE_TEXT_VARIATION_PASSWORD); + passwordDraft.setHint("Password (не сохраняется)"); + CheckBox showPassword = new CheckBox(this); + showPassword.setText("Показать пароль"); + showPassword.setTextColor(0xff111111); + showPassword.setOnCheckedChangeListener((buttonView, isChecked) -> { + passwordDraft.setInputType(InputType.TYPE_CLASS_TEXT | (isChecked + ? InputType.TYPE_TEXT_VARIATION_VISIBLE_PASSWORD + : InputType.TYPE_TEXT_VARIATION_PASSWORD)); + passwordDraft.setSelection(passwordDraft.getText().length()); + }); + form.addView(backendDraft); + form.addView(clusterDraft); + form.addView(organizationDraft); + form.addView(emailDraft); + form.addView(passwordDraft); + form.addView(showPassword); + new AlertDialog.Builder(this) + .setTitle("Настройки подключения") + .setView(form) + .setPositiveButton("Сохранить", (dialog, which) -> { + backendUrl.setText(backendDraft.getText().toString()); + clusterId.setText(clusterDraft.getText().toString()); + organizationId.setText(organizationDraft.getText().toString()); + email.setText(emailDraft.getText().toString()); + password.setText(passwordDraft.getText().toString()); + saveSettings(); + profileSummary.setText(summaryText()); + }) + .setNeutralButton("Забыть устройство", (dialog, which) -> { + clearSavedAuth(true); + status.setText("Устройство забыто. Для следующего входа нужен пароль."); + }) + .setNegativeButton("Отмена", null) + .show(); + } + + private String friendlyError(Exception ex) { + String message = ex.getMessage(); + if (message == null || message.trim().isEmpty()) { + return "неизвестная ошибка"; + } + if (message.contains("auth.invalid_credentials") || message.contains("Неверный логин")) { + int passwordLength = password.getText() == null ? 0 : password.getText().toString().length(); + return "Неверный логин или пароль. Проверьте раскладку и спецсимволы. Длина введенного пароля: " + passwordLength + "."; + } + if (message.contains("auth.invalid_refresh_token") || message.contains("invalid refresh token")) { + return "Сессия устройства истекла. Введите пароль один раз, дальше ключи обновятся автоматически."; + } + return message; + } + + private void showServerPicker() { + if (lastResources.length() == 0) { + loadProfile(); + status.setText("Загружаю список серверов..."); + return; + } + String[] labels = new String[lastResources.length()]; + for (int i = 0; i < lastResources.length(); i++) { + JSONObject resource = lastResources.optJSONObject(i); + labels[i] = resource == null + ? "server" + : resource.optString("name", "server") + " " + resource.optString("address", ""); + } + new AlertDialog.Builder(this) + .setTitle("Удаленный сервер") + .setItems(labels, (dialog, which) -> startRemoteDesktop(which)) + .show(); + } + + private void startRemoteDesktop(int index) { + JSONObject resource = lastResources.optJSONObject(index); + if (resource == null) { + return; + } + if (authContext == null || authContext.userId.isEmpty() || authContext.deviceId.isEmpty()) { + loadProfile(); + status.setText("Профиль обновляется. Повторите открытие сервера."); + return; + } + status.setText("Открываю " + resource.optString("name", "сервер") + "..."); + new Thread(() -> { + try { + RapApiClient client = new RapApiClient(backendUrl.getText().toString(), this); + JSONObject result = client.startSession(resource.getString("id"), authContext.userId, authContext.deviceId); + Intent intent = new Intent(this, RdpActivity.class); + intent.putExtra(RdpActivity.EXTRA_SESSION_RESULT, result.toString()); + intent.putExtra(RdpActivity.EXTRA_GATEWAY_URL, gatewayUrl()); + intent.putExtra(RdpActivity.EXTRA_RESOURCE_NAME, resource.optString("name", "Remote Desktop")); + runOnUiThread(() -> { + status.setText("Сессия создана."); + startActivity(intent); + }); + } catch (Exception ex) { + runOnUiThread(() -> status.setText("Ошибка RDP: " + ex.getMessage())); + } + }).start(); + } + + private String gatewayUrl() { + String api = backendUrl.getText().toString().trim(); + String gateway = api.replace("https://", "wss://").replace("http://", "ws://"); + if (gateway.endsWith("/")) { + gateway = gateway.substring(0, gateway.length() - 1); + } + return gateway + "/gateway/ws"; + } +} diff --git a/_tmp_android_build/app/src/main/java/su/cin/rapvpn/RapApiClient.java b/_tmp_android_build/app/src/main/java/su/cin/rapvpn/RapApiClient.java new file mode 100644 index 0000000..cf52cac --- /dev/null +++ b/_tmp_android_build/app/src/main/java/su/cin/rapvpn/RapApiClient.java @@ -0,0 +1,583 @@ +package su.cin.rapvpn; + +import android.content.Context; +import android.net.ConnectivityManager; +import android.net.Network; +import android.net.NetworkCapabilities; +import android.net.VpnService; + +import okhttp3.MediaType; +import okhttp3.OkHttpClient; +import okhttp3.Dispatcher; +import okhttp3.ConnectionPool; +import okhttp3.Request; +import okhttp3.RequestBody; +import okhttp3.Response; +import okhttp3.ResponseBody; + +import org.json.JSONObject; + +import java.io.ByteArrayOutputStream; +import java.io.IOException; +import java.io.InterruptedIOException; +import java.net.InetAddress; +import java.net.InetSocketAddress; +import java.net.Socket; +import java.nio.charset.StandardCharsets; +import java.util.ArrayList; +import java.util.List; +import java.util.Collections; +import java.util.concurrent.TimeUnit; + +import javax.net.SocketFactory; + +final class RapApiClient { + private static final MediaType JSON = MediaType.get("application/json; charset=utf-8"); + private static final MediaType OCTET_STREAM = MediaType.get("application/octet-stream"); + private static final int MAX_PACKET_BATCH_PACKETS = 128; + private static final int MAX_PACKET_BATCH_BYTES = 128 * 1024; + private static final int MAX_SINGLE_PACKET_BYTES = 65535; + private static final int MAX_BATCH_HEADER_BYTES = 4; + private static final int BATCH_RETRY_THRESHOLD = 2; + private final String baseUrl; + private final OkHttpClient httpClient; + private final String networkMode; + private volatile boolean batchModeEnabled = true; + private volatile int batchModeFailures = 0; + + RapApiClient(String baseUrl) { + this(baseUrl, null); + } + + RapApiClient(String baseUrl, Context context) { + this.baseUrl = trimRight(baseUrl); + OkHttpClient.Builder builder = new OkHttpClient.Builder(); + SocketFactory socketFactory = context == null ? null : underlyingSocketFactory(context); + if (socketFactory != null) { + builder.socketFactory(socketFactory); + this.networkMode = "direct_network"; + } else { + this.networkMode = "default_network"; + } + builder.connectTimeout(10, TimeUnit.SECONDS); + builder.writeTimeout(45, TimeUnit.SECONDS); + builder.readTimeout(45, TimeUnit.SECONDS); + builder.callTimeout(50, TimeUnit.SECONDS); + builder.retryOnConnectionFailure(true); + Dispatcher dispatcher = new Dispatcher(); + dispatcher.setMaxRequests(64); + dispatcher.setMaxRequestsPerHost(32); + builder.dispatcher(dispatcher); + builder.connectionPool(new ConnectionPool(16, 5, TimeUnit.MINUTES)); + this.httpClient = builder.build(); + } + + RapApiClient(String baseUrl, VpnService vpnService) { + this.baseUrl = trimRight(baseUrl); + OkHttpClient.Builder builder = new OkHttpClient.Builder(); + if (vpnService != null) { + SocketFactory socketFactory = underlyingSocketFactory(vpnService); + builder.socketFactory(socketFactory != null ? socketFactory : new ProtectedSocketFactory(vpnService)); + this.networkMode = socketFactory != null ? "direct_network" : "protected_socket"; + } else { + this.networkMode = "default_network"; + } + builder.connectTimeout(10, TimeUnit.SECONDS); + builder.writeTimeout(45, TimeUnit.SECONDS); + builder.readTimeout(45, TimeUnit.SECONDS); + builder.callTimeout(50, TimeUnit.SECONDS); + builder.retryOnConnectionFailure(true); + Dispatcher dispatcher = new Dispatcher(); + dispatcher.setMaxRequests(64); + dispatcher.setMaxRequestsPerHost(32); + builder.dispatcher(dispatcher); + builder.connectionPool(new ConnectionPool(16, 5, TimeUnit.MINUTES)); + this.httpClient = builder.build(); + } + + String networkMode() { + return networkMode; + } + + private SocketFactory underlyingSocketFactory(Context context) { + ConnectivityManager connectivity = (ConnectivityManager) context.getSystemService(Context.CONNECTIVITY_SERVICE); + if (connectivity == null) { + return null; + } + for (Network network : connectivity.getAllNetworks()) { + NetworkCapabilities capabilities = connectivity.getNetworkCapabilities(network); + if (capabilities == null) { + continue; + } + if (capabilities.hasTransport(NetworkCapabilities.TRANSPORT_VPN)) { + continue; + } + if (!capabilities.hasCapability(NetworkCapabilities.NET_CAPABILITY_INTERNET)) { + continue; + } + return network.getSocketFactory(); + } + return null; + } + + AuthContext login(String email, String password, String deviceFingerprint) throws Exception { + JSONObject body = new JSONObject(); + body.put("email", email); + body.put("password", password); + body.put("device_fingerprint", deviceFingerprint); + body.put("device_label", "RAP Android VPN"); + body.put("trust_device", true); + JSONObject response = post("/auth/login", body); + return parseAuthContext(response); + } + + AuthContext refresh(String refreshToken) throws Exception { + JSONObject body = new JSONObject(); + body.put("refresh_token", refreshToken); + return parseAuthContext(post("/auth/refresh", body)); + } + + String vpnClientProfile(String clusterId, String organizationId, String userId, String entryNodeId) throws Exception { + String path = "/clusters/" + clusterId + "/vpn/client-profile?organization_id=" + organizationId + "&user_id=" + userId; + if (entryNodeId != null && !entryNodeId.trim().isEmpty()) { + path += "&entry_node_id=" + entryNodeId.trim(); + } + return get(path).toString(); + } + + JSONObject organizations(String userId) throws Exception { + return get("/organizations/?user_id=" + userId); + } + + JSONObject resources(String organizationId, String userId) throws Exception { + String path = "/resources/?organization_id=" + organizationId + "&user_id=" + userId; + return get(path); + } + + JSONObject startSession(String resourceId, String userId, String deviceId) throws Exception { + JSONObject body = new JSONObject(); + body.put("resource_id", resourceId); + body.put("user_id", userId); + body.put("device_id", deviceId); + return post("/sessions/", body); + } + + JSONObject reportVPNDiagnosticStatus(String clusterId, String deviceId, JSONObject payload) throws Exception { + return post("/clusters/" + clusterId + "/vpn/client-diagnostics/" + deviceId + "/status", payload); + } + + JSONObject nextVPNDiagnosticCommand(String clusterId, String deviceId, int timeoutMs) throws Exception { + byte[] payload = getBytes("/clusters/" + clusterId + "/vpn/client-diagnostics/" + deviceId + "/commands?timeout_ms=" + timeoutMs); + if (payload.length == 0) { + return null; + } + return new JSONObject(new String(payload, StandardCharsets.UTF_8)); + } + + JSONObject vpnPacketStats(String clusterId, String vpnConnectionId) throws Exception { + return get("/clusters/" + clusterId + "/vpn-connections/" + vpnConnectionId + "/tunnel/stats"); + } + + JSONObject resetVPNPacketQueues(String clusterId, String vpnConnectionId) throws Exception { + return post("/clusters/" + clusterId + "/vpn-connections/" + vpnConnectionId + "/tunnel/reset", new JSONObject()); + } + + void sendClientPacket(String clusterId, String vpnConnectionId, byte[] packet, int length) throws Exception { + postBytes("/clusters/" + clusterId + "/vpn-connections/" + vpnConnectionId + "/tunnel/client/packets", packet, length); + } + + void sendClientPacketBatch(String clusterId, String vpnConnectionId, List packets) throws Exception { + if (!batchModeEnabled) { + for (byte[] packet : packets) { + if (packet == null || packet.length == 0) { + continue; + } + sendClientPacket(clusterId, vpnConnectionId, packet, packet.length); + } + return; + } + if (packets == null || packets.isEmpty()) { + return; + } + try { + List> chunks = chunkPacketsForBatch(packets); + if (chunks.isEmpty()) { + for (byte[] packet : packets) { + if (packet == null || packet.length == 0) { + continue; + } + sendClientPacket(clusterId, vpnConnectionId, packet, packet.length); + } + return; + } + for (List chunk : chunks) { + postBytes("/clusters/" + clusterId + "/vpn-connections/" + vpnConnectionId + "/tunnel/client/packets?batch=true", encodePacketBatch(chunk)); + } + resetBatchMode(); + } catch (Exception e) { + if (shouldDisableBatchMode(e)) { + disableBatchMode(); + for (byte[] packet : packets) { + if (packet == null || packet.length == 0) { + continue; + } + sendClientPacket(clusterId, vpnConnectionId, packet, packet.length); + } + return; + } + throw e; + } + } + + byte[] receiveClientPacket(String clusterId, String vpnConnectionId, int timeoutMs) throws Exception { + try { + return getBytes("/clusters/" + clusterId + "/vpn-connections/" + vpnConnectionId + "/tunnel/client/packets?timeout_ms=" + timeoutMs); + } catch (InterruptedIOException e) { + return new byte[0]; + } catch (IOException e) { + if (e.getMessage() != null && e.getMessage().toLowerCase().contains("timeout")) { + return new byte[0]; + } + throw e; + } catch (IllegalStateException e) { + String message = e.getMessage(); + if (message != null && (message.contains("HTTP 502") || message.contains("HTTP 503") || message.contains("HTTP 504"))) { + return new byte[0]; + } + throw e; + } + } + + List receiveClientPacketBatch(String clusterId, String vpnConnectionId, int timeoutMs) throws Exception { + if (!batchModeEnabled) { + return receiveSinglePacketAsBatch(clusterId, vpnConnectionId, timeoutMs); + } + byte[] payload; + try { + payload = getBytes("/clusters/" + clusterId + "/vpn-connections/" + vpnConnectionId + "/tunnel/client/packets?batch=true&timeout_ms=" + timeoutMs); + if (payload == null || payload.length == 0) { + return new ArrayList<>(); + } + if (!isLikelyPacketBatch(payload)) { + return receiveSinglePacketAsBatch(clusterId, vpnConnectionId, timeoutMs); + } + return decodePacketBatch(payload); + } catch (InterruptedIOException e) { + return new ArrayList<>(); + } catch (IOException e) { + if (e.getMessage() != null && e.getMessage().toLowerCase().contains("timeout")) { + return new ArrayList<>(); + } + if (shouldDisableBatchMode(e)) { + disableBatchMode(); + return receiveSinglePacketAsBatch(clusterId, vpnConnectionId, timeoutMs); + } + throw e; + } catch (IllegalStateException e) { + String message = e.getMessage(); + if (message != null && (message.contains("HTTP 502") || message.contains("HTTP 503") || message.contains("HTTP 504"))) { + return new ArrayList<>(); + } + if (shouldDisableBatchMode(e)) { + disableBatchMode(); + return receiveSinglePacketAsBatch(clusterId, vpnConnectionId, timeoutMs); + } + throw e; + } catch (RuntimeException e) { + if (shouldDisableBatchMode(e)) { + disableBatchMode(); + return receiveSinglePacketAsBatch(clusterId, vpnConnectionId, timeoutMs); + } + throw e; + } + } + + private JSONObject get(String path) throws Exception { + Request request = new Request.Builder().url(baseUrl + path).get().build(); + return read(request); + } + + private JSONObject post(String path, JSONObject body) throws Exception { + Request request = new Request.Builder() + .url(baseUrl + path) + .post(RequestBody.create(body.toString().getBytes(StandardCharsets.UTF_8), JSON)) + .build(); + return read(request); + } + + private byte[] getBytes(String path) throws Exception { + Request request = new Request.Builder().url(baseUrl + path).get().build(); + try (Response response = httpClient.newCall(request).execute()) { + if (response.code() == 204) { + return new byte[0]; + } + if (!response.isSuccessful()) { + throw new IllegalStateException("HTTP " + response.code()); + } + ResponseBody body = response.body(); + return body == null ? new byte[0] : body.bytes(); + } + } + + private void postBytes(String path, byte[] packet, int length) throws Exception { + byte[] bodyBytes = new byte[length]; + System.arraycopy(packet, 0, bodyBytes, 0, length); + postBytes(path, bodyBytes); + } + + private void postBytes(String path, byte[] bodyBytes) throws Exception { + Request request = new Request.Builder() + .url(baseUrl + path) + .post(RequestBody.create(bodyBytes, OCTET_STREAM)) + .build(); + try (Response response = httpClient.newCall(request).execute()) { + if (!response.isSuccessful()) { + throw new IllegalStateException("HTTP " + response.code()); + } + } + } + + private byte[] encodePacketBatch(List packets) { + int total = 0; + for (byte[] packet : packets) { + if (packet != null && packet.length > 0) { + total += 4 + packet.length; + } + } + byte[] out = new byte[total]; + int offset = 0; + for (byte[] packet : packets) { + if (packet == null || packet.length == 0) { + continue; + } + int length = packet.length; + out[offset] = (byte) ((length >> 24) & 0xff); + out[offset + 1] = (byte) ((length >> 16) & 0xff); + out[offset + 2] = (byte) ((length >> 8) & 0xff); + out[offset + 3] = (byte) (length & 0xff); + offset += 4; + System.arraycopy(packet, 0, out, offset, length); + offset += length; + } + return out; + } + + private JSONObject read(Request request) throws Exception { + try (Response response = httpClient.newCall(request).execute()) { + ResponseBody body = response.body(); + String text = body == null ? "" : body.string(); + if (!response.isSuccessful()) { + if (response.code() == 401 && text.contains("auth.invalid_credentials")) { + throw new IllegalStateException("Неверный логин или пароль."); + } + if (response.code() == 401 && text.contains("auth.invalid_refresh_token")) { + throw new IllegalStateException("Сессия устройства истекла. Введите пароль один раз."); + } + throw new IllegalStateException("HTTP " + response.code() + ": " + text); + } + return new JSONObject(text); + } + } + + private List decodePacketBatch(byte[] payload) { + List packets = new ArrayList<>(); + int offset = 0; + while (payload != null && offset + 4 <= payload.length) { + int length = ((payload[offset] & 0xff) << 24) + | ((payload[offset + 1] & 0xff) << 16) + | ((payload[offset + 2] & 0xff) << 8) + | (payload[offset + 3] & 0xff); + offset += 4; + if (length <= 0 || offset + length > payload.length) { + break; + } + byte[] packet = new byte[length]; + System.arraycopy(payload, offset, packet, 0, length); + packets.add(packet); + offset += length; + } + return packets; + } + + private List> chunkPacketsForBatch(List packets) { + List> chunks = new ArrayList<>(); + List current = new ArrayList<>(); + int currentBytes = 0; + boolean hasData = false; + for (byte[] packet : packets) { + if (packet == null || packet.length == 0) { + continue; + } + if (packet.length > MAX_SINGLE_PACKET_BYTES) { + continue; + } + hasData = true; + + int projected = currentBytes + MAX_BATCH_HEADER_BYTES + packet.length; + if (!current.isEmpty() && (current.size() >= MAX_PACKET_BATCH_PACKETS || projected > MAX_PACKET_BATCH_BYTES)) { + chunks.add(current); + current = new ArrayList<>(); + currentBytes = 0; + } + current.add(packet); + currentBytes = projected; + } + if (!hasData) { + return chunks; + } + if (!current.isEmpty()) { + chunks.add(current); + } + return chunks; + } + + private boolean isLikelyPacketBatch(byte[] payload) { + if (payload == null || payload.length < MAX_BATCH_HEADER_BYTES) { + return false; + } + int offset = 0; + int consumed = 0; + while (offset + MAX_BATCH_HEADER_BYTES <= payload.length) { + int length = ((payload[offset] & 0xff) << 24) + | ((payload[offset + 1] & 0xff) << 16) + | ((payload[offset + 2] & 0xff) << 8) + | (payload[offset + 3] & 0xff); + offset += MAX_BATCH_HEADER_BYTES; + if (length <= 0 || length > MAX_SINGLE_PACKET_BYTES) { + return false; + } + if (offset + length > payload.length) { + return false; + } + offset += length; + consumed++; + if (consumed > MAX_PACKET_BATCH_PACKETS) { + return false; + } + } + return offset == payload.length && consumed > 0; + } + + private List receiveSinglePacketAsBatch(String clusterId, String vpnConnectionId, int timeoutMs) throws Exception { + byte[] payload = receiveClientPacket(clusterId, vpnConnectionId, timeoutMs); + if (payload == null || payload.length == 0) { + return new ArrayList<>(); + } + return new ArrayList<>(Collections.singletonList(payload)); + } + + private boolean shouldDisableBatchMode(Throwable error) { + return error != null; + } + + private synchronized void disableBatchMode() { + batchModeFailures++; + if (batchModeFailures >= BATCH_RETRY_THRESHOLD) { + batchModeEnabled = false; + } + } + + private synchronized void resetBatchMode() { + batchModeFailures = 0; + batchModeEnabled = true; + } + + private AuthContext parseAuthContext(JSONObject response) throws Exception { + JSONObject user = response.getJSONObject("user"); + String userId = user.optString("id", ""); + if (userId.isEmpty()) { + userId = user.optString("ID", ""); + } + JSONObject device = response.optJSONObject("device"); + String deviceId = device != null ? device.optString("id", "") : ""; + if (deviceId.isEmpty() && device != null) { + deviceId = device.optString("ID", ""); + } + JSONObject tokens = response.optJSONObject("tokens"); + String accessToken = tokens != null ? tokens.optString("access_token", "") : ""; + String accessExpiresAt = tokens != null ? tokens.optString("access_token_expires_at", "") : ""; + String refreshToken = tokens != null ? tokens.optString("refresh_token", "") : ""; + String refreshExpiresAt = tokens != null ? tokens.optString("refresh_token_expires_at", "") : ""; + return new AuthContext(userId, deviceId, accessToken, accessExpiresAt, refreshToken, refreshExpiresAt); + } + + private String trimRight(String value) { + while (value.endsWith("/")) { + value = value.substring(0, value.length() - 1); + } + return value; + } + + private static final class ProtectedSocketFactory extends SocketFactory { + private final SocketFactory delegate = SocketFactory.getDefault(); + private final VpnService vpnService; + + ProtectedSocketFactory(VpnService vpnService) { + this.vpnService = vpnService; + } + + @Override + public Socket createSocket() throws IOException { + return protect(delegate.createSocket()); + } + + @Override + public Socket createSocket(String host, int port) throws IOException { + Socket socket = protect(delegate.createSocket()); + socket.connect(new InetSocketAddress(host, port)); + return socket; + } + + @Override + public Socket createSocket(String host, int port, InetAddress localHost, int localPort) throws IOException { + Socket socket = protect(delegate.createSocket()); + socket.bind(new InetSocketAddress(localHost, localPort)); + socket.connect(new InetSocketAddress(host, port)); + return socket; + } + + @Override + public Socket createSocket(InetAddress host, int port) throws IOException { + Socket socket = protect(delegate.createSocket()); + socket.connect(new InetSocketAddress(host, port)); + return socket; + } + + @Override + public Socket createSocket(InetAddress address, int port, InetAddress localAddress, int localPort) throws IOException { + Socket socket = protect(delegate.createSocket()); + socket.bind(new InetSocketAddress(localAddress, localPort)); + socket.connect(new InetSocketAddress(address, port)); + return socket; + } + + private Socket protect(Socket socket) throws IOException { + if (!vpnService.protect(socket)) { + try { + socket.close(); + } catch (IOException ignored) { + } + throw new IOException("protect control-plane socket failed"); + } + return socket; + } + } + + static final class AuthContext { + final String userId; + final String deviceId; + final String accessToken; + final String accessTokenExpiresAt; + final String refreshToken; + final String refreshTokenExpiresAt; + + AuthContext(String userId, String deviceId, String accessToken, String accessTokenExpiresAt, String refreshToken, String refreshTokenExpiresAt) { + this.userId = userId; + this.deviceId = deviceId; + this.accessToken = accessToken; + this.accessTokenExpiresAt = accessTokenExpiresAt; + this.refreshToken = refreshToken; + this.refreshTokenExpiresAt = refreshTokenExpiresAt; + } + } +} diff --git a/_tmp_android_build/app/src/main/java/su/cin/rapvpn/RapDiagnosticService.java b/_tmp_android_build/app/src/main/java/su/cin/rapvpn/RapDiagnosticService.java new file mode 100644 index 0000000..eaf0a58 --- /dev/null +++ b/_tmp_android_build/app/src/main/java/su/cin/rapvpn/RapDiagnosticService.java @@ -0,0 +1,708 @@ +package su.cin.rapvpn; + +import android.app.Notification; +import android.app.NotificationChannel; +import android.app.NotificationManager; +import android.app.Service; +import android.content.Intent; +import android.content.SharedPreferences; +import android.net.ConnectivityManager; +import android.net.Network; +import android.net.NetworkCapabilities; +import android.net.Uri; +import android.net.VpnService; +import android.os.Build; +import android.os.IBinder; + +import org.json.JSONObject; + +import java.net.DatagramPacket; +import java.net.DatagramSocket; +import java.net.HttpURLConnection; +import java.net.InetAddress; +import java.net.SocketTimeoutException; +import java.net.URI; +import java.net.URL; +import java.net.UnknownHostException; +import java.util.ArrayList; +import java.util.List; +import java.util.Random; +import java.text.SimpleDateFormat; +import java.util.Date; + +public class RapDiagnosticService extends Service { + static final String ACTION_START = "su.cin.rapvpn.DIAGNOSTIC_START"; + static final String ACTION_STOP = "su.cin.rapvpn.DIAGNOSTIC_STOP"; + private static final String CHANNEL_ID = "rap-vpn-diagnostics"; + private static final String APP_VERSION = BuildConfig.VERSION_NAME; + private static final String DEFAULT_BACKEND_URL = "http://195.123.240.88:19131/api/v1"; + private static final String DEFAULT_ENTRY_NODE_ID = "b829ffde-690b-47ab-9522-0f22ab42596d"; + private static final String PREFS = "rap-vpn"; + private static final String RUNTIME_PREFS = "rap-vpn-runtime"; + private static final String PREF_REFRESH_TOKEN = "refresh_token"; + private static final String PREF_USER_ID = "user_id"; + private static final String PREF_DEVICE_ID = "device_id"; + private static final String PREF_PROFILE_JSON = "profile_json"; + private static final String PREF_VPN_CONNECTION_ID = "vpn_connection_id"; + private volatile boolean running; + private Thread worker; + private String serviceState = ""; + private String lastCommandType = ""; + private String lastCommandResult = ""; + private long lastCommandAt = 0; + private long lastHeartbeatAt = 0; + private long lastCommandPollAt = 0; + private String controlNetworkMode = ""; + + @Override + public int onStartCommand(Intent intent, int flags, int startId) { + if (intent != null && ACTION_STOP.equals(intent.getAction())) { + running = false; + if (worker != null) { + worker.interrupt(); + } + stopForeground(true); + stopSelf(); + return START_NOT_STICKY; + } + startForeground(1002, notification()); + startWorker(); + return START_STICKY; + } + + @Override + public void onDestroy() { + running = false; + if (worker != null) { + worker.interrupt(); + } + super.onDestroy(); + } + + @Override + public IBinder onBind(Intent intent) { + return null; + } + + static void start(android.content.Context context) { + Intent intent = new Intent(context, RapDiagnosticService.class); + intent.setAction(ACTION_START); + if (Build.VERSION.SDK_INT >= 26) { + context.startForegroundService(intent); + } else { + context.startService(intent); + } + } + + private void startWorker() { + if (worker != null && worker.isAlive()) { + return; + } + running = true; + worker = new Thread(this::runLoop, "rap-vpn-diagnostic-service"); + worker.start(); + } + + private void runLoop() { + while (running) { + try { + SharedPreferences prefs = getSharedPreferences(PREFS, MODE_PRIVATE); + String backendUrl = normalizeBackendUrl(prefs.getString("backend_url", "")); + if (!backendUrl.equals(prefs.getString("backend_url", ""))) { + prefs.edit().putString("backend_url", backendUrl).apply(); + } + String clusterId = prefs.getString("cluster_id", ""); + String deviceId = prefs.getString(PREF_DEVICE_ID, ""); + if (backendUrl.isEmpty() || clusterId.isEmpty() || deviceId.isEmpty()) { + Thread.sleep(3000); + continue; + } + RapApiClient client = new RapApiClient(backendUrl, this); + controlNetworkMode = client.networkMode(); + lastHeartbeatAt = System.currentTimeMillis(); + serviceState = "online " + new SimpleDateFormat("HH:mm:ss").format(new Date()); + client.reportVPNDiagnosticStatus(clusterId, deviceId, statusPayload("heartbeat")); + lastCommandPollAt = System.currentTimeMillis(); + JSONObject commandEnvelope = client.nextVPNDiagnosticCommand(clusterId, deviceId, 5000); + if (commandEnvelope != null) { + handleCommand(client, clusterId, deviceId, commandEnvelope); + } + } catch (InterruptedException ignored) { + return; + } catch (Exception e) { + serviceState = "error: " + e.getMessage(); + try { + Thread.sleep(3000); + } catch (InterruptedException interrupted) { + return; + } + } + } + } + + private void handleCommand(RapApiClient client, String clusterId, String deviceId, JSONObject envelope) throws Exception { + JSONObject command = envelope.optJSONObject("vpn_client_diagnostic_command"); + JSONObject payload = command == null ? envelope.optJSONObject("payload") : command.optJSONObject("payload"); + if (payload == null) { + return; + } + String type = payload.optString("type", ""); + String result; + if ("start_vpn".equals(type)) { + result = startVPNFromSavedProfile(); + } else if ("stop_vpn".equals(type)) { + Intent stopIntent = new Intent(this, RapVpnService.class); + stopIntent.setAction(RapVpnService.ACTION_STOP); + startService(stopIntent); + result = "stop_vpn accepted"; + } else if ("http_get".equals(type)) { + result = runHttpGet(payload.optString("url", "http://192.168.200.61:18080/")); + } else if ("vpn_http_get".equals(type)) { + result = runVPNHttpGet(payload.optString("url", "http://192.168.200.61:18080/")); + } else if ("vpn_dns_lookup".equals(type)) { + result = runVPNDNSLookup(payload.optString("host", "2ip.ru")); + } else if ("open_url".equals(type)) { + String url = payload.optString("url", "http://2ip.ru/"); + Intent open = new Intent(Intent.ACTION_VIEW, Uri.parse(url)); + open.addFlags(Intent.FLAG_ACTIVITY_NEW_TASK); + startActivity(open); + result = "open_url accepted " + url; + } else if ("vpn_stats".equals(type)) { + result = collectVPNStats(client, clusterId); + } else if ("full_vpn_test".equals(type)) { + result = runFullVPNTest(client, clusterId, payload); + } else if ("refresh_profile".equals(type)) { + result = refreshProfile(); + } else { + result = "unknown command " + type; + } + lastCommandType = type; + lastCommandResult = result; + lastCommandAt = System.currentTimeMillis(); + JSONObject report = statusPayload("command_result"); + report.put("command_type", type); + report.put("command_result", result); + client.reportVPNDiagnosticStatus(clusterId, deviceId, report); + } + + private String startVPNFromSavedProfile() { + SharedPreferences prefs = getSharedPreferences(PREFS, MODE_PRIVATE); + String profileJson = prefs.getString(PREF_PROFILE_JSON, ""); + String backendUrl = prefs.getString("backend_url", ""); + String clusterId = prefs.getString("cluster_id", ""); + String vpnConnectionId = prefs.getString(PREF_VPN_CONNECTION_ID, ""); + if (profileJson.isEmpty() || backendUrl.isEmpty() || clusterId.isEmpty() || vpnConnectionId.isEmpty()) { + return "start_vpn skipped: profile/backend/cluster/connection missing"; + } + if (VpnService.prepare(this) != null) { + Intent launcher = new Intent(this, TestVpnActivity.class); + launcher.putExtra(TestVpnActivity.EXTRA_PROFILE_JSON, profileJson); + launcher.putExtra(TestVpnActivity.EXTRA_BACKEND_URL, backendUrl); + launcher.putExtra(TestVpnActivity.EXTRA_CLUSTER_ID, clusterId); + launcher.putExtra(TestVpnActivity.EXTRA_VPN_CONNECTION_ID, vpnConnectionId); + launcher.addFlags(Intent.FLAG_ACTIVITY_NEW_TASK); + startActivity(launcher); + return "start_vpn permission required: opened vpn launcher " + vpnConnectionId; + } + Intent intent = new Intent(this, RapVpnService.class); + intent.putExtra(RapVpnService.EXTRA_PROFILE_JSON, profileJson); + intent.putExtra(RapVpnService.EXTRA_BACKEND_URL, backendUrl); + intent.putExtra(RapVpnService.EXTRA_CLUSTER_ID, clusterId); + intent.putExtra(RapVpnService.EXTRA_VPN_CONNECTION_ID, vpnConnectionId); + if (Build.VERSION.SDK_INT >= 26) { + startForegroundService(intent); + } else { + startService(intent); + } + return "start_vpn accepted " + vpnConnectionId; + } + + private String refreshProfile() { + SharedPreferences prefs = getSharedPreferences(PREFS, MODE_PRIVATE); + try { + String refreshToken = new SecureTokenStore(this).get(PREF_REFRESH_TOKEN); + if (refreshToken.isEmpty()) { + return "refresh_profile skipped: refresh token missing"; + } + RapApiClient client = new RapApiClient(normalizeBackendUrl(prefs.getString("backend_url", "")), this); + RapApiClient.AuthContext auth = client.refresh(refreshToken); + String organizationId = prefs.getString("organization_id", ""); + String clusterId = prefs.getString("cluster_id", ""); + String profileJson = client.vpnClientProfile(clusterId, organizationId, auth.userId, DEFAULT_ENTRY_NODE_ID); + JSONObject root = new JSONObject(profileJson); + JSONObject profile = root.getJSONObject("vpn_client_profile"); + String connectionId = profile.getJSONArray("connections").getJSONObject(0).getString("id"); + prefs.edit() + .putString(PREF_USER_ID, auth.userId) + .putString(PREF_DEVICE_ID, auth.deviceId) + .putString(PREF_PROFILE_JSON, profileJson) + .putString(PREF_VPN_CONNECTION_ID, connectionId) + .apply(); + new SecureTokenStore(this).put(PREF_REFRESH_TOKEN, auth.refreshToken); + return "refresh_profile ok " + connectionId; + } catch (Exception e) { + return "refresh_profile failed: " + e.getMessage(); + } + } + + private JSONObject statusPayload(String event) throws Exception { + SharedPreferences prefs = getSharedPreferences(PREFS, MODE_PRIVATE); + JSONObject payload = new JSONObject(); + payload.put("event", event); + payload.put("app_version", APP_VERSION); + payload.put("service", "diagnostic"); + payload.put("user_id", prefs.getString(PREF_USER_ID, "")); + payload.put("device_id", prefs.getString(PREF_DEVICE_ID, "")); + payload.put("organization_id", prefs.getString("organization_id", "")); + payload.put("vpn_connection_id", prefs.getString(PREF_VPN_CONNECTION_ID, "")); + payload.put("backend_url", prefs.getString("backend_url", "")); + payload.put("control_network_mode", controlNetworkMode); + payload.put("profile_loaded", !prefs.getString(PREF_PROFILE_JSON, "").isEmpty()); + payload.put("runtime", runtimeSnapshot()); + payload.put("vpn_config", vpnConfigSnapshot()); + payload.put("service_state", serviceState); + payload.put("last_result", lastCommandResult); + payload.put("last_command_type", lastCommandType); + payload.put("last_command_result", lastCommandResult); + payload.put("last_command_at", lastCommandAt); + payload.put("last_heartbeat_at", lastHeartbeatAt); + payload.put("last_command_poll_at", lastCommandPollAt); + return payload; + } + + private String normalizeBackendUrl(String value) { + String candidate = value == null ? "" : value.trim().replaceAll("/+$", ""); + if (candidate.isEmpty() || isLegacyControlPlaneUrl(candidate)) { + return DEFAULT_BACKEND_URL; + } + return candidate; + } + + private boolean isLegacyControlPlaneUrl(String value) { + String lower = value.toLowerCase(); + return lower.equals("http://94.141.118.222:19191/api/v1") + || lower.equals("http://vpn.cin.su:19191/api/v1") + || lower.equals("http://192.168.200.61:18080/api/v1") + || lower.equals("http://docker-test.cin.su:18080/api/v1") + || lower.equals("http://docker-test.cin.su/api/v1") + || lower.equals("http://192.168.200.61/api/v1"); + } + + private String collectVPNStats(RapApiClient client, String clusterId) { + String connectionId = getSharedPreferences(PREFS, MODE_PRIVATE).getString(PREF_VPN_CONNECTION_ID, ""); + if (connectionId.isEmpty()) { + return "vpn_stats skipped: connection missing"; + } + try { + JSONObject stats = client.vpnPacketStats(clusterId, connectionId); + return "vpn_stats " + compact(stats.toString(), 900); + } catch (Exception e) { + return "vpn_stats failed: " + e.getMessage(); + } + } + + private String runFullVPNTest(RapApiClient client, String clusterId, JSONObject payload) { + String url = payload.optString("url", "http://2ip.ru/"); + int watchSeconds = payload.optInt("watch_seconds", 30); + if (watchSeconds < 5) { + watchSeconds = 5; + } + if (watchSeconds > 120) { + watchSeconds = 120; + } + String connectionId = getSharedPreferences(PREFS, MODE_PRIVATE).getString(PREF_VPN_CONNECTION_ID, ""); + StringBuilder result = new StringBuilder(); + try { + result.append(refreshProfile()).append(" | "); + if (!connectionId.isEmpty()) { + result.append("reset=").append(compact(client.resetVPNPacketQueues(clusterId, connectionId).toString(), 240)).append(" | "); + } + result.append(startVPNFromSavedProfile()).append(" | "); + Thread.sleep(3000); + Intent open = new Intent(Intent.ACTION_VIEW, Uri.parse(url)); + open.addFlags(Intent.FLAG_ACTIVITY_NEW_TASK); + startActivity(open); + result.append("open_url=").append(url); + long deadline = System.currentTimeMillis() + watchSeconds * 1000L; + while (running && System.currentTimeMillis() < deadline) { + Thread.sleep(5000); + JSONObject report = statusPayload("full_vpn_test_watch"); + report.put("test_url", url); + if (!connectionId.isEmpty()) { + report.put("packet_stats", client.vpnPacketStats(clusterId, connectionId)); + } + client.reportVPNDiagnosticStatus(clusterId, getSharedPreferences(PREFS, MODE_PRIVATE).getString(PREF_DEVICE_ID, ""), report); + } + if (!connectionId.isEmpty()) { + result.append(" | stats=").append(compact(client.vpnPacketStats(clusterId, connectionId).toString(), 900)); + } + } catch (Exception e) { + result.append(" | full_vpn_test failed: ").append(e.getClass().getSimpleName()).append(": ").append(e.getMessage()); + } + return compact(result.toString(), 1200); + } + + private String compact(String value, int maxLength) { + if (value == null) { + return ""; + } + String compacted = value.replace('\n', ' ').replace('\r', ' '); + if (compacted.length() <= maxLength) { + return compacted; + } + return compacted.substring(0, Math.max(0, maxLength - 3)) + "..."; + } + + private JSONObject runtimeSnapshot() throws Exception { + SharedPreferences runtime = getSharedPreferences(RUNTIME_PREFS, MODE_PRIVATE); + JSONObject payload = new JSONObject(); + payload.put("state", runtime.getString("state", "")); + payload.put("message", runtime.getString("message", "")); + payload.put("updated_at", runtime.getLong("updated_at", 0)); + payload.put("runtime_started_at", runtime.getLong("runtime_started_at", 0)); + payload.put("uplink_read", runtime.getLong("uplink_read", 0)); + payload.put("uplink_sent", runtime.getLong("uplink_sent", 0)); + payload.put("downlink_received", runtime.getLong("downlink_received", 0)); + payload.put("uplink_read_total", runtime.getLong("uplink_read_total", 0)); + payload.put("uplink_read_bytes", runtime.getLong("uplink_read_bytes", 0)); + payload.put("uplink_sent_total", runtime.getLong("uplink_sent_total", 0)); + payload.put("uplink_sent_bytes", runtime.getLong("uplink_sent_bytes", 0)); + payload.put("downlink_received_total", runtime.getLong("downlink_received_total", 0)); + payload.put("downlink_received_bytes", runtime.getLong("downlink_received_bytes", 0)); + payload.put("uplink_read_mbps", runtime.getFloat("uplink_read_mbps", 0f)); + payload.put("uplink_sent_mbps", runtime.getFloat("uplink_sent_mbps", 0f)); + payload.put("downlink_received_mbps", runtime.getFloat("downlink_received_mbps", 0f)); + payload.put("uplink_read_pps", runtime.getFloat("uplink_read_pps", 0f)); + payload.put("uplink_sent_pps", runtime.getFloat("uplink_sent_pps", 0f)); + payload.put("downlink_received_pps", runtime.getFloat("downlink_received_pps", 0f)); + payload.put("uplink_dropped_packets", runtime.getLong("uplink_dropped_packets", 0)); + payload.put("uplink_dropped_bytes", runtime.getLong("uplink_dropped_bytes", 0)); + payload.put("downlink_dropped_packets", runtime.getLong("downlink_dropped_packets", 0)); + payload.put("downlink_dropped_bytes", runtime.getLong("downlink_dropped_bytes", 0)); + payload.put("errors", runtime.getLong("errors", 0)); + payload.put("uplink", runtimePrefix(runtime, "uplink")); + payload.put("uplink_sender", runtimePrefix(runtime, "uplink_sender")); + payload.put("downlink", runtimePrefix(runtime, "downlink")); + payload.put("relay", runtimePrefix(runtime, "relay")); + payload.put("uplink_worker_count", runtime.getInt("uplink_worker_count", 0)); + payload.put("uplink_queue_depth_total", runtime.getInt("uplink_queue_depth_total", 0)); + payload.put("uplink_queue_depth_max", runtime.getInt("uplink_queue_depth_max", 0)); + payload.put("uplink_queue_depths", runtime.getString("uplink_queue_depths", "")); + payload.put("uplink_queue_0_offers", runtime.getLong("uplink_queue_0_offers", 0)); + payload.put("uplink_queue_1_offers", runtime.getLong("uplink_queue_1_offers", 0)); + payload.put("uplink_queue_2_offers", runtime.getLong("uplink_queue_2_offers", 0)); + payload.put("uplink_queue_3_offers", runtime.getLong("uplink_queue_3_offers", 0)); + payload.put("uplink_queue_0_drops", runtime.getLong("uplink_queue_0_drops", 0)); + payload.put("uplink_queue_1_drops", runtime.getLong("uplink_queue_1_drops", 0)); + payload.put("uplink_queue_2_drops", runtime.getLong("uplink_queue_2_drops", 0)); + payload.put("uplink_queue_3_drops", runtime.getLong("uplink_queue_3_drops", 0)); + payload.put("uplink_sender_worker_packets_0", runtime.getLong("uplink_sender_worker_packets_0", 0)); + payload.put("uplink_sender_worker_packets_1", runtime.getLong("uplink_sender_worker_packets_1", 0)); + payload.put("uplink_sender_worker_packets_2", runtime.getLong("uplink_sender_worker_packets_2", 0)); + payload.put("uplink_sender_worker_packets_3", runtime.getLong("uplink_sender_worker_packets_3", 0)); + payload.put("uplink_sender_worker_errors_0", runtime.getLong("uplink_sender_worker_errors_0", 0)); + payload.put("uplink_sender_worker_errors_1", runtime.getLong("uplink_sender_worker_errors_1", 0)); + payload.put("uplink_sender_worker_errors_2", runtime.getLong("uplink_sender_worker_errors_2", 0)); + payload.put("uplink_sender_worker_errors_3", runtime.getLong("uplink_sender_worker_errors_3", 0)); + payload.put("uplink_queue_depth", runtime.getInt("uplink_queue_depth", 0)); + payload.put("downlink_restarts", runtime.getLong("downlink_restarts", 0)); + return payload; + } + + private JSONObject vpnConfigSnapshot() throws Exception { + SharedPreferences runtime = getSharedPreferences(RUNTIME_PREFS, MODE_PRIVATE); + JSONObject payload = new JSONObject(); + payload.put("vpn_address", runtime.getString("vpn_address", "")); + payload.put("dns_servers", runtime.getString("dns_servers", "")); + payload.put("routes", runtime.getString("routes", "")); + payload.put("full_tunnel", runtime.getBoolean("full_tunnel", false)); + return payload; + } + + private JSONObject runtimePrefix(SharedPreferences runtime, String prefix) throws Exception { + JSONObject payload = new JSONObject(); + payload.put("state", runtime.getString(prefix + "_state", "")); + payload.put("message", runtime.getString(prefix + "_message", "")); + payload.put("updated_at", runtime.getLong(prefix + "_updated_at", 0)); + payload.put("packets", runtime.getLong(prefix + "_packets", 0)); + payload.put("bytes", runtime.getLong(prefix + "_bytes", 0)); + payload.put("errors", runtime.getLong(prefix + "_errors", 0)); + payload.put("error_type", runtime.getString(prefix + "_error_type", "")); + payload.put("thread_alive", runtime.getBoolean(prefix + "_thread_alive", false)); + payload.put("rate_mbps", runtime.getFloat(prefix + "_rate_mbps", 0f)); + payload.put("rate_pps", runtime.getFloat(prefix + "_rate_pps", 0f)); + return payload; + } + + private String runHttpGet(String target) { + try { + HttpURLConnection connection = (HttpURLConnection) new URL(target).openConnection(); + connection.setConnectTimeout(15000); + connection.setReadTimeout(15000); + connection.setInstanceFollowRedirects(false); + int code = connection.getResponseCode(); + connection.disconnect(); + return "http_get " + target + " -> HTTP " + code; + } catch (Exception e) { + return "http_get " + target + " -> " + e.getClass().getSimpleName() + ": " + e.getMessage(); + } + } + + private String runVPNHttpGet(String target) { + try { + Network vpn = vpnNetwork(); + if (vpn == null) { + return "vpn_http_get " + target + " -> vpn network not found"; + } + URL url = new URL(target); + HttpURLConnection connection; + String resolved = ""; + if ("http".equalsIgnoreCase(url.getProtocol()) && !isIPv4Literal(url.getHost())) { + resolved = firstManualVPNAddress(vpn, url.getHost()); + } + if (!resolved.isEmpty()) { + URL resolvedURL = new URL(url.getProtocol(), resolved, url.getPort(), url.getFile()); + connection = (HttpURLConnection) vpn.openConnection(resolvedURL); + connection.setRequestProperty("Host", hostHeader(url)); + } else { + connection = (HttpURLConnection) vpn.openConnection(url); + } + try { + connection.setConnectTimeout(15000); + connection.setReadTimeout(15000); + connection.setInstanceFollowRedirects(false); + int code = connection.getResponseCode(); + connection.disconnect(); + return "vpn_http_get " + target + " -> HTTP " + code; + } catch (UnknownHostException e) { + String fallbackResolved = firstManualVPNAddress(vpn, url.getHost()); + if (fallbackResolved.isEmpty() || !"http".equalsIgnoreCase(url.getProtocol())) { + throw e; + } + URL resolvedURL = new URL(url.getProtocol(), fallbackResolved, url.getPort(), url.getFile()); + connection = (HttpURLConnection) vpn.openConnection(resolvedURL); + connection.setRequestProperty("Host", hostHeader(url)); + connection.setConnectTimeout(15000); + connection.setReadTimeout(15000); + connection.setInstanceFollowRedirects(false); + int code = connection.getResponseCode(); + connection.disconnect(); + return "vpn_http_get " + target + " -> HTTP " + code; + } + } catch (Exception e) { + return "vpn_http_get " + target + " -> " + e.getClass().getSimpleName() + ": " + e.getMessage(); + } + } + + private boolean isIPv4Literal(String host) { + if (host == null) { + return false; + } + String[] parts = host.split("\\."); + if (parts.length != 4) { + return false; + } + try { + for (String part : parts) { + int value = Integer.parseInt(part); + if (value < 0 || value > 255) { + return false; + } + } + return true; + } catch (NumberFormatException e) { + return false; + } + } + + private String runVPNDNSLookup(String host) { + try { + Network vpn = vpnNetwork(); + if (vpn == null) { + return "vpn_dns_lookup " + host + " -> vpn network not found"; + } + StringBuilder result = new StringBuilder(); + try { + InetAddress[] system = vpn.getAllByName(host); + result.append("system="); + appendAddresses(result, system); + } catch (Exception e) { + result.append("system=").append(e.getClass().getSimpleName()).append(":").append(e.getMessage()); + } + String manual = manualVPNDNSLookup(vpn, host); + result.append(" manual=").append(manual); + return "vpn_dns_lookup " + host + " -> " + result; + } catch (Exception e) { + return "vpn_dns_lookup " + host + " -> " + e.getClass().getSimpleName() + ": " + e.getMessage(); + } + } + + private String firstManualVPNAddress(Network vpn, String host) { + String result = manualVPNDNSLookup(vpn, host); + if (result.startsWith("ok:")) { + String addresses = result.substring(3); + int comma = addresses.indexOf(','); + return comma >= 0 ? addresses.substring(0, comma) : addresses; + } + return ""; + } + + private String manualVPNDNSLookup(Network vpn, String host) { + String dnsServers = getSharedPreferences(RUNTIME_PREFS, MODE_PRIVATE).getString("dns_servers", ""); + if (dnsServers.isEmpty()) { + return "skipped:no_dns_servers"; + } + String dnsServer = dnsServers.split(",", 2)[0].trim(); + if (dnsServer.isEmpty()) { + return "skipped:no_dns_servers"; + } + try (DatagramSocket socket = new DatagramSocket()) { + vpn.bindSocket(socket); + socket.setSoTimeout(5000); + byte[] query = buildDNSQuery(host); + DatagramPacket packet = new DatagramPacket(query, query.length, InetAddress.getByName(dnsServer), 53); + socket.send(packet); + byte[] response = new byte[512]; + DatagramPacket answer = new DatagramPacket(response, response.length); + socket.receive(answer); + List addresses = parseDNSAResponse(response, answer.getLength()); + if (addresses.isEmpty()) { + return "empty:" + dnsServer; + } + return "ok:" + String.join(",", addresses); + } catch (SocketTimeoutException e) { + return "timeout:" + dnsServer; + } catch (Exception e) { + return e.getClass().getSimpleName() + ":" + e.getMessage(); + } + } + + private byte[] buildDNSQuery(String host) throws Exception { + byte[] out = new byte[512]; + int id = new Random().nextInt(0xffff); + out[0] = (byte) ((id >> 8) & 0xff); + out[1] = (byte) (id & 0xff); + out[2] = 0x01; + out[5] = 0x01; + int offset = 12; + for (String label : host.split("\\.")) { + byte[] bytes = label.getBytes("UTF-8"); + out[offset++] = (byte) bytes.length; + System.arraycopy(bytes, 0, out, offset, bytes.length); + offset += bytes.length; + } + out[offset++] = 0; + out[offset++] = 0; + out[offset++] = 1; + out[offset++] = 0; + out[offset++] = 1; + byte[] query = new byte[offset]; + System.arraycopy(out, 0, query, 0, offset); + return query; + } + + private List parseDNSAResponse(byte[] packet, int length) { + List addresses = new ArrayList<>(); + if (length < 12) { + return addresses; + } + int qd = u16(packet, 4); + int an = u16(packet, 6); + int offset = 12; + for (int i = 0; i < qd; i++) { + offset = skipDNSName(packet, length, offset); + offset += 4; + if (offset > length) { + return addresses; + } + } + for (int i = 0; i < an && offset < length; i++) { + offset = skipDNSName(packet, length, offset); + if (offset + 10 > length) { + return addresses; + } + int type = u16(packet, offset); + int cls = u16(packet, offset + 2); + int rdLen = u16(packet, offset + 8); + offset += 10; + if (type == 1 && cls == 1 && rdLen == 4 && offset + 4 <= length) { + addresses.add((packet[offset] & 0xff) + "." + (packet[offset + 1] & 0xff) + "." + (packet[offset + 2] & 0xff) + "." + (packet[offset + 3] & 0xff)); + } + offset += rdLen; + } + return addresses; + } + + private int skipDNSName(byte[] packet, int length, int offset) { + while (offset < length) { + int value = packet[offset] & 0xff; + offset++; + if (value == 0) { + break; + } + if ((value & 0xc0) == 0xc0) { + offset++; + break; + } + offset += value; + } + return offset; + } + + private int u16(byte[] packet, int offset) { + if (packet == null || offset + 1 >= packet.length) { + return 0; + } + return ((packet[offset] & 0xff) << 8) | (packet[offset + 1] & 0xff); + } + + private void appendAddresses(StringBuilder result, InetAddress[] addresses) { + if (addresses == null || addresses.length == 0) { + result.append("empty"); + return; + } + for (int i = 0; i < addresses.length; i++) { + if (i > 0) { + result.append(","); + } + result.append(addresses[i].getHostAddress()); + } + } + + private String hostHeader(URL url) { + if (url.getPort() > 0) { + return url.getHost() + ":" + url.getPort(); + } + return url.getHost(); + } + + private Network vpnNetwork() { + ConnectivityManager connectivity = (ConnectivityManager) getSystemService(CONNECTIVITY_SERVICE); + if (connectivity == null) { + return null; + } + for (Network network : connectivity.getAllNetworks()) { + NetworkCapabilities capabilities = connectivity.getNetworkCapabilities(network); + if (capabilities != null && capabilities.hasTransport(NetworkCapabilities.TRANSPORT_VPN)) { + return network; + } + } + return null; + } + + private Notification notification() { + if (Build.VERSION.SDK_INT >= 26) { + NotificationChannel channel = new NotificationChannel(CHANNEL_ID, "RAP VPN diagnostics", NotificationManager.IMPORTANCE_LOW); + NotificationManager manager = getSystemService(NotificationManager.class); + if (manager != null) { + manager.createNotificationChannel(channel); + } + } + Notification.Builder builder = Build.VERSION.SDK_INT >= 26 ? new Notification.Builder(this, CHANNEL_ID) : new Notification.Builder(this); + return builder + .setContentTitle("RAP VPN diagnostics") + .setContentText("Diagnostic channel is active") + .setSmallIcon(android.R.drawable.stat_sys_upload_done) + .build(); + } +} diff --git a/_tmp_android_build/app/src/main/java/su/cin/rapvpn/RapVpnService.java b/_tmp_android_build/app/src/main/java/su/cin/rapvpn/RapVpnService.java new file mode 100644 index 0000000..74abf02 --- /dev/null +++ b/_tmp_android_build/app/src/main/java/su/cin/rapvpn/RapVpnService.java @@ -0,0 +1,1384 @@ +package su.cin.rapvpn; + +import android.app.Notification; +import android.app.NotificationChannel; +import android.app.NotificationManager; +import android.content.SharedPreferences; +import android.content.Intent; +import android.net.ConnectivityManager; +import android.net.Network; +import android.net.NetworkCapabilities; +import android.net.VpnService; +import android.os.Build; +import android.os.ParcelFileDescriptor; +import android.system.ErrnoException; +import android.system.Os; +import android.system.OsConstants; +import android.util.Log; + +import org.json.JSONArray; +import org.json.JSONObject; + +import java.io.FileDescriptor; +import java.net.URI; +import java.util.ArrayList; +import java.util.LinkedHashSet; +import java.util.List; +import java.util.Set; +import java.util.concurrent.ArrayBlockingQueue; +import java.util.concurrent.BlockingQueue; +import java.util.concurrent.TimeUnit; +import java.util.concurrent.atomic.AtomicLong; + +public class RapVpnService extends VpnService { + static final String EXTRA_PROFILE_JSON = "profile_json"; + static final String EXTRA_BACKEND_URL = "backend_url"; + static final String EXTRA_CLUSTER_ID = "cluster_id"; + static final String EXTRA_VPN_CONNECTION_ID = "vpn_connection_id"; + static final String ACTION_STOP = "su.cin.rapvpn.STOP"; + private static final String CHANNEL_ID = "rap-vpn"; + private static final String TAG = "RapVpnService"; + private static final String PREFS = "rap-vpn-runtime"; + private static final int DEFAULT_VPN_MTU = 1420; + private static final int VPN_BATCH_MAX_PACKETS = 2048; + private static final int VPN_BATCH_MAX_BYTES = 4 * 1024 * 1024; + private static final int UPLINK_WORKER_MAX_COUNT = 4; + private static final int UPLINK_QUEUE_CAPACITY = 16384; + private static final int DOWNLINK_POLL_MS_MIN = 2; + private static final int DOWNLINK_POLL_MS_MAX = 120; + private static final int DOWNLINK_POLL_MS_STEP = 10; + private static final int UPLINK_BATCH_GATHER_MS = 6; + private static final int TUN_WRITE_MAX_RETRIES = 128; + private static final int TUN_EAGAIN_SLEEP_MS = 1; + private static final int RUNTIME_DETAIL_INTERVAL_MS = 250; + private static final int RUNTIME_STATUS_INTERVAL_MS = 500; + private ParcelFileDescriptor tunnel; + private Thread uplinkThread; + private Thread[] uplinkSenderThreads; + private Thread downlinkThread; + private BlockingQueue[] uplinkQueues; + private volatile AtomicLong[] uplinkQueueOffersByWorker; + private volatile AtomicLong[] uplinkQueueDropsByWorker; + private volatile AtomicLong[] uplinkSenderPacketsByWorker; + private volatile AtomicLong[] uplinkSenderErrorsByWorker; + private volatile int uplinkWorkerCount; + private volatile boolean running; + private volatile String vpnAddressIPv4 = "10.77.0.2"; + private volatile byte[] vpnAddressIPv4Bytes = new byte[]{10, 77, 0, 2}; + private volatile byte[] backendBypassIPv4; + private volatile int backendBypassPort; + private volatile long downlinkRestarts; + private volatile long lastRuntimeDetailAt; + private volatile long lastRuntimeStatusAt; + private volatile long runtimeStartedAt; + private volatile long lastThroughputCalcAt; + private volatile long lastRateUplinkReadBytes; + private volatile long lastRateUplinkSentBytes; + private volatile long lastRateDownlinkReceivedBytes; + private volatile float uplinkReadMbps; + private volatile float uplinkSentMbps; + private volatile float downlinkReceivedMbps; + private volatile float uplinkReadPps; + private volatile float uplinkSentPps; + private volatile float downlinkReceivedPps; + private final AtomicLong uplinkReadPackets = new AtomicLong(); + private final AtomicLong uplinkReadBytes = new AtomicLong(); + private final AtomicLong uplinkSentPackets = new AtomicLong(); + private final AtomicLong uplinkSentBytes = new AtomicLong(); + private final AtomicLong downlinkReceivedPackets = new AtomicLong(); + private final AtomicLong downlinkReceivedBytes = new AtomicLong(); + private final AtomicLong uplinkDroppedPackets = new AtomicLong(); + private final AtomicLong uplinkDroppedBytes = new AtomicLong(); + private final AtomicLong downlinkDroppedPackets = new AtomicLong(); + private final AtomicLong downlinkDroppedBytes = new AtomicLong(); + + @Override + public int onStartCommand(Intent intent, int flags, int startId) { + if (intent != null && ACTION_STOP.equals(intent.getAction())) { + writeRuntimeStatus("stopping", "stop requested", 0, 0, 0, 0); + shutdown(); + stopSelf(); + return START_NOT_STICKY; + } + startForeground(1001, notification()); + String profile = intent != null ? intent.getStringExtra(EXTRA_PROFILE_JSON) : ""; + startSafeInterface(profile == null ? "" : profile); + String backendUrl = intent != null ? intent.getStringExtra(EXTRA_BACKEND_URL) : ""; + String clusterId = intent != null ? intent.getStringExtra(EXTRA_CLUSTER_ID) : ""; + String vpnConnectionId = intent != null ? intent.getStringExtra(EXTRA_VPN_CONNECTION_ID) : ""; + startPacketRelay(backendUrl, clusterId, vpnConnectionId); + return START_NOT_STICKY; + } + + @Override + public void onDestroy() { + shutdown(); + super.onDestroy(); + } + + private void shutdown() { + try { + running = false; + if (uplinkThread != null) { + uplinkThread.interrupt(); + } + if (uplinkSenderThreads != null) { + for (Thread sender : uplinkSenderThreads) { + if (sender != null) { + sender.interrupt(); + } + } + } + if (downlinkThread != null) { + downlinkThread.interrupt(); + } + if (tunnel != null) { + tunnel.close(); + tunnel = null; + } + } catch (Exception ignored) { + } + stopForeground(true); + } + + private void startSafeInterface(String profileJson) { + try { + stopPacketRelay(); + if (tunnel != null) { + tunnel.close(); + tunnel = null; + } + resetRuntimeMetrics(); + VpnClientConfig config = parseClientConfig(profileJson); + Builder builder = new Builder() + .setSession("RAP HOME VPN") + .setMtu(config.mtu) + .setBlocking(true); + vpnAddressIPv4 = cidrHost(config.vpnAddress); + vpnAddressIPv4Bytes = ipv4Bytes(vpnAddressIPv4); + if (vpnAddressIPv4Bytes == null || vpnAddressIPv4Bytes.length != 4) { + vpnAddressIPv4 = "10.77.0.2"; + vpnAddressIPv4Bytes = new byte[]{10, 77, 0, 2}; + } + addCIDRAddress(builder, config.vpnAddress); + for (String dnsServer : config.dnsServers) { + builder.addDnsServer(dnsServer); + addCIDRRoute(builder, dnsServer + "/32"); + } + for (String route : config.effectiveRoutes()) { + addCIDRRoute(builder, route); + } + writeRuntimeConfig(config); + setUnderlyingNetworks(builder); + tunnel = builder.establish(); + if (tunnel == null) { + Log.e(TAG, "vpn tunnel establish returned null"); + writeRuntimeStatus("error", "tunnel establish returned null", 0, 0, 0, 0); + } else { + writeRuntimeStatus("tunnel", "tunnel established " + config.vpnAddress, 0, 0, 0, 0); + } + } catch (Exception e) { + Log.e(TAG, "vpn tunnel establish failed", e); + writeRuntimeStatus("error", "tunnel failed: " + e.getMessage(), 0, 0, 0, 0); + } + } + + private VpnClientConfig parseClientConfig(String profileJson) { + VpnClientConfig config = new VpnClientConfig(); + config.vpnAddress = "10.77.0.2/32"; + try { + JSONObject root = new JSONObject(profileJson == null ? "" : profileJson); + JSONObject profile = root.optJSONObject("vpn_client_profile"); + JSONArray connections = profile != null ? profile.optJSONArray("connections") : null; + JSONObject connection = connections != null && connections.length() > 0 ? connections.optJSONObject(0) : null; + if (connection == null) { + return config; + } + JSONObject clientConfig = connection.optJSONObject("client_config"); + if (clientConfig != null) { + String vpnAddress = clientConfig.optString("vpn_address", ""); + if (!vpnAddress.isEmpty()) { + config.vpnAddress = vpnAddress; + } + config.mtu = parseMtu(clientConfig.optInt("mtu", config.mtu)); + if (clientConfig.optBoolean("full_tunnel", false)) { + config.fullTunnel = true; + } + readStringArray(clientConfig.optJSONArray("dns_servers"), config.dnsServers, true); + readStringArray(clientConfig.optJSONArray("routes"), config.splitRoutes, false); + } + JSONObject routePolicy = connection.optJSONObject("route_policy"); + if (routePolicy != null) { + if (routePolicy.optBoolean("full_tunnel", false)) { + config.fullTunnel = true; + } + readStringArray(routePolicy.optJSONArray("allowed_cidrs"), config.splitRoutes, false); + readStringArray(routePolicy.optJSONArray("dns_servers"), config.dnsServers, true); + } + JSONArray routePolicies = connection.optJSONArray("route_policies"); + if (routePolicies != null) { + for (int i = 0; i < routePolicies.length(); i++) { + JSONObject item = routePolicies.optJSONObject(i); + if (item != null && "allow".equals(item.optString("action"))) { + String destination = item.optString("destination", ""); + if (!destination.isEmpty()) { + config.splitRoutes.add(destination); + } + JSONObject policy = item.optJSONObject("policy"); + if (policy != null && policy.optBoolean("full_tunnel", false)) { + config.fullTunnel = true; + } + } + } + } + } catch (Exception ignored) { + } + return config; + } + + private int parseMtu(int mtu) { + if (mtu <= 0) { + return DEFAULT_VPN_MTU; + } + if (mtu < 576) { + return 576; + } + if (mtu > 1500) { + return 1500; + } + return mtu; + } + + private void readStringArray(JSONArray array, Set target, boolean replace) { + if (array == null) { + return; + } + if (replace) { + target.clear(); + } + for (int i = 0; i < array.length(); i++) { + String value = array.optString(i, ""); + if (!value.isEmpty()) { + target.add(value); + } + } + } + + private void writeRuntimeConfig(VpnClientConfig config) { + try { + getSharedPreferences(PREFS, MODE_PRIVATE).edit() + .putString("vpn_address", config.vpnAddress) + .putInt("vpn_mtu", config.mtu) + .putString("dns_servers", join(config.dnsServers)) + .putString("routes", join(config.effectiveRoutes())) + .putBoolean("full_tunnel", config.fullTunnel) + .apply(); + } catch (Exception ignored) { + } + } + + private String join(Set values) { + StringBuilder out = new StringBuilder(); + for (String value : values) { + if (out.length() > 0) { + out.append(","); + } + out.append(value); + } + return out.toString(); + } + + private void addCIDRAddress(Builder builder, String cidr) { + String[] parts = cidr.split("/", 2); + if (parts.length == 2) { + builder.addAddress(parts[0], Integer.parseInt(parts[1])); + } else { + builder.addAddress(cidr, 32); + } + } + + private void addCIDRRoute(Builder builder, String cidr) { + String[] parts = cidr.split("/", 2); + if (parts.length == 2) { + builder.addRoute(parts[0], Integer.parseInt(parts[1])); + } + } + + private String cidrHost(String cidr) { + if (cidr == null || cidr.isEmpty()) { + return "10.77.0.2"; + } + String[] parts = cidr.split("/", 2); + return parts.length > 0 && !parts[0].isEmpty() ? parts[0] : "10.77.0.2"; + } + + private void startPacketRelay(String backendUrl, String clusterId, String vpnConnectionId) { + if (tunnel == null || backendUrl == null || backendUrl.isEmpty() || clusterId == null || clusterId.isEmpty() || vpnConnectionId == null || vpnConnectionId.isEmpty()) { + Log.e(TAG, "packet relay not started: tunnel=" + (tunnel != null) + + " backend=" + present(backendUrl) + + " cluster=" + present(clusterId) + + " vpn_connection=" + present(vpnConnectionId)); + writeRuntimeStatus("error", "relay not started: tunnel=" + (tunnel != null) + + " backend=" + present(backendUrl) + + " cluster=" + present(clusterId) + + " connection=" + present(vpnConnectionId), 0, 0, 0, 0); + return; + } + stopPacketRelay(); + running = true; + runtimeStartedAt = System.currentTimeMillis(); + uplinkWorkerCount = Math.max(1, Math.min(UPLINK_WORKER_MAX_COUNT, Math.max(1, Runtime.getRuntime().availableProcessors() - 1))); + if (uplinkWorkerCount < 2) { + uplinkWorkerCount = 1; + } + uplinkQueueOffersByWorker = createAtomicCounters(uplinkWorkerCount); + uplinkQueueDropsByWorker = createAtomicCounters(uplinkWorkerCount); + uplinkSenderPacketsByWorker = createAtomicCounters(uplinkWorkerCount); + uplinkSenderErrorsByWorker = createAtomicCounters(uplinkWorkerCount); + uplinkQueues = new ArrayBlockingQueue[uplinkWorkerCount]; + for (int i = 0; i < uplinkWorkerCount; i++) { + uplinkQueues[i] = new ArrayBlockingQueue<>(UPLINK_QUEUE_CAPACITY); + } + configureBackendBypass(backendUrl); + RapApiClient uplinkClient = new RapApiClient(backendUrl, this); + RapApiClient downlinkClient = new RapApiClient(backendUrl, this); + try { + JSONObject reset = uplinkClient.resetVPNPacketQueues(clusterId, vpnConnectionId); + Log.i(TAG, "packet relay queues reset: " + reset.toString()); + writeRuntimeStatus("relay_reset", reset.toString(), 0, 0, 0, 0); + } catch (Exception e) { + Log.w(TAG, "vpn relay queue reset failed; continuing", e); + writeRuntimeStatus("relay_reset_warning", "queue reset failed: " + e.getMessage(), 0, 0, 0, 1); + } + Log.i(TAG, "packet relay starting: backend=" + backendUrl + " cluster=" + clusterId + " vpn_connection=" + vpnConnectionId); + writeRuntimeStatus("relay", "relay starting " + vpnConnectionId, 0, 0, 0, 0); + writeRuntimeDetail("running", "packet relay active", "relay", 0, 0, ""); + uplinkThread = new Thread(() -> pumpTunToRelay(uplinkClient, clusterId, vpnConnectionId), "rap-vpn-uplink"); + uplinkSenderThreads = new Thread[uplinkWorkerCount]; + for (int i = 0; i < uplinkWorkerCount; i++) { + final int workerIndex = i; + uplinkSenderThreads[i] = new Thread(() -> pumpUplinkQueueToRelay(workerIndex, uplinkClient, clusterId, vpnConnectionId), "rap-vpn-uplink-sender-" + workerIndex); + } + downlinkThread = new Thread(() -> runDownlinkWithRestart(downlinkClient, clusterId, vpnConnectionId), "rap-vpn-downlink"); + uplinkThread.start(); + for (Thread senderThread : uplinkSenderThreads) { + senderThread.start(); + } + downlinkThread.start(); + } + + private void stopPacketRelay() { + running = false; + interruptAndJoin(uplinkThread); + if (uplinkSenderThreads != null) { + for (Thread senderThread : uplinkSenderThreads) { + interruptAndJoin(senderThread); + } + } + interruptAndJoin(downlinkThread); + uplinkThread = null; + uplinkSenderThreads = null; + downlinkThread = null; + uplinkWorkerCount = 0; + uplinkQueues = null; + uplinkQueueOffersByWorker = null; + uplinkQueueDropsByWorker = null; + uplinkSenderPacketsByWorker = null; + uplinkSenderErrorsByWorker = null; + } + + private void resetRuntimeMetrics() { + uplinkReadPackets.set(0); + uplinkReadBytes.set(0); + uplinkSentPackets.set(0); + uplinkSentBytes.set(0); + downlinkReceivedPackets.set(0); + downlinkReceivedBytes.set(0); + uplinkDroppedPackets.set(0); + uplinkDroppedBytes.set(0); + downlinkDroppedPackets.set(0); + downlinkDroppedBytes.set(0); + uplinkWorkerCount = 0; + runtimeStartedAt = System.currentTimeMillis(); + lastThroughputCalcAt = runtimeStartedAt; + lastRateUplinkReadBytes = 0; + lastRateUplinkSentBytes = 0; + lastRateDownlinkReceivedBytes = 0; + uplinkReadMbps = 0f; + uplinkSentMbps = 0f; + downlinkReceivedMbps = 0f; + uplinkReadPps = 0f; + uplinkSentPps = 0f; + downlinkReceivedPps = 0f; + getSharedPreferences(PREFS, MODE_PRIVATE).edit() + .putString("state", "resetting") + .putString("message", "runtime counters reset") + .putLong("updated_at", runtimeStartedAt) + .putLong("uplink_read", 0) + .putLong("uplink_sent", 0) + .putLong("downlink_received", 0) + .putLong("errors", 0) + .putFloat("uplink_read_mbps", 0f) + .putFloat("uplink_sent_mbps", 0f) + .putFloat("downlink_received_mbps", 0f) + .putFloat("uplink_read_pps", 0f) + .putFloat("uplink_sent_pps", 0f) + .putFloat("downlink_received_pps", 0f) + .putLong("runtime_started_at", runtimeStartedAt) + .putLong("uplink_read_total", 0) + .putLong("uplink_read_bytes", 0) + .putLong("uplink_sent_total", 0) + .putLong("uplink_sent_bytes", 0) + .putLong("downlink_received_total", 0) + .putLong("downlink_received_bytes", 0) + .putLong("uplink_dropped_packets", 0) + .putLong("uplink_dropped_bytes", 0) + .putLong("downlink_dropped_packets", 0) + .putLong("downlink_dropped_bytes", 0) + .putInt("uplink_worker_count", 0) + .putString("uplink_queue_depths", "") + .putInt("uplink_queue_depth_max", 0) + .putInt("uplink_queue_depth_total", 0) + .apply(); + } + + private static AtomicLong[] createAtomicCounters(int count) { + AtomicLong[] values = new AtomicLong[count]; + for (int i = 0; i < count; i++) { + values[i] = new AtomicLong(); + } + return values; + } + + private void recordUplinkRead(int bytes) { + uplinkReadPackets.incrementAndGet(); + if (bytes > 0) { + uplinkReadBytes.addAndGet(bytes); + } + } + + private void recordUplinkDrop(int bytes) { + uplinkDroppedPackets.incrementAndGet(); + if (bytes > 0) { + uplinkDroppedBytes.addAndGet(bytes); + } + } + + private void recordUplinkSent(int packets, int bytes) { + if (packets > 0) { + uplinkSentPackets.addAndGet(packets); + } + if (bytes > 0) { + uplinkSentBytes.addAndGet(bytes); + } + } + + private void recordDownlinkReceived(int bytes) { + downlinkReceivedPackets.incrementAndGet(); + if (bytes > 0) { + downlinkReceivedBytes.addAndGet(bytes); + } + } + + private void recordDownlinkDrop(int bytes) { + downlinkDroppedPackets.incrementAndGet(); + if (bytes > 0) { + downlinkDroppedBytes.addAndGet(bytes); + } + } + + private void interruptAndJoin(Thread thread) { + if (thread == null) { + return; + } + thread.interrupt(); + if (thread == Thread.currentThread()) { + return; + } + try { + thread.join(750); + } catch (InterruptedException e) { + Thread.currentThread().interrupt(); + } + } + + private void runDownlinkWithRestart(RapApiClient client, String clusterId, String vpnConnectionId) { + long restarts = 0; + while (running) { + downlinkRestarts = restarts; + writeRuntimeDetail("starting", "downlink loop starting", "downlink", 0, restarts, ""); + pumpRelayToTun(client, clusterId, vpnConnectionId); + if (!running) { + return; + } + restarts++; + downlinkRestarts = restarts; + writeRuntimeStatus("downlink_restart", "restarting downlink count=" + restarts, 0, 0, 0, restarts); + writeRuntimeDetail("restart", "downlink loop restarting", "downlink", 0, restarts, ""); + try { + Thread.sleep(100); + } catch (InterruptedException e) { + if (!running) { + return; + } + } + } + } + + private String present(String value) { + return value == null || value.isEmpty() ? "missing" : "present"; + } + + private void pumpTunToRelay(RapApiClient client, String clusterId, String vpnConnectionId) { + byte[] packet = new byte[32767]; + long readPackets = 0; + FileDescriptor fd = null; + try { + fd = Os.dup(tunnel.getFileDescriptor()); + while (running) { + int n; + try { + n = Os.read(fd, packet, 0, packet.length); + } catch (ErrnoException e) { + if (e.errno == OsConstants.EINTR) { + continue; + } + throw e; + } + if (n > 0) { + readPackets++; + recordUplinkRead(n); + if (readPackets == 1 || readPackets % 25 == 0) { + writeRuntimeStatus("uplink_read", packetSummary(packet, n), readPackets, 0, 0, 0); + writeRuntimeDetail("read", packetSummary(packet, n), "uplink", readPackets, 0, ""); + } + queueUplinkPacket(packet, n); + } + } + } catch (Exception e) { + if (running) { + Log.e(TAG, "vpn uplink stopped", e); + writeRuntimeStatus("error", "uplink stopped: " + e.getMessage(), readPackets, 0, 0, 0); + writeRuntimeDetail("stopped", "uplink stopped: " + e.getMessage(), "uplink", readPackets, 0, e.getClass().getSimpleName()); + } + } finally { + closeFdQuietly(fd); + } + } + + private void queueUplinkPacket(byte[] packet, int length) { + if (!shouldForwardUplinkPacket(packet, length)) { + recordUplinkDrop(length); + return; + } + byte[] copy = new byte[length]; + System.arraycopy(packet, 0, copy, 0, length); + if (!hasIPv4Source(copy, length)) { + Log.w(TAG, "vpn uplink source is not vpn address; dropping " + packetSummary(copy, length)); + writeRuntimeDetail("source_drop", packetSummary(copy, length), "uplink", -1, -1, "SOURCE_MISMATCH"); + recordUplinkDrop(length); + return; + } + int queueIndex = shardForUplinkPacket(copy, length); + queueIndex = normalizeQueueIndex(queueIndex); + BlockingQueue queue = queueForUplinkPacket(copy, length, queueIndex); + if (queue != null) { + AtomicLong[] offers = uplinkQueueOffersByWorker; + if (offers != null && queueIndex >= 0 && queueIndex < offers.length) { + offers[queueIndex].incrementAndGet(); + } + } + if (queue != null && !queue.offer(copy)) { + Log.w(TAG, "vpn uplink queue full; dropping packet"); + recordUplinkDrop(length); + AtomicLong[] drops = uplinkQueueDropsByWorker; + if (drops != null && queueIndex >= 0 && queueIndex < drops.length) { + drops[queueIndex].incrementAndGet(); + } + writeRuntimeDetail("queue_full", packetSummary(copy, length), "uplink", -1, -1, "QUEUE_FULL"); + } + } + + private BlockingQueue queueForUplinkPacket(byte[] packet, int length, int queueIndex) { + BlockingQueue[] queues = uplinkQueues; + if (queues == null || queues.length == 0) { + return null; + } + queueIndex = normalizeQueueIndex(queueIndex); + return queues[queueIndex]; + } + + private int normalizeQueueIndex(int queueIndex) { + BlockingQueue[] queues = uplinkQueues; + if (queues == null || queues.length == 0) { + return 0; + } + if (queueIndex >= 0 && queueIndex < queues.length) { + return queueIndex; + } + return Math.abs(queueIndex) % queues.length; + } + + private int shardForUplinkPacket(byte[] packet, int length) { + if (packet == null || length < 20 || uplinkQueues == null || uplinkQueues.length == 0) { + return 0; + } + int version = (packet[0] >> 4) & 0x0f; + if (version != 4) { + return Math.abs(length) % uplinkQueues.length; + } + int ihl = (packet[0] & 0x0f) * 4; + if (ihl < 20 || length < ihl + 4) { + return Math.abs(length) % uplinkQueues.length; + } + int proto = packet[9] & 0xff; + int srcPort = 0; + int dstPort = 0; + if (proto == 6 || proto == 17) { + srcPort = u16(packet, ihl); + dstPort = u16(packet, ihl + 2); + } + int srcIp = ((packet[12] & 0xff) << 24) | ((packet[13] & 0xff) << 16) | ((packet[14] & 0xff) << 8) | (packet[15] & 0xff); + int dstIp = ((packet[16] & 0xff) << 24) | ((packet[17] & 0xff) << 16) | ((packet[18] & 0xff) << 8) | (packet[19] & 0xff); + int hash = srcIp ^ Integer.rotateLeft(dstIp, 8) ^ (proto << 24) ^ (srcPort << 8) ^ dstPort; + return (hash & 0x7fffffff) % uplinkQueues.length; + } + + private boolean hasIPv4Source(byte[] packet, int length) { + byte[] address = vpnAddressIPv4Bytes; + if (address == null || length < 20) { + return false; + } + return (packet[12] & 0xff) == (address[0] & 0xff) + && (packet[13] & 0xff) == (address[1] & 0xff) + && (packet[14] & 0xff) == (address[2] & 0xff) + && (packet[15] & 0xff) == (address[3] & 0xff); + } + + private boolean hasIPv4Destination(byte[] packet, int length) { + byte[] address = vpnAddressIPv4Bytes; + if (address == null || length < 20) { + return false; + } + return (packet[16] & 0xff) == (address[0] & 0xff) + && (packet[17] & 0xff) == (address[1] & 0xff) + && (packet[18] & 0xff) == (address[2] & 0xff) + && (packet[19] & 0xff) == (address[3] & 0xff); + } + + private byte[] ipv4Bytes(String value) { + if (value == null) { + return null; + } + String[] parts = value.split("\\."); + if (parts.length != 4) { + return null; + } + byte[] out = new byte[4]; + try { + for (int i = 0; i < 4; i++) { + int parsed = Integer.parseInt(parts[i]); + if (parsed < 0 || parsed > 255) { + return null; + } + out[i] = (byte) parsed; + } + return out; + } catch (NumberFormatException e) { + return null; + } + } + + private void pumpUplinkQueueToRelay(int workerIndex, RapApiClient client, String clusterId, String vpnConnectionId) { + long sentPackets = 0; + long errors = 0; + List batch = new ArrayList<>(VPN_BATCH_MAX_PACKETS); + while (running) { + try { + BlockingQueue[] queues = uplinkQueues; + if (queues == null || workerIndex < 0 || workerIndex >= queues.length) { + Thread.sleep(25); + continue; + } + BlockingQueue queue = queues[workerIndex]; + if (queue == null) { + Thread.sleep(25); + continue; + } + batch.clear(); + byte[] first = queue.take(); + if (first == null) { + continue; + } + batch.add(first); + int batchBytes = first.length + 4; + long gatherUntil = System.currentTimeMillis() + UPLINK_BATCH_GATHER_MS; + while (batch.size() < VPN_BATCH_MAX_PACKETS) { + long waitMs = gatherUntil - System.currentTimeMillis(); + if (waitMs <= 0) { + break; + } + byte[] next = queue.poll(waitMs, TimeUnit.MILLISECONDS); + if (next == null) { + break; + } + if (next.length <= 0) { + continue; + } + int projectedBytes = batchBytes + 4 + next.length; + if (projectedBytes > VPN_BATCH_MAX_BYTES) { + if (!queue.offer(next)) { + Log.w(TAG, "vpn uplink queue reinsert failed; dropping packet"); + } + break; + } + batch.add(next); + batchBytes = projectedBytes; + } + + client.sendClientPacketBatch(clusterId, vpnConnectionId, batch); + sentPackets += batch.size(); + recordUplinkSent(batch.size(), Math.max(0, batchBytes - 4)); + AtomicLong[] senderPackets = uplinkSenderPacketsByWorker; + if (senderPackets != null && workerIndex >= 0 && workerIndex < senderPackets.length) { + senderPackets[workerIndex].addAndGet(batch.size()); + } + writeRuntimeStatus("uplink_sent", "sent batch=" + batch.size(), 0, sentPackets, 0, errors); + writeRuntimeDetail("sent", "worker=" + workerIndex + " sent batch=" + batch.size(), "uplink_sender", sentPackets, errors, "", workerIndex); + } catch (InterruptedException e) { + if (!running) { + return; + } + writeRuntimeDetail("read_wait", "uplink queue wait interrupted", "uplink_sender", sentPackets, errors, e.getClass().getSimpleName(), workerIndex); + } catch (Exception e) { + if (running) { + Log.w(TAG, "vpn uplink batch send failed; continuing", e); + errors++; + AtomicLong[] senderErrors = uplinkSenderErrorsByWorker; + if (senderErrors != null && workerIndex >= 0 && workerIndex < senderErrors.length) { + senderErrors[workerIndex].incrementAndGet(); + } + writeRuntimeStatus("error", "uplink send failed: " + e.getMessage(), 0, sentPackets, 0, errors); + writeRuntimeDetail("error", "uplink send failed: " + e.getMessage(), "uplink_sender", sentPackets, errors, e.getClass().getSimpleName(), workerIndex); + try { + Thread.sleep(100); + } catch (InterruptedException interrupted) { + if (!running) { + return; + } + } + } + } + } + } + + private boolean shouldForwardUplinkPacket(byte[] packet, int length) { + if (packet == null || length < 20) { + return false; + } + int version = (packet[0] >> 4) & 0x0f; + if (version != 4) { + return false; + } + int ihl = (packet[0] & 0x0f) * 4; + if (ihl < 20 || length < ihl) { + return false; + } + if (isBroadcastOrMulticastIPv4(packet)) { + return false; + } + return !isBackendControlPlanePacket(packet, length, ihl); + } + + private void configureBackendBypass(String backendUrl) { + backendBypassIPv4 = null; + backendBypassPort = 0; + try { + URI uri = URI.create(backendUrl == null ? "" : backendUrl); + byte[] host = ipv4Bytes(uri.getHost()); + if (host == null) { + return; + } + int port = uri.getPort(); + if (port <= 0) { + port = "https".equalsIgnoreCase(uri.getScheme()) ? 443 : 80; + } + backendBypassIPv4 = host; + backendBypassPort = port; + } catch (Exception ignored) { + } + } + + private boolean isBroadcastOrMulticastIPv4(byte[] packet) { + int first = packet[16] & 0xff; + return first >= 224 || first == 255; + } + + private boolean isBackendControlPlanePacket(byte[] packet, int length, int ihl) { + byte[] host = backendBypassIPv4; + int port = backendBypassPort; + if (host == null || port <= 0 || length < ihl + 4) { + return false; + } + if ((packet[16] & 0xff) != (host[0] & 0xff) + || (packet[17] & 0xff) != (host[1] & 0xff) + || (packet[18] & 0xff) != (host[2] & 0xff) + || (packet[19] & 0xff) != (host[3] & 0xff)) { + return false; + } + int proto = packet[9] & 0xff; + if (proto != 6 && proto != 17) { + return false; + } + int dstPort = u16(packet, ihl + 2); + return dstPort == port; + } + + private void setUnderlyingNetworks(Builder builder) { + if (Build.VERSION.SDK_INT < 22) { + return; + } + try { + ConnectivityManager connectivity = (ConnectivityManager) getSystemService(CONNECTIVITY_SERVICE); + if (connectivity == null) { + return; + } + List networks = new ArrayList<>(); + for (Network network : connectivity.getAllNetworks()) { + NetworkCapabilities capabilities = connectivity.getNetworkCapabilities(network); + if (capabilities == null) { + continue; + } + if (capabilities.hasTransport(NetworkCapabilities.TRANSPORT_VPN)) { + continue; + } + if (!capabilities.hasCapability(NetworkCapabilities.NET_CAPABILITY_INTERNET)) { + continue; + } + networks.add(network); + } + if (!networks.isEmpty()) { + builder.setUnderlyingNetworks(networks.toArray(new Network[0])); + } + } catch (Exception e) { + Log.w(TAG, "vpn underlying networks not set", e); + } + } + + private void pumpRelayToTun(RapApiClient client, String clusterId, String vpnConnectionId) { + long receivedPackets = 0; + long errors = 0; + int downlinkPollMs = DOWNLINK_POLL_MS_MIN; + FileDescriptor fd = null; + try { + fd = Os.dup(tunnel.getFileDescriptor()); + while (running) { + try { + List packets = client.receiveClientPacketBatch(clusterId, vpnConnectionId, downlinkPollMs); + for (byte[] packet : packets) { + if (!isIPv4Packet(packet)) { + recordDownlinkDrop(packet == null ? 0 : packet.length); + continue; + } + int length = effectiveIPv4Length(packet, packet.length); + if (length <= 0) { + errors++; + recordDownlinkDrop(packet.length); + writeRuntimeDetail("length_drop", packetSummary(packet, packet.length), "downlink", receivedPackets, errors, "LENGTH"); + continue; + } + if (!hasIPv4Destination(packet, length)) { + continue; + } + if (!normalizeIPv4PacketChecksums(packet, length)) { + errors++; + recordDownlinkDrop(length); + writeRuntimeDetail("normalize_drop", packetSummary(packet, length), "downlink", receivedPackets, errors, "CHECKSUM_NORMALIZE"); + continue; + } + if (writePacketToTun(fd, packet, length)) { + receivedPackets++; + recordDownlinkReceived(length); + } else { + errors++; + recordDownlinkDrop(length); + writeRuntimeDetail("write_drop", packetSummary(packet, length), "downlink", receivedPackets, errors, "EAGAIN"); + } + } + if (!packets.isEmpty()) { + downlinkPollMs = Math.max(DOWNLINK_POLL_MS_MIN, downlinkPollMs - DOWNLINK_POLL_MS_STEP); + writeRuntimeStatus("downlink", "received batch=" + packets.size(), 0, 0, receivedPackets, errors); + } else if (receivedPackets > 0) { + downlinkPollMs = Math.min(DOWNLINK_POLL_MS_MAX, downlinkPollMs + DOWNLINK_POLL_MS_STEP); + writeRuntimeStatus("downlink_idle", "waiting for gateway packets", 0, 0, receivedPackets, errors); + } + } catch (Exception e) { + if (running) { + Log.w(TAG, "vpn downlink receive failed; continuing", e); + errors++; + writeRuntimeStatus("error", "downlink failed: " + e.getMessage(), 0, 0, receivedPackets, errors); + writeRuntimeDetail("error", "downlink failed: " + e.getMessage(), "downlink", receivedPackets, errors, e.getClass().getSimpleName()); + try { + Thread.sleep(100); + } catch (InterruptedException interrupted) { + if (!running) { + return; + } + } + } + } + } + } catch (Exception e) { + if (running) { + Log.e(TAG, "vpn downlink stopped", e); + writeRuntimeStatus("error", "downlink stopped: " + e.getMessage(), 0, 0, receivedPackets, errors); + writeRuntimeDetail("stopped", "downlink stopped: " + e.getMessage(), "downlink", receivedPackets, errors, e.getClass().getSimpleName()); + } + } finally { + closeFdQuietly(fd); + } + } + + private boolean writePacketToTun(FileDescriptor fd, byte[] packet, int packetLength) throws Exception { + int offset = 0; + int attempts = 0; + if (packetLength < 20 || packetLength > packet.length) { + return false; + } + while (running && offset < packetLength) { + try { + int written = Os.write(fd, packet, offset, packetLength - offset); + if (written > 0) { + offset += written; + attempts = 0; + continue; + } + } catch (ErrnoException e) { + if (e.errno != OsConstants.EAGAIN) { + throw e; + } + } + attempts++; + if (attempts > TUN_WRITE_MAX_RETRIES) { + return false; + } + sleepQuietly(TUN_EAGAIN_SLEEP_MS); + } + return offset == packetLength; + } + + private int effectiveIPv4Length(byte[] packet, int maxLength) { + if (packet == null || maxLength < 20) { + return -1; + } + int version = (packet[0] >> 4) & 0x0f; + if (version != 4) { + return -1; + } + int ihl = (packet[0] & 0x0f) * 4; + if (ihl < 20 || maxLength < ihl) { + return -1; + } + int totalLength = u16(packet, 2); + if (totalLength <= 0) { + return maxLength; + } + if (totalLength < ihl || totalLength > maxLength) { + return -1; + } + return totalLength; + } + + private void sleepQuietly(long millis) { + try { + Thread.sleep(millis); + } catch (InterruptedException e) { + if (!running) { + Thread.currentThread().interrupt(); + } + } + } + + private void closeFdQuietly(FileDescriptor fd) { + if (fd == null) { + return; + } + try { + Os.close(fd); + } catch (Exception ignored) { + } + } + + private void refreshRuntimeRates(long now) { + if (now <= 0) { + now = System.currentTimeMillis(); + } + if (lastThroughputCalcAt <= 0) { + lastThroughputCalcAt = now; + lastRateUplinkReadBytes = uplinkReadBytes.get(); + lastRateUplinkSentBytes = uplinkSentBytes.get(); + lastRateDownlinkReceivedBytes = downlinkReceivedBytes.get(); + return; + } + long elapsed = now - lastThroughputCalcAt; + if (elapsed < 250) { + return; + } + long readDelta = Math.max(0, uplinkReadBytes.get() - lastRateUplinkReadBytes); + long sentDelta = Math.max(0, uplinkSentBytes.get() - lastRateUplinkSentBytes); + long downDelta = Math.max(0, downlinkReceivedBytes.get() - lastRateDownlinkReceivedBytes); + float seconds = Math.max(0.001f, elapsed / 1000f); + uplinkReadMbps = (float) (readDelta * 8.0d / 1_000_000d / seconds); + uplinkSentMbps = (float) (sentDelta * 8.0d / 1_000_000d / seconds); + downlinkReceivedMbps = (float) (downDelta * 8.0d / 1_000_000d / seconds); + uplinkReadPps = (float) (readDelta / seconds); + uplinkSentPps = (float) (sentDelta / seconds); + downlinkReceivedPps = (float) (downDelta / seconds); + lastThroughputCalcAt = now; + lastRateUplinkReadBytes = uplinkReadBytes.get(); + lastRateUplinkSentBytes = uplinkSentBytes.get(); + lastRateDownlinkReceivedBytes = downlinkReceivedBytes.get(); + } + + private void writeRuntimeStatus(String state, String message, long readPackets, long sentPackets, long receivedPackets, long errors) { + long now = System.currentTimeMillis(); + boolean important = "error".equals(state) + || "stopped".equals(state) + || "relay".equals(state) + || "relay_reset_warning".equals(state) + || "tunnel".equals(state) + || "relay_reset".equals(state) + || "downlink_restart".equals(state); + if (!important && now - lastRuntimeStatusAt < RUNTIME_STATUS_INTERVAL_MS) { + return; + } + if (!important) { + lastRuntimeStatusAt = now; + } + refreshRuntimeRates(now); + try { + SharedPreferences.Editor editor = getSharedPreferences(PREFS, MODE_PRIVATE).edit() + .putString("state", state) + .putString("message", message == null ? "" : message) + .putLong("updated_at", now) + .putLong("runtime_started_at", runtimeStartedAt) + .putLong("uplink_read_total", uplinkReadPackets.get()) + .putLong("uplink_read_bytes", uplinkReadBytes.get()) + .putLong("uplink_sent_total", uplinkSentPackets.get()) + .putLong("uplink_sent_bytes", uplinkSentBytes.get()) + .putLong("downlink_received_total", downlinkReceivedPackets.get()) + .putLong("downlink_received_bytes", downlinkReceivedBytes.get()) + .putLong("uplink_dropped_packets", uplinkDroppedPackets.get()) + .putLong("uplink_dropped_bytes", uplinkDroppedBytes.get()) + .putLong("downlink_dropped_packets", downlinkDroppedPackets.get()) + .putLong("downlink_dropped_bytes", downlinkDroppedBytes.get()) + .putFloat("uplink_read_mbps", uplinkReadMbps) + .putFloat("uplink_sent_mbps", uplinkSentMbps) + .putFloat("downlink_received_mbps", downlinkReceivedMbps) + .putFloat("uplink_read_pps", uplinkReadPps) + .putFloat("uplink_sent_pps", uplinkSentPps) + .putFloat("downlink_received_pps", downlinkReceivedPps); + if (readPackets > 0) { + editor.putLong("uplink_read", readPackets); + } + if (sentPackets > 0) { + editor.putLong("uplink_sent", sentPackets); + } + if (receivedPackets > 0) { + editor.putLong("downlink_received", receivedPackets); + } + if (errors > 0) { + editor.putLong("errors", errors); + } + editor.apply(); + } catch (Exception ignored) { + } + } + + private void writeRuntimeDetail(String state, String message, String prefix, long packets, long errors, String errorType) { + writeRuntimeDetail(state, message, prefix, packets, errors, errorType, -1); + } + + private void writeRuntimeDetail(String state, String message, String prefix, long packets, long errors, String errorType, int workerIndex) { + long now = System.currentTimeMillis(); + boolean important = "error".equals(state) + || "stopped".equals(state) + || "write_drop".equals(state) + || "source_drop".equals(state) + || "normalize_drop".equals(state) + || "length_drop".equals(state) + || ("downlink".equals(prefix) && !"batch".equals(state) && !"restart".equals(state) && !"running".equals(state)); + if (!important && now - lastRuntimeDetailAt < RUNTIME_DETAIL_INTERVAL_MS) { + return; + } + lastRuntimeDetailAt = now; + try { + SharedPreferences.Editor editor = getSharedPreferences(PREFS, MODE_PRIVATE).edit() + .putString(prefix + "_state", state == null ? "" : state) + .putString(prefix + "_message", message == null ? "" : message) + .putLong(prefix + "_updated_at", System.currentTimeMillis()) + .putBoolean(prefix + "_thread_alive", Thread.currentThread().isAlive()); + if (packets >= 0) { + editor.putLong(prefix + "_packets", packets); + } + if (errors >= 0) { + editor.putLong(prefix + "_errors", errors); + } + if ("uplink".equals(prefix)) { + editor.putLong(prefix + "_bytes", uplinkReadBytes.get()); + editor.putFloat(prefix + "_rate_mbps", uplinkReadMbps); + editor.putFloat(prefix + "_rate_pps", uplinkReadPps); + } else if ("uplink_sender".equals(prefix)) { + editor.putLong(prefix + "_bytes", uplinkSentBytes.get()); + editor.putFloat(prefix + "_rate_mbps", uplinkSentMbps); + editor.putFloat(prefix + "_rate_pps", uplinkSentPps); + if (workerIndex >= 0) { + AtomicLong[] senderPackets = uplinkSenderPacketsByWorker; + AtomicLong[] senderErrors = uplinkSenderErrorsByWorker; + if (senderPackets != null && workerIndex < senderPackets.length) { + editor.putLong(prefix + "_worker_packets_" + workerIndex, senderPackets[workerIndex].get()); + } + if (senderErrors != null && workerIndex < senderErrors.length) { + editor.putLong(prefix + "_worker_errors_" + workerIndex, senderErrors[workerIndex].get()); + } + } + } else if ("downlink".equals(prefix)) { + editor.putLong(prefix + "_bytes", downlinkReceivedBytes.get()); + editor.putFloat(prefix + "_rate_mbps", downlinkReceivedMbps); + editor.putFloat(prefix + "_rate_pps", downlinkReceivedPps); + } else { + editor.putLong(prefix + "_bytes", 0); + editor.putFloat(prefix + "_rate_mbps", 0f); + editor.putFloat(prefix + "_rate_pps", 0f); + } + editor.putString(prefix + "_error_type", errorType == null ? "" : errorType); + if ("downlink".equals(prefix)) { + editor.putLong("downlink_restarts", downlinkRestarts); + } + AtomicLong[] queueOffers = uplinkQueueOffersByWorker; + AtomicLong[] queueDrops = uplinkQueueDropsByWorker; + BlockingQueue[] queues = uplinkQueues; + int workerCount = queues != null ? queues.length : 0; + int depth = 0; + if (queues != null) { + for (BlockingQueue queue : queues) { + if (queue != null) { + depth += queue.size(); + } + } + } + editor.putInt("uplink_worker_count", workerCount); + if (queues != null && queues.length > 0) { + int[] queueDepths = new int[queues.length]; + int maxDepth = 0; + for (int i = 0; i < queues.length; i++) { + BlockingQueue queue = queues[i]; + int queueDepth = queue == null ? 0 : queue.size(); + queueDepths[i] = queueDepth; + maxDepth = Math.max(maxDepth, queueDepth); + if (queueOffers != null && i < queueOffers.length) { + editor.putLong("uplink_queue_" + i + "_offers", queueOffers[i].get()); + } + if (queueDrops != null && i < queueDrops.length) { + editor.putLong("uplink_queue_" + i + "_drops", queueDrops[i].get()); + } + } + editor.putString("uplink_queue_depths", encodeIntArray(queueDepths)); + editor.putInt("uplink_queue_depth_max", maxDepth); + editor.putInt("uplink_queue_depth_total", depth); + editor.putInt("uplink_queue_depth", depth); + } else { + editor.putInt("uplink_queue_depth", 0); + } + editor.apply(); + } catch (Exception ignored) { + } + } + + private boolean isIPv4Packet(byte[] packet) { + return packet != null && packet.length >= 20 && ((packet[0] >> 4) & 0x0f) == 4; + } + + private String encodeIntArray(int[] values) { + if (values == null || values.length == 0) { + return ""; + } + StringBuilder out = new StringBuilder(); + for (int i = 0; i < values.length; i++) { + if (i > 0) { + out.append(","); + } + out.append(values[i]); + } + return out.toString(); + } + + private boolean normalizeIPv4PacketChecksums(byte[] packet, int length) { + if (packet == null || length < 20) { + return false; + } + int version = (packet[0] >> 4) & 0x0f; + if (version != 4) { + return false; + } + int ihl = (packet[0] & 0x0f) * 4; + if (ihl < 20 || length < ihl) { + return false; + } + int totalLength = u16(packet, 2); + if (totalLength <= 0 || totalLength > length) { + totalLength = length; + } + if (totalLength < ihl) { + return false; + } + packet[10] = 0; + packet[11] = 0; + putU16(packet, 10, checksum(packet, 0, ihl)); + + int proto = packet[9] & 0xff; + int fragFlags = u16(packet, 6); + boolean transportHeaderPresent = (fragFlags & 0x1fff) == 0; + int payloadOffset = ihl; + int payloadLength = totalLength - ihl; + if (transportHeaderPresent && proto == 6 && payloadLength >= 20) { + packet[payloadOffset + 16] = 0; + packet[payloadOffset + 17] = 0; + putU16(packet, payloadOffset + 16, transportChecksum(packet, payloadOffset, payloadLength, proto)); + } else if (transportHeaderPresent && proto == 17 && payloadLength >= 8) { + packet[payloadOffset + 6] = 0; + packet[payloadOffset + 7] = 0; + int sum = transportChecksum(packet, payloadOffset, payloadLength, proto); + putU16(packet, payloadOffset + 6, sum == 0 ? 0xffff : sum); + } else if (proto == 1 && payloadLength >= 4) { + packet[payloadOffset + 2] = 0; + packet[payloadOffset + 3] = 0; + putU16(packet, payloadOffset + 2, checksum(packet, payloadOffset, payloadLength)); + } + return true; + } + + private int transportChecksum(byte[] packet, int payloadOffset, int payloadLength, int proto) { + long sum = 0; + sum += u16(packet, 12); + sum += u16(packet, 14); + sum += u16(packet, 16); + sum += u16(packet, 18); + sum += proto & 0xff; + sum += payloadLength & 0xffff; + sum += checksumSum(packet, payloadOffset, payloadLength); + return finishChecksum(sum); + } + + private int checksum(byte[] packet, int offset, int length) { + return finishChecksum(checksumSum(packet, offset, length)); + } + + private long checksumSum(byte[] packet, int offset, int length) { + long sum = 0; + int end = offset + length; + int i = offset; + while (i + 1 < end) { + sum += u16(packet, i); + i += 2; + } + if (i < end) { + sum += (packet[i] & 0xff) << 8; + } + return sum; + } + + private int finishChecksum(long sum) { + while ((sum >> 16) != 0) { + sum = (sum & 0xffff) + (sum >> 16); + } + return (int) (~sum) & 0xffff; + } + + private Notification notification() { + if (Build.VERSION.SDK_INT >= 26) { + NotificationChannel channel = new NotificationChannel(CHANNEL_ID, "RAP VPN", NotificationManager.IMPORTANCE_LOW); + getSystemService(NotificationManager.class).createNotificationChannel(channel); + } + Notification.Builder builder = Build.VERSION.SDK_INT >= 26 ? new Notification.Builder(this, CHANNEL_ID) : new Notification.Builder(this); + return builder + .setContentTitle("RAP VPN") + .setContentText("VPN tunnel is active") + .setSmallIcon(android.R.drawable.stat_sys_upload_done) + .build(); + } + + private String packetSummary(byte[] packet, int length) { + if (packet == null || length < 20) { + return "size=" + length; + } + int version = (packet[0] >> 4) & 0x0f; + if (version != 4) { + return "size=" + length + " ip_version=" + version; + } + int ihl = (packet[0] & 0x0f) * 4; + if (ihl < 20 || length < ihl) { + return "size=" + length + " ipv4=truncated"; + } + int proto = packet[9] & 0xff; + String base = "size=" + length + + " " + ipv4(packet, 12) + + " -> " + ipv4(packet, 16) + + " proto=" + proto; + if ((proto == 6 || proto == 17) && length >= ihl + 4) { + int srcPort = u16(packet, ihl); + int dstPort = u16(packet, ihl + 2); + base += " " + srcPort + "->" + dstPort; + if (proto == 6 && length >= ihl + 14) { + base += " flags=" + tcpFlags(packet[ihl + 13] & 0xff); + } + } else if (proto == 1 && length >= ihl + 2) { + base += " icmp_type=" + (packet[ihl] & 0xff) + " icmp_code=" + (packet[ihl + 1] & 0xff); + } + return base; + } + + private String ipv4(byte[] packet, int offset) { + return (packet[offset] & 0xff) + "." + + (packet[offset + 1] & 0xff) + "." + + (packet[offset + 2] & 0xff) + "." + + (packet[offset + 3] & 0xff); + } + + private int u16(byte[] packet, int offset) { + return ((packet[offset] & 0xff) << 8) | (packet[offset + 1] & 0xff); + } + + private void putU16(byte[] packet, int offset, int value) { + packet[offset] = (byte) ((value >> 8) & 0xff); + packet[offset + 1] = (byte) (value & 0xff); + } + + private String tcpFlags(int flags) { + StringBuilder out = new StringBuilder(); + if ((flags & 0x02) != 0) out.append("S"); + if ((flags & 0x10) != 0) out.append("A"); + if ((flags & 0x01) != 0) out.append("F"); + if ((flags & 0x04) != 0) out.append("R"); + if ((flags & 0x08) != 0) out.append("P"); + return out.length() == 0 ? String.valueOf(flags) : out.toString(); + } + + private static class VpnClientConfig { + String vpnAddress; + boolean fullTunnel = true; + int mtu = DEFAULT_VPN_MTU; + final Set dnsServers = new LinkedHashSet<>(); + final Set splitRoutes = new LinkedHashSet<>(); + + Set effectiveRoutes() { + LinkedHashSet routes = new LinkedHashSet<>(); + if (fullTunnel) { + routes.add("0.0.0.0/0"); + return routes; + } + routes.addAll(splitRoutes); + return routes; + } + } +} diff --git a/_tmp_android_build/app/src/main/java/su/cin/rapvpn/RdpActivity.java b/_tmp_android_build/app/src/main/java/su/cin/rapvpn/RdpActivity.java new file mode 100644 index 0000000..fd6b3d3 --- /dev/null +++ b/_tmp_android_build/app/src/main/java/su/cin/rapvpn/RdpActivity.java @@ -0,0 +1,209 @@ +package su.cin.rapvpn; + +import android.app.Activity; +import android.graphics.Bitmap; +import android.graphics.BitmapFactory; +import android.os.Bundle; +import android.util.Base64; +import android.view.MotionEvent; +import android.view.View; +import android.widget.FrameLayout; +import android.widget.ImageView; +import android.widget.TextView; + +import org.json.JSONObject; + +import java.nio.ByteBuffer; +import java.nio.ByteOrder; +import java.nio.charset.StandardCharsets; +import java.util.UUID; + +import okhttp3.OkHttpClient; +import okhttp3.Request; +import okhttp3.Response; +import okhttp3.WebSocket; +import okhttp3.WebSocketListener; + +public class RdpActivity extends Activity { + static final String EXTRA_SESSION_RESULT = "session_result"; + static final String EXTRA_GATEWAY_URL = "gateway_url"; + static final String EXTRA_RESOURCE_NAME = "resource_name"; + + private final OkHttpClient http = new OkHttpClient(); + private ImageView desktop; + private TextView overlay; + private WebSocket webSocket; + private int desktopWidth = 1; + private int desktopHeight = 1; + + @Override + protected void onCreate(Bundle savedInstanceState) { + super.onCreate(savedInstanceState); + getWindow().getDecorView().setSystemUiVisibility( + View.SYSTEM_UI_FLAG_FULLSCREEN + | View.SYSTEM_UI_FLAG_HIDE_NAVIGATION + | View.SYSTEM_UI_FLAG_IMMERSIVE_STICKY + | View.SYSTEM_UI_FLAG_LAYOUT_FULLSCREEN + | View.SYSTEM_UI_FLAG_LAYOUT_HIDE_NAVIGATION + | View.SYSTEM_UI_FLAG_LAYOUT_STABLE); + + FrameLayout root = new FrameLayout(this); + root.setBackgroundColor(0xff05090c); + desktop = new ImageView(this); + desktop.setScaleType(ImageView.ScaleType.FIT_CENTER); + desktop.setBackgroundColor(0xff05090c); + desktop.setOnTouchListener((view, event) -> { + sendTouch(event); + return true; + }); + overlay = new TextView(this); + overlay.setTextColor(0xffffffff); + overlay.setTextSize(14); + overlay.setBackgroundColor(0x66000000); + overlay.setPadding(14, 10, 14, 10); + overlay.setText("Подключение..."); + root.addView(desktop, new FrameLayout.LayoutParams(-1, -1)); + root.addView(overlay, new FrameLayout.LayoutParams(-2, -2)); + setContentView(root); + connect(); + } + + @Override + protected void onDestroy() { + if (webSocket != null) { + webSocket.close(1000, "activity closed"); + } + super.onDestroy(); + } + + private void connect() { + try { + JSONObject result = new JSONObject(getIntent().getStringExtra(EXTRA_SESSION_RESULT)); + JSONObject token = result.getJSONObject("attach_token"); + String attachToken = token.getString("token"); + String gatewayUrl = getIntent().getStringExtra(EXTRA_GATEWAY_URL); + String url = gatewayUrl + "?attach_token=" + attachToken; + runOnUiThread(() -> overlay.setText(getIntent().getStringExtra(EXTRA_RESOURCE_NAME))); + Request request = new Request.Builder().url(url).build(); + webSocket = http.newWebSocket(request, new WebSocketListener() { + @Override + public void onOpen(WebSocket webSocket, Response response) { + runOnUiThread(() -> overlay.setText("Подключено")); + } + + @Override + public void onMessage(WebSocket webSocket, String text) { + handleEnvelope(text); + } + + @Override + public void onFailure(WebSocket webSocket, Throwable t, Response response) { + runOnUiThread(() -> overlay.setText("Ошибка: " + t.getMessage())); + } + + @Override + public void onClosed(WebSocket webSocket, int code, String reason) { + runOnUiThread(() -> overlay.setText("Отключено")); + } + }); + } catch (Exception ex) { + overlay.setText("Ошибка запуска: " + ex.getMessage()); + } + } + + private void handleEnvelope(String text) { + try { + JSONObject envelope = new JSONObject(text); + String type = envelope.optString("type"); + if ("session.state".equals(type)) { + JSONObject payload = envelope.optJSONObject("payload"); + String state = payload == null ? "" : payload.optString("state", ""); + if (!state.isEmpty() && !"active".equals(state)) { + runOnUiThread(() -> overlay.setText("Сессия: " + state)); + } + return; + } + if (!"session.frame".equals(type)) { + return; + } + JSONObject payload = envelope.optJSONObject("payload"); + if (payload == null) { + return; + } + String frameData = payload.optString("frame_data", ""); + int width = payload.optInt("frame_width", payload.optInt("desktop_width", 0)); + int height = payload.optInt("frame_height", payload.optInt("desktop_height", 0)); + byte[] bytes = Base64.decode(frameData, Base64.DEFAULT); + Bitmap bitmap = decodeFrame(bytes, width, height, payload.optString("frame_format", "")); + if (bitmap != null) { + desktopWidth = Math.max(1, width); + desktopHeight = Math.max(1, height); + runOnUiThread(() -> { + desktop.setImageBitmap(bitmap); + overlay.setText(""); + }); + } + } catch (Exception ex) { + runOnUiThread(() -> overlay.setText("Кадр: " + ex.getMessage())); + } + } + + private Bitmap decodeFrame(byte[] bytes, int width, int height, String format) { + Bitmap compressed = BitmapFactory.decodeByteArray(bytes, 0, bytes.length); + if (compressed != null) { + return compressed; + } + if (width <= 0 || height <= 0 || bytes.length < width * height * 4) { + return null; + } + int[] colors = new int[width * height]; + ByteBuffer buffer = ByteBuffer.wrap(bytes).order(ByteOrder.LITTLE_ENDIAN); + for (int i = 0; i < colors.length; i++) { + int b = buffer.get() & 0xff; + int g = buffer.get() & 0xff; + int r = buffer.get() & 0xff; + int a = buffer.get() & 0xff; + colors[i] = (a << 24) | (r << 16) | (g << 8) | b; + } + return Bitmap.createBitmap(colors, width, height, Bitmap.Config.ARGB_8888); + } + + private void sendTouch(MotionEvent event) { + if (webSocket == null || desktop.getWidth() <= 0 || desktop.getHeight() <= 0) { + return; + } + String action; + switch (event.getActionMasked()) { + case MotionEvent.ACTION_DOWN: + action = "down"; + break; + case MotionEvent.ACTION_UP: + action = "up"; + break; + case MotionEvent.ACTION_MOVE: + action = "move"; + break; + default: + return; + } + double x = Math.max(0, Math.min(1, event.getX() / Math.max(1f, desktop.getWidth()))); + double y = Math.max(0, Math.min(1, event.getY() / Math.max(1f, desktop.getHeight()))); + try { + JSONObject payload = new JSONObject(); + payload.put("correlation_id", UUID.randomUUID().toString()); + payload.put("client_captured_at", java.time.Instant.now().toString()); + payload.put("kind", "mouse"); + payload.put("action", action); + payload.put("button", "left"); + payload.put("normalized_x", x); + payload.put("normalized_y", y); + payload.put("surface_width", desktopWidth); + payload.put("surface_height", desktopHeight); + JSONObject envelope = new JSONObject(); + envelope.put("type", "input"); + envelope.put("payload", payload); + webSocket.send(envelope.toString().getBytes(StandardCharsets.UTF_8).length > 0 ? envelope.toString() : "{}"); + } catch (Exception ignored) { + } + } +} diff --git a/_tmp_android_build/app/src/main/java/su/cin/rapvpn/SecureTokenStore.java b/_tmp_android_build/app/src/main/java/su/cin/rapvpn/SecureTokenStore.java new file mode 100644 index 0000000..a8a88f0 --- /dev/null +++ b/_tmp_android_build/app/src/main/java/su/cin/rapvpn/SecureTokenStore.java @@ -0,0 +1,90 @@ +package su.cin.rapvpn; + +import android.content.Context; +import android.content.SharedPreferences; +import android.security.keystore.KeyGenParameterSpec; +import android.security.keystore.KeyProperties; +import android.util.Base64; + +import java.nio.charset.StandardCharsets; +import java.security.KeyStore; +import java.util.Arrays; + +import javax.crypto.Cipher; +import javax.crypto.KeyGenerator; +import javax.crypto.SecretKey; +import javax.crypto.spec.GCMParameterSpec; + +final class SecureTokenStore { + private static final String PREFS = "rap-vpn-secure"; + private static final String KEY_ALIAS = "rap-vpn-refresh-token"; + private static final String ANDROID_KEYSTORE = "AndroidKeyStore"; + private static final int IV_LENGTH = 12; + private static final int TAG_LENGTH_BITS = 128; + + private final SharedPreferences prefs; + + SecureTokenStore(Context context) { + prefs = context.getSharedPreferences(PREFS, Context.MODE_PRIVATE); + } + + void put(String name, String value) throws Exception { + if (value == null || value.isEmpty()) { + remove(name); + return; + } + Cipher cipher = Cipher.getInstance("AES/GCM/NoPadding"); + cipher.init(Cipher.ENCRYPT_MODE, key()); + byte[] ciphertext = cipher.doFinal(value.getBytes(StandardCharsets.UTF_8)); + byte[] iv = cipher.getIV(); + if (iv == null || iv.length == 0) { + throw new IllegalStateException("Android Keystore did not provide encryption IV"); + } + byte[] payload = new byte[iv.length + ciphertext.length]; + System.arraycopy(iv, 0, payload, 0, iv.length); + System.arraycopy(ciphertext, 0, payload, iv.length, ciphertext.length); + prefs.edit().putString(name, Base64.encodeToString(payload, Base64.NO_WRAP)).apply(); + } + + String get(String name) { + String encoded = prefs.getString(name, ""); + if (encoded.isEmpty()) { + return ""; + } + try { + byte[] payload = Base64.decode(encoded, Base64.NO_WRAP); + if (payload.length <= IV_LENGTH) { + return ""; + } + byte[] iv = Arrays.copyOfRange(payload, 0, IV_LENGTH); + byte[] ciphertext = Arrays.copyOfRange(payload, IV_LENGTH, payload.length); + Cipher cipher = Cipher.getInstance("AES/GCM/NoPadding"); + cipher.init(Cipher.DECRYPT_MODE, key(), new GCMParameterSpec(TAG_LENGTH_BITS, iv)); + return new String(cipher.doFinal(ciphertext), StandardCharsets.UTF_8); + } catch (Exception ignored) { + return ""; + } + } + + void remove(String name) { + prefs.edit().remove(name).apply(); + } + + private SecretKey key() throws Exception { + KeyStore keyStore = KeyStore.getInstance(ANDROID_KEYSTORE); + keyStore.load(null); + KeyStore.Entry entry = keyStore.getEntry(KEY_ALIAS, null); + if (entry instanceof KeyStore.SecretKeyEntry) { + return ((KeyStore.SecretKeyEntry) entry).getSecretKey(); + } + KeyGenerator generator = KeyGenerator.getInstance(KeyProperties.KEY_ALGORITHM_AES, ANDROID_KEYSTORE); + generator.init(new KeyGenParameterSpec.Builder( + KEY_ALIAS, + KeyProperties.PURPOSE_ENCRYPT | KeyProperties.PURPOSE_DECRYPT) + .setBlockModes(KeyProperties.BLOCK_MODE_GCM) + .setEncryptionPaddings(KeyProperties.ENCRYPTION_PADDING_NONE) + .setRandomizedEncryptionRequired(true) + .build()); + return generator.generateKey(); + } +} diff --git a/_tmp_android_build/app/src/main/java/su/cin/rapvpn/TestTrafficActivity.java b/_tmp_android_build/app/src/main/java/su/cin/rapvpn/TestTrafficActivity.java new file mode 100644 index 0000000..4c7715c --- /dev/null +++ b/_tmp_android_build/app/src/main/java/su/cin/rapvpn/TestTrafficActivity.java @@ -0,0 +1,40 @@ +package su.cin.rapvpn; + +import android.app.Activity; +import android.os.Bundle; +import android.widget.TextView; + +import java.net.HttpURLConnection; +import java.net.URL; + +public class TestTrafficActivity extends Activity { + @Override + protected void onCreate(Bundle savedInstanceState) { + super.onCreate(savedInstanceState); + TextView text = new TextView(this); + text.setText("traffic test starting"); + setContentView(text); + String url = getIntent().getStringExtra("url"); + if (url == null || url.isEmpty()) { + url = "http://192.168.200.61:18080/"; + } + String target = url; + new Thread(() -> runRequest(text, target), "rap-test-traffic").start(); + } + + private void runRequest(TextView text, String target) { + String result; + try { + HttpURLConnection connection = (HttpURLConnection) new URL(target).openConnection(); + connection.setConnectTimeout(30000); + connection.setReadTimeout(30000); + connection.setInstanceFollowRedirects(false); + result = "HTTP " + connection.getResponseCode(); + connection.disconnect(); + } catch (Exception e) { + result = e.getClass().getSimpleName() + ": " + e.getMessage(); + } + String finalResult = result; + runOnUiThread(() -> text.setText(finalResult)); + } +} diff --git a/_tmp_android_build/app/src/main/java/su/cin/rapvpn/TestVpnActivity.java b/_tmp_android_build/app/src/main/java/su/cin/rapvpn/TestVpnActivity.java new file mode 100644 index 0000000..a97dbf1 --- /dev/null +++ b/_tmp_android_build/app/src/main/java/su/cin/rapvpn/TestVpnActivity.java @@ -0,0 +1,70 @@ +package su.cin.rapvpn; + +import android.app.Activity; +import android.content.Intent; +import android.net.VpnService; +import android.os.Bundle; +import android.util.Base64; +import android.widget.TextView; +import java.nio.charset.StandardCharsets; + +public class TestVpnActivity extends Activity { + public static final String EXTRA_PROFILE_JSON = "profile_json"; + public static final String EXTRA_PROFILE_BASE64 = "profile_base64"; + public static final String EXTRA_BACKEND_URL = "backend_url"; + public static final String EXTRA_CLUSTER_ID = "cluster_id"; + public static final String EXTRA_VPN_CONNECTION_ID = "vpn_connection_id"; + private static final int VPN_PREPARE_REQUEST = 77; + + private Intent serviceIntent; + + @Override + protected void onCreate(Bundle savedInstanceState) { + super.onCreate(savedInstanceState); + TextView text = new TextView(this); + text.setText("RAP VPN test launcher"); + setContentView(text); + serviceIntent = buildServiceIntent(getIntent()); + Intent prepare = VpnService.prepare(this); + if (prepare != null) { + startActivityForResult(prepare, VPN_PREPARE_REQUEST); + return; + } + startVpn(); + } + + @Override + protected void onActivityResult(int requestCode, int resultCode, Intent data) { + super.onActivityResult(requestCode, resultCode, data); + if (requestCode == VPN_PREPARE_REQUEST && resultCode == RESULT_OK) { + startVpn(); + } + } + + private Intent buildServiceIntent(Intent source) { + Intent intent = new Intent(this, RapVpnService.class); + intent.putExtra(RapVpnService.EXTRA_PROFILE_JSON, profileJson(source)); + intent.putExtra(RapVpnService.EXTRA_BACKEND_URL, source.getStringExtra(EXTRA_BACKEND_URL)); + intent.putExtra(RapVpnService.EXTRA_CLUSTER_ID, source.getStringExtra(EXTRA_CLUSTER_ID)); + intent.putExtra(RapVpnService.EXTRA_VPN_CONNECTION_ID, source.getStringExtra(EXTRA_VPN_CONNECTION_ID)); + return intent; + } + + private String profileJson(Intent source) { + String direct = source.getStringExtra(EXTRA_PROFILE_JSON); + if (direct != null && !direct.isEmpty()) { + return direct; + } + String encoded = source.getStringExtra(EXTRA_PROFILE_BASE64); + if (encoded == null || encoded.isEmpty()) { + return ""; + } + byte[] raw = Base64.decode(encoded, Base64.DEFAULT); + return new String(raw, StandardCharsets.UTF_8); + } + + private void startVpn() { + startForegroundService(serviceIntent); + finish(); + } +} diff --git a/_tmp_android_build/app/src/main/res/values/styles.xml b/_tmp_android_build/app/src/main/res/values/styles.xml new file mode 100644 index 0000000..59653b8 --- /dev/null +++ b/_tmp_android_build/app/src/main/res/values/styles.xml @@ -0,0 +1,7 @@ + + + diff --git a/_tmp_android_build/build.gradle b/_tmp_android_build/build.gradle new file mode 100644 index 0000000..48d2fa1 --- /dev/null +++ b/_tmp_android_build/build.gradle @@ -0,0 +1,3 @@ +plugins { + id "com.android.application" version "8.7.3" apply false +} diff --git a/_tmp_android_build/local.properties b/_tmp_android_build/local.properties new file mode 100644 index 0000000..4fc69df --- /dev/null +++ b/_tmp_android_build/local.properties @@ -0,0 +1 @@ +sdk.dir=C:\Android\sdk diff --git a/_tmp_android_build/settings.gradle b/_tmp_android_build/settings.gradle new file mode 100644 index 0000000..902a398 --- /dev/null +++ b/_tmp_android_build/settings.gradle @@ -0,0 +1,18 @@ +pluginManagement { + repositories { + google() + mavenCentral() + gradlePluginPortal() + } +} + +dependencyResolutionManagement { + repositoriesMode.set(RepositoriesMode.FAIL_ON_PROJECT_REPOS) + repositories { + google() + mavenCentral() + } +} + +rootProject.name = "RapAndroidVpn" +include ":app" diff --git a/_tmp_release_node_agent_0.2.170/rap-host-agent-0.2.170-linux-amd64 b/_tmp_release_node_agent_0.2.170/rap-host-agent-0.2.170-linux-amd64 new file mode 100644 index 0000000..5524a2f Binary files /dev/null and b/_tmp_release_node_agent_0.2.170/rap-host-agent-0.2.170-linux-amd64 differ diff --git a/_tmp_release_node_agent_0.2.170/rap-node-agent-0.2.170-linux-amd64 b/_tmp_release_node_agent_0.2.170/rap-node-agent-0.2.170-linux-amd64 new file mode 100644 index 0000000..b6579fe Binary files /dev/null and b/_tmp_release_node_agent_0.2.170/rap-node-agent-0.2.170-linux-amd64 differ diff --git a/_tmp_release_node_agent_0.2.171/rap-host-agent-0.2.171-linux-amd64 b/_tmp_release_node_agent_0.2.171/rap-host-agent-0.2.171-linux-amd64 new file mode 100644 index 0000000..6a4db03 Binary files /dev/null and b/_tmp_release_node_agent_0.2.171/rap-host-agent-0.2.171-linux-amd64 differ diff --git a/_tmp_release_node_agent_0.2.171/rap-node-agent-0.2.171-linux-amd64 b/_tmp_release_node_agent_0.2.171/rap-node-agent-0.2.171-linux-amd64 new file mode 100644 index 0000000..fbccc7d Binary files /dev/null and b/_tmp_release_node_agent_0.2.171/rap-node-agent-0.2.171-linux-amd64 differ diff --git a/_tmp_release_node_agent_0.2.175/rap-host-agent-0.2.175-linux-amd64 b/_tmp_release_node_agent_0.2.175/rap-host-agent-0.2.175-linux-amd64 new file mode 100644 index 0000000..5d8d23a Binary files /dev/null and b/_tmp_release_node_agent_0.2.175/rap-host-agent-0.2.175-linux-amd64 differ diff --git a/_tmp_release_node_agent_0.2.175/rap-node-agent-0.2.175-linux-amd64 b/_tmp_release_node_agent_0.2.175/rap-node-agent-0.2.175-linux-amd64 new file mode 100644 index 0000000..c0114e3 Binary files /dev/null and b/_tmp_release_node_agent_0.2.175/rap-node-agent-0.2.175-linux-amd64 differ diff --git a/_tmp_release_node_agent_0.2.176/api/rap-host-agent-release.json b/_tmp_release_node_agent_0.2.176/api/rap-host-agent-release.json new file mode 100644 index 0000000..693b40e --- /dev/null +++ b/_tmp_release_node_agent_0.2.176/api/rap-host-agent-release.json @@ -0,0 +1 @@ +{"actor_user_id":"f67d943f-5397-4b3a-a229-695fe67ad700","product":"rap-host-agent","version":"0.2.176","channel":"dev","status":"active","compatibility":{"min_version":"0.0.0"},"changelog":"C18U host-agent companion release for node-agent 0.2.176 route-manager runtime rollout.","artifacts":[{"os":"linux","arch":"amd64","install_type":"linux_binary","kind":"binary","url":"/downloads/rap-host-agent-0.2.176-linux-amd64","sha256":"88b34dcd5f9ae83519d478b66d2695db6f46e5b76c9a14142f95b56f3babe2fe","size_bytes":9625505,"metadata":{}},{"os":"windows","arch":"amd64","install_type":"windows_binary","kind":"binary","url":"/downloads/rap-host-agent-0.2.176-windows-amd64.exe","sha256":"b6333e57efedd45af23c94863f432477eb54f0e77fe1c05a18492c2caa1d7344","size_bytes":9651712,"metadata":{}}]} diff --git a/_tmp_release_node_agent_0.2.176/api/rap-node-agent-release.json b/_tmp_release_node_agent_0.2.176/api/rap-node-agent-release.json new file mode 100644 index 0000000..f085ee9 --- /dev/null +++ b/_tmp_release_node_agent_0.2.176/api/rap-node-agent-release.json @@ -0,0 +1 @@ +{"actor_user_id":"f67d943f-5397-4b3a-a229-695fe67ad700","product":"rap-node-agent","version":"0.2.176","channel":"dev","status":"active","compatibility":{"min_version":"0.0.0","signed_synthetic_config":"c18u_route_manager_rebuild_fields"},"changelog":"C18U node-agent service-channel route-manager consumes backend rebuild decisions and withdraws fenced routes at runtime.","artifacts":[{"os":"linux","arch":"amd64","install_type":"docker","kind":"docker_image_tar","url":"/downloads/rap-node-agent-0.2.176-docker-amd64.tar","sha256":"cdb69ea16de30f79be345e397f24a3dbbafb7fe5fd74bb203ba310c55c698037","size_bytes":41406976,"metadata":{"image":"rap-node-agent:0.2.176"}},{"os":"linux","arch":"amd64","install_type":"linux_binary","kind":"binary","url":"/downloads/rap-node-agent-0.2.176-linux-amd64","sha256":"09c76f40fc94d405c5f99c196e3a88a0f426b581617bbde316ce9cf0d2cccf0c","size_bytes":11345378,"metadata":{}},{"os":"linux","arch":"amd64","install_type":"linux_service","kind":"binary","url":"/downloads/rap-node-agent-0.2.176-linux-amd64","sha256":"09c76f40fc94d405c5f99c196e3a88a0f426b581617bbde316ce9cf0d2cccf0c","size_bytes":11345378,"metadata":{}},{"os":"windows","arch":"amd64","install_type":"windows_service","kind":"binary","url":"/downloads/rap-node-agent-0.2.176-windows-amd64.exe","sha256":"1199da0d86435331de9143f52495149e301da32edbc7ad2db6f9f771a0e609f4","size_bytes":12167168,"metadata":{}}]} diff --git a/_tmp_release_node_agent_0.2.176/policies/policy-108a0d66-d65e-4dea-b9a8-135366bf7dba.json b/_tmp_release_node_agent_0.2.176/policies/policy-108a0d66-d65e-4dea-b9a8-135366bf7dba.json new file mode 100644 index 0000000..d2035b2 --- /dev/null +++ b/_tmp_release_node_agent_0.2.176/policies/policy-108a0d66-d65e-4dea-b9a8-135366bf7dba.json @@ -0,0 +1 @@ +{"actor_user_id":"f67d943f-5397-4b3a-a229-695fe67ad700","product":"rap-node-agent","channel":"dev","target_version":"0.2.176","strategy":"rolling","enabled":true,"rollback_allowed":true,"health_window_seconds":90} diff --git a/_tmp_release_node_agent_0.2.176/policies/policy-1af4c7b6-b04a-41c3-91e3-fbf2fa29fc72.json b/_tmp_release_node_agent_0.2.176/policies/policy-1af4c7b6-b04a-41c3-91e3-fbf2fa29fc72.json new file mode 100644 index 0000000..d2035b2 --- /dev/null +++ b/_tmp_release_node_agent_0.2.176/policies/policy-1af4c7b6-b04a-41c3-91e3-fbf2fa29fc72.json @@ -0,0 +1 @@ +{"actor_user_id":"f67d943f-5397-4b3a-a229-695fe67ad700","product":"rap-node-agent","channel":"dev","target_version":"0.2.176","strategy":"rolling","enabled":true,"rollback_allowed":true,"health_window_seconds":90} diff --git a/_tmp_release_node_agent_0.2.176/policies/policy-830a26de-e7e8-462f-8f37-5189027955d5.json b/_tmp_release_node_agent_0.2.176/policies/policy-830a26de-e7e8-462f-8f37-5189027955d5.json new file mode 100644 index 0000000..d2035b2 --- /dev/null +++ b/_tmp_release_node_agent_0.2.176/policies/policy-830a26de-e7e8-462f-8f37-5189027955d5.json @@ -0,0 +1 @@ +{"actor_user_id":"f67d943f-5397-4b3a-a229-695fe67ad700","product":"rap-node-agent","channel":"dev","target_version":"0.2.176","strategy":"rolling","enabled":true,"rollback_allowed":true,"health_window_seconds":90} diff --git a/_tmp_release_node_agent_0.2.176/policies/policy-8ad04829-cd30-4290-913d-1ce5c7ef7bb3.json b/_tmp_release_node_agent_0.2.176/policies/policy-8ad04829-cd30-4290-913d-1ce5c7ef7bb3.json new file mode 100644 index 0000000..d2035b2 --- /dev/null +++ b/_tmp_release_node_agent_0.2.176/policies/policy-8ad04829-cd30-4290-913d-1ce5c7ef7bb3.json @@ -0,0 +1 @@ +{"actor_user_id":"f67d943f-5397-4b3a-a229-695fe67ad700","product":"rap-node-agent","channel":"dev","target_version":"0.2.176","strategy":"rolling","enabled":true,"rollback_allowed":true,"health_window_seconds":90} diff --git a/_tmp_release_node_agent_0.2.176/policies/policy-b829ffde-690b-47ab-9522-0f22ab42596d.json b/_tmp_release_node_agent_0.2.176/policies/policy-b829ffde-690b-47ab-9522-0f22ab42596d.json new file mode 100644 index 0000000..d2035b2 --- /dev/null +++ b/_tmp_release_node_agent_0.2.176/policies/policy-b829ffde-690b-47ab-9522-0f22ab42596d.json @@ -0,0 +1 @@ +{"actor_user_id":"f67d943f-5397-4b3a-a229-695fe67ad700","product":"rap-node-agent","channel":"dev","target_version":"0.2.176","strategy":"rolling","enabled":true,"rollback_allowed":true,"health_window_seconds":90} diff --git a/_tmp_release_node_agent_0.2.176/policies/policy-f3c95cb7-a189-4dbb-b5d7-5ff93ba9c040.json b/_tmp_release_node_agent_0.2.176/policies/policy-f3c95cb7-a189-4dbb-b5d7-5ff93ba9c040.json new file mode 100644 index 0000000..d2035b2 --- /dev/null +++ b/_tmp_release_node_agent_0.2.176/policies/policy-f3c95cb7-a189-4dbb-b5d7-5ff93ba9c040.json @@ -0,0 +1 @@ +{"actor_user_id":"f67d943f-5397-4b3a-a229-695fe67ad700","product":"rap-node-agent","channel":"dev","target_version":"0.2.176","strategy":"rolling","enabled":true,"rollback_allowed":true,"health_window_seconds":90} diff --git a/_tmp_release_node_agent_0.2.176/rap-host-agent-0.2.176-linux-amd64 b/_tmp_release_node_agent_0.2.176/rap-host-agent-0.2.176-linux-amd64 new file mode 100644 index 0000000..34dcfd5 Binary files /dev/null and b/_tmp_release_node_agent_0.2.176/rap-host-agent-0.2.176-linux-amd64 differ diff --git a/_tmp_release_node_agent_0.2.176/rap-node-agent-0.2.176-linux-amd64 b/_tmp_release_node_agent_0.2.176/rap-node-agent-0.2.176-linux-amd64 new file mode 100644 index 0000000..54441e3 Binary files /dev/null and b/_tmp_release_node_agent_0.2.176/rap-node-agent-0.2.176-linux-amd64 differ diff --git a/_tmp_release_node_agent_0.2.177/api/rap-host-agent-release.json b/_tmp_release_node_agent_0.2.177/api/rap-host-agent-release.json new file mode 100644 index 0000000..0cab3b1 --- /dev/null +++ b/_tmp_release_node_agent_0.2.177/api/rap-host-agent-release.json @@ -0,0 +1 @@ +{"actor_user_id":"f67d943f-5397-4b3a-a229-695fe67ad700","product":"rap-host-agent","version":"0.2.177","channel":"dev","status":"active","compatibility":{"min_version":"0.0.0"},"changelog":"C18V host-agent companion release for node-agent 0.2.177 route-manager transition telemetry rollout.","artifacts":[{"os":"linux","arch":"amd64","install_type":"linux_binary","kind":"binary","url":"/downloads/rap-host-agent-0.2.177-linux-amd64","sha256":"babb74419a9c414caa9ba8612a9a8a745c1b2dc40bd4d83456cc84bdaf6c1fab","size_bytes":9625505,"metadata":{}},{"os":"windows","arch":"amd64","install_type":"windows_binary","kind":"binary","url":"/downloads/rap-host-agent-0.2.177-windows-amd64.exe","sha256":"eba255f8685c3141f8cb80be345d007aad4873d7445ab474424b673e715f0c6b","size_bytes":9651712,"metadata":{}}]} diff --git a/_tmp_release_node_agent_0.2.177/api/rap-node-agent-release.json b/_tmp_release_node_agent_0.2.177/api/rap-node-agent-release.json new file mode 100644 index 0000000..f79d421 --- /dev/null +++ b/_tmp_release_node_agent_0.2.177/api/rap-node-agent-release.json @@ -0,0 +1 @@ +{"actor_user_id":"f67d943f-5397-4b3a-a229-695fe67ad700","product":"rap-node-agent","version":"0.2.177","channel":"dev","status":"active","compatibility":{"min_version":"0.0.0","service_channel_route_manager":"c18v_transition_telemetry"},"changelog":"C18V node-agent service-channel route-manager transition telemetry and lifecycle coverage for rebuild apply, pending degraded fallback, and restore by fresh config.","artifacts":[{"os":"linux","arch":"amd64","install_type":"docker","kind":"docker_image_tar","url":"/downloads/rap-node-agent-0.2.177-docker-amd64.tar","sha256":"17f6448c3ed8939643fddf5180375a43a66d28604d85573806522fa1180180bc","size_bytes":41411072,"metadata":{"image":"rap-node-agent:0.2.177"}},{"os":"linux","arch":"amd64","install_type":"linux_binary","kind":"binary","url":"/downloads/rap-node-agent-0.2.177-linux-amd64","sha256":"a7d077818c49a942d091d65ec6887ca435077b2bfcbfa95fa696b5fca301e143","size_bytes":11350475,"metadata":{}},{"os":"linux","arch":"amd64","install_type":"linux_service","kind":"binary","url":"/downloads/rap-node-agent-0.2.177-linux-amd64","sha256":"a7d077818c49a942d091d65ec6887ca435077b2bfcbfa95fa696b5fca301e143","size_bytes":11350475,"metadata":{}},{"os":"windows","arch":"amd64","install_type":"windows_service","kind":"binary","url":"/downloads/rap-node-agent-0.2.177-windows-amd64.exe","sha256":"e4a25be5b413742bdb0dd6c544f500300b6ebeb6873eaa979f6d780cab861f1b","size_bytes":12173824,"metadata":{}}]} diff --git a/_tmp_release_node_agent_0.2.177/policies/policy-108a0d66-d65e-4dea-b9a8-135366bf7dba.json b/_tmp_release_node_agent_0.2.177/policies/policy-108a0d66-d65e-4dea-b9a8-135366bf7dba.json new file mode 100644 index 0000000..7642fea --- /dev/null +++ b/_tmp_release_node_agent_0.2.177/policies/policy-108a0d66-d65e-4dea-b9a8-135366bf7dba.json @@ -0,0 +1 @@ +{"actor_user_id":"f67d943f-5397-4b3a-a229-695fe67ad700","product":"rap-node-agent","channel":"dev","target_version":"0.2.177","strategy":"rolling","enabled":true,"rollback_allowed":true,"health_window_seconds":90} diff --git a/_tmp_release_node_agent_0.2.177/policies/policy-1af4c7b6-b04a-41c3-91e3-fbf2fa29fc72.json b/_tmp_release_node_agent_0.2.177/policies/policy-1af4c7b6-b04a-41c3-91e3-fbf2fa29fc72.json new file mode 100644 index 0000000..7642fea --- /dev/null +++ b/_tmp_release_node_agent_0.2.177/policies/policy-1af4c7b6-b04a-41c3-91e3-fbf2fa29fc72.json @@ -0,0 +1 @@ +{"actor_user_id":"f67d943f-5397-4b3a-a229-695fe67ad700","product":"rap-node-agent","channel":"dev","target_version":"0.2.177","strategy":"rolling","enabled":true,"rollback_allowed":true,"health_window_seconds":90} diff --git a/_tmp_release_node_agent_0.2.177/policies/policy-830a26de-e7e8-462f-8f37-5189027955d5.json b/_tmp_release_node_agent_0.2.177/policies/policy-830a26de-e7e8-462f-8f37-5189027955d5.json new file mode 100644 index 0000000..7642fea --- /dev/null +++ b/_tmp_release_node_agent_0.2.177/policies/policy-830a26de-e7e8-462f-8f37-5189027955d5.json @@ -0,0 +1 @@ +{"actor_user_id":"f67d943f-5397-4b3a-a229-695fe67ad700","product":"rap-node-agent","channel":"dev","target_version":"0.2.177","strategy":"rolling","enabled":true,"rollback_allowed":true,"health_window_seconds":90} diff --git a/_tmp_release_node_agent_0.2.177/policies/policy-8ad04829-cd30-4290-913d-1ce5c7ef7bb3.json b/_tmp_release_node_agent_0.2.177/policies/policy-8ad04829-cd30-4290-913d-1ce5c7ef7bb3.json new file mode 100644 index 0000000..7642fea --- /dev/null +++ b/_tmp_release_node_agent_0.2.177/policies/policy-8ad04829-cd30-4290-913d-1ce5c7ef7bb3.json @@ -0,0 +1 @@ +{"actor_user_id":"f67d943f-5397-4b3a-a229-695fe67ad700","product":"rap-node-agent","channel":"dev","target_version":"0.2.177","strategy":"rolling","enabled":true,"rollback_allowed":true,"health_window_seconds":90} diff --git a/_tmp_release_node_agent_0.2.177/policies/policy-b829ffde-690b-47ab-9522-0f22ab42596d.json b/_tmp_release_node_agent_0.2.177/policies/policy-b829ffde-690b-47ab-9522-0f22ab42596d.json new file mode 100644 index 0000000..7642fea --- /dev/null +++ b/_tmp_release_node_agent_0.2.177/policies/policy-b829ffde-690b-47ab-9522-0f22ab42596d.json @@ -0,0 +1 @@ +{"actor_user_id":"f67d943f-5397-4b3a-a229-695fe67ad700","product":"rap-node-agent","channel":"dev","target_version":"0.2.177","strategy":"rolling","enabled":true,"rollback_allowed":true,"health_window_seconds":90} diff --git a/_tmp_release_node_agent_0.2.177/policies/policy-f3c95cb7-a189-4dbb-b5d7-5ff93ba9c040.json b/_tmp_release_node_agent_0.2.177/policies/policy-f3c95cb7-a189-4dbb-b5d7-5ff93ba9c040.json new file mode 100644 index 0000000..7642fea --- /dev/null +++ b/_tmp_release_node_agent_0.2.177/policies/policy-f3c95cb7-a189-4dbb-b5d7-5ff93ba9c040.json @@ -0,0 +1 @@ +{"actor_user_id":"f67d943f-5397-4b3a-a229-695fe67ad700","product":"rap-node-agent","channel":"dev","target_version":"0.2.177","strategy":"rolling","enabled":true,"rollback_allowed":true,"health_window_seconds":90} diff --git a/_tmp_release_node_agent_0.2.177/rap-host-agent-0.2.177-linux-amd64 b/_tmp_release_node_agent_0.2.177/rap-host-agent-0.2.177-linux-amd64 new file mode 100644 index 0000000..73c00d9 Binary files /dev/null and b/_tmp_release_node_agent_0.2.177/rap-host-agent-0.2.177-linux-amd64 differ diff --git a/_tmp_release_node_agent_0.2.177/rap-node-agent-0.2.177-linux-amd64 b/_tmp_release_node_agent_0.2.177/rap-node-agent-0.2.177-linux-amd64 new file mode 100644 index 0000000..a610c7a Binary files /dev/null and b/_tmp_release_node_agent_0.2.177/rap-node-agent-0.2.177-linux-amd64 differ diff --git a/_tmp_release_node_agent_0.2.180/api/rap-host-agent-release.json b/_tmp_release_node_agent_0.2.180/api/rap-host-agent-release.json new file mode 100644 index 0000000..f601578 --- /dev/null +++ b/_tmp_release_node_agent_0.2.180/api/rap-host-agent-release.json @@ -0,0 +1 @@ +{"actor_user_id":"f67d943f-5397-4b3a-a229-695fe67ad700","product":"rap-host-agent","version":"0.2.180","channel":"dev","status":"active","compatibility":{"min_version":"0.0.0"},"changelog":"C18X host-agent companion release for node-agent 0.2.180 service-channel scheduler rollout.","artifacts":[{"os":"linux","arch":"amd64","install_type":"linux_binary","kind":"binary","url":"/downloads/rap-host-agent-0.2.180-linux-amd64","sha256":"7dbaabebfa26c97cef443eb1e79729c758453e05ecb0218470d6e4cbcade7a38","size_bytes":9625505,"metadata":{}},{"os":"windows","arch":"amd64","install_type":"windows_binary","kind":"binary","url":"/downloads/rap-host-agent-0.2.180-windows-amd64.exe","sha256":"4e5391b3f3770d6dd00a8c66977a933fb1a610de750c3194fb9a6e37d92e8d74","size_bytes":9651712,"metadata":{}}]} diff --git a/_tmp_release_node_agent_0.2.180/api/rap-node-agent-release.json b/_tmp_release_node_agent_0.2.180/api/rap-node-agent-release.json new file mode 100644 index 0000000..9ebfdaf --- /dev/null +++ b/_tmp_release_node_agent_0.2.180/api/rap-node-agent-release.json @@ -0,0 +1 @@ +{"actor_user_id":"f67d943f-5397-4b3a-a229-695fe67ad700","product":"rap-node-agent","version":"0.2.180","channel":"dev","status":"active","compatibility":{"min_version":"0.0.0","service_channel_scheduler":"c18x_per_logical_channel_failover"},"changelog":"C18X service-channel scheduler fix: per-logical-channel failed route avoidance no longer falls back to global last route; adds bounded backpressure coverage.","artifacts":[{"os":"linux","arch":"amd64","install_type":"docker","kind":"docker_image_tar","url":"/downloads/rap-node-agent-0.2.180-docker-amd64.tar","sha256":"a393ad343a58bf606dab9246e2f2adefa1be5ae49c15305d5af033c937f4cac1","size_bytes":41411072,"metadata":{"image":"rap-node-agent:0.2.180"}},{"os":"linux","arch":"amd64","install_type":"linux_binary","kind":"binary","url":"/downloads/rap-node-agent-0.2.180-linux-amd64","sha256":"ebddd7f0e8dec761f1a8c397cfb56552fd995e6c182b1d6c88df6f7806f03600","size_bytes":11350467,"metadata":{}},{"os":"linux","arch":"amd64","install_type":"linux_service","kind":"binary","url":"/downloads/rap-node-agent-0.2.180-linux-amd64","sha256":"ebddd7f0e8dec761f1a8c397cfb56552fd995e6c182b1d6c88df6f7806f03600","size_bytes":11350467,"metadata":{}},{"os":"windows","arch":"amd64","install_type":"windows_service","kind":"binary","url":"/downloads/rap-node-agent-0.2.180-windows-amd64.exe","sha256":"8218497fb1b150f74478d2041973de93f303ca72a99702a0f8f347125877a000","size_bytes":12173824,"metadata":{}}]} diff --git a/_tmp_release_node_agent_0.2.180/policies/policy-108a0d66-d65e-4dea-b9a8-135366bf7dba.json b/_tmp_release_node_agent_0.2.180/policies/policy-108a0d66-d65e-4dea-b9a8-135366bf7dba.json new file mode 100644 index 0000000..77699c3 --- /dev/null +++ b/_tmp_release_node_agent_0.2.180/policies/policy-108a0d66-d65e-4dea-b9a8-135366bf7dba.json @@ -0,0 +1 @@ +{"actor_user_id":"f67d943f-5397-4b3a-a229-695fe67ad700","product":"rap-node-agent","channel":"dev","target_version":"0.2.180","strategy":"rolling","enabled":true,"rollback_allowed":true,"health_window_seconds":90} diff --git a/_tmp_release_node_agent_0.2.180/policies/policy-1af4c7b6-b04a-41c3-91e3-fbf2fa29fc72.json b/_tmp_release_node_agent_0.2.180/policies/policy-1af4c7b6-b04a-41c3-91e3-fbf2fa29fc72.json new file mode 100644 index 0000000..77699c3 --- /dev/null +++ b/_tmp_release_node_agent_0.2.180/policies/policy-1af4c7b6-b04a-41c3-91e3-fbf2fa29fc72.json @@ -0,0 +1 @@ +{"actor_user_id":"f67d943f-5397-4b3a-a229-695fe67ad700","product":"rap-node-agent","channel":"dev","target_version":"0.2.180","strategy":"rolling","enabled":true,"rollback_allowed":true,"health_window_seconds":90} diff --git a/_tmp_release_node_agent_0.2.180/policies/policy-830a26de-e7e8-462f-8f37-5189027955d5.json b/_tmp_release_node_agent_0.2.180/policies/policy-830a26de-e7e8-462f-8f37-5189027955d5.json new file mode 100644 index 0000000..77699c3 --- /dev/null +++ b/_tmp_release_node_agent_0.2.180/policies/policy-830a26de-e7e8-462f-8f37-5189027955d5.json @@ -0,0 +1 @@ +{"actor_user_id":"f67d943f-5397-4b3a-a229-695fe67ad700","product":"rap-node-agent","channel":"dev","target_version":"0.2.180","strategy":"rolling","enabled":true,"rollback_allowed":true,"health_window_seconds":90} diff --git a/_tmp_release_node_agent_0.2.180/policies/policy-8ad04829-cd30-4290-913d-1ce5c7ef7bb3.json b/_tmp_release_node_agent_0.2.180/policies/policy-8ad04829-cd30-4290-913d-1ce5c7ef7bb3.json new file mode 100644 index 0000000..77699c3 --- /dev/null +++ b/_tmp_release_node_agent_0.2.180/policies/policy-8ad04829-cd30-4290-913d-1ce5c7ef7bb3.json @@ -0,0 +1 @@ +{"actor_user_id":"f67d943f-5397-4b3a-a229-695fe67ad700","product":"rap-node-agent","channel":"dev","target_version":"0.2.180","strategy":"rolling","enabled":true,"rollback_allowed":true,"health_window_seconds":90} diff --git a/_tmp_release_node_agent_0.2.180/policies/policy-b829ffde-690b-47ab-9522-0f22ab42596d.json b/_tmp_release_node_agent_0.2.180/policies/policy-b829ffde-690b-47ab-9522-0f22ab42596d.json new file mode 100644 index 0000000..77699c3 --- /dev/null +++ b/_tmp_release_node_agent_0.2.180/policies/policy-b829ffde-690b-47ab-9522-0f22ab42596d.json @@ -0,0 +1 @@ +{"actor_user_id":"f67d943f-5397-4b3a-a229-695fe67ad700","product":"rap-node-agent","channel":"dev","target_version":"0.2.180","strategy":"rolling","enabled":true,"rollback_allowed":true,"health_window_seconds":90} diff --git a/_tmp_release_node_agent_0.2.180/policies/policy-f3c95cb7-a189-4dbb-b5d7-5ff93ba9c040.json b/_tmp_release_node_agent_0.2.180/policies/policy-f3c95cb7-a189-4dbb-b5d7-5ff93ba9c040.json new file mode 100644 index 0000000..77699c3 --- /dev/null +++ b/_tmp_release_node_agent_0.2.180/policies/policy-f3c95cb7-a189-4dbb-b5d7-5ff93ba9c040.json @@ -0,0 +1 @@ +{"actor_user_id":"f67d943f-5397-4b3a-a229-695fe67ad700","product":"rap-node-agent","channel":"dev","target_version":"0.2.180","strategy":"rolling","enabled":true,"rollback_allowed":true,"health_window_seconds":90} diff --git a/_tmp_release_node_agent_0.2.180/rap-host-agent-0.2.180-linux-amd64 b/_tmp_release_node_agent_0.2.180/rap-host-agent-0.2.180-linux-amd64 new file mode 100644 index 0000000..d84a2c8 Binary files /dev/null and b/_tmp_release_node_agent_0.2.180/rap-host-agent-0.2.180-linux-amd64 differ diff --git a/_tmp_release_node_agent_0.2.180/rap-node-agent-0.2.180-linux-amd64 b/_tmp_release_node_agent_0.2.180/rap-node-agent-0.2.180-linux-amd64 new file mode 100644 index 0000000..53c377f Binary files /dev/null and b/_tmp_release_node_agent_0.2.180/rap-node-agent-0.2.180-linux-amd64 differ diff --git a/_tmp_release_node_agent_0.2.181/rap-host-agent-0.2.181-linux-amd64 b/_tmp_release_node_agent_0.2.181/rap-host-agent-0.2.181-linux-amd64 new file mode 100644 index 0000000..89c7147 Binary files /dev/null and b/_tmp_release_node_agent_0.2.181/rap-host-agent-0.2.181-linux-amd64 differ diff --git a/_tmp_release_node_agent_0.2.181/rap-node-agent-0.2.181-linux-amd64 b/_tmp_release_node_agent_0.2.181/rap-node-agent-0.2.181-linux-amd64 new file mode 100644 index 0000000..1d8cc38 Binary files /dev/null and b/_tmp_release_node_agent_0.2.181/rap-node-agent-0.2.181-linux-amd64 differ diff --git a/_tmp_release_node_agent_0.2.183/rap-host-agent-0.2.183-linux-amd64 b/_tmp_release_node_agent_0.2.183/rap-host-agent-0.2.183-linux-amd64 new file mode 100644 index 0000000..e25eff1 Binary files /dev/null and b/_tmp_release_node_agent_0.2.183/rap-host-agent-0.2.183-linux-amd64 differ diff --git a/_tmp_release_node_agent_0.2.183/rap-node-agent-0.2.183-linux-amd64 b/_tmp_release_node_agent_0.2.183/rap-node-agent-0.2.183-linux-amd64 new file mode 100644 index 0000000..a43d93b Binary files /dev/null and b/_tmp_release_node_agent_0.2.183/rap-node-agent-0.2.183-linux-amd64 differ diff --git a/_tmp_release_node_agent_0.2.97/rap-node-agent-0.2.97-linux-amd64 b/_tmp_release_node_agent_0.2.97/rap-node-agent-0.2.97-linux-amd64 new file mode 100644 index 0000000..a407d35 Binary files /dev/null and b/_tmp_release_node_agent_0.2.97/rap-node-agent-0.2.97-linux-amd64 differ diff --git a/agents/rap-node-agent/Dockerfile b/agents/rap-node-agent/Dockerfile index cc31f74..1353fa8 100644 --- a/agents/rap-node-agent/Dockerfile +++ b/agents/rap-node-agent/Dockerfile @@ -1,4 +1,4 @@ -FROM golang:1.23-bookworm AS build +FROM golang:1.25-bookworm AS build WORKDIR /src COPY agents/rap-node-agent/go.mod ./ @@ -6,8 +6,10 @@ RUN go mod download COPY agents/rap-node-agent/ ./ RUN CGO_ENABLED=0 GOOS=linux GOARCH=amd64 go build -o /out/rap-node-agent ./cmd/rap-node-agent -FROM gcr.io/distroless/static-debian12:nonroot +FROM debian:bookworm-slim +RUN apt-get update \ + && apt-get install -y --no-install-recommends ca-certificates iproute2 iptables procps \ + && rm -rf /var/lib/apt/lists/* COPY --from=build /out/rap-node-agent /usr/local/bin/rap-node-agent -USER nonroot:nonroot ENTRYPOINT ["/usr/local/bin/rap-node-agent"] diff --git a/agents/rap-node-agent/README.md b/agents/rap-node-agent/README.md index b355b5e..b2a439c 100644 --- a/agents/rap-node-agent/README.md +++ b/agents/rap-node-agent/README.md @@ -66,6 +66,11 @@ Implemented: - synthetic route-health route config refresh from Control Plane path decisions - route-health expected/observed effective path drift reporting +- host-agent Docker update plan executor with artifact checksum/size + verification, container replacement, health check, status reporting, and + rollback attempt +- host-agent update loop for service/timer placement +- host-agent binary self-update loop for the updater service itself - maximum capacity guard for the local production observation sink - panic-safe fail-closed production envelope observation wrapper - explicit `4096` byte payload boundary for validated production @@ -98,7 +103,7 @@ Not implemented yet: - VPN runtime - production workload supervision - certificate issuance/rotation -- updater runtime +- in-agent native updater runtime - privileged host route/firewall control ## Build @@ -107,9 +112,237 @@ Not implemented yet: cd agents\rap-node-agent go test ./... go build -o bin\rap-node-agent.exe .\cmd\rap-node-agent +go build -buildvcs=false -o bin\rap-host-agent.exe .\cmd\rap-host-agent go build -o bin\mesh-live-smoke.exe .\cmd\mesh-live-smoke ``` +## Docker Host Agent Bootstrap + +`rap-host-agent` is the first host-level installer/updater boundary for Docker +placement. It does not join the mesh itself. It applies the cluster's install +intent locally by running the `rap-node-agent` container with a persistent host +state directory. On Linux it also installs a systemd `update-loop` service by +default, so nodes continue to update from Control Plane policy without operator +commands on each host. + +Preferred profile-based install: + +```bash +rap-host-agent install \ + --profile-url https://control.example.com/api/v1 \ + --cluster-id \ + --install-token \ + --node-name docker-node-1 +``` + +The host-agent exchanges the install token for a signed control-plane install +profile, then applies Docker image, container, state-dir, mesh listen, +advertise, NAT/connectivity, and region settings from that profile. The same +token is then used by the node-agent for first enrollment, so the operator does +not need to manually pass cluster/runtime flags. + +Manual install is still supported: + +```bash +rap-host-agent install \ + --backend-url http://192.168.200.61:18080/api/v1 \ + --cluster-id \ + --join-token \ + --node-name docker-node-1 \ + --image rap-node-agent:dev-enrollment-bootstrap-smoke \ + --container-name rap-node-agent-docker-node-1 \ + --state-dir /var/lib/rap/nodes/docker-node-1 \ + --network host \ + --replace +``` + +The command creates or replaces only the local Docker container. The running +node-agent submits the join request, waits for owner approval, stores its +identity in the mounted state directory, and then sends heartbeats. Re-running +with `--replace` updates the container while preserving node identity. Pass +`--auto-update-enabled=false` only for lab/debug installs where the local +systemd updater must not be registered. + +Useful checks: + +```bash +rap-host-agent status --container-name rap-node-agent-docker-node-1 +docker logs -f rap-node-agent-docker-node-1 +``` + +For a node that was installed before the updater existed, register only the +local updater service without recreating the node-agent container: + +```bash +rap-host-agent install-updater \ + --backend-url http://192.168.200.61:18080/api/v1 \ + --cluster-id \ + --state-dir /var/lib/rap/nodes/docker-node-1 \ + --container-name rap-node-agent-docker-node-1 +``` + +## Docker Host Agent Updates + +`rap-host-agent update` applies one Control Plane update plan for an already +enrolled Docker node. The host-agent fetches the plan, downloads the selected +Docker image tar, verifies size and sha256, loads the image, recreates the +node-agent container from the existing Docker runtime settings, checks that the +container is running, and reports update phases back to the Control Plane. + +```bash +rap-host-agent update \ + --backend-url http://192.168.200.61:18080/api/v1 \ + --cluster-id \ + --node-id \ + --container-name rap-node-agent-docker-node-1 \ + --current-version 0.1.0-c17z26 +``` + +`rap-host-agent update-loop` is the per-node executor and health boundary. It +does not need to poll for normal releases: the node-agent receives an +`rap.node_update_hint.v1` subscription hint from Control Plane or the assigned +update-cache service during heartbeat, writes `/update-trigger.json`, +and the host-agent wakes immediately. The interval is an emergency fallback for +missed hints, service migration, or a dead update-cache service; keep it long +in production. The loop keeps running after transient errors by default and +advances its in-process current version after a successful update so it does +not repeatedly apply the same plan. When started without `--node-id` it reads +`/identity.json` and waits until the approved node identity appears, +which lets the updater service start immediately during first install. It also +persists the last applied node-agent version in +`/host-update-state.json` so a service restart does not reapply an +already-installed release. + +```bash +rap-host-agent update-loop \ + --backend-url http://192.168.200.61:18080/api/v1 \ + --cluster-id \ + --node-id \ + --container-name rap-node-agent-docker-node-1 \ + --current-version 0.1.0-c17z26 \ + --interval-seconds 21600 \ + --jitter 0.15 +``` + +Update-cache nodes are ordinary cluster nodes with the `update-cache` role. +Control Plane assigns a healthy update-cache node in the heartbeat hint. If the +assigned service disappears, the next hint returns `control_plane_fallback` or a +new service assignment; the local updater stays subscribed and only uses the +long fallback timer as a last resort. + +`rap-host-agent update-host-agent-loop` updates the host-agent binary itself. +Only one global systemd unit is installed per Docker host: +`rap-host-agent-self-updater.service`. It uses one approved local node identity +to ask Control Plane for product `rap-host-agent` with install type +`linux_binary`, verifies the downloaded binary size and sha256, atomically +replaces `/usr/local/bin/rap-host-agent`, and reports status. The already +running process continues until systemd restarts it, while new invocations use +the new binary. + +```bash +rap-host-agent update-host-agent-loop \ + --backend-url http://192.168.200.61:18080/api/v1 \ + --cluster-id \ + --state-dir /var/lib/rap/nodes/docker-node-1 \ + --binary-path /usr/local/bin/rap-host-agent +``` + +## Windows Host Agent Bootstrap And Updates + +Windows uses the same Control Plane install profile, but the local placement is +a Scheduled Task instead of Docker. In `--startup-mode auto` the installer first +tries an elevated `ONSTART` task running as `SYSTEM`; without admin rights it +falls back to a per-user `ONLOGON` task. The `ONSTART` mode starts after reboot +without an interactive user session. The `ONLOGON` fallback can only start after +that Windows user signs in. + +```cmd +powershell -NoProfile -ExecutionPolicy Bypass -Command "Invoke-WebRequest -UseBasicParsing 'http://control.example.com/downloads/rap-host-agent-windows-amd64.exe' -OutFile $env:TEMP\rap-host-agent.exe" +%TEMP%\rap-host-agent.exe install-windows --profile-url "http://control.example.com/api/v1" --cluster-id "" --install-token "" --node-name "office-win-1" --startup-mode "auto" +``` + +`install-windows` installs two tasks: + +- `RAP Node Agent ` runs `rap-node-agent.exe`. +- `RAP Host Agent Updater ` runs `rap-host-agent update-loop` for product + `rap-node-agent`, install type `windows_service`, and replaces the local + `rap-node-agent.exe` from signed release artifacts. + +During first bootstrap the updater can read `\identity.json` and +will wait until the join request is approved. For an already-enrolled Windows +node, prefer passing `--node-id` explicitly. That makes the updater wrapper +independent from the local identity file location and is required for repair of +older Windows installs where the node is already heartbeat-healthy but the +host-agent updater has no usable identity file. + +```cmd +%TEMP%\rap-host-agent.exe install-windows --backend-url "http://control.example.com/api/v1" --cluster-id "" --node-id "" --node-name "office-win-1" --replace --startup-mode "auto" --auto-update-current-version "" +``` + +The admin UI node details page generates a downloadable +`rap-repair-updater-.cmd` for this repair path. It performs these steps: + +- prints `schtasks /Query` diagnostics for the node-agent and updater tasks; +- prints the local `rap-*.exe*` files; +- downloads the current `rap-host-agent.exe`; +- reinstalls the Windows updater wrapper with `--node-id`; +- runs a foreground one-shot `update-loop --max-runs 1`; +- applies `rap-host-agent.exe.next` if the running host-agent could not replace + itself; +- restarts `RAP Host Agent Updater `; +- prints post-repair diagnostics. + +Expected successful updater reports in the admin panel: + +```text +rap-node-agent -> plan/noop +rap-host-agent -> plan/noop +``` + +If the latest host-agent report is `apply/staged`, the new host-agent binary +was downloaded as `rap-host-agent.exe.next` but the running process still held +the old executable. End and run the updater task once, or rerun the generated +repair command: + +```cmd +schtasks /End /TN "RAP Host Agent Updater office-win-1" +schtasks /Run /TN "RAP Host Agent Updater office-win-1" +``` + +### Windows Reboot / Autostart Verification + +After installation or repair, verify the service survives a reboot: + +1. Reboot the Windows host, or at minimum restart both scheduled tasks. +2. Confirm the tasks exist: + +```cmd +schtasks /Query /TN "RAP Node Agent office-win-1" /V /FO LIST +schtasks /Query /TN "RAP Host Agent Updater office-win-1" /V /FO LIST +``` + +3. Confirm the admin panel shows: + +```text +heartbeat: fresh +rap-node-agent: plan/noop +rap-host-agent: plan/noop +node version_state: current +``` + +Without admin rights, `install-windows --startup-mode auto` may fall back to +`user-task`. That node can still heartbeat and update after the user logs in, +but it will not start before logon after a reboot. Use an elevated shell for +production Windows nodes that must recover unattended. + +Control Plane release artifacts for Windows must use: + +- `product=rap-node-agent` +- `os=windows` +- `arch=amd64` +- `install_type=windows_service` +- `kind=binary` + ## First Enrollment Create a join token from the platform control plane, then run: @@ -185,9 +418,18 @@ bounded `synthetic.echo` test-service runtime, and live synthetic HTTP endpoint. It must not be used for RDP, VPN, file, video, or other production service traffic. -`RAP_WORKLOAD_SUPERVISION_ENABLED` defaults to `false`. While service runtime -supervision is still a stub, the agent does not poll desired workloads or report -workload status unless this flag is explicitly enabled. +`RAP_WORKLOAD_SUPERVISION_ENABLED` defaults to `false`. When enabled, the agent +polls node-scoped desired workloads and reports status. The current bounded +runtime reports built-in `core-mesh` and `mesh-listener` services as running +when enabled, supports the native built-in `synthetic.echo` test workload, and +keeps unsupported production workloads such as RDP workers degraded until their +supervisors are implemented. + +For Remote Workspace/RDP integration work, the native `rdp-worker` desired +workload supports only an explicit `adapter_contract_probe` mode. That mode +reports the remote-workspace adapter channel contract and requires Fabric +Service Channel as the future data plane; it does not start FreeRDP, create a +remote session, or carry production RDP payloads. `RAP_MESH_LISTEN_ADDR` starts the C17E/C17F/C17G synthetic HTTP endpoint only when `RAP_MESH_SYNTHETIC_RUNTIME_ENABLED=true`. `RAP_MESH_SYNTHETIC_CONFIG` points to @@ -423,6 +665,63 @@ observations with expected/observed hops and drift status. This probes replacement relay effective paths for control-plane health only and does not enable service payload forwarding. +C17Z21 defines the portable inbound listener contract for Docker, Linux +service, Windows service, and future OS-specific node packages. The node-agent +does not stop when the mesh listen port cannot be bound. It keeps the outbound +Control Plane session alive and emits `c17z21.mesh_listener_report.v1` in +heartbeat metadata with configured address, effective address, listen mode, +listener status, inbound reachability, one-way connectivity, failure reason, +and port-conflict diagnostics. + +`RAP_MESH_LISTEN_PORT_MODE` controls behavior: + +- `manual`: bind exactly `RAP_MESH_LISTEN_ADDR`; on conflict report + `listen_failed` and wait for an operator/config change. +- `auto`: try `RAP_MESH_LISTEN_ADDR`; on conflict scan + `RAP_MESH_LISTEN_AUTO_PORT_START..RAP_MESH_LISTEN_AUTO_PORT_END` and report + `auto_rebound` when a free port is selected. +- `disabled`: do not open an inbound listener; the node is expected to be + outbound-only, relay/rendezvous, or Control Plane only. + +For `RAP_MESH_CONNECTIVITY_MODE=outbound_only`, inbound listener failure is not +treated as node death. The heartbeat remains `healthy` with +`mesh_one_way_connectivity=true` and listener diagnostics. For direct/private +LAN modes, a listener failure degrades the node so the admin panel can show +that the node is alive but cannot accept inbound mesh traffic. Service payload +forwarding is still not enabled by this contract. + +C17Z22 separates outbound Control Plane presence from inbound mesh +reachability. When synthetic mesh testing is enabled, every heartbeat includes +`c17z22.mesh_outbound_session_report.v1` with node-to-control-plane direction, +keepalive transport, listener conflict state, rendezvous/relay counters, and a +flag showing whether the current outbound session can be used as a reverse +control-channel contract. This is the portable basis for Docker, Linux service, +Windows service, and future packages where a node may be behind NAT or have no +stable inbound address. It is still control-plane telemetry only and does not +carry RDP/VPN/service payload traffic. + +C17Z24 separates the listener bind address from advertised mesh endpoints. The +agent never advertises loopback addresses discovered from the local listener; +`127.0.0.1`/`::1` are test-only bind details, not cluster reachability data. +When the listener is active, the agent enumerates active non-loopback host +interfaces and reports usable endpoint candidates with interface metadata, +address family, reachability, NAT/connectivity hints, and priority. Container +bridge/veth interfaces and link-local addresses are filtered by default, while +physical and VPN-style interfaces are kept so different cluster segments can +choose the address that matches their network. Operator-provided +`RAP_MESH_ADVERTISE_ENDPOINT` or endpoint-candidate JSON remains authoritative +and is ranked ahead of auto-discovered addresses. + +C17Z25 adds per-peer endpoint fallback probing to the control-plane mesh +manager. A node no longer treats the top-ranked endpoint candidate as the only +possible address for a peer. For each warm direct/private/corporate peer, the +manager probes the ranked candidate list until one `/mesh/v1/health` endpoint +responds or all direct candidates fail. Heartbeat metadata includes +`c17z25.mesh_peer_connection_manager_report.v1` with `probe_results`, +`selected_candidate_id`, `selected_endpoint`, and per-candidate success/failure +details. This is still control-plane health and address selection telemetry; it +does not forward RDP/VPN/service payloads. + Scoped synthetic config shape: ```json @@ -480,7 +779,7 @@ Expected: - The agent never assigns roles to itself. - The agent reports capabilities only. - Platform policy assigns roles. -- No RDP/VPN/production service traffic is carried by the C17A-C17Z18 staged +- No RDP/VPN/production service traffic is carried by the C17A-C17Z22 staged mesh runtime. - Production forwarding remains disabled by default and limited to `fabric.control` when explicitly enabled. diff --git a/agents/rap-node-agent/cmd/rap-host-agent/main.go b/agents/rap-node-agent/cmd/rap-host-agent/main.go new file mode 100644 index 0000000..dae6908 --- /dev/null +++ b/agents/rap-node-agent/cmd/rap-host-agent/main.go @@ -0,0 +1,744 @@ +package main + +import ( + "context" + "flag" + "fmt" + "log" + "os" + "os/signal" + "runtime" + "strings" + "syscall" + "time" + + "github.com/example/remote-access-platform/agents/rap-node-agent/internal/agent" + "github.com/example/remote-access-platform/agents/rap-node-agent/internal/hostagent" +) + +type installCommandConfig struct { + Runtime hostagent.RuntimeConfig + DryRun bool + AutoUpdateEnabled bool + AutoUpdate hostagent.UpdateServiceConfig +} + +func main() { + log.SetFlags(0) + applyStagedSelfUpdate() + if len(os.Args) < 2 { + usage() + os.Exit(2) + } + + ctx, stop := signal.NotifyContext(context.Background(), os.Interrupt, syscall.SIGTERM) + defer stop() + switch os.Args[1] { + case "install": + if err := runInstall(ctx, os.Args[2:]); err != nil { + log.Fatalf("install failed: %v", err) + } + case "install-windows": + if err := runInstallWindows(ctx, os.Args[2:]); err != nil { + log.Fatalf("install-windows failed: %v", err) + } + case "install-linux": + if err := runInstallLinux(ctx, os.Args[2:]); err != nil { + log.Fatalf("install-linux failed: %v", err) + } + case "status": + if err := runStatus(ctx, os.Args[2:]); err != nil { + log.Fatalf("status failed: %v", err) + } + case "update": + if err := runUpdate(ctx, os.Args[2:]); err != nil { + log.Fatalf("update failed: %v", err) + } + case "update-loop": + if err := runUpdateLoop(ctx, os.Args[2:]); err != nil { + log.Fatalf("update-loop failed: %v", err) + } + case "install-updater": + if err := runInstallUpdater(ctx, os.Args[2:]); err != nil { + log.Fatalf("install-updater failed: %v", err) + } + case "update-host-agent": + if err := runUpdateHostAgent(ctx, os.Args[2:]); err != nil { + log.Fatalf("update-host-agent failed: %v", err) + } + case "update-host-agent-loop": + if err := runUpdateHostAgentLoop(ctx, os.Args[2:]); err != nil { + log.Fatalf("update-host-agent-loop failed: %v", err) + } + default: + usage() + os.Exit(2) + } +} + +func applyStagedSelfUpdate() { + if runtime.GOOS == "windows" { + return + } + executable, err := os.Executable() + if err != nil { + return + } + staged := executable + ".next" + if _, err := os.Stat(staged); err != nil { + return + } + backup := executable + ".old" + _ = os.Remove(backup) + if err := os.Rename(executable, backup); err != nil { + return + } + if err := os.Rename(staged, executable); err != nil { + _ = os.Rename(backup, executable) + return + } + _ = os.Chmod(executable, 0o755) + _ = os.Remove(backup) +} + +func runInstallLinux(ctx context.Context, args []string) error { + fs := flag.NewFlagSet("install-linux", flag.ContinueOnError) + cfg := hostagent.LinuxInstallConfig{} + var profileURL string + var installToken string + fs.StringVar(&cfg.RuntimeConfig.BackendURL, "backend-url", getenv("RAP_BACKEND_URL", ""), "Control Plane API base URL.") + fs.StringVar(&cfg.RuntimeConfig.ClusterID, "cluster-id", getenv("RAP_CLUSTER_ID", ""), "Cluster ID.") + fs.StringVar(&cfg.NodeID, "node-id", getenv("RAP_NODE_ID", ""), "Already enrolled node ID used by updater repair mode.") + fs.StringVar(&cfg.RuntimeConfig.JoinToken, "join-token", getenv("RAP_JOIN_TOKEN", ""), "One-time join token for first enrollment.") + fs.StringVar(&profileURL, "profile-url", getenv("RAP_INSTALL_PROFILE_URL", ""), "Control Plane API base URL or /node-agents/linux-install-profile URL for profile-based install.") + fs.StringVar(&installToken, "install-token", getenv("RAP_INSTALL_TOKEN", ""), "One-time install token used to fetch Linux install profile.") + fs.StringVar(&cfg.RuntimeConfig.NodeName, "node-name", getenv("RAP_NODE_NAME", ""), "Node display name.") + fs.StringVar(&cfg.StateDir, "state-dir", getenv("RAP_NODE_STATE_DIR", ""), "Node state directory.") + fs.StringVar(&cfg.InstallDir, "install-dir", getenv("RAP_LINUX_INSTALL_DIR", ""), "Directory for rap-node-agent and rap-host-agent.") + fs.StringVar(&cfg.ConfigDir, "config-dir", getenv("RAP_LINUX_CONFIG_DIR", ""), "Directory for node-agent env file.") + fs.StringVar(&cfg.StartupMode, "startup-mode", getenv("RAP_LINUX_STARTUP_MODE", "systemd"), "Startup mode: systemd, auto, or none.") + fs.BoolVar(&cfg.Replace, "replace", getenvBool("RAP_REPLACE", true), "Replace local node-agent binary/config when an artifact is available.") + fs.BoolVar(&cfg.DryRun, "dry-run", false, "Print resolved placement without installing.") + fs.BoolVar(&cfg.AutoUpdateEnabled, "auto-update-enabled", getenvBool("RAP_AUTO_UPDATE_ENABLED", true), "Install and start the Linux host-agent update service.") + fs.StringVar(&cfg.AutoUpdateCurrentVersion, "auto-update-current-version", getenv("RAP_NODE_AGENT_VERSION", agent.Version), "Initial node-agent version used by update-loop before the first successful update.") + fs.StringVar(&cfg.AutoUpdateChannel, "auto-update-channel", getenv("RAP_UPDATE_CHANNEL", ""), "Optional update channel override for update-loop.") + fs.IntVar(&cfg.AutoUpdateIntervalSeconds, "auto-update-interval-seconds", getenvInt("RAP_UPDATE_INTERVAL_SECONDS", 21600), "Emergency fallback plan poll interval in seconds. Update-service/heartbeat hints trigger normal runs.") + fs.IntVar(&cfg.AutoUpdateInitialDelaySeconds, "auto-update-initial-delay-seconds", getenvInt("RAP_UPDATE_INITIAL_DELAY_SECONDS", 15), "Update-loop initial delay in seconds.") + fs.IntVar(&cfg.AutoUpdateHealthTimeoutSeconds, "auto-update-health-timeout-seconds", getenvInt("RAP_UPDATE_HEALTH_TIMEOUT_SECONDS", 30), "Updated service health timeout in seconds.") + fs.StringVar(&cfg.HostAgentSourcePath, "host-agent-source-path", getenv("RAP_HOST_AGENT_SOURCE_PATH", ""), "Source rap-host-agent path copied to the persistent updater location.") + fs.BoolVar(&cfg.RuntimeConfig.WorkloadSupervisionEnabled, "workload-supervision-enabled", getenvBool("RAP_WORKLOAD_SUPERVISION_ENABLED", false), "Enable node-agent workload status reporting.") + fs.BoolVar(&cfg.RuntimeConfig.MeshSyntheticRuntimeEnabled, "mesh-synthetic-runtime-enabled", getenvBool("RAP_MESH_SYNTHETIC_RUNTIME_ENABLED", true), "Enable synthetic mesh runtime.") + fs.BoolVar(&cfg.RuntimeConfig.MeshProductionForwardingEnabled, "mesh-production-forwarding-enabled", getenvBool("RAP_MESH_PRODUCTION_FORWARDING_ENABLED", false), "Enable production forwarding gate; runtime still fail-closed if unavailable.") + fs.StringVar(&cfg.RuntimeConfig.MeshListenAddr, "mesh-listen-addr", getenv("RAP_MESH_LISTEN_ADDR", ":19131"), "Synthetic mesh HTTP listen address.") + fs.StringVar(&cfg.RuntimeConfig.MeshListenPortMode, "mesh-listen-port-mode", getenv("RAP_MESH_LISTEN_PORT_MODE", "auto"), "Mesh listen port behavior: manual, auto, or disabled.") + fs.IntVar(&cfg.RuntimeConfig.MeshListenAutoPortStart, "mesh-listen-auto-port-start", getenvInt("RAP_MESH_LISTEN_AUTO_PORT_START", 19131), "First port used when mesh listen port mode is auto.") + fs.IntVar(&cfg.RuntimeConfig.MeshListenAutoPortEnd, "mesh-listen-auto-port-end", getenvInt("RAP_MESH_LISTEN_AUTO_PORT_END", 19231), "Last port used when mesh listen port mode is auto.") + fs.StringVar(&cfg.RuntimeConfig.MeshAdvertiseEndpoint, "mesh-advertise-endpoint", getenv("RAP_MESH_ADVERTISE_ENDPOINT", ""), "Advertised mesh endpoint.") + fs.StringVar(&cfg.RuntimeConfig.MeshAdvertiseEndpointsJSON, "mesh-advertise-endpoints-json", getenv("RAP_MESH_ADVERTISE_ENDPOINTS_JSON", ""), "Advertised endpoint candidates JSON.") + fs.StringVar(&cfg.RuntimeConfig.MeshAdvertiseTransport, "mesh-advertise-transport", getenv("RAP_MESH_ADVERTISE_TRANSPORT", "direct_http"), "Advertised transport.") + fs.StringVar(&cfg.RuntimeConfig.MeshConnectivityMode, "mesh-connectivity-mode", getenv("RAP_MESH_CONNECTIVITY_MODE", "outbound_only"), "Connectivity mode hint.") + fs.StringVar(&cfg.RuntimeConfig.MeshNATType, "mesh-nat-type", getenv("RAP_MESH_NAT_TYPE", "unknown"), "NAT type hint.") + fs.StringVar(&cfg.RuntimeConfig.MeshRegion, "mesh-region", getenv("RAP_MESH_REGION", "linux"), "Region/site hint.") + fs.IntVar(&cfg.RuntimeConfig.HeartbeatIntervalSeconds, "heartbeat-interval-seconds", getenvInt("RAP_HEARTBEAT_INTERVAL_SECONDS", 15), "Heartbeat interval seconds.") + fs.IntVar(&cfg.RuntimeConfig.EnrollmentPollIntervalSeconds, "enrollment-poll-interval-seconds", getenvInt("RAP_ENROLLMENT_POLL_INTERVAL_SECONDS", 5), "Enrollment poll interval seconds.") + fs.IntVar(&cfg.RuntimeConfig.EnrollmentPollTimeoutSeconds, "enrollment-poll-timeout-seconds", getenvInt("RAP_ENROLLMENT_POLL_TIMEOUT_SECONDS", 0), "Enrollment approval timeout seconds. Use 0 to wait indefinitely.") + if err := fs.Parse(args); err != nil { + return err + } + if strings.TrimSpace(profileURL) != "" || strings.TrimSpace(installToken) != "" { + dryRun := cfg.DryRun + startupMode := strings.TrimSpace(cfg.StartupMode) + autoUpdateEnabled := cfg.AutoUpdateEnabled + autoUpdateCurrentVersion := cfg.AutoUpdateCurrentVersion + autoUpdateChannel := cfg.AutoUpdateChannel + autoUpdateIntervalSeconds := cfg.AutoUpdateIntervalSeconds + autoUpdateInitialDelaySeconds := cfg.AutoUpdateInitialDelaySeconds + autoUpdateHealthTimeoutSeconds := cfg.AutoUpdateHealthTimeoutSeconds + hostAgentSourcePath := cfg.HostAgentSourcePath + profile, err := hostagent.FetchLinuxInstallProfile(ctx, hostagent.ProfileRequest{URL: profileURL, ClusterID: cfg.RuntimeConfig.ClusterID, InstallToken: installToken, NodeName: cfg.RuntimeConfig.NodeName}) + if err != nil { + return err + } + cfg = hostagent.LinuxInstallConfigFromProfile(profile) + cfg.Replace = true + cfg.DryRun = dryRun + cfg.AutoUpdateEnabled = autoUpdateEnabled + cfg.AutoUpdateCurrentVersion = autoUpdateCurrentVersion + cfg.AutoUpdateChannel = autoUpdateChannel + cfg.AutoUpdateIntervalSeconds = autoUpdateIntervalSeconds + cfg.AutoUpdateInitialDelaySeconds = autoUpdateInitialDelaySeconds + cfg.AutoUpdateHealthTimeoutSeconds = autoUpdateHealthTimeoutSeconds + cfg.HostAgentSourcePath = hostAgentSourcePath + if startupMode != "" { + cfg.StartupMode = startupMode + } + } + result, err := (hostagent.LinuxManager{}).Install(ctx, cfg) + if err != nil { + return err + } + fmt.Printf("node=%s install_dir=%s state_dir=%s node_agent=%s unit=%s downloaded=%t started=%t updater_unit=%s updater_started=%t\n", + result.NodeName, result.InstallDir, result.StateDir, result.NodeAgentPath, result.UnitName, result.Downloaded, result.Started, result.UpdaterUnitName, result.UpdaterStarted) + fmt.Println("next: approve the join request in the platform admin panel, then the Linux node-agent will finish bootstrap and start heartbeats") + return nil +} + +func runInstallWindows(ctx context.Context, args []string) error { + fs := flag.NewFlagSet("install-windows", flag.ContinueOnError) + cfg := hostagent.WindowsInstallConfig{} + var profileURL string + var installToken string + fs.StringVar(&cfg.RuntimeConfig.BackendURL, "backend-url", getenv("RAP_BACKEND_URL", ""), "Control Plane API base URL.") + fs.StringVar(&cfg.RuntimeConfig.ClusterID, "cluster-id", getenv("RAP_CLUSTER_ID", ""), "Cluster ID.") + fs.StringVar(&cfg.NodeID, "node-id", getenv("RAP_NODE_ID", ""), "Already enrolled node ID used by updater repair mode.") + fs.StringVar(&cfg.RuntimeConfig.JoinToken, "join-token", getenv("RAP_JOIN_TOKEN", ""), "One-time join token for first enrollment.") + fs.StringVar(&profileURL, "profile-url", getenv("RAP_INSTALL_PROFILE_URL", ""), "Control Plane API base URL or /node-agents/windows-install-profile URL for profile-based install.") + fs.StringVar(&installToken, "install-token", getenv("RAP_INSTALL_TOKEN", ""), "One-time install token used to fetch Windows install profile.") + fs.StringVar(&cfg.RuntimeConfig.NodeName, "node-name", getenv("RAP_NODE_NAME", ""), "Node display name.") + fs.StringVar(&cfg.RuntimeConfig.StateDir, "state-dir", getenv("RAP_NODE_STATE_DIR", ""), "Node state directory.") + fs.StringVar(&cfg.InstallDir, "install-dir", getenv("RAP_WINDOWS_INSTALL_DIR", ""), "Directory for rap-node-agent.exe and wrapper scripts.") + fs.StringVar(&cfg.StartupMode, "startup-mode", getenv("RAP_WINDOWS_STARTUP_MODE", "auto"), "Startup mode: auto, system-task, user-task, or none.") + fs.BoolVar(&cfg.Replace, "replace", getenvBool("RAP_REPLACE", true), "Replace local node-agent binary/config when an artifact is available.") + fs.BoolVar(&cfg.DryRun, "dry-run", false, "Print resolved placement without installing.") + fs.BoolVar(&cfg.AutoUpdateEnabled, "auto-update-enabled", getenvBool("RAP_AUTO_UPDATE_ENABLED", true), "Install and start the Windows host-agent update task.") + fs.StringVar(&cfg.AutoUpdateCurrentVersion, "auto-update-current-version", getenv("RAP_NODE_AGENT_VERSION", agent.Version), "Initial node-agent version used by update-loop before the first successful update.") + fs.StringVar(&cfg.AutoUpdateChannel, "auto-update-channel", getenv("RAP_UPDATE_CHANNEL", ""), "Optional update channel override for update-loop.") + fs.IntVar(&cfg.AutoUpdateIntervalSeconds, "auto-update-interval-seconds", getenvInt("RAP_UPDATE_INTERVAL_SECONDS", 21600), "Emergency fallback plan poll interval in seconds. Update-service/heartbeat hints trigger normal runs.") + fs.IntVar(&cfg.AutoUpdateInitialDelaySeconds, "auto-update-initial-delay-seconds", getenvInt("RAP_UPDATE_INITIAL_DELAY_SECONDS", 15), "Update-loop initial delay in seconds.") + fs.IntVar(&cfg.AutoUpdateHealthTimeoutSeconds, "auto-update-health-timeout-seconds", getenvInt("RAP_UPDATE_HEALTH_TIMEOUT_SECONDS", 30), "Updated service health timeout in seconds.") + fs.StringVar(&cfg.HostAgentSourcePath, "host-agent-source-path", getenv("RAP_HOST_AGENT_SOURCE_PATH", ""), "Source rap-host-agent.exe path copied to the persistent updater location.") + fs.BoolVar(&cfg.RuntimeConfig.WorkloadSupervisionEnabled, "workload-supervision-enabled", getenvBool("RAP_WORKLOAD_SUPERVISION_ENABLED", false), "Enable node-agent workload status reporting.") + fs.BoolVar(&cfg.RuntimeConfig.MeshSyntheticRuntimeEnabled, "mesh-synthetic-runtime-enabled", getenvBool("RAP_MESH_SYNTHETIC_RUNTIME_ENABLED", true), "Enable synthetic mesh runtime.") + fs.BoolVar(&cfg.RuntimeConfig.MeshProductionForwardingEnabled, "mesh-production-forwarding-enabled", getenvBool("RAP_MESH_PRODUCTION_FORWARDING_ENABLED", false), "Enable production forwarding gate; runtime still fail-closed if unavailable.") + fs.StringVar(&cfg.RuntimeConfig.MeshListenAddr, "mesh-listen-addr", getenv("RAP_MESH_LISTEN_ADDR", ":19131"), "Synthetic mesh HTTP listen address.") + fs.StringVar(&cfg.RuntimeConfig.MeshListenPortMode, "mesh-listen-port-mode", getenv("RAP_MESH_LISTEN_PORT_MODE", "auto"), "Mesh listen port behavior: manual, auto, or disabled.") + fs.IntVar(&cfg.RuntimeConfig.MeshListenAutoPortStart, "mesh-listen-auto-port-start", getenvInt("RAP_MESH_LISTEN_AUTO_PORT_START", 19131), "First port used when mesh listen port mode is auto.") + fs.IntVar(&cfg.RuntimeConfig.MeshListenAutoPortEnd, "mesh-listen-auto-port-end", getenvInt("RAP_MESH_LISTEN_AUTO_PORT_END", 19231), "Last port used when mesh listen port mode is auto.") + fs.StringVar(&cfg.RuntimeConfig.MeshAdvertiseEndpoint, "mesh-advertise-endpoint", getenv("RAP_MESH_ADVERTISE_ENDPOINT", ""), "Advertised mesh endpoint.") + fs.StringVar(&cfg.RuntimeConfig.MeshAdvertiseEndpointsJSON, "mesh-advertise-endpoints-json", getenv("RAP_MESH_ADVERTISE_ENDPOINTS_JSON", ""), "Advertised endpoint candidates JSON.") + fs.StringVar(&cfg.RuntimeConfig.MeshAdvertiseTransport, "mesh-advertise-transport", getenv("RAP_MESH_ADVERTISE_TRANSPORT", "direct_http"), "Advertised transport.") + fs.StringVar(&cfg.RuntimeConfig.MeshConnectivityMode, "mesh-connectivity-mode", getenv("RAP_MESH_CONNECTIVITY_MODE", "outbound_only"), "Connectivity mode hint.") + fs.StringVar(&cfg.RuntimeConfig.MeshNATType, "mesh-nat-type", getenv("RAP_MESH_NAT_TYPE", "unknown"), "NAT type hint.") + fs.StringVar(&cfg.RuntimeConfig.MeshRegion, "mesh-region", getenv("RAP_MESH_REGION", "windows"), "Region/site hint.") + fs.IntVar(&cfg.RuntimeConfig.HeartbeatIntervalSeconds, "heartbeat-interval-seconds", getenvInt("RAP_HEARTBEAT_INTERVAL_SECONDS", 15), "Heartbeat interval seconds.") + fs.IntVar(&cfg.RuntimeConfig.EnrollmentPollIntervalSeconds, "enrollment-poll-interval-seconds", getenvInt("RAP_ENROLLMENT_POLL_INTERVAL_SECONDS", 5), "Enrollment poll interval seconds.") + fs.IntVar(&cfg.RuntimeConfig.EnrollmentPollTimeoutSeconds, "enrollment-poll-timeout-seconds", getenvInt("RAP_ENROLLMENT_POLL_TIMEOUT_SECONDS", 0), "Enrollment approval timeout seconds. Use 0 to wait indefinitely.") + if err := fs.Parse(args); err != nil { + return err + } + if strings.TrimSpace(profileURL) != "" || strings.TrimSpace(installToken) != "" { + dryRun := cfg.DryRun + startupMode := strings.TrimSpace(cfg.StartupMode) + autoUpdateEnabled := cfg.AutoUpdateEnabled + autoUpdateCurrentVersion := cfg.AutoUpdateCurrentVersion + autoUpdateChannel := cfg.AutoUpdateChannel + autoUpdateIntervalSeconds := cfg.AutoUpdateIntervalSeconds + autoUpdateInitialDelaySeconds := cfg.AutoUpdateInitialDelaySeconds + autoUpdateHealthTimeoutSeconds := cfg.AutoUpdateHealthTimeoutSeconds + hostAgentSourcePath := cfg.HostAgentSourcePath + profile, err := hostagent.FetchWindowsInstallProfile(ctx, hostagent.ProfileRequest{ + URL: profileURL, + ClusterID: cfg.RuntimeConfig.ClusterID, + InstallToken: installToken, + NodeName: cfg.RuntimeConfig.NodeName, + }) + if err != nil { + return err + } + cfg = hostagent.WindowsInstallConfigFromProfile(profile) + cfg.Replace = true + cfg.DryRun = dryRun + cfg.AutoUpdateEnabled = autoUpdateEnabled + cfg.AutoUpdateCurrentVersion = autoUpdateCurrentVersion + cfg.AutoUpdateChannel = autoUpdateChannel + cfg.AutoUpdateIntervalSeconds = autoUpdateIntervalSeconds + cfg.AutoUpdateInitialDelaySeconds = autoUpdateInitialDelaySeconds + cfg.AutoUpdateHealthTimeoutSeconds = autoUpdateHealthTimeoutSeconds + cfg.HostAgentSourcePath = hostAgentSourcePath + if startupMode != "" { + cfg.StartupMode = startupMode + } + } + result, err := (hostagent.WindowsManager{}).Install(ctx, cfg) + if err != nil { + return err + } + fmt.Printf("node=%s install_dir=%s state_dir=%s node_agent=%s startup_mode=%s task=%s downloaded=%t started=%t updater_task=%s updater_started=%t admin_fallback=%t\n", + result.NodeName, result.InstallDir, result.StateDir, result.NodeAgentPath, result.StartupMode, result.TaskName, result.Downloaded, result.Started, result.UpdaterTaskName, result.UpdaterStarted, result.AdminFallback) + fmt.Println("next: approve the join request in the platform admin panel, then the Windows node-agent will finish bootstrap and start heartbeats") + return nil +} + +func runInstall(ctx context.Context, args []string) error { + installCfg, err := parseInstall(args) + if err != nil { + return err + } + cfg := installCfg.Runtime.Normalize() + cfg = cfg.Normalize() + runArgs := hostagent.DockerRunArgs(cfg) + if installCfg.DryRun { + fmt.Printf("docker %s\n", shellJoin(hostagent.RedactedArgs(runArgs))) + if installCfg.AutoUpdateEnabled { + service := installCfg.AutoUpdate + service.RuntimeConfig = cfg + service.DryRun = true + result, err := (hostagent.DockerManager{}).InstallUpdateService(ctx, service) + if err != nil { + return err + } + fmt.Print(result.Unit) + } + return nil + } + result, err := (hostagent.DockerManager{}).Install(ctx, cfg) + if err != nil { + return err + } + fmt.Printf("container=%s image=%s id=%s pulled=%t replaced=%t\n", result.ContainerName, result.Image, result.ContainerID, result.Pulled, result.Replaced) + if installCfg.AutoUpdateEnabled { + service := installCfg.AutoUpdate + service.RuntimeConfig = cfg + service.ManageSystemd = true + serviceResult, err := (hostagent.DockerManager{}).InstallUpdateService(ctx, service) + if err != nil { + return err + } + fmt.Printf("updater_service=%s unit=%s binary=%s started=%t\n", serviceResult.UnitName, serviceResult.UnitPath, serviceResult.BinaryPath, serviceResult.Started) + } + fmt.Println("next: approve the join request in the platform admin panel, then the node-agent will finish bootstrap and start heartbeats") + return nil +} + +func runStatus(ctx context.Context, args []string) error { + fs := flag.NewFlagSet("status", flag.ContinueOnError) + containerName := fs.String("container-name", hostagent.DefaultContainerName, "Docker container name.") + if err := fs.Parse(args); err != nil { + return err + } + out, err := (hostagent.DockerManager{}).Status(ctx, *containerName) + if err != nil { + return err + } + fmt.Print(out) + return nil +} + +func runUpdate(ctx context.Context, args []string) error { + fs := flag.NewFlagSet("update", flag.ContinueOnError) + req := hostagent.UpdateRequest{} + var healthTimeoutSeconds int + registerUpdateFlags(fs, &req, &healthTimeoutSeconds) + if err := fs.Parse(args); err != nil { + return err + } + req.HealthTimeout = time.Duration(healthTimeoutSeconds) * time.Second + if req.DryRun { + plan, err := hostagent.FetchNodeUpdatePlan(ctx, req) + if err != nil { + return err + } + fmt.Printf("action=%s reason=%s target=%s production_forwarding=%t\n", plan.Action, plan.Reason, plan.TargetVersion, plan.ProductionForwarding) + if plan.Artifact != nil { + fmt.Printf("artifact=%s sha256=%s size=%d\n", plan.Artifact.URL, plan.Artifact.SHA256, plan.Artifact.SizeBytes) + } + return nil + } + var result hostagent.UpdateResult + var err error + if req.InstallType == hostagent.WindowsUpdateInstallType || runtime.GOOS == "windows" { + result, err = (hostagent.WindowsManager{}).ApplyUpdate(ctx, req) + } else if req.InstallType == hostagent.BinaryUpdateInstallType { + result, err = (hostagent.LinuxManager{}).ApplyUpdate(ctx, req) + } else { + result, err = (hostagent.DockerManager{}).ApplyUpdate(ctx, req) + } + if err != nil { + return err + } + fmt.Printf("action=%s reason=%s target=%s container=%s image=%s id=%s loaded=%t replaced=%t rolled_back=%t\n", + result.Action, + result.Reason, + result.TargetVersion, + result.ContainerName, + result.NewImage, + result.ContainerID, + result.Loaded, + result.Replaced, + result.RolledBack, + ) + return nil +} + +func runUpdateLoop(ctx context.Context, args []string) error { + fs := flag.NewFlagSet("update-loop", flag.ContinueOnError) + req := hostagent.UpdateRequest{} + var healthTimeoutSeconds int + var intervalSeconds int + var initialDelaySeconds int + var maxRuns int + var jitter float64 + var stopOnError bool + var hostAgentStatusEnabled bool + var hostAgentVersion string + var hostAgentBinaryPath string + registerUpdateFlags(fs, &req, &healthTimeoutSeconds) + fs.IntVar(&intervalSeconds, "interval-seconds", getenvInt("RAP_UPDATE_INTERVAL_SECONDS", 21600), "Seconds between emergency fallback update plan polls. Update-service/heartbeat hints trigger normal runs.") + fs.IntVar(&initialDelaySeconds, "initial-delay-seconds", getenvInt("RAP_UPDATE_INITIAL_DELAY_SECONDS", 0), "Seconds to wait before the first poll.") + fs.Float64Var(&jitter, "jitter", getenvFloat("RAP_UPDATE_JITTER", 0.15), "Fractional random jitter for interval and initial delay, 0..1.") + fs.IntVar(&maxRuns, "max-runs", getenvInt("RAP_UPDATE_MAX_RUNS", 0), "Maximum loop iterations. Use 0 to run until stopped.") + fs.BoolVar(&stopOnError, "stop-on-error", getenvBool("RAP_UPDATE_STOP_ON_ERROR", false), "Stop the loop after the first failed update attempt.") + fs.BoolVar(&hostAgentStatusEnabled, "host-agent-update-status-enabled", getenvBool("RAP_HOST_AGENT_UPDATE_STATUS_ENABLED", true), "Also poll/report rap-host-agent update status from this loop.") + fs.StringVar(&hostAgentVersion, "host-agent-current-version", getenv("RAP_HOST_AGENT_VERSION", agent.Version), "Current rap-host-agent version reported by the loop.") + fs.StringVar(&hostAgentBinaryPath, "host-agent-binary-path", getenv("RAP_HOST_AGENT_BINARY_PATH", hostagent.DefaultHostAgentInstallPath), "rap-host-agent binary path used for host-agent update status.") + if err := fs.Parse(args); err != nil { + return err + } + req.HealthTimeout = time.Duration(healthTimeoutSeconds) * time.Second + cfg := hostagent.UpdateLoopConfig{ + Request: req, + Interval: time.Duration(intervalSeconds) * time.Second, + InitialDelay: time.Duration(initialDelaySeconds) * time.Second, + Jitter: jitter, + MaxRuns: maxRuns, + StopOnError: stopOnError, + Logf: func(format string, args ...any) { + fmt.Printf(format+"\n", args...) + }, + } + cfg.HostAgentUpdateEnabled = hostAgentStatusEnabled + cfg.HostAgentUpdateRequest = hostagent.HostAgentUpdateRequest{ + BackendURL: req.BackendURL, + ClusterID: req.ClusterID, + NodeID: req.NodeID, + StateDir: req.StateDir, + CurrentVersion: hostAgentVersion, + Channel: req.Channel, + OS: firstNonEmptyLocal(req.OS, runtime.GOOS), + Arch: firstNonEmptyLocal(req.Arch, runtime.GOARCH), + InstallType: hostagent.BinaryUpdateInstallType, + BinaryPath: hostAgentBinaryPath, + } + if req.InstallType == hostagent.WindowsUpdateInstallType || runtime.GOOS == "windows" { + cfg.HostAgentUpdateRequest.InstallType = "windows_binary" + return (hostagent.WindowsManager{}).RunUpdateLoop(ctx, cfg) + } + if req.InstallType == hostagent.BinaryUpdateInstallType { + return (hostagent.LinuxManager{}).RunUpdateLoop(ctx, cfg) + } + return (hostagent.DockerManager{}).RunUpdateLoop(ctx, cfg) +} + +func firstNonEmptyLocal(values ...string) string { + for _, value := range values { + if strings.TrimSpace(value) != "" { + return value + } + } + return "" +} + +func runInstallUpdater(ctx context.Context, args []string) error { + fs := flag.NewFlagSet("install-updater", flag.ContinueOnError) + runtimeCfg := hostagent.RuntimeConfig{} + service := hostagent.UpdateServiceConfig{} + var dryRun bool + var selfUpdater bool + fs.StringVar(&runtimeCfg.BackendURL, "backend-url", getenv("RAP_BACKEND_URL", ""), "Control Plane API base URL.") + fs.StringVar(&runtimeCfg.ClusterID, "cluster-id", getenv("RAP_CLUSTER_ID", ""), "Cluster ID.") + fs.StringVar(&runtimeCfg.ContainerName, "container-name", getenv("RAP_NODE_AGENT_CONTAINER", hostagent.DefaultContainerName), "Docker container name to update.") + fs.StringVar(&runtimeCfg.StateDir, "state-dir", getenv("RAP_NODE_STATE_DIR", hostagent.DefaultStateDir), "Host path containing node-agent identity.json.") + fs.StringVar(&service.CurrentVersion, "current-version", getenv("RAP_NODE_AGENT_VERSION", agent.Version), "Initial node-agent version before first successful update.") + fs.StringVar(&service.Channel, "channel", getenv("RAP_UPDATE_CHANNEL", ""), "Optional update channel override.") + fs.IntVar(&service.IntervalSeconds, "interval-seconds", getenvInt("RAP_UPDATE_INTERVAL_SECONDS", 21600), "Emergency fallback plan poll interval in seconds. Update-service/heartbeat hints trigger normal runs.") + fs.IntVar(&service.InitialDelaySeconds, "initial-delay-seconds", getenvInt("RAP_UPDATE_INITIAL_DELAY_SECONDS", 15), "Update-loop initial delay in seconds.") + fs.Float64Var(&service.Jitter, "jitter", getenvFloat("RAP_UPDATE_JITTER", 0.15), "Update-loop interval jitter, 0..1.") + fs.IntVar(&service.HealthTimeoutSec, "health-timeout-seconds", getenvInt("RAP_UPDATE_HEALTH_TIMEOUT_SECONDS", 30), "Updated container running-state timeout in seconds.") + fs.StringVar(&service.BinaryInstallPath, "binary-path", getenv("RAP_HOST_AGENT_BINARY_PATH", hostagent.DefaultHostAgentInstallPath), "Persistent host path for rap-host-agent binary used by the service.") + fs.BoolVar(&selfUpdater, "self-updater-enabled", getenvBool("RAP_HOST_AGENT_SELF_UPDATE_ENABLED", true), "Install and start one global host-agent binary self-updater service.") + fs.BoolVar(&dryRun, "dry-run", false, "Print the systemd unit without installing it.") + if err := fs.Parse(args); err != nil { + return err + } + service.RuntimeConfig = runtimeCfg + service.ManageSystemd = !dryRun + service.DryRun = dryRun + service.InstallSelfUpdater = selfUpdater + service.SelfUpdateVersion = agent.Version + result, err := (hostagent.DockerManager{}).InstallUpdateService(ctx, service) + if err != nil { + return err + } + if dryRun { + fmt.Print(result.Unit) + if result.SelfUnit != "" { + fmt.Print(result.SelfUnit) + } + return nil + } + fmt.Printf("updater_service=%s unit=%s binary=%s started=%t self_updater=%s\n", result.UnitName, result.UnitPath, result.BinaryPath, result.Started, result.SelfUnitName) + return nil +} + +func runUpdateHostAgent(ctx context.Context, args []string) error { + req, interval, initialDelay, jitter, maxRuns, stopOnError, loop, err := parseHostAgentUpdate(args) + _, _, _, _, _ = interval, initialDelay, jitter, maxRuns, stopOnError + if err != nil { + return err + } + if loop { + return fmt.Errorf("internal parser error: loop flag set for one-shot update") + } + result, err := (hostagent.DockerManager{}).ApplyHostAgentUpdate(ctx, req) + if err != nil { + return err + } + fmt.Printf("action=%s reason=%s target=%s binary=%s replaced=%t restart_needed=%t\n", result.Action, result.Reason, result.TargetVersion, result.NewImage, result.Replaced, result.RestartNeeded) + return nil +} + +func runUpdateHostAgentLoop(ctx context.Context, args []string) error { + req, interval, initialDelay, jitter, maxRuns, stopOnError, _, err := parseHostAgentUpdate(args) + if err != nil { + return err + } + return (hostagent.DockerManager{}).RunHostAgentUpdateLoop(ctx, hostagent.HostAgentUpdateLoopConfig{ + Request: req, + Interval: time.Duration(interval) * time.Second, + InitialDelay: time.Duration(initialDelay) * time.Second, + Jitter: jitter, + MaxRuns: maxRuns, + StopOnError: stopOnError, + Logf: func(format string, args ...any) { + fmt.Printf(format+"\n", args...) + }, + }) +} + +func parseHostAgentUpdate(args []string) (hostagent.HostAgentUpdateRequest, int, int, float64, int, bool, bool, error) { + fs := flag.NewFlagSet("update-host-agent", flag.ContinueOnError) + req := hostagent.HostAgentUpdateRequest{} + var intervalSeconds int + var initialDelaySeconds int + var maxRuns int + var jitter float64 + var stopOnError bool + fs.StringVar(&req.BackendURL, "backend-url", getenv("RAP_BACKEND_URL", ""), "Control Plane API base URL.") + fs.StringVar(&req.ClusterID, "cluster-id", getenv("RAP_CLUSTER_ID", ""), "Cluster ID.") + fs.StringVar(&req.NodeID, "node-id", getenv("RAP_NODE_ID", ""), "Already enrolled node ID.") + fs.StringVar(&req.StateDir, "state-dir", getenv("RAP_NODE_STATE_DIR", ""), "Host path containing node-agent identity.json.") + fs.StringVar(&req.CurrentVersion, "current-version", getenv("RAP_HOST_AGENT_VERSION", agent.Version), "Currently installed rap-host-agent version.") + fs.StringVar(&req.Channel, "channel", getenv("RAP_UPDATE_CHANNEL", ""), "Optional update channel override.") + fs.StringVar(&req.OS, "os", getenv("RAP_HOST_AGENT_UPDATE_OS", runtime.GOOS), "Host-agent artifact OS selector.") + fs.StringVar(&req.Arch, "arch", getenv("RAP_HOST_AGENT_UPDATE_ARCH", runtime.GOARCH), "Host-agent artifact architecture selector.") + fs.StringVar(&req.InstallType, "install-type", getenv("RAP_HOST_AGENT_UPDATE_INSTALL_TYPE", hostagent.BinaryUpdateInstallType), "Host-agent artifact install type.") + fs.StringVar(&req.BinaryPath, "binary-path", getenv("RAP_HOST_AGENT_BINARY_PATH", hostagent.DefaultHostAgentInstallPath), "rap-host-agent binary path to replace atomically.") + fs.BoolVar(&req.DryRun, "dry-run", false, "Fetch and print the update plan without applying it.") + fs.IntVar(&intervalSeconds, "interval-seconds", getenvInt("RAP_HOST_AGENT_UPDATE_INTERVAL_SECONDS", 900), "Seconds between host-agent update plan polls.") + fs.IntVar(&initialDelaySeconds, "initial-delay-seconds", getenvInt("RAP_HOST_AGENT_UPDATE_INITIAL_DELAY_SECONDS", 45), "Seconds to wait before the first poll.") + fs.Float64Var(&jitter, "jitter", getenvFloat("RAP_UPDATE_JITTER", 0.15), "Fractional random jitter for interval and initial delay, 0..1.") + fs.IntVar(&maxRuns, "max-runs", getenvInt("RAP_UPDATE_MAX_RUNS", 0), "Maximum loop iterations. Use 0 to run until stopped.") + fs.BoolVar(&stopOnError, "stop-on-error", getenvBool("RAP_UPDATE_STOP_ON_ERROR", false), "Stop the loop after the first failed update attempt.") + if err := fs.Parse(args); err != nil { + return hostagent.HostAgentUpdateRequest{}, 0, 0, 0, 0, false, false, err + } + return req, intervalSeconds, initialDelaySeconds, jitter, maxRuns, stopOnError, false, nil +} + +func registerUpdateFlags(fs *flag.FlagSet, req *hostagent.UpdateRequest, healthTimeoutSeconds *int) { + fs.StringVar(&req.BackendURL, "backend-url", getenv("RAP_BACKEND_URL", ""), "Control Plane API base URL.") + fs.StringVar(&req.ClusterID, "cluster-id", getenv("RAP_CLUSTER_ID", ""), "Cluster ID.") + fs.StringVar(&req.NodeID, "node-id", getenv("RAP_NODE_ID", ""), "Already enrolled node ID.") + fs.StringVar(&req.StateDir, "state-dir", getenv("RAP_NODE_STATE_DIR", ""), "Host path containing node-agent identity.json; used when node-id is not known yet.") + fs.StringVar(&req.Product, "product", getenv("RAP_UPDATE_PRODUCT", hostagent.DefaultUpdateProduct), "Update product name.") + fs.StringVar(&req.CurrentVersion, "current-version", getenv("RAP_NODE_AGENT_VERSION", agent.Version), "Currently running product version.") + fs.StringVar(&req.OS, "os", getenv("RAP_UPDATE_OS", runtime.GOOS), "Artifact OS selector.") + fs.StringVar(&req.Arch, "arch", getenv("RAP_UPDATE_ARCH", runtime.GOARCH), "Artifact architecture selector.") + fs.StringVar(&req.InstallType, "install-type", getenv("RAP_UPDATE_INSTALL_TYPE", hostagent.DefaultUpdateInstallType), "Artifact install type.") + fs.StringVar(&req.Channel, "channel", getenv("RAP_UPDATE_CHANNEL", ""), "Optional update channel override.") + fs.StringVar(&req.ContainerName, "container-name", getenv("RAP_NODE_AGENT_CONTAINER", hostagent.DefaultContainerName), "Docker container name to update.") + fs.StringVar(&req.BinaryPath, "binary-path", getenv("RAP_NODE_AGENT_BINARY_PATH", ""), "Windows node-agent binary path to replace.") + fs.StringVar(&req.WindowsTaskName, "windows-task-name", getenv("RAP_WINDOWS_TASK_NAME", ""), "Windows Scheduled Task name used to restart node-agent.") + fs.StringVar(&req.SystemdUnitName, "systemd-unit", getenv("RAP_SYSTEMD_UNIT", ""), "Linux systemd unit used to restart node-agent.") + fs.IntVar(healthTimeoutSeconds, "health-timeout-seconds", getenvInt("RAP_UPDATE_HEALTH_TIMEOUT_SECONDS", 30), "Seconds to wait for the updated container to be running.") + fs.BoolVar(&req.DryRun, "dry-run", false, "Fetch and print the update plan without applying it.") +} + +func parseInstall(args []string) (installCommandConfig, error) { + fs := flag.NewFlagSet("install", flag.ContinueOnError) + cfg := hostagent.RuntimeConfig{} + var dryRun bool + var profileURL string + var installToken string + var autoUpdateEnabled bool + autoUpdate := hostagent.UpdateServiceConfig{} + fs.StringVar(&cfg.BackendURL, "backend-url", getenv("RAP_BACKEND_URL", ""), "Control Plane API base URL.") + fs.StringVar(&cfg.ClusterID, "cluster-id", getenv("RAP_CLUSTER_ID", ""), "Cluster ID.") + fs.StringVar(&cfg.JoinToken, "join-token", getenv("RAP_JOIN_TOKEN", ""), "One-time join token for first enrollment.") + fs.StringVar(&profileURL, "profile-url", getenv("RAP_INSTALL_PROFILE_URL", ""), "Control Plane API base URL or /node-agents/docker-install-profile URL for profile-based install.") + fs.StringVar(&installToken, "install-token", getenv("RAP_INSTALL_TOKEN", ""), "One-time install token used to fetch Docker install profile.") + fs.StringVar(&cfg.NodeName, "node-name", getenv("RAP_NODE_NAME", ""), "Node display name.") + fs.StringVar(&cfg.Image, "image", getenv("RAP_NODE_AGENT_IMAGE", hostagent.DefaultImage), "Docker image for rap-node-agent.") + fs.StringVar(&cfg.ContainerName, "container-name", getenv("RAP_NODE_AGENT_CONTAINER", hostagent.DefaultContainerName), "Docker container name.") + fs.StringVar(&cfg.StateDir, "state-dir", getenv("RAP_NODE_STATE_DIR", hostagent.DefaultStateDir), "Host path mounted as node-agent state.") + fs.StringVar(&cfg.Network, "network", getenv("RAP_DOCKER_NETWORK", hostagent.DefaultNetwork), "Docker network mode/name.") + fs.StringVar(&cfg.RestartPolicy, "restart", getenv("RAP_DOCKER_RESTART", "unless-stopped"), "Docker restart policy.") + fs.BoolVar(&cfg.PullImage, "pull", getenvBool("RAP_DOCKER_PULL", false), "Pull image before running.") + fs.BoolVar(&cfg.Replace, "replace", getenvBool("RAP_DOCKER_REPLACE", false), "Remove an existing container with the same name before run.") + fs.BoolVar(&cfg.DockerVPNGatewayEnabled, "docker-vpn-gateway-enabled", getenvBool("RAP_DOCKER_VPN_GATEWAY_ENABLED", false), "Run Docker node-agent with NET_ADMIN and /dev/net/tun for VPN gateway mode.") + fs.StringVar(&cfg.ImageArtifactSHA256, "image-artifact-sha256", getenv("RAP_NODE_AGENT_IMAGE_ARTIFACT_SHA256", ""), "Expected SHA-256 for a Docker image tar artifact.") + fs.Int64Var(&cfg.ImageArtifactSizeBytes, "image-artifact-size-bytes", getenvInt64("RAP_NODE_AGENT_IMAGE_ARTIFACT_SIZE_BYTES", 0), "Expected byte size for a Docker image tar artifact (used as a best-effort check when sha256 is provided).") + fs.BoolVar(&dryRun, "dry-run", false, "Print the docker command with secrets redacted.") + fs.BoolVar(&autoUpdateEnabled, "auto-update-enabled", getenvBool("RAP_AUTO_UPDATE_ENABLED", true), "Install and start the local update-loop service.") + fs.BoolVar(&autoUpdate.InstallSelfUpdater, "host-agent-self-update-enabled", getenvBool("RAP_HOST_AGENT_SELF_UPDATE_ENABLED", true), "Install and start one global host-agent binary self-updater service.") + fs.StringVar(&autoUpdate.CurrentVersion, "auto-update-current-version", getenv("RAP_NODE_AGENT_VERSION", agent.Version), "Initial node-agent version used by update-loop before the first successful update.") + fs.StringVar(&autoUpdate.SelfUpdateVersion, "host-agent-current-version", getenv("RAP_HOST_AGENT_VERSION", agent.Version), "Initial host-agent binary version used by the self-updater.") + fs.StringVar(&autoUpdate.Channel, "auto-update-channel", getenv("RAP_UPDATE_CHANNEL", ""), "Optional update channel override for update-loop.") + fs.IntVar(&autoUpdate.IntervalSeconds, "auto-update-interval-seconds", getenvInt("RAP_UPDATE_INTERVAL_SECONDS", 21600), "Emergency fallback plan poll interval in seconds. Update-service/heartbeat hints trigger normal runs.") + fs.IntVar(&autoUpdate.InitialDelaySeconds, "auto-update-initial-delay-seconds", getenvInt("RAP_UPDATE_INITIAL_DELAY_SECONDS", 15), "Update-loop initial delay in seconds.") + fs.Float64Var(&autoUpdate.Jitter, "auto-update-jitter", getenvFloat("RAP_UPDATE_JITTER", 0.15), "Update-loop interval jitter, 0..1.") + fs.IntVar(&autoUpdate.HealthTimeoutSec, "auto-update-health-timeout-seconds", getenvInt("RAP_UPDATE_HEALTH_TIMEOUT_SECONDS", 30), "Updated container running-state timeout in seconds.") + fs.StringVar(&autoUpdate.BinaryInstallPath, "auto-update-binary-path", getenv("RAP_HOST_AGENT_BINARY_PATH", hostagent.DefaultHostAgentInstallPath), "Persistent host path for rap-host-agent binary used by the service.") + fs.BoolVar(&cfg.WorkloadSupervisionEnabled, "workload-supervision-enabled", getenvBool("RAP_WORKLOAD_SUPERVISION_ENABLED", false), "Enable node-agent workload status reporting.") + fs.BoolVar(&cfg.MeshSyntheticRuntimeEnabled, "mesh-synthetic-runtime-enabled", getenvBool("RAP_MESH_SYNTHETIC_RUNTIME_ENABLED", false), "Enable synthetic mesh runtime.") + fs.BoolVar(&cfg.MeshProductionForwardingEnabled, "mesh-production-forwarding-enabled", getenvBool("RAP_MESH_PRODUCTION_FORWARDING_ENABLED", false), "Enable production forwarding gate; runtime still fail-closed if unavailable.") + fs.StringVar(&cfg.MeshListenAddr, "mesh-listen-addr", getenv("RAP_MESH_LISTEN_ADDR", ""), "Synthetic mesh HTTP listen address inside container.") + fs.StringVar(&cfg.MeshListenPortMode, "mesh-listen-port-mode", getenv("RAP_MESH_LISTEN_PORT_MODE", ""), "Mesh listen port behavior: manual, auto, or disabled.") + fs.IntVar(&cfg.MeshListenAutoPortStart, "mesh-listen-auto-port-start", getenvInt("RAP_MESH_LISTEN_AUTO_PORT_START", 0), "First port used when mesh listen port mode is auto.") + fs.IntVar(&cfg.MeshListenAutoPortEnd, "mesh-listen-auto-port-end", getenvInt("RAP_MESH_LISTEN_AUTO_PORT_END", 0), "Last port used when mesh listen port mode is auto.") + fs.StringVar(&cfg.MeshAdvertiseEndpoint, "mesh-advertise-endpoint", getenv("RAP_MESH_ADVERTISE_ENDPOINT", ""), "Advertised mesh endpoint.") + fs.StringVar(&cfg.MeshAdvertiseEndpointsJSON, "mesh-advertise-endpoints-json", getenv("RAP_MESH_ADVERTISE_ENDPOINTS_JSON", ""), "Advertised endpoint candidates JSON.") + fs.StringVar(&cfg.MeshAdvertiseTransport, "mesh-advertise-transport", getenv("RAP_MESH_ADVERTISE_TRANSPORT", ""), "Advertised transport.") + fs.StringVar(&cfg.MeshConnectivityMode, "mesh-connectivity-mode", getenv("RAP_MESH_CONNECTIVITY_MODE", ""), "Connectivity mode hint.") + fs.StringVar(&cfg.MeshNATType, "mesh-nat-type", getenv("RAP_MESH_NAT_TYPE", ""), "NAT type hint.") + fs.StringVar(&cfg.MeshRegion, "mesh-region", getenv("RAP_MESH_REGION", ""), "Region/site hint.") + fs.IntVar(&cfg.HeartbeatIntervalSeconds, "heartbeat-interval-seconds", getenvInt("RAP_HEARTBEAT_INTERVAL_SECONDS", 15), "Heartbeat interval seconds.") + fs.IntVar(&cfg.EnrollmentPollIntervalSeconds, "enrollment-poll-interval-seconds", getenvInt("RAP_ENROLLMENT_POLL_INTERVAL_SECONDS", 5), "Enrollment poll interval seconds.") + fs.IntVar(&cfg.EnrollmentPollTimeoutSeconds, "enrollment-poll-timeout-seconds", getenvInt("RAP_ENROLLMENT_POLL_TIMEOUT_SECONDS", 0), "Enrollment approval timeout seconds. Use 0 to wait indefinitely.") + fs.IntVar(&cfg.ProductionObservationSinkCap, "production-observation-sink-capacity", getenvInt("RAP_MESH_PRODUCTION_OBSERVATION_SINK_CAPACITY", 0), "Production observation sink capacity.") + extraEnv := repeatedFlag{} + extraRunArg := repeatedFlag{} + imageArtifactURL := repeatedFlag{} + fs.Var(&extraEnv, "env", "Extra KEY=VALUE env passed to node-agent container; may be repeated.") + fs.Var(&extraRunArg, "docker-run-arg", "Extra raw docker run argument; may be repeated.") + fs.Var(&imageArtifactURL, "image-artifact-url", "Docker image tar artifact URL to docker load before running; may be repeated.") + if err := fs.Parse(args); err != nil { + return installCommandConfig{}, err + } + cfg.ExtraEnv = extraEnv + cfg.AdditionalDockerRunArgs = extraRunArg + cfg.ImageArtifactURLs = append(cfg.ImageArtifactURLs, imageArtifactURL...) + if strings.TrimSpace(profileURL) != "" || strings.TrimSpace(installToken) != "" { + profile, err := hostagent.FetchDockerInstallProfile(context.Background(), hostagent.ProfileRequest{ + URL: profileURL, + ClusterID: cfg.ClusterID, + InstallToken: installToken, + NodeName: cfg.NodeName, + }) + if err != nil { + return installCommandConfig{}, err + } + profileCfg := hostagent.RuntimeConfigFromProfile(profile) + profileCfg.ExtraEnv = cfg.ExtraEnv + profileCfg.AdditionalDockerRunArgs = cfg.AdditionalDockerRunArgs + profileCfg.DockerVPNGatewayEnabled = profileCfg.DockerVPNGatewayEnabled || cfg.DockerVPNGatewayEnabled + if len(imageArtifactURL) > 0 { + profileCfg.ImageArtifactURLs = append([]string(nil), imageArtifactURL...) + } + if cfg.ImageArtifactSHA256 != "" { + profileCfg.ImageArtifactSHA256 = cfg.ImageArtifactSHA256 + } + if cfg.ImageArtifactSizeBytes > 0 { + profileCfg.ImageArtifactSizeBytes = cfg.ImageArtifactSizeBytes + } + cfg = profileCfg + } + if err := cfg.ValidateInstall(); err != nil { + return installCommandConfig{}, err + } + return installCommandConfig{ + Runtime: cfg, + DryRun: dryRun, + AutoUpdateEnabled: autoUpdateEnabled, + AutoUpdate: autoUpdate, + }, nil +} + +type repeatedFlag []string + +func (f *repeatedFlag) String() string { + return strings.Join(*f, ",") +} + +func (f *repeatedFlag) Set(value string) error { + *f = append(*f, value) + return nil +} + +func getenv(key, fallback string) string { + if value := strings.TrimSpace(os.Getenv(key)); value != "" { + return value + } + return fallback +} + +func getenvBool(key string, fallback bool) bool { + switch strings.ToLower(strings.TrimSpace(os.Getenv(key))) { + case "1", "true", "yes", "y", "on": + return true + case "0", "false", "no", "n", "off": + return false + default: + return fallback + } +} + +func getenvInt(key string, fallback int) int { + var out int + if _, err := fmt.Sscanf(strings.TrimSpace(os.Getenv(key)), "%d", &out); err == nil { + return out + } + return fallback +} + +func getenvInt64(key string, fallback int64) int64 { + var out int64 + if _, err := fmt.Sscanf(strings.TrimSpace(os.Getenv(key)), "%d", &out); err == nil { + return out + } + return fallback +} + +func getenvFloat(key string, fallback float64) float64 { + var out float64 + if _, err := fmt.Sscanf(strings.TrimSpace(os.Getenv(key)), "%f", &out); err == nil { + return out + } + return fallback +} + +func shellJoin(args []string) string { + parts := make([]string, 0, len(args)) + for _, arg := range args { + if strings.ContainsAny(arg, " \t\"'") { + parts = append(parts, `"`+strings.ReplaceAll(arg, `"`, `\"`)+`"`) + } else { + parts = append(parts, arg) + } + } + return strings.Join(parts, " ") +} + +func usage() { + fmt.Fprintln(os.Stderr, `usage: + rap-host-agent install -profile-url URL -install-token TOKEN [-node-name NAME] [docker options] + rap-host-agent install -backend-url URL -cluster-id ID -join-token TOKEN -node-name NAME [docker options] + rap-host-agent install-windows -profile-url URL -install-token TOKEN [-node-name NAME] [windows options] + rap-host-agent install-linux -profile-url URL -install-token TOKEN [-node-name NAME] [linux/systemd options] + rap-host-agent install-updater -backend-url URL -cluster-id ID -state-dir DIR -container-name NAME + rap-host-agent update-host-agent -backend-url URL -cluster-id ID -state-dir DIR + rap-host-agent update-host-agent-loop -backend-url URL -cluster-id ID -state-dir DIR + rap-host-agent update -backend-url URL -cluster-id ID -node-id ID [-container-name NAME] + rap-host-agent update-loop -backend-url URL -cluster-id ID -node-id ID [-container-name NAME] + rap-host-agent status [-container-name NAME]`) +} diff --git a/agents/rap-node-agent/cmd/rap-node-agent/main.go b/agents/rap-node-agent/cmd/rap-node-agent/main.go index c78daf5..a62054c 100644 --- a/agents/rap-node-agent/cmd/rap-node-agent/main.go +++ b/agents/rap-node-agent/cmd/rap-node-agent/main.go @@ -5,12 +5,16 @@ import ( "encoding/json" "fmt" "log" + "net" "net/http" "os" + "os/exec" "os/signal" "path/filepath" + "runtime" "sort" "strings" + "sync/atomic" "syscall" "time" @@ -18,9 +22,11 @@ import ( "github.com/example/remote-access-platform/agents/rap-node-agent/internal/authority" "github.com/example/remote-access-platform/agents/rap-node-agent/internal/client" "github.com/example/remote-access-platform/agents/rap-node-agent/internal/config" + "github.com/example/remote-access-platform/agents/rap-node-agent/internal/hostagent" "github.com/example/remote-access-platform/agents/rap-node-agent/internal/mesh" "github.com/example/remote-access-platform/agents/rap-node-agent/internal/state" "github.com/example/remote-access-platform/agents/rap-node-agent/internal/supervisor" + "github.com/example/remote-access-platform/agents/rap-node-agent/internal/vpnruntime" ) const ( @@ -72,7 +78,8 @@ func main() { } log.Printf("node-agent started: node_id=%s cluster_id=%s backend=%s", identity.NodeID, identity.ClusterID, cfg.BackendURL) - meshState, stopMeshEndpoint, err := startSyntheticMeshEndpoint(ctx, cancel, cfg, identity, api) + vpnGateway := &vpnruntime.Gateway{API: api} + meshState, stopMeshEndpoint, err := startSyntheticMeshEndpoint(ctx, cancel, cfg, identity, api, vpnGateway) if err != nil { log.Fatalf("start synthetic mesh endpoint: %v", err) } @@ -88,17 +95,33 @@ func main() { log.Printf("heartbeat failed: %v", err) } if flags.Enabled && flags.TelemetryEnabled { - if err := api.ReportTelemetry(ctx, identity.ClusterID, identity.NodeID, agent.TelemetryPayload(identity, startedAt)); err != nil { + telemetry := agent.TelemetryPayload(identity, startedAt) + if telemetry.Payload == nil { + telemetry.Payload = map[string]any{} + } + if meshState != nil && meshState.ServiceChannelAccessStats != nil { + telemetry.Payload["fabric_service_channel_access_report"] = meshState.ServiceChannelAccessStats.Report(time.Now().UTC()) + } + if meshState != nil && meshState.RemoteWorkspaceFrameSink != nil { + telemetry.Payload["remote_workspace_adapter_sink_report"] = meshState.RemoteWorkspaceFrameSink.Report(time.Now().UTC()) + } + if err := api.ReportTelemetry(ctx, identity.ClusterID, identity.NodeID, telemetry); err != nil { log.Printf("telemetry failed: %v", err) } else { log.Printf("telemetry sent: node_id=%s cluster_id=%s scopes=%v", identity.NodeID, identity.ClusterID, flags.AppliedScopes) } } if cfg.WorkloadSupervisionEnabled { - if err := reportWorkloadStatus(ctx, api, supervisor, identity); err != nil { + if err := reportWorkloadStatus(ctx, api, supervisor, identity, meshState); err != nil { log.Printf("workload status failed: %v", err) } } + if err := ensureVPNGatewayRuntime(ctx, api, identity, vpnGateway, meshState); err != nil { + log.Printf("vpn gateway runtime failed: %v", err) + } + if err := reportVPNAssignmentStatus(ctx, api, identity, vpnGateway); err != nil { + log.Printf("vpn assignment status failed: %v", err) + } logProductionObservationSinkMetrics(meshState) if flags.Enabled && flags.SyntheticLinksEnabled { if err := api.ReportMeshLink(ctx, identity.ClusterID, agent.MeshSelfObservationPayload(identity)); err != nil { @@ -211,7 +234,7 @@ func ensureApprovedIdentity(ctx context.Context, cfg config.Config, identity sta } else { log.Printf("enrollment bootstrap poll failed: %v", err) } - if cfg.EnrollmentPollTimeout == 0 || (!deadline.IsZero() && !time.Now().UTC().Before(deadline)) { + if cfg.EnrollmentPollTimeout > 0 && !deadline.IsZero() && !time.Now().UTC().Before(deadline) { return identity, nil } select { @@ -287,33 +310,231 @@ func verifyEnrollmentBootstrap(bootstrap client.NodeBootstrap, identity state.Id } type syntheticMeshState struct { - Runtime *mesh.SyntheticRuntime - Routes []mesh.SyntheticRoute - RouteHealthRoutes []mesh.SyntheticRoute - Source string - PeerCache *mesh.PeerCache - RendezvousLeases []mesh.PeerRendezvousLease - RoutePathDecisions *client.RoutePathDecisionReport - RouteGenerationTracker *meshRouteGenerationTracker - ConfigVersion string - PeerDirectoryVersion string - PolicyVersion string - PeerConnections *mesh.PeerConnectionTracker - PeerConnectionManager *mesh.PeerConnectionManager - LastPeerRecoveryPlan *mesh.PeerRecoveryPlan - LastPeerConnectionIntent *mesh.PeerConnectionIntentPlan - LastConfigRefreshAt time.Time - LastLeaseRefresh *meshRendezvousLeaseRefreshState - LeaseRefreshAttempts int - LeaseRefreshSuccesses int - LeaseRefreshFailures int - LastRouteHealthRefresh *meshRouteHealthFeedbackRefreshState - RouteHealthRefreshAttempts int - RouteHealthRefreshSuccesses int - RouteHealthRefreshFailures int - RouteHealthRefreshSuppressed int - ProductionObservationSink *mesh.ProductionEnvelopeObservationSink - LastProductionSinkMetrics *mesh.ProductionEnvelopeObservationSinkMetrics + Runtime *mesh.SyntheticRuntime + Routes []mesh.SyntheticRoute + RouteHealthRoutes []mesh.SyntheticRoute + Source string + PeerCache *mesh.PeerCache + RendezvousLeases []mesh.PeerRendezvousLease + RoutePathDecisions *client.RoutePathDecisionReport + ServiceChannelFeedback *client.FabricServiceChannelFeedbackReport + ServiceChannelAdaptivePolicy *client.FabricServiceChannelAdaptivePolicy + ServiceChannelRemediationCommands []client.FabricServiceChannelRemediationCommand + RouteGenerationTracker *meshRouteGenerationTracker + ConfigVersion string + PeerDirectoryVersion string + PolicyVersion string + PeerConnections *mesh.PeerConnectionTracker + PeerConnectionManager *mesh.PeerConnectionManager + LastPeerRecoveryPlan *mesh.PeerRecoveryPlan + LastPeerConnectionIntent *mesh.PeerConnectionIntentPlan + LastConfigRefreshAt time.Time + LastLeaseRefresh *meshRendezvousLeaseRefreshState + LeaseRefreshAttempts int + LeaseRefreshSuccesses int + LeaseRefreshFailures int + LastRouteHealthRefresh *meshRouteHealthFeedbackRefreshState + RouteHealthRefreshAttempts int + RouteHealthRefreshSuccesses int + RouteHealthRefreshFailures int + RouteHealthRefreshSuppressed int + ProductionObservationSink *mesh.ProductionEnvelopeObservationSink + ProductionForwardTransport mesh.ProductionForwardTransport + ProductionForwardingEnabled bool + VPNFabricInbox *vpnruntime.FabricPacketInbox + VPNFabricIngress *vpnruntime.FabricClientPacketIngress + VPNGateway *vpnruntime.Gateway + ServiceChannelAccessStats *fabricServiceChannelAccessStats + RemoteWorkspaceFrameSink *mesh.RemoteWorkspaceFrameProbeSink + LastProductionSinkMetrics *mesh.ProductionEnvelopeObservationSinkMetrics + ListenerReport meshListenerReport + ListenerConfigKey string + ListenerRuntimeConfig config.Config + ListenerHandler *dynamicHTTPHandler + StopListener func() + ConfigLoadError string +} + +type fabricServiceChannelAccessStats struct { + Total atomic.Int64 + Signed atomic.Int64 + Introspection atomic.Int64 + LegacyUnsigned atomic.Int64 + BackendFallback atomic.Int64 + BackendFallbackBlocked atomic.Int64 + FabricRouteSendFailure atomic.Int64 + DataPlaneContract atomic.Int64 + LastAcceptedUnixSec atomic.Int64 + LastDataPlaneMode atomic.Value + LastWorkingData atomic.Value + LastSteadyState atomic.Value + LastBackendRelay atomic.Value + LastLogicalFlowMode atomic.Value + LastViolationStatus atomic.Value + LastViolationReason atomic.Value +} + +func newFabricServiceChannelAccessStats() *fabricServiceChannelAccessStats { + return &fabricServiceChannelAccessStats{} +} + +func (s *fabricServiceChannelAccessStats) Observe(entry mesh.FabricServiceChannelAccessLogEntry) { + if s == nil { + return + } + s.Total.Add(1) + switch strings.TrimSpace(entry.AcceptedBy) { + case "signed": + s.Signed.Add(1) + case "introspection": + s.Introspection.Add(1) + case "legacy_unsigned": + s.LegacyUnsigned.Add(1) + } + if entry.ForceBackendFallback && strings.TrimSpace(entry.BackendRelayPolicy) != "disabled" { + s.BackendFallback.Add(1) + } + switch strings.TrimSpace(entry.ViolationStatus) { + case "backend_fallback_blocked_by_policy": + s.BackendFallbackBlocked.Add(1) + case "fabric_route_send_failed_backend_fallback_blocked": + s.BackendFallbackBlocked.Add(1) + s.FabricRouteSendFailure.Add(1) + } + if strings.TrimSpace(entry.ViolationStatus) != "" { + s.LastViolationStatus.Store(strings.TrimSpace(entry.ViolationStatus)) + s.LastViolationReason.Store(strings.TrimSpace(entry.ViolationReason)) + } + if entry.DataPlaneValid { + s.DataPlaneContract.Add(1) + s.LastDataPlaneMode.Store(strings.TrimSpace(entry.DataPlaneMode)) + s.LastWorkingData.Store(strings.TrimSpace(entry.WorkingDataTransport)) + s.LastSteadyState.Store(strings.TrimSpace(entry.SteadyStateTransport)) + s.LastBackendRelay.Store(strings.TrimSpace(entry.BackendRelayPolicy)) + s.LastLogicalFlowMode.Store(strings.TrimSpace(entry.LogicalFlowMode)) + } + occurredAt := entry.OccurredAt + if occurredAt.IsZero() { + occurredAt = time.Now().UTC() + } + s.LastAcceptedUnixSec.Store(occurredAt.Unix()) +} + +func (s *fabricServiceChannelAccessStats) Report(observedAt time.Time) map[string]any { + if s == nil { + return nil + } + if observedAt.IsZero() { + observedAt = time.Now().UTC() + } + report := map[string]any{ + "schema_version": "c18z52.fabric_service_channel_access_report.v1", + "observed_at": observedAt.UTC().Format(time.RFC3339Nano), + "total": s.Total.Load(), + "signed": s.Signed.Load(), + "introspection": s.Introspection.Load(), + "legacy_unsigned": s.LegacyUnsigned.Load(), + "backend_fallback": s.BackendFallback.Load(), + "backend_fallback_blocked": s.BackendFallbackBlocked.Load(), + "fabric_route_send_failure": s.FabricRouteSendFailure.Load(), + "data_plane_contract": s.DataPlaneContract.Load(), + "accepted_by_signed": s.Signed.Load(), + "accepted_by_introspection": s.Introspection.Load(), + "accepted_by_legacy_unsigned": s.LegacyUnsigned.Load(), + } + if value, ok := s.LastDataPlaneMode.Load().(string); ok && value != "" { + report["last_data_plane_mode"] = value + } + if value, ok := s.LastWorkingData.Load().(string); ok && value != "" { + report["last_working_data_transport"] = value + } + if value, ok := s.LastSteadyState.Load().(string); ok && value != "" { + report["last_steady_state_transport"] = value + } + if value, ok := s.LastBackendRelay.Load().(string); ok && value != "" { + report["last_backend_relay_policy"] = value + } + if value, ok := s.LastLogicalFlowMode.Load().(string); ok && value != "" { + report["last_logical_flow_mode"] = value + } + if value, ok := s.LastViolationStatus.Load().(string); ok && value != "" { + report["last_data_plane_violation_status"] = value + } + if value, ok := s.LastViolationReason.Load().(string); ok && value != "" { + report["last_data_plane_violation_reason"] = value + } + if last := s.LastAcceptedUnixSec.Load(); last > 0 { + report["last_accepted_at"] = time.Unix(last, 0).UTC().Format(time.RFC3339Nano) + } + return report +} + +type dynamicHTTPHandler struct { + current atomic.Value +} + +func newDynamicHTTPHandler(handler http.Handler) *dynamicHTTPHandler { + out := &dynamicHTTPHandler{} + out.Update(handler) + return out +} + +func (h *dynamicHTTPHandler) Update(handler http.Handler) { + if handler == nil { + handler = http.NotFoundHandler() + } + h.current.Store(handler) +} + +func (h *dynamicHTTPHandler) ServeHTTP(w http.ResponseWriter, r *http.Request) { + if h == nil { + http.NotFound(w, r) + return + } + handler, _ := h.current.Load().(http.Handler) + if handler == nil { + http.NotFound(w, r) + return + } + handler.ServeHTTP(w, r) +} + +type meshListenerReport struct { + SchemaVersion string `json:"schema_version"` + ConfiguredListenAddr string `json:"configured_listen_addr,omitempty"` + EffectiveListenAddr string `json:"effective_listen_addr,omitempty"` + ListenPortMode string `json:"listen_port_mode"` + Status string `json:"status"` + InboundReachability string `json:"inbound_reachability"` + ControlPlaneReachable bool `json:"control_plane_reachable"` + OneWayConnectivity bool `json:"one_way_connectivity"` + FailureReason string `json:"failure_reason,omitempty"` + FailureError string `json:"failure_error,omitempty"` + PortConflict bool `json:"port_conflict,omitempty"` + AutoPortSelected bool `json:"auto_port_selected,omitempty"` + ObservedAt string `json:"observed_at"` +} + +type meshOutboundSessionReport struct { + SchemaVersion string `json:"schema_version"` + Status string `json:"status"` + Direction string `json:"direction"` + Transport string `json:"transport"` + ControlPlaneURL string `json:"control_plane_url,omitempty"` + ConnectivityMode string `json:"connectivity_mode,omitempty"` + InboundListenerRequired bool `json:"inbound_listener_required"` + UsableForInboundControl bool `json:"usable_for_inbound_control"` + ListenerStatus string `json:"listener_status,omitempty"` + ListenerFailureReason string `json:"listener_failure_reason,omitempty"` + ListenerPortConflict bool `json:"listener_port_conflict,omitempty"` + ConfigLoadError string `json:"config_load_error,omitempty"` + PeerConnectionReady int `json:"peer_connection_ready"` + PeerConnectionRelayReady int `json:"peer_connection_relay_ready"` + PeerConnectionWaiting int `json:"peer_connection_waiting"` + RendezvousLeaseCount int `json:"rendezvous_lease_count"` + ProductionForwarding bool `json:"production_forwarding"` + ServiceWorkloadTraffic bool `json:"service_workload_traffic"` + ObservedAt string `json:"observed_at"` } type meshRendezvousLeaseRefreshState struct { @@ -361,35 +582,46 @@ type meshRouteHealthFeedbackRefreshState struct { } type loadedSyntheticMeshConfig struct { - PeerEndpoints map[string]string - PeerEndpointCandidates map[string][]mesh.PeerEndpointCandidate - PeerDirectory []mesh.PeerDirectoryEntry - RecoverySeeds []mesh.PeerRecoverySeed - RendezvousLeases []mesh.PeerRendezvousLease - RoutePathDecisions *client.RoutePathDecisionReport - Routes []mesh.SyntheticRoute - Source string - ConfigVersion string - PeerDirectoryVersion string - PolicyVersion string + PeerEndpoints map[string]string + PeerEndpointCandidates map[string][]mesh.PeerEndpointCandidate + PeerDirectory []mesh.PeerDirectoryEntry + RecoverySeeds []mesh.PeerRecoverySeed + RendezvousLeases []mesh.PeerRendezvousLease + RoutePathDecisions *client.RoutePathDecisionReport + ServiceChannelFeedback *client.FabricServiceChannelFeedbackReport + ServiceChannelRemediationCommands []client.FabricServiceChannelRemediationCommand + ServiceChannelAdaptivePolicy *client.FabricServiceChannelAdaptivePolicy + MeshListener *client.MeshListenerConfig + Routes []mesh.SyntheticRoute + Source string + ConfigVersion string + PeerDirectoryVersion string + PolicyVersion string + ProductionForwarding bool } -func startSyntheticMeshEndpoint(ctx context.Context, cancel context.CancelFunc, cfg config.Config, identity state.Identity, api *client.Client) (*syntheticMeshState, func(), error) { +func startSyntheticMeshEndpoint(ctx context.Context, _ context.CancelFunc, cfg config.Config, identity state.Identity, api *client.Client, vpnGateway *vpnruntime.Gateway) (*syntheticMeshState, func(), error) { noop := func() {} if !cfg.MeshSyntheticRuntimeEnabled { return nil, noop, nil } - if cfg.MeshListenAddr == "" { - log.Print("synthetic mesh runtime enabled, but RAP_MESH_LISTEN_ADDR is empty; endpoint not started") - return nil, noop, nil - } local := mesh.PeerIdentity{ClusterID: identity.ClusterID, NodeID: identity.NodeID} loadedConfig, err := loadSyntheticMeshConfig(ctx, cfg, identity, api) if err != nil { - return nil, noop, err + log.Printf("synthetic mesh config load failed; starting diagnostics-only mesh state: %v", err) + loadedConfig = loadedSyntheticMeshConfig{ + PeerEndpoints: map[string]string{}, + PeerEndpointCandidates: map[string][]mesh.PeerEndpointCandidate{}, + PeerDirectory: []mesh.PeerDirectoryEntry{}, + RecoverySeeds: []mesh.PeerRecoverySeed{}, + RendezvousLeases: []mesh.PeerRendezvousLease{}, + Routes: []mesh.SyntheticRoute{}, + Source: "config_load_failed", + } } peerEndpoints := loadedConfig.PeerEndpoints routes := loadedConfig.Routes + productionForwardingEnabled := cfg.MeshProductionForwardingEnabled || loadedConfig.ProductionForwarding routeHealthRoutes := routeHealthRoutesFromPathDecisions(routes, loadedConfig.RoutePathDecisions) peerCache := mesh.NewPeerCache(mesh.PeerCacheConfig{ Local: local, @@ -426,7 +658,7 @@ func startSyntheticMeshEndpoint(ctx context.Context, cancel context.CancelFunc, RendezvousLeases: loadedConfig.RendezvousLeases, }) routeGenerationTracker := newMeshRouteGenerationTracker(loadedConfig.RoutePathDecisions, time.Now().UTC()) - gateEnabled, runtimeEnabled := productionForwardingLogState(cfg) + gateEnabled, runtimeEnabled := productionForwardingLogState(cfg, loadedConfig.ProductionForwarding) log.Printf( "synthetic mesh config loaded: source=%s node_id=%s cluster_id=%s peers=%d routes=%d peer_cache_peers=%d warm_peers=%d recovery_seeds=%d rendezvous_leases=%d peer_connection_states=%d peer_recovery_mode=%s peer_recovery_target_ready_peers=%d peer_connection_intents=%d rendezvous_required=%d rendezvous_resolved=%d production_forwarding_gate_enabled=%t production_forwarding_runtime_enabled=%t", loadedConfig.Source, @@ -468,43 +700,335 @@ func startSyntheticMeshEndpoint(ctx context.Context, cancel context.CancelFunc, productionEnvelopeObserver = productionObservationSink.Observe } var productionForwardTransport mesh.ProductionForwardTransport - if cfg.MeshProductionForwardingEnabled { + if productionForwardingEnabled { productionForwardTransport = mesh.NewHTTPProductionForwardTransport(peerEndpoints) } + vpnFabricInbox := vpnruntime.NewFabricPacketInbox(4096) + serviceChannelAccessStats := newFabricServiceChannelAccessStats() + remoteWorkspaceFrameSink := mesh.NewRemoteWorkspaceFrameProbeSink() + vpnFabricIngress := &vpnruntime.FabricClientPacketIngress{ + ForwardTransport: productionForwardTransport, + Inbox: vpnFabricInbox, + FlowScheduler: vpnruntime.NewFabricFlowScheduler(0, 0), + MaxParallelFlowSends: 4, + ClusterID: identity.ClusterID, + LocalNodeID: identity.NodeID, + LocalGateway: func(vpnConnectionID string) bool { + return vpnGateway != nil && vpnGateway.IsReadyForConnection(vpnConnectionID) + }, + Routes: func() []mesh.SyntheticRoute { + return routes + }, + } + initialRouteManagerAt := time.Now().UTC() + vpnFabricIngress.UpdateRouteManager(routeManagerDecisionsFromControlPlane(loadedConfig.RoutePathDecisions, loadedConfig.ServiceChannelRemediationCommands), loadedConfig.ConfigVersion, initialRouteManagerAt) + vpnFabricIngress.UpdateRouteQualityPreferences(routeQualityPreferencesFromServiceChannelFeedback(loadedConfig.ServiceChannelFeedback, initialRouteManagerAt), initialRouteManagerAt) + serverHandler := mesh.Server{ + Local: local, + SyntheticRuntime: runtime, + ProductionForwardingEnabled: productionForwardingEnabled, + ProductionEnvelopeObserver: productionEnvelopeObserver, + ProductionEnvelopeDelivery: vpnFabricInbox.DeliverProductionEnvelope, + ProductionForwardTransport: productionForwardTransport, + ProductionForwardLogger: func(entry mesh.ProductionForwardLogEntry) { + payload, err := json.Marshal(entry) + if err != nil { + log.Printf("mesh production forward event marshal failed: %v", err) + return + } + log.Printf("mesh_production_forward_event=%s", string(payload)) + }, + FabricServiceChannelLogger: func(entry mesh.FabricServiceChannelAccessLogEntry) { + serviceChannelAccessStats.Observe(entry) + payload, err := json.Marshal(entry) + if err != nil { + log.Printf("fabric service channel access event marshal failed: %v", err) + return + } + log.Printf("fabric_service_channel_access_event=%s", string(payload)) + }, + RemoteWorkspaceFrameSink: remoteWorkspaceFrameSink, + ProductionRoutes: routes, + VPNPacketIngress: vpnFabricIngress, + BackendProxyBaseURL: cfg.BackendURL, + ClusterAuthorityPublicKey: firstNonEmpty(identity.ClusterAuthorityPublicKey, cfg.ClusterAuthorityPublicKey), + }.Handler() + dynamicListenerHandler := newDynamicHTTPHandler(serverHandler) + listenerCfg := meshListenerRuntimeConfig(cfg, loadedConfig.MeshListener) + listenerReport, stopListener := startSyntheticMeshHTTPServer(ctx, listenerCfg, identity, dynamicListenerHandler, len(peerEndpoints), len(routes), gateEnabled, runtimeEnabled) + return &syntheticMeshState{ + Runtime: runtime, + Routes: routes, + RouteHealthRoutes: routeHealthRoutes, + Source: loadedConfig.Source, + PeerCache: peerCache, + RendezvousLeases: loadedConfig.RendezvousLeases, + RoutePathDecisions: loadedConfig.RoutePathDecisions, + ServiceChannelFeedback: loadedConfig.ServiceChannelFeedback, + ServiceChannelRemediationCommands: append([]client.FabricServiceChannelRemediationCommand{}, loadedConfig.ServiceChannelRemediationCommands...), + RouteGenerationTracker: routeGenerationTracker, + ConfigVersion: loadedConfig.ConfigVersion, + PeerDirectoryVersion: loadedConfig.PeerDirectoryVersion, + PolicyVersion: loadedConfig.PolicyVersion, + LastConfigRefreshAt: time.Now().UTC(), + PeerConnections: peerConnections, + PeerConnectionManager: peerConnectionManager, + LastPeerRecoveryPlan: &peerRecoveryPlan, + LastPeerConnectionIntent: &peerConnectionIntentPlan, + ProductionObservationSink: productionObservationSink, + ProductionForwardTransport: productionForwardTransport, + ProductionForwardingEnabled: productionForwardingEnabled, + VPNFabricInbox: vpnFabricInbox, + VPNFabricIngress: vpnFabricIngress, + VPNGateway: vpnGateway, + ServiceChannelAccessStats: serviceChannelAccessStats, + RemoteWorkspaceFrameSink: remoteWorkspaceFrameSink, + ListenerReport: listenerReport, + ListenerConfigKey: meshListenerConfigKey(listenerCfg), + ListenerRuntimeConfig: listenerCfg, + ListenerHandler: dynamicListenerHandler, + StopListener: stopListener, + ConfigLoadError: errorString(err), + }, stopListener, nil +} + +func productionForwardingLogState(cfg config.Config, signedControlPlaneEnabled bool) (gateEnabled bool, runtimeEnabled bool) { + enabled := cfg.MeshProductionForwardingEnabled || signedControlPlaneEnabled + return enabled, enabled +} + +func newVPNFabricIngress(meshState *syntheticMeshState, identity state.Identity, routes []mesh.SyntheticRoute, decisions *client.RoutePathDecisionReport, remediationCommands []client.FabricServiceChannelRemediationCommand, serviceChannelFeedback *client.FabricServiceChannelFeedbackReport, adaptivePolicy *client.FabricServiceChannelAdaptivePolicy, configVersion string, vpnGateway *vpnruntime.Gateway) *vpnruntime.FabricClientPacketIngress { + if meshState == nil || meshState.VPNFabricInbox == nil { + return nil + } + ingress := meshState.VPNFabricIngress + if ingress == nil { + ingress = &vpnruntime.FabricClientPacketIngress{} + } + ingress.UpdateRuntime( + meshState.ProductionForwardTransport, + meshState.VPNFabricInbox, + identity.ClusterID, + identity.NodeID, + func(vpnConnectionID string) bool { + return vpnGateway != nil && vpnGateway.IsReadyForConnection(vpnConnectionID) + }, + func() []mesh.SyntheticRoute { + return routes + }, + serviceChannelRecoveryPolicyFingerprint(serviceChannelFeedback), + vpnruntimeAdaptivePolicy(adaptivePolicy), + ) + appliedAt := time.Now().UTC() + ingress.UpdateRouteManager(routeManagerDecisionsFromControlPlane(decisions, remediationCommands), configVersion, appliedAt) + ingress.UpdateRouteQualityPreferences(routeQualityPreferencesFromServiceChannelFeedback(serviceChannelFeedback, appliedAt), appliedAt) + return ingress +} + +func vpnruntimeAdaptivePolicy(policy *client.FabricServiceChannelAdaptivePolicy) vpnruntime.FabricServiceChannelAdaptivePolicy { + if policy == nil { + return vpnruntime.FabricServiceChannelAdaptivePolicy{} + } + return vpnruntime.FabricServiceChannelAdaptivePolicy{ + SchemaVersion: policy.SchemaVersion, + Fingerprint: policy.Fingerprint, + MaxParallelWindow: policy.MaxParallelWindow, + BulkPressureChannelThreshold: policy.BulkPressureChannelThreshold, + QueuePressureHighWatermark: policy.QueuePressureHighWatermark, + QueuePressureMaxInFlight: policy.QueuePressureMaxInFlight, + ClassWindows: policy.ClassWindows, + } +} + +func serviceChannelRecoveryPolicyFingerprint(report *client.FabricServiceChannelFeedbackReport) string { + if report == nil || report.RecoveryPolicy == nil { + return "" + } + return strings.TrimSpace(report.RecoveryPolicy.Fingerprint) +} + +func routeQualityPreferencesFromServiceChannelFeedback(report *client.FabricServiceChannelFeedbackReport, observedAt time.Time) []vpnruntime.FabricServiceChannelRouteQualityPreference { + if report == nil { + return nil + } + now := observedAt.UTC() + if now.IsZero() { + now = time.Now().UTC() + } + out := make([]vpnruntime.FabricServiceChannelRouteQualityPreference, 0, len(report.Observations)) + for _, observation := range report.Observations { + effectiveScore := observation.EffectiveScoreAdjustment + if effectiveScore <= 0 { + effectiveScore = observation.ScoreAdjustment + } + if strings.TrimSpace(observation.RouteID) == "" || strings.TrimSpace(observation.FeedbackStatus) != "healthy" || effectiveScore <= 0 { + continue + } + if !observation.ExpiresAt.IsZero() && !observation.ExpiresAt.After(now) { + continue + } + out = append(out, vpnruntime.FabricServiceChannelRouteQualityPreference{ + RouteID: observation.RouteID, + FeedbackStatus: observation.FeedbackStatus, + ScoreAdjustment: effectiveScore, + RawScoreAdjustment: observation.ScoreAdjustment, + Reasons: append([]string{}, observation.Reasons...), + LastSendDurationMs: observation.LastSendDurationMs, + ObservedAt: observation.ObservedAt.UTC().Format(time.RFC3339Nano), + ExpiresAt: observation.ExpiresAt.UTC().Format(time.RFC3339Nano), + }) + } + return out +} + +func routeManagerDecisionsFromPathDecisions(report *client.RoutePathDecisionReport) []vpnruntime.FabricServiceChannelRouteManagerDecision { + if report == nil { + return nil + } + out := make([]vpnruntime.FabricServiceChannelRouteManagerDecision, 0, len(report.Decisions)) + for _, decision := range report.Decisions { + if strings.TrimSpace(decision.RebuildStatus) == "" { + continue + } + out = append(out, vpnruntime.FabricServiceChannelRouteManagerDecision{ + RouteID: decision.RouteID, + ReplacementRouteID: decision.ReplacementRouteID, + RebuildRequestID: decision.RebuildRequestID, + RebuildStatus: decision.RebuildStatus, + RebuildReason: decision.RebuildReason, + RebuildAttempt: decision.RebuildAttempt, + DecisionSource: decision.DecisionSource, + Generation: decision.Generation, + EffectiveHops: append([]string{}, decision.EffectiveHops...), + }) + } + return out +} + +func routeManagerDecisionsFromControlPlane(report *client.RoutePathDecisionReport, commands []client.FabricServiceChannelRemediationCommand) []vpnruntime.FabricServiceChannelRouteManagerDecision { + out := routeManagerDecisionsFromPathDecisions(report) + if len(commands) == 0 { + return out + } + decisionByRequestID := map[string]struct{}{} + for _, decision := range out { + if requestID := strings.TrimSpace(decision.RebuildRequestID); requestID != "" { + decisionByRequestID[requestID] = struct{}{} + } + } + now := time.Now().UTC() + for _, command := range commands { + action := strings.TrimSpace(command.Action) + if action != "prefer_alternate_route" && action != "rebuild_route" { + continue + } + guardStatus := strings.TrimSpace(command.GuardStatus) + if guardStatus != "" && guardStatus != "allowed" { + continue + } + primaryRouteID := strings.TrimSpace(command.PrimaryRouteID) + replacementRouteID := strings.TrimSpace(command.ReplacementRouteID) + if primaryRouteID == "" { + continue + } + if !command.ExpiresAt.IsZero() && !command.ExpiresAt.After(now) { + continue + } + if commandID := strings.TrimSpace(command.CommandID); commandID != "" { + if _, ok := decisionByRequestID[commandID]; ok { + continue + } + } + rebuildStatus := "pending_degraded_fallback" + if action == "prefer_alternate_route" { + if replacementRouteID == "" || primaryRouteID == replacementRouteID { + continue + } + rebuildStatus = "applied" + } + out = append(out, vpnruntime.FabricServiceChannelRouteManagerDecision{ + RouteID: primaryRouteID, + ReplacementRouteID: replacementRouteID, + RebuildRequestID: strings.TrimSpace(command.CommandID), + RebuildStatus: rebuildStatus, + RebuildReason: firstNonEmpty(command.Reason, "service_channel_remediation_"+action), + DecisionSource: "service_channel_remediation_command", + Generation: strings.TrimSpace(command.CommandID), + }) + } + return out +} + +func errorString(err error) string { + if err == nil { + return "" + } + return err.Error() +} + +func startSyntheticMeshHTTPServer(ctx context.Context, cfg config.Config, identity state.Identity, handler http.Handler, peerCount int, routeCount int, gateEnabled bool, runtimeEnabled bool) (meshListenerReport, func()) { + now := time.Now().UTC() + mode := defaultString(cfg.MeshListenPortMode, "manual") + baseReport := meshListenerReport{ + SchemaVersion: "c17z21.mesh_listener_report.v1", + ConfiguredListenAddr: cfg.MeshListenAddr, + ListenPortMode: mode, + Status: "disabled", + InboundReachability: "unavailable", + ControlPlaneReachable: true, + OneWayConnectivity: true, + ObservedAt: now.Format(time.RFC3339Nano), + } + if mode == "disabled" || strings.TrimSpace(cfg.MeshListenAddr) == "" { + if strings.TrimSpace(cfg.MeshListenAddr) == "" { + baseReport.FailureReason = "listen_addr_empty" + log.Print("synthetic mesh runtime enabled, but RAP_MESH_LISTEN_ADDR is empty; inbound endpoint disabled") + } else { + baseReport.FailureReason = "listen_disabled" + log.Printf("synthetic mesh endpoint disabled by listen port mode: node_id=%s cluster_id=%s", identity.NodeID, identity.ClusterID) + } + return baseReport, func() {} + } + + listener, effectiveAddr, autoSelected, bindErr := bindSyntheticMeshListener(cfg) + if bindErr != nil { + baseReport.Status = "listen_failed" + baseReport.FailureReason = "bind_failed" + baseReport.FailureError = bindErr.Error() + baseReport.PortConflict = isAddressInUse(bindErr) + log.Printf("synthetic mesh endpoint unavailable: listen_addr=%s mode=%s node_id=%s cluster_id=%s err=%v", cfg.MeshListenAddr, mode, identity.NodeID, identity.ClusterID, bindErr) + return baseReport, func() {} + } + + report := baseReport + report.Status = "listening" + if autoSelected { + report.Status = "auto_rebound" + } + report.EffectiveListenAddr = effectiveAddr + report.InboundReachability = reachabilityFromConnectivityMode(cfg.MeshConnectivityMode) + report.OneWayConnectivity = cfg.MeshConnectivityMode == "outbound_only" + report.AutoPortSelected = autoSelected server := &http.Server{ - Addr: cfg.MeshListenAddr, - Handler: mesh.Server{ - Local: local, - SyntheticRuntime: runtime, - ProductionForwardingEnabled: cfg.MeshProductionForwardingEnabled, - ProductionEnvelopeObserver: productionEnvelopeObserver, - ProductionForwardTransport: productionForwardTransport, - ProductionForwardLogger: func(entry mesh.ProductionForwardLogEntry) { - payload, err := json.Marshal(entry) - if err != nil { - log.Printf("mesh production forward event marshal failed: %v", err) - return - } - log.Printf("mesh_production_forward_event=%s", string(payload)) - }, - ProductionRoutes: routes, - }.Handler(), + Addr: effectiveAddr, + Handler: handler, ReadHeaderTimeout: 5 * time.Second, } go func() { log.Printf( - "synthetic mesh endpoint starting: listen_addr=%s node_id=%s cluster_id=%s peers=%d routes=%d production_forwarding_gate_enabled=%t production_forwarding_runtime_enabled=%t", + "synthetic mesh endpoint starting: listen_addr=%s effective_listen_addr=%s mode=%s node_id=%s cluster_id=%s peers=%d routes=%d production_forwarding_gate_enabled=%t production_forwarding_runtime_enabled=%t", cfg.MeshListenAddr, + effectiveAddr, + mode, identity.NodeID, identity.ClusterID, - len(peerEndpoints), - len(routes), + peerCount, + routeCount, gateEnabled, runtimeEnabled, ) - if err := server.ListenAndServe(); err != nil && err != http.ErrServerClosed { - log.Printf("synthetic mesh endpoint failed: %v", err) - cancel() + if err := server.Serve(listener); err != nil && err != http.ErrServerClosed { + log.Printf("synthetic mesh endpoint stopped unexpectedly: %v", err) } }() go func() { @@ -515,33 +1039,95 @@ func startSyntheticMeshEndpoint(ctx context.Context, cancel context.CancelFunc, log.Printf("synthetic mesh endpoint shutdown failed: %v", err) } }() - return &syntheticMeshState{ - Runtime: runtime, - Routes: routes, - RouteHealthRoutes: routeHealthRoutes, - Source: loadedConfig.Source, - PeerCache: peerCache, - RendezvousLeases: loadedConfig.RendezvousLeases, - RoutePathDecisions: loadedConfig.RoutePathDecisions, - RouteGenerationTracker: routeGenerationTracker, - ConfigVersion: loadedConfig.ConfigVersion, - PeerDirectoryVersion: loadedConfig.PeerDirectoryVersion, - PolicyVersion: loadedConfig.PolicyVersion, - LastConfigRefreshAt: time.Now().UTC(), - PeerConnections: peerConnections, - PeerConnectionManager: peerConnectionManager, - LastPeerRecoveryPlan: &peerRecoveryPlan, - LastPeerConnectionIntent: &peerConnectionIntentPlan, - ProductionObservationSink: productionObservationSink, - }, func() { - shutdownCtx, shutdownCancel := context.WithTimeout(context.Background(), 5*time.Second) - defer shutdownCancel() - _ = server.Shutdown(shutdownCtx) - }, nil + return report, func() { + shutdownCtx, shutdownCancel := context.WithTimeout(context.Background(), 5*time.Second) + defer shutdownCancel() + _ = server.Shutdown(shutdownCtx) + } } -func productionForwardingLogState(cfg config.Config) (gateEnabled bool, runtimeEnabled bool) { - return cfg.MeshProductionForwardingEnabled, cfg.MeshProductionForwardingEnabled +func meshListenerRuntimeConfig(base config.Config, desired *client.MeshListenerConfig) config.Config { + out := base + if desired == nil { + return out + } + if desired.ListenAddr != "" { + out.MeshListenAddr = strings.TrimSpace(desired.ListenAddr) + } + if desired.ListenPortMode != "" { + out.MeshListenPortMode = strings.ToLower(strings.TrimSpace(desired.ListenPortMode)) + } + if desired.DesiredState != "" && desired.DesiredState != "enabled" { + out.MeshListenPortMode = "disabled" + } + if desired.AutoPortStart > 0 { + out.MeshListenAutoPortStart = desired.AutoPortStart + } + if desired.AutoPortEnd > 0 { + out.MeshListenAutoPortEnd = desired.AutoPortEnd + } + if desired.AdvertiseEndpoint != "" { + out.MeshAdvertiseEndpoint = strings.TrimRight(strings.TrimSpace(desired.AdvertiseEndpoint), "/") + } + if desired.AdvertiseTransport != "" { + out.MeshAdvertiseTransport = strings.TrimSpace(desired.AdvertiseTransport) + } + if desired.ConnectivityMode != "" { + out.MeshConnectivityMode = strings.TrimSpace(desired.ConnectivityMode) + } + if desired.NATType != "" { + out.MeshNATType = strings.TrimSpace(desired.NATType) + } + if desired.Region != "" { + out.MeshRegion = strings.TrimSpace(desired.Region) + } + out.MeshProductionForwardingEnabled = base.MeshProductionForwardingEnabled || desired.ProductionForwarding + return out +} + +func meshListenerConfigKey(cfg config.Config) string { + return strings.Join([]string{ + strings.TrimSpace(cfg.MeshListenAddr), + strings.ToLower(strings.TrimSpace(cfg.MeshListenPortMode)), + fmt.Sprintf("%d", cfg.MeshListenAutoPortStart), + fmt.Sprintf("%d", cfg.MeshListenAutoPortEnd), + strings.TrimRight(strings.TrimSpace(cfg.MeshAdvertiseEndpoint), "/"), + strings.TrimSpace(cfg.MeshAdvertiseTransport), + strings.TrimSpace(cfg.MeshConnectivityMode), + strings.TrimSpace(cfg.MeshNATType), + strings.TrimSpace(cfg.MeshRegion), + fmt.Sprintf("%t", cfg.MeshProductionForwardingEnabled), + }, "|") +} + +func bindSyntheticMeshListener(cfg config.Config) (net.Listener, string, bool, error) { + listener, err := net.Listen("tcp", cfg.MeshListenAddr) + if err == nil { + return listener, listener.Addr().String(), false, nil + } + if cfg.MeshListenPortMode != "auto" { + return nil, "", false, err + } + host, _, splitErr := net.SplitHostPort(cfg.MeshListenAddr) + if splitErr != nil { + host = "" + } + for port := cfg.MeshListenAutoPortStart; port <= cfg.MeshListenAutoPortEnd; port++ { + addr := net.JoinHostPort(host, fmt.Sprintf("%d", port)) + listener, listenErr := net.Listen("tcp", addr) + if listenErr == nil { + return listener, listener.Addr().String(), true, nil + } + } + return nil, "", false, err +} + +func isAddressInUse(err error) bool { + if err == nil { + return false + } + text := strings.ToLower(err.Error()) + return strings.Contains(text, "address already in use") || strings.Contains(text, "only one usage of each socket address") } func productionEnvelopeObservationSinkFromConfig(cfg config.Config) *mesh.ProductionEnvelopeObservationSink { @@ -597,6 +1183,7 @@ func loadSyntheticMeshConfig(ctx context.Context, cfg config.Config, identity st ConfigVersion: scoped.ConfigVersion, PeerDirectoryVersion: scoped.PeerDirectoryVersion, PolicyVersion: scoped.PolicyVersion, + ProductionForwarding: false, }, nil } if api != nil { @@ -608,17 +1195,22 @@ func loadSyntheticMeshConfig(ctx context.Context, cfg config.Config, identity st } if err == nil && remote.Enabled { return loadedSyntheticMeshConfig{ - PeerEndpoints: remote.PeerEndpoints, - PeerEndpointCandidates: peerEndpointCandidatesFromControlPlane(remote.PeerEndpointCandidates), - PeerDirectory: peerDirectoryFromControlPlane(remote.PeerDirectory), - RecoverySeeds: recoverySeedsFromControlPlane(remote.RecoverySeeds), - RendezvousLeases: rendezvousLeasesFromControlPlane(remote.RendezvousLeases), - RoutePathDecisions: remote.RoutePathDecisions, - Routes: syntheticRoutesFromControlPlane(remote.Routes), - Source: "control_plane", - ConfigVersion: remote.ConfigVersion, - PeerDirectoryVersion: remote.PeerDirectoryVersion, - PolicyVersion: remote.PolicyVersion, + PeerEndpoints: remote.PeerEndpoints, + PeerEndpointCandidates: peerEndpointCandidatesFromControlPlane(remote.PeerEndpointCandidates), + PeerDirectory: peerDirectoryFromControlPlane(remote.PeerDirectory), + RecoverySeeds: recoverySeedsFromControlPlane(remote.RecoverySeeds), + RendezvousLeases: rendezvousLeasesFromControlPlane(remote.RendezvousLeases), + RoutePathDecisions: remote.RoutePathDecisions, + ServiceChannelFeedback: remote.ServiceChannelFeedback, + ServiceChannelAdaptivePolicy: remote.ServiceChannelAdaptivePolicy, + ServiceChannelRemediationCommands: append([]client.FabricServiceChannelRemediationCommand{}, remote.ServiceChannelRemediationCommands...), + MeshListener: remote.MeshListener, + Routes: syntheticRoutesFromControlPlane(remote.Routes), + Source: "control_plane", + ConfigVersion: remote.ConfigVersion, + PeerDirectoryVersion: remote.PeerDirectoryVersion, + PolicyVersion: remote.PolicyVersion, + ProductionForwarding: remote.ProductionForwarding, }, nil } if err != nil { @@ -706,8 +1298,11 @@ func verifyControlPlaneSyntheticMeshConfig(remote client.SyntheticMeshConfig, id if payload.ConfigVersion != remote.ConfigVersion { return fmt.Errorf("control-plane synthetic mesh config authority payload version mismatch") } - if !payload.ControlPlaneOnly || payload.ProductionForwarding || remote.ProductionForwarding { - return fmt.Errorf("control-plane synthetic mesh config authority payload forbids production forwarding") + if payload.ControlPlaneOnly == payload.ProductionForwarding { + return fmt.Errorf("synthetic mesh config authority payload control-plane/production forwarding flags mismatch") + } + if payload.ProductionForwarding != remote.ProductionForwarding { + return fmt.Errorf("synthetic mesh config authority payload production forwarding mismatch") } if !payload.ExpiresAt.IsZero() && !payload.ExpiresAt.After(time.Now().UTC()) { return fmt.Errorf("control-plane synthetic mesh config authority payload expired") @@ -723,7 +1318,25 @@ func verifyControlPlaneSyntheticMeshConfig(remote client.SyntheticMeshConfig, id } func syntheticMeshConfigAuthorityHash(remote client.SyntheticMeshConfig) (string, error) { + if !rawMessageEmpty(remote.Raw) { + var unsigned map[string]json.RawMessage + if err := json.Unmarshal(remote.Raw, &unsigned); err != nil { + return "", fmt.Errorf("decode raw control-plane synthetic mesh config for authority hash: %w", err) + } + delete(unsigned, "authority_payload") + delete(unsigned, "authority_signature") + raw, err := json.Marshal(unsigned) + if err != nil { + return "", fmt.Errorf("marshal raw control-plane synthetic mesh config for authority hash: %w", err) + } + hash, err := authority.HashRaw(raw) + if err != nil { + return "", fmt.Errorf("hash raw control-plane synthetic mesh config authority payload: %w", err) + } + return hash, nil + } unsigned := remote + unsigned.Raw = nil unsigned.AuthorityPayload = nil unsigned.AuthoritySignature = nil raw, err := json.Marshal(unsigned) @@ -806,7 +1419,7 @@ func refreshRendezvousLeasesIfNeeded(ctx context.Context, cfg config.Config, ide meshState.LeaseRefreshFailures++ return err } - applyRefreshedSyntheticMeshConfig(meshState, loadedConfig, local, cfg.MeshRegion, completedAt) + applyRefreshedSyntheticMeshConfig(ctx, cfg, identity, meshState, loadedConfig, local, cfg.MeshRegion, completedAt) refresh.Status = "succeeded" refresh.RefreshedLeaseCount = len(loadedConfig.RendezvousLeases) refresh.ConfigVersion = loadedConfig.ConfigVersion @@ -829,7 +1442,7 @@ func refreshSyntheticMeshConfigIfDue(ctx context.Context, cfg config.Config, ide if !meshState.LastConfigRefreshAt.IsZero() && meshState.LastConfigRefreshAt.Add(meshSyntheticConfigRefreshInterval).After(observedAt) { return nil } - if api == nil || meshState.Source != "control_plane" || cfg.MeshSyntheticConfigPath != "" { + if api == nil || cfg.MeshSyntheticConfigPath != "" { meshState.LastConfigRefreshAt = observedAt return nil } @@ -844,7 +1457,7 @@ func refreshSyntheticMeshConfigIfDue(ctx context.Context, cfg config.Config, ide return err } previousVersion := meshState.ConfigVersion - applyRefreshedSyntheticMeshConfig(meshState, loadedConfig, local, cfg.MeshRegion, completedAt) + applyRefreshedSyntheticMeshConfig(ctx, cfg, identity, meshState, loadedConfig, local, cfg.MeshRegion, completedAt) log.Printf( "mesh synthetic config refreshed: previous_config_version=%s refreshed_config_version=%s route_health_routes=%d", previousVersion, @@ -908,7 +1521,7 @@ func refreshSyntheticMeshConfigForRouteHealthFeedback(ctx context.Context, cfg c meshState.RouteHealthRefreshFailures++ return err } - applyRefreshedSyntheticMeshConfig(meshState, loadedConfig, local, cfg.MeshRegion, completedAt) + applyRefreshedSyntheticMeshConfig(ctx, cfg, identity, meshState, loadedConfig, local, cfg.MeshRegion, completedAt) refresh.Status = "succeeded" refresh.RefreshedConfigVersion = loadedConfig.ConfigVersion refresh.RefreshedRouteHealthRouteCount = len(meshState.RouteHealthRoutes) @@ -924,7 +1537,7 @@ func refreshSyntheticMeshConfigForRouteHealthFeedback(ctx context.Context, cfg c return nil } -func applyRefreshedSyntheticMeshConfig(meshState *syntheticMeshState, loadedConfig loadedSyntheticMeshConfig, local mesh.PeerIdentity, preferredRegion string, observedAt time.Time) { +func applyRefreshedSyntheticMeshConfig(ctx context.Context, cfg config.Config, identity state.Identity, meshState *syntheticMeshState, loadedConfig loadedSyntheticMeshConfig, local mesh.PeerIdentity, preferredRegion string, observedAt time.Time) { routeHealthRoutes := routeHealthRoutesFromPathDecisions(loadedConfig.Routes, loadedConfig.RoutePathDecisions) peerCache := mesh.NewPeerCache(mesh.PeerCacheConfig{ Local: local, @@ -974,20 +1587,106 @@ func applyRefreshedSyntheticMeshConfig(meshState *syntheticMeshState, loadedConf } else { meshState.RouteGenerationTracker.Apply(loadedConfig.RoutePathDecisions, observedAt) } + productionForwardingEnabled := cfg.MeshProductionForwardingEnabled || loadedConfig.ProductionForwarding + meshState.ProductionForwardingEnabled = productionForwardingEnabled + if productionForwardingEnabled { + meshState.ProductionForwardTransport = mesh.NewHTTPProductionForwardTransport(loadedConfig.PeerEndpoints) + } else { + meshState.ProductionForwardTransport = nil + } + vpnFabricIngress := newVPNFabricIngress(meshState, identity, loadedConfig.Routes, loadedConfig.RoutePathDecisions, loadedConfig.ServiceChannelRemediationCommands, loadedConfig.ServiceChannelFeedback, loadedConfig.ServiceChannelAdaptivePolicy, loadedConfig.ConfigVersion, meshState.VPNGateway) + meshState.VPNFabricIngress = vpnFabricIngress + if meshState.ServiceChannelAccessStats == nil { + meshState.ServiceChannelAccessStats = newFabricServiceChannelAccessStats() + } + if meshState.RemoteWorkspaceFrameSink == nil { + meshState.RemoteWorkspaceFrameSink = mesh.NewRemoteWorkspaceFrameProbeSink() + } + nextListenerHandler := mesh.Server{ + Local: local, + SyntheticRuntime: meshState.Runtime, + ProductionForwardingEnabled: productionForwardingEnabled, + ProductionEnvelopeDelivery: func() mesh.ProductionEnvelopeDelivery { + if meshState.VPNFabricInbox == nil { + return nil + } + return meshState.VPNFabricInbox.DeliverProductionEnvelope + }(), + ProductionForwardTransport: meshState.ProductionForwardTransport, + ProductionForwardLogger: func(entry mesh.ProductionForwardLogEntry) { + payload, err := json.Marshal(entry) + if err != nil { + log.Printf("mesh production forward event marshal failed: %v", err) + return + } + log.Printf("mesh_production_forward_event=%s", string(payload)) + }, + FabricServiceChannelLogger: func(entry mesh.FabricServiceChannelAccessLogEntry) { + meshState.ServiceChannelAccessStats.Observe(entry) + payload, err := json.Marshal(entry) + if err != nil { + log.Printf("fabric service channel access event marshal failed: %v", err) + return + } + log.Printf("fabric_service_channel_access_event=%s", string(payload)) + }, + RemoteWorkspaceFrameSink: meshState.RemoteWorkspaceFrameSink, + ProductionRoutes: loadedConfig.Routes, + VPNPacketIngress: vpnFabricIngress, + BackendProxyBaseURL: cfg.BackendURL, + ClusterAuthorityPublicKey: firstNonEmpty(identity.ClusterAuthorityPublicKey, cfg.ClusterAuthorityPublicKey), + }.Handler() + if meshState.ListenerHandler == nil { + meshState.ListenerHandler = newDynamicHTTPHandler(nextListenerHandler) + } else { + meshState.ListenerHandler.Update(nextListenerHandler) + } + applyMeshListenerConfigIfChanged(ctx, cfg, identity, meshState, loadedConfig, observedAt) meshState.Routes = loadedConfig.Routes meshState.RouteHealthRoutes = routeHealthRoutes meshState.Source = loadedConfig.Source meshState.PeerCache = peerCache meshState.RendezvousLeases = loadedConfig.RendezvousLeases meshState.RoutePathDecisions = loadedConfig.RoutePathDecisions + meshState.ServiceChannelFeedback = loadedConfig.ServiceChannelFeedback + meshState.ServiceChannelRemediationCommands = append([]client.FabricServiceChannelRemediationCommand{}, loadedConfig.ServiceChannelRemediationCommands...) meshState.ConfigVersion = loadedConfig.ConfigVersion meshState.PeerDirectoryVersion = loadedConfig.PeerDirectoryVersion meshState.PolicyVersion = loadedConfig.PolicyVersion + meshState.ConfigLoadError = "" meshState.LastConfigRefreshAt = observedAt meshState.LastPeerRecoveryPlan = &peerRecoveryPlan meshState.LastPeerConnectionIntent = &peerConnectionIntentPlan } +func applyMeshListenerConfigIfChanged(ctx context.Context, base config.Config, identity state.Identity, meshState *syntheticMeshState, loadedConfig loadedSyntheticMeshConfig, observedAt time.Time) { + if meshState == nil || meshState.ListenerHandler == nil { + return + } + nextCfg := meshListenerRuntimeConfig(base, loadedConfig.MeshListener) + nextKey := meshListenerConfigKey(nextCfg) + if nextKey == meshState.ListenerConfigKey { + return + } + if meshState.StopListener != nil { + meshState.StopListener() + } + gateEnabled, runtimeEnabled := productionForwardingLogState(nextCfg, loadedConfig.ProductionForwarding) + report, stop := startSyntheticMeshHTTPServer(ctx, nextCfg, identity, meshState.ListenerHandler, len(loadedConfig.PeerEndpoints), len(loadedConfig.Routes), gateEnabled, runtimeEnabled) + meshState.ListenerReport = report + meshState.ListenerConfigKey = nextKey + meshState.ListenerRuntimeConfig = nextCfg + meshState.StopListener = stop + log.Printf( + "mesh listener config applied: mode=%s listen_addr=%s status=%s config_version=%s observed_at=%s", + nextCfg.MeshListenPortMode, + nextCfg.MeshListenAddr, + report.Status, + loadedConfig.ConfigVersion, + observedAt.Format(time.RFC3339Nano), + ) +} + func meshRendezvousLeasePostureForState(meshState *syntheticMeshState, identity state.Identity, observedAt time.Time) meshRendezvousLeasePosture { posture := meshRendezvousLeasePosture{} if meshState == nil { @@ -1418,6 +2117,7 @@ func probeWarmPeerHealth(ctx context.Context, api *client.Client, identity state "traffic_forwarding": false, "observation_type": "peer_connection_manager", "config_source": meshState.Source, + "manager_probe_status": result.LinkStatus, "manager_mode": cycle.Mode, "manager_attempted": cycle.Attempted, "manager_succeeded": cycle.Succeeded, @@ -1457,7 +2157,7 @@ func probeWarmPeerHealth(ctx context.Context, api *client.Client, identity state if err := api.ReportMeshLink(ctx, identity.ClusterID, client.MeshLinkObservationRequest{ SourceNodeID: identity.NodeID, TargetNodeID: result.NodeID, - LinkStatus: result.LinkStatus, + LinkStatus: meshLinkStatusFromPeerProbe(result.LinkStatus), LatencyMs: latency, QualityScore: qualityScore, Metadata: metadata, @@ -1580,6 +2280,21 @@ func probeWarmPeerHealth(ctx context.Context, api *client.Client, identity state return nil } +func meshLinkStatusFromPeerProbe(status string) string { + switch status { + case mesh.PeerConnectionProbeReachable: + return "reachable" + case mesh.PeerConnectionProbeUnreachable: + return "unreachable" + case mesh.PeerConnectionProbeDeferred: + return "degraded" + case mesh.PeerConnectionProbeSkipped: + return "unknown" + default: + return "unknown" + } +} + func peerRecoveryPlan(meshState *syntheticMeshState, now time.Time) mesh.PeerRecoveryPlan { if meshState == nil || meshState.PeerCache == nil { return mesh.PeerRecoveryPlan{} @@ -1641,18 +2356,56 @@ func sendHeartbeat(ctx context.Context, api *client.Client, cfg config.Config, i response, err := api.Heartbeat(ctx, identity.ClusterID, identity.NodeID, heartbeatPayload(cfg, identity, meshState, time.Now().UTC())) if err == nil { log.Printf("heartbeat sent: node_id=%s cluster_id=%s", identity.NodeID, identity.ClusterID) + if err := persistUpdateHintTrigger(cfg.StateDir, response.UpdateHint); err != nil { + log.Printf("update hint trigger failed: %v", err) + } } return response.TestingFlags, err } +func persistUpdateHintTrigger(stateDir string, hint *client.NodeUpdateHint) error { + if hint == nil || !hint.CheckNow || strings.TrimSpace(hint.Generation) == "" { + return nil + } + current := hostagent.CurrentUpdateTriggerGenerationForNodeAgent(stateDir) + if current == strings.TrimSpace(hint.Generation) { + return nil + } + return hostagent.SaveUpdateTrigger(stateDir, hostagent.UpdateTrigger{ + SchemaVersion: "rap.node_update_trigger.v1", + Generation: strings.TrimSpace(hint.Generation), + Products: hint.Products, + Reason: hint.Reason, + DeliveryMode: hint.DeliveryMode, + SubscriptionStatus: hint.SubscriptionStatus, + FallbackPollSeconds: hint.FallbackPollSeconds, + UpdateServiceNodeID: func() string { + if hint.UpdateService == nil { + return "" + } + return hint.UpdateService.NodeID + }(), + UpdateServiceStatus: func() string { + if hint.UpdateService == nil { + return "" + } + return hint.UpdateService.Status + }(), + ObservedAt: time.Now().UTC(), + }) +} + func heartbeatPayload(cfg config.Config, identity state.Identity, meshState *syntheticMeshState, observedAt time.Time) client.HeartbeatRequest { + if meshState != nil && meshState.ListenerRuntimeConfig.BackendURL != "" { + cfg = meshState.ListenerRuntimeConfig + } payload := agent.HeartbeatPayload() - candidates, err := advertisedEndpointCandidates(cfg, identity, observedAt) + candidates, err := advertisedEndpointCandidates(cfg, identity, meshState, observedAt) if err != nil { log.Printf("mesh endpoint report skipped: %v", err) return payload } - if len(candidates) == 0 && (meshState == nil || meshState.PeerCache == nil) { + if len(candidates) == 0 && (meshState == nil || (meshState.PeerCache == nil && meshState.ListenerReport.SchemaVersion == "")) { return payload } if payload.Metadata == nil { @@ -1662,6 +2415,33 @@ func heartbeatPayload(cfg config.Config, identity state.Identity, meshState *syn payload.Capabilities = map[string]any{} } payload.Metadata["stage"] = "c17z20" + if meshState != nil && meshState.ListenerReport.SchemaVersion != "" { + report := meshState.ListenerReport + report.ObservedAt = observedAt.UTC().Format(time.RFC3339Nano) + payload.Metadata["mesh_listener_report"] = report + payload.Capabilities["mesh_listener_diagnostics"] = true + if report.OneWayConnectivity { + payload.Capabilities["mesh_one_way_connectivity"] = true + } + if report.Status == "listen_failed" && cfg.MeshConnectivityMode != "outbound_only" { + payload.HealthStatus = "warning" + } + } + if cfg.MeshSyntheticRuntimeEnabled { + payload.Metadata["mesh_outbound_session_report"] = meshOutboundSessionReportFromState(cfg, meshState, observedAt) + payload.Capabilities["mesh_outbound_control_session"] = true + payload.Capabilities["mesh_reverse_control_channel_contract"] = true + if meshState != nil && meshState.ServiceChannelAccessStats != nil { + payload.Metadata["fabric_service_channel_access_report"] = meshState.ServiceChannelAccessStats.Report(observedAt) + payload.Capabilities["fabric_service_channel_access_telemetry"] = true + } + if cfg.MeshProductionForwardingEnabled || (meshState != nil && meshState.ProductionForwardingEnabled) { + payload.Capabilities["mesh_production_forwarding"] = true + } + if meshState != nil && meshState.ConfigLoadError != "" { + payload.HealthStatus = "warning" + } + } if len(candidates) > 0 { payload.Metadata["mesh_endpoint_report"] = meshEndpointReport(cfg, identity, meshState, observedAt, candidates) payload.Capabilities["mesh_dynamic_endpoint_reporting"] = true @@ -1678,6 +2458,7 @@ func heartbeatPayload(cfg config.Config, identity state.Identity, meshState *syn payload.Capabilities["mesh_peer_recovery_planning"] = true payload.Capabilities["mesh_peer_connection_intent_planning"] = true payload.Capabilities["mesh_peer_connection_manager"] = true + payload.Capabilities["mesh_per_peer_endpoint_probe_fallback"] = true payload.Capabilities["mesh_rendezvous_relay_control_contract"] = true payload.Capabilities[meshRendezvousLeaseTelemetryCapability] = true payload.Capabilities[meshRendezvousLeaseRefreshCapability] = true @@ -1687,9 +2468,120 @@ func heartbeatPayload(cfg config.Config, identity state.Identity, meshState *syn payload.Capabilities[meshRouteHealthConfigCapability] = true payload.Capabilities[meshRouteHealthFeedbackRefreshCapability] = true } + if meshState != nil && (meshState.VPNFabricIngress != nil || meshState.VPNFabricInbox != nil) { + payload.Metadata["fabric_service_channel_runtime_report"] = fabricServiceChannelRuntimeReport(meshState, identity, observedAt) + payload.Capabilities["fabric_service_channel_runtime"] = true + payload.Capabilities["fabric_service_channel_route_manager"] = true + } return payload } +func fabricServiceChannelRuntimeReport(meshState *syntheticMeshState, identity state.Identity, observedAt time.Time) map[string]any { + report := map[string]any{ + "schema_version": "c18l.fabric_service_channel_runtime_report.v1", + "cluster_id": identity.ClusterID, + "node_id": identity.NodeID, + "service_class": "vpn_packets", + "channel_class": mesh.ProductionChannelVPNPacket, + "route_manager": "primary_sticky_with_alternate_route_failover", + "backend_relay_fallback": true, + "backend_relay_fallback_position": "after_all_fabric_routes_fail", + "application_protocol_agnostic": true, + "observed_at": observedAt.UTC().Format(time.RFC3339Nano), + } + if meshState == nil { + report["enabled"] = false + return report + } + report["enabled"] = meshState.VPNFabricIngress != nil + report["production_payload_forwarding"] = meshState.ProductionForwardingEnabled + report["route_candidate_total"] = countVPNPacketRoutes(meshState.Routes, identity.ClusterID, identity.NodeID) + report["config_source"] = meshState.Source + report["config_version"] = meshState.ConfigVersion + if meshState.VPNFabricIngress != nil { + report["ingress"] = meshState.VPNFabricIngress.Snapshot(identity.ClusterID) + } + if meshState.VPNFabricInbox != nil { + report["inbox"] = meshState.VPNFabricInbox.Snapshot() + } + return report +} + +func countVPNPacketRoutes(routes []mesh.SyntheticRoute, clusterID string, localNodeID string) int { + count := 0 + now := time.Now().UTC() + for _, route := range routes { + if route.ClusterID != clusterID || route.SourceNodeID != localNodeID || !containsString(route.AllowedChannels, mesh.ProductionChannelVPNPacket) { + continue + } + if !route.ExpiresAt.IsZero() && !route.ExpiresAt.After(now) { + continue + } + nextHop := serviceChannelNextHopAfter(route.Hops, localNodeID, route.DestinationNodeID) + if nextHop == "" || nextHop == localNodeID { + continue + } + count++ + } + return count +} + +func serviceChannelNextHopAfter(path []string, localNodeID string, destinationNodeID string) string { + if len(path) == 0 { + return destinationNodeID + } + for index, nodeID := range path { + if nodeID == localNodeID { + if index+1 < len(path) { + return path[index+1] + } + return localNodeID + } + } + return destinationNodeID +} + +func meshOutboundSessionReportFromState(cfg config.Config, meshState *syntheticMeshState, observedAt time.Time) meshOutboundSessionReport { + report := meshOutboundSessionReport{ + SchemaVersion: "c17z22.mesh_outbound_session_report.v1", + Status: "ready", + Direction: "node_to_control_plane", + Transport: "heartbeat_keepalive", + ControlPlaneURL: cfg.BackendURL, + ConnectivityMode: defaultString(cfg.MeshConnectivityMode, "direct"), + InboundListenerRequired: false, + ProductionForwarding: false, + ServiceWorkloadTraffic: false, + ObservedAt: observedAt.UTC().Format(time.RFC3339Nano), + } + if meshState != nil { + listener := meshState.ListenerReport + report.ListenerStatus = listener.Status + report.ListenerFailureReason = listener.FailureReason + report.ListenerPortConflict = listener.PortConflict + report.ConfigLoadError = meshState.ConfigLoadError + report.UsableForInboundControl = listener.Status == "listening" || + listener.Status == "auto_rebound" || + listener.OneWayConnectivity || + listener.Status == "listen_failed" || + cfg.MeshConnectivityMode == "outbound_only" + if meshState.PeerConnections != nil { + snapshot := meshState.PeerConnections.Snapshot() + report.PeerConnectionReady = snapshot.Ready + report.PeerConnectionRelayReady = snapshot.RelayReady + report.PeerConnectionWaiting = snapshot.Waiting + } + report.RendezvousLeaseCount = len(meshState.RendezvousLeases) + if meshState.ConfigLoadError != "" { + report.Status = "degraded" + report.ListenerFailureReason = firstNonEmpty(report.ListenerFailureReason, "mesh_config_load_failed") + } + } else { + report.UsableForInboundControl = cfg.MeshConnectivityMode == "outbound_only" + } + return report +} + func meshEndpointReport(cfg config.Config, identity state.Identity, meshState *syntheticMeshState, observedAt time.Time, candidates []mesh.PeerEndpointCandidate) map[string]any { transport := cfg.MeshAdvertiseTransport if transport == "" { @@ -1832,7 +2724,7 @@ func meshPeerConnectionIntentReport(meshState *syntheticMeshState, observedAt ti func meshPeerConnectionManagerReport(meshState *syntheticMeshState, observedAt time.Time) map[string]any { report := map[string]any{ - "schema_version": "c17z12.mesh_peer_connection_manager_report.v1", + "schema_version": "c17z25.mesh_peer_connection_manager_report.v1", "service_workload_traffic": false, "production_payload_forwarding": false, "persistent_connection_transport": true, @@ -1858,6 +2750,7 @@ func meshPeerConnectionManagerReport(meshState *syntheticMeshState, observedAt t report["relay_control_count"] = cycle.RelayControlCount report["last_started_at"] = cycle.StartedAt report["last_completed_at"] = cycle.CompletedAt + report["probe_results"] = cycle.Results if meshState.PeerConnections != nil { connectionSnapshot := meshState.PeerConnections.Snapshot() report["peer_connection_ready"] = connectionSnapshot.Ready @@ -2554,6 +3447,9 @@ func meshRoutePathDecisionReport(meshState *syntheticMeshState, identity state.I report["generation"] = decisionReport.Generation report["decision_count"] = decisionReport.DecisionCount report["replacement_decision_count"] = decisionReport.ReplacementDecisionCount + report["degraded_decision_count"] = decisionReport.DegradedDecisionCount + report["rebuild_request_count"] = decisionReport.RebuildRequestCount + report["rebuild_applied_count"] = decisionReport.RebuildAppliedCount report["control_plane_report_only"] = decisionReport.ControlPlaneOnly report["control_plane_report_production_forwarding"] = decisionReport.ProductionForwarding decisions := make([]map[string]any, 0, minInt(len(decisionReport.Decisions), maxMeshRendezvousLeaseReportEntries)) @@ -2580,6 +3476,11 @@ func meshRoutePathDecisionReport(meshState *syntheticMeshState, identity state.I decisions = append(decisions, map[string]any{ "decision_id": decision.DecisionID, "route_id": decision.RouteID, + "replacement_route_id": decision.ReplacementRouteID, + "rebuild_request_id": decision.RebuildRequestID, + "rebuild_status": decision.RebuildStatus, + "rebuild_reason": decision.RebuildReason, + "rebuild_attempt": decision.RebuildAttempt, "source_node_id": decision.SourceNodeID, "destination_node_id": decision.DestinationNodeID, "original_hops": append([]string{}, decision.OriginalHops...), @@ -2723,7 +3624,7 @@ func formatOptionalTime(value time.Time) string { return value.UTC().Format(time.RFC3339Nano) } -func advertisedEndpointCandidates(cfg config.Config, identity state.Identity, observedAt time.Time) ([]mesh.PeerEndpointCandidate, error) { +func advertisedEndpointCandidates(cfg config.Config, identity state.Identity, meshState *syntheticMeshState, observedAt time.Time) ([]mesh.PeerEndpointCandidate, error) { var candidates []mesh.PeerEndpointCandidate if cfg.MeshAdvertiseEndpointsJSON != "" { if err := json.Unmarshal([]byte(cfg.MeshAdvertiseEndpointsJSON), &candidates); err != nil { @@ -2743,6 +3644,7 @@ func advertisedEndpointCandidates(cfg config.Config, identity state.Identity, ob Priority: 10, }) } + candidates = append(candidates, interfaceEndpointCandidates(cfg, identity, meshState, observedAt)...) for i := range candidates { if candidates[i].EndpointID == "" { candidates[i].EndpointID = fmt.Sprintf("%s-advertised-%d", identity.NodeID, i+1) @@ -2786,9 +3688,219 @@ func advertisedEndpointCandidates(cfg config.Config, identity state.Identity, ob candidates[i].Metadata = metadata } } + sort.SliceStable(candidates, func(i, j int) bool { + if candidates[i].Priority == candidates[j].Priority { + return candidates[i].EndpointID < candidates[j].EndpointID + } + return candidates[i].Priority < candidates[j].Priority + }) return candidates, nil } +func interfaceEndpointCandidates(cfg config.Config, identity state.Identity, meshState *syntheticMeshState, observedAt time.Time) []mesh.PeerEndpointCandidate { + if meshState == nil { + return nil + } + report := meshState.ListenerReport + if report.Status != "listening" && report.Status != "auto_rebound" { + return nil + } + if cfg.MeshConnectivityMode == "outbound_only" { + return nil + } + port := listenerPort(report.EffectiveListenAddr, report.ConfiguredListenAddr, cfg.MeshListenAddr) + if port == "" { + return nil + } + interfaces, err := net.Interfaces() + if err != nil { + log.Printf("mesh interface discovery skipped: %v", err) + return nil + } + var candidates []mesh.PeerEndpointCandidate + for _, iface := range interfaces { + if iface.Flags&net.FlagUp == 0 || iface.Flags&net.FlagLoopback != 0 { + continue + } + interfaceType := classifyNetworkInterface(iface.Name) + if interfaceType == "container" { + continue + } + addrs, err := iface.Addrs() + if err != nil { + continue + } + for _, addr := range addrs { + ip := ipFromAddr(addr) + if ip == nil || ip.IsLoopback() || ip.IsUnspecified() || ip.IsMulticast() || ip.IsLinkLocalMulticast() || ip.IsLinkLocalUnicast() { + continue + } + addressFamily := "ipv6" + if ip.To4() != nil { + addressFamily = "ipv4" + } + reachability := "public" + connectivityMode := defaultString(cfg.MeshConnectivityMode, "direct") + if ip.IsPrivate() || ip.IsLinkLocalUnicast() { + reachability = "private" + if connectivityMode == "direct" { + connectivityMode = "private_lan" + } + } + metadata, _ := json.Marshal(map[string]any{ + "source": "node-agent-interface-discovery", + "runtime": "c17z24", + "interface_name": iface.Name, + "interface_index": iface.Index, + "interface_type": interfaceType, + "listen_effective_addr": report.EffectiveListenAddr, + "listen_configured_addr": report.ConfiguredListenAddr, + "loopback_filtered": true, + "link_local_filtered": true, + "container_iface_filtered": true, + "operator_override_allowed": true, + "observed_at": observedAt.UTC().Format(time.RFC3339Nano), + }) + candidates = append(candidates, mesh.PeerEndpointCandidate{ + EndpointID: fmt.Sprintf("%s-if-%s-%s-%s", identity.NodeID, safeEndpointIDPart(iface.Name), safeEndpointIDPart(ip.String()), addressFamily), + NodeID: identity.NodeID, + Transport: defaultString(cfg.MeshAdvertiseTransport, "direct_http"), + Address: endpointAddress(defaultString(cfg.MeshAdvertiseTransport, "direct_http"), ip, port), + AddressFamily: addressFamily, + Reachability: reachability, + NATType: defaultString(cfg.MeshNATType, "unknown"), + ConnectivityMode: connectivityMode, + Region: cfg.MeshRegion, + Priority: endpointPriority(reachability, addressFamily, interfaceType, len(candidates)), + PolicyTags: []string{"auto_discovered", "non_loopback", interfaceType}, + LastVerifiedAt: &observedAt, + Metadata: metadata, + }) + } + } + return candidates +} + +func classifyNetworkInterface(name string) string { + normalized := strings.ToLower(strings.TrimSpace(name)) + switch { + case strings.HasPrefix(normalized, "docker"), + strings.HasPrefix(normalized, "br-"), + strings.HasPrefix(normalized, "veth"), + strings.HasPrefix(normalized, "virbr"), + strings.HasPrefix(normalized, "cni"), + strings.HasPrefix(normalized, "flannel"), + strings.HasPrefix(normalized, "calico"), + strings.HasPrefix(normalized, "kube"): + return "container" + case strings.HasPrefix(normalized, "tun"), + strings.HasPrefix(normalized, "tap"), + strings.HasPrefix(normalized, "wg"), + strings.Contains(normalized, "tailscale"), + strings.Contains(normalized, "zerotier"), + strings.HasPrefix(normalized, "zt"): + return "vpn" + case strings.HasPrefix(normalized, "eth"), + strings.HasPrefix(normalized, "ens"), + strings.HasPrefix(normalized, "eno"), + strings.HasPrefix(normalized, "enp"), + strings.HasPrefix(normalized, "wlan"), + strings.HasPrefix(normalized, "wl"), + strings.HasPrefix(normalized, "bond"): + return "physical" + default: + return "unknown" + } +} + +func listenerPort(addrs ...string) string { + for _, addr := range addrs { + addr = strings.TrimSpace(addr) + if addr == "" { + continue + } + _, port, err := net.SplitHostPort(addr) + if err == nil && port != "" { + return port + } + if strings.HasPrefix(addr, ":") && len(addr) > 1 { + return strings.TrimPrefix(addr, ":") + } + } + return "" +} + +func ipFromAddr(addr net.Addr) net.IP { + switch v := addr.(type) { + case *net.IPNet: + return v.IP + case *net.IPAddr: + return v.IP + default: + host, _, err := net.SplitHostPort(v.String()) + if err != nil { + host = v.String() + } + return net.ParseIP(host) + } +} + +func endpointAddress(transport string, ip net.IP, port string) string { + host := ip.String() + if ip.To4() == nil { + host = "[" + host + "]" + } + scheme := "http" + switch strings.ToLower(strings.TrimSpace(transport)) { + case "wss": + scheme = "wss" + case "https", "direct_https": + scheme = "https" + } + return scheme + "://" + host + ":" + port +} + +func endpointPriority(reachability string, addressFamily string, interfaceType string, offset int) int { + base := 40 + if reachability == "public" { + base = 20 + } else if reachability == "private" { + base = 30 + } + switch interfaceType { + case "vpn": + base += 0 + case "physical": + base += 5 + default: + base += 10 + } + if addressFamily == "ipv6" { + base += 20 + } + return base + offset +} + +func safeEndpointIDPart(value string) string { + value = strings.ToLower(strings.TrimSpace(value)) + var out strings.Builder + lastDash := false + for _, r := range value { + if (r >= 'a' && r <= 'z') || (r >= '0' && r <= '9') { + out.WriteRune(r) + lastDash = false + } else if !lastDash { + out.WriteByte('-') + lastDash = true + } + } + result := strings.Trim(out.String(), "-") + if result == "" { + return "iface" + } + return result +} + func defaultString(value string, fallback string) string { if strings.TrimSpace(value) == "" { return fallback @@ -2831,6 +3943,8 @@ func reachabilityFromConnectivityMode(connectivityMode string) string { return "outbound_only" case "relay_required": return "relay" + case "private_lan": + return "private" case "direct": return "public" default: @@ -2838,7 +3952,7 @@ func reachabilityFromConnectivityMode(connectivityMode string) string { } } -func reportWorkloadStatus(ctx context.Context, api *client.Client, supervisor supervisor.Supervisor, identity state.Identity) error { +func reportWorkloadStatus(ctx context.Context, api *client.Client, supervisor supervisor.Supervisor, identity state.Identity, meshState *syntheticMeshState) error { desired, err := api.DesiredWorkloads(ctx, identity.ClusterID, identity.NodeID) if err != nil { return err @@ -2847,6 +3961,7 @@ func reportWorkloadStatus(ctx context.Context, api *client.Client, supervisor su if err != nil { return err } + enrichWorkloadStatuses(statuses, desired, meshState) for i, status := range statuses { if i >= len(desired) { break @@ -2860,3 +3975,366 @@ func reportWorkloadStatus(ctx context.Context, api *client.Client, supervisor su } return nil } + +func enrichWorkloadStatuses(statuses []client.WorkloadStatusRequest, desired []client.DesiredWorkload, meshState *syntheticMeshState) { + if meshState == nil || meshState.RemoteWorkspaceFrameSink == nil { + return + } + sinkReport := meshState.RemoteWorkspaceFrameSink.Report(time.Now().UTC()) + for i := range statuses { + if i >= len(desired) { + return + } + if strings.TrimSpace(desired[i].ServiceType) != "rdp-worker" { + continue + } + if statuses[i].StatusPayload == nil { + statuses[i].StatusPayload = map[string]any{} + } + statuses[i].StatusPayload["remote_workspace_adapter_sink"] = sinkReport + } +} + +func reportVPNAssignmentStatus(ctx context.Context, api *client.Client, identity state.Identity, gateway *vpnruntime.Gateway) error { + assignments, err := api.NodeVPNAssignments(ctx, identity.ClusterID, identity.NodeID) + if err != nil { + return err + } + for _, assignment := range assignments { + status := "lease_required" + reason := "eligible_candidate_waiting_for_active_lease" + runtimeAvailable := false + packetForwarding := false + runtimeError := "" + if assignment.ActiveLease != nil && assignment.ActiveLease.OwnerNodeID == identity.NodeID { + running, lastErr := gateway.Status() + runtimeAvailable = running + packetForwarding = running + runtimeError = lastErr + if running { + status = "assigned" + reason = "active_lease_owned_by_local_node" + } else { + status = "blocked" + reason = "vpn_gateway_runtime_unavailable" + if runtimeError == "" { + runtimeError = "vpn gateway runtime is not running" + } + } + } + if assignment.DesiredState != "enabled" { + status = "blocked" + reason = "vpn_connection_disabled" + } + payload := map[string]any{ + "schema_version": "rap.node_vpn_assignment_status.v1", + "assignment_reason": assignment.AssignmentReason, + "protocol_family": assignment.ProtocolFamily, + "runtime_available": runtimeAvailable, + "packet_forwarding": packetForwarding, + "reason": reason, + "native_vpn_runtime_note": "experimental packet tunnel runtime is enabled for active linux gateway leases", + "gateway_interface": "rapvpn0", + "gateway_vpn_cidr": "10.77.0.0/24", + "relay_transport": "not_active_owner", + } + if dnsServers := vpnAssignmentDNSServers(assignment); len(dnsServers) > 0 { + payload["exit_dns_servers"] = dnsServers + } + if runtimeError != "" { + payload["runtime_error"] = runtimeError + } + if assignment.ActiveLease != nil && assignment.ActiveLease.OwnerNodeID == identity.NodeID { + gatewayRuntime := gateway.Snapshot() + payload["gateway_runtime"] = gatewayRuntime + if transport, ok := gatewayRuntime["transport"].(string); ok && strings.TrimSpace(transport) != "" { + payload["relay_transport"] = transport + } + } + if assignment.ActiveLease != nil { + payload["active_lease_id"] = assignment.ActiveLease.LeaseID + payload["lease_generation"] = assignment.ActiveLease.LeaseGeneration + payload["lease_expires_at"] = assignment.ActiveLease.ExpiresAt + } + if err := api.ReportNodeVPNAssignmentStatus(ctx, identity.ClusterID, identity.NodeID, assignment.VPNConnectionID, client.NodeVPNAssignmentStatusRequest{ + ObservedStatus: status, + StatusPayload: payload, + ObservedAt: time.Now().UTC(), + }); err != nil { + return err + } + } + if len(assignments) > 0 { + log.Printf("vpn assignment status reported: count=%d", len(assignments)) + } + return nil +} + +func exitDNSServers() []string { + if configured := parseDNSServerList(os.Getenv("RAP_VPN_EXIT_DNS_SERVERS")); len(configured) > 0 { + return configured + } + if configured := parseDNSServerList(os.Getenv("RAP_EXIT_DNS_SERVERS")); len(configured) > 0 { + return configured + } + if runtime.GOOS == "windows" { + return windowsExitDNSServers() + } + seen := map[string]bool{} + var out []string + for _, path := range []string{ + "/run/systemd/resolve/resolv.conf", + "/etc/resolv.conf", + "/run/systemd/resolve/stub-resolv.conf", + } { + data, err := os.ReadFile(path) + if err != nil { + continue + } + for _, line := range strings.Split(string(data), "\n") { + fields := strings.Fields(line) + if len(fields) < 2 || fields[0] != "nameserver" { + continue + } + server := strings.TrimSpace(fields[1]) + ip := net.ParseIP(server) + if ip == nil || ip.IsLoopback() || ip.IsUnspecified() || ip.IsLinkLocalUnicast() { + continue + } + if seen[server] { + continue + } + seen[server] = true + out = append(out, server) + } + if len(out) > 0 { + break + } + } + return out +} + +func vpnAssignmentDNSServers(assignment client.NodeVPNAssignment) []string { + if servers := exitDNSServers(); len(servers) > 0 { + return servers + } + for _, raw := range []json.RawMessage{assignment.RoutePolicy, assignment.TargetEndpoint} { + if servers := dnsServersFromRawPolicy(raw); len(servers) > 0 { + return servers + } + } + return nil +} + +func windowsExitDNSServers() []string { + ctx, cancel := context.WithTimeout(context.Background(), 3*time.Second) + defer cancel() + output, err := exec.CommandContext(ctx, "netsh", "interface", "ip", "show", "dnsservers").CombinedOutput() + if err != nil || len(output) == 0 { + return nil + } + return parseDNSServerList(string(output)) +} + +func dnsServersFromRawPolicy(raw json.RawMessage) []string { + var payload map[string]json.RawMessage + if len(raw) == 0 || json.Unmarshal(raw, &payload) != nil { + return nil + } + for _, key := range []string{"dns_servers", "exit_dns_servers"} { + var values []string + if item, ok := payload[key]; ok && json.Unmarshal(item, &values) == nil { + if servers := normalizeDNSServers(values); len(servers) > 0 { + return servers + } + } + } + return nil +} + +func normalizeDNSServers(values []string) []string { + seen := map[string]bool{} + out := make([]string, 0, len(values)) + for _, value := range values { + server := strings.TrimSpace(value) + ip := net.ParseIP(server) + if ip == nil || ip.IsLoopback() || ip.IsUnspecified() || ip.IsLinkLocalUnicast() || seen[server] { + continue + } + seen[server] = true + out = append(out, server) + } + return out +} + +func parseDNSServerList(value string) []string { + seen := map[string]bool{} + var out []string + for _, field := range strings.FieldsFunc(value, func(r rune) bool { + return r == ',' || r == ';' || r == ' ' || r == '\t' || r == '\n' || r == '\r' + }) { + server := strings.TrimSpace(field) + ip := net.ParseIP(server) + if ip == nil || ip.IsLoopback() || ip.IsUnspecified() || ip.IsLinkLocalUnicast() || seen[server] { + continue + } + seen[server] = true + out = append(out, server) + } + return out +} + +func ensureVPNGatewayRuntime(ctx context.Context, api *client.Client, identity state.Identity, gateway *vpnruntime.Gateway, meshState *syntheticMeshState) error { + assignments, err := api.NodeVPNAssignments(ctx, identity.ClusterID, identity.NodeID) + if err != nil { + return err + } + activeOwner := false + for _, assignment := range assignments { + if assignment.AssignmentReason != "active_owner" { + continue + } + if assignment.ActiveLease == nil || assignment.ActiveLease.OwnerNodeID != identity.NodeID { + continue + } + activeOwner = true + gateway.ClusterID = identity.ClusterID + gateway.VPNConnectionID = assignment.VPNConnectionID + gateway.InterfaceName = "rapvpn0" + gateway.AddressCIDR = "10.77.0.1/24" + gateway.RouteCIDR = "10.77.0.0/24" + gateway.PollTimeout = 25 * time.Second + if transport := fabricGatewayTransportForAssignment(identity, assignment, meshState, api); transport != nil { + if _, ok := gateway.Transport.(vpnruntime.BackendPacketTransport); ok { + gateway.Stop() + } + gateway.Transport = transport + } else if transport := localGatewayTransportForAssignment(identity, assignment, meshState, api); transport != nil { + if _, ok := gateway.Transport.(vpnruntime.BackendPacketTransport); ok { + gateway.Stop() + } + gateway.Transport = transport + } else if _, ok := gateway.Transport.(*vpnruntime.FabricPacketTransport); ok { + gateway.Stop() + gateway.Transport = nil + } else if _, ok := gateway.Transport.(*vpnruntime.AdaptivePacketTransport); ok { + gateway.Stop() + gateway.Transport = nil + } + if err := gateway.EnsureStarted(ctx); err != nil { + return err + } + if err := renewOwnedVPNLease(ctx, api, identity, assignment); err != nil { + return err + } + log.Printf("vpn gateway runtime ensured: vpn_connection_id=%s interface=%s", assignment.VPNConnectionID, gateway.InterfaceName) + return nil + } + if !activeOwner { + gateway.Stop() + } + return nil +} + +func localGatewayTransportForAssignment(identity state.Identity, assignment client.NodeVPNAssignment, meshState *syntheticMeshState, api *client.Client) vpnruntime.PacketTransport { + if meshState == nil || meshState.VPNFabricInbox == nil || assignment.VPNConnectionID == "" { + return nil + } + local := &vpnruntime.LocalPacketTransport{ + Inbox: meshState.VPNFabricInbox, + VPNConnectionID: assignment.VPNConnectionID, + } + if api == nil { + return local + } + return &vpnruntime.AdaptivePacketTransport{ + Primary: local, + Fallback: vpnruntime.BackendPacketTransport{ + API: api, + ClusterID: identity.ClusterID, + VPNConnectionID: assignment.VPNConnectionID, + }, + PrimaryTimeout: 50 * time.Millisecond, + } +} + +func fabricGatewayTransportForAssignment(identity state.Identity, assignment client.NodeVPNAssignment, meshState *syntheticMeshState, api *client.Client) vpnruntime.PacketTransport { + if meshState == nil || meshState.ProductionForwardTransport == nil || meshState.VPNFabricInbox == nil { + return nil + } + route, nextHop, ok := selectVPNPacketRoute(meshState.Routes, identity.ClusterID, identity.NodeID) + if !ok { + return nil + } + fabric := &vpnruntime.FabricPacketTransport{ + ForwardTransport: meshState.ProductionForwardTransport, + Inbox: meshState.VPNFabricInbox, + ClusterID: identity.ClusterID, + VPNConnectionID: assignment.VPNConnectionID, + RouteID: route.RouteID, + LocalNodeID: identity.NodeID, + RemoteNodeID: route.DestinationNodeID, + NextHopNodeID: nextHop, + RoutePath: route.Hops, + SendDirection: vpnruntime.FabricDirectionGatewayToClient, + ReceiveDirection: vpnruntime.FabricDirectionClientToGateway, + } + if api == nil { + return fabric + } + return &vpnruntime.AdaptivePacketTransport{ + Primary: fabric, + Fallback: vpnruntime.BackendPacketTransport{ + API: api, + ClusterID: identity.ClusterID, + VPNConnectionID: assignment.VPNConnectionID, + }, + PrimaryTimeout: 50 * time.Millisecond, + } +} + +func selectVPNPacketRoute(routes []mesh.SyntheticRoute, clusterID string, localNodeID string) (mesh.SyntheticRoute, string, bool) { + now := time.Now().UTC() + for _, route := range routes { + if route.ClusterID != clusterID || route.SourceNodeID != localNodeID || !containsString(route.AllowedChannels, mesh.ProductionChannelVPNPacket) { + continue + } + if !route.ExpiresAt.IsZero() && !route.ExpiresAt.After(now) { + continue + } + nextHop := nextRouteHop(route.Hops, localNodeID, route.DestinationNodeID) + if nextHop == "" || nextHop == localNodeID { + continue + } + return route, nextHop, true + } + return mesh.SyntheticRoute{}, "", false +} + +func nextRouteHop(path []string, localNodeID string, destinationNodeID string) string { + if len(path) == 0 { + return destinationNodeID + } + for index, nodeID := range path { + if nodeID == localNodeID { + if index+1 < len(path) { + return path[index+1] + } + return localNodeID + } + } + return destinationNodeID +} + +func renewOwnedVPNLease(ctx context.Context, api *client.Client, identity state.Identity, assignment client.NodeVPNAssignment) error { + if assignment.ActiveLease == nil || assignment.ActiveLease.OwnerNodeID != identity.NodeID { + return nil + } + if err := api.RenewNodeVPNAssignmentLease(ctx, identity.ClusterID, identity.NodeID, assignment.VPNConnectionID, assignment.ActiveLease.LeaseID, client.NodeVPNAssignmentLeaseRenewRequest{ + TTLSeconds: 300, + }); err != nil { + return err + } + log.Printf("vpn lease renewed: vpn_connection_id=%s lease_id=%s ttl_seconds=300", assignment.VPNConnectionID, assignment.ActiveLease.LeaseID) + return nil +} diff --git a/agents/rap-node-agent/cmd/rap-node-agent/main_test.go b/agents/rap-node-agent/cmd/rap-node-agent/main_test.go index 5d65845..830ecdc 100644 --- a/agents/rap-node-agent/cmd/rap-node-agent/main_test.go +++ b/agents/rap-node-agent/cmd/rap-node-agent/main_test.go @@ -78,6 +78,202 @@ func TestLoadSyntheticMeshConfigPrefersScopedFile(t *testing.T) { } } +func TestSyntheticMeshConfigAuthorityHashUsesRawConfigPayload(t *testing.T) { + raw := json.RawMessage(`{ + "enabled": true, + "schema_version": "c18z-test.synthetic.v1", + "cluster_id": "cluster-1", + "local_node_id": "node-a", + "authority_required": true, + "cluster_authority": {"schema_version":"rap.cluster_authority.v1"}, + "authority_payload": {"ignored": true}, + "authority_signature": {"ignored": true}, + "config_version": "config-1", + "peer_endpoints": {}, + "routes": [], + "production_forwarding": true, + "future_backend_field": {"must_remain_hash_visible": true} + }`) + var remote client.SyntheticMeshConfig + if err := json.Unmarshal(raw, &remote); err != nil { + t.Fatalf("unmarshal synthetic config: %v", err) + } + var unsigned map[string]json.RawMessage + if err := json.Unmarshal(raw, &unsigned); err != nil { + t.Fatalf("unmarshal unsigned map: %v", err) + } + delete(unsigned, "authority_payload") + delete(unsigned, "authority_signature") + unsignedRaw, err := json.Marshal(unsigned) + if err != nil { + t.Fatalf("marshal unsigned map: %v", err) + } + want, err := agentauthority.HashRaw(unsignedRaw) + if err != nil { + t.Fatalf("hash unsigned map: %v", err) + } + got, err := syntheticMeshConfigAuthorityHash(remote) + if err != nil { + t.Fatalf("hash synthetic config: %v", err) + } + if got != want { + t.Fatalf("hash = %s, want raw-preserving hash %s", got, want) + } +} + +func TestRouteManagerDecisionsFromControlPlaneConsumesRemediationCommand(t *testing.T) { + now := time.Now().UTC() + decisions := routeManagerDecisionsFromControlPlane(nil, []client.FabricServiceChannelRemediationCommand{{ + SchemaVersion: "rap.fabric_service_channel_access_remediation_command.v1", + CommandID: "cmd-1", + Action: "prefer_alternate_route", + ClusterID: "cluster-1", + ChannelID: "channel-1", + ServiceClass: "vpn_packets", + PrimaryRouteID: "route-primary", + ReplacementRouteID: "route-alternate", + Reason: "authorized_alternate_route_available", + IssuedAt: now, + ExpiresAt: now.Add(time.Minute), + }}) + if len(decisions) != 1 { + t.Fatalf("decisions = %+v, want one remediation decision", decisions) + } + decision := decisions[0] + if decision.RouteID != "route-primary" || + decision.ReplacementRouteID != "route-alternate" || + decision.RebuildStatus != "applied" || + decision.DecisionSource != "service_channel_remediation_command" || + decision.RebuildRequestID != "cmd-1" { + t.Fatalf("unexpected remediation decision: %+v", decision) + } +} + +func TestRouteManagerDecisionsFromControlPlaneConsumesRebuildRouteCommand(t *testing.T) { + now := time.Now().UTC() + decisions := routeManagerDecisionsFromControlPlane(nil, []client.FabricServiceChannelRemediationCommand{{ + SchemaVersion: "rap.fabric_service_channel_access_remediation_command.v1", + CommandID: "cmd-rebuild", + Action: "rebuild_route", + ClusterID: "cluster-1", + ChannelID: "channel-1", + ServiceClass: "vpn_packets", + PrimaryRouteID: "route-primary", + Reason: "route_feedback_recommends_rebuild", + GuardStatus: "allowed", + IssuedAt: now, + ExpiresAt: now.Add(time.Minute), + }}) + if len(decisions) != 1 { + t.Fatalf("decisions = %+v, want one rebuild remediation decision", decisions) + } + decision := decisions[0] + if decision.RouteID != "route-primary" || + decision.RebuildStatus != "pending_degraded_fallback" || + decision.DecisionSource != "service_channel_remediation_command" || + decision.RebuildRequestID != "cmd-rebuild" { + t.Fatalf("unexpected rebuild remediation decision: %+v", decision) + } +} + +func TestRouteManagerDecisionsFromControlPlaneRejectsGuardedRemediationCommand(t *testing.T) { + now := time.Now().UTC() + decisions := routeManagerDecisionsFromControlPlane(nil, []client.FabricServiceChannelRemediationCommand{{ + SchemaVersion: "rap.fabric_service_channel_access_remediation_command.v1", + CommandID: "cmd-guarded", + Action: "prefer_alternate_route", + ClusterID: "cluster-1", + ChannelID: "channel-1", + ServiceClass: "vpn_packets", + PrimaryRouteID: "route-primary", + ReplacementRouteID: "route-outside-policy", + GuardStatus: "rejected", + GuardReason: "replacement_exit_outside_signed_pool_policy", + IssuedAt: now, + ExpiresAt: now.Add(time.Minute), + }}) + if len(decisions) != 0 { + t.Fatalf("guarded remediation command must not reach route-manager: %+v", decisions) + } +} + +func TestRouteManagerDecisionsFromControlPlaneKeepsExplicitRemediationCommand(t *testing.T) { + now := time.Now().UTC() + report := &client.RoutePathDecisionReport{Decisions: []client.RoutePathDecision{{ + RouteID: "route-primary", + ReplacementRouteID: "route-alternate", + RebuildRequestID: "feedback-rebuild", + RebuildStatus: "applied", + RebuildReason: "service_channel_feedback_rebuild_applied_to_alternate", + DecisionSource: "service_channel_feedback_replacement", + Generation: "gen-1", + }}} + decisions := routeManagerDecisionsFromControlPlane(report, []client.FabricServiceChannelRemediationCommand{{ + CommandID: "cmd-1", + Action: "prefer_alternate_route", + PrimaryRouteID: "route-primary", + ReplacementRouteID: "route-alternate", + Reason: "authorized_alternate_route_available", + IssuedAt: now, + ExpiresAt: now.Add(time.Minute), + }}) + if len(decisions) != 2 { + t.Fatalf("decisions = %+v, want feedback and explicit remediation command", decisions) + } + if decisions[1].DecisionSource != "service_channel_remediation_command" || decisions[1].RebuildRequestID != "cmd-1" { + t.Fatalf("remediation command was not kept as explicit route-manager input: %+v", decisions) + } +} + +func TestRouteManagerDecisionsFromControlPlaneSkipsCommandAlreadyResolvedByPlanner(t *testing.T) { + now := time.Now().UTC() + report := &client.RoutePathDecisionReport{Decisions: []client.RoutePathDecision{{ + RouteID: "route-primary", + ReplacementRouteID: "route-planner", + RebuildRequestID: "cmd-rebuild", + RebuildStatus: "applied", + RebuildReason: "remediation_rebuild_applied_to_alternate", + DecisionSource: "service_channel_remediation_command", + Generation: "config-c18z77", + }}} + decisions := routeManagerDecisionsFromControlPlane(report, []client.FabricServiceChannelRemediationCommand{{ + CommandID: "cmd-rebuild", + Action: "rebuild_route", + PrimaryRouteID: "route-primary", + Reason: "route_feedback_recommends_rebuild", + GuardStatus: "allowed", + IssuedAt: now, + ExpiresAt: now.Add(time.Minute), + }}) + if len(decisions) != 1 { + t.Fatalf("decisions = %+v, want only planner-resolved decision", decisions) + } + if decisions[0].RebuildStatus != "applied" || decisions[0].ReplacementRouteID != "route-planner" { + t.Fatalf("unexpected planner decision: %+v", decisions[0]) + } +} + +func TestFabricServiceChannelAccessStatsReportsDataPlaneViolations(t *testing.T) { + stats := newFabricServiceChannelAccessStats() + stats.Observe(mesh.FabricServiceChannelAccessLogEntry{ + Event: "fabric_service_channel_data_plane_violation", + ClusterID: "cluster-1", + ChannelID: "channel-1", + ResourceID: "vpn-1", + BackendRelayPolicy: "disabled", + ViolationStatus: "fabric_route_send_failed_backend_fallback_blocked", + ViolationReason: "mesh synthetic route not found", + OccurredAt: time.Unix(10, 0).UTC(), + }) + report := stats.Report(time.Unix(20, 0).UTC()) + if report["backend_fallback_blocked"] != int64(1) || + report["fabric_route_send_failure"] != int64(1) || + report["last_data_plane_violation_status"] != "fabric_route_send_failed_backend_fallback_blocked" || + report["last_data_plane_violation_reason"] != "mesh synthetic route not found" { + t.Fatalf("unexpected violation report: %+v", report) + } +} + func TestVerifyEnrollmentBootstrapAcceptsSignedApproval(t *testing.T) { publicKey, privateKey, err := ed25519.GenerateKey(nil) if err != nil { @@ -134,6 +330,134 @@ func TestVerifyEnrollmentBootstrapAcceptsSignedApproval(t *testing.T) { } } +func TestVerifyControlPlaneSyntheticMeshConfigAcceptsSignedServiceChannelFeedback(t *testing.T) { + publicKey, privateKey, err := ed25519.GenerateKey(nil) + if err != nil { + t.Fatalf("generate key: %v", err) + } + publicKeyB64 := base64.StdEncoding.EncodeToString(publicKey) + fingerprint := agentauthority.Fingerprint(publicKey) + now := time.Now().UTC() + remote := client.SyntheticMeshConfig{ + Enabled: true, + SchemaVersion: "c17z18.synthetic.v1", + ClusterID: "cluster-1", + LocalNodeID: "node-a", + AuthorityRequired: true, + ClusterAuthority: &client.ClusterAuthorityDescriptor{ + SchemaVersion: agentauthority.AuthoritySchemaVersion, + ClusterID: "cluster-1", + AuthorityState: "authoritative", + KeyAlgorithm: agentauthority.AlgorithmEd25519, + PublicKey: publicKeyB64, + PublicKeyFingerprint: fingerprint, + }, + ConfigVersion: "config-v1", + PeerDirectoryVersion: "config-v1", + PolicyVersion: "config-v1", + PeerEndpoints: map[string]string{}, + PeerEndpointCandidates: map[string][]client.PeerEndpointCandidate{}, + PeerDirectory: []client.PeerDirectoryEntry{}, + RecoverySeeds: []client.PeerRecoverySeed{}, + RendezvousLeases: []client.PeerRendezvousLease{}, + RoutePathDecisions: &client.RoutePathDecisionReport{ + SchemaVersion: "c17z18.route_path_decisions.v1", + DecisionMode: "control_plane_effective_path_from_relay_policy_and_service_channel_feedback", + Generation: "config-v1", + DecisionCount: 1, + ReplacementDecisionCount: 1, + RebuildRequestCount: 1, + RebuildAppliedCount: 1, + ControlPlaneOnly: true, + Decisions: []client.RoutePathDecision{{ + DecisionID: "route-ab-path-node-a-service-channel-feedback", + RouteID: "route-ab", + ReplacementRouteID: "route-ac", + RebuildRequestID: "route-ab-node-a-config-v1-rebuild", + RebuildStatus: "applied", + RebuildReason: "service_channel_feedback_rebuild_applied_to_alternate", + RebuildAttempt: 2, + ClusterID: "cluster-1", + LocalNodeID: "node-a", + SourceNodeID: "node-a", + DestinationNodeID: "node-b", + OriginalHops: []string{"node-a", "node-b"}, + EffectiveHops: []string{"node-a", "node-c", "node-b"}, + LocalRole: "source", + DecisionSource: "service_channel_feedback_replacement", + Generation: "config-v1", + PathScore: 1000, + ScoreReasons: []string{"service_channel_rebuild_applied"}, + ControlPlaneOnly: true, + ExpiresAt: now.Add(30 * time.Second), + }}, + }, + ServiceChannelFeedback: &client.FabricServiceChannelFeedbackReport{ + SchemaVersion: "c18n.fabric_service_channel_route_feedback_report.v1", + GeneratedAt: now, + FeedbackMaxAgeSeconds: 30, + ObservationCount: 1, + FencedRouteCount: 1, + Observations: []client.FabricServiceChannelFeedbackObservation{{ + ClusterID: "cluster-1", + ReporterNodeID: "node-a", + RouteID: "route-ab", + ServiceClass: "vpn_packets", + FeedbackStatus: "fenced", + ScoreAdjustment: -1000, + Reasons: []string{"route_rebuild_recommended"}, + ConsecutiveFailures: 2, + Payload: json.RawMessage(`{"route_rebuild_recommended":true}`), + ObservedAt: now, + ExpiresAt: now.Add(30 * time.Second), + }}, + }, + MeshListener: nil, + Routes: []client.SyntheticMeshRouteConfig{}, + ProductionForwarding: false, + } + configHash, err := syntheticMeshConfigAuthorityHash(remote) + if err != nil { + t.Fatalf("config hash: %v", err) + } + payload, err := json.Marshal(controlPlaneMeshConfigAuthorityPayload{ + SchemaVersion: "rap.cluster.mesh_config_snapshot.v1", + ClusterID: "cluster-1", + LocalNodeID: "node-a", + ConfigVersion: "config-v1", + ConfigSHA256: configHash, + IssuedAt: now, + ExpiresAt: now.Add(time.Hour), + ControlPlaneOnly: true, + ProductionForwarding: false, + }) + if err != nil { + t.Fatalf("marshal payload: %v", err) + } + canonical, err := agentauthority.CanonicalJSON(payload) + if err != nil { + t.Fatalf("canonical json: %v", err) + } + remote.AuthorityPayload = payload + remote.AuthoritySignature = &client.ClusterSignature{ + SchemaVersion: agentauthority.SignatureSchemaVersion, + Algorithm: agentauthority.AlgorithmEd25519, + KeyFingerprint: fingerprint, + Signature: base64.StdEncoding.EncodeToString(ed25519.Sign(privateKey, canonical)), + SignedAt: now, + } + + err = verifyControlPlaneSyntheticMeshConfig(remote, state.Identity{ + ClusterID: "cluster-1", + NodeID: "node-a", + ClusterAuthorityPublicKey: publicKeyB64, + ClusterAuthorityFingerprint: fingerprint, + }, config.Config{}) + if err != nil { + t.Fatalf("verify control-plane synthetic mesh config: %v", err) + } +} + func TestVerifyEnrollmentBootstrapRejectsPinnedAuthorityMismatch(t *testing.T) { bootstrap := client.NodeBootstrap{ NodeID: "node-1", @@ -155,6 +479,54 @@ func TestVerifyEnrollmentBootstrapRejectsPinnedAuthorityMismatch(t *testing.T) { } } +func TestEnsureApprovedIdentityKeepsPollingWhenTimeoutDisabled(t *testing.T) { + var bootstrapPolls int + ctx, cancel := context.WithCancel(context.Background()) + defer cancel() + server := httptest.NewServer(http.HandlerFunc(func(w http.ResponseWriter, r *http.Request) { + switch { + case r.URL.Path == "/node-agents/enroll": + _ = json.NewEncoder(w).Encode(map[string]any{ + "status": "pending", + "join_request": map[string]any{"id": "join-request-1"}, + }) + case r.URL.Path == "/node-agents/enrollments/join-request-1/bootstrap": + bootstrapPolls++ + if bootstrapPolls >= 2 { + cancel() + } + _ = json.NewEncoder(w).Encode(map[string]any{ + "status": "pending", + "join_request": map[string]any{"id": "join-request-1"}, + }) + default: + http.NotFound(w, r) + } + })) + defer server.Close() + + dir := t.TempDir() + identity, err := state.LoadOrCreate(dir, "cluster-1", "node-a") + if err != nil { + t.Fatalf("load identity: %v", err) + } + _, err = ensureApprovedIdentity(ctx, config.Config{ + BackendURL: server.URL, + ClusterID: "cluster-1", + JoinToken: "join-token", + NodeName: "node-a", + StateDir: dir, + EnrollmentPollInterval: time.Millisecond, + EnrollmentPollTimeout: 0, + }, identity, client.New(server.URL)) + if err == nil || !strings.Contains(err.Error(), "context canceled") { + t.Fatalf("ensureApprovedIdentity err = %v, want context canceled", err) + } + if bootstrapPolls < 2 { + t.Fatalf("bootstrap polls = %d, want at least 2", bootstrapPolls) + } +} + func TestSyntheticQualityScoreIsBounded(t *testing.T) { cases := []struct { latency int @@ -209,6 +581,168 @@ func TestHeartbeatPayloadIncludesMeshEndpointReport(t *testing.T) { } } +func TestHeartbeatPayloadReportsMeshListenerFailureWithoutKillingHeartbeat(t *testing.T) { + now := time.Date(2026, 4, 30, 9, 0, 0, 0, time.UTC) + payload := heartbeatPayload(config.Config{ + MeshConnectivityMode: "private_lan", + }, state.Identity{ + ClusterID: "cluster-1", + NodeID: "node-a", + }, &syntheticMeshState{ + ListenerReport: meshListenerReport{ + SchemaVersion: "c17z21.mesh_listener_report.v1", + ConfiguredListenAddr: ":19131", + ListenPortMode: "manual", + Status: "listen_failed", + InboundReachability: "unavailable", + ControlPlaneReachable: true, + OneWayConnectivity: true, + FailureReason: "bind_failed", + FailureError: "listen tcp :19131: bind: address already in use", + PortConflict: true, + }, + }, now) + + report, ok := payload.Metadata["mesh_listener_report"].(meshListenerReport) + if !ok { + t.Fatalf("mesh listener report missing: %+v", payload.Metadata) + } + if payload.HealthStatus != "warning" || report.Status != "listen_failed" || !report.PortConflict { + t.Fatalf("unexpected listener health report: status=%s report=%+v", payload.HealthStatus, report) + } + if payload.Capabilities["mesh_listener_diagnostics"] != true || payload.Capabilities["mesh_one_way_connectivity"] != true { + t.Fatalf("listener capabilities missing: %+v", payload.Capabilities) + } +} + +func TestAdvertisedEndpointCandidatesPreferManualEndpoints(t *testing.T) { + now := time.Date(2026, 4, 30, 9, 0, 0, 0, time.UTC) + candidates, err := advertisedEndpointCandidates(config.Config{ + MeshAdvertiseEndpointsJSON: `[{"endpoint_id":"node-a-json","node_id":"node-a","transport":"direct_http","address":"http://10.10.10.10:19131","priority":12,"connectivity_mode":"private_lan","reachability":"private"}]`, + MeshAdvertiseEndpoint: "http://203.0.113.10:19131", + MeshAdvertiseTransport: "direct_http", + MeshConnectivityMode: "direct", + MeshNATType: "port_restricted", + MeshRegion: "edge", + }, state.Identity{ + ClusterID: "cluster-1", + NodeID: "node-a", + }, nil, now) + if err != nil { + t.Fatalf("advertised endpoint candidates failed: %v", err) + } + if len(candidates) != 2 { + t.Fatalf("expected two manual candidates, got %d: %+v", len(candidates), candidates) + } + if candidates[0].Address != "http://203.0.113.10:19131" || candidates[0].Priority != 10 { + t.Fatalf("explicit advertise endpoint must win: %+v", candidates) + } + if candidates[1].Address != "http://10.10.10.10:19131" || candidates[1].Priority != 12 { + t.Fatalf("json candidate order mismatch: %+v", candidates) + } +} + +func TestNetworkInterfaceClassificationSkipsContainerNoise(t *testing.T) { + tests := map[string]string{ + "ens160": "physical", + "wg0": "vpn", + "tailscale0": "vpn", + "docker0": "container", + "br-a1b2c3d4": "container", + "vethabc123": "container", + } + for name, want := range tests { + if got := classifyNetworkInterface(name); got != want { + t.Fatalf("classifyNetworkInterface(%q)=%q, want %q", name, got, want) + } + } +} + +func TestHeartbeatPayloadTreatsOutboundOnlyListenerFailureAsOneWayConnectivity(t *testing.T) { + payload := heartbeatPayload(config.Config{ + MeshSyntheticRuntimeEnabled: true, + MeshConnectivityMode: "outbound_only", + }, state.Identity{ + ClusterID: "cluster-1", + NodeID: "node-a", + }, &syntheticMeshState{ + ListenerReport: meshListenerReport{ + SchemaVersion: "c17z21.mesh_listener_report.v1", + ConfiguredListenAddr: ":19131", + ListenPortMode: "manual", + Status: "listen_failed", + InboundReachability: "unavailable", + ControlPlaneReachable: true, + OneWayConnectivity: true, + FailureReason: "bind_failed", + }, + }, time.Date(2026, 4, 30, 9, 0, 0, 0, time.UTC)) + + if payload.HealthStatus != "healthy" { + t.Fatalf("HealthStatus = %q, want healthy for outbound-only listener failure", payload.HealthStatus) + } + report, ok := payload.Metadata["mesh_outbound_session_report"].(meshOutboundSessionReport) + if !ok { + t.Fatalf("mesh outbound session report missing: %+v", payload.Metadata) + } + if report.Status != "ready" || !report.UsableForInboundControl || report.ListenerStatus != "listen_failed" { + t.Fatalf("unexpected outbound session report: %+v", report) + } + if payload.Capabilities["mesh_outbound_control_session"] != true || + payload.Capabilities["mesh_reverse_control_channel_contract"] != true { + t.Fatalf("outbound session capabilities missing: %+v", payload.Capabilities) + } +} + +func TestHeartbeatPayloadReportsMeshConfigLoadFailureWithoutDroppingPresence(t *testing.T) { + payload := heartbeatPayload(config.Config{ + MeshSyntheticRuntimeEnabled: true, + MeshConnectivityMode: "private_lan", + }, state.Identity{ + ClusterID: "cluster-1", + NodeID: "node-a", + }, &syntheticMeshState{ + ConfigLoadError: "control-plane synthetic mesh config unavailable", + ListenerReport: meshListenerReport{ + SchemaVersion: "c17z21.mesh_listener_report.v1", + ConfiguredListenAddr: ":19131", + ListenPortMode: "manual", + Status: "listening", + InboundReachability: "private", + ControlPlaneReachable: true, + }, + }, time.Date(2026, 4, 30, 9, 0, 0, 0, time.UTC)) + + report, ok := payload.Metadata["mesh_outbound_session_report"].(meshOutboundSessionReport) + if !ok { + t.Fatalf("mesh outbound session report missing: %+v", payload.Metadata) + } + if payload.HealthStatus != "warning" || report.Status != "degraded" || report.ConfigLoadError == "" { + t.Fatalf("unexpected config-load diagnostic heartbeat: health=%s report=%+v", payload.HealthStatus, report) + } +} + +func TestOutboundSessionReportTreatsListeningPrivateLANAsUsable(t *testing.T) { + report := meshOutboundSessionReportFromState(config.Config{ + BackendURL: "http://control/api/v1", + MeshConnectivityMode: "private_lan", + MeshSyntheticRuntimeEnabled: true, + }, &syntheticMeshState{ + ListenerReport: meshListenerReport{ + SchemaVersion: "c17z21.mesh_listener_report.v1", + Status: "listening", + InboundReachability: reachabilityFromConnectivityMode("private_lan"), + }, + }, time.Date(2026, 4, 30, 9, 0, 0, 0, time.UTC)) + + if !report.UsableForInboundControl { + t.Fatalf("listening private LAN listener must be usable: %+v", report) + } + if reachabilityFromConnectivityMode("private_lan") != "private" { + t.Fatalf("private_lan reachability mismatch") + } +} + func TestHeartbeatPayloadReportsMultipleMeshEndpoints(t *testing.T) { payload := heartbeatPayload(config.Config{ MeshAdvertiseEndpointsJSON: `[{ @@ -1050,17 +1584,36 @@ func TestProductionEnvelopeObservationSinkFromConfigCreatesBoundedSink(t *testin func TestProductionForwardingLogStateDistinguishesGateFromRuntime(t *testing.T) { gateEnabled, runtimeEnabled := productionForwardingLogState(config.Config{ MeshProductionForwardingEnabled: true, - }) + }, false) if !gateEnabled { t.Fatal("gateEnabled = false, want true") } if !runtimeEnabled { t.Fatal("runtimeEnabled = false, want true") } - gateEnabled, runtimeEnabled = productionForwardingLogState(config.Config{}) + gateEnabled, runtimeEnabled = productionForwardingLogState(config.Config{}, false) if gateEnabled || runtimeEnabled { t.Fatalf("default log state = gate:%t runtime:%t, want false/false", gateEnabled, runtimeEnabled) } + gateEnabled, runtimeEnabled = productionForwardingLogState(config.Config{}, true) + if !gateEnabled || !runtimeEnabled { + t.Fatalf("signed control-plane log state = gate:%t runtime:%t, want true/true", gateEnabled, runtimeEnabled) + } +} + +func TestMeshLinkStatusFromPeerProbeMapsDeferredForLatestLinks(t *testing.T) { + cases := map[string]string{ + mesh.PeerConnectionProbeReachable: "reachable", + mesh.PeerConnectionProbeUnreachable: "unreachable", + mesh.PeerConnectionProbeDeferred: "degraded", + mesh.PeerConnectionProbeSkipped: "unknown", + "unexpected": "unknown", + } + for input, want := range cases { + if got := meshLinkStatusFromPeerProbe(input); got != want { + t.Fatalf("meshLinkStatusFromPeerProbe(%q) = %q, want %q", input, got, want) + } + } } func TestLogProductionObservationSinkMetricsToleratesNilState(t *testing.T) { diff --git a/agents/rap-node-agent/go.mod b/agents/rap-node-agent/go.mod index 31bd64b..c63200c 100644 --- a/agents/rap-node-agent/go.mod +++ b/agents/rap-node-agent/go.mod @@ -1,3 +1,14 @@ module github.com/example/remote-access-platform/agents/rap-node-agent -go 1.23.2 +go 1.25.5 + +require golang.zx2c4.com/wireguard v0.0.0-20250521234502-f333402bd9cb + +require ( + github.com/gorilla/websocket v1.5.3 // indirect + golang.org/x/net v0.53.0 // indirect + golang.org/x/sys v0.43.0 // indirect + golang.org/x/time v0.15.0 // indirect + golang.zx2c4.com/wintun v0.0.0-20230126152724-0fa3db229ce2 // indirect + gvisor.dev/gvisor v0.0.0-20260505022556-2306ef3db943 // indirect +) diff --git a/agents/rap-node-agent/go.sum b/agents/rap-node-agent/go.sum new file mode 100644 index 0000000..782c60c --- /dev/null +++ b/agents/rap-node-agent/go.sum @@ -0,0 +1,16 @@ +github.com/gorilla/websocket v1.5.3 h1:saDtZ6Pbx/0u+bgYQ3q96pZgCzfhKXGPqt7kZ72aNNg= +github.com/gorilla/websocket v1.5.3/go.mod h1:YR8l580nyteQvAITg2hZ9XVh4b55+EU/adAjf1fMHhE= +golang.org/x/exp v0.0.0-20231110203233-9a3e6036ecaa h1:FRnLl4eNAQl8hwxVVC17teOw8kdjVDVAiFMtgUdTSRQ= +golang.org/x/exp v0.0.0-20231110203233-9a3e6036ecaa/go.mod h1:zk2irFbV9DP96SEBUUAy67IdHUaZuSnrz1n472HUCLE= +golang.org/x/net v0.53.0 h1:d+qAbo5L0orcWAr0a9JweQpjXF19LMXJE8Ey7hwOdUA= +golang.org/x/net v0.53.0/go.mod h1:JvMuJH7rrdiCfbeHoo3fCQU24Lf5JJwT9W3sJFulfgs= +golang.org/x/sys v0.43.0 h1:Rlag2XtaFTxp19wS8MXlJwTvoh8ArU6ezoyFsMyCTNI= +golang.org/x/sys v0.43.0/go.mod h1:4GL1E5IUh+htKOUEOaiffhrAeqysfVGipDYzABqnCmw= +golang.org/x/time v0.15.0 h1:bbrp8t3bGUeFOx08pvsMYRTCVSMk89u4tKbNOZbp88U= +golang.org/x/time v0.15.0/go.mod h1:Y4YMaQmXwGQZoFaVFk4YpCt4FLQMYKZe9oeV/f4MSno= +golang.zx2c4.com/wintun v0.0.0-20230126152724-0fa3db229ce2 h1:B82qJJgjvYKsXS9jeunTOisW56dUokqW/FOteYJJ/yg= +golang.zx2c4.com/wintun v0.0.0-20230126152724-0fa3db229ce2/go.mod h1:deeaetjYA+DHMHg+sMSMI58GrEteJUUzzw7en6TJQcI= +golang.zx2c4.com/wireguard v0.0.0-20250521234502-f333402bd9cb h1:whnFRlWMcXI9d+ZbWg+4sHnLp52d5yiIPUxMBSt4X9A= +golang.zx2c4.com/wireguard v0.0.0-20250521234502-f333402bd9cb/go.mod h1:rpwXGsirqLqN2L0JDJQlwOboGHmptD5ZD6T2VmcqhTw= +gvisor.dev/gvisor v0.0.0-20260505022556-2306ef3db943 h1:YUPk0vGbex2+Jk7XXIgLIPG6oEAD9ml0x7wd6i/bmA4= +gvisor.dev/gvisor v0.0.0-20260505022556-2306ef3db943/go.mod h1:xQ2PWgHmWJA/Ph4i1q1jBm39BKhc3W0DXqWoDSyuBOY= diff --git a/agents/rap-node-agent/internal/agent/payload.go b/agents/rap-node-agent/internal/agent/payload.go index 9f4430f..50625c7 100644 --- a/agents/rap-node-agent/internal/agent/payload.go +++ b/agents/rap-node-agent/internal/agent/payload.go @@ -7,7 +7,7 @@ import ( "github.com/example/remote-access-platform/agents/rap-node-agent/internal/state" ) -const Version = "0.1.0-c3" +const Version = "0.2.256-c18z82" func EnrollmentPayload(clusterID, joinToken string, identity state.Identity) client.EnrollRequest { return client.EnrollRequest{ @@ -17,18 +17,26 @@ func EnrollmentPayload(clusterID, joinToken string, identity state.Identity) cli NodeFingerprint: identity.NodeFingerprint, PublicKey: identity.PublicKey, ReportedCapabilities: map[string]any{ - "can_accept_client_ingress": false, - "can_accept_node_ingress": false, - "can_route_mesh": false, - "can_run_rdp_worker": true, - "can_run_vnc_worker": false, - "can_run_vpn_exit": false, - "can_run_vpn_connector": false, - "can_run_file_cache": false, - "can_run_update_cache": false, - "can_run_video_relay": false, - "native_node_agent_version": Version, - "service_supervision_enabled": false, + "can_accept_client_ingress": false, + "can_accept_node_ingress": false, + "can_route_mesh": false, + "can_run_rdp_worker": true, + "can_run_vnc_worker": false, + "can_run_vpn_exit": true, + "can_run_vpn_connector": true, + "can_run_file_cache": false, + "can_run_update_cache": false, + "can_run_video_relay": false, + "native_node_agent_version": Version, + "node_update_plan_contract": "rap.node_update_plan.v1", + "node_update_status_report": true, + "host_agent_update_required": true, + "service_supervision_enabled": false, + "vpn_assignment_status": true, + "vpn_packet_forwarding": true, + "vpn_fabric_packet_transport": true, + "vpn_local_gateway_shortcut": true, + "external_backend_entry_proxy": true, }, ReportedFacts: map[string]any{ "os": runtime.GOOS, @@ -45,13 +53,28 @@ func HeartbeatPayload() client.HeartbeatRequest { HealthStatus: "healthy", ReportedVersion: Version, Capabilities: map[string]any{ - "native_node_agent": true, + "native_node_agent": true, + "node_update_plan_contract": "rap.node_update_plan.v1", + "node_update_status_report": true, + "vpn_assignment_status": true, + "vpn_packet_forwarding": true, + "vpn_fabric_packet_transport": true, + "vpn_local_gateway_shortcut": true, + "external_backend_entry_proxy": true, }, ServiceStates: map[string]any{ "workload_supervision": "not_implemented_c3", }, Metadata: map[string]any{ "stage": "c3", + "update_runtime": map[string]any{ + "product": "rap-node-agent", + "current_version": Version, + "host_agent_present": true, + "self_update_enabled": true, + "rollback_executor_ready": true, + "reason": "host-agent updater active", + }, }, } } diff --git a/agents/rap-node-agent/internal/client/client.go b/agents/rap-node-agent/internal/client/client.go index d708639..98fc616 100644 --- a/agents/rap-node-agent/internal/client/client.go +++ b/agents/rap-node-agent/internal/client/client.go @@ -260,6 +260,7 @@ type SyntheticMeshRouteConfig struct { } type SyntheticMeshConfig struct { + Raw json.RawMessage `json:"-"` Enabled bool `json:"enabled"` SchemaVersion string `json:"schema_version"` ClusterID string `json:"cluster_id"` @@ -286,6 +287,17 @@ type SyntheticMeshConfig struct { ProductionForwarding bool `json:"production_forwarding"` } +func (c *SyntheticMeshConfig) UnmarshalJSON(data []byte) error { + type syntheticMeshConfigAlias SyntheticMeshConfig + var decoded syntheticMeshConfigAlias + if err := json.Unmarshal(data, &decoded); err != nil { + return err + } + *c = SyntheticMeshConfig(decoded) + c.Raw = append(c.Raw[:0], data...) + return nil +} + type FabricServiceChannelRemediationCommand struct { SchemaVersion string `json:"schema_version"` CommandID string `json:"command_id"` diff --git a/agents/rap-node-agent/internal/config/config.go b/agents/rap-node-agent/internal/config/config.go index 0af0625..cc648ee 100644 --- a/agents/rap-node-agent/internal/config/config.go +++ b/agents/rap-node-agent/internal/config/config.go @@ -28,6 +28,9 @@ type Config struct { MeshProductionForwardingEnabled bool MeshProductionObservationSinkCapacity int MeshListenAddr string + MeshListenPortMode string + MeshListenAutoPortStart int + MeshListenAutoPortEnd int MeshAdvertiseEndpoint string MeshAdvertiseEndpointsJSON string MeshAdvertiseTransport string @@ -58,6 +61,9 @@ func Load(args []string, env map[string]string) (Config, error) { fs.BoolVar(&cfg.MeshProductionForwardingEnabled, "mesh-production-forwarding-enabled", getEnvBool(env, "RAP_MESH_PRODUCTION_FORWARDING_ENABLED", false), "Enable production fabric-control direct next-hop forwarding gate. Disabled by default.") fs.IntVar(&cfg.MeshProductionObservationSinkCapacity, "mesh-production-observation-sink-capacity", getEnvSignedInt(env, "RAP_MESH_PRODUCTION_OBSERVATION_SINK_CAPACITY", 0), "Bounded local metadata-only production envelope observation sink capacity. Disabled when 0.") fs.StringVar(&cfg.MeshListenAddr, "mesh-listen-addr", getEnv(env, "RAP_MESH_LISTEN_ADDR", ""), "Listen address for disabled-by-default C17E synthetic mesh HTTP endpoint.") + fs.StringVar(&cfg.MeshListenPortMode, "mesh-listen-port-mode", getEnv(env, "RAP_MESH_LISTEN_PORT_MODE", "manual"), "Mesh listen port behavior: manual, auto, or disabled.") + fs.IntVar(&cfg.MeshListenAutoPortStart, "mesh-listen-auto-port-start", getEnvInt(env, "RAP_MESH_LISTEN_AUTO_PORT_START", 19131), "First port used when mesh listen port mode is auto.") + fs.IntVar(&cfg.MeshListenAutoPortEnd, "mesh-listen-auto-port-end", getEnvInt(env, "RAP_MESH_LISTEN_AUTO_PORT_END", 19231), "Last port used when mesh listen port mode is auto.") fs.StringVar(&cfg.MeshAdvertiseEndpoint, "mesh-advertise-endpoint", getEnv(env, "RAP_MESH_ADVERTISE_ENDPOINT", ""), "Advertised mesh endpoint reported to the Control Plane. Empty disables endpoint reporting.") fs.StringVar(&cfg.MeshAdvertiseEndpointsJSON, "mesh-advertise-endpoints-json", getEnv(env, "RAP_MESH_ADVERTISE_ENDPOINTS_JSON", ""), "JSON array of advertised mesh endpoint candidates, including private/corporate endpoints.") fs.StringVar(&cfg.MeshAdvertiseTransport, "mesh-advertise-transport", getEnv(env, "RAP_MESH_ADVERTISE_TRANSPORT", "direct_tcp_tls"), "Transport label for the advertised mesh endpoint.") @@ -70,7 +76,7 @@ func Load(args []string, env map[string]string) (Config, error) { heartbeatSeconds := getEnvInt(env, "RAP_HEARTBEAT_INTERVAL_SECONDS", 15) fs.DurationVar(&cfg.HeartbeatInterval, "heartbeat-interval", time.Duration(heartbeatSeconds)*time.Second, "Heartbeat interval.") enrollmentPollIntervalSeconds := getEnvInt(env, "RAP_ENROLLMENT_POLL_INTERVAL_SECONDS", 5) - enrollmentPollTimeoutSeconds := getEnvInt(env, "RAP_ENROLLMENT_POLL_TIMEOUT_SECONDS", 600) + enrollmentPollTimeoutSeconds := getEnvSignedInt(env, "RAP_ENROLLMENT_POLL_TIMEOUT_SECONDS", 0) fs.DurationVar(&cfg.EnrollmentPollInterval, "enrollment-poll-interval", time.Duration(enrollmentPollIntervalSeconds)*time.Second, "Enrollment approval polling interval.") fs.DurationVar(&cfg.EnrollmentPollTimeout, "enrollment-poll-timeout", time.Duration(enrollmentPollTimeoutSeconds)*time.Second, "Enrollment approval polling timeout.") if err := fs.Parse(args); err != nil { @@ -84,6 +90,7 @@ func Load(args []string, env map[string]string) (Config, error) { cfg.NodeName = strings.TrimSpace(cfg.NodeName) cfg.StateDir = strings.TrimSpace(cfg.StateDir) cfg.MeshListenAddr = strings.TrimSpace(cfg.MeshListenAddr) + cfg.MeshListenPortMode = strings.ToLower(strings.TrimSpace(cfg.MeshListenPortMode)) cfg.MeshAdvertiseEndpoint = strings.TrimRight(strings.TrimSpace(cfg.MeshAdvertiseEndpoint), "/") cfg.MeshAdvertiseEndpointsJSON = strings.TrimSpace(cfg.MeshAdvertiseEndpointsJSON) cfg.MeshAdvertiseTransport = strings.TrimSpace(cfg.MeshAdvertiseTransport) @@ -117,6 +124,20 @@ func Load(args []string, env map[string]string) (Config, error) { if cfg.MeshProductionObservationSinkCapacity > MaxMeshProductionObservationSinkCapacity { return Config{}, errors.New("mesh production observation sink capacity exceeds maximum") } + switch cfg.MeshListenPortMode { + case "", "manual", "auto", "disabled": + if cfg.MeshListenPortMode == "" { + cfg.MeshListenPortMode = "manual" + } + default: + return Config{}, errors.New("mesh listen port mode must be manual, auto, or disabled") + } + if cfg.MeshListenAutoPortStart <= 0 || cfg.MeshListenAutoPortEnd <= 0 { + return Config{}, errors.New("mesh listen auto port range must be positive") + } + if cfg.MeshListenAutoPortStart > cfg.MeshListenAutoPortEnd { + return Config{}, errors.New("mesh listen auto port start must be less than or equal to end") + } return cfg, nil } diff --git a/agents/rap-node-agent/internal/config/config_test.go b/agents/rap-node-agent/internal/config/config_test.go index 19d8fba..100d5ad 100644 --- a/agents/rap-node-agent/internal/config/config_test.go +++ b/agents/rap-node-agent/internal/config/config_test.go @@ -22,6 +22,9 @@ func TestLoadConfigFromEnvAndArgs(t *testing.T) { "RAP_MESH_PRODUCTION_FORWARDING_ENABLED": "true", "RAP_MESH_PRODUCTION_OBSERVATION_SINK_CAPACITY": "5", "RAP_MESH_LISTEN_ADDR": "127.0.0.1:19001", + "RAP_MESH_LISTEN_PORT_MODE": "auto", + "RAP_MESH_LISTEN_AUTO_PORT_START": "19010", + "RAP_MESH_LISTEN_AUTO_PORT_END": "19020", "RAP_MESH_ADVERTISE_ENDPOINT": "https://node-a.example.test:443/", "RAP_MESH_ADVERTISE_ENDPOINTS_JSON": `[{"endpoint_id":"node-a-lan","address":"10.10.0.20:19001"}]`, "RAP_MESH_ADVERTISE_TRANSPORT": "wss", @@ -65,6 +68,9 @@ func TestLoadConfigFromEnvAndArgs(t *testing.T) { if cfg.MeshListenAddr != "127.0.0.1:19001" { t.Fatalf("MeshListenAddr = %q", cfg.MeshListenAddr) } + if cfg.MeshListenPortMode != "auto" || cfg.MeshListenAutoPortStart != 19010 || cfg.MeshListenAutoPortEnd != 19020 { + t.Fatalf("unexpected mesh listen port config: %+v", cfg) + } if cfg.MeshAdvertiseEndpoint != "https://node-a.example.test:443" || cfg.MeshAdvertiseEndpointsJSON == "" || cfg.MeshAdvertiseTransport != "wss" || @@ -81,6 +87,19 @@ func TestLoadConfigFromEnvAndArgs(t *testing.T) { } } +func TestLoadConfigDefaultsEnrollmentPollingToNoTimeout(t *testing.T) { + cfg, err := Load(nil, map[string]string{ + "RAP_BACKEND_URL": "http://backend/api/v1", + "RAP_NODE_NAME": "node-a", + }) + if err != nil { + t.Fatalf("load config: %v", err) + } + if cfg.EnrollmentPollTimeout != 0 { + t.Fatalf("EnrollmentPollTimeout = %s, want no timeout", cfg.EnrollmentPollTimeout) + } +} + func TestLoadConfigRejectsNegativeProductionObservationSinkCapacity(t *testing.T) { _, err := Load(nil, map[string]string{ "RAP_BACKEND_URL": "http://backend/api/v1", diff --git a/agents/rap-node-agent/internal/hostagent/config.go b/agents/rap-node-agent/internal/hostagent/config.go new file mode 100644 index 0000000..341bd44 --- /dev/null +++ b/agents/rap-node-agent/internal/hostagent/config.go @@ -0,0 +1,135 @@ +package hostagent + +import ( + "errors" + "fmt" + "strings" +) + +const ( + DefaultContainerName = "rap-node-agent" + DefaultImage = "rap-node-agent:latest" + DefaultStateDir = "/var/lib/rap-node-agent" + DefaultNetwork = "host" +) + +type RuntimeConfig struct { + BackendURL string + ClusterID string + JoinToken string + NodeName string + Image string + ContainerName string + StateDir string + Network string + RestartPolicy string + PullImage bool + Replace bool + DockerVPNGatewayEnabled bool + WorkloadSupervisionEnabled bool + MeshSyntheticRuntimeEnabled bool + MeshProductionForwardingEnabled bool + MeshListenAddr string + MeshListenPortMode string + MeshListenAutoPortStart int + MeshListenAutoPortEnd int + MeshAdvertiseEndpoint string + MeshAdvertiseEndpointsJSON string + MeshAdvertiseTransport string + MeshConnectivityMode string + MeshNATType string + MeshRegion string + HeartbeatIntervalSeconds int + EnrollmentPollIntervalSeconds int + EnrollmentPollTimeoutSeconds int + ExtraEnv []string + AdditionalDockerRunArgs []string + ProductionObservationSinkCap int + ImageArtifactURLs []string + ImageArtifactSHA256 string + ImageArtifactSizeBytes int64 +} + +func (cfg RuntimeConfig) Normalize() RuntimeConfig { + cfg.BackendURL = strings.TrimRight(strings.TrimSpace(cfg.BackendURL), "/") + cfg.ClusterID = strings.TrimSpace(cfg.ClusterID) + cfg.JoinToken = strings.TrimSpace(cfg.JoinToken) + cfg.NodeName = strings.TrimSpace(cfg.NodeName) + cfg.Image = firstNonEmpty(cfg.Image, DefaultImage) + cfg.ContainerName = firstNonEmpty(cfg.ContainerName, DefaultContainerName) + cfg.StateDir = firstNonEmpty(cfg.StateDir, DefaultStateDir) + cfg.Network = firstNonEmpty(cfg.Network, DefaultNetwork) + cfg.RestartPolicy = firstNonEmpty(cfg.RestartPolicy, "unless-stopped") + cfg.MeshListenAddr = strings.TrimSpace(cfg.MeshListenAddr) + cfg.MeshListenPortMode = strings.ToLower(strings.TrimSpace(cfg.MeshListenPortMode)) + cfg.MeshAdvertiseEndpoint = strings.TrimRight(strings.TrimSpace(cfg.MeshAdvertiseEndpoint), "/") + cfg.MeshAdvertiseEndpointsJSON = strings.TrimSpace(cfg.MeshAdvertiseEndpointsJSON) + cfg.MeshAdvertiseTransport = strings.TrimSpace(cfg.MeshAdvertiseTransport) + cfg.MeshConnectivityMode = strings.TrimSpace(cfg.MeshConnectivityMode) + cfg.MeshNATType = strings.TrimSpace(cfg.MeshNATType) + cfg.MeshRegion = strings.TrimSpace(cfg.MeshRegion) + cfg.ImageArtifactSHA256 = strings.TrimSpace(cfg.ImageArtifactSHA256) + if cfg.HeartbeatIntervalSeconds == 0 { + cfg.HeartbeatIntervalSeconds = 15 + } + if cfg.EnrollmentPollIntervalSeconds == 0 { + cfg.EnrollmentPollIntervalSeconds = 5 + } + return cfg +} + +func (cfg RuntimeConfig) ValidateInstall() error { + cfg = cfg.Normalize() + var missing []string + if cfg.BackendURL == "" { + missing = append(missing, "backend-url") + } + if cfg.ClusterID == "" { + missing = append(missing, "cluster-id") + } + if cfg.NodeName == "" { + missing = append(missing, "node-name") + } + if len(missing) > 0 { + return fmt.Errorf("missing required install settings: %s", strings.Join(missing, ", ")) + } + if cfg.JoinToken == "" && !cfg.Replace { + return errors.New("join-token is required for first install; pass -replace only when updating an already enrolled local state") + } + if cfg.HeartbeatIntervalSeconds <= 0 { + return errors.New("heartbeat interval must be positive") + } + if cfg.EnrollmentPollIntervalSeconds <= 0 { + return errors.New("enrollment poll interval must be positive") + } + if cfg.EnrollmentPollTimeoutSeconds < 0 { + return errors.New("enrollment poll timeout must not be negative") + } + switch cfg.MeshListenPortMode { + case "", "manual", "auto", "disabled": + default: + return errors.New("mesh listen port mode must be manual, auto, or disabled") + } + if cfg.MeshListenAutoPortStart < 0 || cfg.MeshListenAutoPortEnd < 0 { + return errors.New("mesh listen auto port range must not be negative") + } + if cfg.MeshListenAutoPortStart > 0 && cfg.MeshListenAutoPortEnd > 0 && cfg.MeshListenAutoPortStart > cfg.MeshListenAutoPortEnd { + return errors.New("mesh listen auto port start must be less than or equal to end") + } + if cfg.ProductionObservationSinkCap < 0 { + return errors.New("production observation sink capacity must not be negative") + } + for _, item := range cfg.ExtraEnv { + if !strings.Contains(item, "=") { + return fmt.Errorf("extra env %q must be KEY=VALUE", item) + } + } + return nil +} + +func firstNonEmpty(value, fallback string) string { + if strings.TrimSpace(value) == "" { + return fallback + } + return strings.TrimSpace(value) +} diff --git a/agents/rap-node-agent/internal/hostagent/docker.go b/agents/rap-node-agent/internal/hostagent/docker.go new file mode 100644 index 0000000..4eab7e3 --- /dev/null +++ b/agents/rap-node-agent/internal/hostagent/docker.go @@ -0,0 +1,335 @@ +package hostagent + +import ( + "context" + "crypto/sha256" + "encoding/hex" + "fmt" + "io" + "net/http" + "os" + "os/exec" + "path/filepath" + "strconv" + "strings" +) + +type CommandRunner interface { + Run(ctx context.Context, name string, args ...string) (string, error) +} + +type ExecRunner struct{} + +func (ExecRunner) Run(ctx context.Context, name string, args ...string) (string, error) { + cmd := exec.CommandContext(ctx, name, args...) + out, err := cmd.CombinedOutput() + if err != nil { + return string(out), fmt.Errorf("%s %s: %w\n%s", name, strings.Join(args, " "), err, strings.TrimSpace(string(out))) + } + return string(out), nil +} + +type DockerManager struct { + Runner CommandRunner + Binary string +} + +var statHostPath = os.Stat + +type InstallResult struct { + ContainerName string + Image string + Replaced bool + Pulled bool + Loaded bool + ContainerID string +} + +func (m DockerManager) Install(ctx context.Context, cfg RuntimeConfig) (InstallResult, error) { + if err := cfg.ValidateInstall(); err != nil { + return InstallResult{}, err + } + cfg = cfg.Normalize() + runner := m.Runner + if runner == nil { + runner = ExecRunner{} + } + docker := firstNonEmpty(m.Binary, "docker") + result := InstallResult{ContainerName: cfg.ContainerName, Image: cfg.Image} + + if err := PrepareStateDir(cfg.StateDir); err != nil { + return result, err + } + if cfg.DockerVPNGatewayEnabled { + if err := ensureHostTunDevice(ctx, runner); err != nil { + return result, err + } + } + + if cfg.PullImage { + if _, err := runner.Run(ctx, docker, "pull", cfg.Image); err != nil { + return result, err + } + result.Pulled = true + } else if len(cfg.ImageArtifactURLs) > 0 { + loaded, err := m.ensureImageFromArtifact(ctx, runner, docker, cfg) + if err != nil { + return result, err + } + result.Loaded = loaded + } + + if cfg.Replace { + if _, err := runner.Run(ctx, docker, "rm", "-f", cfg.ContainerName); err != nil && !isNoSuchContainerError(err) { + return result, err + } + result.Replaced = true + } + + args := DockerRunArgs(cfg) + out, err := runner.Run(ctx, docker, args...) + if err != nil { + return result, err + } + result.ContainerID = strings.TrimSpace(out) + return result, nil +} + +func ensureHostTunDevice(ctx context.Context, runner CommandRunner) error { + if _, err := statHostPath("/dev/net/tun"); err == nil { + return nil + } + if _, err := runner.Run(ctx, "modprobe", "tun"); err != nil { + return fmt.Errorf("docker vpn gateway requires host /dev/net/tun; modprobe tun failed: %w", err) + } + if _, err := statHostPath("/dev/net/tun"); err != nil { + return fmt.Errorf("docker vpn gateway requires host /dev/net/tun after modprobe tun: %w", err) + } + return nil +} + +func (m DockerManager) ensureImageFromArtifact(ctx context.Context, runner CommandRunner, docker string, cfg RuntimeConfig) (bool, error) { + if _, err := runner.Run(ctx, docker, "image", "inspect", cfg.Image); err == nil && !cfg.Replace { + return false, nil + } + path, err := downloadFirstArtifact(ctx, cfg.ImageArtifactURLs, cfg.ImageArtifactSHA256, cfg.ImageArtifactSizeBytes) + if err != nil { + return false, err + } + defer os.Remove(path) + if _, err := runner.Run(ctx, docker, "load", "-i", path); err != nil { + return false, err + } + if _, err := runner.Run(ctx, docker, "image", "inspect", cfg.Image); err != nil { + return true, fmt.Errorf("loaded artifact but image %q is not available: %w", cfg.Image, err) + } + return true, nil +} + +func downloadFirstArtifact(ctx context.Context, urls []string, expectedSHA256 string, expectedSizeBytes int64) (string, error) { + var lastErr error + for _, rawURL := range urls { + rawURL = strings.TrimSpace(rawURL) + if rawURL == "" { + continue + } + for attempt := 1; attempt <= 3; attempt++ { + path, err := downloadArtifact(ctx, rawURL, expectedSHA256, expectedSizeBytes) + if err == nil { + return path, nil + } + lastErr = err + } + } + if lastErr != nil { + return "", lastErr + } + return "", fmt.Errorf("no artifact URLs configured") +} + +func downloadArtifact(ctx context.Context, rawURL, expectedSHA256 string, expectedSizeBytes int64) (string, error) { + req, err := http.NewRequestWithContext(ctx, http.MethodGet, rawURL, nil) + if err != nil { + return "", err + } + resp, err := http.DefaultClient.Do(req) + if err != nil { + return "", fmt.Errorf("download artifact %s: %w", rawURL, err) + } + defer resp.Body.Close() + if resp.StatusCode < 200 || resp.StatusCode >= 300 { + return "", fmt.Errorf("download artifact %s: %s", rawURL, resp.Status) + } + file, err := os.CreateTemp("", "rap-docker-image-*.tar") + if err != nil { + return "", err + } + path := file.Name() + hasher := sha256.New() + written, copyErr := io.Copy(io.MultiWriter(file, hasher), resp.Body) + closeErr := file.Close() + if copyErr != nil { + os.Remove(path) + return "", copyErr + } + if closeErr != nil { + os.Remove(path) + return "", closeErr + } + if resp.ContentLength >= 0 && written != resp.ContentLength { + os.Remove(path) + return "", fmt.Errorf("artifact download truncated for %s: got %d bytes want content-length %d", rawURL, written, resp.ContentLength) + } + if expectedSizeBytes > 0 && written != expectedSizeBytes { + if strings.TrimSpace(expectedSHA256) != "" { + os.Remove(path) + return "", fmt.Errorf("artifact size mismatch for %s: got %d bytes want %d", rawURL, written, expectedSizeBytes) + } + fmt.Printf("artifact size mismatch for %s: got %d bytes want %d; proceeding without checksum for backward-compatible installs\n", rawURL, written, expectedSizeBytes) + } + actual := hex.EncodeToString(hasher.Sum(nil)) + if expected := strings.TrimSpace(expectedSHA256); expected != "" && !strings.EqualFold(actual, expected) { + os.Remove(path) + return "", fmt.Errorf("artifact checksum mismatch for %s: got %s want %s", rawURL, actual, expected) + } + return path, nil +} + +func (m DockerManager) Status(ctx context.Context, containerName string) (string, error) { + containerName = firstNonEmpty(containerName, DefaultContainerName) + runner := m.Runner + if runner == nil { + runner = ExecRunner{} + } + docker := firstNonEmpty(m.Binary, "docker") + return runner.Run(ctx, docker, "ps", "-a", "--filter", "name=^/"+containerName+"$", "--format", "{{.Names}}\t{{.Image}}\t{{.Status}}") +} + +func PrepareStateDir(stateDir string) error { + stateDir = strings.TrimSpace(stateDir) + if stateDir == "" || !looksLikeHostPath(stateDir) { + return nil + } + if err := os.MkdirAll(stateDir, 0o777); err != nil { + return fmt.Errorf("prepare state dir %q: %w", stateDir, err) + } + if err := os.Chmod(stateDir, 0o777); err != nil { + if isAccessDenied(err) { + return nil + } + return fmt.Errorf("chmod state dir %q: %w", stateDir, err) + } + return nil +} + +func DockerRunArgs(cfg RuntimeConfig) []string { + cfg = cfg.Normalize() + args := []string{ + "run", "-d", + "--name", cfg.ContainerName, + "--restart", cfg.RestartPolicy, + "--network", cfg.Network, + "-v", cfg.StateDir + ":/var/lib/rap-node-agent", + } + if cfg.DockerVPNGatewayEnabled { + args = append(args, + "--privileged", + "--cap-add", "NET_ADMIN", + "--device", "/dev/net/tun:/dev/net/tun", + ) + } + args = append(args, cfg.AdditionalDockerRunArgs...) + for _, env := range NodeAgentEnv(cfg) { + args = append(args, "-e", env) + } + args = append(args, cfg.Image) + return args +} + +func NodeAgentEnv(cfg RuntimeConfig) []string { + return NodeAgentEnvWithStateDir(cfg, "/var/lib/rap-node-agent") +} + +func NodeAgentEnvWithStateDir(cfg RuntimeConfig, stateDir string) []string { + cfg = cfg.Normalize() + stateDir = firstNonEmpty(stateDir, cfg.StateDir) + env := []string{ + "RAP_BACKEND_URL=" + cfg.BackendURL, + "RAP_CLUSTER_ID=" + cfg.ClusterID, + "RAP_NODE_NAME=" + cfg.NodeName, + "RAP_NODE_STATE_DIR=" + stateDir, + "RAP_HEARTBEAT_INTERVAL_SECONDS=" + strconv.Itoa(cfg.HeartbeatIntervalSeconds), + "RAP_ENROLLMENT_POLL_INTERVAL_SECONDS=" + strconv.Itoa(cfg.EnrollmentPollIntervalSeconds), + "RAP_ENROLLMENT_POLL_TIMEOUT_SECONDS=" + strconv.Itoa(cfg.EnrollmentPollTimeoutSeconds), + "RAP_WORKLOAD_SUPERVISION_ENABLED=" + boolString(cfg.WorkloadSupervisionEnabled), + "RAP_MESH_SYNTHETIC_RUNTIME_ENABLED=" + boolString(cfg.MeshSyntheticRuntimeEnabled), + "RAP_MESH_PRODUCTION_FORWARDING_ENABLED=" + boolString(cfg.MeshProductionForwardingEnabled), + } + if cfg.JoinToken != "" { + env = append(env, "RAP_JOIN_TOKEN="+cfg.JoinToken) + } + if cfg.MeshListenAddr != "" { + env = append(env, "RAP_MESH_LISTEN_ADDR="+cfg.MeshListenAddr) + } + if cfg.MeshListenPortMode != "" { + env = append(env, "RAP_MESH_LISTEN_PORT_MODE="+cfg.MeshListenPortMode) + } + if cfg.MeshListenAutoPortStart > 0 { + env = append(env, "RAP_MESH_LISTEN_AUTO_PORT_START="+strconv.Itoa(cfg.MeshListenAutoPortStart)) + } + if cfg.MeshListenAutoPortEnd > 0 { + env = append(env, "RAP_MESH_LISTEN_AUTO_PORT_END="+strconv.Itoa(cfg.MeshListenAutoPortEnd)) + } + if cfg.MeshAdvertiseEndpoint != "" { + env = append(env, "RAP_MESH_ADVERTISE_ENDPOINT="+cfg.MeshAdvertiseEndpoint) + } + if cfg.MeshAdvertiseEndpointsJSON != "" { + env = append(env, "RAP_MESH_ADVERTISE_ENDPOINTS_JSON="+cfg.MeshAdvertiseEndpointsJSON) + } + if cfg.MeshAdvertiseTransport != "" { + env = append(env, "RAP_MESH_ADVERTISE_TRANSPORT="+cfg.MeshAdvertiseTransport) + } + if cfg.MeshConnectivityMode != "" { + env = append(env, "RAP_MESH_CONNECTIVITY_MODE="+cfg.MeshConnectivityMode) + } + if cfg.MeshNATType != "" { + env = append(env, "RAP_MESH_NAT_TYPE="+cfg.MeshNATType) + } + if cfg.MeshRegion != "" { + env = append(env, "RAP_MESH_REGION="+cfg.MeshRegion) + } + if cfg.ProductionObservationSinkCap > 0 { + env = append(env, "RAP_MESH_PRODUCTION_OBSERVATION_SINK_CAPACITY="+strconv.Itoa(cfg.ProductionObservationSinkCap)) + } + env = append(env, cfg.ExtraEnv...) + return env +} + +func RedactedArgs(args []string) []string { + out := append([]string(nil), args...) + for i := 0; i < len(out)-1; i++ { + if out[i] == "-e" && strings.HasPrefix(out[i+1], "RAP_JOIN_TOKEN=") { + out[i+1] = "RAP_JOIN_TOKEN=***" + } + } + return out +} + +func isNoSuchContainerError(err error) bool { + value := strings.ToLower(err.Error()) + return strings.Contains(value, "no such container") || strings.Contains(value, "no such object") +} + +func looksLikeHostPath(value string) bool { + if filepath.IsAbs(value) { + return true + } + return strings.HasPrefix(value, ".") || strings.HasPrefix(value, "~") || strings.Contains(value, "/") || strings.Contains(value, `\`) +} + +func boolString(value bool) string { + if value { + return "true" + } + return "false" +} diff --git a/agents/rap-node-agent/internal/hostagent/docker_test.go b/agents/rap-node-agent/internal/hostagent/docker_test.go new file mode 100644 index 0000000..4ee9185 --- /dev/null +++ b/agents/rap-node-agent/internal/hostagent/docker_test.go @@ -0,0 +1,366 @@ +package hostagent + +import ( + "context" + "encoding/json" + "fmt" + "net/http" + "net/http/httptest" + "os" + "path/filepath" + "strings" + "testing" +) + +type recordingRunner struct { + calls [][]string +} + +func (r *recordingRunner) Run(_ context.Context, name string, args ...string) (string, error) { + r.calls = append(r.calls, append([]string{name}, args...)) + if len(args) > 0 && args[0] == "run" { + return "container-1\n", nil + } + return "", nil +} + +type imageMissingRunner struct { + calls [][]string + inspectSeen int +} + +func (r *imageMissingRunner) Run(_ context.Context, name string, args ...string) (string, error) { + r.calls = append(r.calls, append([]string{name}, args...)) + if len(args) >= 3 && args[0] == "image" && args[1] == "inspect" { + r.inspectSeen++ + if r.inspectSeen == 1 { + return "", fmt.Errorf("No such image") + } + return "[]", nil + } + if len(args) > 0 && args[0] == "run" { + return "container-1\n", nil + } + return "", nil +} + +type imagePresentRunner struct { + calls [][]string +} + +func (r *imagePresentRunner) Run(_ context.Context, name string, args ...string) (string, error) { + r.calls = append(r.calls, append([]string{name}, args...)) + if len(args) > 0 && args[0] == "run" { + return "container-1\n", nil + } + return "[]", nil +} + +func TestDockerRunArgsBuildNodeRuntimePlacement(t *testing.T) { + args := DockerRunArgs(RuntimeConfig{ + BackendURL: "http://control/api/v1/", + ClusterID: "cluster-1", + JoinToken: "join-secret", + NodeName: "node-a", + Image: "rap-node-agent:test", + ContainerName: "rap-node-agent-node-a", + StateDir: "/srv/rap/node-a", + MeshSyntheticRuntimeEnabled: true, + MeshListenAddr: ":19131", + MeshAdvertiseEndpoint: "http://10.0.0.11:19131/", + MeshConnectivityMode: "private_lan", + }) + + joined := strings.Join(args, "\x00") + for _, want := range []string{ + "run", "-d", "--name\x00rap-node-agent-node-a", "--network\x00host", + "-v\x00/srv/rap/node-a:/var/lib/rap-node-agent", + "RAP_BACKEND_URL=http://control/api/v1", + "RAP_CLUSTER_ID=cluster-1", + "RAP_JOIN_TOKEN=join-secret", + "RAP_NODE_STATE_DIR=/var/lib/rap-node-agent", + "RAP_ENROLLMENT_POLL_TIMEOUT_SECONDS=0", + "RAP_MESH_SYNTHETIC_RUNTIME_ENABLED=true", + "RAP_MESH_LISTEN_ADDR=:19131", + "RAP_MESH_ADVERTISE_ENDPOINT=http://10.0.0.11:19131", + "RAP_MESH_CONNECTIVITY_MODE=private_lan", + "rap-node-agent:test", + } { + if !strings.Contains(joined, want) { + t.Fatalf("docker args missing %q in %#v", want, args) + } + } +} + +func TestDockerRunArgsEnableVPNGatewayDevice(t *testing.T) { + args := DockerRunArgs(RuntimeConfig{ + BackendURL: "http://control/api/v1", + ClusterID: "cluster-1", + JoinToken: "join-secret", + NodeName: "node-a", + StateDir: "rap-node-state", + DockerVPNGatewayEnabled: true, + }) + + joined := strings.Join(args, "\x00") + for _, want := range []string{ + "--privileged", + "--cap-add\x00NET_ADMIN", + "--device\x00/dev/net/tun:/dev/net/tun", + } { + if !strings.Contains(joined, want) { + t.Fatalf("docker vpn gateway args missing %q in %#v", want, args) + } + } +} + +func TestPrepareStateDirCreatesWritableHostPath(t *testing.T) { + dir := filepath.Join(t.TempDir(), "node-state") + if err := PrepareStateDir(dir); err != nil { + t.Fatalf("prepare state dir: %v", err) + } + info, err := os.Stat(dir) + if err != nil { + t.Fatalf("stat state dir: %v", err) + } + if !info.IsDir() { + t.Fatalf("state path is not a directory") + } + if info.Mode().Perm()&0o777 != 0o777 { + t.Fatalf("state dir mode = %v, want writable for container nonroot user", info.Mode().Perm()) + } +} + +func TestPrepareStateDirSkipsNamedVolume(t *testing.T) { + if err := PrepareStateDir("rap-node-state"); err != nil { + t.Fatalf("named volume should be ignored: %v", err) + } +} + +func TestFetchDockerInstallProfileBuildsRuntimeConfig(t *testing.T) { + server := httptest.NewServer(http.HandlerFunc(func(w http.ResponseWriter, r *http.Request) { + if r.URL.Path != "/api/v1/node-agents/docker-install-profile" { + t.Fatalf("path = %s", r.URL.Path) + } + _ = json.NewEncoder(w).Encode(map[string]any{ + "docker_install_profile": map[string]any{ + "cluster_id": "cluster-1", + "backend_url": "https://control.example.test/api/v1", + "join_token": "rap_join_profile", + "node_name": "node-a", + "image": "rap-node-agent:test", + "artifact_endpoints": []string{"https://cache.example.test/artifacts"}, + "docker_image_artifact": map[string]any{ + "kind": "docker_image_tar", + "image": "rap-node-agent:test", + "file_name": "rap-node-agent-test.tar", + "size_bytes": 21, + }, + "container_name": "rap-node-agent-node-a", + "state_dir": "/var/lib/rap/nodes/node-a", + "network": "host", + "restart_policy": "unless-stopped", + "replace": true, + "mesh_synthetic_runtime_enabled": true, + "mesh_connectivity_mode": "outbound_only", + }, + }) + })) + defer server.Close() + + profile, err := FetchDockerInstallProfile(context.Background(), ProfileRequest{ + URL: server.URL + "/api/v1", + ClusterID: "cluster-1", + InstallToken: "rap_join_profile", + NodeName: "node-a", + }) + if err != nil { + t.Fatalf("fetch profile: %v", err) + } + cfg := RuntimeConfigFromProfile(profile).Normalize() + if cfg.BackendURL != "https://control.example.test/api/v1" || + cfg.ClusterID != "cluster-1" || + cfg.JoinToken != "rap_join_profile" || + cfg.ContainerName != "rap-node-agent-node-a" || + len(cfg.ImageArtifactURLs) != 1 || + cfg.ImageArtifactSizeBytes != 21 || + !cfg.MeshSyntheticRuntimeEnabled || + cfg.MeshConnectivityMode != "outbound_only" { + t.Fatalf("unexpected cfg: %+v", cfg) + } +} + +func TestInstallLoadsImageArtifactWhenImageMissing(t *testing.T) { + server := httptest.NewServer(http.HandlerFunc(func(w http.ResponseWriter, r *http.Request) { + _, _ = w.Write([]byte("fake docker image tar")) + })) + defer server.Close() + runner := &imageMissingRunner{} + + result, err := (DockerManager{Runner: runner}).Install(context.Background(), RuntimeConfig{ + BackendURL: "http://control/api/v1", + ClusterID: "cluster-1", + JoinToken: "join-secret", + NodeName: "node-a", + Image: "rap-node-agent:test", + ContainerName: "rap-node-agent-node-a", + StateDir: "rap-node-state", + Replace: true, + ImageArtifactURLs: []string{server.URL + "/rap-node-agent-test.tar"}, + ImageArtifactSHA256: "5c2fbd41c87e83dc372690e8e1244b98baf8aded64870b369c28c4b313e15cc2", + ImageArtifactSizeBytes: 21, + }) + if err != nil { + t.Fatalf("install: %v", err) + } + if !result.Loaded || result.ContainerID != "container-1" { + t.Fatalf("result = %+v", result) + } + joined := strings.Join(flattenCalls(runner.calls), "\x00") + if !strings.Contains(joined, "load\x00-i") || !strings.Contains(joined, "run\x00-d") { + t.Fatalf("expected docker load and run calls, got %#v", runner.calls) + } +} + +func TestInstallAcceptsSizeMismatchWhenChecksumMissing(t *testing.T) { + const payload = "fake docker image tar" + const wrongSize = 999 + server := httptest.NewServer(http.HandlerFunc(func(w http.ResponseWriter, r *http.Request) { + _, _ = w.Write([]byte(payload)) + })) + defer server.Close() + runner := &imageMissingRunner{} + + result, err := (DockerManager{Runner: runner}).Install(context.Background(), RuntimeConfig{ + BackendURL: "http://control/api/v1", + ClusterID: "cluster-1", + JoinToken: "join-secret", + NodeName: "node-a", + Image: "rap-node-agent:test", + ContainerName: "rap-node-agent-node-a", + StateDir: "rap-node-state", + Replace: true, + ImageArtifactURLs: []string{server.URL + "/rap-node-agent-test.tar"}, + ImageArtifactSHA256: "", // intentionally absent -> size mismatch should not block install + ImageArtifactSizeBytes: wrongSize, + }) + if err != nil { + t.Fatalf("install: %v", err) + } + if !result.Loaded || result.ContainerID != "container-1" { + t.Fatalf("result = %+v", result) + } +} + +func TestInstallReloadsImageArtifactWhenReplacingMutableTag(t *testing.T) { + server := httptest.NewServer(http.HandlerFunc(func(w http.ResponseWriter, r *http.Request) { + _, _ = w.Write([]byte("fake docker image tar")) + })) + defer server.Close() + runner := &imagePresentRunner{} + + result, err := (DockerManager{Runner: runner}).Install(context.Background(), RuntimeConfig{ + BackendURL: "http://control/api/v1", + ClusterID: "cluster-1", + JoinToken: "join-secret", + NodeName: "node-a", + Image: "rap-node-agent:test", + ContainerName: "rap-node-agent-node-a", + StateDir: "rap-node-state", + Replace: true, + ImageArtifactURLs: []string{server.URL + "/rap-node-agent-test.tar"}, + ImageArtifactSHA256: "5c2fbd41c87e83dc372690e8e1244b98baf8aded64870b369c28c4b313e15cc2", + ImageArtifactSizeBytes: 21, + }) + if err != nil { + t.Fatalf("install: %v", err) + } + if !result.Loaded { + t.Fatalf("expected image artifact reload, got %+v", result) + } + joined := strings.Join(flattenCalls(runner.calls), "\x00") + if !strings.Contains(joined, "load\x00-i") { + t.Fatalf("expected docker load even when image exists during replace, got %#v", runner.calls) + } +} + +func TestDockerInstallLoadsExplicitArtifactBeforeReplace(t *testing.T) { + server := httptest.NewServer(http.HandlerFunc(func(w http.ResponseWriter, r *http.Request) { + if r.URL.Path != "/rap-node-agent-test.tar" { + t.Fatalf("unexpected path %s", r.URL.Path) + } + _, _ = w.Write([]byte("fake docker image tar")) + })) + defer server.Close() + + runner := &imageMissingRunner{} + result, err := (DockerManager{Runner: runner}).Install(context.Background(), RuntimeConfig{ + BackendURL: "http://control/api/v1", + ClusterID: "cluster-1", + JoinToken: "join-secret", + NodeName: "node-a", + Image: "rap-node-agent:test", + ContainerName: "rap-node-agent-node-a", + StateDir: "rap-node-state", + Replace: true, + ImageArtifactURLs: []string{server.URL + "/rap-node-agent-test.tar"}, + ImageArtifactSHA256: "5c2fbd41c87e83dc372690e8e1244b98baf8aded64870b369c28c4b313e15cc2", + ImageArtifactSizeBytes: 21, + }) + if err != nil { + t.Fatalf("install: %v", err) + } + if !result.Loaded || !result.Replaced { + t.Fatalf("expected explicit artifact load and replace, got %+v", result) + } + joined := strings.Join(flattenCalls(runner.calls), "\x00") + if !strings.Contains(joined, "load\x00-i") { + t.Fatalf("expected docker load call, got %#v", runner.calls) + } +} + +func flattenCalls(calls [][]string) []string { + out := []string{} + for _, call := range calls { + out = append(out, call...) + } + return out +} + +func TestInstallCanPullReplaceAndRedactsJoinToken(t *testing.T) { + runner := &recordingRunner{} + result, err := (DockerManager{Runner: runner}).Install(context.Background(), RuntimeConfig{ + BackendURL: "http://control/api/v1", + ClusterID: "cluster-1", + JoinToken: "join-secret", + NodeName: "node-a", + PullImage: true, + Replace: true, + ContainerName: "rap-node-agent-node-a", + StateDir: "rap-node-state", + }) + if err != nil { + t.Fatalf("install: %v", err) + } + if !result.Pulled || !result.Replaced || result.ContainerID != "container-1" { + t.Fatalf("result = %+v", result) + } + if len(runner.calls) != 3 { + t.Fatalf("calls = %#v", runner.calls) + } + redacted := strings.Join(RedactedArgs(runner.calls[2][1:]), " ") + if strings.Contains(redacted, "join-secret") || !strings.Contains(redacted, "RAP_JOIN_TOKEN=***") { + t.Fatalf("redacted args leaked token: %s", redacted) + } +} + +func TestValidateRequiresJoinTokenUnlessReplacingExistingState(t *testing.T) { + err := RuntimeConfig{BackendURL: "http://control/api/v1", ClusterID: "cluster-1", NodeName: "node-a"}.ValidateInstall() + if err == nil || !strings.Contains(err.Error(), "join-token") { + t.Fatalf("expected join token validation error, got %v", err) + } + err = RuntimeConfig{BackendURL: "http://control/api/v1", ClusterID: "cluster-1", NodeName: "node-a", Replace: true}.ValidateInstall() + if err != nil { + t.Fatalf("replace update should allow missing join token: %v", err) + } +} diff --git a/agents/rap-node-agent/internal/hostagent/linux.go b/agents/rap-node-agent/internal/hostagent/linux.go new file mode 100644 index 0000000..dd9973f --- /dev/null +++ b/agents/rap-node-agent/internal/hostagent/linux.go @@ -0,0 +1,481 @@ +package hostagent + +import ( + "context" + "errors" + "fmt" + "os" + "path/filepath" + "runtime" + "strings" + "time" +) + +const ( + DefaultLinuxInstallRoot = "/opt/rap" + DefaultLinuxStateRoot = "/var/lib/rap/nodes" + DefaultLinuxConfigRoot = "/etc/rap" +) + +type LinuxInstallConfig struct { + RuntimeConfig RuntimeConfig + NodeID string + InstallDir string + StateDir string + ConfigDir string + UnitDir string + StartupMode string + ArtifactURLs []string + ArtifactSHA256 string + ArtifactSizeBytes int64 + Replace bool + DryRun bool + AutoUpdateEnabled bool + AutoUpdateCurrentVersion string + AutoUpdateChannel string + AutoUpdateIntervalSeconds int + AutoUpdateInitialDelaySeconds int + AutoUpdateHealthTimeoutSeconds int + HostAgentSourcePath string +} + +type LinuxInstallResult struct { + NodeName string + InstallDir string + StateDir string + ConfigDir string + NodeAgentPath string + HostAgentPath string + EnvPath string + UnitName string + UnitPath string + UpdaterUnitName string + Downloaded bool + Started bool + UpdaterStarted bool +} + +type LinuxManager struct { + Runner CommandRunner +} + +func LinuxInstallConfigFromProfile(profile LinuxInstallProfile) LinuxInstallConfig { + stateDir := firstNonEmpty(profile.StateDir, filepath.Join(DefaultLinuxStateRoot, safeUnitSlug(profile.NodeName))) + installDir := firstNonEmpty(profile.InstallDir, filepath.Join(DefaultLinuxInstallRoot, safeUnitSlug(profile.NodeName))) + return LinuxInstallConfig{ + RuntimeConfig: RuntimeConfig{ + BackendURL: profile.BackendURL, + ClusterID: profile.ClusterID, + JoinToken: profile.JoinToken, + NodeName: profile.NodeName, + StateDir: stateDir, + WorkloadSupervisionEnabled: profile.WorkloadSupervisionEnabled, + MeshSyntheticRuntimeEnabled: profile.MeshSyntheticRuntimeEnabled, + MeshProductionForwardingEnabled: profile.MeshProductionForwardingEnabled, + MeshListenAddr: profile.MeshListenAddr, + MeshListenPortMode: profile.MeshListenPortMode, + MeshListenAutoPortStart: profile.MeshListenAutoPortStart, + MeshListenAutoPortEnd: profile.MeshListenAutoPortEnd, + MeshAdvertiseEndpoint: profile.MeshAdvertiseEndpoint, + MeshAdvertiseEndpointsJSON: string(profile.MeshAdvertiseEndpointsJSON), + MeshAdvertiseTransport: profile.MeshAdvertiseTransport, + MeshConnectivityMode: profile.MeshConnectivityMode, + MeshNATType: profile.MeshNATType, + MeshRegion: profile.MeshRegion, + HeartbeatIntervalSeconds: profile.HeartbeatIntervalSeconds, + EnrollmentPollIntervalSeconds: profile.EnrollmentPollIntervalSeconds, + EnrollmentPollTimeoutSeconds: profile.EnrollmentPollTimeoutSeconds, + ProductionObservationSinkCap: profile.ProductionObservationSinkCapacity, + }, + InstallDir: installDir, + StateDir: stateDir, + ConfigDir: filepath.Join(DefaultLinuxConfigRoot, safeUnitSlug(profile.NodeName)), + StartupMode: firstNonEmpty(profile.StartupMode, "systemd"), + ArtifactURLs: linuxArtifactURLs(profile), + ArtifactSHA256: linuxArtifactSHA256(profile), + ArtifactSizeBytes: linuxArtifactSizeBytes(profile), + Replace: true, + AutoUpdateEnabled: true, + } +} + +func linuxArtifactURLs(profile LinuxInstallProfile) []string { + if profile.NodeAgentArtifact != nil && len(profile.NodeAgentArtifact.URLs) > 0 { + return append([]string(nil), profile.NodeAgentArtifact.URLs...) + } + if profile.NodeAgentArtifact == nil || strings.TrimSpace(profile.NodeAgentArtifact.FileName) == "" { + return nil + } + out := []string{} + fileName := strings.TrimLeft(strings.TrimSpace(profile.NodeAgentArtifact.FileName), "/") + for _, endpoint := range profile.ArtifactEndpoints { + if trimmed := strings.TrimRight(strings.TrimSpace(endpoint), "/"); trimmed != "" { + out = append(out, trimmed+"/"+fileName) + } + } + return out +} + +func linuxArtifactSHA256(profile LinuxInstallProfile) string { + if profile.NodeAgentArtifact == nil { + return "" + } + return strings.TrimSpace(profile.NodeAgentArtifact.SHA256) +} + +func linuxArtifactSizeBytes(profile LinuxInstallProfile) int64 { + if profile.NodeAgentArtifact == nil { + return 0 + } + return profile.NodeAgentArtifact.SizeBytes +} + +func (m LinuxManager) Install(ctx context.Context, cfg LinuxInstallConfig) (LinuxInstallResult, error) { + cfg.NodeID = strings.TrimSpace(cfg.NodeID) + cfg.RuntimeConfig.Replace = cfg.Replace + cfg.RuntimeConfig.StateDir = firstNonEmpty(cfg.StateDir, cfg.RuntimeConfig.StateDir) + cfg.RuntimeConfig = cfg.RuntimeConfig.Normalize() + if err := cfg.RuntimeConfig.ValidateInstall(); err != nil { + return LinuxInstallResult{}, err + } + slug := safeUnitSlug(cfg.RuntimeConfig.NodeName) + cfg.InstallDir = firstNonEmpty(cfg.InstallDir, filepath.Join(DefaultLinuxInstallRoot, slug)) + cfg.StateDir = firstNonEmpty(cfg.RuntimeConfig.StateDir, filepath.Join(DefaultLinuxStateRoot, slug)) + cfg.ConfigDir = firstNonEmpty(cfg.ConfigDir, filepath.Join(DefaultLinuxConfigRoot, slug)) + cfg.UnitDir = firstNonEmpty(cfg.UnitDir, DefaultSystemdUnitDir) + cfg.StartupMode = strings.ToLower(firstNonEmpty(cfg.StartupMode, "systemd")) + unitName := "rap-node-agent-" + slug + ".service" + result := LinuxInstallResult{ + NodeName: cfg.RuntimeConfig.NodeName, + InstallDir: cfg.InstallDir, + StateDir: cfg.StateDir, + ConfigDir: cfg.ConfigDir, + NodeAgentPath: filepath.Join(cfg.InstallDir, "rap-node-agent"), + HostAgentPath: filepath.Join(cfg.InstallDir, "rap-host-agent"), + EnvPath: filepath.Join(cfg.ConfigDir, "rap-node-agent.env"), + UnitName: unitName, + UnitPath: filepath.Join(cfg.UnitDir, unitName), + } + if cfg.DryRun { + return result, nil + } + if runtime.GOOS != "linux" { + return result, fmt.Errorf("linux install is only supported on linux hosts") + } + if err := os.MkdirAll(cfg.InstallDir, 0o755); err != nil { + return result, err + } + if err := os.MkdirAll(cfg.StateDir, 0o700); err != nil { + return result, err + } + if err := os.MkdirAll(cfg.ConfigDir, 0o755); err != nil { + return result, err + } + if len(cfg.ArtifactURLs) > 0 && (cfg.Replace || !fileExists(result.NodeAgentPath)) { + m.stopService(ctx, result.UnitName) + path, err := downloadFirstArtifact(ctx, cfg.ArtifactURLs, cfg.ArtifactSHA256, cfg.ArtifactSizeBytes) + if err != nil { + return result, err + } + defer os.Remove(path) + if err := copyFile(path, result.NodeAgentPath, 0o755); err != nil { + m.stopService(ctx, result.UnitName) + if retryErr := copyFile(path, result.NodeAgentPath, 0o755); retryErr != nil { + return result, err + } + } + result.Downloaded = true + } + if !fileExists(result.NodeAgentPath) { + return result, fmt.Errorf("node-agent binary is missing at %s and no artifact was available", result.NodeAgentPath) + } + if err := os.WriteFile(result.EnvPath, []byte(linuxEnvFile(cfg.RuntimeConfig, cfg.StateDir)), 0o600); err != nil { + return result, err + } + if cfg.StartupMode != "none" { + if err := os.MkdirAll(cfg.UnitDir, 0o755); err != nil { + return result, err + } + if err := os.WriteFile(result.UnitPath, []byte(linuxNodeAgentUnit(result)), 0o644); err != nil { + return result, err + } + runner := m.runner() + if _, err := runner.Run(ctx, "systemctl", "daemon-reload"); err != nil { + return result, err + } + if _, err := runner.Run(ctx, "systemctl", "enable", "--now", result.UnitName); err != nil { + return result, err + } + result.Started = true + } + return installLinuxHostAgentUpdater(ctx, m, result, cfg) +} + +func (m LinuxManager) stopService(ctx context.Context, unitName string) { + if strings.TrimSpace(unitName) == "" { + return + } + _, _ = m.runner().Run(ctx, "systemctl", "stop", unitName) +} + +func (m LinuxManager) runner() CommandRunner { + if m.Runner != nil { + return m.Runner + } + return ExecRunner{} +} + +func linuxEnvFile(cfg RuntimeConfig, stateDir string) string { + lines := []string{} + for _, env := range NodeAgentEnvWithStateDir(cfg, stateDir) { + key, value, ok := strings.Cut(env, "=") + if !ok { + continue + } + lines = append(lines, key+"="+systemdQuote(value)) + } + return strings.Join(lines, "\n") + "\n" +} + +func linuxNodeAgentUnit(result LinuxInstallResult) string { + return fmt.Sprintf(`[Unit] +Description=RAP node-agent %s +After=network-online.target +Wants=network-online.target + +[Service] +Type=simple +EnvironmentFile=%s +ExecStart=%s +Restart=always +RestartSec=10 + +[Install] +WantedBy=multi-user.target +`, result.NodeName, systemdQuote(result.EnvPath), systemdQuote(result.NodeAgentPath)) +} + +func installLinuxHostAgentUpdater(ctx context.Context, m LinuxManager, result LinuxInstallResult, cfg LinuxInstallConfig) (LinuxInstallResult, error) { + if !cfg.AutoUpdateEnabled || strings.EqualFold(cfg.StartupMode, "none") { + return result, nil + } + if cfg.AutoUpdateCurrentVersion == "" || (cfg.Replace && !result.Downloaded) { + cfg.AutoUpdateCurrentVersion = "0.0.0" + } + if err := installHostAgentBinary(cfg.HostAgentSourcePath, result.HostAgentPath); err != nil { + return result, err + } + interval := cfg.AutoUpdateIntervalSeconds + if interval == 0 { + interval = 21600 + } + initialDelay := cfg.AutoUpdateInitialDelaySeconds + if initialDelay == 0 { + initialDelay = 15 + } + healthTimeout := cfg.AutoUpdateHealthTimeoutSeconds + if healthTimeout == 0 { + healthTimeout = 30 + } + args := []string{ + result.HostAgentPath, + "update-loop", + "--backend-url", cfg.RuntimeConfig.BackendURL, + "--cluster-id", cfg.RuntimeConfig.ClusterID, + "--state-dir", result.StateDir, + "--current-version", cfg.AutoUpdateCurrentVersion, + "--os", "linux", + "--arch", runtime.GOARCH, + "--install-type", BinaryUpdateInstallType, + "--binary-path", result.NodeAgentPath, + "--systemd-unit", result.UnitName, + "--health-timeout-seconds", fmt.Sprintf("%d", healthTimeout), + "--interval-seconds", fmt.Sprintf("%d", interval), + "--initial-delay-seconds", fmt.Sprintf("%d", initialDelay), + "--host-agent-update-status-enabled", + "--host-agent-current-version", firstNonEmpty(cfg.AutoUpdateCurrentVersion, "0.0.0"), + "--host-agent-binary-path", result.HostAgentPath, + } + if strings.TrimSpace(cfg.NodeID) != "" { + args = append(args, "--node-id", strings.TrimSpace(cfg.NodeID)) + } + if strings.TrimSpace(cfg.AutoUpdateChannel) != "" { + args = append(args, "--channel", strings.TrimSpace(cfg.AutoUpdateChannel)) + } + unitName := "rap-host-agent-updater-" + safeUnitSlug(result.NodeName) + ".service" + unitPath := filepath.Join(firstNonEmpty(cfg.UnitDir, DefaultSystemdUnitDir), unitName) + unit := fmt.Sprintf(`[Unit] +Description=RAP host-agent updater for %s +After=network-online.target %s +Wants=network-online.target + +[Service] +Type=simple +ExecStart=%s +Restart=always +RestartSec=30 + +[Install] +WantedBy=multi-user.target +`, result.NodeName, result.UnitName, systemdJoin(args)) + if err := os.WriteFile(unitPath, []byte(unit), 0o644); err != nil { + return result, err + } + runner := m.runner() + if _, err := runner.Run(ctx, "systemctl", "daemon-reload"); err != nil { + return result, err + } + if _, err := runner.Run(ctx, "systemctl", "enable", "--now", unitName); err != nil { + return result, err + } + result.UpdaterUnitName = unitName + result.UpdaterStarted = true + return result, nil +} + +func (m LinuxManager) ApplyUpdate(ctx context.Context, req UpdateRequest) (UpdateResult, error) { + req.InstallType = firstNonEmpty(req.InstallType, BinaryUpdateInstallType) + req.OS = firstNonEmpty(req.OS, "linux") + req.Arch = firstNonEmpty(req.Arch, runtime.GOARCH) + req = req.Normalize() + var err error + req, err = resolveUpdateRequest(req) + if err != nil { + return UpdateResult{}, err + } + plan, err := FetchNodeUpdatePlan(ctx, req) + if err != nil { + return UpdateResult{}, err + } + result := UpdateResult{Action: plan.Action, Reason: plan.Reason, TargetVersion: plan.TargetVersion, ContainerName: req.SystemdUnitName, NewImage: req.BinaryPath} + if plan.Action != "update" { + if !req.DryRun { + status := statusFromNoopPlan(req, plan) + if status.Payload == nil { + status.Payload = map[string]any{} + } + status.Payload["systemd_unit"] = req.SystemdUnitName + status.Payload["binary_path"] = req.BinaryPath + _ = ReportNodeUpdateStatus(ctx, req.BackendURL, req.ClusterID, req.NodeID, status) + } + return result, nil + } + if plan.ProductionForwarding && !req.AllowProductionMesh { + err := errors.New("refusing update plan with production forwarding enabled") + _ = ReportNodeUpdateStatus(ctx, req.BackendURL, req.ClusterID, req.NodeID, statusFromError(req, plan, "preflight", "failed", err)) + return result, err + } + if plan.Artifact == nil { + err := errors.New("update plan has no artifact") + _ = ReportNodeUpdateStatus(ctx, req.BackendURL, req.ClusterID, req.NodeID, statusFromError(req, plan, "preflight", "failed", err)) + return result, err + } + if plan.Artifact.InstallType != "" && plan.Artifact.InstallType != BinaryUpdateInstallType { + err := fmt.Errorf("unsupported update artifact install type %q", plan.Artifact.InstallType) + _ = ReportNodeUpdateStatus(ctx, req.BackendURL, req.ClusterID, req.NodeID, statusFromError(req, plan, "preflight", "failed", err)) + return result, err + } + if req.DryRun { + return result, nil + } + urls := artifactURLsForBackend(*plan.Artifact, req.BackendURL) + _ = ReportNodeUpdateStatus(ctx, req.BackendURL, req.ClusterID, req.NodeID, NodeUpdateStatusRequest{Product: req.Product, CurrentVersion: req.CurrentVersion, TargetVersion: plan.TargetVersion, Phase: "download", Status: "started", AttemptID: updateAttemptID(plan), ObservedAt: time.Now().UTC(), Payload: map[string]any{"artifact_url": plan.Artifact.URL, "artifact_urls": urls, "binary_path": req.BinaryPath}}) + path, err := downloadFirstArtifact(ctx, urls, plan.Artifact.SHA256, plan.Artifact.SizeBytes) + if err != nil { + _ = ReportNodeUpdateStatus(ctx, req.BackendURL, req.ClusterID, req.NodeID, statusFromError(req, plan, "download", "failed", err)) + return result, err + } + defer os.Remove(path) + runner := m.runner() + _, _ = runner.Run(ctx, "systemctl", "stop", req.SystemdUnitName) + if err := copyFile(path, req.BinaryPath, 0o755); err != nil { + _ = ReportNodeUpdateStatus(ctx, req.BackendURL, req.ClusterID, req.NodeID, statusFromError(req, plan, "apply", "failed", err)) + return result, err + } + result.Replaced = true + if _, err := runner.Run(ctx, "systemctl", "restart", req.SystemdUnitName); err != nil { + _ = ReportNodeUpdateStatus(ctx, req.BackendURL, req.ClusterID, req.NodeID, statusFromError(req, plan, "restart", "failed", err)) + return result, err + } + _ = ReportNodeUpdateStatus(ctx, req.BackendURL, req.ClusterID, req.NodeID, NodeUpdateStatusRequest{Product: req.Product, CurrentVersion: req.CurrentVersion, TargetVersion: plan.TargetVersion, Phase: "health_check", Status: "succeeded", AttemptID: updateAttemptID(plan), ObservedAt: time.Now().UTC(), Payload: map[string]any{"systemd_unit": req.SystemdUnitName, "binary_path": req.BinaryPath}}) + _ = saveUpdateState(req.StateDir, UpdateState{Product: req.Product, CurrentVersion: plan.TargetVersion, TargetVersion: plan.TargetVersion, Image: req.BinaryPath, UpdatedAt: time.Now().UTC()}) + return result, nil +} + +func (m LinuxManager) RunUpdateLoop(ctx context.Context, cfg UpdateLoopConfig) error { + req := cfg.Request + req.InstallType = firstNonEmpty(req.InstallType, BinaryUpdateInstallType) + req.OS = firstNonEmpty(req.OS, "linux") + req.Arch = firstNonEmpty(req.Arch, runtime.GOARCH) + cfg.Request = req + return runLinuxUpdateLoop(ctx, m, cfg) +} + +func runLinuxUpdateLoop(ctx context.Context, m LinuxManager, cfg UpdateLoopConfig) error { + if cfg.Interval == 0 { + cfg.Interval = time.Hour + } + logf := cfg.Logf + if logf == nil { + logf = func(string, ...any) {} + } + if cfg.InitialDelay > 0 { + if err := sleepContext(ctx, jitteredDuration(cfg.InitialDelay, cfg.Jitter)); err != nil { + return err + } + } + runs := 0 + lastTriggerGeneration := currentUpdateTriggerGeneration(cfg.Request.StateDir) + for { + runs++ + result, err := m.ApplyUpdate(ctx, cfg.Request) + if err != nil { + if errors.Is(err, ErrNodeIdentityNotReady) { + logf("linux_update_loop run=%d status=waiting_for_node_identity state_dir=%s", runs, cfg.Request.StateDir) + if cfg.MaxRuns > 0 && runs >= cfg.MaxRuns { + return nil + } + if err := sleepUntilUpdateIntervalOrTrigger(ctx, cfg.Request.StateDir, jitteredDuration(cfg.Interval, cfg.Jitter), &lastTriggerGeneration); err != nil { + return err + } + continue + } else { + logf("linux_update_loop run=%d status=failed error=%v", runs, err) + if cfg.StopOnError { + return err + } + } + } else { + logf("linux_update_loop run=%d action=%s reason=%s target=%s unit=%s replaced=%t", runs, result.Action, result.Reason, result.TargetVersion, result.ContainerName, result.Replaced) + if result.Action == "update" && result.TargetVersion != "" { + cfg.Request.CurrentVersion = result.TargetVersion + } + } + if cfg.HostAgentUpdateEnabled { + hostReq := cfg.HostAgentUpdateRequest + hostReq.BackendURL = firstNonEmpty(hostReq.BackendURL, cfg.Request.BackendURL) + hostReq.ClusterID = firstNonEmpty(hostReq.ClusterID, cfg.Request.ClusterID) + hostReq.NodeID = firstNonEmpty(hostReq.NodeID, cfg.Request.NodeID) + hostReq.StateDir = firstNonEmpty(hostReq.StateDir, cfg.Request.StateDir) + hostReq.Channel = firstNonEmpty(hostReq.Channel, cfg.Request.Channel) + hostReq.OS = firstNonEmpty(hostReq.OS, "linux") + hostReq.Arch = firstNonEmpty(hostReq.Arch, runtime.GOARCH) + hostReq.InstallType = firstNonEmpty(hostReq.InstallType, BinaryUpdateInstallType) + hostResult, hostErr := (DockerManager{}).ApplyHostAgentUpdate(ctx, hostReq) + if hostErr != nil { + logf("linux_host_agent_update_loop run=%d status=failed error=%v", runs, hostErr) + } else { + logf("linux_host_agent_update_loop run=%d action=%s reason=%s target=%s binary=%s replaced=%t restart_needed=%t", runs, hostResult.Action, hostResult.Reason, hostResult.TargetVersion, hostResult.NewImage, hostResult.Replaced, hostResult.RestartNeeded) + if hostResult.Action == "update" && hostResult.TargetVersion != "" && !hostResult.RolledBack { + cfg.HostAgentUpdateRequest.CurrentVersion = hostResult.TargetVersion + } + } + } + if cfg.MaxRuns > 0 && runs >= cfg.MaxRuns { + return nil + } + if err := sleepUntilUpdateIntervalOrTrigger(ctx, cfg.Request.StateDir, jitteredDuration(cfg.Interval, cfg.Jitter), &lastTriggerGeneration); err != nil { + return err + } + } +} diff --git a/agents/rap-node-agent/internal/hostagent/profile.go b/agents/rap-node-agent/internal/hostagent/profile.go new file mode 100644 index 0000000..86a266b --- /dev/null +++ b/agents/rap-node-agent/internal/hostagent/profile.go @@ -0,0 +1,333 @@ +package hostagent + +import ( + "bytes" + "context" + "encoding/json" + "fmt" + "net/http" + "strings" + "time" +) + +type DockerInstallProfile struct { + SchemaVersion string `json:"schema_version"` + ClusterID string `json:"cluster_id"` + BackendURL string `json:"backend_url"` + ControlPlaneEndpoints []string `json:"control_plane_endpoints"` + ArtifactEndpoints []string `json:"artifact_endpoints"` + DockerImageArtifact *DockerArtifact `json:"docker_image_artifact"` + JoinToken string `json:"join_token"` + NodeName string `json:"node_name"` + Image string `json:"image"` + ContainerName string `json:"container_name"` + StateDir string `json:"state_dir"` + Network string `json:"network"` + RestartPolicy string `json:"restart_policy"` + PullImage bool `json:"pull_image"` + Replace bool `json:"replace"` + DockerVPNGatewayEnabled bool `json:"docker_vpn_gateway_enabled"` + WorkloadSupervisionEnabled bool `json:"workload_supervision_enabled"` + MeshSyntheticRuntimeEnabled bool `json:"mesh_synthetic_runtime_enabled"` + MeshProductionForwardingEnabled bool `json:"mesh_production_forwarding_enabled"` + MeshListenAddr string `json:"mesh_listen_addr"` + MeshListenPortMode string `json:"mesh_listen_port_mode"` + MeshListenAutoPortStart int `json:"mesh_listen_auto_port_start"` + MeshListenAutoPortEnd int `json:"mesh_listen_auto_port_end"` + MeshAdvertiseEndpoint string `json:"mesh_advertise_endpoint"` + MeshAdvertiseEndpointsJSON json.RawMessage `json:"mesh_advertise_endpoints_json"` + MeshAdvertiseTransport string `json:"mesh_advertise_transport"` + MeshConnectivityMode string `json:"mesh_connectivity_mode"` + MeshNATType string `json:"mesh_nat_type"` + MeshRegion string `json:"mesh_region"` + HeartbeatIntervalSeconds int `json:"heartbeat_interval_seconds"` + EnrollmentPollIntervalSeconds int `json:"enrollment_poll_interval_seconds"` + EnrollmentPollTimeoutSeconds int `json:"enrollment_poll_timeout_seconds"` + ProductionObservationSinkCapacity int `json:"production_observation_sink_capacity"` + Roles []string `json:"roles"` +} + +type DockerArtifact struct { + Kind string `json:"kind"` + Image string `json:"image"` + MediaType string `json:"media_type"` + FileName string `json:"file_name"` + URLs []string `json:"urls"` + SHA256 string `json:"sha256"` + SizeBytes int64 `json:"size_bytes"` +} + +type WindowsInstallProfile struct { + SchemaVersion string `json:"schema_version"` + ClusterID string `json:"cluster_id"` + BackendURL string `json:"backend_url"` + ControlPlaneEndpoints []string `json:"control_plane_endpoints"` + ArtifactEndpoints []string `json:"artifact_endpoints"` + NodeAgentArtifact *DockerArtifact `json:"node_agent_artifact"` + JoinToken string `json:"join_token"` + NodeName string `json:"node_name"` + StateDir string `json:"state_dir"` + InstallDir string `json:"install_dir"` + StartupMode string `json:"startup_mode"` + WorkloadSupervisionEnabled bool `json:"workload_supervision_enabled"` + MeshSyntheticRuntimeEnabled bool `json:"mesh_synthetic_runtime_enabled"` + MeshProductionForwardingEnabled bool `json:"mesh_production_forwarding_enabled"` + MeshListenAddr string `json:"mesh_listen_addr"` + MeshListenPortMode string `json:"mesh_listen_port_mode"` + MeshListenAutoPortStart int `json:"mesh_listen_auto_port_start"` + MeshListenAutoPortEnd int `json:"mesh_listen_auto_port_end"` + MeshAdvertiseEndpoint string `json:"mesh_advertise_endpoint"` + MeshAdvertiseEndpointsJSON json.RawMessage `json:"mesh_advertise_endpoints_json"` + MeshAdvertiseTransport string `json:"mesh_advertise_transport"` + MeshConnectivityMode string `json:"mesh_connectivity_mode"` + MeshNATType string `json:"mesh_nat_type"` + MeshRegion string `json:"mesh_region"` + HeartbeatIntervalSeconds int `json:"heartbeat_interval_seconds"` + EnrollmentPollIntervalSeconds int `json:"enrollment_poll_interval_seconds"` + EnrollmentPollTimeoutSeconds int `json:"enrollment_poll_timeout_seconds"` + ProductionObservationSinkCapacity int `json:"production_observation_sink_capacity"` + Roles []string `json:"roles"` +} + +type LinuxInstallProfile struct { + SchemaVersion string `json:"schema_version"` + ClusterID string `json:"cluster_id"` + BackendURL string `json:"backend_url"` + ControlPlaneEndpoints []string `json:"control_plane_endpoints"` + ArtifactEndpoints []string `json:"artifact_endpoints"` + NodeAgentArtifact *DockerArtifact `json:"node_agent_artifact"` + JoinToken string `json:"join_token"` + NodeName string `json:"node_name"` + StateDir string `json:"state_dir"` + InstallDir string `json:"install_dir"` + StartupMode string `json:"startup_mode"` + WorkloadSupervisionEnabled bool `json:"workload_supervision_enabled"` + MeshSyntheticRuntimeEnabled bool `json:"mesh_synthetic_runtime_enabled"` + MeshProductionForwardingEnabled bool `json:"mesh_production_forwarding_enabled"` + MeshListenAddr string `json:"mesh_listen_addr"` + MeshListenPortMode string `json:"mesh_listen_port_mode"` + MeshListenAutoPortStart int `json:"mesh_listen_auto_port_start"` + MeshListenAutoPortEnd int `json:"mesh_listen_auto_port_end"` + MeshAdvertiseEndpoint string `json:"mesh_advertise_endpoint"` + MeshAdvertiseEndpointsJSON json.RawMessage `json:"mesh_advertise_endpoints_json"` + MeshAdvertiseTransport string `json:"mesh_advertise_transport"` + MeshConnectivityMode string `json:"mesh_connectivity_mode"` + MeshNATType string `json:"mesh_nat_type"` + MeshRegion string `json:"mesh_region"` + HeartbeatIntervalSeconds int `json:"heartbeat_interval_seconds"` + EnrollmentPollIntervalSeconds int `json:"enrollment_poll_interval_seconds"` + EnrollmentPollTimeoutSeconds int `json:"enrollment_poll_timeout_seconds"` + ProductionObservationSinkCapacity int `json:"production_observation_sink_capacity"` + Roles []string `json:"roles"` +} + +type ProfileRequest struct { + URL string + ClusterID string + InstallToken string + NodeName string + HTTPClient *http.Client +} + +func FetchDockerInstallProfile(ctx context.Context, req ProfileRequest) (DockerInstallProfile, error) { + url := strings.TrimRight(strings.TrimSpace(req.URL), "/") + if url == "" || strings.TrimSpace(req.InstallToken) == "" { + return DockerInstallProfile{}, fmt.Errorf("profile-url and install-token are required") + } + if !strings.HasSuffix(url, "/node-agents/docker-install-profile") { + url += "/node-agents/docker-install-profile" + } + body, err := json.Marshal(map[string]string{ + "cluster_id": strings.TrimSpace(req.ClusterID), + "install_token": strings.TrimSpace(req.InstallToken), + "node_name": strings.TrimSpace(req.NodeName), + }) + if err != nil { + return DockerInstallProfile{}, err + } + httpClient := req.HTTPClient + if httpClient == nil { + httpClient = &http.Client{Timeout: 20 * time.Second} + } + httpReq, err := http.NewRequestWithContext(ctx, http.MethodPost, url, bytes.NewReader(body)) + if err != nil { + return DockerInstallProfile{}, err + } + httpReq.Header.Set("Content-Type", "application/json") + resp, err := httpClient.Do(httpReq) + if err != nil { + return DockerInstallProfile{}, err + } + defer resp.Body.Close() + if resp.StatusCode < 200 || resp.StatusCode >= 300 { + return DockerInstallProfile{}, fmt.Errorf("fetch docker install profile: %s", resp.Status) + } + var envelope struct { + Profile DockerInstallProfile `json:"docker_install_profile"` + } + if err := json.NewDecoder(resp.Body).Decode(&envelope); err != nil { + return DockerInstallProfile{}, err + } + if strings.TrimSpace(envelope.Profile.BackendURL) == "" && len(envelope.Profile.ControlPlaneEndpoints) > 0 { + envelope.Profile.BackendURL = envelope.Profile.ControlPlaneEndpoints[0] + } + return envelope.Profile, nil +} + +func FetchWindowsInstallProfile(ctx context.Context, req ProfileRequest) (WindowsInstallProfile, error) { + url := strings.TrimRight(strings.TrimSpace(req.URL), "/") + if url == "" || strings.TrimSpace(req.InstallToken) == "" { + return WindowsInstallProfile{}, fmt.Errorf("profile-url and install-token are required") + } + if !strings.HasSuffix(url, "/node-agents/windows-install-profile") { + url += "/node-agents/windows-install-profile" + } + body, err := json.Marshal(map[string]string{ + "cluster_id": strings.TrimSpace(req.ClusterID), + "install_token": strings.TrimSpace(req.InstallToken), + "node_name": strings.TrimSpace(req.NodeName), + }) + if err != nil { + return WindowsInstallProfile{}, err + } + httpClient := req.HTTPClient + if httpClient == nil { + httpClient = &http.Client{Timeout: 20 * time.Second} + } + httpReq, err := http.NewRequestWithContext(ctx, http.MethodPost, url, bytes.NewReader(body)) + if err != nil { + return WindowsInstallProfile{}, err + } + httpReq.Header.Set("Content-Type", "application/json") + resp, err := httpClient.Do(httpReq) + if err != nil { + return WindowsInstallProfile{}, err + } + defer resp.Body.Close() + if resp.StatusCode < 200 || resp.StatusCode >= 300 { + return WindowsInstallProfile{}, fmt.Errorf("fetch windows install profile: %s", resp.Status) + } + var envelope struct { + Profile WindowsInstallProfile `json:"windows_install_profile"` + } + if err := json.NewDecoder(resp.Body).Decode(&envelope); err != nil { + return WindowsInstallProfile{}, err + } + if strings.TrimSpace(envelope.Profile.BackendURL) == "" && len(envelope.Profile.ControlPlaneEndpoints) > 0 { + envelope.Profile.BackendURL = envelope.Profile.ControlPlaneEndpoints[0] + } + return envelope.Profile, nil +} + +func FetchLinuxInstallProfile(ctx context.Context, req ProfileRequest) (LinuxInstallProfile, error) { + url := strings.TrimRight(strings.TrimSpace(req.URL), "/") + if url == "" || strings.TrimSpace(req.InstallToken) == "" { + return LinuxInstallProfile{}, fmt.Errorf("profile-url and install-token are required") + } + if !strings.HasSuffix(url, "/node-agents/linux-install-profile") { + url += "/node-agents/linux-install-profile" + } + body, err := json.Marshal(map[string]string{ + "cluster_id": strings.TrimSpace(req.ClusterID), + "install_token": strings.TrimSpace(req.InstallToken), + "node_name": strings.TrimSpace(req.NodeName), + }) + if err != nil { + return LinuxInstallProfile{}, err + } + httpClient := req.HTTPClient + if httpClient == nil { + httpClient = &http.Client{Timeout: 20 * time.Second} + } + httpReq, err := http.NewRequestWithContext(ctx, http.MethodPost, url, bytes.NewReader(body)) + if err != nil { + return LinuxInstallProfile{}, err + } + httpReq.Header.Set("Content-Type", "application/json") + resp, err := httpClient.Do(httpReq) + if err != nil { + return LinuxInstallProfile{}, err + } + defer resp.Body.Close() + if resp.StatusCode < 200 || resp.StatusCode >= 300 { + return LinuxInstallProfile{}, fmt.Errorf("fetch linux install profile: %s", resp.Status) + } + var envelope struct { + Profile LinuxInstallProfile `json:"linux_install_profile"` + } + if err := json.NewDecoder(resp.Body).Decode(&envelope); err != nil { + return LinuxInstallProfile{}, err + } + if strings.TrimSpace(envelope.Profile.BackendURL) == "" && len(envelope.Profile.ControlPlaneEndpoints) > 0 { + envelope.Profile.BackendURL = envelope.Profile.ControlPlaneEndpoints[0] + } + return envelope.Profile, nil +} + +func RuntimeConfigFromProfile(profile DockerInstallProfile) RuntimeConfig { + return RuntimeConfig{ + BackendURL: profile.BackendURL, + ClusterID: profile.ClusterID, + JoinToken: profile.JoinToken, + NodeName: profile.NodeName, + Image: profile.Image, + ContainerName: profile.ContainerName, + StateDir: profile.StateDir, + Network: profile.Network, + RestartPolicy: profile.RestartPolicy, + PullImage: profile.PullImage, + Replace: profile.Replace, + DockerVPNGatewayEnabled: profile.DockerVPNGatewayEnabled, + WorkloadSupervisionEnabled: profile.WorkloadSupervisionEnabled, + MeshSyntheticRuntimeEnabled: profile.MeshSyntheticRuntimeEnabled, + MeshProductionForwardingEnabled: profile.MeshProductionForwardingEnabled, + MeshListenAddr: profile.MeshListenAddr, + MeshListenPortMode: profile.MeshListenPortMode, + MeshListenAutoPortStart: profile.MeshListenAutoPortStart, + MeshListenAutoPortEnd: profile.MeshListenAutoPortEnd, + MeshAdvertiseEndpoint: profile.MeshAdvertiseEndpoint, + MeshAdvertiseEndpointsJSON: string(profile.MeshAdvertiseEndpointsJSON), + MeshAdvertiseTransport: profile.MeshAdvertiseTransport, + MeshConnectivityMode: profile.MeshConnectivityMode, + MeshNATType: profile.MeshNATType, + MeshRegion: profile.MeshRegion, + HeartbeatIntervalSeconds: profile.HeartbeatIntervalSeconds, + EnrollmentPollIntervalSeconds: profile.EnrollmentPollIntervalSeconds, + EnrollmentPollTimeoutSeconds: profile.EnrollmentPollTimeoutSeconds, + ProductionObservationSinkCap: profile.ProductionObservationSinkCapacity, + ImageArtifactURLs: dockerArtifactURLs(profile), + ImageArtifactSHA256: dockerArtifactSHA256(profile), + ImageArtifactSizeBytes: dockerArtifactSizeBytes(profile), + } +} + +func dockerArtifactURLs(profile DockerInstallProfile) []string { + if profile.DockerImageArtifact != nil && len(profile.DockerImageArtifact.URLs) > 0 { + return append([]string(nil), profile.DockerImageArtifact.URLs...) + } + if profile.DockerImageArtifact == nil || strings.TrimSpace(profile.DockerImageArtifact.FileName) == "" { + return nil + } + out := []string{} + fileName := strings.TrimLeft(strings.TrimSpace(profile.DockerImageArtifact.FileName), "/") + for _, endpoint := range profile.ArtifactEndpoints { + if trimmed := strings.TrimRight(strings.TrimSpace(endpoint), "/"); trimmed != "" { + out = append(out, trimmed+"/"+fileName) + } + } + return out +} + +func dockerArtifactSHA256(profile DockerInstallProfile) string { + if profile.DockerImageArtifact == nil { + return "" + } + return strings.TrimSpace(profile.DockerImageArtifact.SHA256) +} + +func dockerArtifactSizeBytes(profile DockerInstallProfile) int64 { + if profile.DockerImageArtifact == nil { + return 0 + } + return profile.DockerImageArtifact.SizeBytes +} diff --git a/agents/rap-node-agent/internal/hostagent/self_update.go b/agents/rap-node-agent/internal/hostagent/self_update.go new file mode 100644 index 0000000..50c90b5 --- /dev/null +++ b/agents/rap-node-agent/internal/hostagent/self_update.go @@ -0,0 +1,258 @@ +package hostagent + +import ( + "context" + "errors" + "fmt" + "os" + "strings" + "time" +) + +type HostAgentUpdateRequest struct { + BackendURL string + ClusterID string + NodeID string + StateDir string + CurrentVersion string + Channel string + OS string + Arch string + InstallType string + BinaryPath string + DryRun bool + RestartService string + RestartAfterApply bool +} + +type HostAgentUpdateLoopConfig struct { + Request HostAgentUpdateRequest + Interval time.Duration + InitialDelay time.Duration + Jitter float64 + MaxRuns int + StopOnError bool + Logf func(format string, args ...any) +} + +func (req HostAgentUpdateRequest) updateRequest() UpdateRequest { + return UpdateRequest{ + BackendURL: req.BackendURL, + ClusterID: req.ClusterID, + NodeID: req.NodeID, + StateDir: req.StateDir, + Product: HostAgentUpdateProduct, + CurrentVersion: req.CurrentVersion, + OS: firstNonEmpty(req.OS, "linux"), + Arch: req.Arch, + InstallType: firstNonEmpty(req.InstallType, BinaryUpdateInstallType), + Channel: req.Channel, + ContainerName: "host-agent-service", + DryRun: req.DryRun, + } +} + +func (m DockerManager) ApplyHostAgentUpdate(ctx context.Context, req HostAgentUpdateRequest) (UpdateResult, error) { + binaryPath := firstNonEmpty(req.BinaryPath, DefaultHostAgentInstallPath) + planReq := req.updateRequest() + planReq.BinaryDefaults() + resolved, err := resolveUpdateRequest(planReq) + if err != nil { + return UpdateResult{}, err + } + plan, err := FetchNodeUpdatePlan(ctx, resolved) + if err != nil { + return UpdateResult{}, err + } + result := UpdateResult{ + Action: plan.Action, + Reason: plan.Reason, + TargetVersion: plan.TargetVersion, + ContainerName: "host-agent-service", + NewImage: binaryPath, + } + if plan.Action != "update" { + if !req.DryRun { + status := statusFromNoopPlan(resolved, plan) + status.Product = HostAgentUpdateProduct + if status.Payload == nil { + status.Payload = map[string]any{} + } + status.Payload["binary_path"] = binaryPath + _ = ReportNodeUpdateStatus(ctx, resolved.BackendURL, resolved.ClusterID, resolved.NodeID, status) + } + return result, nil + } + if plan.Artifact == nil { + err := errors.New("host-agent update plan has no artifact") + _ = ReportNodeUpdateStatus(ctx, resolved.BackendURL, resolved.ClusterID, resolved.NodeID, statusFromError(resolved, plan, "preflight", "failed", err)) + return result, err + } + if !isBinaryInstallType(plan.Artifact.InstallType) { + err := fmt.Errorf("unsupported host-agent artifact install type %q", plan.Artifact.InstallType) + _ = ReportNodeUpdateStatus(ctx, resolved.BackendURL, resolved.ClusterID, resolved.NodeID, statusFromError(resolved, plan, "preflight", "failed", err)) + return result, err + } + if req.DryRun { + return result, nil + } + urls := artifactURLsForBackend(*plan.Artifact, resolved.BackendURL) + _ = ReportNodeUpdateStatus(ctx, resolved.BackendURL, resolved.ClusterID, resolved.NodeID, NodeUpdateStatusRequest{ + Product: HostAgentUpdateProduct, + CurrentVersion: resolved.CurrentVersion, + TargetVersion: plan.TargetVersion, + Phase: "download", + Status: "started", + AttemptID: updateAttemptID(plan), + ObservedAt: time.Now().UTC(), + Payload: map[string]any{"artifact_url": plan.Artifact.URL, "artifact_urls": urls, "binary_path": binaryPath}, + }) + path, err := downloadFirstArtifact(ctx, urls, plan.Artifact.SHA256, plan.Artifact.SizeBytes) + if err != nil { + _ = ReportNodeUpdateStatus(ctx, resolved.BackendURL, resolved.ClusterID, resolved.NodeID, statusFromError(resolved, plan, "download", "failed", err)) + return result, err + } + defer os.Remove(path) + if err := installHostAgentBinary(path, binaryPath); err != nil { + stageErr := stageHostAgentBinary(path, binaryPath) + if stageErr == nil { + result.RestartNeeded = true + _ = saveUpdateState(resolved.StateDir, UpdateState{ + Product: HostAgentUpdateProduct, + CurrentVersion: plan.TargetVersion, + TargetVersion: plan.TargetVersion, + ContainerName: "host-agent-service", + Image: binaryPath, + UpdatedAt: time.Now().UTC(), + }) + _ = ReportNodeUpdateStatus(ctx, resolved.BackendURL, resolved.ClusterID, resolved.NodeID, NodeUpdateStatusRequest{ + Product: HostAgentUpdateProduct, + CurrentVersion: resolved.CurrentVersion, + TargetVersion: plan.TargetVersion, + Phase: "apply", + Status: "staged", + AttemptID: updateAttemptID(plan), + ObservedAt: time.Now().UTC(), + Payload: map[string]any{"binary_path": binaryPath, "staged_path": binaryPath + ".next", "restart_needed": true, "replace_error": err.Error()}, + }) + return result, nil + } + _ = ReportNodeUpdateStatus(ctx, resolved.BackendURL, resolved.ClusterID, resolved.NodeID, statusFromError(resolved, plan, "apply", "failed", fmt.Errorf("%w; stage failed: %v", err, stageErr))) + return result, err + } + result.Loaded = true + result.Replaced = true + result.RestartNeeded = true + _ = saveUpdateState(resolved.StateDir, UpdateState{ + Product: HostAgentUpdateProduct, + CurrentVersion: plan.TargetVersion, + TargetVersion: plan.TargetVersion, + ContainerName: "host-agent-service", + Image: binaryPath, + UpdatedAt: time.Now().UTC(), + }) + _ = ReportNodeUpdateStatus(ctx, resolved.BackendURL, resolved.ClusterID, resolved.NodeID, NodeUpdateStatusRequest{ + Product: HostAgentUpdateProduct, + CurrentVersion: resolved.CurrentVersion, + TargetVersion: plan.TargetVersion, + Phase: "apply", + Status: "succeeded", + AttemptID: updateAttemptID(plan), + ObservedAt: time.Now().UTC(), + Payload: map[string]any{"binary_path": binaryPath, "restart_needed": true}, + }) + if req.RestartAfterApply && strings.TrimSpace(req.RestartService) != "" { + runner := m.Runner + if runner == nil { + runner = ExecRunner{} + } + _, err = runner.Run(ctx, "systemctl", "restart", req.RestartService) + if err != nil { + return result, err + } + result.RestartNeeded = false + } + return result, nil +} + +func (m DockerManager) RunHostAgentUpdateLoop(ctx context.Context, cfg HostAgentUpdateLoopConfig) error { + if cfg.Interval == 0 { + cfg.Interval = time.Hour + } + if cfg.InitialDelay < 0 || cfg.Interval < 0 { + return errors.New("host-agent update loop durations must not be negative") + } + if cfg.Jitter < 0 || cfg.Jitter > 1 { + return errors.New("host-agent update loop jitter must be between 0 and 1") + } + logf := cfg.Logf + if logf == nil { + logf = func(string, ...any) {} + } + if cfg.InitialDelay > 0 { + if err := sleepContext(ctx, jitteredDuration(cfg.InitialDelay, cfg.Jitter)); err != nil { + return err + } + } + runs := 0 + req := cfg.Request + for { + runs++ + result, err := m.ApplyHostAgentUpdate(ctx, req) + if err != nil { + if errors.Is(err, ErrNodeIdentityNotReady) { + logf("host_agent_update_loop run=%d status=waiting_for_node_identity state_dir=%s", runs, req.StateDir) + } else { + logf("host_agent_update_loop run=%d status=failed error=%v", runs, err) + if cfg.StopOnError { + return err + } + } + } else { + logf("host_agent_update_loop run=%d action=%s reason=%s target=%s binary=%s replaced=%t restart_needed=%t", + runs, + result.Action, + result.Reason, + result.TargetVersion, + result.NewImage, + result.Replaced, + result.RestartNeeded, + ) + if result.Action == "update" && result.TargetVersion != "" { + req.CurrentVersion = result.TargetVersion + } + } + if cfg.MaxRuns > 0 && runs >= cfg.MaxRuns { + return nil + } + if err := sleepContext(ctx, jitteredDuration(cfg.Interval, cfg.Jitter)); err != nil { + return err + } + } +} + +func (req *UpdateRequest) BinaryDefaults() { + req.Product = firstNonEmpty(req.Product, HostAgentUpdateProduct) + req.InstallType = firstNonEmpty(req.InstallType, BinaryUpdateInstallType) + req.OS = firstNonEmpty(req.OS, "linux") +} + +func isBinaryInstallType(value string) bool { + switch strings.TrimSpace(value) { + case "", BinaryUpdateInstallType, "windows_binary", "binary", "host_binary", "linux-amd64-binary", "windows-amd64-binary": + return true + default: + return false + } +} + +func hostAgentInstallTypeFor(nodeInstallType string) string { + if strings.TrimSpace(nodeInstallType) == WindowsUpdateInstallType { + return "windows_binary" + } + return BinaryUpdateInstallType +} + +func stageHostAgentBinary(sourcePath, binaryPath string) error { + return copyFile(sourcePath, binaryPath+".next", 0o755) +} diff --git a/agents/rap-node-agent/internal/hostagent/service.go b/agents/rap-node-agent/internal/hostagent/service.go new file mode 100644 index 0000000..b5450d3 --- /dev/null +++ b/agents/rap-node-agent/internal/hostagent/service.go @@ -0,0 +1,321 @@ +package hostagent + +import ( + "context" + "fmt" + "io" + "os" + "path/filepath" + "runtime" + "strings" +) + +const ( + DefaultHostAgentInstallPath = "/usr/local/bin/rap-host-agent" + DefaultSystemdUnitDir = "/etc/systemd/system" +) + +type UpdateServiceConfig struct { + RuntimeConfig RuntimeConfig + Product string + CurrentVersion string + Channel string + IntervalSeconds int + InitialDelaySeconds int + Jitter float64 + HealthTimeoutSec int + BinaryInstallPath string + SourceBinaryPath string + UnitDir string + ManageSystemd bool + DryRun bool + InstallSelfUpdater bool + SelfUpdateVersion string +} + +type UpdateServiceResult struct { + Installed bool + Started bool + UnitName string + UnitPath string + BinaryPath string + Unit string + SelfUnitName string + SelfUnitPath string + SelfUnit string +} + +func (m DockerManager) InstallUpdateService(ctx context.Context, cfg UpdateServiceConfig) (UpdateServiceResult, error) { + cfg.RuntimeConfig = cfg.RuntimeConfig.Normalize() + if cfg.Product == "" { + cfg.Product = DefaultUpdateProduct + } + if cfg.IntervalSeconds == 0 { + cfg.IntervalSeconds = 21600 + } + if cfg.Jitter == 0 { + cfg.Jitter = 0.15 + } + if cfg.HealthTimeoutSec == 0 { + cfg.HealthTimeoutSec = 30 + } + cfg.BinaryInstallPath = firstNonEmpty(cfg.BinaryInstallPath, DefaultHostAgentInstallPath) + cfg.UnitDir = firstNonEmpty(cfg.UnitDir, DefaultSystemdUnitDir) + unitName := "rap-host-agent-updater-" + safeUnitSlug(cfg.RuntimeConfig.ContainerName) + ".service" + result := UpdateServiceResult{ + UnitName: unitName, + UnitPath: filepath.Join(cfg.UnitDir, unitName), + BinaryPath: cfg.BinaryInstallPath, + } + unit, err := buildUpdateServiceUnit(cfg) + if err != nil { + return result, err + } + result.Unit = unit + if cfg.DryRun { + if cfg.InstallSelfUpdater { + selfUnit, selfUnitName, selfUnitPath, err := buildHostAgentSelfUpdateUnit(cfg) + if err != nil { + return result, err + } + result.SelfUnit = selfUnit + result.SelfUnitName = selfUnitName + result.SelfUnitPath = selfUnitPath + } + return result, nil + } + if runtime.GOOS != "linux" && cfg.UnitDir == DefaultSystemdUnitDir { + return result, fmt.Errorf("systemd update service install is only supported on linux") + } + if err := installHostAgentBinary(cfg.SourceBinaryPath, cfg.BinaryInstallPath); err != nil { + return result, err + } + if err := os.MkdirAll(cfg.UnitDir, 0o755); err != nil { + return result, err + } + if err := os.WriteFile(result.UnitPath, []byte(unit), 0o644); err != nil { + return result, err + } + if cfg.InstallSelfUpdater { + selfUnit, selfUnitName, selfUnitPath, err := buildHostAgentSelfUpdateUnit(cfg) + if err != nil { + return result, err + } + if err := os.WriteFile(selfUnitPath, []byte(selfUnit), 0o644); err != nil { + return result, err + } + result.SelfUnit = selfUnit + result.SelfUnitName = selfUnitName + result.SelfUnitPath = selfUnitPath + } + result.Installed = true + if cfg.ManageSystemd { + runner := m.Runner + if runner == nil { + runner = ExecRunner{} + } + if _, err := runner.Run(ctx, "systemctl", "daemon-reload"); err != nil { + return result, err + } + if _, err := runner.Run(ctx, "systemctl", "enable", "--now", unitName); err != nil { + return result, err + } + if cfg.InstallSelfUpdater && result.SelfUnitName != "" { + if _, err := runner.Run(ctx, "systemctl", "enable", "--now", result.SelfUnitName); err != nil { + return result, err + } + } + result.Started = true + } + return result, nil +} + +func buildUpdateServiceUnit(cfg UpdateServiceConfig) (string, error) { + runtimeCfg := cfg.RuntimeConfig.Normalize() + var missing []string + if runtimeCfg.BackendURL == "" { + missing = append(missing, "backend-url") + } + if runtimeCfg.ClusterID == "" { + missing = append(missing, "cluster-id") + } + if runtimeCfg.ContainerName == "" { + missing = append(missing, "container-name") + } + if runtimeCfg.StateDir == "" { + missing = append(missing, "state-dir") + } + if len(missing) > 0 { + return "", fmt.Errorf("missing required update service settings: %s", strings.Join(missing, ", ")) + } + args := []string{ + cfg.BinaryInstallPath, + "update-loop", + "--backend-url", runtimeCfg.BackendURL, + "--cluster-id", runtimeCfg.ClusterID, + "--state-dir", runtimeCfg.StateDir, + "--container-name", runtimeCfg.ContainerName, + "--product", firstNonEmpty(cfg.Product, DefaultUpdateProduct), + "--current-version", strings.TrimSpace(cfg.CurrentVersion), + "--interval-seconds", fmt.Sprintf("%d", cfg.IntervalSeconds), + "--initial-delay-seconds", fmt.Sprintf("%d", cfg.InitialDelaySeconds), + "--jitter", fmt.Sprintf("%.3f", cfg.Jitter), + "--health-timeout-seconds", fmt.Sprintf("%d", cfg.HealthTimeoutSec), + } + if strings.TrimSpace(cfg.Channel) != "" { + args = append(args, "--channel", strings.TrimSpace(cfg.Channel)) + } + execStart := systemdJoin(args) + return fmt.Sprintf(`[Unit] +Description=RAP host-agent updater for %s +After=network-online.target docker.service +Wants=network-online.target +Requires=docker.service + +[Service] +Type=simple +ExecStart=%s +Restart=always +RestartSec=30 + +[Install] +WantedBy=multi-user.target +`, runtimeCfg.ContainerName, execStart), nil +} + +func buildHostAgentSelfUpdateUnit(cfg UpdateServiceConfig) (string, string, string, error) { + runtimeCfg := cfg.RuntimeConfig.Normalize() + if runtimeCfg.BackendURL == "" || runtimeCfg.ClusterID == "" || runtimeCfg.StateDir == "" { + return "", "", "", fmt.Errorf("backend-url, cluster-id, and state-dir are required for host-agent self updater") + } + unitName := "rap-host-agent-self-updater.service" + unitPath := filepath.Join(firstNonEmpty(cfg.UnitDir, DefaultSystemdUnitDir), unitName) + currentVersion := firstNonEmpty(cfg.SelfUpdateVersion, cfg.CurrentVersion) + args := []string{ + cfg.BinaryInstallPath, + "update-host-agent-loop", + "--backend-url", runtimeCfg.BackendURL, + "--cluster-id", runtimeCfg.ClusterID, + "--state-dir", runtimeCfg.StateDir, + "--binary-path", firstNonEmpty(cfg.BinaryInstallPath, DefaultHostAgentInstallPath), + "--current-version", currentVersion, + "--interval-seconds", fmt.Sprintf("%d", cfg.IntervalSeconds), + "--initial-delay-seconds", fmt.Sprintf("%d", cfg.InitialDelaySeconds+30), + "--jitter", fmt.Sprintf("%.3f", cfg.Jitter), + } + if strings.TrimSpace(cfg.Channel) != "" { + args = append(args, "--channel", strings.TrimSpace(cfg.Channel)) + } + return fmt.Sprintf(`[Unit] +Description=RAP host-agent self updater +After=network-online.target docker.service +Wants=network-online.target +Requires=docker.service + +[Service] +Type=simple +ExecStart=%s +Restart=always +RestartSec=60 + +[Install] +WantedBy=multi-user.target +`, systemdJoin(args)), unitName, unitPath, nil +} + +func installHostAgentBinary(sourcePath, targetPath string) error { + sourcePath = strings.TrimSpace(sourcePath) + targetPath = strings.TrimSpace(targetPath) + if sourcePath == "" { + var err error + sourcePath, err = os.Executable() + if err != nil { + return err + } + } + if samePath(sourcePath, targetPath) { + return os.Chmod(targetPath, 0o755) + } + src, err := os.Open(sourcePath) + if err != nil { + return err + } + defer src.Close() + if err := os.MkdirAll(filepath.Dir(targetPath), 0o755); err != nil { + return err + } + tmp := targetPath + ".tmp" + dst, err := os.OpenFile(tmp, os.O_CREATE|os.O_WRONLY|os.O_TRUNC, 0o755) + if err != nil { + return err + } + if _, err := io.Copy(dst, src); err != nil { + _ = dst.Close() + _ = os.Remove(tmp) + return err + } + if err := dst.Close(); err != nil { + _ = os.Remove(tmp) + return err + } + if err := os.Chmod(tmp, 0o755); err != nil { + _ = os.Remove(tmp) + return err + } + return os.Rename(tmp, targetPath) +} + +func samePath(a, b string) bool { + absA, errA := filepath.Abs(a) + absB, errB := filepath.Abs(b) + if errA == nil && errB == nil { + return absA == absB + } + return a == b +} + +func safeUnitSlug(value string) string { + value = strings.ToLower(strings.TrimSpace(value)) + if value == "" { + value = DefaultContainerName + } + var b strings.Builder + lastDash := false + for _, r := range value { + ok := (r >= 'a' && r <= 'z') || (r >= '0' && r <= '9') + if ok { + b.WriteRune(r) + lastDash = false + continue + } + if !lastDash { + b.WriteByte('-') + lastDash = true + } + } + out := strings.Trim(b.String(), "-") + if out == "" { + return DefaultContainerName + } + return out +} + +func systemdJoin(args []string) string { + out := make([]string, 0, len(args)) + for _, arg := range args { + out = append(out, systemdQuote(arg)) + } + return strings.Join(out, " ") +} + +func systemdQuote(arg string) string { + if arg == "" { + return `""` + } + if !strings.ContainsAny(arg, " \t\n\"'\\") { + return arg + } + arg = strings.ReplaceAll(arg, `\`, `\\`) + arg = strings.ReplaceAll(arg, `"`, `\"`) + return `"` + arg + `"` +} diff --git a/agents/rap-node-agent/internal/hostagent/service_test.go b/agents/rap-node-agent/internal/hostagent/service_test.go new file mode 100644 index 0000000..7d3b3ec --- /dev/null +++ b/agents/rap-node-agent/internal/hostagent/service_test.go @@ -0,0 +1,171 @@ +package hostagent + +import ( + "context" + "os" + "path/filepath" + "strings" + "testing" +) + +func TestInstallUpdateServiceWritesSystemdUnit(t *testing.T) { + dir := t.TempDir() + source := filepath.Join(dir, "rap-host-agent-src") + if err := os.WriteFile(source, []byte("binary"), 0o755); err != nil { + t.Fatalf("write source: %v", err) + } + unitDir := filepath.Join(dir, "systemd") + binaryPath := filepath.Join(dir, "bin", "rap-host-agent") + result, err := (DockerManager{}).InstallUpdateService(context.Background(), UpdateServiceConfig{ + RuntimeConfig: RuntimeConfig{ + BackendURL: "http://control/api/v1", + ClusterID: "cluster-1", + NodeName: "node-a", + ContainerName: "rap-node-agent-node-a", + StateDir: "/var/lib/rap/nodes/node-a", + }, + CurrentVersion: "0.1.0-current", + IntervalSeconds: 60, + Jitter: 0.2, + SourceBinaryPath: source, + BinaryInstallPath: binaryPath, + UnitDir: unitDir, + ManageSystemd: false, + InstallSelfUpdater: true, + SelfUpdateVersion: "0.1.0-host", + }) + if err != nil { + t.Fatalf("install update service: %v", err) + } + if !result.Installed || result.Started { + t.Fatalf("unexpected result: %+v", result) + } + unit, err := os.ReadFile(result.UnitPath) + if err != nil { + t.Fatalf("read unit: %v", err) + } + text := string(unit) + for _, want := range []string{ + "ExecStart=", + " update-loop", + "--backend-url http://control/api/v1", + "--cluster-id cluster-1", + "--state-dir /var/lib/rap/nodes/node-a", + "--container-name rap-node-agent-node-a", + "--current-version 0.1.0-current", + "--interval-seconds 60", + "Restart=always", + } { + if !strings.Contains(text, want) { + t.Fatalf("unit missing %q:\n%s", want, text) + } + } + if payload, err := os.ReadFile(binaryPath); err != nil || string(payload) != "binary" { + t.Fatalf("binary copy = %q, %v", payload, err) + } + if result.SelfUnitName != "rap-host-agent-self-updater.service" || result.SelfUnitPath == "" { + t.Fatalf("self updater result = %+v", result) + } + selfUnit, err := os.ReadFile(result.SelfUnitPath) + if err != nil { + t.Fatalf("read self unit: %v", err) + } + if text := string(selfUnit); !strings.Contains(text, "update-host-agent-loop") || !strings.Contains(text, "--current-version 0.1.0-host") { + t.Fatalf("unexpected self unit:\n%s", text) + } +} + +func TestWindowsHostAgentUpdateScriptTargetsWindowsService(t *testing.T) { + cfg := WindowsInstallConfig{ + RuntimeConfig: RuntimeConfig{ + BackendURL: "http://control/api/v1", + ClusterID: "cluster-1", + }, + NodeID: "node-1", + AutoUpdateCurrentVersion: "0.1.2", + AutoUpdateIntervalSeconds: 120, + AutoUpdateInitialDelaySeconds: 7, + AutoUpdateHealthTimeoutSeconds: 11, + } + result := WindowsInstallResult{ + NodeName: "win-a", + StateDir: `C:\ProgramData\RAP\nodes\win-a`, + NodeAgentPath: `C:\Program Files\RAP\win-a\rap-node-agent.exe`, + TaskName: "RAP Node Agent win-a", + } + script := windowsHostAgentUpdateScript(`C:\Program Files\RAP\win-a\rap-host-agent.exe`, cfg, result) + for _, want := range []string{ + ":loop", + "rap-host-agent.exe.next", + "update-loop --backend-url", + "--backend-url \"http://control/api/v1\"", + "--cluster-id \"cluster-1\"", + "--node-id \"node-1\"", + "--state-dir \"C:\\ProgramData\\RAP\\nodes\\win-a\"", + "--install-type windows_service", + "--binary-path \"C:\\Program Files\\RAP\\win-a\\rap-node-agent.exe\"", + "--host-agent-binary-path \"C:\\Program Files\\RAP\\win-a\\rap-host-agent.exe\"", + "--windows-task-name \"RAP Node Agent win-a\"", + "--current-version 0.1.2", + "--host-agent-current-version 0.1.2", + "--interval-seconds 120", + "timeout /t 120", + } { + if !strings.Contains(script, want) { + t.Fatalf("script missing %q:\n%s", want, script) + } + } +} + +func TestWindowsInstallReplaceAllowsExistingNodeWithoutJoinToken(t *testing.T) { + result, err := (WindowsManager{}).Install(context.Background(), WindowsInstallConfig{ + RuntimeConfig: RuntimeConfig{ + BackendURL: "http://control/api/v1", + ClusterID: "cluster-1", + NodeName: "win-a", + }, + InstallDir: `C:\Program Files\RAP\win-a`, + Replace: true, + DryRun: true, + }) + if err != nil { + t.Fatalf("replace install should not require join token: %v", err) + } + if result.NodeName != "win-a" || result.NodeAgentPath == "" { + t.Fatalf("unexpected dry-run result: %+v", result) + } +} + +func TestWindowsRepairUpdaterStartsFromUnknownVersion(t *testing.T) { + dir := t.TempDir() + source := filepath.Join(dir, "rap-host-agent.exe") + if err := os.WriteFile(source, []byte("binary"), 0o755); err != nil { + t.Fatalf("write source: %v", err) + } + result, err := installWindowsHostAgentUpdater(context.Background(), WindowsManager{Runner: &recordingRunner{}}, WindowsInstallResult{ + NodeName: "win-a", + InstallDir: dir, + StateDir: dir, + NodeAgentPath: filepath.Join(dir, "rap-node-agent.exe"), + TaskName: "RAP Node Agent win-a", + StartupMode: "user-task", + }, WindowsInstallConfig{ + RuntimeConfig: RuntimeConfig{ + BackendURL: "http://control/api/v1", + ClusterID: "cluster-1", + }, + Replace: true, + AutoUpdateEnabled: true, + HostAgentSourcePath: source, + }) + if err != nil { + t.Fatalf("install updater: %v", err) + } + script, err := os.ReadFile(filepath.Join(result.InstallDir, "rap-host-agent-update.cmd")) + if err != nil { + t.Fatalf("read updater script: %v", err) + } + if !strings.Contains(string(script), "--current-version 0.0.0") { + t.Fatalf("repair updater should force unknown current version:\n%s", script) + } +} diff --git a/agents/rap-node-agent/internal/hostagent/update.go b/agents/rap-node-agent/internal/hostagent/update.go new file mode 100644 index 0000000..216f918 --- /dev/null +++ b/agents/rap-node-agent/internal/hostagent/update.go @@ -0,0 +1,947 @@ +package hostagent + +import ( + "bytes" + "context" + "encoding/json" + "errors" + "fmt" + "math/rand" + "net/http" + "net/url" + "os" + "path/filepath" + "runtime" + "strconv" + "strings" + "time" + + "github.com/example/remote-access-platform/agents/rap-node-agent/internal/state" +) + +const ( + DefaultUpdateProduct = "rap-node-agent" + HostAgentUpdateProduct = "rap-host-agent" + DefaultUpdateInstallType = "docker" + BinaryUpdateInstallType = "linux_binary" + WindowsUpdateInstallType = "windows_service" + UpdateStateFileName = "host-update-state.json" + UpdateTriggerFileName = "update-trigger.json" +) + +var ErrNodeIdentityNotReady = errors.New("node identity is not approved yet") + +type UpdateRequest struct { + BackendURL string + ClusterID string + NodeID string + StateDir string + Product string + CurrentVersion string + OS string + Arch string + InstallType string + Channel string + ContainerName string + BinaryPath string + WindowsTaskName string + SystemdUnitName string + HealthTimeout time.Duration + DryRun bool + AllowProductionMesh bool +} + +type UpdateResult struct { + Action string + Reason string + TargetVersion string + ContainerName string + PreviousImageID string + NewImage string + ContainerID string + Loaded bool + Replaced bool + RolledBack bool + RestartNeeded bool +} + +type UpdateLoopConfig struct { + Request UpdateRequest + Interval time.Duration + InitialDelay time.Duration + Jitter float64 + MaxRuns int + StopOnError bool + HostAgentUpdateEnabled bool + HostAgentUpdateRequest HostAgentUpdateRequest + Logf func(format string, args ...any) +} + +type UpdateState struct { + Product string `json:"product"` + CurrentVersion string `json:"current_version"` + TargetVersion string `json:"target_version,omitempty"` + ContainerName string `json:"container_name,omitempty"` + Image string `json:"image,omitempty"` + UpdatedAt time.Time `json:"updated_at"` +} + +type UpdateTrigger struct { + SchemaVersion string `json:"schema_version"` + Generation string `json:"generation"` + Products []string `json:"products,omitempty"` + Reason string `json:"reason,omitempty"` + DeliveryMode string `json:"delivery_mode,omitempty"` + SubscriptionStatus string `json:"subscription_status,omitempty"` + UpdateServiceNodeID string `json:"update_service_node_id,omitempty"` + UpdateServiceStatus string `json:"update_service_status,omitempty"` + FallbackPollSeconds int `json:"fallback_poll_seconds,omitempty"` + ObservedAt time.Time `json:"observed_at"` +} + +type NodeUpdatePlanResponse struct { + Plan NodeUpdatePlan `json:"node_update_plan"` +} + +type NodeUpdatePlan struct { + SchemaVersion string `json:"schema_version"` + ClusterID string `json:"cluster_id"` + NodeID string `json:"node_id"` + Product string `json:"product"` + CurrentVersion string `json:"current_version,omitempty"` + Action string `json:"action"` + Reason string `json:"reason"` + TargetVersion string `json:"target_version,omitempty"` + Channel string `json:"channel,omitempty"` + Strategy string `json:"strategy,omitempty"` + RollbackAllowed bool `json:"rollback_allowed"` + HealthWindowSec int `json:"health_window_seconds,omitempty"` + Artifact *ReleaseArtifact `json:"artifact,omitempty"` + AuthorityPayload json.RawMessage `json:"authority_payload,omitempty"` + AuthoritySignature json.RawMessage `json:"authority_signature,omitempty"` + ProductionForwarding bool `json:"production_forwarding"` +} + +type ReleaseArtifact struct { + ID string `json:"id"` + ReleaseID string `json:"release_id"` + ClusterID string `json:"cluster_id"` + Product string `json:"product"` + Version string `json:"version"` + OS string `json:"os"` + Arch string `json:"arch"` + InstallType string `json:"install_type"` + Kind string `json:"kind"` + URL string `json:"url"` + URLs []string `json:"urls,omitempty"` + SHA256 string `json:"sha256"` + SizeBytes int64 `json:"size_bytes"` + Signature *string `json:"signature,omitempty"` + Metadata json.RawMessage `json:"metadata"` + CreatedAt time.Time `json:"created_at"` +} + +type NodeUpdateStatusRequest struct { + Product string `json:"product"` + CurrentVersion string `json:"current_version,omitempty"` + TargetVersion string `json:"target_version,omitempty"` + Phase string `json:"phase"` + Status string `json:"status"` + AttemptID string `json:"attempt_id,omitempty"` + ErrorMessage *string `json:"error_message,omitempty"` + RollbackVersion *string `json:"rollback_version,omitempty"` + Payload map[string]any `json:"payload,omitempty"` + ObservedAt time.Time `json:"observed_at,omitempty"` +} + +type dockerInspectContainer struct { + ID string `json:"Id"` + Image string `json:"Image"` + Config struct { + Image string `json:"Image"` + Env []string `json:"Env"` + } `json:"Config"` + HostConfig struct { + Privileged bool `json:"Privileged"` + NetworkMode string `json:"NetworkMode"` + CapAdd []string `json:"CapAdd"` + Devices []struct { + PathOnHost string `json:"PathOnHost"` + PathInContainer string `json:"PathInContainer"` + CgroupPermissions string `json:"CgroupPermissions"` + } `json:"Devices"` + RestartPolicy struct { + Name string `json:"Name"` + } `json:"RestartPolicy"` + } `json:"HostConfig"` + Mounts []struct { + Source string `json:"Source"` + Destination string `json:"Destination"` + } `json:"Mounts"` + State struct { + Running bool `json:"Running"` + } `json:"State"` +} + +func (req UpdateRequest) Normalize() UpdateRequest { + req.BackendURL = strings.TrimRight(strings.TrimSpace(req.BackendURL), "/") + req.ClusterID = strings.TrimSpace(req.ClusterID) + req.NodeID = strings.TrimSpace(req.NodeID) + req.StateDir = strings.TrimSpace(req.StateDir) + req.Product = firstNonEmpty(req.Product, DefaultUpdateProduct) + req.OS = firstNonEmpty(req.OS, runtime.GOOS) + req.Arch = firstNonEmpty(req.Arch, runtime.GOARCH) + req.InstallType = firstNonEmpty(req.InstallType, DefaultUpdateInstallType) + req.Channel = strings.TrimSpace(req.Channel) + req.ContainerName = firstNonEmpty(req.ContainerName, DefaultContainerName) + req.BinaryPath = strings.TrimSpace(req.BinaryPath) + req.WindowsTaskName = strings.TrimSpace(req.WindowsTaskName) + req.SystemdUnitName = strings.TrimSpace(req.SystemdUnitName) + if req.HealthTimeout == 0 { + req.HealthTimeout = 30 * time.Second + } + return req +} + +func (req UpdateRequest) Validate() error { + req = req.Normalize() + var missing []string + if req.BackendURL == "" { + missing = append(missing, "backend-url") + } + if req.ClusterID == "" { + missing = append(missing, "cluster-id") + } + if req.NodeID == "" && req.StateDir == "" { + missing = append(missing, "node-id-or-state-dir") + } + if req.InstallType == WindowsUpdateInstallType { + if req.BinaryPath == "" { + missing = append(missing, "binary-path") + } + if req.WindowsTaskName == "" { + missing = append(missing, "windows-task-name") + } + } else if req.InstallType == BinaryUpdateInstallType && req.Product != HostAgentUpdateProduct { + if req.BinaryPath == "" { + missing = append(missing, "binary-path") + } + if req.SystemdUnitName == "" { + missing = append(missing, "systemd-unit") + } + } else if req.ContainerName == "" { + missing = append(missing, "container-name") + } + if len(missing) > 0 { + return fmt.Errorf("missing required update settings: %s", strings.Join(missing, ", ")) + } + if req.HealthTimeout < 0 { + return errors.New("health timeout must not be negative") + } + return nil +} + +func (m DockerManager) ApplyUpdate(ctx context.Context, req UpdateRequest) (UpdateResult, error) { + req = req.Normalize() + var err error + req, err = resolveUpdateRequest(req) + if err != nil { + return UpdateResult{}, err + } + runner := m.Runner + if runner == nil { + runner = ExecRunner{} + } + docker := firstNonEmpty(m.Binary, "docker") + + plan, err := FetchNodeUpdatePlan(ctx, req) + if err != nil { + return UpdateResult{}, err + } + if plan.HealthWindowSec > 0 && req.HealthTimeout == 30*time.Second { + req.HealthTimeout = time.Duration(plan.HealthWindowSec) * time.Second + } + result := UpdateResult{ + Action: plan.Action, + Reason: plan.Reason, + TargetVersion: plan.TargetVersion, + ContainerName: req.ContainerName, + } + if plan.Action != "update" { + if !req.DryRun { + _ = ReportNodeUpdateStatus(ctx, req.BackendURL, req.ClusterID, req.NodeID, statusFromNoopPlan(req, plan)) + } + return result, nil + } + if plan.ProductionForwarding && !req.AllowProductionMesh { + err := errors.New("refusing update plan with production forwarding enabled") + _ = ReportNodeUpdateStatus(ctx, req.BackendURL, req.ClusterID, req.NodeID, statusFromError(req, plan, "preflight", "failed", err)) + return result, err + } + if plan.Artifact == nil { + err := errors.New("update plan has no artifact") + _ = ReportNodeUpdateStatus(ctx, req.BackendURL, req.ClusterID, req.NodeID, statusFromError(req, plan, "preflight", "failed", err)) + return result, err + } + if plan.Artifact.InstallType != "" && plan.Artifact.InstallType != DefaultUpdateInstallType { + err := fmt.Errorf("unsupported update artifact install type %q", plan.Artifact.InstallType) + _ = ReportNodeUpdateStatus(ctx, req.BackendURL, req.ClusterID, req.NodeID, statusFromError(req, plan, "preflight", "failed", err)) + return result, err + } + if req.DryRun { + result.NewImage = artifactImage(*plan.Artifact, "") + return result, nil + } + _ = ReportNodeUpdateStatus(ctx, req.BackendURL, req.ClusterID, req.NodeID, NodeUpdateStatusRequest{ + Product: req.Product, + CurrentVersion: req.CurrentVersion, + TargetVersion: plan.TargetVersion, + Phase: "planned", + Status: "accepted", + AttemptID: updateAttemptID(plan), + ObservedAt: time.Now().UTC(), + Payload: map[string]any{"strategy": plan.Strategy, "reason": plan.Reason}, + }) + + current, cfg, err := m.runtimeConfigFromContainer(ctx, runner, docker, req.ContainerName) + if err != nil { + _ = ReportNodeUpdateStatus(ctx, req.BackendURL, req.ClusterID, req.NodeID, statusFromError(req, plan, "inspect", "failed", err)) + return result, err + } + result.PreviousImageID = current.Image + cfg.BackendURL = firstNonEmpty(cfg.BackendURL, req.BackendURL) + cfg.ClusterID = firstNonEmpty(cfg.ClusterID, req.ClusterID) + cfg.ContainerName = req.ContainerName + cfg.Image = artifactImage(*plan.Artifact, cfg.Image) + cfg.ImageArtifactURLs = artifactURLsForBackend(*plan.Artifact, req.BackendURL) + cfg.ImageArtifactSHA256 = plan.Artifact.SHA256 + cfg.ImageArtifactSizeBytes = plan.Artifact.SizeBytes + cfg.Replace = true + cfg.JoinToken = "" + result.NewImage = cfg.Image + + _ = ReportNodeUpdateStatus(ctx, req.BackendURL, req.ClusterID, req.NodeID, NodeUpdateStatusRequest{ + Product: req.Product, + CurrentVersion: req.CurrentVersion, + TargetVersion: plan.TargetVersion, + Phase: "download", + Status: "started", + AttemptID: updateAttemptID(plan), + ObservedAt: time.Now().UTC(), + Payload: map[string]any{"artifact_url": plan.Artifact.URL, "artifact_urls": cfg.ImageArtifactURLs, "image": cfg.Image}, + }) + installed, err := m.Install(ctx, cfg) + if err != nil { + _ = ReportNodeUpdateStatus(ctx, req.BackendURL, req.ClusterID, req.NodeID, statusFromError(req, plan, "apply", "failed", err)) + rollbackErr := m.rollbackContainer(ctx, runner, docker, cfg, current, plan.RollbackAllowed) + if rollbackErr == nil && plan.RollbackAllowed { + result.RolledBack = true + } + return result, err + } + result.Loaded = installed.Loaded + result.Replaced = installed.Replaced + result.ContainerID = installed.ContainerID + + if err := m.waitContainerRunning(ctx, runner, docker, req.ContainerName, req.HealthTimeout); err != nil { + _ = ReportNodeUpdateStatus(ctx, req.BackendURL, req.ClusterID, req.NodeID, statusFromError(req, plan, "health_check", "failed", err)) + rollbackErr := m.rollbackContainer(ctx, runner, docker, cfg, current, plan.RollbackAllowed) + if rollbackErr == nil && plan.RollbackAllowed { + result.RolledBack = true + } + return result, err + } + _ = ReportNodeUpdateStatus(ctx, req.BackendURL, req.ClusterID, req.NodeID, NodeUpdateStatusRequest{ + Product: req.Product, + CurrentVersion: req.CurrentVersion, + TargetVersion: plan.TargetVersion, + Phase: "health_check", + Status: "succeeded", + AttemptID: updateAttemptID(plan), + ObservedAt: time.Now().UTC(), + Payload: map[string]any{"container_id": installed.ContainerID, "image": cfg.Image}, + }) + _ = saveUpdateState(req.StateDir, UpdateState{ + Product: req.Product, + CurrentVersion: plan.TargetVersion, + TargetVersion: plan.TargetVersion, + ContainerName: req.ContainerName, + Image: cfg.Image, + UpdatedAt: time.Now().UTC(), + }) + return result, nil +} + +func (m DockerManager) RunUpdateLoop(ctx context.Context, cfg UpdateLoopConfig) error { + req := cfg.Request.Normalize() + if err := req.Validate(); err != nil { + return err + } + if cfg.Interval == 0 { + cfg.Interval = time.Hour + } + if cfg.Interval < 0 { + return errors.New("update loop interval must not be negative") + } + if cfg.InitialDelay < 0 { + return errors.New("update loop initial delay must not be negative") + } + if cfg.Jitter < 0 || cfg.Jitter > 1 { + return errors.New("update loop jitter must be between 0 and 1") + } + logf := cfg.Logf + if logf == nil { + logf = func(string, ...any) {} + } + if cfg.InitialDelay > 0 { + if err := sleepContext(ctx, jitteredDuration(cfg.InitialDelay, cfg.Jitter)); err != nil { + return err + } + } + runs := 0 + lastTriggerGeneration := currentUpdateTriggerGeneration(req.StateDir) + for { + runs++ + result, err := m.ApplyUpdate(ctx, req) + if err != nil { + if errors.Is(err, ErrNodeIdentityNotReady) { + logf("update_loop run=%d status=waiting_for_node_identity state_dir=%s", runs, req.StateDir) + if cfg.MaxRuns > 0 && runs >= cfg.MaxRuns { + return nil + } + if err := sleepContext(ctx, jitteredDuration(cfg.Interval, cfg.Jitter)); err != nil { + return err + } + continue + } + logf("update_loop run=%d status=failed error=%v", runs, err) + if cfg.StopOnError { + return err + } + } else { + logf("update_loop run=%d action=%s reason=%s target=%s container=%s loaded=%t replaced=%t rolled_back=%t", + runs, + result.Action, + result.Reason, + result.TargetVersion, + result.ContainerName, + result.Loaded, + result.Replaced, + result.RolledBack, + ) + if result.Action == "update" && result.TargetVersion != "" && !result.RolledBack { + req.CurrentVersion = result.TargetVersion + } + } + if cfg.HostAgentUpdateEnabled { + hostReq := cfg.HostAgentUpdateRequest + hostReq.BackendURL = firstNonEmpty(hostReq.BackendURL, req.BackendURL) + hostReq.ClusterID = firstNonEmpty(hostReq.ClusterID, req.ClusterID) + hostReq.NodeID = firstNonEmpty(hostReq.NodeID, req.NodeID) + hostReq.StateDir = firstNonEmpty(hostReq.StateDir, req.StateDir) + hostReq.Channel = firstNonEmpty(hostReq.Channel, req.Channel) + hostReq.CurrentVersion = firstNonEmpty(hostReq.CurrentVersion, req.CurrentVersion) + hostReq.OS = firstNonEmpty(hostReq.OS, req.OS) + hostReq.Arch = firstNonEmpty(hostReq.Arch, req.Arch) + hostReq.InstallType = firstNonEmpty(hostReq.InstallType, hostAgentInstallTypeFor(req.InstallType)) + result, err := m.ApplyHostAgentUpdate(ctx, hostReq) + if err != nil { + if errors.Is(err, ErrNodeIdentityNotReady) { + logf("host_agent_update_loop run=%d status=waiting_for_node_identity state_dir=%s", runs, hostReq.StateDir) + } else { + logf("host_agent_update_loop run=%d status=failed error=%v", runs, err) + if cfg.StopOnError { + return err + } + } + } else { + logf("host_agent_update_loop run=%d action=%s reason=%s target=%s binary=%s replaced=%t restart_needed=%t", + runs, + result.Action, + result.Reason, + result.TargetVersion, + result.NewImage, + result.Replaced, + result.RestartNeeded, + ) + if result.Action == "update" && result.TargetVersion != "" { + cfg.HostAgentUpdateRequest.CurrentVersion = result.TargetVersion + } + if result.RestartNeeded { + return nil + } + } + } + if cfg.MaxRuns > 0 && runs >= cfg.MaxRuns { + return nil + } + if err := sleepUntilUpdateIntervalOrTrigger(ctx, req.StateDir, jitteredDuration(cfg.Interval, cfg.Jitter), &lastTriggerGeneration); err != nil { + return err + } + } +} + +func FetchNodeUpdatePlan(ctx context.Context, req UpdateRequest) (NodeUpdatePlan, error) { + var err error + req, err = resolveUpdateRequest(req) + if err != nil { + return NodeUpdatePlan{}, err + } + values := url.Values{} + values.Set("product", req.Product) + values.Set("current_version", req.CurrentVersion) + values.Set("os", req.OS) + values.Set("arch", req.Arch) + values.Set("install_type", req.InstallType) + if req.Channel != "" { + values.Set("channel", req.Channel) + } + endpoint := fmt.Sprintf("%s/clusters/%s/nodes/%s/updates/plan?%s", req.BackendURL, url.PathEscape(req.ClusterID), url.PathEscape(req.NodeID), values.Encode()) + httpReq, err := http.NewRequestWithContext(ctx, http.MethodGet, endpoint, nil) + if err != nil { + return NodeUpdatePlan{}, err + } + resp, err := http.DefaultClient.Do(httpReq) + if err != nil { + return NodeUpdatePlan{}, err + } + defer resp.Body.Close() + if resp.StatusCode < 200 || resp.StatusCode >= 300 { + return NodeUpdatePlan{}, fmt.Errorf("fetch update plan: %s", resp.Status) + } + var out NodeUpdatePlanResponse + if err := json.NewDecoder(resp.Body).Decode(&out); err != nil { + return NodeUpdatePlan{}, err + } + return out.Plan, nil +} + +func resolveUpdateRequest(req UpdateRequest) (UpdateRequest, error) { + req = req.Normalize() + if err := req.Validate(); err != nil { + return UpdateRequest{}, err + } + if req.NodeID == "" { + identity, err := state.Load(filepath.Join(req.StateDir, state.FileName)) + if err != nil { + if errors.Is(err, os.ErrNotExist) { + return UpdateRequest{}, ErrNodeIdentityNotReady + } + return UpdateRequest{}, err + } + if strings.TrimSpace(identity.NodeID) == "" { + return UpdateRequest{}, ErrNodeIdentityNotReady + } + req.NodeID = strings.TrimSpace(identity.NodeID) + if req.ClusterID == "" { + req.ClusterID = strings.TrimSpace(identity.ClusterID) + } + } + if updateState, err := loadUpdateState(req.StateDir, req.Product); err == nil && updateState.Product == req.Product && updateState.CurrentVersion != "" { + req.CurrentVersion = updateState.CurrentVersion + } + return req, nil +} + +func ReportNodeUpdateStatus(ctx context.Context, backendURL, clusterID, nodeID string, request NodeUpdateStatusRequest) error { + backendURL = strings.TrimRight(strings.TrimSpace(backendURL), "/") + endpoint := fmt.Sprintf("%s/clusters/%s/nodes/%s/updates/status", backendURL, url.PathEscape(clusterID), url.PathEscape(nodeID)) + body, err := json.Marshal(request) + if err != nil { + return err + } + httpReq, err := http.NewRequestWithContext(ctx, http.MethodPost, endpoint, bytes.NewReader(body)) + if err != nil { + return err + } + httpReq.Header.Set("Content-Type", "application/json") + resp, err := http.DefaultClient.Do(httpReq) + if err != nil { + return err + } + defer resp.Body.Close() + if resp.StatusCode < 200 || resp.StatusCode >= 300 { + return fmt.Errorf("report update status: %s", resp.Status) + } + return nil +} + +func (m DockerManager) runtimeConfigFromContainer(ctx context.Context, runner CommandRunner, docker, containerName string) (dockerInspectContainer, RuntimeConfig, error) { + out, err := runner.Run(ctx, docker, "inspect", containerName) + if err != nil { + return dockerInspectContainer{}, RuntimeConfig{}, err + } + var inspected []dockerInspectContainer + if err := json.Unmarshal([]byte(out), &inspected); err != nil { + return dockerInspectContainer{}, RuntimeConfig{}, err + } + if len(inspected) == 0 { + return dockerInspectContainer{}, RuntimeConfig{}, fmt.Errorf("container %q not found", containerName) + } + env := envMap(inspected[0].Config.Env) + cfg := RuntimeConfig{ + BackendURL: env["RAP_BACKEND_URL"], + ClusterID: env["RAP_CLUSTER_ID"], + NodeName: firstNonEmpty(env["RAP_NODE_NAME"], containerName), + Image: inspected[0].Config.Image, + ContainerName: containerName, + StateDir: hostStateDir(inspected[0]), + Network: firstNonEmpty(inspected[0].HostConfig.NetworkMode, DefaultNetwork), + RestartPolicy: firstNonEmpty(inspected[0].HostConfig.RestartPolicy.Name, "unless-stopped"), + WorkloadSupervisionEnabled: parseBool(env["RAP_WORKLOAD_SUPERVISION_ENABLED"]), + MeshSyntheticRuntimeEnabled: true, + MeshProductionForwardingEnabled: parseBool(env["RAP_MESH_PRODUCTION_FORWARDING_ENABLED"]), + MeshListenAddr: env["RAP_MESH_LISTEN_ADDR"], + MeshListenPortMode: env["RAP_MESH_LISTEN_PORT_MODE"], + MeshListenAutoPortStart: parseInt(env["RAP_MESH_LISTEN_AUTO_PORT_START"]), + MeshListenAutoPortEnd: parseInt(env["RAP_MESH_LISTEN_AUTO_PORT_END"]), + MeshAdvertiseEndpoint: env["RAP_MESH_ADVERTISE_ENDPOINT"], + MeshAdvertiseEndpointsJSON: env["RAP_MESH_ADVERTISE_ENDPOINTS_JSON"], + MeshAdvertiseTransport: env["RAP_MESH_ADVERTISE_TRANSPORT"], + MeshConnectivityMode: env["RAP_MESH_CONNECTIVITY_MODE"], + MeshNATType: env["RAP_MESH_NAT_TYPE"], + MeshRegion: env["RAP_MESH_REGION"], + HeartbeatIntervalSeconds: parseInt(env["RAP_HEARTBEAT_INTERVAL_SECONDS"]), + EnrollmentPollIntervalSeconds: parseInt(env["RAP_ENROLLMENT_POLL_INTERVAL_SECONDS"]), + EnrollmentPollTimeoutSeconds: parseInt(env["RAP_ENROLLMENT_POLL_TIMEOUT_SECONDS"]), + ProductionObservationSinkCap: parseInt(env["RAP_MESH_PRODUCTION_OBSERVATION_SINK_CAPACITY"]), + DockerVPNGatewayEnabled: dockerInspectHasVPNGatewayRuntime(inspected[0]), + } + return inspected[0], cfg.Normalize(), nil +} + +func dockerInspectHasVPNGatewayRuntime(container dockerInspectContainer) bool { + hasNetAdmin := false + for _, cap := range container.HostConfig.CapAdd { + if strings.EqualFold(strings.TrimSpace(cap), "NET_ADMIN") { + hasNetAdmin = true + break + } + } + hasTun := false + for _, device := range container.HostConfig.Devices { + if device.PathOnHost == "/dev/net/tun" || device.PathInContainer == "/dev/net/tun" { + hasTun = true + break + } + } + return (container.HostConfig.Privileged || hasNetAdmin) && hasTun +} + +func (m DockerManager) waitContainerRunning(ctx context.Context, runner CommandRunner, docker, containerName string, timeout time.Duration) error { + deadline := time.Now().Add(timeout) + for { + out, err := runner.Run(ctx, docker, "inspect", "--format", "{{.State.Running}}", containerName) + if err == nil && strings.TrimSpace(out) == "true" { + return nil + } + if timeout == 0 || time.Now().After(deadline) { + if err != nil { + return err + } + return fmt.Errorf("container %q is not running", containerName) + } + select { + case <-ctx.Done(): + return ctx.Err() + case <-time.After(time.Second): + } + } +} + +func (m DockerManager) rollbackContainer(ctx context.Context, runner CommandRunner, docker string, cfg RuntimeConfig, previous dockerInspectContainer, allowed bool) error { + if !allowed || strings.TrimSpace(previous.Image) == "" { + return nil + } + rollbackCfg := cfg + rollbackCfg.Image = previous.Image + rollbackCfg.ImageArtifactURLs = nil + rollbackCfg.ImageArtifactSHA256 = "" + rollbackCfg.ImageArtifactSizeBytes = 0 + rollbackCfg.Replace = true + _, err := m.Install(ctx, rollbackCfg) + if err == nil { + _, _ = runner.Run(ctx, docker, "inspect", "--format", "{{.State.Running}}", cfg.ContainerName) + } + return err +} + +func artifactImage(artifact ReleaseArtifact, fallback string) string { + if len(artifact.Metadata) > 0 { + var metadata struct { + Image string `json:"image"` + } + if err := json.Unmarshal(artifact.Metadata, &metadata); err == nil && strings.TrimSpace(metadata.Image) != "" { + return strings.TrimSpace(metadata.Image) + } + } + if artifact.InstallType == DefaultUpdateInstallType && artifact.Product != "" && artifact.Version != "" { + return strings.TrimSpace(artifact.Product) + ":" + strings.TrimSpace(artifact.Version) + } + return firstNonEmpty(fallback, DefaultImage) +} + +func artifactURLs(artifact ReleaseArtifact) []string { + out := make([]string, 0, 1+len(artifact.URLs)) + for _, raw := range append([]string{artifact.URL}, artifact.URLs...) { + raw = strings.TrimSpace(raw) + if raw == "" || containsArtifactURL(out, raw) { + continue + } + out = append(out, raw) + } + return out +} + +func artifactURLsForBackend(artifact ReleaseArtifact, backendURL string) []string { + urls := artifactURLs(artifact) + base, err := url.Parse(strings.TrimSpace(backendURL)) + if err != nil || base.Scheme == "" || base.Host == "" { + return urls + } + origin := base.Scheme + "://" + base.Host + out := make([]string, 0, len(urls)) + for _, raw := range urls { + if strings.HasPrefix(raw, "/") { + raw = origin + raw + } + if !containsArtifactURL(out, raw) { + out = append(out, raw) + } + } + return out +} + +func containsArtifactURL(values []string, value string) bool { + for _, item := range values { + if item == value { + return true + } + } + return false +} + +func statusFromError(req UpdateRequest, plan NodeUpdatePlan, phase, status string, err error) NodeUpdateStatusRequest { + message := err.Error() + return NodeUpdateStatusRequest{ + Product: req.Product, + CurrentVersion: req.CurrentVersion, + TargetVersion: plan.TargetVersion, + Phase: phase, + Status: status, + AttemptID: updateAttemptID(plan), + ErrorMessage: &message, + ObservedAt: time.Now().UTC(), + } +} + +func statusFromNoopPlan(req UpdateRequest, plan NodeUpdatePlan) NodeUpdateStatusRequest { + return NodeUpdateStatusRequest{ + Product: req.Product, + CurrentVersion: req.CurrentVersion, + TargetVersion: plan.TargetVersion, + Phase: "plan", + Status: "noop", + AttemptID: updateAttemptID(plan), + ObservedAt: time.Now().UTC(), + Payload: map[string]any{ + "action": plan.Action, + "reason": plan.Reason, + "strategy": plan.Strategy, + "channel": plan.Channel, + }, + } +} + +func updateAttemptID(plan NodeUpdatePlan) string { + parts := []string{plan.NodeID, plan.Product, plan.TargetVersion} + if plan.Artifact != nil { + parts = append(parts, plan.Artifact.ID) + } + return strings.Join(parts, ":") +} + +func envMap(items []string) map[string]string { + out := map[string]string{} + for _, item := range items { + key, value, ok := strings.Cut(item, "=") + if ok { + out[key] = value + } + } + return out +} + +func hostStateDir(container dockerInspectContainer) string { + for _, mount := range container.Mounts { + if mount.Destination == "/var/lib/rap-node-agent" && mount.Source != "" { + return mount.Source + } + } + return DefaultStateDir +} + +func parseBool(value string) bool { + switch strings.ToLower(strings.TrimSpace(value)) { + case "1", "true", "yes", "y", "on": + return true + default: + return false + } +} + +func parseInt(value string) int { + out, _ := strconv.Atoi(strings.TrimSpace(value)) + return out +} + +func loadUpdateState(stateDir string, product string) (UpdateState, error) { + stateDir = strings.TrimSpace(stateDir) + if stateDir == "" { + return UpdateState{}, os.ErrNotExist + } + product = firstNonEmpty(normalizeUpdateProductToken(product), DefaultUpdateProduct) + payload, err := os.ReadFile(updateStatePath(stateDir, product)) + if err != nil && product == DefaultUpdateProduct { + payload, err = os.ReadFile(filepath.Join(stateDir, UpdateStateFileName)) + } + if err != nil { + return UpdateState{}, err + } + var item UpdateState + if err := json.Unmarshal(payload, &item); err != nil { + return UpdateState{}, err + } + item.Product = firstNonEmpty(item.Product, product) + return item, nil +} + +func saveUpdateState(stateDir string, item UpdateState) error { + stateDir = strings.TrimSpace(stateDir) + if stateDir == "" || item.CurrentVersion == "" { + return nil + } + item.Product = firstNonEmpty(item.Product, DefaultUpdateProduct) + if item.UpdatedAt.IsZero() { + item.UpdatedAt = time.Now().UTC() + } + if err := os.MkdirAll(stateDir, 0o700); err != nil { + return err + } + payload, err := json.MarshalIndent(item, "", " ") + if err != nil { + return err + } + return os.WriteFile(updateStatePath(stateDir, item.Product), payload, 0o600) +} + +func updateStatePath(stateDir, product string) string { + product = normalizeUpdateProductToken(firstNonEmpty(product, DefaultUpdateProduct)) + if product == "" || product == DefaultUpdateProduct { + return filepath.Join(stateDir, UpdateStateFileName) + } + return filepath.Join(stateDir, "host-update-state-"+product+".json") +} + +func UpdateTriggerPath(stateDir string) string { + return filepath.Join(strings.TrimSpace(stateDir), UpdateTriggerFileName) +} + +func SaveUpdateTrigger(stateDir string, trigger UpdateTrigger) error { + stateDir = strings.TrimSpace(stateDir) + trigger.Generation = strings.TrimSpace(trigger.Generation) + if stateDir == "" || trigger.Generation == "" { + return nil + } + if trigger.SchemaVersion == "" { + trigger.SchemaVersion = "rap.node_update_trigger.v1" + } + if trigger.ObservedAt.IsZero() { + trigger.ObservedAt = time.Now().UTC() + } + if err := os.MkdirAll(stateDir, 0o700); err != nil { + return err + } + payload, err := json.MarshalIndent(trigger, "", " ") + if err != nil { + return err + } + return os.WriteFile(UpdateTriggerPath(stateDir), payload, 0o600) +} + +func currentUpdateTriggerGeneration(stateDir string) string { + payload, err := os.ReadFile(UpdateTriggerPath(stateDir)) + if err != nil { + return "" + } + var trigger UpdateTrigger + if err := json.Unmarshal(payload, &trigger); err != nil { + return "" + } + return strings.TrimSpace(trigger.Generation) +} + +func CurrentUpdateTriggerGenerationForNodeAgent(stateDir string) string { + return currentUpdateTriggerGeneration(stateDir) +} + +func normalizeUpdateProductToken(value string) string { + value = strings.ToLower(strings.TrimSpace(value)) + var b strings.Builder + for _, r := range value { + if (r >= 'a' && r <= 'z') || (r >= '0' && r <= '9') || r == '-' || r == '_' || r == '.' { + b.WriteRune(r) + } + } + return b.String() +} + +func sleepContext(ctx context.Context, duration time.Duration) error { + if duration <= 0 { + return nil + } + timer := time.NewTimer(duration) + defer timer.Stop() + select { + case <-ctx.Done(): + return ctx.Err() + case <-timer.C: + return nil + } +} + +func sleepUntilUpdateIntervalOrTrigger(ctx context.Context, stateDir string, duration time.Duration, lastGeneration *string) error { + if duration <= 0 { + return nil + } + deadline := time.NewTimer(duration) + defer deadline.Stop() + ticker := time.NewTicker(5 * time.Second) + defer ticker.Stop() + for { + select { + case <-ctx.Done(): + return ctx.Err() + case <-deadline.C: + return nil + case <-ticker.C: + generation := currentUpdateTriggerGeneration(stateDir) + if generation != "" && lastGeneration != nil && generation != *lastGeneration { + *lastGeneration = generation + return nil + } + } + } +} + +func jitteredDuration(base time.Duration, jitter float64) time.Duration { + if base <= 0 || jitter <= 0 { + return base + } + spread := int64(float64(base) * jitter) + if spread <= 0 { + return base + } + offset := rand.Int63n(spread*2+1) - spread + return base + time.Duration(offset) +} diff --git a/agents/rap-node-agent/internal/hostagent/update_test.go b/agents/rap-node-agent/internal/hostagent/update_test.go new file mode 100644 index 0000000..ff3b423 --- /dev/null +++ b/agents/rap-node-agent/internal/hostagent/update_test.go @@ -0,0 +1,672 @@ +package hostagent + +import ( + "context" + "encoding/json" + "fmt" + "net/http" + "net/http/httptest" + "os" + "path/filepath" + "strings" + "testing" + "time" + + "github.com/example/remote-access-platform/agents/rap-node-agent/internal/state" +) + +type updateRunner struct { + calls [][]string + healthOkay bool + inspectJSON string +} + +func TestArtifactURLsForBackendResolvesControlPlaneRelativeDownloads(t *testing.T) { + urls := artifactURLsForBackend(ReleaseArtifact{ + URL: "/downloads/rap-node-agent-0.2.92.tar", + URLs: []string{"/downloads/mirror.tar", "https://cdn.example.test/agent.tar"}, + }, "http://control.example.test:18080/api/v1") + want := []string{ + "http://control.example.test:18080/downloads/rap-node-agent-0.2.92.tar", + "http://control.example.test:18080/downloads/mirror.tar", + "https://cdn.example.test/agent.tar", + } + if len(urls) != len(want) { + t.Fatalf("urls = %#v", urls) + } + for i := range want { + if urls[i] != want[i] { + t.Fatalf("urls[%d] = %q, want %q; all=%#v", i, urls[i], want[i], urls) + } + } +} + +func (r *updateRunner) Run(_ context.Context, name string, args ...string) (string, error) { + r.calls = append(r.calls, append([]string{name}, args...)) + if len(args) >= 2 && args[0] == "inspect" && args[1] == "--format" { + if r.healthOkay { + return "true\n", nil + } + return "false\n", nil + } + if len(args) == 2 && args[0] == "inspect" { + return r.inspectJSON, nil + } + if len(args) >= 2 && args[0] == "image" && args[1] == "inspect" { + return "[]", nil + } + if len(args) > 0 && args[0] == "run" { + return "updated-container\n", nil + } + return "", nil +} + +func TestApplyUpdateFetchesPlanLoadsImageAndRecreatesContainer(t *testing.T) { + artifactBody := []byte("fake docker image tar") + statuses := []NodeUpdateStatusRequest{} + server := httptest.NewServer(http.HandlerFunc(func(w http.ResponseWriter, r *http.Request) { + switch { + case r.Method == http.MethodGet && strings.HasSuffix(r.URL.Path, "/updates/plan"): + _ = json.NewEncoder(w).Encode(map[string]any{ + "node_update_plan": map[string]any{ + "schema_version": "rap.node_update_plan.v1", + "cluster_id": "cluster-1", + "node_id": "node-1", + "product": "rap-node-agent", + "current_version": "0.1.0-old", + "action": "update", + "reason": "matching_release_available", + "target_version": "0.1.0-new", + "rollback_allowed": true, + "health_window_seconds": 1, + "production_forwarding": false, + "artifact": map[string]any{ + "id": "artifact-1", + "product": "rap-node-agent", + "version": "0.1.0-new", + "os": "linux", + "arch": "amd64", + "install_type": "docker", + "url": serverArtifactURL(r), + "sha256": "5c2fbd41c87e83dc372690e8e1244b98baf8aded64870b369c28c4b313e15cc2", + "size_bytes": len(artifactBody), + "metadata": map[string]any{"image": "rap-node-agent:test-new"}, + }, + }, + }) + case r.Method == http.MethodPost && strings.HasSuffix(r.URL.Path, "/updates/status"): + var status NodeUpdateStatusRequest + if err := json.NewDecoder(r.Body).Decode(&status); err != nil { + t.Fatalf("decode status: %v", err) + } + statuses = append(statuses, status) + w.WriteHeader(http.StatusOK) + _, _ = w.Write([]byte(`{"node_update_status":{"id":"status-1"}}`)) + case r.Method == http.MethodGet && r.URL.Path == "/artifact.tar": + _, _ = w.Write(artifactBody) + default: + t.Fatalf("unexpected request %s %s", r.Method, r.URL.String()) + } + })) + defer server.Close() + + runner := &updateRunner{healthOkay: true, inspectJSON: dockerInspectFixture(server.URL)} + result, err := (DockerManager{Runner: runner}).ApplyUpdate(context.Background(), UpdateRequest{ + BackendURL: server.URL, + ClusterID: "cluster-1", + NodeID: "node-1", + CurrentVersion: "0.1.0-old", + ContainerName: "rap-node-agent-node-1", + HealthTimeout: time.Second, + }) + if err != nil { + t.Fatalf("apply update: %v", err) + } + if result.Action != "update" || !result.Loaded || !result.Replaced || result.NewImage != "rap-node-agent:test-new" { + t.Fatalf("unexpected result: %+v", result) + } + joined := strings.Join(flattenCalls(runner.calls), "\x00") + for _, want := range []string{"inspect\x00rap-node-agent-node-1", "load\x00-i", "rm\x00-f\x00rap-node-agent-node-1", "run\x00-d", "RAP_NODE_NAME=node-a"} { + if !strings.Contains(joined, want) { + t.Fatalf("missing docker call part %q in %#v", want, runner.calls) + } + } + if len(statuses) != 3 || statuses[0].Phase != "planned" || statuses[1].Phase != "download" || statuses[2].Status != "succeeded" { + t.Fatalf("statuses = %+v", statuses) + } +} + +func TestApplyUpdatePreservesDockerVPNGatewayRuntime(t *testing.T) { + previousStatHostPath := statHostPath + statHostPath = func(string) (os.FileInfo, error) { return nil, nil } + t.Cleanup(func() { statHostPath = previousStatHostPath }) + + artifactBody := []byte("fake docker image tar") + server := httptest.NewServer(http.HandlerFunc(func(w http.ResponseWriter, r *http.Request) { + switch { + case r.Method == http.MethodGet && strings.HasSuffix(r.URL.Path, "/updates/plan"): + _ = json.NewEncoder(w).Encode(map[string]any{ + "node_update_plan": map[string]any{ + "schema_version": "rap.node_update_plan.v1", + "cluster_id": "cluster-1", + "node_id": "node-1", + "product": "rap-node-agent", + "current_version": "0.2.7", + "action": "update", + "reason": "matching_release_available", + "target_version": "0.2.8", + "rollback_allowed": true, + "health_window_seconds": 1, + "artifact": map[string]any{ + "id": "artifact-1", + "product": "rap-node-agent", + "version": "0.2.8", + "os": "linux", + "arch": "amd64", + "install_type": "docker", + "url": serverArtifactURL(r), + "sha256": "5c2fbd41c87e83dc372690e8e1244b98baf8aded64870b369c28c4b313e15cc2", + "size_bytes": len(artifactBody), + "metadata": map[string]any{"image": "rap-node-agent:test-new"}, + }, + }, + }) + case r.Method == http.MethodPost && strings.HasSuffix(r.URL.Path, "/updates/status"): + w.WriteHeader(http.StatusOK) + _, _ = w.Write([]byte(`{"node_update_status":{"id":"status-1"}}`)) + case r.Method == http.MethodGet && r.URL.Path == "/artifact.tar": + _, _ = w.Write(artifactBody) + default: + t.Fatalf("unexpected request %s %s", r.Method, r.URL.String()) + } + })) + defer server.Close() + + runner := &updateRunner{healthOkay: true, inspectJSON: dockerInspectFixtureWithVPNGatewayRuntime()} + result, err := (DockerManager{Runner: runner}).ApplyUpdate(context.Background(), UpdateRequest{ + BackendURL: server.URL, + ClusterID: "cluster-1", + NodeID: "node-1", + CurrentVersion: "0.2.7", + ContainerName: "rap-node-agent-node-1", + HealthTimeout: time.Second, + }) + if err != nil { + t.Fatalf("ApplyUpdate failed: %v", err) + } + if !result.Replaced { + t.Fatalf("expected replacement") + } + joined := strings.Join(flattenCalls(runner.calls), "\x00") + for _, want := range []string{"--privileged", "--cap-add\x00NET_ADMIN", "--device\x00/dev/net/tun:/dev/net/tun"} { + if !strings.Contains(joined, want) { + t.Fatalf("docker run did not preserve %q in %#v", want, runner.calls) + } + } +} + +func TestApplyUpdateNoopsWithoutDockerWhenPlanHasNoAction(t *testing.T) { + statuses := []NodeUpdateStatusRequest{} + server := httptest.NewServer(http.HandlerFunc(func(w http.ResponseWriter, r *http.Request) { + switch { + case r.Method == http.MethodGet && strings.HasSuffix(r.URL.Path, "/updates/plan"): + _ = json.NewEncoder(w).Encode(map[string]any{ + "node_update_plan": map[string]any{ + "cluster_id": "cluster-1", + "node_id": "node-1", + "product": "rap-node-agent", + "current_version": "0.1.3", + "action": "none", + "reason": "already_current", + "target_version": "0.1.3", + }, + }) + case r.Method == http.MethodPost && strings.HasSuffix(r.URL.Path, "/updates/status"): + var status NodeUpdateStatusRequest + if err := json.NewDecoder(r.Body).Decode(&status); err != nil { + t.Fatalf("decode status: %v", err) + } + statuses = append(statuses, status) + w.WriteHeader(http.StatusOK) + _, _ = w.Write([]byte(`{"node_update_status":{"id":"status-1"}}`)) + default: + t.Fatalf("unexpected request %s %s", r.Method, r.URL.String()) + } + })) + defer server.Close() + runner := &updateRunner{} + result, err := (DockerManager{Runner: runner}).ApplyUpdate(context.Background(), UpdateRequest{ + BackendURL: server.URL, + ClusterID: "cluster-1", + NodeID: "node-1", + CurrentVersion: "0.1.3", + ContainerName: "rap-node-agent-node-1", + }) + if err != nil { + t.Fatalf("apply update: %v", err) + } + if result.Action != "none" || result.Reason != "already_current" { + t.Fatalf("result = %+v", result) + } + if len(runner.calls) != 0 { + t.Fatalf("docker should not be called, got %#v", runner.calls) + } + if len(statuses) != 1 || statuses[0].Phase != "plan" || statuses[0].Status != "noop" || statuses[0].TargetVersion != "0.1.3" { + t.Fatalf("statuses = %+v", statuses) + } +} + +func TestWindowsApplyUpdateNoopReportsTaskStatus(t *testing.T) { + statuses := []NodeUpdateStatusRequest{} + server := httptest.NewServer(http.HandlerFunc(func(w http.ResponseWriter, r *http.Request) { + switch { + case r.Method == http.MethodGet && strings.HasSuffix(r.URL.Path, "/updates/plan"): + _ = json.NewEncoder(w).Encode(map[string]any{ + "node_update_plan": map[string]any{ + "cluster_id": "cluster-1", + "node_id": "node-1", + "product": "rap-node-agent", + "current_version": "0.1.3", + "action": "none", + "reason": "already_current", + "target_version": "0.1.3", + }, + }) + case r.Method == http.MethodPost && strings.HasSuffix(r.URL.Path, "/updates/status"): + var status NodeUpdateStatusRequest + if err := json.NewDecoder(r.Body).Decode(&status); err != nil { + t.Fatalf("decode status: %v", err) + } + statuses = append(statuses, status) + w.WriteHeader(http.StatusOK) + _, _ = w.Write([]byte(`{"node_update_status":{"id":"status-1"}}`)) + default: + t.Fatalf("unexpected request %s %s", r.Method, r.URL.String()) + } + })) + defer server.Close() + result, err := (WindowsManager{Runner: &updateRunner{}}).ApplyUpdate(context.Background(), UpdateRequest{ + BackendURL: server.URL, + ClusterID: "cluster-1", + NodeID: "node-1", + CurrentVersion: "0.1.3", + InstallType: WindowsUpdateInstallType, + BinaryPath: `C:\Program Files\RAP\node\rap-node-agent.exe`, + WindowsTaskName: "RAP Node Agent node", + }) + if err != nil { + t.Fatalf("windows apply update: %v", err) + } + if result.Action != "none" || result.Reason != "already_current" { + t.Fatalf("result = %+v", result) + } + if len(statuses) != 1 || statuses[0].Phase != "plan" || statuses[0].Status != "noop" { + t.Fatalf("statuses = %+v", statuses) + } + if statuses[0].Payload["task"] != "RAP Node Agent node" { + t.Fatalf("status payload = %+v", statuses[0].Payload) + } +} + +func TestRunUpdateLoopAdvancesCurrentVersionAfterSuccessfulUpdate(t *testing.T) { + artifactBody := []byte("fake docker image tar") + planRequests := []string{} + server := httptest.NewServer(http.HandlerFunc(func(w http.ResponseWriter, r *http.Request) { + switch { + case r.Method == http.MethodGet && strings.HasSuffix(r.URL.Path, "/updates/plan"): + current := r.URL.Query().Get("current_version") + planRequests = append(planRequests, current) + action := "update" + reason := "matching_release_available" + if current == "0.1.0-new" { + action = "none" + reason = "already_current" + } + plan := map[string]any{ + "cluster_id": "cluster-1", + "node_id": "node-1", + "product": "rap-node-agent", + "current_version": current, + "action": action, + "reason": reason, + "target_version": "0.1.0-new", + "rollback_allowed": true, + "production_forwarding": false, + } + if action == "update" { + plan["artifact"] = map[string]any{ + "id": "artifact-1", + "product": "rap-node-agent", + "version": "0.1.0-new", + "os": "linux", + "arch": "amd64", + "install_type": "docker", + "url": serverArtifactURL(r), + "sha256": "5c2fbd41c87e83dc372690e8e1244b98baf8aded64870b369c28c4b313e15cc2", + "size_bytes": len(artifactBody), + "metadata": map[string]any{"image": "rap-node-agent:test-new"}, + } + } + _ = json.NewEncoder(w).Encode(map[string]any{"node_update_plan": plan}) + case r.Method == http.MethodPost && strings.HasSuffix(r.URL.Path, "/updates/status"): + w.WriteHeader(http.StatusOK) + _, _ = w.Write([]byte(`{"node_update_status":{"id":"status-1"}}`)) + case r.Method == http.MethodGet && r.URL.Path == "/artifact.tar": + _, _ = w.Write(artifactBody) + default: + t.Fatalf("unexpected request %s %s", r.Method, r.URL.String()) + } + })) + defer server.Close() + + runner := &updateRunner{healthOkay: true, inspectJSON: dockerInspectFixture(server.URL)} + err := (DockerManager{Runner: runner}).RunUpdateLoop(context.Background(), UpdateLoopConfig{ + Request: UpdateRequest{ + BackendURL: server.URL, + ClusterID: "cluster-1", + NodeID: "node-1", + CurrentVersion: "0.1.0-old", + ContainerName: "rap-node-agent-node-1", + HealthTimeout: time.Second, + }, + Interval: time.Millisecond, + MaxRuns: 2, + }) + if err != nil { + t.Fatalf("run update loop: %v", err) + } + if strings.Join(planRequests, ",") != "0.1.0-old,0.1.0-new" { + t.Fatalf("plan current versions = %#v", planRequests) + } +} + +func TestRunUpdateLoopReportsHostAgentStatusWhenEnabled(t *testing.T) { + statuses := []NodeUpdateStatusRequest{} + planProducts := []string{} + server := httptest.NewServer(http.HandlerFunc(func(w http.ResponseWriter, r *http.Request) { + switch { + case r.Method == http.MethodGet && strings.HasSuffix(r.URL.Path, "/updates/plan"): + product := r.URL.Query().Get("product") + planProducts = append(planProducts, product) + _ = json.NewEncoder(w).Encode(map[string]any{ + "node_update_plan": map[string]any{ + "cluster_id": "cluster-1", + "node_id": "node-1", + "product": product, + "current_version": "0.1.3", + "action": "none", + "reason": "already_current", + "target_version": "0.1.3", + "rollback_allowed": true, + "production_forwarding": false, + }, + }) + case r.Method == http.MethodPost && strings.HasSuffix(r.URL.Path, "/updates/status"): + var status NodeUpdateStatusRequest + if err := json.NewDecoder(r.Body).Decode(&status); err != nil { + t.Fatalf("decode status: %v", err) + } + statuses = append(statuses, status) + w.WriteHeader(http.StatusOK) + _, _ = w.Write([]byte(`{"node_update_status":{"id":"status-1"}}`)) + default: + t.Fatalf("unexpected request %s %s", r.Method, r.URL.String()) + } + })) + defer server.Close() + + err := (DockerManager{}).RunUpdateLoop(context.Background(), UpdateLoopConfig{ + Request: UpdateRequest{ + BackendURL: server.URL, + ClusterID: "cluster-1", + NodeID: "node-1", + CurrentVersion: "0.1.3", + ContainerName: "rap-node-agent-node-1", + }, + HostAgentUpdateEnabled: true, + HostAgentUpdateRequest: HostAgentUpdateRequest{ + CurrentVersion: "0.1.3", + BinaryPath: filepath.Join(t.TempDir(), "rap-host-agent"), + }, + MaxRuns: 1, + }) + if err != nil { + t.Fatalf("run update loop: %v", err) + } + if strings.Join(planProducts, ",") != "rap-node-agent,rap-host-agent" { + t.Fatalf("plan products = %#v", planProducts) + } + if len(statuses) != 2 || statuses[0].Product != "rap-node-agent" || statuses[1].Product != "rap-host-agent" { + t.Fatalf("statuses = %+v", statuses) + } + if statuses[1].Phase != "plan" || statuses[1].Status != "noop" { + t.Fatalf("host-agent status = %+v", statuses[1]) + } +} + +func TestFetchNodeUpdatePlanResolvesNodeIDAndVersionFromStateDir(t *testing.T) { + dir := t.TempDir() + if err := state.Save(filepath.Join(dir, state.FileName), state.Identity{ + NodeID: "node-from-state", + ClusterID: "cluster-1", + NodeName: "node-a", + }); err != nil { + t.Fatalf("save identity: %v", err) + } + if err := saveUpdateState(dir, UpdateState{ + Product: "rap-node-agent", + CurrentVersion: "0.1.0-state", + }); err != nil { + t.Fatalf("save update state: %v", err) + } + var gotPath string + var gotCurrent string + server := httptest.NewServer(http.HandlerFunc(func(w http.ResponseWriter, r *http.Request) { + gotPath = r.URL.Path + gotCurrent = r.URL.Query().Get("current_version") + _ = json.NewEncoder(w).Encode(map[string]any{ + "node_update_plan": map[string]any{ + "cluster_id": "cluster-1", + "node_id": "node-from-state", + "product": "rap-node-agent", + "action": "none", + "reason": "already_current", + }, + }) + })) + defer server.Close() + if _, err := FetchNodeUpdatePlan(context.Background(), UpdateRequest{ + BackendURL: server.URL, + ClusterID: "cluster-1", + StateDir: dir, + CurrentVersion: "0.1.0-flag", + }); err != nil { + t.Fatalf("fetch plan: %v", err) + } + if !strings.Contains(gotPath, "/nodes/node-from-state/updates/plan") || gotCurrent != "0.1.0-state" { + t.Fatalf("path/current = %q/%q", gotPath, gotCurrent) + } +} + +func TestApplyHostAgentUpdateDownloadsAndReplacesBinary(t *testing.T) { + dir := t.TempDir() + if err := state.Save(filepath.Join(dir, state.FileName), state.Identity{ + NodeID: "node-1", + ClusterID: "cluster-1", + NodeName: "node-a", + }); err != nil { + t.Fatalf("save identity: %v", err) + } + binaryPath := filepath.Join(dir, "rap-host-agent") + artifactBody := []byte("new host agent binary") + server := httptest.NewServer(http.HandlerFunc(func(w http.ResponseWriter, r *http.Request) { + switch { + case r.Method == http.MethodGet && strings.HasSuffix(r.URL.Path, "/updates/plan"): + if r.URL.Query().Get("product") != HostAgentUpdateProduct || r.URL.Query().Get("install_type") != BinaryUpdateInstallType { + t.Fatalf("unexpected query: %s", r.URL.RawQuery) + } + _ = json.NewEncoder(w).Encode(map[string]any{ + "node_update_plan": map[string]any{ + "cluster_id": "cluster-1", + "node_id": "node-1", + "product": HostAgentUpdateProduct, + "action": "update", + "reason": "matching_release_available", + "target_version": "0.1.0-host-new", + "rollback_allowed": false, + "production_forwarding": false, + "artifact": map[string]any{ + "id": "artifact-host-1", + "product": HostAgentUpdateProduct, + "version": "0.1.0-host-new", + "os": "linux", + "arch": "amd64", + "install_type": BinaryUpdateInstallType, + "url": serverArtifactURL(r), + "sha256": "adc549d9e66ef64a507dd6880590d31309e16a3be965a92d849edd103cfb1815", + "size_bytes": len(artifactBody), + }, + }, + }) + case r.Method == http.MethodPost && strings.HasSuffix(r.URL.Path, "/updates/status"): + w.WriteHeader(http.StatusOK) + _, _ = w.Write([]byte(`{"node_update_status":{"id":"status-1"}}`)) + case r.Method == http.MethodGet && r.URL.Path == "/artifact.tar": + _, _ = w.Write(artifactBody) + default: + t.Fatalf("unexpected request %s %s", r.Method, r.URL.String()) + } + })) + defer server.Close() + result, err := (DockerManager{}).ApplyHostAgentUpdate(context.Background(), HostAgentUpdateRequest{ + BackendURL: server.URL, + ClusterID: "cluster-1", + StateDir: dir, + CurrentVersion: "0.1.0-host-old", + BinaryPath: binaryPath, + }) + if err != nil { + t.Fatalf("apply host-agent update: %v", err) + } + if !result.Replaced || !result.RestartNeeded { + t.Fatalf("result = %+v", result) + } + payload, err := os.ReadFile(binaryPath) + if err != nil || string(payload) != string(artifactBody) { + t.Fatalf("binary payload = %q, %v", payload, err) + } + updateState, err := loadUpdateState(dir, HostAgentUpdateProduct) + if err != nil { + t.Fatalf("load update state: %v", err) + } + if updateState.Product != HostAgentUpdateProduct || updateState.CurrentVersion != "0.1.0-host-new" { + t.Fatalf("update state = %+v", updateState) + } +} + +func TestUpdateStateIsProductScoped(t *testing.T) { + dir := t.TempDir() + if err := saveUpdateState(dir, UpdateState{Product: DefaultUpdateProduct, CurrentVersion: "node-v"}); err != nil { + t.Fatalf("save node state: %v", err) + } + if err := saveUpdateState(dir, UpdateState{Product: HostAgentUpdateProduct, CurrentVersion: "host-v"}); err != nil { + t.Fatalf("save host state: %v", err) + } + nodeState, err := loadUpdateState(dir, DefaultUpdateProduct) + if err != nil { + t.Fatalf("load node state: %v", err) + } + hostState, err := loadUpdateState(dir, HostAgentUpdateProduct) + if err != nil { + t.Fatalf("load host state: %v", err) + } + if nodeState.CurrentVersion != "node-v" || hostState.CurrentVersion != "host-v" { + t.Fatalf("states overlapped: node=%+v host=%+v", nodeState, hostState) + } +} + +func TestArtifactImageDerivesDockerTagFromProductAndVersion(t *testing.T) { + got := artifactImage(ReleaseArtifact{ + Product: "rap-node-agent", + Version: "0.2.77", + InstallType: DefaultUpdateInstallType, + }, "rap-node-agent:old") + if got != "rap-node-agent:0.2.77" { + t.Fatalf("expected versioned docker image, got %q", got) + } +} + +func serverArtifactURL(r *http.Request) string { + scheme := "http" + if r.TLS != nil { + scheme = "https" + } + return fmt.Sprintf("%s://%s/artifact.tar", scheme, r.Host) +} + +func dockerInspectFixture(_ string) string { + return `[ + { + "Id": "old-container", + "Image": "sha256:oldimage", + "Config": { + "Image": "rap-node-agent:test-old", + "Env": [ + "RAP_BACKEND_URL=http://control/api/v1", + "RAP_CLUSTER_ID=cluster-1", + "RAP_NODE_NAME=node-a", + "RAP_NODE_STATE_DIR=/var/lib/rap-node-agent", + "RAP_HEARTBEAT_INTERVAL_SECONDS=15", + "RAP_ENROLLMENT_POLL_INTERVAL_SECONDS=5", + "RAP_ENROLLMENT_POLL_TIMEOUT_SECONDS=0", + "RAP_MESH_SYNTHETIC_RUNTIME_ENABLED=true", + "RAP_MESH_LISTEN_ADDR=:19131" + ] + }, + "HostConfig": { + "NetworkMode": "host", + "RestartPolicy": {"Name": "unless-stopped"} + }, + "Mounts": [ + {"Source": "/var/lib/rap/nodes/node-a", "Destination": "/var/lib/rap-node-agent"} + ], + "State": {"Running": true} + } +]` +} + +func dockerInspectFixtureWithVPNGatewayRuntime() string { + return `[ + { + "Id": "old-container", + "Image": "sha256:oldimage", + "Config": { + "Image": "rap-node-agent:test-old", + "Env": [ + "RAP_BACKEND_URL=http://control/api/v1", + "RAP_CLUSTER_ID=cluster-1", + "RAP_NODE_NAME=node-a", + "RAP_NODE_STATE_DIR=/var/lib/rap-node-agent", + "RAP_HEARTBEAT_INTERVAL_SECONDS=15", + "RAP_ENROLLMENT_POLL_INTERVAL_SECONDS=5", + "RAP_ENROLLMENT_POLL_TIMEOUT_SECONDS=0", + "RAP_MESH_SYNTHETIC_RUNTIME_ENABLED=true", + "RAP_MESH_LISTEN_ADDR=:19131" + ] + }, + "HostConfig": { + "NetworkMode": "host", + "Privileged": true, + "CapAdd": ["NET_ADMIN"], + "Devices": [ + {"PathOnHost": "/dev/net/tun", "PathInContainer": "/dev/net/tun", "CgroupPermissions": "rwm"} + ], + "RestartPolicy": {"Name": "unless-stopped"} + }, + "Mounts": [ + {"Source": "/var/lib/rap/nodes/node-a", "Destination": "/var/lib/rap-node-agent"} + ], + "State": {"Running": true} + } +]` +} diff --git a/agents/rap-node-agent/internal/hostagent/windows.go b/agents/rap-node-agent/internal/hostagent/windows.go new file mode 100644 index 0000000..f507ed8 --- /dev/null +++ b/agents/rap-node-agent/internal/hostagent/windows.go @@ -0,0 +1,368 @@ +package hostagent + +import ( + "context" + "fmt" + "io" + "os" + "path/filepath" + "runtime" + "strings" +) + +const ( + DefaultWindowsInstallDir = `C:\Program Files\RAP` + DefaultWindowsStateRoot = `C:\ProgramData\RAP\nodes` +) + +type WindowsInstallConfig struct { + RuntimeConfig RuntimeConfig + NodeID string + InstallDir string + StartupMode string + ArtifactURLs []string + ArtifactSHA256 string + ArtifactSizeBytes int64 + Replace bool + DryRun bool + AutoUpdateEnabled bool + AutoUpdateCurrentVersion string + AutoUpdateChannel string + AutoUpdateIntervalSeconds int + AutoUpdateInitialDelaySeconds int + AutoUpdateHealthTimeoutSeconds int + HostAgentSourcePath string +} + +type WindowsInstallResult struct { + NodeName string + InstallDir string + StateDir string + NodeAgentPath string + WrapperPath string + StartupMode string + TaskName string + HostAgentPath string + UpdaterTaskName string + Downloaded bool + Started bool + UpdaterStarted bool + AdminFallback bool +} + +type WindowsManager struct { + Runner CommandRunner +} + +func WindowsInstallConfigFromProfile(profile WindowsInstallProfile) WindowsInstallConfig { + stateDir := firstNonEmpty(profile.StateDir, filepath.Join(DefaultWindowsStateRoot, safeUnitSlug(profile.NodeName))) + return WindowsInstallConfig{ + RuntimeConfig: RuntimeConfig{ + BackendURL: profile.BackendURL, + ClusterID: profile.ClusterID, + JoinToken: profile.JoinToken, + NodeName: profile.NodeName, + StateDir: stateDir, + WorkloadSupervisionEnabled: profile.WorkloadSupervisionEnabled, + MeshSyntheticRuntimeEnabled: profile.MeshSyntheticRuntimeEnabled, + MeshProductionForwardingEnabled: profile.MeshProductionForwardingEnabled, + MeshListenAddr: profile.MeshListenAddr, + MeshListenPortMode: profile.MeshListenPortMode, + MeshListenAutoPortStart: profile.MeshListenAutoPortStart, + MeshListenAutoPortEnd: profile.MeshListenAutoPortEnd, + MeshAdvertiseEndpoint: profile.MeshAdvertiseEndpoint, + MeshAdvertiseEndpointsJSON: string(profile.MeshAdvertiseEndpointsJSON), + MeshAdvertiseTransport: profile.MeshAdvertiseTransport, + MeshConnectivityMode: profile.MeshConnectivityMode, + MeshNATType: profile.MeshNATType, + MeshRegion: profile.MeshRegion, + HeartbeatIntervalSeconds: profile.HeartbeatIntervalSeconds, + EnrollmentPollIntervalSeconds: profile.EnrollmentPollIntervalSeconds, + EnrollmentPollTimeoutSeconds: profile.EnrollmentPollTimeoutSeconds, + ProductionObservationSinkCap: profile.ProductionObservationSinkCapacity, + }, + InstallDir: firstNonEmpty(profile.InstallDir, filepath.Join(DefaultWindowsInstallDir, safeUnitSlug(profile.NodeName))), + StartupMode: firstNonEmpty(profile.StartupMode, "auto"), + ArtifactURLs: binaryArtifactURLs(profile), + ArtifactSHA256: binaryArtifactSHA256(profile), + ArtifactSizeBytes: binaryArtifactSizeBytes(profile), + Replace: true, + AutoUpdateEnabled: true, + } +} + +func (m WindowsManager) Install(ctx context.Context, cfg WindowsInstallConfig) (WindowsInstallResult, error) { + cfg.NodeID = strings.TrimSpace(cfg.NodeID) + if strings.TrimSpace(cfg.RuntimeConfig.StateDir) == "" { + cfg.RuntimeConfig.StateDir = filepath.Join(DefaultWindowsStateRoot, safeUnitSlug(cfg.RuntimeConfig.NodeName)) + } + cfg.RuntimeConfig.Replace = cfg.Replace + cfg.RuntimeConfig = cfg.RuntimeConfig.Normalize() + if err := cfg.RuntimeConfig.ValidateInstall(); err != nil { + return WindowsInstallResult{}, err + } + cfg.StartupMode = strings.ToLower(firstNonEmpty(cfg.StartupMode, "auto")) + noAdminPreferred := cfg.StartupMode == "user-task" + cfg.InstallDir = firstNonEmpty(cfg.InstallDir, defaultWindowsInstallDir(cfg.RuntimeConfig.NodeName, noAdminPreferred)) + cfg.StartupMode = strings.ToLower(firstNonEmpty(cfg.StartupMode, "auto")) + if noAdminPreferred && strings.HasPrefix(strings.ToLower(cfg.RuntimeConfig.StateDir), strings.ToLower(DefaultWindowsStateRoot)) { + cfg.RuntimeConfig.StateDir = defaultWindowsStateDir(cfg.RuntimeConfig.NodeName, true) + } + result := WindowsInstallResult{ + NodeName: cfg.RuntimeConfig.NodeName, + InstallDir: cfg.InstallDir, + StateDir: cfg.RuntimeConfig.StateDir, + NodeAgentPath: filepath.Join(cfg.InstallDir, "rap-node-agent.exe"), + WrapperPath: filepath.Join(cfg.InstallDir, "rap-node-agent-run.cmd"), + StartupMode: cfg.StartupMode, + TaskName: "RAP Node Agent " + safeUnitSlug(cfg.RuntimeConfig.NodeName), + } + if cfg.DryRun { + return result, nil + } + if runtime.GOOS != "windows" { + return result, fmt.Errorf("windows install is only supported on windows hosts") + } + if err := os.MkdirAll(cfg.InstallDir, 0o755); err != nil { + if cfg.StartupMode != "auto" || !isAccessDenied(err) { + return result, err + } + cfg.InstallDir = defaultWindowsInstallDir(cfg.RuntimeConfig.NodeName, true) + cfg.RuntimeConfig.StateDir = defaultWindowsStateDir(cfg.RuntimeConfig.NodeName, true) + result.InstallDir = cfg.InstallDir + result.StateDir = cfg.RuntimeConfig.StateDir + result.NodeAgentPath = filepath.Join(cfg.InstallDir, "rap-node-agent.exe") + result.WrapperPath = filepath.Join(cfg.InstallDir, "rap-node-agent-run.cmd") + if err := os.MkdirAll(cfg.InstallDir, 0o755); err != nil { + return result, err + } + result.AdminFallback = true + } + if err := os.MkdirAll(cfg.RuntimeConfig.StateDir, 0o700); err != nil { + return result, err + } + if len(cfg.ArtifactURLs) > 0 && (cfg.Replace || !fileExists(result.NodeAgentPath)) { + m.stopExistingNodeAgent(ctx, result.TaskName, result.NodeAgentPath) + path, err := downloadFirstArtifact(ctx, cfg.ArtifactURLs, cfg.ArtifactSHA256, cfg.ArtifactSizeBytes) + if err != nil { + return result, err + } + defer os.Remove(path) + if err := copyFile(path, result.NodeAgentPath, 0o755); err != nil { + m.stopExistingNodeAgent(ctx, result.TaskName, result.NodeAgentPath) + if retryErr := copyFile(path, result.NodeAgentPath, 0o755); retryErr == nil { + result.Downloaded = true + goto binaryReady + } + return result, err + } + result.Downloaded = true + } +binaryReady: + if !fileExists(result.NodeAgentPath) { + return result, fmt.Errorf("node-agent binary is missing at %s and no artifact was available", result.NodeAgentPath) + } + if err := os.WriteFile(filepath.Join(cfg.InstallDir, "rap-node-agent.env.cmd"), []byte(windowsEnvScript(cfg.RuntimeConfig)), 0o600); err != nil { + return result, err + } + if err := os.WriteFile(result.WrapperPath, []byte(windowsWrapperScript(result.NodeAgentPath, filepath.Join(cfg.InstallDir, "rap-node-agent.env.cmd"))), 0o755); err != nil { + return result, err + } + logPath := filepath.Join(cfg.RuntimeConfig.StateDir, "rap-node-agent.log") + started, fallback, mode, err := m.installStartupTask(ctx, result.TaskName, result.WrapperPath, logPath, cfg.StartupMode) + if err != nil { + return result, err + } + result.Started = started + result.AdminFallback = fallback + result.StartupMode = mode + result, err = installWindowsHostAgentUpdater(ctx, m, result, cfg) + if err != nil { + return result, err + } + return result, nil +} + +func (m WindowsManager) stopExistingNodeAgent(ctx context.Context, taskName, nodeAgentPath string) { + runner := m.Runner + if runner == nil { + runner = ExecRunner{} + } + _, _ = runner.Run(ctx, "schtasks", "/End", "/TN", taskName) + escapedPath := strings.ReplaceAll(nodeAgentPath, `'`, `''`) + _, _ = runner.Run(ctx, "powershell", "-NoProfile", "-ExecutionPolicy", "Bypass", "-Command", + `Get-Process rap-node-agent -ErrorAction SilentlyContinue | Where-Object { $_.Path -eq '`+escapedPath+`' } | Stop-Process -Force -ErrorAction SilentlyContinue`) +} + +func (m WindowsManager) installStartupTask(ctx context.Context, taskName, wrapperPath, logPath, mode string) (bool, bool, string, error) { + if mode == "none" { + return false, false, mode, nil + } + runner := m.Runner + if runner == nil { + runner = ExecRunner{} + } + if mode == "auto" || mode == "system-task" { + _, err := runner.Run(ctx, "schtasks", "/Create", "/TN", taskName, "/SC", "ONSTART", "/RU", "SYSTEM", "/RL", "HIGHEST", "/TR", windowsTaskAction(wrapperPath, logPath), "/F") + if err == nil { + _, _ = runner.Run(ctx, "schtasks", "/Run", "/TN", taskName) + return true, false, "system-task", nil + } + if mode == "system-task" { + return false, false, mode, err + } + } + _, err := runner.Run(ctx, "schtasks", "/Create", "/TN", taskName, "/SC", "ONLOGON", "/TR", windowsTaskAction(wrapperPath, logPath), "/F") + if err != nil { + return false, mode == "auto", "user-task", err + } + _, _ = runner.Run(ctx, "schtasks", "/Run", "/TN", taskName) + return true, mode == "auto", "user-task", nil +} + +func windowsTaskAction(wrapperPath, logPath string) string { + return `cmd.exe /c ""` + wrapperPath + `" >> "` + logPath + `" 2>&1"` +} + +func windowsEnvScript(cfg RuntimeConfig) string { + lines := []string{"@echo off"} + for _, env := range NodeAgentEnv(cfg) { + key, value, ok := strings.Cut(env, "=") + if !ok { + continue + } + lines = append(lines, "set "+key+"="+value) + } + return strings.Join(lines, "\r\n") + "\r\n" +} + +func windowsWrapperScript(nodeAgentPath, envPath string) string { + return strings.Join([]string{ + "@echo off", + `call "` + envPath + `"`, + `"` + nodeAgentPath + `"`, + }, "\r\n") + "\r\n" +} + +func binaryArtifactURLs(profile WindowsInstallProfile) []string { + if profile.NodeAgentArtifact != nil && len(profile.NodeAgentArtifact.URLs) > 0 { + return append([]string(nil), profile.NodeAgentArtifact.URLs...) + } + if profile.NodeAgentArtifact == nil || strings.TrimSpace(profile.NodeAgentArtifact.FileName) == "" { + return nil + } + out := []string{} + fileName := strings.TrimLeft(strings.TrimSpace(profile.NodeAgentArtifact.FileName), "/") + for _, endpoint := range profile.ArtifactEndpoints { + if trimmed := strings.TrimRight(strings.TrimSpace(endpoint), "/"); trimmed != "" { + out = append(out, trimmed+"/"+fileName) + } + } + return out +} + +func binaryArtifactSHA256(profile WindowsInstallProfile) string { + if profile.NodeAgentArtifact == nil { + return "" + } + return strings.TrimSpace(profile.NodeAgentArtifact.SHA256) +} + +func binaryArtifactSizeBytes(profile WindowsInstallProfile) int64 { + if profile.NodeAgentArtifact == nil { + return 0 + } + return profile.NodeAgentArtifact.SizeBytes +} + +func fileExists(path string) bool { + _, err := os.Stat(path) + return err == nil +} + +func copyFile(source, target string, mode os.FileMode) error { + src, err := os.Open(source) + if err != nil { + return err + } + defer src.Close() + if err := os.MkdirAll(filepath.Dir(target), 0o755); err != nil { + return err + } + tmp := target + ".tmp" + dst, err := os.OpenFile(tmp, os.O_CREATE|os.O_WRONLY|os.O_TRUNC, mode) + if err != nil { + return err + } + if _, err := io.Copy(dst, src); err != nil { + _ = dst.Close() + _ = os.Remove(tmp) + return err + } + if err := dst.Close(); err != nil { + _ = os.Remove(tmp) + return err + } + if err := replaceFile(tmp, target); err != nil { + _ = os.Remove(tmp) + return err + } + return nil +} + +func replaceFile(tmp, target string) error { + if runtime.GOOS != "windows" { + return os.Rename(tmp, target) + } + backup := target + ".bak" + _ = os.Remove(backup) + if fileExists(target) { + if err := os.Rename(target, backup); err != nil { + return err + } + } + if err := os.Rename(tmp, target); err != nil { + if fileExists(backup) { + _ = os.Rename(backup, target) + } + return err + } + _ = os.Remove(backup) + return nil +} + +func defaultWindowsInstallDir(nodeName string, userMode bool) string { + slug := safeUnitSlug(nodeName) + if userMode { + if base := strings.TrimSpace(os.Getenv("LOCALAPPDATA")); base != "" { + return filepath.Join(base, "RAP", slug) + } + if base := strings.TrimSpace(os.Getenv("USERPROFILE")); base != "" { + return filepath.Join(base, "AppData", "Local", "RAP", slug) + } + } + return filepath.Join(DefaultWindowsInstallDir, slug) +} + +func defaultWindowsStateDir(nodeName string, userMode bool) string { + slug := safeUnitSlug(nodeName) + if userMode { + if base := strings.TrimSpace(os.Getenv("LOCALAPPDATA")); base != "" { + return filepath.Join(base, "RAP", "nodes", slug) + } + if base := strings.TrimSpace(os.Getenv("USERPROFILE")); base != "" { + return filepath.Join(base, "AppData", "Local", "RAP", "nodes", slug) + } + } + return filepath.Join(DefaultWindowsStateRoot, slug) +} + +func isAccessDenied(err error) bool { + if err == nil { + return false + } + value := strings.ToLower(err.Error()) + return strings.Contains(value, "access is denied") || + strings.Contains(value, "permission denied") || + strings.Contains(value, "operation not permitted") +} diff --git a/agents/rap-node-agent/internal/hostagent/windows_update.go b/agents/rap-node-agent/internal/hostagent/windows_update.go new file mode 100644 index 0000000..3f11dbb --- /dev/null +++ b/agents/rap-node-agent/internal/hostagent/windows_update.go @@ -0,0 +1,337 @@ +package hostagent + +import ( + "context" + "errors" + "fmt" + "os" + "path/filepath" + "strings" + "time" +) + +func (m WindowsManager) ApplyUpdate(ctx context.Context, req UpdateRequest) (UpdateResult, error) { + if strings.TrimSpace(req.InstallType) == "" || req.InstallType == DefaultUpdateInstallType { + req.InstallType = WindowsUpdateInstallType + } + req.OS = firstNonEmpty(req.OS, "windows") + req.Arch = firstNonEmpty(req.Arch, "amd64") + req = req.Normalize() + var err error + req, err = resolveUpdateRequest(req) + if err != nil { + return UpdateResult{}, err + } + runner := m.Runner + if runner == nil { + runner = ExecRunner{} + } + plan, err := FetchNodeUpdatePlan(ctx, req) + if err != nil { + return UpdateResult{}, err + } + if plan.HealthWindowSec > 0 && req.HealthTimeout == 30*time.Second { + req.HealthTimeout = time.Duration(plan.HealthWindowSec) * time.Second + } + result := UpdateResult{ + Action: plan.Action, + Reason: plan.Reason, + TargetVersion: plan.TargetVersion, + ContainerName: req.WindowsTaskName, + NewImage: req.BinaryPath, + } + if plan.Action != "update" { + if !req.DryRun { + status := statusFromNoopPlan(req, plan) + if status.Payload == nil { + status.Payload = map[string]any{} + } + status.Payload["task"] = req.WindowsTaskName + status.Payload["binary_path"] = req.BinaryPath + _ = ReportNodeUpdateStatus(ctx, req.BackendURL, req.ClusterID, req.NodeID, status) + } + return result, nil + } + if plan.ProductionForwarding && !req.AllowProductionMesh { + err := errors.New("refusing update plan with production forwarding enabled") + _ = ReportNodeUpdateStatus(ctx, req.BackendURL, req.ClusterID, req.NodeID, statusFromError(req, plan, "preflight", "failed", err)) + return result, err + } + if plan.Artifact == nil { + err := errors.New("update plan has no artifact") + _ = ReportNodeUpdateStatus(ctx, req.BackendURL, req.ClusterID, req.NodeID, statusFromError(req, plan, "preflight", "failed", err)) + return result, err + } + if plan.Artifact.InstallType != "" && plan.Artifact.InstallType != WindowsUpdateInstallType { + err := fmt.Errorf("unsupported update artifact install type %q", plan.Artifact.InstallType) + _ = ReportNodeUpdateStatus(ctx, req.BackendURL, req.ClusterID, req.NodeID, statusFromError(req, plan, "preflight", "failed", err)) + return result, err + } + if req.DryRun { + return result, nil + } + _ = ReportNodeUpdateStatus(ctx, req.BackendURL, req.ClusterID, req.NodeID, NodeUpdateStatusRequest{ + Product: req.Product, + CurrentVersion: req.CurrentVersion, + TargetVersion: plan.TargetVersion, + Phase: "planned", + Status: "accepted", + AttemptID: updateAttemptID(plan), + ObservedAt: time.Now().UTC(), + Payload: map[string]any{"strategy": plan.Strategy, "reason": plan.Reason, "task": req.WindowsTaskName}, + }) + urls := artifactURLsForBackend(*plan.Artifact, req.BackendURL) + _ = ReportNodeUpdateStatus(ctx, req.BackendURL, req.ClusterID, req.NodeID, NodeUpdateStatusRequest{ + Product: req.Product, + CurrentVersion: req.CurrentVersion, + TargetVersion: plan.TargetVersion, + Phase: "download", + Status: "started", + AttemptID: updateAttemptID(plan), + ObservedAt: time.Now().UTC(), + Payload: map[string]any{"artifact_url": plan.Artifact.URL, "artifact_urls": urls, "binary_path": req.BinaryPath}, + }) + path, err := downloadFirstArtifact(ctx, urls, plan.Artifact.SHA256, plan.Artifact.SizeBytes) + if err != nil { + _ = ReportNodeUpdateStatus(ctx, req.BackendURL, req.ClusterID, req.NodeID, statusFromError(req, plan, "download", "failed", err)) + return result, err + } + defer os.Remove(path) + m.stopExistingNodeAgent(ctx, req.WindowsTaskName, req.BinaryPath) + if err := copyFile(path, req.BinaryPath, 0o755); err != nil { + m.stopExistingNodeAgent(ctx, req.WindowsTaskName, req.BinaryPath) + if retryErr := copyFile(path, req.BinaryPath, 0o755); retryErr != nil { + _ = ReportNodeUpdateStatus(ctx, req.BackendURL, req.ClusterID, req.NodeID, statusFromError(req, plan, "apply", "failed", err)) + return result, err + } + } + result.Replaced = true + if _, err := runner.Run(ctx, "schtasks", "/Run", "/TN", req.WindowsTaskName); err != nil { + _ = ReportNodeUpdateStatus(ctx, req.BackendURL, req.ClusterID, req.NodeID, statusFromError(req, plan, "restart", "failed", err)) + return result, err + } + _ = ReportNodeUpdateStatus(ctx, req.BackendURL, req.ClusterID, req.NodeID, NodeUpdateStatusRequest{ + Product: req.Product, + CurrentVersion: req.CurrentVersion, + TargetVersion: plan.TargetVersion, + Phase: "health_check", + Status: "succeeded", + AttemptID: updateAttemptID(plan), + ObservedAt: time.Now().UTC(), + Payload: map[string]any{"task": req.WindowsTaskName, "binary_path": req.BinaryPath}, + }) + _ = saveUpdateState(req.StateDir, UpdateState{ + Product: req.Product, + CurrentVersion: plan.TargetVersion, + TargetVersion: plan.TargetVersion, + Image: req.BinaryPath, + UpdatedAt: time.Now().UTC(), + }) + return result, nil +} + +func (m WindowsManager) RunUpdateLoop(ctx context.Context, cfg UpdateLoopConfig) error { + req := cfg.Request + if strings.TrimSpace(req.InstallType) == "" || req.InstallType == DefaultUpdateInstallType { + req.InstallType = WindowsUpdateInstallType + } + req.OS = firstNonEmpty(req.OS, "windows") + req.Arch = firstNonEmpty(req.Arch, "amd64") + req = req.Normalize() + if err := req.Validate(); err != nil { + return err + } + if cfg.Interval == 0 { + cfg.Interval = time.Hour + } + if cfg.Interval < 0 { + return errors.New("update loop interval must not be negative") + } + if cfg.InitialDelay < 0 { + return errors.New("update loop initial delay must not be negative") + } + if cfg.Jitter < 0 || cfg.Jitter > 1 { + return errors.New("update loop jitter must be between 0 and 1") + } + logf := cfg.Logf + if logf == nil { + logf = func(string, ...any) {} + } + if cfg.InitialDelay > 0 { + if err := sleepContext(ctx, jitteredDuration(cfg.InitialDelay, cfg.Jitter)); err != nil { + return err + } + } + runs := 0 + lastTriggerGeneration := currentUpdateTriggerGeneration(req.StateDir) + for { + runs++ + result, err := m.ApplyUpdate(ctx, req) + if err != nil { + if errors.Is(err, ErrNodeIdentityNotReady) { + logf("windows_update_loop run=%d status=waiting_for_node_identity state_dir=%s", runs, req.StateDir) + if cfg.MaxRuns > 0 && runs >= cfg.MaxRuns { + return nil + } + if err := sleepContext(ctx, jitteredDuration(cfg.Interval, cfg.Jitter)); err != nil { + return err + } + continue + } + logf("windows_update_loop run=%d status=failed error=%v", runs, err) + if cfg.StopOnError { + return err + } + } else { + logf("windows_update_loop run=%d action=%s reason=%s target=%s task=%s replaced=%t", + runs, + result.Action, + result.Reason, + result.TargetVersion, + result.ContainerName, + result.Replaced, + ) + if result.Action == "update" && result.TargetVersion != "" && !result.RolledBack { + req.CurrentVersion = result.TargetVersion + } + } + if cfg.HostAgentUpdateEnabled { + hostReq := cfg.HostAgentUpdateRequest + hostReq.BackendURL = firstNonEmpty(hostReq.BackendURL, req.BackendURL) + hostReq.ClusterID = firstNonEmpty(hostReq.ClusterID, req.ClusterID) + hostReq.NodeID = firstNonEmpty(hostReq.NodeID, req.NodeID) + hostReq.StateDir = firstNonEmpty(hostReq.StateDir, req.StateDir) + hostReq.Channel = firstNonEmpty(hostReq.Channel, req.Channel) + hostReq.OS = firstNonEmpty(hostReq.OS, "windows") + hostReq.Arch = firstNonEmpty(hostReq.Arch, "amd64") + hostReq.InstallType = firstNonEmpty(hostReq.InstallType, "windows_binary") + hostResult, hostErr := (DockerManager{}).ApplyHostAgentUpdate(ctx, hostReq) + if hostErr != nil { + if errors.Is(hostErr, ErrNodeIdentityNotReady) { + logf("windows_host_agent_update_loop run=%d status=waiting_for_node_identity state_dir=%s", runs, hostReq.StateDir) + } else { + logf("windows_host_agent_update_loop run=%d status=failed error=%v", runs, hostErr) + if cfg.StopOnError { + return hostErr + } + } + } else { + logf("windows_host_agent_update_loop run=%d action=%s reason=%s target=%s binary=%s replaced=%t restart_needed=%t", + runs, + hostResult.Action, + hostResult.Reason, + hostResult.TargetVersion, + hostResult.NewImage, + hostResult.Replaced, + hostResult.RestartNeeded, + ) + if hostResult.Action == "update" && hostResult.TargetVersion != "" && !hostResult.RolledBack { + cfg.HostAgentUpdateRequest.CurrentVersion = hostResult.TargetVersion + } + } + } + if cfg.MaxRuns > 0 && runs >= cfg.MaxRuns { + return nil + } + if err := sleepUntilUpdateIntervalOrTrigger(ctx, req.StateDir, jitteredDuration(cfg.Interval, cfg.Jitter), &lastTriggerGeneration); err != nil { + return err + } + } +} + +func installWindowsHostAgentUpdater(ctx context.Context, m WindowsManager, result WindowsInstallResult, cfg WindowsInstallConfig) (WindowsInstallResult, error) { + if !cfg.AutoUpdateEnabled || strings.EqualFold(result.StartupMode, "none") { + return result, nil + } + if cfg.AutoUpdateCurrentVersion == "" || (cfg.Replace && !result.Downloaded) { + cfg.AutoUpdateCurrentVersion = "0.0.0" + } + hostAgentPath := filepath.Join(result.InstallDir, "rap-host-agent.exe") + if err := installHostAgentBinary(cfg.HostAgentSourcePath, hostAgentPath); err != nil { + return result, err + } + wrapperPath := filepath.Join(result.InstallDir, "rap-host-agent-update.cmd") + logPath := filepath.Join(result.StateDir, "rap-host-agent-update.log") + taskName := "RAP Host Agent Updater " + safeUnitSlug(result.NodeName) + script := windowsHostAgentUpdateScript(hostAgentPath, cfg, result) + if err := os.WriteFile(wrapperPath, []byte(script), 0o755); err != nil { + return result, err + } + started, fallback, mode, err := m.installStartupTask(ctx, taskName, wrapperPath, logPath, cfg.StartupMode) + if err != nil { + return result, err + } + result.HostAgentPath = hostAgentPath + result.UpdaterTaskName = taskName + result.UpdaterStarted = started + if fallback { + result.AdminFallback = true + } + if mode != "" && mode != result.StartupMode { + result.StartupMode = mode + } + return result, nil +} + +func windowsHostAgentUpdateScript(hostAgentPath string, cfg WindowsInstallConfig, result WindowsInstallResult) string { + currentVersion := firstNonEmpty(cfg.AutoUpdateCurrentVersion, "0.0.0") + interval := cfg.AutoUpdateIntervalSeconds + if interval == 0 { + interval = 21600 + } + initialDelay := cfg.AutoUpdateInitialDelaySeconds + if initialDelay == 0 { + initialDelay = 15 + } + healthTimeout := cfg.AutoUpdateHealthTimeoutSeconds + if healthTimeout == 0 { + healthTimeout = 30 + } + updateLoopArgs := []string{ + `"` + hostAgentPath + `"`, + "update-loop", + "--backend-url", `"` + cfg.RuntimeConfig.BackendURL + `"`, + "--cluster-id", `"` + cfg.RuntimeConfig.ClusterID + `"`, + "--state-dir", `"` + result.StateDir + `"`, + "--current-version", currentVersion, + "--os", "windows", + "--arch", "amd64", + "--install-type", WindowsUpdateInstallType, + "--binary-path", `"` + result.NodeAgentPath + `"`, + "--windows-task-name", `"` + result.TaskName + `"`, + "--health-timeout-seconds", fmt.Sprintf("%d", healthTimeout), + "--interval-seconds", fmt.Sprintf("%d", interval), + "--initial-delay-seconds", "0", + "--host-agent-update-status-enabled", + "--host-agent-current-version", currentVersion, + "--host-agent-binary-path", `"` + hostAgentPath + `"`, + } + if strings.TrimSpace(cfg.NodeID) != "" { + updateLoopArgs = append(updateLoopArgs, "--node-id", `"`+strings.TrimSpace(cfg.NodeID)+`"`) + } + if strings.TrimSpace(cfg.AutoUpdateChannel) != "" { + updateLoopArgs = append(updateLoopArgs, "--channel", strings.TrimSpace(cfg.AutoUpdateChannel)) + } + lines := []string{ + "@echo off", + "setlocal", + "set RAP_HOST_AGENT=" + `"` + hostAgentPath + `"`, + "set RAP_HOST_AGENT_NEXT=" + `"` + hostAgentPath + `.next"`, + } + if initialDelay > 0 { + lines = append(lines, "timeout /t "+fmt.Sprintf("%d", initialDelay)+" /nobreak >NUL") + } + lines = append(lines, []string{ + ":loop", + "if exist %RAP_HOST_AGENT_NEXT% (", + " copy /Y %RAP_HOST_AGENT_NEXT% %RAP_HOST_AGENT% >NUL", + " if %ERRORLEVEL% EQU 0 del /F /Q %RAP_HOST_AGENT_NEXT%", + ")", + strings.Join(updateLoopArgs, " "), + "timeout /t " + fmt.Sprintf("%d", interval) + " /nobreak >NUL", + "goto loop", + "endlocal", + "rem initial-delay-seconds " + fmt.Sprintf("%d", initialDelay), + }...) + return strings.Join(lines, "\r\n") + "\r\n" +} diff --git a/agents/rap-node-agent/internal/mesh/contracts.go b/agents/rap-node-agent/internal/mesh/contracts.go index 8e9e44b..3b1dbf6 100644 --- a/agents/rap-node-agent/internal/mesh/contracts.go +++ b/agents/rap-node-agent/internal/mesh/contracts.go @@ -63,10 +63,12 @@ const ( ProductionChannelVPNPacket = "vpn_packet" ProductionMessageVPNPacketBatch = "vpn.packet_batch" FabricServiceClassVPNPackets = "vpn_packets" + FabricServiceClassRemoteWorkspace = "remote_workspace" FabricServiceChannelBulk = "bulk" FabricServiceChannelControl = "control" FabricServiceChannelInteractive = "interactive" FabricServiceChannelReliable = "reliable" + FabricServiceChannelDroppable = "droppable" MaxProductionEnvelopePayloadBytes = 4096 MaxProductionVPNPacketPayloadBytes = 256 * 1024 MaxProductionEnvelopeFutureSkew = time.Minute diff --git a/agents/rap-node-agent/internal/mesh/endpoint_candidate_scoring.go b/agents/rap-node-agent/internal/mesh/endpoint_candidate_scoring.go index a80b76b..a25d930 100644 --- a/agents/rap-node-agent/internal/mesh/endpoint_candidate_scoring.go +++ b/agents/rap-node-agent/internal/mesh/endpoint_candidate_scoring.go @@ -59,9 +59,9 @@ func scorePeerEndpointCandidate(candidate PeerEndpointCandidate, opts EndpointCa reasons := []string{"base"} switch candidate.Transport { - case "direct_tcp_tls": + case "direct_tcp_tls", "direct_http", "direct_https": score += 35 - reasons = append(reasons, "transport:direct_tcp_tls") + reasons = append(reasons, "transport:direct") case "wss": score += 25 reasons = append(reasons, "transport:wss") diff --git a/agents/rap-node-agent/internal/mesh/peer_cache.go b/agents/rap-node-agent/internal/mesh/peer_cache.go index 1f97f38..ab1abcc 100644 --- a/agents/rap-node-agent/internal/mesh/peer_cache.go +++ b/agents/rap-node-agent/internal/mesh/peer_cache.go @@ -37,27 +37,28 @@ type PeerCacheSnapshot struct { } type PeerCacheEntry struct { - NodeID string `json:"node_id"` - RouteIDs []string `json:"route_ids,omitempty"` - Endpoint string `json:"endpoint,omitempty"` - EndpointCount int `json:"endpoint_count"` - CandidateCount int `json:"candidate_count"` - ConnectivityModes []string `json:"connectivity_modes,omitempty"` - RecoverySeed bool `json:"recovery_seed"` - Warm bool `json:"warm"` - WarmReason string `json:"warm_reason,omitempty"` - BestCandidateID string `json:"best_candidate_id,omitempty"` - BestCandidateAddr string `json:"best_candidate_addr,omitempty"` - BestTransport string `json:"best_transport,omitempty"` - BestReachability string `json:"best_reachability,omitempty"` - BestConnectivity string `json:"best_connectivity,omitempty"` - BestNATType string `json:"best_nat_type,omitempty"` - BestPolicyTags []string `json:"best_policy_tags,omitempty"` - BestCandidateScore int `json:"best_candidate_score,omitempty"` - RendezvousLeaseID string `json:"rendezvous_lease_id,omitempty"` - RelayNodeID string `json:"relay_node_id,omitempty"` - RelayEndpoint string `json:"relay_endpoint,omitempty"` - RelayControl bool `json:"relay_control"` + NodeID string `json:"node_id"` + RouteIDs []string `json:"route_ids,omitempty"` + Endpoint string `json:"endpoint,omitempty"` + EndpointCount int `json:"endpoint_count"` + CandidateCount int `json:"candidate_count"` + ConnectivityModes []string `json:"connectivity_modes,omitempty"` + RecoverySeed bool `json:"recovery_seed"` + Warm bool `json:"warm"` + WarmReason string `json:"warm_reason,omitempty"` + BestCandidateID string `json:"best_candidate_id,omitempty"` + BestCandidateAddr string `json:"best_candidate_addr,omitempty"` + BestTransport string `json:"best_transport,omitempty"` + BestReachability string `json:"best_reachability,omitempty"` + BestConnectivity string `json:"best_connectivity,omitempty"` + BestNATType string `json:"best_nat_type,omitempty"` + BestPolicyTags []string `json:"best_policy_tags,omitempty"` + BestCandidateScore int `json:"best_candidate_score,omitempty"` + EndpointCandidates []PeerEndpointCandidate `json:"endpoint_candidates,omitempty"` + RendezvousLeaseID string `json:"rendezvous_lease_id,omitempty"` + RelayNodeID string `json:"relay_node_id,omitempty"` + RelayEndpoint string `json:"relay_endpoint,omitempty"` + RelayControl bool `json:"relay_control"` } type peerCacheBuildEntry struct { @@ -117,6 +118,10 @@ func NewPeerCache(cfg PeerCacheConfig) *PeerCache { MaxVerificationAge: time.Hour, }) if len(scored) > 0 { + entry.EndpointCandidates = make([]PeerEndpointCandidate, 0, len(scored)) + for _, scoredCandidate := range scored { + entry.EndpointCandidates = append(entry.EndpointCandidates, scoredCandidate.Candidate) + } entry.BestCandidateID = scored[0].Candidate.EndpointID entry.BestCandidateAddr = scored[0].Candidate.Address entry.BestTransport = scored[0].Candidate.Transport diff --git a/agents/rap-node-agent/internal/mesh/peer_connection_manager.go b/agents/rap-node-agent/internal/mesh/peer_connection_manager.go index 053ae2a..2867313 100644 --- a/agents/rap-node-agent/internal/mesh/peer_connection_manager.go +++ b/agents/rap-node-agent/internal/mesh/peer_connection_manager.go @@ -66,24 +66,44 @@ type PeerConnectionManagerSnapshot struct { } type PeerConnectionProbeResult struct { - NodeID string `json:"node_id"` - LinkStatus string `json:"link_status"` - Action string `json:"action"` - Reason string `json:"reason"` - Endpoint string `json:"endpoint,omitempty"` - ConnectionState PeerConnectionState `json:"connection_state"` - TransportMode string `json:"transport_mode"` - RequiresRendezvous bool `json:"requires_rendezvous"` - RendezvousResolved bool `json:"rendezvous_resolved"` - DirectCandidate bool `json:"direct_candidate"` - RelayCandidate bool `json:"relay_candidate"` - RendezvousLeaseID string `json:"rendezvous_lease_id,omitempty"` - RelayNodeID string `json:"relay_node_id,omitempty"` - RelayEndpoint string `json:"relay_endpoint,omitempty"` - LatencyMs int `json:"latency_ms,omitempty"` - FailureReason string `json:"failure_reason,omitempty"` - StartedAt time.Time `json:"started_at"` - CompletedAt time.Time `json:"completed_at"` + NodeID string `json:"node_id"` + LinkStatus string `json:"link_status"` + Action string `json:"action"` + Reason string `json:"reason"` + Endpoint string `json:"endpoint,omitempty"` + SelectedCandidateID string `json:"selected_candidate_id,omitempty"` + SelectedEndpoint string `json:"selected_endpoint,omitempty"` + ConnectionState PeerConnectionState `json:"connection_state"` + TransportMode string `json:"transport_mode"` + RequiresRendezvous bool `json:"requires_rendezvous"` + RendezvousResolved bool `json:"rendezvous_resolved"` + DirectCandidate bool `json:"direct_candidate"` + RelayCandidate bool `json:"relay_candidate"` + RendezvousLeaseID string `json:"rendezvous_lease_id,omitempty"` + RelayNodeID string `json:"relay_node_id,omitempty"` + RelayEndpoint string `json:"relay_endpoint,omitempty"` + LatencyMs int `json:"latency_ms,omitempty"` + FailureReason string `json:"failure_reason,omitempty"` + CandidateResults []PeerConnectionCandidateProbeResult `json:"candidate_results,omitempty"` + StartedAt time.Time `json:"started_at"` + CompletedAt time.Time `json:"completed_at"` +} + +type PeerConnectionCandidateProbeResult struct { + CandidateID string `json:"candidate_id,omitempty"` + Endpoint string `json:"endpoint"` + Transport string `json:"transport,omitempty"` + LinkStatus string `json:"link_status"` + LatencyMs int `json:"latency_ms,omitempty"` + FailureReason string `json:"failure_reason,omitempty"` + StartedAt time.Time `json:"started_at"` + CompletedAt time.Time `json:"completed_at"` +} + +type peerConnectionProbeTarget struct { + CandidateID string + Endpoint string + Transport string } func NewPeerConnectionManager(cfg PeerConnectionManagerConfig) *PeerConnectionManager { @@ -137,6 +157,10 @@ func (m *PeerConnectionManager) ProbeOnce(ctx context.Context) PeerConnectionMan RendezvousLeases: rendezvousLeases, Now: startedAt, }) + entriesByNode := map[string]PeerCacheEntry{} + for _, entry := range peerSnapshot.Entries { + entriesByNode[entry.NodeID] = entry + } cycle := PeerConnectionManagerCycle{ Mode: recoveryPlan.Mode, StartedAt: startedAt, @@ -150,7 +174,7 @@ func (m *PeerConnectionManager) ProbeOnce(ctx context.Context) PeerConnectionMan Results: make([]PeerConnectionProbeResult, 0, len(intentPlan.Intents)), } for _, intent := range intentPlan.Intents { - result := m.probeIntent(ctx, intent) + result := m.probeIntent(ctx, intent, entriesByNode[intent.NodeID]) cycle.Results = append(cycle.Results, result) switch result.LinkStatus { case PeerConnectionProbeReachable: @@ -200,7 +224,7 @@ func (m *PeerConnectionManager) peerConfigSnapshot() (*PeerCache, []PeerRendezvo return m.peerCache, append([]PeerRendezvousLease{}, m.rendezvousLeases...) } -func (m *PeerConnectionManager) probeIntent(ctx context.Context, intent PeerConnectionIntent) PeerConnectionProbeResult { +func (m *PeerConnectionManager) probeIntent(ctx context.Context, intent PeerConnectionIntent, cacheEntry PeerCacheEntry) PeerConnectionProbeResult { startedAt := normalizedNow(m.now()) result := PeerConnectionProbeResult{ NodeID: intent.NodeID, @@ -254,9 +278,6 @@ func (m *PeerConnectionManager) probeIntent(ctx context.Context, intent PeerConn result.CompletedAt = normalizedNow(m.now()) return result } - m.tracker.BeginProbe(peer, startedAt) - probeCtx, cancel := context.WithTimeout(ctx, m.probeTimeout) - defer cancel() target := PeerIdentity{ ClusterID: m.local.ClusterID, NodeID: intent.NodeID, @@ -264,30 +285,118 @@ func (m *PeerConnectionManager) probeIntent(ctx context.Context, intent PeerConn if intent.RelayCandidate && intent.RelayNodeID != "" { target.NodeID = intent.RelayNodeID } - _, err := NewClient(strings.TrimRight(intent.Endpoint, "/")).withHTTPClient(m.httpClient).SendHealth(probeCtx, NewHealthMessage(m.local, target)) - completedAt := normalizedNow(m.now()) - if err != nil { - result.LinkStatus = PeerConnectionProbeUnreachable - result.FailureReason = err.Error() - result.ConnectionState = m.tracker.RecordFailure(intent.NodeID, err.Error(), completedAt) + targets := []peerConnectionProbeTarget{{ + CandidateID: intent.BestCandidateID, + Endpoint: intent.Endpoint, + Transport: intent.Transport, + }} + if intent.DirectCandidate { + targets = peerConnectionProbeTargets(intent, cacheEntry) + } + var lastFailure string + for _, probeTarget := range targets { + probePeer := peer + probePeer.Endpoint = strings.TrimRight(strings.TrimSpace(probeTarget.Endpoint), "/") + probePeer.BestCandidateID = strings.TrimSpace(probeTarget.CandidateID) + probePeer.BestCandidateAddr = probePeer.Endpoint + probePeer.BestTransport = strings.TrimSpace(probeTarget.Transport) + if probePeer.Endpoint == "" { + continue + } + candidateStartedAt := normalizedNow(m.now()) + m.tracker.BeginProbe(probePeer, candidateStartedAt) + probeCtx, cancel := context.WithTimeout(ctx, m.probeTimeout) + _, err := NewClient(probePeer.Endpoint).withHTTPClient(m.httpClient).SendHealth(probeCtx, NewHealthMessage(m.local, target)) + cancel() + completedAt := normalizedNow(m.now()) + candidateResult := PeerConnectionCandidateProbeResult{ + CandidateID: probePeer.BestCandidateID, + Endpoint: probePeer.Endpoint, + Transport: probePeer.BestTransport, + StartedAt: candidateStartedAt, + CompletedAt: completedAt, + } + if err != nil { + lastFailure = err.Error() + candidateResult.LinkStatus = PeerConnectionProbeUnreachable + candidateResult.FailureReason = lastFailure + result.CandidateResults = append(result.CandidateResults, candidateResult) + continue + } + latency := int(completedAt.Sub(candidateStartedAt).Milliseconds()) + if latency < 0 { + latency = 0 + } + candidateResult.LinkStatus = PeerConnectionProbeReachable + candidateResult.LatencyMs = latency + result.CandidateResults = append(result.CandidateResults, candidateResult) + result.LinkStatus = PeerConnectionProbeReachable + result.Endpoint = probePeer.Endpoint + result.SelectedCandidateID = probePeer.BestCandidateID + result.SelectedEndpoint = probePeer.Endpoint + result.LatencyMs = latency + if intent.RelayCandidate { + result.ConnectionState = m.tracker.RecordRelayReady(probePeer, latency, completedAt) + } else { + result.ConnectionState = m.tracker.RecordSuccessForPeer(probePeer, latency, completedAt) + } result.CompletedAt = completedAt return result } - latency := int(completedAt.Sub(startedAt).Milliseconds()) - if latency < 0 { - latency = 0 - } - result.LinkStatus = PeerConnectionProbeReachable - result.LatencyMs = latency - if intent.RelayCandidate { - result.ConnectionState = m.tracker.RecordRelayReady(peer, latency, completedAt) - } else { - result.ConnectionState = m.tracker.RecordSuccess(intent.NodeID, latency, completedAt) + completedAt := normalizedNow(m.now()) + if lastFailure == "" { + lastFailure = "no_probe_endpoint_available" } + result.LinkStatus = PeerConnectionProbeUnreachable + result.FailureReason = lastFailure + result.ConnectionState = m.tracker.RecordFailure(intent.NodeID, lastFailure, completedAt) result.CompletedAt = completedAt return result } +func peerConnectionProbeTargets(intent PeerConnectionIntent, cacheEntry PeerCacheEntry) []peerConnectionProbeTarget { + seen := map[string]struct{}{} + out := make([]peerConnectionProbeTarget, 0, len(cacheEntry.EndpointCandidates)+1) + add := func(candidateID, endpoint, transport string) { + endpoint = strings.TrimRight(strings.TrimSpace(endpoint), "/") + if endpoint == "" { + return + } + key := candidateID + "|" + endpoint + if _, ok := seen[key]; ok { + return + } + seen[key] = struct{}{} + out = append(out, peerConnectionProbeTarget{ + CandidateID: strings.TrimSpace(candidateID), + Endpoint: endpoint, + Transport: strings.TrimSpace(transport), + }) + } + for _, candidate := range cacheEntry.EndpointCandidates { + if !candidateUsableForDirectProbe(candidate) { + continue + } + add(candidate.EndpointID, candidate.Address, candidate.Transport) + } + add(intent.BestCandidateID, intent.Endpoint, intent.Transport) + return out +} + +func candidateUsableForDirectProbe(candidate PeerEndpointCandidate) bool { + endpoint := strings.TrimSpace(candidate.Address) + if endpoint == "" || strings.HasPrefix(endpoint, "relay://") || strings.HasPrefix(endpoint, "outbound://") { + return false + } + connectivity := strings.ToLower(strings.TrimSpace(candidate.ConnectivityMode)) + reachability := strings.ToLower(strings.TrimSpace(candidate.Reachability)) + transport := strings.ToLower(strings.TrimSpace(candidate.Transport)) + if connectivity == "outbound_only" || connectivity == "relay_required" || reachability == "outbound_only" || reachability == "relay" { + return false + } + return transport == "" || strings.Contains(transport, "direct") || transport == "wss" || strings.HasPrefix(endpoint, "http://") || strings.HasPrefix(endpoint, "https://") +} + func (m *PeerConnectionManager) connectionState(nodeID string) PeerConnectionState { snapshot := m.tracker.Snapshot() for _, entry := range snapshot.Entries { diff --git a/agents/rap-node-agent/internal/mesh/peer_connection_manager_test.go b/agents/rap-node-agent/internal/mesh/peer_connection_manager_test.go index 5d64f3a..0fe1716 100644 --- a/agents/rap-node-agent/internal/mesh/peer_connection_manager_test.go +++ b/agents/rap-node-agent/internal/mesh/peer_connection_manager_test.go @@ -188,3 +188,71 @@ func TestPeerConnectionManagerProbesRelayControlLease(t *testing.T) { t.Fatalf("unexpected tracker snapshot: %+v", snapshot) } } + +func TestPeerConnectionManagerFallsBackAcrossEndpointCandidates(t *testing.T) { + now := time.Date(2026, 4, 30, 12, 0, 0, 0, time.UTC) + current := now + server := httptest.NewServer(Server{ + Local: PeerIdentity{ClusterID: "cluster-1", NodeID: "node-b"}, + }.Handler()) + defer server.Close() + + local := PeerIdentity{ClusterID: "cluster-1", NodeID: "node-a"} + cache := NewPeerCache(PeerCacheConfig{ + Local: local, + PeerEndpointCandidates: map[string][]PeerEndpointCandidate{ + "node-b": { + { + EndpointID: "node-b-dead", + NodeID: "node-b", + Transport: "direct_http", + Address: "http://127.0.0.1:1", + Reachability: "private", + ConnectivityMode: "private_lan", + Priority: 1, + }, + { + EndpointID: "node-b-live", + NodeID: "node-b", + Transport: "direct_http", + Address: server.URL, + Reachability: "private", + ConnectivityMode: "private_lan", + Priority: 2, + }, + }, + }, + WarmPeerLimit: 1, + Now: now, + }) + tracker := NewPeerConnectionTracker(cache.Snapshot(), now) + manager := NewPeerConnectionManager(PeerConnectionManagerConfig{ + Local: local, + PeerCache: cache, + Tracker: tracker, + HTTPClient: &http.Client{Timeout: 100 * time.Millisecond}, + ProbeTimeout: 100 * time.Millisecond, + Now: func() time.Time { + current = current.Add(10 * time.Millisecond) + return current + }, + }) + + cycle := manager.ProbeOnce(context.Background()) + if cycle.Attempted != 1 || cycle.Succeeded != 1 || cycle.Failed != 0 || len(cycle.Results) != 1 { + t.Fatalf("unexpected cycle: %+v", cycle) + } + result := cycle.Results[0] + if result.LinkStatus != PeerConnectionProbeReachable || result.SelectedCandidateID != "node-b-live" || result.SelectedEndpoint != server.URL { + t.Fatalf("fallback did not select live candidate: %+v", result) + } + if len(result.CandidateResults) != 2 || + result.CandidateResults[0].LinkStatus != PeerConnectionProbeUnreachable || + result.CandidateResults[1].LinkStatus != PeerConnectionProbeReachable { + t.Fatalf("candidate probe trail mismatch: %+v", result.CandidateResults) + } + snapshot := tracker.Snapshot() + if snapshot.Ready != 1 || len(snapshot.Entries) != 1 || snapshot.Entries[0].BestCandidateID != "node-b-live" || snapshot.Entries[0].Endpoint != server.URL { + t.Fatalf("tracker did not retain selected candidate: %+v", snapshot) + } +} diff --git a/agents/rap-node-agent/internal/mesh/peer_connection_state.go b/agents/rap-node-agent/internal/mesh/peer_connection_state.go index 92d5952..00e9539 100644 --- a/agents/rap-node-agent/internal/mesh/peer_connection_state.go +++ b/agents/rap-node-agent/internal/mesh/peer_connection_state.go @@ -138,6 +138,32 @@ func (t *PeerConnectionTracker) RecordSuccess(nodeID string, latencyMs int, now return entry } +func (t *PeerConnectionTracker) RecordSuccessForPeer(peer PeerCacheEntry, latencyMs int, now time.Time) PeerConnectionState { + if t == nil { + return PeerConnectionState{} + } + t.mu.Lock() + defer t.mu.Unlock() + now = normalizedNow(now) + entry := t.entry(peer, now) + entry.ConsecutiveSuccesses++ + entry.ConsecutiveFailures = 0 + entry.LastLatencyMs = latencyMs + entry.LastFailureReason = "" + entry.LastProbeAt = now + entry.BackoffUntil = time.Time{} + nextState := PeerConnectionReady + if latencyMs >= 500 { + nextState = PeerConnectionDegraded + } + if entry.State != nextState { + entry.State = nextState + entry.LastTransitionAt = now + } + t.entries[peer.NodeID] = entry + return entry +} + func (t *PeerConnectionTracker) RecordRelayReady(peer PeerCacheEntry, latencyMs int, now time.Time) PeerConnectionState { if t == nil { return PeerConnectionState{} diff --git a/agents/rap-node-agent/internal/mesh/production_envelope.go b/agents/rap-node-agent/internal/mesh/production_envelope.go index 7cd323d..05a777f 100644 --- a/agents/rap-node-agent/internal/mesh/production_envelope.go +++ b/agents/rap-node-agent/internal/mesh/production_envelope.go @@ -34,12 +34,20 @@ func ValidateProductionEnvelope(local PeerIdentity, envelope ProductionEnvelope, return err } } - if envelope.ChannelClass != ProductionChannelFabricControl { + maxPayloadBytes := MaxProductionEnvelopePayloadBytes + switch envelope.ChannelClass { + case ProductionChannelFabricControl: + if envelope.MessageType != ProductionMessageFabricControl { + return fmt.Errorf("%w: unsupported message_type", ErrForwardEnvelopeInvalid) + } + case ProductionChannelVPNPacket: + if envelope.MessageType != ProductionMessageVPNPacketBatch { + return fmt.Errorf("%w: unsupported message_type", ErrForwardEnvelopeInvalid) + } + maxPayloadBytes = MaxProductionVPNPacketPayloadBytes + default: return ErrUnauthorizedChannel } - if envelope.MessageType != ProductionMessageFabricControl { - return fmt.Errorf("%w: unsupported message_type", ErrForwardEnvelopeInvalid) - } if envelope.TTL <= 0 { return ErrTTLExhausted } @@ -58,8 +66,8 @@ func ValidateProductionEnvelope(local PeerIdentity, envelope ProductionEnvelope, if envelope.PayloadLength != len(envelope.Payload) { return fmt.Errorf("%w: payload_length mismatch", ErrForwardEnvelopeInvalid) } - if envelope.PayloadLength > MaxProductionEnvelopePayloadBytes { - return fmt.Errorf("%w: payload exceeds fabric-control limit", ErrForwardEnvelopeInvalid) + if envelope.PayloadLength > maxPayloadBytes { + return fmt.Errorf("%w: payload exceeds channel limit", ErrForwardEnvelopeInvalid) } if envelope.PayloadHash == "" { return fmt.Errorf("%w: payload_hash is required", ErrForwardEnvelopeInvalid) diff --git a/agents/rap-node-agent/internal/mesh/production_route_config.go b/agents/rap-node-agent/internal/mesh/production_route_config.go index 3a4c9f0..eda1d2a 100644 --- a/agents/rap-node-agent/internal/mesh/production_route_config.go +++ b/agents/rap-node-agent/internal/mesh/production_route_config.go @@ -22,7 +22,7 @@ func ValidateProductionEnvelopeRouteConfig(local PeerIdentity, envelope Producti if route.ExpiresAt.IsZero() || !route.ExpiresAt.After(now.UTC()) || envelope.ExpiresAt.After(route.ExpiresAt) { return ErrRouteExpired } - if !contains(route.AllowedChannels, ProductionChannelFabricControl) { + if !contains(route.AllowedChannels, envelope.ChannelClass) { return ErrUnauthorizedChannel } path := routePath(route) diff --git a/agents/rap-node-agent/internal/mesh/remote_workspace_sink.go b/agents/rap-node-agent/internal/mesh/remote_workspace_sink.go new file mode 100644 index 0000000..6ec4781 --- /dev/null +++ b/agents/rap-node-agent/internal/mesh/remote_workspace_sink.go @@ -0,0 +1,1577 @@ +package mesh + +import ( + "context" + "fmt" + "sort" + "strings" + "sync" + "time" +) + +const DefaultRemoteWorkspaceFrameProbeSinkQueueCapacity = 8 +const DefaultRemoteWorkspaceFrameProbeSinkSessionTTL = 2 * time.Minute +const DefaultRemoteWorkspaceAdapterMailboxCapacity = 16 +const DefaultRemoteWorkspaceAdapterMailboxConsumerCapacity = 32 +const RemoteWorkspaceFrameProbeSinkRuntimeID = "node_agent_rdp_worker_contract_probe" + +type RemoteWorkspaceFrameProbeSink struct { + mu sync.Mutex + sequence int64 + queueCapacity int + sessionTTL time.Duration + sessions map[string]*remoteWorkspaceAdapterProbeSession + terminalSessions map[string]remoteWorkspaceAdapterProbeTerminalSession + sessionCreatedTotal int64 + sessionBoundTotal int64 + sessionBackpressureTotal int64 + sessionExpiredTotal int64 + sessionClosedTotal int64 + sessionResetTotal int64 + sessionControlTotal int64 + mailboxEventSequence int64 + mailboxEnqueuedTotal int64 + mailboxDrainedTotal int64 + mailboxDroppedTotal int64 + mailboxReadTotal int64 + mailboxWaitTotal int64 + mailboxWaitTimeoutTotal int64 + mailboxEmptyReadTotal int64 + mailboxResumeReadTotal int64 + mailboxAfterSequenceReadTotal int64 + mailboxReturnedTotal int64 + mailboxSkippedTotal int64 + mailboxConsumerReadTotal int64 + mailboxConsumerAckTotal int64 + mailboxConsumerResetTotal int64 + mailboxConsumerEvictedTotal int64 + lastMailboxReadAt string + lastMailboxAdapterSessionID string + lastMailboxWaitMs int + lastMailboxWaited bool + lastMailboxWaitTimeout bool + lastMailboxEmpty bool + lastMailboxResumeFrom string + lastMailboxResumeSequence int64 + lastMailboxResumeConsumerID string + lastMailboxAfterSequence int64 + lastMailboxSkippedCount int + lastMailboxReturnedCount int + lastMailboxConsumerID string + lastMailboxConsumerAdapterSessionID string + lastMailboxConsumerReadAt string + lastMailboxConsumerAckAt string + lastMailboxConsumerCheckpoint int64 + lastMailboxConsumerAck int64 + acceptedFramesTotal int64 + droppedFramesTotal int64 + ackedFramesTotal int64 + backpressureCount int64 + lastBackpressureAt string + lastBackpressureReason string + lastRejectedFrameCount int + lastRejectedAdapterSessionID string + lastRejectedChannelClass string + lastRejectedAdapterContractID string + lastRejectedQueueCapacity int + lastRejectedQueueDepth int + lastControl RemoteWorkspaceAdapterSessionControlResult + last RemoteWorkspaceFrameBatchDeliveryReceipt +} + +type remoteWorkspaceAdapterProbeSession struct { + ID string + State string + CreatedAt time.Time + BoundAt time.Time + LastActivityAt time.Time + LastBackpressureAt time.Time + ClosedAt time.Time + DeliveryCount int64 + BackpressureCount int64 + AcceptedFrames int64 + DroppedFrames int64 + AckedFrames int64 + Mailbox []RemoteWorkspaceAdapterMailboxEvent + MailboxEnqueued int64 + MailboxDrained int64 + MailboxDropped int64 + MailboxRead int64 + MailboxWait int64 + MailboxWaitTimeout int64 + MailboxEmptyRead int64 + MailboxResumeRead int64 + MailboxAfterSequenceRead int64 + MailboxReturnedTotal int64 + MailboxSkippedTotal int64 + MailboxConsumers map[string]*remoteWorkspaceAdapterMailboxConsumerState + MailboxConsumerReadTotal int64 + MailboxConsumerAckTotal int64 + MailboxConsumerResetTotal int64 + MailboxConsumerEvictedTotal int64 + LastMailboxConsumerID string + LastMailboxConsumerReadAt time.Time + LastMailboxConsumerAckAt time.Time + LastMailboxConsumerCheckpoint int64 + LastMailboxConsumerAck int64 + LastMailboxReadAt time.Time + LastMailboxWaitMs int + LastMailboxWaited bool + LastMailboxTimeout bool + LastMailboxEmpty bool + LastMailboxResumeFrom string + LastMailboxResumeSequence int64 + LastMailboxResumeConsumerID string + LastMailboxAfterSequence int64 + LastMailboxSkippedCount int + LastMailboxReturnedCount int + LastChannelID string + LastResourceID string + LastRouteID string + LastReason string +} + +type remoteWorkspaceAdapterMailboxConsumerState struct { + ID string + CreatedAt time.Time + ReadTotal int64 + AckTotal int64 + CheckpointSequence int64 + AckSequence int64 + LastReadAt time.Time + LastAckAt time.Time +} + +type remoteWorkspaceAdapterProbeTerminalSession struct { + State string + ControlledAt time.Time + Reason string +} + +type RemoteWorkspaceAdapterSessionControlResult struct { + SchemaVersion string `json:"schema_version"` + AdapterRuntimeID string `json:"adapter_runtime_id"` + AdapterSessionID string `json:"adapter_session_id"` + Action string `json:"action"` + Accepted bool `json:"accepted"` + PreviousState string `json:"previous_state,omitempty"` + SessionState string `json:"session_state"` + Reason string `json:"reason,omitempty"` + ControlledAt string `json:"controlled_at"` + ActiveSessions int `json:"active_session_count"` +} + +type RemoteWorkspaceAdapterMailboxEvent struct { + SchemaVersion string `json:"schema_version"` + Sequence int64 `json:"sequence"` + AdapterRuntimeID string `json:"adapter_runtime_id"` + AdapterSessionID string `json:"adapter_session_id"` + Event string `json:"event"` + ChannelID string `json:"channel_id,omitempty"` + ResourceID string `json:"resource_id,omitempty"` + RouteID string `json:"route_id,omitempty"` + ChannelClass string `json:"channel_class,omitempty"` + FrameCount int `json:"frame_count,omitempty"` + AcceptedFrames int `json:"accepted_frames,omitempty"` + DroppedFrames int `json:"dropped_frames,omitempty"` + AckedFrames int `json:"acked_frames,omitempty"` + Backpressure bool `json:"backpressure,omitempty"` + Reason string `json:"reason,omitempty"` + CreatedAt string `json:"created_at"` +} + +type RemoteWorkspaceAdapterMailboxSnapshot struct { + SchemaVersion string `json:"schema_version"` + AdapterRuntimeID string `json:"adapter_runtime_id"` + AdapterSessionID string `json:"adapter_session_id"` + ObservedAt string `json:"observed_at"` + Drained bool `json:"drained"` + Empty bool `json:"empty"` + Waited bool `json:"waited,omitempty"` + WaitTimeout bool `json:"wait_timeout,omitempty"` + WaitMs int `json:"wait_ms,omitempty"` + MailboxCapacity int `json:"mailbox_capacity"` + MailboxDepth int `json:"mailbox_depth"` + DepthAfter int `json:"depth_after"` + EnqueuedTotal int64 `json:"enqueued_total"` + DrainedTotal int64 `json:"drained_total"` + DroppedTotal int64 `json:"dropped_total"` + AfterSequence int64 `json:"after_sequence,omitempty"` + ResumeFrom string `json:"resume_from,omitempty"` + ResumeSequence int64 `json:"resume_sequence"` + SkippedCount int `json:"skipped_count"` + ReturnedCount int `json:"returned_count"` + ConsumerID string `json:"consumer_id,omitempty"` + ConsumerReadTotal int64 `json:"consumer_read_total"` + ConsumerAckTotal int64 `json:"consumer_ack_total"` + ConsumerResetTotal int64 `json:"consumer_reset_total"` + ConsumerEvictedTotal int64 `json:"consumer_evicted_total"` + ConsumerCheckpointSequence int64 `json:"consumer_checkpoint_sequence"` + ConsumerAckSequence int64 `json:"consumer_ack_sequence"` + ConsumerLagCount int `json:"consumer_lag_count"` + ConsumerCount int `json:"consumer_count"` + ConsumerCapacity int `json:"consumer_capacity"` + ConsumerCreated bool `json:"consumer_created,omitempty"` + ConsumerReset bool `json:"consumer_reset,omitempty"` + ConsumerEvicted bool `json:"consumer_evicted,omitempty"` + ConsumerCreatedAt string `json:"consumer_created_at,omitempty"` + ConsumerLastReadAt string `json:"consumer_last_read_at,omitempty"` + ConsumerLastAckAt string `json:"consumer_last_ack_at,omitempty"` + Events []RemoteWorkspaceAdapterMailboxEvent `json:"events"` +} + +type RemoteWorkspaceAdapterMailboxConsumerSnapshot struct { + SchemaVersion string `json:"schema_version"` + AdapterRuntimeID string `json:"adapter_runtime_id"` + AdapterSessionID string `json:"adapter_session_id"` + ObservedAt string `json:"observed_at"` + ConsumerCapacity int `json:"consumer_capacity"` + ConsumerCount int `json:"consumer_count"` + ConsumerReadTotal int64 `json:"consumer_read_total"` + ConsumerAckTotal int64 `json:"consumer_ack_total"` + ConsumerResetTotal int64 `json:"consumer_reset_total"` + ConsumerEvictedTotal int64 `json:"consumer_evicted_total"` + MailboxDepth int `json:"mailbox_depth"` + MailboxEnqueued int64 `json:"mailbox_enqueued_total"` + MailboxDrained int64 `json:"mailbox_drained_total"` + MailboxDropped int64 `json:"mailbox_dropped_total"` + Consumers []RemoteWorkspaceAdapterMailboxConsumer `json:"consumers"` +} + +type RemoteWorkspaceAdapterMailboxConsumer struct { + ConsumerID string `json:"consumer_id"` + CreatedAt string `json:"created_at,omitempty"` + ReadTotal int64 `json:"consumer_read_total"` + AckTotal int64 `json:"consumer_ack_total"` + CheckpointSequence int64 `json:"consumer_checkpoint_sequence"` + AckSequence int64 `json:"consumer_ack_sequence"` + LagCount int `json:"consumer_lag_count"` + LastReadAt string `json:"last_read_at,omitempty"` + LastAckAt string `json:"last_ack_at,omitempty"` +} + +type RemoteWorkspaceAdapterMailboxPreflightSnapshot struct { + SchemaVersion string `json:"schema_version"` + AdapterRuntimeID string `json:"adapter_runtime_id"` + AdapterSessionID string `json:"adapter_session_id"` + ObservedAt string `json:"observed_at"` + ReadOnly bool `json:"read_only"` + ConsumerID string `json:"consumer_id"` + ResumeFrom string `json:"resume_from"` + ResumeSequence int64 `json:"resume_sequence"` + AfterSequence int64 `json:"after_sequence"` + Limit int `json:"limit"` + MailboxDepth int `json:"mailbox_depth"` + MailboxEnqueued int64 `json:"mailbox_enqueued_total"` + MailboxReadTotal int64 `json:"mailbox_read_total"` + ConsumerReadTotal int64 `json:"consumer_read_total"` + ConsumerAckTotal int64 `json:"consumer_ack_total"` + ConsumerCheckpointSequence int64 `json:"consumer_checkpoint_sequence"` + ConsumerAckSequence int64 `json:"consumer_ack_sequence"` + ConsumerLagCount int `json:"consumer_lag_count"` + ExpectedAvailableCount int `json:"expected_available_count"` + ExpectedReturnedCount int `json:"expected_returned_count"` + ExpectedSkippedCount int `json:"expected_skipped_count"` + FirstExpectedSequence int64 `json:"first_expected_sequence,omitempty"` + LastExpectedSequence int64 `json:"last_expected_sequence,omitempty"` +} + +type RemoteWorkspaceAdapterSessionSnapshot struct { + SchemaVersion string `json:"schema_version"` + AdapterRuntimeID string `json:"adapter_runtime_id"` + ObservedAt string `json:"observed_at"` + ActiveSessionCount int `json:"active_session_count"` + TerminalSessionCount int `json:"terminal_session_count"` + Sessions []RemoteWorkspaceAdapterSessionView `json:"sessions"` + TerminalSessions []RemoteWorkspaceAdapterTerminalSession `json:"terminal_sessions,omitempty"` +} + +type RemoteWorkspaceAdapterSessionView struct { + AdapterSessionID string `json:"adapter_session_id"` + SessionState string `json:"session_state"` + CreatedAt string `json:"created_at"` + BoundAt string `json:"bound_at,omitempty"` + LastActivityAt string `json:"last_activity_at"` + LastBackpressureAt string `json:"last_backpressure_at,omitempty"` + LastBackpressureReason string `json:"last_backpressure_reason,omitempty"` + DeliveryCount int64 `json:"delivery_count"` + BackpressureCount int64 `json:"backpressure_count"` + AcceptedFrames int64 `json:"accepted_frames"` + DroppedFrames int64 `json:"dropped_frames"` + AckedFrames int64 `json:"acked_frames"` + MailboxCapacity int `json:"mailbox_capacity"` + MailboxDepth int `json:"mailbox_depth"` + MailboxEnqueued int64 `json:"mailbox_enqueued_total"` + MailboxDrained int64 `json:"mailbox_drained_total"` + MailboxDropped int64 `json:"mailbox_dropped_total"` + MailboxRead int64 `json:"mailbox_read_total"` + MailboxWait int64 `json:"mailbox_wait_total"` + MailboxWaitTimeout int64 `json:"mailbox_wait_timeout_total"` + MailboxEmptyRead int64 `json:"mailbox_empty_read_total"` + MailboxResumeRead int64 `json:"mailbox_resume_read_total"` + MailboxAfterSequenceRead int64 `json:"mailbox_after_sequence_read_total"` + MailboxReturnedTotal int64 `json:"mailbox_returned_total"` + MailboxSkippedTotal int64 `json:"mailbox_skipped_total"` + MailboxConsumerCount int `json:"mailbox_consumer_count"` + MailboxConsumerRead int64 `json:"mailbox_consumer_read_total"` + MailboxConsumerAck int64 `json:"mailbox_consumer_ack_total"` + MailboxConsumerReset int64 `json:"mailbox_consumer_reset_total"` + MailboxConsumerEvicted int64 `json:"mailbox_consumer_evicted_total"` + LastMailboxConsumerID string `json:"last_mailbox_consumer_id,omitempty"` + LastMailboxConsumerReadAt string `json:"last_mailbox_consumer_read_at,omitempty"` + LastMailboxConsumerAckAt string `json:"last_mailbox_consumer_ack_at,omitempty"` + LastMailboxConsumerCheckpoint int64 `json:"last_mailbox_consumer_checkpoint_sequence,omitempty"` + LastMailboxConsumerAck int64 `json:"last_mailbox_consumer_ack_sequence,omitempty"` + LastMailboxReadAt string `json:"last_mailbox_read_at,omitempty"` + LastMailboxWaitMs int `json:"last_mailbox_wait_ms,omitempty"` + LastMailboxWaited bool `json:"last_mailbox_waited,omitempty"` + LastMailboxWaitTimeout bool `json:"last_mailbox_wait_timeout,omitempty"` + LastMailboxEmpty bool `json:"last_mailbox_empty,omitempty"` + LastMailboxResumeFrom string `json:"last_mailbox_resume_from,omitempty"` + LastMailboxResumeSequence int64 `json:"last_mailbox_resume_sequence,omitempty"` + LastMailboxResumeConsumerID string `json:"last_mailbox_resume_consumer_id,omitempty"` + LastMailboxAfterSequence int64 `json:"last_mailbox_after_sequence,omitempty"` + LastMailboxSkippedCount int `json:"last_mailbox_skipped_count"` + LastMailboxReturnedCount int `json:"last_mailbox_returned_count"` + ChannelID string `json:"channel_id,omitempty"` + ResourceID string `json:"resource_id,omitempty"` + RouteID string `json:"route_id,omitempty"` +} + +type RemoteWorkspaceAdapterTerminalSession struct { + AdapterSessionID string `json:"adapter_session_id"` + SessionState string `json:"session_state"` + ControlledAt string `json:"controlled_at"` + Reason string `json:"reason,omitempty"` +} + +func NewRemoteWorkspaceFrameProbeSink() *RemoteWorkspaceFrameProbeSink { + return &RemoteWorkspaceFrameProbeSink{ + queueCapacity: DefaultRemoteWorkspaceFrameProbeSinkQueueCapacity, + sessionTTL: DefaultRemoteWorkspaceFrameProbeSinkSessionTTL, + sessions: map[string]*remoteWorkspaceAdapterProbeSession{}, + terminalSessions: map[string]remoteWorkspaceAdapterProbeTerminalSession{}, + } +} + +func (s *RemoteWorkspaceFrameProbeSink) AcceptRemoteWorkspaceFrameBatchProbe(_ context.Context, delivery RemoteWorkspaceFrameBatchDelivery) (RemoteWorkspaceFrameBatchDeliveryReceipt, error) { + if s == nil { + return RemoteWorkspaceFrameBatchDeliveryReceipt{}, fmt.Errorf("remote workspace adapter probe sink unavailable") + } + if strings.TrimSpace(delivery.ServiceClass) != FabricServiceClassRemoteWorkspace { + return RemoteWorkspaceFrameBatchDeliveryReceipt{}, fmt.Errorf("remote workspace adapter sink service class mismatch") + } + if strings.TrimSpace(delivery.ChannelClass) == "" { + return RemoteWorkspaceFrameBatchDeliveryReceipt{}, fmt.Errorf("remote workspace adapter sink channel class required") + } + if strings.TrimSpace(delivery.AdapterSessionID) == "" { + return RemoteWorkspaceFrameBatchDeliveryReceipt{}, fmt.Errorf("remote workspace adapter sink session id required") + } + if len(delivery.Frames) == 0 { + return RemoteWorkspaceFrameBatchDeliveryReceipt{}, fmt.Errorf("remote workspace adapter sink requires frames") + } + queueCapacity := s.queueCapacity + if queueCapacity <= 0 { + queueCapacity = DefaultRemoteWorkspaceFrameProbeSinkQueueCapacity + } + acceptedFrames := 0 + droppedFrames := 0 + for _, frame := range delivery.Frames { + if acceptedFrames < queueCapacity { + acceptedFrames++ + continue + } + if frame.Droppable { + droppedFrames++ + continue + } + err := fmt.Errorf("remote workspace adapter sink backpressure") + s.recordBackpressure(delivery, queueCapacity, err.Error()) + return RemoteWorkspaceFrameBatchDeliveryReceipt{}, err + } + s.mu.Lock() + defer s.mu.Unlock() + now := time.Now().UTC() + s.expireIdleSessionsLocked(now) + session := s.ensureSessionLocked(delivery, now) + session.State = "probe_bound" + if session.BoundAt.IsZero() { + session.BoundAt = now + s.sessionBoundTotal++ + } + session.LastActivityAt = now + session.DeliveryCount++ + session.AcceptedFrames += int64(acceptedFrames) + session.DroppedFrames += int64(droppedFrames) + session.AckedFrames += int64(acceptedFrames) + session.LastChannelID = strings.TrimSpace(delivery.ChannelID) + session.LastResourceID = strings.TrimSpace(delivery.ResourceID) + session.LastRouteID = strings.TrimSpace(delivery.PreferredRouteID) + s.enqueueAdapterMailboxEventLocked(session, RemoteWorkspaceAdapterMailboxEvent{ + Event: "frame_batch_probe_delivered", + ChannelID: strings.TrimSpace(delivery.ChannelID), + ResourceID: strings.TrimSpace(delivery.ResourceID), + RouteID: strings.TrimSpace(delivery.PreferredRouteID), + ChannelClass: strings.TrimSpace(delivery.ChannelClass), + FrameCount: len(delivery.Frames), + AcceptedFrames: acceptedFrames, + DroppedFrames: droppedFrames, + AckedFrames: acceptedFrames, + }, now) + s.sequence++ + s.acceptedFramesTotal += int64(acceptedFrames) + s.droppedFramesTotal += int64(droppedFrames) + s.ackedFramesTotal += int64(acceptedFrames) + receipt := RemoteWorkspaceFrameBatchDeliveryReceipt{ + SchemaVersion: "rap.remote_workspace_frame_batch_delivery.v1", + Sink: RemoteWorkspaceFrameProbeSinkRuntimeID, + Accepted: true, + ProbeOnly: true, + ClusterID: strings.TrimSpace(delivery.ClusterID), + ChannelID: strings.TrimSpace(delivery.ChannelID), + ResourceID: strings.TrimSpace(delivery.ResourceID), + ServiceClass: FabricServiceClassRemoteWorkspace, + ChannelClass: strings.TrimSpace(delivery.ChannelClass), + AdapterContractID: strings.TrimSpace(delivery.AdapterContractID), + AdapterSessionID: strings.TrimSpace(delivery.AdapterSessionID), + AdapterRuntimeID: RemoteWorkspaceFrameProbeSinkRuntimeID, + SessionState: "probe_bound", + SessionCreatedAt: session.CreatedAt.Format(time.RFC3339Nano), + SessionBoundAt: session.BoundAt.Format(time.RFC3339Nano), + SessionLastActive: session.LastActivityAt.Format(time.RFC3339Nano), + SessionLifecycle: session.State, + SessionDeliveries: session.DeliveryCount, + SessionPressure: session.BackpressureCount, + MailboxDepth: len(session.Mailbox), + MailboxEnqueued: session.MailboxEnqueued, + FrameCount: len(delivery.Frames), + QueueCapacity: queueCapacity, + QueueDepth: 0, + AcceptedFrames: acceptedFrames, + DroppedFrames: droppedFrames, + AckedFrames: acceptedFrames, + Backpressure: false, + DropPolicy: "drop_droppable_overflow_ack_accepted", + DeliverySequence: s.sequence, + DeliveredAt: now.Format(time.RFC3339Nano), + } + s.last = receipt + return receipt, nil +} + +func (s *RemoteWorkspaceFrameProbeSink) recordBackpressure(delivery RemoteWorkspaceFrameBatchDelivery, queueCapacity int, reason string) { + s.mu.Lock() + defer s.mu.Unlock() + now := time.Now().UTC() + s.expireIdleSessionsLocked(now) + session := s.ensureSessionLocked(delivery, now) + session.State = "backpressure" + session.LastActivityAt = now + session.LastBackpressureAt = now + session.BackpressureCount++ + session.LastChannelID = strings.TrimSpace(delivery.ChannelID) + session.LastResourceID = strings.TrimSpace(delivery.ResourceID) + session.LastRouteID = strings.TrimSpace(delivery.PreferredRouteID) + session.LastReason = strings.TrimSpace(reason) + s.enqueueAdapterMailboxEventLocked(session, RemoteWorkspaceAdapterMailboxEvent{ + Event: "backpressure", + ChannelID: strings.TrimSpace(delivery.ChannelID), + ResourceID: strings.TrimSpace(delivery.ResourceID), + RouteID: strings.TrimSpace(delivery.PreferredRouteID), + ChannelClass: strings.TrimSpace(delivery.ChannelClass), + FrameCount: len(delivery.Frames), + Backpressure: true, + Reason: strings.TrimSpace(reason), + }, now) + s.backpressureCount++ + s.sessionBackpressureTotal++ + s.lastBackpressureAt = now.Format(time.RFC3339Nano) + s.lastBackpressureReason = strings.TrimSpace(reason) + s.lastRejectedFrameCount = len(delivery.Frames) + s.lastRejectedAdapterSessionID = strings.TrimSpace(delivery.AdapterSessionID) + s.lastRejectedChannelClass = strings.TrimSpace(delivery.ChannelClass) + s.lastRejectedAdapterContractID = strings.TrimSpace(delivery.AdapterContractID) + s.lastRejectedQueueCapacity = queueCapacity + s.lastRejectedQueueDepth = queueCapacity +} + +func (s *RemoteWorkspaceFrameProbeSink) ControlAdapterSession(action string, adapterSessionID string, reason string, now time.Time) (RemoteWorkspaceAdapterSessionControlResult, error) { + if s == nil { + return RemoteWorkspaceAdapterSessionControlResult{}, fmt.Errorf("remote workspace adapter probe sink unavailable") + } + action = strings.TrimSpace(strings.ToLower(action)) + adapterSessionID = strings.TrimSpace(adapterSessionID) + reason = strings.TrimSpace(reason) + if adapterSessionID == "" { + return RemoteWorkspaceAdapterSessionControlResult{}, fmt.Errorf("remote workspace adapter session id required") + } + if !isValidRemoteWorkspaceAdapterSessionID(adapterSessionID) { + return RemoteWorkspaceAdapterSessionControlResult{}, fmt.Errorf("invalid remote workspace adapter session id") + } + if len(reason) > 512 { + return RemoteWorkspaceAdapterSessionControlResult{}, fmt.Errorf("remote workspace adapter session control reason too long") + } + switch action { + case "close", "expire", "reset": + default: + return RemoteWorkspaceAdapterSessionControlResult{}, fmt.Errorf("unsupported remote workspace adapter session control action") + } + if now.IsZero() { + now = time.Now().UTC() + } else { + now = now.UTC() + } + s.mu.Lock() + defer s.mu.Unlock() + s.expireIdleSessionsLocked(now) + if s.terminalSessions == nil { + s.terminalSessions = map[string]remoteWorkspaceAdapterProbeTerminalSession{} + } + session := s.sessions[adapterSessionID] + previousState := "" + if session != nil { + previousState = session.State + } + nextState := actionToAdapterSessionState(action) + if session == nil { + if terminal, ok := s.terminalSessions[adapterSessionID]; ok { + result := RemoteWorkspaceAdapterSessionControlResult{ + SchemaVersion: "rap.remote_workspace_adapter_session_control.v1", + AdapterRuntimeID: RemoteWorkspaceFrameProbeSinkRuntimeID, + AdapterSessionID: adapterSessionID, + Action: action, + Accepted: true, + PreviousState: terminal.State, + SessionState: nextState, + Reason: reason, + ControlledAt: now.Format(time.RFC3339Nano), + ActiveSessions: len(s.sessions), + } + s.sessionControlTotal++ + s.lastControl = result + return result, nil + } + return RemoteWorkspaceAdapterSessionControlResult{}, fmt.Errorf("remote workspace adapter session not found") + } else { + session.State = nextState + session.LastActivityAt = now + session.ClosedAt = now + session.LastReason = reason + } + s.sessionControlTotal++ + switch action { + case "expire": + s.sessionExpiredTotal++ + s.sessionClosedTotal++ + case "close": + s.sessionClosedTotal++ + case "reset": + s.sessionResetTotal++ + s.sessionClosedTotal++ + } + delete(s.sessions, adapterSessionID) + s.terminalSessions[adapterSessionID] = remoteWorkspaceAdapterProbeTerminalSession{ + State: nextState, + ControlledAt: now, + Reason: reason, + } + result := RemoteWorkspaceAdapterSessionControlResult{ + SchemaVersion: "rap.remote_workspace_adapter_session_control.v1", + AdapterRuntimeID: RemoteWorkspaceFrameProbeSinkRuntimeID, + AdapterSessionID: adapterSessionID, + Action: action, + Accepted: true, + PreviousState: previousState, + SessionState: nextState, + Reason: reason, + ControlledAt: now.Format(time.RFC3339Nano), + ActiveSessions: len(s.sessions), + } + s.lastControl = result + return result, nil +} + +func (s *RemoteWorkspaceFrameProbeSink) enqueueAdapterMailboxEventLocked(session *remoteWorkspaceAdapterProbeSession, event RemoteWorkspaceAdapterMailboxEvent, now time.Time) { + if session == nil { + return + } + s.mailboxEventSequence++ + event.SchemaVersion = "rap.remote_workspace_adapter_mailbox_event.v1" + event.Sequence = s.mailboxEventSequence + event.AdapterRuntimeID = RemoteWorkspaceFrameProbeSinkRuntimeID + event.AdapterSessionID = session.ID + event.CreatedAt = now.UTC().Format(time.RFC3339Nano) + if len(session.Mailbox) >= DefaultRemoteWorkspaceAdapterMailboxCapacity { + session.Mailbox = append([]RemoteWorkspaceAdapterMailboxEvent(nil), session.Mailbox[1:]...) + session.MailboxDropped++ + s.mailboxDroppedTotal++ + } + session.Mailbox = append(session.Mailbox, event) + session.MailboxEnqueued++ + s.mailboxEnqueuedTotal++ +} + +func isValidRemoteWorkspaceAdapterSessionID(adapterSessionID string) bool { + const prefix = "rap-rw-adapter-session-" + if !strings.HasPrefix(adapterSessionID, prefix) || len(adapterSessionID) != len(prefix)+24 { + return false + } + for _, ch := range adapterSessionID[len(prefix):] { + if (ch < '0' || ch > '9') && (ch < 'a' || ch > 'f') { + return false + } + } + return true +} + +func actionToAdapterSessionState(action string) string { + switch action { + case "expire": + return "expired" + case "reset": + return "reset" + default: + return "closed" + } +} + +func (s *RemoteWorkspaceFrameProbeSink) ensureSessionLocked(delivery RemoteWorkspaceFrameBatchDelivery, now time.Time) *remoteWorkspaceAdapterProbeSession { + if s.sessions == nil { + s.sessions = map[string]*remoteWorkspaceAdapterProbeSession{} + } + if s.terminalSessions == nil { + s.terminalSessions = map[string]remoteWorkspaceAdapterProbeTerminalSession{} + } + sessionID := strings.TrimSpace(delivery.AdapterSessionID) + session := s.sessions[sessionID] + if session == nil { + session = &remoteWorkspaceAdapterProbeSession{ + ID: sessionID, + State: "created", + CreatedAt: now, + LastActivityAt: now, + MailboxConsumers: map[string]*remoteWorkspaceAdapterMailboxConsumerState{}, + } + s.sessions[sessionID] = session + s.sessionCreatedTotal++ + } + return session +} + +func (s *RemoteWorkspaceFrameProbeSink) expireIdleSessionsLocked(now time.Time) { + ttl := s.sessionTTL + if ttl <= 0 { + ttl = DefaultRemoteWorkspaceFrameProbeSinkSessionTTL + } + for id, session := range s.sessions { + if session == nil || session.State == "closed" || session.State == "expired" { + continue + } + if !session.LastActivityAt.IsZero() && now.Sub(session.LastActivityAt) > ttl { + session.State = "expired" + session.ClosedAt = now + s.sessionExpiredTotal++ + s.sessionClosedTotal++ + delete(s.sessions, id) + } + } +} + +func (s *RemoteWorkspaceFrameProbeSink) LastReceipt() (RemoteWorkspaceFrameBatchDeliveryReceipt, bool) { + if s == nil { + return RemoteWorkspaceFrameBatchDeliveryReceipt{}, false + } + s.mu.Lock() + defer s.mu.Unlock() + if s.sequence == 0 { + return RemoteWorkspaceFrameBatchDeliveryReceipt{}, false + } + return s.last, true +} + +func (s *RemoteWorkspaceFrameProbeSink) SnapshotAdapterSessions(includeTerminal bool, limit int, now time.Time) RemoteWorkspaceAdapterSessionSnapshot { + if limit <= 0 { + limit = 50 + } + if limit > 200 { + limit = 200 + } + if now.IsZero() { + now = time.Now().UTC() + } else { + now = now.UTC() + } + snapshot := RemoteWorkspaceAdapterSessionSnapshot{ + SchemaVersion: "rap.remote_workspace_adapter_session_snapshot.v1", + AdapterRuntimeID: RemoteWorkspaceFrameProbeSinkRuntimeID, + ObservedAt: now.Format(time.RFC3339Nano), + } + if s == nil { + return snapshot + } + s.mu.Lock() + defer s.mu.Unlock() + s.expireIdleSessionsLocked(now) + activeIDs := make([]string, 0, len(s.sessions)) + for id := range s.sessions { + activeIDs = append(activeIDs, id) + } + sort.Strings(activeIDs) + for _, id := range activeIDs { + if len(snapshot.Sessions) >= limit { + break + } + session := s.sessions[id] + if session == nil { + continue + } + snapshot.Sessions = append(snapshot.Sessions, remoteWorkspaceAdapterSessionView(*session)) + } + if includeTerminal { + terminalIDs := make([]string, 0, len(s.terminalSessions)) + for id := range s.terminalSessions { + terminalIDs = append(terminalIDs, id) + } + sort.Strings(terminalIDs) + for _, id := range terminalIDs { + if len(snapshot.TerminalSessions) >= limit { + break + } + terminal := s.terminalSessions[id] + snapshot.TerminalSessions = append(snapshot.TerminalSessions, RemoteWorkspaceAdapterTerminalSession{ + AdapterSessionID: id, + SessionState: terminal.State, + ControlledAt: terminal.ControlledAt.Format(time.RFC3339Nano), + Reason: terminal.Reason, + }) + } + } + snapshot.ActiveSessionCount = len(s.sessions) + snapshot.TerminalSessionCount = len(s.terminalSessions) + return snapshot +} + +func (s *RemoteWorkspaceFrameProbeSink) ReadAdapterSessionMailbox(adapterSessionID string, drain bool, limit int, afterSequence int64, now time.Time) (RemoteWorkspaceAdapterMailboxSnapshot, error) { + adapterSessionID = strings.TrimSpace(adapterSessionID) + if !isValidRemoteWorkspaceAdapterSessionID(adapterSessionID) { + return RemoteWorkspaceAdapterMailboxSnapshot{}, fmt.Errorf("invalid remote workspace adapter session id") + } + if afterSequence < 0 { + return RemoteWorkspaceAdapterMailboxSnapshot{}, fmt.Errorf("invalid remote workspace adapter session mailbox after sequence") + } + if drain && afterSequence > 0 { + return RemoteWorkspaceAdapterMailboxSnapshot{}, fmt.Errorf("remote workspace adapter session mailbox after sequence cannot drain") + } + if limit <= 0 { + limit = 50 + } + if limit > 200 { + limit = 200 + } + if now.IsZero() { + now = time.Now().UTC() + } else { + now = now.UTC() + } + if s == nil { + return RemoteWorkspaceAdapterMailboxSnapshot{}, fmt.Errorf("remote workspace adapter probe sink unavailable") + } + s.mu.Lock() + defer s.mu.Unlock() + s.expireIdleSessionsLocked(now) + session := s.sessions[adapterSessionID] + if session == nil { + return RemoteWorkspaceAdapterMailboxSnapshot{}, fmt.Errorf("remote workspace adapter session not found") + } + beforeDepth := len(session.Mailbox) + startIndex := 0 + if afterSequence > 0 { + for startIndex < len(session.Mailbox) && session.Mailbox[startIndex].Sequence <= afterSequence { + startIndex++ + } + } + eventCount := len(session.Mailbox) - startIndex + if eventCount > limit { + eventCount = limit + } + events := append([]RemoteWorkspaceAdapterMailboxEvent(nil), session.Mailbox[startIndex:startIndex+eventCount]...) + if drain && eventCount > 0 { + session.Mailbox = append([]RemoteWorkspaceAdapterMailboxEvent(nil), session.Mailbox[eventCount:]...) + session.MailboxDrained += int64(eventCount) + s.mailboxDrainedTotal += int64(eventCount) + session.LastActivityAt = now + } + return RemoteWorkspaceAdapterMailboxSnapshot{ + SchemaVersion: "rap.remote_workspace_adapter_mailbox_snapshot.v1", + AdapterRuntimeID: RemoteWorkspaceFrameProbeSinkRuntimeID, + AdapterSessionID: adapterSessionID, + ObservedAt: now.Format(time.RFC3339Nano), + Drained: drain, + Empty: eventCount == 0, + MailboxCapacity: DefaultRemoteWorkspaceAdapterMailboxCapacity, + MailboxDepth: beforeDepth, + DepthAfter: len(session.Mailbox), + EnqueuedTotal: session.MailboxEnqueued, + DrainedTotal: session.MailboxDrained, + DroppedTotal: session.MailboxDropped, + AfterSequence: afterSequence, + SkippedCount: startIndex, + ReturnedCount: len(events), + Events: events, + }, nil +} + +func (s *RemoteWorkspaceFrameProbeSink) RecordAdapterSessionMailboxRead(snapshot RemoteWorkspaceAdapterMailboxSnapshot, now time.Time) { + if s == nil { + return + } + if now.IsZero() { + now = time.Now().UTC() + } else { + now = now.UTC() + } + adapterSessionID := strings.TrimSpace(snapshot.AdapterSessionID) + s.mu.Lock() + defer s.mu.Unlock() + s.mailboxReadTotal++ + if snapshot.Waited { + s.mailboxWaitTotal++ + } + if snapshot.WaitTimeout { + s.mailboxWaitTimeoutTotal++ + } + if snapshot.Empty { + s.mailboxEmptyReadTotal++ + } + if snapshot.ResumeFrom != "" { + s.mailboxResumeReadTotal++ + } + if snapshot.AfterSequence > 0 { + s.mailboxAfterSequenceReadTotal++ + } + s.mailboxReturnedTotal += int64(snapshot.ReturnedCount) + s.mailboxSkippedTotal += int64(snapshot.SkippedCount) + s.lastMailboxReadAt = now.Format(time.RFC3339Nano) + s.lastMailboxAdapterSessionID = adapterSessionID + s.lastMailboxWaitMs = snapshot.WaitMs + s.lastMailboxWaited = snapshot.Waited + s.lastMailboxWaitTimeout = snapshot.WaitTimeout + s.lastMailboxEmpty = snapshot.Empty + s.lastMailboxAfterSequence = snapshot.AfterSequence + s.lastMailboxSkippedCount = snapshot.SkippedCount + s.lastMailboxReturnedCount = snapshot.ReturnedCount + if snapshot.ResumeFrom != "" { + s.lastMailboxResumeFrom = snapshot.ResumeFrom + s.lastMailboxResumeSequence = snapshot.ResumeSequence + s.lastMailboxResumeConsumerID = snapshot.ConsumerID + } + if session := s.sessions[adapterSessionID]; session != nil { + session.MailboxRead++ + if snapshot.Waited { + session.MailboxWait++ + } + if snapshot.WaitTimeout { + session.MailboxWaitTimeout++ + } + if snapshot.Empty { + session.MailboxEmptyRead++ + } + if snapshot.ResumeFrom != "" { + session.MailboxResumeRead++ + } + if snapshot.AfterSequence > 0 { + session.MailboxAfterSequenceRead++ + } + session.MailboxReturnedTotal += int64(snapshot.ReturnedCount) + session.MailboxSkippedTotal += int64(snapshot.SkippedCount) + session.LastMailboxReadAt = now + session.LastMailboxWaitMs = snapshot.WaitMs + session.LastMailboxWaited = snapshot.Waited + session.LastMailboxTimeout = snapshot.WaitTimeout + session.LastMailboxEmpty = snapshot.Empty + session.LastMailboxAfterSequence = snapshot.AfterSequence + session.LastMailboxSkippedCount = snapshot.SkippedCount + session.LastMailboxReturnedCount = snapshot.ReturnedCount + if snapshot.ResumeFrom != "" { + session.LastMailboxResumeFrom = snapshot.ResumeFrom + session.LastMailboxResumeSequence = snapshot.ResumeSequence + session.LastMailboxResumeConsumerID = snapshot.ConsumerID + } + } +} + +func (s *RemoteWorkspaceFrameProbeSink) RecordAdapterSessionMailboxConsumerRead(snapshot RemoteWorkspaceAdapterMailboxSnapshot, consumerID string, ackSequence int64, reset bool, now time.Time) (RemoteWorkspaceAdapterMailboxSnapshot, error) { + if s == nil { + return RemoteWorkspaceAdapterMailboxSnapshot{}, fmt.Errorf("remote workspace adapter probe sink unavailable") + } + consumerID = strings.TrimSpace(consumerID) + if !isValidRemoteWorkspaceAdapterMailboxConsumerID(consumerID) { + return RemoteWorkspaceAdapterMailboxSnapshot{}, fmt.Errorf("invalid remote workspace adapter mailbox consumer") + } + if ackSequence < 0 { + return RemoteWorkspaceAdapterMailboxSnapshot{}, fmt.Errorf("invalid remote workspace adapter mailbox ack sequence") + } + if now.IsZero() { + now = time.Now().UTC() + } else { + now = now.UTC() + } + adapterSessionID := strings.TrimSpace(snapshot.AdapterSessionID) + s.mu.Lock() + defer s.mu.Unlock() + session := s.sessions[adapterSessionID] + if session == nil { + return RemoteWorkspaceAdapterMailboxSnapshot{}, fmt.Errorf("remote workspace adapter session not found") + } + if session.MailboxConsumers == nil { + session.MailboxConsumers = map[string]*remoteWorkspaceAdapterMailboxConsumerState{} + } + if reset { + if _, ok := session.MailboxConsumers[consumerID]; ok { + delete(session.MailboxConsumers, consumerID) + } + session.MailboxConsumerResetTotal++ + s.mailboxConsumerResetTotal++ + } + consumer := session.MailboxConsumers[consumerID] + created := false + evicted := false + if consumer == nil { + evicted = s.evictOldestMailboxConsumerLocked(session) + if evicted { + session.MailboxConsumerEvictedTotal++ + s.mailboxConsumerEvictedTotal++ + } + consumer = &remoteWorkspaceAdapterMailboxConsumerState{ + ID: consumerID, + CreatedAt: now, + } + session.MailboxConsumers[consumerID] = consumer + created = true + } + maxSequence := consumer.CheckpointSequence + for _, event := range snapshot.Events { + if event.Sequence > maxSequence { + maxSequence = event.Sequence + } + } + consumer.ReadTotal++ + consumer.LastReadAt = now + consumer.CheckpointSequence = maxSequence + if ackSequence > consumer.AckSequence { + consumer.AckSequence = ackSequence + } + if ackSequence > 0 { + consumer.AckTotal++ + consumer.LastAckAt = now + session.MailboxConsumerAckTotal++ + s.mailboxConsumerAckTotal++ + s.lastMailboxConsumerAckAt = now.Format(time.RFC3339Nano) + } + session.MailboxConsumerReadTotal++ + session.LastMailboxConsumerID = consumerID + session.LastMailboxConsumerReadAt = now + session.LastMailboxConsumerAckAt = consumer.LastAckAt + session.LastMailboxConsumerCheckpoint = consumer.CheckpointSequence + session.LastMailboxConsumerAck = consumer.AckSequence + s.mailboxConsumerReadTotal++ + s.lastMailboxConsumerID = consumerID + s.lastMailboxConsumerAdapterSessionID = adapterSessionID + s.lastMailboxConsumerReadAt = now.Format(time.RFC3339Nano) + s.lastMailboxConsumerCheckpoint = consumer.CheckpointSequence + s.lastMailboxConsumerAck = consumer.AckSequence + + snapshot.ConsumerID = consumerID + snapshot.ConsumerReadTotal = consumer.ReadTotal + snapshot.ConsumerAckTotal = consumer.AckTotal + snapshot.ConsumerResetTotal = session.MailboxConsumerResetTotal + snapshot.ConsumerEvictedTotal = session.MailboxConsumerEvictedTotal + snapshot.ConsumerCheckpointSequence = consumer.CheckpointSequence + snapshot.ConsumerAckSequence = consumer.AckSequence + snapshot.ConsumerLagCount = countMailboxEventsAfterSequence(session.Mailbox, consumer.AckSequence) + snapshot.ConsumerCount = len(session.MailboxConsumers) + snapshot.ConsumerCapacity = DefaultRemoteWorkspaceAdapterMailboxConsumerCapacity + snapshot.ConsumerCreated = created + snapshot.ConsumerReset = reset + snapshot.ConsumerEvicted = evicted + snapshot.ConsumerCreatedAt = consumer.CreatedAt.Format(time.RFC3339Nano) + snapshot.ConsumerLastReadAt = consumer.LastReadAt.Format(time.RFC3339Nano) + if !consumer.LastAckAt.IsZero() { + snapshot.ConsumerLastAckAt = consumer.LastAckAt.Format(time.RFC3339Nano) + } + return snapshot, nil +} + +func (s *RemoteWorkspaceFrameProbeSink) SnapshotAdapterSessionMailboxConsumers(adapterSessionID string, limit int, now time.Time) (RemoteWorkspaceAdapterMailboxConsumerSnapshot, error) { + adapterSessionID = strings.TrimSpace(adapterSessionID) + if !isValidRemoteWorkspaceAdapterSessionID(adapterSessionID) { + return RemoteWorkspaceAdapterMailboxConsumerSnapshot{}, fmt.Errorf("invalid remote workspace adapter session id") + } + if limit <= 0 { + limit = 50 + } + if limit > DefaultRemoteWorkspaceAdapterMailboxConsumerCapacity { + limit = DefaultRemoteWorkspaceAdapterMailboxConsumerCapacity + } + if now.IsZero() { + now = time.Now().UTC() + } else { + now = now.UTC() + } + if s == nil { + return RemoteWorkspaceAdapterMailboxConsumerSnapshot{}, fmt.Errorf("remote workspace adapter probe sink unavailable") + } + s.mu.Lock() + defer s.mu.Unlock() + s.expireIdleSessionsLocked(now) + session := s.sessions[adapterSessionID] + if session == nil { + return RemoteWorkspaceAdapterMailboxConsumerSnapshot{}, fmt.Errorf("remote workspace adapter session not found") + } + snapshot := RemoteWorkspaceAdapterMailboxConsumerSnapshot{ + SchemaVersion: "rap.remote_workspace_adapter_mailbox_consumer_snapshot.v1", + AdapterRuntimeID: RemoteWorkspaceFrameProbeSinkRuntimeID, + AdapterSessionID: adapterSessionID, + ObservedAt: now.Format(time.RFC3339Nano), + ConsumerCapacity: DefaultRemoteWorkspaceAdapterMailboxConsumerCapacity, + ConsumerCount: len(session.MailboxConsumers), + ConsumerReadTotal: session.MailboxConsumerReadTotal, + ConsumerAckTotal: session.MailboxConsumerAckTotal, + ConsumerResetTotal: session.MailboxConsumerResetTotal, + ConsumerEvictedTotal: session.MailboxConsumerEvictedTotal, + MailboxDepth: len(session.Mailbox), + MailboxEnqueued: session.MailboxEnqueued, + MailboxDrained: session.MailboxDrained, + MailboxDropped: session.MailboxDropped, + } + consumerIDs := make([]string, 0, len(session.MailboxConsumers)) + for id := range session.MailboxConsumers { + consumerIDs = append(consumerIDs, id) + } + sort.Strings(consumerIDs) + for _, id := range consumerIDs { + if len(snapshot.Consumers) >= limit { + break + } + consumer := session.MailboxConsumers[id] + if consumer == nil { + continue + } + view := RemoteWorkspaceAdapterMailboxConsumer{ + ConsumerID: consumer.ID, + ReadTotal: consumer.ReadTotal, + AckTotal: consumer.AckTotal, + CheckpointSequence: consumer.CheckpointSequence, + AckSequence: consumer.AckSequence, + LagCount: countMailboxEventsAfterSequence(session.Mailbox, consumer.AckSequence), + } + if !consumer.CreatedAt.IsZero() { + view.CreatedAt = consumer.CreatedAt.Format(time.RFC3339Nano) + } + if !consumer.LastReadAt.IsZero() { + view.LastReadAt = consumer.LastReadAt.Format(time.RFC3339Nano) + } + if !consumer.LastAckAt.IsZero() { + view.LastAckAt = consumer.LastAckAt.Format(time.RFC3339Nano) + } + snapshot.Consumers = append(snapshot.Consumers, view) + } + return snapshot, nil +} + +func (s *RemoteWorkspaceFrameProbeSink) ResolveAdapterSessionMailboxConsumerResume(adapterSessionID string, consumerID string, resumeFrom string, now time.Time) (int64, error) { + adapterSessionID = strings.TrimSpace(adapterSessionID) + consumerID = strings.TrimSpace(consumerID) + resumeFrom = strings.TrimSpace(strings.ToLower(resumeFrom)) + if !isValidRemoteWorkspaceAdapterSessionID(adapterSessionID) { + return 0, fmt.Errorf("invalid remote workspace adapter session id") + } + if !isValidRemoteWorkspaceAdapterMailboxConsumerID(consumerID) { + return 0, fmt.Errorf("invalid remote workspace adapter mailbox consumer") + } + if resumeFrom != "ack" && resumeFrom != "checkpoint" { + return 0, fmt.Errorf("invalid remote workspace adapter mailbox resume cursor") + } + if now.IsZero() { + now = time.Now().UTC() + } else { + now = now.UTC() + } + if s == nil { + return 0, fmt.Errorf("remote workspace adapter probe sink unavailable") + } + s.mu.Lock() + defer s.mu.Unlock() + s.expireIdleSessionsLocked(now) + session := s.sessions[adapterSessionID] + if session == nil { + return 0, fmt.Errorf("remote workspace adapter session not found") + } + consumer := session.MailboxConsumers[consumerID] + if consumer == nil { + return 0, fmt.Errorf("remote workspace adapter mailbox consumer not found") + } + if resumeFrom == "checkpoint" { + return consumer.CheckpointSequence, nil + } + return consumer.AckSequence, nil +} + +func (s *RemoteWorkspaceFrameProbeSink) PreflightAdapterSessionMailboxConsumerResume(adapterSessionID string, consumerID string, resumeFrom string, limit int, now time.Time) (RemoteWorkspaceAdapterMailboxPreflightSnapshot, error) { + adapterSessionID = strings.TrimSpace(adapterSessionID) + consumerID = strings.TrimSpace(consumerID) + resumeFrom = strings.TrimSpace(strings.ToLower(resumeFrom)) + if !isValidRemoteWorkspaceAdapterSessionID(adapterSessionID) { + return RemoteWorkspaceAdapterMailboxPreflightSnapshot{}, fmt.Errorf("invalid remote workspace adapter session id") + } + if !isValidRemoteWorkspaceAdapterMailboxConsumerID(consumerID) { + return RemoteWorkspaceAdapterMailboxPreflightSnapshot{}, fmt.Errorf("invalid remote workspace adapter mailbox consumer") + } + if resumeFrom == "" { + resumeFrom = "checkpoint" + } + if resumeFrom != "ack" && resumeFrom != "checkpoint" { + return RemoteWorkspaceAdapterMailboxPreflightSnapshot{}, fmt.Errorf("invalid remote workspace adapter mailbox resume cursor") + } + if limit <= 0 { + limit = 50 + } + if limit > DefaultRemoteWorkspaceAdapterMailboxCapacity { + limit = DefaultRemoteWorkspaceAdapterMailboxCapacity + } + if now.IsZero() { + now = time.Now().UTC() + } else { + now = now.UTC() + } + if s == nil { + return RemoteWorkspaceAdapterMailboxPreflightSnapshot{}, fmt.Errorf("remote workspace adapter probe sink unavailable") + } + s.mu.Lock() + defer s.mu.Unlock() + s.expireIdleSessionsLocked(now) + session := s.sessions[adapterSessionID] + if session == nil { + return RemoteWorkspaceAdapterMailboxPreflightSnapshot{}, fmt.Errorf("remote workspace adapter session not found") + } + consumer := session.MailboxConsumers[consumerID] + if consumer == nil { + return RemoteWorkspaceAdapterMailboxPreflightSnapshot{}, fmt.Errorf("remote workspace adapter mailbox consumer not found") + } + resumeSequence := consumer.AckSequence + if resumeFrom == "checkpoint" { + resumeSequence = consumer.CheckpointSequence + } + startIndex := 0 + for startIndex < len(session.Mailbox) && session.Mailbox[startIndex].Sequence <= resumeSequence { + startIndex++ + } + available := len(session.Mailbox) - startIndex + if available < 0 { + available = 0 + } + returned := available + if returned > limit { + returned = limit + } + var firstExpected int64 + var lastExpected int64 + if returned > 0 { + firstExpected = session.Mailbox[startIndex].Sequence + lastExpected = session.Mailbox[startIndex+returned-1].Sequence + } + return RemoteWorkspaceAdapterMailboxPreflightSnapshot{ + SchemaVersion: "rap.remote_workspace_adapter_mailbox_preflight.v1", + AdapterRuntimeID: RemoteWorkspaceFrameProbeSinkRuntimeID, + AdapterSessionID: adapterSessionID, + ObservedAt: now.Format(time.RFC3339Nano), + ReadOnly: true, + ConsumerID: consumerID, + ResumeFrom: resumeFrom, + ResumeSequence: resumeSequence, + AfterSequence: resumeSequence, + Limit: limit, + MailboxDepth: len(session.Mailbox), + MailboxEnqueued: session.MailboxEnqueued, + MailboxReadTotal: session.MailboxRead, + ConsumerReadTotal: session.MailboxConsumerReadTotal, + ConsumerAckTotal: session.MailboxConsumerAckTotal, + ConsumerCheckpointSequence: consumer.CheckpointSequence, + ConsumerAckSequence: consumer.AckSequence, + ConsumerLagCount: countMailboxEventsAfterSequence(session.Mailbox, consumer.AckSequence), + ExpectedAvailableCount: available, + ExpectedReturnedCount: returned, + ExpectedSkippedCount: startIndex, + FirstExpectedSequence: firstExpected, + LastExpectedSequence: lastExpected, + }, nil +} + +func (s *RemoteWorkspaceFrameProbeSink) evictOldestMailboxConsumerLocked(session *remoteWorkspaceAdapterProbeSession) bool { + if session == nil || len(session.MailboxConsumers) < DefaultRemoteWorkspaceAdapterMailboxConsumerCapacity { + return false + } + oldestID := "" + var oldestAt time.Time + for id, consumer := range session.MailboxConsumers { + if consumer == nil { + delete(session.MailboxConsumers, id) + return true + } + at := consumer.LastReadAt + if at.IsZero() { + at = consumer.CreatedAt + } + if oldestID == "" || at.Before(oldestAt) || (at.Equal(oldestAt) && id < oldestID) { + oldestID = id + oldestAt = at + } + } + if oldestID != "" { + delete(session.MailboxConsumers, oldestID) + return true + } + return false +} + +func countMailboxEventsAfterSequence(events []RemoteWorkspaceAdapterMailboxEvent, sequence int64) int { + count := 0 + for _, event := range events { + if event.Sequence > sequence { + count++ + } + } + return count +} + +func countMailboxConsumersLocked(sessions map[string]*remoteWorkspaceAdapterProbeSession) int { + count := 0 + for _, session := range sessions { + if session != nil { + count += len(session.MailboxConsumers) + } + } + return count +} + +func remoteWorkspaceAdapterRuntimeReadinessLocked(s *RemoteWorkspaceFrameProbeSink, session *remoteWorkspaceAdapterProbeSession, now time.Time) map[string]any { + readiness := map[string]any{ + "schema_version": "rap.remote_workspace_adapter_runtime_readiness.v1", + "adapter_runtime_id": RemoteWorkspaceFrameProbeSinkRuntimeID, + "observed_at": now.UTC().Format(time.RFC3339Nano), + "probe_only": true, + "payload_traffic": "none", + "status": "idle", + "diagnostic_state": "waiting_for_session", + "ready": false, + "active_session_count": len(s.sessions), + "terminal_session_count": len(s.terminalSessions), + "mailbox_capacity": DefaultRemoteWorkspaceAdapterMailboxCapacity, + "consumer_capacity": DefaultRemoteWorkspaceAdapterMailboxConsumerCapacity, + "mailbox_read_total": s.mailboxReadTotal, + "mailbox_resume_total": s.mailboxResumeReadTotal, + } + if session == nil { + if s.sequence > 0 { + readiness["last_adapter_session_id"] = s.last.AdapterSessionID + readiness["last_session_state"] = s.last.SessionState + readiness["diagnostic_state"] = "last_session_terminal_or_expired" + } + return readiness + } + + consumerLag := countMailboxEventsAfterSequence(session.Mailbox, session.LastMailboxConsumerAck) + status := "session_active" + diagnosticState := "waiting_for_consumer" + if session.State == "backpressure" { + status = "backpressure" + diagnosticState = "backpressure" + } else if len(session.MailboxConsumers) > 0 { + status = "cursor_ready" + diagnosticState = "adapter_cursor_ready" + } + readiness["status"] = status + readiness["diagnostic_state"] = diagnosticState + readiness["ready"] = len(session.MailboxConsumers) > 0 && session.State != "backpressure" + readiness["adapter_session_id"] = session.ID + readiness["session_state"] = session.State + readiness["mailbox_depth"] = len(session.Mailbox) + readiness["mailbox_enqueued_total"] = session.MailboxEnqueued + readiness["mailbox_read_total"] = session.MailboxRead + readiness["mailbox_resume_read_total"] = session.MailboxResumeRead + readiness["mailbox_after_sequence_read_total"] = session.MailboxAfterSequenceRead + readiness["mailbox_returned_total"] = session.MailboxReturnedTotal + readiness["mailbox_skipped_total"] = session.MailboxSkippedTotal + readiness["consumer_count"] = len(session.MailboxConsumers) + readiness["consumer_read_total"] = session.MailboxConsumerReadTotal + readiness["consumer_ack_total"] = session.MailboxConsumerAckTotal + readiness["last_consumer_id"] = session.LastMailboxConsumerID + readiness["last_consumer_checkpoint_sequence"] = session.LastMailboxConsumerCheckpoint + readiness["last_consumer_ack_sequence"] = session.LastMailboxConsumerAck + readiness["last_consumer_lag_count"] = consumerLag + readiness["last_resume_from"] = session.LastMailboxResumeFrom + readiness["last_resume_sequence"] = session.LastMailboxResumeSequence + readiness["last_resume_consumer_id"] = session.LastMailboxResumeConsumerID + readiness["last_after_sequence"] = session.LastMailboxAfterSequence + readiness["last_returned_count"] = session.LastMailboxReturnedCount + readiness["last_skipped_count"] = session.LastMailboxSkippedCount + if !session.LastActivityAt.IsZero() { + readiness["last_activity_at"] = session.LastActivityAt.Format(time.RFC3339Nano) + } + if !session.LastMailboxReadAt.IsZero() { + readiness["last_mailbox_read_at"] = session.LastMailboxReadAt.Format(time.RFC3339Nano) + } + if !session.LastMailboxConsumerReadAt.IsZero() { + readiness["last_consumer_read_at"] = session.LastMailboxConsumerReadAt.Format(time.RFC3339Nano) + } + if !session.LastMailboxConsumerAckAt.IsZero() { + readiness["last_consumer_ack_at"] = session.LastMailboxConsumerAckAt.Format(time.RFC3339Nano) + } + return readiness +} + +func remoteWorkspaceAdapterSessionView(session remoteWorkspaceAdapterProbeSession) RemoteWorkspaceAdapterSessionView { + view := RemoteWorkspaceAdapterSessionView{ + AdapterSessionID: session.ID, + SessionState: session.State, + CreatedAt: session.CreatedAt.Format(time.RFC3339Nano), + LastActivityAt: session.LastActivityAt.Format(time.RFC3339Nano), + DeliveryCount: session.DeliveryCount, + BackpressureCount: session.BackpressureCount, + AcceptedFrames: session.AcceptedFrames, + DroppedFrames: session.DroppedFrames, + AckedFrames: session.AckedFrames, + MailboxCapacity: DefaultRemoteWorkspaceAdapterMailboxCapacity, + MailboxDepth: len(session.Mailbox), + MailboxEnqueued: session.MailboxEnqueued, + MailboxDrained: session.MailboxDrained, + MailboxDropped: session.MailboxDropped, + MailboxRead: session.MailboxRead, + MailboxWait: session.MailboxWait, + MailboxWaitTimeout: session.MailboxWaitTimeout, + MailboxEmptyRead: session.MailboxEmptyRead, + MailboxResumeRead: session.MailboxResumeRead, + MailboxAfterSequenceRead: session.MailboxAfterSequenceRead, + MailboxReturnedTotal: session.MailboxReturnedTotal, + MailboxSkippedTotal: session.MailboxSkippedTotal, + MailboxConsumerCount: len(session.MailboxConsumers), + MailboxConsumerRead: session.MailboxConsumerReadTotal, + MailboxConsumerAck: session.MailboxConsumerAckTotal, + MailboxConsumerReset: session.MailboxConsumerResetTotal, + MailboxConsumerEvicted: session.MailboxConsumerEvictedTotal, + LastMailboxConsumerID: session.LastMailboxConsumerID, + LastMailboxConsumerCheckpoint: session.LastMailboxConsumerCheckpoint, + LastMailboxConsumerAck: session.LastMailboxConsumerAck, + LastMailboxWaitMs: session.LastMailboxWaitMs, + LastMailboxWaited: session.LastMailboxWaited, + LastMailboxWaitTimeout: session.LastMailboxTimeout, + LastMailboxEmpty: session.LastMailboxEmpty, + LastMailboxResumeFrom: session.LastMailboxResumeFrom, + LastMailboxResumeSequence: session.LastMailboxResumeSequence, + LastMailboxResumeConsumerID: session.LastMailboxResumeConsumerID, + LastMailboxAfterSequence: session.LastMailboxAfterSequence, + LastMailboxSkippedCount: session.LastMailboxSkippedCount, + LastMailboxReturnedCount: session.LastMailboxReturnedCount, + ChannelID: session.LastChannelID, + ResourceID: session.LastResourceID, + RouteID: session.LastRouteID, + } + if !session.BoundAt.IsZero() { + view.BoundAt = session.BoundAt.Format(time.RFC3339Nano) + } + if !session.LastBackpressureAt.IsZero() { + view.LastBackpressureAt = session.LastBackpressureAt.Format(time.RFC3339Nano) + view.LastBackpressureReason = session.LastReason + } + if !session.LastMailboxReadAt.IsZero() { + view.LastMailboxReadAt = session.LastMailboxReadAt.Format(time.RFC3339Nano) + } + if !session.LastMailboxConsumerReadAt.IsZero() { + view.LastMailboxConsumerReadAt = session.LastMailboxConsumerReadAt.Format(time.RFC3339Nano) + } + if !session.LastMailboxConsumerAckAt.IsZero() { + view.LastMailboxConsumerAckAt = session.LastMailboxConsumerAckAt.Format(time.RFC3339Nano) + } + return view +} + +func (s *RemoteWorkspaceFrameProbeSink) Report(now time.Time) map[string]any { + report := map[string]any{ + "schema_version": "rap.remote_workspace_adapter_sink_report.v1", + "sink": RemoteWorkspaceFrameProbeSinkRuntimeID, + "adapter_runtime_id": RemoteWorkspaceFrameProbeSinkRuntimeID, + "status": "ready", + "session_state": "idle", + "probe_only": true, + "payload_traffic": "none", + "observed_at": now.UTC().Format(time.RFC3339Nano), + } + if s == nil { + report["status"] = "unavailable" + return report + } + s.mu.Lock() + defer s.mu.Unlock() + s.expireIdleSessionsLocked(now.UTC()) + queueCapacity := s.queueCapacity + if queueCapacity <= 0 { + queueCapacity = DefaultRemoteWorkspaceFrameProbeSinkQueueCapacity + } + report["delivery_count"] = s.sequence + report["queue_capacity"] = queueCapacity + report["queue_depth"] = 0 + report["total_accepted_frames"] = s.acceptedFramesTotal + report["total_dropped_frames"] = s.droppedFramesTotal + report["total_acked_frames"] = s.ackedFramesTotal + report["backpressure_count"] = s.backpressureCount + report["session_ttl_seconds"] = int64(s.sessionTTL.Seconds()) + report["active_session_count"] = len(s.sessions) + report["session_created_total"] = s.sessionCreatedTotal + report["session_bound_total"] = s.sessionBoundTotal + report["session_backpressure_total"] = s.sessionBackpressureTotal + report["session_expired_total"] = s.sessionExpiredTotal + report["session_closed_total"] = s.sessionClosedTotal + report["session_reset_total"] = s.sessionResetTotal + report["session_control_total"] = s.sessionControlTotal + report["mailbox_capacity"] = DefaultRemoteWorkspaceAdapterMailboxCapacity + report["mailbox_enqueued_total"] = s.mailboxEnqueuedTotal + report["mailbox_drained_total"] = s.mailboxDrainedTotal + report["mailbox_dropped_total"] = s.mailboxDroppedTotal + report["mailbox_read_total"] = s.mailboxReadTotal + report["mailbox_wait_total"] = s.mailboxWaitTotal + report["mailbox_wait_timeout_total"] = s.mailboxWaitTimeoutTotal + report["mailbox_empty_read_total"] = s.mailboxEmptyReadTotal + report["mailbox_resume_read_total"] = s.mailboxResumeReadTotal + report["mailbox_after_sequence_read_total"] = s.mailboxAfterSequenceReadTotal + report["mailbox_returned_total"] = s.mailboxReturnedTotal + report["mailbox_skipped_total"] = s.mailboxSkippedTotal + report["mailbox_consumer_capacity"] = DefaultRemoteWorkspaceAdapterMailboxConsumerCapacity + report["mailbox_consumer_count"] = countMailboxConsumersLocked(s.sessions) + report["mailbox_consumer_read_total"] = s.mailboxConsumerReadTotal + report["mailbox_consumer_ack_total"] = s.mailboxConsumerAckTotal + report["mailbox_consumer_reset_total"] = s.mailboxConsumerResetTotal + report["mailbox_consumer_evicted_total"] = s.mailboxConsumerEvictedTotal + if s.mailboxReadTotal > 0 { + report["last_mailbox_read_at"] = s.lastMailboxReadAt + report["last_mailbox_adapter_session_id"] = s.lastMailboxAdapterSessionID + report["last_mailbox_wait_ms"] = s.lastMailboxWaitMs + report["last_mailbox_waited"] = s.lastMailboxWaited + report["last_mailbox_wait_timeout"] = s.lastMailboxWaitTimeout + report["last_mailbox_empty"] = s.lastMailboxEmpty + report["last_mailbox_after_sequence"] = s.lastMailboxAfterSequence + report["last_mailbox_skipped_count"] = s.lastMailboxSkippedCount + report["last_mailbox_returned_count"] = s.lastMailboxReturnedCount + } + if s.mailboxResumeReadTotal > 0 { + report["last_mailbox_resume_from"] = s.lastMailboxResumeFrom + report["last_mailbox_resume_sequence"] = s.lastMailboxResumeSequence + report["last_mailbox_resume_consumer_id"] = s.lastMailboxResumeConsumerID + } + if s.mailboxConsumerReadTotal > 0 { + report["last_mailbox_consumer_id"] = s.lastMailboxConsumerID + report["last_mailbox_consumer_read_at"] = s.lastMailboxConsumerReadAt + report["last_mailbox_consumer_adapter_session_id"] = s.lastMailboxConsumerAdapterSessionID + report["last_mailbox_consumer_checkpoint_sequence"] = s.lastMailboxConsumerCheckpoint + report["last_mailbox_consumer_ack_sequence"] = s.lastMailboxConsumerAck + } + if s.mailboxConsumerAckTotal > 0 { + report["last_mailbox_consumer_ack_at"] = s.lastMailboxConsumerAckAt + } + var currentSession *remoteWorkspaceAdapterProbeSession + if s.sequence > 0 { + session := s.sessions[s.last.AdapterSessionID] + currentSession = session + report["session_state"] = s.last.SessionState + report["current_adapter_session_id"] = s.last.AdapterSessionID + report["current_channel_id"] = s.last.ChannelID + report["current_resource_id"] = s.last.ResourceID + report["last_delivery"] = s.last + report["last_delivery_sequence"] = s.last.DeliverySequence + report["last_frame_count"] = s.last.FrameCount + report["last_queue_capacity"] = s.last.QueueCapacity + report["last_queue_depth"] = s.last.QueueDepth + report["last_accepted_frames"] = s.last.AcceptedFrames + report["last_dropped_frames"] = s.last.DroppedFrames + report["last_acked_frames"] = s.last.AckedFrames + report["last_backpressure"] = s.last.Backpressure + report["drop_policy"] = s.last.DropPolicy + report["last_channel_class"] = s.last.ChannelClass + report["last_adapter_contract_id"] = s.last.AdapterContractID + report["last_adapter_session_id"] = s.last.AdapterSessionID + if session != nil { + report["current_session_lifecycle_state"] = session.State + report["current_session_created_at"] = session.CreatedAt.Format(time.RFC3339Nano) + report["current_session_bound_at"] = session.BoundAt.Format(time.RFC3339Nano) + report["current_session_last_activity_at"] = session.LastActivityAt.Format(time.RFC3339Nano) + report["current_session_delivery_count"] = session.DeliveryCount + report["current_session_backpressure_count"] = session.BackpressureCount + report["current_session_accepted_frames"] = session.AcceptedFrames + report["current_session_dropped_frames"] = session.DroppedFrames + report["current_session_acked_frames"] = session.AckedFrames + report["current_session_mailbox_depth"] = len(session.Mailbox) + report["current_session_mailbox_enqueued_total"] = session.MailboxEnqueued + report["current_session_mailbox_drained_total"] = session.MailboxDrained + report["current_session_mailbox_dropped_total"] = session.MailboxDropped + report["current_session_mailbox_read_total"] = session.MailboxRead + report["current_session_mailbox_wait_total"] = session.MailboxWait + report["current_session_mailbox_wait_timeout_total"] = session.MailboxWaitTimeout + report["current_session_mailbox_empty_read_total"] = session.MailboxEmptyRead + report["current_session_mailbox_resume_read_total"] = session.MailboxResumeRead + report["current_session_mailbox_after_sequence_read_total"] = session.MailboxAfterSequenceRead + report["current_session_mailbox_returned_total"] = session.MailboxReturnedTotal + report["current_session_mailbox_skipped_total"] = session.MailboxSkippedTotal + report["current_session_mailbox_consumer_count"] = len(session.MailboxConsumers) + report["current_session_mailbox_consumer_read_total"] = session.MailboxConsumerReadTotal + report["current_session_mailbox_consumer_ack_total"] = session.MailboxConsumerAckTotal + report["current_session_mailbox_consumer_reset_total"] = session.MailboxConsumerResetTotal + report["current_session_mailbox_consumer_evicted_total"] = session.MailboxConsumerEvictedTotal + if session.MailboxConsumerReadTotal > 0 { + report["current_session_last_mailbox_consumer_id"] = session.LastMailboxConsumerID + report["current_session_last_mailbox_consumer_read_at"] = session.LastMailboxConsumerReadAt.Format(time.RFC3339Nano) + report["current_session_last_mailbox_consumer_checkpoint_sequence"] = session.LastMailboxConsumerCheckpoint + report["current_session_last_mailbox_consumer_ack_sequence"] = session.LastMailboxConsumerAck + } + if session.MailboxConsumerAckTotal > 0 { + report["current_session_last_mailbox_consumer_ack_at"] = session.LastMailboxConsumerAckAt.Format(time.RFC3339Nano) + } + if !session.LastMailboxReadAt.IsZero() { + report["current_session_last_mailbox_read_at"] = session.LastMailboxReadAt.Format(time.RFC3339Nano) + report["current_session_last_mailbox_wait_ms"] = session.LastMailboxWaitMs + report["current_session_last_mailbox_waited"] = session.LastMailboxWaited + report["current_session_last_mailbox_wait_timeout"] = session.LastMailboxTimeout + report["current_session_last_mailbox_empty"] = session.LastMailboxEmpty + report["current_session_last_mailbox_after_sequence"] = session.LastMailboxAfterSequence + report["current_session_last_mailbox_skipped_count"] = session.LastMailboxSkippedCount + report["current_session_last_mailbox_returned_count"] = session.LastMailboxReturnedCount + } + if session.MailboxResumeRead > 0 { + report["current_session_last_mailbox_resume_from"] = session.LastMailboxResumeFrom + report["current_session_last_mailbox_resume_sequence"] = session.LastMailboxResumeSequence + report["current_session_last_mailbox_resume_consumer_id"] = session.LastMailboxResumeConsumerID + } + if !session.LastBackpressureAt.IsZero() { + report["current_session_last_backpressure_at"] = session.LastBackpressureAt.Format(time.RFC3339Nano) + report["current_session_last_backpressure_reason"] = session.LastReason + } + } + } + if s.backpressureCount > 0 { + report["last_backpressure_at"] = s.lastBackpressureAt + report["last_backpressure_reason"] = s.lastBackpressureReason + report["last_rejected_frame_count"] = s.lastRejectedFrameCount + report["last_rejected_adapter_session_id"] = s.lastRejectedAdapterSessionID + report["last_rejected_channel_class"] = s.lastRejectedChannelClass + report["last_rejected_adapter_contract_id"] = s.lastRejectedAdapterContractID + report["last_rejected_queue_capacity"] = s.lastRejectedQueueCapacity + report["last_rejected_queue_depth"] = s.lastRejectedQueueDepth + } + if s.sessionControlTotal > 0 { + report["last_session_control"] = s.lastControl + report["last_controlled_adapter_session_id"] = s.lastControl.AdapterSessionID + report["last_session_control_action"] = s.lastControl.Action + report["last_session_control_state"] = s.lastControl.SessionState + report["last_session_control_at"] = s.lastControl.ControlledAt + } + report["adapter_runtime_readiness"] = remoteWorkspaceAdapterRuntimeReadinessLocked(s, currentSession, now.UTC()) + return report +} diff --git a/agents/rap-node-agent/internal/mesh/server.go b/agents/rap-node-agent/internal/mesh/server.go index b0f8b45..6c04211 100644 --- a/agents/rap-node-agent/internal/mesh/server.go +++ b/agents/rap-node-agent/internal/mesh/server.go @@ -1,23 +1,85 @@ package mesh import ( + "bytes" "context" + "crypto/sha256" + "encoding/base64" + "encoding/binary" + "encoding/hex" "encoding/json" + "fmt" + "io" "net/http" + "net/http/httputil" + "net/url" + "strconv" + "strings" "time" + + "github.com/example/remote-access-platform/agents/rap-node-agent/internal/authority" + "github.com/gorilla/websocket" ) type ProductionEnvelopeObserver func(context.Context, ProductionEnvelopeObservation) error +type ProductionEnvelopeDelivery func(context.Context, ProductionEnvelope) error type ProductionForwardLogger func(ProductionForwardLogEntry) +type FabricServiceChannelAccessLogger func(FabricServiceChannelAccessLogEntry) +type RemoteWorkspaceFrameSink interface { + AcceptRemoteWorkspaceFrameBatchProbe(context.Context, RemoteWorkspaceFrameBatchDelivery) (RemoteWorkspaceFrameBatchDeliveryReceipt, error) +} +type RemoteWorkspaceFrameSinkSessionControl interface { + ControlAdapterSession(action string, adapterSessionID string, reason string, now time.Time) (RemoteWorkspaceAdapterSessionControlResult, error) +} +type RemoteWorkspaceFrameSinkSessionSnapshot interface { + SnapshotAdapterSessions(includeTerminal bool, limit int, now time.Time) RemoteWorkspaceAdapterSessionSnapshot +} +type RemoteWorkspaceFrameSinkSessionMailbox interface { + ReadAdapterSessionMailbox(adapterSessionID string, drain bool, limit int, afterSequence int64, now time.Time) (RemoteWorkspaceAdapterMailboxSnapshot, error) +} +type RemoteWorkspaceFrameSinkSessionMailboxTelemetry interface { + RecordAdapterSessionMailboxRead(snapshot RemoteWorkspaceAdapterMailboxSnapshot, now time.Time) +} +type RemoteWorkspaceFrameSinkSessionMailboxConsumer interface { + RecordAdapterSessionMailboxConsumerRead(snapshot RemoteWorkspaceAdapterMailboxSnapshot, consumerID string, ackSequence int64, reset bool, now time.Time) (RemoteWorkspaceAdapterMailboxSnapshot, error) +} +type RemoteWorkspaceFrameSinkSessionMailboxConsumerSnapshot interface { + SnapshotAdapterSessionMailboxConsumers(adapterSessionID string, limit int, now time.Time) (RemoteWorkspaceAdapterMailboxConsumerSnapshot, error) +} +type RemoteWorkspaceFrameSinkSessionMailboxConsumerResume interface { + ResolveAdapterSessionMailboxConsumerResume(adapterSessionID string, consumerID string, resumeFrom string, now time.Time) (int64, error) +} +type RemoteWorkspaceFrameSinkSessionMailboxPreflight interface { + PreflightAdapterSessionMailboxConsumerResume(adapterSessionID string, consumerID string, resumeFrom string, limit int, now time.Time) (RemoteWorkspaceAdapterMailboxPreflightSnapshot, error) +} +type VPNPacketIngress interface { + SendClientPacketBatch(ctx context.Context, clusterID string, vpnConnectionID string, packets [][]byte) error + ReceiveClientPacketBatch(ctx context.Context, clusterID string, vpnConnectionID string, timeout time.Duration) ([][]byte, error) +} + +type VPNPacketIngressTrafficClass interface { + SendClientPacketBatchWithTrafficClass(ctx context.Context, clusterID string, vpnConnectionID string, trafficClass string, packets [][]byte) error +} + +type VPNPacketIngressRoutePreference interface { + PreferClientRoute(routeID string) +} type Server struct { Local PeerIdentity SyntheticRuntime *SyntheticRuntime ProductionForwardingEnabled bool ProductionEnvelopeObserver ProductionEnvelopeObserver + ProductionEnvelopeDelivery ProductionEnvelopeDelivery ProductionForwardTransport ProductionForwardTransport ProductionForwardLogger ProductionForwardLogger + FabricServiceChannelLogger FabricServiceChannelAccessLogger + RemoteWorkspaceFrameSink RemoteWorkspaceFrameSink ProductionRoutes []SyntheticRoute + VPNPacketIngress VPNPacketIngress + BackendProxyBaseURL string + ClusterAuthorityPublicKey string + ServiceChannelIntrospection bool } func (s Server) Handler() http.Handler { @@ -25,9 +87,1693 @@ func (s Server) Handler() http.Handler { mux.HandleFunc("/mesh/v1/health", s.handleHealth) mux.HandleFunc("/mesh/v1/forward", s.handleForward) mux.HandleFunc("/mesh/v1/synthetic/probe", s.handleSyntheticProbe) + if s.RemoteWorkspaceFrameSink != nil { + mux.HandleFunc("/mesh/v1/remote-workspace/adapter-sessions/", s.handleRemoteWorkspaceAdapterSessionControl) + } + if s.VPNPacketIngress != nil || s.BackendProxyBaseURL != "" { + mux.HandleFunc("/api/v1/clusters/", func(w http.ResponseWriter, r *http.Request) { + if s.handleFabricServiceChannelRemoteWorkspaceIngress(w, r) { + return + } + if s.VPNPacketIngress != nil && s.handleFabricServiceChannelVPNPacketIngress(w, r) { + return + } + if s.VPNPacketIngress != nil && s.handleVPNPacketIngress(w, r) { + return + } + if s.BackendProxyBaseURL != "" { + s.backendProxy().ServeHTTP(w, r) + return + } + http.NotFound(w, r) + }) + } + if s.BackendProxyBaseURL != "" { + proxy := s.backendProxy() + mux.Handle("/api/v1/", proxy) + mux.Handle("/api/v1", proxy) + mux.Handle("/downloads/", proxy) + mux.Handle("/downloads", proxy) + } return mux } +func (s Server) handleRemoteWorkspaceAdapterSessionControl(w http.ResponseWriter, r *http.Request) { + if r.Method == http.MethodGet && isRemoteWorkspaceAdapterSessionListPath(r.URL.Path) { + s.handleRemoteWorkspaceAdapterSessionSnapshot(w, r) + return + } + if r.Method == http.MethodGet && strings.HasSuffix(r.URL.Path, "/mailbox/consumers") { + s.handleRemoteWorkspaceAdapterSessionMailboxConsumers(w, r) + return + } + if r.Method == http.MethodGet && strings.HasSuffix(r.URL.Path, "/mailbox/preflight") { + s.handleRemoteWorkspaceAdapterSessionMailboxPreflight(w, r) + return + } + if r.Method == http.MethodGet && strings.HasSuffix(r.URL.Path, "/mailbox") { + s.handleRemoteWorkspaceAdapterSessionMailbox(w, r) + return + } + controller, ok := s.RemoteWorkspaceFrameSink.(RemoteWorkspaceFrameSinkSessionControl) + if !ok { + http.Error(w, "remote workspace adapter session control unavailable", http.StatusServiceUnavailable) + return + } + sessionID, ok := parseRemoteWorkspaceAdapterSessionControlPath(r.URL.Path) + if !ok { + http.NotFound(w, r) + return + } + if r.Method != http.MethodPost { + w.WriteHeader(http.StatusMethodNotAllowed) + return + } + var request struct { + Action string `json:"action"` + Reason string `json:"reason,omitempty"` + } + if err := json.NewDecoder(http.MaxBytesReader(w, r.Body, 16*1024)).Decode(&request); err != nil { + http.Error(w, "invalid remote workspace adapter session control payload", http.StatusBadRequest) + return + } + result, err := controller.ControlAdapterSession(request.Action, sessionID, request.Reason, time.Now().UTC()) + if err != nil { + http.Error(w, err.Error(), http.StatusBadRequest) + return + } + w.Header().Set("Content-Type", "application/json") + _ = json.NewEncoder(w).Encode(result) +} + +func (s Server) handleRemoteWorkspaceAdapterSessionSnapshot(w http.ResponseWriter, r *http.Request) { + snapshotter, ok := s.RemoteWorkspaceFrameSink.(RemoteWorkspaceFrameSinkSessionSnapshot) + if !ok { + http.Error(w, "remote workspace adapter session snapshot unavailable", http.StatusServiceUnavailable) + return + } + includeTerminal := strings.EqualFold(r.URL.Query().Get("include_terminal"), "true") + limit := 50 + if rawLimit := strings.TrimSpace(r.URL.Query().Get("limit")); rawLimit != "" { + parsed, err := strconv.Atoi(rawLimit) + if err != nil || parsed <= 0 { + http.Error(w, "invalid remote workspace adapter session snapshot limit", http.StatusBadRequest) + return + } + limit = parsed + } + w.Header().Set("Content-Type", "application/json") + _ = json.NewEncoder(w).Encode(snapshotter.SnapshotAdapterSessions(includeTerminal, limit, time.Now().UTC())) +} + +func (s Server) handleRemoteWorkspaceAdapterSessionMailbox(w http.ResponseWriter, r *http.Request) { + reader, ok := s.RemoteWorkspaceFrameSink.(RemoteWorkspaceFrameSinkSessionMailbox) + if !ok { + http.Error(w, "remote workspace adapter session mailbox unavailable", http.StatusServiceUnavailable) + return + } + sessionID, ok := parseRemoteWorkspaceAdapterSessionMailboxPath(r.URL.Path) + if !ok { + http.NotFound(w, r) + return + } + drain := strings.EqualFold(r.URL.Query().Get("drain"), "true") + limit := 50 + if rawLimit := strings.TrimSpace(r.URL.Query().Get("limit")); rawLimit != "" { + parsed, err := strconv.Atoi(rawLimit) + if err != nil || parsed <= 0 { + http.Error(w, "invalid remote workspace adapter session mailbox limit", http.StatusBadRequest) + return + } + limit = parsed + } + waitMs := 0 + if rawWait := strings.TrimSpace(r.URL.Query().Get("wait_ms")); rawWait != "" { + parsed, err := strconv.Atoi(rawWait) + if err != nil || parsed < 0 { + http.Error(w, "invalid remote workspace adapter session mailbox wait", http.StatusBadRequest) + return + } + if parsed > 30000 { + parsed = 30000 + } + waitMs = parsed + } + afterSequence := int64(0) + if rawAfter := strings.TrimSpace(r.URL.Query().Get("after_sequence")); rawAfter != "" { + parsed, err := strconv.ParseInt(rawAfter, 10, 64) + if err != nil || parsed < 0 { + http.Error(w, "invalid remote workspace adapter session mailbox after sequence", http.StatusBadRequest) + return + } + afterSequence = parsed + } + if drain && afterSequence > 0 { + http.Error(w, "remote workspace adapter session mailbox after sequence cannot drain", http.StatusBadRequest) + return + } + consumerID := strings.TrimSpace(r.URL.Query().Get("consumer_id")) + if consumerID != "" && !isValidRemoteWorkspaceAdapterMailboxConsumerID(consumerID) { + http.Error(w, "invalid remote workspace adapter mailbox consumer", http.StatusBadRequest) + return + } + resumeFrom := strings.TrimSpace(strings.ToLower(r.URL.Query().Get("resume_from"))) + if resumeFrom != "" { + switch resumeFrom { + case "ack", "checkpoint": + default: + http.Error(w, "invalid remote workspace adapter mailbox resume cursor", http.StatusBadRequest) + return + } + if consumerID == "" { + http.Error(w, "remote workspace adapter mailbox consumer required for resume", http.StatusBadRequest) + return + } + if afterSequence > 0 { + http.Error(w, "remote workspace adapter mailbox resume cannot combine with after sequence", http.StatusBadRequest) + return + } + if drain { + http.Error(w, "remote workspace adapter mailbox resume cannot drain", http.StatusBadRequest) + return + } + } + resetConsumer := false + if rawReset := strings.TrimSpace(r.URL.Query().Get("reset_consumer")); rawReset != "" { + switch strings.ToLower(rawReset) { + case "true": + resetConsumer = true + case "false": + default: + http.Error(w, "invalid remote workspace adapter mailbox consumer reset", http.StatusBadRequest) + return + } + } + if resetConsumer && consumerID == "" { + http.Error(w, "remote workspace adapter mailbox consumer required for reset", http.StatusBadRequest) + return + } + if resetConsumer && resumeFrom != "" { + http.Error(w, "remote workspace adapter mailbox resume cannot reset consumer", http.StatusBadRequest) + return + } + ackSequence := int64(0) + if rawAck := strings.TrimSpace(r.URL.Query().Get("ack_sequence")); rawAck != "" { + parsed, err := strconv.ParseInt(rawAck, 10, 64) + if err != nil || parsed < 0 { + http.Error(w, "invalid remote workspace adapter mailbox ack sequence", http.StatusBadRequest) + return + } + ackSequence = parsed + } + resumeSequence := int64(0) + if resumeFrom != "" { + resolver, ok := s.RemoteWorkspaceFrameSink.(RemoteWorkspaceFrameSinkSessionMailboxConsumerResume) + if !ok { + http.Error(w, "remote workspace adapter mailbox resume unavailable", http.StatusServiceUnavailable) + return + } + resolved, err := resolver.ResolveAdapterSessionMailboxConsumerResume(sessionID, consumerID, resumeFrom, time.Now().UTC()) + if err != nil { + http.Error(w, err.Error(), http.StatusBadRequest) + return + } + afterSequence = resolved + resumeSequence = resolved + } + mailbox, err := s.readRemoteWorkspaceAdapterSessionMailbox(r.Context(), reader, sessionID, drain, limit, afterSequence, waitMs) + if err != nil { + http.Error(w, err.Error(), http.StatusBadRequest) + return + } + if resumeFrom != "" { + mailbox.ResumeFrom = resumeFrom + mailbox.ResumeSequence = resumeSequence + } + if consumerID != "" { + mailbox.ConsumerID = consumerID + } + if recorder, ok := s.RemoteWorkspaceFrameSink.(RemoteWorkspaceFrameSinkSessionMailboxTelemetry); ok { + recorder.RecordAdapterSessionMailboxRead(mailbox, time.Now().UTC()) + } + if consumerID != "" { + recorder, ok := s.RemoteWorkspaceFrameSink.(RemoteWorkspaceFrameSinkSessionMailboxConsumer) + if !ok { + http.Error(w, "remote workspace adapter mailbox consumer unavailable", http.StatusServiceUnavailable) + return + } + mailbox, err = recorder.RecordAdapterSessionMailboxConsumerRead(mailbox, consumerID, ackSequence, resetConsumer, time.Now().UTC()) + if err != nil { + http.Error(w, err.Error(), http.StatusBadRequest) + return + } + } + w.Header().Set("Content-Type", "application/json") + _ = json.NewEncoder(w).Encode(mailbox) +} + +func (s Server) handleRemoteWorkspaceAdapterSessionMailboxConsumers(w http.ResponseWriter, r *http.Request) { + snapshotter, ok := s.RemoteWorkspaceFrameSink.(RemoteWorkspaceFrameSinkSessionMailboxConsumerSnapshot) + if !ok { + http.Error(w, "remote workspace adapter mailbox consumer snapshot unavailable", http.StatusServiceUnavailable) + return + } + sessionID, ok := parseRemoteWorkspaceAdapterSessionMailboxConsumersPath(r.URL.Path) + if !ok { + http.NotFound(w, r) + return + } + limit := 50 + if rawLimit := strings.TrimSpace(r.URL.Query().Get("limit")); rawLimit != "" { + parsed, err := strconv.Atoi(rawLimit) + if err != nil || parsed <= 0 { + http.Error(w, "invalid remote workspace adapter mailbox consumer snapshot limit", http.StatusBadRequest) + return + } + limit = parsed + } + snapshot, err := snapshotter.SnapshotAdapterSessionMailboxConsumers(sessionID, limit, time.Now().UTC()) + if err != nil { + http.Error(w, err.Error(), http.StatusBadRequest) + return + } + w.Header().Set("Content-Type", "application/json") + _ = json.NewEncoder(w).Encode(snapshot) +} + +func (s Server) handleRemoteWorkspaceAdapterSessionMailboxPreflight(w http.ResponseWriter, r *http.Request) { + preflighter, ok := s.RemoteWorkspaceFrameSink.(RemoteWorkspaceFrameSinkSessionMailboxPreflight) + if !ok { + http.Error(w, "remote workspace adapter mailbox preflight unavailable", http.StatusServiceUnavailable) + return + } + sessionID, ok := parseRemoteWorkspaceAdapterSessionMailboxPreflightPath(r.URL.Path) + if !ok { + http.NotFound(w, r) + return + } + consumerID := strings.TrimSpace(r.URL.Query().Get("consumer_id")) + if consumerID == "" { + http.Error(w, "remote workspace adapter mailbox consumer required for preflight", http.StatusBadRequest) + return + } + if !isValidRemoteWorkspaceAdapterMailboxConsumerID(consumerID) { + http.Error(w, "invalid remote workspace adapter mailbox consumer", http.StatusBadRequest) + return + } + resumeFrom := strings.TrimSpace(strings.ToLower(r.URL.Query().Get("resume_from"))) + if resumeFrom == "" { + resumeFrom = "checkpoint" + } + if resumeFrom != "ack" && resumeFrom != "checkpoint" { + http.Error(w, "invalid remote workspace adapter mailbox resume cursor", http.StatusBadRequest) + return + } + limit := 50 + if rawLimit := strings.TrimSpace(r.URL.Query().Get("limit")); rawLimit != "" { + parsed, err := strconv.Atoi(rawLimit) + if err != nil || parsed <= 0 { + http.Error(w, "invalid remote workspace adapter mailbox preflight limit", http.StatusBadRequest) + return + } + limit = parsed + } + snapshot, err := preflighter.PreflightAdapterSessionMailboxConsumerResume(sessionID, consumerID, resumeFrom, limit, time.Now().UTC()) + if err != nil { + http.Error(w, err.Error(), http.StatusBadRequest) + return + } + w.Header().Set("Content-Type", "application/json") + _ = json.NewEncoder(w).Encode(snapshot) +} + +func (s Server) readRemoteWorkspaceAdapterSessionMailbox(ctx context.Context, reader RemoteWorkspaceFrameSinkSessionMailbox, sessionID string, drain bool, limit int, afterSequence int64, waitMs int) (RemoteWorkspaceAdapterMailboxSnapshot, error) { + if waitMs <= 0 { + return reader.ReadAdapterSessionMailbox(sessionID, drain, limit, afterSequence, time.Now().UTC()) + } + deadline := time.Now().UTC().Add(time.Duration(waitMs) * time.Millisecond) + waited := false + for { + mailbox, err := reader.ReadAdapterSessionMailbox(sessionID, drain, limit, afterSequence, time.Now().UTC()) + if err != nil { + return mailbox, err + } + if mailbox.ReturnedCount > 0 { + mailbox.Waited = waited + mailbox.WaitMs = waitMs + return mailbox, nil + } + now := time.Now().UTC() + if !now.Before(deadline) { + mailbox.Waited = true + mailbox.WaitTimeout = true + mailbox.WaitMs = waitMs + return mailbox, nil + } + waited = true + sleepFor := 25 * time.Millisecond + if remaining := time.Until(deadline); remaining < sleepFor { + sleepFor = remaining + } + timer := time.NewTimer(sleepFor) + select { + case <-ctx.Done(): + timer.Stop() + return RemoteWorkspaceAdapterMailboxSnapshot{}, ctx.Err() + case <-timer.C: + } + } +} + +func isRemoteWorkspaceAdapterSessionListPath(path string) bool { + return path == "/mesh/v1/remote-workspace/adapter-sessions" || path == "/mesh/v1/remote-workspace/adapter-sessions/" +} + +func parseRemoteWorkspaceAdapterSessionMailboxPath(path string) (string, bool) { + const prefix = "/mesh/v1/remote-workspace/adapter-sessions/" + if !strings.HasPrefix(path, prefix) || !strings.HasSuffix(path, "/mailbox") { + return "", false + } + sessionID := strings.TrimSuffix(strings.TrimPrefix(path, prefix), "/mailbox") + sessionID = strings.Trim(sessionID, "/") + if strings.TrimSpace(sessionID) == "" || strings.Contains(sessionID, "/") { + return "", false + } + return sessionID, true +} + +func parseRemoteWorkspaceAdapterSessionMailboxConsumersPath(path string) (string, bool) { + const prefix = "/mesh/v1/remote-workspace/adapter-sessions/" + const suffix = "/mailbox/consumers" + if !strings.HasPrefix(path, prefix) || !strings.HasSuffix(path, suffix) { + return "", false + } + sessionID := strings.TrimSuffix(strings.TrimPrefix(path, prefix), suffix) + sessionID = strings.Trim(sessionID, "/") + if strings.TrimSpace(sessionID) == "" || strings.Contains(sessionID, "/") { + return "", false + } + return sessionID, true +} + +func parseRemoteWorkspaceAdapterSessionMailboxPreflightPath(path string) (string, bool) { + const prefix = "/mesh/v1/remote-workspace/adapter-sessions/" + const suffix = "/mailbox/preflight" + if !strings.HasPrefix(path, prefix) || !strings.HasSuffix(path, suffix) { + return "", false + } + sessionID := strings.TrimSuffix(strings.TrimPrefix(path, prefix), suffix) + sessionID = strings.Trim(sessionID, "/") + if strings.TrimSpace(sessionID) == "" || strings.Contains(sessionID, "/") { + return "", false + } + return sessionID, true +} + +func isValidRemoteWorkspaceAdapterMailboxConsumerID(consumerID string) bool { + if consumerID == "" || len(consumerID) > 64 { + return false + } + for _, ch := range consumerID { + if (ch >= 'a' && ch <= 'z') || (ch >= 'A' && ch <= 'Z') || (ch >= '0' && ch <= '9') { + continue + } + switch ch { + case '.', '_', ':', '-': + continue + default: + return false + } + } + return true +} + +func parseRemoteWorkspaceAdapterSessionControlPath(path string) (string, bool) { + const prefix = "/mesh/v1/remote-workspace/adapter-sessions/" + if !strings.HasPrefix(path, prefix) || !strings.HasSuffix(path, "/control") { + return "", false + } + sessionID := strings.TrimSuffix(strings.TrimPrefix(path, prefix), "/control") + sessionID = strings.Trim(sessionID, "/") + if strings.TrimSpace(sessionID) == "" || strings.Contains(sessionID, "/") { + return "", false + } + return sessionID, true +} + +func (s Server) handleVPNPacketIngress(w http.ResponseWriter, r *http.Request) bool { + if clusterID, vpnConnectionID, ok := parseVPNClientPacketWebSocketPath(r.URL.Path); ok { + s.handleVPNPacketWebSocket(w, r, clusterID, "", vpnConnectionID, false, true, "") + return true + } + clusterID, vpnConnectionID, ok := parseVPNClientPacketPath(r.URL.Path) + if !ok { + return false + } + return s.handleVPNPacketHTTP(w, r, clusterID, "", vpnConnectionID, "", false, true, "") +} + +func (s Server) handleFabricServiceChannelRemoteWorkspaceIngress(w http.ResponseWriter, r *http.Request) bool { + clusterID, channelID, resourceID, channelClass, webSocket, ok := parseFabricServiceChannelRemoteWorkspacePath(r.URL.Path) + if !ok { + return false + } + if webSocket { + http.Error(w, "remote workspace service-channel websocket forwarding is not implemented", http.StatusNotImplemented) + return true + } + decision, valid := s.validateFabricServiceChannelRequest(w, r, clusterID, channelID, resourceID, FabricServiceClassRemoteWorkspace, channelClass) + if !valid { + return true + } + w.Header().Set("X-RAP-Service-Channel-Accepted-By", decision.AcceptedBy) + s.logFabricServiceChannelAccess(r, clusterID, channelID, resourceID, decision) + if r.Method != http.MethodPost && r.Method != http.MethodGet { + w.WriteHeader(http.StatusMethodNotAllowed) + return true + } + if r.Method == http.MethodPost { + body, err := io.ReadAll(http.MaxBytesReader(w, r.Body, MaxProductionEnvelopePayloadBytes)) + if err != nil { + http.Error(w, "invalid remote workspace probe payload", http.StatusBadRequest) + return true + } + if len(strings.TrimSpace(string(body))) > 0 { + frameProbe, err := validateRemoteWorkspaceFrameBatchProbe(body, decision.ChannelClass) + if err != nil { + http.Error(w, err.Error(), http.StatusBadRequest) + return true + } + payloadFlow := "validated_probe_only" + adapterSessionID := remoteWorkspaceAdapterSessionID(clusterID, channelID, resourceID, decision.PreferredRouteID) + var deliveryReceipt *RemoteWorkspaceFrameBatchDeliveryReceipt + if s.RemoteWorkspaceFrameSink != nil { + receipt, err := s.RemoteWorkspaceFrameSink.AcceptRemoteWorkspaceFrameBatchProbe(r.Context(), RemoteWorkspaceFrameBatchDelivery{ + ClusterID: clusterID, + ChannelID: channelID, + ResourceID: resourceID, + AdapterSessionID: adapterSessionID, + ServiceClass: FabricServiceClassRemoteWorkspace, + ChannelClass: decision.ChannelClass, + AdapterContractID: frameProbe.AdapterContractID, + Frames: frameProbe.Frames, + AcceptedBy: decision.AcceptedBy, + PreferredRouteID: decision.PreferredRouteID, + }) + if err != nil { + http.Error(w, err.Error(), http.StatusServiceUnavailable) + return true + } + deliveryReceipt = &receipt + payloadFlow = "delivered_probe_only" + } + w.Header().Set("Content-Type", "application/json") + w.WriteHeader(http.StatusAccepted) + response := map[string]any{ + "schema_version": "rap.remote_workspace_service_channel_ingress_probe.v1", + "accepted": true, + "service_class": FabricServiceClassRemoteWorkspace, + "channel_class": decision.ChannelClass, + "channel_id": channelID, + "resource_id": resourceID, + "adapter_session_id": adapterSessionID, + "data_plane": "validated", + "payload_flow": payloadFlow, + "frame_batch_schema": frameProbe.SchemaVersion, + "frame_count": len(frameProbe.Frames), + "adapter_contract_id": frameProbe.AdapterContractID, + } + if deliveryReceipt != nil { + response["adapter_delivery"] = deliveryReceipt + } + _ = json.NewEncoder(w).Encode(response) + return true + } + } + w.Header().Set("Content-Type", "application/json") + w.WriteHeader(http.StatusAccepted) + _ = json.NewEncoder(w).Encode(map[string]any{ + "schema_version": "rap.remote_workspace_service_channel_ingress_probe.v1", + "accepted": true, + "service_class": FabricServiceClassRemoteWorkspace, + "channel_class": decision.ChannelClass, + "channel_id": channelID, + "resource_id": resourceID, + "data_plane": "validated", + "payload_flow": "not_implemented", + }) + return true +} + +type remoteWorkspaceFrameBatchProbe struct { + SchemaVersion string `json:"schema_version"` + ProbeOnly bool `json:"probe_only"` + ServiceClass string `json:"service_class"` + ChannelClass string `json:"channel_class"` + AdapterContractID string `json:"adapter_contract_id,omitempty"` + Frames []RemoteWorkspaceFrameProbeRecord `json:"frames"` +} + +type RemoteWorkspaceFrameProbeRecord struct { + Channel string `json:"channel"` + Direction string `json:"direction"` + PayloadEncoding string `json:"payload_encoding,omitempty"` + PayloadLength int `json:"payload_length,omitempty"` + Droppable bool `json:"droppable,omitempty"` +} + +type RemoteWorkspaceFrameBatchDelivery struct { + ClusterID string `json:"cluster_id"` + ChannelID string `json:"channel_id"` + ResourceID string `json:"resource_id"` + AdapterSessionID string `json:"adapter_session_id,omitempty"` + ServiceClass string `json:"service_class"` + ChannelClass string `json:"channel_class"` + AdapterContractID string `json:"adapter_contract_id,omitempty"` + Frames []RemoteWorkspaceFrameProbeRecord `json:"frames"` + AcceptedBy string `json:"accepted_by,omitempty"` + PreferredRouteID string `json:"preferred_route_id,omitempty"` +} + +type RemoteWorkspaceFrameBatchDeliveryReceipt struct { + SchemaVersion string `json:"schema_version"` + Sink string `json:"sink"` + Accepted bool `json:"accepted"` + ProbeOnly bool `json:"probe_only"` + ClusterID string `json:"cluster_id,omitempty"` + ChannelID string `json:"channel_id,omitempty"` + ResourceID string `json:"resource_id,omitempty"` + ServiceClass string `json:"service_class"` + ChannelClass string `json:"channel_class"` + AdapterContractID string `json:"adapter_contract_id,omitempty"` + AdapterSessionID string `json:"adapter_session_id,omitempty"` + AdapterRuntimeID string `json:"adapter_runtime_id,omitempty"` + SessionState string `json:"session_state,omitempty"` + SessionCreatedAt string `json:"session_created_at,omitempty"` + SessionBoundAt string `json:"session_bound_at,omitempty"` + SessionLastActive string `json:"session_last_activity_at,omitempty"` + SessionLifecycle string `json:"session_lifecycle_state,omitempty"` + SessionDeliveries int64 `json:"session_delivery_count,omitempty"` + SessionPressure int64 `json:"session_backpressure_count,omitempty"` + MailboxDepth int `json:"mailbox_depth,omitempty"` + MailboxEnqueued int64 `json:"mailbox_enqueued_total,omitempty"` + FrameCount int `json:"frame_count"` + QueueCapacity int `json:"queue_capacity"` + QueueDepth int `json:"queue_depth"` + AcceptedFrames int `json:"accepted_frames"` + DroppedFrames int `json:"dropped_frames"` + AckedFrames int `json:"acked_frames"` + Backpressure bool `json:"backpressure"` + DropPolicy string `json:"drop_policy,omitempty"` + DeliverySequence int64 `json:"delivery_sequence"` + DeliveredAt string `json:"delivered_at"` +} + +func remoteWorkspaceAdapterSessionID(clusterID string, channelID string, resourceID string, preferredRouteID string) string { + seed := strings.Join([]string{ + strings.TrimSpace(clusterID), + strings.TrimSpace(channelID), + strings.TrimSpace(resourceID), + strings.TrimSpace(preferredRouteID), + }, "\x00") + sum := sha256.Sum256([]byte(seed)) + return "rap-rw-adapter-session-" + hex.EncodeToString(sum[:])[:24] +} + +func validateRemoteWorkspaceFrameBatchProbe(payload []byte, requiredChannelClass string) (remoteWorkspaceFrameBatchProbe, error) { + var decoded remoteWorkspaceFrameBatchProbe + if err := json.Unmarshal(payload, &decoded); err != nil { + return decoded, fmt.Errorf("invalid remote workspace frame batch probe") + } + if decoded.SchemaVersion != "rap.remote_workspace_frame_batch.v1" { + return decoded, fmt.Errorf("unsupported remote workspace frame batch schema") + } + if !decoded.ProbeOnly { + return decoded, fmt.Errorf("remote workspace payload forwarding is not implemented") + } + if strings.TrimSpace(strings.ToLower(decoded.ServiceClass)) != FabricServiceClassRemoteWorkspace { + return decoded, fmt.Errorf("remote workspace frame batch service class mismatch") + } + requiredChannelClass = strings.TrimSpace(strings.ToLower(requiredChannelClass)) + if strings.TrimSpace(strings.ToLower(decoded.ChannelClass)) != requiredChannelClass { + return decoded, fmt.Errorf("remote workspace frame batch channel class mismatch") + } + if len(decoded.Frames) == 0 || len(decoded.Frames) > 32 { + return decoded, fmt.Errorf("remote workspace frame batch probe must contain 1..32 frames") + } + for _, frame := range decoded.Frames { + channel := strings.TrimSpace(strings.ToLower(frame.Channel)) + direction := strings.TrimSpace(strings.ToLower(frame.Direction)) + if !isAllowedRemoteWorkspaceAdapterFrameChannel(channel) { + return decoded, fmt.Errorf("unsupported remote workspace adapter frame channel") + } + if !isAllowedRemoteWorkspaceAdapterFrameDirection(channel, direction) { + return decoded, fmt.Errorf("unsupported remote workspace adapter frame direction") + } + encoding := strings.TrimSpace(strings.ToLower(frame.PayloadEncoding)) + if encoding != "" && encoding != "none" && encoding != "base64" { + return decoded, fmt.Errorf("unsupported remote workspace frame payload encoding") + } + if frame.PayloadLength < 0 || frame.PayloadLength > MaxProductionEnvelopePayloadBytes { + return decoded, fmt.Errorf("remote workspace frame payload length out of bounds") + } + } + return decoded, nil +} + +func isAllowedRemoteWorkspaceAdapterFrameChannel(channel string) bool { + switch channel { + case "input", "control", "display", "cursor", "clipboard", "file_transfer", "audio", "device", "telemetry": + return true + default: + return false + } +} + +func isAllowedRemoteWorkspaceAdapterFrameDirection(channel string, direction string) bool { + switch channel { + case "input": + return direction == "client_to_adapter" + case "display", "cursor", "audio", "telemetry": + return direction == "adapter_to_client" + case "control", "clipboard", "file_transfer", "device": + return direction == "client_to_adapter" || direction == "adapter_to_client" || direction == "bidirectional" + default: + return false + } +} + +func (s Server) handleFabricServiceChannelVPNPacketIngress(w http.ResponseWriter, r *http.Request) bool { + if clusterID, channelID, vpnConnectionID, ok := parseFabricServiceChannelVPNPacketWebSocketPath(r.URL.Path); ok { + decision, valid := s.validateFabricServiceChannelVPNRequest(w, r, clusterID, channelID, vpnConnectionID) + if !valid { + return true + } + s.logFabricServiceChannelAccess(r, clusterID, channelID, vpnConnectionID, decision) + s.preferVPNPacketIngressRoute(decision.PreferredRouteID) + s.handleVPNPacketWebSocket(w, r, clusterID, channelID, vpnConnectionID, decision.ForceBackendFallback, decision.BackendFallbackAllowed(), decision.BackendRelayPolicy) + return true + } + clusterID, channelID, vpnConnectionID, ok := parseFabricServiceChannelVPNPacketPath(r.URL.Path) + if !ok { + return false + } + decision, valid := s.validateFabricServiceChannelVPNRequest(w, r, clusterID, channelID, vpnConnectionID) + if !valid { + return true + } + w.Header().Set("X-RAP-Service-Channel-Accepted-By", decision.AcceptedBy) + s.logFabricServiceChannelAccess(r, clusterID, channelID, vpnConnectionID, decision) + s.preferVPNPacketIngressRoute(decision.PreferredRouteID) + backendPath := "/api/v1/clusters/" + clusterID + "/vpn-connections/" + vpnConnectionID + "/tunnel/client/packets" + return s.handleVPNPacketHTTP(w, r, clusterID, channelID, vpnConnectionID, backendPath, decision.ForceBackendFallback, decision.BackendFallbackAllowed(), decision.BackendRelayPolicy) +} + +func (s Server) preferVPNPacketIngressRoute(routeID string) { + routeID = strings.TrimSpace(routeID) + if routeID == "" || s.VPNPacketIngress == nil { + return + } + if preferred, ok := s.VPNPacketIngress.(VPNPacketIngressRoutePreference); ok { + preferred.PreferClientRoute(routeID) + } +} + +func (s Server) handleVPNPacketHTTP(w http.ResponseWriter, r *http.Request, clusterID string, channelID string, vpnConnectionID string, backendFallbackPath string, forceBackendFallback bool, backendFallbackAllowed bool, backendRelayPolicy string) bool { + switch r.Method { + case http.MethodPost: + body, err := io.ReadAll(http.MaxBytesReader(w, r.Body, MaxProductionVPNPacketPayloadBytes)) + if err != nil { + http.Error(w, "invalid vpn packet payload", http.StatusBadRequest) + return true + } + if r.URL.Query().Get("batch") != "true" && len(body) == 0 { + http.Error(w, "empty vpn packet payload", http.StatusBadRequest) + return true + } + packets := [][]byte{body} + if r.URL.Query().Get("batch") == "true" { + packets, err = decodeVPNIngressPacketBatch(body) + if err != nil { + http.Error(w, "invalid vpn packet batch", http.StatusBadRequest) + return true + } + } + packets = cleanVPNIngressPacketBatch(packets) + if len(packets) == 0 { + http.Error(w, "empty vpn packet batch", http.StatusBadRequest) + return true + } + if forceBackendFallback { + if backendFallbackAllowed && s.proxyVPNPacketIngressToBackendPath(w, r, body, backendFallbackPath) { + return true + } + s.logFabricServiceChannelViolation(r, clusterID, channelID, vpnConnectionID, backendRelayPolicy, "backend_fallback_blocked_by_policy", ErrRouteNotFound.Error()) + http.Error(w, ErrRouteNotFound.Error(), vpnIngressStatusCode(ErrRouteNotFound)) + return true + } + trafficClass := r.Header.Get("X-RAP-Traffic-Class") + var sendErr error + if classIngress, ok := s.VPNPacketIngress.(VPNPacketIngressTrafficClass); ok { + sendErr = classIngress.SendClientPacketBatchWithTrafficClass(r.Context(), clusterID, vpnConnectionID, trafficClass, packets) + } else { + sendErr = s.VPNPacketIngress.SendClientPacketBatch(r.Context(), clusterID, vpnConnectionID, packets) + } + if sendErr != nil { + if backendFallbackAllowed && s.proxyVPNPacketIngressToBackendPath(w, r, body, backendFallbackPath) { + return true + } + s.logFabricServiceChannelViolation(r, clusterID, channelID, vpnConnectionID, backendRelayPolicy, "fabric_route_send_failed_backend_fallback_blocked", sendErr.Error()) + http.Error(w, sendErr.Error(), vpnIngressStatusCode(sendErr)) + return true + } + w.WriteHeader(http.StatusAccepted) + return true + case http.MethodGet: + if forceBackendFallback { + if backendFallbackAllowed && s.proxyVPNPacketIngressToBackendPath(w, r, nil, backendFallbackPath) { + return true + } + s.logFabricServiceChannelViolation(r, clusterID, channelID, vpnConnectionID, backendRelayPolicy, "backend_fallback_blocked_by_policy", ErrRouteNotFound.Error()) + w.WriteHeader(http.StatusNoContent) + return true + } + timeout := vpnIngressTimeout(r) + packets, err := s.VPNPacketIngress.ReceiveClientPacketBatch(r.Context(), clusterID, vpnConnectionID, timeout) + if err != nil { + http.Error(w, err.Error(), vpnIngressStatusCode(err)) + return true + } + packets = cleanVPNIngressPacketBatch(packets) + if len(packets) == 0 { + if backendFallbackAllowed && s.proxyVPNPacketIngressToBackendPath(w, r, nil, backendFallbackPath) { + return true + } + w.WriteHeader(http.StatusNoContent) + return true + } + if r.URL.Query().Get("batch") == "true" { + w.Header().Set("Content-Type", "application/vnd.rap.vpn-packet-batch.v1") + _, _ = w.Write(encodeVPNIngressPacketBatch(packets)) + return true + } + w.Header().Set("Content-Type", "application/octet-stream") + _, _ = w.Write(packets[0]) + return true + default: + w.WriteHeader(http.StatusMethodNotAllowed) + return true + } +} + +func (s Server) handleVPNPacketWebSocket(w http.ResponseWriter, r *http.Request, clusterID string, channelID string, vpnConnectionID string, forceBackendFallback bool, backendFallbackAllowed bool, backendRelayPolicy string) { + if r.Method != http.MethodGet { + w.WriteHeader(http.StatusMethodNotAllowed) + return + } + if s.VPNPacketIngress == nil { + http.Error(w, ErrForwardRuntimeUnavailable.Error(), http.StatusServiceUnavailable) + return + } + upgrader := websocket.Upgrader{ + CheckOrigin: func(_ *http.Request) bool { return true }, + } + conn, err := upgrader.Upgrade(w, r, nil) + if err != nil { + return + } + defer conn.Close() + conn.SetReadLimit(MaxProductionVPNPacketPayloadBytes) + + ctx, cancel := context.WithCancel(r.Context()) + defer cancel() + trafficClass := r.Header.Get("X-RAP-Traffic-Class") + errCh := make(chan error, 2) + go func() { + errCh <- s.readVPNPacketWebSocket(ctx, conn, clusterID, channelID, vpnConnectionID, trafficClass, forceBackendFallback, backendFallbackAllowed, backendRelayPolicy) + }() + go func() { + errCh <- s.writeVPNPacketWebSocket(ctx, conn, clusterID, channelID, vpnConnectionID, forceBackendFallback, backendFallbackAllowed, backendRelayPolicy) + }() + + select { + case <-ctx.Done(): + case <-errCh: + cancel() + } +} + +func (s Server) readVPNPacketWebSocket(ctx context.Context, conn *websocket.Conn, clusterID string, channelID string, vpnConnectionID string, trafficClass string, forceBackendFallback bool, backendFallbackAllowed bool, backendRelayPolicy string) error { + for { + messageType, payload, err := conn.ReadMessage() + if err != nil { + return err + } + if messageType != websocket.BinaryMessage { + continue + } + packets, err := decodeVPNIngressPacketBatch(payload) + if err != nil { + return err + } + packets = cleanVPNIngressPacketBatch(packets) + if len(packets) == 0 { + continue + } + if forceBackendFallback { + if !backendFallbackAllowed { + s.logFabricServiceChannelViolation(nil, clusterID, channelID, vpnConnectionID, backendRelayPolicy, "backend_fallback_blocked_by_policy", ErrRouteNotFound.Error()) + return ErrRouteNotFound + } + if proxyErr := s.backendVPNPacketPost(ctx, clusterID, vpnConnectionID, payload); proxyErr != nil { + return proxyErr + } + continue + } + var sendErr error + if classIngress, ok := s.VPNPacketIngress.(VPNPacketIngressTrafficClass); ok { + sendErr = classIngress.SendClientPacketBatchWithTrafficClass(ctx, clusterID, vpnConnectionID, trafficClass, packets) + } else { + sendErr = s.VPNPacketIngress.SendClientPacketBatch(ctx, clusterID, vpnConnectionID, packets) + } + if sendErr != nil { + if !backendFallbackAllowed { + s.logFabricServiceChannelViolation(nil, clusterID, channelID, vpnConnectionID, backendRelayPolicy, "fabric_route_send_failed_backend_fallback_blocked", sendErr.Error()) + return sendErr + } + if proxyErr := s.backendVPNPacketPost(ctx, clusterID, vpnConnectionID, payload); proxyErr != nil { + return sendErr + } + } + } +} + +func (s Server) writeVPNPacketWebSocket(ctx context.Context, conn *websocket.Conn, clusterID string, channelID string, vpnConnectionID string, forceBackendFallback bool, backendFallbackAllowed bool, backendRelayPolicy string) error { + lastPing := time.Now() + for { + select { + case <-ctx.Done(): + return ctx.Err() + default: + } + var packets [][]byte + var err error + if !forceBackendFallback { + packets, err = s.VPNPacketIngress.ReceiveClientPacketBatch(ctx, clusterID, vpnConnectionID, 50*time.Millisecond) + } + if forceBackendFallback && !backendFallbackAllowed { + s.logFabricServiceChannelViolation(nil, clusterID, channelID, vpnConnectionID, backendRelayPolicy, "backend_fallback_blocked_by_policy", ErrRouteNotFound.Error()) + return ErrRouteNotFound + } + if err != nil && !backendFallbackAllowed { + s.logFabricServiceChannelViolation(nil, clusterID, channelID, vpnConnectionID, backendRelayPolicy, "fabric_route_receive_failed_backend_fallback_blocked", err.Error()) + return err + } + if backendFallbackAllowed && (forceBackendFallback || err != nil || len(packets) == 0) { + backendPackets, proxyErr := s.backendVPNPacketGet(ctx, clusterID, vpnConnectionID, 50*time.Millisecond) + if proxyErr != nil && err != nil { + return err + } + if len(backendPackets) > 0 { + packets = backendPackets + } + } + if len(packets) > 0 { + if err := conn.SetWriteDeadline(time.Now().Add(5 * time.Second)); err != nil { + return err + } + if err := conn.WriteMessage(websocket.BinaryMessage, encodeVPNIngressPacketBatch(packets)); err != nil { + return err + } + continue + } + if time.Since(lastPing) >= 15*time.Second { + if err := conn.SetWriteDeadline(time.Now().Add(5 * time.Second)); err != nil { + return err + } + if err := conn.WriteMessage(websocket.PingMessage, []byte("rap-vpn")); err != nil { + return err + } + lastPing = time.Now() + } + } +} + +func (s Server) backendVPNPacketPost(ctx context.Context, clusterID string, vpnConnectionID string, batchPayload []byte) error { + target := strings.TrimRight(strings.TrimSpace(s.BackendProxyBaseURL), "/") + if target == "" { + return ErrRouteNotFound + } + req, err := http.NewRequestWithContext(ctx, http.MethodPost, target+"/clusters/"+clusterID+"/vpn-connections/"+vpnConnectionID+"/tunnel/client/packets?batch=true", bytes.NewReader(batchPayload)) + if err != nil { + return err + } + req.Header.Set("Content-Type", "application/octet-stream") + req.Header.Set("X-RAP-Entry-Node", s.Local.NodeID) + req.Header.Set("X-RAP-Entry-Cluster", s.Local.ClusterID) + resp, err := http.DefaultClient.Do(req) + if err != nil { + return err + } + defer resp.Body.Close() + if resp.StatusCode < 200 || resp.StatusCode >= 300 { + return fmt.Errorf("backend vpn packet post failed: status=%d", resp.StatusCode) + } + return nil +} + +func (s Server) backendVPNPacketGet(ctx context.Context, clusterID string, vpnConnectionID string, timeout time.Duration) ([][]byte, error) { + target := strings.TrimRight(strings.TrimSpace(s.BackendProxyBaseURL), "/") + if target == "" { + return nil, ErrRouteNotFound + } + if timeout <= 0 { + timeout = 50 * time.Millisecond + } + req, err := http.NewRequestWithContext(ctx, http.MethodGet, target+"/clusters/"+clusterID+"/vpn-connections/"+vpnConnectionID+"/tunnel/client/packets?batch=true&timeout_ms="+strconv.FormatInt(timeout.Milliseconds(), 10), nil) + if err != nil { + return nil, err + } + req.Header.Set("Accept", "application/vnd.rap.vpn-packet-batch.v1") + req.Header.Set("X-RAP-Entry-Node", s.Local.NodeID) + req.Header.Set("X-RAP-Entry-Cluster", s.Local.ClusterID) + resp, err := http.DefaultClient.Do(req) + if err != nil { + return nil, err + } + defer resp.Body.Close() + if resp.StatusCode == http.StatusNoContent { + return nil, nil + } + if resp.StatusCode < 200 || resp.StatusCode >= 300 { + return nil, fmt.Errorf("backend vpn packet get failed: status=%d", resp.StatusCode) + } + body, err := io.ReadAll(io.LimitReader(resp.Body, MaxProductionVPNPacketPayloadBytes)) + if err != nil { + return nil, err + } + if len(body) == 0 { + return nil, nil + } + return decodeVPNIngressPacketBatch(body) +} + +func (s Server) proxyVPNPacketIngressToBackend(w http.ResponseWriter, r *http.Request, body []byte) bool { + return s.proxyVPNPacketIngressToBackendPath(w, r, body, "") +} + +func (s Server) proxyVPNPacketIngressToBackendPath(w http.ResponseWriter, r *http.Request, body []byte, backendPath string) bool { + if strings.TrimSpace(s.BackendProxyBaseURL) == "" { + return false + } + target, err := url.Parse(s.BackendProxyBaseURL) + if err != nil || target.Scheme == "" || target.Host == "" { + return false + } + if strings.EqualFold(target.Host, r.Host) { + return false + } + var reader io.Reader + if body != nil { + reader = bytes.NewReader(body) + } + requestURI := r.URL.RequestURI() + if backendPath != "" { + requestURI = backendPath + if r.URL.RawQuery != "" { + requestURI += "?" + r.URL.RawQuery + } + } + req, err := http.NewRequestWithContext(r.Context(), r.Method, target.Scheme+"://"+target.Host+requestURI, reader) + if err != nil { + return false + } + for _, key := range []string{"Accept", "Content-Type"} { + if value := r.Header.Get(key); value != "" { + req.Header.Set(key, value) + } + } + req.Header.Set("X-RAP-Entry-Node", s.Local.NodeID) + req.Header.Set("X-RAP-Entry-Cluster", s.Local.ClusterID) + resp, err := http.DefaultClient.Do(req) + if err != nil { + return false + } + defer resp.Body.Close() + for _, key := range []string{"Content-Type"} { + if value := resp.Header.Get(key); value != "" { + w.Header().Set(key, value) + } + } + w.WriteHeader(resp.StatusCode) + _, _ = io.Copy(w, resp.Body) + return true +} + +type fabricServiceChannelLeaseAuthorityPayload struct { + SchemaVersion string `json:"schema_version"` + ChannelID string `json:"channel_id"` + ClusterID string `json:"cluster_id"` + ResourceID string `json:"resource_id,omitempty"` + ServiceClass string `json:"service_class"` + Status string `json:"status"` + SelectedEntryNodeID string `json:"selected_entry_node_id"` + SelectedExitNodeID string `json:"selected_exit_node_id"` + AllowedChannels []string `json:"allowed_channels"` + RouteGeneration string `json:"route_generation"` + FencingEpoch int64 `json:"fencing_epoch"` + TokenHash string `json:"token_hash"` + IssuedAt time.Time `json:"issued_at"` + ExpiresAt time.Time `json:"expires_at"` + DataPlane fabricServiceChannelDataPlaneContract `json:"data_plane,omitempty"` + PrimaryRoute struct { + RouteID string `json:"route_id"` + Status string `json:"status"` + } `json:"primary_route"` +} + +type fabricServiceChannelDataPlaneContract struct { + SchemaVersion string `json:"schema_version,omitempty"` + Mode string `json:"mode,omitempty"` + ControlPlaneTransport string `json:"control_plane_transport,omitempty"` + WorkingDataTransport string `json:"working_data_transport,omitempty"` + SteadyStateTransport string `json:"steady_state_transport,omitempty"` + BackendRelayPolicy string `json:"backend_relay_policy,omitempty"` + ProductionForwardingRequired bool `json:"production_forwarding_required,omitempty"` + ServiceNeutral bool `json:"service_neutral,omitempty"` + ProtocolAgnostic bool `json:"protocol_agnostic,omitempty"` + LogicalFlowMode string `json:"logical_flow_mode,omitempty"` + RequiredFlowIsolationClasses []string `json:"required_flow_isolation_classes,omitempty"` + RouteSelectionStrategy string `json:"route_selection_strategy,omitempty"` + EntryFailoverMode string `json:"entry_failover_mode,omitempty"` + ExitFailoverMode string `json:"exit_failover_mode,omitempty"` + RouteRebuildMode string `json:"route_rebuild_mode,omitempty"` + FailureDetectionSource string `json:"failure_detection_source,omitempty"` + DegradedFallbackVisibility string `json:"degraded_fallback_visibility,omitempty"` + StableContractForServiceClass string `json:"stable_contract_for_service_class,omitempty"` +} + +type fabricServiceChannelRequestDecision struct { + ForceBackendFallback bool + PreferredRouteID string + AcceptedBy string + ServiceClass string + ChannelClass string + DataPlane fabricServiceChannelDataPlaneContract + DataPlaneValid bool + DataPlaneMode string + BackendRelayPolicy string +} + +func (d fabricServiceChannelRequestDecision) BackendFallbackAllowed() bool { + return strings.TrimSpace(d.BackendRelayPolicy) != "disabled" +} + +func (s Server) validateFabricServiceChannelVPNRequest(w http.ResponseWriter, r *http.Request, clusterID string, channelID string, vpnConnectionID string) (fabricServiceChannelRequestDecision, bool) { + return s.validateFabricServiceChannelRequest(w, r, clusterID, channelID, vpnConnectionID, FabricServiceClassVPNPackets, ProductionChannelVPNPacket) +} + +func (s Server) validateFabricServiceChannelRequest(w http.ResponseWriter, r *http.Request, clusterID string, channelID string, resourceID string, expectedServiceClass string, defaultChannelClass string) (fabricServiceChannelRequestDecision, bool) { + var decision fabricServiceChannelRequestDecision + expectedServiceClass = strings.TrimSpace(strings.ToLower(expectedServiceClass)) + defaultChannelClass = strings.TrimSpace(strings.ToLower(defaultChannelClass)) + if strings.TrimSpace(clusterID) == "" || strings.TrimSpace(channelID) == "" { + http.Error(w, "invalid fabric service channel target", http.StatusBadRequest) + return decision, false + } + if headerChannelID := strings.TrimSpace(r.Header.Get("X-RAP-Fabric-Channel-ID")); headerChannelID != "" && headerChannelID != channelID { + http.Error(w, "fabric service channel mismatch", http.StatusForbidden) + return decision, false + } + serviceClass := strings.TrimSpace(strings.ToLower(r.Header.Get("X-RAP-Service-Class"))) + if serviceClass == "" { + serviceClass = expectedServiceClass + } + if serviceClass != expectedServiceClass { + http.Error(w, "unsupported fabric service class", http.StatusForbidden) + return decision, false + } + channelClass := strings.TrimSpace(strings.ToLower(r.Header.Get("X-RAP-Channel-Class"))) + if channelClass == "" { + channelClass = defaultChannelClass + } + if !isAllowedFabricServiceChannelForClass(serviceClass, channelClass) { + http.Error(w, "unsupported fabric service channel class", http.StatusForbidden) + return decision, false + } + token := fabricServiceChannelBearerToken(r) + if !strings.HasPrefix(token, "rap_fsc_") { + http.Error(w, "fabric service channel token is required", http.StatusUnauthorized) + return decision, false + } + payload, err := s.verifyFabricServiceChannelLeaseAuthority(r, clusterID, channelID, resourceID, serviceClass, channelClass, token) + if err != nil { + http.Error(w, err.Error(), http.StatusForbidden) + return decision, false + } + decision.AcceptedBy = "legacy_unsigned" + decision.ServiceClass = serviceClass + decision.ChannelClass = channelClass + if payload != nil && (payload.Status == "degraded_fallback" || payload.PrimaryRoute.Status == "missing_route_intent") { + decision.ForceBackendFallback = true + } + if payload != nil { + if err := validateFabricServiceChannelDataPlaneContract(payload.DataPlane, channelClass); err != nil { + http.Error(w, err.Error(), http.StatusForbidden) + return decision, false + } + decision.DataPlane = payload.DataPlane + decision.DataPlaneValid = strings.TrimSpace(payload.DataPlane.SchemaVersion) != "" + decision.DataPlaneMode = strings.TrimSpace(payload.DataPlane.Mode) + decision.BackendRelayPolicy = strings.TrimSpace(payload.DataPlane.BackendRelayPolicy) + if payload.DataPlane.Mode == "degraded_backend_fallback" { + decision.ForceBackendFallback = true + } + } + if payload != nil && !decision.ForceBackendFallback { + decision.PreferredRouteID = strings.TrimSpace(payload.PrimaryRoute.RouteID) + } + if payload != nil && payload.SchemaVersion == "rap.fabric_service_channel_introspection.v1" { + decision.AcceptedBy = "introspection" + } else if payload != nil { + decision.AcceptedBy = "signed" + } + return decision, true +} + +type FabricServiceChannelAccessLogEntry struct { + Event string `json:"event"` + ClusterID string `json:"cluster_id,omitempty"` + ChannelID string `json:"channel_id,omitempty"` + ResourceID string `json:"resource_id,omitempty"` + LocalNodeID string `json:"local_node_id,omitempty"` + Method string `json:"method,omitempty"` + Path string `json:"path,omitempty"` + AcceptedBy string `json:"accepted_by,omitempty"` + PreferredRouteID string `json:"preferred_route_id,omitempty"` + ServiceClass string `json:"service_class,omitempty"` + ChannelClass string `json:"channel_class,omitempty"` + ForceBackendFallback bool `json:"force_backend_fallback,omitempty"` + DataPlaneMode string `json:"data_plane_mode,omitempty"` + WorkingDataTransport string `json:"working_data_transport,omitempty"` + SteadyStateTransport string `json:"steady_state_transport,omitempty"` + BackendRelayPolicy string `json:"backend_relay_policy,omitempty"` + LogicalFlowMode string `json:"logical_flow_mode,omitempty"` + DataPlaneValid bool `json:"data_plane_valid,omitempty"` + ViolationStatus string `json:"violation_status,omitempty"` + ViolationReason string `json:"violation_reason,omitempty"` + OccurredAt time.Time `json:"occurred_at"` +} + +func (s Server) logFabricServiceChannelAccess(r *http.Request, clusterID string, channelID string, resourceID string, decision fabricServiceChannelRequestDecision) { + if s.FabricServiceChannelLogger == nil { + return + } + entry := FabricServiceChannelAccessLogEntry{ + Event: "fabric_service_channel_access_accepted", + ClusterID: clusterID, + ChannelID: channelID, + ResourceID: resourceID, + LocalNodeID: s.Local.NodeID, + AcceptedBy: decision.AcceptedBy, + PreferredRouteID: decision.PreferredRouteID, + ServiceClass: decision.ServiceClass, + ChannelClass: decision.ChannelClass, + ForceBackendFallback: decision.ForceBackendFallback, + DataPlaneMode: decision.DataPlaneMode, + WorkingDataTransport: strings.TrimSpace(decision.DataPlane.WorkingDataTransport), + SteadyStateTransport: strings.TrimSpace(decision.DataPlane.SteadyStateTransport), + BackendRelayPolicy: decision.BackendRelayPolicy, + LogicalFlowMode: strings.TrimSpace(decision.DataPlane.LogicalFlowMode), + DataPlaneValid: decision.DataPlaneValid, + OccurredAt: time.Now().UTC(), + } + if r != nil { + entry.Method = r.Method + if r.URL != nil { + entry.Path = r.URL.Path + } + } + s.FabricServiceChannelLogger(entry) +} + +func (s Server) logFabricServiceChannelViolation(r *http.Request, clusterID string, channelID string, resourceID string, backendRelayPolicy string, status string, reason string) { + if s.FabricServiceChannelLogger == nil || strings.TrimSpace(channelID) == "" { + return + } + entry := FabricServiceChannelAccessLogEntry{ + Event: "fabric_service_channel_data_plane_violation", + ClusterID: clusterID, + ChannelID: channelID, + ResourceID: resourceID, + LocalNodeID: s.Local.NodeID, + BackendRelayPolicy: strings.TrimSpace(backendRelayPolicy), + ViolationStatus: strings.TrimSpace(status), + ViolationReason: strings.TrimSpace(reason), + OccurredAt: time.Now().UTC(), + } + if r != nil { + entry.Method = r.Method + if r.URL != nil { + entry.Path = r.URL.Path + } + } + s.FabricServiceChannelLogger(entry) +} + +func (s Server) verifyFabricServiceChannelLeaseAuthority(r *http.Request, clusterID string, channelID string, resourceID string, serviceClass string, channelClass string, token string) (*fabricServiceChannelLeaseAuthorityPayload, error) { + publicKey := strings.TrimSpace(s.ClusterAuthorityPublicKey) + payloadHeader := strings.TrimSpace(r.Header.Get("X-RAP-Service-Channel-Authority-Payload")) + signatureHeader := strings.TrimSpace(r.Header.Get("X-RAP-Service-Channel-Authority-Signature")) + if payloadHeader == "" && signatureHeader == "" { + if publicKey != "" { + if payload, ok, err := s.introspectFabricServiceChannelLease(r, clusterID, channelID, resourceID, serviceClass, channelClass, token); ok || err != nil { + return payload, err + } + return nil, fmt.Errorf("%w: signed service channel authority is required", ErrUnauthorizedChannel) + } + return nil, nil + } + if publicKey == "" { + return nil, ErrUnauthorizedChannel + } + if payloadHeader == "" || signatureHeader == "" { + return nil, fmt.Errorf("%w: service channel authority payload and signature are required together", ErrUnauthorizedChannel) + } + payloadRaw, err := decodeHeaderJSON(payloadHeader) + if err != nil { + return nil, fmt.Errorf("%w: invalid service channel authority payload", ErrUnauthorizedChannel) + } + signatureRaw, err := decodeHeaderJSON(signatureHeader) + if err != nil { + return nil, fmt.Errorf("%w: invalid service channel authority signature", ErrUnauthorizedChannel) + } + var signature authority.Signature + if err := json.Unmarshal(signatureRaw, &signature); err != nil { + return nil, fmt.Errorf("%w: invalid service channel authority signature", ErrUnauthorizedChannel) + } + if err := authority.VerifyRaw(publicKey, payloadRaw, signature); err != nil { + return nil, fmt.Errorf("%w: service channel authority signature rejected", ErrUnauthorizedChannel) + } + var payload fabricServiceChannelLeaseAuthorityPayload + if err := json.Unmarshal(payloadRaw, &payload); err != nil { + return nil, fmt.Errorf("%w: invalid service channel authority payload", ErrUnauthorizedChannel) + } + if payload.SchemaVersion != "rap.fabric_service_channel_lease_authority.v1" || + payload.ClusterID != clusterID || + payload.ChannelID != channelID || + payload.ResourceID != resourceID || + payload.ServiceClass != serviceClass || + payload.TokenHash != fabricServiceChannelTokenHash(token) || + !containsString(payload.AllowedChannels, channelClass) { + return nil, fmt.Errorf("%w: service channel authority payload mismatch", ErrUnauthorizedChannel) + } + if payload.SelectedEntryNodeID != "" && s.Local.NodeID != "" && payload.SelectedEntryNodeID != s.Local.NodeID { + return nil, fmt.Errorf("%w: service channel entry node mismatch", ErrUnauthorizedChannel) + } + if !payload.ExpiresAt.IsZero() && !payload.ExpiresAt.After(time.Now().UTC()) { + return nil, fmt.Errorf("%w: service channel lease expired", ErrUnauthorizedChannel) + } + return &payload, nil +} + +func validateFabricServiceChannelDataPlaneContract(contract fabricServiceChannelDataPlaneContract, requiredFlowClass string) error { + if strings.TrimSpace(contract.SchemaVersion) == "" { + return nil + } + requiredFlowClass = strings.TrimSpace(strings.ToLower(requiredFlowClass)) + if contract.SchemaVersion != "rap.fabric_service_channel_data_plane.v1" || + contract.WorkingDataTransport != "fabric_service_channel" || + contract.SteadyStateTransport != "fabric_route" || + (contract.BackendRelayPolicy != "degraded_fallback_only" && contract.BackendRelayPolicy != "disabled") || + !contract.ServiceNeutral || + !contract.ProtocolAgnostic || + contract.LogicalFlowMode != "multi_flow_isolated" { + return fmt.Errorf("%w: unsupported service channel data-plane contract", ErrUnauthorizedChannel) + } + if contract.Mode != "" && contract.Mode != "fabric_primary" && contract.Mode != "degraded_backend_fallback" { + return fmt.Errorf("%w: unsupported service channel data-plane mode", ErrUnauthorizedChannel) + } + if requiredFlowClass != "" && len(contract.RequiredFlowIsolationClasses) > 0 && !containsString(contract.RequiredFlowIsolationClasses, requiredFlowClass) { + return fmt.Errorf("%w: service channel data-plane missing required flow isolation", ErrUnauthorizedChannel) + } + return nil +} + +type fabricServiceChannelIntrospectionResponse struct { + Introspection fabricServiceChannelIntrospection `json:"fabric_service_channel_introspection"` +} + +type fabricServiceChannelIntrospection struct { + Allowed bool `json:"allowed"` + Status string `json:"status"` + Reason string `json:"reason"` + SelectedEntryNodeID string `json:"selected_entry_node_id"` + AllowedChannels []string `json:"allowed_channels"` + PreferredRouteID string `json:"preferred_route_id"` + ForceBackendFallback bool `json:"force_backend_fallback"` + DataPlane fabricServiceChannelDataPlaneContract `json:"data_plane,omitempty"` + LeaseStatus string `json:"lease_status"` + PrimaryRoute struct { + RouteID string `json:"route_id"` + Status string `json:"status"` + } `json:"primary_route"` + ExpiresAt time.Time `json:"expires_at"` +} + +func (s Server) introspectFabricServiceChannelLease(r *http.Request, clusterID string, channelID string, resourceID string, serviceClass string, channelClass string, token string) (*fabricServiceChannelLeaseAuthorityPayload, bool, error) { + baseURL := strings.TrimRight(strings.TrimSpace(s.BackendProxyBaseURL), "/") + if baseURL == "" { + return nil, false, nil + } + serviceClass = strings.TrimSpace(strings.ToLower(firstNonEmpty(serviceClass, r.Header.Get("X-RAP-Service-Class"), FabricServiceClassVPNPackets))) + channelClass = strings.TrimSpace(strings.ToLower(firstNonEmpty(channelClass, r.Header.Get("X-RAP-Channel-Class"), ProductionChannelVPNPacket))) + path := "/clusters/" + clusterID + "/fabric/service-channels/" + channelID + "/introspect" + body, _ := json.Marshal(map[string]any{ + "token": token, + "resource_id": resourceID, + "service_class": serviceClass, + "channel_class": channelClass, + "entry_node_id": s.Local.NodeID, + }) + ctx, cancel := context.WithTimeout(r.Context(), 3*time.Second) + defer cancel() + req, err := http.NewRequestWithContext(ctx, http.MethodPost, baseURL+path, bytes.NewReader(body)) + if err != nil { + return nil, true, fmt.Errorf("%w: service channel introspection request failed", ErrUnauthorizedChannel) + } + req.Header.Set("Content-Type", "application/json") + req.Header.Set("X-RAP-Entry-Node", s.Local.NodeID) + req.Header.Set("X-RAP-Entry-Cluster", s.Local.ClusterID) + resp, err := http.DefaultClient.Do(req) + if err != nil { + return nil, true, fmt.Errorf("%w: service channel introspection unavailable", ErrUnauthorizedChannel) + } + defer resp.Body.Close() + if resp.StatusCode != http.StatusOK { + return nil, true, fmt.Errorf("%w: service channel introspection denied", ErrUnauthorizedChannel) + } + var decoded fabricServiceChannelIntrospectionResponse + if err := json.NewDecoder(io.LimitReader(resp.Body, 64*1024)).Decode(&decoded); err != nil { + return nil, true, fmt.Errorf("%w: invalid service channel introspection response", ErrUnauthorizedChannel) + } + item := decoded.Introspection + if !item.Allowed { + return nil, true, fmt.Errorf("%w: service channel introspection denied", ErrUnauthorizedChannel) + } + if item.SelectedEntryNodeID != "" && s.Local.NodeID != "" && item.SelectedEntryNodeID != s.Local.NodeID { + return nil, true, fmt.Errorf("%w: service channel entry node mismatch", ErrUnauthorizedChannel) + } + if !item.ExpiresAt.IsZero() && !item.ExpiresAt.After(time.Now().UTC()) { + return nil, true, fmt.Errorf("%w: service channel lease expired", ErrUnauthorizedChannel) + } + payload := &fabricServiceChannelLeaseAuthorityPayload{ + SchemaVersion: "rap.fabric_service_channel_introspection.v1", + ChannelID: channelID, + ClusterID: clusterID, + ResourceID: resourceID, + ServiceClass: serviceClass, + Status: item.LeaseStatus, + SelectedEntryNodeID: item.SelectedEntryNodeID, + AllowedChannels: item.AllowedChannels, + DataPlane: item.DataPlane, + ExpiresAt: item.ExpiresAt, + } + payload.PrimaryRoute.RouteID = strings.TrimSpace(firstNonEmpty(item.PreferredRouteID, item.PrimaryRoute.RouteID)) + payload.PrimaryRoute.Status = item.PrimaryRoute.Status + if item.ForceBackendFallback { + payload.Status = "degraded_fallback" + if payload.PrimaryRoute.Status == "" { + payload.PrimaryRoute.Status = "missing_route_intent" + } + } + return payload, true, nil +} + +func decodeHeaderJSON(value string) (json.RawMessage, error) { + if value == "" { + return nil, fmt.Errorf("empty header") + } + if decoded, err := base64.RawURLEncoding.DecodeString(value); err == nil { + return json.RawMessage(decoded), nil + } + if decoded, err := base64.StdEncoding.DecodeString(value); err == nil { + return json.RawMessage(decoded), nil + } + return json.RawMessage(value), nil +} + +func fabricServiceChannelTokenHash(token string) string { + sum := sha256.Sum256([]byte(strings.TrimSpace(token))) + return hex.EncodeToString(sum[:]) +} + +func fabricServiceChannelBearerToken(r *http.Request) string { + if r == nil { + return "" + } + if token := strings.TrimSpace(r.Header.Get("X-RAP-Service-Channel-Token")); token != "" { + return token + } + auth := strings.TrimSpace(r.Header.Get("Authorization")) + if len(auth) > len("Bearer ") && strings.EqualFold(auth[:len("Bearer ")], "Bearer ") { + return strings.TrimSpace(auth[len("Bearer "):]) + } + return strings.TrimSpace(r.URL.Query().Get("service_channel_token")) +} + +func isAllowedFabricServiceVPNChannel(channel string) bool { + return isAllowedFabricServiceChannelForClass(FabricServiceClassVPNPackets, channel) +} + +func isAllowedFabricServiceChannelForClass(serviceClass string, channel string) bool { + serviceClass = strings.TrimSpace(strings.ToLower(serviceClass)) + channel = strings.TrimSpace(strings.ToLower(channel)) + switch channel { + case ProductionChannelVPNPacket, + FabricServiceChannelBulk, + FabricServiceChannelControl, + FabricServiceChannelInteractive, + FabricServiceChannelReliable, + FabricServiceChannelDroppable: + if serviceClass == FabricServiceClassRemoteWorkspace && channel == ProductionChannelVPNPacket { + return false + } + return channel != "" + } + return false +} + +func containsString(values []string, target string) bool { + for _, value := range values { + if value == target { + return true + } + } + return false +} + +func parseFabricServiceChannelVPNPacketWebSocketPath(path string) (string, string, string, bool) { + parts := strings.Split(strings.Trim(path, "/"), "/") + if len(parts) != 11 || + parts[0] != "api" || + parts[1] != "v1" || + parts[2] != "clusters" || + parts[4] != "fabric" || + parts[5] != "service-channels" || + parts[7] != "vpn-connections" || + parts[9] != "packets" || + parts[10] != "ws" { + return "", "", "", false + } + if parts[3] == "" || parts[6] == "" || parts[8] == "" { + return "", "", "", false + } + return parts[3], parts[6], parts[8], true +} + +func parseFabricServiceChannelRemoteWorkspacePath(path string) (string, string, string, string, bool, bool) { + parts := strings.Split(strings.Trim(path, "/"), "/") + if len(parts) == 11 && + parts[0] == "api" && + parts[1] == "v1" && + parts[2] == "clusters" && + parts[4] == "fabric" && + parts[5] == "service-channels" && + parts[7] == "remote-workspaces" && + parts[9] == "streams" && + parts[10] == "ws" && + parts[3] != "" && + parts[6] != "" && + parts[8] != "" { + return parts[3], parts[6], parts[8], FabricServiceChannelInteractive, true, true + } + if len(parts) != 11 || + parts[0] != "api" || + parts[1] != "v1" || + parts[2] != "clusters" || + parts[4] != "fabric" || + parts[5] != "service-channels" || + parts[7] != "remote-workspaces" || + parts[9] != "streams" { + return "", "", "", "", false, false + } + if parts[3] == "" || parts[6] == "" || parts[8] == "" || parts[10] == "" { + return "", "", "", "", false, false + } + return parts[3], parts[6], parts[8], strings.TrimSpace(strings.ToLower(parts[10])), false, true +} + +func parseFabricServiceChannelVPNPacketPath(path string) (string, string, string, bool) { + parts := strings.Split(strings.Trim(path, "/"), "/") + if len(parts) != 10 || + parts[0] != "api" || + parts[1] != "v1" || + parts[2] != "clusters" || + parts[4] != "fabric" || + parts[5] != "service-channels" || + parts[7] != "vpn-connections" || + parts[9] != "packets" { + return "", "", "", false + } + if parts[3] == "" || parts[6] == "" || parts[8] == "" { + return "", "", "", false + } + return parts[3], parts[6], parts[8], true +} + +func parseVPNClientPacketWebSocketPath(path string) (string, string, bool) { + parts := strings.Split(strings.Trim(path, "/"), "/") + if len(parts) != 10 || + parts[0] != "api" || + parts[1] != "v1" || + parts[2] != "clusters" || + parts[4] != "vpn-connections" || + parts[6] != "tunnel" || + parts[7] != "client" || + parts[8] != "packets" || + parts[9] != "ws" { + return "", "", false + } + if parts[3] == "" || parts[5] == "" { + return "", "", false + } + return parts[3], parts[5], true +} + +func parseVPNClientPacketPath(path string) (string, string, bool) { + parts := strings.Split(strings.Trim(path, "/"), "/") + if len(parts) != 9 || + parts[0] != "api" || + parts[1] != "v1" || + parts[2] != "clusters" || + parts[4] != "vpn-connections" || + parts[6] != "tunnel" || + parts[7] != "client" || + parts[8] != "packets" { + return "", "", false + } + if parts[3] == "" || parts[5] == "" { + return "", "", false + } + return parts[3], parts[5], true +} + +func vpnIngressTimeout(r *http.Request) time.Duration { + timeoutMs, _ := strconv.Atoi(r.URL.Query().Get("timeout_ms")) + if timeoutMs <= 0 { + timeoutMs = 25000 + } + if timeoutMs > 30000 { + timeoutMs = 30000 + } + return time.Duration(timeoutMs) * time.Millisecond +} + +func vpnIngressStatusCode(err error) int { + switch err { + case ErrForwardRuntimeUnavailable, ErrRouteNotFound, ErrForwardPeerUnavailable: + return http.StatusServiceUnavailable + case ErrUnauthorizedChannel, ErrClusterMismatch, ErrNodeMismatch: + return http.StatusForbidden + default: + return http.StatusBadGateway + } +} + +func encodeVPNIngressPacketBatch(packets [][]byte) []byte { + packets = cleanVPNIngressPacketBatch(packets) + total := 0 + for _, packet := range packets { + total += 4 + len(packet) + } + out := make([]byte, total) + offset := 0 + for _, packet := range packets { + binary.BigEndian.PutUint32(out[offset:offset+4], uint32(len(packet))) + offset += 4 + copy(out[offset:offset+len(packet)], packet) + offset += len(packet) + } + return out +} + +func cleanVPNIngressPacketBatch(packets [][]byte) [][]byte { + if len(packets) == 0 { + return nil + } + cleaned := make([][]byte, 0, len(packets)) + for _, packet := range packets { + if len(packet) == 0 { + continue + } + cleaned = append(cleaned, packet) + } + return cleaned +} + +func decodeVPNIngressPacketBatch(payload []byte) ([][]byte, error) { + var packets [][]byte + for offset := 0; offset < len(payload); { + if offset+4 > len(payload) { + return nil, fmt.Errorf("%w: truncated vpn packet batch header", ErrForwardEnvelopeInvalid) + } + size := int(binary.BigEndian.Uint32(payload[offset : offset+4])) + offset += 4 + if size <= 0 || size > 65535 { + return nil, fmt.Errorf("%w: invalid vpn packet batch item size", ErrForwardEnvelopeInvalid) + } + if offset+size > len(payload) { + return nil, fmt.Errorf("%w: truncated vpn packet batch item", ErrForwardEnvelopeInvalid) + } + packets = append(packets, append([]byte(nil), payload[offset:offset+size]...)) + offset += size + } + if len(packets) == 0 { + return nil, fmt.Errorf("%w: empty vpn packet batch", ErrForwardEnvelopeInvalid) + } + return packets, nil +} + +func (s Server) backendProxy() http.Handler { + target, err := url.Parse(s.BackendProxyBaseURL) + if err != nil || target.Scheme == "" || target.Host == "" { + return http.HandlerFunc(func(w http.ResponseWriter, _ *http.Request) { + http.Error(w, "backend proxy misconfigured", http.StatusServiceUnavailable) + }) + } + proxy := &httputil.ReverseProxy{ + Director: func(r *http.Request) { + r.URL.Scheme = target.Scheme + r.URL.Host = target.Host + r.Host = target.Host + r.Header.Set("X-RAP-Entry-Node", s.Local.NodeID) + r.Header.Set("X-RAP-Entry-Cluster", s.Local.ClusterID) + }, + } + return proxy +} + func (s Server) handleHealth(w http.ResponseWriter, r *http.Request) { if r.Method != http.MethodPost { w.WriteHeader(http.StatusMethodNotAllowed) @@ -106,6 +1852,11 @@ func (s Server) handleForward(w http.ResponseWriter, r *http.Request) { } } if envelope.DestinationNodeID == s.Local.NodeID { + if err := deliverProductionEnvelope(r.Context(), s.ProductionEnvelopeDelivery, envelope); err != nil { + s.logProductionForward(productionForwardLogEntry("production_forward_rejected", s.Local, envelope, ErrForwardDeliveryFailed.Error(), http.StatusInternalServerError)) + http.Error(w, ErrForwardDeliveryFailed.Error(), http.StatusInternalServerError) + return + } s.logProductionForward(productionForwardLogEntry("production_forward_delivered", s.Local, envelope, "", http.StatusOK)) writeProductionForwardResult(w, ProductionForwardResult{ Accepted: true, @@ -224,6 +1975,18 @@ func observeProductionEnvelope(ctx context.Context, observer ProductionEnvelopeO return observer(ctx, observation) } +func deliverProductionEnvelope(ctx context.Context, delivery ProductionEnvelopeDelivery, envelope ProductionEnvelope) (err error) { + if delivery == nil { + return nil + } + defer func() { + if recover() != nil { + err = ErrForwardDeliveryFailed + } + }() + return delivery(ctx, envelope) +} + func (s Server) handleSyntheticProbe(w http.ResponseWriter, r *http.Request) { if r.Method != http.MethodPost { w.WriteHeader(http.StatusMethodNotAllowed) diff --git a/agents/rap-node-agent/internal/mesh/server_test.go b/agents/rap-node-agent/internal/mesh/server_test.go index 2f2e663..d49c389 100644 --- a/agents/rap-node-agent/internal/mesh/server_test.go +++ b/agents/rap-node-agent/internal/mesh/server_test.go @@ -3,14 +3,23 @@ package mesh import ( "bytes" "context" + "crypto/ed25519" "crypto/sha256" + "encoding/base64" "encoding/hex" "encoding/json" "errors" + "fmt" + "io" "net/http" "net/http/httptest" + "strings" + "sync" "testing" "time" + + "github.com/example/remote-access-platform/agents/rap-node-agent/internal/authority" + "github.com/gorilla/websocket" ) func TestMeshHealthAcceptsSameCluster(t *testing.T) { @@ -778,6 +787,3034 @@ func configuredProductionRoute(routeID string, hops []string) SyntheticRoute { } } +func TestProductionForwardDeliversVPNPacketBatchOnAuthorizedVPNChannel(t *testing.T) { + local := PeerIdentity{ClusterID: "cluster-1", NodeID: "node-c"} + payload := json.RawMessage(`{"vpn_connection_id":"vpn-1","packets":["AAAA"]}`) + sum := sha256.Sum256(payload) + now := time.Now().UTC() + var delivered ProductionEnvelope + server := httptest.NewServer(Server{ + Local: local, + ProductionForwardingEnabled: true, + ProductionRoutes: []SyntheticRoute{{ + RouteID: "route-vpn-1", + ClusterID: "cluster-1", + SourceNodeID: "node-a", + DestinationNodeID: "node-c", + Hops: []string{"node-a", "node-c"}, + AllowedChannels: []string{ProductionChannelVPNPacket}, + ExpiresAt: now.Add(time.Hour), + MaxTTL: 8, + MaxHops: 8, + }}, + ProductionEnvelopeDelivery: func(_ context.Context, envelope ProductionEnvelope) error { + delivered = envelope + return nil + }, + }.Handler()) + defer server.Close() + + envelope := ProductionEnvelope{ + FabricProtocolVersion: ProtocolVersion, + MessageID: "vpn-message-1", + RouteID: "route-vpn-1", + ClusterID: "cluster-1", + SourceNodeID: "node-a", + DestinationNodeID: "node-c", + CurrentHopNodeID: "node-c", + NextHopNodeID: "node-c", + RoutePath: []string{"node-a", "node-c"}, + ChannelClass: ProductionChannelVPNPacket, + MessageType: ProductionMessageVPNPacketBatch, + TTL: 4, + HopCount: 1, + CreatedAt: now, + ExpiresAt: now.Add(time.Minute), + PayloadLength: len(payload), + PayloadHash: hex.EncodeToString(sum[:]), + Payload: payload, + } + body, err := json.Marshal(envelope) + if err != nil { + t.Fatalf("marshal envelope: %v", err) + } + resp, err := http.Post(server.URL+"/mesh/v1/forward", "application/json", bytes.NewReader(body)) + if err != nil { + t.Fatalf("post forward: %v", err) + } + defer resp.Body.Close() + if resp.StatusCode != http.StatusOK { + t.Fatalf("status = %d, want %d", resp.StatusCode, http.StatusOK) + } + if delivered.MessageID != envelope.MessageID || string(delivered.Payload) != string(payload) { + t.Fatalf("delivered envelope = %+v", delivered) + } +} + +func TestProductionForwardRejectsVPNPacketOnFabricControlRoute(t *testing.T) { + local := PeerIdentity{ClusterID: "cluster-1", NodeID: "node-c"} + envelope := validProductionEnvelope(local) + envelope.RouteID = "route-vpn-blocked" + envelope.RoutePath = []string{"node-a", "node-c"} + envelope.ChannelClass = ProductionChannelVPNPacket + envelope.MessageType = ProductionMessageVPNPacketBatch + payload := json.RawMessage(`{"vpn_connection_id":"vpn-1","packets":["AAAA"]}`) + sum := sha256.Sum256(payload) + envelope.Payload = payload + envelope.PayloadLength = len(payload) + envelope.PayloadHash = hex.EncodeToString(sum[:]) + + server := httptest.NewServer(Server{ + Local: local, + ProductionForwardingEnabled: true, + ProductionRoutes: []SyntheticRoute{configuredProductionRoute("route-vpn-blocked", []string{"node-a", "node-c"})}, + }.Handler()) + defer server.Close() + + body, err := json.Marshal(envelope) + if err != nil { + t.Fatalf("marshal envelope: %v", err) + } + resp, err := http.Post(server.URL+"/mesh/v1/forward", "application/json", bytes.NewReader(body)) + if err != nil { + t.Fatalf("post forward: %v", err) + } + defer resp.Body.Close() + if resp.StatusCode != http.StatusForbidden { + t.Fatalf("status = %d, want %d", resp.StatusCode, http.StatusForbidden) + } +} + +func TestVPNPacketIngressFallsBackToBackendRelayWhenFabricPeerUnavailable(t *testing.T) { + var backendBody []byte + backend := httptest.NewServer(http.HandlerFunc(func(w http.ResponseWriter, r *http.Request) { + if r.URL.Path != "/api/v1/clusters/cluster-1/vpn-connections/vpn-1/tunnel/client/packets" { + t.Fatalf("backend path = %s", r.URL.Path) + } + if r.Header.Get("X-RAP-Entry-Node") != "entry-1" { + t.Fatalf("entry header = %q", r.Header.Get("X-RAP-Entry-Node")) + } + var err error + backendBody, err = io.ReadAll(r.Body) + if err != nil { + t.Fatalf("read backend body: %v", err) + } + w.WriteHeader(http.StatusAccepted) + })) + defer backend.Close() + server := httptest.NewServer(Server{ + Local: PeerIdentity{ClusterID: "cluster-1", NodeID: "entry-1"}, + VPNPacketIngress: failingVPNPacketIngress{sendErr: ErrForwardPeerUnavailable}, + BackendProxyBaseURL: backend.URL + "/api/v1", + }.Handler()) + defer server.Close() + + resp, err := http.Post(server.URL+"/api/v1/clusters/cluster-1/vpn-connections/vpn-1/tunnel/client/packets", "application/octet-stream", bytes.NewReader([]byte("packet"))) + if err != nil { + t.Fatalf("post vpn packet: %v", err) + } + defer resp.Body.Close() + if resp.StatusCode != http.StatusAccepted { + t.Fatalf("status = %d, want %d", resp.StatusCode, http.StatusAccepted) + } + if string(backendBody) != "packet" { + t.Fatalf("backend body = %q", string(backendBody)) + } +} + +func TestVPNPacketIngressFallsBackToBackendRelayWhenFabricInboxIsEmpty(t *testing.T) { + backend := httptest.NewServer(http.HandlerFunc(func(w http.ResponseWriter, _ *http.Request) { + w.Header().Set("Content-Type", "application/octet-stream") + _, _ = w.Write([]byte("reply")) + })) + defer backend.Close() + server := httptest.NewServer(Server{ + Local: PeerIdentity{ClusterID: "cluster-1", NodeID: "entry-1"}, + VPNPacketIngress: failingVPNPacketIngress{}, + BackendProxyBaseURL: backend.URL + "/api/v1", + }.Handler()) + defer server.Close() + + resp, err := http.Get(server.URL + "/api/v1/clusters/cluster-1/vpn-connections/vpn-1/tunnel/client/packets?timeout_ms=2") + if err != nil { + t.Fatalf("get vpn packet: %v", err) + } + defer resp.Body.Close() + if resp.StatusCode != http.StatusOK { + t.Fatalf("status = %d, want %d", resp.StatusCode, http.StatusOK) + } + body, err := io.ReadAll(resp.Body) + if err != nil { + t.Fatalf("read body: %v", err) + } + if string(body) != "reply" { + t.Fatalf("body = %q", string(body)) + } +} + +func TestFabricServiceChannelVPNPacketIngressRequiresLeaseToken(t *testing.T) { + ingress := &recordingVPNPacketIngress{} + server := httptest.NewServer(Server{ + Local: PeerIdentity{ClusterID: "cluster-1", NodeID: "entry-1"}, + VPNPacketIngress: ingress, + }.Handler()) + defer server.Close() + + resp, err := http.Post(server.URL+"/api/v1/clusters/cluster-1/fabric/service-channels/channel-1/vpn-connections/vpn-1/packets", "application/octet-stream", bytes.NewReader([]byte("packet"))) + if err != nil { + t.Fatalf("post service channel packet: %v", err) + } + defer resp.Body.Close() + if resp.StatusCode != http.StatusUnauthorized { + t.Fatalf("status = %d, want %d", resp.StatusCode, http.StatusUnauthorized) + } + ingress.mu.Lock() + defer ingress.mu.Unlock() + if len(ingress.sent) != 0 { + t.Fatalf("unexpected sent packets = %#v", ingress.sent) + } +} + +func TestFabricServiceChannelVPNPacketIngressMovesBatchOverFabricRuntime(t *testing.T) { + ingress := &recordingVPNPacketIngress{ + receive: [][]byte{[]byte("reply-1"), []byte("reply-2")}, + } + server := httptest.NewServer(Server{ + Local: PeerIdentity{ClusterID: "cluster-1", NodeID: "entry-1"}, + VPNPacketIngress: ingress, + }.Handler()) + defer server.Close() + + body := encodeVPNIngressPacketBatch([][]byte{[]byte("packet-1"), []byte("packet-2")}) + req, err := http.NewRequest(http.MethodPost, server.URL+"/api/v1/clusters/cluster-1/fabric/service-channels/channel-1/vpn-connections/vpn-1/packets?batch=true", bytes.NewReader(body)) + if err != nil { + t.Fatalf("new request: %v", err) + } + req.Header.Set("Authorization", "Bearer rap_fsc_testtoken") + req.Header.Set("X-RAP-Service-Class", FabricServiceClassVPNPackets) + req.Header.Set("X-RAP-Channel-Class", ProductionChannelVPNPacket) + req.Header.Set("X-RAP-Traffic-Class", "interactive") + req.Header.Set("X-RAP-Fabric-Channel-ID", "channel-1") + resp, err := http.DefaultClient.Do(req) + if err != nil { + t.Fatalf("post service channel packet batch: %v", err) + } + defer resp.Body.Close() + if resp.StatusCode != http.StatusAccepted { + t.Fatalf("post status = %d, want %d", resp.StatusCode, http.StatusAccepted) + } + ingress.mu.Lock() + if ingress.clusterID != "cluster-1" || ingress.vpnConnectionID != "vpn-1" { + t.Fatalf("ingress ids = %s %s", ingress.clusterID, ingress.vpnConnectionID) + } + if len(ingress.sent) != 2 || string(ingress.sent[0]) != "packet-1" || string(ingress.sent[1]) != "packet-2" { + t.Fatalf("sent packets = %#v", ingress.sent) + } + if ingress.trafficClass != "interactive" { + t.Fatalf("traffic class = %q, want interactive", ingress.trafficClass) + } + ingress.mu.Unlock() + + req, err = http.NewRequest(http.MethodGet, server.URL+"/api/v1/clusters/cluster-1/fabric/service-channels/channel-1/vpn-connections/vpn-1/packets?batch=true&timeout_ms=2", nil) + if err != nil { + t.Fatalf("new get request: %v", err) + } + req.Header.Set("X-RAP-Service-Channel-Token", "rap_fsc_testtoken") + resp, err = http.DefaultClient.Do(req) + if err != nil { + t.Fatalf("get service channel packet batch: %v", err) + } + defer resp.Body.Close() + if resp.StatusCode != http.StatusOK { + t.Fatalf("get status = %d, want %d", resp.StatusCode, http.StatusOK) + } + payload, err := io.ReadAll(resp.Body) + if err != nil { + t.Fatalf("read get body: %v", err) + } + packets, err := decodeVPNIngressPacketBatch(payload) + if err != nil { + t.Fatalf("decode get batch: %v", err) + } + if len(packets) != 2 || string(packets[0]) != "reply-1" || string(packets[1]) != "reply-2" { + t.Fatalf("reply packets = %#v", packets) + } +} + +func TestFabricServiceChannelVPNPacketIngressRequiresSignedLeaseWhenAuthorityPinned(t *testing.T) { + publicKey, _, err := ed25519.GenerateKey(nil) + if err != nil { + t.Fatalf("generate key: %v", err) + } + ingress := &recordingVPNPacketIngress{} + server := httptest.NewServer(Server{ + Local: PeerIdentity{ClusterID: "cluster-1", NodeID: "entry-1"}, + VPNPacketIngress: ingress, + ClusterAuthorityPublicKey: base64.StdEncoding.EncodeToString(publicKey), + }.Handler()) + defer server.Close() + + req, err := http.NewRequest(http.MethodPost, server.URL+"/api/v1/clusters/cluster-1/fabric/service-channels/channel-1/vpn-connections/vpn-1/packets", bytes.NewReader([]byte("packet"))) + if err != nil { + t.Fatalf("new request: %v", err) + } + req.Header.Set("X-RAP-Service-Channel-Token", "rap_fsc_unsigned") + resp, err := http.DefaultClient.Do(req) + if err != nil { + t.Fatalf("post service channel packet: %v", err) + } + defer resp.Body.Close() + if resp.StatusCode != http.StatusForbidden { + t.Fatalf("status = %d, want %d", resp.StatusCode, http.StatusForbidden) + } + ingress.mu.Lock() + defer ingress.mu.Unlock() + if len(ingress.sent) != 0 { + t.Fatalf("unexpected sent packets = %#v", ingress.sent) + } +} + +func TestFabricServiceChannelVPNPacketIngressUsesBackendIntrospectionWhenUnsigned(t *testing.T) { + publicKey, _, err := ed25519.GenerateKey(nil) + if err != nil { + t.Fatalf("generate key: %v", err) + } + var introspected bool + backend := httptest.NewServer(http.HandlerFunc(func(w http.ResponseWriter, r *http.Request) { + if r.URL.Path != "/api/v1/clusters/cluster-1/fabric/service-channels/channel-1/introspect" { + t.Fatalf("introspection path = %s", r.URL.Path) + } + introspected = true + w.Header().Set("Content-Type", "application/json") + _, _ = w.Write([]byte(`{"fabric_service_channel_introspection":{"allowed":true,"status":"allowed","selected_entry_node_id":"entry-1","allowed_channels":["vpn_packet"],"preferred_route_id":"route-1","lease_status":"ready","primary_route":{"route_id":"route-1","status":"ready"},"data_plane":{"schema_version":"rap.fabric_service_channel_data_plane.v1","mode":"fabric_primary","working_data_transport":"fabric_service_channel","steady_state_transport":"fabric_route","backend_relay_policy":"degraded_fallback_only","service_neutral":true,"protocol_agnostic":true,"logical_flow_mode":"multi_flow_isolated","required_flow_isolation_classes":["control","vpn_packet"]},"expires_at":"2099-01-01T00:00:00Z"}}`)) + })) + defer backend.Close() + ingress := &recordingVPNPacketIngress{} + var accessEvents []FabricServiceChannelAccessLogEntry + server := httptest.NewServer(Server{ + Local: PeerIdentity{ClusterID: "cluster-1", NodeID: "entry-1"}, + VPNPacketIngress: ingress, + BackendProxyBaseURL: backend.URL + "/api/v1", + ClusterAuthorityPublicKey: base64.StdEncoding.EncodeToString(publicKey), + FabricServiceChannelLogger: func(entry FabricServiceChannelAccessLogEntry) { + accessEvents = append(accessEvents, entry) + }, + }.Handler()) + defer server.Close() + + req, err := http.NewRequest(http.MethodPost, server.URL+"/api/v1/clusters/cluster-1/fabric/service-channels/channel-1/vpn-connections/vpn-1/packets", bytes.NewReader([]byte("packet"))) + if err != nil { + t.Fatalf("new request: %v", err) + } + req.Header.Set("X-RAP-Service-Channel-Token", "rap_fsc_unsigned") + resp, err := http.DefaultClient.Do(req) + if err != nil { + t.Fatalf("post service channel packet: %v", err) + } + defer resp.Body.Close() + if resp.StatusCode != http.StatusAccepted { + t.Fatalf("status = %d, want %d", resp.StatusCode, http.StatusAccepted) + } + if got := resp.Header.Get("X-RAP-Service-Channel-Accepted-By"); got != "introspection" { + t.Fatalf("accepted-by header = %q, want introspection", got) + } + if !introspected { + t.Fatal("backend introspection was not called") + } + if len(accessEvents) != 1 || accessEvents[0].AcceptedBy != "introspection" || accessEvents[0].PreferredRouteID != "route-1" || + !accessEvents[0].DataPlaneValid || + accessEvents[0].WorkingDataTransport != "fabric_service_channel" || + accessEvents[0].SteadyStateTransport != "fabric_route" || + accessEvents[0].BackendRelayPolicy != "degraded_fallback_only" { + t.Fatalf("unexpected access events: %+v", accessEvents) + } + ingress.mu.Lock() + defer ingress.mu.Unlock() + if len(ingress.sent) != 1 || string(ingress.sent[0]) != "packet" { + t.Fatalf("sent packets = %#v", ingress.sent) + } +} + +func TestFabricServiceChannelVPNPacketIngressVerifiesSignedLeaseAuthority(t *testing.T) { + publicKey, privateKey, err := ed25519.GenerateKey(nil) + if err != nil { + t.Fatalf("generate key: %v", err) + } + token := "rap_fsc_signedtest" + payload := fabricServiceChannelLeaseAuthorityPayload{ + SchemaVersion: "rap.fabric_service_channel_lease_authority.v1", + ChannelID: "channel-1", + ClusterID: "cluster-1", + ResourceID: "vpn-1", + ServiceClass: FabricServiceClassVPNPackets, + SelectedEntryNodeID: "entry-1", + SelectedExitNodeID: "exit-1", + AllowedChannels: []string{ProductionChannelVPNPacket}, + RouteGeneration: "rg-1", + FencingEpoch: 7, + TokenHash: fabricServiceChannelTokenHash(token), + IssuedAt: time.Now().UTC().Add(-time.Minute), + ExpiresAt: time.Now().UTC().Add(time.Minute), + DataPlane: fabricServiceChannelDataPlaneContract{ + SchemaVersion: "rap.fabric_service_channel_data_plane.v1", + Mode: "fabric_primary", + WorkingDataTransport: "fabric_service_channel", + SteadyStateTransport: "fabric_route", + BackendRelayPolicy: "degraded_fallback_only", + ProductionForwardingRequired: true, + ServiceNeutral: true, + ProtocolAgnostic: true, + LogicalFlowMode: "multi_flow_isolated", + RequiredFlowIsolationClasses: []string{"control", ProductionChannelVPNPacket}, + }, + } + payload.PrimaryRoute.RouteID = "route-signed" + payload.PrimaryRoute.Status = "authorized" + rawPayload, err := json.Marshal(payload) + if err != nil { + t.Fatalf("marshal payload: %v", err) + } + canonical, err := authority.CanonicalJSON(rawPayload) + if err != nil { + t.Fatalf("canonical payload: %v", err) + } + signature := authority.Signature{ + SchemaVersion: authority.SignatureSchemaVersion, + Algorithm: authority.AlgorithmEd25519, + KeyFingerprint: authority.Fingerprint(publicKey), + Signature: base64.StdEncoding.EncodeToString(ed25519.Sign(privateKey, canonical)), + } + rawSignature, err := json.Marshal(signature) + if err != nil { + t.Fatalf("marshal signature: %v", err) + } + ingress := &recordingVPNPacketIngress{} + var accessEvents []FabricServiceChannelAccessLogEntry + server := httptest.NewServer(Server{ + Local: PeerIdentity{ClusterID: "cluster-1", NodeID: "entry-1"}, + VPNPacketIngress: ingress, + ClusterAuthorityPublicKey: base64.StdEncoding.EncodeToString(publicKey), + FabricServiceChannelLogger: func(entry FabricServiceChannelAccessLogEntry) { + accessEvents = append(accessEvents, entry) + }, + }.Handler()) + defer server.Close() + + req, err := http.NewRequest(http.MethodPost, server.URL+"/api/v1/clusters/cluster-1/fabric/service-channels/channel-1/vpn-connections/vpn-1/packets", bytes.NewReader([]byte("packet"))) + if err != nil { + t.Fatalf("new request: %v", err) + } + req.Header.Set("X-RAP-Service-Channel-Token", token) + req.Header.Set("X-RAP-Service-Channel-Authority-Payload", base64.RawURLEncoding.EncodeToString(rawPayload)) + req.Header.Set("X-RAP-Service-Channel-Authority-Signature", base64.RawURLEncoding.EncodeToString(rawSignature)) + resp, err := http.DefaultClient.Do(req) + if err != nil { + t.Fatalf("post service channel packet: %v", err) + } + defer resp.Body.Close() + if resp.StatusCode != http.StatusAccepted { + t.Fatalf("status = %d, want %d", resp.StatusCode, http.StatusAccepted) + } + if len(accessEvents) != 1 || + accessEvents[0].AcceptedBy != "signed" || + accessEvents[0].PreferredRouteID != "route-signed" || + !accessEvents[0].DataPlaneValid || + accessEvents[0].WorkingDataTransport != "fabric_service_channel" || + accessEvents[0].SteadyStateTransport != "fabric_route" || + accessEvents[0].BackendRelayPolicy != "degraded_fallback_only" { + t.Fatalf("unexpected signed data-plane access events: %+v", accessEvents) + } +} + +func TestFabricServiceChannelRemoteWorkspaceIngressValidatesSignedLeaseAuthority(t *testing.T) { + publicKey, privateKey, err := ed25519.GenerateKey(nil) + if err != nil { + t.Fatalf("generate key: %v", err) + } + token := "rap_fsc_remoteworkspace" + payload := fabricServiceChannelLeaseAuthorityPayload{ + SchemaVersion: "rap.fabric_service_channel_lease_authority.v1", + ChannelID: "channel-rw", + ClusterID: "cluster-1", + ResourceID: "workspace-1", + ServiceClass: FabricServiceClassRemoteWorkspace, + SelectedEntryNodeID: "entry-1", + SelectedExitNodeID: "exit-1", + AllowedChannels: []string{FabricServiceChannelControl, FabricServiceChannelInteractive, FabricServiceChannelReliable, FabricServiceChannelDroppable}, + RouteGeneration: "rg-rw", + FencingEpoch: 11, + TokenHash: fabricServiceChannelTokenHash(token), + IssuedAt: time.Now().UTC().Add(-time.Minute), + ExpiresAt: time.Now().UTC().Add(time.Minute), + DataPlane: fabricServiceChannelDataPlaneContract{ + SchemaVersion: "rap.fabric_service_channel_data_plane.v1", + Mode: "fabric_primary", + WorkingDataTransport: "fabric_service_channel", + SteadyStateTransport: "fabric_route", + BackendRelayPolicy: "degraded_fallback_only", + ProductionForwardingRequired: true, + ServiceNeutral: true, + ProtocolAgnostic: true, + LogicalFlowMode: "multi_flow_isolated", + RequiredFlowIsolationClasses: []string{FabricServiceChannelControl, FabricServiceChannelInteractive, FabricServiceChannelReliable, FabricServiceChannelDroppable}, + }, + } + payload.PrimaryRoute.RouteID = "route-rw" + payload.PrimaryRoute.Status = "authorized" + rawPayload, err := json.Marshal(payload) + if err != nil { + t.Fatalf("marshal payload: %v", err) + } + canonical, err := authority.CanonicalJSON(rawPayload) + if err != nil { + t.Fatalf("canonical payload: %v", err) + } + signature := authority.Signature{ + SchemaVersion: authority.SignatureSchemaVersion, + Algorithm: authority.AlgorithmEd25519, + KeyFingerprint: authority.Fingerprint(publicKey), + Signature: base64.StdEncoding.EncodeToString(ed25519.Sign(privateKey, canonical)), + } + rawSignature, err := json.Marshal(signature) + if err != nil { + t.Fatalf("marshal signature: %v", err) + } + var accessEvents []FabricServiceChannelAccessLogEntry + server := httptest.NewServer(Server{ + Local: PeerIdentity{ClusterID: "cluster-1", NodeID: "entry-1"}, + VPNPacketIngress: &recordingVPNPacketIngress{}, + ClusterAuthorityPublicKey: base64.StdEncoding.EncodeToString(publicKey), + FabricServiceChannelLogger: func(entry FabricServiceChannelAccessLogEntry) { + accessEvents = append(accessEvents, entry) + }, + }.Handler()) + defer server.Close() + + req, err := http.NewRequest(http.MethodPost, server.URL+"/api/v1/clusters/cluster-1/fabric/service-channels/channel-rw/remote-workspaces/workspace-1/streams/interactive", nil) + if err != nil { + t.Fatalf("new request: %v", err) + } + req.Header.Set("X-RAP-Service-Channel-Token", token) + req.Header.Set("X-RAP-Service-Class", FabricServiceClassRemoteWorkspace) + req.Header.Set("X-RAP-Channel-Class", FabricServiceChannelInteractive) + req.Header.Set("X-RAP-Service-Channel-Authority-Payload", base64.RawURLEncoding.EncodeToString(rawPayload)) + req.Header.Set("X-RAP-Service-Channel-Authority-Signature", base64.RawURLEncoding.EncodeToString(rawSignature)) + resp, err := http.DefaultClient.Do(req) + if err != nil { + t.Fatalf("post remote workspace probe: %v", err) + } + defer resp.Body.Close() + if resp.StatusCode != http.StatusAccepted { + t.Fatalf("status = %d, want %d", resp.StatusCode, http.StatusAccepted) + } + var decoded map[string]any + if err := json.NewDecoder(resp.Body).Decode(&decoded); err != nil { + t.Fatalf("decode response: %v", err) + } + if decoded["service_class"] != FabricServiceClassRemoteWorkspace || decoded["channel_class"] != FabricServiceChannelInteractive || decoded["payload_flow"] != "not_implemented" { + t.Fatalf("unexpected response: %+v", decoded) + } + if len(accessEvents) != 1 || + accessEvents[0].AcceptedBy != "signed" || + accessEvents[0].ServiceClass != FabricServiceClassRemoteWorkspace || + accessEvents[0].ChannelClass != FabricServiceChannelInteractive || + accessEvents[0].PreferredRouteID != "route-rw" || + !accessEvents[0].DataPlaneValid || + accessEvents[0].WorkingDataTransport != "fabric_service_channel" || + accessEvents[0].SteadyStateTransport != "fabric_route" { + t.Fatalf("unexpected remote workspace access events: %+v", accessEvents) + } +} + +func TestFabricServiceChannelRemoteWorkspaceIngressAcceptsFrameBatchProbeOnly(t *testing.T) { + publicKey, privateKey, err := ed25519.GenerateKey(nil) + if err != nil { + t.Fatalf("generate key: %v", err) + } + token := "rap_fsc_remoteworkspace_frames" + payload := fabricServiceChannelLeaseAuthorityPayload{ + SchemaVersion: "rap.fabric_service_channel_lease_authority.v1", + ChannelID: "channel-rw", + ClusterID: "cluster-1", + ResourceID: "workspace-1", + ServiceClass: FabricServiceClassRemoteWorkspace, + SelectedEntryNodeID: "entry-1", + AllowedChannels: []string{FabricServiceChannelInteractive, FabricServiceChannelDroppable}, + TokenHash: fabricServiceChannelTokenHash(token), + ExpiresAt: time.Now().UTC().Add(time.Minute), + DataPlane: fabricServiceChannelDataPlaneContract{ + SchemaVersion: "rap.fabric_service_channel_data_plane.v1", + Mode: "fabric_primary", + WorkingDataTransport: "fabric_service_channel", + SteadyStateTransport: "fabric_route", + BackendRelayPolicy: "degraded_fallback_only", + ServiceNeutral: true, + ProtocolAgnostic: true, + LogicalFlowMode: "multi_flow_isolated", + RequiredFlowIsolationClasses: []string{FabricServiceChannelInteractive, FabricServiceChannelDroppable}, + }, + } + payload.PrimaryRoute.RouteID = "route-rw" + payload.PrimaryRoute.Status = "authorized" + rawPayload, err := json.Marshal(payload) + if err != nil { + t.Fatalf("marshal payload: %v", err) + } + canonical, err := authority.CanonicalJSON(rawPayload) + if err != nil { + t.Fatalf("canonical payload: %v", err) + } + signature := authority.Signature{ + SchemaVersion: authority.SignatureSchemaVersion, + Algorithm: authority.AlgorithmEd25519, + KeyFingerprint: authority.Fingerprint(publicKey), + Signature: base64.StdEncoding.EncodeToString(ed25519.Sign(privateKey, canonical)), + } + rawSignature, err := json.Marshal(signature) + if err != nil { + t.Fatalf("marshal signature: %v", err) + } + server := httptest.NewServer(Server{ + Local: PeerIdentity{ClusterID: "cluster-1", NodeID: "entry-1"}, + VPNPacketIngress: &recordingVPNPacketIngress{}, + RemoteWorkspaceFrameSink: NewRemoteWorkspaceFrameProbeSink(), + ClusterAuthorityPublicKey: base64.StdEncoding.EncodeToString(publicKey), + }.Handler()) + defer server.Close() + + frameBatch := map[string]any{ + "schema_version": "rap.remote_workspace_frame_batch.v1", + "probe_only": true, + "service_class": FabricServiceClassRemoteWorkspace, + "channel_class": FabricServiceChannelInteractive, + "adapter_contract_id": "rap.rdp_worker.remote_workspace_adapter_contract_probe.v1", + "frames": []map[string]any{ + {"channel": "input", "direction": "client_to_adapter", "payload_encoding": "none", "payload_length": 0, "droppable": true}, + {"channel": "display", "direction": "adapter_to_client", "payload_encoding": "none", "payload_length": 0, "droppable": true}, + }, + } + rawFrameBatch, err := json.Marshal(frameBatch) + if err != nil { + t.Fatalf("marshal frame batch: %v", err) + } + req, err := http.NewRequest(http.MethodPost, server.URL+"/api/v1/clusters/cluster-1/fabric/service-channels/channel-rw/remote-workspaces/workspace-1/streams/interactive", bytes.NewReader(rawFrameBatch)) + if err != nil { + t.Fatalf("new request: %v", err) + } + req.Header.Set("X-RAP-Service-Channel-Token", token) + req.Header.Set("X-RAP-Service-Class", FabricServiceClassRemoteWorkspace) + req.Header.Set("X-RAP-Channel-Class", FabricServiceChannelInteractive) + req.Header.Set("X-RAP-Service-Channel-Authority-Payload", base64.RawURLEncoding.EncodeToString(rawPayload)) + req.Header.Set("X-RAP-Service-Channel-Authority-Signature", base64.RawURLEncoding.EncodeToString(rawSignature)) + resp, err := http.DefaultClient.Do(req) + if err != nil { + t.Fatalf("post remote workspace frame probe: %v", err) + } + defer resp.Body.Close() + if resp.StatusCode != http.StatusAccepted { + t.Fatalf("status = %d, want %d", resp.StatusCode, http.StatusAccepted) + } + var decoded map[string]any + if err := json.NewDecoder(resp.Body).Decode(&decoded); err != nil { + t.Fatalf("decode response: %v", err) + } + if decoded["payload_flow"] != "delivered_probe_only" || decoded["frame_batch_schema"] != "rap.remote_workspace_frame_batch.v1" || int(decoded["frame_count"].(float64)) != 2 { + t.Fatalf("unexpected response: %+v", decoded) + } + adapterSessionID, ok := decoded["adapter_session_id"].(string) + if !ok || !strings.HasPrefix(adapterSessionID, "rap-rw-adapter-session-") { + t.Fatalf("adapter_session_id = %#v", decoded["adapter_session_id"]) + } + delivery, ok := decoded["adapter_delivery"].(map[string]any) + if !ok || delivery["sink"] != "node_agent_rdp_worker_contract_probe" || delivery["schema_version"] != "rap.remote_workspace_frame_batch_delivery.v1" { + t.Fatalf("unexpected adapter delivery: %+v", decoded["adapter_delivery"]) + } + if delivery["adapter_session_id"] != adapterSessionID || delivery["adapter_runtime_id"] != "node_agent_rdp_worker_contract_probe" || delivery["session_state"] != "probe_bound" { + t.Fatalf("unexpected adapter session delivery fields: %+v", delivery) + } + if int(delivery["accepted_frames"].(float64)) != 2 || int(delivery["acked_frames"].(float64)) != 2 || int(delivery["queue_depth"].(float64)) != 0 { + t.Fatalf("unexpected queue delivery fields: %+v", delivery) + } +} + +func TestRemoteWorkspaceFrameBatchProbeRejectsGuardrailViolations(t *testing.T) { + valid := map[string]any{ + "schema_version": "rap.remote_workspace_frame_batch.v1", + "probe_only": true, + "service_class": FabricServiceClassRemoteWorkspace, + "channel_class": FabricServiceChannelInteractive, + "adapter_contract_id": "rap.rdp_worker.remote_workspace_adapter_contract_probe.v1", + "frames": []map[string]any{ + {"channel": "input", "direction": "client_to_adapter", "payload_encoding": "none", "payload_length": 0, "droppable": true}, + }, + } + tests := []struct { + name string + mutate func(map[string]any) + wantErr string + }{ + { + name: "production payload forwarding disabled", + mutate: func(item map[string]any) { + item["probe_only"] = false + }, + wantErr: "remote workspace payload forwarding is not implemented", + }, + { + name: "unknown logical channel", + mutate: func(item map[string]any) { + item["frames"] = []map[string]any{{"channel": "unknown", "direction": "client_to_adapter"}} + }, + wantErr: "unsupported remote workspace adapter frame channel", + }, + { + name: "wrong direction", + mutate: func(item map[string]any) { + item["frames"] = []map[string]any{{"channel": "display", "direction": "client_to_adapter"}} + }, + wantErr: "unsupported remote workspace adapter frame direction", + }, + { + name: "service mismatch", + mutate: func(item map[string]any) { + item["service_class"] = FabricServiceClassVPNPackets + }, + wantErr: "remote workspace frame batch service class mismatch", + }, + { + name: "channel mismatch", + mutate: func(item map[string]any) { + item["channel_class"] = FabricServiceChannelReliable + }, + wantErr: "remote workspace frame batch channel class mismatch", + }, + { + name: "unsupported encoding", + mutate: func(item map[string]any) { + item["frames"] = []map[string]any{{"channel": "input", "direction": "client_to_adapter", "payload_encoding": "raw"}} + }, + wantErr: "unsupported remote workspace frame payload encoding", + }, + } + for _, tt := range tests { + t.Run(tt.name, func(t *testing.T) { + item := map[string]any{} + for key, value := range valid { + item[key] = value + } + tt.mutate(item) + raw, err := json.Marshal(item) + if err != nil { + t.Fatalf("marshal frame batch: %v", err) + } + _, err = validateRemoteWorkspaceFrameBatchProbe(raw, FabricServiceChannelInteractive) + if err == nil || !strings.Contains(err.Error(), tt.wantErr) { + t.Fatalf("err = %v, want contains %q", err, tt.wantErr) + } + }) + } +} + +func TestRemoteWorkspaceFrameProbeSinkAppliesBoundedQueuePolicy(t *testing.T) { + sink := NewRemoteWorkspaceFrameProbeSink() + droppable := RemoteWorkspaceFrameBatchDelivery{ + ServiceClass: FabricServiceClassRemoteWorkspace, + ChannelClass: FabricServiceChannelInteractive, + AdapterContractID: "rap.rdp_worker.remote_workspace_adapter_contract_probe.v1", + AdapterSessionID: "rap-rw-adapter-session-test", + } + for i := 0; i < DefaultRemoteWorkspaceFrameProbeSinkQueueCapacity+3; i++ { + droppable.Frames = append(droppable.Frames, RemoteWorkspaceFrameProbeRecord{ + Channel: "display", + Direction: "adapter_to_client", + Droppable: true, + }) + } + receipt, err := sink.AcceptRemoteWorkspaceFrameBatchProbe(context.Background(), droppable) + if err != nil { + t.Fatalf("accept droppable overflow: %v", err) + } + if !receipt.Accepted || receipt.AcceptedFrames != DefaultRemoteWorkspaceFrameProbeSinkQueueCapacity || receipt.DroppedFrames != 3 || receipt.AckedFrames != receipt.AcceptedFrames || receipt.QueueDepth != 0 { + t.Fatalf("droppable overflow receipt = %+v", receipt) + } + report := sink.Report(time.Unix(10, 0).UTC()) + if report["queue_capacity"] != DefaultRemoteWorkspaceFrameProbeSinkQueueCapacity || + report["queue_depth"] != 0 || + report["total_accepted_frames"] != int64(DefaultRemoteWorkspaceFrameProbeSinkQueueCapacity) || + report["total_dropped_frames"] != int64(3) || + report["total_acked_frames"] != int64(DefaultRemoteWorkspaceFrameProbeSinkQueueCapacity) || + report["active_session_count"] != 1 || + report["session_created_total"] != int64(1) || + report["session_bound_total"] != int64(1) || + report["current_session_lifecycle_state"] != "probe_bound" || + report["current_session_delivery_count"] != int64(1) || + report["current_session_dropped_frames"] != int64(3) { + t.Fatalf("droppable overflow report = %+v", report) + } + + reliable := RemoteWorkspaceFrameBatchDelivery{ + ServiceClass: FabricServiceClassRemoteWorkspace, + ChannelClass: FabricServiceChannelInteractive, + AdapterContractID: "rap.rdp_worker.remote_workspace_adapter_contract_probe.v1", + AdapterSessionID: "rap-rw-adapter-session-test", + } + for i := 0; i < DefaultRemoteWorkspaceFrameProbeSinkQueueCapacity+1; i++ { + reliable.Frames = append(reliable.Frames, RemoteWorkspaceFrameProbeRecord{ + Channel: "input", + Direction: "client_to_adapter", + Droppable: false, + }) + } + if _, err := sink.AcceptRemoteWorkspaceFrameBatchProbe(context.Background(), reliable); err == nil || !strings.Contains(err.Error(), "backpressure") { + t.Fatalf("reliable overflow err = %v, want backpressure", err) + } + report = sink.Report(time.Unix(11, 0).UTC()) + if report["backpressure_count"] != int64(1) || + report["last_rejected_frame_count"] != DefaultRemoteWorkspaceFrameProbeSinkQueueCapacity+1 || + report["last_rejected_queue_capacity"] != DefaultRemoteWorkspaceFrameProbeSinkQueueCapacity || + report["last_rejected_queue_depth"] != DefaultRemoteWorkspaceFrameProbeSinkQueueCapacity || + report["last_rejected_adapter_session_id"] != "rap-rw-adapter-session-test" || + report["last_rejected_channel_class"] != FabricServiceChannelInteractive || + report["last_rejected_adapter_contract_id"] != "rap.rdp_worker.remote_workspace_adapter_contract_probe.v1" || + report["session_backpressure_total"] != int64(1) || + report["current_session_lifecycle_state"] != "backpressure" || + report["current_session_backpressure_count"] != int64(1) { + t.Fatalf("backpressure report = %+v", report) + } + report = sink.Report(time.Now().UTC().Add(DefaultRemoteWorkspaceFrameProbeSinkSessionTTL + time.Second)) + if report["active_session_count"] != 0 || + report["session_expired_total"] != int64(1) || + report["session_closed_total"] != int64(1) { + t.Fatalf("expired lifecycle report = %+v", report) + } +} + +func TestRemoteWorkspaceAdapterSessionControlEndpointClosesSession(t *testing.T) { + sink := NewRemoteWorkspaceFrameProbeSink() + delivery := RemoteWorkspaceFrameBatchDelivery{ + ClusterID: "cluster-1", + ChannelID: "channel-rw", + ResourceID: "workspace-1", + ServiceClass: FabricServiceClassRemoteWorkspace, + ChannelClass: FabricServiceChannelInteractive, + AdapterContractID: "rap.rdp_worker.remote_workspace_adapter_contract_probe.v1", + AdapterSessionID: "rap-rw-adapter-session-aaaaaaaaaaaaaaaaaaaaaaaa", + Frames: []RemoteWorkspaceFrameProbeRecord{{ + Channel: "display", + Direction: "adapter_to_client", + Droppable: true, + }}, + } + if _, err := sink.AcceptRemoteWorkspaceFrameBatchProbe(context.Background(), delivery); err != nil { + t.Fatalf("accept frame batch: %v", err) + } + server := httptest.NewServer(Server{RemoteWorkspaceFrameSink: sink}.Handler()) + defer server.Close() + + body := bytes.NewReader([]byte(`{"action":"close","reason":"unit test close"}`)) + controlURL := server.URL + "/mesh/v1/remote-workspace/adapter-sessions/rap-rw-adapter-session-aaaaaaaaaaaaaaaaaaaaaaaa/control" + resp, err := http.Post(controlURL, "application/json", body) + if err != nil { + t.Fatalf("post control: %v", err) + } + defer resp.Body.Close() + if resp.StatusCode != http.StatusOK { + raw, _ := io.ReadAll(resp.Body) + t.Fatalf("status = %d body=%s", resp.StatusCode, string(raw)) + } + var result RemoteWorkspaceAdapterSessionControlResult + if err := json.NewDecoder(resp.Body).Decode(&result); err != nil { + t.Fatalf("decode control result: %v", err) + } + if !result.Accepted || + result.Action != "close" || + result.AdapterSessionID != "rap-rw-adapter-session-aaaaaaaaaaaaaaaaaaaaaaaa" || + result.PreviousState != "probe_bound" || + result.SessionState != "closed" || + result.ActiveSessions != 0 { + t.Fatalf("control result = %+v", result) + } + report := sink.Report(time.Now().UTC()) + if report["active_session_count"] != 0 || + report["session_control_total"] != int64(1) || + report["session_closed_total"] != int64(1) || + report["last_controlled_adapter_session_id"] != "rap-rw-adapter-session-aaaaaaaaaaaaaaaaaaaaaaaa" || + report["last_session_control_action"] != "close" || + report["last_session_control_state"] != "closed" { + t.Fatalf("control report = %+v", report) + } + + resp, err = http.Post(controlURL, "application/json", bytes.NewReader([]byte(`{"action":"close","reason":"repeat close"}`))) + if err != nil { + t.Fatalf("post repeat control: %v", err) + } + defer resp.Body.Close() + if resp.StatusCode != http.StatusOK { + raw, _ := io.ReadAll(resp.Body) + t.Fatalf("repeat status = %d body=%s", resp.StatusCode, string(raw)) + } + if err := json.NewDecoder(resp.Body).Decode(&result); err != nil { + t.Fatalf("decode repeat control result: %v", err) + } + if result.PreviousState != "closed" || result.SessionState != "closed" { + t.Fatalf("repeat control result = %+v", result) + } + report = sink.Report(time.Now().UTC()) + if report["session_control_total"] != int64(2) || report["session_closed_total"] != int64(1) { + t.Fatalf("repeat control report = %+v", report) + } +} + +func TestRemoteWorkspaceAdapterSessionControlRejectsInvalidRequests(t *testing.T) { + sink := NewRemoteWorkspaceFrameProbeSink() + server := httptest.NewServer(Server{RemoteWorkspaceFrameSink: sink}.Handler()) + defer server.Close() + + tests := []struct { + name string + path string + body string + statusCode int + want string + }{ + { + name: "unknown action", + path: "/mesh/v1/remote-workspace/adapter-sessions/rap-rw-adapter-session-aaaaaaaaaaaaaaaaaaaaaaaa/control", + body: `{"action":"launch"}`, + statusCode: http.StatusBadRequest, + want: "unsupported remote workspace adapter session control action", + }, + { + name: "invalid id", + path: "/mesh/v1/remote-workspace/adapter-sessions/rap-rw-adapter-session-nothex/control", + body: `{"action":"close"}`, + statusCode: http.StatusBadRequest, + want: "invalid remote workspace adapter session id", + }, + { + name: "unknown session", + path: "/mesh/v1/remote-workspace/adapter-sessions/rap-rw-adapter-session-bbbbbbbbbbbbbbbbbbbbbbbb/control", + body: `{"action":"close"}`, + statusCode: http.StatusBadRequest, + want: "remote workspace adapter session not found", + }, + { + name: "bad json", + path: "/mesh/v1/remote-workspace/adapter-sessions/rap-rw-adapter-session-aaaaaaaaaaaaaaaaaaaaaaaa/control", + body: `{`, + statusCode: http.StatusBadRequest, + want: "invalid remote workspace adapter session control payload", + }, + } + for _, tt := range tests { + t.Run(tt.name, func(t *testing.T) { + resp, err := http.Post(server.URL+tt.path, "application/json", strings.NewReader(tt.body)) + if err != nil { + t.Fatalf("post control: %v", err) + } + defer resp.Body.Close() + raw, _ := io.ReadAll(resp.Body) + if resp.StatusCode != tt.statusCode || !strings.Contains(string(raw), tt.want) { + t.Fatalf("status=%d body=%s, want %d containing %q", resp.StatusCode, string(raw), tt.statusCode, tt.want) + } + }) + } +} + +func TestRemoteWorkspaceAdapterSessionSnapshotEndpointListsActiveAndTerminalSessions(t *testing.T) { + sink := NewRemoteWorkspaceFrameProbeSink() + delivery := RemoteWorkspaceFrameBatchDelivery{ + ClusterID: "cluster-1", + ChannelID: "channel-rw", + ResourceID: "workspace-1", + ServiceClass: FabricServiceClassRemoteWorkspace, + ChannelClass: FabricServiceChannelInteractive, + AdapterContractID: "rap.rdp_worker.remote_workspace_adapter_contract_probe.v1", + AdapterSessionID: "rap-rw-adapter-session-cccccccccccccccccccccccc", + Frames: []RemoteWorkspaceFrameProbeRecord{{ + Channel: "display", + Direction: "adapter_to_client", + Droppable: true, + }}, + } + if _, err := sink.AcceptRemoteWorkspaceFrameBatchProbe(context.Background(), delivery); err != nil { + t.Fatalf("accept frame batch: %v", err) + } + server := httptest.NewServer(Server{RemoteWorkspaceFrameSink: sink}.Handler()) + defer server.Close() + + resp, err := http.Get(server.URL + "/mesh/v1/remote-workspace/adapter-sessions?include_terminal=true") + if err != nil { + t.Fatalf("get snapshot: %v", err) + } + defer resp.Body.Close() + if resp.StatusCode != http.StatusOK { + raw, _ := io.ReadAll(resp.Body) + t.Fatalf("status = %d body=%s", resp.StatusCode, string(raw)) + } + var snapshot RemoteWorkspaceAdapterSessionSnapshot + if err := json.NewDecoder(resp.Body).Decode(&snapshot); err != nil { + t.Fatalf("decode snapshot: %v", err) + } + if snapshot.SchemaVersion != "rap.remote_workspace_adapter_session_snapshot.v1" || + snapshot.ActiveSessionCount != 1 || + len(snapshot.Sessions) != 1 || + snapshot.Sessions[0].AdapterSessionID != "rap-rw-adapter-session-cccccccccccccccccccccccc" || + snapshot.Sessions[0].SessionState != "probe_bound" { + t.Fatalf("active snapshot = %+v", snapshot) + } + + controlBody := bytes.NewReader([]byte(`{"action":"close","reason":"snapshot terminal"}`)) + controlResp, err := http.Post(server.URL+"/mesh/v1/remote-workspace/adapter-sessions/rap-rw-adapter-session-cccccccccccccccccccccccc/control", "application/json", controlBody) + if err != nil { + t.Fatalf("post control: %v", err) + } + controlResp.Body.Close() + if controlResp.StatusCode != http.StatusOK { + t.Fatalf("control status = %d", controlResp.StatusCode) + } + resp, err = http.Get(server.URL + "/mesh/v1/remote-workspace/adapter-sessions?include_terminal=true") + if err != nil { + t.Fatalf("get terminal snapshot: %v", err) + } + defer resp.Body.Close() + if err := json.NewDecoder(resp.Body).Decode(&snapshot); err != nil { + t.Fatalf("decode terminal snapshot: %v", err) + } + if snapshot.ActiveSessionCount != 0 || + snapshot.TerminalSessionCount != 1 || + len(snapshot.TerminalSessions) != 1 || + snapshot.TerminalSessions[0].AdapterSessionID != "rap-rw-adapter-session-cccccccccccccccccccccccc" || + snapshot.TerminalSessions[0].SessionState != "closed" { + t.Fatalf("terminal snapshot = %+v", snapshot) + } +} + +func TestRemoteWorkspaceAdapterSessionMailboxEndpointReadsAndDrainsEvents(t *testing.T) { + sink := NewRemoteWorkspaceFrameProbeSink() + sessionID := "rap-rw-adapter-session-dddddddddddddddddddddddd" + delivery := RemoteWorkspaceFrameBatchDelivery{ + ClusterID: "cluster-1", + ChannelID: "channel-rw", + ResourceID: "workspace-1", + ServiceClass: FabricServiceClassRemoteWorkspace, + ChannelClass: FabricServiceChannelInteractive, + AdapterContractID: "rap.rdp_worker.remote_workspace_adapter_contract_probe.v1", + AdapterSessionID: sessionID, + Frames: []RemoteWorkspaceFrameProbeRecord{{ + Channel: "display", + Direction: "adapter_to_client", + Droppable: true, + }}, + } + if _, err := sink.AcceptRemoteWorkspaceFrameBatchProbe(context.Background(), delivery); err != nil { + t.Fatalf("accept frame batch: %v", err) + } + reliable := delivery + reliable.Frames = nil + for i := 0; i < DefaultRemoteWorkspaceFrameProbeSinkQueueCapacity+1; i++ { + reliable.Frames = append(reliable.Frames, RemoteWorkspaceFrameProbeRecord{ + Channel: "input", + Direction: "client_to_adapter", + Droppable: false, + }) + } + if _, err := sink.AcceptRemoteWorkspaceFrameBatchProbe(context.Background(), reliable); err == nil || !strings.Contains(err.Error(), "backpressure") { + t.Fatalf("reliable overflow err = %v, want backpressure", err) + } + server := httptest.NewServer(Server{RemoteWorkspaceFrameSink: sink}.Handler()) + defer server.Close() + + resp, err := http.Get(server.URL + "/mesh/v1/remote-workspace/adapter-sessions/" + sessionID + "/mailbox?limit=10") + if err != nil { + t.Fatalf("get mailbox: %v", err) + } + defer resp.Body.Close() + var mailbox RemoteWorkspaceAdapterMailboxSnapshot + if err := json.NewDecoder(resp.Body).Decode(&mailbox); err != nil { + t.Fatalf("decode mailbox: %v", err) + } + if mailbox.SchemaVersion != "rap.remote_workspace_adapter_mailbox_snapshot.v1" || + mailbox.MailboxDepth != 2 || + mailbox.DepthAfter != 2 || + len(mailbox.Events) != 2 || + mailbox.Events[0].Event != "frame_batch_probe_delivered" || + mailbox.Events[1].Event != "backpressure" { + t.Fatalf("mailbox = %+v", mailbox) + } + + resp, err = http.Get(server.URL + "/mesh/v1/remote-workspace/adapter-sessions/" + sessionID + "/mailbox?drain=true&limit=10") + if err != nil { + t.Fatalf("drain mailbox: %v", err) + } + defer resp.Body.Close() + if err := json.NewDecoder(resp.Body).Decode(&mailbox); err != nil { + t.Fatalf("decode drained mailbox: %v", err) + } + if !mailbox.Drained || mailbox.MailboxDepth != 2 || mailbox.DepthAfter != 0 || mailbox.DrainedTotal != 2 { + t.Fatalf("drained mailbox = %+v", mailbox) + } + report := sink.Report(time.Now().UTC()) + if report["current_session_mailbox_depth"] != 0 || + report["current_session_mailbox_enqueued_total"] != int64(2) || + report["current_session_mailbox_drained_total"] != int64(2) { + t.Fatalf("mailbox report = %+v", report) + } +} + +func TestRemoteWorkspaceAdapterSessionMailboxEndpointRejectsInvalidRequests(t *testing.T) { + sink := NewRemoteWorkspaceFrameProbeSink() + sessionID := "rap-rw-adapter-session-eeeeeeeeeeeeeeeeeeeeeeee" + delivery := RemoteWorkspaceFrameBatchDelivery{ + ClusterID: "cluster-1", + ChannelID: "channel-rw", + ResourceID: "workspace-1", + ServiceClass: FabricServiceClassRemoteWorkspace, + ChannelClass: FabricServiceChannelInteractive, + AdapterContractID: "rap.rdp_worker.remote_workspace_adapter_contract_probe.v1", + AdapterSessionID: sessionID, + Frames: []RemoteWorkspaceFrameProbeRecord{{ + Channel: "display", + Direction: "adapter_to_client", + Droppable: true, + }}, + } + if _, err := sink.AcceptRemoteWorkspaceFrameBatchProbe(context.Background(), delivery); err != nil { + t.Fatalf("accept frame batch: %v", err) + } + server := httptest.NewServer(Server{RemoteWorkspaceFrameSink: sink}.Handler()) + defer server.Close() + + tests := []struct { + name string + path string + statusCode int + want string + }{ + { + name: "invalid id", + path: "/mesh/v1/remote-workspace/adapter-sessions/rap-rw-adapter-session-nothex/mailbox", + statusCode: http.StatusBadRequest, + want: "invalid remote workspace adapter session id", + }, + { + name: "unknown session", + path: "/mesh/v1/remote-workspace/adapter-sessions/rap-rw-adapter-session-ffffffffffffffffffffffff/mailbox", + statusCode: http.StatusBadRequest, + want: "remote workspace adapter session not found", + }, + { + name: "bad limit", + path: "/mesh/v1/remote-workspace/adapter-sessions/" + sessionID + "/mailbox?limit=bad", + statusCode: http.StatusBadRequest, + want: "invalid remote workspace adapter session mailbox limit", + }, + } + for _, tt := range tests { + t.Run(tt.name, func(t *testing.T) { + resp, err := http.Get(server.URL + tt.path) + if err != nil { + t.Fatalf("get mailbox: %v", err) + } + defer resp.Body.Close() + raw, _ := io.ReadAll(resp.Body) + if resp.StatusCode != tt.statusCode || !strings.Contains(string(raw), tt.want) { + t.Fatalf("status=%d body=%s, want %d containing %q", resp.StatusCode, string(raw), tt.statusCode, tt.want) + } + }) + } + report := sink.Report(time.Now().UTC()) + if report["current_session_mailbox_depth"] != 1 || report["mailbox_drained_total"] != int64(0) { + t.Fatalf("invalid mailbox requests mutated report = %+v", report) + } +} + +func TestRemoteWorkspaceAdapterSessionMailboxEndpointDrainsWithLimit(t *testing.T) { + sink := NewRemoteWorkspaceFrameProbeSink() + sessionID := "rap-rw-adapter-session-111111111111111111111111" + delivery := RemoteWorkspaceFrameBatchDelivery{ + ClusterID: "cluster-1", + ChannelID: "channel-rw", + ResourceID: "workspace-1", + ServiceClass: FabricServiceClassRemoteWorkspace, + ChannelClass: FabricServiceChannelInteractive, + AdapterContractID: "rap.rdp_worker.remote_workspace_adapter_contract_probe.v1", + AdapterSessionID: sessionID, + Frames: []RemoteWorkspaceFrameProbeRecord{{ + Channel: "display", + Direction: "adapter_to_client", + Droppable: true, + }}, + } + for i := 0; i < 3; i++ { + if _, err := sink.AcceptRemoteWorkspaceFrameBatchProbe(context.Background(), delivery); err != nil { + t.Fatalf("accept frame batch %d: %v", i, err) + } + } + server := httptest.NewServer(Server{RemoteWorkspaceFrameSink: sink}.Handler()) + defer server.Close() + + resp, err := http.Get(server.URL + "/mesh/v1/remote-workspace/adapter-sessions/" + sessionID + "/mailbox?drain=true&limit=1") + if err != nil { + t.Fatalf("drain mailbox: %v", err) + } + defer resp.Body.Close() + var mailbox RemoteWorkspaceAdapterMailboxSnapshot + if err := json.NewDecoder(resp.Body).Decode(&mailbox); err != nil { + t.Fatalf("decode drained mailbox: %v", err) + } + if !mailbox.Drained || mailbox.MailboxDepth != 3 || mailbox.DepthAfter != 2 || len(mailbox.Events) != 1 || mailbox.DrainedTotal != 1 { + t.Fatalf("partial drained mailbox = %+v", mailbox) + } + + resp, err = http.Get(server.URL + "/mesh/v1/remote-workspace/adapter-sessions/" + sessionID + "/mailbox?limit=10") + if err != nil { + t.Fatalf("read mailbox: %v", err) + } + defer resp.Body.Close() + if err := json.NewDecoder(resp.Body).Decode(&mailbox); err != nil { + t.Fatalf("decode mailbox: %v", err) + } + if mailbox.MailboxDepth != 2 || mailbox.DepthAfter != 2 || len(mailbox.Events) != 2 || mailbox.DrainedTotal != 1 { + t.Fatalf("remaining mailbox = %+v", mailbox) + } +} + +func TestRemoteWorkspaceAdapterSessionMailboxDropsOldestWhenFull(t *testing.T) { + sink := NewRemoteWorkspaceFrameProbeSink() + sessionID := "rap-rw-adapter-session-eeeeeeeeeeeeeeeeeeeeeeee" + delivery := RemoteWorkspaceFrameBatchDelivery{ + ClusterID: "cluster-1", + ChannelID: "channel-rw", + ResourceID: "workspace-1", + ServiceClass: FabricServiceClassRemoteWorkspace, + ChannelClass: FabricServiceChannelInteractive, + AdapterContractID: "rap.rdp_worker.remote_workspace_adapter_contract_probe.v1", + AdapterSessionID: sessionID, + Frames: []RemoteWorkspaceFrameProbeRecord{{ + Channel: "display", + Direction: "adapter_to_client", + Droppable: true, + }}, + } + for i := 0; i < DefaultRemoteWorkspaceAdapterMailboxCapacity+2; i++ { + delivery.ResourceID = fmt.Sprintf("workspace-%d", i) + if _, err := sink.AcceptRemoteWorkspaceFrameBatchProbe(context.Background(), delivery); err != nil { + t.Fatalf("accept frame batch %d: %v", i, err) + } + } + + mailbox, err := sink.ReadAdapterSessionMailbox(sessionID, false, 50, 0, time.Now().UTC()) + if err != nil { + t.Fatalf("read mailbox: %v", err) + } + if mailbox.MailboxCapacity != DefaultRemoteWorkspaceAdapterMailboxCapacity || + mailbox.MailboxDepth != DefaultRemoteWorkspaceAdapterMailboxCapacity || + mailbox.DepthAfter != DefaultRemoteWorkspaceAdapterMailboxCapacity || + mailbox.EnqueuedTotal != int64(DefaultRemoteWorkspaceAdapterMailboxCapacity+2) || + mailbox.DroppedTotal != 2 || + len(mailbox.Events) != DefaultRemoteWorkspaceAdapterMailboxCapacity { + t.Fatalf("mailbox overflow snapshot = %+v", mailbox) + } + if mailbox.Events[0].Sequence != 3 || + mailbox.Events[len(mailbox.Events)-1].Sequence != int64(DefaultRemoteWorkspaceAdapterMailboxCapacity+2) { + t.Fatalf("mailbox event window = first %d last %d", mailbox.Events[0].Sequence, mailbox.Events[len(mailbox.Events)-1].Sequence) + } + report := sink.Report(time.Now().UTC()) + if report["mailbox_dropped_total"] != int64(2) || + report["current_session_mailbox_dropped_total"] != int64(2) { + t.Fatalf("mailbox drop report = %+v", report) + } +} + +func TestRemoteWorkspaceAdapterSessionMailboxGuardrailsAndClosedSession(t *testing.T) { + sink := NewRemoteWorkspaceFrameProbeSink() + sessionID := "rap-rw-adapter-session-ffffffffffffffffffffffff" + delivery := RemoteWorkspaceFrameBatchDelivery{ + ClusterID: "cluster-1", + ChannelID: "channel-rw", + ResourceID: "workspace-1", + ServiceClass: FabricServiceClassRemoteWorkspace, + ChannelClass: FabricServiceChannelInteractive, + AdapterContractID: "rap.rdp_worker.remote_workspace_adapter_contract_probe.v1", + AdapterSessionID: sessionID, + Frames: []RemoteWorkspaceFrameProbeRecord{{ + Channel: "display", + Direction: "adapter_to_client", + Droppable: true, + }}, + } + if _, err := sink.AcceptRemoteWorkspaceFrameBatchProbe(context.Background(), delivery); err != nil { + t.Fatalf("accept frame batch: %v", err) + } + server := httptest.NewServer(Server{RemoteWorkspaceFrameSink: sink}.Handler()) + defer server.Close() + + resp, err := http.Get(server.URL + "/mesh/v1/remote-workspace/adapter-sessions/rap-rw-adapter-session-nothex/mailbox") + if err != nil { + t.Fatalf("get invalid mailbox: %v", err) + } + resp.Body.Close() + if resp.StatusCode != http.StatusBadRequest { + t.Fatalf("invalid mailbox status = %d, want 400", resp.StatusCode) + } + + resp, err = http.Get(server.URL + "/mesh/v1/remote-workspace/adapter-sessions/" + sessionID + "/mailbox?limit=0") + if err != nil { + t.Fatalf("get invalid limit mailbox: %v", err) + } + resp.Body.Close() + if resp.StatusCode != http.StatusBadRequest { + t.Fatalf("invalid limit mailbox status = %d, want 400", resp.StatusCode) + } + + controlBody := bytes.NewReader([]byte(`{"action":"close","reason":"mailbox closed"}`)) + controlResp, err := http.Post(server.URL+"/mesh/v1/remote-workspace/adapter-sessions/"+sessionID+"/control", "application/json", controlBody) + if err != nil { + t.Fatalf("post control: %v", err) + } + controlResp.Body.Close() + if controlResp.StatusCode != http.StatusOK { + t.Fatalf("control status = %d", controlResp.StatusCode) + } + resp, err = http.Get(server.URL + "/mesh/v1/remote-workspace/adapter-sessions/" + sessionID + "/mailbox") + if err != nil { + t.Fatalf("get closed mailbox: %v", err) + } + defer resp.Body.Close() + if resp.StatusCode != http.StatusBadRequest { + t.Fatalf("closed mailbox status = %d, want 400", resp.StatusCode) + } +} + +func TestRemoteWorkspaceAdapterSessionMailboxEndpointLongPollsUntilEvent(t *testing.T) { + sink := NewRemoteWorkspaceFrameProbeSink() + sessionID := "rap-rw-adapter-session-222222222222222222222222" + delivery := RemoteWorkspaceFrameBatchDelivery{ + ClusterID: "cluster-1", + ChannelID: "channel-rw", + ResourceID: "workspace-1", + ServiceClass: FabricServiceClassRemoteWorkspace, + ChannelClass: FabricServiceChannelInteractive, + AdapterContractID: "rap.rdp_worker.remote_workspace_adapter_contract_probe.v1", + AdapterSessionID: sessionID, + Frames: []RemoteWorkspaceFrameProbeRecord{{ + Channel: "display", + Direction: "adapter_to_client", + Droppable: true, + }}, + } + if _, err := sink.AcceptRemoteWorkspaceFrameBatchProbe(context.Background(), delivery); err != nil { + t.Fatalf("accept frame batch: %v", err) + } + if _, err := sink.ReadAdapterSessionMailbox(sessionID, true, 10, 0, time.Now().UTC()); err != nil { + t.Fatalf("drain initial mailbox: %v", err) + } + server := httptest.NewServer(Server{RemoteWorkspaceFrameSink: sink}.Handler()) + defer server.Close() + + result := make(chan RemoteWorkspaceAdapterMailboxSnapshot, 1) + errs := make(chan error, 1) + go func() { + resp, err := http.Get(server.URL + "/mesh/v1/remote-workspace/adapter-sessions/" + sessionID + "/mailbox?wait_ms=250&limit=10") + if err != nil { + errs <- err + return + } + defer resp.Body.Close() + if resp.StatusCode != http.StatusOK { + raw, _ := io.ReadAll(resp.Body) + errs <- fmt.Errorf("status=%d body=%s", resp.StatusCode, string(raw)) + return + } + var mailbox RemoteWorkspaceAdapterMailboxSnapshot + if err := json.NewDecoder(resp.Body).Decode(&mailbox); err != nil { + errs <- err + return + } + result <- mailbox + }() + + time.Sleep(50 * time.Millisecond) + if _, err := sink.AcceptRemoteWorkspaceFrameBatchProbe(context.Background(), delivery); err != nil { + t.Fatalf("accept delayed frame batch: %v", err) + } + select { + case err := <-errs: + t.Fatalf("long poll failed: %v", err) + case mailbox := <-result: + if !mailbox.Waited || mailbox.WaitTimeout || mailbox.WaitMs != 250 || mailbox.Empty || + mailbox.MailboxDepth != 1 || len(mailbox.Events) != 1 || mailbox.Events[0].Event != "frame_batch_probe_delivered" { + t.Fatalf("long-poll mailbox = %+v", mailbox) + } + report := sink.Report(time.Now().UTC()) + if report["mailbox_read_total"] != int64(1) || + report["mailbox_wait_total"] != int64(1) || + report["mailbox_wait_timeout_total"] != int64(0) || + report["mailbox_empty_read_total"] != int64(0) || + report["current_session_mailbox_wait_total"] != int64(1) || + report["current_session_last_mailbox_wait_ms"] != 250 { + t.Fatalf("long-poll report = %+v", report) + } + case <-time.After(time.Second): + t.Fatal("timed out waiting for long-poll mailbox") + } +} + +func TestRemoteWorkspaceAdapterSessionMailboxEndpointReturnsEmptyAfterWaitTimeout(t *testing.T) { + sink := NewRemoteWorkspaceFrameProbeSink() + sessionID := "rap-rw-adapter-session-333333333333333333333333" + delivery := RemoteWorkspaceFrameBatchDelivery{ + ClusterID: "cluster-1", + ChannelID: "channel-rw", + ResourceID: "workspace-1", + ServiceClass: FabricServiceClassRemoteWorkspace, + ChannelClass: FabricServiceChannelInteractive, + AdapterContractID: "rap.rdp_worker.remote_workspace_adapter_contract_probe.v1", + AdapterSessionID: sessionID, + Frames: []RemoteWorkspaceFrameProbeRecord{{ + Channel: "display", + Direction: "adapter_to_client", + Droppable: true, + }}, + } + if _, err := sink.AcceptRemoteWorkspaceFrameBatchProbe(context.Background(), delivery); err != nil { + t.Fatalf("accept frame batch: %v", err) + } + if _, err := sink.ReadAdapterSessionMailbox(sessionID, true, 10, 0, time.Now().UTC()); err != nil { + t.Fatalf("drain initial mailbox: %v", err) + } + server := httptest.NewServer(Server{RemoteWorkspaceFrameSink: sink}.Handler()) + defer server.Close() + + resp, err := http.Get(server.URL + "/mesh/v1/remote-workspace/adapter-sessions/" + sessionID + "/mailbox?wait_ms=20&limit=10") + if err != nil { + t.Fatalf("get mailbox: %v", err) + } + defer resp.Body.Close() + var mailbox RemoteWorkspaceAdapterMailboxSnapshot + if err := json.NewDecoder(resp.Body).Decode(&mailbox); err != nil { + t.Fatalf("decode mailbox: %v", err) + } + if !mailbox.Empty || !mailbox.Waited || !mailbox.WaitTimeout || mailbox.WaitMs != 20 || + mailbox.MailboxDepth != 0 || mailbox.DepthAfter != 0 || len(mailbox.Events) != 0 { + t.Fatalf("empty timeout mailbox = %+v", mailbox) + } + report := sink.Report(time.Now().UTC()) + if report["mailbox_read_total"] != int64(1) || + report["mailbox_wait_total"] != int64(1) || + report["mailbox_wait_timeout_total"] != int64(1) || + report["mailbox_empty_read_total"] != int64(1) || + report["current_session_mailbox_read_total"] != int64(1) || + report["current_session_mailbox_wait_timeout_total"] != int64(1) || + report["current_session_mailbox_empty_read_total"] != int64(1) || + report["current_session_last_mailbox_wait_timeout"] != true || + report["current_session_last_mailbox_empty"] != true { + t.Fatalf("empty timeout report = %+v", report) + } + + resp, err = http.Get(server.URL + "/mesh/v1/remote-workspace/adapter-sessions/" + sessionID + "/mailbox?wait_ms=bad") + if err != nil { + t.Fatalf("get bad wait mailbox: %v", err) + } + defer resp.Body.Close() + raw, _ := io.ReadAll(resp.Body) + if resp.StatusCode != http.StatusBadRequest || !strings.Contains(string(raw), "invalid remote workspace adapter session mailbox wait") { + t.Fatalf("bad wait status=%d body=%s", resp.StatusCode, string(raw)) + } +} + +func TestRemoteWorkspaceAdapterSessionMailboxEndpointFiltersAfterSequence(t *testing.T) { + sink := NewRemoteWorkspaceFrameProbeSink() + sessionID := "rap-rw-adapter-session-aaaaaaaaaaaaaaaaaaaaaaaa" + delivery := RemoteWorkspaceFrameBatchDelivery{ + ClusterID: "cluster-1", + ChannelID: "channel-rw", + ResourceID: "workspace-1", + ServiceClass: FabricServiceClassRemoteWorkspace, + ChannelClass: FabricServiceChannelInteractive, + AdapterContractID: "rap.rdp_worker.remote_workspace_adapter_contract_probe.v1", + AdapterSessionID: sessionID, + Frames: []RemoteWorkspaceFrameProbeRecord{{ + Channel: "display", + Direction: "adapter_to_client", + Droppable: true, + }}, + } + for i := 0; i < 3; i++ { + delivery.ResourceID = fmt.Sprintf("workspace-%d", i) + if _, err := sink.AcceptRemoteWorkspaceFrameBatchProbe(context.Background(), delivery); err != nil { + t.Fatalf("accept frame batch %d: %v", i, err) + } + } + server := httptest.NewServer(Server{RemoteWorkspaceFrameSink: sink}.Handler()) + defer server.Close() + + resp, err := http.Get(server.URL + "/mesh/v1/remote-workspace/adapter-sessions/" + sessionID + "/mailbox?after_sequence=1&limit=10") + if err != nil { + t.Fatalf("get filtered mailbox: %v", err) + } + defer resp.Body.Close() + var mailbox RemoteWorkspaceAdapterMailboxSnapshot + if err := json.NewDecoder(resp.Body).Decode(&mailbox); err != nil { + t.Fatalf("decode filtered mailbox: %v", err) + } + if mailbox.AfterSequence != 1 || + mailbox.SkippedCount != 1 || + mailbox.ReturnedCount != 2 || + mailbox.MailboxDepth != 3 || + mailbox.DepthAfter != 3 || + mailbox.Empty || + len(mailbox.Events) != 2 || + mailbox.Events[0].Sequence != 2 || + mailbox.Events[1].Sequence != 3 { + t.Fatalf("filtered mailbox = %+v", mailbox) + } + report := sink.Report(time.Now().UTC()) + if report["mailbox_read_total"] != int64(1) || + report["current_session_mailbox_depth"] != 3 { + t.Fatalf("filtered mailbox report = %+v", report) + } +} + +func TestRemoteWorkspaceAdapterSessionMailboxEndpointLongPollsAfterSequence(t *testing.T) { + sink := NewRemoteWorkspaceFrameProbeSink() + sessionID := "rap-rw-adapter-session-bbbbbbbbbbbbbbbbbbbbbbbb" + delivery := RemoteWorkspaceFrameBatchDelivery{ + ClusterID: "cluster-1", + ChannelID: "channel-rw", + ResourceID: "workspace-1", + ServiceClass: FabricServiceClassRemoteWorkspace, + ChannelClass: FabricServiceChannelInteractive, + AdapterContractID: "rap.rdp_worker.remote_workspace_adapter_contract_probe.v1", + AdapterSessionID: sessionID, + Frames: []RemoteWorkspaceFrameProbeRecord{{ + Channel: "display", + Direction: "adapter_to_client", + Droppable: true, + }}, + } + if _, err := sink.AcceptRemoteWorkspaceFrameBatchProbe(context.Background(), delivery); err != nil { + t.Fatalf("accept initial frame batch: %v", err) + } + server := httptest.NewServer(Server{RemoteWorkspaceFrameSink: sink}.Handler()) + defer server.Close() + + result := make(chan RemoteWorkspaceAdapterMailboxSnapshot, 1) + errs := make(chan error, 1) + go func() { + resp, err := http.Get(server.URL + "/mesh/v1/remote-workspace/adapter-sessions/" + sessionID + "/mailbox?after_sequence=1&wait_ms=250&limit=10") + if err != nil { + errs <- err + return + } + defer resp.Body.Close() + if resp.StatusCode != http.StatusOK { + raw, _ := io.ReadAll(resp.Body) + errs <- fmt.Errorf("status=%d body=%s", resp.StatusCode, string(raw)) + return + } + var mailbox RemoteWorkspaceAdapterMailboxSnapshot + if err := json.NewDecoder(resp.Body).Decode(&mailbox); err != nil { + errs <- err + return + } + result <- mailbox + }() + + time.Sleep(50 * time.Millisecond) + delivery.ResourceID = "workspace-delayed" + if _, err := sink.AcceptRemoteWorkspaceFrameBatchProbe(context.Background(), delivery); err != nil { + t.Fatalf("accept delayed frame batch: %v", err) + } + select { + case err := <-errs: + t.Fatalf("after-sequence long poll failed: %v", err) + case mailbox := <-result: + if mailbox.AfterSequence != 1 || + !mailbox.Waited || + mailbox.WaitTimeout || + mailbox.ReturnedCount != 1 || + len(mailbox.Events) != 1 || + mailbox.Events[0].Sequence != 2 || + mailbox.MailboxDepth != 2 || + mailbox.DepthAfter != 2 { + t.Fatalf("after-sequence long-poll mailbox = %+v", mailbox) + } + case <-time.After(time.Second): + t.Fatal("timed out waiting for after-sequence long-poll mailbox") + } +} + +func TestRemoteWorkspaceAdapterSessionMailboxEndpointRejectsInvalidAfterSequence(t *testing.T) { + sink := NewRemoteWorkspaceFrameProbeSink() + sessionID := "rap-rw-adapter-session-cccccccccccccccccccccccc" + delivery := RemoteWorkspaceFrameBatchDelivery{ + ClusterID: "cluster-1", + ChannelID: "channel-rw", + ResourceID: "workspace-1", + ServiceClass: FabricServiceClassRemoteWorkspace, + ChannelClass: FabricServiceChannelInteractive, + AdapterContractID: "rap.rdp_worker.remote_workspace_adapter_contract_probe.v1", + AdapterSessionID: sessionID, + Frames: []RemoteWorkspaceFrameProbeRecord{{ + Channel: "display", + Direction: "adapter_to_client", + Droppable: true, + }}, + } + if _, err := sink.AcceptRemoteWorkspaceFrameBatchProbe(context.Background(), delivery); err != nil { + t.Fatalf("accept frame batch: %v", err) + } + server := httptest.NewServer(Server{RemoteWorkspaceFrameSink: sink}.Handler()) + defer server.Close() + + tests := []struct { + name string + path string + want string + }{ + { + name: "bad after sequence", + path: "/mesh/v1/remote-workspace/adapter-sessions/" + sessionID + "/mailbox?after_sequence=bad", + want: "invalid remote workspace adapter session mailbox after sequence", + }, + { + name: "drain after sequence", + path: "/mesh/v1/remote-workspace/adapter-sessions/" + sessionID + "/mailbox?after_sequence=1&drain=true", + want: "remote workspace adapter session mailbox after sequence cannot drain", + }, + } + for _, tt := range tests { + t.Run(tt.name, func(t *testing.T) { + resp, err := http.Get(server.URL + tt.path) + if err != nil { + t.Fatalf("get mailbox: %v", err) + } + defer resp.Body.Close() + raw, _ := io.ReadAll(resp.Body) + if resp.StatusCode != http.StatusBadRequest || !strings.Contains(string(raw), tt.want) { + t.Fatalf("status=%d body=%s, want 400 containing %q", resp.StatusCode, string(raw), tt.want) + } + }) + } + report := sink.Report(time.Now().UTC()) + if report["mailbox_read_total"] != int64(0) || + report["current_session_mailbox_depth"] != 1 { + t.Fatalf("invalid after-sequence requests mutated report = %+v", report) + } +} + +func TestRemoteWorkspaceAdapterSessionMailboxEndpointResumesFromConsumerCursor(t *testing.T) { + sink := NewRemoteWorkspaceFrameProbeSink() + sessionID := "rap-rw-adapter-session-dddddddddddddddddddddddd" + delivery := RemoteWorkspaceFrameBatchDelivery{ + ClusterID: "cluster-1", + ChannelID: "channel-rw", + ResourceID: "workspace-1", + ServiceClass: FabricServiceClassRemoteWorkspace, + ChannelClass: FabricServiceChannelInteractive, + AdapterContractID: "rap.rdp_worker.remote_workspace_adapter_contract_probe.v1", + AdapterSessionID: sessionID, + Frames: []RemoteWorkspaceFrameProbeRecord{{ + Channel: "display", + Direction: "adapter_to_client", + Droppable: true, + }}, + } + for i := 0; i < 3; i++ { + delivery.ResourceID = fmt.Sprintf("workspace-%d", i) + if _, err := sink.AcceptRemoteWorkspaceFrameBatchProbe(context.Background(), delivery); err != nil { + t.Fatalf("accept frame batch %d: %v", i, err) + } + } + server := httptest.NewServer(Server{RemoteWorkspaceFrameSink: sink}.Handler()) + defer server.Close() + + resp, err := http.Get(server.URL + "/mesh/v1/remote-workspace/adapter-sessions/" + sessionID + "/mailbox?consumer_id=rdp-worker-probe&limit=2") + if err != nil { + t.Fatalf("get consumer mailbox: %v", err) + } + defer resp.Body.Close() + var mailbox RemoteWorkspaceAdapterMailboxSnapshot + if err := json.NewDecoder(resp.Body).Decode(&mailbox); err != nil { + t.Fatalf("decode consumer mailbox: %v", err) + } + if mailbox.ConsumerCheckpointSequence != 2 { + t.Fatalf("checkpoint mailbox = %+v", mailbox) + } + resp, err = http.Get(server.URL + "/mesh/v1/remote-workspace/adapter-sessions/" + sessionID + "/mailbox?consumer_id=rdp-worker-probe&resume_from=checkpoint&limit=10") + if err != nil { + t.Fatalf("resume checkpoint mailbox: %v", err) + } + defer resp.Body.Close() + mailbox = RemoteWorkspaceAdapterMailboxSnapshot{} + if err := json.NewDecoder(resp.Body).Decode(&mailbox); err != nil { + t.Fatalf("decode resume checkpoint mailbox: %v", err) + } + if mailbox.ResumeFrom != "checkpoint" || + mailbox.ResumeSequence != 2 || + mailbox.AfterSequence != 2 || + mailbox.SkippedCount != 2 || + mailbox.ReturnedCount != 1 || + len(mailbox.Events) != 1 || + mailbox.Events[0].Sequence != 3 || + mailbox.ConsumerCheckpointSequence != 3 { + t.Fatalf("resume checkpoint mailbox = %+v", mailbox) + } + resp, err = http.Get(server.URL + "/mesh/v1/remote-workspace/adapter-sessions/" + sessionID + "/mailbox?consumer_id=rdp-worker-probe&ack_sequence=2&resume_from=ack&limit=10") + if err != nil { + t.Fatalf("resume ack mailbox: %v", err) + } + defer resp.Body.Close() + mailbox = RemoteWorkspaceAdapterMailboxSnapshot{} + if err := json.NewDecoder(resp.Body).Decode(&mailbox); err != nil { + t.Fatalf("decode resume ack mailbox: %v", err) + } + if mailbox.ResumeFrom != "ack" || + mailbox.ResumeSequence != 0 || + mailbox.AfterSequence != 0 || + mailbox.ReturnedCount != 3 || + mailbox.ConsumerAckSequence != 2 { + t.Fatalf("resume ack mailbox = %+v", mailbox) + } + report := sink.Report(time.Now().UTC()) + if report["mailbox_resume_read_total"] != int64(2) || + report["mailbox_after_sequence_read_total"] != int64(1) || + report["mailbox_returned_total"] != int64(6) || + report["mailbox_skipped_total"] != int64(2) || + report["last_mailbox_resume_from"] != "ack" || + report["last_mailbox_resume_sequence"] != int64(0) || + report["last_mailbox_resume_consumer_id"] != "rdp-worker-probe" || + report["last_mailbox_after_sequence"] != int64(0) || + report["last_mailbox_skipped_count"] != 0 || + report["last_mailbox_returned_count"] != 3 || + report["current_session_mailbox_resume_read_total"] != int64(2) || + report["current_session_mailbox_after_sequence_read_total"] != int64(1) || + report["current_session_mailbox_returned_total"] != int64(6) || + report["current_session_mailbox_skipped_total"] != int64(2) || + report["current_session_last_mailbox_resume_from"] != "ack" || + report["current_session_last_mailbox_resume_sequence"] != int64(0) || + report["current_session_last_mailbox_resume_consumer_id"] != "rdp-worker-probe" || + report["current_session_last_mailbox_after_sequence"] != int64(0) || + report["current_session_last_mailbox_skipped_count"] != 0 || + report["current_session_last_mailbox_returned_count"] != 3 { + t.Fatalf("invalid resume telemetry report = %+v", report) + } + readiness, ok := report["adapter_runtime_readiness"].(map[string]any) + if !ok { + t.Fatalf("adapter runtime readiness missing from report = %+v", report) + } + if readiness["schema_version"] != "rap.remote_workspace_adapter_runtime_readiness.v1" || + readiness["status"] != "cursor_ready" || + readiness["diagnostic_state"] != "adapter_cursor_ready" || + readiness["ready"] != true || + readiness["adapter_session_id"] != sessionID || + readiness["session_state"] != "probe_bound" || + readiness["mailbox_depth"] != 3 || + readiness["consumer_count"] != 1 || + readiness["last_consumer_id"] != "rdp-worker-probe" || + readiness["last_consumer_checkpoint_sequence"] != int64(3) || + readiness["last_consumer_ack_sequence"] != int64(2) || + readiness["last_consumer_lag_count"] != 1 || + readiness["last_resume_from"] != "ack" || + readiness["last_resume_sequence"] != int64(0) || + readiness["last_resume_consumer_id"] != "rdp-worker-probe" || + readiness["last_returned_count"] != 3 || + readiness["last_skipped_count"] != 0 { + t.Fatalf("invalid adapter runtime readiness = %+v", readiness) + } +} + +func TestRemoteWorkspaceAdapterSessionMailboxEndpointRejectsInvalidResumeFrom(t *testing.T) { + sink := NewRemoteWorkspaceFrameProbeSink() + sessionID := "rap-rw-adapter-session-eeeeeeeeeeeeeeeeeeeeeeee" + delivery := RemoteWorkspaceFrameBatchDelivery{ + ClusterID: "cluster-1", + ChannelID: "channel-rw", + ResourceID: "workspace-1", + ServiceClass: FabricServiceClassRemoteWorkspace, + ChannelClass: FabricServiceChannelInteractive, + AdapterContractID: "rap.rdp_worker.remote_workspace_adapter_contract_probe.v1", + AdapterSessionID: sessionID, + Frames: []RemoteWorkspaceFrameProbeRecord{{ + Channel: "display", + Direction: "adapter_to_client", + Droppable: true, + }}, + } + if _, err := sink.AcceptRemoteWorkspaceFrameBatchProbe(context.Background(), delivery); err != nil { + t.Fatalf("accept frame batch: %v", err) + } + server := httptest.NewServer(Server{RemoteWorkspaceFrameSink: sink}.Handler()) + defer server.Close() + + tests := []struct { + name string + query string + want string + }{ + { + name: "invalid resume cursor", + query: "consumer_id=rdp-worker-probe&resume_from=bad", + want: "invalid remote workspace adapter mailbox resume cursor", + }, + { + name: "resume without consumer", + query: "resume_from=ack", + want: "remote workspace adapter mailbox consumer required for resume", + }, + { + name: "resume with after", + query: "consumer_id=rdp-worker-probe&resume_from=ack&after_sequence=1", + want: "remote workspace adapter mailbox resume cannot combine with after sequence", + }, + { + name: "resume with drain", + query: "consumer_id=rdp-worker-probe&resume_from=ack&drain=true", + want: "remote workspace adapter mailbox resume cannot drain", + }, + { + name: "resume with reset", + query: "consumer_id=rdp-worker-probe&resume_from=ack&reset_consumer=true", + want: "remote workspace adapter mailbox resume cannot reset consumer", + }, + { + name: "unknown consumer", + query: "consumer_id=rdp-worker-probe&resume_from=ack", + want: "remote workspace adapter mailbox consumer not found", + }, + } + for _, tt := range tests { + t.Run(tt.name, func(t *testing.T) { + resp, err := http.Get(server.URL + "/mesh/v1/remote-workspace/adapter-sessions/" + sessionID + "/mailbox?" + tt.query) + if err != nil { + t.Fatalf("get mailbox: %v", err) + } + defer resp.Body.Close() + raw, _ := io.ReadAll(resp.Body) + if resp.StatusCode != http.StatusBadRequest || !strings.Contains(string(raw), tt.want) { + t.Fatalf("status=%d body=%s, want 400 containing %q", resp.StatusCode, string(raw), tt.want) + } + }) + } + report := sink.Report(time.Now().UTC()) + if report["mailbox_read_total"] != int64(0) || + report["mailbox_consumer_read_total"] != int64(0) || + report["current_session_mailbox_depth"] != 1 { + t.Fatalf("invalid resume requests mutated report = %+v", report) + } +} + +func TestRemoteWorkspaceAdapterSessionMailboxConsumerCheckpointAndAck(t *testing.T) { + sink := NewRemoteWorkspaceFrameProbeSink() + sessionID := "rap-rw-adapter-session-444444444444444444444444" + delivery := RemoteWorkspaceFrameBatchDelivery{ + ClusterID: "cluster-1", + ChannelID: "channel-rw", + ResourceID: "workspace-1", + ServiceClass: FabricServiceClassRemoteWorkspace, + ChannelClass: FabricServiceChannelInteractive, + AdapterContractID: "rap.rdp_worker.remote_workspace_adapter_contract_probe.v1", + AdapterSessionID: sessionID, + Frames: []RemoteWorkspaceFrameProbeRecord{{ + Channel: "display", + Direction: "adapter_to_client", + Droppable: true, + }}, + } + for i := 0; i < 2; i++ { + delivery.ResourceID = fmt.Sprintf("workspace-%d", i) + if _, err := sink.AcceptRemoteWorkspaceFrameBatchProbe(context.Background(), delivery); err != nil { + t.Fatalf("accept frame batch %d: %v", i, err) + } + } + server := httptest.NewServer(Server{RemoteWorkspaceFrameSink: sink}.Handler()) + defer server.Close() + + resp, err := http.Get(server.URL + "/mesh/v1/remote-workspace/adapter-sessions/" + sessionID + "/mailbox?consumer_id=rdp-worker-probe&limit=10") + if err != nil { + t.Fatalf("get consumer mailbox: %v", err) + } + defer resp.Body.Close() + var mailbox RemoteWorkspaceAdapterMailboxSnapshot + if err := json.NewDecoder(resp.Body).Decode(&mailbox); err != nil { + t.Fatalf("decode consumer mailbox: %v", err) + } + if mailbox.ConsumerID != "rdp-worker-probe" || + mailbox.ConsumerReadTotal != 1 || + mailbox.ConsumerAckTotal != 0 || + mailbox.ConsumerCheckpointSequence != mailbox.Events[len(mailbox.Events)-1].Sequence || + mailbox.ConsumerAckSequence != 0 || + mailbox.ConsumerLagCount != 2 || + mailbox.ConsumerCount != 1 || + mailbox.ConsumerCapacity != DefaultRemoteWorkspaceAdapterMailboxConsumerCapacity || + !mailbox.ConsumerCreated || + mailbox.ConsumerCreatedAt == "" || + mailbox.ConsumerLastReadAt == "" { + t.Fatalf("consumer mailbox = %+v", mailbox) + } + + resp, err = http.Get(fmt.Sprintf("%s/mesh/v1/remote-workspace/adapter-sessions/%s/mailbox?consumer_id=rdp-worker-probe&ack_sequence=%d&limit=10", server.URL, sessionID, mailbox.ConsumerCheckpointSequence)) + if err != nil { + t.Fatalf("ack consumer mailbox: %v", err) + } + defer resp.Body.Close() + if err := json.NewDecoder(resp.Body).Decode(&mailbox); err != nil { + t.Fatalf("decode acked consumer mailbox: %v", err) + } + if mailbox.ConsumerReadTotal != 2 || + mailbox.ConsumerAckTotal != 1 || + mailbox.ConsumerAckSequence != mailbox.ConsumerCheckpointSequence || + mailbox.ConsumerLagCount != 0 { + t.Fatalf("acked consumer mailbox = %+v", mailbox) + } + report := sink.Report(time.Now().UTC()) + if report["mailbox_consumer_count"] != 1 || + report["mailbox_consumer_read_total"] != int64(2) || + report["mailbox_consumer_ack_total"] != int64(1) || + report["last_mailbox_consumer_id"] != "rdp-worker-probe" || + report["last_mailbox_consumer_checkpoint_sequence"] != mailbox.ConsumerCheckpointSequence || + report["last_mailbox_consumer_ack_sequence"] != mailbox.ConsumerAckSequence || + report["current_session_mailbox_consumer_count"] != 1 || + report["current_session_mailbox_consumer_read_total"] != int64(2) || + report["current_session_mailbox_consumer_ack_total"] != int64(1) { + t.Fatalf("consumer report = %+v", report) + } +} + +func TestRemoteWorkspaceAdapterSessionMailboxConsumerRejectsInvalidCursorInputs(t *testing.T) { + sink := NewRemoteWorkspaceFrameProbeSink() + sessionID := "rap-rw-adapter-session-555555555555555555555555" + delivery := RemoteWorkspaceFrameBatchDelivery{ + ClusterID: "cluster-1", + ChannelID: "channel-rw", + ResourceID: "workspace-1", + ServiceClass: FabricServiceClassRemoteWorkspace, + ChannelClass: FabricServiceChannelInteractive, + AdapterContractID: "rap.rdp_worker.remote_workspace_adapter_contract_probe.v1", + AdapterSessionID: sessionID, + Frames: []RemoteWorkspaceFrameProbeRecord{{ + Channel: "display", + Direction: "adapter_to_client", + Droppable: true, + }}, + } + if _, err := sink.AcceptRemoteWorkspaceFrameBatchProbe(context.Background(), delivery); err != nil { + t.Fatalf("accept frame batch: %v", err) + } + server := httptest.NewServer(Server{RemoteWorkspaceFrameSink: sink}.Handler()) + defer server.Close() + + tests := []struct { + name string + query string + want string + }{ + { + name: "invalid consumer", + query: "consumer_id=bad%2Fconsumer", + want: "invalid remote workspace adapter mailbox consumer", + }, + { + name: "invalid ack", + query: "consumer_id=rdp-worker-probe&ack_sequence=-1", + want: "invalid remote workspace adapter mailbox ack sequence", + }, + { + name: "invalid reset", + query: "consumer_id=rdp-worker-probe&reset_consumer=maybe", + want: "invalid remote workspace adapter mailbox consumer reset", + }, + { + name: "reset without consumer", + query: "reset_consumer=true", + want: "remote workspace adapter mailbox consumer required for reset", + }, + } + for _, tt := range tests { + t.Run(tt.name, func(t *testing.T) { + resp, err := http.Get(server.URL + "/mesh/v1/remote-workspace/adapter-sessions/" + sessionID + "/mailbox?" + tt.query) + if err != nil { + t.Fatalf("get mailbox: %v", err) + } + defer resp.Body.Close() + raw, _ := io.ReadAll(resp.Body) + if resp.StatusCode != http.StatusBadRequest || !strings.Contains(string(raw), tt.want) { + t.Fatalf("status=%d body=%s, want 400 containing %q", resp.StatusCode, string(raw), tt.want) + } + }) + } + report := sink.Report(time.Now().UTC()) + if report["mailbox_read_total"] != int64(0) || + report["mailbox_consumer_read_total"] != int64(0) || + report["current_session_mailbox_depth"] != 1 { + t.Fatalf("invalid consumer requests mutated report = %+v", report) + } +} + +func TestRemoteWorkspaceAdapterSessionMailboxConsumerResetCursor(t *testing.T) { + sink := NewRemoteWorkspaceFrameProbeSink() + sessionID := "rap-rw-adapter-session-777777777777777777777777" + delivery := RemoteWorkspaceFrameBatchDelivery{ + ClusterID: "cluster-1", + ChannelID: "channel-rw", + ResourceID: "workspace-1", + ServiceClass: FabricServiceClassRemoteWorkspace, + ChannelClass: FabricServiceChannelInteractive, + AdapterContractID: "rap.rdp_worker.remote_workspace_adapter_contract_probe.v1", + AdapterSessionID: sessionID, + Frames: []RemoteWorkspaceFrameProbeRecord{{ + Channel: "display", + Direction: "adapter_to_client", + Droppable: true, + }}, + } + if _, err := sink.AcceptRemoteWorkspaceFrameBatchProbe(context.Background(), delivery); err != nil { + t.Fatalf("accept frame batch: %v", err) + } + server := httptest.NewServer(Server{RemoteWorkspaceFrameSink: sink}.Handler()) + defer server.Close() + + resp, err := http.Get(server.URL + "/mesh/v1/remote-workspace/adapter-sessions/" + sessionID + "/mailbox?consumer_id=rdp-worker-probe&ack_sequence=1") + if err != nil { + t.Fatalf("ack consumer mailbox: %v", err) + } + defer resp.Body.Close() + var mailbox RemoteWorkspaceAdapterMailboxSnapshot + if err := json.NewDecoder(resp.Body).Decode(&mailbox); err != nil { + t.Fatalf("decode ack mailbox: %v", err) + } + if mailbox.ConsumerReadTotal != 1 || mailbox.ConsumerAckTotal != 1 || mailbox.ConsumerAckSequence != 1 { + t.Fatalf("acked consumer mailbox = %+v", mailbox) + } + + resp, err = http.Get(server.URL + "/mesh/v1/remote-workspace/adapter-sessions/" + sessionID + "/mailbox?consumer_id=rdp-worker-probe&reset_consumer=true") + if err != nil { + t.Fatalf("reset consumer mailbox: %v", err) + } + defer resp.Body.Close() + if err := json.NewDecoder(resp.Body).Decode(&mailbox); err != nil { + t.Fatalf("decode reset mailbox: %v", err) + } + if !mailbox.ConsumerReset || + !mailbox.ConsumerCreated || + mailbox.ConsumerReadTotal != 1 || + mailbox.ConsumerAckTotal != 0 || + mailbox.ConsumerAckSequence != 0 || + mailbox.ConsumerResetTotal != 1 || + mailbox.ConsumerCount != 1 { + t.Fatalf("reset consumer mailbox = %+v", mailbox) + } + report := sink.Report(time.Now().UTC()) + if report["mailbox_consumer_reset_total"] != int64(1) || + report["current_session_mailbox_consumer_reset_total"] != int64(1) { + t.Fatalf("reset consumer report = %+v", report) + } +} + +func TestRemoteWorkspaceAdapterSessionMailboxConsumerStateIsBounded(t *testing.T) { + sink := NewRemoteWorkspaceFrameProbeSink() + sessionID := "rap-rw-adapter-session-666666666666666666666666" + delivery := RemoteWorkspaceFrameBatchDelivery{ + ClusterID: "cluster-1", + ChannelID: "channel-rw", + ResourceID: "workspace-1", + ServiceClass: FabricServiceClassRemoteWorkspace, + ChannelClass: FabricServiceChannelInteractive, + AdapterContractID: "rap.rdp_worker.remote_workspace_adapter_contract_probe.v1", + AdapterSessionID: sessionID, + Frames: []RemoteWorkspaceFrameProbeRecord{{ + Channel: "display", + Direction: "adapter_to_client", + Droppable: true, + }}, + } + if _, err := sink.AcceptRemoteWorkspaceFrameBatchProbe(context.Background(), delivery); err != nil { + t.Fatalf("accept frame batch: %v", err) + } + server := httptest.NewServer(Server{RemoteWorkspaceFrameSink: sink}.Handler()) + defer server.Close() + + var mailbox RemoteWorkspaceAdapterMailboxSnapshot + for i := 0; i < DefaultRemoteWorkspaceAdapterMailboxConsumerCapacity+1; i++ { + resp, err := http.Get(fmt.Sprintf("%s/mesh/v1/remote-workspace/adapter-sessions/%s/mailbox?consumer_id=consumer-%02d", server.URL, sessionID, i)) + if err != nil { + t.Fatalf("get consumer mailbox %d: %v", i, err) + } + if resp.StatusCode != http.StatusOK { + resp.Body.Close() + t.Fatalf("consumer mailbox %d status = %d", i, resp.StatusCode) + } + if err := json.NewDecoder(resp.Body).Decode(&mailbox); err != nil { + resp.Body.Close() + t.Fatalf("decode consumer mailbox %d: %v", i, err) + } + resp.Body.Close() + } + report := sink.Report(time.Now().UTC()) + if report["mailbox_consumer_count"] != DefaultRemoteWorkspaceAdapterMailboxConsumerCapacity || + report["current_session_mailbox_consumer_count"] != DefaultRemoteWorkspaceAdapterMailboxConsumerCapacity || + report["mailbox_consumer_evicted_total"] != int64(1) || + report["current_session_mailbox_consumer_evicted_total"] != int64(1) || + !mailbox.ConsumerEvicted || + mailbox.ConsumerEvictedTotal != 1 || + mailbox.ConsumerCount != DefaultRemoteWorkspaceAdapterMailboxConsumerCapacity { + t.Fatalf("bounded consumer report = %+v", report) + } +} + +func TestRemoteWorkspaceAdapterSessionMailboxConsumerSnapshotIsReadOnly(t *testing.T) { + sink := NewRemoteWorkspaceFrameProbeSink() + sessionID := "rap-rw-adapter-session-888888888888888888888888" + delivery := RemoteWorkspaceFrameBatchDelivery{ + ClusterID: "cluster-1", + ChannelID: "channel-rw", + ResourceID: "workspace-1", + ServiceClass: FabricServiceClassRemoteWorkspace, + ChannelClass: FabricServiceChannelInteractive, + AdapterContractID: "rap.rdp_worker.remote_workspace_adapter_contract_probe.v1", + AdapterSessionID: sessionID, + Frames: []RemoteWorkspaceFrameProbeRecord{{ + Channel: "display", + Direction: "adapter_to_client", + Droppable: true, + }}, + } + if _, err := sink.AcceptRemoteWorkspaceFrameBatchProbe(context.Background(), delivery); err != nil { + t.Fatalf("accept frame batch: %v", err) + } + server := httptest.NewServer(Server{RemoteWorkspaceFrameSink: sink}.Handler()) + defer server.Close() + + resp, err := http.Get(server.URL + "/mesh/v1/remote-workspace/adapter-sessions/" + sessionID + "/mailbox?consumer_id=consumer-b&ack_sequence=1") + if err != nil { + t.Fatalf("ack consumer b: %v", err) + } + resp.Body.Close() + resp, err = http.Get(server.URL + "/mesh/v1/remote-workspace/adapter-sessions/" + sessionID + "/mailbox?consumer_id=consumer-a") + if err != nil { + t.Fatalf("read consumer a: %v", err) + } + resp.Body.Close() + reportBefore := sink.Report(time.Now().UTC()) + + resp, err = http.Get(server.URL + "/mesh/v1/remote-workspace/adapter-sessions/" + sessionID + "/mailbox/consumers?limit=1") + if err != nil { + t.Fatalf("get consumer snapshot: %v", err) + } + defer resp.Body.Close() + var snapshot RemoteWorkspaceAdapterMailboxConsumerSnapshot + if err := json.NewDecoder(resp.Body).Decode(&snapshot); err != nil { + t.Fatalf("decode consumer snapshot: %v", err) + } + if snapshot.SchemaVersion != "rap.remote_workspace_adapter_mailbox_consumer_snapshot.v1" || + snapshot.AdapterSessionID != sessionID || + snapshot.ConsumerCapacity != DefaultRemoteWorkspaceAdapterMailboxConsumerCapacity || + snapshot.ConsumerCount != 2 || + snapshot.ConsumerReadTotal != 2 || + snapshot.ConsumerAckTotal != 1 || + len(snapshot.Consumers) != 1 || + snapshot.Consumers[0].ConsumerID != "consumer-a" || + snapshot.Consumers[0].CheckpointSequence != 1 || + snapshot.Consumers[0].LagCount != 1 { + t.Fatalf("consumer snapshot = %+v", snapshot) + } + reportAfter := sink.Report(time.Now().UTC()) + if reportAfter["mailbox_read_total"] != reportBefore["mailbox_read_total"] || + reportAfter["mailbox_consumer_read_total"] != reportBefore["mailbox_consumer_read_total"] || + reportAfter["mailbox_consumer_ack_total"] != reportBefore["mailbox_consumer_ack_total"] { + t.Fatalf("consumer snapshot mutated report before=%+v after=%+v", reportBefore, reportAfter) + } +} + +func TestRemoteWorkspaceAdapterSessionMailboxConsumerSnapshotRejectsInvalidRequests(t *testing.T) { + sink := NewRemoteWorkspaceFrameProbeSink() + sessionID := "rap-rw-adapter-session-999999999999999999999999" + delivery := RemoteWorkspaceFrameBatchDelivery{ + ClusterID: "cluster-1", + ChannelID: "channel-rw", + ResourceID: "workspace-1", + ServiceClass: FabricServiceClassRemoteWorkspace, + ChannelClass: FabricServiceChannelInteractive, + AdapterContractID: "rap.rdp_worker.remote_workspace_adapter_contract_probe.v1", + AdapterSessionID: sessionID, + Frames: []RemoteWorkspaceFrameProbeRecord{{ + Channel: "display", + Direction: "adapter_to_client", + Droppable: true, + }}, + } + if _, err := sink.AcceptRemoteWorkspaceFrameBatchProbe(context.Background(), delivery); err != nil { + t.Fatalf("accept frame batch: %v", err) + } + server := httptest.NewServer(Server{RemoteWorkspaceFrameSink: sink}.Handler()) + defer server.Close() + + tests := []struct { + name string + path string + want string + }{ + { + name: "bad limit", + path: "/mesh/v1/remote-workspace/adapter-sessions/" + sessionID + "/mailbox/consumers?limit=bad", + want: "invalid remote workspace adapter mailbox consumer snapshot limit", + }, + { + name: "invalid id", + path: "/mesh/v1/remote-workspace/adapter-sessions/rap-rw-adapter-session-nothex/mailbox/consumers", + want: "invalid remote workspace adapter session id", + }, + { + name: "unknown session", + path: "/mesh/v1/remote-workspace/adapter-sessions/rap-rw-adapter-session-aaaaaaaaaaaaaaaaaaaaaaaa/mailbox/consumers", + want: "remote workspace adapter session not found", + }, + } + for _, tt := range tests { + t.Run(tt.name, func(t *testing.T) { + resp, err := http.Get(server.URL + tt.path) + if err != nil { + t.Fatalf("get consumer snapshot: %v", err) + } + defer resp.Body.Close() + raw, _ := io.ReadAll(resp.Body) + if resp.StatusCode != http.StatusBadRequest || !strings.Contains(string(raw), tt.want) { + t.Fatalf("status=%d body=%s, want 400 containing %q", resp.StatusCode, string(raw), tt.want) + } + }) + } +} + +func TestRemoteWorkspaceAdapterSessionMailboxPreflightIsReadOnly(t *testing.T) { + sink := NewRemoteWorkspaceFrameProbeSink() + sessionID := "rap-rw-adapter-session-abababababababababababab" + delivery := RemoteWorkspaceFrameBatchDelivery{ + ClusterID: "cluster-1", + ChannelID: "channel-rw", + ResourceID: "workspace-1", + ServiceClass: FabricServiceClassRemoteWorkspace, + ChannelClass: FabricServiceChannelInteractive, + AdapterContractID: "rap.rdp_worker.remote_workspace_adapter_contract_probe.v1", + AdapterSessionID: sessionID, + Frames: []RemoteWorkspaceFrameProbeRecord{{ + Channel: "display", + Direction: "adapter_to_client", + Droppable: true, + }}, + } + for i := 0; i < 3; i++ { + delivery.ResourceID = fmt.Sprintf("workspace-preflight-%d", i) + if _, err := sink.AcceptRemoteWorkspaceFrameBatchProbe(context.Background(), delivery); err != nil { + t.Fatalf("accept frame batch %d: %v", i, err) + } + } + server := httptest.NewServer(Server{RemoteWorkspaceFrameSink: sink}.Handler()) + defer server.Close() + + resp, err := http.Get(server.URL + "/mesh/v1/remote-workspace/adapter-sessions/" + sessionID + "/mailbox?consumer_id=rdp-worker-probe&limit=2") + if err != nil { + t.Fatalf("seed checkpoint: %v", err) + } + resp.Body.Close() + resp, err = http.Get(server.URL + "/mesh/v1/remote-workspace/adapter-sessions/" + sessionID + "/mailbox?consumer_id=rdp-worker-probe&ack_sequence=1&limit=1") + if err != nil { + t.Fatalf("seed ack: %v", err) + } + resp.Body.Close() + reportBefore := sink.Report(time.Now().UTC()) + + resp, err = http.Get(server.URL + "/mesh/v1/remote-workspace/adapter-sessions/" + sessionID + "/mailbox/preflight?consumer_id=rdp-worker-probe&resume_from=ack&limit=1") + if err != nil { + t.Fatalf("get preflight ack: %v", err) + } + defer resp.Body.Close() + var preflight RemoteWorkspaceAdapterMailboxPreflightSnapshot + if err := json.NewDecoder(resp.Body).Decode(&preflight); err != nil { + t.Fatalf("decode preflight ack: %v", err) + } + if preflight.SchemaVersion != "rap.remote_workspace_adapter_mailbox_preflight.v1" || + preflight.AdapterSessionID != sessionID || + !preflight.ReadOnly || + preflight.ConsumerID != "rdp-worker-probe" || + preflight.ResumeFrom != "ack" || + preflight.ResumeSequence != 1 || + preflight.AfterSequence != 1 || + preflight.Limit != 1 || + preflight.MailboxDepth != 3 || + preflight.ConsumerCheckpointSequence != 2 || + preflight.ConsumerAckSequence != 1 || + preflight.ConsumerLagCount != 2 || + preflight.ExpectedAvailableCount != 2 || + preflight.ExpectedReturnedCount != 1 || + preflight.ExpectedSkippedCount != 1 || + preflight.FirstExpectedSequence != 2 || + preflight.LastExpectedSequence != 2 { + t.Fatalf("preflight ack = %+v", preflight) + } + resp, err = http.Get(server.URL + "/mesh/v1/remote-workspace/adapter-sessions/" + sessionID + "/mailbox/preflight?consumer_id=rdp-worker-probe&resume_from=checkpoint&limit=10") + if err != nil { + t.Fatalf("get preflight checkpoint: %v", err) + } + defer resp.Body.Close() + preflight = RemoteWorkspaceAdapterMailboxPreflightSnapshot{} + if err := json.NewDecoder(resp.Body).Decode(&preflight); err != nil { + t.Fatalf("decode preflight checkpoint: %v", err) + } + if preflight.ResumeFrom != "checkpoint" || + preflight.ResumeSequence != 2 || + preflight.ExpectedAvailableCount != 1 || + preflight.ExpectedReturnedCount != 1 || + preflight.ExpectedSkippedCount != 2 || + preflight.FirstExpectedSequence != 3 || + preflight.LastExpectedSequence != 3 { + t.Fatalf("preflight checkpoint = %+v", preflight) + } + reportAfter := sink.Report(time.Now().UTC()) + if reportAfter["mailbox_read_total"] != reportBefore["mailbox_read_total"] || + reportAfter["mailbox_consumer_read_total"] != reportBefore["mailbox_consumer_read_total"] || + reportAfter["mailbox_consumer_ack_total"] != reportBefore["mailbox_consumer_ack_total"] || + reportAfter["current_session_mailbox_consumer_read_total"] != reportBefore["current_session_mailbox_consumer_read_total"] || + reportAfter["current_session_mailbox_consumer_ack_total"] != reportBefore["current_session_mailbox_consumer_ack_total"] { + t.Fatalf("preflight mutated report before=%+v after=%+v", reportBefore, reportAfter) + } +} + +func TestRemoteWorkspaceAdapterSessionMailboxPreflightRejectsInvalidRequests(t *testing.T) { + sink := NewRemoteWorkspaceFrameProbeSink() + sessionID := "rap-rw-adapter-session-acacacacacacacacacacacac" + delivery := RemoteWorkspaceFrameBatchDelivery{ + ClusterID: "cluster-1", + ChannelID: "channel-rw", + ResourceID: "workspace-1", + ServiceClass: FabricServiceClassRemoteWorkspace, + ChannelClass: FabricServiceChannelInteractive, + AdapterContractID: "rap.rdp_worker.remote_workspace_adapter_contract_probe.v1", + AdapterSessionID: sessionID, + Frames: []RemoteWorkspaceFrameProbeRecord{{ + Channel: "display", + Direction: "adapter_to_client", + Droppable: true, + }}, + } + if _, err := sink.AcceptRemoteWorkspaceFrameBatchProbe(context.Background(), delivery); err != nil { + t.Fatalf("accept frame batch: %v", err) + } + server := httptest.NewServer(Server{RemoteWorkspaceFrameSink: sink}.Handler()) + defer server.Close() + + tests := []struct { + name string + path string + want string + }{ + { + name: "missing consumer", + path: "/mesh/v1/remote-workspace/adapter-sessions/" + sessionID + "/mailbox/preflight", + want: "remote workspace adapter mailbox consumer required for preflight", + }, + { + name: "invalid resume", + path: "/mesh/v1/remote-workspace/adapter-sessions/" + sessionID + "/mailbox/preflight?consumer_id=consumer-a&resume_from=bogus", + want: "invalid remote workspace adapter mailbox resume cursor", + }, + { + name: "bad limit", + path: "/mesh/v1/remote-workspace/adapter-sessions/" + sessionID + "/mailbox/preflight?consumer_id=consumer-a&limit=bad", + want: "invalid remote workspace adapter mailbox preflight limit", + }, + { + name: "unknown consumer", + path: "/mesh/v1/remote-workspace/adapter-sessions/" + sessionID + "/mailbox/preflight?consumer_id=consumer-a", + want: "remote workspace adapter mailbox consumer not found", + }, + } + for _, tt := range tests { + t.Run(tt.name, func(t *testing.T) { + resp, err := http.Get(server.URL + tt.path) + if err != nil { + t.Fatalf("get preflight: %v", err) + } + defer resp.Body.Close() + raw, _ := io.ReadAll(resp.Body) + if resp.StatusCode != http.StatusBadRequest || !strings.Contains(string(raw), tt.want) { + t.Fatalf("status=%d body=%s, want 400 containing %q", resp.StatusCode, string(raw), tt.want) + } + }) + } +} + +func TestFabricServiceChannelVPNPacketIngressHonorsDisabledBackendRelayPolicy(t *testing.T) { + publicKey, privateKey, err := ed25519.GenerateKey(nil) + if err != nil { + t.Fatalf("generate key: %v", err) + } + var backendCalled bool + backend := httptest.NewServer(http.HandlerFunc(func(w http.ResponseWriter, r *http.Request) { + backendCalled = true + w.WriteHeader(http.StatusAccepted) + })) + defer backend.Close() + token := "rap_fsc_nobackend" + payload := fabricServiceChannelLeaseAuthorityPayload{ + SchemaVersion: "rap.fabric_service_channel_lease_authority.v1", + ChannelID: "channel-1", + ClusterID: "cluster-1", + ResourceID: "vpn-1", + ServiceClass: FabricServiceClassVPNPackets, + SelectedEntryNodeID: "entry-1", + SelectedExitNodeID: "exit-1", + AllowedChannels: []string{ProductionChannelVPNPacket}, + TokenHash: fabricServiceChannelTokenHash(token), + ExpiresAt: time.Now().UTC().Add(time.Minute), + DataPlane: fabricServiceChannelDataPlaneContract{ + SchemaVersion: "rap.fabric_service_channel_data_plane.v1", + Mode: "fabric_primary", + WorkingDataTransport: "fabric_service_channel", + SteadyStateTransport: "fabric_route", + BackendRelayPolicy: "disabled", + ProductionForwardingRequired: true, + ServiceNeutral: true, + ProtocolAgnostic: true, + LogicalFlowMode: "multi_flow_isolated", + RequiredFlowIsolationClasses: []string{"control", ProductionChannelVPNPacket}, + }, + } + payload.PrimaryRoute.RouteID = "route-signed" + payload.PrimaryRoute.Status = "authorized" + rawPayload, err := json.Marshal(payload) + if err != nil { + t.Fatalf("marshal payload: %v", err) + } + canonical, err := authority.CanonicalJSON(rawPayload) + if err != nil { + t.Fatalf("canonical payload: %v", err) + } + signature := authority.Signature{ + SchemaVersion: authority.SignatureSchemaVersion, + Algorithm: authority.AlgorithmEd25519, + KeyFingerprint: authority.Fingerprint(publicKey), + Signature: base64.StdEncoding.EncodeToString(ed25519.Sign(privateKey, canonical)), + } + rawSignature, err := json.Marshal(signature) + if err != nil { + t.Fatalf("marshal signature: %v", err) + } + server := httptest.NewServer(Server{ + Local: PeerIdentity{ClusterID: "cluster-1", NodeID: "entry-1"}, + VPNPacketIngress: failingVPNPacketIngress{sendErr: ErrRouteNotFound}, + BackendProxyBaseURL: backend.URL + "/api/v1", + ClusterAuthorityPublicKey: base64.StdEncoding.EncodeToString(publicKey), + }.Handler()) + defer server.Close() + + req, err := http.NewRequest(http.MethodPost, server.URL+"/api/v1/clusters/cluster-1/fabric/service-channels/channel-1/vpn-connections/vpn-1/packets", bytes.NewReader([]byte("packet"))) + if err != nil { + t.Fatalf("new request: %v", err) + } + req.Header.Set("X-RAP-Service-Channel-Token", token) + req.Header.Set("X-RAP-Service-Channel-Authority-Payload", base64.RawURLEncoding.EncodeToString(rawPayload)) + req.Header.Set("X-RAP-Service-Channel-Authority-Signature", base64.RawURLEncoding.EncodeToString(rawSignature)) + resp, err := http.DefaultClient.Do(req) + if err != nil { + t.Fatalf("post service channel packet: %v", err) + } + defer resp.Body.Close() + if resp.StatusCode != http.StatusServiceUnavailable { + t.Fatalf("status = %d, want %d", resp.StatusCode, http.StatusServiceUnavailable) + } + if backendCalled { + t.Fatal("backend relay was called despite disabled data-plane policy") + } +} + +func TestFabricServiceChannelVPNPacketWebSocketHonorsDisabledBackendRelayPolicy(t *testing.T) { + publicKey, privateKey, err := ed25519.GenerateKey(nil) + if err != nil { + t.Fatalf("generate key: %v", err) + } + backendCalled := make(chan struct{}, 1) + backend := httptest.NewServer(http.HandlerFunc(func(w http.ResponseWriter, r *http.Request) { + select { + case backendCalled <- struct{}{}: + default: + } + w.WriteHeader(http.StatusAccepted) + })) + defer backend.Close() + token := "rap_fsc_nobackend_ws" + payload := fabricServiceChannelLeaseAuthorityPayload{ + SchemaVersion: "rap.fabric_service_channel_lease_authority.v1", + ChannelID: "channel-1", + ClusterID: "cluster-1", + ResourceID: "vpn-1", + ServiceClass: FabricServiceClassVPNPackets, + SelectedEntryNodeID: "entry-1", + SelectedExitNodeID: "exit-1", + AllowedChannels: []string{ProductionChannelVPNPacket}, + TokenHash: fabricServiceChannelTokenHash(token), + ExpiresAt: time.Now().UTC().Add(time.Minute), + DataPlane: fabricServiceChannelDataPlaneContract{ + SchemaVersion: "rap.fabric_service_channel_data_plane.v1", + Mode: "fabric_primary", + WorkingDataTransport: "fabric_service_channel", + SteadyStateTransport: "fabric_route", + BackendRelayPolicy: "disabled", + ProductionForwardingRequired: true, + ServiceNeutral: true, + ProtocolAgnostic: true, + LogicalFlowMode: "multi_flow_isolated", + RequiredFlowIsolationClasses: []string{"control", ProductionChannelVPNPacket}, + }, + } + payload.PrimaryRoute.RouteID = "route-signed" + payload.PrimaryRoute.Status = "authorized" + rawPayload, err := json.Marshal(payload) + if err != nil { + t.Fatalf("marshal payload: %v", err) + } + canonical, err := authority.CanonicalJSON(rawPayload) + if err != nil { + t.Fatalf("canonical payload: %v", err) + } + signature := authority.Signature{ + SchemaVersion: authority.SignatureSchemaVersion, + Algorithm: authority.AlgorithmEd25519, + KeyFingerprint: authority.Fingerprint(publicKey), + Signature: base64.StdEncoding.EncodeToString(ed25519.Sign(privateKey, canonical)), + } + rawSignature, err := json.Marshal(signature) + if err != nil { + t.Fatalf("marshal signature: %v", err) + } + violations := make(chan FabricServiceChannelAccessLogEntry, 2) + server := httptest.NewServer(Server{ + Local: PeerIdentity{ClusterID: "cluster-1", NodeID: "entry-1"}, + VPNPacketIngress: failingVPNPacketIngress{sendErr: ErrRouteNotFound}, + BackendProxyBaseURL: backend.URL + "/api/v1", + ClusterAuthorityPublicKey: base64.StdEncoding.EncodeToString(publicKey), + FabricServiceChannelLogger: func(entry FabricServiceChannelAccessLogEntry) { + if entry.Event == "fabric_service_channel_data_plane_violation" { + violations <- entry + } + }, + }.Handler()) + defer server.Close() + + wsURL := "ws" + strings.TrimPrefix(server.URL, "http") + "/api/v1/clusters/cluster-1/fabric/service-channels/channel-1/vpn-connections/vpn-1/packets/ws" + headers := http.Header{} + headers.Set("X-RAP-Service-Channel-Token", token) + headers.Set("X-RAP-Service-Channel-Authority-Payload", base64.RawURLEncoding.EncodeToString(rawPayload)) + headers.Set("X-RAP-Service-Channel-Authority-Signature", base64.RawURLEncoding.EncodeToString(rawSignature)) + conn, _, err := websocket.DefaultDialer.Dial(wsURL, headers) + if err != nil { + t.Fatalf("dial websocket: %v", err) + } + defer conn.Close() + + if err := conn.WriteMessage(websocket.BinaryMessage, encodeVPNIngressPacketBatch([][]byte{[]byte("packet")})); err != nil { + t.Fatalf("write packet batch: %v", err) + } + + select { + case entry := <-violations: + if entry.ViolationStatus != "fabric_route_send_failed_backend_fallback_blocked" || + entry.BackendRelayPolicy != "disabled" || + entry.ChannelID != "channel-1" || + entry.ResourceID != "vpn-1" { + t.Fatalf("violation = %+v", entry) + } + case <-time.After(2 * time.Second): + t.Fatal("blocked fallback violation was not logged") + } + select { + case <-backendCalled: + t.Fatal("backend relay was called despite disabled data-plane policy") + default: + } +} + +func TestFabricServiceChannelVPNPacketIngressRejectsSignedLeaseForDifferentEntry(t *testing.T) { + publicKey, privateKey, err := ed25519.GenerateKey(nil) + if err != nil { + t.Fatalf("generate key: %v", err) + } + token := "rap_fsc_signedtest" + payload := fabricServiceChannelLeaseAuthorityPayload{ + SchemaVersion: "rap.fabric_service_channel_lease_authority.v1", + ChannelID: "channel-1", + ClusterID: "cluster-1", + ResourceID: "vpn-1", + ServiceClass: FabricServiceClassVPNPackets, + SelectedEntryNodeID: "other-entry", + SelectedExitNodeID: "exit-1", + AllowedChannels: []string{ProductionChannelVPNPacket}, + TokenHash: fabricServiceChannelTokenHash(token), + ExpiresAt: time.Now().UTC().Add(time.Minute), + } + rawPayload, err := json.Marshal(payload) + if err != nil { + t.Fatalf("marshal payload: %v", err) + } + canonical, err := authority.CanonicalJSON(rawPayload) + if err != nil { + t.Fatalf("canonical payload: %v", err) + } + signature := authority.Signature{ + SchemaVersion: authority.SignatureSchemaVersion, + Algorithm: authority.AlgorithmEd25519, + KeyFingerprint: authority.Fingerprint(publicKey), + Signature: base64.StdEncoding.EncodeToString(ed25519.Sign(privateKey, canonical)), + } + rawSignature, err := json.Marshal(signature) + if err != nil { + t.Fatalf("marshal signature: %v", err) + } + server := httptest.NewServer(Server{ + Local: PeerIdentity{ClusterID: "cluster-1", NodeID: "entry-1"}, + VPNPacketIngress: &recordingVPNPacketIngress{}, + ClusterAuthorityPublicKey: base64.StdEncoding.EncodeToString(publicKey), + }.Handler()) + defer server.Close() + + req, err := http.NewRequest(http.MethodPost, server.URL+"/api/v1/clusters/cluster-1/fabric/service-channels/channel-1/vpn-connections/vpn-1/packets", bytes.NewReader([]byte("packet"))) + if err != nil { + t.Fatalf("new request: %v", err) + } + req.Header.Set("X-RAP-Service-Channel-Token", token) + req.Header.Set("X-RAP-Service-Channel-Authority-Payload", base64.RawURLEncoding.EncodeToString(rawPayload)) + req.Header.Set("X-RAP-Service-Channel-Authority-Signature", base64.RawURLEncoding.EncodeToString(rawSignature)) + resp, err := http.DefaultClient.Do(req) + if err != nil { + t.Fatalf("post service channel packet: %v", err) + } + defer resp.Body.Close() + if resp.StatusCode != http.StatusForbidden { + t.Fatalf("status = %d, want %d", resp.StatusCode, http.StatusForbidden) + } +} + +func TestFabricServiceChannelVPNPacketIngressFallsBackToBackendRelay(t *testing.T) { + var backendPath string + var backendBody []byte + backend := httptest.NewServer(http.HandlerFunc(func(w http.ResponseWriter, r *http.Request) { + backendPath = r.URL.Path + var err error + backendBody, err = io.ReadAll(r.Body) + if err != nil { + t.Fatalf("read backend body: %v", err) + } + w.WriteHeader(http.StatusAccepted) + })) + defer backend.Close() + server := httptest.NewServer(Server{ + Local: PeerIdentity{ClusterID: "cluster-1", NodeID: "entry-1"}, + VPNPacketIngress: failingVPNPacketIngress{sendErr: ErrRouteNotFound}, + BackendProxyBaseURL: backend.URL + "/api/v1", + }.Handler()) + defer server.Close() + + req, err := http.NewRequest(http.MethodPost, server.URL+"/api/v1/clusters/cluster-1/fabric/service-channels/channel-1/vpn-connections/vpn-1/packets", bytes.NewReader([]byte("packet"))) + if err != nil { + t.Fatalf("new request: %v", err) + } + req.Header.Set("X-RAP-Service-Channel-Token", "rap_fsc_testtoken") + resp, err := http.DefaultClient.Do(req) + if err != nil { + t.Fatalf("post service channel packet: %v", err) + } + defer resp.Body.Close() + if resp.StatusCode != http.StatusAccepted { + t.Fatalf("status = %d, want %d", resp.StatusCode, http.StatusAccepted) + } + if backendPath != "/api/v1/clusters/cluster-1/vpn-connections/vpn-1/tunnel/client/packets" { + t.Fatalf("backend path = %s", backendPath) + } + if string(backendBody) != "packet" { + t.Fatalf("backend body = %q", string(backendBody)) + } +} + +func TestFabricServiceChannelVPNPacketIngressUsesSignedDegradedFallback(t *testing.T) { + publicKey, privateKey, err := ed25519.GenerateKey(nil) + if err != nil { + t.Fatalf("generate key: %v", err) + } + token := "rap_fsc_degradedtest" + payload := fabricServiceChannelLeaseAuthorityPayload{ + SchemaVersion: "rap.fabric_service_channel_lease_authority.v1", + ChannelID: "channel-1", + ClusterID: "cluster-1", + ResourceID: "vpn-1", + ServiceClass: FabricServiceClassVPNPackets, + Status: "degraded_fallback", + SelectedEntryNodeID: "entry-1", + SelectedExitNodeID: "exit-1", + AllowedChannels: []string{ProductionChannelVPNPacket}, + TokenHash: fabricServiceChannelTokenHash(token), + ExpiresAt: time.Now().UTC().Add(time.Minute), + } + payload.PrimaryRoute.Status = "missing_route_intent" + rawPayload, err := json.Marshal(payload) + if err != nil { + t.Fatalf("marshal payload: %v", err) + } + canonical, err := authority.CanonicalJSON(rawPayload) + if err != nil { + t.Fatalf("canonical payload: %v", err) + } + signature := authority.Signature{ + SchemaVersion: authority.SignatureSchemaVersion, + Algorithm: authority.AlgorithmEd25519, + KeyFingerprint: authority.Fingerprint(publicKey), + Signature: base64.StdEncoding.EncodeToString(ed25519.Sign(privateKey, canonical)), + } + rawSignature, err := json.Marshal(signature) + if err != nil { + t.Fatalf("marshal signature: %v", err) + } + var backendBody []byte + backend := httptest.NewServer(http.HandlerFunc(func(w http.ResponseWriter, r *http.Request) { + if r.URL.Path != "/api/v1/clusters/cluster-1/vpn-connections/vpn-1/tunnel/client/packets" { + t.Fatalf("backend path = %s", r.URL.Path) + } + var err error + backendBody, err = io.ReadAll(r.Body) + if err != nil { + t.Fatalf("read backend body: %v", err) + } + w.WriteHeader(http.StatusAccepted) + })) + defer backend.Close() + ingress := &recordingVPNPacketIngress{} + server := httptest.NewServer(Server{ + Local: PeerIdentity{ClusterID: "cluster-1", NodeID: "entry-1"}, + VPNPacketIngress: ingress, + BackendProxyBaseURL: backend.URL + "/api/v1", + ClusterAuthorityPublicKey: base64.StdEncoding.EncodeToString(publicKey), + }.Handler()) + defer server.Close() + + req, err := http.NewRequest(http.MethodPost, server.URL+"/api/v1/clusters/cluster-1/fabric/service-channels/channel-1/vpn-connections/vpn-1/packets", bytes.NewReader([]byte("packet"))) + if err != nil { + t.Fatalf("new request: %v", err) + } + req.Header.Set("X-RAP-Service-Channel-Token", token) + req.Header.Set("X-RAP-Service-Channel-Authority-Payload", base64.RawURLEncoding.EncodeToString(rawPayload)) + req.Header.Set("X-RAP-Service-Channel-Authority-Signature", base64.RawURLEncoding.EncodeToString(rawSignature)) + resp, err := http.DefaultClient.Do(req) + if err != nil { + t.Fatalf("post service channel packet: %v", err) + } + defer resp.Body.Close() + if resp.StatusCode != http.StatusAccepted { + t.Fatalf("status = %d, want %d", resp.StatusCode, http.StatusAccepted) + } + if string(backendBody) != "packet" { + t.Fatalf("backend body = %q", string(backendBody)) + } + ingress.mu.Lock() + defer ingress.mu.Unlock() + if len(ingress.sent) != 0 { + t.Fatalf("fabric ingress should not receive degraded fallback packets: %#v", ingress.sent) + } +} + +func TestVPNPacketIngressWebSocketMovesBatchesBothDirections(t *testing.T) { + ingress := &recordingVPNPacketIngress{ + receive: [][]byte{[]byte("reply-1"), []byte("reply-2")}, + } + server := httptest.NewServer(Server{ + Local: PeerIdentity{ClusterID: "cluster-1", NodeID: "entry-1"}, + VPNPacketIngress: ingress, + }.Handler()) + defer server.Close() + + wsURL := "ws" + strings.TrimPrefix(server.URL, "http") + "/api/v1/clusters/cluster-1/vpn-connections/vpn-1/tunnel/client/packets/ws" + conn, _, err := websocket.DefaultDialer.Dial(wsURL, nil) + if err != nil { + t.Fatalf("dial websocket: %v", err) + } + defer conn.Close() + + if err := conn.WriteMessage(websocket.BinaryMessage, encodeVPNIngressPacketBatch([][]byte{[]byte("packet-1"), []byte("packet-2")})); err != nil { + t.Fatalf("write packet batch: %v", err) + } + if err := conn.SetReadDeadline(time.Now().Add(2 * time.Second)); err != nil { + t.Fatalf("set read deadline: %v", err) + } + messageType, payload, err := conn.ReadMessage() + if err != nil { + t.Fatalf("read packet batch: %v", err) + } + if messageType != websocket.BinaryMessage { + t.Fatalf("message type = %d, want binary", messageType) + } + packets, err := decodeVPNIngressPacketBatch(payload) + if err != nil { + t.Fatalf("decode reply batch: %v", err) + } + if len(packets) != 2 || string(packets[0]) != "reply-1" || string(packets[1]) != "reply-2" { + t.Fatalf("reply packets = %#v", packets) + } + + deadline := time.Now().Add(2 * time.Second) + for { + ingress.mu.Lock() + sent := append([][]byte(nil), ingress.sent...) + clusterID := ingress.clusterID + vpnConnectionID := ingress.vpnConnectionID + ingress.mu.Unlock() + if len(sent) == 2 { + if clusterID != "cluster-1" || vpnConnectionID != "vpn-1" { + t.Fatalf("ingress ids = %s %s", clusterID, vpnConnectionID) + } + if string(sent[0]) != "packet-1" || string(sent[1]) != "packet-2" { + t.Fatalf("sent packets = %#v", sent) + } + break + } + if time.Now().After(deadline) { + t.Fatalf("sent packets = %#v", sent) + } + time.Sleep(10 * time.Millisecond) + } +} + +func TestFabricServiceChannelVPNPacketWebSocketPreservesTrafficClass(t *testing.T) { + ingress := &recordingVPNPacketIngress{ + receive: [][]byte{[]byte("reply")}, + } + server := httptest.NewServer(Server{ + Local: PeerIdentity{ClusterID: "cluster-1", NodeID: "entry-1"}, + VPNPacketIngress: ingress, + }.Handler()) + defer server.Close() + + wsURL := "ws" + strings.TrimPrefix(server.URL, "http") + "/api/v1/clusters/cluster-1/fabric/service-channels/channel-1/vpn-connections/vpn-1/packets/ws" + headers := http.Header{} + headers.Set("Authorization", "Bearer rap_fsc_testtoken") + headers.Set("X-RAP-Service-Class", FabricServiceClassVPNPackets) + headers.Set("X-RAP-Channel-Class", ProductionChannelVPNPacket) + headers.Set("X-RAP-Traffic-Class", "interactive") + headers.Set("X-RAP-Fabric-Channel-ID", "channel-1") + conn, _, err := websocket.DefaultDialer.Dial(wsURL, headers) + if err != nil { + t.Fatalf("dial websocket: %v", err) + } + defer conn.Close() + + if err := conn.WriteMessage(websocket.BinaryMessage, encodeVPNIngressPacketBatch([][]byte{[]byte("packet")})); err != nil { + t.Fatalf("write packet batch: %v", err) + } + if err := conn.SetReadDeadline(time.Now().Add(2 * time.Second)); err != nil { + t.Fatalf("set read deadline: %v", err) + } + if _, _, err := conn.ReadMessage(); err != nil { + t.Fatalf("read packet batch: %v", err) + } + + deadline := time.Now().Add(2 * time.Second) + for { + ingress.mu.Lock() + trafficClass := ingress.trafficClass + sent := append([][]byte(nil), ingress.sent...) + ingress.mu.Unlock() + if trafficClass == "interactive" && len(sent) == 1 && string(sent[0]) == "packet" { + break + } + if time.Now().After(deadline) { + t.Fatalf("traffic class = %q sent packets = %#v, want interactive packet", trafficClass, sent) + } + time.Sleep(10 * time.Millisecond) + } +} + +func TestVPNPacketIngressWebSocketFallsBackToBackendRelay(t *testing.T) { + var backendBody []byte + postSeen := make(chan struct{}, 1) + backend := httptest.NewServer(http.HandlerFunc(func(w http.ResponseWriter, r *http.Request) { + switch r.Method { + case http.MethodPost: + if r.URL.Path != "/api/v1/clusters/cluster-1/vpn-connections/vpn-1/tunnel/client/packets" || r.URL.Query().Get("batch") != "true" { + t.Fatalf("backend post target = %s?%s", r.URL.Path, r.URL.RawQuery) + } + var err error + backendBody, err = io.ReadAll(r.Body) + if err != nil { + t.Fatalf("read backend body: %v", err) + } + select { + case postSeen <- struct{}{}: + default: + } + w.WriteHeader(http.StatusAccepted) + case http.MethodGet: + if r.URL.Path != "/api/v1/clusters/cluster-1/vpn-connections/vpn-1/tunnel/client/packets" || r.URL.Query().Get("batch") != "true" { + t.Fatalf("backend get target = %s?%s", r.URL.Path, r.URL.RawQuery) + } + w.Header().Set("Content-Type", "application/vnd.rap.vpn-packet-batch.v1") + _, _ = w.Write(encodeVPNIngressPacketBatch([][]byte{[]byte("backend-reply")})) + default: + w.WriteHeader(http.StatusMethodNotAllowed) + } + })) + defer backend.Close() + server := httptest.NewServer(Server{ + Local: PeerIdentity{ClusterID: "cluster-1", NodeID: "entry-1"}, + VPNPacketIngress: failingVPNPacketIngress{sendErr: ErrRouteNotFound, receiveErr: ErrRouteNotFound}, + BackendProxyBaseURL: backend.URL + "/api/v1", + }.Handler()) + defer server.Close() + + wsURL := "ws" + strings.TrimPrefix(server.URL, "http") + "/api/v1/clusters/cluster-1/vpn-connections/vpn-1/tunnel/client/packets/ws" + conn, _, err := websocket.DefaultDialer.Dial(wsURL, nil) + if err != nil { + t.Fatalf("dial websocket: %v", err) + } + defer conn.Close() + + sentPayload := encodeVPNIngressPacketBatch([][]byte{[]byte("packet")}) + if err := conn.WriteMessage(websocket.BinaryMessage, sentPayload); err != nil { + t.Fatalf("write packet batch: %v", err) + } + if err := conn.SetReadDeadline(time.Now().Add(2 * time.Second)); err != nil { + t.Fatalf("set read deadline: %v", err) + } + _, payload, err := conn.ReadMessage() + if err != nil { + t.Fatalf("read backend packet batch: %v", err) + } + packets, err := decodeVPNIngressPacketBatch(payload) + if err != nil { + t.Fatalf("decode backend batch: %v", err) + } + if len(packets) != 1 || string(packets[0]) != "backend-reply" { + t.Fatalf("backend reply packets = %#v", packets) + } + select { + case <-postSeen: + case <-time.After(2 * time.Second): + t.Fatal("backend POST was not observed") + } + if !bytes.Equal(backendBody, sentPayload) { + t.Fatalf("backend body = %q want %q", string(backendBody), string(sentPayload)) + } +} + +func TestNewProductionVPNPacketBatchEnvelopeRoundTripsPayload(t *testing.T) { + now := time.Now().UTC() + envelope, err := NewProductionVPNPacketBatchEnvelope(ProductionVPNPacketEnvelopeInput{ + MessageID: "vpn-message-builder", + RouteID: "route-vpn-1", + ClusterID: "cluster-1", + SourceNodeID: "entry-1", + DestinationNodeID: "exit-1", + CurrentHopNodeID: "entry-1", + NextHopNodeID: "exit-1", + RoutePath: []string{"entry-1", "exit-1"}, + VPNConnectionID: "vpn-1", + Direction: "client_to_gateway", + Packets: [][]byte{[]byte("packet-1"), []byte("packet-2")}, + Now: now, + }) + if err != nil { + t.Fatalf("new vpn packet envelope: %v", err) + } + if err := ValidateProductionEnvelope(PeerIdentity{ClusterID: "cluster-1", NodeID: "entry-1"}, envelope, now); err != nil { + t.Fatalf("validate envelope: %v", err) + } + payload, err := DecodeProductionVPNPacketBatch(envelope) + if err != nil { + t.Fatalf("decode vpn packet batch: %v", err) + } + if payload.VPNConnectionID != "vpn-1" || payload.Direction != "client_to_gateway" || string(payload.Packets[1]) != "packet-2" { + t.Fatalf("payload = %+v", payload) + } +} + +func TestNewProductionVPNPacketBatchEnvelopeRejectsEmptyPackets(t *testing.T) { + _, err := NewProductionVPNPacketBatchEnvelope(ProductionVPNPacketEnvelopeInput{ + MessageID: "vpn-message-builder", + RouteID: "route-vpn-1", + ClusterID: "cluster-1", + SourceNodeID: "entry-1", + DestinationNodeID: "exit-1", + CurrentHopNodeID: "entry-1", + NextHopNodeID: "exit-1", + RoutePath: []string{"entry-1", "exit-1"}, + VPNConnectionID: "vpn-1", + Direction: "client_to_gateway", + Packets: [][]byte{nil, []byte{}}, + }) + if err == nil { + t.Fatal("expected empty packet batch to be rejected") + } +} + +func TestEncodeVPNIngressPacketBatchSkipsEmptyPackets(t *testing.T) { + encoded := encodeVPNIngressPacketBatch([][]byte{nil, []byte("reply"), []byte{}}) + decoded, err := decodeVPNIngressPacketBatch(encoded) + if err != nil { + t.Fatalf("decode batch: %v", err) + } + if len(decoded) != 1 || string(decoded[0]) != "reply" { + t.Fatalf("decoded = %#v", decoded) + } +} + +type failingVPNPacketIngress struct { + sendErr error + receiveErr error +} + +func (i failingVPNPacketIngress) SendClientPacketBatch(context.Context, string, string, [][]byte) error { + return i.sendErr +} + +func (i failingVPNPacketIngress) ReceiveClientPacketBatch(context.Context, string, string, time.Duration) ([][]byte, error) { + return nil, i.receiveErr +} + +type recordingVPNPacketIngress struct { + mu sync.Mutex + clusterID string + vpnConnectionID string + trafficClass string + sent [][]byte + receive [][]byte +} + +func (i *recordingVPNPacketIngress) SendClientPacketBatch(_ context.Context, clusterID string, vpnConnectionID string, packets [][]byte) error { + i.mu.Lock() + defer i.mu.Unlock() + i.clusterID = clusterID + i.vpnConnectionID = vpnConnectionID + i.sent = cleanVPNIngressPacketBatch(packets) + return nil +} + +func (i *recordingVPNPacketIngress) SendClientPacketBatchWithTrafficClass(_ context.Context, clusterID string, vpnConnectionID string, trafficClass string, packets [][]byte) error { + i.mu.Lock() + defer i.mu.Unlock() + i.clusterID = clusterID + i.vpnConnectionID = vpnConnectionID + i.trafficClass = trafficClass + i.sent = cleanVPNIngressPacketBatch(packets) + return nil +} + +func (i *recordingVPNPacketIngress) ReceiveClientPacketBatch(_ context.Context, clusterID string, vpnConnectionID string, _ time.Duration) ([][]byte, error) { + i.mu.Lock() + defer i.mu.Unlock() + i.clusterID = clusterID + i.vpnConnectionID = vpnConnectionID + packets := i.receive + i.receive = nil + return packets, nil +} + func hasProductionForwardEvent(events []ProductionForwardLogEntry, event string) bool { for _, item := range events { if item.Event == event { diff --git a/agents/rap-node-agent/internal/supervisor/supervisor.go b/agents/rap-node-agent/internal/supervisor/supervisor.go index 1bc6863..b553641 100644 --- a/agents/rap-node-agent/internal/supervisor/supervisor.go +++ b/agents/rap-node-agent/internal/supervisor/supervisor.go @@ -2,6 +2,8 @@ package supervisor import ( "context" + "strings" + "time" "github.com/example/remote-access-platform/agents/rap-node-agent/internal/client" ) @@ -17,24 +19,146 @@ type StubSupervisor struct { func (s StubSupervisor) Apply(_ context.Context, desired []client.DesiredWorkload) ([]client.WorkloadStatusRequest, error) { statuses := make([]client.WorkloadStatusRequest, 0, len(desired)) for _, workload := range desired { - state := "degraded" - if workload.DesiredState == "disabled" { - state = "stopped" - } - version := workload.Version - if version == "" { - version = s.Version - } - statuses = append(statuses, client.WorkloadStatusRequest{ - ReportedState: state, - RuntimeMode: workload.RuntimeMode, - Version: version, - StatusPayload: map[string]any{ - "supervisor": "stub", - "desired_state": workload.DesiredState, - "service_type": workload.ServiceType, - }, - }) + statuses = append(statuses, s.applyOne(workload)) } return statuses, nil } + +func (s StubSupervisor) applyOne(workload client.DesiredWorkload) client.WorkloadStatusRequest { + serviceType := strings.TrimSpace(workload.ServiceType) + desiredState := strings.TrimSpace(strings.ToLower(workload.DesiredState)) + if desiredState == "" { + desiredState = "disabled" + } + runtimeMode := strings.TrimSpace(strings.ToLower(workload.RuntimeMode)) + if runtimeMode == "" { + runtimeMode = "native" + } + version := strings.TrimSpace(workload.Version) + if version == "" { + version = s.Version + } + payload := map[string]any{ + "schema_version": "rap.node_agent.workload_supervision.v1", + "supervisor": "node-agent-local", + "desired_state": desiredState, + "service_type": serviceType, + "runtime_mode": runtimeMode, + "observed_at": time.Now().UTC().Format(time.RFC3339Nano), + } + if desiredState != "enabled" { + payload["reason"] = "desired_state_not_enabled" + return client.WorkloadStatusRequest{ + ReportedState: "stopped", + RuntimeMode: runtimeMode, + Version: version, + StatusPayload: payload, + } + } + if serviceType == "core-mesh" || serviceType == "mesh-listener" { + payload["reason"] = "builtin_node_agent_service_ready" + payload["execution_mode"] = "builtin" + payload["traffic"] = serviceTrafficMode(serviceType) + return client.WorkloadStatusRequest{ + ReportedState: "running", + RuntimeMode: runtimeMode, + Version: version, + StatusPayload: payload, + } + } + if serviceType == "synthetic.echo" && runtimeMode == "native" { + payload["reason"] = "internal_synthetic_echo_ready" + payload["execution_mode"] = "builtin" + payload["traffic"] = "test_service_only" + return client.WorkloadStatusRequest{ + ReportedState: "running", + RuntimeMode: runtimeMode, + Version: version, + StatusPayload: payload, + } + } + if serviceType == "rdp-worker" && runtimeMode == "native" && boolConfig(workload.Config, "adapter_contract_probe") { + payload["reason"] = "remote_workspace_adapter_contract_probe_ready" + payload["execution_mode"] = "contract_probe" + payload["service_class"] = "remote_workspace" + payload["fabric_service_channel_required"] = true + payload["backend_relay_steady_state"] = false + payload["channels"] = remoteWorkspaceAdapterChannels() + payload["frame_batch_contract"] = remoteWorkspaceFrameBatchContract() + payload["traffic"] = "none" + return client.WorkloadStatusRequest{ + ReportedState: "running", + RuntimeMode: runtimeMode, + Version: version, + StatusPayload: payload, + } + } + payload["reason"] = "service_runtime_not_implemented" + payload["traffic"] = "blocked" + return client.WorkloadStatusRequest{ + ReportedState: "degraded", + RuntimeMode: runtimeMode, + Version: version, + StatusPayload: payload, + } +} + +func boolConfig(values map[string]any, key string) bool { + if values == nil { + return false + } + value, ok := values[key] + if !ok { + return false + } + switch typed := value.(type) { + case bool: + return typed + case string: + return strings.EqualFold(strings.TrimSpace(typed), "true") + default: + return false + } +} + +func remoteWorkspaceAdapterChannels() []map[string]any { + return []map[string]any{ + {"name": "input", "direction": "client_to_adapter", "reliability": "reliable_ordered", "priority": "critical", "droppable": true, "may_block_input": false}, + {"name": "control", "direction": "bidirectional", "reliability": "reliable_ordered", "priority": "high", "droppable": false, "may_block_input": false}, + {"name": "display", "direction": "adapter_to_client", "reliability": "droppable_latest", "priority": "high", "droppable": true, "may_block_input": false}, + {"name": "cursor", "direction": "adapter_to_client", "reliability": "droppable_latest", "priority": "high", "droppable": true, "may_block_input": false}, + {"name": "clipboard", "direction": "bidirectional", "reliability": "reliable_ordered", "priority": "medium", "droppable": false, "may_block_input": false}, + {"name": "file_transfer", "direction": "bidirectional", "reliability": "reliable_chunked", "priority": "medium", "droppable": false, "may_block_input": false}, + {"name": "audio", "direction": "adapter_to_client", "reliability": "adaptive_droppable", "priority": "medium", "droppable": true, "may_block_input": false}, + {"name": "device", "direction": "bidirectional", "reliability": "reliable_ordered", "priority": "medium", "droppable": false, "may_block_input": false}, + {"name": "telemetry", "direction": "adapter_to_client", "reliability": "sampled_droppable", "priority": "low", "droppable": true, "may_block_input": false}, + } +} + +func remoteWorkspaceFrameBatchContract() map[string]any { + return map[string]any{ + "schema_version": "rap.remote_workspace_frame_batch.v1", + "adapter_contract_id": "rap.rdp_worker.remote_workspace_adapter_contract_probe.v1", + "probe_only": true, + "payload_forwarding": "not_implemented", + "service_class": "remote_workspace", + "allowed_flow_classes": []string{"control", "interactive", "reliable", "bulk", "droppable"}, + "allowed_payload_encodings": []string{ + "none", + "base64", + }, + "max_probe_frames": 32, + "channels": remoteWorkspaceAdapterChannels(), + } +} + +func serviceTrafficMode(serviceType string) string { + switch serviceType { + case "core-mesh": + return "fabric_control" + case "mesh-listener": + return "entry_listener" + default: + return "unknown" + } +} diff --git a/agents/rap-node-agent/internal/supervisor/supervisor_test.go b/agents/rap-node-agent/internal/supervisor/supervisor_test.go index 9ffed26..ffaf144 100644 --- a/agents/rap-node-agent/internal/supervisor/supervisor_test.go +++ b/agents/rap-node-agent/internal/supervisor/supervisor_test.go @@ -33,3 +33,101 @@ func TestStubSupervisorReportsStoppedForDisabledWorkload(t *testing.T) { t.Fatalf("ReportedState = %q", statuses[0].ReportedState) } } + +func TestStubSupervisorRunsInternalSyntheticEchoWorkload(t *testing.T) { + statuses, err := (StubSupervisor{Version: "test"}).Apply(context.Background(), []client.DesiredWorkload{ + {ServiceType: "synthetic.echo", DesiredState: "enabled", RuntimeMode: "native"}, + }) + if err != nil { + t.Fatalf("apply desired workload: %v", err) + } + if statuses[0].ReportedState != "running" { + t.Fatalf("ReportedState = %q", statuses[0].ReportedState) + } + if statuses[0].StatusPayload["reason"] != "internal_synthetic_echo_ready" { + t.Fatalf("reason = %v", statuses[0].StatusPayload["reason"]) + } + if statuses[0].StatusPayload["execution_mode"] != "builtin" { + t.Fatalf("execution_mode = %v", statuses[0].StatusPayload["execution_mode"]) + } +} + +func TestStubSupervisorReportsBuiltinFabricServicesRunning(t *testing.T) { + statuses, err := (StubSupervisor{Version: "test"}).Apply(context.Background(), []client.DesiredWorkload{ + {ServiceType: "core-mesh", DesiredState: "enabled", RuntimeMode: "container"}, + {ServiceType: "mesh-listener", DesiredState: "enabled", RuntimeMode: "container"}, + }) + if err != nil { + t.Fatalf("apply desired workload: %v", err) + } + if len(statuses) != 2 { + t.Fatalf("statuses length = %d", len(statuses)) + } + for _, status := range statuses { + if status.ReportedState != "running" { + t.Fatalf("ReportedState = %q", status.ReportedState) + } + if status.StatusPayload["reason"] != "builtin_node_agent_service_ready" { + t.Fatalf("reason = %v", status.StatusPayload["reason"]) + } + } +} + +func TestStubSupervisorKeepsUnsupportedEnabledWorkloadDegraded(t *testing.T) { + statuses, err := (StubSupervisor{Version: "test"}).Apply(context.Background(), []client.DesiredWorkload{ + {ServiceType: "rdp-worker", DesiredState: "enabled", RuntimeMode: "container"}, + }) + if err != nil { + t.Fatalf("apply desired workload: %v", err) + } + if statuses[0].ReportedState != "degraded" { + t.Fatalf("ReportedState = %q", statuses[0].ReportedState) + } + if statuses[0].StatusPayload["reason"] != "service_runtime_not_implemented" { + t.Fatalf("reason = %v", statuses[0].StatusPayload["reason"]) + } +} + +func TestStubSupervisorRunsRDPWorkerAdapterContractProbeOnly(t *testing.T) { + statuses, err := (StubSupervisor{Version: "test"}).Apply(context.Background(), []client.DesiredWorkload{ + { + ServiceType: "rdp-worker", + DesiredState: "enabled", + RuntimeMode: "native", + Config: map[string]any{ + "adapter_contract_probe": true, + }, + }, + }) + if err != nil { + t.Fatalf("apply desired workload: %v", err) + } + if statuses[0].ReportedState != "running" { + t.Fatalf("ReportedState = %q", statuses[0].ReportedState) + } + if statuses[0].StatusPayload["reason"] != "remote_workspace_adapter_contract_probe_ready" { + t.Fatalf("reason = %v", statuses[0].StatusPayload["reason"]) + } + if statuses[0].StatusPayload["service_class"] != "remote_workspace" { + t.Fatalf("service_class = %v", statuses[0].StatusPayload["service_class"]) + } + if statuses[0].StatusPayload["backend_relay_steady_state"] != false { + t.Fatalf("backend_relay_steady_state = %v", statuses[0].StatusPayload["backend_relay_steady_state"]) + } + channels, ok := statuses[0].StatusPayload["channels"].([]map[string]any) + if !ok || len(channels) != 9 { + t.Fatalf("channels = %#v", statuses[0].StatusPayload["channels"]) + } + if channels[0]["name"] != "input" || channels[0]["priority"] != "critical" || channels[0]["droppable"] != true || channels[0]["may_block_input"] != false { + t.Fatalf("unexpected input channel: %#v", channels[0]) + } + frameBatch, ok := statuses[0].StatusPayload["frame_batch_contract"].(map[string]any) + if !ok { + t.Fatalf("frame_batch_contract = %#v", statuses[0].StatusPayload["frame_batch_contract"]) + } + if frameBatch["schema_version"] != "rap.remote_workspace_frame_batch.v1" || + frameBatch["payload_forwarding"] != "not_implemented" || + frameBatch["service_class"] != "remote_workspace" { + t.Fatalf("unexpected frame batch contract: %#v", frameBatch) + } +} diff --git a/agents/rap-node-agent/internal/vpnruntime/fabric_transport.go b/agents/rap-node-agent/internal/vpnruntime/fabric_transport.go index fc1ee70..6fa5e53 100644 --- a/agents/rap-node-agent/internal/vpnruntime/fabric_transport.go +++ b/agents/rap-node-agent/internal/vpnruntime/fabric_transport.go @@ -385,32 +385,37 @@ func (s *FabricFlowScheduler) ConfigureAdaptivePolicy(policy FabricServiceChanne } func (s *FabricFlowScheduler) ScheduleClientPackets(packets [][]byte) []FabricScheduledPacketBatch { - return s.scheduleClientPackets("", "", packets) + scheduled, _ := s.scheduleClientPackets("", "", packets) + return scheduled } func (s *FabricFlowScheduler) ScheduleClientPacketsForConnection(vpnConnectionID string, packets [][]byte) []FabricScheduledPacketBatch { - return s.scheduleClientPackets(vpnConnectionID, "", packets) + scheduled, _ := s.scheduleClientPackets(vpnConnectionID, "", packets) + return scheduled } func (s *FabricFlowScheduler) ScheduleClientPacketsForConnectionClass(vpnConnectionID string, trafficClass string, packets [][]byte) []FabricScheduledPacketBatch { - return s.scheduleClientPackets(vpnConnectionID, trafficClass, packets) + scheduled, _ := s.scheduleClientPackets(vpnConnectionID, trafficClass, packets) + return scheduled } -func (s *FabricFlowScheduler) scheduleClientPackets(vpnConnectionID string, trafficClass string, packets [][]byte) []FabricScheduledPacketBatch { +func (s *FabricFlowScheduler) scheduleClientPackets(vpnConnectionID string, trafficClass string, packets [][]byte) ([]FabricScheduledPacketBatch, uint64) { packets = cleanPacketBatch(packets) if len(packets) == 0 { - return nil + return nil, 0 } if s == nil { s = NewFabricFlowScheduler(0, 0) } trafficClass = normalizeFabricTrafficClass(trafficClass) grouped := map[string]*FabricScheduledPacketBatch{} + var droppedCount uint64 for _, packet := range packets { flowID, shard := classifyPacketFlow(packet, s.shardCountValue()) channelID := fabricFlowChannelIDForClass(vpnConnectionID, trafficClass, shard) queueDepth, dropped := s.enqueue(channelID, trafficClass) if dropped { + droppedCount++ continue } batch := grouped[channelID] @@ -433,7 +438,7 @@ func (s *FabricFlowScheduler) scheduleClientPackets(vpnConnectionID string, traf out = append(out, *batch) } s.sortScheduledBatches(out) - return out + return out, droppedCount } func fabricFlowChannelID(vpnConnectionID string, shard int) string { @@ -1441,11 +1446,9 @@ func (i *FabricClientPacketIngress) SendClientPacketBatchWithTrafficClass(ctx co } i.recordSendBatch(len(packets)) scheduler := i.flowScheduler() - droppedBefore := scheduler.Dropped() - scheduled := scheduler.ScheduleClientPacketsForConnectionClass(vpnConnectionID, trafficClass, packets) - droppedAfter := scheduler.Dropped() - if droppedAfter > droppedBefore { - i.recordFlowDropped(droppedAfter - droppedBefore) + scheduled, droppedCount := scheduler.scheduleClientPackets(vpnConnectionID, trafficClass, packets) + if droppedCount > 0 { + i.recordFlowDropped(droppedCount) } if len(scheduled) == 0 { i.recordError(mesh.ErrSyntheticRelayQueueFull) @@ -1657,8 +1660,10 @@ func (i *FabricClientPacketIngress) routeCandidatesWithPreference(clusterID stri if i == nil || routesFunc == nil { return nil } + localClusterID := i.clusterID() + localNodeID := i.localNodeID() if clusterID == "" { - clusterID = i.ClusterID + clusterID = localClusterID } now := time.Now().UTC() var preferred []fabricClientRouteCandidate @@ -1676,7 +1681,7 @@ func (i *FabricClientPacketIngress) routeCandidatesWithPreference(clusterID stri } } for _, route := range routesFunc() { - if route.ClusterID != clusterID || route.SourceNodeID != i.LocalNodeID || !containsString(route.AllowedChannels, mesh.ProductionChannelVPNPacket) { + if route.ClusterID != clusterID || route.SourceNodeID != localNodeID || !containsString(route.AllowedChannels, mesh.ProductionChannelVPNPacket) { continue } if manager.isWithdrawn(route.RouteID) { @@ -1685,8 +1690,8 @@ func (i *FabricClientPacketIngress) routeCandidatesWithPreference(clusterID stri if !route.ExpiresAt.IsZero() && !route.ExpiresAt.After(now) { continue } - nextHop := nextHopAfter(route.Hops, i.LocalNodeID, route.DestinationNodeID) - if nextHop == "" || nextHop == i.LocalNodeID { + nextHop := nextHopAfter(route.Hops, localNodeID, route.DestinationNodeID) + if nextHop == "" || nextHop == localNodeID { continue } candidate := fabricClientRouteCandidate{Route: route, NextHop: nextHop} @@ -2024,7 +2029,7 @@ func (i *FabricClientPacketIngress) routeProvenance(clusterID string) map[string if i == nil || routesFunc == nil { return out } - localNodeID := strings.TrimSpace(i.LocalNodeID) + localNodeID := i.localNodeID() for _, route := range routesFunc() { if strings.TrimSpace(route.RouteID) == "" { continue @@ -2322,6 +2327,24 @@ func (i *FabricClientPacketIngress) routesFunc() func() []mesh.SyntheticRoute { return i.Routes } +func (i *FabricClientPacketIngress) clusterID() string { + if i == nil { + return "" + } + i.mu.Lock() + defer i.mu.Unlock() + return strings.TrimSpace(i.ClusterID) +} + +func (i *FabricClientPacketIngress) localNodeID() string { + if i == nil { + return "" + } + i.mu.Lock() + defer i.mu.Unlock() + return strings.TrimSpace(i.LocalNodeID) +} + func (i *FabricClientPacketIngress) flowScheduler() *FabricFlowScheduler { if i == nil { return NewFabricFlowScheduler(0, 0) diff --git a/agents/rap-node-agent/internal/vpnruntime/fabric_transport_test.go b/agents/rap-node-agent/internal/vpnruntime/fabric_transport_test.go index 4a6509f..afbf28c 100644 --- a/agents/rap-node-agent/internal/vpnruntime/fabric_transport_test.go +++ b/agents/rap-node-agent/internal/vpnruntime/fabric_transport_test.go @@ -324,10 +324,13 @@ func TestFabricFlowSchedulerDropsWhenChannelQueueIsFull(t *testing.T) { packetA := testIPv4TCPPacket([4]byte{10, 77, 0, 2}, [4]byte{192, 168, 200, 95}, 51000, 3389) packetB := testIPv4TCPPacket([4]byte{10, 77, 0, 2}, [4]byte{192, 168, 200, 95}, 51000, 3389) - batches := scheduler.ScheduleClientPackets([][]byte{packetA, packetB}) + batches, dropped := scheduler.scheduleClientPackets("", "", [][]byte{packetA, packetB}) if len(batches) != 1 || len(batches[0].Packets) != 1 { t.Fatalf("batches = %#v, want one accepted packet", batches) } + if dropped != 1 { + t.Fatalf("dropped = %d, want per-call drop count 1", dropped) + } snapshot := scheduler.Snapshot() if snapshot.Dropped != 1 || !snapshot.BackpressureActive { t.Fatalf("snapshot = %+v, want one dropped packet and active backpressure", snapshot) @@ -1069,6 +1072,60 @@ func TestFabricClientPacketIngressIsolatesRouteMemoryPerVPNConnection(t *testing } } +func TestFabricClientPacketIngressRouteSelectionUsesUpdatedRuntimeIdentity(t *testing.T) { + transport := &captureManyProductionTransport{} + ingress := &FabricClientPacketIngress{ + ForwardTransport: transport, + Inbox: NewFabricPacketInbox(8), + ClusterID: "cluster-1", + LocalNodeID: "entry-1", + Routes: func() []mesh.SyntheticRoute { + return []mesh.SyntheticRoute{{ + RouteID: "route-entry-1", + ClusterID: "cluster-1", + SourceNodeID: "entry-1", + DestinationNodeID: "exit-1", + Hops: []string{"entry-1", "relay-1", "exit-1"}, + AllowedChannels: []string{mesh.ProductionChannelVPNPacket}, + ExpiresAt: time.Now().UTC().Add(time.Minute), + MaxTTL: 8, + }} + }, + } + ingress.UpdateRuntime( + transport, + NewFabricPacketInbox(8), + "cluster-1", + "entry-2", + nil, + func() []mesh.SyntheticRoute { + return []mesh.SyntheticRoute{{ + RouteID: "route-entry-2", + ClusterID: "cluster-1", + SourceNodeID: "entry-2", + DestinationNodeID: "exit-2", + Hops: []string{"entry-2", "relay-2", "exit-2"}, + AllowedChannels: []string{mesh.ProductionChannelVPNPacket}, + ExpiresAt: time.Now().UTC().Add(time.Minute), + MaxTTL: 8, + }} + }, + "policy-updated", + ) + + packet := testIPv4TCPPacket([4]byte{10, 77, 0, 2}, [4]byte{192, 168, 200, 95}, 51000, 443) + if err := ingress.SendClientPacketBatch(context.Background(), "", "vpn-1", [][]byte{packet}); err != nil { + t.Fatalf("send after runtime update: %v", err) + } + if len(transport.envelopes) != 1 { + t.Fatalf("envelopes = %d, want one send", len(transport.envelopes)) + } + envelope := transport.envelopes[0] + if envelope.RouteID != "route-entry-2" || envelope.SourceNodeID != "entry-2" || transport.calls[0] != "relay-2" { + t.Fatalf("envelope route/source/next-hop = %s/%s/%s, want updated entry-2 route", envelope.RouteID, envelope.SourceNodeID, transport.calls[0]) + } +} + func TestFabricClientPacketIngressParallelFlowWindowDoesNotBlockIndependentChannel(t *testing.T) { scheduler := NewFabricFlowScheduler(8, 16) slowPacket, fastPacket := packetsForOrderedDistinctChannels(scheduler.shardCountValue()) diff --git a/agents/rap-node-agent/internal/vpnruntime/tun_windows.go b/agents/rap-node-agent/internal/vpnruntime/tun_windows.go new file mode 100644 index 0000000..87540f5 --- /dev/null +++ b/agents/rap-node-agent/internal/vpnruntime/tun_windows.go @@ -0,0 +1,170 @@ +//go:build windows && rap_vpn_windows_tun + +package vpnruntime + +import ( + "crypto/sha256" + _ "embed" + "fmt" + "net" + "os" + "os/exec" + "path/filepath" + "strings" + + wgtun "golang.zx2c4.com/wireguard/tun" +) + +const windowsGatewayMTU = 1420 + +//go:embed assets/windows/amd64/wintun.dll +var embeddedWintunDLL []byte + +type tunDevice struct { + dev wgtun.Device + name string +} + +func openGatewayTun(name, addressCIDR, routeCIDR string) (*tunDevice, error) { + if _, _, err := net.ParseCIDR(addressCIDR); err != nil { + return nil, fmt.Errorf("invalid vpn gateway address %q: %w", addressCIDR, err) + } + if _, _, err := net.ParseCIDR(routeCIDR); err != nil { + return nil, fmt.Errorf("invalid vpn gateway route %q: %w", routeCIDR, err) + } + if err := ensureWintunDLL(); err != nil { + return nil, err + } + + dev, err := wgtun.CreateTUN(name, windowsGatewayMTU) + if err != nil { + return nil, fmt.Errorf("create wintun interface %s: %w", name, err) + } + if err := configureGatewayInterface(name, addressCIDR, routeCIDR); err != nil { + _ = dev.Close() + return nil, err + } + return &tunDevice{dev: dev, name: name}, nil +} + +func (d *tunDevice) Read(packet []byte) (int, error) { + bufs := [][]byte{packet} + sizes := []int{0} + n, err := d.dev.Read(bufs, sizes, 0) + if err != nil { + return 0, err + } + if n <= 0 { + return 0, nil + } + return sizes[0], nil +} + +func (d *tunDevice) Write(packet []byte) (int, error) { + n, err := d.dev.Write([][]byte{packet}, 0) + if err != nil { + return 0, err + } + if n <= 0 { + return 0, nil + } + return len(packet), nil +} + +func (d *tunDevice) Close() error { + _ = removeWindowsGatewayNat() + return d.dev.Close() +} + +func configureGatewayInterface(name, addressCIDR, routeCIDR string) error { + ip, network, err := net.ParseCIDR(addressCIDR) + if err != nil { + return fmt.Errorf("invalid vpn gateway address %q: %w", addressCIDR, err) + } + ones, bits := network.Mask.Size() + if bits != 32 || ones <= 0 { + return fmt.Errorf("invalid vpn gateway prefix %q", addressCIDR) + } + _, route, err := net.ParseCIDR(routeCIDR) + if err != nil { + return fmt.Errorf("invalid vpn gateway route %q: %w", routeCIDR, err) + } + + script := fmt.Sprintf(` +$ErrorActionPreference = 'Stop' +$alias = %s +$address = %s +$prefixLength = %d +$natPrefix = %s +$natName = 'RAPVPN' +$adapter = Get-NetAdapter -Name $alias -ErrorAction Stop +$adapter | Enable-NetAdapter -Confirm:$false -ErrorAction SilentlyContinue | Out-Null +$existing = Get-NetIPAddress -InterfaceAlias $alias -AddressFamily IPv4 -ErrorAction SilentlyContinue +foreach ($addr in $existing) { + if ($addr.IPAddress -ne $address -or $addr.PrefixLength -ne $prefixLength) { + Remove-NetIPAddress -InterfaceAlias $alias -IPAddress $addr.IPAddress -Confirm:$false -ErrorAction SilentlyContinue + } +} +if (-not (Get-NetIPAddress -InterfaceAlias $alias -IPAddress $address -AddressFamily IPv4 -ErrorAction SilentlyContinue)) { + New-NetIPAddress -InterfaceAlias $alias -IPAddress $address -PrefixLength $prefixLength -Type Unicast | Out-Null +} +Set-NetIPInterface -InterfaceAlias $alias -AddressFamily IPv4 -Forwarding Enabled +Get-NetIPInterface -AddressFamily IPv4 | Where-Object { $_.ConnectionState -eq 'Connected' -and $_.InterfaceAlias -ne 'Loopback Pseudo-Interface 1' } | Set-NetIPInterface -Forwarding Enabled +$existingNat = Get-NetNat -Name $natName -ErrorAction SilentlyContinue +if ($existingNat -and $existingNat.InternalIPInterfaceAddressPrefix -ne $natPrefix) { + $existingNat | Remove-NetNat -Confirm:$false + $existingNat = $null +} +if (-not $existingNat) { + New-NetNat -Name $natName -InternalIPInterfaceAddressPrefix $natPrefix | Out-Null +} +`, psQuote(name), psQuote(ip.String()), ones, psQuote(route.String())) + + if err := runPowerShell(script); err != nil { + return fmt.Errorf("configure windows vpn gateway interface %s: %w", name, err) + } + return nil +} + +func removeWindowsGatewayNat() error { + return runPowerShell(`Get-NetNat -Name 'RAPVPN' -ErrorAction SilentlyContinue | Remove-NetNat -Confirm:$false -ErrorAction SilentlyContinue`) +} + +func runPowerShell(script string) error { + cmd := exec.Command("powershell.exe", "-NoProfile", "-ExecutionPolicy", "Bypass", "-Command", script) + if out, err := cmd.CombinedOutput(); err != nil { + return fmt.Errorf("powershell failed: %w: %s", err, strings.TrimSpace(string(out))) + } + return nil +} + +func psQuote(value string) string { + return "'" + strings.ReplaceAll(value, "'", "''") + "'" +} + +func ensureWintunDLL() error { + exePath, err := os.Executable() + if err != nil { + return fmt.Errorf("locate node-agent executable for wintun.dll: %w", err) + } + target := filepath.Join(filepath.Dir(exePath), "wintun.dll") + if payload, err := os.ReadFile(target); err == nil && sameSHA256(payload, embeddedWintunDLL) { + return nil + } + tmp := target + ".tmp" + if err := os.WriteFile(tmp, embeddedWintunDLL, 0o644); err != nil { + return fmt.Errorf("write embedded wintun.dll: %w", err) + } + _ = os.Remove(target) + if err := os.Rename(tmp, target); err != nil { + _ = os.Remove(tmp) + return fmt.Errorf("install embedded wintun.dll: %w", err) + } + return nil +} + +func sameSHA256(a, b []byte) bool { + left := sha256.Sum256(a) + right := sha256.Sum256(b) + return left == right +} diff --git a/backend/Dockerfile b/backend/Dockerfile index 90fa09a..a7734b1 100644 --- a/backend/Dockerfile +++ b/backend/Dockerfile @@ -1,4 +1,4 @@ -FROM golang:1.23-bookworm AS build +FROM golang:1.25-bookworm AS build WORKDIR /src diff --git a/backend/go.mod b/backend/go.mod index 28998cc..bfbead7 100644 --- a/backend/go.mod +++ b/backend/go.mod @@ -1,23 +1,24 @@ module github.com/example/remote-access-platform/backend -go 1.23.2 +go 1.25.0 require ( - github.com/go-chi/chi/v5 v5.2.1 - github.com/golang-jwt/jwt/v5 v5.2.2 + github.com/go-chi/chi/v5 v5.2.5 + github.com/golang-jwt/jwt/v5 v5.3.1 github.com/google/uuid v1.6.0 github.com/gorilla/websocket v1.5.3 - github.com/jackc/pgx/v5 v5.7.4 - github.com/redis/go-redis/v9 v9.8.0 - golang.org/x/crypto v0.37.0 + github.com/jackc/pgx/v5 v5.9.2 + github.com/redis/go-redis/v9 v9.19.0 + golang.org/x/crypto v0.50.0 ) require ( github.com/cespare/xxhash/v2 v2.3.0 // indirect - github.com/dgryski/go-rendezvous v0.0.0-20200823014737-9f7001d12a5f // indirect github.com/jackc/pgpassfile v1.0.0 // indirect github.com/jackc/pgservicefile v0.0.0-20240606120523-5a60cdf6a761 // indirect github.com/jackc/puddle/v2 v2.2.2 // indirect - golang.org/x/sync v0.13.0 // indirect - golang.org/x/text v0.24.0 // indirect + github.com/klauspost/cpuid/v2 v2.3.0 // indirect + go.uber.org/atomic v1.11.0 // indirect + golang.org/x/sync v0.20.0 // indirect + golang.org/x/text v0.36.0 // indirect ) diff --git a/backend/go.sum b/backend/go.sum index 1942278..d0f0768 100644 --- a/backend/go.sum +++ b/backend/go.sum @@ -7,12 +7,10 @@ github.com/cespare/xxhash/v2 v2.3.0/go.mod h1:VGX0DQ3Q6kWi7AoAeZDth3/j3BFtOZR5XL github.com/davecgh/go-spew v1.1.0/go.mod h1:J7Y8YcW2NihsgmVo/mv3lAwl/skON4iLHjSsI+c5H38= github.com/davecgh/go-spew v1.1.1 h1:vj9j/u1bqnvCEfJOwUhtlOARqs3+rkHYY13jYWTU97c= github.com/davecgh/go-spew v1.1.1/go.mod h1:J7Y8YcW2NihsgmVo/mv3lAwl/skON4iLHjSsI+c5H38= -github.com/dgryski/go-rendezvous v0.0.0-20200823014737-9f7001d12a5f h1:lO4WD4F/rVNCu3HqELle0jiPLLBs70cWOduZpkS1E78= -github.com/dgryski/go-rendezvous v0.0.0-20200823014737-9f7001d12a5f/go.mod h1:cuUVRXasLTGF7a8hSLbxyZXjz+1KgoB3wDUb6vlszIc= -github.com/go-chi/chi/v5 v5.2.1 h1:KOIHODQj58PmL80G2Eak4WdvUzjSJSm0vG72crDCqb8= -github.com/go-chi/chi/v5 v5.2.1/go.mod h1:L2yAIGWB3H+phAw1NxKwWM+7eUH/lU8pOMm5hHcoops= -github.com/golang-jwt/jwt/v5 v5.2.2 h1:Rl4B7itRWVtYIHFrSNd7vhTiz9UpLdi6gZhZ3wEeDy8= -github.com/golang-jwt/jwt/v5 v5.2.2/go.mod h1:pqrtFR0X4osieyHYxtmOUWsAWrfe1Q5UVIyoH402zdk= +github.com/go-chi/chi/v5 v5.2.5 h1:Eg4myHZBjyvJmAFjFvWgrqDTXFyOzjj7YIm3L3mu6Ug= +github.com/go-chi/chi/v5 v5.2.5/go.mod h1:X7Gx4mteadT3eDOMTsXzmI4/rwUpOwBHLpAfupzFJP0= +github.com/golang-jwt/jwt/v5 v5.3.1 h1:kYf81DTWFe7t+1VvL7eS+jKFVWaUnK9cB1qbwn63YCY= +github.com/golang-jwt/jwt/v5 v5.3.1/go.mod h1:fxCRLWMO43lRc8nhHWY6LGqRcf+1gQWArsqaEUEa5bE= github.com/google/uuid v1.6.0 h1:NIvaJDMOsjHA8n1jAhLSgzrAzy1Hgr+hNrb57e+94F0= github.com/google/uuid v1.6.0/go.mod h1:TIyPZe4MgqvfeYDBFedMoGGpEw/LqOeaOT+nhxU+yHo= github.com/gorilla/websocket v1.5.3 h1:saDtZ6Pbx/0u+bgYQ3q96pZgCzfhKXGPqt7kZ72aNNg= @@ -21,25 +19,33 @@ github.com/jackc/pgpassfile v1.0.0 h1:/6Hmqy13Ss2zCq62VdNG8tM1wchn8zjSGOBJ6icpsI github.com/jackc/pgpassfile v1.0.0/go.mod h1:CEx0iS5ambNFdcRtxPj5JhEz+xB6uRky5eyVu/W2HEg= github.com/jackc/pgservicefile v0.0.0-20240606120523-5a60cdf6a761 h1:iCEnooe7UlwOQYpKFhBabPMi4aNAfoODPEFNiAnClxo= github.com/jackc/pgservicefile v0.0.0-20240606120523-5a60cdf6a761/go.mod h1:5TJZWKEWniPve33vlWYSoGYefn3gLQRzjfDlhSJ9ZKM= -github.com/jackc/pgx/v5 v5.7.4 h1:9wKznZrhWa2QiHL+NjTSPP6yjl3451BX3imWDnokYlg= -github.com/jackc/pgx/v5 v5.7.4/go.mod h1:ncY89UGWxg82EykZUwSpUKEfccBGGYq1xjrOpsbsfGQ= +github.com/jackc/pgx/v5 v5.9.2 h1:3ZhOzMWnR4yJ+RW1XImIPsD1aNSz4T4fyP7zlQb56hw= +github.com/jackc/pgx/v5 v5.9.2/go.mod h1:mal1tBGAFfLHvZzaYh77YS/eC6IX9OWbRV1QIIM0Jn4= github.com/jackc/puddle/v2 v2.2.2 h1:PR8nw+E/1w0GLuRFSmiioY6UooMp6KJv0/61nB7icHo= github.com/jackc/puddle/v2 v2.2.2/go.mod h1:vriiEXHvEE654aYKXXjOvZM39qJ0q+azkZFrfEOc3H4= +github.com/klauspost/cpuid/v2 v2.3.0 h1:S4CRMLnYUhGeDFDqkGriYKdfoFlDnMtqTiI/sFzhA9Y= +github.com/klauspost/cpuid/v2 v2.3.0/go.mod h1:hqwkgyIinND0mEev00jJYCxPNVRVXFQeu1XKlok6oO0= github.com/pmezard/go-difflib v1.0.0 h1:4DBwDE0NGyQoBHbLQYPwSUPoCMWR5BEzIk/f1lZbAQM= github.com/pmezard/go-difflib v1.0.0/go.mod h1:iKH77koFhYxTK1pcRnkKkqfTogsbg7gZNVY4sRDYZ/4= -github.com/redis/go-redis/v9 v9.8.0 h1:q3nRvjrlge/6UD7eTu/DSg2uYiU2mCL0G/uzBWqhicI= -github.com/redis/go-redis/v9 v9.8.0/go.mod h1:huWgSWd8mW6+m0VPhJjSSQ+d6Nh1VICQ6Q5lHuCH/Iw= +github.com/redis/go-redis/v9 v9.19.0 h1:XPVaaPSnG6RhYf7p+rmSa9zZfeVAnWsH5h3lxthOm/k= +github.com/redis/go-redis/v9 v9.19.0/go.mod h1:v/M13XI1PVCDcm01VtPFOADfZtHf8YW3baQf57KlIkA= github.com/stretchr/objx v0.1.0/go.mod h1:HFkY916IF+rwdDfMAkV7OtwuqBVzrE8GR6GFx+wExME= github.com/stretchr/testify v1.3.0/go.mod h1:M5WIy9Dh21IEIfnGCwXGc5bZfKNJtfHm1UVUgZn+9EI= github.com/stretchr/testify v1.7.0/go.mod h1:6Fq8oRcR53rry900zMqJjRRixrwX3KX962/h/Wwjteg= -github.com/stretchr/testify v1.8.1 h1:w7B6lhMri9wdJUVmEZPGGhZzrYTPvgJArz7wNPgYKsk= -github.com/stretchr/testify v1.8.1/go.mod h1:w2LPCIKwWwSfY2zedu0+kehJoqGctiVI29o6fzry7u4= -golang.org/x/crypto v0.37.0 h1:kJNSjF/Xp7kU0iB2Z+9viTPMW4EqqsrywMXLJOOsXSE= -golang.org/x/crypto v0.37.0/go.mod h1:vg+k43peMZ0pUMhYmVAWysMK35e6ioLh3wB8ZCAfbVc= -golang.org/x/sync v0.13.0 h1:AauUjRAJ9OSnvULf/ARrrVywoJDy0YS2AwQ98I37610= -golang.org/x/sync v0.13.0/go.mod h1:1dzgHSNfp02xaA81J2MS99Qcpr2w7fw1gpm99rleRqA= -golang.org/x/text v0.24.0 h1:dd5Bzh4yt5KYA8f9CJHCP4FB4D51c2c6JvN37xJJkJ0= -golang.org/x/text v0.24.0/go.mod h1:L8rBsPeo2pSS+xqN0d5u2ikmjtmoJbDBT1b7nHvFCdU= +github.com/stretchr/testify v1.11.1 h1:7s2iGBzp5EwR7/aIZr8ao5+dra3wiQyKjjFuvgVKu7U= +github.com/stretchr/testify v1.11.1/go.mod h1:wZwfW3scLgRK+23gO65QZefKpKQRnfz6sD981Nm4B6U= +github.com/zeebo/xxh3 v1.1.0 h1:s7DLGDK45Dyfg7++yxI0khrfwq9661w9EN78eP/UZVs= +github.com/zeebo/xxh3 v1.1.0/go.mod h1:IisAie1LELR4xhVinxWS5+zf1lA4p0MW4T+w+W07F5s= +go.uber.org/atomic v1.11.0 h1:ZvwS0R+56ePWxUNi+Atn9dWONBPp/AUETXlHW0DxSjE= +go.uber.org/atomic v1.11.0/go.mod h1:LUxbIzbOniOlMKjJjyPfpl4v+PKK2cNJn91OQbhoJI0= +golang.org/x/crypto v0.50.0 h1:zO47/JPrL6vsNkINmLoo/PH1gcxpls50DNogFvB5ZGI= +golang.org/x/crypto v0.50.0/go.mod h1:3muZ7vA7PBCE6xgPX7nkzzjiUq87kRItoJQM1Yo8S+Q= +golang.org/x/sync v0.20.0 h1:e0PTpb7pjO8GAtTs2dQ6jYa5BWYlMuX047Dco/pItO4= +golang.org/x/sync v0.20.0/go.mod h1:9xrNwdLfx4jkKbNva9FpL6vEN7evnE43NNNJQ2LF3+0= +golang.org/x/sys v0.43.0 h1:Rlag2XtaFTxp19wS8MXlJwTvoh8ArU6ezoyFsMyCTNI= +golang.org/x/sys v0.43.0/go.mod h1:4GL1E5IUh+htKOUEOaiffhrAeqysfVGipDYzABqnCmw= +golang.org/x/text v0.36.0 h1:JfKh3XmcRPqZPKevfXVpI1wXPTqbkE5f7JA92a55Yxg= +golang.org/x/text v0.36.0/go.mod h1:NIdBknypM8iqVmPiuco0Dh6P5Jcdk8lJL0CUebqK164= gopkg.in/check.v1 v0.0.0-20161208181325-20d25e280405/go.mod h1:Co6ibVJAznAaIkqp8huTwlJQCZ016jof/cbN4VW5Yz0= gopkg.in/yaml.v3 v3.0.0-20200313102051-9f266ea9e77c/go.mod h1:K4uyk7z7BCEPqu6E+C64Yfv1cQ7kz7rIZviUmN+EgEM= gopkg.in/yaml.v3 v3.0.1 h1:fxVm/GzAzEWqLHuvctI91KS9hhNmmWOoWu0XTYJS7CA= diff --git a/backend/internal/modules/auth/models.go b/backend/internal/modules/auth/models.go index 0410689..998fad6 100644 --- a/backend/internal/modules/auth/models.go +++ b/backend/internal/modules/auth/models.go @@ -14,12 +14,13 @@ const ( ) type User struct { - ID string - Email string - PasswordHash string - MFAEnabled bool - CreatedAt time.Time - UpdatedAt time.Time + ID string `json:"id"` + Email string `json:"email"` + PasswordHash string `json:"-"` + MFAEnabled bool `json:"mfa_enabled"` + PlatformRole string `json:"platform_role"` + CreatedAt time.Time `json:"created_at"` + UpdatedAt time.Time `json:"updated_at"` } type Device struct { @@ -40,7 +41,7 @@ type AuthSession struct { ID string UserID string DeviceID string - RefreshTokenHash string + RefreshTokenHash string `json:"-"` RefreshExpiresAt time.Time LastSeenAt *time.Time LastRotatedAt *time.Time @@ -69,6 +70,13 @@ type BootstrapOwnerCommand struct { ActivationSignature string `json:"activation_signature"` } +type CreateUserCommand struct { + ActorUserID string `json:"actor_user_id"` + Email string `json:"email"` + Password string `json:"password"` + PlatformRole string `json:"platform_role"` +} + type RevokeAuthSessionCommand struct { UserID string `json:"user_id"` AuthSessionID string `json:"auth_session_id"` diff --git a/backend/internal/modules/auth/module.go b/backend/internal/modules/auth/module.go index 8af90e3..65453d7 100644 --- a/backend/internal/modules/auth/module.go +++ b/backend/internal/modules/auth/module.go @@ -34,6 +34,10 @@ func (m *Module) RegisterRoutes(router chi.Router) { r.Get("/devices", m.handleTrustedDevices) r.Post("/devices/{deviceID}/revoke", m.handleRevokeTrustedDevice) }) + router.Route("/users", func(r chi.Router) { + r.Get("/", m.handleListUsers) + r.Post("/", m.handleCreateUser) + }) } func (m *Module) handleInstallationStatus(w http.ResponseWriter, r *http.Request) { @@ -78,6 +82,32 @@ func (m *Module) handleLogin(w http.ResponseWriter, r *http.Request) { httpx.WriteJSON(w, http.StatusOK, result) } +func (m *Module) handleListUsers(w http.ResponseWriter, r *http.Request) { + actorUserID := r.URL.Query().Get("actor_user_id") + users, err := m.service.ListUsers(r.Context(), actorUserID) + if err != nil { + status, message := m.service.MapError(err) + httpx.WriteError(w, status, message) + return + } + httpx.WriteJSON(w, http.StatusOK, map[string]any{"users": users}) +} + +func (m *Module) handleCreateUser(w http.ResponseWriter, r *http.Request) { + var cmd CreateUserCommand + if err := json.NewDecoder(r.Body).Decode(&cmd); err != nil { + httpx.WriteError(w, http.StatusBadRequest, "invalid user payload") + return + } + user, err := m.service.CreateUser(r.Context(), cmd) + if err != nil { + status, message := m.service.MapError(err) + httpx.WriteError(w, status, message) + return + } + httpx.WriteJSON(w, http.StatusCreated, map[string]any{"user": user}) +} + func (m *Module) handleRefresh(w http.ResponseWriter, r *http.Request) { var cmd RefreshCommand if err := json.NewDecoder(r.Body).Decode(&cmd); err != nil { diff --git a/backend/internal/modules/auth/postgres_store.go b/backend/internal/modules/auth/postgres_store.go index 78b976c..3e7534e 100644 --- a/backend/internal/modules/auth/postgres_store.go +++ b/backend/internal/modules/auth/postgres_store.go @@ -70,7 +70,7 @@ type postgresInstallationRepository struct { func (r *postgresUserRepository) GetByEmail(ctx context.Context, email string) (*User, error) { const query = ` -SELECT id::text, email, password_hash, mfa_enabled, created_at, updated_at +SELECT id::text, email, password_hash, mfa_enabled, platform_role, created_at, updated_at FROM users WHERE email = $1 ` @@ -79,13 +79,53 @@ WHERE email = $1 func (r *postgresUserRepository) GetByID(ctx context.Context, userID string) (*User, error) { const query = ` -SELECT id::text, email, password_hash, mfa_enabled, created_at, updated_at +SELECT id::text, email, password_hash, mfa_enabled, platform_role, created_at, updated_at FROM users WHERE id = $1::uuid ` return scanOptionalUser(r.db.QueryRow(ctx, query, userID)) } +func (r *postgresUserRepository) List(ctx context.Context) ([]User, error) { + const query = ` +SELECT id::text, email, password_hash, mfa_enabled, platform_role, created_at, updated_at +FROM users +ORDER BY created_at DESC +` + rows, err := r.db.Query(ctx, query) + if err != nil { + return nil, fmt.Errorf("query users: %w", err) + } + defer rows.Close() + var users []User + for rows.Next() { + user, err := scanOptionalUser(rows) + if err != nil { + return nil, err + } + if user != nil { + users = append(users, *user) + } + } + return users, rows.Err() +} + +func (r *postgresUserRepository) Create(ctx context.Context, user User) (*User, error) { + const query = ` +INSERT INTO users (email, password_hash, mfa_enabled, platform_role, created_at, updated_at) +VALUES ($1, $2, $3, $4, $5, $6) +RETURNING id::text, email, password_hash, mfa_enabled, platform_role, created_at, updated_at +` + return scanOptionalUser(r.db.QueryRow(ctx, query, + user.Email, + user.PasswordHash, + user.MFAEnabled, + user.PlatformRole, + user.CreatedAt, + user.UpdatedAt, + )) +} + func (r *postgresDeviceRepository) Upsert(ctx context.Context, params UpsertDeviceParams) (*Device, error) { const query = ` INSERT INTO devices ( @@ -348,7 +388,7 @@ ON CONFLICT (email) DO UPDATE SET password_hash = EXCLUDED.password_hash, platform_role = EXCLUDED.platform_role, updated_at = EXCLUDED.updated_at -RETURNING id::text, email, password_hash, mfa_enabled, created_at, updated_at +RETURNING id::text, email, password_hash, mfa_enabled, platform_role, created_at, updated_at `, email, params.PasswordHash, params.Role, now)) if err != nil { return nil, fmt.Errorf("upsert bootstrap owner: %w", err) @@ -461,6 +501,7 @@ func scanOptionalUser(row scanner) (*User, error) { &user.Email, &user.PasswordHash, &user.MFAEnabled, + &user.PlatformRole, &user.CreatedAt, &user.UpdatedAt, ); err != nil { diff --git a/backend/internal/modules/auth/repository.go b/backend/internal/modules/auth/repository.go index 0f31164..e8e5f47 100644 --- a/backend/internal/modules/auth/repository.go +++ b/backend/internal/modules/auth/repository.go @@ -7,8 +7,10 @@ import ( ) type UserRepository interface { + List(ctx context.Context) ([]User, error) GetByEmail(ctx context.Context, email string) (*User, error) GetByID(ctx context.Context, userID string) (*User, error) + Create(ctx context.Context, user User) (*User, error) } type DeviceRepository interface { diff --git a/backend/internal/modules/auth/service.go b/backend/internal/modules/auth/service.go index 7c1b1f7..a733b1b 100644 --- a/backend/internal/modules/auth/service.go +++ b/backend/internal/modules/auth/service.go @@ -13,11 +13,13 @@ import ( "github.com/example/remote-access-platform/backend/internal/platform/authority" "github.com/example/remote-access-platform/backend/internal/platform/module" + postgresplatform "github.com/example/remote-access-platform/backend/internal/platform/postgres" ) type Service struct { cfg module.Config store Store + db postgresplatform.DBTX transactor Transactor tokenManager *TokenManager authority *authority.Verifier @@ -31,7 +33,7 @@ func NewService(deps module.Dependencies, store Store, transactor Transactor, ve } else if verifier, err := authority.NewVerifier(deps.Config.Installation); err == nil { authorityVerifier = verifier } - return &Service{ + service := &Service{ cfg: deps.Config, store: store, transactor: transactor, @@ -45,6 +47,10 @@ func NewService(deps module.Dependencies, store Store, transactor Transactor, ve authority: authorityVerifier, now: time.Now, } + if postgresStore, ok := store.(*postgresStore); ok { + service.db = postgresStore.db + } + return service } func (s *Service) Login(ctx context.Context, cmd LoginCommand) (*AuthResult, error) { @@ -120,6 +126,44 @@ func (s *Service) Login(ctx context.Context, cmd LoginCommand) (*AuthResult, err return &result, nil } +func (s *Service) ListUsers(ctx context.Context, actorUserID string) ([]User, error) { + if err := s.ensurePlatformAdmin(ctx, actorUserID); err != nil { + return nil, err + } + return s.store.Users().List(ctx) +} + +func (s *Service) CreateUser(ctx context.Context, cmd CreateUserCommand) (*User, error) { + if err := s.ensurePlatformAdmin(ctx, cmd.ActorUserID); err != nil { + return nil, err + } + email := strings.ToLower(strings.TrimSpace(cmd.Email)) + password := strings.TrimSpace(cmd.Password) + role := strings.TrimSpace(cmd.PlatformRole) + if role == "" { + role = "user" + } + if email == "" || !strings.Contains(email, "@") || len(password) < 8 { + return nil, ErrInvalidBootstrapOwner + } + if role != "user" && role != authority.PlatformRoleAdmin && role != authority.PlatformRoleRecoveryAdmin { + return nil, ErrInvalidBootstrapOwner + } + passwordHash, err := bcrypt.GenerateFromPassword([]byte(password), bcrypt.DefaultCost) + if err != nil { + return nil, fmt.Errorf("hash user password: %w", err) + } + now := s.now().UTC() + return s.store.Users().Create(ctx, User{ + Email: email, + PasswordHash: string(passwordHash), + MFAEnabled: false, + PlatformRole: role, + CreatedAt: now, + UpdatedAt: now, + }) +} + func (s *Service) Refresh(ctx context.Context, cmd RefreshCommand) (*AuthResult, error) { authSessionID, err := s.tokenManager.ParseRefreshToken(cmd.RefreshToken) if err != nil { @@ -438,3 +482,25 @@ func (s *Service) installationStatusFromRecord(record *InstallationAuthorityStat func (s *Service) strictAuthority() bool { return s.authority != nil && s.authority.Strict() } + +func (s *Service) ensurePlatformAdmin(ctx context.Context, actorUserID string) error { + if actorUserID == "" { + return ErrInvalidCredentials + } + role := authority.PlatformRoleUser + if s.db != nil { + effectiveRole, err := authority.EffectivePlatformRole(ctx, s.db, s.authority, actorUserID) + if err != nil { + return err + } + role = effectiveRole + } else if user, err := s.store.Users().GetByID(ctx, actorUserID); err != nil { + return err + } else if user != nil && user.PlatformRole != "" { + role = user.PlatformRole + } + if role != authority.PlatformRoleAdmin && role != authority.PlatformRoleRecoveryAdmin { + return ErrDeviceRevoked + } + return nil +} diff --git a/backend/internal/modules/cluster/models.go b/backend/internal/modules/cluster/models.go index f23c6c9..1b5a739 100644 --- a/backend/internal/modules/cluster/models.go +++ b/backend/internal/modules/cluster/models.go @@ -45,6 +45,20 @@ const ( VPNAssignmentStatusLeaseRequired = "lease_required" VPNAssignmentStatusBlocked = "blocked" VPNAssignmentStatusUnknown = "unknown" + + FabricServiceChannelStatusReady = "ready" + FabricServiceChannelStatusDegradedFallback = "degraded_fallback" + + FabricServiceClassVPNPackets = "vpn_packets" + FabricServiceClassRemoteWorkspace = "remote_workspace" + FabricServiceClassFileTransfer = "file_transfer" + FabricServiceClassVideo = "video" + + FabricChannelControl = "control" + FabricChannelInteractive = "interactive" + FabricChannelReliable = "reliable" + FabricChannelBulk = "bulk" + FabricChannelDroppable = "droppable" ) var allowedNodeRoles = map[string]struct{}{ @@ -124,6 +138,240 @@ type CreatedJoinToken struct { Token string `json:"token"` } +type DockerInstallProfileRequest struct { + ClusterID string `json:"cluster_id"` + InstallToken string `json:"install_token"` + NodeName string `json:"node_name,omitempty"` + HostFacts json.RawMessage `json:"host_facts,omitempty"` +} + +type DockerInstallProfile struct { + SchemaVersion string `json:"schema_version"` + ClusterID string `json:"cluster_id"` + BackendURL string `json:"backend_url"` + ControlPlaneEndpoints []string `json:"control_plane_endpoints,omitempty"` + ArtifactEndpoints []string `json:"artifact_endpoints,omitempty"` + DockerImageArtifact *DockerArtifact `json:"docker_image_artifact,omitempty"` + JoinToken string `json:"join_token"` + NodeName string `json:"node_name"` + Image string `json:"image"` + ContainerName string `json:"container_name"` + StateDir string `json:"state_dir"` + Network string `json:"network"` + RestartPolicy string `json:"restart_policy"` + PullImage bool `json:"pull_image"` + Replace bool `json:"replace"` + DockerVPNGatewayEnabled bool `json:"docker_vpn_gateway_enabled"` + WorkloadSupervisionEnabled bool `json:"workload_supervision_enabled"` + MeshSyntheticRuntimeEnabled bool `json:"mesh_synthetic_runtime_enabled"` + MeshProductionForwardingEnabled bool `json:"mesh_production_forwarding_enabled"` + MeshListenAddr string `json:"mesh_listen_addr,omitempty"` + MeshListenPortMode string `json:"mesh_listen_port_mode,omitempty"` + MeshListenAutoPortStart int `json:"mesh_listen_auto_port_start,omitempty"` + MeshListenAutoPortEnd int `json:"mesh_listen_auto_port_end,omitempty"` + MeshAdvertiseEndpoint string `json:"mesh_advertise_endpoint,omitempty"` + MeshAdvertiseEndpointsJSON json.RawMessage `json:"mesh_advertise_endpoints_json,omitempty"` + MeshAdvertiseTransport string `json:"mesh_advertise_transport,omitempty"` + MeshConnectivityMode string `json:"mesh_connectivity_mode,omitempty"` + MeshNATType string `json:"mesh_nat_type,omitempty"` + MeshRegion string `json:"mesh_region,omitempty"` + HeartbeatIntervalSeconds int `json:"heartbeat_interval_seconds"` + EnrollmentPollIntervalSeconds int `json:"enrollment_poll_interval_seconds"` + EnrollmentPollTimeoutSeconds int `json:"enrollment_poll_timeout_seconds"` + ProductionObservationSinkCapacity int `json:"production_observation_sink_capacity,omitempty"` + Roles []string `json:"roles,omitempty"` +} + +type WindowsInstallProfile struct { + SchemaVersion string `json:"schema_version"` + ClusterID string `json:"cluster_id"` + BackendURL string `json:"backend_url"` + ControlPlaneEndpoints []string `json:"control_plane_endpoints,omitempty"` + ArtifactEndpoints []string `json:"artifact_endpoints,omitempty"` + NodeAgentArtifact *DockerArtifact `json:"node_agent_artifact,omitempty"` + JoinToken string `json:"join_token"` + NodeName string `json:"node_name"` + StateDir string `json:"state_dir"` + InstallDir string `json:"install_dir"` + StartupMode string `json:"startup_mode"` + WorkloadSupervisionEnabled bool `json:"workload_supervision_enabled"` + MeshSyntheticRuntimeEnabled bool `json:"mesh_synthetic_runtime_enabled"` + MeshProductionForwardingEnabled bool `json:"mesh_production_forwarding_enabled"` + MeshListenAddr string `json:"mesh_listen_addr,omitempty"` + MeshListenPortMode string `json:"mesh_listen_port_mode,omitempty"` + MeshListenAutoPortStart int `json:"mesh_listen_auto_port_start,omitempty"` + MeshListenAutoPortEnd int `json:"mesh_listen_auto_port_end,omitempty"` + MeshAdvertiseEndpoint string `json:"mesh_advertise_endpoint,omitempty"` + MeshAdvertiseEndpointsJSON json.RawMessage `json:"mesh_advertise_endpoints_json,omitempty"` + MeshAdvertiseTransport string `json:"mesh_advertise_transport,omitempty"` + MeshConnectivityMode string `json:"mesh_connectivity_mode,omitempty"` + MeshNATType string `json:"mesh_nat_type,omitempty"` + MeshRegion string `json:"mesh_region,omitempty"` + HeartbeatIntervalSeconds int `json:"heartbeat_interval_seconds"` + EnrollmentPollIntervalSeconds int `json:"enrollment_poll_interval_seconds"` + EnrollmentPollTimeoutSeconds int `json:"enrollment_poll_timeout_seconds"` + ProductionObservationSinkCapacity int `json:"production_observation_sink_capacity,omitempty"` + Roles []string `json:"roles,omitempty"` +} + +type LinuxInstallProfile struct { + SchemaVersion string `json:"schema_version"` + ClusterID string `json:"cluster_id"` + BackendURL string `json:"backend_url"` + ControlPlaneEndpoints []string `json:"control_plane_endpoints,omitempty"` + ArtifactEndpoints []string `json:"artifact_endpoints,omitempty"` + NodeAgentArtifact *DockerArtifact `json:"node_agent_artifact,omitempty"` + JoinToken string `json:"join_token"` + NodeName string `json:"node_name"` + StateDir string `json:"state_dir"` + InstallDir string `json:"install_dir"` + StartupMode string `json:"startup_mode"` + WorkloadSupervisionEnabled bool `json:"workload_supervision_enabled"` + MeshSyntheticRuntimeEnabled bool `json:"mesh_synthetic_runtime_enabled"` + MeshProductionForwardingEnabled bool `json:"mesh_production_forwarding_enabled"` + MeshListenAddr string `json:"mesh_listen_addr,omitempty"` + MeshListenPortMode string `json:"mesh_listen_port_mode,omitempty"` + MeshListenAutoPortStart int `json:"mesh_listen_auto_port_start,omitempty"` + MeshListenAutoPortEnd int `json:"mesh_listen_auto_port_end,omitempty"` + MeshAdvertiseEndpoint string `json:"mesh_advertise_endpoint,omitempty"` + MeshAdvertiseEndpointsJSON json.RawMessage `json:"mesh_advertise_endpoints_json,omitempty"` + MeshAdvertiseTransport string `json:"mesh_advertise_transport,omitempty"` + MeshConnectivityMode string `json:"mesh_connectivity_mode,omitempty"` + MeshNATType string `json:"mesh_nat_type,omitempty"` + MeshRegion string `json:"mesh_region,omitempty"` + HeartbeatIntervalSeconds int `json:"heartbeat_interval_seconds"` + EnrollmentPollIntervalSeconds int `json:"enrollment_poll_interval_seconds"` + EnrollmentPollTimeoutSeconds int `json:"enrollment_poll_timeout_seconds"` + ProductionObservationSinkCapacity int `json:"production_observation_sink_capacity,omitempty"` + Roles []string `json:"roles,omitempty"` +} + +type DockerArtifact struct { + Kind string `json:"kind"` + Image string `json:"image,omitempty"` + MediaType string `json:"media_type,omitempty"` + FileName string `json:"file_name,omitempty"` + URLs []string `json:"urls,omitempty"` + SHA256 string `json:"sha256,omitempty"` + SizeBytes int64 `json:"size_bytes,omitempty"` +} + +type ReleaseVersion struct { + ID string `json:"id"` + ClusterID string `json:"cluster_id"` + Product string `json:"product"` + Version string `json:"version"` + Channel string `json:"channel"` + Status string `json:"status"` + Compatibility json.RawMessage `json:"compatibility"` + Changelog *string `json:"changelog,omitempty"` + CreatedByUserID *string `json:"created_by_user_id,omitempty"` + CreatedAt time.Time `json:"created_at"` + Artifacts []ReleaseArtifact `json:"artifacts,omitempty"` + AuthorityPayload json.RawMessage `json:"authority_payload,omitempty"` + AuthoritySignature *ClusterSignature `json:"authority_signature,omitempty"` +} + +type ReleaseArtifact struct { + ID string `json:"id"` + ReleaseID string `json:"release_id"` + ClusterID string `json:"cluster_id"` + Product string `json:"product"` + Version string `json:"version"` + OS string `json:"os"` + Arch string `json:"arch"` + InstallType string `json:"install_type"` + Kind string `json:"kind"` + URL string `json:"url"` + URLs []string `json:"urls,omitempty"` + SHA256 string `json:"sha256"` + SizeBytes int64 `json:"size_bytes"` + Signature *string `json:"signature,omitempty"` + Metadata json.RawMessage `json:"metadata"` + CreatedAt time.Time `json:"created_at"` +} + +type NodeUpdatePolicy struct { + ClusterID string `json:"cluster_id"` + NodeID string `json:"node_id"` + Product string `json:"product"` + Channel string `json:"channel"` + TargetVersion *string `json:"target_version,omitempty"` + Strategy string `json:"strategy"` + Enabled bool `json:"enabled"` + RollbackAllowed bool `json:"rollback_allowed"` + HealthWindowSec int `json:"health_window_seconds"` + UpdatedByUserID *string `json:"updated_by_user_id,omitempty"` + UpdatedAt time.Time `json:"updated_at"` +} + +type NodeUpdateHint struct { + SchemaVersion string `json:"schema_version"` + Generation string `json:"generation,omitempty"` + CheckNow bool `json:"check_now"` + Products []string `json:"products,omitempty"` + Reason string `json:"reason,omitempty"` + DeliveryMode string `json:"delivery_mode,omitempty"` + SubscriptionStatus string `json:"subscription_status,omitempty"` + UpdateService *NodeUpdateServiceAssignment `json:"update_service,omitempty"` + FallbackPollSeconds int `json:"fallback_poll_seconds,omitempty"` +} + +type NodeUpdateServiceAssignment struct { + SchemaVersion string `json:"schema_version"` + NodeID string `json:"node_id,omitempty"` + NodeName string `json:"node_name,omitempty"` + Endpoint string `json:"endpoint,omitempty"` + Region string `json:"region,omitempty"` + Status string `json:"status"` + Reason string `json:"reason,omitempty"` + AssignedAt time.Time `json:"assigned_at"` + ExpiresAt time.Time `json:"expires_at"` +} + +type NodeUpdateServiceCandidate struct { + NodeID string + NodeName string + Endpoint string + Region string + LastSeenAt *time.Time +} + +type NodeUpdatePlan struct { + SchemaVersion string `json:"schema_version"` + ClusterID string `json:"cluster_id"` + NodeID string `json:"node_id"` + Product string `json:"product"` + CurrentVersion string `json:"current_version,omitempty"` + Action string `json:"action"` + Reason string `json:"reason"` + TargetVersion string `json:"target_version,omitempty"` + Channel string `json:"channel,omitempty"` + Strategy string `json:"strategy,omitempty"` + RollbackAllowed bool `json:"rollback_allowed"` + HealthWindowSec int `json:"health_window_seconds,omitempty"` + Artifact *ReleaseArtifact `json:"artifact,omitempty"` + AuthorityPayload json.RawMessage `json:"authority_payload,omitempty"` + AuthoritySignature *ClusterSignature `json:"authority_signature,omitempty"` + ProductionForwarding bool `json:"production_forwarding"` +} + +type NodeUpdateStatus struct { + ID string `json:"id"` + ClusterID string `json:"cluster_id"` + NodeID string `json:"node_id"` + Product string `json:"product"` + CurrentVersion string `json:"current_version,omitempty"` + TargetVersion string `json:"target_version,omitempty"` + Phase string `json:"phase"` + Status string `json:"status"` + AttemptID string `json:"attempt_id,omitempty"` + ErrorMessage *string `json:"error_message,omitempty"` + RollbackVersion *string `json:"rollback_version,omitempty"` + Payload json.RawMessage `json:"payload"` + ObservedAt time.Time `json:"observed_at"` +} + type NodeBootstrap struct { NodeID string `json:"node_id"` ClusterID string `json:"cluster_id"` @@ -227,6 +475,9 @@ type MeshRouteIntent struct { ServiceClass string `json:"service_class"` Priority int `json:"priority"` Status string `json:"status"` + LifecycleStatus string `json:"lifecycle_status,omitempty"` + IsExpired bool `json:"is_expired"` + PolicyExpiresAt *time.Time `json:"policy_expires_at,omitempty"` Policy json.RawMessage `json:"policy"` CreatedByUserID *string `json:"created_by_user_id,omitempty"` CreatedAt time.Time `json:"created_at"` @@ -248,6 +499,89 @@ type SyntheticMeshRouteConfig struct { PeerDirectoryVersion string `json:"peer_directory_version,omitempty"` } +type FabricServiceChannelRouteFeedbackReport struct { + SchemaVersion string `json:"schema_version"` + GeneratedAt time.Time `json:"generated_at"` + FeedbackMaxAgeSeconds int `json:"feedback_max_age_seconds"` + RecoveryPolicy *FabricServiceChannelRecoveryPolicy `json:"recovery_policy,omitempty"` + MissingProvenanceCount int `json:"missing_provenance_count,omitempty"` + StalePolicyCount int `json:"stale_policy_count,omitempty"` + StaleGenerationCount int `json:"stale_generation_count,omitempty"` + ObservationCount int `json:"observation_count"` + FencedRouteCount int `json:"fenced_route_count"` + DegradedRouteCount int `json:"degraded_route_count"` + HealthyRouteCount int `json:"healthy_route_count"` + RecoveredRouteCount int `json:"recovered_route_count,omitempty"` + RecoveryHysteresisCount int `json:"recovery_hysteresis_count,omitempty"` + RecoveryPromotedCount int `json:"recovery_promoted_count,omitempty"` + RecoveryDemotedCount int `json:"recovery_demoted_count,omitempty"` + Observations []FabricServiceChannelRouteFeedbackObservation `json:"observations,omitempty"` +} + +type FabricServiceChannelRecoveryPolicy struct { + SchemaVersion string `json:"schema_version"` + Fingerprint string `json:"fingerprint,omitempty"` + HysteresisPenalty int `json:"hysteresis_penalty"` + PromotionMinSamples int `json:"promotion_min_samples"` + DemotionFailureThreshold int `json:"demotion_failure_threshold"` + DemotionDropThreshold int `json:"demotion_drop_threshold"` + DemotionSlowThreshold int `json:"demotion_slow_threshold"` + DemotionRebuildEnabled bool `json:"demotion_rebuild_enabled"` + DemotionFencedEnabled bool `json:"demotion_fenced_enabled"` + Source string `json:"source"` + UpdatedByUserID *string `json:"updated_by_user_id,omitempty"` + UpdatedAt time.Time `json:"updated_at,omitempty"` + ControlPlaneOnly bool `json:"control_plane_only"` + ProductionForwarding bool `json:"production_forwarding"` +} + +type FabricServiceChannelAdaptivePolicy struct { + SchemaVersion string `json:"schema_version"` + Fingerprint string `json:"fingerprint,omitempty"` + MaxParallelWindow int `json:"max_parallel_window"` + BulkPressureChannelThreshold int `json:"bulk_pressure_channel_threshold"` + QueuePressureHighWatermark int `json:"queue_pressure_high_watermark"` + QueuePressureMaxInFlight int `json:"queue_pressure_max_in_flight"` + ClassWindows map[string]int `json:"class_windows"` + Source string `json:"source"` + UpdatedByUserID *string `json:"updated_by_user_id,omitempty"` + UpdatedAt time.Time `json:"updated_at,omitempty"` + ControlPlaneOnly bool `json:"control_plane_only"` + ProductionForwarding bool `json:"production_forwarding"` +} + +type FabricServiceChannelPoolPolicy struct { + SchemaVersion string `json:"schema_version"` + Fingerprint string `json:"fingerprint,omitempty"` + EntryPoolNodeIDs []string `json:"entry_pool_node_ids,omitempty"` + ExitPoolNodeIDs []string `json:"exit_pool_node_ids,omitempty"` + PreferredEntryNodeID string `json:"preferred_entry_node_id,omitempty"` + PreferredExitNodeID string `json:"preferred_exit_node_id,omitempty"` + SelectionStrategy string `json:"selection_strategy"` + RouteRebuild string `json:"route_rebuild"` + EntryFailover string `json:"entry_failover"` + ExitFailover string `json:"exit_failover"` + BackendFallbackAllowed bool `json:"backend_fallback_allowed"` + StickySession bool `json:"sticky_session"` + Source string `json:"source"` + UpdatedByUserID *string `json:"updated_by_user_id,omitempty"` + UpdatedAt time.Time `json:"updated_at,omitempty"` + ControlPlaneOnly bool `json:"control_plane_only"` + ProductionForwarding bool `json:"production_forwarding"` +} + +type FabricServiceChannelBreadcrumbWindowPolicy struct { + SchemaVersion string `json:"schema_version"` + Fingerprint string `json:"fingerprint,omitempty"` + CurrentWindowSeconds int64 `json:"current_window_seconds"` + HistoryWindowSeconds int64 `json:"history_window_seconds"` + Source string `json:"source"` + UpdatedByUserID *string `json:"updated_by_user_id,omitempty"` + UpdatedAt time.Time `json:"updated_at,omitempty"` + ControlPlaneOnly bool `json:"control_plane_only"` + ProductionForwarding bool `json:"production_forwarding"` +} + type PeerEndpointCandidate struct { EndpointID string `json:"endpoint_id"` NodeID string `json:"node_id"` @@ -325,64 +659,108 @@ type RendezvousRelayPolicyReport struct { } type RoutePathDecision struct { - DecisionID string `json:"decision_id"` - RouteID string `json:"route_id"` - ClusterID string `json:"cluster_id"` - LocalNodeID string `json:"local_node_id"` - SourceNodeID string `json:"source_node_id"` - DestinationNodeID string `json:"destination_node_id"` - OriginalHops []string `json:"original_hops"` - EffectiveHops []string `json:"effective_hops"` - PreviousHopID string `json:"previous_hop_id,omitempty"` - NextHopID string `json:"next_hop_id,omitempty"` - LocalRole string `json:"local_role"` - SelectedRelayID string `json:"selected_relay_id,omitempty"` - SelectedRelayEndpoint string `json:"selected_relay_endpoint,omitempty"` - StaleRelayNodeID string `json:"stale_relay_node_id,omitempty"` - RendezvousPeerNodeID string `json:"rendezvous_peer_node_id,omitempty"` - RendezvousLeaseID string `json:"rendezvous_lease_id,omitempty"` - RendezvousLeaseReason string `json:"rendezvous_lease_reason,omitempty"` - DecisionSource string `json:"decision_source"` - Generation string `json:"generation"` - PathScore int `json:"path_score,omitempty"` - ScoreReasons []string `json:"score_reasons,omitempty"` - ControlPlaneOnly bool `json:"control_plane_only"` - ProductionForwarding bool `json:"production_forwarding"` - ExpiresAt time.Time `json:"expires_at"` + DecisionID string `json:"decision_id"` + RouteID string `json:"route_id"` + ReplacementRouteID string `json:"replacement_route_id,omitempty"` + RebuildRequestID string `json:"rebuild_request_id,omitempty"` + RebuildStatus string `json:"rebuild_status,omitempty"` + RebuildReason string `json:"rebuild_reason,omitempty"` + RebuildAttempt int `json:"rebuild_attempt,omitempty"` + FeedbackObservationID string `json:"feedback_observation_id,omitempty"` + FeedbackSource string `json:"feedback_source,omitempty"` + FeedbackObservedAt *time.Time `json:"feedback_observed_at,omitempty"` + FeedbackExpiresAt *time.Time `json:"feedback_expires_at,omitempty"` + FeedbackChannelID string `json:"feedback_channel_id,omitempty"` + FeedbackResourceID string `json:"feedback_resource_id,omitempty"` + FeedbackViolationStatus string `json:"feedback_violation_status,omitempty"` + FeedbackViolationReason string `json:"feedback_violation_reason,omitempty"` + ClusterID string `json:"cluster_id"` + LocalNodeID string `json:"local_node_id"` + SourceNodeID string `json:"source_node_id"` + DestinationNodeID string `json:"destination_node_id"` + OriginalHops []string `json:"original_hops"` + EffectiveHops []string `json:"effective_hops"` + PreviousHopID string `json:"previous_hop_id,omitempty"` + NextHopID string `json:"next_hop_id,omitempty"` + LocalRole string `json:"local_role"` + SelectedRelayID string `json:"selected_relay_id,omitempty"` + SelectedRelayEndpoint string `json:"selected_relay_endpoint,omitempty"` + StaleRelayNodeID string `json:"stale_relay_node_id,omitempty"` + RendezvousPeerNodeID string `json:"rendezvous_peer_node_id,omitempty"` + RendezvousLeaseID string `json:"rendezvous_lease_id,omitempty"` + RendezvousLeaseReason string `json:"rendezvous_lease_reason,omitempty"` + DecisionSource string `json:"decision_source"` + Generation string `json:"generation"` + PathScore int `json:"path_score,omitempty"` + ScoreReasons []string `json:"score_reasons,omitempty"` + ControlPlaneOnly bool `json:"control_plane_only"` + ProductionForwarding bool `json:"production_forwarding"` + ExpiresAt time.Time `json:"expires_at"` } type RoutePathDecisionReport struct { - SchemaVersion string `json:"schema_version"` - DecisionMode string `json:"decision_mode"` - Generation string `json:"generation"` - DecisionCount int `json:"decision_count"` - ReplacementDecisionCount int `json:"replacement_decision_count"` - ControlPlaneOnly bool `json:"control_plane_only"` - ProductionForwarding bool `json:"production_forwarding"` - Decisions []RoutePathDecision `json:"decisions,omitempty"` + SchemaVersion string `json:"schema_version"` + DecisionMode string `json:"decision_mode"` + Generation string `json:"generation"` + RecoveryPolicy *FabricServiceChannelRecoveryPolicy `json:"recovery_policy,omitempty"` + DecisionCount int `json:"decision_count"` + ReplacementDecisionCount int `json:"replacement_decision_count"` + DegradedDecisionCount int `json:"degraded_decision_count"` + RebuildRequestCount int `json:"rebuild_request_count,omitempty"` + RebuildAppliedCount int `json:"rebuild_applied_count,omitempty"` + RecoveryHysteresisCount int `json:"recovery_hysteresis_count,omitempty"` + RecoveryPromotedCount int `json:"recovery_promoted_count,omitempty"` + RecoveryDemotedCount int `json:"recovery_demoted_count,omitempty"` + ControlPlaneOnly bool `json:"control_plane_only"` + ProductionForwarding bool `json:"production_forwarding"` + Decisions []RoutePathDecision `json:"decisions,omitempty"` } type NodeSyntheticMeshConfig struct { - Enabled bool `json:"enabled"` - SchemaVersion string `json:"schema_version"` - ClusterID string `json:"cluster_id"` - LocalNodeID string `json:"local_node_id"` - AuthorityRequired bool `json:"authority_required"` - ClusterAuthority *ClusterAuthorityDescriptor `json:"cluster_authority,omitempty"` - AuthorityPayload json.RawMessage `json:"authority_payload,omitempty"` - AuthoritySignature *ClusterSignature `json:"authority_signature,omitempty"` - ConfigVersion string `json:"config_version,omitempty"` - PeerDirectoryVersion string `json:"peer_directory_version,omitempty"` - PolicyVersion string `json:"policy_version,omitempty"` - PeerEndpoints map[string]string `json:"peer_endpoints"` - PeerEndpointCandidates map[string][]PeerEndpointCandidate `json:"peer_endpoint_candidates,omitempty"` - PeerDirectory []PeerDirectoryEntry `json:"peer_directory,omitempty"` - RecoverySeeds []PeerRecoverySeed `json:"recovery_seeds,omitempty"` - RendezvousLeases []PeerRendezvousLease `json:"rendezvous_leases,omitempty"` - RendezvousRelayPolicy *RendezvousRelayPolicyReport `json:"rendezvous_relay_policy,omitempty"` - RoutePathDecisions *RoutePathDecisionReport `json:"route_path_decisions,omitempty"` - Routes []SyntheticMeshRouteConfig `json:"routes"` - ProductionForwarding bool `json:"production_forwarding"` + Enabled bool `json:"enabled"` + SchemaVersion string `json:"schema_version"` + ClusterID string `json:"cluster_id"` + LocalNodeID string `json:"local_node_id"` + AuthorityRequired bool `json:"authority_required"` + ClusterAuthority *ClusterAuthorityDescriptor `json:"cluster_authority,omitempty"` + AuthorityPayload json.RawMessage `json:"authority_payload,omitempty"` + AuthoritySignature *ClusterSignature `json:"authority_signature,omitempty"` + ConfigVersion string `json:"config_version,omitempty"` + PeerDirectoryVersion string `json:"peer_directory_version,omitempty"` + PolicyVersion string `json:"policy_version,omitempty"` + PeerEndpoints map[string]string `json:"peer_endpoints"` + PeerEndpointCandidates map[string][]PeerEndpointCandidate `json:"peer_endpoint_candidates,omitempty"` + PeerDirectory []PeerDirectoryEntry `json:"peer_directory,omitempty"` + RecoverySeeds []PeerRecoverySeed `json:"recovery_seeds,omitempty"` + RendezvousLeases []PeerRendezvousLease `json:"rendezvous_leases,omitempty"` + RendezvousRelayPolicy *RendezvousRelayPolicyReport `json:"rendezvous_relay_policy,omitempty"` + RoutePathDecisions *RoutePathDecisionReport `json:"route_path_decisions,omitempty"` + ServiceChannelFeedback *FabricServiceChannelRouteFeedbackReport `json:"service_channel_route_feedback,omitempty"` + ServiceChannelAdaptivePolicy *FabricServiceChannelAdaptivePolicy `json:"service_channel_adaptive_policy,omitempty"` + ServiceChannelRemediationCommands []FabricServiceChannelAccessRemediationCommand `json:"service_channel_remediation_commands,omitempty"` + MeshListener *NodeMeshListenerConfig `json:"mesh_listener,omitempty"` + Routes []SyntheticMeshRouteConfig `json:"routes"` + ProductionForwarding bool `json:"production_forwarding"` +} + +type NodeMeshListenerConfig struct { + SchemaVersion string `json:"schema_version"` + Source string `json:"source"` + DesiredState string `json:"desired_state"` + ListenAddr string `json:"listen_addr"` + ListenPortMode string `json:"listen_port_mode"` + AutoPortStart int `json:"auto_port_start,omitempty"` + AutoPortEnd int `json:"auto_port_end,omitempty"` + AdvertiseEndpoint string `json:"advertise_endpoint,omitempty"` + AdvertiseTransport string `json:"advertise_transport,omitempty"` + ConnectivityMode string `json:"connectivity_mode,omitempty"` + NATType string `json:"nat_type,omitempty"` + Region string `json:"region,omitempty"` + ConfigVersion string `json:"config_version,omitempty"` + UpdatedByUserID string `json:"updated_by_user_id,omitempty"` + UpdatedAt string `json:"updated_at,omitempty"` + ControlPlaneOnly bool `json:"control_plane_only"` + ProductionForwarding bool `json:"production_forwarding"` } type MeshQoSPolicy struct { @@ -448,6 +826,700 @@ type FabricEgressPoolNode struct { AddedAt time.Time `json:"added_at"` } +type FabricServiceChannelNodeCandidate struct { + NodeID string `json:"node_id"` + Role string `json:"role,omitempty"` + Priority int `json:"priority,omitempty"` + Status string `json:"status"` + Metadata json.RawMessage `json:"metadata,omitempty"` +} + +type FabricServiceChannelRoute struct { + RouteID string `json:"route_id,omitempty"` + ClusterID string `json:"cluster_id"` + ServiceClass string `json:"service_class"` + SourceNodeID string `json:"source_node_id"` + DestinationNodeID string `json:"destination_node_id"` + Hops []string `json:"hops"` + AllowedChannels []string `json:"allowed_channels"` + RouteVersion string `json:"route_version,omitempty"` + PolicyVersion string `json:"policy_version,omitempty"` + Generation string `json:"generation"` + Status string `json:"status"` + RecoveryState string `json:"recovery_state,omitempty"` + RecoveryPenalty int `json:"recovery_penalty,omitempty"` + RecoveryPromoted bool `json:"recovery_promoted,omitempty"` + RecoveryDemoted bool `json:"recovery_demoted,omitempty"` + RecoveryReason string `json:"recovery_reason,omitempty"` + RecoveryPolicy *FabricServiceChannelRecoveryPolicy `json:"recovery_policy,omitempty"` + PathScore int `json:"path_score,omitempty"` + ScoreReasons []string `json:"score_reasons,omitempty"` + ExpiresAt time.Time `json:"expires_at"` +} + +type FabricServiceChannelFallback struct { + Allowed bool `json:"allowed"` + Active bool `json:"active"` + Transport string `json:"transport,omitempty"` + Reason string `json:"reason,omitempty"` + Degraded bool `json:"degraded"` + BackendRelay bool `json:"backend_relay"` + Compatibility bool `json:"compatibility"` +} + +type FabricServiceChannelToken struct { + Type string `json:"type"` + Token string `json:"token"` + TTLSeconds int `json:"ttl_seconds"` + IntrospectionPath string `json:"introspection_path,omitempty"` +} + +type FabricServiceChannelHTTPIngress struct { + Type string `json:"type"` + PathTemplate string `json:"path_template"` + WebSocketPathTemplate string `json:"websocket_path_template,omitempty"` + TokenHeader string `json:"token_header"` + ServiceClassHeader string `json:"service_class_header,omitempty"` + ChannelClassHeader string `json:"channel_class_header,omitempty"` + PacketBatchFormat string `json:"packet_batch_format"` + SupportedMethods []string `json:"supported_methods"` +} + +type FabricServiceChannelDataPlaneContract struct { + SchemaVersion string `json:"schema_version"` + Mode string `json:"mode"` + ControlPlaneTransport string `json:"control_plane_transport"` + WorkingDataTransport string `json:"working_data_transport"` + SteadyStateTransport string `json:"steady_state_transport"` + BackendRelayPolicy string `json:"backend_relay_policy"` + ProductionForwardingRequired bool `json:"production_forwarding_required"` + ServiceNeutral bool `json:"service_neutral"` + ProtocolAgnostic bool `json:"protocol_agnostic"` + LogicalFlowMode string `json:"logical_flow_mode"` + RequiredFlowIsolationClasses []string `json:"required_flow_isolation_classes,omitempty"` + RouteSelectionStrategy string `json:"route_selection_strategy"` + EntryFailoverMode string `json:"entry_failover_mode"` + ExitFailoverMode string `json:"exit_failover_mode"` + RouteRebuildMode string `json:"route_rebuild_mode"` + FailureDetectionSource string `json:"failure_detection_source"` + DegradedFallbackVisibility string `json:"degraded_fallback_visibility"` + StableContractForServiceClass string `json:"stable_contract_for_service_class,omitempty"` +} + +type FabricServiceChannelLeaseAuthorityPayload struct { + SchemaVersion string `json:"schema_version"` + ChannelID string `json:"channel_id"` + ClusterID string `json:"cluster_id"` + OrganizationID string `json:"organization_id"` + UserID string `json:"user_id"` + ResourceID string `json:"resource_id,omitempty"` + ServiceClass string `json:"service_class"` + Status string `json:"status"` + SelectedEntryNodeID string `json:"selected_entry_node_id"` + SelectedExitNodeID string `json:"selected_exit_node_id"` + EntryPool []FabricServiceChannelNodeCandidate `json:"entry_pool,omitempty"` + ExitPool []FabricServiceChannelNodeCandidate `json:"exit_pool,omitempty"` + AllowedChannels []string `json:"allowed_channels"` + PrimaryRoute FabricServiceChannelRoute `json:"primary_route"` + RecoveryPolicy *FabricServiceChannelRecoveryPolicy `json:"recovery_policy,omitempty"` + PoolPolicy *FabricServiceChannelPoolPolicy `json:"pool_policy,omitempty"` + DataPlane FabricServiceChannelDataPlaneContract `json:"data_plane"` + RouteGeneration string `json:"route_generation"` + FencingEpoch int64 `json:"fencing_epoch"` + TokenHash string `json:"token_hash"` + IssuedAt time.Time `json:"issued_at"` + ExpiresAt time.Time `json:"expires_at"` +} + +type FabricServiceChannelLease struct { + SchemaVersion string `json:"schema_version"` + ChannelID string `json:"channel_id"` + ClusterID string `json:"cluster_id"` + OrganizationID string `json:"organization_id"` + UserID string `json:"user_id"` + ResourceID string `json:"resource_id,omitempty"` + ServiceClass string `json:"service_class"` + Status string `json:"status"` + SelectedEntryNodeID string `json:"selected_entry_node_id"` + SelectedExitNodeID string `json:"selected_exit_node_id"` + EntryPool []FabricServiceChannelNodeCandidate `json:"entry_pool"` + ExitPool []FabricServiceChannelNodeCandidate `json:"exit_pool"` + RequiredRoles []string `json:"required_roles"` + AllowedChannels []string `json:"allowed_channels"` + PrimaryRoute FabricServiceChannelRoute `json:"primary_route"` + AlternateRoutes []FabricServiceChannelRoute `json:"alternate_routes,omitempty"` + RecoveryPolicy *FabricServiceChannelRecoveryPolicy `json:"recovery_policy,omitempty"` + PoolPolicy *FabricServiceChannelPoolPolicy `json:"pool_policy,omitempty"` + DataPlane FabricServiceChannelDataPlaneContract `json:"data_plane"` + QoS json.RawMessage `json:"qos"` + Failover json.RawMessage `json:"failover"` + Fallback FabricServiceChannelFallback `json:"fallback"` + Token FabricServiceChannelToken `json:"token"` + EntryHTTP FabricServiceChannelHTTPIngress `json:"entry_http"` + RouteGeneration string `json:"route_generation"` + FencingEpoch int64 `json:"fencing_epoch"` + IssuedAt time.Time `json:"issued_at"` + ExpiresAt time.Time `json:"expires_at"` + Metadata json.RawMessage `json:"metadata,omitempty"` + AuthorityPayload json.RawMessage `json:"authority_payload,omitempty"` + AuthoritySignature *ClusterSignature `json:"authority_signature,omitempty"` +} + +type FabricServiceChannelLeaseRecord struct { + ClusterID string `json:"cluster_id"` + ChannelID string `json:"channel_id"` + TokenHash string `json:"token_hash"` + ResourceID string `json:"resource_id,omitempty"` + ServiceClass string `json:"service_class"` + SelectedEntryNodeID string `json:"selected_entry_node_id,omitempty"` + ExpiresAt time.Time `json:"expires_at"` + Lease FabricServiceChannelLease `json:"lease"` + CreatedAt time.Time `json:"created_at"` + UpdatedAt time.Time `json:"updated_at"` +} + +type FabricServiceChannelLeaseSummary struct { + ClusterID string `json:"cluster_id"` + ChannelID string `json:"channel_id"` + ResourceID string `json:"resource_id,omitempty"` + ServiceClass string `json:"service_class"` + Status string `json:"status"` + SelectedEntryNodeID string `json:"selected_entry_node_id,omitempty"` + SelectedExitNodeID string `json:"selected_exit_node_id,omitempty"` + AllowedChannels []string `json:"allowed_channels,omitempty"` + PrimaryRouteID string `json:"primary_route_id,omitempty"` + PrimaryRouteStatus string `json:"primary_route_status,omitempty"` + DataPlane FabricServiceChannelDataPlaneContract `json:"data_plane,omitempty"` + ForceBackendFallback bool `json:"force_backend_fallback"` + Expired bool `json:"expired"` + IssuedAt time.Time `json:"issued_at"` + ExpiresAt time.Time `json:"expires_at"` + CreatedAt time.Time `json:"created_at"` + UpdatedAt time.Time `json:"updated_at"` +} + +type FabricServiceChannelLeaseMaintenance struct { + SchemaVersion string `json:"schema_version"` + ClusterID string `json:"cluster_id"` + Status string `json:"status"` + Reason string `json:"reason"` + ObservedAt time.Time `json:"observed_at"` + ActiveCount int `json:"active_count"` + ExpiredCount int `json:"expired_count"` + ScannedCount int `json:"scanned_count"` + DeletedExpiredCount int `json:"deleted_expired_count,omitempty"` + WindowLimit int `json:"window_limit"` + RecommendedOperatorAction string `json:"recommended_operator_action,omitempty"` + Leases []FabricServiceChannelLeaseSummary `json:"leases,omitempty"` +} + +type FabricServiceChannelAccessTelemetry struct { + SchemaVersion string `json:"schema_version"` + ClusterID string `json:"cluster_id"` + Status string `json:"status"` + Reason string `json:"reason"` + ObservedAt time.Time `json:"observed_at"` + NodeCount int `json:"node_count"` + ReportingNodeCount int `json:"reporting_node_count"` + TotalAccepted int `json:"total_accepted"` + SignedAccepted int `json:"signed_accepted"` + IntrospectionAccepted int `json:"introspection_accepted"` + LegacyUnsignedAccepted int `json:"legacy_unsigned_accepted"` + BackendFallbackCount int `json:"backend_fallback_count"` + BackendFallbackBlockedCount int `json:"backend_fallback_blocked_count,omitempty"` + FabricRouteSendFailureCount int `json:"fabric_route_send_failure_count,omitempty"` + DataPlaneContractCount int `json:"data_plane_contract_count,omitempty"` + LastDataPlaneMode string `json:"last_data_plane_mode,omitempty"` + LastWorkingDataTransport string `json:"last_working_data_transport,omitempty"` + LastSteadyStateTransport string `json:"last_steady_state_transport,omitempty"` + LastBackendRelayPolicy string `json:"last_backend_relay_policy,omitempty"` + LastLogicalFlowMode string `json:"last_logical_flow_mode,omitempty"` + LastDataPlaneViolationStatus string `json:"last_data_plane_violation_status,omitempty"` + LastDataPlaneViolationReason string `json:"last_data_plane_violation_reason,omitempty"` + ActiveChannelCount int `json:"active_channel_count"` + DegradedFallbackChannelCount int `json:"degraded_fallback_channel_count"` + CorrelatedRouteCount int `json:"correlated_route_count"` + DegradedRouteCount int `json:"degraded_route_count"` + RouteDecisionChannelCount int `json:"route_decision_channel_count,omitempty"` + ReplacementDecisionCount int `json:"replacement_decision_count,omitempty"` + AppliedRebuildDecisionCount int `json:"applied_rebuild_decision_count,omitempty"` + RecoveryDecisionCount int `json:"recovery_decision_count,omitempty"` + NoSafeRecoveryDecisionCount int `json:"no_safe_recovery_decision_count,omitempty"` + TrafficClassCounts map[string]int `json:"traffic_class_counts,omitempty"` + FlowChannelCount int `json:"flow_channel_count,omitempty"` + FlowDropped int `json:"flow_dropped,omitempty"` + FlowHighWatermark int `json:"flow_high_watermark,omitempty"` + FlowMaxInFlight int `json:"flow_max_in_flight,omitempty"` + FlowHealthStatus string `json:"flow_health_status,omitempty"` + FlowHealthReason string `json:"flow_health_reason,omitempty"` + RecommendedParallelWindows map[string]int `json:"recommended_parallel_windows,omitempty"` + AdaptiveBackpressureActive bool `json:"adaptive_backpressure_active,omitempty"` + AdaptiveBackpressureReason string `json:"adaptive_backpressure_reason,omitempty"` + AdaptivePolicyFingerprint string `json:"adaptive_policy_fingerprint,omitempty"` + LatestAcceptedAt *time.Time `json:"latest_accepted_at,omitempty"` + Nodes []FabricServiceChannelAccessTelemetryNode `json:"nodes,omitempty"` + ActiveChannels []FabricServiceChannelAccessTelemetryChannel `json:"active_channels,omitempty"` + RecommendedOperatorAction string `json:"recommended_operator_action,omitempty"` +} + +type FabricServiceChannelAccessTelemetryNode struct { + NodeID string `json:"node_id"` + NodeName string `json:"node_name,omitempty"` + ObservedAt time.Time `json:"observed_at"` + TotalAccepted int `json:"total_accepted"` + SignedAccepted int `json:"signed_accepted"` + IntrospectionAccepted int `json:"introspection_accepted"` + LegacyUnsignedAccepted int `json:"legacy_unsigned_accepted"` + BackendFallbackCount int `json:"backend_fallback_count"` + BackendFallbackBlockedCount int `json:"backend_fallback_blocked_count,omitempty"` + FabricRouteSendFailureCount int `json:"fabric_route_send_failure_count,omitempty"` + DataPlaneContractCount int `json:"data_plane_contract_count,omitempty"` + LastDataPlaneMode string `json:"last_data_plane_mode,omitempty"` + LastWorkingDataTransport string `json:"last_working_data_transport,omitempty"` + LastSteadyStateTransport string `json:"last_steady_state_transport,omitempty"` + LastBackendRelayPolicy string `json:"last_backend_relay_policy,omitempty"` + LastLogicalFlowMode string `json:"last_logical_flow_mode,omitempty"` + LastDataPlaneViolationStatus string `json:"last_data_plane_violation_status,omitempty"` + LastDataPlaneViolationReason string `json:"last_data_plane_violation_reason,omitempty"` + TrafficClassCounts map[string]int `json:"traffic_class_counts,omitempty"` + FlowChannelCount int `json:"flow_channel_count,omitempty"` + FlowDropped int `json:"flow_dropped,omitempty"` + FlowHighWatermark int `json:"flow_high_watermark,omitempty"` + FlowMaxInFlight int `json:"flow_max_in_flight,omitempty"` + FlowHealthStatus string `json:"flow_health_status,omitempty"` + FlowHealthReason string `json:"flow_health_reason,omitempty"` + RecommendedParallelWindows map[string]int `json:"recommended_parallel_windows,omitempty"` + AdaptiveBackpressureActive bool `json:"adaptive_backpressure_active,omitempty"` + AdaptiveBackpressureReason string `json:"adaptive_backpressure_reason,omitempty"` + AdaptivePolicyFingerprint string `json:"adaptive_policy_fingerprint,omitempty"` + LastAcceptedAt *time.Time `json:"last_accepted_at,omitempty"` +} + +type FabricServiceChannelAccessTelemetryChannel struct { + ChannelID string `json:"channel_id"` + ResourceID string `json:"resource_id,omitempty"` + ServiceClass string `json:"service_class"` + Status string `json:"status"` + SelectedEntryNodeID string `json:"selected_entry_node_id,omitempty"` + SelectedExitNodeID string `json:"selected_exit_node_id,omitempty"` + PrimaryRouteID string `json:"primary_route_id,omitempty"` + PrimaryRouteStatus string `json:"primary_route_status,omitempty"` + ForceBackendFallback bool `json:"force_backend_fallback"` + EntryNodeTotalAccepted int `json:"entry_node_total_accepted"` + EntryNodeIntrospectionAccepted int `json:"entry_node_introspection_accepted"` + EntryNodeBackendFallbackCount int `json:"entry_node_backend_fallback_count"` + EntryNodeBackendFallbackBlockedCount int `json:"entry_node_backend_fallback_blocked_count,omitempty"` + EntryNodeFabricRouteSendFailureCount int `json:"entry_node_fabric_route_send_failure_count,omitempty"` + EntryNodeDataPlaneContractCount int `json:"entry_node_data_plane_contract_count,omitempty"` + EntryNodeLastDataPlaneMode string `json:"entry_node_last_data_plane_mode,omitempty"` + EntryNodeLastWorkingDataTransport string `json:"entry_node_last_working_data_transport,omitempty"` + EntryNodeLastSteadyStateTransport string `json:"entry_node_last_steady_state_transport,omitempty"` + EntryNodeLastBackendRelayPolicy string `json:"entry_node_last_backend_relay_policy,omitempty"` + EntryNodeLastLogicalFlowMode string `json:"entry_node_last_logical_flow_mode,omitempty"` + EntryNodeLastDataPlaneViolationStatus string `json:"entry_node_last_data_plane_violation_status,omitempty"` + EntryNodeLastDataPlaneViolationReason string `json:"entry_node_last_data_plane_violation_reason,omitempty"` + EntryNodeTrafficClassCounts map[string]int `json:"entry_node_traffic_class_counts,omitempty"` + EntryNodeFlowChannelCount int `json:"entry_node_flow_channel_count,omitempty"` + EntryNodeFlowDropped int `json:"entry_node_flow_dropped,omitempty"` + EntryNodeFlowHighWatermark int `json:"entry_node_flow_high_watermark,omitempty"` + EntryNodeFlowMaxInFlight int `json:"entry_node_flow_max_in_flight,omitempty"` + EntryNodeFlowHealthStatus string `json:"entry_node_flow_health_status,omitempty"` + EntryNodeFlowHealthReason string `json:"entry_node_flow_health_reason,omitempty"` + EntryNodeRecommendedParallelWindows map[string]int `json:"entry_node_recommended_parallel_windows,omitempty"` + EntryNodeAdaptiveBackpressureActive bool `json:"entry_node_adaptive_backpressure_active,omitempty"` + EntryNodeAdaptiveBackpressureReason string `json:"entry_node_adaptive_backpressure_reason,omitempty"` + EntryNodeAdaptivePolicyFingerprint string `json:"entry_node_adaptive_policy_fingerprint,omitempty"` + RouteFeedbackStatus string `json:"route_feedback_status,omitempty"` + RouteFeedbackObservedAt *time.Time `json:"route_feedback_observed_at,omitempty"` + RouteFeedbackScoreAdjustment int `json:"route_feedback_score_adjustment,omitempty"` + RouteFeedbackEffectiveScoreAdjustment int `json:"route_feedback_effective_score_adjustment,omitempty"` + RouteFeedbackReasons []string `json:"route_feedback_reasons,omitempty"` + RouteQualityWindowSampleCount int `json:"route_quality_window_sample_count,omitempty"` + RouteQualityWindowFailureCount int `json:"route_quality_window_failure_count,omitempty"` + RouteQualityWindowDropCount int `json:"route_quality_window_drop_count,omitempty"` + RouteQualityWindowSlowCount int `json:"route_quality_window_slow_count,omitempty"` + LastSendDurationMs int64 `json:"last_send_duration_ms,omitempty"` + RemediationAction string `json:"remediation_action,omitempty"` + RemediationReason string `json:"remediation_reason,omitempty"` + RemediationRouteID string `json:"remediation_route_id,omitempty"` + RemediationRouteStatus string `json:"remediation_route_status,omitempty"` + RemediationGuardStatus string `json:"remediation_guard_status,omitempty"` + RemediationGuardReason string `json:"remediation_guard_reason,omitempty"` + RemediationExecutionStatus string `json:"remediation_execution_status,omitempty"` + RemediationExecutionReason string `json:"remediation_execution_reason,omitempty"` + RemediationExecutionGeneration string `json:"remediation_execution_generation,omitempty"` + RemediationExecutionObservedAt string `json:"remediation_execution_observed_at,omitempty"` + RouteDecisionSource string `json:"route_decision_source,omitempty"` + RouteDecisionRouteID string `json:"route_decision_route_id,omitempty"` + RouteDecisionReplacementRouteID string `json:"route_decision_replacement_route_id,omitempty"` + RouteDecisionRebuildStatus string `json:"route_decision_rebuild_status,omitempty"` + RouteDecisionRebuildReason string `json:"route_decision_rebuild_reason,omitempty"` + RouteDecisionGeneration string `json:"route_decision_generation,omitempty"` + RouteDecisionScoreReasons []string `json:"route_decision_score_reasons,omitempty"` + PoolPolicyFingerprint string `json:"pool_policy_fingerprint,omitempty"` + DataPlane FabricServiceChannelDataPlaneContract `json:"data_plane,omitempty"` + RemediationCommand *FabricServiceChannelAccessRemediationCommand `json:"remediation_command,omitempty"` + RecommendedOperatorAction string `json:"recommended_operator_action,omitempty"` + ExpiresAt time.Time `json:"expires_at"` +} + +type FabricServiceChannelAccessRemediationCommand struct { + SchemaVersion string `json:"schema_version"` + CommandID string `json:"command_id"` + Action string `json:"action"` + ClusterID string `json:"cluster_id"` + ChannelID string `json:"channel_id"` + ResourceID string `json:"resource_id,omitempty"` + ServiceClass string `json:"service_class"` + EntryNodeID string `json:"entry_node_id,omitempty"` + ExitNodeID string `json:"exit_node_id,omitempty"` + PrimaryRouteID string `json:"primary_route_id,omitempty"` + ReplacementRouteID string `json:"replacement_route_id,omitempty"` + ReplacementRouteStatus string `json:"replacement_route_status,omitempty"` + PoolPolicyFingerprint string `json:"pool_policy_fingerprint,omitempty"` + GuardStatus string `json:"guard_status,omitempty"` + GuardReason string `json:"guard_reason,omitempty"` + ExecutionStatus string `json:"execution_status,omitempty"` + ExecutionReason string `json:"execution_reason,omitempty"` + ExecutionGeneration string `json:"execution_generation,omitempty"` + ExecutionObservedAt string `json:"execution_observed_at,omitempty"` + Reason string `json:"reason,omitempty"` + OperatorAction string `json:"operator_action,omitempty"` + IssuedAt time.Time `json:"issued_at"` + ExpiresAt time.Time `json:"expires_at"` +} + +type FabricServiceChannelLeaseIntrospection struct { + SchemaVersion string `json:"schema_version"` + ClusterID string `json:"cluster_id"` + ChannelID string `json:"channel_id"` + ResourceID string `json:"resource_id,omitempty"` + ServiceClass string `json:"service_class"` + Allowed bool `json:"allowed"` + Status string `json:"status"` + Reason string `json:"reason,omitempty"` + AcceptedBy string `json:"accepted_by"` + SelectedEntryNodeID string `json:"selected_entry_node_id,omitempty"` + SelectedExitNodeID string `json:"selected_exit_node_id,omitempty"` + AllowedChannels []string `json:"allowed_channels,omitempty"` + PreferredRouteID string `json:"preferred_route_id,omitempty"` + ForceBackendFallback bool `json:"force_backend_fallback"` + LeaseStatus string `json:"lease_status,omitempty"` + PrimaryRoute FabricServiceChannelRoute `json:"primary_route,omitempty"` + DataPlane FabricServiceChannelDataPlaneContract `json:"data_plane,omitempty"` + RouteGeneration string `json:"route_generation,omitempty"` + FencingEpoch int64 `json:"fencing_epoch,omitempty"` + ExpiresAt time.Time `json:"expires_at,omitempty"` +} + +type FabricServiceChannelRouteFeedbackObservation struct { + ID string `json:"id,omitempty"` + ClusterID string `json:"cluster_id"` + ReporterNodeID string `json:"reporter_node_id"` + RouteID string `json:"route_id"` + ServiceClass string `json:"service_class"` + FeedbackStatus string `json:"feedback_status"` + ScoreAdjustment int `json:"score_adjustment"` + EffectiveScoreAdjustment int `json:"effective_score_adjustment,omitempty"` + Reasons []string `json:"reasons,omitempty"` + LastError string `json:"last_error,omitempty"` + ConsecutiveFailures int `json:"consecutive_failures,omitempty"` + StallCount int `json:"stall_count,omitempty"` + LastSendDurationMs int64 `json:"last_send_duration_ms,omitempty"` + RecoveryState string `json:"recovery_state,omitempty"` + RecoveryHysteresisActive bool `json:"recovery_hysteresis_active,omitempty"` + RecoveryHysteresisPenalty int `json:"recovery_hysteresis_penalty,omitempty"` + RecoveryPromoted bool `json:"recovery_promoted,omitempty"` + RecoveryDemoted bool `json:"recovery_demoted,omitempty"` + RecoveryReason string `json:"recovery_reason,omitempty"` + ObservedPolicyFingerprint string `json:"observed_policy_fingerprint,omitempty"` + EffectivePolicyFingerprint string `json:"effective_policy_fingerprint,omitempty"` + ObservedRouteGeneration string `json:"observed_route_generation,omitempty"` + EffectiveRouteGeneration string `json:"effective_route_generation,omitempty"` + ProvenanceMissing bool `json:"provenance_missing,omitempty"` + StalePolicy bool `json:"stale_policy,omitempty"` + StaleGeneration bool `json:"stale_generation,omitempty"` + StaleReason string `json:"stale_reason,omitempty"` + Payload json.RawMessage `json:"payload"` + ObservedAt time.Time `json:"observed_at"` + ExpiresAt time.Time `json:"expires_at"` + RetryCooldownUntil *time.Time `json:"retry_cooldown_until,omitempty"` +} + +type FabricServiceChannelRouteRebuildAttempt struct { + ID string `json:"id"` + ClusterID string `json:"cluster_id"` + ReporterNodeID string `json:"reporter_node_id"` + ServiceClass string `json:"service_class"` + RouteID string `json:"route_id"` + ReplacementRouteID string `json:"replacement_route_id,omitempty"` + RebuildRequestID string `json:"rebuild_request_id"` + RebuildStatus string `json:"rebuild_status"` + RebuildReason string `json:"rebuild_reason,omitempty"` + RebuildAttempt int `json:"rebuild_attempt,omitempty"` + DecisionSource string `json:"decision_source"` + Outcome string `json:"outcome"` + Generation string `json:"generation,omitempty"` + PolicyFingerprint string `json:"policy_fingerprint,omitempty"` + ObservedPolicyFingerprint string `json:"observed_policy_fingerprint,omitempty"` + ObservedRouteGeneration string `json:"observed_route_generation,omitempty"` + EffectiveRouteGeneration string `json:"effective_route_generation,omitempty"` + FeedbackStatus string `json:"feedback_status,omitempty"` + FeedbackObservationID string `json:"feedback_observation_id,omitempty"` + FeedbackSource string `json:"feedback_source,omitempty"` + FeedbackObservedAt *time.Time `json:"feedback_observed_at,omitempty"` + FeedbackExpiresAt *time.Time `json:"feedback_expires_at,omitempty"` + FeedbackChannelID string `json:"feedback_channel_id,omitempty"` + FeedbackResourceID string `json:"feedback_resource_id,omitempty"` + FeedbackViolationStatus string `json:"feedback_violation_status,omitempty"` + FeedbackViolationReason string `json:"feedback_violation_reason,omitempty"` + FeedbackScoreAdjustment int `json:"feedback_score_adjustment,omitempty"` + FeedbackEffectiveScoreAdjustment int `json:"feedback_effective_score_adjustment,omitempty"` + FeedbackReasons []string `json:"feedback_reasons,omitempty"` + LastError string `json:"last_error,omitempty"` + ConsecutiveFailures int `json:"consecutive_failures,omitempty"` + StallCount int `json:"stall_count,omitempty"` + LastSendDurationMs int64 `json:"last_send_duration_ms,omitempty"` + QualityWindowSampleCount int `json:"quality_window_sample_count,omitempty"` + QualityWindowFailureCount int `json:"quality_window_failure_count,omitempty"` + QualityWindowDropCount int `json:"quality_window_drop_count,omitempty"` + QualityWindowSlowCount int `json:"quality_window_slow_count,omitempty"` + OldHops []string `json:"old_hops,omitempty"` + ReplacementHops []string `json:"replacement_hops,omitempty"` + NodeTransitionStatus string `json:"node_transition_status,omitempty"` + NodeTransitionGeneration string `json:"node_transition_generation,omitempty"` + NodeTransitionObservedAt string `json:"node_transition_observed_at,omitempty"` + NodeTransitionMatched bool `json:"node_transition_matched,omitempty"` + NodeRouteGenerationStatus string `json:"node_route_generation_status,omitempty"` + NodeRouteGenerationAppliedAt string `json:"node_route_generation_applied_at,omitempty"` + NodeRouteGenerationWithdrawnAt string `json:"node_route_generation_withdrawn_at,omitempty"` + NodeRouteGenerationMatched bool `json:"node_route_generation_matched,omitempty"` + PostRebuildSelectedRouteID string `json:"post_rebuild_selected_route_id,omitempty"` + PostRebuildSendPackets uint64 `json:"post_rebuild_send_packets,omitempty"` + PostRebuildSendFailures uint64 `json:"post_rebuild_send_failures,omitempty"` + PostRebuildSendFlowPackets uint64 `json:"post_rebuild_send_flow_packets,omitempty"` + PostRebuildSendFlowDropped uint64 `json:"post_rebuild_send_flow_dropped,omitempty"` + GuardStatus string `json:"guard_status,omitempty"` + GuardSeverity string `json:"guard_severity,omitempty"` + GuardReason string `json:"guard_reason,omitempty"` + GuardAgeSeconds int64 `json:"guard_age_seconds,omitempty"` + GuardTransitionDeadlineSeconds int64 `json:"guard_transition_deadline_seconds,omitempty"` + GuardTrafficDeadlineSeconds int64 `json:"guard_traffic_deadline_seconds,omitempty"` + AlertSilenced bool `json:"alert_silenced,omitempty"` + AlertSilenceID string `json:"alert_silence_id,omitempty"` + AlertSilenceReason string `json:"alert_silence_reason,omitempty"` + AlertSilencedUntil *time.Time `json:"alert_silenced_until,omitempty"` + AlertResurfaced bool `json:"alert_resurfaced,omitempty"` + AlertResurfacedFromSilenceID string `json:"alert_resurfaced_from_silence_id,omitempty"` + AlertResurfacedCause string `json:"alert_resurfaced_cause,omitempty"` + AlertResurfacedPreviousRouteID string `json:"alert_resurfaced_previous_route_id,omitempty"` + AlertResurfacedPreviousChannelID string `json:"alert_resurfaced_previous_channel_id,omitempty"` + AlertResurfacedPreviousGeneration string `json:"alert_resurfaced_previous_generation,omitempty"` + AlertResurfacedPreviousUntil *time.Time `json:"alert_resurfaced_previous_until,omitempty"` + Timeline []FabricServiceChannelRouteRebuildTimelineEvent `json:"timeline,omitempty"` + CorrelationSnapshotAt *time.Time `json:"correlation_snapshot_at,omitempty"` + Payload json.RawMessage `json:"payload"` + CreatedAt time.Time `json:"created_at"` + UpdatedAt time.Time `json:"updated_at"` +} + +type FabricServiceChannelRouteRebuildHealthSummary struct { + ClusterID string `json:"cluster_id"` + ObservedAt time.Time `json:"observed_at"` + WindowLimit int `json:"window_limit"` + TotalAttempts int `json:"total_attempts"` + GoodCount int `json:"good_count"` + WarnCount int `json:"warn_count"` + BadCount int `json:"bad_count"` + UnknownCount int `json:"unknown_count"` + ActiveBadCount int `json:"active_bad_count"` + ActiveWarnCount int `json:"active_warn_count"` + SilencedCount int `json:"silenced_count"` + ResurfacedCount int `json:"resurfaced_count"` + AppliedCount int `json:"applied_count"` + PendingCount int `json:"pending_count"` + AccessRouteDecisionCount int `json:"access_route_decision_count,omitempty"` + AccessReplacementCount int `json:"access_replacement_count,omitempty"` + AccessAppliedCount int `json:"access_applied_count,omitempty"` + AccessRecoveryCount int `json:"access_recovery_count,omitempty"` + AccessNoSafeCount int `json:"access_no_safe_count,omitempty"` + CountsByGuardStatus map[string]int `json:"counts_by_guard_status,omitempty"` + CountsByGuardSeverity map[string]int `json:"counts_by_guard_severity,omitempty"` + FeedbackBreakdowns []FabricServiceChannelRouteRebuildFeedbackHealthBreakdown `json:"feedback_breakdowns,omitempty"` + AffectedReporterNodeIDs []string `json:"affected_reporter_node_ids,omitempty"` + AffectedRouteIDs []string `json:"affected_route_ids,omitempty"` + MostRecentBadAttempts []FabricServiceChannelRouteRebuildAttempt `json:"most_recent_bad_attempts,omitempty"` + ResurfacedAttempts []FabricServiceChannelRouteRebuildAttempt `json:"resurfaced_attempts,omitempty"` + RecommendedOperatorAction string `json:"recommended_operator_action,omitempty"` +} + +type FabricServiceChannelRouteRebuildFeedbackHealthBreakdown struct { + FeedbackSource string `json:"feedback_source,omitempty"` + FeedbackChannelID string `json:"feedback_channel_id,omitempty"` + FeedbackViolationStatus string `json:"feedback_violation_status,omitempty"` + TotalCount int `json:"total_count"` + GoodCount int `json:"good_count,omitempty"` + WarnCount int `json:"warn_count,omitempty"` + BadCount int `json:"bad_count,omitempty"` + UnknownCount int `json:"unknown_count,omitempty"` + ActiveWarnCount int `json:"active_warn_count,omitempty"` + ActiveBadCount int `json:"active_bad_count,omitempty"` + SilencedCount int `json:"silenced_count,omitempty"` + LatestObservedAt time.Time `json:"latest_observed_at,omitempty"` + AffectedReporterNodeIDs []string `json:"affected_reporter_node_ids,omitempty"` + AffectedRouteIDs []string `json:"affected_route_ids,omitempty"` +} + +type FabricServiceChannelReadiness struct { + ClusterID string `json:"cluster_id"` + ObservedAt time.Time `json:"observed_at"` + Status string `json:"status"` + Reason string `json:"reason"` + ActiveAlertCount int `json:"active_alert_count"` + ActiveBadCount int `json:"active_bad_count"` + ActiveWarnCount int `json:"active_warn_count"` + ResurfacedCount int `json:"resurfaced_count"` + SilencedCount int `json:"silenced_count"` + MissingTransitionCount int `json:"missing_transition_count"` + MissingRouteGenerationCount int `json:"missing_route_generation_count"` + MissingPostTrafficCount int `json:"missing_post_rebuild_traffic_count"` + UnexpectedRouteCount int `json:"unexpected_route_count"` + PostRebuildDegradedCount int `json:"post_rebuild_degraded_count"` + BlockingReasons []string `json:"blocking_reasons,omitempty"` + DegradedReasons []string `json:"degraded_reasons,omitempty"` + RecommendedOperatorAction string `json:"recommended_operator_action,omitempty"` +} + +type FabricServiceChannelSchemaStatus struct { + ClusterID string `json:"cluster_id"` + ObservedAt time.Time `json:"observed_at"` + Status string `json:"status"` + Reason string `json:"reason"` + RequiredMigration string `json:"required_migration"` + RequiredCheckCount int `json:"required_check_count"` + PassedCheckCount int `json:"passed_check_count"` + MissingCheckCount int `json:"missing_check_count"` + RequiredChecks []FabricServiceChannelSchemaCheck `json:"required_checks"` + MissingChecks []FabricServiceChannelSchemaCheck `json:"missing_checks,omitempty"` + RecommendedOperatorAction string `json:"recommended_operator_action,omitempty"` +} + +type FabricServiceChannelSchemaCheck struct { + CheckID string `json:"check_id"` + RelationName string `json:"relation_name"` + ColumnName string `json:"column_name,omitempty"` + Status string `json:"status"` + RequiredBy string `json:"required_by"` +} + +type FabricServiceChannelRebuildSnapshotWarmup struct { + ClusterID string `json:"cluster_id"` + ObservedAt time.Time `json:"observed_at"` + WindowLimit int `json:"window_limit"` + StaleAfterSeconds int64 `json:"stale_after_seconds"` + ScannedCount int `json:"scanned_count"` + WarmedCount int `json:"warmed_count"` + AlreadyFreshCount int `json:"already_fresh_count"` + MissingSnapshotCount int `json:"missing_snapshot_count"` + StaleSnapshotCount int `json:"stale_snapshot_count"` + DeferredStaleCount int `json:"deferred_stale_count"` + ErrorCount int `json:"error_count"` + Status string `json:"status"` + Reason string `json:"reason"` + RecommendedOperatorAction string `json:"recommended_operator_action,omitempty"` +} + +type FabricServiceChannelRebuildSnapshotMaintenanceHealth struct { + ClusterID string `json:"cluster_id"` + ObservedAt time.Time `json:"observed_at"` + Status string `json:"status"` + Reason string `json:"reason"` + WindowLimit int `json:"window_limit"` + MinAgeSeconds int64 `json:"min_age_seconds"` + HeartbeatThreshold int `json:"heartbeat_threshold"` + RecentAttemptCount int `json:"recent_attempt_count"` + ValidSnapshotCount int `json:"valid_snapshot_count"` + MissingSnapshotCount int `json:"missing_snapshot_count"` + OverdueMissingSnapshotCount int `json:"overdue_missing_snapshot_count"` + AutoWarmupEventCount int `json:"auto_warmup_event_count"` + AutoWarmupWarmedCount int `json:"auto_warmup_warmed_count"` + AutoWarmupAlreadyFreshCount int `json:"auto_warmup_already_fresh_count"` + AutoWarmupErrorCount int `json:"auto_warmup_error_count"` + LatestAutoWarmupAt *time.Time `json:"latest_auto_warmup_at,omitempty"` + Nodes []FabricServiceChannelRebuildSnapshotNodeHealth `json:"nodes,omitempty"` + OverdueMissingSnapshotAttempts []FabricServiceChannelRouteRebuildAttempt `json:"overdue_missing_snapshot_attempts,omitempty"` + RecommendedOperatorAction string `json:"recommended_operator_action,omitempty"` +} + +type FabricServiceChannelRebuildSnapshotNodeHealth struct { + NodeID string `json:"node_id"` + RecentAttemptCount int `json:"recent_attempt_count"` + ValidSnapshotCount int `json:"valid_snapshot_count"` + MissingSnapshotCount int `json:"missing_snapshot_count"` + OverdueMissingSnapshotCount int `json:"overdue_missing_snapshot_count"` + HeartbeatAfterAttemptCount int `json:"heartbeat_after_attempt_count"` + LastHeartbeatAt *time.Time `json:"last_heartbeat_at,omitempty"` + AutoWarmupEventCount int `json:"auto_warmup_event_count"` + AutoWarmupWarmedCount int `json:"auto_warmup_warmed_count"` + AutoWarmupErrorCount int `json:"auto_warmup_error_count"` + LatestAutoWarmupAt *time.Time `json:"latest_auto_warmup_at,omitempty"` +} + +type FabricServiceChannelRouteRebuildIncident struct { + Fingerprint string `json:"fingerprint"` + ClusterID string `json:"cluster_id"` + ReporterNodeID string `json:"reporter_node_id"` + RouteID string `json:"route_id"` + ServiceClass string `json:"service_class"` + Generation string `json:"generation,omitempty"` + IncidentSource string `json:"incident_source,omitempty"` + ChannelID string `json:"channel_id,omitempty"` + GuardStatus string `json:"guard_status"` + GuardSeverity string `json:"guard_severity"` + GuardReason string `json:"guard_reason,omitempty"` + AttemptCount int `json:"attempt_count"` + FirstSeenAt time.Time `json:"first_seen_at"` + LastSeenAt time.Time `json:"last_seen_at"` + LatestReplacementRouteID string `json:"latest_replacement_route_id,omitempty"` + LatestRebuildStatus string `json:"latest_rebuild_status,omitempty"` + LatestOutcome string `json:"latest_outcome,omitempty"` + AlertSilenced bool `json:"alert_silenced,omitempty"` + AlertResurfaced bool `json:"alert_resurfaced,omitempty"` + AlertResurfacedFromSilenceID string `json:"alert_resurfaced_from_silence_id,omitempty"` + AlertResurfacedCause string `json:"alert_resurfaced_cause,omitempty"` + AlertResurfacedPreviousRouteID string `json:"alert_resurfaced_previous_route_id,omitempty"` + AlertResurfacedPreviousChannelID string `json:"alert_resurfaced_previous_channel_id,omitempty"` + AlertResurfacedPreviousGeneration string `json:"alert_resurfaced_previous_generation,omitempty"` + AlertResurfacedPreviousUntil *time.Time `json:"alert_resurfaced_previous_until,omitempty"` + RecommendedOperatorAction string `json:"recommended_operator_action,omitempty"` +} + +type FabricServiceChannelRouteRebuildAlertSilence struct { + ID string `json:"id"` + ClusterID string `json:"cluster_id"` + IncidentSource string `json:"incident_source,omitempty"` + ChannelID string `json:"channel_id,omitempty"` + ReporterNodeID string `json:"reporter_node_id"` + RouteID string `json:"route_id"` + DisplayRouteID string `json:"display_route_id,omitempty"` + GuardStatus string `json:"guard_status"` + Generation string `json:"generation,omitempty"` + Reason string `json:"reason,omitempty"` + CreatedByUserID *string `json:"created_by_user_id,omitempty"` + CreatedAt time.Time `json:"created_at"` + ExpiresAt time.Time `json:"expires_at"` + Payload json.RawMessage `json:"payload"` +} + +type FabricServiceChannelRouteRebuildTimelineEvent struct { + Stage string `json:"stage"` + Status string `json:"status"` + At string `json:"at,omitempty"` + RouteID string `json:"route_id,omitempty"` + Generation string `json:"generation,omitempty"` + Payload json.RawMessage `json:"payload,omitempty"` +} + type ClusterAuthorityState struct { ClusterID string `json:"cluster_id"` AuthorityState string `json:"authority_state"` @@ -494,14 +1566,66 @@ type ClusterAdminSummary struct { } type ClusterAuditEvent struct { - ID string `json:"id"` - ClusterID *string `json:"cluster_id,omitempty"` - ActorUserID *string `json:"actor_user_id,omitempty"` - EventType string `json:"event_type"` - TargetType string `json:"target_type"` - TargetID *string `json:"target_id,omitempty"` - Payload json.RawMessage `json:"payload"` - CreatedAt time.Time `json:"created_at"` + ID string `json:"id"` + ClusterID *string `json:"cluster_id,omitempty"` + ActorUserID *string `json:"actor_user_id,omitempty"` + EventType string `json:"event_type"` + TargetType string `json:"target_type"` + TargetID *string `json:"target_id,omitempty"` + Payload json.RawMessage `json:"payload"` + CorrelationHints *ClusterAuditCorrelationHints `json:"correlation_hints,omitempty"` + CreatedAt time.Time `json:"created_at"` +} + +type ClusterAuditCorrelationHints struct { + Scope string `json:"scope,omitempty"` + CurrentDiagnosticStatus string `json:"current_diagnostic_status,omitempty"` + BreadcrumbStatus string `json:"breadcrumb_status,omitempty"` + BreadcrumbAgeSeconds int64 `json:"breadcrumb_age_seconds,omitempty"` + BreadcrumbCurrentWindow int64 `json:"breadcrumb_current_window_seconds,omitempty"` + BreadcrumbHistoryWindow int64 `json:"breadcrumb_history_window_seconds,omitempty"` + FeedbackBreakdown *FabricServiceChannelRouteRebuildFeedbackHealthBreakdown `json:"feedback_breakdown,omitempty"` + RebuildIncident *FabricServiceChannelRouteRebuildIncident `json:"rebuild_incident,omitempty"` + RecommendedAction string `json:"recommended_action,omitempty"` +} + +type ClusterAuditSummary struct { + TotalCount int `json:"total_count"` + CountsByEventType map[string]int `json:"counts_by_event_type,omitempty"` + CountsByTargetType map[string]int `json:"counts_by_target_type,omitempty"` + CountsByCurrentDiagnosticStatus map[string]int `json:"counts_by_current_diagnostic_status,omitempty"` + CountsByFeedbackSource map[string]int `json:"counts_by_feedback_source,omitempty"` + CountsByFeedbackViolationStatus map[string]int `json:"counts_by_feedback_violation_status,omitempty"` + CountsByBreadcrumbStatus map[string]int `json:"counts_by_breadcrumb_status,omitempty"` + CorrelatedCount int `json:"correlated_count,omitempty"` + NotVisibleCount int `json:"not_visible_count,omitempty"` + LatestAt time.Time `json:"latest_at,omitempty"` +} + +type ListAuditEventsInput struct { + ClusterID string + EventTypes []string + TargetTypes []string + Correlation string + Limit int +} + +type FabricServiceChannelRebuildInvestigationBreadcrumbs struct { + ClusterID string `json:"cluster_id"` + Events []ClusterAuditEvent `json:"events"` + Summary ClusterAuditSummary `json:"summary"` + CurrentWindowSeconds int64 `json:"current_window_seconds"` + HistoryWindowSeconds int64 `json:"history_window_seconds"` + CurrentCount int `json:"current_count"` + StaleCount int `json:"stale_count"` + ExpiredCount int `json:"expired_count"` +} + +type ListFabricServiceChannelRebuildInvestigationBreadcrumbsInput struct { + ClusterID string + Limit int + CurrentWindowSeconds int64 + HistoryWindowSeconds int64 } type FabricTestingFlag struct { @@ -660,6 +1784,35 @@ type NodeVPNAssignmentStatus struct { ObservedAt time.Time `json:"observed_at"` } +type VPNClientProfile struct { + SchemaVersion string `json:"schema_version"` + ClusterID string `json:"cluster_id"` + OrganizationID string `json:"organization_id"` + UserID string `json:"user_id"` + Connections []VPNClientConnection `json:"connections"` + GeneratedAt time.Time `json:"generated_at"` +} + +type VPNClientConnection struct { + ID string `json:"id"` + Name string `json:"name"` + ProtocolFamily string `json:"protocol_family"` + Mode string `json:"mode"` + DesiredState string `json:"desired_state"` + Status string `json:"status"` + TargetEndpoint json.RawMessage `json:"target_endpoint"` + RoutingUsage json.RawMessage `json:"routing_usage"` + RoutePolicy json.RawMessage `json:"route_policy"` + QoSPolicy json.RawMessage `json:"qos_policy"` + PlacementPolicy json.RawMessage `json:"placement_policy"` + AllowedNodeIDs []string `json:"allowed_node_ids"` + EntryNodeIDs []string `json:"entry_node_ids"` + ExitNodeID string `json:"exit_node_id,omitempty"` + ActiveLease *NodeVPNAssignmentLease `json:"active_lease,omitempty"` + RoutePolicies json.RawMessage `json:"route_policies"` + ClientConfig json.RawMessage `json:"client_config"` +} + type CreateClusterInput struct { ActorUserID string Slug string @@ -710,6 +1863,7 @@ type ApproveJoinRequestInput struct { NodeKey string OwnershipType string OwnerOrganizationID *string + NodeGroupID *string } type ApprovedJoinRequest struct { @@ -784,6 +1938,13 @@ type DisableMembershipInput struct { Reason string } +type DeleteClusterNodeInput struct { + ActorUserID string + ClusterID string + NodeID string + Reason string +} + type RecordHeartbeatInput struct { ClusterID string NodeID string @@ -794,6 +1955,70 @@ type RecordHeartbeatInput struct { Metadata json.RawMessage } +type CreateReleaseVersionInput struct { + ActorUserID string + ClusterID string + Product string + Version string + Channel string + Status string + Compatibility json.RawMessage + Changelog *string + Artifacts []ReleaseArtifactInput +} + +type ReleaseArtifactInput struct { + OS string `json:"os"` + Arch string `json:"arch"` + InstallType string `json:"install_type"` + Kind string `json:"kind"` + URL string `json:"url"` + SHA256 string `json:"sha256"` + SizeBytes int64 `json:"size_bytes"` + Signature *string `json:"signature"` + Metadata json.RawMessage `json:"metadata"` +} + +type UpsertNodeUpdatePolicyInput struct { + ActorUserID string + ClusterID string + NodeID string + Product string + Channel string + TargetVersion *string + Strategy string + Enabled bool + RollbackAllowed bool + HealthWindowSec int +} + +type GetNodeUpdatePlanInput struct { + ClusterID string + NodeID string + Product string + CurrentVersion string + OS string + Arch string + InstallType string + Channel string + ArtifactOrigin string +} + +type ReportNodeUpdateStatusInput struct { + ClusterID string + NodeID string + Product string + CurrentVersion string + TargetVersion string + Phase string + Status string + AttemptID string + ErrorMessage *string + RollbackVersion *string + Payload json.RawMessage + ObservedAt time.Time +} + type UpsertFabricTestingFlagInput struct { ActorUserID string ScopeType string @@ -869,6 +2094,13 @@ type CreateRouteIntentInput struct { Policy json.RawMessage } +type RouteIntentLifecycleInput struct { + ActorUserID string + ClusterID string + RouteIntentID string + Reason string +} + type CreateFabricEntryPointInput struct { ActorUserID string ClusterID string @@ -911,6 +2143,312 @@ type SetFabricEgressPoolNodeInput struct { Metadata json.RawMessage } +type IssueFabricServiceChannelLeaseInput struct { + ActorUserID string + ClusterID string + OrganizationID string + UserID string + ResourceID string + ServiceClass string + EntryNodeIDs []string + ExitNodeIDs []string + PreferredEntryNodeID string + PreferredExitNodeID string + RequiredRoles []string + AllowedChannels []string + QoS json.RawMessage + Failover json.RawMessage + Metadata json.RawMessage + TTL time.Duration +} + +type UpdateFabricServiceChannelPoolPolicyInput struct { + ActorUserID string + ClusterID string + EntryPoolNodeIDs []string + ExitPoolNodeIDs []string + PreferredEntryNodeID string + PreferredExitNodeID string + SelectionStrategy string + RouteRebuild string + EntryFailover string + ExitFailover string + BackendFallbackAllowed *bool + StickySession *bool +} + +type UpdateFabricServiceChannelBreadcrumbWindowPolicyInput struct { + ActorUserID string + ClusterID string + CurrentWindowSeconds int64 + HistoryWindowSeconds int64 +} + +type IntrospectFabricServiceChannelLeaseInput struct { + ClusterID string + ChannelID string + ResourceID string + ServiceClass string + ChannelClass string + Token string + EntryNodeID string + RequestSourceIP string +} + +type StoreFabricServiceChannelLeaseInput struct { + Lease FabricServiceChannelLease + TokenHash string +} + +type ListFabricServiceChannelLeasesInput struct { + ClusterID string + ServiceClass string + EntryNodeID string + ResourceID string + IncludeExpired bool + Limit int + Now time.Time +} + +type CleanupFabricServiceChannelLeasesInput struct { + ActorUserID string + ClusterID string + Limit int + Now time.Time +} + +type GetFabricServiceChannelAccessTelemetryInput struct { + ClusterID string + Limit int + Now time.Time +} + +type RecordFabricServiceChannelRouteFeedbackInput struct { + ClusterID string + ReporterNodeID string + RouteID string + ServiceClass string + FeedbackStatus string + ScoreAdjustment int + Reasons []string + LastError string + ConsecutiveFailures int + StallCount int + LastSendDurationMs int64 + Payload json.RawMessage + ObservedAt time.Time + ExpiresAt time.Time +} + +type ListFabricServiceChannelRouteFeedbackInput struct { + ClusterID string + ReporterNodeID string + RouteID string + ServiceClass string + FeedbackStatus string + Now time.Time + IncludeExpired bool +} + +type RecordFabricServiceChannelRouteRebuildAttemptInput struct { + ClusterID string + ReporterNodeID string + ServiceClass string + RouteID string + ReplacementRouteID string + RebuildRequestID string + RebuildStatus string + RebuildReason string + RebuildAttempt int + DecisionSource string + Outcome string + Generation string + PolicyFingerprint string + ObservedPolicyFingerprint string + ObservedRouteGeneration string + EffectiveRouteGeneration string + FeedbackStatus string + FeedbackObservationID string + FeedbackSource string + FeedbackObservedAt *time.Time + FeedbackExpiresAt *time.Time + FeedbackChannelID string + FeedbackResourceID string + FeedbackViolationStatus string + FeedbackViolationReason string + FeedbackScoreAdjustment int + FeedbackEffectiveScoreAdjustment int + FeedbackReasons []string + LastError string + ConsecutiveFailures int + StallCount int + LastSendDurationMs int64 + QualityWindowSampleCount int + QualityWindowFailureCount int + QualityWindowDropCount int + QualityWindowSlowCount int + OldHops []string + ReplacementHops []string + Payload json.RawMessage +} + +type ListFabricServiceChannelRouteRebuildAttemptsInput struct { + ClusterID string + ReporterNodeID string + RouteID string + ReplacementRouteID string + ServiceClass string + RebuildStatus string + RebuildRequestID string + Generation string + FeedbackSource string + FeedbackChannelID string + FeedbackViolationStatus string + EnrichmentMode string + UseCachedSnapshot bool + Limit int + Offset int +} + +type UpdateFabricServiceChannelRouteRebuildCorrelationSnapshotInput struct { + ID string + NodeTransitionStatus string + NodeTransitionGeneration string + NodeTransitionObservedAt string + NodeTransitionMatched bool + NodeRouteGenerationStatus string + NodeRouteGenerationAppliedAt string + NodeRouteGenerationWithdrawnAt string + NodeRouteGenerationMatched bool + PostRebuildSelectedRouteID string + PostRebuildSendPackets uint64 + PostRebuildSendFailures uint64 + PostRebuildSendFlowPackets uint64 + PostRebuildSendFlowDropped uint64 + GuardStatus string + GuardSeverity string + GuardReason string + GuardTransitionDeadlineSeconds int64 + GuardTrafficDeadlineSeconds int64 + Timeline []FabricServiceChannelRouteRebuildTimelineEvent + CorrelationSnapshotAt time.Time +} + +type GetFabricServiceChannelRouteRebuildHealthSummaryInput struct { + ClusterID string + Limit int +} + +type GetFabricServiceChannelReadinessInput struct { + ClusterID string + Limit int +} + +type GetFabricServiceChannelSchemaStatusInput struct { + ClusterID string +} + +type GetFabricServiceChannelRebuildSnapshotMaintenanceHealthInput struct { + ClusterID string + Limit int + MinAgeSeconds int64 + HeartbeatThreshold int +} + +type WarmupFabricServiceChannelRebuildSnapshotsInput struct { + ActorUserID string + ClusterID string + Limit int + StaleAfterSeconds int64 + Now time.Time +} + +type ListFabricServiceChannelRouteRebuildIncidentsInput struct { + ClusterID string + Limit int +} + +type RecordFabricServiceChannelRouteRebuildInvestigationInput struct { + ActorUserID string + ClusterID string + ReporterNodeID string + RouteID string + ServiceClass string + Generation string + GuardStatus string + IncidentID string + FeedbackSource string + FeedbackChannelID string + FeedbackViolationStatus string + DrilldownSource string + Reason string + Now time.Time +} + +type SilenceFabricServiceChannelRouteRebuildAlertInput struct { + ActorUserID string + ClusterID string + IncidentSource string + ChannelID string + ReporterNodeID string + RouteID string + GuardStatus string + Generation string + Reason string + TTL time.Duration + Now time.Time +} + +type UnsilenceFabricServiceChannelRouteRebuildAlertInput struct { + ActorUserID string + ClusterID string + SilenceID string + Reason string + Now time.Time +} + +type ExpireFabricServiceChannelRouteFeedbackInput struct { + ActorUserID string + ClusterID string + ReporterNodeID string + RouteID string + ServiceClass string + Reason string + Now time.Time +} + +type UpdateFabricServiceChannelRecoveryPolicyInput struct { + ActorUserID string + ClusterID string + HysteresisPenalty int + PromotionMinSamples int + DemotionFailureThreshold int + DemotionDropThreshold int + DemotionSlowThreshold int + DemotionRebuildEnabled *bool + DemotionFencedEnabled *bool +} + +type UpdateFabricServiceChannelAdaptivePolicyInput struct { + ActorUserID string + ClusterID string + MaxParallelWindow int + BulkPressureChannelThreshold int + QueuePressureHighWatermark int + QueuePressureMaxInFlight int + ClassWindows map[string]int +} + +type ExpireFabricServiceChannelRouteFeedbackResult struct { + ClusterID string `json:"cluster_id"` + ReporterNodeID string `json:"reporter_node_id,omitempty"` + RouteID string `json:"route_id"` + ServiceClass string `json:"service_class,omitempty"` + ExpiredCount int `json:"expired_count"` + ExpiredAt time.Time `json:"expired_at"` + CooldownUntil time.Time `json:"cooldown_until"` +} + type UpdateClusterAuthorityInput struct { ActorUserID string ClusterID string @@ -985,6 +2523,14 @@ type RenewVPNConnectionLeaseInput struct { TTL time.Duration } +type RenewNodeVPNAssignmentLeaseInput struct { + ClusterID string + VPNConnectionID string + LeaseID string + OwnerNodeID string + TTL time.Duration +} + type ReleaseVPNConnectionLeaseInput struct { ActorUserID string ClusterID string diff --git a/backend/internal/modules/cluster/module.go b/backend/internal/modules/cluster/module.go index 619e3db..c511730 100644 --- a/backend/internal/modules/cluster/module.go +++ b/backend/internal/modules/cluster/module.go @@ -1,10 +1,21 @@ package cluster import ( + "context" + "crypto/sha256" + "encoding/binary" + "encoding/hex" "encoding/json" "errors" + "fmt" + "io" "net/http" + "os" + "reflect" + "sort" "strconv" + "strings" + "sync" "time" "github.com/go-chi/chi/v5" @@ -17,7 +28,9 @@ import ( ) type Module struct { - service *Service + service *Service + vpnPacketHub *vpnPacketHub + vpnClientDiagnosticHub *vpnClientDiagnosticHub } func NewModule(deps module.Dependencies, verifiers ...*authority.Verifier) *Module { @@ -27,7 +40,11 @@ func NewModule(deps module.Dependencies, verifiers ...*authority.Verifier) *Modu store.WithClusterKeyEncryptor(encryptor) } } - return &Module{service: NewService(store)} + return &Module{ + service: NewService(store), + vpnPacketHub: newVPNPacketHub(), + vpnClientDiagnosticHub: newVPNClientDiagnosticHub(), + } } func (m *Module) Name() string { @@ -47,12 +64,19 @@ func (m *Module) RegisterRoutes(router chi.Router) { r.Post("/{clusterID}/join-requests", m.createJoinRequest) r.Post("/{clusterID}/join-requests/{requestID}/approve", m.approveJoinRequest) r.Post("/{clusterID}/join-requests/{requestID}/reject", m.rejectJoinRequest) + r.Get("/{clusterID}/join-tokens", m.listJoinTokens) r.Post("/{clusterID}/join-tokens", m.createJoinToken) r.Post("/{clusterID}/join-tokens/{tokenID}/revoke", m.revokeJoinToken) r.Get("/{clusterID}/nodes/{nodeID}/roles", m.listNodeRoles) r.Post("/{clusterID}/nodes/{nodeID}/roles", m.assignNodeRole) r.Post("/{clusterID}/nodes/{nodeID}/heartbeats", m.recordHeartbeat) r.Get("/{clusterID}/nodes/{nodeID}/heartbeats", m.listNodeHeartbeats) + r.Get("/{clusterID}/updates/releases", m.listReleaseVersions) + r.Post("/{clusterID}/updates/releases", m.createReleaseVersion) + r.Put("/{clusterID}/nodes/{nodeID}/updates/policy", m.upsertNodeUpdatePolicy) + r.Get("/{clusterID}/nodes/{nodeID}/updates/plan", m.getNodeUpdatePlan) + r.Post("/{clusterID}/nodes/{nodeID}/updates/status", m.reportNodeUpdateStatus) + r.Get("/{clusterID}/nodes/{nodeID}/updates/statuses", m.listNodeUpdateStatuses) r.Get("/{clusterID}/nodes/{nodeID}/testing-flags", m.getEffectiveNodeTestingFlags) r.Get("/{clusterID}/nodes/{nodeID}/mesh/synthetic-config", m.getNodeSyntheticMeshConfig) r.Post("/{clusterID}/nodes/{nodeID}/telemetry", m.recordNodeTelemetry) @@ -61,6 +85,7 @@ func (m *Module) RegisterRoutes(router chi.Router) { r.Put("/{clusterID}/nodes/{nodeID}/group", m.assignNodeGroup) r.Post("/{clusterID}/nodes/{nodeID}/identity/revoke", m.revokeNodeIdentity) r.Post("/{clusterID}/nodes/{nodeID}/membership/disable", m.disableMembership) + r.Delete("/{clusterID}/nodes/{nodeID}", m.deleteClusterNode) r.Get("/{clusterID}/nodes/{nodeID}/workloads/desired", m.listDesiredWorkloads) r.Put("/{clusterID}/nodes/{nodeID}/workloads/{serviceType}/desired", m.setDesiredWorkload) r.Post("/{clusterID}/nodes/{nodeID}/workloads/{serviceType}/status", m.reportWorkloadStatus) @@ -69,6 +94,8 @@ func (m *Module) RegisterRoutes(router chi.Router) { r.Post("/{clusterID}/mesh/links", m.reportMeshLink) r.Get("/{clusterID}/mesh/route-intents", m.listRouteIntents) r.Post("/{clusterID}/mesh/route-intents", m.createRouteIntent) + r.Post("/{clusterID}/mesh/route-intents/{routeIntentID}/expire", m.expireRouteIntent) + r.Post("/{clusterID}/mesh/route-intents/{routeIntentID}/disable", m.disableRouteIntent) r.Get("/{clusterID}/mesh/qos-policies", m.listQoSPolicies) r.Get("/{clusterID}/fabric/entry-points", m.listFabricEntryPoints) r.Post("/{clusterID}/fabric/entry-points", m.createFabricEntryPoint) @@ -78,6 +105,33 @@ func (m *Module) RegisterRoutes(router chi.Router) { r.Post("/{clusterID}/fabric/egress-pools", m.createFabricEgressPool) r.Get("/{clusterID}/fabric/egress-pools/{egressPoolID}/nodes", m.listFabricEgressPoolNodes) r.Put("/{clusterID}/fabric/egress-pools/{egressPoolID}/nodes/{nodeID}", m.setFabricEgressPoolNode) + r.Get("/{clusterID}/fabric/service-channels/route-feedback", m.listFabricServiceChannelRouteFeedback) + r.Post("/{clusterID}/fabric/service-channels/route-feedback/expire", m.expireFabricServiceChannelRouteFeedback) + r.Get("/{clusterID}/fabric/service-channels/rebuild-attempts", m.listFabricServiceChannelRouteRebuildAttempts) + r.Get("/{clusterID}/fabric/service-channels/rebuild-health", m.getFabricServiceChannelRouteRebuildHealthSummary) + r.Get("/{clusterID}/fabric/service-channels/readiness", m.getFabricServiceChannelReadiness) + r.Get("/{clusterID}/fabric/service-channels/schema-status", m.getFabricServiceChannelSchemaStatus) + r.Get("/{clusterID}/fabric/service-channels/rebuild-snapshots/health", m.getFabricServiceChannelRebuildSnapshotMaintenanceHealth) + r.Post("/{clusterID}/fabric/service-channels/rebuild-snapshots/warmup", m.warmupFabricServiceChannelRebuildSnapshots) + r.Get("/{clusterID}/fabric/service-channels/rebuild-incidents", m.listFabricServiceChannelRouteRebuildIncidents) + r.Post("/{clusterID}/fabric/service-channels/rebuild-incidents/investigations", m.recordFabricServiceChannelRouteRebuildInvestigation) + r.Get("/{clusterID}/fabric/service-channels/rebuild-investigations/breadcrumbs", m.listFabricServiceChannelRebuildInvestigationBreadcrumbs) + r.Get("/{clusterID}/fabric/service-channels/rebuild-health/silences", m.listFabricServiceChannelRouteRebuildAlertSilences) + r.Post("/{clusterID}/fabric/service-channels/rebuild-health/silences", m.silenceFabricServiceChannelRouteRebuildAlert) + r.Delete("/{clusterID}/fabric/service-channels/rebuild-health/silences/{silenceID}", m.unsilenceFabricServiceChannelRouteRebuildAlert) + r.Get("/{clusterID}/fabric/service-channels/recovery-policy", m.getFabricServiceChannelRecoveryPolicy) + r.Put("/{clusterID}/fabric/service-channels/recovery-policy", m.updateFabricServiceChannelRecoveryPolicy) + r.Get("/{clusterID}/fabric/service-channels/adaptive-policy", m.getFabricServiceChannelAdaptivePolicy) + r.Put("/{clusterID}/fabric/service-channels/adaptive-policy", m.updateFabricServiceChannelAdaptivePolicy) + r.Get("/{clusterID}/fabric/service-channels/pool-policy", m.getFabricServiceChannelPoolPolicy) + r.Put("/{clusterID}/fabric/service-channels/pool-policy", m.updateFabricServiceChannelPoolPolicy) + r.Get("/{clusterID}/fabric/service-channels/breadcrumb-window-policy", m.getFabricServiceChannelBreadcrumbWindowPolicy) + r.Put("/{clusterID}/fabric/service-channels/breadcrumb-window-policy", m.updateFabricServiceChannelBreadcrumbWindowPolicy) + r.Post("/{clusterID}/fabric/service-channels/leases", m.issueFabricServiceChannelLease) + r.Get("/{clusterID}/fabric/service-channels/leases", m.listFabricServiceChannelLeases) + r.Post("/{clusterID}/fabric/service-channels/leases/cleanup", m.cleanupFabricServiceChannelLeases) + r.Get("/{clusterID}/fabric/service-channels/access-telemetry", m.getFabricServiceChannelAccessTelemetry) + r.Post("/{clusterID}/fabric/service-channels/{channelID}/introspect", m.introspectFabricServiceChannelLease) r.Post("/{clusterID}/vpn-connection-leases/expire-stale", m.expireStaleVPNConnectionLeases) r.Get("/{clusterID}/vpn-connections", m.listVPNConnections) r.Post("/{clusterID}/vpn-connections", m.createVPNConnection) @@ -92,9 +146,25 @@ func (m *Module) RegisterRoutes(router chi.Router) { r.Post("/{clusterID}/vpn-connections/{vpnConnectionID}/leases/{leaseID}/renew", m.renewVPNConnectionLease) r.Post("/{clusterID}/vpn-connections/{vpnConnectionID}/leases/{leaseID}/release", m.releaseVPNConnectionLease) r.Post("/{clusterID}/vpn-connections/{vpnConnectionID}/leases/{leaseID}/fence", m.fenceVPNConnectionLease) + r.Get("/{clusterID}/nodes/{nodeID}/vpn/assignments", m.listNodeVPNAssignments) + r.Post("/{clusterID}/nodes/{nodeID}/vpn/assignments/{vpnConnectionID}/lease/{leaseID}/renew", m.renewNodeVPNAssignmentLease) + r.Post("/{clusterID}/nodes/{nodeID}/vpn/assignments/{vpnConnectionID}/status", m.reportNodeVPNAssignmentStatus) + r.Get("/{clusterID}/vpn-connections/{vpnConnectionID}/tunnel/stats", m.getVPNPacketStats) + r.Post("/{clusterID}/vpn-connections/{vpnConnectionID}/tunnel/reset", m.resetVPNPacketQueues) + r.Post("/{clusterID}/vpn-connections/{vpnConnectionID}/tunnel/client/packets", m.postVPNClientPacket) + r.Get("/{clusterID}/vpn-connections/{vpnConnectionID}/tunnel/client/packets", m.getVPNClientPacket) + r.Post("/{clusterID}/vpn-connections/{vpnConnectionID}/tunnel/gateway/packets", m.postVPNGatewayPacket) + r.Get("/{clusterID}/vpn-connections/{vpnConnectionID}/tunnel/gateway/packets", m.getVPNGatewayPacket) + r.Get("/{clusterID}/vpn/client-diagnostics", m.listVPNClientDiagnosticStatuses) + r.Post("/{clusterID}/vpn/client-diagnostics/{deviceID}/status", m.reportVPNClientDiagnosticStatus) + r.Get("/{clusterID}/vpn/client-diagnostics/{deviceID}/status", m.getVPNClientDiagnosticStatus) + r.Post("/{clusterID}/vpn/client-diagnostics/{deviceID}/commands", m.enqueueVPNClientDiagnosticCommand) + r.Get("/{clusterID}/vpn/client-diagnostics/{deviceID}/commands", m.getVPNClientDiagnosticCommand) + r.Get("/{clusterID}/vpn/client-profile", m.getVPNClientProfile) r.Get("/{clusterID}/authority", m.getClusterAuthority) r.Put("/{clusterID}/authority", m.updateClusterAuthority) r.Get("/{clusterID}/audit", m.listAuditEvents) + r.Get("/{clusterID}/events", m.streamClusterEvents) }) router.Get("/cluster-admin-summaries", m.listClusterAdminSummaries) router.Get("/fabric/testing-flags", m.listFabricTestingFlags) @@ -284,6 +354,7 @@ func (m *Module) approveJoinRequest(w http.ResponseWriter, r *http.Request) { NodeKey string `json:"node_key"` OwnershipType string `json:"ownership_type"` OwnerOrganizationID *string `json:"owner_organization_id"` + NodeGroupID *string `json:"node_group_id"` } if err := json.NewDecoder(r.Body).Decode(&payload); err != nil { httpx.WriteError(w, http.StatusBadRequest, "invalid join request approval payload") @@ -296,6 +367,7 @@ func (m *Module) approveJoinRequest(w http.ResponseWriter, r *http.Request) { NodeKey: payload.NodeKey, OwnershipType: payload.OwnershipType, OwnerOrganizationID: payload.OwnerOrganizationID, + NodeGroupID: payload.NodeGroupID, }) if writeServiceError(w, err) { return @@ -343,6 +415,14 @@ func (m *Module) revokeJoinToken(w http.ResponseWriter, r *http.Request) { httpx.WriteJSON(w, http.StatusOK, map[string]any{"join_token": item}) } +func (m *Module) listJoinTokens(w http.ResponseWriter, r *http.Request) { + items, err := m.service.ListJoinTokens(r.Context(), r.URL.Query().Get("actor_user_id"), chi.URLParam(r, "clusterID")) + if writeServiceError(w, err) { + return + } + httpx.WriteJSON(w, http.StatusOK, map[string]any{"join_tokens": items}) +} + func (m *Module) assignNodeRole(w http.ResponseWriter, r *http.Request) { var payload struct { ActorUserID string `json:"actor_user_id"` @@ -403,7 +483,8 @@ func (m *Module) recordHeartbeat(w http.ResponseWriter, r *http.Request) { return } flags, _ := m.service.GetEffectiveNodeTestingFlags(r.Context(), chi.URLParam(r, "clusterID"), chi.URLParam(r, "nodeID")) - httpx.WriteJSON(w, http.StatusAccepted, map[string]any{"heartbeat": item, "testing_flags": flags}) + updateHint := m.service.GetNodeUpdateHint(r.Context(), chi.URLParam(r, "clusterID"), chi.URLParam(r, "nodeID")) + httpx.WriteJSON(w, http.StatusAccepted, map[string]any{"heartbeat": item, "testing_flags": flags, "update_hint": updateHint}) } func (m *Module) listNodeHeartbeats(w http.ResponseWriter, r *http.Request) { @@ -415,6 +496,212 @@ func (m *Module) listNodeHeartbeats(w http.ResponseWriter, r *http.Request) { httpx.WriteJSON(w, http.StatusOK, map[string]any{"heartbeats": items}) } +func (m *Module) createReleaseVersion(w http.ResponseWriter, r *http.Request) { + var payload struct { + ActorUserID string `json:"actor_user_id"` + Product string `json:"product"` + Version string `json:"version"` + Channel string `json:"channel"` + Status string `json:"status"` + Compatibility json.RawMessage `json:"compatibility"` + Changelog *string `json:"changelog"` + Artifacts []ReleaseArtifactInput `json:"artifacts"` + } + if err := json.NewDecoder(r.Body).Decode(&payload); err != nil { + httpx.WriteError(w, http.StatusBadRequest, "invalid release payload") + return + } + item, err := m.service.CreateReleaseVersion(r.Context(), CreateReleaseVersionInput{ + ActorUserID: payload.ActorUserID, + ClusterID: chi.URLParam(r, "clusterID"), + Product: payload.Product, + Version: payload.Version, + Channel: payload.Channel, + Status: payload.Status, + Compatibility: payload.Compatibility, + Changelog: payload.Changelog, + Artifacts: payload.Artifacts, + }) + if writeServiceError(w, err) { + return + } + httpx.WriteJSON(w, http.StatusCreated, map[string]any{"release_version": item}) +} + +func (m *Module) listReleaseVersions(w http.ResponseWriter, r *http.Request) { + items, err := m.service.ListReleaseVersions( + r.Context(), + r.URL.Query().Get("actor_user_id"), + chi.URLParam(r, "clusterID"), + r.URL.Query().Get("product"), + r.URL.Query().Get("channel"), + ) + if writeServiceError(w, err) { + return + } + httpx.WriteJSON(w, http.StatusOK, map[string]any{"release_versions": items}) +} + +func (m *Module) upsertNodeUpdatePolicy(w http.ResponseWriter, r *http.Request) { + var payload struct { + ActorUserID string `json:"actor_user_id"` + Product string `json:"product"` + Channel string `json:"channel"` + TargetVersion *string `json:"target_version"` + Strategy string `json:"strategy"` + Enabled bool `json:"enabled"` + RollbackAllowed bool `json:"rollback_allowed"` + HealthWindowSec int `json:"health_window_seconds"` + } + if err := json.NewDecoder(r.Body).Decode(&payload); err != nil { + httpx.WriteError(w, http.StatusBadRequest, "invalid update policy payload") + return + } + item, err := m.service.UpsertNodeUpdatePolicy(r.Context(), UpsertNodeUpdatePolicyInput{ + ActorUserID: payload.ActorUserID, + ClusterID: chi.URLParam(r, "clusterID"), + NodeID: chi.URLParam(r, "nodeID"), + Product: payload.Product, + Channel: payload.Channel, + TargetVersion: payload.TargetVersion, + Strategy: payload.Strategy, + Enabled: payload.Enabled, + RollbackAllowed: payload.RollbackAllowed, + HealthWindowSec: payload.HealthWindowSec, + }) + if writeServiceError(w, err) { + return + } + httpx.WriteJSON(w, http.StatusOK, map[string]any{"node_update_policy": item}) +} + +func (m *Module) getNodeUpdatePlan(w http.ResponseWriter, r *http.Request) { + item, err := m.service.GetNodeUpdatePlan(r.Context(), GetNodeUpdatePlanInput{ + ClusterID: chi.URLParam(r, "clusterID"), + NodeID: chi.URLParam(r, "nodeID"), + Product: r.URL.Query().Get("product"), + CurrentVersion: r.URL.Query().Get("current_version"), + OS: r.URL.Query().Get("os"), + Arch: r.URL.Query().Get("arch"), + InstallType: r.URL.Query().Get("install_type"), + Channel: r.URL.Query().Get("channel"), + ArtifactOrigin: requestOrigin(r), + }) + if writeServiceError(w, err) { + return + } + httpx.WriteJSON(w, http.StatusOK, map[string]any{"node_update_plan": item}) +} + +func requestOrigin(r *http.Request) string { + proto := strings.TrimSpace(r.Header.Get("X-Forwarded-Proto")) + if proto == "" { + proto = strings.TrimSpace(r.Header.Get("X-Forwarded-Scheme")) + } + if proto == "" { + if r.TLS != nil { + proto = "https" + } else { + proto = "http" + } + } + host := strings.TrimSpace(r.Header.Get("X-Forwarded-Host")) + if host == "" { + host = strings.TrimSpace(r.Host) + } + if host == "" { + return "" + } + if comma := strings.Index(host, ","); comma >= 0 { + host = strings.TrimSpace(host[:comma]) + } + if comma := strings.Index(proto, ","); comma >= 0 { + proto = strings.TrimSpace(proto[:comma]) + } + if !strings.Contains(host, ":") { + if port := strings.TrimSpace(r.Header.Get("X-Forwarded-Port")); port != "" && port != "80" && port != "443" { + host += ":" + port + } + } + if proto == "" || host == "" { + return "" + } + return remapDirectBackendDownloadOrigin(proto, host) +} + +func remapDirectBackendDownloadOrigin(proto, host string) string { + httpPort := strings.TrimSpace(os.Getenv("HTTP_PORT")) + if httpPort == "" { + httpPort = "18121" + } + downloadPort := strings.TrimSpace(os.Getenv("RAP_DOWNLOAD_PORT")) + if downloadPort == "" { + downloadPort = "18080" + } + suffix := ":" + httpPort + if !strings.HasSuffix(host, suffix) { + return proto + "://" + host + } + hostOnly := strings.TrimSuffix(host, suffix) + return proto + "://" + hostOnly + ":" + downloadPort +} + +func (m *Module) reportNodeUpdateStatus(w http.ResponseWriter, r *http.Request) { + var payload struct { + Product string `json:"product"` + CurrentVersion string `json:"current_version"` + TargetVersion string `json:"target_version"` + Phase string `json:"phase"` + Status string `json:"status"` + AttemptID string `json:"attempt_id"` + ErrorMessage *string `json:"error_message"` + RollbackVersion *string `json:"rollback_version"` + Payload json.RawMessage `json:"payload"` + ObservedAt *time.Time `json:"observed_at"` + } + if err := json.NewDecoder(r.Body).Decode(&payload); err != nil { + httpx.WriteError(w, http.StatusBadRequest, "invalid update status payload") + return + } + observedAt := time.Time{} + if payload.ObservedAt != nil { + observedAt = *payload.ObservedAt + } + item, err := m.service.ReportNodeUpdateStatus(r.Context(), ReportNodeUpdateStatusInput{ + ClusterID: chi.URLParam(r, "clusterID"), + NodeID: chi.URLParam(r, "nodeID"), + Product: payload.Product, + CurrentVersion: payload.CurrentVersion, + TargetVersion: payload.TargetVersion, + Phase: payload.Phase, + Status: payload.Status, + AttemptID: payload.AttemptID, + ErrorMessage: payload.ErrorMessage, + RollbackVersion: payload.RollbackVersion, + Payload: payload.Payload, + ObservedAt: observedAt, + }) + if writeServiceError(w, err) { + return + } + httpx.WriteJSON(w, http.StatusAccepted, map[string]any{"node_update_status": item}) +} + +func (m *Module) listNodeUpdateStatuses(w http.ResponseWriter, r *http.Request) { + limit, _ := strconv.Atoi(r.URL.Query().Get("limit")) + items, err := m.service.ListNodeUpdateStatuses( + r.Context(), + r.URL.Query().Get("actor_user_id"), + chi.URLParam(r, "clusterID"), + chi.URLParam(r, "nodeID"), + limit, + ) + if writeServiceError(w, err) { + return + } + httpx.WriteJSON(w, http.StatusOK, map[string]any{"node_update_statuses": items}) +} + func (m *Module) getEffectiveNodeTestingFlags(w http.ResponseWriter, r *http.Request) { item, err := m.service.GetEffectiveNodeTestingFlags(r.Context(), chi.URLParam(r, "clusterID"), chi.URLParam(r, "nodeID")) if writeServiceError(w, err) { @@ -568,6 +855,27 @@ func (m *Module) disableMembership(w http.ResponseWriter, r *http.Request) { httpx.WriteJSON(w, http.StatusAccepted, map[string]any{"status": "accepted"}) } +func (m *Module) deleteClusterNode(w http.ResponseWriter, r *http.Request) { + var payload struct { + ActorUserID string `json:"actor_user_id"` + Reason string `json:"reason"` + } + if err := json.NewDecoder(r.Body).Decode(&payload); err != nil { + httpx.WriteError(w, http.StatusBadRequest, "invalid node delete payload") + return + } + err := m.service.DeleteClusterNode(r.Context(), DeleteClusterNodeInput{ + ActorUserID: payload.ActorUserID, + ClusterID: chi.URLParam(r, "clusterID"), + NodeID: chi.URLParam(r, "nodeID"), + Reason: payload.Reason, + }) + if writeServiceError(w, err) { + return + } + httpx.WriteJSON(w, http.StatusAccepted, map[string]any{"status": "accepted"}) +} + func (m *Module) setDesiredWorkload(w http.ResponseWriter, r *http.Request) { var payload struct { ActorUserID string `json:"actor_user_id"` @@ -714,6 +1022,48 @@ func (m *Module) listRouteIntents(w http.ResponseWriter, r *http.Request) { httpx.WriteJSON(w, http.StatusOK, map[string]any{"route_intents": items}) } +func (m *Module) expireRouteIntent(w http.ResponseWriter, r *http.Request) { + var payload struct { + ActorUserID string `json:"actor_user_id"` + Reason string `json:"reason"` + } + if err := json.NewDecoder(r.Body).Decode(&payload); err != nil { + httpx.WriteError(w, http.StatusBadRequest, "invalid route intent expire payload") + return + } + item, err := m.service.ExpireRouteIntent(r.Context(), RouteIntentLifecycleInput{ + ActorUserID: payload.ActorUserID, + ClusterID: chi.URLParam(r, "clusterID"), + RouteIntentID: chi.URLParam(r, "routeIntentID"), + Reason: payload.Reason, + }) + if writeServiceError(w, err) { + return + } + httpx.WriteJSON(w, http.StatusAccepted, map[string]any{"route_intent": item}) +} + +func (m *Module) disableRouteIntent(w http.ResponseWriter, r *http.Request) { + var payload struct { + ActorUserID string `json:"actor_user_id"` + Reason string `json:"reason"` + } + if err := json.NewDecoder(r.Body).Decode(&payload); err != nil { + httpx.WriteError(w, http.StatusBadRequest, "invalid route intent disable payload") + return + } + item, err := m.service.DisableRouteIntent(r.Context(), RouteIntentLifecycleInput{ + ActorUserID: payload.ActorUserID, + ClusterID: chi.URLParam(r, "clusterID"), + RouteIntentID: chi.URLParam(r, "routeIntentID"), + Reason: payload.Reason, + }) + if writeServiceError(w, err) { + return + } + httpx.WriteJSON(w, http.StatusAccepted, map[string]any{"route_intent": item}) +} + func (m *Module) listQoSPolicies(w http.ResponseWriter, r *http.Request) { items, err := m.service.ListQoSPolicies(r.Context(), r.URL.Query().Get("actor_user_id"), chi.URLParam(r, "clusterID")) if writeServiceError(w, err) { @@ -876,6 +1226,566 @@ func (m *Module) listFabricEgressPoolNodes(w http.ResponseWriter, r *http.Reques httpx.WriteJSON(w, http.StatusOK, map[string]any{"egress_pool_nodes": items}) } +func (m *Module) issueFabricServiceChannelLease(w http.ResponseWriter, r *http.Request) { + var payload struct { + ActorUserID string `json:"actor_user_id"` + OrganizationID string `json:"organization_id"` + UserID string `json:"user_id"` + ResourceID string `json:"resource_id"` + ServiceClass string `json:"service_class"` + EntryNodeIDs []string `json:"entry_node_ids"` + ExitNodeIDs []string `json:"exit_node_ids"` + PreferredEntryNodeID string `json:"preferred_entry_node_id"` + PreferredExitNodeID string `json:"preferred_exit_node_id"` + RequiredRoles []string `json:"required_roles"` + AllowedChannels []string `json:"allowed_channels"` + QoS json.RawMessage `json:"qos"` + Failover json.RawMessage `json:"failover"` + Metadata json.RawMessage `json:"metadata"` + TTLSeconds int `json:"ttl_seconds"` + } + if err := json.NewDecoder(r.Body).Decode(&payload); err != nil { + httpx.WriteError(w, http.StatusBadRequest, "invalid fabric service channel lease payload") + return + } + item, err := m.service.IssueFabricServiceChannelLease(r.Context(), IssueFabricServiceChannelLeaseInput{ + ActorUserID: payload.ActorUserID, + ClusterID: chi.URLParam(r, "clusterID"), + OrganizationID: payload.OrganizationID, + UserID: payload.UserID, + ResourceID: payload.ResourceID, + ServiceClass: payload.ServiceClass, + EntryNodeIDs: payload.EntryNodeIDs, + ExitNodeIDs: payload.ExitNodeIDs, + PreferredEntryNodeID: payload.PreferredEntryNodeID, + PreferredExitNodeID: payload.PreferredExitNodeID, + RequiredRoles: payload.RequiredRoles, + AllowedChannels: payload.AllowedChannels, + QoS: payload.QoS, + Failover: payload.Failover, + Metadata: payload.Metadata, + TTL: time.Duration(payload.TTLSeconds) * time.Second, + }) + if writeServiceError(w, err) { + return + } + httpx.WriteJSON(w, http.StatusCreated, map[string]any{"fabric_service_channel_lease": item}) +} + +func (m *Module) introspectFabricServiceChannelLease(w http.ResponseWriter, r *http.Request) { + var payload struct { + Token string `json:"token"` + ResourceID string `json:"resource_id"` + ServiceClass string `json:"service_class"` + ChannelClass string `json:"channel_class"` + EntryNodeID string `json:"entry_node_id"` + RequestSourceIP string `json:"request_source_ip"` + } + if err := json.NewDecoder(r.Body).Decode(&payload); err != nil { + httpx.WriteError(w, http.StatusBadRequest, "invalid fabric service channel introspection payload") + return + } + if payload.EntryNodeID == "" { + payload.EntryNodeID = r.Header.Get("X-RAP-Entry-Node") + } + if payload.RequestSourceIP == "" { + payload.RequestSourceIP = r.RemoteAddr + } + item, err := m.service.IntrospectFabricServiceChannelLease(r.Context(), IntrospectFabricServiceChannelLeaseInput{ + ClusterID: chi.URLParam(r, "clusterID"), + ChannelID: chi.URLParam(r, "channelID"), + ResourceID: payload.ResourceID, + ServiceClass: payload.ServiceClass, + ChannelClass: payload.ChannelClass, + Token: payload.Token, + EntryNodeID: payload.EntryNodeID, + RequestSourceIP: payload.RequestSourceIP, + }) + if writeServiceError(w, err) { + return + } + status := http.StatusOK + if !item.Allowed { + status = http.StatusForbidden + } + httpx.WriteJSON(w, status, map[string]any{"fabric_service_channel_introspection": item}) +} + +func (m *Module) listFabricServiceChannelLeases(w http.ResponseWriter, r *http.Request) { + limit, _ := strconv.Atoi(r.URL.Query().Get("limit")) + items, err := m.service.ListFabricServiceChannelLeases(r.Context(), r.URL.Query().Get("actor_user_id"), ListFabricServiceChannelLeasesInput{ + ClusterID: chi.URLParam(r, "clusterID"), + ServiceClass: r.URL.Query().Get("service_class"), + EntryNodeID: r.URL.Query().Get("entry_node_id"), + ResourceID: r.URL.Query().Get("resource_id"), + IncludeExpired: r.URL.Query().Get("include_expired") == "true", + Limit: limit, + }) + if writeServiceError(w, err) { + return + } + httpx.WriteJSON(w, http.StatusOK, map[string]any{"fabric_service_channel_lease_maintenance": items}) +} + +func (m *Module) cleanupFabricServiceChannelLeases(w http.ResponseWriter, r *http.Request) { + var payload struct { + ActorUserID string `json:"actor_user_id"` + Limit int `json:"limit"` + } + if err := json.NewDecoder(r.Body).Decode(&payload); err != nil && !errors.Is(err, io.EOF) { + httpx.WriteError(w, http.StatusBadRequest, "invalid fabric service channel lease cleanup payload") + return + } + result, err := m.service.CleanupFabricServiceChannelLeases(r.Context(), CleanupFabricServiceChannelLeasesInput{ + ActorUserID: payload.ActorUserID, + ClusterID: chi.URLParam(r, "clusterID"), + Limit: payload.Limit, + }) + if writeServiceError(w, err) { + return + } + httpx.WriteJSON(w, http.StatusAccepted, map[string]any{"fabric_service_channel_lease_maintenance": result}) +} + +func (m *Module) getFabricServiceChannelAccessTelemetry(w http.ResponseWriter, r *http.Request) { + limit, _ := strconv.Atoi(r.URL.Query().Get("limit")) + item, err := m.service.GetFabricServiceChannelAccessTelemetry(r.Context(), r.URL.Query().Get("actor_user_id"), GetFabricServiceChannelAccessTelemetryInput{ + ClusterID: chi.URLParam(r, "clusterID"), + Limit: limit, + }) + if writeServiceError(w, err) { + return + } + httpx.WriteJSON(w, http.StatusOK, map[string]any{"fabric_service_channel_access_telemetry": item}) +} + +func (m *Module) listFabricServiceChannelRouteFeedback(w http.ResponseWriter, r *http.Request) { + items, err := m.service.ListFabricServiceChannelRouteFeedback(r.Context(), r.URL.Query().Get("actor_user_id"), ListFabricServiceChannelRouteFeedbackInput{ + ClusterID: chi.URLParam(r, "clusterID"), + ReporterNodeID: r.URL.Query().Get("reporter_node_id"), + RouteID: r.URL.Query().Get("route_id"), + ServiceClass: r.URL.Query().Get("service_class"), + FeedbackStatus: r.URL.Query().Get("feedback_status"), + IncludeExpired: r.URL.Query().Get("include_expired") == "true", + }) + if writeServiceError(w, err) { + return + } + httpx.WriteJSON(w, http.StatusOK, map[string]any{"route_feedback": items}) +} + +func (m *Module) listFabricServiceChannelRouteRebuildAttempts(w http.ResponseWriter, r *http.Request) { + limit, _ := strconv.Atoi(r.URL.Query().Get("limit")) + offset, _ := strconv.Atoi(r.URL.Query().Get("offset")) + items, err := m.service.ListFabricServiceChannelRouteRebuildAttempts(r.Context(), r.URL.Query().Get("actor_user_id"), ListFabricServiceChannelRouteRebuildAttemptsInput{ + ClusterID: chi.URLParam(r, "clusterID"), + ReporterNodeID: r.URL.Query().Get("reporter_node_id"), + RouteID: r.URL.Query().Get("route_id"), + ReplacementRouteID: r.URL.Query().Get("replacement_route_id"), + ServiceClass: r.URL.Query().Get("service_class"), + RebuildStatus: r.URL.Query().Get("rebuild_status"), + RebuildRequestID: r.URL.Query().Get("rebuild_request_id"), + Generation: r.URL.Query().Get("generation"), + FeedbackSource: r.URL.Query().Get("feedback_source"), + FeedbackChannelID: r.URL.Query().Get("feedback_channel_id"), + FeedbackViolationStatus: r.URL.Query().Get("feedback_violation_status"), + EnrichmentMode: r.URL.Query().Get("enrichment"), + Limit: limit, + Offset: offset, + }) + if writeServiceError(w, err) { + return + } + httpx.WriteJSON(w, http.StatusOK, map[string]any{"rebuild_attempts": items}) +} + +func (m *Module) getFabricServiceChannelRouteRebuildHealthSummary(w http.ResponseWriter, r *http.Request) { + limit, _ := strconv.Atoi(r.URL.Query().Get("limit")) + summary, err := m.service.GetFabricServiceChannelRouteRebuildHealthSummary(r.Context(), r.URL.Query().Get("actor_user_id"), GetFabricServiceChannelRouteRebuildHealthSummaryInput{ + ClusterID: chi.URLParam(r, "clusterID"), + Limit: limit, + }) + if writeServiceError(w, err) { + return + } + httpx.WriteJSON(w, http.StatusOK, map[string]any{"rebuild_health": summary}) +} + +func (m *Module) getFabricServiceChannelReadiness(w http.ResponseWriter, r *http.Request) { + limit, _ := strconv.Atoi(r.URL.Query().Get("limit")) + readiness, err := m.service.GetFabricServiceChannelReadiness(r.Context(), r.URL.Query().Get("actor_user_id"), GetFabricServiceChannelReadinessInput{ + ClusterID: chi.URLParam(r, "clusterID"), + Limit: limit, + }) + if writeServiceError(w, err) { + return + } + httpx.WriteJSON(w, http.StatusOK, map[string]any{"fabric_service_channel_readiness": readiness}) +} + +func (m *Module) getFabricServiceChannelSchemaStatus(w http.ResponseWriter, r *http.Request) { + status, err := m.service.GetFabricServiceChannelSchemaStatus(r.Context(), r.URL.Query().Get("actor_user_id"), GetFabricServiceChannelSchemaStatusInput{ + ClusterID: chi.URLParam(r, "clusterID"), + }) + if writeServiceError(w, err) { + return + } + httpx.WriteJSON(w, http.StatusOK, map[string]any{"fabric_service_channel_schema_status": status}) +} + +func (m *Module) getFabricServiceChannelRebuildSnapshotMaintenanceHealth(w http.ResponseWriter, r *http.Request) { + limit, _ := strconv.Atoi(r.URL.Query().Get("limit")) + minAgeSeconds, _ := strconv.ParseInt(r.URL.Query().Get("min_age_seconds"), 10, 64) + heartbeatThreshold, _ := strconv.Atoi(r.URL.Query().Get("heartbeat_threshold")) + health, err := m.service.GetFabricServiceChannelRebuildSnapshotMaintenanceHealth(r.Context(), r.URL.Query().Get("actor_user_id"), GetFabricServiceChannelRebuildSnapshotMaintenanceHealthInput{ + ClusterID: chi.URLParam(r, "clusterID"), + Limit: limit, + MinAgeSeconds: minAgeSeconds, + HeartbeatThreshold: heartbeatThreshold, + }) + if writeServiceError(w, err) { + return + } + httpx.WriteJSON(w, http.StatusOK, map[string]any{"rebuild_snapshot_health": health}) +} + +func (m *Module) warmupFabricServiceChannelRebuildSnapshots(w http.ResponseWriter, r *http.Request) { + var payload struct { + ActorUserID string `json:"actor_user_id"` + Limit int `json:"limit"` + StaleAfterSeconds int64 `json:"stale_after_seconds"` + } + if err := json.NewDecoder(r.Body).Decode(&payload); err != nil && !errors.Is(err, io.EOF) { + httpx.WriteError(w, http.StatusBadRequest, "invalid rebuild snapshot warmup payload") + return + } + result, err := m.service.WarmupFabricServiceChannelRebuildSnapshots(r.Context(), WarmupFabricServiceChannelRebuildSnapshotsInput{ + ActorUserID: payload.ActorUserID, + ClusterID: chi.URLParam(r, "clusterID"), + Limit: payload.Limit, + StaleAfterSeconds: payload.StaleAfterSeconds, + }) + if writeServiceError(w, err) { + return + } + httpx.WriteJSON(w, http.StatusAccepted, map[string]any{"rebuild_snapshot_warmup": result}) +} + +func (m *Module) listFabricServiceChannelRouteRebuildIncidents(w http.ResponseWriter, r *http.Request) { + limit, _ := strconv.Atoi(r.URL.Query().Get("limit")) + incidents, err := m.service.ListFabricServiceChannelRouteRebuildIncidents(r.Context(), r.URL.Query().Get("actor_user_id"), ListFabricServiceChannelRouteRebuildIncidentsInput{ + ClusterID: chi.URLParam(r, "clusterID"), + Limit: limit, + }) + if writeServiceError(w, err) { + return + } + httpx.WriteJSON(w, http.StatusOK, map[string]any{"rebuild_incidents": incidents}) +} + +func (m *Module) recordFabricServiceChannelRouteRebuildInvestigation(w http.ResponseWriter, r *http.Request) { + var payload struct { + ActorUserID string `json:"actor_user_id"` + ReporterNodeID string `json:"reporter_node_id"` + RouteID string `json:"route_id"` + ServiceClass string `json:"service_class"` + Generation string `json:"generation"` + GuardStatus string `json:"guard_status"` + IncidentID string `json:"incident_id"` + FeedbackSource string `json:"feedback_source"` + FeedbackChannelID string `json:"feedback_channel_id"` + FeedbackViolationStatus string `json:"feedback_violation_status"` + DrilldownSource string `json:"drilldown_source"` + Reason string `json:"reason"` + } + if err := json.NewDecoder(r.Body).Decode(&payload); err != nil { + httpx.WriteError(w, http.StatusBadRequest, "invalid rebuild investigation payload") + return + } + if err := m.service.RecordFabricServiceChannelRouteRebuildInvestigation(r.Context(), RecordFabricServiceChannelRouteRebuildInvestigationInput{ + ActorUserID: payload.ActorUserID, + ClusterID: chi.URLParam(r, "clusterID"), + ReporterNodeID: payload.ReporterNodeID, + RouteID: payload.RouteID, + ServiceClass: payload.ServiceClass, + Generation: payload.Generation, + GuardStatus: payload.GuardStatus, + IncidentID: payload.IncidentID, + FeedbackSource: payload.FeedbackSource, + FeedbackChannelID: payload.FeedbackChannelID, + FeedbackViolationStatus: payload.FeedbackViolationStatus, + DrilldownSource: payload.DrilldownSource, + Reason: payload.Reason, + }); writeServiceError(w, err) { + return + } + httpx.WriteJSON(w, http.StatusCreated, map[string]any{"status": "recorded"}) +} + +func (m *Module) listFabricServiceChannelRebuildInvestigationBreadcrumbs(w http.ResponseWriter, r *http.Request) { + limit, _ := strconv.Atoi(r.URL.Query().Get("limit")) + currentWindowSeconds, _ := strconv.ParseInt(r.URL.Query().Get("current_window_seconds"), 10, 64) + historyWindowSeconds, _ := strconv.ParseInt(r.URL.Query().Get("history_window_seconds"), 10, 64) + breadcrumbs, err := m.service.ListFabricServiceChannelRebuildInvestigationBreadcrumbs(r.Context(), r.URL.Query().Get("actor_user_id"), ListFabricServiceChannelRebuildInvestigationBreadcrumbsInput{ + ClusterID: chi.URLParam(r, "clusterID"), + Limit: limit, + CurrentWindowSeconds: currentWindowSeconds, + HistoryWindowSeconds: historyWindowSeconds, + }) + if writeServiceError(w, err) { + return + } + httpx.WriteJSON(w, http.StatusOK, map[string]any{"rebuild_investigation_breadcrumbs": breadcrumbs}) +} + +func (m *Module) silenceFabricServiceChannelRouteRebuildAlert(w http.ResponseWriter, r *http.Request) { + var payload struct { + ActorUserID string `json:"actor_user_id"` + IncidentSource string `json:"incident_source"` + ChannelID string `json:"channel_id"` + ReporterNodeID string `json:"reporter_node_id"` + RouteID string `json:"route_id"` + GuardStatus string `json:"guard_status"` + Generation string `json:"generation"` + Reason string `json:"reason"` + TTLSeconds int64 `json:"ttl_seconds"` + } + if err := json.NewDecoder(r.Body).Decode(&payload); err != nil { + httpx.WriteError(w, http.StatusBadRequest, "invalid rebuild alert silence payload") + return + } + silence, err := m.service.SilenceFabricServiceChannelRouteRebuildAlert(r.Context(), SilenceFabricServiceChannelRouteRebuildAlertInput{ + ActorUserID: payload.ActorUserID, + ClusterID: chi.URLParam(r, "clusterID"), + IncidentSource: payload.IncidentSource, + ChannelID: payload.ChannelID, + ReporterNodeID: payload.ReporterNodeID, + RouteID: payload.RouteID, + GuardStatus: payload.GuardStatus, + Generation: payload.Generation, + Reason: payload.Reason, + TTL: time.Duration(payload.TTLSeconds) * time.Second, + }) + if writeServiceError(w, err) { + return + } + httpx.WriteJSON(w, http.StatusCreated, map[string]any{"rebuild_alert_silence": silence}) +} + +func (m *Module) listFabricServiceChannelRouteRebuildAlertSilences(w http.ResponseWriter, r *http.Request) { + silences, err := m.service.ListFabricServiceChannelRouteRebuildAlertSilences( + r.Context(), + r.URL.Query().Get("actor_user_id"), + chi.URLParam(r, "clusterID"), + time.Time{}, + ) + if writeServiceError(w, err) { + return + } + httpx.WriteJSON(w, http.StatusOK, map[string]any{"rebuild_alert_silences": silences}) +} + +func (m *Module) unsilenceFabricServiceChannelRouteRebuildAlert(w http.ResponseWriter, r *http.Request) { + var payload struct { + ActorUserID string `json:"actor_user_id"` + Reason string `json:"reason"` + } + if r.Body != nil { + _ = json.NewDecoder(r.Body).Decode(&payload) + } + if payload.ActorUserID == "" { + payload.ActorUserID = r.URL.Query().Get("actor_user_id") + } + silence, err := m.service.UnsilenceFabricServiceChannelRouteRebuildAlert(r.Context(), UnsilenceFabricServiceChannelRouteRebuildAlertInput{ + ActorUserID: payload.ActorUserID, + ClusterID: chi.URLParam(r, "clusterID"), + SilenceID: chi.URLParam(r, "silenceID"), + Reason: payload.Reason, + }) + if writeServiceError(w, err) { + return + } + httpx.WriteJSON(w, http.StatusOK, map[string]any{"rebuild_alert_silence": silence}) +} + +func (m *Module) expireFabricServiceChannelRouteFeedback(w http.ResponseWriter, r *http.Request) { + var payload struct { + ActorUserID string `json:"actor_user_id"` + ReporterNodeID string `json:"reporter_node_id"` + RouteID string `json:"route_id"` + ServiceClass string `json:"service_class"` + Reason string `json:"reason"` + } + if err := json.NewDecoder(r.Body).Decode(&payload); err != nil { + httpx.WriteError(w, http.StatusBadRequest, "invalid route feedback expire payload") + return + } + result, err := m.service.ExpireFabricServiceChannelRouteFeedback(r.Context(), ExpireFabricServiceChannelRouteFeedbackInput{ + ActorUserID: payload.ActorUserID, + ClusterID: chi.URLParam(r, "clusterID"), + ReporterNodeID: payload.ReporterNodeID, + RouteID: payload.RouteID, + ServiceClass: payload.ServiceClass, + Reason: payload.Reason, + }) + if writeServiceError(w, err) { + return + } + httpx.WriteJSON(w, http.StatusOK, map[string]any{"route_feedback_expire": result}) +} + +func (m *Module) getFabricServiceChannelRecoveryPolicy(w http.ResponseWriter, r *http.Request) { + item, err := m.service.GetFabricServiceChannelRecoveryPolicy(r.Context(), r.URL.Query().Get("actor_user_id"), chi.URLParam(r, "clusterID")) + if writeServiceError(w, err) { + return + } + httpx.WriteJSON(w, http.StatusOK, map[string]any{"fabric_service_channel_recovery_policy": item}) +} + +func (m *Module) updateFabricServiceChannelRecoveryPolicy(w http.ResponseWriter, r *http.Request) { + var payload struct { + ActorUserID string `json:"actor_user_id"` + HysteresisPenalty int `json:"hysteresis_penalty"` + PromotionMinSamples int `json:"promotion_min_samples"` + DemotionFailureThreshold int `json:"demotion_failure_threshold"` + DemotionDropThreshold int `json:"demotion_drop_threshold"` + DemotionSlowThreshold int `json:"demotion_slow_threshold"` + DemotionRebuildEnabled *bool `json:"demotion_rebuild_enabled"` + DemotionFencedEnabled *bool `json:"demotion_fenced_enabled"` + } + if err := json.NewDecoder(r.Body).Decode(&payload); err != nil { + httpx.WriteError(w, http.StatusBadRequest, "invalid recovery policy payload") + return + } + item, err := m.service.UpdateFabricServiceChannelRecoveryPolicy(r.Context(), UpdateFabricServiceChannelRecoveryPolicyInput{ + ActorUserID: payload.ActorUserID, + ClusterID: chi.URLParam(r, "clusterID"), + HysteresisPenalty: payload.HysteresisPenalty, + PromotionMinSamples: payload.PromotionMinSamples, + DemotionFailureThreshold: payload.DemotionFailureThreshold, + DemotionDropThreshold: payload.DemotionDropThreshold, + DemotionSlowThreshold: payload.DemotionSlowThreshold, + DemotionRebuildEnabled: payload.DemotionRebuildEnabled, + DemotionFencedEnabled: payload.DemotionFencedEnabled, + }) + if writeServiceError(w, err) { + return + } + httpx.WriteJSON(w, http.StatusOK, map[string]any{"fabric_service_channel_recovery_policy": item}) +} + +func (m *Module) getFabricServiceChannelAdaptivePolicy(w http.ResponseWriter, r *http.Request) { + item, err := m.service.GetFabricServiceChannelAdaptivePolicy(r.Context(), r.URL.Query().Get("actor_user_id"), chi.URLParam(r, "clusterID")) + if writeServiceError(w, err) { + return + } + httpx.WriteJSON(w, http.StatusOK, map[string]any{"fabric_service_channel_adaptive_policy": item}) +} + +func (m *Module) updateFabricServiceChannelAdaptivePolicy(w http.ResponseWriter, r *http.Request) { + var payload struct { + ActorUserID string `json:"actor_user_id"` + MaxParallelWindow int `json:"max_parallel_window"` + BulkPressureChannelThreshold int `json:"bulk_pressure_channel_threshold"` + QueuePressureHighWatermark int `json:"queue_pressure_high_watermark"` + QueuePressureMaxInFlight int `json:"queue_pressure_max_in_flight"` + ClassWindows map[string]int `json:"class_windows"` + } + if err := json.NewDecoder(r.Body).Decode(&payload); err != nil { + httpx.WriteError(w, http.StatusBadRequest, "invalid adaptive policy payload") + return + } + item, err := m.service.UpdateFabricServiceChannelAdaptivePolicy(r.Context(), UpdateFabricServiceChannelAdaptivePolicyInput{ + ActorUserID: payload.ActorUserID, + ClusterID: chi.URLParam(r, "clusterID"), + MaxParallelWindow: payload.MaxParallelWindow, + BulkPressureChannelThreshold: payload.BulkPressureChannelThreshold, + QueuePressureHighWatermark: payload.QueuePressureHighWatermark, + QueuePressureMaxInFlight: payload.QueuePressureMaxInFlight, + ClassWindows: payload.ClassWindows, + }) + if writeServiceError(w, err) { + return + } + httpx.WriteJSON(w, http.StatusOK, map[string]any{"fabric_service_channel_adaptive_policy": item}) +} + +func (m *Module) getFabricServiceChannelPoolPolicy(w http.ResponseWriter, r *http.Request) { + item, err := m.service.GetFabricServiceChannelPoolPolicy(r.Context(), r.URL.Query().Get("actor_user_id"), chi.URLParam(r, "clusterID")) + if writeServiceError(w, err) { + return + } + httpx.WriteJSON(w, http.StatusOK, map[string]any{"fabric_service_channel_pool_policy": item}) +} + +func (m *Module) updateFabricServiceChannelPoolPolicy(w http.ResponseWriter, r *http.Request) { + var payload struct { + ActorUserID string `json:"actor_user_id"` + EntryPoolNodeIDs []string `json:"entry_pool_node_ids"` + ExitPoolNodeIDs []string `json:"exit_pool_node_ids"` + PreferredEntryNodeID string `json:"preferred_entry_node_id"` + PreferredExitNodeID string `json:"preferred_exit_node_id"` + SelectionStrategy string `json:"selection_strategy"` + RouteRebuild string `json:"route_rebuild"` + EntryFailover string `json:"entry_failover"` + ExitFailover string `json:"exit_failover"` + BackendFallbackAllowed *bool `json:"backend_fallback_allowed"` + StickySession *bool `json:"sticky_session"` + } + if err := json.NewDecoder(r.Body).Decode(&payload); err != nil { + httpx.WriteError(w, http.StatusBadRequest, "invalid pool policy payload") + return + } + item, err := m.service.UpdateFabricServiceChannelPoolPolicy(r.Context(), UpdateFabricServiceChannelPoolPolicyInput{ + ActorUserID: payload.ActorUserID, + ClusterID: chi.URLParam(r, "clusterID"), + EntryPoolNodeIDs: payload.EntryPoolNodeIDs, + ExitPoolNodeIDs: payload.ExitPoolNodeIDs, + PreferredEntryNodeID: payload.PreferredEntryNodeID, + PreferredExitNodeID: payload.PreferredExitNodeID, + SelectionStrategy: payload.SelectionStrategy, + RouteRebuild: payload.RouteRebuild, + EntryFailover: payload.EntryFailover, + ExitFailover: payload.ExitFailover, + BackendFallbackAllowed: payload.BackendFallbackAllowed, + StickySession: payload.StickySession, + }) + if writeServiceError(w, err) { + return + } + httpx.WriteJSON(w, http.StatusOK, map[string]any{"fabric_service_channel_pool_policy": item}) +} + +func (m *Module) getFabricServiceChannelBreadcrumbWindowPolicy(w http.ResponseWriter, r *http.Request) { + item, err := m.service.GetFabricServiceChannelBreadcrumbWindowPolicy(r.Context(), r.URL.Query().Get("actor_user_id"), chi.URLParam(r, "clusterID")) + if writeServiceError(w, err) { + return + } + httpx.WriteJSON(w, http.StatusOK, map[string]any{"fabric_service_channel_breadcrumb_window_policy": item}) +} + +func (m *Module) updateFabricServiceChannelBreadcrumbWindowPolicy(w http.ResponseWriter, r *http.Request) { + var payload struct { + ActorUserID string `json:"actor_user_id"` + CurrentWindowSeconds int64 `json:"current_window_seconds"` + HistoryWindowSeconds int64 `json:"history_window_seconds"` + } + if err := json.NewDecoder(r.Body).Decode(&payload); err != nil { + httpx.WriteError(w, http.StatusBadRequest, "invalid service-channel breadcrumb window policy payload") + return + } + item, err := m.service.UpdateFabricServiceChannelBreadcrumbWindowPolicy(r.Context(), UpdateFabricServiceChannelBreadcrumbWindowPolicyInput{ + ActorUserID: payload.ActorUserID, + ClusterID: chi.URLParam(r, "clusterID"), + CurrentWindowSeconds: payload.CurrentWindowSeconds, + HistoryWindowSeconds: payload.HistoryWindowSeconds, + }) + if writeServiceError(w, err) { + return + } + httpx.WriteJSON(w, http.StatusOK, map[string]any{"fabric_service_channel_breadcrumb_window_policy": item}) +} + func (m *Module) createVPNConnection(w http.ResponseWriter, r *http.Request) { var payload struct { ActorUserID string `json:"actor_user_id"` @@ -1154,6 +2064,1056 @@ func (m *Module) expireStaleVPNConnectionLeases(w http.ResponseWriter, r *http.R httpx.WriteJSON(w, http.StatusOK, map[string]any{"expired_leases": items}) } +func (m *Module) listNodeVPNAssignments(w http.ResponseWriter, r *http.Request) { + items, err := m.service.ListNodeVPNAssignments(r.Context(), chi.URLParam(r, "clusterID"), chi.URLParam(r, "nodeID")) + if writeServiceError(w, err) { + return + } + httpx.WriteJSON(w, http.StatusOK, map[string]any{"vpn_assignments": items}) +} + +func (m *Module) renewNodeVPNAssignmentLease(w http.ResponseWriter, r *http.Request) { + var payload struct { + TTLSeconds int `json:"ttl_seconds"` + } + if err := json.NewDecoder(r.Body).Decode(&payload); err != nil { + httpx.WriteError(w, http.StatusBadRequest, "invalid vpn node lease renew payload") + return + } + item, err := m.service.RenewNodeVPNAssignmentLease(r.Context(), RenewNodeVPNAssignmentLeaseInput{ + ClusterID: chi.URLParam(r, "clusterID"), + VPNConnectionID: chi.URLParam(r, "vpnConnectionID"), + LeaseID: chi.URLParam(r, "leaseID"), + OwnerNodeID: chi.URLParam(r, "nodeID"), + TTL: time.Duration(payload.TTLSeconds) * time.Second, + }) + if writeServiceError(w, err) { + return + } + httpx.WriteJSON(w, http.StatusOK, map[string]any{"lease": NodeVPNAssignmentLease{ + LeaseID: item.ID, + OwnerNodeID: item.OwnerNodeID, + LeaseGeneration: item.LeaseGeneration, + Status: item.Status, + RenewedAt: item.RenewedAt, + ExpiresAt: item.ExpiresAt, + }}) +} + +func (m *Module) reportNodeVPNAssignmentStatus(w http.ResponseWriter, r *http.Request) { + var payload struct { + ObservedStatus string `json:"observed_status"` + StatusPayload json.RawMessage `json:"status_payload"` + ObservedAt time.Time `json:"observed_at"` + } + if err := json.NewDecoder(r.Body).Decode(&payload); err != nil { + httpx.WriteError(w, http.StatusBadRequest, "invalid vpn assignment status payload") + return + } + item, err := m.service.ReportNodeVPNAssignmentStatus(r.Context(), ReportNodeVPNAssignmentStatusInput{ + ClusterID: chi.URLParam(r, "clusterID"), + NodeID: chi.URLParam(r, "nodeID"), + VPNConnectionID: chi.URLParam(r, "vpnConnectionID"), + ObservedStatus: payload.ObservedStatus, + StatusPayload: payload.StatusPayload, + ObservedAt: payload.ObservedAt, + }) + if writeServiceError(w, err) { + return + } + httpx.WriteJSON(w, http.StatusAccepted, map[string]any{"vpn_assignment_status": item}) +} + +func (m *Module) getVPNClientProfile(w http.ResponseWriter, r *http.Request) { + preferredEntryNodeID := strings.TrimSpace(r.URL.Query().Get("entry_node_id")) + if preferredEntryNodeID == "" { + preferredEntryNodeID = strings.TrimSpace(r.Header.Get("X-RAP-Entry-Node")) + } + preferredExitNodeID := strings.TrimSpace(r.URL.Query().Get("exit_node_id")) + if preferredExitNodeID == "" { + preferredExitNodeID = strings.TrimSpace(r.Header.Get("X-RAP-Exit-Node")) + } + item, err := m.service.GetVPNClientProfile( + r.Context(), + chi.URLParam(r, "clusterID"), + r.URL.Query().Get("organization_id"), + r.URL.Query().Get("user_id"), + preferredEntryNodeID, + preferredExitNodeID, + ) + if writeServiceError(w, err) { + return + } + httpx.WriteJSON(w, http.StatusOK, map[string]any{"vpn_client_profile": item}) +} + +func (m *Module) getVPNPacketStats(w http.ResponseWriter, r *http.Request) { + httpx.WriteJSON(w, http.StatusOK, map[string]any{ + "vpn_packet_stats": m.vpnPacketHub.Snapshot( + chi.URLParam(r, "clusterID"), + chi.URLParam(r, "vpnConnectionID"), + ), + }) +} + +func (m *Module) reportVPNClientDiagnosticStatus(w http.ResponseWriter, r *http.Request) { + var payload map[string]any + if err := json.NewDecoder(r.Body).Decode(&payload); err != nil { + httpx.WriteError(w, http.StatusBadRequest, "invalid vpn client diagnostic status payload") + return + } + item := m.vpnClientDiagnosticHub.Report( + chi.URLParam(r, "clusterID"), + chi.URLParam(r, "deviceID"), + payload, + ) + httpx.WriteJSON(w, http.StatusAccepted, map[string]any{"vpn_client_diagnostic_status": item}) +} + +func (m *Module) listVPNClientDiagnosticStatuses(w http.ResponseWriter, r *http.Request) { + items := m.vpnClientDiagnosticHub.List(chi.URLParam(r, "clusterID")) + httpx.WriteJSON(w, http.StatusOK, map[string]any{"vpn_client_diagnostic_statuses": items}) +} + +func (m *Module) getVPNClientDiagnosticStatus(w http.ResponseWriter, r *http.Request) { + item, ok := m.vpnClientDiagnosticHub.Status( + chi.URLParam(r, "clusterID"), + chi.URLParam(r, "deviceID"), + ) + if !ok { + httpx.WriteError(w, http.StatusNotFound, "vpn client diagnostic status not found") + return + } + httpx.WriteJSON(w, http.StatusOK, map[string]any{"vpn_client_diagnostic_status": item}) +} + +func (m *Module) enqueueVPNClientDiagnosticCommand(w http.ResponseWriter, r *http.Request) { + var payload map[string]any + if err := json.NewDecoder(r.Body).Decode(&payload); err != nil { + httpx.WriteError(w, http.StatusBadRequest, "invalid vpn client diagnostic command payload") + return + } + commandType, _ := payload["type"].(string) + if strings.TrimSpace(commandType) == "" { + httpx.WriteError(w, http.StatusBadRequest, "vpn client diagnostic command type is required") + return + } + item := m.vpnClientDiagnosticHub.Enqueue( + chi.URLParam(r, "clusterID"), + chi.URLParam(r, "deviceID"), + payload, + ) + httpx.WriteJSON(w, http.StatusAccepted, map[string]any{"vpn_client_diagnostic_command": item}) +} + +func (m *Module) getVPNClientDiagnosticCommand(w http.ResponseWriter, r *http.Request) { + timeout := 25 * time.Second + if raw := r.URL.Query().Get("timeout_ms"); raw != "" { + if parsed, err := strconv.Atoi(raw); err == nil && parsed >= 0 && parsed <= 30000 { + timeout = time.Duration(parsed) * time.Millisecond + } + } + item, ok := m.vpnClientDiagnosticHub.Pop( + r.Context(), + chi.URLParam(r, "clusterID"), + chi.URLParam(r, "deviceID"), + timeout, + ) + if !ok { + w.WriteHeader(http.StatusNoContent) + return + } + httpx.WriteJSON(w, http.StatusOK, map[string]any{"vpn_client_diagnostic_command": item}) +} + +func (m *Module) postVPNClientPacket(w http.ResponseWriter, r *http.Request) { + m.handleVPNPacketPost(w, r, vpnDirectionClientToGateway) +} + +func (m *Module) getVPNClientPacket(w http.ResponseWriter, r *http.Request) { + m.handleVPNPacketGet(w, r, vpnDirectionGatewayToClient) +} + +func (m *Module) postVPNGatewayPacket(w http.ResponseWriter, r *http.Request) { + m.handleVPNPacketPost(w, r, vpnDirectionGatewayToClient) +} + +func (m *Module) getVPNGatewayPacket(w http.ResponseWriter, r *http.Request) { + m.handleVPNPacketGet(w, r, vpnDirectionClientToGateway) +} + +func (m *Module) resetVPNPacketQueues(w http.ResponseWriter, r *http.Request) { + clusterID := chi.URLParam(r, "clusterID") + vpnConnectionID := chi.URLParam(r, "vpnConnectionID") + clientToGateway := m.vpnPacketHub.Clear(vpnPacketKey{ + ClusterID: clusterID, + VPNConnectionID: vpnConnectionID, + Direction: vpnDirectionClientToGateway, + }) + gatewayToClient := m.vpnPacketHub.Clear(vpnPacketKey{ + ClusterID: clusterID, + VPNConnectionID: vpnConnectionID, + Direction: vpnDirectionGatewayToClient, + }) + httpx.WriteJSON(w, http.StatusOK, map[string]any{ + "vpn_packet_queues_reset": map[string]any{ + "client_to_gateway": clientToGateway, + "gateway_to_client": gatewayToClient, + }, + }) +} + +func (m *Module) handleVPNPacketPost(w http.ResponseWriter, r *http.Request, direction string) { + maxBodyBytes := int64(vpnPacketMaxBytes) + if r.URL.Query().Get("batch") == "true" { + maxBodyBytes = int64(vpnPacketBatchMaxBytes) + } + body, err := io.ReadAll(http.MaxBytesReader(w, r.Body, maxBodyBytes)) + if err != nil { + httpx.WriteError(w, http.StatusBadRequest, "invalid vpn packet payload") + return + } + if len(body) == 0 { + httpx.WriteError(w, http.StatusBadRequest, "empty vpn packet payload") + return + } + key := vpnPacketKey{ + ClusterID: chi.URLParam(r, "clusterID"), + VPNConnectionID: chi.URLParam(r, "vpnConnectionID"), + Direction: direction, + } + if r.URL.Query().Get("batch") == "true" { + packets, err := decodeVPNPacketBatch(body) + if err != nil { + httpx.WriteError(w, http.StatusBadRequest, err.Error()) + return + } + for _, packet := range packets { + if err := m.vpnPacketHub.Push(key, packet); err != nil { + httpx.WriteError(w, http.StatusServiceUnavailable, err.Error()) + return + } + } + w.WriteHeader(http.StatusAccepted) + return + } + if err := m.vpnPacketHub.Push(key, body); err != nil { + httpx.WriteError(w, http.StatusServiceUnavailable, err.Error()) + return + } + w.WriteHeader(http.StatusAccepted) +} + +func (m *Module) handleVPNPacketGet(w http.ResponseWriter, r *http.Request, direction string) { + timeout := 25 * time.Second + if raw := r.URL.Query().Get("timeout_ms"); raw != "" { + if parsed, err := strconv.Atoi(raw); err == nil && parsed >= 0 && parsed <= 30000 { + timeout = time.Duration(parsed) * time.Millisecond + } + } + key := vpnPacketKey{ + ClusterID: chi.URLParam(r, "clusterID"), + VPNConnectionID: chi.URLParam(r, "vpnConnectionID"), + Direction: direction, + } + if r.URL.Query().Get("batch") == "true" { + packets := m.vpnPacketHub.PopBatch(r.Context(), key, timeout, vpnPacketBatchMaxPackets, vpnPacketBatchMaxBytes) + if len(packets) == 0 { + w.WriteHeader(http.StatusNoContent) + return + } + w.Header().Set("Content-Type", "application/vnd.rap.vpn-packet-batch.v1") + w.WriteHeader(http.StatusOK) + _, _ = w.Write(encodeVPNPacketBatch(packets)) + return + } + packet, ok := m.vpnPacketHub.Pop(r.Context(), key, timeout) + if !ok { + w.WriteHeader(http.StatusNoContent) + return + } + w.Header().Set("Content-Type", "application/octet-stream") + w.WriteHeader(http.StatusOK) + _, _ = w.Write(packet) +} + +const ( + vpnDirectionClientToGateway = "client_to_gateway" + vpnDirectionGatewayToClient = "gateway_to_client" + + vpnPacketMaxBytes = 65535 + vpnPacketFlowShardCount = 16 + vpnPacketFlowShardDepth = 4096 + vpnPacketQueueDepth = vpnPacketFlowShardCount * vpnPacketFlowShardDepth + vpnPacketBatchMaxPackets = 1024 + vpnPacketBatchMaxBytes = 4 * 1024 * 1024 + vpnPacketBatchGatherTimeout = 3 * time.Millisecond + vpnPacketStatsWindow = 5 * time.Second +) + +type vpnPacketKey struct { + ClusterID string + VPNConnectionID string + Direction string +} + +type vpnPacketHub struct { + queuesMu sync.RWMutex + statsMu sync.Mutex + queues map[vpnPacketKey]*vpnPacketQueue + stats map[vpnPacketKey]vpnPacketStats +} + +type vpnPacketQueue struct { + shards []chan []byte + popMu sync.Mutex + popCursor int +} + +type vpnPacketStats struct { + Pushed uint64 + Popped uint64 + Dropped uint64 + QueueFullDrops uint64 + RequeueDrops uint64 + ClearedStale uint64 + PushedBytes uint64 + PoppedBytes uint64 + WindowStartedAt time.Time + WindowPushed uint64 + WindowPopped uint64 + WindowPushedBytes uint64 + WindowPoppedBytes uint64 + QueueDepthHigh int + QueueDepthHighAt time.Time + ShardDepthHigh int + ShardDepthHighAt time.Time + LastPushSize int + LastPopSize int + LastPushAt time.Time + LastPopAt time.Time + LastPushSummary string + LastPopSummary string + Recent []vpnPacketTrace +} + +type vpnPacketTrace struct { + Event string `json:"event"` + Summary string `json:"summary"` + Size int `json:"size"` + CreatedAt time.Time `json:"created_at"` +} + +func newVPNPacketHub() *vpnPacketHub { + return &vpnPacketHub{ + queues: map[vpnPacketKey]*vpnPacketQueue{}, + stats: map[vpnPacketKey]vpnPacketStats{}, + } +} + +func (h *vpnPacketHub) Push(key vpnPacketKey, packet []byte) error { + queue := h.queue(key) + if queue.push(packet) { + _, queueDepth, shardDepth := queue.depths() + h.recordPush(key, packet, time.Now().UTC(), queueDepth, shardDepth) + return nil + } + h.recordQueueFullDrop(key, packet, time.Now().UTC()) + return fmt.Errorf("vpn packet queue is full") +} + +func (h *vpnPacketHub) Pop(ctx context.Context, key vpnPacketKey, timeout time.Duration) ([]byte, bool) { + queue := h.queue(key) + if packet, ok := queue.pop(ctx, timeout); ok { + h.recordPop(key, packet, time.Now().UTC()) + return packet, true + } + return nil, false +} + +func (h *vpnPacketHub) PopBatch(ctx context.Context, key vpnPacketKey, timeout time.Duration, maxPackets, maxBytes int) [][]byte { + if maxPackets <= 0 { + maxPackets = 1 + } + if maxBytes <= 0 { + maxBytes = vpnPacketMaxBytes + } + first, ok := h.Pop(ctx, key, timeout) + if !ok { + return nil + } + packets := [][]byte{first} + total := len(first) + gatherUntil := time.Now().Add(vpnPacketBatchGatherTimeout) + queue := h.queue(key) + for len(packets) < maxPackets && total < maxBytes { + packet, ok := queue.pop(ctx, 0) + if !ok { + remaining := time.Until(gatherUntil) + if remaining <= 0 { + return packets + } + packet, ok = queue.pop(ctx, remaining) + if !ok { + return packets + } + } + if total+len(packet)+4 > maxBytes { + if !queue.requeue(packet) { + h.recordRequeueDrop(key, packet, time.Now().UTC()) + } + return packets + } + h.recordPop(key, packet, time.Now().UTC()) + packets = append(packets, packet) + total += len(packet) + 4 + } + return packets +} + +func (h *vpnPacketHub) addWindowSamplesLocked( + stats *vpnPacketStats, + now time.Time, + pushedPackets, poppedPackets, + pushedBytes, poppedBytes uint64, +) { + if stats.WindowStartedAt.IsZero() || now.Sub(stats.WindowStartedAt) >= vpnPacketStatsWindow { + stats.WindowStartedAt = now + stats.WindowPushed = 0 + stats.WindowPopped = 0 + stats.WindowPushedBytes = 0 + stats.WindowPoppedBytes = 0 + } + stats.WindowPushed += pushedPackets + stats.WindowPopped += poppedPackets + stats.WindowPushedBytes += pushedBytes + stats.WindowPoppedBytes += poppedBytes +} + +func computeVPNRateStats(now time.Time, startedAt time.Time, packets, bytes uint64) (float64, float64) { + if packets == 0 || bytes == 0 || startedAt.IsZero() { + return 0, 0 + } + elapsed := now.Sub(startedAt).Seconds() + if elapsed <= 0 { + return 0, 0 + } + pps := float64(packets) / elapsed + mbps := float64(bytes*8) / (elapsed * 1_000_000) + return pps, mbps +} + +func (h *vpnPacketHub) appendRateStats(metrics map[string]any, now time.Time, stats vpnPacketStats) map[string]any { + pushPps, pushMbps := computeVPNRateStats(now, stats.WindowStartedAt, stats.WindowPushed, stats.WindowPushedBytes) + popPps, popMbps := computeVPNRateStats(now, stats.WindowStartedAt, stats.WindowPopped, stats.WindowPoppedBytes) + metrics["rate_window_seconds"] = vpnPacketStatsWindow.Seconds() + metrics["window_push_rate_pps"] = pushPps + metrics["window_push_rate_mbps"] = pushMbps + metrics["window_pop_rate_pps"] = popPps + metrics["window_pop_rate_mbps"] = popMbps + metrics["window_push_packets"] = stats.WindowPushed + metrics["window_pop_packets"] = stats.WindowPopped + metrics["window_push_bytes"] = stats.WindowPushedBytes + metrics["window_pop_bytes"] = stats.WindowPoppedBytes + return metrics +} + +func (h *vpnPacketHub) Clear(key vpnPacketKey) int { + queue := h.queue(key) + cleared := queue.clear() + h.recordClear(key, cleared) + return cleared +} + +func encodeVPNPacketBatch(packets [][]byte) []byte { + total := 0 + for _, packet := range packets { + total += 4 + len(packet) + } + out := make([]byte, total) + offset := 0 + for _, packet := range packets { + binary.BigEndian.PutUint32(out[offset:offset+4], uint32(len(packet))) + offset += 4 + copy(out[offset:offset+len(packet)], packet) + offset += len(packet) + } + return out +} + +func decodeVPNPacketBatch(payload []byte) ([][]byte, error) { + var packets [][]byte + for offset := 0; offset < len(payload); { + if offset+4 > len(payload) { + return nil, fmt.Errorf("truncated vpn packet batch header") + } + size := int(binary.BigEndian.Uint32(payload[offset : offset+4])) + offset += 4 + if size <= 0 || size > vpnPacketMaxBytes { + return nil, fmt.Errorf("invalid vpn packet batch item size") + } + if offset+size > len(payload) { + return nil, fmt.Errorf("truncated vpn packet batch item") + } + packets = append(packets, append([]byte(nil), payload[offset:offset+size]...)) + offset += size + } + if len(packets) == 0 { + return nil, fmt.Errorf("empty vpn packet batch") + } + return packets, nil +} + +func (h *vpnPacketHub) queue(key vpnPacketKey) *vpnPacketQueue { + h.queuesMu.RLock() + queue := h.queues[key] + h.queuesMu.RUnlock() + if queue != nil { + return queue + } + h.queuesMu.Lock() + defer h.queuesMu.Unlock() + queue = h.queues[key] + if queue != nil { + return queue + } + queue = newVPNPacketQueue() + h.queues[key] = queue + return queue +} + +func newVPNPacketQueue() *vpnPacketQueue { + shardCount := vpnPacketFlowShardCount + if shardCount <= 0 { + shardCount = 1 + } + shardDepth := vpnPacketFlowShardDepth + if shardDepth <= 0 { + shardDepth = 1 + } + queue := &vpnPacketQueue{ + shards: make([]chan []byte, shardCount), + } + for i := range queue.shards { + queue.shards[i] = make(chan []byte, shardDepth) + } + return queue +} + +func (q *vpnPacketQueue) push(packet []byte) bool { + return q.enqueue(append([]byte(nil), packet...)) +} + +func (q *vpnPacketQueue) requeue(packet []byte) bool { + return q.enqueue(packet) +} + +func (q *vpnPacketQueue) enqueue(packet []byte) bool { + if len(q.shards) == 0 { + return false + } + shard := vpnPacketFlowShard(packet, len(q.shards)) + select { + case q.shards[shard] <- packet: + return true + default: + return false + } +} + +func (q *vpnPacketQueue) pop(ctx context.Context, timeout time.Duration) ([]byte, bool) { + q.popMu.Lock() + defer q.popMu.Unlock() + if packet, ok := q.popNonBlockingLocked(); ok { + return packet, true + } + if timeout <= 0 || len(q.shards) == 0 { + return nil, false + } + timer := time.NewTimer(timeout) + defer timer.Stop() + cases := make([]reflect.SelectCase, 0, len(q.shards)+2) + cases = append(cases, + reflect.SelectCase{Dir: reflect.SelectRecv, Chan: reflect.ValueOf(ctx.Done())}, + reflect.SelectCase{Dir: reflect.SelectRecv, Chan: reflect.ValueOf(timer.C)}, + ) + start := q.popCursor % len(q.shards) + for i := range q.shards { + cases = append(cases, reflect.SelectCase{ + Dir: reflect.SelectRecv, + Chan: reflect.ValueOf(q.shards[(start+i)%len(q.shards)]), + }) + } + chosen, value, ok := reflect.Select(cases) + if chosen < 2 || !ok { + return nil, false + } + shard := (start + chosen - 2) % len(q.shards) + q.popCursor = (shard + 1) % len(q.shards) + packet, ok := value.Interface().([]byte) + return packet, ok +} + +func (q *vpnPacketQueue) popNonBlockingLocked() ([]byte, bool) { + if len(q.shards) == 0 { + return nil, false + } + start := q.popCursor % len(q.shards) + for i := range q.shards { + shard := (start + i) % len(q.shards) + select { + case packet := <-q.shards[shard]: + q.popCursor = (shard + 1) % len(q.shards) + return packet, true + default: + } + } + return nil, false +} + +func (q *vpnPacketQueue) clear() int { + cleared := 0 + for _, shard := range q.shards { + for { + select { + case <-shard: + cleared++ + default: + goto nextShard + } + } + nextShard: + } + return cleared +} + +func (q *vpnPacketQueue) depths() ([]int, int, int) { + depths := make([]int, len(q.shards)) + total := 0 + maxDepth := 0 + for i, shard := range q.shards { + depth := len(shard) + depths[i] = depth + total += depth + if depth > maxDepth { + maxDepth = depth + } + } + return depths, total, maxDepth +} + +func (h *vpnPacketHub) Snapshot(clusterID, vpnConnectionID string) map[string]any { + now := time.Now().UTC() + out := map[string]any{} + for _, direction := range []string{vpnDirectionClientToGateway, vpnDirectionGatewayToClient} { + key := vpnPacketKey{ClusterID: clusterID, VPNConnectionID: vpnConnectionID, Direction: direction} + h.statsMu.Lock() + stats := h.stats[key] + h.statsMu.Unlock() + queueDepth := 0 + queueDepthMax := 0 + queueDepths := []int{} + h.queuesMu.RLock() + queue := h.queues[key] + h.queuesMu.RUnlock() + if queue != nil { + queueDepths, queueDepth, queueDepthMax = queue.depths() + } + metric := map[string]any{ + "pushed": stats.Pushed, + "pushed_bytes": stats.PushedBytes, + "popped": stats.Popped, + "popped_bytes": stats.PoppedBytes, + "dropped": stats.Dropped, + "queue_full_drops": stats.QueueFullDrops, + "requeue_drops": stats.RequeueDrops, + "cleared_stale_packets": stats.ClearedStale, + "queue_depth": queueDepth, + "queue_depths": queueDepths, + "queue_depth_max": queueDepthMax, + "queue_depth_high_watermark": stats.QueueDepthHigh, + "queue_depth_high_at": stats.QueueDepthHighAt, + "shard_depth_high_watermark": stats.ShardDepthHigh, + "shard_depth_high_at": stats.ShardDepthHighAt, + "queue_capacity": vpnPacketQueueDepth, + "queue_shard_capacity": vpnPacketFlowShardDepth, + "flow_shard_count": len(queueDepths), + "flow_isolation": "ipv4_5tuple_sharded_round_robin", + "last_push_size": stats.LastPushSize, + "last_pop_size": stats.LastPopSize, + "last_push_at": stats.LastPushAt, + "last_pop_at": stats.LastPopAt, + "last_push": stats.LastPushSummary, + "last_pop": stats.LastPopSummary, + "recent": stats.Recent, + } + h.appendRateStats(metric, now, stats) + out[direction] = metric + } + return out +} + +func (h *vpnPacketHub) recordPush(key vpnPacketKey, packet []byte, now time.Time, queueDepth, shardDepth int) { + if !h.statsMu.TryLock() { + return + } + defer h.statsMu.Unlock() + stats := h.stats[key] + stats.Pushed++ + stats.PushedBytes += uint64(len(packet)) + stats.LastPushSize = len(packet) + stats.LastPushAt = now + stats.LastPushSummary = vpnPacketSummary(packet) + if queueDepth > stats.QueueDepthHigh { + stats.QueueDepthHigh = queueDepth + stats.QueueDepthHighAt = now + } + if shardDepth > stats.ShardDepthHigh { + stats.ShardDepthHigh = shardDepth + stats.ShardDepthHighAt = now + } + h.addWindowSamplesLocked(&stats, now, 1, 0, uint64(len(packet)), 0) + stats.Recent = appendVPNPacketTrace(stats.Recent, "push", packet, stats.LastPushAt) + h.stats[key] = stats +} + +func (h *vpnPacketHub) recordPop(key vpnPacketKey, packet []byte, now time.Time) { + if !h.statsMu.TryLock() { + return + } + defer h.statsMu.Unlock() + stats := h.stats[key] + stats.Popped++ + stats.PoppedBytes += uint64(len(packet)) + stats.LastPopSize = len(packet) + stats.LastPopAt = now + stats.LastPopSummary = vpnPacketSummary(packet) + h.addWindowSamplesLocked(&stats, now, 0, 1, 0, uint64(len(packet))) + stats.Recent = appendVPNPacketTrace(stats.Recent, "pop", packet, stats.LastPopAt) + h.stats[key] = stats +} + +func (h *vpnPacketHub) recordQueueFullDrop(key vpnPacketKey, packet []byte, now time.Time) { + if !h.statsMu.TryLock() { + return + } + defer h.statsMu.Unlock() + stats := h.stats[key] + stats.Dropped++ + stats.QueueFullDrops++ + stats.Recent = appendVPNPacketTrace(stats.Recent, "drop_queue_full", packet, now) + h.stats[key] = stats +} + +func (h *vpnPacketHub) recordRequeueDrop(key vpnPacketKey, packet []byte, now time.Time) { + if !h.statsMu.TryLock() { + return + } + defer h.statsMu.Unlock() + stats := h.stats[key] + stats.Dropped++ + stats.RequeueDrops++ + stats.Recent = appendVPNPacketTrace(stats.Recent, "drop_requeue_full", packet, now) + h.stats[key] = stats +} + +func (h *vpnPacketHub) recordClear(key vpnPacketKey, cleared int) { + if cleared <= 0 { + return + } + h.statsMu.Lock() + defer h.statsMu.Unlock() + now := time.Now().UTC() + stats := h.stats[key] + stats.ClearedStale += uint64(cleared) + stats.Recent = append(stats.Recent, vpnPacketTrace{ + Event: "clear", + Summary: fmt.Sprintf("cleared stale packets=%d", cleared), + Size: cleared, + CreatedAt: now, + }) + const maxRecentVPNPacketTrace = 24 + if len(stats.Recent) > maxRecentVPNPacketTrace { + stats.Recent = stats.Recent[len(stats.Recent)-maxRecentVPNPacketTrace:] + } + h.stats[key] = stats +} + +func appendVPNPacketTrace(recent []vpnPacketTrace, event string, packet []byte, at time.Time) []vpnPacketTrace { + recent = append(recent, vpnPacketTrace{ + Event: event, + Summary: vpnPacketSummary(packet), + Size: len(packet), + CreatedAt: at, + }) + const maxRecentVPNPacketTrace = 24 + if len(recent) > maxRecentVPNPacketTrace { + recent = recent[len(recent)-maxRecentVPNPacketTrace:] + } + return recent +} + +func vpnPacketFlowShard(packet []byte, shardCount int) int { + if shardCount <= 1 { + return 0 + } + if len(packet) < 20 { + return len(packet) % shardCount + } + version := (packet[0] >> 4) & 0x0f + if version != 4 { + return len(packet) % shardCount + } + ihl := int(packet[0]&0x0f) * 4 + if ihl < 20 || len(packet) < ihl { + return len(packet) % shardCount + } + proto := uint32(packet[9]) + srcIP := vpnIPv4Uint32(packet[12:16]) + dstIP := vpnIPv4Uint32(packet[16:20]) + srcPort := uint32(0) + dstPort := uint32(0) + if (proto == 6 || proto == 17) && len(packet) >= ihl+4 { + srcPort = uint32(vpnU16(packet[ihl : ihl+2])) + dstPort = uint32(vpnU16(packet[ihl+2 : ihl+4])) + } else if proto == 1 && len(packet) >= ihl+2 { + srcPort = uint32(packet[ihl]) + dstPort = uint32(packet[ihl+1]) + } + hash := srcIP ^ vpnRotateLeft32(dstIP, 7) ^ (proto << 24) ^ (srcPort << 11) ^ dstPort + hash ^= hash >> 16 + hash *= 0x7feb352d + hash ^= hash >> 15 + return int(hash % uint32(shardCount)) +} + +func vpnIPv4Uint32(raw []byte) uint32 { + if len(raw) < 4 { + return 0 + } + return uint32(raw[0])<<24 | uint32(raw[1])<<16 | uint32(raw[2])<<8 | uint32(raw[3]) +} + +func vpnRotateLeft32(value uint32, shift uint) uint32 { + shift %= 32 + if shift == 0 { + return value + } + return (value << shift) | (value >> (32 - shift)) +} + +func vpnPacketSummary(packet []byte) string { + if len(packet) < 20 { + return fmt.Sprintf("size=%d", len(packet)) + } + version := (packet[0] >> 4) & 0x0f + if version != 4 { + return fmt.Sprintf("size=%d ip_version=%d", len(packet), version) + } + ihl := int(packet[0]&0x0f) * 4 + if ihl < 20 || len(packet) < ihl { + return fmt.Sprintf("size=%d ipv4=truncated", len(packet)) + } + proto := int(packet[9]) + base := fmt.Sprintf("size=%d %s -> %s proto=%d", len(packet), vpnIPv4(packet[12:16]), vpnIPv4(packet[16:20]), proto) + if (proto == 6 || proto == 17) && len(packet) >= ihl+4 { + base += fmt.Sprintf(" %d->%d", vpnU16(packet[ihl:ihl+2]), vpnU16(packet[ihl+2:ihl+4])) + if proto == 6 && len(packet) >= ihl+14 { + base += " flags=" + vpnTCPFlags(packet[ihl+13]) + } + } else if proto == 1 && len(packet) >= ihl+2 { + base += fmt.Sprintf(" icmp_type=%d icmp_code=%d", packet[ihl], packet[ihl+1]) + } + return base +} + +func vpnIPv4(raw []byte) string { + if len(raw) < 4 { + return "0.0.0.0" + } + return fmt.Sprintf("%d.%d.%d.%d", raw[0], raw[1], raw[2], raw[3]) +} + +func vpnU16(raw []byte) int { + if len(raw) < 2 { + return 0 + } + return int(raw[0])<<8 | int(raw[1]) +} + +func vpnTCPFlags(flags byte) string { + out := strings.Builder{} + if flags&0x02 != 0 { + out.WriteByte('S') + } + if flags&0x10 != 0 { + out.WriteByte('A') + } + if flags&0x01 != 0 { + out.WriteByte('F') + } + if flags&0x04 != 0 { + out.WriteByte('R') + } + if flags&0x08 != 0 { + out.WriteByte('P') + } + if out.Len() == 0 { + return fmt.Sprintf("%d", flags) + } + return out.String() +} + +type vpnClientDiagnosticKey struct { + ClusterID string + DeviceID string +} + +type vpnClientDiagnosticStatus struct { + ClusterID string `json:"cluster_id"` + DeviceID string `json:"device_id"` + Payload map[string]any `json:"payload"` + ObservedAt time.Time `json:"observed_at"` +} + +type vpnClientDiagnosticCommand struct { + ID string `json:"id"` + ClusterID string `json:"cluster_id"` + DeviceID string `json:"device_id"` + Payload map[string]any `json:"payload"` + CreatedAt time.Time `json:"created_at"` +} + +type vpnClientDiagnosticHub struct { + mu sync.Mutex + statuses map[vpnClientDiagnosticKey]vpnClientDiagnosticStatus + queues map[vpnClientDiagnosticKey]chan vpnClientDiagnosticCommand +} + +func newVPNClientDiagnosticHub() *vpnClientDiagnosticHub { + return &vpnClientDiagnosticHub{ + statuses: map[vpnClientDiagnosticKey]vpnClientDiagnosticStatus{}, + queues: map[vpnClientDiagnosticKey]chan vpnClientDiagnosticCommand{}, + } +} + +func (h *vpnClientDiagnosticHub) Report(clusterID, deviceID string, payload map[string]any) vpnClientDiagnosticStatus { + key := vpnClientDiagnosticKey{ClusterID: clusterID, DeviceID: deviceID} + item := vpnClientDiagnosticStatus{ + ClusterID: clusterID, + DeviceID: deviceID, + Payload: cloneDiagnosticPayload(payload), + ObservedAt: time.Now().UTC(), + } + h.mu.Lock() + h.statuses[key] = item + h.mu.Unlock() + return item +} + +func (h *vpnClientDiagnosticHub) Status(clusterID, deviceID string) (vpnClientDiagnosticStatus, bool) { + key := vpnClientDiagnosticKey{ClusterID: clusterID, DeviceID: deviceID} + h.mu.Lock() + defer h.mu.Unlock() + item, ok := h.statuses[key] + return item, ok +} + +func (h *vpnClientDiagnosticHub) List(clusterID string) []vpnClientDiagnosticStatus { + h.mu.Lock() + defer h.mu.Unlock() + out := make([]vpnClientDiagnosticStatus, 0) + for key, item := range h.statuses { + if key.ClusterID == clusterID { + out = append(out, item) + } + } + sort.Slice(out, func(i, j int) bool { + return out[i].ObservedAt.After(out[j].ObservedAt) + }) + return out +} + +func (h *vpnClientDiagnosticHub) Enqueue(clusterID, deviceID string, payload map[string]any) vpnClientDiagnosticCommand { + item := vpnClientDiagnosticCommand{ + ID: fmt.Sprintf("vpn_diag_%d", time.Now().UnixNano()), + ClusterID: clusterID, + DeviceID: deviceID, + Payload: cloneDiagnosticPayload(payload), + CreatedAt: time.Now().UTC(), + } + queue := h.queue(vpnClientDiagnosticKey{ClusterID: clusterID, DeviceID: deviceID}) + if vpnClientDiagnosticCommandIsPriorityStop(payload) { + drainVPNClientDiagnosticQueue(queue) + } + select { + case queue <- item: + default: + select { + case <-queue: + default: + } + queue <- item + } + return item +} + +func vpnClientDiagnosticCommandIsPriorityStop(payload map[string]any) bool { + commandType, _ := payload["type"].(string) + return strings.TrimSpace(commandType) == "stop_vpn" +} + +func drainVPNClientDiagnosticQueue(queue chan vpnClientDiagnosticCommand) { + for { + select { + case <-queue: + default: + return + } + } +} + +func (h *vpnClientDiagnosticHub) Pop(ctx context.Context, clusterID, deviceID string, timeout time.Duration) (vpnClientDiagnosticCommand, bool) { + queue := h.queue(vpnClientDiagnosticKey{ClusterID: clusterID, DeviceID: deviceID}) + if timeout <= 0 { + select { + case item := <-queue: + return item, true + default: + return vpnClientDiagnosticCommand{}, false + } + } + timer := time.NewTimer(timeout) + defer timer.Stop() + select { + case item := <-queue: + return item, true + case <-timer.C: + return vpnClientDiagnosticCommand{}, false + case <-ctx.Done(): + return vpnClientDiagnosticCommand{}, false + } +} + +func (h *vpnClientDiagnosticHub) queue(key vpnClientDiagnosticKey) chan vpnClientDiagnosticCommand { + h.mu.Lock() + defer h.mu.Unlock() + queue := h.queues[key] + if queue == nil { + queue = make(chan vpnClientDiagnosticCommand, 32) + h.queues[key] = queue + } + return queue +} + +func cloneDiagnosticPayload(payload map[string]any) map[string]any { + out := map[string]any{} + for key, value := range payload { + out[key] = value + } + return out +} + func (m *Module) getClusterAuthority(w http.ResponseWriter, r *http.Request) { item, err := m.service.GetClusterAuthorityState(r.Context(), r.URL.Query().Get("actor_user_id"), chi.URLParam(r, "clusterID")) if writeServiceError(w, err) { @@ -1237,11 +3197,173 @@ func (m *Module) upsertFabricTestingFlag(w http.ResponseWriter, r *http.Request) func (m *Module) listAuditEvents(w http.ResponseWriter, r *http.Request) { limit, _ := strconv.Atoi(r.URL.Query().Get("limit")) - items, err := m.service.ListAuditEvents(r.Context(), r.URL.Query().Get("actor_user_id"), chi.URLParam(r, "clusterID"), limit) + items, err := m.service.ListAuditEvents(r.Context(), r.URL.Query().Get("actor_user_id"), ListAuditEventsInput{ + ClusterID: chi.URLParam(r, "clusterID"), + EventTypes: queryStringList(r, "event_type"), + TargetTypes: queryStringList(r, "target_type"), + Correlation: r.URL.Query().Get("correlation"), + Limit: limit, + }) if writeServiceError(w, err) { return } - httpx.WriteJSON(w, http.StatusOK, map[string]any{"audit_events": items}) + httpx.WriteJSON(w, http.StatusOK, map[string]any{ + "audit_events": items, + "audit_summary": summarizeClusterAuditEvents(items), + }) +} + +func queryStringList(r *http.Request, key string) []string { + values := []string{} + for _, raw := range r.URL.Query()[key] { + for _, part := range strings.Split(raw, ",") { + if value := strings.TrimSpace(part); value != "" { + values = append(values, value) + } + } + } + return values +} + +func (m *Module) streamClusterEvents(w http.ResponseWriter, r *http.Request) { + flusher, ok := w.(http.Flusher) + if !ok { + httpx.WriteError(w, http.StatusInternalServerError, "streaming is not supported") + return + } + actorUserID := r.URL.Query().Get("actor_user_id") + clusterID := chi.URLParam(r, "clusterID") + w.Header().Set("Content-Type", "text/event-stream") + w.Header().Set("Cache-Control", "no-cache, no-transform") + w.Header().Set("Connection", "keep-alive") + w.Header().Set("X-Accel-Buffering", "no") + _, _ = fmt.Fprint(w, "retry: 5000\n\n") + flusher.Flush() + + ticker := time.NewTicker(5 * time.Second) + defer ticker.Stop() + var lastRevision string + for { + revision, payload, err := m.clusterEventSnapshot(r, actorUserID, clusterID) + if err != nil { + _ = writeSSE(w, "cluster.error", map[string]any{ + "cluster_id": clusterID, + "error": err.Error(), + "observed_at": time.Now().UTC(), + }) + flusher.Flush() + return + } + if revision != lastRevision { + lastRevision = revision + _ = writeSSE(w, "cluster.changed", payload) + flusher.Flush() + } else { + _, _ = fmt.Fprintf(w, ": keepalive %s\n\n", time.Now().UTC().Format(time.RFC3339Nano)) + flusher.Flush() + } + select { + case <-r.Context().Done(): + return + case <-ticker.C: + } + } +} + +func (m *Module) clusterEventSnapshot(r *http.Request, actorUserID, clusterID string) (string, map[string]any, error) { + ctx := r.Context() + summaries, err := m.service.ListClusterAdminSummaries(ctx, actorUserID) + if err != nil { + return "", nil, err + } + var selected *ClusterAdminSummary + for i := range summaries { + if summaries[i].ClusterID == clusterID { + selected = &summaries[i] + break + } + } + if selected == nil { + return "", nil, pgx.ErrNoRows + } + nodes, err := m.service.ListClusterNodes(ctx, actorUserID, clusterID) + if err != nil { + return "", nil, err + } + joinRequests, err := m.service.ListJoinRequests(ctx, actorUserID, clusterID) + if err != nil { + return "", nil, err + } + meshLinks, err := m.service.ListMeshLinks(ctx, actorUserID, clusterID) + if err != nil { + return "", nil, err + } + payload := map[string]any{ + "cluster_id": clusterID, + "observed_at": time.Now().UTC(), + "node_count": len(nodes), + "join_request_count": len(joinRequests), + "mesh_link_count": len(meshLinks), + "summary": selected, + "latest_node_seen_at": latestNodeSeenAt(nodes), + "latest_mesh_seen_at": latestMeshLinkSeenAt(meshLinks), + } + revisionPayload := map[string]any{ + "cluster_id": clusterID, + "node_count": len(nodes), + "join_request_count": len(joinRequests), + "mesh_link_count": len(meshLinks), + "summary": selected, + "latest_node_seen_at": latestNodeSeenAt(nodes), + "latest_mesh_seen_at": latestMeshLinkSeenAt(meshLinks), + } + revisionBytes, err := json.Marshal(revisionPayload) + if err != nil { + return "", nil, err + } + sum := sha256.Sum256(revisionBytes) + revision := hex.EncodeToString(sum[:]) + payload["revision"] = revision + return revision, payload, nil +} + +func writeSSE(w http.ResponseWriter, event string, payload any) error { + encoded, err := json.Marshal(payload) + if err != nil { + return err + } + if _, err := fmt.Fprintf(w, "event: %s\n", event); err != nil { + return err + } + if _, err := fmt.Fprintf(w, "data: %s\n\n", encoded); err != nil { + return err + } + return nil +} + +func latestNodeSeenAt(nodes []ClusterNode) *time.Time { + var latest *time.Time + for i := range nodes { + if nodes[i].LastSeenAt == nil { + continue + } + if latest == nil || nodes[i].LastSeenAt.After(*latest) { + value := nodes[i].LastSeenAt.UTC() + latest = &value + } + } + return latest +} + +func latestMeshLinkSeenAt(links []MeshLinkObservation) *time.Time { + var latest *time.Time + for i := range links { + if latest == nil || links[i].ObservedAt.After(*latest) { + value := links[i].ObservedAt.UTC() + latest = &value + } + } + return latest } func writeServiceError(w http.ResponseWriter, err error) bool { diff --git a/backend/internal/modules/cluster/module_vpn_packet_hub_test.go b/backend/internal/modules/cluster/module_vpn_packet_hub_test.go new file mode 100644 index 0000000..4c37793 --- /dev/null +++ b/backend/internal/modules/cluster/module_vpn_packet_hub_test.go @@ -0,0 +1,216 @@ +package cluster + +import ( + "context" + "testing" + "time" +) + +func TestVPNPacketHubPopBatchAndStatsKeys(t *testing.T) { + hub := newVPNPacketHub() + key := vpnPacketKey{ + ClusterID: "cluster-1", + VPNConnectionID: "vpn-1", + Direction: vpnDirectionClientToGateway, + } + + packetA := []byte{ + 0x45, 0x00, 0x00, 20, + 0x00, 0x01, 0x00, 0x00, + 64, 17, 0, 0, + 192, 168, 0, 1, + 192, 168, 0, 2, + 0x00, 0x50, 0x01, 0xBB, + } + + packetB := make([]byte, len(packetA)) + copy(packetB, packetA) + packetB[19] = 0xBA + + hub.Push(key, packetA) + hub.Push(key, packetB) + + packets := hub.PopBatch(context.Background(), key, 0, vpnPacketBatchMaxPackets, vpnPacketBatchMaxBytes) + if len(packets) != 2 { + t.Fatalf("expected 2 packets in batch, got %d", len(packets)) + } + + statsAny := hub.Snapshot("cluster-1", "vpn-1")[vpnDirectionClientToGateway] + stats, ok := statsAny.(map[string]any) + if !ok { + t.Fatalf("unexpected stats payload type: %T", statsAny) + } + + for _, keyName := range []string{ + "pushed", + "pushed_bytes", + "popped", + "popped_bytes", + "window_push_rate_pps", + "window_pop_rate_pps", + "window_push_rate_mbps", + "window_pop_rate_mbps", + "window_push_packets", + "window_pop_packets", + "queue_depth", + "queue_depths", + "queue_depth_max", + "queue_depth_high_watermark", + "queue_depth_high_at", + "shard_depth_high_watermark", + "shard_depth_high_at", + "queue_capacity", + "queue_shard_capacity", + "queue_full_drops", + "requeue_drops", + "cleared_stale_packets", + "flow_shard_count", + "flow_isolation", + } { + if _, found := stats[keyName]; !found { + t.Fatalf("missing vpn packet stat key %s", keyName) + } + } + + if got, ok := stats["popped"].(uint64); !ok || got != 2 { + t.Fatalf("expected popped=2, got %v (ok=%v)", stats["popped"], ok) + } + if got, ok := stats["pushed"].(uint64); !ok || got != 2 { + t.Fatalf("expected pushed=2, got %v (ok=%v)", stats["pushed"], ok) + } + if got, ok := stats["queue_depth_high_watermark"].(int); !ok || got < 1 { + t.Fatalf("expected queue depth high watermark, got %v (ok=%v)", stats["queue_depth_high_watermark"], ok) + } +} + +func TestVPNPacketHubGatherBehavior(t *testing.T) { + hub := newVPNPacketHub() + key := vpnPacketKey{ + ClusterID: "cluster-1", + VPNConnectionID: "vpn-1", + Direction: vpnDirectionGatewayToClient, + } + packet := []byte{ + 0x45, 0x00, 0x00, 20, + 0x00, 0x01, 0x00, 0x00, + 64, 6, 0, 0, + 10, 0, 0, 1, + 10, 0, 0, 2, + 0x12, 0x34, 0x56, 0x78, + } + + hub.Push(key, packet) + hub.Push(key, packet) + hub.Push(key, packet) + + first, ok := hub.Pop(context.Background(), key, 0) + if !ok { + t.Fatal("expected packet from queue") + } + batch := hub.PopBatch(context.Background(), key, 0, 1, 1024) + if len(batch) != 1 { + t.Fatalf("expected 1 packet because batch limit 1, got %d", len(batch)) + } + + _ = first +} + +func TestVPNPacketHubFlowShardsReportDepths(t *testing.T) { + hub := newVPNPacketHub() + key := vpnPacketKey{ + ClusterID: "cluster-1", + VPNConnectionID: "vpn-1", + Direction: vpnDirectionGatewayToClient, + } + + for i := byte(1); i <= 8; i++ { + packet := []byte{ + 0x45, 0x00, 0x00, 24, + 0x00, i, 0x00, 0x00, + 64, 6, 0, 0, + 10, 0, 0, i, + 192, 168, 200, i, + 0x12, i, 0x56, i, + } + if err := hub.Push(key, packet); err != nil { + t.Fatalf("push packet %d: %v", i, err) + } + } + + statsAny := hub.Snapshot("cluster-1", "vpn-1")[vpnDirectionGatewayToClient] + stats, ok := statsAny.(map[string]any) + if !ok { + t.Fatalf("unexpected stats payload type: %T", statsAny) + } + if got, ok := stats["queue_depth"].(int); !ok || got != 8 { + t.Fatalf("expected queue_depth=8, got %v (ok=%v)", stats["queue_depth"], ok) + } + depths, ok := stats["queue_depths"].([]int) + if !ok { + t.Fatalf("unexpected queue_depths payload type: %T", stats["queue_depths"]) + } + if len(depths) != vpnPacketFlowShardCount { + t.Fatalf("expected %d queue shards, got %d", vpnPacketFlowShardCount, len(depths)) + } + nonEmpty := 0 + for _, depth := range depths { + if depth > 0 { + nonEmpty++ + } + } + if nonEmpty < 2 { + t.Fatalf("expected packets to be distributed across at least 2 shards, got depths=%v", depths) + } +} + +func TestVPNPacketHubClearDoesNotCountAsDrop(t *testing.T) { + hub := newVPNPacketHub() + key := vpnPacketKey{ + ClusterID: "cluster-1", + VPNConnectionID: "vpn-1", + Direction: vpnDirectionClientToGateway, + } + packet := []byte{ + 0x45, 0x00, 0x00, 20, + 0x00, 0x01, 0x00, 0x00, + 64, 6, 0, 0, + 10, 0, 0, 1, + 10, 0, 0, 2, + 0x12, 0x34, 0x56, 0x78, + } + if err := hub.Push(key, packet); err != nil { + t.Fatalf("push packet: %v", err) + } + if cleared := hub.Clear(key); cleared != 1 { + t.Fatalf("expected cleared=1, got %d", cleared) + } + statsAny := hub.Snapshot("cluster-1", "vpn-1")[vpnDirectionClientToGateway] + stats, ok := statsAny.(map[string]any) + if !ok { + t.Fatalf("unexpected stats payload type: %T", statsAny) + } + if got, ok := stats["dropped"].(uint64); !ok || got != 0 { + t.Fatalf("expected dropped=0 for stale clear, got %v (ok=%v)", stats["dropped"], ok) + } + if got, ok := stats["cleared_stale_packets"].(uint64); !ok || got != 1 { + t.Fatalf("expected cleared_stale_packets=1, got %v (ok=%v)", stats["cleared_stale_packets"], ok) + } +} + +func TestVPNClientDiagnosticStopCommandDrainsPendingWork(t *testing.T) { + hub := newVPNClientDiagnosticHub() + hub.Enqueue("cluster-1", "device-1", map[string]any{"type": "vpn_page_probe", "url": "https://speedtest.rt.ru/"}) + hub.Enqueue("cluster-1", "device-1", map[string]any{"type": "vpn_tcp_connect", "host": "192.168.200.95"}) + hub.Enqueue("cluster-1", "device-1", map[string]any{"type": "stop_vpn"}) + + item, ok := hub.Pop(context.Background(), "cluster-1", "device-1", time.Millisecond) + if !ok { + t.Fatal("expected priority stop command") + } + if got, _ := item.Payload["type"].(string); got != "stop_vpn" { + t.Fatalf("first command = %q, want stop_vpn", got) + } + if item, ok := hub.Pop(context.Background(), "cluster-1", "device-1", 0); ok { + t.Fatalf("expected old commands to be drained, got %#v", item.Payload) + } +} diff --git a/backend/internal/modules/cluster/postgres_store.go b/backend/internal/modules/cluster/postgres_store.go index a968eab..b4b7683 100644 --- a/backend/internal/modules/cluster/postgres_store.go +++ b/backend/internal/modules/cluster/postgres_store.go @@ -26,6 +26,9 @@ type PostgresStore struct { const encryptedClusterAuthorityKeyPrefix = "enc:v1:" +const nodeHeartbeatStaleIntervalSQL = "1 minute" +const meshLinkStaleIntervalSQL = "2 minutes" + func NewPostgresStore(db *pgxpool.Pool, verifiers ...*authority.Verifier) *PostgresStore { var authorityVerifier *authority.Verifier if len(verifiers) > 0 { @@ -81,6 +84,13 @@ func clusterAuthorityPrivateKeyAAD(clusterID string) []byte { return []byte("rap-cluster-authority-v1|" + strings.TrimSpace(clusterID)) } +func stringPtrValue(value *string) string { + if value == nil { + return "" + } + return strings.TrimSpace(*value) +} + func (s *PostgresStore) GetPlatformRole(ctx context.Context, userID string) (string, error) { return authority.EffectivePlatformRole(ctx, s.db, s.authority, userID) } @@ -231,7 +241,30 @@ func (s *PostgresStore) EnsureClusterAuthority(ctx context.Context, clusterID st func (s *PostgresStore) ListClusterNodes(ctx context.Context, clusterID string) ([]ClusterNode, error) { rows, err := s.db.Query(ctx, ` SELECT n.id::text, n.owner_organization_id::text, n.node_key, n.name, n.ownership_type, - n.registration_status, n.health_status, n.version_state, n.partition_state, + n.registration_status, + CASE + WHEN n.registration_status = 'active' + AND COALESCE(n.last_seen_at, n.updated_at, n.created_at) < NOW() - $2::interval THEN 'offline' + ELSE n.health_status + END AS health_status, + CASE + WHEN update_policy.enabled + AND update_policy.target_version IS NOT NULL + AND n.reported_version = update_policy.target_version THEN 'current' + WHEN update_status.status IN ('failed', 'error') THEN 'failed' + WHEN update_status.phase = 'rollback' OR update_status.status = 'rolled_back' THEN 'rollback' + WHEN update_policy.enabled + AND update_policy.target_version IS NOT NULL + AND n.reported_version IS DISTINCT FROM update_policy.target_version + AND update_status.target_version = update_policy.target_version + AND update_status.phase IN ('planned', 'download', 'apply', 'health_check') + AND update_status.status IN ('accepted', 'started', 'staged', 'running') THEN 'updating' + WHEN update_policy.enabled + AND update_policy.target_version IS NOT NULL + AND n.reported_version IS DISTINCT FROM update_policy.target_version THEN 'outdated' + ELSE n.version_state + END AS version_state, + n.partition_state, n.reported_version, n.last_seen_at, cm.membership_status, cm.metadata, ng.id::text, ng.name, n.created_at, n.updated_at @@ -239,9 +272,28 @@ func (s *PostgresStore) ListClusterNodes(ctx context.Context, clusterID string) JOIN nodes n ON n.id = cm.node_id LEFT JOIN cluster_node_group_memberships ngm ON ngm.cluster_id = cm.cluster_id AND ngm.node_id = cm.node_id LEFT JOIN cluster_node_groups ng ON ng.cluster_id = ngm.cluster_id AND ng.id = ngm.group_id + LEFT JOIN LATERAL ( + SELECT p.enabled, p.target_version + FROM node_update_desired_policies p + WHERE p.cluster_id = cm.cluster_id + AND p.node_id = cm.node_id + AND p.product = 'rap-node-agent' + AND p.enabled + ORDER BY p.updated_at DESC + LIMIT 1 + ) update_policy ON true + LEFT JOIN LATERAL ( + SELECT s.target_version, s.phase, s.status + FROM node_update_status_reports s + WHERE s.cluster_id = cm.cluster_id + AND s.node_id = cm.node_id + AND s.product = 'rap-node-agent' + ORDER BY s.observed_at DESC + LIMIT 1 + ) update_status ON true WHERE cm.cluster_id = $1::uuid ORDER BY n.created_at DESC - `, clusterID) + `, clusterID, nodeHeartbeatStaleIntervalSQL) if err != nil { return nil, err } @@ -332,6 +384,32 @@ func (s *PostgresStore) CreateJoinToken(ctx context.Context, input CreateJoinTok return scanJoinToken(row) } +func (s *PostgresStore) ListJoinTokens(ctx context.Context, clusterID string) ([]NodeJoinToken, error) { + rows, err := s.db.Query(ctx, ` + SELECT id::text, cluster_id::text, scope, expires_at, max_uses, used_count, status, + created_by_user_id::text, created_at, revoked_at, authority_payload, authority_signature + FROM node_join_tokens + WHERE cluster_id = $1::uuid + ORDER BY created_at DESC + `, clusterID) + if err != nil { + return nil, err + } + defer rows.Close() + var out []NodeJoinToken + for rows.Next() { + item, err := scanJoinToken(rows) + if err != nil { + return nil, err + } + out = append(out, item) + } + if out == nil { + out = []NodeJoinToken{} + } + return out, rows.Err() +} + func (s *PostgresStore) SetJoinTokenAuthority(ctx context.Context, clusterID, tokenID string, payload json.RawMessage, signature ClusterSignature) (NodeJoinToken, error) { signatureJSON, err := json.Marshal(signature) if err != nil { @@ -494,6 +572,13 @@ func (s *PostgresStore) ApproveJoinRequest(ctx context.Context, input ApproveJoi if ownershipType == "" { ownershipType = "platform_managed" } + nodeGroupID := strings.TrimSpace(stringPtrValue(input.NodeGroupID)) + if nodeGroupID == "" { + nodeGroupID, err = s.joinRequestTokenNodeGroupID(ctx, tx, req) + if err != nil { + return ApprovedJoinRequest{}, err + } + } if _, err := tx.Exec(ctx, ` INSERT INTO nodes ( @@ -510,6 +595,37 @@ func (s *PostgresStore) ApproveJoinRequest(ctx context.Context, input ApproveJoi `, input.ClusterID, nodeID, now, []byte(`{"created_from_join_request":true}`)); err != nil { return ApprovedJoinRequest{}, err } + if nodeGroupID != "" { + tag, err := tx.Exec(ctx, ` + INSERT INTO cluster_node_group_memberships (cluster_id, node_id, group_id, assigned_by_user_id, assigned_at, metadata) + SELECT $1::uuid, $2::uuid, id, $4::uuid, $5, $6::jsonb + FROM cluster_node_groups + WHERE cluster_id = $1::uuid + AND id = $3::uuid + `, input.ClusterID, nodeID, nodeGroupID, input.ActorUserID, now, []byte(`{"source":"join_token_scope"}`)) + if err != nil { + return ApprovedJoinRequest{}, err + } + if tag.RowsAffected() != 1 { + return ApprovedJoinRequest{}, ErrInvalidPayload + } + } + roles, err := s.joinRequestTokenRoles(ctx, tx, req) + if err != nil { + return ApprovedJoinRequest{}, err + } + for _, role := range roles { + if _, ok := allowedNodeRoles[role]; !ok { + return ApprovedJoinRequest{}, ErrInvalidPayload + } + if _, err := tx.Exec(ctx, ` + INSERT INTO node_role_assignments (id, cluster_id, node_id, role, status, policy, assigned_by_user_id, assigned_at) + VALUES ($1::uuid, $2::uuid, $3::uuid, $4, 'active', $5::jsonb, $6::uuid, $7) + ON CONFLICT DO NOTHING + `, uuid.NewString(), input.ClusterID, nodeID, role, []byte(`{"source":"join_token_scope"}`), input.ActorUserID, now); err != nil { + return ApprovedJoinRequest{}, err + } + } if _, err := tx.Exec(ctx, ` INSERT INTO node_identities (node_id, public_key, identity_status, metadata, created_at, updated_at) @@ -560,6 +676,75 @@ func (s *PostgresStore) ApproveJoinRequest(ctx context.Context, input ApproveJoi }, nil } +func (s *PostgresStore) joinRequestTokenNodeGroupID(ctx context.Context, tx pgx.Tx, req NodeJoinRequest) (string, error) { + if req.JoinTokenID == nil || strings.TrimSpace(*req.JoinTokenID) == "" { + return "", nil + } + var scopeBytes []byte + if err := tx.QueryRow(ctx, ` + SELECT scope + FROM node_join_tokens + WHERE cluster_id = $1::uuid + AND id = $2::uuid + `, req.ClusterID, *req.JoinTokenID).Scan(&scopeBytes); err != nil { + if errors.Is(err, pgx.ErrNoRows) { + return "", nil + } + return "", err + } + var scope struct { + NodeGroupID string `json:"node_group_id"` + } + if len(scopeBytes) == 0 || !json.Valid(scopeBytes) { + return "", nil + } + if err := json.Unmarshal(scopeBytes, &scope); err != nil { + return "", err + } + return strings.TrimSpace(scope.NodeGroupID), nil +} + +func (s *PostgresStore) joinRequestTokenRoles(ctx context.Context, tx pgx.Tx, req NodeJoinRequest) ([]string, error) { + if req.JoinTokenID == nil || strings.TrimSpace(*req.JoinTokenID) == "" { + return nil, nil + } + var scopeBytes []byte + if err := tx.QueryRow(ctx, ` + SELECT scope + FROM node_join_tokens + WHERE cluster_id = $1::uuid + AND id = $2::uuid + `, req.ClusterID, *req.JoinTokenID).Scan(&scopeBytes); err != nil { + if errors.Is(err, pgx.ErrNoRows) { + return nil, nil + } + return nil, err + } + var scope struct { + Roles []string `json:"roles"` + } + if len(scopeBytes) == 0 || !json.Valid(scopeBytes) { + return nil, nil + } + if err := json.Unmarshal(scopeBytes, &scope); err != nil { + return nil, err + } + out := make([]string, 0, len(scope.Roles)) + seen := map[string]struct{}{} + for _, role := range scope.Roles { + role = strings.TrimSpace(role) + if role == "" { + continue + } + if _, ok := seen[role]; ok { + continue + } + seen[role] = struct{}{} + out = append(out, role) + } + return out, nil +} + func (s *PostgresStore) SetJoinRequestApprovalAuthority(ctx context.Context, clusterID, joinRequestID string, payload json.RawMessage, signature ClusterSignature) (NodeJoinRequest, error) { signatureJSON, err := json.Marshal(signature) if err != nil { @@ -726,7 +911,30 @@ func (s *PostgresStore) AttachExistingNodeToCluster(ctx context.Context, input A row := tx.QueryRow(ctx, ` SELECT n.id::text, n.owner_organization_id::text, n.node_key, n.name, n.ownership_type, - n.registration_status, n.health_status, n.version_state, n.partition_state, + n.registration_status, + CASE + WHEN n.registration_status = 'active' + AND COALESCE(n.last_seen_at, n.updated_at, n.created_at) < NOW() - $3::interval THEN 'offline' + ELSE n.health_status + END AS health_status, + CASE + WHEN update_policy.enabled + AND update_policy.target_version IS NOT NULL + AND n.reported_version = update_policy.target_version THEN 'current' + WHEN update_status.status IN ('failed', 'error') THEN 'failed' + WHEN update_status.phase = 'rollback' OR update_status.status = 'rolled_back' THEN 'rollback' + WHEN update_policy.enabled + AND update_policy.target_version IS NOT NULL + AND n.reported_version IS DISTINCT FROM update_policy.target_version + AND update_status.target_version = update_policy.target_version + AND update_status.phase IN ('planned', 'download', 'apply', 'health_check') + AND update_status.status IN ('accepted', 'started', 'staged', 'running') THEN 'updating' + WHEN update_policy.enabled + AND update_policy.target_version IS NOT NULL + AND n.reported_version IS DISTINCT FROM update_policy.target_version THEN 'outdated' + ELSE n.version_state + END AS version_state, + n.partition_state, n.reported_version, n.last_seen_at, cm.membership_status, cm.metadata, ng.id::text, ng.name, n.created_at, n.updated_at @@ -734,9 +942,28 @@ func (s *PostgresStore) AttachExistingNodeToCluster(ctx context.Context, input A JOIN nodes n ON n.id = cm.node_id LEFT JOIN cluster_node_group_memberships ngm ON ngm.cluster_id = cm.cluster_id AND ngm.node_id = cm.node_id LEFT JOIN cluster_node_groups ng ON ng.cluster_id = ngm.cluster_id AND ng.id = ngm.group_id + LEFT JOIN LATERAL ( + SELECT p.enabled, p.target_version + FROM node_update_desired_policies p + WHERE p.cluster_id = cm.cluster_id + AND p.node_id = cm.node_id + AND p.product = 'rap-node-agent' + AND p.enabled + ORDER BY p.updated_at DESC + LIMIT 1 + ) update_policy ON true + LEFT JOIN LATERAL ( + SELECT s.target_version, s.phase, s.status + FROM node_update_status_reports s + WHERE s.cluster_id = cm.cluster_id + AND s.node_id = cm.node_id + AND s.product = 'rap-node-agent' + ORDER BY s.observed_at DESC + LIMIT 1 + ) update_status ON true WHERE cm.cluster_id = $1::uuid AND cm.node_id = $2::uuid - `, input.ClusterID, input.NodeID) + `, input.ClusterID, input.NodeID, nodeHeartbeatStaleIntervalSQL) item, err := scanClusterNode(row) if err != nil { return ClusterNode{}, err @@ -833,7 +1060,30 @@ func (s *PostgresStore) AssignNodeToGroup(ctx context.Context, input AssignNodeG row := tx.QueryRow(ctx, ` SELECT n.id::text, n.owner_organization_id::text, n.node_key, n.name, n.ownership_type, - n.registration_status, n.health_status, n.version_state, n.partition_state, + n.registration_status, + CASE + WHEN n.registration_status = 'active' + AND COALESCE(n.last_seen_at, n.updated_at, n.created_at) < NOW() - $3::interval THEN 'offline' + ELSE n.health_status + END AS health_status, + CASE + WHEN update_policy.enabled + AND update_policy.target_version IS NOT NULL + AND n.reported_version = update_policy.target_version THEN 'current' + WHEN update_status.status IN ('failed', 'error') THEN 'failed' + WHEN update_status.phase = 'rollback' OR update_status.status = 'rolled_back' THEN 'rollback' + WHEN update_policy.enabled + AND update_policy.target_version IS NOT NULL + AND n.reported_version IS DISTINCT FROM update_policy.target_version + AND update_status.target_version = update_policy.target_version + AND update_status.phase IN ('planned', 'download', 'apply', 'health_check') + AND update_status.status IN ('accepted', 'started', 'staged', 'running') THEN 'updating' + WHEN update_policy.enabled + AND update_policy.target_version IS NOT NULL + AND n.reported_version IS DISTINCT FROM update_policy.target_version THEN 'outdated' + ELSE n.version_state + END AS version_state, + n.partition_state, n.reported_version, n.last_seen_at, cm.membership_status, cm.metadata, ng.id::text, ng.name, n.created_at, n.updated_at @@ -841,9 +1091,28 @@ func (s *PostgresStore) AssignNodeToGroup(ctx context.Context, input AssignNodeG JOIN nodes n ON n.id = cm.node_id LEFT JOIN cluster_node_group_memberships ngm ON ngm.cluster_id = cm.cluster_id AND ngm.node_id = cm.node_id LEFT JOIN cluster_node_groups ng ON ng.cluster_id = ngm.cluster_id AND ng.id = ngm.group_id + LEFT JOIN LATERAL ( + SELECT p.enabled, p.target_version + FROM node_update_desired_policies p + WHERE p.cluster_id = cm.cluster_id + AND p.node_id = cm.node_id + AND p.product = 'rap-node-agent' + AND p.enabled + ORDER BY p.updated_at DESC + LIMIT 1 + ) update_policy ON true + LEFT JOIN LATERAL ( + SELECT s.target_version, s.phase, s.status + FROM node_update_status_reports s + WHERE s.cluster_id = cm.cluster_id + AND s.node_id = cm.node_id + AND s.product = 'rap-node-agent' + ORDER BY s.observed_at DESC + LIMIT 1 + ) update_status ON true WHERE cm.cluster_id = $1::uuid AND cm.node_id = $2::uuid - `, input.ClusterID, input.NodeID) + `, input.ClusterID, input.NodeID, nodeHeartbeatStaleIntervalSQL) item, err := scanClusterNode(row) if err != nil { return ClusterNode{}, err @@ -941,6 +1210,220 @@ func (s *PostgresStore) ListNodeHeartbeats(ctx context.Context, clusterID, nodeI return out, rows.Err() } +func (s *PostgresStore) CreateReleaseVersion(ctx context.Context, input CreateReleaseVersionInput) (ReleaseVersion, error) { + tx, err := s.db.Begin(ctx) + if err != nil { + return ReleaseVersion{}, err + } + defer tx.Rollback(ctx) + + releaseID := uuid.NewString() + row := tx.QueryRow(ctx, ` + INSERT INTO release_versions ( + id, cluster_id, product, version, channel, status, compatibility, changelog, created_by_user_id, created_at + ) VALUES ($1::uuid, $2::uuid, $3, $4, $5, $6, $7::jsonb, $8, $9::uuid, NOW()) + RETURNING id::text, cluster_id::text, product, version, channel, status, compatibility, + changelog, created_by_user_id::text, created_at, authority_payload, authority_signature + `, releaseID, input.ClusterID, input.Product, input.Version, input.Channel, input.Status, []byte(input.Compatibility), input.Changelog, input.ActorUserID) + item, err := scanReleaseVersion(row) + if err != nil { + return ReleaseVersion{}, err + } + for _, artifact := range input.Artifacts { + artifactID := uuid.NewString() + row := tx.QueryRow(ctx, ` + INSERT INTO release_artifacts ( + id, release_id, cluster_id, product, version, os, arch, install_type, kind, + url, sha256, size_bytes, signature, metadata, created_at + ) VALUES ( + $1::uuid, $2::uuid, $3::uuid, $4, $5, $6, $7, $8, $9, + $10, $11, $12, $13, $14::jsonb, NOW() + ) + RETURNING id::text, release_id::text, cluster_id::text, product, version, os, arch, + install_type, kind, url, sha256, size_bytes, signature, metadata, created_at + `, artifactID, releaseID, input.ClusterID, input.Product, input.Version, artifact.OS, artifact.Arch, artifact.InstallType, artifact.Kind, artifact.URL, artifact.SHA256, artifact.SizeBytes, artifact.Signature, []byte(artifact.Metadata)) + storedArtifact, err := scanReleaseArtifact(row) + if err != nil { + return ReleaseVersion{}, err + } + item.Artifacts = append(item.Artifacts, storedArtifact) + } + if err := tx.Commit(ctx); err != nil { + return ReleaseVersion{}, err + } + return item, nil +} + +func (s *PostgresStore) ListReleaseVersions(ctx context.Context, clusterID, product, channel string) ([]ReleaseVersion, error) { + query := ` + SELECT id::text, cluster_id::text, product, version, channel, status, compatibility, + changelog, created_by_user_id::text, created_at, authority_payload, authority_signature + FROM release_versions + WHERE cluster_id = $1::uuid + AND ($2 = '' OR product = $2) + AND ($3 = '' OR channel = $3) + ORDER BY created_at DESC, version DESC + ` + rows, err := s.db.Query(ctx, query, clusterID, product, channel) + if err != nil { + return nil, err + } + defer rows.Close() + var out []ReleaseVersion + for rows.Next() { + item, err := scanReleaseVersion(rows) + if err != nil { + return nil, err + } + artifacts, err := s.listReleaseArtifacts(ctx, item.ID) + if err != nil { + return nil, err + } + item.Artifacts = artifacts + out = append(out, item) + } + return out, rows.Err() +} + +func (s *PostgresStore) listReleaseArtifacts(ctx context.Context, releaseID string) ([]ReleaseArtifact, error) { + rows, err := s.db.Query(ctx, ` + SELECT id::text, release_id::text, cluster_id::text, product, version, os, arch, + install_type, kind, url, sha256, size_bytes, signature, metadata, created_at + FROM release_artifacts + WHERE release_id = $1::uuid + ORDER BY os, arch, install_type, kind + `, releaseID) + if err != nil { + return nil, err + } + defer rows.Close() + var out []ReleaseArtifact + for rows.Next() { + item, err := scanReleaseArtifact(rows) + if err != nil { + return nil, err + } + out = append(out, item) + } + return out, rows.Err() +} + +func (s *PostgresStore) ListNodeUpdateServiceCandidates(ctx context.Context, clusterID string) ([]NodeUpdateServiceCandidate, error) { + rows, err := s.db.Query(ctx, ` + SELECT n.id::text, + n.name, + COALESCE(lh.metadata #>> '{mesh_endpoint_report,peer_endpoint}', '') AS endpoint, + COALESCE(lh.metadata #>> '{mesh_endpoint_report,region}', '') AS region, + n.last_seen_at + FROM node_role_assignments r + JOIN nodes n ON n.id = r.node_id + JOIN cluster_memberships cm ON cm.cluster_id = r.cluster_id AND cm.node_id = r.node_id + LEFT JOIN node_latest_heartbeats lh ON lh.cluster_id = r.cluster_id AND lh.node_id = r.node_id + WHERE r.cluster_id = $1::uuid + AND r.role = 'update-cache' + AND r.status = 'active' + AND cm.membership_status = 'active' + AND n.registration_status = 'active' + AND n.health_status = 'healthy' + AND COALESCE(n.last_seen_at, n.updated_at, n.created_at) >= NOW() - $2::interval + ORDER BY + CASE WHEN COALESCE(lh.metadata #>> '{mesh_endpoint_report,peer_endpoint}', '') = '' THEN 1 ELSE 0 END, + n.last_seen_at DESC NULLS LAST, + n.name ASC + `, clusterID, nodeHeartbeatStaleIntervalSQL) + if err != nil { + return nil, err + } + defer rows.Close() + var out []NodeUpdateServiceCandidate + for rows.Next() { + var item NodeUpdateServiceCandidate + if err := rows.Scan(&item.NodeID, &item.NodeName, &item.Endpoint, &item.Region, &item.LastSeenAt); err != nil { + return nil, err + } + out = append(out, item) + } + return out, rows.Err() +} + +func (s *PostgresStore) UpsertNodeUpdatePolicy(ctx context.Context, input UpsertNodeUpdatePolicyInput) (NodeUpdatePolicy, error) { + row := s.db.QueryRow(ctx, ` + INSERT INTO node_update_desired_policies ( + cluster_id, node_id, product, channel, target_version, strategy, enabled, + rollback_allowed, health_window_seconds, updated_by_user_id, updated_at + ) VALUES ( + $1::uuid, $2::uuid, $3, $4, $5, $6, $7, $8, $9, $10::uuid, NOW() + ) + ON CONFLICT (cluster_id, node_id, product) DO UPDATE SET + channel = EXCLUDED.channel, + target_version = EXCLUDED.target_version, + strategy = EXCLUDED.strategy, + enabled = EXCLUDED.enabled, + rollback_allowed = EXCLUDED.rollback_allowed, + health_window_seconds = EXCLUDED.health_window_seconds, + updated_by_user_id = EXCLUDED.updated_by_user_id, + updated_at = NOW() + RETURNING cluster_id::text, node_id::text, product, channel, target_version, + strategy, enabled, rollback_allowed, health_window_seconds, + updated_by_user_id::text, updated_at + `, input.ClusterID, input.NodeID, input.Product, input.Channel, input.TargetVersion, input.Strategy, input.Enabled, input.RollbackAllowed, input.HealthWindowSec, input.ActorUserID) + return scanNodeUpdatePolicy(row) +} + +func (s *PostgresStore) GetNodeUpdatePolicy(ctx context.Context, clusterID, nodeID, product string) (NodeUpdatePolicy, error) { + row := s.db.QueryRow(ctx, ` + SELECT cluster_id::text, node_id::text, product, channel, target_version, + strategy, enabled, rollback_allowed, health_window_seconds, + updated_by_user_id::text, updated_at + FROM node_update_desired_policies + WHERE cluster_id = $1::uuid + AND node_id = $2::uuid + AND product = $3 + `, clusterID, nodeID, product) + return scanNodeUpdatePolicy(row) +} + +func (s *PostgresStore) ReportNodeUpdateStatus(ctx context.Context, input ReportNodeUpdateStatusInput) (NodeUpdateStatus, error) { + id := uuid.NewString() + row := s.db.QueryRow(ctx, ` + INSERT INTO node_update_status_reports ( + id, cluster_id, node_id, product, current_version, target_version, phase, status, + attempt_id, error_message, rollback_version, payload, observed_at + ) VALUES ($1::uuid, $2::uuid, $3::uuid, $4, $5, $6, $7, $8, $9, $10, $11, $12::jsonb, $13) + RETURNING id::text, cluster_id::text, node_id::text, product, current_version, + target_version, phase, status, attempt_id, error_message, rollback_version, + payload, observed_at + `, id, input.ClusterID, input.NodeID, input.Product, input.CurrentVersion, input.TargetVersion, input.Phase, input.Status, input.AttemptID, input.ErrorMessage, input.RollbackVersion, []byte(input.Payload), input.ObservedAt) + return scanNodeUpdateStatus(row) +} + +func (s *PostgresStore) ListNodeUpdateStatuses(ctx context.Context, clusterID, nodeID string, limit int) ([]NodeUpdateStatus, error) { + if limit <= 0 || limit > 200 { + limit = 50 + } + rows, err := s.db.Query(ctx, ` + SELECT id::text, cluster_id::text, node_id::text, product, current_version, target_version, + phase, status, attempt_id, error_message, rollback_version, payload, observed_at + FROM node_update_status_reports + WHERE cluster_id = $1::uuid AND node_id = $2::uuid + ORDER BY observed_at DESC + LIMIT $3 + `, clusterID, nodeID, limit) + if err != nil { + return nil, err + } + defer rows.Close() + out := []NodeUpdateStatus{} + for rows.Next() { + item, err := scanNodeUpdateStatus(rows) + if err != nil { + return nil, err + } + out = append(out, item) + } + return out, rows.Err() +} + func (s *PostgresStore) RevokeNodeIdentity(ctx context.Context, input RevokeNodeIdentityInput) error { tx, err := s.db.Begin(ctx) if err != nil { @@ -1011,6 +1494,79 @@ func (s *PostgresStore) DisableClusterMembership(ctx context.Context, input Disa }) } +func (s *PostgresStore) DeleteClusterNode(ctx context.Context, input DeleteClusterNodeInput) error { + tx, err := s.db.Begin(ctx) + if err != nil { + return err + } + defer tx.Rollback(ctx) + now := time.Now().UTC() + var nodeName string + if err := tx.QueryRow(ctx, ` + SELECT n.name + FROM cluster_memberships cm + JOIN nodes n ON n.id = cm.node_id + WHERE cm.cluster_id = $1::uuid + AND cm.node_id = $2::uuid + `, input.ClusterID, input.NodeID).Scan(&nodeName); err != nil { + return err + } + auditPayload, err := json.Marshal(map[string]any{ + "reason": input.Reason, + "node_name": nodeName, + "deleted_at": now.Format(time.RFC3339Nano), + }) + if err != nil { + return err + } + if _, err := tx.Exec(ctx, ` + INSERT INTO cluster_audit_events (cluster_id, actor_user_id, event_type, target_type, target_id, payload, created_at) + VALUES ($1::uuid, $2::uuid, 'cluster_node.deleted', 'node', $3, $4::jsonb, $5) + `, input.ClusterID, input.ActorUserID, input.NodeID, auditPayload, now); err != nil { + return err + } + if _, err := tx.Exec(ctx, ` + UPDATE node_identities + SET identity_status = 'revoked', + revoked_at = COALESCE(revoked_at, $2), + updated_at = $2, + metadata = metadata || $3::jsonb + WHERE node_id = $1::uuid + `, input.NodeID, now, []byte(fmt.Sprintf(`{"revocation_reason":%q,"revoked_by_delete":true}`, input.Reason))); err != nil { + return err + } + if _, err := tx.Exec(ctx, ` + DELETE FROM cluster_node_group_memberships + WHERE cluster_id = $1::uuid + AND node_id = $2::uuid + `, input.ClusterID, input.NodeID); err != nil { + return err + } + tag, err := tx.Exec(ctx, ` + DELETE FROM cluster_memberships + WHERE cluster_id = $1::uuid + AND node_id = $2::uuid + `, input.ClusterID, input.NodeID) + if err != nil { + return err + } + if tag.RowsAffected() != 1 { + return pgx.ErrNoRows + } + if _, err := tx.Exec(ctx, ` + DELETE FROM nodes n + WHERE n.id = $1::uuid + AND NOT EXISTS ( + SELECT 1 + FROM cluster_memberships cm + WHERE cm.node_id = n.id + ) + `, input.NodeID); err != nil { + return err + } + return tx.Commit(ctx) +} + func (s *PostgresStore) UpsertFabricTestingFlag(ctx context.Context, input UpsertFabricTestingFlagInput) (FabricTestingFlag, error) { if input.HistoryRetentionHours <= 0 { input.HistoryRetentionHours = 24 @@ -1373,12 +1929,44 @@ func meshMetadataString(values map[string]any, key string) string { func (s *PostgresStore) ListMeshLinks(ctx context.Context, clusterID string) ([]MeshLinkObservation, error) { rows, err := s.db.Query(ctx, ` - SELECT COALESCE(observation_id::text, '00000000-0000-0000-0000-000000000000'), cluster_id::text, - source_node_id::text, target_node_id::text, link_status, latency_ms, quality_score, metadata, observed_at - FROM mesh_latest_links - WHERE cluster_id = $1::uuid + SELECT COALESCE(observation_id::text, '00000000-0000-0000-0000-000000000000'), + cluster_id::text, source_node_id::text, target_node_id::text, + CASE WHEN stale THEN 'stale' ELSE link_status END AS link_status, + latency_ms, quality_score, + CASE + WHEN stale THEN metadata || jsonb_build_object( + 'derived_link_status', 'stale', + 'derived_link_stale', true, + 'derived_link_stale_reason', + CASE + WHEN observation_stale THEN 'observation_expired' + WHEN source_stale THEN 'source_node_offline' + WHEN target_stale THEN 'target_node_offline' + ELSE 'endpoint_unavailable' + END + ) + ELSE metadata + END AS metadata, + observed_at + FROM ( + SELECT ml.*, + ml.observed_at < NOW() - $2::interval AS observation_stale, + sn.registration_status = 'active' + AND COALESCE(sn.last_seen_at, sn.updated_at, sn.created_at) < NOW() - $3::interval AS source_stale, + tn.registration_status = 'active' + AND COALESCE(tn.last_seen_at, tn.updated_at, tn.created_at) < NOW() - $3::interval AS target_stale, + ml.observed_at < NOW() - $2::interval + OR (sn.registration_status = 'active' + AND COALESCE(sn.last_seen_at, sn.updated_at, sn.created_at) < NOW() - $3::interval) + OR (tn.registration_status = 'active' + AND COALESCE(tn.last_seen_at, tn.updated_at, tn.created_at) < NOW() - $3::interval) AS stale + FROM mesh_latest_links ml + JOIN nodes sn ON sn.id = ml.source_node_id + JOIN nodes tn ON tn.id = ml.target_node_id + WHERE ml.cluster_id = $1::uuid + ) latest ORDER BY observed_at DESC - `, clusterID) + `, clusterID, meshLinkStaleIntervalSQL, nodeHeartbeatStaleIntervalSQL) if err != nil { return nil, err } @@ -1400,7 +1988,7 @@ func (s *PostgresStore) CreateRouteIntent(ctx context.Context, input CreateRoute INSERT INTO mesh_route_intents ( id, cluster_id, source_selector, destination_selector, service_class, priority, status, policy, created_by_user_id, created_at, updated_at - ) VALUES ($1::uuid, $2::uuid, $3::jsonb, $4::jsonb, $5, $6, 'active', $7::jsonb, $8::uuid, NOW(), NOW()) + ) VALUES ($1::uuid, $2::uuid, $3::jsonb, $4::jsonb, $5, $6, 'active', $7::jsonb, NULLIF($8, '')::uuid, NOW(), NOW()) RETURNING id::text, cluster_id::text, source_selector, destination_selector, service_class, priority, status, policy, created_by_user_id::text, created_at, updated_at `, id, input.ClusterID, []byte(input.SourceSelector), []byte(input.DestinationSelector), input.ServiceClass, input.Priority, []byte(input.Policy), input.ActorUserID) @@ -1430,6 +2018,681 @@ func (s *PostgresStore) ListRouteIntents(ctx context.Context, clusterID string) return out, rows.Err() } +func (s *PostgresStore) ExpireRouteIntent(ctx context.Context, input RouteIntentLifecycleInput, expiresAt time.Time) (MeshRouteIntent, error) { + expiresText := expiresAt.UTC().Format(time.RFC3339Nano) + reason := strings.TrimSpace(input.Reason) + row := s.db.QueryRow(ctx, ` + UPDATE mesh_route_intents + SET policy = jsonb_set( + jsonb_set(COALESCE(policy, '{}'::jsonb), '{expires_at}', to_jsonb($3::text), true), + '{operator_expire}', + jsonb_build_object('expired_at', $3::text, 'reason', $4::text), + true + ), + updated_at = NOW() + WHERE cluster_id = $1::uuid + AND id = $2::uuid + RETURNING id::text, cluster_id::text, source_selector, destination_selector, service_class, + priority, status, policy, created_by_user_id::text, created_at, updated_at + `, input.ClusterID, input.RouteIntentID, expiresText, reason) + return scanRouteIntent(row) +} + +func (s *PostgresStore) DisableRouteIntent(ctx context.Context, input RouteIntentLifecycleInput) (MeshRouteIntent, error) { + reason := strings.TrimSpace(input.Reason) + disabledAt := time.Now().UTC().Format(time.RFC3339Nano) + row := s.db.QueryRow(ctx, ` + UPDATE mesh_route_intents + SET status = 'disabled', + policy = jsonb_set( + COALESCE(policy, '{}'::jsonb), + '{operator_disable}', + jsonb_build_object('disabled_at', $3::text, 'reason', $4::text), + true + ), + updated_at = NOW() + WHERE cluster_id = $1::uuid + AND id = $2::uuid + RETURNING id::text, cluster_id::text, source_selector, destination_selector, service_class, + priority, status, policy, created_by_user_id::text, created_at, updated_at + `, input.ClusterID, input.RouteIntentID, disabledAt, reason) + return scanRouteIntent(row) +} + +func (s *PostgresStore) RecordFabricServiceChannelRouteFeedback(ctx context.Context, input RecordFabricServiceChannelRouteFeedbackInput) (FabricServiceChannelRouteFeedbackObservation, error) { + tx, err := s.db.Begin(ctx) + if err != nil { + return FabricServiceChannelRouteFeedbackObservation{}, err + } + defer tx.Rollback(ctx) + + id := uuid.NewString() + observedAt := input.ObservedAt.UTC() + if observedAt.IsZero() { + observedAt = time.Now().UTC() + } + if input.FeedbackStatus != "healthy" { + var currentPayload json.RawMessage + err := tx.QueryRow(ctx, ` + SELECT payload + FROM fabric_service_channel_route_feedback_latest + WHERE cluster_id = $1::uuid + AND reporter_node_id = $2::uuid + AND route_id = $3 + `, input.ClusterID, input.ReporterNodeID, input.RouteID).Scan(¤tPayload) + if err != nil && !errors.Is(err, pgx.ErrNoRows) { + return FabricServiceChannelRouteFeedbackObservation{}, err + } + if cooldownUntil := fabricServiceChannelRetryCooldownUntil(currentPayload); cooldownUntil != nil && cooldownUntil.After(observedAt) { + input = fabricServiceChannelFeedbackSuppressedByOperatorCooldown(input, *cooldownUntil, observedAt) + } + } + expiresAt := input.ExpiresAt.UTC() + if expiresAt.IsZero() { + expiresAt = observedAt.Add(fabricServiceChannelFeedbackMaxAge) + } + row := tx.QueryRow(ctx, ` + INSERT INTO fabric_service_channel_route_feedback_observations ( + id, cluster_id, reporter_node_id, route_id, service_class, feedback_status, + score_adjustment, reasons, last_error, consecutive_failures, stall_count, + last_send_duration_ms, payload, observed_at, expires_at + ) VALUES ($1::uuid, $2::uuid, $3::uuid, $4, $5, $6, $7, $8, $9, $10, $11, $12, $13::jsonb, $14, $15) + RETURNING id::text, cluster_id::text, reporter_node_id::text, route_id, service_class, + feedback_status, score_adjustment, reasons, last_error, consecutive_failures, + stall_count, last_send_duration_ms, payload, observed_at, expires_at + `, id, input.ClusterID, input.ReporterNodeID, input.RouteID, input.ServiceClass, input.FeedbackStatus, input.ScoreAdjustment, + input.Reasons, input.LastError, input.ConsecutiveFailures, input.StallCount, input.LastSendDurationMs, []byte(input.Payload), observedAt, expiresAt) + item, err := scanFabricServiceChannelRouteFeedback(row) + if err != nil { + return FabricServiceChannelRouteFeedbackObservation{}, err + } + if _, err := tx.Exec(ctx, ` + INSERT INTO fabric_service_channel_route_feedback_latest ( + cluster_id, reporter_node_id, route_id, observation_id, service_class, feedback_status, + score_adjustment, reasons, last_error, consecutive_failures, stall_count, + last_send_duration_ms, payload, observed_at, expires_at + ) VALUES ($1::uuid, $2::uuid, $3, $4::uuid, $5, $6, $7, $8, $9, $10, $11, $12, $13::jsonb, $14, $15) + ON CONFLICT (cluster_id, reporter_node_id, route_id) DO UPDATE SET + observation_id = EXCLUDED.observation_id, + service_class = EXCLUDED.service_class, + feedback_status = EXCLUDED.feedback_status, + score_adjustment = EXCLUDED.score_adjustment, + reasons = EXCLUDED.reasons, + last_error = EXCLUDED.last_error, + consecutive_failures = EXCLUDED.consecutive_failures, + stall_count = EXCLUDED.stall_count, + last_send_duration_ms = EXCLUDED.last_send_duration_ms, + payload = EXCLUDED.payload, + observed_at = EXCLUDED.observed_at, + expires_at = EXCLUDED.expires_at + WHERE fabric_service_channel_route_feedback_latest.observed_at <= EXCLUDED.observed_at + AND NOT ( + EXCLUDED.feedback_status = 'healthy' + AND fabric_service_channel_route_feedback_latest.feedback_status IN ('degraded', 'fenced') + AND fabric_service_channel_route_feedback_latest.expires_at > EXCLUDED.observed_at + ) + `, item.ClusterID, item.ReporterNodeID, item.RouteID, item.ID, item.ServiceClass, item.FeedbackStatus, item.ScoreAdjustment, + item.Reasons, item.LastError, item.ConsecutiveFailures, item.StallCount, item.LastSendDurationMs, []byte(item.Payload), item.ObservedAt, item.ExpiresAt); err != nil { + return FabricServiceChannelRouteFeedbackObservation{}, err + } + if err := tx.Commit(ctx); err != nil { + return FabricServiceChannelRouteFeedbackObservation{}, err + } + return item, nil +} + +func (s *PostgresStore) ListFabricServiceChannelRouteFeedback(ctx context.Context, input ListFabricServiceChannelRouteFeedbackInput) ([]FabricServiceChannelRouteFeedbackObservation, error) { + now := input.Now.UTC() + if now.IsZero() { + now = time.Now().UTC() + } + rows, err := s.db.Query(ctx, ` + SELECT observation_id::text, cluster_id::text, reporter_node_id::text, route_id, service_class, + feedback_status, score_adjustment, reasons, last_error, consecutive_failures, + stall_count, last_send_duration_ms, payload, observed_at, expires_at + FROM fabric_service_channel_route_feedback_latest + WHERE cluster_id = $1::uuid + AND (NULLIF($2, '') IS NULL OR reporter_node_id = NULLIF($2, '')::uuid) + AND ($3 = '' OR route_id = $3) + AND ($4 = '' OR service_class = $4) + AND ($5 = '' OR feedback_status = $5) + AND ($6::boolean OR expires_at > $7) + ORDER BY observed_at DESC + `, input.ClusterID, input.ReporterNodeID, input.RouteID, input.ServiceClass, input.FeedbackStatus, input.IncludeExpired, now) + if err != nil { + return nil, err + } + defer rows.Close() + out := []FabricServiceChannelRouteFeedbackObservation{} + for rows.Next() { + item, err := scanFabricServiceChannelRouteFeedback(rows) + if err != nil { + return nil, err + } + out = append(out, item) + } + return out, rows.Err() +} + +func (s *PostgresStore) StoreFabricServiceChannelLease(ctx context.Context, input StoreFabricServiceChannelLeaseInput) (FabricServiceChannelLeaseRecord, error) { + lease := input.Lease + if lease.ClusterID == "" || lease.ChannelID == "" || input.TokenHash == "" { + return FabricServiceChannelLeaseRecord{}, ErrInvalidPayload + } + storedLease := lease + storedLease.Token.Token = "" + rawLease, err := json.Marshal(storedLease) + if err != nil { + return FabricServiceChannelLeaseRecord{}, err + } + row := s.db.QueryRow(ctx, ` + INSERT INTO fabric_service_channel_leases ( + cluster_id, channel_id, token_hash, resource_id, service_class, + selected_entry_node_id, expires_at, lease, created_at, updated_at + ) VALUES ( + $1::uuid, $2::uuid, $3, $4, $5, $6::uuid, $7, $8::jsonb, NOW(), NOW() + ) + ON CONFLICT (cluster_id, channel_id) DO UPDATE SET + token_hash = EXCLUDED.token_hash, + resource_id = EXCLUDED.resource_id, + service_class = EXCLUDED.service_class, + selected_entry_node_id = EXCLUDED.selected_entry_node_id, + expires_at = EXCLUDED.expires_at, + lease = EXCLUDED.lease, + updated_at = NOW() + RETURNING cluster_id::text, channel_id::text, token_hash, resource_id, service_class, + selected_entry_node_id::text, expires_at, lease, created_at, updated_at + `, lease.ClusterID, lease.ChannelID, input.TokenHash, lease.ResourceID, lease.ServiceClass, lease.SelectedEntryNodeID, lease.ExpiresAt, rawLease) + return scanFabricServiceChannelLeaseRecord(row) +} + +func (s *PostgresStore) GetFabricServiceChannelLease(ctx context.Context, clusterID, channelID string) (FabricServiceChannelLeaseRecord, error) { + row := s.db.QueryRow(ctx, ` + SELECT cluster_id::text, channel_id::text, token_hash, resource_id, service_class, + selected_entry_node_id::text, expires_at, lease, created_at, updated_at + FROM fabric_service_channel_leases + WHERE cluster_id = $1::uuid + AND channel_id = $2::uuid + `, clusterID, channelID) + return scanFabricServiceChannelLeaseRecord(row) +} + +func (s *PostgresStore) ListFabricServiceChannelLeases(ctx context.Context, input ListFabricServiceChannelLeasesInput) ([]FabricServiceChannelLeaseRecord, error) { + now := input.Now.UTC() + if now.IsZero() { + now = time.Now().UTC() + } + if input.Limit <= 0 || input.Limit > 500 { + input.Limit = 100 + } + rows, err := s.db.Query(ctx, ` + SELECT cluster_id::text, channel_id::text, token_hash, resource_id, service_class, + selected_entry_node_id::text, expires_at, lease, created_at, updated_at + FROM fabric_service_channel_leases + WHERE cluster_id = $1::uuid + AND ($2 = '' OR service_class = $2) + AND (NULLIF($3, '') IS NULL OR selected_entry_node_id = NULLIF($3, '')::uuid) + AND ($4 = '' OR resource_id = $4) + AND ($5::boolean OR expires_at > $6) + ORDER BY expires_at DESC + LIMIT $7 + `, input.ClusterID, input.ServiceClass, input.EntryNodeID, input.ResourceID, input.IncludeExpired, now, input.Limit) + if err != nil { + return nil, err + } + defer rows.Close() + out := []FabricServiceChannelLeaseRecord{} + for rows.Next() { + item, err := scanFabricServiceChannelLeaseRecord(rows) + if err != nil { + return nil, err + } + out = append(out, item) + } + return out, rows.Err() +} + +func (s *PostgresStore) CleanupExpiredFabricServiceChannelLeases(ctx context.Context, clusterID string, now time.Time, limit int) (int, error) { + if now.IsZero() { + now = time.Now().UTC() + } + if limit <= 0 || limit > 1000 { + limit = 100 + } + tag, err := s.db.Exec(ctx, ` + DELETE FROM fabric_service_channel_leases + WHERE (cluster_id, channel_id) IN ( + SELECT cluster_id, channel_id + FROM fabric_service_channel_leases + WHERE cluster_id = $1::uuid + AND expires_at <= $2 + ORDER BY expires_at ASC + LIMIT $3 + ) + `, clusterID, now.UTC(), limit) + if err != nil { + return 0, err + } + return int(tag.RowsAffected()), nil +} + +func (s *PostgresStore) RecordFabricServiceChannelRouteRebuildAttempt(ctx context.Context, input RecordFabricServiceChannelRouteRebuildAttemptInput) (FabricServiceChannelRouteRebuildAttempt, error) { + id := uuid.NewString() + payload := defaultJSON(input.Payload, `{}`) + row := s.db.QueryRow(ctx, ` + INSERT INTO fabric_service_channel_route_rebuild_attempts ( + id, cluster_id, reporter_node_id, service_class, route_id, replacement_route_id, + rebuild_request_id, rebuild_status, rebuild_reason, rebuild_attempt, decision_source, + outcome, generation, policy_fingerprint, observed_policy_fingerprint, + observed_route_generation, effective_route_generation, feedback_status, + feedback_score_adjustment, feedback_effective_score_adjustment, feedback_reasons, + last_error, consecutive_failures, stall_count, last_send_duration_ms, + quality_window_sample_count, quality_window_failure_count, quality_window_drop_count, + quality_window_slow_count, old_hops, replacement_hops, payload + ) VALUES ( + $1::uuid, $2::uuid, $3::uuid, $4, $5, $6, + $7, $8, $9, $10, $11, + $12, $13, $14, $15, + $16, $17, $18, + $19, $20, $21, + $22, $23, $24, $25, + $26, $27, $28, + $29, $30, $31, $32::jsonb + ) + ON CONFLICT (cluster_id, reporter_node_id, service_class, route_id, rebuild_request_id) DO UPDATE SET + replacement_route_id = EXCLUDED.replacement_route_id, + rebuild_status = EXCLUDED.rebuild_status, + rebuild_reason = EXCLUDED.rebuild_reason, + rebuild_attempt = EXCLUDED.rebuild_attempt, + decision_source = EXCLUDED.decision_source, + outcome = EXCLUDED.outcome, + generation = EXCLUDED.generation, + policy_fingerprint = EXCLUDED.policy_fingerprint, + observed_policy_fingerprint = EXCLUDED.observed_policy_fingerprint, + observed_route_generation = EXCLUDED.observed_route_generation, + effective_route_generation = EXCLUDED.effective_route_generation, + feedback_status = EXCLUDED.feedback_status, + feedback_score_adjustment = EXCLUDED.feedback_score_adjustment, + feedback_effective_score_adjustment = EXCLUDED.feedback_effective_score_adjustment, + feedback_reasons = EXCLUDED.feedback_reasons, + last_error = EXCLUDED.last_error, + consecutive_failures = EXCLUDED.consecutive_failures, + stall_count = EXCLUDED.stall_count, + last_send_duration_ms = EXCLUDED.last_send_duration_ms, + quality_window_sample_count = EXCLUDED.quality_window_sample_count, + quality_window_failure_count = EXCLUDED.quality_window_failure_count, + quality_window_drop_count = EXCLUDED.quality_window_drop_count, + quality_window_slow_count = EXCLUDED.quality_window_slow_count, + old_hops = EXCLUDED.old_hops, + replacement_hops = EXCLUDED.replacement_hops, + payload = EXCLUDED.payload, + updated_at = NOW() + RETURNING id::text, cluster_id::text, reporter_node_id::text, service_class, route_id, + replacement_route_id, rebuild_request_id, rebuild_status, rebuild_reason, + rebuild_attempt, decision_source, outcome, generation, policy_fingerprint, + observed_policy_fingerprint, observed_route_generation, effective_route_generation, + feedback_status, feedback_score_adjustment, feedback_effective_score_adjustment, + feedback_reasons, last_error, consecutive_failures, stall_count, + last_send_duration_ms, quality_window_sample_count, quality_window_failure_count, + quality_window_drop_count, quality_window_slow_count, old_hops, replacement_hops, + node_transition_status, node_transition_generation, node_transition_observed_at, + node_transition_matched, node_route_generation_status, node_route_generation_applied_at, + node_route_generation_withdrawn_at, node_route_generation_matched, + post_rebuild_selected_route_id, post_rebuild_send_packets, post_rebuild_send_failures, + post_rebuild_send_flow_packets, post_rebuild_send_flow_dropped, + guard_status, guard_severity, guard_reason, guard_transition_deadline_seconds, + guard_traffic_deadline_seconds, correlation_timeline, correlation_snapshot_at, + payload, created_at, updated_at + `, id, input.ClusterID, input.ReporterNodeID, input.ServiceClass, input.RouteID, input.ReplacementRouteID, + input.RebuildRequestID, input.RebuildStatus, input.RebuildReason, input.RebuildAttempt, input.DecisionSource, + input.Outcome, input.Generation, input.PolicyFingerprint, input.ObservedPolicyFingerprint, + input.ObservedRouteGeneration, input.EffectiveRouteGeneration, input.FeedbackStatus, + input.FeedbackScoreAdjustment, input.FeedbackEffectiveScoreAdjustment, input.FeedbackReasons, + input.LastError, input.ConsecutiveFailures, input.StallCount, input.LastSendDurationMs, + input.QualityWindowSampleCount, input.QualityWindowFailureCount, input.QualityWindowDropCount, + input.QualityWindowSlowCount, input.OldHops, input.ReplacementHops, []byte(payload)) + return scanFabricServiceChannelRouteRebuildAttempt(row) +} + +func (s *PostgresStore) ListFabricServiceChannelRouteRebuildAttempts(ctx context.Context, input ListFabricServiceChannelRouteRebuildAttemptsInput) ([]FabricServiceChannelRouteRebuildAttempt, error) { + limit := input.Limit + if limit <= 0 || limit > 200 { + limit = 100 + } + offset := input.Offset + if offset < 0 { + offset = 0 + } + rows, err := s.db.Query(ctx, ` + SELECT id::text, cluster_id::text, reporter_node_id::text, service_class, route_id, + replacement_route_id, rebuild_request_id, rebuild_status, rebuild_reason, + rebuild_attempt, decision_source, outcome, generation, policy_fingerprint, + observed_policy_fingerprint, observed_route_generation, effective_route_generation, + feedback_status, feedback_score_adjustment, feedback_effective_score_adjustment, + feedback_reasons, last_error, consecutive_failures, stall_count, + last_send_duration_ms, quality_window_sample_count, quality_window_failure_count, + quality_window_drop_count, quality_window_slow_count, old_hops, replacement_hops, + node_transition_status, node_transition_generation, node_transition_observed_at, + node_transition_matched, node_route_generation_status, node_route_generation_applied_at, + node_route_generation_withdrawn_at, node_route_generation_matched, + post_rebuild_selected_route_id, post_rebuild_send_packets, post_rebuild_send_failures, + post_rebuild_send_flow_packets, post_rebuild_send_flow_dropped, + guard_status, guard_severity, guard_reason, guard_transition_deadline_seconds, + guard_traffic_deadline_seconds, correlation_timeline, correlation_snapshot_at, + payload, created_at, updated_at + FROM fabric_service_channel_route_rebuild_attempts + WHERE cluster_id = $1::uuid + AND (NULLIF($2, '') IS NULL OR reporter_node_id = NULLIF($2, '')::uuid) + AND ($3 = '' OR route_id = $3) + AND ($4 = '' OR replacement_route_id = $4) + AND ($5 = '' OR service_class = $5) + AND ($6 = '' OR rebuild_status = $6) + AND ($7 = '' OR rebuild_request_id = $7) + AND ($8 = '' OR generation = $8) + AND ($9 = '' OR payload->>'feedback_source' = $9) + AND ($10 = '' OR payload->>'feedback_channel_id' = $10) + AND ($11 = '' OR payload->>'feedback_violation_status' = $11) + ORDER BY updated_at DESC + LIMIT $12 OFFSET $13 + `, input.ClusterID, input.ReporterNodeID, input.RouteID, input.ReplacementRouteID, input.ServiceClass, input.RebuildStatus, input.RebuildRequestID, input.Generation, input.FeedbackSource, input.FeedbackChannelID, input.FeedbackViolationStatus, limit, offset) + if err != nil { + return nil, err + } + defer rows.Close() + out := []FabricServiceChannelRouteRebuildAttempt{} + for rows.Next() { + item, err := scanFabricServiceChannelRouteRebuildAttempt(rows) + if err != nil { + return nil, err + } + out = append(out, item) + } + return out, rows.Err() +} + +func (s *PostgresStore) UpdateFabricServiceChannelRouteRebuildCorrelationSnapshot(ctx context.Context, input UpdateFabricServiceChannelRouteRebuildCorrelationSnapshotInput) error { + timeline := mustJSONRaw(input.Timeline) + _, err := s.db.Exec(ctx, ` + UPDATE fabric_service_channel_route_rebuild_attempts + SET node_transition_status = $2, + node_transition_generation = $3, + node_transition_observed_at = $4, + node_transition_matched = $5, + node_route_generation_status = $6, + node_route_generation_applied_at = $7, + node_route_generation_withdrawn_at = $8, + node_route_generation_matched = $9, + post_rebuild_selected_route_id = $10, + post_rebuild_send_packets = $11, + post_rebuild_send_failures = $12, + post_rebuild_send_flow_packets = $13, + post_rebuild_send_flow_dropped = $14, + guard_status = $15, + guard_severity = $16, + guard_reason = $17, + guard_transition_deadline_seconds = $18, + guard_traffic_deadline_seconds = $19, + correlation_timeline = $20::jsonb, + correlation_snapshot_at = $21 + WHERE id = $1::uuid + `, input.ID, input.NodeTransitionStatus, input.NodeTransitionGeneration, input.NodeTransitionObservedAt, + input.NodeTransitionMatched, input.NodeRouteGenerationStatus, input.NodeRouteGenerationAppliedAt, + input.NodeRouteGenerationWithdrawnAt, input.NodeRouteGenerationMatched, input.PostRebuildSelectedRouteID, + int64(input.PostRebuildSendPackets), int64(input.PostRebuildSendFailures), int64(input.PostRebuildSendFlowPackets), + int64(input.PostRebuildSendFlowDropped), input.GuardStatus, input.GuardSeverity, input.GuardReason, + input.GuardTransitionDeadlineSeconds, input.GuardTrafficDeadlineSeconds, []byte(timeline), input.CorrelationSnapshotAt) + return err +} + +func (s *PostgresStore) GetFabricServiceChannelSchemaStatus(ctx context.Context, input GetFabricServiceChannelSchemaStatusInput) (FabricServiceChannelSchemaStatus, error) { + checks := fabricServiceChannelRequiredSchemaChecks() + status := FabricServiceChannelSchemaStatus{ + ClusterID: input.ClusterID, + ObservedAt: time.Now().UTC(), + Status: "ready", + Reason: "schema_ready", + RequiredMigration: "000028_fabric_service_channel_rebuild_correlation_snapshot", + RequiredChecks: make([]FabricServiceChannelSchemaCheck, 0, len(checks)), + } + for _, check := range checks { + exists, err := s.fabricServiceChannelSchemaCheckExists(ctx, check) + if err != nil { + return FabricServiceChannelSchemaStatus{}, err + } + check.Status = "present" + if !exists { + check.Status = "missing" + status.MissingChecks = append(status.MissingChecks, check) + } + status.RequiredChecks = append(status.RequiredChecks, check) + } + status.RequiredCheckCount = len(status.RequiredChecks) + status.MissingCheckCount = len(status.MissingChecks) + status.PassedCheckCount = status.RequiredCheckCount - status.MissingCheckCount + if status.MissingCheckCount > 0 { + status.Status = "blocked" + status.Reason = "schema_migration_required" + status.RecommendedOperatorAction = "Apply backend migration 000028 before swapping or using this backend build." + } + return status, nil +} + +func (s *PostgresStore) fabricServiceChannelSchemaCheckExists(ctx context.Context, check FabricServiceChannelSchemaCheck) (bool, error) { + if check.ColumnName == "" { + var exists bool + err := s.db.QueryRow(ctx, ` + SELECT EXISTS ( + SELECT 1 + FROM information_schema.tables + WHERE table_schema = 'public' + AND table_name = $1 + ) + `, check.RelationName).Scan(&exists) + return exists, err + } + var exists bool + err := s.db.QueryRow(ctx, ` + SELECT EXISTS ( + SELECT 1 + FROM information_schema.columns + WHERE table_schema = 'public' + AND table_name = $1 + AND column_name = $2 + ) + `, check.RelationName, check.ColumnName).Scan(&exists) + return exists, err +} + +func fabricServiceChannelRequiredSchemaChecks() []FabricServiceChannelSchemaCheck { + const migration = "000028_fabric_service_channel_rebuild_correlation_snapshot" + const table = "fabric_service_channel_route_rebuild_attempts" + columns := []string{ + "node_transition_status", + "node_transition_generation", + "node_transition_observed_at", + "node_transition_matched", + "node_route_generation_status", + "node_route_generation_applied_at", + "node_route_generation_withdrawn_at", + "node_route_generation_matched", + "post_rebuild_selected_route_id", + "post_rebuild_send_packets", + "post_rebuild_send_failures", + "post_rebuild_send_flow_packets", + "post_rebuild_send_flow_dropped", + "guard_status", + "guard_severity", + "guard_reason", + "guard_transition_deadline_seconds", + "guard_traffic_deadline_seconds", + "correlation_timeline", + "correlation_snapshot_at", + } + checks := []FabricServiceChannelSchemaCheck{{ + CheckID: table, + RelationName: table, + RequiredBy: migration, + }} + for _, column := range columns { + checks = append(checks, FabricServiceChannelSchemaCheck{ + CheckID: table + "." + column, + RelationName: table, + ColumnName: column, + RequiredBy: migration, + }) + } + return checks +} + +func (s *PostgresStore) UpsertFabricServiceChannelRouteRebuildAlertSilence(ctx context.Context, input SilenceFabricServiceChannelRouteRebuildAlertInput, expiresAt time.Time) (FabricServiceChannelRouteRebuildAlertSilence, error) { + payload := mustJSONRaw(map[string]any{ + "schema_version": "rap.fabric_service_channel_rebuild_alert_silence.v1", + "reason": input.Reason, + "incident_source": input.IncidentSource, + "channel_id": input.ChannelID, + }) + row := s.db.QueryRow(ctx, ` + INSERT INTO fabric_service_channel_rebuild_alert_silences ( + cluster_id, reporter_node_id, route_id, guard_status, generation, + reason, created_by_user_id, expires_at, payload + ) + VALUES ($1::uuid, $2::uuid, $3, $4, $5, $6, NULLIF($7, '')::uuid, $8, $9::jsonb) + ON CONFLICT (cluster_id, reporter_node_id, route_id, guard_status, generation) + DO UPDATE SET reason = EXCLUDED.reason, + created_by_user_id = EXCLUDED.created_by_user_id, + created_at = NOW(), + expires_at = EXCLUDED.expires_at, + payload = EXCLUDED.payload + RETURNING id::text, cluster_id::text, reporter_node_id::text, route_id, + guard_status, generation, reason, created_by_user_id::text, + created_at, expires_at, payload + `, input.ClusterID, input.ReporterNodeID, input.RouteID, input.GuardStatus, input.Generation, input.Reason, input.ActorUserID, expiresAt, []byte(payload)) + return scanFabricServiceChannelRouteRebuildAlertSilence(row) +} + +func (s *PostgresStore) ListFabricServiceChannelRouteRebuildAlertSilences(ctx context.Context, clusterID string, now time.Time) ([]FabricServiceChannelRouteRebuildAlertSilence, error) { + rows, err := s.db.Query(ctx, ` + SELECT id::text, cluster_id::text, reporter_node_id::text, route_id, + guard_status, generation, reason, created_by_user_id::text, + created_at, expires_at, payload + FROM fabric_service_channel_rebuild_alert_silences + WHERE cluster_id = $1::uuid + AND expires_at > $2 + ORDER BY created_at DESC + `, clusterID, now.UTC()) + if err != nil { + return nil, err + } + defer rows.Close() + out := []FabricServiceChannelRouteRebuildAlertSilence{} + for rows.Next() { + item, err := scanFabricServiceChannelRouteRebuildAlertSilence(rows) + if err != nil { + return nil, err + } + out = append(out, item) + } + return out, rows.Err() +} + +func (s *PostgresStore) DeleteFabricServiceChannelRouteRebuildAlertSilence(ctx context.Context, input UnsilenceFabricServiceChannelRouteRebuildAlertInput) (FabricServiceChannelRouteRebuildAlertSilence, error) { + row := s.db.QueryRow(ctx, ` + DELETE FROM fabric_service_channel_rebuild_alert_silences + WHERE cluster_id = $1::uuid + AND id = $2::uuid + RETURNING id::text, cluster_id::text, reporter_node_id::text, route_id, + guard_status, generation, reason, created_by_user_id::text, + created_at, expires_at, payload + `, input.ClusterID, input.SilenceID) + return scanFabricServiceChannelRouteRebuildAlertSilence(row) +} + +func (s *PostgresStore) ExpireFabricServiceChannelRouteFeedback(ctx context.Context, input ExpireFabricServiceChannelRouteFeedbackInput) (ExpireFabricServiceChannelRouteFeedbackResult, error) { + now := input.Now.UTC() + if now.IsZero() { + now = time.Now().UTC() + } + cooldownUntil := now.Add(fabricServiceChannelOperatorExpireCooldown) + tx, err := s.db.Begin(ctx) + if err != nil { + return ExpireFabricServiceChannelRouteFeedbackResult{}, err + } + defer tx.Rollback(ctx) + + expiredAtText := now.Format(time.RFC3339Nano) + cooldownUntilText := cooldownUntil.Format(time.RFC3339Nano) + rows, err := tx.Query(ctx, ` + UPDATE fabric_service_channel_route_feedback_latest + SET expires_at = $6, + payload = jsonb_set( + jsonb_set( + jsonb_set( + jsonb_set(payload, '{operator_expired}', 'true'::jsonb, true), + '{operator_expire_reason}', to_jsonb($5::text), true + ), + '{operator_expired_at}', to_jsonb($8::text), true + ), + '{operator_retry_cooldown_until}', to_jsonb($7::text), true + ) + WHERE cluster_id = $1::uuid + AND route_id = $2 + AND (NULLIF($3, '') IS NULL OR reporter_node_id = NULLIF($3, '')::uuid) + AND ($4 = '' OR service_class = $4) + AND expires_at > $6 + RETURNING observation_id::text + `, input.ClusterID, input.RouteID, input.ReporterNodeID, input.ServiceClass, input.Reason, now, cooldownUntilText, expiredAtText) + if err != nil { + return ExpireFabricServiceChannelRouteFeedbackResult{}, err + } + var observationIDs []string + for rows.Next() { + var id string + if err := rows.Scan(&id); err != nil { + rows.Close() + return ExpireFabricServiceChannelRouteFeedbackResult{}, err + } + observationIDs = append(observationIDs, id) + } + if err := rows.Err(); err != nil { + rows.Close() + return ExpireFabricServiceChannelRouteFeedbackResult{}, err + } + rows.Close() + if len(observationIDs) > 0 { + for _, observationID := range observationIDs { + if _, err := tx.Exec(ctx, ` + UPDATE fabric_service_channel_route_feedback_observations + SET expires_at = $2, + payload = jsonb_set( + jsonb_set( + jsonb_set( + jsonb_set(payload, '{operator_expired}', 'true'::jsonb, true), + '{operator_expire_reason}', to_jsonb($3::text), true + ), + '{operator_expired_at}', to_jsonb($5::text), true + ), + '{operator_retry_cooldown_until}', to_jsonb($4::text), true + ) + WHERE id = $1::uuid + `, observationID, now, input.Reason, cooldownUntilText, expiredAtText); err != nil { + return ExpireFabricServiceChannelRouteFeedbackResult{}, err + } + } + } + if err := tx.Commit(ctx); err != nil { + return ExpireFabricServiceChannelRouteFeedbackResult{}, err + } + return ExpireFabricServiceChannelRouteFeedbackResult{ + ClusterID: input.ClusterID, + ReporterNodeID: input.ReporterNodeID, + RouteID: input.RouteID, + ServiceClass: input.ServiceClass, + ExpiredCount: len(observationIDs), + ExpiredAt: now, + CooldownUntil: cooldownUntil, + }, nil +} + func (s *PostgresStore) ListQoSPolicies(ctx context.Context, clusterID string) ([]MeshQoSPolicy, error) { rows, err := s.db.Query(ctx, ` SELECT id::text, cluster_id::text, service_class, priority, reliability_mode, @@ -1740,13 +3003,60 @@ func (s *PostgresStore) CreateVPNConnection(ctx context.Context, input CreateVPN func (s *PostgresStore) ListVPNConnections(ctx context.Context, clusterID string) ([]VPNConnection, error) { rows, err := s.db.Query(ctx, ` - SELECT id::text, cluster_id::text, organization_id::text, name, target_endpoint, + SELECT vpn_connections.id::text, vpn_connections.cluster_id::text, organization_id::text, name, target_endpoint, protocol_family, credential_ref, mode, desired_state, allowed_node_policy, - routing_usage, route_policy, qos_policy, placement_policy, status, metadata, - created_by_user_id::text, updated_by_user_id::text, created_at, updated_at + routing_usage, route_policy, qos_policy, placement_policy, vpn_connections.status, + vpn_connections.metadata || CASE WHEN l.id IS NULL THEN '{}'::jsonb ELSE jsonb_build_object( + 'client_config', + COALESCE(vpn_connections.metadata->'client_config', '{}'::jsonb) || + jsonb_build_object('vpn_address', '10.77.0.2/24') || + CASE + WHEN COALESCE((vpn_connections.route_policy->>'full_tunnel')::boolean, false) + THEN jsonb_build_object('routes', jsonb_build_array('0.0.0.0/0')) + ELSE '{}'::jsonb + END || + CASE + WHEN jsonb_typeof(gateway_status.status_payload->'exit_dns_servers') = 'array' + AND jsonb_array_length(gateway_status.status_payload->'exit_dns_servers') > 0 + THEN jsonb_build_object('dns_servers', gateway_status.status_payload->'exit_dns_servers') + WHEN jsonb_typeof(vpn_connections.route_policy->'dns_servers') = 'array' + AND jsonb_array_length(vpn_connections.route_policy->'dns_servers') > 0 + THEN jsonb_build_object('dns_servers', vpn_connections.route_policy->'dns_servers') + WHEN jsonb_typeof(vpn_connections.target_endpoint->'dns_servers') = 'array' + AND jsonb_array_length(vpn_connections.target_endpoint->'dns_servers') > 0 + THEN jsonb_build_object('dns_servers', vpn_connections.target_endpoint->'dns_servers') + ELSE '{}'::jsonb + END || + jsonb_strip_nulls(jsonb_build_object( + 'runtime_status', CASE + WHEN COALESCE((gateway_status.status_payload->>'packet_forwarding')::boolean, false) THEN 'packet_forwarding_active' + WHEN COALESCE((gateway_status.status_payload->>'runtime_available')::boolean, false) THEN 'runtime_available' + WHEN gateway_status.observed_status IS NOT NULL THEN gateway_status.observed_status + ELSE 'lease_active' + END, + 'gateway_node_id', l.owner_node_id::text, + 'gateway_assignment_status', gateway_status.observed_status, + 'gateway_interface', gateway_status.status_payload->>'gateway_interface', + 'gateway_vpn_cidr', gateway_status.status_payload->>'gateway_vpn_cidr', + 'relay_transport', gateway_status.status_payload->>'relay_transport', + 'packet_forwarding', COALESCE((gateway_status.status_payload->>'packet_forwarding')::boolean, false), + 'runtime_available', COALESCE((gateway_status.status_payload->>'runtime_available')::boolean, false), + 'runtime_observed_at', gateway_status.observed_at + )) + ) END AS metadata, + created_by_user_id::text, updated_by_user_id::text, vpn_connections.created_at, vpn_connections.updated_at FROM vpn_connections - WHERE cluster_id = $1::uuid - ORDER BY created_at DESC + LEFT JOIN vpn_connection_leases l + ON l.cluster_id = vpn_connections.cluster_id + AND l.vpn_connection_id = vpn_connections.id + AND l.status = 'active' + AND l.expires_at > NOW() + LEFT JOIN vpn_connection_assignment_latest_statuses gateway_status + ON gateway_status.cluster_id = vpn_connections.cluster_id + AND gateway_status.vpn_connection_id = vpn_connections.id + AND gateway_status.node_id = l.owner_node_id + WHERE vpn_connections.cluster_id = $1::uuid + ORDER BY vpn_connections.created_at DESC `, clusterID) if err != nil { return nil, err @@ -1757,13 +3067,60 @@ func (s *PostgresStore) ListVPNConnections(ctx context.Context, clusterID string func (s *PostgresStore) GetVPNConnection(ctx context.Context, clusterID, vpnConnectionID string) (VPNConnection, error) { row := s.db.QueryRow(ctx, ` - SELECT id::text, cluster_id::text, organization_id::text, name, target_endpoint, + SELECT vpn_connections.id::text, vpn_connections.cluster_id::text, organization_id::text, name, target_endpoint, protocol_family, credential_ref, mode, desired_state, allowed_node_policy, - routing_usage, route_policy, qos_policy, placement_policy, status, metadata, - created_by_user_id::text, updated_by_user_id::text, created_at, updated_at + routing_usage, route_policy, qos_policy, placement_policy, vpn_connections.status, + vpn_connections.metadata || CASE WHEN l.id IS NULL THEN '{}'::jsonb ELSE jsonb_build_object( + 'client_config', + COALESCE(vpn_connections.metadata->'client_config', '{}'::jsonb) || + jsonb_build_object('vpn_address', '10.77.0.2/24') || + CASE + WHEN COALESCE((vpn_connections.route_policy->>'full_tunnel')::boolean, false) + THEN jsonb_build_object('routes', jsonb_build_array('0.0.0.0/0')) + ELSE '{}'::jsonb + END || + CASE + WHEN jsonb_typeof(gateway_status.status_payload->'exit_dns_servers') = 'array' + AND jsonb_array_length(gateway_status.status_payload->'exit_dns_servers') > 0 + THEN jsonb_build_object('dns_servers', gateway_status.status_payload->'exit_dns_servers') + WHEN jsonb_typeof(vpn_connections.route_policy->'dns_servers') = 'array' + AND jsonb_array_length(vpn_connections.route_policy->'dns_servers') > 0 + THEN jsonb_build_object('dns_servers', vpn_connections.route_policy->'dns_servers') + WHEN jsonb_typeof(vpn_connections.target_endpoint->'dns_servers') = 'array' + AND jsonb_array_length(vpn_connections.target_endpoint->'dns_servers') > 0 + THEN jsonb_build_object('dns_servers', vpn_connections.target_endpoint->'dns_servers') + ELSE '{}'::jsonb + END || + jsonb_strip_nulls(jsonb_build_object( + 'runtime_status', CASE + WHEN COALESCE((gateway_status.status_payload->>'packet_forwarding')::boolean, false) THEN 'packet_forwarding_active' + WHEN COALESCE((gateway_status.status_payload->>'runtime_available')::boolean, false) THEN 'runtime_available' + WHEN gateway_status.observed_status IS NOT NULL THEN gateway_status.observed_status + ELSE 'lease_active' + END, + 'gateway_node_id', l.owner_node_id::text, + 'gateway_assignment_status', gateway_status.observed_status, + 'gateway_interface', gateway_status.status_payload->>'gateway_interface', + 'gateway_vpn_cidr', gateway_status.status_payload->>'gateway_vpn_cidr', + 'relay_transport', gateway_status.status_payload->>'relay_transport', + 'packet_forwarding', COALESCE((gateway_status.status_payload->>'packet_forwarding')::boolean, false), + 'runtime_available', COALESCE((gateway_status.status_payload->>'runtime_available')::boolean, false), + 'runtime_observed_at', gateway_status.observed_at + )) + ) END AS metadata, + created_by_user_id::text, updated_by_user_id::text, vpn_connections.created_at, vpn_connections.updated_at FROM vpn_connections - WHERE cluster_id = $1::uuid - AND id = $2::uuid + LEFT JOIN vpn_connection_leases l + ON l.cluster_id = vpn_connections.cluster_id + AND l.vpn_connection_id = vpn_connections.id + AND l.status = 'active' + AND l.expires_at > NOW() + LEFT JOIN vpn_connection_assignment_latest_statuses gateway_status + ON gateway_status.cluster_id = vpn_connections.cluster_id + AND gateway_status.vpn_connection_id = vpn_connections.id + AND gateway_status.node_id = l.owner_node_id + WHERE vpn_connections.cluster_id = $1::uuid + AND vpn_connections.id = $2::uuid `, clusterID, vpnConnectionID) return scanVPNConnection(row) } @@ -1969,6 +3326,24 @@ func (s *PostgresStore) RenewVPNConnectionLease(ctx context.Context, input Renew return scanVPNLease(row) } +func (s *PostgresStore) RenewNodeVPNAssignmentLease(ctx context.Context, input RenewNodeVPNAssignmentLeaseInput, expiresAt time.Time) (VPNConnectionLease, error) { + row := s.db.QueryRow(ctx, ` + UPDATE vpn_connection_leases + SET renewed_at = NOW(), + expires_at = $5 + WHERE id = $1::uuid + AND vpn_connection_id = $2::uuid + AND cluster_id = $3::uuid + AND owner_node_id = $4::uuid + AND status = 'active' + AND expires_at > NOW() + RETURNING id::text, vpn_connection_id::text, cluster_id::text, owner_node_id::text, + lease_generation, fencing_token, status, acquired_at, renewed_at, expires_at, + released_at, fenced_at, metadata + `, input.LeaseID, input.VPNConnectionID, input.ClusterID, input.OwnerNodeID, expiresAt) + return scanVPNLease(row) +} + func (s *PostgresStore) ReleaseVPNConnectionLease(ctx context.Context, input ReleaseVPNConnectionLeaseInput) (VPNConnectionLease, error) { tx, err := s.db.Begin(ctx) if err != nil { @@ -2316,17 +3691,22 @@ func (s *PostgresStore) RecordAudit(ctx context.Context, event ClusterAuditEvent return err } -func (s *PostgresStore) ListAuditEvents(ctx context.Context, clusterID string, limit int) ([]ClusterAuditEvent, error) { - if limit <= 0 || limit > 200 { - limit = 100 +func (s *PostgresStore) ListAuditEvents(ctx context.Context, input ListAuditEventsInput) ([]ClusterAuditEvent, error) { + input.ClusterID = strings.TrimSpace(input.ClusterID) + input.EventTypes = compactStringSlice(input.EventTypes) + input.TargetTypes = compactStringSlice(input.TargetTypes) + if input.Limit <= 0 || input.Limit > 200 { + input.Limit = 100 } rows, err := s.db.Query(ctx, ` SELECT id::text, cluster_id::text, actor_user_id::text, event_type, target_type, target_id, payload, created_at FROM cluster_audit_events WHERE cluster_id = $1::uuid + AND (cardinality($2::text[]) = 0 OR event_type = ANY($2::text[])) + AND (cardinality($3::text[]) = 0 OR target_type = ANY($3::text[])) ORDER BY created_at DESC - LIMIT $2 - `, clusterID, limit) + LIMIT $4 + `, input.ClusterID, input.EventTypes, input.TargetTypes, input.Limit) if err != nil { return nil, err } @@ -2342,6 +3722,18 @@ func (s *PostgresStore) ListAuditEvents(ctx context.Context, clusterID string, l return out, rows.Err() } +func compactStringSlice(values []string) []string { + out := []string{} + for _, value := range values { + trimmed := strings.TrimSpace(value) + if trimmed == "" || containsString(out, trimmed) { + continue + } + out = append(out, trimmed) + } + return out +} + type scanner interface { Scan(dest ...any) error } @@ -2507,6 +3899,135 @@ func scanHeartbeat(row scanner) (NodeHeartbeat, error) { return item, nil } +func scanReleaseVersion(row scanner) (ReleaseVersion, error) { + var item ReleaseVersion + var signatureRaw json.RawMessage + if err := row.Scan( + &item.ID, + &item.ClusterID, + &item.Product, + &item.Version, + &item.Channel, + &item.Status, + &item.Compatibility, + &item.Changelog, + &item.CreatedByUserID, + &item.CreatedAt, + &item.AuthorityPayload, + &signatureRaw, + ); err != nil { + return ReleaseVersion{}, err + } + ensureRaw(&item.Compatibility, `{}`) + ensureRaw(&item.AuthorityPayload, `{}`) + if len(signatureRaw) > 0 && string(signatureRaw) != "{}" { + var signature ClusterSignature + if err := json.Unmarshal(signatureRaw, &signature); err != nil { + return ReleaseVersion{}, err + } + item.AuthoritySignature = &signature + } + return item, nil +} + +func scanReleaseArtifact(row scanner) (ReleaseArtifact, error) { + var item ReleaseArtifact + if err := row.Scan( + &item.ID, + &item.ReleaseID, + &item.ClusterID, + &item.Product, + &item.Version, + &item.OS, + &item.Arch, + &item.InstallType, + &item.Kind, + &item.URL, + &item.SHA256, + &item.SizeBytes, + &item.Signature, + &item.Metadata, + &item.CreatedAt, + ); err != nil { + return ReleaseArtifact{}, err + } + ensureRaw(&item.Metadata, `{}`) + item.URLs = releaseArtifactURLsFromMetadata(item.Metadata) + return item, nil +} + +func releaseArtifactURLsFromMetadata(metadata json.RawMessage) []string { + var payload map[string]json.RawMessage + if len(metadata) == 0 || json.Unmarshal(metadata, &payload) != nil { + return nil + } + for _, key := range []string{"artifact_urls", "urls"} { + raw, ok := payload[key] + if !ok { + continue + } + var urls []string + if json.Unmarshal(raw, &urls) != nil { + continue + } + out := make([]string, 0, len(urls)) + seen := map[string]bool{} + for _, url := range urls { + url = strings.TrimSpace(url) + if url == "" || seen[url] { + continue + } + seen[url] = true + out = append(out, url) + } + return out + } + return nil +} + +func scanNodeUpdatePolicy(row scanner) (NodeUpdatePolicy, error) { + var item NodeUpdatePolicy + if err := row.Scan( + &item.ClusterID, + &item.NodeID, + &item.Product, + &item.Channel, + &item.TargetVersion, + &item.Strategy, + &item.Enabled, + &item.RollbackAllowed, + &item.HealthWindowSec, + &item.UpdatedByUserID, + &item.UpdatedAt, + ); err != nil { + return NodeUpdatePolicy{}, err + } + return item, nil +} + +func scanNodeUpdateStatus(row scanner) (NodeUpdateStatus, error) { + var item NodeUpdateStatus + if err := row.Scan( + &item.ID, + &item.ClusterID, + &item.NodeID, + &item.Product, + &item.CurrentVersion, + &item.TargetVersion, + &item.Phase, + &item.Status, + &item.AttemptID, + &item.ErrorMessage, + &item.RollbackVersion, + &item.Payload, + &item.ObservedAt, + ); err != nil { + return NodeUpdateStatus{}, err + } + ensureRaw(&item.Payload, `{}`) + return item, nil +} + func scanAuditEvent(row scanner) (ClusterAuditEvent, error) { var item ClusterAuditEvent if err := row.Scan(&item.ID, &item.ClusterID, &item.ActorUserID, &item.EventType, &item.TargetType, &item.TargetID, &item.Payload, &item.CreatedAt); err != nil { @@ -2643,6 +4164,194 @@ func scanRouteIntent(row scanner) (MeshRouteIntent, error) { return item, nil } +func scanFabricServiceChannelRouteFeedback(row scanner) (FabricServiceChannelRouteFeedbackObservation, error) { + var item FabricServiceChannelRouteFeedbackObservation + if err := row.Scan( + &item.ID, + &item.ClusterID, + &item.ReporterNodeID, + &item.RouteID, + &item.ServiceClass, + &item.FeedbackStatus, + &item.ScoreAdjustment, + &item.Reasons, + &item.LastError, + &item.ConsecutiveFailures, + &item.StallCount, + &item.LastSendDurationMs, + &item.Payload, + &item.ObservedAt, + &item.ExpiresAt, + ); err != nil { + return FabricServiceChannelRouteFeedbackObservation{}, err + } + ensureRaw(&item.Payload, `{}`) + item.RetryCooldownUntil = fabricServiceChannelRetryCooldownUntil(item.Payload) + return item, nil +} + +func scanFabricServiceChannelLeaseRecord(row scanner) (FabricServiceChannelLeaseRecord, error) { + var item FabricServiceChannelLeaseRecord + var rawLease json.RawMessage + if err := row.Scan( + &item.ClusterID, + &item.ChannelID, + &item.TokenHash, + &item.ResourceID, + &item.ServiceClass, + &item.SelectedEntryNodeID, + &item.ExpiresAt, + &rawLease, + &item.CreatedAt, + &item.UpdatedAt, + ); err != nil { + return FabricServiceChannelLeaseRecord{}, err + } + if len(rawLease) > 0 { + if err := json.Unmarshal(rawLease, &item.Lease); err != nil { + return FabricServiceChannelLeaseRecord{}, err + } + } + return item, nil +} + +func scanFabricServiceChannelRouteRebuildAttempt(row scanner) (FabricServiceChannelRouteRebuildAttempt, error) { + var item FabricServiceChannelRouteRebuildAttempt + var timeline json.RawMessage + if err := row.Scan( + &item.ID, + &item.ClusterID, + &item.ReporterNodeID, + &item.ServiceClass, + &item.RouteID, + &item.ReplacementRouteID, + &item.RebuildRequestID, + &item.RebuildStatus, + &item.RebuildReason, + &item.RebuildAttempt, + &item.DecisionSource, + &item.Outcome, + &item.Generation, + &item.PolicyFingerprint, + &item.ObservedPolicyFingerprint, + &item.ObservedRouteGeneration, + &item.EffectiveRouteGeneration, + &item.FeedbackStatus, + &item.FeedbackScoreAdjustment, + &item.FeedbackEffectiveScoreAdjustment, + &item.FeedbackReasons, + &item.LastError, + &item.ConsecutiveFailures, + &item.StallCount, + &item.LastSendDurationMs, + &item.QualityWindowSampleCount, + &item.QualityWindowFailureCount, + &item.QualityWindowDropCount, + &item.QualityWindowSlowCount, + &item.OldHops, + &item.ReplacementHops, + &item.NodeTransitionStatus, + &item.NodeTransitionGeneration, + &item.NodeTransitionObservedAt, + &item.NodeTransitionMatched, + &item.NodeRouteGenerationStatus, + &item.NodeRouteGenerationAppliedAt, + &item.NodeRouteGenerationWithdrawnAt, + &item.NodeRouteGenerationMatched, + &item.PostRebuildSelectedRouteID, + &item.PostRebuildSendPackets, + &item.PostRebuildSendFailures, + &item.PostRebuildSendFlowPackets, + &item.PostRebuildSendFlowDropped, + &item.GuardStatus, + &item.GuardSeverity, + &item.GuardReason, + &item.GuardTransitionDeadlineSeconds, + &item.GuardTrafficDeadlineSeconds, + &timeline, + &item.CorrelationSnapshotAt, + &item.Payload, + &item.CreatedAt, + &item.UpdatedAt, + ); err != nil { + return FabricServiceChannelRouteRebuildAttempt{}, err + } + ensureRaw(&item.Payload, `{}`) + if len(timeline) > 0 && string(timeline) != "null" { + _ = json.Unmarshal(timeline, &item.Timeline) + } + enrichFabricServiceChannelRouteRebuildAttemptFeedbackCorrelation(&item) + return item, nil +} + +func enrichFabricServiceChannelRouteRebuildAttemptFeedbackCorrelation(item *FabricServiceChannelRouteRebuildAttempt) { + if item == nil || len(item.Payload) == 0 || !json.Valid(item.Payload) { + return + } + payload := jsonObject(item.Payload) + item.FeedbackObservationID = firstNonEmptyString(item.FeedbackObservationID, jsonString(payload, "feedback_observation_id")) + item.FeedbackSource = firstNonEmptyString(item.FeedbackSource, jsonString(payload, "feedback_source")) + item.FeedbackChannelID = firstNonEmptyString(item.FeedbackChannelID, jsonString(payload, "feedback_channel_id")) + item.FeedbackResourceID = firstNonEmptyString(item.FeedbackResourceID, jsonString(payload, "feedback_resource_id")) + item.FeedbackViolationStatus = firstNonEmptyString(item.FeedbackViolationStatus, jsonString(payload, "feedback_violation_status")) + item.FeedbackViolationReason = firstNonEmptyString(item.FeedbackViolationReason, jsonString(payload, "feedback_violation_reason")) + if item.FeedbackObservedAt == nil { + item.FeedbackObservedAt = parseOptionalPayloadTime(jsonString(payload, "feedback_observed_at")) + } + if item.FeedbackExpiresAt == nil { + item.FeedbackExpiresAt = parseOptionalPayloadTime(jsonString(payload, "feedback_expires_at")) + } +} + +func parseOptionalPayloadTime(value string) *time.Time { + value = strings.TrimSpace(value) + if value == "" { + return nil + } + parsed, err := time.Parse(time.RFC3339Nano, value) + if err != nil { + return nil + } + parsed = parsed.UTC() + return &parsed +} + +func scanFabricServiceChannelRouteRebuildAlertSilence(row scanner) (FabricServiceChannelRouteRebuildAlertSilence, error) { + var item FabricServiceChannelRouteRebuildAlertSilence + if err := row.Scan( + &item.ID, + &item.ClusterID, + &item.ReporterNodeID, + &item.RouteID, + &item.GuardStatus, + &item.Generation, + &item.Reason, + &item.CreatedByUserID, + &item.CreatedAt, + &item.ExpiresAt, + &item.Payload, + ); err != nil { + return FabricServiceChannelRouteRebuildAlertSilence{}, err + } + ensureRaw(&item.Payload, `{}`) + item.DisplayRouteID = item.RouteID + var payload map[string]any + if err := json.Unmarshal(item.Payload, &payload); err == nil && payload != nil { + if value, ok := payload["incident_source"].(string); ok { + item.IncidentSource = strings.TrimSpace(value) + } + if value, ok := payload["channel_id"].(string); ok { + item.ChannelID = strings.TrimSpace(value) + } + } + if channelID, routeID, ok := fabricServiceChannelParseAccessDecisionSilenceRouteID(item.RouteID); ok { + item.IncidentSource = firstNonEmptyString(item.IncidentSource, "access_decision") + item.ChannelID = firstNonEmptyString(item.ChannelID, channelID) + item.DisplayRouteID = routeID + } + return item, nil +} + func scanQoSPolicy(row scanner) (MeshQoSPolicy, error) { var item MeshQoSPolicy if err := row.Scan( @@ -2761,6 +4470,186 @@ func scanAuthorityState(row scanner) (ClusterAuthorityState, error) { return item, nil } +func (s *PostgresStore) GetVPNClientProfile( + ctx context.Context, + clusterID, organizationID, userID, preferredEntryNodeID, preferredExitNodeID string, + generatedAt time.Time, +) (VPNClientProfile, error) { + var allowed bool + if err := s.db.QueryRow(ctx, ` + SELECT EXISTS ( + SELECT 1 + FROM organization_memberships + WHERE organization_id = $1::uuid + AND user_id = $2::uuid + AND status = 'active' + ) + `, organizationID, userID).Scan(&allowed); err != nil { + return VPNClientProfile{}, err + } + if !allowed { + return VPNClientProfile{}, ErrVPNLeaseOwnerNotAllowed + } + rows, err := s.db.Query(ctx, ` + SELECT vc.id::text, + vc.name, + vc.protocol_family, + vc.mode, + vc.desired_state, + vc.status, + vc.target_endpoint, + vc.routing_usage, + vc.route_policy, + vc.qos_policy, + vc.placement_policy, + COALESCE(( + SELECT jsonb_agg(van.node_id::text ORDER BY van.created_at, van.node_id::text) + FROM vpn_connection_allowed_nodes van + WHERE van.vpn_connection_id = vc.id + AND van.status = 'active' + ), '[]'::jsonb) AS allowed_node_ids, + COALESCE(vc.placement_policy->'entry_node_ids', '[]'::jsonb) AS entry_node_ids, + COALESCE(vc.placement_policy->>'exit_node_id', '') AS exit_node_id, + CASE WHEN l.id IS NULL THEN NULL ELSE jsonb_build_object( + 'lease_id', l.id::text, + 'owner_node_id', l.owner_node_id::text, + 'lease_generation', l.lease_generation, + 'status', l.status, + 'renewed_at', l.renewed_at, + 'expires_at', l.expires_at + ) END AS active_lease, + COALESCE(( + SELECT jsonb_agg(jsonb_build_object( + 'id', rp.id::text, + 'route_type', rp.route_type, + 'destination', rp.destination, + 'action', rp.action, + 'service_type', rp.service_type, + 'priority', rp.priority, + 'policy', rp.policy, + 'status', rp.status + ) ORDER BY rp.priority, rp.destination) + FROM vpn_connection_route_policies rp + WHERE rp.vpn_connection_id = vc.id + AND rp.status = 'active' + ), '[]'::jsonb) AS route_policies, + COALESCE(vc.metadata->'client_config', '{}'::jsonb) || + jsonb_build_object('vpn_address', '10.77.0.2/24') || + CASE + WHEN COALESCE((vc.route_policy->>'full_tunnel')::boolean, false) + THEN jsonb_build_object('routes', jsonb_build_array('0.0.0.0/0')) + ELSE '{}'::jsonb + END || + CASE + WHEN jsonb_typeof(gateway_status.status_payload->'exit_dns_servers') = 'array' + AND jsonb_array_length(gateway_status.status_payload->'exit_dns_servers') > 0 + THEN jsonb_build_object('dns_servers', gateway_status.status_payload->'exit_dns_servers') + WHEN jsonb_typeof(vc.route_policy->'dns_servers') = 'array' + AND jsonb_array_length(vc.route_policy->'dns_servers') > 0 + THEN jsonb_build_object('dns_servers', vc.route_policy->'dns_servers') + WHEN jsonb_typeof(vc.target_endpoint->'dns_servers') = 'array' + AND jsonb_array_length(vc.target_endpoint->'dns_servers') > 0 + THEN jsonb_build_object('dns_servers', vc.target_endpoint->'dns_servers') + ELSE '{}'::jsonb + END || + CASE WHEN l.id IS NULL THEN '{}'::jsonb ELSE jsonb_strip_nulls(jsonb_build_object( + 'runtime_status', CASE + WHEN COALESCE((gateway_status.status_payload->>'packet_forwarding')::boolean, false) THEN 'packet_forwarding_active' + WHEN COALESCE((gateway_status.status_payload->>'runtime_available')::boolean, false) THEN 'runtime_available' + WHEN gateway_status.observed_status IS NOT NULL THEN gateway_status.observed_status + ELSE 'lease_active' + END, + 'gateway_node_id', l.owner_node_id::text, + 'gateway_assignment_status', gateway_status.observed_status, + 'gateway_interface', gateway_status.status_payload->>'gateway_interface', + 'gateway_vpn_cidr', gateway_status.status_payload->>'gateway_vpn_cidr', + 'relay_transport', gateway_status.status_payload->>'relay_transport', + 'packet_forwarding', COALESCE((gateway_status.status_payload->>'packet_forwarding')::boolean, false), + 'runtime_available', COALESCE((gateway_status.status_payload->>'runtime_available')::boolean, false), + 'runtime_observed_at', gateway_status.observed_at + )) END AS client_config + FROM vpn_connections vc + LEFT JOIN vpn_connection_leases l + ON l.cluster_id = vc.cluster_id + AND l.vpn_connection_id = vc.id + AND l.status = 'active' + AND l.expires_at > NOW() + LEFT JOIN vpn_connection_assignment_latest_statuses gateway_status + ON gateway_status.cluster_id = vc.cluster_id + AND gateway_status.vpn_connection_id = vc.id + AND gateway_status.node_id = l.owner_node_id + WHERE vc.cluster_id = $1::uuid + AND vc.organization_id = $2::uuid + AND vc.desired_state = 'enabled' + ORDER BY vc.name ASC, vc.id ASC + `, clusterID, organizationID) + if err != nil { + return VPNClientProfile{}, err + } + defer rows.Close() + profile := VPNClientProfile{ + SchemaVersion: "rap.vpn_client_profile.v1", + ClusterID: clusterID, + OrganizationID: organizationID, + UserID: userID, + GeneratedAt: generatedAt, + } + for rows.Next() { + var item VPNClientConnection + var allowedRaw, entryRaw []byte + var activeLeaseRaw []byte + if err := rows.Scan( + &item.ID, + &item.Name, + &item.ProtocolFamily, + &item.Mode, + &item.DesiredState, + &item.Status, + &item.TargetEndpoint, + &item.RoutingUsage, + &item.RoutePolicy, + &item.QoSPolicy, + &item.PlacementPolicy, + &allowedRaw, + &entryRaw, + &item.ExitNodeID, + &activeLeaseRaw, + &item.RoutePolicies, + &item.ClientConfig, + ); err != nil { + return VPNClientProfile{}, err + } + _ = json.Unmarshal(allowedRaw, &item.AllowedNodeIDs) + _ = json.Unmarshal(entryRaw, &item.EntryNodeIDs) + if len(activeLeaseRaw) > 0 && string(activeLeaseRaw) != "null" { + var lease NodeVPNAssignmentLease + if err := json.Unmarshal(activeLeaseRaw, &lease); err == nil { + item.ActiveLease = &lease + } + } + ensureRaw(&item.TargetEndpoint, `{}`) + ensureRaw(&item.RoutingUsage, `[]`) + ensureRaw(&item.RoutePolicy, `{}`) + ensureRaw(&item.QoSPolicy, `{}`) + ensureRaw(&item.PlacementPolicy, `{}`) + ensureRaw(&item.RoutePolicies, `[]`) + ensureRaw(&item.ClientConfig, `{}`) + item.ClientConfig = enrichVPNClientFabricRoute(item, preferredEntryNodeID, preferredExitNodeID) + profile.Connections = append(profile.Connections, item) + } + if err := rows.Err(); err != nil { + return VPNClientProfile{}, err + } + entryEndpoints, err := s.vpnEntryEndpointCandidates(ctx, clusterID, vpnProfileEntryNodeIDs(profile)) + if err != nil { + return VPNClientProfile{}, err + } + for i := range profile.Connections { + profile.Connections[i].ClientConfig = enrichVPNClientEntryEndpointCandidates(profile.Connections[i], entryEndpoints) + } + return profile, nil +} + func scanVPNConnection(row scanner) (VPNConnection, error) { var item VPNConnection if err := row.Scan( @@ -2827,6 +4716,211 @@ func scanVPNAllowedNode(row scanner) (VPNConnectionAllowedNode, error) { return item, nil } +func vpnProfileEntryNodeIDs(profile VPNClientProfile) []string { + var out []string + for _, connection := range profile.Connections { + route := vpnFabricRouteFromClientConfig(connection.ClientConfig) + out = append(out, route.SelectedEntryNodeID) + out = append(out, connection.EntryNodeIDs...) + } + return dedupeStrings(out) +} + +func (s *PostgresStore) vpnEntryEndpointCandidates(ctx context.Context, clusterID string, entryNodeIDs []string) (map[string][]map[string]any, error) { + entryNodeIDs = dedupeStrings(entryNodeIDs) + out := make(map[string][]map[string]any, len(entryNodeIDs)) + if len(entryNodeIDs) == 0 { + return out, nil + } + rows, err := s.db.Query(ctx, ` + SELECT node_id::text, capabilities, metadata + FROM node_latest_heartbeats + WHERE cluster_id = $1::uuid + AND node_id::text = ANY($2::text[]) + `, clusterID, entryNodeIDs) + if err != nil { + return nil, err + } + defer rows.Close() + for rows.Next() { + var nodeID string + var capabilities json.RawMessage + var metadata json.RawMessage + if err := rows.Scan(&nodeID, &capabilities, &metadata); err != nil { + return nil, err + } + candidates := vpnEntryEndpointCandidatesFromHeartbeat(nodeID, capabilities, metadata) + if len(candidates) > 0 { + out[nodeID] = candidates + } + } + return out, rows.Err() +} + +func vpnEntryEndpointCandidatesFromHeartbeat(nodeID string, capabilities json.RawMessage, metadata json.RawMessage) []map[string]any { + localGatewayShortcut := heartbeatCapabilityEnabled(capabilities, "vpn_local_gateway_shortcut") + var payload struct { + MeshEndpointReport struct { + PeerEndpoint string `json:"peer_endpoint"` + Transport string `json:"transport"` + ConnectivityMode string `json:"connectivity_mode"` + NATType string `json:"nat_type"` + Region string `json:"region"` + EndpointCandidates []PeerEndpointCandidate `json:"endpoint_candidates"` + } `json:"mesh_endpoint_report"` + } + if len(metadata) == 0 || json.Unmarshal(metadata, &payload) != nil { + return nil + } + report := payload.MeshEndpointReport + var out []map[string]any + for _, candidate := range report.EndpointCandidates { + address := strings.TrimSpace(candidate.Address) + if address == "" { + continue + } + candidateNodeID := strings.TrimSpace(candidate.NodeID) + if candidateNodeID == "" { + candidateNodeID = nodeID + } + transport := strings.TrimSpace(candidate.Transport) + if transport == "" { + transport = strings.TrimSpace(report.Transport) + } + connectivityMode := strings.TrimSpace(candidate.ConnectivityMode) + if connectivityMode == "" { + connectivityMode = strings.TrimSpace(report.ConnectivityMode) + } + natType := strings.TrimSpace(candidate.NATType) + if natType == "" { + natType = strings.TrimSpace(report.NATType) + } + region := strings.TrimSpace(candidate.Region) + if region == "" { + region = strings.TrimSpace(report.Region) + } + reachability := strings.TrimSpace(candidate.Reachability) + if reachability == "" { + reachability = "unverified" + } + endpointID := strings.TrimSpace(candidate.EndpointID) + if endpointID == "" { + endpointID = "mesh-" + candidateNodeID + } + item := map[string]any{ + "node_id": candidateNodeID, + "endpoint_id": endpointID, + "transport": transport, + "address": address, + "reachability": reachability, + "connectivity_mode": connectivityMode, + "nat_type": natType, + "region": region, + "priority": candidate.Priority, + "status": "reported", + "source": "node_latest_heartbeat.mesh_endpoint_report.endpoint_candidates", + } + if apiBaseURL := vpnEntryAPIBaseURL(address); apiBaseURL != "" { + item["api_base_url"] = apiBaseURL + } + if localGatewayShortcut { + item["local_gateway_shortcut"] = true + } + out = append(out, item) + } + if len(out) == 0 { + address := strings.TrimSpace(report.PeerEndpoint) + if address != "" { + item := map[string]any{ + "node_id": nodeID, + "endpoint_id": "mesh-peer-endpoint-" + nodeID, + "transport": strings.TrimSpace(report.Transport), + "address": address, + "reachability": "unverified", + "connectivity_mode": strings.TrimSpace(report.ConnectivityMode), + "nat_type": strings.TrimSpace(report.NATType), + "region": strings.TrimSpace(report.Region), + "priority": 100, + "status": "reported", + "source": "node_latest_heartbeat.mesh_endpoint_report.peer_endpoint", + } + if apiBaseURL := vpnEntryAPIBaseURL(address); apiBaseURL != "" { + item["api_base_url"] = apiBaseURL + } + if localGatewayShortcut { + item["local_gateway_shortcut"] = true + } + out = append(out, item) + } + } + return out +} + +func heartbeatCapabilityEnabled(capabilities json.RawMessage, name string) bool { + var cfg map[string]any + if len(capabilities) == 0 || json.Unmarshal(capabilities, &cfg) != nil { + return false + } + value, ok := cfg[name] + if !ok { + return false + } + switch typed := value.(type) { + case bool: + return typed + case string: + return strings.EqualFold(strings.TrimSpace(typed), "true") || strings.EqualFold(strings.TrimSpace(typed), "enabled") + default: + return false + } +} + +func vpnEntryAPIBaseURL(address string) string { + address = strings.TrimRight(strings.TrimSpace(address), "/") + if address == "" { + return "" + } + if !strings.HasPrefix(address, "http://") && !strings.HasPrefix(address, "https://") { + return "" + } + return address + "/api/v1" +} + +func enrichVPNClientEntryEndpointCandidates(connection VPNClientConnection, endpoints map[string][]map[string]any) json.RawMessage { + var cfg map[string]any + if err := json.Unmarshal(connection.ClientConfig, &cfg); err != nil || cfg == nil { + cfg = map[string]any{} + } + route := vpnFabricRouteFromClientConfig(connection.ClientConfig) + entryIDs := dedupeStrings(append([]string{route.SelectedEntryNodeID}, connection.EntryNodeIDs...)) + var candidates []map[string]any + seen := map[string]struct{}{} + for _, nodeID := range entryIDs { + for _, candidate := range endpoints[nodeID] { + address, _ := candidate["address"].(string) + endpointID, _ := candidate["endpoint_id"].(string) + key := nodeID + "\x00" + endpointID + "\x00" + address + if _, ok := seen[key]; ok { + continue + } + seen[key] = struct{}{} + enriched := make(map[string]any, len(candidate)+1) + for k, v := range candidate { + enriched[k] = v + } + enriched["selected_entry"] = nodeID != "" && nodeID == route.SelectedEntryNodeID + candidates = append(candidates, enriched) + } + } + cfg["vpn_entry_endpoint_candidates"] = candidates + cfg["vpn_entry_endpoint_candidate_count"] = len(candidates) + out, err := json.Marshal(cfg) + if err != nil { + return connection.ClientConfig + } + return out +} + func listVPNConnectionAllowedNodes(ctx context.Context, q rowQuerier, clusterID, vpnConnectionID string) ([]VPNConnectionAllowedNode, error) { rows, err := q.Query(ctx, ` SELECT vpn_connection_id::text, cluster_id::text, node_id::text, role_preference, @@ -2992,3 +5086,194 @@ func ensureRaw(raw *json.RawMessage, fallback string) { *raw = json.RawMessage(fallback) } } + +func enrichVPNClientFabricRoute(item VPNClientConnection, preferredEntryNodeID, preferredExitNodeID string) json.RawMessage { + var cfg map[string]any + if err := json.Unmarshal(item.ClientConfig, &cfg); err != nil || cfg == nil { + cfg = map[string]any{} + } + entryPool := dedupeStrings(append([]string{}, item.EntryNodeIDs...)) + if len(entryPool) == 0 { + entryPool = dedupeStrings(append([]string{}, item.AllowedNodeIDs...)) + } + exitPool := []string{} + if item.ExitNodeID != "" { + exitPool = append(exitPool, item.ExitNodeID) + } + if item.ActiveLease != nil && item.ActiveLease.OwnerNodeID != "" { + exitPool = append(exitPool, item.ActiveLease.OwnerNodeID) + } + exitPool = append(exitPool, item.AllowedNodeIDs...) + exitPool = dedupeStrings(exitPool) + + preferredEntryNodeID = strings.TrimSpace(preferredEntryNodeID) + selectedEntry := selectPreferredNode(entryPool, preferredEntryNodeID) + selectedExit := "" + if item.ActiveLease != nil && item.ActiveLease.OwnerNodeID != "" { + selectedExit = item.ActiveLease.OwnerNodeID + } + if selectedExit == "" { + selectedExit = selectPreferredNode(exitPool, preferredExitNodeID) + } + status := "waiting_for_entry_and_exit" + switch { + case selectedEntry != "" && selectedExit != "": + status = "planned" + case selectedEntry == "": + status = "waiting_for_entry" + case selectedExit == "": + status = "waiting_for_exit" + } + routeCandidates := vpnFabricRouteCandidates(entryPool, exitPool, selectedEntry, selectedExit) + + cfg["vpn_fabric_route"] = map[string]any{ + "schema_version": "rap.vpn_fabric_route.v1", + "status": status, + "preferred_data_plane": "fabric_mesh", + "fallback_data_plane": "backend_relay", + "backend_relay_fallback": true, + "selection_mode": "entry_to_fastest_exit", + "entry_pool_node_ids": entryPool, + "exit_pool_node_ids": exitPool, + "selected_entry_node_id": selectedEntry, + "selected_exit_node_id": selectedExit, + "active_lease_owner_node": selectedExit, + "route_candidates": routeCandidates, + "route_candidate_count": len(routeCandidates), + "route_policy": "full_tunnel_or_connection_policy", + } + cfg["vpn_dataplane_contract"] = map[string]any{ + "schema_version": "rap.vpn_packet_dataplane.v1", + "tunnel_type": "universal_ip_packet", + "application_protocol_agnostic": true, + "packet_forwarding_channel": "vpn_packet", + "control_plane_packet_relay_mode": "lab_fallback_only", + "traffic_contract": map[string]any{ + "all_ip_traffic": true, + "protocol_specific_routing": false, + "diagnostics_only_protocol_summaries": true, + }, + "route_selection": map[string]any{ + "mode": "lowest_latency_healthy_route", + "selected_entry_node_id": selectedEntry, + "selected_exit_node_id": selectedExit, + "route_candidates": routeCandidates, + }, + "failover": map[string]any{ + "enabled": true, + "client_topology_hidden": true, + "preserve_vpn_connection_id": true, + "alternate_route_count": alternateVPNRouteCount(routeCandidates, selectedEntry, selectedExit), + "reroute_triggers": []string{ + "entry_unhealthy", + "exit_unhealthy", + "mesh_route_latency_regression", + "mesh_route_loss_regression", + "queue_backpressure", + "lease_owner_replaced", + }, + }, + "backpressure": map[string]any{ + "queue_policy": "bounded_queue_then_route_failover", + "drop_policy": "drop_only_when_all_routes_unavailable_or_queue_full", + "bulk_and_realtime": "same_packet_path", + "flow_isolation": "hash_by_ip_protocol_and_ports", + "target_dataplane": "entry_node_to_exit_node_fabric", + "temporary_fallback": "backend_http_packet_relay", + }, + } + out, err := json.Marshal(cfg) + if err != nil { + return item.ClientConfig + } + return out +} + +func vpnFabricRouteCandidates(entryPool, exitPool []string, selectedEntry, selectedExit string) []map[string]any { + type pair struct { + entry string + exit string + } + pairs := make([]pair, 0, len(entryPool)*len(exitPool)+1) + if selectedEntry != "" && selectedExit != "" { + pairs = append(pairs, pair{entry: selectedEntry, exit: selectedExit}) + } + for _, entry := range entryPool { + for _, exit := range exitPool { + if entry == "" || exit == "" { + continue + } + pairs = append(pairs, pair{entry: entry, exit: exit}) + } + } + seen := map[string]struct{}{} + out := make([]map[string]any, 0, len(pairs)) + for _, pair := range pairs { + key := pair.entry + "\x00" + pair.exit + if _, ok := seen[key]; ok { + continue + } + seen[key] = struct{}{} + priority := len(out) + 1 + role := "alternate" + if pair.entry == selectedEntry && pair.exit == selectedExit { + role = "preferred" + priority = 0 + } + out = append(out, map[string]any{ + "entry_node_id": pair.entry, + "exit_node_id": pair.exit, + "role": role, + "priority": priority, + "status": "candidate", + }) + } + return out +} + +func alternateVPNRouteCount(candidates []map[string]any, selectedEntry, selectedExit string) int { + count := 0 + for _, candidate := range candidates { + entry, _ := candidate["entry_node_id"].(string) + exit, _ := candidate["exit_node_id"].(string) + if entry == "" || exit == "" { + continue + } + if entry == selectedEntry && exit == selectedExit { + continue + } + count++ + } + return count +} + +func selectPreferredNode(pool []string, preferred string) string { + preferred = strings.TrimSpace(preferred) + if preferred != "" { + for _, value := range pool { + if value == preferred { + return value + } + } + } + if len(pool) > 0 { + return pool[0] + } + return "" +} + +func dedupeStrings(values []string) []string { + seen := make(map[string]struct{}, len(values)) + out := make([]string, 0, len(values)) + for _, value := range values { + if value == "" { + continue + } + if _, ok := seen[value]; ok { + continue + } + seen[value] = struct{}{} + out = append(out, value) + } + return out +} diff --git a/backend/internal/modules/cluster/postgres_store_test.go b/backend/internal/modules/cluster/postgres_store_test.go index 1e4f1de..6b789cc 100644 --- a/backend/internal/modules/cluster/postgres_store_test.go +++ b/backend/internal/modules/cluster/postgres_store_test.go @@ -32,3 +32,136 @@ func TestMeshLatestObservationKeyDefaults(t *testing.T) { t.Fatalf("key = %q", key) } } + +func TestEnrichVPNClientFabricRoutePrefersPlacementEntryAndActiveExit(t *testing.T) { + item := VPNClientConnection{ + AllowedNodeIDs: []string{"node-a", "node-b", "node-b"}, + EntryNodeIDs: []string{"entry-1", "entry-2"}, + ExitNodeID: "exit-policy", + ActiveLease: &NodeVPNAssignmentLease{ + OwnerNodeID: "exit-active", + }, + ClientConfig: json.RawMessage(`{"routes":["0.0.0.0/0"]}`), + } + + var cfg map[string]any + if err := json.Unmarshal(enrichVPNClientFabricRoute(item, "entry-2", ""), &cfg); err != nil { + t.Fatalf("unmarshal enriched config: %v", err) + } + route, ok := cfg["vpn_fabric_route"].(map[string]any) + if !ok { + t.Fatalf("missing vpn_fabric_route in %#v", cfg) + } + if route["preferred_data_plane"] != "fabric_mesh" || route["fallback_data_plane"] != "backend_relay" { + t.Fatalf("unexpected data-plane route contract: %#v", route) + } + if route["selected_entry_node_id"] != "entry-2" || route["selected_exit_node_id"] != "exit-active" { + t.Fatalf("unexpected selected route endpoints: %#v", route) + } + if route["route_candidate_count"].(float64) != 8 { + t.Fatalf("route candidate count = %#v", route["route_candidate_count"]) + } + candidates := route["route_candidates"].([]any) + firstCandidate := candidates[0].(map[string]any) + if firstCandidate["role"] != "preferred" || firstCandidate["entry_node_id"] != "entry-2" || firstCandidate["exit_node_id"] != "exit-active" { + t.Fatalf("preferred route candidate = %#v", firstCandidate) + } + entryPool := route["entry_pool_node_ids"].([]any) + exitPool := route["exit_pool_node_ids"].([]any) + if len(entryPool) != 2 || entryPool[0] != "entry-1" || entryPool[1] != "entry-2" { + t.Fatalf("entry pool = %#v", entryPool) + } + if len(exitPool) != 4 || exitPool[0] != "exit-policy" || exitPool[1] != "exit-active" || exitPool[2] != "node-a" || exitPool[3] != "node-b" { + t.Fatalf("exit pool = %#v", exitPool) + } + contract, ok := cfg["vpn_dataplane_contract"].(map[string]any) + if !ok { + t.Fatalf("missing vpn_dataplane_contract in %#v", cfg) + } + if contract["tunnel_type"] != "universal_ip_packet" || contract["application_protocol_agnostic"] != true { + t.Fatalf("unexpected dataplane contract: %#v", contract) + } + failover := contract["failover"].(map[string]any) + if failover["enabled"] != true || failover["alternate_route_count"].(float64) != 7 { + t.Fatalf("unexpected failover contract: %#v", failover) + } +} + +func TestEnrichVPNClientFabricRoutePrefersExplicitExit(t *testing.T) { + item := VPNClientConnection{ + AllowedNodeIDs: []string{"node-a", "node-b", "node-c"}, + EntryNodeIDs: []string{"entry-1", "entry-2"}, + ExitNodeID: "exit-policy-a", + ActiveLease: &NodeVPNAssignmentLease{ + OwnerNodeID: "", + }, + ClientConfig: json.RawMessage(`{"routes":["0.0.0.0/0"]}`), + } + + var cfg map[string]any + if err := json.Unmarshal(enrichVPNClientFabricRoute(item, "entry-1", "node-c"), &cfg); err != nil { + t.Fatalf("unmarshal enriched config: %v", err) + } + route, ok := cfg["vpn_fabric_route"].(map[string]any) + if !ok { + t.Fatalf("missing vpn_fabric_route in %#v", cfg) + } + if route["selected_entry_node_id"] != "entry-1" { + t.Fatalf("unexpected selected entry: %#v", route["selected_entry_node_id"]) + } + if route["selected_exit_node_id"] != "node-c" { + t.Fatalf("unexpected selected exit: %#v", route["selected_exit_node_id"]) + } +} + +func TestEnrichVPNClientEntryEndpointCandidatesAddsReportedEntryAPI(t *testing.T) { + item := VPNClientConnection{ + EntryNodeIDs: []string{"entry-1"}, + ClientConfig: json.RawMessage(`{ + "vpn_fabric_route": { + "status": "planned", + "selected_entry_node_id": "entry-1", + "selected_exit_node_id": "exit-1" + } + }`), + } + heartbeatMetadata := json.RawMessage(`{ + "mesh_endpoint_report": { + "transport": "direct_http", + "connectivity_mode": "direct", + "nat_type": "none", + "region": "test", + "peer_endpoint": "http://entry.example.test:19131", + "endpoint_candidates": [{ + "endpoint_id": "public-http", + "node_id": "entry-1", + "transport": "direct_http", + "address": "http://entry.example.test:19131", + "reachability": "public", + "priority": 0 + }] + } + }`) + endpoints := map[string][]map[string]any{ + "entry-1": vpnEntryEndpointCandidatesFromHeartbeat("entry-1", json.RawMessage(`{"vpn_local_gateway_shortcut":true}`), heartbeatMetadata), + } + + var cfg map[string]any + if err := json.Unmarshal(enrichVPNClientEntryEndpointCandidates(item, endpoints), &cfg); err != nil { + t.Fatalf("unmarshal enriched config: %v", err) + } + if cfg["vpn_entry_endpoint_candidate_count"].(float64) != 1 { + t.Fatalf("candidate count = %#v", cfg["vpn_entry_endpoint_candidate_count"]) + } + candidates := cfg["vpn_entry_endpoint_candidates"].([]any) + candidate := candidates[0].(map[string]any) + if candidate["node_id"] != "entry-1" || candidate["api_base_url"] != "http://entry.example.test:19131/api/v1" { + t.Fatalf("unexpected endpoint candidate: %#v", candidate) + } + if candidate["local_gateway_shortcut"] != true { + t.Fatalf("local gateway shortcut missing: %#v", candidate) + } + if candidate["selected_entry"] != true || candidate["source"] != "node_latest_heartbeat.mesh_endpoint_report.endpoint_candidates" { + t.Fatalf("unexpected endpoint metadata: %#v", candidate) + } +} diff --git a/backend/internal/modules/cluster/repository.go b/backend/internal/modules/cluster/repository.go index 077443b..ac11e63 100644 --- a/backend/internal/modules/cluster/repository.go +++ b/backend/internal/modules/cluster/repository.go @@ -22,6 +22,7 @@ type Repository interface { AssignNodeToGroup(ctx context.Context, input AssignNodeGroupInput) (ClusterNode, error) CreateJoinToken(ctx context.Context, input CreateJoinTokenInput, tokenHash string) (NodeJoinToken, error) + ListJoinTokens(ctx context.Context, clusterID string) ([]NodeJoinToken, error) SetJoinTokenAuthority(ctx context.Context, clusterID, tokenID string, payload json.RawMessage, signature ClusterSignature) (NodeJoinToken, error) GetValidJoinTokenByHash(ctx context.Context, clusterID, tokenHash string) (NodeJoinToken, error) RevokeJoinToken(ctx context.Context, input RevokeJoinTokenInput) (NodeJoinToken, error) @@ -40,8 +41,16 @@ type Repository interface { RecordHeartbeat(ctx context.Context, input RecordHeartbeatInput) (NodeHeartbeat, error) ListNodeHeartbeats(ctx context.Context, clusterID, nodeID string, limit int) ([]NodeHeartbeat, error) + CreateReleaseVersion(ctx context.Context, input CreateReleaseVersionInput) (ReleaseVersion, error) + ListReleaseVersions(ctx context.Context, clusterID, product, channel string) ([]ReleaseVersion, error) + ListNodeUpdateServiceCandidates(ctx context.Context, clusterID string) ([]NodeUpdateServiceCandidate, error) + UpsertNodeUpdatePolicy(ctx context.Context, input UpsertNodeUpdatePolicyInput) (NodeUpdatePolicy, error) + GetNodeUpdatePolicy(ctx context.Context, clusterID, nodeID, product string) (NodeUpdatePolicy, error) + ReportNodeUpdateStatus(ctx context.Context, input ReportNodeUpdateStatusInput) (NodeUpdateStatus, error) + ListNodeUpdateStatuses(ctx context.Context, clusterID, nodeID string, limit int) ([]NodeUpdateStatus, error) RevokeNodeIdentity(ctx context.Context, input RevokeNodeIdentityInput) error DisableClusterMembership(ctx context.Context, input DisableMembershipInput) error + DeleteClusterNode(ctx context.Context, input DeleteClusterNodeInput) error UpsertFabricTestingFlag(ctx context.Context, input UpsertFabricTestingFlagInput) (FabricTestingFlag, error) ListFabricTestingFlags(ctx context.Context) ([]FabricTestingFlag, error) GetEffectiveNodeTestingFlags(ctx context.Context, clusterID, nodeID string) (EffectiveNodeTestingFlags, error) @@ -55,6 +64,22 @@ type Repository interface { ListMeshLinks(ctx context.Context, clusterID string) ([]MeshLinkObservation, error) CreateRouteIntent(ctx context.Context, input CreateRouteIntentInput) (MeshRouteIntent, error) ListRouteIntents(ctx context.Context, clusterID string) ([]MeshRouteIntent, error) + ExpireRouteIntent(ctx context.Context, input RouteIntentLifecycleInput, expiresAt time.Time) (MeshRouteIntent, error) + DisableRouteIntent(ctx context.Context, input RouteIntentLifecycleInput) (MeshRouteIntent, error) + RecordFabricServiceChannelRouteFeedback(ctx context.Context, input RecordFabricServiceChannelRouteFeedbackInput) (FabricServiceChannelRouteFeedbackObservation, error) + ListFabricServiceChannelRouteFeedback(ctx context.Context, input ListFabricServiceChannelRouteFeedbackInput) ([]FabricServiceChannelRouteFeedbackObservation, error) + ExpireFabricServiceChannelRouteFeedback(ctx context.Context, input ExpireFabricServiceChannelRouteFeedbackInput) (ExpireFabricServiceChannelRouteFeedbackResult, error) + StoreFabricServiceChannelLease(ctx context.Context, input StoreFabricServiceChannelLeaseInput) (FabricServiceChannelLeaseRecord, error) + GetFabricServiceChannelLease(ctx context.Context, clusterID, channelID string) (FabricServiceChannelLeaseRecord, error) + ListFabricServiceChannelLeases(ctx context.Context, input ListFabricServiceChannelLeasesInput) ([]FabricServiceChannelLeaseRecord, error) + CleanupExpiredFabricServiceChannelLeases(ctx context.Context, clusterID string, now time.Time, limit int) (int, error) + RecordFabricServiceChannelRouteRebuildAttempt(ctx context.Context, input RecordFabricServiceChannelRouteRebuildAttemptInput) (FabricServiceChannelRouteRebuildAttempt, error) + ListFabricServiceChannelRouteRebuildAttempts(ctx context.Context, input ListFabricServiceChannelRouteRebuildAttemptsInput) ([]FabricServiceChannelRouteRebuildAttempt, error) + UpdateFabricServiceChannelRouteRebuildCorrelationSnapshot(ctx context.Context, input UpdateFabricServiceChannelRouteRebuildCorrelationSnapshotInput) error + GetFabricServiceChannelSchemaStatus(ctx context.Context, input GetFabricServiceChannelSchemaStatusInput) (FabricServiceChannelSchemaStatus, error) + UpsertFabricServiceChannelRouteRebuildAlertSilence(ctx context.Context, input SilenceFabricServiceChannelRouteRebuildAlertInput, expiresAt time.Time) (FabricServiceChannelRouteRebuildAlertSilence, error) + ListFabricServiceChannelRouteRebuildAlertSilences(ctx context.Context, clusterID string, now time.Time) ([]FabricServiceChannelRouteRebuildAlertSilence, error) + DeleteFabricServiceChannelRouteRebuildAlertSilence(ctx context.Context, input UnsilenceFabricServiceChannelRouteRebuildAlertInput) (FabricServiceChannelRouteRebuildAlertSilence, error) ListQoSPolicies(ctx context.Context, clusterID string) ([]MeshQoSPolicy, error) ListFabricEntryPoints(ctx context.Context, clusterID string) ([]FabricEntryPoint, error) CreateFabricEntryPoint(ctx context.Context, input CreateFabricEntryPointInput) (FabricEntryPoint, error) @@ -78,6 +103,7 @@ type Repository interface { ListVPNConnectionAllowedNodes(ctx context.Context, clusterID, vpnConnectionID string) ([]VPNConnectionAllowedNode, error) AcquireVPNConnectionLease(ctx context.Context, input AcquireVPNConnectionLeaseInput, expiresAt time.Time, fencingToken string) (VPNConnectionLease, error) RenewVPNConnectionLease(ctx context.Context, input RenewVPNConnectionLeaseInput, expiresAt time.Time) (VPNConnectionLease, error) + RenewNodeVPNAssignmentLease(ctx context.Context, input RenewNodeVPNAssignmentLeaseInput, expiresAt time.Time) (VPNConnectionLease, error) ReleaseVPNConnectionLease(ctx context.Context, input ReleaseVPNConnectionLeaseInput) (VPNConnectionLease, error) FenceVPNConnectionLease(ctx context.Context, input FenceVPNConnectionLeaseInput) (VPNConnectionLease, error) GetActiveVPNConnectionLease(ctx context.Context, clusterID, vpnConnectionID string) (VPNConnectionLease, error) @@ -85,7 +111,8 @@ type Repository interface { ExpireStaleVPNConnectionLeases(ctx context.Context, clusterID string, now time.Time) ([]VPNConnectionLease, error) ListNodeVPNAssignments(ctx context.Context, clusterID, nodeID string) ([]NodeVPNAssignment, error) ReportNodeVPNAssignmentStatus(ctx context.Context, input ReportNodeVPNAssignmentStatusInput) (NodeVPNAssignmentStatus, error) + GetVPNClientProfile(ctx context.Context, clusterID, organizationID, userID, preferredEntryNodeID, preferredExitNodeID string, generatedAt time.Time) (VPNClientProfile, error) RecordAudit(ctx context.Context, event ClusterAuditEvent) error - ListAuditEvents(ctx context.Context, clusterID string, limit int) ([]ClusterAuditEvent, error) + ListAuditEvents(ctx context.Context, input ListAuditEventsInput) ([]ClusterAuditEvent, error) } diff --git a/backend/internal/modules/cluster/service.go b/backend/internal/modules/cluster/service.go index d29ae9e..e33aed4 100644 --- a/backend/internal/modules/cluster/service.go +++ b/backend/internal/modules/cluster/service.go @@ -3,11 +3,16 @@ package cluster import ( "context" "crypto/rand" + "crypto/sha256" "encoding/hex" "encoding/json" "errors" + "fmt" + "net" + "net/url" "sort" "strings" + "sync" "time" "github.com/jackc/pgx/v5" @@ -31,12 +36,17 @@ var ( ) type Service struct { - store Repository - now func() time.Time + store Repository + now func() time.Time + fabricServiceChannelLeaseMu sync.Mutex + fabricServiceChannelLeaseCache map[string]FabricServiceChannelLease } +const fabricServiceChannelFeedbackMaxAge = 2 * time.Minute +const fabricServiceChannelOperatorExpireCooldown = 2 * time.Minute + func NewService(store Repository) *Service { - return &Service{store: store, now: func() time.Time { return time.Now().UTC() }} + return &Service{store: store, now: func() time.Time { return time.Now().UTC() }, fabricServiceChannelLeaseCache: map[string]FabricServiceChannelLease{}} } const ( @@ -102,6 +112,340 @@ func (s *Service) GetCluster(ctx context.Context, actorUserID, clusterID string) return item, err } +func (s *Service) GetFabricServiceChannelRecoveryPolicy(ctx context.Context, actorUserID, clusterID string) (FabricServiceChannelRecoveryPolicy, error) { + if err := s.ensurePlatformAdmin(ctx, actorUserID); err != nil { + return FabricServiceChannelRecoveryPolicy{}, err + } + cluster, err := s.store.GetCluster(ctx, strings.TrimSpace(clusterID)) + if errors.Is(err, pgx.ErrNoRows) { + return FabricServiceChannelRecoveryPolicy{}, ErrInvalidCluster + } + if err != nil { + return FabricServiceChannelRecoveryPolicy{}, err + } + return fabricServiceChannelRecoveryPolicyFromCluster(cluster), nil +} + +func (s *Service) UpdateFabricServiceChannelRecoveryPolicy(ctx context.Context, input UpdateFabricServiceChannelRecoveryPolicyInput) (FabricServiceChannelRecoveryPolicy, error) { + if err := s.ensurePlatformAdmin(ctx, input.ActorUserID); err != nil { + return FabricServiceChannelRecoveryPolicy{}, err + } + input.ClusterID = strings.TrimSpace(input.ClusterID) + if input.ClusterID == "" { + return FabricServiceChannelRecoveryPolicy{}, ErrInvalidCluster + } + if err := s.ensureClusterMutable(ctx, input.ActorUserID, input.ClusterID); err != nil { + return FabricServiceChannelRecoveryPolicy{}, err + } + cluster, err := s.store.GetCluster(ctx, input.ClusterID) + if errors.Is(err, pgx.ErrNoRows) { + return FabricServiceChannelRecoveryPolicy{}, ErrInvalidCluster + } + if err != nil { + return FabricServiceChannelRecoveryPolicy{}, err + } + policy := fabricServiceChannelRecoveryPolicyFromCluster(cluster) + if input.HysteresisPenalty > 0 { + policy.HysteresisPenalty = clampInt(input.HysteresisPenalty, 0, 10000) + } + if input.PromotionMinSamples > 0 { + policy.PromotionMinSamples = clampInt(input.PromotionMinSamples, 1, 100000) + } + if input.DemotionFailureThreshold > 0 { + policy.DemotionFailureThreshold = clampInt(input.DemotionFailureThreshold, 1, 100000) + } + if input.DemotionDropThreshold > 0 { + policy.DemotionDropThreshold = clampInt(input.DemotionDropThreshold, 1, 100000) + } + if input.DemotionSlowThreshold > 0 { + policy.DemotionSlowThreshold = clampInt(input.DemotionSlowThreshold, 1, 100000) + } + if input.DemotionRebuildEnabled != nil { + policy.DemotionRebuildEnabled = *input.DemotionRebuildEnabled + } + if input.DemotionFencedEnabled != nil { + policy.DemotionFencedEnabled = *input.DemotionFencedEnabled + } + now := s.now().UTC() + policy.SchemaVersion = "rap.fabric_service_channel_recovery_policy.v1" + policy.Source = "cluster_metadata" + policy.UpdatedByUserID = &input.ActorUserID + policy.UpdatedAt = now + policy.ControlPlaneOnly = true + policy.ProductionForwarding = false + metadata, err := upsertFabricServiceChannelRecoveryPolicyMetadata(cluster.Metadata, policy) + if err != nil { + return FabricServiceChannelRecoveryPolicy{}, err + } + updated, err := s.store.UpdateCluster(ctx, UpdateClusterInput{ + ActorUserID: input.ActorUserID, + ClusterID: cluster.ID, + Name: cluster.Name, + Status: cluster.Status, + Region: cluster.Region, + Metadata: metadata, + }) + if err != nil { + return FabricServiceChannelRecoveryPolicy{}, err + } + _ = s.store.RecordAudit(ctx, ClusterAuditEvent{ + ClusterID: &cluster.ID, + ActorUserID: &input.ActorUserID, + EventType: "fabric.service_channel.recovery_policy.updated", + TargetType: "cluster", + TargetID: &cluster.ID, + Payload: metadata, + CreatedAt: now, + }) + return fabricServiceChannelRecoveryPolicyFromCluster(updated), nil +} + +func (s *Service) GetFabricServiceChannelAdaptivePolicy(ctx context.Context, actorUserID, clusterID string) (FabricServiceChannelAdaptivePolicy, error) { + if err := s.ensurePlatformAdmin(ctx, actorUserID); err != nil { + return FabricServiceChannelAdaptivePolicy{}, err + } + cluster, err := s.store.GetCluster(ctx, strings.TrimSpace(clusterID)) + if errors.Is(err, pgx.ErrNoRows) { + return FabricServiceChannelAdaptivePolicy{}, ErrInvalidCluster + } + if err != nil { + return FabricServiceChannelAdaptivePolicy{}, err + } + return fabricServiceChannelAdaptivePolicyFromCluster(cluster), nil +} + +func (s *Service) UpdateFabricServiceChannelAdaptivePolicy(ctx context.Context, input UpdateFabricServiceChannelAdaptivePolicyInput) (FabricServiceChannelAdaptivePolicy, error) { + if err := s.ensurePlatformAdmin(ctx, input.ActorUserID); err != nil { + return FabricServiceChannelAdaptivePolicy{}, err + } + input.ClusterID = strings.TrimSpace(input.ClusterID) + if input.ClusterID == "" { + return FabricServiceChannelAdaptivePolicy{}, ErrInvalidCluster + } + if err := s.ensureClusterMutable(ctx, input.ActorUserID, input.ClusterID); err != nil { + return FabricServiceChannelAdaptivePolicy{}, err + } + cluster, err := s.store.GetCluster(ctx, input.ClusterID) + if errors.Is(err, pgx.ErrNoRows) { + return FabricServiceChannelAdaptivePolicy{}, ErrInvalidCluster + } + if err != nil { + return FabricServiceChannelAdaptivePolicy{}, err + } + policy := fabricServiceChannelAdaptivePolicyFromCluster(cluster) + if input.MaxParallelWindow > 0 { + policy.MaxParallelWindow = clampInt(input.MaxParallelWindow, 1, 64) + } + if input.BulkPressureChannelThreshold > 0 { + policy.BulkPressureChannelThreshold = clampInt(input.BulkPressureChannelThreshold, 1, 100000) + } + if input.QueuePressureHighWatermark > 0 { + policy.QueuePressureHighWatermark = clampInt(input.QueuePressureHighWatermark, 1, 100000) + } + if input.QueuePressureMaxInFlight > 0 { + policy.QueuePressureMaxInFlight = clampInt(input.QueuePressureMaxInFlight, 1, 100000) + } + if len(input.ClassWindows) > 0 { + policy.ClassWindows = normalizeFabricServiceChannelAdaptiveClassWindows(input.ClassWindows, policy.MaxParallelWindow) + } + now := s.now().UTC() + policy.SchemaVersion = "rap.fabric_service_channel_adaptive_policy.v1" + policy.Source = "cluster_metadata" + policy.UpdatedByUserID = &input.ActorUserID + policy.UpdatedAt = now + policy.ControlPlaneOnly = true + policy.ProductionForwarding = false + metadata, err := upsertFabricServiceChannelAdaptivePolicyMetadata(cluster.Metadata, policy) + if err != nil { + return FabricServiceChannelAdaptivePolicy{}, err + } + updated, err := s.store.UpdateCluster(ctx, UpdateClusterInput{ + ActorUserID: input.ActorUserID, + ClusterID: cluster.ID, + Name: cluster.Name, + Status: cluster.Status, + Region: cluster.Region, + Metadata: metadata, + }) + if err != nil { + return FabricServiceChannelAdaptivePolicy{}, err + } + _ = s.store.RecordAudit(ctx, ClusterAuditEvent{ + ClusterID: &cluster.ID, + ActorUserID: &input.ActorUserID, + EventType: "fabric.service_channel.adaptive_policy.updated", + TargetType: "cluster", + TargetID: &cluster.ID, + Payload: metadata, + CreatedAt: now, + }) + return fabricServiceChannelAdaptivePolicyFromCluster(updated), nil +} + +func (s *Service) GetFabricServiceChannelPoolPolicy(ctx context.Context, actorUserID, clusterID string) (FabricServiceChannelPoolPolicy, error) { + if err := s.ensurePlatformAdmin(ctx, actorUserID); err != nil { + return FabricServiceChannelPoolPolicy{}, err + } + cluster, err := s.store.GetCluster(ctx, strings.TrimSpace(clusterID)) + if errors.Is(err, pgx.ErrNoRows) { + return FabricServiceChannelPoolPolicy{}, ErrInvalidCluster + } + if err != nil { + return FabricServiceChannelPoolPolicy{}, err + } + return fabricServiceChannelPoolPolicyFromCluster(cluster), nil +} + +func (s *Service) UpdateFabricServiceChannelPoolPolicy(ctx context.Context, input UpdateFabricServiceChannelPoolPolicyInput) (FabricServiceChannelPoolPolicy, error) { + if err := s.ensurePlatformAdmin(ctx, input.ActorUserID); err != nil { + return FabricServiceChannelPoolPolicy{}, err + } + input.ClusterID = strings.TrimSpace(input.ClusterID) + if input.ClusterID == "" { + return FabricServiceChannelPoolPolicy{}, ErrInvalidCluster + } + if err := s.ensureClusterMutable(ctx, input.ActorUserID, input.ClusterID); err != nil { + return FabricServiceChannelPoolPolicy{}, err + } + cluster, err := s.store.GetCluster(ctx, input.ClusterID) + if errors.Is(err, pgx.ErrNoRows) { + return FabricServiceChannelPoolPolicy{}, ErrInvalidCluster + } + if err != nil { + return FabricServiceChannelPoolPolicy{}, err + } + policy := fabricServiceChannelPoolPolicyFromCluster(cluster) + policy.EntryPoolNodeIDs = dedupeStrings(input.EntryPoolNodeIDs) + policy.ExitPoolNodeIDs = dedupeStrings(input.ExitPoolNodeIDs) + policy.PreferredEntryNodeID = strings.TrimSpace(input.PreferredEntryNodeID) + policy.PreferredExitNodeID = strings.TrimSpace(input.PreferredExitNodeID) + if input.SelectionStrategy != "" { + policy.SelectionStrategy = strings.TrimSpace(input.SelectionStrategy) + } + if input.RouteRebuild != "" { + policy.RouteRebuild = strings.TrimSpace(input.RouteRebuild) + } + if input.EntryFailover != "" { + policy.EntryFailover = strings.TrimSpace(input.EntryFailover) + } + if input.ExitFailover != "" { + policy.ExitFailover = strings.TrimSpace(input.ExitFailover) + } + if input.BackendFallbackAllowed != nil { + policy.BackendFallbackAllowed = *input.BackendFallbackAllowed + } + if input.StickySession != nil { + policy.StickySession = *input.StickySession + } + now := s.now().UTC() + policy.SchemaVersion = "rap.fabric_service_channel_pool_policy.v1" + policy.Source = "cluster_metadata" + policy.UpdatedByUserID = &input.ActorUserID + policy.UpdatedAt = now + policy.ControlPlaneOnly = true + policy.ProductionForwarding = false + policy = normalizeFabricServiceChannelPoolPolicy(policy, defaultFabricServiceChannelPoolPolicy()) + metadata, err := upsertFabricServiceChannelPoolPolicyMetadata(cluster.Metadata, policy) + if err != nil { + return FabricServiceChannelPoolPolicy{}, err + } + updated, err := s.store.UpdateCluster(ctx, UpdateClusterInput{ + ActorUserID: input.ActorUserID, + ClusterID: cluster.ID, + Name: cluster.Name, + Status: cluster.Status, + Region: cluster.Region, + Metadata: metadata, + }) + if err != nil { + return FabricServiceChannelPoolPolicy{}, err + } + _ = s.store.RecordAudit(ctx, ClusterAuditEvent{ + ClusterID: &cluster.ID, + ActorUserID: &input.ActorUserID, + EventType: "fabric.service_channel.pool_policy.updated", + TargetType: "cluster", + TargetID: &cluster.ID, + Payload: metadata, + CreatedAt: now, + }) + return fabricServiceChannelPoolPolicyFromCluster(updated), nil +} + +func (s *Service) GetFabricServiceChannelBreadcrumbWindowPolicy(ctx context.Context, actorUserID, clusterID string) (FabricServiceChannelBreadcrumbWindowPolicy, error) { + if err := s.ensurePlatformAdmin(ctx, actorUserID); err != nil { + return FabricServiceChannelBreadcrumbWindowPolicy{}, err + } + cluster, err := s.store.GetCluster(ctx, strings.TrimSpace(clusterID)) + if errors.Is(err, pgx.ErrNoRows) { + return FabricServiceChannelBreadcrumbWindowPolicy{}, ErrInvalidCluster + } + if err != nil { + return FabricServiceChannelBreadcrumbWindowPolicy{}, err + } + return fabricServiceChannelBreadcrumbWindowPolicyFromCluster(cluster), nil +} + +func (s *Service) UpdateFabricServiceChannelBreadcrumbWindowPolicy(ctx context.Context, input UpdateFabricServiceChannelBreadcrumbWindowPolicyInput) (FabricServiceChannelBreadcrumbWindowPolicy, error) { + if err := s.ensurePlatformAdmin(ctx, input.ActorUserID); err != nil { + return FabricServiceChannelBreadcrumbWindowPolicy{}, err + } + input.ClusterID = strings.TrimSpace(input.ClusterID) + if input.ClusterID == "" { + return FabricServiceChannelBreadcrumbWindowPolicy{}, ErrInvalidCluster + } + if err := s.ensureClusterMutable(ctx, input.ActorUserID, input.ClusterID); err != nil { + return FabricServiceChannelBreadcrumbWindowPolicy{}, err + } + cluster, err := s.store.GetCluster(ctx, input.ClusterID) + if errors.Is(err, pgx.ErrNoRows) { + return FabricServiceChannelBreadcrumbWindowPolicy{}, ErrInvalidCluster + } + if err != nil { + return FabricServiceChannelBreadcrumbWindowPolicy{}, err + } + policy := fabricServiceChannelBreadcrumbWindowPolicyFromCluster(cluster) + if input.CurrentWindowSeconds > 0 { + policy.CurrentWindowSeconds = input.CurrentWindowSeconds + } + if input.HistoryWindowSeconds > 0 { + policy.HistoryWindowSeconds = input.HistoryWindowSeconds + } + now := s.now().UTC() + policy.SchemaVersion = "rap.fabric_service_channel_breadcrumb_window_policy.v1" + policy.Source = "cluster_metadata" + policy.UpdatedByUserID = &input.ActorUserID + policy.UpdatedAt = now + policy.ControlPlaneOnly = true + policy.ProductionForwarding = false + policy = normalizeFabricServiceChannelBreadcrumbWindowPolicy(policy, defaultFabricServiceChannelBreadcrumbWindowPolicy()) + metadata, err := upsertFabricServiceChannelBreadcrumbWindowPolicyMetadata(cluster.Metadata, policy) + if err != nil { + return FabricServiceChannelBreadcrumbWindowPolicy{}, err + } + updated, err := s.store.UpdateCluster(ctx, UpdateClusterInput{ + ActorUserID: input.ActorUserID, + ClusterID: cluster.ID, + Name: cluster.Name, + Status: cluster.Status, + Region: cluster.Region, + Metadata: metadata, + }) + if err != nil { + return FabricServiceChannelBreadcrumbWindowPolicy{}, err + } + _ = s.store.RecordAudit(ctx, ClusterAuditEvent{ + ClusterID: &cluster.ID, + ActorUserID: &input.ActorUserID, + EventType: "fabric.service_channel.breadcrumb_window_policy.updated", + TargetType: "cluster", + TargetID: &cluster.ID, + Payload: metadata, + CreatedAt: now, + }) + return fabricServiceChannelBreadcrumbWindowPolicyFromCluster(updated), nil +} + func (s *Service) CreateCluster(ctx context.Context, input CreateClusterInput) (Cluster, error) { if err := s.ensurePlatformAdmin(ctx, input.ActorUserID); err != nil { return Cluster{}, err @@ -156,6 +500,516 @@ func authorityDescriptor(authorityKey ClusterAuthorityKey) *ClusterAuthorityDesc return &descriptor } +func defaultFabricServiceChannelRecoveryPolicy() FabricServiceChannelRecoveryPolicy { + return FabricServiceChannelRecoveryPolicy{ + SchemaVersion: "rap.fabric_service_channel_recovery_policy.v1", + HysteresisPenalty: fabricServiceChannelRecoveryHysteresisPenalty, + PromotionMinSamples: fabricServiceChannelRecoveryPromotionMinSamples, + DemotionFailureThreshold: 1, + DemotionDropThreshold: 1, + DemotionSlowThreshold: 1, + DemotionRebuildEnabled: true, + DemotionFencedEnabled: true, + Source: "defaults", + ControlPlaneOnly: true, + ProductionForwarding: false, + } +} + +func fabricServiceChannelRecoveryPolicyFromCluster(cluster Cluster) FabricServiceChannelRecoveryPolicy { + policy := defaultFabricServiceChannelRecoveryPolicy() + if len(cluster.Metadata) == 0 || !json.Valid(cluster.Metadata) { + return policy + } + var raw struct { + Policy *FabricServiceChannelRecoveryPolicy `json:"fabric_service_channel_recovery_policy"` + } + if err := json.Unmarshal(cluster.Metadata, &raw); err != nil || raw.Policy == nil { + return policy + } + policy = normalizeFabricServiceChannelRecoveryPolicy(*raw.Policy, policy) + policy.Source = "cluster_metadata" + return policy +} + +func normalizeFabricServiceChannelRecoveryPolicy(input FabricServiceChannelRecoveryPolicy, fallback FabricServiceChannelRecoveryPolicy) FabricServiceChannelRecoveryPolicy { + if input.SchemaVersion == "" { + input.SchemaVersion = "rap.fabric_service_channel_recovery_policy.v1" + } + if input.HysteresisPenalty < 0 { + input.HysteresisPenalty = fallback.HysteresisPenalty + } + if input.HysteresisPenalty == 0 { + input.HysteresisPenalty = fallback.HysteresisPenalty + } + if input.PromotionMinSamples <= 0 { + input.PromotionMinSamples = fallback.PromotionMinSamples + } + if input.DemotionFailureThreshold <= 0 { + input.DemotionFailureThreshold = fallback.DemotionFailureThreshold + } + if input.DemotionDropThreshold <= 0 { + input.DemotionDropThreshold = fallback.DemotionDropThreshold + } + if input.DemotionSlowThreshold <= 0 { + input.DemotionSlowThreshold = fallback.DemotionSlowThreshold + } + if input.Source == "" { + input.Source = fallback.Source + } + input.ControlPlaneOnly = true + input.ProductionForwarding = false + input.Fingerprint = fabricServiceChannelRecoveryPolicyFingerprint(input) + return input +} + +func upsertFabricServiceChannelRecoveryPolicyMetadata(metadata json.RawMessage, policy FabricServiceChannelRecoveryPolicy) (json.RawMessage, error) { + raw := map[string]any{} + if len(metadata) > 0 && json.Valid(metadata) { + if err := json.Unmarshal(metadata, &raw); err != nil { + return nil, err + } + } + raw["fabric_service_channel_recovery_policy"] = policy + out, err := json.Marshal(raw) + if err != nil { + return nil, err + } + return json.RawMessage(out), nil +} + +func fabricServiceChannelRecoveryPolicyRef(policy FabricServiceChannelRecoveryPolicy) *FabricServiceChannelRecoveryPolicy { + normalized := normalizeFabricServiceChannelRecoveryPolicy(policy, defaultFabricServiceChannelRecoveryPolicy()) + return &normalized +} + +func fabricServiceChannelRecoveryPolicyFingerprint(policy FabricServiceChannelRecoveryPolicy) string { + policy.Fingerprint = "" + policy.UpdatedAt = time.Time{} + policy.UpdatedByUserID = nil + raw, err := json.Marshal(struct { + SchemaVersion string `json:"schema_version"` + HysteresisPenalty int `json:"hysteresis_penalty"` + PromotionMinSamples int `json:"promotion_min_samples"` + DemotionFailureThreshold int `json:"demotion_failure_threshold"` + DemotionDropThreshold int `json:"demotion_drop_threshold"` + DemotionSlowThreshold int `json:"demotion_slow_threshold"` + DemotionRebuildEnabled bool `json:"demotion_rebuild_enabled"` + DemotionFencedEnabled bool `json:"demotion_fenced_enabled"` + ControlPlaneOnly bool `json:"control_plane_only"` + ProductionForwarding bool `json:"production_forwarding"` + }{ + SchemaVersion: policy.SchemaVersion, + HysteresisPenalty: policy.HysteresisPenalty, + PromotionMinSamples: policy.PromotionMinSamples, + DemotionFailureThreshold: policy.DemotionFailureThreshold, + DemotionDropThreshold: policy.DemotionDropThreshold, + DemotionSlowThreshold: policy.DemotionSlowThreshold, + DemotionRebuildEnabled: policy.DemotionRebuildEnabled, + DemotionFencedEnabled: policy.DemotionFencedEnabled, + ControlPlaneOnly: true, + ProductionForwarding: false, + }) + if err != nil { + return "" + } + sum := sha256.Sum256(raw) + return hex.EncodeToString(sum[:]) +} + +func defaultFabricServiceChannelAdaptivePolicy() FabricServiceChannelAdaptivePolicy { + return normalizeFabricServiceChannelAdaptivePolicy(FabricServiceChannelAdaptivePolicy{ + SchemaVersion: "rap.fabric_service_channel_adaptive_policy.v1", + MaxParallelWindow: 4, + BulkPressureChannelThreshold: 16, + QueuePressureHighWatermark: 16, + QueuePressureMaxInFlight: 16, + ClassWindows: map[string]int{ + "control": 4, + "interactive": 4, + "reliable": 3, + "bulk": 1, + "droppable": 1, + }, + Source: "defaults", + ControlPlaneOnly: true, + ProductionForwarding: false, + }, FabricServiceChannelAdaptivePolicy{}) +} + +func fabricServiceChannelAdaptivePolicyFromCluster(cluster Cluster) FabricServiceChannelAdaptivePolicy { + fallback := defaultFabricServiceChannelAdaptivePolicy() + if len(cluster.Metadata) == 0 || !json.Valid(cluster.Metadata) { + return fallback + } + var raw struct { + Policy *FabricServiceChannelAdaptivePolicy `json:"fabric_service_channel_adaptive_policy"` + } + if err := json.Unmarshal(cluster.Metadata, &raw); err != nil || raw.Policy == nil { + return fallback + } + policy := normalizeFabricServiceChannelAdaptivePolicy(*raw.Policy, fallback) + policy.Source = "cluster_metadata" + return policy +} + +func normalizeFabricServiceChannelAdaptivePolicy(input FabricServiceChannelAdaptivePolicy, fallback FabricServiceChannelAdaptivePolicy) FabricServiceChannelAdaptivePolicy { + if input.SchemaVersion == "" { + input.SchemaVersion = "rap.fabric_service_channel_adaptive_policy.v1" + } + if fallback.MaxParallelWindow <= 0 { + fallback.MaxParallelWindow = 4 + } + if input.MaxParallelWindow <= 0 { + input.MaxParallelWindow = fallback.MaxParallelWindow + } + input.MaxParallelWindow = clampInt(input.MaxParallelWindow, 1, 64) + if input.BulkPressureChannelThreshold <= 0 { + input.BulkPressureChannelThreshold = firstPositive(fallback.BulkPressureChannelThreshold, 16) + } + if input.QueuePressureHighWatermark <= 0 { + input.QueuePressureHighWatermark = firstPositive(fallback.QueuePressureHighWatermark, 16) + } + if input.QueuePressureMaxInFlight <= 0 { + input.QueuePressureMaxInFlight = firstPositive(fallback.QueuePressureMaxInFlight, 16) + } + input.ClassWindows = normalizeFabricServiceChannelAdaptiveClassWindows(firstNonNilStringIntMap(input.ClassWindows, fallback.ClassWindows), input.MaxParallelWindow) + if input.Source == "" { + input.Source = fallback.Source + } + if input.Source == "" { + input.Source = "defaults" + } + input.ControlPlaneOnly = true + input.ProductionForwarding = false + input.Fingerprint = fabricServiceChannelAdaptivePolicyFingerprint(input) + return input +} + +func normalizeFabricServiceChannelAdaptiveClassWindows(values map[string]int, maxWindow int) map[string]int { + if maxWindow <= 0 { + maxWindow = 4 + } + defaults := map[string]int{"control": maxWindow, "interactive": maxWindow, "reliable": boundedMinInt(maxWindow, 3), "bulk": 1, "droppable": 1} + out := map[string]int{} + for key, fallback := range defaults { + value := values[key] + if value <= 0 { + value = fallback + } + out[key] = clampInt(value, 1, maxWindow) + } + return out +} + +func upsertFabricServiceChannelAdaptivePolicyMetadata(metadata json.RawMessage, policy FabricServiceChannelAdaptivePolicy) (json.RawMessage, error) { + raw := map[string]any{} + if len(metadata) > 0 && json.Valid(metadata) { + if err := json.Unmarshal(metadata, &raw); err != nil { + return nil, err + } + } + raw["fabric_service_channel_adaptive_policy"] = policy + out, err := json.Marshal(raw) + if err != nil { + return nil, err + } + return json.RawMessage(out), nil +} + +func fabricServiceChannelAdaptivePolicyFingerprint(policy FabricServiceChannelAdaptivePolicy) string { + raw, err := json.Marshal(struct { + SchemaVersion string `json:"schema_version"` + MaxParallelWindow int `json:"max_parallel_window"` + BulkPressureChannelThreshold int `json:"bulk_pressure_channel_threshold"` + QueuePressureHighWatermark int `json:"queue_pressure_high_watermark"` + QueuePressureMaxInFlight int `json:"queue_pressure_max_in_flight"` + ClassWindows map[string]int `json:"class_windows"` + ControlPlaneOnly bool `json:"control_plane_only"` + ProductionForwarding bool `json:"production_forwarding"` + }{ + SchemaVersion: policy.SchemaVersion, + MaxParallelWindow: policy.MaxParallelWindow, + BulkPressureChannelThreshold: policy.BulkPressureChannelThreshold, + QueuePressureHighWatermark: policy.QueuePressureHighWatermark, + QueuePressureMaxInFlight: policy.QueuePressureMaxInFlight, + ClassWindows: policy.ClassWindows, + ControlPlaneOnly: true, + ProductionForwarding: false, + }) + if err != nil { + return "" + } + sum := sha256.Sum256(raw) + return hex.EncodeToString(sum[:]) +} + +func defaultFabricServiceChannelPoolPolicy() FabricServiceChannelPoolPolicy { + return normalizeFabricServiceChannelPoolPolicy(FabricServiceChannelPoolPolicy{ + SchemaVersion: "rap.fabric_service_channel_pool_policy.v1", + SelectionStrategy: "fastest_healthy", + RouteRebuild: "automatic", + EntryFailover: "automatic", + ExitFailover: "automatic", + BackendFallbackAllowed: true, + StickySession: true, + Source: "defaults", + ControlPlaneOnly: true, + ProductionForwarding: false, + }, FabricServiceChannelPoolPolicy{}) +} + +func fabricServiceChannelPoolPolicyFromCluster(cluster Cluster) FabricServiceChannelPoolPolicy { + fallback := defaultFabricServiceChannelPoolPolicy() + if len(cluster.Metadata) == 0 || !json.Valid(cluster.Metadata) { + return fallback + } + var raw struct { + Policy *FabricServiceChannelPoolPolicy `json:"fabric_service_channel_pool_policy"` + } + if err := json.Unmarshal(cluster.Metadata, &raw); err != nil || raw.Policy == nil { + return fallback + } + policy := normalizeFabricServiceChannelPoolPolicy(*raw.Policy, fallback) + policy.Source = "cluster_metadata" + return policy +} + +func normalizeFabricServiceChannelPoolPolicy(input FabricServiceChannelPoolPolicy, fallback FabricServiceChannelPoolPolicy) FabricServiceChannelPoolPolicy { + if input.SchemaVersion == "" { + input.SchemaVersion = firstNonEmptyString(fallback.SchemaVersion, "rap.fabric_service_channel_pool_policy.v1") + } + input.EntryPoolNodeIDs = dedupeStrings(firstNonEmptyStringSlice(input.EntryPoolNodeIDs, fallback.EntryPoolNodeIDs)) + input.ExitPoolNodeIDs = dedupeStrings(firstNonEmptyStringSlice(input.ExitPoolNodeIDs, fallback.ExitPoolNodeIDs)) + input.PreferredEntryNodeID = strings.TrimSpace(firstNonEmptyString(input.PreferredEntryNodeID, fallback.PreferredEntryNodeID)) + input.PreferredExitNodeID = strings.TrimSpace(firstNonEmptyString(input.PreferredExitNodeID, fallback.PreferredExitNodeID)) + input.SelectionStrategy = normalizeFabricServiceChannelPoolPolicyMode(firstNonEmptyString(input.SelectionStrategy, fallback.SelectionStrategy), []string{"fastest_healthy", "preferred_first", "stable_first"}, "fastest_healthy") + input.RouteRebuild = normalizeFabricServiceChannelPoolPolicyMode(firstNonEmptyString(input.RouteRebuild, fallback.RouteRebuild), []string{"automatic", "manual", "disabled"}, "automatic") + input.EntryFailover = normalizeFabricServiceChannelPoolPolicyMode(firstNonEmptyString(input.EntryFailover, fallback.EntryFailover), []string{"automatic", "manual", "disabled"}, "automatic") + input.ExitFailover = normalizeFabricServiceChannelPoolPolicyMode(firstNonEmptyString(input.ExitFailover, fallback.ExitFailover), []string{"automatic", "manual", "disabled"}, "automatic") + if input.Source == "" { + input.Source = firstNonEmptyString(fallback.Source, "defaults") + } + input.ControlPlaneOnly = true + input.ProductionForwarding = false + input.Fingerprint = fabricServiceChannelPoolPolicyFingerprint(input) + return input +} + +func normalizeFabricServiceChannelPoolPolicyMode(value string, allowed []string, fallback string) string { + value = strings.TrimSpace(strings.ToLower(value)) + for _, item := range allowed { + if value == item { + return value + } + } + return fallback +} + +func upsertFabricServiceChannelPoolPolicyMetadata(metadata json.RawMessage, policy FabricServiceChannelPoolPolicy) (json.RawMessage, error) { + raw := map[string]any{} + if len(metadata) > 0 && json.Valid(metadata) { + if err := json.Unmarshal(metadata, &raw); err != nil { + return nil, err + } + } + raw["fabric_service_channel_pool_policy"] = policy + out, err := json.Marshal(raw) + if err != nil { + return nil, err + } + return json.RawMessage(out), nil +} + +func fabricServiceChannelPoolPolicyRef(policy FabricServiceChannelPoolPolicy) *FabricServiceChannelPoolPolicy { + normalized := normalizeFabricServiceChannelPoolPolicy(policy, defaultFabricServiceChannelPoolPolicy()) + return &normalized +} + +func fabricServiceChannelPoolPolicyFingerprint(policy FabricServiceChannelPoolPolicy) string { + raw, err := json.Marshal(struct { + SchemaVersion string `json:"schema_version"` + EntryPoolNodeIDs []string `json:"entry_pool_node_ids,omitempty"` + ExitPoolNodeIDs []string `json:"exit_pool_node_ids,omitempty"` + PreferredEntryNodeID string `json:"preferred_entry_node_id,omitempty"` + PreferredExitNodeID string `json:"preferred_exit_node_id,omitempty"` + SelectionStrategy string `json:"selection_strategy"` + RouteRebuild string `json:"route_rebuild"` + EntryFailover string `json:"entry_failover"` + ExitFailover string `json:"exit_failover"` + BackendFallbackAllowed bool `json:"backend_fallback_allowed"` + StickySession bool `json:"sticky_session"` + ControlPlaneOnly bool `json:"control_plane_only"` + ProductionForwarding bool `json:"production_forwarding"` + }{ + SchemaVersion: policy.SchemaVersion, + EntryPoolNodeIDs: policy.EntryPoolNodeIDs, + ExitPoolNodeIDs: policy.ExitPoolNodeIDs, + PreferredEntryNodeID: policy.PreferredEntryNodeID, + PreferredExitNodeID: policy.PreferredExitNodeID, + SelectionStrategy: policy.SelectionStrategy, + RouteRebuild: policy.RouteRebuild, + EntryFailover: policy.EntryFailover, + ExitFailover: policy.ExitFailover, + BackendFallbackAllowed: policy.BackendFallbackAllowed, + StickySession: policy.StickySession, + ControlPlaneOnly: true, + ProductionForwarding: false, + }) + if err != nil { + return "" + } + sum := sha256.Sum256(raw) + return hex.EncodeToString(sum[:]) +} + +func defaultFabricServiceChannelBreadcrumbWindowPolicy() FabricServiceChannelBreadcrumbWindowPolicy { + return normalizeFabricServiceChannelBreadcrumbWindowPolicy(FabricServiceChannelBreadcrumbWindowPolicy{ + SchemaVersion: "rap.fabric_service_channel_breadcrumb_window_policy.v1", + CurrentWindowSeconds: int64((30 * time.Minute).Seconds()), + HistoryWindowSeconds: int64((24 * time.Hour).Seconds()), + Source: "defaults", + ControlPlaneOnly: true, + ProductionForwarding: false, + }, FabricServiceChannelBreadcrumbWindowPolicy{}) +} + +func fabricServiceChannelBreadcrumbWindowPolicyFromCluster(cluster Cluster) FabricServiceChannelBreadcrumbWindowPolicy { + fallback := defaultFabricServiceChannelBreadcrumbWindowPolicy() + if len(cluster.Metadata) == 0 || !json.Valid(cluster.Metadata) { + return fallback + } + var raw struct { + Policy *FabricServiceChannelBreadcrumbWindowPolicy `json:"fabric_service_channel_breadcrumb_window_policy"` + } + if err := json.Unmarshal(cluster.Metadata, &raw); err != nil || raw.Policy == nil { + return fallback + } + policy := normalizeFabricServiceChannelBreadcrumbWindowPolicy(*raw.Policy, fallback) + policy.Source = "cluster_metadata" + return policy +} + +func normalizeFabricServiceChannelBreadcrumbWindowPolicy(input FabricServiceChannelBreadcrumbWindowPolicy, fallback FabricServiceChannelBreadcrumbWindowPolicy) FabricServiceChannelBreadcrumbWindowPolicy { + if input.SchemaVersion == "" { + input.SchemaVersion = firstNonEmptyString(fallback.SchemaVersion, "rap.fabric_service_channel_breadcrumb_window_policy.v1") + } + if input.CurrentWindowSeconds <= 0 { + input.CurrentWindowSeconds = firstPositiveInt64(fallback.CurrentWindowSeconds, int64((30 * time.Minute).Seconds())) + } + if input.HistoryWindowSeconds <= 0 { + input.HistoryWindowSeconds = firstPositiveInt64(fallback.HistoryWindowSeconds, int64((24 * time.Hour).Seconds())) + } + input.CurrentWindowSeconds = clampInt64(input.CurrentWindowSeconds, 60, int64((7 * 24 * time.Hour).Seconds())) + input.HistoryWindowSeconds = clampInt64(input.HistoryWindowSeconds, input.CurrentWindowSeconds, int64((30 * 24 * time.Hour).Seconds())) + if input.Source == "" { + input.Source = firstNonEmptyString(fallback.Source, "defaults") + } + input.ControlPlaneOnly = true + input.ProductionForwarding = false + input.Fingerprint = fabricServiceChannelBreadcrumbWindowPolicyFingerprint(input) + return input +} + +func upsertFabricServiceChannelBreadcrumbWindowPolicyMetadata(metadata json.RawMessage, policy FabricServiceChannelBreadcrumbWindowPolicy) (json.RawMessage, error) { + raw := map[string]any{} + if len(metadata) > 0 && json.Valid(metadata) { + if err := json.Unmarshal(metadata, &raw); err != nil { + return nil, err + } + } + raw["fabric_service_channel_breadcrumb_window_policy"] = policy + out, err := json.Marshal(raw) + if err != nil { + return nil, err + } + return json.RawMessage(out), nil +} + +func fabricServiceChannelBreadcrumbWindowPolicyFingerprint(policy FabricServiceChannelBreadcrumbWindowPolicy) string { + raw, err := json.Marshal(struct { + SchemaVersion string `json:"schema_version"` + CurrentWindowSeconds int64 `json:"current_window_seconds"` + HistoryWindowSeconds int64 `json:"history_window_seconds"` + ControlPlaneOnly bool `json:"control_plane_only"` + ProductionForwarding bool `json:"production_forwarding"` + }{ + SchemaVersion: policy.SchemaVersion, + CurrentWindowSeconds: policy.CurrentWindowSeconds, + HistoryWindowSeconds: policy.HistoryWindowSeconds, + ControlPlaneOnly: true, + ProductionForwarding: false, + }) + if err != nil { + return "" + } + sum := sha256.Sum256(raw) + return hex.EncodeToString(sum[:]) +} + +func firstNonEmptyStringSlice(values ...[]string) []string { + for _, value := range values { + if len(value) > 0 { + return value + } + } + return nil +} + +func firstPositive(values ...int) int { + for _, value := range values { + if value > 0 { + return value + } + } + return 0 +} + +func firstPositiveInt64(values ...int64) int64 { + for _, value := range values { + if value > 0 { + return value + } + } + return 0 +} + +func firstNonNilStringIntMap(values ...map[string]int) map[string]int { + for _, value := range values { + if len(value) > 0 { + return value + } + } + return nil +} + +func boundedMinInt(a, b int) int { + if a < b { + return a + } + return b +} + +func clampInt(value, minValue, maxValue int) int { + if value < minValue { + return minValue + } + if value > maxValue { + return maxValue + } + return value +} + +func clampInt64(value, minValue, maxValue int64) int64 { + if value < minValue { + return minValue + } + if value > maxValue { + return maxValue + } + return value +} + func (s *Service) UpdateCluster(ctx context.Context, input UpdateClusterInput) (Cluster, error) { if err := s.ensurePlatformAdmin(ctx, input.ActorUserID); err != nil { return Cluster{}, err @@ -293,6 +1147,103 @@ func (s *Service) CreateJoinToken(ctx context.Context, input CreateJoinTokenInpu return CreatedJoinToken{NodeJoinToken: item, Token: rawToken}, nil } +func (s *Service) ListJoinTokens(ctx context.Context, actorUserID, clusterID string) ([]NodeJoinToken, error) { + if err := s.ensurePlatformAdmin(ctx, actorUserID); err != nil { + return nil, err + } + if err := s.store.ExpireJoinTokens(ctx, clusterID); err != nil { + return nil, err + } + return s.store.ListJoinTokens(ctx, clusterID) +} + +func (s *Service) GetDockerInstallProfile(ctx context.Context, input DockerInstallProfileRequest) (DockerInstallProfile, error) { + input.ClusterID = strings.TrimSpace(input.ClusterID) + input.InstallToken = strings.TrimSpace(input.InstallToken) + if input.ClusterID == "" || input.InstallToken == "" { + return DockerInstallProfile{}, ErrInvalidPayload + } + if err := s.store.ExpireJoinTokens(ctx, input.ClusterID); err != nil { + return DockerInstallProfile{}, err + } + tokenHash, err := hashJoinToken(input.InstallToken) + if err != nil { + return DockerInstallProfile{}, ErrInvalidJoinToken + } + token, err := s.store.GetValidJoinTokenByHash(ctx, input.ClusterID, tokenHash) + if err != nil { + if errors.Is(err, pgx.ErrNoRows) { + return DockerInstallProfile{}, ErrInvalidJoinToken + } + return DockerInstallProfile{}, err + } + profile, err := dockerInstallProfileFromScope(input, token.Scope) + if err != nil { + return DockerInstallProfile{}, err + } + profile.ClusterID = input.ClusterID + profile.JoinToken = input.InstallToken + return profile, nil +} + +func (s *Service) GetWindowsInstallProfile(ctx context.Context, input DockerInstallProfileRequest) (WindowsInstallProfile, error) { + input.ClusterID = strings.TrimSpace(input.ClusterID) + input.InstallToken = strings.TrimSpace(input.InstallToken) + if input.ClusterID == "" || input.InstallToken == "" { + return WindowsInstallProfile{}, ErrInvalidPayload + } + if err := s.store.ExpireJoinTokens(ctx, input.ClusterID); err != nil { + return WindowsInstallProfile{}, err + } + tokenHash, err := hashJoinToken(input.InstallToken) + if err != nil { + return WindowsInstallProfile{}, ErrInvalidJoinToken + } + token, err := s.store.GetValidJoinTokenByHash(ctx, input.ClusterID, tokenHash) + if err != nil { + if errors.Is(err, pgx.ErrNoRows) { + return WindowsInstallProfile{}, ErrInvalidJoinToken + } + return WindowsInstallProfile{}, err + } + profile, err := windowsInstallProfileFromScope(input, token.Scope) + if err != nil { + return WindowsInstallProfile{}, err + } + profile.ClusterID = input.ClusterID + profile.JoinToken = input.InstallToken + return profile, nil +} + +func (s *Service) GetLinuxInstallProfile(ctx context.Context, input DockerInstallProfileRequest) (LinuxInstallProfile, error) { + input.ClusterID = strings.TrimSpace(input.ClusterID) + input.InstallToken = strings.TrimSpace(input.InstallToken) + if input.ClusterID == "" || input.InstallToken == "" { + return LinuxInstallProfile{}, ErrInvalidPayload + } + if err := s.store.ExpireJoinTokens(ctx, input.ClusterID); err != nil { + return LinuxInstallProfile{}, err + } + tokenHash, err := hashJoinToken(input.InstallToken) + if err != nil { + return LinuxInstallProfile{}, ErrInvalidJoinToken + } + token, err := s.store.GetValidJoinTokenByHash(ctx, input.ClusterID, tokenHash) + if err != nil { + if errors.Is(err, pgx.ErrNoRows) { + return LinuxInstallProfile{}, ErrInvalidJoinToken + } + return LinuxInstallProfile{}, err + } + profile, err := linuxInstallProfileFromScope(input, token.Scope) + if err != nil { + return LinuxInstallProfile{}, err + } + profile.ClusterID = input.ClusterID + profile.JoinToken = input.InstallToken + return profile, nil +} + func (s *Service) signJoinToken(ctx context.Context, input CreateJoinTokenInput, item NodeJoinToken) (NodeJoinToken, error) { authorityKey, err := s.ensureClusterAuthority(ctx, input.ClusterID, &input.ActorUserID) if err != nil { @@ -695,6 +1646,29 @@ func (s *Service) DisableClusterMembership(ctx context.Context, input DisableMem return nil } +func (s *Service) DeleteClusterNode(ctx context.Context, input DeleteClusterNodeInput) error { + if err := s.ensurePlatformAdmin(ctx, input.ActorUserID); err != nil { + return err + } + if err := s.ensureClusterMutable(ctx, input.ActorUserID, input.ClusterID); err != nil { + return err + } + input.Reason = strings.TrimSpace(input.Reason) + if input.ClusterID == "" || input.NodeID == "" { + return ErrInvalidPayload + } + if input.Reason == "" { + input.Reason = "deleted by platform administrator" + } + if err := s.store.DeleteClusterNode(ctx, input); err != nil { + if errors.Is(err, pgx.ErrNoRows) { + return ErrInvalidPayload + } + return err + } + return nil +} + func (s *Service) RecordHeartbeat(ctx context.Context, input RecordHeartbeatInput) (NodeHeartbeat, error) { if input.ClusterID == "" || input.NodeID == "" { return NodeHeartbeat{}, ErrInvalidPayload @@ -705,7 +1679,13 @@ func (s *Service) RecordHeartbeat(ctx context.Context, input RecordHeartbeatInpu input.Capabilities = defaultJSON(input.Capabilities, `{}`) input.ServiceStates = defaultJSON(input.ServiceStates, `{}`) input.Metadata = defaultJSON(input.Metadata, `{}`) - return s.store.RecordHeartbeat(ctx, input) + heartbeat, err := s.store.RecordHeartbeat(ctx, input) + if err != nil { + return NodeHeartbeat{}, err + } + _ = s.recordFabricServiceChannelRouteFeedback(ctx, heartbeat) + _ = s.autoWarmFabricServiceChannelRouteRebuildSnapshotsAfterHeartbeat(ctx, heartbeat) + return heartbeat, nil } func (s *Service) ListNodeHeartbeats(ctx context.Context, actorUserID, clusterID, nodeID string, limit int) ([]NodeHeartbeat, error) { @@ -715,6 +1695,2251 @@ func (s *Service) ListNodeHeartbeats(ctx context.Context, actorUserID, clusterID return s.store.ListNodeHeartbeats(ctx, clusterID, nodeID, limit) } +func (s *Service) ListFabricServiceChannelRouteFeedback(ctx context.Context, actorUserID string, input ListFabricServiceChannelRouteFeedbackInput) ([]FabricServiceChannelRouteFeedbackObservation, error) { + if err := s.ensurePlatformAdmin(ctx, actorUserID); err != nil { + return nil, err + } + input.ClusterID = strings.TrimSpace(input.ClusterID) + input.ReporterNodeID = strings.TrimSpace(input.ReporterNodeID) + input.RouteID = strings.TrimSpace(input.RouteID) + input.ServiceClass = strings.TrimSpace(input.ServiceClass) + input.FeedbackStatus = strings.TrimSpace(input.FeedbackStatus) + if input.ClusterID == "" { + return nil, ErrInvalidPayload + } + if input.Now.IsZero() { + input.Now = s.now() + } + observations, err := s.store.ListFabricServiceChannelRouteFeedback(ctx, input) + if err != nil { + return nil, err + } + policy := s.fabricServiceChannelRecoveryPolicy(ctx, input.ClusterID) + intents, err := s.store.ListRouteIntents(ctx, input.ClusterID) + if err != nil { + return nil, err + } + report := serviceChannelRouteFeedbackReportWithPolicyAndProvenance(observations, input.Now, policy, fabricServiceChannelRouteProvenanceFromIntents(intents)) + if report == nil { + return nil, nil + } + return report.Observations, nil +} + +func (s *Service) ListFabricServiceChannelRouteRebuildAttempts(ctx context.Context, actorUserID string, input ListFabricServiceChannelRouteRebuildAttemptsInput) ([]FabricServiceChannelRouteRebuildAttempt, error) { + if err := s.ensurePlatformAdmin(ctx, actorUserID); err != nil { + return nil, err + } + input.ClusterID = strings.TrimSpace(input.ClusterID) + input.ReporterNodeID = strings.TrimSpace(input.ReporterNodeID) + input.RouteID = strings.TrimSpace(input.RouteID) + input.ReplacementRouteID = strings.TrimSpace(input.ReplacementRouteID) + input.ServiceClass = strings.TrimSpace(input.ServiceClass) + input.RebuildStatus = strings.TrimSpace(input.RebuildStatus) + input.RebuildRequestID = strings.TrimSpace(input.RebuildRequestID) + input.Generation = strings.TrimSpace(input.Generation) + input.FeedbackSource = strings.TrimSpace(input.FeedbackSource) + input.FeedbackChannelID = strings.TrimSpace(input.FeedbackChannelID) + input.FeedbackViolationStatus = strings.TrimSpace(input.FeedbackViolationStatus) + input.EnrichmentMode = strings.TrimSpace(input.EnrichmentMode) + if input.ClusterID == "" { + return nil, ErrInvalidPayload + } + if input.Offset < 0 { + input.Offset = 0 + } + if input.EnrichmentMode == "" { + input.EnrichmentMode = "summary" + } + items, err := s.store.ListFabricServiceChannelRouteRebuildAttempts(ctx, input) + if err != nil { + return nil, err + } + if input.EnrichmentMode != "deep" { + return stripFabricServiceChannelRouteRebuildCorrelation(items), nil + } + return s.enrichFabricServiceChannelRouteRebuildAttempts(ctx, input.ClusterID, items, s.now()), nil +} + +func (s *Service) GetFabricServiceChannelRouteRebuildHealthSummary(ctx context.Context, actorUserID string, input GetFabricServiceChannelRouteRebuildHealthSummaryInput) (FabricServiceChannelRouteRebuildHealthSummary, error) { + if err := s.ensurePlatformAdmin(ctx, actorUserID); err != nil { + return FabricServiceChannelRouteRebuildHealthSummary{}, err + } + input.ClusterID = strings.TrimSpace(input.ClusterID) + if input.ClusterID == "" { + return FabricServiceChannelRouteRebuildHealthSummary{}, ErrInvalidPayload + } + if input.Limit <= 0 || input.Limit > 500 { + input.Limit = 200 + } + now := s.now() + if now.IsZero() { + now = time.Now().UTC() + } + items, err := s.store.ListFabricServiceChannelRouteRebuildAttempts(ctx, ListFabricServiceChannelRouteRebuildAttemptsInput{ + ClusterID: input.ClusterID, + Limit: input.Limit, + UseCachedSnapshot: true, + }) + if err != nil { + return FabricServiceChannelRouteRebuildHealthSummary{}, err + } + items = s.enrichFabricServiceChannelRouteRebuildAttempts(ctx, input.ClusterID, items, now) + silences, err := s.store.ListFabricServiceChannelRouteRebuildAlertSilences(ctx, input.ClusterID, now) + if err != nil { + return FabricServiceChannelRouteRebuildHealthSummary{}, err + } + items = applyFabricServiceChannelRouteRebuildAlertSilences(items, silences) + summary := FabricServiceChannelRouteRebuildHealthSummary{ + ClusterID: input.ClusterID, + ObservedAt: now.UTC(), + WindowLimit: input.Limit, + TotalAttempts: len(items), + CountsByGuardStatus: map[string]int{}, + CountsByGuardSeverity: map[string]int{}, + } + affectedNodes := map[string]struct{}{} + affectedRoutes := map[string]struct{}{} + feedbackBreakdowns := map[string]*fabricServiceChannelRebuildFeedbackBreakdownAccumulator{} + for _, item := range items { + severity := firstNonEmptyString(item.GuardSeverity, "unknown") + status := firstNonEmptyString(item.GuardStatus, "unknown") + summary.CountsByGuardSeverity[severity]++ + summary.CountsByGuardStatus[status]++ + switch severity { + case "good": + summary.GoodCount++ + case "warn": + summary.WarnCount++ + if !item.AlertSilenced { + summary.ActiveWarnCount++ + } + case "bad": + summary.BadCount++ + if !item.AlertSilenced { + summary.ActiveBadCount++ + } + default: + summary.UnknownCount++ + } + if item.AlertSilenced { + summary.SilencedCount++ + } + if item.AlertResurfaced { + summary.ResurfacedCount++ + } + if item.RebuildStatus == "applied" { + summary.AppliedCount++ + } else if item.RebuildStatus != "" { + summary.PendingCount++ + } + if (severity == "bad" || severity == "warn") && !item.AlertSilenced { + if item.ReporterNodeID != "" { + affectedNodes[item.ReporterNodeID] = struct{}{} + } + if item.RouteID != "" { + affectedRoutes[item.RouteID] = struct{}{} + } + } + if severity == "bad" && !item.AlertSilenced && len(summary.MostRecentBadAttempts) < 10 { + summary.MostRecentBadAttempts = append(summary.MostRecentBadAttempts, item) + } + if item.AlertResurfaced && len(summary.ResurfacedAttempts) < 10 { + summary.ResurfacedAttempts = append(summary.ResurfacedAttempts, item) + } + addFabricServiceChannelRebuildFeedbackBreakdown(feedbackBreakdowns, item, severity) + } + if accessTelemetry, err := s.GetFabricServiceChannelAccessTelemetry(ctx, actorUserID, GetFabricServiceChannelAccessTelemetryInput{ + ClusterID: input.ClusterID, + Limit: input.Limit, + Now: now, + }); err == nil { + summary.AccessRouteDecisionCount = accessTelemetry.RouteDecisionChannelCount + summary.AccessReplacementCount = accessTelemetry.ReplacementDecisionCount + summary.AccessAppliedCount = accessTelemetry.AppliedRebuildDecisionCount + summary.AccessRecoveryCount = accessTelemetry.RecoveryDecisionCount + summary.AccessNoSafeCount = accessTelemetry.NoSafeRecoveryDecisionCount + accessIncidents := append( + fabricServiceChannelAccessDecisionIncidents(input.ClusterID, accessTelemetry), + fabricServiceChannelDataPlaneContractIncidents(input.ClusterID, accessTelemetry)..., + ) + for _, incident := range applyFabricServiceChannelAccessDecisionIncidentSilences(accessIncidents, silences) { + summary.CountsByGuardStatus[incident.GuardStatus]++ + summary.CountsByGuardSeverity[incident.GuardSeverity]++ + if incident.AlertSilenced { + summary.SilencedCount++ + } + if incident.AlertResurfaced { + summary.ResurfacedCount++ + } + switch incident.GuardSeverity { + case "good": + summary.GoodCount++ + case "warn": + summary.WarnCount++ + if !incident.AlertSilenced { + summary.ActiveWarnCount++ + } + case "bad": + summary.BadCount++ + if !incident.AlertSilenced { + summary.ActiveBadCount++ + } + default: + summary.UnknownCount++ + } + if (incident.GuardSeverity == "bad" || incident.GuardSeverity == "warn") && !incident.AlertSilenced { + if incident.ReporterNodeID != "" { + affectedNodes[incident.ReporterNodeID] = struct{}{} + } + if incident.RouteID != "" { + affectedRoutes[incident.RouteID] = struct{}{} + } + } + } + } + summary.AffectedReporterNodeIDs = sortedStringSetKeys(affectedNodes) + summary.AffectedRouteIDs = sortedStringSetKeys(affectedRoutes) + summary.FeedbackBreakdowns = sortedFabricServiceChannelRebuildFeedbackBreakdowns(feedbackBreakdowns) + summary.RecommendedOperatorAction = fabricServiceChannelRebuildRecommendedAction(summary) + return summary, nil +} + +type fabricServiceChannelRebuildFeedbackBreakdownAccumulator struct { + item FabricServiceChannelRouteRebuildFeedbackHealthBreakdown + nodes map[string]struct{} + routes map[string]struct{} +} + +func addFabricServiceChannelRebuildFeedbackBreakdown(out map[string]*fabricServiceChannelRebuildFeedbackBreakdownAccumulator, attempt FabricServiceChannelRouteRebuildAttempt, severity string) { + payload := jsonObject(attempt.Payload) + source := firstNonEmptyString(attempt.FeedbackSource, jsonString(payload, "feedback_source")) + channelID := firstNonEmptyString(attempt.FeedbackChannelID, jsonString(payload, "feedback_channel_id")) + violationStatus := firstNonEmptyString(attempt.FeedbackViolationStatus, jsonString(payload, "feedback_violation_status")) + if source == "" && channelID == "" && violationStatus == "" { + return + } + key := source + "\x00" + channelID + "\x00" + violationStatus + acc := out[key] + if acc == nil { + acc = &fabricServiceChannelRebuildFeedbackBreakdownAccumulator{ + item: FabricServiceChannelRouteRebuildFeedbackHealthBreakdown{ + FeedbackSource: source, + FeedbackChannelID: channelID, + FeedbackViolationStatus: violationStatus, + }, + nodes: map[string]struct{}{}, + routes: map[string]struct{}{}, + } + out[key] = acc + } + acc.item.TotalCount++ + switch severity { + case "good": + acc.item.GoodCount++ + case "warn": + acc.item.WarnCount++ + if !attempt.AlertSilenced { + acc.item.ActiveWarnCount++ + } + case "bad": + acc.item.BadCount++ + if !attempt.AlertSilenced { + acc.item.ActiveBadCount++ + } + default: + acc.item.UnknownCount++ + } + if attempt.AlertSilenced { + acc.item.SilencedCount++ + } + observedAt := time.Time{} + if attempt.FeedbackObservedAt != nil { + observedAt = attempt.FeedbackObservedAt.UTC() + } else if value := strings.TrimSpace(jsonString(payload, "feedback_observed_at")); value != "" { + if parsed, err := time.Parse(time.RFC3339Nano, value); err == nil { + observedAt = parsed.UTC() + } + } + if observedAt.IsZero() { + observedAt = attempt.UpdatedAt.UTC() + } + if observedAt.After(acc.item.LatestObservedAt) { + acc.item.LatestObservedAt = observedAt + } + if attempt.ReporterNodeID != "" { + acc.nodes[attempt.ReporterNodeID] = struct{}{} + } + if attempt.RouteID != "" { + acc.routes[attempt.RouteID] = struct{}{} + } +} + +func sortedFabricServiceChannelRebuildFeedbackBreakdowns(input map[string]*fabricServiceChannelRebuildFeedbackBreakdownAccumulator) []FabricServiceChannelRouteRebuildFeedbackHealthBreakdown { + out := make([]FabricServiceChannelRouteRebuildFeedbackHealthBreakdown, 0, len(input)) + for _, acc := range input { + item := acc.item + item.AffectedReporterNodeIDs = sortedStringSetKeys(acc.nodes) + item.AffectedRouteIDs = sortedStringSetKeys(acc.routes) + out = append(out, item) + } + sort.SliceStable(out, func(i, j int) bool { + leftActive := out[i].ActiveBadCount*100000 + out[i].ActiveWarnCount*1000 + out[i].TotalCount + rightActive := out[j].ActiveBadCount*100000 + out[j].ActiveWarnCount*1000 + out[j].TotalCount + if leftActive != rightActive { + return leftActive > rightActive + } + if !out[i].LatestObservedAt.Equal(out[j].LatestObservedAt) { + return out[i].LatestObservedAt.After(out[j].LatestObservedAt) + } + left := out[i].FeedbackSource + out[i].FeedbackChannelID + out[i].FeedbackViolationStatus + right := out[j].FeedbackSource + out[j].FeedbackChannelID + out[j].FeedbackViolationStatus + return left < right + }) + if len(out) > 100 { + out = out[:100] + } + return out +} + +func (s *Service) GetFabricServiceChannelReadiness(ctx context.Context, actorUserID string, input GetFabricServiceChannelReadinessInput) (FabricServiceChannelReadiness, error) { + if input.Limit <= 0 || input.Limit > 5 { + input.Limit = 5 + } + summary, err := s.GetFabricServiceChannelRouteRebuildHealthSummary(ctx, actorUserID, GetFabricServiceChannelRouteRebuildHealthSummaryInput{ + ClusterID: input.ClusterID, + Limit: input.Limit, + }) + if err != nil { + return FabricServiceChannelReadiness{}, err + } + return fabricServiceChannelReadinessFromRebuildHealth(summary), nil +} + +func (s *Service) GetFabricServiceChannelSchemaStatus(ctx context.Context, actorUserID string, input GetFabricServiceChannelSchemaStatusInput) (FabricServiceChannelSchemaStatus, error) { + if err := s.ensurePlatformAdmin(ctx, actorUserID); err != nil { + return FabricServiceChannelSchemaStatus{}, err + } + input.ClusterID = strings.TrimSpace(input.ClusterID) + if input.ClusterID == "" { + return FabricServiceChannelSchemaStatus{}, ErrInvalidPayload + } + return s.store.GetFabricServiceChannelSchemaStatus(ctx, input) +} + +func (s *Service) GetFabricServiceChannelRebuildSnapshotMaintenanceHealth(ctx context.Context, actorUserID string, input GetFabricServiceChannelRebuildSnapshotMaintenanceHealthInput) (FabricServiceChannelRebuildSnapshotMaintenanceHealth, error) { + if err := s.ensurePlatformAdmin(ctx, actorUserID); err != nil { + return FabricServiceChannelRebuildSnapshotMaintenanceHealth{}, err + } + input.ClusterID = strings.TrimSpace(input.ClusterID) + if input.ClusterID == "" { + return FabricServiceChannelRebuildSnapshotMaintenanceHealth{}, ErrInvalidPayload + } + if input.Limit <= 0 { + input.Limit = 50 + } + if input.Limit > 100 { + input.Limit = 100 + } + if input.MinAgeSeconds <= 0 { + input.MinAgeSeconds = 60 + } + if input.MinAgeSeconds > 3600 { + input.MinAgeSeconds = 3600 + } + if input.HeartbeatThreshold <= 0 { + input.HeartbeatThreshold = 2 + } + if input.HeartbeatThreshold > 10 { + input.HeartbeatThreshold = 10 + } + now := s.now() + if now.IsZero() { + now = time.Now().UTC() + } + out := FabricServiceChannelRebuildSnapshotMaintenanceHealth{ + ClusterID: input.ClusterID, + ObservedAt: now.UTC(), + Status: "ready", + Reason: "snapshot_maintenance_ready", + WindowLimit: input.Limit, + MinAgeSeconds: input.MinAgeSeconds, + HeartbeatThreshold: input.HeartbeatThreshold, + } + attempts, err := s.store.ListFabricServiceChannelRouteRebuildAttempts(ctx, ListFabricServiceChannelRouteRebuildAttemptsInput{ + ClusterID: input.ClusterID, + Limit: input.Limit, + }) + if err != nil { + return FabricServiceChannelRebuildSnapshotMaintenanceHealth{}, err + } + heartbeatsByNode := map[string][]NodeHeartbeat{} + nodes := map[string]*FabricServiceChannelRebuildSnapshotNodeHealth{} + nodeHealth := func(nodeID string) *FabricServiceChannelRebuildSnapshotNodeHealth { + nodeID = strings.TrimSpace(nodeID) + if nodeID == "" { + nodeID = "unknown" + } + if item, ok := nodes[nodeID]; ok { + return item + } + item := &FabricServiceChannelRebuildSnapshotNodeHealth{NodeID: nodeID} + nodes[nodeID] = item + return item + } + for _, attempt := range attempts { + out.RecentAttemptCount++ + node := nodeHealth(attempt.ReporterNodeID) + node.RecentAttemptCount++ + if fabricServiceChannelRouteRebuildHasCorrelationSnapshot(attempt) { + out.ValidSnapshotCount++ + node.ValidSnapshotCount++ + continue + } + out.MissingSnapshotCount++ + node.MissingSnapshotCount++ + ageSeconds := int64(now.Sub(attempt.UpdatedAt).Seconds()) + if ageSeconds < input.MinAgeSeconds { + continue + } + reporterNodeID := strings.TrimSpace(attempt.ReporterNodeID) + if reporterNodeID == "" { + continue + } + heartbeats, ok := heartbeatsByNode[reporterNodeID] + if !ok { + heartbeats, err = s.store.ListNodeHeartbeats(ctx, input.ClusterID, reporterNodeID, input.HeartbeatThreshold+5) + if err != nil { + heartbeats = nil + } + heartbeatsByNode[reporterNodeID] = heartbeats + } + heartbeatAfterAttemptCount := 0 + for _, heartbeat := range heartbeats { + observedAt := heartbeat.ObservedAt + if node.LastHeartbeatAt == nil || observedAt.After(*node.LastHeartbeatAt) { + value := observedAt + node.LastHeartbeatAt = &value + } + if observedAt.After(attempt.UpdatedAt) || observedAt.Equal(attempt.UpdatedAt) { + heartbeatAfterAttemptCount++ + } + } + if heartbeatAfterAttemptCount > node.HeartbeatAfterAttemptCount { + node.HeartbeatAfterAttemptCount = heartbeatAfterAttemptCount + } + if heartbeatAfterAttemptCount >= input.HeartbeatThreshold { + out.OverdueMissingSnapshotCount++ + node.OverdueMissingSnapshotCount++ + if len(out.OverdueMissingSnapshotAttempts) < 10 { + out.OverdueMissingSnapshotAttempts = append(out.OverdueMissingSnapshotAttempts, attempt) + } + } + } + events, err := s.store.ListAuditEvents(ctx, ListAuditEventsInput{ + ClusterID: input.ClusterID, + EventTypes: []string{"fabric.service_channel_rebuild_snapshot.auto_warmup"}, + Limit: 100, + }) + if err != nil { + return FabricServiceChannelRebuildSnapshotMaintenanceHealth{}, err + } + for _, event := range events { + if event.EventType != "fabric.service_channel_rebuild_snapshot.auto_warmup" { + continue + } + payload := jsonObject(event.Payload) + nodeID := jsonString(payload, "reporter_node_id") + node := nodeHealth(nodeID) + out.AutoWarmupEventCount++ + out.AutoWarmupWarmedCount += jsonInt(payload, "warmed_count") + out.AutoWarmupAlreadyFreshCount += jsonInt(payload, "already_fresh_count") + out.AutoWarmupErrorCount += jsonInt(payload, "error_count") + node.AutoWarmupEventCount++ + node.AutoWarmupWarmedCount += jsonInt(payload, "warmed_count") + node.AutoWarmupErrorCount += jsonInt(payload, "error_count") + createdAt := event.CreatedAt + if out.LatestAutoWarmupAt == nil || createdAt.After(*out.LatestAutoWarmupAt) { + value := createdAt + out.LatestAutoWarmupAt = &value + } + if node.LatestAutoWarmupAt == nil || createdAt.After(*node.LatestAutoWarmupAt) { + value := createdAt + node.LatestAutoWarmupAt = &value + } + } + out.Nodes = make([]FabricServiceChannelRebuildSnapshotNodeHealth, 0, len(nodes)) + for _, item := range nodes { + out.Nodes = append(out.Nodes, *item) + } + sort.Slice(out.Nodes, func(i, j int) bool { + if out.Nodes[i].OverdueMissingSnapshotCount != out.Nodes[j].OverdueMissingSnapshotCount { + return out.Nodes[i].OverdueMissingSnapshotCount > out.Nodes[j].OverdueMissingSnapshotCount + } + if out.Nodes[i].MissingSnapshotCount != out.Nodes[j].MissingSnapshotCount { + return out.Nodes[i].MissingSnapshotCount > out.Nodes[j].MissingSnapshotCount + } + return out.Nodes[i].NodeID < out.Nodes[j].NodeID + }) + if out.AutoWarmupErrorCount > 0 { + out.Status = "degraded" + out.Reason = "auto_warmup_errors_seen" + out.RecommendedOperatorAction = "Check backend logs and heartbeat metadata for nodes with auto-warmup errors." + } + if out.OverdueMissingSnapshotCount > 0 { + out.Status = "degraded" + out.Reason = "snapshot_warmup_overdue" + out.RecommendedOperatorAction = "Run warm snapshots or inspect reporter nodes whose heartbeat evidence is not producing rebuild snapshots." + } + if out.MissingSnapshotCount > 0 && out.OverdueMissingSnapshotCount == 0 && out.RecommendedOperatorAction == "" { + out.RecommendedOperatorAction = "Recent attempts are still waiting for runtime heartbeat evidence." + } + return out, nil +} + +func (s *Service) WarmupFabricServiceChannelRebuildSnapshots(ctx context.Context, input WarmupFabricServiceChannelRebuildSnapshotsInput) (FabricServiceChannelRebuildSnapshotWarmup, error) { + if err := s.ensurePlatformAdmin(ctx, input.ActorUserID); err != nil { + return FabricServiceChannelRebuildSnapshotWarmup{}, err + } + input.ClusterID = strings.TrimSpace(input.ClusterID) + if input.ClusterID == "" { + return FabricServiceChannelRebuildSnapshotWarmup{}, ErrInvalidPayload + } + if input.Limit <= 0 || input.Limit > 50 { + input.Limit = 10 + } + if input.StaleAfterSeconds <= 0 || input.StaleAfterSeconds > int64((24*time.Hour).Seconds()) { + input.StaleAfterSeconds = 60 + } + now := input.Now + if now.IsZero() { + now = s.now() + } + if now.IsZero() { + now = time.Now().UTC() + } + result := FabricServiceChannelRebuildSnapshotWarmup{ + ClusterID: input.ClusterID, + ObservedAt: now.UTC(), + WindowLimit: input.Limit, + StaleAfterSeconds: input.StaleAfterSeconds, + Status: "ready", + Reason: "snapshots_warmed", + } + items, err := s.store.ListFabricServiceChannelRouteRebuildAttempts(ctx, ListFabricServiceChannelRouteRebuildAttemptsInput{ + ClusterID: input.ClusterID, + Limit: input.Limit, + }) + if err != nil { + return FabricServiceChannelRebuildSnapshotWarmup{}, err + } + result.ScannedCount = len(items) + heartbeatsByNode := map[string][]NodeHeartbeat{} + staleAfter := time.Duration(input.StaleAfterSeconds) * time.Second + for _, item := range items { + if !fabricServiceChannelRouteRebuildHasCorrelationSnapshot(item) { + result.MissingSnapshotCount++ + } else if fabricServiceChannelRouteRebuildSnapshotIsStale(item, now, staleAfter) { + result.StaleSnapshotCount++ + result.DeferredStaleCount++ + continue + } else { + result.AlreadyFreshCount++ + continue + } + nodeID := strings.TrimSpace(item.ReporterNodeID) + if nodeID == "" { + result.ErrorCount++ + continue + } + if _, ok := heartbeatsByNode[nodeID]; !ok { + heartbeats, err := s.store.ListNodeHeartbeats(ctx, input.ClusterID, nodeID, 120) + if err != nil { + result.ErrorCount++ + heartbeats = nil + } + heartbeatsByNode[nodeID] = heartbeats + } + item = enrichFabricServiceChannelRouteRebuildAttempt(item, heartbeatsByNode[nodeID], now) + item.CorrelationSnapshotAt = &now + if err := s.store.UpdateFabricServiceChannelRouteRebuildCorrelationSnapshot(ctx, fabricServiceChannelRouteRebuildCorrelationSnapshotInput(item, now)); err != nil { + result.ErrorCount++ + continue + } + result.WarmedCount++ + } + if result.ErrorCount > 0 { + result.Status = "degraded" + result.Reason = "snapshot_warmup_partial" + result.RecommendedOperatorAction = "Check node heartbeat history and backend logs for rebuild snapshot warmup failures." + } else if result.DeferredStaleCount > 0 { + result.Status = "ready" + result.Reason = "missing_snapshots_warmed_stale_deferred" + result.RecommendedOperatorAction = "Stale snapshots were detected and left cached; age-sensitive guard state is recomputed on read." + } + return result, nil +} + +func (s *Service) ListFabricServiceChannelRouteRebuildIncidents(ctx context.Context, actorUserID string, input ListFabricServiceChannelRouteRebuildIncidentsInput) ([]FabricServiceChannelRouteRebuildIncident, error) { + if err := s.ensurePlatformAdmin(ctx, actorUserID); err != nil { + return nil, err + } + input.ClusterID = strings.TrimSpace(input.ClusterID) + if input.ClusterID == "" { + return nil, ErrInvalidPayload + } + if input.Limit <= 0 || input.Limit > 5 { + input.Limit = 5 + } + now := s.now() + if now.IsZero() { + now = time.Now().UTC() + } + items, err := s.store.ListFabricServiceChannelRouteRebuildAttempts(ctx, ListFabricServiceChannelRouteRebuildAttemptsInput{ + ClusterID: input.ClusterID, + Limit: input.Limit, + UseCachedSnapshot: true, + }) + if err != nil { + return nil, err + } + items = s.enrichFabricServiceChannelRouteRebuildAttempts(ctx, input.ClusterID, items, now) + silences, err := s.store.ListFabricServiceChannelRouteRebuildAlertSilences(ctx, input.ClusterID, now) + if err != nil { + return nil, err + } + items = applyFabricServiceChannelRouteRebuildAlertSilences(items, silences) + incidents := fabricServiceChannelRouteRebuildIncidentsFromAttempts(input.ClusterID, items) + if accessTelemetry, err := s.GetFabricServiceChannelAccessTelemetry(ctx, actorUserID, GetFabricServiceChannelAccessTelemetryInput{ + ClusterID: input.ClusterID, + Limit: input.Limit, + Now: now, + }); err == nil { + accessIncidents := append( + fabricServiceChannelAccessDecisionIncidents(input.ClusterID, accessTelemetry), + fabricServiceChannelDataPlaneContractIncidents(input.ClusterID, accessTelemetry)..., + ) + incidents = append(incidents, applyFabricServiceChannelAccessDecisionIncidentSilences(accessIncidents, silences)...) + fabricServiceChannelSortRouteRebuildIncidents(incidents) + } + if len(incidents) > input.Limit { + incidents = incidents[:input.Limit] + } + return incidents, nil +} + +func (s *Service) RecordFabricServiceChannelRouteRebuildInvestigation(ctx context.Context, input RecordFabricServiceChannelRouteRebuildInvestigationInput) error { + if err := s.ensurePlatformAdmin(ctx, input.ActorUserID); err != nil { + return err + } + input.ClusterID = strings.TrimSpace(input.ClusterID) + input.ReporterNodeID = strings.TrimSpace(input.ReporterNodeID) + input.RouteID = strings.TrimSpace(input.RouteID) + input.ServiceClass = strings.TrimSpace(input.ServiceClass) + input.Generation = strings.TrimSpace(input.Generation) + input.GuardStatus = strings.TrimSpace(input.GuardStatus) + input.IncidentID = strings.TrimSpace(input.IncidentID) + input.FeedbackSource = strings.TrimSpace(input.FeedbackSource) + input.FeedbackChannelID = strings.TrimSpace(input.FeedbackChannelID) + input.FeedbackViolationStatus = strings.TrimSpace(input.FeedbackViolationStatus) + input.DrilldownSource = strings.TrimSpace(input.DrilldownSource) + input.Reason = strings.TrimSpace(input.Reason) + if input.ClusterID == "" || (input.ReporterNodeID == "" && input.RouteID == "" && input.FeedbackSource == "" && input.FeedbackChannelID == "" && input.FeedbackViolationStatus == "") { + return ErrInvalidPayload + } + now := input.Now + if now.IsZero() { + now = s.now() + } + if now.IsZero() { + now = time.Now().UTC() + } + eventType := "fabric.service_channel_rebuild_incident.investigation_opened" + targetType := "fabric_service_channel_route_rebuild_incident" + targetIDValue := firstNonEmptyString(input.RouteID, input.FeedbackChannelID, input.FeedbackViolationStatus, input.FeedbackSource, input.ReporterNodeID) + if input.DrilldownSource == "rebuild_health_feedback_breakdown" || input.FeedbackSource != "" || input.FeedbackChannelID != "" || input.FeedbackViolationStatus != "" { + eventType = "fabric.service_channel_rebuild_feedback_breakdown.investigation_opened" + targetType = "fabric_service_channel_rebuild_feedback_breakdown" + } + return s.store.RecordAudit(ctx, ClusterAuditEvent{ + ClusterID: &input.ClusterID, + ActorUserID: &input.ActorUserID, + EventType: eventType, + TargetType: targetType, + TargetID: &targetIDValue, + Payload: mustJSONRaw(map[string]any{ + "incident_id": input.IncidentID, + "reporter_node_id": input.ReporterNodeID, + "route_id": input.RouteID, + "service_class": input.ServiceClass, + "generation": input.Generation, + "guard_status": input.GuardStatus, + "feedback_source": input.FeedbackSource, + "feedback_channel_id": input.FeedbackChannelID, + "feedback_violation_status": input.FeedbackViolationStatus, + "drilldown_source": input.DrilldownSource, + "reason": input.Reason, + }), + CreatedAt: now.UTC(), + }) +} + +func (s *Service) SilenceFabricServiceChannelRouteRebuildAlert(ctx context.Context, input SilenceFabricServiceChannelRouteRebuildAlertInput) (FabricServiceChannelRouteRebuildAlertSilence, error) { + if err := s.ensurePlatformAdmin(ctx, input.ActorUserID); err != nil { + return FabricServiceChannelRouteRebuildAlertSilence{}, err + } + input.ClusterID = strings.TrimSpace(input.ClusterID) + input.ReporterNodeID = strings.TrimSpace(input.ReporterNodeID) + input.RouteID = strings.TrimSpace(input.RouteID) + input.GuardStatus = strings.TrimSpace(input.GuardStatus) + input.Generation = strings.TrimSpace(input.Generation) + input.Reason = strings.TrimSpace(input.Reason) + input.IncidentSource = strings.TrimSpace(input.IncidentSource) + input.ChannelID = strings.TrimSpace(input.ChannelID) + if input.ClusterID == "" || input.ReporterNodeID == "" || input.RouteID == "" || input.GuardStatus == "" { + return FabricServiceChannelRouteRebuildAlertSilence{}, ErrInvalidPayload + } + requestedRouteID := input.RouteID + if input.IncidentSource == "access_decision" || input.IncidentSource == "data_plane_contract" { + if input.ChannelID == "" { + return FabricServiceChannelRouteRebuildAlertSilence{}, ErrInvalidPayload + } + input.RouteID = fabricServiceChannelAccessDecisionSilenceRouteID(input.ChannelID, input.RouteID) + } + if input.TTL <= 0 || input.TTL > 7*24*time.Hour { + input.TTL = 6 * time.Hour + } + now := input.Now + if now.IsZero() { + now = s.now() + } + if now.IsZero() { + now = time.Now().UTC() + } + expiresAt := now.UTC().Add(input.TTL) + silence, err := s.store.UpsertFabricServiceChannelRouteRebuildAlertSilence(ctx, input, expiresAt) + if err != nil { + return FabricServiceChannelRouteRebuildAlertSilence{}, err + } + _ = s.store.RecordAudit(ctx, ClusterAuditEvent{ + ClusterID: &input.ClusterID, + ActorUserID: &input.ActorUserID, + EventType: "fabric.service_channel_rebuild_alert.silenced", + TargetType: "fabric_service_channel_route_rebuild_alert", + TargetID: &input.RouteID, + Payload: mustJSONRaw(map[string]any{ + "reporter_node_id": input.ReporterNodeID, + "route_id": requestedRouteID, + "stored_route_id": input.RouteID, + "incident_source": input.IncidentSource, + "channel_id": input.ChannelID, + "guard_status": input.GuardStatus, + "generation": input.Generation, + "reason": input.Reason, + "expires_at": expiresAt.UTC().Format(time.RFC3339Nano), + }), + CreatedAt: now.UTC(), + }) + return silence, nil +} + +func (s *Service) ListFabricServiceChannelRouteRebuildAlertSilences(ctx context.Context, actorUserID string, clusterID string, now time.Time) ([]FabricServiceChannelRouteRebuildAlertSilence, error) { + if err := s.ensurePlatformAdmin(ctx, actorUserID); err != nil { + return nil, err + } + clusterID = strings.TrimSpace(clusterID) + if clusterID == "" { + return nil, ErrInvalidPayload + } + if now.IsZero() { + now = s.now() + } + if now.IsZero() { + now = time.Now().UTC() + } + return s.store.ListFabricServiceChannelRouteRebuildAlertSilences(ctx, clusterID, now) +} + +func (s *Service) UnsilenceFabricServiceChannelRouteRebuildAlert(ctx context.Context, input UnsilenceFabricServiceChannelRouteRebuildAlertInput) (FabricServiceChannelRouteRebuildAlertSilence, error) { + if err := s.ensurePlatformAdmin(ctx, input.ActorUserID); err != nil { + return FabricServiceChannelRouteRebuildAlertSilence{}, err + } + input.ClusterID = strings.TrimSpace(input.ClusterID) + input.SilenceID = strings.TrimSpace(input.SilenceID) + input.Reason = strings.TrimSpace(input.Reason) + if input.ClusterID == "" || input.SilenceID == "" { + return FabricServiceChannelRouteRebuildAlertSilence{}, ErrInvalidPayload + } + now := input.Now + if now.IsZero() { + now = s.now() + } + if now.IsZero() { + now = time.Now().UTC() + } + silence, err := s.store.DeleteFabricServiceChannelRouteRebuildAlertSilence(ctx, input) + if err != nil { + return FabricServiceChannelRouteRebuildAlertSilence{}, err + } + _ = s.store.RecordAudit(ctx, ClusterAuditEvent{ + ClusterID: &input.ClusterID, + ActorUserID: &input.ActorUserID, + EventType: "fabric.service_channel_rebuild_alert.unsilenced", + TargetType: "fabric_service_channel_route_rebuild_alert_silence", + TargetID: &input.SilenceID, + Payload: mustJSONRaw(map[string]any{ + "reporter_node_id": silence.ReporterNodeID, + "route_id": silence.DisplayRouteID, + "stored_route_id": silence.RouteID, + "incident_source": silence.IncidentSource, + "channel_id": silence.ChannelID, + "guard_status": silence.GuardStatus, + "generation": silence.Generation, + "reason": input.Reason, + "unsilenced_at": now.UTC().Format(time.RFC3339Nano), + }), + CreatedAt: now.UTC(), + }) + return silence, nil +} + +func (s *Service) enrichFabricServiceChannelRouteRebuildAttempts(ctx context.Context, clusterID string, items []FabricServiceChannelRouteRebuildAttempt, now time.Time) []FabricServiceChannelRouteRebuildAttempt { + if len(items) == 0 { + return items + } + if now.IsZero() { + now = time.Now().UTC() + } + heartbeatsByNode := map[string][]NodeHeartbeat{} + for idx := range items { + if fabricServiceChannelRouteRebuildHasCorrelationSnapshot(items[idx]) { + items[idx] = applyFabricServiceChannelRouteRebuildGuard(items[idx], now) + continue + } + nodeID := strings.TrimSpace(items[idx].ReporterNodeID) + if nodeID == "" { + continue + } + if _, ok := heartbeatsByNode[nodeID]; !ok { + heartbeats, err := s.store.ListNodeHeartbeats(ctx, clusterID, nodeID, 120) + if err != nil { + heartbeats = nil + } + heartbeatsByNode[nodeID] = heartbeats + } + items[idx] = enrichFabricServiceChannelRouteRebuildAttempt(items[idx], heartbeatsByNode[nodeID], now) + if fabricServiceChannelRouteRebuildHasRuntimeEvidence(items[idx]) { + items[idx].CorrelationSnapshotAt = &now + _ = s.store.UpdateFabricServiceChannelRouteRebuildCorrelationSnapshot(ctx, fabricServiceChannelRouteRebuildCorrelationSnapshotInput(items[idx], now)) + } + } + return items +} + +func fabricServiceChannelRouteRebuildHasCorrelationSnapshot(item FabricServiceChannelRouteRebuildAttempt) bool { + return item.CorrelationSnapshotAt != nil && fabricServiceChannelRouteRebuildHasRuntimeEvidence(item) +} + +func fabricServiceChannelRouteRebuildHasRuntimeEvidence(item FabricServiceChannelRouteRebuildAttempt) bool { + return item.NodeTransitionMatched || + item.NodeRouteGenerationMatched || + item.PostRebuildSelectedRouteID != "" || + item.PostRebuildSendPackets > 0 || + item.PostRebuildSendFlowPackets > 0 +} + +func fabricServiceChannelRouteRebuildSnapshotIsStale(item FabricServiceChannelRouteRebuildAttempt, now time.Time, staleAfter time.Duration) bool { + if item.CorrelationSnapshotAt == nil { + return true + } + if staleAfter <= 0 { + return false + } + snapshotAt := item.CorrelationSnapshotAt.UTC() + if snapshotAt.IsZero() { + return true + } + if now.IsZero() { + now = time.Now().UTC() + } + return now.UTC().Sub(snapshotAt) > staleAfter +} + +func stripFabricServiceChannelRouteRebuildCorrelation(items []FabricServiceChannelRouteRebuildAttempt) []FabricServiceChannelRouteRebuildAttempt { + for idx := range items { + items[idx].NodeTransitionStatus = "" + items[idx].NodeTransitionGeneration = "" + items[idx].NodeTransitionObservedAt = "" + items[idx].NodeTransitionMatched = false + items[idx].NodeRouteGenerationStatus = "" + items[idx].NodeRouteGenerationAppliedAt = "" + items[idx].NodeRouteGenerationWithdrawnAt = "" + items[idx].NodeRouteGenerationMatched = false + items[idx].PostRebuildSelectedRouteID = "" + items[idx].PostRebuildSendPackets = 0 + items[idx].PostRebuildSendFailures = 0 + items[idx].PostRebuildSendFlowPackets = 0 + items[idx].PostRebuildSendFlowDropped = 0 + items[idx].GuardStatus = "" + items[idx].GuardSeverity = "" + items[idx].GuardReason = "" + items[idx].GuardAgeSeconds = 0 + items[idx].GuardTransitionDeadlineSeconds = 0 + items[idx].GuardTrafficDeadlineSeconds = 0 + items[idx].Timeline = nil + items[idx].CorrelationSnapshotAt = nil + } + return items +} + +func fabricServiceChannelRouteRebuildCorrelationSnapshotInput(item FabricServiceChannelRouteRebuildAttempt, now time.Time) UpdateFabricServiceChannelRouteRebuildCorrelationSnapshotInput { + if now.IsZero() { + now = time.Now().UTC() + } + return UpdateFabricServiceChannelRouteRebuildCorrelationSnapshotInput{ + ID: item.ID, + NodeTransitionStatus: item.NodeTransitionStatus, + NodeTransitionGeneration: item.NodeTransitionGeneration, + NodeTransitionObservedAt: item.NodeTransitionObservedAt, + NodeTransitionMatched: item.NodeTransitionMatched, + NodeRouteGenerationStatus: item.NodeRouteGenerationStatus, + NodeRouteGenerationAppliedAt: item.NodeRouteGenerationAppliedAt, + NodeRouteGenerationWithdrawnAt: item.NodeRouteGenerationWithdrawnAt, + NodeRouteGenerationMatched: item.NodeRouteGenerationMatched, + PostRebuildSelectedRouteID: item.PostRebuildSelectedRouteID, + PostRebuildSendPackets: item.PostRebuildSendPackets, + PostRebuildSendFailures: item.PostRebuildSendFailures, + PostRebuildSendFlowPackets: item.PostRebuildSendFlowPackets, + PostRebuildSendFlowDropped: item.PostRebuildSendFlowDropped, + GuardStatus: item.GuardStatus, + GuardSeverity: item.GuardSeverity, + GuardReason: item.GuardReason, + GuardTransitionDeadlineSeconds: item.GuardTransitionDeadlineSeconds, + GuardTrafficDeadlineSeconds: item.GuardTrafficDeadlineSeconds, + Timeline: item.Timeline, + CorrelationSnapshotAt: now.UTC(), + } +} + +func enrichFabricServiceChannelRouteRebuildAttempt(item FabricServiceChannelRouteRebuildAttempt, heartbeats []NodeHeartbeat, now time.Time) FabricServiceChannelRouteRebuildAttempt { + item.Timeline = append(item.Timeline, FabricServiceChannelRouteRebuildTimelineEvent{ + Stage: "backend_decision", + Status: firstNonEmptyString(item.RebuildStatus, "unknown"), + At: item.UpdatedAt.UTC().Format(time.RFC3339Nano), + RouteID: item.RouteID, + Generation: item.Generation, + Payload: mustJSONRaw(map[string]any{ + "rebuild_request_id": item.RebuildRequestID, + "decision_source": item.DecisionSource, + "outcome": item.Outcome, + "replacement_route_id": item.ReplacementRouteID, + "rebuild_reason": item.RebuildReason, + }), + }) + for _, heartbeat := range heartbeats { + metadata := jsonObject(heartbeat.Metadata) + runtime := jsonMapPath(metadata, "fabric_service_channel_runtime_report") + ingress := jsonMapPath(runtime, "ingress") + transition := jsonMapPath(ingress, "route_manager_transition") + if !item.NodeTransitionMatched && transitionMatchesRebuildAttempt(transition, item) { + item.NodeTransitionMatched = true + item.NodeTransitionStatus = jsonString(transition, "status") + item.NodeTransitionGeneration = jsonString(transition, "generation") + item.NodeTransitionObservedAt = firstNonEmptyString(jsonString(transition, "observed_at"), heartbeat.ObservedAt.UTC().Format(time.RFC3339Nano)) + item.Timeline = append(item.Timeline, FabricServiceChannelRouteRebuildTimelineEvent{ + Stage: "node_route_manager_transition", + Status: item.NodeTransitionStatus, + At: item.NodeTransitionObservedAt, + RouteID: item.RouteID, + Generation: item.NodeTransitionGeneration, + Payload: mustJSONRaw(transition), + }) + } + routeGeneration := jsonMapPath(metadata, "mesh_route_generation_report") + if !item.NodeRouteGenerationMatched { + if decision, ok := routeGenerationDecisionForAttempt(routeGeneration, item); ok { + item.NodeRouteGenerationMatched = true + item.NodeRouteGenerationStatus = firstNonEmptyString(jsonString(decision, "status"), jsonString(decision, "apply_status"), jsonString(decision, "withdraw_status")) + item.NodeRouteGenerationAppliedAt = jsonString(decision, "applied_at") + item.NodeRouteGenerationWithdrawnAt = jsonString(decision, "withdrawn_at") + item.Timeline = append(item.Timeline, FabricServiceChannelRouteRebuildTimelineEvent{ + Stage: "node_route_generation_apply", + Status: item.NodeRouteGenerationStatus, + At: firstNonEmptyString(item.NodeRouteGenerationAppliedAt, item.NodeRouteGenerationWithdrawnAt, heartbeat.ObservedAt.UTC().Format(time.RFC3339Nano)), + RouteID: item.RouteID, + Generation: jsonString(decision, "generation"), + Payload: mustJSONRaw(decision), + }) + } + } + if item.PostRebuildSelectedRouteID == "" && !heartbeat.ObservedAt.Before(item.UpdatedAt) { + selectedRouteID := jsonString(ingress, "last_selected_route_id") + if selectedRouteID == item.ReplacementRouteID || selectedRouteID == item.RouteID || selectedRouteID != "" { + item.PostRebuildSelectedRouteID = selectedRouteID + item.PostRebuildSendPackets = jsonUint64(ingress, "send_packets") + item.PostRebuildSendFailures = jsonUint64(ingress, "send_route_failures") + item.PostRebuildSendFlowPackets = jsonUint64(ingress, "send_flow_packets") + item.PostRebuildSendFlowDropped = jsonUint64(ingress, "send_flow_dropped") + item.Timeline = append(item.Timeline, FabricServiceChannelRouteRebuildTimelineEvent{ + Stage: "post_rebuild_traffic", + Status: "observed", + At: heartbeat.ObservedAt.UTC().Format(time.RFC3339Nano), + RouteID: selectedRouteID, + Generation: jsonString(runtime, "config_version"), + Payload: mustJSONRaw(map[string]any{ + "last_selected_route_id": selectedRouteID, + "send_packets": item.PostRebuildSendPackets, + "send_route_failures": item.PostRebuildSendFailures, + "send_flow_packets": item.PostRebuildSendFlowPackets, + "send_flow_dropped": item.PostRebuildSendFlowDropped, + "recommended_parallel": jsonUint64(ingress, "recommended_parallel_flow_sends"), + }), + }) + } + } + if item.NodeTransitionMatched && item.NodeRouteGenerationMatched && item.PostRebuildSelectedRouteID != "" { + break + } + } + sort.SliceStable(item.Timeline, func(i, j int) bool { + left, leftErr := time.Parse(time.RFC3339Nano, item.Timeline[i].At) + right, rightErr := time.Parse(time.RFC3339Nano, item.Timeline[j].At) + if leftErr == nil && rightErr == nil && !left.Equal(right) { + return left.Before(right) + } + return item.Timeline[i].Stage < item.Timeline[j].Stage + }) + item = applyFabricServiceChannelRouteRebuildGuard(item, now) + return item +} + +const ( + fabricServiceChannelRebuildTransitionDeadline = 90 * time.Second + fabricServiceChannelRebuildTrafficDeadline = 180 * time.Second +) + +func applyFabricServiceChannelRouteRebuildGuard(item FabricServiceChannelRouteRebuildAttempt, now time.Time) FabricServiceChannelRouteRebuildAttempt { + if now.IsZero() { + now = time.Now().UTC() + } + age := now.Sub(item.UpdatedAt) + if age < 0 { + age = 0 + } + item.GuardAgeSeconds = int64(age / time.Second) + item.GuardTransitionDeadlineSeconds = int64(fabricServiceChannelRebuildTransitionDeadline / time.Second) + item.GuardTrafficDeadlineSeconds = int64(fabricServiceChannelRebuildTrafficDeadline / time.Second) + if item.RebuildStatus == "" { + item.GuardStatus = "unknown" + item.GuardSeverity = "warn" + item.GuardReason = "missing_backend_rebuild_status" + return item + } + if item.RebuildStatus == "pending_degraded_fallback" { + if item.NodeTransitionMatched { + item.GuardStatus = "pending_degraded_fallback_seen" + item.GuardSeverity = "warn" + item.GuardReason = "node_confirmed_pending_degraded_fallback" + return item + } + if age > fabricServiceChannelRebuildTransitionDeadline { + item.GuardStatus = "missing_node_transition" + item.GuardSeverity = "bad" + item.GuardReason = "node_did_not_report_pending_fallback_transition" + return item + } + item.GuardStatus = "pending_node_transition" + item.GuardSeverity = "warn" + item.GuardReason = "waiting_for_node_pending_fallback_transition" + return item + } + if item.RebuildStatus != "applied" { + item.GuardStatus = "not_applied" + item.GuardSeverity = "warn" + item.GuardReason = "backend_rebuild_not_applied" + return item + } + if !item.NodeTransitionMatched { + if age > fabricServiceChannelRebuildTransitionDeadline { + item.GuardStatus = "missing_node_transition" + item.GuardSeverity = "bad" + item.GuardReason = "node_did_not_report_applied_rebuild_transition" + return item + } + item.GuardStatus = "pending_node_transition" + item.GuardSeverity = "warn" + item.GuardReason = "waiting_for_node_applied_rebuild_transition" + return item + } + if !item.NodeRouteGenerationMatched { + if age > fabricServiceChannelRebuildTransitionDeadline { + item.GuardStatus = "missing_route_generation" + item.GuardSeverity = "bad" + item.GuardReason = "node_transition_seen_but_route_generation_not_correlated" + return item + } + item.GuardStatus = "pending_route_generation" + item.GuardSeverity = "warn" + item.GuardReason = "waiting_for_route_generation_correlation" + return item + } + if item.PostRebuildSelectedRouteID == "" { + if age > fabricServiceChannelRebuildTrafficDeadline { + item.GuardStatus = "missing_post_rebuild_traffic" + item.GuardSeverity = "bad" + item.GuardReason = "no_post_rebuild_traffic_observed" + return item + } + item.GuardStatus = "pending_post_rebuild_traffic" + item.GuardSeverity = "warn" + item.GuardReason = "waiting_for_post_rebuild_traffic" + return item + } + if item.ReplacementRouteID != "" && item.PostRebuildSelectedRouteID != item.ReplacementRouteID { + item.GuardStatus = "unexpected_post_rebuild_route" + item.GuardSeverity = "bad" + item.GuardReason = "post_rebuild_selected_route_differs_from_replacement" + return item + } + if item.PostRebuildSendFailures > 0 || item.PostRebuildSendFlowDropped > 0 { + item.GuardStatus = "post_rebuild_degraded" + item.GuardSeverity = "warn" + item.GuardReason = "post_rebuild_traffic_has_failures_or_drops" + return item + } + item.GuardStatus = "ok" + item.GuardSeverity = "good" + item.GuardReason = "backend_decision_node_transition_and_post_rebuild_traffic_correlated" + return item +} + +func sortedStringSetKeys(values map[string]struct{}) []string { + if len(values) == 0 { + return nil + } + out := make([]string, 0, len(values)) + for value := range values { + out = append(out, value) + } + sort.Strings(out) + return out +} + +func applyFabricServiceChannelRouteRebuildAlertSilences(items []FabricServiceChannelRouteRebuildAttempt, silences []FabricServiceChannelRouteRebuildAlertSilence) []FabricServiceChannelRouteRebuildAttempt { + if len(items) == 0 || len(silences) == 0 { + return items + } + byKey := map[string]FabricServiceChannelRouteRebuildAlertSilence{} + for _, silence := range silences { + byKey[fabricServiceChannelRebuildAlertSilenceKey(silence.ReporterNodeID, silence.RouteID, silence.GuardStatus, silence.Generation)] = silence + } + for idx := range items { + item := &items[idx] + silence, ok := byKey[fabricServiceChannelRebuildAlertSilenceKey(item.ReporterNodeID, item.RouteID, item.GuardStatus, item.Generation)] + if !ok { + continue + } + item.AlertSilenced = true + item.AlertSilenceID = silence.ID + item.AlertSilenceReason = silence.Reason + item.AlertSilencedUntil = &silence.ExpiresAt + } + byResurfaceKey := map[string]FabricServiceChannelRouteRebuildAlertSilence{} + for _, silence := range silences { + key := fabricServiceChannelRebuildAlertResurfaceKey(silence.ReporterNodeID, silence.RouteID, silence.GuardStatus) + current, ok := byResurfaceKey[key] + if !ok || silence.CreatedAt.After(current.CreatedAt) { + byResurfaceKey[key] = silence + } + } + for idx := range items { + item := &items[idx] + if item.AlertSilenced || (item.GuardSeverity != "bad" && item.GuardSeverity != "warn") { + continue + } + silence, ok := byResurfaceKey[fabricServiceChannelRebuildAlertResurfaceKey(item.ReporterNodeID, item.RouteID, item.GuardStatus)] + if !ok || strings.TrimSpace(silence.Generation) == strings.TrimSpace(item.Generation) { + continue + } + item.AlertResurfaced = true + item.AlertResurfacedFromSilenceID = silence.ID + item.AlertResurfacedPreviousGeneration = silence.Generation + item.AlertResurfacedPreviousUntil = &silence.ExpiresAt + } + return items +} + +func fabricServiceChannelRebuildAlertSilenceKey(reporterNodeID, routeID, guardStatus, generation string) string { + return strings.TrimSpace(reporterNodeID) + "|" + strings.TrimSpace(routeID) + "|" + strings.TrimSpace(guardStatus) + "|" + strings.TrimSpace(generation) +} + +func fabricServiceChannelRebuildAlertResurfaceKey(reporterNodeID, routeID, guardStatus string) string { + return strings.TrimSpace(reporterNodeID) + "|" + strings.TrimSpace(routeID) + "|" + strings.TrimSpace(guardStatus) +} + +func fabricServiceChannelReadinessFromRebuildHealth(summary FabricServiceChannelRouteRebuildHealthSummary) FabricServiceChannelReadiness { + readiness := FabricServiceChannelReadiness{ + ClusterID: summary.ClusterID, + ObservedAt: summary.ObservedAt, + Status: "clean", + Reason: "no_active_service_channel_rebuild_alerts", + ActiveAlertCount: summary.ActiveBadCount + summary.ActiveWarnCount, + ActiveBadCount: summary.ActiveBadCount, + ActiveWarnCount: summary.ActiveWarnCount, + ResurfacedCount: summary.ResurfacedCount, + SilencedCount: summary.SilencedCount, + MissingTransitionCount: summary.CountsByGuardStatus["missing_node_transition"], + MissingRouteGenerationCount: summary.CountsByGuardStatus["missing_route_generation"], + MissingPostTrafficCount: summary.CountsByGuardStatus["missing_post_rebuild_traffic"], + UnexpectedRouteCount: summary.CountsByGuardStatus["unexpected_post_rebuild_route"], + PostRebuildDegradedCount: summary.CountsByGuardStatus["post_rebuild_degraded"], + RecommendedOperatorAction: summary.RecommendedOperatorAction, + } + if summary.ResurfacedCount > 0 { + readiness.BlockingReasons = append(readiness.BlockingReasons, "resurfaced_rebuild_alert") + } + if summary.ActiveBadCount > 0 { + readiness.BlockingReasons = append(readiness.BlockingReasons, "active_bad_rebuild_alert") + } + if readiness.MissingTransitionCount > 0 { + readiness.BlockingReasons = append(readiness.BlockingReasons, "missing_node_transition") + } + if readiness.MissingRouteGenerationCount > 0 { + readiness.BlockingReasons = append(readiness.BlockingReasons, "missing_route_generation") + } + if readiness.MissingPostTrafficCount > 0 { + readiness.BlockingReasons = append(readiness.BlockingReasons, "missing_post_rebuild_traffic") + } + if readiness.UnexpectedRouteCount > 0 { + readiness.BlockingReasons = append(readiness.BlockingReasons, "unexpected_post_rebuild_route") + } + if readiness.PostRebuildDegradedCount > 0 { + readiness.DegradedReasons = append(readiness.DegradedReasons, "post_rebuild_degraded") + } + if summary.ActiveWarnCount > 0 { + readiness.DegradedReasons = append(readiness.DegradedReasons, "active_warn_rebuild_alert") + } + if summary.PendingCount > 0 { + readiness.DegradedReasons = append(readiness.DegradedReasons, "pending_rebuild_attempt") + } + if summary.SilencedCount > 0 { + readiness.DegradedReasons = append(readiness.DegradedReasons, "silenced_alert_under_observation") + } + if len(readiness.BlockingReasons) > 0 { + readiness.Status = "blocked" + readiness.Reason = readiness.BlockingReasons[0] + return readiness + } + if len(readiness.DegradedReasons) > 0 { + readiness.Status = "degraded" + readiness.Reason = readiness.DegradedReasons[0] + } + return readiness +} + +func fabricServiceChannelRebuildRecommendedAction(summary FabricServiceChannelRouteRebuildHealthSummary) string { + if summary.AccessNoSafeCount > 0 { + return "inspect_access_no_safe_recovery_route_pool_and_signed_policy" + } + if summary.ActiveBadCount > 0 { + if summary.ResurfacedCount > 0 { + return "resurfaced_rebuild_alerts_need_reinspection_new_generation_or_route_changed" + } + return "inspect_bad_rebuild_attempts_check_reporter_node_heartbeats_route_generation_and_post_rebuild_traffic" + } + if summary.ActiveWarnCount > 0 { + return "watch_pending_rebuild_attempts_until_node_transition_and_post_rebuild_traffic_arrive" + } + if summary.SilencedCount > 0 { + return "no_active_rebuild_alerts_silenced_alerts_remain_under_observation" + } + if summary.TotalAttempts == 0 { + return "no_rebuild_attempts_observed" + } + return "no_operator_action_required" +} + +func fabricServiceChannelRouteRebuildIncidentsFromAttempts(clusterID string, items []FabricServiceChannelRouteRebuildAttempt) []FabricServiceChannelRouteRebuildIncident { + byKey := map[string]*FabricServiceChannelRouteRebuildIncident{} + for _, item := range items { + guardStatus := firstNonEmptyString(item.GuardStatus, "unknown") + guardSeverity := firstNonEmptyString(item.GuardSeverity, "unknown") + key := strings.Join([]string{item.ReporterNodeID, item.RouteID, item.ServiceClass, item.Generation, guardStatus}, "|") + incident, ok := byKey[key] + if !ok { + fingerprint := hashStringHex(key) + incident = &FabricServiceChannelRouteRebuildIncident{ + Fingerprint: fingerprint, + ClusterID: clusterID, + ReporterNodeID: item.ReporterNodeID, + RouteID: item.RouteID, + ServiceClass: item.ServiceClass, + Generation: item.Generation, + GuardStatus: guardStatus, + GuardSeverity: guardSeverity, + GuardReason: item.GuardReason, + FirstSeenAt: item.CreatedAt, + LastSeenAt: item.UpdatedAt, + LatestReplacementRouteID: item.ReplacementRouteID, + LatestRebuildStatus: item.RebuildStatus, + LatestOutcome: item.Outcome, + AlertSilenced: item.AlertSilenced, + AlertResurfaced: item.AlertResurfaced, + } + byKey[key] = incident + } + incident.AttemptCount++ + if item.CreatedAt.Before(incident.FirstSeenAt) { + incident.FirstSeenAt = item.CreatedAt + } + if item.UpdatedAt.After(incident.LastSeenAt) { + incident.LastSeenAt = item.UpdatedAt + incident.GuardSeverity = guardSeverity + incident.GuardReason = item.GuardReason + incident.LatestReplacementRouteID = item.ReplacementRouteID + incident.LatestRebuildStatus = item.RebuildStatus + incident.LatestOutcome = item.Outcome + } + incident.AlertSilenced = incident.AlertSilenced || item.AlertSilenced + if item.AlertResurfaced { + incident.AlertResurfaced = true + incident.AlertResurfacedFromSilenceID = item.AlertResurfacedFromSilenceID + incident.AlertResurfacedCause = item.AlertResurfacedCause + incident.AlertResurfacedPreviousRouteID = item.AlertResurfacedPreviousRouteID + incident.AlertResurfacedPreviousChannelID = item.AlertResurfacedPreviousChannelID + incident.AlertResurfacedPreviousGeneration = item.AlertResurfacedPreviousGeneration + incident.AlertResurfacedPreviousUntil = item.AlertResurfacedPreviousUntil + } + } + out := make([]FabricServiceChannelRouteRebuildIncident, 0, len(byKey)) + for _, incident := range byKey { + incident.RecommendedOperatorAction = fabricServiceChannelRebuildIncidentRecommendedAction(*incident) + out = append(out, *incident) + } + for idx := range out { + out[idx].RecommendedOperatorAction = fabricServiceChannelRebuildIncidentRecommendedAction(out[idx]) + } + fabricServiceChannelSortRouteRebuildIncidents(out) + return out +} + +func fabricServiceChannelSortRouteRebuildIncidents(out []FabricServiceChannelRouteRebuildIncident) { + sort.SliceStable(out, func(i, j int) bool { + leftRank := fabricServiceChannelRebuildIncidentSeverityRank(out[i]) + rightRank := fabricServiceChannelRebuildIncidentSeverityRank(out[j]) + if leftRank != rightRank { + return leftRank > rightRank + } + return out[i].LastSeenAt.After(out[j].LastSeenAt) + }) +} + +func fabricServiceChannelAccessDecisionIncidents(clusterID string, telemetry FabricServiceChannelAccessTelemetry) []FabricServiceChannelRouteRebuildIncident { + out := []FabricServiceChannelRouteRebuildIncident{} + for _, channel := range telemetry.ActiveChannels { + if channel.RouteDecisionSource == "" { + continue + } + status, severity, reason := fabricServiceChannelAccessDecisionIncidentState(channel) + if status == "" { + continue + } + key := strings.Join([]string{"access_decision", channel.ChannelID, channel.RouteDecisionRouteID, status, channel.RouteDecisionGeneration}, "|") + out = append(out, FabricServiceChannelRouteRebuildIncident{ + Fingerprint: hashStringHex(key), + ClusterID: clusterID, + ReporterNodeID: channel.SelectedEntryNodeID, + RouteID: firstNonEmptyString(channel.RouteDecisionRouteID, channel.PrimaryRouteID), + ServiceClass: channel.ServiceClass, + Generation: channel.RouteDecisionGeneration, + IncidentSource: "access_decision", + ChannelID: channel.ChannelID, + GuardStatus: status, + GuardSeverity: severity, + GuardReason: reason, + AttemptCount: 1, + FirstSeenAt: telemetry.ObservedAt, + LastSeenAt: telemetry.ObservedAt, + LatestReplacementRouteID: channel.RouteDecisionReplacementRouteID, + LatestRebuildStatus: channel.RouteDecisionRebuildStatus, + LatestOutcome: channel.RouteDecisionSource, + }) + } + for idx := range out { + out[idx].RecommendedOperatorAction = fabricServiceChannelRebuildIncidentRecommendedAction(out[idx]) + } + fabricServiceChannelSortRouteRebuildIncidents(out) + return out +} + +func fabricServiceChannelDataPlaneContractIncidents(clusterID string, telemetry FabricServiceChannelAccessTelemetry) []FabricServiceChannelRouteRebuildIncident { + out := []FabricServiceChannelRouteRebuildIncident{} + for _, channel := range telemetry.ActiveChannels { + status, severity, reason := fabricServiceChannelDataPlaneContractIncidentState(channel) + if status == "" { + continue + } + routeID := firstNonEmptyString(channel.RouteDecisionRouteID, channel.PrimaryRouteID, "data_plane") + generation := firstNonEmptyString(channel.RouteDecisionGeneration, channel.PrimaryRouteID, channel.DataPlane.BackendRelayPolicy, channel.ChannelID) + key := strings.Join([]string{"data_plane_contract", channel.ChannelID, routeID, status, generation}, "|") + out = append(out, FabricServiceChannelRouteRebuildIncident{ + Fingerprint: hashStringHex(key), + ClusterID: clusterID, + ReporterNodeID: channel.SelectedEntryNodeID, + RouteID: routeID, + ServiceClass: channel.ServiceClass, + Generation: generation, + IncidentSource: "data_plane_contract", + ChannelID: channel.ChannelID, + GuardStatus: status, + GuardSeverity: severity, + GuardReason: reason, + AttemptCount: 1, + FirstSeenAt: telemetry.ObservedAt, + LastSeenAt: telemetry.ObservedAt, + LatestOutcome: firstNonEmptyString(channel.EntryNodeLastWorkingDataTransport, channel.DataPlane.WorkingDataTransport, "unknown"), + LatestRebuildStatus: firstNonEmptyString( + channel.EntryNodeLastBackendRelayPolicy, + channel.DataPlane.BackendRelayPolicy, + ), + }) + } + for idx := range out { + out[idx].RecommendedOperatorAction = fabricServiceChannelRebuildIncidentRecommendedAction(out[idx]) + } + fabricServiceChannelSortRouteRebuildIncidents(out) + return out +} + +func applyFabricServiceChannelAccessDecisionIncidentSilences(items []FabricServiceChannelRouteRebuildIncident, silences []FabricServiceChannelRouteRebuildAlertSilence) []FabricServiceChannelRouteRebuildIncident { + if len(items) == 0 || len(silences) == 0 { + return items + } + byKey := map[string]FabricServiceChannelRouteRebuildAlertSilence{} + byResurfaceKey := map[string]FabricServiceChannelRouteRebuildAlertSilence{} + byGeneralResurfaceKey := map[string]FabricServiceChannelRouteRebuildAlertSilence{} + byAccessReporterGuard := map[string]FabricServiceChannelRouteRebuildAlertSilence{} + for _, silence := range silences { + byKey[fabricServiceChannelRebuildAlertSilenceKey(silence.ReporterNodeID, silence.RouteID, silence.GuardStatus, silence.Generation)] = silence + resurfaceKey := fabricServiceChannelRebuildAlertResurfaceKey(silence.ReporterNodeID, silence.RouteID, silence.GuardStatus) + current, ok := byResurfaceKey[resurfaceKey] + if !ok || silence.CreatedAt.After(current.CreatedAt) { + byResurfaceKey[resurfaceKey] = silence + } + if channelID, routeID, ok := fabricServiceChannelParseAccessDecisionSilenceRouteID(silence.RouteID); ok { + _ = channelID + generalKey := fabricServiceChannelRebuildAlertResurfaceKey(silence.ReporterNodeID, routeID, silence.GuardStatus) + current, ok := byGeneralResurfaceKey[generalKey] + if !ok || silence.CreatedAt.After(current.CreatedAt) { + byGeneralResurfaceKey[generalKey] = silence + } + accessKey := fabricServiceChannelRebuildAlertResurfaceKey(silence.ReporterNodeID, "access_decision", silence.GuardStatus) + current, ok = byAccessReporterGuard[accessKey] + if !ok || silence.CreatedAt.After(current.CreatedAt) { + byAccessReporterGuard[accessKey] = silence + } + } + } + for idx := range items { + item := &items[idx] + silenceRouteID := fabricServiceChannelAccessDecisionSilenceRouteID(item.ChannelID, item.RouteID) + silence, ok := byKey[fabricServiceChannelRebuildAlertSilenceKey(item.ReporterNodeID, silenceRouteID, item.GuardStatus, item.Generation)] + if ok { + item.AlertSilenced = true + continue + } + if item.GuardSeverity != "bad" && item.GuardSeverity != "warn" { + continue + } + silence, ok = byResurfaceKey[fabricServiceChannelRebuildAlertResurfaceKey(item.ReporterNodeID, silenceRouteID, item.GuardStatus)] + if !ok || strings.TrimSpace(silence.Generation) == strings.TrimSpace(item.Generation) { + generalSilence, generalOK := byGeneralResurfaceKey[fabricServiceChannelRebuildAlertResurfaceKey(item.ReporterNodeID, item.RouteID, item.GuardStatus)] + if !generalOK || strings.TrimSpace(generalSilence.Generation) == strings.TrimSpace(item.Generation) { + accessSilence, accessOK := byAccessReporterGuard[fabricServiceChannelRebuildAlertResurfaceKey(item.ReporterNodeID, "access_decision", item.GuardStatus)] + if !accessOK || !fabricServiceChannelAccessDecisionSilenceDiffers(*item, accessSilence) { + continue + } + generalSilence = accessSilence + } + silence = generalSilence + } + item.AlertResurfaced = true + item.AlertResurfacedFromSilenceID = silence.ID + item.AlertResurfacedCause = fabricServiceChannelAccessDecisionResurfaceCause(*item, silence) + item.AlertResurfacedPreviousRouteID = silence.DisplayRouteID + item.AlertResurfacedPreviousChannelID = silence.ChannelID + item.AlertResurfacedPreviousGeneration = silence.Generation + item.AlertResurfacedPreviousUntil = &silence.ExpiresAt + } + return items +} + +func fabricServiceChannelAccessDecisionSilenceDiffers(item FabricServiceChannelRouteRebuildIncident, silence FabricServiceChannelRouteRebuildAlertSilence) bool { + return strings.TrimSpace(silence.ChannelID) != strings.TrimSpace(item.ChannelID) || + strings.TrimSpace(silence.DisplayRouteID) != strings.TrimSpace(item.RouteID) || + strings.TrimSpace(silence.Generation) != strings.TrimSpace(item.Generation) +} + +func fabricServiceChannelAccessDecisionResurfaceCause(item FabricServiceChannelRouteRebuildIncident, silence FabricServiceChannelRouteRebuildAlertSilence) string { + if strings.TrimSpace(silence.ChannelID) != "" && strings.TrimSpace(silence.ChannelID) != strings.TrimSpace(item.ChannelID) { + return "channel_changed" + } + if strings.TrimSpace(silence.DisplayRouteID) != "" && strings.TrimSpace(silence.DisplayRouteID) != strings.TrimSpace(item.RouteID) { + return "route_changed" + } + if strings.TrimSpace(silence.Generation) != strings.TrimSpace(item.Generation) { + return "generation_changed" + } + return "resurfaced" +} + +func fabricServiceChannelAccessDecisionSilenceRouteID(channelID string, routeID string) string { + return "access:" + strings.TrimSpace(channelID) + ":" + strings.TrimSpace(routeID) +} + +func fabricServiceChannelParseAccessDecisionSilenceRouteID(value string) (string, string, bool) { + value = strings.TrimSpace(value) + if !strings.HasPrefix(value, "access:") { + return "", "", false + } + rest := strings.TrimPrefix(value, "access:") + parts := strings.SplitN(rest, ":", 2) + if len(parts) != 2 || strings.TrimSpace(parts[0]) == "" || strings.TrimSpace(parts[1]) == "" { + return "", "", false + } + return strings.TrimSpace(parts[0]), strings.TrimSpace(parts[1]), true +} + +func fabricServiceChannelAccessDecisionIncidentState(channel FabricServiceChannelAccessTelemetryChannel) (string, string, string) { + switch { + case fabricServiceChannelRouteDecisionIsNoSafeRecovery(channel): + return "access_no_safe_recovery", "bad", firstNonEmptyString(channel.RouteDecisionRebuildReason, "no_unfenced_alternate_route") + case fabricServiceChannelRouteDecisionIsRecovery(channel): + return "access_recovery_selected", "warn", firstNonEmptyString(channel.RouteDecisionRebuildReason, "recovery_route_selected") + case channel.RouteDecisionRebuildStatus == "applied" || containsString(channel.RouteDecisionScoreReasons, "service_channel_rebuild_applied"): + return "access_rebuild_applied", "good", firstNonEmptyString(channel.RouteDecisionRebuildReason, "planner_applied_rebuild") + case fabricServiceChannelRouteDecisionIsReplacement(channel): + return "access_replacement_selected", "warn", firstNonEmptyString(channel.RouteDecisionRebuildReason, "replacement_route_selected") + default: + return "", "", "" + } +} + +func fabricServiceChannelDataPlaneContractIncidentState(channel FabricServiceChannelAccessTelemetryChannel) (string, string, string) { + accepted := channel.EntryNodeTotalAccepted > 0 || channel.EntryNodeIntrospectionAccepted > 0 || channel.EntryNodeBackendFallbackCount > 0 + if accepted && channel.EntryNodeDataPlaneContractCount == 0 { + return "data_plane_contract_not_reported", "bad", "entry_node_accepted_service_channel_without_reporting_data_plane_contract" + } + workingTransport := firstNonEmptyString(channel.EntryNodeLastWorkingDataTransport, channel.DataPlane.WorkingDataTransport) + if workingTransport != "" && workingTransport != "fabric_service_channel" { + return "data_plane_working_transport_violation", "bad", "working_data_transport_must_be_fabric_service_channel" + } + steadyTransport := firstNonEmptyString(channel.EntryNodeLastSteadyStateTransport, channel.DataPlane.SteadyStateTransport) + if steadyTransport != "" && steadyTransport != "fabric_route" { + return "data_plane_steady_state_transport_violation", "bad", "steady_state_transport_must_be_fabric_route" + } + logicalFlowMode := firstNonEmptyString(channel.EntryNodeLastLogicalFlowMode, channel.DataPlane.LogicalFlowMode) + if logicalFlowMode != "" && logicalFlowMode != "multi_flow_isolated" { + return "data_plane_logical_flow_violation", "bad", "logical_flow_mode_must_be_multi_flow_isolated" + } + backendRelayPolicy := firstNonEmptyString(channel.EntryNodeLastBackendRelayPolicy, channel.DataPlane.BackendRelayPolicy) + if channel.EntryNodeBackendFallbackBlockedCount > 0 { + return firstNonEmptyString(channel.EntryNodeLastDataPlaneViolationStatus, "data_plane_backend_fallback_blocked"), "bad", firstNonEmptyString(channel.EntryNodeLastDataPlaneViolationReason, "backend_fallback_blocked_by_data_plane_policy") + } + if channel.EntryNodeFabricRouteSendFailureCount > 0 { + return firstNonEmptyString(channel.EntryNodeLastDataPlaneViolationStatus, "data_plane_fabric_route_send_failed"), "bad", firstNonEmptyString(channel.EntryNodeLastDataPlaneViolationReason, "fabric_route_send_failed") + } + if backendRelayPolicy == "disabled" && (channel.EntryNodeBackendFallbackCount > 0 || channel.ForceBackendFallback) { + return "data_plane_disabled_backend_relay_observed", "bad", "backend_relay_policy_disabled_but_backend_fallback_was_observed" + } + if backendRelayPolicy == "degraded_fallback_only" && channel.EntryNodeBackendFallbackCount > 0 { + return "data_plane_degraded_backend_relay_observed", "warn", "backend_relay_used_as_degraded_fallback_for_working_data" + } + return "", "", "" +} + +func hashStringHex(value string) string { + sum := sha256.Sum256([]byte(value)) + return hex.EncodeToString(sum[:]) +} + +func fabricServiceChannelRebuildIncidentSeverityRank(item FabricServiceChannelRouteRebuildIncident) int { + if item.AlertResurfaced { + return 4 + } + if item.IncidentSource == "access_decision" && item.GuardStatus == "access_no_safe_recovery" { + return 4 + } + switch item.GuardSeverity { + case "bad": + return 3 + case "warn": + return 2 + case "good": + return 1 + default: + return 0 + } +} + +func fabricServiceChannelRebuildIncidentRecommendedAction(item FabricServiceChannelRouteRebuildIncident) string { + if item.AlertSilenced && !item.AlertResurfaced { + return "silenced_rebuild_incident_under_observation" + } + if item.AlertResurfaced { + return "open_deep_ledger_for_resurfaced_generation" + } + if item.IncidentSource == "access_decision" { + switch item.GuardStatus { + case "access_no_safe_recovery": + return "inspect_access_no_safe_recovery_route_pool_and_signed_policy" + case "access_recovery_selected": + return "watch_recovery_route_quality_and_confirm_post_recovery_traffic" + case "access_rebuild_applied": + return "confirm_applied_rebuild_runtime_traffic_stays_on_replacement" + case "access_replacement_selected": + return "watch_replacement_route_quality_until_applied_or_recovered" + } + } + if item.IncidentSource == "data_plane_contract" { + switch item.GuardStatus { + case "data_plane_contract_not_reported": + return "upgrade_or_restart_entry_node_until_data_plane_contract_is_reported" + case "data_plane_working_transport_violation", "data_plane_steady_state_transport_violation", "data_plane_logical_flow_violation": + return "inspect_signed_data_plane_contract_and_node_agent_runtime_path" + case "data_plane_disabled_backend_relay_observed": + return "stop_backend_relay_usage_and_restore_fabric_route_before_service_traffic" + case "data_plane_degraded_backend_relay_observed": + return "restore_fabric_route_and_treat_backend_relay_as_degraded_only" + case "backend_fallback_blocked_by_policy", "fabric_route_send_failed_backend_fallback_blocked", "data_plane_backend_fallback_blocked": + return "restore_fabric_route_or_change_signed_backend_relay_policy_before_retry" + case "data_plane_fabric_route_send_failed": + return "inspect_entry_route_runtime_and_restore_fabric_route_delivery" + } + } + switch item.GuardStatus { + case "missing_node_transition": + return "open_deep_ledger_check_reporter_heartbeats_and_route_manager_transition" + case "missing_route_generation": + return "open_deep_ledger_check_route_generation_apply_or_withdraw" + case "missing_post_rebuild_traffic": + return "open_deep_ledger_check_post_rebuild_traffic_and_selected_route" + case "unexpected_post_rebuild_route": + return "open_deep_ledger_check_selected_route_vs_replacement" + case "post_rebuild_degraded": + return "inspect_post_rebuild_drops_failures_and_route_quality" + case "ok": + return "no_operator_action_required" + default: + if item.GuardSeverity == "bad" || item.GuardSeverity == "warn" { + return "open_deep_ledger_for_rebuild_incident" + } + return "no_operator_action_required" + } +} + +func transitionMatchesRebuildAttempt(transition map[string]any, item FabricServiceChannelRouteRebuildAttempt) bool { + if len(transition) == 0 { + return false + } + generation := jsonString(transition, "generation") + if item.Generation != "" { + return generation != "" && generation == item.Generation + } + status := jsonString(transition, "status") + return (status == "applied_rebuild" && item.RebuildStatus == "applied") || + (status == "pending_degraded_fallback" && item.RebuildStatus == "pending_degraded_fallback") +} + +func routeGenerationDecisionForAttempt(report map[string]any, item FabricServiceChannelRouteRebuildAttempt) (map[string]any, bool) { + for _, key := range []string{"active_decisions", "withdrawn_decisions"} { + for _, raw := range jsonArray(report, key) { + decision, ok := raw.(map[string]any) + if !ok { + continue + } + if jsonString(decision, "route_id") != item.RouteID { + continue + } + generation := jsonString(decision, "generation") + if item.Generation == "" || generation == "" || generation == item.Generation { + return decision, true + } + } + } + return nil, false +} + +func jsonObject(raw json.RawMessage) map[string]any { + if len(raw) == 0 || !json.Valid(raw) { + return map[string]any{} + } + var out map[string]any + if err := json.Unmarshal(raw, &out); err != nil { + return map[string]any{} + } + return out +} + +func jsonMapPath(raw map[string]any, path ...string) map[string]any { + current := raw + for _, key := range path { + next, ok := current[key].(map[string]any) + if !ok { + return map[string]any{} + } + current = next + } + return current +} + +func jsonArray(raw map[string]any, key string) []any { + if raw == nil { + return nil + } + items, _ := raw[key].([]any) + return items +} + +func jsonString(raw map[string]any, key string) string { + if raw == nil { + return "" + } + value, _ := raw[key].(string) + return strings.TrimSpace(value) +} + +func jsonStringArray(raw map[string]any, key string) []string { + items := jsonArray(raw, key) + if len(items) == 0 { + return nil + } + out := make([]string, 0, len(items)) + for _, item := range items { + value, ok := item.(string) + if !ok { + continue + } + value = strings.TrimSpace(value) + if value != "" { + out = append(out, value) + } + } + return out +} + +func jsonInt(raw map[string]any, key string) int { + if raw == nil { + return 0 + } + switch value := raw[key].(type) { + case float64: + return int(value) + case int: + return value + case int64: + return int(value) + case json.Number: + parsed, _ := value.Int64() + return int(parsed) + default: + return 0 + } +} + +func jsonBool(raw map[string]any, key string) bool { + if raw == nil { + return false + } + value, _ := raw[key].(bool) + return value +} + +func jsonStringIntMap(raw map[string]any, key string) map[string]int { + if raw == nil { + return nil + } + values, ok := raw[key].(map[string]any) + if !ok || len(values) == 0 { + return nil + } + out := make(map[string]int, len(values)) + for name, value := range values { + name = strings.TrimSpace(name) + if name == "" { + continue + } + switch typed := value.(type) { + case float64: + out[name] = int(typed) + case int: + out[name] = typed + case int64: + out[name] = int(typed) + case json.Number: + parsed, _ := typed.Int64() + out[name] = int(parsed) + } + } + if len(out) == 0 { + return nil + } + return out +} + +func copyStringIntMap(values map[string]int) map[string]int { + if len(values) == 0 { + return nil + } + out := make(map[string]int, len(values)) + for key, value := range values { + out[key] = value + } + return out +} + +func mergeStringIntMap(target map[string]int, source map[string]int) { + if target == nil || len(source) == 0 { + return + } + for key, value := range source { + target[key] += value + } +} + +func mergeMinStringIntMap(target map[string]int, source map[string]int) { + if target == nil || len(source) == 0 { + return + } + for key, value := range source { + if strings.TrimSpace(key) == "" || value <= 0 { + continue + } + current, ok := target[key] + if !ok || value < current { + target[key] = value + } + } +} + +func jsonUint64(raw map[string]any, key string) uint64 { + if raw == nil { + return 0 + } + switch value := raw[key].(type) { + case float64: + if value > 0 { + return uint64(value) + } + case int: + if value > 0 { + return uint64(value) + } + case int64: + if value > 0 { + return uint64(value) + } + case uint64: + return value + } + return 0 +} + +func (s *Service) ExpireFabricServiceChannelRouteFeedback(ctx context.Context, input ExpireFabricServiceChannelRouteFeedbackInput) (ExpireFabricServiceChannelRouteFeedbackResult, error) { + if err := s.ensurePlatformAdmin(ctx, input.ActorUserID); err != nil { + return ExpireFabricServiceChannelRouteFeedbackResult{}, err + } + if err := s.ensureClusterMutable(ctx, input.ActorUserID, input.ClusterID); err != nil { + return ExpireFabricServiceChannelRouteFeedbackResult{}, err + } + input.ClusterID = strings.TrimSpace(input.ClusterID) + input.ReporterNodeID = strings.TrimSpace(input.ReporterNodeID) + input.RouteID = strings.TrimSpace(input.RouteID) + input.ServiceClass = strings.TrimSpace(input.ServiceClass) + input.Reason = strings.TrimSpace(input.Reason) + if input.ClusterID == "" || input.RouteID == "" { + return ExpireFabricServiceChannelRouteFeedbackResult{}, ErrInvalidPayload + } + if input.Now.IsZero() { + input.Now = s.now() + } + result, err := s.store.ExpireFabricServiceChannelRouteFeedback(ctx, input) + if err != nil { + return ExpireFabricServiceChannelRouteFeedbackResult{}, err + } + payload, _ := json.Marshal(map[string]any{ + "reporter_node_id": input.ReporterNodeID, + "route_id": input.RouteID, + "service_class": input.ServiceClass, + "reason": input.Reason, + "expired_count": result.ExpiredCount, + "expired_at": result.ExpiredAt, + "cooldown_until": result.CooldownUntil, + }) + _ = s.store.RecordAudit(ctx, ClusterAuditEvent{ + ClusterID: &input.ClusterID, + ActorUserID: &input.ActorUserID, + EventType: "fabric.service_channel_route_feedback.expired", + TargetType: "fabric_service_channel_route", + TargetID: &input.RouteID, + Payload: payload, + CreatedAt: input.Now.UTC(), + }) + return result, nil +} + +func (s *Service) CreateReleaseVersion(ctx context.Context, input CreateReleaseVersionInput) (ReleaseVersion, error) { + if err := s.ensurePlatformAdmin(ctx, input.ActorUserID); err != nil { + return ReleaseVersion{}, err + } + if err := s.ensureClusterMutable(ctx, input.ActorUserID, input.ClusterID); err != nil { + return ReleaseVersion{}, err + } + input.Product = normalizeUpdateToken(input.Product) + input.Version = strings.TrimSpace(input.Version) + input.Channel = normalizeUpdateToken(firstNonEmptyString(input.Channel, "dev")) + input.Status = normalizeUpdateToken(firstNonEmptyString(input.Status, "active")) + if input.ClusterID == "" || input.Product == "" || input.Version == "" || len(input.Artifacts) == 0 { + return ReleaseVersion{}, ErrInvalidPayload + } + if input.Status != "active" && input.Status != "draft" && input.Status != "revoked" { + return ReleaseVersion{}, ErrInvalidPayload + } + input.Compatibility = defaultJSON(input.Compatibility, `{}`) + if !json.Valid(input.Compatibility) { + return ReleaseVersion{}, ErrInvalidPayload + } + for i := range input.Artifacts { + input.Artifacts[i].OS = normalizeUpdateToken(input.Artifacts[i].OS) + input.Artifacts[i].Arch = normalizeUpdateToken(input.Artifacts[i].Arch) + input.Artifacts[i].InstallType = normalizeUpdateToken(input.Artifacts[i].InstallType) + input.Artifacts[i].Kind = normalizeUpdateToken(input.Artifacts[i].Kind) + input.Artifacts[i].URL = strings.TrimSpace(input.Artifacts[i].URL) + input.Artifacts[i].SHA256 = strings.TrimSpace(input.Artifacts[i].SHA256) + input.Artifacts[i].Metadata = defaultJSON(input.Artifacts[i].Metadata, `{}`) + if input.Artifacts[i].OS == "" || input.Artifacts[i].Arch == "" || input.Artifacts[i].InstallType == "" || + input.Artifacts[i].Kind == "" || input.Artifacts[i].URL == "" || input.Artifacts[i].SHA256 == "" || + !json.Valid(input.Artifacts[i].Metadata) { + return ReleaseVersion{}, ErrInvalidPayload + } + } + item, err := s.store.CreateReleaseVersion(ctx, input) + if err != nil { + return ReleaseVersion{}, err + } + item, err = s.signReleaseVersion(ctx, item, &input.ActorUserID) + if err != nil { + return ReleaseVersion{}, err + } + _ = s.store.RecordAudit(ctx, ClusterAuditEvent{ + ClusterID: &input.ClusterID, + ActorUserID: &input.ActorUserID, + EventType: "release_version.created", + TargetType: "release_version", + TargetID: &item.ID, + Payload: json.RawMessage(`{"production_forwarding":false}`), + CreatedAt: s.now(), + }) + return item, nil +} + +func (s *Service) ListReleaseVersions(ctx context.Context, actorUserID, clusterID, product, channel string) ([]ReleaseVersion, error) { + if err := s.ensurePlatformAdmin(ctx, actorUserID); err != nil { + return nil, err + } + return s.store.ListReleaseVersions(ctx, clusterID, normalizeUpdateToken(product), normalizeUpdateToken(channel)) +} + +func (s *Service) UpsertNodeUpdatePolicy(ctx context.Context, input UpsertNodeUpdatePolicyInput) (NodeUpdatePolicy, error) { + if err := s.ensurePlatformAdmin(ctx, input.ActorUserID); err != nil { + return NodeUpdatePolicy{}, err + } + if err := s.ensureClusterMutable(ctx, input.ActorUserID, input.ClusterID); err != nil { + return NodeUpdatePolicy{}, err + } + input.Product = normalizeUpdateToken(input.Product) + input.Channel = normalizeUpdateToken(firstNonEmptyString(input.Channel, "dev")) + input.Strategy = normalizeUpdateToken(firstNonEmptyString(input.Strategy, "manual")) + if input.ClusterID == "" || input.NodeID == "" || input.Product == "" { + return NodeUpdatePolicy{}, ErrInvalidPayload + } + switch input.Strategy { + case "manual", "canary", "rolling", "pinned": + default: + return NodeUpdatePolicy{}, ErrInvalidPayload + } + if input.HealthWindowSec <= 0 { + input.HealthWindowSec = 180 + } + if input.TargetVersion != nil { + trimmed := strings.TrimSpace(*input.TargetVersion) + input.TargetVersion = &trimmed + } + item, err := s.store.UpsertNodeUpdatePolicy(ctx, input) + if err != nil { + return NodeUpdatePolicy{}, err + } + _ = s.store.RecordAudit(ctx, ClusterAuditEvent{ + ClusterID: &input.ClusterID, + ActorUserID: &input.ActorUserID, + EventType: "node_update_policy.updated", + TargetType: "node", + TargetID: &input.NodeID, + Payload: json.RawMessage(`{"production_forwarding":false}`), + CreatedAt: s.now(), + }) + return item, nil +} + +func (s *Service) GetNodeUpdatePlan(ctx context.Context, input GetNodeUpdatePlanInput) (NodeUpdatePlan, error) { + input.Product = normalizeUpdateToken(firstNonEmptyString(input.Product, "rap-node-agent")) + input.Channel = normalizeUpdateToken(input.Channel) + input.OS = normalizeUpdateToken(input.OS) + input.Arch = normalizeUpdateToken(input.Arch) + input.InstallType = normalizeUpdateToken(input.InstallType) + input.CurrentVersion = strings.TrimSpace(input.CurrentVersion) + input.ArtifactOrigin = normalizeArtifactOrigin(input.ArtifactOrigin) + if input.ClusterID == "" || input.NodeID == "" || input.Product == "" || input.OS == "" || input.Arch == "" || input.InstallType == "" { + return NodeUpdatePlan{}, ErrInvalidPayload + } + policy, err := s.store.GetNodeUpdatePolicy(ctx, input.ClusterID, input.NodeID, input.Product) + if errors.Is(err, pgx.ErrNoRows) { + return s.signNodeUpdatePlan(ctx, NodeUpdatePlan{ + SchemaVersion: "rap.node_update_plan.v1", + ClusterID: input.ClusterID, + NodeID: input.NodeID, + Product: input.Product, + CurrentVersion: input.CurrentVersion, + Action: "none", + Reason: "no_update_policy", + ProductionForwarding: false, + }) + } + if err != nil { + return NodeUpdatePlan{}, err + } + if input.Channel == "" { + input.Channel = policy.Channel + } + base := NodeUpdatePlan{ + SchemaVersion: "rap.node_update_plan.v1", + ClusterID: input.ClusterID, + NodeID: input.NodeID, + Product: input.Product, + CurrentVersion: input.CurrentVersion, + Channel: input.Channel, + Strategy: policy.Strategy, + RollbackAllowed: policy.RollbackAllowed, + HealthWindowSec: policy.HealthWindowSec, + ProductionForwarding: false, + } + if !policy.Enabled { + base.Action = "none" + base.Reason = "policy_disabled" + return s.signNodeUpdatePlan(ctx, base) + } + if mismatch, err := s.hostAgentPlatformMismatch(ctx, input); err != nil { + return NodeUpdatePlan{}, err + } else if mismatch { + base.Action = "none" + base.Reason = "host_agent_artifact_platform_mismatch" + return s.signNodeUpdatePlan(ctx, base) + } + releases, err := s.store.ListReleaseVersions(ctx, input.ClusterID, input.Product, input.Channel) + if err != nil { + return NodeUpdatePlan{}, err + } + release, artifact, ok := selectReleaseArtifact(releases, input, policy) + if !ok { + base.Action = "none" + base.Reason = "no_matching_artifact" + return s.signNodeUpdatePlan(ctx, base) + } + base.TargetVersion = release.Version + artifact = absolutizeReleaseArtifact(artifact, input.ArtifactOrigin) + base.Artifact = &artifact + if strings.TrimSpace(input.CurrentVersion) == release.Version { + base.Action = "none" + base.Reason = "already_current" + return s.signNodeUpdatePlan(ctx, base) + } + base.Action = "update" + base.Reason = "matching_release_available" + return s.signNodeUpdatePlan(ctx, base) +} + +func (s *Service) ReportNodeUpdateStatus(ctx context.Context, input ReportNodeUpdateStatusInput) (NodeUpdateStatus, error) { + input.Product = normalizeUpdateToken(firstNonEmptyString(input.Product, "rap-node-agent")) + input.Phase = normalizeUpdateToken(input.Phase) + input.Status = normalizeUpdateToken(input.Status) + if input.ClusterID == "" || input.NodeID == "" || input.Product == "" || input.Phase == "" || input.Status == "" { + return NodeUpdateStatus{}, ErrInvalidPayload + } + input.Payload = defaultJSON(input.Payload, `{}`) + if !json.Valid(input.Payload) { + return NodeUpdateStatus{}, ErrInvalidPayload + } + if input.ObservedAt.IsZero() { + input.ObservedAt = s.now() + } + return s.store.ReportNodeUpdateStatus(ctx, input) +} + +func (s *Service) ListNodeUpdateStatuses(ctx context.Context, actorUserID, clusterID, nodeID string, limit int) ([]NodeUpdateStatus, error) { + if err := s.ensurePlatformAdmin(ctx, actorUserID); err != nil { + return nil, err + } + if clusterID == "" || nodeID == "" { + return nil, ErrInvalidPayload + } + return s.store.ListNodeUpdateStatuses(ctx, clusterID, nodeID, limit) +} + +func (s *Service) GetNodeUpdateHint(ctx context.Context, clusterID, nodeID string) NodeUpdateHint { + products := []string{"rap-node-agent", "rap-host-agent"} + parts := make([]string, 0, len(products)) + activeProducts := make([]string, 0, len(products)) + updateService := s.selectNodeUpdateService(ctx, clusterID, nodeID) + for _, product := range products { + policy, err := s.store.GetNodeUpdatePolicy(ctx, clusterID, nodeID, product) + if err != nil || !policy.Enabled { + continue + } + targetVersion := strings.TrimSpace(updateHintTargetVersion(ctx, s, clusterID, product, policy)) + if targetVersion == "" { + continue + } + activeProducts = append(activeProducts, product) + parts = append(parts, product+":"+targetVersion+":"+policy.UpdatedAt.UTC().Format(time.RFC3339Nano)) + } + if len(parts) == 0 { + return NodeUpdateHint{ + SchemaVersion: "rap.node_update_hint.v1", + CheckNow: false, + Reason: "no_enabled_update_policy", + DeliveryMode: "update_service_subscription", + SubscriptionStatus: "subscribed", + UpdateService: updateService, + FallbackPollSeconds: 21600, + } + } + sort.Strings(parts) + sort.Strings(activeProducts) + sum := sha256.Sum256([]byte(strings.Join(parts, "|"))) + return NodeUpdateHint{ + SchemaVersion: "rap.node_update_hint.v1", + Generation: hex.EncodeToString(sum[:])[:16], + CheckNow: true, + Products: activeProducts, + Reason: "enabled_update_policy", + DeliveryMode: "update_service_subscription", + SubscriptionStatus: "subscribed", + UpdateService: updateService, + FallbackPollSeconds: 21600, + } +} + +func (s *Service) selectNodeUpdateService(ctx context.Context, clusterID, nodeID string) *NodeUpdateServiceAssignment { + now := s.now() + assignment := &NodeUpdateServiceAssignment{ + SchemaVersion: "rap.node_update_service_assignment.v1", + Status: "control_plane_fallback", + Reason: "no_healthy_update_cache_service", + AssignedAt: now, + ExpiresAt: now.Add(2 * time.Minute), + } + candidates, err := s.store.ListNodeUpdateServiceCandidates(ctx, clusterID) + if err != nil || len(candidates) == 0 { + return assignment + } + selected := candidates[0] + for _, candidate := range candidates { + if candidate.NodeID == nodeID { + selected = candidate + break + } + } + assignment.NodeID = selected.NodeID + assignment.NodeName = selected.NodeName + assignment.Endpoint = selected.Endpoint + assignment.Region = selected.Region + assignment.Status = "assigned" + assignment.Reason = "healthy_update_cache_service" + assignment.ExpiresAt = now.Add(5 * time.Minute) + return assignment +} + +func updateHintTargetVersion(ctx context.Context, s *Service, clusterID, product string, policy NodeUpdatePolicy) string { + if policy.TargetVersion != nil { + return strings.TrimSpace(*policy.TargetVersion) + } + releases, err := s.store.ListReleaseVersions(ctx, clusterID, product, policy.Channel) + if err != nil { + return "" + } + for _, release := range releases { + if release.Status == "active" && strings.TrimSpace(release.Version) != "" { + return strings.TrimSpace(release.Version) + } + } + return "" +} + +func (s *Service) signReleaseVersion(ctx context.Context, item ReleaseVersion, actorUserID *string) (ReleaseVersion, error) { + authorityKey, err := s.ensureClusterAuthority(ctx, item.ClusterID, actorUserID) + if err != nil { + return ReleaseVersion{}, err + } + payload := map[string]any{ + "schema_version": "rap.release_version_authority.v1", + "cluster_id": item.ClusterID, + "release_id": item.ID, + "product": item.Product, + "version": item.Version, + "channel": item.Channel, + "artifact_count": len(item.Artifacts), + "control_plane_only": true, + "production_forwarding": false, + } + rawPayload, signature, err := clusterauth.SignPayload(authorityKey.PrivateKey, payload, s.now()) + if err != nil { + return ReleaseVersion{}, err + } + item.AuthorityPayload = rawPayload + item.AuthoritySignature = &signature + return item, nil +} + +func (s *Service) signNodeUpdatePlan(ctx context.Context, plan NodeUpdatePlan) (NodeUpdatePlan, error) { + authorityKey, err := s.ensureClusterAuthority(ctx, plan.ClusterID, nil) + if err != nil { + return NodeUpdatePlan{}, err + } + payload := map[string]any{ + "schema_version": "rap.node_update_plan_authority.v1", + "cluster_id": plan.ClusterID, + "node_id": plan.NodeID, + "product": plan.Product, + "current_version": plan.CurrentVersion, + "action": plan.Action, + "target_version": plan.TargetVersion, + "artifact_sha256": "", + "control_plane_only": true, + "production_forwarding": false, + } + if plan.Artifact != nil { + payload["artifact_sha256"] = plan.Artifact.SHA256 + payload["artifact_url"] = plan.Artifact.URL + } + rawPayload, signature, err := clusterauth.SignPayload(authorityKey.PrivateKey, payload, s.now()) + if err != nil { + return NodeUpdatePlan{}, err + } + plan.AuthorityPayload = rawPayload + plan.AuthoritySignature = &signature + return plan, nil +} + func (s *Service) UpsertFabricTestingFlag(ctx context.Context, input UpsertFabricTestingFlagInput) (FabricTestingFlag, error) { if err := s.ensurePlatformAdmin(ctx, input.ActorUserID); err != nil { return FabricTestingFlag{}, err @@ -770,6 +3995,3205 @@ func (s *Service) GetEffectiveNodeTestingFlags(ctx context.Context, clusterID, n return s.store.GetEffectiveNodeTestingFlags(ctx, clusterID, nodeID) } +func (s *Service) IssueFabricServiceChannelLease(ctx context.Context, input IssueFabricServiceChannelLeaseInput) (FabricServiceChannelLease, error) { + input.ClusterID = strings.TrimSpace(input.ClusterID) + input.OrganizationID = strings.TrimSpace(input.OrganizationID) + input.UserID = strings.TrimSpace(input.UserID) + input.ResourceID = strings.TrimSpace(input.ResourceID) + input.ServiceClass = normalizeFabricServiceClass(input.ServiceClass) + input.EntryNodeIDs = dedupeStrings(input.EntryNodeIDs) + input.ExitNodeIDs = dedupeStrings(input.ExitNodeIDs) + input.PreferredEntryNodeID = strings.TrimSpace(input.PreferredEntryNodeID) + input.PreferredExitNodeID = strings.TrimSpace(input.PreferredExitNodeID) + if input.ClusterID == "" || input.OrganizationID == "" || input.UserID == "" || input.ServiceClass == "" || len(input.EntryNodeIDs) == 0 || len(input.ExitNodeIDs) == 0 { + return FabricServiceChannelLease{}, ErrInvalidPayload + } + if !isAllowedFabricServiceClass(input.ServiceClass) { + return FabricServiceChannelLease{}, ErrInvalidPayload + } + ttl := input.TTL + if ttl <= 0 { + ttl = time.Minute + } + if ttl > 5*time.Minute { + ttl = 5 * time.Minute + } + now := s.now().UTC() + expiresAt := now.Add(ttl) + routeGeneration := "fsc-" + now.Format("20060102T150405.000000000Z") + allowedChannels := normalizeFabricServiceChannels(input.AllowedChannels, input.ServiceClass) + requiredRoles := normalizeFabricRequiredRoles(input.RequiredRoles, input.ServiceClass) + cluster, err := s.store.GetCluster(ctx, input.ClusterID) + if errors.Is(err, pgx.ErrNoRows) { + return FabricServiceChannelLease{}, ErrInvalidCluster + } + if err != nil { + return FabricServiceChannelLease{}, err + } + poolPolicy := fabricServiceChannelPoolPolicyFromCluster(cluster) + entryNodeIDs := fabricServiceChannelEffectivePool(input.EntryNodeIDs, poolPolicy.EntryPoolNodeIDs) + exitNodeIDs := fabricServiceChannelEffectivePool(input.ExitNodeIDs, poolPolicy.ExitPoolNodeIDs) + if len(entryNodeIDs) == 0 || len(exitNodeIDs) == 0 { + return FabricServiceChannelLease{}, ErrInvalidPayload + } + selectedEntry := selectFabricServiceChannelPreferredNode(entryNodeIDs, firstNonEmptyString(poolPolicy.PreferredEntryNodeID, input.PreferredEntryNodeID)) + selectedExit := selectFabricServiceChannelPreferredNode(exitNodeIDs, firstNonEmptyString(poolPolicy.PreferredExitNodeID, input.PreferredExitNodeID)) + if selectedEntry == "" || selectedExit == "" { + return FabricServiceChannelLease{}, ErrInvalidPayload + } + intents, err := s.store.ListRouteIntents(ctx, input.ClusterID) + if err != nil { + return FabricServiceChannelLease{}, err + } + recoveryPolicy := s.fabricServiceChannelRecoveryPolicy(ctx, input.ClusterID) + routeProvenance := fabricServiceChannelRouteProvenanceFromIntents(intents) + feedback, err := s.fabricServiceChannelRouteFeedback(ctx, input.ClusterID, entryNodeIDs, now, recoveryPolicy, routeProvenance) + if err != nil { + return FabricServiceChannelLease{}, err + } + routes := fabricServiceChannelRoutesFromIntents(intents, input.ServiceClass, entryNodeIDs, exitNodeIDs, allowedChannels, routeGeneration, now, expiresAt, feedback, recoveryPolicy) + primary, alternates := selectFabricServicePrimaryRoute(routes, selectedEntry, selectedExit) + if primary.RouteID != "" && containsString(entryNodeIDs, primary.SourceNodeID) { + selectedEntry = primary.SourceNodeID + } + if primary.RouteID != "" && containsString(exitNodeIDs, primary.DestinationNodeID) { + selectedExit = primary.DestinationNodeID + } + fallback := FabricServiceChannelFallback{ + Allowed: true, + Transport: "backend_relay", + BackendRelay: true, + Compatibility: true, + Reason: "compatibility_fallback_available", + } + fallback.Allowed = poolPolicy.BackendFallbackAllowed + fallback.BackendRelay = poolPolicy.BackendFallbackAllowed + status := FabricServiceChannelStatusReady + if primary.RouteID == "" { + if poolPolicy.BackendFallbackAllowed { + status = FabricServiceChannelStatusDegradedFallback + fallback.Active = true + fallback.Degraded = true + fallback.Reason = "no_authorized_fabric_route_for_selected_entry_exit" + } else { + status = "blocked_no_fabric_route" + fallback.Active = false + fallback.Degraded = true + fallback.Reason = "backend_fallback_disabled_by_pool_policy" + } + if fabricServiceRoutesFencedForSelectedPair(routes, selectedEntry, selectedExit) { + fallback.Reason = "fabric_route_rebuild_pending_backend_relay" + } else if fabricServiceRoutesFencedForPool(routes) { + fallback.Reason = "fabric_entry_exit_pool_rebuild_pending_backend_relay" + } + primary = FabricServiceChannelRoute{ + ClusterID: input.ClusterID, + ServiceClass: input.ServiceClass, + SourceNodeID: selectedEntry, + DestinationNodeID: selectedExit, + Hops: []string{selectedEntry, selectedExit}, + AllowedChannels: allowedChannels, + Generation: routeGeneration, + Status: "missing_route_intent", + RecoveryPolicy: fabricServiceChannelRecoveryPolicyRef(recoveryPolicy), + PathScore: 1, + ScoreReasons: []string{"fallback_until_fabric_route_exists"}, + ExpiresAt: expiresAt, + } + } else { + fallback.Active = false + fallback.Degraded = false + } + channelID := uuidLikeRandom() + if channelID == "" { + channelID = "fabric-channel-" + now.Format("20060102T150405.000000000Z") + } + token := uuidLikeRandom() + if token == "" { + token = channelID + } + lease := FabricServiceChannelLease{ + SchemaVersion: "rap.fabric_service_channel_lease.v1", + ChannelID: channelID, + ClusterID: input.ClusterID, + OrganizationID: input.OrganizationID, + UserID: input.UserID, + ResourceID: input.ResourceID, + ServiceClass: input.ServiceClass, + Status: status, + SelectedEntryNodeID: selectedEntry, + SelectedExitNodeID: selectedExit, + EntryPool: fabricServiceChannelNodePool(entryNodeIDs, "entry", selectedEntry), + ExitPool: fabricServiceChannelNodePool(exitNodeIDs, "exit", selectedExit), + RequiredRoles: requiredRoles, + AllowedChannels: allowedChannels, + PrimaryRoute: primary, + AlternateRoutes: alternates, + RecoveryPolicy: fabricServiceChannelRecoveryPolicyRef(recoveryPolicy), + PoolPolicy: fabricServiceChannelPoolPolicyRef(poolPolicy), + DataPlane: fabricServiceChannelDataPlaneContract(input.ServiceClass, poolPolicy, fallback), + QoS: defaultJSON(input.QoS, defaultFabricServiceQoS(input.ServiceClass)), + Failover: defaultJSON(input.Failover, fabricServiceFailoverFromPoolPolicy(poolPolicy)), + Fallback: fallback, + Token: FabricServiceChannelToken{ + Type: "control_plane_issued_bearer", + Token: "rap_fsc_" + strings.ReplaceAll(token, "-", ""), + TTLSeconds: int(ttl.Seconds()), + IntrospectionPath: "/api/v1/clusters/{cluster_id}/fabric/service-channels/{channel_id}/introspect", + }, + EntryHTTP: fabricServiceChannelHTTPIngress(input.ServiceClass), + RouteGeneration: routeGeneration, + FencingEpoch: now.UnixNano(), + IssuedAt: now, + ExpiresAt: expiresAt, + Metadata: defaultJSON(input.Metadata, `{}`), + } + if signed, err := s.signFabricServiceChannelLease(ctx, lease); err == nil { + lease = signed + } + s.rememberFabricServiceChannelLease(lease) + if _, err := s.store.StoreFabricServiceChannelLease(ctx, StoreFabricServiceChannelLeaseInput{ + Lease: lease, + TokenHash: fabricServiceChannelTokenHash(lease.Token.Token), + }); err != nil { + return FabricServiceChannelLease{}, err + } + return lease, nil +} + +func (s *Service) rememberFabricServiceChannelLease(lease FabricServiceChannelLease) { + if strings.TrimSpace(lease.ClusterID) == "" || strings.TrimSpace(lease.ChannelID) == "" || strings.TrimSpace(lease.Token.Token) == "" { + return + } + now := s.now() + if now.IsZero() { + now = time.Now().UTC() + } + s.fabricServiceChannelLeaseMu.Lock() + defer s.fabricServiceChannelLeaseMu.Unlock() + if s.fabricServiceChannelLeaseCache == nil { + s.fabricServiceChannelLeaseCache = map[string]FabricServiceChannelLease{} + } + for key, item := range s.fabricServiceChannelLeaseCache { + if !item.ExpiresAt.IsZero() && !item.ExpiresAt.After(now) { + delete(s.fabricServiceChannelLeaseCache, key) + } + } + s.fabricServiceChannelLeaseCache[fabricServiceChannelLeaseCacheKey(lease.ClusterID, lease.ChannelID)] = lease +} + +func (s *Service) IntrospectFabricServiceChannelLease(ctx context.Context, input IntrospectFabricServiceChannelLeaseInput) (FabricServiceChannelLeaseIntrospection, error) { + input.ClusterID = strings.TrimSpace(input.ClusterID) + input.ChannelID = strings.TrimSpace(input.ChannelID) + input.ResourceID = strings.TrimSpace(input.ResourceID) + input.ServiceClass = normalizeFabricServiceClass(input.ServiceClass) + input.ChannelClass = strings.TrimSpace(strings.ToLower(input.ChannelClass)) + input.Token = strings.TrimSpace(input.Token) + input.EntryNodeID = strings.TrimSpace(input.EntryNodeID) + if input.ClusterID == "" || input.ChannelID == "" || input.Token == "" { + return FabricServiceChannelLeaseIntrospection{}, ErrInvalidPayload + } + now := s.now() + if now.IsZero() { + now = time.Now().UTC() + } + s.fabricServiceChannelLeaseMu.Lock() + lease, ok := s.fabricServiceChannelLeaseCache[fabricServiceChannelLeaseCacheKey(input.ClusterID, input.ChannelID)] + tokenHash := "" + if ok && !lease.ExpiresAt.IsZero() && !lease.ExpiresAt.After(now) { + delete(s.fabricServiceChannelLeaseCache, fabricServiceChannelLeaseCacheKey(input.ClusterID, input.ChannelID)) + ok = false + } + if ok { + tokenHash = fabricServiceChannelTokenHash(lease.Token.Token) + } + s.fabricServiceChannelLeaseMu.Unlock() + if !ok { + record, err := s.store.GetFabricServiceChannelLease(ctx, input.ClusterID, input.ChannelID) + if err != nil && !errors.Is(err, pgx.ErrNoRows) { + return FabricServiceChannelLeaseIntrospection{}, err + } + if err == nil { + lease = record.Lease + tokenHash = strings.TrimSpace(record.TokenHash) + if !lease.ExpiresAt.IsZero() && !lease.ExpiresAt.After(now) { + ok = false + } else { + ok = true + s.rememberFabricServiceChannelLease(lease) + } + } + } + out := FabricServiceChannelLeaseIntrospection{ + SchemaVersion: "rap.fabric_service_channel_introspection.v1", + ClusterID: input.ClusterID, + ChannelID: input.ChannelID, + ResourceID: input.ResourceID, + ServiceClass: input.ServiceClass, + AcceptedBy: "introspection", + Status: "denied", + Reason: "lease_not_found", + } + if !ok { + return out, nil + } + out.ResourceID = lease.ResourceID + out.ServiceClass = lease.ServiceClass + out.SelectedEntryNodeID = lease.SelectedEntryNodeID + out.SelectedExitNodeID = lease.SelectedExitNodeID + out.AllowedChannels = append([]string{}, lease.AllowedChannels...) + out.LeaseStatus = lease.Status + out.PrimaryRoute = lease.PrimaryRoute + out.DataPlane = lease.DataPlane + out.RouteGeneration = lease.RouteGeneration + out.FencingEpoch = lease.FencingEpoch + out.ExpiresAt = lease.ExpiresAt + if lease.ClusterID != input.ClusterID || + lease.ChannelID != input.ChannelID || + tokenHash == "" || + tokenHash != fabricServiceChannelTokenHash(input.Token) { + out.Reason = "lease_token_mismatch" + return out, nil + } + if lease.ResourceID != "" && input.ResourceID != "" && lease.ResourceID != input.ResourceID { + out.Reason = "resource_mismatch" + return out, nil + } + if input.ServiceClass != "" && lease.ServiceClass != input.ServiceClass { + out.Reason = "service_class_mismatch" + return out, nil + } + if input.ChannelClass != "" && !containsString(lease.AllowedChannels, input.ChannelClass) { + out.Reason = "channel_class_not_allowed" + return out, nil + } + if input.EntryNodeID != "" && lease.SelectedEntryNodeID != "" && lease.SelectedEntryNodeID != input.EntryNodeID { + out.Reason = "entry_node_mismatch" + return out, nil + } + out.Allowed = true + out.Status = "allowed" + out.Reason = "lease_introspection_allowed" + if lease.Status == FabricServiceChannelStatusDegradedFallback || lease.PrimaryRoute.Status == "missing_route_intent" { + out.ForceBackendFallback = true + } else { + out.PreferredRouteID = strings.TrimSpace(lease.PrimaryRoute.RouteID) + } + return out, nil +} + +func (s *Service) ListFabricServiceChannelLeases(ctx context.Context, actorUserID string, input ListFabricServiceChannelLeasesInput) (FabricServiceChannelLeaseMaintenance, error) { + if err := s.ensurePlatformAdmin(ctx, actorUserID); err != nil { + return FabricServiceChannelLeaseMaintenance{}, err + } + input.ClusterID = strings.TrimSpace(input.ClusterID) + input.ServiceClass = normalizeFabricServiceClass(input.ServiceClass) + input.EntryNodeID = strings.TrimSpace(input.EntryNodeID) + input.ResourceID = strings.TrimSpace(input.ResourceID) + if input.ClusterID == "" { + return FabricServiceChannelLeaseMaintenance{}, ErrInvalidPayload + } + if input.Limit <= 0 || input.Limit > 500 { + input.Limit = 100 + } + now := input.Now + if now.IsZero() { + now = s.now() + } + if now.IsZero() { + now = time.Now().UTC() + } + records, err := s.store.ListFabricServiceChannelLeases(ctx, input) + if err != nil { + return FabricServiceChannelLeaseMaintenance{}, err + } + out := FabricServiceChannelLeaseMaintenance{ + SchemaVersion: "rap.fabric_service_channel_lease_maintenance.v1", + ClusterID: input.ClusterID, + Status: "ready", + Reason: "lease_maintenance_ready", + ObservedAt: now.UTC(), + WindowLimit: input.Limit, + } + for _, record := range records { + summary := fabricServiceChannelLeaseSummaryFromRecord(record, now) + if summary.Expired { + out.ExpiredCount++ + } else { + out.ActiveCount++ + } + out.Leases = append(out.Leases, summary) + } + out.ScannedCount = len(out.Leases) + if out.ExpiredCount > 0 { + out.Status = "degraded" + out.Reason = "expired_leases_pending_cleanup" + out.RecommendedOperatorAction = "Run service-channel lease cleanup to remove expired compatibility lease records." + } + return out, nil +} + +func (s *Service) CleanupFabricServiceChannelLeases(ctx context.Context, input CleanupFabricServiceChannelLeasesInput) (FabricServiceChannelLeaseMaintenance, error) { + if err := s.ensurePlatformAdmin(ctx, input.ActorUserID); err != nil { + return FabricServiceChannelLeaseMaintenance{}, err + } + input.ClusterID = strings.TrimSpace(input.ClusterID) + if input.ClusterID == "" { + return FabricServiceChannelLeaseMaintenance{}, ErrInvalidPayload + } + if input.Limit <= 0 || input.Limit > 1000 { + input.Limit = 100 + } + now := input.Now + if now.IsZero() { + now = s.now() + } + if now.IsZero() { + now = time.Now().UTC() + } + deleted, err := s.store.CleanupExpiredFabricServiceChannelLeases(ctx, input.ClusterID, now.UTC(), input.Limit) + if err != nil { + return FabricServiceChannelLeaseMaintenance{}, err + } + out, err := s.ListFabricServiceChannelLeases(ctx, input.ActorUserID, ListFabricServiceChannelLeasesInput{ + ClusterID: input.ClusterID, + IncludeExpired: true, + Limit: input.Limit, + Now: now, + }) + if err != nil { + return FabricServiceChannelLeaseMaintenance{}, err + } + out.DeletedExpiredCount = deleted + out.Status = "ready" + out.Reason = "expired_leases_cleaned" + out.RecommendedOperatorAction = "" + if out.ExpiredCount > 0 { + out.Status = "degraded" + out.Reason = "expired_leases_remaining" + out.RecommendedOperatorAction = "Run cleanup again; expired leases remain beyond the bounded cleanup window." + } + return out, nil +} + +func (s *Service) GetFabricServiceChannelAccessTelemetry(ctx context.Context, actorUserID string, input GetFabricServiceChannelAccessTelemetryInput) (FabricServiceChannelAccessTelemetry, error) { + if err := s.ensurePlatformAdmin(ctx, actorUserID); err != nil { + return FabricServiceChannelAccessTelemetry{}, err + } + input.ClusterID = strings.TrimSpace(input.ClusterID) + if input.ClusterID == "" { + return FabricServiceChannelAccessTelemetry{}, ErrInvalidPayload + } + if input.Limit <= 0 || input.Limit > 200 { + input.Limit = 100 + } + now := input.Now + if now.IsZero() { + now = s.now() + } + if now.IsZero() { + now = time.Now().UTC() + } + nodes, err := s.store.ListClusterNodes(ctx, input.ClusterID) + if err != nil { + return FabricServiceChannelAccessTelemetry{}, err + } + out := FabricServiceChannelAccessTelemetry{ + SchemaVersion: "rap.fabric_service_channel_access_telemetry.v1", + ClusterID: input.ClusterID, + Status: "ready", + Reason: "access_telemetry_ready", + ObservedAt: now.UTC(), + NodeCount: len(nodes), + TrafficClassCounts: map[string]int{}, + RecommendedParallelWindows: map[string]int{}, + } + for _, node := range nodes { + if len(out.Nodes) >= input.Limit { + break + } + items, err := s.store.ListNodeTelemetry(ctx, input.ClusterID, node.ID, 5) + if err != nil { + continue + } + report := map[string]any{} + var observedAt time.Time + for _, item := range items { + payload := jsonObject(item.Payload) + report = jsonMapPath(payload, "fabric_service_channel_access_report") + if len(report) > 0 { + observedAt = item.ObservedAt + break + } + } + if len(report) == 0 { + heartbeats, err := s.store.ListNodeHeartbeats(ctx, input.ClusterID, node.ID, 5) + if err == nil { + for _, heartbeat := range heartbeats { + payload := jsonObject(heartbeat.Metadata) + report = jsonMapPath(payload, "fabric_service_channel_access_report") + if len(report) > 0 { + observedAt = heartbeat.ObservedAt + break + } + } + } + } + if len(report) == 0 { + continue + } + nodeReport := FabricServiceChannelAccessTelemetryNode{ + NodeID: node.ID, + NodeName: node.Name, + ObservedAt: observedAt, + TotalAccepted: jsonInt(report, "total"), + SignedAccepted: jsonInt(report, "signed"), + IntrospectionAccepted: jsonInt(report, "introspection"), + LegacyUnsignedAccepted: jsonInt(report, "legacy_unsigned"), + BackendFallbackCount: jsonInt(report, "backend_fallback"), + BackendFallbackBlockedCount: jsonInt(report, "backend_fallback_blocked"), + FabricRouteSendFailureCount: jsonInt(report, "fabric_route_send_failure"), + DataPlaneContractCount: jsonInt(report, "data_plane_contract"), + LastDataPlaneMode: jsonString(report, "last_data_plane_mode"), + LastWorkingDataTransport: jsonString(report, "last_working_data_transport"), + LastSteadyStateTransport: jsonString(report, "last_steady_state_transport"), + LastBackendRelayPolicy: jsonString(report, "last_backend_relay_policy"), + LastLogicalFlowMode: jsonString(report, "last_logical_flow_mode"), + LastDataPlaneViolationStatus: jsonString(report, "last_data_plane_violation_status"), + LastDataPlaneViolationReason: jsonString(report, "last_data_plane_violation_reason"), + } + if nodeReport.SignedAccepted == 0 { + nodeReport.SignedAccepted = jsonInt(report, "accepted_by_signed") + } + if nodeReport.IntrospectionAccepted == 0 { + nodeReport.IntrospectionAccepted = jsonInt(report, "accepted_by_introspection") + } + if nodeReport.LegacyUnsignedAccepted == 0 { + nodeReport.LegacyUnsignedAccepted = jsonInt(report, "accepted_by_legacy_unsigned") + } + if value := jsonString(report, "last_accepted_at"); value != "" { + if parsed, err := time.Parse(time.RFC3339Nano, value); err == nil { + nodeReport.LastAcceptedAt = &parsed + if out.LatestAcceptedAt == nil || parsed.After(*out.LatestAcceptedAt) { + latest := parsed + out.LatestAcceptedAt = &latest + } + } + } + if heartbeats, err := s.store.ListNodeHeartbeats(ctx, input.ClusterID, node.ID, 1); err == nil && len(heartbeats) > 0 { + flowScheduler := fabricServiceChannelFlowSchedulerFromHeartbeat(heartbeats[0]) + nodeReport.TrafficClassCounts = jsonStringIntMap(flowScheduler, "traffic_class_counts") + nodeReport.FlowChannelCount = jsonInt(flowScheduler, "channel_count") + nodeReport.FlowDropped = jsonInt(flowScheduler, "dropped") + nodeReport.FlowHighWatermark = jsonInt(flowScheduler, "high_watermark") + nodeReport.FlowMaxInFlight = jsonInt(flowScheduler, "max_in_flight") + nodeReport.RecommendedParallelWindows = jsonStringIntMap(flowScheduler, "recommended_parallel_windows") + nodeReport.AdaptiveBackpressureActive = jsonBool(flowScheduler, "adaptive_backpressure_active") + nodeReport.AdaptiveBackpressureReason = jsonString(flowScheduler, "adaptive_backpressure_reason") + nodeReport.AdaptivePolicyFingerprint = jsonString(flowScheduler, "adaptive_policy_fingerprint") + } + nodeReport.FlowHealthStatus, nodeReport.FlowHealthReason, _ = fabricServiceChannelFlowHealth( + nodeReport.TrafficClassCounts, + nodeReport.FlowDropped, + nodeReport.FlowHighWatermark, + nodeReport.FlowMaxInFlight, + nodeReport.BackendFallbackCount, + 0, + 0, + 0, + 0, + ) + out.ReportingNodeCount++ + out.TotalAccepted += nodeReport.TotalAccepted + out.SignedAccepted += nodeReport.SignedAccepted + out.IntrospectionAccepted += nodeReport.IntrospectionAccepted + out.LegacyUnsignedAccepted += nodeReport.LegacyUnsignedAccepted + out.BackendFallbackCount += nodeReport.BackendFallbackCount + out.BackendFallbackBlockedCount += nodeReport.BackendFallbackBlockedCount + out.FabricRouteSendFailureCount += nodeReport.FabricRouteSendFailureCount + out.DataPlaneContractCount += nodeReport.DataPlaneContractCount + if out.LastDataPlaneMode == "" { + out.LastDataPlaneMode = nodeReport.LastDataPlaneMode + } + if out.LastWorkingDataTransport == "" { + out.LastWorkingDataTransport = nodeReport.LastWorkingDataTransport + } + if out.LastSteadyStateTransport == "" { + out.LastSteadyStateTransport = nodeReport.LastSteadyStateTransport + } + if out.LastBackendRelayPolicy == "" { + out.LastBackendRelayPolicy = nodeReport.LastBackendRelayPolicy + } + if out.LastLogicalFlowMode == "" { + out.LastLogicalFlowMode = nodeReport.LastLogicalFlowMode + } + if out.LastDataPlaneViolationStatus == "" { + out.LastDataPlaneViolationStatus = nodeReport.LastDataPlaneViolationStatus + } + if out.LastDataPlaneViolationReason == "" { + out.LastDataPlaneViolationReason = nodeReport.LastDataPlaneViolationReason + } + mergeStringIntMap(out.TrafficClassCounts, nodeReport.TrafficClassCounts) + mergeMinStringIntMap(out.RecommendedParallelWindows, nodeReport.RecommendedParallelWindows) + if nodeReport.AdaptiveBackpressureActive { + out.AdaptiveBackpressureActive = true + if out.AdaptiveBackpressureReason == "" { + out.AdaptiveBackpressureReason = nodeReport.AdaptiveBackpressureReason + } + } + if out.AdaptivePolicyFingerprint == "" { + out.AdaptivePolicyFingerprint = nodeReport.AdaptivePolicyFingerprint + } + out.FlowChannelCount += nodeReport.FlowChannelCount + out.FlowDropped += nodeReport.FlowDropped + if nodeReport.FlowHighWatermark > out.FlowHighWatermark { + out.FlowHighWatermark = nodeReport.FlowHighWatermark + } + if nodeReport.FlowMaxInFlight > out.FlowMaxInFlight { + out.FlowMaxInFlight = nodeReport.FlowMaxInFlight + } + out.Nodes = append(out.Nodes, nodeReport) + } + if len(out.TrafficClassCounts) == 0 { + out.TrafficClassCounts = nil + } + if len(out.RecommendedParallelWindows) == 0 { + out.RecommendedParallelWindows = nil + } + nodeReportsByID := map[string]FabricServiceChannelAccessTelemetryNode{} + for _, node := range out.Nodes { + nodeReportsByID[node.NodeID] = node + } + routeManagerByNodeID := map[string]map[string]any{} + routeManagerTransitionByNodeID := map[string]map[string]any{} + for _, node := range nodes { + heartbeats, err := s.store.ListNodeHeartbeats(ctx, input.ClusterID, node.ID, 1) + if err != nil || len(heartbeats) == 0 { + continue + } + metadata := jsonObject(heartbeats[0].Metadata) + runtime := jsonMapPath(metadata, "fabric_service_channel_runtime_report") + ingress := jsonMapPath(runtime, "ingress") + routeManager := jsonMapPath(ingress, "route_manager") + if len(routeManager) > 0 { + routeManagerByNodeID[node.ID] = routeManager + } + transition := jsonMapPath(ingress, "route_manager_transition") + if len(transition) > 0 { + routeManagerTransitionByNodeID[node.ID] = transition + } + } + feedbackItems, err := s.store.ListFabricServiceChannelRouteFeedback(ctx, ListFabricServiceChannelRouteFeedbackInput{ + ClusterID: input.ClusterID, + ServiceClass: FabricServiceClassVPNPackets, + Now: now.UTC(), + IncludeExpired: false, + }) + if err != nil { + return FabricServiceChannelAccessTelemetry{}, err + } + feedbackByRouteID := map[string]FabricServiceChannelRouteFeedbackObservation{} + for _, item := range feedbackItems { + if strings.TrimSpace(item.RouteID) == "" { + continue + } + current, ok := feedbackByRouteID[item.RouteID] + if !ok || item.ObservedAt.After(current.ObservedAt) { + feedbackByRouteID[item.RouteID] = item + } + } + leaseRecords, err := s.store.ListFabricServiceChannelLeases(ctx, ListFabricServiceChannelLeasesInput{ + ClusterID: input.ClusterID, + IncludeExpired: false, + Limit: input.Limit, + Now: now.UTC(), + }) + if err != nil { + return FabricServiceChannelAccessTelemetry{}, err + } + for _, record := range leaseRecords { + summary := fabricServiceChannelLeaseSummaryFromRecord(record, now) + channel := FabricServiceChannelAccessTelemetryChannel{ + ChannelID: summary.ChannelID, + ResourceID: summary.ResourceID, + ServiceClass: summary.ServiceClass, + Status: summary.Status, + SelectedEntryNodeID: summary.SelectedEntryNodeID, + SelectedExitNodeID: summary.SelectedExitNodeID, + PrimaryRouteID: summary.PrimaryRouteID, + PrimaryRouteStatus: summary.PrimaryRouteStatus, + ForceBackendFallback: summary.ForceBackendFallback, + DataPlane: summary.DataPlane, + ExpiresAt: summary.ExpiresAt, + } + if record.Lease.PoolPolicy != nil { + channel.PoolPolicyFingerprint = record.Lease.PoolPolicy.Fingerprint + } + if entryReport, ok := nodeReportsByID[channel.SelectedEntryNodeID]; ok { + channel.EntryNodeTotalAccepted = entryReport.TotalAccepted + channel.EntryNodeIntrospectionAccepted = entryReport.IntrospectionAccepted + channel.EntryNodeBackendFallbackCount = entryReport.BackendFallbackCount + channel.EntryNodeBackendFallbackBlockedCount = entryReport.BackendFallbackBlockedCount + channel.EntryNodeFabricRouteSendFailureCount = entryReport.FabricRouteSendFailureCount + channel.EntryNodeDataPlaneContractCount = entryReport.DataPlaneContractCount + channel.EntryNodeLastDataPlaneMode = entryReport.LastDataPlaneMode + channel.EntryNodeLastWorkingDataTransport = entryReport.LastWorkingDataTransport + channel.EntryNodeLastSteadyStateTransport = entryReport.LastSteadyStateTransport + channel.EntryNodeLastBackendRelayPolicy = entryReport.LastBackendRelayPolicy + channel.EntryNodeLastLogicalFlowMode = entryReport.LastLogicalFlowMode + channel.EntryNodeLastDataPlaneViolationStatus = entryReport.LastDataPlaneViolationStatus + channel.EntryNodeLastDataPlaneViolationReason = entryReport.LastDataPlaneViolationReason + channel.EntryNodeTrafficClassCounts = copyStringIntMap(entryReport.TrafficClassCounts) + channel.EntryNodeFlowChannelCount = entryReport.FlowChannelCount + channel.EntryNodeFlowDropped = entryReport.FlowDropped + channel.EntryNodeFlowHighWatermark = entryReport.FlowHighWatermark + channel.EntryNodeFlowMaxInFlight = entryReport.FlowMaxInFlight + channel.EntryNodeFlowHealthStatus = entryReport.FlowHealthStatus + channel.EntryNodeFlowHealthReason = entryReport.FlowHealthReason + channel.EntryNodeRecommendedParallelWindows = copyStringIntMap(entryReport.RecommendedParallelWindows) + channel.EntryNodeAdaptiveBackpressureActive = entryReport.AdaptiveBackpressureActive + channel.EntryNodeAdaptiveBackpressureReason = entryReport.AdaptiveBackpressureReason + channel.EntryNodeAdaptivePolicyFingerprint = entryReport.AdaptivePolicyFingerprint + } + if feedback, ok := feedbackByRouteID[channel.PrimaryRouteID]; ok { + observedAt := feedback.ObservedAt + channel.RouteFeedbackStatus = feedback.FeedbackStatus + channel.RouteFeedbackObservedAt = &observedAt + channel.RouteFeedbackScoreAdjustment = feedback.ScoreAdjustment + channel.RouteFeedbackEffectiveScoreAdjustment = feedback.EffectiveScoreAdjustment + channel.RouteFeedbackReasons = append([]string{}, feedback.Reasons...) + channel.RouteQualityWindowSampleCount = fabricServiceChannelFeedbackPayloadInt(feedback.Payload, "quality_window_sample_count") + channel.RouteQualityWindowFailureCount = fabricServiceChannelFeedbackPayloadInt(feedback.Payload, "quality_window_failure_count") + channel.RouteQualityWindowDropCount = fabricServiceChannelFeedbackPayloadInt(feedback.Payload, "quality_window_drop_count") + channel.RouteQualityWindowSlowCount = fabricServiceChannelFeedbackPayloadInt(feedback.Payload, "quality_window_slow_count") + channel.LastSendDurationMs = feedback.LastSendDurationMs + channel.EntryNodeFlowHealthStatus, channel.EntryNodeFlowHealthReason, _ = fabricServiceChannelFlowHealth( + channel.EntryNodeTrafficClassCounts, + channel.EntryNodeFlowDropped, + channel.EntryNodeFlowHighWatermark, + channel.EntryNodeFlowMaxInFlight, + channel.EntryNodeBackendFallbackCount, + channel.LastSendDurationMs, + channel.RouteQualityWindowFailureCount, + channel.RouteQualityWindowDropCount, + channel.RouteQualityWindowSlowCount, + ) + out.CorrelatedRouteCount++ + if feedback.FeedbackStatus == "degraded" || feedback.FeedbackStatus == "fenced" || feedback.EffectiveScoreAdjustment < 0 || feedback.ScoreAdjustment < 0 { + out.DegradedRouteCount++ + } + } + channel = fabricServiceChannelAccessRemediation(channel, record.Lease, now) + channel = fabricServiceChannelAccessRouteDecisionTelemetry(channel, routeManagerByNodeID[channel.SelectedEntryNodeID], routeManagerTransitionByNodeID[channel.SelectedEntryNodeID]) + channel = fabricServiceChannelAccessRemediationExecution(channel, routeManagerByNodeID[channel.SelectedEntryNodeID], routeManagerTransitionByNodeID[channel.SelectedEntryNodeID], now) + channel = s.fabricServiceChannelAccessRemediationLedgerExecution(ctx, input.ClusterID, channel) + fabricServiceChannelAccumulateRouteDecisionTelemetry(&out, channel) + if channel.ForceBackendFallback { + out.DegradedFallbackChannelCount++ + } + out.ActiveChannels = append(out.ActiveChannels, channel) + } + out.ActiveChannelCount = len(out.ActiveChannels) + sort.Slice(out.Nodes, func(i, j int) bool { + if out.Nodes[i].TotalAccepted != out.Nodes[j].TotalAccepted { + return out.Nodes[i].TotalAccepted > out.Nodes[j].TotalAccepted + } + return out.Nodes[i].NodeName < out.Nodes[j].NodeName + }) + sort.Slice(out.ActiveChannels, func(i, j int) bool { + if out.ActiveChannels[i].ForceBackendFallback != out.ActiveChannels[j].ForceBackendFallback { + return out.ActiveChannels[i].ForceBackendFallback + } + if out.ActiveChannels[i].RouteFeedbackStatus != out.ActiveChannels[j].RouteFeedbackStatus { + return out.ActiveChannels[i].RouteFeedbackStatus > out.ActiveChannels[j].RouteFeedbackStatus + } + return out.ActiveChannels[i].ExpiresAt.Before(out.ActiveChannels[j].ExpiresAt) + }) + if out.NoSafeRecoveryDecisionCount > 0 { + out.Status = "degraded" + out.Reason = "active_channels_no_safe_recovery" + out.RecommendedOperatorAction = "Inspect active service-channel route decisions; at least one channel has no safe recovery route." + } else if out.ReportingNodeCount == 0 { + out.Status = "degraded" + out.Reason = "no_access_telemetry_reported" + out.RecommendedOperatorAction = "Wait for node telemetry or verify fabric_service_channel_access_telemetry capability on node-agent." + } else if out.DegradedFallbackChannelCount > 0 || out.DegradedRouteCount > 0 { + out.Status = "degraded" + out.Reason = "active_channels_degraded" + out.RecommendedOperatorAction = "Inspect active service-channel routes with backend fallback or degraded route-quality feedback." + } + out.FlowHealthStatus, out.FlowHealthReason, _ = fabricServiceChannelFlowHealth( + out.TrafficClassCounts, + out.FlowDropped, + out.FlowHighWatermark, + out.FlowMaxInFlight, + out.BackendFallbackCount, + 0, + 0, + 0, + 0, + ) + for _, channel := range out.ActiveChannels { + out.FlowHealthStatus, out.FlowHealthReason = fabricServiceChannelWorseFlowHealth(out.FlowHealthStatus, out.FlowHealthReason, channel.EntryNodeFlowHealthStatus, channel.EntryNodeFlowHealthReason) + } + if out.FlowHealthStatus == "critical" || out.FlowHealthStatus == "degraded" { + out.Status = "degraded" + if out.Reason == "access_telemetry_ready" { + out.Reason = "flow_health_degraded" + } + if out.RecommendedOperatorAction == "" { + out.RecommendedOperatorAction = fabricServiceChannelFlowHealthAction(out.FlowHealthStatus, out.FlowHealthReason) + } + } else if out.FlowHealthStatus == "watch" && out.RecommendedOperatorAction == "" { + out.RecommendedOperatorAction = fabricServiceChannelFlowHealthAction(out.FlowHealthStatus, out.FlowHealthReason) + } + return out, nil +} + +func fabricServiceChannelFlowHealth(trafficClassCounts map[string]int, flowDropped, flowHighWatermark, flowMaxInFlight, backendFallbackCount int, lastSendDurationMs int64, routeFailureCount, routeDropCount, routeSlowCount int) (string, string, string) { + switch { + case flowDropped > 0: + return "critical", "flow_drops_reported", fabricServiceChannelFlowHealthAction("critical", "flow_drops_reported") + case routeDropCount > 0: + return "critical", "route_quality_window_drops_reported", fabricServiceChannelFlowHealthAction("critical", "route_quality_window_drops_reported") + case backendFallbackCount > 0: + return "degraded", "backend_fallback_observed", fabricServiceChannelFlowHealthAction("degraded", "backend_fallback_observed") + case routeFailureCount > 0: + return "degraded", "route_quality_window_failures_reported", fabricServiceChannelFlowHealthAction("degraded", "route_quality_window_failures_reported") + case routeSlowCount > 0: + return "degraded", "route_quality_window_slow_samples_reported", fabricServiceChannelFlowHealthAction("degraded", "route_quality_window_slow_samples_reported") + case lastSendDurationMs >= 1000: + return "degraded", "route_send_latency_high", fabricServiceChannelFlowHealthAction("degraded", "route_send_latency_high") + } + bulk := trafficClassCounts["bulk"] + interactive := trafficClassCounts["interactive"] + trafficClassCounts["control"] + switch { + case flowHighWatermark >= 64 || flowMaxInFlight >= 16: + return "degraded", "flow_queue_pressure_high", fabricServiceChannelFlowHealthAction("degraded", "flow_queue_pressure_high") + case bulk >= 16 && interactive > 0: + return "watch", "bulk_pressure_with_interactive_qos_observed", fabricServiceChannelFlowHealthAction("watch", "bulk_pressure_with_interactive_qos_observed") + case bulk >= 16: + return "watch", "bulk_pressure_observed", fabricServiceChannelFlowHealthAction("watch", "bulk_pressure_observed") + case flowHighWatermark >= 16 || flowMaxInFlight >= 4: + return "watch", "flow_queue_pressure_observed", fabricServiceChannelFlowHealthAction("watch", "flow_queue_pressure_observed") + default: + return "healthy", "flow_health_ready", fabricServiceChannelFlowHealthAction("healthy", "flow_health_ready") + } +} + +func fabricServiceChannelWorseFlowHealth(currentStatus, currentReason, candidateStatus, candidateReason string) (string, string) { + if candidateStatus == "" { + return currentStatus, currentReason + } + if fabricServiceChannelFlowHealthRank(candidateStatus) > fabricServiceChannelFlowHealthRank(currentStatus) { + return candidateStatus, candidateReason + } + return currentStatus, currentReason +} + +func fabricServiceChannelFlowHealthRank(status string) int { + switch status { + case "critical": + return 4 + case "degraded": + return 3 + case "watch": + return 2 + case "healthy": + return 1 + default: + return 0 + } +} + +func fabricServiceChannelFlowHealthAction(status, reason string) string { + switch status { + case "critical": + return "Reduce or reroute service-channel pressure immediately; inspect flow drops, route drops, and backend fallback before adding user traffic." + case "degraded": + return "Inspect service-channel route quality and active entry-node pressure; prefer alternate route or rebuild when degraded evidence persists." + case "watch": + if reason == "bulk_pressure_with_interactive_qos_observed" { + return "Bulk pressure is active while interactive/control remains observable; keep watching latency and drops before increasing load." + } + return "Bulk or queue pressure is visible; monitor interactive/control traffic before increasing production load." + default: + return "Flow health is within the current service-channel guard policy." + } +} + +func fabricServiceChannelAccessRemediation(channel FabricServiceChannelAccessTelemetryChannel, lease FabricServiceChannelLease, now time.Time) FabricServiceChannelAccessTelemetryChannel { + if channel.ForceBackendFallback { + channel.RemediationAction = "use_backend_fallback" + channel.RemediationReason = "explicit_backend_fallback_active" + channel.RecommendedOperatorAction = "Inspect missing/fenced fabric route and keep backend fallback visible until a normal route is available." + channel.RemediationCommand = fabricServiceChannelAccessRemediationCommand(channel, lease, now) + return channel + } + degraded := channel.RouteFeedbackStatus == "degraded" || channel.RouteFeedbackStatus == "fenced" || + channel.RouteFeedbackScoreAdjustment < 0 || channel.RouteFeedbackEffectiveScoreAdjustment < 0 + if !degraded { + channel.RemediationAction = "none" + channel.RemediationReason = "active_route_quality_acceptable" + channel.RecommendedOperatorAction = "No route remediation required." + return channel + } + if containsString(channel.RouteFeedbackReasons, "service_channel_degraded_fallback_recommended") { + channel.RemediationAction = "use_backend_fallback" + channel.RemediationReason = "route_feedback_recommends_degraded_fallback" + channel.RecommendedOperatorAction = "Use explicit degraded backend fallback while route rebuild catches up." + channel.RemediationCommand = fabricServiceChannelAccessRemediationCommand(channel, lease, now) + return channel + } + if alternate, ok := fabricServiceChannelFirstAuthorizedAlternate(lease.AlternateRoutes, channel.PrimaryRouteID); ok { + guardStatus, guardReason := fabricServiceChannelRouteAllowedByLeasePool(lease, alternate) + if guardStatus != "allowed" { + channel.RemediationAction = "rebuild_route" + channel.RemediationReason = "alternate_route_rejected_by_pool_policy" + channel.RemediationRouteID = alternate.RouteID + channel.RemediationRouteStatus = alternate.Status + channel.RemediationGuardStatus = guardStatus + channel.RemediationGuardReason = guardReason + channel.RecommendedOperatorAction = "Reject the alternate route and rebuild within the signed entry/exit pool policy." + channel.RemediationCommand = fabricServiceChannelAccessRemediationCommand(channel, lease, now) + return channel + } + channel.RemediationAction = "prefer_alternate_route" + channel.RemediationReason = "authorized_alternate_route_available" + channel.RemediationRouteID = alternate.RouteID + channel.RemediationRouteStatus = alternate.Status + channel.RemediationGuardStatus = guardStatus + channel.RemediationGuardReason = guardReason + channel.RecommendedOperatorAction = "Prefer the authorized alternate route for this active service channel." + channel.RemediationCommand = fabricServiceChannelAccessRemediationCommand(channel, lease, now) + return channel + } + if containsString(channel.RouteFeedbackReasons, "service_channel_route_rebuild_recommended") || channel.RouteFeedbackStatus == "fenced" { + channel.RemediationAction = "rebuild_route" + channel.RemediationReason = "route_feedback_recommends_rebuild" + channel.RecommendedOperatorAction = "Trigger or wait for route rebuild; keep this distinct from backend fallback." + channel.RemediationCommand = fabricServiceChannelAccessRemediationCommand(channel, lease, now) + return channel + } + channel.RemediationAction = "inspect_route_quality" + channel.RemediationReason = "degraded_route_quality_without_replacement" + channel.RecommendedOperatorAction = "Inspect rolling route quality counters and route feedback provenance." + channel.RemediationCommand = fabricServiceChannelAccessRemediationCommand(channel, lease, now) + return channel +} + +func fabricServiceChannelAccessRemediationCommand(channel FabricServiceChannelAccessTelemetryChannel, lease FabricServiceChannelLease, now time.Time) *FabricServiceChannelAccessRemediationCommand { + action := strings.TrimSpace(channel.RemediationAction) + if action == "" || action == "none" { + return nil + } + if now.IsZero() { + now = time.Now().UTC() + } + issuedAt := now.UTC() + expiresAt := issuedAt.Add(60 * time.Second) + if !channel.ExpiresAt.IsZero() && channel.ExpiresAt.Before(expiresAt) { + expiresAt = channel.ExpiresAt.UTC() + } + routeComponent := firstNonEmptyString(channel.RemediationRouteID, channel.PrimaryRouteID, "no-route") + return &FabricServiceChannelAccessRemediationCommand{ + SchemaVersion: "rap.fabric_service_channel_access_remediation_command.v1", + CommandID: "fsc-remediation:" + channel.ChannelID + ":" + action + ":" + routeComponent, + Action: action, + ClusterID: lease.ClusterID, + ChannelID: channel.ChannelID, + ResourceID: channel.ResourceID, + ServiceClass: channel.ServiceClass, + EntryNodeID: channel.SelectedEntryNodeID, + ExitNodeID: channel.SelectedExitNodeID, + PrimaryRouteID: channel.PrimaryRouteID, + ReplacementRouteID: channel.RemediationRouteID, + ReplacementRouteStatus: channel.RemediationRouteStatus, + PoolPolicyFingerprint: channel.PoolPolicyFingerprint, + GuardStatus: firstNonEmptyString(channel.RemediationGuardStatus, "allowed"), + GuardReason: firstNonEmptyString(channel.RemediationGuardReason, "lease_pool_policy_allows_route"), + ExecutionStatus: channel.RemediationExecutionStatus, + ExecutionReason: channel.RemediationExecutionReason, + ExecutionGeneration: channel.RemediationExecutionGeneration, + ExecutionObservedAt: channel.RemediationExecutionObservedAt, + Reason: channel.RemediationReason, + OperatorAction: channel.RecommendedOperatorAction, + IssuedAt: issuedAt, + ExpiresAt: expiresAt, + } +} + +func fabricServiceChannelAccessRemediationExecution(channel FabricServiceChannelAccessTelemetryChannel, routeManager map[string]any, transition map[string]any, now time.Time) FabricServiceChannelAccessTelemetryChannel { + if channel.RemediationCommand == nil { + return channel + } + if !channel.RemediationCommand.ExpiresAt.IsZero() && !now.IsZero() && !channel.RemediationCommand.ExpiresAt.After(now.UTC()) { + channel.RemediationExecutionStatus = "expired" + channel.RemediationExecutionReason = "remediation_command_ttl_expired" + return fabricServiceChannelSyncRemediationCommandExecution(channel) + } + if channel.RemediationGuardStatus == "rejected" || channel.RemediationCommand.GuardStatus == "rejected" { + channel.RemediationExecutionStatus = "rejected_by_policy_guard" + channel.RemediationExecutionReason = firstNonEmptyString(channel.RemediationGuardReason, channel.RemediationCommand.GuardReason, "remediation_guard_rejected") + return fabricServiceChannelSyncRemediationCommandExecution(channel) + } + switch channel.RemediationCommand.Action { + case "prefer_alternate_route": + if decision, ok := fabricServiceChannelRouteManagerDecisionForCommand(routeManager, *channel.RemediationCommand); ok { + channel.RemediationExecutionStatus = firstNonEmptyString(jsonString(decision, "rebuild_status"), "observed") + channel.RemediationExecutionReason = firstNonEmptyString(jsonString(decision, "rebuild_reason"), jsonString(decision, "decision_source"), "route_manager_decision_observed") + channel.RemediationExecutionGeneration = jsonString(decision, "generation") + channel.RemediationExecutionObservedAt = firstNonEmptyString(jsonString(routeManager, "last_applied_at"), jsonString(transition, "observed_at")) + return fabricServiceChannelSyncRemediationCommandExecution(channel) + } + channel.RemediationExecutionStatus = "waiting_node_apply" + channel.RemediationExecutionReason = "route_manager_has_not_reported_command" + channel.RemediationExecutionObservedAt = jsonString(transition, "observed_at") + case "rebuild_route": + if decision, ok := fabricServiceChannelRouteManagerDecisionForCommand(routeManager, *channel.RemediationCommand); ok { + channel.RemediationExecutionStatus = firstNonEmptyString(jsonString(decision, "rebuild_status"), "pending_rebuild_request") + channel.RemediationExecutionReason = firstNonEmptyString(jsonString(decision, "rebuild_reason"), jsonString(decision, "decision_source"), "route_manager_rebuild_decision_observed") + channel.RemediationExecutionGeneration = jsonString(decision, "generation") + channel.RemediationExecutionObservedAt = firstNonEmptyString(jsonString(routeManager, "last_applied_at"), jsonString(transition, "observed_at")) + return fabricServiceChannelSyncRemediationCommandExecution(channel) + } + channel.RemediationExecutionStatus = "pending_rebuild_request" + channel.RemediationExecutionReason = "bounded_rebuild_route_command_visible" + channel.RemediationExecutionObservedAt = jsonString(transition, "observed_at") + case "use_backend_fallback": + channel.RemediationExecutionStatus = "degraded_fallback_visible" + channel.RemediationExecutionReason = "backend_fallback_command_visible" + default: + channel.RemediationExecutionStatus = "visible" + channel.RemediationExecutionReason = "remediation_command_visible" + } + return fabricServiceChannelSyncRemediationCommandExecution(channel) +} + +func fabricServiceChannelAccessRouteDecisionTelemetry(channel FabricServiceChannelAccessTelemetryChannel, routeManager map[string]any, transition map[string]any) FabricServiceChannelAccessTelemetryChannel { + decision, ok := fabricServiceChannelRouteManagerDecisionForChannel(routeManager, channel) + if !ok { + return channel + } + channel.RouteDecisionSource = jsonString(decision, "decision_source") + channel.RouteDecisionRouteID = jsonString(decision, "route_id") + channel.RouteDecisionReplacementRouteID = jsonString(decision, "replacement_route_id") + channel.RouteDecisionRebuildStatus = jsonString(decision, "rebuild_status") + channel.RouteDecisionRebuildReason = jsonString(decision, "rebuild_reason") + channel.RouteDecisionGeneration = firstNonEmptyString(jsonString(decision, "generation"), jsonString(decision, "rebuild_request_id")) + channel.RouteDecisionScoreReasons = jsonStringArray(decision, "score_reasons") + if channel.RemediationExecutionObservedAt == "" { + channel.RemediationExecutionObservedAt = firstNonEmptyString(jsonString(routeManager, "last_applied_at"), jsonString(transition, "observed_at")) + } + if channel.RouteDecisionSource == "service_channel_feedback_no_alternate" || + channel.RouteDecisionRebuildStatus == "pending_degraded_fallback" || + containsString(channel.RouteDecisionScoreReasons, "no_unfenced_alternate_route") { + channel.RemediationAction = firstNonEmptyString(channel.RemediationAction, "use_backend_fallback") + if channel.RemediationAction == "none" { + channel.RemediationAction = "use_backend_fallback" + } + channel.RemediationReason = "route_decision_no_safe_recovery" + channel.RemediationExecutionStatus = "route_rebuild_no_safe_recovery" + channel.RemediationExecutionReason = firstNonEmptyString(channel.RouteDecisionRebuildReason, "no_unfenced_alternate_route") + channel.RemediationExecutionGeneration = channel.RouteDecisionGeneration + channel.RecommendedOperatorAction = "No safe recovery route is available; keep degraded fallback visible and rebuild the route pool." + } + return channel +} + +func fabricServiceChannelRouteManagerDecisionForChannel(routeManager map[string]any, channel FabricServiceChannelAccessTelemetryChannel) (map[string]any, bool) { + decisionsRaw := jsonArray(routeManager, "decisions") + if len(decisionsRaw) == 0 { + return nil, false + } + var selected map[string]any + selectedRank := 0 + for _, raw := range decisionsRaw { + decision, ok := raw.(map[string]any) + if !ok || !fabricServiceChannelRouteManagerDecisionMatchesChannel(decision, channel) { + continue + } + rank := fabricServiceChannelRouteManagerDecisionTelemetryRank(decision) + if rank > selectedRank { + selected = decision + selectedRank = rank + } + } + if selected == nil { + return nil, false + } + return selected, true +} + +func fabricServiceChannelRouteManagerDecisionMatchesChannel(decision map[string]any, channel FabricServiceChannelAccessTelemetryChannel) bool { + routeID := jsonString(decision, "route_id") + replacementRouteID := jsonString(decision, "replacement_route_id") + if routeID != "" && routeID == channel.PrimaryRouteID { + return true + } + if replacementRouteID != "" && replacementRouteID == channel.PrimaryRouteID { + return true + } + sourceNodeID := jsonString(decision, "source_node_id") + destinationNodeID := jsonString(decision, "destination_node_id") + localNodeID := jsonString(decision, "local_node_id") + return sourceNodeID != "" && + destinationNodeID != "" && + sourceNodeID == channel.SelectedEntryNodeID && + destinationNodeID == channel.SelectedExitNodeID && + (localNodeID == "" || localNodeID == channel.SelectedEntryNodeID) +} + +func fabricServiceChannelRouteManagerDecisionTelemetryRank(decision map[string]any) int { + source := jsonString(decision, "decision_source") + status := jsonString(decision, "rebuild_status") + reasons := jsonStringArray(decision, "score_reasons") + switch { + case source == "service_channel_feedback_no_alternate" || + status == "pending_degraded_fallback" || + containsString(reasons, "no_unfenced_alternate_route"): + return 50 + case status == "applied" || containsString(reasons, "service_channel_rebuild_applied"): + return 40 + case strings.Contains(source, "replacement"): + return 30 + case status != "": + return 20 + default: + return 10 + } +} + +func fabricServiceChannelAccumulateRouteDecisionTelemetry(out *FabricServiceChannelAccessTelemetry, channel FabricServiceChannelAccessTelemetryChannel) { + if out == nil || channel.RouteDecisionSource == "" { + return + } + out.RouteDecisionChannelCount++ + if fabricServiceChannelRouteDecisionIsReplacement(channel) { + out.ReplacementDecisionCount++ + } + if channel.RouteDecisionRebuildStatus == "applied" || containsString(channel.RouteDecisionScoreReasons, "service_channel_rebuild_applied") { + out.AppliedRebuildDecisionCount++ + } + if fabricServiceChannelRouteDecisionIsRecovery(channel) { + out.RecoveryDecisionCount++ + } + if fabricServiceChannelRouteDecisionIsNoSafeRecovery(channel) { + out.NoSafeRecoveryDecisionCount++ + } +} + +func fabricServiceChannelRouteDecisionIsReplacement(channel FabricServiceChannelAccessTelemetryChannel) bool { + return strings.Contains(channel.RouteDecisionSource, "replacement") || + strings.TrimSpace(channel.RouteDecisionReplacementRouteID) != "" +} + +func fabricServiceChannelRouteDecisionIsRecovery(channel FabricServiceChannelAccessTelemetryChannel) bool { + return containsString(channel.RouteDecisionScoreReasons, "service_channel_recovery_promoted") || + containsString(channel.RouteDecisionScoreReasons, "service_channel_recovery_hysteresis") || + strings.Contains(channel.RouteDecisionRebuildReason, "recovery") +} + +func fabricServiceChannelRouteDecisionIsNoSafeRecovery(channel FabricServiceChannelAccessTelemetryChannel) bool { + return channel.RouteDecisionSource == "service_channel_feedback_no_alternate" || + channel.RouteDecisionRebuildStatus == "pending_degraded_fallback" || + containsString(channel.RouteDecisionScoreReasons, "no_unfenced_alternate_route") +} + +func fabricServiceChannelSyncRemediationCommandExecution(channel FabricServiceChannelAccessTelemetryChannel) FabricServiceChannelAccessTelemetryChannel { + if channel.RemediationCommand == nil { + return channel + } + channel.RemediationCommand.ExecutionStatus = channel.RemediationExecutionStatus + channel.RemediationCommand.ExecutionReason = channel.RemediationExecutionReason + channel.RemediationCommand.ExecutionGeneration = channel.RemediationExecutionGeneration + channel.RemediationCommand.ExecutionObservedAt = channel.RemediationExecutionObservedAt + return channel +} + +func fabricServiceChannelRouteManagerDecisionForCommand(routeManager map[string]any, command FabricServiceChannelAccessRemediationCommand) (map[string]any, bool) { + decisionsRaw, ok := routeManager["decisions"].([]any) + if !ok { + return nil, false + } + for _, raw := range decisionsRaw { + decision, ok := raw.(map[string]any) + if !ok { + continue + } + if command.CommandID != "" && jsonString(decision, "rebuild_request_id") == command.CommandID { + return decision, true + } + if jsonString(decision, "route_id") == command.PrimaryRouteID && + jsonString(decision, "replacement_route_id") == command.ReplacementRouteID && + jsonString(decision, "decision_source") == "service_channel_remediation_command" { + return decision, true + } + } + return nil, false +} + +func (s *Service) fabricServiceChannelAccessRemediationLedgerExecution(ctx context.Context, clusterID string, channel FabricServiceChannelAccessTelemetryChannel) FabricServiceChannelAccessTelemetryChannel { + if channel.RemediationCommand == nil || channel.RemediationCommand.Action != "rebuild_route" { + return channel + } + attempts, err := s.store.ListFabricServiceChannelRouteRebuildAttempts(ctx, ListFabricServiceChannelRouteRebuildAttemptsInput{ + ClusterID: clusterID, + ReporterNodeID: channel.SelectedEntryNodeID, + RouteID: channel.PrimaryRouteID, + ServiceClass: channel.ServiceClass, + RebuildRequestID: channel.RemediationCommand.CommandID, + Limit: 1, + }) + if err != nil || len(attempts) == 0 { + return channel + } + attempt := attempts[0] + switch attempt.RebuildStatus { + case "requested": + if channel.RemediationExecutionStatus == "pending_degraded_fallback" { + channel.RemediationExecutionStatus = "rebuild_request_recorded_node_pending" + channel.RemediationExecutionReason = firstNonEmptyString(channel.RemediationExecutionReason, attempt.RebuildReason, "durable_rebuild_route_request_recorded_and_node_pending") + } else { + channel.RemediationExecutionStatus = "rebuild_request_recorded" + channel.RemediationExecutionReason = firstNonEmptyString(attempt.RebuildReason, "durable_rebuild_route_request_recorded") + } + case "rejected": + channel.RemediationExecutionStatus = "rebuild_request_rejected" + channel.RemediationExecutionReason = firstNonEmptyString(attempt.RebuildReason, "durable_rebuild_route_request_rejected") + case "applied": + channel.RemediationExecutionStatus = "rebuild_request_applied" + channel.RemediationExecutionReason = firstNonEmptyString(attempt.RebuildReason, "durable_rebuild_route_request_applied") + case "no_alternate": + channel.RemediationExecutionStatus = "rebuild_request_no_alternate" + channel.RemediationExecutionReason = firstNonEmptyString(attempt.RebuildReason, "durable_rebuild_route_no_alternate") + case "deferred_by_policy": + channel.RemediationExecutionStatus = "rebuild_request_deferred_by_policy" + channel.RemediationExecutionReason = firstNonEmptyString(attempt.RebuildReason, "durable_rebuild_route_deferred_by_policy") + case "expired": + channel.RemediationExecutionStatus = "rebuild_request_expired" + channel.RemediationExecutionReason = firstNonEmptyString(attempt.RebuildReason, "durable_rebuild_route_expired") + default: + channel.RemediationExecutionStatus = firstNonEmptyString(attempt.RebuildStatus, channel.RemediationExecutionStatus) + channel.RemediationExecutionReason = firstNonEmptyString(attempt.RebuildReason, channel.RemediationExecutionReason) + } + channel.RemediationExecutionGeneration = firstNonEmptyString(attempt.Generation, channel.RemediationExecutionGeneration) + if !attempt.UpdatedAt.IsZero() { + channel.RemediationExecutionObservedAt = attempt.UpdatedAt.UTC().Format(time.RFC3339Nano) + } + return fabricServiceChannelSyncRemediationCommandExecution(channel) +} + +func (s *Service) fabricServiceChannelRemediationCommandsForNode(ctx context.Context, clusterID string, nodeID string, feedback map[string]fabricServiceChannelRouteFeedback, now time.Time) ([]FabricServiceChannelAccessRemediationCommand, error) { + records, err := s.store.ListFabricServiceChannelLeases(ctx, ListFabricServiceChannelLeasesInput{ + ClusterID: clusterID, + EntryNodeID: nodeID, + ServiceClass: FabricServiceClassVPNPackets, + IncludeExpired: false, + Limit: 100, + Now: now.UTC(), + }) + if err != nil { + return nil, err + } + commands := make([]FabricServiceChannelAccessRemediationCommand, 0, len(records)) + for _, record := range records { + summary := fabricServiceChannelLeaseSummaryFromRecord(record, now) + if summary.Expired || strings.TrimSpace(summary.PrimaryRouteID) == "" { + continue + } + channel := FabricServiceChannelAccessTelemetryChannel{ + ChannelID: summary.ChannelID, + ResourceID: summary.ResourceID, + ServiceClass: summary.ServiceClass, + Status: summary.Status, + SelectedEntryNodeID: summary.SelectedEntryNodeID, + SelectedExitNodeID: summary.SelectedExitNodeID, + PrimaryRouteID: summary.PrimaryRouteID, + PrimaryRouteStatus: summary.PrimaryRouteStatus, + ForceBackendFallback: summary.ForceBackendFallback, + ExpiresAt: summary.ExpiresAt, + } + if record.Lease.PoolPolicy != nil { + channel.PoolPolicyFingerprint = record.Lease.PoolPolicy.Fingerprint + } + if item, ok := feedback[channel.PrimaryRouteID]; ok { + observedAt := item.ObservedAt + channel.RouteFeedbackObservedAt = &observedAt + if item.Fenced { + channel.RouteFeedbackStatus = "fenced" + } else if item.ScoreAdjustment < 0 { + channel.RouteFeedbackStatus = "degraded" + } else if item.RouteID != "" { + channel.RouteFeedbackStatus = "healthy" + } + channel.RouteFeedbackScoreAdjustment = item.ScoreAdjustment + channel.RouteFeedbackEffectiveScoreAdjustment = item.ScoreAdjustment + channel.RouteFeedbackReasons = append([]string{}, item.Reasons...) + channel.RouteQualityWindowSampleCount = item.QualityWindowSampleCount + channel.RouteQualityWindowFailureCount = item.QualityWindowFailureCount + channel.RouteQualityWindowDropCount = item.QualityWindowDropCount + channel.RouteQualityWindowSlowCount = item.QualityWindowSlowCount + channel.LastSendDurationMs = item.LastSendDurationMs + } + channel = fabricServiceChannelAccessRemediation(channel, record.Lease, now) + if channel.RemediationCommand != nil { + commands = append(commands, *channel.RemediationCommand) + } + } + sort.SliceStable(commands, func(i, j int) bool { + if commands[i].Action != commands[j].Action { + return commands[i].Action < commands[j].Action + } + return commands[i].CommandID < commands[j].CommandID + }) + return commands, nil +} + +func (s *Service) recordFabricServiceChannelRemediationRebuildIntents(ctx context.Context, clusterID string, nodeID string, commands []FabricServiceChannelAccessRemediationCommand, now time.Time) error { + if len(commands) == 0 { + return nil + } + if now.IsZero() { + now = time.Now().UTC() + } + for _, command := range commands { + if command.Action != "rebuild_route" || strings.TrimSpace(command.CommandID) == "" || strings.TrimSpace(command.PrimaryRouteID) == "" { + continue + } + rebuildStatus := "requested" + outcome := "rebuild_requested" + if command.GuardStatus == "rejected" { + rebuildStatus = "rejected" + outcome = "policy_guard_rejected" + } + payload := mustJSONRaw(map[string]any{ + "schema_version": "c18z75.service_channel_remediation_rebuild_intent.v1", + "command_id": command.CommandID, + "channel_id": command.ChannelID, + "resource_id": command.ResourceID, + "entry_node_id": command.EntryNodeID, + "exit_node_id": command.ExitNodeID, + "pool_policy_fingerprint": command.PoolPolicyFingerprint, + "guard_status": command.GuardStatus, + "guard_reason": command.GuardReason, + "command_expires_at": command.ExpiresAt.UTC().Format(time.RFC3339Nano), + "recorded_at": now.UTC().Format(time.RFC3339Nano), + }) + _, err := s.store.RecordFabricServiceChannelRouteRebuildAttempt(ctx, RecordFabricServiceChannelRouteRebuildAttemptInput{ + ClusterID: clusterID, + ReporterNodeID: nodeID, + ServiceClass: firstNonEmptyString(command.ServiceClass, FabricServiceClassVPNPackets), + RouteID: command.PrimaryRouteID, + ReplacementRouteID: command.ReplacementRouteID, + RebuildRequestID: command.CommandID, + RebuildStatus: rebuildStatus, + RebuildReason: firstNonEmptyString(command.Reason, command.GuardReason, "service_channel_remediation_rebuild_route_requested"), + DecisionSource: "service_channel_remediation_command", + Outcome: outcome, + Generation: command.ExecutionGeneration, + PolicyFingerprint: command.PoolPolicyFingerprint, + ObservedPolicyFingerprint: command.PoolPolicyFingerprint, + FeedbackReasons: []string{firstNonEmptyString(command.Reason, command.GuardReason, "service_channel_remediation_rebuild_route_requested")}, + OldHops: []string{}, + ReplacementHops: []string{}, + Payload: payload, + }) + if err != nil { + return err + } + } + return nil +} + +func (s *Service) resolveFabricServiceChannelRemediationRebuildIntents(ctx context.Context, input GetNodeSyntheticMeshConfigInput, commands []FabricServiceChannelAccessRemediationCommand, intents []MeshRouteIntent, feedback map[string]fabricServiceChannelRouteFeedback, generation string, now time.Time) ([]RoutePathDecision, error) { + if len(commands) == 0 { + return nil, nil + } + if now.IsZero() { + now = time.Now().UTC() + } + decisions := []RoutePathDecision{} + for _, command := range commands { + if command.Action != "rebuild_route" || strings.TrimSpace(command.CommandID) == "" || strings.TrimSpace(command.PrimaryRouteID) == "" { + continue + } + lease, leaseOK, err := s.fabricServiceChannelLeaseForRemediationCommand(ctx, input.ClusterID, input.NodeID, command, now) + if err != nil { + return nil, err + } + status := "no_alternate" + outcome := "no_alternate" + reason := "no_unfenced_alternate_route" + var primary SyntheticMeshRouteConfig + var replacement SyntheticMeshRouteConfig + if command.GuardStatus == "rejected" { + status = "deferred_by_policy" + outcome = "deferred_by_policy" + reason = firstNonEmptyString(command.GuardReason, "remediation_guard_rejected") + } else if !command.ExpiresAt.IsZero() && !command.ExpiresAt.After(now.UTC()) { + status = "expired" + outcome = "expired" + reason = "remediation_command_ttl_expired" + } else if !leaseOK { + status = "deferred_by_policy" + outcome = "deferred_by_policy" + reason = "active_lease_not_found_for_rebuild_resolution" + } else { + var ok bool + primary, ok = s.syntheticRouteByID(input, intents, command.PrimaryRouteID) + if !ok { + reason = "primary_route_not_available_for_rebuild" + } else if selected, _, ok := s.selectServiceChannelRouteReplacement(input, primary, intents, feedback); ok { + if guardStatus, guardReason := fabricServiceChannelRouteAllowedByLeasePool(lease, FabricServiceChannelRoute{ + RouteID: selected.RouteID, + ClusterID: selected.ClusterID, + ServiceClass: firstNonEmptyString(command.ServiceClass, FabricServiceClassVPNPackets), + SourceNodeID: selected.SourceNodeID, + DestinationNodeID: selected.DestinationNodeID, + Status: "authorized", + }); guardStatus != "allowed" { + status = "deferred_by_policy" + outcome = "deferred_by_policy" + reason = guardReason + } else { + replacement = selected + status = "applied" + outcome = "replacement_selected" + reason = "remediation_rebuild_applied_to_alternate" + } + } + } + feedbackItem := feedback[command.PrimaryRouteID] + feedbackStatus := "" + if feedbackItem.Fenced { + feedbackStatus = "fenced" + } else if feedbackItem.ScoreAdjustment < 0 { + feedbackStatus = "degraded" + } else if feedbackItem.RouteID != "" { + feedbackStatus = "healthy" + } + payload := mustJSONRaw(map[string]any{ + "schema_version": "c18z77.service_channel_remediation_rebuild_resolution.v1", + "command_id": command.CommandID, + "channel_id": command.ChannelID, + "resource_id": command.ResourceID, + "entry_node_id": command.EntryNodeID, + "exit_node_id": command.ExitNodeID, + "pool_policy_fingerprint": command.PoolPolicyFingerprint, + "guard_status": command.GuardStatus, + "guard_reason": command.GuardReason, + "resolution_status": status, + "resolution_outcome": outcome, + "resolution_reason": reason, + "resolved_at": now.UTC().Format(time.RFC3339Nano), + }) + _, err = s.store.RecordFabricServiceChannelRouteRebuildAttempt(ctx, RecordFabricServiceChannelRouteRebuildAttemptInput{ + ClusterID: input.ClusterID, + ReporterNodeID: input.NodeID, + ServiceClass: firstNonEmptyString(command.ServiceClass, FabricServiceClassVPNPackets), + RouteID: command.PrimaryRouteID, + ReplacementRouteID: replacement.RouteID, + RebuildRequestID: command.CommandID, + RebuildStatus: status, + RebuildReason: reason, + DecisionSource: "service_channel_remediation_command", + Outcome: outcome, + Generation: firstNonEmptyString(generation, command.ExecutionGeneration, command.CommandID), + PolicyFingerprint: command.PoolPolicyFingerprint, + ObservedPolicyFingerprint: command.PoolPolicyFingerprint, + FeedbackStatus: feedbackStatus, + FeedbackScoreAdjustment: feedbackItem.ScoreAdjustment, + FeedbackEffectiveScoreAdjustment: feedbackItem.ScoreAdjustment, + FeedbackReasons: append([]string{reason}, feedbackItem.Reasons...), + LastError: feedbackItem.LastError, + ConsecutiveFailures: feedbackItem.ConsecutiveFailures, + StallCount: feedbackItem.StallCount, + LastSendDurationMs: feedbackItem.LastSendDurationMs, + QualityWindowSampleCount: feedbackItem.QualityWindowSampleCount, + QualityWindowFailureCount: feedbackItem.QualityWindowFailureCount, + QualityWindowDropCount: feedbackItem.QualityWindowDropCount, + QualityWindowSlowCount: feedbackItem.QualityWindowSlowCount, + OldHops: append([]string{}, primary.Hops...), + ReplacementHops: append([]string{}, replacement.Hops...), + Payload: payload, + }) + if err != nil { + return nil, err + } + if status != "applied" { + continue + } + decision := RoutePathDecision{ + DecisionID: command.PrimaryRouteID + "-path-" + input.NodeID + "-service-channel-remediation", + RouteID: command.PrimaryRouteID, + ReplacementRouteID: replacement.RouteID, + RebuildRequestID: command.CommandID, + RebuildStatus: "applied", + RebuildReason: reason, + ClusterID: input.ClusterID, + LocalNodeID: input.NodeID, + SourceNodeID: primary.SourceNodeID, + DestinationNodeID: primary.DestinationNodeID, + OriginalHops: append([]string{}, primary.Hops...), + EffectiveHops: append([]string{}, replacement.Hops...), + DecisionSource: "service_channel_remediation_command", + Generation: firstNonEmptyString(generation, command.CommandID), + PathScore: serviceChannelReplacementRouteScore(replacement), + ScoreReasons: []string{"service_channel_remediation_rebuild_route", "selected_unfenced_alternate_route", "service_channel_rebuild_applied"}, + ControlPlaneOnly: true, + ProductionForwarding: false, + ExpiresAt: minNonZeroTime(primary.ExpiresAt, replacement.ExpiresAt, command.ExpiresAt, now.Add(60*time.Second)).UTC(), + } + decision.PreviousHopID, decision.NextHopID, decision.LocalRole = routePathLocalPosition(decision.EffectiveHops, input.NodeID, "", "") + decisions = append(decisions, decision) + } + return decisions, nil +} + +func (s *Service) fabricServiceChannelLeaseForRemediationCommand(ctx context.Context, clusterID string, nodeID string, command FabricServiceChannelAccessRemediationCommand, now time.Time) (FabricServiceChannelLease, bool, error) { + records, err := s.store.ListFabricServiceChannelLeases(ctx, ListFabricServiceChannelLeasesInput{ + ClusterID: clusterID, + ServiceClass: firstNonEmptyString(command.ServiceClass, FabricServiceClassVPNPackets), + EntryNodeID: nodeID, + ResourceID: command.ResourceID, + IncludeExpired: false, + Limit: 100, + Now: now.UTC(), + }) + if err != nil { + return FabricServiceChannelLease{}, false, err + } + for _, record := range records { + if strings.TrimSpace(record.ChannelID) == strings.TrimSpace(command.ChannelID) { + return record.Lease, true, nil + } + } + return FabricServiceChannelLease{}, false, nil +} + +func (s *Service) syntheticRouteByID(input GetNodeSyntheticMeshConfigInput, intents []MeshRouteIntent, routeID string) (SyntheticMeshRouteConfig, bool) { + routeID = strings.TrimSpace(routeID) + if routeID == "" { + return SyntheticMeshRouteConfig{}, false + } + for _, intent := range intents { + route, _, _, _, _, ok := s.syntheticRouteFromIntent(input, intent) + if ok && route.RouteID == routeID { + return route, true + } + } + return SyntheticMeshRouteConfig{}, false +} + +func minNonZeroTime(items ...time.Time) time.Time { + var out time.Time + for _, item := range items { + if item.IsZero() { + continue + } + if out.IsZero() || item.Before(out) { + out = item + } + } + return out +} + +func fabricServiceChannelFirstAuthorizedAlternate(routes []FabricServiceChannelRoute, primaryRouteID string) (FabricServiceChannelRoute, bool) { + for _, route := range routes { + if strings.TrimSpace(route.RouteID) == "" || route.RouteID == primaryRouteID { + continue + } + if route.Status == "authorized" { + return route, true + } + } + return FabricServiceChannelRoute{}, false +} + +func fabricServiceChannelRouteAllowedByLeasePool(lease FabricServiceChannelLease, route FabricServiceChannelRoute) (string, string) { + if strings.TrimSpace(route.RouteID) == "" { + return "rejected", "replacement_route_missing" + } + entryAllowed := len(lease.EntryPool) == 0 + for _, candidate := range lease.EntryPool { + if candidate.NodeID == route.SourceNodeID { + entryAllowed = true + break + } + } + if !entryAllowed { + return "rejected", "replacement_entry_outside_signed_pool_policy" + } + exitAllowed := len(lease.ExitPool) == 0 + for _, candidate := range lease.ExitPool { + if candidate.NodeID == route.DestinationNodeID { + exitAllowed = true + break + } + } + if !exitAllowed { + return "rejected", "replacement_exit_outside_signed_pool_policy" + } + return "allowed", "lease_pool_policy_allows_route" +} + +func fabricServiceChannelLeaseSummaryFromRecord(record FabricServiceChannelLeaseRecord, now time.Time) FabricServiceChannelLeaseSummary { + if now.IsZero() { + now = time.Now().UTC() + } + lease := record.Lease + summary := FabricServiceChannelLeaseSummary{ + ClusterID: record.ClusterID, + ChannelID: record.ChannelID, + ResourceID: firstNonEmptyString(record.ResourceID, lease.ResourceID), + ServiceClass: firstNonEmptyString(record.ServiceClass, lease.ServiceClass), + Status: lease.Status, + SelectedEntryNodeID: firstNonEmptyString(record.SelectedEntryNodeID, lease.SelectedEntryNodeID), + SelectedExitNodeID: lease.SelectedExitNodeID, + AllowedChannels: append([]string{}, lease.AllowedChannels...), + PrimaryRouteID: strings.TrimSpace(lease.PrimaryRoute.RouteID), + PrimaryRouteStatus: strings.TrimSpace(lease.PrimaryRoute.Status), + DataPlane: lease.DataPlane, + ForceBackendFallback: lease.Status == FabricServiceChannelStatusDegradedFallback || lease.PrimaryRoute.Status == "missing_route_intent", + IssuedAt: lease.IssuedAt, + ExpiresAt: record.ExpiresAt, + CreatedAt: record.CreatedAt, + UpdatedAt: record.UpdatedAt, + } + if summary.ExpiresAt.IsZero() { + summary.ExpiresAt = lease.ExpiresAt + } + summary.Expired = !summary.ExpiresAt.IsZero() && !summary.ExpiresAt.After(now.UTC()) + return summary +} + +func fabricServiceChannelLeaseCacheKey(clusterID string, channelID string) string { + return strings.TrimSpace(clusterID) + "/" + strings.TrimSpace(channelID) +} + +func (s *Service) signFabricServiceChannelLease(ctx context.Context, lease FabricServiceChannelLease) (FabricServiceChannelLease, error) { + authorityKey, err := s.ensureClusterAuthority(ctx, lease.ClusterID, nil) + if err != nil { + return lease, err + } + payload := FabricServiceChannelLeaseAuthorityPayload{ + SchemaVersion: "rap.fabric_service_channel_lease_authority.v1", + ChannelID: lease.ChannelID, + ClusterID: lease.ClusterID, + OrganizationID: lease.OrganizationID, + UserID: lease.UserID, + ResourceID: lease.ResourceID, + ServiceClass: lease.ServiceClass, + Status: lease.Status, + SelectedEntryNodeID: lease.SelectedEntryNodeID, + SelectedExitNodeID: lease.SelectedExitNodeID, + EntryPool: append([]FabricServiceChannelNodeCandidate{}, lease.EntryPool...), + ExitPool: append([]FabricServiceChannelNodeCandidate{}, lease.ExitPool...), + AllowedChannels: append([]string{}, lease.AllowedChannels...), + PrimaryRoute: lease.PrimaryRoute, + RecoveryPolicy: lease.RecoveryPolicy, + PoolPolicy: lease.PoolPolicy, + DataPlane: lease.DataPlane, + RouteGeneration: lease.RouteGeneration, + FencingEpoch: lease.FencingEpoch, + TokenHash: fabricServiceChannelTokenHash(lease.Token.Token), + IssuedAt: lease.IssuedAt, + ExpiresAt: lease.ExpiresAt, + } + rawPayload, signature, err := clusterauth.SignPayload(authorityKey.PrivateKey, payload, s.now()) + if err != nil { + return lease, err + } + lease.AuthorityPayload = rawPayload + lease.AuthoritySignature = &signature + return lease, nil +} + +func fabricServiceChannelTokenHash(token string) string { + sum := sha256.Sum256([]byte(strings.TrimSpace(token))) + return hex.EncodeToString(sum[:]) +} + +func normalizeFabricServiceClass(value string) string { + return strings.TrimSpace(strings.ToLower(value)) +} + +func isAllowedFabricServiceClass(value string) bool { + switch value { + case FabricServiceClassVPNPackets, + FabricServiceClassRemoteWorkspace, + FabricServiceClassFileTransfer, + FabricServiceClassVideo: + return true + default: + return false + } +} + +func normalizeFabricServiceChannels(channels []string, serviceClass string) []string { + channels = dedupeStrings(channels) + if len(channels) > 0 { + return channels + } + switch serviceClass { + case FabricServiceClassVPNPackets: + return []string{FabricChannelControl, FabricChannelBulk, "vpn_packet"} + case FabricServiceClassRemoteWorkspace: + return []string{FabricChannelControl, FabricChannelInteractive, FabricChannelReliable, FabricChannelDroppable} + case FabricServiceClassVideo: + return []string{FabricChannelControl, FabricChannelInteractive, FabricChannelDroppable} + case FabricServiceClassFileTransfer: + return []string{FabricChannelControl, FabricChannelReliable, FabricChannelBulk} + default: + return []string{FabricChannelControl, FabricChannelReliable} + } +} + +func normalizeFabricRequiredRoles(roles []string, serviceClass string) []string { + roles = dedupeStrings(roles) + if len(roles) > 0 { + return roles + } + switch serviceClass { + case FabricServiceClassVPNPackets: + return []string{"entry-node", "vpn-exit"} + case FabricServiceClassRemoteWorkspace: + return []string{"entry-node", "rdp-worker"} + case FabricServiceClassVideo: + return []string{"entry-node", "video-relay"} + case FabricServiceClassFileTransfer: + return []string{"entry-node", "file-storage-cache"} + default: + return []string{"entry-node"} + } +} + +func selectFabricServiceChannelPreferredNode(nodeIDs []string, preferred string) string { + preferred = strings.TrimSpace(preferred) + if preferred != "" && containsString(nodeIDs, preferred) { + return preferred + } + if len(nodeIDs) == 0 { + return "" + } + return strings.TrimSpace(nodeIDs[0]) +} + +func fabricServiceChannelEffectivePool(requested []string, policy []string) []string { + requested = dedupeStrings(requested) + policy = dedupeStrings(policy) + if len(policy) == 0 { + return requested + } + if len(requested) == 0 { + return policy + } + out := []string{} + for _, nodeID := range requested { + if containsString(policy, nodeID) { + out = append(out, nodeID) + } + } + return dedupeStrings(out) +} + +func fabricServiceFailoverFromPoolPolicy(policy FabricServiceChannelPoolPolicy) string { + policy = normalizeFabricServiceChannelPoolPolicy(policy, defaultFabricServiceChannelPoolPolicy()) + raw, err := json.Marshal(map[string]any{ + "route_rebuild": policy.RouteRebuild, + "entry_failover": policy.EntryFailover, + "exit_failover": policy.ExitFailover, + "sticky_session": policy.StickySession, + "backend_fallback_allowed": policy.BackendFallbackAllowed, + "selection_strategy": policy.SelectionStrategy, + "pool_policy_fingerprint": policy.Fingerprint, + }) + if err != nil { + return defaultFabricServiceFailover() + } + return string(raw) +} + +func fabricServiceChannelNodePool(nodeIDs []string, role string, selected string) []FabricServiceChannelNodeCandidate { + out := make([]FabricServiceChannelNodeCandidate, 0, len(nodeIDs)) + for index, nodeID := range nodeIDs { + status := "candidate" + if nodeID == selected { + status = "selected" + } + out = append(out, FabricServiceChannelNodeCandidate{ + NodeID: nodeID, + Role: role, + Priority: index + 1, + Status: status, + Metadata: json.RawMessage(`{}`), + }) + } + return out +} + +type fabricServiceChannelRouteFeedback struct { + RouteID string + ObservationID string + Source string + ChannelID string + ResourceID string + ViolationStatus string + ViolationReason string + Fenced bool + ManualRetry bool + StalePolicy bool + StaleGeneration bool + ProvenanceMissing bool + StaleReason string + ScoreAdjustment int + Reasons []string + LastError string + ConsecutiveFailures int + StallCount int + LastSendDurationMs int64 + DegradedFallbackRecommended bool + RouteRebuildRecommended bool + QualityWindowSampleCount int + QualityWindowSuccessCount int + QualityWindowFailureCount int + QualityWindowSlowCount int + QualityWindowDropCount int + ObservedAt time.Time + ExpiresAt time.Time + RetryCooldownUntil *time.Time +} + +type fabricServiceChannelRouteProvenance struct { + RouteID string + RouteVersion string + PolicyVersion string + RouteGeneration string +} + +func fabricServiceChannelRouteProvenanceFromIntents(intents []MeshRouteIntent) map[string]fabricServiceChannelRouteProvenance { + out := map[string]fabricServiceChannelRouteProvenance{} + for _, intent := range intents { + if strings.TrimSpace(intent.ID) == "" { + continue + } + var policy syntheticRoutePolicy + _ = json.Unmarshal(intent.Policy, &policy) + routeVersion := strings.TrimSpace(policy.RouteVersion) + if routeVersion == "" { + routeVersion = intent.UpdatedAt.UTC().Format(time.RFC3339) + } + policyVersion := strings.TrimSpace(policy.PolicyVersion) + if policyVersion == "" { + policyVersion = routeVersion + } + out[intent.ID] = fabricServiceChannelRouteProvenance{ + RouteID: intent.ID, + RouteVersion: routeVersion, + PolicyVersion: policyVersion, + RouteGeneration: policyVersion, + } + } + return out +} + +func (s *Service) fabricServiceChannelRouteFeedback(ctx context.Context, clusterID string, entryNodeIDs []string, now time.Time, policy FabricServiceChannelRecoveryPolicy, routeProvenance map[string]fabricServiceChannelRouteProvenance) (map[string]fabricServiceChannelRouteFeedback, error) { + out := map[string]fabricServiceChannelRouteFeedback{} + policy = normalizeFabricServiceChannelRecoveryPolicy(policy, defaultFabricServiceChannelRecoveryPolicy()) + for _, nodeID := range dedupeStrings(entryNodeIDs) { + if strings.TrimSpace(nodeID) == "" { + continue + } + observations, err := s.store.ListFabricServiceChannelRouteFeedback(ctx, ListFabricServiceChannelRouteFeedbackInput{ + ClusterID: clusterID, + ReporterNodeID: nodeID, + ServiceClass: FabricServiceClassVPNPackets, + Now: now, + }) + if err != nil { + return nil, err + } + mergeFabricServiceChannelRouteFeedback(out, fabricServiceChannelRouteFeedbackFromObservationsWithProvenance(observations, now, policy, routeProvenance)) + expiredObservations, err := s.store.ListFabricServiceChannelRouteFeedback(ctx, ListFabricServiceChannelRouteFeedbackInput{ + ClusterID: clusterID, + ReporterNodeID: nodeID, + ServiceClass: FabricServiceClassVPNPackets, + IncludeExpired: true, + Now: now, + }) + if err != nil { + return nil, err + } + mergeFabricServiceChannelRouteFeedback(out, fabricServiceChannelManualRetryFeedbackFromObservationsWithProvenance(expiredObservations, now, policy, routeProvenance)) + if len(observations) > 0 { + continue + } + heartbeats, err := s.store.ListNodeHeartbeats(ctx, clusterID, nodeID, 1) + if err != nil { + return nil, err + } + if len(heartbeats) == 0 || now.Sub(heartbeats[0].ObservedAt.UTC()) > fabricServiceChannelFeedbackMaxAge { + continue + } + mergeFabricServiceChannelRouteFeedback(out, fabricServiceChannelRouteFeedbackFromHeartbeatWithProvenance(heartbeats[0], now, policy, routeProvenance)) + } + return out, nil +} + +func (s *Service) fabricServiceChannelRecoveryPolicy(ctx context.Context, clusterID string) FabricServiceChannelRecoveryPolicy { + cluster, err := s.store.GetCluster(ctx, strings.TrimSpace(clusterID)) + if err != nil { + return defaultFabricServiceChannelRecoveryPolicy() + } + return fabricServiceChannelRecoveryPolicyFromCluster(cluster) +} + +func (s *Service) recordFabricServiceChannelRouteFeedback(ctx context.Context, heartbeat NodeHeartbeat) error { + if strings.TrimSpace(heartbeat.ClusterID) == "" || strings.TrimSpace(heartbeat.NodeID) == "" { + return nil + } + observedAt := heartbeat.ObservedAt.UTC() + if observedAt.IsZero() { + observedAt = s.now().UTC() + } + expiresAt := observedAt.Add(fabricServiceChannelFeedbackMaxAge) + for _, input := range fabricServiceChannelRouteFeedbackInputsFromHeartbeat(heartbeat, FabricServiceClassVPNPackets, expiresAt) { + if _, err := s.store.RecordFabricServiceChannelRouteFeedback(ctx, input); err != nil { + return err + } + } + for _, input := range s.fabricServiceChannelRouteFeedbackInputsFromAccessReport(ctx, heartbeat, FabricServiceClassVPNPackets, expiresAt) { + if _, err := s.store.RecordFabricServiceChannelRouteFeedback(ctx, input); err != nil { + return err + } + } + return nil +} + +func (s *Service) fabricServiceChannelRouteFeedbackInputsFromAccessReport(ctx context.Context, heartbeat NodeHeartbeat, serviceClass string, expiresAt time.Time) []RecordFabricServiceChannelRouteFeedbackInput { + if len(heartbeat.Metadata) == 0 || !json.Valid(heartbeat.Metadata) { + return nil + } + report := jsonMapPath(jsonObject(heartbeat.Metadata), "fabric_service_channel_access_report") + if len(report) == 0 { + return nil + } + if jsonInt(report, "fabric_route_send_failure") <= 0 { + return nil + } + status := jsonString(report, "last_data_plane_violation_status") + if status != "fabric_route_send_failed_backend_fallback_blocked" { + return nil + } + observedAt := heartbeat.ObservedAt.UTC() + if observedAt.IsZero() { + observedAt = time.Now().UTC() + } + records, err := s.store.ListFabricServiceChannelLeases(ctx, ListFabricServiceChannelLeasesInput{ + ClusterID: heartbeat.ClusterID, + EntryNodeID: heartbeat.NodeID, + ServiceClass: serviceClass, + IncludeExpired: false, + Limit: 100, + Now: observedAt, + }) + if err != nil || len(records) == 0 { + return nil + } + reason := firstNonEmptyString(jsonString(report, "last_data_plane_violation_reason"), "fabric_route_send_failed_backend_fallback_blocked") + out := make([]RecordFabricServiceChannelRouteFeedbackInput, 0, len(records)) + for _, record := range records { + summary := fabricServiceChannelLeaseSummaryFromRecord(record, observedAt) + routeID := strings.TrimSpace(summary.PrimaryRouteID) + if summary.Expired || routeID == "" || summary.ForceBackendFallback { + continue + } + if s.fabricServiceChannelHasActiveAccessReportRouteFeedback(ctx, heartbeat.ClusterID, heartbeat.NodeID, routeID, serviceClass, observedAt) { + continue + } + out = append(out, RecordFabricServiceChannelRouteFeedbackInput{ + ClusterID: heartbeat.ClusterID, + ReporterNodeID: heartbeat.NodeID, + RouteID: routeID, + ServiceClass: serviceClass, + FeedbackStatus: "fenced", + ScoreAdjustment: -1030, + Reasons: []string{"service_channel_route_rebuild_recommended", "data_plane_fabric_route_send_failed", "backend_fallback_blocked_by_policy"}, + LastError: reason, + ConsecutiveFailures: maxInt(1, jsonInt(report, "fabric_route_send_failure")), + Payload: mustJSONRaw(map[string]any{ + "source": "fabric_service_channel_access_report", + "channel_id": summary.ChannelID, + "resource_id": summary.ResourceID, + "last_data_plane_violation_status": status, + "last_data_plane_violation_reason": reason, + "backend_fallback_blocked": jsonInt(report, "backend_fallback_blocked"), + "fabric_route_send_failure": jsonInt(report, "fabric_route_send_failure"), + "last_backend_relay_policy": jsonString(report, "last_backend_relay_policy"), + "last_working_data_transport": jsonString(report, "last_working_data_transport"), + "last_steady_state_transport": jsonString(report, "last_steady_state_transport"), + }), + ObservedAt: observedAt, + ExpiresAt: expiresAt, + }) + } + return out +} + +func (s *Service) fabricServiceChannelHasActiveAccessReportRouteFeedback(ctx context.Context, clusterID, reporterNodeID, routeID, serviceClass string, observedAt time.Time) bool { + observations, err := s.store.ListFabricServiceChannelRouteFeedback(ctx, ListFabricServiceChannelRouteFeedbackInput{ + ClusterID: clusterID, + ReporterNodeID: reporterNodeID, + RouteID: routeID, + ServiceClass: serviceClass, + IncludeExpired: false, + Now: observedAt, + }) + if err != nil { + return false + } + for _, observation := range observations { + if observation.FeedbackStatus != "fenced" && observation.FeedbackStatus != "degraded" { + continue + } + if containsString(observation.Reasons, "data_plane_fabric_route_send_failed") || + jsonString(jsonObject(observation.Payload), "source") == "fabric_service_channel_access_report" { + return true + } + } + return false +} + +type fabricServiceChannelRuntimeHeartbeat struct { + SchemaVersion string `json:"schema_version"` + ConfigVersion string `json:"config_version"` + Ingress struct { + FlowScheduler struct { + ChannelStats map[string]fabricServiceChannelRuntimeChannelStat `json:"channel_stats"` + } `json:"flow_scheduler"` + } `json:"ingress"` +} + +type fabricServiceChannelRuntimeChannelStat struct { + LastRouteID string `json:"last_route_id"` + RoutePolicyVersion string `json:"route_policy_version,omitempty"` + RouteGeneration string `json:"route_generation,omitempty"` + RecoveryPolicyFingerprint string `json:"recovery_policy_fingerprint,omitempty"` + LastFailedRouteID string `json:"last_failed_route_id"` + LastFailedRoutePolicyVersion string `json:"last_failed_route_policy_version,omitempty"` + LastFailedRouteGeneration string `json:"last_failed_route_generation,omitempty"` + LastError string `json:"last_error"` + ConsecutiveFailures int `json:"consecutive_failures"` + StallCount int `json:"stall_count"` + LastSendDurationMillis int64 `json:"last_send_duration_ms"` + RouteRebuildRecommended bool `json:"route_rebuild_recommended"` + DegradedFallbackRecommended bool `json:"degraded_fallback_recommended"` + QualityWindowSampleCount int `json:"quality_window_sample_count"` + QualityWindowSuccessCount int `json:"quality_window_success_count"` + QualityWindowFailureCount int `json:"quality_window_failure_count"` + QualityWindowSlowCount int `json:"quality_window_slow_count"` + QualityWindowDropCount int `json:"quality_window_drop_count"` + QualityWindowAvgLatencyMs int64 `json:"quality_window_avg_latency_ms"` + QualityWindowLastUpdatedAt string `json:"quality_window_last_updated_at"` +} + +func fabricServiceChannelRouteFeedbackFromHeartbeat(heartbeat NodeHeartbeat, now time.Time) map[string]fabricServiceChannelRouteFeedback { + return fabricServiceChannelRouteFeedbackFromHeartbeatWithProvenance(heartbeat, now, defaultFabricServiceChannelRecoveryPolicy(), nil) +} + +func fabricServiceChannelRouteFeedbackFromHeartbeatWithProvenance(heartbeat NodeHeartbeat, now time.Time, policy FabricServiceChannelRecoveryPolicy, routeProvenance map[string]fabricServiceChannelRouteProvenance) map[string]fabricServiceChannelRouteFeedback { + out := map[string]fabricServiceChannelRouteFeedback{} + for _, input := range fabricServiceChannelRouteFeedbackInputsFromHeartbeat(heartbeat, FabricServiceClassVPNPackets, now.Add(fabricServiceChannelFeedbackMaxAge)) { + observation := fabricServiceChannelAnnotateFeedbackProvenance(FabricServiceChannelRouteFeedbackObservation{ + ClusterID: input.ClusterID, + ReporterNodeID: input.ReporterNodeID, + RouteID: input.RouteID, + ServiceClass: input.ServiceClass, + FeedbackStatus: input.FeedbackStatus, + ScoreAdjustment: input.ScoreAdjustment, + Reasons: append([]string{}, input.Reasons...), + LastError: input.LastError, + ConsecutiveFailures: input.ConsecutiveFailures, + StallCount: input.StallCount, + LastSendDurationMs: input.LastSendDurationMs, + Payload: input.Payload, + ObservedAt: input.ObservedAt, + ExpiresAt: input.ExpiresAt, + }, policy, routeProvenance) + scoreAdjustment := input.ScoreAdjustment + fenced := input.FeedbackStatus == "fenced" + routeRebuildRecommended := containsString(input.Reasons, "service_channel_route_rebuild_recommended") + degradedFallbackRecommended := containsString(input.Reasons, "service_channel_degraded_fallback_recommended") + if observation.StalePolicy || observation.StaleGeneration { + scoreAdjustment = fabricServiceChannelConservativeStaleScore(scoreAdjustment) + fenced = false + routeRebuildRecommended = false + degradedFallbackRecommended = false + } + item := fabricServiceChannelRouteFeedback{ + RouteID: input.RouteID, + Fenced: fenced, + StalePolicy: observation.StalePolicy, + StaleGeneration: observation.StaleGeneration, + ProvenanceMissing: observation.ProvenanceMissing, + StaleReason: observation.StaleReason, + ScoreAdjustment: scoreAdjustment, + Reasons: observation.Reasons, + LastError: input.LastError, + ConsecutiveFailures: input.ConsecutiveFailures, + StallCount: input.StallCount, + LastSendDurationMs: input.LastSendDurationMs, + DegradedFallbackRecommended: degradedFallbackRecommended, + RouteRebuildRecommended: routeRebuildRecommended, + QualityWindowSampleCount: fabricServiceChannelFeedbackPayloadInt(input.Payload, "quality_window_sample_count"), + QualityWindowSuccessCount: fabricServiceChannelFeedbackPayloadInt(input.Payload, "quality_window_success_count"), + QualityWindowFailureCount: fabricServiceChannelFeedbackPayloadInt(input.Payload, "quality_window_failure_count"), + QualityWindowSlowCount: fabricServiceChannelFeedbackPayloadInt(input.Payload, "quality_window_slow_count"), + QualityWindowDropCount: fabricServiceChannelFeedbackPayloadInt(input.Payload, "quality_window_drop_count"), + ObservedAt: input.ObservedAt, + } + out[input.RouteID] = item + } + return out +} + +func fabricServiceChannelRouteFeedbackInputsFromHeartbeat(heartbeat NodeHeartbeat, serviceClass string, expiresAt time.Time) []RecordFabricServiceChannelRouteFeedbackInput { + if len(heartbeat.Metadata) == 0 || !json.Valid(heartbeat.Metadata) { + return nil + } + var metadata struct { + Report fabricServiceChannelRuntimeHeartbeat `json:"fabric_service_channel_runtime_report"` + } + if err := json.Unmarshal(heartbeat.Metadata, &metadata); err != nil { + return nil + } + if metadata.Report.SchemaVersion == "" || len(metadata.Report.Ingress.FlowScheduler.ChannelStats) == 0 { + return nil + } + observedAt := heartbeat.ObservedAt.UTC() + if observedAt.IsZero() { + observedAt = time.Now().UTC() + } + var out []RecordFabricServiceChannelRouteFeedbackInput + for _, stat := range metadata.Report.Ingress.FlowScheduler.ChannelStats { + failedRouteID := strings.TrimSpace(stat.LastFailedRouteID) + rollingFailureCount := fabricServiceChannelRollingFailureCount(stat) + rollingStallCount := fabricServiceChannelRollingStallCount(stat) + rollingLatencyMs := fabricServiceChannelRollingLatencyMs(stat) + rollingWindowActive := stat.QualityWindowSampleCount > 0 + freshFailureActive := failedRouteID != "" && (!rollingWindowActive || rollingFailureCount > 0) + if freshFailureActive { + scoreAdjustment := -30 + reasons := []string{"service_channel_recent_route_failure"} + if rollingWindowActive { + reasons = append(reasons, "service_channel_rolling_quality_window") + } + status := "degraded" + if stat.RouteRebuildRecommended || stat.DegradedFallbackRecommended || rollingFailureCount >= 2 { + status = "fenced" + scoreAdjustment -= 1000 + reasons = append(reasons, "service_channel_route_rebuild_recommended") + if stat.DegradedFallbackRecommended { + reasons = append(reasons, "service_channel_degraded_fallback_recommended") + } + } + out = append(out, RecordFabricServiceChannelRouteFeedbackInput{ + ClusterID: heartbeat.ClusterID, + ReporterNodeID: heartbeat.NodeID, + RouteID: failedRouteID, + ServiceClass: serviceClass, + FeedbackStatus: status, + ScoreAdjustment: scoreAdjustment, + Reasons: dedupeStrings(reasons), + LastError: strings.TrimSpace(stat.LastError), + ConsecutiveFailures: rollingFailureCount, + StallCount: rollingStallCount, + LastSendDurationMs: rollingLatencyMs, + Payload: fabricServiceChannelFeedbackPayload(stat, metadata.Report.ConfigVersion), + ObservedAt: observedAt, + ExpiresAt: expiresAt, + }) + } + successRouteID := strings.TrimSpace(stat.LastRouteID) + if successRouteID != "" && (!freshFailureActive || successRouteID != failedRouteID) && fabricServiceChannelStatHasFreshSuccess(stat) { + qualityAdjustment, qualityReasons := fabricServiceChannelRouteQualityScore(rollingLatencyMs, rollingFailureCount, rollingStallCount) + reasons := append([]string{"service_channel_recent_success"}, qualityReasons...) + if rollingWindowActive { + reasons = append(reasons, "service_channel_rolling_quality_window") + } + out = append(out, RecordFabricServiceChannelRouteFeedbackInput{ + ClusterID: heartbeat.ClusterID, + ReporterNodeID: heartbeat.NodeID, + RouteID: successRouteID, + ServiceClass: serviceClass, + FeedbackStatus: "healthy", + ScoreAdjustment: 10 + qualityAdjustment, + Reasons: dedupeStrings(reasons), + ConsecutiveFailures: rollingFailureCount, + StallCount: rollingStallCount, + LastSendDurationMs: rollingLatencyMs, + Payload: fabricServiceChannelFeedbackPayload(stat, metadata.Report.ConfigVersion), + ObservedAt: observedAt, + ExpiresAt: expiresAt, + }) + } + } + return out +} + +func fabricServiceChannelFeedbackPayload(stat fabricServiceChannelRuntimeChannelStat, configVersion string) json.RawMessage { + payload := map[string]any{} + rawStat, err := json.Marshal(stat) + if err == nil { + _ = json.Unmarshal(rawStat, &payload) + } + if strings.TrimSpace(configVersion) != "" { + payload["observed_config_version"] = strings.TrimSpace(configVersion) + } + raw, err := json.Marshal(payload) + if err != nil { + return json.RawMessage(`{}`) + } + return raw +} + +func fabricServiceChannelStatHasFreshSuccess(stat fabricServiceChannelRuntimeChannelStat) bool { + if stat.QualityWindowSampleCount <= 0 { + return !stat.RouteRebuildRecommended && !stat.DegradedFallbackRecommended + } + return stat.QualityWindowSuccessCount > 0 && stat.QualityWindowFailureCount == 0 && stat.QualityWindowDropCount == 0 +} + +func fabricServiceChannelFlowSchedulerFromHeartbeat(heartbeat NodeHeartbeat) map[string]any { + if len(heartbeat.Metadata) == 0 || !json.Valid(heartbeat.Metadata) { + return map[string]any{} + } + metadata := jsonObject(heartbeat.Metadata) + return jsonMapPath(metadata, "fabric_service_channel_runtime_report", "ingress", "flow_scheduler") +} + +func fabricServiceChannelRollingFailureCount(stat fabricServiceChannelRuntimeChannelStat) int { + if stat.QualityWindowSampleCount <= 0 { + return stat.ConsecutiveFailures + } + return stat.QualityWindowFailureCount + stat.QualityWindowDropCount +} + +func fabricServiceChannelRollingStallCount(stat fabricServiceChannelRuntimeChannelStat) int { + if stat.QualityWindowSampleCount <= 0 { + return stat.StallCount + } + return stat.QualityWindowSlowCount +} + +func fabricServiceChannelRollingLatencyMs(stat fabricServiceChannelRuntimeChannelStat) int64 { + if stat.QualityWindowSampleCount > 0 && stat.QualityWindowAvgLatencyMs > 0 { + return stat.QualityWindowAvgLatencyMs + } + return stat.LastSendDurationMillis +} + +func fabricServiceChannelRouteQualityScore(lastSendDurationMs int64, consecutiveFailures int, stallCount int) (int, []string) { + score := 0 + reasons := []string{} + switch { + case lastSendDurationMs <= 0: + case lastSendDurationMs <= 10: + score += 80 + reasons = append(reasons, "service_channel_quality_latency_le_10ms") + case lastSendDurationMs <= 25: + score += 60 + reasons = append(reasons, "service_channel_quality_latency_le_25ms") + case lastSendDurationMs <= 50: + score += 40 + reasons = append(reasons, "service_channel_quality_latency_le_50ms") + case lastSendDurationMs <= 100: + score += 20 + reasons = append(reasons, "service_channel_quality_latency_le_100ms") + case lastSendDurationMs <= 250: + score += 5 + reasons = append(reasons, "service_channel_quality_latency_le_250ms") + case lastSendDurationMs <= 500: + score -= 10 + reasons = append(reasons, "service_channel_quality_latency_slow") + case lastSendDurationMs <= 1000: + score -= 30 + reasons = append(reasons, "service_channel_quality_latency_very_slow") + default: + score -= 60 + reasons = append(reasons, "service_channel_quality_latency_unhealthy") + } + if consecutiveFailures > 0 { + penalty := consecutiveFailures * 20 + if penalty > 100 { + penalty = 100 + } + score -= penalty + reasons = append(reasons, "service_channel_quality_recent_failures") + } + if stallCount > 0 { + penalty := stallCount * 5 + if penalty > 50 { + penalty = 50 + } + score -= penalty + reasons = append(reasons, "service_channel_quality_recent_stalls") + } + return score, dedupeStrings(reasons) +} + +func fabricServiceChannelRetryCooldownUntil(payload json.RawMessage) *time.Time { + if len(payload) == 0 || !json.Valid(payload) { + return nil + } + var raw map[string]any + if err := json.Unmarshal(payload, &raw); err != nil { + return nil + } + value, ok := raw["operator_retry_cooldown_until"].(string) + if !ok || strings.TrimSpace(value) == "" { + return nil + } + parsed, err := time.Parse(time.RFC3339Nano, strings.TrimSpace(value)) + if err != nil { + return nil + } + parsed = parsed.UTC() + return &parsed +} + +func fabricServiceChannelFeedbackPayloadBool(payload json.RawMessage, key string) bool { + if len(payload) == 0 || !json.Valid(payload) { + return false + } + var raw map[string]any + if err := json.Unmarshal(payload, &raw); err != nil { + return false + } + value, ok := raw[key].(bool) + return ok && value +} + +func fabricServiceChannelFeedbackPayloadInt(payload json.RawMessage, key string) int { + if len(payload) == 0 || !json.Valid(payload) { + return 0 + } + var raw map[string]any + if err := json.Unmarshal(payload, &raw); err != nil { + return 0 + } + switch value := raw[key].(type) { + case float64: + return int(value) + case int: + return value + case json.Number: + parsed, _ := value.Int64() + return int(parsed) + default: + return 0 + } +} + +func fabricServiceChannelFeedbackPayloadString(payload json.RawMessage, keys ...string) string { + if len(payload) == 0 || !json.Valid(payload) { + return "" + } + var raw map[string]any + if err := json.Unmarshal(payload, &raw); err != nil { + return "" + } + for _, key := range keys { + if value, ok := raw[key].(string); ok && strings.TrimSpace(value) != "" { + return strings.TrimSpace(value) + } + } + if nested, ok := raw["recovery_policy"].(map[string]any); ok { + for _, key := range keys { + if value, ok := nested[key].(string); ok && strings.TrimSpace(value) != "" { + return strings.TrimSpace(value) + } + } + } + return "" +} + +func fabricServiceChannelAnnotateFeedbackProvenance(observation FabricServiceChannelRouteFeedbackObservation, policy FabricServiceChannelRecoveryPolicy, routeProvenance map[string]fabricServiceChannelRouteProvenance) FabricServiceChannelRouteFeedbackObservation { + policy = normalizeFabricServiceChannelRecoveryPolicy(policy, defaultFabricServiceChannelRecoveryPolicy()) + observation.EffectivePolicyFingerprint = policy.Fingerprint + observation.ObservedPolicyFingerprint = fabricServiceChannelFeedbackPayloadString(observation.Payload, "recovery_policy_fingerprint", "policy_fingerprint", "fingerprint") + provenance := routeProvenance[observation.RouteID] + observation.EffectiveRouteGeneration = provenance.RouteGeneration + observation.ObservedRouteGeneration = fabricServiceChannelFeedbackPayloadString(observation.Payload, "route_generation", "route_policy_version", "policy_version") + missingPolicy := observation.ObservedPolicyFingerprint == "" + missingGeneration := observation.ObservedRouteGeneration == "" && observation.EffectiveRouteGeneration != "" + observation.ProvenanceMissing = missingPolicy || missingGeneration + if observation.ObservedPolicyFingerprint != "" && policy.Fingerprint != "" && observation.ObservedPolicyFingerprint != policy.Fingerprint { + observation.StalePolicy = true + } + if observation.ObservedRouteGeneration != "" && observation.EffectiveRouteGeneration != "" && observation.ObservedRouteGeneration != observation.EffectiveRouteGeneration { + observation.StaleGeneration = true + } + switch { + case observation.StalePolicy && observation.StaleGeneration: + observation.StaleReason = "service_channel_feedback_stale_policy_and_generation" + case observation.StalePolicy: + observation.StaleReason = "service_channel_feedback_stale_policy" + case observation.StaleGeneration: + observation.StaleReason = "service_channel_feedback_stale_generation" + case observation.ProvenanceMissing: + observation.StaleReason = "service_channel_feedback_provenance_missing" + } + if observation.StaleReason != "" { + observation.Reasons = dedupeStrings(append(observation.Reasons, observation.StaleReason)) + } + return observation +} + +func fabricServiceChannelConservativeStaleScore(score int) int { + if score > 0 { + return 0 + } + if score < -10 { + return -10 + } + return score +} + +func fabricServiceChannelFeedbackSuppressedByOperatorCooldown(input RecordFabricServiceChannelRouteFeedbackInput, cooldownUntil, observedAt time.Time) RecordFabricServiceChannelRouteFeedbackInput { + originalStatus := input.FeedbackStatus + originalScore := input.ScoreAdjustment + payload := map[string]any{} + if len(input.Payload) > 0 && json.Valid(input.Payload) { + _ = json.Unmarshal(input.Payload, &payload) + } + payload["operator_feedback_suppressed"] = true + payload["operator_suppressed_feedback_status"] = originalStatus + payload["operator_suppressed_score_adjustment"] = originalScore + payload["operator_retry_cooldown_until"] = cooldownUntil.UTC().Format(time.RFC3339Nano) + payload["operator_suppressed_at"] = observedAt.UTC().Format(time.RFC3339Nano) + raw, err := json.Marshal(payload) + if err != nil { + raw = []byte(`{}`) + } + input.FeedbackStatus = "operator_retry_cooldown" + input.ScoreAdjustment = 0 + input.Reasons = dedupeStrings(append(input.Reasons, "operator_expired_feedback_retry", "manual_feedback_expired_retry_cooldown", "service_channel_feedback_suppressed_by_operator_expire")) + input.Payload = raw + input.ExpiresAt = cooldownUntil.UTC() + return input +} + +func fabricServiceChannelRouteFeedbackFromObservations(observations []FabricServiceChannelRouteFeedbackObservation, now time.Time) map[string]fabricServiceChannelRouteFeedback { + return fabricServiceChannelRouteFeedbackFromObservationsWithProvenance(observations, now, defaultFabricServiceChannelRecoveryPolicy(), nil) +} + +func fabricServiceChannelRouteFeedbackFromObservationsWithProvenance(observations []FabricServiceChannelRouteFeedbackObservation, now time.Time, policy FabricServiceChannelRecoveryPolicy, routeProvenance map[string]fabricServiceChannelRouteProvenance) map[string]fabricServiceChannelRouteFeedback { + out := map[string]fabricServiceChannelRouteFeedback{} + for _, observation := range observations { + observation = fabricServiceChannelAnnotateFeedbackProvenance(observation, policy, routeProvenance) + if strings.TrimSpace(observation.RouteID) == "" || + (!observation.ExpiresAt.IsZero() && !observation.ExpiresAt.After(now.UTC())) { + continue + } + item := out[observation.RouteID] + item.RouteID = observation.RouteID + stale := observation.StalePolicy || observation.StaleGeneration + item.StalePolicy = item.StalePolicy || observation.StalePolicy + item.StaleGeneration = item.StaleGeneration || observation.StaleGeneration + item.ProvenanceMissing = item.ProvenanceMissing || observation.ProvenanceMissing + if observation.StaleReason != "" { + item.StaleReason = observation.StaleReason + } + item.Fenced = item.Fenced || (!stale && observation.FeedbackStatus == "fenced") + if observation.RetryCooldownUntil != nil && observation.RetryCooldownUntil.After(now.UTC()) { + item.ManualRetry = true + } + scoreAdjustment, ageDecayReasons := fabricServiceChannelFeedbackScoreWithAgeDecay(observation, now) + if stale { + scoreAdjustment = fabricServiceChannelConservativeStaleScore(scoreAdjustment) + } + item.ScoreAdjustment += scoreAdjustment + item.Reasons = append(item.Reasons, observation.Reasons...) + item.Reasons = append(item.Reasons, ageDecayReasons...) + if observation.LastSendDurationMs > 0 && (item.LastSendDurationMs == 0 || observation.LastSendDurationMs < item.LastSendDurationMs) { + item.LastSendDurationMs = observation.LastSendDurationMs + } + if observation.ConsecutiveFailures > item.ConsecutiveFailures { + item.ConsecutiveFailures = observation.ConsecutiveFailures + } + if observation.StallCount > item.StallCount { + item.StallCount = observation.StallCount + } + item.DegradedFallbackRecommended = item.DegradedFallbackRecommended || (!stale && + (containsString(observation.Reasons, "service_channel_degraded_fallback_recommended") || + fabricServiceChannelFeedbackPayloadBool(observation.Payload, "degraded_fallback_recommended"))) + item.RouteRebuildRecommended = item.RouteRebuildRecommended || (!stale && + (containsString(observation.Reasons, "service_channel_route_rebuild_recommended") || + fabricServiceChannelFeedbackPayloadBool(observation.Payload, "route_rebuild_recommended"))) + if sampleCount := fabricServiceChannelFeedbackPayloadInt(observation.Payload, "quality_window_sample_count"); sampleCount > item.QualityWindowSampleCount { + item.QualityWindowSampleCount = sampleCount + } + if successCount := fabricServiceChannelFeedbackPayloadInt(observation.Payload, "quality_window_success_count"); successCount > item.QualityWindowSuccessCount { + item.QualityWindowSuccessCount = successCount + } + if failureCount := fabricServiceChannelFeedbackPayloadInt(observation.Payload, "quality_window_failure_count"); failureCount > item.QualityWindowFailureCount { + item.QualityWindowFailureCount = failureCount + } + if slowCount := fabricServiceChannelFeedbackPayloadInt(observation.Payload, "quality_window_slow_count"); slowCount > item.QualityWindowSlowCount { + item.QualityWindowSlowCount = slowCount + } + if dropCount := fabricServiceChannelFeedbackPayloadInt(observation.Payload, "quality_window_drop_count"); dropCount > item.QualityWindowDropCount { + item.QualityWindowDropCount = dropCount + } + if observation.LastError != "" { + item.LastError = observation.LastError + } + if observation.ObservedAt.After(item.ObservedAt) { + item.ObservedAt = observation.ObservedAt + item.ExpiresAt = observation.ExpiresAt + item.ObservationID = observation.ID + item.Source = jsonString(jsonObject(observation.Payload), "source") + item.ChannelID = jsonString(jsonObject(observation.Payload), "channel_id") + item.ResourceID = jsonString(jsonObject(observation.Payload), "resource_id") + item.ViolationStatus = jsonString(jsonObject(observation.Payload), "last_data_plane_violation_status") + item.ViolationReason = jsonString(jsonObject(observation.Payload), "last_data_plane_violation_reason") + } + if observation.RetryCooldownUntil != nil && (item.RetryCooldownUntil == nil || observation.RetryCooldownUntil.After(*item.RetryCooldownUntil)) { + cooldown := observation.RetryCooldownUntil.UTC() + item.RetryCooldownUntil = &cooldown + } + out[observation.RouteID] = item + } + for routeID, item := range out { + item.Reasons = dedupeStrings(item.Reasons) + out[routeID] = item + } + return out +} + +func fabricServiceChannelManualRetryFeedbackFromObservations(observations []FabricServiceChannelRouteFeedbackObservation, now time.Time) map[string]fabricServiceChannelRouteFeedback { + return fabricServiceChannelManualRetryFeedbackFromObservationsWithProvenance(observations, now, defaultFabricServiceChannelRecoveryPolicy(), nil) +} + +func fabricServiceChannelManualRetryFeedbackFromObservationsWithProvenance(observations []FabricServiceChannelRouteFeedbackObservation, now time.Time, policy FabricServiceChannelRecoveryPolicy, routeProvenance map[string]fabricServiceChannelRouteProvenance) map[string]fabricServiceChannelRouteFeedback { + out := map[string]fabricServiceChannelRouteFeedback{} + now = now.UTC() + for _, observation := range observations { + observation = fabricServiceChannelAnnotateFeedbackProvenance(observation, policy, routeProvenance) + if strings.TrimSpace(observation.RouteID) == "" || observation.RetryCooldownUntil == nil || !observation.RetryCooldownUntil.After(now) { + continue + } + if observation.FeedbackStatus == "healthy" { + continue + } + item := out[observation.RouteID] + item.RouteID = observation.RouteID + item.ManualRetry = true + item.StalePolicy = item.StalePolicy || observation.StalePolicy + item.StaleGeneration = item.StaleGeneration || observation.StaleGeneration + item.ProvenanceMissing = item.ProvenanceMissing || observation.ProvenanceMissing + if observation.StaleReason != "" { + item.StaleReason = observation.StaleReason + } + item.ScoreAdjustment += 0 + item.Reasons = append(item.Reasons, "operator_expired_feedback_retry", "manual_feedback_expired_retry_cooldown") + if observation.LastError != "" { + item.LastError = observation.LastError + } + if observation.ObservedAt.After(item.ObservedAt) { + item.ObservedAt = observation.ObservedAt + } + cooldown := observation.RetryCooldownUntil.UTC() + if item.RetryCooldownUntil == nil || cooldown.After(*item.RetryCooldownUntil) { + item.RetryCooldownUntil = &cooldown + } + out[observation.RouteID] = item + } + for routeID, item := range out { + item.Reasons = dedupeStrings(item.Reasons) + out[routeID] = item + } + return out +} + +func fabricServiceChannelFeedbackScoreWithAgeDecay(observation FabricServiceChannelRouteFeedbackObservation, now time.Time) (int, []string) { + score := observation.ScoreAdjustment + if score <= 0 || observation.FeedbackStatus != "healthy" || observation.ObservedAt.IsZero() { + return score, nil + } + observedAt := observation.ObservedAt.UTC() + now = now.UTC() + if !now.After(observedAt) { + return score, nil + } + maxAge := fabricServiceChannelFeedbackMaxAge + if !observation.ExpiresAt.IsZero() && observation.ExpiresAt.After(observedAt) { + maxAge = observation.ExpiresAt.Sub(observedAt) + } + if maxAge <= 0 { + return 0, []string{"service_channel_feedback_age_decay_expired"} + } + age := now.Sub(observedAt) + if age <= 0 { + return score, nil + } + if age >= maxAge { + return 0, []string{"service_channel_feedback_age_decay_expired"} + } + remaining := maxAge - age + decayed := int((int64(score)*int64(remaining) + int64(maxAge) - 1) / int64(maxAge)) + if decayed < 1 { + decayed = 1 + } + if decayed == score { + return score, nil + } + return decayed, []string{"service_channel_feedback_age_decay"} +} + +func mergeFabricServiceChannelRouteFeedback(dst map[string]fabricServiceChannelRouteFeedback, src map[string]fabricServiceChannelRouteFeedback) { + for routeID, incoming := range src { + existing := dst[routeID] + existing.RouteID = routeID + existing.Fenced = existing.Fenced || incoming.Fenced + existing.ManualRetry = existing.ManualRetry || incoming.ManualRetry + existing.StalePolicy = existing.StalePolicy || incoming.StalePolicy + existing.StaleGeneration = existing.StaleGeneration || incoming.StaleGeneration + existing.ProvenanceMissing = existing.ProvenanceMissing || incoming.ProvenanceMissing + if incoming.StaleReason != "" { + existing.StaleReason = incoming.StaleReason + } + existing.ScoreAdjustment += incoming.ScoreAdjustment + existing.Reasons = dedupeStrings(append(existing.Reasons, incoming.Reasons...)) + if incoming.ConsecutiveFailures > existing.ConsecutiveFailures { + existing.ConsecutiveFailures = incoming.ConsecutiveFailures + } + if incoming.StallCount > existing.StallCount { + existing.StallCount = incoming.StallCount + } + if incoming.LastSendDurationMs > 0 && (existing.LastSendDurationMs == 0 || incoming.LastSendDurationMs < existing.LastSendDurationMs) { + existing.LastSendDurationMs = incoming.LastSendDurationMs + } + existing.DegradedFallbackRecommended = existing.DegradedFallbackRecommended || incoming.DegradedFallbackRecommended + existing.RouteRebuildRecommended = existing.RouteRebuildRecommended || incoming.RouteRebuildRecommended + if incoming.QualityWindowSampleCount > existing.QualityWindowSampleCount { + existing.QualityWindowSampleCount = incoming.QualityWindowSampleCount + } + if incoming.QualityWindowSuccessCount > existing.QualityWindowSuccessCount { + existing.QualityWindowSuccessCount = incoming.QualityWindowSuccessCount + } + if incoming.QualityWindowFailureCount > existing.QualityWindowFailureCount { + existing.QualityWindowFailureCount = incoming.QualityWindowFailureCount + } + if incoming.QualityWindowSlowCount > existing.QualityWindowSlowCount { + existing.QualityWindowSlowCount = incoming.QualityWindowSlowCount + } + if incoming.QualityWindowDropCount > existing.QualityWindowDropCount { + existing.QualityWindowDropCount = incoming.QualityWindowDropCount + } + if incoming.LastError != "" { + existing.LastError = incoming.LastError + } + if incoming.ObservedAt.After(existing.ObservedAt) { + existing.ObservedAt = incoming.ObservedAt + } + if incoming.RetryCooldownUntil != nil && (existing.RetryCooldownUntil == nil || incoming.RetryCooldownUntil.After(*existing.RetryCooldownUntil)) { + cooldown := incoming.RetryCooldownUntil.UTC() + existing.RetryCooldownUntil = &cooldown + } + dst[routeID] = existing + } +} + +func serviceChannelRouteFeedbackReport(observations []FabricServiceChannelRouteFeedbackObservation, now time.Time) *FabricServiceChannelRouteFeedbackReport { + return serviceChannelRouteFeedbackReportWithPolicy(observations, now, defaultFabricServiceChannelRecoveryPolicy()) +} + +func serviceChannelRouteFeedbackReportWithPolicy(observations []FabricServiceChannelRouteFeedbackObservation, now time.Time, policy FabricServiceChannelRecoveryPolicy) *FabricServiceChannelRouteFeedbackReport { + return serviceChannelRouteFeedbackReportWithPolicyAndProvenance(observations, now, policy, nil) +} + +func serviceChannelRouteFeedbackReportWithPolicyAndProvenance(observations []FabricServiceChannelRouteFeedbackObservation, now time.Time, policy FabricServiceChannelRecoveryPolicy, routeProvenance map[string]fabricServiceChannelRouteProvenance) *FabricServiceChannelRouteFeedbackReport { + policy = normalizeFabricServiceChannelRecoveryPolicy(policy, defaultFabricServiceChannelRecoveryPolicy()) + reportObservations := make([]FabricServiceChannelRouteFeedbackObservation, 0, len(observations)) + for _, observation := range observations { + observation = fabricServiceChannelAnnotateFeedbackProvenance(observation, policy, routeProvenance) + effectiveScore, ageDecayReasons := fabricServiceChannelFeedbackScoreWithAgeDecay(observation, now) + if observation.StalePolicy || observation.StaleGeneration { + effectiveScore = fabricServiceChannelConservativeStaleScore(effectiveScore) + } + observation.EffectiveScoreAdjustment = effectiveScore + observation.Reasons = dedupeStrings(append(observation.Reasons, ageDecayReasons...)) + observation.RecoveryState = fabricServiceChannelFeedbackObservationRecoveryState(observation, now) + observation.RecoveryPromoted = fabricServiceChannelFeedbackObservationRecoveryPromoted(observation, now, policy) + if observation.RecoveryPromoted { + observation.RecoveryState = "healthy" + } + observation.RecoveryDemoted, observation.RecoveryReason = fabricServiceChannelFeedbackObservationRecoveryDemotion(observation, now, policy) + observation.RecoveryHysteresisActive = observation.RecoveryState == "recovered" + if observation.RecoveryHysteresisActive { + observation.RecoveryHysteresisPenalty = policy.HysteresisPenalty + } + reportObservations = append(reportObservations, observation) + } + report := &FabricServiceChannelRouteFeedbackReport{ + SchemaVersion: "rap.fabric_service_channel_route_feedback_report.v1", + GeneratedAt: now.UTC(), + FeedbackMaxAgeSeconds: int(fabricServiceChannelFeedbackMaxAge.Seconds()), + RecoveryPolicy: fabricServiceChannelRecoveryPolicyRef(policy), + ObservationCount: len(observations), + Observations: reportObservations, + } + for _, observation := range reportObservations { + switch strings.ToLower(strings.TrimSpace(observation.FeedbackStatus)) { + case "fenced": + report.FencedRouteCount++ + case "degraded": + report.DegradedRouteCount++ + case "healthy": + report.HealthyRouteCount++ + } + if observation.RecoveryState == "recovered" { + report.RecoveredRouteCount++ + } + if observation.RecoveryHysteresisActive { + report.RecoveryHysteresisCount++ + } + if observation.RecoveryPromoted { + report.RecoveryPromotedCount++ + } + if observation.RecoveryDemoted { + report.RecoveryDemotedCount++ + } + if observation.ProvenanceMissing { + report.MissingProvenanceCount++ + } + if observation.StalePolicy { + report.StalePolicyCount++ + } + if observation.StaleGeneration { + report.StaleGenerationCount++ + } + } + return report +} + +func fabricServiceChannelFeedbackObservationRecoveryState(observation FabricServiceChannelRouteFeedbackObservation, now time.Time) string { + switch strings.ToLower(strings.TrimSpace(observation.FeedbackStatus)) { + case "fenced": + return "fenced" + case "degraded": + return "degraded" + case "healthy": + if observation.RetryCooldownUntil != nil && + observation.RetryCooldownUntil.After(now.UTC()) && + containsString(observation.Reasons, "service_channel_rolling_quality_window") { + return "recovered" + } + return "healthy" + default: + if observation.RetryCooldownUntil != nil && observation.RetryCooldownUntil.After(now.UTC()) { + return "cooldown" + } + return "" + } +} + +func fabricServiceChannelFeedbackObservationRecoveryPromoted(observation FabricServiceChannelRouteFeedbackObservation, now time.Time, policy FabricServiceChannelRecoveryPolicy) bool { + if observation.RetryCooldownUntil == nil || !observation.RetryCooldownUntil.After(now.UTC()) { + return false + } + if strings.ToLower(strings.TrimSpace(observation.FeedbackStatus)) != "healthy" || + !containsString(observation.Reasons, "service_channel_rolling_quality_window") { + return false + } + return fabricServiceChannelFeedbackCleanRollingSamples( + fabricServiceChannelFeedbackPayloadInt(observation.Payload, "quality_window_sample_count"), + fabricServiceChannelFeedbackPayloadInt(observation.Payload, "quality_window_success_count"), + fabricServiceChannelFeedbackPayloadInt(observation.Payload, "quality_window_failure_count"), + fabricServiceChannelFeedbackPayloadInt(observation.Payload, "quality_window_slow_count"), + fabricServiceChannelFeedbackPayloadInt(observation.Payload, "quality_window_drop_count"), + policy, + ) +} + +func fabricServiceChannelFeedbackObservationRecoveryDemotion(observation FabricServiceChannelRouteFeedbackObservation, now time.Time, policy FabricServiceChannelRecoveryPolicy) (bool, string) { + if observation.RetryCooldownUntil == nil || !observation.RetryCooldownUntil.After(now.UTC()) { + return false, "" + } + if observation.RecoveryPromoted { + return false, "" + } + if policy.DemotionFencedEnabled && strings.ToLower(strings.TrimSpace(observation.FeedbackStatus)) == "fenced" { + return true, "service_channel_recovery_demoted_fenced" + } + if policy.DemotionRebuildEnabled && (containsString(observation.Reasons, "service_channel_route_rebuild_recommended") || + fabricServiceChannelFeedbackPayloadBool(observation.Payload, "route_rebuild_recommended")) { + return true, "service_channel_recovery_demoted_rebuild" + } + if fabricServiceChannelFeedbackPayloadInt(observation.Payload, "quality_window_failure_count") >= policy.DemotionFailureThreshold || + fabricServiceChannelFeedbackPayloadInt(observation.Payload, "quality_window_drop_count") >= policy.DemotionDropThreshold { + return true, "service_channel_recovery_demoted_failure" + } + if fabricServiceChannelFeedbackPayloadInt(observation.Payload, "quality_window_slow_count") >= policy.DemotionSlowThreshold { + return true, "service_channel_recovery_demoted_slow" + } + if strings.ToLower(strings.TrimSpace(observation.FeedbackStatus)) == "degraded" { + return true, "service_channel_recovery_demoted_degraded" + } + return false, "" +} + +func fabricServiceChannelRoutesFromIntents(intents []MeshRouteIntent, serviceClass string, entryPool, exitPool, allowedChannels []string, generation string, now, defaultExpiresAt time.Time, feedback map[string]fabricServiceChannelRouteFeedback, policy FabricServiceChannelRecoveryPolicy) []FabricServiceChannelRoute { + policy = normalizeFabricServiceChannelRecoveryPolicy(policy, defaultFabricServiceChannelRecoveryPolicy()) + routes := []FabricServiceChannelRoute{} + for _, intent := range intents { + route, ok := fabricServiceChannelRouteFromIntent(intent, serviceClass, entryPool, exitPool, allowedChannels, generation, now, defaultExpiresAt, feedback, policy) + if ok { + routes = append(routes, route) + } + } + sort.SliceStable(routes, func(i, j int) bool { + if routes[i].Status != routes[j].Status { + return routes[i].Status == "authorized" + } + if routes[i].PathScore != routes[j].PathScore { + return routes[i].PathScore > routes[j].PathScore + } + if len(routes[i].Hops) != len(routes[j].Hops) { + return len(routes[i].Hops) < len(routes[j].Hops) + } + return routes[i].RouteID < routes[j].RouteID + }) + return routes +} + +func fabricServiceChannelRouteFromIntent(intent MeshRouteIntent, serviceClass string, entryPool, exitPool, requestedChannels []string, generation string, now, defaultExpiresAt time.Time, feedback map[string]fabricServiceChannelRouteFeedback, recoveryPolicy FabricServiceChannelRecoveryPolicy) (FabricServiceChannelRoute, bool) { + recoveryPolicy = normalizeFabricServiceChannelRecoveryPolicy(recoveryPolicy, defaultFabricServiceChannelRecoveryPolicy()) + if intent.Status != "active" || strings.TrimSpace(intent.ServiceClass) != serviceClass { + return FabricServiceChannelRoute{}, false + } + var policy syntheticRoutePolicy + if err := json.Unmarshal(intent.Policy, &policy); err != nil { + return FabricServiceChannelRoute{}, false + } + if policy.ExpiresAt != nil && !policy.ExpiresAt.After(now.UTC()) { + return FabricServiceChannelRoute{}, false + } + var source nodeSelector + var destination nodeSelector + _ = json.Unmarshal(intent.SourceSelector, &source) + _ = json.Unmarshal(intent.DestinationSelector, &destination) + sourceNodeID := firstNodeID(source) + destinationNodeID := firstNodeID(destination) + hops := append([]string{}, policy.Hops...) + if len(hops) == 0 && sourceNodeID != "" && destinationNodeID != "" { + hops = []string{sourceNodeID, destinationNodeID} + } + if len(hops) < 2 { + return FabricServiceChannelRoute{}, false + } + if sourceNodeID == "" { + sourceNodeID = hops[0] + } + if destinationNodeID == "" { + destinationNodeID = hops[len(hops)-1] + } + if !containsString(entryPool, sourceNodeID) || !containsString(exitPool, destinationNodeID) { + return FabricServiceChannelRoute{}, false + } + allowedChannels := policy.AllowedChannels + if len(allowedChannels) == 0 { + allowedChannels = requestedChannels + } + if !fabricChannelsIntersect(allowedChannels, requestedChannels) { + return FabricServiceChannelRoute{}, false + } + expiresAt := defaultExpiresAt + if policy.ExpiresAt != nil { + expiresAt = policy.ExpiresAt.UTC() + } + routeVersion := policy.RouteVersion + if routeVersion == "" { + routeVersion = intent.UpdatedAt.UTC().Format(time.RFC3339) + } + policyVersion := policy.PolicyVersion + if policyVersion == "" { + policyVersion = routeVersion + } + score := 100 - len(hops)*5 + intent.Priority + if score < 1 { + score = 1 + } + status := "authorized" + recoveryState := "" + recoveryPenalty := 0 + recoveryPromoted := false + recoveryDemoted := false + recoveryReason := "" + scoreReasons := []string{"active_route_intent", "entry_exit_pool_match"} + if item, ok := feedback[intent.ID]; ok { + score += item.ScoreAdjustment + scoreReasons = append(scoreReasons, item.Reasons...) + if item.StalePolicy || item.StaleGeneration { + recoveryReason = item.StaleReason + if recoveryReason == "" { + recoveryReason = "service_channel_feedback_stale" + } + scoreReasons = append(scoreReasons, "service_channel_feedback_stale", recoveryReason) + } + if fabricServiceChannelFeedbackRecoveryDemoted(item, recoveryPolicy) { + recoveryDemoted = true + recoveryReason = fabricServiceChannelFeedbackRecoveryDemotionReason(item, recoveryPolicy) + scoreReasons = append(scoreReasons, "service_channel_recovery_demoted", recoveryReason) + } + if item.Fenced { + status = "fenced_by_service_channel_feedback" + recoveryState = "fenced" + score = 0 + } else if score < 1 { + score = 1 + } + if status == "authorized" && fabricServiceChannelFeedbackRecoveryPromoted(item, recoveryPolicy) { + recoveryState = "healthy" + recoveryPromoted = true + scoreReasons = append(scoreReasons, "service_channel_recovery_promoted") + } else if status == "authorized" && fabricServiceChannelFeedbackRecoveryHysteresisActive(item, recoveryPolicy) { + recoveryState = "recovered" + recoveryPenalty = recoveryPolicy.HysteresisPenalty + score -= recoveryPenalty + if score < 1 { + score = 1 + } + scoreReasons = append(scoreReasons, "service_channel_recovery_hysteresis") + } else if status == "authorized" && item.ScoreAdjustment > 0 { + recoveryState = "healthy" + } + } + return FabricServiceChannelRoute{ + RouteID: intent.ID, + ClusterID: intent.ClusterID, + ServiceClass: serviceClass, + SourceNodeID: sourceNodeID, + DestinationNodeID: destinationNodeID, + Hops: hops, + AllowedChannels: allowedChannels, + RouteVersion: routeVersion, + PolicyVersion: policyVersion, + Generation: generation, + Status: status, + RecoveryState: recoveryState, + RecoveryPenalty: recoveryPenalty, + RecoveryPromoted: recoveryPromoted, + RecoveryDemoted: recoveryDemoted, + RecoveryReason: recoveryReason, + RecoveryPolicy: fabricServiceChannelRecoveryPolicyRef(recoveryPolicy), + PathScore: score, + ScoreReasons: dedupeStrings(scoreReasons), + ExpiresAt: expiresAt, + }, true +} + +const fabricServiceChannelRecoveryHysteresisPenalty = 150 +const fabricServiceChannelRecoveryPromotionMinSamples = 64 + +func fabricServiceChannelFeedbackRecoveryHysteresisActive(item fabricServiceChannelRouteFeedback, policy FabricServiceChannelRecoveryPolicy) bool { + if item.StalePolicy || item.StaleGeneration { + return false + } + return item.ManualRetry && !item.Fenced && item.ScoreAdjustment > 0 && + containsString(item.Reasons, "service_channel_rolling_quality_window") && + !fabricServiceChannelFeedbackRecoveryPromoted(item, policy) +} + +func fabricServiceChannelFeedbackRecoveryPromoted(item fabricServiceChannelRouteFeedback, policy FabricServiceChannelRecoveryPolicy) bool { + if item.StalePolicy || item.StaleGeneration { + return false + } + return item.ManualRetry && !item.Fenced && item.ScoreAdjustment > 0 && + containsString(item.Reasons, "service_channel_rolling_quality_window") && + fabricServiceChannelFeedbackCleanRollingSamples( + item.QualityWindowSampleCount, + item.QualityWindowSuccessCount, + item.QualityWindowFailureCount, + item.QualityWindowSlowCount, + item.QualityWindowDropCount, + policy, + ) +} + +func fabricServiceChannelFeedbackRecoveryDemoted(item fabricServiceChannelRouteFeedback, policy FabricServiceChannelRecoveryPolicy) bool { + if item.StalePolicy || item.StaleGeneration { + return false + } + return item.ManualRetry && !fabricServiceChannelFeedbackRecoveryPromoted(item, policy) && + ((policy.DemotionFencedEnabled && item.Fenced) || + (policy.DemotionRebuildEnabled && item.RouteRebuildRecommended) || + item.DegradedFallbackRecommended || + item.QualityWindowFailureCount >= policy.DemotionFailureThreshold || + item.QualityWindowDropCount >= policy.DemotionDropThreshold || + item.QualityWindowSlowCount >= policy.DemotionSlowThreshold || + item.ScoreAdjustment < 0) +} + +func fabricServiceChannelFeedbackRecoveryDemotionReason(item fabricServiceChannelRouteFeedback, policy FabricServiceChannelRecoveryPolicy) string { + if policy.DemotionFencedEnabled && item.Fenced { + return "service_channel_recovery_demoted_fenced" + } + if policy.DemotionRebuildEnabled && item.RouteRebuildRecommended { + return "service_channel_recovery_demoted_rebuild" + } + if item.QualityWindowFailureCount >= policy.DemotionFailureThreshold || item.QualityWindowDropCount >= policy.DemotionDropThreshold { + return "service_channel_recovery_demoted_failure" + } + if item.QualityWindowSlowCount >= policy.DemotionSlowThreshold { + return "service_channel_recovery_demoted_slow" + } + if item.DegradedFallbackRecommended { + return "service_channel_recovery_demoted_degraded_fallback" + } + if item.ScoreAdjustment < 0 { + return "service_channel_recovery_demoted_degraded" + } + return "service_channel_recovery_demoted" +} + +func fabricServiceChannelFeedbackCleanRollingSamples(sampleCount, successCount, failureCount, slowCount, dropCount int, policy FabricServiceChannelRecoveryPolicy) bool { + return sampleCount >= policy.PromotionMinSamples && + successCount >= policy.PromotionMinSamples && + failureCount == 0 && + slowCount == 0 && + dropCount == 0 +} + +func fabricChannelsIntersect(a, b []string) bool { + for _, left := range a { + if containsString(b, left) { + return true + } + } + return false +} + +func selectFabricServicePrimaryRoute(routes []FabricServiceChannelRoute, selectedEntry, selectedExit string) (FabricServiceChannelRoute, []FabricServiceChannelRoute) { + if len(routes) == 0 { + return FabricServiceChannelRoute{}, nil + } + alternates := make([]FabricServiceChannelRoute, 0, len(routes)-1) + for _, route := range routes { + if route.Status != "authorized" { + continue + } + if route.SourceNodeID == selectedEntry && route.DestinationNodeID == selectedExit { + for _, alternate := range routes { + if alternate.RouteID != route.RouteID && alternate.Status == "authorized" { + alternates = append(alternates, alternate) + } + } + return route, alternates + } + } + primary := FabricServiceChannelRoute{} + for _, route := range routes { + if route.Status != "authorized" { + continue + } + if primary.RouteID == "" { + primary = route + continue + } + alternates = append(alternates, route) + } + return primary, alternates +} + +type fabricServiceChannelRouteIntentReplacementScope struct { + EntryPoolKey string + ExitPoolKey string + ResourceKey string +} + +func fabricServiceChannelRouteIntentMetadataKey(intent MeshRouteIntent, keys []string) string { + if len(intent.Policy) == 0 || !json.Valid(intent.Policy) { + return "" + } + var policy syntheticRoutePolicy + if err := json.Unmarshal(intent.Policy, &policy); err != nil { + return "" + } + for _, key := range keys { + value, ok := policy.Metadata[key] + if !ok { + continue + } + switch typed := value.(type) { + case string: + if trimmed := strings.TrimSpace(typed); trimmed != "" { + return key + ":" + trimmed + } + case fmt.Stringer: + if trimmed := strings.TrimSpace(typed.String()); trimmed != "" { + return key + ":" + trimmed + } + } + } + return "" +} + +func fabricServiceChannelRouteIntentReplacementScopes(intents []MeshRouteIntent) map[string]fabricServiceChannelRouteIntentReplacementScope { + out := map[string]fabricServiceChannelRouteIntentReplacementScope{} + for _, intent := range intents { + if routeID := strings.TrimSpace(intent.ID); routeID != "" { + out[routeID] = fabricServiceChannelRouteIntentReplacementScope{ + EntryPoolKey: fabricServiceChannelRouteIntentMetadataKey(intent, []string{"entry_pool_id", "service_entry_pool_id", "fabric_entry_pool_id"}), + ExitPoolKey: fabricServiceChannelRouteIntentMetadataKey(intent, []string{"exit_pool_id", "service_exit_pool_id", "fabric_exit_pool_id"}), + ResourceKey: fabricServiceChannelRouteIntentMetadataKey(intent, []string{"service_resource_id", "resource_id", "fabric_service_resource_id"}), + } + } + } + return out +} + +func fabricServiceChannelRoutesShareReplacementScope(fencedRoute, candidateRoute SyntheticMeshRouteConfig, scopes map[string]fabricServiceChannelRouteIntentReplacementScope) bool { + if fencedRoute.SourceNodeID == candidateRoute.SourceNodeID && fencedRoute.DestinationNodeID == candidateRoute.DestinationNodeID { + return true + } + fencedScope := scopes[fencedRoute.RouteID] + candidateScope := scopes[candidateRoute.RouteID] + sameResource := strings.TrimSpace(fencedScope.ResourceKey) != "" && fencedScope.ResourceKey == strings.TrimSpace(candidateScope.ResourceKey) + if fencedRoute.SourceNodeID == candidateRoute.SourceNodeID { + return sameResource || (strings.TrimSpace(fencedScope.ExitPoolKey) != "" && fencedScope.ExitPoolKey == strings.TrimSpace(candidateScope.ExitPoolKey)) + } + if fencedRoute.DestinationNodeID == candidateRoute.DestinationNodeID { + return sameResource || (strings.TrimSpace(fencedScope.EntryPoolKey) != "" && fencedScope.EntryPoolKey == strings.TrimSpace(candidateScope.EntryPoolKey)) + } + if sameResource && + strings.TrimSpace(fencedScope.EntryPoolKey) != "" && + fencedScope.EntryPoolKey == strings.TrimSpace(candidateScope.EntryPoolKey) && + strings.TrimSpace(fencedScope.ExitPoolKey) != "" && + fencedScope.ExitPoolKey == strings.TrimSpace(candidateScope.ExitPoolKey) { + return true + } + return false +} + +func fabricServiceRoutesFencedForSelectedPair(routes []FabricServiceChannelRoute, selectedEntry, selectedExit string) bool { + for _, route := range routes { + if route.SourceNodeID == selectedEntry && + route.DestinationNodeID == selectedExit && + route.Status == "fenced_by_service_channel_feedback" { + return true + } + } + return false +} + +func fabricServiceRoutesFencedForPool(routes []FabricServiceChannelRoute) bool { + for _, route := range routes { + if route.Status == "fenced_by_service_channel_feedback" { + return true + } + } + return false +} + +func defaultFabricServiceQoS(serviceClass string) string { + switch serviceClass { + case FabricServiceClassVPNPackets: + return `{"priority":"bulk","interactive":false,"bulk_limit_mbps":0}` + case FabricServiceClassRemoteWorkspace: + return `{"priority":"interactive","interactive":true,"bulk_limit_mbps":0}` + case FabricServiceClassVideo: + return `{"priority":"interactive","interactive":true,"adaptive":true}` + default: + return `{"priority":"normal","interactive":false,"bulk_limit_mbps":0}` + } +} + +func fabricServiceChannelHTTPIngress(serviceClass string) FabricServiceChannelHTTPIngress { + ingress := FabricServiceChannelHTTPIngress{ + Type: "entry_direct_http_v1", + TokenHeader: "X-RAP-Service-Channel-Token", + ServiceClassHeader: "X-RAP-Service-Class", + ChannelClassHeader: "X-RAP-Channel-Class", + SupportedMethods: []string{"POST", "GET", "WEBSOCKET"}, + } + switch serviceClass { + case FabricServiceClassRemoteWorkspace: + ingress.PathTemplate = "/api/v1/clusters/{cluster_id}/fabric/service-channels/{channel_id}/remote-workspaces/{resource_id}/streams/{channel_class}" + ingress.WebSocketPathTemplate = "/api/v1/clusters/{cluster_id}/fabric/service-channels/{channel_id}/remote-workspaces/{resource_id}/streams/ws" + ingress.PacketBatchFormat = "application/vnd.rap.remote-workspace-frame-batch.v1" + case FabricServiceClassVideo: + ingress.PathTemplate = "/api/v1/clusters/{cluster_id}/fabric/service-channels/{channel_id}/video-sessions/{resource_id}/streams/{channel_class}" + ingress.WebSocketPathTemplate = "/api/v1/clusters/{cluster_id}/fabric/service-channels/{channel_id}/video-sessions/{resource_id}/streams/ws" + ingress.PacketBatchFormat = "application/vnd.rap.video-frame-batch.v1" + case FabricServiceClassFileTransfer: + ingress.PathTemplate = "/api/v1/clusters/{cluster_id}/fabric/service-channels/{channel_id}/file-transfers/{resource_id}/chunks" + ingress.WebSocketPathTemplate = "/api/v1/clusters/{cluster_id}/fabric/service-channels/{channel_id}/file-transfers/{resource_id}/chunks/ws" + ingress.PacketBatchFormat = "application/vnd.rap.file-transfer-chunk-batch.v1" + default: + ingress.PathTemplate = "/api/v1/clusters/{cluster_id}/fabric/service-channels/{channel_id}/vpn-connections/{resource_id}/packets" + ingress.WebSocketPathTemplate = "/api/v1/clusters/{cluster_id}/fabric/service-channels/{channel_id}/vpn-connections/{resource_id}/packets/ws" + ingress.PacketBatchFormat = "application/vnd.rap.vpn-packet-batch.v1" + } + return ingress +} + +func fabricServiceChannelDataPlaneContract(serviceClass string, poolPolicy FabricServiceChannelPoolPolicy, fallback FabricServiceChannelFallback) FabricServiceChannelDataPlaneContract { + backendRelayPolicy := "disabled" + if poolPolicy.BackendFallbackAllowed || fallback.Allowed || fallback.BackendRelay { + backendRelayPolicy = "degraded_fallback_only" + } + entryFailover := firstNonEmptyString(poolPolicy.EntryFailover, "automatic") + exitFailover := firstNonEmptyString(poolPolicy.ExitFailover, "automatic") + routeRebuild := firstNonEmptyString(poolPolicy.RouteRebuild, "automatic") + mode := "fabric_primary" + if fallback.Active { + mode = "degraded_backend_fallback" + } + return FabricServiceChannelDataPlaneContract{ + SchemaVersion: "rap.fabric_service_channel_data_plane.v1", + Mode: mode, + ControlPlaneTransport: "backend_api", + WorkingDataTransport: "fabric_service_channel", + SteadyStateTransport: "fabric_route", + BackendRelayPolicy: backendRelayPolicy, + ProductionForwardingRequired: true, + ServiceNeutral: true, + ProtocolAgnostic: true, + LogicalFlowMode: "multi_flow_isolated", + RequiredFlowIsolationClasses: fabricServiceChannelFlowIsolationClasses(serviceClass), + RouteSelectionStrategy: firstNonEmptyString(poolPolicy.SelectionStrategy, "fastest_healthy"), + EntryFailoverMode: entryFailover, + ExitFailoverMode: exitFailover, + RouteRebuildMode: routeRebuild, + FailureDetectionSource: "route_quality_feedback_and_runtime_heartbeats", + DegradedFallbackVisibility: "explicit_access_telemetry_and_rebuild_health", + StableContractForServiceClass: serviceClass, + } +} + +func fabricServiceChannelFlowIsolationClasses(serviceClass string) []string { + switch serviceClass { + case FabricServiceClassVPNPackets: + return []string{FabricChannelControl, FabricChannelInteractive, FabricChannelReliable, FabricChannelBulk, FabricChannelDroppable, "vpn_packet"} + case FabricServiceClassRemoteWorkspace: + return []string{FabricChannelControl, FabricChannelInteractive, FabricChannelReliable, FabricChannelBulk, FabricChannelDroppable} + case FabricServiceClassVideo: + return []string{FabricChannelControl, FabricChannelInteractive, FabricChannelDroppable} + case FabricServiceClassFileTransfer: + return []string{FabricChannelControl, FabricChannelReliable, FabricChannelBulk} + default: + return []string{FabricChannelControl, FabricChannelReliable} + } +} + +func defaultFabricServiceFailover() string { + return `{"route_rebuild":"automatic","exit_failover":"automatic","sticky_session":true}` +} + func (s *Service) GetNodeSyntheticMeshConfig(ctx context.Context, input GetNodeSyntheticMeshConfigInput) (NodeSyntheticMeshConfig, error) { input.ClusterID = strings.TrimSpace(input.ClusterID) input.NodeID = strings.TrimSpace(input.NodeID) @@ -793,6 +7217,14 @@ func (s *Service) GetNodeSyntheticMeshConfig(ctx context.Context, input GetNodeS Routes: []SyntheticMeshRouteConfig{}, ProductionForwarding: false, } + listenerConfig, err := s.nodeMeshListenerConfig(ctx, input) + if err != nil { + return NodeSyntheticMeshConfig{}, err + } + cfg.MeshListener = listenerConfig + if listenerConfig != nil && listenerConfig.ProductionForwarding { + cfg.ProductionForwarding = true + } flags, err := s.store.GetEffectiveNodeTestingFlags(ctx, input.ClusterID, input.NodeID) if err != nil { return NodeSyntheticMeshConfig{}, err @@ -808,21 +7240,72 @@ func (s *Service) GetNodeSyntheticMeshConfig(ctx context.Context, input GetNodeS cfg.ConfigVersion = "c17z18-" + s.now().UTC().Format("20060102T150405Z") cfg.PeerDirectoryVersion = cfg.ConfigVersion cfg.PolicyVersion = cfg.ConfigVersion + if cfg.MeshListener != nil && cfg.MeshListener.ConfigVersion == "" { + cfg.MeshListener.ConfigVersion = cfg.ConfigVersion + } meshLinks, err := s.store.ListMeshLinks(ctx, input.ClusterID) if err != nil { return NodeSyntheticMeshConfig{}, err } relayPolicy := newRendezvousRelayPolicy(input.NodeID, meshLinks, s.now()) + recoveryPolicy := s.fabricServiceChannelRecoveryPolicy(ctx, input.ClusterID) + cluster, err := s.store.GetCluster(ctx, input.ClusterID) + if err != nil { + return NodeSyntheticMeshConfig{}, err + } + adaptivePolicy := fabricServiceChannelAdaptivePolicyFromCluster(cluster) + cfg.ServiceChannelAdaptivePolicy = &adaptivePolicy + routeProvenance := fabricServiceChannelRouteProvenanceFromIntents(intents) + serviceChannelFeedbackItems, err := s.store.ListFabricServiceChannelRouteFeedback(ctx, ListFabricServiceChannelRouteFeedbackInput{ + ClusterID: input.ClusterID, + ReporterNodeID: input.NodeID, + Now: s.now(), + }) + if err != nil { + return NodeSyntheticMeshConfig{}, err + } + cfg.ServiceChannelFeedback = serviceChannelRouteFeedbackReportWithPolicyAndProvenance(serviceChannelFeedbackItems, s.now(), recoveryPolicy, routeProvenance) + serviceChannelFeedback := fabricServiceChannelRouteFeedbackFromObservationsWithProvenance(serviceChannelFeedbackItems, s.now(), recoveryPolicy, routeProvenance) + cfg.ServiceChannelRemediationCommands, err = s.fabricServiceChannelRemediationCommandsForNode(ctx, input.ClusterID, input.NodeID, serviceChannelFeedback, s.now()) + if err != nil { + return NodeSyntheticMeshConfig{}, err + } + if err := s.recordFabricServiceChannelRemediationRebuildIntents(ctx, input.ClusterID, input.NodeID, cfg.ServiceChannelRemediationCommands, s.now()); err != nil { + return NodeSyntheticMeshConfig{}, err + } + remediationRoutePathDecisions, err := s.resolveFabricServiceChannelRemediationRebuildIntents(ctx, input, cfg.ServiceChannelRemediationCommands, intents, serviceChannelFeedback, cfg.ConfigVersion, s.now()) + if err != nil { + return NodeSyntheticMeshConfig{}, err + } + serviceChannelExpiredFeedbackItems, err := s.store.ListFabricServiceChannelRouteFeedback(ctx, ListFabricServiceChannelRouteFeedbackInput{ + ClusterID: input.ClusterID, + ReporterNodeID: input.NodeID, + IncludeExpired: true, + Now: s.now(), + }) + if err != nil { + return NodeSyntheticMeshConfig{}, err + } + mergeFabricServiceChannelRouteFeedback(serviceChannelFeedback, fabricServiceChannelManualRetryFeedbackFromObservationsWithProvenance(serviceChannelExpiredFeedbackItems, s.now(), recoveryPolicy, routeProvenance)) + localPerspective, err := s.localEndpointPerspective(ctx, input.ClusterID, input.NodeID) + if err != nil { + return NodeSyntheticMeshConfig{}, err + } peerDirectory := map[string]*PeerDirectoryEntry{} recoverySeeds := map[string]PeerRecoverySeed{} rendezvousLeases := map[string]PeerRendezvousLease{} - routePathDecisions := []RoutePathDecision{} + routePathDecisions := append([]RoutePathDecision{}, remediationRoutePathDecisions...) for _, intent := range intents { route, peers, candidates, seeds, policyLeases, ok := s.syntheticRouteFromIntent(input, intent) if !ok { continue } - reportedPeers, reportedCandidates, err := s.reportedEndpointConfig(ctx, input.ClusterID, input.NodeID, route.Hops) + if feedback, ok := serviceChannelFeedback[route.RouteID]; ok && feedback.Fenced { + replacementDecision := s.serviceChannelRouteReplacementDecision(input, route, intents, serviceChannelFeedback, cfg.ConfigVersion) + routePathDecisions = append(routePathDecisions, replacementDecision) + continue + } + reportedPeers, reportedCandidates, err := s.reportedEndpointConfig(ctx, input.ClusterID, input.NodeID, route.Hops, localPerspective) if err != nil { return NodeSyntheticMeshConfig{}, err } @@ -847,7 +7330,7 @@ func (s *Service) GetNodeSyntheticMeshConfig(ctx context.Context, input GetNodeS routeLeases := scopedRendezvousLeases(policyLeases, route, input.NodeID, relayPolicy, s.now()) routeLeases = append(routeLeases, derivedRendezvousLeases(route, peers, candidates, input.NodeID, relayPolicy, s.now())...) cfg.Routes = append(cfg.Routes, route) - routePathDecisions = append(routePathDecisions, routePathDecisionForRoute(route, input.NodeID, routeLeases, relayPolicy, cfg.ConfigVersion)) + routePathDecisions = append(routePathDecisions, routePathDecisionForRoute(route, input.NodeID, routeLeases, relayPolicy, cfg.ConfigVersion, serviceChannelFeedback[route.RouteID])) mergePeerDirectoryRoute(peerDirectory, route, input.NodeID) for nodeID, endpoint := range peers { if strings.TrimSpace(nodeID) != "" && strings.TrimSpace(endpoint) != "" { @@ -865,16 +7348,466 @@ func (s *Service) GetNodeSyntheticMeshConfig(ctx context.Context, input GetNodeS mergeRecoverySeeds(recoverySeeds, seeds) mergeRendezvousLeases(rendezvousLeases, routeLeases) } + if err := s.addCoreMeshBootstrapPeers(ctx, input, &cfg, peerDirectory, recoverySeeds, rendezvousLeases, localPerspective); err != nil { + return NodeSyntheticMeshConfig{}, err + } cfg.RecoverySeeds = sortedRecoverySeeds(recoverySeeds, maxScopedRecoverySeeds) cfg.RendezvousLeases = sortedRendezvousLeases(rendezvousLeases, maxScopedRendezvousLeases) cfg.RendezvousRelayPolicy = relayPolicy.report() - cfg.RoutePathDecisions = routePathDecisionReport(cfg.ConfigVersion, routePathDecisions) + cfg.RoutePathDecisions = routePathDecisionReportWithRecoveryPolicy(cfg.ConfigVersion, routePathDecisions, recoveryPolicy) + _ = s.recordFabricServiceChannelRouteRebuildAttempts(ctx, input, cfg.RoutePathDecisions, cfg.ServiceChannelFeedback) markPeerDirectoryRecoverySeeds(peerDirectory, cfg.RecoverySeeds) markPeerDirectoryRendezvousLeases(peerDirectory, cfg.RendezvousLeases, input.NodeID) cfg.PeerDirectory = sortedPeerDirectory(peerDirectory) return s.signSyntheticMeshConfig(ctx, cfg) } +func (s *Service) recordFabricServiceChannelRouteRebuildAttempts(ctx context.Context, input GetNodeSyntheticMeshConfigInput, report *RoutePathDecisionReport, feedbackReport *FabricServiceChannelRouteFeedbackReport) error { + if report == nil || len(report.Decisions) == 0 { + return nil + } + feedbackByRoute := map[string]FabricServiceChannelRouteFeedbackObservation{} + if feedbackReport != nil { + for _, item := range feedbackReport.Observations { + if strings.TrimSpace(item.RouteID) != "" { + feedbackByRoute[item.RouteID] = item + } + } + } + for _, decision := range report.Decisions { + if strings.TrimSpace(decision.RebuildRequestID) == "" { + continue + } + feedback := feedbackByRoute[decision.RouteID] + serviceClass := firstNonEmptyString(feedback.ServiceClass, FabricServiceClassVPNPackets) + outcome := "degraded_fallback" + if strings.TrimSpace(decision.ReplacementRouteID) != "" { + outcome = "replacement_selected" + } else if decision.DecisionSource == "service_channel_feedback_no_alternate" { + outcome = "no_alternate" + } + payload := mustJSONRaw(map[string]any{ + "schema_version": "c18z98.route_rebuild_attempt_correlation.v1", + "decision_id": decision.DecisionID, + "score_reasons": decision.ScoreReasons, + "path_score": decision.PathScore, + "local_role": decision.LocalRole, + "previous_hop_id": decision.PreviousHopID, + "next_hop_id": decision.NextHopID, + "control_plane_only": decision.ControlPlaneOnly, + "production_forwarding": decision.ProductionForwarding, + "decision_expires_at": decision.ExpiresAt.UTC().Format(time.RFC3339Nano), + "feedback_observation_id": decision.FeedbackObservationID, + "feedback_source": decision.FeedbackSource, + "feedback_observed_at": formatOptionalTime(decision.FeedbackObservedAt), + "feedback_expires_at": formatOptionalTime(decision.FeedbackExpiresAt), + "feedback_channel_id": decision.FeedbackChannelID, + "feedback_resource_id": decision.FeedbackResourceID, + "feedback_violation_status": decision.FeedbackViolationStatus, + "feedback_violation_reason": decision.FeedbackViolationReason, + }) + _, err := s.store.RecordFabricServiceChannelRouteRebuildAttempt(ctx, RecordFabricServiceChannelRouteRebuildAttemptInput{ + ClusterID: input.ClusterID, + ReporterNodeID: input.NodeID, + ServiceClass: serviceClass, + RouteID: decision.RouteID, + ReplacementRouteID: decision.ReplacementRouteID, + RebuildRequestID: decision.RebuildRequestID, + RebuildStatus: decision.RebuildStatus, + RebuildReason: decision.RebuildReason, + RebuildAttempt: decision.RebuildAttempt, + DecisionSource: decision.DecisionSource, + Outcome: outcome, + Generation: decision.Generation, + PolicyFingerprint: feedback.EffectivePolicyFingerprint, + ObservedPolicyFingerprint: feedback.ObservedPolicyFingerprint, + ObservedRouteGeneration: feedback.ObservedRouteGeneration, + EffectiveRouteGeneration: feedback.EffectiveRouteGeneration, + FeedbackStatus: feedback.FeedbackStatus, + FeedbackObservationID: decision.FeedbackObservationID, + FeedbackSource: decision.FeedbackSource, + FeedbackObservedAt: decision.FeedbackObservedAt, + FeedbackExpiresAt: decision.FeedbackExpiresAt, + FeedbackChannelID: decision.FeedbackChannelID, + FeedbackResourceID: decision.FeedbackResourceID, + FeedbackViolationStatus: decision.FeedbackViolationStatus, + FeedbackViolationReason: decision.FeedbackViolationReason, + FeedbackScoreAdjustment: feedback.ScoreAdjustment, + FeedbackEffectiveScoreAdjustment: feedback.EffectiveScoreAdjustment, + FeedbackReasons: append([]string{}, feedback.Reasons...), + LastError: feedback.LastError, + ConsecutiveFailures: feedback.ConsecutiveFailures, + StallCount: feedback.StallCount, + LastSendDurationMs: feedback.LastSendDurationMs, + OldHops: append([]string{}, decision.OriginalHops...), + ReplacementHops: append([]string{}, decision.EffectiveHops...), + Payload: payload, + }) + if err != nil { + return err + } + } + return nil +} + +func (s *Service) autoWarmFabricServiceChannelRouteRebuildAttemptSnapshot(ctx context.Context, clusterID string, attempt FabricServiceChannelRouteRebuildAttempt, now time.Time) (bool, error) { + if fabricServiceChannelRouteRebuildHasCorrelationSnapshot(attempt) { + return false, nil + } + nodeID := strings.TrimSpace(attempt.ReporterNodeID) + if nodeID == "" { + return false, ErrInvalidPayload + } + if now.IsZero() { + now = time.Now().UTC() + } + heartbeats, err := s.store.ListNodeHeartbeats(ctx, clusterID, nodeID, 120) + if err != nil { + return false, err + } + attempt = enrichFabricServiceChannelRouteRebuildAttempt(attempt, heartbeats, now) + if !attempt.NodeTransitionMatched && !attempt.NodeRouteGenerationMatched && attempt.PostRebuildSelectedRouteID == "" && attempt.PostRebuildSendPackets == 0 && attempt.PostRebuildSendFlowPackets == 0 { + return false, nil + } + attempt.CorrelationSnapshotAt = &now + if err := s.store.UpdateFabricServiceChannelRouteRebuildCorrelationSnapshot(ctx, fabricServiceChannelRouteRebuildCorrelationSnapshotInput(attempt, now)); err != nil { + return false, err + } + return true, nil +} + +func formatOptionalTime(value *time.Time) string { + if value == nil || value.IsZero() { + return "" + } + return value.UTC().Format(time.RFC3339Nano) +} + +func (s *Service) autoWarmFabricServiceChannelRouteRebuildSnapshotsAfterHeartbeat(ctx context.Context, heartbeat NodeHeartbeat) error { + clusterID := strings.TrimSpace(heartbeat.ClusterID) + nodeID := strings.TrimSpace(heartbeat.NodeID) + if clusterID == "" || nodeID == "" { + return nil + } + now := heartbeat.ObservedAt + if now.IsZero() { + now = s.now() + } + if now.IsZero() { + now = time.Now().UTC() + } + attempts, err := s.store.ListFabricServiceChannelRouteRebuildAttempts(ctx, ListFabricServiceChannelRouteRebuildAttemptsInput{ + ClusterID: clusterID, + ReporterNodeID: nodeID, + Limit: 5, + }) + if err != nil { + return err + } + warmedCount := 0 + freshCount := 0 + errorCount := 0 + warmedAttemptIDs := []string{} + warmedRouteIDs := []string{} + warmedRebuildRequestIDs := []string{} + warmedGenerations := []string{} + for _, attempt := range attempts { + if fabricServiceChannelRouteRebuildHasCorrelationSnapshot(attempt) { + freshCount++ + continue + } + warmed, err := s.autoWarmFabricServiceChannelRouteRebuildAttemptSnapshot(ctx, clusterID, attempt, now) + if err != nil { + errorCount++ + continue + } + if warmed { + warmedCount++ + warmedAttemptIDs = append(warmedAttemptIDs, attempt.ID) + warmedRouteIDs = append(warmedRouteIDs, attempt.RouteID) + warmedRebuildRequestIDs = append(warmedRebuildRequestIDs, attempt.RebuildRequestID) + warmedGenerations = append(warmedGenerations, attempt.Generation) + } else { + freshCount++ + } + } + if warmedCount == 0 && errorCount == 0 { + return nil + } + targetID := nodeID + return s.store.RecordAudit(ctx, ClusterAuditEvent{ + ClusterID: &clusterID, + EventType: "fabric.service_channel_rebuild_snapshot.auto_warmup", + TargetType: "fabric_service_channel_route_rebuild_snapshot", + TargetID: &targetID, + Payload: mustJSONRaw(map[string]any{ + "schema_version": "c18z45.rebuild_snapshot_auto_warmup.v1", + "trigger": "node_heartbeat", + "reporter_node_id": nodeID, + "heartbeat_id": heartbeat.ID, + "scanned_count": len(attempts), + "warmed_count": warmedCount, + "already_fresh_count": freshCount, + "error_count": errorCount, + "warmed_attempt_ids": warmedAttemptIDs, + "warmed_route_ids": warmedRouteIDs, + "warmed_rebuild_ids": warmedRebuildRequestIDs, + "warmed_generations": warmedGenerations, + }), + CreatedAt: now.UTC(), + }) +} + +func (s *Service) nodeMeshListenerConfig(ctx context.Context, input GetNodeSyntheticMeshConfigInput) (*NodeMeshListenerConfig, error) { + workloads, err := s.store.ListDesiredWorkloads(ctx, input.ClusterID, input.NodeID) + if err != nil { + return nil, err + } + for _, workload := range workloads { + if strings.TrimSpace(workload.ServiceType) != "mesh-listener" { + continue + } + cfg, err := nodeMeshListenerConfigFromDesired(workload) + if err != nil { + return nil, err + } + return cfg, nil + } + return nil, nil +} + +func (s *Service) desiredMeshListenerEndpointConfig(ctx context.Context, clusterID, nodeID string, priority int) (string, []PeerEndpointCandidate, error) { + listener, err := s.nodeMeshListenerConfig(ctx, GetNodeSyntheticMeshConfigInput{ClusterID: clusterID, NodeID: nodeID}) + if err != nil { + return "", nil, err + } + if listener == nil || + strings.TrimSpace(listener.DesiredState) != "enabled" || + strings.TrimSpace(listener.AdvertiseEndpoint) == "" { + return "", nil, nil + } + endpoint := strings.TrimRight(strings.TrimSpace(listener.AdvertiseEndpoint), "/") + if isUnusableLocalPeerEndpoint(endpoint) { + return "", nil, nil + } + transport := firstNonEmptyString(listener.AdvertiseTransport, "direct_http") + connectivityMode := firstNonEmptyString(listener.ConnectivityMode, "direct") + natType := firstNonEmptyString(listener.NATType, "unknown") + metadata, err := json.Marshal(map[string]any{ + "source": "desired_workload.mesh-listener", + "config_version": listener.ConfigVersion, + "listen_addr": listener.ListenAddr, + }) + if err != nil { + return "", nil, err + } + candidate := PeerEndpointCandidate{ + EndpointID: nodeID + "-desired-mesh-listener", + NodeID: nodeID, + Transport: transport, + Address: endpoint, + Reachability: reachabilityFromConnectivityMode(connectivityMode), + NATType: natType, + ConnectivityMode: connectivityMode, + Region: listener.Region, + Priority: priority, + PolicyTags: []string{"operator-configured", "desired-mesh-listener"}, + Metadata: metadata, + } + if err := validatePeerEndpointCandidates(map[string][]PeerEndpointCandidate{nodeID: []PeerEndpointCandidate{candidate}}, []string{nodeID}); err != nil { + return "", nil, err + } + return endpoint, []PeerEndpointCandidate{candidate}, nil +} + +func nodeMeshListenerConfigFromDesired(workload NodeWorkloadDesiredState) (*NodeMeshListenerConfig, error) { + var raw map[string]any + if len(workload.Config) > 0 { + if err := json.Unmarshal(workload.Config, &raw); err != nil { + return nil, ErrInvalidPayload + } + } + value := func(key string) string { + if raw == nil { + return "" + } + if text, ok := raw[key].(string); ok { + return strings.TrimSpace(text) + } + return "" + } + intValue := func(key string) int { + if raw == nil { + return 0 + } + switch v := raw[key].(type) { + case float64: + return int(v) + case int: + return v + } + return 0 + } + boolValue := func(key string) bool { + if raw == nil { + return false + } + switch v := raw[key].(type) { + case bool: + return v + case string: + switch strings.ToLower(strings.TrimSpace(v)) { + case "1", "true", "yes", "enabled": + return true + default: + return false + } + } + return false + } + mode := strings.ToLower(value("listen_port_mode")) + if workload.DesiredState != "enabled" { + mode = "disabled" + } + if mode == "" { + mode = "manual" + } + switch mode { + case "manual", "auto", "disabled": + default: + return nil, ErrInvalidPayload + } + listenAddr := value("listen_addr") + if listenAddr == "" && mode != "disabled" { + listenAddr = ":19131" + } + start := intValue("auto_port_start") + end := intValue("auto_port_end") + if start <= 0 { + start = 19131 + } + if end <= 0 { + end = 19231 + } + if start > end { + return nil, ErrInvalidPayload + } + productionForwarding := boolValue("production_forwarding") || boolValue("production_forwarding_enabled") + return &NodeMeshListenerConfig{ + SchemaVersion: "c17z23.mesh_listener_config.v1", + Source: "desired_workload.mesh-listener", + DesiredState: firstNonEmptyString(workload.DesiredState, "disabled"), + ListenAddr: listenAddr, + ListenPortMode: mode, + AutoPortStart: start, + AutoPortEnd: end, + AdvertiseEndpoint: strings.TrimRight(value("advertise_endpoint"), "/"), + AdvertiseTransport: value("advertise_transport"), + ConnectivityMode: value("connectivity_mode"), + NATType: value("nat_type"), + Region: value("region"), + ConfigVersion: stringPtrValue(workload.Version), + UpdatedByUserID: stringPtrValue(workload.UpdatedByUserID), + UpdatedAt: workload.UpdatedAt.UTC().Format(time.RFC3339Nano), + ControlPlaneOnly: !productionForwarding, + ProductionForwarding: productionForwarding, + }, nil +} + +func (s *Service) addCoreMeshBootstrapPeers(ctx context.Context, input GetNodeSyntheticMeshConfigInput, cfg *NodeSyntheticMeshConfig, peerDirectory map[string]*PeerDirectoryEntry, recoverySeeds map[string]PeerRecoverySeed, rendezvousLeases map[string]PeerRendezvousLease, localPerspective endpointPerspective) error { + roles, err := s.store.ListNodeRoleAssignments(ctx, input.ClusterID, input.NodeID) + if err != nil { + return err + } + if !hasActiveNodeRole(roles, "core-mesh") { + return nil + } + nodes, err := s.store.ListClusterNodes(ctx, input.ClusterID) + if err != nil { + return err + } + sort.SliceStable(nodes, func(i, j int) bool { + if nodes[i].HealthStatus != nodes[j].HealthStatus { + return nodes[i].HealthStatus == "healthy" + } + iSeen := nodeLastSeen(nodes[i]) + jSeen := nodeLastSeen(nodes[j]) + if !iSeen.Equal(jSeen) { + return iSeen.After(jSeen) + } + return nodes[i].CreatedAt.Before(nodes[j].CreatedAt) + }) + added := 0 + for _, node := range nodes { + if node.ID == input.NodeID || + node.ID == "" || + node.MembershipStatus != "active" || + node.RegistrationStatus != NodeRegistrationActive || + node.HealthStatus != "healthy" { + continue + } + desiredEndpoint, desiredCandidates, err := s.desiredMeshListenerEndpointConfig(ctx, input.ClusterID, node.ID, added) + if err != nil { + return err + } + if added >= defaultCoreMeshBootstrapPeerTarget && !hasDirectUsableEndpointCandidate(desiredCandidates) { + continue + } + heartbeats, err := s.store.ListNodeHeartbeats(ctx, input.ClusterID, node.ID, 1) + if err != nil { + return err + } + if len(heartbeats) == 0 && desiredEndpoint == "" && len(desiredCandidates) == 0 { + continue + } + endpoint := desiredEndpoint + candidates := append([]PeerEndpointCandidate{}, desiredCandidates...) + if len(heartbeats) > 0 { + reportedEndpoint, reportedCandidates, ok := endpointReportFromHeartbeat(heartbeats[0]) + if ok { + if endpoint == "" { + endpoint = reportedEndpoint + } + candidates = append(candidates, reportedCandidates...) + } + } + endpoint, candidates = scopeEndpointReportForLocal(localPerspective, endpoint, candidates) + if endpoint != "" { + cfg.PeerEndpoints[node.ID] = endpoint + peerDirectoryEntry(peerDirectory, node.ID).EndpointCount++ + } + if len(candidates) > 0 { + cfg.PeerEndpointCandidates[node.ID] = append(cfg.PeerEndpointCandidates[node.ID], candidates...) + mergePeerDirectoryCandidates(peerDirectory, node.ID, candidates) + if lease, ok := controlPlaneBootstrapRendezvousLease(input.ClusterID, node.ID, candidates, localPerspective, s.now()); ok { + mergeRendezvousLeases(rendezvousLeases, []PeerRendezvousLease{lease}) + } + } + seed := recoverySeedFromEndpointReport(node.ID, endpoint, candidates, added) + if seed.NodeID != "" && !endpointCandidateRequiresRendezvous(PeerEndpointCandidate{ + Address: seed.Endpoint, + Transport: seed.Transport, + ConnectivityMode: seed.ConnectivityMode, + Reachability: reachabilityFromConnectivityMode(seed.ConnectivityMode), + }) { + mergeRecoverySeeds(recoverySeeds, []PeerRecoverySeed{seed}) + } + added++ + } + return nil +} + +func hasDirectUsableEndpointCandidate(candidates []PeerEndpointCandidate) bool { + for _, candidate := range candidates { + if strings.TrimSpace(candidate.Address) != "" && + !endpointCandidatePrivateForOffsite(candidate) && + !endpointCandidateRequiresRendezvous(candidate) { + return true + } + } + return false +} + func (s *Service) signSyntheticMeshConfig(ctx context.Context, cfg NodeSyntheticMeshConfig) (NodeSyntheticMeshConfig, error) { authorityKey, err := s.ensureClusterAuthority(ctx, cfg.ClusterID, nil) if err != nil { @@ -902,8 +7835,8 @@ func (s *Service) signSyntheticMeshConfig(ctx context.Context, cfg NodeSynthetic ConfigSHA256: configHash, IssuedAt: issuedAt, ExpiresAt: issuedAt.Add(5 * time.Minute), - ControlPlaneOnly: true, - ProductionForwarding: false, + ControlPlaneOnly: !cfg.ProductionForwarding, + ProductionForwarding: cfg.ProductionForwarding, } rawPayload, signature, err := clusterauth.SignPayload(authorityKey.PrivateKey, payload, issuedAt) if err != nil { @@ -974,8 +7907,11 @@ func (s *Service) SetDesiredWorkload(ctx context.Context, input SetDesiredWorklo } func (s *Service) ListDesiredWorkloads(ctx context.Context, actorUserID, clusterID, nodeID string) ([]NodeWorkloadDesiredState, error) { - if err := s.ensurePlatformAdmin(ctx, actorUserID); err != nil { - return nil, err + actorUserID = strings.TrimSpace(actorUserID) + if actorUserID != "" { + if err := s.ensurePlatformAdmin(ctx, actorUserID); err != nil { + return nil, err + } } if clusterID == "" || nodeID == "" { return nil, ErrInvalidPayload @@ -1052,6 +7988,7 @@ func (s *Service) CreateRouteIntent(ctx context.Context, input CreateRouteIntent if err != nil { return MeshRouteIntent{}, err } + item = routeIntentWithLifecycle(item, s.now()) _ = s.store.RecordAudit(ctx, ClusterAuditEvent{ ClusterID: &input.ClusterID, ActorUserID: &input.ActorUserID, @@ -1068,7 +8005,109 @@ func (s *Service) ListRouteIntents(ctx context.Context, actorUserID, clusterID s if err := s.ensurePlatformAdmin(ctx, actorUserID); err != nil { return nil, err } - return s.store.ListRouteIntents(ctx, clusterID) + items, err := s.store.ListRouteIntents(ctx, clusterID) + if err != nil { + return nil, err + } + return routeIntentsWithLifecycle(items, s.now()), nil +} + +func (s *Service) ExpireRouteIntent(ctx context.Context, input RouteIntentLifecycleInput) (MeshRouteIntent, error) { + input.ActorUserID = strings.TrimSpace(input.ActorUserID) + input.ClusterID = strings.TrimSpace(input.ClusterID) + input.RouteIntentID = strings.TrimSpace(input.RouteIntentID) + input.Reason = strings.TrimSpace(input.Reason) + if err := s.ensurePlatformAdmin(ctx, input.ActorUserID); err != nil { + return MeshRouteIntent{}, err + } + if err := s.ensureClusterMutable(ctx, input.ActorUserID, input.ClusterID); err != nil { + return MeshRouteIntent{}, err + } + if input.ClusterID == "" || input.RouteIntentID == "" { + return MeshRouteIntent{}, ErrInvalidPayload + } + if input.Reason == "" { + input.Reason = "operator expired route intent" + } + expiresAt := s.now().UTC() + item, err := s.store.ExpireRouteIntent(ctx, input, expiresAt) + if err != nil { + return MeshRouteIntent{}, err + } + item = routeIntentWithLifecycle(item, s.now()) + _ = s.store.RecordAudit(ctx, ClusterAuditEvent{ + ClusterID: &input.ClusterID, + ActorUserID: &input.ActorUserID, + EventType: "mesh.route_intent.expired", + TargetType: "mesh_route_intent", + TargetID: &item.ID, + Payload: mustJSONRaw(map[string]any{"reason": input.Reason, "expires_at": expiresAt.Format(time.RFC3339Nano)}), + CreatedAt: s.now(), + }) + return item, nil +} + +func (s *Service) DisableRouteIntent(ctx context.Context, input RouteIntentLifecycleInput) (MeshRouteIntent, error) { + input.ActorUserID = strings.TrimSpace(input.ActorUserID) + input.ClusterID = strings.TrimSpace(input.ClusterID) + input.RouteIntentID = strings.TrimSpace(input.RouteIntentID) + input.Reason = strings.TrimSpace(input.Reason) + if err := s.ensurePlatformAdmin(ctx, input.ActorUserID); err != nil { + return MeshRouteIntent{}, err + } + if err := s.ensureClusterMutable(ctx, input.ActorUserID, input.ClusterID); err != nil { + return MeshRouteIntent{}, err + } + if input.ClusterID == "" || input.RouteIntentID == "" { + return MeshRouteIntent{}, ErrInvalidPayload + } + if input.Reason == "" { + input.Reason = "operator disabled route intent" + } + item, err := s.store.DisableRouteIntent(ctx, input) + if err != nil { + return MeshRouteIntent{}, err + } + item = routeIntentWithLifecycle(item, s.now()) + _ = s.store.RecordAudit(ctx, ClusterAuditEvent{ + ClusterID: &input.ClusterID, + ActorUserID: &input.ActorUserID, + EventType: "mesh.route_intent.disabled", + TargetType: "mesh_route_intent", + TargetID: &item.ID, + Payload: mustJSONRaw(map[string]any{"reason": input.Reason}), + CreatedAt: s.now(), + }) + return item, nil +} + +func routeIntentsWithLifecycle(items []MeshRouteIntent, now time.Time) []MeshRouteIntent { + out := make([]MeshRouteIntent, 0, len(items)) + for _, item := range items { + out = append(out, routeIntentWithLifecycle(item, now)) + } + return out +} + +func routeIntentWithLifecycle(item MeshRouteIntent, now time.Time) MeshRouteIntent { + item.LifecycleStatus = strings.TrimSpace(item.Status) + var policy syntheticRoutePolicy + if err := json.Unmarshal(item.Policy, &policy); err == nil && policy.ExpiresAt != nil { + expiresAt := policy.ExpiresAt.UTC() + item.PolicyExpiresAt = &expiresAt + if !expiresAt.After(now.UTC()) { + item.IsExpired = true + } + } + switch { + case item.Status == "disabled": + item.LifecycleStatus = "disabled" + case item.IsExpired: + item.LifecycleStatus = "expired" + case item.LifecycleStatus == "": + item.LifecycleStatus = "active" + } + return item } func (s *Service) ListQoSPolicies(ctx context.Context, actorUserID, clusterID string) ([]MeshQoSPolicy, error) { @@ -1611,6 +8650,56 @@ func (s *Service) RenewVPNConnectionLease(ctx context.Context, input RenewVPNCon return item, err } +func (s *Service) RenewNodeVPNAssignmentLease(ctx context.Context, input RenewNodeVPNAssignmentLeaseInput) (VPNConnectionLease, error) { + input.ClusterID = strings.TrimSpace(input.ClusterID) + input.VPNConnectionID = strings.TrimSpace(input.VPNConnectionID) + input.LeaseID = strings.TrimSpace(input.LeaseID) + input.OwnerNodeID = strings.TrimSpace(input.OwnerNodeID) + if input.ClusterID == "" || input.VPNConnectionID == "" || input.LeaseID == "" || input.OwnerNodeID == "" { + return VPNConnectionLease{}, ErrInvalidPayload + } + if input.TTL <= 0 { + input.TTL = 2 * time.Minute + } + if err := s.ensureVPNLeaseOwnerEligible(ctx, input.ClusterID, input.VPNConnectionID, input.OwnerNodeID); err != nil { + return VPNConnectionLease{}, err + } + assignments, err := s.store.ListNodeVPNAssignments(ctx, input.ClusterID, input.OwnerNodeID) + if err != nil { + return VPNConnectionLease{}, err + } + ownsVisibleLease := false + for _, assignment := range assignments { + if assignment.VPNConnectionID == input.VPNConnectionID && + assignment.AssignmentReason == "active_owner" && + assignment.ActiveLease != nil && + assignment.ActiveLease.LeaseID == input.LeaseID && + assignment.ActiveLease.OwnerNodeID == input.OwnerNodeID { + ownsVisibleLease = true + break + } + } + if !ownsVisibleLease { + return VPNConnectionLease{}, ErrVPNLeaseOwnerNotAllowed + } + item, err := s.store.RenewNodeVPNAssignmentLease(ctx, input, s.now().Add(input.TTL)) + if errors.Is(err, pgx.ErrNoRows) { + return VPNConnectionLease{}, ErrInvalidVPNLease + } + if err != nil { + return VPNConnectionLease{}, err + } + _ = s.store.RecordAudit(ctx, ClusterAuditEvent{ + ClusterID: &input.ClusterID, + EventType: "vpn_connection.lease_renewed_by_node", + TargetType: "vpn_connection", + TargetID: &input.VPNConnectionID, + Payload: json.RawMessage(`{"node_agent_runtime_executed":true}`), + CreatedAt: s.now(), + }) + return item, nil +} + func (s *Service) ReleaseVPNConnectionLease(ctx context.Context, input ReleaseVPNConnectionLeaseInput) (VPNConnectionLease, error) { if err := s.ensurePlatformAdmin(ctx, input.ActorUserID); err != nil { return VPNConnectionLease{}, err @@ -1765,11 +8854,692 @@ func (s *Service) ReportNodeVPNAssignmentStatus(ctx context.Context, input Repor return item, nil } -func (s *Service) ListAuditEvents(ctx context.Context, actorUserID, clusterID string, limit int) ([]ClusterAuditEvent, error) { +func (s *Service) GetVPNClientProfile( + ctx context.Context, + clusterID, organizationID, userID string, + preferredEntryNodeID ...string, +) (VPNClientProfile, error) { + clusterID = strings.TrimSpace(clusterID) + organizationID = strings.TrimSpace(organizationID) + userID = strings.TrimSpace(userID) + if clusterID == "" || organizationID == "" || userID == "" { + return VPNClientProfile{}, ErrInvalidPayload + } + preferredEntry := "" + if len(preferredEntryNodeID) > 0 { + preferredEntry = strings.TrimSpace(preferredEntryNodeID[0]) + } + + preferredExit := "" + if len(preferredEntryNodeID) > 1 { + preferredExit = strings.TrimSpace(preferredEntryNodeID[1]) + } + profile, err := s.store.GetVPNClientProfile(ctx, clusterID, organizationID, userID, preferredEntry, preferredExit, s.now().UTC()) + if err != nil { + return VPNClientProfile{}, err + } + if profile.ClusterID == "" { + profile.ClusterID = clusterID + } + if profile.OrganizationID == "" { + profile.OrganizationID = organizationID + } + if profile.UserID == "" { + profile.UserID = userID + } + profile = attachVPNDataplaneSessions(profile, s.now().UTC()) + if err := s.ensureVPNFabricRouteIntents(ctx, clusterID, profile); err != nil { + return VPNClientProfile{}, err + } + profile = s.attachVPNFabricServiceChannelLeases(ctx, profile) + return profile, nil +} + +func (s *Service) attachVPNFabricServiceChannelLeases(ctx context.Context, profile VPNClientProfile) VPNClientProfile { + for i := range profile.Connections { + connection := profile.Connections[i] + route := vpnFabricRouteFromClientConfig(connection.ClientConfig) + if route.Status != "planned" || route.SelectedEntryNodeID == "" || route.SelectedExitNodeID == "" { + continue + } + entryPool := dedupeStrings(append([]string{}, route.EntryPoolNodeIDs...)) + if len(entryPool) == 0 { + entryPool = dedupeStrings(append([]string{route.SelectedEntryNodeID}, connection.EntryNodeIDs...)) + } + exitPool := dedupeStrings(append([]string{}, route.ExitPoolNodeIDs...)) + if len(exitPool) == 0 { + exitPool = dedupeStrings(append([]string{route.SelectedExitNodeID, connection.ExitNodeID}, connection.AllowedNodeIDs...)) + } + lease, err := s.IssueFabricServiceChannelLease(ctx, IssueFabricServiceChannelLeaseInput{ + ClusterID: profile.ClusterID, + OrganizationID: profile.OrganizationID, + UserID: profile.UserID, + ResourceID: connection.ID, + ServiceClass: FabricServiceClassVPNPackets, + EntryNodeIDs: entryPool, + ExitNodeIDs: exitPool, + PreferredEntryNodeID: route.SelectedEntryNodeID, + PreferredExitNodeID: route.SelectedExitNodeID, + AllowedChannels: []string{"vpn_packet", "fabric_control", FabricChannelBulk, FabricChannelControl}, + TTL: time.Minute, + }) + if err != nil { + profile.Connections[i].ClientConfig = attachVPNFabricServiceChannelError(connection.ClientConfig, err) + continue + } + profile.Connections[i].ClientConfig = attachVPNFabricServiceChannelLease(connection.ClientConfig, lease) + } + return profile +} + +func attachVPNFabricServiceChannelLease(raw json.RawMessage, lease FabricServiceChannelLease) json.RawMessage { + var cfg map[string]any + if err := json.Unmarshal(raw, &cfg); err != nil || cfg == nil { + cfg = map[string]any{} + } + cfg["fabric_service_channel_lease"] = lease + cfg["fabric_service_channel_status"] = lease.Status + out, err := json.Marshal(cfg) + if err != nil { + return raw + } + return out +} + +func attachVPNFabricServiceChannelError(raw json.RawMessage, err error) json.RawMessage { + var cfg map[string]any + if json.Unmarshal(raw, &cfg) != nil || cfg == nil { + cfg = map[string]any{} + } + cfg["fabric_service_channel_status"] = "error" + cfg["fabric_service_channel_error"] = err.Error() + out, marshalErr := json.Marshal(cfg) + if marshalErr != nil { + return raw + } + return out +} + +func attachVPNDataplaneSessions(profile VPNClientProfile, now time.Time) VPNClientProfile { + for i := range profile.Connections { + profile.Connections[i].ClientConfig = enrichVPNDataplaneSession(profile, profile.Connections[i], now) + } + return profile +} + +func enrichVPNDataplaneSession(profile VPNClientProfile, connection VPNClientConnection, now time.Time) json.RawMessage { + var cfg map[string]any + if err := json.Unmarshal(connection.ClientConfig, &cfg); err != nil || cfg == nil { + cfg = map[string]any{} + } + route := vpnFabricRouteFromClientConfig(connection.ClientConfig) + expiresAt := now.Add(time.Minute) + sessionID := uuidLikeRandom() + if sessionID == "" { + sessionID = "vpn-session-" + now.UTC().Format("20060102T150405.000000000Z") + } + entryCandidates := vpnDataplaneEntryCandidates(route, connection, cfg) + transportCandidates := vpnDataplaneTransportCandidates(route, entryCandidates) + status := "waiting_for_entry_endpoint" + if route.Status == "planned" && route.SelectedEntryNodeID != "" && route.SelectedExitNodeID != "" { + status = "ready_for_entry_listener" + } + cfg["vpn_dataplane_session"] = map[string]any{ + "schema_version": "rap.vpn_dataplane_session.v1", + "session_id": sessionID, + "status": status, + "issued_at": now, + "expires_at": expiresAt, + "cluster_id": profile.ClusterID, + "organization_id": profile.OrganizationID, + "user_id": profile.UserID, + "vpn_connection_id": connection.ID, + "entry_node_id": route.SelectedEntryNodeID, + "exit_node_id": route.SelectedExitNodeID, + "preferred_transport": "fabric_packet_quic_v1", + "fallback_transport": "backend_http_packet_relay", + "packet_contract": map[string]any{ + "tunnel_type": "universal_ip_packet", + "application_protocol_agnostic": true, + "all_ip_traffic": true, + "protocol_specific_routing": false, + }, + "auth": map[string]any{ + "type": "control_plane_issued_bearer", + "token": "rap_vpn_dps_" + sessionID, + "token_ttl_seconds": int(expiresAt.Sub(now).Seconds()), + "node_validation": "entry_node_calls_control_plane_introspection", + "introspection_path": "/api/v1/clusters/{cluster_id}/vpn/dataplane-sessions/{session_id}/introspect", + }, + "entry_candidates": entryCandidates, + "transport_candidates": transportCandidates, + } + out, err := json.Marshal(cfg) + if err != nil { + return connection.ClientConfig + } + return out +} + +func vpnDataplaneEntryCandidates(route vpnClientFabricRoute, connection VPNClientConnection, cfg map[string]any) []map[string]any { + concrete := vpnConcreteEntryCandidatesFromClientConfig(cfg) + ids := dedupeStrings(append([]string{route.SelectedEntryNodeID}, connection.EntryNodeIDs...)) + out := make([]map[string]any, 0, len(concrete)+len(ids)) + nodesWithConcrete := map[string]struct{}{} + for _, candidate := range concrete { + nodeID, _ := candidate["node_id"].(string) + if nodeID == "" { + continue + } + nodesWithConcrete[nodeID] = struct{}{} + enriched := make(map[string]any, len(candidate)+4) + for k, v := range candidate { + enriched[k] = v + } + status := "endpoint_reported" + if nodeID == route.SelectedEntryNodeID { + status = "selected_endpoint_reported" + } + reachability, _ := enriched["reachability"].(string) + if nodeID == route.SelectedEntryNodeID && strings.EqualFold(reachability, "public") { + status = "selected_endpoint_public" + } + enriched["status"] = status + enriched["endpoint_source"] = "node_latest_heartbeat.mesh_endpoint_report" + enriched["transports"] = []string{"entry_direct_http_v1", "fabric_packet_quic_v1", "fabric_packet_tcp_v1"} + out = append(out, enriched) + } + for _, nodeID := range ids { + if nodeID == "" { + continue + } + if _, ok := nodesWithConcrete[nodeID]; ok { + continue + } + status := "endpoint_pending" + if nodeID == route.SelectedEntryNodeID { + status = "selected_endpoint_pending" + } + out = append(out, map[string]any{ + "node_id": nodeID, + "status": status, + "transports": []string{"fabric_packet_quic_v1", "fabric_packet_tcp_v1"}, + "endpoint_source": "node_mesh_advertisement_pending", + }) + } + return out +} + +func vpnConcreteEntryCandidatesFromClientConfig(cfg map[string]any) []map[string]any { + raw, ok := cfg["vpn_entry_endpoint_candidates"] + if !ok { + return nil + } + payload, err := json.Marshal(raw) + if err != nil { + return nil + } + var out []map[string]any + if err := json.Unmarshal(payload, &out); err != nil { + return nil + } + return out +} + +func vpnDataplaneTransportCandidates(route vpnClientFabricRoute, entryCandidates []map[string]any) []map[string]any { + candidates := []map[string]any{ + { + "type": "fabric_packet_quic_v1", + "status": "contract_ready_listener_pending", + "entry_node_id": route.SelectedEntryNodeID, + "exit_node_id": route.SelectedExitNodeID, + "entry_candidates": entryCandidates, + "application_protocols": []string{"ip"}, + }, + } + if direct := vpnDirectHTTPEntryTransportCandidate(route, entryCandidates); direct != nil { + candidates = append(candidates, direct) + } + candidates = append(candidates, map[string]any{ + "type": "backend_http_packet_relay", + "status": "active_fallback", + "description": "current safe dataplane until entry listener is available", + }) + return candidates +} + +func vpnDirectHTTPEntryTransportCandidate(route vpnClientFabricRoute, entryCandidates []map[string]any) map[string]any { + var selected []map[string]any + hasPublic := false + hasHTTP := false + hasLocalGatewayShortcut := false + for _, candidate := range entryCandidates { + nodeID, _ := candidate["node_id"].(string) + if route.SelectedEntryNodeID != "" && nodeID != route.SelectedEntryNodeID { + continue + } + apiBaseURL, _ := candidate["api_base_url"].(string) + address, _ := candidate["address"].(string) + if apiBaseURL == "" && (strings.HasPrefix(address, "http://") || strings.HasPrefix(address, "https://")) { + apiBaseURL = strings.TrimRight(address, "/") + "/api/v1" + candidate["api_base_url"] = apiBaseURL + } + if apiBaseURL == "" { + continue + } + hasHTTP = true + reachability, _ := candidate["reachability"].(string) + if strings.EqualFold(reachability, "public") { + hasPublic = true + } + if value, ok := candidate["local_gateway_shortcut"].(bool); ok && value { + hasLocalGatewayShortcut = true + } + selected = append(selected, candidate) + } + if len(selected) == 0 { + return nil + } + status := "reported_private_or_unverified" + if hasPublic { + status = "available" + } else if hasHTTP { + status = "http_endpoint_reported_unverified" + } + safeClientSwitch := hasPublic + if route.SelectedEntryNodeID != "" && route.SelectedEntryNodeID == route.SelectedExitNodeID { + if hasPublic && hasLocalGatewayShortcut { + status = "available_local_gateway_shortcut" + safeClientSwitch = true + } else { + status = "available_local_gateway_shortcut_pending" + safeClientSwitch = false + } + } + return map[string]any{ + "type": "entry_direct_http_v1", + "status": status, + "entry_node_id": route.SelectedEntryNodeID, + "exit_node_id": route.SelectedExitNodeID, + "entry_candidates": selected, + "application_protocols": []string{"ip"}, + "safe_client_switch": safeClientSwitch, + } +} + +func uuidLikeRandom() string { + var raw [16]byte + if _, err := rand.Read(raw[:]); err != nil { + return "" + } + raw[6] = (raw[6] & 0x0f) | 0x40 + raw[8] = (raw[8] & 0x3f) | 0x80 + encoded := hex.EncodeToString(raw[:]) + return encoded[0:8] + "-" + encoded[8:12] + "-" + encoded[12:16] + "-" + encoded[16:20] + "-" + encoded[20:32] +} + +func (s *Service) ensureVPNFabricRouteIntents(ctx context.Context, clusterID string, profile VPNClientProfile) error { + intents, err := s.store.ListRouteIntents(ctx, clusterID) + if err != nil { + return err + } + existing := map[string]bool{} + for _, intent := range intents { + source, destination, ok := activeVPNPacketRouteIntent(intent, s.now()) + if !ok { + continue + } + existing[source+"->"+destination] = true + } + for _, connection := range profile.Connections { + route := vpnFabricRouteFromClientConfig(connection.ClientConfig) + if route.Status != "planned" || route.SelectedEntryNodeID == "" || route.SelectedExitNodeID == "" || route.SelectedEntryNodeID == route.SelectedExitNodeID { + continue + } + pairs := [][2]string{ + {route.SelectedEntryNodeID, route.SelectedExitNodeID}, + {route.SelectedExitNodeID, route.SelectedEntryNodeID}, + } + for _, pair := range pairs { + key := pair[0] + "->" + pair[1] + if existing[key] { + continue + } + if _, err := s.store.CreateRouteIntent(ctx, CreateRouteIntentInput{ + ClusterID: clusterID, + SourceSelector: mustJSONRaw(map[string]any{"node_id": pair[0]}), + DestinationSelector: mustJSONRaw(map[string]any{"node_id": pair[1]}), + ServiceClass: "vpn_packets", + Priority: 10, + Policy: mustJSONRaw(vpnFabricRouteIntentPolicy(pair[0], pair[1], s.now().UTC().Add(30*24*time.Hour))), + }); err != nil { + return err + } + existing[key] = true + } + } + return nil +} + +type vpnClientFabricRoute struct { + Status string `json:"status"` + SelectedEntryNodeID string `json:"selected_entry_node_id"` + SelectedExitNodeID string `json:"selected_exit_node_id"` + EntryPoolNodeIDs []string `json:"entry_pool_node_ids"` + ExitPoolNodeIDs []string `json:"exit_pool_node_ids"` +} + +func vpnFabricRouteFromClientConfig(raw json.RawMessage) vpnClientFabricRoute { + var cfg struct { + Route vpnClientFabricRoute `json:"vpn_fabric_route"` + } + if len(raw) == 0 { + return vpnClientFabricRoute{} + } + _ = json.Unmarshal(raw, &cfg) + return cfg.Route +} + +func activeVPNPacketRouteIntent(intent MeshRouteIntent, now time.Time) (string, string, bool) { + if intent.Status != "active" || intent.ServiceClass != "vpn_packets" { + return "", "", false + } + var policy syntheticRoutePolicy + if err := json.Unmarshal(intent.Policy, &policy); err != nil || !containsString(policy.AllowedChannels, "vpn_packet") { + return "", "", false + } + if policy.ExpiresAt != nil && !policy.ExpiresAt.After(now.UTC()) { + return "", "", false + } + var source nodeSelector + var destination nodeSelector + _ = json.Unmarshal(intent.SourceSelector, &source) + _ = json.Unmarshal(intent.DestinationSelector, &destination) + sourceNodeID := firstNodeID(source) + destinationNodeID := firstNodeID(destination) + if sourceNodeID == "" || destinationNodeID == "" { + return "", "", false + } + return sourceNodeID, destinationNodeID, true +} + +func vpnFabricRouteIntentPolicy(sourceNodeID, destinationNodeID string, expiresAt time.Time) map[string]any { + version := "vpn-fabric-" + expiresAt.UTC().Format("20060102T150405Z") + return map[string]any{ + "synthetic_enabled": true, + "hops": []string{sourceNodeID, destinationNodeID}, + "allowed_channels": []string{"vpn_packet", "fabric_control"}, + "max_ttl": 8, + "max_hops": 8, + "expires_at": expiresAt.UTC().Format(time.RFC3339), + "route_version": version, + "policy_version": version, + "peer_directory_version": version, + "backend_relay_fallback": true, + "data_plane_preference": "fabric_mesh", + "route_owner": "vpn_client_profile", + "route_refresh_required": true, + "route_refresh_threshold": "24h", + } +} + +func mustJSONRaw(value any) json.RawMessage { + raw, err := json.Marshal(value) + if err != nil { + return json.RawMessage(`{}`) + } + return raw +} + +func (s *Service) ListAuditEvents(ctx context.Context, actorUserID string, input ListAuditEventsInput) ([]ClusterAuditEvent, error) { if err := s.ensurePlatformAdmin(ctx, actorUserID); err != nil { return nil, err } - return s.store.ListAuditEvents(ctx, clusterID, limit) + input.ClusterID = strings.TrimSpace(input.ClusterID) + input.EventTypes = trimStringSlice(input.EventTypes) + input.TargetTypes = trimStringSlice(input.TargetTypes) + input.Correlation = strings.TrimSpace(input.Correlation) + events, err := s.store.ListAuditEvents(ctx, input) + if err != nil { + return nil, err + } + if input.Correlation == "fabric_diagnostics" { + events = s.withFabricDiagnosticsAuditCorrelation(ctx, actorUserID, input.ClusterID, events) + } + return events, nil +} + +func (s *Service) ListFabricServiceChannelRebuildInvestigationBreadcrumbs(ctx context.Context, actorUserID string, input ListFabricServiceChannelRebuildInvestigationBreadcrumbsInput) (FabricServiceChannelRebuildInvestigationBreadcrumbs, error) { + if err := s.ensurePlatformAdmin(ctx, actorUserID); err != nil { + return FabricServiceChannelRebuildInvestigationBreadcrumbs{}, err + } + input.ClusterID = strings.TrimSpace(input.ClusterID) + if input.ClusterID == "" { + return FabricServiceChannelRebuildInvestigationBreadcrumbs{}, ErrInvalidPayload + } + if input.Limit <= 0 || input.Limit > 100 { + input.Limit = 20 + } + cluster, err := s.store.GetCluster(ctx, input.ClusterID) + if errors.Is(err, pgx.ErrNoRows) { + return FabricServiceChannelRebuildInvestigationBreadcrumbs{}, ErrInvalidCluster + } + if err != nil { + return FabricServiceChannelRebuildInvestigationBreadcrumbs{}, err + } + windowPolicy := fabricServiceChannelBreadcrumbWindowPolicyFromCluster(cluster) + if input.CurrentWindowSeconds <= 0 { + input.CurrentWindowSeconds = windowPolicy.CurrentWindowSeconds + } + if input.HistoryWindowSeconds <= 0 { + input.HistoryWindowSeconds = windowPolicy.HistoryWindowSeconds + } + if input.HistoryWindowSeconds < input.CurrentWindowSeconds { + input.HistoryWindowSeconds = input.CurrentWindowSeconds + } + events, err := s.ListAuditEvents(ctx, actorUserID, ListAuditEventsInput{ + ClusterID: input.ClusterID, + EventTypes: []string{ + "fabric.service_channel_rebuild_feedback_breakdown.investigation_opened", + "fabric.service_channel_rebuild_incident.investigation_opened", + }, + Correlation: "fabric_diagnostics", + Limit: input.Limit, + }) + if err != nil { + return FabricServiceChannelRebuildInvestigationBreadcrumbs{}, err + } + events = withFabricDiagnosticsBreadcrumbFreshness(events, s.now(), input.CurrentWindowSeconds, input.HistoryWindowSeconds) + summary := summarizeClusterAuditEvents(events) + return FabricServiceChannelRebuildInvestigationBreadcrumbs{ + ClusterID: input.ClusterID, + Events: events, + Summary: summary, + CurrentWindowSeconds: input.CurrentWindowSeconds, + HistoryWindowSeconds: input.HistoryWindowSeconds, + CurrentCount: summary.CountsByBreadcrumbStatus["current"], + StaleCount: summary.CountsByBreadcrumbStatus["stale"], + ExpiredCount: summary.CountsByBreadcrumbStatus["expired"], + }, nil +} + +func withFabricDiagnosticsBreadcrumbFreshness(events []ClusterAuditEvent, now time.Time, currentWindowSeconds, historyWindowSeconds int64) []ClusterAuditEvent { + if len(events) == 0 { + return events + } + if now.IsZero() { + now = time.Now().UTC() + } + for index := range events { + if events[index].CorrelationHints == nil { + events[index].CorrelationHints = &ClusterAuditCorrelationHints{Scope: "fabric_diagnostics"} + } + ageSeconds := int64(0) + if !events[index].CreatedAt.IsZero() { + ageSeconds = int64(now.Sub(events[index].CreatedAt).Seconds()) + if ageSeconds < 0 { + ageSeconds = 0 + } + } + status := "current" + if ageSeconds > historyWindowSeconds { + status = "expired" + } else if ageSeconds > currentWindowSeconds { + status = "stale" + } + events[index].CorrelationHints.BreadcrumbStatus = status + events[index].CorrelationHints.BreadcrumbAgeSeconds = ageSeconds + events[index].CorrelationHints.BreadcrumbCurrentWindow = currentWindowSeconds + events[index].CorrelationHints.BreadcrumbHistoryWindow = historyWindowSeconds + } + return events +} + +func (s *Service) withFabricDiagnosticsAuditCorrelation(ctx context.Context, actorUserID, clusterID string, events []ClusterAuditEvent) []ClusterAuditEvent { + if len(events) == 0 { + return events + } + health, healthErr := s.GetFabricServiceChannelRouteRebuildHealthSummary(ctx, actorUserID, GetFabricServiceChannelRouteRebuildHealthSummaryInput{ + ClusterID: clusterID, + Limit: 5, + }) + incidents, incidentsErr := s.ListFabricServiceChannelRouteRebuildIncidents(ctx, actorUserID, ListFabricServiceChannelRouteRebuildIncidentsInput{ + ClusterID: clusterID, + Limit: 20, + }) + for index := range events { + hints := ClusterAuditCorrelationHints{ + Scope: "fabric_diagnostics", + CurrentDiagnosticStatus: "not_visible", + } + if healthErr == nil { + if breakdown := fabricAuditMatchingFeedbackBreakdown(events[index], health.FeedbackBreakdowns); breakdown != nil { + hints.CurrentDiagnosticStatus = "breakdown_active" + hints.FeedbackBreakdown = breakdown + hints.RecommendedAction = "open_filtered_rebuild_ledger" + } + } + if hints.FeedbackBreakdown == nil && incidentsErr == nil { + if incident := fabricAuditMatchingRebuildIncident(events[index], incidents); incident != nil { + hints.CurrentDiagnosticStatus = "incident_visible" + hints.RebuildIncident = incident + hints.RecommendedAction = "open_deep_rebuild_ledger" + } + } + events[index].CorrelationHints = &hints + } + return events +} + +func fabricAuditMatchingFeedbackBreakdown(event ClusterAuditEvent, breakdowns []FabricServiceChannelRouteRebuildFeedbackHealthBreakdown) *FabricServiceChannelRouteRebuildFeedbackHealthBreakdown { + payload := jsonObject(event.Payload) + feedbackSource := jsonString(payload, "feedback_source") + feedbackChannelID := jsonString(payload, "feedback_channel_id") + feedbackViolationStatus := jsonString(payload, "feedback_violation_status") + reporterNodeID := jsonString(payload, "reporter_node_id") + routeID := jsonString(payload, "route_id") + if feedbackSource == "" && feedbackChannelID == "" && feedbackViolationStatus == "" { + return nil + } + for index := range breakdowns { + item := breakdowns[index] + if feedbackSource != "" && item.FeedbackSource != feedbackSource { + continue + } + if feedbackChannelID != "" && item.FeedbackChannelID != feedbackChannelID { + continue + } + if feedbackViolationStatus != "" && item.FeedbackViolationStatus != feedbackViolationStatus { + continue + } + if reporterNodeID != "" && !containsString(item.AffectedReporterNodeIDs, reporterNodeID) { + continue + } + if routeID != "" && !containsString(item.AffectedRouteIDs, routeID) { + continue + } + return &item + } + return nil +} + +func fabricAuditMatchingRebuildIncident(event ClusterAuditEvent, incidents []FabricServiceChannelRouteRebuildIncident) *FabricServiceChannelRouteRebuildIncident { + payload := jsonObject(event.Payload) + reporterNodeID := jsonString(payload, "reporter_node_id") + routeID := jsonString(payload, "route_id") + if routeID == "" && event.TargetType == "fabric_service_channel_route_rebuild_incident" && event.TargetID != nil { + routeID = *event.TargetID + } + serviceClass := jsonString(payload, "service_class") + generation := jsonString(payload, "generation") + guardStatus := jsonString(payload, "guard_status") + for index := range incidents { + item := incidents[index] + if reporterNodeID != "" && item.ReporterNodeID != reporterNodeID { + continue + } + if routeID != "" && item.RouteID != routeID { + continue + } + if serviceClass != "" && item.ServiceClass != serviceClass { + continue + } + if generation != "" && item.Generation != generation { + continue + } + if guardStatus != "" && item.GuardStatus != guardStatus { + continue + } + if reporterNodeID == "" && routeID == "" && serviceClass == "" && generation == "" && guardStatus == "" { + continue + } + return &item + } + return nil +} + +func summarizeClusterAuditEvents(events []ClusterAuditEvent) ClusterAuditSummary { + summary := ClusterAuditSummary{ + TotalCount: len(events), + CountsByEventType: map[string]int{}, + CountsByTargetType: map[string]int{}, + CountsByCurrentDiagnosticStatus: map[string]int{}, + CountsByFeedbackSource: map[string]int{}, + CountsByFeedbackViolationStatus: map[string]int{}, + CountsByBreadcrumbStatus: map[string]int{}, + } + for _, event := range events { + if event.EventType != "" { + summary.CountsByEventType[event.EventType]++ + } + if event.TargetType != "" { + summary.CountsByTargetType[event.TargetType]++ + } + if event.CreatedAt.After(summary.LatestAt) { + summary.LatestAt = event.CreatedAt.UTC() + } + payload := jsonObject(event.Payload) + if source := jsonString(payload, "feedback_source"); source != "" { + summary.CountsByFeedbackSource[source]++ + } + if status := jsonString(payload, "feedback_violation_status"); status != "" { + summary.CountsByFeedbackViolationStatus[status]++ + } + if event.CorrelationHints == nil { + continue + } + if breadcrumbStatus := strings.TrimSpace(event.CorrelationHints.BreadcrumbStatus); breadcrumbStatus != "" { + summary.CountsByBreadcrumbStatus[breadcrumbStatus]++ + } + status := firstNonEmptyString(event.CorrelationHints.CurrentDiagnosticStatus, "unknown") + summary.CountsByCurrentDiagnosticStatus[status]++ + if status == "not_visible" { + summary.NotVisibleCount++ + } else { + summary.CorrelatedCount++ + } + } + return summary } func (s *Service) ensurePlatformAdmin(ctx context.Context, userID string) error { @@ -1925,6 +9695,360 @@ type syntheticRoutePolicy struct { RouteVersion string `json:"route_version"` PolicyVersion string `json:"policy_version"` PeerDirectoryVersion string `json:"peer_directory_version"` + Metadata map[string]any `json:"metadata"` +} + +type dockerInstallProfileScope struct { + BackendURL string `json:"backend_url"` + ControlPlaneEndpoints []string `json:"control_plane_endpoints"` + ArtifactEndpoints []string `json:"artifact_endpoints"` + DockerImageArtifactURLs []string `json:"docker_image_artifact_urls"` + DockerImageArtifactSHA256 string `json:"docker_image_artifact_sha256"` + DockerImageArtifactSizeBytes int64 `json:"docker_image_artifact_size_bytes"` + NodeAgentArtifactURLs []string `json:"node_agent_artifact_urls"` + NodeAgentArtifactSHA256 string `json:"node_agent_artifact_sha256"` + NodeAgentArtifactSizeBytes int64 `json:"node_agent_artifact_size_bytes"` + Roles []string `json:"roles"` + NodeName string `json:"node_name"` + NodeGroupID string `json:"node_group_id"` + Image string `json:"image"` + ContainerName string `json:"container_name"` + StateDir string `json:"state_dir"` + InstallDir string `json:"install_dir"` + StartupMode string `json:"startup_mode"` + Network string `json:"network"` + RestartPolicy string `json:"restart_policy"` + PullImage *bool `json:"pull_image"` + Replace *bool `json:"replace"` + DockerVPNGatewayEnabled *bool `json:"docker_vpn_gateway_enabled"` + WorkloadSupervisionEnabled *bool `json:"workload_supervision_enabled"` + MeshSyntheticRuntimeEnabled *bool `json:"mesh_synthetic_runtime_enabled"` + MeshProductionForwardingEnabled *bool `json:"mesh_production_forwarding_enabled"` + MeshListenAddr string `json:"mesh_listen_addr"` + MeshListenPortMode string `json:"mesh_listen_port_mode"` + MeshListenAutoPortStart int `json:"mesh_listen_auto_port_start"` + MeshListenAutoPortEnd int `json:"mesh_listen_auto_port_end"` + MeshAdvertiseEndpoint string `json:"mesh_advertise_endpoint"` + MeshAdvertiseEndpointsJSON json.RawMessage `json:"mesh_advertise_endpoints_json"` + MeshAdvertiseTransport string `json:"mesh_advertise_transport"` + MeshConnectivityMode string `json:"mesh_connectivity_mode"` + MeshNATType string `json:"mesh_nat_type"` + MeshRegion string `json:"mesh_region"` + HeartbeatIntervalSeconds int `json:"heartbeat_interval_seconds"` + EnrollmentPollIntervalSeconds int `json:"enrollment_poll_interval_seconds"` + EnrollmentPollTimeoutSeconds int `json:"enrollment_poll_timeout_seconds"` + ProductionObservationSinkCapacity int `json:"production_observation_sink_capacity"` +} + +func dockerInstallProfileFromScope(input DockerInstallProfileRequest, scopeRaw json.RawMessage) (DockerInstallProfile, error) { + var scope dockerInstallProfileScope + if len(scopeRaw) > 0 { + if !json.Valid(scopeRaw) { + return DockerInstallProfile{}, ErrInvalidPayload + } + if err := json.Unmarshal(scopeRaw, &scope); err != nil { + return DockerInstallProfile{}, ErrInvalidPayload + } + } + nodeName := firstNonEmptyString(strings.TrimSpace(input.NodeName), scope.NodeName) + if nodeName == "" { + nodeName = "docker-node" + } + containerName := firstNonEmptyString(scope.ContainerName, "rap-node-agent-"+safeInstallProfileSlug(nodeName)) + roles := trimStringSlice(scope.Roles) + profile := DockerInstallProfile{ + SchemaVersion: "rap.docker_install_profile.v1", + BackendURL: strings.TrimRight(strings.TrimSpace(scope.BackendURL), "/"), + ControlPlaneEndpoints: trimStringSlice(scope.ControlPlaneEndpoints), + ArtifactEndpoints: trimEndpointSlice(scope.ArtifactEndpoints), + Roles: roles, + NodeName: nodeName, + Image: firstNonEmptyString(scope.Image, "rap-node-agent:latest"), + ContainerName: containerName, + StateDir: firstNonEmptyString(scope.StateDir, "/var/lib/rap/nodes/"+safeInstallProfileSlug(nodeName)), + Network: firstNonEmptyString(scope.Network, "host"), + RestartPolicy: firstNonEmptyString(scope.RestartPolicy, "unless-stopped"), + PullImage: boolPtrValue(scope.PullImage, false), + Replace: boolPtrValue(scope.Replace, true), + DockerVPNGatewayEnabled: boolPtrValue(scope.DockerVPNGatewayEnabled, containsString(roles, "vpn-exit")), + WorkloadSupervisionEnabled: boolPtrValue(scope.WorkloadSupervisionEnabled, false), + MeshSyntheticRuntimeEnabled: boolPtrValue(scope.MeshSyntheticRuntimeEnabled, true), + MeshProductionForwardingEnabled: boolPtrValue(scope.MeshProductionForwardingEnabled, false), + MeshListenAddr: firstNonEmptyString(scope.MeshListenAddr, ":19131"), + MeshListenPortMode: firstNonEmptyString(strings.ToLower(strings.TrimSpace(scope.MeshListenPortMode)), "auto"), + MeshListenAutoPortStart: positiveOrDefault(scope.MeshListenAutoPortStart, 19131), + MeshListenAutoPortEnd: positiveOrDefault(scope.MeshListenAutoPortEnd, 19231), + MeshAdvertiseEndpoint: strings.TrimRight(strings.TrimSpace(scope.MeshAdvertiseEndpoint), "/"), + MeshAdvertiseEndpointsJSON: scope.MeshAdvertiseEndpointsJSON, + MeshAdvertiseTransport: strings.TrimSpace(scope.MeshAdvertiseTransport), + MeshConnectivityMode: strings.TrimSpace(scope.MeshConnectivityMode), + MeshNATType: strings.TrimSpace(scope.MeshNATType), + MeshRegion: strings.TrimSpace(scope.MeshRegion), + HeartbeatIntervalSeconds: positiveOrDefault(scope.HeartbeatIntervalSeconds, 15), + EnrollmentPollIntervalSeconds: positiveOrDefault(scope.EnrollmentPollIntervalSeconds, 5), + EnrollmentPollTimeoutSeconds: nonNegativeOrDefault(scope.EnrollmentPollTimeoutSeconds, 0), + ProductionObservationSinkCapacity: scope.ProductionObservationSinkCapacity, + } + profile.DockerImageArtifact = dockerImageArtifactFromScope(profile.Image, profile.ArtifactEndpoints, scope) + if profile.BackendURL == "" && len(profile.ControlPlaneEndpoints) > 0 { + profile.BackendURL = profile.ControlPlaneEndpoints[0] + } + if profile.BackendURL == "" { + return DockerInstallProfile{}, ErrInvalidPayload + } + if len(profile.ArtifactEndpoints) == 0 { + if endpoint := defaultArtifactEndpointFromBackendURL(profile.BackendURL); endpoint != "" { + profile.ArtifactEndpoints = []string{endpoint} + profile.DockerImageArtifact = dockerImageArtifactFromScope(profile.Image, profile.ArtifactEndpoints, scope) + } + } + if len(profile.MeshAdvertiseEndpointsJSON) > 0 && !json.Valid(profile.MeshAdvertiseEndpointsJSON) { + return DockerInstallProfile{}, ErrInvalidPayload + } + switch profile.MeshListenPortMode { + case "manual", "auto", "disabled": + default: + return DockerInstallProfile{}, ErrInvalidPayload + } + if profile.MeshListenAutoPortStart > profile.MeshListenAutoPortEnd { + return DockerInstallProfile{}, ErrInvalidPayload + } + return profile, nil +} + +func windowsInstallProfileFromScope(input DockerInstallProfileRequest, scopeRaw json.RawMessage) (WindowsInstallProfile, error) { + var scope dockerInstallProfileScope + if len(scopeRaw) > 0 { + if !json.Valid(scopeRaw) { + return WindowsInstallProfile{}, ErrInvalidPayload + } + if err := json.Unmarshal(scopeRaw, &scope); err != nil { + return WindowsInstallProfile{}, ErrInvalidPayload + } + } + nodeName := firstNonEmptyString(strings.TrimSpace(input.NodeName), scope.NodeName) + if nodeName == "" { + nodeName = "windows-node" + } + profile := WindowsInstallProfile{ + SchemaVersion: "rap.windows_install_profile.v1", + BackendURL: strings.TrimRight(strings.TrimSpace(scope.BackendURL), "/"), + ControlPlaneEndpoints: trimStringSlice(scope.ControlPlaneEndpoints), + ArtifactEndpoints: trimEndpointSlice(scope.ArtifactEndpoints), + Roles: trimStringSlice(scope.Roles), + NodeName: nodeName, + StateDir: firstNonEmptyString(scope.StateDir, `C:\ProgramData\RAP\nodes\`+safeInstallProfileSlug(nodeName)), + InstallDir: firstNonEmptyString(scope.InstallDir, `C:\Program Files\RAP\`+safeInstallProfileSlug(nodeName)), + StartupMode: firstNonEmptyString(strings.ToLower(strings.TrimSpace(scope.StartupMode)), "auto"), + WorkloadSupervisionEnabled: boolPtrValue(scope.WorkloadSupervisionEnabled, false), + MeshSyntheticRuntimeEnabled: boolPtrValue(scope.MeshSyntheticRuntimeEnabled, true), + MeshProductionForwardingEnabled: boolPtrValue(scope.MeshProductionForwardingEnabled, false), + MeshListenAddr: firstNonEmptyString(scope.MeshListenAddr, ":19131"), + MeshListenPortMode: firstNonEmptyString(strings.ToLower(strings.TrimSpace(scope.MeshListenPortMode)), "auto"), + MeshListenAutoPortStart: positiveOrDefault(scope.MeshListenAutoPortStart, 19131), + MeshListenAutoPortEnd: positiveOrDefault(scope.MeshListenAutoPortEnd, 19231), + MeshAdvertiseEndpoint: strings.TrimRight(strings.TrimSpace(scope.MeshAdvertiseEndpoint), "/"), + MeshAdvertiseEndpointsJSON: scope.MeshAdvertiseEndpointsJSON, + MeshAdvertiseTransport: strings.TrimSpace(scope.MeshAdvertiseTransport), + MeshConnectivityMode: firstNonEmptyString(strings.TrimSpace(scope.MeshConnectivityMode), "outbound_only"), + MeshNATType: firstNonEmptyString(strings.TrimSpace(scope.MeshNATType), "unknown"), + MeshRegion: firstNonEmptyString(strings.TrimSpace(scope.MeshRegion), "windows"), + HeartbeatIntervalSeconds: positiveOrDefault(scope.HeartbeatIntervalSeconds, 15), + EnrollmentPollIntervalSeconds: positiveOrDefault(scope.EnrollmentPollIntervalSeconds, 5), + EnrollmentPollTimeoutSeconds: nonNegativeOrDefault(scope.EnrollmentPollTimeoutSeconds, 0), + ProductionObservationSinkCapacity: scope.ProductionObservationSinkCapacity, + } + profile.NodeAgentArtifact = windowsNodeAgentArtifactFromScope(profile.ArtifactEndpoints, scope) + if profile.BackendURL == "" && len(profile.ControlPlaneEndpoints) > 0 { + profile.BackendURL = profile.ControlPlaneEndpoints[0] + } + if profile.BackendURL == "" { + return WindowsInstallProfile{}, ErrInvalidPayload + } + if len(profile.ArtifactEndpoints) == 0 { + if endpoint := defaultArtifactEndpointFromBackendURL(profile.BackendURL); endpoint != "" { + profile.ArtifactEndpoints = []string{endpoint} + profile.NodeAgentArtifact = windowsNodeAgentArtifactFromScope(profile.ArtifactEndpoints, scope) + } + } + if len(profile.MeshAdvertiseEndpointsJSON) > 0 && !json.Valid(profile.MeshAdvertiseEndpointsJSON) { + return WindowsInstallProfile{}, ErrInvalidPayload + } + switch profile.MeshListenPortMode { + case "manual", "auto", "disabled": + default: + return WindowsInstallProfile{}, ErrInvalidPayload + } + switch profile.StartupMode { + case "auto", "system-task", "user-task", "none": + default: + return WindowsInstallProfile{}, ErrInvalidPayload + } + if profile.MeshListenAutoPortStart > profile.MeshListenAutoPortEnd { + return WindowsInstallProfile{}, ErrInvalidPayload + } + return profile, nil +} + +func linuxInstallProfileFromScope(input DockerInstallProfileRequest, scopeRaw json.RawMessage) (LinuxInstallProfile, error) { + var scope dockerInstallProfileScope + if len(scopeRaw) > 0 { + if !json.Valid(scopeRaw) { + return LinuxInstallProfile{}, ErrInvalidPayload + } + if err := json.Unmarshal(scopeRaw, &scope); err != nil { + return LinuxInstallProfile{}, ErrInvalidPayload + } + } + nodeName := firstNonEmptyString(strings.TrimSpace(input.NodeName), scope.NodeName) + if nodeName == "" { + nodeName = "linux-node" + } + slug := safeInstallProfileSlug(nodeName) + profile := LinuxInstallProfile{ + SchemaVersion: "rap.linux_install_profile.v1", + BackendURL: strings.TrimRight(strings.TrimSpace(scope.BackendURL), "/"), + ControlPlaneEndpoints: trimStringSlice(scope.ControlPlaneEndpoints), + ArtifactEndpoints: trimEndpointSlice(scope.ArtifactEndpoints), + Roles: trimStringSlice(scope.Roles), + NodeName: nodeName, + StateDir: firstNonEmptyString(scope.StateDir, "/var/lib/rap/nodes/"+slug), + InstallDir: firstNonEmptyString(scope.InstallDir, "/opt/rap/"+slug), + StartupMode: firstNonEmptyString(strings.ToLower(strings.TrimSpace(scope.StartupMode)), "systemd"), + WorkloadSupervisionEnabled: boolPtrValue(scope.WorkloadSupervisionEnabled, false), + MeshSyntheticRuntimeEnabled: boolPtrValue(scope.MeshSyntheticRuntimeEnabled, true), + MeshProductionForwardingEnabled: boolPtrValue(scope.MeshProductionForwardingEnabled, false), + MeshListenAddr: firstNonEmptyString(scope.MeshListenAddr, ":19131"), + MeshListenPortMode: firstNonEmptyString(strings.ToLower(strings.TrimSpace(scope.MeshListenPortMode)), "auto"), + MeshListenAutoPortStart: positiveOrDefault(scope.MeshListenAutoPortStart, 19131), + MeshListenAutoPortEnd: positiveOrDefault(scope.MeshListenAutoPortEnd, 19231), + MeshAdvertiseEndpoint: strings.TrimRight(strings.TrimSpace(scope.MeshAdvertiseEndpoint), "/"), + MeshAdvertiseEndpointsJSON: scope.MeshAdvertiseEndpointsJSON, + MeshAdvertiseTransport: firstNonEmptyString(strings.TrimSpace(scope.MeshAdvertiseTransport), "direct_http"), + MeshConnectivityMode: firstNonEmptyString(strings.TrimSpace(scope.MeshConnectivityMode), "outbound_only"), + MeshNATType: firstNonEmptyString(strings.TrimSpace(scope.MeshNATType), "unknown"), + MeshRegion: firstNonEmptyString(strings.TrimSpace(scope.MeshRegion), "linux"), + HeartbeatIntervalSeconds: positiveOrDefault(scope.HeartbeatIntervalSeconds, 15), + EnrollmentPollIntervalSeconds: positiveOrDefault(scope.EnrollmentPollIntervalSeconds, 5), + EnrollmentPollTimeoutSeconds: nonNegativeOrDefault(scope.EnrollmentPollTimeoutSeconds, 0), + ProductionObservationSinkCapacity: scope.ProductionObservationSinkCapacity, + } + profile.NodeAgentArtifact = linuxNodeAgentArtifactFromScope(profile.ArtifactEndpoints, scope) + if profile.BackendURL == "" && len(profile.ControlPlaneEndpoints) > 0 { + profile.BackendURL = profile.ControlPlaneEndpoints[0] + } + if profile.BackendURL == "" { + return LinuxInstallProfile{}, ErrInvalidPayload + } + if len(profile.ArtifactEndpoints) == 0 { + if endpoint := defaultArtifactEndpointFromBackendURL(profile.BackendURL); endpoint != "" { + profile.ArtifactEndpoints = []string{endpoint} + profile.NodeAgentArtifact = linuxNodeAgentArtifactFromScope(profile.ArtifactEndpoints, scope) + } + } + if len(profile.MeshAdvertiseEndpointsJSON) > 0 && !json.Valid(profile.MeshAdvertiseEndpointsJSON) { + return LinuxInstallProfile{}, ErrInvalidPayload + } + switch profile.MeshListenPortMode { + case "manual", "auto", "disabled": + default: + return LinuxInstallProfile{}, ErrInvalidPayload + } + switch profile.StartupMode { + case "auto", "systemd", "none": + default: + return LinuxInstallProfile{}, ErrInvalidPayload + } + if profile.MeshListenAutoPortStart > profile.MeshListenAutoPortEnd { + return LinuxInstallProfile{}, ErrInvalidPayload + } + return profile, nil +} + +func linuxNodeAgentArtifactFromScope(artifactEndpoints []string, scope dockerInstallProfileScope) *DockerArtifact { + urls := trimEndpointSlice(scope.NodeAgentArtifactURLs) + if len(urls) == 0 { + for _, endpoint := range artifactEndpoints { + urls = append(urls, strings.TrimRight(endpoint, "/")+"/rap-node-agent-linux-amd64") + } + } + sha256 := strings.TrimSpace(scope.NodeAgentArtifactSHA256) + sizeBytes := scope.NodeAgentArtifactSizeBytes + if len(urls) == 0 && sha256 == "" { + return nil + } + return &DockerArtifact{ + Kind: "linux_binary", + MediaType: "application/octet-stream", + FileName: "rap-node-agent-linux-amd64", + URLs: urls, + SHA256: sha256, + SizeBytes: sizeBytes, + } +} + +func windowsNodeAgentArtifactFromScope(artifactEndpoints []string, scope dockerInstallProfileScope) *DockerArtifact { + urls := trimEndpointSlice(scope.NodeAgentArtifactURLs) + if len(urls) == 0 { + for _, endpoint := range artifactEndpoints { + urls = append(urls, strings.TrimRight(endpoint, "/")+"/rap-node-agent-windows-amd64.exe") + } + } + sha256 := strings.TrimSpace(scope.NodeAgentArtifactSHA256) + sizeBytes := scope.NodeAgentArtifactSizeBytes + if len(urls) == 0 && sha256 == "" { + return nil + } + return &DockerArtifact{ + Kind: "windows_exe", + MediaType: "application/vnd.microsoft.portable-executable", + FileName: "rap-node-agent-windows-amd64.exe", + URLs: urls, + SHA256: sha256, + SizeBytes: sizeBytes, + } +} + +func dockerImageArtifactFromScope(image string, artifactEndpoints []string, scope dockerInstallProfileScope) *DockerArtifact { + image = strings.TrimSpace(image) + if image == "" { + return nil + } + fileName := safeArtifactFileName(image) + ".tar" + urls := trimEndpointSlice(scope.DockerImageArtifactURLs) + if len(urls) == 0 { + for _, endpoint := range artifactEndpoints { + urls = append(urls, strings.TrimRight(endpoint, "/")+"/"+fileName) + } + } + sha256 := strings.TrimSpace(scope.DockerImageArtifactSHA256) + sizeBytes := scope.DockerImageArtifactSizeBytes + if len(urls) == 0 && sha256 == "" { + return nil + } + return &DockerArtifact{ + Kind: "docker_image_tar", + Image: image, + MediaType: "application/vnd.docker.image.rootfs.diff.tar", + FileName: fileName, + URLs: urls, + SHA256: sha256, + SizeBytes: sizeBytes, + } +} + +func defaultArtifactEndpointFromBackendURL(backendURL string) string { + value := strings.TrimRight(strings.TrimSpace(backendURL), "/") + if value == "" { + return "" + } + for _, suffix := range []string{"/api/v1", "/api"} { + if strings.HasSuffix(value, suffix) { + value = strings.TrimSuffix(value, suffix) + break + } + } + return strings.TrimRight(value, "/") + "/downloads" } type heartbeatMeshEndpointReport struct { @@ -2011,9 +10135,10 @@ type rendezvousRelayPolicy struct { } const ( - maxScopedRecoverySeeds = 20 - maxScopedRendezvousLeases = 20 - rendezvousRelayFeedbackMaxAge = 2 * time.Minute + maxScopedRecoverySeeds = 20 + maxScopedRendezvousLeases = 20 + defaultCoreMeshBootstrapPeerTarget = 3 + rendezvousRelayFeedbackMaxAge = 2 * time.Minute ) type nodeSelector struct { @@ -2064,6 +10189,9 @@ func (s *Service) syntheticRouteFromIntent(input GetNodeSyntheticMeshConfigInput if policy.ExpiresAt != nil { expiresAt = policy.ExpiresAt.UTC() } + if !expiresAt.After(s.now().UTC()) { + return SyntheticMeshRouteConfig{}, nil, nil, nil, nil, false + } allowedChannels := policy.AllowedChannels if len(allowedChannels) == 0 { allowedChannels = []string{"fabric_control", "route_control"} @@ -2110,7 +10238,7 @@ func (s *Service) syntheticRouteFromIntent(input GetNodeSyntheticMeshConfigInput true } -func (s *Service) reportedEndpointConfig(ctx context.Context, clusterID string, localNodeID string, routePath []string) (map[string]string, map[string][]PeerEndpointCandidate, error) { +func (s *Service) reportedEndpointConfig(ctx context.Context, clusterID string, localNodeID string, routePath []string, localPerspective endpointPerspective) (map[string]string, map[string][]PeerEndpointCandidate, error) { peers := map[string]string{} candidates := map[string][]PeerEndpointCandidate{} for _, nodeID := range routePath { @@ -2118,17 +10246,29 @@ func (s *Service) reportedEndpointConfig(ctx context.Context, clusterID string, if nodeID == "" || nodeID == localNodeID { continue } + desiredEndpoint, desiredCandidates, err := s.desiredMeshListenerEndpointConfig(ctx, clusterID, nodeID, 0) + if err != nil { + return nil, nil, err + } heartbeats, err := s.store.ListNodeHeartbeats(ctx, clusterID, nodeID, 1) if err != nil { return nil, nil, err } - if len(heartbeats) == 0 { + if len(heartbeats) == 0 && desiredEndpoint == "" && len(desiredCandidates) == 0 { continue } - peerEndpoint, nodeCandidates, ok := endpointReportFromHeartbeat(heartbeats[0]) - if !ok { - continue + peerEndpoint := desiredEndpoint + nodeCandidates := append([]PeerEndpointCandidate{}, desiredCandidates...) + if len(heartbeats) > 0 { + reportedEndpoint, reportedCandidates, ok := endpointReportFromHeartbeat(heartbeats[0]) + if ok { + if peerEndpoint == "" { + peerEndpoint = reportedEndpoint + } + nodeCandidates = append(nodeCandidates, reportedCandidates...) + } } + peerEndpoint, nodeCandidates = scopeEndpointReportForLocal(localPerspective, peerEndpoint, nodeCandidates) if peerEndpoint != "" { peers[nodeID] = peerEndpoint } @@ -2139,6 +10279,162 @@ func (s *Service) reportedEndpointConfig(ctx context.Context, clusterID string, return peers, candidates, nil } +type endpointPerspective struct { + OutboundOnly bool + Region string + ControlPlaneURL string + ControlPlaneRelayEndpoint string +} + +func (s *Service) localEndpointPerspective(ctx context.Context, clusterID, localNodeID string) (endpointPerspective, error) { + heartbeats, err := s.store.ListNodeHeartbeats(ctx, clusterID, localNodeID, 1) + if err != nil { + return endpointPerspective{}, err + } + if len(heartbeats) == 0 { + return endpointPerspective{}, nil + } + return endpointPerspectiveFromHeartbeat(heartbeats[0]), nil +} + +func endpointPerspectiveFromHeartbeat(heartbeat NodeHeartbeat) endpointPerspective { + var metadata struct { + MeshEndpointReport heartbeatMeshEndpointReport `json:"mesh_endpoint_report"` + MeshListenerReport struct { + InboundReachability string `json:"inbound_reachability"` + OneWayConnectivity bool `json:"one_way_connectivity"` + } `json:"mesh_listener_report"` + MeshOutboundSessionReport struct { + ControlPlaneURL string `json:"control_plane_url"` + Status string `json:"status"` + } `json:"mesh_outbound_session_report"` + } + if len(heartbeat.Metadata) == 0 || !json.Valid(heartbeat.Metadata) { + return endpointPerspective{} + } + if err := json.Unmarshal(heartbeat.Metadata, &metadata); err != nil { + return endpointPerspective{} + } + connectivity := strings.ToLower(strings.TrimSpace(metadata.MeshEndpointReport.ConnectivityMode)) + reachability := strings.ToLower(strings.TrimSpace(metadata.MeshListenerReport.InboundReachability)) + return endpointPerspective{ + OutboundOnly: connectivity == "outbound_only" || reachability == "outbound_only" || metadata.MeshListenerReport.OneWayConnectivity, + Region: strings.TrimSpace(metadata.MeshEndpointReport.Region), + ControlPlaneURL: strings.TrimSpace(metadata.MeshOutboundSessionReport.ControlPlaneURL), + ControlPlaneRelayEndpoint: controlPlaneRelayEndpointFromURL(metadata.MeshOutboundSessionReport.ControlPlaneURL), + } +} + +func controlPlaneRelayEndpointFromURL(raw string) string { + raw = strings.TrimRight(strings.TrimSpace(raw), "/") + if raw == "" { + return "" + } + parsed, err := url.Parse(raw) + if err != nil || parsed.Scheme == "" || parsed.Host == "" { + return "" + } + path := strings.TrimRight(parsed.Path, "/") + for _, suffix := range []string{"/api/v1", "/api"} { + if strings.HasSuffix(path, suffix) { + path = strings.TrimRight(strings.TrimSuffix(path, suffix), "/") + break + } + } + parsed.Path = path + parsed.RawPath = "" + parsed.RawQuery = "" + parsed.Fragment = "" + return strings.TrimRight(parsed.String(), "/") +} + +func controlPlaneBootstrapRendezvousLease(clusterID, peerNodeID string, candidates []PeerEndpointCandidate, local endpointPerspective, now time.Time) (PeerRendezvousLease, bool) { + if !local.OutboundOnly || local.ControlPlaneRelayEndpoint == "" { + return PeerRendezvousLease{}, false + } + requiresRendezvous := false + for _, candidate := range candidates { + if endpointCandidateRequiresRendezvous(candidate) { + requiresRendezvous = true + break + } + } + if !requiresRendezvous { + return PeerRendezvousLease{}, false + } + issuedAt := now.UTC() + return PeerRendezvousLease{ + LeaseID: "core-mesh-bootstrap-rv-" + peerNodeID + "-via-control-plane", + PeerNodeID: peerNodeID, + RelayNodeID: "control-plane-relay", + RelayEndpoint: local.ControlPlaneRelayEndpoint, + Transport: "relay_control", + ConnectivityMode: "relay_required", + RouteIDs: []string{"core-mesh-bootstrap"}, + AllowedChannels: []string{"fabric_control", "route_control"}, + Priority: 90, + ControlPlaneOnly: true, + IssuedAt: issuedAt, + ExpiresAt: issuedAt.Add(5 * time.Minute), + Reason: "control_plane_bootstrap_relay", + Metadata: json.RawMessage(`{ + "cluster_id": "` + clusterID + `", + "source": "control_plane_bootstrap", + "service_workload_traffic": false, + "production_forwarding": false + }`), + }, true +} + +func scopeEndpointReportForLocal(local endpointPerspective, endpoint string, candidates []PeerEndpointCandidate) (string, []PeerEndpointCandidate) { + if !local.OutboundOnly { + return endpoint, candidates + } + out := make([]PeerEndpointCandidate, 0, len(candidates)) + directUsable := false + for _, candidate := range candidates { + if endpointCandidatePrivateForOffsite(candidate) { + candidate = relayRequiredCandidateForOffsite(candidate) + } else if !endpointCandidateRequiresRendezvous(candidate) { + directUsable = true + } + if candidate.Metadata == nil { + candidate.Metadata = json.RawMessage(`{}`) + } + out = append(out, candidate) + } + if !directUsable && endpointPrivateForOffsite(endpoint) { + endpoint = "" + } + return endpoint, out +} + +func endpointCandidatePrivateForOffsite(candidate PeerEndpointCandidate) bool { + connectivity := strings.ToLower(strings.TrimSpace(candidate.ConnectivityMode)) + reachability := strings.ToLower(strings.TrimSpace(candidate.Reachability)) + return connectivity == "private_lan" || reachability == "private" || endpointPrivateForOffsite(candidate.Address) +} + +func endpointPrivateForOffsite(endpoint string) bool { + host := peerEndpointHost(endpoint) + if host == "" { + return false + } + ip := net.ParseIP(host) + return ip != nil && (ip.IsPrivate() || ip.IsLoopback() || ip.IsLinkLocalUnicast() || ip.IsUnspecified()) +} + +func relayRequiredCandidateForOffsite(candidate PeerEndpointCandidate) PeerEndpointCandidate { + candidate.Transport = "relay" + candidate.Reachability = "relay" + candidate.ConnectivityMode = "relay_required" + candidate.NATType = firstNonEmptyString(candidate.NATType, "unknown") + candidate.Priority += 200 + candidate.PolicyTags = appendMissingString(candidate.PolicyTags, "offsite-private-lan-blocked") + candidate.PolicyTags = appendMissingString(candidate.PolicyTags, "relay-required") + return candidate +} + func endpointReportFromHeartbeat(heartbeat NodeHeartbeat) (string, []PeerEndpointCandidate, bool) { var metadata struct { MeshEndpointReport heartbeatMeshEndpointReport `json:"mesh_endpoint_report"` @@ -2157,7 +10453,11 @@ func endpointReportFromHeartbeat(heartbeat NodeHeartbeat) (string, []PeerEndpoin return "", nil, false } nodeID := heartbeat.NodeID - peerEndpoint := strings.TrimSpace(report.PeerEndpoint) + rawPeerEndpoint := strings.TrimSpace(report.PeerEndpoint) + peerEndpoint := rawPeerEndpoint + if isUnusableLocalPeerEndpoint(peerEndpoint) { + peerEndpoint = "" + } out := make([]PeerEndpointCandidate, 0, len(report.EndpointCandidates)) for _, candidate := range report.EndpointCandidates { if candidate.NodeID == "" { @@ -2167,7 +10467,10 @@ func endpointReportFromHeartbeat(heartbeat NodeHeartbeat) (string, []PeerEndpoin candidate.EndpointID = nodeID + "-reported" } if candidate.Address == "" { - candidate.Address = peerEndpoint + candidate.Address = rawPeerEndpoint + } + if isUnusableLocalPeerEndpoint(candidate.Address) { + continue } if candidate.Transport == "" { candidate.Transport = report.Transport @@ -2200,6 +10503,283 @@ func endpointReportFromHeartbeat(heartbeat NodeHeartbeat) (string, []PeerEndpoin return peerEndpoint, out, peerEndpoint != "" || len(out) > 0 } +func hasActiveNodeRole(roles []NodeRoleAssignment, role string) bool { + for _, item := range roles { + if item.Role == role && item.Status == "active" { + return true + } + } + return false +} + +func nodeLastSeen(node ClusterNode) time.Time { + if node.LastSeenAt == nil { + return time.Time{} + } + return node.LastSeenAt.UTC() +} + +func recoverySeedFromEndpointReport(nodeID, endpoint string, candidates []PeerEndpointCandidate, index int) PeerRecoverySeed { + nodeID = strings.TrimSpace(nodeID) + endpoint = strings.TrimRight(strings.TrimSpace(endpoint), "/") + seed := PeerRecoverySeed{ + NodeID: nodeID, + Endpoint: endpoint, + Transport: "direct_http", + Priority: 10 + index, + Metadata: json.RawMessage(`{"source":"core_mesh_bootstrap"}`), + } + for _, candidate := range candidates { + if strings.TrimSpace(candidate.Address) == "" { + continue + } + seed.Endpoint = strings.TrimRight(strings.TrimSpace(candidate.Address), "/") + if strings.TrimSpace(candidate.Transport) != "" { + seed.Transport = candidate.Transport + } + seed.ConnectivityMode = candidate.ConnectivityMode + seed.Region = candidate.Region + if candidate.LastVerifiedAt != nil { + seed.LastVerifiedAt = candidate.LastVerifiedAt + } + break + } + if seed.NodeID == "" || seed.Endpoint == "" { + return PeerRecoverySeed{} + } + return seed +} + +func firstNonEmptyString(values ...string) string { + for _, value := range values { + if trimmed := strings.TrimSpace(value); trimmed != "" { + return trimmed + } + } + return "" +} + +func trimStringSlice(values []string) []string { + out := []string{} + for _, value := range values { + if trimmed := strings.TrimSpace(value); trimmed != "" && !containsString(out, trimmed) { + out = append(out, trimmed) + } + } + return out +} + +func trimEndpointSlice(values []string) []string { + out := []string{} + for _, value := range values { + trimmed := strings.TrimRight(strings.TrimSpace(value), "/") + if trimmed != "" && !containsString(out, trimmed) { + out = append(out, trimmed) + } + } + return out +} + +func normalizeUpdateToken(value string) string { + return strings.ToLower(strings.TrimSpace(value)) +} + +func selectReleaseArtifact(releases []ReleaseVersion, input GetNodeUpdatePlanInput, policy NodeUpdatePolicy) (ReleaseVersion, ReleaseArtifact, bool) { + targetVersion := "" + if policy.TargetVersion != nil { + targetVersion = strings.TrimSpace(*policy.TargetVersion) + } + for _, release := range releases { + if release.Status != "active" { + continue + } + if targetVersion != "" && release.Version != targetVersion { + continue + } + for _, artifact := range release.Artifacts { + if normalizeUpdateToken(artifact.OS) == input.OS && + normalizeUpdateToken(artifact.Arch) == input.Arch && + normalizeUpdateToken(artifact.InstallType) == input.InstallType { + artifact.URLs = releaseArtifactURLs(artifact) + return release, artifact, true + } + } + } + return ReleaseVersion{}, ReleaseArtifact{}, false +} + +func releaseArtifactURLs(artifact ReleaseArtifact) []string { + out := trimEndpointSlice(append([]string{artifact.URL}, artifact.URLs...)) + if len(artifact.Metadata) > 0 && json.Valid(artifact.Metadata) { + var metadata struct { + URL string `json:"url"` + URLs []string `json:"urls"` + MirrorURLs []string `json:"mirror_urls"` + Mirrors []string `json:"mirrors"` + } + if err := json.Unmarshal(artifact.Metadata, &metadata); err == nil { + out = trimEndpointSlice(append(out, metadata.URL)) + out = trimEndpointSlice(append(out, metadata.URLs...)) + out = trimEndpointSlice(append(out, metadata.MirrorURLs...)) + out = trimEndpointSlice(append(out, metadata.Mirrors...)) + } + } + return out +} + +func normalizeArtifactOrigin(value string) string { + value = strings.TrimRight(strings.TrimSpace(value), "/") + if value == "" { + return "" + } + parsed, err := url.Parse(value) + if err != nil || parsed.Scheme == "" || parsed.Host == "" { + return "" + } + return parsed.Scheme + "://" + parsed.Host +} + +func absolutizeReleaseArtifact(artifact ReleaseArtifact, origin string) ReleaseArtifact { + if origin == "" { + return artifact + } + artifact.URL = absolutizeArtifactURL(artifact.URL, origin) + for i, raw := range artifact.URLs { + artifact.URLs[i] = absolutizeArtifactURL(raw, origin) + } + return artifact +} + +func absolutizeArtifactURL(raw, origin string) string { + raw = strings.TrimSpace(raw) + if raw == "" || origin == "" { + return raw + } + parsed, err := url.Parse(raw) + if err == nil && parsed.IsAbs() { + return raw + } + if strings.HasPrefix(raw, "/") { + return origin + raw + } + return raw +} + +func (s *Service) hostAgentPlatformMismatch(ctx context.Context, input GetNodeUpdatePlanInput) (bool, error) { + if input.Product != "rap-host-agent" { + return false, nil + } + if nodeUpdateRequestIsWindows(input) { + return false, nil + } + statuses, err := s.store.ListNodeUpdateStatuses(ctx, input.ClusterID, input.NodeID, 20) + if err != nil { + return false, err + } + for _, status := range statuses { + if status.Product != "rap-node-agent" || !nodeUpdateStatusLooksWindows(status) { + continue + } + return true, nil + } + return false, nil +} + +func nodeUpdateRequestIsWindows(input GetNodeUpdatePlanInput) bool { + return normalizeUpdateToken(input.OS) == "windows" || strings.Contains(normalizeUpdateToken(input.InstallType), "windows") +} + +func nodeUpdateStatusLooksWindows(status NodeUpdateStatus) bool { + var payload map[string]any + if len(status.Payload) == 0 || json.Unmarshal(status.Payload, &payload) != nil { + return false + } + for _, key := range []string{"os", "runtime_os", "goos"} { + if normalizeUpdateToken(stringFromAny(payload[key])) == "windows" { + return true + } + } + for _, key := range []string{"binary_path", "task", "windows_task_name"} { + value := strings.ToLower(strings.TrimSpace(stringFromAny(payload[key]))) + if strings.Contains(value, `:\`) || strings.Contains(value, `.exe`) || strings.Contains(value, "rap node agent ") { + return true + } + } + return false +} + +func stringFromAny(value any) string { + switch typed := value.(type) { + case string: + return typed + default: + return "" + } +} + +func boolPtrValue(value *bool, fallback bool) bool { + if value == nil { + return fallback + } + return *value +} + +func positiveOrDefault(value, fallback int) int { + if value > 0 { + return value + } + return fallback +} + +func nonNegativeOrDefault(value, fallback int) int { + if value >= 0 { + return value + } + return fallback +} + +func safeInstallProfileSlug(value string) string { + value = strings.ToLower(strings.TrimSpace(value)) + var b strings.Builder + lastDash := false + for _, r := range value { + ok := (r >= 'a' && r <= 'z') || (r >= '0' && r <= '9') + if ok { + b.WriteRune(r) + lastDash = false + continue + } + if !lastDash { + b.WriteByte('-') + lastDash = true + } + } + return strings.Trim(b.String(), "-") +} + +func safeArtifactFileName(value string) string { + value = strings.ToLower(strings.TrimSpace(value)) + var b strings.Builder + lastDash := false + for _, r := range value { + ok := (r >= 'a' && r <= 'z') || (r >= '0' && r <= '9') || r == '.' || r == '_' || r == '-' + if ok { + b.WriteRune(r) + lastDash = false + continue + } + if !lastDash { + b.WriteByte('-') + lastDash = true + } + } + out := strings.Trim(b.String(), "-") + if out == "" { + return "rap-node-agent" + } + return out +} + func (s *Service) rendezvousRelayFeedback(ctx context.Context, clusterID string, routePath []string, now time.Time) ([]rendezvousRelayFeedbackEntry, error) { out := []rendezvousRelayFeedbackEntry{} seenNodes := map[string]struct{}{} @@ -2703,9 +11283,14 @@ func rendezvousRelayReplacementKey(routeID string, peerNodeID string, staleRelay } func routePathDecisionReport(generation string, decisions []RoutePathDecision) *RoutePathDecisionReport { + return routePathDecisionReportWithRecoveryPolicy(generation, decisions, defaultFabricServiceChannelRecoveryPolicy()) +} + +func routePathDecisionReportWithRecoveryPolicy(generation string, decisions []RoutePathDecision, policy FabricServiceChannelRecoveryPolicy) *RoutePathDecisionReport { if len(decisions) == 0 { return nil } + policy = normalizeFabricServiceChannelRecoveryPolicy(policy, defaultFabricServiceChannelRecoveryPolicy()) out := append([]RoutePathDecision{}, decisions...) sort.SliceStable(out, func(i, j int) bool { if out[i].RouteID != out[j].RouteID { @@ -2714,24 +11299,220 @@ func routePathDecisionReport(generation string, decisions []RoutePathDecision) * return out[i].DecisionID < out[j].DecisionID }) replacements := 0 + degraded := 0 + rebuildRequests := 0 + rebuildApplied := 0 + recoveryHysteresis := 0 + recoveryPromoted := 0 + recoveryDemoted := 0 for _, decision := range out { - if decision.DecisionSource == "stale_relay_replacement" { + if decision.DecisionSource == "stale_relay_replacement" || + decision.DecisionSource == "service_channel_feedback_replacement" || + decision.DecisionSource == "service_channel_feedback_exit_pool_replacement" || + decision.DecisionSource == "service_channel_feedback_entry_pool_replacement" || + decision.DecisionSource == "service_channel_feedback_entry_exit_pool_replacement" || + (decision.DecisionSource == "service_channel_remediation_command" && strings.TrimSpace(decision.ReplacementRouteID) != "") { replacements++ } + if containsString(decision.ScoreReasons, "service_channel_recovery_hysteresis") { + recoveryHysteresis++ + } + if containsString(decision.ScoreReasons, "service_channel_recovery_promoted") { + recoveryPromoted++ + } + if containsString(decision.ScoreReasons, "service_channel_recovery_demoted") { + recoveryDemoted++ + } + if decision.DecisionSource == "service_channel_feedback_no_alternate" || decision.RebuildStatus == "no_alternate" { + degraded++ + } + switch decision.RebuildStatus { + case "requested", "pending_degraded_fallback", "no_alternate", "deferred_by_policy", "expired": + rebuildRequests++ + case "applied": + rebuildRequests++ + rebuildApplied++ + } } return &RoutePathDecisionReport{ SchemaVersion: "c17z18.route_path_decisions.v1", - DecisionMode: "control_plane_effective_path_from_relay_policy", + DecisionMode: "control_plane_effective_path_from_relay_policy_and_service_channel_feedback", Generation: generation, + RecoveryPolicy: fabricServiceChannelRecoveryPolicyRef(policy), DecisionCount: len(out), ReplacementDecisionCount: replacements, + DegradedDecisionCount: degraded, + RebuildRequestCount: rebuildRequests, + RebuildAppliedCount: rebuildApplied, + RecoveryHysteresisCount: recoveryHysteresis, + RecoveryPromotedCount: recoveryPromoted, + RecoveryDemotedCount: recoveryDemoted, ControlPlaneOnly: true, ProductionForwarding: false, Decisions: out, } } -func routePathDecisionForRoute(route SyntheticMeshRouteConfig, localNodeID string, leases []PeerRendezvousLease, relayPolicy *rendezvousRelayPolicy, generation string) RoutePathDecision { +func serviceChannelFeedbackRequestsRebuild(item fabricServiceChannelRouteFeedback) bool { + if item.RouteID == "" || !item.Fenced || item.ManualRetry { + return false + } + return item.RouteRebuildRecommended || + item.DegradedFallbackRecommended || + item.ConsecutiveFailures >= 2 || + containsString(item.Reasons, "service_channel_route_rebuild_recommended") +} + +func serviceChannelRebuildRequestID(routeID, reporterNodeID, generation string) string { + base := strings.TrimSpace(routeID) + if base == "" { + base = "route" + } + if strings.TrimSpace(reporterNodeID) != "" { + base += "-" + strings.TrimSpace(reporterNodeID) + } + if strings.TrimSpace(generation) != "" { + base += "-" + strings.TrimSpace(generation) + } + return base + "-rebuild" +} + +func (s *Service) serviceChannelRouteReplacementDecision(input GetNodeSyntheticMeshConfigInput, fencedRoute SyntheticMeshRouteConfig, intents []MeshRouteIntent, feedback map[string]fabricServiceChannelRouteFeedback, generation string) RoutePathDecision { + routeFeedback := feedback[fencedRoute.RouteID] + decision := RoutePathDecision{ + DecisionID: fencedRoute.RouteID + "-path-" + input.NodeID + "-service-channel-feedback", + RouteID: fencedRoute.RouteID, + ClusterID: fencedRoute.ClusterID, + LocalNodeID: input.NodeID, + SourceNodeID: fencedRoute.SourceNodeID, + DestinationNodeID: fencedRoute.DestinationNodeID, + OriginalHops: append([]string{}, fencedRoute.Hops...), + EffectiveHops: []string{}, + DecisionSource: "service_channel_feedback_no_alternate", + Generation: generation, + PathScore: 0, + ScoreReasons: []string{"service_channel_fenced_route", "no_unfenced_alternate_route"}, + ControlPlaneOnly: true, + ProductionForwarding: false, + ExpiresAt: fencedRoute.ExpiresAt.UTC(), + } + applyServiceChannelFeedbackCorrelationToDecision(&decision, routeFeedback) + if serviceChannelFeedbackRequestsRebuild(routeFeedback) { + decision.RebuildRequestID = serviceChannelRebuildRequestID(fencedRoute.RouteID, input.NodeID, generation) + decision.RebuildStatus = "pending_degraded_fallback" + decision.RebuildReason = "service_channel_feedback_rebuild_requested" + decision.RebuildAttempt = routeFeedback.ConsecutiveFailures + decision.ScoreReasons = append(decision.ScoreReasons, "service_channel_rebuild_requested", "backend_relay_degraded_fallback_until_rebuild") + if routeFeedback.DegradedFallbackRecommended { + decision.ScoreReasons = append(decision.ScoreReasons, "service_channel_degraded_fallback_recommended") + } + } + replacement, replacementFeedback, ok := s.selectServiceChannelRouteReplacement(input, fencedRoute, intents, feedback) + if ok { + decision.ReplacementRouteID = replacement.RouteID + decision.EffectiveHops = append([]string{}, replacement.Hops...) + decision.DecisionSource = "service_channel_feedback_replacement" + decision.PathScore = serviceChannelReplacementRouteScore(replacement) + decision.ScoreReasons = []string{"service_channel_fenced_route", "selected_unfenced_alternate_route"} + if replacement.SourceNodeID != fencedRoute.SourceNodeID { + decision.DecisionSource = "service_channel_feedback_entry_pool_replacement" + decision.ScoreReasons = append(decision.ScoreReasons, "selected_unfenced_entry_pool_route") + } + if replacement.DestinationNodeID != fencedRoute.DestinationNodeID { + decision.DecisionSource = "service_channel_feedback_exit_pool_replacement" + decision.ScoreReasons = append(decision.ScoreReasons, "selected_unfenced_exit_pool_route") + } + if replacement.SourceNodeID != fencedRoute.SourceNodeID && replacement.DestinationNodeID != fencedRoute.DestinationNodeID { + decision.DecisionSource = "service_channel_feedback_entry_exit_pool_replacement" + decision.ScoreReasons = append(decision.ScoreReasons, "selected_unfenced_entry_exit_pool_route") + } + if decision.RebuildRequestID != "" { + decision.RebuildStatus = "applied" + decision.RebuildReason = "service_channel_feedback_rebuild_applied_to_alternate" + decision.ScoreReasons = append(decision.ScoreReasons, "service_channel_rebuild_applied") + } + if replacementFeedback.RouteID != "" && !replacementFeedback.Fenced { + decision.PathScore += 10000 + decision.ScoreReasons = append(decision.ScoreReasons, "active_healthy_feedback_dampening_window") + decision.ScoreReasons = append(decision.ScoreReasons, replacementFeedback.Reasons...) + } + decision.ScoreReasons = dedupeStrings(decision.ScoreReasons) + if replacement.ExpiresAt.Before(decision.ExpiresAt) { + decision.ExpiresAt = replacement.ExpiresAt.UTC() + } + } + decision.PreviousHopID, decision.NextHopID, decision.LocalRole = routePathLocalPosition(decision.EffectiveHops, input.NodeID, "", "") + return decision +} + +func applyServiceChannelFeedbackCorrelationToDecision(decision *RoutePathDecision, feedback fabricServiceChannelRouteFeedback) { + if decision == nil || feedback.RouteID == "" { + return + } + decision.FeedbackObservationID = feedback.ObservationID + decision.FeedbackSource = feedback.Source + if !feedback.ObservedAt.IsZero() { + observedAt := feedback.ObservedAt.UTC() + decision.FeedbackObservedAt = &observedAt + } + if !feedback.ExpiresAt.IsZero() { + expiresAt := feedback.ExpiresAt.UTC() + decision.FeedbackExpiresAt = &expiresAt + } + decision.FeedbackChannelID = feedback.ChannelID + decision.FeedbackResourceID = feedback.ResourceID + decision.FeedbackViolationStatus = feedback.ViolationStatus + decision.FeedbackViolationReason = feedback.ViolationReason +} + +func (s *Service) selectServiceChannelRouteReplacement(input GetNodeSyntheticMeshConfigInput, fencedRoute SyntheticMeshRouteConfig, intents []MeshRouteIntent, feedback map[string]fabricServiceChannelRouteFeedback) (SyntheticMeshRouteConfig, fabricServiceChannelRouteFeedback, bool) { + var selected SyntheticMeshRouteConfig + var selectedFeedback fabricServiceChannelRouteFeedback + selectedScore := -1 + scopes := fabricServiceChannelRouteIntentReplacementScopes(intents) + for _, intent := range intents { + route, _, _, _, _, ok := s.syntheticRouteFromIntent(input, intent) + if !ok || route.RouteID == fencedRoute.RouteID { + continue + } + if !fabricServiceChannelRoutesShareReplacementScope(fencedRoute, route, scopes) { + continue + } + if !fabricChannelsIntersect(route.AllowedChannels, fencedRoute.AllowedChannels) { + continue + } + if item, ok := feedback[route.RouteID]; ok && item.Fenced { + continue + } + routeFeedback := feedback[route.RouteID] + score := serviceChannelReplacementRouteScore(route) + intent.Priority + if routeFeedback.RouteID != "" { + score += 10000 + } + if route.DestinationNodeID != fencedRoute.DestinationNodeID { + score -= 5 + } + if route.SourceNodeID != fencedRoute.SourceNodeID { + score -= 10 + } + if score > selectedScore || (score == selectedScore && route.RouteID < selected.RouteID) { + selected = route + selectedFeedback = routeFeedback + selectedScore = score + } + } + return selected, selectedFeedback, selected.RouteID != "" +} + +func serviceChannelReplacementRouteScore(route SyntheticMeshRouteConfig) int { + score := 1000 - len(route.Hops)*10 + if score < 1 { + return 1 + } + return score +} + +func routePathDecisionForRoute(route SyntheticMeshRouteConfig, localNodeID string, leases []PeerRendezvousLease, relayPolicy *rendezvousRelayPolicy, generation string, serviceFeedback fabricServiceChannelRouteFeedback) RoutePathDecision { decision := RoutePathDecision{ DecisionID: route.RouteID + "-path-" + localNodeID, RouteID: route.RouteID, @@ -2749,6 +11530,14 @@ func routePathDecisionForRoute(route SyntheticMeshRouteConfig, localNodeID strin ProductionForwarding: false, ExpiresAt: route.ExpiresAt.UTC(), } + if serviceFeedback.ManualRetry { + decision.ScoreReasons = append(decision.ScoreReasons, "service_channel_route_retry_after_operator_expire") + decision.ScoreReasons = append(decision.ScoreReasons, serviceFeedback.Reasons...) + decision.ScoreReasons = dedupeStrings(decision.ScoreReasons) + if serviceFeedback.RetryCooldownUntil != nil && serviceFeedback.RetryCooldownUntil.Before(decision.ExpiresAt) { + decision.ExpiresAt = serviceFeedback.RetryCooldownUntil.UTC() + } + } var replacementLease PeerRendezvousLease var replacementDecision RendezvousRelayPolicyDecision replacementFound := false @@ -2873,6 +11662,8 @@ func reachabilityFromConnectivityMode(connectivityMode string) string { return "outbound_only" case "relay_required": return "relay" + case "private_lan": + return "private" case "direct": return "public" default: @@ -3112,7 +11903,7 @@ func selectRendezvousRelay(route SyntheticMeshRouteConfig, peerNodeID string, lo } func relayControlEndpointForNode(nodeID string, peers map[string]string, candidates map[string][]PeerEndpointCandidate) (string, int, []string) { - if endpoint := strings.TrimRight(strings.TrimSpace(peers[nodeID]), "/"); isHTTPControlEndpoint(endpoint) { + if endpoint := strings.TrimRight(strings.TrimSpace(peers[nodeID]), "/"); isUsableHTTPControlEndpoint(endpoint) { return endpoint, 80, []string{"reported_peer_endpoint"} } items := append([]PeerEndpointCandidate{}, candidates[nodeID]...) @@ -3127,7 +11918,7 @@ func relayControlEndpointForNode(nodeID string, peers map[string]string, candida continue } endpoint := strings.TrimRight(strings.TrimSpace(candidate.Address), "/") - if isHTTPControlEndpoint(endpoint) { + if isUsableHTTPControlEndpoint(endpoint) { score := 70 reasons := []string{"endpoint_candidate"} if candidate.Priority > 0 { @@ -3559,7 +12350,8 @@ func validatePeerEndpointCandidates(candidates map[string][]PeerEndpointCandidat func scopedPeerEndpoints(peers map[string]string, routePath []string) map[string]string { out := map[string]string{} for nodeID, endpoint := range peers { - if containsString(routePath, nodeID) && strings.TrimSpace(endpoint) != "" { + endpoint = strings.TrimSpace(endpoint) + if containsString(routePath, nodeID) && endpoint != "" && !isUnusableLocalPeerEndpoint(endpoint) { out[nodeID] = endpoint } } @@ -3573,6 +12365,9 @@ func scopedPeerEndpointCandidates(candidates map[string][]PeerEndpointCandidate, continue } for _, candidate := range items { + if isUnusableLocalPeerEndpoint(candidate.Address) { + continue + } if candidate.Metadata == nil { candidate.Metadata = json.RawMessage(`{}`) } @@ -3584,7 +12379,7 @@ func scopedPeerEndpointCandidates(candidates map[string][]PeerEndpointCandidate, func isPeerEndpointTransport(value string) bool { switch value { - case "direct_tcp_tls", "wss", "relay", "outbound_reverse": + case "direct_http", "direct_tcp_tls", "wss", "relay", "outbound_reverse": return true default: return false @@ -3611,7 +12406,7 @@ func isPeerEndpointReachability(value string) bool { func isPeerEndpointConnectivityMode(value string) bool { switch value { - case "direct", "relay_required", "outbound_only", "unknown": + case "direct", "private_lan", "relay_required", "outbound_only", "unknown": return true default: return false @@ -3641,20 +12436,44 @@ func controlPlaneAllowedChannels(channels []string) []string { return out } -func firstNonEmptyStringSlice(values ...[]string) []string { - for _, value := range values { - if len(value) > 0 { - return value - } - } - return nil -} - func isHTTPControlEndpoint(endpoint string) bool { endpoint = strings.ToLower(strings.TrimSpace(endpoint)) return strings.HasPrefix(endpoint, "http://") || strings.HasPrefix(endpoint, "https://") } +func isUsableHTTPControlEndpoint(endpoint string) bool { + return isHTTPControlEndpoint(endpoint) && !isUnusableLocalPeerEndpoint(endpoint) +} + +func isUnusableLocalPeerEndpoint(endpoint string) bool { + host := peerEndpointHost(endpoint) + if host == "" { + return false + } + if strings.EqualFold(host, "localhost") { + return true + } + ip := net.ParseIP(host) + return ip != nil && (ip.IsLoopback() || ip.IsUnspecified()) +} + +func peerEndpointHost(endpoint string) string { + endpoint = strings.TrimRight(strings.TrimSpace(endpoint), "/") + if endpoint == "" { + return "" + } + if host, _, err := net.SplitHostPort(endpoint); err == nil { + return strings.Trim(host, "[]") + } + if parsed, err := url.Parse(endpoint); err == nil && parsed.Host != "" { + if host, _, err := net.SplitHostPort(parsed.Host); err == nil { + return strings.Trim(host, "[]") + } + return strings.Trim(parsed.Host, "[]") + } + return strings.Trim(endpoint, "[]") +} + func firstNodeID(selector nodeSelector) string { if strings.TrimSpace(selector.NodeID) != "" { return strings.TrimSpace(selector.NodeID) @@ -3691,6 +12510,13 @@ func containsString(values []string, needle string) bool { return false } +func appendMissingString(values []string, value string) []string { + if containsString(values, value) { + return values + } + return append(values, value) +} + func generateFencingToken() (string, error) { buf := make([]byte, 32) if _, err := rand.Read(buf); err != nil { diff --git a/backend/internal/modules/cluster/service_test.go b/backend/internal/modules/cluster/service_test.go index 1dd5bdd..9d0498a 100644 --- a/backend/internal/modules/cluster/service_test.go +++ b/backend/internal/modules/cluster/service_test.go @@ -4,6 +4,9 @@ import ( "context" "encoding/json" "errors" + "reflect" + "sort" + "strconv" "strings" "testing" "time" @@ -60,6 +63,92 @@ func TestClusterAuthorityPrivateKeyEncodingUsesSecretEncryptor(t *testing.T) { } } +func TestNodeUpdateHintAssignsUpdateServiceSubscription(t *testing.T) { + targetVersion := "0.2.15" + now := time.Date(2026, 5, 2, 8, 0, 0, 0, time.UTC) + repo := &fakeRepository{ + nodeUpdatePolicies: map[string]NodeUpdatePolicy{ + "node-1|rap-node-agent": { + ClusterID: "cluster-1", + NodeID: "node-1", + Product: "rap-node-agent", + Channel: "dev", + TargetVersion: &targetVersion, + Enabled: true, + UpdatedAt: now, + }, + }, + updateServiceCandidates: []NodeUpdateServiceCandidate{{ + NodeID: "update-1", + NodeName: "update-cache-1", + Endpoint: "http://10.0.0.5:19131", + Region: "office", + }}, + } + service := NewService(repo) + service.now = func() time.Time { return now } + + hint := service.GetNodeUpdateHint(context.Background(), "cluster-1", "node-1") + + if !hint.CheckNow || hint.Generation == "" { + t.Fatalf("expected update hint generation, got %+v", hint) + } + if hint.DeliveryMode != "update_service_subscription" || hint.SubscriptionStatus != "subscribed" { + t.Fatalf("unexpected subscription state: %+v", hint) + } + if hint.FallbackPollSeconds != 21600 { + t.Fatalf("fallback poll seconds = %d", hint.FallbackPollSeconds) + } + if hint.UpdateService == nil || hint.UpdateService.NodeID != "update-1" || hint.UpdateService.Status != "assigned" { + t.Fatalf("unexpected update service assignment: %+v", hint.UpdateService) + } +} + +func TestNodeUpdateHintFallsBackWhenNoUpdateServiceHealthy(t *testing.T) { + targetVersion := "0.2.15" + now := time.Date(2026, 5, 2, 8, 0, 0, 0, time.UTC) + repo := &fakeRepository{ + nodeUpdatePolicies: map[string]NodeUpdatePolicy{ + "node-1|rap-host-agent": { + ClusterID: "cluster-1", + NodeID: "node-1", + Product: "rap-host-agent", + Channel: "dev", + TargetVersion: &targetVersion, + Enabled: true, + UpdatedAt: now, + }, + }, + fabricRebuildAttempts: []FabricServiceChannelRouteRebuildAttempt{{ + ID: "fsc-rebuild-guard-1", + ClusterID: "cluster-1", + ReporterNodeID: "entry-1", + ServiceClass: FabricServiceClassVPNPackets, + RouteID: "route-bad", + ReplacementRouteID: "route-outside-exit", + RebuildRequestID: "fsc-remediation:channel-guard:rebuild_route:route-outside-exit", + RebuildStatus: "rejected", + RebuildReason: "replacement_exit_outside_signed_pool_policy", + DecisionSource: "service_channel_remediation_command", + Outcome: "policy_guard_rejected", + PolicyFingerprint: "pool-fingerprint-1", + CreatedAt: now, + UpdatedAt: now, + }}, + } + service := NewService(repo) + service.now = func() time.Time { return now } + + hint := service.GetNodeUpdateHint(context.Background(), "cluster-1", "node-1") + + if !hint.CheckNow || hint.UpdateService == nil { + t.Fatalf("expected fallback hint with update service object, got %+v", hint) + } + if hint.UpdateService.Status != "control_plane_fallback" { + t.Fatalf("fallback status = %q", hint.UpdateService.Status) + } +} + func TestCreateJoinTokenRequiresPlatformAdmin(t *testing.T) { store := &fakeRepository{platformRole: "user"} service := NewService(store) @@ -308,6 +397,122 @@ func TestAttachExistingNodeUsesConcreteNodeAndRoles(t *testing.T) { } } +func TestGetDockerInstallProfileBuildsRuntimeProfileFromTokenScope(t *testing.T) { + rawToken := "rap_join_profile" + tokenHash, err := hashJoinToken(rawToken) + if err != nil { + t.Fatalf("hash token: %v", err) + } + store := &fakeRepository{ + validJoinToken: NodeJoinToken{ + ID: "token-1", + ClusterID: "cluster-1", + Status: "active", + Scope: json.RawMessage(`{ + "backend_url": "https://control.example.test/api/v1/", + "roles": ["core-mesh"], + "image": "registry.example.test/rap-node-agent:1", + "artifact_endpoints": ["https://cache-a.example.test/artifacts/"], + "docker_image_artifact_sha256": "abc123", + "mesh_connectivity_mode": "outbound_only", + "mesh_region": "customer-a", + "pull_image": true + }`), + }, + } + service := NewService(store) + + profile, err := service.GetDockerInstallProfile(context.Background(), DockerInstallProfileRequest{ + ClusterID: "cluster-1", + InstallToken: rawToken, + NodeName: "Customer Node 1", + }) + if err != nil { + t.Fatalf("profile: %v", err) + } + if store.lastLookupTokenHash != tokenHash { + t.Fatalf("token hash lookup = %q, want %q", store.lastLookupTokenHash, tokenHash) + } + if profile.BackendURL != "https://control.example.test/api/v1" || + profile.JoinToken != rawToken || + profile.NodeName != "Customer Node 1" || + profile.ContainerName != "rap-node-agent-customer-node-1" || + profile.MeshConnectivityMode != "outbound_only" || + profile.MeshRegion != "customer-a" || + len(profile.ArtifactEndpoints) != 1 || + profile.DockerImageArtifact == nil || + profile.DockerImageArtifact.FileName != "registry.example.test-rap-node-agent-1.tar" || + profile.DockerImageArtifact.URLs[0] != "https://cache-a.example.test/artifacts/registry.example.test-rap-node-agent-1.tar" || + profile.DockerImageArtifact.SHA256 != "abc123" || + profile.EnrollmentPollTimeoutSeconds != 0 || + !profile.PullImage || + !profile.MeshSyntheticRuntimeEnabled || + profile.MeshProductionForwardingEnabled { + t.Fatalf("unexpected profile: %+v", profile) + } +} + +func TestGetDockerInstallProfileDefaultsArtifactEndpointFromBackendURL(t *testing.T) { + rawToken := "rap_join_profile" + store := &fakeRepository{ + validJoinToken: NodeJoinToken{ + ID: "token-1", + ClusterID: "cluster-1", + Status: "active", + Scope: json.RawMessage(`{ + "backend_url": "https://control.example.test/api/v1", + "image": "rap-node-agent:dev" + }`), + }, + } + service := NewService(store) + + profile, err := service.GetDockerInstallProfile(context.Background(), DockerInstallProfileRequest{ + ClusterID: "cluster-1", + InstallToken: rawToken, + NodeName: "node-a", + }) + if err != nil { + t.Fatalf("profile: %v", err) + } + if len(profile.ArtifactEndpoints) != 1 || profile.ArtifactEndpoints[0] != "https://control.example.test/downloads" { + t.Fatalf("artifact endpoints = %#v", profile.ArtifactEndpoints) + } + if profile.DockerImageArtifact == nil || profile.DockerImageArtifact.URLs[0] != "https://control.example.test/downloads/rap-node-agent-dev.tar" { + t.Fatalf("unexpected artifact: %+v", profile.DockerImageArtifact) + } +} + +func TestGetDockerInstallProfileDoesNotPinFloatingDevArtifactMetadata(t *testing.T) { + rawToken := "rap_join_profile" + store := &fakeRepository{ + validJoinToken: NodeJoinToken{ + ID: "token-1", + ClusterID: "cluster-1", + Status: "active", + Scope: json.RawMessage(`{ + "backend_url": "https://control.example.test/api/v1", + "image": "rap-node-agent:dev-enrollment-bootstrap-smoke" + }`), + }, + } + service := NewService(store) + + profile, err := service.GetDockerInstallProfile(context.Background(), DockerInstallProfileRequest{ + ClusterID: "cluster-1", + InstallToken: rawToken, + NodeName: "node-a", + }) + if err != nil { + t.Fatalf("profile: %v", err) + } + if profile.DockerImageArtifact == nil || + profile.DockerImageArtifact.SHA256 != "" || + profile.DockerImageArtifact.SizeBytes != 0 { + t.Fatalf("unexpected artifact metadata: %+v", profile.DockerImageArtifact) + } +} + func TestCreateJoinRequestRejectsExpiredOrRevokedToken(t *testing.T) { store := &fakeRepository{validTokenErr: ErrInvalidJoinToken} service := NewService(store) @@ -412,6 +617,29 @@ func TestSetDesiredWorkloadRequiresPlatformAdmin(t *testing.T) { } } +func TestListDesiredWorkloadsAllowsNodeScopedAgentReadWithoutActor(t *testing.T) { + store := &fakeRepository{ + desiredWorkloads: []NodeWorkloadDesiredState{{ + ClusterID: "cluster-1", + NodeID: "node-1", + ServiceType: "synthetic.echo", + DesiredState: "enabled", + RuntimeMode: "native", + Config: json.RawMessage(`{}`), + Environment: json.RawMessage(`{}`), + }}, + } + service := NewService(store) + + items, err := service.ListDesiredWorkloads(context.Background(), "", "cluster-1", "node-1") + if err != nil { + t.Fatalf("list desired workloads: %v", err) + } + if len(items) != 1 || items[0].ServiceType != "synthetic.echo" { + t.Fatalf("unexpected desired workloads: %+v", items) + } +} + func TestReportWorkloadStatusDefaultsToSafeStubState(t *testing.T) { store := &fakeRepository{} service := NewService(store) @@ -461,6 +689,163 @@ func TestCreateRouteIntentRequiresPlatformAdmin(t *testing.T) { } } +func TestGetVPNClientProfileEnsuresFabricVPNPacketRouteIntents(t *testing.T) { + repo := &fakeRepository{ + vpnClientProfile: VPNClientProfile{ + SchemaVersion: "rap.vpn_client_profile.v1", + Connections: []VPNClientConnection{{ + ID: "vpn-1", + ClientConfig: json.RawMessage(`{ + "vpn_fabric_route": { + "status": "planned", + "selected_entry_node_id": "entry-1", + "selected_exit_node_id": "exit-1" + }, + "vpn_entry_endpoint_candidates": [{ + "node_id": "entry-1", + "endpoint_id": "public-http", + "transport": "direct_http", + "address": "http://entry.example.test:19131", + "api_base_url": "http://entry.example.test:19131/api/v1", + "reachability": "public", + "priority": 0 + }] + }`), + }}, + }, + } + service := NewService(repo) + service.now = func() time.Time { return time.Date(2026, 5, 3, 12, 0, 0, 0, time.UTC) } + + profile, err := service.GetVPNClientProfile(context.Background(), "cluster-1", "org-1", "user-1", "entry-1") + if err != nil { + t.Fatalf("GetVPNClientProfile: %v", err) + } + if len(profile.Connections) != 1 { + t.Fatalf("profile connections = %d, want 1", len(profile.Connections)) + } + var cfg map[string]any + if err := json.Unmarshal(profile.Connections[0].ClientConfig, &cfg); err != nil { + t.Fatalf("unmarshal client config: %v", err) + } + session, ok := cfg["vpn_dataplane_session"].(map[string]any) + if !ok { + t.Fatalf("missing vpn_dataplane_session in %#v", cfg) + } + if session["preferred_transport"] != "fabric_packet_quic_v1" || session["fallback_transport"] != "backend_http_packet_relay" { + t.Fatalf("unexpected dataplane session transports: %#v", session) + } + if session["entry_node_id"] != "entry-1" || session["exit_node_id"] != "exit-1" { + t.Fatalf("unexpected dataplane session route: %#v", session) + } + entryCandidates := session["entry_candidates"].([]any) + if len(entryCandidates) != 1 { + t.Fatalf("entry candidate count = %d, want 1", len(entryCandidates)) + } + entryCandidate := entryCandidates[0].(map[string]any) + if entryCandidate["api_base_url"] != "http://entry.example.test:19131/api/v1" || entryCandidate["status"] != "selected_endpoint_public" { + t.Fatalf("unexpected entry candidate: %#v", entryCandidate) + } + transportCandidates := session["transport_candidates"].([]any) + var foundDirect bool + for _, rawCandidate := range transportCandidates { + candidate := rawCandidate.(map[string]any) + if candidate["type"] == "entry_direct_http_v1" { + foundDirect = true + if candidate["status"] != "available" || candidate["safe_client_switch"] != true { + t.Fatalf("unexpected direct entry transport candidate: %#v", candidate) + } + } + } + if !foundDirect { + t.Fatalf("missing entry_direct_http_v1 in %#v", transportCandidates) + } + auth := session["auth"].(map[string]any) + if auth["type"] != "control_plane_issued_bearer" || auth["node_validation"] != "entry_node_calls_control_plane_introspection" { + t.Fatalf("unexpected dataplane session auth: %#v", auth) + } + + if got := len(repo.createdRouteIntents); got != 2 { + t.Fatalf("created route intents = %d, want 2", got) + } + for _, input := range repo.createdRouteIntents { + if input.ClusterID != "cluster-1" || input.ServiceClass != "vpn_packets" || input.Priority != 10 { + t.Fatalf("unexpected route intent input: %+v", input) + } + var policy syntheticRoutePolicy + if err := json.Unmarshal(input.Policy, &policy); err != nil { + t.Fatalf("unmarshal policy: %v", err) + } + if !policy.SyntheticEnabled || !containsString(policy.AllowedChannels, "vpn_packet") || !containsString(policy.AllowedChannels, "fabric_control") || len(policy.Hops) != 2 { + t.Fatalf("policy = %+v", policy) + } + } +} + +func TestGetVPNClientProfileForwardsPreferredExit(t *testing.T) { + repo := &fakeRepository{ + vpnClientProfile: VPNClientProfile{ + SchemaVersion: "rap.vpn_client_profile.v1", + Connections: []VPNClientConnection{{ + ID: "vpn-1", + ClientConfig: json.RawMessage(`{ + "vpn_fabric_route": { + "status": "planned", + "selected_entry_node_id": "entry-1", + "selected_exit_node_id": "exit-1" + } + }`), + }}, + }, + } + service := NewService(repo) + + if _, err := service.GetVPNClientProfile(context.Background(), "cluster-1", "org-1", "user-1", "entry-1", "exit-2"); err != nil { + t.Fatalf("GetVPNClientProfile: %v", err) + } + if repo.lastPreferredEntryNodeID != "entry-1" { + t.Fatalf("preferred entry = %q, want entry-1", repo.lastPreferredEntryNodeID) + } + if repo.lastPreferredExitNodeID != "exit-2" { + t.Fatalf("preferred exit = %q, want exit-2", repo.lastPreferredExitNodeID) + } +} + +func TestVPNDirectHTTPEntryTransportWaitsForLocalGatewayShortcutWhenEntryIsExit(t *testing.T) { + candidate := vpnDirectHTTPEntryTransportCandidate(vpnClientFabricRoute{ + SelectedEntryNodeID: "node-1", + SelectedExitNodeID: "node-1", + }, []map[string]any{{ + "node_id": "node-1", + "api_base_url": "http://node.example.test:19131/api/v1", + "reachability": "public", + }}) + if candidate == nil { + t.Fatal("candidate is nil") + } + if candidate["safe_client_switch"] != false || candidate["status"] != "available_local_gateway_shortcut_pending" { + t.Fatalf("unexpected local shortcut guard: %#v", candidate) + } +} + +func TestVPNDirectHTTPEntryTransportAllowsLocalGatewayShortcutWhenReported(t *testing.T) { + candidate := vpnDirectHTTPEntryTransportCandidate(vpnClientFabricRoute{ + SelectedEntryNodeID: "node-1", + SelectedExitNodeID: "node-1", + }, []map[string]any{{ + "node_id": "node-1", + "api_base_url": "http://node.example.test:19131/api/v1", + "reachability": "public", + "local_gateway_shortcut": true, + }}) + if candidate == nil { + t.Fatal("candidate is nil") + } + if candidate["safe_client_switch"] != true || candidate["status"] != "available_local_gateway_shortcut" { + t.Fatalf("unexpected local shortcut candidate: %#v", candidate) + } +} + func TestGetNodeSyntheticMeshConfigRequiresTestingFlag(t *testing.T) { service := NewService(&fakeRepository{}) @@ -479,6 +864,309 @@ func TestGetNodeSyntheticMeshConfigRequiresTestingFlag(t *testing.T) { } } +func TestNodeUpdatePlanSelectsMatchingReleaseArtifact(t *testing.T) { + store := &fakeRepository{ + platformRole: PlatformRoleAdmin, + releaseVersions: []ReleaseVersion{ + { + ID: "release-1", + ClusterID: "cluster-1", + Product: "rap-node-agent", + Version: "0.1.0-c17z26", + Channel: "dev", + Status: "active", + Artifacts: []ReleaseArtifact{ + {ID: "linux", ClusterID: "cluster-1", Product: "rap-node-agent", Version: "0.1.0-c17z26", OS: "linux", Arch: "amd64", InstallType: "service", Kind: "binary", URL: "https://cache/agent", SHA256: "linux-sha"}, + {ID: "docker", ClusterID: "cluster-1", Product: "rap-node-agent", Version: "0.1.0-c17z26", OS: "linux", Arch: "amd64", InstallType: "docker", Kind: "docker_image_tar", URL: "https://cache/agent.tar", SHA256: "docker-sha"}, + }, + }, + }, + nodeUpdatePolicies: map[string]NodeUpdatePolicy{ + "node-1|rap-node-agent": { + ClusterID: "cluster-1", + NodeID: "node-1", + Product: "rap-node-agent", + Channel: "dev", + Strategy: "manual", + Enabled: true, + RollbackAllowed: true, + HealthWindowSec: 180, + }, + }, + } + service := NewService(store) + + plan, err := service.GetNodeUpdatePlan(context.Background(), GetNodeUpdatePlanInput{ + ClusterID: "cluster-1", + NodeID: "node-1", + Product: "rap-node-agent", + CurrentVersion: "0.1.0-c17z25", + OS: "linux", + Arch: "amd64", + InstallType: "docker", + }) + if err != nil { + t.Fatalf("update plan: %v", err) + } + if plan.Action != "update" || + plan.TargetVersion != "0.1.0-c17z26" || + plan.Artifact == nil || + plan.Artifact.ID != "docker" || + plan.ProductionForwarding { + t.Fatalf("unexpected update plan: %+v", plan) + } + if plan.AuthoritySignature == nil || len(plan.AuthorityPayload) == 0 { + t.Fatalf("update plan must be signed: %+v", plan) + } +} + +func TestNodeUpdatePlanAbsolutizesRelativeArtifactURLs(t *testing.T) { + store := &fakeRepository{ + platformRole: PlatformRoleAdmin, + releaseVersions: []ReleaseVersion{ + { + ID: "release-1", + ClusterID: "cluster-1", + Product: "rap-node-agent", + Version: "0.2.93", + Channel: "stable", + Status: "active", + Artifacts: []ReleaseArtifact{ + { + ID: "docker", + ClusterID: "cluster-1", + Product: "rap-node-agent", + Version: "0.2.93", + OS: "linux", + Arch: "amd64", + InstallType: "docker", + Kind: "docker_image_tar", + URL: "/downloads/rap-node-agent-0.2.93-docker-amd64.tar", + SHA256: "docker-sha", + Metadata: json.RawMessage(`{"urls":["/downloads/mirror.tar","https://cdn.example.test/agent.tar"]}`), + }, + }, + }, + }, + nodeUpdatePolicies: map[string]NodeUpdatePolicy{ + "node-1|rap-node-agent": { + ClusterID: "cluster-1", + NodeID: "node-1", + Product: "rap-node-agent", + Channel: "stable", + Strategy: "rolling", + Enabled: true, + RollbackAllowed: true, + HealthWindowSec: 180, + }, + }, + } + service := NewService(store) + + plan, err := service.GetNodeUpdatePlan(context.Background(), GetNodeUpdatePlanInput{ + ClusterID: "cluster-1", + NodeID: "node-1", + Product: "rap-node-agent", + CurrentVersion: "0.2.92", + OS: "linux", + Arch: "amd64", + InstallType: "docker", + ArtifactOrigin: "http://vpn.cin.su:19191/api/v1", + }) + if err != nil { + t.Fatalf("update plan: %v", err) + } + if plan.Artifact == nil { + t.Fatal("expected artifact") + } + if plan.Artifact.URL != "http://vpn.cin.su:19191/downloads/rap-node-agent-0.2.93-docker-amd64.tar" { + t.Fatalf("artifact URL was not absolutized: %q", plan.Artifact.URL) + } + wantMirror := "http://vpn.cin.su:19191/downloads/mirror.tar" + if len(plan.Artifact.URLs) < 2 || plan.Artifact.URLs[1] != wantMirror || plan.Artifact.URLs[2] != "https://cdn.example.test/agent.tar" { + t.Fatalf("artifact URLs were not preserved/absolutized: %#v", plan.Artifact.URLs) + } +} + +func TestHostAgentUpdatePlanRejectsLinuxArtifactForObservedWindowsNode(t *testing.T) { + store := &fakeRepository{ + platformRole: PlatformRoleAdmin, + releaseVersions: []ReleaseVersion{ + { + ID: "release-host", + ClusterID: "cluster-1", + Product: "rap-host-agent", + Version: "0.2.95", + Channel: "stable", + Status: "active", + Artifacts: []ReleaseArtifact{ + {ID: "linux", ClusterID: "cluster-1", Product: "rap-host-agent", Version: "0.2.95", OS: "linux", Arch: "amd64", InstallType: "linux_binary", Kind: "binary", URL: "/downloads/rap-host-agent-linux", SHA256: "linux-sha"}, + }, + }, + }, + nodeUpdatePolicies: map[string]NodeUpdatePolicy{ + "node-1|rap-host-agent": { + ClusterID: "cluster-1", + NodeID: "node-1", + Product: "rap-host-agent", + Channel: "stable", + Strategy: "rolling", + Enabled: true, + RollbackAllowed: true, + HealthWindowSec: 180, + }, + }, + updateStatuses: []NodeUpdateStatus{ + { + ClusterID: "cluster-1", + NodeID: "node-1", + Product: "rap-node-agent", + Phase: "plan", + Status: "noop", + Payload: json.RawMessage(`{"binary_path":"C:\\Program Files\\RAP\\node\\rap-node-agent.exe","task":"RAP Node Agent node"}`), + }, + }, + } + service := NewService(store) + + plan, err := service.GetNodeUpdatePlan(context.Background(), GetNodeUpdatePlanInput{ + ClusterID: "cluster-1", + NodeID: "node-1", + Product: "rap-host-agent", + CurrentVersion: "0.2.92", + OS: "linux", + Arch: "amd64", + InstallType: "linux_binary", + }) + if err != nil { + t.Fatalf("update plan: %v", err) + } + if plan.Action != "none" || plan.Reason != "host_agent_artifact_platform_mismatch" || plan.Artifact != nil { + t.Fatalf("unexpected mismatch plan: %+v", plan) + } +} + +func TestHostAgentUpdatePlanAllowsWindowsArtifactForObservedWindowsNode(t *testing.T) { + store := &fakeRepository{ + platformRole: PlatformRoleAdmin, + releaseVersions: []ReleaseVersion{ + { + ID: "release-host", + ClusterID: "cluster-1", + Product: "rap-host-agent", + Version: "0.2.95", + Channel: "stable", + Status: "active", + Artifacts: []ReleaseArtifact{ + {ID: "windows", ClusterID: "cluster-1", Product: "rap-host-agent", Version: "0.2.95", OS: "windows", Arch: "amd64", InstallType: "windows_binary", Kind: "binary", URL: "/downloads/rap-host-agent.exe", SHA256: "windows-sha"}, + }, + }, + }, + nodeUpdatePolicies: map[string]NodeUpdatePolicy{ + "node-1|rap-host-agent": { + ClusterID: "cluster-1", + NodeID: "node-1", + Product: "rap-host-agent", + Channel: "stable", + Strategy: "rolling", + Enabled: true, + RollbackAllowed: true, + HealthWindowSec: 180, + }, + }, + updateStatuses: []NodeUpdateStatus{ + { + ClusterID: "cluster-1", + NodeID: "node-1", + Product: "rap-node-agent", + Phase: "plan", + Status: "noop", + Payload: json.RawMessage(`{"binary_path":"C:\\Program Files\\RAP\\node\\rap-node-agent.exe"}`), + }, + }, + } + service := NewService(store) + + plan, err := service.GetNodeUpdatePlan(context.Background(), GetNodeUpdatePlanInput{ + ClusterID: "cluster-1", + NodeID: "node-1", + Product: "rap-host-agent", + CurrentVersion: "0.2.92", + OS: "windows", + Arch: "amd64", + InstallType: "windows_binary", + }) + if err != nil { + t.Fatalf("update plan: %v", err) + } + if plan.Action != "update" || plan.Artifact == nil || plan.Artifact.ID != "windows" { + t.Fatalf("unexpected windows plan: %+v", plan) + } +} + +func TestNodeUpdatePlanNoopsWhenPolicyMissing(t *testing.T) { + service := NewService(&fakeRepository{}) + + plan, err := service.GetNodeUpdatePlan(context.Background(), GetNodeUpdatePlanInput{ + ClusterID: "cluster-1", + NodeID: "node-1", + Product: "rap-node-agent", + CurrentVersion: "0.1.0-c17z25", + OS: "linux", + Arch: "amd64", + InstallType: "docker", + }) + if err != nil { + t.Fatalf("update plan: %v", err) + } + if plan.Action != "none" || plan.Reason != "no_update_policy" || plan.ProductionForwarding { + t.Fatalf("unexpected missing-policy plan: %+v", plan) + } +} + +func TestGetNodeSyntheticMeshConfigIncludesDesiredMeshListener(t *testing.T) { + now := time.Date(2026, 4, 30, 6, 0, 0, 0, time.UTC) + version := "listener-v1" + updatedBy := "admin-1" + service := NewService(&fakeRepository{ + desiredWorkloads: []NodeWorkloadDesiredState{ + { + ClusterID: "cluster-1", + NodeID: "node-a", + ServiceType: "mesh-listener", + DesiredState: "enabled", + Version: &version, + Config: json.RawMessage(`{"listen_addr":":19140","listen_port_mode":"manual","auto_port_start":19140,"auto_port_end":19149,"connectivity_mode":"private_lan","nat_type":"none","region":"site-a"}`), + UpdatedByUserID: &updatedBy, + UpdatedAt: now, + }, + }, + }) + service.now = func() time.Time { return now } + + cfg, err := service.GetNodeSyntheticMeshConfig(context.Background(), GetNodeSyntheticMeshConfigInput{ + ClusterID: "cluster-1", + NodeID: "node-a", + }) + if err != nil { + t.Fatalf("get synthetic config: %v", err) + } + if cfg.MeshListener == nil { + t.Fatal("expected mesh listener desired config") + } + if cfg.MeshListener.ListenAddr != ":19140" || + cfg.MeshListener.ListenPortMode != "manual" || + cfg.MeshListener.ConnectivityMode != "private_lan" || + cfg.MeshListener.ConfigVersion != "listener-v1" || + !cfg.MeshListener.ControlPlaneOnly || + cfg.MeshListener.ProductionForwarding { + t.Fatalf("unexpected listener config: %+v", cfg.MeshListener) + } + if cfg.AuthoritySignature == nil || len(cfg.AuthorityPayload) == 0 { + t.Fatal("listener-bearing synthetic config must remain signed") + } +} + func TestGetNodeSyntheticMeshConfigIsNodeScoped(t *testing.T) { now := time.Date(2026, 4, 27, 12, 0, 0, 0, time.UTC) service := NewService(&fakeRepository{ @@ -627,6 +1315,133 @@ func TestGetNodeSyntheticMeshConfigIsNodeScoped(t *testing.T) { } } +func TestGetNodeSyntheticMeshConfigSkipsExpiredRouteIntent(t *testing.T) { + now := time.Date(2026, 5, 7, 18, 20, 0, 0, time.UTC) + service := NewService(&fakeRepository{ + testingFlags: EffectiveNodeTestingFlags{ + Enabled: true, + SyntheticLinksEnabled: true, + }, + routeIntents: []MeshRouteIntent{ + { + ID: "expired-route", + ClusterID: "cluster-1", + SourceSelector: json.RawMessage(`{"node_id":"node-a"}`), + DestinationSelector: json.RawMessage(`{"node_id":"node-b"}`), + ServiceClass: "vpn_packets", + Status: "active", + Policy: json.RawMessage(`{ + "synthetic_enabled": true, + "hops": ["node-a", "node-b"], + "allowed_channels": ["vpn_packet"], + "expires_at": "2026-05-07T18:19:00Z" + }`), + UpdatedAt: now.Add(-time.Minute), + }, + { + ID: "fresh-route", + ClusterID: "cluster-1", + SourceSelector: json.RawMessage(`{"node_id":"node-a"}`), + DestinationSelector: json.RawMessage(`{"node_id":"node-b"}`), + ServiceClass: "vpn_packets", + Status: "active", + Policy: json.RawMessage(`{ + "synthetic_enabled": true, + "hops": ["node-a", "node-b"], + "allowed_channels": ["vpn_packet"], + "expires_at": "2026-05-07T18:25:00Z" + }`), + UpdatedAt: now, + }, + }, + }) + service.now = func() time.Time { return now } + + cfg, err := service.GetNodeSyntheticMeshConfig(context.Background(), GetNodeSyntheticMeshConfigInput{ + ClusterID: "cluster-1", + NodeID: "node-a", + }) + if err != nil { + t.Fatalf("get synthetic config: %v", err) + } + if containsRouteID(cfg.Routes, "expired-route") { + t.Fatalf("expired route leaked into synthetic config: %+v", cfg.Routes) + } + if !containsRouteID(cfg.Routes, "fresh-route") { + t.Fatalf("fresh route missing from synthetic config: %+v", cfg.Routes) + } +} + +func TestRouteIntentLifecycleActionsMarkExpiredAndDisabled(t *testing.T) { + now := time.Date(2026, 5, 7, 18, 30, 0, 0, time.UTC) + repo := &fakeRepository{ + platformRole: PlatformRoleAdmin, + authorityState: ClusterAuthorityState{ + ClusterID: "cluster-1", + AuthorityState: "authoritative", + MutationMode: "normal", + }, + routeIntents: []MeshRouteIntent{ + { + ID: "route-a", + ClusterID: "cluster-1", + SourceSelector: json.RawMessage(`{"node_id":"node-a"}`), + DestinationSelector: json.RawMessage(`{"node_id":"node-b"}`), + ServiceClass: "vpn_packets", + Status: "active", + Policy: json.RawMessage(`{"synthetic_enabled":true}`), + UpdatedAt: now, + }, + { + ID: "route-b", + ClusterID: "cluster-1", + SourceSelector: json.RawMessage(`{"node_id":"node-a"}`), + DestinationSelector: json.RawMessage(`{"node_id":"node-c"}`), + ServiceClass: "vpn_packets", + Status: "active", + Policy: json.RawMessage(`{"synthetic_enabled":true}`), + UpdatedAt: now, + }, + }, + } + service := NewService(repo) + service.now = func() time.Time { return now } + + expired, err := service.ExpireRouteIntent(context.Background(), RouteIntentLifecycleInput{ + ActorUserID: "admin", + ClusterID: "cluster-1", + RouteIntentID: "route-a", + Reason: "test cleanup", + }) + if err != nil { + t.Fatalf("expire route intent: %v", err) + } + if expired.LifecycleStatus != "expired" || !expired.IsExpired || expired.PolicyExpiresAt == nil { + t.Fatalf("expired lifecycle = %+v", expired) + } + + disabled, err := service.DisableRouteIntent(context.Background(), RouteIntentLifecycleInput{ + ActorUserID: "admin", + ClusterID: "cluster-1", + RouteIntentID: "route-b", + Reason: "test cleanup", + }) + if err != nil { + t.Fatalf("disable route intent: %v", err) + } + if disabled.Status != "disabled" || disabled.LifecycleStatus != "disabled" { + t.Fatalf("disabled lifecycle = %+v", disabled) + } + + items, err := service.ListRouteIntents(context.Background(), "admin", "cluster-1") + if err != nil { + t.Fatalf("list route intents: %v", err) + } + if len(items) != 2 || items[0].LifecycleStatus == "" || items[1].LifecycleStatus == "" { + t.Fatalf("list lifecycle enrichment missing: %+v", items) + } +} + func TestGetNodeSyntheticMeshConfigUsesReportedMeshEndpoint(t *testing.T) { now := time.Date(2026, 4, 28, 12, 0, 0, 0, time.UTC) service := NewService(&fakeRepository{ @@ -704,6 +1519,498 @@ func TestGetNodeSyntheticMeshConfigUsesReportedMeshEndpoint(t *testing.T) { } } +func TestGetNodeSyntheticMeshConfigUsesDesiredMeshListenerAdvertiseEndpointForPeer(t *testing.T) { + now := time.Date(2026, 5, 1, 9, 0, 0, 0, time.UTC) + version := "home-1-external-19199" + service := NewService(&fakeRepository{ + testingFlags: EffectiveNodeTestingFlags{ + Enabled: true, + SyntheticLinksEnabled: true, + }, + routeIntents: []MeshRouteIntent{ + { + ID: "route-a-home", + ClusterID: "cluster-1", + SourceSelector: json.RawMessage(`{"node_id":"node-a"}`), + DestinationSelector: json.RawMessage(`{"node_id":"home-1"}`), + ServiceClass: "synthetic", + Status: "active", + Policy: json.RawMessage(`{"synthetic_enabled":true,"hops":["node-a","home-1"]}`), + UpdatedAt: now, + }, + }, + desiredWorkloads: []NodeWorkloadDesiredState{ + { + ClusterID: "cluster-1", + NodeID: "home-1", + ServiceType: "mesh-listener", + DesiredState: "enabled", + Version: &version, + Config: json.RawMessage(`{ + "listen_addr":"0.0.0.0:19131", + "listen_port_mode":"manual", + "advertise_endpoint":"http://94.141.118.222:19199", + "advertise_transport":"direct_http", + "connectivity_mode":"direct", + "nat_type":"port_restricted", + "region":"home" + }`), + UpdatedAt: now, + }, + }, + heartbeats: map[string][]NodeHeartbeat{ + "home-1": { + { + ClusterID: "cluster-1", + NodeID: "home-1", + Metadata: json.RawMessage(`{ + "mesh_endpoint_report": { + "cluster_id": "cluster-1", + "node_id": "home-1", + "peer_endpoint": "http://192.168.200.85:19131", + "transport": "direct_http", + "connectivity_mode": "private_lan", + "nat_type": "none", + "endpoint_candidates": [ + { + "endpoint_id": "home-1-private", + "node_id": "home-1", + "transport": "direct_http", + "address": "http://192.168.200.85:19131", + "reachability": "private", + "connectivity_mode": "private_lan", + "nat_type": "none", + "priority": 35 + } + ] + } + }`), + ObservedAt: now, + }, + }, + }, + }) + service.now = func() time.Time { return now } + + cfg, err := service.GetNodeSyntheticMeshConfig(context.Background(), GetNodeSyntheticMeshConfigInput{ + ClusterID: "cluster-1", + NodeID: "node-a", + }) + if err != nil { + t.Fatalf("get synthetic config: %v", err) + } + if cfg.PeerEndpoints["home-1"] != "http://94.141.118.222:19199" { + t.Fatalf("desired advertise endpoint should win over private heartbeat endpoint: %+v", cfg.PeerEndpoints) + } + got := cfg.PeerEndpointCandidates["home-1"] + if len(got) != 2 { + t.Fatalf("expected desired and reported candidates: %+v", got) + } + if got[0].EndpointID != "home-1-desired-mesh-listener" || + got[0].Address != "http://94.141.118.222:19199" || + got[0].Reachability != "public" || + got[0].ConnectivityMode != "direct" || + got[0].NATType != "port_restricted" || + got[0].Priority != 0 { + t.Fatalf("unexpected desired candidate: %+v", got[0]) + } +} + +func TestGetNodeSyntheticMeshConfigKeepsOperatorPublicBootstrapPeerBeyondWarmPeerTarget(t *testing.T) { + now := time.Date(2026, 5, 1, 9, 30, 0, 0, time.UTC) + version := "home-1-external-19199" + privateHeartbeat := func(nodeID string, port string) []NodeHeartbeat { + return []NodeHeartbeat{{ + ClusterID: "cluster-1", + NodeID: nodeID, + ObservedAt: now, + Metadata: json.RawMessage(`{ + "mesh_endpoint_report": { + "cluster_id": "cluster-1", + "node_id": "` + nodeID + `", + "peer_endpoint": "http://192.168.200.61:` + port + `", + "transport": "direct_http", + "connectivity_mode": "private_lan", + "nat_type": "none", + "endpoint_candidates": [{ + "endpoint_id": "` + nodeID + `-private", + "node_id": "` + nodeID + `", + "transport": "direct_http", + "address": "http://192.168.200.61:` + port + `", + "reachability": "private", + "connectivity_mode": "private_lan", + "nat_type": "none", + "priority": 35 + }] + } + }`), + }} + } + service := NewService(&fakeRepository{ + testingFlags: EffectiveNodeTestingFlags{ + Enabled: true, + SyntheticLinksEnabled: true, + }, + clusterNodes: []ClusterNode{ + {ID: "remote", RegistrationStatus: NodeRegistrationActive, HealthStatus: "healthy", MembershipStatus: "active", CreatedAt: now.Add(-5 * time.Hour), LastSeenAt: ptrTime(now)}, + {ID: "test-1", RegistrationStatus: NodeRegistrationActive, HealthStatus: "healthy", MembershipStatus: "active", CreatedAt: now.Add(-4 * time.Hour), LastSeenAt: ptrTime(now.Add(-time.Second))}, + {ID: "test-2", RegistrationStatus: NodeRegistrationActive, HealthStatus: "healthy", MembershipStatus: "active", CreatedAt: now.Add(-3 * time.Hour), LastSeenAt: ptrTime(now.Add(-2 * time.Second))}, + {ID: "test-3", RegistrationStatus: NodeRegistrationActive, HealthStatus: "healthy", MembershipStatus: "active", CreatedAt: now.Add(-2 * time.Hour), LastSeenAt: ptrTime(now.Add(-3 * time.Second))}, + {ID: "home-1", RegistrationStatus: NodeRegistrationActive, HealthStatus: "healthy", MembershipStatus: "active", CreatedAt: now.Add(-time.Hour), LastSeenAt: ptrTime(now.Add(-4 * time.Second))}, + }, + nodeRoles: map[string][]NodeRoleAssignment{ + "remote": {{NodeID: "remote", Role: "core-mesh", Status: "active"}}, + }, + heartbeats: map[string][]NodeHeartbeat{ + "remote": {{ + ClusterID: "cluster-1", + NodeID: "remote", + ObservedAt: now, + Metadata: json.RawMessage(`{ + "mesh_endpoint_report": { + "cluster_id": "cluster-1", + "node_id": "remote", + "connectivity_mode": "outbound_only", + "region": "office" + }, + "mesh_listener_report": { + "inbound_reachability": "outbound_only", + "one_way_connectivity": true + }, + "mesh_outbound_session_report": { + "status": "ready", + "control_plane_url": "https://control.example.test/api/v1" + } + }`), + }}, + "test-1": privateHeartbeat("test-1", "19131"), + "test-2": privateHeartbeat("test-2", "19132"), + "test-3": privateHeartbeat("test-3", "19133"), + "home-1": privateHeartbeat("home-1", "19131"), + }, + desiredWorkloads: []NodeWorkloadDesiredState{{ + ClusterID: "cluster-1", + NodeID: "home-1", + ServiceType: "mesh-listener", + DesiredState: "enabled", + Version: &version, + Config: json.RawMessage(`{ + "listen_addr":"0.0.0.0:19131", + "listen_port_mode":"manual", + "advertise_endpoint":"http://94.141.118.222:19199", + "advertise_transport":"direct_http", + "connectivity_mode":"direct", + "nat_type":"port_restricted", + "region":"home" + }`), + UpdatedAt: now, + }}, + }) + service.now = func() time.Time { return now } + + cfg, err := service.GetNodeSyntheticMeshConfig(context.Background(), GetNodeSyntheticMeshConfigInput{ + ClusterID: "cluster-1", + NodeID: "remote", + }) + if err != nil { + t.Fatalf("get synthetic config: %v", err) + } + if cfg.PeerEndpoints["home-1"] != "http://94.141.118.222:19199" { + t.Fatalf("operator public home peer should survive warm-peer target: %+v", cfg.PeerEndpoints) + } + homeCandidates := cfg.PeerEndpointCandidates["home-1"] + if len(homeCandidates) == 0 || homeCandidates[0].EndpointID != "home-1-desired-mesh-listener" { + t.Fatalf("home desired public candidate missing: %+v", homeCandidates) + } + if _, ok := findPeerDirectoryEntry(cfg.PeerDirectory, "home-1"); !ok { + t.Fatalf("home peer directory entry missing: %+v", cfg.PeerDirectory) + } +} + +func TestGetNodeSyntheticMeshConfigFiltersLoopbackReportedMeshEndpoint(t *testing.T) { + now := time.Date(2026, 4, 28, 12, 0, 0, 0, time.UTC) + service := NewService(&fakeRepository{ + testingFlags: EffectiveNodeTestingFlags{ + Enabled: true, + SyntheticLinksEnabled: true, + }, + routeIntents: []MeshRouteIntent{ + { + ID: "route-a-b", + ClusterID: "cluster-1", + SourceSelector: json.RawMessage(`{"node_id":"node-a"}`), + DestinationSelector: json.RawMessage(`{"node_id":"node-b"}`), + ServiceClass: "synthetic", + Status: "active", + Policy: json.RawMessage(`{ + "synthetic_enabled": true, + "hops": ["node-a", "node-b"] + }`), + UpdatedAt: now, + }, + }, + heartbeats: map[string][]NodeHeartbeat{ + "node-b": { + { + ClusterID: "cluster-1", + NodeID: "node-b", + Metadata: json.RawMessage(`{ + "mesh_endpoint_report": { + "schema_version": "c17z25.mesh_endpoint_report.v1", + "cluster_id": "cluster-1", + "node_id": "node-b", + "peer_endpoint": "http://127.0.0.1:19131", + "transport": "direct_http", + "connectivity_mode": "private_lan", + "nat_type": "none", + "endpoint_candidates": [ + { + "endpoint_id": "node-b-loopback", + "node_id": "node-b", + "transport": "direct_http", + "address": "http://127.0.0.1:19131", + "reachability": "private", + "connectivity_mode": "private_lan", + "nat_type": "none", + "priority": 1 + }, + { + "endpoint_id": "node-b-lan", + "node_id": "node-b", + "transport": "direct_http", + "address": "http://192.168.10.20:19131", + "reachability": "private", + "connectivity_mode": "private_lan", + "nat_type": "none", + "priority": 2 + } + ] + } + }`), + ObservedAt: now, + }, + }, + }, + }) + service.now = func() time.Time { return now } + + cfg, err := service.GetNodeSyntheticMeshConfig(context.Background(), GetNodeSyntheticMeshConfigInput{ + ClusterID: "cluster-1", + NodeID: "node-a", + }) + if err != nil { + t.Fatalf("get synthetic config: %v", err) + } + if _, leaked := cfg.PeerEndpoints["node-b"]; leaked { + t.Fatalf("loopback peer endpoint leaked: %+v", cfg.PeerEndpoints) + } + if got := cfg.PeerEndpointCandidates["node-b"]; len(got) != 1 || got[0].EndpointID != "node-b-lan" || got[0].Address != "http://192.168.10.20:19131" { + t.Fatalf("loopback candidates not filtered correctly: %+v", cfg.PeerEndpointCandidates) + } + entry, ok := findPeerDirectoryEntry(cfg.PeerDirectory, "node-b") + if !ok || entry.EndpointCount != 0 || entry.CandidateCount != 1 { + t.Fatalf("peer directory should expose only usable candidate: %+v", cfg.PeerDirectory) + } +} + +func TestScopedPeerEndpointsFiltersLoopbackPolicyEndpoints(t *testing.T) { + got := scopedPeerEndpoints(map[string]string{ + "node-a": "http://127.0.0.1:19131", + "node-b": "http://0.0.0.0:19132", + "node-c": "http://192.168.10.20:19133", + "node-d": "http://localhost:19134", + }, []string{"node-a", "node-b", "node-c", "node-d"}) + + if len(got) != 1 || got["node-c"] != "http://192.168.10.20:19133" { + t.Fatalf("loopback/wildcard policy endpoints leaked: %+v", got) + } +} + +func TestGetNodeSyntheticMeshConfigBootstrapsCoreMeshPeersFromHealthyNodes(t *testing.T) { + now := time.Date(2026, 4, 28, 12, 0, 0, 0, time.UTC) + service := NewService(&fakeRepository{ + testingFlags: EffectiveNodeTestingFlags{ + Enabled: true, + SyntheticLinksEnabled: true, + }, + clusterNodes: []ClusterNode{ + {ID: "node-a", RegistrationStatus: NodeRegistrationActive, HealthStatus: "healthy", MembershipStatus: "active", CreatedAt: now.Add(-2 * time.Hour), LastSeenAt: ptrTime(now)}, + {ID: "node-b", RegistrationStatus: NodeRegistrationActive, HealthStatus: "healthy", MembershipStatus: "active", CreatedAt: now.Add(-time.Hour), LastSeenAt: ptrTime(now.Add(-time.Second))}, + {ID: "node-c", RegistrationStatus: NodeRegistrationActive, HealthStatus: "healthy", MembershipStatus: "active", CreatedAt: now.Add(-30 * time.Minute), LastSeenAt: ptrTime(now.Add(-2 * time.Second))}, + }, + nodeRoles: map[string][]NodeRoleAssignment{ + "node-a": {{NodeID: "node-a", Role: "core-mesh", Status: "active"}}, + }, + heartbeats: map[string][]NodeHeartbeat{ + "node-b": {{ + ClusterID: "cluster-1", + NodeID: "node-b", + ObservedAt: now, + Metadata: json.RawMessage(`{ + "mesh_endpoint_report": { + "cluster_id": "cluster-1", + "node_id": "node-b", + "peer_endpoint": "http://10.0.0.2:19131", + "transport": "direct_http", + "connectivity_mode": "private_lan" + } + }`), + }}, + "node-c": {{ + ClusterID: "cluster-1", + NodeID: "node-c", + ObservedAt: now, + Metadata: json.RawMessage(`{ + "mesh_endpoint_report": { + "cluster_id": "cluster-1", + "node_id": "node-c", + "endpoint_candidates": [{ + "endpoint_id": "node-c-lan", + "node_id": "node-c", + "transport": "direct_http", + "address": "http://10.0.0.3:19131", + "reachability": "private", + "connectivity_mode": "private_lan", + "priority": 1 + }] + } + }`), + }}, + }, + }) + service.now = func() time.Time { return now } + + cfg, err := service.GetNodeSyntheticMeshConfig(context.Background(), GetNodeSyntheticMeshConfigInput{ + ClusterID: "cluster-1", + NodeID: "node-a", + }) + if err != nil { + t.Fatalf("get synthetic config: %v", err) + } + if cfg.PeerEndpoints["node-b"] != "http://10.0.0.2:19131" { + t.Fatalf("reported peer endpoint not bootstrapped: %+v", cfg.PeerEndpoints) + } + if got := cfg.PeerEndpointCandidates["node-c"]; len(got) != 1 || got[0].EndpointID != "node-c-lan" { + t.Fatalf("reported peer candidates not bootstrapped: %+v", cfg.PeerEndpointCandidates) + } + if len(cfg.RecoverySeeds) != 2 { + t.Fatalf("RecoverySeeds = %+v, want two core mesh bootstrap seeds", cfg.RecoverySeeds) + } + if _, ok := findPeerDirectoryEntry(cfg.PeerDirectory, "node-b"); !ok { + t.Fatalf("peer directory missing node-b: %+v", cfg.PeerDirectory) + } + if _, ok := findPeerDirectoryEntry(cfg.PeerDirectory, "node-c"); !ok { + t.Fatalf("peer directory missing node-c: %+v", cfg.PeerDirectory) + } +} + +func TestGetNodeSyntheticMeshConfigScopesPrivateBootstrapPeersForOutboundOnlyNode(t *testing.T) { + now := time.Date(2026, 4, 28, 12, 0, 0, 0, time.UTC) + service := NewService(&fakeRepository{ + testingFlags: EffectiveNodeTestingFlags{ + Enabled: true, + SyntheticLinksEnabled: true, + }, + clusterNodes: []ClusterNode{ + {ID: "node-local", RegistrationStatus: NodeRegistrationActive, HealthStatus: "healthy", MembershipStatus: "active", CreatedAt: now.Add(-2 * time.Hour), LastSeenAt: ptrTime(now)}, + {ID: "node-peer", RegistrationStatus: NodeRegistrationActive, HealthStatus: "healthy", MembershipStatus: "active", CreatedAt: now.Add(-time.Hour), LastSeenAt: ptrTime(now.Add(-time.Second))}, + }, + nodeRoles: map[string][]NodeRoleAssignment{ + "node-local": {{NodeID: "node-local", Role: "core-mesh", Status: "active"}}, + }, + heartbeats: map[string][]NodeHeartbeat{ + "node-local": {{ + ClusterID: "cluster-1", + NodeID: "node-local", + ObservedAt: now, + Metadata: json.RawMessage(`{ + "mesh_endpoint_report": { + "cluster_id": "cluster-1", + "node_id": "node-local", + "connectivity_mode": "outbound_only", + "region": "office" + }, + "mesh_listener_report": { + "inbound_reachability": "outbound_only", + "one_way_connectivity": true + }, + "mesh_outbound_session_report": { + "status": "ready", + "control_plane_url": "https://control.example.test/api/v1" + } + }`), + }}, + "node-peer": {{ + ClusterID: "cluster-1", + NodeID: "node-peer", + ObservedAt: now, + Metadata: json.RawMessage(`{ + "mesh_endpoint_report": { + "cluster_id": "cluster-1", + "node_id": "node-peer", + "peer_endpoint": "http://192.168.200.61:19133", + "transport": "direct_http", + "connectivity_mode": "private_lan", + "endpoint_candidates": [{ + "endpoint_id": "node-peer-lan", + "node_id": "node-peer", + "transport": "direct_http", + "address": "http://192.168.200.61:19133", + "reachability": "private", + "connectivity_mode": "private_lan", + "priority": 1 + }] + } + }`), + }}, + }, + }) + service.now = func() time.Time { return now } + + cfg, err := service.GetNodeSyntheticMeshConfig(context.Background(), GetNodeSyntheticMeshConfigInput{ + ClusterID: "cluster-1", + NodeID: "node-local", + }) + if err != nil { + t.Fatalf("get synthetic config: %v", err) + } + if endpoint := cfg.PeerEndpoints["node-peer"]; endpoint != "" { + t.Fatalf("private peer endpoint leaked to outbound-only node: %q", endpoint) + } + candidates := cfg.PeerEndpointCandidates["node-peer"] + if len(candidates) != 1 { + t.Fatalf("peer candidates = %+v, want relay-required candidate", cfg.PeerEndpointCandidates) + } + candidate := candidates[0] + if candidate.Transport != "relay" || candidate.Reachability != "relay" || candidate.ConnectivityMode != "relay_required" { + t.Fatalf("candidate not converted to relay required: %+v", candidate) + } + if !containsString(candidate.PolicyTags, "offsite-private-lan-blocked") || !containsString(candidate.PolicyTags, "relay-required") { + t.Fatalf("candidate missing offsite relay tags: %+v", candidate.PolicyTags) + } + for _, seed := range cfg.RecoverySeeds { + if seed.NodeID == "node-peer" { + t.Fatalf("private recovery seed leaked to outbound-only node: %+v", cfg.RecoverySeeds) + } + } + entry, ok := findPeerDirectoryEntry(cfg.PeerDirectory, "node-peer") + if !ok || entry.EndpointCount != 0 || entry.CandidateCount != 2 { + t.Fatalf("peer directory should show relay-required candidate and bootstrap lease: %+v", cfg.PeerDirectory) + } + if len(cfg.RendezvousLeases) != 1 { + t.Fatalf("rendezvous leases = %+v, want one control-plane bootstrap lease", cfg.RendezvousLeases) + } + lease := cfg.RendezvousLeases[0] + if lease.PeerNodeID != "node-peer" || + lease.RelayNodeID != "control-plane-relay" || + lease.RelayEndpoint != "https://control.example.test" || + lease.Transport != "relay_control" || + lease.Reason != "control_plane_bootstrap_relay" || + !lease.ControlPlaneOnly { + t.Fatalf("unexpected bootstrap rendezvous lease: %+v", lease) + } +} + func TestGetNodeSyntheticMeshConfigIssuesRendezvousRelayLeases(t *testing.T) { now := time.Date(2026, 4, 28, 12, 0, 0, 0, time.UTC) service := NewService(&fakeRepository{ @@ -1845,6 +3152,59 @@ func TestListNodeVPNAssignmentsDoesNotRequirePlatformAdmin(t *testing.T) { } } +func TestRenewNodeVPNAssignmentLeaseAllowsActiveOwnerWithoutPlatformAdmin(t *testing.T) { + store := &fakeRepository{ + platformRole: "user", + nodeVPNAssignments: []NodeVPNAssignment{ + { + VPNConnectionID: "vpn-1", + ClusterID: "cluster-1", + OrganizationID: "org-1", + AssignmentReason: "active_owner", + ActiveLease: &NodeVPNAssignmentLease{ + LeaseID: "lease-1", + OwnerNodeID: "node-1", + Status: VPNLeaseStatusActive, + }, + }, + }, + } + service := NewService(store) + + lease, err := service.RenewNodeVPNAssignmentLease(context.Background(), RenewNodeVPNAssignmentLeaseInput{ + ClusterID: "cluster-1", + VPNConnectionID: "vpn-1", + LeaseID: "lease-1", + OwnerNodeID: "node-1", + TTL: time.Minute, + }) + if err != nil { + t.Fatalf("renew node vpn assignment lease: %v", err) + } + if lease.ID != "lease-1" { + t.Fatalf("lease.ID = %q, want lease-1", lease.ID) + } +} + +func TestRenewNodeVPNAssignmentLeaseRejectsNonOwner(t *testing.T) { + store := &fakeRepository{ + nodeVPNAssignments: []NodeVPNAssignment{ + {VPNConnectionID: "vpn-1", ClusterID: "cluster-1", OrganizationID: "org-1", AssignmentReason: "eligible_candidate"}, + }, + } + service := NewService(store) + + _, err := service.RenewNodeVPNAssignmentLease(context.Background(), RenewNodeVPNAssignmentLeaseInput{ + ClusterID: "cluster-1", + VPNConnectionID: "vpn-1", + LeaseID: "lease-1", + OwnerNodeID: "node-1", + }) + if !errors.Is(err, ErrVPNLeaseOwnerNotAllowed) { + t.Fatalf("err = %v, want ErrVPNLeaseOwnerNotAllowed", err) + } +} + func TestReportNodeVPNAssignmentStatusRejectsInvisibleAssignment(t *testing.T) { store := &fakeRepository{} service := NewService(store) @@ -1902,35 +3262,4188 @@ func TestReportNodeVPNAssignmentStatusRejectsInvalidStatus(t *testing.T) { } } +func TestIssueFabricServiceChannelLeaseSelectsAuthorizedRoute(t *testing.T) { + now := time.Date(2026, 5, 7, 12, 0, 0, 0, time.UTC) + service := NewService(&fakeRepository{ + routeIntents: []MeshRouteIntent{ + { + ID: "route-usa-home", + ClusterID: "cluster-1", + SourceSelector: json.RawMessage(`{"node_id":"usa-los-1"}`), + DestinationSelector: json.RawMessage(`{"node_id":"home-1"}`), + ServiceClass: FabricServiceClassVPNPackets, + Priority: 20, + Status: "active", + Policy: json.RawMessage(`{ + "synthetic_enabled": true, + "hops": ["usa-los-1", "relay-1", "home-1"], + "allowed_channels": ["vpn_packet", "fabric_control"], + "route_version": "rv-1", + "policy_version": "pv-1" + }`), + UpdatedAt: now, + }, + { + ID: "route-home-home", + ClusterID: "cluster-1", + SourceSelector: json.RawMessage(`{"node_id":"home-1"}`), + DestinationSelector: json.RawMessage(`{"node_id":"home-1"}`), + ServiceClass: FabricServiceClassVPNPackets, + Priority: 5, + Status: "active", + Policy: json.RawMessage(`{ + "synthetic_enabled": true, + "hops": ["home-1", "home-1"], + "allowed_channels": ["vpn_packet", "fabric_control"] + }`), + UpdatedAt: now, + }, + }, + }) + service.now = func() time.Time { return now } + + lease, err := service.IssueFabricServiceChannelLease(context.Background(), IssueFabricServiceChannelLeaseInput{ + ClusterID: "cluster-1", + OrganizationID: "org-home", + UserID: "user-m", + ResourceID: "vpn-home", + ServiceClass: FabricServiceClassVPNPackets, + EntryNodeIDs: []string{"home-1", "usa-los-1"}, + ExitNodeIDs: []string{"home-1", "ifcm-1"}, + PreferredEntryNodeID: "usa-los-1", + PreferredExitNodeID: "home-1", + TTL: 90 * time.Second, + }) + if err != nil { + t.Fatalf("issue lease: %v", err) + } + if lease.Status != FabricServiceChannelStatusReady { + t.Fatalf("lease.Status = %q, want ready", lease.Status) + } + if lease.SelectedEntryNodeID != "usa-los-1" || lease.SelectedExitNodeID != "home-1" { + t.Fatalf("selected nodes = %s -> %s", lease.SelectedEntryNodeID, lease.SelectedExitNodeID) + } + if lease.PrimaryRoute.RouteID != "route-usa-home" { + t.Fatalf("primary route = %q, want route-usa-home", lease.PrimaryRoute.RouteID) + } + if lease.RecoveryPolicy == nil || lease.RecoveryPolicy.HysteresisPenalty != fabricServiceChannelRecoveryHysteresisPenalty { + t.Fatalf("lease recovery policy provenance = %+v", lease.RecoveryPolicy) + } + if lease.PrimaryRoute.RecoveryPolicy == nil || lease.PrimaryRoute.RecoveryPolicy.PromotionMinSamples != fabricServiceChannelRecoveryPromotionMinSamples { + t.Fatalf("primary route recovery policy provenance = %+v", lease.PrimaryRoute.RecoveryPolicy) + } + if lease.Fallback.Active || lease.Fallback.Degraded { + t.Fatalf("fallback should be available but inactive: %+v", lease.Fallback) + } + if !containsString(lease.AllowedChannels, "vpn_packet") || !containsString(lease.RequiredRoles, "vpn-exit") { + t.Fatalf("unexpected channel/role defaults: channels=%v roles=%v", lease.AllowedChannels, lease.RequiredRoles) + } + if lease.Token.Token == "" || lease.Token.TTLSeconds != 90 { + t.Fatalf("unexpected token contract: %+v", lease.Token) + } + if lease.EntryHTTP.PathTemplate == "" || lease.EntryHTTP.WebSocketPathTemplate == "" { + t.Fatalf("entry http contract must include packet endpoints: %+v", lease.EntryHTTP) + } + if lease.DataPlane.SchemaVersion != "rap.fabric_service_channel_data_plane.v1" || + lease.DataPlane.Mode != "fabric_primary" || + lease.DataPlane.WorkingDataTransport != "fabric_service_channel" || + lease.DataPlane.SteadyStateTransport != "fabric_route" || + lease.DataPlane.BackendRelayPolicy != "degraded_fallback_only" || + !lease.DataPlane.ProductionForwardingRequired || + !lease.DataPlane.ServiceNeutral || + !lease.DataPlane.ProtocolAgnostic || + lease.DataPlane.LogicalFlowMode != "multi_flow_isolated" || + !containsString(lease.DataPlane.RequiredFlowIsolationClasses, "vpn_packet") { + t.Fatalf("unexpected data-plane contract: %+v", lease.DataPlane) + } + if lease.AuthoritySignature == nil || len(lease.AuthorityPayload) == 0 { + t.Fatalf("lease must be signed: payload=%s signature=%+v", string(lease.AuthorityPayload), lease.AuthoritySignature) + } + var signedPayload FabricServiceChannelLeaseAuthorityPayload + if err := json.Unmarshal(lease.AuthorityPayload, &signedPayload); err != nil { + t.Fatalf("unmarshal signed payload: %v", err) + } + if signedPayload.TokenHash != fabricServiceChannelTokenHash(lease.Token.Token) || signedPayload.ChannelID != lease.ChannelID { + t.Fatalf("signed payload does not bind token/channel: %+v", signedPayload) + } + if signedPayload.RecoveryPolicy == nil || signedPayload.RecoveryPolicy.Source != "defaults" { + t.Fatalf("signed payload recovery policy provenance = %+v", signedPayload.RecoveryPolicy) + } + if signedPayload.DataPlane.SchemaVersion != lease.DataPlane.SchemaVersion || + signedPayload.DataPlane.WorkingDataTransport != "fabric_service_channel" || + signedPayload.DataPlane.BackendRelayPolicy != "degraded_fallback_only" { + t.Fatalf("signed payload data-plane contract = %+v", signedPayload.DataPlane) + } + store := service.store.(*fakeRepository) + if err := clusterauth.VerifyRaw(store.clusterAuthority.PublicKey, lease.AuthorityPayload, *lease.AuthoritySignature); err != nil { + t.Fatalf("verify lease authority: %v", err) + } +} + +func TestFabricServiceChannelLeaseIntrospectionAllowsFreshToken(t *testing.T) { + store := &fakeRepository{} + service := NewService(store) + service.now = func() time.Time { return time.Date(2026, 5, 8, 14, 0, 0, 0, time.UTC) } + store.routeIntents = []MeshRouteIntent{{ + ID: "route-usa-home", + ClusterID: "cluster-1", + ServiceClass: FabricServiceClassVPNPackets, + Status: "active", + Policy: json.RawMessage(`{ + "schema_version":"rap.synthetic_route_policy.v1", + "source_node_id":"usa-1", + "destination_node_id":"home-1", + "hops":["usa-1","home-1"], + "allowed_channels":["vpn_packet","fabric_control"], + "synthetic_enabled":true + }`), + CreatedAt: service.now(), + UpdatedAt: service.now(), + }} + lease, err := service.IssueFabricServiceChannelLease(context.Background(), IssueFabricServiceChannelLeaseInput{ + ActorUserID: "admin-1", + ClusterID: "cluster-1", + OrganizationID: "org-1", + UserID: "user-1", + ResourceID: "vpn-1", + ServiceClass: FabricServiceClassVPNPackets, + EntryNodeIDs: []string{"usa-1"}, + ExitNodeIDs: []string{"home-1"}, + AllowedChannels: []string{ + "vpn_packet", + FabricChannelControl, + }, + TTL: 90 * time.Second, + }) + if err != nil { + t.Fatalf("issue lease: %v", err) + } + result, err := service.IntrospectFabricServiceChannelLease(context.Background(), IntrospectFabricServiceChannelLeaseInput{ + ClusterID: "cluster-1", + ChannelID: lease.ChannelID, + ResourceID: "vpn-1", + ServiceClass: FabricServiceClassVPNPackets, + ChannelClass: "vpn_packet", + Token: lease.Token.Token, + EntryNodeID: "usa-1", + }) + if err != nil { + t.Fatalf("introspect lease: %v", err) + } + if !result.Allowed || result.AcceptedBy != "introspection" || result.PreferredRouteID != "route-usa-home" || result.ForceBackendFallback { + t.Fatalf("unexpected introspection result: %+v", result) + } + if result.DataPlane.SchemaVersion != "rap.fabric_service_channel_data_plane.v1" || + result.DataPlane.WorkingDataTransport != "fabric_service_channel" || + result.DataPlane.SteadyStateTransport != "fabric_route" || + result.DataPlane.BackendRelayPolicy != "degraded_fallback_only" { + t.Fatalf("unexpected introspection data-plane contract: %+v", result.DataPlane) + } +} + +func TestFabricServiceChannelLeaseIntrospectionRejectsWrongToken(t *testing.T) { + store := &fakeRepository{} + service := NewService(store) + service.now = func() time.Time { return time.Date(2026, 5, 8, 14, 0, 0, 0, time.UTC) } + lease, err := service.IssueFabricServiceChannelLease(context.Background(), IssueFabricServiceChannelLeaseInput{ + ActorUserID: "admin-1", + ClusterID: "cluster-1", + OrganizationID: "org-1", + UserID: "user-1", + ResourceID: "vpn-1", + ServiceClass: FabricServiceClassVPNPackets, + EntryNodeIDs: []string{"usa-1"}, + ExitNodeIDs: []string{"home-1"}, + TTL: 90 * time.Second, + }) + if err != nil { + t.Fatalf("issue lease: %v", err) + } + result, err := service.IntrospectFabricServiceChannelLease(context.Background(), IntrospectFabricServiceChannelLeaseInput{ + ClusterID: "cluster-1", + ChannelID: lease.ChannelID, + ResourceID: "vpn-1", + ServiceClass: FabricServiceClassVPNPackets, + ChannelClass: "vpn_packet", + Token: "rap_fsc_wrong", + EntryNodeID: "usa-1", + }) + if err != nil { + t.Fatalf("introspect lease: %v", err) + } + if result.Allowed || result.Reason != "lease_token_mismatch" { + t.Fatalf("unexpected introspection result: %+v", result) + } +} + +func TestFabricServiceChannelLeaseIntrospectionSurvivesServiceRestart(t *testing.T) { + store := &fakeRepository{} + now := time.Date(2026, 5, 8, 14, 30, 0, 0, time.UTC) + service := NewService(store) + service.now = func() time.Time { return now } + lease, err := service.IssueFabricServiceChannelLease(context.Background(), IssueFabricServiceChannelLeaseInput{ + ActorUserID: "admin-1", + ClusterID: "cluster-1", + OrganizationID: "org-1", + UserID: "user-1", + ResourceID: "vpn-1", + ServiceClass: FabricServiceClassVPNPackets, + EntryNodeIDs: []string{"usa-1"}, + ExitNodeIDs: []string{"home-1"}, + TTL: 90 * time.Second, + }) + if err != nil { + t.Fatalf("issue lease: %v", err) + } + + restarted := NewService(store) + restarted.now = func() time.Time { return now.Add(5 * time.Second) } + result, err := restarted.IntrospectFabricServiceChannelLease(context.Background(), IntrospectFabricServiceChannelLeaseInput{ + ClusterID: "cluster-1", + ChannelID: lease.ChannelID, + ResourceID: "vpn-1", + ServiceClass: FabricServiceClassVPNPackets, + ChannelClass: "vpn_packet", + Token: lease.Token.Token, + EntryNodeID: "usa-1", + }) + if err != nil { + t.Fatalf("introspect lease after restart: %v", err) + } + if !result.Allowed || result.Reason != "lease_introspection_allowed" { + t.Fatalf("unexpected introspection result: %+v", result) + } + if stored := store.fabricLeases[fabricServiceChannelLeaseCacheKey("cluster-1", lease.ChannelID)]; stored.Lease.Token.Token != "" { + t.Fatalf("stored durable lease must not include raw bearer token: %+v", stored.Lease.Token) + } +} + +func TestFabricServiceChannelLeaseMaintenanceListsAndCleansExpired(t *testing.T) { + store := &fakeRepository{platformRole: PlatformRoleAdmin} + now := time.Date(2026, 5, 8, 15, 0, 0, 0, time.UTC) + activeLease := FabricServiceChannelLease{ + ChannelID: "channel-active", + ClusterID: "cluster-1", + ResourceID: "vpn-active", + ServiceClass: FabricServiceClassVPNPackets, + Status: FabricServiceChannelStatusReady, + SelectedEntryNodeID: "entry-1", + SelectedExitNodeID: "exit-1", + AllowedChannels: []string{"vpn_packet"}, + PrimaryRoute: FabricServiceChannelRoute{RouteID: "route-active", Status: "ready"}, + Token: FabricServiceChannelToken{Token: "rap_fsc_active"}, + IssuedAt: now.Add(-time.Minute), + ExpiresAt: now.Add(time.Minute), + } + expiredLease := activeLease + expiredLease.ChannelID = "channel-expired" + expiredLease.ResourceID = "vpn-expired" + expiredLease.Token = FabricServiceChannelToken{Token: "rap_fsc_expired"} + expiredLease.ExpiresAt = now.Add(-time.Second) + if _, err := store.StoreFabricServiceChannelLease(context.Background(), StoreFabricServiceChannelLeaseInput{Lease: activeLease, TokenHash: fabricServiceChannelTokenHash(activeLease.Token.Token)}); err != nil { + t.Fatalf("store active lease: %v", err) + } + if _, err := store.StoreFabricServiceChannelLease(context.Background(), StoreFabricServiceChannelLeaseInput{Lease: expiredLease, TokenHash: fabricServiceChannelTokenHash(expiredLease.Token.Token)}); err != nil { + t.Fatalf("store expired lease: %v", err) + } + service := NewService(store) + service.now = func() time.Time { return now } + health, err := service.ListFabricServiceChannelLeases(context.Background(), "admin-1", ListFabricServiceChannelLeasesInput{ + ClusterID: "cluster-1", + IncludeExpired: true, + Limit: 10, + }) + if err != nil { + t.Fatalf("list leases: %v", err) + } + if health.ActiveCount != 1 || health.ExpiredCount != 1 || health.Status != "degraded" { + t.Fatalf("unexpected lease maintenance health: %+v", health) + } + cleanup, err := service.CleanupFabricServiceChannelLeases(context.Background(), CleanupFabricServiceChannelLeasesInput{ + ActorUserID: "admin-1", + ClusterID: "cluster-1", + Limit: 10, + Now: now, + }) + if err != nil { + t.Fatalf("cleanup leases: %v", err) + } + if cleanup.DeletedExpiredCount != 1 || cleanup.ExpiredCount != 0 || cleanup.ActiveCount != 1 || cleanup.Status != "ready" { + t.Fatalf("unexpected cleanup result: %+v", cleanup) + } +} + +func TestFabricServiceChannelAccessTelemetryAggregatesNodeReports(t *testing.T) { + now := time.Date(2026, 5, 8, 15, 20, 0, 0, time.UTC) + store := &fakeRepository{ + platformRole: PlatformRoleAdmin, + clusterNodes: []ClusterNode{ + {ID: "node-1", Name: "entry-1"}, + {ID: "node-2", Name: "entry-2"}, + }, + heartbeats: map[string][]NodeHeartbeat{ + "node-1": { + { + ClusterID: "cluster-1", + NodeID: "node-1", + ObservedAt: now.Add(-2 * time.Second), + Metadata: json.RawMessage(`{ + "fabric_service_channel_runtime_report": { + "schema_version": "c18z64.fabric_service_channel_runtime_report.v1", + "ingress": { + "flow_scheduler": { + "traffic_class_counts": {"bulk": 32, "interactive": 12}, + "recommended_parallel_windows": {"bulk": 1, "interactive": 4, "control": 4, "reliable": 3, "droppable": 1}, + "adaptive_backpressure_active": true, + "adaptive_backpressure_reason": "bulk_window_reduced_to_protect_interactive", + "channel_count": 44, + "dropped": 0, + "high_watermark": 25, + "max_in_flight": 4, + "channel_stats": {} + } + } + } + }`), + }, + }, + }, + } + service := NewService(store) + service.now = func() time.Time { return now } + _, err := store.RecordNodeTelemetry(context.Background(), RecordNodeTelemetryInput{ + ClusterID: "cluster-1", + NodeID: "node-1", + Payload: json.RawMessage(`{ + "fabric_service_channel_access_report": { + "schema_version": "c18z52.fabric_service_channel_access_report.v1", + "total": 7, + "signed": 3, + "introspection": 4, + "legacy_unsigned": 0, + "backend_fallback": 2, + "data_plane_contract": 5, + "last_data_plane_mode": "fabric_primary", + "last_working_data_transport": "fabric_service_channel", + "last_steady_state_transport": "fabric_route", + "last_backend_relay_policy": "degraded_fallback_only", + "last_logical_flow_mode": "multi_flow_isolated", + "last_accepted_at": "2026-05-08T15:19:59Z" + } + }`), + ObservedAt: now, + }) + if err != nil { + t.Fatalf("record telemetry: %v", err) + } + expiresAt := now.Add(5 * time.Minute) + store.fabricLeases = map[string]FabricServiceChannelLeaseRecord{ + fabricServiceChannelLeaseCacheKey("cluster-1", "channel-1"): { + ClusterID: "cluster-1", + ChannelID: "channel-1", + ResourceID: "vpn-home", + ServiceClass: FabricServiceClassVPNPackets, + SelectedEntryNodeID: "node-1", + ExpiresAt: expiresAt, + Lease: FabricServiceChannelLease{ + ClusterID: "cluster-1", + ChannelID: "channel-1", + ResourceID: "vpn-home", + ServiceClass: FabricServiceClassVPNPackets, + Status: FabricServiceChannelStatusReady, + SelectedEntryNodeID: "node-1", + SelectedExitNodeID: "node-2", + PrimaryRoute: FabricServiceChannelRoute{ + RouteID: "route-1", + Status: "ready", + }, + ExpiresAt: expiresAt, + }, + }, + } + _, err = store.RecordFabricServiceChannelRouteFeedback(context.Background(), RecordFabricServiceChannelRouteFeedbackInput{ + ClusterID: "cluster-1", + ReporterNodeID: "node-1", + RouteID: "route-1", + ServiceClass: FabricServiceClassVPNPackets, + FeedbackStatus: "healthy", + ScoreAdjustment: 15, + LastSendDurationMs: 42, + Payload: json.RawMessage(`{"quality_window_sample_count":5,"quality_window_failure_count":0,"quality_window_drop_count":0,"quality_window_slow_count":1}`), + ObservedAt: now, + ExpiresAt: expiresAt, + }) + if err != nil { + t.Fatalf("record route feedback: %v", err) + } + report, err := service.GetFabricServiceChannelAccessTelemetry(context.Background(), "admin-1", GetFabricServiceChannelAccessTelemetryInput{ + ClusterID: "cluster-1", + Limit: 10, + Now: now, + }) + if err != nil { + t.Fatalf("get access telemetry: %v", err) + } + if report.ReportingNodeCount != 1 || report.TotalAccepted != 7 || report.SignedAccepted != 3 || report.IntrospectionAccepted != 4 || report.BackendFallbackCount != 2 { + t.Fatalf("unexpected access telemetry: %+v", report) + } + if report.DataPlaneContractCount != 5 || report.LastDataPlaneMode != "fabric_primary" || report.LastWorkingDataTransport != "fabric_service_channel" || report.LastSteadyStateTransport != "fabric_route" || report.LastBackendRelayPolicy != "degraded_fallback_only" || report.LastLogicalFlowMode != "multi_flow_isolated" { + t.Fatalf("unexpected aggregate data-plane telemetry: %+v", report) + } + if report.Nodes[0].DataPlaneContractCount != 5 || report.Nodes[0].LastWorkingDataTransport != "fabric_service_channel" || report.Nodes[0].LastBackendRelayPolicy != "degraded_fallback_only" || report.Nodes[0].LastLogicalFlowMode != "multi_flow_isolated" { + t.Fatalf("unexpected node data-plane telemetry: %+v", report.Nodes[0]) + } + if got := report.Nodes[0].TrafficClassCounts["bulk"]; got != 32 { + t.Fatalf("bulk traffic class count = %d, want 32: %+v", got, report.Nodes[0]) + } + if report.TrafficClassCounts["bulk"] != 32 || report.TrafficClassCounts["interactive"] != 12 || report.FlowChannelCount != 44 || report.FlowMaxInFlight != 4 { + t.Fatalf("unexpected aggregate flow telemetry: %+v", report) + } + if report.FlowHealthStatus != "degraded" || report.FlowHealthReason != "backend_fallback_observed" { + t.Fatalf("unexpected aggregate flow health: %+v", report) + } + if !report.AdaptiveBackpressureActive || report.AdaptiveBackpressureReason != "bulk_window_reduced_to_protect_interactive" || report.RecommendedParallelWindows["bulk"] != 1 || report.RecommendedParallelWindows["interactive"] != 4 { + t.Fatalf("unexpected aggregate adaptive backpressure: %+v", report) + } + if report.Nodes[0].FlowChannelCount != 44 || report.Nodes[0].FlowHighWatermark != 25 || report.Nodes[0].FlowMaxInFlight != 4 { + t.Fatalf("unexpected flow telemetry on node: %+v", report.Nodes[0]) + } + if report.Nodes[0].FlowHealthStatus != "degraded" || report.Nodes[0].FlowHealthReason != "backend_fallback_observed" { + t.Fatalf("unexpected node flow health: %+v", report.Nodes[0]) + } + if !report.Nodes[0].AdaptiveBackpressureActive || report.Nodes[0].RecommendedParallelWindows["control"] != 4 || report.Nodes[0].RecommendedParallelWindows["droppable"] != 1 { + t.Fatalf("unexpected node adaptive backpressure: %+v", report.Nodes[0]) + } + if report.ActiveChannelCount != 1 || report.CorrelatedRouteCount != 1 || report.DegradedRouteCount != 0 { + t.Fatalf("unexpected channel correlation counters: %+v", report) + } + if len(report.ActiveChannels) != 1 { + t.Fatalf("expected one active channel, got %d", len(report.ActiveChannels)) + } + channel := report.ActiveChannels[0] + if channel.ChannelID != "channel-1" || channel.EntryNodeTotalAccepted != 7 || channel.RouteFeedbackStatus != "healthy" || channel.RouteQualityWindowSampleCount != 5 || channel.LastSendDurationMs != 42 { + t.Fatalf("unexpected active channel correlation: %+v", channel) + } + if channel.EntryNodeDataPlaneContractCount != 5 || channel.EntryNodeLastDataPlaneMode != "fabric_primary" || channel.EntryNodeLastWorkingDataTransport != "fabric_service_channel" || channel.EntryNodeLastSteadyStateTransport != "fabric_route" || channel.EntryNodeLastBackendRelayPolicy != "degraded_fallback_only" || channel.EntryNodeLastLogicalFlowMode != "multi_flow_isolated" { + t.Fatalf("unexpected active channel data-plane telemetry: %+v", channel) + } + if channel.EntryNodeTrafficClassCounts["interactive"] != 12 || channel.EntryNodeFlowChannelCount != 44 || channel.EntryNodeFlowMaxInFlight != 4 { + t.Fatalf("unexpected active channel flow telemetry: %+v", channel) + } + if channel.EntryNodeFlowHealthStatus != "degraded" || channel.EntryNodeFlowHealthReason != "backend_fallback_observed" { + t.Fatalf("unexpected channel flow health: %+v", channel) + } + if !channel.EntryNodeAdaptiveBackpressureActive || channel.EntryNodeAdaptiveBackpressureReason != "bulk_window_reduced_to_protect_interactive" || channel.EntryNodeRecommendedParallelWindows["bulk"] != 1 { + t.Fatalf("unexpected channel adaptive backpressure: %+v", channel) + } + if channel.RemediationAction != "none" { + t.Fatalf("healthy route should not need remediation: %+v", channel) + } + incidents, err := service.ListFabricServiceChannelRouteRebuildIncidents(context.Background(), "admin-1", ListFabricServiceChannelRouteRebuildIncidentsInput{ + ClusterID: "cluster-1", + Limit: 10, + }) + if err != nil { + t.Fatalf("list rebuild incidents: %v", err) + } + if len(incidents) == 0 || + incidents[0].IncidentSource != "data_plane_contract" || + incidents[0].ChannelID != "channel-1" || + incidents[0].GuardStatus != "data_plane_degraded_backend_relay_observed" || + incidents[0].GuardSeverity != "warn" || + incidents[0].RecommendedOperatorAction != "restore_fabric_route_and_treat_backend_relay_as_degraded_only" { + t.Fatalf("unexpected data-plane incident projection: %+v", incidents) + } +} + +func TestFabricServiceChannelFlowHealthPolicyClassifiesPressure(t *testing.T) { + status, reason, action := fabricServiceChannelFlowHealth(map[string]int{"bulk": 32, "interactive": 12}, 0, 25, 4, 0, 132, 0, 0, 0) + if status != "watch" || reason != "bulk_pressure_with_interactive_qos_observed" || action == "" { + t.Fatalf("unexpected healthy pressure classification: status=%q reason=%q action=%q", status, reason, action) + } + status, reason, _ = fabricServiceChannelFlowHealth(map[string]int{"bulk": 32}, 1, 25, 4, 0, 0, 0, 0, 0) + if status != "critical" || reason != "flow_drops_reported" { + t.Fatalf("unexpected drop classification: status=%q reason=%q", status, reason) + } + status, reason, _ = fabricServiceChannelFlowHealth(map[string]int{"bulk": 2}, 0, 4, 1, 0, 1500, 0, 0, 0) + if status != "degraded" || reason != "route_send_latency_high" { + t.Fatalf("unexpected latency classification: status=%q reason=%q", status, reason) + } +} + +func TestFabricServiceChannelAccessTelemetryRecommendsAlternateForDegradedRoute(t *testing.T) { + now := time.Date(2026, 5, 8, 15, 55, 0, 0, time.UTC) + expiresAt := now.Add(5 * time.Minute) + store := &fakeRepository{ + platformRole: PlatformRoleAdmin, + clusterNodes: []ClusterNode{ + {ID: "node-1", Name: "entry-1"}, + {ID: "node-2", Name: "exit-1"}, + }, + fabricLeases: map[string]FabricServiceChannelLeaseRecord{ + fabricServiceChannelLeaseCacheKey("cluster-1", "channel-1"): { + ClusterID: "cluster-1", + ChannelID: "channel-1", + ResourceID: "vpn-home", + ServiceClass: FabricServiceClassVPNPackets, + SelectedEntryNodeID: "node-1", + ExpiresAt: expiresAt, + Lease: FabricServiceChannelLease{ + ClusterID: "cluster-1", + ChannelID: "channel-1", + ResourceID: "vpn-home", + ServiceClass: FabricServiceClassVPNPackets, + Status: FabricServiceChannelStatusReady, + SelectedEntryNodeID: "node-1", + SelectedExitNodeID: "node-2", + PrimaryRoute: FabricServiceChannelRoute{ + RouteID: "route-bad", + Status: "authorized", + }, + AlternateRoutes: []FabricServiceChannelRoute{{ + RouteID: "route-alt", + Status: "authorized", + }}, + ExpiresAt: expiresAt, + }, + }, + }, + } + service := NewService(store) + service.now = func() time.Time { return now } + _, err := store.RecordNodeTelemetry(context.Background(), RecordNodeTelemetryInput{ + ClusterID: "cluster-1", + NodeID: "node-1", + Payload: json.RawMessage(`{ + "fabric_service_channel_access_report": { + "total": 4, + "introspection": 4, + "backend_fallback": 0 + } + }`), + ObservedAt: now, + }) + if err != nil { + t.Fatalf("record telemetry: %v", err) + } + _, err = store.RecordFabricServiceChannelRouteFeedback(context.Background(), RecordFabricServiceChannelRouteFeedbackInput{ + ClusterID: "cluster-1", + ReporterNodeID: "node-1", + RouteID: "route-bad", + ServiceClass: FabricServiceClassVPNPackets, + FeedbackStatus: "fenced", + ScoreAdjustment: -1030, + Reasons: []string{"service_channel_route_rebuild_recommended"}, + LastSendDurationMs: 1200, + Payload: json.RawMessage(`{"quality_window_sample_count":7,"quality_window_failure_count":3,"quality_window_drop_count":1}`), + ObservedAt: now, + ExpiresAt: expiresAt, + }) + if err != nil { + t.Fatalf("record route feedback: %v", err) + } + _, err = store.RecordHeartbeat(context.Background(), RecordHeartbeatInput{ + ClusterID: "cluster-1", + NodeID: "node-1", + HealthStatus: "healthy", + Metadata: json.RawMessage(`{ + "fabric_service_channel_runtime_report": { + "ingress": { + "route_manager": { + "last_applied_at": "2026-05-08T15:55:01Z", + "decisions": [{ + "route_id": "route-bad", + "replacement_route_id": "route-alt", + "rebuild_request_id": "fsc-remediation:channel-1:prefer_alternate_route:route-alt", + "rebuild_status": "applied", + "rebuild_reason": "authorized_alternate_route_available", + "decision_source": "service_channel_remediation_command", + "generation": "config-c18z74" + }] + }, + "route_manager_transition": { + "status": "applied_rebuild", + "generation": "config-c18z74", + "observed_at": "2026-05-08T15:55:01Z" + } + } + } + }`), + }) + if err != nil { + t.Fatalf("record heartbeat: %v", err) + } + report, err := service.GetFabricServiceChannelAccessTelemetry(context.Background(), "admin-1", GetFabricServiceChannelAccessTelemetryInput{ + ClusterID: "cluster-1", + Limit: 10, + Now: now, + }) + if err != nil { + t.Fatalf("get access telemetry: %v", err) + } + if report.DegradedRouteCount != 1 || report.Status != "degraded" { + t.Fatalf("expected degraded route aggregate: %+v", report) + } + if len(report.ActiveChannels) != 1 { + t.Fatalf("expected one active channel, got %d", len(report.ActiveChannels)) + } + channel := report.ActiveChannels[0] + if channel.RemediationAction != "prefer_alternate_route" || channel.RemediationRouteID != "route-alt" { + t.Fatalf("expected alternate remediation, got %+v", channel) + } + if channel.RemediationCommand == nil { + t.Fatalf("expected bounded remediation command, got %+v", channel) + } + if channel.RemediationCommand.Action != "prefer_alternate_route" || + channel.RemediationCommand.ReplacementRouteID != "route-alt" || + channel.RemediationCommand.PrimaryRouteID != "route-bad" || + channel.RemediationCommand.ClusterID != "cluster-1" { + t.Fatalf("unexpected remediation command: %+v", channel.RemediationCommand) + } + if channel.RemediationExecutionStatus != "applied" || + channel.RemediationExecutionReason != "authorized_alternate_route_available" || + channel.RemediationExecutionGeneration != "config-c18z74" || + channel.RemediationCommand.ExecutionStatus != "applied" { + t.Fatalf("unexpected remediation execution: channel=%+v command=%+v", channel, channel.RemediationCommand) + } + if !channel.RemediationCommand.IssuedAt.Equal(now) || channel.RemediationCommand.ExpiresAt.After(expiresAt) || !channel.RemediationCommand.ExpiresAt.After(now) { + t.Fatalf("unexpected remediation command ttl: %+v", channel.RemediationCommand) + } +} + +func TestFabricServiceChannelAccessTelemetryRejectsAlternateOutsideSignedPoolPolicy(t *testing.T) { + now := time.Date(2026, 5, 8, 16, 15, 0, 0, time.UTC) + expiresAt := now.Add(5 * time.Minute) + store := &fakeRepository{ + platformRole: PlatformRoleAdmin, + clusterNodes: []ClusterNode{ + {ID: "entry-1", Name: "entry-1"}, + {ID: "exit-1", Name: "exit-1"}, + {ID: "exit-2", Name: "exit-2"}, + }, + fabricLeases: map[string]FabricServiceChannelLeaseRecord{ + fabricServiceChannelLeaseCacheKey("cluster-1", "channel-guard"): { + ClusterID: "cluster-1", + ChannelID: "channel-guard", + ResourceID: "vpn-home", + ServiceClass: FabricServiceClassVPNPackets, + SelectedEntryNodeID: "entry-1", + ExpiresAt: expiresAt, + Lease: FabricServiceChannelLease{ + ClusterID: "cluster-1", + ChannelID: "channel-guard", + ResourceID: "vpn-home", + ServiceClass: FabricServiceClassVPNPackets, + Status: FabricServiceChannelStatusReady, + SelectedEntryNodeID: "entry-1", + SelectedExitNodeID: "exit-1", + EntryPool: []FabricServiceChannelNodeCandidate{{ + NodeID: "entry-1", + Status: "selected", + }}, + ExitPool: []FabricServiceChannelNodeCandidate{{ + NodeID: "exit-1", + Status: "selected", + }}, + PoolPolicy: &FabricServiceChannelPoolPolicy{ + SchemaVersion: "rap.fabric_service_channel_pool_policy.v1", + Fingerprint: "pool-fingerprint-1", + EntryPoolNodeIDs: []string{"entry-1"}, + ExitPoolNodeIDs: []string{"exit-1"}, + SelectionStrategy: "fastest_healthy", + RouteRebuild: "automatic", + EntryFailover: "automatic", + ExitFailover: "automatic", + BackendFallbackAllowed: true, + StickySession: true, + Source: "cluster_metadata", + }, + PrimaryRoute: FabricServiceChannelRoute{ + RouteID: "route-bad", + ClusterID: "cluster-1", + ServiceClass: FabricServiceClassVPNPackets, + SourceNodeID: "entry-1", + DestinationNodeID: "exit-1", + Status: "authorized", + }, + AlternateRoutes: []FabricServiceChannelRoute{{ + RouteID: "route-outside-exit", + ClusterID: "cluster-1", + ServiceClass: FabricServiceClassVPNPackets, + SourceNodeID: "entry-1", + DestinationNodeID: "exit-2", + Status: "authorized", + }}, + ExpiresAt: expiresAt, + }, + }, + }, + fabricRebuildAttempts: []FabricServiceChannelRouteRebuildAttempt{{ + ID: "fsc-rebuild-guard-1", + ClusterID: "cluster-1", + ReporterNodeID: "entry-1", + ServiceClass: FabricServiceClassVPNPackets, + RouteID: "route-bad", + ReplacementRouteID: "route-outside-exit", + RebuildRequestID: "fsc-remediation:channel-guard:rebuild_route:route-outside-exit", + RebuildStatus: "rejected", + RebuildReason: "replacement_exit_outside_signed_pool_policy", + DecisionSource: "service_channel_remediation_command", + Outcome: "policy_guard_rejected", + PolicyFingerprint: "pool-fingerprint-1", + CreatedAt: now, + UpdatedAt: now, + }}, + } + service := NewService(store) + service.now = func() time.Time { return now } + _, err := store.RecordFabricServiceChannelRouteFeedback(context.Background(), RecordFabricServiceChannelRouteFeedbackInput{ + ClusterID: "cluster-1", + ReporterNodeID: "entry-1", + RouteID: "route-bad", + ServiceClass: FabricServiceClassVPNPackets, + FeedbackStatus: "fenced", + ScoreAdjustment: -1030, + Reasons: []string{"service_channel_route_rebuild_recommended"}, + LastSendDurationMs: 1200, + ObservedAt: now, + ExpiresAt: expiresAt, + }) + if err != nil { + t.Fatalf("record route feedback: %v", err) + } + report, err := service.GetFabricServiceChannelAccessTelemetry(context.Background(), "admin-1", GetFabricServiceChannelAccessTelemetryInput{ + ClusterID: "cluster-1", + Limit: 10, + Now: now, + }) + if err != nil { + t.Fatalf("get access telemetry: %v", err) + } + if len(report.ActiveChannels) != 1 { + t.Fatalf("expected one active channel, got %d", len(report.ActiveChannels)) + } + channel := report.ActiveChannels[0] + if channel.RemediationAction != "rebuild_route" || + channel.RemediationReason != "alternate_route_rejected_by_pool_policy" || + channel.RemediationRouteID != "route-outside-exit" || + channel.RemediationGuardStatus != "rejected" || + channel.RemediationGuardReason != "replacement_exit_outside_signed_pool_policy" || + channel.PoolPolicyFingerprint != "pool-fingerprint-1" { + t.Fatalf("expected guarded rebuild remediation, got %+v", channel) + } + if channel.RemediationCommand == nil { + t.Fatalf("expected guarded remediation command, got %+v", channel) + } + if channel.RemediationCommand.Action != "rebuild_route" || + channel.RemediationCommand.GuardStatus != "rejected" || + channel.RemediationCommand.GuardReason != "replacement_exit_outside_signed_pool_policy" || + channel.RemediationCommand.PoolPolicyFingerprint != "pool-fingerprint-1" || + channel.RemediationCommand.ExecutionStatus != "rebuild_request_rejected" { + t.Fatalf("unexpected guarded remediation command: %+v", channel.RemediationCommand) + } +} + +func TestFabricServiceChannelAccessTelemetryShowsRebuildRouteNodePending(t *testing.T) { + now := time.Date(2026, 5, 8, 16, 50, 0, 0, time.UTC) + expiresAt := now.Add(5 * time.Minute) + commandID := "fsc-remediation:channel-pending:rebuild_route:route-bad" + store := &fakeRepository{ + platformRole: PlatformRoleAdmin, + clusterNodes: []ClusterNode{ + {ID: "entry-1", Name: "entry-1"}, + {ID: "exit-1", Name: "exit-1"}, + }, + fabricLeases: map[string]FabricServiceChannelLeaseRecord{ + fabricServiceChannelLeaseCacheKey("cluster-1", "channel-pending"): { + ClusterID: "cluster-1", + ChannelID: "channel-pending", + ResourceID: "vpn-home", + ServiceClass: FabricServiceClassVPNPackets, + SelectedEntryNodeID: "entry-1", + ExpiresAt: expiresAt, + Lease: FabricServiceChannelLease{ + ClusterID: "cluster-1", + ChannelID: "channel-pending", + ResourceID: "vpn-home", + ServiceClass: FabricServiceClassVPNPackets, + Status: FabricServiceChannelStatusReady, + SelectedEntryNodeID: "entry-1", + SelectedExitNodeID: "exit-1", + PrimaryRoute: FabricServiceChannelRoute{ + RouteID: "route-bad", + ClusterID: "cluster-1", + ServiceClass: FabricServiceClassVPNPackets, + SourceNodeID: "entry-1", + DestinationNodeID: "exit-1", + Status: "authorized", + }, + ExpiresAt: expiresAt, + }, + }, + }, + fabricRebuildAttempts: []FabricServiceChannelRouteRebuildAttempt{{ + ID: "fsc-rebuild-pending-1", + ClusterID: "cluster-1", + ReporterNodeID: "entry-1", + ServiceClass: FabricServiceClassVPNPackets, + RouteID: "route-bad", + RebuildRequestID: commandID, + RebuildStatus: "requested", + RebuildReason: "route_feedback_recommends_rebuild", + DecisionSource: "service_channel_remediation_command", + Outcome: "rebuild_requested", + Generation: commandID, + CreatedAt: now, + UpdatedAt: now, + }}, + } + service := NewService(store) + service.now = func() time.Time { return now } + _, err := store.RecordFabricServiceChannelRouteFeedback(context.Background(), RecordFabricServiceChannelRouteFeedbackInput{ + ClusterID: "cluster-1", + ReporterNodeID: "entry-1", + RouteID: "route-bad", + ServiceClass: FabricServiceClassVPNPackets, + FeedbackStatus: "fenced", + ScoreAdjustment: -1030, + Reasons: []string{"service_channel_route_rebuild_recommended"}, + ObservedAt: now, + ExpiresAt: expiresAt, + }) + if err != nil { + t.Fatalf("record route feedback: %v", err) + } + _, err = store.RecordHeartbeat(context.Background(), RecordHeartbeatInput{ + ClusterID: "cluster-1", + NodeID: "entry-1", + HealthStatus: "healthy", + Metadata: json.RawMessage(`{ + "fabric_service_channel_runtime_report": { + "ingress": { + "route_manager": { + "last_applied_at": "2026-05-08T16:50:01Z", + "decisions": [{ + "route_id": "route-bad", + "rebuild_request_id": "fsc-remediation:channel-pending:rebuild_route:route-bad", + "rebuild_status": "pending_degraded_fallback", + "rebuild_reason": "route_feedback_recommends_rebuild", + "decision_source": "service_channel_remediation_command", + "generation": "fsc-remediation:channel-pending:rebuild_route:route-bad" + }] + }, + "route_manager_transition": { + "status": "pending_degraded_fallback", + "generation": "fsc-remediation:channel-pending:rebuild_route:route-bad", + "observed_at": "2026-05-08T16:50:01Z" + } + } + } + }`), + }) + if err != nil { + t.Fatalf("record heartbeat: %v", err) + } + report, err := service.GetFabricServiceChannelAccessTelemetry(context.Background(), "admin-1", GetFabricServiceChannelAccessTelemetryInput{ + ClusterID: "cluster-1", + Limit: 10, + Now: now, + }) + if err != nil { + t.Fatalf("get access telemetry: %v", err) + } + if len(report.ActiveChannels) != 1 { + t.Fatalf("expected one active channel, got %d", len(report.ActiveChannels)) + } + channel := report.ActiveChannels[0] + if channel.RemediationAction != "rebuild_route" || + channel.RemediationExecutionStatus != "rebuild_request_recorded_node_pending" || + channel.RemediationExecutionGeneration != commandID || + channel.RouteDecisionSource != "service_channel_remediation_command" || + channel.RouteDecisionRebuildStatus != "pending_degraded_fallback" || + channel.RemediationCommand == nil || + channel.RemediationCommand.ExecutionStatus != "rebuild_request_recorded_node_pending" { + t.Fatalf("unexpected rebuild route execution: channel=%+v command=%+v", channel, channel.RemediationCommand) + } +} + +func TestFabricServiceChannelAccessTelemetryProjectsNoSafeRecoveryDecision(t *testing.T) { + now := time.Date(2026, 5, 9, 3, 10, 0, 0, time.UTC) + expiresAt := now.Add(5 * time.Minute) + store := &fakeRepository{ + platformRole: PlatformRoleAdmin, + clusterNodes: []ClusterNode{ + {ID: "entry-1", Name: "entry-1"}, + {ID: "exit-1", Name: "exit-1"}, + }, + fabricLeases: map[string]FabricServiceChannelLeaseRecord{ + fabricServiceChannelLeaseCacheKey("cluster-1", "channel-no-safe"): { + ClusterID: "cluster-1", + ChannelID: "channel-no-safe", + ResourceID: "vpn-home", + ServiceClass: FabricServiceClassVPNPackets, + SelectedEntryNodeID: "entry-1", + ExpiresAt: expiresAt, + Lease: FabricServiceChannelLease{ + ClusterID: "cluster-1", + ChannelID: "channel-no-safe", + ResourceID: "vpn-home", + ServiceClass: FabricServiceClassVPNPackets, + Status: FabricServiceChannelStatusReady, + SelectedEntryNodeID: "entry-1", + SelectedExitNodeID: "exit-1", + PrimaryRoute: FabricServiceChannelRoute{ + RouteID: "route-primary", + ClusterID: "cluster-1", + ServiceClass: FabricServiceClassVPNPackets, + SourceNodeID: "entry-1", + DestinationNodeID: "exit-1", + Status: "authorized", + }, + ExpiresAt: expiresAt, + }, + }, + }, + } + service := NewService(store) + service.now = func() time.Time { return now } + _, err := store.RecordHeartbeat(context.Background(), RecordHeartbeatInput{ + ClusterID: "cluster-1", + NodeID: "entry-1", + HealthStatus: "healthy", + Metadata: json.RawMessage(`{ + "fabric_service_channel_runtime_report": { + "ingress": { + "route_manager": { + "last_applied_at": "2026-05-09T03:10:01Z", + "decisions": [{ + "route_id": "route-replacement", + "source_node_id": "entry-1", + "destination_node_id": "exit-1", + "local_node_id": "entry-1", + "decision_source": "service_channel_feedback_no_alternate", + "rebuild_status": "pending_degraded_fallback", + "rebuild_reason": "service_channel_feedback_rebuild_requested", + "generation": "c18z82-generation", + "score_reasons": [ + "service_channel_fenced_route", + "no_unfenced_alternate_route", + "backend_relay_degraded_fallback_until_rebuild" + ] + }] + }, + "route_manager_transition": { + "status": "pending_degraded_fallback", + "generation": "c18z82-generation", + "observed_at": "2026-05-09T03:10:01Z" + } + } + } + }`), + }) + if err != nil { + t.Fatalf("record heartbeat: %v", err) + } + report, err := service.GetFabricServiceChannelAccessTelemetry(context.Background(), "admin-1", GetFabricServiceChannelAccessTelemetryInput{ + ClusterID: "cluster-1", + Limit: 10, + Now: now, + }) + if err != nil { + t.Fatalf("get access telemetry: %v", err) + } + if len(report.ActiveChannels) != 1 { + t.Fatalf("expected one active channel, got %d", len(report.ActiveChannels)) + } + channel := report.ActiveChannels[0] + if channel.RouteDecisionSource != "service_channel_feedback_no_alternate" || + channel.RouteDecisionRouteID != "route-replacement" || + channel.RouteDecisionRebuildStatus != "pending_degraded_fallback" || + !containsString(channel.RouteDecisionScoreReasons, "no_unfenced_alternate_route") || + channel.RemediationAction != "use_backend_fallback" || + channel.RemediationExecutionStatus != "route_rebuild_no_safe_recovery" { + t.Fatalf("unexpected no-safe route decision projection: %+v", channel) + } + if report.RouteDecisionChannelCount != 1 || + report.NoSafeRecoveryDecisionCount != 1 || + report.ReplacementDecisionCount != 0 || + report.AppliedRebuildDecisionCount != 0 || + report.Status != "degraded" || + report.Reason != "active_channels_no_safe_recovery" { + t.Fatalf("unexpected no-safe route decision aggregate: %+v", report) + } + health, err := service.GetFabricServiceChannelRouteRebuildHealthSummary(context.Background(), "admin-1", GetFabricServiceChannelRouteRebuildHealthSummaryInput{ + ClusterID: "cluster-1", + Limit: 10, + }) + if err != nil { + t.Fatalf("get rebuild health: %v", err) + } + if health.AccessRouteDecisionCount != 1 || + health.AccessNoSafeCount != 1 || + health.ActiveBadCount != 1 || + health.RecommendedOperatorAction != "inspect_access_no_safe_recovery_route_pool_and_signed_policy" { + t.Fatalf("unexpected rebuild health access decision projection: %+v", health) + } + incidents, err := service.ListFabricServiceChannelRouteRebuildIncidents(context.Background(), "admin-1", ListFabricServiceChannelRouteRebuildIncidentsInput{ + ClusterID: "cluster-1", + Limit: 10, + }) + if err != nil { + t.Fatalf("list rebuild incidents: %v", err) + } + if len(incidents) == 0 || + incidents[0].IncidentSource != "access_decision" || + incidents[0].ChannelID != "channel-no-safe" || + incidents[0].GuardStatus != "access_no_safe_recovery" || + incidents[0].GuardSeverity != "bad" { + t.Fatalf("unexpected access decision incident projection: %+v", incidents) + } + silence, err := service.SilenceFabricServiceChannelRouteRebuildAlert(context.Background(), SilenceFabricServiceChannelRouteRebuildAlertInput{ + ActorUserID: "admin-1", + ClusterID: "cluster-1", + IncidentSource: "access_decision", + ChannelID: incidents[0].ChannelID, + ReporterNodeID: incidents[0].ReporterNodeID, + RouteID: incidents[0].RouteID, + GuardStatus: incidents[0].GuardStatus, + Generation: incidents[0].Generation, + Reason: "operator acknowledged access no-safe", + TTL: 6 * time.Hour, + Now: now, + }) + if err != nil { + t.Fatalf("silence access decision incident: %v", err) + } + health, err = service.GetFabricServiceChannelRouteRebuildHealthSummary(context.Background(), "admin-1", GetFabricServiceChannelRouteRebuildHealthSummaryInput{ + ClusterID: "cluster-1", + Limit: 10, + }) + if err != nil { + t.Fatalf("get silenced rebuild health: %v", err) + } + if health.AccessNoSafeCount != 1 || health.ActiveBadCount != 0 || health.SilencedCount != 1 { + t.Fatalf("unexpected silenced access decision health: %+v", health) + } + incidents, err = service.ListFabricServiceChannelRouteRebuildIncidents(context.Background(), "admin-1", ListFabricServiceChannelRouteRebuildIncidentsInput{ + ClusterID: "cluster-1", + Limit: 10, + }) + if err != nil { + t.Fatalf("list silenced rebuild incidents: %v", err) + } + if len(incidents) == 0 || !incidents[0].AlertSilenced { + t.Fatalf("expected silenced access decision incident: %+v", incidents) + } + silences, err := service.ListFabricServiceChannelRouteRebuildAlertSilences(context.Background(), "admin-1", "cluster-1", now) + if err != nil { + t.Fatalf("list rebuild alert silences: %v", err) + } + if len(silences) != 1 || + silences[0].ID != silence.ID || + silences[0].IncidentSource != "access_decision" || + silences[0].ChannelID != "channel-no-safe" || + silences[0].DisplayRouteID != "route-replacement" { + t.Fatalf("unexpected listed access decision silence: %+v", silences) + } + _, err = service.UnsilenceFabricServiceChannelRouteRebuildAlert(context.Background(), UnsilenceFabricServiceChannelRouteRebuildAlertInput{ + ActorUserID: "admin-1", + ClusterID: "cluster-1", + SilenceID: silence.ID, + Reason: "operator reopened access no-safe", + Now: now.Add(time.Minute), + }) + if err != nil { + t.Fatalf("unsilence access decision incident: %v", err) + } + health, err = service.GetFabricServiceChannelRouteRebuildHealthSummary(context.Background(), "admin-1", GetFabricServiceChannelRouteRebuildHealthSummaryInput{ + ClusterID: "cluster-1", + Limit: 10, + }) + if err != nil { + t.Fatalf("get unsilenced rebuild health: %v", err) + } + if health.ActiveBadCount != 1 || health.SilencedCount != 0 { + t.Fatalf("unexpected unsilenced access decision health: %+v", health) + } + silence, err = service.SilenceFabricServiceChannelRouteRebuildAlert(context.Background(), SilenceFabricServiceChannelRouteRebuildAlertInput{ + ActorUserID: "admin-1", + ClusterID: "cluster-1", + IncidentSource: "access_decision", + ChannelID: incidents[0].ChannelID, + ReporterNodeID: incidents[0].ReporterNodeID, + RouteID: incidents[0].RouteID, + GuardStatus: incidents[0].GuardStatus, + Generation: incidents[0].Generation, + Reason: "operator acknowledged access no-safe again", + TTL: 6 * time.Hour, + Now: now, + }) + if err != nil { + t.Fatalf("resilence access decision incident: %v", err) + } + _, err = store.RecordHeartbeat(context.Background(), RecordHeartbeatInput{ + ClusterID: "cluster-1", + NodeID: "entry-1", + HealthStatus: "healthy", + Metadata: json.RawMessage(`{ + "fabric_service_channel_runtime_report": { + "ingress": { + "route_manager": { + "last_applied_at": "2026-05-09T03:11:01Z", + "decisions": [{ + "route_id": "route-replacement", + "source_node_id": "entry-1", + "destination_node_id": "exit-1", + "local_node_id": "entry-1", + "decision_source": "service_channel_feedback_no_alternate", + "rebuild_status": "pending_degraded_fallback", + "rebuild_reason": "service_channel_feedback_rebuild_requested", + "generation": "c18z82-generation-next", + "score_reasons": ["service_channel_fenced_route", "no_unfenced_alternate_route"] + }] + } + } + } + }`), + }) + if err != nil { + t.Fatalf("record resurfaced heartbeat: %v", err) + } + incidents, err = service.ListFabricServiceChannelRouteRebuildIncidents(context.Background(), "admin-1", ListFabricServiceChannelRouteRebuildIncidentsInput{ + ClusterID: "cluster-1", + Limit: 10, + }) + if err != nil { + t.Fatalf("list resurfaced rebuild incidents: %v", err) + } + if len(incidents) == 0 || incidents[0].AlertSilenced || !incidents[0].AlertResurfaced || incidents[0].Generation != "c18z82-generation-next" || + incidents[0].AlertResurfacedCause != "generation_changed" || + incidents[0].AlertResurfacedPreviousGeneration != "c18z82-generation" || + incidents[0].AlertResurfacedPreviousRouteID != "route-replacement" || + incidents[0].AlertResurfacedPreviousChannelID != "channel-no-safe" { + t.Fatalf("expected resurfaced access decision incident on new generation: %+v", incidents) + } +} + +func TestRecordFabricServiceChannelRemediationRebuildIntentsPersistsRequestedAndRejected(t *testing.T) { + now := time.Date(2026, 5, 8, 16, 45, 0, 0, time.UTC) + store := &fakeRepository{} + service := NewService(store) + err := service.recordFabricServiceChannelRemediationRebuildIntents(context.Background(), "cluster-1", "entry-1", []FabricServiceChannelAccessRemediationCommand{ + { + CommandID: "cmd-requested", + Action: "rebuild_route", + ChannelID: "channel-1", + ServiceClass: FabricServiceClassVPNPackets, + PrimaryRouteID: "route-a", + PoolPolicyFingerprint: "pool-fp-1", + GuardStatus: "allowed", + GuardReason: "lease_pool_policy_allows_route", + Reason: "route_feedback_recommends_rebuild", + ExpiresAt: now.Add(time.Minute), + }, + { + CommandID: "cmd-rejected", + Action: "rebuild_route", + ChannelID: "channel-2", + ServiceClass: FabricServiceClassVPNPackets, + PrimaryRouteID: "route-b", + ReplacementRouteID: "route-outside", + PoolPolicyFingerprint: "pool-fp-2", + GuardStatus: "rejected", + GuardReason: "replacement_exit_outside_signed_pool_policy", + Reason: "alternate_route_rejected_by_pool_policy", + ExpiresAt: now.Add(time.Minute), + }, + }, now) + if err != nil { + t.Fatalf("record rebuild intents: %v", err) + } + if len(store.fabricRebuildAttempts) != 2 { + t.Fatalf("rebuild attempts = %+v, want two", store.fabricRebuildAttempts) + } + first := store.fabricRebuildAttempts[0] + if first.RebuildRequestID != "cmd-requested" || + first.RebuildStatus != "requested" || + first.Outcome != "rebuild_requested" || + first.DecisionSource != "service_channel_remediation_command" || + first.PolicyFingerprint != "pool-fp-1" { + t.Fatalf("unexpected requested rebuild intent: %+v", first) + } + second := store.fabricRebuildAttempts[1] + if second.RebuildRequestID != "cmd-rejected" || + second.RebuildStatus != "rejected" || + second.Outcome != "policy_guard_rejected" || + second.ReplacementRouteID != "route-outside" || + second.PolicyFingerprint != "pool-fp-2" { + t.Fatalf("unexpected rejected rebuild intent: %+v", second) + } +} + +func TestResolveFabricServiceChannelRemediationRebuildIntentsRecordsNoAlternate(t *testing.T) { + now := time.Date(2026, 5, 9, 1, 10, 0, 0, time.UTC) + store := &fakeRepository{ + fabricLeases: map[string]FabricServiceChannelLeaseRecord{ + fabricServiceChannelLeaseCacheKey("cluster-1", "channel-no-alt"): { + ClusterID: "cluster-1", + ChannelID: "channel-no-alt", + ResourceID: "vpn-home", + ServiceClass: FabricServiceClassVPNPackets, + SelectedEntryNodeID: "entry-1", + ExpiresAt: now.Add(time.Minute), + Lease: FabricServiceChannelLease{ + ClusterID: "cluster-1", + ChannelID: "channel-no-alt", + ResourceID: "vpn-home", + ServiceClass: FabricServiceClassVPNPackets, + Status: FabricServiceChannelStatusReady, + SelectedEntryNodeID: "entry-1", + SelectedExitNodeID: "exit-1", + EntryPool: []FabricServiceChannelNodeCandidate{{NodeID: "entry-1", Status: "selected"}}, + ExitPool: []FabricServiceChannelNodeCandidate{{NodeID: "exit-1", Status: "selected"}}, + ExpiresAt: now.Add(time.Minute), + }, + }, + }, + } + service := NewService(store) + decisions, err := service.resolveFabricServiceChannelRemediationRebuildIntents(context.Background(), GetNodeSyntheticMeshConfigInput{ + ClusterID: "cluster-1", + NodeID: "entry-1", + }, []FabricServiceChannelAccessRemediationCommand{{ + CommandID: "cmd-no-alt", + Action: "rebuild_route", + ChannelID: "channel-no-alt", + ResourceID: "vpn-home", + ServiceClass: FabricServiceClassVPNPackets, + PrimaryRouteID: "route-bad", + GuardStatus: "allowed", + Reason: "route_feedback_recommends_rebuild", + ExpiresAt: now.Add(time.Minute), + }}, []MeshRouteIntent{{ + ID: "route-bad", + ClusterID: "cluster-1", + SourceSelector: json.RawMessage(`{"node_id":"entry-1"}`), + DestinationSelector: json.RawMessage(`{"node_id":"exit-1"}`), + ServiceClass: FabricServiceClassVPNPackets, + Priority: 100, + Status: "active", + Policy: json.RawMessage(`{ + "synthetic_enabled": true, + "hops": ["entry-1", "exit-1"], + "allowed_channels": ["vpn_packet", "fabric_control"] + }`), + UpdatedAt: now, + }}, map[string]fabricServiceChannelRouteFeedback{ + "route-bad": { + RouteID: "route-bad", + Fenced: true, + RouteRebuildRecommended: true, + ScoreAdjustment: -1030, + Reasons: []string{"service_channel_route_rebuild_recommended"}, + ConsecutiveFailures: 3, + }, + }, "config-c18z77", now) + if err != nil { + t.Fatalf("resolve rebuild intents: %v", err) + } + if len(decisions) != 0 { + t.Fatalf("decisions = %+v, want none without alternate", decisions) + } + if len(store.fabricRebuildAttempts) != 1 { + t.Fatalf("rebuild attempts = %+v, want one", store.fabricRebuildAttempts) + } + attempt := store.fabricRebuildAttempts[0] + if attempt.RebuildStatus != "no_alternate" || + attempt.Outcome != "no_alternate" || + attempt.RebuildReason != "no_unfenced_alternate_route" || + attempt.ConsecutiveFailures != 3 { + t.Fatalf("unexpected no-alternate rebuild resolution: %+v", attempt) + } +} + +func TestResolveFabricServiceChannelRemediationRebuildIntentsAppliesAlternateDecision(t *testing.T) { + now := time.Date(2026, 5, 9, 1, 15, 0, 0, time.UTC) + store := &fakeRepository{ + fabricLeases: map[string]FabricServiceChannelLeaseRecord{ + fabricServiceChannelLeaseCacheKey("cluster-1", "channel-apply"): { + ClusterID: "cluster-1", + ChannelID: "channel-apply", + ResourceID: "vpn-home", + ServiceClass: FabricServiceClassVPNPackets, + SelectedEntryNodeID: "entry-1", + ExpiresAt: now.Add(time.Minute), + Lease: FabricServiceChannelLease{ + ClusterID: "cluster-1", + ChannelID: "channel-apply", + ResourceID: "vpn-home", + ServiceClass: FabricServiceClassVPNPackets, + Status: FabricServiceChannelStatusReady, + SelectedEntryNodeID: "entry-1", + SelectedExitNodeID: "exit-1", + EntryPool: []FabricServiceChannelNodeCandidate{{NodeID: "entry-1", Status: "selected"}}, + ExitPool: []FabricServiceChannelNodeCandidate{{NodeID: "exit-1", Status: "selected"}}, + ExpiresAt: now.Add(time.Minute), + }, + }, + }, + } + service := NewService(store) + intents := []MeshRouteIntent{ + { + ID: "route-bad", + ClusterID: "cluster-1", + SourceSelector: json.RawMessage(`{"node_id":"entry-1"}`), + DestinationSelector: json.RawMessage(`{"node_id":"exit-1"}`), + ServiceClass: FabricServiceClassVPNPackets, + Priority: 100, + Status: "active", + Policy: json.RawMessage(`{ + "synthetic_enabled": true, + "hops": ["entry-1", "exit-1"], + "allowed_channels": ["vpn_packet", "fabric_control"] + }`), + UpdatedAt: now, + }, + { + ID: "route-good", + ClusterID: "cluster-1", + SourceSelector: json.RawMessage(`{"node_id":"entry-1"}`), + DestinationSelector: json.RawMessage(`{"node_id":"exit-1"}`), + ServiceClass: FabricServiceClassVPNPackets, + Priority: 90, + Status: "active", + Policy: json.RawMessage(`{ + "synthetic_enabled": true, + "hops": ["entry-1", "exit-1"], + "allowed_channels": ["vpn_packet", "fabric_control"] + }`), + UpdatedAt: now, + }, + } + decisions, err := service.resolveFabricServiceChannelRemediationRebuildIntents(context.Background(), GetNodeSyntheticMeshConfigInput{ + ClusterID: "cluster-1", + NodeID: "entry-1", + }, []FabricServiceChannelAccessRemediationCommand{{ + CommandID: "cmd-apply", + Action: "rebuild_route", + ChannelID: "channel-apply", + ResourceID: "vpn-home", + ServiceClass: FabricServiceClassVPNPackets, + PrimaryRouteID: "route-bad", + GuardStatus: "allowed", + Reason: "route_feedback_recommends_rebuild", + ExpiresAt: now.Add(time.Minute), + }}, intents, map[string]fabricServiceChannelRouteFeedback{ + "route-bad": { + RouteID: "route-bad", + Fenced: true, + RouteRebuildRecommended: true, + ScoreAdjustment: -1030, + Reasons: []string{"service_channel_route_rebuild_recommended"}, + }, + "route-good": { + RouteID: "route-good", + ScoreAdjustment: 100, + Reasons: []string{"service_channel_recent_success"}, + }, + }, "config-c18z77", now) + if err != nil { + t.Fatalf("resolve rebuild intents: %v", err) + } + if len(decisions) != 1 { + t.Fatalf("decisions = %+v, want one applied alternate", decisions) + } + decision := decisions[0] + if decision.RebuildRequestID != "cmd-apply" || + decision.RebuildStatus != "applied" || + decision.ReplacementRouteID != "route-good" || + decision.DecisionSource != "service_channel_remediation_command" { + t.Fatalf("unexpected applied remediation decision: %+v", decision) + } + attempt := store.fabricRebuildAttempts[0] + if attempt.RebuildStatus != "applied" || + attempt.Outcome != "replacement_selected" || + attempt.ReplacementRouteID != "route-good" || + !reflect.DeepEqual(attempt.OldHops, []string{"entry-1", "exit-1"}) || + !reflect.DeepEqual(attempt.ReplacementHops, []string{"entry-1", "exit-1"}) { + t.Fatalf("unexpected applied rebuild resolution: %+v", attempt) + } +} + +func TestIssueFabricServiceChannelLeaseMarksBackendRelayAsDegradedFallbackWhenRouteMissing(t *testing.T) { + now := time.Date(2026, 5, 7, 12, 30, 0, 0, time.UTC) + service := NewService(&fakeRepository{}) + service.now = func() time.Time { return now } + + lease, err := service.IssueFabricServiceChannelLease(context.Background(), IssueFabricServiceChannelLeaseInput{ + ClusterID: "cluster-1", + OrganizationID: "org-home", + UserID: "user-m", + ServiceClass: FabricServiceClassRemoteWorkspace, + EntryNodeIDs: []string{"entry-a"}, + ExitNodeIDs: []string{"exit-b"}, + }) + if err != nil { + t.Fatalf("issue lease: %v", err) + } + if lease.Status != FabricServiceChannelStatusDegradedFallback { + t.Fatalf("lease.Status = %q, want degraded_fallback", lease.Status) + } + if lease.PrimaryRoute.Status != "missing_route_intent" || lease.PrimaryRoute.RouteID != "" { + t.Fatalf("unexpected primary route fallback: %+v", lease.PrimaryRoute) + } + if lease.PrimaryRoute.RecoveryPolicy == nil { + t.Fatalf("fallback primary route must include recovery policy provenance") + } + if !lease.Fallback.Active || !lease.Fallback.Degraded || !lease.Fallback.BackendRelay { + t.Fatalf("fallback should be active degraded backend relay: %+v", lease.Fallback) + } + if !containsString(lease.AllowedChannels, FabricChannelInteractive) || !containsString(lease.RequiredRoles, "rdp-worker") { + t.Fatalf("remote workspace defaults not applied: channels=%v roles=%v", lease.AllowedChannels, lease.RequiredRoles) + } + if strings.Contains(lease.EntryHTTP.PathTemplate, "vpn-connections") || + !strings.Contains(lease.EntryHTTP.PathTemplate, "remote-workspaces") || + lease.EntryHTTP.PacketBatchFormat != "application/vnd.rap.remote-workspace-frame-batch.v1" { + t.Fatalf("remote workspace ingress should not be vpn-specific: %+v", lease.EntryHTTP) + } + if lease.DataPlane.StableContractForServiceClass != FabricServiceClassRemoteWorkspace || + !lease.DataPlane.ServiceNeutral || + !lease.DataPlane.ProtocolAgnostic || + !containsString(lease.DataPlane.RequiredFlowIsolationClasses, FabricChannelInteractive) { + t.Fatalf("unexpected remote workspace data-plane contract: %+v", lease.DataPlane) + } +} + +func TestIssueFabricServiceChannelLeaseUsesServiceClassAwareIngressDescriptors(t *testing.T) { + now := time.Date(2026, 5, 12, 14, 10, 0, 0, time.UTC) + service := NewService(&fakeRepository{}) + service.now = func() time.Time { return now } + + tests := []struct { + name string + service string + pathNeedle string + packetMedia string + }{ + {name: "vpn", service: FabricServiceClassVPNPackets, pathNeedle: "vpn-connections", packetMedia: "application/vnd.rap.vpn-packet-batch.v1"}, + {name: "remote workspace", service: FabricServiceClassRemoteWorkspace, pathNeedle: "remote-workspaces", packetMedia: "application/vnd.rap.remote-workspace-frame-batch.v1"}, + {name: "file transfer", service: FabricServiceClassFileTransfer, pathNeedle: "file-transfers", packetMedia: "application/vnd.rap.file-transfer-chunk-batch.v1"}, + {name: "video", service: FabricServiceClassVideo, pathNeedle: "video-sessions", packetMedia: "application/vnd.rap.video-frame-batch.v1"}, + } + for _, tt := range tests { + t.Run(tt.name, func(t *testing.T) { + lease, err := service.IssueFabricServiceChannelLease(context.Background(), IssueFabricServiceChannelLeaseInput{ + ClusterID: "cluster-1", + OrganizationID: "org-home", + UserID: "user-m", + ResourceID: "resource-1", + ServiceClass: tt.service, + EntryNodeIDs: []string{"entry-a"}, + ExitNodeIDs: []string{"exit-b"}, + }) + if err != nil { + t.Fatalf("issue lease: %v", err) + } + if !strings.Contains(lease.EntryHTTP.PathTemplate, tt.pathNeedle) { + t.Fatalf("PathTemplate = %q, want %q", lease.EntryHTTP.PathTemplate, tt.pathNeedle) + } + if lease.EntryHTTP.PacketBatchFormat != tt.packetMedia { + t.Fatalf("PacketBatchFormat = %q, want %q", lease.EntryHTTP.PacketBatchFormat, tt.packetMedia) + } + if lease.DataPlane.StableContractForServiceClass != tt.service { + t.Fatalf("StableContractForServiceClass = %q, want %q", lease.DataPlane.StableContractForServiceClass, tt.service) + } + }) + } +} + +func TestIssueFabricServiceChannelLeaseFencesRouteFromFlowFeedback(t *testing.T) { + now := time.Date(2026, 5, 7, 13, 0, 0, 0, time.UTC) + store := &fakeRepository{ + routeIntents: []MeshRouteIntent{ + { + ID: "route-bad", + ClusterID: "cluster-1", + SourceSelector: json.RawMessage(`{"node_id":"entry-1"}`), + DestinationSelector: json.RawMessage(`{"node_id":"exit-1"}`), + ServiceClass: FabricServiceClassVPNPackets, + Priority: 50, + Status: "active", + Policy: json.RawMessage(`{ + "hops": ["entry-1", "relay-bad", "exit-1"], + "allowed_channels": ["vpn_packet", "fabric_control"] + }`), + UpdatedAt: now, + }, + { + ID: "route-good", + ClusterID: "cluster-1", + SourceSelector: json.RawMessage(`{"node_id":"entry-1"}`), + DestinationSelector: json.RawMessage(`{"node_id":"exit-1"}`), + ServiceClass: FabricServiceClassVPNPackets, + Priority: 10, + Status: "active", + Policy: json.RawMessage(`{ + "hops": ["entry-1", "relay-good", "exit-1"], + "allowed_channels": ["vpn_packet", "fabric_control"] + }`), + UpdatedAt: now, + }, + }, + heartbeats: map[string][]NodeHeartbeat{ + "entry-1": { + { + ClusterID: "cluster-1", + NodeID: "entry-1", + ObservedAt: now.Add(-15 * time.Second), + Metadata: json.RawMessage(`{ + "fabric_service_channel_runtime_report": { + "schema_version": "c18l.fabric_service_channel_runtime_report.v1", + "ingress": { + "flow_scheduler": { + "channel_stats": { + "flow-7": { + "last_failed_route_id": "route-bad", + "last_error": "forward peer unavailable", + "consecutive_failures": 2, + "route_rebuild_recommended": true, + "degraded_fallback_recommended": true + }, + "flow-9": { + "last_route_id": "route-good", + "last_next_hop": "relay-good", + "consecutive_failures": 0 + } + } + } + } + } + }`), + }, + }, + }, + } + service := NewService(store) + service.now = func() time.Time { return now } + + lease, err := service.IssueFabricServiceChannelLease(context.Background(), IssueFabricServiceChannelLeaseInput{ + ClusterID: "cluster-1", + OrganizationID: "org-home", + UserID: "user-m", + ResourceID: "vpn-home", + ServiceClass: FabricServiceClassVPNPackets, + EntryNodeIDs: []string{"entry-1"}, + ExitNodeIDs: []string{"exit-1"}, + PreferredEntryNodeID: "entry-1", + PreferredExitNodeID: "exit-1", + TTL: 90 * time.Second, + }) + if err != nil { + t.Fatalf("issue lease: %v", err) + } + if lease.PrimaryRoute.RouteID != "route-good" { + t.Fatalf("primary route = %q, want route-good after route-bad feedback fence", lease.PrimaryRoute.RouteID) + } + if !containsString(lease.PrimaryRoute.ScoreReasons, "service_channel_recent_success") { + t.Fatalf("primary route should include service-channel success feedback: %+v", lease.PrimaryRoute) + } + for _, alternate := range lease.AlternateRoutes { + if alternate.RouteID == "route-bad" { + t.Fatalf("fenced route must not be offered as alternate: %+v", lease.AlternateRoutes) + } + } + if lease.Fallback.Active || lease.Fallback.Degraded { + t.Fatalf("healthy alternate should avoid degraded fallback: %+v", lease.Fallback) + } +} + +func TestIssueFabricServiceChannelLeasePrefersFastHealthyRouteFeedback(t *testing.T) { + now := time.Date(2026, 5, 7, 16, 10, 0, 0, time.UTC) + store := &fakeRepository{ + routeIntents: []MeshRouteIntent{ + { + ID: "route-slow", + ClusterID: "cluster-1", + SourceSelector: json.RawMessage(`{"node_id":"entry-1"}`), + DestinationSelector: json.RawMessage(`{"node_id":"exit-1"}`), + ServiceClass: FabricServiceClassVPNPackets, + Priority: 120, + Status: "active", + Policy: json.RawMessage(`{ + "hops": ["entry-1", "relay-slow", "exit-1"], + "allowed_channels": ["vpn_packet", "fabric_control"] + }`), + UpdatedAt: now, + }, + { + ID: "route-fast", + ClusterID: "cluster-1", + SourceSelector: json.RawMessage(`{"node_id":"entry-1"}`), + DestinationSelector: json.RawMessage(`{"node_id":"exit-1"}`), + ServiceClass: FabricServiceClassVPNPackets, + Priority: 80, + Status: "active", + Policy: json.RawMessage(`{ + "hops": ["entry-1", "relay-fast", "exit-1"], + "allowed_channels": ["vpn_packet", "fabric_control"] + }`), + UpdatedAt: now, + }, + }, + heartbeats: map[string][]NodeHeartbeat{ + "entry-1": { + { + ClusterID: "cluster-1", + NodeID: "entry-1", + ObservedAt: now.Add(-10 * time.Second), + Metadata: json.RawMessage(`{ + "fabric_service_channel_runtime_report": { + "schema_version": "c18l.fabric_service_channel_runtime_report.v1", + "ingress": { + "flow_scheduler": { + "channel_stats": { + "flow-fast": { + "last_route_id": "route-fast", + "last_next_hop": "relay-fast", + "last_send_duration_ms": 8, + "consecutive_failures": 0, + "stall_count": 0 + }, + "flow-slow": { + "last_route_id": "route-slow", + "last_next_hop": "relay-slow", + "last_send_duration_ms": 900, + "consecutive_failures": 0, + "stall_count": 0 + } + } + } + } + } + }`), + }, + }, + }, + } + service := NewService(store) + service.now = func() time.Time { return now } + + lease, err := service.IssueFabricServiceChannelLease(context.Background(), IssueFabricServiceChannelLeaseInput{ + ClusterID: "cluster-1", + OrganizationID: "org-home", + UserID: "user-m", + ResourceID: "vpn-home", + ServiceClass: FabricServiceClassVPNPackets, + EntryNodeIDs: []string{"entry-1"}, + ExitNodeIDs: []string{"exit-1"}, + PreferredEntryNodeID: "entry-1", + PreferredExitNodeID: "exit-1", + TTL: 90 * time.Second, + }) + if err != nil { + t.Fatalf("issue lease: %v", err) + } + if lease.PrimaryRoute.RouteID != "route-fast" { + t.Fatalf("primary route = %q, want route-fast from quality feedback; route=%+v alternates=%+v", lease.PrimaryRoute.RouteID, lease.PrimaryRoute, lease.AlternateRoutes) + } + if !containsString(lease.PrimaryRoute.ScoreReasons, "service_channel_quality_latency_le_10ms") { + t.Fatalf("fast route should include latency quality reason: %+v", lease.PrimaryRoute) + } + var slow FabricServiceChannelRoute + for _, route := range lease.AlternateRoutes { + if route.RouteID == "route-slow" { + slow = route + break + } + } + if slow.RouteID == "" || !containsString(slow.ScoreReasons, "service_channel_quality_latency_very_slow") { + t.Fatalf("slow alternate should retain quality penalty reason: %+v", lease.AlternateRoutes) + } +} + +func TestIssueFabricServiceChannelLeaseDecaysOlderHealthyRouteFeedback(t *testing.T) { + now := time.Date(2026, 5, 8, 9, 0, 0, 0, time.UTC) + store := &fakeRepository{ + routeIntents: []MeshRouteIntent{ + { + ID: "route-old-fast", + ClusterID: "cluster-1", + SourceSelector: json.RawMessage(`{"node_id":"entry-1"}`), + DestinationSelector: json.RawMessage(`{"node_id":"exit-1"}`), + ServiceClass: FabricServiceClassVPNPackets, + Priority: 80, + Status: "active", + Policy: json.RawMessage(`{ + "hops": ["entry-1", "exit-1"], + "allowed_channels": ["vpn_packet", "fabric_control"] + }`), + UpdatedAt: now, + }, + { + ID: "route-fresh", + ClusterID: "cluster-1", + SourceSelector: json.RawMessage(`{"node_id":"entry-1"}`), + DestinationSelector: json.RawMessage(`{"node_id":"exit-1"}`), + ServiceClass: FabricServiceClassVPNPackets, + Priority: 80, + Status: "active", + Policy: json.RawMessage(`{ + "hops": ["entry-1", "relay-fresh", "exit-1"], + "allowed_channels": ["vpn_packet", "fabric_control"] + }`), + UpdatedAt: now, + }, + }, + fabricRouteFeedback: []FabricServiceChannelRouteFeedbackObservation{ + { + ID: "feedback-old", + ClusterID: "cluster-1", + ReporterNodeID: "entry-1", + RouteID: "route-old-fast", + ServiceClass: FabricServiceClassVPNPackets, + FeedbackStatus: "healthy", + ScoreAdjustment: 90, + Reasons: []string{"service_channel_recent_success", "service_channel_quality_latency_le_10ms"}, + LastSendDurationMs: 1, + ObservedAt: now.Add(-90 * time.Second), + ExpiresAt: now.Add(30 * time.Second), + }, + { + ID: "feedback-fresh", + ClusterID: "cluster-1", + ReporterNodeID: "entry-1", + RouteID: "route-fresh", + ServiceClass: FabricServiceClassVPNPackets, + FeedbackStatus: "healthy", + ScoreAdjustment: 40, + Reasons: []string{"service_channel_recent_success", "service_channel_quality_latency_le_50ms"}, + LastSendDurationMs: 40, + ObservedAt: now.Add(-5 * time.Second), + ExpiresAt: now.Add(115 * time.Second), + }, + }, + } + service := NewService(store) + service.now = func() time.Time { return now } + + lease, err := service.IssueFabricServiceChannelLease(context.Background(), IssueFabricServiceChannelLeaseInput{ + ClusterID: "cluster-1", + OrganizationID: "org-home", + UserID: "user-m", + ResourceID: "vpn-home", + ServiceClass: FabricServiceClassVPNPackets, + EntryNodeIDs: []string{"entry-1"}, + ExitNodeIDs: []string{"exit-1"}, + PreferredEntryNodeID: "entry-1", + PreferredExitNodeID: "exit-1", + TTL: 90 * time.Second, + }) + if err != nil { + t.Fatalf("issue lease: %v", err) + } + if lease.PrimaryRoute.RouteID != "route-fresh" { + t.Fatalf("primary route = %q, want fresher feedback route after age decay; route=%+v alternates=%+v", lease.PrimaryRoute.RouteID, lease.PrimaryRoute, lease.AlternateRoutes) + } + var oldRoute FabricServiceChannelRoute + for _, route := range lease.AlternateRoutes { + if route.RouteID == "route-old-fast" { + oldRoute = route + break + } + } + if oldRoute.RouteID == "" || !containsString(oldRoute.ScoreReasons, "service_channel_feedback_age_decay") { + t.Fatalf("old route should carry age decay reason: %+v", lease.AlternateRoutes) + } +} + +func TestServiceChannelRouteFeedbackReportIncludesEffectiveDecayedScore(t *testing.T) { + now := time.Date(2026, 5, 8, 9, 3, 0, 0, time.UTC) + report := serviceChannelRouteFeedbackReport([]FabricServiceChannelRouteFeedbackObservation{{ + ID: "feedback-old", + ClusterID: "cluster-1", + ReporterNodeID: "entry-1", + RouteID: "route-old-fast", + ServiceClass: FabricServiceClassVPNPackets, + FeedbackStatus: "healthy", + ScoreAdjustment: 90, + Reasons: []string{"service_channel_recent_success"}, + LastSendDurationMs: 1, + ObservedAt: now.Add(-90 * time.Second), + ExpiresAt: now.Add(30 * time.Second), + }}, now) + if report == nil || len(report.Observations) != 1 { + t.Fatalf("report observations = %+v, want one observation", report) + } + observation := report.Observations[0] + if observation.ScoreAdjustment != 90 || observation.EffectiveScoreAdjustment != 23 { + t.Fatalf("scores raw/effective = %d/%d, want 90/23", observation.ScoreAdjustment, observation.EffectiveScoreAdjustment) + } + if !containsString(observation.Reasons, "service_channel_feedback_age_decay") { + t.Fatalf("reasons = %+v, want age decay reason", observation.Reasons) + } +} + +func TestIssueFabricServiceChannelLeaseFallsBackWhenOnlyRouteFencedByFlowFeedback(t *testing.T) { + now := time.Date(2026, 5, 7, 13, 30, 0, 0, time.UTC) + store := &fakeRepository{ + routeIntents: []MeshRouteIntent{ + { + ID: "route-bad", + ClusterID: "cluster-1", + SourceSelector: json.RawMessage(`{"node_id":"entry-1"}`), + DestinationSelector: json.RawMessage(`{"node_id":"exit-1"}`), + ServiceClass: FabricServiceClassVPNPackets, + Priority: 50, + Status: "active", + Policy: json.RawMessage(`{ + "hops": ["entry-1", "relay-bad", "exit-1"], + "allowed_channels": ["vpn_packet", "fabric_control"] + }`), + UpdatedAt: now, + }, + }, + heartbeats: map[string][]NodeHeartbeat{ + "entry-1": { + { + ClusterID: "cluster-1", + NodeID: "entry-1", + ObservedAt: now.Add(-10 * time.Second), + Metadata: json.RawMessage(`{ + "fabric_service_channel_runtime_report": { + "schema_version": "c18l.fabric_service_channel_runtime_report.v1", + "ingress": { + "flow_scheduler": { + "channel_stats": { + "flow-7": { + "last_failed_route_id": "route-bad", + "consecutive_failures": 2, + "route_rebuild_recommended": true + } + } + } + } + } + }`), + }, + }, + }, + } + service := NewService(store) + service.now = func() time.Time { return now } + + lease, err := service.IssueFabricServiceChannelLease(context.Background(), IssueFabricServiceChannelLeaseInput{ + ClusterID: "cluster-1", + OrganizationID: "org-home", + UserID: "user-m", + ResourceID: "vpn-home", + ServiceClass: FabricServiceClassVPNPackets, + EntryNodeIDs: []string{"entry-1"}, + ExitNodeIDs: []string{"exit-1"}, + PreferredEntryNodeID: "entry-1", + PreferredExitNodeID: "exit-1", + }) + if err != nil { + t.Fatalf("issue lease: %v", err) + } + if lease.Status != FabricServiceChannelStatusDegradedFallback || + lease.Fallback.Reason != "fabric_route_rebuild_pending_backend_relay" { + t.Fatalf("lease should degrade because the only route is fenced: status=%s fallback=%+v", lease.Status, lease.Fallback) + } +} + +func TestIssueFabricServiceChannelLeaseSelectsHealthyAlternateExitFromPool(t *testing.T) { + now := time.Date(2026, 5, 7, 13, 45, 0, 0, time.UTC) + store := &fakeRepository{ + routeIntents: []MeshRouteIntent{ + { + ID: "route-entry-exit-a", + ClusterID: "cluster-1", + SourceSelector: json.RawMessage(`{"node_id":"entry-1"}`), + DestinationSelector: json.RawMessage(`{"node_id":"exit-a"}`), + ServiceClass: FabricServiceClassVPNPackets, + Priority: 100, + Status: "active", + Policy: json.RawMessage(`{ + "hops": ["entry-1", "exit-a"], + "allowed_channels": ["vpn_packet", "fabric_control"], + "metadata": {"exit_pool_id": "pool-home"} + }`), + UpdatedAt: now, + }, + { + ID: "route-entry-exit-b", + ClusterID: "cluster-1", + SourceSelector: json.RawMessage(`{"node_id":"entry-1"}`), + DestinationSelector: json.RawMessage(`{"node_id":"exit-b"}`), + ServiceClass: FabricServiceClassVPNPackets, + Priority: 30, + Status: "active", + Policy: json.RawMessage(`{ + "hops": ["entry-1", "exit-b"], + "allowed_channels": ["vpn_packet", "fabric_control"], + "metadata": {"exit_pool_id": "pool-home"} + }`), + UpdatedAt: now, + }, + }, + fabricRouteFeedback: []FabricServiceChannelRouteFeedbackObservation{ + { + ClusterID: "cluster-1", + ReporterNodeID: "entry-1", + RouteID: "route-entry-exit-a", + ServiceClass: FabricServiceClassVPNPackets, + FeedbackStatus: "fenced", + ScoreAdjustment: -1030, + Reasons: []string{"service_channel_route_rebuild_recommended"}, + ConsecutiveFailures: 2, + ObservedAt: now, + ExpiresAt: now.Add(time.Minute), + }, + { + ClusterID: "cluster-1", + ReporterNodeID: "entry-1", + RouteID: "route-entry-exit-b", + ServiceClass: FabricServiceClassVPNPackets, + FeedbackStatus: "healthy", + ScoreAdjustment: 10, + Reasons: []string{"service_channel_recent_success"}, + ObservedAt: now, + ExpiresAt: now.Add(time.Minute), + }, + }, + } + service := NewService(store) + service.now = func() time.Time { return now } + + lease, err := service.IssueFabricServiceChannelLease(context.Background(), IssueFabricServiceChannelLeaseInput{ + ClusterID: "cluster-1", + OrganizationID: "org-home", + UserID: "user-m", + ResourceID: "vpn-home", + ServiceClass: FabricServiceClassVPNPackets, + EntryNodeIDs: []string{"entry-1"}, + ExitNodeIDs: []string{"exit-a", "exit-b"}, + PreferredEntryNodeID: "entry-1", + PreferredExitNodeID: "exit-a", + }) + if err != nil { + t.Fatalf("issue lease: %v", err) + } + if lease.PrimaryRoute.RouteID != "route-entry-exit-b" || lease.SelectedExitNodeID != "exit-b" { + t.Fatalf("lease should select alternate exit from pool: selected_exit=%s primary=%+v", lease.SelectedExitNodeID, lease.PrimaryRoute) + } + for _, candidate := range lease.ExitPool { + if candidate.NodeID == "exit-b" && candidate.Status != "selected" { + t.Fatalf("alternate exit should be marked selected in exit pool: %+v", lease.ExitPool) + } + } + var signedPayload FabricServiceChannelLeaseAuthorityPayload + if err := json.Unmarshal(lease.AuthorityPayload, &signedPayload); err != nil { + t.Fatalf("unmarshal signed payload: %v", err) + } + if signedPayload.SelectedExitNodeID != "exit-b" || len(signedPayload.ExitPool) != 2 { + t.Fatalf("signed payload must bind selected exit and authorized exit pool: %+v", signedPayload) + } + if lease.Fallback.Active || lease.Fallback.Degraded { + t.Fatalf("healthy exit-pool alternate should avoid degraded fallback: %+v", lease.Fallback) + } +} + +func TestIssueFabricServiceChannelLeaseSelectsHealthyAlternateEntryFromPool(t *testing.T) { + now := time.Date(2026, 5, 7, 14, 45, 0, 0, time.UTC) + store := &fakeRepository{ + routeIntents: []MeshRouteIntent{ + { + ID: "route-entry-a", + ClusterID: "cluster-1", + SourceSelector: json.RawMessage(`{"node_id":"entry-a"}`), + DestinationSelector: json.RawMessage(`{"node_id":"exit-1"}`), + ServiceClass: FabricServiceClassVPNPackets, + Priority: 100, + Status: "active", + Policy: json.RawMessage(`{ + "hops": ["entry-a", "exit-1"], + "allowed_channels": ["vpn_packet", "fabric_control"], + "metadata": {"entry_pool_id": "pool-edge"} + }`), + UpdatedAt: now, + }, + { + ID: "route-entry-b", + ClusterID: "cluster-1", + SourceSelector: json.RawMessage(`{"node_id":"entry-b"}`), + DestinationSelector: json.RawMessage(`{"node_id":"exit-1"}`), + ServiceClass: FabricServiceClassVPNPackets, + Priority: 30, + Status: "active", + Policy: json.RawMessage(`{ + "hops": ["entry-b", "exit-1"], + "allowed_channels": ["vpn_packet", "fabric_control"], + "metadata": {"entry_pool_id": "pool-edge"} + }`), + UpdatedAt: now, + }, + }, + fabricRouteFeedback: []FabricServiceChannelRouteFeedbackObservation{ + { + ClusterID: "cluster-1", + ReporterNodeID: "entry-a", + RouteID: "route-entry-a", + ServiceClass: FabricServiceClassVPNPackets, + FeedbackStatus: "fenced", + ScoreAdjustment: -1030, + Reasons: []string{"service_channel_entry_unreachable"}, + ConsecutiveFailures: 3, + ObservedAt: now, + ExpiresAt: now.Add(time.Minute), + }, + { + ClusterID: "cluster-1", + ReporterNodeID: "entry-b", + RouteID: "route-entry-b", + ServiceClass: FabricServiceClassVPNPackets, + FeedbackStatus: "healthy", + ScoreAdjustment: 10, + Reasons: []string{"service_channel_recent_success"}, + ObservedAt: now, + ExpiresAt: now.Add(time.Minute), + }, + }, + } + service := NewService(store) + service.now = func() time.Time { return now } + + lease, err := service.IssueFabricServiceChannelLease(context.Background(), IssueFabricServiceChannelLeaseInput{ + ClusterID: "cluster-1", + OrganizationID: "org-home", + UserID: "user-m", + ResourceID: "vpn-home", + ServiceClass: FabricServiceClassVPNPackets, + EntryNodeIDs: []string{"entry-a", "entry-b"}, + ExitNodeIDs: []string{"exit-1"}, + PreferredEntryNodeID: "entry-a", + PreferredExitNodeID: "exit-1", + }) + if err != nil { + t.Fatalf("issue lease: %v", err) + } + if lease.PrimaryRoute.RouteID != "route-entry-b" || lease.SelectedEntryNodeID != "entry-b" { + t.Fatalf("lease should select alternate entry from pool: selected_entry=%s primary=%+v", lease.SelectedEntryNodeID, lease.PrimaryRoute) + } + for _, candidate := range lease.EntryPool { + if candidate.NodeID == "entry-b" && candidate.Status != "selected" { + t.Fatalf("alternate entry should be marked selected in entry pool: %+v", lease.EntryPool) + } + } + var signedPayload FabricServiceChannelLeaseAuthorityPayload + if err := json.Unmarshal(lease.AuthorityPayload, &signedPayload); err != nil { + t.Fatalf("unmarshal signed payload: %v", err) + } + if signedPayload.SelectedEntryNodeID != "entry-b" || len(signedPayload.EntryPool) != 2 { + t.Fatalf("signed payload must bind selected entry and authorized entry pool: %+v", signedPayload) + } + if lease.Fallback.Active || lease.Fallback.Degraded { + t.Fatalf("healthy entry-pool alternate should avoid degraded fallback: %+v", lease.Fallback) + } +} + +func TestIssueFabricServiceChannelLeaseAppliesClusterPoolPolicy(t *testing.T) { + now := time.Date(2026, 5, 8, 20, 10, 0, 0, time.UTC) + policy := defaultFabricServiceChannelPoolPolicy() + policy.Source = "cluster_metadata" + policy.EntryPoolNodeIDs = []string{"entry-b"} + policy.ExitPoolNodeIDs = []string{"exit-b"} + policy.PreferredEntryNodeID = "entry-b" + policy.PreferredExitNodeID = "exit-b" + policy.SelectionStrategy = "preferred_first" + policy.RouteRebuild = "automatic" + policy.EntryFailover = "automatic" + policy.ExitFailover = "automatic" + policy.BackendFallbackAllowed = true + policy.StickySession = true + policy = normalizeFabricServiceChannelPoolPolicy(policy, defaultFabricServiceChannelPoolPolicy()) + metadata, err := upsertFabricServiceChannelPoolPolicyMetadata(json.RawMessage(`{}`), policy) + if err != nil { + t.Fatalf("policy metadata: %v", err) + } + store := &fakeRepository{ + cluster: Cluster{ + ID: "cluster-1", + Slug: "cluster-1", + Name: "Cluster 1", + Status: ClusterStatusActive, + Metadata: metadata, + }, + routeIntents: []MeshRouteIntent{ + { + ID: "route-a", + ClusterID: "cluster-1", + SourceSelector: json.RawMessage(`{"node_id":"entry-a"}`), + DestinationSelector: json.RawMessage(`{"node_id":"exit-a"}`), + ServiceClass: FabricServiceClassVPNPackets, + Priority: 100, + Status: "active", + Policy: json.RawMessage(`{"hops":["entry-a","exit-a"],"allowed_channels":["vpn_packet","fabric_control"]}`), + UpdatedAt: now, + }, + { + ID: "route-b", + ClusterID: "cluster-1", + SourceSelector: json.RawMessage(`{"node_id":"entry-b"}`), + DestinationSelector: json.RawMessage(`{"node_id":"exit-b"}`), + ServiceClass: FabricServiceClassVPNPackets, + Priority: 10, + Status: "active", + Policy: json.RawMessage(`{"hops":["entry-b","exit-b"],"allowed_channels":["vpn_packet","fabric_control"]}`), + UpdatedAt: now, + }, + }, + } + service := NewService(store) + service.now = func() time.Time { return now } + + lease, err := service.IssueFabricServiceChannelLease(context.Background(), IssueFabricServiceChannelLeaseInput{ + ClusterID: "cluster-1", + OrganizationID: "org-home", + UserID: "user-m", + ResourceID: "vpn-home", + ServiceClass: FabricServiceClassVPNPackets, + EntryNodeIDs: []string{"entry-a", "entry-b"}, + ExitNodeIDs: []string{"exit-a", "exit-b"}, + }) + if err != nil { + t.Fatalf("issue lease: %v", err) + } + if lease.SelectedEntryNodeID != "entry-b" || lease.SelectedExitNodeID != "exit-b" || lease.PrimaryRoute.RouteID != "route-b" { + t.Fatalf("lease did not apply pool policy: selected_entry=%s selected_exit=%s primary=%+v", lease.SelectedEntryNodeID, lease.SelectedExitNodeID, lease.PrimaryRoute) + } + if len(lease.EntryPool) != 1 || lease.EntryPool[0].NodeID != "entry-b" || len(lease.ExitPool) != 1 || lease.ExitPool[0].NodeID != "exit-b" { + t.Fatalf("lease pools should be constrained by pool policy: entry=%+v exit=%+v", lease.EntryPool, lease.ExitPool) + } + if lease.PoolPolicy == nil || lease.PoolPolicy.Fingerprint != policy.Fingerprint { + t.Fatalf("lease missing pool policy provenance: %+v want %s", lease.PoolPolicy, policy.Fingerprint) + } + var signedPayload FabricServiceChannelLeaseAuthorityPayload + if err := json.Unmarshal(lease.AuthorityPayload, &signedPayload); err != nil { + t.Fatalf("unmarshal signed payload: %v", err) + } + if signedPayload.PoolPolicy == nil || signedPayload.PoolPolicy.Fingerprint != policy.Fingerprint { + t.Fatalf("signed payload missing pool policy provenance: %+v want %s", signedPayload.PoolPolicy, policy.Fingerprint) + } +} + +func TestRecordHeartbeatPersistsServiceChannelRouteFeedbackForLaterLease(t *testing.T) { + now := time.Now().UTC().Truncate(time.Second) + store := &fakeRepository{ + routeIntents: []MeshRouteIntent{ + { + ID: "route-bad", + ClusterID: "cluster-1", + SourceSelector: json.RawMessage(`{"node_id":"entry-1"}`), + DestinationSelector: json.RawMessage(`{"node_id":"exit-1"}`), + ServiceClass: FabricServiceClassVPNPackets, + Priority: 100, + Status: "active", + Policy: json.RawMessage(`{ + "hops": ["entry-1", "relay-bad", "exit-1"], + "allowed_channels": ["vpn_packet", "fabric_control"] + }`), + UpdatedAt: now, + }, + { + ID: "route-good", + ClusterID: "cluster-1", + SourceSelector: json.RawMessage(`{"node_id":"entry-1"}`), + DestinationSelector: json.RawMessage(`{"node_id":"exit-1"}`), + ServiceClass: FabricServiceClassVPNPackets, + Priority: 10, + Status: "active", + Policy: json.RawMessage(`{ + "hops": ["entry-1", "relay-good", "exit-1"], + "allowed_channels": ["vpn_packet", "fabric_control"] + }`), + UpdatedAt: now, + }, + }, + } + service := NewService(store) + service.now = func() time.Time { return now } + + _, err := service.RecordHeartbeat(context.Background(), RecordHeartbeatInput{ + ClusterID: "cluster-1", + NodeID: "entry-1", + HealthStatus: "healthy", + Metadata: json.RawMessage(`{ + "fabric_service_channel_runtime_report": { + "schema_version": "c18l.fabric_service_channel_runtime_report.v1", + "ingress": { + "flow_scheduler": { + "channel_stats": { + "flow-1": { + "last_failed_route_id": "route-bad", + "consecutive_failures": 2, + "route_rebuild_recommended": true + } + } + } + } + } + }`), + }) + if err != nil { + t.Fatalf("record heartbeat: %v", err) + } + if len(store.fabricRouteFeedback) != 1 || store.fabricRouteFeedback[0].RouteID != "route-bad" || + store.fabricRouteFeedback[0].FeedbackStatus != "fenced" { + t.Fatalf("service-channel route feedback was not persisted: %+v", store.fabricRouteFeedback) + } + + lease, err := service.IssueFabricServiceChannelLease(context.Background(), IssueFabricServiceChannelLeaseInput{ + ClusterID: "cluster-1", + OrganizationID: "org-home", + UserID: "user-m", + ResourceID: "vpn-home", + ServiceClass: FabricServiceClassVPNPackets, + EntryNodeIDs: []string{"entry-1"}, + ExitNodeIDs: []string{"exit-1"}, + PreferredEntryNodeID: "entry-1", + PreferredExitNodeID: "exit-1", + }) + if err != nil { + t.Fatalf("issue lease: %v", err) + } + if lease.PrimaryRoute.RouteID != "route-good" { + t.Fatalf("primary route = %q, want durable feedback to fence route-bad and select route-good", lease.PrimaryRoute.RouteID) + } +} + +func TestRecordHeartbeatTurnsBlockedFallbackSendFailureIntoRebuildFeedback(t *testing.T) { + now := time.Now().UTC().Truncate(time.Second) + store := &fakeRepository{ + platformRole: PlatformRoleAdmin, + testingFlags: EffectiveNodeTestingFlags{ + Enabled: true, + SyntheticLinksEnabled: true, + }, + routeIntents: []MeshRouteIntent{ + { + ID: "route-bad", + ClusterID: "cluster-1", + SourceSelector: json.RawMessage(`{"node_id":"entry-1"}`), + DestinationSelector: json.RawMessage(`{"node_id":"exit-1"}`), + ServiceClass: FabricServiceClassVPNPackets, + Priority: 100, + Status: "active", + Policy: json.RawMessage(`{"synthetic_enabled":true,"hops":["entry-1","exit-1"],"allowed_channels":["vpn_packet","fabric_control"]}`), + UpdatedAt: now, + }, + { + ID: "route-good", + ClusterID: "cluster-1", + SourceSelector: json.RawMessage(`{"node_id":"entry-1"}`), + DestinationSelector: json.RawMessage(`{"node_id":"exit-1"}`), + ServiceClass: FabricServiceClassVPNPackets, + Priority: 10, + Status: "active", + Policy: json.RawMessage(`{"synthetic_enabled":true,"hops":["entry-1","exit-1"],"allowed_channels":["vpn_packet","fabric_control"]}`), + UpdatedAt: now, + }, + }, + fabricLeases: map[string]FabricServiceChannelLeaseRecord{ + fabricServiceChannelLeaseCacheKey("cluster-1", "channel-1"): { + ClusterID: "cluster-1", + ChannelID: "channel-1", + ResourceID: "vpn-home", + ServiceClass: FabricServiceClassVPNPackets, + SelectedEntryNodeID: "entry-1", + ExpiresAt: now.Add(time.Minute), + Lease: FabricServiceChannelLease{ + ClusterID: "cluster-1", + ChannelID: "channel-1", + ResourceID: "vpn-home", + ServiceClass: FabricServiceClassVPNPackets, + Status: FabricServiceChannelStatusReady, + SelectedEntryNodeID: "entry-1", + SelectedExitNodeID: "exit-1", + PrimaryRoute: FabricServiceChannelRoute{ + RouteID: "route-bad", + Status: "authorized", + }, + AlternateRoutes: []FabricServiceChannelRoute{{ + RouteID: "route-good", + Status: "authorized", + }}, + ExpiresAt: now.Add(time.Minute), + }, + }, + }, + } + service := NewService(store) + service.now = func() time.Time { return now } + + _, err := service.RecordHeartbeat(context.Background(), RecordHeartbeatInput{ + ClusterID: "cluster-1", + NodeID: "entry-1", + HealthStatus: "healthy", + Metadata: json.RawMessage(`{ + "fabric_service_channel_access_report": { + "schema_version": "c18z52.fabric_service_channel_access_report.v1", + "total": 1, + "signed": 1, + "backend_fallback": 0, + "backend_fallback_blocked": 1, + "fabric_route_send_failure": 1, + "data_plane_contract": 1, + "last_backend_relay_policy": "disabled", + "last_working_data_transport": "fabric_service_channel", + "last_steady_state_transport": "fabric_route", + "last_data_plane_violation_status": "fabric_route_send_failed_backend_fallback_blocked", + "last_data_plane_violation_reason": "mesh synthetic route not found" + } + }`), + }) + if err != nil { + t.Fatalf("record heartbeat: %v", err) + } + if len(store.fabricRouteFeedback) != 1 { + t.Fatalf("route feedback count = %d, want one blocked fallback feedback: %+v", len(store.fabricRouteFeedback), store.fabricRouteFeedback) + } + feedback := store.fabricRouteFeedback[0] + if feedback.RouteID != "route-bad" || + feedback.FeedbackStatus != "fenced" || + feedback.ScoreAdjustment != -1030 || + !containsString(feedback.Reasons, "data_plane_fabric_route_send_failed") || + !containsString(feedback.Reasons, "backend_fallback_blocked_by_policy") { + t.Fatalf("unexpected route feedback from blocked fallback: %+v", feedback) + } + + cfg, err := service.GetNodeSyntheticMeshConfig(context.Background(), GetNodeSyntheticMeshConfigInput{ + ClusterID: "cluster-1", + NodeID: "entry-1", + }) + if err != nil { + t.Fatalf("synthetic config: %v", err) + } + if cfg.RoutePathDecisions == nil || cfg.RoutePathDecisions.ReplacementDecisionCount != 1 { + t.Fatalf("expected blocked fallback feedback to drive replacement decision: %+v", cfg.RoutePathDecisions) + } + decision := cfg.RoutePathDecisions.Decisions[0] + if decision.RouteID != "route-bad" || + decision.ReplacementRouteID != "route-good" || + decision.RebuildStatus != "applied" || + decision.FeedbackObservationID == "" || + decision.FeedbackSource != "fabric_service_channel_access_report" || + decision.FeedbackChannelID != "channel-1" || + decision.FeedbackResourceID != "vpn-home" || + decision.FeedbackViolationStatus != "fabric_route_send_failed_backend_fallback_blocked" || + !containsString(decision.ScoreReasons, "service_channel_fenced_route") || + !containsString(decision.ScoreReasons, "service_channel_rebuild_applied") { + t.Fatalf("unexpected replacement decision: %+v", decision) + } + if len(store.fabricRebuildAttempts) != 1 { + t.Fatalf("rebuild attempt count = %d, want one correlated attempt: %+v", len(store.fabricRebuildAttempts), store.fabricRebuildAttempts) + } + attempt := store.fabricRebuildAttempts[0] + if attempt.FeedbackObservationID != decision.FeedbackObservationID || + attempt.FeedbackSource != "fabric_service_channel_access_report" || + attempt.FeedbackChannelID != "channel-1" || + attempt.FeedbackResourceID != "vpn-home" || + attempt.FeedbackViolationStatus != "fabric_route_send_failed_backend_fallback_blocked" { + t.Fatalf("unexpected rebuild attempt feedback correlation: %+v", attempt) + } + if jsonString(jsonObject(attempt.Payload), "feedback_observation_id") != decision.FeedbackObservationID || + jsonString(jsonObject(attempt.Payload), "feedback_source") != "fabric_service_channel_access_report" { + t.Fatalf("rebuild attempt payload missing feedback correlation: %s", string(attempt.Payload)) + } + health, err := service.GetFabricServiceChannelRouteRebuildHealthSummary(context.Background(), "admin-1", GetFabricServiceChannelRouteRebuildHealthSummaryInput{ + ClusterID: "cluster-1", + Limit: 10, + }) + if err != nil { + t.Fatalf("get rebuild health: %v", err) + } + if len(health.FeedbackBreakdowns) != 1 { + t.Fatalf("feedback breakdowns = %+v, want one access-report group", health.FeedbackBreakdowns) + } + breakdown := health.FeedbackBreakdowns[0] + if breakdown.FeedbackSource != "fabric_service_channel_access_report" || + breakdown.FeedbackChannelID != "channel-1" || + breakdown.FeedbackViolationStatus != "fabric_route_send_failed_backend_fallback_blocked" || + breakdown.TotalCount != 1 || + len(breakdown.AffectedReporterNodeIDs) != 1 || + breakdown.AffectedReporterNodeIDs[0] != "entry-1" || + len(breakdown.AffectedRouteIDs) != 1 || + breakdown.AffectedRouteIDs[0] != "route-bad" { + t.Fatalf("unexpected feedback breakdown: %+v", breakdown) + } +} + +func TestRecordHeartbeatDeduplicatesBlockedFallbackAccessFeedback(t *testing.T) { + now := time.Now().UTC().Truncate(time.Second) + store := &fakeRepository{ + platformRole: PlatformRoleAdmin, + testingFlags: EffectiveNodeTestingFlags{ + Enabled: true, + SyntheticLinksEnabled: true, + }, + routeIntents: []MeshRouteIntent{ + { + ID: "route-bad", + ClusterID: "cluster-1", + SourceSelector: json.RawMessage(`{"node_id":"entry-1"}`), + DestinationSelector: json.RawMessage(`{"node_id":"exit-1"}`), + ServiceClass: FabricServiceClassVPNPackets, + Priority: 100, + Status: "active", + Policy: json.RawMessage(`{"synthetic_enabled":true,"hops":["entry-1","exit-1"],"allowed_channels":["vpn_packet","fabric_control"]}`), + UpdatedAt: now, + }, + }, + fabricLeases: map[string]FabricServiceChannelLeaseRecord{ + fabricServiceChannelLeaseCacheKey("cluster-1", "channel-1"): { + ClusterID: "cluster-1", + ChannelID: "channel-1", + ResourceID: "vpn-home", + ServiceClass: FabricServiceClassVPNPackets, + SelectedEntryNodeID: "entry-1", + ExpiresAt: now.Add(time.Minute), + Lease: FabricServiceChannelLease{ + ClusterID: "cluster-1", + ChannelID: "channel-1", + ResourceID: "vpn-home", + ServiceClass: FabricServiceClassVPNPackets, + Status: FabricServiceChannelStatusReady, + SelectedEntryNodeID: "entry-1", + SelectedExitNodeID: "exit-1", + PrimaryRoute: FabricServiceChannelRoute{ + RouteID: "route-bad", + Status: "authorized", + }, + ExpiresAt: now.Add(time.Minute), + }, + }, + }, + } + service := NewService(store) + + heartbeat := RecordHeartbeatInput{ + ClusterID: "cluster-1", + NodeID: "entry-1", + HealthStatus: "healthy", + Metadata: json.RawMessage(`{ + "fabric_service_channel_access_report": { + "schema_version": "c18z52.fabric_service_channel_access_report.v1", + "total": 1, + "signed": 1, + "backend_fallback": 0, + "backend_fallback_blocked": 1, + "fabric_route_send_failure": 1, + "data_plane_contract": 1, + "last_backend_relay_policy": "disabled", + "last_working_data_transport": "fabric_service_channel", + "last_steady_state_transport": "fabric_route", + "last_data_plane_violation_status": "fabric_route_send_failed_backend_fallback_blocked", + "last_data_plane_violation_reason": "mesh synthetic route not found" + } + }`), + } + if _, err := service.RecordHeartbeat(context.Background(), heartbeat); err != nil { + t.Fatalf("record first heartbeat: %v", err) + } + if _, err := service.RecordHeartbeat(context.Background(), heartbeat); err != nil { + t.Fatalf("record duplicate heartbeat: %v", err) + } + if len(store.fabricRouteFeedback) != 1 { + t.Fatalf("route feedback count = %d, want duplicate access-report feedback suppressed: %+v", len(store.fabricRouteFeedback), store.fabricRouteFeedback) + } + feedback := store.fabricRouteFeedback[0] + if feedback.RouteID != "route-bad" || + feedback.FeedbackStatus != "fenced" || + !containsString(feedback.Reasons, "data_plane_fabric_route_send_failed") || + jsonString(jsonObject(feedback.Payload), "source") != "fabric_service_channel_access_report" { + t.Fatalf("unexpected deduplicated feedback: %+v", feedback) + } +} + +func TestRecordHeartbeatUsesRollingQualityWindowForRouteFeedback(t *testing.T) { + now := time.Now().UTC().Truncate(time.Second) + store := &fakeRepository{} + service := NewService(store) + service.now = func() time.Time { return now } + + _, err := service.RecordHeartbeat(context.Background(), RecordHeartbeatInput{ + ClusterID: "cluster-1", + NodeID: "entry-1", + HealthStatus: "healthy", + Metadata: json.RawMessage(`{ + "fabric_service_channel_runtime_report": { + "schema_version": "c18z21.fabric_service_channel_runtime_report.v1", + "ingress": { + "flow_scheduler": { + "channel_stats": { + "vpn:vpn-1:flow-01": { + "last_route_id": "route-good", + "last_failed_route_id": "route-bad", + "last_error": "old failure", + "consecutive_failures": 2, + "stall_count": 2, + "last_send_duration_ms": 1500, + "route_rebuild_recommended": true, + "degraded_fallback_recommended": true, + "quality_window_sample_count": 32, + "quality_window_success_count": 32, + "quality_window_failure_count": 0, + "quality_window_slow_count": 0, + "quality_window_drop_count": 0, + "quality_window_avg_latency_ms": 1 + } + } + } + } + } + }`), + }) + if err != nil { + t.Fatalf("record heartbeat: %v", err) + } + if len(store.fabricRouteFeedback) != 1 { + t.Fatalf("route feedback count = %d, want one healthy fresh observation: %+v", len(store.fabricRouteFeedback), store.fabricRouteFeedback) + } + observation := store.fabricRouteFeedback[0] + if observation.RouteID != "route-good" || observation.FeedbackStatus != "healthy" { + t.Fatalf("route feedback = %+v, want rolling window to ignore old failed route", observation) + } + if observation.ConsecutiveFailures != 0 || observation.StallCount != 0 || observation.LastSendDurationMs != 1 { + t.Fatalf("rolling counters = failures:%d stalls:%d latency:%d, want fresh window values", observation.ConsecutiveFailures, observation.StallCount, observation.LastSendDurationMs) + } + if !containsString(observation.Reasons, "service_channel_rolling_quality_window") || !containsString(observation.Reasons, "service_channel_quality_latency_le_10ms") { + t.Fatalf("feedback reasons = %+v, want rolling window quality reasons", observation.Reasons) + } +} + +func TestGetNodeSyntheticMeshConfigSkipsFencedServiceChannelRoute(t *testing.T) { + now := time.Now().UTC().Truncate(time.Second) + store := &fakeRepository{ + platformRole: PlatformRoleAdmin, + testingFlags: EffectiveNodeTestingFlags{ + Enabled: true, + SyntheticLinksEnabled: true, + }, + routeIntents: []MeshRouteIntent{ + { + ID: "route-bad", + ClusterID: "cluster-1", + SourceSelector: json.RawMessage(`{"node_id":"entry-1"}`), + DestinationSelector: json.RawMessage(`{"node_id":"exit-1"}`), + ServiceClass: FabricServiceClassVPNPackets, + Priority: 100, + Status: "active", + Policy: json.RawMessage(`{ + "synthetic_enabled": true, + "hops": ["entry-1", "exit-1"], + "allowed_channels": ["vpn_packet", "fabric_control"] + }`), + UpdatedAt: now, + }, + { + ID: "route-good", + ClusterID: "cluster-1", + SourceSelector: json.RawMessage(`{"node_id":"entry-1"}`), + DestinationSelector: json.RawMessage(`{"node_id":"exit-1"}`), + ServiceClass: FabricServiceClassVPNPackets, + Priority: 10, + Status: "active", + Policy: json.RawMessage(`{ + "synthetic_enabled": true, + "hops": ["entry-1", "exit-1"], + "allowed_channels": ["vpn_packet", "fabric_control"] + }`), + UpdatedAt: now, + }, + { + ID: "route-unproven", + ClusterID: "cluster-1", + SourceSelector: json.RawMessage(`{"node_id":"entry-1"}`), + DestinationSelector: json.RawMessage(`{"node_id":"exit-1"}`), + ServiceClass: FabricServiceClassVPNPackets, + Priority: 900, + Status: "active", + Policy: json.RawMessage(`{ + "synthetic_enabled": true, + "hops": ["entry-1", "exit-1"], + "allowed_channels": ["vpn_packet", "fabric_control"] + }`), + UpdatedAt: now, + }, + }, + fabricRouteFeedback: []FabricServiceChannelRouteFeedbackObservation{ + { + ClusterID: "cluster-1", + ReporterNodeID: "entry-1", + RouteID: "route-bad", + ServiceClass: FabricServiceClassVPNPackets, + FeedbackStatus: "fenced", + ScoreAdjustment: -1030, + Reasons: []string{"service_channel_route_rebuild_recommended"}, + ConsecutiveFailures: 2, + ObservedAt: now, + ExpiresAt: now.Add(time.Minute), + }, + { + ClusterID: "cluster-1", + ReporterNodeID: "entry-1", + RouteID: "route-good", + ServiceClass: FabricServiceClassVPNPackets, + FeedbackStatus: "healthy", + ScoreAdjustment: 10, + Reasons: []string{"service_channel_recent_success"}, + ObservedAt: now, + ExpiresAt: now.Add(time.Minute), + }, + }, + fabricLeases: map[string]FabricServiceChannelLeaseRecord{ + fabricServiceChannelLeaseCacheKey("cluster-1", "channel-1"): { + ClusterID: "cluster-1", + ChannelID: "channel-1", + ResourceID: "vpn-home", + ServiceClass: FabricServiceClassVPNPackets, + SelectedEntryNodeID: "entry-1", + ExpiresAt: now.Add(time.Minute), + Lease: FabricServiceChannelLease{ + ClusterID: "cluster-1", + ChannelID: "channel-1", + ResourceID: "vpn-home", + ServiceClass: FabricServiceClassVPNPackets, + Status: FabricServiceChannelStatusReady, + SelectedEntryNodeID: "entry-1", + SelectedExitNodeID: "exit-1", + PrimaryRoute: FabricServiceChannelRoute{ + RouteID: "route-bad", + Status: "authorized", + }, + AlternateRoutes: []FabricServiceChannelRoute{{ + RouteID: "route-good", + Status: "authorized", + }}, + ExpiresAt: now.Add(time.Minute), + }, + }, + }, + } + service := NewService(store) + service.now = func() time.Time { return now } + + cfg, err := service.GetNodeSyntheticMeshConfig(context.Background(), GetNodeSyntheticMeshConfigInput{ + ClusterID: "cluster-1", + NodeID: "entry-1", + }) + if err != nil { + t.Fatalf("synthetic config: %v", err) + } + if len(cfg.Routes) != 2 || containsRouteID(cfg.Routes, "route-bad") || !containsRouteID(cfg.Routes, "route-good") { + t.Fatalf("routes = %+v, want route-bad excluded and route-good retained", cfg.Routes) + } + if cfg.ServiceChannelFeedback == nil || cfg.ServiceChannelFeedback.FencedRouteCount != 1 || cfg.ServiceChannelFeedback.HealthyRouteCount != 1 { + t.Fatalf("feedback report missing fenced count: %+v", cfg.ServiceChannelFeedback) + } + if cfg.RoutePathDecisions == nil || cfg.RoutePathDecisions.ReplacementDecisionCount != 1 { + t.Fatalf("expected one service-channel replacement decision: %+v", cfg.RoutePathDecisions) + } + if len(cfg.ServiceChannelRemediationCommands) != 1 { + t.Fatalf("remediation commands = %+v, want one", cfg.ServiceChannelRemediationCommands) + } + command := cfg.ServiceChannelRemediationCommands[0] + if command.Action != "prefer_alternate_route" || + command.PrimaryRouteID != "route-bad" || + command.ReplacementRouteID != "route-good" || + command.ChannelID != "channel-1" || + !command.ExpiresAt.After(now) { + t.Fatalf("unexpected remediation command: %+v", command) + } + var replacement RoutePathDecision + for _, decision := range cfg.RoutePathDecisions.Decisions { + if decision.DecisionSource == "service_channel_feedback_replacement" { + replacement = decision + break + } + } + if replacement.RouteID != "route-bad" || replacement.ReplacementRouteID != "route-good" || + replacement.RebuildStatus != "applied" || + replacement.RebuildRequestID == "" || + !containsString(replacement.ScoreReasons, "selected_unfenced_alternate_route") || + !containsString(replacement.ScoreReasons, "service_channel_rebuild_applied") || + !containsString(replacement.ScoreReasons, "active_healthy_feedback_dampening_window") { + t.Fatalf("unexpected replacement decision: %+v", replacement) + } + attempts, err := service.ListFabricServiceChannelRouteRebuildAttempts(context.Background(), "admin-1", ListFabricServiceChannelRouteRebuildAttemptsInput{ + ClusterID: "cluster-1", + ReporterNodeID: "entry-1", + RouteID: "route-bad", + }) + if err != nil { + t.Fatalf("list rebuild attempts: %v", err) + } + if len(attempts) != 1 { + t.Fatalf("rebuild attempts = %+v, want one", attempts) + } + attempt := attempts[0] + if attempt.RebuildStatus != "applied" || + attempt.Outcome != "replacement_selected" || + attempt.ReplacementRouteID != "route-good" || + attempt.RebuildRequestID != replacement.RebuildRequestID || + attempt.FeedbackStatus != "fenced" || + attempt.ConsecutiveFailures != 2 || + !containsString(attempt.FeedbackReasons, "service_channel_route_rebuild_recommended") || + !reflect.DeepEqual(attempt.OldHops, []string{"entry-1", "exit-1"}) || + !reflect.DeepEqual(attempt.ReplacementHops, []string{"entry-1", "exit-1"}) { + t.Fatalf("unexpected rebuild ledger attempt: %+v", attempt) + } +} + +func TestGetNodeSyntheticMeshConfigReportsRebuildPendingWhenNoAlternateExists(t *testing.T) { + now := time.Date(2026, 5, 7, 14, 0, 0, 0, time.UTC) + store := &fakeRepository{ + testingFlags: EffectiveNodeTestingFlags{ + Enabled: true, + SyntheticLinksEnabled: true, + }, + routeIntents: []MeshRouteIntent{ + { + ID: "route-bad", + ClusterID: "cluster-1", + SourceSelector: json.RawMessage(`{"node_id":"entry-1"}`), + DestinationSelector: json.RawMessage(`{"node_id":"exit-1"}`), + ServiceClass: FabricServiceClassVPNPackets, + Priority: 100, + Status: "active", + Policy: json.RawMessage(`{ + "synthetic_enabled": true, + "hops": ["entry-1", "exit-1"], + "allowed_channels": ["vpn_packet", "fabric_control"] + }`), + UpdatedAt: now, + }, + }, + fabricRouteFeedback: []FabricServiceChannelRouteFeedbackObservation{ + { + ClusterID: "cluster-1", + ReporterNodeID: "entry-1", + RouteID: "route-bad", + ServiceClass: FabricServiceClassVPNPackets, + FeedbackStatus: "fenced", + ScoreAdjustment: -1030, + Reasons: []string{"service_channel_route_rebuild_recommended", "service_channel_degraded_fallback_recommended"}, + ConsecutiveFailures: 3, + ObservedAt: now, + ExpiresAt: now.Add(time.Minute), + }, + }, + } + service := NewService(store) + service.now = func() time.Time { return now } + + cfg, err := service.GetNodeSyntheticMeshConfig(context.Background(), GetNodeSyntheticMeshConfigInput{ + ClusterID: "cluster-1", + NodeID: "entry-1", + }) + if err != nil { + t.Fatalf("synthetic config: %v", err) + } + if containsRouteID(cfg.Routes, "route-bad") { + t.Fatalf("fenced route should be withheld while rebuild is pending: %+v", cfg.Routes) + } + if cfg.RoutePathDecisions == nil || cfg.RoutePathDecisions.RebuildRequestCount != 1 || cfg.RoutePathDecisions.DegradedDecisionCount != 1 { + t.Fatalf("expected rebuild/degraded decision counts: %+v", cfg.RoutePathDecisions) + } + decision := cfg.RoutePathDecisions.Decisions[0] + if decision.DecisionSource != "service_channel_feedback_no_alternate" || + decision.RebuildStatus != "pending_degraded_fallback" || + decision.RebuildRequestID == "" || + decision.RebuildAttempt != 3 || + !containsString(decision.ScoreReasons, "backend_relay_degraded_fallback_until_rebuild") { + t.Fatalf("unexpected rebuild decision: %+v", decision) + } +} + +func TestGetNodeSyntheticMeshConfigReplacesFencedServiceChannelRouteAcrossExitPool(t *testing.T) { + now := time.Date(2026, 5, 7, 14, 15, 0, 0, time.UTC) + store := &fakeRepository{ + testingFlags: EffectiveNodeTestingFlags{ + Enabled: true, + SyntheticLinksEnabled: true, + }, + routeIntents: []MeshRouteIntent{ + { + ID: "route-exit-a", + ClusterID: "cluster-1", + SourceSelector: json.RawMessage(`{"node_id":"entry-1"}`), + DestinationSelector: json.RawMessage(`{"node_id":"exit-a"}`), + ServiceClass: FabricServiceClassVPNPackets, + Priority: 100, + Status: "active", + Policy: json.RawMessage(`{ + "synthetic_enabled": true, + "hops": ["entry-1", "exit-a"], + "allowed_channels": ["vpn_packet", "fabric_control"], + "metadata": {"exit_pool_id": "pool-home"} + }`), + UpdatedAt: now, + }, + { + ID: "route-exit-b", + ClusterID: "cluster-1", + SourceSelector: json.RawMessage(`{"node_id":"entry-1"}`), + DestinationSelector: json.RawMessage(`{"node_id":"exit-b"}`), + ServiceClass: FabricServiceClassVPNPackets, + Priority: 20, + Status: "active", + Policy: json.RawMessage(`{ + "synthetic_enabled": true, + "hops": ["entry-1", "exit-b"], + "allowed_channels": ["vpn_packet", "fabric_control"], + "metadata": {"exit_pool_id": "pool-home"} + }`), + UpdatedAt: now, + }, + { + ID: "route-other-pool", + ClusterID: "cluster-1", + SourceSelector: json.RawMessage(`{"node_id":"entry-1"}`), + DestinationSelector: json.RawMessage(`{"node_id":"exit-c"}`), + ServiceClass: FabricServiceClassVPNPackets, + Priority: 900, + Status: "active", + Policy: json.RawMessage(`{ + "synthetic_enabled": true, + "hops": ["entry-1", "exit-c"], + "allowed_channels": ["vpn_packet", "fabric_control"], + "metadata": {"exit_pool_id": "pool-other"} + }`), + UpdatedAt: now, + }, + }, + fabricRouteFeedback: []FabricServiceChannelRouteFeedbackObservation{ + { + ClusterID: "cluster-1", + ReporterNodeID: "entry-1", + RouteID: "route-exit-a", + ServiceClass: FabricServiceClassVPNPackets, + FeedbackStatus: "fenced", + ScoreAdjustment: -1030, + Reasons: []string{"service_channel_route_rebuild_recommended"}, + ConsecutiveFailures: 2, + ObservedAt: now, + ExpiresAt: now.Add(time.Minute), + }, + { + ClusterID: "cluster-1", + ReporterNodeID: "entry-1", + RouteID: "route-exit-b", + ServiceClass: FabricServiceClassVPNPackets, + FeedbackStatus: "healthy", + ScoreAdjustment: 10, + Reasons: []string{"service_channel_recent_success"}, + ObservedAt: now, + ExpiresAt: now.Add(time.Minute), + }, + }, + } + service := NewService(store) + service.now = func() time.Time { return now } + + cfg, err := service.GetNodeSyntheticMeshConfig(context.Background(), GetNodeSyntheticMeshConfigInput{ + ClusterID: "cluster-1", + NodeID: "entry-1", + }) + if err != nil { + t.Fatalf("synthetic config: %v", err) + } + if cfg.RoutePathDecisions == nil || cfg.RoutePathDecisions.ReplacementDecisionCount != 1 { + t.Fatalf("expected one exit-pool replacement decision: %+v", cfg.RoutePathDecisions) + } + var replacement RoutePathDecision + for _, decision := range cfg.RoutePathDecisions.Decisions { + if decision.RouteID == "route-exit-a" { + replacement = decision + break + } + } + if replacement.ReplacementRouteID != "route-exit-b" || + replacement.DecisionSource != "service_channel_feedback_exit_pool_replacement" || + replacement.RebuildStatus != "applied" || + !containsString(replacement.ScoreReasons, "selected_unfenced_exit_pool_route") { + t.Fatalf("unexpected exit-pool replacement decision: %+v", replacement) + } +} + +func TestGetNodeSyntheticMeshConfigReplacesFencedServiceChannelRouteAcrossEntryPool(t *testing.T) { + now := time.Date(2026, 5, 7, 15, 5, 0, 0, time.UTC) + store := &fakeRepository{ + testingFlags: EffectiveNodeTestingFlags{ + Enabled: true, + SyntheticLinksEnabled: true, + }, + routeIntents: []MeshRouteIntent{ + { + ID: "route-entry-a", + ClusterID: "cluster-1", + SourceSelector: json.RawMessage(`{"node_id":"entry-a"}`), + DestinationSelector: json.RawMessage(`{"node_id":"exit-1"}`), + ServiceClass: FabricServiceClassVPNPackets, + Priority: 100, + Status: "active", + Policy: json.RawMessage(`{ + "synthetic_enabled": true, + "hops": ["entry-a", "exit-1"], + "allowed_channels": ["vpn_packet", "fabric_control"], + "metadata": {"entry_pool_id": "pool-edge"} + }`), + UpdatedAt: now, + }, + { + ID: "route-entry-b", + ClusterID: "cluster-1", + SourceSelector: json.RawMessage(`{"node_id":"entry-b"}`), + DestinationSelector: json.RawMessage(`{"node_id":"exit-1"}`), + ServiceClass: FabricServiceClassVPNPackets, + Priority: 20, + Status: "active", + Policy: json.RawMessage(`{ + "synthetic_enabled": true, + "hops": ["entry-b", "exit-1"], + "allowed_channels": ["vpn_packet", "fabric_control"], + "metadata": {"entry_pool_id": "pool-edge"} + }`), + UpdatedAt: now, + }, + { + ID: "route-other-pool", + ClusterID: "cluster-1", + SourceSelector: json.RawMessage(`{"node_id":"entry-c"}`), + DestinationSelector: json.RawMessage(`{"node_id":"exit-1"}`), + ServiceClass: FabricServiceClassVPNPackets, + Priority: 900, + Status: "active", + Policy: json.RawMessage(`{ + "synthetic_enabled": true, + "hops": ["entry-c", "exit-1"], + "allowed_channels": ["vpn_packet", "fabric_control"], + "metadata": {"entry_pool_id": "pool-other"} + }`), + UpdatedAt: now, + }, + }, + fabricRouteFeedback: []FabricServiceChannelRouteFeedbackObservation{ + { + ClusterID: "cluster-1", + ReporterNodeID: "exit-1", + RouteID: "route-entry-a", + ServiceClass: FabricServiceClassVPNPackets, + FeedbackStatus: "fenced", + ScoreAdjustment: -1030, + Reasons: []string{"service_channel_route_rebuild_recommended"}, + ConsecutiveFailures: 2, + ObservedAt: now, + ExpiresAt: now.Add(time.Minute), + }, + { + ClusterID: "cluster-1", + ReporterNodeID: "exit-1", + RouteID: "route-entry-b", + ServiceClass: FabricServiceClassVPNPackets, + FeedbackStatus: "healthy", + ScoreAdjustment: 10, + Reasons: []string{"service_channel_recent_success"}, + ObservedAt: now, + ExpiresAt: now.Add(time.Minute), + }, + }, + } + service := NewService(store) + service.now = func() time.Time { return now } + + cfg, err := service.GetNodeSyntheticMeshConfig(context.Background(), GetNodeSyntheticMeshConfigInput{ + ClusterID: "cluster-1", + NodeID: "exit-1", + }) + if err != nil { + t.Fatalf("synthetic config: %v", err) + } + if cfg.RoutePathDecisions == nil || cfg.RoutePathDecisions.ReplacementDecisionCount != 1 { + t.Fatalf("expected one entry-pool replacement decision: %+v", cfg.RoutePathDecisions) + } + var replacement RoutePathDecision + for _, decision := range cfg.RoutePathDecisions.Decisions { + if decision.RouteID == "route-entry-a" { + replacement = decision + break + } + } + if replacement.ReplacementRouteID != "route-entry-b" || + replacement.DecisionSource != "service_channel_feedback_entry_pool_replacement" || + replacement.RebuildStatus != "applied" || + !containsString(replacement.ScoreReasons, "selected_unfenced_entry_pool_route") { + t.Fatalf("unexpected entry-pool replacement decision: %+v", replacement) + } + if replacement.LocalRole != "exit" || replacement.PreviousHopID != "entry-b" { + t.Fatalf("entry-pool replacement should be visible from shared exit perspective: %+v", replacement) + } +} + +func TestExpireFabricServiceChannelRouteFeedbackRemovesActiveFeedback(t *testing.T) { + now := time.Date(2026, 5, 7, 12, 0, 0, 0, time.UTC) + store := &fakeRepository{ + platformRole: PlatformRoleAdmin, + authorityState: ClusterAuthorityState{ + ClusterID: "cluster-1", + AuthorityState: "authoritative", + MutationMode: "normal", + }, + fabricRouteFeedback: []FabricServiceChannelRouteFeedbackObservation{ + { + ClusterID: "cluster-1", + ReporterNodeID: "entry-1", + RouteID: "route-bad", + ServiceClass: FabricServiceClassVPNPackets, + FeedbackStatus: "fenced", + ObservedAt: now.Add(-time.Minute), + ExpiresAt: now.Add(time.Minute), + }, + { + ClusterID: "cluster-1", + ReporterNodeID: "entry-2", + RouteID: "route-other", + ServiceClass: FabricServiceClassVPNPackets, + FeedbackStatus: "healthy", + ObservedAt: now.Add(-time.Minute), + ExpiresAt: now.Add(time.Minute), + }, + }, + } + service := NewService(store) + service.now = func() time.Time { return now } + + result, err := service.ExpireFabricServiceChannelRouteFeedback(context.Background(), ExpireFabricServiceChannelRouteFeedbackInput{ + ActorUserID: "admin-1", + ClusterID: "cluster-1", + ReporterNodeID: "entry-1", + RouteID: "route-bad", + ServiceClass: FabricServiceClassVPNPackets, + Reason: "operator verified route is healthy", + }) + if err != nil { + t.Fatalf("expire feedback: %v", err) + } + if result.ExpiredCount != 1 || !result.ExpiredAt.Equal(now) || !result.CooldownUntil.Equal(now.Add(fabricServiceChannelOperatorExpireCooldown)) { + t.Fatalf("unexpected expire result: %+v", result) + } + if len(store.auditEvents) == 0 || store.auditEvents[len(store.auditEvents)-1].EventType != "fabric.service_channel_route_feedback.expired" { + t.Fatalf("missing feedback expire audit event: %+v", store.auditEvents) + } + active, err := service.ListFabricServiceChannelRouteFeedback(context.Background(), "admin-1", ListFabricServiceChannelRouteFeedbackInput{ + ClusterID: "cluster-1", + Now: now, + }) + if err != nil { + t.Fatalf("list feedback: %v", err) + } + if len(active) != 1 || active[0].RouteID != "route-other" { + t.Fatalf("active feedback = %+v, want only route-other", active) + } + expired, err := service.ListFabricServiceChannelRouteFeedback(context.Background(), "admin-1", ListFabricServiceChannelRouteFeedbackInput{ + ClusterID: "cluster-1", + RouteID: "route-bad", + IncludeExpired: true, + Now: now, + }) + if err != nil { + t.Fatalf("list expired feedback: %v", err) + } + if len(expired) != 1 || !expired[0].ExpiresAt.Equal(now) { + t.Fatalf("expired feedback = %+v, want route-bad expired at now", expired) + } +} + +func TestRecordFabricServiceChannelRouteRebuildFeedbackBreakdownInvestigationAudit(t *testing.T) { + now := time.Date(2026, 5, 9, 13, 30, 0, 0, time.UTC) + store := &fakeRepository{platformRole: "platform_admin"} + service := NewService(store) + service.now = func() time.Time { return now } + + err := service.RecordFabricServiceChannelRouteRebuildInvestigation(context.Background(), RecordFabricServiceChannelRouteRebuildInvestigationInput{ + ActorUserID: "admin-1", + ClusterID: "cluster-1", + FeedbackSource: "fabric_service_channel_access_report", + FeedbackChannelID: "channel-1", + FeedbackViolationStatus: "fabric_route_send_failed_backend_fallback_blocked", + DrilldownSource: "rebuild_health_feedback_breakdown", + Reason: "operator opened rebuild-health feedback breakdown ledger", + }) + if err != nil { + t.Fatalf("record investigation: %v", err) + } + if len(store.auditEvents) != 1 { + t.Fatalf("audit events = %d, want 1", len(store.auditEvents)) + } + event := store.auditEvents[0] + if event.EventType != "fabric.service_channel_rebuild_feedback_breakdown.investigation_opened" { + t.Fatalf("event type = %q", event.EventType) + } + if event.TargetType != "fabric_service_channel_rebuild_feedback_breakdown" || event.TargetID == nil || *event.TargetID != "channel-1" { + t.Fatalf("unexpected target: type=%q id=%v", event.TargetType, event.TargetID) + } + payload := jsonObject(event.Payload) + if jsonString(payload, "feedback_source") != "fabric_service_channel_access_report" || + jsonString(payload, "feedback_channel_id") != "channel-1" || + jsonString(payload, "feedback_violation_status") != "fabric_route_send_failed_backend_fallback_blocked" || + jsonString(payload, "drilldown_source") != "rebuild_health_feedback_breakdown" { + t.Fatalf("unexpected audit payload: %s", string(event.Payload)) + } + if !event.CreatedAt.Equal(now) { + t.Fatalf("created_at = %s, want %s", event.CreatedAt, now) + } +} + +func TestListAuditEventsFiltersFabricInvestigationBreadcrumbs(t *testing.T) { + clusterID := "cluster-1" + otherClusterID := "cluster-other" + now := time.Date(2026, 5, 9, 14, 20, 0, 0, time.UTC) + store := &fakeRepository{ + platformRole: "platform_admin", + fabricRebuildAttempts: []FabricServiceChannelRouteRebuildAttempt{ + { + ID: "attempt-1", + ClusterID: clusterID, + ReporterNodeID: "entry-1", + RouteID: "route-1", + ServiceClass: FabricServiceClassVPNPackets, + RebuildStatus: "applied", + Outcome: "replacement_selected", + FeedbackSource: "fabric_service_channel_access_report", + FeedbackChannelID: "channel-1", + FeedbackViolationStatus: "fabric_route_send_failed_backend_fallback_blocked", + FeedbackObservedAt: &now, + UpdatedAt: now, + CreatedAt: now, + Payload: json.RawMessage(`{}`), + }, + }, + auditEvents: []ClusterAuditEvent{ + { + ID: "audit-1", + ClusterID: &clusterID, + EventType: "fabric.service_channel_rebuild_feedback_breakdown.investigation_opened", + TargetType: "fabric_service_channel_rebuild_feedback_breakdown", + TargetID: stringPtr("channel-1"), + Payload: json.RawMessage(`{"feedback_source":"fabric_service_channel_access_report","feedback_channel_id":"channel-1","feedback_violation_status":"fabric_route_send_failed_backend_fallback_blocked"}`), + CreatedAt: now.Add(-5 * time.Minute), + }, + { + ID: "audit-2", + ClusterID: &clusterID, + EventType: "fabric.service_channel_rebuild_incident.investigation_opened", + TargetType: "fabric_service_channel_route_rebuild_incident", + TargetID: stringPtr("route-1"), + Payload: json.RawMessage(`{}`), + CreatedAt: now.Add(-2 * time.Hour), + }, + { + ID: "audit-3", + ClusterID: &clusterID, + EventType: "fabric.service_channel_route_feedback.expired", + TargetType: "fabric_service_channel_route_feedback", + TargetID: stringPtr("route-2"), + Payload: json.RawMessage(`{}`), + CreatedAt: now, + }, + { + ID: "audit-4", + ClusterID: &otherClusterID, + EventType: "fabric.service_channel_rebuild_feedback_breakdown.investigation_opened", + TargetType: "fabric_service_channel_rebuild_feedback_breakdown", + TargetID: stringPtr("channel-other"), + Payload: json.RawMessage(`{}`), + CreatedAt: now, + }, + }, + } + service := NewService(store) + service.now = func() time.Time { return now } + + events, err := service.ListAuditEvents(context.Background(), "admin-1", ListAuditEventsInput{ + ClusterID: clusterID, + EventTypes: []string{ + "fabric.service_channel_rebuild_feedback_breakdown.investigation_opened", + "fabric.service_channel_rebuild_incident.investigation_opened", + }, + Correlation: "fabric_diagnostics", + Limit: 10, + }) + if err != nil { + t.Fatalf("list audit events: %v", err) + } + if len(events) != 2 || events[0].ID != "audit-1" || events[1].ID != "audit-2" { + t.Fatalf("events = %+v, want only fabric investigation breadcrumbs", events) + } + if events[0].CorrelationHints == nil || events[0].CorrelationHints.CurrentDiagnosticStatus != "breakdown_active" || + events[0].CorrelationHints.FeedbackBreakdown == nil || events[0].CorrelationHints.FeedbackBreakdown.FeedbackChannelID != "channel-1" { + t.Fatalf("feedback breadcrumb correlation hints = %+v", events[0].CorrelationHints) + } + + breadcrumbs, err := service.ListFabricServiceChannelRebuildInvestigationBreadcrumbs(context.Background(), "admin-1", ListFabricServiceChannelRebuildInvestigationBreadcrumbsInput{ + ClusterID: clusterID, + Limit: 10, + CurrentWindowSeconds: int64((30 * time.Minute).Seconds()), + HistoryWindowSeconds: int64((24 * time.Hour).Seconds()), + }) + if err != nil { + t.Fatalf("list breadcrumbs: %v", err) + } + if len(breadcrumbs.Events) != 2 || breadcrumbs.Summary.TotalCount != 2 || breadcrumbs.Summary.CorrelatedCount != 2 { + t.Fatalf("breadcrumbs = %+v", breadcrumbs) + } + if breadcrumbs.Summary.CountsByCurrentDiagnosticStatus["breakdown_active"] != 1 || + breadcrumbs.Summary.CountsByCurrentDiagnosticStatus["incident_visible"] != 1 { + t.Fatalf("breadcrumb summary statuses = %+v", breadcrumbs.Summary.CountsByCurrentDiagnosticStatus) + } + if breadcrumbs.CurrentCount != 1 || breadcrumbs.StaleCount != 1 || breadcrumbs.ExpiredCount != 0 || + breadcrumbs.Summary.CountsByBreadcrumbStatus["current"] != 1 || + breadcrumbs.Summary.CountsByBreadcrumbStatus["stale"] != 1 { + t.Fatalf("breadcrumb freshness = %+v summary=%+v", breadcrumbs, breadcrumbs.Summary.CountsByBreadcrumbStatus) + } +} + +func TestRebuildHealthSilenceIsGenerationScoped(t *testing.T) { + now := time.Date(2026, 5, 8, 12, 0, 0, 0, time.UTC) + store := &fakeRepository{ + platformRole: "platform_admin", + fabricRebuildAttempts: []FabricServiceChannelRouteRebuildAttempt{ + { + ID: "attempt-old", + ClusterID: "cluster-1", + ReporterNodeID: "entry-1", + ServiceClass: FabricServiceClassVPNPackets, + RouteID: "route-1", + RebuildStatus: "applied", + Generation: "gen-old", + UpdatedAt: now.Add(-5 * time.Minute), + CreatedAt: now.Add(-5 * time.Minute), + Payload: json.RawMessage(`{}`), + }, + { + ID: "attempt-new", + ClusterID: "cluster-1", + ReporterNodeID: "entry-1", + ServiceClass: FabricServiceClassVPNPackets, + RouteID: "route-1", + RebuildStatus: "applied", + Generation: "gen-new", + UpdatedAt: now.Add(-4 * time.Minute), + CreatedAt: now.Add(-4 * time.Minute), + Payload: json.RawMessage(`{}`), + }, + }, + } + service := NewService(store) + service.now = func() time.Time { return now } + _, err := service.SilenceFabricServiceChannelRouteRebuildAlert(context.Background(), SilenceFabricServiceChannelRouteRebuildAlertInput{ + ActorUserID: "admin-1", + ClusterID: "cluster-1", + ReporterNodeID: "entry-1", + RouteID: "route-1", + GuardStatus: "missing_node_transition", + Generation: "gen-old", + Reason: "known old test route", + TTL: time.Hour, + Now: now, + }) + if err != nil { + t.Fatalf("silence rebuild alert: %v", err) + } + summary, err := service.GetFabricServiceChannelRouteRebuildHealthSummary(context.Background(), "admin-1", GetFabricServiceChannelRouteRebuildHealthSummaryInput{ + ClusterID: "cluster-1", + Limit: 20, + }) + if err != nil { + t.Fatalf("get rebuild health: %v", err) + } + if summary.BadCount != 2 || summary.SilencedCount != 1 || summary.ActiveBadCount != 1 { + t.Fatalf("summary counts = %+v, want bad=2 silenced=1 active_bad=1", summary) + } + if len(summary.MostRecentBadAttempts) != 1 || summary.MostRecentBadAttempts[0].Generation != "gen-new" { + t.Fatalf("active bad attempts = %+v, want only fresh generation", summary.MostRecentBadAttempts) + } + if summary.ResurfacedCount != 1 || len(summary.ResurfacedAttempts) != 1 || summary.ResurfacedAttempts[0].AlertResurfacedPreviousGeneration != "gen-old" { + t.Fatalf("resurfaced attempts = %+v / count %d, want gen-new resurfaced from gen-old", summary.ResurfacedAttempts, summary.ResurfacedCount) + } + readiness, err := service.GetFabricServiceChannelReadiness(context.Background(), "admin-1", GetFabricServiceChannelReadinessInput{ + ClusterID: "cluster-1", + Limit: 20, + }) + if err != nil { + t.Fatalf("get readiness: %v", err) + } + if readiness.Status != "blocked" || readiness.Reason != "resurfaced_rebuild_alert" || readiness.ResurfacedCount != 1 { + t.Fatalf("readiness = %+v, want blocked by resurfaced alert", readiness) + } +} + +func TestOperatorExpiredFabricServiceChannelFeedbackAllowsRetryAndSuppressesImmediateChurn(t *testing.T) { + now := time.Date(2026, 5, 7, 12, 30, 0, 0, time.UTC) + cooldownUntil := now.Add(fabricServiceChannelOperatorExpireCooldown) + store := &fakeRepository{ + platformRole: PlatformRoleAdmin, + authorityState: ClusterAuthorityState{ + ClusterID: "cluster-1", + AuthorityState: "authoritative", + MutationMode: "normal", + }, + testingFlags: EffectiveNodeTestingFlags{ + Enabled: true, + SyntheticLinksEnabled: true, + }, + routeIntents: []MeshRouteIntent{ + { + ID: "route-retry", + ClusterID: "cluster-1", + SourceSelector: json.RawMessage(`{"node_id":"entry-1"}`), + DestinationSelector: json.RawMessage(`{"node_id":"exit-1"}`), + ServiceClass: FabricServiceClassVPNPackets, + Priority: 100, + Status: "active", + Policy: json.RawMessage(`{ + "synthetic_enabled": true, + "hops": ["entry-1", "exit-1"], + "allowed_channels": ["vpn_packet", "fabric_control"] + }`), + UpdatedAt: now, + }, + }, + fabricRouteFeedback: []FabricServiceChannelRouteFeedbackObservation{ + { + ClusterID: "cluster-1", + ReporterNodeID: "entry-1", + RouteID: "route-retry", + ServiceClass: FabricServiceClassVPNPackets, + FeedbackStatus: "fenced", + ScoreAdjustment: -1030, + Reasons: []string{"service_channel_route_rebuild_recommended"}, + ObservedAt: now.Add(-time.Minute), + ExpiresAt: now, + RetryCooldownUntil: &cooldownUntil, + }, + }, + } + service := NewService(store) + service.now = func() time.Time { return now } + + cfg, err := service.GetNodeSyntheticMeshConfig(context.Background(), GetNodeSyntheticMeshConfigInput{ + ClusterID: "cluster-1", + NodeID: "entry-1", + }) + if err != nil { + t.Fatalf("synthetic config: %v", err) + } + if !containsRouteID(cfg.Routes, "route-retry") { + t.Fatalf("route-retry should be retried during operator cooldown: %+v", cfg.Routes) + } + if cfg.RoutePathDecisions == nil || len(cfg.RoutePathDecisions.Decisions) != 1 || + !containsString(cfg.RoutePathDecisions.Decisions[0].ScoreReasons, "service_channel_route_retry_after_operator_expire") { + t.Fatalf("missing manual retry decision reason: %+v", cfg.RoutePathDecisions) + } + + _, err = store.RecordFabricServiceChannelRouteFeedback(context.Background(), RecordFabricServiceChannelRouteFeedbackInput{ + ClusterID: "cluster-1", + ReporterNodeID: "entry-1", + RouteID: "route-retry", + ServiceClass: FabricServiceClassVPNPackets, + FeedbackStatus: "fenced", + ScoreAdjustment: -1030, + Reasons: []string{"service_channel_route_rebuild_recommended"}, + ObservedAt: now.Add(30 * time.Second), + ExpiresAt: now.Add(3 * time.Minute), + Payload: json.RawMessage(`{"last_error":"retry failed"}`), + }) + if err != nil { + t.Fatalf("record feedback: %v", err) + } + active, err := service.ListFabricServiceChannelRouteFeedback(context.Background(), "admin-1", ListFabricServiceChannelRouteFeedbackInput{ + ClusterID: "cluster-1", + RouteID: "route-retry", + Now: now.Add(30 * time.Second), + }) + if err != nil { + t.Fatalf("list feedback: %v", err) + } + if len(active) != 1 || active[0].FeedbackStatus != "operator_retry_cooldown" || active[0].ScoreAdjustment != 0 { + t.Fatalf("feedback not suppressed during cooldown: %+v", active) + } +} + +func TestIssueFabricServiceChannelLeaseDampensRecoveredRouteDuringRetryCooldown(t *testing.T) { + now := time.Now().UTC().Truncate(time.Second) + cooldownUntil := now.Add(2 * time.Minute) + store := &fakeRepository{ + routeIntents: []MeshRouteIntent{ + { + ID: "route-recovered", + ClusterID: "cluster-1", + SourceSelector: json.RawMessage(`{"node_id":"entry-1"}`), + DestinationSelector: json.RawMessage(`{"node_id":"exit-1"}`), + ServiceClass: FabricServiceClassVPNPackets, + Priority: 100, + Status: "active", + Policy: json.RawMessage(`{ + "hops": ["entry-1", "exit-1"], + "allowed_channels": ["vpn_packet", "fabric_control"] + }`), + UpdatedAt: now, + }, + { + ID: "route-steady", + ClusterID: "cluster-1", + SourceSelector: json.RawMessage(`{"node_id":"entry-1"}`), + DestinationSelector: json.RawMessage(`{"node_id":"exit-1"}`), + ServiceClass: FabricServiceClassVPNPackets, + Priority: 80, + Status: "active", + Policy: json.RawMessage(`{ + "hops": ["entry-1", "exit-1"], + "allowed_channels": ["vpn_packet", "fabric_control"] + }`), + UpdatedAt: now, + }, + }, + fabricRouteFeedback: []FabricServiceChannelRouteFeedbackObservation{ + { + ClusterID: "cluster-1", + ReporterNodeID: "entry-1", + RouteID: "route-recovered", + ServiceClass: FabricServiceClassVPNPackets, + FeedbackStatus: "fenced", + ScoreAdjustment: -1030, + Reasons: []string{"service_channel_route_rebuild_recommended"}, + ObservedAt: now.Add(-time.Minute), + ExpiresAt: now, + RetryCooldownUntil: &cooldownUntil, + }, + { + ClusterID: "cluster-1", + ReporterNodeID: "entry-1", + RouteID: "route-recovered", + ServiceClass: FabricServiceClassVPNPackets, + FeedbackStatus: "healthy", + ScoreAdjustment: 90, + Reasons: []string{"service_channel_recent_success", "service_channel_quality_latency_le_10ms", "service_channel_rolling_quality_window"}, + LastSendDurationMs: 1, + Payload: json.RawMessage(`{"quality_window_sample_count":32,"quality_window_success_count":32,"quality_window_failure_count":0,"quality_window_drop_count":0,"quality_window_avg_latency_ms":1}`), + ObservedAt: now, + ExpiresAt: now.Add(2 * time.Minute), + }, + }, + } + service := NewService(store) + service.now = func() time.Time { return now } + + lease, err := service.IssueFabricServiceChannelLease(context.Background(), IssueFabricServiceChannelLeaseInput{ + ClusterID: "cluster-1", + OrganizationID: "org-home", + UserID: "user-m", + ResourceID: "vpn-home", + ServiceClass: FabricServiceClassVPNPackets, + EntryNodeIDs: []string{"entry-1"}, + ExitNodeIDs: []string{"exit-1"}, + PreferredEntryNodeID: "entry-1", + PreferredExitNodeID: "exit-1", + }) + if err != nil { + t.Fatalf("issue lease: %v", err) + } + if lease.PrimaryRoute.RouteID != "route-steady" { + t.Fatalf("primary route = %q, want steady route while recovered route is dampened", lease.PrimaryRoute.RouteID) + } + var recovered FabricServiceChannelRoute + for _, route := range append([]FabricServiceChannelRoute{lease.PrimaryRoute}, lease.AlternateRoutes...) { + if route.RouteID == "route-recovered" { + recovered = route + break + } + } + if recovered.RouteID == "" || recovered.Status != "authorized" { + t.Fatalf("recovered route should be authorized alternate during hysteresis: primary=%+v alternates=%+v", lease.PrimaryRoute, lease.AlternateRoutes) + } + if recovered.RecoveryState != "recovered" || recovered.RecoveryPenalty != fabricServiceChannelRecoveryHysteresisPenalty { + t.Fatalf("recovered telemetry state=%q penalty=%d, want recovered penalty %d", recovered.RecoveryState, recovered.RecoveryPenalty, fabricServiceChannelRecoveryHysteresisPenalty) + } + if !containsString(recovered.ScoreReasons, "service_channel_recovery_hysteresis") || + !containsString(recovered.ScoreReasons, "service_channel_rolling_quality_window") || + !containsString(recovered.ScoreReasons, "manual_feedback_expired_retry_cooldown") { + t.Fatalf("recovered route score reasons = %+v, want hysteresis + rolling feedback reasons", recovered.ScoreReasons) + } + if recovered.PathScore >= lease.PrimaryRoute.PathScore { + t.Fatalf("recovered score = %d primary score = %d, want recovered route dampened below steady primary", recovered.PathScore, lease.PrimaryRoute.PathScore) + } +} + +func TestServiceChannelRouteFeedbackReportExposesRecoveryState(t *testing.T) { + now := time.Now().UTC().Truncate(time.Second) + cooldownUntil := now.Add(2 * time.Minute) + report := serviceChannelRouteFeedbackReport([]FabricServiceChannelRouteFeedbackObservation{ + { + ClusterID: "cluster-1", + ReporterNodeID: "entry-1", + RouteID: "route-recovered", + ServiceClass: FabricServiceClassVPNPackets, + FeedbackStatus: "healthy", + ScoreAdjustment: 90, + Reasons: []string{"service_channel_recent_success", "service_channel_rolling_quality_window"}, + ObservedAt: now, + ExpiresAt: now.Add(2 * time.Minute), + RetryCooldownUntil: &cooldownUntil, + }, + { + ClusterID: "cluster-1", + ReporterNodeID: "entry-1", + RouteID: "route-promoted", + ServiceClass: FabricServiceClassVPNPackets, + FeedbackStatus: "healthy", + ScoreAdjustment: 90, + Reasons: []string{"service_channel_recent_success", "service_channel_rolling_quality_window"}, + Payload: json.RawMessage(`{"quality_window_sample_count":96,"quality_window_success_count":96,"quality_window_failure_count":0,"quality_window_slow_count":0,"quality_window_drop_count":0}`), + ObservedAt: now, + ExpiresAt: now.Add(2 * time.Minute), + RetryCooldownUntil: &cooldownUntil, + }, + { + ClusterID: "cluster-1", + ReporterNodeID: "entry-1", + RouteID: "route-demoted", + ServiceClass: FabricServiceClassVPNPackets, + FeedbackStatus: "degraded", + ScoreAdjustment: -30, + Reasons: []string{"service_channel_recent_route_failure", "service_channel_rolling_quality_window"}, + Payload: json.RawMessage(`{"quality_window_sample_count":96,"quality_window_success_count":95,"quality_window_failure_count":1,"quality_window_slow_count":0,"quality_window_drop_count":0}`), + ObservedAt: now, + ExpiresAt: now.Add(2 * time.Minute), + RetryCooldownUntil: &cooldownUntil, + }, + }, now) + if report.RecoveredRouteCount != 1 || report.RecoveryHysteresisCount != 1 || report.RecoveryPromotedCount != 1 || report.RecoveryDemotedCount != 1 { + t.Fatalf("recovery counters = recovered:%d hysteresis:%d promoted:%d demoted:%d, want 1/1/1/1", report.RecoveredRouteCount, report.RecoveryHysteresisCount, report.RecoveryPromotedCount, report.RecoveryDemotedCount) + } + if len(report.Observations) != 3 { + t.Fatalf("observations = %d, want 3", len(report.Observations)) + } + observation := report.Observations[0] + if observation.RecoveryState != "recovered" || !observation.RecoveryHysteresisActive || observation.RecoveryHysteresisPenalty != fabricServiceChannelRecoveryHysteresisPenalty { + t.Fatalf("observation recovery telemetry = state:%q active:%t penalty:%d", observation.RecoveryState, observation.RecoveryHysteresisActive, observation.RecoveryHysteresisPenalty) + } + promoted := report.Observations[1] + if promoted.RecoveryState != "healthy" || !promoted.RecoveryPromoted || promoted.RecoveryHysteresisActive { + t.Fatalf("promoted recovery telemetry = state:%q promoted:%t hysteresis:%t", promoted.RecoveryState, promoted.RecoveryPromoted, promoted.RecoveryHysteresisActive) + } + demoted := report.Observations[2] + if demoted.RecoveryState != "degraded" || !demoted.RecoveryDemoted || demoted.RecoveryReason != "service_channel_recovery_demoted_failure" { + t.Fatalf("demoted recovery telemetry = state:%q demoted:%t reason:%q", demoted.RecoveryState, demoted.RecoveryDemoted, demoted.RecoveryReason) + } +} + +func TestFabricServiceChannelRecoveryPromotionRemovesHysteresisPenalty(t *testing.T) { + now := time.Now().UTC().Truncate(time.Second) + route, ok := fabricServiceChannelRouteFromIntent(MeshRouteIntent{ + ID: "route-promoted", + ClusterID: "cluster-1", + SourceSelector: json.RawMessage(`{"node_id":"entry-1"}`), + DestinationSelector: json.RawMessage(`{"node_id":"exit-1"}`), + ServiceClass: FabricServiceClassVPNPackets, + Priority: 100, + Status: "active", + Policy: json.RawMessage(`{ + "hops": ["entry-1", "exit-1"], + "allowed_channels": ["vpn_packet", "fabric_control"] + }`), + UpdatedAt: now, + }, FabricServiceClassVPNPackets, []string{"entry-1"}, []string{"exit-1"}, []string{"vpn_packet"}, "generation-1", now, now.Add(time.Minute), map[string]fabricServiceChannelRouteFeedback{ + "route-promoted": { + RouteID: "route-promoted", + ManualRetry: true, + ScoreAdjustment: 90, + Reasons: []string{"service_channel_recent_success", "service_channel_rolling_quality_window", "manual_feedback_expired_retry_cooldown"}, + QualityWindowSampleCount: 96, + QualityWindowSuccessCount: 96, + }, + }, defaultFabricServiceChannelRecoveryPolicy()) + if !ok { + t.Fatal("route was not built") + } + if route.RecoveryState != "healthy" || !route.RecoveryPromoted || route.RecoveryPenalty != 0 { + t.Fatalf("promoted route recovery = state:%q promoted:%t penalty:%d", route.RecoveryState, route.RecoveryPromoted, route.RecoveryPenalty) + } + if containsString(route.ScoreReasons, "service_channel_recovery_hysteresis") || !containsString(route.ScoreReasons, "service_channel_recovery_promoted") { + t.Fatalf("promoted route reasons = %+v", route.ScoreReasons) + } +} + +func TestFabricServiceChannelRecoveryDemotionMarksRouteReason(t *testing.T) { + now := time.Now().UTC().Truncate(time.Second) + route, ok := fabricServiceChannelRouteFromIntent(MeshRouteIntent{ + ID: "route-demoted", + ClusterID: "cluster-1", + SourceSelector: json.RawMessage(`{"node_id":"entry-1"}`), + DestinationSelector: json.RawMessage(`{"node_id":"exit-1"}`), + ServiceClass: FabricServiceClassVPNPackets, + Priority: 100, + Status: "active", + Policy: json.RawMessage(`{ + "hops": ["entry-1", "exit-1"], + "allowed_channels": ["vpn_packet", "fabric_control"] + }`), + UpdatedAt: now, + }, FabricServiceClassVPNPackets, []string{"entry-1"}, []string{"exit-1"}, []string{"vpn_packet"}, "generation-1", now, now.Add(time.Minute), map[string]fabricServiceChannelRouteFeedback{ + "route-demoted": { + RouteID: "route-demoted", + ManualRetry: true, + ScoreAdjustment: -30, + Reasons: []string{"service_channel_recent_route_failure", "service_channel_rolling_quality_window", "manual_feedback_expired_retry_cooldown"}, + QualityWindowSampleCount: 96, + QualityWindowSuccessCount: 95, + QualityWindowFailureCount: 1, + }, + }, defaultFabricServiceChannelRecoveryPolicy()) + if !ok { + t.Fatal("route was not built") + } + if !route.RecoveryDemoted || route.RecoveryReason != "service_channel_recovery_demoted_failure" || route.RecoveryPromoted { + t.Fatalf("demoted route recovery = demoted:%t reason:%q promoted:%t", route.RecoveryDemoted, route.RecoveryReason, route.RecoveryPromoted) + } + if !containsString(route.ScoreReasons, "service_channel_recovery_demoted") || !containsString(route.ScoreReasons, "service_channel_recovery_demoted_failure") { + t.Fatalf("demoted route reasons = %+v", route.ScoreReasons) + } +} + +func TestFabricServiceChannelRecoveryPolicyControlsPromotionAndPenalty(t *testing.T) { + now := time.Now().UTC().Truncate(time.Second) + policy := defaultFabricServiceChannelRecoveryPolicy() + policy.HysteresisPenalty = 40 + policy.PromotionMinSamples = 4 + cooldownUntil := now.Add(2 * time.Minute) + report := serviceChannelRouteFeedbackReportWithPolicy([]FabricServiceChannelRouteFeedbackObservation{ + { + ClusterID: "cluster-1", + ReporterNodeID: "entry-1", + RouteID: "route-promoted", + ServiceClass: FabricServiceClassVPNPackets, + FeedbackStatus: "healthy", + ScoreAdjustment: 90, + Reasons: []string{"service_channel_recent_success", "service_channel_rolling_quality_window"}, + Payload: json.RawMessage(`{"quality_window_sample_count":4,"quality_window_success_count":4,"quality_window_failure_count":0,"quality_window_slow_count":0,"quality_window_drop_count":0}`), + ObservedAt: now, + ExpiresAt: now.Add(2 * time.Minute), + RetryCooldownUntil: &cooldownUntil, + }, + { + ClusterID: "cluster-1", + ReporterNodeID: "entry-1", + RouteID: "route-recovered", + ServiceClass: FabricServiceClassVPNPackets, + FeedbackStatus: "healthy", + ScoreAdjustment: 90, + Reasons: []string{"service_channel_recent_success", "service_channel_rolling_quality_window"}, + ObservedAt: now, + ExpiresAt: now.Add(2 * time.Minute), + RetryCooldownUntil: &cooldownUntil, + }, + }, now, policy) + if report.RecoveryPromotedCount != 1 || report.RecoveryHysteresisCount != 1 { + t.Fatalf("policy counters promoted/hysteresis = %d/%d, want 1/1", report.RecoveryPromotedCount, report.RecoveryHysteresisCount) + } + if report.Observations[1].RecoveryHysteresisPenalty != 40 { + t.Fatalf("hysteresis penalty = %d, want policy penalty 40", report.Observations[1].RecoveryHysteresisPenalty) + } + if report.RecoveryPolicy == nil || report.RecoveryPolicy.HysteresisPenalty != 40 || report.RecoveryPolicy.PromotionMinSamples != 4 { + t.Fatalf("report recovery policy provenance = %+v", report.RecoveryPolicy) + } +} + +func TestFabricServiceChannelFeedbackStalePolicyIsConservative(t *testing.T) { + now := time.Now().UTC().Truncate(time.Second) + policy := defaultFabricServiceChannelRecoveryPolicy() + policy.HysteresisPenalty = 44 + policy = normalizeFabricServiceChannelRecoveryPolicy(policy, defaultFabricServiceChannelRecoveryPolicy()) + routeProvenance := map[string]fabricServiceChannelRouteProvenance{ + "route-1": {RouteID: "route-1", RouteGeneration: "policy-v2", PolicyVersion: "policy-v2", RouteVersion: "route-v2"}, + } + report := serviceChannelRouteFeedbackReportWithPolicyAndProvenance([]FabricServiceChannelRouteFeedbackObservation{ + { + ClusterID: "cluster-1", + ReporterNodeID: "entry-1", + RouteID: "route-1", + ServiceClass: FabricServiceClassVPNPackets, + FeedbackStatus: "fenced", + ScoreAdjustment: -1030, + Reasons: []string{"service_channel_recent_route_failure", "service_channel_route_rebuild_recommended"}, + Payload: json.RawMessage(`{"recovery_policy_fingerprint":"old-policy","route_policy_version":"policy-v1","quality_window_failure_count":3}`), + ObservedAt: now, + ExpiresAt: now.Add(time.Minute), + ConsecutiveFailures: 3, + LastSendDurationMs: 900, + }, + }, now, policy, routeProvenance) + if report == nil || report.StalePolicyCount != 1 || report.StaleGenerationCount != 1 { + t.Fatalf("stale counters = %+v, want policy/generation stale", report) + } + if report.Observations[0].EffectiveScoreAdjustment != -10 || !report.Observations[0].StalePolicy || !report.Observations[0].StaleGeneration { + t.Fatalf("stale observation = %+v", report.Observations[0]) + } + feedback := fabricServiceChannelRouteFeedbackFromObservationsWithProvenance(report.Observations, now, policy, routeProvenance) + item := feedback["route-1"] + if item.Fenced || item.RouteRebuildRecommended || item.ScoreAdjustment != -10 { + t.Fatalf("stale feedback should not fence/rebuild current route: %+v", item) + } +} + +func TestFabricServiceChannelFeedbackMissingProvenanceIsVisibleButCompatible(t *testing.T) { + now := time.Now().UTC().Truncate(time.Second) + policy := defaultFabricServiceChannelRecoveryPolicy() + routeProvenance := map[string]fabricServiceChannelRouteProvenance{ + "route-1": {RouteID: "route-1", RouteGeneration: "policy-v2", PolicyVersion: "policy-v2", RouteVersion: "route-v2"}, + } + report := serviceChannelRouteFeedbackReportWithPolicyAndProvenance([]FabricServiceChannelRouteFeedbackObservation{ + { + ClusterID: "cluster-1", + ReporterNodeID: "entry-1", + RouteID: "route-1", + ServiceClass: FabricServiceClassVPNPackets, + FeedbackStatus: "healthy", + ScoreAdjustment: 42, + Reasons: []string{"service_channel_recent_success"}, + Payload: json.RawMessage(`{"quality_window_success_count":8}`), + ObservedAt: now, + ExpiresAt: now.Add(time.Minute), + }, + }, now, policy, routeProvenance) + if report == nil || report.MissingProvenanceCount != 1 || report.StalePolicyCount != 0 || report.StaleGenerationCount != 0 { + t.Fatalf("missing provenance counters = %+v", report) + } + feedback := fabricServiceChannelRouteFeedbackFromObservationsWithProvenance(report.Observations, now, policy, routeProvenance) + if feedback["route-1"].ScoreAdjustment != 42 || !feedback["route-1"].ProvenanceMissing { + t.Fatalf("missing provenance should stay compatible for old agents: %+v", feedback["route-1"]) + } +} + +func TestUpdateFabricServiceChannelRecoveryPolicyPersistsClusterMetadata(t *testing.T) { + store := &fakeRepository{ + platformRole: "platform_admin", + cluster: Cluster{ + ID: "cluster-1", + Slug: "cluster-1", + Name: "Cluster 1", + Status: ClusterStatusActive, + Metadata: json.RawMessage(`{"existing":true}`), + }, + } + service := NewService(store) + enabled := true + policy, err := service.UpdateFabricServiceChannelRecoveryPolicy(context.Background(), UpdateFabricServiceChannelRecoveryPolicyInput{ + ActorUserID: "admin-1", + ClusterID: "cluster-1", + HysteresisPenalty: 42, + PromotionMinSamples: 7, + DemotionFailureThreshold: 3, + DemotionDropThreshold: 2, + DemotionSlowThreshold: 5, + DemotionRebuildEnabled: &enabled, + DemotionFencedEnabled: &enabled, + }) + if err != nil { + t.Fatalf("update recovery policy: %v", err) + } + if policy.HysteresisPenalty != 42 || policy.PromotionMinSamples != 7 || policy.DemotionFailureThreshold != 3 { + t.Fatalf("policy = %+v, want configured values", policy) + } + var metadata map[string]any + if err := json.Unmarshal(store.cluster.Metadata, &metadata); err != nil { + t.Fatalf("metadata json: %v", err) + } + if metadata["existing"] != true || metadata["fabric_service_channel_recovery_policy"] == nil { + t.Fatalf("metadata = %+v, want existing value plus policy", metadata) + } +} + +func TestUpdateFabricServiceChannelAdaptivePolicyPersistsClusterMetadata(t *testing.T) { + store := &fakeRepository{ + platformRole: "platform_admin", + cluster: Cluster{ + ID: "cluster-1", + Slug: "cluster-1", + Name: "Cluster 1", + Status: ClusterStatusActive, + Metadata: json.RawMessage(`{"existing":true}`), + }, + } + service := NewService(store) + policy, err := service.UpdateFabricServiceChannelAdaptivePolicy(context.Background(), UpdateFabricServiceChannelAdaptivePolicyInput{ + ActorUserID: "admin-1", + ClusterID: "cluster-1", + MaxParallelWindow: 6, + BulkPressureChannelThreshold: 8, + QueuePressureHighWatermark: 9, + QueuePressureMaxInFlight: 10, + ClassWindows: map[string]int{ + "control": 6, + "interactive": 6, + "reliable": 4, + "bulk": 2, + "droppable": 1, + }, + }) + if err != nil { + t.Fatalf("update adaptive policy: %v", err) + } + if policy.MaxParallelWindow != 6 || policy.ClassWindows["bulk"] != 2 || policy.QueuePressureHighWatermark != 9 { + t.Fatalf("policy = %+v, want configured values", policy) + } + if policy.Fingerprint == "" || policy.Source != "cluster_metadata" { + t.Fatalf("policy provenance = %+v", policy) + } + var metadata map[string]any + if err := json.Unmarshal(store.cluster.Metadata, &metadata); err != nil { + t.Fatalf("metadata json: %v", err) + } + if metadata["existing"] != true || metadata["fabric_service_channel_adaptive_policy"] == nil { + t.Fatalf("metadata = %+v, want existing value plus adaptive policy", metadata) + } +} + +func TestUpdateFabricServiceChannelPoolPolicyPersistsClusterMetadata(t *testing.T) { + store := &fakeRepository{ + platformRole: "platform_admin", + cluster: Cluster{ + ID: "cluster-1", + Slug: "cluster-1", + Name: "Cluster 1", + Status: ClusterStatusActive, + Metadata: json.RawMessage(`{"existing":true}`), + }, + } + service := NewService(store) + enabled := true + sticky := false + policy, err := service.UpdateFabricServiceChannelPoolPolicy(context.Background(), UpdateFabricServiceChannelPoolPolicyInput{ + ActorUserID: "admin-1", + ClusterID: "cluster-1", + EntryPoolNodeIDs: []string{"entry-a", "entry-b"}, + ExitPoolNodeIDs: []string{"exit-b"}, + PreferredEntryNodeID: "entry-b", + PreferredExitNodeID: "exit-b", + SelectionStrategy: "preferred_first", + RouteRebuild: "automatic", + EntryFailover: "automatic", + ExitFailover: "manual", + BackendFallbackAllowed: &enabled, + StickySession: &sticky, + }) + if err != nil { + t.Fatalf("update pool policy: %v", err) + } + if policy.PreferredEntryNodeID != "entry-b" || policy.PreferredExitNodeID != "exit-b" || policy.ExitFailover != "manual" || policy.StickySession { + t.Fatalf("policy = %+v, want configured values", policy) + } + if policy.Fingerprint == "" || policy.Source != "cluster_metadata" { + t.Fatalf("policy provenance = %+v", policy) + } + var metadata map[string]any + if err := json.Unmarshal(store.cluster.Metadata, &metadata); err != nil { + t.Fatalf("metadata json: %v", err) + } + if metadata["existing"] != true || metadata["fabric_service_channel_pool_policy"] == nil { + t.Fatalf("metadata = %+v, want existing value plus pool policy", metadata) + } +} + +func TestUpdateFabricServiceChannelBreadcrumbWindowPolicyPersistsClusterMetadata(t *testing.T) { + store := &fakeRepository{ + platformRole: "platform_admin", + cluster: Cluster{ + ID: "cluster-1", + Slug: "cluster-1", + Name: "Cluster 1", + Status: ClusterStatusActive, + Metadata: json.RawMessage(`{"existing":true}`), + }, + } + service := NewService(store) + policy, err := service.UpdateFabricServiceChannelBreadcrumbWindowPolicy(context.Background(), UpdateFabricServiceChannelBreadcrumbWindowPolicyInput{ + ActorUserID: "admin-1", + ClusterID: "cluster-1", + CurrentWindowSeconds: 600, + HistoryWindowSeconds: 7200, + }) + if err != nil { + t.Fatalf("update breadcrumb window policy: %v", err) + } + if policy.CurrentWindowSeconds != 600 || policy.HistoryWindowSeconds != 7200 { + t.Fatalf("policy = %+v, want configured windows", policy) + } + if policy.Fingerprint == "" || policy.Source != "cluster_metadata" { + t.Fatalf("policy provenance = %+v", policy) + } + var metadata map[string]any + if err := json.Unmarshal(store.cluster.Metadata, &metadata); err != nil { + t.Fatalf("metadata json: %v", err) + } + if metadata["existing"] != true || metadata["fabric_service_channel_breadcrumb_window_policy"] == nil { + t.Fatalf("metadata = %+v, want existing value plus breadcrumb window policy", metadata) + } +} + +func TestListFabricBreadcrumbsUsesClusterDefaultWindowPolicy(t *testing.T) { + clusterID := "cluster-1" + now := time.Date(2026, 5, 9, 14, 20, 0, 0, time.UTC) + policy := defaultFabricServiceChannelBreadcrumbWindowPolicy() + policy.Source = "cluster_metadata" + policy.CurrentWindowSeconds = 600 + policy.HistoryWindowSeconds = 1800 + policy = normalizeFabricServiceChannelBreadcrumbWindowPolicy(policy, defaultFabricServiceChannelBreadcrumbWindowPolicy()) + metadata, err := upsertFabricServiceChannelBreadcrumbWindowPolicyMetadata(json.RawMessage(`{}`), policy) + if err != nil { + t.Fatalf("policy metadata: %v", err) + } + store := &fakeRepository{ + platformRole: "platform_admin", + cluster: Cluster{ + ID: clusterID, + Slug: "cluster-1", + Name: "Cluster 1", + Status: ClusterStatusActive, + Metadata: metadata, + }, + auditEvents: []ClusterAuditEvent{ + { + ID: "audit-current", + ClusterID: &clusterID, + EventType: "fabric.service_channel_rebuild_incident.investigation_opened", + TargetType: "fabric_service_channel_route_rebuild_incident", + Payload: json.RawMessage(`{}`), + CreatedAt: now.Add(-5 * time.Minute), + }, + { + ID: "audit-stale", + ClusterID: &clusterID, + EventType: "fabric.service_channel_rebuild_incident.investigation_opened", + TargetType: "fabric_service_channel_route_rebuild_incident", + Payload: json.RawMessage(`{}`), + CreatedAt: now.Add(-20 * time.Minute), + }, + { + ID: "audit-expired", + ClusterID: &clusterID, + EventType: "fabric.service_channel_rebuild_incident.investigation_opened", + TargetType: "fabric_service_channel_route_rebuild_incident", + Payload: json.RawMessage(`{}`), + CreatedAt: now.Add(-40 * time.Minute), + }, + }, + } + service := NewService(store) + service.now = func() time.Time { return now } + + breadcrumbs, err := service.ListFabricServiceChannelRebuildInvestigationBreadcrumbs(context.Background(), "admin-1", ListFabricServiceChannelRebuildInvestigationBreadcrumbsInput{ + ClusterID: clusterID, + Limit: 10, + }) + if err != nil { + t.Fatalf("list breadcrumbs: %v", err) + } + if breadcrumbs.CurrentWindowSeconds != 600 || breadcrumbs.HistoryWindowSeconds != 1800 { + t.Fatalf("breadcrumb windows = %d/%d, want cluster policy", breadcrumbs.CurrentWindowSeconds, breadcrumbs.HistoryWindowSeconds) + } + if breadcrumbs.CurrentCount != 1 || breadcrumbs.StaleCount != 1 || breadcrumbs.ExpiredCount != 1 { + t.Fatalf("breadcrumb freshness counts = current %d stale %d expired %d", breadcrumbs.CurrentCount, breadcrumbs.StaleCount, breadcrumbs.ExpiredCount) + } +} + +func TestListFabricBreadcrumbsKeepsQueryWindowOverrides(t *testing.T) { + clusterID := "cluster-1" + now := time.Date(2026, 5, 9, 14, 20, 0, 0, time.UTC) + policy := defaultFabricServiceChannelBreadcrumbWindowPolicy() + policy.Source = "cluster_metadata" + policy.CurrentWindowSeconds = 3600 + policy.HistoryWindowSeconds = 7200 + policy = normalizeFabricServiceChannelBreadcrumbWindowPolicy(policy, defaultFabricServiceChannelBreadcrumbWindowPolicy()) + metadata, err := upsertFabricServiceChannelBreadcrumbWindowPolicyMetadata(json.RawMessage(`{}`), policy) + if err != nil { + t.Fatalf("policy metadata: %v", err) + } + store := &fakeRepository{ + platformRole: "platform_admin", + cluster: Cluster{ + ID: clusterID, + Slug: "cluster-1", + Name: "Cluster 1", + Status: ClusterStatusActive, + Metadata: metadata, + }, + auditEvents: []ClusterAuditEvent{ + { + ID: "audit-stale-by-override", + ClusterID: &clusterID, + EventType: "fabric.service_channel_rebuild_incident.investigation_opened", + TargetType: "fabric_service_channel_route_rebuild_incident", + Payload: json.RawMessage(`{}`), + CreatedAt: now.Add(-20 * time.Minute), + }, + }, + } + service := NewService(store) + service.now = func() time.Time { return now } + + breadcrumbs, err := service.ListFabricServiceChannelRebuildInvestigationBreadcrumbs(context.Background(), "admin-1", ListFabricServiceChannelRebuildInvestigationBreadcrumbsInput{ + ClusterID: clusterID, + Limit: 10, + CurrentWindowSeconds: 600, + HistoryWindowSeconds: 1800, + }) + if err != nil { + t.Fatalf("list breadcrumbs: %v", err) + } + if breadcrumbs.CurrentWindowSeconds != 600 || breadcrumbs.HistoryWindowSeconds != 1800 { + t.Fatalf("breadcrumb windows = %d/%d, want query override", breadcrumbs.CurrentWindowSeconds, breadcrumbs.HistoryWindowSeconds) + } + if breadcrumbs.CurrentCount != 0 || breadcrumbs.StaleCount != 1 || breadcrumbs.ExpiredCount != 0 { + t.Fatalf("breadcrumb override freshness counts = current %d stale %d expired %d", breadcrumbs.CurrentCount, breadcrumbs.StaleCount, breadcrumbs.ExpiredCount) + } +} + +func TestRoutePathDecisionReportCountsRecoveryHysteresis(t *testing.T) { + now := time.Now().UTC().Truncate(time.Second) + policy := defaultFabricServiceChannelRecoveryPolicy() + policy.Source = "cluster_metadata" + policy.HysteresisPenalty = 33 + report := routePathDecisionReportWithRecoveryPolicy("generation-1", []RoutePathDecision{ + { + DecisionID: "decision-1", + RouteID: "route-recovered", + ClusterID: "cluster-1", + LocalNodeID: "entry-1", + SourceNodeID: "entry-1", + DestinationNodeID: "exit-1", + OriginalHops: []string{"entry-1", "exit-1"}, + EffectiveHops: []string{"entry-1", "exit-1"}, + LocalRole: "entry", + DecisionSource: "service_channel_feedback_replacement", + Generation: "generation-1", + ScoreReasons: []string{"service_channel_recovery_hysteresis"}, + ControlPlaneOnly: true, + ProductionForwarding: false, + ExpiresAt: now.Add(time.Minute), + }, + { + DecisionID: "decision-2", + RouteID: "route-promoted", + ClusterID: "cluster-1", + LocalNodeID: "entry-1", + SourceNodeID: "entry-1", + DestinationNodeID: "exit-1", + OriginalHops: []string{"entry-1", "exit-1"}, + EffectiveHops: []string{"entry-1", "exit-1"}, + LocalRole: "entry", + DecisionSource: "service_channel_feedback_replacement", + Generation: "generation-1", + ScoreReasons: []string{"service_channel_recovery_promoted"}, + ControlPlaneOnly: true, + ProductionForwarding: false, + ExpiresAt: now.Add(time.Minute), + }, + { + DecisionID: "decision-3", + RouteID: "route-demoted", + ClusterID: "cluster-1", + LocalNodeID: "entry-1", + SourceNodeID: "entry-1", + DestinationNodeID: "exit-1", + OriginalHops: []string{"entry-1", "exit-1"}, + EffectiveHops: []string{"entry-1", "exit-1"}, + LocalRole: "entry", + DecisionSource: "service_channel_feedback_replacement", + Generation: "generation-1", + ScoreReasons: []string{"service_channel_recovery_demoted", "service_channel_recovery_demoted_failure"}, + ControlPlaneOnly: true, + ProductionForwarding: false, + ExpiresAt: now.Add(time.Minute), + }, + }, policy) + if report == nil || report.RecoveryHysteresisCount != 1 || report.RecoveryPromotedCount != 1 || report.RecoveryDemotedCount != 1 { + t.Fatalf("recovery counts = %+v, want hysteresis/promoted/demoted 1/1/1", report) + } + if report.RecoveryPolicy == nil || report.RecoveryPolicy.Source != "cluster_metadata" || report.RecoveryPolicy.HysteresisPenalty != 33 { + t.Fatalf("route path decision recovery policy provenance = %+v", report.RecoveryPolicy) + } +} + +func containsRouteID(routes []SyntheticMeshRouteConfig, routeID string) bool { + for _, route := range routes { + if route.RouteID == routeID { + return true + } + } + return false +} + +func ptrTime(value time.Time) *time.Time { + return &value +} + type fakeRepository struct { - platformRole string - lastTokenHash string - validTokenErr error - createJoinRequestID string - bootstrapJoinRequest NodeJoinRequest - clusterAuthority ClusterAuthorityKey - lastTokenAuthority json.RawMessage - lastApprovalAuthority json.RawMessage - authorityState ClusterAuthorityState - vpnConnection VPNConnection - lastVPNConnectionInput CreateVPNConnectionInput - lastAllowedNodesInput SetVPNConnectionAllowedNodesInput - lastAttachInput AttachExistingNodeInput - lastNodeGroupInput CreateNodeGroupInput - lastAssignGroupInput AssignNodeGroupInput - lastEntryPointInput CreateFabricEntryPointInput - lastEgressPoolInput CreateFabricEgressPoolInput - acquireVPNLeaseErr error - ownerEligibility VPNLeaseOwnerEligibility - ownerEligibilityErr error - renewVPNLeaseErr error - expiredVPNLeases []VPNConnectionLease - nodeVPNAssignments []NodeVPNAssignment - testingFlags EffectiveNodeTestingFlags - routeIntents []MeshRouteIntent - meshLinks []MeshLinkObservation - heartbeats map[string][]NodeHeartbeat - auditEvents []ClusterAuditEvent + platformRole string + lastTokenHash string + lastLookupTokenHash string + validJoinToken NodeJoinToken + validTokenErr error + createJoinRequestID string + bootstrapJoinRequest NodeJoinRequest + clusterAuthority ClusterAuthorityKey + lastTokenAuthority json.RawMessage + lastApprovalAuthority json.RawMessage + authorityState ClusterAuthorityState + vpnConnection VPNConnection + lastVPNConnectionInput CreateVPNConnectionInput + lastAllowedNodesInput SetVPNConnectionAllowedNodesInput + lastAttachInput AttachExistingNodeInput + lastNodeGroupInput CreateNodeGroupInput + lastAssignGroupInput AssignNodeGroupInput + lastEntryPointInput CreateFabricEntryPointInput + lastEgressPoolInput CreateFabricEgressPoolInput + acquireVPNLeaseErr error + ownerEligibility VPNLeaseOwnerEligibility + ownerEligibilityErr error + renewVPNLeaseErr error + expiredVPNLeases []VPNConnectionLease + nodeVPNAssignments []NodeVPNAssignment + vpnClientProfile VPNClientProfile + testingFlags EffectiveNodeTestingFlags + routeIntents []MeshRouteIntent + createdRouteIntents []CreateRouteIntentInput + clusterNodes []ClusterNode + nodeRoles map[string][]NodeRoleAssignment + releaseVersions []ReleaseVersion + updateServiceCandidates []NodeUpdateServiceCandidate + nodeUpdatePolicies map[string]NodeUpdatePolicy + updateStatuses []NodeUpdateStatus + meshLinks []MeshLinkObservation + fabricRouteFeedback []FabricServiceChannelRouteFeedbackObservation + fabricLeases map[string]FabricServiceChannelLeaseRecord + fabricRebuildAttempts []FabricServiceChannelRouteRebuildAttempt + fabricRebuildSilences []FabricServiceChannelRouteRebuildAlertSilence + heartbeats map[string][]NodeHeartbeat + nodeTelemetry map[string][]NodeTelemetryObservation + desiredWorkloads []NodeWorkloadDesiredState + auditEvents []ClusterAuditEvent + cluster Cluster + lastPreferredEntryNodeID string + lastPreferredExitNodeID string } func (f *fakeRepository) GetPlatformRole(context.Context, string) (string, error) { @@ -1942,7 +7455,16 @@ func (f *fakeRepository) ListClusters(context.Context) ([]Cluster, error) { } func (f *fakeRepository) GetCluster(context.Context, string) (Cluster, error) { - return Cluster{}, nil + if f.cluster.ID != "" { + return f.cluster, nil + } + return Cluster{ + ID: "cluster-1", + Slug: "cluster-1", + Name: "Cluster 1", + Status: ClusterStatusActive, + Metadata: json.RawMessage(`{}`), + }, nil } func (f *fakeRepository) CreateCluster(context.Context, CreateClusterInput) (Cluster, error) { @@ -1950,14 +7472,15 @@ func (f *fakeRepository) CreateCluster(context.Context, CreateClusterInput) (Clu } func (f *fakeRepository) UpdateCluster(_ context.Context, input UpdateClusterInput) (Cluster, error) { - return Cluster{ + f.cluster = Cluster{ ID: input.ClusterID, Slug: "cluster-1", Name: input.Name, Status: input.Status, Region: input.Region, Metadata: input.Metadata, - }, nil + } + return f.cluster, nil } func (f *fakeRepository) GetClusterAuthority(_ context.Context, clusterID string) (ClusterAuthorityKey, error) { @@ -1991,7 +7514,7 @@ func (f *fakeRepository) EnsureClusterAuthority(ctx context.Context, clusterID s } func (f *fakeRepository) ListClusterNodes(context.Context, string) ([]ClusterNode, error) { - return nil, nil + return f.clusterNodes, nil } func (f *fakeRepository) ListNodeGroups(context.Context, string) ([]ClusterNodeGroup, error) { @@ -2051,10 +7574,14 @@ func (f *fakeRepository) SetJoinTokenAuthority(_ context.Context, clusterID, tok }, nil } -func (f *fakeRepository) GetValidJoinTokenByHash(context.Context, string, string) (NodeJoinToken, error) { +func (f *fakeRepository) GetValidJoinTokenByHash(_ context.Context, _ string, tokenHash string) (NodeJoinToken, error) { + f.lastLookupTokenHash = tokenHash if f.validTokenErr != nil { return NodeJoinToken{}, f.validTokenErr } + if f.validJoinToken.ID != "" { + return f.validJoinToken, nil + } return NodeJoinToken{ID: "token-1", Status: "active", ExpiresAt: time.Now().Add(time.Hour), MaxUses: 1}, nil } @@ -2062,6 +7589,10 @@ func (f *fakeRepository) RevokeJoinToken(context.Context, RevokeJoinTokenInput) return NodeJoinToken{ID: "token-1", Status: "revoked"}, nil } +func (f *fakeRepository) ListJoinTokens(context.Context, string) ([]NodeJoinToken, error) { + return []NodeJoinToken{{ID: "token-1", Status: "active", ExpiresAt: time.Now().Add(time.Hour), MaxUses: 1}}, nil +} + func (f *fakeRepository) ExpireJoinTokens(context.Context, string) error { return nil } @@ -2125,8 +7656,8 @@ func (f *fakeRepository) AssignNodeRole(_ context.Context, input AssignNodeRoleI return NodeRoleAssignment{ClusterID: input.ClusterID, NodeID: input.NodeID, Role: input.Role}, nil } -func (f *fakeRepository) ListNodeRoleAssignments(context.Context, string, string) ([]NodeRoleAssignment, error) { - return nil, nil +func (f *fakeRepository) ListNodeRoleAssignments(_ context.Context, _ string, nodeID string) ([]NodeRoleAssignment, error) { + return f.nodeRoles[nodeID], nil } func (f *fakeRepository) AttachExistingNodeToCluster(_ context.Context, input AttachExistingNodeInput) (ClusterNode, error) { @@ -2140,14 +7671,152 @@ func (f *fakeRepository) AttachExistingNodeToCluster(_ context.Context, input At }, nil } -func (f *fakeRepository) RecordHeartbeat(context.Context, RecordHeartbeatInput) (NodeHeartbeat, error) { - return NodeHeartbeat{}, nil +func (f *fakeRepository) RecordHeartbeat(_ context.Context, input RecordHeartbeatInput) (NodeHeartbeat, error) { + now := time.Now().UTC() + item := NodeHeartbeat{ + ID: "heartbeat-" + input.NodeID, + ClusterID: input.ClusterID, + NodeID: input.NodeID, + HealthStatus: input.HealthStatus, + ReportedVersion: input.ReportedVersion, + Capabilities: input.Capabilities, + ServiceStates: input.ServiceStates, + Metadata: input.Metadata, + ObservedAt: now, + } + if f.heartbeats == nil { + f.heartbeats = map[string][]NodeHeartbeat{} + } + f.heartbeats[input.NodeID] = append([]NodeHeartbeat{item}, f.heartbeats[input.NodeID]...) + return item, nil } func (f *fakeRepository) ListNodeHeartbeats(_ context.Context, _ string, nodeID string, _ int) ([]NodeHeartbeat, error) { return f.heartbeats[nodeID], nil } +func (f *fakeRepository) CreateReleaseVersion(_ context.Context, input CreateReleaseVersionInput) (ReleaseVersion, error) { + item := ReleaseVersion{ + ID: "release-" + input.Version, + ClusterID: input.ClusterID, + Product: input.Product, + Version: input.Version, + Channel: input.Channel, + Status: input.Status, + Compatibility: input.Compatibility, + Changelog: input.Changelog, + CreatedByUserID: &input.ActorUserID, + CreatedAt: time.Now().UTC(), + } + for i, artifact := range input.Artifacts { + item.Artifacts = append(item.Artifacts, ReleaseArtifact{ + ID: item.ID + "-artifact", + ReleaseID: item.ID, + ClusterID: input.ClusterID, + Product: input.Product, + Version: input.Version, + OS: artifact.OS, + Arch: artifact.Arch, + InstallType: artifact.InstallType, + Kind: artifact.Kind, + URL: artifact.URL, + SHA256: artifact.SHA256, + SizeBytes: artifact.SizeBytes, + Signature: artifact.Signature, + Metadata: artifact.Metadata, + CreatedAt: time.Now().UTC().Add(time.Duration(i) * time.Second), + }) + } + f.releaseVersions = append([]ReleaseVersion{item}, f.releaseVersions...) + return item, nil +} + +func (f *fakeRepository) ListReleaseVersions(_ context.Context, clusterID, product, channel string) ([]ReleaseVersion, error) { + var out []ReleaseVersion + for _, item := range f.releaseVersions { + if item.ClusterID != clusterID { + continue + } + if product != "" && item.Product != product { + continue + } + if channel != "" && item.Channel != channel { + continue + } + out = append(out, item) + } + return out, nil +} + +func (f *fakeRepository) ListNodeUpdateServiceCandidates(context.Context, string) ([]NodeUpdateServiceCandidate, error) { + return f.updateServiceCandidates, nil +} + +func (f *fakeRepository) UpsertNodeUpdatePolicy(_ context.Context, input UpsertNodeUpdatePolicyInput) (NodeUpdatePolicy, error) { + item := NodeUpdatePolicy{ + ClusterID: input.ClusterID, + NodeID: input.NodeID, + Product: input.Product, + Channel: input.Channel, + TargetVersion: input.TargetVersion, + Strategy: input.Strategy, + Enabled: input.Enabled, + RollbackAllowed: input.RollbackAllowed, + HealthWindowSec: input.HealthWindowSec, + UpdatedByUserID: &input.ActorUserID, + UpdatedAt: time.Now().UTC(), + } + if f.nodeUpdatePolicies == nil { + f.nodeUpdatePolicies = map[string]NodeUpdatePolicy{} + } + f.nodeUpdatePolicies[input.NodeID+"|"+input.Product] = item + return item, nil +} + +func (f *fakeRepository) GetNodeUpdatePolicy(_ context.Context, _ string, nodeID, product string) (NodeUpdatePolicy, error) { + if f.nodeUpdatePolicies == nil { + return NodeUpdatePolicy{}, pgx.ErrNoRows + } + item, ok := f.nodeUpdatePolicies[nodeID+"|"+product] + if !ok { + return NodeUpdatePolicy{}, pgx.ErrNoRows + } + return item, nil +} + +func (f *fakeRepository) ReportNodeUpdateStatus(_ context.Context, input ReportNodeUpdateStatusInput) (NodeUpdateStatus, error) { + item := NodeUpdateStatus{ + ID: "status-1", + ClusterID: input.ClusterID, + NodeID: input.NodeID, + Product: input.Product, + CurrentVersion: input.CurrentVersion, + TargetVersion: input.TargetVersion, + Phase: input.Phase, + Status: input.Status, + AttemptID: input.AttemptID, + ErrorMessage: input.ErrorMessage, + RollbackVersion: input.RollbackVersion, + Payload: input.Payload, + ObservedAt: input.ObservedAt, + } + f.updateStatuses = append(f.updateStatuses, item) + return item, nil +} + +func (f *fakeRepository) ListNodeUpdateStatuses(_ context.Context, clusterID, nodeID string, limit int) ([]NodeUpdateStatus, error) { + out := []NodeUpdateStatus{} + for _, item := range f.updateStatuses { + if item.ClusterID == clusterID && item.NodeID == nodeID { + out = append(out, item) + } + } + if limit > 0 && len(out) > limit { + out = out[:limit] + } + return out, nil +} + func (f *fakeRepository) RevokeNodeIdentity(context.Context, RevokeNodeIdentityInput) error { return nil } @@ -2156,6 +7825,10 @@ func (f *fakeRepository) DisableClusterMembership(context.Context, DisableMember return nil } +func (f *fakeRepository) DeleteClusterNode(context.Context, DeleteClusterNodeInput) error { + return nil +} + func (f *fakeRepository) UpsertFabricTestingFlag(_ context.Context, input UpsertFabricTestingFlagInput) (FabricTestingFlag, error) { return FabricTestingFlag{ ScopeType: input.ScopeType, @@ -2178,15 +7851,28 @@ func (f *fakeRepository) GetEffectiveNodeTestingFlags(context.Context, string, s } func (f *fakeRepository) RecordNodeTelemetry(_ context.Context, input RecordNodeTelemetryInput) (NodeTelemetryObservation, error) { - return NodeTelemetryObservation{ - ClusterID: input.ClusterID, - NodeID: input.NodeID, - Payload: input.Payload, - }, nil + item := NodeTelemetryObservation{ + ClusterID: input.ClusterID, + NodeID: input.NodeID, + Payload: input.Payload, + ObservedAt: input.ObservedAt, + } + if item.ObservedAt.IsZero() { + item.ObservedAt = time.Now().UTC() + } + if f.nodeTelemetry == nil { + f.nodeTelemetry = map[string][]NodeTelemetryObservation{} + } + f.nodeTelemetry[input.NodeID] = append([]NodeTelemetryObservation{item}, f.nodeTelemetry[input.NodeID]...) + return item, nil } -func (f *fakeRepository) ListNodeTelemetry(context.Context, string, string, int) ([]NodeTelemetryObservation, error) { - return nil, nil +func (f *fakeRepository) ListNodeTelemetry(_ context.Context, _ string, nodeID string, limit int) ([]NodeTelemetryObservation, error) { + items := append([]NodeTelemetryObservation{}, f.nodeTelemetry[nodeID]...) + if limit > 0 && len(items) > limit { + items = items[:limit] + } + return items, nil } func (f *fakeRepository) SetDesiredWorkload(_ context.Context, input SetDesiredWorkloadInput) (NodeWorkloadDesiredState, error) { @@ -2201,8 +7887,14 @@ func (f *fakeRepository) SetDesiredWorkload(_ context.Context, input SetDesiredW }, nil } -func (f *fakeRepository) ListDesiredWorkloads(context.Context, string, string) ([]NodeWorkloadDesiredState, error) { - return nil, nil +func (f *fakeRepository) ListDesiredWorkloads(_ context.Context, clusterID, nodeID string) ([]NodeWorkloadDesiredState, error) { + out := []NodeWorkloadDesiredState{} + for _, item := range f.desiredWorkloads { + if item.ClusterID == clusterID && item.NodeID == nodeID { + out = append(out, item) + } + } + return out, nil } func (f *fakeRepository) ReportWorkloadStatus(_ context.Context, input ReportWorkloadStatusInput) (NodeWorkloadStatus, error) { @@ -2235,20 +7927,472 @@ func (f *fakeRepository) ListMeshLinks(context.Context, string) ([]MeshLinkObser } func (f *fakeRepository) CreateRouteIntent(_ context.Context, input CreateRouteIntentInput) (MeshRouteIntent, error) { - return MeshRouteIntent{ + f.createdRouteIntents = append(f.createdRouteIntents, input) + item := MeshRouteIntent{ + ID: "route-intent-" + strconv.Itoa(len(f.createdRouteIntents)), ClusterID: input.ClusterID, SourceSelector: input.SourceSelector, DestinationSelector: input.DestinationSelector, ServiceClass: input.ServiceClass, Priority: input.Priority, + Status: "active", Policy: input.Policy, - }, nil + UpdatedAt: time.Now().UTC(), + } + f.routeIntents = append(f.routeIntents, item) + return item, nil } func (f *fakeRepository) ListRouteIntents(context.Context, string) ([]MeshRouteIntent, error) { return f.routeIntents, nil } +func (f *fakeRepository) ExpireRouteIntent(_ context.Context, input RouteIntentLifecycleInput, expiresAt time.Time) (MeshRouteIntent, error) { + for index, item := range f.routeIntents { + if item.ClusterID != input.ClusterID || item.ID != input.RouteIntentID { + continue + } + var policy map[string]any + _ = json.Unmarshal(item.Policy, &policy) + if policy == nil { + policy = map[string]any{} + } + policy["expires_at"] = expiresAt.UTC().Format(time.RFC3339Nano) + policy["operator_expire"] = map[string]any{ + "expired_at": expiresAt.UTC().Format(time.RFC3339Nano), + "reason": input.Reason, + } + item.Policy = mustJSONRaw(policy) + item.UpdatedAt = expiresAt.UTC() + f.routeIntents[index] = item + return item, nil + } + return MeshRouteIntent{}, pgx.ErrNoRows +} + +func (f *fakeRepository) DisableRouteIntent(_ context.Context, input RouteIntentLifecycleInput) (MeshRouteIntent, error) { + for index, item := range f.routeIntents { + if item.ClusterID != input.ClusterID || item.ID != input.RouteIntentID { + continue + } + var policy map[string]any + _ = json.Unmarshal(item.Policy, &policy) + if policy == nil { + policy = map[string]any{} + } + policy["operator_disable"] = map[string]any{ + "reason": input.Reason, + } + item.Status = "disabled" + item.Policy = mustJSONRaw(policy) + item.UpdatedAt = time.Now().UTC() + f.routeIntents[index] = item + return item, nil + } + return MeshRouteIntent{}, pgx.ErrNoRows +} + +func (f *fakeRepository) RecordFabricServiceChannelRouteFeedback(_ context.Context, input RecordFabricServiceChannelRouteFeedbackInput) (FabricServiceChannelRouteFeedbackObservation, error) { + observedAt := input.ObservedAt.UTC() + if observedAt.IsZero() { + observedAt = time.Now().UTC() + } + if input.FeedbackStatus != "healthy" { + for _, current := range f.fabricRouteFeedback { + if current.ClusterID != input.ClusterID || current.ReporterNodeID != input.ReporterNodeID || current.RouteID != input.RouteID { + continue + } + if current.RetryCooldownUntil == nil || !current.RetryCooldownUntil.After(observedAt) { + continue + } + input = fabricServiceChannelFeedbackSuppressedByOperatorCooldown(input, *current.RetryCooldownUntil, observedAt) + break + } + } + item := FabricServiceChannelRouteFeedbackObservation{ + ID: "fsc-feedback-" + strconv.Itoa(len(f.fabricRouteFeedback)+1), + ClusterID: input.ClusterID, + ReporterNodeID: input.ReporterNodeID, + RouteID: input.RouteID, + ServiceClass: input.ServiceClass, + FeedbackStatus: input.FeedbackStatus, + ScoreAdjustment: input.ScoreAdjustment, + Reasons: append([]string{}, input.Reasons...), + LastError: input.LastError, + ConsecutiveFailures: input.ConsecutiveFailures, + StallCount: input.StallCount, + LastSendDurationMs: input.LastSendDurationMs, + Payload: input.Payload, + ObservedAt: observedAt, + ExpiresAt: input.ExpiresAt, + RetryCooldownUntil: fabricServiceChannelRetryCooldownUntil(input.Payload), + } + f.fabricRouteFeedback = append(f.fabricRouteFeedback, item) + return item, nil +} + +func (f *fakeRepository) ListFabricServiceChannelRouteFeedback(_ context.Context, input ListFabricServiceChannelRouteFeedbackInput) ([]FabricServiceChannelRouteFeedbackObservation, error) { + now := input.Now.UTC() + out := []FabricServiceChannelRouteFeedbackObservation{} + for _, item := range f.fabricRouteFeedback { + if item.ClusterID != input.ClusterID { + continue + } + if input.ReporterNodeID != "" && item.ReporterNodeID != input.ReporterNodeID { + continue + } + if input.RouteID != "" && item.RouteID != input.RouteID { + continue + } + if input.ServiceClass != "" && item.ServiceClass != input.ServiceClass { + continue + } + if input.FeedbackStatus != "" && item.FeedbackStatus != input.FeedbackStatus { + continue + } + if !input.IncludeExpired && !item.ExpiresAt.IsZero() && !item.ExpiresAt.After(now) { + continue + } + out = append(out, item) + } + return out, nil +} + +func (f *fakeRepository) ExpireFabricServiceChannelRouteFeedback(_ context.Context, input ExpireFabricServiceChannelRouteFeedbackInput) (ExpireFabricServiceChannelRouteFeedbackResult, error) { + now := input.Now.UTC() + if now.IsZero() { + now = time.Now().UTC() + } + cooldownUntil := now.Add(fabricServiceChannelOperatorExpireCooldown) + expired := 0 + for idx, item := range f.fabricRouteFeedback { + if item.ClusterID != input.ClusterID || item.RouteID != input.RouteID { + continue + } + if input.ReporterNodeID != "" && item.ReporterNodeID != input.ReporterNodeID { + continue + } + if input.ServiceClass != "" && item.ServiceClass != input.ServiceClass { + continue + } + if !item.ExpiresAt.IsZero() && !item.ExpiresAt.After(now) { + continue + } + f.fabricRouteFeedback[idx].ExpiresAt = now + f.fabricRouteFeedback[idx].RetryCooldownUntil = &cooldownUntil + expired++ + } + return ExpireFabricServiceChannelRouteFeedbackResult{ + ClusterID: input.ClusterID, + ReporterNodeID: input.ReporterNodeID, + RouteID: input.RouteID, + ServiceClass: input.ServiceClass, + ExpiredCount: expired, + ExpiredAt: now, + CooldownUntil: cooldownUntil, + }, nil +} + +func (f *fakeRepository) StoreFabricServiceChannelLease(_ context.Context, input StoreFabricServiceChannelLeaseInput) (FabricServiceChannelLeaseRecord, error) { + lease := input.Lease + storedLease := lease + storedLease.Token.Token = "" + item := FabricServiceChannelLeaseRecord{ + ClusterID: lease.ClusterID, + ChannelID: lease.ChannelID, + TokenHash: input.TokenHash, + ResourceID: lease.ResourceID, + ServiceClass: lease.ServiceClass, + SelectedEntryNodeID: lease.SelectedEntryNodeID, + ExpiresAt: lease.ExpiresAt, + Lease: storedLease, + CreatedAt: lease.IssuedAt, + UpdatedAt: lease.IssuedAt, + } + if f.fabricLeases == nil { + f.fabricLeases = map[string]FabricServiceChannelLeaseRecord{} + } + f.fabricLeases[fabricServiceChannelLeaseCacheKey(lease.ClusterID, lease.ChannelID)] = item + return item, nil +} + +func (f *fakeRepository) GetFabricServiceChannelLease(_ context.Context, clusterID, channelID string) (FabricServiceChannelLeaseRecord, error) { + if f.fabricLeases == nil { + return FabricServiceChannelLeaseRecord{}, pgx.ErrNoRows + } + item, ok := f.fabricLeases[fabricServiceChannelLeaseCacheKey(clusterID, channelID)] + if !ok { + return FabricServiceChannelLeaseRecord{}, pgx.ErrNoRows + } + return item, nil +} + +func (f *fakeRepository) ListFabricServiceChannelLeases(_ context.Context, input ListFabricServiceChannelLeasesInput) ([]FabricServiceChannelLeaseRecord, error) { + now := input.Now + if now.IsZero() { + now = time.Now().UTC() + } + out := []FabricServiceChannelLeaseRecord{} + for _, item := range f.fabricLeases { + if item.ClusterID != input.ClusterID { + continue + } + if input.ServiceClass != "" && item.ServiceClass != input.ServiceClass { + continue + } + if input.EntryNodeID != "" && item.SelectedEntryNodeID != input.EntryNodeID { + continue + } + if input.ResourceID != "" && item.ResourceID != input.ResourceID { + continue + } + if !input.IncludeExpired && !item.ExpiresAt.IsZero() && !item.ExpiresAt.After(now) { + continue + } + out = append(out, item) + } + sort.Slice(out, func(i, j int) bool { + return out[i].ExpiresAt.After(out[j].ExpiresAt) + }) + if input.Limit > 0 && len(out) > input.Limit { + out = out[:input.Limit] + } + return out, nil +} + +func (f *fakeRepository) CleanupExpiredFabricServiceChannelLeases(_ context.Context, clusterID string, now time.Time, limit int) (int, error) { + if f.fabricLeases == nil { + return 0, nil + } + if now.IsZero() { + now = time.Now().UTC() + } + if limit <= 0 { + limit = 100 + } + deleted := 0 + for key, item := range f.fabricLeases { + if deleted >= limit { + break + } + if item.ClusterID == clusterID && !item.ExpiresAt.IsZero() && !item.ExpiresAt.After(now) { + delete(f.fabricLeases, key) + deleted++ + } + } + return deleted, nil +} + +func (f *fakeRepository) RecordFabricServiceChannelRouteRebuildAttempt(_ context.Context, input RecordFabricServiceChannelRouteRebuildAttemptInput) (FabricServiceChannelRouteRebuildAttempt, error) { + item := FabricServiceChannelRouteRebuildAttempt{ + ID: "fsc-rebuild-" + strconv.Itoa(len(f.fabricRebuildAttempts)+1), + ClusterID: input.ClusterID, + ReporterNodeID: input.ReporterNodeID, + ServiceClass: input.ServiceClass, + RouteID: input.RouteID, + ReplacementRouteID: input.ReplacementRouteID, + RebuildRequestID: input.RebuildRequestID, + RebuildStatus: input.RebuildStatus, + RebuildReason: input.RebuildReason, + RebuildAttempt: input.RebuildAttempt, + DecisionSource: input.DecisionSource, + Outcome: input.Outcome, + Generation: input.Generation, + PolicyFingerprint: input.PolicyFingerprint, + ObservedPolicyFingerprint: input.ObservedPolicyFingerprint, + ObservedRouteGeneration: input.ObservedRouteGeneration, + EffectiveRouteGeneration: input.EffectiveRouteGeneration, + FeedbackStatus: input.FeedbackStatus, + FeedbackObservationID: input.FeedbackObservationID, + FeedbackSource: input.FeedbackSource, + FeedbackObservedAt: input.FeedbackObservedAt, + FeedbackExpiresAt: input.FeedbackExpiresAt, + FeedbackChannelID: input.FeedbackChannelID, + FeedbackResourceID: input.FeedbackResourceID, + FeedbackViolationStatus: input.FeedbackViolationStatus, + FeedbackViolationReason: input.FeedbackViolationReason, + FeedbackScoreAdjustment: input.FeedbackScoreAdjustment, + FeedbackEffectiveScoreAdjustment: input.FeedbackEffectiveScoreAdjustment, + FeedbackReasons: append([]string{}, input.FeedbackReasons...), + LastError: input.LastError, + ConsecutiveFailures: input.ConsecutiveFailures, + StallCount: input.StallCount, + LastSendDurationMs: input.LastSendDurationMs, + QualityWindowSampleCount: input.QualityWindowSampleCount, + QualityWindowFailureCount: input.QualityWindowFailureCount, + QualityWindowDropCount: input.QualityWindowDropCount, + QualityWindowSlowCount: input.QualityWindowSlowCount, + OldHops: append([]string{}, input.OldHops...), + ReplacementHops: append([]string{}, input.ReplacementHops...), + Payload: input.Payload, + CreatedAt: time.Now().UTC(), + UpdatedAt: time.Now().UTC(), + } + for idx, current := range f.fabricRebuildAttempts { + if current.ClusterID == item.ClusterID && + current.ReporterNodeID == item.ReporterNodeID && + current.ServiceClass == item.ServiceClass && + current.RouteID == item.RouteID && + current.RebuildRequestID == item.RebuildRequestID { + item.ID = current.ID + item.CreatedAt = current.CreatedAt + f.fabricRebuildAttempts[idx] = item + return item, nil + } + } + f.fabricRebuildAttempts = append(f.fabricRebuildAttempts, item) + return item, nil +} + +func (f *fakeRepository) ListFabricServiceChannelRouteRebuildAttempts(_ context.Context, input ListFabricServiceChannelRouteRebuildAttemptsInput) ([]FabricServiceChannelRouteRebuildAttempt, error) { + out := []FabricServiceChannelRouteRebuildAttempt{} + for _, item := range f.fabricRebuildAttempts { + if item.ClusterID != input.ClusterID { + continue + } + if input.ReporterNodeID != "" && item.ReporterNodeID != input.ReporterNodeID { + continue + } + if input.RouteID != "" && item.RouteID != input.RouteID { + continue + } + if input.ReplacementRouteID != "" && item.ReplacementRouteID != input.ReplacementRouteID { + continue + } + if input.ServiceClass != "" && item.ServiceClass != input.ServiceClass { + continue + } + if input.RebuildStatus != "" && item.RebuildStatus != input.RebuildStatus { + continue + } + if input.RebuildRequestID != "" && item.RebuildRequestID != input.RebuildRequestID { + continue + } + payload := jsonObject(item.Payload) + if input.FeedbackSource != "" && firstNonEmptyString(item.FeedbackSource, jsonString(payload, "feedback_source")) != input.FeedbackSource { + continue + } + if input.FeedbackChannelID != "" && firstNonEmptyString(item.FeedbackChannelID, jsonString(payload, "feedback_channel_id")) != input.FeedbackChannelID { + continue + } + if input.FeedbackViolationStatus != "" && firstNonEmptyString(item.FeedbackViolationStatus, jsonString(payload, "feedback_violation_status")) != input.FeedbackViolationStatus { + continue + } + out = append(out, item) + } + return out, nil +} + +func (f *fakeRepository) UpdateFabricServiceChannelRouteRebuildCorrelationSnapshot(_ context.Context, input UpdateFabricServiceChannelRouteRebuildCorrelationSnapshotInput) error { + for idx := range f.fabricRebuildAttempts { + if f.fabricRebuildAttempts[idx].ID != input.ID { + continue + } + f.fabricRebuildAttempts[idx].NodeTransitionStatus = input.NodeTransitionStatus + f.fabricRebuildAttempts[idx].NodeTransitionGeneration = input.NodeTransitionGeneration + f.fabricRebuildAttempts[idx].NodeTransitionObservedAt = input.NodeTransitionObservedAt + f.fabricRebuildAttempts[idx].NodeTransitionMatched = input.NodeTransitionMatched + f.fabricRebuildAttempts[idx].NodeRouteGenerationStatus = input.NodeRouteGenerationStatus + f.fabricRebuildAttempts[idx].NodeRouteGenerationAppliedAt = input.NodeRouteGenerationAppliedAt + f.fabricRebuildAttempts[idx].NodeRouteGenerationWithdrawnAt = input.NodeRouteGenerationWithdrawnAt + f.fabricRebuildAttempts[idx].NodeRouteGenerationMatched = input.NodeRouteGenerationMatched + f.fabricRebuildAttempts[idx].PostRebuildSelectedRouteID = input.PostRebuildSelectedRouteID + f.fabricRebuildAttempts[idx].PostRebuildSendPackets = input.PostRebuildSendPackets + f.fabricRebuildAttempts[idx].PostRebuildSendFailures = input.PostRebuildSendFailures + f.fabricRebuildAttempts[idx].PostRebuildSendFlowPackets = input.PostRebuildSendFlowPackets + f.fabricRebuildAttempts[idx].PostRebuildSendFlowDropped = input.PostRebuildSendFlowDropped + f.fabricRebuildAttempts[idx].GuardStatus = input.GuardStatus + f.fabricRebuildAttempts[idx].GuardSeverity = input.GuardSeverity + f.fabricRebuildAttempts[idx].GuardReason = input.GuardReason + f.fabricRebuildAttempts[idx].GuardTransitionDeadlineSeconds = input.GuardTransitionDeadlineSeconds + f.fabricRebuildAttempts[idx].GuardTrafficDeadlineSeconds = input.GuardTrafficDeadlineSeconds + f.fabricRebuildAttempts[idx].Timeline = append([]FabricServiceChannelRouteRebuildTimelineEvent{}, input.Timeline...) + snapshotAt := input.CorrelationSnapshotAt + f.fabricRebuildAttempts[idx].CorrelationSnapshotAt = &snapshotAt + return nil + } + return nil +} + +func (f *fakeRepository) GetFabricServiceChannelSchemaStatus(_ context.Context, input GetFabricServiceChannelSchemaStatusInput) (FabricServiceChannelSchemaStatus, error) { + return FabricServiceChannelSchemaStatus{ + ClusterID: input.ClusterID, + ObservedAt: time.Now().UTC(), + Status: "ready", + Reason: "schema_ready", + RequiredMigration: "000028_fabric_service_channel_rebuild_correlation_snapshot", + RequiredCheckCount: 1, + PassedCheckCount: 1, + RequiredChecks: []FabricServiceChannelSchemaCheck{{ + CheckID: "fabric_service_channel_route_rebuild_attempts", + RelationName: "fabric_service_channel_route_rebuild_attempts", + Status: "present", + RequiredBy: "000028_fabric_service_channel_rebuild_correlation_snapshot", + }}, + }, nil +} + +func (f *fakeRepository) UpsertFabricServiceChannelRouteRebuildAlertSilence(_ context.Context, input SilenceFabricServiceChannelRouteRebuildAlertInput, expiresAt time.Time) (FabricServiceChannelRouteRebuildAlertSilence, error) { + createdAt := input.Now + if createdAt.IsZero() { + createdAt = time.Now().UTC() + } + item := FabricServiceChannelRouteRebuildAlertSilence{ + ID: "fsc-rebuild-silence-" + strconv.Itoa(len(f.fabricRebuildSilences)+1), + ClusterID: input.ClusterID, + IncidentSource: input.IncidentSource, + ChannelID: input.ChannelID, + ReporterNodeID: input.ReporterNodeID, + RouteID: input.RouteID, + DisplayRouteID: input.RouteID, + GuardStatus: input.GuardStatus, + Generation: input.Generation, + Reason: input.Reason, + CreatedByUserID: &input.ActorUserID, + CreatedAt: createdAt, + ExpiresAt: expiresAt, + Payload: mustJSONRaw(map[string]any{ + "schema_version": "rap.fabric_service_channel_rebuild_alert_silence.v1", + "reason": input.Reason, + "incident_source": input.IncidentSource, + "channel_id": input.ChannelID, + }), + } + if channelID, routeID, ok := fabricServiceChannelParseAccessDecisionSilenceRouteID(input.RouteID); ok { + item.IncidentSource = firstNonEmptyString(item.IncidentSource, "access_decision") + item.ChannelID = firstNonEmptyString(item.ChannelID, channelID) + item.DisplayRouteID = routeID + } + for idx, current := range f.fabricRebuildSilences { + if current.ClusterID == item.ClusterID && current.ReporterNodeID == item.ReporterNodeID && current.RouteID == item.RouteID && current.GuardStatus == item.GuardStatus && current.Generation == item.Generation { + f.fabricRebuildSilences[idx] = item + return item, nil + } + } + f.fabricRebuildSilences = append(f.fabricRebuildSilences, item) + return item, nil +} + +func (f *fakeRepository) ListFabricServiceChannelRouteRebuildAlertSilences(_ context.Context, clusterID string, now time.Time) ([]FabricServiceChannelRouteRebuildAlertSilence, error) { + out := []FabricServiceChannelRouteRebuildAlertSilence{} + for _, item := range f.fabricRebuildSilences { + if item.ClusterID == clusterID && item.ExpiresAt.After(now) { + out = append(out, item) + } + } + return out, nil +} + +func (f *fakeRepository) DeleteFabricServiceChannelRouteRebuildAlertSilence(_ context.Context, input UnsilenceFabricServiceChannelRouteRebuildAlertInput) (FabricServiceChannelRouteRebuildAlertSilence, error) { + for idx, item := range f.fabricRebuildSilences { + if item.ClusterID == input.ClusterID && item.ID == input.SilenceID { + f.fabricRebuildSilences = append(f.fabricRebuildSilences[:idx], f.fabricRebuildSilences[idx+1:]...) + return item, nil + } + } + return FabricServiceChannelRouteRebuildAlertSilence{}, pgx.ErrNoRows +} + func (f *fakeRepository) ListQoSPolicies(context.Context, string) ([]MeshQoSPolicy, error) { return nil, nil } @@ -2444,6 +8588,13 @@ func (f *fakeRepository) RenewVPNConnectionLease(_ context.Context, input RenewV return VPNConnectionLease{ID: input.LeaseID, VPNConnectionID: input.VPNConnectionID, ClusterID: input.ClusterID, OwnerNodeID: input.OwnerNodeID, FencingToken: input.FencingToken, Status: VPNLeaseStatusActive, ExpiresAt: expiresAt}, nil } +func (f *fakeRepository) RenewNodeVPNAssignmentLease(_ context.Context, input RenewNodeVPNAssignmentLeaseInput, expiresAt time.Time) (VPNConnectionLease, error) { + if f.renewVPNLeaseErr != nil { + return VPNConnectionLease{}, f.renewVPNLeaseErr + } + return VPNConnectionLease{ID: input.LeaseID, VPNConnectionID: input.VPNConnectionID, ClusterID: input.ClusterID, OwnerNodeID: input.OwnerNodeID, Status: VPNLeaseStatusActive, ExpiresAt: expiresAt}, nil +} + func (f *fakeRepository) ReleaseVPNConnectionLease(_ context.Context, input ReleaseVPNConnectionLeaseInput) (VPNConnectionLease, error) { return VPNConnectionLease{ID: input.LeaseID, VPNConnectionID: input.VPNConnectionID, ClusterID: input.ClusterID, OwnerNodeID: input.OwnerNodeID, FencingToken: input.FencingToken, Status: VPNLeaseStatusReleased}, nil } @@ -2495,11 +8646,67 @@ func (f *fakeRepository) ReportNodeVPNAssignmentStatus(_ context.Context, input }, nil } +func (f *fakeRepository) GetVPNClientProfile( + _ context.Context, + clusterID, organizationID, userID, preferredEntryNodeID, preferredExitNodeID string, + generatedAt time.Time, +) (VPNClientProfile, error) { + f.lastPreferredEntryNodeID = preferredEntryNodeID + f.lastPreferredExitNodeID = preferredExitNodeID + if f.vpnClientProfile.SchemaVersion != "" { + profile := f.vpnClientProfile + profile.ClusterID = clusterID + profile.OrganizationID = organizationID + profile.UserID = userID + profile.GeneratedAt = generatedAt + return profile, nil + } + return VPNClientProfile{ + SchemaVersion: "rap.vpn_client_profile.v1", + ClusterID: clusterID, + OrganizationID: organizationID, + UserID: userID, + GeneratedAt: generatedAt, + }, nil +} + func (f *fakeRepository) RecordAudit(_ context.Context, event ClusterAuditEvent) error { f.auditEvents = append(f.auditEvents, event) return nil } -func (f *fakeRepository) ListAuditEvents(context.Context, string, int) ([]ClusterAuditEvent, error) { - return nil, nil +func (f *fakeRepository) ListAuditEvents(_ context.Context, input ListAuditEventsInput) ([]ClusterAuditEvent, error) { + limit := input.Limit + if limit <= 0 || limit > 200 { + limit = 100 + } + eventTypes := map[string]bool{} + for _, eventType := range trimStringSlice(input.EventTypes) { + eventTypes[eventType] = true + } + targetTypes := map[string]bool{} + for _, targetType := range trimStringSlice(input.TargetTypes) { + targetTypes[targetType] = true + } + out := []ClusterAuditEvent{} + for _, event := range f.auditEvents { + if event.ClusterID != nil && input.ClusterID != "" && *event.ClusterID != input.ClusterID { + continue + } + if len(eventTypes) > 0 && !eventTypes[event.EventType] { + continue + } + if len(targetTypes) > 0 && !targetTypes[event.TargetType] { + continue + } + out = append(out, event) + if len(out) >= limit { + break + } + } + return out, nil +} + +func stringPtr(value string) *string { + return &value } diff --git a/backend/internal/modules/nodeagent/module.go b/backend/internal/modules/nodeagent/module.go index b9ca479..35994a1 100644 --- a/backend/internal/modules/nodeagent/module.go +++ b/backend/internal/modules/nodeagent/module.go @@ -40,6 +40,9 @@ func (m *Module) Name() string { func (m *Module) RegisterRoutes(router chi.Router) { router.Route("/node-agents", func(r chi.Router) { + r.Post("/docker-install-profile", m.dockerInstallProfile) + r.Post("/windows-install-profile", m.windowsInstallProfile) + r.Post("/linux-install-profile", m.linuxInstallProfile) r.Post("/enroll", m.enrollAgent) r.Post("/enrollments/{requestID}/bootstrap", m.bootstrapEnrollment) r.Post("/register", m.registerAgent) @@ -53,6 +56,48 @@ func (m *Module) RegisterRoutes(router chi.Router) { }) } +func (m *Module) linuxInstallProfile(w http.ResponseWriter, r *http.Request) { + var payload clustermodule.DockerInstallProfileRequest + if err := json.NewDecoder(r.Body).Decode(&payload); err != nil { + httpx.WriteError(w, http.StatusBadRequest, "invalid linux install profile payload") + return + } + profile, err := m.cluster.GetLinuxInstallProfile(r.Context(), payload) + if err != nil { + httpx.WriteError(w, http.StatusBadRequest, err.Error()) + return + } + httpx.WriteJSON(w, http.StatusOK, map[string]any{"linux_install_profile": profile}) +} + +func (m *Module) windowsInstallProfile(w http.ResponseWriter, r *http.Request) { + var payload clustermodule.DockerInstallProfileRequest + if err := json.NewDecoder(r.Body).Decode(&payload); err != nil { + httpx.WriteError(w, http.StatusBadRequest, "invalid windows install profile payload") + return + } + profile, err := m.cluster.GetWindowsInstallProfile(r.Context(), payload) + if err != nil { + httpx.WriteError(w, http.StatusBadRequest, err.Error()) + return + } + httpx.WriteJSON(w, http.StatusOK, map[string]any{"windows_install_profile": profile}) +} + +func (m *Module) dockerInstallProfile(w http.ResponseWriter, r *http.Request) { + var payload clustermodule.DockerInstallProfileRequest + if err := json.NewDecoder(r.Body).Decode(&payload); err != nil { + httpx.WriteError(w, http.StatusBadRequest, "invalid docker install profile payload") + return + } + profile, err := m.cluster.GetDockerInstallProfile(r.Context(), payload) + if err != nil { + httpx.WriteError(w, http.StatusBadRequest, err.Error()) + return + } + httpx.WriteJSON(w, http.StatusOK, map[string]any{"docker_install_profile": profile}) +} + func (m *Module) enrollAgent(w http.ResponseWriter, r *http.Request) { var payload struct { ClusterID string `json:"cluster_id"` diff --git a/backend/internal/modules/organization/module.go b/backend/internal/modules/organization/module.go index 9be55e3..d0750f9 100644 --- a/backend/internal/modules/organization/module.go +++ b/backend/internal/modules/organization/module.go @@ -242,6 +242,37 @@ func (m *Module) loadAdminSummary(ctx context.Context, orgID string) (AdminSumma return AdminSummary{}, err } + var vpnConnectionCount int64 + var vpnActiveLeaseCount int64 + var vpnForwardingCount int64 + if err := m.db.QueryRow(ctx, ` + SELECT COUNT(*) + FROM vpn_connections + WHERE organization_id = $1::uuid + AND desired_state = 'enabled' + `, orgID).Scan(&vpnConnectionCount); err != nil { + return AdminSummary{}, err + } + if err := m.db.QueryRow(ctx, ` + SELECT COUNT(*) + FROM vpn_connection_leases l + INNER JOIN vpn_connections vc ON vc.id = l.vpn_connection_id + WHERE vc.organization_id = $1::uuid + AND l.status = 'active' + AND l.expires_at > NOW() + `, orgID).Scan(&vpnActiveLeaseCount); err != nil { + return AdminSummary{}, err + } + if err := m.db.QueryRow(ctx, ` + SELECT COUNT(*) + FROM vpn_connection_assignment_latest_statuses s + INNER JOIN vpn_connections vc ON vc.id = s.vpn_connection_id + WHERE vc.organization_id = $1::uuid + AND COALESCE((s.status_payload->>'packet_forwarding')::boolean, false) + `, orgID).Scan(&vpnForwardingCount); err != nil { + return AdminSummary{}, err + } + auditRows, err := m.db.Query(ctx, ` SELECT ae.id::text, ae.event_type, ae.target_type, ae.target_id, ae.payload, ae.created_at FROM audit_events ae @@ -265,6 +296,12 @@ func (m *Module) loadAdminSummary(ctx context.Context, orgID string) (AdminSumma if err := auditRows.Err(); err != nil { return AdminSummary{}, err } + if services == nil { + services = []ServiceSummary{} + } + if audit == nil { + audit = []OrgAuditEvent{} + } return AdminSummary{ OrganizationID: orgID, @@ -272,14 +309,34 @@ func (m *Module) loadAdminSummary(ctx context.Context, orgID string) (AdminSumma ActiveSessionCount: activeSessionCount, ServiceEndpoints: services, ConnectorStatus: map[string]any{ - "vpn": "not_implemented", - "connector": "not_implemented", + "vpn": map[string]any{ + "enabled_connections": vpnConnectionCount, + "active_leases": vpnActiveLeaseCount, + "packet_forwarding": vpnForwardingCount, + "status": connectorStatus(vpnConnectionCount, vpnActiveLeaseCount, vpnForwardingCount), + }, + "rdp": map[string]any{ + "status": "resource_catalog_ready", + }, }, RecentAudit: audit, TopologyExposure: tenantSafeTopologyExposure(), }, nil } +func connectorStatus(enabledConnections, activeLeases, forwarding int64) string { + if enabledConnections == 0 { + return "not_configured" + } + if forwarding > 0 { + return "active" + } + if activeLeases > 0 { + return "gateway_blocked" + } + return "waiting_for_gateway" +} + func tenantSafeTopologyExposure() string { return "tenant_safe_no_core_mesh_topology" } diff --git a/backend/internal/modules/resource/module.go b/backend/internal/modules/resource/module.go index 4c10e70..0a70f07 100644 --- a/backend/internal/modules/resource/module.go +++ b/backend/internal/modules/resource/module.go @@ -49,6 +49,7 @@ type Resource struct { Address string `json:"address"` Protocol string `json:"protocol"` SecretRef *string `json:"secret_ref,omitempty"` + HasSecret bool `json:"has_secret"` CertificateVerificationMode string `json:"certificate_verification_mode"` RenderQualityProfile string `json:"render_quality_profile"` ClipboardMode string `json:"clipboard_mode"` @@ -116,6 +117,7 @@ func (m *Module) listResources(w http.ResponseWriter, r *http.Request) { query := ` SELECT r.id, r.organization_id, r.name, r.address, r.protocol, r.secret_ref, r.certificate_verification_mode, r.metadata, r.created_at, r.updated_at, + EXISTS (SELECT 1 FROM resource_secrets sec WHERE sec.resource_id = r.id) AS has_secret, COALESCE(rp.clipboard_mode, 'disabled') AS clipboard_mode, COALESCE(rp.file_transfer_mode, 'disabled') AS file_transfer_mode FROM resources r @@ -500,6 +502,7 @@ func (m *Module) getByID(ctx context.Context, resourceID string) (Resource, erro row := m.db.QueryRow(ctx, ` SELECT r.id, r.organization_id, r.name, r.address, r.protocol, r.secret_ref, r.certificate_verification_mode, r.metadata, r.created_at, r.updated_at, + EXISTS (SELECT 1 FROM resource_secrets sec WHERE sec.resource_id = r.id) AS has_secret, COALESCE(rp.clipboard_mode, 'disabled') AS clipboard_mode, COALESCE(rp.file_transfer_mode, 'disabled') AS file_transfer_mode FROM resources r @@ -555,6 +558,7 @@ func scanResource(row rowScanner) (Resource, error) { &resource.Metadata, &resource.CreatedAt, &resource.UpdatedAt, + &resource.HasSecret, &resource.ClipboardMode, &resource.FileTransferMode, ); err != nil { diff --git a/backend/internal/platform/authority/authority.go b/backend/internal/platform/authority/authority.go index 01a6582..cb3f369 100644 --- a/backend/internal/platform/authority/authority.go +++ b/backend/internal/platform/authority/authority.go @@ -214,9 +214,49 @@ END, prg.granted_at DESC if err := rows.Err(); err != nil { return "", fmt.Errorf("iterate platform role grants: %w", err) } + if bestRole == PlatformRoleUser { + if role, ok, err := strictBootstrappedOwnerFallback(ctx, db, verifier, userID, email); err != nil { + return "", err + } else if ok { + return role, nil + } + return legacyPlatformRole(ctx, db, userID) + } return bestRole, nil } +func strictBootstrappedOwnerFallback(ctx context.Context, db postgresplatform.DBTX, verifier *Verifier, userID, email string) (string, bool, error) { + var role string + var bootstrappedOwnerEmail *string + var authorityState string + var rootFingerprint string + err := db.QueryRow(ctx, ` +SELECT u.platform_role, ia.bootstrapped_owner_email, ia.authority_state, ia.product_root_key_fingerprint +FROM users u +CROSS JOIN installation_authority ia +WHERE u.id = $1::uuid + AND ia.id = 1 +`, userID).Scan(&role, &bootstrappedOwnerEmail, &authorityState, &rootFingerprint) + if err != nil { + if errors.Is(err, pgx.ErrNoRows) { + return PlatformRoleUser, false, nil + } + return "", false, fmt.Errorf("query strict bootstrapped owner fallback: %w", err) + } + if bootstrappedOwnerEmail == nil || + !strings.EqualFold(*bootstrappedOwnerEmail, email) || + authorityState != "active" || + rootFingerprint != verifier.RootFingerprint() { + return PlatformRoleUser, false, nil + } + switch role { + case PlatformRoleAdmin, PlatformRoleRecoveryAdmin: + return role, true, nil + default: + return PlatformRoleUser, false, nil + } +} + func legacyPlatformRole(ctx context.Context, db postgresplatform.DBTX, userID string) (string, error) { var role string if err := db.QueryRow(ctx, `SELECT platform_role FROM users WHERE id = $1::uuid`, userID).Scan(&role); err != nil { diff --git a/backend/internal/platform/runtime/app.go b/backend/internal/platform/runtime/app.go index 9327a50..6120eca 100644 --- a/backend/internal/platform/runtime/app.go +++ b/backend/internal/platform/runtime/app.go @@ -2,6 +2,7 @@ package runtime import ( "context" + "encoding/json" "errors" "fmt" "log/slog" @@ -208,6 +209,7 @@ func buildRouter(logger *slog.Logger, modules ...module.Module) http.Handler { w.WriteHeader(http.StatusOK) _, _ = w.Write([]byte("ready")) }) + router.Post("/mesh/v1/health", controlPlaneMeshHealth) router.Route("/api/v1", func(r chi.Router) { for _, mod := range modules { @@ -218,3 +220,34 @@ func buildRouter(logger *slog.Logger, modules ...module.Module) http.Handler { return router } + +func controlPlaneMeshHealth(w http.ResponseWriter, r *http.Request) { + var message struct { + ProtocolVersion string `json:"protocol_version"` + From struct { + ClusterID string `json:"cluster_id"` + NodeID string `json:"node_id"` + } `json:"from"` + To struct { + ClusterID string `json:"cluster_id"` + NodeID string `json:"node_id"` + } `json:"to"` + } + if err := json.NewDecoder(r.Body).Decode(&message); err != nil { + http.Error(w, "invalid mesh health message", http.StatusBadRequest) + return + } + if message.ProtocolVersion != "mesh-control-v1" || message.From.ClusterID == "" || message.From.NodeID == "" { + http.Error(w, "invalid mesh health message", http.StatusBadRequest) + return + } + w.Header().Set("Content-Type", "application/json") + _ = json.NewEncoder(w).Encode(map[string]any{ + "protocol_version": "mesh-control-v1", + "accepted": true, + "by": map[string]string{ + "cluster_id": message.From.ClusterID, + "node_id": "control-plane-relay", + }, + }) +} diff --git a/backend/migrations/000023_node_update_control_plane.down.sql b/backend/migrations/000023_node_update_control_plane.down.sql new file mode 100644 index 0000000..80c15c4 --- /dev/null +++ b/backend/migrations/000023_node_update_control_plane.down.sql @@ -0,0 +1,8 @@ +DROP INDEX IF EXISTS node_update_status_reports_latest_idx; +DROP INDEX IF EXISTS release_artifacts_match_idx; +DROP INDEX IF EXISTS release_versions_lookup_idx; + +DROP TABLE IF EXISTS node_update_status_reports; +DROP TABLE IF EXISTS node_update_desired_policies; +DROP TABLE IF EXISTS release_artifacts; +DROP TABLE IF EXISTS release_versions; diff --git a/backend/migrations/000023_node_update_control_plane.up.sql b/backend/migrations/000023_node_update_control_plane.up.sql new file mode 100644 index 0000000..fbc5d68 --- /dev/null +++ b/backend/migrations/000023_node_update_control_plane.up.sql @@ -0,0 +1,74 @@ +CREATE TABLE IF NOT EXISTS release_versions ( + id UUID PRIMARY KEY, + cluster_id UUID NOT NULL REFERENCES clusters(id) ON DELETE CASCADE, + product TEXT NOT NULL, + version TEXT NOT NULL, + channel TEXT NOT NULL DEFAULT 'dev', + status TEXT NOT NULL DEFAULT 'active', + compatibility JSONB NOT NULL DEFAULT '{}'::jsonb, + changelog TEXT, + created_by_user_id UUID REFERENCES users(id) ON DELETE SET NULL, + created_at TIMESTAMPTZ NOT NULL DEFAULT NOW(), + authority_payload JSONB NOT NULL DEFAULT '{}'::jsonb, + authority_signature JSONB NOT NULL DEFAULT '{}'::jsonb, + UNIQUE (cluster_id, product, version, channel) +); + +CREATE TABLE IF NOT EXISTS release_artifacts ( + id UUID PRIMARY KEY, + release_id UUID NOT NULL REFERENCES release_versions(id) ON DELETE CASCADE, + cluster_id UUID NOT NULL REFERENCES clusters(id) ON DELETE CASCADE, + product TEXT NOT NULL, + version TEXT NOT NULL, + os TEXT NOT NULL, + arch TEXT NOT NULL, + install_type TEXT NOT NULL, + kind TEXT NOT NULL, + url TEXT NOT NULL, + sha256 TEXT NOT NULL, + size_bytes BIGINT NOT NULL DEFAULT 0, + signature TEXT, + metadata JSONB NOT NULL DEFAULT '{}'::jsonb, + created_at TIMESTAMPTZ NOT NULL DEFAULT NOW(), + UNIQUE (release_id, os, arch, install_type, kind) +); + +CREATE TABLE IF NOT EXISTS node_update_desired_policies ( + cluster_id UUID NOT NULL REFERENCES clusters(id) ON DELETE CASCADE, + node_id UUID NOT NULL REFERENCES nodes(id) ON DELETE CASCADE, + product TEXT NOT NULL, + channel TEXT NOT NULL DEFAULT 'dev', + target_version TEXT, + strategy TEXT NOT NULL DEFAULT 'manual', + enabled BOOLEAN NOT NULL DEFAULT false, + rollback_allowed BOOLEAN NOT NULL DEFAULT true, + health_window_seconds INTEGER NOT NULL DEFAULT 180, + updated_by_user_id UUID REFERENCES users(id) ON DELETE SET NULL, + updated_at TIMESTAMPTZ NOT NULL DEFAULT NOW(), + PRIMARY KEY (cluster_id, node_id, product) +); + +CREATE TABLE IF NOT EXISTS node_update_status_reports ( + id UUID PRIMARY KEY, + cluster_id UUID NOT NULL REFERENCES clusters(id) ON DELETE CASCADE, + node_id UUID NOT NULL REFERENCES nodes(id) ON DELETE CASCADE, + product TEXT NOT NULL, + current_version TEXT NOT NULL DEFAULT '', + target_version TEXT NOT NULL DEFAULT '', + phase TEXT NOT NULL, + status TEXT NOT NULL, + attempt_id TEXT NOT NULL DEFAULT '', + error_message TEXT, + rollback_version TEXT, + payload JSONB NOT NULL DEFAULT '{}'::jsonb, + observed_at TIMESTAMPTZ NOT NULL DEFAULT NOW() +); + +CREATE INDEX IF NOT EXISTS release_versions_lookup_idx + ON release_versions (cluster_id, product, channel, status, created_at DESC); + +CREATE INDEX IF NOT EXISTS release_artifacts_match_idx + ON release_artifacts (release_id, os, arch, install_type); + +CREATE INDEX IF NOT EXISTS node_update_status_reports_latest_idx + ON node_update_status_reports (cluster_id, node_id, product, observed_at DESC); diff --git a/backend/migrations/000024_stale_node_health_view.down.sql b/backend/migrations/000024_stale_node_health_view.down.sql new file mode 100644 index 0000000..89b97e0 --- /dev/null +++ b/backend/migrations/000024_stale_node_health_view.down.sql @@ -0,0 +1,26 @@ +DROP VIEW IF EXISTS cluster_admin_summaries; + +CREATE VIEW cluster_admin_summaries AS +SELECT + c.id AS cluster_id, + c.slug, + c.name, + c.status, + c.region, + COALESCE(cas.authority_state, 'authoritative') AS authority_state, + COALESCE(cas.mutation_mode, 'normal') AS mutation_mode, + ca.key_algorithm AS cluster_key_algorithm, + ca.public_key_fingerprint AS cluster_key_fingerprint, + COUNT(DISTINCT cm.node_id) AS node_count, + COUNT(DISTINCT CASE WHEN n.health_status = 'healthy' THEN n.id END) AS healthy_node_count, + COUNT(DISTINCT CASE WHEN njr.status = 'pending' THEN njr.id END) AS pending_join_count, + COUNT(DISTINCT nra.id) AS active_role_assignment_count, + MAX(n.last_seen_at) AS last_node_seen_at +FROM clusters c +LEFT JOIN cluster_authority_states cas ON cas.cluster_id = c.id +LEFT JOIN cluster_authorities ca ON ca.cluster_id = c.id +LEFT JOIN cluster_memberships cm ON cm.cluster_id = c.id +LEFT JOIN nodes n ON n.id = cm.node_id +LEFT JOIN node_join_requests njr ON njr.cluster_id = c.id +LEFT JOIN node_role_assignments nra ON nra.cluster_id = c.id AND nra.status = 'active' +GROUP BY c.id, c.slug, c.name, c.status, c.region, cas.authority_state, cas.mutation_mode, ca.key_algorithm, ca.public_key_fingerprint; diff --git a/backend/migrations/000024_stale_node_health_view.up.sql b/backend/migrations/000024_stale_node_health_view.up.sql new file mode 100644 index 0000000..1f5c813 --- /dev/null +++ b/backend/migrations/000024_stale_node_health_view.up.sql @@ -0,0 +1,29 @@ +DROP VIEW IF EXISTS cluster_admin_summaries; + +CREATE VIEW cluster_admin_summaries AS +SELECT + c.id AS cluster_id, + c.slug, + c.name, + c.status, + c.region, + COALESCE(cas.authority_state, 'authoritative') AS authority_state, + COALESCE(cas.mutation_mode, 'normal') AS mutation_mode, + ca.key_algorithm AS cluster_key_algorithm, + ca.public_key_fingerprint AS cluster_key_fingerprint, + COUNT(DISTINCT cm.node_id) AS node_count, + COUNT(DISTINCT CASE + WHEN n.health_status = 'healthy' + AND n.last_seen_at >= NOW() - '1 minute'::interval THEN n.id + END) AS healthy_node_count, + COUNT(DISTINCT CASE WHEN njr.status = 'pending' THEN njr.id END) AS pending_join_count, + COUNT(DISTINCT nra.id) AS active_role_assignment_count, + MAX(n.last_seen_at) AS last_node_seen_at +FROM clusters c +LEFT JOIN cluster_authority_states cas ON cas.cluster_id = c.id +LEFT JOIN cluster_authorities ca ON ca.cluster_id = c.id +LEFT JOIN cluster_memberships cm ON cm.cluster_id = c.id +LEFT JOIN nodes n ON n.id = cm.node_id +LEFT JOIN node_join_requests njr ON njr.cluster_id = c.id +LEFT JOIN node_role_assignments nra ON nra.cluster_id = c.id AND nra.status = 'active' +GROUP BY c.id, c.slug, c.name, c.status, c.region, cas.authority_state, cas.mutation_mode, ca.key_algorithm, ca.public_key_fingerprint; diff --git a/backend/migrations/000025_fabric_service_channel_route_feedback.down.sql b/backend/migrations/000025_fabric_service_channel_route_feedback.down.sql new file mode 100644 index 0000000..1a771df --- /dev/null +++ b/backend/migrations/000025_fabric_service_channel_route_feedback.down.sql @@ -0,0 +1,2 @@ +DROP TABLE IF EXISTS fabric_service_channel_route_feedback_latest; +DROP TABLE IF EXISTS fabric_service_channel_route_feedback_observations; diff --git a/backend/migrations/000025_fabric_service_channel_route_feedback.up.sql b/backend/migrations/000025_fabric_service_channel_route_feedback.up.sql new file mode 100644 index 0000000..a6f1531 --- /dev/null +++ b/backend/migrations/000025_fabric_service_channel_route_feedback.up.sql @@ -0,0 +1,45 @@ +CREATE TABLE IF NOT EXISTS fabric_service_channel_route_feedback_observations ( + id UUID PRIMARY KEY, + cluster_id UUID NOT NULL REFERENCES clusters(id) ON DELETE CASCADE, + reporter_node_id UUID NOT NULL REFERENCES nodes(id) ON DELETE CASCADE, + route_id TEXT NOT NULL, + service_class TEXT NOT NULL, + feedback_status TEXT NOT NULL, + score_adjustment INTEGER NOT NULL DEFAULT 0, + reasons TEXT[] NOT NULL DEFAULT ARRAY[]::TEXT[], + last_error TEXT NOT NULL DEFAULT '', + consecutive_failures INTEGER NOT NULL DEFAULT 0, + stall_count INTEGER NOT NULL DEFAULT 0, + last_send_duration_ms BIGINT NOT NULL DEFAULT 0, + payload JSONB NOT NULL DEFAULT '{}'::JSONB, + observed_at TIMESTAMPTZ NOT NULL, + expires_at TIMESTAMPTZ NOT NULL +); + +CREATE INDEX IF NOT EXISTS idx_fsc_route_feedback_observed + ON fabric_service_channel_route_feedback_observations (cluster_id, reporter_node_id, service_class, observed_at DESC); + +CREATE INDEX IF NOT EXISTS idx_fsc_route_feedback_route + ON fabric_service_channel_route_feedback_observations (cluster_id, route_id, observed_at DESC); + +CREATE TABLE IF NOT EXISTS fabric_service_channel_route_feedback_latest ( + cluster_id UUID NOT NULL REFERENCES clusters(id) ON DELETE CASCADE, + reporter_node_id UUID NOT NULL REFERENCES nodes(id) ON DELETE CASCADE, + route_id TEXT NOT NULL, + observation_id UUID NOT NULL REFERENCES fabric_service_channel_route_feedback_observations(id) ON DELETE CASCADE, + service_class TEXT NOT NULL, + feedback_status TEXT NOT NULL, + score_adjustment INTEGER NOT NULL DEFAULT 0, + reasons TEXT[] NOT NULL DEFAULT ARRAY[]::TEXT[], + last_error TEXT NOT NULL DEFAULT '', + consecutive_failures INTEGER NOT NULL DEFAULT 0, + stall_count INTEGER NOT NULL DEFAULT 0, + last_send_duration_ms BIGINT NOT NULL DEFAULT 0, + payload JSONB NOT NULL DEFAULT '{}'::JSONB, + observed_at TIMESTAMPTZ NOT NULL, + expires_at TIMESTAMPTZ NOT NULL, + PRIMARY KEY (cluster_id, reporter_node_id, route_id) +); + +CREATE INDEX IF NOT EXISTS idx_fsc_route_feedback_latest_active + ON fabric_service_channel_route_feedback_latest (cluster_id, reporter_node_id, service_class, expires_at DESC); diff --git a/backend/migrations/000026_fabric_service_channel_route_rebuild_attempts.down.sql b/backend/migrations/000026_fabric_service_channel_route_rebuild_attempts.down.sql new file mode 100644 index 0000000..79493ba --- /dev/null +++ b/backend/migrations/000026_fabric_service_channel_route_rebuild_attempts.down.sql @@ -0,0 +1 @@ +DROP TABLE IF EXISTS fabric_service_channel_route_rebuild_attempts; diff --git a/backend/migrations/000026_fabric_service_channel_route_rebuild_attempts.up.sql b/backend/migrations/000026_fabric_service_channel_route_rebuild_attempts.up.sql new file mode 100644 index 0000000..a52f710 --- /dev/null +++ b/backend/migrations/000026_fabric_service_channel_route_rebuild_attempts.up.sql @@ -0,0 +1,43 @@ +CREATE TABLE IF NOT EXISTS fabric_service_channel_route_rebuild_attempts ( + id UUID PRIMARY KEY, + cluster_id UUID NOT NULL REFERENCES clusters(id) ON DELETE CASCADE, + reporter_node_id UUID NOT NULL REFERENCES nodes(id) ON DELETE CASCADE, + service_class TEXT NOT NULL, + route_id TEXT NOT NULL, + replacement_route_id TEXT NOT NULL DEFAULT '', + rebuild_request_id TEXT NOT NULL, + rebuild_status TEXT NOT NULL, + rebuild_reason TEXT NOT NULL DEFAULT '', + rebuild_attempt INTEGER NOT NULL DEFAULT 0, + decision_source TEXT NOT NULL, + outcome TEXT NOT NULL, + generation TEXT NOT NULL DEFAULT '', + policy_fingerprint TEXT NOT NULL DEFAULT '', + observed_policy_fingerprint TEXT NOT NULL DEFAULT '', + observed_route_generation TEXT NOT NULL DEFAULT '', + effective_route_generation TEXT NOT NULL DEFAULT '', + feedback_status TEXT NOT NULL DEFAULT '', + feedback_score_adjustment INTEGER NOT NULL DEFAULT 0, + feedback_effective_score_adjustment INTEGER NOT NULL DEFAULT 0, + feedback_reasons TEXT[] NOT NULL DEFAULT '{}', + last_error TEXT NOT NULL DEFAULT '', + consecutive_failures INTEGER NOT NULL DEFAULT 0, + stall_count INTEGER NOT NULL DEFAULT 0, + last_send_duration_ms BIGINT NOT NULL DEFAULT 0, + quality_window_sample_count INTEGER NOT NULL DEFAULT 0, + quality_window_failure_count INTEGER NOT NULL DEFAULT 0, + quality_window_drop_count INTEGER NOT NULL DEFAULT 0, + quality_window_slow_count INTEGER NOT NULL DEFAULT 0, + old_hops TEXT[] NOT NULL DEFAULT '{}', + replacement_hops TEXT[] NOT NULL DEFAULT '{}', + payload JSONB NOT NULL DEFAULT '{}'::jsonb, + created_at TIMESTAMPTZ NOT NULL DEFAULT NOW(), + updated_at TIMESTAMPTZ NOT NULL DEFAULT NOW(), + UNIQUE (cluster_id, reporter_node_id, service_class, route_id, rebuild_request_id) +); + +CREATE INDEX IF NOT EXISTS idx_fsc_rebuild_attempts_cluster_reporter_updated + ON fabric_service_channel_route_rebuild_attempts (cluster_id, reporter_node_id, updated_at DESC); + +CREATE INDEX IF NOT EXISTS idx_fsc_rebuild_attempts_cluster_route_updated + ON fabric_service_channel_route_rebuild_attempts (cluster_id, route_id, updated_at DESC); diff --git a/backend/migrations/000027_fabric_service_channel_rebuild_alert_silences.down.sql b/backend/migrations/000027_fabric_service_channel_rebuild_alert_silences.down.sql new file mode 100644 index 0000000..9def5d4 --- /dev/null +++ b/backend/migrations/000027_fabric_service_channel_rebuild_alert_silences.down.sql @@ -0,0 +1 @@ +DROP TABLE IF EXISTS fabric_service_channel_rebuild_alert_silences; diff --git a/backend/migrations/000027_fabric_service_channel_rebuild_alert_silences.up.sql b/backend/migrations/000027_fabric_service_channel_rebuild_alert_silences.up.sql new file mode 100644 index 0000000..13728b3 --- /dev/null +++ b/backend/migrations/000027_fabric_service_channel_rebuild_alert_silences.up.sql @@ -0,0 +1,17 @@ +CREATE TABLE IF NOT EXISTS fabric_service_channel_rebuild_alert_silences ( + id UUID PRIMARY KEY DEFAULT gen_random_uuid(), + cluster_id UUID NOT NULL REFERENCES clusters(id) ON DELETE CASCADE, + reporter_node_id UUID NOT NULL REFERENCES nodes(id) ON DELETE CASCADE, + route_id TEXT NOT NULL, + guard_status TEXT NOT NULL, + generation TEXT NOT NULL DEFAULT '', + reason TEXT NOT NULL DEFAULT '', + created_by_user_id UUID REFERENCES users(id) ON DELETE SET NULL, + created_at TIMESTAMPTZ NOT NULL DEFAULT NOW(), + expires_at TIMESTAMPTZ NOT NULL, + payload JSONB NOT NULL DEFAULT '{}'::jsonb, + UNIQUE (cluster_id, reporter_node_id, route_id, guard_status, generation) +); + +CREATE INDEX IF NOT EXISTS idx_fsc_rebuild_alert_silences_active + ON fabric_service_channel_rebuild_alert_silences (cluster_id, expires_at DESC); diff --git a/backend/migrations/000028_fabric_service_channel_rebuild_correlation_snapshot.down.sql b/backend/migrations/000028_fabric_service_channel_rebuild_correlation_snapshot.down.sql new file mode 100644 index 0000000..e9059e9 --- /dev/null +++ b/backend/migrations/000028_fabric_service_channel_rebuild_correlation_snapshot.down.sql @@ -0,0 +1,23 @@ +DROP INDEX IF EXISTS idx_fsc_rebuild_attempts_cluster_guard_updated; + +ALTER TABLE fabric_service_channel_route_rebuild_attempts + DROP COLUMN IF EXISTS correlation_snapshot_at, + DROP COLUMN IF EXISTS correlation_timeline, + DROP COLUMN IF EXISTS guard_traffic_deadline_seconds, + DROP COLUMN IF EXISTS guard_transition_deadline_seconds, + DROP COLUMN IF EXISTS guard_reason, + DROP COLUMN IF EXISTS guard_severity, + DROP COLUMN IF EXISTS guard_status, + DROP COLUMN IF EXISTS post_rebuild_send_flow_dropped, + DROP COLUMN IF EXISTS post_rebuild_send_flow_packets, + DROP COLUMN IF EXISTS post_rebuild_send_failures, + DROP COLUMN IF EXISTS post_rebuild_send_packets, + DROP COLUMN IF EXISTS post_rebuild_selected_route_id, + DROP COLUMN IF EXISTS node_route_generation_matched, + DROP COLUMN IF EXISTS node_route_generation_withdrawn_at, + DROP COLUMN IF EXISTS node_route_generation_applied_at, + DROP COLUMN IF EXISTS node_route_generation_status, + DROP COLUMN IF EXISTS node_transition_matched, + DROP COLUMN IF EXISTS node_transition_observed_at, + DROP COLUMN IF EXISTS node_transition_generation, + DROP COLUMN IF EXISTS node_transition_status; diff --git a/backend/migrations/000028_fabric_service_channel_rebuild_correlation_snapshot.up.sql b/backend/migrations/000028_fabric_service_channel_rebuild_correlation_snapshot.up.sql new file mode 100644 index 0000000..1a44d5a --- /dev/null +++ b/backend/migrations/000028_fabric_service_channel_rebuild_correlation_snapshot.up.sql @@ -0,0 +1,24 @@ +ALTER TABLE fabric_service_channel_route_rebuild_attempts + ADD COLUMN IF NOT EXISTS node_transition_status TEXT NOT NULL DEFAULT '', + ADD COLUMN IF NOT EXISTS node_transition_generation TEXT NOT NULL DEFAULT '', + ADD COLUMN IF NOT EXISTS node_transition_observed_at TEXT NOT NULL DEFAULT '', + ADD COLUMN IF NOT EXISTS node_transition_matched BOOLEAN NOT NULL DEFAULT FALSE, + ADD COLUMN IF NOT EXISTS node_route_generation_status TEXT NOT NULL DEFAULT '', + ADD COLUMN IF NOT EXISTS node_route_generation_applied_at TEXT NOT NULL DEFAULT '', + ADD COLUMN IF NOT EXISTS node_route_generation_withdrawn_at TEXT NOT NULL DEFAULT '', + ADD COLUMN IF NOT EXISTS node_route_generation_matched BOOLEAN NOT NULL DEFAULT FALSE, + ADD COLUMN IF NOT EXISTS post_rebuild_selected_route_id TEXT NOT NULL DEFAULT '', + ADD COLUMN IF NOT EXISTS post_rebuild_send_packets BIGINT NOT NULL DEFAULT 0, + ADD COLUMN IF NOT EXISTS post_rebuild_send_failures BIGINT NOT NULL DEFAULT 0, + ADD COLUMN IF NOT EXISTS post_rebuild_send_flow_packets BIGINT NOT NULL DEFAULT 0, + ADD COLUMN IF NOT EXISTS post_rebuild_send_flow_dropped BIGINT NOT NULL DEFAULT 0, + ADD COLUMN IF NOT EXISTS guard_status TEXT NOT NULL DEFAULT '', + ADD COLUMN IF NOT EXISTS guard_severity TEXT NOT NULL DEFAULT '', + ADD COLUMN IF NOT EXISTS guard_reason TEXT NOT NULL DEFAULT '', + ADD COLUMN IF NOT EXISTS guard_transition_deadline_seconds BIGINT NOT NULL DEFAULT 0, + ADD COLUMN IF NOT EXISTS guard_traffic_deadline_seconds BIGINT NOT NULL DEFAULT 0, + ADD COLUMN IF NOT EXISTS correlation_timeline JSONB NOT NULL DEFAULT '[]'::jsonb, + ADD COLUMN IF NOT EXISTS correlation_snapshot_at TIMESTAMPTZ; + +CREATE INDEX IF NOT EXISTS idx_fsc_rebuild_attempts_cluster_guard_updated + ON fabric_service_channel_route_rebuild_attempts (cluster_id, guard_severity, guard_status, updated_at DESC); diff --git a/backend/migrations/000029_fabric_service_channel_leases.down.sql b/backend/migrations/000029_fabric_service_channel_leases.down.sql new file mode 100644 index 0000000..4c05fb8 --- /dev/null +++ b/backend/migrations/000029_fabric_service_channel_leases.down.sql @@ -0,0 +1 @@ +DROP TABLE IF EXISTS fabric_service_channel_leases; diff --git a/backend/migrations/000029_fabric_service_channel_leases.up.sql b/backend/migrations/000029_fabric_service_channel_leases.up.sql new file mode 100644 index 0000000..dda0eff --- /dev/null +++ b/backend/migrations/000029_fabric_service_channel_leases.up.sql @@ -0,0 +1,19 @@ +CREATE TABLE IF NOT EXISTS fabric_service_channel_leases ( + cluster_id UUID NOT NULL REFERENCES clusters(id) ON DELETE CASCADE, + channel_id UUID NOT NULL, + token_hash TEXT NOT NULL, + resource_id TEXT NOT NULL DEFAULT '', + service_class TEXT NOT NULL, + selected_entry_node_id UUID NULL REFERENCES nodes(id) ON DELETE SET NULL, + expires_at TIMESTAMPTZ NOT NULL, + lease JSONB NOT NULL, + created_at TIMESTAMPTZ NOT NULL DEFAULT NOW(), + updated_at TIMESTAMPTZ NOT NULL DEFAULT NOW(), + PRIMARY KEY (cluster_id, channel_id) +); + +CREATE INDEX IF NOT EXISTS fabric_service_channel_leases_cluster_expires_idx + ON fabric_service_channel_leases(cluster_id, expires_at); + +CREATE INDEX IF NOT EXISTS fabric_service_channel_leases_entry_idx + ON fabric_service_channel_leases(cluster_id, selected_entry_node_id, expires_at); diff --git a/backend/migrations/000030_mesh_route_service_channel_classes.down.sql b/backend/migrations/000030_mesh_route_service_channel_classes.down.sql new file mode 100644 index 0000000..c63ef28 --- /dev/null +++ b/backend/migrations/000030_mesh_route_service_channel_classes.down.sql @@ -0,0 +1,17 @@ +DELETE FROM mesh_qos_policies +WHERE service_class IN ('remote_workspace', 'video') + AND metadata->>'fabric_service_channel' = 'true'; + +ALTER TABLE mesh_route_intents + DROP CONSTRAINT IF EXISTS mesh_route_intents_service_class_check; + +ALTER TABLE mesh_route_intents + ADD CONSTRAINT mesh_route_intents_service_class_check + CHECK (service_class IN ('input', 'control', 'synthetic', 'render', 'clipboard', 'file_transfer', 'vpn_packets', 'telemetry')); + +ALTER TABLE mesh_qos_policies + DROP CONSTRAINT IF EXISTS mesh_qos_policies_service_class_check; + +ALTER TABLE mesh_qos_policies + ADD CONSTRAINT mesh_qos_policies_service_class_check + CHECK (service_class IN ('input', 'control', 'synthetic', 'render', 'clipboard', 'file_transfer', 'vpn_packets', 'telemetry')); diff --git a/backend/migrations/000030_mesh_route_service_channel_classes.up.sql b/backend/migrations/000030_mesh_route_service_channel_classes.up.sql new file mode 100644 index 0000000..a80ae17 --- /dev/null +++ b/backend/migrations/000030_mesh_route_service_channel_classes.up.sql @@ -0,0 +1,29 @@ +ALTER TABLE mesh_route_intents + DROP CONSTRAINT IF EXISTS mesh_route_intents_service_class_check; + +ALTER TABLE mesh_route_intents + ADD CONSTRAINT mesh_route_intents_service_class_check + CHECK (service_class IN ('input', 'control', 'synthetic', 'render', 'clipboard', 'file_transfer', 'vpn_packets', 'remote_workspace', 'video', 'telemetry')); + +ALTER TABLE mesh_qos_policies + DROP CONSTRAINT IF EXISTS mesh_qos_policies_service_class_check; + +ALTER TABLE mesh_qos_policies + ADD CONSTRAINT mesh_qos_policies_service_class_check + CHECK (service_class IN ('input', 'control', 'synthetic', 'render', 'clipboard', 'file_transfer', 'vpn_packets', 'remote_workspace', 'video', 'telemetry')); + +INSERT INTO mesh_qos_policies ( + cluster_id, service_class, priority, reliability_mode, drop_policy, bandwidth_policy, metadata +) +SELECT c.id, 'remote_workspace', 20, 'adaptive', 'adaptive', '{}'::jsonb, + '{"default":true,"fabric_service_channel":true,"interactive":true}'::jsonb +FROM clusters c +ON CONFLICT (cluster_id, service_class) DO NOTHING; + +INSERT INTO mesh_qos_policies ( + cluster_id, service_class, priority, reliability_mode, drop_policy, bandwidth_policy, metadata +) +SELECT c.id, 'video', 40, 'adaptive', 'adaptive', '{}'::jsonb, + '{"default":true,"fabric_service_channel":true,"adaptive":true}'::jsonb +FROM clusters c +ON CONFLICT (cluster_id, service_class) DO NOTHING; diff --git a/clients/android/.gradle/9.5.0/checksums/checksums.lock b/clients/android/.gradle/9.5.0/checksums/checksums.lock new file mode 100644 index 0000000..512ecf3 Binary files /dev/null and b/clients/android/.gradle/9.5.0/checksums/checksums.lock differ diff --git a/clients/android/.gradle/9.5.0/checksums/md5-checksums.bin b/clients/android/.gradle/9.5.0/checksums/md5-checksums.bin new file mode 100644 index 0000000..86f9a49 Binary files /dev/null and b/clients/android/.gradle/9.5.0/checksums/md5-checksums.bin differ diff --git a/clients/android/.gradle/9.5.0/checksums/sha1-checksums.bin b/clients/android/.gradle/9.5.0/checksums/sha1-checksums.bin new file mode 100644 index 0000000..a6ffb8c Binary files /dev/null and b/clients/android/.gradle/9.5.0/checksums/sha1-checksums.bin differ diff --git a/clients/android/.gradle/9.5.0/executionHistory/executionHistory.bin b/clients/android/.gradle/9.5.0/executionHistory/executionHistory.bin new file mode 100644 index 0000000..5f6654a Binary files /dev/null and b/clients/android/.gradle/9.5.0/executionHistory/executionHistory.bin differ diff --git a/clients/android/.gradle/9.5.0/executionHistory/executionHistory.lock b/clients/android/.gradle/9.5.0/executionHistory/executionHistory.lock new file mode 100644 index 0000000..892f515 Binary files /dev/null and b/clients/android/.gradle/9.5.0/executionHistory/executionHistory.lock differ diff --git a/clients/android/.gradle/9.5.0/fileChanges/last-build.bin b/clients/android/.gradle/9.5.0/fileChanges/last-build.bin new file mode 100644 index 0000000..f76dd23 Binary files /dev/null and b/clients/android/.gradle/9.5.0/fileChanges/last-build.bin differ diff --git a/clients/android/.gradle/9.5.0/fileHashes/fileHashes.bin b/clients/android/.gradle/9.5.0/fileHashes/fileHashes.bin new file mode 100644 index 0000000..5482412 Binary files /dev/null and b/clients/android/.gradle/9.5.0/fileHashes/fileHashes.bin differ diff --git a/clients/android/.gradle/9.5.0/fileHashes/fileHashes.lock b/clients/android/.gradle/9.5.0/fileHashes/fileHashes.lock new file mode 100644 index 0000000..ec4b7f5 Binary files /dev/null and b/clients/android/.gradle/9.5.0/fileHashes/fileHashes.lock differ diff --git a/clients/android/.gradle/9.5.0/fileHashes/resourceHashesCache.bin b/clients/android/.gradle/9.5.0/fileHashes/resourceHashesCache.bin new file mode 100644 index 0000000..1dded69 Binary files /dev/null and b/clients/android/.gradle/9.5.0/fileHashes/resourceHashesCache.bin differ diff --git a/clients/android/.gradle/9.5.0/gc.properties b/clients/android/.gradle/9.5.0/gc.properties new file mode 100644 index 0000000..e69de29 diff --git a/clients/android/.gradle/buildOutputCleanup/buildOutputCleanup.lock b/clients/android/.gradle/buildOutputCleanup/buildOutputCleanup.lock new file mode 100644 index 0000000..c6dcc88 Binary files /dev/null and b/clients/android/.gradle/buildOutputCleanup/buildOutputCleanup.lock differ diff --git a/clients/android/.gradle/buildOutputCleanup/cache.properties b/clients/android/.gradle/buildOutputCleanup/cache.properties new file mode 100644 index 0000000..971d8fa --- /dev/null +++ b/clients/android/.gradle/buildOutputCleanup/cache.properties @@ -0,0 +1,2 @@ +#Fri May 01 13:10:35 MSK 2026 +gradle.version=9.5.0 diff --git a/clients/android/.gradle/buildOutputCleanup/outputFiles.bin b/clients/android/.gradle/buildOutputCleanup/outputFiles.bin new file mode 100644 index 0000000..f0bab02 Binary files /dev/null and b/clients/android/.gradle/buildOutputCleanup/outputFiles.bin differ diff --git a/clients/android/.gradle/vcs-1/gc.properties b/clients/android/.gradle/vcs-1/gc.properties new file mode 100644 index 0000000..e69de29 diff --git a/clients/android/README.md b/clients/android/README.md new file mode 100644 index 0000000..13ee83e --- /dev/null +++ b/clients/android/README.md @@ -0,0 +1,41 @@ +# RAP Android VPN + +This is the Android client for the experimental RAP VPN service. + +Implemented now: + +- login through `/auth/login`; +- trusted-device reconnect through `/auth/refresh` without retyping the password + while the device session is valid; +- load organization-scoped VPN client profile from `/clusters/{clusterID}/vpn/client-profile`; +- request Android VPN permission and create a `VpnService` TUN interface; +- relay TUN packets through the Control Plane HTTP packet relay to the active + `home-1` gateway lease. +- user-facing HOME-first screen: connect/disconnect is primary, while backend, + cluster, organization, login, and password are kept in the settings dialog; +- saved connection settings in app preferences so repeat connects do not require + retyping the profile. +- encrypted refresh-token storage through Android Keystore. If the trusted + device session is revoked or expires, the app asks for the password once and + then rotates the device keys/profile again. + +This is still a lab runtime, not a production WireGuard/IPsec implementation. +The active Linux gateway node must be able to create `/dev/net/tun`, run `ip`, +`sysctl`, and `iptables`, and enable NAT for `10.77.0.0/24`. + +Build from this repository on Windows: + +```powershell +$env:ANDROID_HOME="C:\Android\Sdk" +$env:ANDROID_SDK_ROOT="C:\Android\Sdk" +pwsh -ExecutionPolicy Bypass -File ..\..\scripts\android\build-android-apk.ps1 +adb install -r app/build/outputs/apk/debug/app-debug.apk +``` + +Or run directly from the project: + +```powershell +$env:ANDROID_HOME="C:\Android\Sdk" +$env:ANDROID_SDK_ROOT="C:\Android\Sdk" +gradle assembleDebug +``` diff --git a/clients/android/app/build.gradle b/clients/android/app/build.gradle new file mode 100644 index 0000000..8994908 --- /dev/null +++ b/clients/android/app/build.gradle @@ -0,0 +1,49 @@ +plugins { + id "com.android.application" +} + +android { + namespace "su.cin.rapvpn" + compileSdk 35 + + signingConfigs { + release { + // Для тестовой среды используем debug-сертификат как fallback, чтобы APK всегда можно было установить. + // Когда будет отдельный keystore для prod/release — заменим на него в этом блоке. + initWith signingConfigs.debug + } + } + + buildFeatures { + buildConfig true + } + + def normalizeGradleString = { value -> + return (value == null ? "" : value.toString()).replace("\\", "\\\\").replace("\"", "\\\"") + } + + def defaultBackendUrl = project.findProperty("RAP_ANDROID_DEFAULT_BACKEND_URL") ?: "http://vpn.cin.su:19191/api/v1" + def defaultClusterId = project.findProperty("RAP_ANDROID_DEFAULT_CLUSTER_ID") ?: "cfc0743d-d960-49fb-9de8-96e063d5e4aa" + def defaultOrganizationId = project.findProperty("RAP_ANDROID_DEFAULT_ORGANIZATION_ID") ?: "125ff8b2-5ac1-4406-9bbb-ebbe18f7c7ed" + + defaultConfig { + applicationId "su.cin.rapvpn" + minSdk 26 + targetSdk 35 + versionCode 159 + versionName "0.2.159" + buildConfigField "String", "DEFAULT_BACKEND_URL", "\"${normalizeGradleString(defaultBackendUrl)}\"" + buildConfigField "String", "DEFAULT_CLUSTER_ID", "\"${normalizeGradleString(defaultClusterId)}\"" + buildConfigField "String", "DEFAULT_ORGANIZATION_ID", "\"${normalizeGradleString(defaultOrganizationId)}\"" + } + + buildTypes { + release { + signingConfig signingConfigs.release + } + } +} + +dependencies { + implementation "com.squareup.okhttp3:okhttp:5.3.2" +} diff --git a/clients/android/app/src/main/AndroidManifest.xml b/clients/android/app/src/main/AndroidManifest.xml new file mode 100644 index 0000000..7a7d9c1 --- /dev/null +++ b/clients/android/app/src/main/AndroidManifest.xml @@ -0,0 +1,69 @@ + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + diff --git a/clients/android/app/src/main/java/su/cin/rapvpn/FabricServiceChannel.java b/clients/android/app/src/main/java/su/cin/rapvpn/FabricServiceChannel.java new file mode 100644 index 0000000..933be35 --- /dev/null +++ b/clients/android/app/src/main/java/su/cin/rapvpn/FabricServiceChannel.java @@ -0,0 +1,140 @@ +package su.cin.rapvpn; + +import android.util.Base64; + +import org.json.JSONObject; + +import java.net.URI; +import java.nio.charset.StandardCharsets; + +import okhttp3.Request; + +final class FabricServiceChannel { + final boolean enabled; + final String channelId; + final String token; + final String pathTemplate; + final String webSocketPathTemplate; + final String authorityPayloadHeader; + final String authoritySignatureHeader; + final String serviceClass; + final String channelClass; + + FabricServiceChannel() { + this(false, "", "", "", "", "", "", "", ""); + } + + private FabricServiceChannel( + boolean enabled, + String channelId, + String token, + String pathTemplate, + String webSocketPathTemplate, + String authorityPayloadHeader, + String authoritySignatureHeader, + String serviceClass, + String channelClass) { + this.enabled = enabled; + this.channelId = safe(channelId); + this.token = safe(token); + this.pathTemplate = safe(pathTemplate); + this.webSocketPathTemplate = safe(webSocketPathTemplate); + this.authorityPayloadHeader = safe(authorityPayloadHeader); + this.authoritySignatureHeader = safe(authoritySignatureHeader); + this.serviceClass = safe(serviceClass); + this.channelClass = safe(channelClass); + } + + static FabricServiceChannel fromLease(JSONObject lease) { + if (lease == null) { + return new FabricServiceChannel(); + } + JSONObject tokenObject = lease.optJSONObject("token"); + JSONObject entryHttp = lease.optJSONObject("entry_http"); + String channelId = lease.optString("channel_id", ""); + String token = tokenObject == null ? "" : tokenObject.optString("token", ""); + String pathTemplate = entryHttp == null ? "" : entryHttp.optString("path_template", ""); + String wsTemplate = entryHttp == null ? "" : entryHttp.optString("websocket_path_template", ""); + String serviceClass = lease.optString("service_class", "vpn_packets"); + String channelClass = "vpn_packet"; + JSONObject authoritySignature = lease.optJSONObject("authority_signature"); + JSONObject authorityPayload = lease.optJSONObject("authority_payload"); + String payloadHeader = authorityPayload == null ? "" : encodeHeader(authorityPayload.toString()); + String signatureHeader = authoritySignature == null ? "" : encodeHeader(authoritySignature.toString()); + boolean enabled = !channelId.isEmpty() && token.startsWith("rap_fsc_") && !pathTemplate.isEmpty(); + return new FabricServiceChannel(enabled, channelId, token, pathTemplate, wsTemplate, payloadHeader, signatureHeader, serviceClass, channelClass); + } + + String packetPath(String clusterId, String vpnConnectionId, boolean webSocket) { + return packetPathForBase("", clusterId, vpnConnectionId, webSocket); + } + + String packetPathForBase(String baseUrl, String clusterId, String vpnConnectionId, boolean webSocket) { + String template = webSocket && !webSocketPathTemplate.isEmpty() ? webSocketPathTemplate : pathTemplate; + if (!enabled || template.isEmpty()) { + return ""; + } + String path = template + .replace("{cluster_id}", safe(clusterId)) + .replace("{clusterID}", safe(clusterId)) + .replace("{channel_id}", channelId) + .replace("{channelID}", channelId) + .replace("{resource_id}", safe(vpnConnectionId)) + .replace("{resourceID}", safe(vpnConnectionId)) + .replace("{vpn_connection_id}", safe(vpnConnectionId)) + .replace("{vpnConnectionID}", safe(vpnConnectionId)); + path = path.startsWith("/") ? path : "/" + path; + String basePath = ""; + try { + URI uri = URI.create(baseUrl == null ? "" : baseUrl); + basePath = uri.getRawPath() == null ? "" : trimRight(uri.getRawPath()); + } catch (Exception ignored) { + } + if (basePath.endsWith("/api/v1") && path.startsWith("/api/v1/")) { + path = path.substring("/api/v1".length()); + } + return path; + } + + Request.Builder applyHeaders(Request.Builder builder) { + if (!enabled || builder == null) { + return builder; + } + builder.header("X-RAP-Service-Channel-Token", token); + builder.header("X-RAP-Fabric-Channel-ID", channelId); + if (!serviceClass.isEmpty()) { + builder.header("X-RAP-Service-Class", serviceClass); + } + if (!channelClass.isEmpty()) { + builder.header("X-RAP-Channel-Class", channelClass); + } + if (!authorityPayloadHeader.isEmpty()) { + builder.header("X-RAP-Service-Channel-Authority-Payload", authorityPayloadHeader); + } + if (!authoritySignatureHeader.isEmpty()) { + builder.header("X-RAP-Service-Channel-Authority-Signature", authoritySignatureHeader); + } + return builder; + } + + private static String encodeHeader(String value) { + if (value == null || value.isEmpty()) { + return ""; + } + return Base64.encodeToString(value.getBytes(StandardCharsets.UTF_8), Base64.URL_SAFE | Base64.NO_WRAP | Base64.NO_PADDING); + } + + private static String safe(String value) { + return value == null ? "" : value.trim(); + } + + private static String trimRight(String value) { + if (value == null) { + return ""; + } + while (value.endsWith("/")) { + value = value.substring(0, value.length() - 1); + } + return value; + } +} diff --git a/clients/android/app/src/main/java/su/cin/rapvpn/MainActivity.java b/clients/android/app/src/main/java/su/cin/rapvpn/MainActivity.java new file mode 100644 index 0000000..9d0fef6 --- /dev/null +++ b/clients/android/app/src/main/java/su/cin/rapvpn/MainActivity.java @@ -0,0 +1,946 @@ +package su.cin.rapvpn; + +import android.app.Activity; +import android.app.AlertDialog; +import android.content.SharedPreferences; +import android.content.Intent; +import android.net.ConnectivityManager; +import android.net.Network; +import android.net.NetworkCapabilities; +import android.net.VpnService; +import android.os.Bundle; +import android.text.InputType; +import android.widget.Button; +import android.widget.CheckBox; +import android.widget.EditText; +import android.widget.LinearLayout; +import android.widget.TextView; + +import org.json.JSONArray; +import org.json.JSONObject; + +import java.util.Locale; + + +public class MainActivity extends Activity { + private static final String APP_VERSION = BuildConfig.VERSION_NAME; + private static final String DEFAULT_BACKEND_URL = BuildConfig.DEFAULT_BACKEND_URL; + private static final String DEFAULT_CLUSTER_ID = BuildConfig.DEFAULT_CLUSTER_ID; + private static final String DEFAULT_ORGANIZATION_ID = BuildConfig.DEFAULT_ORGANIZATION_ID; + private static final String PREF_SELECTED_EXIT_NODE_ID = "selected_exit_node_id"; + private static final int VPN_PREPARE_REQUEST = 42; + private static final String PREFS = "rap-vpn"; + private static final String PREF_DEVICE_FINGERPRINT = "device_fingerprint"; + private static final String PREF_REFRESH_TOKEN = "refresh_token"; + private static final String PREF_REFRESH_EXPIRES_AT = "refresh_expires_at"; + private static final String PREF_USER_ID = "user_id"; + private static final String PREF_DEVICE_ID = "device_id"; + private static final String PREF_PROFILE_JSON = "profile_json"; + private static final String PREF_VPN_CONNECTION_ID = "vpn_connection_id"; + static final String PREF_FORCE_FULL_TUNNEL = "force_full_tunnel"; + private EditText backendUrl; + private EditText clusterId; + private EditText organizationId; + private EditText email; + private EditText password; + private TextView status; + private TextView profileSummary; + private TextView serverDirectory; + private TextView runtimeStatus; + private String profileJson = ""; + private String vpnConnectionId = ""; + private JSONArray lastResources = new JSONArray(); + private RapApiClient.AuthContext authContext = null; + private SharedPreferences prefs; + private SharedPreferences runtimePrefs; + private SecureTokenStore secureTokens; + + @Override + protected void onCreate(Bundle savedInstanceState) { + super.onCreate(savedInstanceState); + prefs = getSharedPreferences(PREFS, MODE_PRIVATE); + runtimePrefs = getSharedPreferences("rap-vpn-runtime", MODE_PRIVATE); + secureTokens = new SecureTokenStore(this); + LinearLayout root = new LinearLayout(this); + root.setOrientation(LinearLayout.VERTICAL); + root.setBackgroundColor(0xff101820); + int pad = dp(20); + root.setPadding(pad, pad, pad, pad); + + backendUrl = field("Backend URL", preferredBackendUrl()); + clusterId = field("Cluster ID", prefs.getString("cluster_id", DEFAULT_CLUSTER_ID)); + organizationId = field("Organization ID", prefs.getString("organization_id", DEFAULT_ORGANIZATION_ID)); + email = field("Email", prefs.getString("email", "m")); + password = field("Password", ""); + password.setInputType(InputType.TYPE_CLASS_TEXT | InputType.TYPE_TEXT_VARIATION_PASSWORD); + normalizeAndPersistDefaults(); + if (!prefs.contains(PREF_FORCE_FULL_TUNNEL)) { + prefs.edit().putBoolean(PREF_FORCE_FULL_TUNNEL, true).apply(); + } + profileJson = prefs.getString(PREF_PROFILE_JSON, ""); + vpnConnectionId = prefs.getString(PREF_VPN_CONNECTION_ID, ""); + restoreAuthContext(); + + TextView title = new TextView(this); + title.setText("RAP HOME VPN " + APP_VERSION); + title.setTextColor(0xffffffff); + title.setTextSize(26); + title.setPadding(0, 0, 0, dp(8)); + + profileSummary = new TextView(this); + profileSummary.setTextColor(0xffc8d6df); + profileSummary.setTextSize(14); + profileSummary.setText(summaryText()); + + serverDirectory = new TextView(this); + serverDirectory.setTextColor(0xffe8eef2); + serverDirectory.setTextSize(15); + serverDirectory.setPadding(0, dp(14), 0, dp(14)); + serverDirectory.setText(""); + + status = new TextView(this); + status.setTextColor(0xffd8eadf); + status.setPadding(0, dp(14), 0, dp(14)); + status.setText("Готово. Версия " + APP_VERSION + "."); + + runtimeStatus = new TextView(this); + runtimeStatus.setTextColor(0xff9fb6c2); + runtimeStatus.setTextSize(13); + runtimeStatus.setPadding(0, 0, 0, dp(10)); + runtimeStatus.setText(runtimeStatusText()); + + Button load = new Button(this); + load.setText("Войти / обновить профиль"); + load.setOnClickListener(v -> loadProfile(false)); + + Button start = new Button(this); + start.setText("Включить HOME VPN"); + start.setOnClickListener(v -> prepareVpn()); + + Button stop = new Button(this); + stop.setText("Отключить VPN"); + stop.setOnClickListener(v -> { + Intent stopIntent = new Intent(this, RapVpnService.class); + stopIntent.setAction(RapVpnService.ACTION_STOP); + try { + startService(stopIntent); + } catch (Exception ignored) { + } + runtimePrefs.edit() + .putString("state", "stopped") + .putString("message", "stop requested from app") + .putLong("updated_at", System.currentTimeMillis()) + .apply(); + status.setText("Отключаю VPN..."); + runtimeStatus.postDelayed(() -> { + runtimeStatus.setText(runtimeStatusText()); + status.setText(isSystemVpnActive() ? "VPN еще активен в Android. Повторяю остановку..." : "VPN отключен."); + if (isSystemVpnActive()) { + try { + startService(stopIntent); + } catch (Exception ignored) { + } + runtimeStatus.postDelayed(() -> { + if (isSystemVpnActive()) { + runtimePrefs.edit() + .putString("state", "stopped") + .putString("message", "force stop app process after VPN stop request") + .putLong("updated_at", System.currentTimeMillis()) + .apply(); + android.os.Process.killProcess(android.os.Process.myPid()); + } + }, 1800); + } + }, 1200); + }); + + Button settings = new Button(this); + settings.setText("Настройки"); + settings.setOnClickListener(v -> showSettingsDialog()); + + Button servers = new Button(this); + servers.setText("Открыть удаленный сервер"); + servers.setOnClickListener(v -> showServerPicker()); + + root.addView(title); + root.addView(profileSummary); + root.addView(load); + root.addView(servers); + root.addView(start); + root.addView(stop); + root.addView(settings); + root.addView(status); + root.addView(runtimeStatus); + setContentView(root); + scheduleRuntimeStatusRefresh(); + if (authContext != null && !authContext.deviceId.isEmpty()) { + startDiagnosticChannel(); + } + } + + @Override + protected void onDestroy() { + super.onDestroy(); + } + + private EditText field(String hint, String value) { + EditText input = new EditText(this); + input.setHint(hint); + input.setText(value); + input.setSingleLine(true); + return input; + } + + private void loadProfile() { + loadProfile(false); + } + + private void loadProfile(boolean startAfterLoad) { + status.setText("Загрузка..."); + saveSettings(); + new Thread(() -> { + try { + RapApiClient client = new RapApiClient(backendUrl.getText().toString(), this); + authContext = authenticate(client); + String activeOrganizationId = resolveOrganizationId(client, authContext.userId); + String requestedExitNodeId = selectedExitNodeId(); + profileJson = client.vpnClientProfile( + clusterId.getText().toString(), + activeOrganizationId, + authContext.userId, + requestedExitNodeId + ); + vpnConnectionId = firstConnectionId(profileJson); + saveProfileState(); + JSONObject resourcePayload = client.resources(activeOrganizationId, authContext.userId); + lastResources = resourcePayload.optJSONArray("resources"); + if (lastResources == null) { + lastResources = new JSONArray(); + } + String resourcesText = resourcesText(resourcePayload); + runOnUiThread(() -> { + profileSummary.setText(summaryText()); + serverDirectory.setText(resourcesText); + status.setText(startAfterLoad ? "Профиль обновлен. Запускаю VPN..." : "Профиль и ключи устройства обновлены."); + startDiagnosticChannel(); + if (startAfterLoad) { + requestVpnPermission(); + } + }); + } catch (Exception ex) { + runOnUiThread(() -> { + String message = friendlyError(ex); + boolean canUseSavedProfile = startAfterLoad && !profileJson.isEmpty() && !vpnConnectionId.isEmpty(); + if (canUseSavedProfile) { + status.setText("Профиль сейчас не обновился: " + message + ". Запускаю VPN с сохраненным рабочим профилем."); + startDiagnosticChannel(); + requestVpnPermission(); + return; + } + status.setText("Ошибка профиля: " + message); + if (message.contains("логин") || message.contains("пароль") || message.contains("Сессия устройства")) { + clearSavedAuth(false); + showSettingsDialog(); + } + }); + } + }).start(); + } + + private void prepareVpn() { + loadProfile(true); + status.setText("Обновляю сессию устройства и VPN-профиль..."); + } + + private void requestVpnPermission() { + if (profileJson.isEmpty()) { + status.setText("VPN-профиль не загружен."); + return; + } + Intent prepare = VpnService.prepare(this); + if (prepare != null) { + startActivityForResult(prepare, VPN_PREPARE_REQUEST); + return; + } + startVpn(); + } + + @Override + protected void onActivityResult(int requestCode, int resultCode, Intent data) { + super.onActivityResult(requestCode, resultCode, data); + if (requestCode == VPN_PREPARE_REQUEST && resultCode == RESULT_OK) { + startVpn(); + } + } + + private void startVpn() { + Intent intent = new Intent(this, RapVpnService.class); + intent.putExtra(RapVpnService.EXTRA_PROFILE_JSON, profileJson); + intent.putExtra(RapVpnService.EXTRA_BACKEND_URL, backendUrl.getText().toString()); + intent.putExtra(RapVpnService.EXTRA_CLUSTER_ID, clusterId.getText().toString()); + intent.putExtra(RapVpnService.EXTRA_VPN_CONNECTION_ID, vpnConnectionId); + startForegroundService(intent); + status.setText("VPN запускается. Версия " + APP_VERSION + ". Backend: " + backendUrl.getText() + ". Connection: " + vpnConnectionId + ". Ожидаю статус подключения."); + runtimeStatus.setText("Запрашиваю статус... " + runtimeStatusText()); + runtimeStatus.postDelayed(() -> { + String state = runtimePrefs.getString("state", ""); + boolean runtimeActive = isVpnRuntimeActive(); + if (!isSystemVpnActive()) { + if (runtimeActive) { + status.setText("VPN runtime активен, рабочий канал поднят. Android еще обновляет системный статус."); + } else if ("stopped".equals(state) || "revoked".equals(state) || "error".equals(state)) { + status.setText("VPN не включился: " + runtimePrefs.getString("message", "Android остановил VPN-сервис") + "."); + } else if ("starting".equals(state) || "tunnel".equals(state) || "relay_selected".equals(state) || "relay".equals(state) || "relay_reset".equals(state)) { + status.setText("VPN запускается. Android еще применяет туннель, ожидаю рабочий канал."); + } else { + status.setText("VPN еще не активен в Android. Проверьте системный запрос разрешения VPN."); + } + } else { + status.setText("VPN включен Android. Версия " + APP_VERSION + "."); + } + runtimeStatus.setText(runtimeStatusText()); + }, 2500); + } + + private void scheduleRuntimeStatusRefresh() { + runtimeStatus.postDelayed(() -> { + runtimeStatus.setText(runtimeStatusText()); + scheduleRuntimeStatusRefresh(); + }, 1500); + } + + private String runtimeStatusText() { + String state = runtimePrefs.getString("state", "нет данных"); + String message = runtimePrefs.getString("message", ""); + long updatedAt = runtimePrefs.getLong("updated_at", 0); + long read = runtimePrefs.getLong("uplink_read", 0); + long sent = runtimePrefs.getLong("uplink_sent", 0); + long down = runtimePrefs.getLong("downlink_received", 0); + long errors = runtimePrefs.getLong("errors", 0); + long readBytes = runtimePrefs.getLong("uplink_read_bytes", 0); + long sentBytes = runtimePrefs.getLong("uplink_sent_bytes", 0); + long downBytes = runtimePrefs.getLong("downlink_received_bytes", 0); + long droppedRead = runtimePrefs.getLong("uplink_dropped_packets", 0); + long droppedDown = runtimePrefs.getLong("downlink_dropped_packets", 0); + long bypassControl = runtimePrefs.getLong("uplink_bypassed_control_packets", 0); + long sourceMismatch = runtimePrefs.getLong("uplink_source_mismatch_packets", 0); + long destinationMismatch = runtimePrefs.getLong("downlink_destination_mismatch_packets", 0); + float uplinkReadMbps = runtimePrefs.getFloat("uplink_read_mbps", 0f); + float uplinkSentMbps = runtimePrefs.getFloat("uplink_sent_mbps", 0f); + float downlinkMbps = runtimePrefs.getFloat("downlink_received_mbps", 0f); + float uplinkReadPps = runtimePrefs.getFloat("uplink_read_pps", 0f); + float uplinkSentPps = runtimePrefs.getFloat("uplink_sent_pps", 0f); + float downlinkPps = runtimePrefs.getFloat("downlink_received_pps", 0f); + int workerCount = runtimePrefs.getInt("uplink_worker_count", 0); + int queueDepthTotal = runtimePrefs.getInt("uplink_queue_depth_total", 0); + int queueDepthMax = runtimePrefs.getInt("uplink_queue_depth_max", 0); + String queueDepths = runtimePrefs.getString("uplink_queue_depths", ""); + long queue0Drops = runtimePrefs.getLong("uplink_queue_0_drops", 0); + long queue1Drops = runtimePrefs.getLong("uplink_queue_1_drops", 0); + long queue2Drops = runtimePrefs.getLong("uplink_queue_2_drops", 0); + long queue3Drops = runtimePrefs.getLong("uplink_queue_3_drops", 0); + long queue0Offers = runtimePrefs.getLong("uplink_queue_0_offers", 0); + long queue1Offers = runtimePrefs.getLong("uplink_queue_1_offers", 0); + long queue2Offers = runtimePrefs.getLong("uplink_queue_2_offers", 0); + long queue3Offers = runtimePrefs.getLong("uplink_queue_3_offers", 0); + long sender0Packets = runtimePrefs.getLong("uplink_sender_worker_packets_0", 0); + long sender1Packets = runtimePrefs.getLong("uplink_sender_worker_packets_1", 0); + long sender2Packets = runtimePrefs.getLong("uplink_sender_worker_packets_2", 0); + long sender3Packets = runtimePrefs.getLong("uplink_sender_worker_packets_3", 0); + long sender0Errors = runtimePrefs.getLong("uplink_sender_worker_errors_0", 0); + long sender1Errors = runtimePrefs.getLong("uplink_sender_worker_errors_1", 0); + long sender2Errors = runtimePrefs.getLong("uplink_sender_worker_errors_2", 0); + long sender3Errors = runtimePrefs.getLong("uplink_sender_worker_errors_3", 0); + String age = updatedAt <= 0 ? "никогда" : ((System.currentTimeMillis() - updatedAt) / 1000) + " сек назад"; + boolean osVpnActive = isSystemVpnActive(); + String routes = runtimePrefs.getString("routes", ""); + String dnsServers = runtimePrefs.getString("dns_servers", ""); + String profileRelayUrl = runtimePrefs.getString("packet_relay_profile_base_url", ""); + String activeRelayUrl = runtimePrefs.getString("packet_relay_active_base_url", ""); + String relayCandidates = runtimePrefs.getString("packet_relay_candidate_urls", ""); + boolean forceFullTunnelRuntime = false; + boolean fastPathEnabled = false; + try { + forceFullTunnelRuntime = runtimePrefs.getBoolean("force_full_tunnel", false); + } catch (Exception ignored) { + } + try { + fastPathEnabled = runtimePrefs.getBoolean("fast_path_enabled", forceFullTunnelRuntime); + } catch (Exception ignored) { + } + boolean staleState = updatedAt > 0 && (System.currentTimeMillis() - updatedAt) > 12_000; + boolean runtimeActive = isVpnRuntimeActive(); + if (!osVpnActive && !runtimeActive && ("running".equals(state) || "tunnel".equals(state) || "relay".equals(state) || "relay_reset".equals(state))) { + state = "stale_no_os_vpn"; + message = "Сервис говорит об активном состоянии, но Android VPN-интерфейс не активен. Проверьте разрешения/ручной запуск."; + staleState = false; + } + return "Диагностика: " + state + + "\n" + message + + "\nOS VPN: " + (osVpnActive ? "активен" : (runtimeActive ? "runtime активен" : "неактивен")) + + "\n" + (staleState ? "статус устарел" : "статус актуален") + + "\nread/sent/down: " + read + "/" + sent + "/" + down + + "\nerrors/drops: " + errors + "/" + (droppedRead + droppedDown) + + "\ncontrol bypass: " + bypassControl + + "\naddress mismatch (up/down): " + sourceMismatch + " / " + destinationMismatch + + "\nthroughput Mbps: up " + String.format(Locale.US, "%.2f", uplinkSentMbps) + + " / down " + String.format(Locale.US, "%.2f", downlinkMbps) + + "\npps: up " + String.format(Locale.US, "%.1f", uplinkSentPps) + + " / down " + String.format(Locale.US, "%.1f", downlinkPps) + + "\nDNS выхода: " + (dnsServers.isEmpty() ? "-" : dnsServers) + + "\nroutes: " + (routes.isEmpty() ? "-" : routes) + + "\nrelay active: " + (activeRelayUrl.isEmpty() ? "-" : activeRelayUrl) + + "\nrelay profile: " + (profileRelayUrl.isEmpty() ? "-" : profileRelayUrl) + + "\nrelay candidates: " + (relayCandidates.isEmpty() ? "-" : relayCandidates) + + "\nforced_full_tunnel: " + (forceFullTunnelRuntime ? "да" : "нет") + + "\nfast_path_mode: " + (fastPathEnabled ? "включен" : "выключен") + + "\nbytes read/sent/down: " + readBytes + "/" + sentBytes + "/" + downBytes + + "\nworkers: " + workerCount + + "\nqueue depth total/max: " + queueDepthTotal + " / " + queueDepthMax + + "\nqueue depths: " + (queueDepths.isEmpty() ? "-" : queueDepths) + + "\nqueue0 q/s: " + queue0Offers + "/" + queue0Drops + + " q1 " + queue1Offers + "/" + queue1Drops + + " q2 " + queue2Offers + "/" + queue2Drops + + " q3 " + queue3Offers + "/" + queue3Drops + + "\nsender pkt/err: w0 " + sender0Packets + "/" + sender0Errors + + " w1 " + sender1Packets + "/" + sender1Errors + + " w2 " + sender2Packets + "/" + sender2Errors + + " w3 " + sender3Packets + "/" + sender3Errors + + "\nобновлено: " + age; + } + + private void startDiagnosticChannel() { + if (authContext == null || authContext.deviceId.isEmpty()) { + return; + } + RapDiagnosticService.start(this); + } + + private boolean isSystemVpnActive() { + try { + ConnectivityManager connectivityManager = (ConnectivityManager) getSystemService(CONNECTIVITY_SERVICE); + if (connectivityManager == null) { + return false; + } + Network[] networks = connectivityManager.getAllNetworks(); + if (networks != null) { + for (Network network : networks) { + NetworkCapabilities capabilities = connectivityManager.getNetworkCapabilities(network); + if (capabilities != null && capabilities.hasTransport(NetworkCapabilities.TRANSPORT_VPN)) { + return true; + } + } + } + return false; + } catch (Exception ignored) { + return false; + } + } + + private boolean isVpnRuntimeActive() { + String state = runtimePrefs.getString("state", ""); + if ("stopped".equals(state) || "revoked".equals(state) || "error".equals(state)) { + return false; + } + long updatedAt = runtimePrefs.getLong("updated_at", 0); + if (updatedAt <= 0 || (System.currentTimeMillis() - updatedAt) > 15_000) { + return false; + } + String relay = runtimePrefs.getString("packet_relay_active_base_url", ""); + long read = runtimePrefs.getLong("uplink_read_total", 0); + long sent = runtimePrefs.getLong("uplink_sent_total", 0); + long down = runtimePrefs.getLong("downlink_received_total", 0); + return !relay.isEmpty() && ("running".equals(state) + || "relay".equals(state) + || "relay_reset".equals(state) + || "downlink".equals(state) + || "downlink_idle".equals(state) + || "uplink_sent".equals(state) + || read > 0 || sent > 0 || down > 0); + } + + private String firstConnectionId(String profile) throws Exception { + JSONObject root = new JSONObject(profile); + JSONObject vpnProfile = root.getJSONObject("vpn_client_profile"); + JSONArray connections = vpnProfile.getJSONArray("connections"); + if (connections.length() == 0) { + throw new IllegalStateException("VPN profile has no connections"); + } + String fallback = null; + String waiting = null; + for (int i = 0; i < connections.length(); i++) { + JSONObject connection = connections.optJSONObject(i); + if (connection == null) { + continue; + } + String id = connection.optString("id", "").trim(); + if (id.isEmpty()) { + continue; + } + if (fallback == null) { + fallback = id; + } + JSONObject clientConfig = connection.optJSONObject("client_config"); + if (clientConfig == null) { + continue; + } + JSONObject fabricRoute = clientConfig.optJSONObject("vpn_fabric_route"); + if (fabricRoute == null) { + continue; + } + String status = fabricRoute.optString("status", "").trim().toLowerCase(); + if ("planned".equals(status)) { + String entry = fabricRoute.optString("selected_entry_node_id", "").trim(); + String exit = fabricRoute.optString("selected_exit_node_id", "").trim(); + if (!entry.isEmpty() && !exit.isEmpty()) { + return id; + } + } + if (("connecting".equals(status) || "active".equals(status) || "assigned".equals(status)) && waiting == null) { + waiting = id; + } + } + if (waiting != null) { + return waiting; + } + if (fallback != null) { + return fallback; + } + return connections.getJSONObject(0).getString("id"); + } + + private String resourcesText(JSONObject payload) throws Exception { + JSONArray resources = payload.optJSONArray("resources"); + if (resources == null || resources.length() == 0) { + return "Серверы: доступных ресурсов нет."; + } + StringBuilder text = new StringBuilder("Серверы:\n"); + int limit = Math.min(resources.length(), 6); + for (int i = 0; i < limit; i++) { + JSONObject resource = resources.getJSONObject(i); + text.append("• ") + .append(resource.optString("name", "server")) + .append(" ") + .append(resource.optString("protocol", "rdp")) + .append(" ") + .append(resource.optString("address", "")) + .append('\n'); + } + if (resources.length() > limit) { + text.append("и еще ").append(resources.length() - limit).append("..."); + } + return text.toString().trim(); + } + + private int dp(int value) { + return (int) (value * getResources().getDisplayMetrics().density); + } + + private String summaryText() { + String deviceId = prefs == null ? "" : prefs.getString(PREF_DEVICE_ID, ""); + String connectionId = vpnConnectionId == null || vpnConnectionId.isEmpty() + ? (prefs == null ? "" : prefs.getString(PREF_VPN_CONNECTION_ID, "")) + : vpnConnectionId; + String backendText = backendUrl == null ? "" : backendUrl.getText().toString().trim(); + String clusterText = clusterId == null ? "" : clusterId.getText().toString().trim(); + String organizationText = organizationId == null ? "" : organizationId.getText().toString().trim(); + String exitNode = selectedExitNodeId(); + String profileDNS = profileDNSServersText(); + return "Версия: " + APP_VERSION + + "\nКластер: " + (clusterText.isEmpty() ? "не задан" : clusterText) + + "\nОрганизация: " + (organizationText.isEmpty() ? "не задана" : organizationText) + + "\nТочка входа: автоматическая (из настроек кластера)" + + "\nТочка выхода: " + (exitNode.isEmpty() ? "не выбрана (по умолчанию)" : exitNode) + + "\nDNS выхода: " + (profileDNS.isEmpty() ? "будет получен из профиля" : profileDNS) + + "\nBackend: " + (backendText.isEmpty() ? "не задан" : backendText) + + "\nТрафик: " + (prefs.getBoolean(PREF_FORCE_FULL_TUNNEL, true) ? "весь через VPN" : "по профилю") + + "\nDevice: " + (deviceId.isEmpty() ? "нет" : deviceId) + + "\nConnection: " + (connectionId.isEmpty() ? "нет" : connectionId); + } + + private String profileDNSServersText() { + if (profileJson == null || profileJson.trim().isEmpty()) { + return runtimePrefs == null ? "" : runtimePrefs.getString("dns_servers", ""); + } + try { + JSONObject root = new JSONObject(profileJson); + JSONObject vpnProfile = root.optJSONObject("vpn_client_profile"); + JSONArray connections = vpnProfile == null ? null : vpnProfile.optJSONArray("connections"); + if (connections == null || connections.length() == 0) { + return ""; + } + String preferredConnection = vpnConnectionId == null || vpnConnectionId.isEmpty() + ? (prefs == null ? "" : prefs.getString(PREF_VPN_CONNECTION_ID, "")) + : vpnConnectionId; + JSONObject selected = null; + for (int i = 0; i < connections.length(); i++) { + JSONObject candidate = connections.optJSONObject(i); + if (candidate == null) { + continue; + } + if (!preferredConnection.isEmpty() && preferredConnection.equals(candidate.optString("id", ""))) { + selected = candidate; + break; + } + if (selected == null) { + selected = candidate; + } + } + JSONObject clientConfig = selected == null ? null : selected.optJSONObject("client_config"); + JSONArray dns = clientConfig == null ? null : clientConfig.optJSONArray("dns_servers"); + return joinJSONArray(dns); + } catch (Exception ignored) { + return ""; + } + } + + private String joinJSONArray(JSONArray values) { + if (values == null || values.length() == 0) { + return ""; + } + StringBuilder out = new StringBuilder(); + for (int i = 0; i < values.length(); i++) { + String value = values.optString(i, "").trim(); + if (value.isEmpty()) { + continue; + } + if (out.length() > 0) { + out.append(","); + } + out.append(value); + } + return out.toString(); + } + + private String preferredBackendUrl() { + String saved = prefs.getString("backend_url", DEFAULT_BACKEND_URL); + String normalized = normalizeBackendUrl(saved); + if (!normalized.equals(saved == null ? "" : saved.trim())) { + prefs.edit().putString("backend_url", normalized).apply(); + } + return normalized; + } + + private void saveSettings() { + String normalizedBackend = normalizeBackendUrl(backendUrl.getText().toString()); + if (!normalizedBackend.equals(backendUrl.getText().toString().trim())) { + backendUrl.setText(normalizedBackend); + } + normalizeAndPersistDefaults(); + if (clusterId.getText().toString().trim().isEmpty()) { + clusterId.setText(DEFAULT_CLUSTER_ID); + } + if (organizationId.getText().toString().trim().isEmpty()) { + organizationId.setText(DEFAULT_ORGANIZATION_ID); + } + prefs.edit() + .putString("backend_url", normalizedBackend) + .putString("cluster_id", clusterId.getText().toString()) + .putString("organization_id", organizationId.getText().toString()) + .putString("email", email.getText().toString()) + .apply(); + } + + private void normalizeAndPersistDefaults() { + String normalizedBackend = normalizeBackendUrl(backendUrl.getText().toString()); + if (normalizedBackend.isEmpty()) { + backendUrl.setText(DEFAULT_BACKEND_URL); + } + if (clusterId.getText().toString().trim().isEmpty()) { + clusterId.setText(DEFAULT_CLUSTER_ID); + } + if (organizationId.getText().toString().trim().isEmpty()) { + organizationId.setText(DEFAULT_ORGANIZATION_ID); + } + } + + private String normalizeBackendUrl(String value) { + String candidate = value == null ? "" : value.trim().replaceAll("/+$", ""); + if (candidate.isEmpty()) { + return DEFAULT_BACKEND_URL; + } + return candidate; + } + + private String selectedExitNodeId() { + String configured = prefs == null ? "" : prefs.getString(PREF_SELECTED_EXIT_NODE_ID, ""); + return normalizeSelectedExitNodeId(configured); + } + + private String normalizeSelectedExitNodeId(String value) { + String candidate = value == null ? "" : value.trim(); + if (candidate.isEmpty()) { + return ""; + } + if (candidate.matches("^[0-9a-fA-F]{8}-[0-9a-fA-F]{4}-[0-9a-fA-F]{4}-[0-9a-fA-F]{4}-[0-9a-fA-F]{12}$")) { + return candidate; + } + if (candidate.matches("^[A-Za-z0-9][A-Za-z0-9._-]{2,63}$")) { + return candidate; + } + return ""; + } + + private RapApiClient.AuthContext authenticate(RapApiClient client) throws Exception { + String savedRefresh = savedRefreshToken(); + if (!savedRefresh.isEmpty()) { + try { + RapApiClient.AuthContext refreshed = client.refresh(savedRefresh); + saveAuthContext(refreshed); + return refreshed; + } catch (Exception ignored) { + clearSavedAuth(false); + } + } + String passwordValue = password.getText().toString().trim(); + if (passwordValue.isEmpty()) { + throw new IllegalStateException("Сессия устройства истекла или отозвана. Введите пароль один раз, дальше ключи обновятся автоматически."); + } + RapApiClient.AuthContext loggedIn = client.login(email.getText().toString().trim(), passwordValue, deviceFingerprint()); + saveAuthContext(loggedIn); + return loggedIn; + } + + private String resolveOrganizationId(RapApiClient client, String userId) throws Exception { + JSONObject payload = client.organizations(userId); + JSONArray organizations = payload.optJSONArray("organizations"); + if (organizations == null || organizations.length() == 0) { + throw new IllegalStateException("У пользователя нет активной организации."); + } + String configured = organizationId.getText().toString().trim(); + JSONObject fallback = null; + for (int i = 0; i < organizations.length(); i++) { + JSONObject item = organizations.optJSONObject(i); + if (item == null) { + continue; + } + String id = item.optString("id", ""); + String name = item.optString("name", ""); + String slug = item.optString("slug", ""); + if (!configured.isEmpty() && configured.equals(id)) { + return configured; + } + if (fallback == null || "HOME".equalsIgnoreCase(name) || "home".equalsIgnoreCase(slug)) { + fallback = item; + } + } + String selected = fallback != null ? fallback.optString("id", "") : ""; + if (selected.isEmpty()) { + throw new IllegalStateException("Не удалось выбрать организацию пользователя."); + } + runOnUiThread(() -> { + organizationId.setText(selected); + saveSettings(); + }); + return selected; + } + + private void saveAuthContext(RapApiClient.AuthContext context) throws Exception { + secureTokens.put(PREF_REFRESH_TOKEN, context.refreshToken); + prefs.edit() + .putString(PREF_USER_ID, context.userId) + .putString(PREF_DEVICE_ID, context.deviceId) + .putString(PREF_REFRESH_EXPIRES_AT, context.refreshTokenExpiresAt) + .putString(PREF_PROFILE_JSON, profileJson) + .putString(PREF_VPN_CONNECTION_ID, vpnConnectionId) + .apply(); + } + + private void saveProfileState() { + prefs.edit() + .putString(PREF_PROFILE_JSON, profileJson) + .putString(PREF_VPN_CONNECTION_ID, vpnConnectionId) + .apply(); + } + + private void restoreAuthContext() { + String userId = prefs.getString(PREF_USER_ID, ""); + String deviceId = prefs.getString(PREF_DEVICE_ID, ""); + if (!userId.isEmpty() && !deviceId.isEmpty()) { + authContext = new RapApiClient.AuthContext( + userId, + deviceId, + "", + "", + secureTokens.get(PREF_REFRESH_TOKEN), + prefs.getString(PREF_REFRESH_EXPIRES_AT, "")); + } + } + + private void clearSavedAuth(boolean clearProfile) { + secureTokens.remove(PREF_REFRESH_TOKEN); + SharedPreferences.Editor editor = prefs.edit() + .remove(PREF_REFRESH_EXPIRES_AT) + .remove(PREF_USER_ID) + .remove(PREF_DEVICE_ID); + if (clearProfile) { + editor.remove(PREF_PROFILE_JSON).remove(PREF_VPN_CONNECTION_ID); + profileJson = ""; + vpnConnectionId = ""; + } + editor.apply(); + authContext = null; + } + + private String savedRefreshToken() { + String token = secureTokens.get(PREF_REFRESH_TOKEN); + if (!token.isEmpty()) { + return token; + } + String legacyToken = prefs.getString(PREF_REFRESH_TOKEN, ""); + if (!legacyToken.isEmpty()) { + try { + secureTokens.put(PREF_REFRESH_TOKEN, legacyToken); + prefs.edit().remove(PREF_REFRESH_TOKEN).apply(); + } catch (Exception ignored) { + } + } + return legacyToken; + } + + private String deviceFingerprint() { + String existing = prefs.getString(PREF_DEVICE_FINGERPRINT, ""); + if (!existing.isEmpty()) { + return existing; + } + String generated = "android-" + java.util.UUID.randomUUID(); + prefs.edit().putString(PREF_DEVICE_FINGERPRINT, generated).apply(); + return generated; + } + + private void showSettingsDialog() { + LinearLayout form = new LinearLayout(this); + form.setOrientation(LinearLayout.VERTICAL); + int pad = dp(12); + form.setPadding(pad, pad, pad, pad); + EditText backendDraft = field("Backend URL", backendUrl.getText().toString()); + EditText clusterDraft = field("Cluster ID", clusterId.getText().toString()); + EditText organizationDraft = field("Organization ID", organizationId.getText().toString()); + EditText emailDraft = field("Email", email.getText().toString()); + EditText passwordDraft = field("Password", password.getText().toString()); + passwordDraft.setInputType(InputType.TYPE_CLASS_TEXT | InputType.TYPE_TEXT_VARIATION_PASSWORD); + passwordDraft.setHint("Password (не сохраняется)"); + EditText selectedExitDraft = field( + "Точка выхода (Node ID, например ifcm)", + prefs.getString(PREF_SELECTED_EXIT_NODE_ID, "")); + CheckBox showPassword = new CheckBox(this); + showPassword.setText("Показать пароль"); + showPassword.setTextColor(0xff111111); + showPassword.setOnCheckedChangeListener((buttonView, isChecked) -> { + passwordDraft.setInputType(InputType.TYPE_CLASS_TEXT | (isChecked + ? InputType.TYPE_TEXT_VARIATION_VISIBLE_PASSWORD + : InputType.TYPE_TEXT_VARIATION_PASSWORD)); + passwordDraft.setSelection(passwordDraft.getText().length()); + }); + CheckBox forceFullTunnel = new CheckBox(this); + forceFullTunnel.setText("Полный маршрут через VPN"); + forceFullTunnel.setChecked(prefs.getBoolean(PREF_FORCE_FULL_TUNNEL, true)); + forceFullTunnel.setTextColor(0xff111111); + form.addView(backendDraft); + form.addView(clusterDraft); + form.addView(organizationDraft); + form.addView(emailDraft); + form.addView(passwordDraft); + form.addView(selectedExitDraft); + form.addView(showPassword); + form.addView(forceFullTunnel); + new AlertDialog.Builder(this) + .setTitle("Настройки подключения") + .setView(form) + .setPositiveButton("Сохранить", (dialog, which) -> { + backendUrl.setText(backendDraft.getText().toString()); + clusterId.setText(clusterDraft.getText().toString()); + organizationId.setText(organizationDraft.getText().toString()); + email.setText(emailDraft.getText().toString()); + password.setText(passwordDraft.getText().toString()); + String normalizedExit = normalizeSelectedExitNodeId(selectedExitDraft.getText().toString()); + prefs.edit() + .putString(PREF_SELECTED_EXIT_NODE_ID, normalizedExit) + .apply(); + if (!normalizedExit.equals(selectedExitDraft.getText().toString().trim())) { + status.setText("Точка выхода очищена: значение было не похоже на Node ID/alias."); + } + prefs.edit().putBoolean(PREF_FORCE_FULL_TUNNEL, forceFullTunnel.isChecked()).apply(); + saveSettings(); + profileSummary.setText(summaryText()); + }) + .setNeutralButton("Забыть устройство", (dialog, which) -> { + clearSavedAuth(true); + status.setText("Устройство забыто. Для следующего входа нужен пароль."); + }) + .setNegativeButton("Отмена", null) + .show(); + } + + private String friendlyError(Exception ex) { + String message = ex.getMessage(); + if (message == null || message.trim().isEmpty()) { + return "неизвестная ошибка"; + } + if (message.contains("auth.invalid_credentials") || message.contains("Неверный логин")) { + int passwordLength = password.getText() == null ? 0 : password.getText().toString().length(); + return "Неверный логин или пароль. Проверьте раскладку и спецсимволы. Длина введенного пароля: " + passwordLength + "."; + } + if (message.contains("auth.invalid_refresh_token") || message.contains("invalid refresh token")) { + return "Сессия устройства истекла. Введите пароль один раз, дальше ключи обновятся автоматически."; + } + return message; + } + + private void showServerPicker() { + if (lastResources.length() == 0) { + loadProfile(); + status.setText("Загружаю список серверов..."); + return; + } + String[] labels = new String[lastResources.length()]; + for (int i = 0; i < lastResources.length(); i++) { + JSONObject resource = lastResources.optJSONObject(i); + labels[i] = resource == null + ? "server" + : resource.optString("name", "server") + " " + resource.optString("address", ""); + } + new AlertDialog.Builder(this) + .setTitle("Удаленный сервер") + .setItems(labels, (dialog, which) -> startRemoteDesktop(which)) + .show(); + } + + private void startRemoteDesktop(int index) { + JSONObject resource = lastResources.optJSONObject(index); + if (resource == null) { + return; + } + if (authContext == null || authContext.userId.isEmpty() || authContext.deviceId.isEmpty()) { + loadProfile(); + status.setText("Профиль обновляется. Повторите открытие сервера."); + return; + } + status.setText("Открываю " + resource.optString("name", "сервер") + "..."); + new Thread(() -> { + try { + RapApiClient client = new RapApiClient(backendUrl.getText().toString(), this); + JSONObject result = client.startSession(resource.getString("id"), authContext.userId, authContext.deviceId); + Intent intent = new Intent(this, RdpActivity.class); + intent.putExtra(RdpActivity.EXTRA_SESSION_RESULT, result.toString()); + intent.putExtra(RdpActivity.EXTRA_GATEWAY_URL, gatewayUrl()); + intent.putExtra(RdpActivity.EXTRA_RESOURCE_NAME, resource.optString("name", "Remote Desktop")); + runOnUiThread(() -> { + status.setText("Сессия создана."); + startActivity(intent); + }); + } catch (Exception ex) { + runOnUiThread(() -> status.setText("Ошибка RDP: " + ex.getMessage())); + } + }).start(); + } + + private String gatewayUrl() { + String api = backendUrl.getText().toString().trim(); + String gateway = api.replace("https://", "wss://").replace("http://", "ws://"); + if (gateway.endsWith("/")) { + gateway = gateway.substring(0, gateway.length() - 1); + } + return gateway + "/gateway/ws"; + } +} diff --git a/clients/android/app/src/main/java/su/cin/rapvpn/RapApiClient.java b/clients/android/app/src/main/java/su/cin/rapvpn/RapApiClient.java new file mode 100644 index 0000000..3144211 --- /dev/null +++ b/clients/android/app/src/main/java/su/cin/rapvpn/RapApiClient.java @@ -0,0 +1,630 @@ +package su.cin.rapvpn; + +import android.content.Context; +import android.net.ConnectivityManager; +import android.net.Network; +import android.net.NetworkCapabilities; +import android.net.VpnService; + +import okhttp3.MediaType; +import okhttp3.OkHttpClient; +import okhttp3.Dispatcher; +import okhttp3.ConnectionPool; +import okhttp3.Dns; +import okhttp3.Request; +import okhttp3.RequestBody; +import okhttp3.Response; +import okhttp3.ResponseBody; + +import org.json.JSONObject; + +import java.io.ByteArrayOutputStream; +import java.io.IOException; +import java.io.InterruptedIOException; +import java.net.InetAddress; +import java.net.InetSocketAddress; +import java.net.Socket; +import java.net.URI; +import java.net.UnknownHostException; +import java.nio.charset.StandardCharsets; +import java.util.ArrayList; +import java.util.List; +import java.util.Collections; +import java.util.concurrent.TimeUnit; + +import javax.net.SocketFactory; + +final class RapApiClient { + private static final MediaType JSON = MediaType.get("application/json; charset=utf-8"); + private static final MediaType OCTET_STREAM = MediaType.get("application/octet-stream"); + private static final int MAX_PACKET_BATCH_PACKETS = 512; + private static final int MAX_PACKET_BATCH_BYTES = 512 * 1024; + private static final int MAX_SINGLE_PACKET_BYTES = 65535; + private static final int MAX_BATCH_HEADER_BYTES = 4; + private final String baseUrl; + private final OkHttpClient httpClient; + private final String networkMode; + private final FabricServiceChannel fabricServiceChannel; + + RapApiClient(String baseUrl) { + this(baseUrl, (Context) null); + } + + RapApiClient(String baseUrl, Context context) { + this.baseUrl = trimRight(baseUrl); + this.fabricServiceChannel = new FabricServiceChannel(); + OkHttpClient.Builder builder = new OkHttpClient.Builder(); + // Regular app and diagnostic requests should use Android's default + // routing. Some devices reject binding app sockets to a specific + // Network with EACCES, which must not block login/profile refresh. + this.networkMode = context == null ? "default_network" : "default_network_context"; + builder.dns(new BackendPinnedDns(baseUrl)); + builder.connectTimeout(5, TimeUnit.SECONDS); + builder.writeTimeout(12, TimeUnit.SECONDS); + builder.readTimeout(12, TimeUnit.SECONDS); + builder.callTimeout(15, TimeUnit.SECONDS); + builder.retryOnConnectionFailure(true); + Dispatcher dispatcher = new Dispatcher(); + dispatcher.setMaxRequests(64); + dispatcher.setMaxRequestsPerHost(32); + builder.dispatcher(dispatcher); + builder.connectionPool(new ConnectionPool(16, 5, TimeUnit.MINUTES)); + this.httpClient = builder.build(); + } + + RapApiClient(String baseUrl, Context context, boolean preferUnderlyingNetwork) { + this.baseUrl = trimRight(baseUrl); + this.fabricServiceChannel = new FabricServiceChannel(); + OkHttpClient.Builder builder = new OkHttpClient.Builder(); + String mode = context == null ? "default_network" : "default_network_context"; + if (preferUnderlyingNetwork && context != null) { + SocketFactory socketFactory = underlyingSocketFactory(context); + if (socketFactory != null) { + builder.socketFactory(socketFactory); + mode = "underlying_network_context"; + } + } + this.networkMode = mode; + builder.dns(new BackendPinnedDns(baseUrl)); + builder.connectTimeout(3, TimeUnit.SECONDS); + builder.writeTimeout(6, TimeUnit.SECONDS); + builder.readTimeout(6, TimeUnit.SECONDS); + builder.callTimeout(8, TimeUnit.SECONDS); + builder.retryOnConnectionFailure(true); + Dispatcher dispatcher = new Dispatcher(); + dispatcher.setMaxRequests(64); + dispatcher.setMaxRequestsPerHost(32); + builder.dispatcher(dispatcher); + builder.connectionPool(new ConnectionPool(16, 5, TimeUnit.MINUTES)); + this.httpClient = builder.build(); + } + + RapApiClient(String baseUrl, VpnService vpnService) { + this(baseUrl, vpnService, new FabricServiceChannel()); + } + + RapApiClient(String baseUrl, VpnService vpnService, FabricServiceChannel fabricServiceChannel) { + this.baseUrl = trimRight(baseUrl); + this.fabricServiceChannel = fabricServiceChannel == null ? new FabricServiceChannel() : fabricServiceChannel; + OkHttpClient.Builder builder = new OkHttpClient.Builder(); + if (vpnService != null) { + builder.socketFactory(new ProtectedSocketFactory(vpnService)); + builder.dns(new BackendPinnedDns(baseUrl)); + this.networkMode = "protected_socket"; + } else { + this.networkMode = "default_network"; + } + builder.connectTimeout(3, TimeUnit.SECONDS); + builder.writeTimeout(8, TimeUnit.SECONDS); + builder.readTimeout(8, TimeUnit.SECONDS); + builder.callTimeout(10, TimeUnit.SECONDS); + builder.retryOnConnectionFailure(false); + Dispatcher dispatcher = new Dispatcher(); + dispatcher.setMaxRequests(64); + dispatcher.setMaxRequestsPerHost(32); + builder.dispatcher(dispatcher); + builder.connectionPool(new ConnectionPool(16, 5, TimeUnit.MINUTES)); + this.httpClient = builder.build(); + } + + RapApiClient(String baseUrl, Network network) { + this.baseUrl = trimRight(baseUrl); + this.fabricServiceChannel = new FabricServiceChannel(); + OkHttpClient.Builder builder = new OkHttpClient.Builder(); + if (network != null) { + builder.socketFactory(network.getSocketFactory()); + builder.dns(hostname -> { + InetAddress[] addresses = network.getAllByName(hostname); + if (addresses == null || addresses.length == 0) { + throw new UnknownHostException(hostname); + } + List out = new ArrayList<>(); + Collections.addAll(out, addresses); + return out; + }); + this.networkMode = "vpn_network"; + } else { + builder.dns(new BackendPinnedDns(baseUrl)); + this.networkMode = "default_network"; + } + builder.connectTimeout(5, TimeUnit.SECONDS); + builder.writeTimeout(12, TimeUnit.SECONDS); + builder.readTimeout(12, TimeUnit.SECONDS); + builder.callTimeout(15, TimeUnit.SECONDS); + builder.retryOnConnectionFailure(true); + Dispatcher dispatcher = new Dispatcher(); + dispatcher.setMaxRequests(64); + dispatcher.setMaxRequestsPerHost(32); + builder.dispatcher(dispatcher); + builder.connectionPool(new ConnectionPool(16, 5, TimeUnit.MINUTES)); + this.httpClient = builder.build(); + } + + String networkMode() { + return networkMode; + } + + static final class BackendPinnedDns implements Dns { + private static final String VPN_PUBLIC_HOST = "vpn.cin.su"; + private static final String VPN_PUBLIC_IPV4 = "94.141.118.222"; + private final String backendHost; + + BackendPinnedDns(String baseUrl) { + String parsedHost = ""; + try { + parsedHost = URI.create(baseUrl == null ? "" : baseUrl).getHost(); + } catch (Exception ignored) { + } + backendHost = parsedHost == null ? "" : parsedHost.trim().toLowerCase(); + } + + @Override + public List lookup(String hostname) throws UnknownHostException { + String host = hostname == null ? "" : hostname.trim().toLowerCase(); + if (!backendHost.isEmpty() && host.equals(backendHost) && VPN_PUBLIC_HOST.equals(host)) { + return Collections.singletonList(InetAddress.getByName(VPN_PUBLIC_IPV4)); + } + return Dns.SYSTEM.lookup(hostname); + } + } + + private SocketFactory underlyingSocketFactory(Context context) { + ConnectivityManager connectivity = (ConnectivityManager) context.getSystemService(Context.CONNECTIVITY_SERVICE); + if (connectivity == null) { + return null; + } + for (Network network : connectivity.getAllNetworks()) { + NetworkCapabilities capabilities = connectivity.getNetworkCapabilities(network); + if (capabilities == null) { + continue; + } + if (capabilities.hasTransport(NetworkCapabilities.TRANSPORT_VPN)) { + continue; + } + if (!capabilities.hasCapability(NetworkCapabilities.NET_CAPABILITY_INTERNET)) { + continue; + } + return network.getSocketFactory(); + } + return null; + } + + AuthContext login(String email, String password, String deviceFingerprint) throws Exception { + JSONObject body = new JSONObject(); + body.put("email", email); + body.put("password", password); + body.put("device_fingerprint", deviceFingerprint); + body.put("device_label", "RAP Android VPN"); + body.put("trust_device", true); + JSONObject response = post("/auth/login", body); + return parseAuthContext(response); + } + + AuthContext refresh(String refreshToken) throws Exception { + JSONObject body = new JSONObject(); + body.put("refresh_token", refreshToken); + return parseAuthContext(post("/auth/refresh", body)); + } + + String vpnClientProfile(String clusterId, String organizationId, String userId, String exitNodeId) throws Exception { + String path = "/clusters/" + clusterId + "/vpn/client-profile?organization_id=" + organizationId + "&user_id=" + userId; + if (exitNodeId != null && !exitNodeId.trim().isEmpty()) { + path += "&exit_node_id=" + exitNodeId.trim(); + } + return get(path).toString(); + } + + JSONObject organizations(String userId) throws Exception { + return get("/organizations/?user_id=" + userId); + } + + JSONObject resources(String organizationId, String userId) throws Exception { + String path = "/resources/?organization_id=" + organizationId + "&user_id=" + userId; + return get(path); + } + + JSONObject startSession(String resourceId, String userId, String deviceId) throws Exception { + JSONObject body = new JSONObject(); + body.put("resource_id", resourceId); + body.put("user_id", userId); + body.put("device_id", deviceId); + return post("/sessions/", body); + } + + JSONObject reportVPNDiagnosticStatus(String clusterId, String deviceId, JSONObject payload) throws Exception { + return post("/clusters/" + clusterId + "/vpn/client-diagnostics/" + deviceId + "/status", payload); + } + + JSONObject nextVPNDiagnosticCommand(String clusterId, String deviceId, int timeoutMs) throws Exception { + byte[] payload = getBytes("/clusters/" + clusterId + "/vpn/client-diagnostics/" + deviceId + "/commands?timeout_ms=" + timeoutMs); + if (payload.length == 0) { + return null; + } + return new JSONObject(new String(payload, StandardCharsets.UTF_8)); + } + + JSONObject vpnPacketStats(String clusterId, String vpnConnectionId) throws Exception { + return get("/clusters/" + clusterId + "/vpn-connections/" + vpnConnectionId + "/tunnel/stats"); + } + + JSONObject resetVPNPacketQueues(String clusterId, String vpnConnectionId) throws Exception { + return post("/clusters/" + clusterId + "/vpn-connections/" + vpnConnectionId + "/tunnel/reset", new JSONObject()); + } + + void sendClientPacket(String clusterId, String vpnConnectionId, byte[] packet, int length) throws Exception { + postBytes(clientPacketPath(clusterId, vpnConnectionId, ""), packet, length); + } + + void sendClientPacketBatch(String clusterId, String vpnConnectionId, List packets) throws Exception { + if (packets == null || packets.isEmpty()) { + return; + } + List> chunks = chunkPacketsForBatch(packets); + if (chunks.isEmpty()) { + return; + } + for (List chunk : chunks) { + postBytes(clientPacketPath(clusterId, vpnConnectionId, "?batch=true"), encodePacketBatch(chunk)); + } + } + + byte[] receiveClientPacket(String clusterId, String vpnConnectionId, int timeoutMs) throws Exception { + try { + return getBytes(clientPacketPath(clusterId, vpnConnectionId, "?timeout_ms=" + timeoutMs)); + } catch (InterruptedIOException e) { + return new byte[0]; + } catch (IOException e) { + if (e.getMessage() != null && e.getMessage().toLowerCase().contains("timeout")) { + return new byte[0]; + } + throw e; + } catch (IllegalStateException e) { + String message = e.getMessage(); + if (message != null && (message.contains("HTTP 502") || message.contains("HTTP 503") || message.contains("HTTP 504"))) { + return new byte[0]; + } + throw e; + } + } + + List receiveClientPacketBatch(String clusterId, String vpnConnectionId, int timeoutMs) throws Exception { + byte[] payload; + try { + payload = getBytes(clientPacketPath(clusterId, vpnConnectionId, "?batch=true&timeout_ms=" + timeoutMs)); + if (payload == null || payload.length == 0) { + return new ArrayList<>(); + } + if (!isLikelyPacketBatch(payload)) { + return receiveSinglePacketAsBatch(clusterId, vpnConnectionId, timeoutMs); + } + return decodePacketBatch(payload); + } catch (InterruptedIOException e) { + return new ArrayList<>(); + } catch (IOException e) { + if (e.getMessage() != null && e.getMessage().toLowerCase().contains("timeout")) { + return new ArrayList<>(); + } + throw e; + } catch (IllegalStateException e) { + String message = e.getMessage(); + if (message != null && (message.contains("HTTP 502") || message.contains("HTTP 503") || message.contains("HTTP 504"))) { + return new ArrayList<>(); + } + throw e; + } + } + + private JSONObject get(String path) throws Exception { + Request request = new Request.Builder().url(baseUrl + path).get().build(); + return read(request); + } + + private JSONObject post(String path, JSONObject body) throws Exception { + Request request = new Request.Builder() + .url(baseUrl + path) + .post(RequestBody.create(body.toString().getBytes(StandardCharsets.UTF_8), JSON)) + .build(); + return read(request); + } + + private byte[] getBytes(String path) throws Exception { + Request.Builder builder = new Request.Builder().url(baseUrl + path).get(); + applyFabricHeadersIfNeeded(builder, path); + Request request = builder.build(); + try (Response response = httpClient.newCall(request).execute()) { + if (response.code() == 204) { + return new byte[0]; + } + if (!response.isSuccessful()) { + throw new IllegalStateException("HTTP " + response.code()); + } + ResponseBody body = response.body(); + return body == null ? new byte[0] : body.bytes(); + } + } + + private void postBytes(String path, byte[] packet, int length) throws Exception { + byte[] bodyBytes = new byte[length]; + System.arraycopy(packet, 0, bodyBytes, 0, length); + postBytes(path, bodyBytes); + } + + private void postBytes(String path, byte[] bodyBytes) throws Exception { + Request.Builder builder = new Request.Builder() + .url(baseUrl + path) + .post(RequestBody.create(bodyBytes, OCTET_STREAM)); + applyFabricHeadersIfNeeded(builder, path); + Request request = builder.build(); + try (Response response = httpClient.newCall(request).execute()) { + if (!response.isSuccessful()) { + throw new IllegalStateException("HTTP " + response.code()); + } + } + } + + private String clientPacketPath(String clusterId, String vpnConnectionId, String suffix) { + String path = fabricServiceChannel.packetPathForBase(baseUrl, clusterId, vpnConnectionId, false); + if (path.isEmpty()) { + path = "/clusters/" + clusterId + "/vpn-connections/" + vpnConnectionId + "/tunnel/client/packets"; + } + return path + (suffix == null ? "" : suffix); + } + + private void applyFabricHeadersIfNeeded(Request.Builder builder, String path) { + if (path != null && path.contains("/fabric/service-channels/")) { + fabricServiceChannel.applyHeaders(builder); + } + } + + private byte[] encodePacketBatch(List packets) { + int total = 0; + for (byte[] packet : packets) { + if (packet != null && packet.length > 0) { + total += 4 + packet.length; + } + } + byte[] out = new byte[total]; + int offset = 0; + for (byte[] packet : packets) { + if (packet == null || packet.length == 0) { + continue; + } + int length = packet.length; + out[offset] = (byte) ((length >> 24) & 0xff); + out[offset + 1] = (byte) ((length >> 16) & 0xff); + out[offset + 2] = (byte) ((length >> 8) & 0xff); + out[offset + 3] = (byte) (length & 0xff); + offset += 4; + System.arraycopy(packet, 0, out, offset, length); + offset += length; + } + return out; + } + + private JSONObject read(Request request) throws Exception { + try (Response response = httpClient.newCall(request).execute()) { + ResponseBody body = response.body(); + String text = body == null ? "" : body.string(); + if (!response.isSuccessful()) { + if (response.code() == 401 && text.contains("auth.invalid_credentials")) { + throw new IllegalStateException("Неверный логин или пароль."); + } + if (response.code() == 401 && text.contains("auth.invalid_refresh_token")) { + throw new IllegalStateException("Сессия устройства истекла. Введите пароль один раз."); + } + throw new IllegalStateException("HTTP " + response.code() + ": " + text); + } + return new JSONObject(text); + } + } + + private List decodePacketBatch(byte[] payload) { + List packets = new ArrayList<>(); + int offset = 0; + while (payload != null && offset + 4 <= payload.length) { + int length = ((payload[offset] & 0xff) << 24) + | ((payload[offset + 1] & 0xff) << 16) + | ((payload[offset + 2] & 0xff) << 8) + | (payload[offset + 3] & 0xff); + offset += 4; + if (length <= 0 || offset + length > payload.length) { + break; + } + byte[] packet = new byte[length]; + System.arraycopy(payload, offset, packet, 0, length); + packets.add(packet); + offset += length; + } + return packets; + } + + private List> chunkPacketsForBatch(List packets) { + List> chunks = new ArrayList<>(); + List current = new ArrayList<>(); + int currentBytes = 0; + boolean hasData = false; + for (byte[] packet : packets) { + if (packet == null || packet.length == 0) { + continue; + } + if (packet.length > MAX_SINGLE_PACKET_BYTES) { + continue; + } + hasData = true; + + int projected = currentBytes + MAX_BATCH_HEADER_BYTES + packet.length; + if (!current.isEmpty() && (current.size() >= MAX_PACKET_BATCH_PACKETS || projected > MAX_PACKET_BATCH_BYTES)) { + chunks.add(current); + current = new ArrayList<>(); + currentBytes = 0; + } + current.add(packet); + currentBytes = projected; + } + if (!hasData) { + return chunks; + } + if (!current.isEmpty()) { + chunks.add(current); + } + return chunks; + } + + private boolean isLikelyPacketBatch(byte[] payload) { + if (payload == null || payload.length < MAX_BATCH_HEADER_BYTES) { + return false; + } + int offset = 0; + int consumed = 0; + while (offset + MAX_BATCH_HEADER_BYTES <= payload.length) { + int length = ((payload[offset] & 0xff) << 24) + | ((payload[offset + 1] & 0xff) << 16) + | ((payload[offset + 2] & 0xff) << 8) + | (payload[offset + 3] & 0xff); + offset += MAX_BATCH_HEADER_BYTES; + if (length <= 0 || length > MAX_SINGLE_PACKET_BYTES) { + return false; + } + if (offset + length > payload.length) { + return false; + } + offset += length; + consumed++; + if (consumed > MAX_PACKET_BATCH_PACKETS) { + return false; + } + } + return offset == payload.length && consumed > 0; + } + + private List receiveSinglePacketAsBatch(String clusterId, String vpnConnectionId, int timeoutMs) throws Exception { + byte[] payload = receiveClientPacket(clusterId, vpnConnectionId, timeoutMs); + if (payload == null || payload.length == 0) { + return new ArrayList<>(); + } + return new ArrayList<>(Collections.singletonList(payload)); + } + + private AuthContext parseAuthContext(JSONObject response) throws Exception { + JSONObject user = response.getJSONObject("user"); + String userId = user.optString("id", ""); + if (userId.isEmpty()) { + userId = user.optString("ID", ""); + } + JSONObject device = response.optJSONObject("device"); + String deviceId = device != null ? device.optString("id", "") : ""; + if (deviceId.isEmpty() && device != null) { + deviceId = device.optString("ID", ""); + } + JSONObject tokens = response.optJSONObject("tokens"); + String accessToken = tokens != null ? tokens.optString("access_token", "") : ""; + String accessExpiresAt = tokens != null ? tokens.optString("access_token_expires_at", "") : ""; + String refreshToken = tokens != null ? tokens.optString("refresh_token", "") : ""; + String refreshExpiresAt = tokens != null ? tokens.optString("refresh_token_expires_at", "") : ""; + return new AuthContext(userId, deviceId, accessToken, accessExpiresAt, refreshToken, refreshExpiresAt); + } + + private String trimRight(String value) { + while (value.endsWith("/")) { + value = value.substring(0, value.length() - 1); + } + return value; + } + + static final class ProtectedSocketFactory extends SocketFactory { + private final SocketFactory delegate = SocketFactory.getDefault(); + private final VpnService vpnService; + + ProtectedSocketFactory(VpnService vpnService) { + this.vpnService = vpnService; + } + + @Override + public Socket createSocket() throws IOException { + Socket socket = delegate.createSocket(); + socket.bind(null); + return protect(socket); + } + + @Override + public Socket createSocket(String host, int port) throws IOException { + Socket socket = createSocket(); + socket.connect(new InetSocketAddress(host, port)); + return socket; + } + + @Override + public Socket createSocket(String host, int port, InetAddress localHost, int localPort) throws IOException { + Socket socket = delegate.createSocket(); + socket.bind(new InetSocketAddress(localHost, localPort)); + protect(socket); + socket.connect(new InetSocketAddress(host, port)); + return socket; + } + + @Override + public Socket createSocket(InetAddress host, int port) throws IOException { + Socket socket = createSocket(); + socket.connect(new InetSocketAddress(host, port)); + return socket; + } + + @Override + public Socket createSocket(InetAddress address, int port, InetAddress localAddress, int localPort) throws IOException { + Socket socket = delegate.createSocket(); + socket.bind(new InetSocketAddress(localAddress, localPort)); + protect(socket); + socket.connect(new InetSocketAddress(address, port)); + return socket; + } + + private Socket protect(Socket socket) throws IOException { + if (!vpnService.protect(socket)) { + try { + socket.close(); + } catch (IOException ignored) { + } + throw new IOException("protect control-plane socket failed"); + } + return socket; + } + } + + static final class AuthContext { + final String userId; + final String deviceId; + final String accessToken; + final String accessTokenExpiresAt; + final String refreshToken; + final String refreshTokenExpiresAt; + + AuthContext(String userId, String deviceId, String accessToken, String accessTokenExpiresAt, String refreshToken, String refreshTokenExpiresAt) { + this.userId = userId; + this.deviceId = deviceId; + this.accessToken = accessToken; + this.accessTokenExpiresAt = accessTokenExpiresAt; + this.refreshToken = refreshToken; + this.refreshTokenExpiresAt = refreshTokenExpiresAt; + } + } +} diff --git a/clients/android/app/src/main/java/su/cin/rapvpn/RapAutostartReceiver.java b/clients/android/app/src/main/java/su/cin/rapvpn/RapAutostartReceiver.java new file mode 100644 index 0000000..326b2a7 --- /dev/null +++ b/clients/android/app/src/main/java/su/cin/rapvpn/RapAutostartReceiver.java @@ -0,0 +1,54 @@ +package su.cin.rapvpn; + +import android.content.BroadcastReceiver; +import android.content.Context; +import android.content.Intent; +import android.content.SharedPreferences; +import android.net.VpnService; +import android.os.Build; + +public final class RapAutostartReceiver extends BroadcastReceiver { + private static final String PREFS = "rap-vpn"; + private static final String PREF_PROFILE_JSON = "profile_json"; + private static final String PREF_BACKEND_URL = "backend_url"; + private static final String PREF_CLUSTER_ID = "cluster_id"; + private static final String PREF_VPN_CONNECTION_ID = "vpn_connection_id"; + private static final String PREF_MANUAL_STOPPED = "manual_stopped"; + + @Override + public void onReceive(Context context, Intent intent) { + if (context == null || intent == null) { + return; + } + String action = intent.getAction(); + if (!Intent.ACTION_MY_PACKAGE_REPLACED.equals(action) + && !Intent.ACTION_BOOT_COMPLETED.equals(action)) { + return; + } + RapDiagnosticService.start(context); + SharedPreferences prefs = context.getSharedPreferences(PREFS, Context.MODE_PRIVATE); + if (prefs.getBoolean(PREF_MANUAL_STOPPED, false)) { + return; + } + String profile = prefs.getString(PREF_PROFILE_JSON, ""); + String backendUrl = prefs.getString(PREF_BACKEND_URL, ""); + String clusterId = prefs.getString(PREF_CLUSTER_ID, ""); + String vpnConnectionId = prefs.getString(PREF_VPN_CONNECTION_ID, ""); + if (profile.isEmpty() || backendUrl.isEmpty() || clusterId.isEmpty() || vpnConnectionId.isEmpty()) { + return; + } + if (VpnService.prepare(context) != null) { + return; + } + Intent service = new Intent(context, RapVpnService.class); + service.putExtra("profile_json", profile); + service.putExtra("backend_url", backendUrl); + service.putExtra("cluster_id", clusterId); + service.putExtra("vpn_connection_id", vpnConnectionId); + if (Build.VERSION.SDK_INT >= 26) { + context.startForegroundService(service); + } else { + context.startService(service); + } + } +} diff --git a/clients/android/app/src/main/java/su/cin/rapvpn/RapDiagnosticService.java b/clients/android/app/src/main/java/su/cin/rapvpn/RapDiagnosticService.java new file mode 100644 index 0000000..7257c05 --- /dev/null +++ b/clients/android/app/src/main/java/su/cin/rapvpn/RapDiagnosticService.java @@ -0,0 +1,1674 @@ +package su.cin.rapvpn; + +import android.app.Notification; +import android.app.NotificationChannel; +import android.app.NotificationManager; +import android.app.Service; +import android.content.Intent; +import android.content.SharedPreferences; +import android.content.pm.PackageManager; +import android.content.pm.ResolveInfo; +import android.net.ConnectivityManager; +import android.net.LinkProperties; +import android.net.Network; +import android.net.NetworkCapabilities; +import android.net.Uri; +import android.net.VpnService; +import android.os.Build; +import android.os.Handler; +import android.os.IBinder; +import android.os.Looper; +import android.widget.Toast; + +import org.json.JSONObject; + +import java.net.DatagramPacket; +import java.net.DatagramSocket; +import java.net.HttpURLConnection; +import java.net.InetAddress; +import java.net.InetSocketAddress; +import java.net.SocketTimeoutException; +import java.net.Socket; +import java.net.URI; +import java.net.URL; +import java.net.UnknownHostException; +import java.util.ArrayList; +import java.util.Arrays; +import java.util.LinkedHashSet; +import java.util.List; +import java.util.Random; +import java.util.Set; +import java.text.SimpleDateFormat; +import java.util.Date; +import java.util.concurrent.atomic.AtomicBoolean; +import java.util.regex.Matcher; +import java.util.regex.Pattern; + +public class RapDiagnosticService extends Service { + static final String ACTION_START = "su.cin.rapvpn.DIAGNOSTIC_START"; + static final String ACTION_STOP = "su.cin.rapvpn.DIAGNOSTIC_STOP"; + static final String ACTION_RESTART = "su.cin.rapvpn.DIAGNOSTIC_RESTART"; + private static final String CHANNEL_ID = "rap-vpn-diagnostics"; + private static final String APP_VERSION = BuildConfig.VERSION_NAME; + private static final String DEFAULT_BACKEND_URL = BuildConfig.DEFAULT_BACKEND_URL; + private static final String INTERNAL_BACKEND_URL = "http://192.168.200.61:18080/api/v1"; + private static final String DEFAULT_CLUSTER_ID = BuildConfig.DEFAULT_CLUSTER_ID; + private static final String DEFAULT_ORGANIZATION_ID = BuildConfig.DEFAULT_ORGANIZATION_ID; + private static final String PREF_SELECTED_EXIT_NODE_ID = "selected_exit_node_id"; + private static final String PREFS = "rap-vpn"; + private static final String RUNTIME_PREFS = "rap-vpn-runtime"; + private static final String PREF_REFRESH_TOKEN = "refresh_token"; + private static final String PREF_USER_ID = "user_id"; + private static final String PREF_DEVICE_ID = "device_id"; + private static final String PREF_PROFILE_JSON = "profile_json"; + private static final String PREF_VPN_CONNECTION_ID = "vpn_connection_id"; + private volatile boolean running; + private Thread worker; + private Thread supervisor; + private String serviceState = ""; + private String lastCommandType = ""; + private String lastCommandResult = ""; + private long lastCommandAt = 0; + private long lastHeartbeatAt = 0; + private long lastCommandPollAt = 0; + private volatile long lastWorkerProgressAt = 0; + private volatile long heartbeatStartedAt = 0; + private volatile long commandPollStartedAt = 0; + private volatile long commandStartedAt = 0; + private String controlNetworkMode = ""; + private final AtomicBoolean heartbeatInProgress = new AtomicBoolean(false); + private final AtomicBoolean commandPollInProgress = new AtomicBoolean(false); + private final AtomicBoolean commandInProgress = new AtomicBoolean(false); + + @Override + public int onStartCommand(Intent intent, int flags, int startId) { + if (intent != null && ACTION_STOP.equals(intent.getAction())) { + running = false; + if (worker != null) { + worker.interrupt(); + } + if (supervisor != null) { + supervisor.interrupt(); + } + stopForeground(true); + stopSelfResult(startId); + return START_NOT_STICKY; + } + if (intent != null && ACTION_RESTART.equals(intent.getAction())) { + restartWorker(); + } + startForeground(1002, notification()); + startWorker(); + return START_STICKY; + } + + @Override + public void onDestroy() { + running = false; + if (worker != null) { + worker.interrupt(); + } + if (supervisor != null) { + supervisor.interrupt(); + } + super.onDestroy(); + } + + @Override + public IBinder onBind(Intent intent) { + return null; + } + + static void start(android.content.Context context) { + Intent intent = new Intent(context, RapDiagnosticService.class); + intent.setAction(ACTION_START); + if (Build.VERSION.SDK_INT >= 26) { + context.startForegroundService(intent); + } else { + context.startService(intent); + } + } + + private void startWorker() { + if (worker != null && worker.isAlive()) { + long age = System.currentTimeMillis() - lastWorkerProgressAt; + if (age > 45000) { + restartWorker(); + } else { + startSupervisor(); + return; + } + } + running = true; + lastWorkerProgressAt = System.currentTimeMillis(); + worker = new Thread(this::runLoop, "rap-vpn-diagnostic-service"); + worker.start(); + startSupervisor(); + } + + private void restartWorker() { + running = false; + Thread oldWorker = worker; + Thread oldSupervisor = supervisor; + worker = null; + supervisor = null; + if (oldWorker != null) { + oldWorker.interrupt(); + } + if (oldSupervisor != null) { + oldSupervisor.interrupt(); + } + heartbeatInProgress.set(false); + commandPollInProgress.set(false); + commandInProgress.set(false); + heartbeatStartedAt = 0; + commandPollStartedAt = 0; + commandStartedAt = 0; + serviceState = "diagnostic worker restarting"; + lastWorkerProgressAt = System.currentTimeMillis(); + running = true; + } + + private void startSupervisor() { + if (supervisor != null && supervisor.isAlive()) { + return; + } + supervisor = new Thread(() -> { + while (running) { + try { + Thread.sleep(10000); + long age = System.currentTimeMillis() - lastWorkerProgressAt; + releaseStaleBackgroundOperations(System.currentTimeMillis()); + if (age < 60000) { + continue; + } + Thread stale = worker; + if (stale != null) { + stale.interrupt(); + } + serviceState = "restarting stale diagnostic worker age_ms=" + age; + worker = null; + startWorker(); + } catch (InterruptedException e) { + return; + } catch (Exception ignored) { + } + } + }, "rap-vpn-diagnostic-supervisor"); + supervisor.start(); + } + + private void runLoop() { + while (running) { + try { + lastWorkerProgressAt = System.currentTimeMillis(); + SharedPreferences prefs = getSharedPreferences(PREFS, MODE_PRIVATE); + String backendUrl = normalizeBackendUrl(prefs.getString("backend_url", DEFAULT_BACKEND_URL)); + if (!backendUrl.equals(prefs.getString("backend_url", ""))) { + prefs.edit().putString("backend_url", backendUrl).apply(); + } + String clusterId = prefs.getString("cluster_id", DEFAULT_CLUSTER_ID); + if (clusterId == null || clusterId.trim().isEmpty()) { + clusterId = DEFAULT_CLUSTER_ID; + } + String deviceId = prefs.getString(PREF_DEVICE_ID, ""); + if (backendUrl.isEmpty() || clusterId.isEmpty() || deviceId.isEmpty()) { + Thread.sleep(3000); + continue; + } + releaseStaleBackgroundOperations(System.currentTimeMillis()); + lastHeartbeatAt = System.currentTimeMillis(); + serviceState = "online " + new SimpleDateFormat("HH:mm:ss").format(new Date()); + writeLocalDiagnosticHeartbeat(); + startHeartbeatWorker(backendUrl, clusterId, deviceId, prefs); + if (!commandInProgress.get()) { + startCommandPollWorker(backendUrl, clusterId, deviceId); + } + Thread.sleep(1000); + } catch (InterruptedException ignored) { + return; + } catch (Exception e) { + lastWorkerProgressAt = System.currentTimeMillis(); + serviceState = "error: " + e.getMessage(); + try { + Thread.sleep(3000); + } catch (InterruptedException interrupted) { + return; + } + } + } + } + + private void writeLocalDiagnosticHeartbeat() { + getSharedPreferences(RUNTIME_PREFS, MODE_PRIVATE) + .edit() + .putLong("diagnostic_local_heartbeat_at", System.currentTimeMillis()) + .putString("diagnostic_local_state", serviceState) + .putString("diagnostic_local_app_version", APP_VERSION) + .apply(); + } + + private void releaseStaleBackgroundOperations(long now) { + if (heartbeatInProgress.get() && heartbeatStartedAt > 0 && now - heartbeatStartedAt > 30000) { + heartbeatInProgress.set(false); + heartbeatStartedAt = 0; + serviceState = "diagnostic heartbeat watchdog released stale heartbeat"; + } + if (commandPollInProgress.get() && commandPollStartedAt > 0 && now - commandPollStartedAt > 30000) { + commandPollInProgress.set(false); + commandPollStartedAt = 0; + serviceState = "diagnostic poll watchdog released stale poll"; + } + long commandAge = commandInProgress.get() && commandStartedAt > 0 ? now - commandStartedAt : 0; + if (commandAge > 120000) { + commandInProgress.set(false); + commandStartedAt = 0; + serviceState = "diagnostic command watchdog released stale command age_ms=" + commandAge; + } + } + + private void startHeartbeatWorker(String backendUrl, String clusterId, String deviceId, SharedPreferences prefs) { + if (!heartbeatInProgress.compareAndSet(false, true)) { + return; + } + Thread heartbeatWorker = new Thread(() -> { + try { + heartbeatStartedAt = System.currentTimeMillis(); + RapApiClient client = controlClient(backendUrl); + controlNetworkMode = client.networkMode(); + try { + maybeRestartVPNAfterAppUpgrade(client, clusterId, prefs); + } catch (Exception e) { + serviceState = "upgrade restart check warning: " + e.getMessage(); + } + lastWorkerProgressAt = System.currentTimeMillis(); + reportStatusWithFallback(backendUrl, clusterId, deviceId, statusPayload("heartbeat")); + lastWorkerProgressAt = System.currentTimeMillis(); + } catch (Exception e) { + serviceState = "heartbeat error: " + e.getMessage(); + lastWorkerProgressAt = System.currentTimeMillis(); + } finally { + heartbeatInProgress.set(false); + heartbeatStartedAt = 0; + } + }, "rap-vpn-diagnostic-heartbeat"); + heartbeatWorker.start(); + } + + private void startCommandPollWorker(String backendUrl, String clusterId, String deviceId) { + if (!commandPollInProgress.compareAndSet(false, true)) { + return; + } + Thread pollWorker = new Thread(() -> { + try { + commandPollStartedAt = System.currentTimeMillis(); + RapApiClient client = controlClient(backendUrl); + controlNetworkMode = client.networkMode(); + lastCommandPollAt = System.currentTimeMillis(); + JSONObject commandEnvelope = nextCommandWithFallback(backendUrl, clusterId, deviceId); + lastWorkerProgressAt = System.currentTimeMillis(); + if (commandEnvelope != null) { + startCommandWorker(backendUrl, clusterId, deviceId, commandEnvelope); + } + } catch (Exception e) { + serviceState = "command poll error: " + e.getMessage(); + lastWorkerProgressAt = System.currentTimeMillis(); + } finally { + commandPollInProgress.set(false); + commandPollStartedAt = 0; + } + }, "rap-vpn-diagnostic-poll"); + pollWorker.start(); + } + + private void startCommandWorker(String backendUrl, String clusterId, String deviceId, JSONObject commandEnvelope) { + if (!commandInProgress.compareAndSet(false, true)) { + return; + } + Thread commandWorker = new Thread(() -> { + try { + commandStartedAt = System.currentTimeMillis(); + lastWorkerProgressAt = System.currentTimeMillis(); + RapApiClient commandClient = controlClient(backendUrl); + controlNetworkMode = commandClient.networkMode(); + handleCommand(backendUrl, commandClient, clusterId, deviceId, commandEnvelope); + lastWorkerProgressAt = System.currentTimeMillis(); + } catch (Exception e) { + lastWorkerProgressAt = System.currentTimeMillis(); + lastCommandType = "command_worker_error"; + lastCommandResult = e.getClass().getSimpleName() + ": " + e.getMessage(); + lastCommandAt = System.currentTimeMillis(); + serviceState = "command error: " + e.getMessage(); + try { + reportStatusWithFallback(backendUrl, clusterId, deviceId, statusPayload("command_worker_error")); + } catch (Exception ignored) { + } + } finally { + commandInProgress.set(false); + commandStartedAt = 0; + } + }, "rap-vpn-diagnostic-command"); + commandWorker.start(); + } + + private void handleCommand(String backendUrl, RapApiClient client, String clusterId, String deviceId, JSONObject envelope) throws Exception { + JSONObject command = envelope.optJSONObject("vpn_client_diagnostic_command"); + JSONObject payload = command == null ? envelope.optJSONObject("payload") : command.optJSONObject("payload"); + if (payload == null) { + return; + } + String type = payload.optString("type", ""); + JSONObject params = payload.optJSONObject("payload"); + if (params == null) { + params = payload; + } + String result; + if ("start_vpn".equals(type)) { + result = startVPNFromSavedProfile(); + } else if ("stop_vpn".equals(type)) { + Intent stopIntent = new Intent(this, RapVpnService.class); + stopIntent.setAction(RapVpnService.ACTION_STOP); + startService(stopIntent); + result = "stop_vpn accepted"; + } else if ("http_get".equals(type)) { + result = runHttpGet(params.optString("url", "http://192.168.200.61:18080/")); + } else if ("vpn_http_get".equals(type)) { + result = runVPNHttpGet(params.optString("url", "http://192.168.200.61:18080/")); + } else if ("vpn_page_probe".equals(type)) { + result = runVPNPageProbe(params); + } else if ("vpn_tcp_connect".equals(type) || "vpn_rdp_probe".equals(type)) { + result = runVPNTCPConnect(params.optString("host", "192.168.200.95"), params.optInt("port", 3389), params.optInt("timeout_ms", 7000)); + } else if ("vpn_dns_lookup".equals(type)) { + result = runVPNDNSLookup(params.optString("host", "2ip.ru")); + } else if ("open_url".equals(type)) { + String url = params.optString("url", "http://2ip.ru/"); + Intent open = new Intent(Intent.ACTION_VIEW, Uri.parse(url)); + open.addFlags(Intent.FLAG_ACTIVITY_NEW_TASK); + startActivity(open); + result = "open_url accepted " + url; + } else if ("open_chrome_test".equals(type) || "open_external_browser_test".equals(type)) { + result = openExternalBrowserTest( + params.optString("url", "https://speedtest.rt.ru/"), + params.optString("package", "")); + } else if ("open_webview_test".equals(type) || "browser_page_test".equals(type)) { + result = openWebViewTest(params.optString("url", "https://speedtest.rt.ru/")); + } else if ("vpn_stats".equals(type)) { + result = collectVPNStats(client, clusterId); + } else if ("device_network_snapshot".equals(type)) { + result = deviceNetworkSnapshot(); + } else if ("vpn_deep_test".equals(type)) { + result = runVPNDeepTest(client, clusterId, params); + } else if ("vpn_download_test".equals(type)) { + result = runVPNDownloadTest(params.optString("url", "http://192.168.200.61:18080/downloads/rap-android-rdp-vpn-build.json")); + } else if ("launch_telegram".equals(type)) { + result = openExternalURL(params.optString("url", "tg://resolve?domain=telegram")); + } else if ("remote_assist_start".equals(type)) { + showVisibleMessage("Необходим полный доступ к VPN-диагностике. Сеанс удаленной диагностики начат."); + result = "remote_assist_start accepted: scoped vpn diagnostics only"; + } else if ("remote_assist_end".equals(type)) { + String message = params.optString("message", "Сеанс удаленной диагностики завершен."); + showVisibleMessage(message); + result = "remote_assist_end accepted"; + } else if ("full_vpn_test".equals(type)) { + result = runFullVPNTest(client, clusterId, params); + } else if ("refresh_profile".equals(type)) { + result = refreshProfile(); + } else { + result = "unknown command " + type; + } + if (isRecoverableVPNProbe(type) && looksLikeVPNStall(result)) { + String firstResult = result; + String recovery = controlledRestartVPNRuntime(client, clusterId); + Thread.sleep(4000); + result = firstResult + " | recovery=" + recovery + " | retry=" + runVPNProbeCommand(type, params); + } + lastCommandType = type; + lastCommandResult = result; + lastCommandAt = System.currentTimeMillis(); + JSONObject report = statusPayload("command_result"); + report.put("command_type", type); + report.put("command_result", result); + try { + reportStatusWithFallback(backendUrl, clusterId, deviceId, report); + } catch (Exception e) { + serviceState = "command result report failed: " + e.getMessage(); + } + } + + private RapApiClient controlClient(String backendUrl) { + return new RapApiClient(backendUrl, this, true); + } + + private JSONObject nextCommandWithFallback(String backendUrl, String clusterId, String deviceId) throws Exception { + Exception last = null; + for (ControlEndpoint endpoint : controlEndpoints(backendUrl)) { + try { + RapApiClient client = endpoint.client(this); + controlNetworkMode = client.networkMode() + " " + endpoint.url; + return client.nextVPNDiagnosticCommand(clusterId, deviceId, 0); + } catch (Exception e) { + last = e; + } + } + if (last != null) { + throw last; + } + return null; + } + + private void reportStatusWithFallback(String backendUrl, String clusterId, String deviceId, JSONObject payload) throws Exception { + Exception last = null; + for (ControlEndpoint endpoint : controlEndpoints(backendUrl)) { + try { + RapApiClient client = endpoint.client(this); + controlNetworkMode = client.networkMode() + " " + endpoint.url; + client.reportVPNDiagnosticStatus(clusterId, deviceId, payload); + return; + } catch (Exception e) { + last = e; + } + } + if (last != null) { + throw last; + } + } + + private List controlEndpoints(String primary) { + List endpoints = new ArrayList<>(); + LinkedHashSet seen = new LinkedHashSet<>(); + addControlEndpoint(endpoints, seen, DEFAULT_BACKEND_URL, false); + addControlEndpoint(endpoints, seen, primary, false); + boolean vpnRuntimeActive = isVpnRuntimeLikelyActive() || vpnNetwork() != null; + if (vpnRuntimeActive) { + addControlEndpoint(endpoints, seen, INTERNAL_BACKEND_URL, true); + } + SharedPreferences runtime = getSharedPreferences(RUNTIME_PREFS, MODE_PRIVATE); + addControlEndpoint(endpoints, seen, runtime.getString("packet_relay_active_base_url", ""), false); + String candidates = runtime.getString("packet_relay_candidate_urls", ""); + for (String candidate : candidates.split(",")) { + addControlEndpoint(endpoints, seen, candidate, false); + } + return endpoints; + } + + private void addControlEndpoint(List endpoints, LinkedHashSet seen, String value, boolean viaVPN) { + String normalized = normalizeBackendUrl(value == null ? "" : value.trim()); + String key = (viaVPN ? "vpn|" : "direct|") + normalized; + if (!normalized.isEmpty() && seen.add(key)) { + endpoints.add(new ControlEndpoint(normalized, viaVPN)); + } + } + + private boolean isVpnRuntimeLikelyActive() { + SharedPreferences runtime = getSharedPreferences(RUNTIME_PREFS, MODE_PRIVATE); + long updatedAt = runtime.getLong("updated_at", 0); + if (updatedAt <= 0 || System.currentTimeMillis() - updatedAt > 30000) { + return false; + } + String relay = runtime.getString("packet_relay_active_base_url", ""); + String state = runtime.getString("state", ""); + if (!relay.isEmpty() && ("running".equals(state) + || "ready".equals(state) + || "warming".equals(state) + || "tunnel".equals(state) + || "relay".equals(state) + || "downlink".equals(state) + || "downlink_idle".equals(state) + || "runtime_recovery".equals(state))) { + return true; + } + long sent = runtime.getLong("uplink_sent_total", 0); + long down = runtime.getLong("downlink_received_total", 0); + return !relay.isEmpty() && (sent > 0 || down > 0); + } + + private static final class ControlEndpoint { + final String url; + final boolean viaVPN; + + ControlEndpoint(String url, boolean viaVPN) { + this.url = url; + this.viaVPN = viaVPN; + } + + RapApiClient client(RapDiagnosticService service) { + if (!viaVPN) { + return new RapApiClient(url, service, true); + } + Network vpn = service.vpnNetwork(); + if (vpn != null) { + return new RapApiClient(url, vpn); + } + return new RapApiClient(url); + } + } + + private String startVPNFromSavedProfile() { + SharedPreferences prefs = getSharedPreferences(PREFS, MODE_PRIVATE); + String profileJson = prefs.getString(PREF_PROFILE_JSON, ""); + String backendUrl = prefs.getString("backend_url", ""); + String clusterId = prefs.getString("cluster_id", ""); + String vpnConnectionId = prefs.getString(PREF_VPN_CONNECTION_ID, ""); + if (profileJson.isEmpty() || backendUrl.isEmpty() || clusterId.isEmpty() || vpnConnectionId.isEmpty()) { + return "start_vpn skipped: profile/backend/cluster/connection missing"; + } + if (VpnService.prepare(this) != null) { + Intent launcher = new Intent(this, TestVpnActivity.class); + launcher.putExtra(TestVpnActivity.EXTRA_PROFILE_JSON, profileJson); + launcher.putExtra(TestVpnActivity.EXTRA_BACKEND_URL, backendUrl); + launcher.putExtra(TestVpnActivity.EXTRA_CLUSTER_ID, clusterId); + launcher.putExtra(TestVpnActivity.EXTRA_VPN_CONNECTION_ID, vpnConnectionId); + launcher.addFlags(Intent.FLAG_ACTIVITY_NEW_TASK); + startActivity(launcher); + return "start_vpn permission required: opened vpn launcher " + vpnConnectionId; + } + Intent intent = new Intent(this, RapVpnService.class); + intent.putExtra(RapVpnService.EXTRA_PROFILE_JSON, profileJson); + intent.putExtra(RapVpnService.EXTRA_BACKEND_URL, backendUrl); + intent.putExtra(RapVpnService.EXTRA_CLUSTER_ID, clusterId); + intent.putExtra(RapVpnService.EXTRA_VPN_CONNECTION_ID, vpnConnectionId); + if (Build.VERSION.SDK_INT >= 26) { + startForegroundService(intent); + } else { + startService(intent); + } + return "start_vpn accepted " + vpnConnectionId; + } + + private boolean isRecoverableVPNProbe(String type) { + return "vpn_http_get".equals(type) + || "vpn_page_probe".equals(type) + || "vpn_tcp_connect".equals(type) + || "vpn_rdp_probe".equals(type) + || "vpn_download_test".equals(type); + } + + private boolean looksLikeVPNStall(String result) { + if (result == null) { + return false; + } + return result.contains("SocketTimeoutException") + || result.contains("failed to connect") + || result.contains("Read timed out") + || result.contains("Connection reset") + || result.contains("No route to host"); + } + + private String runVPNProbeCommand(String type, JSONObject payload) { + if ("vpn_http_get".equals(type)) { + return runVPNHttpGet(payload.optString("url", "http://192.168.200.61:18080/")); + } + if ("vpn_page_probe".equals(type)) { + return runVPNPageProbe(payload); + } + if ("vpn_tcp_connect".equals(type) || "vpn_rdp_probe".equals(type)) { + return runVPNTCPConnect(payload.optString("host", "192.168.200.95"), payload.optInt("port", 3389), payload.optInt("timeout_ms", 7000)); + } + if ("vpn_download_test".equals(type)) { + return runVPNDownloadTest(payload.optString("url", "http://192.168.200.61:18080/downloads/rap-android-rdp-vpn-build.json")); + } + return "retry skipped: unsupported probe " + type; + } + + private String controlledRestartVPNRuntime(RapApiClient client, String clusterId) { + SharedPreferences prefs = getSharedPreferences(PREFS, MODE_PRIVATE); + String connectionId = prefs.getString(PREF_VPN_CONNECTION_ID, ""); + if (connectionId.isEmpty()) { + return "restart skipped: connection missing"; + } + try { + try { + client.resetVPNPacketQueues(clusterId, connectionId); + } catch (Exception e) { + return "restart failed: queue reset failed: " + e.getMessage(); + } + Thread.sleep(300); + return startVPNFromSavedProfile(); + } catch (Exception e) { + return "restart failed: " + e.getClass().getSimpleName() + ": " + e.getMessage(); + } + } + + private void maybeRestartVPNAfterAppUpgrade(RapApiClient client, String clusterId, SharedPreferences prefs) { + String lastVersion = prefs.getString("vpn_runtime_app_version", ""); + if (APP_VERSION.equals(lastVersion)) { + return; + } + String connectionId = prefs.getString(PREF_VPN_CONNECTION_ID, ""); + String profileJson = prefs.getString(PREF_PROFILE_JSON, ""); + if (connectionId.isEmpty() || profileJson.isEmpty()) { + prefs.edit().putString("vpn_runtime_app_version", APP_VERSION).apply(); + return; + } + serviceState = "upgrade restart " + lastVersion + " -> " + APP_VERSION; + try { + try { + client.resetVPNPacketQueues(clusterId, connectionId); + } catch (Exception ignored) { + } + Thread.sleep(300); + startVPNFromSavedProfile(); + prefs.edit().putString("vpn_runtime_app_version", APP_VERSION).apply(); + lastCommandType = "auto_upgrade_restart"; + lastCommandResult = "vpn runtime reinitialized after app upgrade " + lastVersion + " -> " + APP_VERSION; + lastCommandAt = System.currentTimeMillis(); + } catch (Exception e) { + lastCommandType = "auto_upgrade_restart"; + lastCommandResult = "vpn runtime upgrade restart failed: " + e.getClass().getSimpleName() + ": " + e.getMessage(); + lastCommandAt = System.currentTimeMillis(); + } + } + + private String refreshProfile() { + SharedPreferences prefs = getSharedPreferences(PREFS, MODE_PRIVATE); + try { + String refreshToken = new SecureTokenStore(this).get(PREF_REFRESH_TOKEN); + if (refreshToken.isEmpty()) { + return "refresh_profile skipped: refresh token missing"; + } + RapApiClient client = new RapApiClient(normalizeBackendUrl(prefs.getString("backend_url", "")), this, true); + RapApiClient.AuthContext auth = client.refresh(refreshToken); + String organizationId = prefs.getString("organization_id", DEFAULT_ORGANIZATION_ID); + String clusterId = prefs.getString("cluster_id", DEFAULT_CLUSTER_ID); + if (clusterId == null || clusterId.trim().isEmpty()) { + clusterId = DEFAULT_CLUSTER_ID; + } + if (organizationId == null || organizationId.trim().isEmpty()) { + organizationId = DEFAULT_ORGANIZATION_ID; + } + String exitNodeId = prefs.getString(PREF_SELECTED_EXIT_NODE_ID, ""); + String profileJson = client.vpnClientProfile(clusterId, organizationId, auth.userId, exitNodeId); + JSONObject root = new JSONObject(profileJson); + JSONObject profile = root.getJSONObject("vpn_client_profile"); + String connectionId = profile.getJSONArray("connections").getJSONObject(0).getString("id"); + prefs.edit() + .putString(PREF_USER_ID, auth.userId) + .putString(PREF_DEVICE_ID, auth.deviceId) + .putString(PREF_PROFILE_JSON, profileJson) + .putString(PREF_VPN_CONNECTION_ID, connectionId) + .apply(); + new SecureTokenStore(this).put(PREF_REFRESH_TOKEN, auth.refreshToken); + return "refresh_profile ok " + connectionId; + } catch (Exception e) { + return "refresh_profile failed: " + e.getMessage(); + } + } + + private JSONObject statusPayload(String event) throws Exception { + SharedPreferences prefs = getSharedPreferences(PREFS, MODE_PRIVATE); + JSONObject payload = new JSONObject(); + payload.put("event", event); + payload.put("app_version", APP_VERSION); + payload.put("service", "diagnostic"); + payload.put("user_id", prefs.getString(PREF_USER_ID, "")); + payload.put("device_id", prefs.getString(PREF_DEVICE_ID, "")); + payload.put("organization_id", prefs.getString("organization_id", "")); + payload.put("vpn_connection_id", prefs.getString(PREF_VPN_CONNECTION_ID, "")); + payload.put("backend_url", prefs.getString("backend_url", "")); + payload.put("control_network_mode", controlNetworkMode); + payload.put("profile_loaded", !prefs.getString(PREF_PROFILE_JSON, "").isEmpty()); + payload.put("runtime", runtimeSnapshot()); + payload.put("vpn_config", vpnConfigSnapshot()); + payload.put("service_state", serviceState); + payload.put("last_result", lastCommandResult); + payload.put("last_command_type", lastCommandType); + payload.put("last_command_result", lastCommandResult); + payload.put("last_command_at", lastCommandAt); + payload.put("last_heartbeat_at", lastHeartbeatAt); + payload.put("last_command_poll_at", lastCommandPollAt); + payload.put("browser_test", browserTestSnapshot()); + return payload; + } + + private String normalizeBackendUrl(String value) { + String candidate = value == null ? "" : value.trim().replaceAll("/+$", ""); + if (candidate.isEmpty()) { + return DEFAULT_BACKEND_URL; + } + return candidate; + } + + private String collectVPNStats(RapApiClient client, String clusterId) { + String connectionId = getSharedPreferences(PREFS, MODE_PRIVATE).getString(PREF_VPN_CONNECTION_ID, ""); + if (connectionId.isEmpty()) { + return "vpn_stats skipped: connection missing"; + } + try { + JSONObject stats = client.vpnPacketStats(clusterId, connectionId); + return "vpn_stats " + compact(stats.toString(), 900); + } catch (Exception e) { + return "vpn_stats failed: " + e.getMessage(); + } + } + + private JSONObject browserTestSnapshot() throws Exception { + SharedPreferences prefs = getSharedPreferences(TestTrafficActivity.PREFS, MODE_PRIVATE); + JSONObject payload = new JSONObject(); + payload.put("state", prefs.getString("state", "")); + payload.put("message", prefs.getString("message", "")); + payload.put("progress", prefs.getInt("progress", 0)); + payload.put("url", prefs.getString("url", "")); + payload.put("target_url", prefs.getString("target_url", "")); + payload.put("error_type", prefs.getString("error_type", "")); + payload.put("asset_error_count", prefs.getInt("asset_error_count", 0)); + payload.put("main_error_count", prefs.getInt("main_error_count", 0)); + payload.put("http_error_count", prefs.getInt("http_error_count", 0)); + payload.put("updated_at", prefs.getLong("updated_at", 0)); + payload.put("http_probe", prefs.getString("http_probe", "")); + payload.put("http_probe_at", prefs.getLong("http_probe_at", 0)); + payload.put("dom_probe", prefs.getString("dom_probe", "")); + payload.put("dom_probe_at", prefs.getLong("dom_probe_at", 0)); + return payload; + } + + private String runFullVPNTest(RapApiClient client, String clusterId, JSONObject payload) { + String url = payload.optString("url", "http://2ip.ru/"); + int watchSeconds = payload.optInt("watch_seconds", 30); + if (watchSeconds < 5) { + watchSeconds = 5; + } + if (watchSeconds > 120) { + watchSeconds = 120; + } + String connectionId = getSharedPreferences(PREFS, MODE_PRIVATE).getString(PREF_VPN_CONNECTION_ID, ""); + StringBuilder result = new StringBuilder(); + try { + result.append(refreshProfile()).append(" | "); + if (!connectionId.isEmpty()) { + result.append("reset=").append(compact(client.resetVPNPacketQueues(clusterId, connectionId).toString(), 240)).append(" | "); + } + result.append(startVPNFromSavedProfile()).append(" | "); + Thread.sleep(3000); + Intent open = new Intent(Intent.ACTION_VIEW, Uri.parse(url)); + open.addFlags(Intent.FLAG_ACTIVITY_NEW_TASK); + startActivity(open); + result.append("open_url=").append(url); + long deadline = System.currentTimeMillis() + watchSeconds * 1000L; + while (running && System.currentTimeMillis() < deadline) { + Thread.sleep(5000); + JSONObject report = statusPayload("full_vpn_test_watch"); + report.put("test_url", url); + if (!connectionId.isEmpty()) { + report.put("packet_stats", client.vpnPacketStats(clusterId, connectionId)); + } + client.reportVPNDiagnosticStatus(clusterId, getSharedPreferences(PREFS, MODE_PRIVATE).getString(PREF_DEVICE_ID, ""), report); + } + if (!connectionId.isEmpty()) { + result.append(" | stats=").append(compact(client.vpnPacketStats(clusterId, connectionId).toString(), 900)); + } + } catch (Exception e) { + result.append(" | full_vpn_test failed: ").append(e.getClass().getSimpleName()).append(": ").append(e.getMessage()); + } + return compact(result.toString(), 1200); + } + + private String runVPNDeepTest(RapApiClient client, String clusterId, JSONObject payload) { + String url = payload.optString("url", "http://2ip.ru/"); + String host = payload.optString("host", "2ip.ru"); + String localUrl = payload.optString("local_url", "http://192.168.200.61:18080/"); + StringBuilder result = new StringBuilder(); + result.append("network={").append(deviceNetworkSnapshot()).append("}"); + result.append(" | stats=").append(collectVPNStats(client, clusterId)); + result.append(" | dns=").append(runVPNDNSLookup(host)); + result.append(" | vpn_http=").append(runVPNHttpGet(url)); + result.append(" | vpn_local_http=").append(runVPNHttpGet(localUrl)); + result.append(" | download=").append(runVPNDownloadTest(payload.optString("download_url", "http://192.168.200.61:18080/downloads/rap-android-rdp-vpn-build.json"))); + return compact(result.toString(), 2500); + } + + private String runVPNDownloadTest(String target) { + try { + Network vpn = vpnNetwork(); + if (vpn == null) { + return "vpn_download_test " + target + " -> vpn network not found"; + } + URL url = new URL(target); + HttpURLConnection connection = (HttpURLConnection) vpn.openConnection(url); + connection.setConnectTimeout(15000); + connection.setReadTimeout(20000); + connection.setInstanceFollowRedirects(false); + int code = connection.getResponseCode(); + int bytes = 0; + byte[] buffer = new byte[8192]; + try (java.io.InputStream input = connection.getInputStream()) { + while (bytes < 1024 * 1024) { + int n = input.read(buffer, 0, Math.min(buffer.length, 1024 * 1024 - bytes)); + if (n < 0) { + break; + } + bytes += n; + } + } + connection.disconnect(); + return "vpn_download_test " + target + " -> HTTP " + code + " bytes=" + bytes; + } catch (Exception e) { + return "vpn_download_test " + target + " -> " + e.getClass().getSimpleName() + ": " + e.getMessage(); + } + } + + private String openExternalURL(String target) { + try { + Intent open = new Intent(Intent.ACTION_VIEW, Uri.parse(target)); + open.addFlags(Intent.FLAG_ACTIVITY_NEW_TASK); + startActivity(open); + return "open_external_url accepted " + target; + } catch (Exception e) { + return "open_external_url " + target + " -> " + e.getClass().getSimpleName() + ": " + e.getMessage(); + } + } + + private String openWebViewTest(String target) { + try { + if (target == null || target.trim().isEmpty()) { + target = "https://speedtest.rt.ru/"; + } + Intent open = new Intent(this, TestTrafficActivity.class); + open.putExtra(TestTrafficActivity.EXTRA_URL, target); + open.addFlags(Intent.FLAG_ACTIVITY_NEW_TASK | Intent.FLAG_ACTIVITY_CLEAR_TOP | Intent.FLAG_ACTIVITY_SINGLE_TOP); + startActivity(open); + return "open_webview_test accepted " + target; + } catch (Exception e) { + return "open_webview_test " + target + " -> " + e.getClass().getSimpleName() + ": " + e.getMessage(); + } + } + + private String openExternalBrowserTest(String target, String requestedPackage) { + try { + if (target == null || target.trim().isEmpty()) { + target = "https://speedtest.rt.ru/"; + } + Uri uri = Uri.parse(target); + Intent open = new Intent(Intent.ACTION_VIEW, uri); + open.addCategory(Intent.CATEGORY_BROWSABLE); + open.addFlags(Intent.FLAG_ACTIVITY_NEW_TASK | Intent.FLAG_ACTIVITY_CLEAR_TOP | Intent.FLAG_ACTIVITY_SINGLE_TOP); + open.putExtra("com.android.browser.application_id", getPackageName()); + open.putExtra("create_new_tab", true); + + String packageName = selectBrowserPackage(open, requestedPackage); + if (!packageName.isEmpty()) { + open.setPackage(packageName); + } + startActivity(open); + return "open_external_browser_test accepted " + target + " package=" + (packageName.isEmpty() ? "default" : packageName); + } catch (Exception e) { + return "open_external_browser_test " + target + " -> " + e.getClass().getSimpleName() + ": " + e.getMessage(); + } + } + + private String selectBrowserPackage(Intent probe, String requestedPackage) { + PackageManager packageManager = getPackageManager(); + if (packageManager == null) { + return ""; + } + if (requestedPackage != null && !requestedPackage.trim().isEmpty() && isPackageAvailable(packageManager, requestedPackage.trim())) { + return requestedPackage.trim(); + } + List preferred = Arrays.asList( + "com.android.chrome", + "com.chrome.beta", + "com.chrome.dev", + "com.google.android.apps.chrome", + "org.mozilla.firefox", + "com.sec.android.app.sbrowser", + "com.android.browser"); + for (String candidate : preferred) { + if (isPackageAvailable(packageManager, candidate)) { + return candidate; + } + } + try { + List infos = packageManager.queryIntentActivities(probe, 0); + if (infos != null && !infos.isEmpty() && infos.get(0).activityInfo != null) { + return infos.get(0).activityInfo.packageName == null ? "" : infos.get(0).activityInfo.packageName; + } + } catch (Exception ignored) { + } + return ""; + } + + private boolean isPackageAvailable(PackageManager packageManager, String packageName) { + try { + packageManager.getPackageInfo(packageName, 0); + return true; + } catch (Exception ignored) { + return false; + } + } + + private void showVisibleMessage(String message) { + new Handler(Looper.getMainLooper()).post(() -> Toast.makeText(this, message, Toast.LENGTH_LONG).show()); + } + + private String deviceNetworkSnapshot() { + try { + ConnectivityManager connectivity = (ConnectivityManager) getSystemService(CONNECTIVITY_SERVICE); + if (connectivity == null) { + return "connectivity unavailable"; + } + Network active = connectivity.getActiveNetwork(); + StringBuilder out = new StringBuilder(); + out.append("active=").append(active == null ? "none" : active.toString()); + Network[] networks = connectivity.getAllNetworks(); + out.append(" networks=").append(networks == null ? 0 : networks.length); + if (networks != null) { + for (Network network : networks) { + NetworkCapabilities capabilities = connectivity.getNetworkCapabilities(network); + LinkProperties link = connectivity.getLinkProperties(network); + out.append(" [").append(network); + if (capabilities != null) { + out.append(" transports="); + if (capabilities.hasTransport(NetworkCapabilities.TRANSPORT_VPN)) { + out.append("VPN,"); + } + if (capabilities.hasTransport(NetworkCapabilities.TRANSPORT_WIFI)) { + out.append("WIFI,"); + } + if (capabilities.hasTransport(NetworkCapabilities.TRANSPORT_CELLULAR)) { + out.append("CELL,"); + } + out.append("internet=").append(capabilities.hasCapability(NetworkCapabilities.NET_CAPABILITY_INTERNET)); + } + if (link != null) { + out.append(" dns="); + List dns = link.getDnsServers(); + for (int i = 0; i < dns.size(); i++) { + if (i > 0) { + out.append(","); + } + out.append(dns.get(i).getHostAddress()); + } + out.append(" routes=").append(link.getRoutes().size()); + } + out.append("]"); + } + } + return compact(out.toString(), 1800); + } catch (Exception e) { + return "device_network_snapshot failed: " + e.getClass().getSimpleName() + ": " + e.getMessage(); + } + } + + private String compact(String value, int maxLength) { + if (value == null) { + return ""; + } + String compacted = value.replace('\n', ' ').replace('\r', ' '); + if (compacted.length() <= maxLength) { + return compacted; + } + return compacted.substring(0, Math.max(0, maxLength - 3)) + "..."; + } + + private JSONObject runtimeSnapshot() throws Exception { + SharedPreferences runtime = getSharedPreferences(RUNTIME_PREFS, MODE_PRIVATE); + JSONObject payload = new JSONObject(); + payload.put("state", runtime.getString("state", "")); + payload.put("message", runtime.getString("message", "")); + payload.put("updated_at", runtime.getLong("updated_at", 0)); + payload.put("runtime_started_at", runtime.getLong("runtime_started_at", 0)); + payload.put("diagnostic_local_heartbeat_at", runtime.getLong("diagnostic_local_heartbeat_at", 0)); + payload.put("diagnostic_local_state", runtime.getString("diagnostic_local_state", "")); + payload.put("diagnostic_local_app_version", runtime.getString("diagnostic_local_app_version", "")); + payload.put("uplink_read", runtime.getLong("uplink_read", 0)); + payload.put("uplink_sent", runtime.getLong("uplink_sent", 0)); + payload.put("downlink_received", runtime.getLong("downlink_received", 0)); + payload.put("uplink_read_total", runtime.getLong("uplink_read_total", 0)); + payload.put("uplink_read_bytes", runtime.getLong("uplink_read_bytes", 0)); + payload.put("uplink_sent_total", runtime.getLong("uplink_sent_total", 0)); + payload.put("uplink_sent_bytes", runtime.getLong("uplink_sent_bytes", 0)); + payload.put("downlink_received_total", runtime.getLong("downlink_received_total", 0)); + payload.put("downlink_received_bytes", runtime.getLong("downlink_received_bytes", 0)); + payload.put("uplink_read_mbps", runtime.getFloat("uplink_read_mbps", 0f)); + payload.put("uplink_sent_mbps", runtime.getFloat("uplink_sent_mbps", 0f)); + payload.put("downlink_received_mbps", runtime.getFloat("downlink_received_mbps", 0f)); + payload.put("uplink_read_pps", runtime.getFloat("uplink_read_pps", 0f)); + payload.put("uplink_sent_pps", runtime.getFloat("uplink_sent_pps", 0f)); + payload.put("downlink_received_pps", runtime.getFloat("downlink_received_pps", 0f)); + payload.put("uplink_dropped_packets", runtime.getLong("uplink_dropped_packets", 0)); + payload.put("uplink_dropped_bytes", runtime.getLong("uplink_dropped_bytes", 0)); + payload.put("uplink_filtered_packets", runtime.getLong("uplink_filtered_packets", 0)); + payload.put("uplink_filtered_bytes", runtime.getLong("uplink_filtered_bytes", 0)); + payload.put("uplink_bypassed_control_packets", runtime.getLong("uplink_bypassed_control_packets", 0)); + payload.put("uplink_bypassed_control_bytes", runtime.getLong("uplink_bypassed_control_bytes", 0)); + payload.put("downlink_dropped_packets", runtime.getLong("downlink_dropped_packets", 0)); + payload.put("downlink_dropped_bytes", runtime.getLong("downlink_dropped_bytes", 0)); + payload.put("downlink_transport_checksum_repairs", runtime.getLong("downlink_transport_checksum_repairs", 0)); + payload.put("local_dns_queries", runtime.getLong("local_dns_queries", 0)); + payload.put("local_dns_replies", runtime.getLong("local_dns_replies", 0)); + payload.put("local_dns_errors", runtime.getLong("local_dns_errors", 0)); + payload.put("runtime_watchdog_recoveries", runtime.getLong("runtime_watchdog_recoveries", 0)); + payload.put("tcp_handshake_stalls", runtime.getLong("tcp_handshake_stalls", 0)); + payload.put("runtime_watchdog_hard_restarts", runtime.getLong("runtime_watchdog_hard_restarts", 0)); + payload.put("uplink_source_mismatch_packets", runtime.getLong("uplink_source_mismatch_packets", 0)); + payload.put("downlink_destination_mismatch_packets", runtime.getLong("downlink_destination_mismatch_packets", 0)); + payload.put("errors", runtime.getLong("errors", 0)); + payload.put("uplink", runtimePrefix(runtime, "uplink")); + payload.put("uplink_sender", runtimePrefix(runtime, "uplink_sender")); + payload.put("downlink", runtimePrefix(runtime, "downlink")); + payload.put("downlink_writer", runtimePrefix(runtime, "downlink_writer")); + payload.put("relay", runtimePrefix(runtime, "relay")); + payload.put("uplink_worker_count", runtime.getInt("uplink_worker_count", 0)); + payload.put("uplink_queue_depth_total", runtime.getInt("uplink_queue_depth_total", 0)); + payload.put("uplink_queue_depth_max", runtime.getInt("uplink_queue_depth_max", 0)); + payload.put("uplink_queue_depths", runtime.getString("uplink_queue_depths", "")); + payload.put("uplink_queue_0_offers", runtime.getLong("uplink_queue_0_offers", 0)); + payload.put("uplink_queue_1_offers", runtime.getLong("uplink_queue_1_offers", 0)); + payload.put("uplink_queue_2_offers", runtime.getLong("uplink_queue_2_offers", 0)); + payload.put("uplink_queue_3_offers", runtime.getLong("uplink_queue_3_offers", 0)); + payload.put("uplink_queue_0_drops", runtime.getLong("uplink_queue_0_drops", 0)); + payload.put("uplink_queue_1_drops", runtime.getLong("uplink_queue_1_drops", 0)); + payload.put("uplink_queue_2_drops", runtime.getLong("uplink_queue_2_drops", 0)); + payload.put("uplink_queue_3_drops", runtime.getLong("uplink_queue_3_drops", 0)); + payload.put("uplink_sender_worker_packets_0", runtime.getLong("uplink_sender_worker_packets_0", 0)); + payload.put("uplink_sender_worker_packets_1", runtime.getLong("uplink_sender_worker_packets_1", 0)); + payload.put("uplink_sender_worker_packets_2", runtime.getLong("uplink_sender_worker_packets_2", 0)); + payload.put("uplink_sender_worker_packets_3", runtime.getLong("uplink_sender_worker_packets_3", 0)); + payload.put("uplink_sender_worker_errors_0", runtime.getLong("uplink_sender_worker_errors_0", 0)); + payload.put("uplink_sender_worker_errors_1", runtime.getLong("uplink_sender_worker_errors_1", 0)); + payload.put("uplink_sender_worker_errors_2", runtime.getLong("uplink_sender_worker_errors_2", 0)); + payload.put("uplink_sender_worker_errors_3", runtime.getLong("uplink_sender_worker_errors_3", 0)); + payload.put("uplink_queue_depth", runtime.getInt("uplink_queue_depth", 0)); + payload.put("downlink_queue_depth", runtime.getInt("downlink_queue_depth", 0)); + payload.put("downlink_flow_queue_count", runtime.getInt("downlink_flow_queue_count", 0)); + payload.put("downlink_queue_depths", runtime.getString("downlink_queue_depths", "")); + payload.put("downlink_queue_depth_total", runtime.getInt("downlink_queue_depth_total", 0)); + payload.put("downlink_queue_depth_max", runtime.getInt("downlink_queue_depth_max", 0)); + payload.put("downlink_queued_packets", runtime.getLong("downlink_queued_packets", 0)); + payload.put("downlink_queue_waits", runtime.getLong("downlink_queue_waits", 0)); + payload.put("downlink_restarts", runtime.getLong("downlink_restarts", 0)); + payload.put("flow_isolation_mode", runtime.getString("flow_isolation_mode", "")); + return payload; + } + + private JSONObject vpnConfigSnapshot() throws Exception { + SharedPreferences runtime = getSharedPreferences(RUNTIME_PREFS, MODE_PRIVATE); + JSONObject payload = new JSONObject(); + payload.put("vpn_address", runtime.getString("vpn_address", "")); + payload.put("dns_servers", runtime.getString("dns_servers", "")); + payload.put("routes", runtime.getString("routes", "")); + payload.put("full_tunnel", runtime.getBoolean("full_tunnel", false)); + payload.put("dataplane_session_status", runtime.getString("dataplane_session_status", "")); + payload.put("dataplane_preferred_transport", runtime.getString("dataplane_preferred_transport", "")); + payload.put("dataplane_fallback_transport", runtime.getString("dataplane_fallback_transport", "")); + payload.put("dataplane_entry_node_id", runtime.getString("dataplane_entry_node_id", "")); + payload.put("dataplane_exit_node_id", runtime.getString("dataplane_exit_node_id", "")); + payload.put("dataplane_selected_transport", runtime.getString("dataplane_selected_transport", "")); + payload.put("packet_relay_profile_base_url", runtime.getString("packet_relay_profile_base_url", "")); + payload.put("packet_relay_active_base_url", runtime.getString("packet_relay_active_base_url", "")); + payload.put("packet_relay_base_url", runtime.getString("packet_relay_base_url", "")); + payload.put("packet_relay_candidate_urls", runtime.getString("packet_relay_candidate_urls", "")); + payload.put("dataplane_transport_candidate_count", runtime.getInt("dataplane_transport_candidate_count", 0)); + payload.put("dataplane_entry_candidate_count", runtime.getInt("dataplane_entry_candidate_count", 0)); + return payload; + } + + private JSONObject runtimePrefix(SharedPreferences runtime, String prefix) throws Exception { + JSONObject payload = new JSONObject(); + payload.put("state", runtime.getString(prefix + "_state", "")); + payload.put("message", runtime.getString(prefix + "_message", "")); + payload.put("updated_at", runtime.getLong(prefix + "_updated_at", 0)); + payload.put("packets", runtime.getLong(prefix + "_packets", 0)); + payload.put("bytes", runtime.getLong(prefix + "_bytes", 0)); + payload.put("errors", runtime.getLong(prefix + "_errors", 0)); + payload.put("error_type", runtime.getString(prefix + "_error_type", "")); + payload.put("thread_alive", runtime.getBoolean(prefix + "_thread_alive", false)); + payload.put("rate_mbps", runtime.getFloat(prefix + "_rate_mbps", 0f)); + payload.put("rate_pps", runtime.getFloat(prefix + "_rate_pps", 0f)); + return payload; + } + + private String runHttpGet(String target) { + try { + HttpURLConnection connection = (HttpURLConnection) new URL(target).openConnection(); + connection.setConnectTimeout(15000); + connection.setReadTimeout(15000); + connection.setInstanceFollowRedirects(false); + int code = connection.getResponseCode(); + connection.disconnect(); + return "http_get " + target + " -> HTTP " + code; + } catch (Exception e) { + return "http_get " + target + " -> " + e.getClass().getSimpleName() + ": " + e.getMessage(); + } + } + + private String runVPNHttpGet(String target) { + try { + Network vpn = vpnNetwork(); + if (vpn == null) { + return "vpn_http_get " + target + " -> vpn network not found"; + } + URL url = new URL(target); + HttpURLConnection connection; + String resolved = ""; + if ("http".equalsIgnoreCase(url.getProtocol()) && !isIPv4Literal(url.getHost())) { + resolved = firstManualVPNAddress(vpn, url.getHost()); + } + if (!resolved.isEmpty()) { + URL resolvedURL = new URL(url.getProtocol(), resolved, url.getPort(), url.getFile()); + connection = (HttpURLConnection) vpn.openConnection(resolvedURL); + connection.setRequestProperty("Host", hostHeader(url)); + } else { + connection = (HttpURLConnection) vpn.openConnection(url); + } + try { + connection.setConnectTimeout(15000); + connection.setReadTimeout(15000); + connection.setInstanceFollowRedirects(false); + int code = connection.getResponseCode(); + connection.disconnect(); + return "vpn_http_get " + target + " -> HTTP " + code; + } catch (UnknownHostException e) { + String fallbackResolved = firstManualVPNAddress(vpn, url.getHost()); + if (fallbackResolved.isEmpty() || !"http".equalsIgnoreCase(url.getProtocol())) { + throw e; + } + URL resolvedURL = new URL(url.getProtocol(), fallbackResolved, url.getPort(), url.getFile()); + connection = (HttpURLConnection) vpn.openConnection(resolvedURL); + connection.setRequestProperty("Host", hostHeader(url)); + connection.setConnectTimeout(15000); + connection.setReadTimeout(15000); + connection.setInstanceFollowRedirects(false); + int code = connection.getResponseCode(); + connection.disconnect(); + return "vpn_http_get " + target + " -> HTTP " + code; + } + } catch (Exception e) { + return "vpn_http_get " + target + " -> " + e.getClass().getSimpleName() + ": " + e.getMessage(); + } + } + + private String runVPNPageProbe(JSONObject payload) { + String target = payload.optString("url", "https://speedtest.rt.ru/"); + int maxAssets = payload.optInt("max_assets", 16); + int totalTimeoutMs = payload.optInt("timeout_ms", 70000); + if (maxAssets < 1) { + maxAssets = 1; + } + if (maxAssets > 40) { + maxAssets = 40; + } + if (totalTimeoutMs < 10000) { + totalTimeoutMs = 10000; + } + if (totalTimeoutMs > 120000) { + totalTimeoutMs = 120000; + } + long deadline = System.currentTimeMillis() + totalTimeoutMs; + try { + Network vpn = vpnNetwork(); + if (vpn == null) { + return "vpn_page_probe " + target + " -> vpn network not found"; + } + FetchResult main = fetchVPNURL(vpn, new URL(target), Math.min(15000, totalTimeoutMs / 2), Math.min(20000, totalTimeoutMs / 2), 512 * 1024); + StringBuilder result = new StringBuilder(); + result.append("vpn_page_probe ").append(target) + .append(" -> main HTTP ").append(main.code) + .append(" bytes=").append(main.bytes) + .append(" ms=").append(main.elapsedMs) + .append(" type=").append(main.contentType); + Set assets = extractAssetURLs(new URL(target), main.body, maxAssets); + int ok = 0; + int failed = 0; + int skipped = 0; + long totalBytes = 0; + long maxMs = 0; + StringBuilder failures = new StringBuilder(); + int index = 0; + for (String asset : assets) { + index++; + long remaining = deadline - System.currentTimeMillis(); + if (remaining < 3000) { + skipped = assets.size() - index + 1; + appendFailure(failures, "#" + index + " deadline_reached skipped=" + skipped); + break; + } + try { + int connectTimeout = (int) Math.max(2000, Math.min(8000, remaining / 3)); + int readTimeout = (int) Math.max(3000, Math.min(10000, remaining / 2)); + FetchResult assetResult = fetchVPNURL(vpn, new URL(asset), connectTimeout, readTimeout, 256 * 1024); + totalBytes += assetResult.bytes; + maxMs = Math.max(maxMs, assetResult.elapsedMs); + if (assetResult.code >= 200 && assetResult.code < 400) { + ok++; + } else { + failed++; + appendFailure(failures, "#" + index + " HTTP " + assetResult.code + " " + compact(asset, 120)); + } + } catch (Exception e) { + failed++; + appendFailure(failures, "#" + index + " " + e.getClass().getSimpleName() + ":" + e.getMessage() + " " + compact(asset, 120)); + } + } + result.append(" | assets=").append(assets.size()) + .append(" ok=").append(ok) + .append(" failed=").append(failed) + .append(" skipped=").append(skipped) + .append(" asset_bytes=").append(totalBytes) + .append(" max_asset_ms=").append(maxMs); + if (failures.length() > 0) { + result.append(" | failures=").append(failures); + } + return compact(result.toString(), 2500); + } catch (Exception e) { + return "vpn_page_probe " + target + " -> " + e.getClass().getSimpleName() + ": " + e.getMessage(); + } + } + + private String runVPNTCPConnect(String host, int port, int timeoutMs) { + if (port <= 0 || port > 65535) { + return "vpn_tcp_connect " + host + ":" + port + " -> invalid port"; + } + if (timeoutMs < 1000) { + timeoutMs = 1000; + } + if (timeoutMs > 30000) { + timeoutMs = 30000; + } + long started = System.currentTimeMillis(); + try { + Network vpn = vpnNetwork(); + if (vpn == null) { + return "vpn_tcp_connect " + host + ":" + port + " -> vpn network not found"; + } + InetAddress address = resolveForVPN(vpn, host); + try (Socket socket = new Socket()) { + vpn.bindSocket(socket); + socket.connect(new InetSocketAddress(address, port), timeoutMs); + long elapsed = System.currentTimeMillis() - started; + return "vpn_tcp_connect " + host + ":" + port + " -> connected address=" + address.getHostAddress() + " ms=" + elapsed; + } + } catch (Exception e) { + long elapsed = System.currentTimeMillis() - started; + return "vpn_tcp_connect " + host + ":" + port + " -> " + e.getClass().getSimpleName() + ": " + e.getMessage() + " ms=" + elapsed; + } + } + + private FetchResult fetchVPNURL(Network vpn, URL url, int connectTimeoutMs, int readTimeoutMs, int maxBytes) throws Exception { + long started = System.currentTimeMillis(); + HttpURLConnection connection = (HttpURLConnection) vpn.openConnection(url); + connection.setConnectTimeout(connectTimeoutMs); + connection.setReadTimeout(readTimeoutMs); + connection.setInstanceFollowRedirects(true); + connection.setRequestProperty("User-Agent", "RAP-VPN-Diagnostic/" + APP_VERSION); + int code = connection.getResponseCode(); + String contentType = connection.getContentType(); + int bytes = 0; + StringBuilder body = new StringBuilder(Math.min(maxBytes, 65536)); + try (java.io.InputStream input = code >= 400 ? connection.getErrorStream() : connection.getInputStream()) { + if (input != null) { + byte[] buffer = new byte[8192]; + while (bytes < maxBytes) { + int want = Math.min(buffer.length, maxBytes - bytes); + int read = input.read(buffer, 0, want); + if (read < 0) { + break; + } + if (body.length() < 131072) { + body.append(new String(buffer, 0, read, java.nio.charset.StandardCharsets.UTF_8)); + } + bytes += read; + } + } + } finally { + connection.disconnect(); + } + return new FetchResult(code, bytes, System.currentTimeMillis() - started, contentType == null ? "" : contentType, body.toString()); + } + + private Set extractAssetURLs(URL base, String html, int maxAssets) { + Set out = new LinkedHashSet<>(); + if (html == null || html.isEmpty()) { + return out; + } + Pattern pattern = Pattern.compile("(?i)(?:src|href)\\s*=\\s*[\"']([^\"'#]+)[\"']"); + Matcher matcher = pattern.matcher(html); + while (matcher.find() && out.size() < maxAssets) { + String raw = matcher.group(1).trim(); + if (raw.isEmpty() || raw.startsWith("data:") || raw.startsWith("mailto:") || raw.startsWith("tel:")) { + continue; + } + try { + URL resolved = new URL(base, raw); + String protocol = resolved.getProtocol(); + if ("http".equalsIgnoreCase(protocol) || "https".equalsIgnoreCase(protocol)) { + out.add(resolved.toString()); + } + } catch (Exception ignored) { + } + } + return out; + } + + private void appendFailure(StringBuilder failures, String item) { + if (failures.length() > 0) { + failures.append("; "); + } + failures.append(item); + } + + private InetAddress resolveForVPN(Network vpn, String host) throws Exception { + if (isIPv4Literal(host)) { + return InetAddress.getByName(host); + } + String preferredExitAddress = firstManualVPNAddress(vpn, host); + if (!preferredExitAddress.isEmpty()) { + return InetAddress.getByName(preferredExitAddress); + } + Exception resolverError = null; + try { + InetAddress[] addresses = vpn.getAllByName(host); + if (addresses != null && addresses.length > 0) { + return addresses[0]; + } + } catch (Exception e) { + resolverError = e; + } + if (resolverError != null) { + String fallbackAddress = firstUnderlyingDNSAddress(host); + if (!fallbackAddress.isEmpty()) { + return InetAddress.getByName(fallbackAddress); + } + throw resolverError; + } + throw new UnknownHostException(host); + } + + private static class FetchResult { + final int code; + final int bytes; + final long elapsedMs; + final String contentType; + final String body; + + FetchResult(int code, int bytes, long elapsedMs, String contentType, String body) { + this.code = code; + this.bytes = bytes; + this.elapsedMs = elapsedMs; + this.contentType = contentType; + this.body = body; + } + } + + private boolean isIPv4Literal(String host) { + if (host == null) { + return false; + } + String[] parts = host.split("\\."); + if (parts.length != 4) { + return false; + } + try { + for (String part : parts) { + int value = Integer.parseInt(part); + if (value < 0 || value > 255) { + return false; + } + } + return true; + } catch (NumberFormatException e) { + return false; + } + } + + private String runVPNDNSLookup(String host) { + try { + Network vpn = vpnNetwork(); + if (vpn == null) { + return "vpn_dns_lookup " + host + " -> vpn network not found"; + } + StringBuilder result = new StringBuilder(); + try { + InetAddress[] system = vpn.getAllByName(host); + result.append("system="); + appendAddresses(result, system); + } catch (Exception e) { + result.append("system=").append(e.getClass().getSimpleName()).append(":").append(e.getMessage()); + } + String manual = manualVPNDNSLookup(vpn, host); + result.append(" manual=").append(manual); + return "vpn_dns_lookup " + host + " -> " + result; + } catch (Exception e) { + return "vpn_dns_lookup " + host + " -> " + e.getClass().getSimpleName() + ": " + e.getMessage(); + } + } + + private String firstManualVPNAddress(Network vpn, String host) { + String result = manualVPNDNSLookup(vpn, host); + if (result.startsWith("ok:")) { + String addresses = result.substring(3); + int comma = addresses.indexOf(','); + return comma >= 0 ? addresses.substring(0, comma) : addresses; + } + return ""; + } + + private String firstUnderlyingDNSAddress(String host) { + for (String server : Arrays.asList("1.1.1.1", "8.8.8.8", "9.9.9.9")) { + String result = manualDNSLookupOnUnderlyingNetwork(server, host); + if (result.startsWith("ok:")) { + String addresses = result.substring(3); + int comma = addresses.indexOf(','); + return comma >= 0 ? addresses.substring(0, comma) : addresses; + } + } + return ""; + } + + private String manualDNSLookupOnUnderlyingNetwork(String dnsServer, String host) { + try (DatagramSocket socket = new DatagramSocket()) { + Network network = underlyingNetwork(); + if (network != null) { + network.bindSocket(socket); + } + socket.setSoTimeout(2500); + byte[] query = buildDNSQuery(host); + DatagramPacket packet = new DatagramPacket(query, query.length, InetAddress.getByName(dnsServer), 53); + socket.send(packet); + byte[] response = new byte[512]; + DatagramPacket answer = new DatagramPacket(response, response.length); + socket.receive(answer); + List addresses = parseDNSAResponse(response, answer.getLength()); + if (addresses.isEmpty()) { + return "empty:" + dnsServer; + } + return "ok:" + String.join(",", addresses); + } catch (Exception e) { + return e.getClass().getSimpleName() + ":" + e.getMessage(); + } + } + + private Network underlyingNetwork() { + ConnectivityManager connectivity = (ConnectivityManager) getSystemService(CONNECTIVITY_SERVICE); + if (connectivity == null) { + return null; + } + for (Network network : connectivity.getAllNetworks()) { + NetworkCapabilities capabilities = connectivity.getNetworkCapabilities(network); + if (capabilities == null) { + continue; + } + if (capabilities.hasTransport(NetworkCapabilities.TRANSPORT_VPN)) { + continue; + } + if (!capabilities.hasCapability(NetworkCapabilities.NET_CAPABILITY_INTERNET)) { + continue; + } + return network; + } + return null; + } + + private String manualVPNDNSLookup(Network vpn, String host) { + String dnsServers = getSharedPreferences(RUNTIME_PREFS, MODE_PRIVATE).getString("dns_servers", ""); + if (dnsServers.isEmpty()) { + return "skipped:no_dns_servers"; + } + String dnsServer = dnsServers.split(",", 2)[0].trim(); + if (dnsServer.isEmpty()) { + return "skipped:no_dns_servers"; + } + try (DatagramSocket socket = new DatagramSocket()) { + vpn.bindSocket(socket); + socket.setSoTimeout(5000); + byte[] query = buildDNSQuery(host); + DatagramPacket packet = new DatagramPacket(query, query.length, InetAddress.getByName(dnsServer), 53); + socket.send(packet); + byte[] response = new byte[512]; + DatagramPacket answer = new DatagramPacket(response, response.length); + socket.receive(answer); + List addresses = parseDNSAResponse(response, answer.getLength()); + if (addresses.isEmpty()) { + return "empty:" + dnsServer; + } + return "ok:" + String.join(",", addresses); + } catch (SocketTimeoutException e) { + return "timeout:" + dnsServer; + } catch (Exception e) { + return e.getClass().getSimpleName() + ":" + e.getMessage(); + } + } + + private byte[] buildDNSQuery(String host) throws Exception { + byte[] out = new byte[512]; + int id = new Random().nextInt(0xffff); + out[0] = (byte) ((id >> 8) & 0xff); + out[1] = (byte) (id & 0xff); + out[2] = 0x01; + out[5] = 0x01; + int offset = 12; + for (String label : host.split("\\.")) { + byte[] bytes = label.getBytes("UTF-8"); + out[offset++] = (byte) bytes.length; + System.arraycopy(bytes, 0, out, offset, bytes.length); + offset += bytes.length; + } + out[offset++] = 0; + out[offset++] = 0; + out[offset++] = 1; + out[offset++] = 0; + out[offset++] = 1; + byte[] query = new byte[offset]; + System.arraycopy(out, 0, query, 0, offset); + return query; + } + + private List parseDNSAResponse(byte[] packet, int length) { + List addresses = new ArrayList<>(); + if (length < 12) { + return addresses; + } + int qd = u16(packet, 4); + int an = u16(packet, 6); + int offset = 12; + for (int i = 0; i < qd; i++) { + offset = skipDNSName(packet, length, offset); + offset += 4; + if (offset > length) { + return addresses; + } + } + for (int i = 0; i < an && offset < length; i++) { + offset = skipDNSName(packet, length, offset); + if (offset + 10 > length) { + return addresses; + } + int type = u16(packet, offset); + int cls = u16(packet, offset + 2); + int rdLen = u16(packet, offset + 8); + offset += 10; + if (type == 1 && cls == 1 && rdLen == 4 && offset + 4 <= length) { + addresses.add((packet[offset] & 0xff) + "." + (packet[offset + 1] & 0xff) + "." + (packet[offset + 2] & 0xff) + "." + (packet[offset + 3] & 0xff)); + } + offset += rdLen; + } + return addresses; + } + + private int skipDNSName(byte[] packet, int length, int offset) { + while (offset < length) { + int value = packet[offset] & 0xff; + offset++; + if (value == 0) { + break; + } + if ((value & 0xc0) == 0xc0) { + offset++; + break; + } + offset += value; + } + return offset; + } + + private int u16(byte[] packet, int offset) { + if (packet == null || offset + 1 >= packet.length) { + return 0; + } + return ((packet[offset] & 0xff) << 8) | (packet[offset + 1] & 0xff); + } + + private void appendAddresses(StringBuilder result, InetAddress[] addresses) { + if (addresses == null || addresses.length == 0) { + result.append("empty"); + return; + } + for (int i = 0; i < addresses.length; i++) { + if (i > 0) { + result.append(","); + } + result.append(addresses[i].getHostAddress()); + } + } + + private String hostHeader(URL url) { + if (url.getPort() > 0) { + return url.getHost() + ":" + url.getPort(); + } + return url.getHost(); + } + + private Network vpnNetwork() { + ConnectivityManager connectivity = (ConnectivityManager) getSystemService(CONNECTIVITY_SERVICE); + if (connectivity == null) { + return null; + } + long deadline = System.currentTimeMillis() + 3000; + do { + for (Network network : connectivity.getAllNetworks()) { + NetworkCapabilities capabilities = connectivity.getNetworkCapabilities(network); + if (capabilities != null && capabilities.hasTransport(NetworkCapabilities.TRANSPORT_VPN)) { + return network; + } + } + try { + Thread.sleep(200); + } catch (InterruptedException e) { + Thread.currentThread().interrupt(); + return null; + } + } while (System.currentTimeMillis() < deadline); + for (Network network : connectivity.getAllNetworks()) { + NetworkCapabilities capabilities = connectivity.getNetworkCapabilities(network); + if (capabilities != null && capabilities.hasTransport(NetworkCapabilities.TRANSPORT_VPN)) { + return network; + } + } + return null; + } + + private Notification notification() { + if (Build.VERSION.SDK_INT >= 26) { + NotificationChannel channel = new NotificationChannel(CHANNEL_ID, "RAP VPN diagnostics", NotificationManager.IMPORTANCE_LOW); + NotificationManager manager = getSystemService(NotificationManager.class); + if (manager != null) { + manager.createNotificationChannel(channel); + } + } + Notification.Builder builder = Build.VERSION.SDK_INT >= 26 ? new Notification.Builder(this, CHANNEL_ID) : new Notification.Builder(this); + return builder + .setContentTitle("RAP VPN diagnostics") + .setContentText("Diagnostic channel is active") + .setSmallIcon(android.R.drawable.stat_sys_upload_done) + .build(); + } +} diff --git a/clients/android/app/src/main/java/su/cin/rapvpn/RapVpnService.java b/clients/android/app/src/main/java/su/cin/rapvpn/RapVpnService.java new file mode 100644 index 0000000..3fde64d --- /dev/null +++ b/clients/android/app/src/main/java/su/cin/rapvpn/RapVpnService.java @@ -0,0 +1,3530 @@ +package su.cin.rapvpn; + +import android.app.Notification; +import android.app.NotificationChannel; +import android.app.NotificationManager; +import android.content.SharedPreferences; +import android.content.Intent; +import android.net.ConnectivityManager; +import android.net.Network; +import android.net.NetworkCapabilities; +import android.net.VpnService; +import android.os.Build; +import android.os.ParcelFileDescriptor; +import android.os.StrictMode; +import android.system.ErrnoException; +import android.system.Os; +import android.system.OsConstants; +import android.util.Log; + +import org.json.JSONArray; +import org.json.JSONObject; + +import java.io.FileDescriptor; +import java.io.FileInputStream; +import java.net.DatagramPacket; +import java.net.DatagramSocket; +import java.net.Inet4Address; +import java.net.InetAddress; +import java.net.URI; +import java.util.ArrayList; +import java.util.LinkedHashMap; +import java.util.LinkedHashSet; +import java.util.List; +import java.util.Map; +import java.util.Set; +import java.util.concurrent.ArrayBlockingQueue; +import java.util.concurrent.BlockingQueue; +import java.util.concurrent.TimeUnit; +import java.util.concurrent.atomic.AtomicBoolean; +import java.util.concurrent.atomic.AtomicLong; + +public class RapVpnService extends VpnService { + static final String EXTRA_PROFILE_JSON = "profile_json"; + static final String EXTRA_BACKEND_URL = "backend_url"; + static final String EXTRA_CLUSTER_ID = "cluster_id"; + static final String EXTRA_VPN_CONNECTION_ID = "vpn_connection_id"; + static final String ACTION_STOP = "su.cin.rapvpn.STOP"; + private static final String CHANNEL_ID = "rap-vpn"; + private static final String TAG = "RapVpnService"; + private static final String PREFS = "rap-vpn-runtime"; + private static final int DEFAULT_VPN_MTU = 1420; + private static final int VPN_BATCH_MAX_PACKETS = 512; + private static final int VPN_BATCH_MAX_BYTES = 1024 * 1024; + private static final int UPLINK_WORKER_MAX_COUNT = 4; + private static final int UPLINK_QUEUE_CAPACITY = 32768; + private static final int PRIORITY_QUEUE_CAPACITY = 4096; + private static final int UPLINK_SEND_RETRY_COUNT = 2; + private static final int UPLINK_SEND_RETRY_SLEEP_MS = 80; + private static final int DOWNLINK_QUEUE_CAPACITY = 8192; + private static final int DOWNLINK_FLOW_QUEUE_MAX_COUNT = 8; + private static final int DOWNLINK_QUEUE_OFFER_MS = 50; + private static final int DOWNLINK_WRITER_IDLE_WAIT_MS = 20; + private static final int DOWNLINK_POLL_MS_MIN = 2; + private static final int DOWNLINK_POLL_MS_MAX = 40; + private static final int DOWNLINK_POLL_MS_STEP = 4; + private static final int UPLINK_BATCH_GATHER_MS = 2; + private static final int TUN_WRITE_MAX_RETRIES = 1000; + private static final int TUN_EAGAIN_SLEEP_MS = 1; + private static final int RUNTIME_DETAIL_INTERVAL_MS = 250; + private static final int RUNTIME_STATUS_INTERVAL_MS = 500; + private static final int RUNTIME_WATCHDOG_INTERVAL_MS = 2000; + private static final int RUNTIME_WATCHDOG_STALE_SYNACK_MS = 15000; + private static final int RUNTIME_WATCHDOG_RECOVERY_COOLDOWN_MS = 60000; + private static final int RUNTIME_WATCHDOG_HARD_RESTART_COOLDOWN_MS = 180000; + private static final int VPN_START_WARMUP_TIMEOUT_MS = 6000; + private static final String[] DEFAULT_DNS_PROBE_DOMAINS = new String[]{ + "speedtest.rt.ru", + "2ip.ru", + "mail.ru", + "qms.ru", + "linenew-host.qms.ru", + "tula.qms.ru", + "vladimir.qms.ru", + "moscow.qms.ru", + "moscow1.qms.ru", + "spb.qms.ru", + "kazan.qms.ru", + "samara.qms.ru", + "ufa.qms.ru", + "voronezh.qms.ru", + "krasnodar.qms.ru", + "rostov-na-donu1.qms.ru", + "nizhnynovgorod1.qms.ru", + "ekat10.qms.ru", + "t2msk1-host.qms.ru", + "t2msk3-host.qms.ru", + "rline-host.qms.ru", + "timernet-host.qms.ru" + }; + private static final String PREF_NAME = "rap-vpn"; + private static final String PREF_PROFILE_JSON = "profile_json"; + private static final String PREF_BACKEND_URL = "backend_url"; + private static final String PREF_CLUSTER_ID = "cluster_id"; + private static final String PREF_VPN_CONNECTION_ID = "vpn_connection_id"; + private static final String PREF_MANUAL_STOPPED = "manual_stopped"; + private ParcelFileDescriptor tunnel; + private volatile FileDescriptor uplinkTunFd; + private volatile FileInputStream uplinkTunInput; + private volatile FileDescriptor downlinkTunFd; + private Thread uplinkThread; + private Thread[] uplinkSenderThreads; + private Thread downlinkThread; + private Thread downlinkWriterThread; + private Thread runtimeWatchdogThread; + private BlockingQueue[] uplinkQueues; + private BlockingQueue[] downlinkQueues; + private BlockingQueue uplinkPriorityQueue; + private BlockingQueue downlinkPriorityQueue; + private volatile AtomicLong[] uplinkQueueOffersByWorker; + private volatile AtomicLong[] uplinkQueueDropsByWorker; + private volatile AtomicLong[] uplinkSenderPacketsByWorker; + private volatile AtomicLong[] uplinkSenderErrorsByWorker; + private volatile AtomicLong[] downlinkQueueOffersByFlow; + private volatile AtomicLong[] downlinkQueueDropsByFlow; + private volatile AtomicLong[] downlinkWriterPacketsByFlow; + private volatile int uplinkWorkerCount; + private volatile int downlinkFlowQueueCount; + private volatile boolean running; + private volatile String vpnAddressIPv4 = "10.77.0.2"; + private volatile byte[] vpnAddressIPv4Bytes = new byte[]{10, 77, 0, 2}; + private volatile byte[][] backendBypassIPv4s = new byte[0][]; + private volatile int backendBypassPort; + private volatile boolean fastPathModeEnabled; + private volatile long downlinkRestarts; + private volatile long lastRuntimeDetailAt; + private volatile long lastRuntimeStatusAt; + private volatile long runtimeStartedAt; + private volatile long lastThroughputCalcAt; + private volatile long lastRateUplinkReadBytes; + private volatile long lastRateUplinkSentBytes; + private volatile long lastRateDownlinkReceivedBytes; + private volatile float uplinkReadMbps; + private volatile float uplinkSentMbps; + private volatile float downlinkReceivedMbps; + private volatile float uplinkReadPps; + private volatile float uplinkSentPps; + private volatile float downlinkReceivedPps; + private final AtomicLong uplinkReadPackets = new AtomicLong(); + private final AtomicLong uplinkReadBytes = new AtomicLong(); + private final AtomicLong uplinkSentPackets = new AtomicLong(); + private final AtomicLong uplinkSentBytes = new AtomicLong(); + private final AtomicLong downlinkReceivedPackets = new AtomicLong(); + private final AtomicLong downlinkReceivedBytes = new AtomicLong(); + private final AtomicLong uplinkDroppedPackets = new AtomicLong(); + private final AtomicLong uplinkDroppedBytes = new AtomicLong(); + private final AtomicLong uplinkFilteredPackets = new AtomicLong(); + private final AtomicLong uplinkFilteredBytes = new AtomicLong(); + private final AtomicLong downlinkDroppedPackets = new AtomicLong(); + private final AtomicLong downlinkDroppedBytes = new AtomicLong(); + private final AtomicLong downlinkTransportChecksumRepairs = new AtomicLong(); + private final AtomicLong uplinkBypassedControlPackets = new AtomicLong(); + private final AtomicLong uplinkBypassedControlBytes = new AtomicLong(); + private final AtomicLong uplinkSourceMismatchPackets = new AtomicLong(); + private final AtomicLong downlinkDestinationMismatchPackets = new AtomicLong(); + private final AtomicLong uplinkQueuedPackets = new AtomicLong(); + private final AtomicLong downlinkQueuedPackets = new AtomicLong(); + private final AtomicLong downlinkQueueWaits = new AtomicLong(); + private final AtomicLong localDnsQueries = new AtomicLong(); + private final AtomicLong localDnsReplies = new AtomicLong(); + private final AtomicLong localDnsErrors = new AtomicLong(); + private final AtomicLong runtimeWatchdogRecoveries = new AtomicLong(); + private final AtomicLong tcpHandshakeStalls = new AtomicLong(); + private final AtomicLong runtimeWatchdogHardRestarts = new AtomicLong(); + private final AtomicBoolean hardRuntimeRestartInProgress = new AtomicBoolean(); + private volatile boolean relaxedUplinkSourceValidation; + private volatile boolean relaxedDownlinkDestinationValidation; + private volatile String activeConnectionIdByProfile; + private volatile String activePacketRelayUrlByProfile; + private volatile List activePacketRelayUrlsByProfile = new ArrayList<>(); + private volatile int activePacketRelayIndex; + private volatile VpnPacketWebSocketRelay packetWebSocketRelay; + private volatile FabricServiceChannel activeFabricServiceChannel = new FabricServiceChannel(); + private final Object packetRelaySwitchLock = new Object(); + private final Map clientSourceNat = new LinkedHashMap(4096, 0.75f, true) { + @Override + protected boolean removeEldestEntry(Map.Entry eldest) { + return size() > 4096; + } + }; + private final Map pendingTcpHandshakes = new LinkedHashMap(4096, 0.75f, true) { + @Override + protected boolean removeEldestEntry(Map.Entry eldest) { + return size() > 4096; + } + }; + private volatile String shutdownReason = "service destroyed"; + private volatile long lastRuntimeWatchdogRecoveryAt; + private volatile long lastRuntimeWatchdogHardRestartAt; + private volatile long lastDiagnosticEnsureAt; + private volatile long lastDiagnosticStatusEnsureAt; + private volatile boolean nextDiagnosticEnsureMayRestart; + private static final int ADDRESS_MISMATCH_TOLERANCE_PACKETS = 16; + + @Override + public int onStartCommand(Intent intent, int flags, int startId) { + if (intent != null && ACTION_STOP.equals(intent.getAction())) { + shutdownReason = "stop requested"; + writeRuntimeStatus("stopping", shutdownReason, 0, 0, 0, 0); + markManualStopped(true); + shutdown(); + stopSelfResult(startId); + return START_NOT_STICKY; + } + ensureDiagnosticServiceRunning(); + startForeground(1001, notification()); + writeRuntimeStatus("starting", "starting vpn service", 0, 0, 0, 0); + SharedPreferences prefs = getSharedPreferences(PREF_NAME, MODE_PRIVATE); + boolean explicitStart = intent != null && (intent.hasExtra(EXTRA_PROFILE_JSON) + || intent.hasExtra(EXTRA_BACKEND_URL) + || intent.hasExtra(EXTRA_CLUSTER_ID) + || intent.hasExtra(EXTRA_VPN_CONNECTION_ID)); + if (!explicitStart && prefs.getBoolean(PREF_MANUAL_STOPPED, false)) { + shutdownReason = "manual stop preserved"; + writeRuntimeStatus("stopped", shutdownReason, 0, 0, 0, 0); + shutdown(); + stopSelf(); + return START_NOT_STICKY; + } + markManualStopped(false); + String profile = extraOrSaved(intent, EXTRA_PROFILE_JSON, PREF_PROFILE_JSON, prefs); + String backendUrl = extraOrSaved(intent, EXTRA_BACKEND_URL, PREF_BACKEND_URL, prefs); + startSafeInterface(profile == null ? "" : profile, backendUrl); + if (tunnel == null) { + shutdownReason = "vpn start failed: tunnel is null"; + writeRuntimeStatus("error", "vpn start failed: tunnel is null", 0, 0, 0, 0); + stopSelf(); + return START_NOT_STICKY; + } + String clusterId = extraOrSaved(intent, EXTRA_CLUSTER_ID, PREF_CLUSTER_ID, prefs); + String vpnConnectionId = extraOrSaved(intent, EXTRA_VPN_CONNECTION_ID, PREF_VPN_CONNECTION_ID, prefs); + if ((vpnConnectionId == null || vpnConnectionId.isEmpty()) && activeConnectionIdByProfile != null && !activeConnectionIdByProfile.isEmpty()) { + vpnConnectionId = activeConnectionIdByProfile; + } + if ((vpnConnectionId == null || vpnConnectionId.isEmpty()) && profile != null && !profile.isEmpty() && activeConnectionIdByProfile == null) { + vpnConnectionId = getLastVpnConnectionId(); + } + if ((vpnConnectionId == null || vpnConnectionId.isEmpty())) { + vpnConnectionId = getLastVpnConnectionId(); + } + if (vpnConnectionId == null || vpnConnectionId.isEmpty()) { + shutdownReason = "missing vpn connection id"; + writeRuntimeStatus("error", "missing vpn connection id; retry profile load", 0, 0, 0, 0); + stopSelf(); + return START_NOT_STICKY; + } + persistStartConfig(profile, backendUrl, clusterId, vpnConnectionId); + List packetRelayUrls = activePacketRelayUrlsByProfile == null || activePacketRelayUrlsByProfile.isEmpty() + ? singletonUrl(activePacketRelayUrlByProfile) + : new ArrayList<>(activePacketRelayUrlsByProfile); + if (packetRelayUrls.isEmpty()) { + packetRelayUrls.add(backendUrl); + } + startPacketRelay(backendUrl, packetRelayUrls, clusterId, vpnConnectionId); + if (!running) { + shutdownReason = "relay not running"; + writeRuntimeStatus("error", "vpn not started: relay not running", 0, 0, 0, 0); + stopSelf(); + return START_NOT_STICKY; + } + if (tunnel == null || backendUrl == null || backendUrl.isEmpty() + || clusterId == null || clusterId.isEmpty() + || vpnConnectionId == null || vpnConnectionId.isEmpty()) { + shutdownReason = "invalid runtime"; + writeRuntimeStatus("error", "vpn not started: invalid runtime", 0, 0, 0, 0); + stopSelf(); + return START_NOT_STICKY; + } + writeRuntimeStatus("running", "vpn service active " + vpnAddressIPv4, 0, 0, downlinkReceivedPackets.get(), 0); + startVPNReadinessWarmup(configuredDnsServers(), configuredDnsProbeDomains(), vpnConnectionId); + shutdownReason = "running"; + return START_NOT_STICKY; + } + + private void ensureDiagnosticServiceRunning() { + try { + RapDiagnosticService.start(this); + writeRuntimeDetail("diagnostic_start", "diagnostic service start requested by vpn runtime", "control", 0, 0, "", -1); + } catch (Exception e) { + Log.w(TAG, "diagnostic service start failed", e); + writeRuntimeDetail("diagnostic_start_failed", e.getMessage(), "control", 0, 1, e.getClass().getSimpleName(), -1); + } + } + + private void ensureDiagnosticServiceHealthy() { + try { + SharedPreferences runtime = getSharedPreferences(PREFS, MODE_PRIVATE); + long lastLocalHeartbeat = runtime.getLong("diagnostic_local_heartbeat_at", 0); + long age = lastLocalHeartbeat <= 0 ? Long.MAX_VALUE : System.currentTimeMillis() - lastLocalHeartbeat; + boolean restart = nextDiagnosticEnsureMayRestart && age > 45000; + Intent intent = new Intent(this, RapDiagnosticService.class); + intent.setAction(restart ? RapDiagnosticService.ACTION_RESTART : RapDiagnosticService.ACTION_START); + if (Build.VERSION.SDK_INT >= 26) { + startForegroundService(intent); + } else { + startService(intent); + } + nextDiagnosticEnsureMayRestart = true; + writeRuntimeDetail( + restart ? "diagnostic_restart" : "diagnostic_start", + (restart ? "diagnostic service restart requested age_ms=" : "diagnostic service start requested age_ms=") + age, + "control", + 0, + 0, + "", + -1); + } catch (Exception e) { + Log.w(TAG, "diagnostic service health ensure failed", e); + writeRuntimeDetail("diagnostic_start_failed", e.getMessage(), "control", 0, 1, e.getClass().getSimpleName(), -1); + } + } + + private void ensureDiagnosticFromRuntimeStatus(long now) { + if (now - lastDiagnosticStatusEnsureAt < 10000) { + return; + } + lastDiagnosticStatusEnsureAt = now; + try { + long lastLocalHeartbeat = getSharedPreferences(PREFS, MODE_PRIVATE).getLong("diagnostic_local_heartbeat_at", 0); + long age = lastLocalHeartbeat <= 0 ? Long.MAX_VALUE : now - lastLocalHeartbeat; + if (age > 45000) { + ensureDiagnosticServiceHealthy(); + } + } catch (Exception ignored) { + } + } + + @Override + public void onDestroy() { + writeRuntimeStatus("stopped", shutdownReason == null || shutdownReason.isEmpty() ? "service destroyed" : shutdownReason, 0, 0, 0, 0); + shutdown(); + super.onDestroy(); + } + + @Override + public void onRevoke() { + shutdownReason = "vpn permission revoked by Android"; + writeRuntimeStatus("revoked", shutdownReason, 0, 0, 0, 0); + shutdown(); + stopSelf(); + super.onRevoke(); + } + + private String extraOrSaved(Intent intent, String extraName, String prefName, SharedPreferences prefs) { + String value = intent != null ? intent.getStringExtra(extraName) : ""; + if (value != null && !value.isEmpty()) { + return value; + } + return prefs.getString(prefName, ""); + } + + private void persistStartConfig(String profile, String backendUrl, String clusterId, String vpnConnectionId) { + try { + getSharedPreferences(PREF_NAME, MODE_PRIVATE).edit() + .putString(PREF_PROFILE_JSON, profile == null ? "" : profile) + .putString(PREF_BACKEND_URL, backendUrl == null ? "" : backendUrl) + .putString(PREF_CLUSTER_ID, clusterId == null ? "" : clusterId) + .putString(PREF_VPN_CONNECTION_ID, vpnConnectionId == null ? "" : vpnConnectionId) + .apply(); + } catch (Exception ignored) { + } + } + + private void shutdown() { + try { + running = false; + closeTunHandles(); + closeTunnelQuietly(); + stopPacketRelay(); + } catch (Exception ignored) { + } + writeRuntimeStatus("stopped", shutdownReason == null || shutdownReason.isEmpty() ? "vpn stopped" : shutdownReason, 0, 0, 0, 0); + if (Build.VERSION.SDK_INT >= 24) { + stopForeground(STOP_FOREGROUND_REMOVE); + } else { + stopForeground(true); + } + } + + private void markManualStopped(boolean stopped) { + try { + getSharedPreferences(PREF_NAME, MODE_PRIVATE).edit() + .putBoolean(PREF_MANUAL_STOPPED, stopped) + .apply(); + } catch (Exception ignored) { + } + } + + private void startSafeInterface(String profileJson, String backendUrl) { + try { + running = false; + closeTunHandles(); + closeTunnelQuietly(); + stopPacketRelay(); + resetRuntimeMetrics(); + activePacketRelayUrlByProfile = ""; + activePacketRelayUrlsByProfile = new ArrayList<>(); + activeFabricServiceChannel = new FabricServiceChannel(); + VpnClientConfig config = parseClientConfig(profileJson, backendUrl); + SharedPreferences prefs = getSharedPreferences(PREF_NAME, MODE_PRIVATE); + boolean forceFullTunnel = prefs.getBoolean(MainActivity.PREF_FORCE_FULL_TUNNEL, true); + fastPathModeEnabled = forceFullTunnel || config.fullTunnel; + if (forceFullTunnel && !config.fullTunnel) { + config.fullTunnel = true; + config.configNotes.add("Mobile setting: forced full tunnel"); + } + if (forceFullTunnel) { + config.configNotes.add("Runtime setting: full tunnel mandatory"); + } + writeRuntimeConfig(config, forceFullTunnel, fastPathModeEnabled); + Builder builder = new Builder() + .setSession("RAP HOME VPN") + .setMtu(config.mtu) + .setBlocking(true); + try { + builder.allowFamily(android.system.OsConstants.AF_INET); + } catch (Throwable ignore) { + } + vpnAddressIPv4 = cidrHost(config.vpnAddress); + vpnAddressIPv4Bytes = ipv4Bytes(vpnAddressIPv4); + if (vpnAddressIPv4Bytes == null || vpnAddressIPv4Bytes.length != 4) { + vpnAddressIPv4 = "10.77.0.2"; + vpnAddressIPv4Bytes = new byte[]{10, 77, 0, 2}; + } + addCIDRAddress(builder, config.vpnAddress); + for (String dnsServer : config.dnsServers) { + builder.addDnsServer(dnsServer); + addCIDRRoute(builder, dnsServer + "/32"); + } + int routeCount = 0; + for (String route : config.effectiveRoutes()) { + if (addCIDRRoute(builder, route)) { + routeCount++; + } + } + if (!config.fullTunnel && routeCount == 0) { + config.fullTunnel = true; + config.configNotes.add("No usable split routes received; fallback to full tunnel"); + addCIDRRoute(builder, "0.0.0.0/0"); + } + writeRuntimeConfig(config); + setUnderlyingNetworks(builder); + tunnel = builder.establish(); + if (tunnel == null) { + Log.e(TAG, "vpn tunnel establish returned null"); + writeRuntimeStatus("error", "tunnel establish returned null", 0, 0, 0, 0); + } else { + writeRuntimeStatus("tunnel", "tunnel established " + config.vpnAddress, 0, 0, 0, 0); + } + activeConnectionIdByProfile = config.selectedConnectionId; + if (activeConnectionIdByProfile != null && !activeConnectionIdByProfile.isEmpty()) { + persistVpnConnectionId(activeConnectionIdByProfile); + } + activePacketRelayUrlByProfile = config.packetRelayBaseUrl; + activePacketRelayUrlsByProfile = new ArrayList<>(config.packetRelayBaseUrls); + activeFabricServiceChannel = config.fabricServiceChannel; + } catch (Exception e) { + Log.e(TAG, "vpn tunnel establish failed", e); + writeRuntimeStatus("error", "tunnel failed: " + e.getMessage(), 0, 0, 0, 0); + activeConnectionIdByProfile = null; + } + } + + private VpnClientConfig parseClientConfig(String profileJson, String backendUrl) { + VpnClientConfig config = new VpnClientConfig(); + config.vpnAddress = "10.77.0.2/32"; + try { + JSONObject root = new JSONObject(profileJson == null ? "" : profileJson); + JSONObject profile = root.optJSONObject("vpn_client_profile"); + if (profile == null) { + profile = root; + } + JSONArray connections = profile != null ? profile.optJSONArray("connections") : null; + JSONObject connection = null; + String selectedConnectionId = null; + String waitingConnectionId = null; + if (connections != null && connections.length() > 0) { + for (int i = 0; i < connections.length(); i++) { + JSONObject candidate = connections.optJSONObject(i); + if (candidate == null) { + continue; + } + String candidateId = candidate.optString("id", "").trim(); + if (selectedConnectionId == null && !candidateId.isEmpty()) { + selectedConnectionId = candidateId; + } + JSONObject candidateClientConfig = candidate.optJSONObject("client_config"); + JSONObject candidateRoute = candidateClientConfig != null ? candidateClientConfig.optJSONObject("vpn_fabric_route") : null; + String status = candidateRoute != null ? candidateRoute.optString("status", "").trim().toLowerCase() : ""; + if ("planned".equals(status) && connection == null) { + String entry = candidateRoute.optString("selected_entry_node_id", "").trim(); + String exit = candidateRoute.optString("selected_exit_node_id", "").trim(); + if (!entry.isEmpty() && !exit.isEmpty()) { + connection = candidate; + selectedConnectionId = candidateId; + break; + } + } + if (("connecting".equals(status) || "active".equals(status) || "assigned".equals(status)) + && waitingConnectionId == null && !candidateId.isEmpty()) { + waitingConnectionId = candidateId; + } + } + } + if (connection == null && connections != null && connections.length() > 0) { + connection = connections.optJSONObject(0); + } + if (connection == null) { + config.selectedConnectionId = waitingConnectionId != null ? waitingConnectionId : selectedConnectionId; + return config; + } + JSONObject clientConfig = connection.optJSONObject("client_config"); + if (clientConfig != null) { + String vpnAddress = clientConfig.optString("vpn_address", ""); + if (!vpnAddress.isEmpty()) { + config.vpnAddress = vpnAddress; + } + config.mtu = parseMtu(clientConfig.optInt("mtu", config.mtu)); + if (clientConfig.optBoolean("full_tunnel", false)) { + config.fullTunnel = true; + } + readStringArray(clientConfig.optJSONArray("dns_servers"), config.dnsServers, true); + readStringArray(clientConfig.optJSONArray("dns_probe_domains"), config.dnsProbeDomains, false); + readStringArray(clientConfig.optJSONArray("routes"), config.splitRoutes, false); + JSONObject dataplaneSession = clientConfig.optJSONObject("vpn_dataplane_session"); + if (dataplaneSession != null) { + config.dataplaneSessionStatus = dataplaneSession.optString("status", ""); + config.dataplanePreferredTransport = dataplaneSession.optString("preferred_transport", ""); + config.dataplaneFallbackTransport = dataplaneSession.optString("fallback_transport", ""); + config.dataplaneEntryNodeId = dataplaneSession.optString("entry_node_id", ""); + config.dataplaneExitNodeId = dataplaneSession.optString("exit_node_id", ""); + JSONArray transportCandidates = dataplaneSession.optJSONArray("transport_candidates"); + config.dataplaneTransportCandidateCount = transportCandidates == null ? 0 : transportCandidates.length(); + JSONArray entryCandidates = dataplaneSession.optJSONArray("entry_candidates"); + config.dataplaneEntryCandidateCount = entryCandidates == null ? 0 : entryCandidates.length(); + config.dataplaneSelectedTransport = selectDataplanePacketTransport(dataplaneSession); + config.packetRelayBaseUrls.addAll(selectDataplanePacketRelayBaseUrls(dataplaneSession, backendUrl)); + if (!config.packetRelayBaseUrls.isEmpty()) { + config.packetRelayBaseUrl = config.packetRelayBaseUrls.get(0); + } + } + JSONObject serviceChannelLease = clientConfig.optJSONObject("fabric_service_channel_lease"); + if (serviceChannelLease != null) { + FabricServiceChannel channel = FabricServiceChannel.fromLease(serviceChannelLease); + if (channel.enabled) { + config.fabricServiceChannel = channel; + config.configNotes.add("Fabric service channel enabled: " + channel.channelId); + } + } + } + JSONObject routePolicy = connection.optJSONObject("route_policy"); + if (routePolicy != null) { + if (routePolicy.optBoolean("full_tunnel", false)) { + config.fullTunnel = true; + } + readStringArray(routePolicy.optJSONArray("allowed_cidrs"), config.splitRoutes, false); + if (config.dnsServers.isEmpty()) { + readStringArray(routePolicy.optJSONArray("dns_servers"), config.dnsServers, true); + } + readStringArray(routePolicy.optJSONArray("dns_probe_domains"), config.dnsProbeDomains, false); + } + JSONArray routePolicies = connection.optJSONArray("route_policies"); + if (routePolicies != null) { + for (int i = 0; i < routePolicies.length(); i++) { + JSONObject item = routePolicies.optJSONObject(i); + if (item != null && "allow".equals(item.optString("action"))) { + String destination = item.optString("destination", ""); + if (!destination.isEmpty()) { + config.splitRoutes.add(destination); + } + JSONObject policy = item.optJSONObject("policy"); + if (policy != null && policy.optBoolean("full_tunnel", false)) { + config.fullTunnel = true; + } + } + } + } + if (selectedConnectionId == null) { + selectedConnectionId = connection.optString("id", "").trim(); + } + if ((selectedConnectionId == null || selectedConnectionId.isEmpty()) && waitingConnectionId != null) { + selectedConnectionId = waitingConnectionId; + } + config.selectedConnectionId = selectedConnectionId; + } catch (Exception ignored) { + config.configNotes.add("Failed parsing profile: using defaults"); + } + return config; + } + + private int parseMtu(int mtu) { + if (mtu <= 0) { + return DEFAULT_VPN_MTU; + } + if (mtu < 576) { + return 576; + } + if (mtu > 1500) { + return 1500; + } + return mtu; + } + + private void readStringArray(JSONArray array, Set target, boolean replace) { + if (array == null) { + return; + } + if (replace) { + target.clear(); + } + for (int i = 0; i < array.length(); i++) { + String value = array.optString(i, ""); + if (!value.isEmpty()) { + target.add(value); + } + } + } + + private String selectDataplanePacketTransport(JSONObject dataplaneSession) { + JSONObject candidate = selectSafeEntryDirectHTTPCandidate(dataplaneSession); + return candidate == null ? "" : "entry_direct_http_v1"; + } + + private List selectDataplanePacketRelayBaseUrls(JSONObject dataplaneSession, String backendUrl) { + List urls = new ArrayList<>(); + JSONObject candidate = selectSafeEntryDirectHTTPCandidate(dataplaneSession); + if (candidate == null) { + return urls; + } + JSONArray entryCandidates = candidate.optJSONArray("entry_candidates"); + if (entryCandidates == null) { + return urls; + } + List ordered = orderEntryCandidatesForNetwork(entryCandidates, backendUrl); + for (int i = 0; i < ordered.size(); i++) { + JSONObject entry = ordered.get(i); + if (entry == null) { + continue; + } + String apiBaseUrl = normalizeHTTPBaseUrl(entry.optString("api_base_url", "")); + if (!apiBaseUrl.isEmpty()) { + addUniqueUrl(urls, apiBaseUrl); + continue; + } + String address = normalizeHTTPBaseUrl(entry.optString("address", "")); + if (!address.isEmpty()) { + addUniqueUrl(urls, address + "/api/v1"); + } + } + return urls; + } + + private List orderEntryCandidatesForNetwork(JSONArray entryCandidates, String backendUrl) { + List publicEntries = new ArrayList<>(); + List privateEntries = new ArrayList<>(); + List otherEntries = new ArrayList<>(); + for (int i = 0; i < entryCandidates.length(); i++) { + JSONObject entry = entryCandidates.optJSONObject(i); + if (entry == null) { + continue; + } + String reachability = entry.optString("reachability", ""); + String url = entry.optString("api_base_url", ""); + if (url.isEmpty()) { + url = entry.optString("address", ""); + } + boolean privateAddress = isPrivateURLHost(url); + if ("public".equalsIgnoreCase(reachability) && !privateAddress) { + publicEntries.add(entry); + } else if ("private".equalsIgnoreCase(reachability) || privateAddress) { + privateEntries.add(entry); + } else { + otherEntries.add(entry); + } + } + List ordered = new ArrayList<>(); + if (isPrivateURLHost(backendUrl)) { + ordered.addAll(privateEntries); + ordered.addAll(publicEntries); + } else { + ordered.addAll(publicEntries); + ordered.addAll(privateEntries); + } + ordered.addAll(otherEntries); + return ordered; + } + + private JSONObject selectSafeEntryDirectHTTPCandidate(JSONObject dataplaneSession) { + if (dataplaneSession == null) { + return null; + } + JSONArray transportCandidates = dataplaneSession.optJSONArray("transport_candidates"); + if (transportCandidates == null) { + return null; + } + for (int i = 0; i < transportCandidates.length(); i++) { + JSONObject candidate = transportCandidates.optJSONObject(i); + if (candidate == null) { + continue; + } + if (!"entry_direct_http_v1".equals(candidate.optString("type", ""))) { + continue; + } + if (!candidate.optBoolean("safe_client_switch", false)) { + continue; + } + String status = candidate.optString("status", "").trim().toLowerCase(); + if (!"available".equals(status) && !status.startsWith("available_")) { + continue; + } + return candidate; + } + return null; + } + + private String normalizeHTTPBaseUrl(String value) { + if (value == null) { + return ""; + } + value = value.trim(); + while (value.endsWith("/")) { + value = value.substring(0, value.length() - 1); + } + if (value.isEmpty()) { + return ""; + } + try { + URI uri = URI.create(value); + String scheme = uri.getScheme(); + String host = uri.getHost(); + if (host == null || host.isEmpty()) { + return ""; + } + if (!"http".equalsIgnoreCase(scheme) && !"https".equalsIgnoreCase(scheme)) { + return ""; + } + return value; + } catch (Exception ignored) { + return ""; + } + } + + private boolean isPrivateURLHost(String value) { + if (value == null || value.trim().isEmpty()) { + return false; + } + try { + URI uri = URI.create(value.trim()); + String host = uri.getHost(); + if (host == null || host.isEmpty()) { + return false; + } + byte[] bytes = ipv4Bytes(host); + if (bytes == null || bytes.length != 4) { + return "localhost".equalsIgnoreCase(host); + } + int first = bytes[0] & 0xff; + int second = bytes[1] & 0xff; + return first == 10 + || first == 127 + || (first == 172 && second >= 16 && second <= 31) + || (first == 192 && second == 168) + || (first == 169 && second == 254); + } catch (Exception ignored) { + return false; + } + } + + private void writeRuntimeConfig(VpnClientConfig config, boolean forceFullTunnel, boolean fastPathMode) { + try { + getSharedPreferences(PREFS, MODE_PRIVATE).edit() + .putBoolean("force_full_tunnel", forceFullTunnel) + .putBoolean("fast_path_enabled", fastPathMode) + .apply(); + } catch (Exception ignored) { + } + } + + private void writeRuntimeConfig(VpnClientConfig config) { + try { + getSharedPreferences(PREFS, MODE_PRIVATE).edit() + .putString("vpn_address", config.vpnAddress) + .putInt("vpn_mtu", config.mtu) + .putString("dns_servers", join(config.dnsServers)) + .putString("dns_probe_domains", join(config.effectiveDnsProbeDomains())) + .putString("routes", join(config.effectiveRoutes())) + .putBoolean("full_tunnel", config.fullTunnel) + .putString("config_notes", configNotes(config)) + .putString("dataplane_session_status", config.dataplaneSessionStatus) + .putString("dataplane_preferred_transport", config.dataplanePreferredTransport) + .putString("dataplane_fallback_transport", config.dataplaneFallbackTransport) + .putString("dataplane_entry_node_id", config.dataplaneEntryNodeId) + .putString("dataplane_exit_node_id", config.dataplaneExitNodeId) + .putString("dataplane_selected_transport", config.dataplaneSelectedTransport) + .putString("packet_relay_profile_base_url", config.packetRelayBaseUrl) + .putString("packet_relay_active_base_url", "") + .putString("packet_relay_base_url", config.packetRelayBaseUrl) + .putString("packet_relay_candidate_urls", joinList(config.packetRelayBaseUrls)) + .putInt("dataplane_transport_candidate_count", config.dataplaneTransportCandidateCount) + .putInt("dataplane_entry_candidate_count", config.dataplaneEntryCandidateCount) + .commit(); + } catch (Exception ignored) { + } + } + + private String join(Set values) { + StringBuilder out = new StringBuilder(); + for (String value : values) { + if (out.length() > 0) { + out.append(","); + } + out.append(value); + } + return out.toString(); + } + + private void addCIDRAddress(Builder builder, String cidr) { + int[] parsed = parseCidr(cidr); + if (parsed == null) { + configLog("Invalid VPN address, fallback to default: " + cidr); + return; + } + try { + builder.addAddress(ipv4String(parsed), parsed[4]); + } catch (Exception e) { + configLog("Failed to set VPN address " + cidr + ": " + e.getMessage()); + } + } + + private boolean addCIDRRoute(Builder builder, String cidr) { + int[] parsed = parseCidr(cidr); + if (parsed == null) { + configLog("Skip invalid route: " + cidr); + return false; + } + try { + builder.addRoute(ipv4String(parsed), parsed[4]); + return true; + } catch (Exception e) { + configLog("Failed adding route " + cidr + ": " + e.getMessage()); + return false; + } + } + + private int[] parseCidr(String value) { + if (value == null) { + return null; + } + String[] parts = value.trim().split("/", 2); + if (parts.length == 0 || parts[0].isEmpty()) { + return null; + } + int prefix = 32; + if (parts.length == 2) { + try { + prefix = Integer.parseInt(parts[1].trim()); + } catch (NumberFormatException e) { + return null; + } + if (prefix < 0 || prefix > 32) { + return null; + } + } + byte[] bytes = ipv4Bytes(parts[0].trim()); + if (bytes == null || bytes.length != 4) { + return null; + } + return new int[]{bytes[0] & 0xff, bytes[1] & 0xff, bytes[2] & 0xff, bytes[3] & 0xff, prefix}; + } + + private String ipv4String(int[] bytes) { + if (bytes == null || bytes.length < 4) { + return "0.0.0.0"; + } + return bytes[0] + "." + bytes[1] + "." + bytes[2] + "." + bytes[3]; + } + + private void configLog(String message) { + Log.w(TAG, message); + writeRuntimeConfigAppend("Config: " + message); + } + + private void writeRuntimeConfigAppend(String message) { + if (message == null || message.isEmpty()) { + return; + } + try { + getSharedPreferences(PREFS, MODE_PRIVATE).edit() + .putString("last_config_note", message) + .apply(); + } catch (Exception ignored) { + } + } + + private String configNotes(VpnClientConfig config) { + StringBuilder notes = new StringBuilder(); + if (config == null) { + return ""; + } + for (String note : config.configNotes) { + if (notes.length() > 0) { + notes.append(" | "); + } + notes.append(note); + } + return notes.toString(); + } + + private String cidrHost(String cidr) { + if (cidr == null || cidr.isEmpty()) { + return "10.77.0.2"; + } + String[] parts = cidr.split("/", 2); + return parts.length > 0 && !parts[0].isEmpty() ? parts[0] : "10.77.0.2"; + } + + private void startPacketRelay(String backendUrl, List candidateUrls, String clusterId, String vpnConnectionId) { + if (tunnel == null || backendUrl == null || backendUrl.isEmpty() || clusterId == null || clusterId.isEmpty() || vpnConnectionId == null || vpnConnectionId.isEmpty()) { + Log.e(TAG, "packet relay not started: tunnel=" + (tunnel != null) + + " backend=" + present(backendUrl) + + " cluster=" + present(clusterId) + + " vpn_connection=" + present(vpnConnectionId)); + writeRuntimeStatus("error", "relay not started: tunnel=" + (tunnel != null) + + " backend=" + present(backendUrl) + + " cluster=" + present(clusterId) + + " connection=" + present(vpnConnectionId), 0, 0, 0, 0); + return; + } + List relayUrls = dedupeRelayUrls(candidateUrls, backendUrl); + String selectedRelayUrl = relayUrls.isEmpty() ? "" : relayUrls.get(0); + if (selectedRelayUrl == null || selectedRelayUrl.isEmpty()) { + selectedRelayUrl = backendUrl; + } + activePacketRelayUrlByProfile = selectedRelayUrl; + activePacketRelayUrlsByProfile = new ArrayList<>(relayUrls); + activePacketRelayIndex = Math.max(0, relayUrls.indexOf(selectedRelayUrl)); + writeActivePacketRelayConfig(selectedRelayUrl, relayUrls); + stopPacketRelay(); + running = true; + runtimeStartedAt = System.currentTimeMillis(); + uplinkWorkerCount = Math.max(1, Math.min(UPLINK_WORKER_MAX_COUNT, Math.max(1, Runtime.getRuntime().availableProcessors() - 1))); + if (uplinkWorkerCount < 2) { + uplinkWorkerCount = 1; + } + uplinkQueueOffersByWorker = createAtomicCounters(uplinkWorkerCount); + uplinkQueueDropsByWorker = createAtomicCounters(uplinkWorkerCount); + uplinkSenderPacketsByWorker = createAtomicCounters(uplinkWorkerCount); + uplinkSenderErrorsByWorker = createAtomicCounters(uplinkWorkerCount); + uplinkPriorityQueue = new ArrayBlockingQueue<>(PRIORITY_QUEUE_CAPACITY); + uplinkQueues = new ArrayBlockingQueue[uplinkWorkerCount]; + for (int i = 0; i < uplinkWorkerCount; i++) { + uplinkQueues[i] = new ArrayBlockingQueue<>(UPLINK_QUEUE_CAPACITY); + } + downlinkFlowQueueCount = Math.max(1, Math.min(DOWNLINK_FLOW_QUEUE_MAX_COUNT, Math.max(1, Runtime.getRuntime().availableProcessors()))); + downlinkQueueOffersByFlow = createAtomicCounters(downlinkFlowQueueCount); + downlinkQueueDropsByFlow = createAtomicCounters(downlinkFlowQueueCount); + downlinkWriterPacketsByFlow = createAtomicCounters(downlinkFlowQueueCount); + downlinkPriorityQueue = new ArrayBlockingQueue<>(PRIORITY_QUEUE_CAPACITY); + downlinkQueues = new ArrayBlockingQueue[downlinkFlowQueueCount]; + int downlinkPerFlowCapacity = Math.max(512, DOWNLINK_QUEUE_CAPACITY / downlinkFlowQueueCount); + for (int i = 0; i < downlinkFlowQueueCount; i++) { + downlinkQueues[i] = new ArrayBlockingQueue<>(downlinkPerFlowCapacity); + } + configureBackendBypass(selectedRelayUrl); + startPacketWebSocketRelay(selectedRelayUrl, clusterId, vpnConnectionId); + Log.i(TAG, "packet relay starting: backend=" + selectedRelayUrl + " cluster=" + clusterId + " vpn_connection=" + vpnConnectionId); + writeRuntimeStatus("relay", "relay starting " + vpnConnectionId, 0, 0, 0, 0); + writeRuntimeDetail("running", "packet relay active", "relay", 0, 0, ""); + final String resetRelayUrl = selectedRelayUrl; + Thread resetThread = new Thread(() -> { + try { + RapApiClient uplinkClient = packetRelayClientForUrl(resetRelayUrl); + JSONObject reset = uplinkClient.resetVPNPacketQueues(clusterId, vpnConnectionId); + Log.i(TAG, "packet relay queues reset: " + reset.toString()); + writeRuntimeStatus("relay_reset", reset.toString(), 0, 0, 0, 0); + } catch (Exception e) { + Log.w(TAG, "vpn relay queue reset failed; continuing", e); + writeRuntimeStatus("relay_reset_warning", "queue reset failed: " + e.getMessage(), 0, 0, 0, 1); + } + }, "rap-vpn-relay-reset"); + uplinkThread = new Thread(() -> pumpTunToRelay(clusterId, vpnConnectionId), "rap-vpn-uplink"); + uplinkSenderThreads = new Thread[uplinkWorkerCount]; + for (int i = 0; i < uplinkWorkerCount; i++) { + final int workerIndex = i; + uplinkSenderThreads[i] = new Thread(() -> pumpUplinkQueueToRelay(workerIndex, clusterId, vpnConnectionId), "rap-vpn-uplink-sender-" + workerIndex); + } + downlinkThread = new Thread(() -> runDownlinkWithRestart(clusterId, vpnConnectionId), "rap-vpn-downlink-receiver"); + downlinkWriterThread = new Thread(this::pumpDownlinkQueueToTun, "rap-vpn-downlink-writer"); + runtimeWatchdogThread = new Thread(() -> runRuntimeWatchdog(clusterId, vpnConnectionId), "rap-vpn-runtime-watchdog"); + resetThread.start(); + uplinkThread.start(); + for (Thread senderThread : uplinkSenderThreads) { + senderThread.start(); + } + downlinkThread.start(); + downlinkWriterThread.start(); + runtimeWatchdogThread.start(); + } + + private List singletonUrl(String value) { + List out = new ArrayList<>(); + addUniqueUrl(out, value); + return out; + } + + private void writeActivePacketRelayConfig(String selectedRelayUrl, List relayUrls) { + try { + String activeUrl = selectedRelayUrl == null ? "" : selectedRelayUrl; + getSharedPreferences(PREFS, MODE_PRIVATE).edit() + .putString("packet_relay_active_base_url", activeUrl) + .putString("packet_relay_base_url", activeUrl) + .putString("packet_relay_candidate_urls", joinList(relayUrls)) + .commit(); + } catch (Exception ignored) { + } + } + + private String joinList(List values) { + if (values == null || values.isEmpty()) { + return ""; + } + StringBuilder out = new StringBuilder(); + for (String value : values) { + if (value == null || value.isEmpty()) { + continue; + } + if (out.length() > 0) { + out.append(","); + } + out.append(value); + } + return out.toString(); + } + + private List dedupeRelayUrls(List candidateUrls, String backendUrl) { + List out = new ArrayList<>(); + boolean backendIsPrivate = isPrivateURLHost(backendUrl); + if (candidateUrls != null) { + for (String url : candidateUrls) { + String normalized = normalizeHTTPBaseUrl(url); + if (!backendIsPrivate && isPrivateURLHost(normalized)) { + continue; + } + addUniqueUrl(out, normalized); + } + } + addUniqueUrl(out, normalizeHTTPBaseUrl(backendUrl)); + return out; + } + + private void addUniqueUrl(List urls, String value) { + if (urls == null || value == null) { + return; + } + String normalized = normalizeHTTPBaseUrl(value); + if (normalized.isEmpty()) { + return; + } + for (String existing : urls) { + if (normalized.equals(existing)) { + return; + } + } + urls.add(normalized); + } + + private RapApiClient packetRelayClientForUrl(String url) { + return new RapApiClient(url, this, activeFabricServiceChannel); + } + + private void startPacketWebSocketRelay(String relayUrl, String clusterId, String vpnConnectionId) { + try { + VpnPacketWebSocketRelay old = packetWebSocketRelay; + if (old != null && !relayUrl.equals(old.baseUrl())) { + old.close(); + packetWebSocketRelay = null; + } + VpnPacketWebSocketRelay relay = packetWebSocketRelay; + if (relay == null) { + relay = new VpnPacketWebSocketRelay(relayUrl, this, activeFabricServiceChannel); + packetWebSocketRelay = relay; + } + relay.connect(clusterId, vpnConnectionId); + writeRuntimeDetail("websocket_connect", "packet websocket connect requested " + relayUrl, "relay", 0, 0, "", -1); + } catch (Exception e) { + writeRuntimeDetail("websocket_connect_failed", "packet websocket connect failed: " + e.getMessage(), "relay", 0, 1, e.getClass().getSimpleName(), -1); + } + } + + private String currentPacketRelayUrl() { + String active = activePacketRelayUrlByProfile; + if (active != null && !active.isEmpty()) { + return active; + } + List urls = activePacketRelayUrlsByProfile; + if (urls != null && !urls.isEmpty()) { + int index = activePacketRelayIndex; + if (index < 0 || index >= urls.size()) { + index = 0; + } + return urls.get(index); + } + return ""; + } + + private boolean switchPacketRelayUrl(String failedUrl, String reason) { + synchronized (packetRelaySwitchLock) { + List urls = activePacketRelayUrlsByProfile; + if (urls == null || urls.size() <= 1) { + return false; + } + String normalizedFailed = normalizeHTTPBaseUrl(failedUrl); + int start = activePacketRelayIndex; + if (!normalizedFailed.isEmpty()) { + int failedIndex = urls.indexOf(normalizedFailed); + if (failedIndex >= 0) { + start = failedIndex; + } + } + for (int offset = 1; offset <= urls.size(); offset++) { + int nextIndex = (start + offset) % urls.size(); + String next = urls.get(nextIndex); + if (next == null || next.isEmpty() || next.equals(normalizedFailed)) { + continue; + } + activePacketRelayIndex = nextIndex; + activePacketRelayUrlByProfile = next; + configureBackendBypass(next); + startPacketWebSocketRelay(next, getSharedPreferences(PREF_NAME, MODE_PRIVATE).getString(PREF_CLUSTER_ID, ""), getSharedPreferences(PREF_NAME, MODE_PRIVATE).getString(PREF_VPN_CONNECTION_ID, "")); + writeActivePacketRelayConfig(next, urls); + writeRuntimeStatus("relay_switch", "relay switched to " + next + " after " + reason, 0, 0, 0, 0); + writeRuntimeDetail("relay_switch", "relay switched from " + normalizedFailed + " to " + next + " reason=" + reason, "relay", 0, 0, ""); + return true; + } + return false; + } + } + + private String selectReachablePacketRelayUrl(List relayUrls, String clusterId, String vpnConnectionId) { + if (relayUrls == null || relayUrls.isEmpty()) { + return ""; + } + StrictMode.ThreadPolicy previousPolicy = StrictMode.getThreadPolicy(); + StrictMode.setThreadPolicy(new StrictMode.ThreadPolicy.Builder(previousPolicy).permitNetwork().build()); + try { + for (String url : relayUrls) { + if (url == null || url.isEmpty()) { + continue; + } + try { + RapApiClient probe = new RapApiClient(url, this); + JSONObject reset = probe.resetVPNPacketQueues(clusterId, vpnConnectionId); + Log.i(TAG, "packet relay selected: " + url + " reset=" + reset.toString()); + writeRuntimeStatus("relay_selected", "relay selected " + url, 0, 0, 0, 0); + return url; + } catch (Exception e) { + Log.w(TAG, "packet relay candidate failed: " + url, e); + writeRuntimeStatus("relay_candidate_failed", url + ": " + e.getMessage(), 0, 0, 0, 1); + } + } + } finally { + StrictMode.setThreadPolicy(previousPolicy); + } + return ""; + } + + private void stopPacketRelay() { + running = false; + VpnPacketWebSocketRelay relay = packetWebSocketRelay; + packetWebSocketRelay = null; + if (relay != null) { + relay.close(); + } + closeTunHandles(); + interruptAndJoin(uplinkThread); + if (uplinkSenderThreads != null) { + for (Thread senderThread : uplinkSenderThreads) { + interruptAndJoin(senderThread); + } + } + interruptAndJoin(downlinkThread); + interruptAndJoin(downlinkWriterThread); + interruptAndJoin(runtimeWatchdogThread); + uplinkThread = null; + uplinkSenderThreads = null; + downlinkThread = null; + downlinkWriterThread = null; + runtimeWatchdogThread = null; + uplinkWorkerCount = 0; + downlinkFlowQueueCount = 0; + uplinkQueues = null; + downlinkQueues = null; + uplinkPriorityQueue = null; + downlinkPriorityQueue = null; + uplinkQueueOffersByWorker = null; + uplinkQueueDropsByWorker = null; + uplinkSenderPacketsByWorker = null; + uplinkSenderErrorsByWorker = null; + downlinkQueueOffersByFlow = null; + downlinkQueueDropsByFlow = null; + downlinkWriterPacketsByFlow = null; + } + + private void resetRuntimeMetrics() { + uplinkReadPackets.set(0); + uplinkReadBytes.set(0); + uplinkSentPackets.set(0); + uplinkSentBytes.set(0); + downlinkReceivedPackets.set(0); + downlinkReceivedBytes.set(0); + uplinkDroppedPackets.set(0); + uplinkDroppedBytes.set(0); + uplinkFilteredPackets.set(0); + uplinkFilteredBytes.set(0); + uplinkBypassedControlPackets.set(0); + uplinkBypassedControlBytes.set(0); + uplinkQueuedPackets.set(0); + downlinkQueuedPackets.set(0); + downlinkQueueWaits.set(0); + localDnsQueries.set(0); + localDnsReplies.set(0); + localDnsErrors.set(0); + runtimeWatchdogRecoveries.set(0); + tcpHandshakeStalls.set(0); + runtimeWatchdogHardRestarts.set(0); + hardRuntimeRestartInProgress.set(false); + downlinkDroppedPackets.set(0); + downlinkDroppedBytes.set(0); + downlinkTransportChecksumRepairs.set(0); + uplinkSourceMismatchPackets.set(0); + downlinkDestinationMismatchPackets.set(0); + synchronized (pendingTcpHandshakes) { + pendingTcpHandshakes.clear(); + } + fastPathModeEnabled = false; + relaxedUplinkSourceValidation = false; + relaxedDownlinkDestinationValidation = false; + uplinkWorkerCount = 0; + downlinkFlowQueueCount = 0; + runtimeStartedAt = System.currentTimeMillis(); + lastThroughputCalcAt = runtimeStartedAt; + lastRateUplinkReadBytes = 0; + lastRateUplinkSentBytes = 0; + lastRateDownlinkReceivedBytes = 0; + uplinkReadMbps = 0f; + uplinkSentMbps = 0f; + downlinkReceivedMbps = 0f; + uplinkReadPps = 0f; + uplinkSentPps = 0f; + downlinkReceivedPps = 0f; + getSharedPreferences(PREFS, MODE_PRIVATE).edit() + .putString("state", "resetting") + .putString("message", "runtime counters reset") + .putLong("updated_at", runtimeStartedAt) + .putLong("uplink_read", 0) + .putLong("uplink_sent", 0) + .putLong("downlink_received", 0) + .putLong("errors", 0) + .putFloat("uplink_read_mbps", 0f) + .putFloat("uplink_sent_mbps", 0f) + .putFloat("downlink_received_mbps", 0f) + .putFloat("uplink_read_pps", 0f) + .putFloat("uplink_sent_pps", 0f) + .putFloat("downlink_received_pps", 0f) + .putLong("runtime_started_at", runtimeStartedAt) + .putLong("uplink_read_total", 0) + .putLong("uplink_read_bytes", 0) + .putLong("uplink_sent_total", 0) + .putLong("uplink_sent_bytes", 0) + .putLong("downlink_received_total", 0) + .putLong("downlink_received_bytes", 0) + .putLong("uplink_dropped_packets", 0) + .putLong("uplink_dropped_bytes", 0) + .putLong("uplink_filtered_packets", 0) + .putLong("uplink_filtered_bytes", 0) + .putLong("uplink_bypassed_control_packets", 0) + .putLong("uplink_bypassed_control_bytes", 0) + .putLong("downlink_dropped_packets", 0) + .putLong("downlink_dropped_bytes", 0) + .putLong("downlink_transport_checksum_repairs", 0) + .putLong("local_dns_queries", 0) + .putLong("local_dns_replies", 0) + .putLong("local_dns_errors", 0) + .putLong("runtime_watchdog_recoveries", 0) + .putLong("tcp_handshake_stalls", 0) + .putInt("uplink_worker_count", 0) + .putString("uplink_queue_depths", "") + .putInt("uplink_queue_depth_max", 0) + .putInt("uplink_queue_depth_total", 0) + .putInt("downlink_queue_depth", 0) + .putInt("downlink_flow_queue_count", 0) + .putString("downlink_queue_depths", "") + .putInt("downlink_queue_depth_max", 0) + .putInt("downlink_queue_depth_total", 0) + .putLong("downlink_queued_packets", 0) + .putLong("downlink_queue_waits", 0) + .apply(); + } + + private static AtomicLong[] createAtomicCounters(int count) { + AtomicLong[] values = new AtomicLong[count]; + for (int i = 0; i < count; i++) { + values[i] = new AtomicLong(); + } + return values; + } + + private void recordUplinkRead(int bytes) { + uplinkReadPackets.incrementAndGet(); + if (bytes > 0) { + uplinkReadBytes.addAndGet(bytes); + } + } + + private void recordUplinkDrop(int bytes) { + uplinkDroppedPackets.incrementAndGet(); + if (bytes > 0) { + uplinkDroppedBytes.addAndGet(bytes); + } + } + + private void recordUplinkFiltered(int bytes) { + uplinkFilteredPackets.incrementAndGet(); + if (bytes > 0) { + uplinkFilteredBytes.addAndGet(bytes); + } + } + + private void recordUplinkBypassControl(int bytes) { + uplinkBypassedControlPackets.incrementAndGet(); + if (bytes > 0) { + uplinkBypassedControlBytes.addAndGet(bytes); + } + } + + private void recordUplinkSent(int packets, int bytes) { + if (packets > 0) { + uplinkSentPackets.addAndGet(packets); + } + if (bytes > 0) { + uplinkSentBytes.addAndGet(bytes); + } + } + + private void recordDownlinkReceived(int bytes) { + downlinkReceivedPackets.incrementAndGet(); + if (bytes > 0) { + downlinkReceivedBytes.addAndGet(bytes); + } + } + + private void recordDownlinkDrop(int bytes) { + downlinkDroppedPackets.incrementAndGet(); + if (bytes > 0) { + downlinkDroppedBytes.addAndGet(bytes); + } + } + + private void interruptAndJoin(Thread thread) { + if (thread == null) { + return; + } + thread.interrupt(); + if (thread == Thread.currentThread()) { + return; + } + try { + thread.join(750); + } catch (InterruptedException e) { + Thread.currentThread().interrupt(); + } + } + + private void runDownlinkWithRestart(String clusterId, String vpnConnectionId) { + long restarts = 0; + while (running) { + downlinkRestarts = restarts; + writeRuntimeDetail("starting", "downlink loop starting", "downlink", 0, restarts, ""); + pumpRelayToTun(clusterId, vpnConnectionId); + if (!running) { + return; + } + restarts++; + downlinkRestarts = restarts; + writeRuntimeStatus("downlink_restart", "restarting downlink count=" + restarts, 0, 0, 0, restarts); + writeRuntimeDetail("restart", "downlink loop restarting", "downlink", 0, restarts, ""); + try { + Thread.sleep(100); + } catch (InterruptedException e) { + if (!running) { + return; + } + } + } + } + + private void runRuntimeWatchdog(String clusterId, String vpnConnectionId) { + while (running) { + try { + Thread.sleep(RUNTIME_WATCHDOG_INTERVAL_MS); + if (!running) { + return; + } + long now = System.currentTimeMillis(); + if (now - lastDiagnosticEnsureAt >= 10000) { + lastDiagnosticEnsureAt = now; + ensureDiagnosticServiceHealthy(); + } + int stale = staleTCPHandshakeCount(); + if (stale <= 0) { + continue; + } + if (now - lastRuntimeWatchdogRecoveryAt < RUNTIME_WATCHDOG_RECOVERY_COOLDOWN_MS) { + continue; + } + tcpHandshakeStalls.addAndGet(stale); + runtimeWatchdogRecoveries.incrementAndGet(); + lastRuntimeWatchdogRecoveryAt = now; + if (shouldHardRestartRuntime(now)) { + scheduleHardRuntimeRestart(clusterId, vpnConnectionId, "tcp_handshake_stall stale=" + stale); + } else { + recoverPacketRelayRuntime(clusterId, vpnConnectionId, "tcp_handshake_stall stale=" + stale); + } + } catch (InterruptedException e) { + if (!running) { + return; + } + } catch (Exception e) { + writeRuntimeDetail("watchdog_error", e.getMessage(), "watchdog", runtimeWatchdogRecoveries.get(), tcpHandshakeStalls.get(), e.getClass().getSimpleName(), -1); + } + } + } + + private boolean shouldHardRestartRuntime(long now) { + if (runtimeWatchdogRecoveries.get() < 2) { + return false; + } + return now - lastRuntimeWatchdogHardRestartAt >= RUNTIME_WATCHDOG_HARD_RESTART_COOLDOWN_MS; + } + + private void scheduleHardRuntimeRestart(String clusterId, String vpnConnectionId, String reason) { + if (!hardRuntimeRestartInProgress.compareAndSet(false, true)) { + writeRuntimeDetail("hard_restart_skipped", "hard restart already in progress: " + reason, "watchdog", runtimeWatchdogRecoveries.get(), tcpHandshakeStalls.get(), "", -1); + return; + } + long now = System.currentTimeMillis(); + lastRuntimeWatchdogHardRestartAt = now; + runtimeWatchdogHardRestarts.incrementAndGet(); + writeRuntimeStatus("runtime_hard_restart", reason, 0, uplinkSentPackets.get(), downlinkReceivedPackets.get(), runtimeWatchdogHardRestarts.get()); + writeRuntimeDetail("runtime_hard_restart", reason, "watchdog", runtimeWatchdogRecoveries.get(), runtimeWatchdogHardRestarts.get(), "", -1); + Thread restartThread = new Thread(() -> { + try { + SharedPreferences prefs = getSharedPreferences(PREF_NAME, MODE_PRIVATE); + String profile = prefs.getString(PREF_PROFILE_JSON, ""); + String backendUrl = prefs.getString(PREF_BACKEND_URL, ""); + String savedClusterId = prefs.getString(PREF_CLUSTER_ID, clusterId == null ? "" : clusterId); + String savedVpnConnectionId = prefs.getString(PREF_VPN_CONNECTION_ID, vpnConnectionId == null ? "" : vpnConnectionId); + if (profile.isEmpty() || backendUrl.isEmpty() || savedClusterId.isEmpty() || savedVpnConnectionId.isEmpty()) { + writeRuntimeDetail("hard_restart_failed", "saved runtime config missing", "watchdog", runtimeWatchdogRecoveries.get(), runtimeWatchdogHardRestarts.get(), "CONFIG_MISSING", -1); + return; + } + try { + String relayUrl = currentPacketRelayUrl(); + if (relayUrl != null && !relayUrl.isEmpty()) { + packetRelayClientForUrl(relayUrl).resetVPNPacketQueues(savedClusterId, savedVpnConnectionId); + } + } catch (Exception e) { + writeRuntimeDetail("hard_restart_queue_reset_warning", e.getMessage(), "watchdog", runtimeWatchdogRecoveries.get(), runtimeWatchdogHardRestarts.get(), e.getClass().getSimpleName(), -1); + } + Intent startIntent = new Intent(this, RapVpnService.class); + startIntent.putExtra(EXTRA_PROFILE_JSON, profile); + startIntent.putExtra(EXTRA_BACKEND_URL, backendUrl); + startIntent.putExtra(EXTRA_CLUSTER_ID, savedClusterId); + startIntent.putExtra(EXTRA_VPN_CONNECTION_ID, savedVpnConnectionId); + if (Build.VERSION.SDK_INT >= 26) { + startForegroundService(startIntent); + } else { + startService(startIntent); + } + writeRuntimeDetail("hard_restart_started", "vpn runtime restart requested", "watchdog", runtimeWatchdogRecoveries.get(), runtimeWatchdogHardRestarts.get(), "", -1); + } catch (Exception e) { + writeRuntimeDetail("hard_restart_failed", e.getMessage(), "watchdog", runtimeWatchdogRecoveries.get(), runtimeWatchdogHardRestarts.get(), e.getClass().getSimpleName(), -1); + } finally { + hardRuntimeRestartInProgress.set(false); + } + }, "rap-vpn-runtime-hard-restart"); + restartThread.start(); + } + + private void recoverPacketRelayRuntime(String clusterId, String vpnConnectionId, String reason) { + String relayUrl = currentPacketRelayUrl(); + writeRuntimeStatus("runtime_recovery", reason, 0, uplinkSentPackets.get(), downlinkReceivedPackets.get(), runtimeWatchdogRecoveries.get()); + writeRuntimeDetail("runtime_recovery", reason, "watchdog", runtimeWatchdogRecoveries.get(), tcpHandshakeStalls.get(), "", -1); + synchronized (pendingTcpHandshakes) { + pendingTcpHandshakes.clear(); + } + try { + VpnPacketWebSocketRelay relay = packetWebSocketRelay; + if (relay != null) { + relay.close(); + packetWebSocketRelay = null; + } + if (relayUrl != null && !relayUrl.isEmpty()) { + startPacketWebSocketRelay(relayUrl, clusterId, vpnConnectionId); + } + } catch (Exception e) { + writeRuntimeDetail("websocket_recover_failed", e.getMessage(), "watchdog", runtimeWatchdogRecoveries.get(), tcpHandshakeStalls.get(), e.getClass().getSimpleName(), -1); + } + } + + private void clearPacketQueues() { + BlockingQueue[] uplink = uplinkQueues; + if (uplink != null) { + for (BlockingQueue queue : uplink) { + if (queue != null) { + queue.clear(); + } + } + } + BlockingQueue[] downlink = downlinkQueues; + if (downlink != null) { + for (BlockingQueue queue : downlink) { + if (queue != null) { + queue.clear(); + } + } + } + BlockingQueue uplinkPriority = uplinkPriorityQueue; + if (uplinkPriority != null) { + uplinkPriority.clear(); + } + BlockingQueue downlinkPriority = downlinkPriorityQueue; + if (downlinkPriority != null) { + downlinkPriority.clear(); + } + } + + private int staleTCPHandshakeCount() { + long now = System.currentTimeMillis(); + int stale = 0; + synchronized (pendingTcpHandshakes) { + List remove = new ArrayList<>(); + for (Map.Entry entry : pendingTcpHandshakes.entrySet()) { + long value = entry.getValue(); + long at = Math.abs(value); + long age = now - at; + if (age > 30000) { + remove.add(entry.getKey()); + continue; + } + if (value < 0 && age >= RUNTIME_WATCHDOG_STALE_SYNACK_MS) { + stale++; + } + } + for (String key : remove) { + pendingTcpHandshakes.remove(key); + } + } + return stale; + } + + private String present(String value) { + return value == null || value.isEmpty() ? "missing" : "present"; + } + + private void pumpTunToRelay(String clusterId, String vpnConnectionId) { + byte[] packet = new byte[32767]; + long readPackets = 0; + FileDescriptor fd = null; + FileInputStream input = null; + try { + fd = Os.dup(tunnel.getFileDescriptor()); + input = new FileInputStream(fd); + uplinkTunFd = fd; + uplinkTunInput = input; + while (running) { + int n = input.read(packet); + if (n > 0) { + readPackets++; + recordUplinkRead(n); + if (readPackets == 1) { + Log.i(TAG, "vpn uplink first packet: " + packetSummary(packet, n)); + } + if (readPackets == 1 || readPackets % 25 == 0) { + writeRuntimeStatus("uplink_read", packetSummary(packet, n), readPackets, 0, 0, 0); + writeRuntimeDetail("read", packetSummary(packet, n), "uplink", readPackets, 0, ""); + } + queueUplinkPacket(packet, n); + } else if (n < 0) { + break; + } + } + } catch (Exception e) { + if (running) { + Log.e(TAG, "vpn uplink stopped", e); + writeRuntimeStatus("error", "uplink stopped: " + e.getMessage(), readPackets, 0, 0, 0); + writeRuntimeDetail("stopped", "uplink stopped: " + e.getMessage(), "uplink", readPackets, 0, e.getClass().getSimpleName()); + } + } finally { + if (input != null) { + try { + input.close(); + } catch (Exception ignored) { + } + } else { + closeFdQuietly(fd); + } + if (uplinkTunInput == input) { + uplinkTunInput = null; + } + if (uplinkTunFd == fd) { + uplinkTunFd = null; + } + } + } + + private void queueUplinkPacket(byte[] packet, int length) { + if (isUplinkBackendBypassPacket(packet, length)) { + recordUplinkBypassControl(length); + return; + } + if (!shouldForwardUplinkPacket(packet, length)) { + recordUplinkFiltered(length); + return; + } + byte[] copy = new byte[length]; + System.arraycopy(packet, 0, copy, 0, length); + if (!hasIPv4Source(copy, length)) { + long mismatch = uplinkSourceMismatchPackets.incrementAndGet(); + String natKey = natKeyForOutboundReturn(copy, length); + if (natKey.isEmpty() || !rewriteIPv4SourceToVPN(copy, length, natKey)) { + Log.w(TAG, "vpn uplink source is not vpn address; dropping " + packetSummary(copy, length)); + writeRuntimeDetail("source_drop", packetSummary(copy, length), "uplink", -1, mismatch, "SOURCE_MISMATCH"); + recordUplinkDrop(length); + return; + } + writeRuntimeDetail("source_nat", packetSummary(copy, length), "uplink", -1, mismatch, "SOURCE_NAT"); + } + recordOutboundTCPHandshake(copy, length); + if (handleLocalDnsQuery(copy, length)) { + recordUplinkFiltered(length); + return; + } + if (isTCPPriorityPacket(copy, length)) { + BlockingQueue priority = uplinkPriorityQueue; + if (priority != null && priority.offer(copy)) { + uplinkQueuedPackets.incrementAndGet(); + return; + } + } + int queueIndex = shardForUplinkPacket(copy, length); + queueIndex = normalizeQueueIndex(queueIndex); + BlockingQueue queue = queueForUplinkPacket(copy, length, queueIndex); + if (queue != null) { + long queued = uplinkQueuedPackets.incrementAndGet(); + if (queued <= 5) { + Log.i(TAG, "vpn uplink queued packet worker=" + queueIndex + " " + packetSummary(copy, length)); + } + AtomicLong[] offers = uplinkQueueOffersByWorker; + if (offers != null && queueIndex >= 0 && queueIndex < offers.length) { + offers[queueIndex].incrementAndGet(); + } + } + if (queue != null && !queue.offer(copy)) { + Log.w(TAG, "vpn uplink queue full; dropping packet"); + recordUplinkDrop(length); + AtomicLong[] drops = uplinkQueueDropsByWorker; + if (drops != null && queueIndex >= 0 && queueIndex < drops.length) { + drops[queueIndex].incrementAndGet(); + } + writeRuntimeDetail("queue_full", packetSummary(copy, length), "uplink", -1, -1, "QUEUE_FULL"); + } + } + + private boolean handleLocalDnsQuery(byte[] packet, int length) { + if (!isUdpDnsQuery(packet, length)) { + return false; + } + int ihl = (packet[0] & 0x0f) * 4; + int udpOffset = ihl; + int udpLength = u16(packet, udpOffset + 4); + int dnsOffset = udpOffset + 8; + int dnsLength = udpLength - 8; + if (dnsLength <= 0 || dnsOffset + dnsLength > length) { + localDnsErrors.incrementAndGet(); + return false; + } + byte[] query = new byte[dnsLength]; + System.arraycopy(packet, dnsOffset, query, 0, dnsLength); + String questionName = dnsQuestionName(query, query.length); + if (shouldPassDnsQueryToExit(questionName)) { + return false; + } + localDnsQueries.incrementAndGet(); + byte[] answer = resolveDnsOutsideVpn(query); + if (answer == null || answer.length == 0) { + localDnsErrors.incrementAndGet(); + writeRuntimeDetail("local_dns_miss", packetSummary(packet, length), "dns", localDnsQueries.get(), localDnsErrors.get(), "DNS_UPSTREAM"); + return false; + } + byte[] responsePacket = buildUdpResponsePacket(packet, length, answer); + if (responsePacket == null) { + localDnsErrors.incrementAndGet(); + return false; + } + try { + if (offerDownlinkPacket(responsePacket, responsePacket.length)) { + long replies = localDnsReplies.incrementAndGet(); + writeRuntimeDetail("local_dns_reply", packetSummary(responsePacket, responsePacket.length), "dns", replies, localDnsErrors.get(), ""); + return true; + } + localDnsErrors.incrementAndGet(); + } catch (InterruptedException e) { + Thread.currentThread().interrupt(); + localDnsErrors.incrementAndGet(); + } + return false; + } + + private boolean shouldPassDnsQueryToExit(String questionName) { + if (questionName == null || questionName.isEmpty()) { + return true; + } + String name = questionName.toLowerCase(); + return name.endsWith(".local") + || name.endsWith(".lan") + || name.endsWith(".home") + || name.endsWith(".internal") + || name.endsWith(".corp") + || name.indexOf('.') < 0; + } + + private boolean isUdpDnsQuery(byte[] packet, int length) { + if (packet == null || length < 28) { + return false; + } + int version = (packet[0] >> 4) & 0x0f; + if (version != 4) { + return false; + } + int ihl = (packet[0] & 0x0f) * 4; + if (ihl < 20 || length < ihl + 8) { + return false; + } + int proto = packet[9] & 0xff; + if (proto != 17) { + return false; + } + int dstPort = u16(packet, ihl + 2); + if (dstPort != 53) { + return false; + } + int udpLength = u16(packet, ihl + 4); + return udpLength >= 8 && ihl + udpLength <= length; + } + + private byte[] resolveDnsOutsideVpn(byte[] query) { + List servers = new ArrayList<>(configuredDnsServers()); + if (servers.isEmpty()) { + servers.add("1.1.1.1"); + servers.add("8.8.8.8"); + } + servers.add("1.1.1.1"); + servers.add("8.8.8.8"); + servers.add("9.9.9.9"); + servers = dedupeDnsServers(servers); + for (String server : servers) { + if (server == null || server.trim().isEmpty()) { + continue; + } + byte[] serverAddress = ipv4Bytes(server.trim()); + if (serverAddress != null && isPrivateIPv4(serverAddress)) { + continue; + } + try (DatagramSocket socket = new DatagramSocket()) { + protect(socket); + socket.setSoTimeout(1800); + DatagramPacket outbound = new DatagramPacket(query, query.length, InetAddress.getByName(server.trim()), 53); + socket.send(outbound); + byte[] buffer = new byte[1500]; + DatagramPacket inbound = new DatagramPacket(buffer, buffer.length); + socket.receive(inbound); + byte[] answer = new byte[inbound.getLength()]; + System.arraycopy(buffer, 0, answer, 0, answer.length); + return answer; + } catch (Exception e) { + writeRuntimeDetail("local_dns_upstream_failed", server + ": " + e.getClass().getSimpleName() + ": " + e.getMessage(), "dns", localDnsQueries.get(), localDnsErrors.get(), e.getClass().getSimpleName()); + } + } + return null; + } + + private List dedupeDnsServers(List servers) { + List out = new ArrayList<>(); + if (servers == null) { + return out; + } + for (String server : servers) { + String value = server == null ? "" : server.trim(); + if (value.isEmpty()) { + continue; + } + boolean exists = false; + for (String current : out) { + if (current.equals(value)) { + exists = true; + break; + } + } + if (!exists) { + out.add(value); + } + } + return out; + } + + private String dnsQuestionName(byte[] query, int length) { + if (query == null || length < 13) { + return ""; + } + StringBuilder out = new StringBuilder(); + int offset = 12; + while (offset < length) { + int labelLength = query[offset] & 0xff; + offset++; + if (labelLength == 0) { + break; + } + if ((labelLength & 0xc0) != 0 || offset + labelLength > length) { + return ""; + } + if (out.length() > 0) { + out.append('.'); + } + for (int i = 0; i < labelLength; i++) { + int ch = query[offset + i] & 0xff; + if ((ch >= 'a' && ch <= 'z') || (ch >= 'A' && ch <= 'Z') || (ch >= '0' && ch <= '9') || ch == '-') { + out.append((char) ch); + } + } + offset += labelLength; + } + return out.toString(); + } + + private boolean isPrivateIPv4(byte[] address) { + if (address == null || address.length != 4) { + return false; + } + int a = address[0] & 0xff; + int b = address[1] & 0xff; + return a == 10 + || (a == 172 && b >= 16 && b <= 31) + || (a == 192 && b == 168) + || (a == 169 && b == 254) + || a == 127; + } + + private byte[] buildUdpResponsePacket(byte[] queryPacket, int queryLength, byte[] udpPayload) { + if (queryPacket == null || udpPayload == null || udpPayload.length <= 0 || queryLength < 28) { + return null; + } + int ihl = (queryPacket[0] & 0x0f) * 4; + if (ihl < 20 || queryLength < ihl + 8) { + return null; + } + int totalLength = 20 + 8 + udpPayload.length; + if (totalLength > 65535) { + return null; + } + byte[] out = new byte[totalLength]; + out[0] = 0x45; + out[1] = 0; + putU16(out, 2, totalLength); + putU16(out, 4, u16(queryPacket, 4)); + putU16(out, 6, 0x4000); + out[8] = 64; + out[9] = 17; + out[12] = queryPacket[16]; + out[13] = queryPacket[17]; + out[14] = queryPacket[18]; + out[15] = queryPacket[19]; + out[16] = queryPacket[12]; + out[17] = queryPacket[13]; + out[18] = queryPacket[14]; + out[19] = queryPacket[15]; + int udpOffset = 20; + putU16(out, udpOffset, u16(queryPacket, ihl + 2)); + putU16(out, udpOffset + 2, u16(queryPacket, ihl)); + putU16(out, udpOffset + 4, 8 + udpPayload.length); + System.arraycopy(udpPayload, 0, out, udpOffset + 8, udpPayload.length); + return normalizeIPv4PacketChecksums(out, out.length) ? out : null; + } + + private BlockingQueue queueForUplinkPacket(byte[] packet, int length, int queueIndex) { + BlockingQueue[] queues = uplinkQueues; + if (queues == null || queues.length == 0) { + return null; + } + queueIndex = normalizeQueueIndex(queueIndex); + return queues[queueIndex]; + } + + private int normalizeQueueIndex(int queueIndex) { + BlockingQueue[] queues = uplinkQueues; + if (queues == null || queues.length == 0) { + return 0; + } + if (queueIndex >= 0 && queueIndex < queues.length) { + return queueIndex; + } + return Math.abs(queueIndex) % queues.length; + } + + private int shardForUplinkPacket(byte[] packet, int length) { + if (packet == null || length < 20 || uplinkQueues == null || uplinkQueues.length == 0) { + return 0; + } + return shardForIPv4Packet(packet, length, uplinkQueues.length); + } + + private int shardForDownlinkPacket(byte[] packet, int length) { + BlockingQueue[] queues = downlinkQueues; + if (packet == null || length < 20 || queues == null || queues.length == 0) { + return 0; + } + return shardForIPv4Packet(packet, length, queues.length); + } + + private int shardForIPv4Packet(byte[] packet, int length, int shardCount) { + if (packet == null || length < 20 || shardCount <= 0) { + return 0; + } + int version = (packet[0] >> 4) & 0x0f; + if (version != 4) { + return Math.abs(length) % shardCount; + } + int ihl = (packet[0] & 0x0f) * 4; + if (ihl < 20 || length < ihl + 4) { + return Math.abs(length) % shardCount; + } + int proto = packet[9] & 0xff; + int srcPort = 0; + int dstPort = 0; + if (proto == 6 || proto == 17) { + srcPort = u16(packet, ihl); + dstPort = u16(packet, ihl + 2); + } + int srcIp = ((packet[12] & 0xff) << 24) | ((packet[13] & 0xff) << 16) | ((packet[14] & 0xff) << 8) | (packet[15] & 0xff); + int dstIp = ((packet[16] & 0xff) << 24) | ((packet[17] & 0xff) << 16) | ((packet[18] & 0xff) << 8) | (packet[19] & 0xff); + int hash = srcIp ^ Integer.rotateLeft(dstIp, 8) ^ (proto << 24) ^ (srcPort << 8) ^ dstPort; + return (hash & 0x7fffffff) % shardCount; + } + + private void recordOutboundTCPHandshake(byte[] packet, int length) { + TCPFlow flow = tcpFlow(packet, length); + if (flow == null) { + return; + } + boolean syn = (flow.flags & 0x02) != 0; + boolean ack = (flow.flags & 0x10) != 0; + String key = tcpFlowKey(flow.dstIp, flow.dstPort, flow.srcPort); + synchronized (pendingTcpHandshakes) { + if (syn && !ack) { + pendingTcpHandshakes.put(key, System.currentTimeMillis()); + } else if (ack) { + pendingTcpHandshakes.remove(key); + } + } + } + + private void recordInboundTCPHandshake(byte[] packet, int length) { + TCPFlow flow = tcpFlow(packet, length); + if (flow == null) { + return; + } + boolean syn = (flow.flags & 0x02) != 0; + boolean ack = (flow.flags & 0x10) != 0; + if (!syn || !ack) { + return; + } + String key = tcpFlowKey(flow.srcIp, flow.srcPort, flow.dstPort); + synchronized (pendingTcpHandshakes) { + if (pendingTcpHandshakes.containsKey(key)) { + pendingTcpHandshakes.put(key, -System.currentTimeMillis()); + } + } + } + + private TCPFlow tcpFlow(byte[] packet, int length) { + if (packet == null || length < 40 || ((packet[0] >> 4) & 0x0f) != 4) { + return null; + } + int ihl = (packet[0] & 0x0f) * 4; + if (ihl < 20 || length < ihl + 20 || (packet[9] & 0xff) != 6) { + return null; + } + TCPFlow flow = new TCPFlow(); + flow.srcIp = ipv4String(packet, 12); + flow.dstIp = ipv4String(packet, 16); + flow.srcPort = u16(packet, ihl); + flow.dstPort = u16(packet, ihl + 2); + flow.flags = packet[ihl + 13] & 0xff; + return flow; + } + + private boolean isTCPPriorityPacket(byte[] packet, int length) { + if (packet == null || length < 40 || ((packet[0] >> 4) & 0x0f) != 4) { + return false; + } + int ihl = (packet[0] & 0x0f) * 4; + if (ihl < 20 || length < ihl + 20 || (packet[9] & 0xff) != 6) { + return false; + } + int totalLength = u16(packet, 2); + if (totalLength <= 0 || totalLength > length) { + totalLength = length; + } + int tcpHeaderLength = ((packet[ihl + 12] >> 4) & 0x0f) * 4; + if (tcpHeaderLength < 20 || ihl + tcpHeaderLength > totalLength) { + return false; + } + int flags = packet[ihl + 13] & 0xff; + boolean syn = (flags & 0x02) != 0; + boolean fin = (flags & 0x01) != 0; + boolean rst = (flags & 0x04) != 0; + boolean ack = (flags & 0x10) != 0; + int payloadLength = totalLength - ihl - tcpHeaderLength; + return syn || fin || rst || (ack && payloadLength == 0); + } + + private String tcpFlowKey(String remoteIp, int remotePort, int localPort) { + return remoteIp + "|" + remotePort + "|" + localPort; + } + + private static class TCPFlow { + String srcIp; + String dstIp; + int srcPort; + int dstPort; + int flags; + } + + private boolean hasIPv4Source(byte[] packet, int length) { + byte[] address = vpnAddressIPv4Bytes; + if (address == null || length < 20) { + return false; + } + return (packet[12] & 0xff) == (address[0] & 0xff) + && (packet[13] & 0xff) == (address[1] & 0xff) + && (packet[14] & 0xff) == (address[2] & 0xff) + && (packet[15] & 0xff) == (address[3] & 0xff); + } + + private boolean hasIPv4Destination(byte[] packet, int length) { + byte[] address = vpnAddressIPv4Bytes; + if (address == null || length < 20) { + return false; + } + return (packet[16] & 0xff) == (address[0] & 0xff) + && (packet[17] & 0xff) == (address[1] & 0xff) + && (packet[18] & 0xff) == (address[2] & 0xff) + && (packet[19] & 0xff) == (address[3] & 0xff); + } + + private boolean rewriteIPv4SourceToVPN(byte[] packet, int length, String natKey) { + byte[] vpn = vpnAddressIPv4Bytes; + if (vpn == null || vpn.length != 4 || packet == null || length < 20) { + return false; + } + byte[] original = new byte[]{packet[12], packet[13], packet[14], packet[15]}; + synchronized (clientSourceNat) { + clientSourceNat.put(natKey, original); + } + packet[12] = vpn[0]; + packet[13] = vpn[1]; + packet[14] = vpn[2]; + packet[15] = vpn[3]; + return normalizeIPv4PacketChecksums(packet, length); + } + + private boolean restoreClientSourceNATDestination(byte[] packet, int length) { + if (packet == null || length < 20 || !hasIPv4Destination(packet, length)) { + return false; + } + String natKey = natKeyForInbound(packet, length); + if (natKey.isEmpty()) { + return false; + } + byte[] original; + synchronized (clientSourceNat) { + original = clientSourceNat.get(natKey); + } + if (original == null || original.length != 4) { + return false; + } + packet[16] = original[0]; + packet[17] = original[1]; + packet[18] = original[2]; + packet[19] = original[3]; + return true; + } + + private String natKeyForOutboundReturn(byte[] packet, int length) { + if (packet == null || length < 20) { + return ""; + } + int ihl = (packet[0] & 0x0f) * 4; + if (ihl < 20 || length < ihl) { + return ""; + } + int proto = packet[9] & 0xff; + int localPort = 0; + int remotePort = 0; + if ((proto == 6 || proto == 17) && length >= ihl + 4) { + localPort = u16(packet, ihl); + remotePort = u16(packet, ihl + 2); + } else if (proto != 1) { + return ""; + } + return proto + "|" + ipv4String(packet, 16) + "|" + remotePort + "|" + localPort; + } + + private String natKeyForInbound(byte[] packet, int length) { + if (packet == null || length < 20) { + return ""; + } + int ihl = (packet[0] & 0x0f) * 4; + if (ihl < 20 || length < ihl) { + return ""; + } + int proto = packet[9] & 0xff; + int remotePort = 0; + int localPort = 0; + if ((proto == 6 || proto == 17) && length >= ihl + 4) { + remotePort = u16(packet, ihl); + localPort = u16(packet, ihl + 2); + } else if (proto != 1) { + return ""; + } + return proto + "|" + ipv4String(packet, 12) + "|" + remotePort + "|" + localPort; + } + + private String ipv4String(byte[] packet, int offset) { + if (packet == null || offset < 0 || offset + 3 >= packet.length) { + return ""; + } + return (packet[offset] & 0xff) + "." + (packet[offset + 1] & 0xff) + "." + (packet[offset + 2] & 0xff) + "." + (packet[offset + 3] & 0xff); + } + + private byte[] ipv4Bytes(String value) { + if (value == null) { + return null; + } + String[] parts = value.split("\\."); + if (parts.length != 4) { + return null; + } + byte[] out = new byte[4]; + try { + for (int i = 0; i < 4; i++) { + int parsed = Integer.parseInt(parts[i]); + if (parsed < 0 || parsed > 255) { + return null; + } + out[i] = (byte) parsed; + } + return out; + } catch (NumberFormatException e) { + return null; + } + } + + private void pumpUplinkQueueToRelay(int workerIndex, String clusterId, String vpnConnectionId) { + long sentPackets = 0; + long errors = 0; + List batch = new ArrayList<>(VPN_BATCH_MAX_PACKETS); + while (running) { + try { + BlockingQueue[] queues = uplinkQueues; + if (queues == null || workerIndex < 0 || workerIndex >= queues.length) { + Thread.sleep(25); + continue; + } + BlockingQueue queue = queues[workerIndex]; + if (queue == null) { + Thread.sleep(25); + continue; + } + batch.clear(); + BlockingQueue priority = uplinkPriorityQueue; + byte[] first = priority == null ? null : priority.poll(); + if (first == null) { + first = queue.poll(25, TimeUnit.MILLISECONDS); + } + if (first == null) { + continue; + } + batch.add(first); + int batchBytes = first.length + 4; + long gatherUntil = System.currentTimeMillis() + UPLINK_BATCH_GATHER_MS; + while (batch.size() < VPN_BATCH_MAX_PACKETS) { + long waitMs = gatherUntil - System.currentTimeMillis(); + if (waitMs <= 0) { + break; + } + byte[] next = priority == null ? null : priority.poll(); + if (next == null) { + next = queue.poll(waitMs, TimeUnit.MILLISECONDS); + } + if (next == null) { + break; + } + if (next.length <= 0) { + continue; + } + int projectedBytes = batchBytes + 4 + next.length; + if (projectedBytes > VPN_BATCH_MAX_BYTES) { + BlockingQueue targetQueue = isTCPPriorityPacket(next, next.length) && priority != null ? priority : queue; + if (!targetQueue.offer(next)) { + Log.w(TAG, "vpn uplink queue reinsert failed; dropping packet"); + } + break; + } + batch.add(next); + batchBytes = projectedBytes; + } + + if (sentPackets < 5) { + Log.i(TAG, "vpn uplink sending batch worker=" + workerIndex + " packets=" + batch.size() + " bytes=" + batchBytes); + } + if (!sendUplinkBatchWithRetry(clusterId, vpnConnectionId, batch, workerIndex)) { + errors++; + AtomicLong[] senderErrors = uplinkSenderErrorsByWorker; + if (senderErrors != null && workerIndex >= 0 && workerIndex < senderErrors.length) { + senderErrors[workerIndex].incrementAndGet(); + } + recordUplinkDrop(Math.max(0, batchBytes - 4)); + writeRuntimeStatus("degraded", "uplink send failed after retry; continuing", 0, sentPackets, 0, errors); + writeRuntimeDetail("error", "uplink send failed after retry batch=" + batch.size(), "uplink_sender", sentPackets, errors, "SEND_RETRY_EXHAUSTED", workerIndex); + continue; + } + sentPackets += batch.size(); + if (sentPackets <= 5) { + Log.i(TAG, "vpn uplink sent batch worker=" + workerIndex + " sent_total=" + sentPackets); + } + recordUplinkSent(batch.size(), Math.max(0, batchBytes - 4)); + AtomicLong[] senderPackets = uplinkSenderPacketsByWorker; + if (senderPackets != null && workerIndex >= 0 && workerIndex < senderPackets.length) { + senderPackets[workerIndex].addAndGet(batch.size()); + } + writeRuntimeStatus("uplink_sent", "sent batch=" + batch.size(), 0, sentPackets, 0, errors); + writeRuntimeDetail("sent", "worker=" + workerIndex + " sent batch=" + batch.size(), "uplink_sender", sentPackets, errors, "", workerIndex); + } catch (InterruptedException e) { + if (!running) { + return; + } + writeRuntimeDetail("read_wait", "uplink queue wait interrupted", "uplink_sender", sentPackets, errors, e.getClass().getSimpleName(), workerIndex); + } catch (Exception e) { + if (running) { + Log.w(TAG, "vpn uplink batch send failed; continuing", e); + errors++; + AtomicLong[] senderErrors = uplinkSenderErrorsByWorker; + if (senderErrors != null && workerIndex >= 0 && workerIndex < senderErrors.length) { + senderErrors[workerIndex].incrementAndGet(); + } + writeRuntimeStatus("error", "uplink send failed: " + e.getMessage(), 0, sentPackets, 0, errors); + writeRuntimeDetail("error", "uplink send failed: " + e.getMessage(), "uplink_sender", sentPackets, errors, e.getClass().getSimpleName(), workerIndex); + try { + Thread.sleep(100); + } catch (InterruptedException interrupted) { + if (!running) { + return; + } + } + } + } + } + } + + private boolean sendUplinkBatchWithRetry(String clusterId, String vpnConnectionId, List batch, int workerIndex) { + Exception lastError = null; + int relayAttempts = Math.max(1, activePacketRelayUrlsByProfile == null ? 1 : activePacketRelayUrlsByProfile.size()); + for (int relayAttempt = 0; relayAttempt < relayAttempts && running; relayAttempt++) { + String relayUrl = currentPacketRelayUrl(); + if (relayUrl == null || relayUrl.isEmpty()) { + return false; + } + if (sendUplinkBatchOverWebSocket(relayUrl, clusterId, vpnConnectionId, batch, workerIndex)) { + return true; + } + RapApiClient client = packetRelayClientForUrl(relayUrl); + for (int attempt = 0; attempt <= UPLINK_SEND_RETRY_COUNT && running; attempt++) { + try { + client.sendClientPacketBatch(clusterId, vpnConnectionId, batch); + if (attempt > 0) { + writeRuntimeDetail("retry_ok", "uplink retry ok worker=" + workerIndex + " relay=" + relayUrl + " attempt=" + attempt + " batch=" + batch.size(), "uplink_sender", -1, -1, "", workerIndex); + } + return true; + } catch (Exception e) { + lastError = e; + writeRuntimeDetail("retry", "uplink send retry worker=" + workerIndex + " relay=" + relayUrl + " attempt=" + attempt + " error=" + e.getClass().getSimpleName(), "uplink_sender", -1, -1, e.getClass().getSimpleName(), workerIndex); + sleepQuietly(UPLINK_SEND_RETRY_SLEEP_MS * (attempt + 1L)); + } + } + if (!switchPacketRelayUrl(relayUrl, lastError == null ? "send_failed" : lastError.getClass().getSimpleName())) { + break; + } + } + if (lastError != null) { + Log.w(TAG, "vpn uplink batch send failed after retry", lastError); + } + return false; + } + + private boolean sendUplinkBatchOverWebSocket(String relayUrl, String clusterId, String vpnConnectionId, List batch, int workerIndex) { + VpnPacketWebSocketRelay relay = packetWebSocketRelay; + if (relay == null || relayUrl == null || !relayUrl.equals(relay.baseUrl())) { + return false; + } + try { + if (relay.sendClientPacketBatch(clusterId, vpnConnectionId, batch)) { + writeRuntimeDetail("websocket_sent", "worker=" + workerIndex + " sent batch=" + batch.size(), "uplink_sender", -1, -1, "", workerIndex); + return true; + } + writeRuntimeDetail("websocket_send_fallback", "websocket send fallback " + relay.lastError(), "uplink_sender", -1, -1, "WEBSOCKET_SEND", workerIndex); + } catch (Exception e) { + writeRuntimeDetail("websocket_send_fallback", "websocket send failed: " + e.getMessage(), "uplink_sender", -1, -1, e.getClass().getSimpleName(), workerIndex); + } + return false; + } + + private boolean shouldForwardUplinkPacket(byte[] packet, int length) { + if (packet == null || length < 20) { + return false; + } + if (fastPathModeEnabled) { + int versionFast = (packet[0] >> 4) & 0x0f; + if (versionFast != 4) { + return false; + } + int ihlFast = (packet[0] & 0x0f) * 4; + if (ihlFast < 20 || length < ihlFast) { + return false; + } + if (isBroadcastOrMulticastIPv4(packet)) { + return false; + } + return !isBackendControlPlanePacket(packet, length, ihlFast); + } + int version = (packet[0] >> 4) & 0x0f; + if (version != 4) { + return false; + } + int ihl = (packet[0] & 0x0f) * 4; + if (ihl < 20 || length < ihl) { + return false; + } + if (isBroadcastOrMulticastIPv4(packet)) { + return false; + } + return !isBackendControlPlanePacket(packet, length, ihl); + } + + private boolean isUplinkBackendBypassPacket(byte[] packet, int length) { + if (packet == null || length < 20) { + return false; + } + int version = (packet[0] >> 4) & 0x0f; + if (version != 4) { + return false; + } + int ihl = (packet[0] & 0x0f) * 4; + return ihl >= 20 && length >= ihl && isBackendControlPlanePacket(packet, length, ihl); + } + + private void configureBackendBypass(String backendUrl) { + backendBypassIPv4s = new byte[0][]; + backendBypassPort = 0; + try { + URI uri = URI.create(backendUrl == null ? "" : backendUrl); + byte[][] hosts = resolveBackendBypassIPv4(uri.getHost()); + if (hosts == null || hosts.length == 0) { + return; + } + int port = uri.getPort(); + if (port <= 0) { + port = "https".equalsIgnoreCase(uri.getScheme()) ? 443 : 80; + } + backendBypassIPv4s = hosts; + backendBypassPort = port; + } catch (Exception ignored) { + } + } + + private byte[][] resolveBackendBypassIPv4(String host) { + byte[] direct = ipv4Bytes(host); + if (direct != null) { + return new byte[][]{direct}; + } + if (host == null || host.trim().isEmpty()) { + return null; + } + try { + InetAddress[] addresses = InetAddress.getAllByName(host); + List result = new ArrayList<>(); + for (InetAddress address : addresses) { + if (address instanceof Inet4Address) { + result.add(address.getAddress()); + } + } + if (!result.isEmpty()) { + return result.toArray(new byte[0][]); + } + } catch (Exception ignored) { + } + return null; + } + + private boolean isBroadcastOrMulticastIPv4(byte[] packet) { + int first = packet[16] & 0xff; + return first >= 224 || first == 255; + } + + private boolean isBackendControlPlanePacket(byte[] packet, int length, int ihl) { + int port = backendBypassPort; + byte[][] hosts = backendBypassIPv4s; + if (hosts == null || hosts.length == 0 || port <= 0 || length < ihl + 4) { + return false; + } + if (!matchesBackendBypassAddress(packet)) { + return false; + } + int proto = packet[9] & 0xff; + if (proto != 6 && proto != 17) { + return false; + } + int dstPort = u16(packet, ihl + 2); + return dstPort == port; + } + + private boolean matchesBackendBypassAddress(byte[] packet) { + byte[][] hosts = backendBypassIPv4s; + if (hosts == null || hosts.length == 0) { + return false; + } + for (byte[] host : hosts) { + if (host == null || host.length != 4) { + continue; + } + if ((packet[16] & 0xff) == (host[0] & 0xff) + && (packet[17] & 0xff) == (host[1] & 0xff) + && (packet[18] & 0xff) == (host[2] & 0xff) + && (packet[19] & 0xff) == (host[3] & 0xff)) { + return true; + } + } + return false; + } + + private void setUnderlyingNetworks(Builder builder) { + if (Build.VERSION.SDK_INT < 22) { + return; + } + try { + ConnectivityManager connectivity = (ConnectivityManager) getSystemService(CONNECTIVITY_SERVICE); + if (connectivity == null) { + return; + } + List networks = new ArrayList<>(); + for (Network network : connectivity.getAllNetworks()) { + NetworkCapabilities capabilities = connectivity.getNetworkCapabilities(network); + if (capabilities == null) { + continue; + } + if (capabilities.hasTransport(NetworkCapabilities.TRANSPORT_VPN)) { + continue; + } + if (!capabilities.hasCapability(NetworkCapabilities.NET_CAPABILITY_INTERNET)) { + continue; + } + networks.add(network); + } + if (!networks.isEmpty()) { + builder.setUnderlyingNetworks(networks.toArray(new Network[0])); + } + } catch (Exception e) { + Log.w(TAG, "vpn underlying networks not set", e); + } + } + + private Set configuredDnsServers() { + Set out = new LinkedHashSet<>(); + try { + String raw = getSharedPreferences(PREFS, MODE_PRIVATE).getString("dns_servers", ""); + if (raw == null || raw.trim().isEmpty()) { + return out; + } + String[] parts = raw.split(","); + for (String part : parts) { + String value = part == null ? "" : part.trim(); + if (!value.isEmpty()) { + out.add(value); + } + } + } catch (Exception ignored) { + } + return out; + } + + private Set configuredDnsProbeDomains() { + Set out = new LinkedHashSet<>(); + try { + String raw = getSharedPreferences(PREFS, MODE_PRIVATE).getString("dns_probe_domains", ""); + if (raw == null || raw.trim().isEmpty()) { + return out; + } + String[] parts = raw.split(","); + for (String part : parts) { + String value = part == null ? "" : part.trim(); + if (!value.isEmpty()) { + out.add(value); + } + } + } catch (Exception ignored) { + } + return out; + } + + private void startVPNReadinessWarmup(Set dnsServers, Set probeDomains, String vpnConnectionId) { + Thread thread = new Thread(() -> { + long deadline = System.currentTimeMillis() + VPN_START_WARMUP_TIMEOUT_MS; + writeRuntimeStatus("warming", "warming vpn dns and relay", 0, 0, downlinkReceivedPackets.get(), 0); + Network vpn = null; + while (running && System.currentTimeMillis() < deadline) { + vpn = vpnNetwork(); + if (vpn != null) { + break; + } + sleepQuietly(150); + } + String dnsInfo = dnsServers == null || dnsServers.isEmpty() ? "dns=none" : "dns=" + join(dnsServers); + LinkedHashSet domains = new LinkedHashSet<>(); + if (probeDomains != null) { + domains.addAll(probeDomains); + } + addDefaultDnsProbeDomains(domains); + int resolved = 0; + String last = ""; + long warmUntil = System.currentTimeMillis() + 30000; + int pass = 0; + while (running && (pass == 0 || System.currentTimeMillis() < warmUntil)) { + pass++; + int passResolved = 0; + for (String host : domains) { + if (!running) { + return; + } + try { + InetAddress[] addresses = vpn != null ? vpn.getAllByName(host) : InetAddress.getAllByName(host); + if (addresses != null && addresses.length > 0) { + resolved++; + passResolved++; + last = host + "=" + addresses[0].getHostAddress(); + writeRuntimeDetail("dns_warmup", "pass=" + pass + " " + last, "readiness", resolved, 0, "", -1); + } + } catch (Exception e) { + last = host + "=" + e.getClass().getSimpleName(); + writeRuntimeDetail("dns_warmup_failed", "pass=" + pass + " " + last, "readiness", resolved, 1, e.getClass().getSimpleName(), -1); + } + sleepQuietly(120); + } + if (passResolved >= Math.min(3, domains.size()) && pass >= 2) { + break; + } + sleepQuietly(750); + } + if (!running) { + return; + } + if (resolved > 0) { + writeRuntimeStatus("ready", "vpn ready; dns warmup ok " + resolved + " " + dnsInfo + " " + last, 0, 0, downlinkReceivedPackets.get(), 0); + } else { + writeRuntimeStatus("warming", "vpn started; dns warmup pending " + dnsInfo + " " + last, 0, 0, downlinkReceivedPackets.get(), 1); + } + Log.i(TAG, "vpn readiness warmup complete: connection=" + vpnConnectionId + " resolved=" + resolved + " " + dnsInfo + " " + last); + }, "rap-vpn-readiness-warmup"); + thread.start(); + } + + private static void addDefaultDnsProbeDomains(Set domains) { + if (domains == null) { + return; + } + for (String domain : DEFAULT_DNS_PROBE_DOMAINS) { + domains.add(domain); + } + } + + private Network vpnNetwork() { + try { + ConnectivityManager connectivity = (ConnectivityManager) getSystemService(CONNECTIVITY_SERVICE); + if (connectivity == null) { + return null; + } + for (Network network : connectivity.getAllNetworks()) { + NetworkCapabilities capabilities = connectivity.getNetworkCapabilities(network); + if (capabilities != null && capabilities.hasTransport(NetworkCapabilities.TRANSPORT_VPN)) { + return network; + } + } + } catch (Exception ignored) { + } + return null; + } + + private void pumpRelayToTun(String clusterId, String vpnConnectionId) { + long fetchedPackets = 0; + long errors = 0; + int downlinkPollMs = DOWNLINK_POLL_MS_MIN; + String relayUrl = ""; + RapApiClient client = null; + try { + while (running) { + try { + String currentRelayUrl = currentPacketRelayUrl(); + if (currentRelayUrl == null || currentRelayUrl.isEmpty()) { + Thread.sleep(100); + continue; + } + if (client == null || !currentRelayUrl.equals(relayUrl)) { + relayUrl = currentRelayUrl; + client = packetRelayClientForUrl(relayUrl); + } + List packets = receiveDownlinkBatch(relayUrl, client, clusterId, vpnConnectionId, downlinkPollMs); + for (byte[] packet : packets) { + if (!isIPv4Packet(packet)) { + recordDownlinkDrop(packet == null ? 0 : packet.length); + continue; + } + int length = effectiveIPv4Length(packet, packet.length); + if (length <= 0) { + errors++; + recordDownlinkDrop(packet.length); + writeRuntimeDetail("length_drop", packetSummary(packet, packet.length), "downlink", fetchedPackets, errors, "LENGTH"); + continue; + } + boolean restoredClientNAT = restoreClientSourceNATDestination(packet, length); + if (!fastPathModeEnabled && !hasIPv4Destination(packet, length)) { + long mismatch = downlinkDestinationMismatchPackets.incrementAndGet(); + if (mismatch > ADDRESS_MISMATCH_TOLERANCE_PACKETS) { + relaxedDownlinkDestinationValidation = true; + writeRuntimeStatus("recover", "downlink destination validation relaxed; packet checks will continue", 0, 0, 0, 0); + } else { + recordDownlinkDrop(length); + writeRuntimeDetail("length_drop", packetSummary(packet, packet.length), "downlink", fetchedPackets, errors, "DEST_MISMATCH"); + } + if (!relaxedDownlinkDestinationValidation) { + continue; + } + } + boolean transportChecksumWasValid = hasValidIPv4TransportChecksum(packet, length); + boolean normalized = normalizeIPv4PacketChecksums(packet, length); + if (normalized && (!transportChecksumWasValid || restoredClientNAT)) { + downlinkTransportChecksumRepairs.incrementAndGet(); + } + if (!normalized) { + errors++; + recordDownlinkDrop(length); + writeRuntimeDetail("normalize_drop", packetSummary(packet, length), "downlink", fetchedPackets, errors, "CHECKSUM_NORMALIZE"); + continue; + } + recordInboundTCPHandshake(packet, length); + if (offerDownlinkPacket(packet, length)) { + fetchedPackets++; + } else if (running) { + errors++; + recordDownlinkDrop(length); + writeRuntimeDetail("queue_drop", packetSummary(packet, length), "downlink", fetchedPackets, errors, "QUEUE_FULL"); + } + } + if (!packets.isEmpty()) { + downlinkPollMs = Math.max(DOWNLINK_POLL_MS_MIN, downlinkPollMs - DOWNLINK_POLL_MS_STEP); + writeRuntimeStatus("downlink", "queued batch=" + packets.size(), 0, 0, downlinkReceivedPackets.get(), errors); + } else if (fetchedPackets > 0) { + downlinkPollMs = Math.min(DOWNLINK_POLL_MS_MAX, downlinkPollMs + DOWNLINK_POLL_MS_STEP); + writeRuntimeStatus("downlink_idle", "waiting for gateway packets", 0, 0, downlinkReceivedPackets.get(), errors); + } + } catch (Exception e) { + if (running) { + Log.w(TAG, "vpn downlink receive failed; continuing", e); + errors++; + writeRuntimeStatus("error", "downlink failed: " + e.getMessage(), 0, 0, downlinkReceivedPackets.get(), errors); + writeRuntimeDetail("error", "downlink failed: " + e.getMessage(), "downlink", fetchedPackets, errors, e.getClass().getSimpleName()); + if (errors % 3 == 0) { + switchPacketRelayUrl(relayUrl, "downlink_" + e.getClass().getSimpleName()); + } + try { + Thread.sleep(100); + } catch (InterruptedException interrupted) { + if (!running) { + return; + } + } + } + } + } + } catch (Exception e) { + if (running) { + Log.e(TAG, "vpn downlink stopped", e); + writeRuntimeStatus("error", "downlink stopped: " + e.getMessage(), 0, 0, downlinkReceivedPackets.get(), errors); + writeRuntimeDetail("stopped", "downlink stopped: " + e.getMessage(), "downlink", fetchedPackets, errors, e.getClass().getSimpleName()); + } + } + } + + private List receiveDownlinkBatch(String relayUrl, RapApiClient client, String clusterId, String vpnConnectionId, int timeoutMs) throws Exception { + VpnPacketWebSocketRelay relay = packetWebSocketRelay; + if (relay != null && relayUrl != null && relayUrl.equals(relay.baseUrl())) { + List packets = relay.receiveClientPacketBatch(clusterId, vpnConnectionId, timeoutMs); + if (!packets.isEmpty()) { + writeRuntimeDetail("websocket_received", "received batch=" + packets.size(), "downlink", -1, -1, "", -1); + return packets; + } + if (relay.isOpen()) { + return packets; + } + writeRuntimeDetail("websocket_receive_fallback", "websocket receive fallback " + relay.lastError(), "downlink", -1, -1, "WEBSOCKET_RECEIVE", -1); + } + return client.receiveClientPacketBatch(clusterId, vpnConnectionId, timeoutMs); + } + + private boolean offerDownlinkPacket(byte[] packet, int length) throws InterruptedException { + BlockingQueue[] queues = downlinkQueues; + if (queues == null || queues.length == 0 || length <= 0 || packet == null || length > packet.length) { + return false; + } + byte[] copy = new byte[length]; + System.arraycopy(packet, 0, copy, 0, length); + if (isTCPPriorityPacket(copy, length)) { + BlockingQueue priority = downlinkPriorityQueue; + if (priority != null && priority.offer(copy)) { + downlinkQueuedPackets.incrementAndGet(); + return true; + } + } + int queueIndex = shardForDownlinkPacket(copy, length); + if (queueIndex < 0 || queueIndex >= queues.length) { + queueIndex = Math.abs(queueIndex) % queues.length; + } + BlockingQueue queue = queues[queueIndex]; + if (queue == null) { + return false; + } + AtomicLong[] offers = downlinkQueueOffersByFlow; + if (offers != null && queueIndex < offers.length) { + offers[queueIndex].incrementAndGet(); + } + for (int attempt = 0; running && attempt < 2; attempt++) { + if (queue.offer(copy, DOWNLINK_QUEUE_OFFER_MS, TimeUnit.MILLISECONDS)) { + downlinkQueuedPackets.incrementAndGet(); + return true; + } + downlinkQueueWaits.incrementAndGet(); + writeRuntimeDetail("queue_wait", "downlink flow queue backpressure index=" + queueIndex + " depth=" + queue.size(), "downlink", downlinkReceivedPackets.get(), 0, "QUEUE_WAIT"); + } + AtomicLong[] drops = downlinkQueueDropsByFlow; + if (drops != null && queueIndex < drops.length) { + drops[queueIndex].incrementAndGet(); + } + return false; + } + + private void pumpDownlinkQueueToTun() { + long writtenPackets = 0; + long errors = 0; + FileDescriptor fd = null; + int nextQueueIndex = 0; + try { + fd = Os.dup(tunnel.getFileDescriptor()); + downlinkTunFd = fd; + while (running) { + BlockingQueue[] queues = downlinkQueues; + if (queues == null || queues.length == 0) { + return; + } + QueueTakeResult take = pollDownlinkFlowQueues(queues, nextQueueIndex, DOWNLINK_WRITER_IDLE_WAIT_MS); + byte[] packet = take.packet; + if (packet == null) { + if (writtenPackets > 0) { + writeRuntimeDetail("writer_idle", "waiting for queued packets", "downlink_writer", writtenPackets, errors, ""); + } + continue; + } + nextQueueIndex = (take.queueIndex + 1) % queues.length; + if (writePacketToTun(fd, packet, packet.length)) { + writtenPackets++; + recordDownlinkReceived(packet.length); + AtomicLong[] writerPackets = downlinkWriterPacketsByFlow; + if (writerPackets != null && take.queueIndex >= 0 && take.queueIndex < writerPackets.length) { + writerPackets[take.queueIndex].incrementAndGet(); + } + writeRuntimeDetail("write", packetSummary(packet, packet.length), "downlink_writer", writtenPackets, errors, ""); + } else { + errors++; + recordDownlinkDrop(packet.length); + writeRuntimeDetail("write_drop", packetSummary(packet, packet.length), "downlink_writer", writtenPackets, errors, "EAGAIN"); + } + } + } catch (Exception e) { + if (running) { + Log.e(TAG, "vpn downlink writer stopped", e); + writeRuntimeStatus("error", "downlink writer stopped: " + e.getMessage(), 0, 0, writtenPackets, errors); + writeRuntimeDetail("stopped", "downlink writer stopped: " + e.getMessage(), "downlink_writer", writtenPackets, errors, e.getClass().getSimpleName()); + } + } finally { + closeFdQuietly(fd); + if (downlinkTunFd == fd) { + downlinkTunFd = null; + } + } + } + + private QueueTakeResult pollDownlinkFlowQueues(BlockingQueue[] queues, int startIndex, long waitMs) throws InterruptedException { + BlockingQueue priority = downlinkPriorityQueue; + if (priority != null) { + byte[] packet = priority.poll(); + if (packet != null) { + return new QueueTakeResult(packet, startIndex < 0 ? 0 : startIndex); + } + } + if (queues == null || queues.length == 0) { + return new QueueTakeResult(null, 0); + } + int start = startIndex; + if (start < 0 || start >= queues.length) { + start = 0; + } + for (int offset = 0; offset < queues.length; offset++) { + int index = (start + offset) % queues.length; + BlockingQueue queue = queues[index]; + if (queue == null) { + continue; + } + byte[] packet = queue.poll(); + if (packet != null) { + return new QueueTakeResult(packet, index); + } + } + BlockingQueue waitQueue = queues[start]; + if (waitQueue == null) { + Thread.sleep(Math.max(1, waitMs)); + return new QueueTakeResult(null, start); + } + long deadline = System.currentTimeMillis() + Math.max(1, waitMs); + while (running) { + if (priority != null) { + byte[] packet = priority.poll(); + if (packet != null) { + return new QueueTakeResult(packet, start); + } + } + byte[] packet = waitQueue.poll(2, TimeUnit.MILLISECONDS); + if (packet != null) { + return new QueueTakeResult(packet, start); + } + if (System.currentTimeMillis() >= deadline) { + return new QueueTakeResult(null, start); + } + } + return new QueueTakeResult(null, start); + } + + private static class QueueTakeResult { + final byte[] packet; + final int queueIndex; + + QueueTakeResult(byte[] packet, int queueIndex) { + this.packet = packet; + this.queueIndex = queueIndex; + } + } + + private boolean writePacketToTun(FileDescriptor fd, byte[] packet, int packetLength) throws Exception { + int offset = 0; + int attempts = 0; + if (packetLength < 20 || packetLength > packet.length) { + return false; + } + while (running && offset < packetLength) { + try { + int written = Os.write(fd, packet, offset, packetLength - offset); + if (written > 0) { + offset += written; + attempts = 0; + continue; + } + } catch (ErrnoException e) { + if (e.errno != OsConstants.EAGAIN) { + throw e; + } + } + attempts++; + if (attempts > TUN_WRITE_MAX_RETRIES) { + return false; + } + sleepQuietly(TUN_EAGAIN_SLEEP_MS); + } + return offset == packetLength; + } + + private int effectiveIPv4Length(byte[] packet, int maxLength) { + if (packet == null || maxLength < 20) { + return -1; + } + int version = (packet[0] >> 4) & 0x0f; + if (version != 4) { + return -1; + } + int ihl = (packet[0] & 0x0f) * 4; + if (ihl < 20 || maxLength < ihl) { + return -1; + } + int totalLength = u16(packet, 2); + if (totalLength <= 0) { + return maxLength; + } + if (totalLength < ihl || totalLength > maxLength) { + return -1; + } + return totalLength; + } + + private void sleepQuietly(long millis) { + try { + Thread.sleep(millis); + } catch (InterruptedException e) { + if (!running) { + Thread.currentThread().interrupt(); + } + } + } + + private void closeFdQuietly(FileDescriptor fd) { + if (fd == null) { + return; + } + try { + Os.close(fd); + } catch (Exception ignored) { + } + } + + private void closeTunHandles() { + FileInputStream input = uplinkTunInput; + if (input != null) { + try { + input.close(); + } catch (Exception ignored) { + } + } + closeFdQuietly(uplinkTunFd); + closeFdQuietly(downlinkTunFd); + uplinkTunInput = null; + uplinkTunFd = null; + downlinkTunFd = null; + } + + private void closeTunnelQuietly() { + if (tunnel == null) { + return; + } + try { + tunnel.close(); + } catch (Exception ignored) { + } finally { + tunnel = null; + } + } + + private void refreshRuntimeRates(long now) { + if (now <= 0) { + now = System.currentTimeMillis(); + } + if (lastThroughputCalcAt <= 0) { + lastThroughputCalcAt = now; + lastRateUplinkReadBytes = uplinkReadBytes.get(); + lastRateUplinkSentBytes = uplinkSentBytes.get(); + lastRateDownlinkReceivedBytes = downlinkReceivedBytes.get(); + return; + } + long elapsed = now - lastThroughputCalcAt; + if (elapsed < 250) { + return; + } + long readDelta = Math.max(0, uplinkReadBytes.get() - lastRateUplinkReadBytes); + long sentDelta = Math.max(0, uplinkSentBytes.get() - lastRateUplinkSentBytes); + long downDelta = Math.max(0, downlinkReceivedBytes.get() - lastRateDownlinkReceivedBytes); + float seconds = Math.max(0.001f, elapsed / 1000f); + uplinkReadMbps = (float) (readDelta * 8.0d / 1_000_000d / seconds); + uplinkSentMbps = (float) (sentDelta * 8.0d / 1_000_000d / seconds); + downlinkReceivedMbps = (float) (downDelta * 8.0d / 1_000_000d / seconds); + uplinkReadPps = (float) (readDelta / seconds); + uplinkSentPps = (float) (sentDelta / seconds); + downlinkReceivedPps = (float) (downDelta / seconds); + lastThroughputCalcAt = now; + lastRateUplinkReadBytes = uplinkReadBytes.get(); + lastRateUplinkSentBytes = uplinkSentBytes.get(); + lastRateDownlinkReceivedBytes = downlinkReceivedBytes.get(); + } + + private void writeRuntimeStatus(String state, String message, long readPackets, long sentPackets, long receivedPackets, long errors) { + long now = System.currentTimeMillis(); + boolean important = "error".equals(state) + || "stopped".equals(state) + || "relay".equals(state) + || "relay_reset_warning".equals(state) + || "tunnel".equals(state) + || "relay_reset".equals(state) + || "runtime_recovery".equals(state) + || "downlink_restart".equals(state); + if (!important && now - lastRuntimeStatusAt < RUNTIME_STATUS_INTERVAL_MS) { + return; + } + if (!important) { + lastRuntimeStatusAt = now; + } + refreshRuntimeRates(now); + try { + SharedPreferences.Editor editor = getSharedPreferences(PREFS, MODE_PRIVATE).edit() + .putString("state", state) + .putString("message", message == null ? "" : message) + .putLong("updated_at", now) + .putLong("runtime_started_at", runtimeStartedAt) + .putLong("uplink_read_total", uplinkReadPackets.get()) + .putLong("uplink_read_bytes", uplinkReadBytes.get()) + .putLong("uplink_sent_total", uplinkSentPackets.get()) + .putLong("uplink_sent_bytes", uplinkSentBytes.get()) + .putLong("downlink_received_total", downlinkReceivedPackets.get()) + .putLong("downlink_received_bytes", downlinkReceivedBytes.get()) + .putLong("downlink_queued_packets", downlinkQueuedPackets.get()) + .putLong("downlink_queue_waits", downlinkQueueWaits.get()) + .putLong("uplink_dropped_packets", uplinkDroppedPackets.get()) + .putLong("uplink_dropped_bytes", uplinkDroppedBytes.get()) + .putLong("uplink_filtered_packets", uplinkFilteredPackets.get()) + .putLong("uplink_filtered_bytes", uplinkFilteredBytes.get()) + .putLong("uplink_bypassed_control_packets", uplinkBypassedControlPackets.get()) + .putLong("uplink_bypassed_control_bytes", uplinkBypassedControlBytes.get()) + .putLong("downlink_dropped_packets", downlinkDroppedPackets.get()) + .putLong("downlink_dropped_bytes", downlinkDroppedBytes.get()) + .putLong("downlink_transport_checksum_repairs", downlinkTransportChecksumRepairs.get()) + .putLong("local_dns_queries", localDnsQueries.get()) + .putLong("local_dns_replies", localDnsReplies.get()) + .putLong("local_dns_errors", localDnsErrors.get()) + .putLong("runtime_watchdog_recoveries", runtimeWatchdogRecoveries.get()) + .putLong("tcp_handshake_stalls", tcpHandshakeStalls.get()) + .putLong("runtime_watchdog_hard_restarts", runtimeWatchdogHardRestarts.get()) + .putLong("uplink_source_mismatch_packets", uplinkSourceMismatchPackets.get()) + .putLong("downlink_destination_mismatch_packets", downlinkDestinationMismatchPackets.get()) + .putFloat("uplink_read_mbps", uplinkReadMbps) + .putFloat("uplink_sent_mbps", uplinkSentMbps) + .putFloat("downlink_received_mbps", downlinkReceivedMbps) + .putFloat("uplink_read_pps", uplinkReadPps) + .putFloat("uplink_sent_pps", uplinkSentPps) + .putFloat("downlink_received_pps", downlinkReceivedPps); + if (readPackets > 0) { + editor.putLong("uplink_read", readPackets); + } + if (sentPackets > 0) { + editor.putLong("uplink_sent", sentPackets); + } + if (receivedPackets > 0) { + editor.putLong("downlink_received", receivedPackets); + } + if (errors > 0) { + editor.putLong("errors", errors); + } + writeDownlinkQueueStats(editor); + editor.apply(); + ensureDiagnosticFromRuntimeStatus(now); + } catch (Exception ignored) { + } + } + + private void writeRuntimeDetail(String state, String message, String prefix, long packets, long errors, String errorType) { + writeRuntimeDetail(state, message, prefix, packets, errors, errorType, -1); + } + + private void writeRuntimeDetail(String state, String message, String prefix, long packets, long errors, String errorType, int workerIndex) { + long now = System.currentTimeMillis(); + boolean important = "error".equals(state) + || "stopped".equals(state) + || "write_drop".equals(state) + || "source_drop".equals(state) + || "normalize_drop".equals(state) + || "length_drop".equals(state) + || ("downlink".equals(prefix) && !"batch".equals(state) && !"restart".equals(state) && !"running".equals(state)) + || ("downlink_writer".equals(prefix) && !"write".equals(state) && !"writer_idle".equals(state)); + if (!important && now - lastRuntimeDetailAt < RUNTIME_DETAIL_INTERVAL_MS) { + return; + } + lastRuntimeDetailAt = now; + try { + SharedPreferences.Editor editor = getSharedPreferences(PREFS, MODE_PRIVATE).edit() + .putString(prefix + "_state", state == null ? "" : state) + .putString(prefix + "_message", message == null ? "" : message) + .putLong(prefix + "_updated_at", System.currentTimeMillis()) + .putBoolean(prefix + "_thread_alive", Thread.currentThread().isAlive()); + if (packets >= 0) { + editor.putLong(prefix + "_packets", packets); + } + if (errors >= 0) { + editor.putLong(prefix + "_errors", errors); + } + if ("uplink".equals(prefix)) { + editor.putLong(prefix + "_bytes", uplinkReadBytes.get()); + editor.putFloat(prefix + "_rate_mbps", uplinkReadMbps); + editor.putFloat(prefix + "_rate_pps", uplinkReadPps); + } else if ("uplink_sender".equals(prefix)) { + editor.putLong(prefix + "_bytes", uplinkSentBytes.get()); + editor.putFloat(prefix + "_rate_mbps", uplinkSentMbps); + editor.putFloat(prefix + "_rate_pps", uplinkSentPps); + if (workerIndex >= 0) { + AtomicLong[] senderPackets = uplinkSenderPacketsByWorker; + AtomicLong[] senderErrors = uplinkSenderErrorsByWorker; + if (senderPackets != null && workerIndex < senderPackets.length) { + editor.putLong(prefix + "_worker_packets_" + workerIndex, senderPackets[workerIndex].get()); + } + if (senderErrors != null && workerIndex < senderErrors.length) { + editor.putLong(prefix + "_worker_errors_" + workerIndex, senderErrors[workerIndex].get()); + } + } + } else if ("downlink".equals(prefix)) { + editor.putLong(prefix + "_bytes", downlinkReceivedBytes.get()); + editor.putFloat(prefix + "_rate_mbps", downlinkReceivedMbps); + editor.putFloat(prefix + "_rate_pps", downlinkReceivedPps); + } else if ("downlink_writer".equals(prefix)) { + editor.putLong(prefix + "_bytes", downlinkReceivedBytes.get()); + editor.putFloat(prefix + "_rate_mbps", downlinkReceivedMbps); + editor.putFloat(prefix + "_rate_pps", downlinkReceivedPps); + } else { + editor.putLong(prefix + "_bytes", 0); + editor.putFloat(prefix + "_rate_mbps", 0f); + editor.putFloat(prefix + "_rate_pps", 0f); + } + editor.putString(prefix + "_error_type", errorType == null ? "" : errorType); + if ("downlink".equals(prefix)) { + editor.putLong("downlink_restarts", downlinkRestarts); + } + writeDownlinkQueueStats(editor); + editor.putLong("downlink_queued_packets", downlinkQueuedPackets.get()); + editor.putLong("downlink_queue_waits", downlinkQueueWaits.get()); + AtomicLong[] queueOffers = uplinkQueueOffersByWorker; + AtomicLong[] queueDrops = uplinkQueueDropsByWorker; + BlockingQueue[] queues = uplinkQueues; + int workerCount = queues != null ? queues.length : 0; + int depth = 0; + if (queues != null) { + for (BlockingQueue queue : queues) { + if (queue != null) { + depth += queue.size(); + } + } + } + editor.putInt("uplink_worker_count", workerCount); + if (queues != null && queues.length > 0) { + int[] queueDepths = new int[queues.length]; + int maxDepth = 0; + for (int i = 0; i < queues.length; i++) { + BlockingQueue queue = queues[i]; + int queueDepth = queue == null ? 0 : queue.size(); + queueDepths[i] = queueDepth; + maxDepth = Math.max(maxDepth, queueDepth); + if (queueOffers != null && i < queueOffers.length) { + editor.putLong("uplink_queue_" + i + "_offers", queueOffers[i].get()); + } + if (queueDrops != null && i < queueDrops.length) { + editor.putLong("uplink_queue_" + i + "_drops", queueDrops[i].get()); + } + } + editor.putString("uplink_queue_depths", encodeIntArray(queueDepths)); + editor.putInt("uplink_queue_depth_max", maxDepth); + editor.putInt("uplink_queue_depth_total", depth); + editor.putInt("uplink_queue_depth", depth); + } else { + editor.putInt("uplink_queue_depth", 0); + } + editor.apply(); + } catch (Exception ignored) { + } + } + + private void writeDownlinkQueueStats(SharedPreferences.Editor editor) { + BlockingQueue[] queues = downlinkQueues; + int depth = 0; + int maxDepth = 0; + int[] queueDepths = queues == null ? new int[0] : new int[queues.length]; + AtomicLong[] offers = downlinkQueueOffersByFlow; + AtomicLong[] drops = downlinkQueueDropsByFlow; + AtomicLong[] written = downlinkWriterPacketsByFlow; + if (queues != null) { + for (int i = 0; i < queues.length; i++) { + BlockingQueue queue = queues[i]; + int queueDepth = queue == null ? 0 : queue.size(); + queueDepths[i] = queueDepth; + depth += queueDepth; + maxDepth = Math.max(maxDepth, queueDepth); + if (offers != null && i < offers.length) { + editor.putLong("downlink_queue_" + i + "_offers", offers[i].get()); + } + if (drops != null && i < drops.length) { + editor.putLong("downlink_queue_" + i + "_drops", drops[i].get()); + } + if (written != null && i < written.length) { + editor.putLong("downlink_writer_flow_packets_" + i, written[i].get()); + } + } + } + editor.putString("flow_isolation_mode", "uplink_hash_workers_downlink_flow_round_robin"); + editor.putInt("downlink_flow_queue_count", queues == null ? 0 : queues.length); + editor.putString("downlink_queue_depths", encodeIntArray(queueDepths)); + editor.putInt("downlink_queue_depth_max", maxDepth); + editor.putInt("downlink_queue_depth_total", depth); + editor.putInt("downlink_queue_depth", depth); + } + + private boolean isIPv4Packet(byte[] packet) { + return packet != null && packet.length >= 20 && ((packet[0] >> 4) & 0x0f) == 4; + } + + private String getLastVpnConnectionId() { + try { + return getSharedPreferences(PREFS, MODE_PRIVATE).getString("vpn_connection_id", ""); + } catch (Exception ignored) { + return ""; + } + } + + private void persistVpnConnectionId(String vpnConnectionId) { + if (vpnConnectionId == null || vpnConnectionId.isEmpty()) { + return; + } + try { + getSharedPreferences(PREFS, MODE_PRIVATE).edit() + .putString("vpn_connection_id", vpnConnectionId) + .apply(); + } catch (Exception ignored) { + } + } + + private String encodeIntArray(int[] values) { + if (values == null || values.length == 0) { + return ""; + } + StringBuilder out = new StringBuilder(); + for (int i = 0; i < values.length; i++) { + if (i > 0) { + out.append(","); + } + out.append(values[i]); + } + return out.toString(); + } + + private boolean normalizeIPv4PacketChecksums(byte[] packet, int length) { + if (packet == null || length < 20) { + return false; + } + int version = (packet[0] >> 4) & 0x0f; + if (version != 4) { + return false; + } + int ihl = (packet[0] & 0x0f) * 4; + if (ihl < 20 || length < ihl) { + return false; + } + int totalLength = u16(packet, 2); + if (totalLength <= 0 || totalLength > length) { + totalLength = length; + } + if (totalLength < ihl) { + return false; + } + packet[10] = 0; + packet[11] = 0; + putU16(packet, 10, checksum(packet, 0, ihl)); + + int proto = packet[9] & 0xff; + int fragFlags = u16(packet, 6); + boolean transportHeaderPresent = (fragFlags & 0x1fff) == 0; + int payloadOffset = ihl; + int payloadLength = totalLength - ihl; + if (transportHeaderPresent && proto == 6 && payloadLength >= 20) { + packet[payloadOffset + 16] = 0; + packet[payloadOffset + 17] = 0; + putU16(packet, payloadOffset + 16, transportChecksum(packet, payloadOffset, payloadLength, proto)); + } else if (transportHeaderPresent && proto == 17 && payloadLength >= 8) { + packet[payloadOffset + 6] = 0; + packet[payloadOffset + 7] = 0; + int sum = transportChecksum(packet, payloadOffset, payloadLength, proto); + putU16(packet, payloadOffset + 6, sum == 0 ? 0xffff : sum); + } else if (proto == 1 && payloadLength >= 4) { + packet[payloadOffset + 2] = 0; + packet[payloadOffset + 3] = 0; + putU16(packet, payloadOffset + 2, checksum(packet, payloadOffset, payloadLength)); + } + return true; + } + + private boolean normalizeIPv4HeaderChecksum(byte[] packet, int length) { + if (packet == null || length < 20) { + return false; + } + int version = (packet[0] >> 4) & 0x0f; + if (version != 4) { + return false; + } + int ihl = (packet[0] & 0x0f) * 4; + if (ihl < 20 || length < ihl) { + return false; + } + int totalLength = u16(packet, 2); + if (totalLength <= 0 || totalLength > length) { + totalLength = length; + } + if (totalLength < ihl) { + return false; + } + packet[10] = 0; + packet[11] = 0; + putU16(packet, 10, checksum(packet, 0, ihl)); + return true; + } + + private boolean hasValidIPv4TransportChecksum(byte[] packet, int length) { + if (packet == null || length < 20) { + return false; + } + int version = (packet[0] >> 4) & 0x0f; + if (version != 4) { + return false; + } + int ihl = (packet[0] & 0x0f) * 4; + if (ihl < 20 || length < ihl) { + return false; + } + int totalLength = u16(packet, 2); + if (totalLength <= 0 || totalLength > length) { + totalLength = length; + } + if (totalLength < ihl) { + return false; + } + + int proto = packet[9] & 0xff; + int fragFlags = u16(packet, 6); + boolean firstFragment = (fragFlags & 0x1fff) == 0; + if (!firstFragment) { + return true; + } + + int payloadOffset = ihl; + int payloadLength = totalLength - ihl; + if (proto == 6 && payloadLength >= 20) { + return transportChecksum(packet, payloadOffset, payloadLength, proto) == 0; + } + if (proto == 17 && payloadLength >= 8) { + int current = u16(packet, payloadOffset + 6); + return current == 0 || transportChecksum(packet, payloadOffset, payloadLength, proto) == 0; + } + if (proto == 1 && payloadLength >= 4) { + return checksum(packet, payloadOffset, payloadLength) == 0; + } + return true; + } + + private int transportChecksum(byte[] packet, int payloadOffset, int payloadLength, int proto) { + long sum = 0; + sum += u16(packet, 12); + sum += u16(packet, 14); + sum += u16(packet, 16); + sum += u16(packet, 18); + sum += proto & 0xff; + sum += payloadLength & 0xffff; + sum += checksumSum(packet, payloadOffset, payloadLength); + return finishChecksum(sum); + } + + private int checksum(byte[] packet, int offset, int length) { + return finishChecksum(checksumSum(packet, offset, length)); + } + + private long checksumSum(byte[] packet, int offset, int length) { + long sum = 0; + int end = offset + length; + int i = offset; + while (i + 1 < end) { + sum += u16(packet, i); + i += 2; + } + if (i < end) { + sum += (packet[i] & 0xff) << 8; + } + return sum; + } + + private int finishChecksum(long sum) { + while ((sum >> 16) != 0) { + sum = (sum & 0xffff) + (sum >> 16); + } + return (int) (~sum) & 0xffff; + } + + private Notification notification() { + if (Build.VERSION.SDK_INT >= 26) { + NotificationChannel channel = new NotificationChannel(CHANNEL_ID, "RAP VPN", NotificationManager.IMPORTANCE_LOW); + getSystemService(NotificationManager.class).createNotificationChannel(channel); + } + Notification.Builder builder = Build.VERSION.SDK_INT >= 26 ? new Notification.Builder(this, CHANNEL_ID) : new Notification.Builder(this); + return builder + .setContentTitle("RAP VPN") + .setContentText("VPN tunnel is active") + .setSmallIcon(android.R.drawable.stat_sys_upload_done) + .build(); + } + + private String packetSummary(byte[] packet, int length) { + if (packet == null || length < 20) { + return "size=" + length; + } + int version = (packet[0] >> 4) & 0x0f; + if (version != 4) { + return "size=" + length + " ip_version=" + version; + } + int ihl = (packet[0] & 0x0f) * 4; + if (ihl < 20 || length < ihl) { + return "size=" + length + " ipv4=truncated"; + } + int proto = packet[9] & 0xff; + String base = "size=" + length + + " " + ipv4(packet, 12) + + " -> " + ipv4(packet, 16) + + " proto=" + proto; + if ((proto == 6 || proto == 17) && length >= ihl + 4) { + int srcPort = u16(packet, ihl); + int dstPort = u16(packet, ihl + 2); + base += " " + srcPort + "->" + dstPort; + if (proto == 6 && length >= ihl + 14) { + base += " flags=" + tcpFlags(packet[ihl + 13] & 0xff); + } + } else if (proto == 1 && length >= ihl + 2) { + base += " icmp_type=" + (packet[ihl] & 0xff) + " icmp_code=" + (packet[ihl + 1] & 0xff); + } + return base; + } + + private String ipv4(byte[] packet, int offset) { + return (packet[offset] & 0xff) + "." + + (packet[offset + 1] & 0xff) + "." + + (packet[offset + 2] & 0xff) + "." + + (packet[offset + 3] & 0xff); + } + + private int u16(byte[] packet, int offset) { + return ((packet[offset] & 0xff) << 8) | (packet[offset + 1] & 0xff); + } + + private void putU16(byte[] packet, int offset, int value) { + packet[offset] = (byte) ((value >> 8) & 0xff); + packet[offset + 1] = (byte) (value & 0xff); + } + + private String tcpFlags(int flags) { + StringBuilder out = new StringBuilder(); + if ((flags & 0x02) != 0) out.append("S"); + if ((flags & 0x10) != 0) out.append("A"); + if ((flags & 0x01) != 0) out.append("F"); + if ((flags & 0x04) != 0) out.append("R"); + if ((flags & 0x08) != 0) out.append("P"); + return out.length() == 0 ? String.valueOf(flags) : out.toString(); + } + + private static class VpnClientConfig { + String vpnAddress; + String selectedConnectionId; + boolean fullTunnel = true; + int mtu = DEFAULT_VPN_MTU; + String dataplaneSessionStatus = ""; + String dataplanePreferredTransport = ""; + String dataplaneFallbackTransport = ""; + String dataplaneEntryNodeId = ""; + String dataplaneExitNodeId = ""; + String dataplaneSelectedTransport = ""; + String packetRelayBaseUrl = ""; + final List packetRelayBaseUrls = new ArrayList<>(); + FabricServiceChannel fabricServiceChannel = new FabricServiceChannel(); + int dataplaneTransportCandidateCount = 0; + int dataplaneEntryCandidateCount = 0; + final Set configNotes = new LinkedHashSet<>(); + final Set dnsServers = new LinkedHashSet<>(); + final Set dnsProbeDomains = new LinkedHashSet<>(); + final Set splitRoutes = new LinkedHashSet<>(); + + Set effectiveRoutes() { + LinkedHashSet routes = new LinkedHashSet<>(); + if (fullTunnel) { + routes.add("0.0.0.0/0"); + return routes; + } + routes.addAll(splitRoutes); + return routes; + } + + Set effectiveDnsProbeDomains() { + LinkedHashSet domains = new LinkedHashSet<>(); + domains.addAll(dnsProbeDomains); + addDefaultDnsProbeDomains(domains); + return domains; + } + } +} diff --git a/clients/android/app/src/main/java/su/cin/rapvpn/RdpActivity.java b/clients/android/app/src/main/java/su/cin/rapvpn/RdpActivity.java new file mode 100644 index 0000000..fd6b3d3 --- /dev/null +++ b/clients/android/app/src/main/java/su/cin/rapvpn/RdpActivity.java @@ -0,0 +1,209 @@ +package su.cin.rapvpn; + +import android.app.Activity; +import android.graphics.Bitmap; +import android.graphics.BitmapFactory; +import android.os.Bundle; +import android.util.Base64; +import android.view.MotionEvent; +import android.view.View; +import android.widget.FrameLayout; +import android.widget.ImageView; +import android.widget.TextView; + +import org.json.JSONObject; + +import java.nio.ByteBuffer; +import java.nio.ByteOrder; +import java.nio.charset.StandardCharsets; +import java.util.UUID; + +import okhttp3.OkHttpClient; +import okhttp3.Request; +import okhttp3.Response; +import okhttp3.WebSocket; +import okhttp3.WebSocketListener; + +public class RdpActivity extends Activity { + static final String EXTRA_SESSION_RESULT = "session_result"; + static final String EXTRA_GATEWAY_URL = "gateway_url"; + static final String EXTRA_RESOURCE_NAME = "resource_name"; + + private final OkHttpClient http = new OkHttpClient(); + private ImageView desktop; + private TextView overlay; + private WebSocket webSocket; + private int desktopWidth = 1; + private int desktopHeight = 1; + + @Override + protected void onCreate(Bundle savedInstanceState) { + super.onCreate(savedInstanceState); + getWindow().getDecorView().setSystemUiVisibility( + View.SYSTEM_UI_FLAG_FULLSCREEN + | View.SYSTEM_UI_FLAG_HIDE_NAVIGATION + | View.SYSTEM_UI_FLAG_IMMERSIVE_STICKY + | View.SYSTEM_UI_FLAG_LAYOUT_FULLSCREEN + | View.SYSTEM_UI_FLAG_LAYOUT_HIDE_NAVIGATION + | View.SYSTEM_UI_FLAG_LAYOUT_STABLE); + + FrameLayout root = new FrameLayout(this); + root.setBackgroundColor(0xff05090c); + desktop = new ImageView(this); + desktop.setScaleType(ImageView.ScaleType.FIT_CENTER); + desktop.setBackgroundColor(0xff05090c); + desktop.setOnTouchListener((view, event) -> { + sendTouch(event); + return true; + }); + overlay = new TextView(this); + overlay.setTextColor(0xffffffff); + overlay.setTextSize(14); + overlay.setBackgroundColor(0x66000000); + overlay.setPadding(14, 10, 14, 10); + overlay.setText("Подключение..."); + root.addView(desktop, new FrameLayout.LayoutParams(-1, -1)); + root.addView(overlay, new FrameLayout.LayoutParams(-2, -2)); + setContentView(root); + connect(); + } + + @Override + protected void onDestroy() { + if (webSocket != null) { + webSocket.close(1000, "activity closed"); + } + super.onDestroy(); + } + + private void connect() { + try { + JSONObject result = new JSONObject(getIntent().getStringExtra(EXTRA_SESSION_RESULT)); + JSONObject token = result.getJSONObject("attach_token"); + String attachToken = token.getString("token"); + String gatewayUrl = getIntent().getStringExtra(EXTRA_GATEWAY_URL); + String url = gatewayUrl + "?attach_token=" + attachToken; + runOnUiThread(() -> overlay.setText(getIntent().getStringExtra(EXTRA_RESOURCE_NAME))); + Request request = new Request.Builder().url(url).build(); + webSocket = http.newWebSocket(request, new WebSocketListener() { + @Override + public void onOpen(WebSocket webSocket, Response response) { + runOnUiThread(() -> overlay.setText("Подключено")); + } + + @Override + public void onMessage(WebSocket webSocket, String text) { + handleEnvelope(text); + } + + @Override + public void onFailure(WebSocket webSocket, Throwable t, Response response) { + runOnUiThread(() -> overlay.setText("Ошибка: " + t.getMessage())); + } + + @Override + public void onClosed(WebSocket webSocket, int code, String reason) { + runOnUiThread(() -> overlay.setText("Отключено")); + } + }); + } catch (Exception ex) { + overlay.setText("Ошибка запуска: " + ex.getMessage()); + } + } + + private void handleEnvelope(String text) { + try { + JSONObject envelope = new JSONObject(text); + String type = envelope.optString("type"); + if ("session.state".equals(type)) { + JSONObject payload = envelope.optJSONObject("payload"); + String state = payload == null ? "" : payload.optString("state", ""); + if (!state.isEmpty() && !"active".equals(state)) { + runOnUiThread(() -> overlay.setText("Сессия: " + state)); + } + return; + } + if (!"session.frame".equals(type)) { + return; + } + JSONObject payload = envelope.optJSONObject("payload"); + if (payload == null) { + return; + } + String frameData = payload.optString("frame_data", ""); + int width = payload.optInt("frame_width", payload.optInt("desktop_width", 0)); + int height = payload.optInt("frame_height", payload.optInt("desktop_height", 0)); + byte[] bytes = Base64.decode(frameData, Base64.DEFAULT); + Bitmap bitmap = decodeFrame(bytes, width, height, payload.optString("frame_format", "")); + if (bitmap != null) { + desktopWidth = Math.max(1, width); + desktopHeight = Math.max(1, height); + runOnUiThread(() -> { + desktop.setImageBitmap(bitmap); + overlay.setText(""); + }); + } + } catch (Exception ex) { + runOnUiThread(() -> overlay.setText("Кадр: " + ex.getMessage())); + } + } + + private Bitmap decodeFrame(byte[] bytes, int width, int height, String format) { + Bitmap compressed = BitmapFactory.decodeByteArray(bytes, 0, bytes.length); + if (compressed != null) { + return compressed; + } + if (width <= 0 || height <= 0 || bytes.length < width * height * 4) { + return null; + } + int[] colors = new int[width * height]; + ByteBuffer buffer = ByteBuffer.wrap(bytes).order(ByteOrder.LITTLE_ENDIAN); + for (int i = 0; i < colors.length; i++) { + int b = buffer.get() & 0xff; + int g = buffer.get() & 0xff; + int r = buffer.get() & 0xff; + int a = buffer.get() & 0xff; + colors[i] = (a << 24) | (r << 16) | (g << 8) | b; + } + return Bitmap.createBitmap(colors, width, height, Bitmap.Config.ARGB_8888); + } + + private void sendTouch(MotionEvent event) { + if (webSocket == null || desktop.getWidth() <= 0 || desktop.getHeight() <= 0) { + return; + } + String action; + switch (event.getActionMasked()) { + case MotionEvent.ACTION_DOWN: + action = "down"; + break; + case MotionEvent.ACTION_UP: + action = "up"; + break; + case MotionEvent.ACTION_MOVE: + action = "move"; + break; + default: + return; + } + double x = Math.max(0, Math.min(1, event.getX() / Math.max(1f, desktop.getWidth()))); + double y = Math.max(0, Math.min(1, event.getY() / Math.max(1f, desktop.getHeight()))); + try { + JSONObject payload = new JSONObject(); + payload.put("correlation_id", UUID.randomUUID().toString()); + payload.put("client_captured_at", java.time.Instant.now().toString()); + payload.put("kind", "mouse"); + payload.put("action", action); + payload.put("button", "left"); + payload.put("normalized_x", x); + payload.put("normalized_y", y); + payload.put("surface_width", desktopWidth); + payload.put("surface_height", desktopHeight); + JSONObject envelope = new JSONObject(); + envelope.put("type", "input"); + envelope.put("payload", payload); + webSocket.send(envelope.toString().getBytes(StandardCharsets.UTF_8).length > 0 ? envelope.toString() : "{}"); + } catch (Exception ignored) { + } + } +} diff --git a/clients/android/app/src/main/java/su/cin/rapvpn/SecureTokenStore.java b/clients/android/app/src/main/java/su/cin/rapvpn/SecureTokenStore.java new file mode 100644 index 0000000..a8a88f0 --- /dev/null +++ b/clients/android/app/src/main/java/su/cin/rapvpn/SecureTokenStore.java @@ -0,0 +1,90 @@ +package su.cin.rapvpn; + +import android.content.Context; +import android.content.SharedPreferences; +import android.security.keystore.KeyGenParameterSpec; +import android.security.keystore.KeyProperties; +import android.util.Base64; + +import java.nio.charset.StandardCharsets; +import java.security.KeyStore; +import java.util.Arrays; + +import javax.crypto.Cipher; +import javax.crypto.KeyGenerator; +import javax.crypto.SecretKey; +import javax.crypto.spec.GCMParameterSpec; + +final class SecureTokenStore { + private static final String PREFS = "rap-vpn-secure"; + private static final String KEY_ALIAS = "rap-vpn-refresh-token"; + private static final String ANDROID_KEYSTORE = "AndroidKeyStore"; + private static final int IV_LENGTH = 12; + private static final int TAG_LENGTH_BITS = 128; + + private final SharedPreferences prefs; + + SecureTokenStore(Context context) { + prefs = context.getSharedPreferences(PREFS, Context.MODE_PRIVATE); + } + + void put(String name, String value) throws Exception { + if (value == null || value.isEmpty()) { + remove(name); + return; + } + Cipher cipher = Cipher.getInstance("AES/GCM/NoPadding"); + cipher.init(Cipher.ENCRYPT_MODE, key()); + byte[] ciphertext = cipher.doFinal(value.getBytes(StandardCharsets.UTF_8)); + byte[] iv = cipher.getIV(); + if (iv == null || iv.length == 0) { + throw new IllegalStateException("Android Keystore did not provide encryption IV"); + } + byte[] payload = new byte[iv.length + ciphertext.length]; + System.arraycopy(iv, 0, payload, 0, iv.length); + System.arraycopy(ciphertext, 0, payload, iv.length, ciphertext.length); + prefs.edit().putString(name, Base64.encodeToString(payload, Base64.NO_WRAP)).apply(); + } + + String get(String name) { + String encoded = prefs.getString(name, ""); + if (encoded.isEmpty()) { + return ""; + } + try { + byte[] payload = Base64.decode(encoded, Base64.NO_WRAP); + if (payload.length <= IV_LENGTH) { + return ""; + } + byte[] iv = Arrays.copyOfRange(payload, 0, IV_LENGTH); + byte[] ciphertext = Arrays.copyOfRange(payload, IV_LENGTH, payload.length); + Cipher cipher = Cipher.getInstance("AES/GCM/NoPadding"); + cipher.init(Cipher.DECRYPT_MODE, key(), new GCMParameterSpec(TAG_LENGTH_BITS, iv)); + return new String(cipher.doFinal(ciphertext), StandardCharsets.UTF_8); + } catch (Exception ignored) { + return ""; + } + } + + void remove(String name) { + prefs.edit().remove(name).apply(); + } + + private SecretKey key() throws Exception { + KeyStore keyStore = KeyStore.getInstance(ANDROID_KEYSTORE); + keyStore.load(null); + KeyStore.Entry entry = keyStore.getEntry(KEY_ALIAS, null); + if (entry instanceof KeyStore.SecretKeyEntry) { + return ((KeyStore.SecretKeyEntry) entry).getSecretKey(); + } + KeyGenerator generator = KeyGenerator.getInstance(KeyProperties.KEY_ALGORITHM_AES, ANDROID_KEYSTORE); + generator.init(new KeyGenParameterSpec.Builder( + KEY_ALIAS, + KeyProperties.PURPOSE_ENCRYPT | KeyProperties.PURPOSE_DECRYPT) + .setBlockModes(KeyProperties.BLOCK_MODE_GCM) + .setEncryptionPaddings(KeyProperties.ENCRYPTION_PADDING_NONE) + .setRandomizedEncryptionRequired(true) + .build()); + return generator.generateKey(); + } +} diff --git a/clients/android/app/src/main/java/su/cin/rapvpn/TestTrafficActivity.java b/clients/android/app/src/main/java/su/cin/rapvpn/TestTrafficActivity.java new file mode 100644 index 0000000..19b9fd1 --- /dev/null +++ b/clients/android/app/src/main/java/su/cin/rapvpn/TestTrafficActivity.java @@ -0,0 +1,233 @@ +package su.cin.rapvpn; + +import android.app.Activity; +import android.content.SharedPreferences; +import android.graphics.Color; +import android.os.Bundle; +import android.os.Handler; +import android.os.Looper; +import android.view.Gravity; +import android.webkit.WebChromeClient; +import android.webkit.WebResourceError; +import android.webkit.WebResourceRequest; +import android.webkit.WebResourceResponse; +import android.webkit.WebSettings; +import android.webkit.WebView; +import android.webkit.WebViewClient; +import android.widget.LinearLayout; +import android.widget.TextView; + +import java.net.HttpURLConnection; +import java.net.URL; + +public class TestTrafficActivity extends Activity { + static final String PREFS = "rap-vpn-browser-test"; + static final String EXTRA_URL = "url"; + + private TextView status; + private WebView webView; + private String target; + private int assetErrorCount; + private int mainErrorCount; + private int httpErrorCount; + private final Handler handler = new Handler(Looper.getMainLooper()); + + @Override + protected void onCreate(Bundle savedInstanceState) { + super.onCreate(savedInstanceState); + LinearLayout layout = new LinearLayout(this); + layout.setOrientation(LinearLayout.VERTICAL); + layout.setBackgroundColor(Color.WHITE); + status = new TextView(this); + status.setTextColor(Color.rgb(20, 30, 40)); + status.setTextSize(14); + status.setGravity(Gravity.START); + status.setPadding(18, 14, 18, 14); + layout.addView(status, new LinearLayout.LayoutParams( + LinearLayout.LayoutParams.MATCH_PARENT, + LinearLayout.LayoutParams.WRAP_CONTENT)); + webView = new WebView(this); + layout.addView(webView, new LinearLayout.LayoutParams( + LinearLayout.LayoutParams.MATCH_PARENT, + 0, + 1f)); + setContentView(layout); + String url = getIntent().getStringExtra(EXTRA_URL); + if (url == null || url.isEmpty()) { + url = "http://192.168.200.61:18080/"; + } + target = url; + assetErrorCount = 0; + mainErrorCount = 0; + httpErrorCount = 0; + configureWebView(); + saveStatus("starting", "open " + target, 0, target, ""); + status.setText("Web test starting: " + target); + webView.loadUrl(target); + new Thread(() -> runRequest(target), "rap-test-traffic-http").start(); + } + + @Override + protected void onDestroy() { + try { + saveStatus("destroyed", "activity destroyed", webView == null ? 0 : webView.getProgress(), target, ""); + if (webView != null) { + webView.stopLoading(); + webView.destroy(); + } + } catch (Exception ignored) { + } + super.onDestroy(); + } + + private void configureWebView() { + WebSettings settings = webView.getSettings(); + settings.setJavaScriptEnabled(true); + settings.setDomStorageEnabled(true); + settings.setDatabaseEnabled(true); + settings.setLoadsImagesAutomatically(true); + settings.setBlockNetworkLoads(false); + settings.setCacheMode(WebSettings.LOAD_NO_CACHE); + if (android.os.Build.VERSION.SDK_INT >= 21) { + settings.setMixedContentMode(WebSettings.MIXED_CONTENT_COMPATIBILITY_MODE); + } + webView.setWebChromeClient(new WebChromeClient() { + @Override + public void onProgressChanged(WebView view, int newProgress) { + String url = view == null ? target : view.getUrl(); + String title = view == null ? "" : view.getTitle(); + String message = "progress=" + newProgress + " title=" + safe(title); + status.setText(message + "\n" + safe(url)); + saveStatus(newProgress >= 100 ? "progress_complete" : "loading", message, newProgress, url, ""); + if (newProgress >= 100) { + scheduleDomProbe(1200); + scheduleDomProbe(5000); + scheduleDomProbe(12000); + } + } + }); + webView.setWebViewClient(new WebViewClient() { + @Override + public void onPageStarted(WebView view, String url, android.graphics.Bitmap favicon) { + assetErrorCount = 0; + mainErrorCount = 0; + httpErrorCount = 0; + status.setText("started\n" + safe(url)); + saveStatus("started", "page started", 0, url, ""); + } + + @Override + public void onPageFinished(WebView view, String url) { + int progress = Math.max(100, view == null ? 100 : view.getProgress()); + String title = view == null ? "" : view.getTitle(); + status.setText("finished progress=" + progress + "\n" + safe(title) + "\n" + safe(url)); + saveStatus("finished", "page finished title=" + safe(title), progress, url, ""); + scheduleDomProbe(1000); + scheduleDomProbe(5000); + scheduleDomProbe(12000); + } + + @Override + public void onReceivedError(WebView view, WebResourceRequest request, WebResourceError error) { + String url = request == null || request.getUrl() == null ? "" : request.getUrl().toString(); + String description = error == null ? "unknown" : String.valueOf(error.getDescription()); + boolean mainFrame = request != null && request.isForMainFrame(); + if (mainFrame) { + mainErrorCount++; + } else { + assetErrorCount++; + } + status.setText("error main=" + mainFrame + "\n" + description + "\n" + url); + saveStatus(mainFrame ? "main_error" : "asset_error", description, view == null ? 0 : view.getProgress(), url, mainFrame ? "MAIN" : "ASSET"); + } + + @Override + public void onReceivedHttpError(WebView view, WebResourceRequest request, WebResourceResponse errorResponse) { + String url = request == null || request.getUrl() == null ? "" : request.getUrl().toString(); + int code = errorResponse == null ? 0 : errorResponse.getStatusCode(); + boolean mainFrame = request != null && request.isForMainFrame(); + httpErrorCount++; + saveStatus(mainFrame ? "main_http_error" : "asset_http_error", "HTTP " + code, view == null ? 0 : view.getProgress(), url, mainFrame ? "HTTP_MAIN" : "HTTP_ASSET"); + } + }); + } + + private void runRequest(String target) { + String result; + try { + HttpURLConnection connection = (HttpURLConnection) new URL(target).openConnection(); + connection.setConnectTimeout(30000); + connection.setReadTimeout(30000); + connection.setInstanceFollowRedirects(false); + result = "HTTP " + connection.getResponseCode(); + connection.disconnect(); + } catch (Exception e) { + result = e.getClass().getSimpleName() + ": " + e.getMessage(); + } + String finalResult = result; + saveHttpProbe(finalResult); + runOnUiThread(() -> status.setText(status.getText() + "\nhttp_probe=" + finalResult)); + } + + private void saveStatus(String state, String message, int progress, String url, String errorType) { + getSharedPreferences(PREFS, MODE_PRIVATE).edit() + .putString("state", safe(state)) + .putString("message", safe(message)) + .putInt("progress", progress) + .putString("url", safe(url)) + .putString("target_url", safe(target)) + .putString("error_type", safe(errorType)) + .putInt("asset_error_count", assetErrorCount) + .putInt("main_error_count", mainErrorCount) + .putInt("http_error_count", httpErrorCount) + .putLong("updated_at", System.currentTimeMillis()) + .apply(); + } + + private void saveHttpProbe(String result) { + getSharedPreferences(PREFS, MODE_PRIVATE).edit() + .putString("http_probe", safe(result)) + .putLong("http_probe_at", System.currentTimeMillis()) + .apply(); + } + + private void scheduleDomProbe(long delayMs) { + handler.postDelayed(this::runDomProbe, Math.max(0, delayMs)); + } + + private void runDomProbe() { + if (webView == null) { + return; + } + String script = "(function(){" + + "function txt(e){return ((e.innerText||e.textContent||e.value||e.getAttribute('aria-label')||'')+'').replace(/\\s+/g,' ').trim();}" + + "function vis(e){var r=e.getBoundingClientRect();var s=getComputedStyle(e);return r.width>0&&r.height>0&&s.visibility!=='hidden'&&s.display!=='none';}" + + "var nodes=Array.prototype.slice.call(document.querySelectorAll('button,[role=button],input[type=button],input[type=submit],a'));" + + "var buttons=nodes.map(function(e){return {text:txt(e).slice(0,48),disabled:!!e.disabled||e.getAttribute('aria-disabled')==='true'||e.classList.contains('disabled'),visible:vis(e),tag:e.tagName,cls:(e.className||'').toString().slice(0,64)};}).filter(function(x){return x.text||/button/i.test(x.tag);}).slice(0,40);" + + "var start=buttons.filter(function(x){return /старт|start/i.test(x.text)||/start/i.test(x.cls);});" + + "var qms=(document.documentElement.innerHTML.match(/qms\\.ru/g)||[]).length;" + + "var out={readyState:document.readyState,title:document.title,scripts:document.scripts.length,buttons:buttons.length,start:start,qms:qms,url:location.href};" + + "return JSON.stringify(out);" + + "})()"; + try { + webView.evaluateJavascript(script, value -> { + String probe = safe(value); + saveDomProbe(probe); + status.setText(status.getText() + "\ndom_probe=" + probe); + }); + } catch (Exception e) { + saveDomProbe(e.getClass().getSimpleName() + ": " + e.getMessage()); + } + } + + private void saveDomProbe(String result) { + getSharedPreferences(PREFS, MODE_PRIVATE).edit() + .putString("dom_probe", safe(result)) + .putLong("dom_probe_at", System.currentTimeMillis()) + .apply(); + } + + private String safe(String value) { + return value == null ? "" : value; + } +} diff --git a/clients/android/app/src/main/java/su/cin/rapvpn/TestVpnActivity.java b/clients/android/app/src/main/java/su/cin/rapvpn/TestVpnActivity.java new file mode 100644 index 0000000..a97dbf1 --- /dev/null +++ b/clients/android/app/src/main/java/su/cin/rapvpn/TestVpnActivity.java @@ -0,0 +1,70 @@ +package su.cin.rapvpn; + +import android.app.Activity; +import android.content.Intent; +import android.net.VpnService; +import android.os.Bundle; +import android.util.Base64; +import android.widget.TextView; +import java.nio.charset.StandardCharsets; + +public class TestVpnActivity extends Activity { + public static final String EXTRA_PROFILE_JSON = "profile_json"; + public static final String EXTRA_PROFILE_BASE64 = "profile_base64"; + public static final String EXTRA_BACKEND_URL = "backend_url"; + public static final String EXTRA_CLUSTER_ID = "cluster_id"; + public static final String EXTRA_VPN_CONNECTION_ID = "vpn_connection_id"; + private static final int VPN_PREPARE_REQUEST = 77; + + private Intent serviceIntent; + + @Override + protected void onCreate(Bundle savedInstanceState) { + super.onCreate(savedInstanceState); + TextView text = new TextView(this); + text.setText("RAP VPN test launcher"); + setContentView(text); + serviceIntent = buildServiceIntent(getIntent()); + Intent prepare = VpnService.prepare(this); + if (prepare != null) { + startActivityForResult(prepare, VPN_PREPARE_REQUEST); + return; + } + startVpn(); + } + + @Override + protected void onActivityResult(int requestCode, int resultCode, Intent data) { + super.onActivityResult(requestCode, resultCode, data); + if (requestCode == VPN_PREPARE_REQUEST && resultCode == RESULT_OK) { + startVpn(); + } + } + + private Intent buildServiceIntent(Intent source) { + Intent intent = new Intent(this, RapVpnService.class); + intent.putExtra(RapVpnService.EXTRA_PROFILE_JSON, profileJson(source)); + intent.putExtra(RapVpnService.EXTRA_BACKEND_URL, source.getStringExtra(EXTRA_BACKEND_URL)); + intent.putExtra(RapVpnService.EXTRA_CLUSTER_ID, source.getStringExtra(EXTRA_CLUSTER_ID)); + intent.putExtra(RapVpnService.EXTRA_VPN_CONNECTION_ID, source.getStringExtra(EXTRA_VPN_CONNECTION_ID)); + return intent; + } + + private String profileJson(Intent source) { + String direct = source.getStringExtra(EXTRA_PROFILE_JSON); + if (direct != null && !direct.isEmpty()) { + return direct; + } + String encoded = source.getStringExtra(EXTRA_PROFILE_BASE64); + if (encoded == null || encoded.isEmpty()) { + return ""; + } + byte[] raw = Base64.decode(encoded, Base64.DEFAULT); + return new String(raw, StandardCharsets.UTF_8); + } + + private void startVpn() { + startForegroundService(serviceIntent); + finish(); + } +} diff --git a/clients/android/app/src/main/java/su/cin/rapvpn/VpnPacketWebSocketRelay.java b/clients/android/app/src/main/java/su/cin/rapvpn/VpnPacketWebSocketRelay.java new file mode 100644 index 0000000..a6868ba --- /dev/null +++ b/clients/android/app/src/main/java/su/cin/rapvpn/VpnPacketWebSocketRelay.java @@ -0,0 +1,288 @@ +package su.cin.rapvpn; + +import android.net.VpnService; +import android.util.Log; + +import java.net.URI; +import java.util.ArrayList; +import java.util.List; +import java.util.concurrent.ArrayBlockingQueue; +import java.util.concurrent.BlockingQueue; +import java.util.concurrent.TimeUnit; + +import okhttp3.ConnectionPool; +import okhttp3.Dispatcher; +import okhttp3.OkHttpClient; +import okhttp3.Request; +import okhttp3.Response; +import okhttp3.WebSocket; +import okhttp3.WebSocketListener; +import okio.ByteString; + +final class VpnPacketWebSocketRelay { + private static final String TAG = "RapVpnWebSocketRelay"; + private static final int MAX_PACKET_BATCH_PACKETS = 512; + private static final int MAX_PACKET_BATCH_BYTES = 1024 * 1024; + private static final int MAX_SINGLE_PACKET_BYTES = 65535; + + private final String baseUrl; + private final VpnService vpnService; + private final OkHttpClient httpClient; + private final FabricServiceChannel fabricServiceChannel; + private final BlockingQueue> incoming = new ArrayBlockingQueue<>(2048); + private final Object lock = new Object(); + + private WebSocket webSocket; + private String connectedClusterId = ""; + private String connectedVpnConnectionId = ""; + private volatile boolean open; + private volatile boolean connecting; + private volatile long reconnectAfterMs; + private volatile String lastError = ""; + + VpnPacketWebSocketRelay(String baseUrl, VpnService vpnService) { + this(baseUrl, vpnService, new FabricServiceChannel()); + } + + VpnPacketWebSocketRelay(String baseUrl, VpnService vpnService, FabricServiceChannel fabricServiceChannel) { + this.baseUrl = trimRight(baseUrl); + this.vpnService = vpnService; + this.fabricServiceChannel = fabricServiceChannel == null ? new FabricServiceChannel() : fabricServiceChannel; + OkHttpClient.Builder builder = new OkHttpClient.Builder(); + if (vpnService != null) { + builder.socketFactory(new RapApiClient.ProtectedSocketFactory(vpnService)); + } + builder.dns(new RapApiClient.BackendPinnedDns(baseUrl)); + builder.connectTimeout(5, TimeUnit.SECONDS); + builder.writeTimeout(10, TimeUnit.SECONDS); + builder.readTimeout(0, TimeUnit.SECONDS); + builder.retryOnConnectionFailure(true); + Dispatcher dispatcher = new Dispatcher(); + dispatcher.setMaxRequests(16); + dispatcher.setMaxRequestsPerHost(8); + builder.dispatcher(dispatcher); + builder.connectionPool(new ConnectionPool(8, 5, TimeUnit.MINUTES)); + this.httpClient = builder.build(); + } + + String baseUrl() { + return baseUrl; + } + + boolean isOpen() { + return open; + } + + String lastError() { + return lastError == null ? "" : lastError; + } + + void connect(String clusterId, String vpnConnectionId) { + if (clusterId == null || clusterId.isEmpty() || vpnConnectionId == null || vpnConnectionId.isEmpty()) { + return; + } + long now = System.currentTimeMillis(); + synchronized (lock) { + if (open && clusterId.equals(connectedClusterId) && vpnConnectionId.equals(connectedVpnConnectionId)) { + return; + } + if (connecting && clusterId.equals(connectedClusterId) && vpnConnectionId.equals(connectedVpnConnectionId)) { + return; + } + if (now < reconnectAfterMs) { + return; + } + closeLocked(); + String wsUrl = webSocketUrl(clusterId, vpnConnectionId); + if (wsUrl.isEmpty()) { + lastError = "invalid websocket url"; + reconnectAfterMs = now + 5000; + return; + } + connectedClusterId = clusterId; + connectedVpnConnectionId = vpnConnectionId; + connecting = true; + Request.Builder requestBuilder = new Request.Builder().url(wsUrl); + this.fabricServiceChannel.applyHeaders(requestBuilder); + Request request = requestBuilder.build(); + webSocket = httpClient.newWebSocket(request, new Listener()); + } + } + + boolean sendClientPacketBatch(String clusterId, String vpnConnectionId, List packets) { + packets = cleanPacketBatch(packets); + if (packets.isEmpty()) { + return true; + } + connect(clusterId, vpnConnectionId); + WebSocket socket = webSocket; + if (socket == null || !open) { + return false; + } + byte[] payload = encodePacketBatch(packets); + if (payload.length == 0) { + return true; + } + boolean queued = socket.send(ByteString.of(payload)); + if (!queued) { + lastError = "websocket send queue rejected batch"; + } + return queued; + } + + List receiveClientPacketBatch(String clusterId, String vpnConnectionId, int timeoutMs) throws InterruptedException { + connect(clusterId, vpnConnectionId); + int waitMs = Math.max(1, timeoutMs); + List packets = incoming.poll(waitMs, TimeUnit.MILLISECONDS); + return packets == null ? new ArrayList<>() : packets; + } + + void close() { + synchronized (lock) { + closeLocked(); + } + } + + private void closeLocked() { + open = false; + connecting = false; + incoming.clear(); + if (webSocket != null) { + try { + webSocket.close(1000, "relay switch"); + } catch (Exception ignored) { + } + } + webSocket = null; + } + + private String webSocketUrl(String clusterId, String vpnConnectionId) { + try { + URI uri = URI.create(baseUrl); + String scheme = "https".equalsIgnoreCase(uri.getScheme()) ? "wss" : "ws"; + String path = uri.getRawPath() == null || uri.getRawPath().isEmpty() ? "" : trimRight(uri.getRawPath()); + String fabricPath = fabricServiceChannel.packetPathForBase(baseUrl, clusterId, vpnConnectionId, true); + if (!fabricPath.isEmpty()) { + path += fabricPath; + } else { + path += "/clusters/" + clusterId + "/vpn-connections/" + vpnConnectionId + "/tunnel/client/packets/ws"; + } + URI ws = new URI(scheme, uri.getRawUserInfo(), uri.getHost(), uri.getPort(), path, null, null); + return ws.toString(); + } catch (Exception e) { + lastError = e.getClass().getSimpleName() + ": " + e.getMessage(); + return ""; + } + } + + private final class Listener extends WebSocketListener { + @Override + public void onOpen(WebSocket webSocket, Response response) { + open = true; + connecting = false; + reconnectAfterMs = 0; + lastError = ""; + Log.i(TAG, "vpn packet websocket opened " + baseUrl); + } + + @Override + public void onMessage(WebSocket webSocket, ByteString bytes) { + List packets = decodePacketBatch(bytes.toByteArray()); + if (packets.isEmpty()) { + return; + } + if (!incoming.offer(packets)) { + incoming.poll(); + incoming.offer(packets); + } + } + + @Override + public void onClosed(WebSocket webSocket, int code, String reason) { + open = false; + connecting = false; + reconnectAfterMs = System.currentTimeMillis() + 1000; + lastError = "closed " + code + " " + reason; + } + + @Override + public void onFailure(WebSocket webSocket, Throwable t, Response response) { + open = false; + connecting = false; + reconnectAfterMs = System.currentTimeMillis() + 3000; + lastError = t == null ? "websocket failure" : t.getClass().getSimpleName() + ": " + t.getMessage(); + Log.w(TAG, "vpn packet websocket failed " + baseUrl + ": " + lastError); + } + } + + private static List cleanPacketBatch(List packets) { + List cleaned = new ArrayList<>(); + int bytes = 0; + if (packets == null) { + return cleaned; + } + for (byte[] packet : packets) { + if (packet == null || packet.length <= 0 || packet.length > MAX_SINGLE_PACKET_BYTES) { + continue; + } + int projected = bytes + 4 + packet.length; + if (cleaned.size() >= MAX_PACKET_BATCH_PACKETS || projected > MAX_PACKET_BATCH_BYTES) { + break; + } + cleaned.add(packet); + bytes = projected; + } + return cleaned; + } + + private static byte[] encodePacketBatch(List packets) { + packets = cleanPacketBatch(packets); + int total = 0; + for (byte[] packet : packets) { + total += 4 + packet.length; + } + byte[] out = new byte[total]; + int offset = 0; + for (byte[] packet : packets) { + int length = packet.length; + out[offset] = (byte) ((length >> 24) & 0xff); + out[offset + 1] = (byte) ((length >> 16) & 0xff); + out[offset + 2] = (byte) ((length >> 8) & 0xff); + out[offset + 3] = (byte) (length & 0xff); + offset += 4; + System.arraycopy(packet, 0, out, offset, length); + offset += length; + } + return out; + } + + private static List decodePacketBatch(byte[] payload) { + List packets = new ArrayList<>(); + int offset = 0; + while (payload != null && offset + 4 <= payload.length && packets.size() < MAX_PACKET_BATCH_PACKETS) { + int length = ((payload[offset] & 0xff) << 24) + | ((payload[offset + 1] & 0xff) << 16) + | ((payload[offset + 2] & 0xff) << 8) + | (payload[offset + 3] & 0xff); + offset += 4; + if (length <= 0 || length > MAX_SINGLE_PACKET_BYTES || offset + length > payload.length) { + break; + } + byte[] packet = new byte[length]; + System.arraycopy(payload, offset, packet, 0, length); + packets.add(packet); + offset += length; + } + return packets; + } + + private static String trimRight(String value) { + if (value == null) { + return ""; + } + while (value.endsWith("/")) { + value = value.substring(0, value.length() - 1); + } + return value; + } +} diff --git a/clients/android/app/src/main/res/values/styles.xml b/clients/android/app/src/main/res/values/styles.xml new file mode 100644 index 0000000..59653b8 --- /dev/null +++ b/clients/android/app/src/main/res/values/styles.xml @@ -0,0 +1,7 @@ + + + diff --git a/clients/android/build.gradle b/clients/android/build.gradle new file mode 100644 index 0000000..e4bb369 --- /dev/null +++ b/clients/android/build.gradle @@ -0,0 +1,3 @@ +plugins { + id "com.android.application" version "9.2.0" apply false +} diff --git a/clients/android/local.properties b/clients/android/local.properties new file mode 100644 index 0000000..dac3e14 --- /dev/null +++ b/clients/android/local.properties @@ -0,0 +1 @@ +sdk.dir=C:/Android/sdk \ No newline at end of file diff --git a/clients/android/settings.gradle b/clients/android/settings.gradle new file mode 100644 index 0000000..902a398 --- /dev/null +++ b/clients/android/settings.gradle @@ -0,0 +1,18 @@ +pluginManagement { + repositories { + google() + mavenCentral() + gradlePluginPortal() + } +} + +dependencyResolutionManagement { + repositoriesMode.set(RepositoriesMode.FAIL_ON_PROJECT_REPOS) + repositories { + google() + mavenCentral() + } +} + +rootProject.name = "RapAndroidVpn" +include ":app" diff --git a/clients/windows/README.md b/clients/windows/README.md index 81d2e7a..3078f50 100644 --- a/clients/windows/README.md +++ b/clients/windows/README.md @@ -10,6 +10,9 @@ real RDP session window with direct worker data-plane support. - application services for auth, organizations, resources, sessions, and gateway attach flows - secure local token storage via DPAPI for MVP - organization selection persisted locally +- remote-desktop-first shell: the selected server/session surface is primary, + while organization, server catalog, and active sessions stay in compact + controls or collapsible side panels - resource list, active session list, and session window - direct worker WSS data-plane integration with backend gateway fallback - binary render receive path for direct worker WSS diff --git a/clients/windows/src/RemoteAccessPlatform.Windows.App/MainWindow.xaml b/clients/windows/src/RemoteAccessPlatform.Windows.App/MainWindow.xaml index 6a3a898..4ea14d8 100644 --- a/clients/windows/src/RemoteAccessPlatform.Windows.App/MainWindow.xaml +++ b/clients/windows/src/RemoteAccessPlatform.Windows.App/MainWindow.xaml @@ -21,10 +21,10 @@ + Margin="0,0,0,12" + Padding="14" + Background="#FF10262F" + CornerRadius="8"> @@ -32,7 +32,7 @@ - + +
+
+
+

Secure Access Fabric

+

{activeOrganization?.name || "Личный кабинет"}

+

{session.email}

+
+ + +
+ + {error &&
{error}
} + {notice &&
{notice}
} + +
+ + + +
+ +
+
+
+
+

Установки

+

+ {androidClientVersion ? `Актуальная версия Android: ${androidClientVersion}` : "Скачивайте актуальные клиенты только отсюда, чтобы не ловить старую сборку."} +

+
+ latest +
+ +
+ +
+

Профиль

+ + + + +
+ +
+
+
+

Доступные серверы

+

Список ресурсов, которые уже разрешены пользователю через организацию.

+
+
+ [ + resource.name, + resource.address, + resource.protocol, + resource.has_secret ? "настроен" : "нет", + statusLabel(resource.file_transfer_mode || "disabled"), + ])} + /> +
+ +
+

Сервисы

+ [protocol, String(count)])} /> +
+ +
+

Что здесь будет дальше

+
+ Устройства и доверенные входы + Активные VPN/RDP сессии + Обновление профиля VPN без ручных ключей + Самостоятельная смена пароля +
+
+
+
+ + ); + } + return (