Refactor RDP proxy handling and update related tests
This commit is contained in:
@@ -88,6 +88,16 @@ Native host process responsible for node identity, enrollment, certificates, hea
|
||||
Service Workload:
|
||||
A workload executed on a node. It may be native or containerized. Examples: `rdp-worker`, `vnc-worker`, `entry-node`, `relay-node`, `file-storage-cache`.
|
||||
|
||||
Public/Admin HTTPS Ingress:
|
||||
A service-edge role that listens on TCP `80`/`443` for browser/API HTTPS and
|
||||
forwards accepted requests into the QUIC-only fabric service channel. It is not
|
||||
an authority service and does not imply permission to manage the cluster.
|
||||
|
||||
Admin UI Runtime:
|
||||
A scoped admin service runtime. Global admin runtime may run only on
|
||||
platform-owner trusted nodes; cluster, organization, and user portal runtimes
|
||||
receive only their scoped projections.
|
||||
|
||||
Capability:
|
||||
What a node can technically do. Example: `can_run_rdp_worker`.
|
||||
|
||||
@@ -162,6 +172,13 @@ policy, approvals, and audit.
|
||||
20. Node-agent is the local supervisor for health, restart, update, and rollback
|
||||
of node services, but Control Plane owns rollout policy and durable schema
|
||||
migration orchestration.
|
||||
21. HTTP/HTTPS is an external service edge only. Fabric node-to-node transport
|
||||
remains QUIC-only.
|
||||
22. A node that accepts `443` does not own management authority. Admin authority
|
||||
belongs to signed roles, scoped claims, policy, and trusted runtime nodes.
|
||||
23. Global admin runtime, policy authority, and audit sink must run only on
|
||||
platform-owner controlled nodes. Organization and cluster portals must not
|
||||
expose unrelated tenants, clusters, or internal mesh topology.
|
||||
|
||||
## Existing Node Management Semantics
|
||||
|
||||
|
||||
@@ -0,0 +1,96 @@
|
||||
# Distributed Authority Audit 2026-05-16
|
||||
|
||||
Status: target architecture is distributed, but the live test cluster still has
|
||||
bootstrap central authority pieces that must be removed before production trust.
|
||||
|
||||
## Fixed Requirements
|
||||
|
||||
- No single management/API/storage/update service is allowed to own cluster
|
||||
truth.
|
||||
- Control, storage, update, route authority, observer, and update-cache are node
|
||||
roles in the fabric.
|
||||
- A service endpoint can serve signed state, but cannot create trusted state by
|
||||
itself.
|
||||
- Node identity is cryptographic. IP addresses, DNS names, and NAT addresses are
|
||||
endpoint candidates only.
|
||||
- Nodes must publish real signed candidates for reachable interfaces,
|
||||
STUN/ICE-reflexive addresses, passive reverse channels, and relay fallback.
|
||||
- Nodes must verify signed control data locally before applying it.
|
||||
|
||||
## Live Cluster Findings
|
||||
|
||||
- The live cluster has one active `cluster_authorities` row:
|
||||
`rap-ca-ed25519-09877466aa9b6b58b0f312b0b313ea33`.
|
||||
- Its metadata says `storage=database_signer` and
|
||||
`production_target=external_cluster_signer_or_hsm`.
|
||||
- Release metadata for recent node-agent versions is signed, but signed by the
|
||||
same database-backed authority.
|
||||
- Synthetic mesh configs are signed and node-agent verifies them against the
|
||||
pinned cluster authority.
|
||||
- Node enrollment pins cluster authority into `identity.json`.
|
||||
- Before this audit, host-agent update plans were carried with signatures but
|
||||
host-agent did not locally reject unsigned plans when a pinned authority was
|
||||
present.
|
||||
|
||||
## Changes Made In This Audit
|
||||
|
||||
- The fabric docs now declare distributed authority and quorum as mandatory.
|
||||
- Node/fabric endpoints must be explicit `host:port`; DNS-only service names are
|
||||
rejected as fabric endpoints.
|
||||
- `home-1` no longer advertises `smoke.cin.su` as a fabric endpoint. It now
|
||||
advertises its real interface candidate `quic://192.168.200.85:18080`.
|
||||
- Host-agent now verifies `node_update_plan` authority signatures when
|
||||
`identity.json` contains a pinned cluster authority public key.
|
||||
- Unsigned update plans are rejected in that pinned-authority mode.
|
||||
- Added `rap.cluster_authority.quorum.v1` and
|
||||
`rap.cluster_authority.quorum_envelope.v1` contracts to both agent and
|
||||
backend authority packages.
|
||||
- Host-agent can now verify quorum-signed update plans when `identity.json`
|
||||
contains a pinned quorum descriptor.
|
||||
- Backend update plans now include an `authority_quorum` envelope when the
|
||||
cluster authority metadata contains a quorum descriptor. If that configured
|
||||
quorum cannot be satisfied, the update plan is not issued.
|
||||
- Node bootstrap now carries `cluster_authority_quorum`; the approval authority
|
||||
payload signs the quorum descriptor hash, and node-agent persists the
|
||||
descriptor into `identity.json` after verifying the signed hash.
|
||||
- Published `rap-node-agent` and `rap-host-agent` release
|
||||
`0.2.284-quorumauthority`.
|
||||
- Canaried `home-1` to `rap-node-agent 0.2.284-quorumauthority` and
|
||||
`rap-host-agent 0.2.284-quorumauthority`; both reported healthy/noop after
|
||||
update.
|
||||
- Published `rap-node-agent` and `rap-host-agent` release
|
||||
`0.2.285-quorumbootstrap`.
|
||||
- Canaried `home-1` to `rap-node-agent 0.2.285-quorumbootstrap` and
|
||||
`rap-host-agent 0.2.285-quorumbootstrap`; both reported current=target/noop.
|
||||
`ifcm-rufms-s-mo1cr` was intentionally not updated because it is behind NAT
|
||||
and still needs fabric/update-cache artifact reachability before further
|
||||
rollout.
|
||||
|
||||
## Remaining Production Blockers
|
||||
|
||||
- Replace `database_signer` with quorum authority:
|
||||
M-of-N signatures from nodes or hardware/offline keys with
|
||||
`control-authority` / `update-authority` roles.
|
||||
- Store authority descriptors and role certificates as replicated signed state,
|
||||
not only database rows.
|
||||
- Require quorum envelopes for the remaining high-risk mutations: role
|
||||
mutation, release creation, update policy mutation, route lease issuance,
|
||||
relay/rendezvous lease issuance, storage placement, and authority rotation.
|
||||
Node update plans and bootstrap quorum pinning now have the first contract
|
||||
hooks, but production still needs real M-of-N signers.
|
||||
- Add node-side verification of release metadata in addition to update-plan
|
||||
verification; update-plan verification is now enforced by host-agent when a
|
||||
pinned authority or pinned quorum descriptor exists.
|
||||
- Add update-cache mirror selection through fabric endpoint candidates instead
|
||||
of a single HTTP origin.
|
||||
- Add signed endpoint-candidate epochs so peer directory gossip can survive API
|
||||
replica loss.
|
||||
- Add revocation/fencing epochs for compromised authority keys, nodes, and
|
||||
update artifacts.
|
||||
|
||||
## Acceptance Rule
|
||||
|
||||
The cluster is not production-trust-ready while a single `database_signer` can
|
||||
create authoritative cluster mutations. It may remain as a development bootstrap
|
||||
signer only when every signed payload clearly identifies it as bootstrap and
|
||||
nodes can be configured to reject it in production mode.
|
||||
@@ -62,6 +62,88 @@ route and stream semantics.
|
||||
7. Mobile nodes are first-class nodes with stricter capability scoring.
|
||||
8. HTTP forwarding remains a compatibility and emergency fallback, not the
|
||||
primary high-speed data plane.
|
||||
9. There must be no single management service that can seize the fabric. Control,
|
||||
storage, update distribution, route authority, and certificate authority are
|
||||
fabric roles assigned to eligible nodes and protected by quorum signatures.
|
||||
A web/API endpoint is only an access replica for a signed state log, not the
|
||||
owner of cluster truth.
|
||||
10. IP addresses and DNS names are never authority. Nodes announce signed
|
||||
endpoint candidates for every usable interface, public/reflexive address,
|
||||
local segment address, reverse channel, and relay fallback. Neighbors select
|
||||
the usable candidate locally by policy, reachability, latency, load, and
|
||||
trust.
|
||||
|
||||
## Distributed Control And Trust
|
||||
|
||||
The target fabric behaves like a distributed network, not a client/server
|
||||
management product. The cluster has a replicated signed state log and many
|
||||
service replicas. Any node with the right role can serve API, storage, update,
|
||||
or route-coordinator duties, but no single replica can mutate cluster authority
|
||||
alone.
|
||||
|
||||
Required trust model:
|
||||
|
||||
- Every node has a long-lived node identity key and short-lived role
|
||||
certificates. The node identity is cryptographic; the current IP, hostname,
|
||||
NAT address, or container name is only an endpoint candidate.
|
||||
- Cluster authority is threshold-based. Root or high-risk changes require M-of-N
|
||||
signatures from authorized control-authority nodes or hardware/offline
|
||||
operator keys.
|
||||
- Role certificates are scoped by action, organization/tenant, service,
|
||||
partition, validity window, and allowed delegation depth.
|
||||
- Update releases, route leases, peer-directory epochs, storage shard placement,
|
||||
node approvals, role changes, and authority rotations are signed records in
|
||||
the state log.
|
||||
- A node accepts control data only when it can verify signatures, epoch/fencing,
|
||||
expiry, target cluster, target node or role scope, and monotonic generation.
|
||||
- A compromised API replica can withhold or delay data, but cannot forge updates,
|
||||
route authority, new certificates, node roles, or cluster ownership.
|
||||
- Bootstrap may use a temporary centralized signer for development, but
|
||||
production mode must mark that signer as non-authoritative unless quorum
|
||||
signatures are present.
|
||||
|
||||
Authority levels:
|
||||
|
||||
- `root-authority`: rotates cluster root and quorum membership. Offline or
|
||||
hardware-backed where possible. Rarely online.
|
||||
- `control-authority`: approves node join, role changes, policy epochs, and
|
||||
route-authority membership through quorum.
|
||||
- `route-authority`: signs short-lived route leases and relay/rendezvous
|
||||
assignments for a shard or partition.
|
||||
- `update-authority`: signs release metadata, compatibility, artifact hashes,
|
||||
rollback windows, and staged rollout policy.
|
||||
- `storage-authority`: signs storage shard manifests, replication factors,
|
||||
retention policy, and recovery epochs.
|
||||
- `observer-authority`: can sign telemetry observations only; it cannot mutate
|
||||
routing, roles, updates, or secrets.
|
||||
|
||||
Required anti-takeover controls:
|
||||
|
||||
- No bearer admin token may grant fabric-wide mutation without a signed authority
|
||||
envelope.
|
||||
- No node may accept unsigned update metadata or an artifact whose hash is not
|
||||
signed by update-authority quorum.
|
||||
- No node may accept unsigned route changes for production channels.
|
||||
- No node may promote itself into control, storage, update, relay, or route
|
||||
authority roles without a quorum-signed role certificate.
|
||||
- Authority and role certificates must have short validity, explicit scopes, and
|
||||
revocation/fencing epochs.
|
||||
- Nodes must pin the cluster root/quorum descriptor and reject unexpected root
|
||||
changes unless the old quorum signs the transition or an offline recovery
|
||||
policy is invoked.
|
||||
|
||||
Endpoint state is also distributed:
|
||||
|
||||
- Nodes publish signed endpoint-candidate sets containing local interfaces,
|
||||
public/reflexive STUN/ICE candidates, NAT group/local segment identifiers,
|
||||
relay fallback, and passive reverse-channel availability.
|
||||
- Endpoint candidates expire quickly. When a node changes IP, it reconnects
|
||||
passively to any reachable fabric peer or API replica and publishes a new
|
||||
signed candidate epoch.
|
||||
- Peers keep using cached valid candidates and route leases while refreshing
|
||||
from any reachable replica or neighbor gossip path.
|
||||
- Neighbor selection is local and latency/load-aware; the state log announces
|
||||
facts and policy, not a forced single next hop.
|
||||
|
||||
## Node Roles
|
||||
|
||||
|
||||
@@ -0,0 +1,845 @@
|
||||
# Fabric-First Transport And Stress Plan
|
||||
|
||||
Status: fabric-first implementation baseline is active. QUIC-only transport,
|
||||
route planning, runtime reroute/failover, pressure accounting, shared-host
|
||||
stress gates, 1000-channel load, failure/degradation gates, and a 30-minute
|
||||
real-byte soak are implemented and verified. Remaining work is wider real
|
||||
topology coverage as the cluster grows.
|
||||
|
||||
This project is now fabric-first. Work on service payloads, service adapter
|
||||
expansion, and Android VPN transport is paused until the fabric transport layer
|
||||
is complete and proven under real load.
|
||||
|
||||
## Goal
|
||||
|
||||
The fabric is a distributed QUIC overlay over the current IPv4 network. Nodes
|
||||
may have public addresses, sit behind NAT, or represent a whole local segment
|
||||
behind one NAT. The fabric must expose a single logical transport layer where
|
||||
nodes can reach each other directly, through local segment paths, through
|
||||
passive outbound tunnels, or through relay hops without changing the data-plane
|
||||
protocol.
|
||||
|
||||
QUIC is the only data-plane protocol. Direct, LAN, relay, reverse, and
|
||||
ICE-selected paths are route modes inside the same QUIC fabric, not alternative
|
||||
transports.
|
||||
|
||||
The fabric must not depend on one management service for authority. API,
|
||||
storage, update-cache, route-coordinator, observer, and authority duties are
|
||||
roles inside the mesh. A reachable API endpoint can distribute signed state, but
|
||||
it cannot be the source of truth by itself. Nodes accept control data,
|
||||
configuration, route leases, update plans, and role changes only when the
|
||||
signatures, quorum rules, scopes, epochs, and expiry windows verify locally.
|
||||
|
||||
## Required Fabric Behavior
|
||||
|
||||
- Address channels by `node_id`, `pool_id`, or service target, not by raw IP.
|
||||
- Keep endpoint candidates for public QUIC, LAN QUIC, reverse/passive QUIC,
|
||||
relay QUIC, and future ICE-derived QUIC paths.
|
||||
- Treat DNS names such as web/admin/API domains as service endpoints only, not
|
||||
node identity or fabric authority.
|
||||
- Require node-published endpoint candidates to include explicit `host:port`,
|
||||
reachability, connectivity mode, NAT/local-segment metadata, source, and
|
||||
freshness.
|
||||
- Prefer local segment paths for nodes that share a NAT/local network.
|
||||
- Keep outbound passive QUIC control/data adjacencies from NATed nodes to
|
||||
reachable public or relay nodes.
|
||||
- Build logical channels over shared QUIC adjacencies instead of opening one
|
||||
physical QUIC connection per channel.
|
||||
- Maintain primary, warm standby, and fallback route sets per channel.
|
||||
- Rebuild a channel when an intermediate hop fails.
|
||||
- Switch to another pool member when the target is a pool and the current
|
||||
endpoint fails.
|
||||
- Reroute slow channels when a faster path exists and the reroute will not harm
|
||||
aggregate fabric throughput.
|
||||
- Spread channels across available routes so the shortest path is not saturated
|
||||
while other nodes are idle.
|
||||
- Isolate channels with per-channel flow control, traffic classes, backpressure,
|
||||
quotas, and fairness scheduling.
|
||||
- Report per-node, per-link, per-route, and per-channel load and failure causes.
|
||||
|
||||
## Service Channel Boundary
|
||||
|
||||
The fabric is the only component that builds and maintains transport channels.
|
||||
VPN, RDP, SSH, web ingress, file transfer, and future adapters are applications
|
||||
above the fabric. They must not select raw QUIC endpoints, pin exit nodes as a
|
||||
transport concern, open fallback transports, or implement route repair.
|
||||
|
||||
Every service starts by submitting a fabric service channel request:
|
||||
|
||||
```json
|
||||
{
|
||||
"schema_version": "rap.fabric_service_channel_request.v1",
|
||||
"channel_id": "vpn-session-or-service-session-id",
|
||||
"source_role": "vpn-client | rdp-client | service-adapter",
|
||||
"service_class": "vpn_packets | rdp | ssh | file_transfer | web",
|
||||
"target": {
|
||||
"kind": "pool",
|
||||
"pool_ids": ["home-ipv4"],
|
||||
"service_role": "ipv4-egress"
|
||||
},
|
||||
"traffic": {
|
||||
"mode": "duplex",
|
||||
"application_protocol_agnostic": true,
|
||||
"flow_distribution": "latency_and_load_aware"
|
||||
},
|
||||
"resilience": {
|
||||
"min_active_paths": 1,
|
||||
"warm_standby_paths": 1,
|
||||
"failover": "pool_member_or_next_authorized_pool",
|
||||
"reroute_on": ["route_failure", "latency_regression", "loss_regression", "backpressure"]
|
||||
}
|
||||
}
|
||||
```
|
||||
|
||||
The fabric responds with a signed route bundle containing a short-lived
|
||||
`rap.fabric_route_lease.v1`. The lease names the target pool, primary path,
|
||||
warm standby paths, multipath hints, and rebuild policy. Physical endpoint
|
||||
candidates are visible only to the fabric runtime as lease material; service
|
||||
adapters do not rank, pin, or fail over endpoints themselves. A service adapter
|
||||
receives only a duplex channel handle and service metadata:
|
||||
|
||||
- Android VPN: TUN packet reader/writer only.
|
||||
- `ipv4-egress`: NAT/ordinary IPv4 exit only.
|
||||
- RDP: protocol/session adapter only; server address, protocol, credentials,
|
||||
rendering, and clipboard are RDP service metadata, not fabric routing.
|
||||
|
||||
Temporary compatibility fields such as `exit_candidates` may exist only inside
|
||||
the fabric route bundle consumed by the fabric runtime. Service code must treat
|
||||
them as opaque and must not schedule routes from them.
|
||||
|
||||
The VPN client runtime accepts only `fabric_service_channel_request` plus
|
||||
`fabric_route_bundle.route_lease`. The Android service may keep a deprecated
|
||||
diagnostic endpoint cache, but packet routing must come from the lease. If a
|
||||
path fails, slows down, or its target pool member dies, the fabric lease/rebuild
|
||||
policy is the authority; the VPN service continues writing packets to the
|
||||
channel and does not switch protocols.
|
||||
|
||||
## Distributed Authority Requirements
|
||||
|
||||
- No single control-plane/API/storage/update node can mutate the cluster alone.
|
||||
- Cluster root and high-risk role changes require threshold signatures from
|
||||
authorized control-authority keys.
|
||||
- Update releases require signed metadata, signed artifact hashes, compatibility
|
||||
constraints, rollout scope, and rollback windows; mirrors may serve bytes but
|
||||
cannot change what is trusted.
|
||||
- Route leases, relay leases, rendezvous assignments, peer-directory epochs, and
|
||||
endpoint candidate epochs are signed and short-lived.
|
||||
- Nodes cache the last valid signed state and continue routing through peers,
|
||||
relay fallbacks, and passive reverse channels when API replicas are down.
|
||||
- A compromised replica may delay or omit data, but must not be able to forge
|
||||
role assignment, route authority, update authority, storage placement, or node
|
||||
ownership.
|
||||
- Development `database_signer` mode is not production authority. Production
|
||||
acceptance requires quorum-signed envelopes for node join, role mutation,
|
||||
mesh config, route leases, update plans, and release metadata.
|
||||
|
||||
## Implementation Layers
|
||||
|
||||
1. Discovery layer: STUN/ICE, LAN candidates, public candidates, reverse
|
||||
tunnels, relay candidates.
|
||||
2. Fabric adjacency layer: long-lived QUIC neighbor sessions with capacity,
|
||||
health, and pressure metrics.
|
||||
3. Routing layer: latency-aware and load-aware route sets with relay fallback
|
||||
and pool failover.
|
||||
4. Channel layer: millions of logical channels with independent lifecycle,
|
||||
flow control, and statistics.
|
||||
|
||||
## Stress Requirements
|
||||
|
||||
The fabric is not accepted by ping tests. It must pass real byte-transfer load:
|
||||
|
||||
- 1000 concurrent streams from different source nodes to different destination
|
||||
nodes.
|
||||
- Mixed long-lived and short-lived channels.
|
||||
- Aggressive create/delete churn.
|
||||
- many-to-one, one-to-many, and many-to-many traffic.
|
||||
- direct, LAN, relay, multi-hop, and reverse tunnel paths.
|
||||
- endpoint pool failover under load.
|
||||
- intermediate relay/node failure and route rebuild under load.
|
||||
- induced latency, packet loss, bandwidth caps, and route saturation.
|
||||
- control/interactive traffic surviving bulk traffic.
|
||||
- no sustained overload of one path when alternatives exist.
|
||||
- no goroutine, memory, stream, or file descriptor leak after churn.
|
||||
|
||||
## Required Stress Report
|
||||
|
||||
Every stress run must produce machine-readable JSON with:
|
||||
|
||||
- topology and scenario profile;
|
||||
- channel setup/teardown counts and latency;
|
||||
- total and per-channel throughput;
|
||||
- per-node and per-route capacity pressure;
|
||||
- p50/p95/p99 latency where measured;
|
||||
- backpressure, rejection, and queue-depth counters;
|
||||
- route switch and failover events;
|
||||
- target pool failover events;
|
||||
- QUIC connection and logical channel counts;
|
||||
- final pass/fail verdict against SLO thresholds.
|
||||
|
||||
The first executable harness is `agents/rap-node-agent/cmd/fabric-loadtest`.
|
||||
It supports in-process multi-node QUIC targets, short logical channel churn,
|
||||
pool failover, target failure injection, and JSON reports.
|
||||
|
||||
Example local pool-failover run:
|
||||
|
||||
```powershell
|
||||
go run ./cmd/fabric-loadtest -mode all -nodes 4 -streams 400 -concurrency 64 -bytes-per-stream 262144 -payload-size 16384 -fail-target 0 -fail-after 1ms -timeout 60s
|
||||
```
|
||||
|
||||
The local harness is not a replacement for distributed host testing. It is the
|
||||
first acceptance gate for protocol limits, channel lifecycle churn, pool
|
||||
failover semantics, and reporting shape before running the same workload across
|
||||
the shared test Docker host.
|
||||
|
||||
Distributed shared-host smoke:
|
||||
|
||||
```powershell
|
||||
powershell -ExecutionPolicy Bypass -File scripts/fabric/fabric-loadtest-docker-smoke.ps1 -DockerContext test-docker -TuneUdpBuffers -Nodes 4 -Streams 400 -Concurrency 64 -BytesPerStream 262144 -PayloadSize 16384 -FailTarget 0 -FailAfter 100ms
|
||||
```
|
||||
|
||||
The distributed smoke builds/runs separate server and client containers on the
|
||||
shared Docker host, sends real QUIC fabric frames across the Docker network,
|
||||
kills one target node during load, and expects all channels assigned to that
|
||||
target to fail over to the remaining pool.
|
||||
|
||||
The smoke summary includes the strict loadtest verdict plus `route_pressure`
|
||||
and `transport_snapshot`; the script fails when the client verdict is not
|
||||
`pass` and carries `verdict_reasons` into the thrown error.
|
||||
|
||||
`-TuneUdpBuffers` applies runtime host sysctls through a privileged one-shot
|
||||
container before the run and records the observed values in the summary:
|
||||
`net.core.rmem_max`, `net.core.wmem_max`, `net.core.rmem_default`, and
|
||||
`net.core.wmem_default`.
|
||||
|
||||
Degraded-target and latency-aware admission run:
|
||||
|
||||
```powershell
|
||||
powershell -ExecutionPolicy Bypass -File scripts/fabric/fabric-loadtest-docker-smoke.ps1 -DockerContext test-docker -TuneUdpBuffers -Nodes 4 -Streams 300 -Concurrency 64 -BytesPerStream 131072 -PayloadSize 16384 -FailTarget -1 -ImpairTarget 0 -ImpairDelay 100ms -ImpairLoss 0.5% -ProbeTargets -MaxTargetRttMs 80
|
||||
```
|
||||
|
||||
This applies `tc netem` to one target, probes every target before mass channel
|
||||
placement, excludes targets above the RTT threshold, and reports per-target
|
||||
setup/duration percentiles. This is the first executable gate for
|
||||
latency-aware placement; live channel migration after mid-stream degradation is
|
||||
the next routing-layer gate.
|
||||
|
||||
Mid-stream migration gate:
|
||||
|
||||
```powershell
|
||||
powershell -ExecutionPolicy Bypass -File scripts/fabric/fabric-loadtest-docker-smoke.ps1 -DockerContext test-docker -TuneUdpBuffers -Nodes 4 -Streams 80 -Concurrency 16 -BytesPerStream 8388608 -PayloadSize 65536 -FailTarget -1 -ImpairTarget 0 -ImpairDelay 200ms -ImpairLoss 0.5% -ImpairAfter 50ms -MigrateSlowStreams -MaxAckMs 30
|
||||
```
|
||||
|
||||
This starts channels normally, applies `tc netem` after traffic is already in
|
||||
flight, and expects slow logical streams to continue their remaining bytes on a
|
||||
different target. The report exposes `migration_events`, `max_ack_ms`,
|
||||
`ack_p95_ms`, `ack_p99_ms`, `route_attempts_total`, `reroute_causes`, and
|
||||
per-target stats.
|
||||
|
||||
Production fabric-core migration boundary:
|
||||
|
||||
- `FabricChannelRouter` opens channels on the best route from a `FabricRouteSet`.
|
||||
- Live `FabricChannelObservation` values update counters and trigger reroute on
|
||||
route failure, ACK latency threshold, or capacity pressure.
|
||||
- Reroutes switch route binding and pool target where applicable, increment
|
||||
`RerouteCount`, and emit `FabricChannelRouteEvent`.
|
||||
- `MinRerouteInterval` provides hysteresis so a noisy path does not cause route
|
||||
flapping.
|
||||
- `FabricChannelRuntime` binds the router to live QUIC fabric sessions for
|
||||
reliable byte payloads: it opens the logical stream, sends frames, measures
|
||||
ACK latency, reports observations to the router, and continues remaining
|
||||
payloads on a rerouted QUIC route after connect failure or slow ACKs.
|
||||
- QUIC logical session close cancels the stream read side before closing the
|
||||
write side, so high-churn short sessions release reader goroutines promptly
|
||||
instead of waiting for stream read deadlines.
|
||||
- Server-side QUIC stream handlers close their write side when the handler
|
||||
exits. This returns QUIC stream credit promptly during high-churn short
|
||||
sessions and prevents the last worker window from stalling on stream open.
|
||||
- Production request/response forwarding now builds a `FabricRouteSet` from all
|
||||
QUIC endpoint candidates for the next hop, sends the envelope over the chosen
|
||||
QUIC route, and reroutes to warm standby/fallback QUIC candidates on connect
|
||||
failure or response timeout.
|
||||
- The legacy HTTP production forward carrier has been removed from the mesh
|
||||
runtime API. Production forwarding now exposes a single QUIC transport
|
||||
implementation; HTTP handlers remain only as node-local API surfaces and test
|
||||
harness entry points.
|
||||
- Production route choice includes live per-route active-channel pressure, so
|
||||
concurrent forwarding requests can spread across equivalent QUIC candidates
|
||||
instead of concentrating on the first/shortest route until it is saturated.
|
||||
- Production forwarding also keeps per-route health quarantine. A QUIC route
|
||||
that fails connect or response is marked unhealthy for a bounded retry window,
|
||||
skipped by subsequent channel scheduling, exposed in route-health snapshots,
|
||||
and restored automatically after the retry window or a successful send.
|
||||
- `FabricRoutePressureTracker` provides shared active-channel accounting for
|
||||
both production request/response forwarding and bulk `FabricChannelRuntime`
|
||||
traffic, so different traffic surfaces can make route decisions against the
|
||||
same live load signal.
|
||||
- Route pressure is observable through `FabricRoutePressureSnapshot`, including
|
||||
current active channels, max active channels, total acquire/release counts,
|
||||
and last acquired/released route IDs. Bulk runtime results and production
|
||||
QUIC forwarding snapshots expose this data for stress reports.
|
||||
- `fabric-loadtest` reports route IDs per stream attempt, global
|
||||
`route_pressure`, and per-target `max_active_channels`, so stress runs can
|
||||
verify channel distribution and release accounting after churn.
|
||||
- `FabricRouteSetForPeerEndpointCandidates` converts QUIC endpoint candidates
|
||||
into production route sets for direct, LAN, ICE/STUN-derived, reverse
|
||||
outbound, and relay fallback modes. Non-QUIC candidates are rejected instead
|
||||
of becoming alternate transports.
|
||||
- Node-agent discovery now advertises multiple QUIC candidates in one heartbeat
|
||||
instead of collapsing to one address: operator/public QUIC, listener QUIC,
|
||||
LAN/interface QUIC, STUN reflexive `ice_quic`, reverse/outbound-only, and
|
||||
`relay_quic` fallback. Candidate metadata carries `local_segment_id`,
|
||||
`nat_group_id`, `stun_server`, `ice_foundation`, `relay_node_id`, and
|
||||
`relay_endpoint` when configured.
|
||||
- Endpoint candidate scoring is QUIC-mode only. It ranks `direct_quic`,
|
||||
`lan_quic`, `ice_quic`, `reverse_quic`, and `relay_quic` using freshness,
|
||||
health observations, latency, reliability, region, policy tags, and live
|
||||
capacity pressure; HTTP/WebSocket labels are treated as rejected legacy
|
||||
candidates rather than alternate transports.
|
||||
- `FabricTransportForTarget` no longer accepts a WebSocket carrier. Transport
|
||||
selection can return only `QUICFabricTransport`; unsupported labels fail with
|
||||
a QUIC-required error.
|
||||
- Explicit transport labels are authoritative. A legacy label such as `relay`
|
||||
or `outbound_reverse` is rejected even when the endpoint string uses a
|
||||
`quic://` scheme; configs must use `relay_quic` and `reverse_quic`.
|
||||
- Node-agent config loading rejects legacy advertised transport labels and
|
||||
HTTP/WebSocket advertised endpoint schemes for mesh, STUN-reflexive, and relay
|
||||
fabric endpoints. Bad endpoint posture fails before heartbeat publication.
|
||||
- Host-agent install/runtime validation rejects legacy mesh advertise transport
|
||||
labels and HTTP/WebSocket advertise endpoints before they can be passed into a
|
||||
node-agent Docker runtime.
|
||||
- JSON-advertised endpoint candidates and scoped synthetic config route
|
||||
recovery surfaces are hard-fail QUIC-only: endpoint candidates, recovery
|
||||
seeds, and rendezvous leases reject legacy transport labels and
|
||||
HTTP/WebSocket endpoint schemes instead of silently downgrading or dropping
|
||||
entries.
|
||||
- Rendezvous relay leases and peer-connection intents now use `relay_quic` as
|
||||
the transport label. `relay_control` remains only a telemetry/control-state
|
||||
name for rendezvous admission counters, not a data-plane transport option.
|
||||
- Peer connection health probing is aligned with the QUIC fabric: QUIC endpoint
|
||||
candidates are probed with QUIC session setup, pinned certificate metadata is
|
||||
honored, and HTTP/WebSocket endpoint schemes are rejected instead of being
|
||||
used as peer health transport.
|
||||
- Node-agent synthetic runtime no longer installs an HTTP peer transport as an
|
||||
inter-node carrier, and the shared mesh runtime package no longer exports an
|
||||
HTTP peer transport implementation. Any HTTP synthetic motion is confined to
|
||||
explicit legacy smoke harness code while fabric acceptance uses QUIC loadtest
|
||||
gates.
|
||||
- Control-plane and debug JSON mesh config loading is validated after
|
||||
conversion into runtime structures. Peer endpoint candidates, recovery seeds,
|
||||
rendezvous leases, and selected relay endpoints in route decisions must use
|
||||
QUIC labels/endpoints before they can update node runtime state.
|
||||
- Scoped synthetic mesh configs also reject legacy `peer_endpoints` directly,
|
||||
in addition to QUIC-only checks for endpoint candidates, recovery seeds, and
|
||||
rendezvous leases.
|
||||
- The old fabric-session WebSocket endpoint is no longer exposed by
|
||||
`FabricSessionEnabled` alone. It requires an explicit legacy test harness flag
|
||||
and is not part of the node-agent fabric transport surface.
|
||||
- Same local segment or same NAT group is treated as a LAN route by the planner,
|
||||
so a whole cluster piece behind one NAT can prefer private addresses between
|
||||
its own nodes while still maintaining outbound/relay visibility to the rest
|
||||
of the fabric.
|
||||
- Heartbeat telemetry includes `fabric_runtime_report` with QUIC-only status,
|
||||
route-set counts, QUIC candidate totals, rejected legacy/non-QUIC candidate
|
||||
totals by transport label, route pressure, QUIC listener state, goroutines,
|
||||
heap usage, and the next recommended soak gate.
|
||||
- `FabricOverlayTransport` is the generic service-neutral send facade over
|
||||
route sets, `FabricChannelRuntime`, shared route pressure, and QUIC sessions.
|
||||
New traffic classes should enter the fabric through this layer or an
|
||||
equivalent runtime integration, not through HTTP/WebSocket fallbacks.
|
||||
- `FabricChannelRuntime` uses the same route health quarantine as production
|
||||
forwarding. Connect failures, stream send failures, and missing ACKs mark a
|
||||
route unhealthy for a bounded retry window, so later channels for any traffic
|
||||
class avoid that route until it recovers.
|
||||
- `FabricOverlayTransport` exposes route pressure and route health snapshots,
|
||||
and node heartbeat runtime metadata reports production route health plus the
|
||||
current quarantined route count.
|
||||
- Scheduler resource guardrails include `HardMaxRoutePressure`: when enabled,
|
||||
a route whose projected active-channel pressure exceeds the threshold is not
|
||||
admitted. This makes overload prevention enforceable in route choice rather
|
||||
than only observable after the fact.
|
||||
- The loadtest verdict fails on route-pressure leaks, acquire/release mismatch,
|
||||
missing acquire accounting, active channels above configured concurrency, or
|
||||
target distribution collapse/skew when multiple targets are healthy.
|
||||
- Continuous soak aggregation is bounded: `fabric-loadtest` keeps exact
|
||||
counters, per-target totals, route-mode counts, error/reroute totals, and
|
||||
bounded latency samples, while `stream_samples` is capped to diagnostic
|
||||
examples. Long 30-120 minute runs should not retain one result object per
|
||||
logical channel.
|
||||
- `fabric-loadtest` also keeps bounded `error_samples`, so high-volume churn
|
||||
reports preserve representative failed logical channels even when the first
|
||||
retained `stream_samples` are all successful.
|
||||
- Mixed topology verdicts require route-mode coverage when at least four
|
||||
healthy targets are present. A `mixed-public-nat-lan-relay` or
|
||||
`nat-lan-relay` run fails if it does not exercise `lan_quic`, `ice_quic`,
|
||||
`reverse_quic`, and `relay_quic`.
|
||||
- Loadtest verdicts also fail on legacy route-mode labels. Seeing `relay`,
|
||||
`outbound_reverse`, `direct_http`, `direct_https`, `direct_tcp_tls`, `ws`,
|
||||
`wss`, or `websocket` in route-mode telemetry is treated as a transport-layer
|
||||
violation even if payload delivery succeeds.
|
||||
- Healthy multi-target verdicts check both stream distribution and byte
|
||||
distribution. This prevents a run from passing with equal channel counts but
|
||||
most bulk bytes concentrated on one target or route.
|
||||
- Healthy multi-target verdicts also check route-pressure distribution through
|
||||
per-route `max_active` values. A run fails if live concurrent channel load
|
||||
collapses onto one target/route while alternatives are healthy.
|
||||
- Successful logical channels must receive one ACK per transmitted data frame.
|
||||
`fabric-loadtest` reports `ack_mismatched_streams`, per-target
|
||||
`acks_received`, and fails verdict when any stream is marked successful with
|
||||
fewer ACKs than sent frames.
|
||||
- ACK payloads carry the SHA-256 checksum of the received data-frame payload.
|
||||
`fabric-loadtest` validates the checksum for every ACK and fails verdict with
|
||||
`ack_integrity_errors` when the acknowledged bytes do not match the sent
|
||||
payload.
|
||||
- Failover accounting separates `abandoned_frames` from true ACK mismatch. A
|
||||
frame sent on a route that dies before ACK is counted as abandoned and the
|
||||
unacknowledged byte range is retransmitted on the next pool member; verdict
|
||||
still fails when non-abandoned frames are missing ACKs.
|
||||
- Loadtest data frames use deterministic per-frame payloads derived from stream
|
||||
index, logical stream ID, sequence, and byte offset. This makes checksum ACKs
|
||||
validate each frame identity instead of repeatedly validating one shared
|
||||
buffer pattern.
|
||||
- Mixed bulk/control stress is supported with `-control-every`,
|
||||
`-control-bytes-per-stream`, and `-max-control-ack-p95-ms`. Reports include
|
||||
`control_streams`, `bulk_streams`, `control_ack_p95_ms`, and
|
||||
`bulk_ack_p95_ms`; verdict fails when control ACK p95 exceeds the configured
|
||||
SLO.
|
||||
- Verified shared-host mixed smoke:
|
||||
`powershell -ExecutionPolicy Bypass -File scripts/fabric/fabric-loadtest-docker-smoke.ps1 -DockerContext test-docker -Nodes 2 -Streams 40 -Concurrency 8 -BytesPerStream 65536 -PayloadSize 8192 -FailTarget -1 -ControlEvery 5 -ControlBytesPerStream 4096 -MaxControlAckP95Ms 100`.
|
||||
The run produced 40/40 successful streams, 8 control streams,
|
||||
`control_ack_p95_ms=1`, `bulk_ack_p95_ms=2`,
|
||||
`route_pressure.active_total=0`, and matching acquire/release counts.
|
||||
- Verified shared-host mixed failover stress:
|
||||
`powershell -ExecutionPolicy Bypass -File scripts/fabric/fabric-loadtest-docker-smoke.ps1 -DockerContext test-docker -SkipBuild -TuneUdpBuffers -Nodes 4 -Streams 1000 -Concurrency 128 -BytesPerStream 1048576 -PayloadSize 65536 -FailTarget 0 -FailAfter 100ms -ControlEvery 20 -ControlBytesPerStream 4096 -MaxControlAckP95Ms 100`.
|
||||
Latest run `fabric-loadtest-20260516-160751` produced 1000/1000 successful
|
||||
streams, 250 failover events after the planned target kill, 50 control
|
||||
streams, `control_ack_p95_ms=3`, `bulk_ack_p95_ms=6`, `ack_p95_ms=6`,
|
||||
`ack_p99_ms=8`, `route_attempts_total=1250`,
|
||||
`route_pressure.active_total=0`, `max_active_total=128`, and matching
|
||||
acquire/release counts. Full JSON artifacts are written under
|
||||
`artifacts/fabric-loadtest`.
|
||||
- Verified shared-host mixed degradation/migration stress:
|
||||
`powershell -ExecutionPolicy Bypass -File scripts/fabric/fabric-loadtest-docker-smoke.ps1 -DockerContext test-docker -TuneUdpBuffers -Nodes 4 -Streams 200 -Concurrency 32 -BytesPerStream 8388608 -PayloadSize 65536 -FailTarget -1 -ImpairTarget 0 -ImpairDelay 200ms -ImpairLoss 0.5% -ImpairAfter 50ms -MigrateSlowStreams -MaxAckMs 30 -ControlEvery 10 -ControlBytesPerStream 4096 -MaxControlAckP95Ms 100`.
|
||||
The run produced 200/200 successful streams, 9 migration events,
|
||||
20 control streams, `control_ack_p95_ms=2`, `bulk_ack_p95_ms=7`,
|
||||
`route_pressure.active_total=0`, `max_active_total=32`, and matching
|
||||
acquire/release counts.
|
||||
- Latest shared-host degradation/migration gate:
|
||||
`fabric-loadtest-20260516-160710` with 160 streams, 32 concurrency, 4 MiB
|
||||
bulk streams, 180 ms + 0.5% induced impairment after 50 ms produced 160/160
|
||||
successful streams, 12 slow-ACK migrations, degraded-target quarantine,
|
||||
`control_ack_p95_ms=3`, `bulk_ack_p95_ms=180`,
|
||||
`route_pressure.active_total=0`, `max_active_total=32`, and matching
|
||||
acquire/release counts.
|
||||
- Short shared-host soak gate:
|
||||
`fabric-loadtest-20260516-160943` with `-Duration 45s`, 1200 streams,
|
||||
96 concurrency, four healthy targets, and mixed control/bulk traffic produced
|
||||
1200/1200 successful streams, even 300/300/300/300 target distribution,
|
||||
`channel_opens=1200`, `channel_closes=1200`, `channel_leaks=0`,
|
||||
`control_ack_p95_ms=4`, `ack_p95_ms=5`, `ack_p99_ms=8`,
|
||||
`route_pressure.active_total=0`, `max_active_total=96`, and matching
|
||||
acquire/release counts.
|
||||
- Continuous soak mode is now explicit: add `-Soak -Duration 30m` or
|
||||
`-Soak -Duration 120m` to the Docker runner. In soak mode workers keep
|
||||
creating and closing logical channels until the duration expires, instead of
|
||||
stopping after a fixed stream list. This is the required gate for memory,
|
||||
goroutine, file descriptor, QUIC stream, and route-pressure stability.
|
||||
- Soak duration stops new logical channel creation but does not cancel channels
|
||||
already in flight. In-flight channels drain under their per-channel
|
||||
`-StreamTimeout`; the outer `-ClientTimeout` remains the hard scenario
|
||||
guardrail. This prevents the final active window from being counted as
|
||||
failed streams just because the soak timer expired.
|
||||
- Recommended real-topology soak command:
|
||||
`powershell -ExecutionPolicy Bypass -File scripts/fabric/fabric-loadtest-docker-smoke.ps1 -DockerContext test-docker -SkipBuild -TuneUdpBuffers -Nodes 4 -Streams 1000 -Concurrency 128 -BytesPerStream 1048576 -PayloadSize 65536 -TopologyProfile mixed-public-nat-lan-relay -Soak -Duration 30m -ResourceSampleInterval 10s -MaxGoroutineDelta 64 -MaxHeapDeltaMB 512 -FailTarget -1 -ControlEvery 20 -ControlBytesPerStream 4096 -MaxControlAckP95Ms 100`.
|
||||
- Soak reports include `resource_samples` and `resource_summary` with
|
||||
goroutine start/end/max/delta, heap allocation start/end/max/delta, heap
|
||||
objects, open file descriptor start/end/max/delta, GC delta, max active QUIC
|
||||
streams, and max active route load.
|
||||
Optional verdict gates `-MaxGoroutineDelta` and `-MaxHeapDeltaMB` fail the
|
||||
run if resource drift exceeds the configured budget.
|
||||
- Optional file descriptor verdict gates `-MaxOpenFDDelta` and `-MaxOpenFDs`
|
||||
are passed through the Docker runner to `fabric-loadtest` as
|
||||
`-max-open-fd-delta` and `-max-open-fds`. On Linux containers these read
|
||||
`/proc/self/fd` and fail the run if descriptor count drifts or peaks beyond
|
||||
the configured budget.
|
||||
- Optional throughput SLO gate `-MinThroughputMbps` is passed through the Docker
|
||||
runner to `fabric-loadtest` as `-min-throughput-mbps`. It fails verdict when
|
||||
useful data-plane throughput falls below the configured Mbps floor.
|
||||
- Optional short-session churn SLO gate `-MinChannelChurnPerSec` is passed
|
||||
through the Docker runner to `fabric-loadtest` as
|
||||
`-min-channel-churn-per-sec`. It fails verdict when logical channel
|
||||
open/close throughput falls below the configured channel-per-second floor.
|
||||
- Each logical channel has a per-channel timeout through `-StreamTimeout`
|
||||
in the Docker runner and `-stream-timeout` in `fabric-loadtest`. This keeps a
|
||||
wedged channel from holding a worker slot until the whole client run times
|
||||
out, preserving channel isolation under churn.
|
||||
- Each data frame has an ACK timeout through `-AckTimeout` in the Docker runner
|
||||
and `-ack-timeout` in `fabric-loadtest`. A missing ACK triggers reroute/pool
|
||||
retry without waiting for the full channel timeout.
|
||||
- Optional overall ACK latency gates `-MaxAckP95Ms` and `-MaxAckP99Ms` are
|
||||
passed through the Docker runner to `fabric-loadtest` as
|
||||
`-max-ack-p95-ms` and `-max-ack-p99-ms`. They fail healthy runs when
|
||||
aggregate data-plane ACK latency exceeds the configured SLO, independently
|
||||
of slow-route migration thresholds.
|
||||
- Optional per-target ACK latency gate `-MaxTargetAckMs` is passed through the
|
||||
Docker runner to `fabric-loadtest` as `-max-target-ack-ms`. It fails healthy
|
||||
runs when any target route reports a `target_stats[*].max_ack_ms` above the
|
||||
configured SLO.
|
||||
- Optional channel setup latency gates `-MaxSetupP95Ms` and `-MaxSetupP99Ms`
|
||||
are passed through the Docker runner to `fabric-loadtest` as
|
||||
`-max-setup-p95-ms` and `-max-setup-p99-ms`. They fail healthy runs when
|
||||
logical channel open/setup latency exceeds the configured SLO before payload
|
||||
transfer starts.
|
||||
- Optional reroute latency gates `-MaxRerouteP95Ms` and `-MaxRerouteP99Ms`
|
||||
are passed through the Docker runner to `fabric-loadtest` as
|
||||
`-max-reroute-p95-ms` and `-max-reroute-p99-ms`. They measure repeat channel
|
||||
setup latency after pool failover or slow-route migration and fail the run
|
||||
when route rebuild exceeds the configured SLO.
|
||||
- Docker shared-host summaries also include `container_stats` from
|
||||
`docker stats --no-stream` for each fabric server/client container that is
|
||||
still running at the end of the scenario. This records CPU percent, memory
|
||||
usage, memory percent, network IO, block IO, and PID count per node before
|
||||
cleanup.
|
||||
- Long soak runs can add `-ContainerStatsSampleInterval 10s` to collect
|
||||
periodic Docker container stats while traffic is in flight. The runner writes
|
||||
samples to `container_stats_samples_path`, includes
|
||||
`container_stats_samples_count` and `container_stats_sample_summary`, and
|
||||
records per-container memory/PID start, end, max, and delta values.
|
||||
- Optional container resource verdict gates `-MaxContainerMemoryMiB` and
|
||||
`-MaxContainerPids` fail the Docker scenario when any running fabric
|
||||
container exceeds the configured memory or PID budget at the final snapshot
|
||||
or at any periodic sample peak.
|
||||
- Verified short continuous soak:
|
||||
`fabric-loadtest-20260516-163206` used `-Soak -Duration 20s`,
|
||||
mixed public/NAT/LAN/relay profile, 32 concurrency, and mixed control/bulk
|
||||
traffic. It produced 4000/4000 successful logical channels,
|
||||
`channel_opens=4035`, `channel_closes=4035`, `channel_leaks=0`,
|
||||
`route_pressure.active_total=0`, `max_active_total=32`,
|
||||
`control_ack_p95_ms=2`, `ack_p95_ms=4`, resource sample count 12,
|
||||
goroutine delta -18, max active streams 32, max active route load 32, and
|
||||
matching acquire/release counts.
|
||||
- Verified 60-second high-churn continuous soak with graceful drain:
|
||||
`fabric-loadtest-20260516-174505` rebuilt the Docker image after changing
|
||||
soak duration to stop generation and let in-flight channels drain. The
|
||||
4-node mixed-topology run used 128 concurrency, `-Duration 60s`,
|
||||
`-StreamTimeout 15s`, periodic resource/container sampling, mixed
|
||||
control/bulk traffic, throughput and churn SLOs. It produced 438740/438740
|
||||
successful logical channels, `channel_churn_per_sec=7310`,
|
||||
`throughput_bps=3473632858`, `ack_p95_ms=5`, `ack_p99_ms=6`,
|
||||
`control_ack_p95_ms=3`, `channel_opens=438740`,
|
||||
`channel_closes=438740`, `channel_leaks=0`, `open_failures=0`,
|
||||
`goroutines_delta=-1`, `open_fds_delta=4`, all four route modes, clean
|
||||
route-pressure accounting, and verdict `pass`.
|
||||
- Verified pool failover soak with ACK timeout and abandoned-frame accounting:
|
||||
`fabric-loadtest-20260516-175622` rebuilt the Docker image with ACK timeout,
|
||||
target quarantine, and abandoned-frame accounting, then killed target 0 after
|
||||
3 seconds during a 30-second mixed-topology soak. It produced 136194/136194
|
||||
successful logical channels, `failed_streams=0`, `failover_events=82`,
|
||||
`abandoned_frames=75`, `ack_mismatched_streams=0`,
|
||||
`ack_integrity_errors=0`, `channel_churn_per_sec=4543`,
|
||||
`throughput_bps=2156155314`, `reroute_latency_p99_ms=9`,
|
||||
`channel_leaks=0`, clean route-pressure accounting, and verdict `pass`.
|
||||
- Verified container stats gate:
|
||||
`fabric-loadtest-20260516-163854` produced a passing 2-node mixed-topology
|
||||
smoke with `-MaxContainerMemoryMiB 128 -MaxContainerPids 64` and included
|
||||
`container_stats` for both fabric server containers, with memory usage around
|
||||
4-6 MiB per server and server PID counts 7-9. A negative control run with
|
||||
`-MaxContainerMemoryMiB 1` failed as expected with
|
||||
`container_memory_mib=...>1` verdict reasons.
|
||||
- Verified periodic container stats sampling:
|
||||
`fabric-loadtest-20260516-164259` used `-Soak -Duration 8s`,
|
||||
`-ContainerStatsSampleInterval 2s`, mixed public/NAT/LAN/relay profile, and
|
||||
`-MaxContainerMemoryMiB 128 -MaxContainerPids 64`. It produced 2000/2000
|
||||
successful logical channels, `channel_opens=2009`, `channel_closes=2009`,
|
||||
`channel_leaks=0`, even 1000/1000 target distribution, 400 control streams,
|
||||
`ack_p95_ms=1`, `route_pressure.active_total=0`, matching acquire/release
|
||||
counts, final server memory around 12-13 MiB, and periodic sample peaks for
|
||||
the client and both servers in
|
||||
`fabric-loadtest-20260516-164259-container-stats-samples.json`.
|
||||
- Verified high-churn goroutine drain after QUIC close cancellation:
|
||||
`fabric-loadtest-20260516-164502` rebuilt the Docker image and repeated the
|
||||
2-node mixed-topology continuous soak with `-MaxGoroutineDelta 64`,
|
||||
`-MaxHeapDeltaMB 128`, `-ContainerStatsSampleInterval 2s`,
|
||||
`-MaxContainerMemoryMiB 128`, and `-MaxContainerPids 64`. It produced
|
||||
2000/2000 successful logical channels, `channel_opens=2009`,
|
||||
`channel_closes=2009`, `channel_leaks=0`, even 1000/1000 target
|
||||
distribution, `control_ack_p95_ms=1`, `ack_p95_ms=1`,
|
||||
`route_pressure.active_total=0`, matching acquire/release counts, and
|
||||
`goroutines_delta=-2`.
|
||||
- Verified file descriptor gate:
|
||||
`fabric-loadtest-20260516-164725` rebuilt the Docker image and repeated the
|
||||
2-node mixed-topology continuous soak with `-MaxOpenFDDelta 8` and
|
||||
`-MaxOpenFDs 128` in addition to goroutine, heap, container memory, and PID
|
||||
gates. It produced 2000/2000 successful logical channels,
|
||||
`channel_leaks=0`, `route_pressure.active_total=0`, matching
|
||||
acquire/release counts, `open_fds_start=15`, `open_fds_end=9`,
|
||||
`open_fds_max=19`, and `open_fds_delta=-6`.
|
||||
- Verified bounded soak aggregation:
|
||||
`fabric-loadtest-20260516-165051` rebuilt the Docker image after changing
|
||||
soak result storage to an aggregate collector. The 2-node mixed-topology soak
|
||||
produced 2000/2000 successful logical channels, even 1000/1000 target
|
||||
distribution, `channel_leaks=0`, `route_pressure.active_total=0`, matching
|
||||
acquire/release counts, `goroutines_delta=0`, `open_fds_delta=1`, verdict
|
||||
`pass`, and only 25 retained `stream_samples` in the full report.
|
||||
- Verified mixed route-mode coverage gate:
|
||||
`fabric-loadtest-20260516-165308` rebuilt the Docker image with the route
|
||||
coverage verdict and ran a 4-node mixed-topology soak. It produced 4000/4000
|
||||
successful logical channels, even 1000/1000/1000/1000 target distribution,
|
||||
`channel_leaks=0`, `route_pressure.active_total=0`, matching
|
||||
acquire/release counts, and observed all required route modes:
|
||||
`lan_quic`, `ice_quic`, `reverse_quic`, and `relay_quic`.
|
||||
- Verified ACK integrity gate:
|
||||
`fabric-loadtest-20260516-165544` rebuilt the Docker image with the ACK
|
||||
mismatch verdict and repeated the 4-node mixed-topology soak. It produced
|
||||
4000/4000 successful logical channels, `ack_mismatched_streams=0`, per-target
|
||||
`frames_sent=6600` and `acks_received=6600`, all four route modes, clean
|
||||
channel/route pressure accounting, and verdict `pass`.
|
||||
- Verified ACK checksum integrity gate:
|
||||
`fabric-loadtest-20260516-165926` rebuilt the Docker image with ACK payload
|
||||
checksums and repeated the 4-node mixed-topology soak. It produced 4000/4000
|
||||
successful logical channels, `ack_mismatched_streams=0`,
|
||||
`ack_integrity_errors=0`, 26400 total data frames, 26400 ACKs, all four route
|
||||
modes, clean channel/route pressure accounting, and verdict `pass`.
|
||||
- Verified unique per-frame payload integrity:
|
||||
`fabric-loadtest-20260516-170150` rebuilt the Docker image after switching
|
||||
loadtest traffic from a shared payload buffer to deterministic per-frame
|
||||
payloads. The 4-node mixed-topology soak produced 4000/4000 successful
|
||||
logical channels, `ack_mismatched_streams=0`, `ack_integrity_errors=0`, 26400
|
||||
data frames, 26400 ACKs, all four route modes, clean channel/route pressure
|
||||
accounting, and verdict `pass`.
|
||||
- Verified throughput SLO gate:
|
||||
`fabric-loadtest-20260516-170512` rebuilt the Docker image with
|
||||
`-MinThroughputMbps 100` and repeated the 4-node mixed-topology soak. It
|
||||
produced 4000/4000 successful logical channels, `throughput_bps=212479668`,
|
||||
`ack_mismatched_streams=0`, `ack_integrity_errors=0`, all four route modes,
|
||||
clean channel/route pressure accounting, and verdict `pass`.
|
||||
- Verified short-session churn SLO gate:
|
||||
`fabric-loadtest-20260516-173320` rebuilt the Docker image with
|
||||
`-MinChannelChurnPerSec 200`, then ran a 4-node mixed-topology high-churn
|
||||
short-session smoke with 1000 one-frame logical channels. It produced
|
||||
1000/1000 successful logical channels, `channel_churn_per_sec=9478`,
|
||||
`channel_opens=1000`, `channel_closes=1000`, `channel_leaks=0`, even target
|
||||
stream distribution, all four route modes, clean route-pressure accounting,
|
||||
and verdict `pass`.
|
||||
- Verified high-churn QUIC stream-credit regression gate:
|
||||
`fabric-loadtest-20260516-174046` rebuilt the Docker image after closing the
|
||||
server-side QUIC stream on handler exit and ran a 4-node mixed-topology burst
|
||||
of 5000 one-frame short logical channels at 128 concurrency with
|
||||
`-MinChannelChurnPerSec 300` and `-StreamTimeout 15s`. It produced 5000/5000
|
||||
successful logical channels, `channel_churn_per_sec=21124`,
|
||||
`channel_opens=5000`, `channel_closes=5000`, `channel_leaks=0`,
|
||||
`open_failures=0`, `ack_mismatched_streams=0`, `ack_integrity_errors=0`,
|
||||
even 1250/1250/1250/1250 target distribution, all four route modes, clean
|
||||
route-pressure accounting, and verdict `pass`.
|
||||
- Verified target byte distribution gate:
|
||||
`fabric-loadtest-20260516-170731` rebuilt the Docker image with byte
|
||||
distribution verdicts and repeated the 4-node mixed-topology soak. It
|
||||
produced 4000/4000 successful logical channels, even 1000/1000/1000/1000
|
||||
stream distribution, exactly 53,248,000 bytes per target,
|
||||
`throughput_bps=212488911`, all four route modes, clean channel/route
|
||||
pressure accounting, and verdict `pass`.
|
||||
- Verified overall ACK latency SLO gate:
|
||||
`fabric-loadtest-20260516-171001` rebuilt the Docker image with
|
||||
`-MaxAckP95Ms 20` and `-MaxAckP99Ms 50` and repeated the 4-node
|
||||
mixed-topology soak. It produced 4000/4000 successful logical channels,
|
||||
`ack_p95_ms=2`, `ack_p99_ms=3`, `ack_mismatched_streams=0`,
|
||||
`ack_integrity_errors=0`, all four route modes, clean channel/route pressure
|
||||
accounting, and verdict `pass`.
|
||||
- Verified route-pressure distribution gate:
|
||||
`fabric-loadtest-20260516-171216` rebuilt the Docker image with
|
||||
route-pressure distribution verdicts and repeated the 4-node mixed-topology
|
||||
soak. It produced 4000/4000 successful logical channels, even target stream
|
||||
and byte distribution, per-route `max_active` values of 13/12/13/13,
|
||||
`route_pressure.active_total=0`, matching acquire/release counts, and
|
||||
verdict `pass`.
|
||||
- Verified per-target ACK latency gate:
|
||||
`fabric-loadtest-20260516-171454` rebuilt the Docker image with
|
||||
`-MaxTargetAckMs 20` and repeated the 4-node mixed-topology soak. It produced
|
||||
4000/4000 successful logical channels, per-target `max_ack_ms` values of
|
||||
6/5/7/9, `ack_p95_ms=3`, `ack_p99_ms=5`, all four route modes, clean
|
||||
channel/route pressure accounting, and verdict `pass`.
|
||||
- Verified channel setup latency SLO gate:
|
||||
`fabric-loadtest-20260516-171937` rebuilt the Docker image with
|
||||
`-MaxSetupP95Ms 20` and `-MaxSetupP99Ms 50`, then repeated the 4-node
|
||||
mixed-topology soak with ACK, throughput, FD, goroutine, heap, container
|
||||
memory, and PID gates enabled. It produced 4000/4000 successful logical
|
||||
channels, `setup_latency_p95_ms=0`, `ack_p95_ms=3`, `ack_p99_ms=3`,
|
||||
`throughput_bps=212572631`, even target stream/byte distribution, all four
|
||||
route modes, clean channel/route pressure accounting, and verdict `pass`.
|
||||
- Verified reroute latency SLO gate:
|
||||
`fabric-loadtest-20260516-172652` rebuilt the Docker image with
|
||||
`-MaxRerouteP95Ms 100` and `-MaxRerouteP99Ms 200`, then ran a 4-node
|
||||
mixed-topology pool-failover stress with target 0 killed during load. It
|
||||
produced 400/400 successful logical channels, 100 pool failover events,
|
||||
`reroute_latency_p95_ms=1`, `reroute_latency_p99_ms=2`,
|
||||
`route_attempts_total=500`, `ack_p95_ms=6`, `ack_p99_ms=8`,
|
||||
`throughput_bps=3863633075`, clean channel/route pressure accounting, and
|
||||
verdict `pass`.
|
||||
- Mixed topology profile gate:
|
||||
`fabric-loadtest-20260516-162037` used
|
||||
`-TopologyProfile mixed-public-nat-lan-relay` with 400 streams, 64
|
||||
concurrency, four targets, and mixed control/bulk traffic. It produced
|
||||
400/400 successful streams, 100 streams per target, route-mode reporting for
|
||||
`lan_quic`, `ice_quic`, `reverse_quic`, and `relay_quic`,
|
||||
`control_ack_p95_ms=2`, `ack_p95_ms=7`, `channel_leaks=0`,
|
||||
`route_pressure.active_total=0`, and matching acquire/release counts.
|
||||
- Verified strict QUIC route-mode gate:
|
||||
`fabric-loadtest-20260516-182550` rebuilt the loadtest image with legacy
|
||||
route-mode verdicts and ran the 4-node mixed topology profile. It produced
|
||||
400/400 successful logical channels, observed only `lan_quic`, `ice_quic`,
|
||||
`reverse_quic`, and `relay_quic`, kept `ack_mismatched_streams=0`,
|
||||
`ack_integrity_errors=0`, `channel_leaks=0`, clean route-pressure accounting,
|
||||
and verdict `pass`.
|
||||
- `fabric-loadtest` now also treats the configured target list as part of the
|
||||
acceptance surface: every target must be `quic://...`. Empty targets, bare
|
||||
`host:port`, HTTP(S), and WS/WSS targets produce a failing
|
||||
`non_quic_targets=...` verdict reason. Client mode also rejects those targets
|
||||
before dialing, so a bad stress command cannot accidentally exercise a
|
||||
non-QUIC path and only discover it after the run.
|
||||
- The shared Docker runner `scripts/fabric/fabric-loadtest-docker-smoke.ps1`
|
||||
now has matching guardrails: it refuses local Docker Desktop contexts such as
|
||||
`default`/`desktop-linux` and validates generated targets before launch so the
|
||||
real-load smoke remains tied to the shared test Docker host and QUIC-only
|
||||
endpoints.
|
||||
- Shared Docker validation after those guardrails:
|
||||
`fabric-loadtest-20260516-190049` rebuilt the Docker image on `test-docker`
|
||||
and ran 4 QUIC targets with 120 streams. It produced 120/120 successful
|
||||
logical channels, `ack_p95_ms=3`, `setup_latency_p95_ms=21`, clean
|
||||
open/close and route-pressure accounting, QUIC-only targets, and verdict
|
||||
`pass`.
|
||||
- Shared Docker mixed-topology failover validation:
|
||||
`fabric-loadtest-20260516-190137` reused the image on `test-docker`, killed
|
||||
target 0 after 100ms, and ran 400 streams over the mixed public/NAT/LAN/relay
|
||||
profile. It produced 400/400 successful logical channels, 100 pool failover
|
||||
events, `route_attempts_total=500`, route modes `ice_quic`,
|
||||
`reverse_quic`, and `relay_quic` after the failed target was removed,
|
||||
`ack_p95_ms=8`, `setup_latency_p95_ms=51`, clean channel/route-pressure
|
||||
accounting, and verdict `pass`.
|
||||
- Shared Docker mixed-topology route coverage validation:
|
||||
`fabric-loadtest-20260516-190207` ran the same 4-target mixed profile without
|
||||
target failure. It produced 400/400 successful logical channels, exactly 100
|
||||
streams per target, observed `lan_quic`, `ice_quic`, `reverse_quic`, and
|
||||
`relay_quic`, kept `ack_integrity_errors=0`, `channel_leaks=0`,
|
||||
`route_pressure.active_total=0`, and verdict `pass`.
|
||||
- Load balancing under pool failover is now an acceptance gate. The first
|
||||
stricter shared-host rebuild, `fabric-loadtest-20260516-190704`, intentionally
|
||||
failed because all failed-target retries moved to the nearest live target,
|
||||
producing `target_byte_distribution_skew` and
|
||||
`route_pressure_distribution_skew`. The retry selector was then changed to
|
||||
spread failed-slot retries across the currently usable target set instead of
|
||||
selecting the next target in ring order.
|
||||
- Verified load-aware retry routing after the fix:
|
||||
`fabric-loadtest-20260516-191028` rebuilt on `test-docker`, killed target 0
|
||||
after 100ms, and repeated the 4-target mixed profile. It produced 400/400
|
||||
successful logical channels, 100 pool failover events, surviving-target stream
|
||||
distribution of 134/133/133, surviving route-pressure max-active values of
|
||||
30/25/27, `ack_p95_ms=4`, `reroute_latency_p95_ms=1`, clean acquire/release
|
||||
accounting, and verdict `pass`.
|
||||
- Verified 1000-channel mixed-topology stress:
|
||||
`fabric-loadtest-20260516-193414` ran 1000 logical channels on `test-docker`
|
||||
with 128 concurrency, mixed control/bulk traffic, and the
|
||||
`mixed-public-nat-lan-relay` profile. It produced 1000/1000 successful
|
||||
logical channels, exact 250/250/250/250 target distribution, observed all four
|
||||
QUIC route modes (`lan_quic`, `ice_quic`, `reverse_quic`, `relay_quic`),
|
||||
`throughput_bps=3629522849`, `channel_churn_per_sec=1919`,
|
||||
`ack_p95_ms=6`, clean channel/route-pressure accounting, and verdict `pass`.
|
||||
- Verified 1000-channel pool-failover stress:
|
||||
`fabric-loadtest-20260516-193444` killed target 0 after 100ms and ran 1000
|
||||
logical channels with 128 concurrency. It produced 1000/1000 successful
|
||||
logical channels, 250 pool failover events, surviving-target distribution of
|
||||
334/333/333, `route_attempts_total=1250`, `ack_p95_ms=7`, clean
|
||||
acquire/release accounting, and verdict `pass`.
|
||||
- Verified latency-degradation migration:
|
||||
`fabric-loadtest-20260516-193515` applied `tc netem delay 80ms` to target 1,
|
||||
enabled slow-stream migration with `-MaxAckMs 20`, and ran 400 mixed-profile
|
||||
channels. It observed the impaired target in `degraded_targets`, produced
|
||||
64 slow-ACK migrations, moved completed streams onto healthy targets with
|
||||
distribution 134/133/133, kept `channel_leaks=0`, `ack_integrity_errors=0`,
|
||||
clean route-pressure accounting, and verdict `pass`.
|
||||
- Shared Docker runner resource-sample fallback was verified with
|
||||
`fabric-loadtest-20260516-190325`: short runs now still persist
|
||||
`container_stats_samples_path` and a minimal per-container sample summary
|
||||
from final Docker stats when the background sampler has no time to emit
|
||||
samples.
|
||||
- Added `scripts/fabric/fabric-acceptance-summary.ps1` to aggregate recent
|
||||
`*-summary.json` artifacts into an acceptance report. It captures verdicts,
|
||||
target distribution, route modes, churn, failover/migration counts, latency
|
||||
SLOs, resource evidence, and keeps intentional failed runs visible as
|
||||
regression evidence for gates such as route-pressure skew detection.
|
||||
- The first 30-minute soak attempt (`fabric-loadtest-20260516-193558`) exposed
|
||||
a runner defect instead of a fabric defect: server containers were still
|
||||
started with a fixed `-timeout 10m`, so the three surviving servers exited
|
||||
around minute 10 while the client expected a 30-minute run. The Docker runner
|
||||
now exposes `-ServerTimeout` and defaults it to `-ClientTimeout`, so long soak
|
||||
server lifetimes match the client run.
|
||||
- The next soak attempt (`fabric-loadtest-20260516-194816`) passed the 10-minute
|
||||
server-timeout boundary but exposed another long-run behavior: a healthy
|
||||
surviving target could stay out of placement after a transient degradation
|
||||
mark. `fabric-loadtest` now uses a bounded `target_quarantine_ttl` for
|
||||
placement while still preserving historical `degraded_targets` observations
|
||||
in the report. The Docker runner exposes this as `-TargetQuarantineTTL`.
|
||||
- `fabric-loadtest-20260516-200241` then exposed a soak-loop issue: it reported
|
||||
`pass` with 432869/432869 logical channels and clean accounting, but finished
|
||||
after about 95 seconds despite `config.duration=30m`. The cause was worker
|
||||
shutdown on per-stream `context deadline exceeded`; soak workers now only exit
|
||||
on the parent run context or the configured soak stop time, not on one
|
||||
channel's timeout.
|
||||
- `fabric-loadtest-20260516-200939` and `fabric-loadtest-20260516-201331`
|
||||
confirmed the soak loop fix by running full 3-minute preflights, but they
|
||||
failed the zero-failed-stream gate under target-kill injection. The issue was
|
||||
policy: the known killed target re-entered placement too quickly via the
|
||||
short transient quarantine TTL, causing some channels to spend their stream
|
||||
budget on a hard-dead endpoint. `fabric-loadtest` now separates transient
|
||||
`target_quarantine_ttl` from `failure_quarantine_ttl`, and the Docker runner
|
||||
exposes `-FailureQuarantineTTL`.
|
||||
- Verified 30-minute long-duration soak:
|
||||
`fabric-loadtest-20260516-202532` ran on `test-docker` for 1800.010 seconds
|
||||
with 4 QUIC targets, 128 concurrency, mixed control/bulk traffic, 64 KiB per
|
||||
logical channel, 10-second resource and container samples, and the
|
||||
`mixed-public-nat-lan-relay` profile. It produced 15,074,556/15,074,556
|
||||
successful logical channels, 895,308,005,376 bytes, `throughput_bps=3979124146`,
|
||||
`channel_churn_per_sec=8374`, exact 3,768,639 streams per target, all four
|
||||
QUIC route modes, `ack_p95_ms=5`, `ack_p99_ms=6`, `channel_leaks=0`,
|
||||
matching 15,074,556 channel opens/closes, `route_pressure.active_total=0`,
|
||||
458 container-stat samples, bounded memory/PID use, and verdict `pass`.
|
||||
- Verified real-node host-to-host QUIC smoke:
|
||||
`home-1` ran the standalone `fabric-loadtest` client against a temporary
|
||||
QUIC server on `test-docker` at `quic://docker-test.cin.su:19443`. The run
|
||||
created 1000 short logical channels at 128 concurrency, mixed control and
|
||||
bulk traffic, sent 59,392,000 bytes, received 3700/3700 ACKs, produced
|
||||
`throughput_bps=1177445403`, `channel_churn_per_sec=2478`,
|
||||
`ack_p95_ms=12`, `ack_p99_ms=21`, `setup_latency_p95_ms=118`, zero failed
|
||||
streams, zero channel leaks, and verdict `pass`. The report is saved as
|
||||
`artifacts/fabric-real-nodes/home-1-to-test-docker-20260516-181649.json`.
|
||||
- Published and registered node-agent release `0.2.280-fabricsession` with
|
||||
linux binary/native and Docker image artifacts. The release is intentionally
|
||||
not assigned to live node update policies yet because current live node
|
||||
workload/env posture still advertises legacy `direct_http` and HTTP/HTTPS
|
||||
mesh endpoints. Before rollout, node configs must be migrated to
|
||||
`quic://...` endpoints, QUIC advertise labels, and enabled QUIC listener env
|
||||
such as `RAP_MESH_QUIC_FABRIC_ENABLED=true` plus
|
||||
`RAP_MESH_QUIC_FABRIC_LISTEN_ADDR`.
|
||||
- Loadtest degraded-target quarantine is observable through `degraded_targets`.
|
||||
When `-impair-target` and slow-stream migration are enabled, verdict fails if
|
||||
no degraded target is observed or if degraded targets do not produce migration
|
||||
events. A shared-host validation run with 120 streams reported
|
||||
`degraded_targets = { impaired_target: "slow_ack" }`, 5 migration events,
|
||||
`control_ack_p95_ms=3`, and clean acquire/release accounting.
|
||||
- Channel lifecycle accounting is explicit in `fabric-loadtest` through
|
||||
`channel_opens`, `channel_closes`, and `channel_leaks`. Verdict fails on
|
||||
open/close mismatch, active stream leaks, or mismatch between route-pressure
|
||||
acquire counts and QUIC stream opens.
|
||||
- The next validation step is broader real mixed public/NAT/LAN topology across
|
||||
separate physical or VM hosts. The shared Docker host has verified the route
|
||||
model, stress gates, 30-minute stability, memory, goroutine, file descriptor,
|
||||
container resource, and route-pressure accounting. A true external NAT lab
|
||||
should now validate the same gates with independent NAT devices, public nodes,
|
||||
and local NAT-side cluster segments.
|
||||
|
||||
Initial SLO examples:
|
||||
|
||||
- `channel_setup_p95_ms < 200`
|
||||
- `reroute_p95_ms < 1000`
|
||||
- `control_latency_p99_ms < 100 under bulk load`
|
||||
- `packet_loss_after_recovery < 0.1%`
|
||||
- `no_route_pressure_over_90_percent_when_alternatives_exist`
|
||||
- `no_channel_table_growth_after_churn`
|
||||
@@ -204,6 +204,8 @@ Examples:
|
||||
- `vnc-worker` wraps a future VNC client/runtime.
|
||||
- `vpn-exit` handles exit routing.
|
||||
- `vpn-connector` handles private network reachability.
|
||||
- `vpn-client` runs on an end-user device, including Android, as a normal farm node.
|
||||
- `ipv4-egress` marks a node/service that can send authorized VPN packet traffic to ordinary IPv4 networks.
|
||||
- `video-relay` handles media optimized paths.
|
||||
|
||||
Rules:
|
||||
@@ -293,6 +295,41 @@ Responsibilities:
|
||||
- applies route, DNS, and egress restrictions
|
||||
- reports traffic and health telemetry
|
||||
|
||||
### `ipv4-egress`
|
||||
|
||||
Fabric-only IPv4 exit service. It is assigned to nodes that may forward authorized VPN packet channels from the mesh to ordinary IPv4 networks.
|
||||
|
||||
Responsibilities:
|
||||
|
||||
- accepts VPN packet channels only through the fabric service channel
|
||||
- advertises exit pool membership, region, route policy, and health
|
||||
- enforces user, organization, cluster, and owner visibility policy before accepting traffic
|
||||
- participates in latency-aware and load-aware exit selection
|
||||
- supports failover between nodes in the same exit pool without changing the Android client protocol
|
||||
- does not expose legacy VPN protocols as the steady-state data plane
|
||||
|
||||
### `vpn-client`
|
||||
|
||||
Client-side VPN node role. On Android the installed application is a node-agent/runtime with this role, then the VPN client service is started locally and joins the farm like any other node.
|
||||
|
||||
Responsibilities:
|
||||
|
||||
- joins the mesh using the current QUIC fabric transport
|
||||
- requests the list of visible IPv4 exit pools and nodes according to the current user's access level
|
||||
- creates VPN packet channels to the selected `ipv4-egress`/`vpn-exit` pool
|
||||
- switches to another authorized exit when the selected exit fails or becomes slow
|
||||
- keeps old protocol compatibility out of the runtime data plane; old nodes may only use legacy download/update paths long enough to fetch the new agent
|
||||
- exposes its local IPv4 ingress as service configuration: on Android this is the
|
||||
`VpnService` TUN, and on Linux/Docker it may also include explicit TCP/UDP
|
||||
listen ports that are mapped into VPN packet channels.
|
||||
|
||||
Rules:
|
||||
|
||||
- A VPN client does not use a dedicated entry node. It is itself a mesh node.
|
||||
- The farm builds the route from the client node to an authorized exit pool.
|
||||
- Exits are addressed as pools. A pool may contain one node, but that is a degraded redundancy posture and should be visible as a risk.
|
||||
- The control plane may issue policy and signed route authority, but it must not become the packet entry point for the VPN client.
|
||||
|
||||
### `vpn-connector`
|
||||
|
||||
Connector to private networks.
|
||||
|
||||
@@ -1,13 +1,13 @@
|
||||
# Web Ingress and Admin UI Model
|
||||
|
||||
Status: target architecture clarification. Documentation only.
|
||||
Status: target architecture and implementation contract.
|
||||
|
||||
This document defines how HTTP/HTTPS web entry, Admin UI, dynamic page
|
||||
composition, and cluster configuration responsibilities are separated in the
|
||||
Secure Access Fabric.
|
||||
|
||||
It does not implement code, APIs, UI pages, mesh runtime, VPN runtime, or RDP
|
||||
changes.
|
||||
The fabric node-to-node transport remains QUIC-only. HTTP/HTTPS is allowed only
|
||||
as an external client-facing service edge.
|
||||
|
||||
## Purpose
|
||||
|
||||
@@ -16,33 +16,41 @@ The platform needs a clear distinction between:
|
||||
- Web Service as the HTTP/HTTPS entry layer
|
||||
- Control Plane as the owner of cluster configuration and policy
|
||||
- Admin UI as a safe, scoped user interface over Control Plane APIs
|
||||
- Fabric Transport as the internal QUIC-only node-to-node substrate
|
||||
|
||||
The Web layer must never become the owner of cluster state, policy, topology,
|
||||
secrets, node identity, or routing authority.
|
||||
|
||||
## Layer Ownership
|
||||
|
||||
### Web Service / Web Ingress
|
||||
### Public HTTPS Ingress
|
||||
|
||||
Web Service is an edge service.
|
||||
Public HTTPS Ingress is an edge service. It may run on a public Internet node,
|
||||
including a small/slow node intended only to accept browser traffic and pass it
|
||||
into the fabric.
|
||||
|
||||
Suggested role names:
|
||||
Role names:
|
||||
|
||||
- `web-ingress`
|
||||
- `admin-web-entry`
|
||||
- `admin-web-shell`
|
||||
- `public-ingress`
|
||||
- `admin-ingress`
|
||||
|
||||
Responsibilities:
|
||||
|
||||
- accept HTTP/HTTPS
|
||||
- listen on TCP `80` only for ACME challenges, health checks, and HTTPS
|
||||
redirects
|
||||
- listen on TCP `443` for browser/API HTTPS
|
||||
- terminate TLS or sit behind the approved TLS terminator
|
||||
- serve Admin UI shell/static assets
|
||||
- proxy browser/API traffic to Control API
|
||||
- serve only approved static UI shells and safe public metadata
|
||||
- validate SNI/Host, request size, rate limits, and edge policy
|
||||
- map the request to an allowed platform, cluster, organization, or user portal
|
||||
scope
|
||||
- forward accepted traffic into the fabric through an authorized fabric service
|
||||
channel
|
||||
- apply edge controls such as headers, rate limits, request size limits, and
|
||||
future WAF rules
|
||||
- expose only approved public/admin endpoints
|
||||
|
||||
Web Service must not:
|
||||
Public HTTPS Ingress must not:
|
||||
|
||||
- own cluster configuration
|
||||
- directly mutate PostgreSQL
|
||||
@@ -51,6 +59,39 @@ Web Service must not:
|
||||
- store node identity or certificates as source of truth
|
||||
- expose internal mesh topology to browser clients
|
||||
- execute cluster decisions locally
|
||||
- hold platform/global admin authority keys
|
||||
- infer authorization from the fact that it accepted TCP `443`
|
||||
- become a general relay for arbitrary HTTP inside the fabric
|
||||
|
||||
The node that accepts HTTPS is not the node that automatically owns or executes
|
||||
admin logic. It is only a service edge.
|
||||
|
||||
### Fabric Transport
|
||||
|
||||
Fabric Transport is the internal node-to-node layer.
|
||||
|
||||
Rules:
|
||||
|
||||
- node-to-node traffic uses QUIC only
|
||||
- no HTTP fallback between fabric nodes
|
||||
- STUN/ICE/rendezvous/relay are fabric transport mechanisms, not browser/API
|
||||
protocols
|
||||
- any service traffic accepted on `443` is converted into a scoped fabric
|
||||
service channel before it crosses the mesh
|
||||
- direct links, relay links, and route-health observations must remain separate
|
||||
in diagnostics
|
||||
- a fabric route proves reachability, not administrative authority
|
||||
|
||||
If a public ingress receives a request for an admin surface, the request flow is:
|
||||
|
||||
```text
|
||||
Browser HTTPS
|
||||
-> public/admin ingress on 443
|
||||
-> tenant/cluster/platform scope selection
|
||||
-> signed fabric service channel over QUIC
|
||||
-> authorized admin/runtime service node
|
||||
-> Control Plane authorization and policy
|
||||
```
|
||||
|
||||
### Control Plane
|
||||
|
||||
@@ -77,9 +118,23 @@ only.
|
||||
Cluster configuration is changed only through Control Plane services and APIs.
|
||||
The Web layer is a presentation and ingress layer over those APIs.
|
||||
|
||||
### Admin UI
|
||||
### Admin UI Runtime
|
||||
|
||||
Admin UI is a client application served through Web Ingress.
|
||||
Admin UI Runtime is the service that serves and executes the admin surface. It
|
||||
may run on any node explicitly assigned the matching runtime role.
|
||||
|
||||
Role names:
|
||||
|
||||
- `global-admin-runtime`
|
||||
- `cluster-admin-runtime`
|
||||
- `organization-portal-runtime`
|
||||
- `user-portal-runtime`
|
||||
- `identity-runtime`
|
||||
- `policy-authority`
|
||||
- `audit-sink`
|
||||
|
||||
Admin UI is a client application served through Public HTTPS Ingress or Admin UI
|
||||
Runtime according to deployment policy.
|
||||
|
||||
It renders safe Control Plane projections and submits user actions to Control
|
||||
Plane APIs.
|
||||
@@ -95,7 +150,7 @@ Admin UI must not:
|
||||
viewer
|
||||
- contain executable cluster logic
|
||||
|
||||
## Admin Endpoint Placement
|
||||
## Admin Endpoint Placement And Trust
|
||||
|
||||
Admin UI endpoint placement is explicit and must not be inferred from storage.
|
||||
|
||||
@@ -110,6 +165,8 @@ Scopes:
|
||||
- Organization Admin Panel: tenant-safe projection for one organization. It
|
||||
must expose only allowed resources, service endpoints, sessions, policies,
|
||||
and safe status.
|
||||
- User Portal: personal/account scope. It must expose only the authenticated
|
||||
user's resources, sessions, devices, and profile actions.
|
||||
|
||||
Rules:
|
||||
|
||||
@@ -118,19 +175,29 @@ Rules:
|
||||
- Storage nodes distribute/cache scoped configuration and snapshots only.
|
||||
- Admin/web ingress is a separate service role and requires explicit Control
|
||||
Plane assignment.
|
||||
- Public Internet ingress is not enough to run a global panel.
|
||||
- `global-admin-runtime`, `policy-authority`, and `audit-sink` may run only on
|
||||
platform-owner trusted nodes.
|
||||
- `cluster-admin-runtime` may run only on nodes authorized for that cluster.
|
||||
- `organization-portal-runtime` and `user-portal-runtime` may run on broader
|
||||
infrastructure, but they receive only scoped projections.
|
||||
- Cluster-local admin endpoints require valid TLS/cert policy, signed scoped
|
||||
snapshots, current node health, and sufficient role coverage.
|
||||
- Platform Owner Console remains the owner-level view even when cluster-local
|
||||
admin endpoints exist.
|
||||
- Organization Admin Panel must never expose intermediate mesh topology,
|
||||
storage shards, peer caches, route caches, or unrelated cluster data.
|
||||
- A request entering through an organization-bound ingress must be rejected if it
|
||||
asks for another organization, another cluster outside its contract, global
|
||||
topology, or platform-owner data.
|
||||
|
||||
## Request Flow
|
||||
|
||||
```text
|
||||
Admin Browser
|
||||
-> Web Ingress / Admin Web Shell
|
||||
-> Control API
|
||||
-> Public/Admin HTTPS Ingress
|
||||
-> Fabric Service Channel over QUIC
|
||||
-> Admin UI Runtime / Control API
|
||||
-> PostgreSQL source of truth
|
||||
-> signed scoped snapshots / config distribution
|
||||
-> rap-node-agent
|
||||
@@ -266,6 +333,18 @@ Organization admin must not see:
|
||||
- secrets
|
||||
- unrelated cluster internals
|
||||
|
||||
Ingress-bound projections:
|
||||
|
||||
- A platform-owner ingress may expose platform navigation only after platform
|
||||
authorization, MFA/step-up, and policy checks.
|
||||
- A cluster-bound ingress may expose only that cluster's admin surface and
|
||||
cluster-scoped safe diagnostics.
|
||||
- An organization-bound ingress may expose only the organization projection and
|
||||
organization-safe service endpoints.
|
||||
- A user portal ingress may expose only the user's personal/account projection.
|
||||
- Host/SNI alone is not authorization; it only selects the maximum possible
|
||||
projection before server-side authorization narrows it further.
|
||||
|
||||
## Service Adapter UI Extensions
|
||||
|
||||
Service adapters may need configuration UI.
|
||||
@@ -361,22 +440,258 @@ High-risk actions include:
|
||||
|
||||
## Deployment Model
|
||||
|
||||
### Current Test Entry
|
||||
|
||||
The current shared Docker test stand exposes the Platform Owner Control Panel at
|
||||
`http://docker-test.cin.su:18080/` (`http://192.168.200.61:18080/`). This is a
|
||||
temporary lab HTTP edge served by `rap_web_admin` from
|
||||
`/tmp/rap-web-admin/html` on `test-docker`.
|
||||
|
||||
This entry is not the production authority model. It is allowed only for the
|
||||
shared test stand while the HTTPS admin-ingress runtime is being completed. The
|
||||
target production entry is:
|
||||
|
||||
```text
|
||||
Browser HTTPS on 443
|
||||
-> node with explicit admin-ingress/public-ingress role
|
||||
-> signed web-ingress envelope
|
||||
-> QUIC fabric service channel
|
||||
-> authorized admin/portal runtime node
|
||||
-> Control API projection/authorization
|
||||
```
|
||||
|
||||
The browser-facing ingress may be a small public node, but it must not become
|
||||
the management authority. Platform/global admin runtime remains limited to
|
||||
platform-owner trusted nodes. Cluster, organization, and user panels receive
|
||||
only their scoped projections.
|
||||
|
||||
The legacy Fabric map with separate `inputs`, `cluster nodes`, and `egress
|
||||
zones` is retired for the transport-layer view. The Fabric panel must show
|
||||
actual direct/fresh QUIC neighbor links, one-way/passive direction, stale/problem
|
||||
state, relay/route-health annotations, and web-ingress runtime readiness. It
|
||||
must not render old entry/egress zone columns as if they were transport
|
||||
topology.
|
||||
|
||||
Possible deployment modes:
|
||||
|
||||
- Web Ingress and Control API in the same deployment for small/test installs
|
||||
- Public/Admin HTTPS Ingress and Control API in the same deployment for
|
||||
small/test installs
|
||||
- Web Ingress separated from Control API for production
|
||||
- multiple Web Ingress nodes for regional/admin access
|
||||
- Web Ingress behind Caddy/Nginx/enterprise ingress
|
||||
- Admin UI shell served from Web Ingress while APIs remain on Control API
|
||||
- Internet ingress on a low-capacity node that forwards scoped channels to a
|
||||
trusted admin runtime elsewhere in the fabric
|
||||
- global admin runtime only on platform-owner controlled nodes
|
||||
- cluster admin runtime on cluster-authorized nodes
|
||||
- organization/user portal runtime on tenant-safe nodes with scoped data
|
||||
|
||||
Even when deployed together, ownership remains separate:
|
||||
|
||||
- Web Ingress is entry/presentation
|
||||
- Public/Admin HTTPS Ingress is entry/presentation
|
||||
- Fabric Transport is QUIC-only service-channel delivery
|
||||
- Control API is authorization/domain logic
|
||||
- PostgreSQL is source of truth
|
||||
- Fabric Storage/Config Storage is scoped distribution/cache
|
||||
- node-agent consumes scoped desired state
|
||||
|
||||
## Required Roles
|
||||
|
||||
The platform recognizes these web/admin placement roles:
|
||||
|
||||
| Role | Scope | Purpose |
|
||||
| --- | --- | --- |
|
||||
| `public-ingress` | cluster or organization | Listen on 80/443, terminate/validate HTTPS, forward scoped service channels. |
|
||||
| `admin-ingress` | platform or cluster | HTTPS edge for admin surfaces. It does not own authority. |
|
||||
| `global-admin-runtime` | platform trusted nodes only | Platform-owner console/runtime. |
|
||||
| `cluster-admin-runtime` | cluster | Cluster admin console/runtime for one cluster. |
|
||||
| `organization-portal-runtime` | organization | Tenant-safe organization administration. |
|
||||
| `user-portal-runtime` | user/organization | Personal account/resource portal. |
|
||||
| `identity-runtime` | platform/cluster | Authentication, session, MFA, step-up and token issuance. |
|
||||
| `policy-authority` | platform trusted nodes only | Authorization/policy decisions and signed claims. |
|
||||
| `audit-sink` | platform trusted nodes only | Durable mutation/security audit ingestion. |
|
||||
|
||||
Legacy `entry-node` remains a generic client ingress/service edge role for
|
||||
non-admin product services. It must not imply admin authority.
|
||||
|
||||
## Fabric Service Classes
|
||||
|
||||
Admin and portal traffic uses explicit fabric service classes. This prevents
|
||||
admin traffic from being disguised as VPN/RDP/file/video traffic and gives the
|
||||
routing layer clear QoS, role, and audit semantics.
|
||||
|
||||
| Service class | Required runtime roles | Projection |
|
||||
| --- | --- | --- |
|
||||
| `platform_admin` | `admin-ingress`, `global-admin-runtime`, `identity-runtime`, `policy-authority`, `audit-sink` | Platform-owner console. |
|
||||
| `cluster_admin` | `admin-ingress`, `cluster-admin-runtime`, `identity-runtime`, `policy-authority`, `audit-sink` | One cluster. |
|
||||
| `organization_portal` | `public-ingress`, `organization-portal-runtime`, `identity-runtime`, `policy-authority`, `audit-sink` | One organization. |
|
||||
| `user_portal` | `public-ingress`, `user-portal-runtime`, `identity-runtime`, `policy-authority`, `audit-sink` | One authenticated user/account scope. |
|
||||
|
||||
Default channels for these classes are `control`, `interactive`, and
|
||||
`reliable`. They are latency-sensitive control-plane/service traffic, not bulk
|
||||
data transfer.
|
||||
|
||||
## Desired Workload Contract
|
||||
|
||||
Ingress nodes are configured through normal node desired workloads. The first
|
||||
runtime stage is a contract probe: node-agent validates the policy and reports a
|
||||
workload status, but it does not open `80`/`443` until the real ingress runtime
|
||||
stage is enabled.
|
||||
|
||||
Example platform/cluster admin ingress workload:
|
||||
|
||||
```json
|
||||
{
|
||||
"service_type": "admin-ingress",
|
||||
"desired_state": "enabled",
|
||||
"runtime_mode": "native",
|
||||
"config": {
|
||||
"listen_http_port": 80,
|
||||
"listen_https_port": 443,
|
||||
"tls_mode": "terminate",
|
||||
"scope": "platform",
|
||||
"service_classes": ["platform_admin", "cluster_admin"]
|
||||
}
|
||||
}
|
||||
```
|
||||
|
||||
Example organization/user public ingress workload:
|
||||
|
||||
```json
|
||||
{
|
||||
"service_type": "public-ingress",
|
||||
"desired_state": "enabled",
|
||||
"runtime_mode": "native",
|
||||
"config": {
|
||||
"listen_http_port": 80,
|
||||
"listen_https_port": 443,
|
||||
"tls_mode": "terminate",
|
||||
"scope": "organization",
|
||||
"service_classes": ["organization_portal", "user_portal"]
|
||||
}
|
||||
}
|
||||
```
|
||||
|
||||
Contract-probe status requirements:
|
||||
|
||||
- `fabric_transport` is `quic_only`
|
||||
- `http_between_fabric_nodes` is `false`
|
||||
- `authority_service` is `false`
|
||||
- `fabric_service_channel_required` is `true`
|
||||
- `ports_opened_by_stub` is `false`
|
||||
- invalid service classes or non-80/443 ports report `degraded`
|
||||
- real listener startup requires both workload config
|
||||
`real_listener_enabled=true` and node-agent process gate
|
||||
`RAP_WEB_INGRESS_RUNTIME_ENABLED=true`
|
||||
- without the process gate, a real-listener request reports
|
||||
`web_ingress_real_listener_gate_disabled`
|
||||
- the first handler stage returns schema
|
||||
`rap.web_ingress.runtime_response.v1`; it redirects HTTP to HTTPS, exposes
|
||||
health, validates service class/scope, and blocks payload forwarding with
|
||||
`fabric_service_channel_binding_not_implemented` until the QUIC service
|
||||
channel binding is implemented
|
||||
- node-agent owns a web-ingress listener lifecycle manager. When the real
|
||||
listener gate is enabled, it starts the HTTP redirect listener and starts
|
||||
HTTPS only when `tls_cert_file` and `tls_key_file` are present in workload
|
||||
config. Without TLS files the listener status is `partial` and service
|
||||
payload remains blocked.
|
||||
- HTTPS handler has a `FabricBinder` boundary. Valid requests become
|
||||
`rap.web_ingress.fabric_request.v1` records with method, path, query, host,
|
||||
derived scope, service class, safe headers, bounded body, and observed
|
||||
timestamp. Runtime derives fabric scope from service class
|
||||
(`platform_admin` -> `platform`, `cluster_admin` -> `cluster`,
|
||||
`organization_portal` -> `organization`, `user_portal` -> `user`) before
|
||||
signing/forwarding the request.
|
||||
Dangerous browser headers such as `Authorization`, `Cookie`, `Set-Cookie`,
|
||||
and service-channel tokens are not forwarded as ordinary proxy headers.
|
||||
The binder must convert the request into a signed/scoped fabric service
|
||||
channel envelope; if no binder is present, ingress returns
|
||||
`fabric_service_channel_binding_not_implemented`.
|
||||
- The first concrete binder emits
|
||||
`rap.web_ingress.fabric_service_channel_envelope.v1`. The envelope contains
|
||||
the safe request projection, base64-encoded body, scope, service class,
|
||||
observed timestamp, and envelope timestamp. It is serialized as canonical JSON
|
||||
for signing, then passed to an `EnvelopeSigner` and `EnvelopeSender`.
|
||||
`EnvelopeSigner` owns node/service-channel signature policy. `EnvelopeSender`
|
||||
owns delivery into the QUIC fabric service channel and route selection. This
|
||||
keeps HTTP edge handling separated from mesh internals while making the
|
||||
security boundary explicit and testable.
|
||||
- The initial signer implementation is Ed25519 over the canonical envelope
|
||||
bytes. The signer can derive `key_id` from the public key fingerprint or use
|
||||
an explicitly configured key id. Production deployment must bind this key to
|
||||
the node identity/service-channel authority policy before enabling real
|
||||
browser traffic.
|
||||
- The initial mesh sender adapter can submit the signed envelope through the
|
||||
existing reliable fabric channel runtime using `control` traffic class and a
|
||||
configured route set to an admin/portal runtime node or pool. At this stage it
|
||||
returns a delivery-accepted response with route/channel metrics. Full
|
||||
request/response admin API streaming remains a later runtime step and must
|
||||
stay on the same QUIC fabric channel model.
|
||||
- The fabric channel runtime now also has a request/response path for web
|
||||
ingress: it opens a QUIC stream, sends the signed envelope as `FrameData`, and
|
||||
waits for a `FrameData` response on the same stream and sequence. Route
|
||||
failures or response timeouts use the same latency-aware reroute path as
|
||||
reliable delivery. Runtime HTTP responses use
|
||||
`rap.web_ingress.fabric_runtime_response.v1` with status code, safe headers,
|
||||
and body/body_b64. If a runtime response is not in that schema, ingress
|
||||
reports delivery-accepted metrics instead of treating arbitrary payload as an
|
||||
HTTP response.
|
||||
- QUIC fabric server reserves `WebIngressForwardQUICStreamID` for web ingress
|
||||
request/response forwarding. The server invokes a web-ingress forward handler
|
||||
with the signed envelope payload and returns a wrapper containing either
|
||||
runtime payload or an error on the same stream/sequence.
|
||||
- Admin/portal runtime nodes have a signed-envelope receiver contract. The
|
||||
receiver verifies `rap.web_ingress.signed_fabric_service_channel_envelope.v1`,
|
||||
Ed25519 signature, trusted key id, scope, service class, and timestamp skew
|
||||
before calling the local runtime handler. The local handler returns
|
||||
`rap.web_ingress.fabric_runtime_response.v1`; unsafe response headers are
|
||||
filtered before the payload is returned to the ingress edge.
|
||||
- Node-agent exposes explicit runtime key policy inputs while the final signed
|
||||
config-snapshot distribution is being wired:
|
||||
`RAP_WEB_INGRESS_SIGNING_PRIVATE_KEY`,
|
||||
`RAP_WEB_INGRESS_SIGNING_KEY_ID`, and
|
||||
`RAP_WEB_INGRESS_TRUSTED_KEYS_JSON`. Trusted keys JSON may be either
|
||||
`{"key_id":"public_key_b64"}` or an array of
|
||||
`{"key_id":"...","public_key":"..."}` objects. Without trusted keys the
|
||||
web-ingress receiver handler is not installed. Runtime receiver placement can
|
||||
be narrowed with `RAP_WEB_INGRESS_RUNTIME_SERVICE_CLASSES`, a comma-separated
|
||||
allow-list of `platform_admin`, `cluster_admin`, `organization_portal`, and
|
||||
`user_portal`; this is a temporary explicit node-local policy until signed
|
||||
role snapshots drive receiver placement.
|
||||
- Heartbeat metadata includes `web_ingress_runtime_receiver_report` when QUIC
|
||||
fabric or web-ingress key policy is configured. The report exposes the
|
||||
signed-envelope schema, QUIC stream id, trusted key count, receiver
|
||||
service-class allow-list, handler installation state, status/reason
|
||||
(`ready`, `degraded`, or `blocked`), and QUIC endpoint readiness so the
|
||||
fabric panel can show whether a node can currently receive admin/portal
|
||||
runtime traffic and why it cannot.
|
||||
- QUIC listener/reverse-transport handler configuration is sensitive to the
|
||||
web-ingress trusted key policy and runtime service-class allow-list. If either
|
||||
policy changes, node-agent restarts or refreshes the QUIC fabric handler
|
||||
binding so stale key trust or stale receiver placement is not kept in memory.
|
||||
- The first local admin runtime dispatcher is intentionally read-only. It
|
||||
handles `/healthz`, `/readyz`, and `*/ui-manifest` requests after signed
|
||||
envelope verification. It returns `rap.web_ingress.admin_runtime_response.v1`
|
||||
with a safe `rap.web_ingress.ui_manifest.v1` projection that lists sections
|
||||
and read-only actions for the requested service class. It rejects invalid
|
||||
`scope`/`service_class` pairs before using either the local fallback or the
|
||||
Control API projection client. Mutations return
|
||||
`control_api_mutation_binding_not_implemented`; unknown read projections
|
||||
return `control_api_projection_binding_not_implemented` until the dispatcher
|
||||
is wired to the real Control API authorization/projection layer.
|
||||
- The dispatcher now has a `ControlAPIProjectionClient` boundary. When bound,
|
||||
read-only GET/HEAD requests are sent to the Control API projection endpoint
|
||||
and returned as `rap.web_ingress.control_api_projection_response.v1`.
|
||||
Backend exposes the first read-only projection endpoint at
|
||||
`/api/v1/clusters/{cluster_id}/nodes/{node_id}/admin-runtime/projection`.
|
||||
It returns safe manifest/projection payloads, marks audit as required, and
|
||||
rejects mutation methods and invalid `scope`/`service_class` combinations.
|
||||
Requests must use schema
|
||||
`rap.web_ingress.control_api_projection_request.v1`; agent accepts responses
|
||||
only with schema `rap.web_ingress.control_api_projection_response.v1`.
|
||||
This is the first Control API binding slice; it is not yet a full
|
||||
authorization/session/audit implementation.
|
||||
|
||||
## Future Stages
|
||||
|
||||
Suggested staged work:
|
||||
@@ -417,8 +732,9 @@ This document does not authorize:
|
||||
## Result / Decision
|
||||
|
||||
WEB is an ingress and presentation layer, not a cluster configuration owner.
|
||||
Cluster configuration belongs to the Control Plane and is persisted in
|
||||
PostgreSQL. Dynamic admin pages are allowed only as safe, scoped,
|
||||
Fabric remains QUIC-only internally; HTTP/HTTPS exists only at the external
|
||||
client edge. Cluster configuration belongs to the Control Plane and is persisted
|
||||
in PostgreSQL. Dynamic admin pages are allowed only as safe, scoped,
|
||||
schema-driven projections over Control Plane APIs. They must not embed secrets,
|
||||
internal topology, peer caches, route caches, or arbitrary executable code.
|
||||
|
||||
|
||||
Reference in New Issue
Block a user