Refactor RDP proxy handling and update related tests
This commit is contained in:
@@ -0,0 +1,845 @@
|
||||
# Fabric-First Transport And Stress Plan
|
||||
|
||||
Status: fabric-first implementation baseline is active. QUIC-only transport,
|
||||
route planning, runtime reroute/failover, pressure accounting, shared-host
|
||||
stress gates, 1000-channel load, failure/degradation gates, and a 30-minute
|
||||
real-byte soak are implemented and verified. Remaining work is wider real
|
||||
topology coverage as the cluster grows.
|
||||
|
||||
This project is now fabric-first. Work on service payloads, service adapter
|
||||
expansion, and Android VPN transport is paused until the fabric transport layer
|
||||
is complete and proven under real load.
|
||||
|
||||
## Goal
|
||||
|
||||
The fabric is a distributed QUIC overlay over the current IPv4 network. Nodes
|
||||
may have public addresses, sit behind NAT, or represent a whole local segment
|
||||
behind one NAT. The fabric must expose a single logical transport layer where
|
||||
nodes can reach each other directly, through local segment paths, through
|
||||
passive outbound tunnels, or through relay hops without changing the data-plane
|
||||
protocol.
|
||||
|
||||
QUIC is the only data-plane protocol. Direct, LAN, relay, reverse, and
|
||||
ICE-selected paths are route modes inside the same QUIC fabric, not alternative
|
||||
transports.
|
||||
|
||||
The fabric must not depend on one management service for authority. API,
|
||||
storage, update-cache, route-coordinator, observer, and authority duties are
|
||||
roles inside the mesh. A reachable API endpoint can distribute signed state, but
|
||||
it cannot be the source of truth by itself. Nodes accept control data,
|
||||
configuration, route leases, update plans, and role changes only when the
|
||||
signatures, quorum rules, scopes, epochs, and expiry windows verify locally.
|
||||
|
||||
## Required Fabric Behavior
|
||||
|
||||
- Address channels by `node_id`, `pool_id`, or service target, not by raw IP.
|
||||
- Keep endpoint candidates for public QUIC, LAN QUIC, reverse/passive QUIC,
|
||||
relay QUIC, and future ICE-derived QUIC paths.
|
||||
- Treat DNS names such as web/admin/API domains as service endpoints only, not
|
||||
node identity or fabric authority.
|
||||
- Require node-published endpoint candidates to include explicit `host:port`,
|
||||
reachability, connectivity mode, NAT/local-segment metadata, source, and
|
||||
freshness.
|
||||
- Prefer local segment paths for nodes that share a NAT/local network.
|
||||
- Keep outbound passive QUIC control/data adjacencies from NATed nodes to
|
||||
reachable public or relay nodes.
|
||||
- Build logical channels over shared QUIC adjacencies instead of opening one
|
||||
physical QUIC connection per channel.
|
||||
- Maintain primary, warm standby, and fallback route sets per channel.
|
||||
- Rebuild a channel when an intermediate hop fails.
|
||||
- Switch to another pool member when the target is a pool and the current
|
||||
endpoint fails.
|
||||
- Reroute slow channels when a faster path exists and the reroute will not harm
|
||||
aggregate fabric throughput.
|
||||
- Spread channels across available routes so the shortest path is not saturated
|
||||
while other nodes are idle.
|
||||
- Isolate channels with per-channel flow control, traffic classes, backpressure,
|
||||
quotas, and fairness scheduling.
|
||||
- Report per-node, per-link, per-route, and per-channel load and failure causes.
|
||||
|
||||
## Service Channel Boundary
|
||||
|
||||
The fabric is the only component that builds and maintains transport channels.
|
||||
VPN, RDP, SSH, web ingress, file transfer, and future adapters are applications
|
||||
above the fabric. They must not select raw QUIC endpoints, pin exit nodes as a
|
||||
transport concern, open fallback transports, or implement route repair.
|
||||
|
||||
Every service starts by submitting a fabric service channel request:
|
||||
|
||||
```json
|
||||
{
|
||||
"schema_version": "rap.fabric_service_channel_request.v1",
|
||||
"channel_id": "vpn-session-or-service-session-id",
|
||||
"source_role": "vpn-client | rdp-client | service-adapter",
|
||||
"service_class": "vpn_packets | rdp | ssh | file_transfer | web",
|
||||
"target": {
|
||||
"kind": "pool",
|
||||
"pool_ids": ["home-ipv4"],
|
||||
"service_role": "ipv4-egress"
|
||||
},
|
||||
"traffic": {
|
||||
"mode": "duplex",
|
||||
"application_protocol_agnostic": true,
|
||||
"flow_distribution": "latency_and_load_aware"
|
||||
},
|
||||
"resilience": {
|
||||
"min_active_paths": 1,
|
||||
"warm_standby_paths": 1,
|
||||
"failover": "pool_member_or_next_authorized_pool",
|
||||
"reroute_on": ["route_failure", "latency_regression", "loss_regression", "backpressure"]
|
||||
}
|
||||
}
|
||||
```
|
||||
|
||||
The fabric responds with a signed route bundle containing a short-lived
|
||||
`rap.fabric_route_lease.v1`. The lease names the target pool, primary path,
|
||||
warm standby paths, multipath hints, and rebuild policy. Physical endpoint
|
||||
candidates are visible only to the fabric runtime as lease material; service
|
||||
adapters do not rank, pin, or fail over endpoints themselves. A service adapter
|
||||
receives only a duplex channel handle and service metadata:
|
||||
|
||||
- Android VPN: TUN packet reader/writer only.
|
||||
- `ipv4-egress`: NAT/ordinary IPv4 exit only.
|
||||
- RDP: protocol/session adapter only; server address, protocol, credentials,
|
||||
rendering, and clipboard are RDP service metadata, not fabric routing.
|
||||
|
||||
Temporary compatibility fields such as `exit_candidates` may exist only inside
|
||||
the fabric route bundle consumed by the fabric runtime. Service code must treat
|
||||
them as opaque and must not schedule routes from them.
|
||||
|
||||
The VPN client runtime accepts only `fabric_service_channel_request` plus
|
||||
`fabric_route_bundle.route_lease`. The Android service may keep a deprecated
|
||||
diagnostic endpoint cache, but packet routing must come from the lease. If a
|
||||
path fails, slows down, or its target pool member dies, the fabric lease/rebuild
|
||||
policy is the authority; the VPN service continues writing packets to the
|
||||
channel and does not switch protocols.
|
||||
|
||||
## Distributed Authority Requirements
|
||||
|
||||
- No single control-plane/API/storage/update node can mutate the cluster alone.
|
||||
- Cluster root and high-risk role changes require threshold signatures from
|
||||
authorized control-authority keys.
|
||||
- Update releases require signed metadata, signed artifact hashes, compatibility
|
||||
constraints, rollout scope, and rollback windows; mirrors may serve bytes but
|
||||
cannot change what is trusted.
|
||||
- Route leases, relay leases, rendezvous assignments, peer-directory epochs, and
|
||||
endpoint candidate epochs are signed and short-lived.
|
||||
- Nodes cache the last valid signed state and continue routing through peers,
|
||||
relay fallbacks, and passive reverse channels when API replicas are down.
|
||||
- A compromised replica may delay or omit data, but must not be able to forge
|
||||
role assignment, route authority, update authority, storage placement, or node
|
||||
ownership.
|
||||
- Development `database_signer` mode is not production authority. Production
|
||||
acceptance requires quorum-signed envelopes for node join, role mutation,
|
||||
mesh config, route leases, update plans, and release metadata.
|
||||
|
||||
## Implementation Layers
|
||||
|
||||
1. Discovery layer: STUN/ICE, LAN candidates, public candidates, reverse
|
||||
tunnels, relay candidates.
|
||||
2. Fabric adjacency layer: long-lived QUIC neighbor sessions with capacity,
|
||||
health, and pressure metrics.
|
||||
3. Routing layer: latency-aware and load-aware route sets with relay fallback
|
||||
and pool failover.
|
||||
4. Channel layer: millions of logical channels with independent lifecycle,
|
||||
flow control, and statistics.
|
||||
|
||||
## Stress Requirements
|
||||
|
||||
The fabric is not accepted by ping tests. It must pass real byte-transfer load:
|
||||
|
||||
- 1000 concurrent streams from different source nodes to different destination
|
||||
nodes.
|
||||
- Mixed long-lived and short-lived channels.
|
||||
- Aggressive create/delete churn.
|
||||
- many-to-one, one-to-many, and many-to-many traffic.
|
||||
- direct, LAN, relay, multi-hop, and reverse tunnel paths.
|
||||
- endpoint pool failover under load.
|
||||
- intermediate relay/node failure and route rebuild under load.
|
||||
- induced latency, packet loss, bandwidth caps, and route saturation.
|
||||
- control/interactive traffic surviving bulk traffic.
|
||||
- no sustained overload of one path when alternatives exist.
|
||||
- no goroutine, memory, stream, or file descriptor leak after churn.
|
||||
|
||||
## Required Stress Report
|
||||
|
||||
Every stress run must produce machine-readable JSON with:
|
||||
|
||||
- topology and scenario profile;
|
||||
- channel setup/teardown counts and latency;
|
||||
- total and per-channel throughput;
|
||||
- per-node and per-route capacity pressure;
|
||||
- p50/p95/p99 latency where measured;
|
||||
- backpressure, rejection, and queue-depth counters;
|
||||
- route switch and failover events;
|
||||
- target pool failover events;
|
||||
- QUIC connection and logical channel counts;
|
||||
- final pass/fail verdict against SLO thresholds.
|
||||
|
||||
The first executable harness is `agents/rap-node-agent/cmd/fabric-loadtest`.
|
||||
It supports in-process multi-node QUIC targets, short logical channel churn,
|
||||
pool failover, target failure injection, and JSON reports.
|
||||
|
||||
Example local pool-failover run:
|
||||
|
||||
```powershell
|
||||
go run ./cmd/fabric-loadtest -mode all -nodes 4 -streams 400 -concurrency 64 -bytes-per-stream 262144 -payload-size 16384 -fail-target 0 -fail-after 1ms -timeout 60s
|
||||
```
|
||||
|
||||
The local harness is not a replacement for distributed host testing. It is the
|
||||
first acceptance gate for protocol limits, channel lifecycle churn, pool
|
||||
failover semantics, and reporting shape before running the same workload across
|
||||
the shared test Docker host.
|
||||
|
||||
Distributed shared-host smoke:
|
||||
|
||||
```powershell
|
||||
powershell -ExecutionPolicy Bypass -File scripts/fabric/fabric-loadtest-docker-smoke.ps1 -DockerContext test-docker -TuneUdpBuffers -Nodes 4 -Streams 400 -Concurrency 64 -BytesPerStream 262144 -PayloadSize 16384 -FailTarget 0 -FailAfter 100ms
|
||||
```
|
||||
|
||||
The distributed smoke builds/runs separate server and client containers on the
|
||||
shared Docker host, sends real QUIC fabric frames across the Docker network,
|
||||
kills one target node during load, and expects all channels assigned to that
|
||||
target to fail over to the remaining pool.
|
||||
|
||||
The smoke summary includes the strict loadtest verdict plus `route_pressure`
|
||||
and `transport_snapshot`; the script fails when the client verdict is not
|
||||
`pass` and carries `verdict_reasons` into the thrown error.
|
||||
|
||||
`-TuneUdpBuffers` applies runtime host sysctls through a privileged one-shot
|
||||
container before the run and records the observed values in the summary:
|
||||
`net.core.rmem_max`, `net.core.wmem_max`, `net.core.rmem_default`, and
|
||||
`net.core.wmem_default`.
|
||||
|
||||
Degraded-target and latency-aware admission run:
|
||||
|
||||
```powershell
|
||||
powershell -ExecutionPolicy Bypass -File scripts/fabric/fabric-loadtest-docker-smoke.ps1 -DockerContext test-docker -TuneUdpBuffers -Nodes 4 -Streams 300 -Concurrency 64 -BytesPerStream 131072 -PayloadSize 16384 -FailTarget -1 -ImpairTarget 0 -ImpairDelay 100ms -ImpairLoss 0.5% -ProbeTargets -MaxTargetRttMs 80
|
||||
```
|
||||
|
||||
This applies `tc netem` to one target, probes every target before mass channel
|
||||
placement, excludes targets above the RTT threshold, and reports per-target
|
||||
setup/duration percentiles. This is the first executable gate for
|
||||
latency-aware placement; live channel migration after mid-stream degradation is
|
||||
the next routing-layer gate.
|
||||
|
||||
Mid-stream migration gate:
|
||||
|
||||
```powershell
|
||||
powershell -ExecutionPolicy Bypass -File scripts/fabric/fabric-loadtest-docker-smoke.ps1 -DockerContext test-docker -TuneUdpBuffers -Nodes 4 -Streams 80 -Concurrency 16 -BytesPerStream 8388608 -PayloadSize 65536 -FailTarget -1 -ImpairTarget 0 -ImpairDelay 200ms -ImpairLoss 0.5% -ImpairAfter 50ms -MigrateSlowStreams -MaxAckMs 30
|
||||
```
|
||||
|
||||
This starts channels normally, applies `tc netem` after traffic is already in
|
||||
flight, and expects slow logical streams to continue their remaining bytes on a
|
||||
different target. The report exposes `migration_events`, `max_ack_ms`,
|
||||
`ack_p95_ms`, `ack_p99_ms`, `route_attempts_total`, `reroute_causes`, and
|
||||
per-target stats.
|
||||
|
||||
Production fabric-core migration boundary:
|
||||
|
||||
- `FabricChannelRouter` opens channels on the best route from a `FabricRouteSet`.
|
||||
- Live `FabricChannelObservation` values update counters and trigger reroute on
|
||||
route failure, ACK latency threshold, or capacity pressure.
|
||||
- Reroutes switch route binding and pool target where applicable, increment
|
||||
`RerouteCount`, and emit `FabricChannelRouteEvent`.
|
||||
- `MinRerouteInterval` provides hysteresis so a noisy path does not cause route
|
||||
flapping.
|
||||
- `FabricChannelRuntime` binds the router to live QUIC fabric sessions for
|
||||
reliable byte payloads: it opens the logical stream, sends frames, measures
|
||||
ACK latency, reports observations to the router, and continues remaining
|
||||
payloads on a rerouted QUIC route after connect failure or slow ACKs.
|
||||
- QUIC logical session close cancels the stream read side before closing the
|
||||
write side, so high-churn short sessions release reader goroutines promptly
|
||||
instead of waiting for stream read deadlines.
|
||||
- Server-side QUIC stream handlers close their write side when the handler
|
||||
exits. This returns QUIC stream credit promptly during high-churn short
|
||||
sessions and prevents the last worker window from stalling on stream open.
|
||||
- Production request/response forwarding now builds a `FabricRouteSet` from all
|
||||
QUIC endpoint candidates for the next hop, sends the envelope over the chosen
|
||||
QUIC route, and reroutes to warm standby/fallback QUIC candidates on connect
|
||||
failure or response timeout.
|
||||
- The legacy HTTP production forward carrier has been removed from the mesh
|
||||
runtime API. Production forwarding now exposes a single QUIC transport
|
||||
implementation; HTTP handlers remain only as node-local API surfaces and test
|
||||
harness entry points.
|
||||
- Production route choice includes live per-route active-channel pressure, so
|
||||
concurrent forwarding requests can spread across equivalent QUIC candidates
|
||||
instead of concentrating on the first/shortest route until it is saturated.
|
||||
- Production forwarding also keeps per-route health quarantine. A QUIC route
|
||||
that fails connect or response is marked unhealthy for a bounded retry window,
|
||||
skipped by subsequent channel scheduling, exposed in route-health snapshots,
|
||||
and restored automatically after the retry window or a successful send.
|
||||
- `FabricRoutePressureTracker` provides shared active-channel accounting for
|
||||
both production request/response forwarding and bulk `FabricChannelRuntime`
|
||||
traffic, so different traffic surfaces can make route decisions against the
|
||||
same live load signal.
|
||||
- Route pressure is observable through `FabricRoutePressureSnapshot`, including
|
||||
current active channels, max active channels, total acquire/release counts,
|
||||
and last acquired/released route IDs. Bulk runtime results and production
|
||||
QUIC forwarding snapshots expose this data for stress reports.
|
||||
- `fabric-loadtest` reports route IDs per stream attempt, global
|
||||
`route_pressure`, and per-target `max_active_channels`, so stress runs can
|
||||
verify channel distribution and release accounting after churn.
|
||||
- `FabricRouteSetForPeerEndpointCandidates` converts QUIC endpoint candidates
|
||||
into production route sets for direct, LAN, ICE/STUN-derived, reverse
|
||||
outbound, and relay fallback modes. Non-QUIC candidates are rejected instead
|
||||
of becoming alternate transports.
|
||||
- Node-agent discovery now advertises multiple QUIC candidates in one heartbeat
|
||||
instead of collapsing to one address: operator/public QUIC, listener QUIC,
|
||||
LAN/interface QUIC, STUN reflexive `ice_quic`, reverse/outbound-only, and
|
||||
`relay_quic` fallback. Candidate metadata carries `local_segment_id`,
|
||||
`nat_group_id`, `stun_server`, `ice_foundation`, `relay_node_id`, and
|
||||
`relay_endpoint` when configured.
|
||||
- Endpoint candidate scoring is QUIC-mode only. It ranks `direct_quic`,
|
||||
`lan_quic`, `ice_quic`, `reverse_quic`, and `relay_quic` using freshness,
|
||||
health observations, latency, reliability, region, policy tags, and live
|
||||
capacity pressure; HTTP/WebSocket labels are treated as rejected legacy
|
||||
candidates rather than alternate transports.
|
||||
- `FabricTransportForTarget` no longer accepts a WebSocket carrier. Transport
|
||||
selection can return only `QUICFabricTransport`; unsupported labels fail with
|
||||
a QUIC-required error.
|
||||
- Explicit transport labels are authoritative. A legacy label such as `relay`
|
||||
or `outbound_reverse` is rejected even when the endpoint string uses a
|
||||
`quic://` scheme; configs must use `relay_quic` and `reverse_quic`.
|
||||
- Node-agent config loading rejects legacy advertised transport labels and
|
||||
HTTP/WebSocket advertised endpoint schemes for mesh, STUN-reflexive, and relay
|
||||
fabric endpoints. Bad endpoint posture fails before heartbeat publication.
|
||||
- Host-agent install/runtime validation rejects legacy mesh advertise transport
|
||||
labels and HTTP/WebSocket advertise endpoints before they can be passed into a
|
||||
node-agent Docker runtime.
|
||||
- JSON-advertised endpoint candidates and scoped synthetic config route
|
||||
recovery surfaces are hard-fail QUIC-only: endpoint candidates, recovery
|
||||
seeds, and rendezvous leases reject legacy transport labels and
|
||||
HTTP/WebSocket endpoint schemes instead of silently downgrading or dropping
|
||||
entries.
|
||||
- Rendezvous relay leases and peer-connection intents now use `relay_quic` as
|
||||
the transport label. `relay_control` remains only a telemetry/control-state
|
||||
name for rendezvous admission counters, not a data-plane transport option.
|
||||
- Peer connection health probing is aligned with the QUIC fabric: QUIC endpoint
|
||||
candidates are probed with QUIC session setup, pinned certificate metadata is
|
||||
honored, and HTTP/WebSocket endpoint schemes are rejected instead of being
|
||||
used as peer health transport.
|
||||
- Node-agent synthetic runtime no longer installs an HTTP peer transport as an
|
||||
inter-node carrier, and the shared mesh runtime package no longer exports an
|
||||
HTTP peer transport implementation. Any HTTP synthetic motion is confined to
|
||||
explicit legacy smoke harness code while fabric acceptance uses QUIC loadtest
|
||||
gates.
|
||||
- Control-plane and debug JSON mesh config loading is validated after
|
||||
conversion into runtime structures. Peer endpoint candidates, recovery seeds,
|
||||
rendezvous leases, and selected relay endpoints in route decisions must use
|
||||
QUIC labels/endpoints before they can update node runtime state.
|
||||
- Scoped synthetic mesh configs also reject legacy `peer_endpoints` directly,
|
||||
in addition to QUIC-only checks for endpoint candidates, recovery seeds, and
|
||||
rendezvous leases.
|
||||
- The old fabric-session WebSocket endpoint is no longer exposed by
|
||||
`FabricSessionEnabled` alone. It requires an explicit legacy test harness flag
|
||||
and is not part of the node-agent fabric transport surface.
|
||||
- Same local segment or same NAT group is treated as a LAN route by the planner,
|
||||
so a whole cluster piece behind one NAT can prefer private addresses between
|
||||
its own nodes while still maintaining outbound/relay visibility to the rest
|
||||
of the fabric.
|
||||
- Heartbeat telemetry includes `fabric_runtime_report` with QUIC-only status,
|
||||
route-set counts, QUIC candidate totals, rejected legacy/non-QUIC candidate
|
||||
totals by transport label, route pressure, QUIC listener state, goroutines,
|
||||
heap usage, and the next recommended soak gate.
|
||||
- `FabricOverlayTransport` is the generic service-neutral send facade over
|
||||
route sets, `FabricChannelRuntime`, shared route pressure, and QUIC sessions.
|
||||
New traffic classes should enter the fabric through this layer or an
|
||||
equivalent runtime integration, not through HTTP/WebSocket fallbacks.
|
||||
- `FabricChannelRuntime` uses the same route health quarantine as production
|
||||
forwarding. Connect failures, stream send failures, and missing ACKs mark a
|
||||
route unhealthy for a bounded retry window, so later channels for any traffic
|
||||
class avoid that route until it recovers.
|
||||
- `FabricOverlayTransport` exposes route pressure and route health snapshots,
|
||||
and node heartbeat runtime metadata reports production route health plus the
|
||||
current quarantined route count.
|
||||
- Scheduler resource guardrails include `HardMaxRoutePressure`: when enabled,
|
||||
a route whose projected active-channel pressure exceeds the threshold is not
|
||||
admitted. This makes overload prevention enforceable in route choice rather
|
||||
than only observable after the fact.
|
||||
- The loadtest verdict fails on route-pressure leaks, acquire/release mismatch,
|
||||
missing acquire accounting, active channels above configured concurrency, or
|
||||
target distribution collapse/skew when multiple targets are healthy.
|
||||
- Continuous soak aggregation is bounded: `fabric-loadtest` keeps exact
|
||||
counters, per-target totals, route-mode counts, error/reroute totals, and
|
||||
bounded latency samples, while `stream_samples` is capped to diagnostic
|
||||
examples. Long 30-120 minute runs should not retain one result object per
|
||||
logical channel.
|
||||
- `fabric-loadtest` also keeps bounded `error_samples`, so high-volume churn
|
||||
reports preserve representative failed logical channels even when the first
|
||||
retained `stream_samples` are all successful.
|
||||
- Mixed topology verdicts require route-mode coverage when at least four
|
||||
healthy targets are present. A `mixed-public-nat-lan-relay` or
|
||||
`nat-lan-relay` run fails if it does not exercise `lan_quic`, `ice_quic`,
|
||||
`reverse_quic`, and `relay_quic`.
|
||||
- Loadtest verdicts also fail on legacy route-mode labels. Seeing `relay`,
|
||||
`outbound_reverse`, `direct_http`, `direct_https`, `direct_tcp_tls`, `ws`,
|
||||
`wss`, or `websocket` in route-mode telemetry is treated as a transport-layer
|
||||
violation even if payload delivery succeeds.
|
||||
- Healthy multi-target verdicts check both stream distribution and byte
|
||||
distribution. This prevents a run from passing with equal channel counts but
|
||||
most bulk bytes concentrated on one target or route.
|
||||
- Healthy multi-target verdicts also check route-pressure distribution through
|
||||
per-route `max_active` values. A run fails if live concurrent channel load
|
||||
collapses onto one target/route while alternatives are healthy.
|
||||
- Successful logical channels must receive one ACK per transmitted data frame.
|
||||
`fabric-loadtest` reports `ack_mismatched_streams`, per-target
|
||||
`acks_received`, and fails verdict when any stream is marked successful with
|
||||
fewer ACKs than sent frames.
|
||||
- ACK payloads carry the SHA-256 checksum of the received data-frame payload.
|
||||
`fabric-loadtest` validates the checksum for every ACK and fails verdict with
|
||||
`ack_integrity_errors` when the acknowledged bytes do not match the sent
|
||||
payload.
|
||||
- Failover accounting separates `abandoned_frames` from true ACK mismatch. A
|
||||
frame sent on a route that dies before ACK is counted as abandoned and the
|
||||
unacknowledged byte range is retransmitted on the next pool member; verdict
|
||||
still fails when non-abandoned frames are missing ACKs.
|
||||
- Loadtest data frames use deterministic per-frame payloads derived from stream
|
||||
index, logical stream ID, sequence, and byte offset. This makes checksum ACKs
|
||||
validate each frame identity instead of repeatedly validating one shared
|
||||
buffer pattern.
|
||||
- Mixed bulk/control stress is supported with `-control-every`,
|
||||
`-control-bytes-per-stream`, and `-max-control-ack-p95-ms`. Reports include
|
||||
`control_streams`, `bulk_streams`, `control_ack_p95_ms`, and
|
||||
`bulk_ack_p95_ms`; verdict fails when control ACK p95 exceeds the configured
|
||||
SLO.
|
||||
- Verified shared-host mixed smoke:
|
||||
`powershell -ExecutionPolicy Bypass -File scripts/fabric/fabric-loadtest-docker-smoke.ps1 -DockerContext test-docker -Nodes 2 -Streams 40 -Concurrency 8 -BytesPerStream 65536 -PayloadSize 8192 -FailTarget -1 -ControlEvery 5 -ControlBytesPerStream 4096 -MaxControlAckP95Ms 100`.
|
||||
The run produced 40/40 successful streams, 8 control streams,
|
||||
`control_ack_p95_ms=1`, `bulk_ack_p95_ms=2`,
|
||||
`route_pressure.active_total=0`, and matching acquire/release counts.
|
||||
- Verified shared-host mixed failover stress:
|
||||
`powershell -ExecutionPolicy Bypass -File scripts/fabric/fabric-loadtest-docker-smoke.ps1 -DockerContext test-docker -SkipBuild -TuneUdpBuffers -Nodes 4 -Streams 1000 -Concurrency 128 -BytesPerStream 1048576 -PayloadSize 65536 -FailTarget 0 -FailAfter 100ms -ControlEvery 20 -ControlBytesPerStream 4096 -MaxControlAckP95Ms 100`.
|
||||
Latest run `fabric-loadtest-20260516-160751` produced 1000/1000 successful
|
||||
streams, 250 failover events after the planned target kill, 50 control
|
||||
streams, `control_ack_p95_ms=3`, `bulk_ack_p95_ms=6`, `ack_p95_ms=6`,
|
||||
`ack_p99_ms=8`, `route_attempts_total=1250`,
|
||||
`route_pressure.active_total=0`, `max_active_total=128`, and matching
|
||||
acquire/release counts. Full JSON artifacts are written under
|
||||
`artifacts/fabric-loadtest`.
|
||||
- Verified shared-host mixed degradation/migration stress:
|
||||
`powershell -ExecutionPolicy Bypass -File scripts/fabric/fabric-loadtest-docker-smoke.ps1 -DockerContext test-docker -TuneUdpBuffers -Nodes 4 -Streams 200 -Concurrency 32 -BytesPerStream 8388608 -PayloadSize 65536 -FailTarget -1 -ImpairTarget 0 -ImpairDelay 200ms -ImpairLoss 0.5% -ImpairAfter 50ms -MigrateSlowStreams -MaxAckMs 30 -ControlEvery 10 -ControlBytesPerStream 4096 -MaxControlAckP95Ms 100`.
|
||||
The run produced 200/200 successful streams, 9 migration events,
|
||||
20 control streams, `control_ack_p95_ms=2`, `bulk_ack_p95_ms=7`,
|
||||
`route_pressure.active_total=0`, `max_active_total=32`, and matching
|
||||
acquire/release counts.
|
||||
- Latest shared-host degradation/migration gate:
|
||||
`fabric-loadtest-20260516-160710` with 160 streams, 32 concurrency, 4 MiB
|
||||
bulk streams, 180 ms + 0.5% induced impairment after 50 ms produced 160/160
|
||||
successful streams, 12 slow-ACK migrations, degraded-target quarantine,
|
||||
`control_ack_p95_ms=3`, `bulk_ack_p95_ms=180`,
|
||||
`route_pressure.active_total=0`, `max_active_total=32`, and matching
|
||||
acquire/release counts.
|
||||
- Short shared-host soak gate:
|
||||
`fabric-loadtest-20260516-160943` with `-Duration 45s`, 1200 streams,
|
||||
96 concurrency, four healthy targets, and mixed control/bulk traffic produced
|
||||
1200/1200 successful streams, even 300/300/300/300 target distribution,
|
||||
`channel_opens=1200`, `channel_closes=1200`, `channel_leaks=0`,
|
||||
`control_ack_p95_ms=4`, `ack_p95_ms=5`, `ack_p99_ms=8`,
|
||||
`route_pressure.active_total=0`, `max_active_total=96`, and matching
|
||||
acquire/release counts.
|
||||
- Continuous soak mode is now explicit: add `-Soak -Duration 30m` or
|
||||
`-Soak -Duration 120m` to the Docker runner. In soak mode workers keep
|
||||
creating and closing logical channels until the duration expires, instead of
|
||||
stopping after a fixed stream list. This is the required gate for memory,
|
||||
goroutine, file descriptor, QUIC stream, and route-pressure stability.
|
||||
- Soak duration stops new logical channel creation but does not cancel channels
|
||||
already in flight. In-flight channels drain under their per-channel
|
||||
`-StreamTimeout`; the outer `-ClientTimeout` remains the hard scenario
|
||||
guardrail. This prevents the final active window from being counted as
|
||||
failed streams just because the soak timer expired.
|
||||
- Recommended real-topology soak command:
|
||||
`powershell -ExecutionPolicy Bypass -File scripts/fabric/fabric-loadtest-docker-smoke.ps1 -DockerContext test-docker -SkipBuild -TuneUdpBuffers -Nodes 4 -Streams 1000 -Concurrency 128 -BytesPerStream 1048576 -PayloadSize 65536 -TopologyProfile mixed-public-nat-lan-relay -Soak -Duration 30m -ResourceSampleInterval 10s -MaxGoroutineDelta 64 -MaxHeapDeltaMB 512 -FailTarget -1 -ControlEvery 20 -ControlBytesPerStream 4096 -MaxControlAckP95Ms 100`.
|
||||
- Soak reports include `resource_samples` and `resource_summary` with
|
||||
goroutine start/end/max/delta, heap allocation start/end/max/delta, heap
|
||||
objects, open file descriptor start/end/max/delta, GC delta, max active QUIC
|
||||
streams, and max active route load.
|
||||
Optional verdict gates `-MaxGoroutineDelta` and `-MaxHeapDeltaMB` fail the
|
||||
run if resource drift exceeds the configured budget.
|
||||
- Optional file descriptor verdict gates `-MaxOpenFDDelta` and `-MaxOpenFDs`
|
||||
are passed through the Docker runner to `fabric-loadtest` as
|
||||
`-max-open-fd-delta` and `-max-open-fds`. On Linux containers these read
|
||||
`/proc/self/fd` and fail the run if descriptor count drifts or peaks beyond
|
||||
the configured budget.
|
||||
- Optional throughput SLO gate `-MinThroughputMbps` is passed through the Docker
|
||||
runner to `fabric-loadtest` as `-min-throughput-mbps`. It fails verdict when
|
||||
useful data-plane throughput falls below the configured Mbps floor.
|
||||
- Optional short-session churn SLO gate `-MinChannelChurnPerSec` is passed
|
||||
through the Docker runner to `fabric-loadtest` as
|
||||
`-min-channel-churn-per-sec`. It fails verdict when logical channel
|
||||
open/close throughput falls below the configured channel-per-second floor.
|
||||
- Each logical channel has a per-channel timeout through `-StreamTimeout`
|
||||
in the Docker runner and `-stream-timeout` in `fabric-loadtest`. This keeps a
|
||||
wedged channel from holding a worker slot until the whole client run times
|
||||
out, preserving channel isolation under churn.
|
||||
- Each data frame has an ACK timeout through `-AckTimeout` in the Docker runner
|
||||
and `-ack-timeout` in `fabric-loadtest`. A missing ACK triggers reroute/pool
|
||||
retry without waiting for the full channel timeout.
|
||||
- Optional overall ACK latency gates `-MaxAckP95Ms` and `-MaxAckP99Ms` are
|
||||
passed through the Docker runner to `fabric-loadtest` as
|
||||
`-max-ack-p95-ms` and `-max-ack-p99-ms`. They fail healthy runs when
|
||||
aggregate data-plane ACK latency exceeds the configured SLO, independently
|
||||
of slow-route migration thresholds.
|
||||
- Optional per-target ACK latency gate `-MaxTargetAckMs` is passed through the
|
||||
Docker runner to `fabric-loadtest` as `-max-target-ack-ms`. It fails healthy
|
||||
runs when any target route reports a `target_stats[*].max_ack_ms` above the
|
||||
configured SLO.
|
||||
- Optional channel setup latency gates `-MaxSetupP95Ms` and `-MaxSetupP99Ms`
|
||||
are passed through the Docker runner to `fabric-loadtest` as
|
||||
`-max-setup-p95-ms` and `-max-setup-p99-ms`. They fail healthy runs when
|
||||
logical channel open/setup latency exceeds the configured SLO before payload
|
||||
transfer starts.
|
||||
- Optional reroute latency gates `-MaxRerouteP95Ms` and `-MaxRerouteP99Ms`
|
||||
are passed through the Docker runner to `fabric-loadtest` as
|
||||
`-max-reroute-p95-ms` and `-max-reroute-p99-ms`. They measure repeat channel
|
||||
setup latency after pool failover or slow-route migration and fail the run
|
||||
when route rebuild exceeds the configured SLO.
|
||||
- Docker shared-host summaries also include `container_stats` from
|
||||
`docker stats --no-stream` for each fabric server/client container that is
|
||||
still running at the end of the scenario. This records CPU percent, memory
|
||||
usage, memory percent, network IO, block IO, and PID count per node before
|
||||
cleanup.
|
||||
- Long soak runs can add `-ContainerStatsSampleInterval 10s` to collect
|
||||
periodic Docker container stats while traffic is in flight. The runner writes
|
||||
samples to `container_stats_samples_path`, includes
|
||||
`container_stats_samples_count` and `container_stats_sample_summary`, and
|
||||
records per-container memory/PID start, end, max, and delta values.
|
||||
- Optional container resource verdict gates `-MaxContainerMemoryMiB` and
|
||||
`-MaxContainerPids` fail the Docker scenario when any running fabric
|
||||
container exceeds the configured memory or PID budget at the final snapshot
|
||||
or at any periodic sample peak.
|
||||
- Verified short continuous soak:
|
||||
`fabric-loadtest-20260516-163206` used `-Soak -Duration 20s`,
|
||||
mixed public/NAT/LAN/relay profile, 32 concurrency, and mixed control/bulk
|
||||
traffic. It produced 4000/4000 successful logical channels,
|
||||
`channel_opens=4035`, `channel_closes=4035`, `channel_leaks=0`,
|
||||
`route_pressure.active_total=0`, `max_active_total=32`,
|
||||
`control_ack_p95_ms=2`, `ack_p95_ms=4`, resource sample count 12,
|
||||
goroutine delta -18, max active streams 32, max active route load 32, and
|
||||
matching acquire/release counts.
|
||||
- Verified 60-second high-churn continuous soak with graceful drain:
|
||||
`fabric-loadtest-20260516-174505` rebuilt the Docker image after changing
|
||||
soak duration to stop generation and let in-flight channels drain. The
|
||||
4-node mixed-topology run used 128 concurrency, `-Duration 60s`,
|
||||
`-StreamTimeout 15s`, periodic resource/container sampling, mixed
|
||||
control/bulk traffic, throughput and churn SLOs. It produced 438740/438740
|
||||
successful logical channels, `channel_churn_per_sec=7310`,
|
||||
`throughput_bps=3473632858`, `ack_p95_ms=5`, `ack_p99_ms=6`,
|
||||
`control_ack_p95_ms=3`, `channel_opens=438740`,
|
||||
`channel_closes=438740`, `channel_leaks=0`, `open_failures=0`,
|
||||
`goroutines_delta=-1`, `open_fds_delta=4`, all four route modes, clean
|
||||
route-pressure accounting, and verdict `pass`.
|
||||
- Verified pool failover soak with ACK timeout and abandoned-frame accounting:
|
||||
`fabric-loadtest-20260516-175622` rebuilt the Docker image with ACK timeout,
|
||||
target quarantine, and abandoned-frame accounting, then killed target 0 after
|
||||
3 seconds during a 30-second mixed-topology soak. It produced 136194/136194
|
||||
successful logical channels, `failed_streams=0`, `failover_events=82`,
|
||||
`abandoned_frames=75`, `ack_mismatched_streams=0`,
|
||||
`ack_integrity_errors=0`, `channel_churn_per_sec=4543`,
|
||||
`throughput_bps=2156155314`, `reroute_latency_p99_ms=9`,
|
||||
`channel_leaks=0`, clean route-pressure accounting, and verdict `pass`.
|
||||
- Verified container stats gate:
|
||||
`fabric-loadtest-20260516-163854` produced a passing 2-node mixed-topology
|
||||
smoke with `-MaxContainerMemoryMiB 128 -MaxContainerPids 64` and included
|
||||
`container_stats` for both fabric server containers, with memory usage around
|
||||
4-6 MiB per server and server PID counts 7-9. A negative control run with
|
||||
`-MaxContainerMemoryMiB 1` failed as expected with
|
||||
`container_memory_mib=...>1` verdict reasons.
|
||||
- Verified periodic container stats sampling:
|
||||
`fabric-loadtest-20260516-164259` used `-Soak -Duration 8s`,
|
||||
`-ContainerStatsSampleInterval 2s`, mixed public/NAT/LAN/relay profile, and
|
||||
`-MaxContainerMemoryMiB 128 -MaxContainerPids 64`. It produced 2000/2000
|
||||
successful logical channels, `channel_opens=2009`, `channel_closes=2009`,
|
||||
`channel_leaks=0`, even 1000/1000 target distribution, 400 control streams,
|
||||
`ack_p95_ms=1`, `route_pressure.active_total=0`, matching acquire/release
|
||||
counts, final server memory around 12-13 MiB, and periodic sample peaks for
|
||||
the client and both servers in
|
||||
`fabric-loadtest-20260516-164259-container-stats-samples.json`.
|
||||
- Verified high-churn goroutine drain after QUIC close cancellation:
|
||||
`fabric-loadtest-20260516-164502` rebuilt the Docker image and repeated the
|
||||
2-node mixed-topology continuous soak with `-MaxGoroutineDelta 64`,
|
||||
`-MaxHeapDeltaMB 128`, `-ContainerStatsSampleInterval 2s`,
|
||||
`-MaxContainerMemoryMiB 128`, and `-MaxContainerPids 64`. It produced
|
||||
2000/2000 successful logical channels, `channel_opens=2009`,
|
||||
`channel_closes=2009`, `channel_leaks=0`, even 1000/1000 target
|
||||
distribution, `control_ack_p95_ms=1`, `ack_p95_ms=1`,
|
||||
`route_pressure.active_total=0`, matching acquire/release counts, and
|
||||
`goroutines_delta=-2`.
|
||||
- Verified file descriptor gate:
|
||||
`fabric-loadtest-20260516-164725` rebuilt the Docker image and repeated the
|
||||
2-node mixed-topology continuous soak with `-MaxOpenFDDelta 8` and
|
||||
`-MaxOpenFDs 128` in addition to goroutine, heap, container memory, and PID
|
||||
gates. It produced 2000/2000 successful logical channels,
|
||||
`channel_leaks=0`, `route_pressure.active_total=0`, matching
|
||||
acquire/release counts, `open_fds_start=15`, `open_fds_end=9`,
|
||||
`open_fds_max=19`, and `open_fds_delta=-6`.
|
||||
- Verified bounded soak aggregation:
|
||||
`fabric-loadtest-20260516-165051` rebuilt the Docker image after changing
|
||||
soak result storage to an aggregate collector. The 2-node mixed-topology soak
|
||||
produced 2000/2000 successful logical channels, even 1000/1000 target
|
||||
distribution, `channel_leaks=0`, `route_pressure.active_total=0`, matching
|
||||
acquire/release counts, `goroutines_delta=0`, `open_fds_delta=1`, verdict
|
||||
`pass`, and only 25 retained `stream_samples` in the full report.
|
||||
- Verified mixed route-mode coverage gate:
|
||||
`fabric-loadtest-20260516-165308` rebuilt the Docker image with the route
|
||||
coverage verdict and ran a 4-node mixed-topology soak. It produced 4000/4000
|
||||
successful logical channels, even 1000/1000/1000/1000 target distribution,
|
||||
`channel_leaks=0`, `route_pressure.active_total=0`, matching
|
||||
acquire/release counts, and observed all required route modes:
|
||||
`lan_quic`, `ice_quic`, `reverse_quic`, and `relay_quic`.
|
||||
- Verified ACK integrity gate:
|
||||
`fabric-loadtest-20260516-165544` rebuilt the Docker image with the ACK
|
||||
mismatch verdict and repeated the 4-node mixed-topology soak. It produced
|
||||
4000/4000 successful logical channels, `ack_mismatched_streams=0`, per-target
|
||||
`frames_sent=6600` and `acks_received=6600`, all four route modes, clean
|
||||
channel/route pressure accounting, and verdict `pass`.
|
||||
- Verified ACK checksum integrity gate:
|
||||
`fabric-loadtest-20260516-165926` rebuilt the Docker image with ACK payload
|
||||
checksums and repeated the 4-node mixed-topology soak. It produced 4000/4000
|
||||
successful logical channels, `ack_mismatched_streams=0`,
|
||||
`ack_integrity_errors=0`, 26400 total data frames, 26400 ACKs, all four route
|
||||
modes, clean channel/route pressure accounting, and verdict `pass`.
|
||||
- Verified unique per-frame payload integrity:
|
||||
`fabric-loadtest-20260516-170150` rebuilt the Docker image after switching
|
||||
loadtest traffic from a shared payload buffer to deterministic per-frame
|
||||
payloads. The 4-node mixed-topology soak produced 4000/4000 successful
|
||||
logical channels, `ack_mismatched_streams=0`, `ack_integrity_errors=0`, 26400
|
||||
data frames, 26400 ACKs, all four route modes, clean channel/route pressure
|
||||
accounting, and verdict `pass`.
|
||||
- Verified throughput SLO gate:
|
||||
`fabric-loadtest-20260516-170512` rebuilt the Docker image with
|
||||
`-MinThroughputMbps 100` and repeated the 4-node mixed-topology soak. It
|
||||
produced 4000/4000 successful logical channels, `throughput_bps=212479668`,
|
||||
`ack_mismatched_streams=0`, `ack_integrity_errors=0`, all four route modes,
|
||||
clean channel/route pressure accounting, and verdict `pass`.
|
||||
- Verified short-session churn SLO gate:
|
||||
`fabric-loadtest-20260516-173320` rebuilt the Docker image with
|
||||
`-MinChannelChurnPerSec 200`, then ran a 4-node mixed-topology high-churn
|
||||
short-session smoke with 1000 one-frame logical channels. It produced
|
||||
1000/1000 successful logical channels, `channel_churn_per_sec=9478`,
|
||||
`channel_opens=1000`, `channel_closes=1000`, `channel_leaks=0`, even target
|
||||
stream distribution, all four route modes, clean route-pressure accounting,
|
||||
and verdict `pass`.
|
||||
- Verified high-churn QUIC stream-credit regression gate:
|
||||
`fabric-loadtest-20260516-174046` rebuilt the Docker image after closing the
|
||||
server-side QUIC stream on handler exit and ran a 4-node mixed-topology burst
|
||||
of 5000 one-frame short logical channels at 128 concurrency with
|
||||
`-MinChannelChurnPerSec 300` and `-StreamTimeout 15s`. It produced 5000/5000
|
||||
successful logical channels, `channel_churn_per_sec=21124`,
|
||||
`channel_opens=5000`, `channel_closes=5000`, `channel_leaks=0`,
|
||||
`open_failures=0`, `ack_mismatched_streams=0`, `ack_integrity_errors=0`,
|
||||
even 1250/1250/1250/1250 target distribution, all four route modes, clean
|
||||
route-pressure accounting, and verdict `pass`.
|
||||
- Verified target byte distribution gate:
|
||||
`fabric-loadtest-20260516-170731` rebuilt the Docker image with byte
|
||||
distribution verdicts and repeated the 4-node mixed-topology soak. It
|
||||
produced 4000/4000 successful logical channels, even 1000/1000/1000/1000
|
||||
stream distribution, exactly 53,248,000 bytes per target,
|
||||
`throughput_bps=212488911`, all four route modes, clean channel/route
|
||||
pressure accounting, and verdict `pass`.
|
||||
- Verified overall ACK latency SLO gate:
|
||||
`fabric-loadtest-20260516-171001` rebuilt the Docker image with
|
||||
`-MaxAckP95Ms 20` and `-MaxAckP99Ms 50` and repeated the 4-node
|
||||
mixed-topology soak. It produced 4000/4000 successful logical channels,
|
||||
`ack_p95_ms=2`, `ack_p99_ms=3`, `ack_mismatched_streams=0`,
|
||||
`ack_integrity_errors=0`, all four route modes, clean channel/route pressure
|
||||
accounting, and verdict `pass`.
|
||||
- Verified route-pressure distribution gate:
|
||||
`fabric-loadtest-20260516-171216` rebuilt the Docker image with
|
||||
route-pressure distribution verdicts and repeated the 4-node mixed-topology
|
||||
soak. It produced 4000/4000 successful logical channels, even target stream
|
||||
and byte distribution, per-route `max_active` values of 13/12/13/13,
|
||||
`route_pressure.active_total=0`, matching acquire/release counts, and
|
||||
verdict `pass`.
|
||||
- Verified per-target ACK latency gate:
|
||||
`fabric-loadtest-20260516-171454` rebuilt the Docker image with
|
||||
`-MaxTargetAckMs 20` and repeated the 4-node mixed-topology soak. It produced
|
||||
4000/4000 successful logical channels, per-target `max_ack_ms` values of
|
||||
6/5/7/9, `ack_p95_ms=3`, `ack_p99_ms=5`, all four route modes, clean
|
||||
channel/route pressure accounting, and verdict `pass`.
|
||||
- Verified channel setup latency SLO gate:
|
||||
`fabric-loadtest-20260516-171937` rebuilt the Docker image with
|
||||
`-MaxSetupP95Ms 20` and `-MaxSetupP99Ms 50`, then repeated the 4-node
|
||||
mixed-topology soak with ACK, throughput, FD, goroutine, heap, container
|
||||
memory, and PID gates enabled. It produced 4000/4000 successful logical
|
||||
channels, `setup_latency_p95_ms=0`, `ack_p95_ms=3`, `ack_p99_ms=3`,
|
||||
`throughput_bps=212572631`, even target stream/byte distribution, all four
|
||||
route modes, clean channel/route pressure accounting, and verdict `pass`.
|
||||
- Verified reroute latency SLO gate:
|
||||
`fabric-loadtest-20260516-172652` rebuilt the Docker image with
|
||||
`-MaxRerouteP95Ms 100` and `-MaxRerouteP99Ms 200`, then ran a 4-node
|
||||
mixed-topology pool-failover stress with target 0 killed during load. It
|
||||
produced 400/400 successful logical channels, 100 pool failover events,
|
||||
`reroute_latency_p95_ms=1`, `reroute_latency_p99_ms=2`,
|
||||
`route_attempts_total=500`, `ack_p95_ms=6`, `ack_p99_ms=8`,
|
||||
`throughput_bps=3863633075`, clean channel/route pressure accounting, and
|
||||
verdict `pass`.
|
||||
- Mixed topology profile gate:
|
||||
`fabric-loadtest-20260516-162037` used
|
||||
`-TopologyProfile mixed-public-nat-lan-relay` with 400 streams, 64
|
||||
concurrency, four targets, and mixed control/bulk traffic. It produced
|
||||
400/400 successful streams, 100 streams per target, route-mode reporting for
|
||||
`lan_quic`, `ice_quic`, `reverse_quic`, and `relay_quic`,
|
||||
`control_ack_p95_ms=2`, `ack_p95_ms=7`, `channel_leaks=0`,
|
||||
`route_pressure.active_total=0`, and matching acquire/release counts.
|
||||
- Verified strict QUIC route-mode gate:
|
||||
`fabric-loadtest-20260516-182550` rebuilt the loadtest image with legacy
|
||||
route-mode verdicts and ran the 4-node mixed topology profile. It produced
|
||||
400/400 successful logical channels, observed only `lan_quic`, `ice_quic`,
|
||||
`reverse_quic`, and `relay_quic`, kept `ack_mismatched_streams=0`,
|
||||
`ack_integrity_errors=0`, `channel_leaks=0`, clean route-pressure accounting,
|
||||
and verdict `pass`.
|
||||
- `fabric-loadtest` now also treats the configured target list as part of the
|
||||
acceptance surface: every target must be `quic://...`. Empty targets, bare
|
||||
`host:port`, HTTP(S), and WS/WSS targets produce a failing
|
||||
`non_quic_targets=...` verdict reason. Client mode also rejects those targets
|
||||
before dialing, so a bad stress command cannot accidentally exercise a
|
||||
non-QUIC path and only discover it after the run.
|
||||
- The shared Docker runner `scripts/fabric/fabric-loadtest-docker-smoke.ps1`
|
||||
now has matching guardrails: it refuses local Docker Desktop contexts such as
|
||||
`default`/`desktop-linux` and validates generated targets before launch so the
|
||||
real-load smoke remains tied to the shared test Docker host and QUIC-only
|
||||
endpoints.
|
||||
- Shared Docker validation after those guardrails:
|
||||
`fabric-loadtest-20260516-190049` rebuilt the Docker image on `test-docker`
|
||||
and ran 4 QUIC targets with 120 streams. It produced 120/120 successful
|
||||
logical channels, `ack_p95_ms=3`, `setup_latency_p95_ms=21`, clean
|
||||
open/close and route-pressure accounting, QUIC-only targets, and verdict
|
||||
`pass`.
|
||||
- Shared Docker mixed-topology failover validation:
|
||||
`fabric-loadtest-20260516-190137` reused the image on `test-docker`, killed
|
||||
target 0 after 100ms, and ran 400 streams over the mixed public/NAT/LAN/relay
|
||||
profile. It produced 400/400 successful logical channels, 100 pool failover
|
||||
events, `route_attempts_total=500`, route modes `ice_quic`,
|
||||
`reverse_quic`, and `relay_quic` after the failed target was removed,
|
||||
`ack_p95_ms=8`, `setup_latency_p95_ms=51`, clean channel/route-pressure
|
||||
accounting, and verdict `pass`.
|
||||
- Shared Docker mixed-topology route coverage validation:
|
||||
`fabric-loadtest-20260516-190207` ran the same 4-target mixed profile without
|
||||
target failure. It produced 400/400 successful logical channels, exactly 100
|
||||
streams per target, observed `lan_quic`, `ice_quic`, `reverse_quic`, and
|
||||
`relay_quic`, kept `ack_integrity_errors=0`, `channel_leaks=0`,
|
||||
`route_pressure.active_total=0`, and verdict `pass`.
|
||||
- Load balancing under pool failover is now an acceptance gate. The first
|
||||
stricter shared-host rebuild, `fabric-loadtest-20260516-190704`, intentionally
|
||||
failed because all failed-target retries moved to the nearest live target,
|
||||
producing `target_byte_distribution_skew` and
|
||||
`route_pressure_distribution_skew`. The retry selector was then changed to
|
||||
spread failed-slot retries across the currently usable target set instead of
|
||||
selecting the next target in ring order.
|
||||
- Verified load-aware retry routing after the fix:
|
||||
`fabric-loadtest-20260516-191028` rebuilt on `test-docker`, killed target 0
|
||||
after 100ms, and repeated the 4-target mixed profile. It produced 400/400
|
||||
successful logical channels, 100 pool failover events, surviving-target stream
|
||||
distribution of 134/133/133, surviving route-pressure max-active values of
|
||||
30/25/27, `ack_p95_ms=4`, `reroute_latency_p95_ms=1`, clean acquire/release
|
||||
accounting, and verdict `pass`.
|
||||
- Verified 1000-channel mixed-topology stress:
|
||||
`fabric-loadtest-20260516-193414` ran 1000 logical channels on `test-docker`
|
||||
with 128 concurrency, mixed control/bulk traffic, and the
|
||||
`mixed-public-nat-lan-relay` profile. It produced 1000/1000 successful
|
||||
logical channels, exact 250/250/250/250 target distribution, observed all four
|
||||
QUIC route modes (`lan_quic`, `ice_quic`, `reverse_quic`, `relay_quic`),
|
||||
`throughput_bps=3629522849`, `channel_churn_per_sec=1919`,
|
||||
`ack_p95_ms=6`, clean channel/route-pressure accounting, and verdict `pass`.
|
||||
- Verified 1000-channel pool-failover stress:
|
||||
`fabric-loadtest-20260516-193444` killed target 0 after 100ms and ran 1000
|
||||
logical channels with 128 concurrency. It produced 1000/1000 successful
|
||||
logical channels, 250 pool failover events, surviving-target distribution of
|
||||
334/333/333, `route_attempts_total=1250`, `ack_p95_ms=7`, clean
|
||||
acquire/release accounting, and verdict `pass`.
|
||||
- Verified latency-degradation migration:
|
||||
`fabric-loadtest-20260516-193515` applied `tc netem delay 80ms` to target 1,
|
||||
enabled slow-stream migration with `-MaxAckMs 20`, and ran 400 mixed-profile
|
||||
channels. It observed the impaired target in `degraded_targets`, produced
|
||||
64 slow-ACK migrations, moved completed streams onto healthy targets with
|
||||
distribution 134/133/133, kept `channel_leaks=0`, `ack_integrity_errors=0`,
|
||||
clean route-pressure accounting, and verdict `pass`.
|
||||
- Shared Docker runner resource-sample fallback was verified with
|
||||
`fabric-loadtest-20260516-190325`: short runs now still persist
|
||||
`container_stats_samples_path` and a minimal per-container sample summary
|
||||
from final Docker stats when the background sampler has no time to emit
|
||||
samples.
|
||||
- Added `scripts/fabric/fabric-acceptance-summary.ps1` to aggregate recent
|
||||
`*-summary.json` artifacts into an acceptance report. It captures verdicts,
|
||||
target distribution, route modes, churn, failover/migration counts, latency
|
||||
SLOs, resource evidence, and keeps intentional failed runs visible as
|
||||
regression evidence for gates such as route-pressure skew detection.
|
||||
- The first 30-minute soak attempt (`fabric-loadtest-20260516-193558`) exposed
|
||||
a runner defect instead of a fabric defect: server containers were still
|
||||
started with a fixed `-timeout 10m`, so the three surviving servers exited
|
||||
around minute 10 while the client expected a 30-minute run. The Docker runner
|
||||
now exposes `-ServerTimeout` and defaults it to `-ClientTimeout`, so long soak
|
||||
server lifetimes match the client run.
|
||||
- The next soak attempt (`fabric-loadtest-20260516-194816`) passed the 10-minute
|
||||
server-timeout boundary but exposed another long-run behavior: a healthy
|
||||
surviving target could stay out of placement after a transient degradation
|
||||
mark. `fabric-loadtest` now uses a bounded `target_quarantine_ttl` for
|
||||
placement while still preserving historical `degraded_targets` observations
|
||||
in the report. The Docker runner exposes this as `-TargetQuarantineTTL`.
|
||||
- `fabric-loadtest-20260516-200241` then exposed a soak-loop issue: it reported
|
||||
`pass` with 432869/432869 logical channels and clean accounting, but finished
|
||||
after about 95 seconds despite `config.duration=30m`. The cause was worker
|
||||
shutdown on per-stream `context deadline exceeded`; soak workers now only exit
|
||||
on the parent run context or the configured soak stop time, not on one
|
||||
channel's timeout.
|
||||
- `fabric-loadtest-20260516-200939` and `fabric-loadtest-20260516-201331`
|
||||
confirmed the soak loop fix by running full 3-minute preflights, but they
|
||||
failed the zero-failed-stream gate under target-kill injection. The issue was
|
||||
policy: the known killed target re-entered placement too quickly via the
|
||||
short transient quarantine TTL, causing some channels to spend their stream
|
||||
budget on a hard-dead endpoint. `fabric-loadtest` now separates transient
|
||||
`target_quarantine_ttl` from `failure_quarantine_ttl`, and the Docker runner
|
||||
exposes `-FailureQuarantineTTL`.
|
||||
- Verified 30-minute long-duration soak:
|
||||
`fabric-loadtest-20260516-202532` ran on `test-docker` for 1800.010 seconds
|
||||
with 4 QUIC targets, 128 concurrency, mixed control/bulk traffic, 64 KiB per
|
||||
logical channel, 10-second resource and container samples, and the
|
||||
`mixed-public-nat-lan-relay` profile. It produced 15,074,556/15,074,556
|
||||
successful logical channels, 895,308,005,376 bytes, `throughput_bps=3979124146`,
|
||||
`channel_churn_per_sec=8374`, exact 3,768,639 streams per target, all four
|
||||
QUIC route modes, `ack_p95_ms=5`, `ack_p99_ms=6`, `channel_leaks=0`,
|
||||
matching 15,074,556 channel opens/closes, `route_pressure.active_total=0`,
|
||||
458 container-stat samples, bounded memory/PID use, and verdict `pass`.
|
||||
- Verified real-node host-to-host QUIC smoke:
|
||||
`home-1` ran the standalone `fabric-loadtest` client against a temporary
|
||||
QUIC server on `test-docker` at `quic://docker-test.cin.su:19443`. The run
|
||||
created 1000 short logical channels at 128 concurrency, mixed control and
|
||||
bulk traffic, sent 59,392,000 bytes, received 3700/3700 ACKs, produced
|
||||
`throughput_bps=1177445403`, `channel_churn_per_sec=2478`,
|
||||
`ack_p95_ms=12`, `ack_p99_ms=21`, `setup_latency_p95_ms=118`, zero failed
|
||||
streams, zero channel leaks, and verdict `pass`. The report is saved as
|
||||
`artifacts/fabric-real-nodes/home-1-to-test-docker-20260516-181649.json`.
|
||||
- Published and registered node-agent release `0.2.280-fabricsession` with
|
||||
linux binary/native and Docker image artifacts. The release is intentionally
|
||||
not assigned to live node update policies yet because current live node
|
||||
workload/env posture still advertises legacy `direct_http` and HTTP/HTTPS
|
||||
mesh endpoints. Before rollout, node configs must be migrated to
|
||||
`quic://...` endpoints, QUIC advertise labels, and enabled QUIC listener env
|
||||
such as `RAP_MESH_QUIC_FABRIC_ENABLED=true` plus
|
||||
`RAP_MESH_QUIC_FABRIC_LISTEN_ADDR`.
|
||||
- Loadtest degraded-target quarantine is observable through `degraded_targets`.
|
||||
When `-impair-target` and slow-stream migration are enabled, verdict fails if
|
||||
no degraded target is observed or if degraded targets do not produce migration
|
||||
events. A shared-host validation run with 120 streams reported
|
||||
`degraded_targets = { impaired_target: "slow_ack" }`, 5 migration events,
|
||||
`control_ack_p95_ms=3`, and clean acquire/release accounting.
|
||||
- Channel lifecycle accounting is explicit in `fabric-loadtest` through
|
||||
`channel_opens`, `channel_closes`, and `channel_leaks`. Verdict fails on
|
||||
open/close mismatch, active stream leaks, or mismatch between route-pressure
|
||||
acquire counts and QUIC stream opens.
|
||||
- The next validation step is broader real mixed public/NAT/LAN topology across
|
||||
separate physical or VM hosts. The shared Docker host has verified the route
|
||||
model, stress gates, 30-minute stability, memory, goroutine, file descriptor,
|
||||
container resource, and route-pressure accounting. A true external NAT lab
|
||||
should now validate the same gates with independent NAT devices, public nodes,
|
||||
and local NAT-side cluster segments.
|
||||
|
||||
Initial SLO examples:
|
||||
|
||||
- `channel_setup_p95_ms < 200`
|
||||
- `reroute_p95_ms < 1000`
|
||||
- `control_latency_p99_ms < 100 under bulk load`
|
||||
- `packet_loss_after_recovery < 0.1%`
|
||||
- `no_route_pressure_over_90_percent_when_alternatives_exist`
|
||||
- `no_channel_table_growth_after_churn`
|
||||
Reference in New Issue
Block a user