52 KiB
Fabric-First Transport And Stress Plan
Status: fabric-first implementation baseline is active. QUIC-only transport, route planning, runtime reroute/failover, pressure accounting, shared-host stress gates, 1000-channel load, failure/degradation gates, and a 30-minute real-byte soak are implemented and verified. Remaining work is wider real topology coverage as the cluster grows.
This project is now fabric-first. Work on service payloads, service adapter expansion, and Android VPN transport is paused until the fabric transport layer is complete and proven under real load.
Goal
The fabric is a distributed QUIC overlay over the current IPv4 network. Nodes may have public addresses, sit behind NAT, or represent a whole local segment behind one NAT. The fabric must expose a single logical transport layer where nodes can reach each other directly, through local segment paths, through passive outbound tunnels, or through relay hops without changing the data-plane protocol.
QUIC is the only data-plane protocol. Direct, LAN, relay, reverse, and ICE-selected paths are route modes inside the same QUIC fabric, not alternative transports.
The fabric must not depend on one management service for authority. API, storage, update-cache, route-coordinator, observer, and authority duties are roles inside the mesh. A reachable API endpoint can distribute signed state, but it cannot be the source of truth by itself. Nodes accept control data, configuration, route leases, update plans, and role changes only when the signatures, quorum rules, scopes, epochs, and expiry windows verify locally.
Required Fabric Behavior
- Address channels by
node_id,pool_id, or service target, not by raw IP. - Keep endpoint candidates for public QUIC, LAN QUIC, reverse/passive QUIC, relay QUIC, and future ICE-derived QUIC paths.
- Treat DNS names such as web/admin/API domains as service endpoints only, not node identity or fabric authority.
- Require node-published endpoint candidates to include explicit
host:port, reachability, connectivity mode, NAT/local-segment metadata, source, and freshness. - Prefer local segment paths for nodes that share a NAT/local network.
- Keep outbound passive QUIC control/data adjacencies from NATed nodes to reachable public or relay nodes.
- Build logical channels over shared QUIC adjacencies instead of opening one physical QUIC connection per channel.
- Maintain primary, warm standby, and fallback route sets per channel.
- Rebuild a channel when an intermediate hop fails.
- Switch to another pool member when the target is a pool and the current endpoint fails.
- Reroute slow channels when a faster path exists and the reroute will not harm aggregate fabric throughput.
- Spread channels across available routes so the shortest path is not saturated while other nodes are idle.
- Isolate channels with per-channel flow control, traffic classes, backpressure, quotas, and fairness scheduling.
- Report per-node, per-link, per-route, and per-channel load and failure causes.
Service Channel Boundary
The fabric is the only component that builds and maintains transport channels. VPN, RDP, SSH, web ingress, file transfer, and future adapters are applications above the fabric. They must not select raw QUIC endpoints, pin exit nodes as a transport concern, open fallback transports, or implement route repair.
Every service starts by submitting a fabric service channel request:
{
"schema_version": "rap.fabric_service_channel_request.v1",
"channel_id": "vpn-session-or-service-session-id",
"source_role": "vpn-client | rdp-client | service-adapter",
"service_class": "vpn_packets | rdp | ssh | file_transfer | web",
"target": {
"kind": "pool",
"pool_ids": ["home-ipv4"],
"service_role": "ipv4-egress"
},
"traffic": {
"mode": "duplex",
"application_protocol_agnostic": true,
"flow_distribution": "latency_and_load_aware"
},
"resilience": {
"min_active_paths": 1,
"warm_standby_paths": 1,
"failover": "pool_member_or_next_authorized_pool",
"reroute_on": ["route_failure", "latency_regression", "loss_regression", "backpressure"]
}
}
The fabric responds with a signed route bundle containing a short-lived
rap.fabric_route_lease.v1. The lease names the target pool, primary path,
warm standby paths, multipath hints, and rebuild policy. Physical endpoint
candidates are visible only to the fabric runtime as lease material; service
adapters do not rank, pin, or fail over endpoints themselves. A service adapter
receives only a duplex channel handle and service metadata:
- Android VPN: TUN packet reader/writer only.
ipv4-egress: NAT/ordinary IPv4 exit only.- RDP: protocol/session adapter only; server address, protocol, credentials, rendering, and clipboard are RDP service metadata, not fabric routing.
Temporary compatibility fields such as exit_candidates may exist only inside
the fabric route bundle consumed by the fabric runtime. Service code must treat
them as opaque and must not schedule routes from them.
The VPN client runtime accepts only fabric_service_channel_request plus
fabric_route_bundle.route_lease. The Android service may keep a deprecated
diagnostic endpoint cache, but packet routing must come from the lease. If a
path fails, slows down, or its target pool member dies, the fabric lease/rebuild
policy is the authority; the VPN service continues writing packets to the
channel and does not switch protocols.
Distributed Authority Requirements
- No single control-plane/API/storage/update node can mutate the cluster alone.
- Cluster root and high-risk role changes require threshold signatures from authorized control-authority keys.
- Update releases require signed metadata, signed artifact hashes, compatibility constraints, rollout scope, and rollback windows; mirrors may serve bytes but cannot change what is trusted.
- Route leases, relay leases, rendezvous assignments, peer-directory epochs, and endpoint candidate epochs are signed and short-lived.
- Nodes cache the last valid signed state and continue routing through peers, relay fallbacks, and passive reverse channels when API replicas are down.
- A compromised replica may delay or omit data, but must not be able to forge role assignment, route authority, update authority, storage placement, or node ownership.
- Development
database_signermode is not production authority. Production acceptance requires quorum-signed envelopes for node join, role mutation, mesh config, route leases, update plans, and release metadata.
Implementation Layers
- Discovery layer: STUN/ICE, LAN candidates, public candidates, reverse tunnels, relay candidates.
- Fabric adjacency layer: long-lived QUIC neighbor sessions with capacity, health, and pressure metrics.
- Routing layer: latency-aware and load-aware route sets with relay fallback and pool failover.
- Channel layer: millions of logical channels with independent lifecycle, flow control, and statistics.
Stress Requirements
The fabric is not accepted by ping tests. It must pass real byte-transfer load:
- 1000 concurrent streams from different source nodes to different destination nodes.
- Mixed long-lived and short-lived channels.
- Aggressive create/delete churn.
- many-to-one, one-to-many, and many-to-many traffic.
- direct, LAN, relay, multi-hop, and reverse tunnel paths.
- endpoint pool failover under load.
- intermediate relay/node failure and route rebuild under load.
- induced latency, packet loss, bandwidth caps, and route saturation.
- control/interactive traffic surviving bulk traffic.
- no sustained overload of one path when alternatives exist.
- no goroutine, memory, stream, or file descriptor leak after churn.
Required Stress Report
Every stress run must produce machine-readable JSON with:
- topology and scenario profile;
- channel setup/teardown counts and latency;
- total and per-channel throughput;
- per-node and per-route capacity pressure;
- p50/p95/p99 latency where measured;
- backpressure, rejection, and queue-depth counters;
- route switch and failover events;
- target pool failover events;
- QUIC connection and logical channel counts;
- final pass/fail verdict against SLO thresholds.
The first executable harness is agents/rap-node-agent/cmd/fabric-loadtest.
It supports in-process multi-node QUIC targets, short logical channel churn,
pool failover, target failure injection, and JSON reports.
Example local pool-failover run:
go run ./cmd/fabric-loadtest -mode all -nodes 4 -streams 400 -concurrency 64 -bytes-per-stream 262144 -payload-size 16384 -fail-target 0 -fail-after 1ms -timeout 60s
The local harness is not a replacement for distributed host testing. It is the first acceptance gate for protocol limits, channel lifecycle churn, pool failover semantics, and reporting shape before running the same workload across the shared test Docker host.
Distributed shared-host smoke:
powershell -ExecutionPolicy Bypass -File scripts/fabric/fabric-loadtest-docker-smoke.ps1 -DockerContext test-docker -TuneUdpBuffers -Nodes 4 -Streams 400 -Concurrency 64 -BytesPerStream 262144 -PayloadSize 16384 -FailTarget 0 -FailAfter 100ms
The distributed smoke builds/runs separate server and client containers on the shared Docker host, sends real QUIC fabric frames across the Docker network, kills one target node during load, and expects all channels assigned to that target to fail over to the remaining pool.
The smoke summary includes the strict loadtest verdict plus route_pressure
and transport_snapshot; the script fails when the client verdict is not
pass and carries verdict_reasons into the thrown error.
-TuneUdpBuffers applies runtime host sysctls through a privileged one-shot
container before the run and records the observed values in the summary:
net.core.rmem_max, net.core.wmem_max, net.core.rmem_default, and
net.core.wmem_default.
Degraded-target and latency-aware admission run:
powershell -ExecutionPolicy Bypass -File scripts/fabric/fabric-loadtest-docker-smoke.ps1 -DockerContext test-docker -TuneUdpBuffers -Nodes 4 -Streams 300 -Concurrency 64 -BytesPerStream 131072 -PayloadSize 16384 -FailTarget -1 -ImpairTarget 0 -ImpairDelay 100ms -ImpairLoss 0.5% -ProbeTargets -MaxTargetRttMs 80
This applies tc netem to one target, probes every target before mass channel
placement, excludes targets above the RTT threshold, and reports per-target
setup/duration percentiles. This is the first executable gate for
latency-aware placement; live channel migration after mid-stream degradation is
the next routing-layer gate.
Mid-stream migration gate:
powershell -ExecutionPolicy Bypass -File scripts/fabric/fabric-loadtest-docker-smoke.ps1 -DockerContext test-docker -TuneUdpBuffers -Nodes 4 -Streams 80 -Concurrency 16 -BytesPerStream 8388608 -PayloadSize 65536 -FailTarget -1 -ImpairTarget 0 -ImpairDelay 200ms -ImpairLoss 0.5% -ImpairAfter 50ms -MigrateSlowStreams -MaxAckMs 30
This starts channels normally, applies tc netem after traffic is already in
flight, and expects slow logical streams to continue their remaining bytes on a
different target. The report exposes migration_events, max_ack_ms,
ack_p95_ms, ack_p99_ms, route_attempts_total, reroute_causes, and
per-target stats.
Production fabric-core migration boundary:
FabricChannelRouteropens channels on the best route from aFabricRouteSet.- Live
FabricChannelObservationvalues update counters and trigger reroute on route failure, ACK latency threshold, or capacity pressure. - Reroutes switch route binding and pool target where applicable, increment
RerouteCount, and emitFabricChannelRouteEvent. MinRerouteIntervalprovides hysteresis so a noisy path does not cause route flapping.FabricChannelRuntimebinds the router to live QUIC fabric sessions for reliable byte payloads: it opens the logical stream, sends frames, measures ACK latency, reports observations to the router, and continues remaining payloads on a rerouted QUIC route after connect failure or slow ACKs.- QUIC logical session close cancels the stream read side before closing the write side, so high-churn short sessions release reader goroutines promptly instead of waiting for stream read deadlines.
- Server-side QUIC stream handlers close their write side when the handler exits. This returns QUIC stream credit promptly during high-churn short sessions and prevents the last worker window from stalling on stream open.
- Production request/response forwarding now builds a
FabricRouteSetfrom all QUIC endpoint candidates for the next hop, sends the envelope over the chosen QUIC route, and reroutes to warm standby/fallback QUIC candidates on connect failure or response timeout. - The compat HTTP production forward carrier has been removed from the mesh runtime API. Production forwarding now exposes a single QUIC transport implementation; HTTP handlers remain only as node-local API surfaces and test harness entry points.
- Production route choice includes live per-route active-channel pressure, so concurrent forwarding requests can spread across equivalent QUIC candidates instead of concentrating on the first/shortest route until it is saturated.
- Production forwarding also keeps per-route health quarantine. A QUIC route that fails connect or response is marked unhealthy for a bounded retry window, skipped by subsequent channel scheduling, exposed in route-health snapshots, and restored automatically after the retry window or a successful send.
FabricRoutePressureTrackerprovides shared active-channel accounting for both production request/response forwarding and bulkFabricChannelRuntimetraffic, so different traffic surfaces can make route decisions against the same live load signal.- Route pressure is observable through
FabricRoutePressureSnapshot, including current active channels, max active channels, total acquire/release counts, and last acquired/released route IDs. Bulk runtime results and production QUIC forwarding snapshots expose this data for stress reports. fabric-loadtestreports route IDs per stream attempt, globalroute_pressure, and per-targetmax_active_channels, so stress runs can verify channel distribution and release accounting after churn.FabricRouteSetForPeerEndpointCandidatesconverts QUIC endpoint candidates into production route sets for direct, LAN, ICE/STUN-derived, reverse outbound, and relay fallback modes. Non-QUIC candidates are rejected instead of becoming alternate transports.- Node-agent discovery now advertises multiple QUIC candidates in one heartbeat
instead of collapsing to one address: operator/public QUIC, listener QUIC,
LAN/interface QUIC, STUN reflexive
ice_quic, reverse/outbound-only, andrelay_quicfallback. Candidate metadata carrieslocality_group_id,nat_group_id,stun_server,ice_foundation,relay_node_id, andrelay_endpointwhen configured. When a relay endpoint is the first physical QUIC hop, its advertised certificate fingerprint must survive route planning so public-IP relay paths can verify the relay node by pin instead of falling back to hostname/IP SAN matching. - Endpoint candidate scoring is QUIC-mode only. It ranks
direct_quic,lan_quic,ice_quic,reverse_quic, andrelay_quicusing freshness, health observations, latency, reliability, region, policy tags, and live capacity pressure; HTTP/WebSocket labels are treated as rejected compat candidates rather than alternate transports. FabricTransportForTargetno longer accepts a WebSocket carrier. Transport selection can return onlyQUICFabricTransport; unsupported labels fail with a QUIC-required error.- Explicit transport labels are authoritative. A compat label such as
relayoroutbound_reverseis rejected even when the endpoint string uses aquic://scheme; configs must userelay_quicandreverse_quic. - Node-agent config loading rejects compat advertised transport labels and HTTP/WebSocket advertised endpoint schemes for mesh, STUN-reflexive, and relay fabric endpoints. Bad endpoint posture fails before heartbeat publication.
- Host-agent install/runtime validation rejects compat mesh advertise transport labels and HTTP/WebSocket advertise endpoints before they can be passed into a node-agent Docker runtime.
- JSON-advertised endpoint candidates and scoped synthetic config route recovery surfaces are hard-fail QUIC-only: endpoint candidates, recovery seeds, and rendezvous leases reject compat transport labels and HTTP/WebSocket endpoint schemes instead of silently downgrading or dropping entries.
- Rendezvous relay leases and peer-connection intents now use
relay_quicas the transport label.relay_controlremains only a telemetry/control-state name for rendezvous admission counters, not a data-plane transport option. - Peer connection health probing is aligned with the QUIC fabric: QUIC endpoint candidates are probed with QUIC session setup, pinned certificate metadata is honored, and HTTP/WebSocket endpoint schemes are rejected instead of being used as peer health transport.
- Node-agent synthetic runtime no longer installs an HTTP peer transport as an inter-node carrier, and the shared mesh runtime package no longer exports an HTTP peer transport implementation. Any HTTP synthetic motion is confined to explicit compat smoke harness code while fabric acceptance uses QUIC loadtest gates.
- Control-plane and debug JSON mesh config loading is validated after conversion into runtime structures. Peer endpoint candidates, recovery seeds, rendezvous leases, and selected relay endpoints in route decisions must use QUIC labels/endpoints before they can update node runtime state.
- Scoped synthetic mesh configs also reject compat
peer_endpointsdirectly, in addition to QUIC-only checks for endpoint candidates, recovery seeds, and rendezvous leases. - The old fabric-session WebSocket endpoint is no longer exposed by
FabricSessionEnabledalone. It requires an explicit compat test harness flag and is not part of the node-agent fabric transport surface. - Same local segment or same NAT group is treated as a LAN route by the planner, so a whole cluster piece behind one NAT can prefer private addresses between its own nodes while still maintaining outbound/relay visibility to the rest of the fabric.
- Heartbeat telemetry includes
fabric_runtime_reportwith QUIC-only status, route-set counts, QUIC candidate totals, rejected compat/non-QUIC candidate totals by transport label, route pressure, QUIC listener state, goroutines, heap usage, and the next recommended soak gate. FabricOverlayTransportis the generic service-neutral send facade over route sets,FabricChannelRuntime, shared route pressure, and QUIC sessions. New traffic classes should enter the fabric through this layer or an equivalent runtime integration, not through HTTP/WebSocket fallbacks.FabricChannelRuntimeuses the same route health quarantine as production forwarding. Connect failures, stream send failures, and missing ACKs mark a route unhealthy for a bounded retry window, so later channels for any traffic class avoid that route until it recovers.FabricOverlayTransportexposes route pressure and route health snapshots, and node heartbeat runtime metadata reports production route health plus the current quarantined route count.- Scheduler resource guardrails include
HardMaxRoutePressure: when enabled, a route whose projected active-channel pressure exceeds the threshold is not admitted. This makes overload prevention enforceable in route choice rather than only observable after the fact. - The loadtest verdict fails on route-pressure leaks, acquire/release mismatch, missing acquire accounting, active channels above configured concurrency, or target distribution collapse/skew when multiple targets are healthy.
- Continuous soak aggregation is bounded:
fabric-loadtestkeeps exact counters, per-target totals, route-mode counts, error/reroute totals, and bounded latency samples, whilestream_samplesis capped to diagnostic examples. Long 30-120 minute runs should not retain one result object per logical channel. fabric-loadtestalso keeps boundederror_samples, so high-volume churn reports preserve representative failed logical channels even when the first retainedstream_samplesare all successful.- Mixed topology verdicts require route-mode coverage when at least four
healthy targets are present. A
mixed-public-nat-lan-relayornat-lan-relayrun fails if it does not exerciselan_quic,ice_quic,reverse_quic, andrelay_quic. - Loadtest verdicts also fail on compat route-mode labels. Seeing
relay,outbound_reverse,direct_http,direct_https,direct_tcp_tls,ws,wss, orwebsocketin route-mode telemetry is treated as a transport-layer violation even if payload delivery succeeds. - Healthy multi-target verdicts check both stream distribution and byte distribution. This prevents a run from passing with equal channel counts but most bulk bytes concentrated on one target or route.
- Healthy multi-target verdicts also check route-pressure distribution through
per-route
max_activevalues. A run fails if live concurrent channel load collapses onto one target/route while alternatives are healthy. - Successful logical channels must receive one ACK per transmitted data frame.
fabric-loadtestreportsack_mismatched_streams, per-targetacks_received, and fails verdict when any stream is marked successful with fewer ACKs than sent frames. - ACK payloads carry the SHA-256 checksum of the received data-frame payload.
fabric-loadtestvalidates the checksum for every ACK and fails verdict withack_integrity_errorswhen the acknowledged bytes do not match the sent payload. - Failover accounting separates
abandoned_framesfrom true ACK mismatch. A frame sent on a route that dies before ACK is counted as abandoned and the unacknowledged byte range is retransmitted on the next pool member; verdict still fails when non-abandoned frames are missing ACKs. - Loadtest data frames use deterministic per-frame payloads derived from stream index, logical stream ID, sequence, and byte offset. This makes checksum ACKs validate each frame identity instead of repeatedly validating one shared buffer pattern.
- Mixed bulk/control stress is supported with
-control-every,-control-bytes-per-stream, and-max-control-ack-p95-ms. Reports includecontrol_streams,bulk_streams,control_ack_p95_ms, andbulk_ack_p95_ms; verdict fails when control ACK p95 exceeds the configured SLO. - Verified shared-host mixed smoke:
powershell -ExecutionPolicy Bypass -File scripts/fabric/fabric-loadtest-docker-smoke.ps1 -DockerContext test-docker -Nodes 2 -Streams 40 -Concurrency 8 -BytesPerStream 65536 -PayloadSize 8192 -FailTarget -1 -ControlEvery 5 -ControlBytesPerStream 4096 -MaxControlAckP95Ms 100. The run produced 40/40 successful streams, 8 control streams,control_ack_p95_ms=1,bulk_ack_p95_ms=2,route_pressure.active_total=0, and matching acquire/release counts. - Verified shared-host mixed failover stress:
powershell -ExecutionPolicy Bypass -File scripts/fabric/fabric-loadtest-docker-smoke.ps1 -DockerContext test-docker -SkipBuild -TuneUdpBuffers -Nodes 4 -Streams 1000 -Concurrency 128 -BytesPerStream 1048576 -PayloadSize 65536 -FailTarget 0 -FailAfter 100ms -ControlEvery 20 -ControlBytesPerStream 4096 -MaxControlAckP95Ms 100. Latest runfabric-loadtest-20260516-160751produced 1000/1000 successful streams, 250 failover events after the planned target kill, 50 control streams,control_ack_p95_ms=3,bulk_ack_p95_ms=6,ack_p95_ms=6,ack_p99_ms=8,route_attempts_total=1250,route_pressure.active_total=0,max_active_total=128, and matching acquire/release counts. Full JSON artifacts are written underartifacts/fabric-loadtest. - Verified shared-host mixed degradation/migration stress:
powershell -ExecutionPolicy Bypass -File scripts/fabric/fabric-loadtest-docker-smoke.ps1 -DockerContext test-docker -TuneUdpBuffers -Nodes 4 -Streams 200 -Concurrency 32 -BytesPerStream 8388608 -PayloadSize 65536 -FailTarget -1 -ImpairTarget 0 -ImpairDelay 200ms -ImpairLoss 0.5% -ImpairAfter 50ms -MigrateSlowStreams -MaxAckMs 30 -ControlEvery 10 -ControlBytesPerStream 4096 -MaxControlAckP95Ms 100. The run produced 200/200 successful streams, 9 migration events, 20 control streams,control_ack_p95_ms=2,bulk_ack_p95_ms=7,route_pressure.active_total=0,max_active_total=32, and matching acquire/release counts. - Latest shared-host degradation/migration gate:
fabric-loadtest-20260516-160710with 160 streams, 32 concurrency, 4 MiB bulk streams, 180 ms + 0.5% induced impairment after 50 ms produced 160/160 successful streams, 12 slow-ACK migrations, degraded-target quarantine,control_ack_p95_ms=3,bulk_ack_p95_ms=180,route_pressure.active_total=0,max_active_total=32, and matching acquire/release counts. - Short shared-host soak gate:
fabric-loadtest-20260516-160943with-Duration 45s, 1200 streams, 96 concurrency, four healthy targets, and mixed control/bulk traffic produced 1200/1200 successful streams, even 300/300/300/300 target distribution,channel_opens=1200,channel_closes=1200,channel_leaks=0,control_ack_p95_ms=4,ack_p95_ms=5,ack_p99_ms=8,route_pressure.active_total=0,max_active_total=96, and matching acquire/release counts. - Continuous soak mode is now explicit: add
-Soak -Duration 30mor-Soak -Duration 120mto the Docker runner. In soak mode workers keep creating and closing logical channels until the duration expires, instead of stopping after a fixed stream list. This is the required gate for memory, goroutine, file descriptor, QUIC stream, and route-pressure stability. - Soak duration stops new logical channel creation but does not cancel channels
already in flight. In-flight channels drain under their per-channel
-StreamTimeout; the outer-ClientTimeoutremains the hard scenario guardrail. This prevents the final active window from being counted as failed streams just because the soak timer expired. - Recommended real-topology soak command:
powershell -ExecutionPolicy Bypass -File scripts/fabric/fabric-loadtest-docker-smoke.ps1 -DockerContext test-docker -SkipBuild -TuneUdpBuffers -Nodes 4 -Streams 1000 -Concurrency 128 -BytesPerStream 1048576 -PayloadSize 65536 -TopologyProfile mixed-public-nat-lan-relay -Soak -Duration 30m -ResourceSampleInterval 10s -MaxGoroutineDelta 64 -MaxHeapDeltaMB 512 -FailTarget -1 -ControlEvery 20 -ControlBytesPerStream 4096 -MaxControlAckP95Ms 100. - Soak reports include
resource_samplesandresource_summarywith goroutine start/end/max/delta, heap allocation start/end/max/delta, heap objects, open file descriptor start/end/max/delta, GC delta, max active QUIC streams, and max active route load. Optional verdict gates-MaxGoroutineDeltaand-MaxHeapDeltaMBfail the run if resource drift exceeds the configured budget. - Optional file descriptor verdict gates
-MaxOpenFDDeltaand-MaxOpenFDsare passed through the Docker runner tofabric-loadtestas-max-open-fd-deltaand-max-open-fds. On Linux containers these read/proc/self/fdand fail the run if descriptor count drifts or peaks beyond the configured budget. - Optional throughput SLO gate
-MinThroughputMbpsis passed through the Docker runner tofabric-loadtestas-min-throughput-mbps. It fails verdict when useful data-plane throughput falls below the configured Mbps floor. - Optional short-session churn SLO gate
-MinChannelChurnPerSecis passed through the Docker runner tofabric-loadtestas-min-channel-churn-per-sec. It fails verdict when logical channel open/close throughput falls below the configured channel-per-second floor. - Each logical channel has a per-channel timeout through
-StreamTimeoutin the Docker runner and-stream-timeoutinfabric-loadtest. This keeps a wedged channel from holding a worker slot until the whole client run times out, preserving channel isolation under churn. - Each data frame has an ACK timeout through
-AckTimeoutin the Docker runner and-ack-timeoutinfabric-loadtest. A missing ACK triggers reroute/pool retry without waiting for the full channel timeout. - Optional overall ACK latency gates
-MaxAckP95Msand-MaxAckP99Msare passed through the Docker runner tofabric-loadtestas-max-ack-p95-msand-max-ack-p99-ms. They fail healthy runs when aggregate data-plane ACK latency exceeds the configured SLO, independently of slow-route migration thresholds. - Optional per-target ACK latency gate
-MaxTargetAckMsis passed through the Docker runner tofabric-loadtestas-max-target-ack-ms. It fails healthy runs when any target route reports atarget_stats[*].max_ack_msabove the configured SLO. - Optional channel setup latency gates
-MaxSetupP95Msand-MaxSetupP99Msare passed through the Docker runner tofabric-loadtestas-max-setup-p95-msand-max-setup-p99-ms. They fail healthy runs when logical channel open/setup latency exceeds the configured SLO before payload transfer starts. - Optional reroute latency gates
-MaxRerouteP95Msand-MaxRerouteP99Msare passed through the Docker runner tofabric-loadtestas-max-reroute-p95-msand-max-reroute-p99-ms. They measure repeat channel setup latency after pool failover or slow-route migration and fail the run when route rebuild exceeds the configured SLO. - Docker shared-host summaries also include
container_statsfromdocker stats --no-streamfor each fabric server/client container that is still running at the end of the scenario. This records CPU percent, memory usage, memory percent, network IO, block IO, and PID count per node before cleanup. - Long soak runs can add
-ContainerStatsSampleInterval 10sto collect periodic Docker container stats while traffic is in flight. The runner writes samples tocontainer_stats_samples_path, includescontainer_stats_samples_countandcontainer_stats_sample_summary, and records per-container memory/PID start, end, max, and delta values. - Optional container resource verdict gates
-MaxContainerMemoryMiBand-MaxContainerPidsfail the Docker scenario when any running fabric container exceeds the configured memory or PID budget at the final snapshot or at any periodic sample peak. - Verified short continuous soak:
fabric-loadtest-20260516-163206used-Soak -Duration 20s, mixed public/NAT/LAN/relay profile, 32 concurrency, and mixed control/bulk traffic. It produced 4000/4000 successful logical channels,channel_opens=4035,channel_closes=4035,channel_leaks=0,route_pressure.active_total=0,max_active_total=32,control_ack_p95_ms=2,ack_p95_ms=4, resource sample count 12, goroutine delta -18, max active streams 32, max active route load 32, and matching acquire/release counts. - Verified 60-second high-churn continuous soak with graceful drain:
fabric-loadtest-20260516-174505rebuilt the Docker image after changing soak duration to stop generation and let in-flight channels drain. The 4-node mixed-topology run used 128 concurrency,-Duration 60s,-StreamTimeout 15s, periodic resource/container sampling, mixed control/bulk traffic, throughput and churn SLOs. It produced 438740/438740 successful logical channels,channel_churn_per_sec=7310,throughput_bps=3473632858,ack_p95_ms=5,ack_p99_ms=6,control_ack_p95_ms=3,channel_opens=438740,channel_closes=438740,channel_leaks=0,open_failures=0,goroutines_delta=-1,open_fds_delta=4, all four route modes, clean route-pressure accounting, and verdictpass. - Verified pool failover soak with ACK timeout and abandoned-frame accounting:
fabric-loadtest-20260516-175622rebuilt the Docker image with ACK timeout, target quarantine, and abandoned-frame accounting, then killed target 0 after 3 seconds during a 30-second mixed-topology soak. It produced 136194/136194 successful logical channels,failed_streams=0,failover_events=82,abandoned_frames=75,ack_mismatched_streams=0,ack_integrity_errors=0,channel_churn_per_sec=4543,throughput_bps=2156155314,reroute_latency_p99_ms=9,channel_leaks=0, clean route-pressure accounting, and verdictpass. - Verified container stats gate:
fabric-loadtest-20260516-163854produced a passing 2-node mixed-topology smoke with-MaxContainerMemoryMiB 128 -MaxContainerPids 64and includedcontainer_statsfor both fabric server containers, with memory usage around 4-6 MiB per server and server PID counts 7-9. A negative control run with-MaxContainerMemoryMiB 1failed as expected withcontainer_memory_mib=...>1verdict reasons. - Verified periodic container stats sampling:
fabric-loadtest-20260516-164259used-Soak -Duration 8s,-ContainerStatsSampleInterval 2s, mixed public/NAT/LAN/relay profile, and-MaxContainerMemoryMiB 128 -MaxContainerPids 64. It produced 2000/2000 successful logical channels,channel_opens=2009,channel_closes=2009,channel_leaks=0, even 1000/1000 target distribution, 400 control streams,ack_p95_ms=1,route_pressure.active_total=0, matching acquire/release counts, final server memory around 12-13 MiB, and periodic sample peaks for the client and both servers infabric-loadtest-20260516-164259-container-stats-samples.json. - Verified high-churn goroutine drain after QUIC close cancellation:
fabric-loadtest-20260516-164502rebuilt the Docker image and repeated the 2-node mixed-topology continuous soak with-MaxGoroutineDelta 64,-MaxHeapDeltaMB 128,-ContainerStatsSampleInterval 2s,-MaxContainerMemoryMiB 128, and-MaxContainerPids 64. It produced 2000/2000 successful logical channels,channel_opens=2009,channel_closes=2009,channel_leaks=0, even 1000/1000 target distribution,control_ack_p95_ms=1,ack_p95_ms=1,route_pressure.active_total=0, matching acquire/release counts, andgoroutines_delta=-2. - Verified file descriptor gate:
fabric-loadtest-20260516-164725rebuilt the Docker image and repeated the 2-node mixed-topology continuous soak with-MaxOpenFDDelta 8and-MaxOpenFDs 128in addition to goroutine, heap, container memory, and PID gates. It produced 2000/2000 successful logical channels,channel_leaks=0,route_pressure.active_total=0, matching acquire/release counts,open_fds_start=15,open_fds_end=9,open_fds_max=19, andopen_fds_delta=-6. - Verified bounded soak aggregation:
fabric-loadtest-20260516-165051rebuilt the Docker image after changing soak result storage to an aggregate collector. The 2-node mixed-topology soak produced 2000/2000 successful logical channels, even 1000/1000 target distribution,channel_leaks=0,route_pressure.active_total=0, matching acquire/release counts,goroutines_delta=0,open_fds_delta=1, verdictpass, and only 25 retainedstream_samplesin the full report. - Verified mixed route-mode coverage gate:
fabric-loadtest-20260516-165308rebuilt the Docker image with the route coverage verdict and ran a 4-node mixed-topology soak. It produced 4000/4000 successful logical channels, even 1000/1000/1000/1000 target distribution,channel_leaks=0,route_pressure.active_total=0, matching acquire/release counts, and observed all required route modes:lan_quic,ice_quic,reverse_quic, andrelay_quic. - Verified ACK integrity gate:
fabric-loadtest-20260516-165544rebuilt the Docker image with the ACK mismatch verdict and repeated the 4-node mixed-topology soak. It produced 4000/4000 successful logical channels,ack_mismatched_streams=0, per-targetframes_sent=6600andacks_received=6600, all four route modes, clean channel/route pressure accounting, and verdictpass. - Verified ACK checksum integrity gate:
fabric-loadtest-20260516-165926rebuilt the Docker image with ACK payload checksums and repeated the 4-node mixed-topology soak. It produced 4000/4000 successful logical channels,ack_mismatched_streams=0,ack_integrity_errors=0, 26400 total data frames, 26400 ACKs, all four route modes, clean channel/route pressure accounting, and verdictpass. - Verified unique per-frame payload integrity:
fabric-loadtest-20260516-170150rebuilt the Docker image after switching loadtest traffic from a shared payload buffer to deterministic per-frame payloads. The 4-node mixed-topology soak produced 4000/4000 successful logical channels,ack_mismatched_streams=0,ack_integrity_errors=0, 26400 data frames, 26400 ACKs, all four route modes, clean channel/route pressure accounting, and verdictpass. - Verified throughput SLO gate:
fabric-loadtest-20260516-170512rebuilt the Docker image with-MinThroughputMbps 100and repeated the 4-node mixed-topology soak. It produced 4000/4000 successful logical channels,throughput_bps=212479668,ack_mismatched_streams=0,ack_integrity_errors=0, all four route modes, clean channel/route pressure accounting, and verdictpass. - Verified short-session churn SLO gate:
fabric-loadtest-20260516-173320rebuilt the Docker image with-MinChannelChurnPerSec 200, then ran a 4-node mixed-topology high-churn short-session smoke with 1000 one-frame logical channels. It produced 1000/1000 successful logical channels,channel_churn_per_sec=9478,channel_opens=1000,channel_closes=1000,channel_leaks=0, even target stream distribution, all four route modes, clean route-pressure accounting, and verdictpass. - Verified high-churn QUIC stream-credit regression gate:
fabric-loadtest-20260516-174046rebuilt the Docker image after closing the server-side QUIC stream on handler exit and ran a 4-node mixed-topology burst of 5000 one-frame short logical channels at 128 concurrency with-MinChannelChurnPerSec 300and-StreamTimeout 15s. It produced 5000/5000 successful logical channels,channel_churn_per_sec=21124,channel_opens=5000,channel_closes=5000,channel_leaks=0,open_failures=0,ack_mismatched_streams=0,ack_integrity_errors=0, even 1250/1250/1250/1250 target distribution, all four route modes, clean route-pressure accounting, and verdictpass. - Verified target byte distribution gate:
fabric-loadtest-20260516-170731rebuilt the Docker image with byte distribution verdicts and repeated the 4-node mixed-topology soak. It produced 4000/4000 successful logical channels, even 1000/1000/1000/1000 stream distribution, exactly 53,248,000 bytes per target,throughput_bps=212488911, all four route modes, clean channel/route pressure accounting, and verdictpass. - Verified overall ACK latency SLO gate:
fabric-loadtest-20260516-171001rebuilt the Docker image with-MaxAckP95Ms 20and-MaxAckP99Ms 50and repeated the 4-node mixed-topology soak. It produced 4000/4000 successful logical channels,ack_p95_ms=2,ack_p99_ms=3,ack_mismatched_streams=0,ack_integrity_errors=0, all four route modes, clean channel/route pressure accounting, and verdictpass. - Verified route-pressure distribution gate:
fabric-loadtest-20260516-171216rebuilt the Docker image with route-pressure distribution verdicts and repeated the 4-node mixed-topology soak. It produced 4000/4000 successful logical channels, even target stream and byte distribution, per-routemax_activevalues of 13/12/13/13,route_pressure.active_total=0, matching acquire/release counts, and verdictpass. - Verified per-target ACK latency gate:
fabric-loadtest-20260516-171454rebuilt the Docker image with-MaxTargetAckMs 20and repeated the 4-node mixed-topology soak. It produced 4000/4000 successful logical channels, per-targetmax_ack_msvalues of 6/5/7/9,ack_p95_ms=3,ack_p99_ms=5, all four route modes, clean channel/route pressure accounting, and verdictpass. - Verified channel setup latency SLO gate:
fabric-loadtest-20260516-171937rebuilt the Docker image with-MaxSetupP95Ms 20and-MaxSetupP99Ms 50, then repeated the 4-node mixed-topology soak with ACK, throughput, FD, goroutine, heap, container memory, and PID gates enabled. It produced 4000/4000 successful logical channels,setup_latency_p95_ms=0,ack_p95_ms=3,ack_p99_ms=3,throughput_bps=212572631, even target stream/byte distribution, all four route modes, clean channel/route pressure accounting, and verdictpass. - Verified reroute latency SLO gate:
fabric-loadtest-20260516-172652rebuilt the Docker image with-MaxRerouteP95Ms 100and-MaxRerouteP99Ms 200, then ran a 4-node mixed-topology pool-failover stress with target 0 killed during load. It produced 400/400 successful logical channels, 100 pool failover events,reroute_latency_p95_ms=1,reroute_latency_p99_ms=2,route_attempts_total=500,ack_p95_ms=6,ack_p99_ms=8,throughput_bps=3863633075, clean channel/route pressure accounting, and verdictpass. - Mixed topology profile gate:
fabric-loadtest-20260516-162037used-TopologyProfile mixed-public-nat-lan-relaywith 400 streams, 64 concurrency, four targets, and mixed control/bulk traffic. It produced 400/400 successful streams, 100 streams per target, route-mode reporting forlan_quic,ice_quic,reverse_quic, andrelay_quic,control_ack_p95_ms=2,ack_p95_ms=7,channel_leaks=0,route_pressure.active_total=0, and matching acquire/release counts. - Verified strict QUIC route-mode gate:
fabric-loadtest-20260516-182550rebuilt the loadtest image with compat route-mode verdicts and ran the 4-node mixed topology profile. It produced 400/400 successful logical channels, observed onlylan_quic,ice_quic,reverse_quic, andrelay_quic, keptack_mismatched_streams=0,ack_integrity_errors=0,channel_leaks=0, clean route-pressure accounting, and verdictpass. fabric-loadtestnow also treats the configured target list as part of the acceptance surface: every target must bequic://.... Empty targets, barehost:port, HTTP(S), and WS/WSS targets produce a failingnon_quic_targets=...verdict reason. Client mode also rejects those targets before dialing, so a bad stress command cannot accidentally exercise a non-QUIC path and only discover it after the run.- The shared Docker runner
scripts/fabric/fabric-loadtest-docker-smoke.ps1now has matching guardrails: it refuses local Docker Desktop contexts such asdefault/desktop-linuxand validates generated targets before launch so the real-load smoke remains tied to the shared test Docker host and QUIC-only endpoints. - Shared Docker validation after those guardrails:
fabric-loadtest-20260516-190049rebuilt the Docker image ontest-dockerand ran 4 QUIC targets with 120 streams. It produced 120/120 successful logical channels,ack_p95_ms=3,setup_latency_p95_ms=21, clean open/close and route-pressure accounting, QUIC-only targets, and verdictpass. - Shared Docker mixed-topology failover validation:
fabric-loadtest-20260516-190137reused the image ontest-docker, killed target 0 after 100ms, and ran 400 streams over the mixed public/NAT/LAN/relay profile. It produced 400/400 successful logical channels, 100 pool failover events,route_attempts_total=500, route modesice_quic,reverse_quic, andrelay_quicafter the failed target was removed,ack_p95_ms=8,setup_latency_p95_ms=51, clean channel/route-pressure accounting, and verdictpass. - Shared Docker mixed-topology route coverage validation:
fabric-loadtest-20260516-190207ran the same 4-target mixed profile without target failure. It produced 400/400 successful logical channels, exactly 100 streams per target, observedlan_quic,ice_quic,reverse_quic, andrelay_quic, keptack_integrity_errors=0,channel_leaks=0,route_pressure.active_total=0, and verdictpass. - Load balancing under pool failover is now an acceptance gate. The first
stricter shared-host rebuild,
fabric-loadtest-20260516-190704, intentionally failed because all failed-target retries moved to the nearest live target, producingtarget_byte_distribution_skewandroute_pressure_distribution_skew. The retry selector was then changed to spread failed-slot retries across the currently usable target set instead of selecting the next target in ring order. - Verified load-aware retry routing after the fix:
fabric-loadtest-20260516-191028rebuilt ontest-docker, killed target 0 after 100ms, and repeated the 4-target mixed profile. It produced 400/400 successful logical channels, 100 pool failover events, surviving-target stream distribution of 134/133/133, surviving route-pressure max-active values of 30/25/27,ack_p95_ms=4,reroute_latency_p95_ms=1, clean acquire/release accounting, and verdictpass. - Verified 1000-channel mixed-topology stress:
fabric-loadtest-20260516-193414ran 1000 logical channels ontest-dockerwith 128 concurrency, mixed control/bulk traffic, and themixed-public-nat-lan-relayprofile. It produced 1000/1000 successful logical channels, exact 250/250/250/250 target distribution, observed all four QUIC route modes (lan_quic,ice_quic,reverse_quic,relay_quic),throughput_bps=3629522849,channel_churn_per_sec=1919,ack_p95_ms=6, clean channel/route-pressure accounting, and verdictpass. - Verified 1000-channel pool-failover stress:
fabric-loadtest-20260516-193444killed target 0 after 100ms and ran 1000 logical channels with 128 concurrency. It produced 1000/1000 successful logical channels, 250 pool failover events, surviving-target distribution of 334/333/333,route_attempts_total=1250,ack_p95_ms=7, clean acquire/release accounting, and verdictpass. - Verified latency-degradation migration:
fabric-loadtest-20260516-193515appliedtc netem delay 80msto target 1, enabled slow-stream migration with-MaxAckMs 20, and ran 400 mixed-profile channels. It observed the impaired target indegraded_targets, produced 64 slow-ACK migrations, moved completed streams onto healthy targets with distribution 134/133/133, keptchannel_leaks=0,ack_integrity_errors=0, clean route-pressure accounting, and verdictpass. - Shared Docker runner resource-sample fallback was verified with
fabric-loadtest-20260516-190325: short runs now still persistcontainer_stats_samples_pathand a minimal per-container sample summary from final Docker stats when the background sampler has no time to emit samples. - Added
scripts/fabric/fabric-acceptance-summary.ps1to aggregate recent*-summary.jsonartifacts into an acceptance report. It captures verdicts, target distribution, route modes, churn, failover/migration counts, latency SLOs, resource evidence, and keeps intentional failed runs visible as regression evidence for gates such as route-pressure skew detection. - The first 30-minute soak attempt (
fabric-loadtest-20260516-193558) exposed a runner defect instead of a fabric defect: server containers were still started with a fixed-timeout 10m, so the three surviving servers exited around minute 10 while the client expected a 30-minute run. The Docker runner now exposes-ServerTimeoutand defaults it to-ClientTimeout, so long soak server lifetimes match the client run. - The next soak attempt (
fabric-loadtest-20260516-194816) passed the 10-minute server-timeout boundary but exposed another long-run behavior: a healthy surviving target could stay out of placement after a transient degradation mark.fabric-loadtestnow uses a boundedtarget_quarantine_ttlfor placement while still preserving historicaldegraded_targetsobservations in the report. The Docker runner exposes this as-TargetQuarantineTTL. fabric-loadtest-20260516-200241then exposed a soak-loop issue: it reportedpasswith 432869/432869 logical channels and clean accounting, but finished after about 95 seconds despiteconfig.duration=30m. The cause was worker shutdown on per-streamcontext deadline exceeded; soak workers now only exit on the parent run context or the configured soak stop time, not on one channel's timeout.fabric-loadtest-20260516-200939andfabric-loadtest-20260516-201331confirmed the soak loop fix by running full 3-minute preflights, but they failed the zero-failed-stream gate under target-kill injection. The issue was policy: the known killed target re-entered placement too quickly via the short transient quarantine TTL, causing some channels to spend their stream budget on a hard-dead endpoint.fabric-loadtestnow separates transienttarget_quarantine_ttlfromfailure_quarantine_ttl, and the Docker runner exposes-FailureQuarantineTTL.- Verified 30-minute long-duration soak:
fabric-loadtest-20260516-202532ran ontest-dockerfor 1800.010 seconds with 4 QUIC targets, 128 concurrency, mixed control/bulk traffic, 64 KiB per logical channel, 10-second resource and container samples, and themixed-public-nat-lan-relayprofile. It produced 15,074,556/15,074,556 successful logical channels, 895,308,005,376 bytes,throughput_bps=3979124146,channel_churn_per_sec=8374, exact 3,768,639 streams per target, all four QUIC route modes,ack_p95_ms=5,ack_p99_ms=6,channel_leaks=0, matching 15,074,556 channel opens/closes,route_pressure.active_total=0, 458 container-stat samples, bounded memory/PID use, and verdictpass. - Verified real-node host-to-host QUIC smoke:
home-1ran the standalonefabric-loadtestclient against a temporary QUIC server ontest-dockeratquic://docker-test.cin.su:19443. The run created 1000 short logical channels at 128 concurrency, mixed control and bulk traffic, sent 59,392,000 bytes, received 3700/3700 ACKs, producedthroughput_bps=1177445403,channel_churn_per_sec=2478,ack_p95_ms=12,ack_p99_ms=21,setup_latency_p95_ms=118, zero failed streams, zero channel leaks, and verdictpass. The report is saved asartifacts/fabric-real-nodes/home-1-to-test-docker-20260516-181649.json. - Published and registered node-agent release
0.2.280-fabricsessionwith linux binary/native and Docker image artifacts. The release is intentionally not assigned to live node update policies yet because current live node workload/env posture still advertises compatdirect_httpand HTTP/HTTPS mesh endpoints. Before rollout, node configs must be migrated toquic://...endpoints, QUIC advertise labels, and enabled QUIC listener env such asRAP_MESH_QUIC_FABRIC_ENABLED=trueplusRAP_MESH_QUIC_FABRIC_LISTEN_ADDR. - Loadtest degraded-target quarantine is observable through
degraded_targets. When-impair-targetand slow-stream migration are enabled, verdict fails if no degraded target is observed or if degraded targets do not produce migration events. A shared-host validation run with 120 streams reporteddegraded_targets = { impaired_target: "slow_ack" }, 5 migration events,control_ack_p95_ms=3, and clean acquire/release accounting. - Channel lifecycle accounting is explicit in
fabric-loadtestthroughchannel_opens,channel_closes, andchannel_leaks. Verdict fails on open/close mismatch, active stream leaks, or mismatch between route-pressure acquire counts and QUIC stream opens. - The next validation step is broader real mixed public/NAT/LAN topology across separate physical or VM hosts. The shared Docker host has verified the route model, stress gates, 30-minute stability, memory, goroutine, file descriptor, container resource, and route-pressure accounting. A true external NAT lab should now validate the same gates with independent NAT devices, public nodes, and local NAT-side cluster segments.
Initial SLO examples:
channel_setup_p95_ms < 200reroute_p95_ms < 1000control_latency_p99_ms < 100 under bulk loadpacket_loss_after_recovery < 0.1%no_route_pressure_over_90_percent_when_alternatives_existno_channel_table_growth_after_churn