Files
rdp-proxy/docs/architecture/FABRIC_FIRST_TRANSPORT_AND_STRESS_PLAN.md
T

52 KiB

Fabric-First Transport And Stress Plan

Status: fabric-first implementation baseline is active. QUIC-only transport, route planning, runtime reroute/failover, pressure accounting, shared-host stress gates, 1000-channel load, failure/degradation gates, and a 30-minute real-byte soak are implemented and verified. Remaining work is wider real topology coverage as the cluster grows.

This project is now fabric-first. Work on service payloads, service adapter expansion, and Android VPN transport is paused until the fabric transport layer is complete and proven under real load.

Goal

The fabric is a distributed QUIC overlay over the current IPv4 network. Nodes may have public addresses, sit behind NAT, or represent a whole local segment behind one NAT. The fabric must expose a single logical transport layer where nodes can reach each other directly, through local segment paths, through passive outbound tunnels, or through relay hops without changing the data-plane protocol.

QUIC is the only data-plane protocol. Direct, LAN, relay, reverse, and ICE-selected paths are route modes inside the same QUIC fabric, not alternative transports.

The fabric must not depend on one management service for authority. API, storage, update-cache, route-coordinator, observer, and authority duties are roles inside the mesh. A reachable API endpoint can distribute signed state, but it cannot be the source of truth by itself. Nodes accept control data, configuration, route leases, update plans, and role changes only when the signatures, quorum rules, scopes, epochs, and expiry windows verify locally.

Required Fabric Behavior

  • Address channels by node_id, pool_id, or service target, not by raw IP.
  • Keep endpoint candidates for public QUIC, LAN QUIC, reverse/passive QUIC, relay QUIC, and future ICE-derived QUIC paths.
  • Treat DNS names such as web/admin/API domains as service endpoints only, not node identity or fabric authority.
  • Require node-published endpoint candidates to include explicit host:port, reachability, connectivity mode, NAT/local-segment metadata, source, and freshness.
  • Prefer local segment paths for nodes that share a NAT/local network.
  • Keep outbound passive QUIC control/data adjacencies from NATed nodes to reachable public or relay nodes.
  • Build logical channels over shared QUIC adjacencies instead of opening one physical QUIC connection per channel.
  • Maintain primary, warm standby, and fallback route sets per channel.
  • Rebuild a channel when an intermediate hop fails.
  • Switch to another pool member when the target is a pool and the current endpoint fails.
  • Reroute slow channels when a faster path exists and the reroute will not harm aggregate fabric throughput.
  • Spread channels across available routes so the shortest path is not saturated while other nodes are idle.
  • Isolate channels with per-channel flow control, traffic classes, backpressure, quotas, and fairness scheduling.
  • Report per-node, per-link, per-route, and per-channel load and failure causes.

Service Channel Boundary

The fabric is the only component that builds and maintains transport channels. VPN, RDP, SSH, web ingress, file transfer, and future adapters are applications above the fabric. They must not select raw QUIC endpoints, pin exit nodes as a transport concern, open fallback transports, or implement route repair.

Every service starts by submitting a fabric service channel request:

{
  "schema_version": "rap.fabric_service_channel_request.v1",
  "channel_id": "vpn-session-or-service-session-id",
  "source_role": "vpn-client | rdp-client | service-adapter",
  "service_class": "vpn_packets | rdp | ssh | file_transfer | web",
  "target": {
    "kind": "pool",
    "pool_ids": ["home-ipv4"],
    "service_role": "ipv4-egress"
  },
  "traffic": {
    "mode": "duplex",
    "application_protocol_agnostic": true,
    "flow_distribution": "latency_and_load_aware"
  },
  "resilience": {
    "min_active_paths": 1,
    "warm_standby_paths": 1,
    "failover": "pool_member_or_next_authorized_pool",
    "reroute_on": ["route_failure", "latency_regression", "loss_regression", "backpressure"]
  }
}

The fabric responds with a signed route bundle containing a short-lived rap.fabric_route_lease.v1. The lease names the target pool, primary path, warm standby paths, multipath hints, and rebuild policy. Physical endpoint candidates are visible only to the fabric runtime as lease material; service adapters do not rank, pin, or fail over endpoints themselves. A service adapter receives only a duplex channel handle and service metadata:

  • Android VPN: TUN packet reader/writer only.
  • ipv4-egress: NAT/ordinary IPv4 exit only.
  • RDP: protocol/session adapter only; server address, protocol, credentials, rendering, and clipboard are RDP service metadata, not fabric routing.

Temporary compatibility fields such as exit_candidates may exist only inside the fabric route bundle consumed by the fabric runtime. Service code must treat them as opaque and must not schedule routes from them.

The VPN client runtime accepts only fabric_service_channel_request plus fabric_route_bundle.route_lease. The Android service may keep a deprecated diagnostic endpoint cache, but packet routing must come from the lease. If a path fails, slows down, or its target pool member dies, the fabric lease/rebuild policy is the authority; the VPN service continues writing packets to the channel and does not switch protocols.

Distributed Authority Requirements

  • No single control-plane/API/storage/update node can mutate the cluster alone.
  • Cluster root and high-risk role changes require threshold signatures from authorized control-authority keys.
  • Update releases require signed metadata, signed artifact hashes, compatibility constraints, rollout scope, and rollback windows; mirrors may serve bytes but cannot change what is trusted.
  • Route leases, relay leases, rendezvous assignments, peer-directory epochs, and endpoint candidate epochs are signed and short-lived.
  • Nodes cache the last valid signed state and continue routing through peers, relay fallbacks, and passive reverse channels when API replicas are down.
  • A compromised replica may delay or omit data, but must not be able to forge role assignment, route authority, update authority, storage placement, or node ownership.
  • Development database_signer mode is not production authority. Production acceptance requires quorum-signed envelopes for node join, role mutation, mesh config, route leases, update plans, and release metadata.

Implementation Layers

  1. Discovery layer: STUN/ICE, LAN candidates, public candidates, reverse tunnels, relay candidates.
  2. Fabric adjacency layer: long-lived QUIC neighbor sessions with capacity, health, and pressure metrics.
  3. Routing layer: latency-aware and load-aware route sets with relay fallback and pool failover.
  4. Channel layer: millions of logical channels with independent lifecycle, flow control, and statistics.

Stress Requirements

The fabric is not accepted by ping tests. It must pass real byte-transfer load:

  • 1000 concurrent streams from different source nodes to different destination nodes.
  • Mixed long-lived and short-lived channels.
  • Aggressive create/delete churn.
  • many-to-one, one-to-many, and many-to-many traffic.
  • direct, LAN, relay, multi-hop, and reverse tunnel paths.
  • endpoint pool failover under load.
  • intermediate relay/node failure and route rebuild under load.
  • induced latency, packet loss, bandwidth caps, and route saturation.
  • control/interactive traffic surviving bulk traffic.
  • no sustained overload of one path when alternatives exist.
  • no goroutine, memory, stream, or file descriptor leak after churn.

Required Stress Report

Every stress run must produce machine-readable JSON with:

  • topology and scenario profile;
  • channel setup/teardown counts and latency;
  • total and per-channel throughput;
  • per-node and per-route capacity pressure;
  • p50/p95/p99 latency where measured;
  • backpressure, rejection, and queue-depth counters;
  • route switch and failover events;
  • target pool failover events;
  • QUIC connection and logical channel counts;
  • final pass/fail verdict against SLO thresholds.

The first executable harness is agents/rap-node-agent/cmd/fabric-loadtest. It supports in-process multi-node QUIC targets, short logical channel churn, pool failover, target failure injection, and JSON reports.

Example local pool-failover run:

go run ./cmd/fabric-loadtest -mode all -nodes 4 -streams 400 -concurrency 64 -bytes-per-stream 262144 -payload-size 16384 -fail-target 0 -fail-after 1ms -timeout 60s

The local harness is not a replacement for distributed host testing. It is the first acceptance gate for protocol limits, channel lifecycle churn, pool failover semantics, and reporting shape before running the same workload across the shared test Docker host.

Distributed shared-host smoke:

powershell -ExecutionPolicy Bypass -File scripts/fabric/fabric-loadtest-docker-smoke.ps1 -DockerContext test-docker -TuneUdpBuffers -Nodes 4 -Streams 400 -Concurrency 64 -BytesPerStream 262144 -PayloadSize 16384 -FailTarget 0 -FailAfter 100ms

The distributed smoke builds/runs separate server and client containers on the shared Docker host, sends real QUIC fabric frames across the Docker network, kills one target node during load, and expects all channels assigned to that target to fail over to the remaining pool.

The smoke summary includes the strict loadtest verdict plus route_pressure and transport_snapshot; the script fails when the client verdict is not pass and carries verdict_reasons into the thrown error.

-TuneUdpBuffers applies runtime host sysctls through a privileged one-shot container before the run and records the observed values in the summary: net.core.rmem_max, net.core.wmem_max, net.core.rmem_default, and net.core.wmem_default.

Degraded-target and latency-aware admission run:

powershell -ExecutionPolicy Bypass -File scripts/fabric/fabric-loadtest-docker-smoke.ps1 -DockerContext test-docker -TuneUdpBuffers -Nodes 4 -Streams 300 -Concurrency 64 -BytesPerStream 131072 -PayloadSize 16384 -FailTarget -1 -ImpairTarget 0 -ImpairDelay 100ms -ImpairLoss 0.5% -ProbeTargets -MaxTargetRttMs 80

This applies tc netem to one target, probes every target before mass channel placement, excludes targets above the RTT threshold, and reports per-target setup/duration percentiles. This is the first executable gate for latency-aware placement; live channel migration after mid-stream degradation is the next routing-layer gate.

Mid-stream migration gate:

powershell -ExecutionPolicy Bypass -File scripts/fabric/fabric-loadtest-docker-smoke.ps1 -DockerContext test-docker -TuneUdpBuffers -Nodes 4 -Streams 80 -Concurrency 16 -BytesPerStream 8388608 -PayloadSize 65536 -FailTarget -1 -ImpairTarget 0 -ImpairDelay 200ms -ImpairLoss 0.5% -ImpairAfter 50ms -MigrateSlowStreams -MaxAckMs 30

This starts channels normally, applies tc netem after traffic is already in flight, and expects slow logical streams to continue their remaining bytes on a different target. The report exposes migration_events, max_ack_ms, ack_p95_ms, ack_p99_ms, route_attempts_total, reroute_causes, and per-target stats.

Production fabric-core migration boundary:

  • FabricChannelRouter opens channels on the best route from a FabricRouteSet.
  • Live FabricChannelObservation values update counters and trigger reroute on route failure, ACK latency threshold, or capacity pressure.
  • Reroutes switch route binding and pool target where applicable, increment RerouteCount, and emit FabricChannelRouteEvent.
  • MinRerouteInterval provides hysteresis so a noisy path does not cause route flapping.
  • FabricChannelRuntime binds the router to live QUIC fabric sessions for reliable byte payloads: it opens the logical stream, sends frames, measures ACK latency, reports observations to the router, and continues remaining payloads on a rerouted QUIC route after connect failure or slow ACKs.
  • QUIC logical session close cancels the stream read side before closing the write side, so high-churn short sessions release reader goroutines promptly instead of waiting for stream read deadlines.
  • Server-side QUIC stream handlers close their write side when the handler exits. This returns QUIC stream credit promptly during high-churn short sessions and prevents the last worker window from stalling on stream open.
  • Production request/response forwarding now builds a FabricRouteSet from all QUIC endpoint candidates for the next hop, sends the envelope over the chosen QUIC route, and reroutes to warm standby/fallback QUIC candidates on connect failure or response timeout.
  • The legacy HTTP production forward carrier has been removed from the mesh runtime API. Production forwarding now exposes a single QUIC transport implementation; HTTP handlers remain only as node-local API surfaces and test harness entry points.
  • Production route choice includes live per-route active-channel pressure, so concurrent forwarding requests can spread across equivalent QUIC candidates instead of concentrating on the first/shortest route until it is saturated.
  • Production forwarding also keeps per-route health quarantine. A QUIC route that fails connect or response is marked unhealthy for a bounded retry window, skipped by subsequent channel scheduling, exposed in route-health snapshots, and restored automatically after the retry window or a successful send.
  • FabricRoutePressureTracker provides shared active-channel accounting for both production request/response forwarding and bulk FabricChannelRuntime traffic, so different traffic surfaces can make route decisions against the same live load signal.
  • Route pressure is observable through FabricRoutePressureSnapshot, including current active channels, max active channels, total acquire/release counts, and last acquired/released route IDs. Bulk runtime results and production QUIC forwarding snapshots expose this data for stress reports.
  • fabric-loadtest reports route IDs per stream attempt, global route_pressure, and per-target max_active_channels, so stress runs can verify channel distribution and release accounting after churn.
  • FabricRouteSetForPeerEndpointCandidates converts QUIC endpoint candidates into production route sets for direct, LAN, ICE/STUN-derived, reverse outbound, and relay fallback modes. Non-QUIC candidates are rejected instead of becoming alternate transports.
  • Node-agent discovery now advertises multiple QUIC candidates in one heartbeat instead of collapsing to one address: operator/public QUIC, listener QUIC, LAN/interface QUIC, STUN reflexive ice_quic, reverse/outbound-only, and relay_quic fallback. Candidate metadata carries local_segment_id, nat_group_id, stun_server, ice_foundation, relay_node_id, and relay_endpoint when configured.
  • Endpoint candidate scoring is QUIC-mode only. It ranks direct_quic, lan_quic, ice_quic, reverse_quic, and relay_quic using freshness, health observations, latency, reliability, region, policy tags, and live capacity pressure; HTTP/WebSocket labels are treated as rejected legacy candidates rather than alternate transports.
  • FabricTransportForTarget no longer accepts a WebSocket carrier. Transport selection can return only QUICFabricTransport; unsupported labels fail with a QUIC-required error.
  • Explicit transport labels are authoritative. A legacy label such as relay or outbound_reverse is rejected even when the endpoint string uses a quic:// scheme; configs must use relay_quic and reverse_quic.
  • Node-agent config loading rejects legacy advertised transport labels and HTTP/WebSocket advertised endpoint schemes for mesh, STUN-reflexive, and relay fabric endpoints. Bad endpoint posture fails before heartbeat publication.
  • Host-agent install/runtime validation rejects legacy mesh advertise transport labels and HTTP/WebSocket advertise endpoints before they can be passed into a node-agent Docker runtime.
  • JSON-advertised endpoint candidates and scoped synthetic config route recovery surfaces are hard-fail QUIC-only: endpoint candidates, recovery seeds, and rendezvous leases reject legacy transport labels and HTTP/WebSocket endpoint schemes instead of silently downgrading or dropping entries.
  • Rendezvous relay leases and peer-connection intents now use relay_quic as the transport label. relay_control remains only a telemetry/control-state name for rendezvous admission counters, not a data-plane transport option.
  • Peer connection health probing is aligned with the QUIC fabric: QUIC endpoint candidates are probed with QUIC session setup, pinned certificate metadata is honored, and HTTP/WebSocket endpoint schemes are rejected instead of being used as peer health transport.
  • Node-agent synthetic runtime no longer installs an HTTP peer transport as an inter-node carrier, and the shared mesh runtime package no longer exports an HTTP peer transport implementation. Any HTTP synthetic motion is confined to explicit legacy smoke harness code while fabric acceptance uses QUIC loadtest gates.
  • Control-plane and debug JSON mesh config loading is validated after conversion into runtime structures. Peer endpoint candidates, recovery seeds, rendezvous leases, and selected relay endpoints in route decisions must use QUIC labels/endpoints before they can update node runtime state.
  • Scoped synthetic mesh configs also reject legacy peer_endpoints directly, in addition to QUIC-only checks for endpoint candidates, recovery seeds, and rendezvous leases.
  • The old fabric-session WebSocket endpoint is no longer exposed by FabricSessionEnabled alone. It requires an explicit legacy test harness flag and is not part of the node-agent fabric transport surface.
  • Same local segment or same NAT group is treated as a LAN route by the planner, so a whole cluster piece behind one NAT can prefer private addresses between its own nodes while still maintaining outbound/relay visibility to the rest of the fabric.
  • Heartbeat telemetry includes fabric_runtime_report with QUIC-only status, route-set counts, QUIC candidate totals, rejected legacy/non-QUIC candidate totals by transport label, route pressure, QUIC listener state, goroutines, heap usage, and the next recommended soak gate.
  • FabricOverlayTransport is the generic service-neutral send facade over route sets, FabricChannelRuntime, shared route pressure, and QUIC sessions. New traffic classes should enter the fabric through this layer or an equivalent runtime integration, not through HTTP/WebSocket fallbacks.
  • FabricChannelRuntime uses the same route health quarantine as production forwarding. Connect failures, stream send failures, and missing ACKs mark a route unhealthy for a bounded retry window, so later channels for any traffic class avoid that route until it recovers.
  • FabricOverlayTransport exposes route pressure and route health snapshots, and node heartbeat runtime metadata reports production route health plus the current quarantined route count.
  • Scheduler resource guardrails include HardMaxRoutePressure: when enabled, a route whose projected active-channel pressure exceeds the threshold is not admitted. This makes overload prevention enforceable in route choice rather than only observable after the fact.
  • The loadtest verdict fails on route-pressure leaks, acquire/release mismatch, missing acquire accounting, active channels above configured concurrency, or target distribution collapse/skew when multiple targets are healthy.
  • Continuous soak aggregation is bounded: fabric-loadtest keeps exact counters, per-target totals, route-mode counts, error/reroute totals, and bounded latency samples, while stream_samples is capped to diagnostic examples. Long 30-120 minute runs should not retain one result object per logical channel.
  • fabric-loadtest also keeps bounded error_samples, so high-volume churn reports preserve representative failed logical channels even when the first retained stream_samples are all successful.
  • Mixed topology verdicts require route-mode coverage when at least four healthy targets are present. A mixed-public-nat-lan-relay or nat-lan-relay run fails if it does not exercise lan_quic, ice_quic, reverse_quic, and relay_quic.
  • Loadtest verdicts also fail on legacy route-mode labels. Seeing relay, outbound_reverse, direct_http, direct_https, direct_tcp_tls, ws, wss, or websocket in route-mode telemetry is treated as a transport-layer violation even if payload delivery succeeds.
  • Healthy multi-target verdicts check both stream distribution and byte distribution. This prevents a run from passing with equal channel counts but most bulk bytes concentrated on one target or route.
  • Healthy multi-target verdicts also check route-pressure distribution through per-route max_active values. A run fails if live concurrent channel load collapses onto one target/route while alternatives are healthy.
  • Successful logical channels must receive one ACK per transmitted data frame. fabric-loadtest reports ack_mismatched_streams, per-target acks_received, and fails verdict when any stream is marked successful with fewer ACKs than sent frames.
  • ACK payloads carry the SHA-256 checksum of the received data-frame payload. fabric-loadtest validates the checksum for every ACK and fails verdict with ack_integrity_errors when the acknowledged bytes do not match the sent payload.
  • Failover accounting separates abandoned_frames from true ACK mismatch. A frame sent on a route that dies before ACK is counted as abandoned and the unacknowledged byte range is retransmitted on the next pool member; verdict still fails when non-abandoned frames are missing ACKs.
  • Loadtest data frames use deterministic per-frame payloads derived from stream index, logical stream ID, sequence, and byte offset. This makes checksum ACKs validate each frame identity instead of repeatedly validating one shared buffer pattern.
  • Mixed bulk/control stress is supported with -control-every, -control-bytes-per-stream, and -max-control-ack-p95-ms. Reports include control_streams, bulk_streams, control_ack_p95_ms, and bulk_ack_p95_ms; verdict fails when control ACK p95 exceeds the configured SLO.
  • Verified shared-host mixed smoke: powershell -ExecutionPolicy Bypass -File scripts/fabric/fabric-loadtest-docker-smoke.ps1 -DockerContext test-docker -Nodes 2 -Streams 40 -Concurrency 8 -BytesPerStream 65536 -PayloadSize 8192 -FailTarget -1 -ControlEvery 5 -ControlBytesPerStream 4096 -MaxControlAckP95Ms 100. The run produced 40/40 successful streams, 8 control streams, control_ack_p95_ms=1, bulk_ack_p95_ms=2, route_pressure.active_total=0, and matching acquire/release counts.
  • Verified shared-host mixed failover stress: powershell -ExecutionPolicy Bypass -File scripts/fabric/fabric-loadtest-docker-smoke.ps1 -DockerContext test-docker -SkipBuild -TuneUdpBuffers -Nodes 4 -Streams 1000 -Concurrency 128 -BytesPerStream 1048576 -PayloadSize 65536 -FailTarget 0 -FailAfter 100ms -ControlEvery 20 -ControlBytesPerStream 4096 -MaxControlAckP95Ms 100. Latest run fabric-loadtest-20260516-160751 produced 1000/1000 successful streams, 250 failover events after the planned target kill, 50 control streams, control_ack_p95_ms=3, bulk_ack_p95_ms=6, ack_p95_ms=6, ack_p99_ms=8, route_attempts_total=1250, route_pressure.active_total=0, max_active_total=128, and matching acquire/release counts. Full JSON artifacts are written under artifacts/fabric-loadtest.
  • Verified shared-host mixed degradation/migration stress: powershell -ExecutionPolicy Bypass -File scripts/fabric/fabric-loadtest-docker-smoke.ps1 -DockerContext test-docker -TuneUdpBuffers -Nodes 4 -Streams 200 -Concurrency 32 -BytesPerStream 8388608 -PayloadSize 65536 -FailTarget -1 -ImpairTarget 0 -ImpairDelay 200ms -ImpairLoss 0.5% -ImpairAfter 50ms -MigrateSlowStreams -MaxAckMs 30 -ControlEvery 10 -ControlBytesPerStream 4096 -MaxControlAckP95Ms 100. The run produced 200/200 successful streams, 9 migration events, 20 control streams, control_ack_p95_ms=2, bulk_ack_p95_ms=7, route_pressure.active_total=0, max_active_total=32, and matching acquire/release counts.
  • Latest shared-host degradation/migration gate: fabric-loadtest-20260516-160710 with 160 streams, 32 concurrency, 4 MiB bulk streams, 180 ms + 0.5% induced impairment after 50 ms produced 160/160 successful streams, 12 slow-ACK migrations, degraded-target quarantine, control_ack_p95_ms=3, bulk_ack_p95_ms=180, route_pressure.active_total=0, max_active_total=32, and matching acquire/release counts.
  • Short shared-host soak gate: fabric-loadtest-20260516-160943 with -Duration 45s, 1200 streams, 96 concurrency, four healthy targets, and mixed control/bulk traffic produced 1200/1200 successful streams, even 300/300/300/300 target distribution, channel_opens=1200, channel_closes=1200, channel_leaks=0, control_ack_p95_ms=4, ack_p95_ms=5, ack_p99_ms=8, route_pressure.active_total=0, max_active_total=96, and matching acquire/release counts.
  • Continuous soak mode is now explicit: add -Soak -Duration 30m or -Soak -Duration 120m to the Docker runner. In soak mode workers keep creating and closing logical channels until the duration expires, instead of stopping after a fixed stream list. This is the required gate for memory, goroutine, file descriptor, QUIC stream, and route-pressure stability.
  • Soak duration stops new logical channel creation but does not cancel channels already in flight. In-flight channels drain under their per-channel -StreamTimeout; the outer -ClientTimeout remains the hard scenario guardrail. This prevents the final active window from being counted as failed streams just because the soak timer expired.
  • Recommended real-topology soak command: powershell -ExecutionPolicy Bypass -File scripts/fabric/fabric-loadtest-docker-smoke.ps1 -DockerContext test-docker -SkipBuild -TuneUdpBuffers -Nodes 4 -Streams 1000 -Concurrency 128 -BytesPerStream 1048576 -PayloadSize 65536 -TopologyProfile mixed-public-nat-lan-relay -Soak -Duration 30m -ResourceSampleInterval 10s -MaxGoroutineDelta 64 -MaxHeapDeltaMB 512 -FailTarget -1 -ControlEvery 20 -ControlBytesPerStream 4096 -MaxControlAckP95Ms 100.
  • Soak reports include resource_samples and resource_summary with goroutine start/end/max/delta, heap allocation start/end/max/delta, heap objects, open file descriptor start/end/max/delta, GC delta, max active QUIC streams, and max active route load. Optional verdict gates -MaxGoroutineDelta and -MaxHeapDeltaMB fail the run if resource drift exceeds the configured budget.
  • Optional file descriptor verdict gates -MaxOpenFDDelta and -MaxOpenFDs are passed through the Docker runner to fabric-loadtest as -max-open-fd-delta and -max-open-fds. On Linux containers these read /proc/self/fd and fail the run if descriptor count drifts or peaks beyond the configured budget.
  • Optional throughput SLO gate -MinThroughputMbps is passed through the Docker runner to fabric-loadtest as -min-throughput-mbps. It fails verdict when useful data-plane throughput falls below the configured Mbps floor.
  • Optional short-session churn SLO gate -MinChannelChurnPerSec is passed through the Docker runner to fabric-loadtest as -min-channel-churn-per-sec. It fails verdict when logical channel open/close throughput falls below the configured channel-per-second floor.
  • Each logical channel has a per-channel timeout through -StreamTimeout in the Docker runner and -stream-timeout in fabric-loadtest. This keeps a wedged channel from holding a worker slot until the whole client run times out, preserving channel isolation under churn.
  • Each data frame has an ACK timeout through -AckTimeout in the Docker runner and -ack-timeout in fabric-loadtest. A missing ACK triggers reroute/pool retry without waiting for the full channel timeout.
  • Optional overall ACK latency gates -MaxAckP95Ms and -MaxAckP99Ms are passed through the Docker runner to fabric-loadtest as -max-ack-p95-ms and -max-ack-p99-ms. They fail healthy runs when aggregate data-plane ACK latency exceeds the configured SLO, independently of slow-route migration thresholds.
  • Optional per-target ACK latency gate -MaxTargetAckMs is passed through the Docker runner to fabric-loadtest as -max-target-ack-ms. It fails healthy runs when any target route reports a target_stats[*].max_ack_ms above the configured SLO.
  • Optional channel setup latency gates -MaxSetupP95Ms and -MaxSetupP99Ms are passed through the Docker runner to fabric-loadtest as -max-setup-p95-ms and -max-setup-p99-ms. They fail healthy runs when logical channel open/setup latency exceeds the configured SLO before payload transfer starts.
  • Optional reroute latency gates -MaxRerouteP95Ms and -MaxRerouteP99Ms are passed through the Docker runner to fabric-loadtest as -max-reroute-p95-ms and -max-reroute-p99-ms. They measure repeat channel setup latency after pool failover or slow-route migration and fail the run when route rebuild exceeds the configured SLO.
  • Docker shared-host summaries also include container_stats from docker stats --no-stream for each fabric server/client container that is still running at the end of the scenario. This records CPU percent, memory usage, memory percent, network IO, block IO, and PID count per node before cleanup.
  • Long soak runs can add -ContainerStatsSampleInterval 10s to collect periodic Docker container stats while traffic is in flight. The runner writes samples to container_stats_samples_path, includes container_stats_samples_count and container_stats_sample_summary, and records per-container memory/PID start, end, max, and delta values.
  • Optional container resource verdict gates -MaxContainerMemoryMiB and -MaxContainerPids fail the Docker scenario when any running fabric container exceeds the configured memory or PID budget at the final snapshot or at any periodic sample peak.
  • Verified short continuous soak: fabric-loadtest-20260516-163206 used -Soak -Duration 20s, mixed public/NAT/LAN/relay profile, 32 concurrency, and mixed control/bulk traffic. It produced 4000/4000 successful logical channels, channel_opens=4035, channel_closes=4035, channel_leaks=0, route_pressure.active_total=0, max_active_total=32, control_ack_p95_ms=2, ack_p95_ms=4, resource sample count 12, goroutine delta -18, max active streams 32, max active route load 32, and matching acquire/release counts.
  • Verified 60-second high-churn continuous soak with graceful drain: fabric-loadtest-20260516-174505 rebuilt the Docker image after changing soak duration to stop generation and let in-flight channels drain. The 4-node mixed-topology run used 128 concurrency, -Duration 60s, -StreamTimeout 15s, periodic resource/container sampling, mixed control/bulk traffic, throughput and churn SLOs. It produced 438740/438740 successful logical channels, channel_churn_per_sec=7310, throughput_bps=3473632858, ack_p95_ms=5, ack_p99_ms=6, control_ack_p95_ms=3, channel_opens=438740, channel_closes=438740, channel_leaks=0, open_failures=0, goroutines_delta=-1, open_fds_delta=4, all four route modes, clean route-pressure accounting, and verdict pass.
  • Verified pool failover soak with ACK timeout and abandoned-frame accounting: fabric-loadtest-20260516-175622 rebuilt the Docker image with ACK timeout, target quarantine, and abandoned-frame accounting, then killed target 0 after 3 seconds during a 30-second mixed-topology soak. It produced 136194/136194 successful logical channels, failed_streams=0, failover_events=82, abandoned_frames=75, ack_mismatched_streams=0, ack_integrity_errors=0, channel_churn_per_sec=4543, throughput_bps=2156155314, reroute_latency_p99_ms=9, channel_leaks=0, clean route-pressure accounting, and verdict pass.
  • Verified container stats gate: fabric-loadtest-20260516-163854 produced a passing 2-node mixed-topology smoke with -MaxContainerMemoryMiB 128 -MaxContainerPids 64 and included container_stats for both fabric server containers, with memory usage around 4-6 MiB per server and server PID counts 7-9. A negative control run with -MaxContainerMemoryMiB 1 failed as expected with container_memory_mib=...>1 verdict reasons.
  • Verified periodic container stats sampling: fabric-loadtest-20260516-164259 used -Soak -Duration 8s, -ContainerStatsSampleInterval 2s, mixed public/NAT/LAN/relay profile, and -MaxContainerMemoryMiB 128 -MaxContainerPids 64. It produced 2000/2000 successful logical channels, channel_opens=2009, channel_closes=2009, channel_leaks=0, even 1000/1000 target distribution, 400 control streams, ack_p95_ms=1, route_pressure.active_total=0, matching acquire/release counts, final server memory around 12-13 MiB, and periodic sample peaks for the client and both servers in fabric-loadtest-20260516-164259-container-stats-samples.json.
  • Verified high-churn goroutine drain after QUIC close cancellation: fabric-loadtest-20260516-164502 rebuilt the Docker image and repeated the 2-node mixed-topology continuous soak with -MaxGoroutineDelta 64, -MaxHeapDeltaMB 128, -ContainerStatsSampleInterval 2s, -MaxContainerMemoryMiB 128, and -MaxContainerPids 64. It produced 2000/2000 successful logical channels, channel_opens=2009, channel_closes=2009, channel_leaks=0, even 1000/1000 target distribution, control_ack_p95_ms=1, ack_p95_ms=1, route_pressure.active_total=0, matching acquire/release counts, and goroutines_delta=-2.
  • Verified file descriptor gate: fabric-loadtest-20260516-164725 rebuilt the Docker image and repeated the 2-node mixed-topology continuous soak with -MaxOpenFDDelta 8 and -MaxOpenFDs 128 in addition to goroutine, heap, container memory, and PID gates. It produced 2000/2000 successful logical channels, channel_leaks=0, route_pressure.active_total=0, matching acquire/release counts, open_fds_start=15, open_fds_end=9, open_fds_max=19, and open_fds_delta=-6.
  • Verified bounded soak aggregation: fabric-loadtest-20260516-165051 rebuilt the Docker image after changing soak result storage to an aggregate collector. The 2-node mixed-topology soak produced 2000/2000 successful logical channels, even 1000/1000 target distribution, channel_leaks=0, route_pressure.active_total=0, matching acquire/release counts, goroutines_delta=0, open_fds_delta=1, verdict pass, and only 25 retained stream_samples in the full report.
  • Verified mixed route-mode coverage gate: fabric-loadtest-20260516-165308 rebuilt the Docker image with the route coverage verdict and ran a 4-node mixed-topology soak. It produced 4000/4000 successful logical channels, even 1000/1000/1000/1000 target distribution, channel_leaks=0, route_pressure.active_total=0, matching acquire/release counts, and observed all required route modes: lan_quic, ice_quic, reverse_quic, and relay_quic.
  • Verified ACK integrity gate: fabric-loadtest-20260516-165544 rebuilt the Docker image with the ACK mismatch verdict and repeated the 4-node mixed-topology soak. It produced 4000/4000 successful logical channels, ack_mismatched_streams=0, per-target frames_sent=6600 and acks_received=6600, all four route modes, clean channel/route pressure accounting, and verdict pass.
  • Verified ACK checksum integrity gate: fabric-loadtest-20260516-165926 rebuilt the Docker image with ACK payload checksums and repeated the 4-node mixed-topology soak. It produced 4000/4000 successful logical channels, ack_mismatched_streams=0, ack_integrity_errors=0, 26400 total data frames, 26400 ACKs, all four route modes, clean channel/route pressure accounting, and verdict pass.
  • Verified unique per-frame payload integrity: fabric-loadtest-20260516-170150 rebuilt the Docker image after switching loadtest traffic from a shared payload buffer to deterministic per-frame payloads. The 4-node mixed-topology soak produced 4000/4000 successful logical channels, ack_mismatched_streams=0, ack_integrity_errors=0, 26400 data frames, 26400 ACKs, all four route modes, clean channel/route pressure accounting, and verdict pass.
  • Verified throughput SLO gate: fabric-loadtest-20260516-170512 rebuilt the Docker image with -MinThroughputMbps 100 and repeated the 4-node mixed-topology soak. It produced 4000/4000 successful logical channels, throughput_bps=212479668, ack_mismatched_streams=0, ack_integrity_errors=0, all four route modes, clean channel/route pressure accounting, and verdict pass.
  • Verified short-session churn SLO gate: fabric-loadtest-20260516-173320 rebuilt the Docker image with -MinChannelChurnPerSec 200, then ran a 4-node mixed-topology high-churn short-session smoke with 1000 one-frame logical channels. It produced 1000/1000 successful logical channels, channel_churn_per_sec=9478, channel_opens=1000, channel_closes=1000, channel_leaks=0, even target stream distribution, all four route modes, clean route-pressure accounting, and verdict pass.
  • Verified high-churn QUIC stream-credit regression gate: fabric-loadtest-20260516-174046 rebuilt the Docker image after closing the server-side QUIC stream on handler exit and ran a 4-node mixed-topology burst of 5000 one-frame short logical channels at 128 concurrency with -MinChannelChurnPerSec 300 and -StreamTimeout 15s. It produced 5000/5000 successful logical channels, channel_churn_per_sec=21124, channel_opens=5000, channel_closes=5000, channel_leaks=0, open_failures=0, ack_mismatched_streams=0, ack_integrity_errors=0, even 1250/1250/1250/1250 target distribution, all four route modes, clean route-pressure accounting, and verdict pass.
  • Verified target byte distribution gate: fabric-loadtest-20260516-170731 rebuilt the Docker image with byte distribution verdicts and repeated the 4-node mixed-topology soak. It produced 4000/4000 successful logical channels, even 1000/1000/1000/1000 stream distribution, exactly 53,248,000 bytes per target, throughput_bps=212488911, all four route modes, clean channel/route pressure accounting, and verdict pass.
  • Verified overall ACK latency SLO gate: fabric-loadtest-20260516-171001 rebuilt the Docker image with -MaxAckP95Ms 20 and -MaxAckP99Ms 50 and repeated the 4-node mixed-topology soak. It produced 4000/4000 successful logical channels, ack_p95_ms=2, ack_p99_ms=3, ack_mismatched_streams=0, ack_integrity_errors=0, all four route modes, clean channel/route pressure accounting, and verdict pass.
  • Verified route-pressure distribution gate: fabric-loadtest-20260516-171216 rebuilt the Docker image with route-pressure distribution verdicts and repeated the 4-node mixed-topology soak. It produced 4000/4000 successful logical channels, even target stream and byte distribution, per-route max_active values of 13/12/13/13, route_pressure.active_total=0, matching acquire/release counts, and verdict pass.
  • Verified per-target ACK latency gate: fabric-loadtest-20260516-171454 rebuilt the Docker image with -MaxTargetAckMs 20 and repeated the 4-node mixed-topology soak. It produced 4000/4000 successful logical channels, per-target max_ack_ms values of 6/5/7/9, ack_p95_ms=3, ack_p99_ms=5, all four route modes, clean channel/route pressure accounting, and verdict pass.
  • Verified channel setup latency SLO gate: fabric-loadtest-20260516-171937 rebuilt the Docker image with -MaxSetupP95Ms 20 and -MaxSetupP99Ms 50, then repeated the 4-node mixed-topology soak with ACK, throughput, FD, goroutine, heap, container memory, and PID gates enabled. It produced 4000/4000 successful logical channels, setup_latency_p95_ms=0, ack_p95_ms=3, ack_p99_ms=3, throughput_bps=212572631, even target stream/byte distribution, all four route modes, clean channel/route pressure accounting, and verdict pass.
  • Verified reroute latency SLO gate: fabric-loadtest-20260516-172652 rebuilt the Docker image with -MaxRerouteP95Ms 100 and -MaxRerouteP99Ms 200, then ran a 4-node mixed-topology pool-failover stress with target 0 killed during load. It produced 400/400 successful logical channels, 100 pool failover events, reroute_latency_p95_ms=1, reroute_latency_p99_ms=2, route_attempts_total=500, ack_p95_ms=6, ack_p99_ms=8, throughput_bps=3863633075, clean channel/route pressure accounting, and verdict pass.
  • Mixed topology profile gate: fabric-loadtest-20260516-162037 used -TopologyProfile mixed-public-nat-lan-relay with 400 streams, 64 concurrency, four targets, and mixed control/bulk traffic. It produced 400/400 successful streams, 100 streams per target, route-mode reporting for lan_quic, ice_quic, reverse_quic, and relay_quic, control_ack_p95_ms=2, ack_p95_ms=7, channel_leaks=0, route_pressure.active_total=0, and matching acquire/release counts.
  • Verified strict QUIC route-mode gate: fabric-loadtest-20260516-182550 rebuilt the loadtest image with legacy route-mode verdicts and ran the 4-node mixed topology profile. It produced 400/400 successful logical channels, observed only lan_quic, ice_quic, reverse_quic, and relay_quic, kept ack_mismatched_streams=0, ack_integrity_errors=0, channel_leaks=0, clean route-pressure accounting, and verdict pass.
  • fabric-loadtest now also treats the configured target list as part of the acceptance surface: every target must be quic://.... Empty targets, bare host:port, HTTP(S), and WS/WSS targets produce a failing non_quic_targets=... verdict reason. Client mode also rejects those targets before dialing, so a bad stress command cannot accidentally exercise a non-QUIC path and only discover it after the run.
  • The shared Docker runner scripts/fabric/fabric-loadtest-docker-smoke.ps1 now has matching guardrails: it refuses local Docker Desktop contexts such as default/desktop-linux and validates generated targets before launch so the real-load smoke remains tied to the shared test Docker host and QUIC-only endpoints.
  • Shared Docker validation after those guardrails: fabric-loadtest-20260516-190049 rebuilt the Docker image on test-docker and ran 4 QUIC targets with 120 streams. It produced 120/120 successful logical channels, ack_p95_ms=3, setup_latency_p95_ms=21, clean open/close and route-pressure accounting, QUIC-only targets, and verdict pass.
  • Shared Docker mixed-topology failover validation: fabric-loadtest-20260516-190137 reused the image on test-docker, killed target 0 after 100ms, and ran 400 streams over the mixed public/NAT/LAN/relay profile. It produced 400/400 successful logical channels, 100 pool failover events, route_attempts_total=500, route modes ice_quic, reverse_quic, and relay_quic after the failed target was removed, ack_p95_ms=8, setup_latency_p95_ms=51, clean channel/route-pressure accounting, and verdict pass.
  • Shared Docker mixed-topology route coverage validation: fabric-loadtest-20260516-190207 ran the same 4-target mixed profile without target failure. It produced 400/400 successful logical channels, exactly 100 streams per target, observed lan_quic, ice_quic, reverse_quic, and relay_quic, kept ack_integrity_errors=0, channel_leaks=0, route_pressure.active_total=0, and verdict pass.
  • Load balancing under pool failover is now an acceptance gate. The first stricter shared-host rebuild, fabric-loadtest-20260516-190704, intentionally failed because all failed-target retries moved to the nearest live target, producing target_byte_distribution_skew and route_pressure_distribution_skew. The retry selector was then changed to spread failed-slot retries across the currently usable target set instead of selecting the next target in ring order.
  • Verified load-aware retry routing after the fix: fabric-loadtest-20260516-191028 rebuilt on test-docker, killed target 0 after 100ms, and repeated the 4-target mixed profile. It produced 400/400 successful logical channels, 100 pool failover events, surviving-target stream distribution of 134/133/133, surviving route-pressure max-active values of 30/25/27, ack_p95_ms=4, reroute_latency_p95_ms=1, clean acquire/release accounting, and verdict pass.
  • Verified 1000-channel mixed-topology stress: fabric-loadtest-20260516-193414 ran 1000 logical channels on test-docker with 128 concurrency, mixed control/bulk traffic, and the mixed-public-nat-lan-relay profile. It produced 1000/1000 successful logical channels, exact 250/250/250/250 target distribution, observed all four QUIC route modes (lan_quic, ice_quic, reverse_quic, relay_quic), throughput_bps=3629522849, channel_churn_per_sec=1919, ack_p95_ms=6, clean channel/route-pressure accounting, and verdict pass.
  • Verified 1000-channel pool-failover stress: fabric-loadtest-20260516-193444 killed target 0 after 100ms and ran 1000 logical channels with 128 concurrency. It produced 1000/1000 successful logical channels, 250 pool failover events, surviving-target distribution of 334/333/333, route_attempts_total=1250, ack_p95_ms=7, clean acquire/release accounting, and verdict pass.
  • Verified latency-degradation migration: fabric-loadtest-20260516-193515 applied tc netem delay 80ms to target 1, enabled slow-stream migration with -MaxAckMs 20, and ran 400 mixed-profile channels. It observed the impaired target in degraded_targets, produced 64 slow-ACK migrations, moved completed streams onto healthy targets with distribution 134/133/133, kept channel_leaks=0, ack_integrity_errors=0, clean route-pressure accounting, and verdict pass.
  • Shared Docker runner resource-sample fallback was verified with fabric-loadtest-20260516-190325: short runs now still persist container_stats_samples_path and a minimal per-container sample summary from final Docker stats when the background sampler has no time to emit samples.
  • Added scripts/fabric/fabric-acceptance-summary.ps1 to aggregate recent *-summary.json artifacts into an acceptance report. It captures verdicts, target distribution, route modes, churn, failover/migration counts, latency SLOs, resource evidence, and keeps intentional failed runs visible as regression evidence for gates such as route-pressure skew detection.
  • The first 30-minute soak attempt (fabric-loadtest-20260516-193558) exposed a runner defect instead of a fabric defect: server containers were still started with a fixed -timeout 10m, so the three surviving servers exited around minute 10 while the client expected a 30-minute run. The Docker runner now exposes -ServerTimeout and defaults it to -ClientTimeout, so long soak server lifetimes match the client run.
  • The next soak attempt (fabric-loadtest-20260516-194816) passed the 10-minute server-timeout boundary but exposed another long-run behavior: a healthy surviving target could stay out of placement after a transient degradation mark. fabric-loadtest now uses a bounded target_quarantine_ttl for placement while still preserving historical degraded_targets observations in the report. The Docker runner exposes this as -TargetQuarantineTTL.
  • fabric-loadtest-20260516-200241 then exposed a soak-loop issue: it reported pass with 432869/432869 logical channels and clean accounting, but finished after about 95 seconds despite config.duration=30m. The cause was worker shutdown on per-stream context deadline exceeded; soak workers now only exit on the parent run context or the configured soak stop time, not on one channel's timeout.
  • fabric-loadtest-20260516-200939 and fabric-loadtest-20260516-201331 confirmed the soak loop fix by running full 3-minute preflights, but they failed the zero-failed-stream gate under target-kill injection. The issue was policy: the known killed target re-entered placement too quickly via the short transient quarantine TTL, causing some channels to spend their stream budget on a hard-dead endpoint. fabric-loadtest now separates transient target_quarantine_ttl from failure_quarantine_ttl, and the Docker runner exposes -FailureQuarantineTTL.
  • Verified 30-minute long-duration soak: fabric-loadtest-20260516-202532 ran on test-docker for 1800.010 seconds with 4 QUIC targets, 128 concurrency, mixed control/bulk traffic, 64 KiB per logical channel, 10-second resource and container samples, and the mixed-public-nat-lan-relay profile. It produced 15,074,556/15,074,556 successful logical channels, 895,308,005,376 bytes, throughput_bps=3979124146, channel_churn_per_sec=8374, exact 3,768,639 streams per target, all four QUIC route modes, ack_p95_ms=5, ack_p99_ms=6, channel_leaks=0, matching 15,074,556 channel opens/closes, route_pressure.active_total=0, 458 container-stat samples, bounded memory/PID use, and verdict pass.
  • Verified real-node host-to-host QUIC smoke: home-1 ran the standalone fabric-loadtest client against a temporary QUIC server on test-docker at quic://docker-test.cin.su:19443. The run created 1000 short logical channels at 128 concurrency, mixed control and bulk traffic, sent 59,392,000 bytes, received 3700/3700 ACKs, produced throughput_bps=1177445403, channel_churn_per_sec=2478, ack_p95_ms=12, ack_p99_ms=21, setup_latency_p95_ms=118, zero failed streams, zero channel leaks, and verdict pass. The report is saved as artifacts/fabric-real-nodes/home-1-to-test-docker-20260516-181649.json.
  • Published and registered node-agent release 0.2.280-fabricsession with linux binary/native and Docker image artifacts. The release is intentionally not assigned to live node update policies yet because current live node workload/env posture still advertises legacy direct_http and HTTP/HTTPS mesh endpoints. Before rollout, node configs must be migrated to quic://... endpoints, QUIC advertise labels, and enabled QUIC listener env such as RAP_MESH_QUIC_FABRIC_ENABLED=true plus RAP_MESH_QUIC_FABRIC_LISTEN_ADDR.
  • Loadtest degraded-target quarantine is observable through degraded_targets. When -impair-target and slow-stream migration are enabled, verdict fails if no degraded target is observed or if degraded targets do not produce migration events. A shared-host validation run with 120 streams reported degraded_targets = { impaired_target: "slow_ack" }, 5 migration events, control_ack_p95_ms=3, and clean acquire/release accounting.
  • Channel lifecycle accounting is explicit in fabric-loadtest through channel_opens, channel_closes, and channel_leaks. Verdict fails on open/close mismatch, active stream leaks, or mismatch between route-pressure acquire counts and QUIC stream opens.
  • The next validation step is broader real mixed public/NAT/LAN topology across separate physical or VM hosts. The shared Docker host has verified the route model, stress gates, 30-minute stability, memory, goroutine, file descriptor, container resource, and route-pressure accounting. A true external NAT lab should now validate the same gates with independent NAT devices, public nodes, and local NAT-side cluster segments.

Initial SLO examples:

  • channel_setup_p95_ms < 200
  • reroute_p95_ms < 1000
  • control_latency_p99_ms < 100 under bulk load
  • packet_loss_after_recovery < 0.1%
  • no_route_pressure_over_90_percent_when_alternatives_exist
  • no_channel_table_growth_after_churn