Refactor RDP proxy handling and update related tests

This commit is contained in:
2026-05-17 20:38:35 +03:00
parent 8e9402580f
commit d551e57fd5
172 changed files with 22117 additions and 2509 deletions
@@ -88,6 +88,16 @@ Native host process responsible for node identity, enrollment, certificates, hea
Service Workload:
A workload executed on a node. It may be native or containerized. Examples: `rdp-worker`, `vnc-worker`, `entry-node`, `relay-node`, `file-storage-cache`.
Public/Admin HTTPS Ingress:
A service-edge role that listens on TCP `80`/`443` for browser/API HTTPS and
forwards accepted requests into the QUIC-only fabric service channel. It is not
an authority service and does not imply permission to manage the cluster.
Admin UI Runtime:
A scoped admin service runtime. Global admin runtime may run only on
platform-owner trusted nodes; cluster, organization, and user portal runtimes
receive only their scoped projections.
Capability:
What a node can technically do. Example: `can_run_rdp_worker`.
@@ -162,6 +172,13 @@ policy, approvals, and audit.
20. Node-agent is the local supervisor for health, restart, update, and rollback
of node services, but Control Plane owns rollout policy and durable schema
migration orchestration.
21. HTTP/HTTPS is an external service edge only. Fabric node-to-node transport
remains QUIC-only.
22. A node that accepts `443` does not own management authority. Admin authority
belongs to signed roles, scoped claims, policy, and trusted runtime nodes.
23. Global admin runtime, policy authority, and audit sink must run only on
platform-owner controlled nodes. Organization and cluster portals must not
expose unrelated tenants, clusters, or internal mesh topology.
## Existing Node Management Semantics
@@ -0,0 +1,96 @@
# Distributed Authority Audit 2026-05-16
Status: target architecture is distributed, but the live test cluster still has
bootstrap central authority pieces that must be removed before production trust.
## Fixed Requirements
- No single management/API/storage/update service is allowed to own cluster
truth.
- Control, storage, update, route authority, observer, and update-cache are node
roles in the fabric.
- A service endpoint can serve signed state, but cannot create trusted state by
itself.
- Node identity is cryptographic. IP addresses, DNS names, and NAT addresses are
endpoint candidates only.
- Nodes must publish real signed candidates for reachable interfaces,
STUN/ICE-reflexive addresses, passive reverse channels, and relay fallback.
- Nodes must verify signed control data locally before applying it.
## Live Cluster Findings
- The live cluster has one active `cluster_authorities` row:
`rap-ca-ed25519-09877466aa9b6b58b0f312b0b313ea33`.
- Its metadata says `storage=database_signer` and
`production_target=external_cluster_signer_or_hsm`.
- Release metadata for recent node-agent versions is signed, but signed by the
same database-backed authority.
- Synthetic mesh configs are signed and node-agent verifies them against the
pinned cluster authority.
- Node enrollment pins cluster authority into `identity.json`.
- Before this audit, host-agent update plans were carried with signatures but
host-agent did not locally reject unsigned plans when a pinned authority was
present.
## Changes Made In This Audit
- The fabric docs now declare distributed authority and quorum as mandatory.
- Node/fabric endpoints must be explicit `host:port`; DNS-only service names are
rejected as fabric endpoints.
- `home-1` no longer advertises `smoke.cin.su` as a fabric endpoint. It now
advertises its real interface candidate `quic://192.168.200.85:18080`.
- Host-agent now verifies `node_update_plan` authority signatures when
`identity.json` contains a pinned cluster authority public key.
- Unsigned update plans are rejected in that pinned-authority mode.
- Added `rap.cluster_authority.quorum.v1` and
`rap.cluster_authority.quorum_envelope.v1` contracts to both agent and
backend authority packages.
- Host-agent can now verify quorum-signed update plans when `identity.json`
contains a pinned quorum descriptor.
- Backend update plans now include an `authority_quorum` envelope when the
cluster authority metadata contains a quorum descriptor. If that configured
quorum cannot be satisfied, the update plan is not issued.
- Node bootstrap now carries `cluster_authority_quorum`; the approval authority
payload signs the quorum descriptor hash, and node-agent persists the
descriptor into `identity.json` after verifying the signed hash.
- Published `rap-node-agent` and `rap-host-agent` release
`0.2.284-quorumauthority`.
- Canaried `home-1` to `rap-node-agent 0.2.284-quorumauthority` and
`rap-host-agent 0.2.284-quorumauthority`; both reported healthy/noop after
update.
- Published `rap-node-agent` and `rap-host-agent` release
`0.2.285-quorumbootstrap`.
- Canaried `home-1` to `rap-node-agent 0.2.285-quorumbootstrap` and
`rap-host-agent 0.2.285-quorumbootstrap`; both reported current=target/noop.
`ifcm-rufms-s-mo1cr` was intentionally not updated because it is behind NAT
and still needs fabric/update-cache artifact reachability before further
rollout.
## Remaining Production Blockers
- Replace `database_signer` with quorum authority:
M-of-N signatures from nodes or hardware/offline keys with
`control-authority` / `update-authority` roles.
- Store authority descriptors and role certificates as replicated signed state,
not only database rows.
- Require quorum envelopes for the remaining high-risk mutations: role
mutation, release creation, update policy mutation, route lease issuance,
relay/rendezvous lease issuance, storage placement, and authority rotation.
Node update plans and bootstrap quorum pinning now have the first contract
hooks, but production still needs real M-of-N signers.
- Add node-side verification of release metadata in addition to update-plan
verification; update-plan verification is now enforced by host-agent when a
pinned authority or pinned quorum descriptor exists.
- Add update-cache mirror selection through fabric endpoint candidates instead
of a single HTTP origin.
- Add signed endpoint-candidate epochs so peer directory gossip can survive API
replica loss.
- Add revocation/fencing epochs for compromised authority keys, nodes, and
update artifacts.
## Acceptance Rule
The cluster is not production-trust-ready while a single `database_signer` can
create authoritative cluster mutations. It may remain as a development bootstrap
signer only when every signed payload clearly identifies it as bootstrap and
nodes can be configured to reject it in production mode.
@@ -62,6 +62,88 @@ route and stream semantics.
7. Mobile nodes are first-class nodes with stricter capability scoring.
8. HTTP forwarding remains a compatibility and emergency fallback, not the
primary high-speed data plane.
9. There must be no single management service that can seize the fabric. Control,
storage, update distribution, route authority, and certificate authority are
fabric roles assigned to eligible nodes and protected by quorum signatures.
A web/API endpoint is only an access replica for a signed state log, not the
owner of cluster truth.
10. IP addresses and DNS names are never authority. Nodes announce signed
endpoint candidates for every usable interface, public/reflexive address,
local segment address, reverse channel, and relay fallback. Neighbors select
the usable candidate locally by policy, reachability, latency, load, and
trust.
## Distributed Control And Trust
The target fabric behaves like a distributed network, not a client/server
management product. The cluster has a replicated signed state log and many
service replicas. Any node with the right role can serve API, storage, update,
or route-coordinator duties, but no single replica can mutate cluster authority
alone.
Required trust model:
- Every node has a long-lived node identity key and short-lived role
certificates. The node identity is cryptographic; the current IP, hostname,
NAT address, or container name is only an endpoint candidate.
- Cluster authority is threshold-based. Root or high-risk changes require M-of-N
signatures from authorized control-authority nodes or hardware/offline
operator keys.
- Role certificates are scoped by action, organization/tenant, service,
partition, validity window, and allowed delegation depth.
- Update releases, route leases, peer-directory epochs, storage shard placement,
node approvals, role changes, and authority rotations are signed records in
the state log.
- A node accepts control data only when it can verify signatures, epoch/fencing,
expiry, target cluster, target node or role scope, and monotonic generation.
- A compromised API replica can withhold or delay data, but cannot forge updates,
route authority, new certificates, node roles, or cluster ownership.
- Bootstrap may use a temporary centralized signer for development, but
production mode must mark that signer as non-authoritative unless quorum
signatures are present.
Authority levels:
- `root-authority`: rotates cluster root and quorum membership. Offline or
hardware-backed where possible. Rarely online.
- `control-authority`: approves node join, role changes, policy epochs, and
route-authority membership through quorum.
- `route-authority`: signs short-lived route leases and relay/rendezvous
assignments for a shard or partition.
- `update-authority`: signs release metadata, compatibility, artifact hashes,
rollback windows, and staged rollout policy.
- `storage-authority`: signs storage shard manifests, replication factors,
retention policy, and recovery epochs.
- `observer-authority`: can sign telemetry observations only; it cannot mutate
routing, roles, updates, or secrets.
Required anti-takeover controls:
- No bearer admin token may grant fabric-wide mutation without a signed authority
envelope.
- No node may accept unsigned update metadata or an artifact whose hash is not
signed by update-authority quorum.
- No node may accept unsigned route changes for production channels.
- No node may promote itself into control, storage, update, relay, or route
authority roles without a quorum-signed role certificate.
- Authority and role certificates must have short validity, explicit scopes, and
revocation/fencing epochs.
- Nodes must pin the cluster root/quorum descriptor and reject unexpected root
changes unless the old quorum signs the transition or an offline recovery
policy is invoked.
Endpoint state is also distributed:
- Nodes publish signed endpoint-candidate sets containing local interfaces,
public/reflexive STUN/ICE candidates, NAT group/local segment identifiers,
relay fallback, and passive reverse-channel availability.
- Endpoint candidates expire quickly. When a node changes IP, it reconnects
passively to any reachable fabric peer or API replica and publishes a new
signed candidate epoch.
- Peers keep using cached valid candidates and route leases while refreshing
from any reachable replica or neighbor gossip path.
- Neighbor selection is local and latency/load-aware; the state log announces
facts and policy, not a forced single next hop.
## Node Roles
@@ -0,0 +1,845 @@
# Fabric-First Transport And Stress Plan
Status: fabric-first implementation baseline is active. QUIC-only transport,
route planning, runtime reroute/failover, pressure accounting, shared-host
stress gates, 1000-channel load, failure/degradation gates, and a 30-minute
real-byte soak are implemented and verified. Remaining work is wider real
topology coverage as the cluster grows.
This project is now fabric-first. Work on service payloads, service adapter
expansion, and Android VPN transport is paused until the fabric transport layer
is complete and proven under real load.
## Goal
The fabric is a distributed QUIC overlay over the current IPv4 network. Nodes
may have public addresses, sit behind NAT, or represent a whole local segment
behind one NAT. The fabric must expose a single logical transport layer where
nodes can reach each other directly, through local segment paths, through
passive outbound tunnels, or through relay hops without changing the data-plane
protocol.
QUIC is the only data-plane protocol. Direct, LAN, relay, reverse, and
ICE-selected paths are route modes inside the same QUIC fabric, not alternative
transports.
The fabric must not depend on one management service for authority. API,
storage, update-cache, route-coordinator, observer, and authority duties are
roles inside the mesh. A reachable API endpoint can distribute signed state, but
it cannot be the source of truth by itself. Nodes accept control data,
configuration, route leases, update plans, and role changes only when the
signatures, quorum rules, scopes, epochs, and expiry windows verify locally.
## Required Fabric Behavior
- Address channels by `node_id`, `pool_id`, or service target, not by raw IP.
- Keep endpoint candidates for public QUIC, LAN QUIC, reverse/passive QUIC,
relay QUIC, and future ICE-derived QUIC paths.
- Treat DNS names such as web/admin/API domains as service endpoints only, not
node identity or fabric authority.
- Require node-published endpoint candidates to include explicit `host:port`,
reachability, connectivity mode, NAT/local-segment metadata, source, and
freshness.
- Prefer local segment paths for nodes that share a NAT/local network.
- Keep outbound passive QUIC control/data adjacencies from NATed nodes to
reachable public or relay nodes.
- Build logical channels over shared QUIC adjacencies instead of opening one
physical QUIC connection per channel.
- Maintain primary, warm standby, and fallback route sets per channel.
- Rebuild a channel when an intermediate hop fails.
- Switch to another pool member when the target is a pool and the current
endpoint fails.
- Reroute slow channels when a faster path exists and the reroute will not harm
aggregate fabric throughput.
- Spread channels across available routes so the shortest path is not saturated
while other nodes are idle.
- Isolate channels with per-channel flow control, traffic classes, backpressure,
quotas, and fairness scheduling.
- Report per-node, per-link, per-route, and per-channel load and failure causes.
## Service Channel Boundary
The fabric is the only component that builds and maintains transport channels.
VPN, RDP, SSH, web ingress, file transfer, and future adapters are applications
above the fabric. They must not select raw QUIC endpoints, pin exit nodes as a
transport concern, open fallback transports, or implement route repair.
Every service starts by submitting a fabric service channel request:
```json
{
"schema_version": "rap.fabric_service_channel_request.v1",
"channel_id": "vpn-session-or-service-session-id",
"source_role": "vpn-client | rdp-client | service-adapter",
"service_class": "vpn_packets | rdp | ssh | file_transfer | web",
"target": {
"kind": "pool",
"pool_ids": ["home-ipv4"],
"service_role": "ipv4-egress"
},
"traffic": {
"mode": "duplex",
"application_protocol_agnostic": true,
"flow_distribution": "latency_and_load_aware"
},
"resilience": {
"min_active_paths": 1,
"warm_standby_paths": 1,
"failover": "pool_member_or_next_authorized_pool",
"reroute_on": ["route_failure", "latency_regression", "loss_regression", "backpressure"]
}
}
```
The fabric responds with a signed route bundle containing a short-lived
`rap.fabric_route_lease.v1`. The lease names the target pool, primary path,
warm standby paths, multipath hints, and rebuild policy. Physical endpoint
candidates are visible only to the fabric runtime as lease material; service
adapters do not rank, pin, or fail over endpoints themselves. A service adapter
receives only a duplex channel handle and service metadata:
- Android VPN: TUN packet reader/writer only.
- `ipv4-egress`: NAT/ordinary IPv4 exit only.
- RDP: protocol/session adapter only; server address, protocol, credentials,
rendering, and clipboard are RDP service metadata, not fabric routing.
Temporary compatibility fields such as `exit_candidates` may exist only inside
the fabric route bundle consumed by the fabric runtime. Service code must treat
them as opaque and must not schedule routes from them.
The VPN client runtime accepts only `fabric_service_channel_request` plus
`fabric_route_bundle.route_lease`. The Android service may keep a deprecated
diagnostic endpoint cache, but packet routing must come from the lease. If a
path fails, slows down, or its target pool member dies, the fabric lease/rebuild
policy is the authority; the VPN service continues writing packets to the
channel and does not switch protocols.
## Distributed Authority Requirements
- No single control-plane/API/storage/update node can mutate the cluster alone.
- Cluster root and high-risk role changes require threshold signatures from
authorized control-authority keys.
- Update releases require signed metadata, signed artifact hashes, compatibility
constraints, rollout scope, and rollback windows; mirrors may serve bytes but
cannot change what is trusted.
- Route leases, relay leases, rendezvous assignments, peer-directory epochs, and
endpoint candidate epochs are signed and short-lived.
- Nodes cache the last valid signed state and continue routing through peers,
relay fallbacks, and passive reverse channels when API replicas are down.
- A compromised replica may delay or omit data, but must not be able to forge
role assignment, route authority, update authority, storage placement, or node
ownership.
- Development `database_signer` mode is not production authority. Production
acceptance requires quorum-signed envelopes for node join, role mutation,
mesh config, route leases, update plans, and release metadata.
## Implementation Layers
1. Discovery layer: STUN/ICE, LAN candidates, public candidates, reverse
tunnels, relay candidates.
2. Fabric adjacency layer: long-lived QUIC neighbor sessions with capacity,
health, and pressure metrics.
3. Routing layer: latency-aware and load-aware route sets with relay fallback
and pool failover.
4. Channel layer: millions of logical channels with independent lifecycle,
flow control, and statistics.
## Stress Requirements
The fabric is not accepted by ping tests. It must pass real byte-transfer load:
- 1000 concurrent streams from different source nodes to different destination
nodes.
- Mixed long-lived and short-lived channels.
- Aggressive create/delete churn.
- many-to-one, one-to-many, and many-to-many traffic.
- direct, LAN, relay, multi-hop, and reverse tunnel paths.
- endpoint pool failover under load.
- intermediate relay/node failure and route rebuild under load.
- induced latency, packet loss, bandwidth caps, and route saturation.
- control/interactive traffic surviving bulk traffic.
- no sustained overload of one path when alternatives exist.
- no goroutine, memory, stream, or file descriptor leak after churn.
## Required Stress Report
Every stress run must produce machine-readable JSON with:
- topology and scenario profile;
- channel setup/teardown counts and latency;
- total and per-channel throughput;
- per-node and per-route capacity pressure;
- p50/p95/p99 latency where measured;
- backpressure, rejection, and queue-depth counters;
- route switch and failover events;
- target pool failover events;
- QUIC connection and logical channel counts;
- final pass/fail verdict against SLO thresholds.
The first executable harness is `agents/rap-node-agent/cmd/fabric-loadtest`.
It supports in-process multi-node QUIC targets, short logical channel churn,
pool failover, target failure injection, and JSON reports.
Example local pool-failover run:
```powershell
go run ./cmd/fabric-loadtest -mode all -nodes 4 -streams 400 -concurrency 64 -bytes-per-stream 262144 -payload-size 16384 -fail-target 0 -fail-after 1ms -timeout 60s
```
The local harness is not a replacement for distributed host testing. It is the
first acceptance gate for protocol limits, channel lifecycle churn, pool
failover semantics, and reporting shape before running the same workload across
the shared test Docker host.
Distributed shared-host smoke:
```powershell
powershell -ExecutionPolicy Bypass -File scripts/fabric/fabric-loadtest-docker-smoke.ps1 -DockerContext test-docker -TuneUdpBuffers -Nodes 4 -Streams 400 -Concurrency 64 -BytesPerStream 262144 -PayloadSize 16384 -FailTarget 0 -FailAfter 100ms
```
The distributed smoke builds/runs separate server and client containers on the
shared Docker host, sends real QUIC fabric frames across the Docker network,
kills one target node during load, and expects all channels assigned to that
target to fail over to the remaining pool.
The smoke summary includes the strict loadtest verdict plus `route_pressure`
and `transport_snapshot`; the script fails when the client verdict is not
`pass` and carries `verdict_reasons` into the thrown error.
`-TuneUdpBuffers` applies runtime host sysctls through a privileged one-shot
container before the run and records the observed values in the summary:
`net.core.rmem_max`, `net.core.wmem_max`, `net.core.rmem_default`, and
`net.core.wmem_default`.
Degraded-target and latency-aware admission run:
```powershell
powershell -ExecutionPolicy Bypass -File scripts/fabric/fabric-loadtest-docker-smoke.ps1 -DockerContext test-docker -TuneUdpBuffers -Nodes 4 -Streams 300 -Concurrency 64 -BytesPerStream 131072 -PayloadSize 16384 -FailTarget -1 -ImpairTarget 0 -ImpairDelay 100ms -ImpairLoss 0.5% -ProbeTargets -MaxTargetRttMs 80
```
This applies `tc netem` to one target, probes every target before mass channel
placement, excludes targets above the RTT threshold, and reports per-target
setup/duration percentiles. This is the first executable gate for
latency-aware placement; live channel migration after mid-stream degradation is
the next routing-layer gate.
Mid-stream migration gate:
```powershell
powershell -ExecutionPolicy Bypass -File scripts/fabric/fabric-loadtest-docker-smoke.ps1 -DockerContext test-docker -TuneUdpBuffers -Nodes 4 -Streams 80 -Concurrency 16 -BytesPerStream 8388608 -PayloadSize 65536 -FailTarget -1 -ImpairTarget 0 -ImpairDelay 200ms -ImpairLoss 0.5% -ImpairAfter 50ms -MigrateSlowStreams -MaxAckMs 30
```
This starts channels normally, applies `tc netem` after traffic is already in
flight, and expects slow logical streams to continue their remaining bytes on a
different target. The report exposes `migration_events`, `max_ack_ms`,
`ack_p95_ms`, `ack_p99_ms`, `route_attempts_total`, `reroute_causes`, and
per-target stats.
Production fabric-core migration boundary:
- `FabricChannelRouter` opens channels on the best route from a `FabricRouteSet`.
- Live `FabricChannelObservation` values update counters and trigger reroute on
route failure, ACK latency threshold, or capacity pressure.
- Reroutes switch route binding and pool target where applicable, increment
`RerouteCount`, and emit `FabricChannelRouteEvent`.
- `MinRerouteInterval` provides hysteresis so a noisy path does not cause route
flapping.
- `FabricChannelRuntime` binds the router to live QUIC fabric sessions for
reliable byte payloads: it opens the logical stream, sends frames, measures
ACK latency, reports observations to the router, and continues remaining
payloads on a rerouted QUIC route after connect failure or slow ACKs.
- QUIC logical session close cancels the stream read side before closing the
write side, so high-churn short sessions release reader goroutines promptly
instead of waiting for stream read deadlines.
- Server-side QUIC stream handlers close their write side when the handler
exits. This returns QUIC stream credit promptly during high-churn short
sessions and prevents the last worker window from stalling on stream open.
- Production request/response forwarding now builds a `FabricRouteSet` from all
QUIC endpoint candidates for the next hop, sends the envelope over the chosen
QUIC route, and reroutes to warm standby/fallback QUIC candidates on connect
failure or response timeout.
- The legacy HTTP production forward carrier has been removed from the mesh
runtime API. Production forwarding now exposes a single QUIC transport
implementation; HTTP handlers remain only as node-local API surfaces and test
harness entry points.
- Production route choice includes live per-route active-channel pressure, so
concurrent forwarding requests can spread across equivalent QUIC candidates
instead of concentrating on the first/shortest route until it is saturated.
- Production forwarding also keeps per-route health quarantine. A QUIC route
that fails connect or response is marked unhealthy for a bounded retry window,
skipped by subsequent channel scheduling, exposed in route-health snapshots,
and restored automatically after the retry window or a successful send.
- `FabricRoutePressureTracker` provides shared active-channel accounting for
both production request/response forwarding and bulk `FabricChannelRuntime`
traffic, so different traffic surfaces can make route decisions against the
same live load signal.
- Route pressure is observable through `FabricRoutePressureSnapshot`, including
current active channels, max active channels, total acquire/release counts,
and last acquired/released route IDs. Bulk runtime results and production
QUIC forwarding snapshots expose this data for stress reports.
- `fabric-loadtest` reports route IDs per stream attempt, global
`route_pressure`, and per-target `max_active_channels`, so stress runs can
verify channel distribution and release accounting after churn.
- `FabricRouteSetForPeerEndpointCandidates` converts QUIC endpoint candidates
into production route sets for direct, LAN, ICE/STUN-derived, reverse
outbound, and relay fallback modes. Non-QUIC candidates are rejected instead
of becoming alternate transports.
- Node-agent discovery now advertises multiple QUIC candidates in one heartbeat
instead of collapsing to one address: operator/public QUIC, listener QUIC,
LAN/interface QUIC, STUN reflexive `ice_quic`, reverse/outbound-only, and
`relay_quic` fallback. Candidate metadata carries `local_segment_id`,
`nat_group_id`, `stun_server`, `ice_foundation`, `relay_node_id`, and
`relay_endpoint` when configured.
- Endpoint candidate scoring is QUIC-mode only. It ranks `direct_quic`,
`lan_quic`, `ice_quic`, `reverse_quic`, and `relay_quic` using freshness,
health observations, latency, reliability, region, policy tags, and live
capacity pressure; HTTP/WebSocket labels are treated as rejected legacy
candidates rather than alternate transports.
- `FabricTransportForTarget` no longer accepts a WebSocket carrier. Transport
selection can return only `QUICFabricTransport`; unsupported labels fail with
a QUIC-required error.
- Explicit transport labels are authoritative. A legacy label such as `relay`
or `outbound_reverse` is rejected even when the endpoint string uses a
`quic://` scheme; configs must use `relay_quic` and `reverse_quic`.
- Node-agent config loading rejects legacy advertised transport labels and
HTTP/WebSocket advertised endpoint schemes for mesh, STUN-reflexive, and relay
fabric endpoints. Bad endpoint posture fails before heartbeat publication.
- Host-agent install/runtime validation rejects legacy mesh advertise transport
labels and HTTP/WebSocket advertise endpoints before they can be passed into a
node-agent Docker runtime.
- JSON-advertised endpoint candidates and scoped synthetic config route
recovery surfaces are hard-fail QUIC-only: endpoint candidates, recovery
seeds, and rendezvous leases reject legacy transport labels and
HTTP/WebSocket endpoint schemes instead of silently downgrading or dropping
entries.
- Rendezvous relay leases and peer-connection intents now use `relay_quic` as
the transport label. `relay_control` remains only a telemetry/control-state
name for rendezvous admission counters, not a data-plane transport option.
- Peer connection health probing is aligned with the QUIC fabric: QUIC endpoint
candidates are probed with QUIC session setup, pinned certificate metadata is
honored, and HTTP/WebSocket endpoint schemes are rejected instead of being
used as peer health transport.
- Node-agent synthetic runtime no longer installs an HTTP peer transport as an
inter-node carrier, and the shared mesh runtime package no longer exports an
HTTP peer transport implementation. Any HTTP synthetic motion is confined to
explicit legacy smoke harness code while fabric acceptance uses QUIC loadtest
gates.
- Control-plane and debug JSON mesh config loading is validated after
conversion into runtime structures. Peer endpoint candidates, recovery seeds,
rendezvous leases, and selected relay endpoints in route decisions must use
QUIC labels/endpoints before they can update node runtime state.
- Scoped synthetic mesh configs also reject legacy `peer_endpoints` directly,
in addition to QUIC-only checks for endpoint candidates, recovery seeds, and
rendezvous leases.
- The old fabric-session WebSocket endpoint is no longer exposed by
`FabricSessionEnabled` alone. It requires an explicit legacy test harness flag
and is not part of the node-agent fabric transport surface.
- Same local segment or same NAT group is treated as a LAN route by the planner,
so a whole cluster piece behind one NAT can prefer private addresses between
its own nodes while still maintaining outbound/relay visibility to the rest
of the fabric.
- Heartbeat telemetry includes `fabric_runtime_report` with QUIC-only status,
route-set counts, QUIC candidate totals, rejected legacy/non-QUIC candidate
totals by transport label, route pressure, QUIC listener state, goroutines,
heap usage, and the next recommended soak gate.
- `FabricOverlayTransport` is the generic service-neutral send facade over
route sets, `FabricChannelRuntime`, shared route pressure, and QUIC sessions.
New traffic classes should enter the fabric through this layer or an
equivalent runtime integration, not through HTTP/WebSocket fallbacks.
- `FabricChannelRuntime` uses the same route health quarantine as production
forwarding. Connect failures, stream send failures, and missing ACKs mark a
route unhealthy for a bounded retry window, so later channels for any traffic
class avoid that route until it recovers.
- `FabricOverlayTransport` exposes route pressure and route health snapshots,
and node heartbeat runtime metadata reports production route health plus the
current quarantined route count.
- Scheduler resource guardrails include `HardMaxRoutePressure`: when enabled,
a route whose projected active-channel pressure exceeds the threshold is not
admitted. This makes overload prevention enforceable in route choice rather
than only observable after the fact.
- The loadtest verdict fails on route-pressure leaks, acquire/release mismatch,
missing acquire accounting, active channels above configured concurrency, or
target distribution collapse/skew when multiple targets are healthy.
- Continuous soak aggregation is bounded: `fabric-loadtest` keeps exact
counters, per-target totals, route-mode counts, error/reroute totals, and
bounded latency samples, while `stream_samples` is capped to diagnostic
examples. Long 30-120 minute runs should not retain one result object per
logical channel.
- `fabric-loadtest` also keeps bounded `error_samples`, so high-volume churn
reports preserve representative failed logical channels even when the first
retained `stream_samples` are all successful.
- Mixed topology verdicts require route-mode coverage when at least four
healthy targets are present. A `mixed-public-nat-lan-relay` or
`nat-lan-relay` run fails if it does not exercise `lan_quic`, `ice_quic`,
`reverse_quic`, and `relay_quic`.
- Loadtest verdicts also fail on legacy route-mode labels. Seeing `relay`,
`outbound_reverse`, `direct_http`, `direct_https`, `direct_tcp_tls`, `ws`,
`wss`, or `websocket` in route-mode telemetry is treated as a transport-layer
violation even if payload delivery succeeds.
- Healthy multi-target verdicts check both stream distribution and byte
distribution. This prevents a run from passing with equal channel counts but
most bulk bytes concentrated on one target or route.
- Healthy multi-target verdicts also check route-pressure distribution through
per-route `max_active` values. A run fails if live concurrent channel load
collapses onto one target/route while alternatives are healthy.
- Successful logical channels must receive one ACK per transmitted data frame.
`fabric-loadtest` reports `ack_mismatched_streams`, per-target
`acks_received`, and fails verdict when any stream is marked successful with
fewer ACKs than sent frames.
- ACK payloads carry the SHA-256 checksum of the received data-frame payload.
`fabric-loadtest` validates the checksum for every ACK and fails verdict with
`ack_integrity_errors` when the acknowledged bytes do not match the sent
payload.
- Failover accounting separates `abandoned_frames` from true ACK mismatch. A
frame sent on a route that dies before ACK is counted as abandoned and the
unacknowledged byte range is retransmitted on the next pool member; verdict
still fails when non-abandoned frames are missing ACKs.
- Loadtest data frames use deterministic per-frame payloads derived from stream
index, logical stream ID, sequence, and byte offset. This makes checksum ACKs
validate each frame identity instead of repeatedly validating one shared
buffer pattern.
- Mixed bulk/control stress is supported with `-control-every`,
`-control-bytes-per-stream`, and `-max-control-ack-p95-ms`. Reports include
`control_streams`, `bulk_streams`, `control_ack_p95_ms`, and
`bulk_ack_p95_ms`; verdict fails when control ACK p95 exceeds the configured
SLO.
- Verified shared-host mixed smoke:
`powershell -ExecutionPolicy Bypass -File scripts/fabric/fabric-loadtest-docker-smoke.ps1 -DockerContext test-docker -Nodes 2 -Streams 40 -Concurrency 8 -BytesPerStream 65536 -PayloadSize 8192 -FailTarget -1 -ControlEvery 5 -ControlBytesPerStream 4096 -MaxControlAckP95Ms 100`.
The run produced 40/40 successful streams, 8 control streams,
`control_ack_p95_ms=1`, `bulk_ack_p95_ms=2`,
`route_pressure.active_total=0`, and matching acquire/release counts.
- Verified shared-host mixed failover stress:
`powershell -ExecutionPolicy Bypass -File scripts/fabric/fabric-loadtest-docker-smoke.ps1 -DockerContext test-docker -SkipBuild -TuneUdpBuffers -Nodes 4 -Streams 1000 -Concurrency 128 -BytesPerStream 1048576 -PayloadSize 65536 -FailTarget 0 -FailAfter 100ms -ControlEvery 20 -ControlBytesPerStream 4096 -MaxControlAckP95Ms 100`.
Latest run `fabric-loadtest-20260516-160751` produced 1000/1000 successful
streams, 250 failover events after the planned target kill, 50 control
streams, `control_ack_p95_ms=3`, `bulk_ack_p95_ms=6`, `ack_p95_ms=6`,
`ack_p99_ms=8`, `route_attempts_total=1250`,
`route_pressure.active_total=0`, `max_active_total=128`, and matching
acquire/release counts. Full JSON artifacts are written under
`artifacts/fabric-loadtest`.
- Verified shared-host mixed degradation/migration stress:
`powershell -ExecutionPolicy Bypass -File scripts/fabric/fabric-loadtest-docker-smoke.ps1 -DockerContext test-docker -TuneUdpBuffers -Nodes 4 -Streams 200 -Concurrency 32 -BytesPerStream 8388608 -PayloadSize 65536 -FailTarget -1 -ImpairTarget 0 -ImpairDelay 200ms -ImpairLoss 0.5% -ImpairAfter 50ms -MigrateSlowStreams -MaxAckMs 30 -ControlEvery 10 -ControlBytesPerStream 4096 -MaxControlAckP95Ms 100`.
The run produced 200/200 successful streams, 9 migration events,
20 control streams, `control_ack_p95_ms=2`, `bulk_ack_p95_ms=7`,
`route_pressure.active_total=0`, `max_active_total=32`, and matching
acquire/release counts.
- Latest shared-host degradation/migration gate:
`fabric-loadtest-20260516-160710` with 160 streams, 32 concurrency, 4 MiB
bulk streams, 180 ms + 0.5% induced impairment after 50 ms produced 160/160
successful streams, 12 slow-ACK migrations, degraded-target quarantine,
`control_ack_p95_ms=3`, `bulk_ack_p95_ms=180`,
`route_pressure.active_total=0`, `max_active_total=32`, and matching
acquire/release counts.
- Short shared-host soak gate:
`fabric-loadtest-20260516-160943` with `-Duration 45s`, 1200 streams,
96 concurrency, four healthy targets, and mixed control/bulk traffic produced
1200/1200 successful streams, even 300/300/300/300 target distribution,
`channel_opens=1200`, `channel_closes=1200`, `channel_leaks=0`,
`control_ack_p95_ms=4`, `ack_p95_ms=5`, `ack_p99_ms=8`,
`route_pressure.active_total=0`, `max_active_total=96`, and matching
acquire/release counts.
- Continuous soak mode is now explicit: add `-Soak -Duration 30m` or
`-Soak -Duration 120m` to the Docker runner. In soak mode workers keep
creating and closing logical channels until the duration expires, instead of
stopping after a fixed stream list. This is the required gate for memory,
goroutine, file descriptor, QUIC stream, and route-pressure stability.
- Soak duration stops new logical channel creation but does not cancel channels
already in flight. In-flight channels drain under their per-channel
`-StreamTimeout`; the outer `-ClientTimeout` remains the hard scenario
guardrail. This prevents the final active window from being counted as
failed streams just because the soak timer expired.
- Recommended real-topology soak command:
`powershell -ExecutionPolicy Bypass -File scripts/fabric/fabric-loadtest-docker-smoke.ps1 -DockerContext test-docker -SkipBuild -TuneUdpBuffers -Nodes 4 -Streams 1000 -Concurrency 128 -BytesPerStream 1048576 -PayloadSize 65536 -TopologyProfile mixed-public-nat-lan-relay -Soak -Duration 30m -ResourceSampleInterval 10s -MaxGoroutineDelta 64 -MaxHeapDeltaMB 512 -FailTarget -1 -ControlEvery 20 -ControlBytesPerStream 4096 -MaxControlAckP95Ms 100`.
- Soak reports include `resource_samples` and `resource_summary` with
goroutine start/end/max/delta, heap allocation start/end/max/delta, heap
objects, open file descriptor start/end/max/delta, GC delta, max active QUIC
streams, and max active route load.
Optional verdict gates `-MaxGoroutineDelta` and `-MaxHeapDeltaMB` fail the
run if resource drift exceeds the configured budget.
- Optional file descriptor verdict gates `-MaxOpenFDDelta` and `-MaxOpenFDs`
are passed through the Docker runner to `fabric-loadtest` as
`-max-open-fd-delta` and `-max-open-fds`. On Linux containers these read
`/proc/self/fd` and fail the run if descriptor count drifts or peaks beyond
the configured budget.
- Optional throughput SLO gate `-MinThroughputMbps` is passed through the Docker
runner to `fabric-loadtest` as `-min-throughput-mbps`. It fails verdict when
useful data-plane throughput falls below the configured Mbps floor.
- Optional short-session churn SLO gate `-MinChannelChurnPerSec` is passed
through the Docker runner to `fabric-loadtest` as
`-min-channel-churn-per-sec`. It fails verdict when logical channel
open/close throughput falls below the configured channel-per-second floor.
- Each logical channel has a per-channel timeout through `-StreamTimeout`
in the Docker runner and `-stream-timeout` in `fabric-loadtest`. This keeps a
wedged channel from holding a worker slot until the whole client run times
out, preserving channel isolation under churn.
- Each data frame has an ACK timeout through `-AckTimeout` in the Docker runner
and `-ack-timeout` in `fabric-loadtest`. A missing ACK triggers reroute/pool
retry without waiting for the full channel timeout.
- Optional overall ACK latency gates `-MaxAckP95Ms` and `-MaxAckP99Ms` are
passed through the Docker runner to `fabric-loadtest` as
`-max-ack-p95-ms` and `-max-ack-p99-ms`. They fail healthy runs when
aggregate data-plane ACK latency exceeds the configured SLO, independently
of slow-route migration thresholds.
- Optional per-target ACK latency gate `-MaxTargetAckMs` is passed through the
Docker runner to `fabric-loadtest` as `-max-target-ack-ms`. It fails healthy
runs when any target route reports a `target_stats[*].max_ack_ms` above the
configured SLO.
- Optional channel setup latency gates `-MaxSetupP95Ms` and `-MaxSetupP99Ms`
are passed through the Docker runner to `fabric-loadtest` as
`-max-setup-p95-ms` and `-max-setup-p99-ms`. They fail healthy runs when
logical channel open/setup latency exceeds the configured SLO before payload
transfer starts.
- Optional reroute latency gates `-MaxRerouteP95Ms` and `-MaxRerouteP99Ms`
are passed through the Docker runner to `fabric-loadtest` as
`-max-reroute-p95-ms` and `-max-reroute-p99-ms`. They measure repeat channel
setup latency after pool failover or slow-route migration and fail the run
when route rebuild exceeds the configured SLO.
- Docker shared-host summaries also include `container_stats` from
`docker stats --no-stream` for each fabric server/client container that is
still running at the end of the scenario. This records CPU percent, memory
usage, memory percent, network IO, block IO, and PID count per node before
cleanup.
- Long soak runs can add `-ContainerStatsSampleInterval 10s` to collect
periodic Docker container stats while traffic is in flight. The runner writes
samples to `container_stats_samples_path`, includes
`container_stats_samples_count` and `container_stats_sample_summary`, and
records per-container memory/PID start, end, max, and delta values.
- Optional container resource verdict gates `-MaxContainerMemoryMiB` and
`-MaxContainerPids` fail the Docker scenario when any running fabric
container exceeds the configured memory or PID budget at the final snapshot
or at any periodic sample peak.
- Verified short continuous soak:
`fabric-loadtest-20260516-163206` used `-Soak -Duration 20s`,
mixed public/NAT/LAN/relay profile, 32 concurrency, and mixed control/bulk
traffic. It produced 4000/4000 successful logical channels,
`channel_opens=4035`, `channel_closes=4035`, `channel_leaks=0`,
`route_pressure.active_total=0`, `max_active_total=32`,
`control_ack_p95_ms=2`, `ack_p95_ms=4`, resource sample count 12,
goroutine delta -18, max active streams 32, max active route load 32, and
matching acquire/release counts.
- Verified 60-second high-churn continuous soak with graceful drain:
`fabric-loadtest-20260516-174505` rebuilt the Docker image after changing
soak duration to stop generation and let in-flight channels drain. The
4-node mixed-topology run used 128 concurrency, `-Duration 60s`,
`-StreamTimeout 15s`, periodic resource/container sampling, mixed
control/bulk traffic, throughput and churn SLOs. It produced 438740/438740
successful logical channels, `channel_churn_per_sec=7310`,
`throughput_bps=3473632858`, `ack_p95_ms=5`, `ack_p99_ms=6`,
`control_ack_p95_ms=3`, `channel_opens=438740`,
`channel_closes=438740`, `channel_leaks=0`, `open_failures=0`,
`goroutines_delta=-1`, `open_fds_delta=4`, all four route modes, clean
route-pressure accounting, and verdict `pass`.
- Verified pool failover soak with ACK timeout and abandoned-frame accounting:
`fabric-loadtest-20260516-175622` rebuilt the Docker image with ACK timeout,
target quarantine, and abandoned-frame accounting, then killed target 0 after
3 seconds during a 30-second mixed-topology soak. It produced 136194/136194
successful logical channels, `failed_streams=0`, `failover_events=82`,
`abandoned_frames=75`, `ack_mismatched_streams=0`,
`ack_integrity_errors=0`, `channel_churn_per_sec=4543`,
`throughput_bps=2156155314`, `reroute_latency_p99_ms=9`,
`channel_leaks=0`, clean route-pressure accounting, and verdict `pass`.
- Verified container stats gate:
`fabric-loadtest-20260516-163854` produced a passing 2-node mixed-topology
smoke with `-MaxContainerMemoryMiB 128 -MaxContainerPids 64` and included
`container_stats` for both fabric server containers, with memory usage around
4-6 MiB per server and server PID counts 7-9. A negative control run with
`-MaxContainerMemoryMiB 1` failed as expected with
`container_memory_mib=...>1` verdict reasons.
- Verified periodic container stats sampling:
`fabric-loadtest-20260516-164259` used `-Soak -Duration 8s`,
`-ContainerStatsSampleInterval 2s`, mixed public/NAT/LAN/relay profile, and
`-MaxContainerMemoryMiB 128 -MaxContainerPids 64`. It produced 2000/2000
successful logical channels, `channel_opens=2009`, `channel_closes=2009`,
`channel_leaks=0`, even 1000/1000 target distribution, 400 control streams,
`ack_p95_ms=1`, `route_pressure.active_total=0`, matching acquire/release
counts, final server memory around 12-13 MiB, and periodic sample peaks for
the client and both servers in
`fabric-loadtest-20260516-164259-container-stats-samples.json`.
- Verified high-churn goroutine drain after QUIC close cancellation:
`fabric-loadtest-20260516-164502` rebuilt the Docker image and repeated the
2-node mixed-topology continuous soak with `-MaxGoroutineDelta 64`,
`-MaxHeapDeltaMB 128`, `-ContainerStatsSampleInterval 2s`,
`-MaxContainerMemoryMiB 128`, and `-MaxContainerPids 64`. It produced
2000/2000 successful logical channels, `channel_opens=2009`,
`channel_closes=2009`, `channel_leaks=0`, even 1000/1000 target
distribution, `control_ack_p95_ms=1`, `ack_p95_ms=1`,
`route_pressure.active_total=0`, matching acquire/release counts, and
`goroutines_delta=-2`.
- Verified file descriptor gate:
`fabric-loadtest-20260516-164725` rebuilt the Docker image and repeated the
2-node mixed-topology continuous soak with `-MaxOpenFDDelta 8` and
`-MaxOpenFDs 128` in addition to goroutine, heap, container memory, and PID
gates. It produced 2000/2000 successful logical channels,
`channel_leaks=0`, `route_pressure.active_total=0`, matching
acquire/release counts, `open_fds_start=15`, `open_fds_end=9`,
`open_fds_max=19`, and `open_fds_delta=-6`.
- Verified bounded soak aggregation:
`fabric-loadtest-20260516-165051` rebuilt the Docker image after changing
soak result storage to an aggregate collector. The 2-node mixed-topology soak
produced 2000/2000 successful logical channels, even 1000/1000 target
distribution, `channel_leaks=0`, `route_pressure.active_total=0`, matching
acquire/release counts, `goroutines_delta=0`, `open_fds_delta=1`, verdict
`pass`, and only 25 retained `stream_samples` in the full report.
- Verified mixed route-mode coverage gate:
`fabric-loadtest-20260516-165308` rebuilt the Docker image with the route
coverage verdict and ran a 4-node mixed-topology soak. It produced 4000/4000
successful logical channels, even 1000/1000/1000/1000 target distribution,
`channel_leaks=0`, `route_pressure.active_total=0`, matching
acquire/release counts, and observed all required route modes:
`lan_quic`, `ice_quic`, `reverse_quic`, and `relay_quic`.
- Verified ACK integrity gate:
`fabric-loadtest-20260516-165544` rebuilt the Docker image with the ACK
mismatch verdict and repeated the 4-node mixed-topology soak. It produced
4000/4000 successful logical channels, `ack_mismatched_streams=0`, per-target
`frames_sent=6600` and `acks_received=6600`, all four route modes, clean
channel/route pressure accounting, and verdict `pass`.
- Verified ACK checksum integrity gate:
`fabric-loadtest-20260516-165926` rebuilt the Docker image with ACK payload
checksums and repeated the 4-node mixed-topology soak. It produced 4000/4000
successful logical channels, `ack_mismatched_streams=0`,
`ack_integrity_errors=0`, 26400 total data frames, 26400 ACKs, all four route
modes, clean channel/route pressure accounting, and verdict `pass`.
- Verified unique per-frame payload integrity:
`fabric-loadtest-20260516-170150` rebuilt the Docker image after switching
loadtest traffic from a shared payload buffer to deterministic per-frame
payloads. The 4-node mixed-topology soak produced 4000/4000 successful
logical channels, `ack_mismatched_streams=0`, `ack_integrity_errors=0`, 26400
data frames, 26400 ACKs, all four route modes, clean channel/route pressure
accounting, and verdict `pass`.
- Verified throughput SLO gate:
`fabric-loadtest-20260516-170512` rebuilt the Docker image with
`-MinThroughputMbps 100` and repeated the 4-node mixed-topology soak. It
produced 4000/4000 successful logical channels, `throughput_bps=212479668`,
`ack_mismatched_streams=0`, `ack_integrity_errors=0`, all four route modes,
clean channel/route pressure accounting, and verdict `pass`.
- Verified short-session churn SLO gate:
`fabric-loadtest-20260516-173320` rebuilt the Docker image with
`-MinChannelChurnPerSec 200`, then ran a 4-node mixed-topology high-churn
short-session smoke with 1000 one-frame logical channels. It produced
1000/1000 successful logical channels, `channel_churn_per_sec=9478`,
`channel_opens=1000`, `channel_closes=1000`, `channel_leaks=0`, even target
stream distribution, all four route modes, clean route-pressure accounting,
and verdict `pass`.
- Verified high-churn QUIC stream-credit regression gate:
`fabric-loadtest-20260516-174046` rebuilt the Docker image after closing the
server-side QUIC stream on handler exit and ran a 4-node mixed-topology burst
of 5000 one-frame short logical channels at 128 concurrency with
`-MinChannelChurnPerSec 300` and `-StreamTimeout 15s`. It produced 5000/5000
successful logical channels, `channel_churn_per_sec=21124`,
`channel_opens=5000`, `channel_closes=5000`, `channel_leaks=0`,
`open_failures=0`, `ack_mismatched_streams=0`, `ack_integrity_errors=0`,
even 1250/1250/1250/1250 target distribution, all four route modes, clean
route-pressure accounting, and verdict `pass`.
- Verified target byte distribution gate:
`fabric-loadtest-20260516-170731` rebuilt the Docker image with byte
distribution verdicts and repeated the 4-node mixed-topology soak. It
produced 4000/4000 successful logical channels, even 1000/1000/1000/1000
stream distribution, exactly 53,248,000 bytes per target,
`throughput_bps=212488911`, all four route modes, clean channel/route
pressure accounting, and verdict `pass`.
- Verified overall ACK latency SLO gate:
`fabric-loadtest-20260516-171001` rebuilt the Docker image with
`-MaxAckP95Ms 20` and `-MaxAckP99Ms 50` and repeated the 4-node
mixed-topology soak. It produced 4000/4000 successful logical channels,
`ack_p95_ms=2`, `ack_p99_ms=3`, `ack_mismatched_streams=0`,
`ack_integrity_errors=0`, all four route modes, clean channel/route pressure
accounting, and verdict `pass`.
- Verified route-pressure distribution gate:
`fabric-loadtest-20260516-171216` rebuilt the Docker image with
route-pressure distribution verdicts and repeated the 4-node mixed-topology
soak. It produced 4000/4000 successful logical channels, even target stream
and byte distribution, per-route `max_active` values of 13/12/13/13,
`route_pressure.active_total=0`, matching acquire/release counts, and
verdict `pass`.
- Verified per-target ACK latency gate:
`fabric-loadtest-20260516-171454` rebuilt the Docker image with
`-MaxTargetAckMs 20` and repeated the 4-node mixed-topology soak. It produced
4000/4000 successful logical channels, per-target `max_ack_ms` values of
6/5/7/9, `ack_p95_ms=3`, `ack_p99_ms=5`, all four route modes, clean
channel/route pressure accounting, and verdict `pass`.
- Verified channel setup latency SLO gate:
`fabric-loadtest-20260516-171937` rebuilt the Docker image with
`-MaxSetupP95Ms 20` and `-MaxSetupP99Ms 50`, then repeated the 4-node
mixed-topology soak with ACK, throughput, FD, goroutine, heap, container
memory, and PID gates enabled. It produced 4000/4000 successful logical
channels, `setup_latency_p95_ms=0`, `ack_p95_ms=3`, `ack_p99_ms=3`,
`throughput_bps=212572631`, even target stream/byte distribution, all four
route modes, clean channel/route pressure accounting, and verdict `pass`.
- Verified reroute latency SLO gate:
`fabric-loadtest-20260516-172652` rebuilt the Docker image with
`-MaxRerouteP95Ms 100` and `-MaxRerouteP99Ms 200`, then ran a 4-node
mixed-topology pool-failover stress with target 0 killed during load. It
produced 400/400 successful logical channels, 100 pool failover events,
`reroute_latency_p95_ms=1`, `reroute_latency_p99_ms=2`,
`route_attempts_total=500`, `ack_p95_ms=6`, `ack_p99_ms=8`,
`throughput_bps=3863633075`, clean channel/route pressure accounting, and
verdict `pass`.
- Mixed topology profile gate:
`fabric-loadtest-20260516-162037` used
`-TopologyProfile mixed-public-nat-lan-relay` with 400 streams, 64
concurrency, four targets, and mixed control/bulk traffic. It produced
400/400 successful streams, 100 streams per target, route-mode reporting for
`lan_quic`, `ice_quic`, `reverse_quic`, and `relay_quic`,
`control_ack_p95_ms=2`, `ack_p95_ms=7`, `channel_leaks=0`,
`route_pressure.active_total=0`, and matching acquire/release counts.
- Verified strict QUIC route-mode gate:
`fabric-loadtest-20260516-182550` rebuilt the loadtest image with legacy
route-mode verdicts and ran the 4-node mixed topology profile. It produced
400/400 successful logical channels, observed only `lan_quic`, `ice_quic`,
`reverse_quic`, and `relay_quic`, kept `ack_mismatched_streams=0`,
`ack_integrity_errors=0`, `channel_leaks=0`, clean route-pressure accounting,
and verdict `pass`.
- `fabric-loadtest` now also treats the configured target list as part of the
acceptance surface: every target must be `quic://...`. Empty targets, bare
`host:port`, HTTP(S), and WS/WSS targets produce a failing
`non_quic_targets=...` verdict reason. Client mode also rejects those targets
before dialing, so a bad stress command cannot accidentally exercise a
non-QUIC path and only discover it after the run.
- The shared Docker runner `scripts/fabric/fabric-loadtest-docker-smoke.ps1`
now has matching guardrails: it refuses local Docker Desktop contexts such as
`default`/`desktop-linux` and validates generated targets before launch so the
real-load smoke remains tied to the shared test Docker host and QUIC-only
endpoints.
- Shared Docker validation after those guardrails:
`fabric-loadtest-20260516-190049` rebuilt the Docker image on `test-docker`
and ran 4 QUIC targets with 120 streams. It produced 120/120 successful
logical channels, `ack_p95_ms=3`, `setup_latency_p95_ms=21`, clean
open/close and route-pressure accounting, QUIC-only targets, and verdict
`pass`.
- Shared Docker mixed-topology failover validation:
`fabric-loadtest-20260516-190137` reused the image on `test-docker`, killed
target 0 after 100ms, and ran 400 streams over the mixed public/NAT/LAN/relay
profile. It produced 400/400 successful logical channels, 100 pool failover
events, `route_attempts_total=500`, route modes `ice_quic`,
`reverse_quic`, and `relay_quic` after the failed target was removed,
`ack_p95_ms=8`, `setup_latency_p95_ms=51`, clean channel/route-pressure
accounting, and verdict `pass`.
- Shared Docker mixed-topology route coverage validation:
`fabric-loadtest-20260516-190207` ran the same 4-target mixed profile without
target failure. It produced 400/400 successful logical channels, exactly 100
streams per target, observed `lan_quic`, `ice_quic`, `reverse_quic`, and
`relay_quic`, kept `ack_integrity_errors=0`, `channel_leaks=0`,
`route_pressure.active_total=0`, and verdict `pass`.
- Load balancing under pool failover is now an acceptance gate. The first
stricter shared-host rebuild, `fabric-loadtest-20260516-190704`, intentionally
failed because all failed-target retries moved to the nearest live target,
producing `target_byte_distribution_skew` and
`route_pressure_distribution_skew`. The retry selector was then changed to
spread failed-slot retries across the currently usable target set instead of
selecting the next target in ring order.
- Verified load-aware retry routing after the fix:
`fabric-loadtest-20260516-191028` rebuilt on `test-docker`, killed target 0
after 100ms, and repeated the 4-target mixed profile. It produced 400/400
successful logical channels, 100 pool failover events, surviving-target stream
distribution of 134/133/133, surviving route-pressure max-active values of
30/25/27, `ack_p95_ms=4`, `reroute_latency_p95_ms=1`, clean acquire/release
accounting, and verdict `pass`.
- Verified 1000-channel mixed-topology stress:
`fabric-loadtest-20260516-193414` ran 1000 logical channels on `test-docker`
with 128 concurrency, mixed control/bulk traffic, and the
`mixed-public-nat-lan-relay` profile. It produced 1000/1000 successful
logical channels, exact 250/250/250/250 target distribution, observed all four
QUIC route modes (`lan_quic`, `ice_quic`, `reverse_quic`, `relay_quic`),
`throughput_bps=3629522849`, `channel_churn_per_sec=1919`,
`ack_p95_ms=6`, clean channel/route-pressure accounting, and verdict `pass`.
- Verified 1000-channel pool-failover stress:
`fabric-loadtest-20260516-193444` killed target 0 after 100ms and ran 1000
logical channels with 128 concurrency. It produced 1000/1000 successful
logical channels, 250 pool failover events, surviving-target distribution of
334/333/333, `route_attempts_total=1250`, `ack_p95_ms=7`, clean
acquire/release accounting, and verdict `pass`.
- Verified latency-degradation migration:
`fabric-loadtest-20260516-193515` applied `tc netem delay 80ms` to target 1,
enabled slow-stream migration with `-MaxAckMs 20`, and ran 400 mixed-profile
channels. It observed the impaired target in `degraded_targets`, produced
64 slow-ACK migrations, moved completed streams onto healthy targets with
distribution 134/133/133, kept `channel_leaks=0`, `ack_integrity_errors=0`,
clean route-pressure accounting, and verdict `pass`.
- Shared Docker runner resource-sample fallback was verified with
`fabric-loadtest-20260516-190325`: short runs now still persist
`container_stats_samples_path` and a minimal per-container sample summary
from final Docker stats when the background sampler has no time to emit
samples.
- Added `scripts/fabric/fabric-acceptance-summary.ps1` to aggregate recent
`*-summary.json` artifacts into an acceptance report. It captures verdicts,
target distribution, route modes, churn, failover/migration counts, latency
SLOs, resource evidence, and keeps intentional failed runs visible as
regression evidence for gates such as route-pressure skew detection.
- The first 30-minute soak attempt (`fabric-loadtest-20260516-193558`) exposed
a runner defect instead of a fabric defect: server containers were still
started with a fixed `-timeout 10m`, so the three surviving servers exited
around minute 10 while the client expected a 30-minute run. The Docker runner
now exposes `-ServerTimeout` and defaults it to `-ClientTimeout`, so long soak
server lifetimes match the client run.
- The next soak attempt (`fabric-loadtest-20260516-194816`) passed the 10-minute
server-timeout boundary but exposed another long-run behavior: a healthy
surviving target could stay out of placement after a transient degradation
mark. `fabric-loadtest` now uses a bounded `target_quarantine_ttl` for
placement while still preserving historical `degraded_targets` observations
in the report. The Docker runner exposes this as `-TargetQuarantineTTL`.
- `fabric-loadtest-20260516-200241` then exposed a soak-loop issue: it reported
`pass` with 432869/432869 logical channels and clean accounting, but finished
after about 95 seconds despite `config.duration=30m`. The cause was worker
shutdown on per-stream `context deadline exceeded`; soak workers now only exit
on the parent run context or the configured soak stop time, not on one
channel's timeout.
- `fabric-loadtest-20260516-200939` and `fabric-loadtest-20260516-201331`
confirmed the soak loop fix by running full 3-minute preflights, but they
failed the zero-failed-stream gate under target-kill injection. The issue was
policy: the known killed target re-entered placement too quickly via the
short transient quarantine TTL, causing some channels to spend their stream
budget on a hard-dead endpoint. `fabric-loadtest` now separates transient
`target_quarantine_ttl` from `failure_quarantine_ttl`, and the Docker runner
exposes `-FailureQuarantineTTL`.
- Verified 30-minute long-duration soak:
`fabric-loadtest-20260516-202532` ran on `test-docker` for 1800.010 seconds
with 4 QUIC targets, 128 concurrency, mixed control/bulk traffic, 64 KiB per
logical channel, 10-second resource and container samples, and the
`mixed-public-nat-lan-relay` profile. It produced 15,074,556/15,074,556
successful logical channels, 895,308,005,376 bytes, `throughput_bps=3979124146`,
`channel_churn_per_sec=8374`, exact 3,768,639 streams per target, all four
QUIC route modes, `ack_p95_ms=5`, `ack_p99_ms=6`, `channel_leaks=0`,
matching 15,074,556 channel opens/closes, `route_pressure.active_total=0`,
458 container-stat samples, bounded memory/PID use, and verdict `pass`.
- Verified real-node host-to-host QUIC smoke:
`home-1` ran the standalone `fabric-loadtest` client against a temporary
QUIC server on `test-docker` at `quic://docker-test.cin.su:19443`. The run
created 1000 short logical channels at 128 concurrency, mixed control and
bulk traffic, sent 59,392,000 bytes, received 3700/3700 ACKs, produced
`throughput_bps=1177445403`, `channel_churn_per_sec=2478`,
`ack_p95_ms=12`, `ack_p99_ms=21`, `setup_latency_p95_ms=118`, zero failed
streams, zero channel leaks, and verdict `pass`. The report is saved as
`artifacts/fabric-real-nodes/home-1-to-test-docker-20260516-181649.json`.
- Published and registered node-agent release `0.2.280-fabricsession` with
linux binary/native and Docker image artifacts. The release is intentionally
not assigned to live node update policies yet because current live node
workload/env posture still advertises legacy `direct_http` and HTTP/HTTPS
mesh endpoints. Before rollout, node configs must be migrated to
`quic://...` endpoints, QUIC advertise labels, and enabled QUIC listener env
such as `RAP_MESH_QUIC_FABRIC_ENABLED=true` plus
`RAP_MESH_QUIC_FABRIC_LISTEN_ADDR`.
- Loadtest degraded-target quarantine is observable through `degraded_targets`.
When `-impair-target` and slow-stream migration are enabled, verdict fails if
no degraded target is observed or if degraded targets do not produce migration
events. A shared-host validation run with 120 streams reported
`degraded_targets = { impaired_target: "slow_ack" }`, 5 migration events,
`control_ack_p95_ms=3`, and clean acquire/release accounting.
- Channel lifecycle accounting is explicit in `fabric-loadtest` through
`channel_opens`, `channel_closes`, and `channel_leaks`. Verdict fails on
open/close mismatch, active stream leaks, or mismatch between route-pressure
acquire counts and QUIC stream opens.
- The next validation step is broader real mixed public/NAT/LAN topology across
separate physical or VM hosts. The shared Docker host has verified the route
model, stress gates, 30-minute stability, memory, goroutine, file descriptor,
container resource, and route-pressure accounting. A true external NAT lab
should now validate the same gates with independent NAT devices, public nodes,
and local NAT-side cluster segments.
Initial SLO examples:
- `channel_setup_p95_ms < 200`
- `reroute_p95_ms < 1000`
- `control_latency_p99_ms < 100 under bulk load`
- `packet_loss_after_recovery < 0.1%`
- `no_route_pressure_over_90_percent_when_alternatives_exist`
- `no_channel_table_growth_after_churn`
@@ -204,6 +204,8 @@ Examples:
- `vnc-worker` wraps a future VNC client/runtime.
- `vpn-exit` handles exit routing.
- `vpn-connector` handles private network reachability.
- `vpn-client` runs on an end-user device, including Android, as a normal farm node.
- `ipv4-egress` marks a node/service that can send authorized VPN packet traffic to ordinary IPv4 networks.
- `video-relay` handles media optimized paths.
Rules:
@@ -293,6 +295,41 @@ Responsibilities:
- applies route, DNS, and egress restrictions
- reports traffic and health telemetry
### `ipv4-egress`
Fabric-only IPv4 exit service. It is assigned to nodes that may forward authorized VPN packet channels from the mesh to ordinary IPv4 networks.
Responsibilities:
- accepts VPN packet channels only through the fabric service channel
- advertises exit pool membership, region, route policy, and health
- enforces user, organization, cluster, and owner visibility policy before accepting traffic
- participates in latency-aware and load-aware exit selection
- supports failover between nodes in the same exit pool without changing the Android client protocol
- does not expose legacy VPN protocols as the steady-state data plane
### `vpn-client`
Client-side VPN node role. On Android the installed application is a node-agent/runtime with this role, then the VPN client service is started locally and joins the farm like any other node.
Responsibilities:
- joins the mesh using the current QUIC fabric transport
- requests the list of visible IPv4 exit pools and nodes according to the current user's access level
- creates VPN packet channels to the selected `ipv4-egress`/`vpn-exit` pool
- switches to another authorized exit when the selected exit fails or becomes slow
- keeps old protocol compatibility out of the runtime data plane; old nodes may only use legacy download/update paths long enough to fetch the new agent
- exposes its local IPv4 ingress as service configuration: on Android this is the
`VpnService` TUN, and on Linux/Docker it may also include explicit TCP/UDP
listen ports that are mapped into VPN packet channels.
Rules:
- A VPN client does not use a dedicated entry node. It is itself a mesh node.
- The farm builds the route from the client node to an authorized exit pool.
- Exits are addressed as pools. A pool may contain one node, but that is a degraded redundancy posture and should be visible as a risk.
- The control plane may issue policy and signed route authority, but it must not become the packet entry point for the VPN client.
### `vpn-connector`
Connector to private networks.
@@ -1,13 +1,13 @@
# Web Ingress and Admin UI Model
Status: target architecture clarification. Documentation only.
Status: target architecture and implementation contract.
This document defines how HTTP/HTTPS web entry, Admin UI, dynamic page
composition, and cluster configuration responsibilities are separated in the
Secure Access Fabric.
It does not implement code, APIs, UI pages, mesh runtime, VPN runtime, or RDP
changes.
The fabric node-to-node transport remains QUIC-only. HTTP/HTTPS is allowed only
as an external client-facing service edge.
## Purpose
@@ -16,33 +16,41 @@ The platform needs a clear distinction between:
- Web Service as the HTTP/HTTPS entry layer
- Control Plane as the owner of cluster configuration and policy
- Admin UI as a safe, scoped user interface over Control Plane APIs
- Fabric Transport as the internal QUIC-only node-to-node substrate
The Web layer must never become the owner of cluster state, policy, topology,
secrets, node identity, or routing authority.
## Layer Ownership
### Web Service / Web Ingress
### Public HTTPS Ingress
Web Service is an edge service.
Public HTTPS Ingress is an edge service. It may run on a public Internet node,
including a small/slow node intended only to accept browser traffic and pass it
into the fabric.
Suggested role names:
Role names:
- `web-ingress`
- `admin-web-entry`
- `admin-web-shell`
- `public-ingress`
- `admin-ingress`
Responsibilities:
- accept HTTP/HTTPS
- listen on TCP `80` only for ACME challenges, health checks, and HTTPS
redirects
- listen on TCP `443` for browser/API HTTPS
- terminate TLS or sit behind the approved TLS terminator
- serve Admin UI shell/static assets
- proxy browser/API traffic to Control API
- serve only approved static UI shells and safe public metadata
- validate SNI/Host, request size, rate limits, and edge policy
- map the request to an allowed platform, cluster, organization, or user portal
scope
- forward accepted traffic into the fabric through an authorized fabric service
channel
- apply edge controls such as headers, rate limits, request size limits, and
future WAF rules
- expose only approved public/admin endpoints
Web Service must not:
Public HTTPS Ingress must not:
- own cluster configuration
- directly mutate PostgreSQL
@@ -51,6 +59,39 @@ Web Service must not:
- store node identity or certificates as source of truth
- expose internal mesh topology to browser clients
- execute cluster decisions locally
- hold platform/global admin authority keys
- infer authorization from the fact that it accepted TCP `443`
- become a general relay for arbitrary HTTP inside the fabric
The node that accepts HTTPS is not the node that automatically owns or executes
admin logic. It is only a service edge.
### Fabric Transport
Fabric Transport is the internal node-to-node layer.
Rules:
- node-to-node traffic uses QUIC only
- no HTTP fallback between fabric nodes
- STUN/ICE/rendezvous/relay are fabric transport mechanisms, not browser/API
protocols
- any service traffic accepted on `443` is converted into a scoped fabric
service channel before it crosses the mesh
- direct links, relay links, and route-health observations must remain separate
in diagnostics
- a fabric route proves reachability, not administrative authority
If a public ingress receives a request for an admin surface, the request flow is:
```text
Browser HTTPS
-> public/admin ingress on 443
-> tenant/cluster/platform scope selection
-> signed fabric service channel over QUIC
-> authorized admin/runtime service node
-> Control Plane authorization and policy
```
### Control Plane
@@ -77,9 +118,23 @@ only.
Cluster configuration is changed only through Control Plane services and APIs.
The Web layer is a presentation and ingress layer over those APIs.
### Admin UI
### Admin UI Runtime
Admin UI is a client application served through Web Ingress.
Admin UI Runtime is the service that serves and executes the admin surface. It
may run on any node explicitly assigned the matching runtime role.
Role names:
- `global-admin-runtime`
- `cluster-admin-runtime`
- `organization-portal-runtime`
- `user-portal-runtime`
- `identity-runtime`
- `policy-authority`
- `audit-sink`
Admin UI is a client application served through Public HTTPS Ingress or Admin UI
Runtime according to deployment policy.
It renders safe Control Plane projections and submits user actions to Control
Plane APIs.
@@ -95,7 +150,7 @@ Admin UI must not:
viewer
- contain executable cluster logic
## Admin Endpoint Placement
## Admin Endpoint Placement And Trust
Admin UI endpoint placement is explicit and must not be inferred from storage.
@@ -110,6 +165,8 @@ Scopes:
- Organization Admin Panel: tenant-safe projection for one organization. It
must expose only allowed resources, service endpoints, sessions, policies,
and safe status.
- User Portal: personal/account scope. It must expose only the authenticated
user's resources, sessions, devices, and profile actions.
Rules:
@@ -118,19 +175,29 @@ Rules:
- Storage nodes distribute/cache scoped configuration and snapshots only.
- Admin/web ingress is a separate service role and requires explicit Control
Plane assignment.
- Public Internet ingress is not enough to run a global panel.
- `global-admin-runtime`, `policy-authority`, and `audit-sink` may run only on
platform-owner trusted nodes.
- `cluster-admin-runtime` may run only on nodes authorized for that cluster.
- `organization-portal-runtime` and `user-portal-runtime` may run on broader
infrastructure, but they receive only scoped projections.
- Cluster-local admin endpoints require valid TLS/cert policy, signed scoped
snapshots, current node health, and sufficient role coverage.
- Platform Owner Console remains the owner-level view even when cluster-local
admin endpoints exist.
- Organization Admin Panel must never expose intermediate mesh topology,
storage shards, peer caches, route caches, or unrelated cluster data.
- A request entering through an organization-bound ingress must be rejected if it
asks for another organization, another cluster outside its contract, global
topology, or platform-owner data.
## Request Flow
```text
Admin Browser
-> Web Ingress / Admin Web Shell
-> Control API
-> Public/Admin HTTPS Ingress
-> Fabric Service Channel over QUIC
-> Admin UI Runtime / Control API
-> PostgreSQL source of truth
-> signed scoped snapshots / config distribution
-> rap-node-agent
@@ -266,6 +333,18 @@ Organization admin must not see:
- secrets
- unrelated cluster internals
Ingress-bound projections:
- A platform-owner ingress may expose platform navigation only after platform
authorization, MFA/step-up, and policy checks.
- A cluster-bound ingress may expose only that cluster's admin surface and
cluster-scoped safe diagnostics.
- An organization-bound ingress may expose only the organization projection and
organization-safe service endpoints.
- A user portal ingress may expose only the user's personal/account projection.
- Host/SNI alone is not authorization; it only selects the maximum possible
projection before server-side authorization narrows it further.
## Service Adapter UI Extensions
Service adapters may need configuration UI.
@@ -361,22 +440,258 @@ High-risk actions include:
## Deployment Model
### Current Test Entry
The current shared Docker test stand exposes the Platform Owner Control Panel at
`http://docker-test.cin.su:18080/` (`http://192.168.200.61:18080/`). This is a
temporary lab HTTP edge served by `rap_web_admin` from
`/tmp/rap-web-admin/html` on `test-docker`.
This entry is not the production authority model. It is allowed only for the
shared test stand while the HTTPS admin-ingress runtime is being completed. The
target production entry is:
```text
Browser HTTPS on 443
-> node with explicit admin-ingress/public-ingress role
-> signed web-ingress envelope
-> QUIC fabric service channel
-> authorized admin/portal runtime node
-> Control API projection/authorization
```
The browser-facing ingress may be a small public node, but it must not become
the management authority. Platform/global admin runtime remains limited to
platform-owner trusted nodes. Cluster, organization, and user panels receive
only their scoped projections.
The legacy Fabric map with separate `inputs`, `cluster nodes`, and `egress
zones` is retired for the transport-layer view. The Fabric panel must show
actual direct/fresh QUIC neighbor links, one-way/passive direction, stale/problem
state, relay/route-health annotations, and web-ingress runtime readiness. It
must not render old entry/egress zone columns as if they were transport
topology.
Possible deployment modes:
- Web Ingress and Control API in the same deployment for small/test installs
- Public/Admin HTTPS Ingress and Control API in the same deployment for
small/test installs
- Web Ingress separated from Control API for production
- multiple Web Ingress nodes for regional/admin access
- Web Ingress behind Caddy/Nginx/enterprise ingress
- Admin UI shell served from Web Ingress while APIs remain on Control API
- Internet ingress on a low-capacity node that forwards scoped channels to a
trusted admin runtime elsewhere in the fabric
- global admin runtime only on platform-owner controlled nodes
- cluster admin runtime on cluster-authorized nodes
- organization/user portal runtime on tenant-safe nodes with scoped data
Even when deployed together, ownership remains separate:
- Web Ingress is entry/presentation
- Public/Admin HTTPS Ingress is entry/presentation
- Fabric Transport is QUIC-only service-channel delivery
- Control API is authorization/domain logic
- PostgreSQL is source of truth
- Fabric Storage/Config Storage is scoped distribution/cache
- node-agent consumes scoped desired state
## Required Roles
The platform recognizes these web/admin placement roles:
| Role | Scope | Purpose |
| --- | --- | --- |
| `public-ingress` | cluster or organization | Listen on 80/443, terminate/validate HTTPS, forward scoped service channels. |
| `admin-ingress` | platform or cluster | HTTPS edge for admin surfaces. It does not own authority. |
| `global-admin-runtime` | platform trusted nodes only | Platform-owner console/runtime. |
| `cluster-admin-runtime` | cluster | Cluster admin console/runtime for one cluster. |
| `organization-portal-runtime` | organization | Tenant-safe organization administration. |
| `user-portal-runtime` | user/organization | Personal account/resource portal. |
| `identity-runtime` | platform/cluster | Authentication, session, MFA, step-up and token issuance. |
| `policy-authority` | platform trusted nodes only | Authorization/policy decisions and signed claims. |
| `audit-sink` | platform trusted nodes only | Durable mutation/security audit ingestion. |
Legacy `entry-node` remains a generic client ingress/service edge role for
non-admin product services. It must not imply admin authority.
## Fabric Service Classes
Admin and portal traffic uses explicit fabric service classes. This prevents
admin traffic from being disguised as VPN/RDP/file/video traffic and gives the
routing layer clear QoS, role, and audit semantics.
| Service class | Required runtime roles | Projection |
| --- | --- | --- |
| `platform_admin` | `admin-ingress`, `global-admin-runtime`, `identity-runtime`, `policy-authority`, `audit-sink` | Platform-owner console. |
| `cluster_admin` | `admin-ingress`, `cluster-admin-runtime`, `identity-runtime`, `policy-authority`, `audit-sink` | One cluster. |
| `organization_portal` | `public-ingress`, `organization-portal-runtime`, `identity-runtime`, `policy-authority`, `audit-sink` | One organization. |
| `user_portal` | `public-ingress`, `user-portal-runtime`, `identity-runtime`, `policy-authority`, `audit-sink` | One authenticated user/account scope. |
Default channels for these classes are `control`, `interactive`, and
`reliable`. They are latency-sensitive control-plane/service traffic, not bulk
data transfer.
## Desired Workload Contract
Ingress nodes are configured through normal node desired workloads. The first
runtime stage is a contract probe: node-agent validates the policy and reports a
workload status, but it does not open `80`/`443` until the real ingress runtime
stage is enabled.
Example platform/cluster admin ingress workload:
```json
{
"service_type": "admin-ingress",
"desired_state": "enabled",
"runtime_mode": "native",
"config": {
"listen_http_port": 80,
"listen_https_port": 443,
"tls_mode": "terminate",
"scope": "platform",
"service_classes": ["platform_admin", "cluster_admin"]
}
}
```
Example organization/user public ingress workload:
```json
{
"service_type": "public-ingress",
"desired_state": "enabled",
"runtime_mode": "native",
"config": {
"listen_http_port": 80,
"listen_https_port": 443,
"tls_mode": "terminate",
"scope": "organization",
"service_classes": ["organization_portal", "user_portal"]
}
}
```
Contract-probe status requirements:
- `fabric_transport` is `quic_only`
- `http_between_fabric_nodes` is `false`
- `authority_service` is `false`
- `fabric_service_channel_required` is `true`
- `ports_opened_by_stub` is `false`
- invalid service classes or non-80/443 ports report `degraded`
- real listener startup requires both workload config
`real_listener_enabled=true` and node-agent process gate
`RAP_WEB_INGRESS_RUNTIME_ENABLED=true`
- without the process gate, a real-listener request reports
`web_ingress_real_listener_gate_disabled`
- the first handler stage returns schema
`rap.web_ingress.runtime_response.v1`; it redirects HTTP to HTTPS, exposes
health, validates service class/scope, and blocks payload forwarding with
`fabric_service_channel_binding_not_implemented` until the QUIC service
channel binding is implemented
- node-agent owns a web-ingress listener lifecycle manager. When the real
listener gate is enabled, it starts the HTTP redirect listener and starts
HTTPS only when `tls_cert_file` and `tls_key_file` are present in workload
config. Without TLS files the listener status is `partial` and service
payload remains blocked.
- HTTPS handler has a `FabricBinder` boundary. Valid requests become
`rap.web_ingress.fabric_request.v1` records with method, path, query, host,
derived scope, service class, safe headers, bounded body, and observed
timestamp. Runtime derives fabric scope from service class
(`platform_admin` -> `platform`, `cluster_admin` -> `cluster`,
`organization_portal` -> `organization`, `user_portal` -> `user`) before
signing/forwarding the request.
Dangerous browser headers such as `Authorization`, `Cookie`, `Set-Cookie`,
and service-channel tokens are not forwarded as ordinary proxy headers.
The binder must convert the request into a signed/scoped fabric service
channel envelope; if no binder is present, ingress returns
`fabric_service_channel_binding_not_implemented`.
- The first concrete binder emits
`rap.web_ingress.fabric_service_channel_envelope.v1`. The envelope contains
the safe request projection, base64-encoded body, scope, service class,
observed timestamp, and envelope timestamp. It is serialized as canonical JSON
for signing, then passed to an `EnvelopeSigner` and `EnvelopeSender`.
`EnvelopeSigner` owns node/service-channel signature policy. `EnvelopeSender`
owns delivery into the QUIC fabric service channel and route selection. This
keeps HTTP edge handling separated from mesh internals while making the
security boundary explicit and testable.
- The initial signer implementation is Ed25519 over the canonical envelope
bytes. The signer can derive `key_id` from the public key fingerprint or use
an explicitly configured key id. Production deployment must bind this key to
the node identity/service-channel authority policy before enabling real
browser traffic.
- The initial mesh sender adapter can submit the signed envelope through the
existing reliable fabric channel runtime using `control` traffic class and a
configured route set to an admin/portal runtime node or pool. At this stage it
returns a delivery-accepted response with route/channel metrics. Full
request/response admin API streaming remains a later runtime step and must
stay on the same QUIC fabric channel model.
- The fabric channel runtime now also has a request/response path for web
ingress: it opens a QUIC stream, sends the signed envelope as `FrameData`, and
waits for a `FrameData` response on the same stream and sequence. Route
failures or response timeouts use the same latency-aware reroute path as
reliable delivery. Runtime HTTP responses use
`rap.web_ingress.fabric_runtime_response.v1` with status code, safe headers,
and body/body_b64. If a runtime response is not in that schema, ingress
reports delivery-accepted metrics instead of treating arbitrary payload as an
HTTP response.
- QUIC fabric server reserves `WebIngressForwardQUICStreamID` for web ingress
request/response forwarding. The server invokes a web-ingress forward handler
with the signed envelope payload and returns a wrapper containing either
runtime payload or an error on the same stream/sequence.
- Admin/portal runtime nodes have a signed-envelope receiver contract. The
receiver verifies `rap.web_ingress.signed_fabric_service_channel_envelope.v1`,
Ed25519 signature, trusted key id, scope, service class, and timestamp skew
before calling the local runtime handler. The local handler returns
`rap.web_ingress.fabric_runtime_response.v1`; unsafe response headers are
filtered before the payload is returned to the ingress edge.
- Node-agent exposes explicit runtime key policy inputs while the final signed
config-snapshot distribution is being wired:
`RAP_WEB_INGRESS_SIGNING_PRIVATE_KEY`,
`RAP_WEB_INGRESS_SIGNING_KEY_ID`, and
`RAP_WEB_INGRESS_TRUSTED_KEYS_JSON`. Trusted keys JSON may be either
`{"key_id":"public_key_b64"}` or an array of
`{"key_id":"...","public_key":"..."}` objects. Without trusted keys the
web-ingress receiver handler is not installed. Runtime receiver placement can
be narrowed with `RAP_WEB_INGRESS_RUNTIME_SERVICE_CLASSES`, a comma-separated
allow-list of `platform_admin`, `cluster_admin`, `organization_portal`, and
`user_portal`; this is a temporary explicit node-local policy until signed
role snapshots drive receiver placement.
- Heartbeat metadata includes `web_ingress_runtime_receiver_report` when QUIC
fabric or web-ingress key policy is configured. The report exposes the
signed-envelope schema, QUIC stream id, trusted key count, receiver
service-class allow-list, handler installation state, status/reason
(`ready`, `degraded`, or `blocked`), and QUIC endpoint readiness so the
fabric panel can show whether a node can currently receive admin/portal
runtime traffic and why it cannot.
- QUIC listener/reverse-transport handler configuration is sensitive to the
web-ingress trusted key policy and runtime service-class allow-list. If either
policy changes, node-agent restarts or refreshes the QUIC fabric handler
binding so stale key trust or stale receiver placement is not kept in memory.
- The first local admin runtime dispatcher is intentionally read-only. It
handles `/healthz`, `/readyz`, and `*/ui-manifest` requests after signed
envelope verification. It returns `rap.web_ingress.admin_runtime_response.v1`
with a safe `rap.web_ingress.ui_manifest.v1` projection that lists sections
and read-only actions for the requested service class. It rejects invalid
`scope`/`service_class` pairs before using either the local fallback or the
Control API projection client. Mutations return
`control_api_mutation_binding_not_implemented`; unknown read projections
return `control_api_projection_binding_not_implemented` until the dispatcher
is wired to the real Control API authorization/projection layer.
- The dispatcher now has a `ControlAPIProjectionClient` boundary. When bound,
read-only GET/HEAD requests are sent to the Control API projection endpoint
and returned as `rap.web_ingress.control_api_projection_response.v1`.
Backend exposes the first read-only projection endpoint at
`/api/v1/clusters/{cluster_id}/nodes/{node_id}/admin-runtime/projection`.
It returns safe manifest/projection payloads, marks audit as required, and
rejects mutation methods and invalid `scope`/`service_class` combinations.
Requests must use schema
`rap.web_ingress.control_api_projection_request.v1`; agent accepts responses
only with schema `rap.web_ingress.control_api_projection_response.v1`.
This is the first Control API binding slice; it is not yet a full
authorization/session/audit implementation.
## Future Stages
Suggested staged work:
@@ -417,8 +732,9 @@ This document does not authorize:
## Result / Decision
WEB is an ingress and presentation layer, not a cluster configuration owner.
Cluster configuration belongs to the Control Plane and is persisted in
PostgreSQL. Dynamic admin pages are allowed only as safe, scoped,
Fabric remains QUIC-only internally; HTTP/HTTPS exists only at the external
client edge. Cluster configuration belongs to the Control Plane and is persisted
in PostgreSQL. Dynamic admin pages are allowed only as safe, scoped,
schema-driven projections over Control Plane APIs. They must not embed secrets,
internal topology, peer caches, route caches, or arbitrary executable code.