Record project continuation changes
This commit is contained in:
@@ -1016,6 +1016,240 @@ Status: implemented and verified. Report: `artifacts/c5-service-workload-supervi
|
||||
Goal:
|
||||
Node-agent can start, stop, and monitor service workloads based on role assignment.
|
||||
|
||||
C19A adds the first bounded live service-supervision runtime proof on top of
|
||||
that contract: node-agent can read node-scoped desired workloads without an
|
||||
operator actor id, report built-in `core-mesh` and `mesh-listener` as running,
|
||||
report native built-in `synthetic.echo` as running, and keep unsupported
|
||||
production workloads degraded instead of pretending that their adapters exist.
|
||||
The live smoke is `scripts/fabric/c19a-service-workload-supervision-smoke.ps1`.
|
||||
|
||||
C19B adds the Remote Workspace/RDP adapter-contract bridge without enabling RDP
|
||||
payload traffic. A native `rdp-worker` desired workload with
|
||||
`adapter_contract_probe=true` reports the remote-workspace channel map,
|
||||
requires Fabric Service Channel, and marks backend relay as not steady-state.
|
||||
The live smoke is
|
||||
`scripts/fabric/c19b-remote-workspace-adapter-contract-smoke.ps1`.
|
||||
|
||||
C19C wires Remote Workspace into service-channel lease issuance without
|
||||
starting RDP traffic: route intents now accept `remote_workspace`, the lease
|
||||
entry descriptor uses remote-workspace stream paths and frame-batch media type
|
||||
instead of VPN packet paths, and the signed data-plane contract is present in
|
||||
lease, authority payload, introspection, and lease maintenance. The live smoke
|
||||
is `scripts/fabric/c19c-remote-workspace-service-channel-lease-smoke.ps1`.
|
||||
|
||||
C19D adds the Remote Workspace entry-node ingress skeleton. The node-agent
|
||||
accepts a signed/introspected `remote_workspace` service-channel lease on
|
||||
`remote-workspaces/{resource_id}/streams/{channel_class}`, validates service
|
||||
class, channel class, selected entry node, and data-plane flow isolation, and
|
||||
reports access telemetry. It intentionally returns a probe contract with
|
||||
`payload_flow=not_implemented` for non-empty RDP payloads; this stage proves
|
||||
the Fabric ingress contract without forwarding desktop frames yet. The live
|
||||
smoke is `scripts/fabric/c19d-remote-workspace-entry-ingress-smoke.ps1`.
|
||||
|
||||
C19E adds the first Remote Workspace frame-batch contract probe across the
|
||||
adapter/entry boundary. The `rdp-worker` adapter probe reports
|
||||
`rap.remote_workspace_frame_batch.v1`; entry-node accepts only
|
||||
`probe_only=true` frame batches, validates logical adapter channels and
|
||||
directions, and returns `payload_flow=validated_probe_only`. Real desktop frame
|
||||
delivery remains intentionally disabled until the service adapter runtime stage.
|
||||
The live smoke is
|
||||
`scripts/fabric/c19e-remote-workspace-frame-batch-contract-smoke.ps1`.
|
||||
|
||||
C19F adds the first local adapter-sink proof for that frame-batch contract.
|
||||
Node-agent now keeps an in-memory `node_agent_rdp_worker_contract_probe` sink
|
||||
for Remote Workspace frame probes and preserves it across mesh config refresh.
|
||||
Entry-node delivers validated `probe_only=true` frame batches to that sink and
|
||||
returns a `rap.remote_workspace_frame_batch_delivery.v1` receipt with
|
||||
`payload_flow=delivered_probe_only`. This still does not enable production RDP
|
||||
frame forwarding. The live smoke is
|
||||
`scripts/fabric/c19f-remote-workspace-adapter-sink-smoke.ps1`.
|
||||
|
||||
C19G exposes the adapter-sink delivery proof through existing node-agent
|
||||
visibility channels. The `rdp-worker` workload status payload now includes
|
||||
`remote_workspace_adapter_sink`, and node telemetry includes
|
||||
`remote_workspace_adapter_sink_report`, both carrying delivery count, latest
|
||||
delivery sequence, channel class, frame count, and the probe-only/no-payload
|
||||
boundary. The live smoke is
|
||||
`scripts/fabric/c19g-remote-workspace-adapter-sink-telemetry-smoke.ps1`.
|
||||
|
||||
C19H locks down the Remote Workspace frame-batch guardrails before real adapter
|
||||
runtime work begins. Unit and live smoke coverage now proves that entry-node
|
||||
rejects `probe_only=false`, unknown logical channels, invalid channel
|
||||
directions, service-class mismatch, channel-class mismatch, and unsupported
|
||||
payload encoding, and that rejected batches do not produce adapter delivery.
|
||||
The live smoke is
|
||||
`scripts/fabric/c19h-remote-workspace-frame-guardrails-smoke.ps1`.
|
||||
|
||||
C19I adds the first bounded adapter handoff queue/ack proof for the same
|
||||
probe-only path. The local `node_agent_rdp_worker_contract_probe` sink reports
|
||||
queue capacity/depth plus accepted, dropped, and acked frame counts: with
|
||||
capacity `8`, droppable display overflow accepts/acks `8` frames and drops `3`,
|
||||
while reliable input overflow is rejected with backpressure and no delivery
|
||||
receipt. The boundary still carries `payload_traffic=none`; this is queue
|
||||
semantics for the future adapter runtime, not real RDP payload forwarding. The
|
||||
live smoke is
|
||||
`scripts/fabric/c19i-remote-workspace-adapter-queue-smoke.ps1`.
|
||||
|
||||
C19J makes those queue/backpressure signals operationally visible. The
|
||||
`remote_workspace_adapter_sink` workload status payload and
|
||||
`remote_workspace_adapter_sink_report` telemetry now include current queue
|
||||
capacity/depth, cumulative accepted/dropped/acked frame counters,
|
||||
`backpressure_count`, and the latest rejected batch metadata/reason. The live
|
||||
smoke first produces the C19I droppable overflow plus reliable backpressure,
|
||||
then waits until both workload status and telemetry show the delivery, dropped
|
||||
total, and backpressure increment. The live smoke is
|
||||
`scripts/fabric/c19j-remote-workspace-adapter-queue-telemetry-smoke.ps1`.
|
||||
|
||||
C19K introduces the probe-only adapter session boundary. Entry-node derives a
|
||||
stable `adapter_session_id` from the service-channel lease/resource/route
|
||||
context and passes it to the local `rdp-worker` adapter probe sink. Delivery
|
||||
receipts, workload status, and telemetry now include `adapter_session_id`,
|
||||
`adapter_runtime_id=node_agent_rdp_worker_contract_probe`, and
|
||||
`session_state=probe_bound`, and rejected/backpressured batches retain the same
|
||||
session identity. This is still not real RDP payload forwarding; it binds the
|
||||
queue/ack/backpressure model to the future per-session adapter runtime. The
|
||||
live smoke is
|
||||
`scripts/fabric/c19k-remote-workspace-adapter-session-boundary-smoke.ps1`.
|
||||
|
||||
C19L adds the first lifecycle model to that probe-only adapter session. The
|
||||
node-agent sink now tracks active sessions in memory with created/bound totals,
|
||||
last activity timestamps, per-session delivery/backpressure/frame counters,
|
||||
`current_session_lifecycle_state`, and idle expiry counters. A successful
|
||||
droppable overflow binds the session as `probe_bound`; a reliable overflow keeps
|
||||
the same `adapter_session_id` and moves the lifecycle state to `backpressure`
|
||||
for diagnosis. Receipts expose session created/bound/last-activity timestamps
|
||||
and per-session counters while `payload_traffic=none` remains enforced. The
|
||||
live smoke is
|
||||
`scripts/fabric/c19l-remote-workspace-adapter-session-lifecycle-smoke.ps1`.
|
||||
|
||||
C19M adds explicit probe-only adapter-session control. Node-agent exposes
|
||||
`POST /mesh/v1/remote-workspace/adapter-sessions/{adapter_session_id}/control`
|
||||
with `close`, `expire`, and `reset` actions, returning
|
||||
`rap.remote_workspace_adapter_session_control.v1`. Workload status and telemetry
|
||||
now include `session_control_total`, `session_closed_total`,
|
||||
`session_reset_total`, and the latest control action/session/state, so sessions
|
||||
can be ended deliberately instead of only by idle TTL. The live smoke creates a
|
||||
Remote Workspace adapter session, closes it through the mesh control endpoint,
|
||||
and waits until workload status and telemetry expose the close. The live smoke
|
||||
is
|
||||
`scripts/fabric/c19m-remote-workspace-adapter-session-control-smoke.ps1`.
|
||||
|
||||
C19N locks down the adapter-session control guardrails. Control requests now
|
||||
reject unsupported actions, invalid `adapter_session_id` values, malformed JSON,
|
||||
unknown active/terminal sessions, and overlong reasons without creating hidden
|
||||
session state. Repeating `close` against an already closed terminal session is
|
||||
idempotent: it reports `previous_state=closed` and does not increment
|
||||
`session_closed_total` again, while still counting the control observation. The
|
||||
live smoke verifies the negative cases plus first/repeated close visibility in
|
||||
workload status and telemetry. The live smoke is
|
||||
`scripts/fabric/c19n-remote-workspace-adapter-session-control-guardrails-smoke.ps1`.
|
||||
|
||||
C19O adds an immediate read-only adapter-session snapshot endpoint:
|
||||
`GET /mesh/v1/remote-workspace/adapter-sessions?include_terminal=true&limit=N`.
|
||||
It returns `rap.remote_workspace_adapter_session_snapshot.v1` with active
|
||||
sessions, terminal sessions when requested, per-session lifecycle state,
|
||||
activity/backpressure timestamps, frame counters, and runtime identity. This
|
||||
lets operators inspect adapter-session state directly from node-agent without
|
||||
waiting for heartbeat, workload status, or telemetry propagation. The live smoke
|
||||
checks active-session visibility, close transition into terminal snapshot, and
|
||||
invalid snapshot limit rejection. The live smoke is
|
||||
`scripts/fabric/c19o-remote-workspace-adapter-session-snapshot-smoke.ps1`.
|
||||
|
||||
C19P adds the first adapter-runtime handoff mailbox contract. Each active
|
||||
probe-only adapter session now owns a bounded in-memory mailbox that receives
|
||||
`frame_batch_probe_delivered` and `backpressure` events with frame counts,
|
||||
channel/resource/route context, and sequence numbers. Node-agent exposes
|
||||
`GET /mesh/v1/remote-workspace/adapter-sessions/{adapter_session_id}/mailbox`
|
||||
with optional `drain=true`, and session snapshots/workload reports expose
|
||||
mailbox depth/enqueued/drained/dropped counters. This is the handoff surface a
|
||||
real `rdp-worker` runtime can consume next; payload forwarding is still disabled.
|
||||
The live smoke verifies read, drain, post-drain empty state, and snapshot
|
||||
counters. The live smoke is
|
||||
`scripts/fabric/c19p-remote-workspace-adapter-runtime-mailbox-smoke.ps1`.
|
||||
|
||||
C19Q hardens the mailbox handoff. Invalid IDs, unknown sessions, and invalid
|
||||
limits are rejected before state mutation, and bounded `drain=true&limit=N`
|
||||
reads remove only the returned event slice while preserving remaining depth for
|
||||
the next poll. The bounded mailbox drops oldest events once capacity is reached,
|
||||
and a closed adapter session no longer exposes an active runtime mailbox. The
|
||||
live smoke verifies negative cases, drop-oldest pressure, partial drain, and
|
||||
closed-session rejection. The live smoke is
|
||||
`scripts/fabric/c19q-remote-workspace-adapter-mailbox-guardrails-smoke.ps1`.
|
||||
|
||||
C19R adds bounded long-poll ergonomics to the same node-local mailbox endpoint.
|
||||
`wait_ms` lets an adapter runtime wait briefly for the next event without hot
|
||||
polling, and responses make empty/timeout state explicit with `empty`,
|
||||
`waited`, `wait_timeout`, and `wait_ms`. The live smoke proves empty timeout and
|
||||
wake-on-delayed-event behavior while keeping the path probe-only. The live smoke
|
||||
is `scripts/fabric/c19r-remote-workspace-mailbox-long-poll-smoke.ps1`.
|
||||
|
||||
C19S makes mailbox consumer behavior visible in diagnostics. Workload status and
|
||||
node telemetry now expose `mailbox_read_total`, `mailbox_wait_total`,
|
||||
`mailbox_wait_timeout_total`, `mailbox_empty_read_total`, and last mailbox read
|
||||
metadata; active session snapshots carry the same per-session counters while a
|
||||
session remains active. The live smoke proves C19R traffic is reflected in both
|
||||
workload status and telemetry. The live smoke is
|
||||
`scripts/fabric/c19s-remote-workspace-mailbox-telemetry-smoke.ps1`.
|
||||
|
||||
C19T adds the node-local consumer cursor contract for that mailbox. Consumers
|
||||
can pass `consumer_id` plus optional `ack_sequence` to receive explicit
|
||||
checkpoint, ack, lag, read, and ack counters without draining mailbox state.
|
||||
The probe sink stores bounded per-session consumer state and reports aggregate
|
||||
and current-session consumer telemetry through workload status and heartbeat
|
||||
telemetry. The live smoke is
|
||||
`scripts/fabric/c19t-remote-workspace-mailbox-consumer-checkpoint-smoke.ps1`.
|
||||
|
||||
C19U adds lifecycle visibility and reset guardrails to the same cursor state.
|
||||
Mailbox consumers can pass `reset_consumer=true` with a valid `consumer_id` to
|
||||
clear their checkpoint/ack state before the current read is recorded. Mailbox
|
||||
responses now expose consumer count/capacity, created/reset/evicted flags, and
|
||||
consumer timestamps, while diagnostics add reset and eviction counters. The
|
||||
live smoke is
|
||||
`scripts/fabric/c19u-remote-workspace-mailbox-consumer-lifecycle-smoke.ps1`.
|
||||
|
||||
C19V adds read-only inspection for active mailbox consumer cursors. The
|
||||
node-local
|
||||
`GET /mesh/v1/remote-workspace/adapter-sessions/{adapter_session_id}/mailbox/consumers`
|
||||
endpoint returns bounded cursor snapshots with consumer ids, checkpoint and ack
|
||||
sequences, lag, totals, and timestamps. It is verified as read-only: inspection
|
||||
does not increment mailbox reads, ack totals, reset counters, or drain mailbox
|
||||
events. The live smoke is
|
||||
`scripts/fabric/c19v-remote-workspace-mailbox-consumer-snapshot-smoke.ps1`.
|
||||
|
||||
C19W adds cursor-aware resume reads to the mailbox endpoint. Consumers can pass
|
||||
`after_sequence` to receive only mailbox events newer than their checkpoint;
|
||||
responses include `skipped_count` and `returned_count`, and long-poll waits for
|
||||
newer-than-checkpoint events. The endpoint rejects `after_sequence` with
|
||||
`drain=true`, preserving the non-destructive resume contract. The live smoke is
|
||||
`scripts/fabric/c19w-remote-workspace-mailbox-after-sequence-smoke.ps1`.
|
||||
|
||||
C19X adds consumer-aware resume convenience. Mailbox reads with `consumer_id`
|
||||
can pass `resume_from=ack` or `resume_from=checkpoint`; the node-agent resolves
|
||||
the stored cursor to `after_sequence` before reading and returns
|
||||
`resume_from`/`resume_sequence` in the response. The guardrails reject mixing
|
||||
resume with manual `after_sequence`, drain, reset, missing consumers, or invalid
|
||||
cursor names. The live smoke is
|
||||
`scripts/fabric/c19x-remote-workspace-mailbox-consumer-resume-smoke.ps1`.
|
||||
|
||||
C19Y adds resume telemetry to workload status and heartbeat reports. Operators
|
||||
can now see resume read totals, after-sequence read totals, returned/skipped
|
||||
totals, and the last resume cursor, sequence, consumer, returned count, and
|
||||
skipped count. Session snapshots also expose per-session resume counters. The
|
||||
live smoke is
|
||||
`scripts/fabric/c19y-remote-workspace-mailbox-resume-telemetry-smoke.ps1`.
|
||||
|
||||
C19Z adds adapter-runtime readiness diagnostics. Sink reports now include
|
||||
`adapter_runtime_readiness`, a compact probe-only object with ready status,
|
||||
diagnostic state, session lifecycle, mailbox depth, consumer cursor, resume
|
||||
cursor, lag, and returned/skipped counts. The live smoke is
|
||||
`scripts/fabric/c19z-remote-workspace-adapter-readiness-smoke.ps1`.
|
||||
|
||||
C19Z1 adds read-only handoff preflight for mailbox consumers. The endpoint
|
||||
`/mailbox/preflight` accepts `consumer_id` and `resume_from=ack|checkpoint`,
|
||||
then reports the expected next event window without mailbox reads, drains, acks,
|
||||
or consumer cursor mutation. The live smoke is
|
||||
`scripts/fabric/c19z1-remote-workspace-mailbox-preflight-smoke.ps1`.
|
||||
|
||||
Includes:
|
||||
|
||||
- container/native workload contract
|
||||
|
||||
File diff suppressed because it is too large
Load Diff
@@ -131,6 +131,43 @@ Data Plane
|
||||
|
||||
The backend/control plane must not become a production VPN packet relay.
|
||||
|
||||
## Universal Packet Dataplane Principle
|
||||
|
||||
The VPN service carries IP packets. It must not classify the product as a web
|
||||
proxy, an RDP helper, or an HTTP-only accelerator. HTTP, DNS, RDP, SSH, VNC,
|
||||
messengers, audio calls, file transfer, application sync, and future mobile or
|
||||
desktop traffic are all just packets flowing through the same tunnel contract.
|
||||
|
||||
Implementation rules:
|
||||
|
||||
- packet forwarding must not branch on application protocol for correctness
|
||||
- performance work must optimize the shared packet path, not a specific site or
|
||||
port
|
||||
- batching, backpressure, retries, and route failover are dataplane mechanics
|
||||
and must apply to all traffic
|
||||
- diagnostics may summarize protocol/ports for operators, but diagnostics must
|
||||
not decide whether traffic is allowed to flow
|
||||
- a transient transport error must not permanently downgrade the tunnel to a
|
||||
per-packet request mode
|
||||
- the control plane chooses entry, exit, route, lease, and policy; packet flow
|
||||
should use the fastest available fabric path
|
||||
|
||||
The temporary backend HTTP packet relay is a lab compatibility path. The
|
||||
production target is:
|
||||
|
||||
```text
|
||||
client device
|
||||
-> selected entry node
|
||||
-> fabric route / alternate route set
|
||||
-> selected exit node
|
||||
-> target private network or Internet gateway
|
||||
```
|
||||
|
||||
When the cluster grows, route choice must consider latency, loss, queue depth,
|
||||
node health, role eligibility, lease freshness, and regional/network locality.
|
||||
If a node or link degrades, the fabric should switch to an alternate route
|
||||
without requiring the client to understand mesh topology.
|
||||
|
||||
## Control Plane Responsibilities
|
||||
|
||||
The control plane owns:
|
||||
|
||||
+605
-118
@@ -1,123 +1,610 @@
|
||||
C17Z20 is complete.
|
||||
Current product decision:
|
||||
|
||||
Stop treating VPN, Remote Server/Desktop Access, video, file transfer, and
|
||||
future services as separate transport implementations. The next implementation
|
||||
work should focus on the shared Fabric Service Channel runtime described in
|
||||
`docs/architecture/FABRIC_SERVICE_CHANNEL_RUNTIME.md`.
|
||||
|
||||
The immediate engineering target is:
|
||||
|
||||
- backend service-channel lease/route-generation contract
|
||||
- node-agent entry runtime for client/service live connections
|
||||
- service-neutral channel scheduling, bounded queues, route health, and
|
||||
failover
|
||||
- VPN packet flow as the proving service over that common channel
|
||||
- backend relay only as explicit degraded fallback
|
||||
|
||||
Backend service-channel lease/route-generation contract is now started:
|
||||
|
||||
- `POST /clusters/{clusterID}/fabric/service-channels/leases` issues
|
||||
`rap.fabric_service_channel_lease.v1`
|
||||
- VPN client profiles embed `fabric_service_channel_lease`
|
||||
- tests cover ready route and degraded backend-relay fallback behavior
|
||||
- leases include entry HTTP/WebSocket endpoint templates for the selected
|
||||
service channel
|
||||
- leases include cluster-authority-signed
|
||||
`rap.fabric_service_channel_lease_authority.v1` payloads that bind token
|
||||
hash, selected route, generation, fencing epoch, and expiry
|
||||
|
||||
Node-agent entry runtime is now started:
|
||||
|
||||
- `rap-node-agent` accepts VPN packet batches through
|
||||
`/api/v1/clusters/{clusterID}/fabric/service-channels/{channelID}/vpn-connections/{vpnConnectionID}/packets`
|
||||
and `/packets/ws`
|
||||
- entry runtime requires a `rap_fsc_*` service-channel token and maps packet
|
||||
batches to the existing production `vpn_packet` fabric route
|
||||
- route failure falls back to the canonical backend relay endpoint so degraded
|
||||
compatibility remains explicit
|
||||
|
||||
Next narrow runtime layer:
|
||||
|
||||
Installation Authority foundation is also complete:
|
||||
- persist cluster-level default window policy for Fabric diagnostics
|
||||
investigation breadcrumbs and expose a small admin control for it
|
||||
- keep this in the shared Fabric Service Channel runtime contract and telemetry
|
||||
- do not add Android/RDP protocol work in this slice
|
||||
|
||||
C17Z20 is complete.
|
||||
|
||||
Installation Authority foundation is also complete:
|
||||
|
||||
- production config requires strict authority mode with Product Root public key
|
||||
- first-owner bootstrap requires a signed activation manifest in strict mode
|
||||
- `installation_authority` and signed `platform_role_grants` are persisted
|
||||
- strict platform-admin checks ignore direct `users.platform_role` edits unless
|
||||
a valid signed grant exists
|
||||
- web-admin shows installation status and first-owner bootstrap
|
||||
- `scripts/installation/product-root-tool.go` can generate Ed25519 Product Root
|
||||
keys and sign activation manifests; private keys must stay outside the repo
|
||||
|
||||
Cluster Authority foundation is now also complete:
|
||||
|
||||
- every newly created cluster gets an Ed25519 `cluster_authorities` key record
|
||||
- cluster authority private keys are encrypted at rest when
|
||||
`SECRET_ENCRYPTION_KEY_B64`/file is configured; production already requires
|
||||
a secret encryption key
|
||||
- legacy/default clusters are backfilled lazily through `EnsureClusterAuthority`
|
||||
- backend signs join-token scope material, node approval/bootstrap material,
|
||||
and node-scoped synthetic mesh config snapshots
|
||||
- node-agent verifies signed Control Plane synthetic config when
|
||||
`authority_required=true` or signature fields are present
|
||||
- node-agent can pin `RAP_CLUSTER_AUTHORITY_PUBLIC_KEY` and
|
||||
`RAP_CLUSTER_AUTHORITY_FINGERPRINT`, and identity state can store the same
|
||||
trust anchor after approval
|
||||
- web-admin shows cluster key fingerprints on summaries, join-token output,
|
||||
approval rows, and synthetic config visibility
|
||||
- docker-test lifecycle smoke is complete: fresh dev install, first-owner
|
||||
bootstrap, cluster creation, signed join token, real node-agent enrollment,
|
||||
owner approval, automatic signed bootstrap polling, authority pin
|
||||
persistence, heartbeat, and signed synthetic config verification all passed
|
||||
- `rap-node-agent` desired-workload polling/status reporting is gated by
|
||||
`RAP_WORKLOAD_SUPERVISION_ENABLED=false` by default while service runtime
|
||||
supervision remains a stub
|
||||
|
||||
Node enrollment bootstrap polling is also complete:
|
||||
|
||||
- backend exposes `/node-agents/enrollments/{requestID}/bootstrap`
|
||||
- pending agents prove `cluster_id`, `node_fingerprint`, and `public_key`
|
||||
before receiving status/bootstrap material
|
||||
- `rap-node-agent` stores `pending_join_request_id`, polls approval, verifies
|
||||
the signed bootstrap contract, then persists `node_id`, `identity_status`,
|
||||
and cluster authority pin into `identity.json`
|
||||
- polling is controlled by `RAP_ENROLLMENT_POLL_INTERVAL_SECONDS` and
|
||||
`RAP_ENROLLMENT_POLL_TIMEOUT_SECONDS`
|
||||
|
||||
Current state:
|
||||
|
||||
- C17Z12 added rendezvous/relay control-plane leases for peers that would
|
||||
otherwise stay in `waiting_rendezvous`.
|
||||
- C17Z13-C17Z14 added lease telemetry and node-scoped synthetic-config refresh
|
||||
for renewal/stale relay recovery.
|
||||
- C17Z15 added backend stale-relay replacement/withdrawal policy and alternate
|
||||
relay-pool scoring.
|
||||
- C17Z16 added Control Plane `route_path_decisions`.
|
||||
- C17Z17 added node-side route generation apply/withdraw tracking.
|
||||
- C17Z18 applies Control Plane `route_path_decisions` to synthetic
|
||||
route-health route config only. The synthetic `fabric.route_health` runtime
|
||||
now probes the selected effective path, including replacement relay paths,
|
||||
and reports expected/observed hops plus drift state.
|
||||
- C17Z19 consumes those synthetic route-health observations in backend relay
|
||||
scoring. Drift/unreachable/failure feedback marks the exact selected relay
|
||||
stale and can trigger replacement; healthy low-latency route-health boosts
|
||||
alternate relay score reasons. Migration `000022` adds the `synthetic` mesh
|
||||
service class, and web-admin marks relay policy `rh feedback`.
|
||||
- C17Z20 closes the node-side feedback loop. After node-agent reports
|
||||
synthetic route-health drift/unreachable/failure, it performs a bounded
|
||||
node-scoped synthetic-config refresh, applies returned replacement route
|
||||
decisions to route-health config immediately, and reports
|
||||
`c17z20.mesh_route_health_feedback_refresh_report.v1`.
|
||||
- Backend `mesh_latest_links` now keeps latest observations per observation
|
||||
type/route, so `synthetic_route_health` is not overwritten by
|
||||
`peer_connection_manager`.
|
||||
- Web-admin Fabric links now show observation type, selected relay, and
|
||||
route-health effective/observed path.
|
||||
- All of this remains control-plane/synthetic route-health only. It does not
|
||||
forward RDP/VPN/service payloads, does not start VPN runtime, and does not
|
||||
implement arbitrary relay packet forwarding.
|
||||
- Cluster Authority and node enrollment bootstrap are docker-test
|
||||
lifecycle-smoke verified in run `dev-bootstrap-20260428-201430`.
|
||||
- Fresh migration replay found and fixed a PostgreSQL view replacement issue in
|
||||
`000021_cluster_authority_keys`; the migration now drops/recreates
|
||||
`cluster_admin_summaries` in up/down paths.
|
||||
|
||||
Runtime report:
|
||||
|
||||
- `artifacts/c17z18-route-health-effective-path-report.md`
|
||||
- `artifacts/c17z19-route-health-feedback-report.md`
|
||||
- `artifacts/c17z19-route-health-feedback-smoke-result.json`
|
||||
- `artifacts/c17z20-route-health-feedback-refresh-report.md`
|
||||
- `artifacts/dev-cluster-enrollment-bootstrap-smoke-report.md`
|
||||
- `artifacts/c18w-service-channel-route-manager-smoke-result.json`
|
||||
- `artifacts/c18x-service-channel-logical-channel-smoke-result.json`
|
||||
- `artifacts/c18y-route-intent-lifecycle-smoke-result.json`
|
||||
- `artifacts/c18z-service-channel-load-smoke-result.json`
|
||||
- `artifacts/c18z1-live-service-channel-ingress-smoke-result.json`
|
||||
- `artifacts/c18z2-live-service-channel-soak-smoke-result.json`
|
||||
- `artifacts/c18z3-live-service-channel-entry-ws-fallback-smoke-result.json`
|
||||
- `artifacts/c18z4-live-service-channel-session-pressure-smoke-result.json`
|
||||
- `artifacts/c18z5-live-service-channel-exit-restart-smoke-result.json`
|
||||
- `artifacts/c18z6-live-service-channel-active-rebuild-smoke-result.json`
|
||||
- `artifacts/c18z7-live-service-channel-concurrent-isolation-smoke-result.json`
|
||||
- `artifacts/c18z8-live-service-channel-backpressure-isolation-smoke-result.json`
|
||||
- `artifacts/c18z9-live-service-channel-route-pool-smoke-result.json`
|
||||
- `artifacts/c18z10-live-service-channel-exit-pool-smoke-result.json`
|
||||
- `artifacts/c18z11-live-service-channel-entry-pool-smoke-result.json`
|
||||
- `artifacts/c18z12-service-channel-route-quality-smoke-result.json`
|
||||
- `artifacts/c18z13-live-service-channel-route-quality-smoke-result.json`
|
||||
- `artifacts/c18z14-live-service-channel-active-quality-shift-smoke-result.json`
|
||||
- `artifacts/c18z15-live-service-channel-effective-quality-smoke-result.json`
|
||||
- `artifacts/c18z16-live-service-channel-quality-fairness-smoke-result.json`
|
||||
- `artifacts/c18z17-live-service-channel-quality-cleanup-smoke-result.json`
|
||||
- `artifacts/c18z18-service-channel-session-scoped-fairness-smoke-result.json`
|
||||
- `artifacts/c18z19-service-channel-parallel-flow-window-smoke-result.json`
|
||||
- `artifacts/c18z20-service-channel-adaptive-window-telemetry-smoke-result.json`
|
||||
- Docker-test smoke command:
|
||||
`pwsh -NoProfile -ExecutionPolicy Bypass -File scripts\fabric\c17z12-rendezvous-relay-smoke-ssh.ps1 -KeepRunning`
|
||||
- Dev lifecycle smoke command:
|
||||
`pwsh -NoProfile -ExecutionPolicy Bypass -File scripts\fabric\dev-cluster-enrollment-bootstrap-smoke-ssh.ps1 -KeepRunning`
|
||||
- Last proven runtime run: `c17z18-20260428-221601` (legacy smoke script name,
|
||||
current C17Z20 node-agent code)
|
||||
- Last proven dev lifecycle run: `dev-bootstrap-20260428-201430`
|
||||
- Admin: `http://192.168.200.61:18080/`
|
||||
- C17Z20 multi-agent API: `http://192.168.200.61:18120/api/v1`
|
||||
- C17Z19 backend-only API: `http://192.168.200.61:18122/api/v1`
|
||||
- Dev lifecycle API: `http://192.168.200.61:18121/api/v1`
|
||||
|
||||
Do not automatically continue into:
|
||||
|
||||
- RDP/VNC/SSH/file/video/service workload traffic over mesh
|
||||
- VPN/IP tunnel runtime implementation
|
||||
- arbitrary relay packet forwarding
|
||||
- production payload forwarding for relay paths
|
||||
- QUIC/WebRTC or STUN/TURN/ICE
|
||||
- TUN/TAP, host route, DNS, or firewall manipulation
|
||||
- backend/session lifecycle changes
|
||||
- Windows client changes
|
||||
|
||||
Current active next layer after the 2026-05-09 C18Z109 breadcrumb freshness
|
||||
window proof:
|
||||
|
||||
- production config requires strict authority mode with Product Root public key
|
||||
- first-owner bootstrap requires a signed activation manifest in strict mode
|
||||
- `installation_authority` and signed `platform_role_grants` are persisted
|
||||
- strict platform-admin checks ignore direct `users.platform_role` edits unless
|
||||
a valid signed grant exists
|
||||
- web-admin shows installation status and first-owner bootstrap
|
||||
- `scripts/installation/product-root-tool.go` can generate Ed25519 Product Root
|
||||
keys and sign activation manifests; private keys must stay outside the repo
|
||||
C18Z48/C18Z49/C18Z50/C18Z51/C18Z52/C18Z53/C18Z54/C18Z55/C18Z56/C18Z57/C18Z58/C18Z59/C18Z60/C18Z61/C18Z62/C18Z63/C18Z64/C18Z65/C18Z66/C18Z67/C18Z68/C18Z69/C18Z70/C18Z71/C18Z72/C18Z73/C18Z74/C18Z75/C18Z76/C18Z77/C18Z78/C18Z79/C18Z80/C18Z81/C18Z82/C18Z83/C18Z84/C18Z85/C18Z86/C18Z87/C18Z88/C18Z89/C18Z90/C18Z91/C18Z92/C18Z93/C18Z94/C18Z95/C18Z96/C18Z97/C18Z98/C18Z99/C18Z100/C18Z101/C18Z102/C18Z103/C18Z104/C18Z105/C18Z106/C18Z107/C18Z108/C18Z109 are complete. Backend is deployed on docker-test as
|
||||
`rap-backend:fabric-service-channel-0.2.281-c18z109`; migration
|
||||
`000029_fabric_service_channel_leases` is applied on the shared test database.
|
||||
Node-agent image `rap-node-agent:0.2.270-c18z95` is built and deployed on
|
||||
`test-1/2/3`; web-admin is rebuilt and deployed to `rap_web_admin`.
|
||||
All three test nodes run the C18Z92 image, healthy, and current after policy
|
||||
update. Node-agent still requires signed service-channel lease authority when
|
||||
cluster authority is pinned, but if legacy clients cannot send signed lease
|
||||
headers it now calls backend introspection before accepting the unsigned token.
|
||||
Accepted ingress is visible as `accepted_by=signed|introspection|legacy_unsigned`
|
||||
in structured node logs and via `X-RAP-Service-Channel-Accepted-By` on HTTP
|
||||
packet ingress. Durable introspection stores only `token_hash` plus a scrubbed
|
||||
lease payload, so backend restarts no longer break compatibility clients. Live
|
||||
lease maintenance now lists active/expired durable compatibility leases and runs
|
||||
bounded cleanup through the admin API/panel. Durable access telemetry now
|
||||
aggregates node-reported accepted ingress counters by signed/introspection/
|
||||
legacy path, with heartbeat metadata fallback and admin-panel visibility.
|
||||
Access telemetry now also correlates active durable service-channel leases with
|
||||
entry/exit nodes, primary route status, backend fallback, and latest
|
||||
route-quality feedback when a route exists. Normal-route access diagnostics are
|
||||
smoke-proven with a temporary direct `vpn_packets` route and healthy rolling
|
||||
quality window. Degraded normal-route diagnostics are also smoke-proven: the
|
||||
active channel stays on a normal primary route with `force_backend_fallback=false`
|
||||
while route feedback becomes `fenced` and rolling failure/drop/slow counters are
|
||||
visible. Active-channel remediation diagnostics now expose
|
||||
`remediation_action`, reason, optional alternate route id/status, and operator
|
||||
hint, with unit coverage for healthy/noop, rebuild, backend fallback, and
|
||||
authorized alternate decisions. The alternate-route remediation branch is now
|
||||
live-smoke-proven: a selected primary route is degraded after lease issuance and
|
||||
access telemetry recommends `prefer_alternate_route` while keeping
|
||||
`force_backend_fallback=false`. C18Z57 turns that recommendation into a bounded
|
||||
machine-readable `remediation_command` on the active channel row, including the
|
||||
primary route, replacement route, issued time, and command TTL capped to the
|
||||
lease lifetime. C18Z58 projects those commands into node-scoped synthetic mesh
|
||||
config and node-agent consumes `prefer_alternate_route` as an explicit
|
||||
route-manager `applied` decision with source
|
||||
`service_channel_remediation_command`. C18Z59 proves active traffic follows the
|
||||
replacement route after remediation: runtime heartbeat evidence shows
|
||||
`last_selected_route_id` and flow-scheduler `last_route_id` on the replacement
|
||||
route, with no local/backend fallback and no route send failures. C18Z60 proves
|
||||
the same replacement path under multiple independent VPN flow channels: a
|
||||
twelve-packet batch is classified across multiple flow-scheduler channels, all
|
||||
observed replacement-route sends avoid local/backend fallback, flow drops, and
|
||||
route failures. C18Z61 raises that to a pressure batch of 128 IPv4/TCP-like
|
||||
packets; runtime evidence shows 32 replacement-route flow stats, scheduler
|
||||
high-watermark 5, max-in-flight 4, no fallback, no drops, and no route failures.
|
||||
C18Z62 adds neutral traffic-class QoS wiring at the service-channel HTTP
|
||||
ingress: `X-RAP-Traffic-Class` can mark `control`, `interactive`, `reliable`,
|
||||
`bulk`, or `droppable`; default traffic remains backward-compatible bulk.
|
||||
Unit tests prove scheduler priority order, and live smoke proves a bulk
|
||||
128-packet pressure batch plus an interactive packet both move over the
|
||||
replacement route with separate traffic-class flow stats and no fallback,
|
||||
drops, or route failures. C18Z63 adds a controlled concurrent runtime proof: a
|
||||
bulk traffic-class send is held in-flight while an independent interactive
|
||||
traffic-class packet is sent through the same ingress, and interactive completes
|
||||
before bulk release with `MaxInFlight >= 2`, no drops, and no failures.
|
||||
C18Z64 adds compact runtime telemetry: `rap.fabric_flow_scheduler.v1` snapshots
|
||||
include `traffic_class_counts`, so backend/admin/diagnostics can show active
|
||||
flow-channel counts per traffic class without scanning each channel stat. It is
|
||||
live-proven on `rap-node-agent:0.2.239-c18z64`; latest test-1 snapshot showed
|
||||
`bulk=32`, `interactive=12`, drops 0. C18Z65/C18Z66 project those counts and
|
||||
flow pressure fields into backend access telemetry at node, active-channel, and
|
||||
cluster aggregate levels, and web-admin shows cluster/node/channel `flow QoS`
|
||||
visibility. Live aggregate API result showed `bulk=32`, `interactive=12`,
|
||||
`flow_channel_count=44`, `flow_max_in_flight=4`. C18Z67 adds a live HTTP
|
||||
concurrent QoS proof: six parallel bulk service-channel requests ran while an
|
||||
interactive traffic-class request was injected on the same entry path after
|
||||
remediation; the interactive request completed in 132 ms, all 6 bulk requests
|
||||
were accepted, 3072 post-remediation packets moved over the replacement route,
|
||||
32 bulk and 12 interactive replacement-route flow stats were observed, and
|
||||
fallback, route failures, flow drops, and scheduler drops stayed at 0. C18Z68
|
||||
adds backend/admin flow-health guard diagnostics over that telemetry:
|
||||
`flow_health_status` and `flow_health_reason` are projected at cluster, node,
|
||||
and active-channel levels from traffic-class pressure, queue pressure, flow
|
||||
drops, backend fallback, route-quality failures/drops/slow samples, and route
|
||||
send latency. Web-admin now shows flow-health chips beside flow QoS.
|
||||
C18Z69 adds node-side adaptive response: heartbeat flow-scheduler snapshots now
|
||||
report per-class `recommended_parallel_windows` plus
|
||||
`adaptive_backpressure_active/reason`, and the ingress send path uses the
|
||||
traffic-class-specific window. Under pressure, bulk/droppable are reduced first,
|
||||
reliable is reduced moderately, and control/interactive keep their full window
|
||||
unless their own class degrades. Live smoke verified `bulk=1`, `droppable=1`,
|
||||
`reliable=3`, `interactive=4`, `control=4`, no drops, and
|
||||
`bulk_window_reduced_to_protect_interactive`. C18Z70 projects those adaptive
|
||||
runtime fields into backend/admin access telemetry at cluster, node, and
|
||||
active-channel levels. Cluster windows are aggregated by minimum non-zero
|
||||
per-class recommendation, and web-admin shows adaptive window chips beside flow
|
||||
health/QoS. Live API artifact shows `adaptive=true`,
|
||||
`bulk_window_reduced_to_protect_interactive`, and windows `bulk=1`,
|
||||
`droppable=1`, `reliable=3`, `interactive=4`, `control=4`. C18Z71 adds the
|
||||
cluster-level adaptive policy contract:
|
||||
`GET/PUT /clusters/{clusterID}/fabric/service-channels/adaptive-policy`.
|
||||
The policy stores audited thresholds and class windows in cluster metadata,
|
||||
projects the effective fingerprint into signed node-scoped synthetic config,
|
||||
and node-agent heartbeat/runtime telemetry reports `adaptive_policy_fingerprint`.
|
||||
The node scheduler consumes the policy at runtime; default policy preserves
|
||||
bulk=1/droppable=1/reliable=3/control=4/interactive=4, while the C18Z71 smoke
|
||||
proved an operator policy with max window 6 and `bulk=2` changes the live
|
||||
recommended windows without breaking interactive/control. A signed-config hash
|
||||
mismatch found during the smoke was fixed by preserving all signed adaptive
|
||||
policy provenance fields in the node-agent client model. C18Z72 adds the
|
||||
cluster-level pool/failover policy contract:
|
||||
`GET/PUT /clusters/{clusterID}/fabric/service-channels/pool-policy`. Lease
|
||||
issuance now applies the effective entry/exit pool constraints and preferred
|
||||
entry/exit before route selection, stores the effective policy on the lease,
|
||||
and signs it into `rap.fabric_service_channel_lease_authority.v1`. Live smoke
|
||||
proved a policy-constrained lease selects only the policy entry/exit from a
|
||||
wider requested pool and carries matching signed `pool_policy` provenance.
|
||||
C18Z73 projects that signed pool-policy fingerprint into active access
|
||||
telemetry and guards remediation commands against routes outside the signed
|
||||
lease pools. C18Z74 correlates active remediation commands with entry-node
|
||||
route-manager heartbeats and reports execution states such as
|
||||
`waiting_node_apply`, `applied`, `rejected_by_policy_guard`,
|
||||
`pending_rebuild_request`, and `expired`. C18Z75 records `rebuild_route`
|
||||
remediation as durable rebuild ledger intent rows when node-scoped synthetic
|
||||
config is fetched, and access telemetry reports `rebuild_request_recorded` or
|
||||
`rebuild_request_rejected`. C18Z76 makes the allowed `rebuild_route` command
|
||||
visible from the node side: node-agent consumes it as a route-manager
|
||||
`pending_degraded_fallback` decision with source
|
||||
`service_channel_remediation_command`, and backend access telemetry correlates
|
||||
that with the durable ledger as `rebuild_request_recorded_node_pending`.
|
||||
C18Z77 resolves durable remediation rebuild requests inside the shared Control
|
||||
Plane planner: signed-pool-valid alternates become `applied` /
|
||||
`replacement_selected` and are projected as route-manager decisions with the
|
||||
same command id, missing safe alternates become `no_alternate`, lease/policy
|
||||
blocks become `deferred_by_policy`, and stale commands become `expired`.
|
||||
C18Z78 adds operator-visible planner outcome chips in web-admin and proves the
|
||||
`applied` branch live by adding an alternate route after lease issuance and
|
||||
verifying the existing rebuild command resolves to `rebuild_request_applied`.
|
||||
C18Z79 closes the planner-to-runtime proof loop for that branch: after planner
|
||||
resolution, the entry node reports a route-manager decision with the same
|
||||
`rebuild_request_id`, the transition is `applied_rebuild`, and live
|
||||
service-channel packet traffic selects the replacement route without
|
||||
local/backend fallback, route failures, or flow drops. C18Z80 hardens that
|
||||
same path under sustained pressure: after planner-applied rebuild, five
|
||||
post-rebuild bursts of mixed `interactive`, `bulk`, and `reliable` VPN packet
|
||||
batches stay on the replacement route, the stale primary is not reselected, and
|
||||
fallback/route-failure/drop deltas stay zero from the pre-pressure baseline.
|
||||
C18Z81 adds the negative/rollback proof: after the initial replacement is
|
||||
applied and used, a generation-valid fenced feedback report for that
|
||||
replacement causes the Control Plane to select a new safe recovery route; live
|
||||
traffic then moves to the recovery route, the degraded replacement is not
|
||||
reselected, and fallback/failure/drop deltas stay zero for the recovery send.
|
||||
The C18Z81 work also tightened older smoke checks to use per-run counter deltas
|
||||
instead of absolute cumulative runtime counters.
|
||||
C18Z82 closes the no-safe-recovery branch: after the replacement route reports
|
||||
generation-valid fenced feedback and no new safe recovery route is created,
|
||||
node-scoped synthetic config surfaces `service_channel_feedback_no_alternate`
|
||||
with `pending_degraded_fallback`, `no_unfenced_alternate_route`, and
|
||||
`backend_relay_degraded_fallback_until_rebuild`, proving the Control Plane
|
||||
exposes a degraded/no-alternate state instead of silently sticking to a bad
|
||||
replacement.
|
||||
C18Z83 projects those route-manager decisions into active access telemetry and
|
||||
web-admin: active channels now expose route-decision source, route id,
|
||||
replacement route id, rebuild status/reason/generation, and score reasons.
|
||||
The live smoke proves the no-safe state is visible through access telemetry as
|
||||
`service_channel_feedback_no_alternate` /
|
||||
`pending_degraded_fallback`, with operator execution state remaining compatible
|
||||
with durable ledger `rebuild_request_no_alternate`.
|
||||
C18Z84 aggregates those per-channel decisions at the access-telemetry summary
|
||||
level: route-decision channel count, replacement decision count, applied
|
||||
rebuild count, recovery decision count, and no-safe recovery count are exposed
|
||||
to the API and web-admin summary chips. The no-safe branch now prioritizes the
|
||||
aggregate status reason `active_channels_no_safe_recovery` over generic missing
|
||||
access-report noise.
|
||||
C18Z85 projects access-decision aggregates into rebuild health and incident
|
||||
diagnostics. Health summary now carries access decision counts and prioritizes
|
||||
`inspect_access_no_safe_recovery_route_pool_and_signed_policy` when no-safe is
|
||||
active. Rebuild incidents now include `incident_source=access_decision` rows
|
||||
for active channel decisions such as `access_no_safe_recovery`, with bad
|
||||
severity and channel id, so operators see route-decision failures beside ledger
|
||||
incidents.
|
||||
C18Z86 adds silence/acknowledgement behavior for those
|
||||
`incident_source=access_decision` incidents. Silence requests now carry
|
||||
`incident_source` and `channel_id`; access-decision no-safe silences are stored
|
||||
with a channel-scoped route key, applied back into rebuild health/incidents,
|
||||
and exact current-generation incidents stop contributing to active bad count.
|
||||
Generation-changing access-decision resurfacing is unit-tested; the live smoke
|
||||
proves the operator silence path on docker-test.
|
||||
C18Z87 exposes active rebuild/access-decision silences to operators and adds
|
||||
unsilence. The API now lists active rebuild alert silences, returns
|
||||
access-decision `incident_source`, `channel_id`, and display route id, and
|
||||
allows deleting a silence by id. Web-admin shows an `Active rebuild silences`
|
||||
table with an unsilence action. The live smoke proves list -> silence ->
|
||||
unsilence and verifies the access no-safe incident becomes active again.
|
||||
C18Z88 makes access-decision resurfacing operator-visible in live runtime.
|
||||
Access-decision incidents now expose the silence id they resurfaced from, the
|
||||
previous acknowledged generation, and the silence expiry. The live smoke
|
||||
proves: access no-safe incident -> silence current generation -> wait for a new
|
||||
route-decision generation -> incident returns as `alert_resurfaced=true`, active
|
||||
bad count is restored, and previous generation metadata is preserved.
|
||||
C18Z89 closes the resurfaced-incident operator action loop for generation
|
||||
changes. Resurfaced access-decision incidents now expose
|
||||
`alert_resurfaced_cause`, previous route id, and previous channel id; web-admin
|
||||
shows the cause beside resurfaced incidents. The live smoke proves the operator
|
||||
can re-acknowledge the resurfaced generation, the active-channel decision
|
||||
context matches the incident route/generation, and the current generation
|
||||
returns to a silenced state.
|
||||
C18Z90 introduces the explicit signed production data-plane contract on
|
||||
service-channel leases. `data_plane` is now part of the lease, authority
|
||||
payload, introspection response, and lease-maintenance/admin list. It declares
|
||||
that control-plane traffic uses backend API, working data uses the fabric
|
||||
service channel over fabric routes, backend relay is degraded fallback only,
|
||||
production forwarding is required, and logical flows are service-neutral,
|
||||
protocol-agnostic, and isolated. Web-admin shows this contract in the
|
||||
service-channel lease table.
|
||||
C18Z91 makes node-agent consume that signed/introspected data-plane contract.
|
||||
Service-channel packet ingress validates the contract, applies the preferred
|
||||
fabric route, emits data-plane mode/transport/fallback/logical-flow fields in
|
||||
access logs, and reports contract adoption in heartbeat access telemetry.
|
||||
C18Z92 enforces disabled backend fallback policy at node-agent runtime: when a
|
||||
signed lease says `backend_relay_policy=disabled`, route failure or missing
|
||||
fabric route returns a visible 503 instead of silently proxying working data
|
||||
through backend relay.
|
||||
C18Z93 promotes that data-plane contract telemetry into backend access
|
||||
telemetry and web-admin active-channel diagnostics: cluster, node, and
|
||||
active-channel rows now show contract adoption count, last working transport,
|
||||
steady-state transport, backend relay policy, data-plane mode, and logical
|
||||
flow mode.
|
||||
C18Z94 turns those data-plane/fallback signals into operator incidents.
|
||||
`data_plane_contract` incident rows are now emitted for missing data-plane
|
||||
contract reports after accepted service-channel traffic, wrong working or
|
||||
steady-state transport, wrong logical flow mode, disabled backend relay
|
||||
observed, and degraded backend relay usage. The incident list/readiness path
|
||||
can now surface a recommended action such as restoring the fabric route instead
|
||||
of treating backend relay as normal service traffic.
|
||||
C18Z95 adds node-agent blocked-fallback telemetry. When a signed data-plane
|
||||
contract disables backend relay and the entry runtime cannot use a fabric
|
||||
route, node-agent reports `backend_fallback_blocked`, the last data-plane
|
||||
violation status/reason, and backend/admin project those fields to cluster,
|
||||
node, channel, and `data_plane_contract` incident diagnostics. Disabled-policy
|
||||
refusal is now separate from real backend relay usage.
|
||||
C18Z96 wires normal-route send failure with disabled backend relay into the
|
||||
existing route feedback and rebuild planner path. When heartbeat access
|
||||
telemetry reports `fabric_route_send_failed_backend_fallback_blocked`, backend
|
||||
correlates the entry node's active service-channel leases, records fenced
|
||||
`fabric_service_channel_route_feedback` for the selected primary route, and the
|
||||
existing planner can select an alternate/replacement route. This keeps blocked
|
||||
fallback from becoming a dead-end operator alert.
|
||||
C18Z97 adds bounded deduplication for those access-report-derived route
|
||||
feedback records. Repeated blocked-fallback send-failure heartbeats no longer
|
||||
rewrite the same active fenced feedback or churn planner rebuild attempts while
|
||||
the first access-report feedback is still active. Runtime feedback from the
|
||||
flow scheduler remains independent.
|
||||
C18Z98 carries that feedback identity into the replacement decision and
|
||||
rebuild-attempt ledger: decision and ledger rows now expose
|
||||
`feedback_observation_id`, `feedback_source`, feedback observed/expiry time,
|
||||
channel/resource ids, and data-plane violation status/reason. Web-admin shows
|
||||
that correlation in Route decisions and Rebuild ledger.
|
||||
C18Z99 adds rebuild ledger filters for those correlation fields. The backend
|
||||
`/fabric/service-channels/rebuild-attempts` API accepts `feedback_source`,
|
||||
`feedback_channel_id`, and `feedback_violation_status`, and web-admin exposes
|
||||
the same filters in the rebuild ledger form. The live smoke proves source,
|
||||
channel, violation, combined filters, and wrong-channel exclusion.
|
||||
C18Z100 adds rebuild-health feedback breakdown aggregation for the same
|
||||
correlation fields. The backend rebuild-health summary now returns
|
||||
`feedback_breakdowns` grouped by feedback source, feedback channel id, and
|
||||
feedback violation status, including total/good/warn/bad/unknown counts,
|
||||
active warn/bad counts, silenced count, latest observation time, and affected
|
||||
reporter nodes/routes. Web-admin shows this breakdown in the Rebuild health
|
||||
panel so operators can see which access-report-derived failure classes dominate
|
||||
active warn/bad rebuild state.
|
||||
C18Z101 wires that breakdown into operator workflow in web-admin. Each
|
||||
feedback-breakdown row now shows related incident context by channel/reporter/
|
||||
route overlap and has an `open ledger` action that switches to the deep rebuild
|
||||
ledger with `feedback_source`, `feedback_channel_id`, and
|
||||
`feedback_violation_status` prefilled from the breakdown row.
|
||||
C18Z102 adds backend audit breadcrumbs for that workflow. The existing rebuild
|
||||
investigation endpoint now accepts feedback source/channel/violation drilldown
|
||||
payloads, records
|
||||
`fabric.service_channel_rebuild_feedback_breakdown.investigation_opened`
|
||||
cluster audit events, and web-admin records one before opening the filtered
|
||||
deep ledger from a rebuild-health feedback breakdown row.
|
||||
C18Z103 surfaces those breadcrumbs directly in the Fabric diagnostics panel.
|
||||
Web-admin now filters the loaded cluster audit list for rebuild incident and
|
||||
feedback-breakdown investigation events and shows recent drilldowns with time,
|
||||
source, feedback filters, target reporter/route, actor, and reason beside
|
||||
rebuild incidents and silences.
|
||||
C18Z104 adds focused audit loading for that panel. The cluster audit API now
|
||||
accepts `event_type` and `target_type` filters, including repeated or
|
||||
comma-separated values, and web-admin loads recent fabric investigation
|
||||
breadcrumbs with a dedicated filtered request instead of depending on the
|
||||
generic latest-100 cluster audit list.
|
||||
C18Z105 correlates those focused audit breadcrumbs back to currently visible
|
||||
diagnostics in web-admin. Recent investigation rows now show whether the
|
||||
breadcrumb still matches an active rebuild-health feedback breakdown or visible
|
||||
rebuild incident, and provide an `open` action to jump back into the matching
|
||||
filtered ledger path.
|
||||
C18Z106 moves that correlation into the backend/API. `GET /audit` with
|
||||
`correlation=fabric_diagnostics` now returns `correlation_hints` for focused
|
||||
fabric investigation breadcrumbs, including current diagnostic status
|
||||
(`breakdown_active`, `incident_visible`, or `not_visible`) and the matching
|
||||
breakdown/incident object when present. Web-admin consumes those hints and keeps
|
||||
its previous local matching as fallback. During verification the noisy test
|
||||
history exposed that rebuild-health feedback breakdowns were capped too tightly;
|
||||
the backend now returns up to 100 breakdown groups so fresh failure classes are
|
||||
not pushed out by older smoke history.
|
||||
C18Z107 adds a compact backend-provided `audit_summary` beside `audit_events`.
|
||||
For focused Fabric diagnostics audit reads, the summary includes total count,
|
||||
counts by event/target type, counts by current diagnostic status, counts by
|
||||
feedback source/violation status, correlated count, not-visible count, and
|
||||
latest time. Web-admin shows these as Recent investigations chips and short
|
||||
source/violation lines without recalculating the aggregate in the browser.
|
||||
C18Z108 moves Fabric diagnostics investigation breadcrumbs off the generic
|
||||
cluster audit read path. Backend now exposes
|
||||
`GET /clusters/{clusterID}/fabric/service-channels/rebuild-investigations/breadcrumbs`
|
||||
with a dedicated `rebuild_investigation_breadcrumbs` contract containing
|
||||
events plus summary. Web-admin uses this endpoint for Recent investigations
|
||||
and keeps generic audit semantics separate from Fabric diagnostics workflow
|
||||
state.
|
||||
C18Z109 adds freshness windows to the dedicated breadcrumb contract. The
|
||||
endpoint accepts `current_window_seconds` and `history_window_seconds`, annotates
|
||||
each breadcrumb with `correlation_hints.breadcrumb_status` (`current`, `stale`,
|
||||
or `expired`) plus age/window seconds, returns current/stale/expired totals, and
|
||||
adds `counts_by_breadcrumb_status` to the summary. Web-admin shows freshness
|
||||
chips and an age column in Recent investigations, so operators can separate live
|
||||
workflow hints from stale history without deleting audit records.
|
||||
Live
|
||||
verification passed:
|
||||
`scripts/fabric/c18z48-service-channel-introspection-smoke.ps1` and
|
||||
`scripts/fabric/c18z50-service-channel-durable-introspection-smoke.ps1` and
|
||||
`scripts/fabric/c18z51-service-channel-lease-maintenance-smoke.ps1` and
|
||||
`scripts/fabric/c18z52-service-channel-access-telemetry-smoke.ps1` and
|
||||
`scripts/fabric/c18z53-service-channel-access-correlation-smoke.ps1` and
|
||||
`scripts/fabric/c18z54-service-channel-normal-route-access-smoke.ps1` and
|
||||
`scripts/fabric/c18z55-service-channel-degraded-route-access-smoke.ps1` and
|
||||
`scripts/fabric/c18z56-service-channel-alternate-remediation-smoke.ps1` and
|
||||
`scripts/fabric/c18z57-service-channel-remediation-command-smoke.ps1` and
|
||||
`scripts/fabric/c18z58-service-channel-remediation-apply-smoke.ps1` and
|
||||
`scripts/fabric/c18z59-service-channel-remediation-traffic-smoke.ps1` and
|
||||
`scripts/fabric/c18z60-service-channel-remediation-multiflow-smoke.ps1` and
|
||||
`scripts/fabric/c18z61-service-channel-remediation-pressure-smoke.ps1` and
|
||||
`scripts/fabric/c18z62-service-channel-remediation-qos-smoke.ps1` and
|
||||
`scripts/fabric/c18z67-service-channel-concurrent-qos-live-smoke.ps1` and
|
||||
`scripts/fabric/c18z69-service-channel-adaptive-backpressure-smoke.ps1` and
|
||||
`scripts/fabric/c18z71-service-channel-adaptive-policy-smoke.ps1` and
|
||||
`scripts/fabric/c18z72-service-channel-pool-policy-smoke.ps1`, with
|
||||
artifacts:
|
||||
`artifacts/c18z48-service-channel-introspection-smoke-result.json`,
|
||||
`artifacts/c18z50-service-channel-durable-introspection-smoke-result.json`, and
|
||||
`artifacts/c18z51-service-channel-lease-maintenance-smoke-result.json`, and
|
||||
`artifacts/c18z52-service-channel-access-telemetry-smoke-result.json`, and
|
||||
`artifacts/c18z53-service-channel-access-correlation-smoke-result.json`, and
|
||||
`artifacts/c18z54-service-channel-normal-route-access-smoke-result.json`, and
|
||||
`artifacts/c18z55-service-channel-degraded-route-access-smoke-result.json`, and
|
||||
`artifacts/c18z56-service-channel-alternate-remediation-smoke-result.json`, and
|
||||
`artifacts/c18z57-service-channel-remediation-command-smoke-result.json`, and
|
||||
`artifacts/c18z58-service-channel-remediation-apply-smoke-result.json`, and
|
||||
`artifacts/c18z59-service-channel-remediation-traffic-smoke-result.json`, and
|
||||
`artifacts/c18z60-service-channel-remediation-multiflow-smoke-result.json`, and
|
||||
`artifacts/c18z61-service-channel-remediation-pressure-smoke-result.json`, and
|
||||
`artifacts/c18z62-service-channel-remediation-qos-smoke-result.json`, and
|
||||
`artifacts/c18z63-service-channel-concurrent-qos-go-test.jsonl`, and
|
||||
`artifacts/c18z64-service-channel-traffic-class-telemetry-go-test.jsonl`, and
|
||||
`artifacts/c18z64-service-channel-traffic-class-telemetry-live-smoke-result.json`, and
|
||||
`artifacts/c18z64-service-channel-traffic-class-telemetry-live-snapshot.json`, and
|
||||
`artifacts/c18z65-service-channel-access-qos-telemetry-api-result.json`, and
|
||||
`artifacts/c18z65-service-channel-access-qos-telemetry-smoke-result.json`, and
|
||||
`artifacts/c18z66-service-channel-access-qos-aggregate-api-result.json`, and
|
||||
`artifacts/c18z67-service-channel-concurrent-qos-live-smoke-result.json`, and
|
||||
`artifacts/c18z68-service-channel-flow-health-api-result.json`, and
|
||||
`artifacts/c18z69-service-channel-adaptive-backpressure-smoke-result.json`, and
|
||||
`artifacts/c18z70-service-channel-adaptive-telemetry-api-result.json`, and
|
||||
`artifacts/c18z71-service-channel-adaptive-policy-smoke-result.json`, and
|
||||
`artifacts/c18z72-service-channel-pool-policy-smoke-result.json`, and
|
||||
`artifacts/c18z73-service-channel-pool-policy-remediation-guard-smoke-result.json`, and
|
||||
`artifacts/c18z74-service-channel-remediation-execution-smoke-result.json`, and
|
||||
`artifacts/c18z75-service-channel-rebuild-intent-smoke-result.json`, and
|
||||
`artifacts/c18z76-service-channel-rebuild-node-pending-smoke-result.json`, and
|
||||
`artifacts/c18z77-service-channel-rebuild-planner-resolution-smoke-result.json`, and
|
||||
`artifacts/c18z78-service-channel-rebuild-planner-applied-smoke-result.json`, and
|
||||
`artifacts/c18z79-service-channel-planner-runtime-loop-smoke-result.json`, and
|
||||
`artifacts/c18z80-service-channel-post-rebuild-pressure-smoke-result.json`, and
|
||||
`artifacts/c18z81-service-channel-replacement-degradation-recovery-smoke-result.json`, and
|
||||
`artifacts/c18z82-service-channel-no-safe-recovery-smoke-result.json`, and
|
||||
`artifacts/c18z83-service-channel-access-telemetry-no-safe-smoke-result.json`, and
|
||||
`artifacts/c18z84-service-channel-access-decision-aggregate-smoke-result.json`, and
|
||||
`artifacts/c18z85-service-channel-access-decision-incident-smoke-result.json`, and
|
||||
`artifacts/c18z86-service-channel-access-decision-silence-smoke-result.json`, and
|
||||
`artifacts/c18z87-service-channel-access-decision-unsilence-smoke-result.json`, and
|
||||
`artifacts/c18z88-service-channel-access-decision-resurface-smoke-result.json`, and
|
||||
`artifacts/c18z89-service-channel-access-decision-resurface-action-smoke-result.json`, and
|
||||
`artifacts/c18z90-service-channel-data-plane-contract-smoke-result.json`, and
|
||||
`artifacts/c18z91-node-agent-data-plane-contract-enforcement-smoke-result.json`, and
|
||||
`artifacts/c18z92-node-agent-disabled-backend-fallback-smoke-result.json`, and
|
||||
`artifacts/c18z93-access-telemetry-data-plane-contract-smoke-result.json`, and
|
||||
`artifacts/c18z94-data-plane-contract-incident-smoke-result.json`, and
|
||||
`artifacts/c18z95-node-agent-blocked-fallback-telemetry-smoke-result.json`, and
|
||||
`artifacts/c18z96-blocked-fallback-rebuild-feedback-smoke-result.json`, and
|
||||
`artifacts/c18z97-blocked-fallback-feedback-dedup-smoke-result.json`, and
|
||||
`artifacts/c18z98-blocked-fallback-rebuild-correlation-smoke-result.json`, and
|
||||
`artifacts/c18z99-rebuild-correlation-filter-smoke-result.json`, and
|
||||
`artifacts/c18z100-rebuild-health-feedback-breakdown-smoke-result.json`, and
|
||||
`artifacts/c18z102-rebuild-health-feedback-drilldown-audit-smoke-result.json`,
|
||||
`artifacts/c18z104-focused-fabric-audit-smoke-result.json`, and
|
||||
`artifacts/c18z106-audit-correlation-hints-smoke-result.json`, and
|
||||
`artifacts/c18z107-audit-correlation-summary-smoke-result.json`, and
|
||||
`artifacts/c18z108-dedicated-breadcrumbs-smoke-result.json`, and
|
||||
`artifacts/c18z109-breadcrumb-freshness-window-smoke-result.json`.
|
||||
|
||||
Cluster Authority foundation is now also complete:
|
||||
Current active continuation after C19Z1:
|
||||
|
||||
- every newly created cluster gets an Ed25519 `cluster_authorities` key record
|
||||
- cluster authority private keys are encrypted at rest when
|
||||
`SECRET_ENCRYPTION_KEY_B64`/file is configured; production already requires
|
||||
a secret encryption key
|
||||
- legacy/default clusters are backfilled lazily through `EnsureClusterAuthority`
|
||||
- backend signs join-token scope material, node approval/bootstrap material,
|
||||
and node-scoped synthetic mesh config snapshots
|
||||
- node-agent verifies signed Control Plane synthetic config when
|
||||
`authority_required=true` or signature fields are present
|
||||
- node-agent can pin `RAP_CLUSTER_AUTHORITY_PUBLIC_KEY` and
|
||||
`RAP_CLUSTER_AUTHORITY_FINGERPRINT`, and identity state can store the same
|
||||
trust anchor after approval
|
||||
- web-admin shows cluster key fingerprints on summaries, join-token output,
|
||||
approval rows, and synthetic config visibility
|
||||
- docker-test lifecycle smoke is complete: fresh dev install, first-owner
|
||||
bootstrap, cluster creation, signed join token, real node-agent enrollment,
|
||||
owner approval, automatic signed bootstrap polling, authority pin
|
||||
persistence, heartbeat, and signed synthetic config verification all passed
|
||||
- `rap-node-agent` desired-workload polling/status reporting is gated by
|
||||
`RAP_WORKLOAD_SUPERVISION_ENABLED=false` by default while service runtime
|
||||
supervision remains a stub
|
||||
C19Z1 is implemented and runtime-smoke-proven. Remote Workspace adapter sessions
|
||||
now expose read-only mailbox handoff preflight:
|
||||
`GET /mesh/v1/remote-workspace/adapter-sessions/{adapter_session_id}/mailbox/preflight?consumer_id=...&resume_from=ack|checkpoint`.
|
||||
The response validates the consumer cursor and reports the expected next event
|
||||
window (`after_sequence`, available/returned/skipped counts, first/last expected
|
||||
sequence) without reading, draining, acking, or mutating consumer state.
|
||||
Node-agent image `rap-node-agent:codex-service-supervisor-20260512z2` is
|
||||
deployed on `test-1/2/3`. Verification artifacts:
|
||||
`artifacts/c19z1-remote-workspace-mailbox-preflight-smoke-result.json`, C19X
|
||||
source
|
||||
`artifacts/c19z1-remote-workspace-mailbox-preflight-source-result.json`, and
|
||||
C19Z regression
|
||||
`artifacts/c19z-remote-workspace-adapter-readiness-smoke-result.json`.
|
||||
|
||||
Node enrollment bootstrap polling is also complete:
|
||||
|
||||
- backend exposes `/node-agents/enrollments/{requestID}/bootstrap`
|
||||
- pending agents prove `cluster_id`, `node_fingerprint`, and `public_key`
|
||||
before receiving status/bootstrap material
|
||||
- `rap-node-agent` stores `pending_join_request_id`, polls approval, verifies
|
||||
the signed bootstrap contract, then persists `node_id`, `identity_status`,
|
||||
and cluster authority pin into `identity.json`
|
||||
- polling is controlled by `RAP_ENROLLMENT_POLL_INTERVAL_SECONDS` and
|
||||
`RAP_ENROLLMENT_POLL_TIMEOUT_SECONDS`
|
||||
|
||||
Current state:
|
||||
|
||||
- C17Z12 added rendezvous/relay control-plane leases for peers that would
|
||||
otherwise stay in `waiting_rendezvous`.
|
||||
- C17Z13-C17Z14 added lease telemetry and node-scoped synthetic-config refresh
|
||||
for renewal/stale relay recovery.
|
||||
- C17Z15 added backend stale-relay replacement/withdrawal policy and alternate
|
||||
relay-pool scoring.
|
||||
- C17Z16 added Control Plane `route_path_decisions`.
|
||||
- C17Z17 added node-side route generation apply/withdraw tracking.
|
||||
- C17Z18 applies Control Plane `route_path_decisions` to synthetic
|
||||
route-health route config only. The synthetic `fabric.route_health` runtime
|
||||
now probes the selected effective path, including replacement relay paths,
|
||||
and reports expected/observed hops plus drift state.
|
||||
- C17Z19 consumes those synthetic route-health observations in backend relay
|
||||
scoring. Drift/unreachable/failure feedback marks the exact selected relay
|
||||
stale and can trigger replacement; healthy low-latency route-health boosts
|
||||
alternate relay score reasons. Migration `000022` adds the `synthetic` mesh
|
||||
service class, and web-admin marks relay policy `rh feedback`.
|
||||
- C17Z20 closes the node-side feedback loop. After node-agent reports
|
||||
synthetic route-health drift/unreachable/failure, it performs a bounded
|
||||
node-scoped synthetic-config refresh, applies returned replacement route
|
||||
decisions to route-health config immediately, and reports
|
||||
`c17z20.mesh_route_health_feedback_refresh_report.v1`.
|
||||
- Backend `mesh_latest_links` now keeps latest observations per observation
|
||||
type/route, so `synthetic_route_health` is not overwritten by
|
||||
`peer_connection_manager`.
|
||||
- Web-admin Fabric links now show observation type, selected relay, and
|
||||
route-health effective/observed path.
|
||||
- All of this remains control-plane/synthetic route-health only. It does not
|
||||
forward RDP/VPN/service payloads, does not start VPN runtime, and does not
|
||||
implement arbitrary relay packet forwarding.
|
||||
- Cluster Authority and node enrollment bootstrap are docker-test
|
||||
lifecycle-smoke verified in run `dev-bootstrap-20260428-201430`.
|
||||
- Fresh migration replay found and fixed a PostgreSQL view replacement issue in
|
||||
`000021_cluster_authority_keys`; the migration now drops/recreates
|
||||
`cluster_admin_summaries` in up/down paths.
|
||||
|
||||
Runtime report:
|
||||
|
||||
- `artifacts/c17z18-route-health-effective-path-report.md`
|
||||
- `artifacts/c17z19-route-health-feedback-report.md`
|
||||
- `artifacts/c17z19-route-health-feedback-smoke-result.json`
|
||||
- `artifacts/c17z20-route-health-feedback-refresh-report.md`
|
||||
- `artifacts/dev-cluster-enrollment-bootstrap-smoke-report.md`
|
||||
- Docker-test smoke command:
|
||||
`pwsh -NoProfile -ExecutionPolicy Bypass -File scripts\fabric\c17z12-rendezvous-relay-smoke-ssh.ps1 -KeepRunning`
|
||||
- Dev lifecycle smoke command:
|
||||
`pwsh -NoProfile -ExecutionPolicy Bypass -File scripts\fabric\dev-cluster-enrollment-bootstrap-smoke-ssh.ps1 -KeepRunning`
|
||||
- Last proven runtime run: `c17z18-20260428-221601` (legacy smoke script name,
|
||||
current C17Z20 node-agent code)
|
||||
- Last proven dev lifecycle run: `dev-bootstrap-20260428-201430`
|
||||
- Admin: `http://192.168.200.61:5174/`
|
||||
- C17Z20 multi-agent API: `http://192.168.200.61:18120/api/v1`
|
||||
- C17Z19 backend-only API: `http://192.168.200.61:18122/api/v1`
|
||||
- Dev lifecycle API: `http://192.168.200.61:18121/api/v1`
|
||||
|
||||
Do not automatically continue into:
|
||||
|
||||
- RDP/VNC/SSH/file/video/service workload traffic over mesh
|
||||
- VPN/IP tunnel runtime implementation
|
||||
- arbitrary relay packet forwarding
|
||||
- production payload forwarding for relay paths
|
||||
- QUIC/WebRTC or STUN/TURN/ICE
|
||||
- TUN/TAP, host route, DNS, or firewall manipulation
|
||||
- backend/session lifecycle changes
|
||||
- Windows client changes
|
||||
|
||||
Next narrow layer, if approved:
|
||||
|
||||
C17Z21 should tighten route-health feedback refresh dampening: if an immediate
|
||||
feedback refresh returns the same config version or no replacement change, keep
|
||||
a per-route/relay no-change cooldown before retrying. Keep the boundary
|
||||
synthetic/control-plane only and keep RDP/VPN/service payload forwarding
|
||||
untouched.
|
||||
Next narrow Remote Workspace layer should stay probe-only and node-local. A good
|
||||
C19Z2 candidate is handoff preflight telemetry: add counters/last-preflight
|
||||
fields for the read-only preflight endpoint in workload status/heartbeat reports,
|
||||
so operators can distinguish handoff checks from mailbox reads. Do not add
|
||||
desktop frame transport, Android work, backend relay semantics, or production
|
||||
adapter payload forwarding in this slice.
|
||||
|
||||
@@ -0,0 +1,79 @@
|
||||
# VPN baseline 0.2.87
|
||||
|
||||
Date: 2026-05-05
|
||||
|
||||
This document freezes the current near-working VPN state. Treat it as the
|
||||
rollback and comparison point before changing the Android VPN dataplane,
|
||||
gateway assignment, mesh route intents, or packet relay behavior.
|
||||
|
||||
## Baseline components
|
||||
|
||||
- Android client: `0.2.87` / version code `87`
|
||||
- APK path: `web-admin/deploy/html/downloads/rap-android-rdp-vpn-latest-release.apk`
|
||||
- Known-good APK path: `web-admin/deploy/html/downloads/rap-android-rdp-vpn-known-good-0.2.87.apk`
|
||||
- Versioned APK path: `web-admin/deploy/html/downloads/releases/0.2.87/rap-android-rdp-vpn-0.2.87-release.apk`
|
||||
- APK sha256: `bc44304658df7cd0ad627660c9e7b37af68910cdb13b310314ab99a049ff3b8d`
|
||||
- APK size: `1187103`
|
||||
- Backend image: `rap-backend:vpn-dataplane-contract-0.2.86`
|
||||
- Node/host agents: `0.2.86`
|
||||
- Cluster: `cfc0743d-d960-49fb-9de8-96e063d5e4aa`
|
||||
- VPN connection: `7cc94b0d-9cc2-4492-956a-cb0913b887e2` (`home-full-tunnel`)
|
||||
- Entry node: `home-1` (`8ad04829-cd30-4290-913d-1ce5c7ef7bb3`)
|
||||
- Exit node: `home-1` (`8ad04829-cd30-4290-913d-1ce5c7ef7bb3`)
|
||||
- DNS from exit side: `192.168.200.210`
|
||||
- Client tunnel: full tunnel, `0.0.0.0/0`, VPN address `10.77.0.2/24`
|
||||
- Active gateway lease: home-1, generation `8`
|
||||
- Active relay transport: `backend_http_packet_relay`
|
||||
|
||||
## Current working behavior
|
||||
|
||||
- General web traffic passes through the VPN.
|
||||
- External sites open through the configured home exit.
|
||||
- Telegram can connect, but initial connection may be delayed.
|
||||
- RDP can connect through the tunnel, but long-lived sessions can still drop.
|
||||
- Speed is the best observed so far, but speed-test pages may delay loading
|
||||
their plugin/script parts.
|
||||
|
||||
## Observed diagnostics
|
||||
|
||||
Latest phone diagnostics for device `37574bd4-b944-440f-bbd5-87f2980d22c4`
|
||||
reported Android app version `0.2.87`.
|
||||
|
||||
Packet relay counters showed both directions are active:
|
||||
|
||||
- `client_to_gateway`: no queue drops observed, queue depth returned to `0`
|
||||
- `gateway_to_client`: queue depth was observed at `48-55`
|
||||
- `gateway_to_client`: `246` dropped packets were observed
|
||||
- Android side recorded downlink traffic, uplink traffic, and several uplink
|
||||
sender errors
|
||||
- Android source validation dropped packets whose source was not the VPN
|
||||
address; keep this guard enabled
|
||||
|
||||
Interpretation: the active path is real and carries traffic, but downlink
|
||||
backpressure or Android TUN drain stalls can still interrupt long-lived TCP
|
||||
flows. This explains delayed Telegram startup, speed-test plugin loading
|
||||
delays, and RDP sessions that connect and later drop.
|
||||
|
||||
## Guardrails
|
||||
|
||||
- Do not reduce Android `TUN_WRITE_MAX_RETRIES` below `1000` without a
|
||||
controlled regression test.
|
||||
- Do not relax Android VPN source-address validation.
|
||||
- Do not re-enable the home-1 `vpn_packets` fabric mesh route intent for this
|
||||
connection until the Android client can intentionally use the fabric entry
|
||||
path. The current working baseline relies on `backend_http_packet_relay`.
|
||||
- Do not change the active entry/exit away from home-1 without saving packet
|
||||
counters before and after.
|
||||
- Do not change DNS away from `192.168.200.210` without checking full-tunnel
|
||||
DNS and direct-IP traffic separately.
|
||||
- Keep the 0.2.87 APK available as a known-good rollback artifact.
|
||||
|
||||
## Next safe work
|
||||
|
||||
1. Stabilize `gateway_to_client` downlink queue draining and Android TUN write
|
||||
backpressure.
|
||||
2. Add clearer per-flow counters for long-lived TCP flows such as RDP.
|
||||
3. Add a small repeatable smoke test: DNS, direct IP HTTP, 2ip.ru, Telegram-like
|
||||
long connection, and RDP port reachability.
|
||||
4. Only after this baseline is stable, move Android entry traffic from backend
|
||||
relay to fabric mesh.
|
||||
Reference in New Issue
Block a user