Files

T

m 469fa0e860 3

2026-05-18 21:33:39 +03:00

84 KiB

Raw Blame History

Cluster, Node, Mesh, and Admin Foundation

Status: target foundation plan plus staged implementation tracker. This document defines the next platform-core direction and implementation order. It is not an instruction to rewrite the current RDP MVP.

The current RDP work is paused by product decision. The proven RDP access core remains valuable and must not be broken. The next platform step is to build the cluster, node enrollment, mesh preparation, and platform administration foundation that the Secure Access Fabric will depend on.

Purpose

The platform is a Secure Access Fabric, not only an RDP proxy. RDP is the first managed service adapter, but the long-term platform must support additional adapters and network services such as VNC, SSH, VPN/IP tunnel, internal web apps, file services, video/audio, update cache, and relay/entry nodes.

The lower platform foundation is the RAP Fabric Core: a distributed runtime layer above the host OS and below service adapters. It is not a real operating system. It is implemented through native rap-node-agent, control-plane contracts, signed scoped cluster snapshots, node-local state, and service supervision boundaries.

Before implementing mesh traffic, VPN runtime, or organization administration UI, the platform needs a clear foundation for:

clusters
host-level node identity
node enrollment
Fabric Core terminology and contracts
scoped configuration distribution
node-local state
role assignment
service workload placement
platform-owner administration
future organization administration
safe multi-cluster operations

Current Foundation Inventory

The backend already contains an early platform-core foundation:

nodes
node_capabilities
node_services
node_update_policies
node_partition_states
node_agent_update_runs
node management routes under /nodes
node-agent routes under /node-agents

This is useful, but it is not sufficient as the final node and cluster model. Current gaps:

no explicit first-class clusters
no explicit cluster membership model
no join request approval workflow
no short-lived join token model
no node certificate lifecycle model
no clear separation between technical capability and authorized role assignment
no platform admin console model
no multi-cluster administration model
no organization-safe visibility model for entry/egress/service endpoints
node-agent registration is too direct for production trust boundaries

Vocabulary

Platform: The whole Secure Access Fabric installation operated by the platform owner.

Cluster: A bounded control and trust domain containing nodes, policies, service placements, certificates, health state, and routing state. A platform may operate multiple clusters.

Organization: A tenant. Organizations own or are granted access to resources, policies, sessions, services, and selected entry/egress points. Organizations must not see full internal mesh topology.

Fabric Core: The lower OS-like distributed runtime layer of the Secure Access Fabric. It runs above the host OS and is owned by native rap-node-agent plus control-plane contracts. It owns node identity, enrollment, local state, capability reporting, role assignment consumption, signed scoped configuration snapshots, update trust, and service supervision boundaries.

Node: A host-level identity. A node is not a Docker container. A node is managed by native rap-node-agent.

Node Agent: Native host process responsible for node identity, enrollment, certificates, health, role execution, service workload supervision, updates, recovery, and reporting.

Service Workload: A workload executed on a node. It may be native or containerized. Examples: rdp-worker, vnc-worker, entry-node, relay-node, file-storage-cache.

Public/Admin HTTPS Ingress: A service-edge role that listens on TCP 80/443 for browser/API HTTPS and forwards accepted requests into the QUIC-only fabric service channel. It is not an authority service and does not imply permission to manage the cluster.

Admin UI Runtime: A scoped admin service runtime. Global admin runtime may run only on platform-owner trusted nodes; cluster, organization, and user portal runtimes receive only their scoped projections.

Capability: What a node can technically do. Example: can_run_rdp_worker.

Role Assignment: What a node is authorized to do in a cluster and, where relevant, for an organization. Capabilities are not permissions.

Join Request: A request from a node-agent to join a cluster, usually created with a short-lived join token and approved by a platform admin.

Entry Node: Accepts client connections and maps them into platform logical channels.

Core Mesh Node: Maintains secure overlay routing and QoS. It should remain protocol-neutral.

Egress / Service Node: Hosts service adapters or network exits that connect to target resources.

Service Adapter: Protocol-specific edge function, for example RDP Adapter, VNC Adapter, SSH Adapter, HTTP/internal app adapter, or video/audio adapter.

Remote Server/Desktop Access: Product-level managed access service consumed by Access Client. RDP, VNC, and SSH are internal protocol adapters selected from the organization resource protocol; they are not separate organization-facing cluster services.

Egress Pool: Organization-visible logical exit such as "Office Moscow" or "Datacenter Kazan". It may be backed by multiple internal nodes. Organizations reference egress pools, not concrete nodes or mesh topology.

Fabric Routing Engine: Logical fabric-layer component responsible for peer selection, route discovery, route scoring, failover, shortcut decisions, route cache management, topology hiding, and channel-aware routing. It is not a runtime implementation in the current phase.

Fabric Storage Service / Config Storage Service: Future scoped storage/distribution service for signed cluster snapshots, peer directories, policy snapshots, update metadata, and node-scoped configuration. PostgreSQL remains authoritative.

Version Storage / Update Repository: Logical artifact repository for signed platform, node-agent, and service workload releases. It stores release manifests, OS/architecture artifacts, stable/current/candidate channel metadata, hashes, signatures, compatibility metadata, and migration bundles. PostgreSQL remains authoritative for rollout policy, approvals, and audit.

Golden Rules

Node identity belongs to the native rap-node-agent, not to a container.
Containers are packaging and isolation boundaries only.
Capabilities are not permissions.
Role assignment must be explicit per cluster and, when needed, per organization.
Platform admins operate cluster/node trust and placement.
Organization admins manage only organization-visible resources, policies, sessions, and allowed service endpoints.
Organizations must not see intermediate core mesh topology.
Mesh routing must not be implemented before enrollment, identity, and role assignment are trustworthy.
Disconnected minority partitions must not add nodes or change policy.
PostgreSQL remains the source of truth; Redis/live state must be reconstructable.
RDP MVP behavior must remain preserved while platform-core work progresses.
Fabric Core comes before mesh runtime.
Service adapters must not own mesh routing logic.
Nodes must not store or learn more cluster, organization, secret, or route data than their role requires.
Config/storage services distribute scoped snapshots and caches; they are not a second source of truth.
Redis must not store durable topology, durable configuration, node identity, policy, or organization data.
Fabric routing decisions must not depend on live backend availability.
Fabric Storage / Config Storage must not become a general-purpose distributed database.
Version Storage stores signed artifacts and release manifests; it does not decide rollout policy and must not serve unsigned or unapproved artifacts.
Node-agent is the local supervisor for health, restart, update, and rollback of node services, but Control Plane owns rollout policy and durable schema migration orchestration.
HTTP/HTTPS is an external service edge only. Fabric node-to-node transport remains QUIC-only.
A node that accepts 443 does not own management authority. Admin authority belongs to signed roles, scoped claims, policy, and trusted runtime nodes.
Global admin runtime, policy authority, and audit sink must run only on platform-owner controlled nodes. Organization and cluster portals must not expose unrelated tenants, clusters, or internal mesh topology.

Existing Node Management Semantics

The Platform Owner Console must treat a node as a concrete host-level identity, not as an abstract request for any free capacity. A cluster membership always connects a specific node_id to a specific cluster_id.

The same physical node may participate in multiple clusters only through isolated cluster memberships, certificates, tokens, storage namespaces, and policies. Role assignments and desired service workloads are cluster-scoped. This means a node may run entry-node for one cluster, have no ingress role in another cluster, and still report the same underlying host capabilities.

Node capabilities are reported by rap-node-agent as the technical upper bound of what the host can do. Capabilities are not permissions. The Control Plane must expose to each cluster only the roles and service workloads that are explicitly assigned for that cluster and, where relevant, for a specific organization.

Current owner-console operations for existing nodes are:

view active-cluster nodes
view all physical nodes visible to the platform owner
inspect cluster membership state per node
organize active-cluster nodes into hierarchical node groups
inspect heartbeat, telemetry, reported services, and desired services
assign cluster-scoped roles
enable or disable cluster-scoped desired service workloads
manage node participation in product-level functions such as Remote Server/Desktop Access, without exposing protocol adapters as separate organization-facing services
enable node/cluster/organization test telemetry flags
disable a node membership in the active cluster
revoke a node identity as a high-risk platform-owner action

Creating a brand-new node is not a direct database insert. The future production flow remains installation of native rap-node-agent on the host, enrollment with a short-lived join token, platform-owner approval, then explicit role and service assignment. This enrollment/create-node UX is intentionally separate from existing-node management.

Node groups are cluster-scoped inventory folders, not node identities and not permissions. A group may contain child groups, and a node membership may be assigned to one group in that cluster. The same physical node may be in a different group, or no group, in another cluster. Groups are intended for large-scale operator usability: data centers, sites, customers, regions, environments, ownership boundaries, or maintenance rings. Authorization still comes from cluster membership, role assignment, organization policy, and service desired state, not from the visual group tree.

Remote Server/Desktop Access is managed as one product service. A node may be allowed to run this access service, while the concrete protocol adapter is chosen from the organization resource (rdp, vnc, ssh). Cluster admins should not configure separate organization-facing RDP/VNC/SSH services; those adapters are internal runtime implementations.

Fabric Core Foundation

Layer order:

Host OS
RAP Fabric Core
Secure Fabric Network
Service Runtime / Service Adapters
Access Clients / Admin UI

Fabric Core responsibilities:

node identity
node enrollment
local node state
capability reporting
role assignment consumption
signed scoped cluster snapshots
update trust
service workload supervision boundary

Fabric Core does not implement service protocols. RDP, VNC, SSH, VPN, video, and file transfer are services above the fabric. They consume node identity, role assignment, local state, routing, and policy from the Fabric Core.

Fabric Core must be trustworthy before mesh runtime traffic exists. Mesh routing, VPN/IP tunnel runtime, relay packet routing, and service workload execution must not be implemented before node identity, enrollment, trust, role assignment, and scoped configuration distribution are solid.

Peer Discovery and Routing Foundation

Nodes must not maintain active connections to all other nodes.

Each node maintains:

active peers
warm candidate peers
cold / bootstrap peers

Recommended defaults:

normal node: 3-5 active peers
relay / core node: 8-20 active peers
thin / mobile node: 1-3 active peers

Peer selection must be score-based, not latency-only.

Score inputs:

latency
packet loss
reliability
region distance
node load
bandwidth availability
role suitability
policy constraints
trust level
recent failure history

Service adapters request a destination node, resource target, egress node, or egress pool. The Fabric Routing Engine chooses the path. RDP/VNC/SSH/VPN/video adapters must not implement topology discovery, multi-hop route selection, shortcut creation, or cross-cluster routing policy.

Fabric routing uses node-local state, signed scoped snapshots, peer cache, and route cache. It must respect policy, organization scope, cluster boundaries, and partition/authority state. Service adapters must not select routes, discover peers, manage mesh connections, implement failover, implement shortcut logic, implement partition recovery, or implement cross-cluster routing policy.

Node-Local State and Scoped Configuration

rap-node-agent local state should include:

node identity
cluster membership
signed scoped cluster snapshot
peer cache
route cache
service assignment cache
local health state
partition / degraded state
last applied config version
pending update metadata

A node must not store full cluster topology unless its role requires it.

Configuration distribution is need-to-know:

core mesh nodes receive neighbor/peer data, route policy, QoS policy, and cluster version
ingress nodes receive allowed client entry policies, token validation config, and route entry data
egress/service nodes receive assigned service configs, needed resource refs, connector refs, and service policy
storage services receive assigned shard/scope data and replication metadata

Secrets are delivered only through approved resolvers and only when required at runtime. RDP credentials, organization user lists, unrelated storage shards, and full topology must not be distributed to nodes that do not need them.

Fabric Storage / Config Storage Foundation

Fabric Storage Service / Config Storage Service is a future foundation service, not a replacement for PostgreSQL.

Purpose:

store and replicate scoped cluster configuration
distribute signed snapshots
keep frequently used data near services
support local high-speed reads
preserve configuration availability when some nodes disappear

Rules:

not every node stores all data
replication factor is policy-driven
critical data is replicated across failure domains
hot data is placed near services that use it
organization and cluster isolation are mandatory
PostgreSQL remains authoritative
storage/config services are distribution/cache layers only
storage/config services must not accept direct node writes as authoritative state
storage/config services must not expose arbitrary query capabilities
storage/config services must not store full cluster or organization data on every node

Data classes:

platform global data
cluster state
organization data
service config data
realtime / lease state
audit / event data
artifacts / update data

Version Storage and Node Update Foundation

The platform includes a logical Version Storage / Update Repository service.

It stores approved release artifacts and metadata for:

native rap-node-agent
service workloads such as rdp-worker, entry-node, relay-node, file-storage-cache, update-cache, future VNC/SSH/VPN services
OS / architecture-specific builds
migration bundles for data structure changes

Release channels:

stable: known-good version used for rollback
current: normal intended version for the cluster/ring
candidate: version staged for test/canary rollout

Rules:

PostgreSQL is authoritative for rollout policy, channel assignment, approvals, target nodes/rings, and audit.
Version Storage stores immutable signed artifacts and manifests.
update-cache mirrors approved artifacts close to nodes but does not approve or invent versions.
Node-agent downloads from authorized update-cache or Version Storage, verifies signatures and hashes, stages updates, runs health checks, and reports state.
Node-agent may restart workloads and roll back to last stable / last known good when health checks fail.
Node-agent self-update requires stronger safeguards than normal service workloads: A/B slots or equivalent staged replacement, watchdog/recovery, and rollback that survives process crash.

Data structure migrations:

code updates and data structure updates are separate concerns in one signed release manifest
migration bundles must declare source version, target version, compatibility, preflight checks, backup/snapshot requirements, rollback behavior, and whether downgrade is possible
Control Plane/PostgreSQL schema migrations are orchestrated by the Control Plane release process
service-local, cache-local, and node-local state migrations may be executed by node-agent only when included in an approved signed manifest
failed migrations must leave the node rolled back, degraded, or fenced, never silently half-updated

Suggested update states:

idle
checking
downloading
verifying
staging
migrating
starting
health_checking
promoting
running
rollback_required
rolling_back
rolled_back
failed
fenced

Target Data Model

The following entities should become the explicit foundation. Names are target names and may be adjusted during implementation if the existing schema requires compatibility.

Cluster

clusters

Purpose: First-class cluster records.

Key fields:

id
name
slug
status
region
metadata
created_at
updated_at

Cluster Membership

cluster_memberships

Purpose: Links nodes to clusters with membership state.

Key fields:

cluster_id
node_id
membership_status
joined_at
last_seen_at
metadata

Node

nodes

Purpose: Host-level identity record.

Existing nodes should be preserved and migrated safely. A default cluster backfill is acceptable for existing rows.

Target additions or refinements:

explicit cluster membership through cluster_memberships
enrollment status separated from runtime health
identity/certificate state
ownership and organization association only where applicable

Node Identity

node_identities

Purpose: Stores node public identity material and trust state.

Key fields:

node_id
public_key
certificate_serial
certificate_not_before
certificate_not_after
identity_status
rotated_at
revoked_at

Private keys must never be stored in the control plane.

Node Join Token

node_join_tokens

Purpose: Short-lived token created by platform admin to allow enrollment into a cluster.

Key fields:

id
cluster_id
token_hash
scope
expires_at
max_uses
used_count
created_by_user_id
revoked_at

Raw join tokens must not be stored.

Node Join Request

node_join_requests

Purpose: Approval workflow for node enrollment.

Key fields:

id
cluster_id
node_name
node_fingerprint
public_key
reported_capabilities
reported_facts
requested_roles
status
reviewed_by_user_id
reviewed_at
approved_node_id
rejection_reason

Default behavior should be manual approval.

Node Capability

node_capabilities

Purpose: Technical capabilities reported by node-agent and verified by platform policy.

Capability examples:

can_accept_client_ingress
can_accept_node_ingress
can_route_mesh
can_egress_internet
can_egress_private_network
can_run_rdp_worker
can_run_vnc_worker
can_run_vpn_exit
can_run_vpn_connector
can_run_file_cache
can_run_update_cache
can_run_video_relay

Node Role Assignment

node_role_assignments

Purpose: Explicit authorization for a node to execute roles in a cluster and optionally for specific organizations.

Key fields:

cluster_id
node_id
role
organization_id
status
assigned_by_user_id
assigned_at
policy

Examples:

entry-node
relay-node
rdp-worker
vnc-worker
vpn-exit
vpn-connector
file-storage-cache
update-cache
video-relay

Node Service Instance

node_service_instances

Purpose: Tracks running service workloads supervised by node-agent.

Key fields:

cluster_id
node_id
service_type
service_instance_id
desired_state
reported_state
version
last_heartbeat_at
metadata

Node Heartbeat

node_heartbeats

Purpose: Durable heartbeat observations and latest state snapshots.

This can be split into an append-only heartbeat log and a compact latest-state table if needed.

Cluster Audit Event

cluster_audit_events

Purpose: Audit cluster, node, enrollment, role, trust, and service-placement changes.

Events should include:

join token created/revoked
join request received/approved/rejected
node certificate issued/rotated/revoked
role assigned/removed
service desired state changed
node health changed
cluster partition detected/resolved

Future Mesh Entities

These are not Stage C1 implementation requirements, but the data model should not block them:

mesh_links
mesh_routes
cluster_route_intents
cluster_partitions
cluster_authority_terms

Node Enrollment Flow

Default enrollment must be explicit and reviewable.

Platform admin chooses or creates a cluster.
Platform admin creates a short-lived join token scoped to that cluster.
Operator installs native rap-node-agent on the host.
Node-agent generates a local keypair.
Node-agent sends a join request with token, public key, node facts, and capabilities.
Control plane validates token hash, expiry, scope, and usage limit.
Control plane creates node_join_request in pending state.
Platform admin reviews node facts and approves or rejects the request.
On approval, control plane creates or activates the node identity.
Control plane issues cluster-scoped node certificate or certificate metadata.
Node-agent stores identity material locally.
Node-agent starts heartbeats.
Platform admin assigns roles.
Node-agent starts only authorized service workloads.

Auto-approval may exist later only for tightly scoped platform-managed bootstrap tokens. It must not be the default.

Node Runtime Expectations

rap-node-agent should run natively on the host.

The agent owns:

host identity
node certificates
enrollment
heartbeat
desired role polling
service workload supervision
update trust and rollback
local recovery

Containerized workloads are preferred for:

rdp-worker
vnc-worker
relay-node
entry-node
file-storage-cache
update-cache
video-relay

Native workloads are preferred for:

vpn-exit
vpn-connector
host route manager
firewall/QoS manager
Windows virtual adapter service
Android VpnService client

Realtime container workloads should avoid Docker bridge/NAT hot paths where low latency matters. Prefer host networking or native mode for latency-critical paths.

Privileged containers are discouraged. If a workload needs NET_ADMIN, /dev/net/tun, host firewall control, or broad host privileges, native mode should be preferred unless explicitly approved.

Admin Console Model

Admin UI is served through Web Ingress, but cluster configuration belongs to the Control Plane.

The Web/Admin boundary is defined in docs/architecture/WEB_INGRESS_AND_ADMIN_UI_MODEL.md.

Rules:

Web Ingress is an HTTP/HTTPS entry and presentation layer only
Web Ingress must not own cluster state, node trust, policy, or secrets
Admin pages may be dynamically composed from safe ui_manifest / page definitions
dynamic page definitions must be schema-driven and must not contain internal topology, peer caches, route caches, secrets, raw credentials, or arbitrary executable code
every cluster mutation still goes through Control Plane authorization, PostgreSQL source-of-truth mutation, and audit
Fabric Storage / Config Storage role does not imply admin/web ingress role
adding a storage node to a cluster must not move or create the cluster panel automatically

Admin Endpoint Placement

Admin UI has three distinct scopes:

Platform Owner Console: global platform-owner scope. It may aggregate visibility across clusters according to platform policy and audit.
Cluster Admin Endpoint: cluster-local admin/web ingress service for one cluster. It is served only by nodes explicitly assigned an approved admin/web ingress role.
Organization Admin Panel: tenant-safe projection over allowed organization resources, endpoints, policies, sessions, and safe status.

Storage nodes are configuration distribution/cache nodes. They are not browser admin entry points and they do not own platform or cluster administration.

A cluster-local admin endpoint may become available only after:

the cluster has healthy authority state
scoped configuration snapshots are signed and current
admin/web ingress role is explicitly assigned to authorized node(s)
TLS/certificate policy is valid
node health and heartbeat are current
required role coverage exists for the intended cluster operations

During cluster split/seed workflows, a temporary node may participate in more than one cluster only through isolated memberships, identities, certificates, tokens, storage namespaces, and policies. Removing that seed node from the new cluster is safe only after the new cluster has its own healthy node set, snapshots, role coverage, and admin ingress.

Platform Admin Console MVP

The first admin panel should be for platform owner/admin only.

Required pages:

Overview
Clusters
Nodes
Join Requests
Node Detail
Role Assignments
Service Workloads
Health and Alerts
Audit
Trust and Certificates

Overview should show:

cluster count
node count
nodes by health
pending join requests
role coverage gaps
service workload health
risky states

Cluster page should show:

cluster status
node membership
enabled services
health state
partition warnings
role coverage

Node page should show:

identity
membership
health
capabilities
assigned roles
running services
last heartbeat
version
update state
audit timeline

Join Requests page should allow:

review node facts
compare requested capabilities
approve
reject
revoke token

Role Assignments page should allow:

assign role to node
restrict role to organization if applicable
disable role
audit role changes

Multi-Cluster Admin

Multi-cluster administration should be platform-owner only.

It should support:

list clusters
compare cluster health
inspect node distribution
inspect service placement
inspect cluster partitions
inspect capacity and load
detect missing role coverage
manage cluster-scoped trust

Important: No organization should see the full multi-cluster topology.

Multi-cluster trust boundaries:

platform may operate multiple clusters
clusters do not automatically trust each other
clusters do not form one shared mesh by default
cross-cluster routing requires explicit trust and policy
a node may participate in multiple clusters only through isolated memberships
cluster-scoped identities, certificates, tokens, storage namespaces, and policies are required
platform owner may aggregate visibility across clusters through audited control-plane views
organization admins see only authorized clusters, resources, services, and safe status

Platform Admin Access Security

Platform owner/admin access must be risk-based.

Trusted or accredited devices may receive lower-friction access, but unknown devices require stronger checks.

Required controls:

MFA / 2FA policy
device trust state
session risk controls
step-up authentication for high-risk actions
audit for all high-risk actions

High-risk actions:

cluster trust changes
node approval
role assignment
partition promotion
cross-cluster trust
secrets access
update policy changes

Future Organization Admin Panel

Organization admin UI should come later.

It should show only safe tenant-scoped objects:

organization resources
organization users
organization policies
active sessions
allowed entry points
allowed egress/service endpoints
safe VPN/connector status
audit events for that organization

It must not show:

full mesh topology
other organizations
internal core routing state
platform-level node trust material
unrelated cluster internals

Service Visibility Rules

Organizations see only what policy allows:

resources
enabled services
allowed ingress endpoints
allowed egress/service endpoints
safe status

Organizations do not see:

intermediate core mesh nodes
platform-owned private topology
other tenants
cluster-wide trust configuration

Different services may expose different ingress and egress points.

Partition and Split-Brain Behavior

Disconnected minority segments must not:

approve join requests
issue node certificates
assign roles
change cluster policy
rotate trust roots

Disconnected segments may continue already-running services only when safe and policy allows it.

Only platform owner authority should promote, merge, split, or recover partitions.

This document does not define final consensus mechanics. It only establishes that cluster authority and partition behavior must be modeled before production mesh changes.

Implementation Stages

Stage C1: Backend Cluster and Node Model Foundation

Status: implemented and verified. Report: artifacts/c1-cluster-node-foundation-report.md.

Goal: Create the explicit backend data model and service boundaries for clusters, join requests, role assignments, node identity, heartbeats, and audit.

Allowed:

migrations
repositories
service interfaces
platform-admin-only API skeletons
tests
docs

Not allowed:

mesh runtime
VPN runtime
node-agent runtime rewrite
admin UI implementation
RDP behavior changes

Stage C2: Node Enrollment API

Status: implemented and verified. Report: artifacts/c2-node-enrollment-hardening-report.md.

Goal: Implement production enrollment semantics.

Includes:

join token creation
join request creation
approval/rejection
certificate metadata boundary
node activation/revocation
audit

Stage C3: Native Node-Agent MVP

Status: implemented and verified. Report: artifacts/c3-rap-node-agent-mvp-report.md.

Goal: Create native rap-node-agent MVP.

Includes:

enroll
heartbeat
capability report
desired role polling
service status report

No mesh packet routing yet.

Stage C4: Platform Admin Console MVP

Status: implemented and build-verified. Report: artifacts/c4-platform-admin-console-report.md.

Goal: Admin UI for platform operators.

Includes:

clusters
nodes
join requests
roles
service health
audit

Stage C5: Service Workload Supervision Contract

Status: implemented and verified. Report: artifacts/c5-service-workload-supervision-contract-report.md.

Goal: Node-agent can start, stop, and monitor service workloads based on role assignment.

C19A adds the first bounded live service-supervision runtime proof on top of that contract: node-agent can read node-scoped desired workloads without an operator actor id, report built-in core-mesh and mesh-listener as running, report native built-in synthetic.echo as running, and keep unsupported production workloads degraded instead of pretending that their adapters exist. The live smoke is scripts/fabric/c19a-service-workload-supervision-smoke.ps1.

C19B adds the Remote Workspace/RDP adapter-contract bridge without enabling RDP payload traffic. A native rdp-worker desired workload with adapter_contract_probe=true reports the remote-workspace channel map, requires Fabric Service Channel, and marks backend relay as not steady-state. The live smoke is scripts/fabric/c19b-remote-workspace-adapter-contract-smoke.ps1.

C19C wires Remote Workspace into service-channel lease issuance without starting RDP traffic: route intents now accept remote_workspace, the lease entry descriptor uses remote-workspace stream paths and frame-batch media type instead of VPN packet paths, and the signed data-plane contract is present in lease, authority payload, introspection, and lease maintenance. The live smoke is scripts/fabric/c19c-remote-workspace-service-channel-lease-smoke.ps1.

C19D adds the Remote Workspace entry-node ingress skeleton. The node-agent accepts a signed/introspected remote_workspace service-channel lease on remote-workspaces/{resource_id}/streams/{channel_class}, validates service class, channel class, selected entry node, and data-plane flow isolation, and reports access telemetry. It intentionally returns a probe contract with payload_flow=validated_only for empty control probes; non-empty RDP payloads are rejected with probe_only required. This stage proves the Fabric ingress contract without forwarding desktop frames yet. The live smoke is scripts/fabric/c19d-remote-workspace-entry-ingress-smoke.ps1.

C19E adds the first Remote Workspace frame-batch contract probe across the adapter/entry boundary. The rdp-worker adapter probe reports rap.remote_workspace_frame_batch.v1; entry-node accepts only probe_only=true frame batches, validates logical adapter channels and directions, and returns payload_flow=validated_probe_only. Real desktop frame delivery remains intentionally disabled until the service adapter runtime stage. The live smoke is scripts/fabric/c19e-remote-workspace-frame-batch-contract-smoke.ps1.

C19F adds the first local adapter-sink proof for that frame-batch contract. Node-agent now keeps an in-memory node_agent_rdp_worker_contract_probe sink for Remote Workspace frame probes and preserves it across mesh config refresh. Entry-node delivers validated probe_only=true frame batches to that sink and returns a rap.remote_workspace_frame_batch_delivery.v1 receipt with payload_flow=delivered_probe_only. This still does not enable production RDP frame forwarding. The live smoke is scripts/fabric/c19f-remote-workspace-adapter-sink-smoke.ps1.

C19G exposes the adapter-sink delivery proof through existing node-agent visibility channels. The rdp-worker workload status payload now includes remote_workspace_adapter_sink, and node telemetry includes remote_workspace_adapter_sink_report, both carrying delivery count, latest delivery sequence, channel class, frame count, and the probe-only/no-payload boundary. The live smoke is scripts/fabric/c19g-remote-workspace-adapter-sink-telemetry-smoke.ps1.

C19H locks down the Remote Workspace frame-batch guardrails before real adapter runtime work begins. Unit and live smoke coverage now proves that entry-node rejects probe_only=false, unknown logical channels, invalid channel directions, service-class mismatch, channel-class mismatch, and unsupported payload encoding, and that rejected batches do not produce adapter delivery. The live smoke is scripts/fabric/c19h-remote-workspace-frame-guardrails-smoke.ps1.

C19I adds the first bounded adapter handoff queue/ack proof for the same probe-only path. The local node_agent_rdp_worker_contract_probe sink reports queue capacity/depth plus accepted, dropped, and acked frame counts: with capacity 8, droppable display overflow accepts/acks 8 frames and drops 3, while reliable input overflow is rejected with backpressure and no delivery receipt. The boundary still carries payload_traffic=none; this is queue semantics for the future adapter runtime, not real RDP payload forwarding. The live smoke is scripts/fabric/c19i-remote-workspace-adapter-queue-smoke.ps1.

C19J makes those queue/backpressure signals operationally visible. The remote_workspace_adapter_sink workload status payload and remote_workspace_adapter_sink_report telemetry now include current queue capacity/depth, cumulative accepted/dropped/acked frame counters, backpressure_count, and the latest rejected batch metadata/reason. The live smoke first produces the C19I droppable overflow plus reliable backpressure, then waits until both workload status and telemetry show the delivery, dropped total, and backpressure increment. The live smoke is scripts/fabric/c19j-remote-workspace-adapter-queue-telemetry-smoke.ps1.

C19K introduces the probe-only adapter session boundary. Entry-node derives a stable adapter_session_id from the service-channel lease/resource/route context and passes it to the local rdp-worker adapter probe sink. Delivery receipts, workload status, and telemetry now include adapter_session_id, adapter_runtime_id=node_agent_rdp_worker_contract_probe, and session_state=probe_bound, and rejected/backpressured batches retain the same session identity. This is still not real RDP payload forwarding; it binds the queue/ack/backpressure model to the future per-session adapter runtime. The live smoke is scripts/fabric/c19k-remote-workspace-adapter-session-boundary-smoke.ps1.

C19L adds the first lifecycle model to that probe-only adapter session. The node-agent sink now tracks active sessions in memory with created/bound totals, last activity timestamps, per-session delivery/backpressure/frame counters, current_session_lifecycle_state, and idle expiry counters. A successful droppable overflow binds the session as probe_bound; a reliable overflow keeps the same adapter_session_id and moves the lifecycle state to backpressure for diagnosis. Receipts expose session created/bound/last-activity timestamps and per-session counters while payload_traffic=none remains enforced. The live smoke is scripts/fabric/c19l-remote-workspace-adapter-session-lifecycle-smoke.ps1.

C19M adds explicit probe-only adapter-session control. Node-agent exposes POST /mesh/v1/remote-workspace/adapter-sessions/{adapter_session_id}/control with close, expire, and reset actions, returning rap.remote_workspace_adapter_session_control.v1. Workload status and telemetry now include session_control_total, session_closed_total, session_reset_total, and the latest control action/session/state, so sessions can be ended deliberately instead of only by idle TTL. The live smoke creates a Remote Workspace adapter session, closes it through the mesh control endpoint, and waits until workload status and telemetry expose the close. The live smoke is scripts/fabric/c19m-remote-workspace-adapter-session-control-smoke.ps1.

C19N locks down the adapter-session control guardrails. Control requests now reject unsupported actions, invalid adapter_session_id values, malformed JSON, unknown active/terminal sessions, and overlong reasons without creating hidden session state. Repeating close against an already closed terminal session is idempotent: it reports previous_state=closed and does not increment session_closed_total again, while still counting the control observation. The live smoke verifies the negative cases plus first/repeated close visibility in workload status and telemetry. The live smoke is scripts/fabric/c19n-remote-workspace-adapter-session-control-guardrails-smoke.ps1.

C19O adds an immediate read-only adapter-session snapshot endpoint: GET /mesh/v1/remote-workspace/adapter-sessions?include_terminal=true&limit=N. It returns rap.remote_workspace_adapter_session_snapshot.v1 with active sessions, terminal sessions when requested, per-session lifecycle state, activity/backpressure timestamps, frame counters, and runtime identity. This lets operators inspect adapter-session state directly from node-agent without waiting for heartbeat, workload status, or telemetry propagation. The live smoke checks active-session visibility, close transition into terminal snapshot, and invalid snapshot limit rejection. The live smoke is scripts/fabric/c19o-remote-workspace-adapter-session-snapshot-smoke.ps1.

C19P adds the first adapter-runtime handoff mailbox contract. Each active probe-only adapter session now owns a bounded in-memory mailbox that receives frame_batch_probe_delivered and backpressure events with frame counts, channel/resource/route context, and sequence numbers. Node-agent exposes GET /mesh/v1/remote-workspace/adapter-sessions/{adapter_session_id}/mailbox with optional drain=true, and session snapshots/workload reports expose mailbox depth/enqueued/drained/dropped counters. This is the handoff surface a real rdp-worker runtime can consume next; payload forwarding is still disabled. The live smoke verifies read, drain, post-drain empty state, and snapshot counters. The live smoke is scripts/fabric/c19p-remote-workspace-adapter-runtime-mailbox-smoke.ps1.

C19Q hardens the mailbox handoff. Invalid IDs, unknown sessions, and invalid limits are rejected before state mutation, and bounded drain=true&limit=N reads remove only the returned event slice while preserving remaining depth for the next poll. The bounded mailbox drops oldest events once capacity is reached, and a closed adapter session no longer exposes an active runtime mailbox. The live smoke verifies negative cases, drop-oldest pressure, partial drain, and closed-session rejection. The live smoke is scripts/fabric/c19q-remote-workspace-adapter-mailbox-guardrails-smoke.ps1.

C19R adds bounded long-poll ergonomics to the same node-local mailbox endpoint. wait_ms lets an adapter runtime wait briefly for the next event without hot polling, and responses make empty/timeout state explicit with empty, waited, wait_timeout, and wait_ms. The live smoke proves empty timeout and wake-on-delayed-event behavior while keeping the path probe-only. The live smoke is scripts/fabric/c19r-remote-workspace-mailbox-long-poll-smoke.ps1.

C19S makes mailbox consumer behavior visible in diagnostics. Workload status and node telemetry now expose mailbox_read_total, mailbox_wait_total, mailbox_wait_timeout_total, mailbox_empty_read_total, and last mailbox read metadata; active session snapshots carry the same per-session counters while a session remains active. The live smoke proves C19R traffic is reflected in both workload status and telemetry. The live smoke is scripts/fabric/c19s-remote-workspace-mailbox-telemetry-smoke.ps1.

C19T adds the node-local consumer cursor contract for that mailbox. Consumers can pass consumer_id plus optional ack_sequence to receive explicit checkpoint, ack, lag, read, and ack counters without draining mailbox state. The probe sink stores bounded per-session consumer state and reports aggregate and current-session consumer telemetry through workload status and heartbeat telemetry. The live smoke is scripts/fabric/c19t-remote-workspace-mailbox-consumer-checkpoint-smoke.ps1.

C19U adds lifecycle visibility and reset guardrails to the same cursor state. Mailbox consumers can pass reset_consumer=true with a valid consumer_id to clear their checkpoint/ack state before the current read is recorded. Mailbox responses now expose consumer count/capacity, created/reset/evicted flags, and consumer timestamps, while diagnostics add reset and eviction counters. The live smoke is scripts/fabric/c19u-remote-workspace-mailbox-consumer-lifecycle-smoke.ps1.

C19V adds read-only inspection for active mailbox consumer cursors. The node-local GET /mesh/v1/remote-workspace/adapter-sessions/{adapter_session_id}/mailbox/consumers endpoint returns bounded cursor snapshots with consumer ids, checkpoint and ack sequences, lag, totals, and timestamps. It is verified as read-only: inspection does not increment mailbox reads, ack totals, reset counters, or drain mailbox events. The live smoke is scripts/fabric/c19v-remote-workspace-mailbox-consumer-snapshot-smoke.ps1.

C19W adds cursor-aware resume reads to the mailbox endpoint. Consumers can pass after_sequence to receive only mailbox events newer than their checkpoint; responses include skipped_count and returned_count, and long-poll waits for newer-than-checkpoint events. The endpoint rejects after_sequence with drain=true, preserving the non-destructive resume contract. The live smoke is scripts/fabric/c19w-remote-workspace-mailbox-after-sequence-smoke.ps1.

C19X adds consumer-aware resume convenience. Mailbox reads with consumer_id can pass resume_from=ack or resume_from=checkpoint; the node-agent resolves the stored cursor to after_sequence before reading and returns resume_from/resume_sequence in the response. The guardrails reject mixing resume with manual after_sequence, drain, reset, missing consumers, or invalid cursor names. The live smoke is scripts/fabric/c19x-remote-workspace-mailbox-consumer-resume-smoke.ps1.

C19Y adds resume telemetry to workload status and heartbeat reports. Operators can now see resume read totals, after-sequence read totals, returned/skipped totals, and the last resume cursor, sequence, consumer, returned count, and skipped count. Session snapshots also expose per-session resume counters. The live smoke is scripts/fabric/c19y-remote-workspace-mailbox-resume-telemetry-smoke.ps1.

C19Z adds adapter-runtime readiness diagnostics. Sink reports now include adapter_runtime_readiness, a compact probe-only object with ready status, diagnostic state, session lifecycle, mailbox depth, consumer cursor, resume cursor, lag, and returned/skipped counts. The live smoke is scripts/fabric/c19z-remote-workspace-adapter-readiness-smoke.ps1.

C19Z1 adds read-only handoff preflight for mailbox consumers. The endpoint /mailbox/preflight accepts consumer_id and resume_from=ack|checkpoint, then reports the expected next event window without mailbox reads, drains, acks, or consumer cursor mutation. The live smoke is scripts/fabric/c19z1-remote-workspace-mailbox-preflight-smoke.ps1.

C19Z2 adds telemetry for mailbox preflight checks. Workload status and heartbeat reports now expose preflight totals, ack/checkpoint split counters, and the last preflight cursor/window fields so diagnostics can distinguish handoff checks from mailbox reads. The live smoke is scripts/fabric/c19z2-remote-workspace-mailbox-preflight-telemetry-smoke.ps1.

C19Z3 adds stale-cursor diagnostics to mailbox preflight. If a consumer cursor falls behind retained mailbox events after bounded-mailbox drops, preflight reports retained sequence bounds, stale_cursor, diagnostic_state, and missing_dropped_count; the latest stale state is also visible in telemetry and readiness diagnostics. The live smoke is scripts/fabric/c19z3-remote-workspace-mailbox-stale-preflight-smoke.ps1.

C19Z4 adds action hints to mailbox preflight diagnostics. Stale cursor gaps now return recommended_action=reset_consumer_and_resync plus hints to reset the consumer cursor, request full adapter resync, and resume from checkpoint after resync. The live smoke is scripts/fabric/c19z4-remote-workspace-mailbox-preflight-action-hints-smoke.ps1.

C19Z5 adds provenance for the selected preflight action. Responses, telemetry, and readiness diagnostics include action_reason and structured action_context with cursor, retained sequence bounds, dropped/missing counts, and expected window values. The live smoke is scripts/fabric/c19z5-remote-workspace-mailbox-preflight-provenance-smoke.ps1.

C19Z6 adds the operator-facing summary for mailbox preflight. Responses, telemetry, and readiness diagnostics include operator_summary plus compact operator_summary_fields for the diagnostic state, selected action, reason, cursor, retained bounds, and expected window counters. The live smoke is scripts/fabric/c19z6-remote-workspace-mailbox-preflight-summary-smoke.ps1.

C19Z7 adds machine-sortable operator severity for mailbox preflight. Responses, telemetry, readiness diagnostics, and summary fields expose operator_status and operator_severity, classifying ready windows, caught-up cursors, and stale cursor gaps without parsing summary text. The live smoke is scripts/fabric/c19z7-remote-workspace-mailbox-preflight-severity-smoke.ps1.

C19Z8 adds the grouped readiness rollup for mailbox preflight. The readiness diagnostic keeps the flat fields and adds last_preflight with observed time, cursor, counts, diagnostic state, action hints/provenance, operator summary, status, severity, and summary fields. The live smoke is scripts/fabric/c19z8-remote-workspace-mailbox-preflight-rollup-smoke.ps1.

C19Z9 adds retained-window detail to that preflight rollup. The grouped last_preflight readiness object includes first/last retained sequence and mailbox dropped total so stale cursor explanations are visible without opening the raw preflight response. The live smoke is scripts/fabric/c19z9-remote-workspace-mailbox-preflight-retained-window-smoke.ps1.

C19Z10 adds a structured remediation checklist to that rollup. The grouped last_preflight.remediation_checklist entries expose required/satisfied operator steps derived from action hints, including cursor reset, full adapter resync, and resume after resync for stale cursor gaps. The live smoke is scripts/fabric/c19z10-remote-workspace-mailbox-preflight-checklist-smoke.ps1.

C19Z11 adds checklist status and counts to that rollup. The grouped last_preflight readiness object exposes remediation_checklist_status and total/required/satisfied/pending counts for admin UI summaries. The live smoke is scripts/fabric/c19z11-remote-workspace-mailbox-preflight-checklist-status-smoke.ps1.

C19Z12 adds session-level preflight operator status/severity counters. Readiness exposes status and severity count maps, mirrored in last_preflight, so repeated resync-required/warn preflights are visible without retaining a history log. The live smoke is scripts/fabric/c19z12-remote-workspace-mailbox-preflight-status-counts-smoke.ps1.

C19Z13 adds compact preflight attention status on top of those counters. Readiness and last_preflight expose preflight_attention_status so admin UI can sort clean, attention-needed, and repeated-resync sessions without interpreting count maps. The live smoke is scripts/fabric/c19z13-remote-workspace-mailbox-preflight-attention-smoke.ps1.

C19Z14 proves the repeated-resync attention branch. Unit and live smoke coverage perform multiple stale preflights on one active adapter session and verify preflight_attention_status=repeated_resync_required with repeated resync-required/warn counters. The live smoke is scripts/fabric/c19z14-remote-workspace-mailbox-preflight-repeated-attention-smoke.ps1.

C19Z15 adds the preflight attention reason. Readiness and last_preflight expose preflight_attention_reason beside the attention status, explaining clean, attention-needed, and repeated-resync states without UI-side counter parsing. The live smoke is scripts/fabric/c19z15-remote-workspace-mailbox-preflight-attention-reason-smoke.ps1.

C19Z16 completes focused proof coverage for those attention reasons. Unit tests cover clean, single-resync, repeated-resync, and no-preflight mappings; live smoke proves the single stale-preflight reason. The live smoke is scripts/fabric/c19z16-remote-workspace-mailbox-preflight-attention-reason-coverage-smoke.ps1.

C19Z17 adds the preflight diagnostics contract marker. The readiness last_preflight rollup includes diagnostics_schema_version and diagnostics_contract entries for retained-window, remediation-checklist, attention, and operator-count fields, allowing UI rendering to be gated safely. The live smoke is scripts/fabric/c19z17-remote-workspace-mailbox-preflight-contract-smoke.ps1.

C19Z18 adds boolean diagnostics feature flags to the same preflight rollup. last_preflight.diagnostics_features now exposes retained-window, remediation-checklist, attention, and operator-count support directly, so admin UI and automation can gate each diagnostics group without scanning the contract list. The live smoke is scripts/fabric/c19z18-remote-workspace-mailbox-preflight-feature-flags-smoke.ps1.

C19Z19 proves compatibility between the two diagnostics contract forms. Unit coverage and live smoke verify that workload and telemetry reports expose both the string diagnostics_contract entries and matching boolean diagnostics_features flags for every preflight diagnostics group. The live smoke is scripts/fabric/c19z19-remote-workspace-mailbox-preflight-contract-compatibility-smoke.ps1.

C19Z20 proves the no-preflight readiness shape. Before any mailbox preflight is observed, active adapter sessions expose preflight_attention_status=unknown, preflight_attention_reason=no_preflight_observed, zero session preflight count, and no grouped last_preflight rollup. The live smoke is scripts/fabric/c19z20-remote-workspace-mailbox-preflight-absence-smoke.ps1.

C19Z21 proves the no-active-session readiness shape. After closing the active adapter session, readiness exposes idle/not-ready state, zero active sessions, no active adapter_session_id, no grouped last_preflight, and terminal last_session_state=closed from the terminal-session ledger. The live smoke is scripts/fabric/c19z21-remote-workspace-no-active-session-readiness-smoke.ps1.

C19Z22 proves terminal-state readiness for expire and reset controls. The same no-active-session readiness shape now reports last_session_state=expired or last_session_state=reset from the terminal-session ledger. The live smoke is scripts/fabric/c19z22-remote-workspace-terminal-state-readiness-smoke.ps1.

C19Z23 adds grouped terminal-session summary metadata to no-active-session readiness. terminal_session_summary carries adapter session id, terminal state, reason, and control timestamp so admin UI can render the terminal cause without stitching flat fields. The live smoke is scripts/fabric/c19z23-remote-workspace-terminal-session-summary-smoke.ps1.

C19Z24 adds the terminal-session summary contract marker. The grouped summary now carries schema version rap.remote_workspace_adapter_terminal_session_summary.v1 and a summary-contract field list for explicit UI gating. The live smoke is scripts/fabric/c19z24-remote-workspace-terminal-summary-contract-smoke.ps1.

C19Z25 adds boolean summary_features to the same grouped terminal-session summary, covering adapter session id, state, reason, and control timestamp. The live smoke is scripts/fabric/c19z25-remote-workspace-terminal-summary-features-smoke.ps1.

C19Z26 proves compatibility between summary_contract and summary_features for the grouped terminal-session summary in workload and telemetry reports. The live smoke is scripts/fabric/c19z26-remote-workspace-terminal-summary-compatibility-smoke.ps1.

C19Z27 proves the absence shape for terminal-session summary. Before any adapter session or terminal history exists, readiness reports waiting_for_session and does not include terminal_session_summary. The live smoke is scripts/fabric/c19z27-remote-workspace-terminal-summary-absence-smoke.ps1.

Includes:

container/native workload contract
service instance status
version reporting
logs/health hooks

Stage C6: Mesh Control-Plane Preparation

Status: implemented and verified. Report: artifacts/c6-mesh-control-plane-preparation-report.md.

Goal: Prepare mesh routing state without carrying production traffic.

Includes:

link inventory
route intent
node reachability
QoS policy model

Stage C7: Mesh MVP

Status: skeleton implemented and verified. Report: artifacts/c7-mesh-mvp-skeleton-report.md.

Goal: First secure node-to-node overlay path.

Includes:

mTLS node-to-node
basic relay skeleton
route health
no VPN yet unless separately approved

Stage C8: Multi-Cluster Hardening

Status: implemented and verified. Report: artifacts/c8-multi-cluster-hardening-report.md.

Goal: Safe multi-cluster operations.

Includes:

cluster authority state
partition visibility
cross-cluster admin views
trust separation

Stage C9: Organization Admin Console

Status: foundation implemented and verified. Report: artifacts/c9-organization-admin-foundation-report.md.

Goal: Tenant-safe admin UI.

Includes:

resources
policies
sessions
allowed service endpoints
organization audit

Stage C10: Fabric Core Documentation and Config Distribution Design

Status: completed as documentation/planning stage. Report: artifacts/c10-fabric-core-config-distribution-design-report.md.

Goal: Consolidate Fabric Core terminology and prepare scoped cluster configuration distribution design.

Scope:

define signed scoped cluster snapshot model boundaries
define node-local state boundaries
define peer directory/cache boundaries
define Fabric Storage / Config Storage role and non-goals
define PostgreSQL source-of-truth versus distribution/cache boundaries
define multi-cluster isolation boundaries
define future implementation stages C11-C19

C10 prepares C11, C12, and C13. It did not implement mesh routing, VPN/IP tunnel runtime, relay packet forwarding, RDP work, code, API changes, or service workload execution.

Stage C11: Signed Scoped Cluster Snapshot Model

Status: completed as documentation/planning stage. Report: artifacts/c11-signed-scoped-cluster-snapshot-model-report.md.

Goal: Define signed, versioned, scoped snapshots for node-local operation and degraded-mode recovery.

Stage C12: Node Local State Store

Status: completed as documentation/planning stage. Report: artifacts/c12-node-local-state-store-report.md.

Goal: Define bounded local state storage for identity, cluster membership, peer cache, route cache, service assignment cache, health, partition/degraded state, config versions, and pending update metadata.

Stage C13: Config / Storage Service Foundation

Status: completed as documentation/planning stage. Report: artifacts/c13-fabric-storage-config-service-report.md.

Goal: Define Fabric Storage Service / Config Storage Service boundaries, replication scope, failure-domain rules, and source-of-truth relationship to PostgreSQL.

Stage C14: Peer Directory and Cache Model

Status: completed as documentation/planning stage. Report: artifacts/c14-peer-directory-cache-model-report.md.

Goal: Define peer directory shape, node-local peer cache, refresh cadence, score-based peer selection inputs, and degraded recovery order.

Stage C15: Fabric Routing Engine Skeleton

Status: completed as documentation/planning stage. Report: artifacts/c15-fabric-routing-engine-skeleton-report.md.

Goal: Introduce a safe compile/runtime boundary for future route selection without carrying production mesh traffic.

Stage C16: Secure Node-to-Node Channel Lifecycle

Status: completed as documentation/planning stage. Report: artifacts/c16-secure-node-to-node-channel-lifecycle-report.md.

Goal: Define authenticated node-to-node channel lifecycle, certificate validation, health, reconnection, and channel authorization.

Stage C17: Mesh Routing Runtime

Status: planning completed. Report: artifacts/c17-mesh-routing-runtime-implementation-plan-report.md. C17A synthetic runtime skeleton is implemented and test-proven. Report: artifacts/c17a-synthetic-mesh-runtime-skeleton-report.md. C17B route health and failover probe skeleton is implemented and test-proven. Report: artifacts/c17b-route-health-failover-probes-report.md. C17C relay semantic hardening skeleton is implemented and test-proven. Report: artifacts/c17c-relay-semantic-hardening-report.md. C17D non-production test-service path experiment is implemented and test-proven. Report: artifacts/c17d-non-production-test-service-path-report.md. C17E live node-to-node synthetic HTTP transport skeleton is implemented and smoke-proven. Report: artifacts/c17e-live-node-to-node-synthetic-transport-report.md. C17F scoped synthetic peer/route config loading and route-health reporting is implemented and smoke-proven. Report: artifacts/c17f-scoped-synthetic-route-config-report.md. C17G Control Plane scoped synthetic config read boundary is implemented and test-proven. Report: artifacts/c17g-control-plane-scoped-synthetic-config-report.md.

Goal: Define the mesh routing runtime implementation plan only after identity, enrollment, scoped config, local state, storage, peer cache, routing skeleton, and secure node-to-node channel lifecycle stages are accepted.

First implementation step was C17A: synthetic fabric messages only, feature-flagged, test topology only, no RDP/VPN/service traffic, and no topology exposure to organizations.

C17D proved a bounded synthetic.echo test-service path over direct, single-relay, and forced fallback routes. It does not authorize production service traffic.

C17E proved the same synthetic-only route model over real local HTTP node endpoints using a disabled-by-default rap-node-agent endpoint and a mesh-live-smoke harness. It still does not authorize production mesh traffic, RDP/VPN traffic, service workload traffic, or organization-visible topology.

C17F moved the C17E synthetic route model from manual debug JSON toward scoped configuration by adding a node-local scoped config file boundary and synthetic route-health reporting to the Control Plane mesh link endpoint. It still does not authorize production mesh traffic.

C17G added the Control Plane read endpoint for node-scoped synthetic mesh config and node-agent consumption of that config. It still does not authorize production mesh traffic.

Stage C18: VPN / IP Tunnel Service

Status: completed as documentation/planning stage. Report: artifacts/c18-vpn-ip-tunnel-service-target-design-report.md.

Goal: Define VPN/IP tunnel service as a cluster-managed service above Fabric Core and Fabric Routing, not as node-local ad hoc configuration.

Result:

target design: docs/architecture/VPN_IP_TUNNEL_SERVICE_TARGET.md
defines vpn_connection as logical desired state
defines single-active lease/fencing model
defines control-plane versus data-plane boundary
defines routing policy, QoS, security, credential distribution, audit, and failure behavior
defines future C18A-C18I stages

C18 did not implement VPN/IP tunnel runtime, TUN/TAP devices, packet routing, mesh production traffic, RDP work, API changes, migrations, or service workload execution.

Stage C18A: VPN / IP Tunnel Control-Plane Data Model

Status: implemented and backend-test-proven. Report: artifacts/c18a-vpn-control-plane-data-model-report.md.

Goal: Add the durable control-plane source-of-truth model for future VPN/IP tunnel service desired state without implementing VPN runtime.

Completed:

vpn_connections
vpn_connection_allowed_nodes
vpn_connection_route_policies
vpn_connection_leases
backend models, repository methods, service methods, and platform-admin API skeleton
single-active lease boundary with PostgreSQL unique active lease protection
recovery-admin-only lease fencing boundary
tests for authorization, scope, desired-state gating, duplicate active lease rejection, route policy validation, and allowed-node normalization

C18A did not implement VPN/IP tunnel runtime, TUN/TAP devices, packet routing, host route/firewall manipulation, node-agent execution, production mesh traffic, RDP work, Windows client changes, or data-plane behavior changes.

Stage C18B: VPN / IP Tunnel Lease and Fencing Hardening

Status: implemented and backend-test-proven. Report: artifacts/c18b-vpn-lease-fencing-hardening-report.md.

Goal: Harden the future VPN/IP tunnel single-active control-plane ownership model before any node-agent executes VPN work.

Completed:

owner eligibility validation against active cluster membership
owner validation against vpn_connection allowed-node policy
owner validation against active vpn-exit or vpn-connector role assignment
same-owner acquire idempotency
different-owner active lease rejection
idempotent release/fence behavior
stale lease cleanup/reclamation service boundary
audit events for lease renewal and stale lease expiration
tests for wrong cluster, unauthorized owner, missing role, disabled connection, expired lease renewal, duplicate active lease, and stale lease audit behavior

C18B did not implement VPN/IP tunnel runtime, TUN/TAP devices, host route/firewall manipulation, node-agent VPN execution, production mesh traffic, RDP work, Windows client changes, or data-plane behavior changes.

Stage C18C: VPN / IP Tunnel Node-Agent Desired-State Consumption

Status: implemented and backend-test-proven. Report: artifacts/c18c-vpn-node-agent-desired-state-report.md.

Goal: Expose scoped VPN/IP tunnel desired assignments to eligible node-agents and allow node-agents to report observed assignment status without executing VPN runtime.

Completed:

node-agent desired-state API for scoped VPN assignments
node-agent assignment status report API
explicit observed status values: not_started, assigned, lease_required, blocked, unknown
PostgreSQL assignment status report and latest-status tables
backend service/repository boundaries for scoped assignment visibility
assignment visibility limited to eligible candidates or current active owner
credential_ref withheld from node-agent assignment payloads; only has_credential_ref is exposed for future resolver integration
tests for unauthorized/invisible assignment rejection, allowed status values, invalid status rejection, and non-platform-admin node-agent read path

C18C did not implement VPN/IP tunnel runtime, TUN/TAP devices, host route/firewall/DNS/QoS manipulation, credential delivery, node-agent VPN execution, production mesh traffic, RDP work, Windows client changes, or data-plane behavior changes.

Stage C19: Version Storage and Node Update / Rollback Foundation

Status: future stage. Architecture direction is documented in this file and docs/architecture/SECURE_ACCESS_FABRIC_TARGET.md. No updater runtime is implemented by this documentation.

Goal: Define the durable release, artifact, node-agent update, health supervision, rollback, and data-structure migration model before any production update runtime is enabled.

Scope:

define version-storage / update-repository as the signed release artifact source for node-agent and service workloads
define release channels: stable, current, and candidate
define OS / architecture artifact variants under a single signed release manifest
define last-known-good stable release tracking and rollback eligibility
define update-cache mirroring of approved artifacts without becoming a source of truth
define node-agent download, verification, staging, restart, health-check, promote, rollback, and failure-reporting states
define migration bundle metadata for PostgreSQL schema, service-local data, node-local state, config/cache shards, and protocol/config schema changes
define rollout policy boundaries for canary, staged rollout, pause, resume, rollback, blocklist, and emergency pinning

Non-goals:

no production updater runtime
no automatic schema migration execution by node-agent for PostgreSQL
no unsigned or unapproved artifacts
no arbitrary code download
no mesh, VPN/IP tunnel, RDP, client, or data-plane behavior changes

Rules:

PostgreSQL remains authoritative for rollout policy, approvals, desired versions, channel assignment, migration approval, and audit.
Version Storage stores immutable signed artifacts, release manifests, compatibility metadata, hashes, signatures, provenance, and migration bundles.
Update-cache nodes only mirror approved artifacts and may be invalidated or rebuilt from Version Storage.
Node-agent executes only signed, approved, scoped work assigned by policy.
Node-agent must stop, restart, rollback, or fence local workloads according to health checks and lease/policy state.
Node-agent self-update requires stricter safeguards than workload updates: staged replacement, watchdog, crash-safe rollback, and preservation of control-plane/update-source reachability unless an approved break-glass procedure exists.
PostgreSQL schema migrations are orchestrated by the Control Plane release process, not independently invented by a node.
Service-local and node-local migrations may be executed by node-agent only when declared in a signed release manifest and scoped to that node/workload.

Historical Stage C1 Prompt

This prompt is retained for provenance only. Stages C1-C9 are now listed above as implemented/verified or build-verified according to their reports.

Proceed with Stage C1 only.

Goal: Implement backend cluster and node model foundation for the Secure Access Fabric.

Strict rules:

do NOT implement mesh runtime
do NOT implement VPN runtime
do NOT change RDP runtime behavior
do NOT build admin UI yet
do NOT rewrite existing node-agent runtime
do NOT expose full topology to organizations
keep PostgreSQL as source of truth
keep Redis for live coordination only

Scope:

Add explicit cluster model.
Add cluster membership model.
Add node join token model with hashed tokens only.
Add node join request model.
Add node identity/certificate metadata model.
Add node role assignment model.
Add node heartbeat/latest health model.
Add cluster audit events.
Preserve existing node tables with safe migration/backfill into a default cluster.
Add repository interfaces and PostgreSQL implementations.
Add service/usecase boundaries for platform-admin cluster and node management.
Add platform-admin-only API skeletons for clusters, nodes, join requests, and role assignments.
Add tests for cluster scoping, role authorization, join token hashing, and organization topology isolation.
Update documentation.

Deliver:

migrations
backend models/repositories/services
platform-admin API skeleton
tests
docs update
verification report

Historical note: C2 was not to start until C1 was accepted. C1-C9 are now recorded above with their current accepted statuses.

Result / Decision

This document defines Fabric Core as the foundation beneath service adapters and before mesh runtime. It clarifies peer discovery, score-based routing, scoped configuration distribution, node-local state, Fabric Storage/Config Storage, multi-cluster trust boundaries, and risk-based platform admin access. It also hardens the boundaries that Redis is live-only, Fabric routing must not depend on live backend availability, and Fabric Storage/Config Storage is not a general-purpose distributed database or second source of truth.

Stage C10 through Stage C18 planning are completed as documentation/planning. C17A, C17B, C17C, C17D, C17E, C17F, and C17G are implemented and test-proven with synthetic traffic only. C18 defines the VPN/IP tunnel service target model but does not authorize VPN/IP tunnel runtime. C18A adds the VPN/IP tunnel control-plane data model and platform-admin skeleton only. C18B hardens single-active lease/fencing semantics. C18C adds node-agent desired-state/status reporting for scoped VPN assignments only. C19 Remote Workspace adapter probe layers are still node-local and probe-only; through C19Z30, fresh no-session runtime readiness exposes a grouped no_session_summary contract plus summary_features, with compatibility proof across workload and telemetry, while terminal-history readiness exposes terminal_session_summary and omits no_session_summary; summary exclusivity is proven across fresh, active, and terminal readiness states, and a compact readiness state matrix artifact exists for admin/runtime handoff. C19Z34 records the explicit probe-to-runtime gates and confirms Remote Workspace still has no production payload traffic. C19Z35 adds the disabled-by-default real-adapter supervision status scaffold without enabling real adapter execution. C19Z36 proves that scaffold's env/status/ guardrail compatibility. C19Z37 adds sanitized config projection for the future real adapter while still refusing activation and payload traffic. C19Z38 proves that projection for both default/empty and requested config shapes. C19Z39 adds an explicit blocked activation decision contract with required/missing gates. C19Z40 adds a compact handoff report proving scaffold/projection/decision alignment for requested and default node config. C19Z41 adds explicit feature flags for those real-adapter supervision fields. C19Z42 folds those feature flags into the compact handoff report for admin/runtime handoff. C19Z43 proves contract-probe precedence when real-adapter supervision is also requested in desired workload config. C19Z44 proves the real-adapter-only desired workload path remains degraded and blocked. C19Z45 adds a compact desired-workload mode matrix for probe-only, real-adapter-only, and combined requested modes. C19Z46 adds compatibility proof for the mode matrix row contract. C19Z47 adds a disabled process-supervisor preconditions contract for future external RDP worker supervision. C19Z48 proves that contract across requested/default config shapes. C19Z49 folds process-supervisor preconditions into the compact handoff report. C19Z50 folds process-supervisor preconditions into the desired-workload mode matrix. C19Z51 proves the mode matrix v2 row contract. C19Z52 adds a disabled process-health-probe contract for future external RDP worker supervision. C19Z53 proves that process-health-probe contract across requested/default status forms. C19Z54 folds process-health-probe visibility into the compact real-adapter handoff report. C19Z55 folds process-health-probe visibility into the desired-workload mode matrix. C19Z56 proves the mode matrix v3 row contract. C19Z57 adds a compact disabled real-adapter readiness/handoff checklist. C19Z58 proves the readiness/handoff summary and checklist contract. C19Z59 adds a disabled real-adapter operator action map. C19Z60 proves the disabled real-adapter operator action map contract. C19Z61 adds a compact disabled real-adapter admin handoff bundle. C19Z62 proves the disabled real-adapter admin handoff bundle contract. C19Z63 adds compact disabled real-adapter admin handoff digest rows. C19Z64 proves the disabled real-adapter admin handoff digest row contract. C19Z65 adds a disabled real-adapter admin handoff digest rollup. C19Z66 proves the disabled real-adapter admin handoff digest rollup contract. C19Z67 adds a disabled real-adapter admin handoff full-chain summary. C19Z68 proves the disabled real-adapter admin handoff full-chain summary contract. C19Z69 adds a disabled real-adapter admin handoff release marker. C19Z70 proves the disabled real-adapter admin handoff release marker contract. C19Z71 adds a final contract-only package index for the disabled real-adapter admin handoff chain. C19Z72 proves the final package index contract for the disabled real-adapter admin handoff chain. C19Z73 adds a contract-only runtime gate phase boundary for the next disabled real-adapter preflight phase. C19Z74 proves the runtime gate phase boundary contract. C19Z75 adds a disabled real-adapter runtime gate preflight checklist with all items still blocking runtime. C19Z76 proves the disabled real-adapter runtime gate preflight checklist contract. C19Z77 adds a disabled real-adapter runtime gate preflight status summary. C19Z78 proves the disabled real-adapter runtime gate preflight status summary contract. C19Z79 adds disabled real-adapter runtime gate preflight action hints. C19Z80 proves the disabled real-adapter runtime gate preflight action hints contract. C19Z81 adds a disabled real-adapter runtime gate preflight operator handoff bundle. C19Z82 proves the disabled real-adapter runtime gate preflight operator handoff bundle contract. C19Z83 adds a disabled real-adapter runtime gate preflight release marker. C19Z84 proves the disabled real-adapter runtime gate preflight release marker contract. C19Z85 adds a disabled real-adapter runtime gate preflight package index. C19Z86 proves the disabled real-adapter runtime gate preflight package index contract. C19Z87 adds a disabled real-adapter runtime gate preflight closeout summary. C19Z88 proves the disabled real-adapter runtime gate preflight closeout summary contract. C19Z89 starts the explicit real-adapter runtime gate enablement phase with a contract-only request that remains blocked pending validation. C19Z90 proves the explicit real-adapter runtime gate enablement request contract. C19Z91 adds contract-only operator confirmation validation while keeping the runtime gate blocked pending remaining validations. C19Z92 proves the operator confirmation validation contract. C19Z93 adds contract-only binary validation while keeping the runtime gate blocked pending remaining validations. C19Z94 proves the binary validation contract. C19Z95 adds contract-only permission validation while keeping the runtime gate blocked pending remaining validations. C19Z96 proves the permission validation contract. C19Z97 adds contract-only supervisor validation while keeping the runtime gate blocked pending remaining validations. C19Z98 proves the supervisor validation contract. C19Z99 adds contract-only health probe validation while keeping the runtime gate blocked pending payload gate validation. C19Z100 proves the health probe validation contract. C19Z101 adds contract-only payload gate validation with no remaining required validations while keeping runtime not enabled. C19Z102 proves the payload gate validation contract. C19Z103 adds the runtime gate validation closeout while keeping explicit operator enablement required. C19Z104 proves the runtime gate validation closeout contract. C19Z105 adds an operator enablement readiness package while keeping runtime disabled by default. C19Z106 proves the operator enablement readiness package contract. C19Z107 adds an operator enablement readiness release marker while keeping runtime disabled by default. C19Z108 proves the operator enablement readiness release marker contract. C19Z109 adds an operator enablement readiness package index while keeping runtime disabled by default. C19Z110 proves the operator enablement readiness package index contract. C19Z111 adds an operator readiness closeout summary while keeping runtime disabled by default. C19Z112 proves the operator readiness closeout summary contract. C19Z113 adds an operator review decision request while keeping runtime disabled by default. C19Z114 proves the operator review decision request contract. C19Z115 adds an operator decision status summary while keeping runtime disabled by default. C19Z116 proves the operator decision status summary contract. C19Z117 adds an operator approval/rejection outcome contract with the outcome not approved and runtime disabled by default. C19Z118 proves the operator approval/rejection outcome contract. C19Z119 adds an operator outcome closeout/reopen boundary while keeping runtime disabled by default. C19Z120 proves the operator outcome closeout/reopen boundary contract. C19Z121 adds a not-approved outcome release marker while keeping runtime disabled by default. C19Z122 proves the not-approved outcome release marker contract. C19Z123 adds a not-approved outcome package index while keeping runtime disabled by default. C19Z124 proves the not-approved outcome package index contract. C19Z125 adds a not-approved outcome closeout summary while keeping runtime disabled by default. C19Z126 proves the not-approved outcome closeout summary contract. C19Z127 adds a final not-approved outcome release marker while keeping runtime disabled by default. C19Z128 proves the final not-approved outcome release marker contract. C19Z129 adds a final not-approved outcome package index/archive marker while keeping runtime disabled by default. C19Z130 proves the final not-approved outcome package index/archive marker contract. C19Z131 adds a not-approved outcome archive closeout manifest while keeping runtime disabled by default. C19Z132 proves the not-approved outcome archive closeout manifest contract. C19Z133 adds a stopped-branch sentinel for the not-approved outcome while keeping runtime disabled by default. C19Z134 proves the not-approved outcome stopped-branch sentinel contract. C19Z135 adds a no-continuation guard for the stopped not-approved outcome while keeping runtime disabled by default. C19Z136 proves the not-approved outcome no-continuation guard contract. C19Z137 adds continuation block enforcement for the stopped not-approved outcome while keeping runtime disabled by default. C19Z138 proves the not-approved outcome continuation block enforcement contract. C19Z139 adds a continuation block audit record for the stopped not-approved outcome while keeping runtime disabled by default. C19Z140 proves the not-approved outcome continuation block audit record contract. C19Z141 adds a continuation block audit rollup for the stopped not-approved outcome while keeping runtime disabled by default. C19Z142 proves the not-approved outcome continuation block audit rollup contract. C19Z143 adds an operator stop summary for the stopped not-approved outcome while keeping runtime disabled by default. C19Z144 proves the not-approved outcome operator stop summary contract. C19Z145 adds an operator stop handoff for the stopped not-approved outcome while keeping runtime disabled by default. C19Z146 proves the not-approved outcome operator stop handoff contract. C19Z147 adds an operator stop handoff digest for the stopped not-approved outcome while keeping runtime disabled by default. C19Z148 proves the not-approved outcome operator stop handoff digest contract. C19Z149 adds an operator stop status snapshot for the stopped not-approved outcome while keeping runtime disabled by default. C19Z150 proves the not-approved outcome operator stop status snapshot contract. C19Z151 adds an operator stop status snapshot index for the stopped not-approved outcome while keeping runtime disabled by default. C19Z152 proves the not-approved outcome operator stop status snapshot index contract. C19Z153 adds an operator stop status catalog for the stopped not-approved outcome while keeping runtime disabled by default. C19Z154 proves the not-approved outcome operator stop status catalog contract. C19Z155 adds an operator stop status catalog release marker for the stopped not-approved outcome while keeping runtime disabled by default. C19Z156 proves the not-approved outcome operator stop status catalog release marker contract. C19Z157 adds an operator stop status catalog package index for the stopped not-approved outcome while keeping runtime disabled by default. C19Z158 proves the not-approved outcome operator stop status catalog package index contract. C19Z159 adds an operator stop status catalog closeout summary for the stopped not-approved outcome while keeping runtime disabled by default. C19Z160 proves the not-approved outcome operator stop status catalog closeout summary contract. C19Z161 adds an operator stop status final archive marker for the stopped not-approved outcome while keeping runtime disabled by default. C19Z162 proves the not-approved outcome operator stop status final archive marker contract. C19Z163 adds an operator stop status final archive manifest for the stopped not-approved outcome while keeping runtime disabled by default. C19Z164 proves the not-approved outcome operator stop status final archive manifest contract. C19Z165 adds a terminal-complete marker for the stopped not-approved outcome factory while keeping runtime disabled by default. C19Z166 proves the not-approved outcome factory terminal-complete contract. C20Z1 opens a new explicit real-adapter enablement request while keeping runtime disabled by default. C20Z2 proves the new explicit real-adapter enablement request contract. C20Z3 adds the operator validation intake for the new explicit request while keeping runtime disabled by default. C20Z4 completes the operator validation checklist contract while keeping runtime disabled by default. C20Z5 closes the operator validation chain contract while keeping runtime disabled by default. C20Z6 proves the C20 stage terminal-complete contract. Version Storage/Update Repository and node-agent update/rollback foundation are not implemented by this document. No RDP, data-plane, VPN runtime, production relay, production mesh service traffic, node-agent VPN execution, host networking, service workload runtime, or production updater behavior is implied by this document.

84 KiB Raw Blame History

Cluster, Node, Mesh, and Admin Foundation

Purpose

Current Foundation Inventory

Vocabulary

Golden Rules

Existing Node Management Semantics

Fabric Core Foundation

Peer Discovery and Routing Foundation

Node-Local State and Scoped Configuration

Fabric Storage / Config Storage Foundation

Version Storage and Node Update Foundation

Target Data Model

Cluster

Cluster Membership

Node

Node Identity

Node Join Token

Node Join Request

Node Capability

Node Role Assignment

Node Service Instance

Node Heartbeat

Cluster Audit Event

Future Mesh Entities

Node Enrollment Flow

Node Runtime Expectations

Admin Console Model

Admin Endpoint Placement

Platform Admin Console MVP

Multi-Cluster Admin

Platform Admin Access Security

Future Organization Admin Panel

Service Visibility Rules

Partition and Split-Brain Behavior

Implementation Stages

Stage C1: Backend Cluster and Node Model Foundation

Stage C2: Node Enrollment API

Stage C3: Native Node-Agent MVP

Stage C4: Platform Admin Console MVP

Stage C5: Service Workload Supervision Contract

Stage C6: Mesh Control-Plane Preparation

Stage C7: Mesh MVP

Stage C8: Multi-Cluster Hardening

Stage C9: Organization Admin Console

Stage C10: Fabric Core Documentation and Config Distribution Design

Stage C11: Signed Scoped Cluster Snapshot Model

Stage C12: Node Local State Store

Stage C13: Config / Storage Service Foundation

Stage C14: Peer Directory and Cache Model

Stage C15: Fabric Routing Engine Skeleton

Stage C16: Secure Node-to-Node Channel Lifecycle

Stage C17: Mesh Routing Runtime

Stage C18: VPN / IP Tunnel Service

Stage C18A: VPN / IP Tunnel Control-Plane Data Model

Stage C18B: VPN / IP Tunnel Lease and Fencing Hardening

Stage C18C: VPN / IP Tunnel Node-Agent Desired-State Consumption

Stage C19: Version Storage and Node Update / Rollback Foundation

Historical Stage C1 Prompt

Result / Decision

84 KiB

Raw Blame History