1680 lines
61 KiB
Markdown
1680 lines
61 KiB
Markdown
# Cluster, Node, Mesh, and Admin Foundation
|
|
|
|
Status: target foundation plan plus staged implementation tracker. This document defines the next platform-core direction and implementation order. It is not an instruction to rewrite the current RDP MVP.
|
|
|
|
The current RDP work is paused by product decision. The proven RDP access core remains valuable and must not be broken. The next platform step is to build the cluster, node enrollment, mesh preparation, and platform administration foundation that the Secure Access Fabric will depend on.
|
|
|
|
Related documents:
|
|
|
|
- `docs/architecture/SECURE_ACCESS_FABRIC_TARGET.md`
|
|
- `docs/architecture/DATA_PLANE_V1.md`
|
|
- `docs/architecture/WEB_INGRESS_AND_ADMIN_UI_MODEL.md`
|
|
- `docs/codex/CURRENT_STATUS.md`
|
|
|
|
## Purpose
|
|
|
|
The platform is a Secure Access Fabric, not only an RDP proxy. RDP is the first managed service adapter, but the long-term platform must support additional adapters and network services such as VNC, SSH, VPN/IP tunnel, internal web apps, file services, video/audio, update cache, and relay/entry nodes.
|
|
|
|
The lower platform foundation is the RAP Fabric Core: a distributed runtime
|
|
layer above the host OS and below service adapters. It is not a real operating
|
|
system. It is implemented through native `rap-node-agent`, control-plane
|
|
contracts, signed scoped cluster snapshots, node-local state, and service
|
|
supervision boundaries.
|
|
|
|
Before implementing mesh traffic, VPN runtime, or organization administration UI, the platform needs a clear foundation for:
|
|
|
|
- clusters
|
|
- host-level node identity
|
|
- node enrollment
|
|
- Fabric Core terminology and contracts
|
|
- scoped configuration distribution
|
|
- node-local state
|
|
- role assignment
|
|
- service workload placement
|
|
- platform-owner administration
|
|
- future organization administration
|
|
- safe multi-cluster operations
|
|
|
|
## Current Foundation Inventory
|
|
|
|
The backend already contains an early platform-core foundation:
|
|
|
|
- `nodes`
|
|
- `node_capabilities`
|
|
- `node_services`
|
|
- `node_update_policies`
|
|
- `node_partition_states`
|
|
- `node_agent_update_runs`
|
|
- node management routes under `/nodes`
|
|
- node-agent routes under `/node-agents`
|
|
|
|
This is useful, but it is not sufficient as the final node and cluster model. Current gaps:
|
|
|
|
- no explicit first-class `clusters`
|
|
- no explicit cluster membership model
|
|
- no join request approval workflow
|
|
- no short-lived join token model
|
|
- no node certificate lifecycle model
|
|
- no clear separation between technical capability and authorized role assignment
|
|
- no platform admin console model
|
|
- no multi-cluster administration model
|
|
- no organization-safe visibility model for entry/egress/service endpoints
|
|
- node-agent registration is too direct for production trust boundaries
|
|
|
|
## Vocabulary
|
|
|
|
Platform:
|
|
The whole Secure Access Fabric installation operated by the platform owner.
|
|
|
|
Cluster:
|
|
A bounded control and trust domain containing nodes, policies, service placements, certificates, health state, and routing state. A platform may operate multiple clusters.
|
|
|
|
Organization:
|
|
A tenant. Organizations own or are granted access to resources, policies, sessions, services, and selected entry/egress points. Organizations must not see full internal mesh topology.
|
|
|
|
Fabric Core:
|
|
The lower OS-like distributed runtime layer of the Secure Access Fabric. It runs
|
|
above the host OS and is owned by native `rap-node-agent` plus control-plane
|
|
contracts. It owns node identity, enrollment, local state, capability reporting,
|
|
role assignment consumption, signed scoped configuration snapshots, update
|
|
trust, and service supervision boundaries.
|
|
|
|
Node:
|
|
A host-level identity. A node is not a Docker container. A node is managed by native `rap-node-agent`.
|
|
|
|
Node Agent:
|
|
Native host process responsible for node identity, enrollment, certificates, health, role execution, service workload supervision, updates, recovery, and reporting.
|
|
|
|
Service Workload:
|
|
A workload executed on a node. It may be native or containerized. Examples: `rdp-worker`, `vnc-worker`, `entry-node`, `relay-node`, `file-storage-cache`.
|
|
|
|
Capability:
|
|
What a node can technically do. Example: `can_run_rdp_worker`.
|
|
|
|
Role Assignment:
|
|
What a node is authorized to do in a cluster and, where relevant, for an organization. Capabilities are not permissions.
|
|
|
|
Join Request:
|
|
A request from a node-agent to join a cluster, usually created with a short-lived join token and approved by a platform admin.
|
|
|
|
Entry Node:
|
|
Accepts client connections and maps them into platform logical channels.
|
|
|
|
Core Mesh Node:
|
|
Maintains secure overlay routing and QoS. It should remain protocol-neutral.
|
|
|
|
Egress / Service Node:
|
|
Hosts service adapters or network exits that connect to target resources.
|
|
|
|
Service Adapter:
|
|
Protocol-specific edge function, for example RDP Adapter, VNC Adapter, SSH Adapter, HTTP/internal app adapter, or video/audio adapter.
|
|
|
|
Remote Server/Desktop Access:
|
|
Product-level managed access service consumed by Access Client. RDP, VNC, and
|
|
SSH are internal protocol adapters selected from the organization resource
|
|
protocol; they are not separate organization-facing cluster services.
|
|
|
|
Egress Pool:
|
|
Organization-visible logical exit such as "Office Moscow" or "Datacenter
|
|
Kazan". It may be backed by multiple internal nodes. Organizations reference
|
|
egress pools, not concrete nodes or mesh topology.
|
|
|
|
Fabric Routing Engine:
|
|
Logical fabric-layer component responsible for peer selection, route discovery,
|
|
route scoring, failover, shortcut decisions, route cache management, topology
|
|
hiding, and channel-aware routing. It is not a runtime implementation in the
|
|
current phase.
|
|
|
|
Fabric Storage Service / Config Storage Service:
|
|
Future scoped storage/distribution service for signed cluster snapshots, peer
|
|
directories, policy snapshots, update metadata, and node-scoped configuration.
|
|
PostgreSQL remains authoritative.
|
|
|
|
Version Storage / Update Repository:
|
|
Logical artifact repository for signed platform, node-agent, and service
|
|
workload releases. It stores release manifests, OS/architecture artifacts,
|
|
stable/current/candidate channel metadata, hashes, signatures, compatibility
|
|
metadata, and migration bundles. PostgreSQL remains authoritative for rollout
|
|
policy, approvals, and audit.
|
|
|
|
## Golden Rules
|
|
|
|
1. Node identity belongs to the native `rap-node-agent`, not to a container.
|
|
2. Containers are packaging and isolation boundaries only.
|
|
3. Capabilities are not permissions.
|
|
4. Role assignment must be explicit per cluster and, when needed, per organization.
|
|
5. Platform admins operate cluster/node trust and placement.
|
|
6. Organization admins manage only organization-visible resources, policies, sessions, and allowed service endpoints.
|
|
7. Organizations must not see intermediate core mesh topology.
|
|
8. Mesh routing must not be implemented before enrollment, identity, and role assignment are trustworthy.
|
|
9. Disconnected minority partitions must not add nodes or change policy.
|
|
10. PostgreSQL remains the source of truth; Redis/live state must be reconstructable.
|
|
11. RDP MVP behavior must remain preserved while platform-core work progresses.
|
|
12. Fabric Core comes before mesh runtime.
|
|
13. Service adapters must not own mesh routing logic.
|
|
14. Nodes must not store or learn more cluster, organization, secret, or route data than their role requires.
|
|
15. Config/storage services distribute scoped snapshots and caches; they are not a second source of truth.
|
|
16. Redis must not store durable topology, durable configuration, node identity, policy, or organization data.
|
|
17. Fabric routing decisions must not depend on live backend availability.
|
|
18. Fabric Storage / Config Storage must not become a general-purpose distributed database.
|
|
19. Version Storage stores signed artifacts and release manifests; it does not
|
|
decide rollout policy and must not serve unsigned or unapproved artifacts.
|
|
20. Node-agent is the local supervisor for health, restart, update, and rollback
|
|
of node services, but Control Plane owns rollout policy and durable schema
|
|
migration orchestration.
|
|
|
|
## Existing Node Management Semantics
|
|
|
|
The Platform Owner Console must treat a node as a concrete host-level identity,
|
|
not as an abstract request for any free capacity. A cluster membership always
|
|
connects a specific `node_id` to a specific `cluster_id`.
|
|
|
|
The same physical node may participate in multiple clusters only through
|
|
isolated cluster memberships, certificates, tokens, storage namespaces, and
|
|
policies. Role assignments and desired service workloads are cluster-scoped.
|
|
This means a node may run `entry-node` for one cluster, have no ingress role in
|
|
another cluster, and still report the same underlying host capabilities.
|
|
|
|
Node capabilities are reported by `rap-node-agent` as the technical upper bound
|
|
of what the host can do. Capabilities are not permissions. The Control Plane
|
|
must expose to each cluster only the roles and service workloads that are
|
|
explicitly assigned for that cluster and, where relevant, for a specific
|
|
organization.
|
|
|
|
Current owner-console operations for existing nodes are:
|
|
|
|
- view active-cluster nodes
|
|
- view all physical nodes visible to the platform owner
|
|
- inspect cluster membership state per node
|
|
- organize active-cluster nodes into hierarchical node groups
|
|
- inspect heartbeat, telemetry, reported services, and desired services
|
|
- assign cluster-scoped roles
|
|
- enable or disable cluster-scoped desired service workloads
|
|
- manage node participation in product-level functions such as Remote
|
|
Server/Desktop Access, without exposing protocol adapters as separate
|
|
organization-facing services
|
|
- enable node/cluster/organization test telemetry flags
|
|
- disable a node membership in the active cluster
|
|
- revoke a node identity as a high-risk platform-owner action
|
|
|
|
Creating a brand-new node is not a direct database insert. The future production
|
|
flow remains installation of native `rap-node-agent` on the host, enrollment
|
|
with a short-lived join token, platform-owner approval, then explicit role and
|
|
service assignment. This enrollment/create-node UX is intentionally separate
|
|
from existing-node management.
|
|
|
|
Node groups are cluster-scoped inventory folders, not node identities and not
|
|
permissions. A group may contain child groups, and a node membership may be
|
|
assigned to one group in that cluster. The same physical node may be in a
|
|
different group, or no group, in another cluster. Groups are intended for
|
|
large-scale operator usability: data centers, sites, customers, regions,
|
|
environments, ownership boundaries, or maintenance rings. Authorization still
|
|
comes from cluster membership, role assignment, organization policy, and
|
|
service desired state, not from the visual group tree.
|
|
|
|
Remote Server/Desktop Access is managed as one product service. A node may be
|
|
allowed to run this access service, while the concrete protocol adapter is chosen
|
|
from the organization resource (`rdp`, `vnc`, `ssh`). Cluster admins should not
|
|
configure separate organization-facing RDP/VNC/SSH services; those adapters are
|
|
internal runtime implementations.
|
|
|
|
## Fabric Core Foundation
|
|
|
|
Layer order:
|
|
|
|
1. Host OS
|
|
2. RAP Fabric Core
|
|
3. Secure Fabric Network
|
|
4. Service Runtime / Service Adapters
|
|
5. Access Clients / Admin UI
|
|
|
|
Fabric Core responsibilities:
|
|
|
|
- node identity
|
|
- node enrollment
|
|
- local node state
|
|
- capability reporting
|
|
- role assignment consumption
|
|
- signed scoped cluster snapshots
|
|
- update trust
|
|
- service workload supervision boundary
|
|
|
|
Fabric Core does not implement service protocols. RDP, VNC, SSH, VPN, video,
|
|
and file transfer are services above the fabric. They consume node identity,
|
|
role assignment, local state, routing, and policy from the Fabric Core.
|
|
|
|
Fabric Core must be trustworthy before mesh runtime traffic exists. Mesh
|
|
routing, VPN/IP tunnel runtime, relay packet routing, and service workload
|
|
execution must not be implemented before node identity, enrollment, trust,
|
|
role assignment, and scoped configuration distribution are solid.
|
|
|
|
## Peer Discovery and Routing Foundation
|
|
|
|
Nodes must not maintain active connections to all other nodes.
|
|
|
|
Each node maintains:
|
|
|
|
- active peers
|
|
- warm candidate peers
|
|
- cold / bootstrap peers
|
|
|
|
Recommended defaults:
|
|
|
|
- normal node: 3-5 active peers
|
|
- relay / core node: 8-20 active peers
|
|
- thin / mobile node: 1-3 active peers
|
|
|
|
Peer selection must be score-based, not latency-only.
|
|
|
|
Score inputs:
|
|
|
|
- latency
|
|
- packet loss
|
|
- reliability
|
|
- region distance
|
|
- node load
|
|
- bandwidth availability
|
|
- role suitability
|
|
- policy constraints
|
|
- trust level
|
|
- recent failure history
|
|
|
|
Service adapters request a destination node, resource target, egress node, or
|
|
egress pool. The Fabric Routing Engine chooses the path. RDP/VNC/SSH/VPN/video
|
|
adapters must not implement topology discovery, multi-hop route selection,
|
|
shortcut creation, or cross-cluster routing policy.
|
|
|
|
Fabric routing uses node-local state, signed scoped snapshots, peer cache, and
|
|
route cache. It must respect policy, organization scope, cluster boundaries,
|
|
and partition/authority state. Service adapters must not select routes,
|
|
discover peers, manage mesh connections, implement failover, implement shortcut
|
|
logic, implement partition recovery, or implement cross-cluster routing policy.
|
|
|
|
## Node-Local State and Scoped Configuration
|
|
|
|
`rap-node-agent` local state should include:
|
|
|
|
- node identity
|
|
- cluster membership
|
|
- signed scoped cluster snapshot
|
|
- peer cache
|
|
- route cache
|
|
- service assignment cache
|
|
- local health state
|
|
- partition / degraded state
|
|
- last applied config version
|
|
- pending update metadata
|
|
|
|
A node must not store full cluster topology unless its role requires it.
|
|
|
|
Configuration distribution is need-to-know:
|
|
|
|
- core mesh nodes receive neighbor/peer data, route policy, QoS policy, and cluster version
|
|
- ingress nodes receive allowed client entry policies, token validation config, and route entry data
|
|
- egress/service nodes receive assigned service configs, needed resource refs, connector refs, and service policy
|
|
- storage services receive assigned shard/scope data and replication metadata
|
|
|
|
Secrets are delivered only through approved resolvers and only when required at
|
|
runtime. RDP credentials, organization user lists, unrelated storage shards, and
|
|
full topology must not be distributed to nodes that do not need them.
|
|
|
|
## Fabric Storage / Config Storage Foundation
|
|
|
|
Fabric Storage Service / Config Storage Service is a future foundation service,
|
|
not a replacement for PostgreSQL.
|
|
|
|
Purpose:
|
|
|
|
- store and replicate scoped cluster configuration
|
|
- distribute signed snapshots
|
|
- keep frequently used data near services
|
|
- support local high-speed reads
|
|
- preserve configuration availability when some nodes disappear
|
|
|
|
Rules:
|
|
|
|
- not every node stores all data
|
|
- replication factor is policy-driven
|
|
- critical data is replicated across failure domains
|
|
- hot data is placed near services that use it
|
|
- organization and cluster isolation are mandatory
|
|
- PostgreSQL remains authoritative
|
|
- storage/config services are distribution/cache layers only
|
|
- storage/config services must not accept direct node writes as authoritative state
|
|
- storage/config services must not expose arbitrary query capabilities
|
|
- storage/config services must not store full cluster or organization data on every node
|
|
|
|
Data classes:
|
|
|
|
- platform global data
|
|
- cluster state
|
|
- organization data
|
|
- service config data
|
|
- realtime / lease state
|
|
- audit / event data
|
|
- artifacts / update data
|
|
|
|
## Version Storage and Node Update Foundation
|
|
|
|
The platform includes a logical Version Storage / Update Repository service.
|
|
|
|
It stores approved release artifacts and metadata for:
|
|
|
|
- native `rap-node-agent`
|
|
- service workloads such as `rdp-worker`, `entry-node`, `relay-node`,
|
|
`file-storage-cache`, `update-cache`, future VNC/SSH/VPN services
|
|
- OS / architecture-specific builds
|
|
- migration bundles for data structure changes
|
|
|
|
Release channels:
|
|
|
|
- `stable`: known-good version used for rollback
|
|
- `current`: normal intended version for the cluster/ring
|
|
- `candidate`: version staged for test/canary rollout
|
|
|
|
Rules:
|
|
|
|
- PostgreSQL is authoritative for rollout policy, channel assignment, approvals,
|
|
target nodes/rings, and audit.
|
|
- Version Storage stores immutable signed artifacts and manifests.
|
|
- `update-cache` mirrors approved artifacts close to nodes but does not approve
|
|
or invent versions.
|
|
- Node-agent downloads from authorized update-cache or Version Storage, verifies
|
|
signatures and hashes, stages updates, runs health checks, and reports state.
|
|
- Node-agent may restart workloads and roll back to last stable / last known
|
|
good when health checks fail.
|
|
- Node-agent self-update requires stronger safeguards than normal service
|
|
workloads: A/B slots or equivalent staged replacement, watchdog/recovery, and
|
|
rollback that survives process crash.
|
|
|
|
Data structure migrations:
|
|
|
|
- code updates and data structure updates are separate concerns in one signed
|
|
release manifest
|
|
- migration bundles must declare source version, target version, compatibility,
|
|
preflight checks, backup/snapshot requirements, rollback behavior, and whether
|
|
downgrade is possible
|
|
- Control Plane/PostgreSQL schema migrations are orchestrated by the Control
|
|
Plane release process
|
|
- service-local, cache-local, and node-local state migrations may be executed by
|
|
node-agent only when included in an approved signed manifest
|
|
- failed migrations must leave the node rolled back, degraded, or fenced, never
|
|
silently half-updated
|
|
|
|
Suggested update states:
|
|
|
|
- `idle`
|
|
- `checking`
|
|
- `downloading`
|
|
- `verifying`
|
|
- `staging`
|
|
- `migrating`
|
|
- `starting`
|
|
- `health_checking`
|
|
- `promoting`
|
|
- `running`
|
|
- `rollback_required`
|
|
- `rolling_back`
|
|
- `rolled_back`
|
|
- `failed`
|
|
- `fenced`
|
|
|
|
## Target Data Model
|
|
|
|
The following entities should become the explicit foundation. Names are target names and may be adjusted during implementation if the existing schema requires compatibility.
|
|
|
|
### Cluster
|
|
|
|
`clusters`
|
|
|
|
Purpose:
|
|
First-class cluster records.
|
|
|
|
Key fields:
|
|
|
|
- `id`
|
|
- `name`
|
|
- `slug`
|
|
- `status`
|
|
- `region`
|
|
- `metadata`
|
|
- `created_at`
|
|
- `updated_at`
|
|
|
|
### Cluster Membership
|
|
|
|
`cluster_memberships`
|
|
|
|
Purpose:
|
|
Links nodes to clusters with membership state.
|
|
|
|
Key fields:
|
|
|
|
- `cluster_id`
|
|
- `node_id`
|
|
- `membership_status`
|
|
- `joined_at`
|
|
- `last_seen_at`
|
|
- `metadata`
|
|
|
|
### Node
|
|
|
|
`nodes`
|
|
|
|
Purpose:
|
|
Host-level identity record.
|
|
|
|
Existing `nodes` should be preserved and migrated safely. A default cluster backfill is acceptable for existing rows.
|
|
|
|
Target additions or refinements:
|
|
|
|
- explicit cluster membership through `cluster_memberships`
|
|
- enrollment status separated from runtime health
|
|
- identity/certificate state
|
|
- ownership and organization association only where applicable
|
|
|
|
### Node Identity
|
|
|
|
`node_identities`
|
|
|
|
Purpose:
|
|
Stores node public identity material and trust state.
|
|
|
|
Key fields:
|
|
|
|
- `node_id`
|
|
- `public_key`
|
|
- `certificate_serial`
|
|
- `certificate_not_before`
|
|
- `certificate_not_after`
|
|
- `identity_status`
|
|
- `rotated_at`
|
|
- `revoked_at`
|
|
|
|
Private keys must never be stored in the control plane.
|
|
|
|
### Node Join Token
|
|
|
|
`node_join_tokens`
|
|
|
|
Purpose:
|
|
Short-lived token created by platform admin to allow enrollment into a cluster.
|
|
|
|
Key fields:
|
|
|
|
- `id`
|
|
- `cluster_id`
|
|
- `token_hash`
|
|
- `scope`
|
|
- `expires_at`
|
|
- `max_uses`
|
|
- `used_count`
|
|
- `created_by_user_id`
|
|
- `revoked_at`
|
|
|
|
Raw join tokens must not be stored.
|
|
|
|
### Node Join Request
|
|
|
|
`node_join_requests`
|
|
|
|
Purpose:
|
|
Approval workflow for node enrollment.
|
|
|
|
Key fields:
|
|
|
|
- `id`
|
|
- `cluster_id`
|
|
- `node_name`
|
|
- `node_fingerprint`
|
|
- `public_key`
|
|
- `reported_capabilities`
|
|
- `reported_facts`
|
|
- `requested_roles`
|
|
- `status`
|
|
- `reviewed_by_user_id`
|
|
- `reviewed_at`
|
|
- `approved_node_id`
|
|
- `rejection_reason`
|
|
|
|
Default behavior should be manual approval.
|
|
|
|
### Node Capability
|
|
|
|
`node_capabilities`
|
|
|
|
Purpose:
|
|
Technical capabilities reported by node-agent and verified by platform policy.
|
|
|
|
Capability examples:
|
|
|
|
- `can_accept_client_ingress`
|
|
- `can_accept_node_ingress`
|
|
- `can_route_mesh`
|
|
- `can_egress_internet`
|
|
- `can_egress_private_network`
|
|
- `can_run_rdp_worker`
|
|
- `can_run_vnc_worker`
|
|
- `can_run_vpn_exit`
|
|
- `can_run_vpn_connector`
|
|
- `can_run_file_cache`
|
|
- `can_run_update_cache`
|
|
- `can_run_video_relay`
|
|
|
|
### Node Role Assignment
|
|
|
|
`node_role_assignments`
|
|
|
|
Purpose:
|
|
Explicit authorization for a node to execute roles in a cluster and optionally for specific organizations.
|
|
|
|
Key fields:
|
|
|
|
- `cluster_id`
|
|
- `node_id`
|
|
- `role`
|
|
- `organization_id`
|
|
- `status`
|
|
- `assigned_by_user_id`
|
|
- `assigned_at`
|
|
- `policy`
|
|
|
|
Examples:
|
|
|
|
- `entry-node`
|
|
- `relay-node`
|
|
- `rdp-worker`
|
|
- `vnc-worker`
|
|
- `vpn-exit`
|
|
- `vpn-connector`
|
|
- `file-storage-cache`
|
|
- `update-cache`
|
|
- `video-relay`
|
|
|
|
### Node Service Instance
|
|
|
|
`node_service_instances`
|
|
|
|
Purpose:
|
|
Tracks running service workloads supervised by node-agent.
|
|
|
|
Key fields:
|
|
|
|
- `cluster_id`
|
|
- `node_id`
|
|
- `service_type`
|
|
- `service_instance_id`
|
|
- `desired_state`
|
|
- `reported_state`
|
|
- `version`
|
|
- `last_heartbeat_at`
|
|
- `metadata`
|
|
|
|
### Node Heartbeat
|
|
|
|
`node_heartbeats`
|
|
|
|
Purpose:
|
|
Durable heartbeat observations and latest state snapshots.
|
|
|
|
This can be split into an append-only heartbeat log and a compact latest-state table if needed.
|
|
|
|
### Cluster Audit Event
|
|
|
|
`cluster_audit_events`
|
|
|
|
Purpose:
|
|
Audit cluster, node, enrollment, role, trust, and service-placement changes.
|
|
|
|
Events should include:
|
|
|
|
- join token created/revoked
|
|
- join request received/approved/rejected
|
|
- node certificate issued/rotated/revoked
|
|
- role assigned/removed
|
|
- service desired state changed
|
|
- node health changed
|
|
- cluster partition detected/resolved
|
|
|
|
### Future Mesh Entities
|
|
|
|
These are not Stage C1 implementation requirements, but the data model should not block them:
|
|
|
|
- `mesh_links`
|
|
- `mesh_routes`
|
|
- `cluster_route_intents`
|
|
- `cluster_partitions`
|
|
- `cluster_authority_terms`
|
|
|
|
## Node Enrollment Flow
|
|
|
|
Default enrollment must be explicit and reviewable.
|
|
|
|
1. Platform admin chooses or creates a cluster.
|
|
2. Platform admin creates a short-lived join token scoped to that cluster.
|
|
3. Operator installs native `rap-node-agent` on the host.
|
|
4. Node-agent generates a local keypair.
|
|
5. Node-agent sends a join request with token, public key, node facts, and capabilities.
|
|
6. Control plane validates token hash, expiry, scope, and usage limit.
|
|
7. Control plane creates `node_join_request` in `pending` state.
|
|
8. Platform admin reviews node facts and approves or rejects the request.
|
|
9. On approval, control plane creates or activates the node identity.
|
|
10. Control plane issues cluster-scoped node certificate or certificate metadata.
|
|
11. Node-agent stores identity material locally.
|
|
12. Node-agent starts heartbeats.
|
|
13. Platform admin assigns roles.
|
|
14. Node-agent starts only authorized service workloads.
|
|
|
|
Auto-approval may exist later only for tightly scoped platform-managed bootstrap tokens. It must not be the default.
|
|
|
|
## Node Runtime Expectations
|
|
|
|
`rap-node-agent` should run natively on the host.
|
|
|
|
The agent owns:
|
|
|
|
- host identity
|
|
- node certificates
|
|
- enrollment
|
|
- heartbeat
|
|
- desired role polling
|
|
- service workload supervision
|
|
- update trust and rollback
|
|
- local recovery
|
|
|
|
Containerized workloads are preferred for:
|
|
|
|
- `rdp-worker`
|
|
- `vnc-worker`
|
|
- `relay-node`
|
|
- `entry-node`
|
|
- `file-storage-cache`
|
|
- `update-cache`
|
|
- `video-relay`
|
|
|
|
Native workloads are preferred for:
|
|
|
|
- `vpn-exit`
|
|
- `vpn-connector`
|
|
- host route manager
|
|
- firewall/QoS manager
|
|
- Windows virtual adapter service
|
|
- Android VpnService client
|
|
|
|
Realtime container workloads should avoid Docker bridge/NAT hot paths where low latency matters. Prefer host networking or native mode for latency-critical paths.
|
|
|
|
Privileged containers are discouraged. If a workload needs `NET_ADMIN`, `/dev/net/tun`, host firewall control, or broad host privileges, native mode should be preferred unless explicitly approved.
|
|
|
|
## Admin Console Model
|
|
|
|
Admin UI is served through Web Ingress, but cluster configuration belongs to
|
|
the Control Plane.
|
|
|
|
The Web/Admin boundary is defined in
|
|
`docs/architecture/WEB_INGRESS_AND_ADMIN_UI_MODEL.md`.
|
|
|
|
Rules:
|
|
|
|
- Web Ingress is an HTTP/HTTPS entry and presentation layer only
|
|
- Web Ingress must not own cluster state, node trust, policy, or secrets
|
|
- Admin pages may be dynamically composed from safe `ui_manifest` / page
|
|
definitions
|
|
- dynamic page definitions must be schema-driven and must not contain internal
|
|
topology, peer caches, route caches, secrets, raw credentials, or arbitrary
|
|
executable code
|
|
- every cluster mutation still goes through Control Plane authorization,
|
|
PostgreSQL source-of-truth mutation, and audit
|
|
- Fabric Storage / Config Storage role does not imply admin/web ingress role
|
|
- adding a storage node to a cluster must not move or create the cluster panel
|
|
automatically
|
|
|
|
### Admin Endpoint Placement
|
|
|
|
Admin UI has three distinct scopes:
|
|
|
|
- Platform Owner Console: global platform-owner scope. It may aggregate
|
|
visibility across clusters according to platform policy and audit.
|
|
- Cluster Admin Endpoint: cluster-local admin/web ingress service for one
|
|
cluster. It is served only by nodes explicitly assigned an approved
|
|
admin/web ingress role.
|
|
- Organization Admin Panel: tenant-safe projection over allowed organization
|
|
resources, endpoints, policies, sessions, and safe status.
|
|
|
|
Storage nodes are configuration distribution/cache nodes. They are not browser
|
|
admin entry points and they do not own platform or cluster administration.
|
|
|
|
A cluster-local admin endpoint may become available only after:
|
|
|
|
- the cluster has healthy authority state
|
|
- scoped configuration snapshots are signed and current
|
|
- admin/web ingress role is explicitly assigned to authorized node(s)
|
|
- TLS/certificate policy is valid
|
|
- node health and heartbeat are current
|
|
- required role coverage exists for the intended cluster operations
|
|
|
|
During cluster split/seed workflows, a temporary node may participate in more
|
|
than one cluster only through isolated memberships, identities, certificates,
|
|
tokens, storage namespaces, and policies. Removing that seed node from the new
|
|
cluster is safe only after the new cluster has its own healthy node set,
|
|
snapshots, role coverage, and admin ingress.
|
|
|
|
### Platform Admin Console MVP
|
|
|
|
The first admin panel should be for platform owner/admin only.
|
|
|
|
Required pages:
|
|
|
|
- Overview
|
|
- Clusters
|
|
- Nodes
|
|
- Join Requests
|
|
- Node Detail
|
|
- Role Assignments
|
|
- Service Workloads
|
|
- Health and Alerts
|
|
- Audit
|
|
- Trust and Certificates
|
|
|
|
Overview should show:
|
|
|
|
- cluster count
|
|
- node count
|
|
- nodes by health
|
|
- pending join requests
|
|
- role coverage gaps
|
|
- service workload health
|
|
- risky states
|
|
|
|
Cluster page should show:
|
|
|
|
- cluster status
|
|
- node membership
|
|
- enabled services
|
|
- health state
|
|
- partition warnings
|
|
- role coverage
|
|
|
|
Node page should show:
|
|
|
|
- identity
|
|
- membership
|
|
- health
|
|
- capabilities
|
|
- assigned roles
|
|
- running services
|
|
- last heartbeat
|
|
- version
|
|
- update state
|
|
- audit timeline
|
|
|
|
Join Requests page should allow:
|
|
|
|
- review node facts
|
|
- compare requested capabilities
|
|
- approve
|
|
- reject
|
|
- revoke token
|
|
|
|
Role Assignments page should allow:
|
|
|
|
- assign role to node
|
|
- restrict role to organization if applicable
|
|
- disable role
|
|
- audit role changes
|
|
|
|
### Multi-Cluster Admin
|
|
|
|
Multi-cluster administration should be platform-owner only.
|
|
|
|
It should support:
|
|
|
|
- list clusters
|
|
- compare cluster health
|
|
- inspect node distribution
|
|
- inspect service placement
|
|
- inspect cluster partitions
|
|
- inspect capacity and load
|
|
- detect missing role coverage
|
|
- manage cluster-scoped trust
|
|
|
|
Important:
|
|
No organization should see the full multi-cluster topology.
|
|
|
|
Multi-cluster trust boundaries:
|
|
|
|
- platform may operate multiple clusters
|
|
- clusters do not automatically trust each other
|
|
- clusters do not form one shared mesh by default
|
|
- cross-cluster routing requires explicit trust and policy
|
|
- a node may participate in multiple clusters only through isolated memberships
|
|
- cluster-scoped identities, certificates, tokens, storage namespaces, and policies are required
|
|
- platform owner may aggregate visibility across clusters through audited control-plane views
|
|
- organization admins see only authorized clusters, resources, services, and safe status
|
|
|
|
### Platform Admin Access Security
|
|
|
|
Platform owner/admin access must be risk-based.
|
|
|
|
Trusted or accredited devices may receive lower-friction access, but unknown
|
|
devices require stronger checks.
|
|
|
|
Required controls:
|
|
|
|
- MFA / 2FA policy
|
|
- device trust state
|
|
- session risk controls
|
|
- step-up authentication for high-risk actions
|
|
- audit for all high-risk actions
|
|
|
|
High-risk actions:
|
|
|
|
- cluster trust changes
|
|
- node approval
|
|
- role assignment
|
|
- partition promotion
|
|
- cross-cluster trust
|
|
- secrets access
|
|
- update policy changes
|
|
|
|
### Future Organization Admin Panel
|
|
|
|
Organization admin UI should come later.
|
|
|
|
It should show only safe tenant-scoped objects:
|
|
|
|
- organization resources
|
|
- organization users
|
|
- organization policies
|
|
- active sessions
|
|
- allowed entry points
|
|
- allowed egress/service endpoints
|
|
- safe VPN/connector status
|
|
- audit events for that organization
|
|
|
|
It must not show:
|
|
|
|
- full mesh topology
|
|
- other organizations
|
|
- internal core routing state
|
|
- platform-level node trust material
|
|
- unrelated cluster internals
|
|
|
|
## Service Visibility Rules
|
|
|
|
Organizations see only what policy allows:
|
|
|
|
- resources
|
|
- enabled services
|
|
- allowed ingress endpoints
|
|
- allowed egress/service endpoints
|
|
- safe status
|
|
|
|
Organizations do not see:
|
|
|
|
- intermediate core mesh nodes
|
|
- platform-owned private topology
|
|
- other tenants
|
|
- cluster-wide trust configuration
|
|
|
|
Different services may expose different ingress and egress points.
|
|
|
|
## Partition and Split-Brain Behavior
|
|
|
|
Disconnected minority segments must not:
|
|
|
|
- approve join requests
|
|
- issue node certificates
|
|
- assign roles
|
|
- change cluster policy
|
|
- rotate trust roots
|
|
|
|
Disconnected segments may continue already-running services only when safe and policy allows it.
|
|
|
|
Only platform owner authority should promote, merge, split, or recover partitions.
|
|
|
|
This document does not define final consensus mechanics. It only establishes that cluster authority and partition behavior must be modeled before production mesh changes.
|
|
|
|
## Implementation Stages
|
|
|
|
### Stage C1: Backend Cluster and Node Model Foundation
|
|
|
|
Status: implemented and verified. Report: `artifacts/c1-cluster-node-foundation-report.md`.
|
|
|
|
Goal:
|
|
Create the explicit backend data model and service boundaries for clusters, join requests, role assignments, node identity, heartbeats, and audit.
|
|
|
|
Allowed:
|
|
|
|
- migrations
|
|
- repositories
|
|
- service interfaces
|
|
- platform-admin-only API skeletons
|
|
- tests
|
|
- docs
|
|
|
|
Not allowed:
|
|
|
|
- mesh runtime
|
|
- VPN runtime
|
|
- node-agent runtime rewrite
|
|
- admin UI implementation
|
|
- RDP behavior changes
|
|
|
|
### Stage C2: Node Enrollment API
|
|
|
|
Status: implemented and verified. Report: `artifacts/c2-node-enrollment-hardening-report.md`.
|
|
|
|
Goal:
|
|
Implement production enrollment semantics.
|
|
|
|
Includes:
|
|
|
|
- join token creation
|
|
- join request creation
|
|
- approval/rejection
|
|
- certificate metadata boundary
|
|
- node activation/revocation
|
|
- audit
|
|
|
|
### Stage C3: Native Node-Agent MVP
|
|
|
|
Status: implemented and verified. Report: `artifacts/c3-rap-node-agent-mvp-report.md`.
|
|
|
|
Goal:
|
|
Create native `rap-node-agent` MVP.
|
|
|
|
Includes:
|
|
|
|
- enroll
|
|
- heartbeat
|
|
- capability report
|
|
- desired role polling
|
|
- service status report
|
|
|
|
No mesh packet routing yet.
|
|
|
|
### Stage C4: Platform Admin Console MVP
|
|
|
|
Status: implemented and build-verified. Report: `artifacts/c4-platform-admin-console-report.md`.
|
|
|
|
Goal:
|
|
Admin UI for platform operators.
|
|
|
|
Includes:
|
|
|
|
- clusters
|
|
- nodes
|
|
- join requests
|
|
- roles
|
|
- service health
|
|
- audit
|
|
|
|
### Stage C5: Service Workload Supervision Contract
|
|
|
|
Status: implemented and verified. Report: `artifacts/c5-service-workload-supervision-contract-report.md`.
|
|
|
|
Goal:
|
|
Node-agent can start, stop, and monitor service workloads based on role assignment.
|
|
|
|
C19A adds the first bounded live service-supervision runtime proof on top of
|
|
that contract: node-agent can read node-scoped desired workloads without an
|
|
operator actor id, report built-in `core-mesh` and `mesh-listener` as running,
|
|
report native built-in `synthetic.echo` as running, and keep unsupported
|
|
production workloads degraded instead of pretending that their adapters exist.
|
|
The live smoke is `scripts/fabric/c19a-service-workload-supervision-smoke.ps1`.
|
|
|
|
C19B adds the Remote Workspace/RDP adapter-contract bridge without enabling RDP
|
|
payload traffic. A native `rdp-worker` desired workload with
|
|
`adapter_contract_probe=true` reports the remote-workspace channel map,
|
|
requires Fabric Service Channel, and marks backend relay as not steady-state.
|
|
The live smoke is
|
|
`scripts/fabric/c19b-remote-workspace-adapter-contract-smoke.ps1`.
|
|
|
|
C19C wires Remote Workspace into service-channel lease issuance without
|
|
starting RDP traffic: route intents now accept `remote_workspace`, the lease
|
|
entry descriptor uses remote-workspace stream paths and frame-batch media type
|
|
instead of VPN packet paths, and the signed data-plane contract is present in
|
|
lease, authority payload, introspection, and lease maintenance. The live smoke
|
|
is `scripts/fabric/c19c-remote-workspace-service-channel-lease-smoke.ps1`.
|
|
|
|
C19D adds the Remote Workspace entry-node ingress skeleton. The node-agent
|
|
accepts a signed/introspected `remote_workspace` service-channel lease on
|
|
`remote-workspaces/{resource_id}/streams/{channel_class}`, validates service
|
|
class, channel class, selected entry node, and data-plane flow isolation, and
|
|
reports access telemetry. It intentionally returns a probe contract with
|
|
`payload_flow=not_implemented` for non-empty RDP payloads; this stage proves
|
|
the Fabric ingress contract without forwarding desktop frames yet. The live
|
|
smoke is `scripts/fabric/c19d-remote-workspace-entry-ingress-smoke.ps1`.
|
|
|
|
C19E adds the first Remote Workspace frame-batch contract probe across the
|
|
adapter/entry boundary. The `rdp-worker` adapter probe reports
|
|
`rap.remote_workspace_frame_batch.v1`; entry-node accepts only
|
|
`probe_only=true` frame batches, validates logical adapter channels and
|
|
directions, and returns `payload_flow=validated_probe_only`. Real desktop frame
|
|
delivery remains intentionally disabled until the service adapter runtime stage.
|
|
The live smoke is
|
|
`scripts/fabric/c19e-remote-workspace-frame-batch-contract-smoke.ps1`.
|
|
|
|
C19F adds the first local adapter-sink proof for that frame-batch contract.
|
|
Node-agent now keeps an in-memory `node_agent_rdp_worker_contract_probe` sink
|
|
for Remote Workspace frame probes and preserves it across mesh config refresh.
|
|
Entry-node delivers validated `probe_only=true` frame batches to that sink and
|
|
returns a `rap.remote_workspace_frame_batch_delivery.v1` receipt with
|
|
`payload_flow=delivered_probe_only`. This still does not enable production RDP
|
|
frame forwarding. The live smoke is
|
|
`scripts/fabric/c19f-remote-workspace-adapter-sink-smoke.ps1`.
|
|
|
|
C19G exposes the adapter-sink delivery proof through existing node-agent
|
|
visibility channels. The `rdp-worker` workload status payload now includes
|
|
`remote_workspace_adapter_sink`, and node telemetry includes
|
|
`remote_workspace_adapter_sink_report`, both carrying delivery count, latest
|
|
delivery sequence, channel class, frame count, and the probe-only/no-payload
|
|
boundary. The live smoke is
|
|
`scripts/fabric/c19g-remote-workspace-adapter-sink-telemetry-smoke.ps1`.
|
|
|
|
C19H locks down the Remote Workspace frame-batch guardrails before real adapter
|
|
runtime work begins. Unit and live smoke coverage now proves that entry-node
|
|
rejects `probe_only=false`, unknown logical channels, invalid channel
|
|
directions, service-class mismatch, channel-class mismatch, and unsupported
|
|
payload encoding, and that rejected batches do not produce adapter delivery.
|
|
The live smoke is
|
|
`scripts/fabric/c19h-remote-workspace-frame-guardrails-smoke.ps1`.
|
|
|
|
C19I adds the first bounded adapter handoff queue/ack proof for the same
|
|
probe-only path. The local `node_agent_rdp_worker_contract_probe` sink reports
|
|
queue capacity/depth plus accepted, dropped, and acked frame counts: with
|
|
capacity `8`, droppable display overflow accepts/acks `8` frames and drops `3`,
|
|
while reliable input overflow is rejected with backpressure and no delivery
|
|
receipt. The boundary still carries `payload_traffic=none`; this is queue
|
|
semantics for the future adapter runtime, not real RDP payload forwarding. The
|
|
live smoke is
|
|
`scripts/fabric/c19i-remote-workspace-adapter-queue-smoke.ps1`.
|
|
|
|
C19J makes those queue/backpressure signals operationally visible. The
|
|
`remote_workspace_adapter_sink` workload status payload and
|
|
`remote_workspace_adapter_sink_report` telemetry now include current queue
|
|
capacity/depth, cumulative accepted/dropped/acked frame counters,
|
|
`backpressure_count`, and the latest rejected batch metadata/reason. The live
|
|
smoke first produces the C19I droppable overflow plus reliable backpressure,
|
|
then waits until both workload status and telemetry show the delivery, dropped
|
|
total, and backpressure increment. The live smoke is
|
|
`scripts/fabric/c19j-remote-workspace-adapter-queue-telemetry-smoke.ps1`.
|
|
|
|
C19K introduces the probe-only adapter session boundary. Entry-node derives a
|
|
stable `adapter_session_id` from the service-channel lease/resource/route
|
|
context and passes it to the local `rdp-worker` adapter probe sink. Delivery
|
|
receipts, workload status, and telemetry now include `adapter_session_id`,
|
|
`adapter_runtime_id=node_agent_rdp_worker_contract_probe`, and
|
|
`session_state=probe_bound`, and rejected/backpressured batches retain the same
|
|
session identity. This is still not real RDP payload forwarding; it binds the
|
|
queue/ack/backpressure model to the future per-session adapter runtime. The
|
|
live smoke is
|
|
`scripts/fabric/c19k-remote-workspace-adapter-session-boundary-smoke.ps1`.
|
|
|
|
C19L adds the first lifecycle model to that probe-only adapter session. The
|
|
node-agent sink now tracks active sessions in memory with created/bound totals,
|
|
last activity timestamps, per-session delivery/backpressure/frame counters,
|
|
`current_session_lifecycle_state`, and idle expiry counters. A successful
|
|
droppable overflow binds the session as `probe_bound`; a reliable overflow keeps
|
|
the same `adapter_session_id` and moves the lifecycle state to `backpressure`
|
|
for diagnosis. Receipts expose session created/bound/last-activity timestamps
|
|
and per-session counters while `payload_traffic=none` remains enforced. The
|
|
live smoke is
|
|
`scripts/fabric/c19l-remote-workspace-adapter-session-lifecycle-smoke.ps1`.
|
|
|
|
C19M adds explicit probe-only adapter-session control. Node-agent exposes
|
|
`POST /mesh/v1/remote-workspace/adapter-sessions/{adapter_session_id}/control`
|
|
with `close`, `expire`, and `reset` actions, returning
|
|
`rap.remote_workspace_adapter_session_control.v1`. Workload status and telemetry
|
|
now include `session_control_total`, `session_closed_total`,
|
|
`session_reset_total`, and the latest control action/session/state, so sessions
|
|
can be ended deliberately instead of only by idle TTL. The live smoke creates a
|
|
Remote Workspace adapter session, closes it through the mesh control endpoint,
|
|
and waits until workload status and telemetry expose the close. The live smoke
|
|
is
|
|
`scripts/fabric/c19m-remote-workspace-adapter-session-control-smoke.ps1`.
|
|
|
|
C19N locks down the adapter-session control guardrails. Control requests now
|
|
reject unsupported actions, invalid `adapter_session_id` values, malformed JSON,
|
|
unknown active/terminal sessions, and overlong reasons without creating hidden
|
|
session state. Repeating `close` against an already closed terminal session is
|
|
idempotent: it reports `previous_state=closed` and does not increment
|
|
`session_closed_total` again, while still counting the control observation. The
|
|
live smoke verifies the negative cases plus first/repeated close visibility in
|
|
workload status and telemetry. The live smoke is
|
|
`scripts/fabric/c19n-remote-workspace-adapter-session-control-guardrails-smoke.ps1`.
|
|
|
|
C19O adds an immediate read-only adapter-session snapshot endpoint:
|
|
`GET /mesh/v1/remote-workspace/adapter-sessions?include_terminal=true&limit=N`.
|
|
It returns `rap.remote_workspace_adapter_session_snapshot.v1` with active
|
|
sessions, terminal sessions when requested, per-session lifecycle state,
|
|
activity/backpressure timestamps, frame counters, and runtime identity. This
|
|
lets operators inspect adapter-session state directly from node-agent without
|
|
waiting for heartbeat, workload status, or telemetry propagation. The live smoke
|
|
checks active-session visibility, close transition into terminal snapshot, and
|
|
invalid snapshot limit rejection. The live smoke is
|
|
`scripts/fabric/c19o-remote-workspace-adapter-session-snapshot-smoke.ps1`.
|
|
|
|
C19P adds the first adapter-runtime handoff mailbox contract. Each active
|
|
probe-only adapter session now owns a bounded in-memory mailbox that receives
|
|
`frame_batch_probe_delivered` and `backpressure` events with frame counts,
|
|
channel/resource/route context, and sequence numbers. Node-agent exposes
|
|
`GET /mesh/v1/remote-workspace/adapter-sessions/{adapter_session_id}/mailbox`
|
|
with optional `drain=true`, and session snapshots/workload reports expose
|
|
mailbox depth/enqueued/drained/dropped counters. This is the handoff surface a
|
|
real `rdp-worker` runtime can consume next; payload forwarding is still disabled.
|
|
The live smoke verifies read, drain, post-drain empty state, and snapshot
|
|
counters. The live smoke is
|
|
`scripts/fabric/c19p-remote-workspace-adapter-runtime-mailbox-smoke.ps1`.
|
|
|
|
C19Q hardens the mailbox handoff. Invalid IDs, unknown sessions, and invalid
|
|
limits are rejected before state mutation, and bounded `drain=true&limit=N`
|
|
reads remove only the returned event slice while preserving remaining depth for
|
|
the next poll. The bounded mailbox drops oldest events once capacity is reached,
|
|
and a closed adapter session no longer exposes an active runtime mailbox. The
|
|
live smoke verifies negative cases, drop-oldest pressure, partial drain, and
|
|
closed-session rejection. The live smoke is
|
|
`scripts/fabric/c19q-remote-workspace-adapter-mailbox-guardrails-smoke.ps1`.
|
|
|
|
C19R adds bounded long-poll ergonomics to the same node-local mailbox endpoint.
|
|
`wait_ms` lets an adapter runtime wait briefly for the next event without hot
|
|
polling, and responses make empty/timeout state explicit with `empty`,
|
|
`waited`, `wait_timeout`, and `wait_ms`. The live smoke proves empty timeout and
|
|
wake-on-delayed-event behavior while keeping the path probe-only. The live smoke
|
|
is `scripts/fabric/c19r-remote-workspace-mailbox-long-poll-smoke.ps1`.
|
|
|
|
C19S makes mailbox consumer behavior visible in diagnostics. Workload status and
|
|
node telemetry now expose `mailbox_read_total`, `mailbox_wait_total`,
|
|
`mailbox_wait_timeout_total`, `mailbox_empty_read_total`, and last mailbox read
|
|
metadata; active session snapshots carry the same per-session counters while a
|
|
session remains active. The live smoke proves C19R traffic is reflected in both
|
|
workload status and telemetry. The live smoke is
|
|
`scripts/fabric/c19s-remote-workspace-mailbox-telemetry-smoke.ps1`.
|
|
|
|
C19T adds the node-local consumer cursor contract for that mailbox. Consumers
|
|
can pass `consumer_id` plus optional `ack_sequence` to receive explicit
|
|
checkpoint, ack, lag, read, and ack counters without draining mailbox state.
|
|
The probe sink stores bounded per-session consumer state and reports aggregate
|
|
and current-session consumer telemetry through workload status and heartbeat
|
|
telemetry. The live smoke is
|
|
`scripts/fabric/c19t-remote-workspace-mailbox-consumer-checkpoint-smoke.ps1`.
|
|
|
|
C19U adds lifecycle visibility and reset guardrails to the same cursor state.
|
|
Mailbox consumers can pass `reset_consumer=true` with a valid `consumer_id` to
|
|
clear their checkpoint/ack state before the current read is recorded. Mailbox
|
|
responses now expose consumer count/capacity, created/reset/evicted flags, and
|
|
consumer timestamps, while diagnostics add reset and eviction counters. The
|
|
live smoke is
|
|
`scripts/fabric/c19u-remote-workspace-mailbox-consumer-lifecycle-smoke.ps1`.
|
|
|
|
C19V adds read-only inspection for active mailbox consumer cursors. The
|
|
node-local
|
|
`GET /mesh/v1/remote-workspace/adapter-sessions/{adapter_session_id}/mailbox/consumers`
|
|
endpoint returns bounded cursor snapshots with consumer ids, checkpoint and ack
|
|
sequences, lag, totals, and timestamps. It is verified as read-only: inspection
|
|
does not increment mailbox reads, ack totals, reset counters, or drain mailbox
|
|
events. The live smoke is
|
|
`scripts/fabric/c19v-remote-workspace-mailbox-consumer-snapshot-smoke.ps1`.
|
|
|
|
C19W adds cursor-aware resume reads to the mailbox endpoint. Consumers can pass
|
|
`after_sequence` to receive only mailbox events newer than their checkpoint;
|
|
responses include `skipped_count` and `returned_count`, and long-poll waits for
|
|
newer-than-checkpoint events. The endpoint rejects `after_sequence` with
|
|
`drain=true`, preserving the non-destructive resume contract. The live smoke is
|
|
`scripts/fabric/c19w-remote-workspace-mailbox-after-sequence-smoke.ps1`.
|
|
|
|
C19X adds consumer-aware resume convenience. Mailbox reads with `consumer_id`
|
|
can pass `resume_from=ack` or `resume_from=checkpoint`; the node-agent resolves
|
|
the stored cursor to `after_sequence` before reading and returns
|
|
`resume_from`/`resume_sequence` in the response. The guardrails reject mixing
|
|
resume with manual `after_sequence`, drain, reset, missing consumers, or invalid
|
|
cursor names. The live smoke is
|
|
`scripts/fabric/c19x-remote-workspace-mailbox-consumer-resume-smoke.ps1`.
|
|
|
|
C19Y adds resume telemetry to workload status and heartbeat reports. Operators
|
|
can now see resume read totals, after-sequence read totals, returned/skipped
|
|
totals, and the last resume cursor, sequence, consumer, returned count, and
|
|
skipped count. Session snapshots also expose per-session resume counters. The
|
|
live smoke is
|
|
`scripts/fabric/c19y-remote-workspace-mailbox-resume-telemetry-smoke.ps1`.
|
|
|
|
C19Z adds adapter-runtime readiness diagnostics. Sink reports now include
|
|
`adapter_runtime_readiness`, a compact probe-only object with ready status,
|
|
diagnostic state, session lifecycle, mailbox depth, consumer cursor, resume
|
|
cursor, lag, and returned/skipped counts. The live smoke is
|
|
`scripts/fabric/c19z-remote-workspace-adapter-readiness-smoke.ps1`.
|
|
|
|
C19Z1 adds read-only handoff preflight for mailbox consumers. The endpoint
|
|
`/mailbox/preflight` accepts `consumer_id` and `resume_from=ack|checkpoint`,
|
|
then reports the expected next event window without mailbox reads, drains, acks,
|
|
or consumer cursor mutation. The live smoke is
|
|
`scripts/fabric/c19z1-remote-workspace-mailbox-preflight-smoke.ps1`.
|
|
|
|
Includes:
|
|
|
|
- container/native workload contract
|
|
- service instance status
|
|
- version reporting
|
|
- logs/health hooks
|
|
|
|
### Stage C6: Mesh Control-Plane Preparation
|
|
|
|
Status: implemented and verified. Report: `artifacts/c6-mesh-control-plane-preparation-report.md`.
|
|
|
|
Goal:
|
|
Prepare mesh routing state without carrying production traffic.
|
|
|
|
Includes:
|
|
|
|
- link inventory
|
|
- route intent
|
|
- node reachability
|
|
- QoS policy model
|
|
|
|
### Stage C7: Mesh MVP
|
|
|
|
Status: skeleton implemented and verified. Report: `artifacts/c7-mesh-mvp-skeleton-report.md`.
|
|
|
|
Goal:
|
|
First secure node-to-node overlay path.
|
|
|
|
Includes:
|
|
|
|
- mTLS node-to-node
|
|
- basic relay skeleton
|
|
- route health
|
|
- no VPN yet unless separately approved
|
|
|
|
### Stage C8: Multi-Cluster Hardening
|
|
|
|
Status: implemented and verified. Report: `artifacts/c8-multi-cluster-hardening-report.md`.
|
|
|
|
Goal:
|
|
Safe multi-cluster operations.
|
|
|
|
Includes:
|
|
|
|
- cluster authority state
|
|
- partition visibility
|
|
- cross-cluster admin views
|
|
- trust separation
|
|
|
|
### Stage C9: Organization Admin Console
|
|
|
|
Status: foundation implemented and verified. Report: `artifacts/c9-organization-admin-foundation-report.md`.
|
|
|
|
Goal:
|
|
Tenant-safe admin UI.
|
|
|
|
Includes:
|
|
|
|
- resources
|
|
- policies
|
|
- sessions
|
|
- allowed service endpoints
|
|
- organization audit
|
|
|
|
### Stage C10: Fabric Core Documentation and Config Distribution Design
|
|
|
|
Status: completed as documentation/planning stage. Report:
|
|
`artifacts/c10-fabric-core-config-distribution-design-report.md`.
|
|
|
|
Goal:
|
|
Consolidate Fabric Core terminology and prepare scoped cluster configuration
|
|
distribution design.
|
|
|
|
Scope:
|
|
|
|
- define signed scoped cluster snapshot model boundaries
|
|
- define node-local state boundaries
|
|
- define peer directory/cache boundaries
|
|
- define Fabric Storage / Config Storage role and non-goals
|
|
- define PostgreSQL source-of-truth versus distribution/cache boundaries
|
|
- define multi-cluster isolation boundaries
|
|
- define future implementation stages C11-C19
|
|
|
|
C10 prepares C11, C12, and C13. It did not implement mesh routing, VPN/IP
|
|
tunnel runtime, relay packet forwarding, RDP work, code, API changes, or
|
|
service workload execution.
|
|
|
|
### Stage C11: Signed Scoped Cluster Snapshot Model
|
|
|
|
Status: completed as documentation/planning stage. Report:
|
|
`artifacts/c11-signed-scoped-cluster-snapshot-model-report.md`.
|
|
|
|
Goal:
|
|
Define signed, versioned, scoped snapshots for node-local operation and
|
|
degraded-mode recovery.
|
|
|
|
### Stage C12: Node Local State Store
|
|
|
|
Status: completed as documentation/planning stage. Report:
|
|
`artifacts/c12-node-local-state-store-report.md`.
|
|
|
|
Goal:
|
|
Define bounded local state storage for identity, cluster membership, peer
|
|
cache, route cache, service assignment cache, health, partition/degraded state,
|
|
config versions, and pending update metadata.
|
|
|
|
### Stage C13: Config / Storage Service Foundation
|
|
|
|
Status: completed as documentation/planning stage. Report:
|
|
`artifacts/c13-fabric-storage-config-service-report.md`.
|
|
|
|
Goal:
|
|
Define Fabric Storage Service / Config Storage Service boundaries, replication
|
|
scope, failure-domain rules, and source-of-truth relationship to PostgreSQL.
|
|
|
|
### Stage C14: Peer Directory and Cache Model
|
|
|
|
Status: completed as documentation/planning stage. Report:
|
|
`artifacts/c14-peer-directory-cache-model-report.md`.
|
|
|
|
Goal:
|
|
Define peer directory shape, node-local peer cache, refresh cadence,
|
|
score-based peer selection inputs, and degraded recovery order.
|
|
|
|
### Stage C15: Fabric Routing Engine Skeleton
|
|
|
|
Status: completed as documentation/planning stage. Report:
|
|
`artifacts/c15-fabric-routing-engine-skeleton-report.md`.
|
|
|
|
Goal:
|
|
Introduce a safe compile/runtime boundary for future route selection without
|
|
carrying production mesh traffic.
|
|
|
|
### Stage C16: Secure Node-to-Node Channel Lifecycle
|
|
|
|
Status: completed as documentation/planning stage. Report:
|
|
`artifacts/c16-secure-node-to-node-channel-lifecycle-report.md`.
|
|
|
|
Goal:
|
|
Define authenticated node-to-node channel lifecycle, certificate validation,
|
|
health, reconnection, and channel authorization.
|
|
|
|
### Stage C17: Mesh Routing Runtime
|
|
|
|
Status: planning completed. Report:
|
|
`artifacts/c17-mesh-routing-runtime-implementation-plan-report.md`.
|
|
C17A synthetic runtime skeleton is implemented and test-proven. Report:
|
|
`artifacts/c17a-synthetic-mesh-runtime-skeleton-report.md`.
|
|
C17B route health and failover probe skeleton is implemented and test-proven.
|
|
Report: `artifacts/c17b-route-health-failover-probes-report.md`.
|
|
C17C relay semantic hardening skeleton is implemented and test-proven. Report:
|
|
`artifacts/c17c-relay-semantic-hardening-report.md`.
|
|
C17D non-production test-service path experiment is implemented and
|
|
test-proven. Report:
|
|
`artifacts/c17d-non-production-test-service-path-report.md`.
|
|
C17E live node-to-node synthetic HTTP transport skeleton is implemented and
|
|
smoke-proven. Report:
|
|
`artifacts/c17e-live-node-to-node-synthetic-transport-report.md`.
|
|
C17F scoped synthetic peer/route config loading and route-health reporting is
|
|
implemented and smoke-proven. Report:
|
|
`artifacts/c17f-scoped-synthetic-route-config-report.md`.
|
|
C17G Control Plane scoped synthetic config read boundary is implemented and
|
|
test-proven. Report:
|
|
`artifacts/c17g-control-plane-scoped-synthetic-config-report.md`.
|
|
|
|
Goal:
|
|
Define the mesh routing runtime implementation plan only after identity,
|
|
enrollment, scoped config, local state, storage, peer cache, routing skeleton,
|
|
and secure node-to-node channel lifecycle stages are accepted.
|
|
|
|
First implementation step was C17A: synthetic fabric messages only,
|
|
feature-flagged, test topology only, no RDP/VPN/service traffic, and no
|
|
topology exposure to organizations.
|
|
|
|
C17D proved a bounded `synthetic.echo` test-service path over direct,
|
|
single-relay, and forced fallback routes. It does not authorize production
|
|
service traffic.
|
|
|
|
C17E proved the same synthetic-only route model over real local HTTP node
|
|
endpoints using a disabled-by-default `rap-node-agent` endpoint and a
|
|
`mesh-live-smoke` harness. It still does not authorize production mesh traffic,
|
|
RDP/VPN traffic, service workload traffic, or organization-visible topology.
|
|
|
|
C17F moved the C17E synthetic route model from manual debug JSON toward scoped
|
|
configuration by adding a node-local scoped config file boundary and synthetic
|
|
route-health reporting to the Control Plane mesh link endpoint. It still does
|
|
not authorize production mesh traffic.
|
|
|
|
C17G added the Control Plane read endpoint for node-scoped synthetic mesh config
|
|
and node-agent consumption of that config. It still does not authorize
|
|
production mesh traffic.
|
|
|
|
### Stage C18: VPN / IP Tunnel Service
|
|
|
|
Status: completed as documentation/planning stage. Report:
|
|
`artifacts/c18-vpn-ip-tunnel-service-target-design-report.md`.
|
|
|
|
Goal:
|
|
Define VPN/IP tunnel service as a cluster-managed service above Fabric Core and
|
|
Fabric Routing, not as node-local ad hoc configuration.
|
|
|
|
Result:
|
|
|
|
- target design: `docs/architecture/VPN_IP_TUNNEL_SERVICE_TARGET.md`
|
|
- defines `vpn_connection` as logical desired state
|
|
- defines single-active lease/fencing model
|
|
- defines control-plane versus data-plane boundary
|
|
- defines routing policy, QoS, security, credential distribution, audit, and
|
|
failure behavior
|
|
- defines future C18A-C18I stages
|
|
|
|
C18 did not implement VPN/IP tunnel runtime, TUN/TAP devices, packet routing,
|
|
mesh production traffic, RDP work, API changes, migrations, or service workload
|
|
execution.
|
|
|
|
### Stage C18A: VPN / IP Tunnel Control-Plane Data Model
|
|
|
|
Status: implemented and backend-test-proven. Report:
|
|
`artifacts/c18a-vpn-control-plane-data-model-report.md`.
|
|
|
|
Goal:
|
|
Add the durable control-plane source-of-truth model for future VPN/IP tunnel
|
|
service desired state without implementing VPN runtime.
|
|
|
|
Completed:
|
|
|
|
- `vpn_connections`
|
|
- `vpn_connection_allowed_nodes`
|
|
- `vpn_connection_route_policies`
|
|
- `vpn_connection_leases`
|
|
- backend models, repository methods, service methods, and platform-admin API
|
|
skeleton
|
|
- single-active lease boundary with PostgreSQL unique active lease protection
|
|
- recovery-admin-only lease fencing boundary
|
|
- tests for authorization, scope, desired-state gating, duplicate active lease
|
|
rejection, route policy validation, and allowed-node normalization
|
|
|
|
C18A did not implement VPN/IP tunnel runtime, TUN/TAP devices, packet routing,
|
|
host route/firewall manipulation, node-agent execution, production mesh
|
|
traffic, RDP work, Windows client changes, or data-plane behavior changes.
|
|
|
|
### Stage C18B: VPN / IP Tunnel Lease and Fencing Hardening
|
|
|
|
Status: implemented and backend-test-proven. Report:
|
|
`artifacts/c18b-vpn-lease-fencing-hardening-report.md`.
|
|
|
|
Goal:
|
|
Harden the future VPN/IP tunnel single-active control-plane ownership model
|
|
before any node-agent executes VPN work.
|
|
|
|
Completed:
|
|
|
|
- owner eligibility validation against active cluster membership
|
|
- owner validation against `vpn_connection` allowed-node policy
|
|
- owner validation against active `vpn-exit` or `vpn-connector` role assignment
|
|
- same-owner acquire idempotency
|
|
- different-owner active lease rejection
|
|
- idempotent release/fence behavior
|
|
- stale lease cleanup/reclamation service boundary
|
|
- audit events for lease renewal and stale lease expiration
|
|
- tests for wrong cluster, unauthorized owner, missing role, disabled
|
|
connection, expired lease renewal, duplicate active lease, and stale lease
|
|
audit behavior
|
|
|
|
C18B did not implement VPN/IP tunnel runtime, TUN/TAP devices, host
|
|
route/firewall manipulation, node-agent VPN execution, production mesh traffic,
|
|
RDP work, Windows client changes, or data-plane behavior changes.
|
|
|
|
### Stage C18C: VPN / IP Tunnel Node-Agent Desired-State Consumption
|
|
|
|
Status: implemented and backend-test-proven. Report:
|
|
`artifacts/c18c-vpn-node-agent-desired-state-report.md`.
|
|
|
|
Goal:
|
|
Expose scoped VPN/IP tunnel desired assignments to eligible node-agents and
|
|
allow node-agents to report observed assignment status without executing VPN
|
|
runtime.
|
|
|
|
Completed:
|
|
|
|
- node-agent desired-state API for scoped VPN assignments
|
|
- node-agent assignment status report API
|
|
- explicit observed status values: `not_started`, `assigned`,
|
|
`lease_required`, `blocked`, `unknown`
|
|
- PostgreSQL assignment status report and latest-status tables
|
|
- backend service/repository boundaries for scoped assignment visibility
|
|
- assignment visibility limited to eligible candidates or current active owner
|
|
- `credential_ref` withheld from node-agent assignment payloads; only
|
|
`has_credential_ref` is exposed for future resolver integration
|
|
- tests for unauthorized/invisible assignment rejection, allowed status values,
|
|
invalid status rejection, and non-platform-admin node-agent read path
|
|
|
|
C18C did not implement VPN/IP tunnel runtime, TUN/TAP devices, host
|
|
route/firewall/DNS/QoS manipulation, credential delivery, node-agent VPN
|
|
execution, production mesh traffic, RDP work, Windows client changes, or
|
|
data-plane behavior changes.
|
|
|
|
### Stage C19: Version Storage and Node Update / Rollback Foundation
|
|
|
|
Status: future stage. Architecture direction is documented in this file and
|
|
`docs/architecture/SECURE_ACCESS_FABRIC_TARGET.md`. No updater runtime is
|
|
implemented by this documentation.
|
|
|
|
Goal:
|
|
Define the durable release, artifact, node-agent update, health supervision,
|
|
rollback, and data-structure migration model before any production update
|
|
runtime is enabled.
|
|
|
|
Scope:
|
|
|
|
- define `version-storage` / `update-repository` as the signed release
|
|
artifact source for node-agent and service workloads
|
|
- define release channels: `stable`, `current`, and `candidate`
|
|
- define OS / architecture artifact variants under a single signed release
|
|
manifest
|
|
- define last-known-good stable release tracking and rollback eligibility
|
|
- define update-cache mirroring of approved artifacts without becoming a
|
|
source of truth
|
|
- define node-agent download, verification, staging, restart, health-check,
|
|
promote, rollback, and failure-reporting states
|
|
- define migration bundle metadata for PostgreSQL schema, service-local data,
|
|
node-local state, config/cache shards, and protocol/config schema changes
|
|
- define rollout policy boundaries for canary, staged rollout, pause, resume,
|
|
rollback, blocklist, and emergency pinning
|
|
|
|
Non-goals:
|
|
|
|
- no production updater runtime
|
|
- no automatic schema migration execution by node-agent for PostgreSQL
|
|
- no unsigned or unapproved artifacts
|
|
- no arbitrary code download
|
|
- no mesh, VPN/IP tunnel, RDP, client, or data-plane behavior changes
|
|
|
|
Rules:
|
|
|
|
- PostgreSQL remains authoritative for rollout policy, approvals, desired
|
|
versions, channel assignment, migration approval, and audit.
|
|
- Version Storage stores immutable signed artifacts, release manifests,
|
|
compatibility metadata, hashes, signatures, provenance, and migration
|
|
bundles.
|
|
- Update-cache nodes only mirror approved artifacts and may be invalidated or
|
|
rebuilt from Version Storage.
|
|
- Node-agent executes only signed, approved, scoped work assigned by policy.
|
|
- Node-agent must stop, restart, rollback, or fence local workloads according
|
|
to health checks and lease/policy state.
|
|
- Node-agent self-update requires stricter safeguards than workload updates:
|
|
staged replacement, watchdog, crash-safe rollback, and preservation of
|
|
control-plane/update-source reachability unless an approved break-glass
|
|
procedure exists.
|
|
- PostgreSQL schema migrations are orchestrated by the Control Plane release
|
|
process, not independently invented by a node.
|
|
- Service-local and node-local migrations may be executed by node-agent only
|
|
when declared in a signed release manifest and scoped to that node/workload.
|
|
|
|
## Historical Stage C1 Prompt
|
|
|
|
This prompt is retained for provenance only. Stages C1-C9 are now listed above
|
|
as implemented/verified or build-verified according to their reports.
|
|
|
|
Proceed with Stage C1 only.
|
|
|
|
Goal:
|
|
Implement backend cluster and node model foundation for the Secure Access Fabric.
|
|
|
|
Strict rules:
|
|
|
|
- do NOT implement mesh runtime
|
|
- do NOT implement VPN runtime
|
|
- do NOT change RDP runtime behavior
|
|
- do NOT build admin UI yet
|
|
- do NOT rewrite existing node-agent runtime
|
|
- do NOT expose full topology to organizations
|
|
- keep PostgreSQL as source of truth
|
|
- keep Redis for live coordination only
|
|
|
|
Scope:
|
|
|
|
1. Add explicit cluster model.
|
|
2. Add cluster membership model.
|
|
3. Add node join token model with hashed tokens only.
|
|
4. Add node join request model.
|
|
5. Add node identity/certificate metadata model.
|
|
6. Add node role assignment model.
|
|
7. Add node heartbeat/latest health model.
|
|
8. Add cluster audit events.
|
|
9. Preserve existing node tables with safe migration/backfill into a default cluster.
|
|
10. Add repository interfaces and PostgreSQL implementations.
|
|
11. Add service/usecase boundaries for platform-admin cluster and node management.
|
|
12. Add platform-admin-only API skeletons for clusters, nodes, join requests, and role assignments.
|
|
13. Add tests for cluster scoping, role authorization, join token hashing, and organization topology isolation.
|
|
14. Update documentation.
|
|
|
|
Deliver:
|
|
|
|
- migrations
|
|
- backend models/repositories/services
|
|
- platform-admin API skeleton
|
|
- tests
|
|
- docs update
|
|
- verification report
|
|
|
|
Historical note:
|
|
C2 was not to start until C1 was accepted. C1-C9 are now recorded above with
|
|
their current accepted statuses.
|
|
|
|
## Result / Decision
|
|
|
|
This document defines Fabric Core as the foundation beneath service adapters
|
|
and before mesh runtime. It clarifies peer discovery, score-based routing,
|
|
scoped configuration distribution, node-local state, Fabric Storage/Config
|
|
Storage, multi-cluster trust boundaries, and risk-based platform admin access.
|
|
It also hardens the boundaries that Redis is live-only, Fabric routing must not
|
|
depend on live backend availability, and Fabric Storage/Config Storage is not a
|
|
general-purpose distributed database or second source of truth.
|
|
|
|
Stage C10 through Stage C18 planning are completed as documentation/planning.
|
|
C17A, C17B, C17C, C17D, C17E, C17F, and C17G are implemented and test-proven
|
|
with synthetic traffic only. C18 defines the VPN/IP tunnel service target model but does not
|
|
authorize VPN/IP tunnel runtime. C18A adds the VPN/IP tunnel control-plane
|
|
data model and platform-admin skeleton only. C18B hardens single-active
|
|
lease/fencing semantics. C18C adds node-agent desired-state/status reporting
|
|
for scoped VPN assignments only. C19 is now reserved for the Version
|
|
Storage/Update Repository and node-agent update/rollback foundation; it is not
|
|
implemented by this document. No RDP, data-plane, VPN runtime, production
|
|
relay, production mesh service traffic, node-agent VPN execution, host
|
|
networking, service workload runtime, or production updater behavior is implied
|
|
by this document.
|