m/rdp-proxy

Fork 0

Files

T

m 8ba0561f4f Initial project snapshot

2026-04-28 22:29:50 +03:00

12 KiB

Raw Blame History

Secure Node-to-Node Channel Lifecycle

Status: Stage C16 result. Documentation and architecture only.

This document defines the secure node-to-node channel lifecycle for the Secure Access Fabric. It does not implement code, migrations, APIs, mesh runtime traffic, VPN/IP tunnel runtime, relay packet routing, RDP work, or service workload execution.

1. Purpose

Secure node-to-node channels are the future authenticated transport foundation for Fabric routes. They must exist as a trust and lifecycle model before any production mesh routing runtime carries traffic.

C16 defines:

mTLS identity validation
connection establishment
channel authorization
lifecycle state
heartbeat and liveness
reconnect/backoff
draining
revalidation
trust rotation
revocation handling
failure observability

2. Non-Goals

C16 does not:

implement packet forwarding
implement mesh routing runtime
implement relay node behavior
implement VPN/IP tunnel traffic
implement QUIC/WebRTC
implement service workloads
change RDP runtime
change backend session lifecycle
change Windows client behavior

It defines the node-channel lifecycle boundary only.

3. Trust Foundation

Every node-to-node channel must be authenticated.

Required identity inputs:

cluster id
local node id
remote node id
local node certificate
remote node certificate
cluster trust roots
revocation metadata
role assignment snapshot
allowed peer relationship
route/channel authorization policy

Private keys remain local to the node. The control plane must never store node private keys.

4. mTLS Certificate Requirements

Node certificates must be cluster-scoped.

Certificate identity should bind:

node id
cluster id
certificate serial
validity period
key usage for node-to-node transport
optional role/service constraints where practical

Validation must check:

certificate chain
cluster trust root
certificate validity time
node id binding
cluster id binding
expected remote node id
revocation status
key usage
policy scope

A valid TLS certificate is necessary but not sufficient. The channel must also pass role, peer, route, and channel authorization.

5. Channel Establishment Flow

Proposed logical flow:

Routing Engine or node-agent selects an allowed peer candidate.
Local node checks peer directory and local policy.
Local node opens authenticated transport.
Both sides perform mTLS handshake.
Both sides validate certificate identity and cluster scope.
Both sides exchange channel hello metadata.
Both sides validate role, channel classes, and policy version.
Channel enters established state.
Heartbeat/liveness begins.
Channel is registered in local channel table with expiry/revalidation deadline.

Channel hello metadata should include:

protocol version
cluster id
node id
supported channel classes
supported transport features
local config version
peer directory version
trust bundle version
route epoch
draining status

6. Channel States

Initial state machine:

idle
connecting
handshaking
authenticating
authorizing
established
revalidating
degraded
draining
closing
closed
failed

State rules:

no traffic before established
control/liveness may continue in revalidating
new non-essential traffic should stop in draining
channel must close on failed authentication
channel must close or degrade on failed reauthorization according to policy
terminal closed/failed channels must not be reused

7. Channel Classes

Allowed channel classes map to Fabric routing classes, not service-specific protocol internals.

Initial channel classes:

fabric_control
route_control
health
telemetry
render
input
clipboard
file_transfer
storage_fetch
update_fetch
vpn_packet

Authorization is per channel class.

Rules:

input and fabric_control require high-priority scheduling.
render and video-like traffic may be droppable/latest-only.
file_transfer, storage_fetch, and update_fetch must not starve input or control.
vpn_packet must be QoS-limited so bulk traffic cannot starve interactive channels.
A channel may carry only classes authorized by local policy and route result.

8. Channel Authorization

Authorization checks:

local node is allowed to connect to remote node
remote node is allowed to accept the connection
cluster id matches
roles are compatible
route result or peer policy permits the relationship
requested channel classes are allowed
organization/service scope is allowed where applicable
partition/degraded state permits the channel
remote node is not revoked, disabled, or disallowed
certificate is not expired or revoked

Authorization must be repeated when:

trust bundle changes
revocation list changes
role assignment changes
route policy changes
route epoch changes
channel is long-lived past revalidation interval

9. Heartbeat and Liveness

Heartbeats prove liveness, not authority.

Heartbeat metadata:

channel id
local node id
remote node id
timestamp
sequence
observed latency
packet loss/jitter summary where available
local health hint
draining flag
config version
route epoch

Recommended heartbeat cadence:

active control channels: 5-15 seconds
high-priority realtime channels: 2-10 seconds where needed
low-priority/storage channels: 15-60 seconds

Missing heartbeats should trigger:

suspicion state
bounded retry
route failover consideration
channel close/failure
health report

10. Reconnect and Backoff

Reconnect must be bounded and policy-aware.

Rules:

use exponential backoff with jitter
do not stampede bootstrap peers
prefer warm candidates after active peer failure
stop reconnect when peer is revoked or policy disallows it
report repeated failures
preserve route stickiness only while healthy and authorized
avoid reconnect loops during draining or shutdown

Reconnect should use current peer cache and route policy, not stale hardcoded endpoints.

11. Revalidation

Long-lived channels must revalidate periodically.

Revalidation checks:

certificate still valid
revocation status current enough
cluster trust root still valid
peer relationship still allowed
channel classes still allowed
route epoch/policy version still acceptable
role assignments still active

If revalidation fails:

stop accepting new traffic
drain or close according to policy
report reason
trigger route failover where applicable

12. Draining and Graceful Shutdown

Draining supports maintenance and safe role removal.

Draining flow:

node enters draining state
node advertises draining in heartbeat/channel metadata
routing stops placing new flows on the node
existing flows continue until TTL or policy deadline
new non-essential channels are rejected
channel closes after active work drains or deadline expires
node reports drained status

Draining must not silently drop critical control messages.

If graceful drain fails, policy decides whether to force-close and failover.

13. Trust Rotation

Trust rotation must avoid split trust windows.

Recommended flow:

new trust bundle is signed by current trusted key
nodes fetch and verify new trust bundle
dual validation period begins where required
new certificates are issued/accepted
old certificates expire or are revoked
old trust root is retired after rollout threshold

Channels should revalidate after trust bundle changes.

14. Revocation Handling

Revocation must affect active channels.

Revocation inputs:

signed revocation list
trust bundle update
control-plane status after reconnect
emergency revocation policy

On revocation of remote node/certificate/key:

stop new channels
mark existing channels as revalidation failed
close or drain according to policy severity
remove peer from eligible active/warm candidates
report and audit event

High-severity revocation should close immediately.

15. Partition and Degraded Behavior

In degraded mode, channels may continue only if:

current signed snapshot permits it
certificates remain valid
revocation state is not known to reject the peer
route/channel policy permits degraded continuation
TTL has not expired

Degraded mode must not authorize:

new node enrollment
new trust roots
role changes
cross-cluster trust changes
partition promotion
new high-risk channels without policy

16. Failure Classification

Failure reasons:

tls_handshake_failed
certificate_invalid
certificate_revoked
wrong_cluster
wrong_node
policy_denied
channel_class_denied
route_epoch_stale
heartbeat_timeout
peer_draining
peer_disabled
trust_bundle_stale
network_unreachable
backoff_exhausted

Failures should be structured and safe to log.

17. Observability

Node-agent should report:

channel state
active channel count
channel classes in use
handshake failures
authorization failures
heartbeat latency
reconnect count
backoff state
draining state
revocation actions
revalidation failures
route epoch/policy version

Tenant views must not expose internal topology. Platform owner views may show full channel diagnostics according to audited policy.

18. Security Requirements

Required:

mTLS for node-to-node channels
cluster-scoped node certificates
certificate revocation support
policy-scoped channel authorization
no unauthenticated peer enumeration
no channel use before authorization
channel class separation
QoS-aware scheduling expectations
structured audit for high-risk channel changes

Compromised node blast radius must be limited by:

scoped certificates
scoped snapshots
role assignment
peer directory scope
channel authorization
revocation
topology hiding

19. Future Validation Tests

Future implementation tests must prove:

valid node-to-node mTLS succeeds
wrong cluster certificate rejected
wrong node id rejected
expired certificate rejected
revoked certificate closes active channel
unauthorized channel class rejected
channel cannot carry traffic before authorization
heartbeat timeout triggers failure
draining stops new channels
trust rotation revalidates channels
degraded mode honors TTL and forbidden actions
tenant-safe views hide topology

20. C17 Preparation

C17 may plan mesh routing runtime only after C10-C16 are accepted.

C17 must use:

signed snapshots
node-local state store
Fabric Storage / Config Storage
peer directory/cache
Fabric Routing Engine route results
secure node-to-node channels

C17 must not jump directly to broad production mesh. It should first define a minimal runtime implementation plan, test topology, rollback path, and go/no-go criteria.

21. Result / Decision

Stage C16 defines secure node-to-node channels as authenticated, policy-authorized, lifecycle-managed connections.

Decisions:

mTLS is required for node-to-node channels.
Certificate validity is necessary but not sufficient; channel policy must authorize role, peer relationship, route, and channel classes.
Active channels must revalidate on trust, revocation, role, and route policy changes.
Draining is a first-class lifecycle state.
Revocation affects active channels.
Degraded mode is bounded and cannot authorize high-risk mutations.
C17 must plan mesh routing runtime using C10-C16 foundations.

No code, migration, API, runtime, RDP, data-plane, mesh, VPN, relay, or service workload behavior is changed by C16.

12 KiB Raw Blame History