12 KiB
Secure Node-to-Node Channel Lifecycle
Status: Stage C16 result. Documentation and architecture only.
This document defines the secure node-to-node channel lifecycle for the Secure Access Fabric. It does not implement code, migrations, APIs, mesh runtime traffic, VPN/IP tunnel runtime, relay packet routing, RDP work, or service workload execution.
1. Purpose
Secure node-to-node channels are the future authenticated transport foundation for Fabric routes. They must exist as a trust and lifecycle model before any production mesh routing runtime carries traffic.
C16 defines:
- mTLS identity validation
- connection establishment
- channel authorization
- lifecycle state
- heartbeat and liveness
- reconnect/backoff
- draining
- revalidation
- trust rotation
- revocation handling
- failure observability
2. Non-Goals
C16 does not:
- implement packet forwarding
- implement mesh routing runtime
- implement relay node behavior
- implement VPN/IP tunnel traffic
- implement QUIC/WebRTC
- implement service workloads
- change RDP runtime
- change backend session lifecycle
- change Windows client behavior
It defines the node-channel lifecycle boundary only.
3. Trust Foundation
Every node-to-node channel must be authenticated.
Required identity inputs:
- cluster id
- local node id
- remote node id
- local node certificate
- remote node certificate
- cluster trust roots
- revocation metadata
- role assignment snapshot
- allowed peer relationship
- route/channel authorization policy
Private keys remain local to the node. The control plane must never store node private keys.
4. mTLS Certificate Requirements
Node certificates must be cluster-scoped.
Certificate identity should bind:
- node id
- cluster id
- certificate serial
- validity period
- key usage for node-to-node transport
- optional role/service constraints where practical
Validation must check:
- certificate chain
- cluster trust root
- certificate validity time
- node id binding
- cluster id binding
- expected remote node id
- revocation status
- key usage
- policy scope
A valid TLS certificate is necessary but not sufficient. The channel must also pass role, peer, route, and channel authorization.
5. Channel Establishment Flow
Proposed logical flow:
- Routing Engine or node-agent selects an allowed peer candidate.
- Local node checks peer directory and local policy.
- Local node opens authenticated transport.
- Both sides perform mTLS handshake.
- Both sides validate certificate identity and cluster scope.
- Both sides exchange channel hello metadata.
- Both sides validate role, channel classes, and policy version.
- Channel enters
establishedstate. - Heartbeat/liveness begins.
- Channel is registered in local channel table with expiry/revalidation deadline.
Channel hello metadata should include:
- protocol version
- cluster id
- node id
- supported channel classes
- supported transport features
- local config version
- peer directory version
- trust bundle version
- route epoch
- draining status
6. Channel States
Initial state machine:
idleconnectinghandshakingauthenticatingauthorizingestablishedrevalidatingdegradeddrainingclosingclosedfailed
State rules:
- no traffic before
established - control/liveness may continue in
revalidating - new non-essential traffic should stop in
draining - channel must close on failed authentication
- channel must close or degrade on failed reauthorization according to policy
- terminal
closed/failedchannels must not be reused
7. Channel Classes
Allowed channel classes map to Fabric routing classes, not service-specific protocol internals.
Initial channel classes:
fabric_controlroute_controlhealthtelemetryrenderinputclipboardfile_transferstorage_fetchupdate_fetchvpn_packet
Authorization is per channel class.
Rules:
inputandfabric_controlrequire high-priority scheduling.renderand video-like traffic may be droppable/latest-only.file_transfer,storage_fetch, andupdate_fetchmust not starveinputor control.vpn_packetmust be QoS-limited so bulk traffic cannot starve interactive channels.- A channel may carry only classes authorized by local policy and route result.
8. Channel Authorization
Authorization checks:
- local node is allowed to connect to remote node
- remote node is allowed to accept the connection
- cluster id matches
- roles are compatible
- route result or peer policy permits the relationship
- requested channel classes are allowed
- organization/service scope is allowed where applicable
- partition/degraded state permits the channel
- remote node is not revoked, disabled, or disallowed
- certificate is not expired or revoked
Authorization must be repeated when:
- trust bundle changes
- revocation list changes
- role assignment changes
- route policy changes
- route epoch changes
- channel is long-lived past revalidation interval
9. Heartbeat and Liveness
Heartbeats prove liveness, not authority.
Heartbeat metadata:
- channel id
- local node id
- remote node id
- timestamp
- sequence
- observed latency
- packet loss/jitter summary where available
- local health hint
- draining flag
- config version
- route epoch
Recommended heartbeat cadence:
- active control channels: 5-15 seconds
- high-priority realtime channels: 2-10 seconds where needed
- low-priority/storage channels: 15-60 seconds
Missing heartbeats should trigger:
- suspicion state
- bounded retry
- route failover consideration
- channel close/failure
- health report
10. Reconnect and Backoff
Reconnect must be bounded and policy-aware.
Rules:
- use exponential backoff with jitter
- do not stampede bootstrap peers
- prefer warm candidates after active peer failure
- stop reconnect when peer is revoked or policy disallows it
- report repeated failures
- preserve route stickiness only while healthy and authorized
- avoid reconnect loops during draining or shutdown
Reconnect should use current peer cache and route policy, not stale hardcoded endpoints.
11. Revalidation
Long-lived channels must revalidate periodically.
Revalidation checks:
- certificate still valid
- revocation status current enough
- cluster trust root still valid
- peer relationship still allowed
- channel classes still allowed
- route epoch/policy version still acceptable
- role assignments still active
If revalidation fails:
- stop accepting new traffic
- drain or close according to policy
- report reason
- trigger route failover where applicable
12. Draining and Graceful Shutdown
Draining supports maintenance and safe role removal.
Draining flow:
- node enters draining state
- node advertises draining in heartbeat/channel metadata
- routing stops placing new flows on the node
- existing flows continue until TTL or policy deadline
- new non-essential channels are rejected
- channel closes after active work drains or deadline expires
- node reports drained status
Draining must not silently drop critical control messages.
If graceful drain fails, policy decides whether to force-close and failover.
13. Trust Rotation
Trust rotation must avoid split trust windows.
Recommended flow:
- new trust bundle is signed by current trusted key
- nodes fetch and verify new trust bundle
- dual validation period begins where required
- new certificates are issued/accepted
- old certificates expire or are revoked
- old trust root is retired after rollout threshold
Channels should revalidate after trust bundle changes.
14. Revocation Handling
Revocation must affect active channels.
Revocation inputs:
- signed revocation list
- trust bundle update
- control-plane status after reconnect
- emergency revocation policy
On revocation of remote node/certificate/key:
- stop new channels
- mark existing channels as revalidation failed
- close or drain according to policy severity
- remove peer from eligible active/warm candidates
- report and audit event
High-severity revocation should close immediately.
15. Partition and Degraded Behavior
In degraded mode, channels may continue only if:
- current signed snapshot permits it
- certificates remain valid
- revocation state is not known to reject the peer
- route/channel policy permits degraded continuation
- TTL has not expired
Degraded mode must not authorize:
- new node enrollment
- new trust roots
- role changes
- cross-cluster trust changes
- partition promotion
- new high-risk channels without policy
16. Failure Classification
Failure reasons:
tls_handshake_failedcertificate_invalidcertificate_revokedwrong_clusterwrong_nodepolicy_deniedchannel_class_deniedroute_epoch_staleheartbeat_timeoutpeer_drainingpeer_disabledtrust_bundle_stalenetwork_unreachablebackoff_exhausted
Failures should be structured and safe to log.
17. Observability
Node-agent should report:
- channel state
- active channel count
- channel classes in use
- handshake failures
- authorization failures
- heartbeat latency
- reconnect count
- backoff state
- draining state
- revocation actions
- revalidation failures
- route epoch/policy version
Tenant views must not expose internal topology. Platform owner views may show full channel diagnostics according to audited policy.
18. Security Requirements
Required:
- mTLS for node-to-node channels
- cluster-scoped node certificates
- certificate revocation support
- policy-scoped channel authorization
- no unauthenticated peer enumeration
- no channel use before authorization
- channel class separation
- QoS-aware scheduling expectations
- structured audit for high-risk channel changes
Compromised node blast radius must be limited by:
- scoped certificates
- scoped snapshots
- role assignment
- peer directory scope
- channel authorization
- revocation
- topology hiding
19. Future Validation Tests
Future implementation tests must prove:
- valid node-to-node mTLS succeeds
- wrong cluster certificate rejected
- wrong node id rejected
- expired certificate rejected
- revoked certificate closes active channel
- unauthorized channel class rejected
- channel cannot carry traffic before authorization
- heartbeat timeout triggers failure
- draining stops new channels
- trust rotation revalidates channels
- degraded mode honors TTL and forbidden actions
- tenant-safe views hide topology
20. C17 Preparation
C17 may plan mesh routing runtime only after C10-C16 are accepted.
C17 must use:
- signed snapshots
- node-local state store
- Fabric Storage / Config Storage
- peer directory/cache
- Fabric Routing Engine route results
- secure node-to-node channels
C17 must not jump directly to broad production mesh. It should first define a minimal runtime implementation plan, test topology, rollback path, and go/no-go criteria.
21. Result / Decision
Stage C16 defines secure node-to-node channels as authenticated, policy-authorized, lifecycle-managed connections.
Decisions:
- mTLS is required for node-to-node channels.
- Certificate validity is necessary but not sufficient; channel policy must authorize role, peer relationship, route, and channel classes.
- Active channels must revalidate on trust, revocation, role, and route policy changes.
- Draining is a first-class lifecycle state.
- Revocation affects active channels.
- Degraded mode is bounded and cannot authorize high-risk mutations.
- C17 must plan mesh routing runtime using C10-C16 foundations.
No code, migration, API, runtime, RDP, data-plane, mesh, VPN, relay, or service workload behavior is changed by C16.