Initial project snapshot
This commit is contained in:
@@ -0,0 +1,464 @@
|
||||
# Secure Node-to-Node Channel Lifecycle
|
||||
|
||||
Status: Stage C16 result. Documentation and architecture only.
|
||||
|
||||
This document defines the secure node-to-node channel lifecycle for the Secure
|
||||
Access Fabric. It does not implement code, migrations, APIs, mesh runtime
|
||||
traffic, VPN/IP tunnel runtime, relay packet routing, RDP work, or service
|
||||
workload execution.
|
||||
|
||||
## 1. Purpose
|
||||
|
||||
Secure node-to-node channels are the future authenticated transport foundation
|
||||
for Fabric routes. They must exist as a trust and lifecycle model before any
|
||||
production mesh routing runtime carries traffic.
|
||||
|
||||
C16 defines:
|
||||
|
||||
- mTLS identity validation
|
||||
- connection establishment
|
||||
- channel authorization
|
||||
- lifecycle state
|
||||
- heartbeat and liveness
|
||||
- reconnect/backoff
|
||||
- draining
|
||||
- revalidation
|
||||
- trust rotation
|
||||
- revocation handling
|
||||
- failure observability
|
||||
|
||||
## 2. Non-Goals
|
||||
|
||||
C16 does not:
|
||||
|
||||
- implement packet forwarding
|
||||
- implement mesh routing runtime
|
||||
- implement relay node behavior
|
||||
- implement VPN/IP tunnel traffic
|
||||
- implement QUIC/WebRTC
|
||||
- implement service workloads
|
||||
- change RDP runtime
|
||||
- change backend session lifecycle
|
||||
- change Windows client behavior
|
||||
|
||||
It defines the node-channel lifecycle boundary only.
|
||||
|
||||
## 3. Trust Foundation
|
||||
|
||||
Every node-to-node channel must be authenticated.
|
||||
|
||||
Required identity inputs:
|
||||
|
||||
- cluster id
|
||||
- local node id
|
||||
- remote node id
|
||||
- local node certificate
|
||||
- remote node certificate
|
||||
- cluster trust roots
|
||||
- revocation metadata
|
||||
- role assignment snapshot
|
||||
- allowed peer relationship
|
||||
- route/channel authorization policy
|
||||
|
||||
Private keys remain local to the node. The control plane must never store node
|
||||
private keys.
|
||||
|
||||
## 4. mTLS Certificate Requirements
|
||||
|
||||
Node certificates must be cluster-scoped.
|
||||
|
||||
Certificate identity should bind:
|
||||
|
||||
- node id
|
||||
- cluster id
|
||||
- certificate serial
|
||||
- validity period
|
||||
- key usage for node-to-node transport
|
||||
- optional role/service constraints where practical
|
||||
|
||||
Validation must check:
|
||||
|
||||
- certificate chain
|
||||
- cluster trust root
|
||||
- certificate validity time
|
||||
- node id binding
|
||||
- cluster id binding
|
||||
- expected remote node id
|
||||
- revocation status
|
||||
- key usage
|
||||
- policy scope
|
||||
|
||||
A valid TLS certificate is necessary but not sufficient. The channel must also
|
||||
pass role, peer, route, and channel authorization.
|
||||
|
||||
## 5. Channel Establishment Flow
|
||||
|
||||
Proposed logical flow:
|
||||
|
||||
1. Routing Engine or node-agent selects an allowed peer candidate.
|
||||
2. Local node checks peer directory and local policy.
|
||||
3. Local node opens authenticated transport.
|
||||
4. Both sides perform mTLS handshake.
|
||||
5. Both sides validate certificate identity and cluster scope.
|
||||
6. Both sides exchange channel hello metadata.
|
||||
7. Both sides validate role, channel classes, and policy version.
|
||||
8. Channel enters `established` state.
|
||||
9. Heartbeat/liveness begins.
|
||||
10. Channel is registered in local channel table with expiry/revalidation
|
||||
deadline.
|
||||
|
||||
Channel hello metadata should include:
|
||||
|
||||
- protocol version
|
||||
- cluster id
|
||||
- node id
|
||||
- supported channel classes
|
||||
- supported transport features
|
||||
- local config version
|
||||
- peer directory version
|
||||
- trust bundle version
|
||||
- route epoch
|
||||
- draining status
|
||||
|
||||
## 6. Channel States
|
||||
|
||||
Initial state machine:
|
||||
|
||||
- `idle`
|
||||
- `connecting`
|
||||
- `handshaking`
|
||||
- `authenticating`
|
||||
- `authorizing`
|
||||
- `established`
|
||||
- `revalidating`
|
||||
- `degraded`
|
||||
- `draining`
|
||||
- `closing`
|
||||
- `closed`
|
||||
- `failed`
|
||||
|
||||
State rules:
|
||||
|
||||
- no traffic before `established`
|
||||
- control/liveness may continue in `revalidating`
|
||||
- new non-essential traffic should stop in `draining`
|
||||
- channel must close on failed authentication
|
||||
- channel must close or degrade on failed reauthorization according to policy
|
||||
- terminal `closed`/`failed` channels must not be reused
|
||||
|
||||
## 7. Channel Classes
|
||||
|
||||
Allowed channel classes map to Fabric routing classes, not service-specific
|
||||
protocol internals.
|
||||
|
||||
Initial channel classes:
|
||||
|
||||
- `fabric_control`
|
||||
- `route_control`
|
||||
- `health`
|
||||
- `telemetry`
|
||||
- `render`
|
||||
- `input`
|
||||
- `clipboard`
|
||||
- `file_transfer`
|
||||
- `storage_fetch`
|
||||
- `update_fetch`
|
||||
- `vpn_packet`
|
||||
|
||||
Authorization is per channel class.
|
||||
|
||||
Rules:
|
||||
|
||||
- `input` and `fabric_control` require high-priority scheduling.
|
||||
- `render` and video-like traffic may be droppable/latest-only.
|
||||
- `file_transfer`, `storage_fetch`, and `update_fetch` must not starve
|
||||
`input` or control.
|
||||
- `vpn_packet` must be QoS-limited so bulk traffic cannot starve interactive
|
||||
channels.
|
||||
- A channel may carry only classes authorized by local policy and route result.
|
||||
|
||||
## 8. Channel Authorization
|
||||
|
||||
Authorization checks:
|
||||
|
||||
- local node is allowed to connect to remote node
|
||||
- remote node is allowed to accept the connection
|
||||
- cluster id matches
|
||||
- roles are compatible
|
||||
- route result or peer policy permits the relationship
|
||||
- requested channel classes are allowed
|
||||
- organization/service scope is allowed where applicable
|
||||
- partition/degraded state permits the channel
|
||||
- remote node is not revoked, disabled, or disallowed
|
||||
- certificate is not expired or revoked
|
||||
|
||||
Authorization must be repeated when:
|
||||
|
||||
- trust bundle changes
|
||||
- revocation list changes
|
||||
- role assignment changes
|
||||
- route policy changes
|
||||
- route epoch changes
|
||||
- channel is long-lived past revalidation interval
|
||||
|
||||
## 9. Heartbeat and Liveness
|
||||
|
||||
Heartbeats prove liveness, not authority.
|
||||
|
||||
Heartbeat metadata:
|
||||
|
||||
- channel id
|
||||
- local node id
|
||||
- remote node id
|
||||
- timestamp
|
||||
- sequence
|
||||
- observed latency
|
||||
- packet loss/jitter summary where available
|
||||
- local health hint
|
||||
- draining flag
|
||||
- config version
|
||||
- route epoch
|
||||
|
||||
Recommended heartbeat cadence:
|
||||
|
||||
- active control channels: 5-15 seconds
|
||||
- high-priority realtime channels: 2-10 seconds where needed
|
||||
- low-priority/storage channels: 15-60 seconds
|
||||
|
||||
Missing heartbeats should trigger:
|
||||
|
||||
1. suspicion state
|
||||
2. bounded retry
|
||||
3. route failover consideration
|
||||
4. channel close/failure
|
||||
5. health report
|
||||
|
||||
## 10. Reconnect and Backoff
|
||||
|
||||
Reconnect must be bounded and policy-aware.
|
||||
|
||||
Rules:
|
||||
|
||||
- use exponential backoff with jitter
|
||||
- do not stampede bootstrap peers
|
||||
- prefer warm candidates after active peer failure
|
||||
- stop reconnect when peer is revoked or policy disallows it
|
||||
- report repeated failures
|
||||
- preserve route stickiness only while healthy and authorized
|
||||
- avoid reconnect loops during draining or shutdown
|
||||
|
||||
Reconnect should use current peer cache and route policy, not stale hardcoded
|
||||
endpoints.
|
||||
|
||||
## 11. Revalidation
|
||||
|
||||
Long-lived channels must revalidate periodically.
|
||||
|
||||
Revalidation checks:
|
||||
|
||||
- certificate still valid
|
||||
- revocation status current enough
|
||||
- cluster trust root still valid
|
||||
- peer relationship still allowed
|
||||
- channel classes still allowed
|
||||
- route epoch/policy version still acceptable
|
||||
- role assignments still active
|
||||
|
||||
If revalidation fails:
|
||||
|
||||
- stop accepting new traffic
|
||||
- drain or close according to policy
|
||||
- report reason
|
||||
- trigger route failover where applicable
|
||||
|
||||
## 12. Draining and Graceful Shutdown
|
||||
|
||||
Draining supports maintenance and safe role removal.
|
||||
|
||||
Draining flow:
|
||||
|
||||
1. node enters draining state
|
||||
2. node advertises draining in heartbeat/channel metadata
|
||||
3. routing stops placing new flows on the node
|
||||
4. existing flows continue until TTL or policy deadline
|
||||
5. new non-essential channels are rejected
|
||||
6. channel closes after active work drains or deadline expires
|
||||
7. node reports drained status
|
||||
|
||||
Draining must not silently drop critical control messages.
|
||||
|
||||
If graceful drain fails, policy decides whether to force-close and failover.
|
||||
|
||||
## 13. Trust Rotation
|
||||
|
||||
Trust rotation must avoid split trust windows.
|
||||
|
||||
Recommended flow:
|
||||
|
||||
1. new trust bundle is signed by current trusted key
|
||||
2. nodes fetch and verify new trust bundle
|
||||
3. dual validation period begins where required
|
||||
4. new certificates are issued/accepted
|
||||
5. old certificates expire or are revoked
|
||||
6. old trust root is retired after rollout threshold
|
||||
|
||||
Channels should revalidate after trust bundle changes.
|
||||
|
||||
## 14. Revocation Handling
|
||||
|
||||
Revocation must affect active channels.
|
||||
|
||||
Revocation inputs:
|
||||
|
||||
- signed revocation list
|
||||
- trust bundle update
|
||||
- control-plane status after reconnect
|
||||
- emergency revocation policy
|
||||
|
||||
On revocation of remote node/certificate/key:
|
||||
|
||||
- stop new channels
|
||||
- mark existing channels as revalidation failed
|
||||
- close or drain according to policy severity
|
||||
- remove peer from eligible active/warm candidates
|
||||
- report and audit event
|
||||
|
||||
High-severity revocation should close immediately.
|
||||
|
||||
## 15. Partition and Degraded Behavior
|
||||
|
||||
In degraded mode, channels may continue only if:
|
||||
|
||||
- current signed snapshot permits it
|
||||
- certificates remain valid
|
||||
- revocation state is not known to reject the peer
|
||||
- route/channel policy permits degraded continuation
|
||||
- TTL has not expired
|
||||
|
||||
Degraded mode must not authorize:
|
||||
|
||||
- new node enrollment
|
||||
- new trust roots
|
||||
- role changes
|
||||
- cross-cluster trust changes
|
||||
- partition promotion
|
||||
- new high-risk channels without policy
|
||||
|
||||
## 16. Failure Classification
|
||||
|
||||
Failure reasons:
|
||||
|
||||
- `tls_handshake_failed`
|
||||
- `certificate_invalid`
|
||||
- `certificate_revoked`
|
||||
- `wrong_cluster`
|
||||
- `wrong_node`
|
||||
- `policy_denied`
|
||||
- `channel_class_denied`
|
||||
- `route_epoch_stale`
|
||||
- `heartbeat_timeout`
|
||||
- `peer_draining`
|
||||
- `peer_disabled`
|
||||
- `trust_bundle_stale`
|
||||
- `network_unreachable`
|
||||
- `backoff_exhausted`
|
||||
|
||||
Failures should be structured and safe to log.
|
||||
|
||||
## 17. Observability
|
||||
|
||||
Node-agent should report:
|
||||
|
||||
- channel state
|
||||
- active channel count
|
||||
- channel classes in use
|
||||
- handshake failures
|
||||
- authorization failures
|
||||
- heartbeat latency
|
||||
- reconnect count
|
||||
- backoff state
|
||||
- draining state
|
||||
- revocation actions
|
||||
- revalidation failures
|
||||
- route epoch/policy version
|
||||
|
||||
Tenant views must not expose internal topology. Platform owner views may show
|
||||
full channel diagnostics according to audited policy.
|
||||
|
||||
## 18. Security Requirements
|
||||
|
||||
Required:
|
||||
|
||||
- mTLS for node-to-node channels
|
||||
- cluster-scoped node certificates
|
||||
- certificate revocation support
|
||||
- policy-scoped channel authorization
|
||||
- no unauthenticated peer enumeration
|
||||
- no channel use before authorization
|
||||
- channel class separation
|
||||
- QoS-aware scheduling expectations
|
||||
- structured audit for high-risk channel changes
|
||||
|
||||
Compromised node blast radius must be limited by:
|
||||
|
||||
- scoped certificates
|
||||
- scoped snapshots
|
||||
- role assignment
|
||||
- peer directory scope
|
||||
- channel authorization
|
||||
- revocation
|
||||
- topology hiding
|
||||
|
||||
## 19. Future Validation Tests
|
||||
|
||||
Future implementation tests must prove:
|
||||
|
||||
- valid node-to-node mTLS succeeds
|
||||
- wrong cluster certificate rejected
|
||||
- wrong node id rejected
|
||||
- expired certificate rejected
|
||||
- revoked certificate closes active channel
|
||||
- unauthorized channel class rejected
|
||||
- channel cannot carry traffic before authorization
|
||||
- heartbeat timeout triggers failure
|
||||
- draining stops new channels
|
||||
- trust rotation revalidates channels
|
||||
- degraded mode honors TTL and forbidden actions
|
||||
- tenant-safe views hide topology
|
||||
|
||||
## 20. C17 Preparation
|
||||
|
||||
C17 may plan mesh routing runtime only after C10-C16 are accepted.
|
||||
|
||||
C17 must use:
|
||||
|
||||
- signed snapshots
|
||||
- node-local state store
|
||||
- Fabric Storage / Config Storage
|
||||
- peer directory/cache
|
||||
- Fabric Routing Engine route results
|
||||
- secure node-to-node channels
|
||||
|
||||
C17 must not jump directly to broad production mesh. It should first define a
|
||||
minimal runtime implementation plan, test topology, rollback path, and go/no-go
|
||||
criteria.
|
||||
|
||||
## 21. Result / Decision
|
||||
|
||||
Stage C16 defines secure node-to-node channels as authenticated,
|
||||
policy-authorized, lifecycle-managed connections.
|
||||
|
||||
Decisions:
|
||||
|
||||
- mTLS is required for node-to-node channels.
|
||||
- Certificate validity is necessary but not sufficient; channel policy must
|
||||
authorize role, peer relationship, route, and channel classes.
|
||||
- Active channels must revalidate on trust, revocation, role, and route policy
|
||||
changes.
|
||||
- Draining is a first-class lifecycle state.
|
||||
- Revocation affects active channels.
|
||||
- Degraded mode is bounded and cannot authorize high-risk mutations.
|
||||
- C17 must plan mesh routing runtime using C10-C16 foundations.
|
||||
|
||||
No code, migration, API, runtime, RDP, data-plane, mesh, VPN, relay, or service
|
||||
workload behavior is changed by C16.
|
||||
Reference in New Issue
Block a user