Initial project snapshot

This commit is contained in:
2026-04-28 22:29:50 +03:00
commit 8ba0561f4f
365 changed files with 91832 additions and 0 deletions
@@ -0,0 +1,464 @@
# Secure Node-to-Node Channel Lifecycle
Status: Stage C16 result. Documentation and architecture only.
This document defines the secure node-to-node channel lifecycle for the Secure
Access Fabric. It does not implement code, migrations, APIs, mesh runtime
traffic, VPN/IP tunnel runtime, relay packet routing, RDP work, or service
workload execution.
## 1. Purpose
Secure node-to-node channels are the future authenticated transport foundation
for Fabric routes. They must exist as a trust and lifecycle model before any
production mesh routing runtime carries traffic.
C16 defines:
- mTLS identity validation
- connection establishment
- channel authorization
- lifecycle state
- heartbeat and liveness
- reconnect/backoff
- draining
- revalidation
- trust rotation
- revocation handling
- failure observability
## 2. Non-Goals
C16 does not:
- implement packet forwarding
- implement mesh routing runtime
- implement relay node behavior
- implement VPN/IP tunnel traffic
- implement QUIC/WebRTC
- implement service workloads
- change RDP runtime
- change backend session lifecycle
- change Windows client behavior
It defines the node-channel lifecycle boundary only.
## 3. Trust Foundation
Every node-to-node channel must be authenticated.
Required identity inputs:
- cluster id
- local node id
- remote node id
- local node certificate
- remote node certificate
- cluster trust roots
- revocation metadata
- role assignment snapshot
- allowed peer relationship
- route/channel authorization policy
Private keys remain local to the node. The control plane must never store node
private keys.
## 4. mTLS Certificate Requirements
Node certificates must be cluster-scoped.
Certificate identity should bind:
- node id
- cluster id
- certificate serial
- validity period
- key usage for node-to-node transport
- optional role/service constraints where practical
Validation must check:
- certificate chain
- cluster trust root
- certificate validity time
- node id binding
- cluster id binding
- expected remote node id
- revocation status
- key usage
- policy scope
A valid TLS certificate is necessary but not sufficient. The channel must also
pass role, peer, route, and channel authorization.
## 5. Channel Establishment Flow
Proposed logical flow:
1. Routing Engine or node-agent selects an allowed peer candidate.
2. Local node checks peer directory and local policy.
3. Local node opens authenticated transport.
4. Both sides perform mTLS handshake.
5. Both sides validate certificate identity and cluster scope.
6. Both sides exchange channel hello metadata.
7. Both sides validate role, channel classes, and policy version.
8. Channel enters `established` state.
9. Heartbeat/liveness begins.
10. Channel is registered in local channel table with expiry/revalidation
deadline.
Channel hello metadata should include:
- protocol version
- cluster id
- node id
- supported channel classes
- supported transport features
- local config version
- peer directory version
- trust bundle version
- route epoch
- draining status
## 6. Channel States
Initial state machine:
- `idle`
- `connecting`
- `handshaking`
- `authenticating`
- `authorizing`
- `established`
- `revalidating`
- `degraded`
- `draining`
- `closing`
- `closed`
- `failed`
State rules:
- no traffic before `established`
- control/liveness may continue in `revalidating`
- new non-essential traffic should stop in `draining`
- channel must close on failed authentication
- channel must close or degrade on failed reauthorization according to policy
- terminal `closed`/`failed` channels must not be reused
## 7. Channel Classes
Allowed channel classes map to Fabric routing classes, not service-specific
protocol internals.
Initial channel classes:
- `fabric_control`
- `route_control`
- `health`
- `telemetry`
- `render`
- `input`
- `clipboard`
- `file_transfer`
- `storage_fetch`
- `update_fetch`
- `vpn_packet`
Authorization is per channel class.
Rules:
- `input` and `fabric_control` require high-priority scheduling.
- `render` and video-like traffic may be droppable/latest-only.
- `file_transfer`, `storage_fetch`, and `update_fetch` must not starve
`input` or control.
- `vpn_packet` must be QoS-limited so bulk traffic cannot starve interactive
channels.
- A channel may carry only classes authorized by local policy and route result.
## 8. Channel Authorization
Authorization checks:
- local node is allowed to connect to remote node
- remote node is allowed to accept the connection
- cluster id matches
- roles are compatible
- route result or peer policy permits the relationship
- requested channel classes are allowed
- organization/service scope is allowed where applicable
- partition/degraded state permits the channel
- remote node is not revoked, disabled, or disallowed
- certificate is not expired or revoked
Authorization must be repeated when:
- trust bundle changes
- revocation list changes
- role assignment changes
- route policy changes
- route epoch changes
- channel is long-lived past revalidation interval
## 9. Heartbeat and Liveness
Heartbeats prove liveness, not authority.
Heartbeat metadata:
- channel id
- local node id
- remote node id
- timestamp
- sequence
- observed latency
- packet loss/jitter summary where available
- local health hint
- draining flag
- config version
- route epoch
Recommended heartbeat cadence:
- active control channels: 5-15 seconds
- high-priority realtime channels: 2-10 seconds where needed
- low-priority/storage channels: 15-60 seconds
Missing heartbeats should trigger:
1. suspicion state
2. bounded retry
3. route failover consideration
4. channel close/failure
5. health report
## 10. Reconnect and Backoff
Reconnect must be bounded and policy-aware.
Rules:
- use exponential backoff with jitter
- do not stampede bootstrap peers
- prefer warm candidates after active peer failure
- stop reconnect when peer is revoked or policy disallows it
- report repeated failures
- preserve route stickiness only while healthy and authorized
- avoid reconnect loops during draining or shutdown
Reconnect should use current peer cache and route policy, not stale hardcoded
endpoints.
## 11. Revalidation
Long-lived channels must revalidate periodically.
Revalidation checks:
- certificate still valid
- revocation status current enough
- cluster trust root still valid
- peer relationship still allowed
- channel classes still allowed
- route epoch/policy version still acceptable
- role assignments still active
If revalidation fails:
- stop accepting new traffic
- drain or close according to policy
- report reason
- trigger route failover where applicable
## 12. Draining and Graceful Shutdown
Draining supports maintenance and safe role removal.
Draining flow:
1. node enters draining state
2. node advertises draining in heartbeat/channel metadata
3. routing stops placing new flows on the node
4. existing flows continue until TTL or policy deadline
5. new non-essential channels are rejected
6. channel closes after active work drains or deadline expires
7. node reports drained status
Draining must not silently drop critical control messages.
If graceful drain fails, policy decides whether to force-close and failover.
## 13. Trust Rotation
Trust rotation must avoid split trust windows.
Recommended flow:
1. new trust bundle is signed by current trusted key
2. nodes fetch and verify new trust bundle
3. dual validation period begins where required
4. new certificates are issued/accepted
5. old certificates expire or are revoked
6. old trust root is retired after rollout threshold
Channels should revalidate after trust bundle changes.
## 14. Revocation Handling
Revocation must affect active channels.
Revocation inputs:
- signed revocation list
- trust bundle update
- control-plane status after reconnect
- emergency revocation policy
On revocation of remote node/certificate/key:
- stop new channels
- mark existing channels as revalidation failed
- close or drain according to policy severity
- remove peer from eligible active/warm candidates
- report and audit event
High-severity revocation should close immediately.
## 15. Partition and Degraded Behavior
In degraded mode, channels may continue only if:
- current signed snapshot permits it
- certificates remain valid
- revocation state is not known to reject the peer
- route/channel policy permits degraded continuation
- TTL has not expired
Degraded mode must not authorize:
- new node enrollment
- new trust roots
- role changes
- cross-cluster trust changes
- partition promotion
- new high-risk channels without policy
## 16. Failure Classification
Failure reasons:
- `tls_handshake_failed`
- `certificate_invalid`
- `certificate_revoked`
- `wrong_cluster`
- `wrong_node`
- `policy_denied`
- `channel_class_denied`
- `route_epoch_stale`
- `heartbeat_timeout`
- `peer_draining`
- `peer_disabled`
- `trust_bundle_stale`
- `network_unreachable`
- `backoff_exhausted`
Failures should be structured and safe to log.
## 17. Observability
Node-agent should report:
- channel state
- active channel count
- channel classes in use
- handshake failures
- authorization failures
- heartbeat latency
- reconnect count
- backoff state
- draining state
- revocation actions
- revalidation failures
- route epoch/policy version
Tenant views must not expose internal topology. Platform owner views may show
full channel diagnostics according to audited policy.
## 18. Security Requirements
Required:
- mTLS for node-to-node channels
- cluster-scoped node certificates
- certificate revocation support
- policy-scoped channel authorization
- no unauthenticated peer enumeration
- no channel use before authorization
- channel class separation
- QoS-aware scheduling expectations
- structured audit for high-risk channel changes
Compromised node blast radius must be limited by:
- scoped certificates
- scoped snapshots
- role assignment
- peer directory scope
- channel authorization
- revocation
- topology hiding
## 19. Future Validation Tests
Future implementation tests must prove:
- valid node-to-node mTLS succeeds
- wrong cluster certificate rejected
- wrong node id rejected
- expired certificate rejected
- revoked certificate closes active channel
- unauthorized channel class rejected
- channel cannot carry traffic before authorization
- heartbeat timeout triggers failure
- draining stops new channels
- trust rotation revalidates channels
- degraded mode honors TTL and forbidden actions
- tenant-safe views hide topology
## 20. C17 Preparation
C17 may plan mesh routing runtime only after C10-C16 are accepted.
C17 must use:
- signed snapshots
- node-local state store
- Fabric Storage / Config Storage
- peer directory/cache
- Fabric Routing Engine route results
- secure node-to-node channels
C17 must not jump directly to broad production mesh. It should first define a
minimal runtime implementation plan, test topology, rollback path, and go/no-go
criteria.
## 21. Result / Decision
Stage C16 defines secure node-to-node channels as authenticated,
policy-authorized, lifecycle-managed connections.
Decisions:
- mTLS is required for node-to-node channels.
- Certificate validity is necessary but not sufficient; channel policy must
authorize role, peer relationship, route, and channel classes.
- Active channels must revalidate on trust, revocation, role, and route policy
changes.
- Draining is a first-class lifecycle state.
- Revocation affects active channels.
- Degraded mode is bounded and cannot authorize high-risk mutations.
- C17 must plan mesh routing runtime using C10-C16 foundations.
No code, migration, API, runtime, RDP, data-plane, mesh, VPN, relay, or service
workload behavior is changed by C16.