Initial project snapshot

2026-04-28 22:29:50 +03:00
commit 8ba0561f4f
365 changed files with 91832 additions and 0 deletions
@@ -0,0 +1,464 @@
+# Secure Node-to-Node Channel Lifecycle
+
+Status: Stage C16 result. Documentation and architecture only.
+
+This document defines the secure node-to-node channel lifecycle for the Secure
+Access Fabric. It does not implement code, migrations, APIs, mesh runtime
+traffic, VPN/IP tunnel runtime, relay packet routing, RDP work, or service
+workload execution.
+
+## 1. Purpose
+
+Secure node-to-node channels are the future authenticated transport foundation
+for Fabric routes. They must exist as a trust and lifecycle model before any
+production mesh routing runtime carries traffic.
+
+C16 defines:
+
+- mTLS identity validation
+- connection establishment
+- channel authorization
+- lifecycle state
+- heartbeat and liveness
+- reconnect/backoff
+- draining
+- revalidation
+- trust rotation
+- revocation handling
+- failure observability
+
+## 2. Non-Goals
+
+C16 does not:
+
+- implement packet forwarding
+- implement mesh routing runtime
+- implement relay node behavior
+- implement VPN/IP tunnel traffic
+- implement QUIC/WebRTC
+- implement service workloads
+- change RDP runtime
+- change backend session lifecycle
+- change Windows client behavior
+
+It defines the node-channel lifecycle boundary only.
+
+## 3. Trust Foundation
+
+Every node-to-node channel must be authenticated.
+
+Required identity inputs:
+
+- cluster id
+- local node id
+- remote node id
+- local node certificate
+- remote node certificate
+- cluster trust roots
+- revocation metadata
+- role assignment snapshot
+- allowed peer relationship
+- route/channel authorization policy
+
+Private keys remain local to the node. The control plane must never store node
+private keys.
+
+## 4. mTLS Certificate Requirements
+
+Node certificates must be cluster-scoped.
+
+Certificate identity should bind:
+
+- node id
+- cluster id
+- certificate serial
+- validity period
+- key usage for node-to-node transport
+- optional role/service constraints where practical
+
+Validation must check:
+
+- certificate chain
+- cluster trust root
+- certificate validity time
+- node id binding
+- cluster id binding
+- expected remote node id
+- revocation status
+- key usage
+- policy scope
+
+A valid TLS certificate is necessary but not sufficient. The channel must also
+pass role, peer, route, and channel authorization.
+
+## 5. Channel Establishment Flow
+
+Proposed logical flow:
+
+1. Routing Engine or node-agent selects an allowed peer candidate.
+2. Local node checks peer directory and local policy.
+3. Local node opens authenticated transport.
+4. Both sides perform mTLS handshake.
+5. Both sides validate certificate identity and cluster scope.
+6. Both sides exchange channel hello metadata.
+7. Both sides validate role, channel classes, and policy version.
+8. Channel enters `established` state.
+9. Heartbeat/liveness begins.
+10. Channel is registered in local channel table with expiry/revalidation
+    deadline.
+
+Channel hello metadata should include:
+
+- protocol version
+- cluster id
+- node id
+- supported channel classes
+- supported transport features
+- local config version
+- peer directory version
+- trust bundle version
+- route epoch
+- draining status
+
+## 6. Channel States
+
+Initial state machine:
+
+- `idle`
+- `connecting`
+- `handshaking`
+- `authenticating`
+- `authorizing`
+- `established`
+- `revalidating`
+- `degraded`
+- `draining`
+- `closing`
+- `closed`
+- `failed`
+
+State rules:
+
+- no traffic before `established`
+- control/liveness may continue in `revalidating`
+- new non-essential traffic should stop in `draining`
+- channel must close on failed authentication
+- channel must close or degrade on failed reauthorization according to policy
+- terminal `closed`/`failed` channels must not be reused
+
+## 7. Channel Classes
+
+Allowed channel classes map to Fabric routing classes, not service-specific
+protocol internals.
+
+Initial channel classes:
+
+- `fabric_control`
+- `route_control`
+- `health`
+- `telemetry`
+- `render`
+- `input`
+- `clipboard`
+- `file_transfer`
+- `storage_fetch`
+- `update_fetch`
+- `vpn_packet`
+
+Authorization is per channel class.
+
+Rules:
+
+- `input` and `fabric_control` require high-priority scheduling.
+- `render` and video-like traffic may be droppable/latest-only.
+- `file_transfer`, `storage_fetch`, and `update_fetch` must not starve
+  `input` or control.
+- `vpn_packet` must be QoS-limited so bulk traffic cannot starve interactive
+  channels.
+- A channel may carry only classes authorized by local policy and route result.
+
+## 8. Channel Authorization
+
+Authorization checks:
+
+- local node is allowed to connect to remote node
+- remote node is allowed to accept the connection
+- cluster id matches
+- roles are compatible
+- route result or peer policy permits the relationship
+- requested channel classes are allowed
+- organization/service scope is allowed where applicable
+- partition/degraded state permits the channel
+- remote node is not revoked, disabled, or disallowed
+- certificate is not expired or revoked
+
+Authorization must be repeated when:
+
+- trust bundle changes
+- revocation list changes
+- role assignment changes
+- route policy changes
+- route epoch changes
+- channel is long-lived past revalidation interval
+
+## 9. Heartbeat and Liveness
+
+Heartbeats prove liveness, not authority.
+
+Heartbeat metadata:
+
+- channel id
+- local node id
+- remote node id
+- timestamp
+- sequence
+- observed latency
+- packet loss/jitter summary where available
+- local health hint
+- draining flag
+- config version
+- route epoch
+
+Recommended heartbeat cadence:
+
+- active control channels: 5-15 seconds
+- high-priority realtime channels: 2-10 seconds where needed
+- low-priority/storage channels: 15-60 seconds
+
+Missing heartbeats should trigger:
+
+1. suspicion state
+2. bounded retry
+3. route failover consideration
+4. channel close/failure
+5. health report
+
+## 10. Reconnect and Backoff
+
+Reconnect must be bounded and policy-aware.
+
+Rules:
+
+- use exponential backoff with jitter
+- do not stampede bootstrap peers
+- prefer warm candidates after active peer failure
+- stop reconnect when peer is revoked or policy disallows it
+- report repeated failures
+- preserve route stickiness only while healthy and authorized
+- avoid reconnect loops during draining or shutdown
+
+Reconnect should use current peer cache and route policy, not stale hardcoded
+endpoints.
+
+## 11. Revalidation
+
+Long-lived channels must revalidate periodically.
+
+Revalidation checks:
+
+- certificate still valid
+- revocation status current enough
+- cluster trust root still valid
+- peer relationship still allowed
+- channel classes still allowed
+- route epoch/policy version still acceptable
+- role assignments still active
+
+If revalidation fails:
+
+- stop accepting new traffic
+- drain or close according to policy
+- report reason
+- trigger route failover where applicable
+
+## 12. Draining and Graceful Shutdown
+
+Draining supports maintenance and safe role removal.
+
+Draining flow:
+
+1. node enters draining state
+2. node advertises draining in heartbeat/channel metadata
+3. routing stops placing new flows on the node
+4. existing flows continue until TTL or policy deadline
+5. new non-essential channels are rejected
+6. channel closes after active work drains or deadline expires
+7. node reports drained status
+
+Draining must not silently drop critical control messages.
+
+If graceful drain fails, policy decides whether to force-close and failover.
+
+## 13. Trust Rotation
+
+Trust rotation must avoid split trust windows.
+
+Recommended flow:
+
+1. new trust bundle is signed by current trusted key
+2. nodes fetch and verify new trust bundle
+3. dual validation period begins where required
+4. new certificates are issued/accepted
+5. old certificates expire or are revoked
+6. old trust root is retired after rollout threshold
+
+Channels should revalidate after trust bundle changes.
+
+## 14. Revocation Handling
+
+Revocation must affect active channels.
+
+Revocation inputs:
+
+- signed revocation list
+- trust bundle update
+- control-plane status after reconnect
+- emergency revocation policy
+
+On revocation of remote node/certificate/key:
+
+- stop new channels
+- mark existing channels as revalidation failed
+- close or drain according to policy severity
+- remove peer from eligible active/warm candidates
+- report and audit event
+
+High-severity revocation should close immediately.
+
+## 15. Partition and Degraded Behavior
+
+In degraded mode, channels may continue only if:
+
+- current signed snapshot permits it
+- certificates remain valid
+- revocation state is not known to reject the peer
+- route/channel policy permits degraded continuation
+- TTL has not expired
+
+Degraded mode must not authorize:
+
+- new node enrollment
+- new trust roots
+- role changes
+- cross-cluster trust changes
+- partition promotion
+- new high-risk channels without policy
+
+## 16. Failure Classification
+
+Failure reasons:
+
+- `tls_handshake_failed`
+- `certificate_invalid`
+- `certificate_revoked`
+- `wrong_cluster`
+- `wrong_node`
+- `policy_denied`
+- `channel_class_denied`
+- `route_epoch_stale`
+- `heartbeat_timeout`
+- `peer_draining`
+- `peer_disabled`
+- `trust_bundle_stale`
+- `network_unreachable`
+- `backoff_exhausted`
+
+Failures should be structured and safe to log.
+
+## 17. Observability
+
+Node-agent should report:
+
+- channel state
+- active channel count
+- channel classes in use
+- handshake failures
+- authorization failures
+- heartbeat latency
+- reconnect count
+- backoff state
+- draining state
+- revocation actions
+- revalidation failures
+- route epoch/policy version
+
+Tenant views must not expose internal topology. Platform owner views may show
+full channel diagnostics according to audited policy.
+
+## 18. Security Requirements
+
+Required:
+
+- mTLS for node-to-node channels
+- cluster-scoped node certificates
+- certificate revocation support
+- policy-scoped channel authorization
+- no unauthenticated peer enumeration
+- no channel use before authorization
+- channel class separation
+- QoS-aware scheduling expectations
+- structured audit for high-risk channel changes
+
+Compromised node blast radius must be limited by:
+
+- scoped certificates
+- scoped snapshots
+- role assignment
+- peer directory scope
+- channel authorization
+- revocation
+- topology hiding
+
+## 19. Future Validation Tests
+
+Future implementation tests must prove:
+
+- valid node-to-node mTLS succeeds
+- wrong cluster certificate rejected
+- wrong node id rejected
+- expired certificate rejected
+- revoked certificate closes active channel
+- unauthorized channel class rejected
+- channel cannot carry traffic before authorization
+- heartbeat timeout triggers failure
+- draining stops new channels
+- trust rotation revalidates channels
+- degraded mode honors TTL and forbidden actions
+- tenant-safe views hide topology
+
+## 20. C17 Preparation
+
+C17 may plan mesh routing runtime only after C10-C16 are accepted.
+
+C17 must use:
+
+- signed snapshots
+- node-local state store
+- Fabric Storage / Config Storage
+- peer directory/cache
+- Fabric Routing Engine route results
+- secure node-to-node channels
+
+C17 must not jump directly to broad production mesh. It should first define a
+minimal runtime implementation plan, test topology, rollback path, and go/no-go
+criteria.
+
+## 21. Result / Decision
+
+Stage C16 defines secure node-to-node channels as authenticated,
+policy-authorized, lifecycle-managed connections.
+
+Decisions:
+
+- mTLS is required for node-to-node channels.
+- Certificate validity is necessary but not sufficient; channel policy must
+  authorize role, peer relationship, route, and channel classes.
+- Active channels must revalidate on trust, revocation, role, and route policy
+  changes.
+- Draining is a first-class lifecycle state.
+- Revocation affects active channels.
+- Degraded mode is bounded and cannot authorize high-risk mutations.
+- C17 must plan mesh routing runtime using C10-C16 foundations.
+
+No code, migration, API, runtime, RDP, data-plane, mesh, VPN, relay, or service
+workload behavior is changed by C16.