Initial project snapshot

2026-04-28 22:29:50 +03:00
commit 8ba0561f4f
365 changed files with 91832 additions and 0 deletions
@@ -0,0 +1,518 @@
+# Fabric Routing Engine Skeleton
+
+Status: Stage C15 result. Documentation and architecture only.
+
+This document defines the Fabric Routing Engine skeleton boundary. It does not
+implement code, migrations, APIs, mesh runtime traffic, VPN/IP tunnel runtime,
+relay packet routing, RDP work, or service workload execution.
+
+## 1. Purpose
+
+The Fabric Routing Engine is the logical Fabric layer responsible for choosing
+authorized paths between ingress, core, egress, service, storage, and future
+VPN/IP-tunnel components.
+
+C15 defines the route decision boundary before runtime mesh routing exists.
+
+The purpose is to ensure that future routing:
+
+- is policy-aware
+- is QoS-aware
+- is channel-aware
+- respects cluster and organization boundaries
+- uses scoped local state and peer cache
+- does not depend on live backend availability for realtime decisions
+- is not implemented independently by Service Adapters
+
+## 2. Non-Goals
+
+C15 does not:
+
+- carry production mesh traffic
+- implement node-to-node transport
+- implement relay forwarding
+- implement VPN/IP tunnel packets
+- implement QUIC/WebRTC
+- implement route execution
+- implement service workloads
+- change RDP runtime
+- change backend session lifecycle
+- change Windows client behavior
+
+It defines contracts and responsibilities only.
+
+## 3. Routing Engine Responsibilities
+
+The Fabric Routing Engine owns:
+
+- route request validation
+- peer candidate filtering
+- route scoring
+- channel-aware path selection
+- QoS class selection
+- route cache lookup/update policy
+- failover decision boundaries
+- shortcut recommendation boundaries
+- topology hiding
+- policy and cluster-boundary enforcement
+- service adapter routing integration boundary
+
+The Routing Engine does not own:
+
+- PostgreSQL source-of-truth mutation
+- service protocol translation
+- RDP/VNC/SSH/VPN implementation details
+- raw packet forwarding
+- direct secret resolution
+- organization admin visibility
+- node enrollment authority
+
+## 4. Inputs
+
+Routing decisions may consume:
+
+- signed scoped cluster snapshot
+- node-local peer cache
+- route cache
+- peer directory
+- route policy
+- QoS policy
+- service assignment cache
+- cluster membership
+- organization scope
+- service/resource scope
+- channel class
+- current health/degraded state
+- partition/authority state
+- failure history
+- load and latency observations
+
+Routing decisions must not require a live backend call in the realtime path.
+
+## 5. Route Request Contract
+
+A route request is a logical request for a path. It is not a packet.
+
+Required fields:
+
+- `request_id`
+- `cluster_id`
+- `organization_id` where applicable
+- `source_node_id`
+- `source_role`
+- `destination_kind`
+- `destination_ref`
+- `service_type`
+- `channel_class`
+- `priority_class`
+- `policy_refs`
+- `requested_at`
+
+Destination kinds:
+
+- `node`
+- `egress_pool`
+- `service_instance`
+- `resource_target`
+- `vpn_connection`
+- `storage_scope`
+- `control_plane_endpoint`
+
+Optional fields:
+
+- `session_id`
+- `attachment_id`
+- `resource_id`
+- `user_id`
+- `device_id`
+- `region_preference`
+- `required_capabilities`
+- `forbidden_nodes`
+- `preferred_nodes`
+- `max_latency_ms`
+- `min_bandwidth_hint`
+- `stickiness_key`
+- `previous_route_id`
+- `failure_context`
+
+Service adapters may create route requests through an adapter-facing boundary,
+but they must not select peers or paths themselves.
+
+## 6. Route Result Contract
+
+A route result is a signed or locally verifiable decision artifact for a
+bounded time.
+
+Required fields:
+
+- `route_id`
+- `request_id`
+- `cluster_id`
+- `organization_id` where applicable
+- `route_class`
+- `channel_class`
+- `selected_path`
+- `selected_qos_class`
+- `score`
+- `valid_from`
+- `expires_at`
+- `route_epoch`
+- `policy_version`
+- `decision_reason`
+
+Selected path contains ordered logical hops:
+
+- source node
+- optional ingress node
+- zero or more core/relay nodes
+- optional egress/service node
+- target/service endpoint
+
+Optional fields:
+
+- `fallback_paths`
+- `shortcut_candidate`
+- `stickiness_key`
+- `drain_after`
+- `degraded_mode`
+- `constraints_applied`
+- `rejection_reason`
+
+Route results must be bounded by expiry, policy version, route epoch, and
+cluster authority state.
+
+## 7. Channel Classes
+
+Routing is channel-aware.
+
+Initial channel classes:
+
+- `control`
+- `input`
+- `render`
+- `cursor`
+- `clipboard`
+- `file_transfer`
+- `telemetry`
+- `vpn_packet`
+- `storage_fetch`
+- `update_fetch`
+
+Rules:
+
+- `input` and critical `control` prefer lowest latency and lowest jitter.
+- `render` prefers bandwidth and bounded jitter; stale render may be dropped.
+- `cursor` is latest-only and should use low-latency paths.
+- `clipboard` is reliable and bounded.
+- `file_transfer` prefers throughput but must not starve input/control/render.
+- `telemetry` is low priority and may be sampled or dropped.
+- `vpn_packet` uses adaptive QoS and bulk protection.
+- `storage_fetch` and `update_fetch` should not consume interactive reserves.
+
+## 8. Route Classes
+
+Initial route classes:
+
+- `direct`
+- `single_relay`
+- `multi_hop`
+- `storage_local`
+- `storage_remote`
+- `vpn_chained`
+- `degraded_existing`
+- `unavailable`
+
+`direct`:
+
+- selected when source can safely reach destination directly
+- trust and policy must allow it
+
+`single_relay`:
+
+- selected when one relay improves connectivity or policy requires relay
+
+`multi_hop`:
+
+- selected when direct/single relay is unavailable or policy/region requires it
+
+`storage_local` / `storage_remote`:
+
+- used for config/snapshot/artifact fetch decisions
+
+`vpn_chained`:
+
+- used when a managed service or IP tunnel depends on a logical
+  `vpn_connection`
+
+`degraded_existing`:
+
+- keeps an already-authorized existing path alive while policy permits
+
+`unavailable`:
+
+- explicit denial or no valid route
+
+## 9. Hard Policy Checks
+
+Hard checks run before scoring.
+
+Reject route when:
+
+- source node is not trusted
+- source node is not a member of the cluster
+- destination is outside cluster scope
+- cross-cluster trust is missing
+- organization scope does not match
+- role assignment does not permit the route
+- peer certificate is invalid or revoked
+- required channel is not authorized
+- partition/authority state forbids new route
+- destination node is draining or disabled and policy forbids placement
+- route would leak topology or tenant data
+
+No score can override hard policy rejection.
+
+## 10. Scoring Inputs
+
+Soft scoring inputs:
+
+- latency
+- jitter
+- packet loss
+- reliability
+- recent failure history
+- region distance
+- load
+- available bandwidth
+- role suitability
+- route length
+- service co-location
+- stickiness preference
+- cost preference
+- policy preference
+- health score
+
+Scoring weights are policy-driven and may differ by channel class.
+
+Example:
+
+- input/control heavily weight latency and jitter
+- file transfer heavily weights throughput and reliability
+- VPN bulk considers QoS impact on interactive routes
+- storage fetch considers locality and replica freshness
+
+## 11. Route Cache Relationship
+
+Route cache is local and bounded.
+
+Cache key inputs:
+
+- cluster id
+- organization id
+- source node
+- destination kind/ref
+- service type
+- channel class
+- policy version
+- route epoch
+- stickiness key
+
+Cache entries contain:
+
+- route result
+- expiry
+- score
+- last success/failure
+- backoff state
+- fallback candidates
+
+Cache invalidation triggers:
+
+- policy version change
+- peer directory version change
+- trust/revocation update
+- route epoch change
+- health state change
+- repeated route failure
+- expiry
+
+Route cache is a performance aid, not route authority.
+
+## 12. Failover Boundaries
+
+Failover decisions may:
+
+- switch from failed active path to fallback path
+- promote warm peer path
+- retry through bootstrap route for recovery
+- mark route unavailable
+- request control-plane/config refresh when reachable
+- keep degraded existing path alive if policy permits
+
+Failover decisions must not:
+
+- create new cluster authority
+- bypass policy
+- add nodes
+- approve role changes
+- cross cluster boundaries without explicit trust
+- expose topology to organizations
+
+## 13. Shortcut Decision Boundary
+
+Shortcut connections are optional optimization recommendations.
+
+A shortcut may be recommended when:
+
+- long-lived flow exists
+- current path latency/jitter is high
+- direct connectivity appears possible
+- trust validation succeeds
+- policy allows shortcut
+- shortcut improves latency, jitter, or bandwidth
+- fallback path remains available
+
+Shortcut recommendation output:
+
+- source node
+- destination node
+- channel classes affected
+- expected improvement
+- required validation
+- expiry
+- fallback route id
+
+C15 does not implement shortcut connections. It only defines when a future
+Routing Engine may recommend them.
+
+## 14. Service Adapter Integration
+
+Service Adapters may ask for routes using service-neutral metadata.
+
+Examples:
+
+- RDP Adapter requests route to RDP service/egress node or resource target.
+- VNC Adapter requests route to VNC target zone.
+- SSH Adapter requests route to SSH target.
+- VPN/IP tunnel service requests route through `vpn_connection`.
+- Storage fetch requests route to config/storage scope.
+
+Service Adapters must not:
+
+- enumerate peers
+- select mesh paths
+- create relay chains
+- create shortcuts
+- implement failover policy
+- implement partition recovery
+- implement cross-cluster routing trust
+
+The adapter consumes a route result and sends/receives through the approved
+data-plane boundary when runtime exists.
+
+## 15. Topology Hiding
+
+Organizations see:
+
+- allowed service endpoints
+- safe ingress/egress status
+- safe session/resource status
+- policy-visible route dependency names where allowed
+
+Organizations must not see:
+
+- intermediate core mesh nodes
+- full peer directory
+- route cache
+- shortcut candidates
+- other organizations' route data
+- storage shard placement
+
+Platform owners may inspect routing internals according to audited platform
+policy.
+
+## 16. Degraded and Partition Behavior
+
+In degraded mode, Routing Engine may:
+
+- keep existing authorized routes alive until TTL
+- use last signed snapshot for recovery
+- select fallback among already-authorized peers
+- mark route unavailable when safety cannot be proven
+
+In degraded mode, Routing Engine must not:
+
+- authorize new high-risk routes
+- mutate cluster trust
+- approve nodes
+- assign roles
+- promote partition authority automatically
+- create cross-cluster trust
+
+## 17. Observability
+
+Routing decisions should emit safe telemetry:
+
+- route selected
+- route rejected
+- rejection reason
+- route class
+- channel class
+- score bucket
+- latency/jitter/packet loss summary
+- failover count
+- fallback used
+- shortcut recommended
+- policy version
+- peer directory version
+- route epoch
+
+Tenant-visible telemetry must hide topology.
+
+## 18. Future Validation Tests
+
+Future implementation tests must prove:
+
+- route request rejects wrong cluster
+- route request rejects wrong organization
+- revoked peer is not selected
+- unavailable route returns explicit result
+- cache invalidates on policy version change
+- cache invalidates on peer directory version change
+- input route prefers latency over throughput
+- file transfer route does not starve input class
+- service adapter cannot bypass routing engine
+- shortcut recommendation requires fallback path
+- degraded mode does not authorize new forbidden routes
+
+## 19. C16 Preparation
+
+C16 must define the secure node-to-node channel lifecycle that can later carry
+route-selected traffic.
+
+C16 must preserve:
+
+- routing results are bounded and policy-scoped
+- channels are authenticated and authorized
+- trust/revocation affects active channels
+- Service Adapters remain above Fabric routing
+- no mesh packet routing starts before explicit C17
+
+## 20. Result / Decision
+
+Stage C15 defines Fabric Routing Engine as a skeleton boundary for route
+requests, route results, scoring, cache relationship, failover, shortcut
+recommendations, topology hiding, and Service Adapter integration.
+
+Decisions:
+
+- Routing belongs to Fabric, not Service Adapters.
+- Route requests/results are logical contracts, not packet forwarding.
+- Hard policy checks precede scoring.
+- Route cache is local, bounded, and non-authoritative.
+- Routing is channel-aware and QoS-aware.
+- Shortcut connections are future optional recommendations, not C15 runtime.
+- C16 must define secure node-to-node channels before mesh routing runtime.
+
+No code, migration, API, runtime, RDP, data-plane, mesh, VPN, relay, or service
+workload behavior is changed by C15.