Initial project snapshot

This commit is contained in:
2026-04-28 22:29:50 +03:00
commit 8ba0561f4f
365 changed files with 91832 additions and 0 deletions
@@ -0,0 +1,518 @@
# Fabric Routing Engine Skeleton
Status: Stage C15 result. Documentation and architecture only.
This document defines the Fabric Routing Engine skeleton boundary. It does not
implement code, migrations, APIs, mesh runtime traffic, VPN/IP tunnel runtime,
relay packet routing, RDP work, or service workload execution.
## 1. Purpose
The Fabric Routing Engine is the logical Fabric layer responsible for choosing
authorized paths between ingress, core, egress, service, storage, and future
VPN/IP-tunnel components.
C15 defines the route decision boundary before runtime mesh routing exists.
The purpose is to ensure that future routing:
- is policy-aware
- is QoS-aware
- is channel-aware
- respects cluster and organization boundaries
- uses scoped local state and peer cache
- does not depend on live backend availability for realtime decisions
- is not implemented independently by Service Adapters
## 2. Non-Goals
C15 does not:
- carry production mesh traffic
- implement node-to-node transport
- implement relay forwarding
- implement VPN/IP tunnel packets
- implement QUIC/WebRTC
- implement route execution
- implement service workloads
- change RDP runtime
- change backend session lifecycle
- change Windows client behavior
It defines contracts and responsibilities only.
## 3. Routing Engine Responsibilities
The Fabric Routing Engine owns:
- route request validation
- peer candidate filtering
- route scoring
- channel-aware path selection
- QoS class selection
- route cache lookup/update policy
- failover decision boundaries
- shortcut recommendation boundaries
- topology hiding
- policy and cluster-boundary enforcement
- service adapter routing integration boundary
The Routing Engine does not own:
- PostgreSQL source-of-truth mutation
- service protocol translation
- RDP/VNC/SSH/VPN implementation details
- raw packet forwarding
- direct secret resolution
- organization admin visibility
- node enrollment authority
## 4. Inputs
Routing decisions may consume:
- signed scoped cluster snapshot
- node-local peer cache
- route cache
- peer directory
- route policy
- QoS policy
- service assignment cache
- cluster membership
- organization scope
- service/resource scope
- channel class
- current health/degraded state
- partition/authority state
- failure history
- load and latency observations
Routing decisions must not require a live backend call in the realtime path.
## 5. Route Request Contract
A route request is a logical request for a path. It is not a packet.
Required fields:
- `request_id`
- `cluster_id`
- `organization_id` where applicable
- `source_node_id`
- `source_role`
- `destination_kind`
- `destination_ref`
- `service_type`
- `channel_class`
- `priority_class`
- `policy_refs`
- `requested_at`
Destination kinds:
- `node`
- `egress_pool`
- `service_instance`
- `resource_target`
- `vpn_connection`
- `storage_scope`
- `control_plane_endpoint`
Optional fields:
- `session_id`
- `attachment_id`
- `resource_id`
- `user_id`
- `device_id`
- `region_preference`
- `required_capabilities`
- `forbidden_nodes`
- `preferred_nodes`
- `max_latency_ms`
- `min_bandwidth_hint`
- `stickiness_key`
- `previous_route_id`
- `failure_context`
Service adapters may create route requests through an adapter-facing boundary,
but they must not select peers or paths themselves.
## 6. Route Result Contract
A route result is a signed or locally verifiable decision artifact for a
bounded time.
Required fields:
- `route_id`
- `request_id`
- `cluster_id`
- `organization_id` where applicable
- `route_class`
- `channel_class`
- `selected_path`
- `selected_qos_class`
- `score`
- `valid_from`
- `expires_at`
- `route_epoch`
- `policy_version`
- `decision_reason`
Selected path contains ordered logical hops:
- source node
- optional ingress node
- zero or more core/relay nodes
- optional egress/service node
- target/service endpoint
Optional fields:
- `fallback_paths`
- `shortcut_candidate`
- `stickiness_key`
- `drain_after`
- `degraded_mode`
- `constraints_applied`
- `rejection_reason`
Route results must be bounded by expiry, policy version, route epoch, and
cluster authority state.
## 7. Channel Classes
Routing is channel-aware.
Initial channel classes:
- `control`
- `input`
- `render`
- `cursor`
- `clipboard`
- `file_transfer`
- `telemetry`
- `vpn_packet`
- `storage_fetch`
- `update_fetch`
Rules:
- `input` and critical `control` prefer lowest latency and lowest jitter.
- `render` prefers bandwidth and bounded jitter; stale render may be dropped.
- `cursor` is latest-only and should use low-latency paths.
- `clipboard` is reliable and bounded.
- `file_transfer` prefers throughput but must not starve input/control/render.
- `telemetry` is low priority and may be sampled or dropped.
- `vpn_packet` uses adaptive QoS and bulk protection.
- `storage_fetch` and `update_fetch` should not consume interactive reserves.
## 8. Route Classes
Initial route classes:
- `direct`
- `single_relay`
- `multi_hop`
- `storage_local`
- `storage_remote`
- `vpn_chained`
- `degraded_existing`
- `unavailable`
`direct`:
- selected when source can safely reach destination directly
- trust and policy must allow it
`single_relay`:
- selected when one relay improves connectivity or policy requires relay
`multi_hop`:
- selected when direct/single relay is unavailable or policy/region requires it
`storage_local` / `storage_remote`:
- used for config/snapshot/artifact fetch decisions
`vpn_chained`:
- used when a managed service or IP tunnel depends on a logical
`vpn_connection`
`degraded_existing`:
- keeps an already-authorized existing path alive while policy permits
`unavailable`:
- explicit denial or no valid route
## 9. Hard Policy Checks
Hard checks run before scoring.
Reject route when:
- source node is not trusted
- source node is not a member of the cluster
- destination is outside cluster scope
- cross-cluster trust is missing
- organization scope does not match
- role assignment does not permit the route
- peer certificate is invalid or revoked
- required channel is not authorized
- partition/authority state forbids new route
- destination node is draining or disabled and policy forbids placement
- route would leak topology or tenant data
No score can override hard policy rejection.
## 10. Scoring Inputs
Soft scoring inputs:
- latency
- jitter
- packet loss
- reliability
- recent failure history
- region distance
- load
- available bandwidth
- role suitability
- route length
- service co-location
- stickiness preference
- cost preference
- policy preference
- health score
Scoring weights are policy-driven and may differ by channel class.
Example:
- input/control heavily weight latency and jitter
- file transfer heavily weights throughput and reliability
- VPN bulk considers QoS impact on interactive routes
- storage fetch considers locality and replica freshness
## 11. Route Cache Relationship
Route cache is local and bounded.
Cache key inputs:
- cluster id
- organization id
- source node
- destination kind/ref
- service type
- channel class
- policy version
- route epoch
- stickiness key
Cache entries contain:
- route result
- expiry
- score
- last success/failure
- backoff state
- fallback candidates
Cache invalidation triggers:
- policy version change
- peer directory version change
- trust/revocation update
- route epoch change
- health state change
- repeated route failure
- expiry
Route cache is a performance aid, not route authority.
## 12. Failover Boundaries
Failover decisions may:
- switch from failed active path to fallback path
- promote warm peer path
- retry through bootstrap route for recovery
- mark route unavailable
- request control-plane/config refresh when reachable
- keep degraded existing path alive if policy permits
Failover decisions must not:
- create new cluster authority
- bypass policy
- add nodes
- approve role changes
- cross cluster boundaries without explicit trust
- expose topology to organizations
## 13. Shortcut Decision Boundary
Shortcut connections are optional optimization recommendations.
A shortcut may be recommended when:
- long-lived flow exists
- current path latency/jitter is high
- direct connectivity appears possible
- trust validation succeeds
- policy allows shortcut
- shortcut improves latency, jitter, or bandwidth
- fallback path remains available
Shortcut recommendation output:
- source node
- destination node
- channel classes affected
- expected improvement
- required validation
- expiry
- fallback route id
C15 does not implement shortcut connections. It only defines when a future
Routing Engine may recommend them.
## 14. Service Adapter Integration
Service Adapters may ask for routes using service-neutral metadata.
Examples:
- RDP Adapter requests route to RDP service/egress node or resource target.
- VNC Adapter requests route to VNC target zone.
- SSH Adapter requests route to SSH target.
- VPN/IP tunnel service requests route through `vpn_connection`.
- Storage fetch requests route to config/storage scope.
Service Adapters must not:
- enumerate peers
- select mesh paths
- create relay chains
- create shortcuts
- implement failover policy
- implement partition recovery
- implement cross-cluster routing trust
The adapter consumes a route result and sends/receives through the approved
data-plane boundary when runtime exists.
## 15. Topology Hiding
Organizations see:
- allowed service endpoints
- safe ingress/egress status
- safe session/resource status
- policy-visible route dependency names where allowed
Organizations must not see:
- intermediate core mesh nodes
- full peer directory
- route cache
- shortcut candidates
- other organizations' route data
- storage shard placement
Platform owners may inspect routing internals according to audited platform
policy.
## 16. Degraded and Partition Behavior
In degraded mode, Routing Engine may:
- keep existing authorized routes alive until TTL
- use last signed snapshot for recovery
- select fallback among already-authorized peers
- mark route unavailable when safety cannot be proven
In degraded mode, Routing Engine must not:
- authorize new high-risk routes
- mutate cluster trust
- approve nodes
- assign roles
- promote partition authority automatically
- create cross-cluster trust
## 17. Observability
Routing decisions should emit safe telemetry:
- route selected
- route rejected
- rejection reason
- route class
- channel class
- score bucket
- latency/jitter/packet loss summary
- failover count
- fallback used
- shortcut recommended
- policy version
- peer directory version
- route epoch
Tenant-visible telemetry must hide topology.
## 18. Future Validation Tests
Future implementation tests must prove:
- route request rejects wrong cluster
- route request rejects wrong organization
- revoked peer is not selected
- unavailable route returns explicit result
- cache invalidates on policy version change
- cache invalidates on peer directory version change
- input route prefers latency over throughput
- file transfer route does not starve input class
- service adapter cannot bypass routing engine
- shortcut recommendation requires fallback path
- degraded mode does not authorize new forbidden routes
## 19. C16 Preparation
C16 must define the secure node-to-node channel lifecycle that can later carry
route-selected traffic.
C16 must preserve:
- routing results are bounded and policy-scoped
- channels are authenticated and authorized
- trust/revocation affects active channels
- Service Adapters remain above Fabric routing
- no mesh packet routing starts before explicit C17
## 20. Result / Decision
Stage C15 defines Fabric Routing Engine as a skeleton boundary for route
requests, route results, scoring, cache relationship, failover, shortcut
recommendations, topology hiding, and Service Adapter integration.
Decisions:
- Routing belongs to Fabric, not Service Adapters.
- Route requests/results are logical contracts, not packet forwarding.
- Hard policy checks precede scoring.
- Route cache is local, bounded, and non-authoritative.
- Routing is channel-aware and QoS-aware.
- Shortcut connections are future optional recommendations, not C15 runtime.
- C16 must define secure node-to-node channels before mesh routing runtime.
No code, migration, API, runtime, RDP, data-plane, mesh, VPN, relay, or service
workload behavior is changed by C15.