519 lines
12 KiB
Markdown
519 lines
12 KiB
Markdown
# Fabric Routing Engine Skeleton
|
|
|
|
Status: Stage C15 result. Documentation and architecture only.
|
|
|
|
This document defines the Fabric Routing Engine skeleton boundary. It does not
|
|
implement code, migrations, APIs, mesh runtime traffic, VPN/IP tunnel runtime,
|
|
relay packet routing, RDP work, or service workload execution.
|
|
|
|
## 1. Purpose
|
|
|
|
The Fabric Routing Engine is the logical Fabric layer responsible for choosing
|
|
authorized paths between ingress, core, egress, service, storage, and future
|
|
VPN/IP-tunnel components.
|
|
|
|
C15 defines the route decision boundary before runtime mesh routing exists.
|
|
|
|
The purpose is to ensure that future routing:
|
|
|
|
- is policy-aware
|
|
- is QoS-aware
|
|
- is channel-aware
|
|
- respects cluster and organization boundaries
|
|
- uses scoped local state and peer cache
|
|
- does not depend on live backend availability for realtime decisions
|
|
- is not implemented independently by Service Adapters
|
|
|
|
## 2. Non-Goals
|
|
|
|
C15 does not:
|
|
|
|
- carry production mesh traffic
|
|
- implement node-to-node transport
|
|
- implement relay forwarding
|
|
- implement VPN/IP tunnel packets
|
|
- implement QUIC/WebRTC
|
|
- implement route execution
|
|
- implement service workloads
|
|
- change RDP runtime
|
|
- change backend session lifecycle
|
|
- change Windows client behavior
|
|
|
|
It defines contracts and responsibilities only.
|
|
|
|
## 3. Routing Engine Responsibilities
|
|
|
|
The Fabric Routing Engine owns:
|
|
|
|
- route request validation
|
|
- peer candidate filtering
|
|
- route scoring
|
|
- channel-aware path selection
|
|
- QoS class selection
|
|
- route cache lookup/update policy
|
|
- failover decision boundaries
|
|
- shortcut recommendation boundaries
|
|
- topology hiding
|
|
- policy and cluster-boundary enforcement
|
|
- service adapter routing integration boundary
|
|
|
|
The Routing Engine does not own:
|
|
|
|
- PostgreSQL source-of-truth mutation
|
|
- service protocol translation
|
|
- RDP/VNC/SSH/VPN implementation details
|
|
- raw packet forwarding
|
|
- direct secret resolution
|
|
- organization admin visibility
|
|
- node enrollment authority
|
|
|
|
## 4. Inputs
|
|
|
|
Routing decisions may consume:
|
|
|
|
- signed scoped cluster snapshot
|
|
- node-local peer cache
|
|
- route cache
|
|
- peer directory
|
|
- route policy
|
|
- QoS policy
|
|
- service assignment cache
|
|
- cluster membership
|
|
- organization scope
|
|
- service/resource scope
|
|
- channel class
|
|
- current health/degraded state
|
|
- partition/authority state
|
|
- failure history
|
|
- load and latency observations
|
|
|
|
Routing decisions must not require a live backend call in the realtime path.
|
|
|
|
## 5. Route Request Contract
|
|
|
|
A route request is a logical request for a path. It is not a packet.
|
|
|
|
Required fields:
|
|
|
|
- `request_id`
|
|
- `cluster_id`
|
|
- `organization_id` where applicable
|
|
- `source_node_id`
|
|
- `source_role`
|
|
- `destination_kind`
|
|
- `destination_ref`
|
|
- `service_type`
|
|
- `channel_class`
|
|
- `priority_class`
|
|
- `policy_refs`
|
|
- `requested_at`
|
|
|
|
Destination kinds:
|
|
|
|
- `node`
|
|
- `egress_pool`
|
|
- `service_instance`
|
|
- `resource_target`
|
|
- `vpn_connection`
|
|
- `storage_scope`
|
|
- `control_plane_endpoint`
|
|
|
|
Optional fields:
|
|
|
|
- `session_id`
|
|
- `attachment_id`
|
|
- `resource_id`
|
|
- `user_id`
|
|
- `device_id`
|
|
- `region_preference`
|
|
- `required_capabilities`
|
|
- `forbidden_nodes`
|
|
- `preferred_nodes`
|
|
- `max_latency_ms`
|
|
- `min_bandwidth_hint`
|
|
- `stickiness_key`
|
|
- `previous_route_id`
|
|
- `failure_context`
|
|
|
|
Service adapters may create route requests through an adapter-facing boundary,
|
|
but they must not select peers or paths themselves.
|
|
|
|
## 6. Route Result Contract
|
|
|
|
A route result is a signed or locally verifiable decision artifact for a
|
|
bounded time.
|
|
|
|
Required fields:
|
|
|
|
- `route_id`
|
|
- `request_id`
|
|
- `cluster_id`
|
|
- `organization_id` where applicable
|
|
- `route_class`
|
|
- `channel_class`
|
|
- `selected_path`
|
|
- `selected_qos_class`
|
|
- `score`
|
|
- `valid_from`
|
|
- `expires_at`
|
|
- `route_epoch`
|
|
- `policy_version`
|
|
- `decision_reason`
|
|
|
|
Selected path contains ordered logical hops:
|
|
|
|
- source node
|
|
- optional ingress node
|
|
- zero or more core/relay nodes
|
|
- optional egress/service node
|
|
- target/service endpoint
|
|
|
|
Optional fields:
|
|
|
|
- `fallback_paths`
|
|
- `shortcut_candidate`
|
|
- `stickiness_key`
|
|
- `drain_after`
|
|
- `degraded_mode`
|
|
- `constraints_applied`
|
|
- `rejection_reason`
|
|
|
|
Route results must be bounded by expiry, policy version, route epoch, and
|
|
cluster authority state.
|
|
|
|
## 7. Channel Classes
|
|
|
|
Routing is channel-aware.
|
|
|
|
Initial channel classes:
|
|
|
|
- `control`
|
|
- `input`
|
|
- `render`
|
|
- `cursor`
|
|
- `clipboard`
|
|
- `file_transfer`
|
|
- `telemetry`
|
|
- `vpn_packet`
|
|
- `storage_fetch`
|
|
- `update_fetch`
|
|
|
|
Rules:
|
|
|
|
- `input` and critical `control` prefer lowest latency and lowest jitter.
|
|
- `render` prefers bandwidth and bounded jitter; stale render may be dropped.
|
|
- `cursor` is latest-only and should use low-latency paths.
|
|
- `clipboard` is reliable and bounded.
|
|
- `file_transfer` prefers throughput but must not starve input/control/render.
|
|
- `telemetry` is low priority and may be sampled or dropped.
|
|
- `vpn_packet` uses adaptive QoS and bulk protection.
|
|
- `storage_fetch` and `update_fetch` should not consume interactive reserves.
|
|
|
|
## 8. Route Classes
|
|
|
|
Initial route classes:
|
|
|
|
- `direct`
|
|
- `single_relay`
|
|
- `multi_hop`
|
|
- `storage_local`
|
|
- `storage_remote`
|
|
- `vpn_chained`
|
|
- `degraded_existing`
|
|
- `unavailable`
|
|
|
|
`direct`:
|
|
|
|
- selected when source can safely reach destination directly
|
|
- trust and policy must allow it
|
|
|
|
`single_relay`:
|
|
|
|
- selected when one relay improves connectivity or policy requires relay
|
|
|
|
`multi_hop`:
|
|
|
|
- selected when direct/single relay is unavailable or policy/region requires it
|
|
|
|
`storage_local` / `storage_remote`:
|
|
|
|
- used for config/snapshot/artifact fetch decisions
|
|
|
|
`vpn_chained`:
|
|
|
|
- used when a managed service or IP tunnel depends on a logical
|
|
`vpn_connection`
|
|
|
|
`degraded_existing`:
|
|
|
|
- keeps an already-authorized existing path alive while policy permits
|
|
|
|
`unavailable`:
|
|
|
|
- explicit denial or no valid route
|
|
|
|
## 9. Hard Policy Checks
|
|
|
|
Hard checks run before scoring.
|
|
|
|
Reject route when:
|
|
|
|
- source node is not trusted
|
|
- source node is not a member of the cluster
|
|
- destination is outside cluster scope
|
|
- cross-cluster trust is missing
|
|
- organization scope does not match
|
|
- role assignment does not permit the route
|
|
- peer certificate is invalid or revoked
|
|
- required channel is not authorized
|
|
- partition/authority state forbids new route
|
|
- destination node is draining or disabled and policy forbids placement
|
|
- route would leak topology or tenant data
|
|
|
|
No score can override hard policy rejection.
|
|
|
|
## 10. Scoring Inputs
|
|
|
|
Soft scoring inputs:
|
|
|
|
- latency
|
|
- jitter
|
|
- packet loss
|
|
- reliability
|
|
- recent failure history
|
|
- region distance
|
|
- load
|
|
- available bandwidth
|
|
- role suitability
|
|
- route length
|
|
- service co-location
|
|
- stickiness preference
|
|
- cost preference
|
|
- policy preference
|
|
- health score
|
|
|
|
Scoring weights are policy-driven and may differ by channel class.
|
|
|
|
Example:
|
|
|
|
- input/control heavily weight latency and jitter
|
|
- file transfer heavily weights throughput and reliability
|
|
- VPN bulk considers QoS impact on interactive routes
|
|
- storage fetch considers locality and replica freshness
|
|
|
|
## 11. Route Cache Relationship
|
|
|
|
Route cache is local and bounded.
|
|
|
|
Cache key inputs:
|
|
|
|
- cluster id
|
|
- organization id
|
|
- source node
|
|
- destination kind/ref
|
|
- service type
|
|
- channel class
|
|
- policy version
|
|
- route epoch
|
|
- stickiness key
|
|
|
|
Cache entries contain:
|
|
|
|
- route result
|
|
- expiry
|
|
- score
|
|
- last success/failure
|
|
- backoff state
|
|
- fallback candidates
|
|
|
|
Cache invalidation triggers:
|
|
|
|
- policy version change
|
|
- peer directory version change
|
|
- trust/revocation update
|
|
- route epoch change
|
|
- health state change
|
|
- repeated route failure
|
|
- expiry
|
|
|
|
Route cache is a performance aid, not route authority.
|
|
|
|
## 12. Failover Boundaries
|
|
|
|
Failover decisions may:
|
|
|
|
- switch from failed active path to fallback path
|
|
- promote warm peer path
|
|
- retry through bootstrap route for recovery
|
|
- mark route unavailable
|
|
- request control-plane/config refresh when reachable
|
|
- keep degraded existing path alive if policy permits
|
|
|
|
Failover decisions must not:
|
|
|
|
- create new cluster authority
|
|
- bypass policy
|
|
- add nodes
|
|
- approve role changes
|
|
- cross cluster boundaries without explicit trust
|
|
- expose topology to organizations
|
|
|
|
## 13. Shortcut Decision Boundary
|
|
|
|
Shortcut connections are optional optimization recommendations.
|
|
|
|
A shortcut may be recommended when:
|
|
|
|
- long-lived flow exists
|
|
- current path latency/jitter is high
|
|
- direct connectivity appears possible
|
|
- trust validation succeeds
|
|
- policy allows shortcut
|
|
- shortcut improves latency, jitter, or bandwidth
|
|
- fallback path remains available
|
|
|
|
Shortcut recommendation output:
|
|
|
|
- source node
|
|
- destination node
|
|
- channel classes affected
|
|
- expected improvement
|
|
- required validation
|
|
- expiry
|
|
- fallback route id
|
|
|
|
C15 does not implement shortcut connections. It only defines when a future
|
|
Routing Engine may recommend them.
|
|
|
|
## 14. Service Adapter Integration
|
|
|
|
Service Adapters may ask for routes using service-neutral metadata.
|
|
|
|
Examples:
|
|
|
|
- RDP Adapter requests route to RDP service/egress node or resource target.
|
|
- VNC Adapter requests route to VNC target zone.
|
|
- SSH Adapter requests route to SSH target.
|
|
- VPN/IP tunnel service requests route through `vpn_connection`.
|
|
- Storage fetch requests route to config/storage scope.
|
|
|
|
Service Adapters must not:
|
|
|
|
- enumerate peers
|
|
- select mesh paths
|
|
- create relay chains
|
|
- create shortcuts
|
|
- implement failover policy
|
|
- implement partition recovery
|
|
- implement cross-cluster routing trust
|
|
|
|
The adapter consumes a route result and sends/receives through the approved
|
|
data-plane boundary when runtime exists.
|
|
|
|
## 15. Topology Hiding
|
|
|
|
Organizations see:
|
|
|
|
- allowed service endpoints
|
|
- safe ingress/egress status
|
|
- safe session/resource status
|
|
- policy-visible route dependency names where allowed
|
|
|
|
Organizations must not see:
|
|
|
|
- intermediate core mesh nodes
|
|
- full peer directory
|
|
- route cache
|
|
- shortcut candidates
|
|
- other organizations' route data
|
|
- storage shard placement
|
|
|
|
Platform owners may inspect routing internals according to audited platform
|
|
policy.
|
|
|
|
## 16. Degraded and Partition Behavior
|
|
|
|
In degraded mode, Routing Engine may:
|
|
|
|
- keep existing authorized routes alive until TTL
|
|
- use last signed snapshot for recovery
|
|
- select fallback among already-authorized peers
|
|
- mark route unavailable when safety cannot be proven
|
|
|
|
In degraded mode, Routing Engine must not:
|
|
|
|
- authorize new high-risk routes
|
|
- mutate cluster trust
|
|
- approve nodes
|
|
- assign roles
|
|
- promote partition authority automatically
|
|
- create cross-cluster trust
|
|
|
|
## 17. Observability
|
|
|
|
Routing decisions should emit safe telemetry:
|
|
|
|
- route selected
|
|
- route rejected
|
|
- rejection reason
|
|
- route class
|
|
- channel class
|
|
- score bucket
|
|
- latency/jitter/packet loss summary
|
|
- failover count
|
|
- fallback used
|
|
- shortcut recommended
|
|
- policy version
|
|
- peer directory version
|
|
- route epoch
|
|
|
|
Tenant-visible telemetry must hide topology.
|
|
|
|
## 18. Future Validation Tests
|
|
|
|
Future implementation tests must prove:
|
|
|
|
- route request rejects wrong cluster
|
|
- route request rejects wrong organization
|
|
- revoked peer is not selected
|
|
- unavailable route returns explicit result
|
|
- cache invalidates on policy version change
|
|
- cache invalidates on peer directory version change
|
|
- input route prefers latency over throughput
|
|
- file transfer route does not starve input class
|
|
- service adapter cannot bypass routing engine
|
|
- shortcut recommendation requires fallback path
|
|
- degraded mode does not authorize new forbidden routes
|
|
|
|
## 19. C16 Preparation
|
|
|
|
C16 must define the secure node-to-node channel lifecycle that can later carry
|
|
route-selected traffic.
|
|
|
|
C16 must preserve:
|
|
|
|
- routing results are bounded and policy-scoped
|
|
- channels are authenticated and authorized
|
|
- trust/revocation affects active channels
|
|
- Service Adapters remain above Fabric routing
|
|
- no mesh packet routing starts before explicit C17
|
|
|
|
## 20. Result / Decision
|
|
|
|
Stage C15 defines Fabric Routing Engine as a skeleton boundary for route
|
|
requests, route results, scoring, cache relationship, failover, shortcut
|
|
recommendations, topology hiding, and Service Adapter integration.
|
|
|
|
Decisions:
|
|
|
|
- Routing belongs to Fabric, not Service Adapters.
|
|
- Route requests/results are logical contracts, not packet forwarding.
|
|
- Hard policy checks precede scoring.
|
|
- Route cache is local, bounded, and non-authoritative.
|
|
- Routing is channel-aware and QoS-aware.
|
|
- Shortcut connections are future optional recommendations, not C15 runtime.
|
|
- C16 must define secure node-to-node channels before mesh routing runtime.
|
|
|
|
No code, migration, API, runtime, RDP, data-plane, mesh, VPN, relay, or service
|
|
workload behavior is changed by C15.
|