Initial project snapshot
This commit is contained in:
@@ -0,0 +1,518 @@
|
||||
# Fabric Routing Engine Skeleton
|
||||
|
||||
Status: Stage C15 result. Documentation and architecture only.
|
||||
|
||||
This document defines the Fabric Routing Engine skeleton boundary. It does not
|
||||
implement code, migrations, APIs, mesh runtime traffic, VPN/IP tunnel runtime,
|
||||
relay packet routing, RDP work, or service workload execution.
|
||||
|
||||
## 1. Purpose
|
||||
|
||||
The Fabric Routing Engine is the logical Fabric layer responsible for choosing
|
||||
authorized paths between ingress, core, egress, service, storage, and future
|
||||
VPN/IP-tunnel components.
|
||||
|
||||
C15 defines the route decision boundary before runtime mesh routing exists.
|
||||
|
||||
The purpose is to ensure that future routing:
|
||||
|
||||
- is policy-aware
|
||||
- is QoS-aware
|
||||
- is channel-aware
|
||||
- respects cluster and organization boundaries
|
||||
- uses scoped local state and peer cache
|
||||
- does not depend on live backend availability for realtime decisions
|
||||
- is not implemented independently by Service Adapters
|
||||
|
||||
## 2. Non-Goals
|
||||
|
||||
C15 does not:
|
||||
|
||||
- carry production mesh traffic
|
||||
- implement node-to-node transport
|
||||
- implement relay forwarding
|
||||
- implement VPN/IP tunnel packets
|
||||
- implement QUIC/WebRTC
|
||||
- implement route execution
|
||||
- implement service workloads
|
||||
- change RDP runtime
|
||||
- change backend session lifecycle
|
||||
- change Windows client behavior
|
||||
|
||||
It defines contracts and responsibilities only.
|
||||
|
||||
## 3. Routing Engine Responsibilities
|
||||
|
||||
The Fabric Routing Engine owns:
|
||||
|
||||
- route request validation
|
||||
- peer candidate filtering
|
||||
- route scoring
|
||||
- channel-aware path selection
|
||||
- QoS class selection
|
||||
- route cache lookup/update policy
|
||||
- failover decision boundaries
|
||||
- shortcut recommendation boundaries
|
||||
- topology hiding
|
||||
- policy and cluster-boundary enforcement
|
||||
- service adapter routing integration boundary
|
||||
|
||||
The Routing Engine does not own:
|
||||
|
||||
- PostgreSQL source-of-truth mutation
|
||||
- service protocol translation
|
||||
- RDP/VNC/SSH/VPN implementation details
|
||||
- raw packet forwarding
|
||||
- direct secret resolution
|
||||
- organization admin visibility
|
||||
- node enrollment authority
|
||||
|
||||
## 4. Inputs
|
||||
|
||||
Routing decisions may consume:
|
||||
|
||||
- signed scoped cluster snapshot
|
||||
- node-local peer cache
|
||||
- route cache
|
||||
- peer directory
|
||||
- route policy
|
||||
- QoS policy
|
||||
- service assignment cache
|
||||
- cluster membership
|
||||
- organization scope
|
||||
- service/resource scope
|
||||
- channel class
|
||||
- current health/degraded state
|
||||
- partition/authority state
|
||||
- failure history
|
||||
- load and latency observations
|
||||
|
||||
Routing decisions must not require a live backend call in the realtime path.
|
||||
|
||||
## 5. Route Request Contract
|
||||
|
||||
A route request is a logical request for a path. It is not a packet.
|
||||
|
||||
Required fields:
|
||||
|
||||
- `request_id`
|
||||
- `cluster_id`
|
||||
- `organization_id` where applicable
|
||||
- `source_node_id`
|
||||
- `source_role`
|
||||
- `destination_kind`
|
||||
- `destination_ref`
|
||||
- `service_type`
|
||||
- `channel_class`
|
||||
- `priority_class`
|
||||
- `policy_refs`
|
||||
- `requested_at`
|
||||
|
||||
Destination kinds:
|
||||
|
||||
- `node`
|
||||
- `egress_pool`
|
||||
- `service_instance`
|
||||
- `resource_target`
|
||||
- `vpn_connection`
|
||||
- `storage_scope`
|
||||
- `control_plane_endpoint`
|
||||
|
||||
Optional fields:
|
||||
|
||||
- `session_id`
|
||||
- `attachment_id`
|
||||
- `resource_id`
|
||||
- `user_id`
|
||||
- `device_id`
|
||||
- `region_preference`
|
||||
- `required_capabilities`
|
||||
- `forbidden_nodes`
|
||||
- `preferred_nodes`
|
||||
- `max_latency_ms`
|
||||
- `min_bandwidth_hint`
|
||||
- `stickiness_key`
|
||||
- `previous_route_id`
|
||||
- `failure_context`
|
||||
|
||||
Service adapters may create route requests through an adapter-facing boundary,
|
||||
but they must not select peers or paths themselves.
|
||||
|
||||
## 6. Route Result Contract
|
||||
|
||||
A route result is a signed or locally verifiable decision artifact for a
|
||||
bounded time.
|
||||
|
||||
Required fields:
|
||||
|
||||
- `route_id`
|
||||
- `request_id`
|
||||
- `cluster_id`
|
||||
- `organization_id` where applicable
|
||||
- `route_class`
|
||||
- `channel_class`
|
||||
- `selected_path`
|
||||
- `selected_qos_class`
|
||||
- `score`
|
||||
- `valid_from`
|
||||
- `expires_at`
|
||||
- `route_epoch`
|
||||
- `policy_version`
|
||||
- `decision_reason`
|
||||
|
||||
Selected path contains ordered logical hops:
|
||||
|
||||
- source node
|
||||
- optional ingress node
|
||||
- zero or more core/relay nodes
|
||||
- optional egress/service node
|
||||
- target/service endpoint
|
||||
|
||||
Optional fields:
|
||||
|
||||
- `fallback_paths`
|
||||
- `shortcut_candidate`
|
||||
- `stickiness_key`
|
||||
- `drain_after`
|
||||
- `degraded_mode`
|
||||
- `constraints_applied`
|
||||
- `rejection_reason`
|
||||
|
||||
Route results must be bounded by expiry, policy version, route epoch, and
|
||||
cluster authority state.
|
||||
|
||||
## 7. Channel Classes
|
||||
|
||||
Routing is channel-aware.
|
||||
|
||||
Initial channel classes:
|
||||
|
||||
- `control`
|
||||
- `input`
|
||||
- `render`
|
||||
- `cursor`
|
||||
- `clipboard`
|
||||
- `file_transfer`
|
||||
- `telemetry`
|
||||
- `vpn_packet`
|
||||
- `storage_fetch`
|
||||
- `update_fetch`
|
||||
|
||||
Rules:
|
||||
|
||||
- `input` and critical `control` prefer lowest latency and lowest jitter.
|
||||
- `render` prefers bandwidth and bounded jitter; stale render may be dropped.
|
||||
- `cursor` is latest-only and should use low-latency paths.
|
||||
- `clipboard` is reliable and bounded.
|
||||
- `file_transfer` prefers throughput but must not starve input/control/render.
|
||||
- `telemetry` is low priority and may be sampled or dropped.
|
||||
- `vpn_packet` uses adaptive QoS and bulk protection.
|
||||
- `storage_fetch` and `update_fetch` should not consume interactive reserves.
|
||||
|
||||
## 8. Route Classes
|
||||
|
||||
Initial route classes:
|
||||
|
||||
- `direct`
|
||||
- `single_relay`
|
||||
- `multi_hop`
|
||||
- `storage_local`
|
||||
- `storage_remote`
|
||||
- `vpn_chained`
|
||||
- `degraded_existing`
|
||||
- `unavailable`
|
||||
|
||||
`direct`:
|
||||
|
||||
- selected when source can safely reach destination directly
|
||||
- trust and policy must allow it
|
||||
|
||||
`single_relay`:
|
||||
|
||||
- selected when one relay improves connectivity or policy requires relay
|
||||
|
||||
`multi_hop`:
|
||||
|
||||
- selected when direct/single relay is unavailable or policy/region requires it
|
||||
|
||||
`storage_local` / `storage_remote`:
|
||||
|
||||
- used for config/snapshot/artifact fetch decisions
|
||||
|
||||
`vpn_chained`:
|
||||
|
||||
- used when a managed service or IP tunnel depends on a logical
|
||||
`vpn_connection`
|
||||
|
||||
`degraded_existing`:
|
||||
|
||||
- keeps an already-authorized existing path alive while policy permits
|
||||
|
||||
`unavailable`:
|
||||
|
||||
- explicit denial or no valid route
|
||||
|
||||
## 9. Hard Policy Checks
|
||||
|
||||
Hard checks run before scoring.
|
||||
|
||||
Reject route when:
|
||||
|
||||
- source node is not trusted
|
||||
- source node is not a member of the cluster
|
||||
- destination is outside cluster scope
|
||||
- cross-cluster trust is missing
|
||||
- organization scope does not match
|
||||
- role assignment does not permit the route
|
||||
- peer certificate is invalid or revoked
|
||||
- required channel is not authorized
|
||||
- partition/authority state forbids new route
|
||||
- destination node is draining or disabled and policy forbids placement
|
||||
- route would leak topology or tenant data
|
||||
|
||||
No score can override hard policy rejection.
|
||||
|
||||
## 10. Scoring Inputs
|
||||
|
||||
Soft scoring inputs:
|
||||
|
||||
- latency
|
||||
- jitter
|
||||
- packet loss
|
||||
- reliability
|
||||
- recent failure history
|
||||
- region distance
|
||||
- load
|
||||
- available bandwidth
|
||||
- role suitability
|
||||
- route length
|
||||
- service co-location
|
||||
- stickiness preference
|
||||
- cost preference
|
||||
- policy preference
|
||||
- health score
|
||||
|
||||
Scoring weights are policy-driven and may differ by channel class.
|
||||
|
||||
Example:
|
||||
|
||||
- input/control heavily weight latency and jitter
|
||||
- file transfer heavily weights throughput and reliability
|
||||
- VPN bulk considers QoS impact on interactive routes
|
||||
- storage fetch considers locality and replica freshness
|
||||
|
||||
## 11. Route Cache Relationship
|
||||
|
||||
Route cache is local and bounded.
|
||||
|
||||
Cache key inputs:
|
||||
|
||||
- cluster id
|
||||
- organization id
|
||||
- source node
|
||||
- destination kind/ref
|
||||
- service type
|
||||
- channel class
|
||||
- policy version
|
||||
- route epoch
|
||||
- stickiness key
|
||||
|
||||
Cache entries contain:
|
||||
|
||||
- route result
|
||||
- expiry
|
||||
- score
|
||||
- last success/failure
|
||||
- backoff state
|
||||
- fallback candidates
|
||||
|
||||
Cache invalidation triggers:
|
||||
|
||||
- policy version change
|
||||
- peer directory version change
|
||||
- trust/revocation update
|
||||
- route epoch change
|
||||
- health state change
|
||||
- repeated route failure
|
||||
- expiry
|
||||
|
||||
Route cache is a performance aid, not route authority.
|
||||
|
||||
## 12. Failover Boundaries
|
||||
|
||||
Failover decisions may:
|
||||
|
||||
- switch from failed active path to fallback path
|
||||
- promote warm peer path
|
||||
- retry through bootstrap route for recovery
|
||||
- mark route unavailable
|
||||
- request control-plane/config refresh when reachable
|
||||
- keep degraded existing path alive if policy permits
|
||||
|
||||
Failover decisions must not:
|
||||
|
||||
- create new cluster authority
|
||||
- bypass policy
|
||||
- add nodes
|
||||
- approve role changes
|
||||
- cross cluster boundaries without explicit trust
|
||||
- expose topology to organizations
|
||||
|
||||
## 13. Shortcut Decision Boundary
|
||||
|
||||
Shortcut connections are optional optimization recommendations.
|
||||
|
||||
A shortcut may be recommended when:
|
||||
|
||||
- long-lived flow exists
|
||||
- current path latency/jitter is high
|
||||
- direct connectivity appears possible
|
||||
- trust validation succeeds
|
||||
- policy allows shortcut
|
||||
- shortcut improves latency, jitter, or bandwidth
|
||||
- fallback path remains available
|
||||
|
||||
Shortcut recommendation output:
|
||||
|
||||
- source node
|
||||
- destination node
|
||||
- channel classes affected
|
||||
- expected improvement
|
||||
- required validation
|
||||
- expiry
|
||||
- fallback route id
|
||||
|
||||
C15 does not implement shortcut connections. It only defines when a future
|
||||
Routing Engine may recommend them.
|
||||
|
||||
## 14. Service Adapter Integration
|
||||
|
||||
Service Adapters may ask for routes using service-neutral metadata.
|
||||
|
||||
Examples:
|
||||
|
||||
- RDP Adapter requests route to RDP service/egress node or resource target.
|
||||
- VNC Adapter requests route to VNC target zone.
|
||||
- SSH Adapter requests route to SSH target.
|
||||
- VPN/IP tunnel service requests route through `vpn_connection`.
|
||||
- Storage fetch requests route to config/storage scope.
|
||||
|
||||
Service Adapters must not:
|
||||
|
||||
- enumerate peers
|
||||
- select mesh paths
|
||||
- create relay chains
|
||||
- create shortcuts
|
||||
- implement failover policy
|
||||
- implement partition recovery
|
||||
- implement cross-cluster routing trust
|
||||
|
||||
The adapter consumes a route result and sends/receives through the approved
|
||||
data-plane boundary when runtime exists.
|
||||
|
||||
## 15. Topology Hiding
|
||||
|
||||
Organizations see:
|
||||
|
||||
- allowed service endpoints
|
||||
- safe ingress/egress status
|
||||
- safe session/resource status
|
||||
- policy-visible route dependency names where allowed
|
||||
|
||||
Organizations must not see:
|
||||
|
||||
- intermediate core mesh nodes
|
||||
- full peer directory
|
||||
- route cache
|
||||
- shortcut candidates
|
||||
- other organizations' route data
|
||||
- storage shard placement
|
||||
|
||||
Platform owners may inspect routing internals according to audited platform
|
||||
policy.
|
||||
|
||||
## 16. Degraded and Partition Behavior
|
||||
|
||||
In degraded mode, Routing Engine may:
|
||||
|
||||
- keep existing authorized routes alive until TTL
|
||||
- use last signed snapshot for recovery
|
||||
- select fallback among already-authorized peers
|
||||
- mark route unavailable when safety cannot be proven
|
||||
|
||||
In degraded mode, Routing Engine must not:
|
||||
|
||||
- authorize new high-risk routes
|
||||
- mutate cluster trust
|
||||
- approve nodes
|
||||
- assign roles
|
||||
- promote partition authority automatically
|
||||
- create cross-cluster trust
|
||||
|
||||
## 17. Observability
|
||||
|
||||
Routing decisions should emit safe telemetry:
|
||||
|
||||
- route selected
|
||||
- route rejected
|
||||
- rejection reason
|
||||
- route class
|
||||
- channel class
|
||||
- score bucket
|
||||
- latency/jitter/packet loss summary
|
||||
- failover count
|
||||
- fallback used
|
||||
- shortcut recommended
|
||||
- policy version
|
||||
- peer directory version
|
||||
- route epoch
|
||||
|
||||
Tenant-visible telemetry must hide topology.
|
||||
|
||||
## 18. Future Validation Tests
|
||||
|
||||
Future implementation tests must prove:
|
||||
|
||||
- route request rejects wrong cluster
|
||||
- route request rejects wrong organization
|
||||
- revoked peer is not selected
|
||||
- unavailable route returns explicit result
|
||||
- cache invalidates on policy version change
|
||||
- cache invalidates on peer directory version change
|
||||
- input route prefers latency over throughput
|
||||
- file transfer route does not starve input class
|
||||
- service adapter cannot bypass routing engine
|
||||
- shortcut recommendation requires fallback path
|
||||
- degraded mode does not authorize new forbidden routes
|
||||
|
||||
## 19. C16 Preparation
|
||||
|
||||
C16 must define the secure node-to-node channel lifecycle that can later carry
|
||||
route-selected traffic.
|
||||
|
||||
C16 must preserve:
|
||||
|
||||
- routing results are bounded and policy-scoped
|
||||
- channels are authenticated and authorized
|
||||
- trust/revocation affects active channels
|
||||
- Service Adapters remain above Fabric routing
|
||||
- no mesh packet routing starts before explicit C17
|
||||
|
||||
## 20. Result / Decision
|
||||
|
||||
Stage C15 defines Fabric Routing Engine as a skeleton boundary for route
|
||||
requests, route results, scoring, cache relationship, failover, shortcut
|
||||
recommendations, topology hiding, and Service Adapter integration.
|
||||
|
||||
Decisions:
|
||||
|
||||
- Routing belongs to Fabric, not Service Adapters.
|
||||
- Route requests/results are logical contracts, not packet forwarding.
|
||||
- Hard policy checks precede scoring.
|
||||
- Route cache is local, bounded, and non-authoritative.
|
||||
- Routing is channel-aware and QoS-aware.
|
||||
- Shortcut connections are future optional recommendations, not C15 runtime.
|
||||
- C16 must define secure node-to-node channels before mesh routing runtime.
|
||||
|
||||
No code, migration, API, runtime, RDP, data-plane, mesh, VPN, relay, or service
|
||||
workload behavior is changed by C15.
|
||||
Reference in New Issue
Block a user