Files
rdp-proxy/docs/architecture/FABRIC_ROUTING_ENGINE_SKELETON.md
2026-04-28 22:29:50 +03:00

12 KiB

Fabric Routing Engine Skeleton

Status: Stage C15 result. Documentation and architecture only.

This document defines the Fabric Routing Engine skeleton boundary. It does not implement code, migrations, APIs, mesh runtime traffic, VPN/IP tunnel runtime, relay packet routing, RDP work, or service workload execution.

1. Purpose

The Fabric Routing Engine is the logical Fabric layer responsible for choosing authorized paths between ingress, core, egress, service, storage, and future VPN/IP-tunnel components.

C15 defines the route decision boundary before runtime mesh routing exists.

The purpose is to ensure that future routing:

  • is policy-aware
  • is QoS-aware
  • is channel-aware
  • respects cluster and organization boundaries
  • uses scoped local state and peer cache
  • does not depend on live backend availability for realtime decisions
  • is not implemented independently by Service Adapters

2. Non-Goals

C15 does not:

  • carry production mesh traffic
  • implement node-to-node transport
  • implement relay forwarding
  • implement VPN/IP tunnel packets
  • implement QUIC/WebRTC
  • implement route execution
  • implement service workloads
  • change RDP runtime
  • change backend session lifecycle
  • change Windows client behavior

It defines contracts and responsibilities only.

3. Routing Engine Responsibilities

The Fabric Routing Engine owns:

  • route request validation
  • peer candidate filtering
  • route scoring
  • channel-aware path selection
  • QoS class selection
  • route cache lookup/update policy
  • failover decision boundaries
  • shortcut recommendation boundaries
  • topology hiding
  • policy and cluster-boundary enforcement
  • service adapter routing integration boundary

The Routing Engine does not own:

  • PostgreSQL source-of-truth mutation
  • service protocol translation
  • RDP/VNC/SSH/VPN implementation details
  • raw packet forwarding
  • direct secret resolution
  • organization admin visibility
  • node enrollment authority

4. Inputs

Routing decisions may consume:

  • signed scoped cluster snapshot
  • node-local peer cache
  • route cache
  • peer directory
  • route policy
  • QoS policy
  • service assignment cache
  • cluster membership
  • organization scope
  • service/resource scope
  • channel class
  • current health/degraded state
  • partition/authority state
  • failure history
  • load and latency observations

Routing decisions must not require a live backend call in the realtime path.

5. Route Request Contract

A route request is a logical request for a path. It is not a packet.

Required fields:

  • request_id
  • cluster_id
  • organization_id where applicable
  • source_node_id
  • source_role
  • destination_kind
  • destination_ref
  • service_type
  • channel_class
  • priority_class
  • policy_refs
  • requested_at

Destination kinds:

  • node
  • egress_pool
  • service_instance
  • resource_target
  • vpn_connection
  • storage_scope
  • control_plane_endpoint

Optional fields:

  • session_id
  • attachment_id
  • resource_id
  • user_id
  • device_id
  • region_preference
  • required_capabilities
  • forbidden_nodes
  • preferred_nodes
  • max_latency_ms
  • min_bandwidth_hint
  • stickiness_key
  • previous_route_id
  • failure_context

Service adapters may create route requests through an adapter-facing boundary, but they must not select peers or paths themselves.

6. Route Result Contract

A route result is a signed or locally verifiable decision artifact for a bounded time.

Required fields:

  • route_id
  • request_id
  • cluster_id
  • organization_id where applicable
  • route_class
  • channel_class
  • selected_path
  • selected_qos_class
  • score
  • valid_from
  • expires_at
  • route_epoch
  • policy_version
  • decision_reason

Selected path contains ordered logical hops:

  • source node
  • optional ingress node
  • zero or more core/relay nodes
  • optional egress/service node
  • target/service endpoint

Optional fields:

  • fallback_paths
  • shortcut_candidate
  • stickiness_key
  • drain_after
  • degraded_mode
  • constraints_applied
  • rejection_reason

Route results must be bounded by expiry, policy version, route epoch, and cluster authority state.

7. Channel Classes

Routing is channel-aware.

Initial channel classes:

  • control
  • input
  • render
  • cursor
  • clipboard
  • file_transfer
  • telemetry
  • vpn_packet
  • storage_fetch
  • update_fetch

Rules:

  • input and critical control prefer lowest latency and lowest jitter.
  • render prefers bandwidth and bounded jitter; stale render may be dropped.
  • cursor is latest-only and should use low-latency paths.
  • clipboard is reliable and bounded.
  • file_transfer prefers throughput but must not starve input/control/render.
  • telemetry is low priority and may be sampled or dropped.
  • vpn_packet uses adaptive QoS and bulk protection.
  • storage_fetch and update_fetch should not consume interactive reserves.

8. Route Classes

Initial route classes:

  • direct
  • single_relay
  • multi_hop
  • storage_local
  • storage_remote
  • vpn_chained
  • degraded_existing
  • unavailable

direct:

  • selected when source can safely reach destination directly
  • trust and policy must allow it

single_relay:

  • selected when one relay improves connectivity or policy requires relay

multi_hop:

  • selected when direct/single relay is unavailable or policy/region requires it

storage_local / storage_remote:

  • used for config/snapshot/artifact fetch decisions

vpn_chained:

  • used when a managed service or IP tunnel depends on a logical vpn_connection

degraded_existing:

  • keeps an already-authorized existing path alive while policy permits

unavailable:

  • explicit denial or no valid route

9. Hard Policy Checks

Hard checks run before scoring.

Reject route when:

  • source node is not trusted
  • source node is not a member of the cluster
  • destination is outside cluster scope
  • cross-cluster trust is missing
  • organization scope does not match
  • role assignment does not permit the route
  • peer certificate is invalid or revoked
  • required channel is not authorized
  • partition/authority state forbids new route
  • destination node is draining or disabled and policy forbids placement
  • route would leak topology or tenant data

No score can override hard policy rejection.

10. Scoring Inputs

Soft scoring inputs:

  • latency
  • jitter
  • packet loss
  • reliability
  • recent failure history
  • region distance
  • load
  • available bandwidth
  • role suitability
  • route length
  • service co-location
  • stickiness preference
  • cost preference
  • policy preference
  • health score

Scoring weights are policy-driven and may differ by channel class.

Example:

  • input/control heavily weight latency and jitter
  • file transfer heavily weights throughput and reliability
  • VPN bulk considers QoS impact on interactive routes
  • storage fetch considers locality and replica freshness

11. Route Cache Relationship

Route cache is local and bounded.

Cache key inputs:

  • cluster id
  • organization id
  • source node
  • destination kind/ref
  • service type
  • channel class
  • policy version
  • route epoch
  • stickiness key

Cache entries contain:

  • route result
  • expiry
  • score
  • last success/failure
  • backoff state
  • fallback candidates

Cache invalidation triggers:

  • policy version change
  • peer directory version change
  • trust/revocation update
  • route epoch change
  • health state change
  • repeated route failure
  • expiry

Route cache is a performance aid, not route authority.

12. Failover Boundaries

Failover decisions may:

  • switch from failed active path to fallback path
  • promote warm peer path
  • retry through bootstrap route for recovery
  • mark route unavailable
  • request control-plane/config refresh when reachable
  • keep degraded existing path alive if policy permits

Failover decisions must not:

  • create new cluster authority
  • bypass policy
  • add nodes
  • approve role changes
  • cross cluster boundaries without explicit trust
  • expose topology to organizations

13. Shortcut Decision Boundary

Shortcut connections are optional optimization recommendations.

A shortcut may be recommended when:

  • long-lived flow exists
  • current path latency/jitter is high
  • direct connectivity appears possible
  • trust validation succeeds
  • policy allows shortcut
  • shortcut improves latency, jitter, or bandwidth
  • fallback path remains available

Shortcut recommendation output:

  • source node
  • destination node
  • channel classes affected
  • expected improvement
  • required validation
  • expiry
  • fallback route id

C15 does not implement shortcut connections. It only defines when a future Routing Engine may recommend them.

14. Service Adapter Integration

Service Adapters may ask for routes using service-neutral metadata.

Examples:

  • RDP Adapter requests route to RDP service/egress node or resource target.
  • VNC Adapter requests route to VNC target zone.
  • SSH Adapter requests route to SSH target.
  • VPN/IP tunnel service requests route through vpn_connection.
  • Storage fetch requests route to config/storage scope.

Service Adapters must not:

  • enumerate peers
  • select mesh paths
  • create relay chains
  • create shortcuts
  • implement failover policy
  • implement partition recovery
  • implement cross-cluster routing trust

The adapter consumes a route result and sends/receives through the approved data-plane boundary when runtime exists.

15. Topology Hiding

Organizations see:

  • allowed service endpoints
  • safe ingress/egress status
  • safe session/resource status
  • policy-visible route dependency names where allowed

Organizations must not see:

  • intermediate core mesh nodes
  • full peer directory
  • route cache
  • shortcut candidates
  • other organizations' route data
  • storage shard placement

Platform owners may inspect routing internals according to audited platform policy.

16. Degraded and Partition Behavior

In degraded mode, Routing Engine may:

  • keep existing authorized routes alive until TTL
  • use last signed snapshot for recovery
  • select fallback among already-authorized peers
  • mark route unavailable when safety cannot be proven

In degraded mode, Routing Engine must not:

  • authorize new high-risk routes
  • mutate cluster trust
  • approve nodes
  • assign roles
  • promote partition authority automatically
  • create cross-cluster trust

17. Observability

Routing decisions should emit safe telemetry:

  • route selected
  • route rejected
  • rejection reason
  • route class
  • channel class
  • score bucket
  • latency/jitter/packet loss summary
  • failover count
  • fallback used
  • shortcut recommended
  • policy version
  • peer directory version
  • route epoch

Tenant-visible telemetry must hide topology.

18. Future Validation Tests

Future implementation tests must prove:

  • route request rejects wrong cluster
  • route request rejects wrong organization
  • revoked peer is not selected
  • unavailable route returns explicit result
  • cache invalidates on policy version change
  • cache invalidates on peer directory version change
  • input route prefers latency over throughput
  • file transfer route does not starve input class
  • service adapter cannot bypass routing engine
  • shortcut recommendation requires fallback path
  • degraded mode does not authorize new forbidden routes

19. C16 Preparation

C16 must define the secure node-to-node channel lifecycle that can later carry route-selected traffic.

C16 must preserve:

  • routing results are bounded and policy-scoped
  • channels are authenticated and authorized
  • trust/revocation affects active channels
  • Service Adapters remain above Fabric routing
  • no mesh packet routing starts before explicit C17

20. Result / Decision

Stage C15 defines Fabric Routing Engine as a skeleton boundary for route requests, route results, scoring, cache relationship, failover, shortcut recommendations, topology hiding, and Service Adapter integration.

Decisions:

  • Routing belongs to Fabric, not Service Adapters.
  • Route requests/results are logical contracts, not packet forwarding.
  • Hard policy checks precede scoring.
  • Route cache is local, bounded, and non-authoritative.
  • Routing is channel-aware and QoS-aware.
  • Shortcut connections are future optional recommendations, not C15 runtime.
  • C16 must define secure node-to-node channels before mesh routing runtime.

No code, migration, API, runtime, RDP, data-plane, mesh, VPN, relay, or service workload behavior is changed by C15.