DevOps&SRE Library
19.4K subscribers
426 photos
2 videos
2 files
5.32K links
Библиотека статей по теме DevOps и SRE.

Реклама: @ostinostin
Контент: @mxssl

РКН: https://www.gosuslugi.ru/snet/67704b536aa9672b963777b3
Download Telegram
Monitor LLM routing with the Kubernetes Inference Extension

If you serve LLMs on Kubernetes without inference-aware routing, your load balancer is likely wasting inference capacity. Generic HTTP traffic management blindly routes requests, assuming the backends in your cluster are interchangeable. But your model-serving backends are stateful and unevenly prepared to handle any given request. As a result, requests are often routed to the backend that’s not the one best suited to respond.

Migrating to Gateway API gives you a more capable foundation for traffic management and opens the door to inference-aware routing. The Kubernetes Gateway API’s Inference Extension routes requests based on backend serving state, which tends to make better use of cluster capacity and reduce request latency.

In this post, we’ll look at how the Inference Extension works, the routing strategies it enables, and the signals you can use to monitor whether inference-aware routing is behaving as intended in production.


https://www.datadoghq.com/blog/llm-routing-kubernetes-inference-extension/
In incidents, swarming is a feature, not a bug

Spontaneous swarming of responders might seem like a nuisance that breaks our tidy mental models of incident response, but it's actually very powerful.


https://greatcircle.com/blog/2026/03/24/swarming-is-a-feature