Machine Learning World
12.1K subscribers
100 photos
16 videos
17 files
941 links
The best of Machine Learning World
@devs_world - the best materials for developers

Our fund instagram to help homeless animals: https://www.instagram.com/ukraineanimalhelp/

Contacts: @anikishaev | creotiv@gmail.com
Download Telegram
Forwarded from Devs World
Modern generative #AI and large language model (#LLM) services create unique traffic-routing challenges on #Kubernetes. Unlike typical short-lived, stateless web requests, LLM inference sessions are often long-running, resource-intensive, and partially stateful. For example, a single #GPU-backed model server may keep multiple inference sessions active and maintain in-memory token caches.

Traditional load balancers focused on HTTP path or round-robin lack the specialized capabilities needed for these workloads. They also don’t account for model identity or request criticality (e.g., interactive chat vs. batch jobs). Organizations often patch together ad-hoc solutions, but a standardized approach is missing.

And here comes the new #Gateway API Inference Extension in #K8S

https://kubernetes.io/blog/2025/06/05/introducing-gateway-api-inference-extension/