DevOps&SRE Library

Mount Mayhem at Netflix: Scaling Containers on Modern CPUs

Imagine this — you click play on Netflix on a Friday night and behind the scenes hundreds of containers spring to action in a few seconds to answer your call. At Netflix, scaling containers efficiently is critical to delivering a seamless streaming experience to millions of members worldwide. To keep up with responsiveness at this scale, we modernized our container runtime, only to hit a surprising bottleneck: the CPU architecture itself.

Let us walk you through the story of how we diagnosed the problem and what we learned about scaling containers at the hardware level.

https://netflixtechblog.com/mount-mayhem-at-netflix-scaling-containers-on-modern-cpus-f3b09b68beac

3.07K views07:04

DevOps&SRE Library

From vendors to vanguard: Airbnb’s hard-won lessons in observability ownership

How a complex, large-scale migration to an in-house observability platform led to superior tooling, consistent data, and a fundamental reset of the developer experience.

https://medium.com/airbnb-engineering/from-vendors-to-vanguard-airbnbs-hard-won-lessons-in-observability-ownership-3811bf6c1ac3

2.83K views15:02

DevOps&SRE Library

5 Ways That Resilience Can’t Be Automated

The most dangerous thing I’ve seen in engineering isn’t a failed system. It’s a team that thinks their system can’t fail.

It’s not just about adding and adapting tooling. The leader who believes a new $30pp automation tool will resolve deep systemic issues is overlooking the most valuable resource already sitting inside their organisation: their people.

At Uptime Labs, we come back to the same principle repeatedly – the true source of resilience is people. Not because it’s a neat slogan, but because the evidence keeps pointing there. Below are five reasons why resilience can’t be automated away from people entirely – hope you enjoy.

https://uptimelabs.io/articles/5-ways-that-resilience-cant-be-automated

2.58K views07:03

DevOps&SRE Library

Часто JSON в базе становится компромиссом: удобно хранить, но сложно быстро читать и индексировать.

Без понимания JSONB и операторов запросы начинают тормозить, а структура данных расползаться.

Если вы работаете с динамическими данными и хотите делать это без потери производительности — подключайтесь.

На открытом уроке разберём:
- как устроен JSONB внутри PostgreSQL
- какие индексы реально ускоряют запросы
- как писать SQL, который работает на больших объёмах
- покажем практические сценарии: конфиги, события, генерация JSON-ответов прямо в базе

📌 Встречаемся 5 мая в 20:00 МСК, регистрация открыта: https://vk.cc/cXd6ae

Урок проходит в преддверии старта курса «PostgreSQL для администраторов баз данных и разработчиков». Скидка на ранее бронирование курса 15% - все подробности у менеджера.

Реклама. ООО «Отус онлайн‑образование», ОГРН 1177746618576, erid: 2Vtzqwgfv6j

Please open Telegram to view this post

VIEW IN TELEGRAM

2.78K views09:04

DevOps&SRE Library

pgque

PgQue brings back PgQ — one of the longest-running Postgres queue architectures in production — in a form that runs on any Postgres platform, managed providers included.

PgQ was designed at Skype to run messaging for hundreds of millions of users, and it ran on large self-managed Postgres deployments for over a decade. Standard PgQ depends on a C extension (pgq) and an external daemon (pgqd), neither of which run on most managed Postgres providers.

PgQue rebuilds that battle-tested engine in pure PL/pgSQL, so the zero-bloat queue pattern works anywhere you can run SQL — without adding another distributed system to your stack.

The anti-extension. Pure SQL + PL/pgSQL on any Postgres 14+ — including RDS, Aurora, Cloud SQL, AlloyDB, Supabase, Neon, and most other managed providers. No C extension, no shared_preload_libraries, no provider approval, no restart.

https://github.com/NikolayS/pgque

3K views15:01

DevOps&SRE Library

Hidden Infrastructure Challenges in Distributed LLM Inference on Kubernetes

Chapter 1: A networking story

https://substack.com/home/post/p-188586336

2.58K views07:03

DevOps&SRE Library

Решайте DevOps-, SRE- и FinOps-задачи с помощью облачного ИИ-помощника

💬

Большое обновление от Cloud.ru. Что нового:

1⃣

Сразу несколько ВМ в разных конфигурациях

Теперь ИИ-помощник в облаке может создавать несколько виртуальных машин, а после управлять ими по команде. Например, добавлять или удалять диски, менять конфигурации и выполнять другие повседневные операции.

2⃣

Три новых сценария

▶

DevOps-агент

— может разворачивать и обслуживать PostgreSQL, Kafka, WordPress, GitLab и другие популярные сервисы по текстовому промпту.

▶

SRE-агент

— настраивает мониторинг, алертинг и помогает разбирать инциденты.

▶

FinOps-агент

— находит забытые или неиспользуемые ВМ и предлагает их удалить, чтобы исключить бессмысленные траты. А еще может показать топ дорогих ресурсов, позволяя сравнивать траты за разные периоды.

👉 Попробовать

Please open Telegram to view this post

VIEW IN TELEGRAM

3.28K views09:01

DevOps&SRE Library

Simplifying Model Serving with Kubernetes and Ray: Inside DoubleVerify’s ML Platform

https://medium.com/doubleverify-engineering/simplifying-model-serving-with-kubernetes-and-ray-inside-doubleverifys-ml-platform-78b33faa9e91

3.23K views15:02

DevOps&SRE Library

chainplane

A Kubernetes operator for deploying and managing blockchain full nodes. Supports 102 chains with built-in health monitoring, snapshot bootstrapping, and automatic recovery.

https://github.com/tazhate/chainplane

3.22K views16:02

DevOps&SRE Library

Lazy-Pulling Container Images: A Deep Dive Into OCI Seekability

From DEFLATE dependency chains to FUSE mounts: how few competing approaches make container layers randomly accessible, and what they all require you to change on every node.

https://blog.zmalik.dev/p/lazy-pulling-container-images-a-deep

3.26K views07:03

DevOps&SRE Library

Building eBPF-Based Bandwidth Limiting in AWS Network Policy Agent — Why Vibe Coding Isn’t Enough

https://medium.com/@jayanthvn_55441/building-ebpf-based-bandwidth-limiting-in-aws-network-policy-agent-why-vibe-coding-isnt-enough-f8c6681aa278

3.36K views15:05

DevOps&SRE Library

Hardware-Backed TLS Certificates with cert-manager and YubiHSM 2

Your cert-manager CA key is one kubectl get secret away from being stolen. It's a base64-encoded blob sitting in etcd, and anyone with the right RBAC can read it, copy it, and use it to sign certificates for any service in your cluster.

https://charles.dev/blog/yubihsm-cert-manager

3.47K views07:03

DevOps&SRE Library

Mastering KEDA on GKE: A Deep Dive into Event-Driven Autoscaling

Event Driven Scaling and How to Fix It When It Breaks

https://saeed.hashnode.dev/keda-on-gke

3.44K views15:02

DevOps&SRE Library

ing-switch: Migrate from Ingress NGINX to Traefik or Gateway API in Minutes, Not Days

https://blog.kubesimplify.com/ing-switch-migrate-from-ingress-nginx-to-traefik-or-gateway-api-in-minutes-not-days

3.41K views07:03

DevOps&SRE Library

warden

The open-source egress gateway for AI agents — every API call is authenticated, authorized, and audited. No credentials ever reach the agent.

https://github.com/stephnangue/warden

3.31K views15:04

DevOps&SRE Library

aibrix

Cost-efficient and pluggable Infrastructure components for GenAI inference

https://github.com/vllm-project/aibrix

3.28K views07:01

DevOps&SRE Library

kloudlite

Kloudlite provides cloud-based development workspaces with live service connectivity. Think Telepresence meets cloud IDEs — but with per-developer environment ownership, instant environment switching, and cross-team collaboration built in.

https://github.com/kloudlite/kloudlite

3.27K views15:03

DevOps&SRE Library

cpg

Cilium Policy Generator -- because writing CiliumNetworkPolicies by hand in a default-deny cluster is nobody's idea of a good Friday night.

https://github.com/SoulKyu/cpg

3.15K views07:00

DevOps&SRE Library

x509-certificate-exporter

A Prometheus exporter for certificates focusing on expiration monitoring, written in Go. Designed to monitor Kubernetes clusters from inside, it can also be used as a standalone exporter.

https://github.com/enix/x509-certificate-exporter

3.09K views15:04

DevOps&SRE Library

sish

Open source SSH tunneling for HTTP(S), WS(S), TCP, aliases, and SNI.

If you like the simplicity of serveo/ngrok-style sharing but want to use plain SSH and run your own infrastructure, sish is built for that.

https://github.com/antoniomika/sish

2.8K views07:02

About

Blog

Apps

Platform