DevOps&SRE Library
19.2K subscribers
429 photos
2 videos
2 files
5.23K links
Библиотека статей по теме DevOps и SRE.

Реклама: @ostinostin
Контент: @mxssl

РКН: https://www.gosuslugi.ru/snet/67704b536aa9672b963777b3
Download Telegram
Mount Mayhem at Netflix: Scaling Containers on Modern CPUs

Imagine this — you click play on Netflix on a Friday night and behind the scenes hundreds of containers spring to action in a few seconds to answer your call. At Netflix, scaling containers efficiently is critical to delivering a seamless streaming experience to millions of members worldwide. To keep up with responsiveness at this scale, we modernized our container runtime, only to hit a surprising bottleneck: the CPU architecture itself.

Let us walk you through the story of how we diagnosed the problem and what we learned about scaling containers at the hardware level.


https://netflixtechblog.com/mount-mayhem-at-netflix-scaling-containers-on-modern-cpus-f3b09b68beac
From vendors to vanguard: Airbnb’s hard-won lessons in observability ownership

How a complex, large-scale migration to an in-house observability platform led to superior tooling, consistent data, and a fundamental reset of the developer experience.


https://medium.com/airbnb-engineering/from-vendors-to-vanguard-airbnbs-hard-won-lessons-in-observability-ownership-3811bf6c1ac3
5 Ways That Resilience Can’t Be Automated

The most dangerous thing I’ve seen in engineering isn’t a failed system. It’s a team that thinks their system can’t fail.

It’s not just about adding and adapting tooling. The leader who believes a new $30pp automation tool will resolve deep systemic issues is overlooking the most valuable resource already sitting inside their organisation: their people.

At Uptime Labs, we come back to the same principle repeatedly – the true source of resilience is people. Not because it’s a neat slogan, but because the evidence keeps pointing there. Below are five reasons why resilience can’t be automated away from people entirely – hope you enjoy.


https://uptimelabs.io/articles/5-ways-that-resilience-cant-be-automated
Часто JSON в базе становится компромиссом: удобно хранить, но сложно быстро читать и индексировать.

Без понимания JSONB и операторов запросы начинают тормозить, а структура данных расползаться.

Если вы работаете с динамическими данными и хотите делать это без потери производительности — подключайтесь.

На открытом уроке разберём:
- как устроен JSONB внутри PostgreSQL
- какие индексы реально ускоряют запросы
- как писать SQL, который работает на больших объёмах
- покажем практические сценарии: конфиги, события, генерация JSON-ответов прямо в базе

📌 Встречаемся 5 мая в 20:00 МСК, регистрация открыта: https://vk.cc/cXd6ae

Урок проходит в преддверии старта курса «PostgreSQL для администраторов баз данных и разработчиков». Скидка на ранее бронирование курса 15% - все подробности у менеджера.

Реклама. ООО «Отус онлайн‑образование», ОГРН 1177746618576, erid: 2Vtzqwgfv6j
Please open Telegram to view this post
VIEW IN TELEGRAM
pgque

PgQue brings back PgQ — one of the longest-running Postgres queue architectures in production — in a form that runs on any Postgres platform, managed providers included.

PgQ was designed at Skype to run messaging for hundreds of millions of users, and it ran on large self-managed Postgres deployments for over a decade. Standard PgQ depends on a C extension (pgq) and an external daemon (pgqd), neither of which run on most managed Postgres providers.

PgQue rebuilds that battle-tested engine in pure PL/pgSQL, so the zero-bloat queue pattern works anywhere you can run SQL — without adding another distributed system to your stack.

The anti-extension. Pure SQL + PL/pgSQL on any Postgres 14+ — including RDS, Aurora, Cloud SQL, AlloyDB, Supabase, Neon, and most other managed providers. No C extension, no shared_preload_libraries, no provider approval, no restart.


https://github.com/NikolayS/pgque
Hidden Infrastructure Challenges in Distributed LLM Inference on Kubernetes

Chapter 1: A networking story


https://substack.com/home/post/p-188586336
Решайте DevOps-, SRE- и FinOps-задачи с помощью облачного ИИ-помощника 💬

Большое обновление от Cloud.ru. Что нового:

1⃣ Сразу несколько ВМ в разных конфигурациях
Теперь ИИ-помощник в облаке может создавать несколько виртуальных машин, а после управлять ими по команде. Например, добавлять или удалять диски, менять конфигурации и выполнять другие повседневные операции.


2⃣ Три новых сценария

DevOps-агент

— может разворачивать и обслуживать PostgreSQL, Kafka, WordPress, GitLab и другие популярные сервисы по текстовому промпту.



SRE-агент

— настраивает мониторинг, алертинг и помогает разбирать инциденты.



FinOps-агент

— находит забытые или неиспользуемые ВМ и предлагает их удалить, чтобы исключить бессмысленные траты. А еще может показать топ дорогих ресурсов, позволяя сравнивать траты за разные периоды.


👉 Попробовать
Please open Telegram to view this post
VIEW IN TELEGRAM
chainplane

A Kubernetes operator for deploying and managing blockchain full nodes. Supports 102 chains with built-in health monitoring, snapshot bootstrapping, and automatic recovery.


https://github.com/tazhate/chainplane
Lazy-Pulling Container Images: A Deep Dive Into OCI Seekability

From DEFLATE dependency chains to FUSE mounts: how few competing approaches make container layers randomly accessible, and what they all require you to change on every node.


https://blog.zmalik.dev/p/lazy-pulling-container-images-a-deep
Building eBPF-Based Bandwidth Limiting in AWS Network Policy Agent — Why Vibe Coding Isn’t Enough

https://medium.com/@jayanthvn_55441/building-ebpf-based-bandwidth-limiting-in-aws-network-policy-agent-why-vibe-coding-isnt-enough-f8c6681aa278
Hardware-Backed TLS Certificates with cert-manager and YubiHSM 2

Your cert-manager CA key is one kubectl get secret away from being stolen. It's a base64-encoded blob sitting in etcd, and anyone with the right RBAC can read it, copy it, and use it to sign certificates for any service in your cluster.


https://charles.dev/blog/yubihsm-cert-manager
Mastering KEDA on GKE: A Deep Dive into Event-Driven Autoscaling

Event Driven Scaling and How to Fix It When It Breaks


https://saeed.hashnode.dev/keda-on-gke
ing-switch: Migrate from Ingress NGINX to Traefik or Gateway API in Minutes, Not Days

https://blog.kubesimplify.com/ing-switch-migrate-from-ingress-nginx-to-traefik-or-gateway-api-in-minutes-not-days
warden

The open-source egress gateway for AI agents — every API call is authenticated, authorized, and audited. No credentials ever reach the agent.


https://github.com/stephnangue/warden
aibrix

Cost-efficient and pluggable Infrastructure components for GenAI inference


https://github.com/vllm-project/aibrix
kloudlite

Kloudlite provides cloud-based development workspaces with live service connectivity. Think Telepresence meets cloud IDEs — but with per-developer environment ownership, instant environment switching, and cross-team collaboration built in.


https://github.com/kloudlite/kloudlite
cpg

Cilium Policy Generator -- because writing CiliumNetworkPolicies by hand in a default-deny cluster is nobody's idea of a good Friday night.


https://github.com/SoulKyu/cpg
x509-certificate-exporter

A Prometheus exporter for certificates focusing on expiration monitoring, written in Go. Designed to monitor Kubernetes clusters from inside, it can also be used as a standalone exporter.


https://github.com/enix/x509-certificate-exporter
sish

Open source SSH tunneling for HTTP(S), WS(S), TCP, aliases, and SNI.

If you like the simplicity of serveo/ngrok-style sharing but want to use plain SSH and run your own infrastructure, sish is built for that.


https://github.com/antoniomika/sish