DevOps&SRE Library

Inside Adobe's OpenTelemetry pipeline: simplicity at scale

As part of an ongoing series, the Developer Experience SIG interviews organizations about their real-world OpenTelemetry Collector deployments to share practical lessons with the broader community. This post features Adobe, a global software company whose observability team has built an OpenTelemetry-based telemetry pipeline designed for simplicity at massive scale, with thousands of collectors running per signal type across the company’s infrastructure.

https://opentelemetry.io/blog/2026/devex-adobe

2.99K views15:01

DevOps&SRE Library

traceway

Traceway is a self-hosted observability platform that ingests OpenTelemetry traces and metrics, groups exceptions automatically, and gives you endpoint performance, distributed tracing, and alerts — all in a single binary. No OTel Collector or separate time-series database required.

https://github.com/tracewayapp/traceway

2.91K views07:04

DevOps&SRE Library

lynxdb

Log analytics in a single binary. No dependencies. Lynx Flow query language.

https://github.com/lynxbase/lynxdb

2.66K views08:02

DevOps&SRE Library

🎥 Вебинар: «Ansible: быстрый старт»

О чём поговорим:
- Как работает Ansible: архитектура, принципы и основные компоненты.
- Настройка Ansible и запуск базовых плейбуков для автоматизации рутинных задач.
- Основы написания YAML-плейбуков: команды, задачи, модули и переменные.
- Практические возможности автоматизации настройки серверов и развёртывания приложений.
- Лучшая практика управления изменениями в DevOps-процессах.

Что вы получите:
- Освоите базовые возможности Ansible и начнёте уверенно использовать его в своей работе.
- Попробуйте, как автоматизировать рутинные задачи, сократить количество ошибок и повысить производительность.
- Используйте инструменты для быстрого запуска автоматизации.

👉 Для участия зарегистрируйтесь: https://vk.cc/cWZiNv

🎁 Все участники вебинара получат специальные условия на полное обучение курса «DevOps практики и инструменты»

Реклама. ООО «Отус онлайн-образование», ОГРН 1177746618576, www.otus.ru, erid: 2VtzqvKvo9H

3.23K views09:02

DevOps&SRE Library

cardamon

Cardamon is a metric auditor for Prometheus. It identifies metrics that exist in your TSDB but are never actually queried by dashboards, alerting rules, recording rules, or any other consumer. You can then generate Prometheus drop rules to remove them and reduce storage need.

https://github.com/dominikhei/cardamon

3.31K views15:02

DevOps&SRE Library

versitygw

Versity Gateway, a simple to use tool for seamless inline translation between AWS S3 object commands and storage systems. The Versity Gateway bridges the gap between S3-reliant applications and other storage systems, enabling enhanced compatibility and integration while offering exceptional scalability.

https://github.com/versity/versitygw

3.23K views07:03

DevOps&SRE Library

Terragrunt 1.0 Released!

After nearly a decade of development, over 900 releases, and tens of millions of infrastructure deployments by platform teams, today we're happy to announce that Terragrunt 1.0 is officially here.

https://www.gruntwork.io/blog/terragrunt-1-0-released

3.39K views15:04

DevOps&SRE Library

Little Snitch for Linux

Every time an application on your computer opens a network connection, it does so quietly, without asking. Little Snitch for Linux makes that activity visible and gives you the option to do something about it. You can see exactly which applications are talking to which servers, block the ones you didn't invite, and keep an eye on traffic history and data volumes over time.

https://obdev.at/products/littlesnitch-linux/index.html

3.28K views07:02

DevOps&SRE Library

kumo

A lightweight AWS service emulator written in Go. Works as both a CI/CD testing tool and a local development server with optional data persistence.

https://github.com/sivchari/kumo

3.06K views15:00

DevOps&SRE Library

Mount Mayhem at Netflix: Scaling Containers on Modern CPUs

Imagine this — you click play on Netflix on a Friday night and behind the scenes hundreds of containers spring to action in a few seconds to answer your call. At Netflix, scaling containers efficiently is critical to delivering a seamless streaming experience to millions of members worldwide. To keep up with responsiveness at this scale, we modernized our container runtime, only to hit a surprising bottleneck: the CPU architecture itself.

Let us walk you through the story of how we diagnosed the problem and what we learned about scaling containers at the hardware level.

https://netflixtechblog.com/mount-mayhem-at-netflix-scaling-containers-on-modern-cpus-f3b09b68beac

3.08K views07:04

DevOps&SRE Library

From vendors to vanguard: Airbnb’s hard-won lessons in observability ownership

How a complex, large-scale migration to an in-house observability platform led to superior tooling, consistent data, and a fundamental reset of the developer experience.

https://medium.com/airbnb-engineering/from-vendors-to-vanguard-airbnbs-hard-won-lessons-in-observability-ownership-3811bf6c1ac3

2.83K views15:02

DevOps&SRE Library

5 Ways That Resilience Can’t Be Automated

The most dangerous thing I’ve seen in engineering isn’t a failed system. It’s a team that thinks their system can’t fail.

It’s not just about adding and adapting tooling. The leader who believes a new $30pp automation tool will resolve deep systemic issues is overlooking the most valuable resource already sitting inside their organisation: their people.

At Uptime Labs, we come back to the same principle repeatedly – the true source of resilience is people. Not because it’s a neat slogan, but because the evidence keeps pointing there. Below are five reasons why resilience can’t be automated away from people entirely – hope you enjoy.

https://uptimelabs.io/articles/5-ways-that-resilience-cant-be-automated

2.59K views07:03

DevOps&SRE Library

Часто JSON в базе становится компромиссом: удобно хранить, но сложно быстро читать и индексировать.

Без понимания JSONB и операторов запросы начинают тормозить, а структура данных расползаться.

Если вы работаете с динамическими данными и хотите делать это без потери производительности — подключайтесь.

На открытом уроке разберём:
- как устроен JSONB внутри PostgreSQL
- какие индексы реально ускоряют запросы
- как писать SQL, который работает на больших объёмах
- покажем практические сценарии: конфиги, события, генерация JSON-ответов прямо в базе

📌 Встречаемся 5 мая в 20:00 МСК, регистрация открыта: https://vk.cc/cXd6ae

Урок проходит в преддверии старта курса «PostgreSQL для администраторов баз данных и разработчиков». Скидка на ранее бронирование курса 15% - все подробности у менеджера.

Реклама. ООО «Отус онлайн‑образование», ОГРН 1177746618576, erid: 2Vtzqwgfv6j

Please open Telegram to view this post

VIEW IN TELEGRAM

2.79K views09:04

DevOps&SRE Library

pgque

PgQue brings back PgQ — one of the longest-running Postgres queue architectures in production — in a form that runs on any Postgres platform, managed providers included.

PgQ was designed at Skype to run messaging for hundreds of millions of users, and it ran on large self-managed Postgres deployments for over a decade. Standard PgQ depends on a C extension (pgq) and an external daemon (pgqd), neither of which run on most managed Postgres providers.

PgQue rebuilds that battle-tested engine in pure PL/pgSQL, so the zero-bloat queue pattern works anywhere you can run SQL — without adding another distributed system to your stack.

The anti-extension. Pure SQL + PL/pgSQL on any Postgres 14+ — including RDS, Aurora, Cloud SQL, AlloyDB, Supabase, Neon, and most other managed providers. No C extension, no shared_preload_libraries, no provider approval, no restart.

https://github.com/NikolayS/pgque

3.01K views15:01

DevOps&SRE Library

Hidden Infrastructure Challenges in Distributed LLM Inference on Kubernetes

Chapter 1: A networking story

https://substack.com/home/post/p-188586336

2.59K views07:03

DevOps&SRE Library

Решайте DevOps-, SRE- и FinOps-задачи с помощью облачного ИИ-помощника

💬

Большое обновление от Cloud.ru. Что нового:

1⃣

Сразу несколько ВМ в разных конфигурациях

Теперь ИИ-помощник в облаке может создавать несколько виртуальных машин, а после управлять ими по команде. Например, добавлять или удалять диски, менять конфигурации и выполнять другие повседневные операции.

2⃣

Три новых сценария

▶

DevOps-агент

— может разворачивать и обслуживать PostgreSQL, Kafka, WordPress, GitLab и другие популярные сервисы по текстовому промпту.

▶

SRE-агент

— настраивает мониторинг, алертинг и помогает разбирать инциденты.

▶

FinOps-агент

— находит забытые или неиспользуемые ВМ и предлагает их удалить, чтобы исключить бессмысленные траты. А еще может показать топ дорогих ресурсов, позволяя сравнивать траты за разные периоды.

👉 Попробовать

Please open Telegram to view this post

VIEW IN TELEGRAM

3.3K views09:01

DevOps&SRE Library

Simplifying Model Serving with Kubernetes and Ray: Inside DoubleVerify’s ML Platform

https://medium.com/doubleverify-engineering/simplifying-model-serving-with-kubernetes-and-ray-inside-doubleverifys-ml-platform-78b33faa9e91

3.24K views15:02

DevOps&SRE Library

chainplane

A Kubernetes operator for deploying and managing blockchain full nodes. Supports 102 chains with built-in health monitoring, snapshot bootstrapping, and automatic recovery.

https://github.com/tazhate/chainplane

3.23K views16:02

DevOps&SRE Library

Lazy-Pulling Container Images: A Deep Dive Into OCI Seekability

From DEFLATE dependency chains to FUSE mounts: how few competing approaches make container layers randomly accessible, and what they all require you to change on every node.

https://blog.zmalik.dev/p/lazy-pulling-container-images-a-deep

3.28K views07:03

DevOps&SRE Library

Building eBPF-Based Bandwidth Limiting in AWS Network Policy Agent — Why Vibe Coding Isn’t Enough

https://medium.com/@jayanthvn_55441/building-ebpf-based-bandwidth-limiting-in-aws-network-policy-agent-why-vibe-coding-isnt-enough-f8c6681aa278

3.39K views15:05

About

Blog

Apps

Platform