DevOps&SRE Library
18.4K subscribers
459 photos
3 videos
2 files
5.01K links
Библиотека статей по теме DevOps и SRE.

Реклама: @ostinostin
Контент: @mxssl

РКН: https://www.gosuslugi.ru/snet/67704b536aa9672b963777b3
Download Telegram
renovate

Renovate is an automated dependency update tool. It helps to update dependencies in your code without needing to do it manually. When Renovate runs on your repo, it looks for references to dependencies (both public and private) and, if there are newer versions available, Renovate can create pull requests to update your versions automatically.


https://github.com/renovatebot/renovate
walrus

Walrus is a distributed message streaming platform built on a high-performance log storage engine. It provides fault-tolerant streaming with automatic leadership rotation, segment-based partitioning, and Raft consensus for metadata coordination.


https://github.com/nubskr/walrus
lazygit

simple terminal UI for git commands


https://github.com/jesseduffield/lazygit
Что почитать/посмотреть в обед, чтобы провести время с пользой, но не перегрузить себя?

Без шуток, нас спасает канал команды AvitoTech. Ну сами смотрите. Там:ㅤㅤ
- простым языком рассуждают на сложные технические темы;
- делятся экспертизой и рабочими кейсами из практики Авито;
- ищут полезные знакомства во время митапов и вебинаров (для будущих стажёров тоже немало анонсов);
- собирают подборки с полезной литературой для менеджеров.

Подписывайтесь по ссылке и заглядывайте в обед.
trow

Image management and caching for Kubernetes.

We're building a small registry to make image management in Kubernetes easy. The Trow Registry runs inside the cluster with very little resources, and is simple to set-up so it caches every image.


https://github.com/Trow-Registry/trow
Как сократить расходы на Managed K8S в Yandex Cloud почти вдвое

Devops-инженер из спортивного медиа Спортс" Кирилл Мухин рассказал, как в компании перевели часть кластера в Yandex Cloud на прерываемые ВМ и написали собственный k8s-оператор, который аккуратно выводит ноды из нагрузки и не ломает прод.

В статье:
— почему прерываемые ВМ опасны «из коробки»
— как автоматизировать их безопасное использование
— для каких сервисов это подходит, а для каких – нет
— какой получился эффект по деньгам

📚 Читайте про кейс на Хабре и забирайте идею для оптимизации облака

Реклама. ООО «Спортс.ру» , ИНН: 7705933383, erid: 2Vtzqw5FGt2
runme

Runme is a tool that makes runbooks actually runnable, making it easier to follow step-by-step instructions. Shell/Bash, Python, Ruby, JavaScript/TypeScript, Lua, PHP, Perl, and many other runtimes are supported via Runme's shebang feature. Runme allows users to execute instructions, check intermediate results, and ensure the desired outputs are achieved. This makes it an excellent solution for runbooks, playbooks, and documentation that requires users to complete runnable steps incrementally—making operational docs reliable and much less susceptible to bitrot.

Runme achieves this by literally running markdown. More specifically, Runme runs your commands (shell, bash, zsh) or code inside your fenced code blocks. It's 100% compatible with your programming language's task definitions (Makefile, Gradle, Grunt, NPM scripts, Pipfile or Deno tasks, etc.) and markdown-native. Much like a terminal session, environment variables are retained across execution, and it is possible to pipe previous cells' output into successive cells. Runme persists your runbooks in markdown, which your docs are likely already using.


https://github.com/runmedev/runme
Что загадывает DevOps на Новый год?

чтобы кластер обновлялся без ночных алертов
сеть работала стабильно и предсказуемо
апгрейд кластера не превращался в вечер с release notes

Разработчики Managed Kubernetes в облаке MWS Cloud Platform знают все ваши тайные желания и готовы упростить вашу DevOps-рутину.

С Managed Kubernetes вы получаете:
готовый кластер за несколько минут без сложной настройки
управление жизненным циклом кластера и нод
автоматическое масштабирование под нагрузку
нативную работу с сетью и storage через CCM / CSI
централизованное управление доступами через IAM


🎄
🎁 Попробуйте с грантом до 10 000 ₽

Попробовать
Please open Telegram to view this post
VIEW IN TELEGRAM
sqlit

The lazygit of SQL databases. Connect to Postgres, MySQL, SQL Server, SQLite, Supabase, Turso, and more from your terminal in seconds.


https://github.com/Maxteabag/sqlit
doh

Simple DNS over HTTPS cli client for cloudflare


https://github.com/mxssl/doh
💥 IaC: Тестирование инфраструктуры — как внедрить инженерные практики и перестать бояться изменений

🔥 14 января в 20:00 мск — бесплатный открытый вебинар OTUS

Вы вносите изменения в Terraform/Ansible и молитесь, чтобы ничего не сломалось в проде? Пора перестать бояться: инфраструктура должна тестироваться так же жёстко, как продуктовый код.

На вебинаре разберём, как построить полноценную систему тестирования IaC и превратить изменения в инфраструктуре из лотереи в предсказуемый процесс.

📌 Что будет:
— Мифы и реальные боли: почему без тестов инфраструктура ломается в самый неподходящий момент
— Пирамида тестирования IaC
— Как встроить проверки в GitLab CI и сделать pipeline’ы предсказуемыми
— Практики выживания: стандарты, структура проектов, кодстайл, документация

🎯 После вебинара вы сможете:
— Автоматически проверять инфраструктурный код до применения
— Ловить регрессии и «магические» конфиги на этапе MR
— Внедрить тесты в текущие процессы без боли
— Снизить количество аварий и ночных звонков
— Поднять инженерную культуру команды до взрослого уровня

👉 Регистрация уже открыта https://vk.cc/cSE5uW

Вебинар приурочен к старту курса «DevOps-инженер: практики и инструменты»: за 5 месяцев вы построите полностью протестированную, автоматизированную и отказоустойчивую инфраструктуру на реальном боевом стеке.

Реклама. ООО «Отус онлайн-образование», ОГРН 1177746618576, erid: 2Vtzqx3pEhY
Shifting left at enterprise scale: how we manage Cloudflare with Infrastructure as Code

The Cloudflare platform is a critical system for Cloudflare itself. We are our own Customer Zero – using our products to secure and optimize our own services.

Within our security division, a dedicated Customer Zero team uses its unique position to provide a constant, high-fidelity feedback loop to product and engineering that drives continuous improvement of our products. And we do this at a global scale — where a single misconfiguration can propagate across our edge in seconds and lead to unintended consequences. If you've ever hesitated before pushing a change to production, sweating because you know one small mistake could lock every employee out of critical application or take down a production service, you know the feeling. The risk of unintended consequences is real, and it keeps us up at night.

This presents an interesting challenge: How do we ensure hundreds of internal production Cloudflare accounts are secured consistently while minimizing human error?

While the Cloudflare dashboard is excellent for observability and analytics, manually clicking through hundreds of accounts to ensure security settings are identical is a recipe for mistakes. To keep our sanity and our security intact, we stopped treating our configurations as manual point-and-click tasks and started treating them like code. We adopted “shift left” principles to move security checks to the earliest stages of development.

This wasn't an abstract corporate goal for us. It was a survival mechanism to catch errors before they caused an incident, and it required a fundamental change in our governance architecture.


https://blog.cloudflare.com/shift-left-enterprise-scale
How We Scaled Code Repository Management at DNSimple

Managing a handful of GitHub repositories is straightforward. Managing hundreds of them consistently is a challenge. Over the years at DNSimple, we've evolved from manual configuration to a fully automated Infrastructure as Code (IaC) approach. This is the story of that evolution, the lessons we learned, and how we built a system that now manages all our GitHub resources through pull requests and CI/CD pipelines.

At DNSimple, we've managed our internal infrastructure as code since day one, primarily using Chef for configuration management. Infrastructure as Code wasn't new to us, it was the foundation of how we operated. The challenge was applying these same principles to externally managed resources like GitHub repositories, which required a different approach than our traditional internal infrastructure management.


https://blog.dnsimple.com/2025/11/managing-repositories-terraform-github
The stacking workflow

Stacked PRs. Stacked diffs. Stacked changes.
A better workflow to manage pull requests.


https://www.stacking.dev
Monitoring & Observability: Using Logs, Metrics, Traces, and Alerts to Understand System Failures

When your application ships to production, it becomes partly opaque. You own the code, but the runtime, network, and platform behaviors often fall outside your direct line of sight. That’s where Monitoring and Observability come in.

Monitoring warns you when predefined thresholds break. Observability lets you explore unknowns, asking new questions in real time and getting meaningful answers without redeploying.

For engineers running software in production, observability rests on three pillars: logs, metrics, and traces. Each offers a different lens into system behavior. Understanding where each excels and where it doesn’t is essential for building a practical, scalable visibility strategy.


https://blog.railway.com/p/using-logs-metrics-traces-and-alerts-to-understand-system-failures
KISS vs DRY in Infrastructure as Code: Why Simple Often Beats Clever

Every Infrastructure as Code tutorial starts the same way: provision a single S3 bucket, create one EC2 instance, deploy a basic load balancer. The examples are clean, simple, and elegant. You follow along, everything works, and you feel like you understand Terraform.

Then you get to your actual production environment, and everything changes.

You’re not starting from scratch with a blank AWS account. You’ve got existing resources that were manually created two years ago by someone who left the company. There’s brownfield infrastructure everywhere with no clear documentation. You need to import existing state, figure out what’s actually running, and somehow wrangle it all into code without breaking production. On top of that, you need to manage 200 instances across dev, staging, and production environments. Multiple AWS accounts with different configurations and permissions. Three regions for disaster recovery. Azure for the legacy workloads that nobody wants to touch. GCP running your GKE clusters for the containerized applications.

Suddenly that elegant tutorial code becomes a nightmare of orchestration, state management, environment-specific configurations, and brownfield complexity. You’re not just writing infrastructure code anymore. You’re trying to organize, orchestrate, and maintain it at scale while dealing with the reality that infrastructure is messy, evolving, and full of historical baggage.

This is the scale gap, and it’s where the KISS vs DRY debate stops being theoretical and starts costing real time, money, and engineering effort.


https://rosesecurity.dev/2025/11/14/kiss-versus-dry-iac.html
pg_textsearch

PostgreSQL extension for BM25 relevance-ranked full-text search. Postgres OSS licensed.


https://github.com/timescale/pg_textsearch
pgedge-postgres-mcp

The pgEdge Postgres Model Context Protocol (MCP) server enables SQL queries against PostgreSQL databases through MCP-compatible clients like Claude Desktop. The Natural Language Agent provides supporting functionality that allows you to use natural language to form SQL queries.


https://github.com/pgEdge/pgedge-postgres-mcp
arcane

Modern Docker Management, Designed for Everyone


https://github.com/getarcaneapp/arcane
ente

Ente is a service that provides a fully open source, end-to-end encrypted platform for you to store your data in the cloud without needing to trust the service provider. On top of this platform, we have built two apps so far: Ente Photos (an alternative to Apple and Google Photos) and Ente Auth (a 2FA alternative to the deprecated Authy).


https://github.com/ente-io/ente