DevOps&SRE Library

difftastic

a structural diff that understands syntax

3.08K views15:01

We Cut Our Kubernetes Pods by 60% and Doubled Traffic Capacity

This case study explains how JVM tuning, a smaller Hikari pool, and faster HPA scale-up doubled traffic capacity while reducing baseline pods.

https://medium.com/@feridquluzade2002/we-cut-our-kubernetes-pods-by-60-and-doubled-traffic-capacity-b1cfb6850fca

3.09K views07:01

DevOps&SRE Library

Hidden Kubernetes Bad Practices Learned the Hard Way During Incidents

This article distills incident-driven lessons on troubleshooting, configuration mistakes, and operational habits that make Kubernetes outages worse.

https://hackernoon.com/hidden-kubernetes-bad-practices-learned-the-hard-way-during-incidents

4.04K views15:05

DevOps&SRE Library

From Chaos to 99.9% Uptime: Rebuilding a Kubernetes Platform for GPU Workloads

This article covers rebuilding a Kubernetes platform for GPU workloads to reach 99.9% uptime after operational instability.

https://medium.com/@mateenali66/from-chaos-to-99-9-uptime-rebuilding-a-kubernetes-platform-for-gpu-workloads-4fadb1067a0b

2.95K views07:03

DevOps&SRE Library

Benchmarking Kubernetes Log Collectors: vlagent, Vector, Fluent Bit, OpenTelemetry Collector, and more

At VictoriaMetrics, we built vlagent as a high-performance log collector for VictoriaLogs. To validate its performance and correctness under a real production-like load, we developed a benchmark suite and ran it against 8 popular log collectors. This post covers the methodology, throughput results, resource usage, and delivery correctness.

https://victoriametrics.com/blog/log-collectors-benchmark-2026/index.html

3K views10:00

DevOps&SRE Library

Making and scaling a game server in Kubernetes using agones

This tutorial walks through building a Go game server with Agones, matchmaking, Fleet allocation, and autoscaling on Kubernetes.

https://noe-t.dev/posts/making-and-scaling-a-game-server-in-k8s-using-agones

2.95K views15:02

DevOps&SRE Library

PostgreSQL migration with CloudNativePG Logical Replication on Kubernetes - Zero-Downtime

This tutorial shows how to migrate PostgreSQL to CloudNativePG on Kubernetes with logical replication and no downtime.

https://kndoni.medium.com/postgresql-migration-with-cloudnativepg-logical-replication-on-kubernetes-zero-downtime-aef1c33a3a53

2.75K views07:05

DevOps&SRE Library

Gateway API setup on GKE with NGINX Gateway Fabric

This tutorial shows how to deploy NGINX Gateway Fabric on GKE with Terraform, split traffic paths, and automate TLS certificates.

https://medium.com/@henrikamirbekyan/gateway-api-setup-on-gke-with-nginx-gateway-fabric-1b0d0ec3bbf3

2.57K views15:01

DevOps&SRE Library

Как запустить микросервисы в Managed Kubernetes

Микросервисное приложение мало просто задеплоить — нужны правила запуска, обновлений, масштабирования и изоляции. Именно они делают эксплуатацию предсказуемой, а инфраструктуру — готовой к росту нагрузки.

На вебинаре 26 марта в 11:00 эксперты Cloud.ru разберут, как превратить Managed Kubernetes в удобную и надежную платформу для работы микросервисов.

В программе:

1⃣ разберете, в каких проектах микросервисы действительно нужны и как быстро запустить готовое масштабируемое решение в облаке без лишних сложностей;

2⃣ рассмотрите базовую структуру Kubernetes для микросервисов: что потребуется сразу, а что можно отложить;

3⃣обсудите, как организовать деплой, обновления и откаты, чтобы релизы были управляемыми;

4⃣ настроите масштабирование с помощью нативных инструментов Kubernetes;

5⃣ свяжете платформу с реестром артефактов;

6⃣ узнаете, как следить за метриками и логами приложения.

👉

Зарегистрироваться👈

Please open Telegram to view this post

VIEW IN TELEGRAM

3.15K views16:04

DevOps&SRE Library

Migrating Kubernetes Off Big Cloud

This interview compares the cost and operational tradeoffs of moving a Kubernetes workload from GKE Autopilot to Hetzner with Edka.

https://kube.fm/migrating-kubernetes-off-big-cloud-fernando

3.11K views07:02

DevOps&SRE Library

GoKubeDownscaler

A horizontal autoscaler for Kubernetes workloads, saving cloud costs by scaling workloads down after hours. This is a golang port and successor of the popular (py-)kube-downscaler with improvements and quality of life changes.

https://github.com/caas-team/GoKubeDownscaler

3.1K views15:03

DevOps&SRE Library

Karpenter Optimizer: cost optimization

This tool analyzes Karpenter NodePool usage and offers AI-powered recommendations to reduce AWS EC2 costs while maintaining performance.

https://github.com/kaskol10/karpenter-optimizer

2.92K views07:03

DevOps&SRE Library

cek

Explore OCI container images without running them.

https://github.com/bschaatsbergen/cek

2.62K views15:01

DevOps&SRE Library

Какие инструменты ускоряют запуск продуктов и упрощают разработку

↗

Узнайте на GoCloud 2026

9 апреля команда Cloud.ru проводит большую ИТ-конференцию про облака и ИИ.

В этот раз отдельный трек посвящен разработке и инструментам, которые снижают нагрузку на команду:

▶автоматизация в эпоху ИИ

▶DevOps-инструменты в облаке

▶эффективные среды для разработки, CI/CD и обучения

▶DevOps- и SRE-агенты

▶защита cloud native приложений

▶и другие доклады

Также будут отдельные треки про ИИ, облачную инфраструктуру и работу с данными. И самое крутое – практические воркшопы: берите ноутбук и решайте прикладные задачи под руководством экспертов Cloud.ru.

Где и когда:
9 апреля в Москве и онлайн

👉

Не пропустите

👈

Please open Telegram to view this post

VIEW IN TELEGRAM

2.91K views16:03

DevOps&SRE Library

linnix

eBPF-powered Linux observability with AI incident detection.

https://github.com/linnix-os/linnix

3.73K views07:00

DevOps&SRE Library

Yandex B2B Tech запустила Stackland — контейнерную платформу для развертывания и масштабирования приложений в закрытом контуре on-prem. По сути, это готовая инфраструктура, которая позволяет "из коробки" развернуть управляемые сервисы Yandex Cloud: S3-хранилище, базы данных, средства контейнеризации. Можно за несколько часов развернуть на своих выделенных или арендованных сервисах, а также на виртуализации, а не тратить недели и месяцы на сборку базового стека, необходимого для поддержки и разработки приложений.

Основная идея — сократить время, которое команды тратят на инфраструктуру, особенно в сценариях, где данные нельзя выносить в публичное облако и приходится жить в гибридной модели. При этом заявляют ускорение разработки и снижение затрат примерно в 1,5 раза. Уже сейчас пользователи тестируют платформу в качестве готовой ИИ-инфраструктуры, базы для создания аналитических систем, а также для разработки микросервисных приложений.

Платформа позволяет без дополнительной интеграции быстро внедрять во внутрикорпоративную среду сервисы Yandex Cloud. Сейчас уже доступны SpeechSense и DataLens, а в ближайшее время будет добавлена AI Studio.

Запросить демо платформы, а также записаться на индивидуальную консультацию с архитекторами платформы можно по ссылке.

2.98K views12:41

DevOps&SRE Library

radar

Visualize your cluster topology, browse resources, stream logs, exec into pods, inspect container image filesystems, manage Helm releases, monitor GitOps workflows (FluxCD & ArgoCD), and forward ports — all from a single binary with zero cluster-side installation.

https://github.com/skyhook-io/radar

2.78K views15:04

DevOps&SRE Library

onecli

OneCLI is an open-source gateway that sits between your AI agents and the services they call. Instead of baking API keys into every agent, you store credentials once in OneCLI and the gateway injects them transparently. Agents never see the secrets.

Why we built it: AI agents need to call dozens of APIs, but giving each agent raw credentials is a security risk. OneCLI solves this with a single gateway that handles auth, so you get one place to manage access, rotate keys, and see what every agent is doing.

How it works: You store your real API credentials in OneCLI and give your agents placeholder keys (e.g. FAKE_KEY). When an agent makes an HTTP call through the gateway, the OneCLI gateway matches the request to the right credentials, swaps the FAKE_KEY for the REAL_KEY, decrypts them, and injects them into the outbound request. The agent never touches the real secrets. It just makes normal HTTP calls and the gateway handles the swap.

https://github.com/onecli/onecli

3K views07:04

DevOps&SRE Library

How I Dropped Our Production Database and Now Pay 10% More for AWS

https://alexeyondata.substack.com/p/how-i-dropped-our-production-database

3.01K views15:04

DevOps&SRE Library

Is Infrastructure as Code the Next Abstraction to Fall?

I’ve been staring at a Terraform module for the last ten minutes, and I can’t stop thinking about a question that would have been absurd two years ago: why am I writing this?

Not “why am I provisioning this infrastructure.” That part makes sense. But why am I writing HCL, a domain-specific language that exists to describe infrastructure in a way that humans can read, when I have an AI agent sitting in my terminal that can call the AWS API directly?

It’s the kind of question that sounds naive until you realise the same logic is playing out across every layer of the stack. And the more I look at it, the more I think we’re watching the early stages of a fundamental shift in how we interact with machines.

https://sjramblings.io/is-infrastructure-as-code-the-next-abstraction-to-fall

2.97K views07:04

About

Blog

Apps

Platform