DevOps&SRE Library

Practical Considerations for AI Incident Reviews

The post argues AI-written incident reviews fail without rich cross-system data and human engagement because incident reviews are socio-technical learning work, not just document generation.

https://fgj.codes/posts/ai-incident-reviews

2.96K views06:01

DevOps&SRE Library

10 Real-World Status Page Examples: And What You Can Learn From Them

The post walks through ten status page examples and highlights clear communication, simple layouts, and expectation-setting details that help users during incidents.

https://uptimerobot.com/blog/10-real-status-page-examples

3.57K views14:02

DevOps&SRE Library

Disappointing People Early

The post argues teams should make reliability targets, support limits, and roadmap uncertainty explicit early so customers and stakeholders do not build riskier implicit expectations.

https://log.andvari.net/disappointing-people-early.html

3.39K views06:00

DevOps&SRE Library

5 Suggestions to Upgrade your OpenTofu/Terraform & AWS Development Experience

Five practical DX improvements for daily OpenTofu/Terraform + AWS work: use `tenv` for seamless version switching, a `grep` alias to summarize plans quickly, `tflint` with cloud provider plugins for linting, `awsp` for fast AWS profile switching, and a customized shell prompt showing the current branch/workspace/profile at a glance to prevent costly wrong-context mistakes.

https://www.uturndata.com/insights/5-suggestions-upgrade-opentofu-terraform-aws-development-experience

3.55K views14:03

DevOps&SRE Library

Terraform Drift Detection Powered by GitHub Actions

A zero-cost drift detection pipeline built entirely on GitHub Actions uses Terraform's native `-detailed-exitcode` flag to auto-discover root modules, run daily parallel plans, and open GitHub Issues when drift is detected — no external tools or paid services required, with OIDC for keyless AWS auth.

https://rosesecurity.dev/2025/12/11/terraform-drift-detection-with-github-actions.html

3.53K views06:03

DevOps&SRE Library

InfraKitchen

An open-source platform from Electrolux that lets platform teams define reusable Terraform templates while enabling developers to self-serve multi-cloud infrastructure (AWS, Azure, GCP) via pull-request-driven continuous delivery, with audit logging and an MCP server for AI agent integration.

https://opensource.electrolux.one/infrakitchen

3.35K views14:01

DevOps&SRE Library

nono

AI agents get filesystem access, run shell commands, and are wide open to prompt injections. The standard response is guardrails and policies. The problem is that policies can be bypassed — and guardrails can be talked out of.

With nono, you don't have to. nono wraps your agent in a kernel-isolated sandbox in seconds — with API key protection, destructive action guardrails, and full snapshot/rollback built in. No hypervisor to configure. No container volume mounts, instead fine grained capability control to the file level. Zero latency overhead.

https://github.com/always-further/nono

3.5K views07:01

DevOps&SRE Library

nanobrew

A fast package manager for macOS and Linux. Written in Zig. Uses Homebrew's bottles and formulas under the hood, plus native .deb support for Docker containers.

https://github.com/justrach/nanobrew

3.29K views15:02

DevOps&SRE Library

Automating RDS Postgres to Aurora Postgres Migration

In 2024, the Online Data Stores team at Netflix conducted a comprehensive review of the relational database technologies used across the company. This evaluation examined functionality, performance, and total cost of ownership across our database ecosystem. Based on this analysis, we decided to standardize on Amazon Aurora PostgreSQL as the primary relational database offering for Netflix teams.

https://netflixtechblog.com/automating-rds-postgres-to-aurora-postgres-migration-261ca045447f

3.26K views07:02

DevOps&SRE Library

Safeguarding dynamic configuration changes at scale

How Airbnb ships dynamic config changes safely and reliably.

https://medium.com/airbnb-engineering/safeguarding-dynamic-configuration-changes-at-scale-5aca5222ed68

3.26K views15:04

DevOps&SRE Library

How to cut your Docker build time by 95%, Buildx, Caching & Layer Optimization

Docker builds taking forever? I cut mine from 8 min to 24 sec. Here's how using Buildx and caching.

https://arcnet.am/post/70

3.43K views16:03

DevOps&SRE Library

Terraform Parallelism: How It Works, Tuning & Best Practices

In this blog post, we will explore Terraform parallelism: what it is, how to manage it, and best practices for configuring parallelism in Terraform.

https://spacelift.io/blog/terraform-parallelism

3.08K views07:01

DevOps&SRE Library

4 ways to use Argo CD and Terraform together

Terraform is the most popular solution for implementing Infrastructure As Code (IaC). The Terraform provider registry contains a very large collection of providers/integrations for all the major cloud providers and at the same time offers a wealth of integration for databases, networking components, Continuous Integration platforms etc.

Argo CD is the leading solution for GitOps deployments on Kubernetes. In the last CNCF survey we found out that 60% of respondents use Argo CD in production.

Although several guides currently exist that explain how to use each tool individually, there is limited information on how they can be combined. A lot of existing Terraform users adopt Argo CD and wonder:

1. What is the best way to pass variables from Terraform to Helm charts deployed with Terraform?
2. How to get secrets in Kubernetes applications that are generated/retrieved from Terraform?
3. When should the Terraform Helm and Kubernetes providers come into play if Argo CD already supports Kubernetes deployments on its own?
4. For which Kubernetes resources should Terraform be responsible and for which Argo CD?
5. What is the proper boundary between the two tools so that operators can use them to the maximum benefit?

In this guide, we will answer all these questions and actually show you four different approaches for how Terraform and Argo CD can work together. Note that everything we say about Terraform also applies to OpenTofu.

https://octopus.com/blog/argocd-terraform-together

3.51K views15:02

DevOps&SRE Library

Migrating Etsy's database sharding to Vitess

This database cluster contains most of Etsy's online data and is made up of ~1,000 tables distributed across ~1,000 shards.

https://www.etsy.com/codeascraft/migrating-etsyas-database-sharding-to-vitess

3.56K views07:03

DevOps&SRE Library

We Automated Everything Except Knowing What's Going On

AI collapsed the cost of building software, but the systems underneath are buckling.

https://eversole.dev/blog/we-automated-everything

3.59K views15:01

DevOps&SRE Library

Why our Kafka consumers survived the day but died every night

It took us 4–5 incidents over several weeks to even recognise the pattern.

https://medium.com/@lokeshsoni/why-our-kafka-consumers-survived-the-day-but-died-every-night-8c9eb6ae528f

3.49K views07:04

DevOps&SRE Library

Reliability Engineering for Air-Gapped Systems

All those systems were air-gapped, meaning the team that builds the software has no access to metrics, logs or runtime.

https://blog.alexewerlof.com/p/reliability-engineering-for-air-gapped

3.4K views15:05

DevOps&SRE Library

How I Dragged Phantom Tide Out of an OOM Kill Loop

From the inside, it was a systems failure spread across FastAPI, uvicorn, Redis, ClickHouse, APScheduler, Docker memory limits, and a startup sequence that had quietly become a deterministic self-attack.

https://github.com/tg12/phantomtide/blob/main/docs/oom-postmortem.md

3.05K views07:05

DevOps&SRE Library

Shell Tricks That Actually Make Life Easier (And Save Your Sanity)

There is a distinct, visceral kind of pain in watching an otherwise brilliant engineer hold down the Backspace key for six continuous seconds to fix a typo at the beginning of a line.

We’ve all been there. We learn ls, cd, and grep, and then we sort of… stop. The terminal becomes a place we live in-but we rarely bother to arrange the furniture. We accept that certain tasks take forty keystrokes, completely unaware that the shell authors solved our exact frustration sometime in 1989.

Here are some tricks that aren’t exactly secret, but aren’t always taught either. To keep the peace in our extended Unix family, I’ve split these into two camps: the universal tricks that work on almost any POSIX-ish shell (like sh on FreeBSD or ksh on OpenBSD), and the quality-of-life additions specific to interactive shells like Bash or Zsh.

https://blog.hofstede.it/shell-tricks-that-actually-make-life-easier-and-save-your-sanity

3.19K views15:02

DevOps&SRE Library

drpc-agent-skills

Blockchain RPC skills for AI coding agents

https://github.com/drpcorg/drpc-agent-skills

2.81K views15:34

About

Blog

Apps

Platform