DevOps&SRE Library

Managing Terraform at Scale: A Deep Dive into Terragrunt Configuration Hierarchy

How I manage 100+ infrastructure components across multiple products, environments, and regions without configuration duplication

https://devineer.medium.com/managing-terraform-at-scale-a-deep-dive-into-terragrunt-configuration-hierarchy-54f1f16e7c1f

3.65K views15:03

DevOps&SRE Library

Futureproofing Tines: Partitioning a 17TB table in PostgreSQL

At Tines, we recently faced a significant engineering challenge: our output_payloads table in PostgreSQL was rapidly approaching 17TB on our largest cloud cluster, with no signs of slowing down. Once a table reaches PostgreSQL’s 32TB table size limit, it will stop accepting writes. This table holds event data, in the form of arbitrary JSON, which is critical to powering Tines workflows. Given the criticality of the data, we couldn’t risk any disruptions to it.

As our monitoring showed the table's growth, we began experiencing warning signs. Cleanup jobs on the table had begun to time out. The table was causing increased I/O pressure on our infrastructure, leading us to use more expensive hardware. The arbitrary JSON shape of the data meant massive autovacuum jobs on its TOAST table. When these autovacuums ran, they displaced other tables from the buffer cache, forcing disk reads in critical areas. As a bandaid, we modified the autovacuum parameters of the table so that the autovacuums would run more frequently, but have less tuples to process. With performance slowly degrading, and 32TB looming on the horizon, we knew we needed to act decisively.

https://www.tines.com/blog/futureproofing-tines-partitioning-a-17tb-table-in-postgresql

4.05K views07:04

DevOps&SRE Library

Chasing Boring at Just the Right Speed

https://log.andvari.net/no-mttr.html

3.7K views15:05

DevOps&SRE Library

fluid.sh

AI agents are ready to do infrastructure work, but they can't touch prod:

- Agents can install packages, configure services, write scripts—autonomously
- But one mistake on production and you're getting paged at 3 AM to fix it
- So we limit agents to chatbots instead of letting them do the work

https://github.com/aspectrr/fluid.sh

3.46K views07:05

DevOps&SRE Library

graft

Graft is a CLI tool that brings the Overlay Pattern (similar to Kustomize) to Terraform. It acts as a JIT (Just-In-Time) Compiler, allowing you to apply declarative patches to third-party modules at build time.

With Graft, you can treat upstream modules (e.g., from the Public Registry) as immutable base layers and inject your own logic on top—without the maintenance nightmare of forking.

https://github.com/ms-henglu/graft

3.63K views15:04

DevOps&SRE Library

Owning a $5M data center

These days it seems you need a trillion fake dollars, or lunch with politicians to get your own data center. They may help, but they’re not required. At comma we’ve been running our own data center for years. All of our model training, metrics, and data live in our own data center in our own office. Having your own data center is cool, and in this blog post I will describe how ours works, so you can be inspired to have your own data center too.

https://blog.comma.ai/datacenter

3.33K views07:00

DevOps&SRE Library

whosthere

Local Area Network discovery tool with a modern Terminal User Interface (TUI) written in Go. Discover, explore, and understand your LAN in an intuitive way.

Whosthere performs unprivileged, concurrent scans using mDNS and SSDP scanners. Additionally, it sweeps the local subnet by attempting TCP/UDP connections to trigger ARP resolution, then reads the ARP cache to identify devices on your Local Area Network. This technique populates the ARP cache without requiring elevated privileges. All discovered devices are enhanced with OUI lookups to display manufacturers when available.

Whosthere provides a friendly, intuitive way to answer the question every network administrator asks: "Who's there on my network?"

https://github.com/ramonvermeulen/whosthere

3.92K views15:02

DevOps&SRE Library

zerobrew

zerobrew applies uv's model to Mac packages. Packages live in a content-addressable store (by sha256), so reinstalls are instant. Downloads, extraction, and linking run in parallel with aggressive HTTP caching. It pulls from Homebrew's CDN, so you can swap brew for zb with your existing commands.

This leads to dramatic speedups, up to 5x cold and 20x warm.

https://github.com/lucasgelfond/zerobrew

4.03K views07:02

DevOps&SRE Library

Hi! My good friend is looking for a colleague to join their team.

You can check the details of the position and apply here: https://jobs.ashbyhq.com/perplexity/7bce0fcf-eef6-41aa-9243-896f07a0316e

If you have additional questions about the position, you can send them to alena@perplexity.ai.

4.12K views09:33

DevOps&SRE Library

prek

pre-commit is a framework to run hooks written in many languages, and it manages the language toolchain and dependencies for running the hooks.

https://github.com/j178/prek

4K views15:02

DevOps&SRE Library

The future of software engineering is SRE

When code gets cheap operational excellence wins. Anyone can build a greenfield demo, but it takes engineering to run a service.

https://swizec.com/blog/the-future-of-software-engineering-is-sre

3.93K views07:02

DevOps&SRE Library

10 Elasticsearch Production Issues (and How Postgres Avoids Them)

Elasticsearch may work great in initial testing and development but Production is a different story. This blog is about what happens after you ship: the JVM tuning, the shard math, the 3 AM pages, the sync pipelines that break silently. The stuff your ops team lives with.

After years of teams running Elasticsearch in production, certain patterns keep emerging. The same issues show up in blog posts, Stack Overflow questions, and incident reports. We've compiled ten of the most common ones below, with references to the engineers who've documented them. We’ve also added images to make it easy to quickly skim through it and compare the challenges against Postgres.

TLDR: With great power comes great operational complexity.

https://www.tigerdata.com/blog/10-elasticsearch-production-issues-how-postgres-avoids-them

4.39K views15:02

DevOps&SRE Library

How OpenAI Scales Postgres to Power 800 Million ChatGPT Users

For years, PostgreSQL has been one of the most critical, under-the-hood data systems powering core products like ChatGPT and OpenAI’s API. As our user base grows rapidly, the demands on our databases have increased exponentially, too. Over the past year, our PostgreSQL load has grown by more than 10x, and it continues to rise quickly.

https://openai.com/index/scaling-postgresql

4.17K views07:05

DevOps&SRE Library

Introduction to Buffers in PostgreSQL

The work around RegreSQL led me to focus a lot on buffers. If you are a casual PostgreSQL user, you have probably heard about adjusting shared_buffers and followed the good old advice to set it to 1/4 of available RAM. But after we went a little bit too enthusiastic about them on a recent Postgres FM episode I've been asked what that's all about.

Buffers are one of those topics that easily gets forgotten. And while they are a foundation block of PostgreSQL's performance architecture, most of us treat them as a black box. This article is going to attempt to change that.

https://boringsql.com/posts/introduction-to-buffers

4.56K views15:04

DevOps&SRE Library

Why Your HA Architecture is a Lie (And That's Okay)

https://mydbanotebook.org/posts/why-your-ha-architecture-is-a-lie-and-thats-okay

4.42K views07:01

DevOps&SRE Library

Is the future of MySQL PostgreSQL (Or MariaDB, or TiDB, or ...)?

https://stokerpostgresql.blogspot.com/2026/01/is-future-of-mysql-postgresql-or.html

4.3K views15:02

DevOps&SRE Library

“You Had One Job”: Why Twenty Years of DevOps Has Failed to Do it

I think the entire DevOps movement was a mighty, twenty year battle to achieve one thing: a single feedback loop connecting devs with prod. On those grounds, it failed.

https://www.honeycomb.io/blog/you-had-one-job-why-twenty-years-of-devops-has-failed-to-do-it

4.48K views07:02

DevOps&SRE Library

OpenTelemetry Collector vs agent: How to choose the right telemetry approach

https://www.cncf.io/blog/2026/02/02/opentelemetry-collector-vs-agent-how-to-choose-the-right-telemetry-approach

4.36K views15:06

DevOps&SRE Library

Unconventional PostgreSQL Optimizations

Creative ideas for speeding up queries in PostgreSQL

https://hakibenita.com/postgresql-unconventional-optimizations

4.24K views07:02

DevOps&SRE Library

Scaling Terraform Across many Teams: A Native Framework for Platform Engineering

This write-up presents a pure Terraform framework where 50+ teams deploy infrastructure using simple tfvars while platform teams maintain reusable building blocks. It highlights native lookup patterns, automated PR updates, and significant boilerplate reduction without adding preprocessing layers.

https://dev.to/jverhoeks/-scaling-terraform-across-many-teams-a-native-framework-for-platform-engineering-3n0b

4.24K views15:03

About

Blog

Apps

Platform