DevOps&SRE Library
19K subscribers
426 photos
2 videos
2 files
5.16K links
Библиотека статей по теме DevOps и SRE.

Реклама: @ostinostin
Контент: @mxssl

РКН: https://www.gosuslugi.ru/snet/67704b536aa9672b963777b3
Download Telegram
qmd

An on-device search engine for everything you need to remember. Index your markdown notes, meeting transcripts, documentation, and knowledge bases. Search with keywords or natural language. Ideal for your agentic flows.


https://github.com/tobi/qmd
zedis

Zedis is a next-generation Redis GUI client designed for developers who demand speed.

Unlike Electron-based clients that can feel sluggish with large datasets, Zedis is built on GPUI (the same rendering engine powering the Zed Editor). This ensures a native, 60 FPS experience with minimal memory footprint, even when browsing millions of keys.


https://github.com/vicanso/zedis
ansible-collection-hardening

This Ansible collection provides battle tested hardening for Linux, SSH, nginx, MySQL


https://github.com/dev-sec/ansible-collection-hardening
Introduction to PostgreSQL Indexes

This text is for developers that have an intuitive knowledge of what database indexes are, but don’t necessarily know how they work internaly, what are the tradeoffs associated with indexes, what are the types of indexes provided by postgres and how you can use some of its more advanced options to make them more optimized for your use case.


https://dlt.github.io/blog/posts/introduction-to-postgresql-indexes
Quiet Influence: A Guide to Nemawashi in Engineering

Being an excellent engineer helps you advance through the ranks to become a Staff Engineer; “quiet influence” keeps you there.

I’ve learned the hard way that my architectural proposals didn’t fail on technical merits (mostly 😅); they failed because of the social strategy (or lack thereof) I had employed behind them. I’d have a vision on how things were going to be improved, but struggled to recruit others to get behind the idea.

After a few painful misses, I started building a toolkit of approaches that actually get big changes through. In this post, I’ll share a technique I use: Nemawashi.


https://hodgkins.io/blog/quiet-influence-a-guide-to-nemawashi-in-engineering
The Art of Command Line

Master the command line, in one page


https://github.com/jlevy/the-art-of-command-line
The only Terraform pipeline you will ever need: GitHub Actions for Multi-Environment Deployments

https://medium.com/zencore/the-only-terraform-pipeline-you-will-ever-need-github-actions-for-multi-environment-deployments-a2cb25d72473
Managing Terraform at Scale: A Deep Dive into Terragrunt Configuration Hierarchy

How I manage 100+ infrastructure components across multiple products, environments, and regions without configuration duplication


https://devineer.medium.com/managing-terraform-at-scale-a-deep-dive-into-terragrunt-configuration-hierarchy-54f1f16e7c1f
Futureproofing Tines: Partitioning a 17TB table in PostgreSQL

At Tines, we recently faced a significant engineering challenge: our output_payloads table in PostgreSQL was rapidly approaching 17TB on our largest cloud cluster, with no signs of slowing down. Once a table reaches PostgreSQL’s 32TB table size limit, it will stop accepting writes. This table holds event data, in the form of arbitrary JSON, which is critical to powering Tines workflows. Given the criticality of the data, we couldn’t risk any disruptions to it.

As our monitoring showed the table's growth, we began experiencing warning signs. Cleanup jobs on the table had begun to time out. The table was causing increased I/O pressure on our infrastructure, leading us to use more expensive hardware. The arbitrary JSON shape of the data meant massive autovacuum jobs on its TOAST table. When these autovacuums ran, they displaced other tables from the buffer cache, forcing disk reads in critical areas. As a bandaid, we modified the autovacuum parameters of the table so that the autovacuums would run more frequently, but have less tuples to process. With performance slowly degrading, and 32TB looming on the horizon, we knew we needed to act decisively.


https://www.tines.com/blog/futureproofing-tines-partitioning-a-17tb-table-in-postgresql
Chasing Boring at Just the Right Speed

https://log.andvari.net/no-mttr.html
fluid.sh

AI agents are ready to do infrastructure work, but they can't touch prod:

- Agents can install packages, configure services, write scripts—autonomously
- But one mistake on production and you're getting paged at 3 AM to fix it
- So we limit agents to chatbots instead of letting them do the work


https://github.com/aspectrr/fluid.sh
graft

Graft is a CLI tool that brings the Overlay Pattern (similar to Kustomize) to Terraform. It acts as a JIT (Just-In-Time) Compiler, allowing you to apply declarative patches to third-party modules at build time.

With Graft, you can treat upstream modules (e.g., from the Public Registry) as immutable base layers and inject your own logic on top—without the maintenance nightmare of forking.


https://github.com/ms-henglu/graft
Owning a $5M data center

These days it seems you need a trillion fake dollars, or lunch with politicians to get your own data center. They may help, but they’re not required. At comma we’ve been running our own data center for years. All of our model training, metrics, and data live in our own data center in our own office. Having your own data center is cool, and in this blog post I will describe how ours works, so you can be inspired to have your own data center too.


https://blog.comma.ai/datacenter
whosthere

Local Area Network discovery tool with a modern Terminal User Interface (TUI) written in Go. Discover, explore, and understand your LAN in an intuitive way.

Whosthere performs unprivileged, concurrent scans using mDNS and SSDP scanners. Additionally, it sweeps the local subnet by attempting TCP/UDP connections to trigger ARP resolution, then reads the ARP cache to identify devices on your Local Area Network. This technique populates the ARP cache without requiring elevated privileges. All discovered devices are enhanced with OUI lookups to display manufacturers when available.

Whosthere provides a friendly, intuitive way to answer the question every network administrator asks: "Who's there on my network?"


https://github.com/ramonvermeulen/whosthere
zerobrew

zerobrew applies uv's model to Mac packages. Packages live in a content-addressable store (by sha256), so reinstalls are instant. Downloads, extraction, and linking run in parallel with aggressive HTTP caching. It pulls from Homebrew's CDN, so you can swap brew for zb with your existing commands.

This leads to dramatic speedups, up to 5x cold and 20x warm.


https://github.com/lucasgelfond/zerobrew
Hi! My good friend is looking for a colleague to join their team.

You can check the details of the position and apply here: https://jobs.ashbyhq.com/perplexity/7bce0fcf-eef6-41aa-9243-896f07a0316e

If you have additional questions about the position, you can send them to alena@perplexity.ai.
prek

pre-commit is a framework to run hooks written in many languages, and it manages the language toolchain and dependencies for running the hooks.


https://github.com/j178/prek
The future of software engineering is SRE

When code gets cheap operational excellence wins. Anyone can build a greenfield demo, but it takes engineering to run a service.


https://swizec.com/blog/the-future-of-software-engineering-is-sre
10 Elasticsearch Production Issues (and How Postgres Avoids Them)

Elasticsearch may work great in initial testing and development but Production is a different story. This blog is about what happens after you ship: the JVM tuning, the shard math, the 3 AM pages, the sync pipelines that break silently. The stuff your ops team lives with.

After years of teams running Elasticsearch in production, certain patterns keep emerging. The same issues show up in blog posts, Stack Overflow questions, and incident reports. We've compiled ten of the most common ones below, with references to the engineers who've documented them. We’ve also added images to make it easy to quickly skim through it and compare the challenges against Postgres.

TLDR: With great power comes great operational complexity.


https://www.tigerdata.com/blog/10-elasticsearch-production-issues-how-postgres-avoids-them