DevOps&SRE Library

EKS Auto Mode: Simplify Kubernetes with Terraform Setup

Instead of managing node groups, installing Karpenter, configuring the VPC CNI plugin, deploying the AWS Load Balancer Controller, setting up the EBS CSI driver, and keeping all of those components updated and compatible with each other - you enable a single flag and AWS handles all of it.

https://darryl-ruggles.cloud/a-complete-terraform-setup-for-eks-auto-mode-is-it-right-for-you

3.07K views07:04

DevOps&SRE Library

Terraform Tips from the IaC Trenches

After a few years of writing open-source Terraform modules, I've picked up a few syntax tricks that make code safer, cleaner, and easier to maintain.

https://rosesecurity.dev/2025/12/04/terraform-tips-and-tricks.html

2.8K views15:02

DevOps&SRE Library

How We Manage Domain and DNS Management with Infrastructure as Code

After successfully adopting Terraform for GitHub repository management, the next step in our Infrastructure as Code (IaC) journey was clear: dogfood our own product and manage our domains and DNS zones using the DNSimple Terraform provider.

https://blog.dnsimple.com/2025/11/managing-domains-terraform-dnsimple

3.55K views07:02

DevOps&SRE Library

DriftHound

DriftHound is a Rails WebApp that receives Terraform drift reports via API and provides visibility into infrastructure drift across your projects.

https://github.com/drifthoundhq/drifthound

3.55K views15:04

DevOps&SRE Library

otel-front

A lightweight, single-binary OpenTelemetry viewer for local development. Visualize traces, logs, and metrics from your instrumented applications — no Docker, no databases, no complex setup.

https://github.com/mesaglio/otel-front

3.46K views07:05

DevOps&SRE Library

scion

Run multiple agents in parallel — each in its own container, with its own workspace, collaborating on your code or project files simultaneously.

https://github.com/GoogleCloudPlatform/scion

3.25K views15:04

DevOps&SRE Library

atomic

A personal knowledge base that turns markdown notes into a semantically-connected, AI-augmented knowledge graph.

Atomic stores knowledge as atoms — markdown notes that are automatically chunked, embedded, tagged, and linked by semantic similarity. Your atoms can be synthesized into wiki articles, explored on a spatial canvas, and queried through an agentic chat interface.

https://github.com/kenforthewin/atomic

3.39K views07:02

DevOps&SRE Library

Keeping a Postgres queue healthy

https://planetscale.com/blog/keeping-a-postgres-queue-healthy

2.97K views15:02

DevOps&SRE Library

Inside a Self-Hosted AI Coding Assistant: Architecture, Kubernetes Deployment, and llama.cpp Parallelism

More and more enterprises want the benefits of AI-assisted coding, automatic completions, suggestions, and inline generation, without sending their source code to external APIs.

This has naturally increased interest in self-hosted coding assistants, where all inference runs on internal hardware and all models stay inside a controlled environment.

We built a complete prototype of such a system. In this article, we walk through its architecture, explain how Kubernetes is used to deploy it, and how different system parameters interact to determine real-world performance. In a separate post, we study how the llama.cpp inference server behaves under increasing load.

https://medium.com/@ferraricorneloup.teo/inside-a-self-hosted-ai-coding-assistant-architecture-kubernetes-deployment-and-llama-cpp-158330a12441

3.04K views07:02

DevOps&SRE Library

How My Client Hit Linux Kernel Network Limits on AWS EKS

This is a story about a tricky issue I resolved recently.

https://dev.to/datton94/how-my-client-hit-linux-kernel-network-limits-on-aws-eks-3am5

2.87K views15:04

DevOps&SRE Library

Startup CPU Boost in Kubernetes with In-Place Pod Resize

This article explains how to use the In-Place Pod Resize feature in Kubernetes, combined with Kube Startup CPU Boost, to speed up Java application startup.

https://piotrminkowski.com/2025/12/22/startup-cpu-boost-in-kubernetes-with-in-place-pod-resize/

3.43K views07:05

DevOps&SRE Library

dynamo

The open-source, datacenter-scale inference stack. Dynamo is the orchestration layer above inference engines — it doesn't replace SGLang, TensorRT-LLM, or vLLM, it turns them into a coordinated multi-node inference system. Disaggregated serving, intelligent routing, multi-tier KV caching, and automatic scaling work together to maximize throughput and minimize latency for LLM, reasoning, multimodal, and video generation workloads.

https://github.com/ai-dynamo/dynamo

3.41K views15:02

DevOps&SRE Library

helm-exporter

Exports helm release, chart, and version statistics in the prometheus format.

https://github.com/sstarcher/helm-exporter

3.22K views07:04

DevOps&SRE Library

Inside Adobe's OpenTelemetry pipeline: simplicity at scale

As part of an ongoing series, the Developer Experience SIG interviews organizations about their real-world OpenTelemetry Collector deployments to share practical lessons with the broader community. This post features Adobe, a global software company whose observability team has built an OpenTelemetry-based telemetry pipeline designed for simplicity at massive scale, with thousands of collectors running per signal type across the company’s infrastructure.

https://opentelemetry.io/blog/2026/devex-adobe

3.2K views15:01

DevOps&SRE Library

traceway

Traceway is a self-hosted observability platform that ingests OpenTelemetry traces and metrics, groups exceptions automatically, and gives you endpoint performance, distributed tracing, and alerts — all in a single binary. No OTel Collector or separate time-series database required.

https://github.com/tracewayapp/traceway

3.14K views07:04

DevOps&SRE Library

lynxdb

Log analytics in a single binary. No dependencies. Lynx Flow query language.

https://github.com/lynxbase/lynxdb

2.88K views08:02

DevOps&SRE Library

cardamon

Cardamon is a metric auditor for Prometheus. It identifies metrics that exist in your TSDB but are never actually queried by dashboards, alerting rules, recording rules, or any other consumer. You can then generate Prometheus drop rules to remove them and reduce storage need.

https://github.com/dominikhei/cardamon

3.58K views15:02

DevOps&SRE Library

versitygw

Versity Gateway, a simple to use tool for seamless inline translation between AWS S3 object commands and storage systems. The Versity Gateway bridges the gap between S3-reliant applications and other storage systems, enabling enhanced compatibility and integration while offering exceptional scalability.

https://github.com/versity/versitygw

3.47K views07:03

DevOps&SRE Library

Terragrunt 1.0 Released!

After nearly a decade of development, over 900 releases, and tens of millions of infrastructure deployments by platform teams, today we're happy to announce that Terragrunt 1.0 is officially here.

https://www.gruntwork.io/blog/terragrunt-1-0-released

3.65K views15:04

DevOps&SRE Library

Little Snitch for Linux

Every time an application on your computer opens a network connection, it does so quietly, without asking. Little Snitch for Linux makes that activity visible and gives you the option to do something about it. You can see exactly which applications are talking to which servers, block the ones you didn't invite, and keep an eye on traffic history and data volumes over time.

https://obdev.at/products/littlesnitch-linux/index.html

3.54K views07:02

About

Blog

Apps

Platform