DevOps&SRE Library

Monitoring & Observability: Using Logs, Metrics, Traces, and Alerts to Understand System Failures

When your application ships to production, it becomes partly opaque. You own the code, but the runtime, network, and platform behaviors often fall outside your direct line of sight. That’s where Monitoring and Observability come in.

Monitoring warns you when predefined thresholds break. Observability lets you explore unknowns, asking new questions in real time and getting meaningful answers without redeploying.

For engineers running software in production, observability rests on three pillars: logs, metrics, and traces. Each offers a different lens into system behavior. Understanding where each excels and where it doesn’t is essential for building a practical, scalable visibility strategy.

https://blog.railway.com/p/using-logs-metrics-traces-and-alerts-to-understand-system-failures

2.23K views07:05

DevOps&SRE Library

KISS vs DRY in Infrastructure as Code: Why Simple Often Beats Clever

Every Infrastructure as Code tutorial starts the same way: provision a single S3 bucket, create one EC2 instance, deploy a basic load balancer. The examples are clean, simple, and elegant. You follow along, everything works, and you feel like you understand Terraform.

Then you get to your actual production environment, and everything changes.

You’re not starting from scratch with a blank AWS account. You’ve got existing resources that were manually created two years ago by someone who left the company. There’s brownfield infrastructure everywhere with no clear documentation. You need to import existing state, figure out what’s actually running, and somehow wrangle it all into code without breaking production. On top of that, you need to manage 200 instances across dev, staging, and production environments. Multiple AWS accounts with different configurations and permissions. Three regions for disaster recovery. Azure for the legacy workloads that nobody wants to touch. GCP running your GKE clusters for the containerized applications.

Suddenly that elegant tutorial code becomes a nightmare of orchestration, state management, environment-specific configurations, and brownfield complexity. You’re not just writing infrastructure code anymore. You’re trying to organize, orchestrate, and maintain it at scale while dealing with the reality that infrastructure is messy, evolving, and full of historical baggage.

This is the scale gap, and it’s where the KISS vs DRY debate stops being theoretical and starts costing real time, money, and engineering effort.

https://rosesecurity.dev/2025/11/14/kiss-versus-dry-iac.html

2.63K views15:04

DevOps&SRE Library

pg_textsearch

PostgreSQL extension for BM25 relevance-ranked full-text search. Postgres OSS licensed.

https://github.com/timescale/pg_textsearch

2.54K views07:01

DevOps&SRE Library

pgedge-postgres-mcp

The pgEdge Postgres Model Context Protocol (MCP) server enables SQL queries against PostgreSQL databases through MCP-compatible clients like Claude Desktop. The Natural Language Agent provides supporting functionality that allows you to use natural language to form SQL queries.

https://github.com/pgEdge/pgedge-postgres-mcp

2.55K views15:03

DevOps&SRE Library

arcane

Modern Docker Management, Designed for Everyone

https://github.com/getarcaneapp/arcane

2.36K views07:03

DevOps&SRE Library

ente

Ente is a service that provides a fully open source, end-to-end encrypted platform for you to store your data in the cloud without needing to trust the service provider. On top of this platform, we have built two apps so far: Ente Photos (an alternative to Apple and Google Photos) and Ente Auth (a 2FA alternative to the deprecated Authy).

https://github.com/ente-io/ente

2.22K views15:02

DevOps&SRE Library

databasus

Databasus is a free, open source and self-hosted tool to backup databases. Make backups with different storages (S3, Google Drive, FTP, etc.) and notifications about progress (Slack, Discord, Telegram, etc.).

https://github.com/databasus/databasus

2.14K views07:02

DevOps&SRE Library

notifuse

The open-source alternative to Mailchimp, Brevo, Mailjet, Listmonk, Mailerlite, and Klaviyo, Loop.so, etc.

Notifuse is a modern, self-hosted emailing platform that allows you to send newsletters and transactional emails at a fraction of the cost. Built with Go and React, it provides enterprise-grade features with the flexibility of open-source software.

https://github.com/Notifuse/notifuse

1.99K views15:02

DevOps&SRE Library

Pulse

Pulse is a modern, unified dashboard for monitoring your infrastructure across Proxmox, Docker, and Kubernetes. It consolidates metrics, alerts, and AI-powered insights from all your systems into a single, beautiful interface.

https://github.com/rcourtman/Pulse

1.76K views07:03

DevOps&SRE Library

How we deploy the largest GitLab instance 12 times daily

Take a deep technical dive into GitLab.com's deployment pipeline, including progressive rollouts, Canary strategies, database migrations, and multiversion compatibility.

https://about.gitlab.com/blog/continuously-deploying-the-largest-gitlab-instance

1.6K views15:04

DevOps&SRE Library

It works on my cluster: a tale of two troubleshooters

https://octopus.com/blog/verifying-and-troubleshooting-kubernetes-deployments

1.26K views07:05

DevOps&SRE Library

Karpenter at Beekeeper by LumApps: Fun Stories

At the beginning of this year, we (Beekeeper by LumApps Engineering) decided to adopt Karpenter for our EKS (Kubernetes/K8s) workloads, replacing our previous node autoscaling setup that used cluster-autoscaler with a managed autoscaling group (ASG). We made this decision before the release and hype of EKS Auto Mode, which is why we chose to implement a self-managed Karpenter solution.

https://medium.com/beekeeper-technology-blog/karpenter-at-beekeeper-by-lumapps-fun-stories-7c55656f02b8

690 views15:02

About

Blog

Apps

Platform