DevOps&SRE Library

tailspin

A log file highlighter

3.48K views15:00

The costs of microservices

The microservices architecture adds more moving parts to the overall system, and this doesn’t come for free. The cost of fully embracing microservices is only worth paying if it can be amortized across dozens of development teams.

https://robertovitillo.com/costs-of-microservices

3.51K views07:01

DevOps&SRE Library

Retries, Backoff and Jitter

In distributed systems, failures and latency issues are inevitable. Services can fail due to overloaded servers, network issues, bugs, and various other factors. As engineers building distributed systems, we need strategies to make our services robust and resilient in the face of such failures. One useful technique is using retries.

https://www.codereliant.io/retries-backoff-jitter

3.63K views15:01

DevOps&SRE Library

Prometheus and centralized storage: When you need it, how it works, and what Mimir is

https://blog.palark.com/prometheus-centralized-storage-mimir

4.17K views07:02

DevOps&SRE Library

A guide to post-mortem meetings and how we run them at incident.io

https://incident.io/hubs/post-mortem/a-guide-to-post-mortem-meetings

3.95K views15:01

DevOps&SRE Library

A Comprehensive Guide to Testing in Terraform: Keep your tests, validations, checks, and policies in order

This post discusses testing and validation for infrastructure-as-code (IaC) with HashiCorp Terraform. The insights and ideas presented here can surely be extended to IaC in general.

https://mattias.engineer/posts/terraform-testing-and-validation

3.66K views07:02

DevOps&SRE Library

Elevating CloudWatch Logs: Smart Alerts with Chatbot, SNS, and Lambda

https://medium.com/@louis-fiori/cloudwatch-logs-enhanced-alerts-a50ea08d0845

3.5K views15:00

DevOps&SRE Library

From AI to sustainability, why our latest data centers use 400G networking

To meet the bandwidth requirements of new and future AI workloads—and stay committed to our sustainability goals—the Dropbox networking team recently designed and launched our first data center architecture using highly efficient, cutting edge 400 gigabit per second (400G) ethernet technology.

https://dropbox.tech/infrastructure/from-ai-to-sustainability-why-our-latest-data-centers-use-400g-networking

3.62K views07:01

DevOps&SRE Library

gitness

Gitness is an open source development platform packed with the power of code hosting and automated DevOps pipelines.

https://github.com/harness/gitness

3.59K views15:00

DevOps&SRE Library

checkov

Checkov is a static code analysis tool for infrastructure as code (IaC) and also a software composition analysis (SCA) tool for images and open source packages.

It scans cloud infrastructure provisioned using Terraform, Terraform plan, Cloudformation, AWS SAM, Kubernetes, Helm charts, Kustomize, Dockerfile, Serverless, Bicep, OpenAPI or ARM Templates and detects security and compliance misconfigurations using graph-based scanning.

https://github.com/bridgecrewio/checkov

3.67K views07:01

DevOps&SRE Library

gh-copilot

GitHub Copilot in the CLI is an extension for GitHub CLI which provides a chat-like interface in the terminal that allows you to ask questions about the command line. You can ask Copilot in the CLI to suggest a command for your use case with gh copilot suggest, or to explain a command you're curious about with gh copilot explain.

https://github.com/github/gh-copilot

3.95K views15:01

DevOps&SRE Library

Load Shedding for High Traffic Systems

https://www.codereliant.io/load-shedding

3.43K views07:01

DevOps&SRE Library

Handling a Regional Outage: Comparing the Response From AWS, Azure and GCP

https://blog.pragmaticengineer.com/aws-azure-and-gcp-regional-outages

3.64K views15:01

DevOps&SRE Library

Architecture Patterns : The Circuit-Breaker

https://lab.scub.net/architecture-patterns-the-circuit-breaker-8f79280771f1

3.55K views07:00

DevOps&SRE Library

How to be on-call

I have been on-call for most of my career and led teams with on-call rotations, and have a lot of experience with the negative impact of on-call to my personal life and the lives of my colleagues. I’ve missed Christmas dinner (years later my Mom still brings it up), worked through weekends and nights, missed many kids’ events, and once juggled a fussy baby and an incident call at the same time. My goal is to make being on-call as sane as possible, balancing what the business needs with our collective personal lives.

https://hart-michael.medium.com/how-to-be-on-call-034e3a202729

3.92K views15:01

DevOps&SRE Library

Use of HTTPS Resource Records

Good news, everybody -- we have new DNS resource records! Well, not new new, but, you know, newish. You've probably heard of them, or even seen them actively in use, even though they moved from internet draft to formal RFC9460 adoption literally while I was working on this blog post during the last few weeks: the SVCB and HTTPS resource records.

https://www.netmeister.org/blog/https-rrs.html

3.91K views07:00

DevOps&SRE Library

teks

tEKS is a set of Terraform / Terragrunt modules designed to get you everything you need to run a production EKS cluster on AWS. It ships with sensible defaults, and add a lot of common addons with their configurations that work out of the box.

https://github.com/particuleio/teks

3.96K views15:00

DevOps&SRE Library

Terraform AWS Drift Checks

https://pd.shipmonk.com/terraform-aws-drift-checks

3.54K views07:01

DevOps&SRE Library

Where Did All The Terraform Testing Go?

https://landadevopsjob.com/blog/where-did-all-the-terraform-testing-go

3.71K views15:01

DevOps&SRE Library

Terraform documentation made easy with terraform-docs