The costs of microservices
https://robertovitillo.com/costs-of-microservices
The microservices architecture adds more moving parts to the overall system, and this doesn’t come for free. The cost of fully embracing microservices is only worth paying if it can be amortized across dozens of development teams.
https://robertovitillo.com/costs-of-microservices
Retries, Backoff and Jitter
https://www.codereliant.io/retries-backoff-jitter
In distributed systems, failures and latency issues are inevitable. Services can fail due to overloaded servers, network issues, bugs, and various other factors. As engineers building distributed systems, we need strategies to make our services robust and resilient in the face of such failures. One useful technique is using retries.
https://www.codereliant.io/retries-backoff-jitter
Prometheus and centralized storage: When you need it, how it works, and what Mimir is
https://blog.palark.com/prometheus-centralized-storage-mimir
https://blog.palark.com/prometheus-centralized-storage-mimir
A guide to post-mortem meetings and how we run them at incident.io
https://incident.io/hubs/post-mortem/a-guide-to-post-mortem-meetings
https://incident.io/hubs/post-mortem/a-guide-to-post-mortem-meetings
A Comprehensive Guide to Testing in Terraform: Keep your tests, validations, checks, and policies in order
https://mattias.engineer/posts/terraform-testing-and-validation
This post discusses testing and validation for infrastructure-as-code (IaC) with HashiCorp Terraform. The insights and ideas presented here can surely be extended to IaC in general.
https://mattias.engineer/posts/terraform-testing-and-validation
Elevating CloudWatch Logs: Smart Alerts with Chatbot, SNS, and Lambda
https://medium.com/@louis-fiori/cloudwatch-logs-enhanced-alerts-a50ea08d0845
https://medium.com/@louis-fiori/cloudwatch-logs-enhanced-alerts-a50ea08d0845
From AI to sustainability, why our latest data centers use 400G networking
https://dropbox.tech/infrastructure/from-ai-to-sustainability-why-our-latest-data-centers-use-400g-networking
To meet the bandwidth requirements of new and future AI workloads—and stay committed to our sustainability goals—the Dropbox networking team recently designed and launched our first data center architecture using highly efficient, cutting edge 400 gigabit per second (400G) ethernet technology.
https://dropbox.tech/infrastructure/from-ai-to-sustainability-why-our-latest-data-centers-use-400g-networking
gitness
https://github.com/harness/gitness
Gitness is an open source development platform packed with the power of code hosting and automated DevOps pipelines.
https://github.com/harness/gitness
checkov
https://github.com/bridgecrewio/checkov
Checkov is a static code analysis tool for infrastructure as code (IaC) and also a software composition analysis (SCA) tool for images and open source packages.
It scans cloud infrastructure provisioned using Terraform, Terraform plan, Cloudformation, AWS SAM, Kubernetes, Helm charts, Kustomize, Dockerfile, Serverless, Bicep, OpenAPI or ARM Templates and detects security and compliance misconfigurations using graph-based scanning.
https://github.com/bridgecrewio/checkov
gh-copilot
https://github.com/github/gh-copilot
GitHub Copilot in the CLI is an extension for GitHub CLI which provides a chat-like interface in the terminal that allows you to ask questions about the command line. You can ask Copilot in the CLI to suggest a command for your use case with gh copilot suggest, or to explain a command you're curious about with gh copilot explain.
https://github.com/github/gh-copilot
Handling a Regional Outage: Comparing the Response From AWS, Azure and GCP
https://blog.pragmaticengineer.com/aws-azure-and-gcp-regional-outages
https://blog.pragmaticengineer.com/aws-azure-and-gcp-regional-outages
Architecture Patterns : The Circuit-Breaker
https://lab.scub.net/architecture-patterns-the-circuit-breaker-8f79280771f1
https://lab.scub.net/architecture-patterns-the-circuit-breaker-8f79280771f1
How to be on-call
https://hart-michael.medium.com/how-to-be-on-call-034e3a202729
I have been on-call for most of my career and led teams with on-call rotations, and have a lot of experience with the negative impact of on-call to my personal life and the lives of my colleagues. I’ve missed Christmas dinner (years later my Mom still brings it up), worked through weekends and nights, missed many kids’ events, and once juggled a fussy baby and an incident call at the same time. My goal is to make being on-call as sane as possible, balancing what the business needs with our collective personal lives.
https://hart-michael.medium.com/how-to-be-on-call-034e3a202729
Use of HTTPS Resource Records
https://www.netmeister.org/blog/https-rrs.html
Good news, everybody -- we have new DNS resource records! Well, not new new, but, you know, newish. You've probably heard of them, or even seen them actively in use, even though they moved from internet draft to formal RFC9460 adoption literally while I was working on this blog post during the last few weeks: the SVCB and HTTPS resource records.
https://www.netmeister.org/blog/https-rrs.html
teks
https://github.com/particuleio/teks
tEKS is a set of Terraform / Terragrunt modules designed to get you everything you need to run a production EKS cluster on AWS. It ships with sensible defaults, and add a lot of common addons with their configurations that work out of the box.
https://github.com/particuleio/teks
Where Did All The Terraform Testing Go?
https://landadevopsjob.com/blog/where-did-all-the-terraform-testing-go
https://landadevopsjob.com/blog/where-did-all-the-terraform-testing-go
Terraform documentation made easy with terraform-docs
https://medium.com/@akhilesh-mishra/terraform-documentation-made-easy-with-terraform-docs-096014b00ecf
A complete guide to Terraform documentation with terraform-docs
https://medium.com/@akhilesh-mishra/terraform-documentation-made-easy-with-terraform-docs-096014b00ecf