DevOps&SRE Library
18.3K subscribers
457 photos
4 videos
2 files
4.94K links
Библиотека статей по теме DevOps и SRE.

Реклама: @ostinostin
Контент: @mxssl

РКН: https://www.gosuslugi.ru/snet/67704b536aa9672b963777b3
Download Telegram
Retries, Backoff and Jitter

In distributed systems, failures and latency issues are inevitable. Services can fail due to overloaded servers, network issues, bugs, and various other factors. As engineers building distributed systems, we need strategies to make our services robust and resilient in the face of such failures. One useful technique is using retries.


https://www.codereliant.io/retries-backoff-jitter
Prometheus and centralized storage: When you need it, how it works, and what Mimir is

https://blog.palark.com/prometheus-centralized-storage-mimir
A guide to post-mortem meetings and how we run them at incident.io

https://incident.io/hubs/post-mortem/a-guide-to-post-mortem-meetings
A Comprehensive Guide to Testing in Terraform: Keep your tests, validations, checks, and policies in order

This post discusses testing and validation for infrastructure-as-code (IaC) with HashiCorp Terraform. The insights and ideas presented here can surely be extended to IaC in general.


https://mattias.engineer/posts/terraform-testing-and-validation
Elevating CloudWatch Logs: Smart Alerts with Chatbot, SNS, and Lambda

https://medium.com/@louis-fiori/cloudwatch-logs-enhanced-alerts-a50ea08d0845
From AI to sustainability, why our latest data centers use 400G networking

To meet the bandwidth requirements of new and future AI workloads—and stay committed to our sustainability goals—the Dropbox networking team recently designed and launched our first data center architecture using highly efficient, cutting edge 400 gigabit per second (400G) ethernet technology.


https://dropbox.tech/infrastructure/from-ai-to-sustainability-why-our-latest-data-centers-use-400g-networking
gitness

Gitness is an open source development platform packed with the power of code hosting and automated DevOps pipelines.


https://github.com/harness/gitness
checkov

Checkov is a static code analysis tool for infrastructure as code (IaC) and also a software composition analysis (SCA) tool for images and open source packages.

It scans cloud infrastructure provisioned using Terraform, Terraform plan, Cloudformation, AWS SAM, Kubernetes, Helm charts, Kustomize, Dockerfile, Serverless, Bicep, OpenAPI or ARM Templates and detects security and compliance misconfigurations using graph-based scanning.


https://github.com/bridgecrewio/checkov
gh-copilot

GitHub Copilot in the CLI is an extension for GitHub CLI which provides a chat-like interface in the terminal that allows you to ask questions about the command line. You can ask Copilot in the CLI to suggest a command for your use case with gh copilot suggest, or to explain a command you're curious about with gh copilot explain.


https://github.com/github/gh-copilot
Load Shedding for High Traffic Systems

https://www.codereliant.io/load-shedding
Handling a Regional Outage: Comparing the Response From AWS, Azure and GCP

https://blog.pragmaticengineer.com/aws-azure-and-gcp-regional-outages
How to be on-call

I have been on-call for most of my career and led teams with on-call rotations, and have a lot of experience with the negative impact of on-call to my personal life and the lives of my colleagues. I’ve missed Christmas dinner (years later my Mom still brings it up), worked through weekends and nights, missed many kids’ events, and once juggled a fussy baby and an incident call at the same time. My goal is to make being on-call as sane as possible, balancing what the business needs with our collective personal lives.


https://hart-michael.medium.com/how-to-be-on-call-034e3a202729
Use of HTTPS Resource Records

Good news, everybody -- we have new DNS resource records! Well, not new new, but, you know, newish. You've probably heard of them, or even seen them actively in use, even though they moved from internet draft to formal RFC9460 adoption literally while I was working on this blog post during the last few weeks: the SVCB and HTTPS resource records.


https://www.netmeister.org/blog/https-rrs.html
teks

tEKS is a set of Terraform / Terragrunt modules designed to get you everything you need to run a production EKS cluster on AWS. It ships with sensible defaults, and add a lot of common addons with their configurations that work out of the box.


https://github.com/particuleio/teks
Terraform documentation made easy with terraform-docs

A complete guide to Terraform documentation with terraform-docs


https://medium.com/@akhilesh-mishra/terraform-documentation-made-easy-with-terraform-docs-096014b00ecf
tfprovidercheck

CLI to prevent malicious Terraform Providers from being executed. You can define the allow list of Terraform Providers and their versions, and check if disallowed providers aren't used


https://github.com/suzuki-shunsuke/tfprovidercheck
terraform-local

Terraform CLI wrapper to deploy your Terraform applications directly to LocalStack


https://github.com/localstack/terraform-local