Lessons Learned from Twenty Years of Site Reliability Engineering
Or, Eleven things we have learned as Site Reliability Engineers at Googlehttps://sre.google/resources/practices-and-processes/twenty-years-of-sre-lessons-learned
1. The riskiness of a mitigation should scale with the severity of the outage
2. Recovery mechanisms should be fully tested before an emergency
3. Canary all changes
4. Have a "Big Red Button"
5. Unit tests alone are not enough - integration testing is also needed
6. COMMUNICATION CHANNELS! AND BACKUP CHANNELS!! AND BACKUPS FOR THOSE BACKUP CHANNELS!!!
7. Intentionally degrade performance modes
8. Test for Disaster resilience
9. Automate your mitigations
10. Reduce the time between rollouts, to decrease the likelihood of the rollout going wrong
11. A single global hardware version is a single point of failure
How DoorDash Migrated from StatsD to Prometheus
https://doordash.engineering/2023/08/01/how-doordash-migrated-from-statsd-to-prometheus
https://doordash.engineering/2023/08/01/how-doordash-migrated-from-statsd-to-prometheus
How to use Terraform test
The new Terraform version v1.6.0 introduce a test framework, named “Terraform test”. Here’s how to use it.https://blog.captaincy.io/how-to-use-terraform-test
Terraform project structure with reusable modules
https://erudinsky.com/2023/10/20/structuring-terraform-projects
https://erudinsky.com/2023/10/20/structuring-terraform-projects
cluster.dev
Cluster.dev is an open-source tool designed to manage cloud native infrastructures with simple declarative manifests - infrastructure templates. The infrastructure templates could be based on Terraform modules, Kubernetes manifests, Shell scripts, Helm charts, Kustomize and ArgoCD/Flux applications, OPA policies etc. Cluster.dev sticks those components together so that you could deploy, test and distribute a whole set of components with pinned versions.https://github.com/shalb/cluster.dev
Prometheus and its storage: Architecture, challenges, and solutions
This two-article series is about monitoring. Part One covers accumulating a multitude of different metrics in a single place, handling permissions for different aspects of those metrics, and storing large amounts of data. In Part Two, we then focus on choosing monitoring systems based on the brief example of a fictional company’s “journey” in struggling with continually expanding its monitoring system and growing its infrastructure.https://blog.palark.com/prometheus-architecture-tsdb
What is a Memory Leak?
Memory leaks are a common and frustrating problem in software development. These issues arise when a program fails to free up memory that is no longer being used, leading to a gradual loss of available memory over time.https://www.codereliant.io/what-is-a-memory-leak
Rescue Struggling Pods from Scratch
https://www.honeycomb.io/blog/rescue-struggling-pods-from-scratch
https://www.honeycomb.io/blog/rescue-struggling-pods-from-scratch
Solving Metrics at scale with VictoriaMetrics
https://sarthak-acoustic.medium.com/solving-metrics-at-scale-with-victoriametrics-ac9c306826c3
https://sarthak-acoustic.medium.com/solving-metrics-at-scale-with-victoriametrics-ac9c306826c3
A Guide to Service Discovery with Prometheus Operator — How to use Pod Monitor, Service Monitor and Scrape Config
https://medium.com/@helia.barroso/a-guide-to-service-discovery-with-prometheus-operator-how-to-use-pod-monitor-service-monitor-6a7e4e27b303
https://medium.com/@helia.barroso/a-guide-to-service-discovery-with-prometheus-operator-how-to-use-pod-monitor-service-monitor-6a7e4e27b303
Profiling: Flame Chart vs. Flame Graph
Flame Charts and Flame Graphs clearly explainedhttps://medium.com/performance-engineering-for-the-ordinary-barbie/profiling-flame-chart-vs-flame-graph-7b212ddf3a83
Reduce cross-AZ traffic costs on EKS using topology aware hints
https://blog.ratnopamc.com/reduce-cross-az-traffic-costs-on-eks-using-topology-aware-hints
https://blog.ratnopamc.com/reduce-cross-az-traffic-costs-on-eks-using-topology-aware-hints
Advanced Secret Management on Kubernetes With Pulumi and GitOps: Sealed Secrets Controller
https://blog.ediri.io/advanced-secret-management-on-kubernetes-with-pulumi-and-gitops-sealed-secrets-controller
https://blog.ediri.io/advanced-secret-management-on-kubernetes-with-pulumi-and-gitops-sealed-secrets-controller
12 Scanners to Find Security Vulnerabilities and Misconfigurations in Kubernetes
https://towardsdev.com/12-scanners-to-find-security-vulnerabilities-and-misconfigurations-in-kubernetes-332a738d076d
https://towardsdev.com/12-scanners-to-find-security-vulnerabilities-and-misconfigurations-in-kubernetes-332a738d076d
Kubernetes API Server Discovery
A little excursion into the Kubernetes API server
https://medium.com/cp-massive-programming/kubernetes-api-server-discovery-ac3b358e878e
A little excursion into the Kubernetes API server
https://medium.com/cp-massive-programming/kubernetes-api-server-discovery-ac3b358e878e
Step by Step Guide: How to create a Dynamic Service Endpoint via K8S API
This article will help bring clarity to some internal components of the K8S cluster, demonstrating how to interact with them using command line tools.https://medium.com/lightricks-tech-blog/step-by-step-guide-how-to-create-a-dynamic-service-endpoint-via-k8s-api-1024309cb226
Ingress in Google Kubernetes Products
Here is my attempt to summarise and disambiguate terms often used in technical discussions around arranging network ingress traffic into [single] Kubernetes clusters running in Google Cloud (GKE) or on-premise (Anthos on Bare Metal, Anthos on VMware).https://medium.com/google-cloud/ingress-in-google-kubernetes-products-f22ded21f4ed
Partial Helm values encryption using AWS KMS with ArgoCD
How to encrypt only specific yaml fields in values.yaml, and how to configure ArgoCD to decrypt theses secrets before installing a chart.https://medium.com/@samuelbagattin/partial-helm-values-encryption-using-aws-kms-with-argocd-aca1c0d36323
Create temporary environment from Pull Request with ArgoCD ApplicationSet
Deploying app to Kubernetes. Creating a new environment for each pull request.https://medium.com/@jerome.decoster/create-temporary-environment-from-pull-request-with-argocd-applicationset-1cef9803223a
How to Use Cluster API to Programmatically Configure and Deploy Kubernetes Clusters
https://www.mirantis.com/blog/how-to-use-cluster-api
https://www.mirantis.com/blog/how-to-use-cluster-api