Cloud Cost Handbook
The Cloud Cost Handbook is a free, open-source, community-supported set of guides meant to help explain often-times complex pricing of public cloud infrastructure and service providers in easy-to-understand terms.https://handbook.vantage.sh
A tcpdump Tutorial with Examples — 50 Ways to Isolate Traffic
https://danielmiessler.com/study/tcpdump
https://danielmiessler.com/study/tcpdump
4 Key Observability Metrics for Distributed Applications
https://hackernoon.com/4-key-observability-metrics-for-distributed-applications-z11337yh
https://hackernoon.com/4-key-observability-metrics-for-distributed-applications-z11337yh
Unpacking Observability: Understanding Logs, Events, Traces, and Spans
https://medium.com/dzerolabs/observability-journey-understanding-logs-events-traces-and-spans-836524d63172
https://medium.com/dzerolabs/observability-journey-understanding-logs-events-traces-and-spans-836524d63172
ON THE EVILNESS OF FEATURE BRANCHING - A TALE OF TWO TEAMS
On the experience of working with two totally different teams: one novice practising trunk-based development, the other very experienced being used by GitFlow.https://thinkinglabs.io/articles/2021/07/14/on-the-evilness-of-feature-branching-a-tale-of-two-teams.html
Behind the scenes, AWS Lambda
Writing code and deploying it to AWS Lambda is as easy as baking a cake (depending on the type of cake). Lambda performs the heavy lifting for you, from provisioning to scaling. But where is the magic happening and how does it actually work under the hood? Lets find out together!https://www.bschaatsbergen.com/behind-the-scenes-lambda
Automatic Remediation of Kubernetes Nodes
https://blog.cloudflare.com/automatic-remediation-of-kubernetes-nodes
https://blog.cloudflare.com/automatic-remediation-of-kubernetes-nodes
schemahero
A Kubernetes operator for declarative database schema management (gitops for database schemas)https://github.com/schemahero/schemahero
ortelius
Ortelius simplifies the implementation of microservices. By providing a central catalog of services with their deployment specs, application teams can easily consume and deploy services across cluster. Ortelius tracks application versions based on service updates and maps their service dependencies eliminating confusion and guess work.https://github.com/ortelius/ortelius
Common Kubernetes Errors Made by Beginners [2021]
https://medium.com/nerd-for-tech/common-kubernetes-errors-made-by-beginners-274b50e18a01
https://medium.com/nerd-for-tech/common-kubernetes-errors-made-by-beginners-274b50e18a01
Thoughts on HTTP instrumentation with OpenTelemetry
https://neskazu.medium.com/thoughts-on-http-instrumentation-with-opentelemetry-9fc22fa35bc7
https://neskazu.medium.com/thoughts-on-http-instrumentation-with-opentelemetry-9fc22fa35bc7
Unpacking Observability
https://adri-v.medium.com/unpacking-observability-a-beginners-guide-833258a0591f
https://adri-v.medium.com/unpacking-observability-a-beginners-guide-833258a0591f
My Dev Lessons From 2020
Kubernetes is to Borg what Frankstein is to the Dali Lamahttp://www.gophersre.com/2021/02/21/my-dev-lessons-from-2020
When I left Google, I was sold on the whole containerized way of running things. Borg is lightyears ahead of every other cluster orchestration project.
Borg doesn't let you do everything. It is designed to run specifically built applications that are containerized. You don't get Docker images with whatever OS stuff you feel like running that day. The OS is always Google's internal OS. You don't get access to whatever binaries you want to install. You don't get go use whatever security you want. Your RPC system is always going to be Stubby (GRPC internal to Google). Your cluster file system is going to be the only one allowed. Period.
Those limits are freeing. You simply need to have resources to run your jobs and deploy them. Your binaries are packaged up and you just need to say what is going to get run.
So naturally, I've used Kubernetes after I left.
Everything about Borg I liked is gone in Kubernentes. It is trying to solve everyone's problem and solves no one's problem.
It is easy to kill your jobs. Its hard to do things like update a single instance. Service meshes???? Really????
Helm? Great, I can kill all my cluster MySQL databases at the flick of my heml config.
Security, what security? Oh, right, the bring my own model that is just crazy hard.
Need it to work with special cloud sidecars (like special identity services)? Well, that's going to be a fun thing.
Upgrades that change the config language so that your jobs won't run anymore. Perfect.....
And btw, love YAML over the Borg config language, NOT!
Linux Performance Checklists for SREs
Linux Perf Analysis in 60s (https://netflixtechblog.com/linux-performance-analysis-in-60-000-milliseconds-accc10403c55)
1.
2.
3.
4.
5.
6.
7.
8.
9.
10.
Linux Disk Checklist
1.
2.
3.
4.
5.
6.
7.
8.
9.
* Another short checklist. Won't solve everything. ext4slower/dist, bioslower/latency, are from bcc/BPF tools.
Linux Network Checklist
1.
2.
3.
4.
5.
6.
7.
8.
9.
10.
* tcp*, are from bcc/BPF tools.
Linux CPU Checklist
1.
2.
3.
4.
5.
6.
7.
* htop can do 1-4. I'm tempted to add execsnoop for short-lived processes (it's in perf-tools or bcc/BPF tools).
https://www.brendangregg.com/blog/2016-05-04/srecon2016-perf-checklists-for-sres.html
Linux Perf Analysis in 60s (https://netflixtechblog.com/linux-performance-analysis-in-60-000-milliseconds-accc10403c55)
1.
uptime ⟶ load averages2.
dmesg -T | tail ⟶ kernel errors3.
vmstat 1 ⟶ overall stats by time4.
mpstat -P ALL 1 ⟶ CPU balance5.
pidstat 1 ⟶ process usage6.
iostat -xz 1 ⟶ disk I/O7.
free -m ⟶ memory usage8.
sar -n DEV 1 ⟶ network I/O9.
sar -n TCP,ETCP 1 ⟶ TCP stats10.
top ⟶ check overviewLinux Disk Checklist
1.
iostat -xz 1 ⟶ any disk I/O? if not, stop looking2.
vmstat 1 ⟶ is this swapping? or, high sys time?3.
df -h ⟶ are file systems nearly full?4.
ext4slower 10 ⟶ (zfs*, xfs*, etc.) slow file system I/O?5.
bioslower 10 ⟶ if so, check disks6.
ext4dist 1 ⟶ check distribution and rate7.
biolatency 1 ⟶ if interesting, check disks8.
cat /sys/devices/…/ioerr_cnt ⟶ (if available) errors9.
smartctl -l error /dev/sda1 ⟶ (if available) errors* Another short checklist. Won't solve everything. ext4slower/dist, bioslower/latency, are from bcc/BPF tools.
Linux Network Checklist
1.
sar -n DEV,EDEV 1 ⟶ at interface limits? or use nicstat2.
sar -n TCP,ETCP 1 ⟶ active/passive load, retransmit rate3.
cat /etc/resolv.conf ⟶ it's always DNS4.
mpstat -P ALL 1 ⟶ high kernel time? single hot CPU?5.
tcpretrans ⟶ what are the retransmits? state?6.
tcpconnect ⟶ connecting to anything unexpected?7.
tcpaccept ⟶ unexpected workload?8.
netstat -rnv ⟶ any inefficient routes?9.
check firewall config ⟶ anything blocking/throttling?10.
netstat -s ⟶ play 252 metric pickup* tcp*, are from bcc/BPF tools.
Linux CPU Checklist
1.
uptime ⟶ load averages2.
vmstat 1 ⟶ system-wide utilization, run q length3.
mpstat -P ALL 1 ⟶ CPU balance4.
pidstat 1 ⟶ per-process CPU5.
CPU flame graph ⟶ CPU profiling6.
CPU subsecond offset heat map ⟶ look for gaps7.
perf stat -a -- sleep 10 ⟶ IPC, LLC hit ratio* htop can do 1-4. I'm tempted to add execsnoop for short-lived processes (it's in perf-tools or bcc/BPF tools).
https://www.brendangregg.com/blog/2016-05-04/srecon2016-perf-checklists-for-sres.html
Troubleshooting Elasticsearch ILM: Common issues and fixes
https://www.elastic.co/blog/troubleshooting-elasticsearch-ilm-common-issues-and-fixes
https://www.elastic.co/blog/troubleshooting-elasticsearch-ilm-common-issues-and-fixes
How to pick the best observability solution for your organisation
There are a wealth of monitoring solutions available for engineers and developers to choose from, so how do you select which is most appropriate for you?https://medium.com/contino-engineering/how-to-pick-the-best-observability-solution-for-your-organisation-e956f0bffb8e
[ALERTING] When are critical alerts needed?
https://medium.com/nerd-for-tech/alerting-when-are-critical-alerts-needed-8144f092a48
https://medium.com/nerd-for-tech/alerting-when-are-critical-alerts-needed-8144f092a48