DevOps&SRE Library

Amazon Web Services In Plain English

https://expeditedsecurity.com/aws-in-plain-english

3.35K views16:00

Cloud Cost Handbook

The Cloud Cost Handbook is a free, open-source, community-supported set of guides meant to help explain often-times complex pricing of public cloud infrastructure and service providers in easy-to-understand terms.

https://handbook.vantage.sh

3.33K views07:00

DevOps&SRE Library

A tcpdump Tutorial with Examples — 50 Ways to Isolate Traffic

https://danielmiessler.com/study/tcpdump

4.46K views16:00

DevOps&SRE Library

4 Key Observability Metrics for Distributed Applications

https://hackernoon.com/4-key-observability-metrics-for-distributed-applications-z11337yh

3.22K views07:00

DevOps&SRE Library

Unpacking Observability: Understanding Logs, Events, Traces, and Spans

https://medium.com/dzerolabs/observability-journey-understanding-logs-events-traces-and-spans-836524d63172

3.29K views16:00

DevOps&SRE Library

ON THE EVILNESS OF FEATURE BRANCHING - A TALE OF TWO TEAMS

On the experience of working with two totally different teams: one novice practising trunk-based development, the other very experienced being used by GitFlow.

https://thinkinglabs.io/articles/2021/07/14/on-the-evilness-of-feature-branching-a-tale-of-two-teams.html

3.22K views07:00

DevOps&SRE Library

Behind the scenes, AWS Lambda

Writing code and deploying it to AWS Lambda is as easy as baking a cake (depending on the type of cake). Lambda performs the heavy lifting for you, from provisioning to scaling. But where is the magic happening and how does it actually work under the hood? Lets find out together!

https://www.bschaatsbergen.com/behind-the-scenes-lambda

3.24K views16:00

DevOps&SRE Library

Automatic Remediation of Kubernetes Nodes

https://blog.cloudflare.com/automatic-remediation-of-kubernetes-nodes

3.39K views16:00

DevOps&SRE Library

Monitoring Alerts That Don't Suck

https://hceris.com/monitoring-alerts-that-dont-suck

3.39K views07:00

DevOps&SRE Library

schemahero

A Kubernetes operator for declarative database schema management (gitops for database schemas)

https://github.com/schemahero/schemahero

3.39K views16:00

DevOps&SRE Library

ortelius

Ortelius simplifies the implementation of microservices. By providing a central catalog of services with their deployment specs, application teams can easily consume and deploy services across cluster. Ortelius tracks application versions based on service updates and maps their service dependencies eliminating confusion and guess work.

https://github.com/ortelius/ortelius

3.42K views07:00

DevOps&SRE Library

Common Kubernetes Errors Made by Beginners [2021]

https://medium.com/nerd-for-tech/common-kubernetes-errors-made-by-beginners-274b50e18a01

3.46K views16:00

DevOps&SRE Library

Thoughts on HTTP instrumentation with OpenTelemetry

https://neskazu.medium.com/thoughts-on-http-instrumentation-with-opentelemetry-9fc22fa35bc7

3.4K views07:00

DevOps&SRE Library

Migrating Facebook to MySQL 8.0

https://engineering.fb.com/2021/07/22/data-infrastructure/mysql

4.55K views16:00

DevOps&SRE Library

Unpacking Observability

https://adri-v.medium.com/unpacking-observability-a-beginners-guide-833258a0591f

4.18K views07:01

DevOps&SRE Library

(All) DNS Resource Records

https://www.netmeister.org/blog/dns-rrs.html

3.29K views16:00

DevOps&SRE Library

My Dev Lessons From 2020

Kubernetes is to Borg what Frankstein is to the Dali Lama

When I left Google, I was sold on the whole containerized way of running things. Borg is lightyears ahead of every other cluster orchestration project.

Borg doesn't let you do everything. It is designed to run specifically built applications that are containerized. You don't get Docker images with whatever OS stuff you feel like running that day. The OS is always Google's internal OS. You don't get access to whatever binaries you want to install. You don't get go use whatever security you want. Your RPC system is always going to be Stubby (GRPC internal to Google). Your cluster file system is going to be the only one allowed. Period.

Those limits are freeing. You simply need to have resources to run your jobs and deploy them. Your binaries are packaged up and you just need to say what is going to get run.

So naturally, I've used Kubernetes after I left.

Everything about Borg I liked is gone in Kubernentes. It is trying to solve everyone's problem and solves no one's problem.

It is easy to kill your jobs. Its hard to do things like update a single instance. Service meshes???? Really????

Helm? Great, I can kill all my cluster MySQL databases at the flick of my heml config.

Security, what security? Oh, right, the bring my own model that is just crazy hard.

Need it to work with special cloud sidecars (like special identity services)? Well, that's going to be a fun thing.

Upgrades that change the config language so that your jobs won't run anymore. Perfect.....

And btw, love YAML over the Borg config language, NOT!

http://www.gophersre.com/2021/02/21/my-dev-lessons-from-2020

3.57K views07:00

DevOps&SRE Library

Linux Performance Checklists for SREs

Linux Perf Analysis in 60s (https://netflixtechblog.com/linux-performance-analysis-in-60-000-milliseconds-accc10403c55)

1. uptime ⟶ load averages
2. dmesg -T | tail ⟶ kernel errors
3. vmstat 1 ⟶ overall stats by time
4. mpstat -P ALL 1 ⟶ CPU balance
5. pidstat 1 ⟶ process usage
6. iostat -xz 1 ⟶ disk I/O
7. free -m ⟶ memory usage
8. sar -n DEV 1 ⟶ network I/O
9. sar -n TCP,ETCP 1 ⟶ TCP stats
10. top ⟶ check overview

Linux Disk Checklist

1. iostat -xz 1 ⟶ any disk I/O? if not, stop looking
2. vmstat 1 ⟶ is this swapping? or, high sys time?
3. df -h ⟶ are file systems nearly full?
4. ext4slower 10 ⟶ (zfs*, xfs*, etc.) slow file system I/O?
5. bioslower 10 ⟶ if so, check disks
6. ext4dist 1 ⟶ check distribution and rate
7. biolatency 1 ⟶ if interesting, check disks
8. cat /sys/devices/…/ioerr_cnt ⟶ (if available) errors
9. smartctl -l error /dev/sda1 ⟶ (if available) errors

* Another short checklist. Won't solve everything. ext4slower/dist, bioslower/latency, are from bcc/BPF tools.

Linux Network Checklist

1. sar -n DEV,EDEV 1 ⟶ at interface limits? or use nicstat
2. sar -n TCP,ETCP 1 ⟶ active/passive load, retransmit rate
3. cat /etc/resolv.conf ⟶ it's always DNS
4. mpstat -P ALL 1 ⟶ high kernel time? single hot CPU?
5. tcpretrans ⟶ what are the retransmits? state?
6. tcpconnect ⟶ connecting to anything unexpected?
7. tcpaccept ⟶ unexpected workload?
8. netstat -rnv ⟶ any inefficient routes?
9. check firewall config ⟶ anything blocking/throttling?
10. netstat -s ⟶ play 252 metric pickup

* tcp*, are from bcc/BPF tools.

Linux CPU Checklist

1. uptime ⟶ load averages
2. vmstat 1 ⟶ system-wide utilization, run q length
3. mpstat -P ALL 1 ⟶ CPU balance
4. pidstat 1 ⟶ per-process CPU
5. CPU flame graph ⟶ CPU profiling
6. CPU subsecond offset heat map ⟶ look for gaps
7. perf stat -a -- sleep 10 ⟶ IPC, LLC hit ratio

* htop can do 1-4. I'm tempted to add execsnoop for short-lived processes (it's in perf-tools or bcc/BPF tools).

https://www.brendangregg.com/blog/2016-05-04/srecon2016-perf-checklists-for-sres.html

9.06K views16:00

DevOps&SRE Library

Troubleshooting Elasticsearch ILM: Common issues and fixes

https://www.elastic.co/blog/troubleshooting-elasticsearch-ilm-common-issues-and-fixes

3.19K views07:00

DevOps&SRE Library

How to pick the best observability solution for your organisation

There are a wealth of monitoring solutions available for engineers and developers to choose from, so how do you select which is most appropriate for you?

https://medium.com/contino-engineering/how-to-pick-the-best-observability-solution-for-your-organisation-e956f0bffb8e

3.34K views16:00

DevOps&SRE Library

[ALERTING] When are critical alerts needed?

https://medium.com/nerd-for-tech/alerting-when-are-critical-alerts-needed-8144f092a48

3.4K views07:00

About

Blog

Apps

Platform