DevOps&SRE Library
18.2K subscribers
461 photos
5 videos
2 files
4.9K links
Библиотека статей по теме DevOps и SRE.

Реклама: @ostinostin
Контент: @mxssl

РКН: https://www.gosuslugi.ru/snet/67704b536aa9672b963777b3
Download Telegram
Amazon Web Services In Plain English

https://expeditedsecurity.com/aws-in-plain-english
Cloud Cost Handbook

The Cloud Cost Handbook is a free, open-source, community-supported set of guides meant to help explain often-times complex pricing of public cloud infrastructure and service providers in easy-to-understand terms.

https://handbook.vantage.sh
A tcpdump Tutorial with Examples — 50 Ways to Isolate Traffic

https://danielmiessler.com/study/tcpdump
ON THE EVILNESS OF FEATURE BRANCHING - A TALE OF TWO TEAMS

On the experience of working with two totally different teams: one novice practising trunk-based development, the other very experienced being used by GitFlow.

https://thinkinglabs.io/articles/2021/07/14/on-the-evilness-of-feature-branching-a-tale-of-two-teams.html
Behind the scenes, AWS Lambda

Writing code and deploying it to AWS Lambda is as easy as baking a cake (depending on the type of cake). Lambda performs the heavy lifting for you, from provisioning to scaling. But where is the magic happening and how does it actually work under the hood? Lets find out together!

https://www.bschaatsbergen.com/behind-the-scenes-lambda
schemahero

A Kubernetes operator for declarative database schema management (gitops for database schemas)

https://github.com/schemahero/schemahero
ortelius

Ortelius simplifies the implementation of microservices. By providing a central catalog of services with their deployment specs, application teams can easily consume and deploy services across cluster. Ortelius tracks application versions based on service updates and maps their service dependencies eliminating confusion and guess work.

https://github.com/ortelius/ortelius
My Dev Lessons From 2020

Kubernetes is to Borg what Frankstein is to the Dali Lama

When I left Google, I was sold on the whole containerized way of running things. Borg is lightyears ahead of every other cluster orchestration project.

Borg doesn't let you do everything. It is designed to run specifically built applications that are containerized. You don't get Docker images with whatever OS stuff you feel like running that day. The OS is always Google's internal OS. You don't get access to whatever binaries you want to install. You don't get go use whatever security you want. Your RPC system is always going to be Stubby (GRPC internal to Google). Your cluster file system is going to be the only one allowed. Period.

Those limits are freeing. You simply need to have resources to run your jobs and deploy them. Your binaries are packaged up and you just need to say what is going to get run.

So naturally, I've used Kubernetes after I left.

Everything about Borg I liked is gone in Kubernentes. It is trying to solve everyone's problem and solves no one's problem.

It is easy to kill your jobs. Its hard to do things like update a single instance. Service meshes???? Really????

Helm? Great, I can kill all my cluster MySQL databases at the flick of my heml config.

Security, what security? Oh, right, the bring my own model that is just crazy hard.

Need it to work with special cloud sidecars (like special identity services)? Well, that's going to be a fun thing.

Upgrades that change the config language so that your jobs won't run anymore. Perfect.....

And btw, love YAML over the Borg config language, NOT!

http://www.gophersre.com/2021/02/21/my-dev-lessons-from-2020
Linux Performance Checklists for SREs

Linux Perf Analysis in 60s
(https://netflixtechblog.com/linux-performance-analysis-in-60-000-milliseconds-accc10403c55)

1. uptime ⟶ load averages
2. dmesg -T | tail ⟶ kernel errors
3. vmstat 1 ⟶ overall stats by time
4. mpstat -P ALL 1 ⟶ CPU balance
5. pidstat 1 ⟶ process usage
6. iostat -xz 1 ⟶ disk I/O
7. free -m ⟶ memory usage
8. sar -n DEV 1 ⟶ network I/O
9. sar -n TCP,ETCP 1 ⟶ TCP stats
10. top ⟶ check overview

Linux Disk Checklist

1. iostat -xz 1 ⟶ any disk I/O? if not, stop looking
2. vmstat 1 ⟶ is this swapping? or, high sys time?
3. df -h ⟶ are file systems nearly full?
4. ext4slower 10 ⟶ (zfs*, xfs*, etc.) slow file system I/O?
5. bioslower 10 ⟶ if so, check disks
6. ext4dist 1 ⟶ check distribution and rate
7. biolatency 1 ⟶ if interesting, check disks
8. cat /sys/devices/…/ioerr_cnt ⟶ (if available) errors
9. smartctl -l error /dev/sda1 ⟶ (if available) errors

* Another short checklist. Won't solve everything. ext4slower/dist, bioslower/latency, are from bcc/BPF tools.

Linux Network Checklist

1. sar -n DEV,EDEV 1 ⟶ at interface limits? or use nicstat
2. sar -n TCP,ETCP 1 ⟶ active/passive load, retransmit rate
3. cat /etc/resolv.conf ⟶ it's always DNS
4. mpstat -P ALL 1 ⟶ high kernel time? single hot CPU?
5. tcpretrans ⟶ what are the retransmits? state?
6. tcpconnect ⟶ connecting to anything unexpected?
7. tcpaccept ⟶ unexpected workload?
8. netstat -rnv ⟶ any inefficient routes?
9. check firewall config ⟶ anything blocking/throttling?
10. netstat -s ⟶ play 252 metric pickup

* tcp*, are from bcc/BPF tools.

Linux CPU Checklist

1. uptime ⟶ load averages
2. vmstat 1 ⟶ system-wide utilization, run q length
3. mpstat -P ALL 1 ⟶ CPU balance
4. pidstat 1 ⟶ per-process CPU
5. CPU flame graph ⟶ CPU profiling
6. CPU subsecond offset heat map ⟶ look for gaps
7. perf stat -a -- sleep 10 ⟶ IPC, LLC hit ratio

* htop can do 1-4. I'm tempted to add execsnoop for short-lived processes (it's in perf-tools or bcc/BPF tools).

https://www.brendangregg.com/blog/2016-05-04/srecon2016-perf-checklists-for-sres.html
Troubleshooting Elasticsearch ILM: Common issues and fixes

https://www.elastic.co/blog/troubleshooting-elasticsearch-ilm-common-issues-and-fixes
How to pick the best observability solution for your organisation

There are a wealth of monitoring solutions available for engineers and developers to choose from, so how do you select which is most appropriate for you?

https://medium.com/contino-engineering/how-to-pick-the-best-observability-solution-for-your-organisation-e956f0bffb8e