Dev0ps

Многие знают про readiness/liveness probes в Kubernetes, но часто не понимают что именно в них должно быть. Статья которая раскрывает тему healthcheck достаточно глубоко - https://medium.com/@copyconstruct/health-checks-in-distributed-systems-aa8a0e8c1672 Настоятельно советую посмотреть еще ссылки в конце - там тоже много вкусного.
#sre #healtcheck

Medium

Health Checks and Graceful Degradation in Distributed Systems

Thanks, as always, to Fred Hebert and Sargun Dhillon for reading a draft of this post and offering some invaluable suggestions.

1 view12:48

Dev0ps

Forwarded from Пятничный деплой

Просто и понятно про Capacity Planning https://hackernoon.com/why-capacity-planning-needs-queueing-theory-without-the-hard-math-342a851e215c #sre #capacity #planning

Hackernoon

Why Capacity Planning Needs Queueing Theory (without the hard math) | HackerNoon

Using Queueing Theory simulations to model capacity planning allows for a deeper understanding of system performance and client experience when compared to a strictly rate based approach. This article details why queueing theory is essential for modeling…

1 view15:01

Dev0ps

Forwarded from Пятничный деплой

Огромная подборка материалов и статей для SRE https://github.com/lorin/resilience-engineering/blob/master/README.md #sre

GitHub

resilience-engineering/README.md at master · lorin/resilience-engineering

Resilience engineering papers. Contribute to lorin/resilience-engineering development by creating an account on GitHub.

1 view10:57

Dev0ps

Forwarded from Пятничный деплой

Интересная статья-взгляд на организацию знаний для разрешения инциндентов https://medium.com/dm03514-tech-blog/sre-knowledge-graphs-increased-context-in-human-involved-incident-response-ir-301fd831070c #sre #knowledge

Medium

SRE: Knowledge Graphs: Increased Context in Human Involved Incident Response(IR)

Incident response involving human responders requires context of systems and services that are encountering issues. Getting this context…

1 view13:58

Dev0ps

Forwarded from Пятничный деплой

Статья про таймауты https://vorpus.org/blog/timeouts-and-cancellation-for-humans/ #sre #timeouts

vorpus.org

Timeouts and cancellation for humans — njs blog

1 view16:02

Dev0ps

Forwarded from Пятничный деплой

Статья про таймауты https://vorpus.org/blog/timeouts-and-cancellation-for-humans/ #sre #timeouts

vorpus.org

Timeouts and cancellation for humans — njs blog

1 view08:49

Dev0ps

Forwarded from Пятничный деплой

Ещё одна статья про chaos engineering, в этот раз больше про практику #chaos #sre https://blog.acolyer.org/2019/07/05/automating-chaos-experiments-in-production/

1 view20:29

Dev0ps

Forwarded from Пятничный деплой

Вторая статья из цикла про отказоустойчивые архитектуры https://medium.com/@adhorn/patterns-for-resilient-architecture-part-2-9b51a7e2f10f #architecture #sre

Medium

Patterns for Resilient Architecture — Part 2

The art of avoiding cascading failures

1 view12:40

Dev0ps

Forwarded from Пятничный деплой

Давайте уже писать все bulletproof'но! #golang #sre #reliability
https://medium.com/free-code-camp/how-to-write-bulletproof-code-in-go-a-workflow-for-servers-that-cant-fail-10a14a765f22

Medium

How to write bulletproof code in Go: a workflow for servers that can’t fail

From time to time you may find yourself facing a daunting task: building a server that really isn’t allowed to fail, a project where the…

1 view18:39

Dev0ps

Forwarded from Пятничный деплой

Прикольное расследование аномалий в потреблении CPU приложением https://medium.com/synthesio-engineering/a-journey-into-scaling-a-prometheus-deployment-76c9e1b4db6f #perfomance #prometheus #sre

Medium

A Journey into Scaling a Prometheus Deployment

Written by Aurélien Rougement and Romain Baugue.

1 view21:29

Dev0ps

Forwarded from Пятничный деплой

В куче разных as-a-service нашел новую штуку - Failure-as-a-service с забавным названием gremlin.com, которая пригодится вам, если вы практикуете #chaos engineering или думаете начать. Вот здесь пример использования, где шатают всеми нами любимый Kubernetes https://medium.com/better-practices/chaos-d3ef238ec328 #k8s #sre #grafana #gremlin

Medium

Learn how your Kubernetes clusters respond to failure using Gremlin and Grafana

Building resilient APIs with chaos engineering

1 view07:40

About

Blog

Apps

Platform