Forwarded from Записки админа
☁️ Ещё одна tui утилита. Для работы с Amazon EC2 инстансами: https://github.com/dutchcoders/cloudman
#tui #amazon #фидбечат
#tui #amazon #фидбечат
Forwarded from DevOps&SRE Library
Forwarded from DevOps&SRE Library
Forwarded from DevOps&SRE Library
Service Status Monitoring Using WhatsApp, Notion, and Python
https://www.twilio.com/blog/service-status-monitoring-whatsapp-notion-python
https://www.twilio.com/blog/service-status-monitoring-whatsapp-notion-python
Forwarded from DevOps&SRE Library
How we’re building a production readiness review process at Grafana Labs
https://grafana.com/blog/2021/10/13/how-were-building-a-production-readiness-review-process-at-grafana-labs
https://grafana.com/blog/2021/10/13/how-were-building-a-production-readiness-review-process-at-grafana-labs
Forwarded from DevOps&SRE Library
ottr
Ottr is a serverless framework for Public Key Infrastructure (PKI) that aims to provide a robust and scalable method to manage end-to-end certificate rotations using an agentless approach.https://github.com/airbnb/ottr
Forwarded from DevOps&SRE Library
apiclarity
Reconstruct Open API Specifications from real-time workload traffic seamlessly.https://github.com/apiclarity/apiclarity
Forwarded from DevOps&SRE Library
The road to world-class monitoring at Azimo
https://medium.com/azimolabs/the-road-to-world-class-monitoring-at-azimo-bb7dfd358441
https://medium.com/azimolabs/the-road-to-world-class-monitoring-at-azimo-bb7dfd358441
Forwarded from DevOps&SRE Library
Federating Prometheus Effectively
Federation allows a Prometheus server to scrape selected time series from another Prometheus server. Prometheus federation can be used to scale to hundreds of clusters or to pull related metrics from one service’s Prometheus into another.https://levelup.gitconnected.com/federating-prometheus-effectively-4ccd51b2767b
Forwarded from DevOps&SRE Library
A different and (often) better way to downsample your Prometheus metrics
https://blog.timescale.com/blog/a-different-and-often-better-way-to-downsample-your-prometheus-metrics
https://blog.timescale.com/blog/a-different-and-often-better-way-to-downsample-your-prometheus-metrics
Forwarded from DevOps&SRE Library
Five-P factors for root cause analysis
https://cloudpundit.com/2021/10/28/five-p-factors-for-root-cause-analysis
https://cloudpundit.com/2021/10/28/five-p-factors-for-root-cause-analysis
Forwarded from Мониторим ИТ
PostgreSQL Monitoring for App Developers: Alerts & Troubleshooting
If you choose only one thing to alert on in your PostgreSQL cluster (and as I hope this article makes clear, you should alert on multiple things), it should be availability. If your application is unable to connect or transaction with your database, you're probably in for a bad day. Читать дальше.
If you choose only one thing to alert on in your PostgreSQL cluster (and as I hope this article makes clear, you should alert on multiple things), it should be availability. If your application is unable to connect or transaction with your database, you're probably in for a bad day. Читать дальше.
Crunchy Data
PostgreSQL Monitoring for App Developers: Alerts & Troubleshooting
When should you be alerted about issues in your PostgreSQL clusters? How do you troubleshoot them? What are some typical solutions?
Forwarded from Записки админа
📟 Save your engineers' sleep: best practices for on-call processes. Собственно, из названия всё понятно - полезные советы для организации on-call процесса здорового человека.
#напочитать #support #oncall
#напочитать #support #oncall
Forwarded from Грефневая Кафка (pro.kafka)
Время от времени спрашивают как делать приложения, чтобы при падении Кафки приложение не падало. Мне вспомнилась статья Jakub Korab как раз где он разбирается в различных подходах к решению этой задачи.
https://www.confluent.io/blog/how-to-survive-a-kafka-outage/
https://www.confluent.io/blog/how-to-survive-a-kafka-outage/
Confluent
Apache Kafka® Broker Failures & Other Outages
Learn common causes of Apache Kafka® broker failures, as well as how to recover from outages and ensure high availability and resilience in your Kafka cluster.
Forwarded from Updates rtfm.co.ua 🇺🇦 (rtfmcoua)
Prometheus: Recording Rules и теги – разделяем алерты в Slack
С 2018 года используем Opsgenie, который получает алерты от Prometheus, CloudWatch и Uptrends, которые потом через Slack-интеграцию отправляет нам в Slack. Интеграции Slack на данный момент выглядят так: В каждой из них настроен фильтр по уровню важности, например интеграция P1, P2 > Slack #devops-alarms-warning: Но есть проблема: так как каналы получаются общие, то все алерты…
https://rtfm.co.ua/prometheus-recording-rules-i-tegi-razdelyaem-alerty-v-slack/
С 2018 года используем Opsgenie, который получает алерты от Prometheus, CloudWatch и Uptrends, которые потом через Slack-интеграцию отправляет нам в Slack. Интеграции Slack на данный момент выглядят так: В каждой из них настроен фильтр по уровню важности, например интеграция P1, P2 > Slack #devops-alarms-warning: Но есть проблема: так как каналы получаются общие, то все алерты…
https://rtfm.co.ua/prometheus-recording-rules-i-tegi-razdelyaem-alerty-v-slack/
RTFM: Linux, DevOps и системное администрирование | DevOps-инжиниринг и системное администрирование. Случаи из практики.
Prometheus: Recording Rules и теги — разделяем алерты в Slack
Применение Prometheus Recording Rules и Tags для выбора Slack-канала, используя Opsgenie