DevOps&SRE Library

sshs

Terminal user interface for SSH.

https://github.com/quantumsheep/sshs

4.45K views15:02

DevOps&SRE Library

SRE Is Anti-Transactional

If you ask 10 SRE engineers to define SRE, you'll get 11 definitions.

https://queue.acm.org/detail.cfm?ref=rss&id=3773094

4.46K views07:04

DevOps&SRE Library

Resilience vs. Fault tolerance

In this post, I will discuss if there is a difference between resilience and fault tolerance when talking about IT systems.

https://www.ufried.com/blog/resilience_vs_fault_tolerance

4.42K views15:02

DevOps&SRE Library

Datadog, Thank You for Blocking Us

Datadog cut off our observability overnight. We migrated to an open Grafana stack in 48 hours. Here’s why vendor lock-in is fading in an AI-native world.

https://www.deductive.ai/blogs/datadog-thank-you-for-blocking-us

4.46K views07:05

DevOps&SRE Library

You Can’t Debug a System by Blaming a Person

“I understand why we need to be blameless, but I have this person in my team who is often reckless. How can I not blame them when their actions continuously make things worse?”

Someone asked me this at the SRE meetup, right after my talk on incidents. Since then I’ve been thinking about it, because it surfaces a concern many people might have.

https://humansinsystems.com/blog/you-cant-debug-a-systems-by-blaming-a-person

4.59K views15:04

DevOps&SRE Library

Eliminate sensitive values from Terraform state using write-only attributes

https://skundunotes.com/2025/12/22/eliminate-sensitive-values-from-terraform-state-using-write-only-attributes

4.47K views07:01

DevOps&SRE Library

How We Moved a 2M RPM WebSocket Service to EKS and Fixed a Critical Bottleneck

Lessons in systems because AWS deprecated OpsWorks

https://medium.com/freshworks-engineering-blog/two-million-websockets-90f63e760cfd

4.47K views15:03

DevOps&SRE Library

Scaling Dagster on Kubernetes: Best Practices for 50+ Code Locations

https://u11d.com/blog/scaling-dagster-kubernetes-multi-code-locations

4.26K views07:04

DevOps&SRE Library

Investigating and fixing "StopPodSandbox from runtime service failed" Kubelet errors

https://marcusnoble.co.uk/2025-09-28-investigating-and-fixing-stoppodsandbox-from-runtime-service-failed-kubelet-errors

4.05K views15:05

DevOps&SRE Library

Transforming Kubernetes Secret Management Best Practices

https://medium.com/@nitinyadav745/transforming-kubernetes-secret-management-best-practices-804e993a22d9

3.89K views07:03

DevOps&SRE Library

HOWTO: Use SimKube for Cost Forecasting

Recently, I’ve had a number of folks ask for some more details about how SimKube can be used to predict or forecast your Kubernetes expenditures, and I realized that I’ve said you can do this several times, but I’ve never actually gone through the details! So this post will show you how.

https://blog.appliedcomputing.io/p/howto-use-simkube-for-cost-forecasting

3.34K views15:06

DevOps&SRE Library

kanidm

Kanidm is a simple and secure identity management platform, allowing other applications and services to offload the challenge of authenticating and storing identities to Kanidm.

The goal of this project is to be a complete identity provider, covering the broadest possible set of requirements and integrations. You should not need any other components (like Keycloak) when you use Kanidm - we already have everything you need!

To achieve this we rely heavily on strict defaults, simple configuration, and self-healing components. This allows Kanidm to support small home labs, families, small businesses, and all the way to the largest enterprise needs.

If you want to host your own authentication service, then Kanidm is for you!

https://github.com/kanidm/kanidm

3.27K views07:03

DevOps&SRE Library

kubernetes-nmstate

Declarative node network configuration driven through Kubernetes API.

https://github.com/nmstate/kubernetes-nmstate

4.21K views15:04

DevOps&SRE Library

kide

OpenObserve Kide is a lightweight and fast Kubernetes IDE.

https://github.com/openobserve/kide

3.96K views07:02

DevOps&SRE Library

zot

zot: a production-ready vendor-neutral OCI image registry - images stored in OCI image format, distribution specification on-the-wire, that's it!

https://github.com/project-zot/zot

3.74K views15:05

DevOps&SRE Library

juicefs

JuiceFS is a high-performance POSIX file system released under Apache License 2.0, particularly designed for the cloud-native environment. The data, stored via JuiceFS, will be persisted in Object Storage (e.g. Amazon S3), and the corresponding metadata can be persisted in various compatible database engines such as Redis, MySQL, and TiKV based on the scenarios and requirements.

With JuiceFS, massive cloud storage can be directly connected to big data, machine learning, artificial intelligence, and various application platforms in production environments. Without modifying code, the massive cloud storage can be used as efficiently as local storage.

https://github.com/juicedata/juicefs

4.18K views07:01

DevOps&SRE Library

kubernetes-autoscaling-mixin

A set of Grafana dashboards and Prometheus alerts for Kubernetes Autoscaling using the metrics from Kube-state-metrics, Karpenter, and Cluster-autoscaler.

https://github.com/adinhodovic/kubernetes-autoscaling-mixin

4.33K views15:03

DevOps&SRE Library

phoenix