You Can’t Debug a System by Blaming a Person
https://humansinsystems.com/blog/you-cant-debug-a-systems-by-blaming-a-person
“I understand why we need to be blameless, but I have this person in my team who is often reckless. How can I not blame them when their actions continuously make things worse?”
Someone asked me this at the SRE meetup, right after my talk on incidents. Since then I’ve been thinking about it, because it surfaces a concern many people might have.
https://humansinsystems.com/blog/you-cant-debug-a-systems-by-blaming-a-person
Eliminate sensitive values from Terraform state using write-only attributes
https://skundunotes.com/2025/12/22/eliminate-sensitive-values-from-terraform-state-using-write-only-attributes
https://skundunotes.com/2025/12/22/eliminate-sensitive-values-from-terraform-state-using-write-only-attributes
How We Moved a 2M RPM WebSocket Service to EKS and Fixed a Critical Bottleneck
https://medium.com/freshworks-engineering-blog/two-million-websockets-90f63e760cfd
Lessons in systems because AWS deprecated OpsWorks
https://medium.com/freshworks-engineering-blog/two-million-websockets-90f63e760cfd
Scaling Dagster on Kubernetes: Best Practices for 50+ Code Locations
https://u11d.com/blog/scaling-dagster-kubernetes-multi-code-locations
https://u11d.com/blog/scaling-dagster-kubernetes-multi-code-locations
Investigating and fixing "StopPodSandbox from runtime service failed" Kubelet errors
https://marcusnoble.co.uk/2025-09-28-investigating-and-fixing-stoppodsandbox-from-runtime-service-failed-kubelet-errors
https://marcusnoble.co.uk/2025-09-28-investigating-and-fixing-stoppodsandbox-from-runtime-service-failed-kubelet-errors
Transforming Kubernetes Secret Management Best Practices
https://medium.com/@nitinyadav745/transforming-kubernetes-secret-management-best-practices-804e993a22d9
https://medium.com/@nitinyadav745/transforming-kubernetes-secret-management-best-practices-804e993a22d9
HOWTO: Use SimKube for Cost Forecasting
https://blog.appliedcomputing.io/p/howto-use-simkube-for-cost-forecasting
Recently, I’ve had a number of folks ask for some more details about how SimKube can be used to predict or forecast your Kubernetes expenditures, and I realized that I’ve said you can do this several times, but I’ve never actually gone through the details! So this post will show you how.
https://blog.appliedcomputing.io/p/howto-use-simkube-for-cost-forecasting
kanidm
https://github.com/kanidm/kanidm
Kanidm is a simple and secure identity management platform, allowing other applications and services to offload the challenge of authenticating and storing identities to Kanidm.
The goal of this project is to be a complete identity provider, covering the broadest possible set of requirements and integrations. You should not need any other components (like Keycloak) when you use Kanidm - we already have everything you need!
To achieve this we rely heavily on strict defaults, simple configuration, and self-healing components. This allows Kanidm to support small home labs, families, small businesses, and all the way to the largest enterprise needs.
If you want to host your own authentication service, then Kanidm is for you!
https://github.com/kanidm/kanidm
kubernetes-nmstate
https://github.com/nmstate/kubernetes-nmstate
Declarative node network configuration driven through Kubernetes API.
https://github.com/nmstate/kubernetes-nmstate
zot
https://github.com/project-zot/zot
zot: a production-ready vendor-neutral OCI image registry - images stored in OCI image format, distribution specification on-the-wire, that's it!
https://github.com/project-zot/zot
juicefs
https://github.com/juicedata/juicefs
JuiceFS is a high-performance POSIX file system released under Apache License 2.0, particularly designed for the cloud-native environment. The data, stored via JuiceFS, will be persisted in Object Storage (e.g. Amazon S3), and the corresponding metadata can be persisted in various compatible database engines such as Redis, MySQL, and TiKV based on the scenarios and requirements.
With JuiceFS, massive cloud storage can be directly connected to big data, machine learning, artificial intelligence, and various application platforms in production environments. Without modifying code, the massive cloud storage can be used as efficiently as local storage.
https://github.com/juicedata/juicefs
kubernetes-autoscaling-mixin
https://github.com/adinhodovic/kubernetes-autoscaling-mixin
A set of Grafana dashboards and Prometheus alerts for Kubernetes Autoscaling using the metrics from Kube-state-metrics, Karpenter, and Cluster-autoscaler.
https://github.com/adinhodovic/kubernetes-autoscaling-mixin
phoenix
https://github.com/Arize-ai/phoenix
Phoenix is an open-source AI observability platform designed for experimentation, evaluation, and troubleshooting.
https://github.com/Arize-ai/phoenix
NVIDIA HGX Containers as a Service
https://topofmind.dev/blog/2025/10/21/gpu-based-containers-as-a-service
https://topofmind.dev/blog/2025/10/21/gpu-based-containers-as-a-service
Bifrost’s journey from Nginx to Envoy gateway for intelligent rate limiting
https://medium.com/the-heimdall-platform/bifrosts-journey-from-nginx-to-envoy-gateway-for-intelligent-rate-limiting-3215d19bc315
https://medium.com/the-heimdall-platform/bifrosts-journey-from-nginx-to-envoy-gateway-for-intelligent-rate-limiting-3215d19bc315
Building Production-Ready Multi-Agent Systems on Kubernetes: Real Lessons from Deploying 11 Specialized AI Agents
https://aws.plainenglish.io/building-production-ready-multi-agent-systems-on-kubernetes-real-lessons-from-deploying-11-b01976cd4236
https://aws.plainenglish.io/building-production-ready-multi-agent-systems-on-kubernetes-real-lessons-from-deploying-11-b01976cd4236
Kubernetes security fundamentals: Networking
https://securitylabs.datadoghq.com/articles/kubernetes-security-fundamentals-part-6
https://securitylabs.datadoghq.com/articles/kubernetes-security-fundamentals-part-6
Debugging the One-in-a-Million Failure: Migrating Pinterest’s Search Infrastructure to Kubernetes
https://medium.com/pinterest-engineering/debugging-the-one-in-a-million-failure-migrating-pinterests-search-infrastructure-to-kubernetes-bef9af9dabf4
https://medium.com/pinterest-engineering/debugging-the-one-in-a-million-failure-migrating-pinterests-search-infrastructure-to-kubernetes-bef9af9dabf4
Stretching a Layer 2 network over multiple KubeVirt clusters
https://kubevirt.io/2025/Stretched-layer2-network-between-clusters.html
https://kubevirt.io/2025/Stretched-layer2-network-between-clusters.html