DevOps&SRE Library

Troubleshooting Conan ZFS GitHub ARC Container Initialization slowness

https://daversomethingsomething.medium.com/troubleshooting-conan-zfs-github-arc-container-initialization-slowness-ba3ee7be6fb0

3.78K views15:04

DevOps&SRE Library

Developing on Raspberry Pi

https://medium.com/@sean.ankenbruck_96245/developing-on-raspberry-pi-9be59b135d23

3.49K views07:00

DevOps&SRE Library

Hosting and scaling EKS hybrid nodes with KubeVirt and Kube-OVN CNI

https://itnext.io/hosting-and-scaling-eks-hybrid-nodes-with-kubevirt-and-kube-ovn-cni-a9305d1290f8

3.17K views15:03

DevOps&SRE Library

Mastering GKE Multi-Tenancy: The Power of Namespaces, RBAC, and Quotas

https://immrbhattarai.medium.com/mastering-gke-multi-tenancy-the-power-of-namespaces-rbac-and-quotas-0a01d69dca87

3.31K views07:05

DevOps&SRE Library

Moving Logic Out of Pods: Extending the Argo Workflows Controller

In this article, I'll show how the Argo Workflows Executor Plugin lets you extend the Argo Workflows controller without maintaining your own fork—simply by implementing a small HTTP server in any language. As a bonus, this same mechanism reduces the number of extra pods in your DAGs and lightens the load on the Kubernetes scheduler. If you're new to Argo, I'll briefly cover the architecture and where plugins fit in. We'll finish with practical examples and key configuration details.

https://hackernoon.com/moving-logic-out-of-pods-extending-the-argo-workflows-controller

3.45K views15:03

DevOps&SRE Library

k8squest

K8sQuest is a local, game-based Kubernetes training platform with an interactive GUI-like terminal interface. Each mission breaks something in Kubernetes. Your job is to fix it.

https://github.com/Manoj-engineer/k8squest

3.25K views07:00

DevOps&SRE Library

kimspect

kimspect is a kubernetes container image inspection tool that provides comprehensive visibility into container images running inside your cluster. kimspect can get image information by pod, namespace, and node. Built for performance and reliability, kimspect enables container image insights with a simple, intuitive command-line interface.

https://github.com/koithos/kimspect

3.37K views15:05

DevOps&SRE Library

kaos

KAOS is a Kubernetes-native framework for deploying and orchestrating AI agents with tool access, multi-agent coordination, and seamless LLM integration.

https://github.com/axsaucedo/kaos

3.13K views07:02

DevOps&SRE Library

flux9s

A K9s-inspired terminal UI for monitoring Flux GitOps resources in real-time.

https://github.com/dgunzy/flux9s

3.94K views15:02

DevOps&SRE Library

nix-csi

Mount /nix into Kubernetes pods using the CSI Ephemeral Volume feature. Volumes share lifetime with Pods and are embedded into the Podspec.

https://github.com/lillecarl/nix-csi

3.87K views07:04

DevOps&SRE Library

Every layer of review makes you 10x slower

https://apenwarr.ca/log/20260316

3.73K views15:01

DevOps&SRE Library

cartography

Cartography is a Python tool that maps infrastructure assets and their relationships into a Neo4j-backed graph view.

https://github.com/cartography-cncf/cartography

3.62K views06:04

DevOps&SRE Library

Stairway to GitOps: Scaling Flux at Morgan Stanley

Morgan Stanley explains how it scaled Flux across 500+ clusters over five years, including security, performance, and observability lessons.

https://fluxcd.io/blog/2026/03/stairway-to-gitops-morgan-stanley

3.74K views14:04

DevOps&SRE Library

The Invisible Rewrite: Modernizing the Kubernetes Image Promoter

Every container image you pull from registry.k8s.io got there through kpromo, the Kubernetes image promoter. It copies images from staging registries to production, signs them with cosign, replicates signatures across more than 20 regional mirrors, and generates SLSA provenance attestations. If this tool breaks, no Kubernetes release ships. Over the past few weeks, we rewrote its core from scratch, deleted 20% of the codebase, made it dramatically faster, and nobody noticed. That was the whole point.

https://kubernetes.io/blog/2026/03/17/image-promoter-rewrite

3.53K views06:03

DevOps&SRE Library

Securing Production Debugging in Kubernetes

This covers safer Kubernetes debugging with least-privilege RBAC, short-lived identity-bound credentials, and audited SSH-style access paths.

https://kubernetes.io/blog/2026/03/18/securing-production-debugging-in-kubernetes

4.5K views14:04

DevOps&SRE Library

Running Agents on Kubernetes with Agent Sandbox

Agent Sandbox adds a declarative Kubernetes API for isolated, stateful AI agents with strong execution boundaries and stable network identities.

https://kubernetes.io/blog/2026/03/20/running-agents-on-kubernetes-with-agent-sandbox

3.05K views06:04

DevOps&SRE Library

How Mastodon Runs OpenTelemetry Collectors in Production

At the beginning of 2025, the OpenTelemetry Developer Experience SIG published the results of its first community survey. One of the strongest themes was clear: teams want more real-world examples of how the OpenTelemetry SDKs and the OpenTelemetry Collector are actually used in production.

To help close that gap, the SIG began collecting stories directly from end users—across industries, architectures, and company sizes. This post kicks off a new series focused specifically on organizations’ real world stories, starting with a small but uniquely challenging case.

This first story features Mastodon, a non-profit organization operating at global scale with a remarkably small team.

https://opentelemetry.io/blog/2026/devex-mastodon

3.23K views14:03

DevOps&SRE Library

Practical Considerations for AI Incident Reviews

The post argues AI-written incident reviews fail without rich cross-system data and human engagement because incident reviews are socio-technical learning work, not just document generation.

https://fgj.codes/posts/ai-incident-reviews

2.96K views06:01

DevOps&SRE Library

10 Real-World Status Page Examples: And What You Can Learn From Them

The post walks through ten status page examples and highlights clear communication, simple layouts, and expectation-setting details that help users during incidents.

https://uptimerobot.com/blog/10-real-status-page-examples

3.57K views14:02

DevOps&SRE Library

Disappointing People Early

The post argues teams should make reliability targets, support limits, and roadmap uncertainty explicit early so customers and stakeholders do not build riskier implicit expectations.

https://log.andvari.net/disappointing-people-early.html

3.39K views06:00

About

Blog

Apps

Platform