Job VC
Kubernetes Health / Drift / Rollout Engineer
Technologies
Description
We are looking for an experienced
Kubernetes Health / Drift / Rollout Engineer
This role owns the operational automation that makes the Kubernetes platform trustworthy in production.
What This Person Will Build
Kubernetes profile drift detection against expected cluster configuration.
Node and cluster health monitoring with degraded/failed states
Staged rollout automation for OS image and Kubernetes upgrades
Canary, percentage-based rollout, halt, rollback, and recovery logic
Node replacement flows and capacity updates after failures
Audit events and operational diagnostics for all lifecycle actions
Must-Have Background
Strong Go and Kubernetes client-go experience
Experience building Kubernetes controllers, operators, reconcilers, health monitors, rollout systems, or upgrade automation
Deep understanding of node lifecycle, cordon/drain, PodDisruptionBudgets, device plugins, cluster upgrades, and failure recovery
Production operations mindset around rollback, blast radius, staged deployment, and observability
Experience debugging real Kubernetes incidents
Nice to Have
Experience with Argo Rollouts, Flux, Cluster API, Rancher, OpenShift, Kured, node remediation, edge clusters, or multi-site cluster operations.
Kubernetes Health / Drift / Rollout Engineer
This role owns the operational automation that makes the Kubernetes platform trustworthy in production.
What This Person Will Build
Kubernetes profile drift detection against expected cluster configuration.
Node and cluster health monitoring with degraded/failed states
Staged rollout automation for OS image and Kubernetes upgrades
Canary, percentage-based rollout, halt, rollback, and recovery logic
Node replacement flows and capacity updates after failures
Audit events and operational diagnostics for all lifecycle actions
Must-Have Background
Strong Go and Kubernetes client-go experience
Experience building Kubernetes controllers, operators, reconcilers, health monitors, rollout systems, or upgrade automation
Deep understanding of node lifecycle, cordon/drain, PodDisruptionBudgets, device plugins, cluster upgrades, and failure recovery
Production operations mindset around rollback, blast radius, staged deployment, and observability
Experience debugging real Kubernetes incidents
Nice to Have
Experience with Argo Rollouts, Flux, Cluster API, Rancher, OpenShift, Kured, node remediation, edge clusters, or multi-site cluster operations.