Health Management
Enterprise Kubernetes deployments require robust health management strategies to ensure reliability, performance, and availability. This guide covers advanced techniques for maintaining healthy Kubernetes clusters at scale.
Real-Life Health Management Strategies
- Multi-cluster Health Dashboards: Implement centralized observability platforms (Grafana/Prometheus) that aggregate health metrics across all clusters in your fleet.
- Capacity Forecasting: Use historical resource consumption data to predict future capacity needs and automate scaling operations before constraints impact performance.
- Kubernetes Control Plane Monitoring: Implement dedicated monitoring for API server, etcd, scheduler, and controller-manager components with automated alerting.
- Failure Domain Isolation: Design clusters to withstand the failure of entire regions, availability zones, or control plane components.
Advanced Monitoring Setup
-
Comprehensive Metric Collection:
apiVersion: monitoring.coreos.com/v1 kind: PodMonitor metadata: name: app-metrics namespace: monitoring spec: selector: matchLabels: app.kubernetes.io/component: backend podMetricsEndpoints: - port: metrics interval: 15s scrapeTimeout: 10s namespaceSelector: matchNames: - production - staging -
Control Plane Health Checks:
# Monitor etcd health kubectl -n kube-system exec etcd-master -- etcdctl --endpoints=https://127.0.0.1:2379 \ --cacert=/etc/kubernetes/pki/etcd/ca.crt \ --cert=/etc/kubernetes/pki/etcd/server.crt \ --key=/etc/kubernetes/pki/etcd/server.key \ endpoint health # Check API server health kubectl get --raw='/healthz' # Check all component statuses kubectl get componentstatuses -
Extended Node Problem Detection:
apiVersion: apps/v1 kind: DaemonSet metadata: name: node-problem-detector namespace: kube-system spec: selector: matchLabels: app: node-problem-detector template: metadata: labels: app: node-problem-detector spec: containers: - name: node-problem-detector image: k8s.gcr.io/node-problem-detector:v0.8.7 securityContext: privileged: true volumeMounts: - name: log mountPath: /var/log readOnly: true volumes: - name: log hostPath: path: /var/log
Proactive Health Maintenance
-
Regular etcd Defragmentation:
# Run etcd defragmentation to reclaim space kubectl -n kube-system exec etcd-master -- etcdctl --endpoints=https://127.0.0.1:2379 \ --cacert=/etc/kubernetes/pki/etcd/ca.crt \ --cert=/etc/kubernetes/pki/etcd/server.crt \ --key=/etc/kubernetes/pki/etcd/server.key \ defrag -
Automated Certificate Rotation:
# Check certificate expiration kubeadm certs check-expiration # Rotate certificates kubeadm certs renew all -
Cluster Upgrade Validation:
# Pre-upgrade validation kubeadm upgrade plan # Apply upgrades in controlled manner kubeadm upgrade apply v1.27.x
Cluster Recovery Procedures
-
API Server Recovery:
# Check logs journalctl -u kubelet -f # Restart kubelet systemctl restart kubelet # Check API server pod kubectl -n kube-system get pod kube-apiserver-master -o yaml -
etcd Backup and Restore:
# Create etcd snapshot ETCDCTL_API=3 etcdctl --endpoints=https://127.0.0.1:2379 \ --cacert=/etc/kubernetes/pki/etcd/ca.crt \ --cert=/etc/kubernetes/pki/etcd/server.crt \ --key=/etc/kubernetes/pki/etcd/server.key \ snapshot save /backup/etcd-snapshot-$(date +%Y-%m-%d).db # Restore from snapshot ETCDCTL_API=3 etcdctl --endpoints=https://127.0.0.1:2379 \ --cacert=/etc/kubernetes/pki/etcd/ca.crt \ --cert=/etc/kubernetes/pki/etcd/server.crt \ --key=/etc/kubernetes/pki/etcd/server.key \ snapshot restore /backup/etcd-snapshot.db -
Node Draining and Recovery:
# Drain a node for maintenance kubectl drain node-1 --ignore-daemonsets --delete-emptydir-data # Mark node as unschedulable kubectl cordon node-1 # Re-enable scheduling after maintenance kubectl uncordon node-1
Advanced Autoscaling
-
Multi-dimensional Pod Autoscaling:
apiVersion: autoscaling/v2 kind: HorizontalPodAutoscaler metadata: name: advanced-hpa namespace: production spec: scaleTargetRef: apiVersion: apps/v1 kind: Deployment name: web-app minReplicas: 3 maxReplicas: 100 metrics: - type: Resource resource: name: cpu target: type: Utilization averageUtilization: 70 - type: Resource resource: name: memory target: type: Utilization averageUtilization: 80 - type: External external: metric: name: queue_messages_ready selector: matchLabels: queue: "worker" target: type: AverageValue averageValue: 30 -
Cluster Autoscaler with Node Affinity:
apiVersion: cluster.k8s.io/v1 kind: MachineDeployment metadata: name: gpu-workers namespace: kube-system spec: replicas: 1 selector: matchLabels: node-pool: gpu-accelerated template: spec: providerSpec: value: machineType: g4dn.xlarge diskSizeGb: 100 labels: node-pool: gpu-accelerated
Best Practices
- Implement Pod Disruption Budgets for all critical workloads to maintain availability during node maintenance.
- Use multiple Prometheus instances with hierarchical federation for large clusters.
- Employ dedicated infrastructure for monitoring stack to avoid monitoring failure during cluster issues.
- Utilize Custom Resource Metrics for application-specific scaling decisions.
- Implement regular cluster audits for security, resource allocation, and configuration drift.
- Run chaos experiments to validate resilience and recovery procedures.
Cross-Cloud Health Management
- Unified Monitoring Plane: Implement tools like Thanos or Cortex for cross-cluster, cross-cloud Prometheus federation.
- Standard Health Metrics: Develop organization-wide standard health metrics and SLIs across all clusters.
- Automated Recovery Playbooks: Create cloud-specific but standardized recovery procedures.
- Cross-Cluster Service Discovery: Implement mechanisms for service discovery across multiple clusters.