Running Kubernetes in Production

Learn best practices for running Kubernetes in production environments. Understand high availability, disaster recovery, backup strategies, and production-ready configurations.

Table of Contents

1. Production Readiness

Before deploying to production, ensure:

  • High availability configuration
  • Backup and disaster recovery plans
  • Monitoring and alerting in place
  • Security hardening applied
  • Resource limits and quotas configured
  • Documentation complete
  • Runbooks prepared
  • Disaster recovery tested

1.1 Production Checklist

  • ✓ Multiple master nodes for HA
  • ✓ etcd cluster with backups
  • ✓ Resource quotas and limits
  • ✓ Network policies configured
  • ✓ RBAC properly configured
  • ✓ Monitoring and logging
  • ✓ Health checks configured
  • ✓ Pod disruption budgets
  • ✓ Backup procedures tested

2. High Availability

2.1 Control Plane HA

Run multiple master nodes:

  • Minimum 3 master nodes (odd number)
  • Distribute across availability zones
  • Use load balancer for API Server
  • etcd cluster with 3 or 5 nodes

2.2 Application HA

apiVersion: apps/v1
kind: Deployment
metadata:
  name: app-deployment
spec:
  replicas: 3  # Multiple replicas
  strategy:
    type: RollingUpdate
    rollingUpdate:
      maxSurge: 1
      maxUnavailable: 0  # Zero downtime
  selector:
    matchLabels:
      app: myapp
  template:
    metadata:
      labels:
        app: myapp
    spec:
      affinity:
        podAntiAffinity:
          preferredDuringSchedulingIgnoredDuringExecution:
          - weight: 100
            podAffinityTerm:
              labelSelector:
                matchExpressions:
                - key: app
                  operator: In
                  values:
                  - myapp
              topologyKey: kubernetes.io/hostname
      containers:
      - name: app
        image: myapp:1.0
        livenessProbe:
          httpGet:
            path: /health
            port: 8080
          initialDelaySeconds: 30
          periodSeconds: 10
        readinessProbe:
          httpGet:
            path: /ready
            port: 8080
          initialDelaySeconds: 5
          periodSeconds: 5

2.3 Pod Disruption Budget

apiVersion: policy/v1
kind: PodDisruptionBudget
metadata:
  name: app-pdb
spec:
  minAvailable: 2  # Keep at least 2 Pods available
  selector:
    matchLabels:
      app: myapp

3. Disaster Recovery

3.1 DR Strategy

Plan for different disaster scenarios:

  • RTO (Recovery Time Objective): Maximum acceptable downtime
  • RPO (Recovery Point Objective): Maximum acceptable data loss
  • Multi-region deployment
  • Regular backup testing
  • Documented recovery procedures

3.2 Multi-Region Setup

  • Deploy clusters in multiple regions
  • Use DNS for failover
  • Replicate data across regions
  • Test failover procedures regularly

3.3 Application-Level DR

  • Design applications for multi-region
  • Use eventual consistency
  • Implement idempotent operations
  • Handle network partitions gracefully

4. Backup Strategies

4.1 etcd Backups

Regular etcd backups are critical:

# Manual etcd backup
ETCDCTL_API=3 etcdctl snapshot save /backup/etcd-snapshot.db \
  --endpoints=https://127.0.0.1:2379 \
  --cacert=/etc/etcd/ca.crt \
  --cert=/etc/etcd/etcd.crt \
  --key=/etc/etcd/etcd.key

# Restore from backup
ETCDCTL_API=3 etcdctl snapshot restore /backup/etcd-snapshot.db \
  --data-dir=/var/lib/etcd-restored

4.2 Velero

Velero provides backup and restore for Kubernetes:

# Install Velero
velero install \
  --provider aws \
  --plugins velero/velero-plugin-for-aws:v1.5.0 \
  --bucket my-backup-bucket \
  --secret-file ./credentials-velero

# Create backup
velero backup create my-backup

# Schedule backups
velero schedule create daily-backup --schedule="0 2 * * *"

# Restore from backup
velero restore create --from-backup my-backup

4.3 Application Data Backups

  • Backup databases regularly
  • Use database-specific backup tools
  • Test restore procedures
  • Store backups in separate location
  • Encrypt backups

5. Monitoring and Alerting

5.1 Key Metrics

Monitor these critical metrics:

  • Cluster health (Node status, API Server availability)
  • Resource utilization (CPU, memory, disk)
  • Application metrics (latency, error rates, throughput)
  • Pod restarts and failures
  • Network connectivity
  • Storage usage

5.2 Alerting Rules

groups:
- name: critical
  rules:
  - alert: ClusterDown
    expr: up{job="kubernetes-apiservers"} == 0
    for: 5m
    labels:
      severity: critical
    annotations:
      summary: "Kubernetes API Server is down"
  
  - alert: HighMemoryUsage
    expr: (node_memory_MemTotal_bytes - node_memory_MemAvailable_bytes) / node_memory_MemTotal_bytes > 0.9
    for: 5m
    labels:
      severity: warning
    annotations:
      summary: "Node memory usage is above 90%"
  
  - alert: PodCrashLooping
    expr: rate(kube_pod_container_status_restarts_total[5m]) > 0
    for: 5m
    labels:
      severity: critical
    annotations:
      summary: "Pod is crash looping"

5.3 Logging

  • Centralize logs (ELK, Loki)
  • Set appropriate retention
  • Index by namespace and application
  • Monitor log volume
  • Set up log-based alerts

6. Security Hardening

6.1 Cluster Security

  • Enable RBAC
  • Use least privilege principle
  • Enable Pod Security Standards
  • Implement Network Policies
  • Enable encryption at rest
  • Use mTLS for service communication
  • Regular security audits

6.2 Node Security

  • Keep Nodes updated
  • Use CIS Kubernetes Benchmark
  • Minimize Node access
  • Use secure container runtime
  • Enable Node authorization

6.3 Image Security

  • Scan images for vulnerabilities
  • Use minimal base images
  • Sign images cryptographically
  • Enforce image policies
  • Keep images updated

7. Performance Optimization

7.1 Resource Management

  • Set appropriate resource requests and limits
  • Use HPA for automatic scaling
  • Implement Cluster Autoscaler
  • Monitor and optimize resource usage
  • Use resource quotas

7.2 Network Optimization

  • Use appropriate CNI plugin
  • Optimize Service endpoints
  • Use IPVS mode for kube-proxy
  • Implement network policies efficiently

7.3 Storage Optimization

  • Use appropriate storage classes
  • Enable volume expansion
  • Monitor storage usage
  • Implement storage quotas

8. Operational Practices

8.1 Change Management

  • Use GitOps for configuration
  • Review changes before applying
  • Test in staging first
  • Use canary deployments
  • Maintain rollback capability

8.2 Documentation

  • Document architecture
  • Maintain runbooks
  • Document procedures
  • Keep diagrams updated
  • Document dependencies

8.3 Incident Response

  • Define incident response procedures
  • Set up on-call rotation
  • Use incident management tools
  • Conduct post-mortems
  • Learn from incidents

8.4 Regular Maintenance

  • Update Kubernetes regularly
  • Update container images
  • Review and optimize configurations
  • Clean up unused resources
  • Review security policies
  • Test disaster recovery

8.5 Capacity Planning

  • Monitor resource trends
  • Plan for growth
  • Set up autoscaling
  • Review capacity regularly
  • Plan for peak loads

Summary: Running Kubernetes in production requires high availability, disaster recovery planning, comprehensive backups, monitoring and alerting, security hardening, and operational best practices. Use multiple master nodes, implement Pod disruption budgets, set up regular backups with Velero, monitor critical metrics, and follow change management procedures.

Post a Comment

0 Comments