Running Kubernetes in Production

Learn best practices for running Kubernetes in production environments. Understand high availability, disaster recovery, backup strategies, and production-ready configurations.

1. Production Readiness
2. High Availability
3. Disaster Recovery
4. Backup Strategies
5. Monitoring and Alerting
6. Security Hardening
7. Performance Optimization
8. Operational Practices

1. Production Readiness

Before deploying to production, ensure:

High availability configuration
Backup and disaster recovery plans
Monitoring and alerting in place
Security hardening applied
Resource limits and quotas configured
Documentation complete
Runbooks prepared
Disaster recovery tested

1.1 Production Checklist

✓ Multiple master nodes for HA
✓ etcd cluster with backups
✓ Resource quotas and limits
✓ Network policies configured
✓ RBAC properly configured
✓ Monitoring and logging
✓ Health checks configured
✓ Pod disruption budgets
✓ Backup procedures tested

2. High Availability

2.1 Control Plane HA

Run multiple master nodes:

Minimum 3 master nodes (odd number)
Distribute across availability zones
Use load balancer for API Server
etcd cluster with 3 or 5 nodes

2.2 Application HA

apiVersion: apps/v1
kind: Deployment
metadata:
  name: app-deployment
spec:
  replicas: 3  # Multiple replicas
  strategy:
    type: RollingUpdate
    rollingUpdate:
      maxSurge: 1
      maxUnavailable: 0  # Zero downtime
  selector:
    matchLabels:
      app: myapp
  template:
    metadata:
      labels:
        app: myapp
    spec:
      affinity:
        podAntiAffinity:
          preferredDuringSchedulingIgnoredDuringExecution:
          - weight: 100
            podAffinityTerm:
              labelSelector:
                matchExpressions:
                - key: app
                  operator: In
                  values:
                  - myapp
              topologyKey: kubernetes.io/hostname
      containers:
      - name: app
        image: myapp:1.0
        livenessProbe:
          httpGet:
            path: /health
            port: 8080
          initialDelaySeconds: 30
          periodSeconds: 10
        readinessProbe:
          httpGet:
            path: /ready
            port: 8080
          initialDelaySeconds: 5
          periodSeconds: 5

2.3 Pod Disruption Budget

apiVersion: policy/v1
kind: PodDisruptionBudget
metadata:
  name: app-pdb
spec:
  minAvailable: 2  # Keep at least 2 Pods available
  selector:
    matchLabels:
      app: myapp

3. Disaster Recovery

3.1 DR Strategy

Plan for different disaster scenarios:

RTO (Recovery Time Objective): Maximum acceptable downtime
RPO (Recovery Point Objective): Maximum acceptable data loss
Multi-region deployment
Regular backup testing
Documented recovery procedures

3.2 Multi-Region Setup

Deploy clusters in multiple regions
Use DNS for failover
Replicate data across regions
Test failover procedures regularly

3.3 Application-Level DR

Design applications for multi-region
Use eventual consistency
Implement idempotent operations
Handle network partitions gracefully

4. Backup Strategies

4.1 etcd Backups

Regular etcd backups are critical:

# Manual etcd backup
ETCDCTL_API=3 etcdctl snapshot save /backup/etcd-snapshot.db \
  --endpoints=https://127.0.0.1:2379 \
  --cacert=/etc/etcd/ca.crt \
  --cert=/etc/etcd/etcd.crt \
  --key=/etc/etcd/etcd.key

# Restore from backup
ETCDCTL_API=3 etcdctl snapshot restore /backup/etcd-snapshot.db \
  --data-dir=/var/lib/etcd-restored

4.2 Velero

Velero provides backup and restore for Kubernetes:

# Install Velero
velero install \
  --provider aws \
  --plugins velero/velero-plugin-for-aws:v1.5.0 \
  --bucket my-backup-bucket \
  --secret-file ./credentials-velero

# Create backup
velero backup create my-backup

# Schedule backups
velero schedule create daily-backup --schedule="0 2 * * *"

# Restore from backup
velero restore create --from-backup my-backup

4.3 Application Data Backups

Backup databases regularly
Use database-specific backup tools
Test restore procedures
Store backups in separate location
Encrypt backups

5. Monitoring and Alerting

5.1 Key Metrics

Monitor these critical metrics:

Cluster health (Node status, API Server availability)
Resource utilization (CPU, memory, disk)
Application metrics (latency, error rates, throughput)
Pod restarts and failures
Network connectivity
Storage usage

5.2 Alerting Rules

groups:
- name: critical
  rules:
  - alert: ClusterDown
    expr: up{job="kubernetes-apiservers"} == 0
    for: 5m
    labels:
      severity: critical
    annotations:
      summary: "Kubernetes API Server is down"
  
  - alert: HighMemoryUsage
    expr: (node_memory_MemTotal_bytes - node_memory_MemAvailable_bytes) / node_memory_MemTotal_bytes > 0.9
    for: 5m
    labels:
      severity: warning
    annotations:
      summary: "Node memory usage is above 90%"
  
  - alert: PodCrashLooping
    expr: rate(kube_pod_container_status_restarts_total[5m]) > 0
    for: 5m
    labels:
      severity: critical
    annotations:
      summary: "Pod is crash looping"

5.3 Logging

Centralize logs (ELK, Loki)
Set appropriate retention
Index by namespace and application
Monitor log volume
Set up log-based alerts

6. Security Hardening

6.1 Cluster Security

Enable RBAC
Use least privilege principle
Enable Pod Security Standards
Implement Network Policies
Enable encryption at rest
Use mTLS for service communication
Regular security audits

6.2 Node Security

Keep Nodes updated
Use CIS Kubernetes Benchmark
Minimize Node access
Use secure container runtime
Enable Node authorization

6.3 Image Security

Scan images for vulnerabilities
Use minimal base images
Sign images cryptographically
Enforce image policies
Keep images updated

7. Performance Optimization

7.1 Resource Management

Set appropriate resource requests and limits
Use HPA for automatic scaling
Implement Cluster Autoscaler
Monitor and optimize resource usage
Use resource quotas

7.2 Network Optimization

Use appropriate CNI plugin
Optimize Service endpoints
Use IPVS mode for kube-proxy
Implement network policies efficiently

7.3 Storage Optimization

Use appropriate storage classes
Enable volume expansion
Monitor storage usage
Implement storage quotas

8. Operational Practices

8.1 Change Management

Use GitOps for configuration
Review changes before applying
Test in staging first
Use canary deployments
Maintain rollback capability

8.2 Documentation

Document architecture
Maintain runbooks
Document procedures
Keep diagrams updated
Document dependencies

8.3 Incident Response

Define incident response procedures
Set up on-call rotation
Use incident management tools
Conduct post-mortems
Learn from incidents

8.4 Regular Maintenance

Update Kubernetes regularly
Update container images
Review and optimize configurations
Clean up unused resources
Review security policies
Test disaster recovery

8.5 Capacity Planning

Monitor resource trends
Plan for growth
Set up autoscaling
Review capacity regularly
Plan for peak loads

Summary: Running Kubernetes in production requires high availability, disaster recovery planning, comprehensive backups, monitoring and alerting, security hardening, and operational best practices. Use multiple master nodes, implement Pod disruption budgets, set up regular backups with Velero, monitor critical metrics, and follow change management procedures.