Deploying Kubernetes in production requires careful planning, robust architecture, and adherence to best practices to ensure reliability, security, and performance. This guide covers essential practices for production workloads and comprehensive backup and disaster recovery strategies.
Best Practices for Production Workloads
Production environments demand higher standards of reliability, security, and maintainability. These best practices will help you operate Kubernetes clusters effectively in production scenarios.
Cluster Architecture and Design
High Availability Setup
# Example of a highly available control plane apiVersion: kubeadm.k8s.io/v1beta3 kind: ClusterConfiguration kubernetesVersion: v1.25.0 controlPlaneEndpoint: "k8s-api.example.com:6443" apiServer: extraArgs: advertise-address: 192.168.1.100 certSANs: - "k8s-api.example.com" - "192.168.1.100" - "192.168.1.101" - "192.168.1.102" controllerManager: extraArgs: node-monitor-grace-period: "40s" node-monitor-period: "5s" pod-eviction-timeout: "2m" etcd: external: endpoints: - "https://192.168.1.100:2379" - "https://192.168.1.101:2379" - "https://192.168.1.102:2379" caFile: "/etc/kubernetes/pki/etcd/ca.crt" certFile: "/etc/kubernetes/pki/etcd/server.crt" keyFile: "/etc/kubernetes/pki/etcd/server.key"
Node Design
- Use separate node pools for different workload types
- Implement dedicated nodes for system components
- Consider spot instances for fault-tolerant workloads
- Use node auto-scaling groups for dynamic scaling
Security Best Practices
Pod Security Standards
# Pod Security Admission configuration apiVersion: apiserver.config.k8s.io/v1 kind: AdmissionConfiguration plugins: - name: PodSecurity configuration: apiVersion: pod-security.admission.config.k8s.io/v1 kind: PodSecurityConfiguration defaults: enforce: "restricted" enforce-version: "latest" audit: "baseline" audit-version: "latest" warn: "restricted" warn-version: "latest" exemptions: usernames: [] runtimeClasses: [] namespaces: [kube-system]
Network Policies
# Default deny all ingress traffic apiVersion: networking.k8s.io/v1 kind: NetworkPolicy metadata: name: default-deny-ingress spec: podSelector: {} policyTypes: - Ingress # Allow only necessary communications apiVersion: networking.k8s.io/v1 kind: NetworkPolicy metadata: name: allow-app-communication spec: podSelector: matchLabels: app: my-app policyTypes: - Ingress - Egress ingress: - from: - podSelector: matchLabels: app: frontend ports: - protocol: TCP port: 8080 egress: - to: - podSelector: matchLabels: app: database ports: - protocol: TCP port: 5432
RBAC Configuration
# Least privilege service account apiVersion: v1 kind: ServiceAccount metadata: name: my-app namespace: production --- apiVersion: rbac.authorization.k8s.io/v1 kind: Role metadata: namespace: production name: my-app-role rules: - apiGroups: [""] resources: ["pods"] verbs: ["get", "list"] - apiGroups: ["apps"] resources: ["deployments"] verbs: ["get", "list", "watch"] --- apiVersion: rbac.authorization.k8s.io/v1 kind: RoleBinding metadata: name: my-app-rolebinding namespace: production subjects: - kind: ServiceAccount name: my-app namespace: production roleRef: kind: Role name: my-app-role apiGroup: rbac.authorization.k8s.io
Application Deployment Practices
Resource Management
# Pod with resource requests and limits apiVersion: v1 kind: Pod metadata: name: my-app spec: containers: - name: app image: my-app:1.0.0 resources: requests: memory: "256Mi" cpu: "250m" limits: memory: "512Mi" cpu: "500m" livenessProbe: httpGet: path: /health port: 8080 initialDelaySeconds: 30 periodSeconds: 10 readinessProbe: httpGet: path: /ready port: 8080 initialDelaySeconds: 5 periodSeconds: 5
Deployment Strategies
# Rolling update deployment apiVersion: apps/v1 kind: Deployment metadata: name: my-app spec: replicas: 3 strategy: type: RollingUpdate rollingUpdate: maxSurge: 1 maxUnavailable: 0 selector: matchLabels: app: my-app template: metadata: labels: app: my-app spec: containers: - name: app image: my-app:1.0.0 ports: - containerPort: 8080
Pod Disruption Budget
apiVersion: policy/v1 kind: PodDisruptionBudget metadata: name: my-app-pdb spec: minAvailable: 2 selector: matchLabels: app: my-app
Monitoring and Observability
Comprehensive Monitoring Stack
# Prometheus monitoring setup apiVersion: monitoring.coreos.com/v1 kind: Prometheus metadata: name: prometheus labels: prometheus: k8s spec: replicas: 2 serviceAccountName: prometheus serviceMonitorSelector: {} podMonitorSelector: {} ruleSelector: {} alerting: alertmanagers: - namespace: monitoring name: alertmanager port: web resources: requests: memory: 400Mi limits: memory: 2Gi storage: volumeClaimTemplate: spec: storageClassName: fast-ssd accessModes: ["ReadWriteOnce"] resources: requests: storage: 50Gi
Application Metrics
# ServiceMonitor for application metrics apiVersion: monitoring.coreos.com/v1 kind: ServiceMonitor metadata: name: my-app-monitor labels: app: my-app spec: selector: matchLabels: app: my-app endpoints: - port: web interval: 30s path: /metrics
Automation and GitOps
GitOps with ArgoCD
# Application definition for GitOps apiVersion: argoproj.io/v1alpha1 kind: Application metadata: name: production-apps namespace: argocd spec: destination: server: https://kubernetes.default.svc namespace: production source: repoURL: https://github.com/my-org/gitops-repo.git targetRevision: HEAD path: production syncPolicy: automated: prune: true selfHeal: true syncOptions: - CreateNamespace=true ignoreDifferences: - group: apps kind: Deployment jsonPointers: - /spec/replicas
Backup and Disaster Recovery
A comprehensive backup and disaster recovery strategy is essential for production Kubernetes environments. This includes backing up both cluster state and persistent data.
Cluster State Backup
ETCD Backup
# Script for etcd backup #!/bin/bash # Variables DATE=$(date +%Y%m%d-%H%M%S) ETCD_ENDPOINTS="https://192.168.1.100:2379,https://192.168.1.101:2379,https://192.168.1.102:2379" BACKUP_DIR="/backup/etcd" CERT_DIR="/etc/kubernetes/pki/etcd" # Create backup directory mkdir -p ${BACKUP_DIR} # Perform etcd backup ETCDCTL_API=3 etcdctl \ --endpoints=${ETCD_ENDPOINTS} \ --cacert=${CERT_DIR}/ca.crt \ --cert=${CERT_DIR}/server.crt \ --key=${CERT_DIR}/server.key \ snapshot save ${BACKUP_DIR}/snapshot-${DATE}.db # Verify backup ETCDCTL_API=3 etcdctl \ --write-out=table \ snapshot status ${BACKUP_DIR}/snapshot-${DATE}.db # Clean up old backups (keep 30 days) find ${BACKUP_DIR} -name "*.db" -mtime +30 -delete
Velero for Kubernetes Backup
# Install Velero velero install \ --provider aws \ --plugins velero/velero-plugin-for-aws:v1.5.0 \ --bucket my-velero-backups \ --backup-location-config region=us-west-2 \ --snapshot-location-config region=us-west-2 \ --secret-file ./credentials-velero # Schedule daily backup velero schedule create daily-backup \ --schedule="0 3 * * *" \ --include-namespaces production,development \ --ttl 720h # One-time backup velero backup create manual-backup-$(date +%Y%m%d) \ --include-namespaces production \ --wait
Application Data Backup
Volume Snapshot Configuration
# VolumeSnapshotClass for AWS EBS apiVersion: snapshot.storage.k8s.io/v1 kind: VolumeSnapshotClass metadata: name: aws-ebs-snapshot-class driver: ebs.csi.aws.com deletionPolicy: Delete parameters: type: gp3
Application-Consistent Backups
# Pre-backup hook for database quiescing apiVersion: batch/v1 kind: Job metadata: name: pre-backup-hook annotations: "pre.hook.backup.velero.io/container": fsfreeze "pre.hook.backup.velero.io/command": '["/bin/bash", "-c", "pg_dump -Fc mydb > /backup/mydb.dump"]' spec: template: spec: containers: - name: fsfreeze image: postgres:13 command: ["/bin/bash", "-c"] args: ["pg_dump -Fc mydb > /backup/mydb.dump"] volumeMounts: - name: backup-volume mountPath: /backup restartPolicy: Never volumes: - name: backup-volume persistentVolumeClaim: claimName: backup-pvc
Disaster Recovery Plan
Recovery Objectives
- RTO (Recovery Time Objective): 4 hours for critical applications
- RPO (Recovery Point Objective): 15 minutes for critical data
- Recovery Priority: Tier 1 applications first, then Tier 2, etc.
Cluster Recovery Procedure
#!/bin/bash # Cluster recovery script # Restore etcd from snapshot ETCDCTL_API=3 etcdctl snapshot restore /backup/etcd/snapshot-latest.db \ --initial-cluster-token etcd-cluster-1 \ --initial-advertise-peer-urls https://192.168.1.100:2380 \ --name control-plane-1 \ --initial-cluster control-plane-1=https://192.168.1.100:2380,control-plane-2=https://192.168.1.101:2380,control-plane-3=https://192.168.1.102:2380 # Restore Kubernetes resources with Velero velero restore create --from-backup daily-backup-20231001 \ --include-namespaces production \ --wait # Verify restoration kubectl get all -n production velero restore describe daily-backup-20231001
DR Drills and Testing
# Test restore in isolated environment velero restore create --from-backup test-backup \ --namespace-mappings production:production-test \ --wait # Validate application functionality kubectl get pods -n production-test kubectl exec -it -n production-test deployment/my-app -- curl localhost:8080/health # Clean up test environment kubectl delete namespace production-test
Multi-Region Disaster Recovery
Cross-Region Backup
# Velero with multiple backup locations velero backup-location create \ backup-secondary \ --provider aws \ --bucket my-velero-backups-dr \ --config region=us-east-1 # Schedule cross-region backup replication velero schedule create daily-backup \ --schedule="0 4 * * *" \ --include-namespaces production \ --storage-location backup-primary \ --snapshot-location config-region=us-west-2 \ --ttl 720h
DNS Failover Configuration
# Route53 failover configuration apiVersion: externaldns.k8s.io/v1alpha1 kind: DNSEndpoint metadata: name: app-failover spec: endpoints: - dnsName: "app.example.com" recordType: A recordTTL: 60 targets: - "192.168.1.100" - "192.168.1.101" labels: aws/weight: "100" providerSpecific: - name: weight value: "100" - name: failover value: "PRIMARY" - dnsName: "app.example.com" recordType: A recordTTL: 60 targets: - "192.168.2.100" - "192.168.2.101" labels: aws/weight: "0" providerSpecific: - name: weight value: "0" - name: failover value: "SECONDARY"
Operational Excellence
Incident Management
- Implement alerting and on-call rotation
- Create runbooks for common failure scenarios
- Establish escalation procedures
- Conduct regular post-mortem analyses
Capacity Planning
- Monitor resource utilization trends
- Plan for seasonal traffic patterns
- Implement cluster autoscaling
- Regularly review and adjust resource requests/limits
Change Management
- Use GitOps for all configuration changes
- Implement canary deployments and feature flags
- Maintain change windows and maintenance schedules
- Test changes in non-production environments first
Running Kubernetes in production requires a comprehensive approach that encompasses architecture design, security, monitoring, and robust disaster recovery practices. By implementing these best practices, organizations can achieve the reliability, security, and performance needed for production workloads while maintaining the flexibility to respond to incidents and scale as needed.
0 Comments