Running Kubernetes in Production

Deploying Kubernetes in production requires careful planning, robust architecture, and adherence to best practices to ensure reliability, security, and performance. This guide covers essential practices for production workloads and comprehensive backup and disaster recovery strategies.

Best Practices for Production Workloads

Production environments demand higher standards of reliability, security, and maintainability. These best practices will help you operate Kubernetes clusters effectively in production scenarios.

Cluster Architecture and Design

High Availability Setup

# Example of a highly available control plane
apiVersion: kubeadm.k8s.io/v1beta3
kind: ClusterConfiguration
kubernetesVersion: v1.25.0
controlPlaneEndpoint: "k8s-api.example.com:6443"
apiServer:
  extraArgs:
    advertise-address: 192.168.1.100
  certSANs:
  - "k8s-api.example.com"
  - "192.168.1.100"
  - "192.168.1.101"
  - "192.168.1.102"
controllerManager:
  extraArgs:
    node-monitor-grace-period: "40s"
    node-monitor-period: "5s"
    pod-eviction-timeout: "2m"
etcd:
  external:
    endpoints:
    - "https://192.168.1.100:2379"
    - "https://192.168.1.101:2379"
    - "https://192.168.1.102:2379"
    caFile: "/etc/kubernetes/pki/etcd/ca.crt"
    certFile: "/etc/kubernetes/pki/etcd/server.crt"
    keyFile: "/etc/kubernetes/pki/etcd/server.key"

Node Design

Use separate node pools for different workload types
Implement dedicated nodes for system components
Consider spot instances for fault-tolerant workloads
Use node auto-scaling groups for dynamic scaling

Security Best Practices

Pod Security Standards

# Pod Security Admission configuration
apiVersion: apiserver.config.k8s.io/v1
kind: AdmissionConfiguration
plugins:
- name: PodSecurity
  configuration:
    apiVersion: pod-security.admission.config.k8s.io/v1
    kind: PodSecurityConfiguration
    defaults:
      enforce: "restricted"
      enforce-version: "latest"
      audit: "baseline"
      audit-version: "latest"
      warn: "restricted"
      warn-version: "latest"
    exemptions:
      usernames: []
      runtimeClasses: []
      namespaces: [kube-system]

Network Policies

# Default deny all ingress traffic
apiVersion: networking.k8s.io/v1
kind: NetworkPolicy
metadata:
  name: default-deny-ingress
spec:
  podSelector: {}
  policyTypes:
  - Ingress

# Allow only necessary communications
apiVersion: networking.k8s.io/v1
kind: NetworkPolicy
metadata:
  name: allow-app-communication
spec:
  podSelector:
    matchLabels:
      app: my-app
  policyTypes:
  - Ingress
  - Egress
  ingress:
  - from:
    - podSelector:
        matchLabels:
          app: frontend
    ports:
    - protocol: TCP
      port: 8080
  egress:
  - to:
    - podSelector:
        matchLabels:
          app: database
    ports:
    - protocol: TCP
      port: 5432

RBAC Configuration

# Least privilege service account
apiVersion: v1
kind: ServiceAccount
metadata:
  name: my-app
  namespace: production
---
apiVersion: rbac.authorization.k8s.io/v1
kind: Role
metadata:
  namespace: production
  name: my-app-role
rules:
- apiGroups: [""]
  resources: ["pods"]
  verbs: ["get", "list"]
- apiGroups: ["apps"]
  resources: ["deployments"]
  verbs: ["get", "list", "watch"]
---
apiVersion: rbac.authorization.k8s.io/v1
kind: RoleBinding
metadata:
  name: my-app-rolebinding
  namespace: production
subjects:
- kind: ServiceAccount
  name: my-app
  namespace: production
roleRef:
  kind: Role
  name: my-app-role
  apiGroup: rbac.authorization.k8s.io

Application Deployment Practices

Resource Management

# Pod with resource requests and limits
apiVersion: v1
kind: Pod
metadata:
  name: my-app
spec:
  containers:
  - name: app
    image: my-app:1.0.0
    resources:
      requests:
        memory: "256Mi"
        cpu: "250m"
      limits:
        memory: "512Mi"
        cpu: "500m"
    livenessProbe:
      httpGet:
        path: /health
        port: 8080
      initialDelaySeconds: 30
      periodSeconds: 10
    readinessProbe:
      httpGet:
        path: /ready
        port: 8080
      initialDelaySeconds: 5
      periodSeconds: 5

Deployment Strategies

# Rolling update deployment
apiVersion: apps/v1
kind: Deployment
metadata:
  name: my-app
spec:
  replicas: 3
  strategy:
    type: RollingUpdate
    rollingUpdate:
      maxSurge: 1
      maxUnavailable: 0
  selector:
    matchLabels:
      app: my-app
  template:
    metadata:
      labels:
        app: my-app
    spec:
      containers:
      - name: app
        image: my-app:1.0.0
        ports:
        - containerPort: 8080

Pod Disruption Budget

apiVersion: policy/v1
kind: PodDisruptionBudget
metadata:
  name: my-app-pdb
spec:
  minAvailable: 2
  selector:
    matchLabels:
      app: my-app

Monitoring and Observability

Comprehensive Monitoring Stack

# Prometheus monitoring setup
apiVersion: monitoring.coreos.com/v1
kind: Prometheus
metadata:
  name: prometheus
  labels:
    prometheus: k8s
spec:
  replicas: 2
  serviceAccountName: prometheus
  serviceMonitorSelector: {}
  podMonitorSelector: {}
  ruleSelector: {}
  alerting:
    alertmanagers:
    - namespace: monitoring
      name: alertmanager
      port: web
  resources:
    requests:
      memory: 400Mi
    limits:
      memory: 2Gi
  storage:
    volumeClaimTemplate:
      spec:
        storageClassName: fast-ssd
        accessModes: ["ReadWriteOnce"]
        resources:
          requests:
            storage: 50Gi

Application Metrics

# ServiceMonitor for application metrics
apiVersion: monitoring.coreos.com/v1
kind: ServiceMonitor
metadata:
  name: my-app-monitor
  labels:
    app: my-app
spec:
  selector:
    matchLabels:
      app: my-app
  endpoints:
  - port: web
    interval: 30s
    path: /metrics

Automation and GitOps

GitOps with ArgoCD

# Application definition for GitOps
apiVersion: argoproj.io/v1alpha1
kind: Application
metadata:
  name: production-apps
  namespace: argocd
spec:
  destination:
    server: https://kubernetes.default.svc
    namespace: production
  source:
    repoURL: https://github.com/my-org/gitops-repo.git
    targetRevision: HEAD
    path: production
  syncPolicy:
    automated:
      prune: true
      selfHeal: true
    syncOptions:
    - CreateNamespace=true
  ignoreDifferences:
  - group: apps
    kind: Deployment
    jsonPointers:
    - /spec/replicas

Backup and Disaster Recovery

A comprehensive backup and disaster recovery strategy is essential for production Kubernetes environments. This includes backing up both cluster state and persistent data.

Cluster State Backup

ETCD Backup

# Script for etcd backup
#!/bin/bash

# Variables
DATE=$(date +%Y%m%d-%H%M%S)
ETCD_ENDPOINTS="https://192.168.1.100:2379,https://192.168.1.101:2379,https://192.168.1.102:2379"
BACKUP_DIR="/backup/etcd"
CERT_DIR="/etc/kubernetes/pki/etcd"

# Create backup directory
mkdir -p ${BACKUP_DIR}

# Perform etcd backup
ETCDCTL_API=3 etcdctl \
  --endpoints=${ETCD_ENDPOINTS} \
  --cacert=${CERT_DIR}/ca.crt \
  --cert=${CERT_DIR}/server.crt \
  --key=${CERT_DIR}/server.key \
  snapshot save ${BACKUP_DIR}/snapshot-${DATE}.db

# Verify backup
ETCDCTL_API=3 etcdctl \
  --write-out=table \
  snapshot status ${BACKUP_DIR}/snapshot-${DATE}.db

# Clean up old backups (keep 30 days)
find ${BACKUP_DIR} -name "*.db" -mtime +30 -delete

Velero for Kubernetes Backup

# Install Velero
velero install \
  --provider aws \
  --plugins velero/velero-plugin-for-aws:v1.5.0 \
  --bucket my-velero-backups \
  --backup-location-config region=us-west-2 \
  --snapshot-location-config region=us-west-2 \
  --secret-file ./credentials-velero

# Schedule daily backup
velero schedule create daily-backup \
  --schedule="0 3 * * *" \
  --include-namespaces production,development \
  --ttl 720h

# One-time backup
velero backup create manual-backup-$(date +%Y%m%d) \
  --include-namespaces production \
  --wait

Application Data Backup

Volume Snapshot Configuration

# VolumeSnapshotClass for AWS EBS
apiVersion: snapshot.storage.k8s.io/v1
kind: VolumeSnapshotClass
metadata:
  name: aws-ebs-snapshot-class
driver: ebs.csi.aws.com
deletionPolicy: Delete
parameters:
  type: gp3

Application-Consistent Backups

# Pre-backup hook for database quiescing
apiVersion: batch/v1
kind: Job
metadata:
  name: pre-backup-hook
  annotations:
    "pre.hook.backup.velero.io/container": fsfreeze
    "pre.hook.backup.velero.io/command": '["/bin/bash", "-c", "pg_dump -Fc mydb > /backup/mydb.dump"]'
spec:
  template:
    spec:
      containers:
      - name: fsfreeze
        image: postgres:13
        command: ["/bin/bash", "-c"]
        args: ["pg_dump -Fc mydb > /backup/mydb.dump"]
        volumeMounts:
        - name: backup-volume
          mountPath: /backup
      restartPolicy: Never
      volumes:
      - name: backup-volume
        persistentVolumeClaim:
          claimName: backup-pvc

Disaster Recovery Plan

Recovery Objectives

RTO (Recovery Time Objective): 4 hours for critical applications
RPO (Recovery Point Objective): 15 minutes for critical data
Recovery Priority: Tier 1 applications first, then Tier 2, etc.

Cluster Recovery Procedure

#!/bin/bash
# Cluster recovery script

# Restore etcd from snapshot
ETCDCTL_API=3 etcdctl snapshot restore /backup/etcd/snapshot-latest.db \
  --initial-cluster-token etcd-cluster-1 \
  --initial-advertise-peer-urls https://192.168.1.100:2380 \
  --name control-plane-1 \
  --initial-cluster control-plane-1=https://192.168.1.100:2380,control-plane-2=https://192.168.1.101:2380,control-plane-3=https://192.168.1.102:2380

# Restore Kubernetes resources with Velero
velero restore create --from-backup daily-backup-20231001 \
  --include-namespaces production \
  --wait

# Verify restoration
kubectl get all -n production
velero restore describe daily-backup-20231001

DR Drills and Testing

# Test restore in isolated environment
velero restore create --from-backup test-backup \
  --namespace-mappings production:production-test \
  --wait

# Validate application functionality
kubectl get pods -n production-test
kubectl exec -it -n production-test deployment/my-app -- curl localhost:8080/health

# Clean up test environment
kubectl delete namespace production-test

Multi-Region Disaster Recovery

Cross-Region Backup

# Velero with multiple backup locations
velero backup-location create \
  backup-secondary \
  --provider aws \
  --bucket my-velero-backups-dr \
  --config region=us-east-1

# Schedule cross-region backup replication
velero schedule create daily-backup \
  --schedule="0 4 * * *" \
  --include-namespaces production \
  --storage-location backup-primary \
  --snapshot-location config-region=us-west-2 \
  --ttl 720h

DNS Failover Configuration

# Route53 failover configuration
apiVersion: externaldns.k8s.io/v1alpha1
kind: DNSEndpoint
metadata:
  name: app-failover
spec:
  endpoints:
  - dnsName: "app.example.com"
    recordType: A
    recordTTL: 60
    targets:
    - "192.168.1.100"
    - "192.168.1.101"
    labels:
      aws/weight: "100"
    providerSpecific:
    - name: weight
      value: "100"
    - name: failover
      value: "PRIMARY"
  - dnsName: "app.example.com"
    recordType: A
    recordTTL: 60
    targets:
    - "192.168.2.100"
    - "192.168.2.101"
    labels:
      aws/weight: "0"
    providerSpecific:
    - name: weight
      value: "0"
    - name: failover
      value: "SECONDARY"

Operational Excellence

Incident Management

Implement alerting and on-call rotation
Create runbooks for common failure scenarios
Establish escalation procedures
Conduct regular post-mortem analyses

Capacity Planning

Monitor resource utilization trends
Plan for seasonal traffic patterns
Implement cluster autoscaling
Regularly review and adjust resource requests/limits

Change Management

Use GitOps for all configuration changes
Implement canary deployments and feature flags
Maintain change windows and maintenance schedules
Test changes in non-production environments first

Running Kubernetes in production requires a comprehensive approach that encompasses architecture design, security, monitoring, and robust disaster recovery practices. By implementing these best practices, organizations can achieve the reliability, security, and performance needed for production workloads while maintaining the flexibility to respond to incidents and scale as needed.

Report Abuse