Deploying Kubernetes in production requires careful planning, robust architecture, and adherence to best practices to ensure reliability, security, and performance. This guide covers essential practices for production workloads and comprehensive backup and disaster recovery strategies.
Best Practices for Production Workloads
Production environments demand higher standards of reliability, security, and maintainability. These best practices will help you operate Kubernetes clusters effectively in production scenarios.
Cluster Architecture and Design
High Availability Setup
# Example of a highly available control plane
apiVersion: kubeadm.k8s.io/v1beta3
kind: ClusterConfiguration
kubernetesVersion: v1.25.0
controlPlaneEndpoint: "k8s-api.example.com:6443"
apiServer:
extraArgs:
advertise-address: 192.168.1.100
certSANs:
- "k8s-api.example.com"
- "192.168.1.100"
- "192.168.1.101"
- "192.168.1.102"
controllerManager:
extraArgs:
node-monitor-grace-period: "40s"
node-monitor-period: "5s"
pod-eviction-timeout: "2m"
etcd:
external:
endpoints:
- "https://192.168.1.100:2379"
- "https://192.168.1.101:2379"
- "https://192.168.1.102:2379"
caFile: "/etc/kubernetes/pki/etcd/ca.crt"
certFile: "/etc/kubernetes/pki/etcd/server.crt"
keyFile: "/etc/kubernetes/pki/etcd/server.key"
Node Design
- Use separate node pools for different workload types
- Implement dedicated nodes for system components
- Consider spot instances for fault-tolerant workloads
- Use node auto-scaling groups for dynamic scaling
Security Best Practices
Pod Security Standards
# Pod Security Admission configuration
apiVersion: apiserver.config.k8s.io/v1
kind: AdmissionConfiguration
plugins:
- name: PodSecurity
configuration:
apiVersion: pod-security.admission.config.k8s.io/v1
kind: PodSecurityConfiguration
defaults:
enforce: "restricted"
enforce-version: "latest"
audit: "baseline"
audit-version: "latest"
warn: "restricted"
warn-version: "latest"
exemptions:
usernames: []
runtimeClasses: []
namespaces: [kube-system]
Network Policies
# Default deny all ingress traffic
apiVersion: networking.k8s.io/v1
kind: NetworkPolicy
metadata:
name: default-deny-ingress
spec:
podSelector: {}
policyTypes:
- Ingress
# Allow only necessary communications
apiVersion: networking.k8s.io/v1
kind: NetworkPolicy
metadata:
name: allow-app-communication
spec:
podSelector:
matchLabels:
app: my-app
policyTypes:
- Ingress
- Egress
ingress:
- from:
- podSelector:
matchLabels:
app: frontend
ports:
- protocol: TCP
port: 8080
egress:
- to:
- podSelector:
matchLabels:
app: database
ports:
- protocol: TCP
port: 5432
RBAC Configuration
# Least privilege service account
apiVersion: v1
kind: ServiceAccount
metadata:
name: my-app
namespace: production
---
apiVersion: rbac.authorization.k8s.io/v1
kind: Role
metadata:
namespace: production
name: my-app-role
rules:
- apiGroups: [""]
resources: ["pods"]
verbs: ["get", "list"]
- apiGroups: ["apps"]
resources: ["deployments"]
verbs: ["get", "list", "watch"]
---
apiVersion: rbac.authorization.k8s.io/v1
kind: RoleBinding
metadata:
name: my-app-rolebinding
namespace: production
subjects:
- kind: ServiceAccount
name: my-app
namespace: production
roleRef:
kind: Role
name: my-app-role
apiGroup: rbac.authorization.k8s.io
Application Deployment Practices
Resource Management
# Pod with resource requests and limits
apiVersion: v1
kind: Pod
metadata:
name: my-app
spec:
containers:
- name: app
image: my-app:1.0.0
resources:
requests:
memory: "256Mi"
cpu: "250m"
limits:
memory: "512Mi"
cpu: "500m"
livenessProbe:
httpGet:
path: /health
port: 8080
initialDelaySeconds: 30
periodSeconds: 10
readinessProbe:
httpGet:
path: /ready
port: 8080
initialDelaySeconds: 5
periodSeconds: 5
Deployment Strategies
# Rolling update deployment
apiVersion: apps/v1
kind: Deployment
metadata:
name: my-app
spec:
replicas: 3
strategy:
type: RollingUpdate
rollingUpdate:
maxSurge: 1
maxUnavailable: 0
selector:
matchLabels:
app: my-app
template:
metadata:
labels:
app: my-app
spec:
containers:
- name: app
image: my-app:1.0.0
ports:
- containerPort: 8080
Pod Disruption Budget
apiVersion: policy/v1
kind: PodDisruptionBudget
metadata:
name: my-app-pdb
spec:
minAvailable: 2
selector:
matchLabels:
app: my-app
Monitoring and Observability
Comprehensive Monitoring Stack
# Prometheus monitoring setup
apiVersion: monitoring.coreos.com/v1
kind: Prometheus
metadata:
name: prometheus
labels:
prometheus: k8s
spec:
replicas: 2
serviceAccountName: prometheus
serviceMonitorSelector: {}
podMonitorSelector: {}
ruleSelector: {}
alerting:
alertmanagers:
- namespace: monitoring
name: alertmanager
port: web
resources:
requests:
memory: 400Mi
limits:
memory: 2Gi
storage:
volumeClaimTemplate:
spec:
storageClassName: fast-ssd
accessModes: ["ReadWriteOnce"]
resources:
requests:
storage: 50Gi
Application Metrics
# ServiceMonitor for application metrics
apiVersion: monitoring.coreos.com/v1
kind: ServiceMonitor
metadata:
name: my-app-monitor
labels:
app: my-app
spec:
selector:
matchLabels:
app: my-app
endpoints:
- port: web
interval: 30s
path: /metrics
Automation and GitOps
GitOps with ArgoCD
# Application definition for GitOps
apiVersion: argoproj.io/v1alpha1
kind: Application
metadata:
name: production-apps
namespace: argocd
spec:
destination:
server: https://kubernetes.default.svc
namespace: production
source:
repoURL: https://github.com/my-org/gitops-repo.git
targetRevision: HEAD
path: production
syncPolicy:
automated:
prune: true
selfHeal: true
syncOptions:
- CreateNamespace=true
ignoreDifferences:
- group: apps
kind: Deployment
jsonPointers:
- /spec/replicas
Backup and Disaster Recovery
A comprehensive backup and disaster recovery strategy is essential for production Kubernetes environments. This includes backing up both cluster state and persistent data.
Cluster State Backup
ETCD Backup
# Script for etcd backup
#!/bin/bash
# Variables
DATE=$(date +%Y%m%d-%H%M%S)
ETCD_ENDPOINTS="https://192.168.1.100:2379,https://192.168.1.101:2379,https://192.168.1.102:2379"
BACKUP_DIR="/backup/etcd"
CERT_DIR="/etc/kubernetes/pki/etcd"
# Create backup directory
mkdir -p ${BACKUP_DIR}
# Perform etcd backup
ETCDCTL_API=3 etcdctl \
--endpoints=${ETCD_ENDPOINTS} \
--cacert=${CERT_DIR}/ca.crt \
--cert=${CERT_DIR}/server.crt \
--key=${CERT_DIR}/server.key \
snapshot save ${BACKUP_DIR}/snapshot-${DATE}.db
# Verify backup
ETCDCTL_API=3 etcdctl \
--write-out=table \
snapshot status ${BACKUP_DIR}/snapshot-${DATE}.db
# Clean up old backups (keep 30 days)
find ${BACKUP_DIR} -name "*.db" -mtime +30 -delete
Velero for Kubernetes Backup
# Install Velero
velero install \
--provider aws \
--plugins velero/velero-plugin-for-aws:v1.5.0 \
--bucket my-velero-backups \
--backup-location-config region=us-west-2 \
--snapshot-location-config region=us-west-2 \
--secret-file ./credentials-velero
# Schedule daily backup
velero schedule create daily-backup \
--schedule="0 3 * * *" \
--include-namespaces production,development \
--ttl 720h
# One-time backup
velero backup create manual-backup-$(date +%Y%m%d) \
--include-namespaces production \
--wait
Application Data Backup
Volume Snapshot Configuration
# VolumeSnapshotClass for AWS EBS
apiVersion: snapshot.storage.k8s.io/v1
kind: VolumeSnapshotClass
metadata:
name: aws-ebs-snapshot-class
driver: ebs.csi.aws.com
deletionPolicy: Delete
parameters:
type: gp3
Application-Consistent Backups
# Pre-backup hook for database quiescing
apiVersion: batch/v1
kind: Job
metadata:
name: pre-backup-hook
annotations:
"pre.hook.backup.velero.io/container": fsfreeze
"pre.hook.backup.velero.io/command": '["/bin/bash", "-c", "pg_dump -Fc mydb > /backup/mydb.dump"]'
spec:
template:
spec:
containers:
- name: fsfreeze
image: postgres:13
command: ["/bin/bash", "-c"]
args: ["pg_dump -Fc mydb > /backup/mydb.dump"]
volumeMounts:
- name: backup-volume
mountPath: /backup
restartPolicy: Never
volumes:
- name: backup-volume
persistentVolumeClaim:
claimName: backup-pvc
Disaster Recovery Plan
Recovery Objectives
- RTO (Recovery Time Objective): 4 hours for critical applications
- RPO (Recovery Point Objective): 15 minutes for critical data
- Recovery Priority: Tier 1 applications first, then Tier 2, etc.
Cluster Recovery Procedure
#!/bin/bash
# Cluster recovery script
# Restore etcd from snapshot
ETCDCTL_API=3 etcdctl snapshot restore /backup/etcd/snapshot-latest.db \
--initial-cluster-token etcd-cluster-1 \
--initial-advertise-peer-urls https://192.168.1.100:2380 \
--name control-plane-1 \
--initial-cluster control-plane-1=https://192.168.1.100:2380,control-plane-2=https://192.168.1.101:2380,control-plane-3=https://192.168.1.102:2380
# Restore Kubernetes resources with Velero
velero restore create --from-backup daily-backup-20231001 \
--include-namespaces production \
--wait
# Verify restoration
kubectl get all -n production
velero restore describe daily-backup-20231001
DR Drills and Testing
# Test restore in isolated environment
velero restore create --from-backup test-backup \
--namespace-mappings production:production-test \
--wait
# Validate application functionality
kubectl get pods -n production-test
kubectl exec -it -n production-test deployment/my-app -- curl localhost:8080/health
# Clean up test environment
kubectl delete namespace production-test
Multi-Region Disaster Recovery
Cross-Region Backup
# Velero with multiple backup locations
velero backup-location create \
backup-secondary \
--provider aws \
--bucket my-velero-backups-dr \
--config region=us-east-1
# Schedule cross-region backup replication
velero schedule create daily-backup \
--schedule="0 4 * * *" \
--include-namespaces production \
--storage-location backup-primary \
--snapshot-location config-region=us-west-2 \
--ttl 720h
DNS Failover Configuration
# Route53 failover configuration
apiVersion: externaldns.k8s.io/v1alpha1
kind: DNSEndpoint
metadata:
name: app-failover
spec:
endpoints:
- dnsName: "app.example.com"
recordType: A
recordTTL: 60
targets:
- "192.168.1.100"
- "192.168.1.101"
labels:
aws/weight: "100"
providerSpecific:
- name: weight
value: "100"
- name: failover
value: "PRIMARY"
- dnsName: "app.example.com"
recordType: A
recordTTL: 60
targets:
- "192.168.2.100"
- "192.168.2.101"
labels:
aws/weight: "0"
providerSpecific:
- name: weight
value: "0"
- name: failover
value: "SECONDARY"
Operational Excellence
Incident Management
- Implement alerting and on-call rotation
- Create runbooks for common failure scenarios
- Establish escalation procedures
- Conduct regular post-mortem analyses
Capacity Planning
- Monitor resource utilization trends
- Plan for seasonal traffic patterns
- Implement cluster autoscaling
- Regularly review and adjust resource requests/limits
Change Management
- Use GitOps for all configuration changes
- Implement canary deployments and feature flags
- Maintain change windows and maintenance schedules
- Test changes in non-production environments first
Running Kubernetes in production requires a comprehensive approach that encompasses architecture design, security, monitoring, and robust disaster recovery practices. By implementing these best practices, organizations can achieve the reliability, security, and performance needed for production workloads while maintaining the flexibility to respond to incidents and scale as needed.
.png)
0 Comments