Troubleshooting Kubernetes

Learn how to troubleshoot common Kubernetes issues. Master debugging techniques, understand error messages, and use diagnostic tools to resolve problems quickly.

1. Troubleshooting Approach
2. Pod Issues
3. Deployment Issues
4. Network Issues
5. Storage Issues
6. Node Issues
7. Control Plane Issues
8. Diagnostic Tools

1. Troubleshooting Approach

Follow a systematic approach to troubleshoot Kubernetes issues:

Identify the problem: What's not working?
Gather information: Check logs, events, and status
Check resources: Verify Pods, Services, Deployments
Examine events: Look for error messages
Test connectivity: Verify network and DNS
Check configuration: Review YAML files
Verify resources: CPU, memory, storage

1.1 Basic Commands

# Get overview
kubectl get all

# Describe resource
kubectl describe pod <pod-name>

# View logs
kubectl logs <pod-name>

# View events
kubectl get events --sort-by='.lastTimestamp'

# Check resource status
kubectl top nodes
kubectl top pods

2. Pod Issues

2.1 Pod Not Starting

Common causes:

Image pull errors
Insufficient resources
Configuration errors
Node issues

# Check Pod status
kubectl get pods

# Describe Pod for details
kubectl describe pod <pod-name>

# Check Pod logs
kubectl logs <pod-name>

# Check previous container logs
kubectl logs <pod-name> --previous

# Check events
kubectl get events --field-selector involvedObject.name=<pod-name>

2.2 Common Pod States

Pending: Waiting to be scheduled
ContainerCreating: Pulling image or creating container
Running: Container is running
CrashLoopBackOff: Container keeps crashing
ImagePullBackOff: Cannot pull image
ErrImagePull: Image pull error
Error: Container failed to start

2.3 Image Pull Errors

# Check image pull secrets
kubectl get secrets

# Verify image exists
docker pull <image-name>

# Check image pull policy
kubectl get pod <pod-name> -o yaml | grep imagePullPolicy

# Test image pull manually
kubectl run test-pod --image=<image-name> --dry-run=client -o yaml

2.4 CrashLoopBackOff

# Check logs
kubectl logs <pod-name>

# Check previous instance logs
kubectl logs <pod-name> --previous

# Describe Pod
kubectl describe pod <pod-name>

# Common fixes:
# - Check application logs for errors
# - Verify environment variables
# - Check resource limits
# - Verify configuration files
# - Check health probes

3. Deployment Issues

3.1 Deployment Not Updating

# Check Deployment status
kubectl get deployment <deployment-name>

# Describe Deployment
kubectl describe deployment <deployment-name>

# Check ReplicaSet
kubectl get replicaset

# Check rollout status
kubectl rollout status deployment/<deployment-name>

# View rollout history
kubectl rollout history deployment/<deployment-name>

3.2 Deployment Stuck

# Check why rollout is stuck
kubectl describe deployment <deployment-name>

# Check Pods
kubectl get pods -l app=<app-label>

# Check events
kubectl get events --sort-by='.lastTimestamp'

# Pause and resume
kubectl rollout pause deployment/<deployment-name>
kubectl rollout resume deployment/<deployment-name>

# Rollback if needed
kubectl rollout undo deployment/<deployment-name>

3.3 ReplicaSet Issues

# Check ReplicaSet
kubectl get replicaset

# Describe ReplicaSet
kubectl describe replicaset <replicaset-name>

# Check why Pods aren't being created
kubectl describe replicaset <replicaset-name> | grep -A 10 Events

4. Network Issues

4.1 Service Not Accessible

# Check Service
kubectl get svc <service-name>

# Describe Service
kubectl describe svc <service-name>

# Check endpoints
kubectl get endpoints <service-name>

# Verify selector matches Pod labels
kubectl get pods --show-labels
kubectl get svc <service-name>> -o yaml | grep selector

4.2 DNS Issues

# Test DNS from Pod
kubectl run -it --rm debug --image=busybox --restart=Never -- nslookup <service-name>

# Check CoreDNS
kubectl get pods -n kube-system | grep coredns

# Check CoreDNS logs
kubectl logs -n kube-system -l k8s-app=kube-dns

# Verify DNS configuration
kubectl get configmap coredns -n kube-system -o yaml

4.3 Network Policy Issues

# List Network Policies
kubectl get networkpolicies

# Describe Network Policy
kubectl describe networkpolicy <policy-name>

# Check if CNI supports Network Policies
kubectl get pods -n kube-system

# Test connectivity
kubectl run -it --rm debug --image=busybox --restart=Never -- wget -O- <service-url>

5. Storage Issues

5.1 PVC Not Binding

# Check PVC status
kubectl get pvc

# Describe PVC
kubectl describe pvc <pvc-name>

# Check available PVs
kubectl get pv

# Check StorageClass
kubectl get storageclass

# Verify PVC requirements match available PVs
kubectl describe pvc <pvc-name> | grep -A 5 "Capacity\|Access Modes"

5.2 Volume Mount Issues

# Check Pod volume mounts
kubectl describe pod <pod-name> | grep -A 10 "Volumes\|Mounts"

# Verify volume exists
kubectl get pv

# Check volume attachment
kubectl get volumeattachment

# Test volume in Pod
kubectl exec -it <pod-name> -- ls -la /mount/path

6. Node Issues

6.1 Node Not Ready

# Check Node status
kubectl get nodes

# Describe Node
kubectl describe node <node-name>

# Check Node conditions
kubectl get node <node-name> -o yaml | grep -A 5 conditions

# Check kubelet status
systemctl status kubelet

# Check kubelet logs
journalctl -u kubelet -f

6.2 Resource Pressure

# Check Node resources
kubectl top nodes

# Check Node capacity
kubectl describe node <node-name> | grep -A 5 "Capacity\|Allocatable"

# Check disk space
kubectl get nodes -o json | jq '.items[] | {name: .metadata.name, disk: .status.conditions[] | select(.type=="DiskPressure")}'

# Check memory pressure
kubectl get nodes -o json | jq '.items[] | {name: .metadata.name, memory: .status.conditions[] | select(.type=="MemoryPressure")}'

6.3 Pod Scheduling Issues

# Check why Pod isn't scheduled
kubectl describe pod <pod-name> | grep -A 10 Events

# Check Node resources
kubectl describe node <node-name>

# Check taints
kubectl describe node <node-name> | grep Taint

# Check affinity rules
kubectl get pod <pod-name>> -o yaml | grep -A 10 affinity

7. Control Plane Issues

7.1 API Server Issues

# Check API Server status
kubectl get pods -n kube-system | grep kube-apiserver

# Check API Server logs
kubectl logs -n kube-system <api-server-pod-name>

# Test API Server connectivity
kubectl cluster-info

# Check API Server health
curl -k https://localhost:6443/healthz

7.2 etcd Issues

# Check etcd status
kubectl get pods -n kube-system | grep etcd

# Check etcd logs
kubectl logs -n kube-system <etcd-pod-name>

# Check etcd health (from etcd pod)
kubectl exec -n kube-system <etcd-pod-name> -- etcdctl endpoint health

7.3 Scheduler Issues

# Check Scheduler status
kubectl get pods -n kube-system | grep kube-scheduler

# Check Scheduler logs
kubectl logs -n kube-system <scheduler-pod-name>

# Check for scheduling errors
kubectl get events --sort-by='.lastTimestamp' | grep -i schedule

8. Diagnostic Tools

8.1 kubectl debug

# Create ephemeral debug container
kubectl debug <pod-name> -it --image=busybox

# Debug node
kubectl debug node/<node-name> -it --image=busybox

8.2 kubectl get with wide output

# Wide output shows more details
kubectl get pods -o wide

# Custom columns
kubectl get pods -o custom-columns=NAME:.metadata.name,STATUS:.status.phase,NODE:.spec.nodeName

8.3 kubectl explain

# Explain resource fields
kubectl explain pod
kubectl explain pod.spec
kubectl explain pod.spec.containers

8.4 Debugging Tips

Always check events first: kubectl get events
Use describe for detailed information
Check logs from all containers
Verify resource availability
Test connectivity between Pods
Check RBAC permissions
Verify configuration with dry-run

Summary: Troubleshooting Kubernetes requires a systematic approach. Start by gathering information with kubectl get, describe, and logs commands. Check events for error messages. Verify resource availability, network connectivity, and configuration. Use diagnostic tools like kubectl debug and explain to investigate issues.