Troubleshooting Kubernetes

Kubernetes is a complex system with many moving parts, and issues can arise at various levels of the stack. Effective troubleshooting requires a systematic approach and knowledge of the right tools and techniques. This guide covers common issues and debugging techniques for Kubernetes environments.

Common Issues and Debugging Techniques

Troubleshooting in Kubernetes typically involves investigating issues at different layers: application, configuration, networking, storage, and cluster components.

Pod Issues

Pod Stays in Pending State

When a pod cannot be scheduled, check the following:

Resource availability (CPU, memory)
Node selector and affinity rules
Taints and tolerations
Persistent volume claims

Pod Stays in ContainerCreating State

Common causes include:

Image pull issues (private registry, authentication)
Resource quota limitations
Network policy restrictions
Storage provisioning problems

Pod Crashes or Restarts Continuously

Investigate application issues:

Application crashes (check logs)
Resource limits (OOMKilled)
Liveness/readiness probe failures
Configuration errors

Service and Networking Issues

Service Not Accessible

Troubleshooting steps:

Verify service endpoints exist and are correct
Check kube-proxy is running on nodes
Verify network policies allow traffic
Check CNI plugin status

DNS Resolution Problems

Common issues:

CoreDNS pods not running
Network policies blocking DNS traffic
Incorrect DNS configuration

Storage Issues

Persistent Volume Claims Not Bound

Check the following:

Storage class exists and is configured correctly
Persistent volumes are available
Access modes match between PVC and PV
Storage provider issues

Cluster Component Issues

Node Not Ready

Investigate node problems:

Kubelet service status
Container runtime issues
Resource exhaustion (disk space, memory)
Network connectivity problems

Kubectl Commands for Troubleshooting

Kubectl provides numerous commands and options that are essential for troubleshooting Kubernetes clusters and applications.

Basic Inspection Commands

Cluster Information

# Get cluster information
kubectl cluster-info

# Get cluster events
kubectl get events --all-namespaces --sort-by='.lastTimestamp'

# Check component statuses
kubectl get componentstatuses

Node Information

# List all nodes with details
kubectl get nodes -o wide

# Describe a specific node
kubectl describe node 

# Get node resource usage
kubectl top node

# Check node conditions
kubectl get nodes -o jsonpath='{range .items[*]}{.metadata.name}{"\t"}{.status.conditions[?(@.type=="Ready")].status}{"\n"}{end}'

Pod Troubleshooting Commands

Pod Inspection

# List pods in all namespaces
kubectl get pods --all-namespaces

# List pods with node information
kubectl get pods -o wide

# Describe a pod for detailed information
kubectl describe pod 

# Get pod logs
kubectl logs 

# Get logs from previous container instance
kubectl logs  --previous

# Stream logs in real-time
kubectl logs -f 

# Get logs from a specific container in a multi-container pod
kubectl logs  -c 

# Execute commands in a running pod
kubectl exec -it  -- /bin/sh

# Copy files to/from a pod
kubectl cp :/path/to/file /local/path
kubectl cp /local/path :/path/to/file

Pod Debugging

# Create a debugging pod with busybox
kubectl run debug --rm -i --tty --image=busybox -- /bin/sh

# Run a temporary debugging pod in a specific namespace
kubectl run -n  debug --rm -i --tty --image=nicolaka/netshoot -- /bin/bash

# Check DNS resolution from a pod
kubectl run -i --tty --rm debug --image=busybox -- nslookup

Service and Networking Commands

Service Inspection

# List all services
kubectl get services --all-namespaces

# Describe a service
kubectl describe service 

# Get service endpoints
kubectl get endpoints 

# Check if a service has endpoints
kubectl get endpoints  -o jsonpath='{.subsets[*].addresses[*].ip}'

Network Diagnostics

# Check network connectivity between pods
kubectl run network-test --rm -it --image=alpine -- sh
# Then run: ping 

# Check DNS resolution
kubectl run dns-test --rm -it --image=busybox -- nslookup ..svc.cluster.local

# Check network policies
kubectl get networkpolicies --all-namespaces

# Describe a network policy
kubectl describe networkpolicy

Storage Commands

Storage Inspection

# List persistent volumes
kubectl get pv

# List persistent volume claims
kubectl get pvc --all-namespaces

# Describe a persistent volume claim
kubectl describe pvc 

# List storage classes
kubectl get storageclass

# Check volume attachments
kubectl get volumeattachment

Deployment and Configuration Commands

Deployment Inspection

# List deployments
kubectl get deployments --all-namespaces

# Describe a deployment
kubectl describe deployment 

# Check deployment rollout status
kubectl rollout status deployment/

# View deployment history
kubectl rollout history deployment/

# Rollback a deployment
kubectl rollout undo deployment/

# Scale a deployment
kubectl scale deployment/ --replicas=3

Configuration Debugging

# Dry-run to see what would be created
kubectl apply -f config.yaml --dry-run=client

# Validate configuration files
kubectl apply -f config.yaml --validate=true

# View the effective configuration of a resource
kubectl get   -o yaml

# Diff current configuration with proposed changes
kubectl diff -f config.yaml

Advanced Debugging Commands

API Resources and Access

# List all API resources
kubectl api-resources

# Check API versions
kubectl api-versions

# Check if you can perform an action
kubectl auth can-i create pods
kubectl auth can-i delete pods --as=system:serviceaccount:default:default

# Check user permissions
kubectl auth whoami

Resource Usage and Limits

# Get resource usage for pods
kubectl top pods --all-namespaces

# Get resource usage for nodes
kubectl top nodes

# Check resource quotas
kubectl get resourcequotas --all-namespaces

# Check limit ranges
kubectl get limitranges --all-namespaces

Debugging with JSONPath

# Extract specific information using JSONPath
kubectl get pods -o jsonpath='{range .items[*]}{.metadata.name}{"\t"}{.status.phase}{"\n"}{end}'

# Get all pod IP addresses
kubectl get pods -o jsonpath='{.items[*].status.podIP}'

# Get node internal IPs
kubectl get nodes -o jsonpath='{range .items[*]}{.metadata.name}{"\t"}{.status.addresses[?(@.type=="InternalIP")].address}{"\n"}{end}'

# Custom columns output
kubectl get pods -o custom-columns="NAME:.metadata.name,STATUS:.status.phase,NODE:.spec.nodeName"

Troubleshooting Methodology

Systematic Approach

Follow this structured approach to troubleshooting:

Reproduce the issue: Confirm the problem exists
Gather information: Use kubectl commands to collect data
Isolate the component: Determine where the issue occurs
Form hypotheses: Develop theories about the root cause
Test hypotheses: Experiment to confirm or reject theories
Implement fix: Apply the solution
Verify resolution: Confirm the issue is resolved
Document: Record the problem and solution for future reference

Common Troubleshooting Scenarios

Scenario 1: Pod Not Starting

# Step 1: Check pod status
kubectl get pods

# Step 2: Describe the pod for details
kubectl describe pod 

# Step 3: Check events in the namespace
kubectl get events --field-selector involvedObject.name=

# Step 4: Check resource quotas
kubectl describe resourcequota

# Step 5: Check node capacity
kubectl describe node

Scenario 2: Service Not Accessible

# Step 1: Check service details
kubectl describe service 

# Step 2: Verify endpoints
kubectl get endpoints 

# Step 3: Check pod labels match service selector
kubectl get pods --show-labels

# Step 4: Test DNS resolution
kubectl run dns-test --rm -it --image=busybox -- nslookup 

# Step 5: Check network policies
kubectl get networkpolicies --all-namespaces

Scenario 3: Application Error

# Step 1: Check application logs
kubectl logs 

# Step 2: Check container status
kubectl describe pod 

# Step 3: Exec into container for debugging
kubectl exec -it  -- /bin/sh

# Step 4: Check resource usage
kubectl top pod 

# Step 5: Check liveness/readiness probes
kubectl get pod  -o yaml | grep -A 5 -B 5 "liveness\|readiness"

Useful Tools for Troubleshooting

Kubernetes-native Tools

kubectl debug: Debug running pods by adding ephemeral containers
kubectl events: View cluster events in real-time
kubectl alpha: Experimental features for advanced debugging

Third-party Tools

Lens: Kubernetes IDE with powerful debugging capabilities
K9s: Terminal-based UI to interact with Kubernetes clusters
Stern: Multi-pod and container log tailing
Kube-score: Static analysis of Kubernetes object definitions
Popeye: Kubernetes cluster sanitizer

Debugging Images

busybox: Lightweight utilities for basic debugging
nicolaka/netshoot: Network troubleshooting tools
ubuntu: Full-featured Linux distribution for complex debugging

Preventive Practices

Monitoring and Alerting

Implement comprehensive monitoring with Prometheus and Grafana
Set up alerts for cluster health indicators
Monitor resource usage and set up autoscaling

Regular Maintenance

Keep Kubernetes components updated
Regularly clean up unused resources
Perform health checks and disaster recovery drills

Documentation and Runbooks

Maintain documentation of cluster architecture
Create runbooks for common issues and procedures
Establish escalation procedures for complex problems

Effective troubleshooting in Kubernetes requires a combination of deep knowledge about Kubernetes concepts, familiarity with kubectl commands, and a systematic approach to problem-solving. By mastering these techniques and tools, you can quickly identify and resolve issues in your Kubernetes environments.