Kubernetes is a complex system with many moving parts, and issues can arise at various levels of the stack. Effective troubleshooting requires a systematic approach and knowledge of the right tools and techniques. This guide covers common issues and debugging techniques for Kubernetes environments.
Common Issues and Debugging Techniques
Troubleshooting in Kubernetes typically involves investigating issues at different layers: application, configuration, networking, storage, and cluster components.
Pod Issues
Pod Stays in Pending State
When a pod cannot be scheduled, check the following:
- Resource availability (CPU, memory)
- Node selector and affinity rules
- Taints and tolerations
- Persistent volume claims
Pod Stays in ContainerCreating State
Common causes include:
- Image pull issues (private registry, authentication)
- Resource quota limitations
- Network policy restrictions
- Storage provisioning problems
Pod Crashes or Restarts Continuously
Investigate application issues:
- Application crashes (check logs)
- Resource limits (OOMKilled)
- Liveness/readiness probe failures
- Configuration errors
Service and Networking Issues
Service Not Accessible
Troubleshooting steps:
- Verify service endpoints exist and are correct
- Check kube-proxy is running on nodes
- Verify network policies allow traffic
- Check CNI plugin status
DNS Resolution Problems
Common issues:
- CoreDNS pods not running
- Network policies blocking DNS traffic
- Incorrect DNS configuration
Storage Issues
Persistent Volume Claims Not Bound
Check the following:
- Storage class exists and is configured correctly
- Persistent volumes are available
- Access modes match between PVC and PV
- Storage provider issues
Cluster Component Issues
Node Not Ready
Investigate node problems:
- Kubelet service status
- Container runtime issues
- Resource exhaustion (disk space, memory)
- Network connectivity problems
Kubectl Commands for Troubleshooting
Kubectl provides numerous commands and options that are essential for troubleshooting Kubernetes clusters and applications.
Basic Inspection Commands
Cluster Information
# Get cluster information kubectl cluster-info # Get cluster events kubectl get events --all-namespaces --sort-by='.lastTimestamp' # Check component statuses kubectl get componentstatuses
Node Information
# List all nodes with details kubectl get nodes -o wide # Describe a specific node kubectl describe node# Get node resource usage kubectl top node # Check node conditions kubectl get nodes -o jsonpath='{range .items[*]}{.metadata.name}{"\t"}{.status.conditions[?(@.type=="Ready")].status}{"\n"}{end}'
Pod Troubleshooting Commands
Pod Inspection
# List pods in all namespaces kubectl get pods --all-namespaces # List pods with node information kubectl get pods -o wide # Describe a pod for detailed information kubectl describe pod# Get pod logs kubectl logs # Get logs from previous container instance kubectl logs --previous # Stream logs in real-time kubectl logs -f # Get logs from a specific container in a multi-container pod kubectl logs -c # Execute commands in a running pod kubectl exec -it -- /bin/sh # Copy files to/from a pod kubectl cp :/path/to/file /local/path kubectl cp /local/path :/path/to/file
Pod Debugging
# Create a debugging pod with busybox kubectl run debug --rm -i --tty --image=busybox -- /bin/sh # Run a temporary debugging pod in a specific namespace kubectl run -ndebug --rm -i --tty --image=nicolaka/netshoot -- /bin/bash # Check DNS resolution from a pod kubectl run -i --tty --rm debug --image=busybox -- nslookup
Service and Networking Commands
Service Inspection
# List all services kubectl get services --all-namespaces # Describe a service kubectl describe service# Get service endpoints kubectl get endpoints # Check if a service has endpoints kubectl get endpoints -o jsonpath='{.subsets[*].addresses[*].ip}'
Network Diagnostics
# Check network connectivity between pods kubectl run network-test --rm -it --image=alpine -- sh # Then run: ping# Check DNS resolution kubectl run dns-test --rm -it --image=busybox -- nslookup . .svc.cluster.local # Check network policies kubectl get networkpolicies --all-namespaces # Describe a network policy kubectl describe networkpolicy
Storage Commands
Storage Inspection
# List persistent volumes kubectl get pv # List persistent volume claims kubectl get pvc --all-namespaces # Describe a persistent volume claim kubectl describe pvc# List storage classes kubectl get storageclass # Check volume attachments kubectl get volumeattachment
Deployment and Configuration Commands
Deployment Inspection
# List deployments kubectl get deployments --all-namespaces # Describe a deployment kubectl describe deployment# Check deployment rollout status kubectl rollout status deployment/ # View deployment history kubectl rollout history deployment/ # Rollback a deployment kubectl rollout undo deployment/ # Scale a deployment kubectl scale deployment/ --replicas=3
Configuration Debugging
# Dry-run to see what would be created kubectl apply -f config.yaml --dry-run=client # Validate configuration files kubectl apply -f config.yaml --validate=true # View the effective configuration of a resource kubectl get-o yaml # Diff current configuration with proposed changes kubectl diff -f config.yaml
Advanced Debugging Commands
API Resources and Access
# List all API resources kubectl api-resources # Check API versions kubectl api-versions # Check if you can perform an action kubectl auth can-i create pods kubectl auth can-i delete pods --as=system:serviceaccount:default:default # Check user permissions kubectl auth whoami
Resource Usage and Limits
# Get resource usage for pods kubectl top pods --all-namespaces # Get resource usage for nodes kubectl top nodes # Check resource quotas kubectl get resourcequotas --all-namespaces # Check limit ranges kubectl get limitranges --all-namespaces
Debugging with JSONPath
# Extract specific information using JSONPath kubectl get pods -o jsonpath='{range .items[*]}{.metadata.name}{"\t"}{.status.phase}{"\n"}{end}' # Get all pod IP addresses kubectl get pods -o jsonpath='{.items[*].status.podIP}' # Get node internal IPs kubectl get nodes -o jsonpath='{range .items[*]}{.metadata.name}{"\t"}{.status.addresses[?(@.type=="InternalIP")].address}{"\n"}{end}' # Custom columns output kubectl get pods -o custom-columns="NAME:.metadata.name,STATUS:.status.phase,NODE:.spec.nodeName"
Troubleshooting Methodology
Systematic Approach
Follow this structured approach to troubleshooting:
- Reproduce the issue: Confirm the problem exists
- Gather information: Use kubectl commands to collect data
- Isolate the component: Determine where the issue occurs
- Form hypotheses: Develop theories about the root cause
- Test hypotheses: Experiment to confirm or reject theories
- Implement fix: Apply the solution
- Verify resolution: Confirm the issue is resolved
- Document: Record the problem and solution for future reference
Common Troubleshooting Scenarios
Scenario 1: Pod Not Starting
# Step 1: Check pod status kubectl get pods # Step 2: Describe the pod for details kubectl describe pod# Step 3: Check events in the namespace kubectl get events --field-selector involvedObject.name= # Step 4: Check resource quotas kubectl describe resourcequota # Step 5: Check node capacity kubectl describe node
Scenario 2: Service Not Accessible
# Step 1: Check service details kubectl describe service# Step 2: Verify endpoints kubectl get endpoints # Step 3: Check pod labels match service selector kubectl get pods --show-labels # Step 4: Test DNS resolution kubectl run dns-test --rm -it --image=busybox -- nslookup # Step 5: Check network policies kubectl get networkpolicies --all-namespaces
Scenario 3: Application Error
# Step 1: Check application logs kubectl logs# Step 2: Check container status kubectl describe pod # Step 3: Exec into container for debugging kubectl exec -it -- /bin/sh # Step 4: Check resource usage kubectl top pod # Step 5: Check liveness/readiness probes kubectl get pod -o yaml | grep -A 5 -B 5 "liveness\|readiness"
Useful Tools for Troubleshooting
Kubernetes-native Tools
- kubectl debug: Debug running pods by adding ephemeral containers
- kubectl events: View cluster events in real-time
- kubectl alpha: Experimental features for advanced debugging
Third-party Tools
- Lens: Kubernetes IDE with powerful debugging capabilities
- K9s: Terminal-based UI to interact with Kubernetes clusters
- Stern: Multi-pod and container log tailing
- Kube-score: Static analysis of Kubernetes object definitions
- Popeye: Kubernetes cluster sanitizer
Debugging Images
- busybox: Lightweight utilities for basic debugging
- nicolaka/netshoot: Network troubleshooting tools
- ubuntu: Full-featured Linux distribution for complex debugging
Preventive Practices
Monitoring and Alerting
- Implement comprehensive monitoring with Prometheus and Grafana
- Set up alerts for cluster health indicators
- Monitor resource usage and set up autoscaling
Regular Maintenance
- Keep Kubernetes components updated
- Regularly clean up unused resources
- Perform health checks and disaster recovery drills
Documentation and Runbooks
- Maintain documentation of cluster architecture
- Create runbooks for common issues and procedures
- Establish escalation procedures for complex problems
Effective troubleshooting in Kubernetes requires a combination of deep knowledge about Kubernetes concepts, familiarity with kubectl commands, and a systematic approach to problem-solving. By mastering these techniques and tools, you can quickly identify and resolve issues in your Kubernetes environments.
0 Comments