Troubleshooting Kubernetes

Kubernetes is a complex system with many moving parts, and issues can arise at various levels of the stack. Effective troubleshooting requires a systematic approach and knowledge of the right tools and techniques. This guide covers common issues and debugging techniques for Kubernetes environments.

Common Issues and Debugging Techniques

Troubleshooting in Kubernetes typically involves investigating issues at different layers: application, configuration, networking, storage, and cluster components.

Pod Issues

Pod Stays in Pending State

When a pod cannot be scheduled, check the following:

  • Resource availability (CPU, memory)
  • Node selector and affinity rules
  • Taints and tolerations
  • Persistent volume claims

Pod Stays in ContainerCreating State

Common causes include:

  • Image pull issues (private registry, authentication)
  • Resource quota limitations
  • Network policy restrictions
  • Storage provisioning problems

Pod Crashes or Restarts Continuously

Investigate application issues:

  • Application crashes (check logs)
  • Resource limits (OOMKilled)
  • Liveness/readiness probe failures
  • Configuration errors

Service and Networking Issues

Service Not Accessible

Troubleshooting steps:

  • Verify service endpoints exist and are correct
  • Check kube-proxy is running on nodes
  • Verify network policies allow traffic
  • Check CNI plugin status

DNS Resolution Problems

Common issues:

  • CoreDNS pods not running
  • Network policies blocking DNS traffic
  • Incorrect DNS configuration

Storage Issues

Persistent Volume Claims Not Bound

Check the following:

  • Storage class exists and is configured correctly
  • Persistent volumes are available
  • Access modes match between PVC and PV
  • Storage provider issues

Cluster Component Issues

Node Not Ready

Investigate node problems:

  • Kubelet service status
  • Container runtime issues
  • Resource exhaustion (disk space, memory)
  • Network connectivity problems

Kubectl Commands for Troubleshooting

Kubectl provides numerous commands and options that are essential for troubleshooting Kubernetes clusters and applications.

Basic Inspection Commands

Cluster Information

# Get cluster information
kubectl cluster-info

# Get cluster events
kubectl get events --all-namespaces --sort-by='.lastTimestamp'

# Check component statuses
kubectl get componentstatuses
    

Node Information

# List all nodes with details
kubectl get nodes -o wide

# Describe a specific node
kubectl describe node 

# Get node resource usage
kubectl top node

# Check node conditions
kubectl get nodes -o jsonpath='{range .items[*]}{.metadata.name}{"\t"}{.status.conditions[?(@.type=="Ready")].status}{"\n"}{end}'
    

Pod Troubleshooting Commands

Pod Inspection

# List pods in all namespaces
kubectl get pods --all-namespaces

# List pods with node information
kubectl get pods -o wide

# Describe a pod for detailed information
kubectl describe pod 

# Get pod logs
kubectl logs 

# Get logs from previous container instance
kubectl logs  --previous

# Stream logs in real-time
kubectl logs -f 

# Get logs from a specific container in a multi-container pod
kubectl logs  -c 

# Execute commands in a running pod
kubectl exec -it  -- /bin/sh

# Copy files to/from a pod
kubectl cp :/path/to/file /local/path
kubectl cp /local/path :/path/to/file
    

Pod Debugging

# Create a debugging pod with busybox
kubectl run debug --rm -i --tty --image=busybox -- /bin/sh

# Run a temporary debugging pod in a specific namespace
kubectl run -n  debug --rm -i --tty --image=nicolaka/netshoot -- /bin/bash

# Check DNS resolution from a pod
kubectl run -i --tty --rm debug --image=busybox -- nslookup 
    

Service and Networking Commands

Service Inspection

# List all services
kubectl get services --all-namespaces

# Describe a service
kubectl describe service 

# Get service endpoints
kubectl get endpoints 

# Check if a service has endpoints
kubectl get endpoints  -o jsonpath='{.subsets[*].addresses[*].ip}'
    

Network Diagnostics

# Check network connectivity between pods
kubectl run network-test --rm -it --image=alpine -- sh
# Then run: ping 

# Check DNS resolution
kubectl run dns-test --rm -it --image=busybox -- nslookup ..svc.cluster.local

# Check network policies
kubectl get networkpolicies --all-namespaces

# Describe a network policy
kubectl describe networkpolicy 
    

Storage Commands

Storage Inspection

# List persistent volumes
kubectl get pv

# List persistent volume claims
kubectl get pvc --all-namespaces

# Describe a persistent volume claim
kubectl describe pvc 

# List storage classes
kubectl get storageclass

# Check volume attachments
kubectl get volumeattachment
    

Deployment and Configuration Commands

Deployment Inspection

# List deployments
kubectl get deployments --all-namespaces

# Describe a deployment
kubectl describe deployment 

# Check deployment rollout status
kubectl rollout status deployment/

# View deployment history
kubectl rollout history deployment/

# Rollback a deployment
kubectl rollout undo deployment/

# Scale a deployment
kubectl scale deployment/ --replicas=3
    

Configuration Debugging

# Dry-run to see what would be created
kubectl apply -f config.yaml --dry-run=client

# Validate configuration files
kubectl apply -f config.yaml --validate=true

# View the effective configuration of a resource
kubectl get   -o yaml

# Diff current configuration with proposed changes
kubectl diff -f config.yaml
    

Advanced Debugging Commands

API Resources and Access

# List all API resources
kubectl api-resources

# Check API versions
kubectl api-versions

# Check if you can perform an action
kubectl auth can-i create pods
kubectl auth can-i delete pods --as=system:serviceaccount:default:default

# Check user permissions
kubectl auth whoami
    

Resource Usage and Limits

# Get resource usage for pods
kubectl top pods --all-namespaces

# Get resource usage for nodes
kubectl top nodes

# Check resource quotas
kubectl get resourcequotas --all-namespaces

# Check limit ranges
kubectl get limitranges --all-namespaces
    

Debugging with JSONPath

# Extract specific information using JSONPath
kubectl get pods -o jsonpath='{range .items[*]}{.metadata.name}{"\t"}{.status.phase}{"\n"}{end}'

# Get all pod IP addresses
kubectl get pods -o jsonpath='{.items[*].status.podIP}'

# Get node internal IPs
kubectl get nodes -o jsonpath='{range .items[*]}{.metadata.name}{"\t"}{.status.addresses[?(@.type=="InternalIP")].address}{"\n"}{end}'

# Custom columns output
kubectl get pods -o custom-columns="NAME:.metadata.name,STATUS:.status.phase,NODE:.spec.nodeName"
    

Troubleshooting Methodology

Systematic Approach

Follow this structured approach to troubleshooting:

  1. Reproduce the issue: Confirm the problem exists
  2. Gather information: Use kubectl commands to collect data
  3. Isolate the component: Determine where the issue occurs
  4. Form hypotheses: Develop theories about the root cause
  5. Test hypotheses: Experiment to confirm or reject theories
  6. Implement fix: Apply the solution
  7. Verify resolution: Confirm the issue is resolved
  8. Document: Record the problem and solution for future reference

Common Troubleshooting Scenarios

Scenario 1: Pod Not Starting

# Step 1: Check pod status
kubectl get pods

# Step 2: Describe the pod for details
kubectl describe pod 

# Step 3: Check events in the namespace
kubectl get events --field-selector involvedObject.name=

# Step 4: Check resource quotas
kubectl describe resourcequota

# Step 5: Check node capacity
kubectl describe node 
    

Scenario 2: Service Not Accessible

# Step 1: Check service details
kubectl describe service 

# Step 2: Verify endpoints
kubectl get endpoints 

# Step 3: Check pod labels match service selector
kubectl get pods --show-labels

# Step 4: Test DNS resolution
kubectl run dns-test --rm -it --image=busybox -- nslookup 

# Step 5: Check network policies
kubectl get networkpolicies --all-namespaces
    

Scenario 3: Application Error

# Step 1: Check application logs
kubectl logs 

# Step 2: Check container status
kubectl describe pod 

# Step 3: Exec into container for debugging
kubectl exec -it  -- /bin/sh

# Step 4: Check resource usage
kubectl top pod 

# Step 5: Check liveness/readiness probes
kubectl get pod  -o yaml | grep -A 5 -B 5 "liveness\|readiness"
    

Useful Tools for Troubleshooting

Kubernetes-native Tools

  • kubectl debug: Debug running pods by adding ephemeral containers
  • kubectl events: View cluster events in real-time
  • kubectl alpha: Experimental features for advanced debugging

Third-party Tools

  • Lens: Kubernetes IDE with powerful debugging capabilities
  • K9s: Terminal-based UI to interact with Kubernetes clusters
  • Stern: Multi-pod and container log tailing
  • Kube-score: Static analysis of Kubernetes object definitions
  • Popeye: Kubernetes cluster sanitizer

Debugging Images

  • busybox: Lightweight utilities for basic debugging
  • nicolaka/netshoot: Network troubleshooting tools
  • ubuntu: Full-featured Linux distribution for complex debugging

Preventive Practices

Monitoring and Alerting

  • Implement comprehensive monitoring with Prometheus and Grafana
  • Set up alerts for cluster health indicators
  • Monitor resource usage and set up autoscaling

Regular Maintenance

  • Keep Kubernetes components updated
  • Regularly clean up unused resources
  • Perform health checks and disaster recovery drills

Documentation and Runbooks

  • Maintain documentation of cluster architecture
  • Create runbooks for common issues and procedures
  • Establish escalation procedures for complex problems

Effective troubleshooting in Kubernetes requires a combination of deep knowledge about Kubernetes concepts, familiarity with kubectl commands, and a systematic approach to problem-solving. By mastering these techniques and tools, you can quickly identify and resolve issues in your Kubernetes environments.

Post a Comment

0 Comments