Learn how to monitor and log Kubernetes clusters using Prometheus, Grafana, and the ELK Stack. Understand metrics collection, alerting, and log aggregation for production clusters.
Table of Contents
1. Monitoring Overview
Effective monitoring in Kubernetes requires tracking multiple layers:
- Cluster Metrics: Node CPU, memory, disk usage
- Pod Metrics: Container CPU, memory, network
- Application Metrics: Custom application metrics
- Control Plane Metrics: API Server, etcd, scheduler performance
1.1 Monitoring Stack Components
- Metrics Collection: Prometheus, Metrics Server
- Visualization: Grafana
- Alerting: Alertmanager
- Logging: ELK Stack, Loki, Fluentd
1.2 Key Metrics to Monitor
- Pod CPU and memory usage
- Node resource utilization
- Pod restart counts
- Network traffic
- Storage usage
- API Server latency
- Error rates
2. Prometheus
Prometheus is an open-source monitoring and alerting toolkit. It collects metrics by scraping HTTP endpoints and stores them in a time-series database.
2.1 Prometheus Architecture
- Prometheus Server: Scrapes and stores metrics
- Exporters: Expose metrics endpoints
- Service Discovery: Automatically discovers targets
- Alertmanager: Handles alerts
2.2 Installing Prometheus
Install Prometheus using Helm:
# Add Prometheus Helm repository
helm repo add prometheus-community https://prometheus-community.github.io/helm-charts
helm repo update
# Install Prometheus Operator
helm install prometheus prometheus-community/kube-prometheus-stack \
--namespace monitoring \
--create-namespace
2.3 Prometheus Configuration
apiVersion: v1
kind: ConfigMap
metadata:
name: prometheus-config
data:
prometheus.yml: |
global:
scrape_interval: 15s
evaluation_interval: 15s
scrape_configs:
- job_name: 'kubernetes-pods'
kubernetes_sd_configs:
- role: pod
relabel_configs:
- source_labels: [__meta_kubernetes_pod_annotation_prometheus_io_scrape]
action: keep
regex: true
2.4 PromQL Queries
Prometheus Query Language (PromQL) examples:
# CPU usage percentage
rate(container_cpu_usage_seconds_total[5m]) * 100
# Memory usage
container_memory_usage_bytes
# Pod restart count
kube_pod_container_status_restarts_total
# Request rate
rate(http_requests_total[5m])
2.5 ServiceMonitor
Use ServiceMonitor to configure Prometheus scraping:
apiVersion: monitoring.coreos.com/v1
kind: ServiceMonitor
metadata:
name: app-metrics
spec:
selector:
matchLabels:
app: myapp
endpoints:
- port: metrics
interval: 30s
2.6 Alerting Rules
groups:
- name: kubernetes
rules:
- alert: HighCPUUsage
expr: rate(container_cpu_usage_seconds_total[5m]) > 0.8
for: 5m
labels:
severity: warning
annotations:
summary: "High CPU usage detected"
- alert: PodCrashLooping
expr: rate(kube_pod_container_status_restarts_total[5m]) > 0
for: 5m
labels:
severity: critical
annotations:
summary: "Pod is crash looping"
3. Grafana
Grafana is an open-source analytics and visualization platform. It connects to Prometheus and other data sources to create dashboards.
3.1 Grafana Features
- Rich visualization options
- Dashboard templating
- Alerting
- Multiple data source support
- User management and permissions
3.2 Accessing Grafana
When installed via Helm, access Grafana:
# Get Grafana admin password
kubectl get secret --namespace monitoring prometheus-grafana \
-o jsonpath="{.data.admin-password}" | base64 --decode
# Port forward to access Grafana
kubectl port-forward --namespace monitoring svc/prometheus-grafana 3000:80
Access Grafana at http://localhost:3000 (default username: admin)
3.3 Pre-built Dashboards
Import pre-built dashboards from Grafana Labs:
- Kubernetes Cluster Monitoring (ID: 7249)
- Kubernetes Pod Monitoring (ID: 6417)
- Node Exporter Full (ID: 1860)
- Kubernetes Deployment Statefulset Daemonset metrics (ID: 8588)
3.4 Creating Custom Dashboards
Create dashboards to visualize:
- Pod resource usage
- Application-specific metrics
- Error rates and latency
- Business metrics
4. Metrics Server
Metrics Server collects resource usage data from kubelets and exposes it via the Metrics API. It's required for HPA and kubectl top commands.
4.1 Installing Metrics Server
# Install Metrics Server
kubectl apply -f https://github.com/kubernetes-sigs/metrics-server/releases/latest/download/components.yaml
# Verify installation
kubectl get deployment metrics-server -n kube-system
4.2 Using Metrics API
# View Node metrics
kubectl top nodes
# View Pod metrics
kubectl top pods
# View Pod metrics in namespace
kubectl top pods -n default
# View specific Pod
kubectl top pod <pod-name>
4.3 Metrics Server vs Prometheus
- Metrics Server: Lightweight, for HPA and kubectl top
- Prometheus: Full-featured monitoring, long-term storage, alerting
- Both can run together
5. Logging Overview
Kubernetes logging involves collecting logs from multiple sources:
- Container Logs: Application logs from containers
- Node Logs: kubelet, kube-proxy logs
- Control Plane Logs: API Server, etcd, scheduler logs
- Audit Logs: API access logs
5.1 Log Collection Approaches
- Node-level logging: Agent on each Node collects logs
- Sidecar pattern: Logging container alongside application
- Application-level: Application sends logs directly to logging backend
5.2 Viewing Logs with kubectl
# View Pod logs
kubectl logs <pod-name>
# View logs from specific container
kubectl logs <pod-name> -c <container-name>
# Follow logs
kubectl logs -f <pod-name>
# View logs from previous container instance
kubectl logs <pod-name> --previous
# View logs from all Pods with label
kubectl logs -l app=nginx
6. ELK Stack
The ELK Stack consists of Elasticsearch, Logstash, and Kibana for log aggregation and analysis.
6.1 ELK Components
- Elasticsearch: Search and analytics engine
- Logstash: Log processing pipeline
- Kibana: Visualization and exploration
- Beats/Fluentd: Log shippers
6.2 Installing ELK Stack
Install using Helm:
# Add Elastic Helm repository
helm repo add elastic https://helm.elastic.co
helm repo update
# Install Elasticsearch
helm install elasticsearch elastic/elasticsearch
# Install Kibana
helm install kibana elastic/kibana
# Install Filebeat (log shipper)
helm install filebeat elastic/filebeat
6.3 Using Fluentd
Fluentd is a popular log collector for Kubernetes:
apiVersion: v1
kind: ConfigMap
metadata:
name: fluentd-config
data:
fluent.conf: |
<source>
@type tail
path /var/log/containers/*.log
pos_file /var/log/fluentd-containers.log.pos
tag kubernetes.*
read_from_head true
<parse>
@type json
time_format %Y-%m-%dT%H:%M:%S.%NZ
</parse>
</source>
<match kubernetes.**>
@type elasticsearch
host elasticsearch.logging.svc.cluster.local
port 9200
logstash_format true
</match>
6.4 Log Aggregation Best Practices
- Use structured logging (JSON)
- Include correlation IDs
- Set appropriate log retention policies
- Index logs by namespace and application
- Monitor log volume
7. Loki
Loki is a horizontally-scalable, highly-available log aggregation system inspired by Prometheus. It's designed to work with Grafana.
7.1 Loki Advantages
- Lightweight and cost-effective
- Integrates seamlessly with Grafana
- Uses labels like Prometheus
- Doesn't index log content, only labels
- Better suited for Kubernetes than ELK
7.2 Installing Loki
# Add Grafana Helm repository
helm repo add grafana https://grafana.github.io/helm-charts
helm repo update
# Install Loki Stack (Loki + Promtail + Grafana)
helm install loki grafana/loki-stack \
--namespace logging \
--create-namespace
7.3 Promtail Configuration
Promtail is the log shipper for Loki:
server:
http_listen_port: 3101
grpc_listen_port: 9096
positions:
filename: /tmp/positions.yaml
clients:
- url: http://loki:3100/loki/api/v1/push
scrape_configs:
- job_name: kubernetes-pods
kubernetes_sd_configs:
- role: pod
relabel_configs:
- source_labels: [__meta_kubernetes_pod_node_name]
target_label: __host__
- action: labelmap
regex: __meta_kubernetes_pod_label_(.+)
8. Best Practices
8.1 Monitor Key Metrics
Focus on metrics that indicate problems: CPU, memory, error rates, latency, and Pod restarts.
8.2 Set Up Alerting
Configure alerts for critical issues. Use Alertmanager to route alerts to appropriate channels (Slack, PagerDuty, email).
8.3 Use Structured Logging
Use JSON format for logs to enable better parsing and searching. Include correlation IDs for tracing requests.
8.4 Centralize Logs
Aggregate logs from all Pods to a central location for easier searching and analysis.
8.5 Set Retention Policies
Configure log retention based on compliance requirements and storage costs. Archive old logs if needed.
8.6 Monitor Control Plane
Don't forget to monitor the control plane components: API Server, etcd, scheduler, and controller manager.
8.7 Use Dashboards
Create comprehensive dashboards for different audiences: operations, developers, and management.
8.8 Test Alerting
Regularly test your alerting setup to ensure alerts are received and actionable.
Summary: Effective monitoring and logging in Kubernetes requires Prometheus for metrics collection, Grafana for visualization, and a logging solution like ELK Stack or Loki. Monitor cluster, Pod, and application metrics. Set up alerting for critical issues. Use structured logging and centralize logs for easier analysis.
0 Comments