Monitoring and Logging in Kubernetes

Learn how to monitor and log Kubernetes clusters using Prometheus, Grafana, and the ELK Stack. Understand metrics collection, alerting, and log aggregation for production clusters.

1. Monitoring Overview
2. Prometheus
3. Grafana
4. Metrics Server
5. Logging Overview
6. ELK Stack
7. Loki
8. Best Practices

1. Monitoring Overview

Effective monitoring in Kubernetes requires tracking multiple layers:

Cluster Metrics: Node CPU, memory, disk usage
Pod Metrics: Container CPU, memory, network
Application Metrics: Custom application metrics
Control Plane Metrics: API Server, etcd, scheduler performance

1.1 Monitoring Stack Components

Metrics Collection: Prometheus, Metrics Server
Visualization: Grafana
Alerting: Alertmanager
Logging: ELK Stack, Loki, Fluentd

1.2 Key Metrics to Monitor

Pod CPU and memory usage
Node resource utilization
Pod restart counts
Network traffic
Storage usage
API Server latency
Error rates

2. Prometheus

Prometheus is an open-source monitoring and alerting toolkit. It collects metrics by scraping HTTP endpoints and stores them in a time-series database.

2.1 Prometheus Architecture

Prometheus Server: Scrapes and stores metrics
Exporters: Expose metrics endpoints
Service Discovery: Automatically discovers targets
Alertmanager: Handles alerts

2.2 Installing Prometheus

Install Prometheus using Helm:

# Add Prometheus Helm repository
helm repo add prometheus-community https://prometheus-community.github.io/helm-charts
helm repo update

# Install Prometheus Operator
helm install prometheus prometheus-community/kube-prometheus-stack \
  --namespace monitoring \
  --create-namespace

2.3 Prometheus Configuration

apiVersion: v1
kind: ConfigMap
metadata:
  name: prometheus-config
data:
  prometheus.yml: |
    global:
      scrape_interval: 15s
      evaluation_interval: 15s
    
    scrape_configs:
    - job_name: 'kubernetes-pods'
      kubernetes_sd_configs:
      - role: pod
      relabel_configs:
      - source_labels: [__meta_kubernetes_pod_annotation_prometheus_io_scrape]
        action: keep
        regex: true

2.4 PromQL Queries

Prometheus Query Language (PromQL) examples:

# CPU usage percentage
rate(container_cpu_usage_seconds_total[5m]) * 100

# Memory usage
container_memory_usage_bytes

# Pod restart count
kube_pod_container_status_restarts_total

# Request rate
rate(http_requests_total[5m])

2.5 ServiceMonitor

Use ServiceMonitor to configure Prometheus scraping:

apiVersion: monitoring.coreos.com/v1
kind: ServiceMonitor
metadata:
  name: app-metrics
spec:
  selector:
    matchLabels:
      app: myapp
  endpoints:
  - port: metrics
    interval: 30s

2.6 Alerting Rules

groups:
- name: kubernetes
  rules:
  - alert: HighCPUUsage
    expr: rate(container_cpu_usage_seconds_total[5m]) > 0.8
    for: 5m
    labels:
      severity: warning
    annotations:
      summary: "High CPU usage detected"
  
  - alert: PodCrashLooping
    expr: rate(kube_pod_container_status_restarts_total[5m]) > 0
    for: 5m
    labels:
      severity: critical
    annotations:
      summary: "Pod is crash looping"

3. Grafana

Grafana is an open-source analytics and visualization platform. It connects to Prometheus and other data sources to create dashboards.

3.1 Grafana Features

Rich visualization options
Dashboard templating
Alerting
Multiple data source support
User management and permissions

3.2 Accessing Grafana

When installed via Helm, access Grafana:

# Get Grafana admin password
kubectl get secret --namespace monitoring prometheus-grafana \
  -o jsonpath="{.data.admin-password}" | base64 --decode

# Port forward to access Grafana
kubectl port-forward --namespace monitoring svc/prometheus-grafana 3000:80

Access Grafana at http://localhost:3000 (default username: admin)

3.3 Pre-built Dashboards

Import pre-built dashboards from Grafana Labs:

Kubernetes Cluster Monitoring (ID: 7249)
Kubernetes Pod Monitoring (ID: 6417)
Node Exporter Full (ID: 1860)
Kubernetes Deployment Statefulset Daemonset metrics (ID: 8588)

3.4 Creating Custom Dashboards

Create dashboards to visualize:

Pod resource usage
Application-specific metrics
Error rates and latency
Business metrics

4. Metrics Server

Metrics Server collects resource usage data from kubelets and exposes it via the Metrics API. It's required for HPA and kubectl top commands.

4.1 Installing Metrics Server

# Install Metrics Server
kubectl apply -f https://github.com/kubernetes-sigs/metrics-server/releases/latest/download/components.yaml

# Verify installation
kubectl get deployment metrics-server -n kube-system

4.2 Using Metrics API

# View Node metrics
kubectl top nodes

# View Pod metrics
kubectl top pods

# View Pod metrics in namespace
kubectl top pods -n default

# View specific Pod
kubectl top pod <pod-name>

4.3 Metrics Server vs Prometheus

Metrics Server: Lightweight, for HPA and kubectl top
Prometheus: Full-featured monitoring, long-term storage, alerting
Both can run together

5. Logging Overview

Kubernetes logging involves collecting logs from multiple sources:

Container Logs: Application logs from containers
Node Logs: kubelet, kube-proxy logs
Control Plane Logs: API Server, etcd, scheduler logs
Audit Logs: API access logs

5.1 Log Collection Approaches

Node-level logging: Agent on each Node collects logs
Sidecar pattern: Logging container alongside application
Application-level: Application sends logs directly to logging backend

5.2 Viewing Logs with kubectl

# View Pod logs
kubectl logs <pod-name>

# View logs from specific container
kubectl logs <pod-name> -c <container-name>

# Follow logs
kubectl logs -f <pod-name>

# View logs from previous container instance
kubectl logs <pod-name> --previous

# View logs from all Pods with label
kubectl logs -l app=nginx

6. ELK Stack

The ELK Stack consists of Elasticsearch, Logstash, and Kibana for log aggregation and analysis.

6.1 ELK Components

Elasticsearch: Search and analytics engine
Logstash: Log processing pipeline
Kibana: Visualization and exploration
Beats/Fluentd: Log shippers

6.2 Installing ELK Stack

Install using Helm:

# Add Elastic Helm repository
helm repo add elastic https://helm.elastic.co
helm repo update

# Install Elasticsearch
helm install elasticsearch elastic/elasticsearch

# Install Kibana
helm install kibana elastic/kibana

# Install Filebeat (log shipper)
helm install filebeat elastic/filebeat

6.3 Using Fluentd

Fluentd is a popular log collector for Kubernetes:

apiVersion: v1
kind: ConfigMap
metadata:
  name: fluentd-config
data:
  fluent.conf: |
    <source>
      @type tail
      path /var/log/containers/*.log
      pos_file /var/log/fluentd-containers.log.pos
      tag kubernetes.*
      read_from_head true
      <parse>
        @type json
        time_format %Y-%m-%dT%H:%M:%S.%NZ
      </parse>
    </source>
    
    <match kubernetes.**>
      @type elasticsearch
      host elasticsearch.logging.svc.cluster.local
      port 9200
      logstash_format true
    </match>

6.4 Log Aggregation Best Practices

Use structured logging (JSON)
Include correlation IDs
Set appropriate log retention policies
Index logs by namespace and application
Monitor log volume

7. Loki

Loki is a horizontally-scalable, highly-available log aggregation system inspired by Prometheus. It's designed to work with Grafana.

7.1 Loki Advantages

Lightweight and cost-effective
Integrates seamlessly with Grafana
Uses labels like Prometheus
Doesn't index log content, only labels
Better suited for Kubernetes than ELK

7.2 Installing Loki

# Add Grafana Helm repository
helm repo add grafana https://grafana.github.io/helm-charts
helm repo update

# Install Loki Stack (Loki + Promtail + Grafana)
helm install loki grafana/loki-stack \
  --namespace logging \
  --create-namespace

7.3 Promtail Configuration

Promtail is the log shipper for Loki:

server:
  http_listen_port: 3101
  grpc_listen_port: 9096

positions:
  filename: /tmp/positions.yaml

clients:
  - url: http://loki:3100/loki/api/v1/push

scrape_configs:
  - job_name: kubernetes-pods
    kubernetes_sd_configs:
      - role: pod
    relabel_configs:
      - source_labels: [__meta_kubernetes_pod_node_name]
        target_label: __host__
      - action: labelmap
        regex: __meta_kubernetes_pod_label_(.+)

8. Best Practices

8.1 Monitor Key Metrics

Focus on metrics that indicate problems: CPU, memory, error rates, latency, and Pod restarts.

8.2 Set Up Alerting

Configure alerts for critical issues. Use Alertmanager to route alerts to appropriate channels (Slack, PagerDuty, email).

8.3 Use Structured Logging

Use JSON format for logs to enable better parsing and searching. Include correlation IDs for tracing requests.

8.4 Centralize Logs

Aggregate logs from all Pods to a central location for easier searching and analysis.

8.5 Set Retention Policies

Configure log retention based on compliance requirements and storage costs. Archive old logs if needed.

8.6 Monitor Control Plane

Don't forget to monitor the control plane components: API Server, etcd, scheduler, and controller manager.

8.7 Use Dashboards

Create comprehensive dashboards for different audiences: operations, developers, and management.

8.8 Test Alerting

Regularly test your alerting setup to ensure alerts are received and actionable.

Summary: Effective monitoring and logging in Kubernetes requires Prometheus for metrics collection, Grafana for visualization, and a logging solution like ELK Stack or Loki. Monitor cluster, Pod, and application metrics. Set up alerting for critical issues. Use structured logging and centralize logs for easier analysis.