Monitoring and Logging in Kubernetes

Learn how to monitor and log Kubernetes clusters using Prometheus, Grafana, and the ELK Stack. Understand metrics collection, alerting, and log aggregation for production clusters.

Table of Contents

1. Monitoring Overview

Effective monitoring in Kubernetes requires tracking multiple layers:

  • Cluster Metrics: Node CPU, memory, disk usage
  • Pod Metrics: Container CPU, memory, network
  • Application Metrics: Custom application metrics
  • Control Plane Metrics: API Server, etcd, scheduler performance

1.1 Monitoring Stack Components

  • Metrics Collection: Prometheus, Metrics Server
  • Visualization: Grafana
  • Alerting: Alertmanager
  • Logging: ELK Stack, Loki, Fluentd

1.2 Key Metrics to Monitor

  • Pod CPU and memory usage
  • Node resource utilization
  • Pod restart counts
  • Network traffic
  • Storage usage
  • API Server latency
  • Error rates

2. Prometheus

Prometheus is an open-source monitoring and alerting toolkit. It collects metrics by scraping HTTP endpoints and stores them in a time-series database.

2.1 Prometheus Architecture

  • Prometheus Server: Scrapes and stores metrics
  • Exporters: Expose metrics endpoints
  • Service Discovery: Automatically discovers targets
  • Alertmanager: Handles alerts

2.2 Installing Prometheus

Install Prometheus using Helm:

# Add Prometheus Helm repository
helm repo add prometheus-community https://prometheus-community.github.io/helm-charts
helm repo update

# Install Prometheus Operator
helm install prometheus prometheus-community/kube-prometheus-stack \
  --namespace monitoring \
  --create-namespace

2.3 Prometheus Configuration

apiVersion: v1
kind: ConfigMap
metadata:
  name: prometheus-config
data:
  prometheus.yml: |
    global:
      scrape_interval: 15s
      evaluation_interval: 15s
    
    scrape_configs:
    - job_name: 'kubernetes-pods'
      kubernetes_sd_configs:
      - role: pod
      relabel_configs:
      - source_labels: [__meta_kubernetes_pod_annotation_prometheus_io_scrape]
        action: keep
        regex: true

2.4 PromQL Queries

Prometheus Query Language (PromQL) examples:

# CPU usage percentage
rate(container_cpu_usage_seconds_total[5m]) * 100

# Memory usage
container_memory_usage_bytes

# Pod restart count
kube_pod_container_status_restarts_total

# Request rate
rate(http_requests_total[5m])

2.5 ServiceMonitor

Use ServiceMonitor to configure Prometheus scraping:

apiVersion: monitoring.coreos.com/v1
kind: ServiceMonitor
metadata:
  name: app-metrics
spec:
  selector:
    matchLabels:
      app: myapp
  endpoints:
  - port: metrics
    interval: 30s

2.6 Alerting Rules

groups:
- name: kubernetes
  rules:
  - alert: HighCPUUsage
    expr: rate(container_cpu_usage_seconds_total[5m]) > 0.8
    for: 5m
    labels:
      severity: warning
    annotations:
      summary: "High CPU usage detected"
  
  - alert: PodCrashLooping
    expr: rate(kube_pod_container_status_restarts_total[5m]) > 0
    for: 5m
    labels:
      severity: critical
    annotations:
      summary: "Pod is crash looping"

3. Grafana

Grafana is an open-source analytics and visualization platform. It connects to Prometheus and other data sources to create dashboards.

3.1 Grafana Features

  • Rich visualization options
  • Dashboard templating
  • Alerting
  • Multiple data source support
  • User management and permissions

3.2 Accessing Grafana

When installed via Helm, access Grafana:

# Get Grafana admin password
kubectl get secret --namespace monitoring prometheus-grafana \
  -o jsonpath="{.data.admin-password}" | base64 --decode

# Port forward to access Grafana
kubectl port-forward --namespace monitoring svc/prometheus-grafana 3000:80

Access Grafana at http://localhost:3000 (default username: admin)

3.3 Pre-built Dashboards

Import pre-built dashboards from Grafana Labs:

  • Kubernetes Cluster Monitoring (ID: 7249)
  • Kubernetes Pod Monitoring (ID: 6417)
  • Node Exporter Full (ID: 1860)
  • Kubernetes Deployment Statefulset Daemonset metrics (ID: 8588)

3.4 Creating Custom Dashboards

Create dashboards to visualize:

  • Pod resource usage
  • Application-specific metrics
  • Error rates and latency
  • Business metrics

4. Metrics Server

Metrics Server collects resource usage data from kubelets and exposes it via the Metrics API. It's required for HPA and kubectl top commands.

4.1 Installing Metrics Server

# Install Metrics Server
kubectl apply -f https://github.com/kubernetes-sigs/metrics-server/releases/latest/download/components.yaml

# Verify installation
kubectl get deployment metrics-server -n kube-system

4.2 Using Metrics API

# View Node metrics
kubectl top nodes

# View Pod metrics
kubectl top pods

# View Pod metrics in namespace
kubectl top pods -n default

# View specific Pod
kubectl top pod <pod-name>

4.3 Metrics Server vs Prometheus

  • Metrics Server: Lightweight, for HPA and kubectl top
  • Prometheus: Full-featured monitoring, long-term storage, alerting
  • Both can run together

5. Logging Overview

Kubernetes logging involves collecting logs from multiple sources:

  • Container Logs: Application logs from containers
  • Node Logs: kubelet, kube-proxy logs
  • Control Plane Logs: API Server, etcd, scheduler logs
  • Audit Logs: API access logs

5.1 Log Collection Approaches

  • Node-level logging: Agent on each Node collects logs
  • Sidecar pattern: Logging container alongside application
  • Application-level: Application sends logs directly to logging backend

5.2 Viewing Logs with kubectl

# View Pod logs
kubectl logs <pod-name>

# View logs from specific container
kubectl logs <pod-name> -c <container-name>

# Follow logs
kubectl logs -f <pod-name>

# View logs from previous container instance
kubectl logs <pod-name> --previous

# View logs from all Pods with label
kubectl logs -l app=nginx

6. ELK Stack

The ELK Stack consists of Elasticsearch, Logstash, and Kibana for log aggregation and analysis.

6.1 ELK Components

  • Elasticsearch: Search and analytics engine
  • Logstash: Log processing pipeline
  • Kibana: Visualization and exploration
  • Beats/Fluentd: Log shippers

6.2 Installing ELK Stack

Install using Helm:

# Add Elastic Helm repository
helm repo add elastic https://helm.elastic.co
helm repo update

# Install Elasticsearch
helm install elasticsearch elastic/elasticsearch

# Install Kibana
helm install kibana elastic/kibana

# Install Filebeat (log shipper)
helm install filebeat elastic/filebeat

6.3 Using Fluentd

Fluentd is a popular log collector for Kubernetes:

apiVersion: v1
kind: ConfigMap
metadata:
  name: fluentd-config
data:
  fluent.conf: |
    <source>
      @type tail
      path /var/log/containers/*.log
      pos_file /var/log/fluentd-containers.log.pos
      tag kubernetes.*
      read_from_head true
      <parse>
        @type json
        time_format %Y-%m-%dT%H:%M:%S.%NZ
      </parse>
    </source>
    
    <match kubernetes.**>
      @type elasticsearch
      host elasticsearch.logging.svc.cluster.local
      port 9200
      logstash_format true
    </match>

6.4 Log Aggregation Best Practices

  • Use structured logging (JSON)
  • Include correlation IDs
  • Set appropriate log retention policies
  • Index logs by namespace and application
  • Monitor log volume

7. Loki

Loki is a horizontally-scalable, highly-available log aggregation system inspired by Prometheus. It's designed to work with Grafana.

7.1 Loki Advantages

  • Lightweight and cost-effective
  • Integrates seamlessly with Grafana
  • Uses labels like Prometheus
  • Doesn't index log content, only labels
  • Better suited for Kubernetes than ELK

7.2 Installing Loki

# Add Grafana Helm repository
helm repo add grafana https://grafana.github.io/helm-charts
helm repo update

# Install Loki Stack (Loki + Promtail + Grafana)
helm install loki grafana/loki-stack \
  --namespace logging \
  --create-namespace

7.3 Promtail Configuration

Promtail is the log shipper for Loki:

server:
  http_listen_port: 3101
  grpc_listen_port: 9096

positions:
  filename: /tmp/positions.yaml

clients:
  - url: http://loki:3100/loki/api/v1/push

scrape_configs:
  - job_name: kubernetes-pods
    kubernetes_sd_configs:
      - role: pod
    relabel_configs:
      - source_labels: [__meta_kubernetes_pod_node_name]
        target_label: __host__
      - action: labelmap
        regex: __meta_kubernetes_pod_label_(.+)

8. Best Practices

8.1 Monitor Key Metrics

Focus on metrics that indicate problems: CPU, memory, error rates, latency, and Pod restarts.

8.2 Set Up Alerting

Configure alerts for critical issues. Use Alertmanager to route alerts to appropriate channels (Slack, PagerDuty, email).

8.3 Use Structured Logging

Use JSON format for logs to enable better parsing and searching. Include correlation IDs for tracing requests.

8.4 Centralize Logs

Aggregate logs from all Pods to a central location for easier searching and analysis.

8.5 Set Retention Policies

Configure log retention based on compliance requirements and storage costs. Archive old logs if needed.

8.6 Monitor Control Plane

Don't forget to monitor the control plane components: API Server, etcd, scheduler, and controller manager.

8.7 Use Dashboards

Create comprehensive dashboards for different audiences: operations, developers, and management.

8.8 Test Alerting

Regularly test your alerting setup to ensure alerts are received and actionable.

Summary: Effective monitoring and logging in Kubernetes requires Prometheus for metrics collection, Grafana for visualization, and a logging solution like ELK Stack or Loki. Monitor cluster, Pod, and application metrics. Set up alerting for critical issues. Use structured logging and centralize logs for easier analysis.

Post a Comment

0 Comments