Effective monitoring and logging are essential for maintaining the health, performance, and security of Kubernetes clusters and applications. Kubernetes provides various tools and patterns to collect metrics, monitor resources, and aggregate logs from containers, pods, and cluster components.
Monitoring with Prometheus and Grafana
Prometheus has become the de facto standard for monitoring Kubernetes clusters, while Grafana provides powerful visualization capabilities for the collected metrics.
Prometheus Architecture
Prometheus is a pull-based monitoring system that collects metrics from configured targets:
- Prometheus Server: Scrapes and stores time series data
- Exporters: Expose metrics in Prometheus format (Node Exporter, cAdvisor, etc.)
- Pushgateway: Handles metrics from short-lived jobs
- Alertmanager: Handles alerts sent by Prometheus Server
- Service Discovery: Automatically discovers monitoring targets in Kubernetes
Key Kubernetes Metrics to Monitor
Cluster-level Metrics
- Node CPU and memory utilization
- Disk space and I/O
- Network bandwidth
- API server latency and error rates
Workload-level Metrics
- Pod CPU and memory usage
- Container restarts
- Application-specific metrics
- Request latency and error rates
Setting Up Prometheus in Kubernetes
Using the Prometheus Operator
The Prometheus Operator simplifies Prometheus setup and management in Kubernetes:
# Install Prometheus Operator using Helm
helm repo add prometheus-community https://prometheus-community.github.io/helm-charts
helm install prometheus prometheus-community/kube-prometheus-stack
ServiceMonitor Resource
ServiceMonitor defines how Prometheus should monitor services:
apiVersion: monitoring.coreos.com/v1
kind: ServiceMonitor
metadata:
name: my-app-monitor
labels:
release: prometheus
spec:
selector:
matchLabels:
app: my-app
endpoints:
- port: web
interval: 30s
path: /metrics
namespaceSelector:
matchNames:
- my-namespace
PodMonitor Resource
PodMonitor defines how Prometheus should monitor pods directly:
apiVersion: monitoring.coreos.com/v1
kind: PodMonitor
metadata:
name: my-pod-monitor
labels:
release: prometheus
spec:
selector:
matchLabels:
app: my-app
podMetricsEndpoints:
- port: metrics
interval: 30s
path: /metrics
Grafana Dashboards
Grafana connects to Prometheus to visualize metrics through customizable dashboards:
Example Dashboard Configuration
apiVersion: integreatly.org/v1alpha1
kind: GrafanaDashboard
metadata:
name: kubernetes-cluster-monitoring
labels:
app: grafana
spec:
json: |
{
"title": "Kubernetes Cluster Monitoring",
"tags": ["kubernetes", "prometheus"],
"timezone": "browser",
"panels": [
{
"title": "CPU Usage",
"type": "graph",
"targets": [
{
"expr": "sum(rate(container_cpu_usage_seconds_total{container!=\"POD\",container!=\"\"}[5m])) by (pod)",
"legendFormat": "{{pod}}"
}
]
}
]
}
Alerting with Prometheus
PrometheusRule resources define alerts based on metric conditions:
apiVersion: monitoring.coreos.com/v1
kind: PrometheusRule
metadata:
name: my-app-alerts
labels:
release: prometheus
spec:
groups:
- name: my-app.rules
rules:
- alert: HighMemoryUsage
expr: (container_memory_working_set_bytes{container!=\"\",container!=\"POD\"} / container_spec_memory_limit_bytes) > 0.8
for: 5m
labels:
severity: warning
annotations:
summary: "High memory usage in pod {{ $labels.pod }}"
description: "Pod {{ $labels.pod }} is using {{ $value }}% of its memory limit"
Log Management with Fluentd, Elasticsearch, and Kibana (EFK Stack)
The EFK stack is a popular solution for collecting, storing, and analyzing logs in Kubernetes:
Fluentd Architecture
Fluentd collects, processes, and forwards logs from various sources:
- Input Plugins: Collect logs from sources (files, systemd, etc.)
- Parser Plugins: Parse logs into structured data
- Filter Plugins: Process and modify log records
- Output Plugins: Send logs to destinations (Elasticsearch, S3, etc.)
- Buffer: Temporarily stores logs during processing
Setting Up Fluentd in Kubernetes
Fluentd DaemonSet Configuration
Fluentd typically runs as a DaemonSet to collect logs from each node:
apiVersion: apps/v1
kind: DaemonSet
metadata:
name: fluentd
namespace: logging
labels:
app: fluentd
spec:
selector:
matchLabels:
app: fluentd
template:
metadata:
labels:
app: fluentd
spec:
serviceAccountName: fluentd
containers:
- name: fluentd
image: fluent/fluentd-kubernetes-daemonset:v1-debian-elasticsearch
env:
- name: FLUENT_ELASTICSEARCH_HOST
value: "elasticsearch.logging.svc.cluster.local"
- name: FLUENT_ELASTICSEARCH_PORT
value: "9200"
- name: FLUENT_ELASTICSEARCH_SCHEME
value: "http"
resources:
limits:
memory: 512Mi
requests:
cpu: 100m
memory: 256Mi
volumeMounts:
- name: varlog
mountPath: /var/log
- name: varlibdockercontainers
mountPath: /var/lib/docker/containers
readOnly: true
- name: fluentd-config
mountPath: /fluentd/etc
volumes:
- name: varlog
hostPath:
path: /var/log
- name: varlibdockercontainers
hostPath:
path: /var/lib/docker/containers
- name: fluentd-config
configMap:
name: fluentd-config
Fluentd Configuration
apiVersion: v1
kind: ConfigMap
metadata:
name: fluentd-config
namespace: logging
data:
fluent.conf: |
@type tail
path /var/log/containers/*.log
pos_file /var/log/fluentd-containers.log.pos
tag kubernetes.*
read_from_head true
@type json
time_format %Y-%m-%dT%H:%M:%S.%NZ
@type kubernetes_metadata
@type elasticsearch
host elasticsearch.logging.svc.cluster.local
port 9200
logstash_format true
logstash_prefix fluentd
include_tag_key true
type_name fluentd
@type file
path /var/log/fluentd-buffers/kubernetes.system.buffer
flush_mode interval
retry_type exponential_backoff
flush_thread_count 2
flush_interval 5s
retry_forever true
retry_max_interval 30
chunk_limit_size 2M
queue_limit_length 8
overflow_action block
Elasticsearch Configuration
Elasticsearch stores and indexes the log data collected by Fluentd:
apiVersion: apps/v1
kind: StatefulSet
metadata:
name: elasticsearch
namespace: logging
spec:
serviceName: elasticsearch
replicas: 3
selector:
matchLabels:
app: elasticsearch
template:
metadata:
labels:
app: elasticsearch
spec:
containers:
- name: elasticsearch
image: docker.elastic.co/elasticsearch/elasticsearch:7.10.2
env:
- name: discovery.type
value: single-node
- name: ES_JAVA_OPTS
value: "-Xms512m -Xmx512m"
- name: xpack.security.enabled
value: "false"
ports:
- containerPort: 9200
name: http
- containerPort: 9300
name: transport
volumeMounts:
- name: data
mountPath: /usr/share/elasticsearch/data
resources:
limits:
memory: 1Gi
requests:
cpu: 500m
memory: 1Gi
volumes:
- name: data
emptyDir: {}
---
apiVersion: v1
kind: Service
metadata:
name: elasticsearch
namespace: logging
spec:
selector:
app: elasticsearch
ports:
- port: 9200
name: http
- port: 9300
name: transport
clusterIP: None
Kibana Configuration
Kibana provides a web interface for searching, analyzing, and visualizing log data:
apiVersion: apps/v1
kind: Deployment
metadata:
name: kibana
namespace: logging
spec:
replicas: 1
selector:
matchLabels:
app: kibana
template:
metadata:
labels:
app: kibana
spec:
containers:
- name: kibana
image: docker.elastic.co/kibana/kibana:7.10.2
env:
- name: ELASTICSEARCH_HOSTS
value: "http://elasticsearch.logging.svc.cluster.local:9200"
ports:
- containerPort: 5601
resources:
requests:
cpu: 100m
memory: 500Mi
limits:
memory: 1Gi
---
apiVersion: v1
kind: Service
metadata:
name: kibana
namespace: logging
spec:
selector:
app: kibana
ports:
- port: 5601
targetPort: 5601
---
apiVersion: networking.k8s.io/v1
kind: Ingress
metadata:
name: kibana
namespace: logging
annotations:
nginx.ingress.kubernetes.io/rewrite-target: /
spec:
rules:
- host: kibana.example.com
http:
paths:
- path: /
pathType: Prefix
backend:
service:
name: kibana
port:
number: 5601
Best Practices for Monitoring and Logging
Monitoring Best Practices
- Monitor at multiple levels: cluster, node, pod, and container
- Set up meaningful alerts with appropriate thresholds
- Use histograms for latency measurements instead of averages
- Regularly review and update your monitoring dashboards
- Monitor resource utilization and plan for capacity
Logging Best Practices
- Implement structured logging in your applications
- Include correlation IDs for tracing requests across services
- Set appropriate log retention policies
- Secure access to your logging infrastructure
- Regularly archive old logs to cold storage
Performance Considerations
- Limit the cardinality of your metrics to prevent Prometheus overload
- Use sampling for high-volume logs
- Configure appropriate buffer sizes for Fluentd
- Monitor the monitoring system itself
- Consider using Thanos or Cortex for long-term metric storage
Implementing a comprehensive monitoring and logging solution is crucial for maintaining the reliability and performance of your Kubernetes clusters and applications. The combination of Prometheus for metrics and the EFK stack for logs provides a powerful observability platform that can scale with your needs.
.png)
0 Comments