Prometheus and Grafana: Metrics and Visualisation
Building monitoring infrastructure with Prometheus and Grafana. PromQL queries, alerting rules, dashboard design, and Kubernetes integration.
Prometheus and Grafana: Metrics and Visualisation
Prometheus and Grafana form the foundation of modern monitoring infrastructure. Prometheus collects and stores metrics, while Grafana provides powerful visualisation and alerting. Together, they enable comprehensive observability for cloud-native applications.
Prometheus Architecture
Core Components
Prometheus Architecture:
┌─────────────────────────────────────────────────────────────┐
│ Prometheus Server │
│ ┌─────────────────┐ ┌─────────────────┐ ┌─────────────┐ │
│ │ Retrieval │ │ TSDB │ │ HTTP │ │
│ │ (Scraping) │──│ (Time Series │──│ Server │ │
│ │ │ │ Database) │ │ (PromQL) │ │
│ └─────────────────┘ └─────────────────┘ └─────────────┘ │
└───────────────┬─────────────────┬─────────────────┬─────────┘
│ │ │
┌───────▼───────┐ ┌──────▼──────┐ ┌───────▼───────┐
│ Targets │ │ Alertmanager│ │ Grafana │
│ (Apps, K8s) │ │ │ │ │
└───────────────┘ └─────────────┘ └───────────────┘Metric Types
| Type | Description | Example |
|---|---|---|
| Counter | Monotonically increasing | http_requests_total |
| Gauge | Can go up or down | temperature_celsius |
| Histogram | Bucketed observations | request_duration_seconds |
| Summary | Similar to histogram, with percentiles | request_latency |
Prometheus Configuration
Basic Setup
# prometheus.yml
global:
scrape_interval: 15s
evaluation_interval: 15s
external_labels:
cluster: production
environment: prod
alerting:
alertmanagers:
- static_configs:
- targets:
- alertmanager:9093
rule_files:
- /etc/prometheus/rules/*.yml
scrape_configs:
# Prometheus self-monitoring
- job_name: prometheus
static_configs:
- targets: ['localhost:9090']
# Application endpoints
- job_name: api-service
static_configs:
- targets: ['api-service:8080']
metrics_path: /metrics
scheme: http
# Kubernetes service discovery
- job_name: kubernetes-pods
kubernetes_sd_configs:
- role: pod
relabel_configs:
- source_labels: [__meta_kubernetes_pod_annotation_prometheus_io_scrape]
action: keep
regex: true
- source_labels: [__meta_kubernetes_pod_annotation_prometheus_io_path]
action: replace
target_label: __metrics_path__
regex: (.+)
- source_labels: [__address__, __meta_kubernetes_pod_annotation_prometheus_io_port]
action: replace
regex: ([^:]+)(?::\d+)?;(\d+)
replacement: $1:$2
target_label: __address__
- source_labels: [__meta_kubernetes_namespace]
action: replace
target_label: namespace
- source_labels: [__meta_kubernetes_pod_name]
action: replace
target_label: pod
# Node Exporter
- job_name: node-exporter
kubernetes_sd_configs:
- role: node
relabel_configs:
- action: labelmap
regex: __meta_kubernetes_node_label_(.+)
- target_label: __address__
replacement: kubernetes.default.svc:443
- source_labels: [__meta_kubernetes_node_name]
regex: (.+)
target_label: __metrics_path__
replacement: /api/v1/nodes/${1}/proxy/metricsRecording Rules
# recording-rules.yml
groups:
- name: api-service-rules
interval: 30s
rules:
# Request rate
- record: job:http_requests:rate5m
expr: sum(rate(http_requests_total[5m])) by (job)
# Error rate
- record: job:http_requests:error_rate5m
expr: |
sum(rate(http_requests_total{status_code=~"5.."}[5m])) by (job)
/
sum(rate(http_requests_total[5m])) by (job)
# P99 latency
- record: job:http_request_duration:p99
expr: |
histogram_quantile(0.99,
sum(rate(http_request_duration_seconds_bucket[5m])) by (job, le)
)
# Apdex score (T=0.5s)
- record: job:apdex_score
expr: |
(
sum(rate(http_request_duration_seconds_bucket{le="0.5"}[5m])) by (job)
+ sum(rate(http_request_duration_seconds_bucket{le="2.0"}[5m])) by (job)
) / 2
/
sum(rate(http_request_duration_seconds_count[5m])) by (job)Alerting Rules
# alerting-rules.yml
groups:
- name: api-service-alerts
rules:
- alert: HighErrorRate
expr: job:http_requests:error_rate5m > 0.05
for: 5m
labels:
severity: critical
annotations:
summary: High error rate detected
description: "Error rate is {{ $value | humanizePercentage }} for {{ $labels.job }}"
- alert: HighLatency
expr: job:http_request_duration:p99 > 1
for: 5m
labels:
severity: warning
annotations:
summary: High latency detected
description: "P99 latency is {{ $value | humanizeDuration }} for {{ $labels.job }}"
- alert: ServiceDown
expr: up == 0
for: 1m
labels:
severity: critical
annotations:
summary: Service is down
description: "{{ $labels.job }} has been down for more than 1 minute"
- alert: HighMemoryUsage
expr: |
container_memory_usage_bytes / container_spec_memory_limit_bytes > 0.9
for: 5m
labels:
severity: warning
annotations:
summary: Container memory usage is high
description: "{{ $labels.container }} is using {{ $value | humanizePercentage }} of memory limit"
- name: slo-alerts
rules:
- alert: SLOBurnRateCritical
expr: |
(
sum(rate(http_requests_total{status_code=~"5.."}[1h]))
/
sum(rate(http_requests_total[1h]))
) > (14.4 * 0.001)
for: 2m
labels:
severity: critical
slo: availability
annotations:
summary: SLO burn rate critical
description: "Error budget is being consumed at 14.4x normal rate"PromQL Queries
Common Patterns
# Request rate per second
rate(http_requests_total[5m])
# Error rate percentage
sum(rate(http_requests_total{status_code=~"5.."}[5m]))
/
sum(rate(http_requests_total[5m]))
* 100
# P50, P90, P99 latency
histogram_quantile(0.50, sum(rate(http_request_duration_seconds_bucket[5m])) by (le))
histogram_quantile(0.90, sum(rate(http_request_duration_seconds_bucket[5m])) by (le))
histogram_quantile(0.99, sum(rate(http_request_duration_seconds_bucket[5m])) by (le))
# Average latency
rate(http_request_duration_seconds_sum[5m])
/
rate(http_request_duration_seconds_count[5m])
# Requests by status code
sum by (status_code) (rate(http_requests_total[5m]))
# Top 5 endpoints by request rate
topk(5, sum by (endpoint) (rate(http_requests_total[5m])))
# Memory usage percentage
container_memory_usage_bytes / container_spec_memory_limit_bytes * 100
# CPU usage
rate(container_cpu_usage_seconds_total[5m]) * 100
# Network I/O
sum(rate(container_network_receive_bytes_total[5m])) by (pod)
sum(rate(container_network_transmit_bytes_total[5m])) by (pod)
# Disk I/O
rate(node_disk_read_bytes_total[5m])
rate(node_disk_written_bytes_total[5m])Grafana Dashboards
Dashboard JSON Model
{
"dashboard": {
"title": "API Service Dashboard",
"tags": ["api", "production"],
"timezone": "browser",
"refresh": "30s",
"panels": [
{
"title": "Request Rate",
"type": "timeseries",
"gridPos": { "x": 0, "y": 0, "w": 12, "h": 8 },
"targets": [
{
"expr": "sum(rate(http_requests_total{job=\"api-service\"}[5m])) by (status_code)",
"legendFormat": "{{status_code}}"
}
],
"fieldConfig": {
"defaults": {
"unit": "reqps"
}
}
},
{
"title": "Latency Distribution",
"type": "heatmap",
"gridPos": { "x": 12, "y": 0, "w": 12, "h": 8 },
"targets": [
{
"expr": "sum(rate(http_request_duration_seconds_bucket{job=\"api-service\"}[5m])) by (le)",
"format": "heatmap"
}
]
},
{
"title": "Error Rate",
"type": "stat",
"gridPos": { "x": 0, "y": 8, "w": 6, "h": 4 },
"targets": [
{
"expr": "sum(rate(http_requests_total{job=\"api-service\",status_code=~\"5..\"}[5m])) / sum(rate(http_requests_total{job=\"api-service\"}[5m])) * 100"
}
],
"fieldConfig": {
"defaults": {
"unit": "percent",
"thresholds": {
"mode": "absolute",
"steps": [
{ "color": "green", "value": null },
{ "color": "yellow", "value": 1 },
{ "color": "red", "value": 5 }
]
}
}
}
},
{
"title": "P99 Latency",
"type": "stat",
"gridPos": { "x": 6, "y": 8, "w": 6, "h": 4 },
"targets": [
{
"expr": "histogram_quantile(0.99, sum(rate(http_request_duration_seconds_bucket{job=\"api-service\"}[5m])) by (le))"
}
],
"fieldConfig": {
"defaults": {
"unit": "s",
"thresholds": {
"mode": "absolute",
"steps": [
{ "color": "green", "value": null },
{ "color": "yellow", "value": 0.5 },
{ "color": "red", "value": 1 }
]
}
}
}
}
]
}
}Terraform Grafana Configuration
# grafana.tf
resource "grafana_dashboard" "api_service" {
config_json = templatefile("${path.module}/dashboards/api-service.json", {
datasource = grafana_data_source.prometheus.uid
})
folder = grafana_folder.services.id
}
resource "grafana_data_source" "prometheus" {
type = "prometheus"
name = "Prometheus"
url = "http://prometheus:9090"
json_data_encoded = jsonencode({
httpMethod = "POST"
timeInterval = "15s"
})
}
resource "grafana_folder" "services" {
title = "Services"
}
resource "grafana_alert_rule_group" "api_service" {
name = "api-service-alerts"
folder_uid = grafana_folder.services.uid
interval_seconds = 60
rule {
name = "High Error Rate"
condition = "C"
data {
ref_id = "A"
relative_time_range {
from = 300
to = 0
}
datasource_uid = grafana_data_source.prometheus.uid
model = jsonencode({
expr = "sum(rate(http_requests_total{status_code=~\"5..\"}[5m])) / sum(rate(http_requests_total[5m]))"
})
}
data {
ref_id = "C"
relative_time_range {
from = 0
to = 0
}
datasource_uid = "__expr__"
model = jsonencode({
type = "threshold"
conditions = [{
evaluator = {
type = "gt"
params = [0.05]
}
}]
})
}
annotations = {
summary = "High error rate on API service"
description = "Error rate has exceeded 5% threshold"
}
labels = {
severity = "critical"
}
}
}Kubernetes Deployment
Prometheus Stack
# prometheus-stack.yaml (Helm values)
prometheus:
prometheusSpec:
retention: 15d
resources:
requests:
memory: 2Gi
cpu: 500m
limits:
memory: 4Gi
cpu: 2000m
storageSpec:
volumeClaimTemplate:
spec:
accessModes: ["ReadWriteOnce"]
resources:
requests:
storage: 50Gi
additionalScrapeConfigs:
- job_name: custom-services
kubernetes_sd_configs:
- role: service
relabel_configs:
- source_labels: [__meta_kubernetes_service_annotation_prometheus_io_scrape]
action: keep
regex: true
grafana:
adminPassword: ${GRAFANA_ADMIN_PASSWORD}
persistence:
enabled: true
size: 10Gi
dashboardProviders:
dashboardproviders.yaml:
apiVersion: 1
providers:
- name: default
folder: ''
type: file
disableDeletion: false
options:
path: /var/lib/grafana/dashboards
alertmanager:
config:
route:
receiver: slack
group_by: ['alertname', 'severity']
group_wait: 30s
group_interval: 5m
repeat_interval: 4h
receivers:
- name: slack
slack_configs:
- channel: '#alerts'
send_resolved: true
title: '{{ .Status | toUpper }}: {{ .CommonAnnotations.summary }}'
text: '{{ .CommonAnnotations.description }}'Key Takeaways
-
Recording rules: Pre-compute frequently used queries
-
Alerting strategy: Use multi-window, multi-burn-rate alerts
-
Label cardinality: Keep label cardinality under control
-
Retention planning: Size storage based on retention needs
-
Dashboard design: Group related metrics, use consistent units
-
Service discovery: Leverage Kubernetes SD for dynamic targets
-
Histogram buckets: Choose buckets based on expected latency distribution
-
Grafana as code: Version control dashboard definitions
Prometheus and Grafana provide powerful, flexible monitoring. Proper configuration of recording rules and alerts enables proactive incident detection.