Skip to main content
Back to Blog
5 July 202514 min read

Prometheus and Grafana: Metrics and Visualisation

PrometheusGrafanaMonitoringObservability

Building monitoring infrastructure with Prometheus and Grafana. PromQL queries, alerting rules, dashboard design, and Kubernetes integration.


Prometheus and Grafana: Metrics and Visualisation

Prometheus and Grafana form the foundation of modern monitoring infrastructure. Prometheus collects and stores metrics, while Grafana provides powerful visualisation and alerting. Together, they enable comprehensive observability for cloud-native applications.

Prometheus Architecture

Core Components

Prometheus Architecture:

┌─────────────────────────────────────────────────────────────┐
│                     Prometheus Server                        │
│  ┌─────────────────┐  ┌─────────────────┐  ┌─────────────┐  │
│  │    Retrieval    │  │      TSDB       │  │   HTTP      │  │
│  │   (Scraping)    │──│  (Time Series   │──│   Server    │  │
│  │                 │  │    Database)    │  │  (PromQL)   │  │
│  └─────────────────┘  └─────────────────┘  └─────────────┘  │
└───────────────┬─────────────────┬─────────────────┬─────────┘
                │                 │                 │
        ┌───────▼───────┐ ┌──────▼──────┐  ┌───────▼───────┐
        │  Targets      │ │ Alertmanager│  │   Grafana     │
        │  (Apps, K8s)  │ │             │  │               │
        └───────────────┘ └─────────────┘  └───────────────┘

Metric Types

TypeDescriptionExample
CounterMonotonically increasinghttp_requests_total
GaugeCan go up or downtemperature_celsius
HistogramBucketed observationsrequest_duration_seconds
SummarySimilar to histogram, with percentilesrequest_latency

Prometheus Configuration

Basic Setup

# prometheus.yml global: scrape_interval: 15s evaluation_interval: 15s external_labels: cluster: production environment: prod alerting: alertmanagers: - static_configs: - targets: - alertmanager:9093 rule_files: - /etc/prometheus/rules/*.yml scrape_configs: # Prometheus self-monitoring - job_name: prometheus static_configs: - targets: ['localhost:9090'] # Application endpoints - job_name: api-service static_configs: - targets: ['api-service:8080'] metrics_path: /metrics scheme: http # Kubernetes service discovery - job_name: kubernetes-pods kubernetes_sd_configs: - role: pod relabel_configs: - source_labels: [__meta_kubernetes_pod_annotation_prometheus_io_scrape] action: keep regex: true - source_labels: [__meta_kubernetes_pod_annotation_prometheus_io_path] action: replace target_label: __metrics_path__ regex: (.+) - source_labels: [__address__, __meta_kubernetes_pod_annotation_prometheus_io_port] action: replace regex: ([^:]+)(?::\d+)?;(\d+) replacement: $1:$2 target_label: __address__ - source_labels: [__meta_kubernetes_namespace] action: replace target_label: namespace - source_labels: [__meta_kubernetes_pod_name] action: replace target_label: pod # Node Exporter - job_name: node-exporter kubernetes_sd_configs: - role: node relabel_configs: - action: labelmap regex: __meta_kubernetes_node_label_(.+) - target_label: __address__ replacement: kubernetes.default.svc:443 - source_labels: [__meta_kubernetes_node_name] regex: (.+) target_label: __metrics_path__ replacement: /api/v1/nodes/${1}/proxy/metrics

Recording Rules

# recording-rules.yml groups: - name: api-service-rules interval: 30s rules: # Request rate - record: job:http_requests:rate5m expr: sum(rate(http_requests_total[5m])) by (job) # Error rate - record: job:http_requests:error_rate5m expr: | sum(rate(http_requests_total{status_code=~"5.."}[5m])) by (job) / sum(rate(http_requests_total[5m])) by (job) # P99 latency - record: job:http_request_duration:p99 expr: | histogram_quantile(0.99, sum(rate(http_request_duration_seconds_bucket[5m])) by (job, le) ) # Apdex score (T=0.5s) - record: job:apdex_score expr: | ( sum(rate(http_request_duration_seconds_bucket{le="0.5"}[5m])) by (job) + sum(rate(http_request_duration_seconds_bucket{le="2.0"}[5m])) by (job) ) / 2 / sum(rate(http_request_duration_seconds_count[5m])) by (job)

Alerting Rules

# alerting-rules.yml groups: - name: api-service-alerts rules: - alert: HighErrorRate expr: job:http_requests:error_rate5m > 0.05 for: 5m labels: severity: critical annotations: summary: High error rate detected description: "Error rate is {{ $value | humanizePercentage }} for {{ $labels.job }}" - alert: HighLatency expr: job:http_request_duration:p99 > 1 for: 5m labels: severity: warning annotations: summary: High latency detected description: "P99 latency is {{ $value | humanizeDuration }} for {{ $labels.job }}" - alert: ServiceDown expr: up == 0 for: 1m labels: severity: critical annotations: summary: Service is down description: "{{ $labels.job }} has been down for more than 1 minute" - alert: HighMemoryUsage expr: | container_memory_usage_bytes / container_spec_memory_limit_bytes > 0.9 for: 5m labels: severity: warning annotations: summary: Container memory usage is high description: "{{ $labels.container }} is using {{ $value | humanizePercentage }} of memory limit" - name: slo-alerts rules: - alert: SLOBurnRateCritical expr: | ( sum(rate(http_requests_total{status_code=~"5.."}[1h])) / sum(rate(http_requests_total[1h])) ) > (14.4 * 0.001) for: 2m labels: severity: critical slo: availability annotations: summary: SLO burn rate critical description: "Error budget is being consumed at 14.4x normal rate"

PromQL Queries

Common Patterns

# Request rate per second rate(http_requests_total[5m]) # Error rate percentage sum(rate(http_requests_total{status_code=~"5.."}[5m])) / sum(rate(http_requests_total[5m])) * 100 # P50, P90, P99 latency histogram_quantile(0.50, sum(rate(http_request_duration_seconds_bucket[5m])) by (le)) histogram_quantile(0.90, sum(rate(http_request_duration_seconds_bucket[5m])) by (le)) histogram_quantile(0.99, sum(rate(http_request_duration_seconds_bucket[5m])) by (le)) # Average latency rate(http_request_duration_seconds_sum[5m]) / rate(http_request_duration_seconds_count[5m]) # Requests by status code sum by (status_code) (rate(http_requests_total[5m])) # Top 5 endpoints by request rate topk(5, sum by (endpoint) (rate(http_requests_total[5m]))) # Memory usage percentage container_memory_usage_bytes / container_spec_memory_limit_bytes * 100 # CPU usage rate(container_cpu_usage_seconds_total[5m]) * 100 # Network I/O sum(rate(container_network_receive_bytes_total[5m])) by (pod) sum(rate(container_network_transmit_bytes_total[5m])) by (pod) # Disk I/O rate(node_disk_read_bytes_total[5m]) rate(node_disk_written_bytes_total[5m])

Grafana Dashboards

Dashboard JSON Model

{ "dashboard": { "title": "API Service Dashboard", "tags": ["api", "production"], "timezone": "browser", "refresh": "30s", "panels": [ { "title": "Request Rate", "type": "timeseries", "gridPos": { "x": 0, "y": 0, "w": 12, "h": 8 }, "targets": [ { "expr": "sum(rate(http_requests_total{job=\"api-service\"}[5m])) by (status_code)", "legendFormat": "{{status_code}}" } ], "fieldConfig": { "defaults": { "unit": "reqps" } } }, { "title": "Latency Distribution", "type": "heatmap", "gridPos": { "x": 12, "y": 0, "w": 12, "h": 8 }, "targets": [ { "expr": "sum(rate(http_request_duration_seconds_bucket{job=\"api-service\"}[5m])) by (le)", "format": "heatmap" } ] }, { "title": "Error Rate", "type": "stat", "gridPos": { "x": 0, "y": 8, "w": 6, "h": 4 }, "targets": [ { "expr": "sum(rate(http_requests_total{job=\"api-service\",status_code=~\"5..\"}[5m])) / sum(rate(http_requests_total{job=\"api-service\"}[5m])) * 100" } ], "fieldConfig": { "defaults": { "unit": "percent", "thresholds": { "mode": "absolute", "steps": [ { "color": "green", "value": null }, { "color": "yellow", "value": 1 }, { "color": "red", "value": 5 } ] } } } }, { "title": "P99 Latency", "type": "stat", "gridPos": { "x": 6, "y": 8, "w": 6, "h": 4 }, "targets": [ { "expr": "histogram_quantile(0.99, sum(rate(http_request_duration_seconds_bucket{job=\"api-service\"}[5m])) by (le))" } ], "fieldConfig": { "defaults": { "unit": "s", "thresholds": { "mode": "absolute", "steps": [ { "color": "green", "value": null }, { "color": "yellow", "value": 0.5 }, { "color": "red", "value": 1 } ] } } } } ] } }

Terraform Grafana Configuration

# grafana.tf resource "grafana_dashboard" "api_service" { config_json = templatefile("${path.module}/dashboards/api-service.json", { datasource = grafana_data_source.prometheus.uid }) folder = grafana_folder.services.id } resource "grafana_data_source" "prometheus" { type = "prometheus" name = "Prometheus" url = "http://prometheus:9090" json_data_encoded = jsonencode({ httpMethod = "POST" timeInterval = "15s" }) } resource "grafana_folder" "services" { title = "Services" } resource "grafana_alert_rule_group" "api_service" { name = "api-service-alerts" folder_uid = grafana_folder.services.uid interval_seconds = 60 rule { name = "High Error Rate" condition = "C" data { ref_id = "A" relative_time_range { from = 300 to = 0 } datasource_uid = grafana_data_source.prometheus.uid model = jsonencode({ expr = "sum(rate(http_requests_total{status_code=~\"5..\"}[5m])) / sum(rate(http_requests_total[5m]))" }) } data { ref_id = "C" relative_time_range { from = 0 to = 0 } datasource_uid = "__expr__" model = jsonencode({ type = "threshold" conditions = [{ evaluator = { type = "gt" params = [0.05] } }] }) } annotations = { summary = "High error rate on API service" description = "Error rate has exceeded 5% threshold" } labels = { severity = "critical" } } }

Kubernetes Deployment

Prometheus Stack

# prometheus-stack.yaml (Helm values) prometheus: prometheusSpec: retention: 15d resources: requests: memory: 2Gi cpu: 500m limits: memory: 4Gi cpu: 2000m storageSpec: volumeClaimTemplate: spec: accessModes: ["ReadWriteOnce"] resources: requests: storage: 50Gi additionalScrapeConfigs: - job_name: custom-services kubernetes_sd_configs: - role: service relabel_configs: - source_labels: [__meta_kubernetes_service_annotation_prometheus_io_scrape] action: keep regex: true grafana: adminPassword: ${GRAFANA_ADMIN_PASSWORD} persistence: enabled: true size: 10Gi dashboardProviders: dashboardproviders.yaml: apiVersion: 1 providers: - name: default folder: '' type: file disableDeletion: false options: path: /var/lib/grafana/dashboards alertmanager: config: route: receiver: slack group_by: ['alertname', 'severity'] group_wait: 30s group_interval: 5m repeat_interval: 4h receivers: - name: slack slack_configs: - channel: '#alerts' send_resolved: true title: '{{ .Status | toUpper }}: {{ .CommonAnnotations.summary }}' text: '{{ .CommonAnnotations.description }}'

Key Takeaways

  1. Recording rules: Pre-compute frequently used queries

  2. Alerting strategy: Use multi-window, multi-burn-rate alerts

  3. Label cardinality: Keep label cardinality under control

  4. Retention planning: Size storage based on retention needs

  5. Dashboard design: Group related metrics, use consistent units

  6. Service discovery: Leverage Kubernetes SD for dynamic targets

  7. Histogram buckets: Choose buckets based on expected latency distribution

  8. Grafana as code: Version control dashboard definitions

Prometheus and Grafana provide powerful, flexible monitoring. Proper configuration of recording rules and alerts enables proactive incident detection.

Share this article