Smart Metering at Scale: Data Architecture for 2.5M+ Customers

At Vattenfall, we built a data platform to capture, aggregate, and analyze smart meter readings for over 2.5 million energy customers. This system enabled predictive usage analysis, accurate billing estimations, and real-time operational dashboards.

The Smart Metering Challenge

Smart meters generate data at unprecedented scale compared to traditional monthly meter readings:

Data Volume Analysis

Per meter, per day:

96 readings (15-minute intervals)
Multiple data points per reading (consumption, voltage, power factor)
Metadata (meter status, communication quality)

For 2.5 million meters:

240 million readings per day
87.6 billion readings per year
Multi-year retention requirements for billing disputes and analysis

Business Requirements

Billing accuracy: Meter data must be complete and validated before billing cycles Customer portals: Real-time usage visibility for 2.5M+ registered customers Predictive analytics: Usage forecasting for capacity planning and customer engagement Regulatory compliance: Data retention, audit trails, and reporting requirements

Data Architecture Overview

Our architecture separated concerns across specialized data stores:

Ingestion Layer

Apache Kafka served as the central nervous system:

Received raw meter data from collection systems
Buffered during downstream outages
Enabled multiple consumers with different processing needs
Provided replay capability for reprocessing

Raw Data Storage

Apache Cassandra stored raw meter readings:

Optimized for time-series write patterns
Linear scalability for growing meter population
Tunable consistency (eventual for raw data)
Time-based data expiration (TTL)

Aggregated Data Storage

PostgreSQL housed aggregated and validated data:

Daily, weekly, monthly rollups
Complex queries for billing and reporting
ACID compliance for financial calculations
Integration with existing business systems

Analytics Layer

ClickHouse powered analytics and dashboards:

Columnar storage for analytical queries
Real-time aggregations across dimensions
Sub-second response for complex queries
Efficient compression for historical data

Ingestion Pipeline Deep Dive

Getting data from meters to storage involved multiple processing stages:

Stage 1: Collection

Meters communicate via various protocols (DLMS/COSEM, PRIME, OSGP). Collection systems normalize these into a common format before publishing to Kafka.

Stage 2: Validation

Before storage, every reading passed through validation:

Technical validation:

Timestamp within expected range
Values within physical limits (no negative consumption)
No gaps in sequence numbers

Business validation:

Consumption within historical bounds (detect meter tampering)
Meter registered and active in customer database
Communication quality above threshold

Stage 3: Enrichment

Raw readings were enriched with:

Customer account information
Tariff structure for cost calculation
Geographic data for regional analysis
Historical baseline for comparison

Stage 4: Storage

Validated, enriched data flows to multiple destinations:

Cassandra for raw storage
Kafka topics for downstream consumers
Direct path to real-time dashboards

Aggregation Strategy

Raw data alone doesn't serve business needs. Aggregation makes data useful.

Time-Based Rollups

Hourly aggregates: Sum of 15-minute readings, computed in near-real-time Daily aggregates: Computed overnight, validated before customer visibility Monthly aggregates: Official billing data, reconciled with customer accounts

Dimension-Based Aggregates

By geography: Regional consumption for capacity planning By customer segment: Residential vs. commercial patterns By tariff type: Usage patterns across pricing structures

Aggregation Implementation

We used two approaches:

Real-time aggregation: Kafka Streams computed running totals for dashboards Batch aggregation: Scheduled Spark jobs computed validated aggregates for billing

The key insight: real-time aggregates are approximate; batch aggregates are authoritative. Customers see real-time data with a "provisional" label until batch validation completes.

Handling Data Quality Issues

Smart metering data is messy. Our pipeline handled common issues:

Missing Data

Meters go offline. Communication fails. Data gaps are inevitable.

Detection: Hourly jobs identified missing readings Estimation: Interpolation from adjacent readings or historical patterns Flagging: Estimated data marked separately from actual readings Remediation: Backfill when communication restored

Late-Arriving Data

Data sometimes arrived days after the reading timestamp.

Handling: Accepted late data up to configurable threshold Reprocessing: Triggered aggregate recalculation for affected periods Notification: Alerted billing systems if late data affected invoiced periods

Incorrect Data

Faulty meters, data corruption, and human error caused incorrect readings.

Manual corrections: Workflow for customer service to adjust readings Audit trail: Complete history of changes with reasons Downstream updates: Automated propagation to affected aggregates

Performance Optimization

Scale demanded careful optimization at every layer.

Cassandra Optimization

Partition design: Time-bucketed partitions (meter_id + day) Compaction strategy: TimeWindowCompactionStrategy for time-series Read path: Bloom filters and partition key caching Write path: Batched writes, tuned memtable settings

Query Optimization

Pre-aggregation: Most common queries served from pre-computed tables Materialized views: ClickHouse materialized views for dashboard queries Caching: Redis cache for frequently accessed customer data Query routing: Separate read replicas for reporting workloads

Resource Management

Data tiering: Hot data on SSD, warm data on HDD, cold data in object storage Auto-scaling: Kafka consumers scaled based on lag metrics Cost optimization: Regular review of data retention and compression

Lessons Learned

1. Time-Series Databases Have Trade-offs

Cassandra excelled at writes but struggled with ad-hoc queries. ClickHouse excelled at analytics but wasn't designed for point queries. The hybrid approach served different access patterns optimally.

2. Aggregation is Key to Query Performance

Nobody queries 87.6 billion rows. Pre-aggregated data at multiple granularities enabled interactive dashboards and reports.

3. Data Quality Pipelines Are Essential

Garbage in, garbage out. Investing in validation, estimation, and correction workflows paid dividends in billing accuracy and customer trust.

4. Plan for Data Corrections and Reprocessing

Requirements change. Bugs happen. Design systems that can reprocess historical data without downtime.

5. Monitoring Is Non-Negotiable

With billions of readings, problems hide in the noise. Comprehensive monitoring caught issues before they impacted customers.

Results Achieved

After implementation:

Sub-second query response for customer portal usage displays
99.8% data completeness through validation and estimation pipelines
Predictive accuracy within 5% for usage forecasting
Billing disputes reduced 40% through improved data quality
Analytics dashboards used daily by operations and customer service