Smart Metering at Scale: Data Architecture for 2.5M+ Customers
How we built a data platform to process smart meter readings for millions of energy customers at Vattenfall. Time-series data, aggregation strategies, and analytics pipelines.
Smart Metering at Scale: Data Architecture for 2.5M+ Customers
At Vattenfall, we built a data platform to capture, aggregate, and analyze smart meter readings for over 2.5 million energy customers. This system enabled predictive usage analysis, accurate billing estimations, and real-time operational dashboards.
The Smart Metering Challenge
Smart meters generate data at unprecedented scale compared to traditional monthly meter readings:
Data Volume Analysis
Per meter, per day:
- 96 readings (15-minute intervals)
- Multiple data points per reading (consumption, voltage, power factor)
- Metadata (meter status, communication quality)
For 2.5 million meters:
- 240 million readings per day
- 87.6 billion readings per year
- Multi-year retention requirements for billing disputes and analysis
Business Requirements
Billing accuracy: Meter data must be complete and validated before billing cycles Customer portals: Real-time usage visibility for 2.5M+ registered customers Predictive analytics: Usage forecasting for capacity planning and customer engagement Regulatory compliance: Data retention, audit trails, and reporting requirements
Data Architecture Overview
Our architecture separated concerns across specialized data stores:
Ingestion Layer
Apache Kafka served as the central nervous system:
- Received raw meter data from collection systems
- Buffered during downstream outages
- Enabled multiple consumers with different processing needs
- Provided replay capability for reprocessing
Raw Data Storage
Apache Cassandra stored raw meter readings:
- Optimized for time-series write patterns
- Linear scalability for growing meter population
- Tunable consistency (eventual for raw data)
- Time-based data expiration (TTL)
Aggregated Data Storage
PostgreSQL housed aggregated and validated data:
- Daily, weekly, monthly rollups
- Complex queries for billing and reporting
- ACID compliance for financial calculations
- Integration with existing business systems
Analytics Layer
ClickHouse powered analytics and dashboards:
- Columnar storage for analytical queries
- Real-time aggregations across dimensions
- Sub-second response for complex queries
- Efficient compression for historical data
Ingestion Pipeline Deep Dive
Getting data from meters to storage involved multiple processing stages:
Stage 1: Collection
Meters communicate via various protocols (DLMS/COSEM, PRIME, OSGP). Collection systems normalize these into a common format before publishing to Kafka.
Stage 2: Validation
Before storage, every reading passed through validation:
Technical validation:
- Timestamp within expected range
- Values within physical limits (no negative consumption)
- No gaps in sequence numbers
Business validation:
- Consumption within historical bounds (detect meter tampering)
- Meter registered and active in customer database
- Communication quality above threshold
Stage 3: Enrichment
Raw readings were enriched with:
- Customer account information
- Tariff structure for cost calculation
- Geographic data for regional analysis
- Historical baseline for comparison
Stage 4: Storage
Validated, enriched data flows to multiple destinations:
- Cassandra for raw storage
- Kafka topics for downstream consumers
- Direct path to real-time dashboards
Aggregation Strategy
Raw data alone doesn't serve business needs. Aggregation makes data useful.
Time-Based Rollups
Hourly aggregates: Sum of 15-minute readings, computed in near-real-time Daily aggregates: Computed overnight, validated before customer visibility Monthly aggregates: Official billing data, reconciled with customer accounts
Dimension-Based Aggregates
By geography: Regional consumption for capacity planning By customer segment: Residential vs. commercial patterns By tariff type: Usage patterns across pricing structures
Aggregation Implementation
We used two approaches:
Real-time aggregation: Kafka Streams computed running totals for dashboards Batch aggregation: Scheduled Spark jobs computed validated aggregates for billing
The key insight: real-time aggregates are approximate; batch aggregates are authoritative. Customers see real-time data with a "provisional" label until batch validation completes.
Handling Data Quality Issues
Smart metering data is messy. Our pipeline handled common issues:
Missing Data
Meters go offline. Communication fails. Data gaps are inevitable.
Detection: Hourly jobs identified missing readings Estimation: Interpolation from adjacent readings or historical patterns Flagging: Estimated data marked separately from actual readings Remediation: Backfill when communication restored
Late-Arriving Data
Data sometimes arrived days after the reading timestamp.
Handling: Accepted late data up to configurable threshold Reprocessing: Triggered aggregate recalculation for affected periods Notification: Alerted billing systems if late data affected invoiced periods
Incorrect Data
Faulty meters, data corruption, and human error caused incorrect readings.
Manual corrections: Workflow for customer service to adjust readings Audit trail: Complete history of changes with reasons Downstream updates: Automated propagation to affected aggregates
Performance Optimization
Scale demanded careful optimization at every layer.
Cassandra Optimization
Partition design: Time-bucketed partitions (meter_id + day) Compaction strategy: TimeWindowCompactionStrategy for time-series Read path: Bloom filters and partition key caching Write path: Batched writes, tuned memtable settings
Query Optimization
Pre-aggregation: Most common queries served from pre-computed tables Materialized views: ClickHouse materialized views for dashboard queries Caching: Redis cache for frequently accessed customer data Query routing: Separate read replicas for reporting workloads
Resource Management
Data tiering: Hot data on SSD, warm data on HDD, cold data in object storage Auto-scaling: Kafka consumers scaled based on lag metrics Cost optimization: Regular review of data retention and compression
Lessons Learned
1. Time-Series Databases Have Trade-offs
Cassandra excelled at writes but struggled with ad-hoc queries. ClickHouse excelled at analytics but wasn't designed for point queries. The hybrid approach served different access patterns optimally.
2. Aggregation is Key to Query Performance
Nobody queries 87.6 billion rows. Pre-aggregated data at multiple granularities enabled interactive dashboards and reports.
3. Data Quality Pipelines Are Essential
Garbage in, garbage out. Investing in validation, estimation, and correction workflows paid dividends in billing accuracy and customer trust.
4. Plan for Data Corrections and Reprocessing
Requirements change. Bugs happen. Design systems that can reprocess historical data without downtime.
5. Monitoring Is Non-Negotiable
With billions of readings, problems hide in the noise. Comprehensive monitoring caught issues before they impacted customers.
Results Achieved
After implementation:
- Sub-second query response for customer portal usage displays
- 99.8% data completeness through validation and estimation pipelines
- Predictive accuracy within 5% for usage forecasting
- Billing disputes reduced 40% through improved data quality
- Analytics dashboards used daily by operations and customer service