15 May 202414 min read
E-Commerce Scalability: Handling 10x Traffic Spikes
E-CommerceScalabilityRetailArchitecture
Lessons from building retail platforms that handle holiday traffic surges. Caching strategies, database optimization, and capacity planning.
E-Commerce Scalability: Handling 10x Traffic Spikes
Retail systems face unique scalability challenges—traffic can spike 10x or more during sales events, holidays, and flash promotions. At Interflora, Valentine's Day and Mother's Day meant preparing for traffic surges that dwarfed our baseline. Here's how we achieved 99.95% uptime during peak shopping periods.
Understanding Retail Traffic Patterns
The Reality of Spikes
| Event | Traffic Multiplier | Duration |
|---|---|---|
| Flash sale announcement | 5-10x | 30-60 minutes |
| Holiday (Valentine's, Mother's Day) | 8-15x | 2-3 days |
| Black Friday/Cyber Monday | 10-20x | 4-5 days |
| TV advertisement | 3-5x | 15-30 minutes |
The Cascade Effect
When one component slows, everything suffers:
Normal: User → CDN → App → DB → Response (200ms)
Under load:
User → CDN → App (waiting) → DB (saturated) → Timeout
↓
Connection pool exhausted
↓
New requests queued
↓
Cascade failureCapacity Planning
Baseline Measurement
Before you can plan for 10x, you need to know your 1x:
Key baseline metrics:
- Average requests per second (RPS)
- Peak RPS (daily, weekly patterns)
- Database queries per request
- Cache hit ratio
- Average response time by endpoint
- Error rate baselineCapacity Model
Peak Planning Formula:
Required capacity = Baseline peak × Expected multiplier × Safety margin
Example:
- Normal peak: 500 RPS
- Black Friday multiplier: 15x
- Safety margin: 1.5x
- Required capacity: 500 × 15 × 1.5 = 11,250 RPSLoad Testing Strategy
# k6 load test script example
stages:
- duration: '2m', target: 100 # Warm up
- duration: '5m', target: 500 # Normal load
- duration: '2m', target: 2500 # Ramp to 5x
- duration: '5m', target: 2500 # Hold at 5x
- duration: '2m', target: 5000 # Ramp to 10x
- duration: '10m', target: 5000 # Hold at 10x
- duration: '2m', target: 7500 # Push to 15x
- duration: '5m', target: 7500 # Breaking point testCaching Architecture
Multi-Layer Caching
Layer 1: CDN (Cloudflare/CloudFront)
├── Static assets (images, CSS, JS)
├── Product images
└── API responses (with proper cache headers)
Layer 2: Application Cache (Redis)
├── Session data
├── User cart state
├── Product catalog
└── Inventory counts (with short TTL)
Layer 3: Database Query Cache
├── Prepared statement cache
└── Query result cacheCache-First Architecture
async function getProduct(productId: string): Promise<Product> {
// Layer 1: Memory cache (hot items)
const memCached = memoryCache.get(productId);
if (memCached) return memCached;
// Layer 2: Redis
const redisCached = await redis.get(`product:${productId}`);
if (redisCached) {
const product = JSON.parse(redisCached);
memoryCache.set(productId, product, 60); // 60 second local cache
return product;
}
// Layer 3: Database (with cache population)
const product = await db.products.findById(productId);
if (product) {
await redis.setex(`product:${productId}`, 300, JSON.stringify(product));
memoryCache.set(productId, product, 60);
}
return product;
}Cache Invalidation for E-Commerce
// Inventory updates need careful invalidation
async function updateInventory(productId: string, delta: number): Promise<void> {
// Update database
await db.inventory.decrement(productId, delta);
// Invalidate product cache
await redis.del(`product:${productId}`);
// Publish event for CDN purge
await events.publish('inventory-change', {
productId,
requiresCdnPurge: true
});
// For flash sales: invalidate listing caches
await redis.del('featured-products');
await redis.del(`category:${product.categoryId}:products`);
}Database Optimization
Read Replica Strategy
// Route reads to replicas, writes to primary
const readPool = new Pool({
host: 'replica.db.example.com',
max: 100,
idleTimeoutMillis: 30000
});
const writePool = new Pool({
host: 'primary.db.example.com',
max: 20,
idleTimeoutMillis: 30000
});
async function getProducts(categoryId: string): Promise<Product[]> {
// Read from replica
return readPool.query('SELECT * FROM products WHERE category_id = $1', [categoryId]);
}
async function createOrder(order: Order): Promise<Order> {
// Write to primary
return writePool.query(
'INSERT INTO orders (user_id, items, total) VALUES ($1, $2, $3) RETURNING *',
[order.userId, order.items, order.total]
);
}Connection Pool Tuning
Connection pool sizing:
- Too small: Requests wait for connections
- Too large: Database overwhelmed
Formula:
connections = (core_count * 2) + effective_spindle_count
For cloud databases:
- Start with 20 connections per application instance
- Monitor wait time and adjust
- Consider PgBouncer for connection pooling at scaleQuery Optimization for Spikes
-- BEFORE: Full table scan during traffic spike
SELECT * FROM products
WHERE category_id = $1
ORDER BY created_at DESC
LIMIT 20;
-- AFTER: Indexed query with covering index
CREATE INDEX idx_products_category_created
ON products (category_id, created_at DESC)
INCLUDE (name, price, image_url);
-- Result: Query time 200ms → 2msAuto-Scaling Configuration
Kubernetes HPA
apiVersion: autoscaling/v2
kind: HorizontalPodAutoscaler
metadata:
name: ecommerce-api
spec:
scaleTargetRef:
apiVersion: apps/v1
kind: Deployment
name: ecommerce-api
minReplicas: 10
maxReplicas: 100
metrics:
- type: Resource
resource:
name: cpu
target:
type: Utilization
averageUtilization: 60
- type: Pods
pods:
metric:
name: requests_per_second
target:
type: AverageValue
averageValue: "100"
behavior:
scaleUp:
stabilizationWindowSeconds: 60
policies:
- type: Percent
value: 100
periodSeconds: 60
scaleDown:
stabilizationWindowSeconds: 300
policies:
- type: Percent
value: 10
periodSeconds: 60Pre-Scaling for Known Events
# Scale up before Valentine's Day traffic
kubectl scale deployment ecommerce-api --replicas=50
# Or use scheduled scaling
apiVersion: autoscaling.k8s.io/v1
kind: CronHorizontalPodAutoscaler
spec:
schedule: "0 6 14 2 *" # 6 AM on Feb 14
targetReplicas: 100Graceful Degradation
Feature Flags for Load Shedding
const loadSheddingConfig = {
level0: { // Normal
recommendations: true,
reviews: true,
relatedProducts: true,
searchSuggestions: true
},
level1: { // High load
recommendations: true,
reviews: true,
relatedProducts: false, // Disable
searchSuggestions: true
},
level2: { // Very high load
recommendations: false, // Disable
reviews: false, // Disable
relatedProducts: false,
searchSuggestions: false
},
level3: { // Critical
// Essential checkout flow only
recommendations: false,
reviews: false,
relatedProducts: false,
searchSuggestions: false,
guestCheckout: true, // Force guest checkout
paymentMethods: ['card'] // Reduce payment options
}
};
async function getLoadLevel(): Promise<number> {
const metrics = await getSystemMetrics();
if (metrics.errorRate > 5 || metrics.p99Latency > 5000) return 3;
if (metrics.errorRate > 2 || metrics.p99Latency > 2000) return 2;
if (metrics.cpuUsage > 80 || metrics.p99Latency > 1000) return 1;
return 0;
}Queue-Based Checkout
During extreme load, queue checkout requests:
async function initiateCheckout(cart: Cart): Promise<CheckoutResponse> {
if (await isSystemOverloaded()) {
// Queue the checkout
const ticketId = await checkoutQueue.add(cart);
return {
status: 'queued',
ticketId,
estimatedWaitSeconds: await checkoutQueue.getEstimatedWait(),
message: 'High demand! Your order is queued and will be processed shortly.'
};
}
// Normal checkout flow
return processCheckout(cart);
}Static Fallback Pages
// Serve cached product pages when database is overwhelmed
app.get('/products/:id', async (req, res, next) => {
try {
const product = await getProduct(req.params.id);
res.json(product);
} catch (error) {
if (error.code === 'ECONNREFUSED' || error.code === 'ETIMEDOUT') {
// Serve static fallback
const fallback = await cdn.get(`/static/products/${req.params.id}.json`);
if (fallback) {
res.set('X-Served-From', 'fallback');
return res.json(fallback);
}
}
next(error);
}
});Monitoring During Spikes
Real-Time Dashboard Metrics
Critical metrics during peak:
├── Requests per second (by endpoint)
├── Error rate (4xx, 5xx)
├── P50, P95, P99 latency
├── Database connections (active, waiting)
├── Redis memory and hit rate
├── Pod count and CPU utilization
└── Cart and checkout conversion rateAutomated Alerting
# Prometheus alerting rules
groups:
- name: ecommerce-peak
rules:
- alert: HighErrorRate
expr: rate(http_requests_total{status=~"5.."}[1m]) > 0.01
for: 2m
labels:
severity: critical
annotations:
summary: Error rate above 1%
- alert: CheckoutLatency
expr: histogram_quantile(0.99, rate(checkout_duration_seconds_bucket[5m])) > 10
for: 5m
labels:
severity: warning
annotations:
summary: Checkout P99 latency above 10 secondsKey Takeaways
- Know your baseline: You can't plan for 10x if you don't know 1x
- Cache aggressively: Multi-layer caching reduces database load exponentially
- Read replicas scale reads: Most e-commerce traffic is read-heavy
- Pre-scale for known events: Auto-scaling alone isn't fast enough for flash sales
- Plan graceful degradation: Know which features to disable and in what order
- Queue don't reject: A queued checkout is better than a failed one
- Monitor in real-time: Have dashboards ready and teams on standby during peaks
Retail scalability isn't about handling average load—it's about surviving the moments that make or break your year. Prepare for the spike, not the baseline.