Observability for Modern Applications: Beyond Monitoring
Building comprehensive observability with metrics, logs, and traces. OpenTelemetry adoption, dashboard design, and alert fatigue prevention.
Observability for Modern Applications: Beyond Monitoring
Monitoring tells you something is wrong. Observability helps you understand why. In distributed systems, the difference between spending 10 minutes or 10 hours debugging an incident often comes down to observability quality.
Monitoring vs Observability
Traditional Monitoring
Ask predefined questions:
- Is CPU above 80%?
- Is error rate above 1%?
- Is disk usage below threshold?
Monitoring works when you know what to ask in advance.
Observability
Explore unknown questions:
- Why did latency spike at 3:47 PM?
- Which users are affected by this error?
- What changed between the working and broken states?
Observability lets you investigate novel problems without prior instrumentation.
The Three Pillars
Metrics
Numeric measurements over time:
What metrics capture:
- Request rate (requests per second)
- Error rate (percentage failing)
- Duration (latency distribution)
- Saturation (resource utilization)
Characteristics:
- Highly aggregated
- Cheap to store long-term
- Good for alerting and trends
- Limited context per data point
Logs
Discrete event records:
What logs capture:
- Request details and parameters
- Error messages and stack traces
- Business events and state changes
- Audit and compliance information
Characteristics:
- Rich context per event
- Expensive at scale
- Good for debugging specific issues
- Require structure for effective querying
Traces
Request paths through distributed systems:
What traces capture:
- Full request journey across services
- Timing for each operation
- Relationships between operations
- Errors at specific steps
Characteristics:
- Show causation, not just correlation
- Essential for distributed debugging
- Moderate storage requirements (with sampling)
- Require instrumentation across all services
OpenTelemetry Adoption
OpenTelemetry (OTel) provides vendor-neutral observability instrumentation.
Why OpenTelemetry
Vendor neutrality: Change backends without code changes Comprehensive: Metrics, logs, and traces in one framework Wide adoption: Supported by all major vendors Future-proof: CNCF standard with broad industry backing
Implementation Strategy
Phase 1: Auto-instrumentation
Start with automatic instrumentation for immediate value:
- HTTP server and client calls
- Database queries
- gRPC calls
- Common libraries
Most languages have auto-instrumentation agents that require minimal code changes.
Phase 2: Custom spans
Add business-relevant context:
const span = tracer.startSpan('process-order');
span.setAttribute('order.id', orderId);
span.setAttribute('order.value', orderValue);
span.setAttribute('customer.tier', customerTier);
try {
await processOrder(order);
span.setStatus({ code: SpanStatusCode.OK });
} catch (error) {
span.setStatus({ code: SpanStatusCode.ERROR });
span.recordException(error);
throw error;
} finally {
span.end();
}Phase 3: Custom metrics
Add business metrics:
const orderCounter = meter.createCounter('orders.processed', {
description: 'Number of orders processed',
});
const orderValueHistogram = meter.createHistogram('orders.value', {
description: 'Distribution of order values',
});
orderCounter.add(1, { status: 'completed', region: 'eu-west' });
orderValueHistogram.record(orderValue, { currency: 'EUR' });Sampling Strategies
Not every trace needs to be stored:
Head-based sampling: Decide at trace start (simple but misses interesting traces) Tail-based sampling: Decide after trace completes (captures errors and slow requests) Adaptive sampling: Adjust rate based on traffic and error rates
For production systems, sample 1-10% of normal traffic but 100% of errors.
Dashboard Design
Start with SLOs
Service Level Objectives define what "good" looks like:
Example SLOs:
- 99.9% of requests complete successfully
- 99% of requests complete in under 500ms
- 99.5% availability measured monthly
Dashboard primary focus should be SLO health.
RED Metrics
For each service, display:
Rate: Request throughput (requests/second) Errors: Error rate (percentage or count) Duration: Latency distribution (p50, p95, p99)
RED gives a quick health overview for any service.
USE Metrics
For infrastructure resources:
Utilization: Percentage of resource capacity used Saturation: Work queued waiting for resources Errors: Resource-level error counts
Audience-Specific Design
Operations dashboard: System health, alerts, SLO status Development dashboard: Error details, latency breakdowns, dependency health Business dashboard: Transaction volumes, conversion rates, revenue metrics
One dashboard doesn't serve all audiences effectively.
Avoiding Dashboard Sprawl
Symptoms of sprawl:
- Nobody knows which dashboard to check
- Dashboards have overlapping information
- Dashboards go months without views
Prevention:
- Establish dashboard ownership
- Regular audits of dashboard usage
- Template-based creation for consistency
- Clear naming conventions
Alert Fatigue Prevention
Alert fatigue is real and dangerous. When teams ignore alerts, real incidents go unnoticed.
Alert on Symptoms, Not Causes
Bad alert: CPU usage above 80% Good alert: Error rate above SLO threshold
Users experience symptoms (errors, latency), not causes (CPU, memory). Alert on what affects users.
Set Meaningful Thresholds
Use historical data to set thresholds:
- What's the normal range for this metric?
- What level actually indicates a problem?
- What can the on-call engineer actually do about it?
Alerts that never fire or always fire are both useless.
Implement Alert Hierarchy
Page (immediate response):
- Production is down
- Data loss imminent
- Security breach detected
Ticket (business hours):
- Elevated error rate (but within SLO)
- Approaching capacity limits
- Performance degradation
Log (no notification):
- Informational events
- Debugging data
- Audit information
Regular Alert Review
Monthly alert hygiene:
- Which alerts fired? Were they actionable?
- Which alerts never fire? Are thresholds too high?
- What incidents were missed? What alerts would have caught them?
Correlation and Context
Connecting the Pillars
Make it easy to pivot between data types:
- From alert → related metrics → relevant logs → traces
- From trace → service metrics → recent deployments
- From logs → aggregated metrics → similar errors
Correlation Keys
Use consistent identifiers across all telemetry:
// Include in all telemetry
const context = {
traceId: currentTraceId(),
requestId: req.headers['x-request-id'],
userId: req.user?.id,
deployment: process.env.DEPLOYMENT_VERSION,
};Linking to Code
Connect telemetry to source:
- Include commit SHA in deployment metadata
- Link errors to code locations
- Connect traces to source repositories
Key Takeaways
- Observability enables understanding: Not just detecting, but diagnosing
- Three pillars work together: Metrics for alerts, logs for details, traces for flow
- OpenTelemetry is the standard: Invest in vendor-neutral instrumentation
- Design dashboards for audiences: Different users need different views
- Prevent alert fatigue: Alert on symptoms, set meaningful thresholds
- Connect everything: Correlation turns data into insight
- Evolve continuously: Observability needs grow with system complexity