Observability for Modern Applications: Beyond Monitoring

Monitoring tells you something is wrong. Observability helps you understand why. In distributed systems, the difference between spending 10 minutes or 10 hours debugging an incident often comes down to observability quality.

Monitoring vs Observability

Traditional Monitoring

Ask predefined questions:

Is CPU above 80%?
Is error rate above 1%?
Is disk usage below threshold?

Monitoring works when you know what to ask in advance.

Observability

Explore unknown questions:

Why did latency spike at 3:47 PM?
Which users are affected by this error?
What changed between the working and broken states?

Observability lets you investigate novel problems without prior instrumentation.

The Three Pillars

Metrics

Numeric measurements over time:

What metrics capture:

Request rate (requests per second)
Error rate (percentage failing)
Duration (latency distribution)
Saturation (resource utilization)

Characteristics:

Highly aggregated
Cheap to store long-term
Good for alerting and trends
Limited context per data point

Logs

Discrete event records:

What logs capture:

Request details and parameters
Error messages and stack traces
Business events and state changes
Audit and compliance information

Characteristics:

Rich context per event
Expensive at scale
Good for debugging specific issues
Require structure for effective querying

Traces

Request paths through distributed systems:

What traces capture:

Full request journey across services
Timing for each operation
Relationships between operations
Errors at specific steps

Characteristics:

Show causation, not just correlation
Essential for distributed debugging
Moderate storage requirements (with sampling)
Require instrumentation across all services

OpenTelemetry Adoption

OpenTelemetry (OTel) provides vendor-neutral observability instrumentation.

Why OpenTelemetry

Vendor neutrality: Change backends without code changes Comprehensive: Metrics, logs, and traces in one framework Wide adoption: Supported by all major vendors Future-proof: CNCF standard with broad industry backing

Implementation Strategy

Phase 1: Auto-instrumentation

Start with automatic instrumentation for immediate value:

HTTP server and client calls
Database queries
gRPC calls
Common libraries

Most languages have auto-instrumentation agents that require minimal code changes.

Phase 2: Custom spans

Add business-relevant context:

const span = tracer.startSpan('process-order');
span.setAttribute('order.id', orderId);
span.setAttribute('order.value', orderValue);
span.setAttribute('customer.tier', customerTier);

try {
  await processOrder(order);
  span.setStatus({ code: SpanStatusCode.OK });
} catch (error) {
  span.setStatus({ code: SpanStatusCode.ERROR });
  span.recordException(error);
  throw error;
} finally {
  span.end();
}

Phase 3: Custom metrics

Add business metrics:

const orderCounter = meter.createCounter('orders.processed', {
});

const orderValueHistogram = meter.createHistogram('orders.value', {
});

orderCounter.add(1, { status: 'completed', region: 'eu-west' });
orderValueHistogram.record(orderValue, { currency: 'EUR' });

Sampling Strategies

Not every trace needs to be stored:

Head-based sampling: Decide at trace start (simple but misses interesting traces) Tail-based sampling: Decide after trace completes (captures errors and slow requests) Adaptive sampling: Adjust rate based on traffic and error rates

For production systems, sample 1-10% of normal traffic but 100% of errors.

Dashboard Design

Start with SLOs

Service Level Objectives define what "good" looks like:

Example SLOs:

99.9% of requests complete successfully
99% of requests complete in under 500ms
99.5% availability measured monthly

Dashboard primary focus should be SLO health.

RED Metrics

For each service, display:

Rate: Request throughput (requests/second) Errors: Error rate (percentage or count) Duration: Latency distribution (p50, p95, p99)

RED gives a quick health overview for any service.

USE Metrics

For infrastructure resources:

Utilization: Percentage of resource capacity used Saturation: Work queued waiting for resources Errors: Resource-level error counts

Audience-Specific Design

Operations dashboard: System health, alerts, SLO status Development dashboard: Error details, latency breakdowns, dependency health Business dashboard: Transaction volumes, conversion rates, revenue metrics

One dashboard doesn't serve all audiences effectively.

Avoiding Dashboard Sprawl

Symptoms of sprawl:

Nobody knows which dashboard to check
Dashboards have overlapping information
Dashboards go months without views

Prevention:

Establish dashboard ownership
Regular audits of dashboard usage
Template-based creation for consistency
Clear naming conventions

Alert Fatigue Prevention

Alert fatigue is real and dangerous. When teams ignore alerts, real incidents go unnoticed.

Alert on Symptoms, Not Causes

Bad alert: CPU usage above 80% Good alert: Error rate above SLO threshold

Users experience symptoms (errors, latency), not causes (CPU, memory). Alert on what affects users.

Set Meaningful Thresholds

Use historical data to set thresholds:

What's the normal range for this metric?
What level actually indicates a problem?
What can the on-call engineer actually do about it?

Alerts that never fire or always fire are both useless.

Implement Alert Hierarchy

Page (immediate response):

Production is down
Data loss imminent
Security breach detected

Ticket (business hours):

Elevated error rate (but within SLO)
Approaching capacity limits
Performance degradation

Log (no notification):

Informational events
Debugging data
Audit information

Regular Alert Review

Monthly alert hygiene:

Which alerts fired? Were they actionable?
Which alerts never fire? Are thresholds too high?
What incidents were missed? What alerts would have caught them?

Correlation and Context

Connecting the Pillars

Make it easy to pivot between data types:

From alert → related metrics → relevant logs → traces
From trace → service metrics → recent deployments
From logs → aggregated metrics → similar errors

Correlation Keys

Use consistent identifiers across all telemetry:

// Include in all telemetry
const context = {
  traceId: currentTraceId(),
  requestId: req.headers['x-request-id'],
  userId: req.user?.id,
  deployment: process.env.DEPLOYMENT_VERSION,
};

Linking to Code

Connect telemetry to source:

Include commit SHA in deployment metadata
Link errors to code locations
Connect traces to source repositories

Key Takeaways

Observability enables understanding: Not just detecting, but diagnosing
Three pillars work together: Metrics for alerts, logs for details, traces for flow
OpenTelemetry is the standard: Invest in vendor-neutral instrumentation
Design dashboards for audiences: Different users need different views
Prevent alert fatigue: Alert on symptoms, set meaningful thresholds
Connect everything: Correlation turns data into insight
Evolve continuously: Observability needs grow with system complexity

Observability for Modern Applications: Beyond Monitoring

Monitoring vs Observability

Traditional Monitoring

Observability

The Three Pillars

Metrics

Logs

Traces

OpenTelemetry Adoption

Why OpenTelemetry

Implementation Strategy

Sampling Strategies

Dashboard Design

Start with SLOs

RED Metrics

USE Metrics

Audience-Specific Design

Avoiding Dashboard Sprawl

Alert Fatigue Prevention

Alert on Symptoms, Not Causes

Set Meaningful Thresholds

Implement Alert Hierarchy

Regular Alert Review

Correlation and Context

Connecting the Pillars

Correlation Keys

Linking to Code

Key Takeaways

Share this article