Skip to main content
Back to Blog
25 July 202416 min read

Observability for Modern Applications: Beyond Monitoring

ObservabilityMonitoringOpenTelemetryDevOps

Building comprehensive observability with metrics, logs, and traces. OpenTelemetry adoption, dashboard design, and alert fatigue prevention.


Observability for Modern Applications: Beyond Monitoring

Monitoring tells you something is wrong. Observability helps you understand why. In distributed systems, the difference between spending 10 minutes or 10 hours debugging an incident often comes down to observability quality.

Monitoring vs Observability

Traditional Monitoring

Ask predefined questions:

  • Is CPU above 80%?
  • Is error rate above 1%?
  • Is disk usage below threshold?

Monitoring works when you know what to ask in advance.

Observability

Explore unknown questions:

  • Why did latency spike at 3:47 PM?
  • Which users are affected by this error?
  • What changed between the working and broken states?

Observability lets you investigate novel problems without prior instrumentation.

The Three Pillars

Metrics

Numeric measurements over time:

What metrics capture:

  • Request rate (requests per second)
  • Error rate (percentage failing)
  • Duration (latency distribution)
  • Saturation (resource utilization)

Characteristics:

  • Highly aggregated
  • Cheap to store long-term
  • Good for alerting and trends
  • Limited context per data point

Logs

Discrete event records:

What logs capture:

  • Request details and parameters
  • Error messages and stack traces
  • Business events and state changes
  • Audit and compliance information

Characteristics:

  • Rich context per event
  • Expensive at scale
  • Good for debugging specific issues
  • Require structure for effective querying

Traces

Request paths through distributed systems:

What traces capture:

  • Full request journey across services
  • Timing for each operation
  • Relationships between operations
  • Errors at specific steps

Characteristics:

  • Show causation, not just correlation
  • Essential for distributed debugging
  • Moderate storage requirements (with sampling)
  • Require instrumentation across all services

OpenTelemetry Adoption

OpenTelemetry (OTel) provides vendor-neutral observability instrumentation.

Why OpenTelemetry

Vendor neutrality: Change backends without code changes Comprehensive: Metrics, logs, and traces in one framework Wide adoption: Supported by all major vendors Future-proof: CNCF standard with broad industry backing

Implementation Strategy

Phase 1: Auto-instrumentation

Start with automatic instrumentation for immediate value:

  • HTTP server and client calls
  • Database queries
  • gRPC calls
  • Common libraries

Most languages have auto-instrumentation agents that require minimal code changes.

Phase 2: Custom spans

Add business-relevant context:

const span = tracer.startSpan('process-order'); span.setAttribute('order.id', orderId); span.setAttribute('order.value', orderValue); span.setAttribute('customer.tier', customerTier); try { await processOrder(order); span.setStatus({ code: SpanStatusCode.OK }); } catch (error) { span.setStatus({ code: SpanStatusCode.ERROR }); span.recordException(error); throw error; } finally { span.end(); }

Phase 3: Custom metrics

Add business metrics:

const orderCounter = meter.createCounter('orders.processed', { description: 'Number of orders processed', }); const orderValueHistogram = meter.createHistogram('orders.value', { description: 'Distribution of order values', }); orderCounter.add(1, { status: 'completed', region: 'eu-west' }); orderValueHistogram.record(orderValue, { currency: 'EUR' });

Sampling Strategies

Not every trace needs to be stored:

Head-based sampling: Decide at trace start (simple but misses interesting traces) Tail-based sampling: Decide after trace completes (captures errors and slow requests) Adaptive sampling: Adjust rate based on traffic and error rates

For production systems, sample 1-10% of normal traffic but 100% of errors.

Dashboard Design

Start with SLOs

Service Level Objectives define what "good" looks like:

Example SLOs:

  • 99.9% of requests complete successfully
  • 99% of requests complete in under 500ms
  • 99.5% availability measured monthly

Dashboard primary focus should be SLO health.

RED Metrics

For each service, display:

Rate: Request throughput (requests/second) Errors: Error rate (percentage or count) Duration: Latency distribution (p50, p95, p99)

RED gives a quick health overview for any service.

USE Metrics

For infrastructure resources:

Utilization: Percentage of resource capacity used Saturation: Work queued waiting for resources Errors: Resource-level error counts

Audience-Specific Design

Operations dashboard: System health, alerts, SLO status Development dashboard: Error details, latency breakdowns, dependency health Business dashboard: Transaction volumes, conversion rates, revenue metrics

One dashboard doesn't serve all audiences effectively.

Avoiding Dashboard Sprawl

Symptoms of sprawl:

  • Nobody knows which dashboard to check
  • Dashboards have overlapping information
  • Dashboards go months without views

Prevention:

  • Establish dashboard ownership
  • Regular audits of dashboard usage
  • Template-based creation for consistency
  • Clear naming conventions

Alert Fatigue Prevention

Alert fatigue is real and dangerous. When teams ignore alerts, real incidents go unnoticed.

Alert on Symptoms, Not Causes

Bad alert: CPU usage above 80% Good alert: Error rate above SLO threshold

Users experience symptoms (errors, latency), not causes (CPU, memory). Alert on what affects users.

Set Meaningful Thresholds

Use historical data to set thresholds:

  • What's the normal range for this metric?
  • What level actually indicates a problem?
  • What can the on-call engineer actually do about it?

Alerts that never fire or always fire are both useless.

Implement Alert Hierarchy

Page (immediate response):

  • Production is down
  • Data loss imminent
  • Security breach detected

Ticket (business hours):

  • Elevated error rate (but within SLO)
  • Approaching capacity limits
  • Performance degradation

Log (no notification):

  • Informational events
  • Debugging data
  • Audit information

Regular Alert Review

Monthly alert hygiene:

  • Which alerts fired? Were they actionable?
  • Which alerts never fire? Are thresholds too high?
  • What incidents were missed? What alerts would have caught them?

Correlation and Context

Connecting the Pillars

Make it easy to pivot between data types:

  • From alert → related metrics → relevant logs → traces
  • From trace → service metrics → recent deployments
  • From logs → aggregated metrics → similar errors

Correlation Keys

Use consistent identifiers across all telemetry:

// Include in all telemetry const context = { traceId: currentTraceId(), requestId: req.headers['x-request-id'], userId: req.user?.id, deployment: process.env.DEPLOYMENT_VERSION, };

Linking to Code

Connect telemetry to source:

  • Include commit SHA in deployment metadata
  • Link errors to code locations
  • Connect traces to source repositories

Key Takeaways

  1. Observability enables understanding: Not just detecting, but diagnosing
  2. Three pillars work together: Metrics for alerts, logs for details, traces for flow
  3. OpenTelemetry is the standard: Invest in vendor-neutral instrumentation
  4. Design dashboards for audiences: Different users need different views
  5. Prevent alert fatigue: Alert on symptoms, set meaningful thresholds
  6. Connect everything: Correlation turns data into insight
  7. Evolve continuously: Observability needs grow with system complexity

Share this article