OpenTelemetry: The Observability Standard
Implementing distributed tracing and metrics with OpenTelemetry. Instrumentation patterns, collector configuration, and integration with observability backends.
OpenTelemetry: The Observability Standard
OpenTelemetry (OTel) is the industry standard for collecting telemetry data—traces, metrics, and logs—from applications. It provides vendor-neutral instrumentation, enabling portability across observability backends like Jaeger, Prometheus, Datadog, and more.
OpenTelemetry Architecture
Core Components
OpenTelemetry Architecture:
Application
├── SDK
│ ├── TracerProvider
│ ├── MeterProvider
│ └── LoggerProvider
├── API (vendor-neutral)
└── Instrumentation Libraries
│ OTLP (OpenTelemetry Protocol)
▼
OpenTelemetry Collector
├── Receivers (OTLP, Jaeger, Prometheus, etc.)
├── Processors (batch, memory_limiter, attributes)
└── Exporters (Jaeger, Prometheus, OTLP, etc.)
│
▼
Observability Backends
├── Jaeger (traces)
├── Prometheus (metrics)
├── Elasticsearch (logs)
└── Commercial (Datadog, New Relic, etc.)Signal Types
| Signal | Purpose | Example |
|---|---|---|
| Traces | Request flow across services | HTTP request → DB query → Response |
| Metrics | Numerical measurements | Request count, latency percentiles |
| Logs | Discrete events | Error messages, audit events |
| Baggage | Context propagation | User ID, tenant ID |
Application Instrumentation
Node.js Setup
// tracing.ts
import { NodeSDK } from '@opentelemetry/sdk-node';
import { OTLPTraceExporter } from '@opentelemetry/exporter-trace-otlp-grpc';
import { OTLPMetricExporter } from '@opentelemetry/exporter-metrics-otlp-grpc';
import { PeriodicExportingMetricReader } from '@opentelemetry/sdk-metrics';
import { getNodeAutoInstrumentations } from '@opentelemetry/auto-instrumentations-node';
import { Resource } from '@opentelemetry/resources';
import { SemanticResourceAttributes } from '@opentelemetry/semantic-conventions';
const resource = new Resource({
[SemanticResourceAttributes.SERVICE_NAME]: process.env.SERVICE_NAME || 'api-service',
[SemanticResourceAttributes.SERVICE_VERSION]: process.env.SERVICE_VERSION || '1.0.0',
[SemanticResourceAttributes.DEPLOYMENT_ENVIRONMENT]: process.env.NODE_ENV || 'development',
});
const traceExporter = new OTLPTraceExporter({
url: process.env.OTEL_EXPORTER_OTLP_ENDPOINT || 'http://localhost:4317',
});
const metricExporter = new OTLPMetricExporter({
url: process.env.OTEL_EXPORTER_OTLP_ENDPOINT || 'http://localhost:4317',
});
const sdk = new NodeSDK({
resource,
traceExporter,
metricReader: new PeriodicExportingMetricReader({
exporter: metricExporter,
exportIntervalMillis: 10000,
}),
instrumentations: [
getNodeAutoInstrumentations({
'@opentelemetry/instrumentation-fs': { enabled: false },
'@opentelemetry/instrumentation-http': {
ignoreIncomingRequestHook: (req) => {
// Ignore health checks
return req.url === '/health' || req.url === '/ready';
},
},
}),
],
});
sdk.start();
process.on('SIGTERM', () => {
sdk.shutdown()
.then(() => console.log('SDK shut down successfully'))
.catch((error) => console.error('Error shutting down SDK', error))
.finally(() => process.exit(0));
});Manual Instrumentation
// manual-tracing.ts
import { trace, context, SpanKind, SpanStatusCode } from '@opentelemetry/api';
import { metrics } from '@opentelemetry/api';
const tracer = trace.getTracer('my-service', '1.0.0');
const meter = metrics.getMeter('my-service', '1.0.0');
// Create custom metrics
const requestCounter = meter.createCounter('http_requests_total', {
description: 'Total number of HTTP requests',
});
const requestDuration = meter.createHistogram('http_request_duration_ms', {
description: 'HTTP request duration in milliseconds',
unit: 'ms',
});
// Manual span creation
export const processOrder = async (orderId: string): Promise<Order> => {
return tracer.startActiveSpan('processOrder', {
kind: SpanKind.INTERNAL,
attributes: {
'order.id': orderId,
},
}, async (span) => {
try {
// Child span for database operation
const order = await tracer.startActiveSpan('fetchOrder', async (childSpan) => {
childSpan.setAttribute('db.system', 'postgresql');
childSpan.setAttribute('db.operation', 'SELECT');
try {
const result = await database.findOrder(orderId);
childSpan.setStatus({ code: SpanStatusCode.OK });
return result;
} catch (error) {
childSpan.setStatus({
code: SpanStatusCode.ERROR,
message: (error as Error).message,
});
childSpan.recordException(error as Error);
throw error;
} finally {
childSpan.end();
}
});
// Process the order
const processedOrder = await validateAndProcess(order);
span.setAttribute('order.status', processedOrder.status);
span.setStatus({ code: SpanStatusCode.OK });
return processedOrder;
} catch (error) {
span.setStatus({
code: SpanStatusCode.ERROR,
message: (error as Error).message,
});
span.recordException(error as Error);
throw error;
} finally {
span.end();
}
});
};
// Express middleware with metrics
export const metricsMiddleware = (req: Request, res: Response, next: NextFunction) => {
const startTime = Date.now();
res.on('finish', () => {
const duration = Date.now() - startTime;
const labels = {
method: req.method,
path: req.route?.path || req.path,
status_code: res.statusCode.toString(),
};
requestCounter.add(1, labels);
requestDuration.record(duration, labels);
});
next();
};OpenTelemetry Collector
Collector Configuration
# otel-collector-config.yaml
receivers:
otlp:
protocols:
grpc:
endpoint: 0.0.0.0:4317
http:
endpoint: 0.0.0.0:4318
prometheus:
config:
scrape_configs:
- job_name: 'kubernetes-pods'
kubernetes_sd_configs:
- role: pod
relabel_configs:
- source_labels: [__meta_kubernetes_pod_annotation_prometheus_io_scrape]
action: keep
regex: true
processors:
batch:
timeout: 10s
send_batch_size: 1000
memory_limiter:
check_interval: 1s
limit_mib: 1000
spike_limit_mib: 200
attributes:
actions:
- key: environment
value: ${ENVIRONMENT}
action: upsert
resource:
attributes:
- key: k8s.cluster.name
value: ${CLUSTER_NAME}
action: upsert
filter:
traces:
span:
- 'attributes["http.route"] == "/health"'
- 'attributes["http.route"] == "/ready"'
tail_sampling:
decision_wait: 10s
policies:
- name: error-policy
type: status_code
status_code:
status_codes: [ERROR]
- name: slow-policy
type: latency
latency:
threshold_ms: 1000
- name: probabilistic-policy
type: probabilistic
probabilistic:
sampling_percentage: 10
exporters:
otlp/jaeger:
endpoint: jaeger-collector:4317
tls:
insecure: true
prometheusremotewrite:
endpoint: http://prometheus:9090/api/v1/write
tls:
insecure: true
elasticsearch:
endpoints: [http://elasticsearch:9200]
logs_index: otel-logs
traces_index: otel-traces
debug:
verbosity: detailed
extensions:
health_check:
endpoint: 0.0.0.0:13133
zpages:
endpoint: 0.0.0.0:55679
service:
extensions: [health_check, zpages]
pipelines:
traces:
receivers: [otlp]
processors: [memory_limiter, batch, attributes, tail_sampling]
exporters: [otlp/jaeger]
metrics:
receivers: [otlp, prometheus]
processors: [memory_limiter, batch, attributes]
exporters: [prometheusremotewrite]
logs:
receivers: [otlp]
processors: [memory_limiter, batch, attributes]
exporters: [elasticsearch]
telemetry:
logs:
level: info
metrics:
address: 0.0.0.0:8888Kubernetes Deployment
# otel-collector-deployment.yaml
apiVersion: apps/v1
kind: Deployment
metadata:
name: otel-collector
namespace: observability
spec:
replicas: 2
selector:
matchLabels:
app: otel-collector
template:
metadata:
labels:
app: otel-collector
spec:
serviceAccountName: otel-collector
containers:
- name: collector
image: otel/opentelemetry-collector-contrib:0.91.0
args:
- --config=/conf/otel-collector-config.yaml
ports:
- containerPort: 4317 # OTLP gRPC
- containerPort: 4318 # OTLP HTTP
- containerPort: 8888 # Metrics
- containerPort: 13133 # Health check
env:
- name: ENVIRONMENT
valueFrom:
fieldRef:
fieldPath: metadata.namespace
- name: CLUSTER_NAME
value: production-cluster
volumeMounts:
- name: config
mountPath: /conf
resources:
requests:
memory: "256Mi"
cpu: "200m"
limits:
memory: "512Mi"
cpu: "500m"
livenessProbe:
httpGet:
path: /
port: 13133
readinessProbe:
httpGet:
path: /
port: 13133
volumes:
- name: config
configMap:
name: otel-collector-config
---
apiVersion: v1
kind: Service
metadata:
name: otel-collector
namespace: observability
spec:
selector:
app: otel-collector
ports:
- name: otlp-grpc
port: 4317
- name: otlp-http
port: 4318Context Propagation
Cross-Service Tracing
// context-propagation.ts
import { context, propagation, trace } from '@opentelemetry/api';
import { W3CTraceContextPropagator } from '@opentelemetry/core';
// Set up propagator
propagation.setGlobalPropagator(new W3CTraceContextPropagator());
// Inject context into outgoing request
export const callService = async (url: string, data: any): Promise<any> => {
const headers: Record<string, string> = {};
// Inject current context into headers
propagation.inject(context.active(), headers);
const response = await fetch(url, {
method: 'POST',
headers: {
'Content-Type': 'application/json',
...headers, // traceparent, tracestate
},
body: JSON.stringify(data),
});
return response.json();
};
// Extract context from incoming request (middleware)
export const extractContext = (req: Request, res: Response, next: NextFunction) => {
const extractedContext = propagation.extract(context.active(), req.headers);
context.with(extractedContext, () => {
next();
});
};Key Takeaways
-
Vendor neutral: OTel works with any observability backend
-
Auto-instrumentation: Start quickly with automatic instrumentation
-
Manual spans: Add custom spans for business-critical operations
-
Collector deployment: Use the Collector for processing and routing
-
Tail sampling: Sample intelligently based on trace characteristics
-
Context propagation: Ensure trace context flows across service boundaries
-
Resource attributes: Add metadata like service name, version, environment
-
Start with traces: Distributed tracing provides the most insight initially
OpenTelemetry provides a unified approach to observability. Invest in proper instrumentation to gain visibility into distributed systems.