OpenTelemetry: The Observability Standard

OpenTelemetry (OTel) is the industry standard for collecting telemetry data—traces, metrics, and logs—from applications. It provides vendor-neutral instrumentation, enabling portability across observability backends like Jaeger, Prometheus, Datadog, and more.

OpenTelemetry Architecture

Core Components

OpenTelemetry Architecture:

Application
├── SDK
│   ├── TracerProvider
│   ├── MeterProvider
│   └── LoggerProvider
├── API (vendor-neutral)
└── Instrumentation Libraries

         │ OTLP (OpenTelemetry Protocol)
         ▼

OpenTelemetry Collector
├── Receivers (OTLP, Jaeger, Prometheus, etc.)
├── Processors (batch, memory_limiter, attributes)
└── Exporters (Jaeger, Prometheus, OTLP, etc.)

         │
         ▼

Observability Backends
├── Jaeger (traces)
├── Prometheus (metrics)
├── Elasticsearch (logs)
└── Commercial (Datadog, New Relic, etc.)

Signal Types

Signal	Purpose	Example
Traces	Request flow across services	HTTP request → DB query → Response
Metrics	Numerical measurements	Request count, latency percentiles
Logs	Discrete events	Error messages, audit events
Baggage	Context propagation	User ID, tenant ID

Application Instrumentation

Node.js Setup

// tracing.ts
import { NodeSDK } from '@opentelemetry/sdk-node';
import { OTLPTraceExporter } from '@opentelemetry/exporter-trace-otlp-grpc';
import { OTLPMetricExporter } from '@opentelemetry/exporter-metrics-otlp-grpc';
import { PeriodicExportingMetricReader } from '@opentelemetry/sdk-metrics';
import { getNodeAutoInstrumentations } from '@opentelemetry/auto-instrumentations-node';
import { Resource } from '@opentelemetry/resources';
import { SemanticResourceAttributes } from '@opentelemetry/semantic-conventions';

const resource = new Resource({
  [SemanticResourceAttributes.SERVICE_NAME]: process.env.SERVICE_NAME || 'api-service',
  [SemanticResourceAttributes.SERVICE_VERSION]: process.env.SERVICE_VERSION || '1.0.0',
  [SemanticResourceAttributes.DEPLOYMENT_ENVIRONMENT]: process.env.NODE_ENV || 'development',
});

const traceExporter = new OTLPTraceExporter({
  url: process.env.OTEL_EXPORTER_OTLP_ENDPOINT || 'http://localhost:4317',
});

const metricExporter = new OTLPMetricExporter({
  url: process.env.OTEL_EXPORTER_OTLP_ENDPOINT || 'http://localhost:4317',
});

const sdk = new NodeSDK({
  resource,
  traceExporter,
  metricReader: new PeriodicExportingMetricReader({
    exporter: metricExporter,
    exportIntervalMillis: 10000,
  }),
  instrumentations: [
    getNodeAutoInstrumentations({
      '@opentelemetry/instrumentation-fs': { enabled: false },
      '@opentelemetry/instrumentation-http': {
        ignoreIncomingRequestHook: (req) => {
          // Ignore health checks
          return req.url === '/health' || req.url === '/ready';
        },
      },
    }),
  ],
});

sdk.start();

process.on('SIGTERM', () => {
  sdk.shutdown()
    .then(() => console.log('SDK shut down successfully'))
    .catch((error) => console.error('Error shutting down SDK', error))
    .finally(() => process.exit(0));
});

Manual Instrumentation

// manual-tracing.ts
import { trace, context, SpanKind, SpanStatusCode } from '@opentelemetry/api';
import { metrics } from '@opentelemetry/api';

const tracer = trace.getTracer('my-service', '1.0.0');
const meter = metrics.getMeter('my-service', '1.0.0');

// Create custom metrics
const requestCounter = meter.createCounter('http_requests_total', {
  description: 'Total number of HTTP requests',
});

const requestDuration = meter.createHistogram('http_request_duration_ms', {
  description: 'HTTP request duration in milliseconds',
  unit: 'ms',
});

// Manual span creation
export const processOrder = async (orderId: string): Promise<Order> => {
  return tracer.startActiveSpan('processOrder', {
    kind: SpanKind.INTERNAL,
    attributes: {
      'order.id': orderId,
    },
  }, async (span) => {
    try {
      // Child span for database operation
      const order = await tracer.startActiveSpan('fetchOrder', async (childSpan) => {
        childSpan.setAttribute('db.system', 'postgresql');
        childSpan.setAttribute('db.operation', 'SELECT');

        try {
          const result = await database.findOrder(orderId);
          childSpan.setStatus({ code: SpanStatusCode.OK });
          return result;
        } catch (error) {
          childSpan.setStatus({
            code: SpanStatusCode.ERROR,
            message: (error as Error).message,
          });
          childSpan.recordException(error as Error);
          throw error;
        } finally {
          childSpan.end();
        }
      });

      // Process the order
      const processedOrder = await validateAndProcess(order);

      span.setAttribute('order.status', processedOrder.status);
      span.setStatus({ code: SpanStatusCode.OK });

      return processedOrder;
    } catch (error) {
      span.setStatus({
        code: SpanStatusCode.ERROR,
        message: (error as Error).message,
      });
      span.recordException(error as Error);
      throw error;
    } finally {
      span.end();
    }
  });
};

// Express middleware with metrics
export const metricsMiddleware = (req: Request, res: Response, next: NextFunction) => {
  const startTime = Date.now();

  res.on('finish', () => {
    const duration = Date.now() - startTime;
    const labels = {
      method: req.method,
      path: req.route?.path || req.path,
      status_code: res.statusCode.toString(),
    };

    requestCounter.add(1, labels);
    requestDuration.record(duration, labels);
  });

  next();
};

OpenTelemetry Collector

Collector Configuration

# otel-collector-config.yaml
receivers:
  otlp:
    protocols:
      grpc:
        endpoint: 0.0.0.0:4317
      http:
        endpoint: 0.0.0.0:4318

  prometheus:
    config:
      scrape_configs:
        - job_name: 'kubernetes-pods'
          kubernetes_sd_configs:
            - role: pod
          relabel_configs:
            - source_labels: [__meta_kubernetes_pod_annotation_prometheus_io_scrape]
              action: keep
              regex: true

processors:
  batch:
    timeout: 10s
    send_batch_size: 1000

  memory_limiter:
    check_interval: 1s
    limit_mib: 1000
    spike_limit_mib: 200

  attributes:
    actions:
      - key: environment
        value: ${ENVIRONMENT}
        action: upsert

  resource:
    attributes:
      - key: k8s.cluster.name
        value: ${CLUSTER_NAME}
        action: upsert

  filter:
    traces:
      span:
        - 'attributes["http.route"] == "/health"'
        - 'attributes["http.route"] == "/ready"'

  tail_sampling:
    decision_wait: 10s
    policies:
      - name: error-policy
        type: status_code
        status_code:
          status_codes: [ERROR]
      - name: slow-policy
        type: latency
        latency:
          threshold_ms: 1000
      - name: probabilistic-policy
        type: probabilistic
        probabilistic:
          sampling_percentage: 10

exporters:
  otlp/jaeger:
    endpoint: jaeger-collector:4317
    tls:
      insecure: true

  prometheusremotewrite:
    endpoint: http://prometheus:9090/api/v1/write
    tls:
      insecure: true

  elasticsearch:
    endpoints: [http://elasticsearch:9200]
    logs_index: otel-logs
    traces_index: otel-traces

  debug:
    verbosity: detailed

extensions:
  health_check:
    endpoint: 0.0.0.0:13133

  zpages:
    endpoint: 0.0.0.0:55679

service:
  extensions: [health_check, zpages]

  pipelines:
    traces:
      receivers: [otlp]
      processors: [memory_limiter, batch, attributes, tail_sampling]
      exporters: [otlp/jaeger]

    metrics:
      receivers: [otlp, prometheus]
      processors: [memory_limiter, batch, attributes]
      exporters: [prometheusremotewrite]

    logs:
      receivers: [otlp]
      processors: [memory_limiter, batch, attributes]
      exporters: [elasticsearch]

  telemetry:
    logs:
      level: info
    metrics:
      address: 0.0.0.0:8888

Kubernetes Deployment

# otel-collector-deployment.yaml
apiVersion: apps/v1
kind: Deployment
metadata:
  name: otel-collector
  namespace: observability
spec:
  replicas: 2
  selector:
    matchLabels:
      app: otel-collector
  template:
    metadata:
      labels:
        app: otel-collector
    spec:
      serviceAccountName: otel-collector
      containers:
        - name: collector
          image: otel/opentelemetry-collector-contrib:0.91.0
          args:
            - --config=/conf/otel-collector-config.yaml
          ports:
            - containerPort: 4317  # OTLP gRPC
            - containerPort: 4318  # OTLP HTTP
            - containerPort: 8888  # Metrics
            - containerPort: 13133 # Health check
          env:
            - name: ENVIRONMENT
              valueFrom:
                fieldRef:
                  fieldPath: metadata.namespace
            - name: CLUSTER_NAME
              value: production-cluster
          volumeMounts:
            - name: config
              mountPath: /conf
          resources:
            requests:
              memory: "256Mi"
              cpu: "200m"
            limits:
              memory: "512Mi"
              cpu: "500m"
          livenessProbe:
            httpGet:
              path: /
              port: 13133
          readinessProbe:
            httpGet:
              path: /
              port: 13133
      volumes:
        - name: config
          configMap:
            name: otel-collector-config
---
apiVersion: v1
kind: Service
metadata:
  name: otel-collector
  namespace: observability
spec:
  selector:
    app: otel-collector
  ports:
    - name: otlp-grpc
      port: 4317
    - name: otlp-http
      port: 4318

Context Propagation

Cross-Service Tracing

// context-propagation.ts
import { context, propagation, trace } from '@opentelemetry/api';
import { W3CTraceContextPropagator } from '@opentelemetry/core';

// Set up propagator
propagation.setGlobalPropagator(new W3CTraceContextPropagator());

// Inject context into outgoing request
export const callService = async (url: string, data: any): Promise<any> => {
  const headers: Record<string, string> = {};

  // Inject current context into headers
  propagation.inject(context.active(), headers);

  const response = await fetch(url, {
    method: 'POST',
    headers: {
      'Content-Type': 'application/json',
      ...headers,  // traceparent, tracestate
    },
    body: JSON.stringify(data),
  });

  return response.json();
};

// Extract context from incoming request (middleware)
export const extractContext = (req: Request, res: Response, next: NextFunction) => {
  const extractedContext = propagation.extract(context.active(), req.headers);

  context.with(extractedContext, () => {
    next();
  });
};

Key Takeaways

Vendor neutral: OTel works with any observability backend
Auto-instrumentation: Start quickly with automatic instrumentation
Manual spans: Add custom spans for business-critical operations
Collector deployment: Use the Collector for processing and routing
Tail sampling: Sample intelligently based on trace characteristics
Context propagation: Ensure trace context flows across service boundaries
Resource attributes: Add metadata like service name, version, environment
Start with traces: Distributed tracing provides the most insight initially

OpenTelemetry provides a unified approach to observability. Invest in proper instrumentation to gain visibility into distributed systems.