Kubernetes in Production: Hard-Won Lessons from Enterprise Deployments

Kubernetes is powerful but complex. After deploying and operating clusters at Vattenfall, Vitrifi, and other enterprises, I've accumulated hard-won lessons about what it takes to run Kubernetes reliably in production.

Security First

Security in Kubernetes is multi-layered. A breach at any layer can compromise your entire infrastructure.

Pod Security Standards

The old Pod Security Policies (PSPs) are deprecated. Pod Security Standards (PSS) are the replacement:

Privileged: Unrestricted policy for trusted workloads Baseline: Minimally restrictive, prevents known privilege escalations Restricted: Heavily restricted, follows security best practices

Apply at namespace level:

apiVersion: v1
kind: Namespace
metadata:
  name: production
  labels:
    pod-security.kubernetes.io/enforce: restricted
    pod-security.kubernetes.io/warn: restricted

Network Policies Are Not Optional

By default, all pods can communicate with all other pods. This is unacceptable in production.

Implement network policies that:

Default deny all ingress and egress
Explicitly allow only required communication paths
Separate concerns by namespace (frontend, backend, databases)
Log denied connections for security monitoring

RBAC Requires Careful Planning

Role-Based Access Control determines who can do what in your cluster:

Principle of least privilege: Grant only the permissions needed Service accounts per workload: Don't share service accounts across deployments Audit regularly: Review who has cluster-admin and why Use namespaced roles where possible: ClusterRoles should be rare

Secrets Management Needs a Strategy

Kubernetes Secrets are base64-encoded, not encrypted. This is insufficient for sensitive data.

Options for proper secrets management:

External Secrets Operator: Sync secrets from HashiCorp Vault, AWS Secrets Manager, or Azure Key Vault
Sealed Secrets: Encrypt secrets that can only be decrypted by the cluster
SOPS: Encrypt secret files in Git with automatic decryption during deployment

We used External Secrets Operator with Azure Key Vault at Lloyds Banking Group, enabling centralized secret rotation and audit trails.

Observability Stack

Without observability, you're operating blind. Kubernetes generates massive amounts of data—the challenge is making it useful.

Metrics with Prometheus

Prometheus is the de facto standard for Kubernetes metrics:

Cluster metrics: Node CPU, memory, disk, network Kubernetes metrics: Pod status, replica counts, resource usage Application metrics: Custom metrics exposed via /metrics endpoints

Key practices:

Use ServiceMonitor resources for automatic scraping
Implement alerting rules for critical conditions
Retain metrics long enough for capacity planning
Consider long-term storage with Thanos or Cortex

Logging with Loki or Elasticsearch

Centralized logging is essential for debugging:

Loki: Prometheus-native log aggregation, cost-effective, label-based queries Elasticsearch: Full-text search, more powerful queries, higher operational overhead

Regardless of choice:

Include structured metadata (namespace, pod, container)
Implement log retention policies
Create dashboards for common debugging scenarios

Distributed Tracing

Traces show how requests flow through your services:

Jaeger: Popular, Kubernetes-native, good Prometheus integration OpenTelemetry: Vendor-neutral standard, supports metrics, logs, and traces

Tracing reveals:

Which services are slow
Where errors originate
How load distributes across services
Dependency relationships between services

Unified Observability

Connect your observability tools:

Link from alerts to relevant dashboards
Correlate logs with traces using trace IDs
Enable drill-down from high-level metrics to specific pods

Resource Management

Kubernetes resource management is crucial for stability and cost optimization.

Requests and Limits

Every container should specify resource requests and limits:

Requests: Guaranteed resources; scheduler uses these for placement decisions Limits: Maximum resources; exceeding limits causes throttling (CPU) or OOMKill (memory)

Common mistakes:

Not setting any limits (resource contention)
Setting limits too high (wasted capacity)
Setting requests equal to limits (no burst capacity)

Start with requests at expected usage and limits at 2x requests, then tune based on actual behavior.

Vertical Pod Autoscaler

VPA automatically adjusts resource requests based on actual usage:

Benefits:

Right-sized resources without manual tuning
Reduced waste from over-provisioning
Better bin-packing efficiency

Limitations:

Requires pod restarts to apply changes
May conflict with Horizontal Pod Autoscaler
Needs enough historical data to make good recommendations

Horizontal Pod Autoscaler

HPA scales the number of pods based on metrics:

CPU-based scaling: Simple but often insufficient Custom metrics: Scale on queue depth, request latency, or business metrics External metrics: Scale based on external systems (message queue length, database connections)

Quality of Service Classes

Kubernetes assigns QoS classes based on resource specifications:

Guaranteed: Requests equal limits for all containers—highest priority, last to be evicted Burstable: Requests less than limits—medium priority BestEffort: No requests or limits—lowest priority, first to be evicted

Critical workloads should be Guaranteed; development workloads can be Burstable or BestEffort.

Networking Deep Dive

Kubernetes networking is notoriously complex.

CNI Selection

The Container Network Interface (CNI) determines how pods communicate:

Calico: Feature-rich, strong network policy support, BGP routing Cilium: eBPF-based, excellent performance, advanced observability Flannel: Simple overlay, limited features, easy to understand

For enterprise deployments, Calico or Cilium provide the security features you'll need.

Service Mesh Considerations

Service meshes (Istio, Linkerd) add capabilities but also complexity:

When to use:

You need mutual TLS between all services
Complex traffic management (canary releases, traffic splitting)
Advanced observability requirements

When to avoid:

Simple architectures with few services
Teams without dedicated platform engineers
When the overhead isn't justified by benefits

We successfully operated at Vitrifi without a service mesh by using NATS for service communication and implementing application-level resilience patterns.

Disaster Recovery

Hope is not a strategy. Plan for failure and test your recovery procedures.

etcd Backups

etcd stores all cluster state. Losing etcd means losing your cluster.

Backup strategies:

Automated snapshots with etcdctl
Velero for cluster-wide backup including persistent volumes
Off-cluster storage for backup resilience

Test restores regularly. Backups you haven't tested are backups you can't trust.

Multi-Cluster Strategies

For critical workloads, consider multi-cluster architectures:

Active-Passive: Secondary cluster ready to take over on failure Active-Active: Workloads spread across clusters with load balancing Federation: Unified control plane across clusters (complex, maturing)

Tested Runbooks

Document procedures for common failure scenarios:

Node failure and replacement
etcd recovery from backup
Network partition handling
Storage failure recovery

Run game days to practice these procedures. The first time you execute a disaster recovery should not be during an actual disaster.

Key Takeaways

Security is foundational: Retrofit security is expensive; build it in from the start
Observability enables operations: You can't fix what you can't see
Resource management affects stability: Uncontrolled resource usage causes cascading failures
Networking complexity is real: Invest in understanding your CNI and network policies
Disaster recovery requires practice: Untested recovery plans will fail when you need them
Start simpler than you think: Not every organization needs a service mesh or multi-cluster federation