Fix: nginx Upstream Load Balancing Not Working — All Traffic Hitting One Server

Q: How do I fix "nginx Upstream Load Balancing Not Working — All Traffic Hitting One Server"?

How to fix nginx load balancing issues — upstream block configuration, health checks, least_conn vs round-robin, sticky sessions, upstream timeouts, and SSL termination.

The Problem

nginx is configured for load balancing but all requests go to the same upstream server:

upstream backend {
    server 10.0.0.1:3000;
    server 10.0.0.2:3000;
    server 10.0.0.3:3000;
}

# Requests only reach 10.0.0.1 — other servers get no traffic

Or one server gets most of the traffic due to a misconfigured weight:

upstream backend {
    server 10.0.0.1:3000 weight=10;   # Gets 10x the traffic
    server 10.0.0.2:3000;             # Default weight=1
}

Or nginx marks a healthy upstream server as down and stops sending traffic to it:

# Error log shows:
# upstream timed out (110: Connection timed out) while reading response header
# from upstream, client: ..., upstream: "http://10.0.0.2:3000/"
# no live upstreams while connecting to upstream

Or all connections go to one server because of sticky sessions configured incorrectly.

Why This Happens

nginx’s upstream load balancing has several configuration pitfalls that cause uneven or absent traffic distribution. The root cause is usually one of three things: the balancing algorithm does not match the traffic pattern, the failure detection marks healthy servers as down, or the connection is never actually reaching the upstream block at all.

Default round-robin behavior is per-worker. nginx’s master process forks multiple worker processes, and each worker independently distributes requests across the upstream servers. With low traffic volume, a single worker handles most requests sequentially, making it appear that all traffic goes to one server. At higher volumes, distribution evens out. This is not a bug, but it surprises developers testing with curl in a loop.

ip_hash makes every request from the same client IP sticky. All requests from one IP always go to the same upstream server. When testing from a single machine or from behind a corporate NAT (where thousands of users share one IP), it looks like load balancing is completely broken. The same effect happens with hash $request_uri consistent when testing with the same URL repeatedly.

Aggressive failure detection. The defaults max_fails=1 and fail_timeout=10s mean that a single failed request takes a server out of rotation for 10 seconds. In environments where upstream servers occasionally return 502 during deployments or brief GC pauses, this causes cascading removal of healthy servers until only one remains.

Platform and Environment Differences

Load balancing behavior differs significantly between nginx editions, container orchestrators, and cloud load balancers. Understanding where nginx sits in the stack determines which features are available and which problems are caused by a layer above or below nginx.

nginx OSS vs nginx Plus. The open-source nginx includes round-robin, weighted round-robin, ip_hash, hash, least_conn, and random algorithms. Active health checks (probing upstream servers on a schedule, independent of client traffic) are only available in nginx Plus or via third-party modules like ngx_upstream_check_module. OSS nginx only has passive health checks — it marks a server as down after actual client requests fail. The zone directive for shared memory (allowing all workers to share upstream state, including failure counts) is available in OSS but the upstream status API (/api/) for monitoring is Plus-only. If you need active health checks without Plus, compile nginx with the ngx_upstream_check_module from Tengine or use a sidecar health checker.

Docker Swarm DNS round-robin. When nginx runs in Docker Swarm and upstream servers are Swarm services, Docker’s internal DNS resolves the service name to a virtual IP (VIP) that performs its own layer-4 round-robin. nginx sees a single IP address in the upstream block, so all its load-balancing logic is bypassed. You effectively get Docker’s L4 load balancing, not nginx’s L7 balancing. To use nginx’s load balancing, resolve service names to individual container IPs with the endpoint_mode: dnsrr setting in the Compose file, and use resolver 127.0.0.11 valid=5s in the nginx config to pick up container IP changes.

Kubernetes Service (iptables vs IPVS). In Kubernetes, a Service of type ClusterIP load-balances traffic before it reaches nginx. If nginx is the upstream (behind a Service), the Service’s kube-proxy distributes traffic using iptables rules (random selection) or IPVS (round-robin, least connections, or source hashing). If nginx is the load balancer (in front of application pods), you typically use a headless Service (clusterIP: None) so nginx receives individual pod IPs. Without a headless Service, nginx resolves the Service name to a single ClusterIP and all traffic goes through Kubernetes’ own load balancer — nginx’s upstream block has one entry and cannot distribute traffic itself. Use resolver kube-dns.kube-system.svc.cluster.local valid=5s in nginx to re-resolve pod IPs as pods scale up or down.

Envoy and xDS. In service mesh environments (Istio, Linkerd with Envoy), Envoy typically sits as a sidecar proxy and handles load balancing before traffic reaches nginx. If you see unexpected routing behavior, check whether Envoy is intercepting traffic on the same port. Envoy uses xDS (discovery service) for dynamic upstream configuration, while nginx requires a reload or Plus API call to update upstreams. In an Istio mesh, adding nginx as a load balancer in front of application pods creates a double-proxy situation (Envoy -> nginx -> Envoy -> app) that adds latency and complicates debugging.

Behind AWS ALB/NLB. When nginx sits behind an AWS Application Load Balancer (ALB), the ALB has already distributed traffic across nginx instances. If each nginx instance has its own upstream block pointing to the same set of backend servers, the combination of ALB and nginx can cause uneven distribution. ALB uses round-robin across target group members, and each nginx instance independently round-robins across upstreams. With 3 ALB targets and 3 upstreams, some backend servers receive disproportionate traffic depending on ALB connection reuse. AWS NLB (layer 4) passes TCP connections directly to nginx without HTTP-level balancing, which preserves nginx’s upstream distribution but exposes client IPs directly (use proxy_protocol to preserve them through NLB).

Fix 1: Verify the upstream Block Configuration

The upstream block must be at the http level:

# nginx.conf — CORRECT structure
http {
    # upstream must be in http block
    upstream backend {
        server 10.0.0.1:3000;
        server 10.0.0.2:3000;
        server 10.0.0.3:3000;
    }

    server {
        listen 80;
        server_name example.com;

        location / {
            proxy_pass http://backend;   # Use upstream name
            proxy_http_version 1.1;
            proxy_set_header Host $host;
            proxy_set_header X-Real-IP $remote_addr;
            proxy_set_header X-Forwarded-For $proxy_add_x_forwarded_for;
        }
    }
}

# WRONG — upstream inside server block (doesn't work)
server {
    upstream backend { ... }   # Syntax error or ignored
}

Verify nginx config is valid and loaded:

# Test configuration
nginx -t

# Reload without downtime
nginx -s reload

# Check which config file is loaded
nginx -T | grep -E "upstream|server"

# View current nginx version and compiled modules
nginx -V

Fix 2: Choose the Right Load Balancing Algorithm

nginx supports several algorithms:

# Round-robin (default) — requests distributed sequentially
upstream backend {
    server 10.0.0.1:3000;
    server 10.0.0.2:3000;
    server 10.0.0.3:3000;
    # Request 1 → server 1, Request 2 → server 2, Request 3 → server 3, repeat
}

# Weighted round-robin — proportional distribution
upstream backend {
    server 10.0.0.1:3000 weight=3;   # Gets 3/5 of traffic
    server 10.0.0.2:3000 weight=2;   # Gets 2/5 of traffic
    # Use when servers have different capacities
}

# least_conn — send to server with fewest active connections
# Better for requests with varying processing times (API calls, DB queries)
upstream backend {
    least_conn;
    server 10.0.0.1:3000;
    server 10.0.0.2:3000;
    server 10.0.0.3:3000;
}

# ip_hash — sticky sessions: same client always goes to same server
# WARNING: This causes apparent "load imbalance" when testing from one IP
upstream backend {
    ip_hash;
    server 10.0.0.1:3000;
    server 10.0.0.2:3000;
}

# hash — route by custom key (URL, cookie, header)
upstream backend {
    hash $request_uri consistent;   # Same URL always goes to same server
    server 10.0.0.1:3000;
    server 10.0.0.2:3000;
}

# random — pick randomly (nginx 1.15.1+)
upstream backend {
    random two least_conn;   # Pick 2 random servers, forward to one with least connections
    server 10.0.0.1:3000;
    server 10.0.0.2:3000;
    server 10.0.0.3:3000;
}

Fix 3: Configure Health Checks and Failure Handling

Control how nginx handles failed upstream servers:

upstream backend {
    server 10.0.0.1:3000 max_fails=3 fail_timeout=30s;
    server 10.0.0.2:3000 max_fails=3 fail_timeout=30s;
    server 10.0.0.3:3000 backup;   # Only used when all others are down
}

# Parameters:
# max_fails=3    — mark server down after 3 consecutive failures (default: 1)
# fail_timeout=30s — don't retry for 30 seconds after marking down (default: 10s)
# backup         — only receives traffic when all primary servers are down
# down           — permanently marks server as down (manual removal)
# weight=N       — relative weight for round-robin

Active health checks (nginx Plus or ngx_upstream_check_module):

# nginx Open Source — passive health checks only (based on failed requests)
# nginx Plus — active health checks available

# For open source nginx with upstream_check module:
upstream backend {
    server 10.0.0.1:3000;
    server 10.0.0.2:3000;

    check interval=3000 rise=2 fall=3 timeout=1000 type=http;
    check_http_send "GET /health HTTP/1.0\r\n\r\n";
    check_http_expect_alive http_2xx http_3xx;
}

Configure proxy timeouts to match your upstream:

location / {
    proxy_pass http://backend;

    # How long to wait for upstream to accept the connection
    proxy_connect_timeout 5s;

    # How long to wait for upstream to send response headers
    proxy_read_timeout 60s;

    # How long to wait for upstream to accept data we're sending
    proxy_send_timeout 60s;

    # Retry on failure — try next upstream server
    proxy_next_upstream error timeout http_500 http_502 http_503;
    proxy_next_upstream_tries 3;      # Max retries
    proxy_next_upstream_timeout 10s;  # Total time limit for retries
}

Fix 4: Debug Traffic Distribution

Verify traffic is actually being distributed:

# Check nginx access log — look at upstream addresses
tail -f /var/log/nginx/access.log

# Add upstream address to access log format
# nginx.conf:
log_format upstream_log '$remote_addr - $upstream_addr - $request - $status - $upstream_response_time';
access_log /var/log/nginx/access.log upstream_log;

# Now each log line shows which upstream server handled the request:
# 203.0.113.1 - 10.0.0.2:3000 - GET /api/data HTTP/1.1 - 200 - 0.045
# 203.0.113.1 - 10.0.0.1:3000 - GET /api/data HTTP/1.1 - 200 - 0.032

# Count requests per upstream server
grep "10.0.0" /var/log/nginx/access.log | \
  grep -oP '\d+\.\d+\.\d+\.\d+:\d+' | \
  sort | uniq -c | sort -rn

Check upstream server status:

# Enable nginx status module to see upstream state
server {
    listen 8080;
    server_name localhost;
    location /nginx_status {
        stub_status;
        allow 127.0.0.1;
        deny all;
    }
}

curl http://localhost:8080/nginx_status
# Active connections: 15
# server accepts handled requests
#  1234 1234 5678
# Reading: 0 Writing: 3 Waiting: 12

Fix 5: Handle WebSocket Load Balancing

WebSocket connections require special upstream configuration:

upstream websocket_backend {
    ip_hash;   # WebSockets need sticky sessions (stateful connection)
    server 10.0.0.1:3000;
    server 10.0.0.2:3000;
    # keepalive for reusing connections to upstream
    keepalive 64;
}

server {
    listen 80;

    location /ws/ {
        proxy_pass http://websocket_backend;
        proxy_http_version 1.1;

        # Required for WebSocket upgrade
        proxy_set_header Upgrade $http_upgrade;
        proxy_set_header Connection "upgrade";
        proxy_set_header Host $host;

        # WebSocket connections are long-lived — extend timeout
        proxy_read_timeout 3600s;   # 1 hour
        proxy_send_timeout 3600s;
    }
}

Fix 6: SSL Termination at nginx

Terminate SSL at nginx and load balance over plain HTTP to upstream:

upstream backend {
    least_conn;
    server 10.0.0.1:3000;
    server 10.0.0.2:3000;
    server 10.0.0.3:3000;

    keepalive 32;   # Reuse connections — reduces overhead
}

server {
    listen 443 ssl;
    server_name api.example.com;

    ssl_certificate /etc/nginx/ssl/cert.pem;
    ssl_certificate_key /etc/nginx/ssl/key.pem;
    ssl_protocols TLSv1.2 TLSv1.3;

    location / {
        proxy_pass http://backend;   # Plain HTTP to upstream (SSL terminated)
        proxy_http_version 1.1;
        proxy_set_header Connection "";   # Required for keepalive
        proxy_set_header Host $host;
        proxy_set_header X-Real-IP $remote_addr;
        proxy_set_header X-Forwarded-For $proxy_add_x_forwarded_for;
        proxy_set_header X-Forwarded-Proto $scheme;   # Tell upstream it was HTTPS
    }
}

# Redirect HTTP to HTTPS
server {
    listen 80;
    server_name api.example.com;
    return 301 https://$host$request_uri;
}

Fix 7: Rate Limiting Per Upstream

Apply rate limits before traffic reaches upstreams:

http {
    # Define rate limit zone — 10MB stores ~160,000 IP states
    limit_req_zone $binary_remote_addr zone=api_limit:10m rate=100r/m;

    upstream backend {
        least_conn;
        server 10.0.0.1:3000;
        server 10.0.0.2:3000;
    }

    server {
        listen 80;

        location /api/ {
            # Apply rate limit
            limit_req zone=api_limit burst=20 nodelay;
            limit_req_status 429;

            proxy_pass http://backend;
            proxy_http_version 1.1;
        }

        # Stricter limit for auth endpoints
        location /api/auth/ {
            limit_req zone=api_limit burst=5;
            proxy_pass http://backend;
        }
    }
}

Still Not Working?

upstream DNS resolution — nginx resolves upstream hostnames at startup. If you use container names or dynamic hosts, set resolver and use variables in proxy_pass for runtime DNS resolution:

resolver 127.0.0.53 valid=30s;

location / {
    set $upstream "http://backend-service:3000";
    proxy_pass $upstream;   # Re-resolved via DNS regularly
}

proxy_cache serving stale responses — if proxy caching is enabled, all clients may get the same cached response regardless of which upstream server processed it. This isn’t a load balancing issue but looks like one.

Keepalive connections and worker processes — nginx uses multiple worker processes. keepalive connections are per-worker, not global. With worker_processes 4 and keepalive 16, there are 64 total keepalive connections, distributed across workers.

zone directive missing for shared failure state — without zone, each nginx worker process tracks upstream failure counts independently. One worker may mark a server as down while other workers continue sending traffic to it. Add zone backend_zone 64k; to the upstream block to share state across all workers.

gRPC load balancing not distributing — gRPC uses HTTP/2, which multiplexes many requests over a single long-lived TCP connection. nginx’s default round-robin assigns connections, not requests, so a single gRPC connection goes to one upstream for its entire lifetime. Use grpc_pass with least_conn and set keepalive_requests to a low value (e.g., 100) to force periodic connection rotation.

Upstream servers behind a firewall dropping health probes — if nginx and upstream servers are in different network segments, firewall rules or security groups may silently drop nginx’s connection attempts after a timeout. nginx interprets the timeout as a failure and marks the server as down. Check firewall rules and ensure the upstream port is open from nginx’s network.