Fix: Docker HEALTHCHECK Failing — Container Marked Unhealthy Despite Running

Q: How do I fix "Docker HEALTHCHECK Failing — Container Marked Unhealthy Despite Running"?

How to fix Docker HEALTHCHECK failures — command syntax, curl vs wget availability, start period, interval tuning, health check in docker-compose, and debugging unhealthy containers.

The Problem

A Docker container starts and runs correctly, but its status shows unhealthy:

docker ps

CONTAINER ID   IMAGE      STATUS
a1b2c3d4e5f6   myapp     Up 2 minutes (unhealthy)
# Container is running, but health check is failing

Or the container keeps restarting because orchestrators (Docker Swarm, Kubernetes) kill unhealthy containers:

docker inspect myapp --format='{{.State.Health.Status}}'
# unhealthy

docker inspect myapp --format='{{json .State.Health.Log}}'
# [{"Start":"...","End":"...","ExitCode":1,"Output":"curl: (7) Failed to connect to localhost port 8080: Connection refused"}]

Or the start_period isn’t long enough, causing the container to be marked unhealthy before the application finishes starting:

# Container starts, health check runs immediately, app not ready yet
# Health check fails 3 times → container marked unhealthy
# But app would have been healthy 10 seconds later

In orchestrated environments, this triggers a restart loop: the container starts, fails its health check before the app is ready, gets killed, a new container is scheduled, and the cycle repeats indefinitely.

Why This Happens

Docker’s HEALTHCHECK instruction runs a command inside the container on a schedule. The container is marked unhealthy after a specified number of consecutive failures. Common causes:

curl or wget not in the image — the default health check command is often curl http://localhost:8080/health, but minimal images (Alpine, distroless) don’t include curl.
Wrong port or path — the health check targets a port or path that doesn’t exist or isn’t reachable from inside the container.
Localhost vs 0.0.0.0 — the app listens on 0.0.0.0 but the health check tries 127.0.0.1 — this should work, but some configurations bind only to a specific interface.
start_period too short — the health check starts counting failures immediately by default. Slow-starting applications (JVM, large Node.js apps) aren’t ready within the default start period.
Exit code not zero — the health check command must exit 0 to be healthy. If the HTTP request succeeds but the command returns a non-zero exit code for other reasons, Docker marks it as failed.
Shell unavailable — HEALTHCHECK CMD without ["CMD-SHELL", "..."] runs the command directly without a shell, so shell features (&&, ||, pipes) don’t work.

The blast radius depends on the deployment context. In a standalone docker run scenario, you see (unhealthy) in docker ps and nothing else happens — the container keeps running. In Docker Compose with depends_on: service_healthy, downstream services never start. In Swarm or Kubernetes, the orchestrator actively kills and replaces unhealthy containers, which means a misconfigured health check can take down an otherwise functional service. The worst case is a restart loop where the container never finishes its startup sequence before the health check declares it dead, cycling indefinitely and producing zero uptime.

Fix 1: Install curl or Use Alternatives

Minimal images often lack curl. Either install it or use an alternative:

# Alpine — install curl (adds ~1MB)
FROM node:20-alpine

RUN apk add --no-cache curl

HEALTHCHECK --interval=30s --timeout=10s --start-period=30s --retries=3 \
  CMD curl -f http://localhost:3000/health || exit 1

# Alternatively — use wget (included in busybox/Alpine)
HEALTHCHECK CMD wget -qO- http://localhost:3000/health || exit 1

# Or use nc (netcat) to check if port is open (no HTTP check)
HEALTHCHECK CMD nc -z localhost 3000 || exit 1

Distroless images — copy curl binary or use a shell script:

FROM gcr.io/distroless/nodejs20-debian12

# Distroless has no shell, no curl — use a compiled healthcheck binary
# Option 1: Use a multi-stage build to include a health check binary
FROM golang:1.22-alpine AS healthcheck-builder
RUN go build -o /healthcheck github.com/grpc-ecosystem/grpc-health-probe/...

FROM gcr.io/distroless/nodejs20-debian12
COPY --from=healthcheck-builder /healthcheck /healthcheck

HEALTHCHECK --interval=30s --timeout=10s \
  CMD ["/healthcheck", "-addr=:3000"]

Node.js — use a JavaScript health check script:

FROM node:20-alpine

# health-check.js included in the image
COPY health-check.js .

HEALTHCHECK --interval=30s --timeout=10s --start-period=30s \
  CMD node health-check.js

// health-check.js
const http = require('http');

const options = {
  host: 'localhost',
  port: process.env.PORT || 3000,
  path: '/health',
  timeout: 5000,
};

const req = http.request(options, (res) => {
  process.exit(res.statusCode === 200 ? 0 : 1);
});

req.on('error', () => process.exit(1));
req.on('timeout', () => { req.abort(); process.exit(1); });
req.end();

Fix 2: Set Correct Timing Parameters

The default HEALTHCHECK timing often causes false positives for real-world applications:

# Default values (if not specified):
# --interval=30s     Check every 30 seconds
# --timeout=30s      Fail if check takes longer than 30 seconds
# --start-period=0s  Start counting failures immediately
# --retries=3        Mark unhealthy after 3 consecutive failures

# Optimized for a typical web application:
HEALTHCHECK \
  --interval=10s \        # Check frequently during development
  --timeout=5s \          # Fail fast if unresponsive
  --start-period=60s \    # Wait 60s before counting failures (startup time)
  --retries=5 \           # 5 failures before marking unhealthy
  CMD curl -f http://localhost:3000/health || exit 1

# For a JVM/Spring Boot application (slow startup):
HEALTHCHECK \
  --interval=20s \
  --timeout=10s \
  --start-period=120s \   # JVM apps often take 30-90s to start
  --retries=5 \
  CMD curl -f http://localhost:8080/actuator/health || exit 1

start-period explained:

During start_period, health check failures don’t count toward retries. The container is starting during this window. After start_period elapses, failures start counting. This prevents false unhealthy status during legitimate startup.

# Check the health status timeline
docker inspect myapp | jq '.State.Health.Log[-5:]'
# Look at Start timestamps to see when checks began
# Compare against the container's start time in .State.StartedAt

Pro Tip: In production, set start_period to at least 1.5x your application’s measured cold-start time. If your app takes 40 seconds to start under load, set start_period to 60 seconds. Measure cold-start time under realistic conditions (database migrations, cache warming, connection pool initialization) rather than local dev, where startup is typically much faster.

Fix 3: Fix Health Check Command Syntax

The CMD form in HEALTHCHECK has two variants with different behavior:

# CMD (exec form) — runs directly, no shell, shell features not available
HEALTHCHECK CMD ["curl", "-f", "http://localhost:3000/health"]
# Equivalent to: docker exec container curl -f http://localhost:3000/health

# CMD-SHELL (shell form) — runs through /bin/sh -c
HEALTHCHECK CMD curl -f http://localhost:3000/health || exit 1
# Equivalent to: docker exec container /bin/sh -c "curl -f ... || exit 1"
# The || exit 1 requires a shell — use CMD-SHELL form

# Explicit CMD-SHELL form (equivalent to plain string CMD)
HEALTHCHECK CMD ["CMD-SHELL", "curl -f http://localhost:3000/health || exit 1"]

curl -f flag — -f (fail) makes curl return exit code 22 for HTTP error responses (4xx, 5xx). Without it, curl exits 0 even on 404 or 500 responses:

# WRONG — exits 0 even on 500 Internal Server Error
HEALTHCHECK CMD curl http://localhost:3000/health

# CORRECT — exits non-zero on 4xx/5xx HTTP responses
HEALTHCHECK CMD curl -f http://localhost:3000/health

Test the health check command manually:

# Run the exact command inside the container
docker exec myapp curl -f http://localhost:3000/health
echo "Exit code: $?"  # Should be 0 for healthy

# Run as root to rule out permission issues
docker exec -u root myapp curl -f http://localhost:3000/health

# Check if curl is available
docker exec myapp which curl || echo "curl not found"

Fix 4: Configure Health Checks in docker-compose

docker-compose.yml supports overriding or adding health checks:

# docker-compose.yml
services:
  api:
    image: myapp:latest
    healthcheck:
      test: ["CMD", "curl", "-f", "http://localhost:3000/health"]
      interval: 15s
      timeout: 5s
      retries: 5
      start_period: 30s

  # Service that waits for api to be healthy
  worker:
    image: myworker:latest
    depends_on:
      api:
        condition: service_healthy  # Wait for api to pass health check
    # Without this, worker starts immediately — api may not be ready

  # Database with built-in health check
  postgres:
    image: postgres:16
    environment:
      POSTGRES_PASSWORD: secret
    healthcheck:
      test: ["CMD-SHELL", "pg_isready -U postgres"]
      interval: 10s
      timeout: 5s
      retries: 5
      start_period: 10s

  redis:
    image: redis:7-alpine
    healthcheck:
      test: ["CMD", "redis-cli", "ping"]
      interval: 10s
      retries: 3

Disable health check for a service (override Dockerfile’s HEALTHCHECK):

services:
  myservice:
    image: myapp:latest
    healthcheck:
      disable: true   # Ignore the HEALTHCHECK from the Dockerfile

Fix 5: Implement a Proper Health Endpoint

The health check’s value depends on what /health actually checks. A proper health endpoint verifies that the application can serve requests:

// Express.js — comprehensive health endpoint
app.get('/health', async (req, res) => {
  const health: Record<string, unknown> = {
    status: 'ok',
    uptime: process.uptime(),
    timestamp: new Date().toISOString(),
  };

  // Check database connection
  try {
    await db.query('SELECT 1');
    health.database = 'ok';
  } catch (err) {
    health.database = 'error';
    health.status = 'degraded';
  }

  // Check Redis connection
  try {
    await redis.ping();
    health.cache = 'ok';
  } catch (err) {
    health.cache = 'error';
    health.status = 'degraded';
  }

  const statusCode = health.status === 'ok' ? 200 : 503;
  res.status(statusCode).json(health);
});

Separate liveness vs readiness (Kubernetes pattern, useful in Docker too):

// Liveness — "is the process alive?" (simple, rarely fails)
app.get('/health/live', (req, res) => {
  res.status(200).json({ status: 'alive' });
});

// Readiness — "can it serve traffic?" (checks dependencies)
app.get('/health/ready', async (req, res) => {
  try {
    await Promise.all([
      db.query('SELECT 1'),
      redis.ping(),
    ]);
    res.status(200).json({ status: 'ready' });
  } catch (err) {
    res.status(503).json({ status: 'not ready', error: err.message });
  }
});

# Use the liveness check for Docker HEALTHCHECK (avoid killing due to DB outage)
HEALTHCHECK CMD curl -f http://localhost:3000/health/live || exit 1

Common Mistake: A health endpoint that checks database connectivity sounds correct, but it means a temporary database outage causes container restarts. Those restarts don’t fix the database and add connection pressure to an already-overloaded DB. Use a liveness check (process is alive) for HEALTHCHECK, and a readiness check (dependencies available) for load-balancer routing decisions.

Fix 6: Debug an Unhealthy Container

When a container is unhealthy, inspect the recent health check history:

# View last 5 health check results
docker inspect myapp --format='{{json .State.Health.Log}}' | \
  python3 -m json.tool | head -50

# Each log entry contains:
# Start: when the check began
# End: when it finished
# ExitCode: 0=healthy, non-zero=unhealthy
# Output: stdout/stderr from the check command

# Watch health status in real time
watch -n 2 'docker inspect myapp --format="Status: {{.State.Health.Status}}"'

# Get full container inspect output
docker inspect myapp | jq '.State.Health'

Common output messages and their meanings:

"curl: (7) Failed to connect to localhost port 3000: Connection refused"
→ App not listening on port 3000 yet, or crashed
→ Fix: Increase start-period or check app startup

"curl: (22) The requested URL returned error: 500"
→ App is running but returning 500 error
→ Fix: Debug the health endpoint — check app logs

"curl: (6) Could not resolve host: localhost"
→ Unusual — network configuration issue
→ Fix: Use 127.0.0.1 instead of localhost

"OCI runtime exec failed: exec: 'curl': executable file not found"
→ curl not in the image
→ Fix: Install curl or use alternative

Fix 7: Use Health Checks in Production Orchestration

In production with Docker Swarm or Kubernetes, health checks drive automatic recovery:

Docker Swarm — restart unhealthy replicas:

# docker-compose.yml (Swarm mode)
services:
  api:
    image: myapp:latest
    deploy:
      replicas: 3
      update_config:
        parallelism: 1
        delay: 10s
        failure_action: rollback
      restart_policy:
        condition: on-failure
        delay: 5s
        max_attempts: 3
    healthcheck:
      test: ["CMD", "curl", "-f", "http://localhost:3000/health"]
      interval: 10s
      timeout: 5s
      retries: 3
      start_period: 30s

Health check during rolling updates — Swarm waits for the new container to pass its health check before removing the old one, enabling zero-downtime deployments.

Fix 8: Diagnose Restart Loop Incidents

A restart loop is the most disruptive failure mode caused by health check misconfiguration. The container starts, the health check runs before the app is ready, the orchestrator kills it, and the cycle begins again. Service availability drops to zero even though the application code is perfectly functional.

Identify a restart loop:

# Check restart count — if it's climbing fast, you have a loop
docker inspect myapp --format='{{.RestartCount}}'

# In Kubernetes, look for CrashLoopBackOff
kubectl get pods
# NAME        READY   STATUS             RESTARTS   AGE
# myapp-xyz   0/1     CrashLoopBackOff   14         12m

# View events to see the timeline
kubectl describe pod myapp-xyz | grep -A 20 "Events:"
# Liveness probe failed: connection refused
# Container killed
# Started container
# Liveness probe failed: connection refused (repeats)

Break the loop:

Increase start_period to give the app time to start. If you don’t know the app’s cold-start time, set it to 180 seconds and tune down after observing actual startup.
Reduce retries pressure by temporarily increasing retries to 10 so the app has more chances before being killed.
Check if the health check endpoint itself is the problem. An endpoint that queries the database during startup can fail because the connection pool isn’t initialized yet.
Use a staged health check: return 200 from the health endpoint immediately on startup (before dependencies are ready), and fail only after the app has had time to initialize everything.

// Staged health check — always alive, but reports readiness separately
let ready = false;

app.get('/health', (req, res) => {
  // Always return 200 for liveness (don't let the orchestrator kill us)
  res.status(200).json({ alive: true, ready });
});

// Set ready=true once all dependencies are initialized
async function bootstrap() {
  await db.connect();
  await cache.connect();
  ready = true;
  console.log('Application ready');
}

bootstrap();

Monitor restart loops in production:

Set up alerts on container restart count. If a container restarts more than 3 times in 5 minutes, the health check configuration likely needs adjustment, not the application code. The metric to track is restart_count per container (or kube_pod_container_status_restarts_total in Kubernetes).

Still Not Working?

Different user inside the container — the health check runs as the container’s user (often non-root). If the app listens on a port below 1024 (privileged), a non-root user may not be able to connect. Use ports above 1024 or run as root.

IPv6 vs IPv4 binding — if the app binds to ::1 (IPv6 localhost) but curl tries 127.0.0.1 (IPv4), the connection fails. Try using [::1] in the curl URL or bind the app to 0.0.0.0.

Health check timing with depends_on — in Docker Compose, depends_on: service_healthy only works if the dependency defines a healthcheck. If it doesn’t have one, Docker Compose ignores the service_healthy condition.

Misleading exit codes — some commands return non-zero for reasons unrelated to the actual health. Test the command manually inside the container with docker exec to verify the exit code and output before trusting it in a HEALTHCHECK.

Health check passes locally but fails in CI/CD — CI runners often have lower memory and CPU limits than local dev machines. The application takes longer to start under resource pressure, so the start_period that works locally is too short in CI. Measure startup time in your CI environment separately and set a corresponding start_period.

Network namespace differences between host and container — if the app binds to a specific network interface name (e.g., eth0), the interface name inside the container may differ. Use 0.0.0.0 to bind to all interfaces, or verify interface names with docker exec myapp ip addr.

Health check logs are empty — if docker inspect shows health check log entries with empty Output and ExitCode: 0, the health check is passing. The unhealthy status you see may be from a previous run. Restart the container and watch the health status transition from starting to healthy.