Fix: Docker HEALTHCHECK Failing — Container Marked Unhealthy Despite Running
Part of: Docker, DevOps & Infrastructure
Quick Answer
How to fix Docker HEALTHCHECK failures — command syntax, curl vs wget availability, start period, interval tuning, health check in docker-compose, and debugging unhealthy containers.
The Problem
A Docker container starts and runs correctly, but its status shows unhealthy:
docker ps
CONTAINER ID IMAGE STATUS
a1b2c3d4e5f6 myapp Up 2 minutes (unhealthy)
# Container is running, but health check is failingOr the container keeps restarting because orchestrators (Docker Swarm, Kubernetes) kill unhealthy containers:
docker inspect myapp --format='{{.State.Health.Status}}'
# unhealthy
docker inspect myapp --format='{{json .State.Health.Log}}'
# [{"Start":"...","End":"...","ExitCode":1,"Output":"curl: (7) Failed to connect to localhost port 8080: Connection refused"}]Or the start_period isn’t long enough, causing the container to be marked unhealthy before the application finishes starting:
# Container starts, health check runs immediately, app not ready yet
# Health check fails 3 times → container marked unhealthy
# But app would have been healthy 10 seconds laterIn orchestrated environments, this triggers a restart loop: the container starts, fails its health check before the app is ready, gets killed, a new container is scheduled, and the cycle repeats indefinitely.
Why This Happens
Docker’s HEALTHCHECK instruction runs a command inside the container on a schedule. The container is marked unhealthy after a specified number of consecutive failures. Common causes:
curlorwgetnot in the image — the default health check command is oftencurl http://localhost:8080/health, but minimal images (Alpine, distroless) don’t include curl.- Wrong port or path — the health check targets a port or path that doesn’t exist or isn’t reachable from inside the container.
- Localhost vs 0.0.0.0 — the app listens on
0.0.0.0but the health check tries127.0.0.1— this should work, but some configurations bind only to a specific interface. start_periodtoo short — the health check starts counting failures immediately by default. Slow-starting applications (JVM, large Node.js apps) aren’t ready within the default start period.- Exit code not zero — the health check command must exit
0to be healthy. If the HTTP request succeeds but the command returns a non-zero exit code for other reasons, Docker marks it as failed. - Shell unavailable —
HEALTHCHECK CMDwithout["CMD-SHELL", "..."]runs the command directly without a shell, so shell features (&&,||, pipes) don’t work.
The blast radius depends on the deployment context. In a standalone docker run scenario, you see (unhealthy) in docker ps and nothing else happens — the container keeps running. In Docker Compose with depends_on: service_healthy, downstream services never start. In Swarm or Kubernetes, the orchestrator actively kills and replaces unhealthy containers, which means a misconfigured health check can take down an otherwise functional service. The worst case is a restart loop where the container never finishes its startup sequence before the health check declares it dead, cycling indefinitely and producing zero uptime.
Fix 1: Install curl or Use Alternatives
Minimal images often lack curl. Either install it or use an alternative:
# Alpine — install curl (adds ~1MB)
FROM node:20-alpine
RUN apk add --no-cache curl
HEALTHCHECK --interval=30s --timeout=10s --start-period=30s --retries=3 \
CMD curl -f http://localhost:3000/health || exit 1
# Alternatively — use wget (included in busybox/Alpine)
HEALTHCHECK CMD wget -qO- http://localhost:3000/health || exit 1
# Or use nc (netcat) to check if port is open (no HTTP check)
HEALTHCHECK CMD nc -z localhost 3000 || exit 1Distroless images — copy curl binary or use a shell script:
FROM gcr.io/distroless/nodejs20-debian12
# Distroless has no shell, no curl — use a compiled healthcheck binary
# Option 1: Use a multi-stage build to include a health check binary
FROM golang:1.22-alpine AS healthcheck-builder
RUN go build -o /healthcheck github.com/grpc-ecosystem/grpc-health-probe/...
FROM gcr.io/distroless/nodejs20-debian12
COPY --from=healthcheck-builder /healthcheck /healthcheck
HEALTHCHECK --interval=30s --timeout=10s \
CMD ["/healthcheck", "-addr=:3000"]Node.js — use a JavaScript health check script:
FROM node:20-alpine
# health-check.js included in the image
COPY health-check.js .
HEALTHCHECK --interval=30s --timeout=10s --start-period=30s \
CMD node health-check.js// health-check.js
const http = require('http');
const options = {
host: 'localhost',
port: process.env.PORT || 3000,
path: '/health',
timeout: 5000,
};
const req = http.request(options, (res) => {
process.exit(res.statusCode === 200 ? 0 : 1);
});
req.on('error', () => process.exit(1));
req.on('timeout', () => { req.abort(); process.exit(1); });
req.end();Fix 2: Set Correct Timing Parameters
The default HEALTHCHECK timing often causes false positives for real-world applications:
# Default values (if not specified):
# --interval=30s Check every 30 seconds
# --timeout=30s Fail if check takes longer than 30 seconds
# --start-period=0s Start counting failures immediately
# --retries=3 Mark unhealthy after 3 consecutive failures
# Optimized for a typical web application:
HEALTHCHECK \
--interval=10s \ # Check frequently during development
--timeout=5s \ # Fail fast if unresponsive
--start-period=60s \ # Wait 60s before counting failures (startup time)
--retries=5 \ # 5 failures before marking unhealthy
CMD curl -f http://localhost:3000/health || exit 1
# For a JVM/Spring Boot application (slow startup):
HEALTHCHECK \
--interval=20s \
--timeout=10s \
--start-period=120s \ # JVM apps often take 30-90s to start
--retries=5 \
CMD curl -f http://localhost:8080/actuator/health || exit 1start-period explained:
During start_period, health check failures don’t count toward retries. The container is starting during this window. After start_period elapses, failures start counting. This prevents false unhealthy status during legitimate startup.
# Check the health status timeline
docker inspect myapp | jq '.State.Health.Log[-5:]'
# Look at Start timestamps to see when checks began
# Compare against the container's start time in .State.StartedAtPro Tip: In production, set start_period to at least 1.5x your application’s measured cold-start time. If your app takes 40 seconds to start under load, set start_period to 60 seconds. Measure cold-start time under realistic conditions (database migrations, cache warming, connection pool initialization) rather than local dev, where startup is typically much faster.
Fix 3: Fix Health Check Command Syntax
The CMD form in HEALTHCHECK has two variants with different behavior:
# CMD (exec form) — runs directly, no shell, shell features not available
HEALTHCHECK CMD ["curl", "-f", "http://localhost:3000/health"]
# Equivalent to: docker exec container curl -f http://localhost:3000/health
# CMD-SHELL (shell form) — runs through /bin/sh -c
HEALTHCHECK CMD curl -f http://localhost:3000/health || exit 1
# Equivalent to: docker exec container /bin/sh -c "curl -f ... || exit 1"
# The || exit 1 requires a shell — use CMD-SHELL form
# Explicit CMD-SHELL form (equivalent to plain string CMD)
HEALTHCHECK CMD ["CMD-SHELL", "curl -f http://localhost:3000/health || exit 1"]curl -f flag — -f (fail) makes curl return exit code 22 for HTTP error responses (4xx, 5xx). Without it, curl exits 0 even on 404 or 500 responses:
# WRONG — exits 0 even on 500 Internal Server Error
HEALTHCHECK CMD curl http://localhost:3000/health
# CORRECT — exits non-zero on 4xx/5xx HTTP responses
HEALTHCHECK CMD curl -f http://localhost:3000/healthTest the health check command manually:
# Run the exact command inside the container
docker exec myapp curl -f http://localhost:3000/health
echo "Exit code: $?" # Should be 0 for healthy
# Run as root to rule out permission issues
docker exec -u root myapp curl -f http://localhost:3000/health
# Check if curl is available
docker exec myapp which curl || echo "curl not found"Fix 4: Configure Health Checks in docker-compose
docker-compose.yml supports overriding or adding health checks:
# docker-compose.yml
services:
api:
image: myapp:latest
healthcheck:
test: ["CMD", "curl", "-f", "http://localhost:3000/health"]
interval: 15s
timeout: 5s
retries: 5
start_period: 30s
# Service that waits for api to be healthy
worker:
image: myworker:latest
depends_on:
api:
condition: service_healthy # Wait for api to pass health check
# Without this, worker starts immediately — api may not be ready
# Database with built-in health check
postgres:
image: postgres:16
environment:
POSTGRES_PASSWORD: secret
healthcheck:
test: ["CMD-SHELL", "pg_isready -U postgres"]
interval: 10s
timeout: 5s
retries: 5
start_period: 10s
redis:
image: redis:7-alpine
healthcheck:
test: ["CMD", "redis-cli", "ping"]
interval: 10s
retries: 3Disable health check for a service (override Dockerfile’s HEALTHCHECK):
services:
myservice:
image: myapp:latest
healthcheck:
disable: true # Ignore the HEALTHCHECK from the DockerfileFix 5: Implement a Proper Health Endpoint
The health check’s value depends on what /health actually checks. A proper health endpoint verifies that the application can serve requests:
// Express.js — comprehensive health endpoint
app.get('/health', async (req, res) => {
const health: Record<string, unknown> = {
status: 'ok',
uptime: process.uptime(),
timestamp: new Date().toISOString(),
};
// Check database connection
try {
await db.query('SELECT 1');
health.database = 'ok';
} catch (err) {
health.database = 'error';
health.status = 'degraded';
}
// Check Redis connection
try {
await redis.ping();
health.cache = 'ok';
} catch (err) {
health.cache = 'error';
health.status = 'degraded';
}
const statusCode = health.status === 'ok' ? 200 : 503;
res.status(statusCode).json(health);
});Separate liveness vs readiness (Kubernetes pattern, useful in Docker too):
// Liveness — "is the process alive?" (simple, rarely fails)
app.get('/health/live', (req, res) => {
res.status(200).json({ status: 'alive' });
});
// Readiness — "can it serve traffic?" (checks dependencies)
app.get('/health/ready', async (req, res) => {
try {
await Promise.all([
db.query('SELECT 1'),
redis.ping(),
]);
res.status(200).json({ status: 'ready' });
} catch (err) {
res.status(503).json({ status: 'not ready', error: err.message });
}
});# Use the liveness check for Docker HEALTHCHECK (avoid killing due to DB outage)
HEALTHCHECK CMD curl -f http://localhost:3000/health/live || exit 1Common Mistake: A health endpoint that checks database connectivity sounds correct, but it means a temporary database outage causes container restarts. Those restarts don’t fix the database and add connection pressure to an already-overloaded DB. Use a liveness check (process is alive) for HEALTHCHECK, and a readiness check (dependencies available) for load-balancer routing decisions.
Fix 6: Debug an Unhealthy Container
When a container is unhealthy, inspect the recent health check history:
# View last 5 health check results
docker inspect myapp --format='{{json .State.Health.Log}}' | \
python3 -m json.tool | head -50
# Each log entry contains:
# Start: when the check began
# End: when it finished
# ExitCode: 0=healthy, non-zero=unhealthy
# Output: stdout/stderr from the check command
# Watch health status in real time
watch -n 2 'docker inspect myapp --format="Status: {{.State.Health.Status}}"'
# Get full container inspect output
docker inspect myapp | jq '.State.Health'Common output messages and their meanings:
"curl: (7) Failed to connect to localhost port 3000: Connection refused"
→ App not listening on port 3000 yet, or crashed
→ Fix: Increase start-period or check app startup
"curl: (22) The requested URL returned error: 500"
→ App is running but returning 500 error
→ Fix: Debug the health endpoint — check app logs
"curl: (6) Could not resolve host: localhost"
→ Unusual — network configuration issue
→ Fix: Use 127.0.0.1 instead of localhost
"OCI runtime exec failed: exec: 'curl': executable file not found"
→ curl not in the image
→ Fix: Install curl or use alternativeFix 7: Use Health Checks in Production Orchestration
In production with Docker Swarm or Kubernetes, health checks drive automatic recovery:
Docker Swarm — restart unhealthy replicas:
# docker-compose.yml (Swarm mode)
services:
api:
image: myapp:latest
deploy:
replicas: 3
update_config:
parallelism: 1
delay: 10s
failure_action: rollback
restart_policy:
condition: on-failure
delay: 5s
max_attempts: 3
healthcheck:
test: ["CMD", "curl", "-f", "http://localhost:3000/health"]
interval: 10s
timeout: 5s
retries: 3
start_period: 30sHealth check during rolling updates — Swarm waits for the new container to pass its health check before removing the old one, enabling zero-downtime deployments.
Fix 8: Diagnose Restart Loop Incidents
A restart loop is the most disruptive failure mode caused by health check misconfiguration. The container starts, the health check runs before the app is ready, the orchestrator kills it, and the cycle begins again. Service availability drops to zero even though the application code is perfectly functional.
Identify a restart loop:
# Check restart count — if it's climbing fast, you have a loop
docker inspect myapp --format='{{.RestartCount}}'
# In Kubernetes, look for CrashLoopBackOff
kubectl get pods
# NAME READY STATUS RESTARTS AGE
# myapp-xyz 0/1 CrashLoopBackOff 14 12m
# View events to see the timeline
kubectl describe pod myapp-xyz | grep -A 20 "Events:"
# Liveness probe failed: connection refused
# Container killed
# Started container
# Liveness probe failed: connection refused (repeats)Break the loop:
- Increase
start_periodto give the app time to start. If you don’t know the app’s cold-start time, set it to 180 seconds and tune down after observing actual startup. - Reduce
retriespressure by temporarily increasingretriesto 10 so the app has more chances before being killed. - Check if the health check endpoint itself is the problem. An endpoint that queries the database during startup can fail because the connection pool isn’t initialized yet.
- Use a staged health check: return 200 from the health endpoint immediately on startup (before dependencies are ready), and fail only after the app has had time to initialize everything.
// Staged health check — always alive, but reports readiness separately
let ready = false;
app.get('/health', (req, res) => {
// Always return 200 for liveness (don't let the orchestrator kill us)
res.status(200).json({ alive: true, ready });
});
// Set ready=true once all dependencies are initialized
async function bootstrap() {
await db.connect();
await cache.connect();
ready = true;
console.log('Application ready');
}
bootstrap();Monitor restart loops in production:
Set up alerts on container restart count. If a container restarts more than 3 times in 5 minutes, the health check configuration likely needs adjustment, not the application code. The metric to track is restart_count per container (or kube_pod_container_status_restarts_total in Kubernetes).
Still Not Working?
Different user inside the container — the health check runs as the container’s user (often non-root). If the app listens on a port below 1024 (privileged), a non-root user may not be able to connect. Use ports above 1024 or run as root.
IPv6 vs IPv4 binding — if the app binds to ::1 (IPv6 localhost) but curl tries 127.0.0.1 (IPv4), the connection fails. Try using [::1] in the curl URL or bind the app to 0.0.0.0.
Health check timing with depends_on — in Docker Compose, depends_on: service_healthy only works if the dependency defines a healthcheck. If it doesn’t have one, Docker Compose ignores the service_healthy condition.
Misleading exit codes — some commands return non-zero for reasons unrelated to the actual health. Test the command manually inside the container with docker exec to verify the exit code and output before trusting it in a HEALTHCHECK.
Health check passes locally but fails in CI/CD — CI runners often have lower memory and CPU limits than local dev machines. The application takes longer to start under resource pressure, so the start_period that works locally is too short in CI. Measure startup time in your CI environment separately and set a corresponding start_period.
Network namespace differences between host and container — if the app binds to a specific network interface name (e.g., eth0), the interface name inside the container may differ. Use 0.0.0.0 to bind to all interfaces, or verify interface names with docker exec myapp ip addr.
Health check logs are empty — if docker inspect shows health check log entries with empty Output and ExitCode: 0, the health check is passing. The unhealthy status you see may be from a previous run. Restart the container and watch the health status transition from starting to healthy.
For related Docker issues, see Fix: Docker Build Cache Invalidated, Fix: Docker Multi-Stage Build Failed, Fix: Kubernetes CrashLoopBackOff, and Fix: Docker Compose depends_on Not Working.
Solo developer based in Japan. Every solution is cross-referenced with official documentation and tested before publishing.
Was this article helpful?
Related Articles
Fix: Docker Secrets Not Working — BuildKit --secret Not Mounting, Compose Secrets Undefined, or Secret Leaking into Image
How to fix Docker secrets — BuildKit secret mounts in Dockerfile, docker-compose secrets config, runtime vs build-time secrets, environment variable alternatives, and verifying secrets don't leak into image layers.
Fix: Docker Compose Healthcheck Not Working — depends_on Not Waiting or Always Unhealthy
How to fix Docker Compose healthcheck issues — depends_on condition service_healthy, healthcheck command syntax, start_period, custom health scripts, and debugging unhealthy containers.
Fix: docker-compose.override.yml Not Working — Override File Ignored or Not Merged
How to fix docker-compose.override.yml not being applied — file naming, merge behavior, explicit file flags, environment-specific configs, and common override pitfalls.
Fix: Docker Build ARG Not Available — ENV Variables Missing at Runtime
How to fix Docker ARG and ENV variable issues — build-time vs runtime scope, ARG before FROM, multi-stage build variable passing, secret handling, and .env file patterns.