Fix: AWS ECS Task Failed to Start
Quick Answer
How to fix ECS tasks that fail to start — port binding errors, missing IAM permissions, Secrets Manager access, essential container exit codes, and health check failures.
The Error
An ECS task fails to start and the service shows no running tasks:
CannotPullContainerError: ref pull has been retried 1 time(s): failed to pull
and unpack image "123456789.dkr.ecr.us-east-1.amazonaws.com/myapp:latest":
failed to resolve reference: unexpected status code 403 ForbiddenOr the task starts and immediately stops:
Essential container in task exited
Exit Code: 1
Reason: Essential container in task exitedOr resource constraints prevent scheduling:
RESOURCE:MEMORYOr port binding fails:
CannotStartContainerError: bind for 0.0.0.0:8080 failed: port is already allocatedOr Secrets Manager access is denied:
ResourceInitializationError: unable to pull secrets or registry auth:
execution resource retrieval failed: unable to retrieve secret from asm:
service call has been retried 1 time(s): AccessDeniedExceptionWhy This Happens
ECS task failures have several distinct causes:
- ECR pull failure — the task execution role lacks
ecr:GetAuthorizationTokenorecr:BatchGetImagepermissions, or the image URI is wrong. - Application crash on startup — the container starts but the application exits immediately due to a missing env var, failed database connection, or configuration error.
- Secrets Manager / Parameter Store access denied — the task execution role needs explicit permission to retrieve secrets referenced in the task definition.
- Port already allocated — using
hostnetwork mode, a previous task didn’t release the port before the new one started. - Insufficient memory or CPU — the task’s memory/CPU settings don’t match what’s available on the instance (for EC2 launch type).
- Health check failing — the container starts but the load balancer health check fails, causing the task to be deregistered and replaced in a loop.
- Missing
taskRoleArn— the task execution role and the task role are different. The execution role is for ECS infrastructure (pulling images, writing logs). The task role is for your application code (S3, DynamoDB access).
Fix 1: Check Stopped Task Error Details
ECS stores the stop reason for recent tasks. This is the first place to look:
# List recent stopped tasks in a service
aws ecs list-tasks \
--cluster my-cluster \
--service-name my-service \
--desired-status STOPPED
# Get detailed stop reason for a specific task
aws ecs describe-tasks \
--cluster my-cluster \
--tasks arn:aws:ecs:us-east-1:123456789:task/my-cluster/abc123def456
# Look for:
# - stoppedReason: "Essential container in task exited"
# - containers[].exitCode: 1
# - containers[].reason: "CannotPullContainerError..."In the ECS console:
Navigate to your cluster → Service → Tasks tab → filter by “Stopped” → click the task → expand the container to see the exit code and stop reason.
Check CloudWatch Logs for the application error:
# Get logs from the last task run
aws logs get-log-events \
--log-group-name /ecs/my-service \
--log-stream-name ecs/my-container/abc123def456 \
--limit 100 \
--start-from-headFix 2: Fix ECR Image Pull Failures
The ECS task execution role needs ECR permissions to pull the image:
{
"Version": "2012-10-17",
"Statement": [
{
"Effect": "Allow",
"Action": [
"ecr:GetAuthorizationToken",
"ecr:BatchCheckLayerAvailability",
"ecr:GetDownloadUrlForLayer",
"ecr:BatchGetImage"
],
"Resource": "*"
}
]
}Attach the managed policy (easiest):
aws iam attach-role-policy \
--role-name ecsTaskExecutionRole \
--policy-arn arn:aws:iam::aws:policy/service-role/AmazonECSTaskExecutionRolePolicyVerify the image URI is correct:
# List images in ECR repository
aws ecr list-images \
--repository-name my-app \
--region us-east-1
# The task definition image should match exactly:
# 123456789012.dkr.ecr.us-east-1.amazonaws.com/my-app:latestFor private registries (not ECR):
Add registry credentials to Secrets Manager and reference them in the task definition:
{
"containerDefinitions": [{
"name": "my-container",
"image": "registry.example.com/my-app:latest",
"repositoryCredentials": {
"credentialsParameter": "arn:aws:secretsmanager:us-east-1:123456789012:secret:registry-credentials"
}
}]
}Fix 3: Fix Secrets Manager Access
When task definitions reference secrets from Secrets Manager or Parameter Store, the task execution role (not the task role) needs access:
{
"Version": "2012-10-17",
"Statement": [
{
"Effect": "Allow",
"Action": [
"secretsmanager:GetSecretValue"
],
"Resource": "arn:aws:secretsmanager:us-east-1:123456789012:secret:my-app/*"
},
{
"Effect": "Allow",
"Action": [
"ssm:GetParameters",
"ssm:GetParameter"
],
"Resource": "arn:aws:ssm:us-east-1:123456789012:parameter/my-app/*"
},
{
"Effect": "Allow",
"Action": [
"kms:Decrypt"
],
"Resource": "arn:aws:kms:us-east-1:123456789012:key/your-kms-key-id"
}
]
}Task definition with secrets:
{
"containerDefinitions": [{
"name": "my-app",
"image": "my-image:latest",
"secrets": [
{
"name": "DATABASE_URL",
"valueFrom": "arn:aws:secretsmanager:us-east-1:123456789012:secret:my-app/database-url"
},
{
"name": "API_KEY",
"valueFrom": "arn:aws:ssm:us-east-1:123456789012:parameter/my-app/api-key"
}
]
}]
}Note: The
secretsblock in a task definition is resolved by ECS before the container starts. If ECS can’t retrieve a secret, the task fails withResourceInitializationError— the application code never runs.
Fix 4: Fix Application Startup Crashes (Exit Code 1)
If the container starts and immediately exits with code 1 (or any non-zero code), the application is crashing before it can serve requests:
# Check the application logs immediately after the crash
aws logs filter-log-events \
--log-group-name /ecs/my-service \
--filter-pattern "ERROR" \
--start-time $(date -d '30 minutes ago' +%s000)Common startup crash causes:
Missing required environment variable:
# Application logs show:
# Error: Required environment variable DATABASE_URL is not set
# Process exited with code 1Fix: add the missing variable to the task definition’s environment or secrets block.
Database connection failure at startup:
# Application logs show:
# FATAL: could not connect to server: Connection refused
# Process exited with code 1Fix: check security group rules — ECS tasks need outbound access to the database’s port. Also check that the database hostname is reachable from within the VPC.
Wrong port configuration:
# Application listens on port 3000 but task definition maps port 8080
# Health check hits port 8080 → no response → task marked unhealthyFix: ensure the containerPort in the task definition matches the port your application listens on:
{
"portMappings": [{
"containerPort": 3000, // Must match what the application binds to
"hostPort": 0, // 0 = dynamic port assignment (for awsvpc/bridge mode)
"protocol": "tcp"
}]
}Test the Docker image locally before deploying:
# Simulate the ECS environment locally
docker run --rm \
-e DATABASE_URL=postgres://... \
-e API_KEY=test \
-p 3000:3000 \
123456789012.dkr.ecr.us-east-1.amazonaws.com/my-app:latestFix 5: Fix Resource Constraints
If ECS can’t place a task because of insufficient memory or CPU, tasks stay in PENDING state:
# Check service events for placement failures
aws ecs describe-services \
--cluster my-cluster \
--services my-service \
--query 'services[0].events[:10]'
# Look for:
# "service my-service was unable to place a task because no container
# instance met all of its requirements. The closest matching instance
# had insufficient memory available."For EC2 launch type — check instance resources:
# List container instances and their available resources
aws ecs list-container-instances --cluster my-cluster
aws ecs describe-container-instances \
--cluster my-cluster \
--container-instances $(aws ecs list-container-instances --cluster my-cluster --query 'containerInstanceArns[]' --output text) \
--query 'containerInstances[*].{id:ec2InstanceId, cpu:remainingResources[?name==`CPU`].integerValue|[0], mem:remainingResources[?name==`MEMORY`].integerValue|[0]}'Fix: reduce task memory/CPU or scale up the cluster:
{
"cpu": "256", // 0.25 vCPU — reduce if tasks are competing for resources
"memory": "512", // 512 MB — reduce or scale up instances
"requiresCompatibilities": ["FARGATE"]
}For Fargate — ensure the cpu/memory combination is valid. Fargate only supports specific combinations:
| CPU | Valid Memory values |
|---|---|
| 256 (.25 vCPU) | 512, 1024, 2048 |
| 512 (.5 vCPU) | 1024–4096 (in 1024 increments) |
| 1024 (1 vCPU) | 2048–8192 (in 1024 increments) |
| 2048 (2 vCPU) | 4096–16384 (in 1024 increments) |
| 4096 (4 vCPU) | 8192–30720 (in 1024 increments) |
Fix 6: Fix Health Check Failures
A task that starts successfully but fails load balancer health checks is repeatedly stopped and replaced:
# Check target group health in the ECS console or CLI
aws elbv2 describe-target-health \
--target-group-arn arn:aws:elasticloadbalancing:...
# Look for:
# "State": "unhealthy",
# "Reason": "Target.FailedHealthChecks"Common health check fixes:
// Task definition health check
{
"healthCheck": {
"command": ["CMD-SHELL", "curl -f http://localhost:3000/health || exit 1"],
"interval": 30,
"timeout": 5,
"retries": 3,
"startPeriod": 60 // Give the app time to start before health checks count
}
}ALB target group health check settings:
aws elbv2 modify-target-group \
--target-group-arn arn:aws:elasticloadbalancing:... \
--health-check-path /health \
--health-check-interval-seconds 30 \
--healthy-threshold-count 2 \
--unhealthy-threshold-count 3 \
--health-check-timeout-seconds 10Real-world scenario: A Node.js application takes 45 seconds to warm up (loading models, establishing DB connections). The default health check starts after 0 seconds with a 3-failure threshold. The app fails 3 checks before it’s ready, and ECS kills it. Setting
startPeriod: 60in the task definition health check gives the app time to initialize before failures count against it.
Still Not Working?
Enable ECS Exec to get a shell in a running container:
# Enable ECS Exec on the service
aws ecs update-service \
--cluster my-cluster \
--service my-service \
--enable-execute-command
# Connect to a running task
aws ecs execute-command \
--cluster my-cluster \
--task <task-id> \
--container my-container \
--interactive \
--command "/bin/sh"Check the task execution role trust policy — the execution role must trust ecs-tasks.amazonaws.com:
{
"Version": "2012-10-17",
"Statement": [{
"Effect": "Allow",
"Principal": {
"Service": "ecs-tasks.amazonaws.com"
},
"Action": "sts:AssumeRole"
}]
}Check VPC networking for Fargate tasks — Fargate tasks need either a public IP or a NAT Gateway to pull images from ECR and reach the internet:
# Fargate task in a private subnet needs NAT Gateway
# Or use VPC endpoints for ECR, S3, and CloudWatch Logs
# Required VPC endpoints for fully private Fargate:
# - com.amazonaws.<region>.ecr.api
# - com.amazonaws.<region>.ecr.dkr
# - com.amazonaws.<region>.s3 (Gateway endpoint)
# - com.amazonaws.<region>.logs
# - com.amazonaws.<region>.secretsmanager (if using Secrets Manager)For related AWS issues, see Fix: AWS ECR Authentication Failed and Fix: AWS CloudWatch Logs Not Appearing.
Solo developer based in Japan. Every solution is cross-referenced with official documentation and tested before publishing.
Was this article helpful?
Related Articles
Fix: Docker Multi-Stage Build COPY --from Failed
How to fix Docker multi-stage build errors — COPY --from stage not found, wrong stage name, artifacts not at expected path, and BuildKit caching issues.
Fix: AWS CloudWatch Logs Not Appearing
How to fix AWS CloudWatch logs not showing up — IAM permissions missing, log group not created, log stream issues, CloudWatch agent misconfiguration, and Lambda log delivery delays.
Fix: AWS ECR Authentication Failed (docker login and push Errors)
How to fix AWS ECR authentication errors — no basic auth credentials, token expired, permission denied on push, and how to authenticate correctly from CI/CD pipelines and local development.
Fix: Linux OOM Killer Killing Processes (Out of Memory)
How to fix Linux OOM killer terminating processes — reading oom_kill logs, adjusting oom_score_adj, adding swap, tuning vm.overcommit, and preventing memory leaks.