Skip to content

Fix: AWS ECS Task Failed to Start

FixDevs ·

Quick Answer

How to fix ECS tasks that fail to start — port binding errors, missing IAM permissions, Secrets Manager access, essential container exit codes, and health check failures.

The Error

An ECS task fails to start and the service shows no running tasks:

CannotPullContainerError: ref pull has been retried 1 time(s): failed to pull
and unpack image "123456789.dkr.ecr.us-east-1.amazonaws.com/myapp:latest":
failed to resolve reference: unexpected status code 403 Forbidden

Or the task starts and immediately stops:

Essential container in task exited
Exit Code: 1
Reason: Essential container in task exited

Or resource constraints prevent scheduling:

RESOURCE:MEMORY

Or port binding fails:

CannotStartContainerError: bind for 0.0.0.0:8080 failed: port is already allocated

Or Secrets Manager access is denied:

ResourceInitializationError: unable to pull secrets or registry auth:
execution resource retrieval failed: unable to retrieve secret from asm:
service call has been retried 1 time(s): AccessDeniedException

Why This Happens

ECS task failures have several distinct causes:

  • ECR pull failure — the task execution role lacks ecr:GetAuthorizationToken or ecr:BatchGetImage permissions, or the image URI is wrong.
  • Application crash on startup — the container starts but the application exits immediately due to a missing env var, failed database connection, or configuration error.
  • Secrets Manager / Parameter Store access denied — the task execution role needs explicit permission to retrieve secrets referenced in the task definition.
  • Port already allocated — using host network mode, a previous task didn’t release the port before the new one started.
  • Insufficient memory or CPU — the task’s memory/CPU settings don’t match what’s available on the instance (for EC2 launch type).
  • Health check failing — the container starts but the load balancer health check fails, causing the task to be deregistered and replaced in a loop.
  • Missing taskRoleArn — the task execution role and the task role are different. The execution role is for ECS infrastructure (pulling images, writing logs). The task role is for your application code (S3, DynamoDB access).

Fix 1: Check Stopped Task Error Details

ECS stores the stop reason for recent tasks. This is the first place to look:

# List recent stopped tasks in a service
aws ecs list-tasks \
  --cluster my-cluster \
  --service-name my-service \
  --desired-status STOPPED

# Get detailed stop reason for a specific task
aws ecs describe-tasks \
  --cluster my-cluster \
  --tasks arn:aws:ecs:us-east-1:123456789:task/my-cluster/abc123def456

# Look for:
# - stoppedReason: "Essential container in task exited"
# - containers[].exitCode: 1
# - containers[].reason: "CannotPullContainerError..."

In the ECS console:

Navigate to your cluster → Service → Tasks tab → filter by “Stopped” → click the task → expand the container to see the exit code and stop reason.

Check CloudWatch Logs for the application error:

# Get logs from the last task run
aws logs get-log-events \
  --log-group-name /ecs/my-service \
  --log-stream-name ecs/my-container/abc123def456 \
  --limit 100 \
  --start-from-head

Fix 2: Fix ECR Image Pull Failures

The ECS task execution role needs ECR permissions to pull the image:

{
  "Version": "2012-10-17",
  "Statement": [
    {
      "Effect": "Allow",
      "Action": [
        "ecr:GetAuthorizationToken",
        "ecr:BatchCheckLayerAvailability",
        "ecr:GetDownloadUrlForLayer",
        "ecr:BatchGetImage"
      ],
      "Resource": "*"
    }
  ]
}

Attach the managed policy (easiest):

aws iam attach-role-policy \
  --role-name ecsTaskExecutionRole \
  --policy-arn arn:aws:iam::aws:policy/service-role/AmazonECSTaskExecutionRolePolicy

Verify the image URI is correct:

# List images in ECR repository
aws ecr list-images \
  --repository-name my-app \
  --region us-east-1

# The task definition image should match exactly:
# 123456789012.dkr.ecr.us-east-1.amazonaws.com/my-app:latest

For private registries (not ECR):

Add registry credentials to Secrets Manager and reference them in the task definition:

{
  "containerDefinitions": [{
    "name": "my-container",
    "image": "registry.example.com/my-app:latest",
    "repositoryCredentials": {
      "credentialsParameter": "arn:aws:secretsmanager:us-east-1:123456789012:secret:registry-credentials"
    }
  }]
}

Fix 3: Fix Secrets Manager Access

When task definitions reference secrets from Secrets Manager or Parameter Store, the task execution role (not the task role) needs access:

{
  "Version": "2012-10-17",
  "Statement": [
    {
      "Effect": "Allow",
      "Action": [
        "secretsmanager:GetSecretValue"
      ],
      "Resource": "arn:aws:secretsmanager:us-east-1:123456789012:secret:my-app/*"
    },
    {
      "Effect": "Allow",
      "Action": [
        "ssm:GetParameters",
        "ssm:GetParameter"
      ],
      "Resource": "arn:aws:ssm:us-east-1:123456789012:parameter/my-app/*"
    },
    {
      "Effect": "Allow",
      "Action": [
        "kms:Decrypt"
      ],
      "Resource": "arn:aws:kms:us-east-1:123456789012:key/your-kms-key-id"
    }
  ]
}

Task definition with secrets:

{
  "containerDefinitions": [{
    "name": "my-app",
    "image": "my-image:latest",
    "secrets": [
      {
        "name": "DATABASE_URL",
        "valueFrom": "arn:aws:secretsmanager:us-east-1:123456789012:secret:my-app/database-url"
      },
      {
        "name": "API_KEY",
        "valueFrom": "arn:aws:ssm:us-east-1:123456789012:parameter/my-app/api-key"
      }
    ]
  }]
}

Note: The secrets block in a task definition is resolved by ECS before the container starts. If ECS can’t retrieve a secret, the task fails with ResourceInitializationError — the application code never runs.

Fix 4: Fix Application Startup Crashes (Exit Code 1)

If the container starts and immediately exits with code 1 (or any non-zero code), the application is crashing before it can serve requests:

# Check the application logs immediately after the crash
aws logs filter-log-events \
  --log-group-name /ecs/my-service \
  --filter-pattern "ERROR" \
  --start-time $(date -d '30 minutes ago' +%s000)

Common startup crash causes:

Missing required environment variable:

# Application logs show:
# Error: Required environment variable DATABASE_URL is not set
# Process exited with code 1

Fix: add the missing variable to the task definition’s environment or secrets block.

Database connection failure at startup:

# Application logs show:
# FATAL: could not connect to server: Connection refused
# Process exited with code 1

Fix: check security group rules — ECS tasks need outbound access to the database’s port. Also check that the database hostname is reachable from within the VPC.

Wrong port configuration:

# Application listens on port 3000 but task definition maps port 8080
# Health check hits port 8080 → no response → task marked unhealthy

Fix: ensure the containerPort in the task definition matches the port your application listens on:

{
  "portMappings": [{
    "containerPort": 3000,   // Must match what the application binds to
    "hostPort": 0,           // 0 = dynamic port assignment (for awsvpc/bridge mode)
    "protocol": "tcp"
  }]
}

Test the Docker image locally before deploying:

# Simulate the ECS environment locally
docker run --rm \
  -e DATABASE_URL=postgres://... \
  -e API_KEY=test \
  -p 3000:3000 \
  123456789012.dkr.ecr.us-east-1.amazonaws.com/my-app:latest

Fix 5: Fix Resource Constraints

If ECS can’t place a task because of insufficient memory or CPU, tasks stay in PENDING state:

# Check service events for placement failures
aws ecs describe-services \
  --cluster my-cluster \
  --services my-service \
  --query 'services[0].events[:10]'

# Look for:
# "service my-service was unable to place a task because no container
#  instance met all of its requirements. The closest matching instance
#  had insufficient memory available."

For EC2 launch type — check instance resources:

# List container instances and their available resources
aws ecs list-container-instances --cluster my-cluster

aws ecs describe-container-instances \
  --cluster my-cluster \
  --container-instances $(aws ecs list-container-instances --cluster my-cluster --query 'containerInstanceArns[]' --output text) \
  --query 'containerInstances[*].{id:ec2InstanceId, cpu:remainingResources[?name==`CPU`].integerValue|[0], mem:remainingResources[?name==`MEMORY`].integerValue|[0]}'

Fix: reduce task memory/CPU or scale up the cluster:

{
  "cpu": "256",      // 0.25 vCPU — reduce if tasks are competing for resources
  "memory": "512",   // 512 MB — reduce or scale up instances
  "requiresCompatibilities": ["FARGATE"]
}

For Fargate — ensure the cpu/memory combination is valid. Fargate only supports specific combinations:

CPUValid Memory values
256 (.25 vCPU)512, 1024, 2048
512 (.5 vCPU)1024–4096 (in 1024 increments)
1024 (1 vCPU)2048–8192 (in 1024 increments)
2048 (2 vCPU)4096–16384 (in 1024 increments)
4096 (4 vCPU)8192–30720 (in 1024 increments)

Fix 6: Fix Health Check Failures

A task that starts successfully but fails load balancer health checks is repeatedly stopped and replaced:

# Check target group health in the ECS console or CLI
aws elbv2 describe-target-health \
  --target-group-arn arn:aws:elasticloadbalancing:...

# Look for:
# "State": "unhealthy",
# "Reason": "Target.FailedHealthChecks"

Common health check fixes:

// Task definition health check
{
  "healthCheck": {
    "command": ["CMD-SHELL", "curl -f http://localhost:3000/health || exit 1"],
    "interval": 30,
    "timeout": 5,
    "retries": 3,
    "startPeriod": 60    // Give the app time to start before health checks count
  }
}

ALB target group health check settings:

aws elbv2 modify-target-group \
  --target-group-arn arn:aws:elasticloadbalancing:... \
  --health-check-path /health \
  --health-check-interval-seconds 30 \
  --healthy-threshold-count 2 \
  --unhealthy-threshold-count 3 \
  --health-check-timeout-seconds 10

Real-world scenario: A Node.js application takes 45 seconds to warm up (loading models, establishing DB connections). The default health check starts after 0 seconds with a 3-failure threshold. The app fails 3 checks before it’s ready, and ECS kills it. Setting startPeriod: 60 in the task definition health check gives the app time to initialize before failures count against it.

Still Not Working?

Enable ECS Exec to get a shell in a running container:

# Enable ECS Exec on the service
aws ecs update-service \
  --cluster my-cluster \
  --service my-service \
  --enable-execute-command

# Connect to a running task
aws ecs execute-command \
  --cluster my-cluster \
  --task <task-id> \
  --container my-container \
  --interactive \
  --command "/bin/sh"

Check the task execution role trust policy — the execution role must trust ecs-tasks.amazonaws.com:

{
  "Version": "2012-10-17",
  "Statement": [{
    "Effect": "Allow",
    "Principal": {
      "Service": "ecs-tasks.amazonaws.com"
    },
    "Action": "sts:AssumeRole"
  }]
}

Check VPC networking for Fargate tasks — Fargate tasks need either a public IP or a NAT Gateway to pull images from ECR and reach the internet:

# Fargate task in a private subnet needs NAT Gateway
# Or use VPC endpoints for ECR, S3, and CloudWatch Logs

# Required VPC endpoints for fully private Fargate:
# - com.amazonaws.<region>.ecr.api
# - com.amazonaws.<region>.ecr.dkr
# - com.amazonaws.<region>.s3 (Gateway endpoint)
# - com.amazonaws.<region>.logs
# - com.amazonaws.<region>.secretsmanager (if using Secrets Manager)

For related AWS issues, see Fix: AWS ECR Authentication Failed and Fix: AWS CloudWatch Logs Not Appearing.

F

FixDevs

Solo developer based in Japan. Every solution is cross-referenced with official documentation and tested before publishing.

Was this article helpful?

Related Articles