Fix: AWS CloudFormation stack in ROLLBACK_COMPLETE or CREATE_FAILED state

Q: How do I fix "AWS CloudFormation stack in ROLLBACK_COMPLETE or CREATE_FAILED state"?

How to fix AWS CloudFormation ROLLBACK_COMPLETE and CREATE_FAILED errors caused by IAM permissions, resource limits, invalid parameters, and dependency failures.

The Error

Your CloudFormation stack creation fails and enters:

Status: ROLLBACK_COMPLETE
Status reason: The following resource(s) failed to create: [MyEC2Instance, MySecurityGroup].

Or variations:

CREATE_FAILED - Resource handler returned message: "Access Denied (Service: S3, Status Code: 403)"

ROLLBACK_IN_PROGRESS - The following resource(s) failed to create: [MyLambdaFunction].
Rollback requested by user.

UPDATE_ROLLBACK_COMPLETE - Parameter validation failed

DELETE_FAILED - Cannot delete stack: it has nested stacks that are in DELETE_FAILED state

CloudFormation tried to create or update resources, one or more failed, and the stack rolled back to its previous state (or failed entirely for new stacks). A stack in ROLLBACK_COMPLETE cannot be updated — it must be deleted and recreated.

Why This Happens

CloudFormation orchestrates resource creation as a directed acyclic graph (DAG). It walks the template, figures out dependencies (explicit ones via DependsOn, implicit ones via !Ref and !GetAtt), then creates resources in topological order. When any resource creation fails, CloudFormation halts the forward pass and begins a backward pass: deleting everything it already created in reverse order. If the stack is new (no prior successful version), it ends in ROLLBACK_COMPLETE. If it is an update, it tries to restore the previous state and ends in UPDATE_ROLLBACK_COMPLETE.

The ROLLBACK_COMPLETE state is a one-way trap: you cannot update or retry a stack in this state. The only options are delete-and-recreate or, since November 2021, continue-update-rollback for the update variant. This is a deliberate AWS design choice — the stack is in a partially-defined state where the template no longer matches any real configuration, so further updates would have unpredictable effects.

The “first failure is the real cause” rule is true 95% of the time. Subsequent failures usually cascade: SecurityGroup creation fails → Instance referencing it fails to find it → RouteTable referencing the instance fails. Skip to the first CREATE_FAILED event and fix that.

Common causes:

IAM permissions. The CloudFormation role does not have permission to create the resources.
Resource limits. AWS account limits (VPCs, Elastic IPs, etc.) are exceeded.
Invalid parameters. Wrong AMI ID, non-existent subnet, or invalid instance type.
Name conflicts. A resource with the same name already exists.
Dependency failures. A resource that another depends on failed to create.
Template errors. Invalid YAML/JSON, missing required properties, or wrong resource types.
Eventual consistency lag. IAM roles created moments earlier are not yet visible to the service trying to assume them.

Platform and Environment Differences

CloudFormation is a global control plane, but what it can deploy and how it behaves changes by region, account type, and the tool wrapping it. The same template that deploys cleanly in us-east-1 can fail in me-south-1 because a service is not available, or in GovCloud because IAM policies differ.

Regional service availability. Not every AWS service exists in every region. Newer services (Bedrock, Q Developer, EventBridge Pipes) launch in a handful of regions first and roll out over months. Older mature services (EC2, S3, Lambda) are universal. A template that uses AWS::Bedrock::Agent deploys in us-east-1 and us-west-2 but fails with “ResourceType not supported” in ap-southeast-2. Check the AWS Regional Services List before assuming a template is portable.

Resource quotas per region. Quotas (formerly “limits”) are per-region. Your account may have 5 VPCs in us-east-1 and 5 in eu-west-1 — independent budgets. Hitting the cap in one region does not affect others, but aws service-quotas calls must specify a region or default to your CLI config. Quota increases take time and are not always granted in newer or restricted regions.

IAM eventual consistency. IAM is global but its propagation is eventual. A role created in step 1 of a CloudFormation stack may not be assumable in step 2 if the assuming service caches role metadata. The classic symptom is “Lambda function created successfully” followed immediately by “Function returned error: cannot assume execution role.” CloudFormation usually retries internally, but tight templates that immediately invoke Lambda from a custom resource can race. Add an AWS::CloudFormation::WaitCondition with a 30-60 second delay if you see this.

AWS GovCloud (US) vs commercial. GovCloud has separate ARN partitions (arn:aws-us-gov:...), separate IAM policies (some commercial actions do not exist), and a smaller service catalog. Templates that hardcode arn:aws:... fail in GovCloud. Use intrinsic functions like !Sub "arn:${AWS::Partition}:..." to stay portable.

AWS China (Beijing and Ningxia). Completely separate partition (arn:aws-cn:...), separate account system, fewer services, longer feature lag. Templates must use the partition prefix and may reference different AMI IDs. Most third-party CloudFormation registry resources are unavailable.

CDK vs SAM vs raw CloudFormation rollback behavior. All three eventually produce CloudFormation stacks, but the rollback experience differs.

CDK synthesizes a CloudFormation template under cdk.out/. cdk deploy watches the stack and prints events live. On failure, it prints the first CREATE_FAILED event and exits. The synth step catches schema errors before deploy.
SAM uses CloudFormation transforms (AWS::Serverless::Function becomes AWS::Lambda::Function plus IAM role plus API Gateway). sam deploy shows the same events. Transform errors appear as ROLLBACK_COMPLETE with a misleading “no failures in events” status — check aws cloudformation describe-stack-events for the actual transform error.
Raw CloudFormation via aws cloudformation create-stack returns immediately and you poll with wait stack-create-complete. No live event stream; you must call describe-stack-events yourself.

LocalStack for local testing. LocalStack mocks CloudFormation locally, but the mock is incomplete. Some resource types are unsupported (AWS::AppSync::*, AWS::Bedrock::*), some are stubbed (return success without actually creating), and IAM permission checks are loose. A stack that deploys to LocalStack may fail in real AWS due to missing IAM permissions, and vice versa. Use LocalStack for fast iteration on Lambda/S3/DynamoDB and validate critical changes in a real dev account.

Terraform vs CloudFormation. Both manage AWS resources but Terraform tracks state in its own file (terraform.tfstate) rather than in AWS. A failed Terraform apply leaves a partial state file but no ROLLBACK_COMPLETE — you fix and re-apply. CloudFormation’s strict rollback is safer for compliance but slower for iteration.

AWS Outposts and Local Zones. These regions expose a subset of services. CloudFormation works but many resource types are unavailable. Check the Outposts feature matrix before deploying.

Fix 1: Find the Root Cause in Events

Always check the stack events first:

aws cloudformation describe-stack-events \
    --stack-name my-stack \
    --query "StackEvents[?ResourceStatus=='CREATE_FAILED'].[LogicalResourceId,ResourceStatusReason]" \
    --output table

In the AWS Console: CloudFormation → Your Stack → Events tab. Look for the first CREATE_FAILED event — that is the root cause. Subsequent failures are usually cascading.

Common error messages and fixes:

Error	Cause	Fix
`Access Denied`	Missing IAM permissions	Add permissions to the CloudFormation role
`Limit Exceeded`	AWS service quota reached	Request a quota increase
`Resource already exists`	Name conflict	Use unique names or import existing resources
`Parameter validation failed`	Invalid template parameter	Fix the parameter value
`Template format error`	Invalid YAML/JSON	Validate the template

Pro Tip: The first CREATE_FAILED event in the timeline is the root cause. All other failures after that are typically cascading failures caused by the first one. Fix the first error, and the rest usually resolve.

Fix 2: Delete ROLLBACK_COMPLETE Stacks

A stack in ROLLBACK_COMPLETE cannot be updated. You must delete and recreate it:

# Delete the failed stack
aws cloudformation delete-stack --stack-name my-stack

# Wait for deletion
aws cloudformation wait stack-delete-complete --stack-name my-stack

# Recreate
aws cloudformation create-stack \
    --stack-name my-stack \
    --template-body file://template.yaml \
    --parameters file://params.json \
    --capabilities CAPABILITY_NAMED_IAM

If delete fails (resources cannot be cleaned up):

# Skip specific resources during deletion
aws cloudformation delete-stack \
    --stack-name my-stack \
    --retain-resources MyRDSInstance MyS3Bucket

This deletes the stack but leaves the specified resources in place. Clean them up manually later.

Fix 3: Fix IAM Permission Errors

CloudFormation needs permissions to create every resource in the template:

Add permissions to the CloudFormation role:

{
    "Version": "2012-10-17",
    "Statement": [
        {
            "Effect": "Allow",
            "Action": [
                "ec2:*",
                "s3:*",
                "lambda:*",
                "iam:*",
                "logs:*"
            ],
            "Resource": "*"
        }
    ]
}

For least privilege (recommended):

{
    "Effect": "Allow",
    "Action": [
        "ec2:CreateSecurityGroup",
        "ec2:AuthorizeSecurityGroupIngress",
        "ec2:RunInstances",
        "ec2:DescribeInstances",
        "ec2:TerminateInstances"
    ],
    "Resource": "*"
}

Pass a role to CloudFormation:

aws cloudformation create-stack \
    --stack-name my-stack \
    --template-body file://template.yaml \
    --role-arn arn:aws:iam::123456789012:role/CloudFormationRole \
    --capabilities CAPABILITY_NAMED_IAM

Required capabilities for IAM resources:

# For IAM resources with custom names
--capabilities CAPABILITY_NAMED_IAM

# For IAM resources without custom names
--capabilities CAPABILITY_IAM

# For nested stacks with transforms
--capabilities CAPABILITY_AUTO_EXPAND

Common Mistake: Forgetting --capabilities CAPABILITY_NAMED_IAM. CloudFormation refuses to create IAM resources without explicit acknowledgment. This causes an error before any resources are created.

Fix 4: Validate the Template Before Deploying

Catch errors before they cause a rollback:

# Validate template syntax
aws cloudformation validate-template --template-body file://template.yaml

# Create a change set (dry run)
aws cloudformation create-change-set \
    --stack-name my-stack \
    --template-body file://template.yaml \
    --change-set-name preview \
    --parameters file://params.json

# Review the change set
aws cloudformation describe-change-set \
    --stack-name my-stack \
    --change-set-name preview

Use cfn-lint for deeper validation:

pip install cfn-lint
cfn-lint template.yaml

cfn-lint knows the schema for every resource type, validates !Ref and !GetAtt targets, and checks regional resource availability via the --regions flag. It catches most issues that validate-template misses.

Common template errors:

# Wrong — missing required property
Resources:
  MyBucket:
    Type: AWS::S3::Bucket
    Properties: {}  # BucketName is optional, but other resources have required props

# Wrong — wrong property type
Resources:
  MyInstance:
    Type: AWS::EC2::Instance
    Properties:
      InstanceType: "t2.micro"
      SubnetId: "subnet-12345"  # Must be a valid subnet ID in your VPC
      ImageId: "ami-12345"      # Must be a valid AMI in your region

# Wrong — referencing non-existent resource
Resources:
  MyInstance:
    Type: AWS::EC2::Instance
    Properties:
      SecurityGroupIds:
        - !Ref NonExistentSG  # This resource doesn't exist in the template

Fix 5: Fix Resource Limit Errors

AWS has default limits on many resources:

# Check your current limits
aws service-quotas list-service-quotas --service-code ec2

# Request an increase
aws service-quotas request-service-quota-increase \
    --service-code ec2 \
    --quota-code L-1216C47A \
    --desired-value 10

Common limits:

Resource	Default Limit
VPCs per region	5
Elastic IPs	5
EC2 instances (on-demand)	varies by type
S3 buckets	100
Lambda concurrent executions	1000
CloudFormation stacks	200

Quota increase requests can take hours (mature regions, common quotas) to days (newer regions, vCPU increases). Plan ahead.

Fix 6: Fix Disable Rollback for Debugging

Disable rollback to keep failed resources for inspection:

aws cloudformation create-stack \
    --stack-name my-stack \
    --template-body file://template.yaml \
    --disable-rollback

With rollback disabled, the stack stays in CREATE_FAILED state with the successfully created resources still running. You can inspect them to understand the failure.

Delete when done debugging:

aws cloudformation delete-stack --stack-name my-stack

Fix 7: Fix Circular Dependencies

CloudFormation cannot resolve circular references:

Broken:

Resources:
  SecurityGroup:
    Type: AWS::EC2::SecurityGroup
    Properties:
      SecurityGroupIngress:
        - SourceSecurityGroupId: !Ref SecurityGroup  # References itself!

Fixed — use a separate ingress rule:

Resources:
  SecurityGroup:
    Type: AWS::EC2::SecurityGroup
    Properties:
      GroupDescription: My SG

  IngressRule:
    Type: AWS::EC2::SecurityGroupIngress
    Properties:
      GroupId: !Ref SecurityGroup
      SourceSecurityGroupId: !Ref SecurityGroup
      IpProtocol: tcp
      FromPort: 443
      ToPort: 443

Fix 8: Fix UPDATE_ROLLBACK_FAILED

The worst state — the stack cannot complete its rollback:

# Continue the rollback, skipping problematic resources
aws cloudformation continue-update-rollback \
    --stack-name my-stack \
    --resources-to-skip MyLambdaFunction MyCustomResource

This tells CloudFormation to skip the resources that cannot be rolled back and finish the rollback process. After this, you can attempt the update again or delete the stack.

Still Not Working?

Check CloudTrail for detailed API errors:

aws cloudtrail lookup-events \
    --lookup-attributes AttributeKey=ResourceName,AttributeValue=my-stack \
    --max-items 10

Check for nested stack failures. If your template uses nested stacks, check the events of the nested stack for the actual error.

Check for custom resource failures. Lambda-backed custom resources might fail silently. Check the Lambda function logs:

aws logs filter-log-events \
    --log-group-name /aws/lambda/my-custom-resource \
    --start-time $(date -d '1 hour ago' +%s000)

Use aws cloudformation describe-stack-resources to see which resources were created before the failure.

Check for IAM eventual consistency. A role created moments earlier may not be visible when another resource tries to assume it. Wait 10–30 seconds and retry, or add an explicit DependsOn plus a WaitCondition between IAM role creation and the resource that uses it. This is more common in fresh accounts and newer regions.

Check for cfn-response signaling in custom resources. Lambda-backed custom resources must call cfn-response (Node.js) or cfnresponse (Python) to signal success or failure. If the Lambda crashes before signaling, CloudFormation waits the full 1-hour timeout before marking the resource as failed. Wrap the handler body in try/except and always send a response.

Check the template hash for drift. aws cloudformation detect-stack-drift finds resources that were modified outside CloudFormation (someone clicked “edit” in the console). Drift causes update failures even when the template itself is fine.

Check service control policies (SCPs). Organizations with SCPs may deny actions that CloudFormation needs. The error message is identical to a missing IAM permission. Check both the user’s IAM permissions and the SCP applied to the account.

For AWS IAM permission errors, see Fix: AWS IAM AccessDeniedException. For S3 access issues, see Fix: AWS S3 Access Denied. For Lambda timeout issues, see Fix: AWS Lambda timeout. For credential resolution problems that masquerade as permission errors, see Fix: AWS Unable to locate credentials.