Fix: AWS CloudFormation stack in ROLLBACK_COMPLETE or CREATE_FAILED state
Part of: Docker, DevOps & Infrastructure
Quick Answer
How to fix AWS CloudFormation ROLLBACK_COMPLETE and CREATE_FAILED errors caused by IAM permissions, resource limits, invalid parameters, and dependency failures.
The Error
Your CloudFormation stack creation fails and enters:
Status: ROLLBACK_COMPLETE
Status reason: The following resource(s) failed to create: [MyEC2Instance, MySecurityGroup].Or variations:
CREATE_FAILED - Resource handler returned message: "Access Denied (Service: S3, Status Code: 403)"ROLLBACK_IN_PROGRESS - The following resource(s) failed to create: [MyLambdaFunction].
Rollback requested by user.UPDATE_ROLLBACK_COMPLETE - Parameter validation failedDELETE_FAILED - Cannot delete stack: it has nested stacks that are in DELETE_FAILED stateCloudFormation tried to create or update resources, one or more failed, and the stack rolled back to its previous state (or failed entirely for new stacks). A stack in ROLLBACK_COMPLETE cannot be updated — it must be deleted and recreated.
Why This Happens
CloudFormation orchestrates resource creation as a directed acyclic graph (DAG). It walks the template, figures out dependencies (explicit ones via DependsOn, implicit ones via !Ref and !GetAtt), then creates resources in topological order. When any resource creation fails, CloudFormation halts the forward pass and begins a backward pass: deleting everything it already created in reverse order. If the stack is new (no prior successful version), it ends in ROLLBACK_COMPLETE. If it is an update, it tries to restore the previous state and ends in UPDATE_ROLLBACK_COMPLETE.
The ROLLBACK_COMPLETE state is a one-way trap: you cannot update or retry a stack in this state. The only options are delete-and-recreate or, since November 2021, continue-update-rollback for the update variant. This is a deliberate AWS design choice — the stack is in a partially-defined state where the template no longer matches any real configuration, so further updates would have unpredictable effects.
The “first failure is the real cause” rule is true 95% of the time. Subsequent failures usually cascade: SecurityGroup creation fails → Instance referencing it fails to find it → RouteTable referencing the instance fails. Skip to the first CREATE_FAILED event and fix that.
Common causes:
- IAM permissions. The CloudFormation role does not have permission to create the resources.
- Resource limits. AWS account limits (VPCs, Elastic IPs, etc.) are exceeded.
- Invalid parameters. Wrong AMI ID, non-existent subnet, or invalid instance type.
- Name conflicts. A resource with the same name already exists.
- Dependency failures. A resource that another depends on failed to create.
- Template errors. Invalid YAML/JSON, missing required properties, or wrong resource types.
- Eventual consistency lag. IAM roles created moments earlier are not yet visible to the service trying to assume them.
Platform and Environment Differences
CloudFormation is a global control plane, but what it can deploy and how it behaves changes by region, account type, and the tool wrapping it. The same template that deploys cleanly in us-east-1 can fail in me-south-1 because a service is not available, or in GovCloud because IAM policies differ.
Regional service availability. Not every AWS service exists in every region. Newer services (Bedrock, Q Developer, EventBridge Pipes) launch in a handful of regions first and roll out over months. Older mature services (EC2, S3, Lambda) are universal. A template that uses AWS::Bedrock::Agent deploys in us-east-1 and us-west-2 but fails with “ResourceType not supported” in ap-southeast-2. Check the AWS Regional Services List before assuming a template is portable.
Resource quotas per region. Quotas (formerly “limits”) are per-region. Your account may have 5 VPCs in us-east-1 and 5 in eu-west-1 — independent budgets. Hitting the cap in one region does not affect others, but aws service-quotas calls must specify a region or default to your CLI config. Quota increases take time and are not always granted in newer or restricted regions.
IAM eventual consistency. IAM is global but its propagation is eventual. A role created in step 1 of a CloudFormation stack may not be assumable in step 2 if the assuming service caches role metadata. The classic symptom is “Lambda function created successfully” followed immediately by “Function returned error: cannot assume execution role.” CloudFormation usually retries internally, but tight templates that immediately invoke Lambda from a custom resource can race. Add an AWS::CloudFormation::WaitCondition with a 30-60 second delay if you see this.
AWS GovCloud (US) vs commercial. GovCloud has separate ARN partitions (arn:aws-us-gov:...), separate IAM policies (some commercial actions do not exist), and a smaller service catalog. Templates that hardcode arn:aws:... fail in GovCloud. Use intrinsic functions like !Sub "arn:${AWS::Partition}:..." to stay portable.
AWS China (Beijing and Ningxia). Completely separate partition (arn:aws-cn:...), separate account system, fewer services, longer feature lag. Templates must use the partition prefix and may reference different AMI IDs. Most third-party CloudFormation registry resources are unavailable.
CDK vs SAM vs raw CloudFormation rollback behavior. All three eventually produce CloudFormation stacks, but the rollback experience differs.
- CDK synthesizes a CloudFormation template under
cdk.out/.cdk deploywatches the stack and prints events live. On failure, it prints the firstCREATE_FAILEDevent and exits. The synth step catches schema errors before deploy. - SAM uses CloudFormation transforms (
AWS::Serverless::FunctionbecomesAWS::Lambda::Functionplus IAM role plus API Gateway).sam deployshows the same events. Transform errors appear asROLLBACK_COMPLETEwith a misleading “no failures in events” status — checkaws cloudformation describe-stack-eventsfor the actual transform error. - Raw CloudFormation via
aws cloudformation create-stackreturns immediately and you poll withwait stack-create-complete. No live event stream; you must calldescribe-stack-eventsyourself.
LocalStack for local testing. LocalStack mocks CloudFormation locally, but the mock is incomplete. Some resource types are unsupported (AWS::AppSync::*, AWS::Bedrock::*), some are stubbed (return success without actually creating), and IAM permission checks are loose. A stack that deploys to LocalStack may fail in real AWS due to missing IAM permissions, and vice versa. Use LocalStack for fast iteration on Lambda/S3/DynamoDB and validate critical changes in a real dev account.
Terraform vs CloudFormation. Both manage AWS resources but Terraform tracks state in its own file (terraform.tfstate) rather than in AWS. A failed Terraform apply leaves a partial state file but no ROLLBACK_COMPLETE — you fix and re-apply. CloudFormation’s strict rollback is safer for compliance but slower for iteration.
AWS Outposts and Local Zones. These regions expose a subset of services. CloudFormation works but many resource types are unavailable. Check the Outposts feature matrix before deploying.
Fix 1: Find the Root Cause in Events
Always check the stack events first:
aws cloudformation describe-stack-events \
--stack-name my-stack \
--query "StackEvents[?ResourceStatus=='CREATE_FAILED'].[LogicalResourceId,ResourceStatusReason]" \
--output tableIn the AWS Console: CloudFormation → Your Stack → Events tab. Look for the first CREATE_FAILED event — that is the root cause. Subsequent failures are usually cascading.
Common error messages and fixes:
| Error | Cause | Fix |
|---|---|---|
Access Denied | Missing IAM permissions | Add permissions to the CloudFormation role |
Limit Exceeded | AWS service quota reached | Request a quota increase |
Resource already exists | Name conflict | Use unique names or import existing resources |
Parameter validation failed | Invalid template parameter | Fix the parameter value |
Template format error | Invalid YAML/JSON | Validate the template |
Pro Tip: The first
CREATE_FAILEDevent in the timeline is the root cause. All other failures after that are typically cascading failures caused by the first one. Fix the first error, and the rest usually resolve.
Fix 2: Delete ROLLBACK_COMPLETE Stacks
A stack in ROLLBACK_COMPLETE cannot be updated. You must delete and recreate it:
# Delete the failed stack
aws cloudformation delete-stack --stack-name my-stack
# Wait for deletion
aws cloudformation wait stack-delete-complete --stack-name my-stack
# Recreate
aws cloudformation create-stack \
--stack-name my-stack \
--template-body file://template.yaml \
--parameters file://params.json \
--capabilities CAPABILITY_NAMED_IAMIf delete fails (resources cannot be cleaned up):
# Skip specific resources during deletion
aws cloudformation delete-stack \
--stack-name my-stack \
--retain-resources MyRDSInstance MyS3BucketThis deletes the stack but leaves the specified resources in place. Clean them up manually later.
Fix 3: Fix IAM Permission Errors
CloudFormation needs permissions to create every resource in the template:
Add permissions to the CloudFormation role:
{
"Version": "2012-10-17",
"Statement": [
{
"Effect": "Allow",
"Action": [
"ec2:*",
"s3:*",
"lambda:*",
"iam:*",
"logs:*"
],
"Resource": "*"
}
]
}For least privilege (recommended):
{
"Effect": "Allow",
"Action": [
"ec2:CreateSecurityGroup",
"ec2:AuthorizeSecurityGroupIngress",
"ec2:RunInstances",
"ec2:DescribeInstances",
"ec2:TerminateInstances"
],
"Resource": "*"
}Pass a role to CloudFormation:
aws cloudformation create-stack \
--stack-name my-stack \
--template-body file://template.yaml \
--role-arn arn:aws:iam::123456789012:role/CloudFormationRole \
--capabilities CAPABILITY_NAMED_IAMRequired capabilities for IAM resources:
# For IAM resources with custom names
--capabilities CAPABILITY_NAMED_IAM
# For IAM resources without custom names
--capabilities CAPABILITY_IAM
# For nested stacks with transforms
--capabilities CAPABILITY_AUTO_EXPANDCommon Mistake: Forgetting
--capabilities CAPABILITY_NAMED_IAM. CloudFormation refuses to create IAM resources without explicit acknowledgment. This causes an error before any resources are created.
Fix 4: Validate the Template Before Deploying
Catch errors before they cause a rollback:
# Validate template syntax
aws cloudformation validate-template --template-body file://template.yaml
# Create a change set (dry run)
aws cloudformation create-change-set \
--stack-name my-stack \
--template-body file://template.yaml \
--change-set-name preview \
--parameters file://params.json
# Review the change set
aws cloudformation describe-change-set \
--stack-name my-stack \
--change-set-name previewUse cfn-lint for deeper validation:
pip install cfn-lint
cfn-lint template.yamlcfn-lint knows the schema for every resource type, validates !Ref and !GetAtt targets, and checks regional resource availability via the --regions flag. It catches most issues that validate-template misses.
Common template errors:
# Wrong — missing required property
Resources:
MyBucket:
Type: AWS::S3::Bucket
Properties: {} # BucketName is optional, but other resources have required props
# Wrong — wrong property type
Resources:
MyInstance:
Type: AWS::EC2::Instance
Properties:
InstanceType: "t2.micro"
SubnetId: "subnet-12345" # Must be a valid subnet ID in your VPC
ImageId: "ami-12345" # Must be a valid AMI in your region
# Wrong — referencing non-existent resource
Resources:
MyInstance:
Type: AWS::EC2::Instance
Properties:
SecurityGroupIds:
- !Ref NonExistentSG # This resource doesn't exist in the templateFix 5: Fix Resource Limit Errors
AWS has default limits on many resources:
# Check your current limits
aws service-quotas list-service-quotas --service-code ec2
# Request an increase
aws service-quotas request-service-quota-increase \
--service-code ec2 \
--quota-code L-1216C47A \
--desired-value 10Common limits:
| Resource | Default Limit |
|---|---|
| VPCs per region | 5 |
| Elastic IPs | 5 |
| EC2 instances (on-demand) | varies by type |
| S3 buckets | 100 |
| Lambda concurrent executions | 1000 |
| CloudFormation stacks | 200 |
Quota increase requests can take hours (mature regions, common quotas) to days (newer regions, vCPU increases). Plan ahead.
Fix 6: Fix Disable Rollback for Debugging
Disable rollback to keep failed resources for inspection:
aws cloudformation create-stack \
--stack-name my-stack \
--template-body file://template.yaml \
--disable-rollbackWith rollback disabled, the stack stays in CREATE_FAILED state with the successfully created resources still running. You can inspect them to understand the failure.
Delete when done debugging:
aws cloudformation delete-stack --stack-name my-stackFix 7: Fix Circular Dependencies
CloudFormation cannot resolve circular references:
Broken:
Resources:
SecurityGroup:
Type: AWS::EC2::SecurityGroup
Properties:
SecurityGroupIngress:
- SourceSecurityGroupId: !Ref SecurityGroup # References itself!Fixed — use a separate ingress rule:
Resources:
SecurityGroup:
Type: AWS::EC2::SecurityGroup
Properties:
GroupDescription: My SG
IngressRule:
Type: AWS::EC2::SecurityGroupIngress
Properties:
GroupId: !Ref SecurityGroup
SourceSecurityGroupId: !Ref SecurityGroup
IpProtocol: tcp
FromPort: 443
ToPort: 443Fix 8: Fix UPDATE_ROLLBACK_FAILED
The worst state — the stack cannot complete its rollback:
# Continue the rollback, skipping problematic resources
aws cloudformation continue-update-rollback \
--stack-name my-stack \
--resources-to-skip MyLambdaFunction MyCustomResourceThis tells CloudFormation to skip the resources that cannot be rolled back and finish the rollback process. After this, you can attempt the update again or delete the stack.
Still Not Working?
Check CloudTrail for detailed API errors:
aws cloudtrail lookup-events \
--lookup-attributes AttributeKey=ResourceName,AttributeValue=my-stack \
--max-items 10Check for nested stack failures. If your template uses nested stacks, check the events of the nested stack for the actual error.
Check for custom resource failures. Lambda-backed custom resources might fail silently. Check the Lambda function logs:
aws logs filter-log-events \
--log-group-name /aws/lambda/my-custom-resource \
--start-time $(date -d '1 hour ago' +%s000)Use aws cloudformation describe-stack-resources to see which resources were created before the failure.
Check for IAM eventual consistency. A role created moments earlier may not be visible when another resource tries to assume it. Wait 10–30 seconds and retry, or add an explicit DependsOn plus a WaitCondition between IAM role creation and the resource that uses it. This is more common in fresh accounts and newer regions.
Check for cfn-response signaling in custom resources. Lambda-backed custom resources must call cfn-response (Node.js) or cfnresponse (Python) to signal success or failure. If the Lambda crashes before signaling, CloudFormation waits the full 1-hour timeout before marking the resource as failed. Wrap the handler body in try/except and always send a response.
Check the template hash for drift. aws cloudformation detect-stack-drift finds resources that were modified outside CloudFormation (someone clicked “edit” in the console). Drift causes update failures even when the template itself is fine.
Check service control policies (SCPs). Organizations with SCPs may deny actions that CloudFormation needs. The error message is identical to a missing IAM permission. Check both the user’s IAM permissions and the SCP applied to the account.
For AWS IAM permission errors, see Fix: AWS IAM AccessDeniedException. For S3 access issues, see Fix: AWS S3 Access Denied. For Lambda timeout issues, see Fix: AWS Lambda timeout. For credential resolution problems that masquerade as permission errors, see Fix: AWS Unable to locate credentials.
Solo developer based in Japan. Every solution is cross-referenced with official documentation and tested before publishing.
Was this article helpful?
Related Articles
Fix: AWS Lambda Layer Not Working — Module Not Found or Layer Not Applied
How to fix AWS Lambda Layer issues — directory structure, runtime compatibility, layer ARN configuration, dependency conflicts, size limits, and container image alternatives.
Fix: AWS SQS Not Working — Messages Not Received, Duplicate Processing, or DLQ Filling Up
How to fix AWS SQS issues — visibility timeout, message not delivered, duplicate messages, Dead Letter Queue configuration, FIFO queue ordering, and Lambda trigger problems.
Fix: AWS S3 CORS Error — Access to Fetch Blocked by CORS Policy
How to fix AWS S3 CORS errors — S3 bucket CORS configuration, pre-signed URL CORS, CloudFront CORS headers, OPTIONS preflight requests, and presigned POST uploads.
Fix: AWS Access Denied — IAM Permission Errors and Policy Debugging
How to fix AWS Access Denied errors — understanding IAM policies, using IAM policy simulator, fixing AssumeRole errors, resource-based policies, and SCPs blocking actions.