Fix: AWS Step Functions Not Working — ASL Syntax, Map State, Error Handling, and IAM
Quick Answer
How to fix AWS Step Functions errors — Amazon States Language syntax, Standard vs Express workflows, Distributed Map for large datasets, Retry/Catch error handling, Lambda invoke optimization, and IAM execution role permissions.
The Error
You define a state machine and the validation fails:
States.Runtime: An error occurred while executing the state.
The JSONPath '$.user.id' specified for the field 'InputPath' could
not be found in the inputOr a Lambda task throws and the workflow doesn’t catch it:
ExecutionFailed: States.TaskFailed in state 'CallLambda'
Function returned an error, but no Catch was defined.Or a Map state iterating over 100,000 items times out:
States.MapStateFailed: Map state execution exceeded the maximum
number of concurrent iterations.Or IAM denies access to a downstream service:
States.Runtime: AccessDenied — User: arn:aws:iam::123456789012:role/StepFunctions-Role
is not authorized to perform: lambda:InvokeFunctionWhy This Happens
Step Functions orchestrate AWS services via a JSON-based DSL called Amazon States Language (ASL). Most issues come from:
- ASL is strict. Missing
EndorNexton a state fails validation. JSONPath paths that don’t exist throw at runtime. - Standard vs Express. Standard workflows last up to a year, are billed per state transition, and store full history. Express are limited to 5 minutes, billed per execution + duration, and don’t persist history by default. They’re not interchangeable.
- Map state has two modes. Inline Map (default) runs in-line and is limited (~40 concurrent iterations, 256 KB payload). Distributed Map handles millions of items but needs S3 as input/output.
- IAM is layered. The state machine has an execution role. That role needs permission to invoke each service the state machine touches (Lambda, SQS, DynamoDB).
Fix 1: Write Valid ASL
{
"Comment": "Process an order",
"StartAt": "ValidateOrder",
"States": {
"ValidateOrder": {
"Type": "Task",
"Resource": "arn:aws:states:::lambda:invoke",
"Parameters": {
"FunctionName": "arn:aws:lambda:us-east-1:123456789012:function:ValidateOrder",
"Payload.$": "$"
},
"ResultPath": "$.validation",
"Next": "ProcessPayment"
},
"ProcessPayment": {
"Type": "Task",
"Resource": "arn:aws:states:::lambda:invoke",
"Parameters": {
"FunctionName": "arn:aws:lambda:us-east-1:123456789012:function:ProcessPayment",
"Payload": {
"orderId.$": "$.orderId",
"amount.$": "$.validation.Payload.amount"
}
},
"ResultPath": "$.payment",
"End": true
}
}
}Three required pieces per state:
Type— Task, Choice, Wait, Map, Parallel, Pass, Succeed, Fail.NextorEnd— what comes after. Every non-terminal state needsNext: "...".Resource— for Task, the ARN of the service integration.
Common JSONPath patterns:
Payload.$": "$"— the entire input."orderId.$": "$.orderId"— pull a specific field. The key suffix.$is required for JSONPath references.ResultPath: "$.payment"— where to merge the state’s output into the state document.OutputPath: "$.payment"— discard everything else and emit just this path.
Pro Tip: Use the visual editor in the AWS Console to design the state machine, then export the ASL. The validator catches missing transitions and syntax errors before you deploy.
Fix 2: Standard vs Express Workflows
Pick based on duration and history needs:
- Standard — up to 1 year, full execution history, $0.025 per 1K state transitions. For long-running orchestration, human approvals, audit trails.
- Express — up to 5 minutes, billed per execution + duration, no history (logs to CloudWatch if enabled). For high-throughput API backends, event processing.
Set the type at create time:
# SAM template:
Resources:
MyStateMachine:
Type: AWS::Serverless::StateMachine
Properties:
DefinitionUri: state-machine.asl.json
Type: STANDARD # or EXPRESS
Role: !GetAtt MyRole.ArnYou can’t convert between Standard and Express in place — create a new state machine.
For Express workflows that need history:
LoggingConfiguration:
Level: ALL
IncludeExecutionData: true
Destinations:
- CloudWatchLogsLogGroup:
LogGroupArn: !GetAtt MyLogGroup.ArnThis sends Express execution logs to CloudWatch — replaces the missing in-product history.
Common Mistake: Using Express for workflows with human approval steps. Express has a 5-minute hard limit; humans take longer than 5 minutes. Use Standard.
Fix 3: Map State — Inline vs Distributed
Inline Map (default) handles up to ~40 concurrent iterations, with a 256 KB payload limit:
{
"MyMap": {
"Type": "Map",
"ItemsPath": "$.items",
"MaxConcurrency": 10,
"Iterator": {
"StartAt": "ProcessItem",
"States": {
"ProcessItem": {
"Type": "Task",
"Resource": "arn:aws:lambda:::function:ProcessItem",
"End": true
}
}
},
"ResultPath": "$.results",
"End": true
}
}For larger datasets, use Distributed Map. It reads input from S3 and can iterate over millions of items:
{
"MyMap": {
"Type": "Map",
"ItemReader": {
"Resource": "arn:aws:states:::s3:listObjectsV2",
"Parameters": {
"Bucket": "my-bucket",
"Prefix": "items/"
}
},
"ItemProcessor": {
"ProcessorConfig": {
"Mode": "DISTRIBUTED",
"ExecutionType": "EXPRESS"
},
"StartAt": "ProcessItem",
"States": {
"ProcessItem": {
"Type": "Task",
"Resource": "arn:aws:lambda:::function:ProcessItem",
"End": true
}
}
},
"MaxConcurrency": 1000,
"ResultWriter": {
"Resource": "arn:aws:states:::s3:putObject",
"Parameters": {
"Bucket": "my-bucket",
"Prefix": "results/"
}
},
"End": true
}
}Distributed Map:
Mode: "DISTRIBUTED"inProcessorConfigswitches from inline.ExecutionType: "EXPRESS"is recommended for high-throughput; iteration sub-workflows run as Express.ItemReadercan be S3 list (objects), S3 GetObject (CSV/JSONL contents), or DynamoDB scan.ResultWriterpersists results to S3 — avoids the inline result accumulating into the parent state.
Pro Tip: For batch jobs over thousands of items, always use Distributed Map. Inline Map silently caps concurrency and accumulates results in memory — slow and OOM-prone.
Fix 4: Error Handling With Retry and Catch
For transient errors, retry:
{
"MyTask": {
"Type": "Task",
"Resource": "arn:aws:lambda:::function:MyFunction",
"Retry": [
{
"ErrorEquals": ["Lambda.ServiceException", "Lambda.TooManyRequestsException"],
"IntervalSeconds": 2,
"MaxAttempts": 3,
"BackoffRate": 2,
"JitterStrategy": "FULL"
}
],
"Catch": [
{
"ErrorEquals": ["States.TaskFailed"],
"Next": "HandleFailure",
"ResultPath": "$.error"
}
],
"Next": "Success"
}
}Retry:
ErrorEquals— list of error types to retry.States.ALLcatches everything.IntervalSeconds— initial backoff.MaxAttempts— total retries (not including initial attempt).BackoffRate— multiplier per attempt (2 = exponential).JitterStrategy: "FULL"— adds random jitter (recommended for distributed retries).
Catch:
- Runs when retries are exhausted (or the error doesn’t match
Retry.ErrorEquals). ResultPath: "$.error"— stores the error info at this path for the next state to inspect.Next— the failure-handling state.
In the failure handler, you can read the error:
{
"HandleFailure": {
"Type": "Pass",
"Parameters": {
"errorType.$": "$.error.Error",
"errorMessage.$": "$.error.Cause"
},
"Next": "Notify"
}
}Common Mistake: Catching States.ALL without inspecting the error. You handle exceptions you shouldn’t (like States.Timeout that should bubble up). Be specific about what you catch.
Fix 5: Choice State Syntax
{
"RouteByStatus": {
"Type": "Choice",
"Choices": [
{
"Variable": "$.order.status",
"StringEquals": "pending",
"Next": "ProcessPending"
},
{
"Variable": "$.order.status",
"StringEquals": "confirmed",
"Next": "ProcessConfirmed"
},
{
"And": [
{ "Variable": "$.order.amount", "NumericGreaterThan": 100 },
{ "Variable": "$.order.region", "StringEquals": "US" }
],
"Next": "HighValueUSFlow"
}
],
"Default": "ProcessUnknown"
}
}Operators:
StringEquals,StringMatches,StringGreaterThan, etc. — string comparisons.NumericEquals,NumericGreaterThan, etc. — number comparisons.BooleanEquals— boolean.TimestampLessThan,TimestampLessThanEqualsPath, etc. — timestamp.IsPresent,IsString,IsNumeric,IsBoolean,IsNull,IsTimestamp— type checks.
Combinations:
And: [...]— all must match.Or: [...]— any.Not: { ... }— negation.
Default is required if no choices might match — without it, the execution fails.
For dynamic comparisons (compare two paths):
{
"Variable": "$.requested",
"NumericGreaterThanPath": "$.available"
}The Path suffix on the operator means “compare against another JSONPath.”
Fix 6: IAM Execution Role
The state machine’s execution role needs to invoke every service it touches:
StepFunctionsRole:
Type: AWS::IAM::Role
Properties:
AssumeRolePolicyDocument:
Version: 2012-10-17
Statement:
- Effect: Allow
Principal:
Service: states.amazonaws.com
Action: sts:AssumeRole
Policies:
- PolicyName: InvokeLambdas
PolicyDocument:
Version: 2012-10-17
Statement:
- Effect: Allow
Action: lambda:InvokeFunction
Resource:
- !GetAtt ValidateLambda.Arn
- !GetAtt ProcessLambda.Arn
- Effect: Allow
Action:
- dynamodb:GetItem
- dynamodb:PutItem
Resource: !GetAtt MyTable.Arn
- Effect: Allow
Action:
- logs:CreateLogDelivery
- logs:GetLogDelivery
- logs:UpdateLogDelivery
- logs:DeleteLogDelivery
- logs:ListLogDeliveries
Resource: "*"For X-Ray tracing:
- Effect: Allow
Action:
- xray:PutTraceSegments
- xray:PutTelemetryRecords
Resource: "*"For Distributed Map (needs to start child executions):
- Effect: Allow
Action: states:StartExecution
Resource: !Sub "arn:aws:states:${AWS::Region}:${AWS::AccountId}:stateMachine:${StateMachineName}"
- Effect: Allow
Action:
- states:DescribeExecution
- states:StopExecution
Resource: !Sub "arn:aws:states:${AWS::Region}:${AWS::AccountId}:execution:${StateMachineName}:*"Common Mistake: Granting lambda:InvokeFunction on *. Scope to specific Lambda ARNs. A leaky role lets the state machine invoke functions it shouldn’t.
Fix 7: Lambda Integration — Optimized vs Standard
For Lambda invocation, there are two integration types:
Optimized integration (arn:aws:states:::lambda:invoke):
{
"Type": "Task",
"Resource": "arn:aws:states:::lambda:invoke",
"Parameters": {
"FunctionName": "...",
"Payload.$": "$"
},
"Retry": [
{
"ErrorEquals": [
"Lambda.ServiceException",
"Lambda.AWSLambdaException",
"Lambda.SdkClientException",
"Lambda.TooManyRequestsException"
],
"IntervalSeconds": 1,
"MaxAttempts": 3
}
]
}The output has Lambda metadata wrapping:
{
"ExecutedVersion": "$LATEST",
"Payload": { ...your actual response... },
"SdkHttpMetadata": { ... }
}Read your data via $.Payload.
Standard ARN integration (arn:aws:lambda:...:function:MyFunction):
{
"Type": "Task",
"Resource": "arn:aws:lambda:us-east-1:123456789012:function:MyFunction",
"InputPath": "$",
"ResultPath": "$.result"
}The output is just your Lambda’s response — no wrapping. But error retries are less granular.
Pro Tip: Always use optimized integration (arn:aws:states:::lambda:invoke) for production. The built-in retry for Lambda.ServiceException and TooManyRequestsException handles transient AWS issues automatically.
Fix 8: Wait States and Heartbeats
Wait for a duration:
{
"WaitFiveSeconds": {
"Type": "Wait",
"Seconds": 5,
"Next": "Continue"
}
}Wait until a specific time:
{
"WaitUntil2026Year": {
"Type": "Wait",
"Timestamp": "2027-01-01T00:00:00Z",
"Next": "Continue"
}
}Wait for a path-resolved value:
{
"WaitUntilUserSchedule": {
"Type": "Wait",
"TimestampPath": "$.scheduledAt",
"Next": "Continue"
}
}For human-in-the-loop with TaskToken:
{
"WaitForApproval": {
"Type": "Task",
"Resource": "arn:aws:states:::lambda:invoke.waitForTaskToken",
"Parameters": {
"FunctionName": "SendApprovalRequest",
"Payload": {
"taskToken.$": "$$.Task.Token",
"userId.$": "$.userId"
}
},
"Next": "Continue"
}
}The Lambda sends an approval link to the user, embedding taskToken. When the user clicks “Approve,” your backend calls SendTaskSuccess with the token, unblocking the state machine.
aws stepfunctions send-task-success \
--task-token "$TASK_TOKEN" \
--output '{"approved": true}'This pattern supports human approvals, external system callbacks, and async work that doesn’t fit in a single Lambda timeout.
Common Mistake: Forgetting .waitForTaskToken on the Resource ARN. Without it, the task completes immediately on Lambda return, instead of waiting for the explicit SendTaskSuccess call.
Still Not Working?
A few less-obvious failures:
- State machine fails to start. Check the execution role’s trust policy — must include
states.amazonaws.comas Principal. States.Runtimewith cryptic JSONPath. Your input doesn’t have the field you expect. Add a Pass state at the start to log the input, then read CloudWatch logs.- Step Functions billed more than expected. Each state transition costs (Standard). Loops with many iterations add up. Use Express for high-volume sub-workflows.
Mapstate OOM in inline mode. Switch to Distributed Map with S3 reader/writer.Retrydoesn’t retry the error you saw. Match the exact error type. Errors from Lambda have specific names;States.TaskFailedis the generic wrapper.Passstate doesn’t change input. By default,Passpasses input through. UseParametersto transform, orResultPathto merge a literalResult.Choicestate has no default and execution fails. Always provideDefault. Even"Default": "FailExplicitly"(a Fail state) is better than no default.- Time zones in
Timestampfield. Always UTC. Convert in your code if your data is in local time.
For related AWS orchestration and serverless issues, see AWS Lambda timeout, AWS Lambda cold start timeout, AWS IAM permission denied, and AWS SQS not working.
Solo developer based in Japan. Every solution is cross-referenced with official documentation and tested before publishing.
Was this article helpful?
Related Articles
Fix: AWS Lambda SnapStart Not Working — Version vs Alias, Restore Hooks, and Uniqueness Bugs
How to fix Lambda SnapStart errors — feature requires published version, $LATEST not supported, restore hook for stale connections, UUID collisions after snapshot, time-based state staleness, and pricing surprises.
Fix: AWS Lambda Environment Variable Not Set — undefined or Missing at Runtime
How to fix AWS Lambda environment variables not available — Lambda console config, CDK/SAM/Terraform setup, secrets from SSM Parameter Store, encrypted variables, and local testing.
Fix: AWS Lambda Cold Start Timeout and Slow First Invocation
How to fix AWS Lambda cold start timeouts and slow first invocations — provisioned concurrency, reducing package size, connection reuse, and language-specific optimizations.
Fix: AWS RDS Proxy Not Working — Endpoint, IAM Auth, Connection Pinning, and Lambda VPC
How to fix AWS RDS Proxy errors — IAM authentication token mismatch, connection pinning blocking reuse, Lambda VPC routing, Secrets Manager rotation, max_connections, read/write splitter, and TLS requirement.