Skip to content

Fix: AWS Step Functions Not Working — ASL Syntax, Map State, Error Handling, and IAM

FixDevs ·

Quick Answer

How to fix AWS Step Functions errors — Amazon States Language syntax, Standard vs Express workflows, Distributed Map for large datasets, Retry/Catch error handling, Lambda invoke optimization, and IAM execution role permissions.

The Error

You define a state machine and the validation fails:

States.Runtime: An error occurred while executing the state.
The JSONPath '$.user.id' specified for the field 'InputPath' could 
not be found in the input

Or a Lambda task throws and the workflow doesn’t catch it:

ExecutionFailed: States.TaskFailed in state 'CallLambda'
Function returned an error, but no Catch was defined.

Or a Map state iterating over 100,000 items times out:

States.MapStateFailed: Map state execution exceeded the maximum 
number of concurrent iterations.

Or IAM denies access to a downstream service:

States.Runtime: AccessDenied — User: arn:aws:iam::123456789012:role/StepFunctions-Role 
is not authorized to perform: lambda:InvokeFunction

Why This Happens

Step Functions orchestrate AWS services via a JSON-based DSL called Amazon States Language (ASL). Most issues come from:

  • ASL is strict. Missing End or Next on a state fails validation. JSONPath paths that don’t exist throw at runtime.
  • Standard vs Express. Standard workflows last up to a year, are billed per state transition, and store full history. Express are limited to 5 minutes, billed per execution + duration, and don’t persist history by default. They’re not interchangeable.
  • Map state has two modes. Inline Map (default) runs in-line and is limited (~40 concurrent iterations, 256 KB payload). Distributed Map handles millions of items but needs S3 as input/output.
  • IAM is layered. The state machine has an execution role. That role needs permission to invoke each service the state machine touches (Lambda, SQS, DynamoDB).

Fix 1: Write Valid ASL

{
  "Comment": "Process an order",
  "StartAt": "ValidateOrder",
  "States": {
    "ValidateOrder": {
      "Type": "Task",
      "Resource": "arn:aws:states:::lambda:invoke",
      "Parameters": {
        "FunctionName": "arn:aws:lambda:us-east-1:123456789012:function:ValidateOrder",
        "Payload.$": "$"
      },
      "ResultPath": "$.validation",
      "Next": "ProcessPayment"
    },
    "ProcessPayment": {
      "Type": "Task",
      "Resource": "arn:aws:states:::lambda:invoke",
      "Parameters": {
        "FunctionName": "arn:aws:lambda:us-east-1:123456789012:function:ProcessPayment",
        "Payload": {
          "orderId.$": "$.orderId",
          "amount.$": "$.validation.Payload.amount"
        }
      },
      "ResultPath": "$.payment",
      "End": true
    }
  }
}

Three required pieces per state:

  • Type — Task, Choice, Wait, Map, Parallel, Pass, Succeed, Fail.
  • Next or End — what comes after. Every non-terminal state needs Next: "...".
  • Resource — for Task, the ARN of the service integration.

Common JSONPath patterns:

  • Payload.$": "$" — the entire input.
  • "orderId.$": "$.orderId" — pull a specific field. The key suffix .$ is required for JSONPath references.
  • ResultPath: "$.payment" — where to merge the state’s output into the state document.
  • OutputPath: "$.payment" — discard everything else and emit just this path.

Pro Tip: Use the visual editor in the AWS Console to design the state machine, then export the ASL. The validator catches missing transitions and syntax errors before you deploy.

Fix 2: Standard vs Express Workflows

Pick based on duration and history needs:

  • Standard — up to 1 year, full execution history, $0.025 per 1K state transitions. For long-running orchestration, human approvals, audit trails.
  • Express — up to 5 minutes, billed per execution + duration, no history (logs to CloudWatch if enabled). For high-throughput API backends, event processing.

Set the type at create time:

# SAM template:
Resources:
  MyStateMachine:
    Type: AWS::Serverless::StateMachine
    Properties:
      DefinitionUri: state-machine.asl.json
      Type: STANDARD   # or EXPRESS
      Role: !GetAtt MyRole.Arn

You can’t convert between Standard and Express in place — create a new state machine.

For Express workflows that need history:

LoggingConfiguration:
  Level: ALL
  IncludeExecutionData: true
  Destinations:
    - CloudWatchLogsLogGroup:
        LogGroupArn: !GetAtt MyLogGroup.Arn

This sends Express execution logs to CloudWatch — replaces the missing in-product history.

Common Mistake: Using Express for workflows with human approval steps. Express has a 5-minute hard limit; humans take longer than 5 minutes. Use Standard.

Fix 3: Map State — Inline vs Distributed

Inline Map (default) handles up to ~40 concurrent iterations, with a 256 KB payload limit:

{
  "MyMap": {
    "Type": "Map",
    "ItemsPath": "$.items",
    "MaxConcurrency": 10,
    "Iterator": {
      "StartAt": "ProcessItem",
      "States": {
        "ProcessItem": {
          "Type": "Task",
          "Resource": "arn:aws:lambda:::function:ProcessItem",
          "End": true
        }
      }
    },
    "ResultPath": "$.results",
    "End": true
  }
}

For larger datasets, use Distributed Map. It reads input from S3 and can iterate over millions of items:

{
  "MyMap": {
    "Type": "Map",
    "ItemReader": {
      "Resource": "arn:aws:states:::s3:listObjectsV2",
      "Parameters": {
        "Bucket": "my-bucket",
        "Prefix": "items/"
      }
    },
    "ItemProcessor": {
      "ProcessorConfig": {
        "Mode": "DISTRIBUTED",
        "ExecutionType": "EXPRESS"
      },
      "StartAt": "ProcessItem",
      "States": {
        "ProcessItem": {
          "Type": "Task",
          "Resource": "arn:aws:lambda:::function:ProcessItem",
          "End": true
        }
      }
    },
    "MaxConcurrency": 1000,
    "ResultWriter": {
      "Resource": "arn:aws:states:::s3:putObject",
      "Parameters": {
        "Bucket": "my-bucket",
        "Prefix": "results/"
      }
    },
    "End": true
  }
}

Distributed Map:

  • Mode: "DISTRIBUTED" in ProcessorConfig switches from inline.
  • ExecutionType: "EXPRESS" is recommended for high-throughput; iteration sub-workflows run as Express.
  • ItemReader can be S3 list (objects), S3 GetObject (CSV/JSONL contents), or DynamoDB scan.
  • ResultWriter persists results to S3 — avoids the inline result accumulating into the parent state.

Pro Tip: For batch jobs over thousands of items, always use Distributed Map. Inline Map silently caps concurrency and accumulates results in memory — slow and OOM-prone.

Fix 4: Error Handling With Retry and Catch

For transient errors, retry:

{
  "MyTask": {
    "Type": "Task",
    "Resource": "arn:aws:lambda:::function:MyFunction",
    "Retry": [
      {
        "ErrorEquals": ["Lambda.ServiceException", "Lambda.TooManyRequestsException"],
        "IntervalSeconds": 2,
        "MaxAttempts": 3,
        "BackoffRate": 2,
        "JitterStrategy": "FULL"
      }
    ],
    "Catch": [
      {
        "ErrorEquals": ["States.TaskFailed"],
        "Next": "HandleFailure",
        "ResultPath": "$.error"
      }
    ],
    "Next": "Success"
  }
}

Retry:

  • ErrorEquals — list of error types to retry. States.ALL catches everything.
  • IntervalSeconds — initial backoff.
  • MaxAttempts — total retries (not including initial attempt).
  • BackoffRate — multiplier per attempt (2 = exponential).
  • JitterStrategy: "FULL" — adds random jitter (recommended for distributed retries).

Catch:

  • Runs when retries are exhausted (or the error doesn’t match Retry.ErrorEquals).
  • ResultPath: "$.error" — stores the error info at this path for the next state to inspect.
  • Next — the failure-handling state.

In the failure handler, you can read the error:

{
  "HandleFailure": {
    "Type": "Pass",
    "Parameters": {
      "errorType.$": "$.error.Error",
      "errorMessage.$": "$.error.Cause"
    },
    "Next": "Notify"
  }
}

Common Mistake: Catching States.ALL without inspecting the error. You handle exceptions you shouldn’t (like States.Timeout that should bubble up). Be specific about what you catch.

Fix 5: Choice State Syntax

{
  "RouteByStatus": {
    "Type": "Choice",
    "Choices": [
      {
        "Variable": "$.order.status",
        "StringEquals": "pending",
        "Next": "ProcessPending"
      },
      {
        "Variable": "$.order.status",
        "StringEquals": "confirmed",
        "Next": "ProcessConfirmed"
      },
      {
        "And": [
          { "Variable": "$.order.amount", "NumericGreaterThan": 100 },
          { "Variable": "$.order.region", "StringEquals": "US" }
        ],
        "Next": "HighValueUSFlow"
      }
    ],
    "Default": "ProcessUnknown"
  }
}

Operators:

  • StringEquals, StringMatches, StringGreaterThan, etc. — string comparisons.
  • NumericEquals, NumericGreaterThan, etc. — number comparisons.
  • BooleanEquals — boolean.
  • TimestampLessThan, TimestampLessThanEqualsPath, etc. — timestamp.
  • IsPresent, IsString, IsNumeric, IsBoolean, IsNull, IsTimestamp — type checks.

Combinations:

  • And: [...] — all must match.
  • Or: [...] — any.
  • Not: { ... } — negation.

Default is required if no choices might match — without it, the execution fails.

For dynamic comparisons (compare two paths):

{
  "Variable": "$.requested",
  "NumericGreaterThanPath": "$.available"
}

The Path suffix on the operator means “compare against another JSONPath.”

Fix 6: IAM Execution Role

The state machine’s execution role needs to invoke every service it touches:

StepFunctionsRole:
  Type: AWS::IAM::Role
  Properties:
    AssumeRolePolicyDocument:
      Version: 2012-10-17
      Statement:
        - Effect: Allow
          Principal:
            Service: states.amazonaws.com
          Action: sts:AssumeRole
    Policies:
      - PolicyName: InvokeLambdas
        PolicyDocument:
          Version: 2012-10-17
          Statement:
            - Effect: Allow
              Action: lambda:InvokeFunction
              Resource:
                - !GetAtt ValidateLambda.Arn
                - !GetAtt ProcessLambda.Arn
            - Effect: Allow
              Action:
                - dynamodb:GetItem
                - dynamodb:PutItem
              Resource: !GetAtt MyTable.Arn
            - Effect: Allow
              Action:
                - logs:CreateLogDelivery
                - logs:GetLogDelivery
                - logs:UpdateLogDelivery
                - logs:DeleteLogDelivery
                - logs:ListLogDeliveries
              Resource: "*"

For X-Ray tracing:

- Effect: Allow
  Action:
    - xray:PutTraceSegments
    - xray:PutTelemetryRecords
  Resource: "*"

For Distributed Map (needs to start child executions):

- Effect: Allow
  Action: states:StartExecution
  Resource: !Sub "arn:aws:states:${AWS::Region}:${AWS::AccountId}:stateMachine:${StateMachineName}"
- Effect: Allow
  Action:
    - states:DescribeExecution
    - states:StopExecution
  Resource: !Sub "arn:aws:states:${AWS::Region}:${AWS::AccountId}:execution:${StateMachineName}:*"

Common Mistake: Granting lambda:InvokeFunction on *. Scope to specific Lambda ARNs. A leaky role lets the state machine invoke functions it shouldn’t.

Fix 7: Lambda Integration — Optimized vs Standard

For Lambda invocation, there are two integration types:

Optimized integration (arn:aws:states:::lambda:invoke):

{
  "Type": "Task",
  "Resource": "arn:aws:states:::lambda:invoke",
  "Parameters": {
    "FunctionName": "...",
    "Payload.$": "$"
  },
  "Retry": [
    {
      "ErrorEquals": [
        "Lambda.ServiceException",
        "Lambda.AWSLambdaException",
        "Lambda.SdkClientException",
        "Lambda.TooManyRequestsException"
      ],
      "IntervalSeconds": 1,
      "MaxAttempts": 3
    }
  ]
}

The output has Lambda metadata wrapping:

{
  "ExecutedVersion": "$LATEST",
  "Payload": { ...your actual response... },
  "SdkHttpMetadata": { ... }
}

Read your data via $.Payload.

Standard ARN integration (arn:aws:lambda:...:function:MyFunction):

{
  "Type": "Task",
  "Resource": "arn:aws:lambda:us-east-1:123456789012:function:MyFunction",
  "InputPath": "$",
  "ResultPath": "$.result"
}

The output is just your Lambda’s response — no wrapping. But error retries are less granular.

Pro Tip: Always use optimized integration (arn:aws:states:::lambda:invoke) for production. The built-in retry for Lambda.ServiceException and TooManyRequestsException handles transient AWS issues automatically.

Fix 8: Wait States and Heartbeats

Wait for a duration:

{
  "WaitFiveSeconds": {
    "Type": "Wait",
    "Seconds": 5,
    "Next": "Continue"
  }
}

Wait until a specific time:

{
  "WaitUntil2026Year": {
    "Type": "Wait",
    "Timestamp": "2027-01-01T00:00:00Z",
    "Next": "Continue"
  }
}

Wait for a path-resolved value:

{
  "WaitUntilUserSchedule": {
    "Type": "Wait",
    "TimestampPath": "$.scheduledAt",
    "Next": "Continue"
  }
}

For human-in-the-loop with TaskToken:

{
  "WaitForApproval": {
    "Type": "Task",
    "Resource": "arn:aws:states:::lambda:invoke.waitForTaskToken",
    "Parameters": {
      "FunctionName": "SendApprovalRequest",
      "Payload": {
        "taskToken.$": "$$.Task.Token",
        "userId.$": "$.userId"
      }
    },
    "Next": "Continue"
  }
}

The Lambda sends an approval link to the user, embedding taskToken. When the user clicks “Approve,” your backend calls SendTaskSuccess with the token, unblocking the state machine.

aws stepfunctions send-task-success \
  --task-token "$TASK_TOKEN" \
  --output '{"approved": true}'

This pattern supports human approvals, external system callbacks, and async work that doesn’t fit in a single Lambda timeout.

Common Mistake: Forgetting .waitForTaskToken on the Resource ARN. Without it, the task completes immediately on Lambda return, instead of waiting for the explicit SendTaskSuccess call.

Still Not Working?

A few less-obvious failures:

  • State machine fails to start. Check the execution role’s trust policy — must include states.amazonaws.com as Principal.
  • States.Runtime with cryptic JSONPath. Your input doesn’t have the field you expect. Add a Pass state at the start to log the input, then read CloudWatch logs.
  • Step Functions billed more than expected. Each state transition costs (Standard). Loops with many iterations add up. Use Express for high-volume sub-workflows.
  • Map state OOM in inline mode. Switch to Distributed Map with S3 reader/writer.
  • Retry doesn’t retry the error you saw. Match the exact error type. Errors from Lambda have specific names; States.TaskFailed is the generic wrapper.
  • Pass state doesn’t change input. By default, Pass passes input through. Use Parameters to transform, or ResultPath to merge a literal Result.
  • Choice state has no default and execution fails. Always provide Default. Even "Default": "FailExplicitly" (a Fail state) is better than no default.
  • Time zones in Timestamp field. Always UTC. Convert in your code if your data is in local time.

For related AWS orchestration and serverless issues, see AWS Lambda timeout, AWS Lambda cold start timeout, AWS IAM permission denied, and AWS SQS not working.

F

FixDevs

Solo developer based in Japan. Every solution is cross-referenced with official documentation and tested before publishing.

Was this article helpful?

Related Articles