Skip to content

Fix: AWS Lambda SnapStart Not Working — Version vs Alias, Restore Hooks, and Uniqueness Bugs

FixDevs ·

Quick Answer

How to fix Lambda SnapStart errors — feature requires published version, $LATEST not supported, restore hook for stale connections, UUID collisions after snapshot, time-based state staleness, and pricing surprises.

The Error

You enable SnapStart on a Lambda function and the change doesn’t apply:

The function policy doesn't support SnapStart. Publish a new version first.

Or every invocation still takes 800ms+ to start despite SnapStart being on:

Init Duration: 0.92 ms   (good — restored from snapshot)
Restore Duration: 410 ms (still significant)
Duration: 50 ms
Billed Duration: 460 ms

Or two requests get the same “random” UUID:

[invocation A] id=550e8400-e29b-41d4-a716-446655440000
[invocation B] id=550e8400-e29b-41d4-a716-446655440000  # Same!

Or DB connections fail with “connection closed” right after restore:

[invocation] Restored from snapshot.
Error: Connection terminated unexpectedly
    at /var/task/node_modules/pg/lib/...

Why This Happens

SnapStart pre-initializes your Lambda function once, snapshots the process memory, and uses that snapshot to start new invocations. Cold-start goes from 1-10 seconds (Java) or 200-800ms (Node/Python) down to ~50-200ms.

Three core constraints:

  • SnapStart requires a published version. It snapshots versions, not $LATEST. You must publish a version, then alias it. Pointing your function at $LATEST disables SnapStart.
  • Restore happens after init. State captured at snapshot time (open connections, file handles, timestamps) is reused across all restores from that snapshot. Stale state is your problem to detect and refresh.
  • Uniqueness sources are snapshotted. Math.random()’s internal state, JVM’s SecureRandom, process.hrtime()’s startup time — they’re all frozen at snapshot. Without explicit reseeding, multiple invocations get correlated values.
  • Time advances at snapshot time, not restore time. Anything that cached “now” at startup is reading the snapshot’s time, not the actual invocation time.

Fix 1: Enable SnapStart on a Published Version

In the Lambda console: Function → Configuration → SnapStart → Apply → “Published versions” → Save.

Or via CLI:

aws lambda update-function-configuration \
  --function-name my-app \
  --snap-start ApplyOn=PublishedVersions

# Then publish a version:
aws lambda publish-version --function-name my-app
# Returns: { "Version": "5", ... }

# Point an alias at it:
aws lambda update-alias \
  --function-name my-app \
  --name prod \
  --function-version 5

API Gateway / function URLs / event sources must point at the alias (my-app:prod), not the function itself or $LATEST. Otherwise SnapStart doesn’t activate.

# AWS SAM template:
Resources:
  MyFunction:
    Type: AWS::Serverless::Function
    Properties:
      AutoPublishAlias: prod
      SnapStart:
        ApplyOn: PublishedVersions

AutoPublishAlias: prod makes SAM publish a new version and update the prod alias on each deploy. SnapStart picks it up automatically.

Pro Tip: For non-prod environments, use SnapStart too. Cold-start differences make perf testing meaningless if dev doesn’t use SnapStart.

Fix 2: Use Restore Hooks for Stale Connections

DB connections, HTTP keep-alive sockets, file handles — all become stale after a snapshot restore. You need to refresh them in a restore hook.

Java (with the AWS SDK):

import org.crac.Resource;
import org.crac.Core;

public class App implements Resource {
    private Connection dbConn;

    public App() {
        Core.getGlobalContext().register(this);
    }

    @Override
    public void beforeCheckpoint(org.crac.Context<? extends Resource> context) {
        if (dbConn != null) dbConn.close();
        dbConn = null;
    }

    @Override
    public void afterRestore(org.crac.Context<? extends Resource> context) {
        dbConn = createConnection();
    }
}

The org.crac package (Coordinated Restore at Checkpoint) is Java’s hook API. SnapStart calls beforeCheckpoint before snapshotting and afterRestore after restoring.

Python:

import os
# The runtime hook API surface and import path is still evolving — check
# the AWS Lambda Python runtime docs for the exact package and decorators.
# The shape generally looks like:

connection = None

def init_connection():
    global connection
    connection = psycopg2.connect(os.environ["DATABASE_URL"])

def close_connection():
    global connection
    if connection:
        connection.close()
        connection = None

def reopen_connection():
    init_connection()

# Register close_connection as a "before snapshot" hook and reopen_connection
# as an "after restore" hook via the current AWS-provided runtime API.

init_connection()  # Runs at startup, captured in snapshot

Node.js:

let dbClient;

async function init() {
  dbClient = await createPgClient();
}

// Register lifecycle hooks via the current AWS Lambda Node.js runtime API
// (the API name and import path are still evolving — check the docs).
// Conceptually:
//   - beforeSnapshot: close stale connections, dbClient = null.
//   - afterRestore: re-create dbClient by calling init().

await init();  // Initial setup, captured in snapshot.

For both runtimes, the hook API is newer than Java’s org.crac and the import paths have moved across releases — always check the AWS Lambda runtime docs for the current names.

Common Mistake: Initializing a DB connection at module load and assuming it survives the snapshot. It doesn’t — TCP sockets are dead after restore. Always re-establish in afterRestore.

Fix 3: Reseed Random Number Generators

Java’s SecureRandom and Random are stateful — the state is part of the snapshot. Without reseeding, restored instances generate correlated sequences:

@Override
public void afterRestore(org.crac.Context<? extends Resource> context) {
    // Reseed:
    SecureRandom.getInstanceStrong();  // Forces a reseed from /dev/urandom
}

For UUID v4 generation:

@Override
public void afterRestore(...) {
    // The internal Random used by UUID.randomUUID() shares the JVM's default.
    // Reseed explicitly:
    new SecureRandom().nextBytes(new byte[16]);
}

Python’s random module is also stateful:

import random
import secrets

@register_after_restore
def reseed_random():
    random.seed()  # Reseeds from /dev/urandom

secrets (CSPRNG, always reseeded from the OS) is unaffected by snapshots — prefer it over random for any value that must be unique across invocations.

Node.js’s Math.random() and crypto.randomUUID():

  • crypto.randomUUID() uses the OS’s CSPRNG — safe across snapshots.
  • Math.random() is V8 internal state — affected by snapshots, but practical impact is small for most apps.

For anything security-sensitive, use crypto.randomUUID() or crypto.getRandomValues() — never Math.random().

Pro Tip: Audit your code for any “random” that you depend on being globally unique. If it uses pre-restore RNG state, fix it.

Fix 4: Refresh Cached Time

If you cache System.currentTimeMillis() or Date.now() at init for “when this Lambda started,” that value is the snapshot time, not the current invocation:

private static final long STARTUP_TIME = System.currentTimeMillis();
// At snapshot: 2026-01-01 00:00:00
// At every restore: still 2026-01-01 00:00:00 (snapshot time)
// Don't use this for cache TTLs, log timestamps, etc.

Fix in afterRestore:

private static long restoreTime;

@Override
public void afterRestore(...) {
    restoreTime = System.currentTimeMillis();
}

Now restoreTime is when this specific invocation started.

For cache that should expire:

private static long CACHE_VALID_UNTIL = -1;
private static Result CACHED_RESULT;

public Result get() {
    if (System.currentTimeMillis() < CACHE_VALID_UNTIL) {
        return CACHED_RESULT;
    }
    CACHED_RESULT = fetch();
    CACHE_VALID_UNTIL = System.currentTimeMillis() + 60_000;
    return CACHED_RESULT;
}

This reads “now” at each call. The cache TTL is measured from the last fetch, not from snapshot — safe.

Fix 5: Reduce Snapshot Size (for Faster Restore)

Restore Duration is the time to fault in the snapshot’s memory pages. Larger snapshots = slower restore. To reduce:

  • Trim init. Lazy-load packages used by < 50% of invocations.
  • Avoid eager JIT in Java. Class Data Sharing (CDS) helps, but heavy class loading in static blocks increases snapshot size.
  • Skip pre-warming caches that don’t survive restore anyway. Pre-warming a DB connection just to throw it away in beforeCheckpoint wastes init time.

For Java specifically, pass JVM options via the standard JAVA_TOOL_OPTIONS Lambda env var:

JAVA_TOOL_OPTIONS=-XX:TieredStopAtLevel=1 -XX:+UseSerialGC

These keep JIT compilation light and use a simpler garbage collector — faster init, smaller heap, smaller snapshot.

Pro Tip: Profile with aws lambda invoke --log-type Tail to see Init Duration, Restore Duration, Duration. The goal: Restore Duration < 200ms. Above that, your init is too heavy.

Fix 6: Priming Code at Init

Code paths that run at handler time (first invocation) aren’t part of the snapshot — they cold-start. Move common logic to init so it’s captured:

public class App {
    private static final Database DB;
    private static final HttpClient HTTP;
    
    static {
        // Runs once at init, captured in snapshot
        DB = new Database();
        HTTP = HttpClient.newHttpClient();
    }

    public Response handleRequest(Request req) {
        // Fast because DB and HTTP are already constructed
        return DB.query(...);
    }
}

Same pattern in Python:

# Module-level — runs at init:
db = create_db_pool()
http_client = httpx.Client()

def handler(event, context):
    # Uses the pre-created pool
    return db.query(...)

Anything done in static {} (Java) / module scope (Python/Node) is part of the snapshot — restored fast. Anything in the handler function is per-invocation — adds to Duration.

Common Mistake: Initializing the DB inside the handler. Each invocation pays the connection cost. Move to init, refresh in restore hook.

Fix 7: Local Testing With SnapStart Behavior

SnapStart isn’t perfectly reproducible locally (no AWS environment). But you can simulate the restore lifecycle:

# Invoke once to trigger snapshot creation:
aws lambda invoke --function-name my-app:prod --payload '{}' /tmp/out.json

# Wait a few seconds for the snapshot to bake.

# Invoke many times to test restore behavior:
for i in {1..10}; do
  aws lambda invoke --function-name my-app:prod --payload "{\"i\":$i}" /tmp/out-$i.json
done

# Compare timings via CloudWatch Logs.

For unit testing restore hooks, mock the snapshot lifecycle:

@Test
public void afterRestore_reconnects() {
    var app = new App();
    app.beforeCheckpoint(null);
    assertNull(app.getDbConn());
    app.afterRestore(null);
    assertNotNull(app.getDbConn());
}

Test that connections are dropped at checkpoint and re-established at restore.

Fix 8: Pricing and Quotas

SnapStart adds cost:

  • Snapshot storage. Per GB-month for the snapshot data (small for most functions).
  • Restore time billed. Restore Duration is part of your billed time.
  • First invocation per version creates a snapshot. Slow first-after-publish.

Monitor:

aws cloudwatch get-metric-statistics \
  --namespace AWS/Lambda \
  --metric-name Duration \
  --dimensions Name=FunctionName,Value=my-app Name=ExecutedVersion,Value=5 \
  --start-time 2026-05-20T00:00:00Z \
  --end-time 2026-05-20T23:59:59Z \
  --period 3600 \
  --statistics Average,Maximum

Compare versions with and without SnapStart. If SnapStart adds more cost than it saves in latency, it’s not worth it for that function (rare — usually a clear win for Java).

Pro Tip: For functions invoked < 1/minute, SnapStart’s snapshot storage may cost more than you save. For high-traffic functions, it’s almost always cheaper.

Still Not Working?

A few less-obvious failures:

  • Restore time is huge (>2 seconds). Snapshot is too big. Likely heavy class-loading in Java; check Init Duration of the pre-SnapStart version — that’s roughly your snapshot size.
  • Function doesn’t honor SnapStart after update. Each new version requires a new snapshot. Confirm aws lambda get-function-configuration shows the right SnapStart.OptimizationStatus.
  • Java cold-start time still bad. Verify you’re calling the alias, not $LATEST. $LATEST always cold-starts.
  • Python/Node SnapStart features differ from Java. Some hooks are Java-only as of writing. Check AWS docs for current support matrix per runtime.
  • DynamoDB / RDS connections hang. Connection pool’s TCP keep-alive doesn’t survive the snapshot. Always close + reopen in restore hooks.
  • Provisioned Concurrency vs SnapStart. They’re different mechanisms. SnapStart is cheaper and broader; Provisioned Concurrency is closer to “always-on” but expensive. Compare both for your workload.
  • Logs show snapshot age. Some snapshots can be reused across deploys (rare). If you suspect stale snapshots, force a new publish.
  • EFS / Lambda Layer changes invalidate snapshots. Snapshots tied to deployment artifact hash. Layer updates trigger re-snapshot.

For related AWS Lambda and serverless performance issues, see AWS Lambda cold start timeout, AWS Lambda timeout, AWS Lambda layer not working, and AWS Lambda import module error.

F

FixDevs

Solo developer based in Japan. Every solution is cross-referenced with official documentation and tested before publishing.

Was this article helpful?

Related Articles