Fix: DuckDB Not Working — File Lock Conflicts, Out of Memory, Extensions, and Parquet/S3 Reads

Q: How do I fix "DuckDB Not Working — File Lock Conflicts, Out of Memory, Extensions, and Parquet/S3 Reads"?

How to fix DuckDB errors — IOException database is locked, OutOfMemoryException on large queries, httpfs extension not loaded, secret manager for S3, Pandas/Polars zero-copy, and concurrent writer limits.

The Error

You connect to a DuckDB file from two processes and get this:

duckdb.duckdb.IOException: IO Error: Could not set lock on file "analytics.duckdb":
Conflicting lock is held in /path/to/other/process

Or your aggregation blows up partway through:

duckdb.duckdb.OutOfMemoryException: Out of Memory Error:
could not allocate block of size 256MB (15.2GB/15.0GB used)

Or read_parquet from S3 fails because the extension isn’t installed:

duckdb.duckdb.IOException: HTTP HEAD error on 's3://...'
(HTTP 403). The httpfs extension is not loaded.

Or pandas integration silently copies a 50 GB DataFrame:

result = duckdb.sql("SELECT * FROM big_df").df()
# 30 minutes later, OOM.

Why This Happens

DuckDB is an embedded OLAP database. Most footguns come from three places:

Single-writer concurrency. A .duckdb file is locked exclusively by the connecting process for writes. Two Python processes both trying to open the same file in read-write mode will conflict, even if they’re just doing reads — DuckDB defaults to read-write.
In-memory by design. DuckDB processes data in memory by default, spilling to disk only when configured. Big aggregations, sorts, or joins on tables larger than RAM will OOM unless you set memory_limit and temp_directory.
Extensions are lazy. httpfs, parquet, json, iceberg, delta — all extensions that aren’t built into the core binary. You must INSTALL and LOAD them, and on some platforms install requires network access at first use.

The pandas zero-copy issue is a doc-reading problem: duckdb.sql(...).df() materializes the full result set into memory. For large results, use .fetch_arrow_table() or write to Parquet directly.

A subtler reason these errors feel inconsistent across machines is that DuckDB’s defaults depend on whether you opened a file or stayed in-memory. The in-memory path (duckdb.connect() with no argument) never takes a file lock, so concurrent processes “work” in development and then fail the moment you point them at a real .duckdb file in production. Similarly, the in-memory path uses your full RAM without complaining until the OS kills the process, while the file-backed path can fall back to disk spill if you remembered to set temp_directory. If you only ever tested with :memory:, half of this article will be new to you the day you move to disk.

The extension story has its own quirk: DuckDB ships a small core binary and downloads extensions on first use to keep installs lean. That works great on a laptop and breaks in three places — Lambda/Cloud Run cold starts (no network or a TLS proxy in the way), offline CI runners, and pinned production images that block outbound HTTP. The official advice in those environments is to bake the extension into the image at build time by running INSTALL once and shipping the ~/.duckdb/extensions/ directory. The error message you see when this is missing is not “extension not installed” — it is a generic IO/HTTP error from inside the extension loader, which is why teams chase the wrong fix first.

Fix 1: Open Read-Only for Concurrent Readers

If you have multiple processes reading the same .duckdb file (one writer, many readers — or all readers), open the readers in read-only mode:

import duckdb

# Reader (no lock conflict with the writer):
ro = duckdb.connect("analytics.duckdb", read_only=True)
ro.sql("SELECT count(*) FROM events").show()

# Writer (exclusive):
rw = duckdb.connect("analytics.duckdb")  # read_only=False is the default
rw.sql("INSERT INTO events VALUES (...)")

Read-only connections don’t conflict with each other or with read-only access from other readers. They do conflict with an active writer if the writer holds the file lock during transactions, but DuckDB releases the lock between operations.

Pro Tip: For analytics pipelines with one writer (your ETL) and many readers (dashboards, notebooks), make read-only the default in all reader code paths. Mistakes default to “can’t connect” rather than “silently broke the ETL.”

Fix 2: Cap Memory and Set a Temp Directory

DuckDB will spill intermediate results to disk if you give it a temp directory and let it know your memory budget:

con = duckdb.connect("analytics.duckdb")
con.execute("SET memory_limit = '8GB'")
con.execute("SET temp_directory = '/var/tmp/duckdb'")
con.execute("SET max_temp_directory_size = '50GB'")

memory_limit caps in-memory state. When queries exceed it, DuckDB streams the overflow to temp_directory. Without these settings, a query that doesn’t fit in RAM either OOMs or fails hard depending on the platform.

For long-running ETLs:

con.execute("SET preserve_insertion_order = false")   # Faster, more memory-efficient
con.execute("SET threads = 8")                        # Cap CPU usage

Common Mistake: Setting memory_limit higher than physical RAM “to be safe.” DuckDB will happily allocate that much before spilling; you want memory_limit below RAM so it spills before the OS kills the process.

Fix 3: Install and Load Extensions Once

For Parquet on local disk, no extension needed — it’s built in. For Parquet on S3, you need httpfs:

import duckdb

con = duckdb.connect()
con.execute("INSTALL httpfs")
con.execute("LOAD httpfs")

# Now S3 reads work:
con.sql("SELECT * FROM 's3://my-bucket/events.parquet' LIMIT 10").show()

INSTALL downloads the extension once and caches it. LOAD activates it for the current connection. Subsequent connections still need LOAD but skip the download.

For air-gapped environments, pre-download extensions:

# On a machine with network access:
duckdb -c "INSTALL httpfs; INSTALL parquet; INSTALL json"
# Then ship the cache directory (~/.duckdb/extensions) to the target.

Other commonly-needed extensions:

INSTALL postgres_scanner;  -- Read directly from Postgres
INSTALL iceberg;           -- Apache Iceberg tables
INSTALL delta;             -- Delta Lake tables
INSTALL spatial;           -- GEOS spatial functions
INSTALL aws;               -- AWS credentials chain

Fix 4: S3 Credentials via Secret Manager

The old SET s3_access_key_id = ... pattern still works but is being replaced by the secret manager. Use CREATE SECRET:

CREATE SECRET s3_creds (
    TYPE S3,
    KEY_ID 'AKIA...',
    SECRET 'wJalrXUtn...',
    REGION 'us-east-1'
);

-- Or use the AWS credential chain (env vars, ~/.aws, IAM role):
CREATE SECRET s3_creds (
    TYPE S3,
    PROVIDER CREDENTIAL_CHAIN
);

SELECT * FROM 's3://my-bucket/data.parquet' LIMIT 10;

The CREDENTIAL_CHAIN provider walks the standard AWS credentials chain — env vars first, then ~/.aws/credentials, then EC2/ECS instance metadata. This is what you want in production: no secrets in SQL, credentials rotate with your AWS setup.

Note: Without INSTALL aws; LOAD aws;, PROVIDER CREDENTIAL_CHAIN doesn’t work — it depends on the aws extension.

Fix 5: Pandas / Polars Without Copies

DuckDB can read pandas DataFrames and Polars LazyFrames by name. The trick is to avoid .df() for big result sets — it materializes everything.

For pandas:

import duckdb
import pandas as pd

big_df = pd.read_parquet("events.parquet")

# DuckDB reads big_df in place — no copy.
con = duckdb.connect()
result = con.sql("SELECT user_id, count(*) FROM big_df GROUP BY user_id")

# Stream the result, don't materialize:
for batch in result.fetch_record_batch(rows_per_batch=10_000):
    process(batch)

For Polars (Arrow-native, fastest):

import polars as pl

# Query Parquet directly — no Python materialization at all:
result_arrow = con.sql("""
    SELECT * FROM read_parquet('events.parquet')
    WHERE country = 'JP'
""").arrow()

result_lf = pl.from_arrow(result_arrow)

# Or register an eager Polars DataFrame for ad-hoc queries:
df = pl.read_parquet("events.parquet")
con.register("big", df)
result_arrow = con.sql("SELECT * FROM big WHERE country = 'JP'").arrow()

con.sql(...).arrow() returns a pyarrow.Table without copying — both DuckDB and Polars use Arrow as the in-memory format, so the handoff is free.

Pro Tip: For very large results, prefer writing directly to Parquet and reading back:

con.sql("COPY (SELECT * FROM big_df WHERE ...) TO 'result.parquet' (FORMAT PARQUET)")

This streams without materializing in Python at all.

Fix 6: Handle Type Mismatch on `read_csv` and `read_parquet`

DuckDB infers types by sampling rows. On big CSVs with mixed types, the sample misses edge cases:

Conversion Error: Could not convert string 'N/A' to INT64 at row 12345

Fix the inference with explicit types or expand the sample:

SELECT * FROM read_csv('events.csv',
  columns = {'user_id': 'BIGINT', 'amount': 'DOUBLE', 'status': 'VARCHAR'},
  null_padding = true,
  ignore_errors = true,
  sample_size = -1  -- Sample the whole file
);

sample_size = -1 makes inference accurate but slow on big files. For repeated reads, infer once and save the schema:

schema = con.sql("DESCRIBE SELECT * FROM read_csv('events.csv', sample_size = -1)").df()
# Persist schema, reuse with explicit columns.

For Parquet, types are in the file metadata — mismatches usually mean the file was written with a non-standard logical type. Inspect with:

duckdb -c "SELECT * FROM parquet_schema('events.parquet')"

Fix 7: Long-Running Queries and Cancellation

A query that runs for minutes can be cancelled from another thread:

import duckdb
import threading

con = duckdb.connect()

def run():
    try:
        con.sql("SELECT count(*) FROM huge_table WHERE complex_filter(...)").show()
    except duckdb.duckdb.InterruptException:
        print("cancelled")

t = threading.Thread(target=run)
t.start()
# ... change of plans ...
con.interrupt()
t.join()

con.interrupt() signals the running query to abort at the next safe point. For long aggregations this is usually within seconds.

For pipelines that need a hard timeout, wrap the query in threading.Timer:

import threading

timer = threading.Timer(60.0, con.interrupt)
timer.start()
try:
    result = con.sql("...").df()
finally:
    timer.cancel()

Fix 8: Connecting From Multiple Languages

DuckDB has bindings for Python, R, Node, Java, Go, Rust, and a C/C++ API. They all use the same file format — you can write from Python and read from Node:

# producer.py
import duckdb
con = duckdb.connect("shared.duckdb")
con.execute("CREATE TABLE IF NOT EXISTS metrics(t TIMESTAMP, v DOUBLE)")
con.execute("INSERT INTO metrics VALUES (NOW(), 42.0)")
con.close()  # Releases the lock.

// consumer.js
import { Database } from "duckdb-async";

const db = await Database.create("shared.duckdb", { read_only: true });
const rows = await db.all("SELECT * FROM metrics");
console.log(rows);

Common Mistake: Leaving connections open and assuming garbage collection will close them. DuckDB connections hold OS-level file locks — close them explicitly with .close() or use a context manager:

with duckdb.connect("shared.duckdb") as con:
    con.execute("INSERT INTO metrics VALUES (NOW(), 42.0)")
# Connection closed, lock released.

Version History: How DuckDB Changed Under You

DuckDB ships fast. Pinning a version that matches your stack saves entire afternoons.

0.7 (Feb 2023). Introduced the built-in web UI and the first stable Python API that returned Arrow tables without an intermediate copy. If you have notebooks from this era, the .df() shortcut originated here.
0.8 (May 2023). Added the prepared_statements runtime and a first iteration of the secret manager. CSV inference improved but still had the “small sample misses edge case” bug fixed later.
0.9 (Sep 2023). Stabilized CREATE SECRET as the recommended way to hold S3, GCS, and Azure credentials. The older SET s3_access_key_id pragmas still work and are still common in legacy code; you can ignore deprecation warnings short-term but should migrate before 1.x line goes EOL.
0.10 (Feb 2024). Major ART (Adaptive Radix Tree) index improvements. If your application created indexes pre-0.10 and now sees slow updates after upgrading, rebuild the indexes — the format changed and old indexes do not benefit from the new layout.
1.0 GA (Jun 2024). First officially stable release. Storage format was declared frozen in a forward-compatible way: files written by 1.x can be read by future 1.x; files written by 0.x need migration via EXPORT DATABASE / IMPORT DATABASE. Spilling to temp_directory became the default for large hash joins.
1.1 (late 2024). Added vector similarity search via the vss extension, native HuggingFace dataset reads, and read_json improvements. The S3 secret manager picked up KEY_ID from environment variables when not passed explicitly.
1.2+ line. Iterative improvements to the optimizer (PRAGMA explain_output = 'all' now shows physical and logical plans), better Iceberg and Delta support, and refinements to the multi-threaded executor.

A practical rule: if you are on 0.x, the upgrade to 1.x is worth it for memory-spill alone. If you are on 1.0 and considering 1.1+, treat extensions as the upgrade trigger — many community extensions only target the 1.x line. Always run EXPORT DATABASE on critical data before crossing a major boundary; reading a 0.x file in a 1.x build works in most cases but is not guaranteed by the on-disk format contract.

The Python package follows the C++ release closely but lags by a day or two. pip install duckdb==1.1.0 may briefly fail if you upgrade on release day; wait for the wheels to appear on PyPI. The same versioning applies to the Node, R, and Java bindings — they are all separate publishes that pin to the same core engine.

Still Not Working?

A few less-obvious failures:

“Could not find catalog with name ‘memory’”. You ran a query before attaching a database. Either pass a file path to connect() or explicitly attach: ATTACH 'analytics.duckdb' AS analytics; USE analytics;.
Slow first query after start. DuckDB JIT-compiles query plans. The first query of a kind is slow; subsequent identical queries are fast. Benchmark with warmup.
Out of Memory only in production, not dev. Production uses more cores → DuckDB parallelizes more → each thread takes its share of memory_limit. Cap threads if your prod box has fewer cores than expected.
COPY TO ... PARQUET writes uncompressed. Pass COMPRESSION ZSTD (or SNAPPY) explicitly. Default behavior varies by DuckDB version.
SELECT * FROM 'gs://...' fails. Use httpfs with the GCS HTTP endpoint, or install the gcs extension if available for your version.
Locale-dependent number parsing. DuckDB parses 1.234 as 1.234 (one point two three four). If your CSV uses German conventions (1,234 = 1.234), set decimal_separator = ',' in read_csv.
Connection per request in a web app. Each duckdb.connect() is cheap but not free, and the lock contention adds up. Use a singleton connection per process and let DuckDB’s MVCC handle concurrent reads.
Window functions slower than expected. Set PRAGMA enable_progress_bar to see where time goes. Large OVER (PARTITION BY x ORDER BY y) benefits from indexes on x and a sorted load order.
Lambda / serverless containers fail on extension load. The default extension directory is under ~/.duckdb/extensions/, which is not writable on most managed runtimes. Pre-install during the container build and set extension_directory to a path baked into the image.
Newer .duckdb file cannot be opened. DuckDB does not auto-downgrade. If a colleague wrote the file with 1.1 and you are on 1.0, upgrade your binary or ask them to EXPORT DATABASE to Parquet/CSV instead.
Aggregations correct in DuckDB but Pandas roundtrip rounds wrong. Pandas defaults to float64; DuckDB returns DECIMAL when the schema declares it. The .df() cast loses precision. Use .fetch_arrow_table() and then to_pandas(types_mapper=pd.ArrowDtype) if exact decimals matter.

For related analytics and storage issues, see Pandas settingwithcopywarning, Polars not working, Postgres slow query, and SQLite database is locked.

Fix: DuckDB Not Working — File Lock Conflicts, Out of Memory, Extensions, and Parquet/S3 Reads

The Error

Why This Happens

Fix 1: Open Read-Only for Concurrent Readers

Fix 2: Cap Memory and Set a Temp Directory

Fix 3: Install and Load Extensions Once

Fix 4: S3 Credentials via Secret Manager

Fix 5: Pandas / Polars Without Copies

Fix 6: Handle Type Mismatch on `read_csv` and `read_parquet`

Fix 7: Long-Running Queries and Cancellation

Fix 8: Connecting From Multiple Languages

Version History: How DuckDB Changed Under You

Still Not Working?

Related Articles

Fix: dbt Not Working — ref() Not Found, Schema Mismatch, and Compilation Errors

Fix: BigQuery Not Working — Partitioning, Slots, Streaming Inserts, and Cost Surprises

Fix: ClickHouse Not Working — MergeTree ORDER BY, INSERT Batching, LowCardinality, and Replication

Fix: Snowflake Not Working — Warehouse Suspend, COPY INTO, Roles, Time Travel, and Snowpipe

The Error

Why This Happens

Fix 1: Open Read-Only for Concurrent Readers

Fix 2: Cap Memory and Set a Temp Directory

Fix 3: Install and Load Extensions Once

Fix 4: S3 Credentials via Secret Manager

Fix 5: Pandas / Polars Without Copies

Fix 6: Handle Type Mismatch on read_csv and read_parquet

Fix 7: Long-Running Queries and Cancellation

Fix 8: Connecting From Multiple Languages

Version History: How DuckDB Changed Under You

Still Not Working?

Related Articles

Fix: dbt Not Working — ref() Not Found, Schema Mismatch, and Compilation Errors

Fix: BigQuery Not Working — Partitioning, Slots, Streaming Inserts, and Cost Surprises

Fix: ClickHouse Not Working — MergeTree ORDER BY, INSERT Batching, LowCardinality, and Replication

Fix: Snowflake Not Working — Warehouse Suspend, COPY INTO, Roles, Time Travel, and Snowpipe

Fix 6: Handle Type Mismatch on `read_csv` and `read_parquet`