Fix: DuckDB Not Working — File Lock Conflicts, Out of Memory, Extensions, and Parquet/S3 Reads
Quick Answer
How to fix DuckDB errors — IOException database is locked, OutOfMemoryException on large queries, httpfs extension not loaded, secret manager for S3, Pandas/Polars zero-copy, and concurrent writer limits.
The Error
You connect to a DuckDB file from two processes and get this:
duckdb.duckdb.IOException: IO Error: Could not set lock on file "analytics.duckdb":
Conflicting lock is held in /path/to/other/processOr your aggregation blows up partway through:
duckdb.duckdb.OutOfMemoryException: Out of Memory Error:
could not allocate block of size 256MB (15.2GB/15.0GB used)Or read_parquet from S3 fails because the extension isn’t installed:
duckdb.duckdb.IOException: HTTP HEAD error on 's3://...'
(HTTP 403). The httpfs extension is not loaded.Or pandas integration silently copies a 50 GB DataFrame:
result = duckdb.sql("SELECT * FROM big_df").df()
# 30 minutes later, OOM.Why This Happens
DuckDB is an embedded OLAP database. Most footguns come from three places:
- Single-writer concurrency. A
.duckdbfile is locked exclusively by the connecting process for writes. Two Python processes both trying to open the same file in read-write mode will conflict, even if they’re just doing reads — DuckDB defaults to read-write. - In-memory by design. DuckDB processes data in memory by default, spilling to disk only when configured. Big aggregations, sorts, or joins on tables larger than RAM will OOM unless you set
memory_limitandtemp_directory. - Extensions are lazy.
httpfs,parquet,json,iceberg,delta— all extensions that aren’t built into the core binary. You mustINSTALLandLOADthem, and on some platforms install requires network access at first use.
The pandas zero-copy issue is a doc-reading problem: duckdb.sql(...).df() materializes the full result set into memory. For large results, use .fetch_arrow_table() or write to Parquet directly.
Fix 1: Open Read-Only for Concurrent Readers
If you have multiple processes reading the same .duckdb file (one writer, many readers — or all readers), open the readers in read-only mode:
import duckdb
# Reader (no lock conflict with the writer):
ro = duckdb.connect("analytics.duckdb", read_only=True)
ro.sql("SELECT count(*) FROM events").show()# Writer (exclusive):
rw = duckdb.connect("analytics.duckdb") # read_only=False is the default
rw.sql("INSERT INTO events VALUES (...)")Read-only connections don’t conflict with each other or with read-only access from other readers. They do conflict with an active writer if the writer holds the file lock during transactions, but DuckDB releases the lock between operations.
Pro Tip: For analytics pipelines with one writer (your ETL) and many readers (dashboards, notebooks), make read-only the default in all reader code paths. Mistakes default to “can’t connect” rather than “silently broke the ETL.”
Fix 2: Cap Memory and Set a Temp Directory
DuckDB will spill intermediate results to disk if you give it a temp directory and let it know your memory budget:
con = duckdb.connect("analytics.duckdb")
con.execute("SET memory_limit = '8GB'")
con.execute("SET temp_directory = '/var/tmp/duckdb'")
con.execute("SET max_temp_directory_size = '50GB'")memory_limit caps in-memory state. When queries exceed it, DuckDB streams the overflow to temp_directory. Without these settings, a query that doesn’t fit in RAM either OOMs or fails hard depending on the platform.
For long-running ETLs:
con.execute("SET preserve_insertion_order = false") # Faster, more memory-efficient
con.execute("SET threads = 8") # Cap CPU usageCommon Mistake: Setting memory_limit higher than physical RAM “to be safe.” DuckDB will happily allocate that much before spilling; you want memory_limit below RAM so it spills before the OS kills the process.
Fix 3: Install and Load Extensions Once
For Parquet on local disk, no extension needed — it’s built in. For Parquet on S3, you need httpfs:
import duckdb
con = duckdb.connect()
con.execute("INSTALL httpfs")
con.execute("LOAD httpfs")
# Now S3 reads work:
con.sql("SELECT * FROM 's3://my-bucket/events.parquet' LIMIT 10").show()INSTALL downloads the extension once and caches it. LOAD activates it for the current connection. Subsequent connections still need LOAD but skip the download.
For air-gapped environments, pre-download extensions:
# On a machine with network access:
duckdb -c "INSTALL httpfs; INSTALL parquet; INSTALL json"
# Then ship the cache directory (~/.duckdb/extensions) to the target.Other commonly-needed extensions:
INSTALL postgres_scanner; -- Read directly from Postgres
INSTALL iceberg; -- Apache Iceberg tables
INSTALL delta; -- Delta Lake tables
INSTALL spatial; -- GEOS spatial functions
INSTALL aws; -- AWS credentials chainFix 4: S3 Credentials via Secret Manager
The old SET s3_access_key_id = ... pattern still works but is being replaced by the secret manager. Use CREATE SECRET:
CREATE SECRET s3_creds (
TYPE S3,
KEY_ID 'AKIA...',
SECRET 'wJalrXUtn...',
REGION 'us-east-1'
);
-- Or use the AWS credential chain (env vars, ~/.aws, IAM role):
CREATE SECRET s3_creds (
TYPE S3,
PROVIDER CREDENTIAL_CHAIN
);
SELECT * FROM 's3://my-bucket/data.parquet' LIMIT 10;The CREDENTIAL_CHAIN provider walks the standard AWS credentials chain — env vars first, then ~/.aws/credentials, then EC2/ECS instance metadata. This is what you want in production: no secrets in SQL, credentials rotate with your AWS setup.
Note: Without INSTALL aws; LOAD aws;, PROVIDER CREDENTIAL_CHAIN doesn’t work — it depends on the aws extension.
Fix 5: Pandas / Polars Without Copies
DuckDB can read pandas DataFrames and Polars LazyFrames by name. The trick is to avoid .df() for big result sets — it materializes everything.
For pandas:
import duckdb
import pandas as pd
big_df = pd.read_parquet("events.parquet")
# DuckDB reads big_df in place — no copy.
con = duckdb.connect()
result = con.sql("SELECT user_id, count(*) FROM big_df GROUP BY user_id")
# Stream the result, don't materialize:
for batch in result.fetch_record_batch(rows_per_batch=10_000):
process(batch)For Polars (Arrow-native, fastest):
import polars as pl
# Query Parquet directly — no Python materialization at all:
result_arrow = con.sql("""
SELECT * FROM read_parquet('events.parquet')
WHERE country = 'JP'
""").arrow()
result_lf = pl.from_arrow(result_arrow)
# Or register an eager Polars DataFrame for ad-hoc queries:
df = pl.read_parquet("events.parquet")
con.register("big", df)
result_arrow = con.sql("SELECT * FROM big WHERE country = 'JP'").arrow()con.sql(...).arrow() returns a pyarrow.Table without copying — both DuckDB and Polars use Arrow as the in-memory format, so the handoff is free.
Pro Tip: For very large results, prefer writing directly to Parquet and reading back:
con.sql("COPY (SELECT * FROM big_df WHERE ...) TO 'result.parquet' (FORMAT PARQUET)")This streams without materializing in Python at all.
Fix 6: Handle Type Mismatch on read_csv and read_parquet
DuckDB infers types by sampling rows. On big CSVs with mixed types, the sample misses edge cases:
Conversion Error: Could not convert string 'N/A' to INT64 at row 12345Fix the inference with explicit types or expand the sample:
SELECT * FROM read_csv('events.csv',
columns = {'user_id': 'BIGINT', 'amount': 'DOUBLE', 'status': 'VARCHAR'},
null_padding = true,
ignore_errors = true,
sample_size = -1 -- Sample the whole file
);sample_size = -1 makes inference accurate but slow on big files. For repeated reads, infer once and save the schema:
schema = con.sql("DESCRIBE SELECT * FROM read_csv('events.csv', sample_size = -1)").df()
# Persist schema, reuse with explicit columns.For Parquet, types are in the file metadata — mismatches usually mean the file was written with a non-standard logical type. Inspect with:
duckdb -c "SELECT * FROM parquet_schema('events.parquet')"Fix 7: Long-Running Queries and Cancellation
A query that runs for minutes can be cancelled from another thread:
import duckdb
import threading
con = duckdb.connect()
def run():
try:
con.sql("SELECT count(*) FROM huge_table WHERE complex_filter(...)").show()
except duckdb.duckdb.InterruptException:
print("cancelled")
t = threading.Thread(target=run)
t.start()
# ... change of plans ...
con.interrupt()
t.join()con.interrupt() signals the running query to abort at the next safe point. For long aggregations this is usually within seconds.
For pipelines that need a hard timeout, wrap the query in threading.Timer:
import threading
timer = threading.Timer(60.0, con.interrupt)
timer.start()
try:
result = con.sql("...").df()
finally:
timer.cancel()Fix 8: Connecting From Multiple Languages
DuckDB has bindings for Python, R, Node, Java, Go, Rust, and a C/C++ API. They all use the same file format — you can write from Python and read from Node:
# producer.py
import duckdb
con = duckdb.connect("shared.duckdb")
con.execute("CREATE TABLE IF NOT EXISTS metrics(t TIMESTAMP, v DOUBLE)")
con.execute("INSERT INTO metrics VALUES (NOW(), 42.0)")
con.close() # Releases the lock.// consumer.js
import { Database } from "duckdb-async";
const db = await Database.create("shared.duckdb", { read_only: true });
const rows = await db.all("SELECT * FROM metrics");
console.log(rows);Common Mistake: Leaving connections open and assuming garbage collection will close them. DuckDB connections hold OS-level file locks — close them explicitly with .close() or use a context manager:
with duckdb.connect("shared.duckdb") as con:
con.execute("INSERT INTO metrics VALUES (NOW(), 42.0)")
# Connection closed, lock released.Still Not Working?
A few less-obvious failures:
- “Could not find catalog with name ‘memory’”. You ran a query before attaching a database. Either pass a file path to
connect()or explicitly attach:ATTACH 'analytics.duckdb' AS analytics; USE analytics;. - Slow first query after start. DuckDB JIT-compiles query plans. The first query of a kind is slow; subsequent identical queries are fast. Benchmark with warmup.
Out of Memoryonly in production, not dev. Production uses more cores → DuckDB parallelizes more → each thread takes its share ofmemory_limit. Capthreadsif your prod box has fewer cores than expected.COPY TO ... PARQUETwrites uncompressed. PassCOMPRESSION ZSTD(orSNAPPY) explicitly. Default behavior varies by DuckDB version.SELECT * FROM 'gs://...'fails. Usehttpfswith the GCS HTTP endpoint, or install thegcsextension if available for your version.- Locale-dependent number parsing. DuckDB parses
1.234as 1.234 (one point two three four). If your CSV uses German conventions (1,234= 1.234), setdecimal_separator = ','inread_csv. - Connection per request in a web app. Each
duckdb.connect()is cheap but not free, and the lock contention adds up. Use a singleton connection per process and let DuckDB’s MVCC handle concurrent reads. - Window functions slower than expected. Set
PRAGMA enable_progress_barto see where time goes. LargeOVER (PARTITION BY x ORDER BY y)benefits from indexes onxand a sorted load order.
For related analytics and storage issues, see Pandas settingwithcopywarning, Polars not working, Postgres slow query, and SQLite database is locked.
Solo developer based in Japan. Every solution is cross-referenced with official documentation and tested before publishing.
Was this article helpful?
Related Articles
Fix: dbt Not Working — ref() Not Found, Schema Mismatch, and Compilation Errors
How to fix dbt errors — ref() model not found, profile not found, database relation does not exist, incremental model schema mismatch requiring full-refresh, dbt deps failure, Jinja compilation errors, and test failures.
Fix: BigQuery Not Working — Partitioning, Slots, Streaming Inserts, and Cost Surprises
How to fix BigQuery errors — query scans full table without partition filter, slot contention with on-demand pricing, streaming insert quota exceeded, DML row limits, service account auth, and INFORMATION_SCHEMA for monitoring.
Fix: ClickHouse Not Working — MergeTree ORDER BY, INSERT Batching, LowCardinality, and Replication
How to fix ClickHouse errors — table engine choice, ORDER BY for primary key, INSERT too many small parts, LowCardinality types, Nullable performance, ZooKeeper for Replicated tables, and HTTP vs Native client.
Fix: Snowflake Not Working — Warehouse Suspend, COPY INTO, Roles, Time Travel, and Snowpipe
How to fix Snowflake errors — warehouse auto-suspend pricing, account URL format, role-vs-grant confusion, COPY INTO S3/Azure errors, Snowpipe lag, Time Travel and Fail-safe, and connection string for snowflake-connector-python.