Skip to content

Fix: DuckDB Not Working — File Lock Conflicts, Out of Memory, Extensions, and Parquet/S3 Reads

FixDevs ·

Quick Answer

How to fix DuckDB errors — IOException database is locked, OutOfMemoryException on large queries, httpfs extension not loaded, secret manager for S3, Pandas/Polars zero-copy, and concurrent writer limits.

The Error

You connect to a DuckDB file from two processes and get this:

duckdb.duckdb.IOException: IO Error: Could not set lock on file "analytics.duckdb":
Conflicting lock is held in /path/to/other/process

Or your aggregation blows up partway through:

duckdb.duckdb.OutOfMemoryException: Out of Memory Error:
could not allocate block of size 256MB (15.2GB/15.0GB used)

Or read_parquet from S3 fails because the extension isn’t installed:

duckdb.duckdb.IOException: HTTP HEAD error on 's3://...'
(HTTP 403). The httpfs extension is not loaded.

Or pandas integration silently copies a 50 GB DataFrame:

result = duckdb.sql("SELECT * FROM big_df").df()
# 30 minutes later, OOM.

Why This Happens

DuckDB is an embedded OLAP database. Most footguns come from three places:

  • Single-writer concurrency. A .duckdb file is locked exclusively by the connecting process for writes. Two Python processes both trying to open the same file in read-write mode will conflict, even if they’re just doing reads — DuckDB defaults to read-write.
  • In-memory by design. DuckDB processes data in memory by default, spilling to disk only when configured. Big aggregations, sorts, or joins on tables larger than RAM will OOM unless you set memory_limit and temp_directory.
  • Extensions are lazy. httpfs, parquet, json, iceberg, delta — all extensions that aren’t built into the core binary. You must INSTALL and LOAD them, and on some platforms install requires network access at first use.

The pandas zero-copy issue is a doc-reading problem: duckdb.sql(...).df() materializes the full result set into memory. For large results, use .fetch_arrow_table() or write to Parquet directly.

Fix 1: Open Read-Only for Concurrent Readers

If you have multiple processes reading the same .duckdb file (one writer, many readers — or all readers), open the readers in read-only mode:

import duckdb

# Reader (no lock conflict with the writer):
ro = duckdb.connect("analytics.duckdb", read_only=True)
ro.sql("SELECT count(*) FROM events").show()
# Writer (exclusive):
rw = duckdb.connect("analytics.duckdb")  # read_only=False is the default
rw.sql("INSERT INTO events VALUES (...)")

Read-only connections don’t conflict with each other or with read-only access from other readers. They do conflict with an active writer if the writer holds the file lock during transactions, but DuckDB releases the lock between operations.

Pro Tip: For analytics pipelines with one writer (your ETL) and many readers (dashboards, notebooks), make read-only the default in all reader code paths. Mistakes default to “can’t connect” rather than “silently broke the ETL.”

Fix 2: Cap Memory and Set a Temp Directory

DuckDB will spill intermediate results to disk if you give it a temp directory and let it know your memory budget:

con = duckdb.connect("analytics.duckdb")
con.execute("SET memory_limit = '8GB'")
con.execute("SET temp_directory = '/var/tmp/duckdb'")
con.execute("SET max_temp_directory_size = '50GB'")

memory_limit caps in-memory state. When queries exceed it, DuckDB streams the overflow to temp_directory. Without these settings, a query that doesn’t fit in RAM either OOMs or fails hard depending on the platform.

For long-running ETLs:

con.execute("SET preserve_insertion_order = false")   # Faster, more memory-efficient
con.execute("SET threads = 8")                        # Cap CPU usage

Common Mistake: Setting memory_limit higher than physical RAM “to be safe.” DuckDB will happily allocate that much before spilling; you want memory_limit below RAM so it spills before the OS kills the process.

Fix 3: Install and Load Extensions Once

For Parquet on local disk, no extension needed — it’s built in. For Parquet on S3, you need httpfs:

import duckdb

con = duckdb.connect()
con.execute("INSTALL httpfs")
con.execute("LOAD httpfs")

# Now S3 reads work:
con.sql("SELECT * FROM 's3://my-bucket/events.parquet' LIMIT 10").show()

INSTALL downloads the extension once and caches it. LOAD activates it for the current connection. Subsequent connections still need LOAD but skip the download.

For air-gapped environments, pre-download extensions:

# On a machine with network access:
duckdb -c "INSTALL httpfs; INSTALL parquet; INSTALL json"
# Then ship the cache directory (~/.duckdb/extensions) to the target.

Other commonly-needed extensions:

INSTALL postgres_scanner;  -- Read directly from Postgres
INSTALL iceberg;           -- Apache Iceberg tables
INSTALL delta;             -- Delta Lake tables
INSTALL spatial;           -- GEOS spatial functions
INSTALL aws;               -- AWS credentials chain

Fix 4: S3 Credentials via Secret Manager

The old SET s3_access_key_id = ... pattern still works but is being replaced by the secret manager. Use CREATE SECRET:

CREATE SECRET s3_creds (
    TYPE S3,
    KEY_ID 'AKIA...',
    SECRET 'wJalrXUtn...',
    REGION 'us-east-1'
);

-- Or use the AWS credential chain (env vars, ~/.aws, IAM role):
CREATE SECRET s3_creds (
    TYPE S3,
    PROVIDER CREDENTIAL_CHAIN
);

SELECT * FROM 's3://my-bucket/data.parquet' LIMIT 10;

The CREDENTIAL_CHAIN provider walks the standard AWS credentials chain — env vars first, then ~/.aws/credentials, then EC2/ECS instance metadata. This is what you want in production: no secrets in SQL, credentials rotate with your AWS setup.

Note: Without INSTALL aws; LOAD aws;, PROVIDER CREDENTIAL_CHAIN doesn’t work — it depends on the aws extension.

Fix 5: Pandas / Polars Without Copies

DuckDB can read pandas DataFrames and Polars LazyFrames by name. The trick is to avoid .df() for big result sets — it materializes everything.

For pandas:

import duckdb
import pandas as pd

big_df = pd.read_parquet("events.parquet")

# DuckDB reads big_df in place — no copy.
con = duckdb.connect()
result = con.sql("SELECT user_id, count(*) FROM big_df GROUP BY user_id")

# Stream the result, don't materialize:
for batch in result.fetch_record_batch(rows_per_batch=10_000):
    process(batch)

For Polars (Arrow-native, fastest):

import polars as pl

# Query Parquet directly — no Python materialization at all:
result_arrow = con.sql("""
    SELECT * FROM read_parquet('events.parquet')
    WHERE country = 'JP'
""").arrow()

result_lf = pl.from_arrow(result_arrow)

# Or register an eager Polars DataFrame for ad-hoc queries:
df = pl.read_parquet("events.parquet")
con.register("big", df)
result_arrow = con.sql("SELECT * FROM big WHERE country = 'JP'").arrow()

con.sql(...).arrow() returns a pyarrow.Table without copying — both DuckDB and Polars use Arrow as the in-memory format, so the handoff is free.

Pro Tip: For very large results, prefer writing directly to Parquet and reading back:

con.sql("COPY (SELECT * FROM big_df WHERE ...) TO 'result.parquet' (FORMAT PARQUET)")

This streams without materializing in Python at all.

Fix 6: Handle Type Mismatch on read_csv and read_parquet

DuckDB infers types by sampling rows. On big CSVs with mixed types, the sample misses edge cases:

Conversion Error: Could not convert string 'N/A' to INT64 at row 12345

Fix the inference with explicit types or expand the sample:

SELECT * FROM read_csv('events.csv',
  columns = {'user_id': 'BIGINT', 'amount': 'DOUBLE', 'status': 'VARCHAR'},
  null_padding = true,
  ignore_errors = true,
  sample_size = -1  -- Sample the whole file
);

sample_size = -1 makes inference accurate but slow on big files. For repeated reads, infer once and save the schema:

schema = con.sql("DESCRIBE SELECT * FROM read_csv('events.csv', sample_size = -1)").df()
# Persist schema, reuse with explicit columns.

For Parquet, types are in the file metadata — mismatches usually mean the file was written with a non-standard logical type. Inspect with:

duckdb -c "SELECT * FROM parquet_schema('events.parquet')"

Fix 7: Long-Running Queries and Cancellation

A query that runs for minutes can be cancelled from another thread:

import duckdb
import threading

con = duckdb.connect()

def run():
    try:
        con.sql("SELECT count(*) FROM huge_table WHERE complex_filter(...)").show()
    except duckdb.duckdb.InterruptException:
        print("cancelled")

t = threading.Thread(target=run)
t.start()
# ... change of plans ...
con.interrupt()
t.join()

con.interrupt() signals the running query to abort at the next safe point. For long aggregations this is usually within seconds.

For pipelines that need a hard timeout, wrap the query in threading.Timer:

import threading

timer = threading.Timer(60.0, con.interrupt)
timer.start()
try:
    result = con.sql("...").df()
finally:
    timer.cancel()

Fix 8: Connecting From Multiple Languages

DuckDB has bindings for Python, R, Node, Java, Go, Rust, and a C/C++ API. They all use the same file format — you can write from Python and read from Node:

# producer.py
import duckdb
con = duckdb.connect("shared.duckdb")
con.execute("CREATE TABLE IF NOT EXISTS metrics(t TIMESTAMP, v DOUBLE)")
con.execute("INSERT INTO metrics VALUES (NOW(), 42.0)")
con.close()  # Releases the lock.
// consumer.js
import { Database } from "duckdb-async";

const db = await Database.create("shared.duckdb", { read_only: true });
const rows = await db.all("SELECT * FROM metrics");
console.log(rows);

Common Mistake: Leaving connections open and assuming garbage collection will close them. DuckDB connections hold OS-level file locks — close them explicitly with .close() or use a context manager:

with duckdb.connect("shared.duckdb") as con:
    con.execute("INSERT INTO metrics VALUES (NOW(), 42.0)")
# Connection closed, lock released.

Still Not Working?

A few less-obvious failures:

  • “Could not find catalog with name ‘memory’”. You ran a query before attaching a database. Either pass a file path to connect() or explicitly attach: ATTACH 'analytics.duckdb' AS analytics; USE analytics;.
  • Slow first query after start. DuckDB JIT-compiles query plans. The first query of a kind is slow; subsequent identical queries are fast. Benchmark with warmup.
  • Out of Memory only in production, not dev. Production uses more cores → DuckDB parallelizes more → each thread takes its share of memory_limit. Cap threads if your prod box has fewer cores than expected.
  • COPY TO ... PARQUET writes uncompressed. Pass COMPRESSION ZSTD (or SNAPPY) explicitly. Default behavior varies by DuckDB version.
  • SELECT * FROM 'gs://...' fails. Use httpfs with the GCS HTTP endpoint, or install the gcs extension if available for your version.
  • Locale-dependent number parsing. DuckDB parses 1.234 as 1.234 (one point two three four). If your CSV uses German conventions (1,234 = 1.234), set decimal_separator = ',' in read_csv.
  • Connection per request in a web app. Each duckdb.connect() is cheap but not free, and the lock contention adds up. Use a singleton connection per process and let DuckDB’s MVCC handle concurrent reads.
  • Window functions slower than expected. Set PRAGMA enable_progress_bar to see where time goes. Large OVER (PARTITION BY x ORDER BY y) benefits from indexes on x and a sorted load order.

For related analytics and storage issues, see Pandas settingwithcopywarning, Polars not working, Postgres slow query, and SQLite database is locked.

F

FixDevs

Solo developer based in Japan. Every solution is cross-referenced with official documentation and tested before publishing.

Was this article helpful?

Related Articles