Fix: ONNX Not Working — Conversion Errors, Runtime Provider Issues, and Dynamic Shape Problems

Q: How do I fix "ONNX Not Working — Conversion Errors, Runtime Provider Issues, and Dynamic Shape Problems"?

How to fix ONNX errors — torch.onnx.export unsupported operator, ONNX Runtime CUDA provider not found, InvalidArgument input shape mismatch, dynamic axes not working, IR version mismatch, and opset version conflicts.

The Error

You export a PyTorch model to ONNX and the converter chokes:

torch.onnx.errors.UnsupportedOperatorError: Exporting the operator 'aten::custom_op'
to ONNX opset version 17 is not supported.

Or ONNX Runtime starts but ignores your GPU:

session = ort.InferenceSession("model.onnx")
print(session.get_providers())
# ['CPUExecutionProvider']   # Even though you installed onnxruntime-gpu

Or inference fails with a shape mismatch:

InvalidArgument: Got invalid dimensions for input: input_ids.
Expected: {1, 512}, received: {1, 768}

Or the exported model requires a fixed batch size and you need variable batch:

RuntimeError: Input 'input' got shape [3, 224, 224, 3] but expected [1, 224, 224, 3]

Or the model loads but outputs garbage:

# PyTorch output: [0.87, 0.12, 0.01]
# ONNX output:    [0.34, 0.33, 0.33]   # Same input, wrong result

ONNX is the standard interchange format for ML models — trained in PyTorch/TensorFlow, deployed with ONNX Runtime, TensorRT, or CoreML. The conversion process introduces subtle bugs: unsupported operators, wrong opset versions, incorrect dynamic axes, and precision mismatches. This guide covers each failure mode.

Why This Happens

ONNX defines a fixed set of operators per opset version. PyTorch/TensorFlow have thousands of operations — most map cleanly to ONNX, but custom ops, certain complex indexing patterns, and newer operators don’t. The exporter picks an opset and silently fails on unsupported ops, often with misleading error messages.

ONNX Runtime uses “execution providers” (CPU, CUDA, TensorRT, OpenVINO). Each provider has its own installation and runtime requirements. Installing onnxruntime-gpu doesn’t automatically enable CUDA — the provider must be listed when creating the session, and the matching CUDA toolkit version must be present.

Diagnostic Timeline

A failing ONNX export or runtime crash usually walks through the same false leads. Here is the timeline most teams hit before finding the actual cause.

Minute 0 — first guess: re-export with a different opset. You see UnsupportedOperatorError or a shape error, so you bump opset_version from 14 to 17 and re-run. Sometimes that works. Often it surfaces a new error in the runtime — because newer opsets need a newer ONNX Runtime, and yours is pinned at 1.16.

Minute 4 — opset version too new for runtime. InvalidGraph: Unrecognized attribute 'axes' for operator 'ReduceMean' means the model was exported at opset 18 but onnxruntime==1.16 only supports up to opset 17. Either downgrade the export (opset_version=17), upgrade the runtime (pip install -U onnxruntime), or use onnx.version_converter.convert_version(model, 17) to downcast.

Minute 12 — dynamic axes. Inference works with the same batch as dummy_input but fails on any other size. You forgot dynamic_axes={"input": {0: "batch"}} at export. Worse: you set dynamic axes but forgot to mark the output, so session.run produces a fixed-batch tensor that crashes downstream code. Inspect with onnx.load and print dim_param vs dim_value for every input and output.

Minute 20 — quantization op support. You ran quantize_dynamic and the int8 model loads but throws NotImplemented: MatMulInteger on CUDA. Dynamic int8 ops are CPU-only on many runtime versions. Switch the provider to CPUExecutionProvider, or use quantize_static with calibration data and a QDQ format that the GPU provider supports.

Minute 30 — silent precision drift. The model loads, runs on GPU, returns the right shape — but the predictions disagree with PyTorch by 0.3 in argmax. Cause: batch-norm or dropout left in train mode at export, or a custom op that fell back to a numerically different ONNX implementation. Run the verification block (PyTorch vs ONNX np.abs(a - b).max()) on the exact dummy input that was used for export. Diff over 1e-3 means a real bug, not float noise.

Minute 45 — production fix. After matching opset to runtime, declaring dynamic axes correctly, choosing a quantization mode the target provider supports, and verifying numerics, the model deploys cleanly. The original “just re-export” instinct was wrong because the failure was at the runtime/op-coverage layer, not the graph.

Fix 1: Exporting a PyTorch Model to ONNX

import torch
import torch.onnx

model = MyModel()
model.eval()   # Export mode — disables dropout, batchnorm updates

dummy_input = torch.randn(1, 3, 224, 224)

torch.onnx.export(
    model,
    dummy_input,
    "model.onnx",
    opset_version=17,              # ONNX operator set version
    do_constant_folding=True,      # Optimize constants
    input_names=["input"],
    output_names=["output"],
    dynamic_axes={
        "input": {0: "batch_size"},   # Allow variable batch at dim 0
        "output": {0: "batch_size"},
    },
)

Opset version table:

Opset	PyTorch compatibility
11	Legacy; minimum for ORT 1.2+
13	Good default for older PyTorch
14-16	Current production sweet spot
17	PyTorch 1.13+
18-20	PyTorch 2.x
21+	PyTorch 2.3+ (newest ops)

Use higher opsets if you need newer operators; older opsets for compatibility with legacy runtimes:

torch.onnx.export(model, dummy_input, "model.onnx", opset_version=17)

For PyTorch 2.x, use the new dynamo exporter (better than the legacy exporter):

import torch

# New dynamo-based exporter (PyTorch 2.1+)
torch.onnx.dynamo_export(model, dummy_input).save("model.onnx")

# Or with options
onnx_program = torch.onnx.dynamo_export(
    model,
    dummy_input,
    export_options=torch.onnx.ExportOptions(dynamic_shapes=True),
)

Unsupported operators:

UnsupportedOperatorError: Exporting the operator 'aten::grid_sampler'
to ONNX opset version 11 is not supported.

Solutions in order:

Raise opset version — newer opsets support more operators
Use a supported alternative — e.g., replace custom indexing with gather or scatter
Register a custom ONNX function:

from torch.onnx import register_custom_op_symbolic

def custom_op_export(g, *args):
    return g.op("custom::MyOp", *args)

register_custom_op_symbolic("mylib::my_op", custom_op_export, opset_version=17)

Common Mistake: Exporting a model still in training mode (with dropout, batchnorm in update mode). This produces incorrect ONNX output because randomness and running statistics don’t match evaluation. Always call model.eval() before export.

Fix 2: ONNX Runtime Provider Setup

import onnxruntime as ort

session = ort.InferenceSession("model.onnx")
print(session.get_providers())   # ['CPUExecutionProvider'] even with GPU installed

The provider must be explicitly requested, and the correct package must be installed.

Install the right package:

# CPU only
pip install onnxruntime

# NVIDIA GPU
pip install onnxruntime-gpu

# Can't have both in the same env
pip uninstall onnxruntime   # Remove CPU before installing GPU
pip install onnxruntime-gpu

Specify providers when creating the session:

import onnxruntime as ort

providers = [
    ("CUDAExecutionProvider", {
        "device_id": 0,
        "arena_extend_strategy": "kNextPowerOfTwo",
        "gpu_mem_limit": 4 * 1024 * 1024 * 1024,   # 4 GB
    }),
    "CPUExecutionProvider",   # Fallback if CUDA fails
]

session = ort.InferenceSession("model.onnx", providers=providers)
print(session.get_providers())   # ['CUDAExecutionProvider', 'CPUExecutionProvider']

Available providers (require matching install):

Provider	Install	Use case
`CPUExecutionProvider`	`onnxruntime`	Default everywhere
`CUDAExecutionProvider`	`onnxruntime-gpu`	NVIDIA GPUs
`TensorrtExecutionProvider`	`onnxruntime-gpu` + TRT	NVIDIA, higher perf than CUDA
`OpenVINOExecutionProvider`	`onnxruntime-openvino`	Intel CPUs/GPUs
`CoreMLExecutionProvider`	`onnxruntime` on macOS	Apple Silicon
`DmlExecutionProvider`	`onnxruntime-directml`	Windows DirectML (any GPU)

Verify provider is actually used:

session = ort.InferenceSession("model.onnx", providers=["CUDAExecutionProvider"])

# Check provider list
print(session.get_providers())

# If it fell back to CPU silently, CUDA setup is broken
assert "CUDAExecutionProvider" in session.get_providers(), "GPU not available!"

CUDA version requirements:

onnxruntime-gpu	CUDA	cuDNN
1.20	12.4	9.1
1.18	12.2	8.9
1.17	11.8 or 12.2	8.9
1.16	11.8	8.9

Use nvidia-smi to check your driver/CUDA version.

Pro Tip: Always include CPUExecutionProvider as a fallback in your providers list. If CUDA fails to initialize (wrong version, missing library, out of memory), the session falls back to CPU instead of crashing. Production systems should degrade gracefully when GPU resources are unavailable.

Fix 3: Dynamic Shapes and Batch Sizes

A model exported with dummy_input shape (1, 3, 224, 224) rejects any other batch size without dynamic_axes:

# WRONG — fixed shape
torch.onnx.export(model, dummy_input, "model.onnx")

# Later
session.run(None, {"input": batch_of_8_images})   # InvalidArgument: expected batch 1

Fix — declare dynamic axes:

torch.onnx.export(
    model,
    dummy_input,
    "model.onnx",
    input_names=["input"],
    output_names=["output"],
    dynamic_axes={
        "input": {0: "batch_size"},           # Batch dimension is dynamic
        "output": {0: "batch_size"},
    },
    opset_version=17,
)

Multiple dynamic dimensions:

dynamic_axes={
    "input": {
        0: "batch_size",
        2: "height",
        3: "width",
    },
    "output": {0: "batch_size"},
}

Verify shapes in the ONNX model:

import onnx

model = onnx.load("model.onnx")
for inp in model.graph.input:
    print(f"{inp.name}: {[dim.dim_value or dim.dim_param for dim in inp.type.tensor_type.shape.dim]}")
# input: ['batch_size', 3, 224, 224]

If dimensions show integers (like 1), they’re fixed. If they show strings (like 'batch_size'), they’re dynamic.

Fix a model with wrong dynamic axes using onnx tools:

import onnx
from onnx.tools import update_model_dims

model = onnx.load("model.onnx")
updated_model = update_model_dims.update_inputs_outputs_dims(
    model,
    input_dims={"input": ["batch", 3, 224, 224]},   # Set batch dimension to dynamic
    output_dims={"output": ["batch", 1000]},
)
onnx.save(updated_model, "model_dynamic.onnx")

Fix 4: Input/Output Shape Mismatch at Runtime

InvalidArgument: Got invalid dimensions for input: input.
Expected: {1, 3, 224, 224}, received: {1, 224, 224, 3}

This is usually a layout (channels-first vs channels-last) mismatch.

import numpy as np

# PyTorch / ONNX convention: NCHW (batch, channels, height, width)
img = np.random.rand(1, 3, 224, 224).astype(np.float32)

# TensorFlow convention: NHWC (batch, height, width, channels) — doesn't match
# Convert before feeding:
img_nchw = np.transpose(img, (0, 3, 1, 2))

Check expected input types and shapes:

import onnxruntime as ort

session = ort.InferenceSession("model.onnx")

for inp in session.get_inputs():
    print(f"Name: {inp.name}")
    print(f"Shape: {inp.shape}")
    print(f"Type: {inp.type}")

# Name: input
# Shape: ['batch_size', 3, 224, 224]
# Type: tensor(float)

Type matching — tensor(float) means float32, not float64:

# WRONG
img = np.random.rand(1, 3, 224, 224)   # Default float64

# CORRECT
img = np.random.rand(1, 3, 224, 224).astype(np.float32)

Run inference:

outputs = session.run(
    None,                          # None = all outputs
    {"input": img},                # Dict: input_name → numpy array
)

# outputs is a list, one per output
prediction = outputs[0]

Specify which outputs to compute (faster for multi-output models):

outputs = session.run(
    ["logits", "attention_weights"],   # Only these two
    {"input": img},
)

Fix 5: Verifying Export Correctness

After export, always verify the ONNX model produces the same output as the original:

import torch
import onnxruntime as ort
import numpy as np

model = MyModel().eval()
dummy_input = torch.randn(1, 3, 224, 224)

# PyTorch inference
with torch.no_grad():
    pytorch_output = model(dummy_input).numpy()

# ONNX inference
session = ort.InferenceSession("model.onnx")
onnx_output = session.run(None, {"input": dummy_input.numpy()})[0]

# Compare
diff = np.abs(pytorch_output - onnx_output).max()
print(f"Max diff: {diff:.6f}")
assert diff < 1e-4, "ONNX output diverges from PyTorch!"

Common Mistake: Exporting and deploying without verifying. Small differences (1e-6) are fine — floating-point ops aren’t identical across frameworks. Large differences (>1e-3) indicate bugs like wrong training/eval mode, custom op miscompilation, or unsupported operator fallbacks producing wrong results silently.

Validate ONNX model structure:

import onnx

model = onnx.load("model.onnx")
onnx.checker.check_model(model)   # Raises if model is malformed

# IR version, opset version
print(f"IR version: {model.ir_version}")
print(f"Producer: {model.producer_name} {model.producer_version}")
for opset in model.opset_import:
    print(f"Opset: {opset.domain} v{opset.version}")

Fix 6: Performance and Optimization

Graph optimization levels:

import onnxruntime as ort

sess_options = ort.SessionOptions()
sess_options.graph_optimization_level = ort.GraphOptimizationLevel.ORT_ENABLE_ALL
# Options: ORT_DISABLE_ALL, ORT_ENABLE_BASIC, ORT_ENABLE_EXTENDED, ORT_ENABLE_ALL

session = ort.InferenceSession("model.onnx", sess_options=sess_options)

Thread tuning for CPU:

sess_options = ort.SessionOptions()
sess_options.intra_op_num_threads = 4   # Parallelism within a single op
sess_options.inter_op_num_threads = 1   # Parallelism across ops (keep low)
sess_options.execution_mode = ort.ExecutionMode.ORT_SEQUENTIAL

Save optimized model for faster future loads:

sess_options.optimized_model_filepath = "model_optimized.onnx"
session = ort.InferenceSession("model.onnx", sess_options=sess_options)
# Optimized version saved to disk; use it directly next time

Quantization for smaller models and faster CPU inference:

from onnxruntime.quantization import quantize_dynamic, QuantType

quantize_dynamic(
    model_input="model.onnx",
    model_output="model_int8.onnx",
    weight_type=QuantType.QInt8,
)

# 4x smaller, often 2-4x faster on CPU, slight accuracy drop

IO binding — avoid Python overhead for high-throughput inference:

io_binding = session.io_binding()

# Bind input (can be on GPU directly for zero-copy)
io_binding.bind_input(
    name="input",
    device_type="cuda",
    device_id=0,
    element_type=np.float32,
    shape=(1, 3, 224, 224),
    buffer_ptr=input_buffer_ptr,
)

# Bind output
io_binding.bind_output(name="output", device_type="cuda", device_id=0)

# Run
session.run_with_iobinding(io_binding)

Fix 7: Converting from Other Frameworks

TensorFlow/Keras to ONNX with tf2onnx:

pip install tf2onnx

# SavedModel format
python -m tf2onnx.convert --saved-model my_model/ --output model.onnx --opset 17

# Keras H5 format
python -m tf2onnx.convert --keras model.h5 --output model.onnx --opset 17

For TensorFlow-specific issues that come up during conversion, see TensorFlow not working.

scikit-learn to ONNX with skl2onnx:

from skl2onnx import to_onnx
import numpy as np

model = train_sklearn_model()
onnx_model = to_onnx(
    model,
    X_train[:1].astype(np.float32),
    target_opset=17,
)
with open("model.onnx", "wb") as f:
    f.write(onnx_model.SerializeToString())

HuggingFace Transformers to ONNX:

pip install optimum[onnxruntime]

from optimum.onnxruntime import ORTModelForSequenceClassification
from transformers import AutoTokenizer

# Export and load in one step
model = ORTModelForSequenceClassification.from_pretrained(
    "distilbert-base-uncased",
    export=True,
)
tokenizer = AutoTokenizer.from_pretrained("distilbert-base-uncased")

inputs = tokenizer("Hello world", return_tensors="pt")
outputs = model(**inputs)

For HuggingFace-specific patterns, see HuggingFace Transformers not working.

Fix 8: Debugging Common ONNX Issues

Inspect model graph:

# Command-line tool
pip install onnx
python -c "import onnx; m = onnx.load('model.onnx'); print(onnx.helper.printable_graph(m.graph))"

# Or use Netron (visual inspection)
pip install netron
netron model.onnx   # Opens browser with graph visualization

Check for unsupported ops before export:

from torch.onnx import register_custom_op_symbolic
import torch

# Test export with dry run
try:
    torch.onnx.export(model, dummy_input, "test.onnx", verbose=True)
except Exception as e:
    print(f"Export failed: {e}")

Trace vs script export — affects model capture fidelity:

# Tracing: runs the model once with dummy input, captures operations
# May miss conditional branches that depend on input values
torch.onnx.export(model, dummy_input, "model.onnx")

# Scripting: statically analyzes model code, handles control flow
scripted = torch.jit.script(model)
torch.onnx.export(scripted, dummy_input, "model.onnx")

Use scripting when your model has if statements or loops that depend on input values.

Still Not Working?

ONNX vs TensorRT vs CoreML

ONNX Runtime — Cross-platform, decent performance. Default for portable deployment.
TensorRT — NVIDIA only, highest performance via fused kernels and INT8 quantization. Use for production NVIDIA inference.
CoreML — Apple devices only, Metal-optimized, smallest deployment size on iOS.
OpenVINO — Intel CPUs and GPUs, strong for edge/embedded deployment.

Start with ONNX Runtime for broad compatibility, then specialize if you need maximum performance on a specific platform.

Model Size and Memory

Large models (>2GB) may not load in ONNX Runtime without the external data format:

import onnx

# Save with external data — weights in separate files
onnx.save(model, "model.onnx", save_as_external_data=True, all_tensors_to_one_file=True, location="model.weights")

PyTorch Model Export Issues

For PyTorch-specific problems during export (training mode, custom layers, gradient checkpointing conflicts), see PyTorch not working.

vLLM and ONNX

For LLM inference, vLLM typically outperforms ONNX Runtime for transformer models due to paged attention and continuous batching. ONNX is better for smaller non-LLM models or when you need cross-platform deployment. For vLLM setup, see vLLM not working.

Memory Arena and Long-Running Servers

ONNX Runtime keeps a memory arena to avoid per-inference allocator churn. In long-running servers the arena can grow without bound — workers OOM after days of traffic. Cap it:

sess_options = ort.SessionOptions()
sess_options.add_session_config_entry("session.memory_pattern", "1")
sess_options.add_session_config_entry("session.use_env_allocators", "1")
session = ort.InferenceSession("model.onnx", sess_options=sess_options)

Or instantiate a new session periodically (every N requests) and let the old one be garbage collected. Disabling the arena entirely (enable_mem_pattern=False) costs throughput but bounds memory.

Mixed CPU/GPU Pipelines and IO Binding

When pre/post-processing happens on CPU and the model runs on GPU, every inference incurs a host-to-device copy that dominates latency for small models. Use io_binding with pinned memory:

io_binding = session.io_binding()
io_binding.bind_input(
    name="input",
    device_type="cpu",
    device_id=0,
    element_type=np.float32,
    shape=tensor.shape,
    buffer_ptr=tensor.ctypes.data,
)
io_binding.bind_output("output", device_type="cuda", device_id=0)
session.run_with_iobinding(io_binding)

For sub-millisecond models on GPU, this often doubles throughput vs the regular session.run path.

Models That Load but Produce NaN

A correctly-shaped output full of NaN usually means the model contains an Exp or Log operator that overflowed because the input range differs between training and deployment. Run onnx.shape_inference.infer_shapes(model) and inspect the suspect nodes with Netron. Either re-export with do_constant_folding=False (some constant folds change numerics) or clamp the offending input range before inference.