Skip to content

Fix: DVC Not Working — Remote Push Errors, Pipeline DAG Issues, and Git Integration

FixDevs ·

Quick Answer

How to fix DVC errors — dvc push authentication failed, dvc pull file missing, pipeline stage not reproducing, cache out of disk space, dvc add vs dvc stage, conflict with git LFS, and S3/GCS remote setup.

The Error

You set up DVC and try to push to a remote — authentication fails:

$ dvc push
ERROR: failed to push data to the cloud - Authentication failed
unable to access S3

Or you clone a repo with DVC files and the data is missing:

$ dvc pull
WARNING: No remote provided
ERROR: failed to pull data from the cloud

Or a pipeline stage refuses to reproduce despite changed code:

$ dvc repro
Stage 'train' didn't change, skipping
# But you just edited train.py!

Or the cache fills up the disk:

$ dvc status
ERROR: unable to write file to cache: [Errno 28] No space left on device

Or dvc add and git add interact badly:

$ dvc add data/large_file.csv
$ git add data/large_file.csv   # Mistake — adds the large file to git
$ git push
remote: error: File large_file.csv exceeds GitHub's file size limit

DVC (Data Version Control) brings Git-like workflows to large data files and ML pipelines. Files live in a cache and remote storage; Git tracks small .dvc metadata files instead of the data itself. The model is powerful for ML reproducibility but creates specific failure modes around remote configuration, pipeline DAGs, and the dual Git/DVC interaction. This guide covers each.

Why This Happens

DVC stores actual file contents in a content-addressed cache (.dvc/cache/). Git tracks tiny .dvc pointer files (a few hundred bytes each) that contain hashes. The cache pushes to remote storage (S3, GCS, Azure, SSH). When you dvc pull, DVC reads the hashes from .dvc files and downloads matching content from the remote.

The pipeline DAG is defined in dvc.yaml. DVC tracks each stage’s dependencies (code files, data inputs, parameters) and caches outputs. A stage is “fresh” if all dependencies are unchanged — DVC computes hashes of every dep, not just timestamps. Confusion happens when a dep change isn’t picked up because it’s not declared in dvc.yaml.

Fix 1: Adding Data Files

# Add a single file
dvc add data/raw/dataset.csv

# DVC creates:
# - data/raw/dataset.csv.dvc  (small metadata file — commit this to git)
# - .gitignore entry for data/raw/dataset.csv  (auto-added)
# - File moved to cache, symlinked/copied back

git add data/raw/dataset.csv.dvc data/raw/.gitignore
git commit -m "Track dataset with DVC"

Add an entire directory:

dvc add data/raw/
# Creates data/raw.dvc instead of one file per element

Common Mistake: Running git add data/raw/dataset.csv after dvc add. DVC adds the file to .gitignore automatically, but if you’ve previously committed the file or use git add -f, you bypass the ignore and push the actual data to git. Always check git status after dvc add — only the .dvc file and .gitignore change should appear staged.

Verify what’s tracked:

dvc status         # Show changes since last commit
dvc list .         # List all DVC-tracked files
git status         # Show only .dvc metadata files (no large files)

Updating tracked data:

# Edit the file
echo "new row" >> data/raw/dataset.csv

# Re-add — updates the .dvc file with new hash
dvc add data/raw/dataset.csv

# Commit the updated .dvc file
git add data/raw/dataset.csv.dvc
git commit -m "Update dataset"

Fix 2: Remote Storage Setup

# Configure S3 remote
dvc remote add -d myremote s3://my-bucket/dvc-storage

# Configure GCS remote
dvc remote add -d myremote gs://my-bucket/dvc-storage

# Configure Azure
dvc remote add -d myremote azure://my-container/dvc-storage

# SSH remote (your own server)
dvc remote add -d myremote ssh://[email protected]/path/to/storage

# Local network share
dvc remote add -d myremote /mnt/shared/dvc-storage

# Google Drive (free, slow)
dvc remote add -d myremote gdrive://YOUR_FOLDER_ID

-d makes it the default remote.

Authentication for cloud remotes:

# AWS S3 — uses standard AWS credentials
# (~/.aws/credentials, AWS_ACCESS_KEY_ID env vars, IAM instance profile)
aws configure   # If not done

# Or set explicitly per remote
dvc remote modify myremote access_key_id YOUR_KEY
dvc remote modify myremote secret_access_key YOUR_SECRET --local
# --local stores in .dvc/config.local (gitignored)

GCS — uses Google Cloud SDK auth:

gcloud auth application-default login

# Or service account
export GOOGLE_APPLICATION_CREDENTIALS=/path/to/key.json

Test connectivity:

dvc remote list
dvc push --dry-run   # Show what would be pushed without actually transferring

Pro Tip: Use --local flag for credentials that shouldn’t be committed:

dvc remote modify myremote access_key_id YOUR_KEY --local

This writes to .dvc/config.local instead of .dvc/config. The .local file is gitignored by default — credentials never reach git.

Fix 3: Push, Pull, and Fetch

# Push tracked files to remote (after dvc add)
dvc push

# Pull tracked files from remote (after git clone)
dvc pull

# Fetch (download but don't checkout)
dvc fetch

# Check status
dvc status -c   # Compare local cache to remote
dvc status      # Compare workspace to local cache

Common errors:

ERROR: failed to push data to the cloud - 403 Forbidden

The credentials work for reading but not writing. Check IAM permissions on the bucket.

ERROR: Unable to find DVC file with output 'data/raw/dataset.csv'

Run dvc add first to create the .dvc file.

WARNING: No data found in cache for path 'data/raw/dataset.csv'

The cache was cleared (or you’re on a fresh clone). Run dvc pull to download from remote.

Push specific files:

dvc push data/raw/dataset.csv
dvc push -r myremote data/processed/

Pull specific files:

dvc pull data/raw/dataset.csv
dvc pull --include-only data/   # Only files under data/

Fix 4: Pipeline Stages with dvc.yaml

DVC pipelines define reproducible workflows. Each stage has commands, dependencies, and outputs:

# dvc.yaml
stages:
  prepare:
    cmd: python src/prepare.py data/raw/dataset.csv data/prepared/
    deps:
      - src/prepare.py
      - data/raw/dataset.csv
    outs:
      - data/prepared/

  train:
    cmd: python src/train.py data/prepared/ models/model.pkl
    deps:
      - src/train.py
      - data/prepared/
    params:
      - train.learning_rate
      - train.epochs
    outs:
      - models/model.pkl
    metrics:
      - metrics.json:
          cache: false

  evaluate:
    cmd: python src/evaluate.py models/model.pkl data/test/ evaluation.json
    deps:
      - src/evaluate.py
      - models/model.pkl
      - data/test/
    metrics:
      - evaluation.json:
          cache: false

Run the pipeline:

dvc repro         # Run only stages whose deps have changed
dvc repro -f      # Force re-run all stages
dvc repro train   # Run a specific stage and downstream stages

Stage doesn’t re-run when expected:

$ dvc repro
Stage 'train' didn't change, skipping

This means DVC computed the same hash for all dependencies. Common causes:

  1. Dependency missing from dvc.yaml — you changed a file DVC isn’t tracking as a dep
  2. cmd change isn’t tracked — DVC v3+ tracks cmd automatically, but ensure you didn’t edit something else
  3. Cached output is reused — even if you delete the output, DVC restores from cache

Force re-run with -f:

dvc repro -f train   # Re-run regardless of cache state

Common Mistake: Forgetting to add a code file as a dependency. If prepare.py imports utils.py and you change utils.py, DVC won’t notice — utils.py isn’t in deps. Add it explicitly:

prepare:
  cmd: python src/prepare.py data/raw/ data/prepared/
  deps:
    - src/prepare.py
    - src/utils.py           # Add this
    - data/raw/
  outs:
    - data/prepared/

Or use deps: - src/ to track the whole directory (broader but safer).

Fix 5: Parameters and Experiments

# params.yaml
train:
  learning_rate: 0.001
  epochs: 50
  batch_size: 32

prepare:
  test_split: 0.2
  random_seed: 42

Reference in dvc.yaml:

stages:
  train:
    cmd: python src/train.py
    deps: [src/train.py, data/prepared/]
    params: [train.learning_rate, train.epochs, train.batch_size]
    outs: [models/model.pkl]

Reading params in Python:

# src/train.py
import yaml

with open("params.yaml") as f:
    params = yaml.safe_load(f)["train"]

lr = params["learning_rate"]
epochs = params["epochs"]

Run experiments with different params:

# Override a param for one experiment
dvc exp run -S train.learning_rate=0.01 -S train.epochs=100

# Run a sweep
dvc exp run -S 'train.learning_rate=range(1e-5, 1e-1, 1e-5)'

# List experiments
dvc exp show

# Promote the best one to a git branch
dvc exp branch exp-abc123 best-lr

dvc exp show displays a table comparing experiments by metric — your model selection dashboard.

Fix 6: Avoiding Cache Bloat

$ dvc status
ERROR: unable to write file to cache: No space left on device

DVC’s cache grows with every version of every file. Large datasets multiply quickly.

Clean unused cache entries:

dvc gc -w        # Remove cache for files not referenced in workspace
dvc gc -a -w     # Across all branches/tags
dvc gc -a -w -r myremote   # Also remove from remote

-w = current workspace; -a = all branches; -T = all tags.

Configure cache location if disk is full:

# Move cache to a larger disk
dvc cache dir /mnt/large-disk/dvc-cache

Shared cache for teams on the same machine:

dvc cache dir --global /shared/dvc-cache
dvc config cache.shared group --global

Sets the cache to a shared directory with group-writeable permissions.

Configure symlinks instead of copies (saves disk for large files):

dvc config cache.type symlink,hardlink,copy

DVC tries symlink first, then hardlink, then copy. Symlinks save the most space but can be fragile if cache is on a different filesystem.

Fix 7: Git Integration and .dvc Files

A .dvc file looks like this:

# data/raw/dataset.csv.dvc
outs:
- md5: a1b2c3d4e5f6789abc...
  size: 1234567
  path: dataset.csv

Small enough to commit to Git. The hash tells DVC which content to fetch from the cache or remote.

.gitignore rules DVC creates:

# .gitignore (created automatically by dvc add)
/data/raw/dataset.csv

Common Mistake: Editing the .dvc file by hand. The hash must match the file content exactly. If you edit .dvc to point at a different hash, DVC can’t find the content and dvc checkout fails. Always re-run dvc add to update the hash.

Restoring an old version:

# Check out an old git commit
git checkout abc123

# DVC files now point at old data — sync workspace
dvc checkout
# Or pull from remote if not in local cache
dvc pull

Common workflow for data updates:

# 1. Update the data
python prepare_new_data.py > data/raw/dataset.csv

# 2. Re-add to DVC (updates the .dvc hash)
dvc add data/raw/dataset.csv

# 3. Commit the .dvc file
git add data/raw/dataset.csv.dvc
git commit -m "Update dataset to v2"

# 4. Push to DVC remote
dvc push

# 5. Push to Git
git push

Fix 8: CI/CD with DVC

# .github/workflows/train.yml
name: Train Model

on:
  push:
    branches: [main]
  workflow_dispatch:

jobs:
  train:
    runs-on: ubuntu-latest
    steps:
      - uses: actions/checkout@v4

      - uses: actions/setup-python@v5
        with:
          python-version: "3.12"

      - run: pip install dvc[s3]
      - run: pip install -r requirements.txt

      - name: Configure DVC remote
        env:
          AWS_ACCESS_KEY_ID: ${{ secrets.AWS_ACCESS_KEY_ID }}
          AWS_SECRET_ACCESS_KEY: ${{ secrets.AWS_SECRET_ACCESS_KEY }}
        run: |
          dvc pull

      - name: Run pipeline
        run: dvc repro

      - name: Push artifacts
        if: success()
        run: dvc push

      - name: Show metrics
        run: dvc metrics show

dvc[s3] — install the S3 storage backend. Use dvc[gs] for GCS, dvc[azure] for Azure, dvc[ssh] for SSH.

Cache the DVC cache between runs:

- uses: actions/cache@v4
  with:
    path: .dvc/cache
    key: dvc-${{ hashFiles('**/*.dvc', 'dvc.lock') }}

Saves time on repeated runs that share data.

Still Not Working?

DVC vs Git LFS

  • DVC — Designed for ML workflows: pipelines, experiments, metrics, parameters. Works with any storage (S3, GCS, SSH).
  • Git LFS — Generic large file storage in Git. Simpler model, fewer features. Tied to your Git hosting provider.

For ML projects, DVC’s pipeline/experiment tracking is the key differentiator. For binary asset versioning in a non-ML project, Git LFS is simpler.

dvc.lock File

DVC v3+ uses dvc.lock (similar to package-lock.json in npm) to record exact dependency hashes. Commit this file:

git add dvc.lock dvc.yaml
git commit -m "Update pipeline"

Without dvc.lock, collaborators may get different pipeline results from the same code. Always commit it.

Integration with MLflow / Weights & Biases

DVC handles data and pipelines; MLflow/W&B handle experiment metadata. They complement:

  • Use DVC for data versioning and reproducible pipelines
  • Use MLflow for model registry and experiment metric tracking
  • Use both — log to MLflow inside DVC pipeline stages

For MLflow-specific patterns, see MLflow not working. For W&B integration, see Weights & Biases not working.

Studio and CML for Collaboration

DVC Studio (web UI) and CML (continuous machine learning for CI/CD comments) are separate paid/free products from Iterative. Useful for team workflows but not required for basic DVC use.

Working with Pandas / Polars

DVC stores file contents, but reading them depends on your code. For pandas DataFrame operations on DVC-tracked files, see pandas SettingWithCopyWarning. For Polars patterns that work cleanly with DVC pipelines, see Polars not working.

Git Performance with Many .dvc Files

A repo with thousands of .dvc files has the same git performance characteristics as a repo with thousands of text files. Use dvc add on directories instead of individual files when possible to reduce .dvc file count.

For git-specific issues that affect DVC’s workflow, see git fatal not a git repository.

F

FixDevs

Solo developer based in Japan. Every solution is cross-referenced with official documentation and tested before publishing.

Was this article helpful?

Related Articles