Skip to content

Fix: MySQL Replication Not Working — Replica Lag, Stopped Replication, or GTID Errors

FixDevs · (Updated: )

Part of:  Docker, DevOps & Infrastructure

Quick Answer

How to fix MySQL replication issues — SHOW REPLICA STATUS errors, relay log corruption, GTID configuration, replication lag, skipping errors, and replica promotion.

The Problem

MySQL replication has stopped with an error:

SHOW REPLICA STATUS\G
-- Replica_IO_Running: Yes
-- Replica_SQL_Running: No
-- Last_SQL_Error: Error 'Duplicate entry '42' for key 'PRIMARY'' on query.
--   Default database: 'myapp'. Query: 'INSERT INTO users ...'

Or the replica is lagging behind the primary:

SHOW REPLICA STATUS\G
-- Seconds_Behind_Source: 3842  -- Over an hour behind

Or GTID replication fails after restoring a backup:

Got fatal error 1236 from source when reading data from binary log:
'Cannot replicate because the source purged required binary logs.
Replicate the missing transactions from elsewhere, or provision a new replica
from backup. Consider increasing the source's binary log expiration period.'

Why This Happens

MySQL replication works by replaying binary log events from the primary on the replica. It breaks when:

  • Replication error on the replica — the most common cause: a statement that succeeded on the primary fails on the replica (duplicate key, row not found, constraint violation). The SQL thread stops until the error is resolved.
  • Replica lag — the replica can’t apply binary log events as fast as the primary generates them. Typically caused by slow queries, resource constraints, or a single-threaded replica applying multi-threaded primary writes.
  • GTID gaps — when using GTID replication, the replica tracks which transactions it has applied. If GTIDs are missing (from a restore that didn’t preserve GTID state), replication can’t figure out where to resume.
  • Binary log expired — if the primary purges binary logs before the replica reads them (due to replica downtime or binlog_expire_logs_seconds too low), the replica loses its position.

Replication is asynchronous by default, which means committed transactions on the primary are not guaranteed to be on any replica at the moment the client gets its OK response. Two threads on the replica do the work: the IO thread connects to the primary, reads binary log events, and writes them to the local relay log; the SQL thread (or workers, under parallel replication) reads from the relay log and applies the events to local tables. The two threads can fail independently. An IO thread failure usually means a network problem, an authentication issue, or a missing binary log on the source. An SQL thread failure means the event was received but cannot be applied — almost always because the data on the replica diverged from what the primary expected.

The diverged-data problem is so common that experienced operators treat any replica as potentially inconsistent until verified. Direct writes to a replica (because read_only was off), a botched SKIP_COUNTER, or a pre-existing row created by a restore script are all enough to produce a duplicate-key error two days later when the matching INSERT finally arrives over replication. The fix is rarely “skip and forget” — it is “skip, then run pt-table-checksum and pt-table-sync to confirm the rest of the table still matches.” Otherwise you accumulate silent drift that a future query will reveal as a hard-to-debug logical inconsistency.

How Other Tools Handle This

MySQL is not the only database that ships replication, and the way each engine handles ordering, durability, and failover shapes how you debug a stuck replica.

MySQL replication vs PostgreSQL streaming replication. PostgreSQL uses physical WAL streaming by default: the primary writes Write-Ahead Log records and ships them byte-for-byte to standbys. There is no “SQL thread on the replica” — the standby simply applies block-level changes, so duplicate-key errors of the MySQL type cannot happen. The downside is that physical replication requires identical major versions and identical binaries. PostgreSQL 10+ added logical replication, which is closer in spirit to MySQL’s row-based binlog: it ships a stream of inserts, updates, and deletes per table, allows cross-version replication, and can also fail with conflicts if the subscriber’s data diverges.

MySQL GTID vs PostgreSQL LSN. GTIDs (Global Transaction IDs) name each transaction by source UUID and sequence. After a failover, a replica can resume from any primary by asking “what transactions do you have that I don’t?” PostgreSQL identifies positions by Log Sequence Number (LSN), a monotonically increasing byte offset into the WAL. LSNs are simpler but tied to a specific timeline — promoting a new primary creates a new timeline, and old replicas must be re-cloned or rewound with pg_rewind to follow it. The mental model is different: GTIDs are “what” was applied; LSNs are “where” we are in the log.

MongoDB replica sets vs MySQL primary-replica. MongoDB uses a single OpLog (operations log) per replica set member, with automatic primary election via a Raft-like protocol. There is no manual CHANGE MASTER TO after a failover — the cluster elects a new primary automatically when the old one is unreachable, and clients with a replica-set-aware driver discover the new primary through the cluster’s hello/isMaster response. MongoDB’s equivalent of “replica lag” is replSetGetStatus.members[].optimeDate, and like MySQL it can stall on conflicts (OplogTruncation, RollbackId mismatches).

Cassandra eventual consistency vs MySQL replication. Cassandra abandons the primary-replica model entirely. Every node accepts writes, and consistency is tunable per query (ONE, QUORUM, ALL). Replication is via Hinted Handoff (a node holds writes for an unreachable peer and replays them when it returns) plus anti-entropy repair (nodetool repair). There is no “replication stopped” alert — instead you alert on hint backlog size, read repair latency, and the time since the last successful repair. Operationally, Cassandra trades MySQL’s deterministic stop-on-error model for a system that always accepts writes but may return stale reads.

Failover patterns. MySQL relies on external orchestrators (MHA, Orchestrator, MaxScale, ProxySQL, Vitess, or cloud-managed RDS/Aurora) to detect primary failure, pick a replica with the most advanced GTID set, and reconfigure routing. PostgreSQL has Patroni, repmgr, and Stolon for the same role. MongoDB and Cassandra handle this in-cluster. The MySQL-specific gotcha is that without semi-synchronous replication (rpl_semi_sync_source_enabled), a failover can lose committed transactions that never replicated, so always confirm whether your setup uses semi-sync, lossless semi-sync (MySQL 5.7.2+), or Group Replication before assuming a failover is safe.

Fix 1: Diagnose and Resolve SQL Thread Errors

Start with SHOW REPLICA STATUS\G to identify the error:

-- Run on the replica
SHOW REPLICA STATUS\G

-- Key fields to check:
-- Replica_IO_Running: Yes/No  — is the IO thread receiving binlogs?
-- Replica_SQL_Running: Yes/No — is the SQL thread applying them?
-- Last_IO_Error: error message from IO thread
-- Last_SQL_Error: error message from SQL thread
-- Seconds_Behind_Source: replication lag in seconds
-- Exec_Source_Log_File / Exec_Source_Log_Pos: current position being applied

Fix duplicate key errors (most common):

-- Option 1: Skip the offending transaction (use with caution)
-- For GTID replication:
SET GTID_NEXT = 'xxxxxxxx-xxxx-xxxx-xxxx-xxxxxxxxxxxx:N';
BEGIN; COMMIT;  -- Empty transaction to "skip" the GTID
SET GTID_NEXT = 'AUTOMATIC';
START REPLICA;

-- For non-GTID replication (legacy):
SET GLOBAL SQL_REPLICA_SKIP_COUNTER = 1;  -- Skip 1 event
START REPLICA;

-- Option 2: Fix the data on the replica, then restart
-- If the row is missing on replica but exists on primary:
INSERT INTO users (id, email) VALUES (42, '[email protected]');
START REPLICA;

-- Option 3: Use pt-slave-repair (Percona Toolkit) for data discrepancies
pt-table-sync --replicate myapp.checksums h=replica,u=root,p=pass \
  --execute --print

Warning: Skipping transactions can cause data drift between primary and replica. After fixing, run pt-table-checksum to verify data consistency.

Fix 2: Reduce Replication Lag

Enable parallel replication (MySQL 5.7+):

-- On the replica — enable multi-threaded replication
-- Check current setting:
SHOW GLOBAL VARIABLES LIKE 'replica_parallel_workers';

-- Enable parallel workers (use 4-8, matching primary CPU cores)
SET GLOBAL replica_parallel_workers = 8;
SET GLOBAL replica_parallel_type = 'LOGICAL_CLOCK';
-- LOGICAL_CLOCK: applies transactions that overlapped on primary in parallel
-- DATABASE: applies transactions for different databases in parallel (simpler, less effective)
# my.cnf — persist the settings
[mysqld]
replica_parallel_workers = 8
replica_parallel_type = LOGICAL_CLOCK
replica_preserve_commit_order = ON  # Ensures replica commits in same order as primary

Enable binary log group commit on the primary (reduces sync overhead):

# my.cnf on primary
[mysqld]
binlog_group_commit_sync_delay = 1000    # Wait 1ms to group more transactions
binlog_group_commit_sync_no_delay_count = 100  # Or until 100 transactions ready

Check what’s causing lag:

-- See what the SQL thread is applying
SHOW PROCESSLIST;
-- Look for "System lock" or slow queries in the replica SQL thread

-- Check replica SQL thread wait states
SELECT * FROM performance_schema.events_waits_current
WHERE THREAD_ID = (
  SELECT THREAD_ID FROM performance_schema.threads
  WHERE NAME = 'thread/sql/replica_sql'
);

Fix 3: Fix GTID Replication Issues

Check GTID state:

-- On primary
SHOW GLOBAL VARIABLES LIKE 'gtid_mode';
SHOW MASTER STATUS;
-- Shows Executed_Gtid_Set: source UUID:1-12345

-- On replica
SHOW REPLICA STATUS\G
-- Executed_Gtid_Set: what replica has applied
-- Retrieved_Gtid_Set: what replica has received but not yet applied
-- Auto_Position: 1 (using GTID auto-positioning)

Fix GTID gaps after a restore:

-- If replica is missing transactions that primary has already purged:

-- Option 1: Provision a new replica from a fresh backup
-- This is the safest approach

-- Option 2: Inject empty transactions to fill the gaps
-- Find the missing GTIDs:
-- Primary has: source_uuid:1-1000
-- Replica has: source_uuid:1-800, source_uuid:850-1000
-- Missing: source_uuid:801-849

-- Inject empty transactions on the replica:
STOP REPLICA;
SET GTID_NEXT = 'source_uuid:801';
BEGIN; COMMIT;
SET GTID_NEXT = 'source_uuid:802';
BEGIN; COMMIT;
-- ... repeat for each missing GTID
-- Or use a loop in a shell script for large ranges:
-- for i in $(seq 801 849); do
--   mysql -e "SET GTID_NEXT='source_uuid:$i'; BEGIN; COMMIT;"
-- done
SET GTID_NEXT = 'AUTOMATIC';
START REPLICA;

Set up GTID replication correctly:

# my.cnf — both primary and replica
[mysqld]
gtid_mode = ON
enforce_gtid_consistency = ON
log_bin = mysql-bin
log_replica_updates = ON  # Required: replica also writes to its own binlog
-- On the replica — connect using GTID auto-positioning
CHANGE REPLICATION SOURCE TO
  SOURCE_HOST = 'primary-host',
  SOURCE_USER = 'replication_user',
  SOURCE_PASSWORD = 'replication_password',
  SOURCE_AUTO_POSITION = 1;  -- GTID auto-positioning
START REPLICA;

Fix 4: Rebuild a Replica from Scratch

When replication is too far out of sync, rebuilding is faster than catching up:

# Method 1: Using mysqldump with GTID state
# On primary:
mysqldump \
  --single-transaction \
  --master-data=2 \
  --set-gtid-purged=ON \
  --all-databases \
  -u root -p > full_backup.sql

# Transfer to replica and restore:
mysql -u root -p < full_backup.sql

# On replica — check GTID_PURGED was set:
mysql -e "SHOW GLOBAL VARIABLES LIKE 'gtid_purged';"

# Connect to primary and start replication:
mysql -e "
  CHANGE REPLICATION SOURCE TO
    SOURCE_HOST='primary-host',
    SOURCE_USER='repl',
    SOURCE_PASSWORD='password',
    SOURCE_AUTO_POSITION=1;
  START REPLICA;
"
# Method 2: Using Percona XtraBackup (faster for large databases)
# On primary:
xtrabackup --backup --target-dir=/backup/full \
  --user=root --password=pass

xtrabackup --prepare --target-dir=/backup/full

# Transfer to replica:
rsync -avz /backup/full/ replica:/var/lib/mysql/

# On replica:
xtrabackup --move-back --target-dir=/backup/full
chown -R mysql:mysql /var/lib/mysql
systemctl start mysql

# Get the GTID position from backup:
cat /backup/full/xtrabackup_binlog_info
# mysql-bin.000042  1234567  source_uuid:1-54321

mysql -e "
  RESET REPLICA ALL;
  SET GLOBAL gtid_purged='source_uuid:1-54321';
  CHANGE REPLICATION SOURCE TO SOURCE_HOST='primary', SOURCE_AUTO_POSITION=1;
  START REPLICA;
"

Fix 5: Monitor Replication Health

Set up proactive monitoring to catch issues before they become critical:

-- Check replication health (run regularly)
SELECT
  SERVICE_STATE AS io_state,
  LAST_ERROR_NUMBER,
  LAST_ERROR_MESSAGE,
  LAST_ERROR_TIMESTAMP
FROM performance_schema.replication_connection_status;

SELECT
  SERVICE_STATE AS sql_state,
  LAST_ERROR_NUMBER,
  LAST_ERROR_MESSAGE,
  LAST_ERROR_TIMESTAMP,
  LAST_APPLIED_TRANSACTION
FROM performance_schema.replication_applier_status_by_worker;

-- Lag monitoring
SELECT
  CHANNEL_NAME,
  SERVICE_STATE,
  LAST_QUEUED_TRANSACTION_ORIGINAL_COMMIT_TIMESTAMP,
  LAST_APPLIED_TRANSACTION_ORIGINAL_COMMIT_TIMESTAMP,
  TIMESTAMPDIFF(
    SECOND,
    LAST_APPLIED_TRANSACTION_ORIGINAL_COMMIT_TIMESTAMP,
    NOW()
  ) AS lag_seconds
FROM performance_schema.replication_applier_status_by_worker;

Shell script for monitoring:

#!/bin/bash
# replication_check.sh — alert if replication is broken or lagging

MYSQL="mysql -u monitor -ppassword -h replica-host"
MAX_LAG=30  # Alert if lag > 30 seconds

STATUS=$($MYSQL -e "SHOW REPLICA STATUS\G" 2>/dev/null)

IO_RUNNING=$(echo "$STATUS" | grep "Replica_IO_Running:" | awk '{print $2}')
SQL_RUNNING=$(echo "$STATUS" | grep "Replica_SQL_Running:" | awk '{print $2}')
LAG=$(echo "$STATUS" | grep "Seconds_Behind_Source:" | awk '{print $2}')
ERROR=$(echo "$STATUS" | grep "Last_SQL_Error:" | cut -d: -f2-)

if [ "$IO_RUNNING" != "Yes" ] || [ "$SQL_RUNNING" != "Yes" ]; then
  echo "CRITICAL: Replication stopped! Error:$ERROR"
  # Send alert here
elif [ "$LAG" != "0" ] && [ "$LAG" -gt "$MAX_LAG" ]; then
  echo "WARNING: Replication lag ${LAG}s exceeds threshold of ${MAX_LAG}s"
fi

Fix 6: Replication User and Permissions

A misconfigured replication user is a common setup mistake:

-- On primary — create replication user
CREATE USER 'repl'@'replica-ip' IDENTIFIED WITH mysql_native_password BY 'strong-password';
GRANT REPLICATION SLAVE ON *.* TO 'repl'@'replica-ip';
FLUSH PRIVILEGES;

-- Verify the grant
SHOW GRANTS FOR 'repl'@'replica-ip';

-- On replica — connect
CHANGE REPLICATION SOURCE TO
  SOURCE_HOST = '10.0.0.1',
  SOURCE_PORT = 3306,
  SOURCE_USER = 'repl',
  SOURCE_PASSWORD = 'strong-password',
  SOURCE_AUTO_POSITION = 1;

-- For SSL/TLS replication (recommended for production)
CHANGE REPLICATION SOURCE TO
  SOURCE_HOST = 'primary-host',
  SOURCE_USER = 'repl',
  SOURCE_PASSWORD = 'password',
  SOURCE_AUTO_POSITION = 1,
  SOURCE_SSL = 1,
  SOURCE_SSL_CA = '/etc/mysql/ssl/ca.pem',
  SOURCE_SSL_CERT = '/etc/mysql/ssl/client-cert.pem',
  SOURCE_SSL_KEY = '/etc/mysql/ssl/client-key.pem';

Still Not Working?

Binary log format mismatch — if the primary uses binlog_format=STATEMENT and the replica has issues executing non-deterministic queries (NOW(), RAND()), switch the primary to ROW format. ROW format replicates the actual row changes, not the SQL statements, and is more reliable:

-- Check current format
SHOW GLOBAL VARIABLES LIKE 'binlog_format';

-- Change to ROW (requires replica restart to take effect)
SET GLOBAL binlog_format = 'ROW';

read_only not set on replica — if the replica doesn’t have read_only = ON, application writes directly to the replica cause data drift and replication conflicts. Always set read_only = ON and super_read_only = ON on replicas:

[mysqld]
read_only = ON
super_read_only = ON  # Prevents even SUPER privilege users from writing

Semi-synchronous replication timeouts — if rpl_semi_sync_source_timeout is reached and the primary falls back to asynchronous replication, the replica may have missed committed transactions. Check Rpl_semi_sync_source_no_tx counter.

Replica clock drift breaks Seconds_Behind_Source reporting — that field is computed by subtracting the replica’s local time from the timestamp baked into the binlog event. If the replica’s clock is behind the primary’s, Seconds_Behind_Source reads as a large positive number even when the replica is current. If it is ahead, the field can show negative values or zero on a lagging replica. Run NTP/chrony on both hosts and prefer the LAST_APPLIED_TRANSACTION_ORIGINAL_COMMIT_TIMESTAMP columns in performance_schema.replication_applier_status_by_worker for accurate lag.

auto.cnf UUID collision after VM cloning — if you cloned the data directory from another MySQL instance, both servers have the same server_uuid in /var/lib/mysql/auto.cnf. Replication refuses to start with Fatal error: The slave I/O thread stops because master and slave have equal MySQL server UUIDs. Stop MySQL on the clone, delete auto.cnf, and restart — a fresh UUID is generated.

relay_log_recovery left off after a crash — if the replica’s relay logs are corrupted (after a power loss, for example), restarting with relay_log_recovery=ON (set in my.cnf) tells MySQL to discard the relay logs and re-fetch from the primary on startup. Without it, the SQL thread errors out on the corrupted event and refuses to advance.

For related MySQL issues, see Fix: MySQL Deadlock Detected, Fix: MySQL Too Many Connections, Fix: MySQL Slow Query Optimization, and Fix: MySQL Lock Wait Timeout Exceeded.

F

FixDevs

Solo developer based in Japan. Every solution is cross-referenced with official documentation and tested before publishing.

Was this article helpful?

Related Articles