Fix: MySQL Replication Not Working — Replica Lag, Stopped Replication, or GTID Errors
Part of: Docker, DevOps & Infrastructure
Quick Answer
How to fix MySQL replication issues — SHOW REPLICA STATUS errors, relay log corruption, GTID configuration, replication lag, skipping errors, and replica promotion.
The Problem
MySQL replication has stopped with an error:
SHOW REPLICA STATUS\G
-- Replica_IO_Running: Yes
-- Replica_SQL_Running: No
-- Last_SQL_Error: Error 'Duplicate entry '42' for key 'PRIMARY'' on query.
-- Default database: 'myapp'. Query: 'INSERT INTO users ...'Or the replica is lagging behind the primary:
SHOW REPLICA STATUS\G
-- Seconds_Behind_Source: 3842 -- Over an hour behindOr GTID replication fails after restoring a backup:
Got fatal error 1236 from source when reading data from binary log:
'Cannot replicate because the source purged required binary logs.
Replicate the missing transactions from elsewhere, or provision a new replica
from backup. Consider increasing the source's binary log expiration period.'Why This Happens
MySQL replication works by replaying binary log events from the primary on the replica. It breaks when:
- Replication error on the replica — the most common cause: a statement that succeeded on the primary fails on the replica (duplicate key, row not found, constraint violation). The SQL thread stops until the error is resolved.
- Replica lag — the replica can’t apply binary log events as fast as the primary generates them. Typically caused by slow queries, resource constraints, or a single-threaded replica applying multi-threaded primary writes.
- GTID gaps — when using GTID replication, the replica tracks which transactions it has applied. If GTIDs are missing (from a restore that didn’t preserve GTID state), replication can’t figure out where to resume.
- Binary log expired — if the primary purges binary logs before the replica reads them (due to replica downtime or
binlog_expire_logs_secondstoo low), the replica loses its position.
Replication is asynchronous by default, which means committed transactions on the primary are not guaranteed to be on any replica at the moment the client gets its OK response. Two threads on the replica do the work: the IO thread connects to the primary, reads binary log events, and writes them to the local relay log; the SQL thread (or workers, under parallel replication) reads from the relay log and applies the events to local tables. The two threads can fail independently. An IO thread failure usually means a network problem, an authentication issue, or a missing binary log on the source. An SQL thread failure means the event was received but cannot be applied — almost always because the data on the replica diverged from what the primary expected.
The diverged-data problem is so common that experienced operators treat any replica as potentially inconsistent until verified. Direct writes to a replica (because read_only was off), a botched SKIP_COUNTER, or a pre-existing row created by a restore script are all enough to produce a duplicate-key error two days later when the matching INSERT finally arrives over replication. The fix is rarely “skip and forget” — it is “skip, then run pt-table-checksum and pt-table-sync to confirm the rest of the table still matches.” Otherwise you accumulate silent drift that a future query will reveal as a hard-to-debug logical inconsistency.
How Other Tools Handle This
MySQL is not the only database that ships replication, and the way each engine handles ordering, durability, and failover shapes how you debug a stuck replica.
MySQL replication vs PostgreSQL streaming replication. PostgreSQL uses physical WAL streaming by default: the primary writes Write-Ahead Log records and ships them byte-for-byte to standbys. There is no “SQL thread on the replica” — the standby simply applies block-level changes, so duplicate-key errors of the MySQL type cannot happen. The downside is that physical replication requires identical major versions and identical binaries. PostgreSQL 10+ added logical replication, which is closer in spirit to MySQL’s row-based binlog: it ships a stream of inserts, updates, and deletes per table, allows cross-version replication, and can also fail with conflicts if the subscriber’s data diverges.
MySQL GTID vs PostgreSQL LSN. GTIDs (Global Transaction IDs) name each transaction by source UUID and sequence. After a failover, a replica can resume from any primary by asking “what transactions do you have that I don’t?” PostgreSQL identifies positions by Log Sequence Number (LSN), a monotonically increasing byte offset into the WAL. LSNs are simpler but tied to a specific timeline — promoting a new primary creates a new timeline, and old replicas must be re-cloned or rewound with pg_rewind to follow it. The mental model is different: GTIDs are “what” was applied; LSNs are “where” we are in the log.
MongoDB replica sets vs MySQL primary-replica. MongoDB uses a single OpLog (operations log) per replica set member, with automatic primary election via a Raft-like protocol. There is no manual CHANGE MASTER TO after a failover — the cluster elects a new primary automatically when the old one is unreachable, and clients with a replica-set-aware driver discover the new primary through the cluster’s hello/isMaster response. MongoDB’s equivalent of “replica lag” is replSetGetStatus.members[].optimeDate, and like MySQL it can stall on conflicts (OplogTruncation, RollbackId mismatches).
Cassandra eventual consistency vs MySQL replication. Cassandra abandons the primary-replica model entirely. Every node accepts writes, and consistency is tunable per query (ONE, QUORUM, ALL). Replication is via Hinted Handoff (a node holds writes for an unreachable peer and replays them when it returns) plus anti-entropy repair (nodetool repair). There is no “replication stopped” alert — instead you alert on hint backlog size, read repair latency, and the time since the last successful repair. Operationally, Cassandra trades MySQL’s deterministic stop-on-error model for a system that always accepts writes but may return stale reads.
Failover patterns. MySQL relies on external orchestrators (MHA, Orchestrator, MaxScale, ProxySQL, Vitess, or cloud-managed RDS/Aurora) to detect primary failure, pick a replica with the most advanced GTID set, and reconfigure routing. PostgreSQL has Patroni, repmgr, and Stolon for the same role. MongoDB and Cassandra handle this in-cluster. The MySQL-specific gotcha is that without semi-synchronous replication (rpl_semi_sync_source_enabled), a failover can lose committed transactions that never replicated, so always confirm whether your setup uses semi-sync, lossless semi-sync (MySQL 5.7.2+), or Group Replication before assuming a failover is safe.
Fix 1: Diagnose and Resolve SQL Thread Errors
Start with SHOW REPLICA STATUS\G to identify the error:
-- Run on the replica
SHOW REPLICA STATUS\G
-- Key fields to check:
-- Replica_IO_Running: Yes/No — is the IO thread receiving binlogs?
-- Replica_SQL_Running: Yes/No — is the SQL thread applying them?
-- Last_IO_Error: error message from IO thread
-- Last_SQL_Error: error message from SQL thread
-- Seconds_Behind_Source: replication lag in seconds
-- Exec_Source_Log_File / Exec_Source_Log_Pos: current position being appliedFix duplicate key errors (most common):
-- Option 1: Skip the offending transaction (use with caution)
-- For GTID replication:
SET GTID_NEXT = 'xxxxxxxx-xxxx-xxxx-xxxx-xxxxxxxxxxxx:N';
BEGIN; COMMIT; -- Empty transaction to "skip" the GTID
SET GTID_NEXT = 'AUTOMATIC';
START REPLICA;
-- For non-GTID replication (legacy):
SET GLOBAL SQL_REPLICA_SKIP_COUNTER = 1; -- Skip 1 event
START REPLICA;
-- Option 2: Fix the data on the replica, then restart
-- If the row is missing on replica but exists on primary:
INSERT INTO users (id, email) VALUES (42, '[email protected]');
START REPLICA;
-- Option 3: Use pt-slave-repair (Percona Toolkit) for data discrepancies
pt-table-sync --replicate myapp.checksums h=replica,u=root,p=pass \
--execute --printWarning: Skipping transactions can cause data drift between primary and replica. After fixing, run
pt-table-checksumto verify data consistency.
Fix 2: Reduce Replication Lag
Enable parallel replication (MySQL 5.7+):
-- On the replica — enable multi-threaded replication
-- Check current setting:
SHOW GLOBAL VARIABLES LIKE 'replica_parallel_workers';
-- Enable parallel workers (use 4-8, matching primary CPU cores)
SET GLOBAL replica_parallel_workers = 8;
SET GLOBAL replica_parallel_type = 'LOGICAL_CLOCK';
-- LOGICAL_CLOCK: applies transactions that overlapped on primary in parallel
-- DATABASE: applies transactions for different databases in parallel (simpler, less effective)# my.cnf — persist the settings
[mysqld]
replica_parallel_workers = 8
replica_parallel_type = LOGICAL_CLOCK
replica_preserve_commit_order = ON # Ensures replica commits in same order as primaryEnable binary log group commit on the primary (reduces sync overhead):
# my.cnf on primary
[mysqld]
binlog_group_commit_sync_delay = 1000 # Wait 1ms to group more transactions
binlog_group_commit_sync_no_delay_count = 100 # Or until 100 transactions readyCheck what’s causing lag:
-- See what the SQL thread is applying
SHOW PROCESSLIST;
-- Look for "System lock" or slow queries in the replica SQL thread
-- Check replica SQL thread wait states
SELECT * FROM performance_schema.events_waits_current
WHERE THREAD_ID = (
SELECT THREAD_ID FROM performance_schema.threads
WHERE NAME = 'thread/sql/replica_sql'
);Fix 3: Fix GTID Replication Issues
Check GTID state:
-- On primary
SHOW GLOBAL VARIABLES LIKE 'gtid_mode';
SHOW MASTER STATUS;
-- Shows Executed_Gtid_Set: source UUID:1-12345
-- On replica
SHOW REPLICA STATUS\G
-- Executed_Gtid_Set: what replica has applied
-- Retrieved_Gtid_Set: what replica has received but not yet applied
-- Auto_Position: 1 (using GTID auto-positioning)Fix GTID gaps after a restore:
-- If replica is missing transactions that primary has already purged:
-- Option 1: Provision a new replica from a fresh backup
-- This is the safest approach
-- Option 2: Inject empty transactions to fill the gaps
-- Find the missing GTIDs:
-- Primary has: source_uuid:1-1000
-- Replica has: source_uuid:1-800, source_uuid:850-1000
-- Missing: source_uuid:801-849
-- Inject empty transactions on the replica:
STOP REPLICA;
SET GTID_NEXT = 'source_uuid:801';
BEGIN; COMMIT;
SET GTID_NEXT = 'source_uuid:802';
BEGIN; COMMIT;
-- ... repeat for each missing GTID
-- Or use a loop in a shell script for large ranges:
-- for i in $(seq 801 849); do
-- mysql -e "SET GTID_NEXT='source_uuid:$i'; BEGIN; COMMIT;"
-- done
SET GTID_NEXT = 'AUTOMATIC';
START REPLICA;Set up GTID replication correctly:
# my.cnf — both primary and replica
[mysqld]
gtid_mode = ON
enforce_gtid_consistency = ON
log_bin = mysql-bin
log_replica_updates = ON # Required: replica also writes to its own binlog-- On the replica — connect using GTID auto-positioning
CHANGE REPLICATION SOURCE TO
SOURCE_HOST = 'primary-host',
SOURCE_USER = 'replication_user',
SOURCE_PASSWORD = 'replication_password',
SOURCE_AUTO_POSITION = 1; -- GTID auto-positioning
START REPLICA;Fix 4: Rebuild a Replica from Scratch
When replication is too far out of sync, rebuilding is faster than catching up:
# Method 1: Using mysqldump with GTID state
# On primary:
mysqldump \
--single-transaction \
--master-data=2 \
--set-gtid-purged=ON \
--all-databases \
-u root -p > full_backup.sql
# Transfer to replica and restore:
mysql -u root -p < full_backup.sql
# On replica — check GTID_PURGED was set:
mysql -e "SHOW GLOBAL VARIABLES LIKE 'gtid_purged';"
# Connect to primary and start replication:
mysql -e "
CHANGE REPLICATION SOURCE TO
SOURCE_HOST='primary-host',
SOURCE_USER='repl',
SOURCE_PASSWORD='password',
SOURCE_AUTO_POSITION=1;
START REPLICA;
"# Method 2: Using Percona XtraBackup (faster for large databases)
# On primary:
xtrabackup --backup --target-dir=/backup/full \
--user=root --password=pass
xtrabackup --prepare --target-dir=/backup/full
# Transfer to replica:
rsync -avz /backup/full/ replica:/var/lib/mysql/
# On replica:
xtrabackup --move-back --target-dir=/backup/full
chown -R mysql:mysql /var/lib/mysql
systemctl start mysql
# Get the GTID position from backup:
cat /backup/full/xtrabackup_binlog_info
# mysql-bin.000042 1234567 source_uuid:1-54321
mysql -e "
RESET REPLICA ALL;
SET GLOBAL gtid_purged='source_uuid:1-54321';
CHANGE REPLICATION SOURCE TO SOURCE_HOST='primary', SOURCE_AUTO_POSITION=1;
START REPLICA;
"Fix 5: Monitor Replication Health
Set up proactive monitoring to catch issues before they become critical:
-- Check replication health (run regularly)
SELECT
SERVICE_STATE AS io_state,
LAST_ERROR_NUMBER,
LAST_ERROR_MESSAGE,
LAST_ERROR_TIMESTAMP
FROM performance_schema.replication_connection_status;
SELECT
SERVICE_STATE AS sql_state,
LAST_ERROR_NUMBER,
LAST_ERROR_MESSAGE,
LAST_ERROR_TIMESTAMP,
LAST_APPLIED_TRANSACTION
FROM performance_schema.replication_applier_status_by_worker;
-- Lag monitoring
SELECT
CHANNEL_NAME,
SERVICE_STATE,
LAST_QUEUED_TRANSACTION_ORIGINAL_COMMIT_TIMESTAMP,
LAST_APPLIED_TRANSACTION_ORIGINAL_COMMIT_TIMESTAMP,
TIMESTAMPDIFF(
SECOND,
LAST_APPLIED_TRANSACTION_ORIGINAL_COMMIT_TIMESTAMP,
NOW()
) AS lag_seconds
FROM performance_schema.replication_applier_status_by_worker;Shell script for monitoring:
#!/bin/bash
# replication_check.sh — alert if replication is broken or lagging
MYSQL="mysql -u monitor -ppassword -h replica-host"
MAX_LAG=30 # Alert if lag > 30 seconds
STATUS=$($MYSQL -e "SHOW REPLICA STATUS\G" 2>/dev/null)
IO_RUNNING=$(echo "$STATUS" | grep "Replica_IO_Running:" | awk '{print $2}')
SQL_RUNNING=$(echo "$STATUS" | grep "Replica_SQL_Running:" | awk '{print $2}')
LAG=$(echo "$STATUS" | grep "Seconds_Behind_Source:" | awk '{print $2}')
ERROR=$(echo "$STATUS" | grep "Last_SQL_Error:" | cut -d: -f2-)
if [ "$IO_RUNNING" != "Yes" ] || [ "$SQL_RUNNING" != "Yes" ]; then
echo "CRITICAL: Replication stopped! Error:$ERROR"
# Send alert here
elif [ "$LAG" != "0" ] && [ "$LAG" -gt "$MAX_LAG" ]; then
echo "WARNING: Replication lag ${LAG}s exceeds threshold of ${MAX_LAG}s"
fiFix 6: Replication User and Permissions
A misconfigured replication user is a common setup mistake:
-- On primary — create replication user
CREATE USER 'repl'@'replica-ip' IDENTIFIED WITH mysql_native_password BY 'strong-password';
GRANT REPLICATION SLAVE ON *.* TO 'repl'@'replica-ip';
FLUSH PRIVILEGES;
-- Verify the grant
SHOW GRANTS FOR 'repl'@'replica-ip';
-- On replica — connect
CHANGE REPLICATION SOURCE TO
SOURCE_HOST = '10.0.0.1',
SOURCE_PORT = 3306,
SOURCE_USER = 'repl',
SOURCE_PASSWORD = 'strong-password',
SOURCE_AUTO_POSITION = 1;
-- For SSL/TLS replication (recommended for production)
CHANGE REPLICATION SOURCE TO
SOURCE_HOST = 'primary-host',
SOURCE_USER = 'repl',
SOURCE_PASSWORD = 'password',
SOURCE_AUTO_POSITION = 1,
SOURCE_SSL = 1,
SOURCE_SSL_CA = '/etc/mysql/ssl/ca.pem',
SOURCE_SSL_CERT = '/etc/mysql/ssl/client-cert.pem',
SOURCE_SSL_KEY = '/etc/mysql/ssl/client-key.pem';Still Not Working?
Binary log format mismatch — if the primary uses binlog_format=STATEMENT and the replica has issues executing non-deterministic queries (NOW(), RAND()), switch the primary to ROW format. ROW format replicates the actual row changes, not the SQL statements, and is more reliable:
-- Check current format
SHOW GLOBAL VARIABLES LIKE 'binlog_format';
-- Change to ROW (requires replica restart to take effect)
SET GLOBAL binlog_format = 'ROW';read_only not set on replica — if the replica doesn’t have read_only = ON, application writes directly to the replica cause data drift and replication conflicts. Always set read_only = ON and super_read_only = ON on replicas:
[mysqld]
read_only = ON
super_read_only = ON # Prevents even SUPER privilege users from writingSemi-synchronous replication timeouts — if rpl_semi_sync_source_timeout is reached and the primary falls back to asynchronous replication, the replica may have missed committed transactions. Check Rpl_semi_sync_source_no_tx counter.
Replica clock drift breaks Seconds_Behind_Source reporting — that field is computed by subtracting the replica’s local time from the timestamp baked into the binlog event. If the replica’s clock is behind the primary’s, Seconds_Behind_Source reads as a large positive number even when the replica is current. If it is ahead, the field can show negative values or zero on a lagging replica. Run NTP/chrony on both hosts and prefer the LAST_APPLIED_TRANSACTION_ORIGINAL_COMMIT_TIMESTAMP columns in performance_schema.replication_applier_status_by_worker for accurate lag.
auto.cnf UUID collision after VM cloning — if you cloned the data directory from another MySQL instance, both servers have the same server_uuid in /var/lib/mysql/auto.cnf. Replication refuses to start with Fatal error: The slave I/O thread stops because master and slave have equal MySQL server UUIDs. Stop MySQL on the clone, delete auto.cnf, and restart — a fresh UUID is generated.
relay_log_recovery left off after a crash — if the replica’s relay logs are corrupted (after a power loss, for example), restarting with relay_log_recovery=ON (set in my.cnf) tells MySQL to discard the relay logs and re-fetch from the primary on startup. Without it, the SQL thread errors out on the corrupted event and refuses to advance.
For related MySQL issues, see Fix: MySQL Deadlock Detected, Fix: MySQL Too Many Connections, Fix: MySQL Slow Query Optimization, and Fix: MySQL Lock Wait Timeout Exceeded.
Solo developer based in Japan. Every solution is cross-referenced with official documentation and tested before publishing.
Was this article helpful?
Related Articles
Fix: Redis Cluster Not Working — MOVED, CROSSSLOT, or Connection Errors
How to fix Redis Cluster errors — MOVED redirects, CROSSSLOT multi-key operations, cluster-aware client setup, hash tags for key grouping, and failover handling.
Fix: MySQL Full-Text Search Not Working — MATCH AGAINST Returns No Results
How to fix MySQL full-text search issues — FULLTEXT index creation, minimum word length, stopwords, boolean mode vs natural language mode, InnoDB vs MyISAM, and LIKE fallback.
Fix: MySQL Index Not Being Used — Query Optimizer Skipping Indexes
How to fix MySQL indexes not being used by the query optimizer — EXPLAIN output, implicit conversions, function on columns, composite index order, cardinality issues, and forcing indexes.
Fix: MySQL Slow Query — Diagnosis and Optimization with EXPLAIN
How to diagnose and fix slow MySQL queries — enabling the slow query log, reading EXPLAIN output, adding indexes, fixing N+1 queries, and optimizing JOINs and ORDER BY.