Sei sulla pagina 1di 34

Error Detection with Mysql Replication

Khanh Do Ba Stephen Tu Daniel Peek

Problem

(cosmic ray)! Master! Slave!

statement-based replication!

Problem

Master!

Slave!

Target scenario What kind of errors? How to nd errors Results on production systems

Target Scenario
Master! Slave!

Dont interfere with workload Minimize communication Detect when master slave Deal with replication lag Use vanilla MySQL

Easy Errors
Table does not exist Different schema Database ofine

Kinds of Errors

!
Wrong Data!

Kinds of Errors

!
Wrong Data!

!
Slave Missing Row!

Kinds of Errors

!
Wrong Data!

!
Slave Missing Row!

!
Slave Extra Row!

First thoughts
DB C onte

nts!

Compare!

DB1!

DB

! tents Con

DB2!

Second thoughts
Fingerprint!
Fing er print s!

Compare!

DB1! Fingerprint!
Fing rints! erp

DB2!

A New Plan

1. Fast pass to narrow search to blocks 2. CM-ngerprint narrows search to rows 3. Third pass gives denite answers

First Pass: Checksum


cs! cs! cs! cs! Table! Record!

Block Boundaries
1! 2! 3! 4! 5! 6! 7! 8! 9! 10! 11! 12! 13! 14! 15! 16! 17! 18! 19! 20!

Rows start - 4! Rows 4 - 7! Rows 7 - 10! Rows 10 - 13! Rows 13 - 16! Rows 16 - end!

cs! cs! cs! cs! cs! cs!

Now What? We know which blocks may have inconsistencies Which rows in those blocks have inconsistencies?

Second Pass: CM-Fingerprint


CM-Fingerprint!
CM-F inger print s!

Compare!

DB1! CM-Fingerprint!
-Fing CM rints! e rp

DB2!

CM-ngerprinting: encoding
0! 1! 2! 3! 4! 5! 6! 7!
fp0! fp1! fp2! fp3! fp4! fp5! fp6! fp7!

Bad Block! 0 1

fp0! fp0!

fp1! fp1!

CM-ngerprinting: decoding
x00 = i:binary(i)=0** fpi = fp0 fp1 fp2 fp3!
x00 x01 x10 x11 x20 x21

x01 = i:binary(i)=1** fpi = fp4 fp5 fp6 fp7! x10 = i:binary(i)=*0* fpi = fp0 fp1 fp4 fp5! x11 = i:binary(i)=*1* fpi = fp2 fp3 fp6 fp7! x
20

= i:binary(i)=**0 fpi = fp0 fp2 fp4 fp6!

y00 y01 y10 y11 y20 y21

x21 = i:binary(i)=**1 fpi = fp1 fp3 fp5 fp7!

CM-ngerprinting: decoding
x00 x01 x10 x11 x20 x21 0 0 0 0

Case 1: All rows agree!

y00 y01 y10 y11 y20 y21

0 0

CM-ngerprinting: decoding
x00 x01 x10 x11 x20 x21

Case 2: 1 row disagrees ! (e.g., row 3)!


z 0 z z

y00 y01 y10 y11 y20 y21

0 0

CM-ngerprinting: decoding
x00 x01 x10 x11 x20 x21 ? ? ? ?

Case 3: >1 rows disagree!

y00 y01 y10 y11 y20 y21

? ?

CM-ngerprinting: analysis
0! 1! 2! 3! 4! 5! ! n!

log2n!

Blocks of 1000 rows require CM-ngerprints of size! !2log21000 * 32 bits = 640 bits!

Pass 3: Consistent snapshot


Copy bad blocks and rows into a side table! Use statement-based replication for consistency! Snapshot! Master!

Snapshot! Slave!

Comparing Rows
Master Snapshot Slave Snapshot

Easy with unique keys If no unique key, order by by md5(row)

Final picture: Phase 1


Narrow search to blocks
Fingerprint!
Fing er

Compare!
print !

Fingerprint!

n erpri Fing

t!

Final picture: Phase 2


Narrow search to rows
CM-ngerprint!
CM-

Decode!
nger print !

CM-ngerprint!
n CM-

! print ger

Final picture: Phase 3


Denitive Answers
Snapshot!
Sn a p

Compare!
shot!

Snapshot!
o apsh Sn

t!

Results
On Facebooks User Databases Rate of inconsistency: 0.0056% - Strange Tables Rate of inconsistency: 0.0027% What did we nd at what cost?

Finding Inconsistencies
(log scale)

100% Pass 1: Checksum 1.12% Pass 2: CM-ngerprint 0.014% Pass 3: Consistent Snapshot 0.0027%

How inconsistent are blocks?


1 inconsistency: 2 inconsistencies: 3 inconsistencies: >4 inconsistencies: 27.8% 13.7% 2.9% 55.6%

CM-Fingerprint saves if 1 inconsistency Use smaller blocksize?

What kind of inconsistencies?


Of the inconsistent 0.0027%, Different data: 99.54% Slave missing a row: 0.41% Slave has extra row: 0.05%

What kind of wrong data?


Tool cant tell us about causes 1 column is off: 98.5% Off by one: 0.04% Bad timestamp: 97.6% Still unexplained: 2.4% Serious(?) inconsistency rate: 0.000066%

Future Work
Master-master mode No consistent snapshot! Measure growth rate Evaluate blocksize vs. data trafc

Potrebbero piacerti anche