Sei sulla pagina 1di 32

Data Sharding

Micha Gruchaa michal@gruchala.info WebClusters 2011

TODO
Background Theory Practice Summary

Background
Microblogging site user messages (blog) cockpit/wall

Classic architecture database web server(s) loadbalancer(s)

Background
Web servers, load balancers one server ... 1000 servers not a problem Database one database two databases (master -> slave) two databases (master <-> master) n databases (slave(s)<-master<->master->slave(s)) a lot of replication ;)

Background
Replication increase read performance (raid1) increase data safety (raid1) does not increase system's capacity (GBs)

Background
Scalability stateless elements scale well stateful elements quite easy to scale if we want more reads (cache, replication) hard to scale if we want more writes if we want more capacity

Background
Sharding ;) AB CD

ABCD EFGH IJKL

EF

GH

IJ

KL

Theory

Theory
Scaling Scale Back delete, archive unuset data Scale Up (vertical) more power, more disks Scale Out (horizontal) add machines functional partitioning replication sharding

Theory
Sharding split one big database into many smaller databases spread rows spread them across many servers shared-nothing partitioning not a replication

Theory
Sharding key shard by a key all data with that key will be on the same shard i.e. shard by user - all informations connected to user are on one shard (user info, messages, friends list) user 1 -> shard 1 user 2 -> shard 2 user 3 -> shard 1 user 4 -> shard 2 choosing a right key is very important!

Theory
Sharding function maps keys to shards where to find the data where to store the data shard number = sf(key)

Theory
Sharding function Dynamic Mapping in a database table Fixed Modulo shard number = id % shards_count Hash + Modulo shard number = md5(email) % shards_count Consistent hasing http://en.wikipedia.org/wiki/Consistent_hashing

Theory
Advantages Linear write/read performance scalability (raid0) Capacity increase (raid0) Smaller databases are easier to manage alter backup/restore truncate ;) Smaller databases are faster as may fit into memory Cost effective 80core, 20 HD, 80GB RAM vs 10 x (8core, 2HD, 8GB RAM)

Theory
Challenges Globally unique IDs unique across all shards auto_increment_increment, auto_increment_offset global IDs table not unique across shards IDs in dbs - not unique shard_number - unique global unique ID = shard_number + db ID

Challenges
Re-sharding 1,4,7 2,5,8 3,6,9

1,6

2,7

3,8

4,9

consistent hasing or more shards than machines/nodes (i.e. 100 shards on 10 machines)

Challenges
Cross-shard queries sent to many shards collect result from one avoidable (better sharding key, more sharding keys) joins send query to many shards join results in an application sometimes unavoidable

Challenges
Network more machines, more smaller streams full-mesh between webservers and shards pconnect vs. connect Complexity usually sharding is done in application logic

Practice

Practice
Microblogging site see users messages see stream/wall

Classic architecture database web server(s) loadbalancer(s)

Practice
who whose 2 4 2 3 2 1 5 3 1

Data
id 1 2 3 4 5 login John Bob Andy Claire Megan id 1 2 3 4 5

John's messages? John's follows?

1 3 3 1

owner 2 1 2 3 2

message M1 M2 M3 M4

5 2 1 4 4

M5

Practice
User no need for sharding User Message sharded by user (owner field) shard_number = owner % 2 Follow sharded by user (who field) shard_number = who % 2 2 shards, 3 machines

Message Follow shard0

Message Follow
Follow

shard1

Practice
shard0
id 1 3 5 id 1 2 3 4 5 login John Bob Andy Claire Megan owner 2 2 2 message M1 M3 M5 who 2 4 4 whose 1 3 1

shard1
id 2 4 owner 1 3 message M2 M4

who 1 3 3 1 5 1

whose 2 4 2 3 2 5

mapping?

Practice
Bob's blog Bob's messages find Bob's id in User table (id = 2) find Bob's shard (2%2 = 0, shard0) fetch Messages (shard0) where owner = 2 People Bob follows find Bob's id in User table (id = 2) find Bob's shard (2%2 = 0, shard0) fetch whose id from Follow table (shard0) fetch people info from User table

Practice
shard0
id 1 3 5 id 1 2 3 4 5 login John Bob Andy Claire Megan owner 2 2 2 message M1 M3 M5 who 2 4 4 whose 1 3 1

shard1
id 2 4 owner 1 3 message M2 M4

who 1 3 3 1 5 1

whose 2 4 2 3 2 5

Practice
Who follows Andy ? find Andy's id in User table (id=3) find Andy's shard (3%2 = 1, shard1) hmmm

Practice
shard0
id 1 3 5 id 1 2 3 4 5 login John Bob Andy Claire Megan owner 2 2 2 message M1 M3 M5 who 2 4 4 whose 1 3 1

shard1
id 2 owner 1 3 message M2 M4

who 1 3 3 1 5 1

whose 2 4 2 3 2 5

Cross-shard 4 query!

Practice
shard0
id 1 3 5 id 1 2 3 4 5 login John Bob Andy Claire Megan owner 2 2 2 message M1 M3 M5 who 2 4 4 whose 1 3 1

shard1
id 2 owner 1 3 message M2 M4

who 1 3 3 1 5 1

whose 2 4 2 3 2 5

Ideas?

Summary

Summary
Shard or not to shard many reads, little writes? - don't many writes and no capacity problems? - don't (use SSD) capacity problems? - yes many writes and capacity problems? - yes scale-up is affordable? - don't shard

As You see... it depends!

Summary
If You have to shard always use sharding + replication = raid10 sharding reduces high availability (like raid0) more shards than You need i.e. 4 machines, 100 shards or dynamic allocation think of network capacity (full-mesh) load sharding (google it ;)) sharding key - important! cross-shard queries

Wake Up!

Potrebbero piacerti anche