Michał Gruchała - Data Sharding

Data Sharding
Micha Gruchaa michal@gruchala.info WebClusters 2011
TODO
Background Theory Practice Summary
Background
Microblogging site user messages (blog) cockpit/wall
Classic architecture database web server(s) loadbalancer(s)
Background
Web servers, load balancers one server ... 1000 servers not a problem Database one database two databases (master -> slave) two databases (master <-> master) n databases (slave(s)<-master<->master->slave(s)) a lot of replication ;)
Background
Replication increase read performance (raid1) increase data safety (raid1) does not increase system's capacity (GBs)
Background
Scalability stateless elements scale well stateful elements quite easy to scale if we want more reads (cache, replication) hard to scale if we want more writes if we want more capacity
Background
Sharding ;) AB CD
ABCD EFGH IJKL
EF
GH
IJ
KL
Theory
Theory
Scaling Scale Back delete, archive unuset data Scale Up (vertical) more power, more disks Scale Out (horizontal) add machines functional partitioning replication sharding
Theory
Sharding split one big database into many smaller databases spread rows spread them across many servers shared-nothing partitioning not a replication
Theory
Sharding key shard by a key all data with that key will be on the same shard i.e. shard by user - all informations connected to user are on one shard (user info, messages, friends list) user 1 -> shard 1 user 2 -> shard 2 user 3 -> shard 1 user 4 -> shard 2 choosing a right key is very important!
Theory
Sharding function maps keys to shards where to find the data where to store the data shard number = sf(key)
Theory
Sharding function Dynamic Mapping in a database table Fixed Modulo shard number = id % shards_count Hash + Modulo shard number = md5(email) % shards_count Consistent hasing http://en.wikipedia.org/wiki/Consistent_hashing
Theory
Advantages Linear write/read performance scalability (raid0) Capacity increase (raid0) Smaller databases are easier to manage alter backup/restore truncate ;) Smaller databases are faster as may fit into memory Cost effective 80core, 20 HD, 80GB RAM vs 10 x (8core, 2HD, 8GB RAM)
Theory
Challenges Globally unique IDs unique across all shards auto_increment_increment, auto_increment_offset global IDs table not unique across shards IDs in dbs - not unique shard_number - unique global unique ID = shard_number + db ID
Challenges
Re-sharding 1,4,7 2,5,8 3,6,9
1,6
2,7
3,8
4,9
consistent hasing or more shards than machines/nodes (i.e. 100 shards on 10 machines)
Challenges
Cross-shard queries sent to many shards collect result from one avoidable (better sharding key, more sharding keys) joins send query to many shards join results in an application sometimes unavoidable
Challenges
Network more machines, more smaller streams full-mesh between webservers and shards pconnect vs. connect Complexity usually sharding is done in application logic
Practice
Practice
Microblogging site see users messages see stream/wall
Classic architecture database web server(s) loadbalancer(s)
Practice
who whose 2 4 2 3 2 1 5 3 1
Data
id 1 2 3 4 5 login John Bob Andy Claire Megan id 1 2 3 4 5
John's messages? John's follows?
1 3 3 1
owner 2 1 2 3 2
message M1 M2 M3 M4
5 2 1 4 4
M5
Practice
User no need for sharding User Message sharded by user (owner field) shard_number = owner % 2 Follow sharded by user (who field) shard_number = who % 2 2 shards, 3 machines
Message Follow shard0
Message Follow
Follow
shard1
Practice
shard0
id 1 3 5 id 1 2 3 4 5 login John Bob Andy Claire Megan owner 2 2 2 message M1 M3 M5 who 2 4 4 whose 1 3 1
shard1
id 2 4 owner 1 3 message M2 M4
who 1 3 3 1 5 1
whose 2 4 2 3 2 5
mapping?
Practice
Bob's blog Bob's messages find Bob's id in User table (id = 2) find Bob's shard (2%2 = 0, shard0) fetch Messages (shard0) where owner = 2 People Bob follows find Bob's id in User table (id = 2) find Bob's shard (2%2 = 0, shard0) fetch whose id from Follow table (shard0) fetch people info from User table
Practice
shard0
shard1
id 2 4 owner 1 3 message M2 M4
who 1 3 3 1 5 1
whose 2 4 2 3 2 5
Practice
Who follows Andy ? find Andy's id in User table (id=3) find Andy's shard (3%2 = 1, shard1) hmmm
Practice
shard0
shard1
id 2 owner 1 3 message M2 M4
who 1 3 3 1 5 1
whose 2 4 2 3 2 5
Cross-shard 4 query!
Practice
shard0
shard1
id 2 owner 1 3 message M2 M4
who 1 3 3 1 5 1
whose 2 4 2 3 2 5
Ideas?
Summary
Summary
Shard or not to shard many reads, little writes? - don't many writes and no capacity problems? - don't (use SSD) capacity problems? - yes many writes and capacity problems? - yes scale-up is affordable? - don't shard
As You see... it depends!
Summary
If You have to shard always use sharding + replication = raid10 sharding reduces high availability (like raid0) more shards than You need i.e. 4 machines, 100 shards or dynamic allocation think of network capacity (full-mesh) load sharding (google it ;)) sharding key - important! cross-shard queries
Wake Up!

Michał Gruchała - Data Sharding

Caricato da

Informazioni sul documento

Titolo originale

Copyright

Formati disponibili

Condividi questo documento

Condividi o incorpora il documento

Opzioni di condivisione

Hai trovato utile questo documento?

Questo contenuto è inappropriato?

Copyright:

Formati disponibili

Michał Gruchała - Data Sharding

Caricato da

Copyright:

Formati disponibili

Data Sharding

Micha Gruchaa michal@gruchala.info WebClusters 2011

Classic architecture database web server(s) loadbalancer(s)

ABCD EFGH IJKL

Classic architecture database web server(s) loadbalancer(s)

John's messages? John's follows?

Message Follow shard0

As You see... it depends!

Potrebbero piacerti anche