Sei sulla pagina 1di 7

Practical Postgres Sharding: Scaling to the Horizon

Author: Casey Duncan <casey dot duncan at gmail dot com> Date: November 2, 2010

Pandora?
A little music service you might have heard of

How Big is it?


In one day last week (Oct 2010) >6M Listeners listened to >200M tracks 100k new listeners signed up 1.2 billion stations have been created >7 billion thumbs up/down

Where's the Postgres?


Read-only music databases for music metadata Radio databases for listener data This talk is about the latter

Scaling challenges
Data size Concurrency Write load

How can We Scale?


Vertical: Bigger, badder db boxes Monolithic database server Non-linear price to power Practical limits on just how big you can go All or nothing failure modes Horizontal: Add more moderately powerful commodity machines Zerg! Linear price per power Keeps on scaling so long as...

Sharding?
Who can tell me what sharding is? Think of it as increasing the "surface area" of your database Divides data Divides concurrent load Divides writes Multi-dimensional performance gain less data, less queries per more hardware

What's the downside?


Complexity, operationally, architecturally and software-wise Increases chance of failure, but decreases scope Reporting complexity Though with some work the cluster can be leveraged Keeps sysadmins busy, generally uses more rack space and power

Gotchas?
Schema inconsistency Data inconsistency Single points of failure Load hot spots Focus points Load balancing Cross-shard joins Re-sharding to increase capacity

Practical basic architecture


Shared nothing, only "static" data repeats across shards Shards do not talk to each other Schema is kept in sync by discipline and process All schema upgrades are automated to prevent drift All schema versions are numbered A script to check schema consistency across shards is very useful Postgres version and config the same across shards Hardware is the same for each shard (within practical limits)

Divide and Conquer


Shard by natural base key (e.g., user id) that keeps related stuff together in a shard as much as possible Multiple base keys are sometimes useful/necessary Allow shard to be accessed by base key or related foreign key Composite key structure Allows foreign keys to be locally determined on a shard given a base key key -> shard mapping (hash function) is outside of shards themselves Generated constraints keep data from landing on the wrong shard Non-shardable data resides on a "default" node A remnant of the monolith Same schema as segments Though many tables will be empty Contains shard configuration

shard id => shard host, db name, port apps only need to know default host, db name, port facilitates "one node clusters" more easily for dev Apps talk here to look up or generate base keys Username/password -> user id base key Ideally data size here is very small Ideally only pkey/ukey lookups are permitted here Beware, this can easily be a single point of failure or hotspot New users are balanced across shards New users == heavy users Makes re-sharding painful Reduces hot spots in return Data stays evenly divided (mostly) with no movement at run-time Moving data between shards at runtime is hazardous to your health re-sharding is very rare, so trade is worth it

Connection Management
Well connected network, many apps to many databases Per app connection pooling is not ideal Many wasted idle connections Per app connection pool shrinks as you add apps Load imbalace at apps can cause pool contention Faster apps need bigger pools Server-side connection pooling is better Pgpool and pgBouncer Looks like postgres and works with any client (language independent) Can aggressively manage connections Prevent long "idle in transaction" queries Funnel multiple clients into one connection (pgBouncer) Has some downsides Complexity Overhead (Performance and memory) Additional point of failure Blame pgpool syndrome

Redundancy
Each shard needs to be replicated Provides HA and real-time backup Facilitates upgrades and maintenance Clustering PG upgrades Re-sharding Use slony or PG 9.x streaming replication Secondaries are read-only Data lags between primary and secondary Deployment options Dedicated secondaries Chained primary/secondary Slony puts some extra load on the primary Slony complicates schema upgrades Streaming replication is preferred if you can

Failover
Failover to secondaries should be automatic Counting timeouts connecting to server is useful Manual failover still needed for maintenance When failed over, service becomes read-only Apps need to cope with this and still allow as much function as possible Test this scenario continually! Pgpool provides transparent failover Must run pgpool on separate hardware from primaries Pgpool becomes a single point of failure Fails over everything at once, which is good and bad

Failover Cont'd
App side failover Flexible, failover rules are your own More dev work Each client fails over individually which is good and bad Allows partial failover more easily, which is useful for app maintenance, or emergencies Allows redundant server-side pools Monitoring Monitor connections by certain users to the secondaries Use a specific db user for apps Also monitor apps directly as secondary warning Switching back Auto-switchback is dangerous Really needs to be signaled by a human

Multi-node Operations
A necessary evil Looking up "friends" Some join work must happen client-side Friend graphs do not scale well with sharding Creating new accounts Get id from "default" Write to default and shard Use two-phase commit so it's atomic Monitor pg_prepared_xacts for orphaned transactions Always finalize 2PC in the same order for easy admin Default first, then shards in order by number Minimize them as much as possible They break horizontal scalability

Reporting
Use secondaries if they are on dedicated hardware But automatically abort on failover (watch for app connections) Query in parallel Leverage the shard cluster hardware Aggregate across shards Requires client-side code Some operations are more difficult than with monolith Replicate to a monolithic reporting box Complex, operationally difficult Standard slony/streaming replication does not support this Prod failures casacade to reports Pain grows as data, traffic, and cluster size grows Queries are easier Can be hard to get good performance Consider a map/reduce solution, e.g. Hadoop

Potrebbero piacerti anche