Documenti di Didattica
Documenti di Professioni
Documenti di Cultura
Author: Casey Duncan <casey dot duncan at gmail dot com> Date: November 2, 2010
Pandora?
A little music service you might have heard of
Scaling challenges
Data size Concurrency Write load
Sharding?
Who can tell me what sharding is? Think of it as increasing the "surface area" of your database Divides data Divides concurrent load Divides writes Multi-dimensional performance gain less data, less queries per more hardware
Gotchas?
Schema inconsistency Data inconsistency Single points of failure Load hot spots Focus points Load balancing Cross-shard joins Re-sharding to increase capacity
shard id => shard host, db name, port apps only need to know default host, db name, port facilitates "one node clusters" more easily for dev Apps talk here to look up or generate base keys Username/password -> user id base key Ideally data size here is very small Ideally only pkey/ukey lookups are permitted here Beware, this can easily be a single point of failure or hotspot New users are balanced across shards New users == heavy users Makes re-sharding painful Reduces hot spots in return Data stays evenly divided (mostly) with no movement at run-time Moving data between shards at runtime is hazardous to your health re-sharding is very rare, so trade is worth it
Connection Management
Well connected network, many apps to many databases Per app connection pooling is not ideal Many wasted idle connections Per app connection pool shrinks as you add apps Load imbalace at apps can cause pool contention Faster apps need bigger pools Server-side connection pooling is better Pgpool and pgBouncer Looks like postgres and works with any client (language independent) Can aggressively manage connections Prevent long "idle in transaction" queries Funnel multiple clients into one connection (pgBouncer) Has some downsides Complexity Overhead (Performance and memory) Additional point of failure Blame pgpool syndrome
Redundancy
Each shard needs to be replicated Provides HA and real-time backup Facilitates upgrades and maintenance Clustering PG upgrades Re-sharding Use slony or PG 9.x streaming replication Secondaries are read-only Data lags between primary and secondary Deployment options Dedicated secondaries Chained primary/secondary Slony puts some extra load on the primary Slony complicates schema upgrades Streaming replication is preferred if you can
Failover
Failover to secondaries should be automatic Counting timeouts connecting to server is useful Manual failover still needed for maintenance When failed over, service becomes read-only Apps need to cope with this and still allow as much function as possible Test this scenario continually! Pgpool provides transparent failover Must run pgpool on separate hardware from primaries Pgpool becomes a single point of failure Fails over everything at once, which is good and bad
Failover Cont'd
App side failover Flexible, failover rules are your own More dev work Each client fails over individually which is good and bad Allows partial failover more easily, which is useful for app maintenance, or emergencies Allows redundant server-side pools Monitoring Monitor connections by certain users to the secondaries Use a specific db user for apps Also monitor apps directly as secondary warning Switching back Auto-switchback is dangerous Really needs to be signaled by a human
Multi-node Operations
A necessary evil Looking up "friends" Some join work must happen client-side Friend graphs do not scale well with sharding Creating new accounts Get id from "default" Write to default and shard Use two-phase commit so it's atomic Monitor pg_prepared_xacts for orphaned transactions Always finalize 2PC in the same order for easy admin Default first, then shards in order by number Minimize them as much as possible They break horizontal scalability
Reporting
Use secondaries if they are on dedicated hardware But automatically abort on failover (watch for app connections) Query in parallel Leverage the shard cluster hardware Aggregate across shards Requires client-side code Some operations are more difficult than with monolith Replicate to a monolithic reporting box Complex, operationally difficult Standard slony/streaming replication does not support this Prod failures casacade to reports Pain grows as data, traffic, and cluster size grows Queries are easier Can be hard to get good performance Consider a map/reduce solution, e.g. Hadoop