Sei sulla pagina 1di 56

Scaling the World’s Largest Django App

Jason Yan David Cramer


@jasonyan @zeeg

1
What is DISQUS?

2
What is DISQUS?

dis·cuss • dĭ-skŭs'

We are a comment system with an emphasis on


connecting communities

http://disqus.com/about/

3
What is Scale?

Number of Visitors
300M
250M
200M
150M
100M
50M

Our traffic at a glance


17,000 requests/second peak
450,000 websites
15 million profiles
75 million comments
250 million visitors (August 2010)

4
Our Challenges

• We can’t predict when things will happen


• Random celebrity gossip
• Natural disasters
• Discussions never expire
• We can’t keep those millions of articles from
2008 in the cache
• You don’t know in advance (generally) where the
traffic will be
• Especially with dynamic paging, realtime, sorting,
personal prefs, etc.

5
Our Challenges (cont’d)

• High availability
• Not a destination site
• Difficult to schedule maintenance

6
Server Architecture

7
Server Architecture - Load Balancing
• Load Balancing • High Availability
• Software, HAProxy • heartbeat
• High performance, intelligent
server availability checking
• Bonus: Nice statistics reporting

Image Source: http://haproxy.1wt.eu/


8
Server Architecture

• ~100 Servers
• 30% Web Servers (Apache + mod_wsgi)
• 10% Databases (PostgreSQL)
• 25% Cache Servers (memcached)
• 20% Load Balancing / High Availability
(HAProxy + heartbeat)
• 15% Utility Servers (Python scripts)

9
Server Architecture - Web Servers

• Apache 2.2
• mod_wsgi
• Using `maximum-requests` to
plug memory leaks.

• Performance Monitoring
• Custom middleware
(PerformanceLogMiddleware)
• Ships performance statistics
(DB queries, external calls,
template rendering, etc) through
syslog
• Collected and graphed through
Ganglia

10
Server Architecture - Database

• PostgreSQL
• Slony-I for Replication
• Trigger-based
• Read slaves for extra read capacity
• Failover master database for high
availability

11
Server Architecture - Database

• Make sure indexes fit in memory and


measure I/O
• High I/O generally means slow queries
due to missing indexes or indexes not in
buffer cache
• Log Slow Queries
• syslog-ng + pgFouine + cron to automate
slow query logging

12
Server Architecture - Database

• Use connection pooling


• Django doesn’t do this for you
• We use pgbouncer
• Limits the maximum number of
connections your database needs to
handle
• Save on costly opening and tearing down
of new database connections

13
Our Data Model

14
Partitioning

• Fairly easy to implement, quick wins


• Done at the application level
• Data is replayed by Slony
• Two methods of data separation

15
Vertical Partitioning
Vertical partitioning involves creating tables with fewer columns
and using additional tables to store the remaining columns.

Forums Posts Users Sentry

http://en.wikipedia.org/wiki/Partition_(database)

16
Pythonic Joins

Allows us to separate datasets

posts = Post.objects.all()[0:25]

# store users in a dictionary based on primary key


users = dict(
(u.pk, u) for u in \
User.objects.filter(pk__in=set(p.user_id for p in posts))
)

# map users to their posts


for p in posts:
p._user_cache = users.get(p.user_id)

17
Pythonic Joins (cont’d)

• Slower than at database level


• But not enough that you should care
• Trading performance for scale
• Allows us to separate data
• Easy vertical partitioning
• More efficient caching
• get_many, object-per-row cache

18
Designating Masters

• Alleviates some of the write load on your


primary application master
• Masters exist under specific conditions:
• application use case
• partitioned data
• Database routers make this (fairly) easy

19
Routing by Application

class ApplicationRouter(object):
def db_for_read(self, model, **hints):
instance = hints.get('instance')
if not instance:
return None

app_label = instance._meta.app_label

return get_application_alias(app_label)

20
Horizontal Partitioning
Horizontal partitioning (also known as sharding) involves splitting
one set of data into different tables.

Disqus Your Blog CNN Telegraph

http://en.wikipedia.org/wiki/Partition_(database)

21
Horizontal Partitions

• Some forums have very large datasets


• Partners need high availability
• Helps scale the write load on the master
• We rely more on vertical partitions

22
Routing by Partition

class ForumPartitionRouter(object):
def db_for_read(self, model, **hints):
instance = hints.get('instance')
if not instance:
return None

forum_id = getattr(instance, 'forum_id', None)


if not forum_id:
return None

return get_forum_alias(forum_id)

# What we used to do
Post.objects.filter(forum=forum)

# Now, making sure hints are available


forum.post_set.all()

23
Optimizing QuerySets

• We really dislike raw SQL


• It creates more work when dealing with
partitions
• Built-in cache allows sub-slicing
• But isn’t always needed
• We removed this cache

24
Removing the Cache

• Django internally caches the results of your QuerySet


• This adds additional memory overhead

# 1 query
qs = Model.objects.all()[0:100]

# 0 queries (we don’t need this behavior)


qs = qs[0:10]

# 1 query
qs = qs.filter(foo=bar)

• Many times you only need to view a result set once


• So we built SkinnyQuerySet

25
Removing the Cache (cont’d)

Optimizing memory usage by removing the cache


class SkinnyQuerySet(QuerySet):
def __iter__(self):
if self._result_cache is not None:
# __len__ must have been run
return iter(self._result_cache)

has_run = getattr(self, 'has_run', False)


if has_run:
raise QuerySetDoubleIteration("...")
self.has_run = True
# We wanted .iterator() as the default
return self.iterator()

http://gist.github.com/550438

26
Atomic Updates

• Keeps your data consistent


• save() isnt thread-safe
• use update() instead
• Great for things like counters
• But should be considered for all write
operations

27
Atomic Updates (cont’d)

Thread safety is impossible with .save()


Request 1

post = Post(pk=1)
# a moderator approves
post.approved = True
post.save()

Request 2

post = Post(pk=1)
# the author adjusts their message
post.message = ‘Hello!’
post.save()

28
Atomic Updates (cont’d)

So we need atomic updates


Request 1

post = Post(pk=1)
# a moderator approves
Post.objects.filter(pk=post.pk)\
.update(approved=True)

Request 2

post = Post(pk=1)
# the author adjusts their message
Post.objects.filter(pk=post.pk)\
.update(message=‘Hello!’)

29
Atomic Updates (cont’d)

A better way to approach updates


def update(obj, using=None, **kwargs):
"""
Updates specified attributes on the current instance.
"""
assert obj, "Instance has not yet been created."
obj.__class__._base_manager.using(using)\
.filter(pk=obj)
.update(**kwargs)
for k, v in kwargs.iteritems():
if isinstance(v, ExpressionNode):
# NotImplemented
continue
setattr(obj, k, v)

http://github.com/andymccurdy/django-tips-and-tricks/blob/master/model_update.py

30
Delayed Signals

• Queueing low priority tasks


• even if they’re fast
• Asynchronous (Delayed) signals
• very friendly to the developer
• ..but not as friendly as real signals

31
Delayed Signals (cont’d)

We send a specific serialized version


of the model for delayed signals

from disqus.common.signals import delayed_save

def my_func(data, sender, created, **kwargs):


print data[‘id’]

delayed_save.connect(my_func, sender=Post)

This is all handled through our Queue

32
Caching

• Memcached
• Use pylibmc (newer libMemcached-based)
• Ticket #11675 (add pylibmc support)
• Third party applications:
• django-newcache, django-pylibmc

33
Caching (cont’d)

• libMemcached / pylibmc is configurable with


“behaviors”.
• Memcached “single point of failure”
• Distributed system, but we must take
precautions.
• Connection timeout to memcached can stall
requests.
• Use `_auto_eject_hosts` and
`_retry_timeout` behaviors to prevent
reconnecting to dead caches.

34
Caching (cont’d)

• Default (naive) hashing behavior


• Modulo hashed cache key cache for index
to server list.
• Removal of a server causes majority of
cache keys to be remapped to new
servers.

CACHE_SERVERS = [‘10.0.0.1’, ‘10.0.0.2’]


key = ‘my_cache_key’
cache_server = CACHE_SERVERS[hash(key) % len(CACHE_SERVERS)]

35
Caching (cont’d)

• Better approach: consistent hashing


• libMemcached (pylibmc) uses libketama
(http://tinyurl.com/lastfm-libketama)

• Addition / removal of a cache server


remaps (K/n) cache keys
(where K=number of keys and n=number of servers)

Image Source: http://sourceforge.net/apps/mediawiki/kai/index.php?title=Introduction

36
Caching (cont’d)

• Thundering herd (stampede) problem


• Invalidating a heavily accessed cache key causes many
clients to refill cache.
• But everyone refetching to fill the cache from the data
store or reprocessing data can cause things to get even
slower.
• Most times, it’s ideal to return the previously invalidated
cache value and let a single client refill the cache.
• django-newcache or MintCache (http://
djangosnippets.org/snippets/793/) will do this for you.
• Prefer filling cache on invalidation instead of deleting
from cache also helps to prevent the thundering herd
problem.

37
Transactions

• TransactionMiddleware got us started, but


down the road became a burden
• For postgresql_psycopg2, there’s a database
option, OPTIONS[‘autocommit’]
• Each query is in its own transaction. This
means each request won’t start in a
transaction.
• But sometimes we want transactions
(e.g., saving multiple objects and rolling
back on error)

38
Transactions (cont’d)

• Tips:
• Use autocommit for read slave databases.
• Isolate slow functions (e.g., external calls,
template rendering) from transactions.
• Selective autocommit
• Most read-only views don’t need to be
in transactions.
• Start in autocommit and switch to a
transaction on write.

39
Scaling the Team

• Small team of engineers


• Monthly users / developers = 40m
• Which means writing tests..
• ..and having a dead simple workflow

40
Keeping it Simple

• A developer can be up and running in a few


minutes
• assuming postgres and other server
applications are already installed
• pip, virtualenv
• settings.py

41
Setting Up Local

1. createdb -E UTF-8 disqus


2. git clone git://repo
3. mkvirtualenv disqus
4. pip install -U -r requirements.txt
5. ./manage.py syncdb && ./manage.py migrate

42
Sane Defaults

settings.py
from disqus.conf.settings.default import *

try:
from local_settings import *
except ImportError:
import sys, traceback
sys.stderr.write("Can't find 'localsettings.py’\n”)
sys.stderr.write("\nThe exception was:\n\n")
traceback.print_exc()

local_settings.py
from disqus.conf.settings.dev import *

43
Continuous Integration

• Daily deploys with Fabric


• several times an hour on some days
• Hudson keeps our builds going
• combined with Selenium
• Post-commit hooks for quick testing
• like Pyflakes
• Reverting to a previous version is a matter of
seconds

44
Continuous Integration (cont’d)

Hudson makes integration easy

45
Testing

• It’s not fun breaking things when you’re the new


guy
• Our testing process is fairly heavy
• 70k (Python) LOC, 73% coverage, 20 min suite
• Custom Test Runner (unittest)
• We needed XML, Selenium, Query Counts
• Database proxies (for read-slave testing)
• Integration with our Queue

46
Testing (cont’d)

Query Counts
# failures yield a dump of queries
def test_read_slave(self):
Model.objects.using(‘read_slave’).count()
self.assertQueryCount(1, ‘read_slave’)

Selenium
def test_button(self):
self.selenium.click('//a[@class=”dsq-button”]')

Queue Integration
class WorkerTest(DisqusTest):
workers = [‘fire_signal’]

def test_delayed_signal(self):
...

47
Bug Tracking

• Switched from Trac to Redmine


• We wanted Subtasks
• Emailing exceptions is a bad idea
• Even if its localhost
• Previously using django-db-log to aggregate
errors to a single point
• We’ve overhauled db log and are releasing
Sentry

48
django-sentry

Groups messages intelligently

http://github.com/dcramer/django-sentry

49
django-sentry (cont’d)

Similar feel to Django’s debugger

http://github.com/dcramer/django-sentry

50
Feature Switches

• We needed a safety in case a feature wasn’t


performing well at peak
• it had to respond without delay, globally,
and without writing to disk
• Allows us to work out of trunk (mostly)
• Easy to release new features to a portion of
your audience
• Also nice for “Labs” type projects

51
Feature Switches (cont’d)

52
Final Thoughts

• The language (usually) isn’t your problem


• We like Django
• But we maintain local patches
• Some tickets don’t have enough of a following
• Patches, like #17, completely change
Django..
• ..arguably in a good way
• Others don’t have champions
Ticket #17 describes making the ORM an identify mapper

53
Housekeeping

Birds of a Feather
Want to learn from others about
performance and scaling problems?
Or play some StarCraft 2?

We’re Hiring!

DISQUS is looking for amazing engineers

54
Questions

55
References

django-sentry
http://github.com/dcramer/django-sentry

Our Feature Switches


http://cl.ly/2FYt

Andy McCurdy’s update()


http://github.com/andymccurdy/django-tips-and-tricks

Our PyFlakes Fork


http://github.com/dcramer/pyflakes

SkinnyQuerySet
http://gist.github.com/550438

django-newcache
http://github.com/ericflo/django-newcache

attach_foreignkey (Pythonic Joins)


http://gist.github.com/567356

56

Potrebbero piacerti anche