DjangoCon 2010 Scaling Disqus

Scaling the World’s Largest Django App
Jason Yan David Cramer

@jasonyan @zeeg
1
What is DISQUS?
2
What is DISQUS?
dis·cuss • dĭ-skŭs'
We are a comment system with an emphasis on

connecting communities
http://disqus.com/about/
3
What is Scale?
Number of Visitors
300M
250M
200M
150M
100M
50M
Our traffic at a glance

17,000 requests/second peak
450,000 websites
15 million profiles
75 million comments
250 million visitors (August 2010)
4
Our Challenges
• We can’t predict when things will happen

• Random celebrity gossip
• Natural disasters
• Discussions never expire
• We can’t keep those millions of articles from
2008 in the cache
• You don’t know in advance (generally) where the
traffic will be
• Especially with dynamic paging, realtime, sorting,
personal prefs, etc.
5
Our Challenges (cont’d)
• High availability
• Not a destination site
• Difficult to schedule maintenance
6
Server Architecture
7
Server Architecture - Load Balancing
• Load Balancing • High Availability
• Software, HAProxy • heartbeat
• High performance, intelligent
server availability checking
• Bonus: Nice statistics reporting
Image Source: http://haproxy.1wt.eu/

8
Server Architecture
• ~100 Servers
• 30% Web Servers (Apache + mod_wsgi)
• 10% Databases (PostgreSQL)
• 25% Cache Servers (memcached)
• 20% Load Balancing / High Availability
(HAProxy + heartbeat)
• 15% Utility Servers (Python scripts)
9
Server Architecture - Web Servers
• Apache 2.2
• mod_wsgi
• Using `maximum-requests` to
plug memory leaks.
• Performance Monitoring
• Custom middleware
(PerformanceLogMiddleware)
• Ships performance statistics
(DB queries, external calls,
template rendering, etc) through
syslog
• Collected and graphed through
Ganglia
10
Server Architecture - Database
• PostgreSQL
• Slony-I for Replication
• Trigger-based
• Read slaves for extra read capacity
• Failover master database for high
availability
11
• Make sure indexes fit in memory and

measure I/O
• High I/O generally means slow queries
due to missing indexes or indexes not in
buffer cache
• Log Slow Queries
• syslog-ng + pgFouine + cron to automate
slow query logging
12
• Use connection pooling

• Django doesn’t do this for you
• We use pgbouncer
• Limits the maximum number of
connections your database needs to
handle
• Save on costly opening and tearing down
of new database connections
13
Our Data Model
14
Partitioning
• Fairly easy to implement, quick wins

• Done at the application level
• Data is replayed by Slony
• Two methods of data separation
15
Vertical Partitioning
Vertical partitioning involves creating tables with fewer columns
and using additional tables to store the remaining columns.
Forums Posts Users Sentry
http://en.wikipedia.org/wiki/Partition_(database)
16
Pythonic Joins
Allows us to separate datasets
posts = Post.objects.all()[0:25]
# store users in a dictionary based on primary key

users = dict(
(u.pk, u) for u in \
User.objects.filter(pk__in=set(p.user_id for p in posts))
)
# map users to their posts

for p in posts:
p._user_cache = users.get(p.user_id)
17
Pythonic Joins (cont’d)
• Slower than at database level

• But not enough that you should care
• Trading performance for scale
• Allows us to separate data
• Easy vertical partitioning
• More efficient caching
• get_many, object-per-row cache
18
Designating Masters
• Alleviates some of the write load on your

primary application master
• Masters exist under specific conditions:
• application use case
• partitioned data
• Database routers make this (fairly) easy
19
Routing by Application
class ApplicationRouter(object):
def db_for_read(self, model, **hints):
instance = hints.get('instance')
if not instance:
return None
app_label = instance._meta.app_label
return get_application_alias(app_label)
20
Horizontal Partitioning
Horizontal partitioning (also known as sharding) involves splitting
one set of data into different tables.
Disqus Your Blog CNN Telegraph
http://en.wikipedia.org/wiki/Partition_(database)
21
Horizontal Partitions
• Some forums have very large datasets

• Partners need high availability
• Helps scale the write load on the master
• We rely more on vertical partitions
22
Routing by Partition
class ForumPartitionRouter(object):
def db_for_read(self, model, **hints):
instance = hints.get('instance')
if not instance:
return None
forum_id = getattr(instance, 'forum_id', None)

if not forum_id:
return None
return get_forum_alias(forum_id)
# What we used to do
Post.objects.filter(forum=forum)
# Now, making sure hints are available

forum.post_set.all()
23
Optimizing QuerySets
• We really dislike raw SQL

• It creates more work when dealing with
partitions
• Built-in cache allows sub-slicing
• But isn’t always needed
• We removed this cache
24
Removing the Cache
• Django internally caches the results of your QuerySet

• This adds additional memory overhead
# 1 query
qs = Model.objects.all()[0:100]
# 0 queries (we don’t need this behavior)

qs = qs[0:10]
# 1 query
qs = qs.filter(foo=bar)
• Many times you only need to view a result set once

• So we built SkinnyQuerySet
25
Removing the Cache (cont’d)
Optimizing memory usage by removing the cache

class SkinnyQuerySet(QuerySet):
def __iter__(self):
if self._result_cache is not None:
# __len__ must have been run
return iter(self._result_cache)
has_run = getattr(self, 'has_run', False)

if has_run:
raise QuerySetDoubleIteration("...")
self.has_run = True
# We wanted .iterator() as the default
return self.iterator()
http://gist.github.com/550438
26
Atomic Updates
• Keeps your data consistent

• save() isnt thread-safe
• use update() instead
• Great for things like counters
• But should be considered for all write
operations
27
Atomic Updates (cont’d)
Thread safety is impossible with .save()

Request 1
post = Post(pk=1)
# a moderator approves
post.approved = True
post.save()
Request 2
post = Post(pk=1)
# the author adjusts their message
post.message = ‘Hello!’
post.save()
28
So we need atomic updates

Request 1
post = Post(pk=1)
# a moderator approves
Post.objects.filter(pk=post.pk)\
.update(approved=True)
Request 2
post = Post(pk=1)
# the author adjusts their message
Post.objects.filter(pk=post.pk)\
.update(message=‘Hello!’)
29
A better way to approach updates

def update(obj, using=None, **kwargs):
"""
Updates specified attributes on the current instance.
"""
assert obj, "Instance has not yet been created."
obj.__class__._base_manager.using(using)\
.filter(pk=obj)
.update(**kwargs)
for k, v in kwargs.iteritems():
if isinstance(v, ExpressionNode):
# NotImplemented
continue
setattr(obj, k, v)
http://github.com/andymccurdy/django-tips-and-tricks/blob/master/model_update.py
30
Delayed Signals
• Queueing low priority tasks

• even if they’re fast
• Asynchronous (Delayed) signals
• very friendly to the developer
• ..but not as friendly as real signals
31
Delayed Signals (cont’d)
We send a specific serialized version

of the model for delayed signals
from disqus.common.signals import delayed_save
def my_func(data, sender, created, **kwargs):

print data[‘id’]
delayed_save.connect(my_func, sender=Post)
This is all handled through our Queue
32
Caching
• Memcached
• Use pylibmc (newer libMemcached-based)
• Ticket #11675 (add pylibmc support)
• Third party applications:
• django-newcache, django-pylibmc
33
Caching (cont’d)
• libMemcached / pylibmc is configurable with

“behaviors”.
• Memcached “single point of failure”
• Distributed system, but we must take
precautions.
• Connection timeout to memcached can stall
requests.
• Use `_auto_eject_hosts` and
`_retry_timeout` behaviors to prevent
reconnecting to dead caches.
34
Caching (cont’d)
• Default (naive) hashing behavior

• Modulo hashed cache key cache for index
to server list.
• Removal of a server causes majority of
cache keys to be remapped to new
servers.
CACHE_SERVERS = [‘10.0.0.1’, ‘10.0.0.2’]

key = ‘my_cache_key’
cache_server = CACHE_SERVERS[hash(key) % len(CACHE_SERVERS)]
35
Caching (cont’d)
• Better approach: consistent hashing

• libMemcached (pylibmc) uses libketama
(http://tinyurl.com/lastfm-libketama)
• Addition / removal of a cache server

remaps (K/n) cache keys
(where K=number of keys and n=number of servers)
Image Source: http://sourceforge.net/apps/mediawiki/kai/index.php?title=Introduction
36
Caching (cont’d)
• Thundering herd (stampede) problem

• Invalidating a heavily accessed cache key causes many
clients to refill cache.
• But everyone refetching to fill the cache from the data
store or reprocessing data can cause things to get even
slower.
• Most times, it’s ideal to return the previously invalidated
cache value and let a single client refill the cache.
• django-newcache or MintCache (http://
djangosnippets.org/snippets/793/) will do this for you.
• Prefer filling cache on invalidation instead of deleting
from cache also helps to prevent the thundering herd
problem.
37
Transactions
• TransactionMiddleware got us started, but

down the road became a burden
• For postgresql_psycopg2, there’s a database
option, OPTIONS[‘autocommit’]
• Each query is in its own transaction. This
means each request won’t start in a
transaction.
• But sometimes we want transactions
(e.g., saving multiple objects and rolling
back on error)
38
Transactions (cont’d)
• Tips:
• Use autocommit for read slave databases.
• Isolate slow functions (e.g., external calls,
template rendering) from transactions.
• Selective autocommit
• Most read-only views don’t need to be
in transactions.
• Start in autocommit and switch to a
transaction on write.
39
Scaling the Team
• Small team of engineers

• Monthly users / developers = 40m
• Which means writing tests..
• ..and having a dead simple workflow
40
Keeping it Simple
• A developer can be up and running in a few

minutes
• assuming postgres and other server
applications are already installed
• pip, virtualenv
• settings.py
41
Setting Up Local
1. createdb -E UTF-8 disqus

2. git clone git://repo
3. mkvirtualenv disqus
4. pip install -U -r requirements.txt
5. ./manage.py syncdb && ./manage.py migrate
42
Sane Defaults
settings.py
from disqus.conf.settings.default import *
try:
from local_settings import *
except ImportError:
import sys, traceback
sys.stderr.write("Can't find 'localsettings.py’\n”)
sys.stderr.write("\nThe exception was:\n\n")
traceback.print_exc()
local_settings.py
from disqus.conf.settings.dev import *
43
Continuous Integration
• Daily deploys with Fabric

• several times an hour on some days
• Hudson keeps our builds going
• combined with Selenium
• Post-commit hooks for quick testing
• like Pyflakes
• Reverting to a previous version is a matter of
seconds
44
Continuous Integration (cont’d)
Hudson makes integration easy
45
Testing
• It’s not fun breaking things when you’re the new

guy
• Our testing process is fairly heavy
• 70k (Python) LOC, 73% coverage, 20 min suite
• Custom Test Runner (unittest)
• We needed XML, Selenium, Query Counts
• Database proxies (for read-slave testing)
• Integration with our Queue
46
Testing (cont’d)
Query Counts
# failures yield a dump of queries
def test_read_slave(self):
Model.objects.using(‘read_slave’).count()
self.assertQueryCount(1, ‘read_slave’)
Selenium
def test_button(self):
self.selenium.click('//a[@class=”dsq-button”]')
Queue Integration
class WorkerTest(DisqusTest):
workers = [‘fire_signal’]
def test_delayed_signal(self):
...
47
Bug Tracking
• Switched from Trac to Redmine

• We wanted Subtasks
• Emailing exceptions is a bad idea
• Even if its localhost
• Previously using django-db-log to aggregate
errors to a single point
• We’ve overhauled db log and are releasing
Sentry
48
django-sentry
Groups messages intelligently
http://github.com/dcramer/django-sentry
49
django-sentry (cont’d)
Similar feel to Django’s debugger
50
Feature Switches
• We needed a safety in case a feature wasn’t

performing well at peak
• it had to respond without delay, globally,
and without writing to disk
• Allows us to work out of trunk (mostly)
• Easy to release new features to a portion of
your audience
• Also nice for “Labs” type projects
51
Feature Switches (cont’d)
52
Final Thoughts
• The language (usually) isn’t your problem

• We like Django
• But we maintain local patches
• Some tickets don’t have enough of a following
• Patches, like #17, completely change
Django..
• ..arguably in a good way
• Others don’t have champions
Ticket #17 describes making the ORM an identify mapper
53
Housekeeping
Birds of a Feather
Want to learn from others about
performance and scaling problems?
Or play some StarCraft 2?
We’re Hiring!
DISQUS is looking for amazing engineers
54
Questions
55
References
django-sentry
Our Feature Switches

http://cl.ly/2FYt
Andy McCurdy’s update()

http://github.com/andymccurdy/django-tips-and-tricks
Our PyFlakes Fork

http://github.com/dcramer/pyflakes
SkinnyQuerySet
django-newcache
http://github.com/ericflo/django-newcache
attach_foreignkey (Pythonic Joins)

56

DjangoCon 2010 Scaling Disqus

Caricato da

Informazioni sul documento

Copyright

Formati disponibili

Condividi questo documento

Condividi o incorpora il documento

Opzioni di condivisione

Hai trovato utile questo documento?

Questo contenuto è inappropriato?

Copyright:

Formati disponibili

DjangoCon 2010 Scaling Disqus

Caricato da

Copyright:

Formati disponibili

Scaling the World’s Largest Django App

Jason Yan David Cramer

We are a comment system with an emphasis on

Our traffic at a glance

• We can’t predict when things will happen

Image Source: http://haproxy.1wt.eu/

• Make sure indexes fit in memory and

• Use connection pooling

• Fairly easy to implement, quick wins

Forums Posts Users Sentry

Allows us to separate datasets

# store users in a dictionary based on primary key

# map users to their posts

• Slower than at database level

• Alleviates some of the write load on your

Disqus Your Blog CNN Telegraph

• Some forums have very large datasets

forum_id = getattr(instance, 'forum_id', None)

# Now, making sure hints are available

• We really dislike raw SQL

• Django internally caches the results of your QuerySet

# 0 queries (we don’t need this behavior)

• Many times you only need to view a result set once

Optimizing memory usage by removing the cache

has_run = getattr(self, 'has_run', False)

• Keeps your data consistent

Thread safety is impossible with .save()

So we need atomic updates

A better way to approach updates

• Queueing low priority tasks

We send a specific serialized version

from disqus.common.signals import delayed_save

def my_func(data, sender, created, **kwargs):

This is all handled through our Queue

• libMemcached / pylibmc is configurable with

• Default (naive) hashing behavior

CACHE_SERVERS = [‘10.0.0.1’, ‘10.0.0.2’]

• Better approach: consistent hashing

• Addition / removal of a cache server

Image Source: http://sourceforge.net/apps/mediawiki/kai/index.php?title=Introduction

• Thundering herd (stampede) problem

• TransactionMiddleware got us started, but

• Small team of engineers

• A developer can be up and running in a few

1. createdb -E UTF-8 disqus

• Daily deploys with Fabric

Hudson makes integration easy

• It’s not fun breaking things when you’re the new

• Switched from Trac to Redmine

Groups messages intelligently

Similar feel to Django’s debugger

• We needed a safety in case a feature wasn’t

• The language (usually) isn’t your problem

DISQUS is looking for amazing engineers

Our Feature Switches

Andy McCurdy’s update()

Our PyFlakes Fork

attach_foreignkey (Pythonic Joins)

Potrebbero piacerti anche