Sei sulla pagina 1di 19

WIKIPEDIA ARCHITECTURE

Matej Ferenc

WIKIPEDIA
One of the top ten websites Currently about 400 million unique visitors a month Over 100,000 hits per second Performance, caching, optimization Supported by a non-profit organization No expensive features Balance between performance and features

HISTORY

Phase I: UseModWiki Experiment, launched in January 2001 Written in Perl Pages in individual text files No history of changes New features: free links linking pages with special syntax instead of automatic links Influenced todays Wikipedia markup language Wikipedia hosted on a single server Performance issues

HISTORY

Phase II: the PHP script Magnus Manske, January 2002 PHP + MySQL Namespaces to organize content Special pages: maintenance reports, contributions list, user watchlist Frequent difficulties

Phase III: MediaWiki Lee Daniel Crocker, July 2002 "wasn't much time to sit down and properly architect and develop a solution Tracking down slow functions

HISTORY

New server (still single) Performance issues Temporarily disabled view count and site statistics Occasionally switched to read-only mode In 2003 decision to re-architect software from scratch or continue to improve existing code Database server added Caching rendered (ready-to-output) pages New features (automatically-generated table of contents, editing page sections)

MEDIAWIKI

Not like generic CMS No regular features like publication workflow or ACLs Very specific purpose Variety of tools to handle spam and vandalism Open-source software from the beginning Solid external user base PHP MediaWiki uses unprefixed class names Namespace class renamed to MWNamespace to be compatible with PHP 5.3 Not the best choice for performance

Developers interested in feature development architecture is left behind

SECURITY

Wrappers around HTML output and database queries Removes "magic quotes" slashes Strips illegal input characters Normalizes Unicode sequences Cross-site request forgery (CSRF) avoided by using tokens

Cross-site scripting (XSS) avoided by validating inputs and escaping outputs Database functions preventing SQL injection
Unregistered users IP addresses are logged when editing

Blocking IP addresses for vandalism

DATABASE

MySQL Dozens of tables content, users, media files, caching Indices and summary tables used extensively SQL queries that scan huge numbers of rows can be very expensive

Unindexed queries are discouraged 1.4 model content stored in two tables, cur and old. Deleted pages in archive
1.5 model content stored in three tables, page, revision and text text table mapping IDs to text blobs, which contain few dozen revisions. First revision stored in full, following stored as diffs.

DATABASE

Revisions are grouped per page, tend to be similar, diffs are small, gzip works well, compression ratio ~ 98% Load balancing One master database server, any number of slaves All writes sent to master Reads sent to slaves Each slave has replication lag. If it exceeds 30s, slave will not receive read queries to catch up. If all slaves are lagged more than 30s, system will put itself in read-only mode Chronology protector Write query stores masters position in the users session After the user makes a read request, load balancer reads from slave that has caught up to the replication position

CACHING

Most requests handled by caching proxies (Squids)


Contain static versions of rendered pages, served for simple reads to unlogged users Logged-in users and other requests forwarded to the web server

Second level caching

When assembling the page from multiple objects Many can be cached to minimize future calls Page's interface sidebar, menus, UI text

LANGUAGES

Wikipedia is available in more than 280 languages

English less than 20%

Provide localized interface Localization and internationalization (l10n & i18n)

Central component of MediaWiki Pervasive impacts many parts of the software

UTF8 (since 2005) Bidirectional text


LTR and RTL text on the same page

Interface messages

key-values pairs Language switches ({{GENDER:}}, {{PLURAL:}}, {{GRAMMAR:}})

LANGUAGES

Localizing messages

Localized interface messages in MessagesXx.php


Where Xx is the ISO-639 code of the language Message files include language-dependent information such as date formats, number formats, grammar conventions

Special MediaWiki website translatewiki.net


Translators can localize interface messages by editing a wiki page MessagesXx.php files are updated in the MediaWiki code repository Documentation for every message

MEDIA FILES

Files stored on the file system

Thumbnails in dedicated thumb directory

Supports uncommon file types

Like SVG vector images Rendered as PNG files, can be thumbnailed and displayed inline

File is assigned a page with information entered by the uploader


Includes copyright information (author, license) Describes or classifies the content of the file

CONTENT PROCESSING

User-generated content isn't in HTML, but in a markup language specific to MediaWiki


"wikitext Formatting changes (bold, italic using quotes) Links, templates, context-dependent content (date or signature), Incredible number of other magical things

Content needs to be parsed, assembled from all the external or dynamic pieces and converted to HTML Parser one of the most essential parts of MediaWiki

Difficult to change or improve Has to remain extremely stable

MARKUP LANGUAGE

No formal specification from the beginning


Started based on UseModWiki's markup Morphed and evolved as needs have demanded Complex language can not be represented as a formal grammar

MODIFIABILITY
Administrators can install extensions, configure separate helper programs (image thumbnailing and TeX rendering) and global settings MediaWiki used to over-depend on global variables

Slowly moving context out of global variables into objects

Storing context in objects allows flexible reuse of objects

Some parts can be modified (database schema)

While others can not (Parser)

SYSTEM ARCHITECTURE

DNS servers run PowerDNS

Geographical DNS distribute requests between two main sites (US and Europe) depending on the location of the client

Load-balancing on servers uses LVS For HTML Squid caching proxy servers in front of Apache For image files Squid in front of Sun Java System Web Server Servers run Ubuntu Linux

Image file storage servers run Solaris

Main web application is MediaWiki written in PHP Structured data stored in MySQL. Wikis grouped into clusters, each served by several MySQL servers replicated in a single-master configuration.

Potrebbero piacerti anche