Documenti di Didattica
Documenti di Professioni
Documenti di Cultura
Matej Ferenc
WIKIPEDIA
One of the top ten websites Currently about 400 million unique visitors a month Over 100,000 hits per second Performance, caching, optimization Supported by a non-profit organization No expensive features Balance between performance and features
HISTORY
Phase I: UseModWiki Experiment, launched in January 2001 Written in Perl Pages in individual text files No history of changes New features: free links linking pages with special syntax instead of automatic links Influenced todays Wikipedia markup language Wikipedia hosted on a single server Performance issues
HISTORY
Phase II: the PHP script Magnus Manske, January 2002 PHP + MySQL Namespaces to organize content Special pages: maintenance reports, contributions list, user watchlist Frequent difficulties
Phase III: MediaWiki Lee Daniel Crocker, July 2002 "wasn't much time to sit down and properly architect and develop a solution Tracking down slow functions
HISTORY
New server (still single) Performance issues Temporarily disabled view count and site statistics Occasionally switched to read-only mode In 2003 decision to re-architect software from scratch or continue to improve existing code Database server added Caching rendered (ready-to-output) pages New features (automatically-generated table of contents, editing page sections)
MEDIAWIKI
Not like generic CMS No regular features like publication workflow or ACLs Very specific purpose Variety of tools to handle spam and vandalism Open-source software from the beginning Solid external user base PHP MediaWiki uses unprefixed class names Namespace class renamed to MWNamespace to be compatible with PHP 5.3 Not the best choice for performance
SECURITY
Wrappers around HTML output and database queries Removes "magic quotes" slashes Strips illegal input characters Normalizes Unicode sequences Cross-site request forgery (CSRF) avoided by using tokens
Cross-site scripting (XSS) avoided by validating inputs and escaping outputs Database functions preventing SQL injection
Unregistered users IP addresses are logged when editing
DATABASE
MySQL Dozens of tables content, users, media files, caching Indices and summary tables used extensively SQL queries that scan huge numbers of rows can be very expensive
Unindexed queries are discouraged 1.4 model content stored in two tables, cur and old. Deleted pages in archive
1.5 model content stored in three tables, page, revision and text text table mapping IDs to text blobs, which contain few dozen revisions. First revision stored in full, following stored as diffs.
DATABASE
Revisions are grouped per page, tend to be similar, diffs are small, gzip works well, compression ratio ~ 98% Load balancing One master database server, any number of slaves All writes sent to master Reads sent to slaves Each slave has replication lag. If it exceeds 30s, slave will not receive read queries to catch up. If all slaves are lagged more than 30s, system will put itself in read-only mode Chronology protector Write query stores masters position in the users session After the user makes a read request, load balancer reads from slave that has caught up to the replication position
CACHING
Contain static versions of rendered pages, served for simple reads to unlogged users Logged-in users and other requests forwarded to the web server
When assembling the page from multiple objects Many can be cached to minimize future calls Page's interface sidebar, menus, UI text
LANGUAGES
Interface messages
LANGUAGES
Localizing messages
Translators can localize interface messages by editing a wiki page MessagesXx.php files are updated in the MediaWiki code repository Documentation for every message
MEDIA FILES
Like SVG vector images Rendered as PNG files, can be thumbnailed and displayed inline
Includes copyright information (author, license) Describes or classifies the content of the file
CONTENT PROCESSING
"wikitext Formatting changes (bold, italic using quotes) Links, templates, context-dependent content (date or signature), Incredible number of other magical things
Content needs to be parsed, assembled from all the external or dynamic pieces and converted to HTML Parser one of the most essential parts of MediaWiki
MARKUP LANGUAGE
Started based on UseModWiki's markup Morphed and evolved as needs have demanded Complex language can not be represented as a formal grammar
MODIFIABILITY
Administrators can install extensions, configure separate helper programs (image thumbnailing and TeX rendering) and global settings MediaWiki used to over-depend on global variables
SYSTEM ARCHITECTURE
Geographical DNS distribute requests between two main sites (US and Europe) depending on the location of the client
Load-balancing on servers uses LVS For HTML Squid caching proxy servers in front of Apache For image files Squid in front of Sun Java System Web Server Servers run Ubuntu Linux
Main web application is MediaWiki written in PHP Structured data stored in MySQL. Wikis grouped into clusters, each served by several MySQL servers replicated in a single-master configuration.