Documenti di Didattica
Documenti di Professioni
Documenti di Cultura
Diploma Thesis
at the Institute of Entrepreneurship & Innovation
Vienna University of Economics and Business Administration
Degree Program: Business Administration
Submitted by:
Roman Pickl
Degree Program Identification No.: J151
Student Enrolment No.: h0451691
Online communities have been thriving in recent years and have not only drawn the attention
of researchers and professionals but also influence our daily lives. Their success factors,
however, are still rather unclear. This paper sheds light on this topic by analyzing the rela-
tionships between the characteristics of online communities and their performance. There-
fore, 5000 communities around individual articles in Wikipedia were analyzed. The results
demonstrate that community characteristics are significantly linked to the quantity and qual-
ity of the output created by online communities. The number of users is by far the most influ-
ential force that drives content creation. When it comes to the quality of the output, however,
characteristics of community members and how they collaborate are as important as the
sheer number of contributors. These findings can be utilized by community operators who
want to foster the development of their online communities and firms which look for promis-
ing communities to scan for innovative ideas and users.
Acknowledgments
I would like to express my gratitude to my advisor Univ. Prof. Dr. Nikolaus Franke and my
assistant advisor Dr. Philipp Türtscher for their support, encouragement and time to listen to
and discuss little problems and roadblocks. Special thanks also go to Matthias Pickl and
Daniel Winzer who provided thoughtful comments on this thesis. Furthermore, I want to
thank the IT Department of the Vienna University of Economics and Business Administra-
tion, especially Franz Schäfer, for helping with the means to handle the huge amount of data
and thus making this research project possible. Finally, I thank my family and my girlfriend
Julia for their continuous support and encouragement.
1 Introduction................................................................................................................... 1
1.1 Objective.................................................................................................................. 2
1.2 Structure.................................................................................................................. 3
2 Literature Review and Hypotheses Development....................................................... 4
2.1 Online Communities ................................................................................................ 4
2.2 Performance of Online Communities ....................................................................... 6
2.2.1 Information-Quantity ........................................................................................... 7
2.2.2 Article-Quality ..................................................................................................... 8
2.3 Links between Community Characteristics and Performance................................... 9
2.3.1 Community-Centered Perspective .................................................................... 10
2.3.2 User-Centered Perspective............................................................................... 12
2.3.3 Collaboration-Centered Perspective ................................................................. 14
3 Research Method........................................................................................................ 17
3.1 Research Site ........................................................................................................ 17
3.2 Study Design ......................................................................................................... 18
3.3 Data Collection and Cleansing............................................................................... 19
3.4 Measures............................................................................................................... 22
3.4.1 Operationalization of Performance Indicators ................................................... 22
3.4.2 Operationalization of Community Characteristics.............................................. 23
4 Results ........................................................................................................................ 26
4.1 Descriptive Statistics.............................................................................................. 26
4.2 Inferential Statistics................................................................................................ 28
4.2.1 Results related to Information-Quantity............................................................. 29
4.2.2 Results related to Article-Quality....................................................................... 30
5 Discussion and Implications...................................................................................... 31
5.1 Implications for Theory........................................................................................... 35
5.2 Implications for Methods ........................................................................................ 36
5.3 Implications for Practice......................................................................................... 37
5.4 Limitations ............................................................................................................. 38
5.5 Directions for Future Research .............................................................................. 39
6 References .................................................................................................................. 41
7 Appendix ..................................................................................................................... 52
7.1 Figures and Examples ........................................................................................... 52
7.2 Data Collection ...................................................................................................... 53
7.2.1 Setting up the Database and Drawing the Sample............................................ 53
7.2.2 Parsing the Data and Calculating all Variables ................................................. 58
7.3 References used in the Appendix: ......................................................................... 73
With the increasing ubiquity of the Internet, recent years have seen a surge in the number of
online communities1 (Kozinets 1999, p. 253; Rashid et al. 2006, p. 955). These communities
of interest are known for attracting innovative users with a high level of domain-specific
knowledge and are hence an important source of innovation (Füller, Jawecki & Mühlbacher
2007, p. 60). Even though open source software projects like Linux and Apache are among
the first examples that come to mind when thinking about successful online communities, this
phenomenon is not limited to the software sector. In fact, communities have shown astonish-
ing performances in numerous diverse industries (Lakhani & Panetta 2007, p. 98).
Wikipedia, the online encyclopedia that allows everyone to edit or create new articles, is a
case in point. Initiated in 2001, it has grown to more than 11 million articles in more than 260
languages written by more than 14 million users (Wikimedia 2008d). As of today, it is one of
the most visited pages on the Internet (Alexa.com 2008) and even though often scrutinized
due to its open source principle, generally known for its notably high quality (Giles 2005, p.
900).
The success of Wikipedia and various other online communities has not only drawn the atten-
tion of researchers and professionals but also influenced how amongst others society func-
tions in terms of production, learning, communication and commerce (Cothrel & Williams
1999, p. 54; Tapscott & Williams 2008, p. 20; Wanga & Fesenmaier 2004, p. 709). Conse-
quently, many firms across industries have tried to embrace online communities, in particular
to harness their creative potential for developing new products and services (Füller, Matzler &
Hoppe 2008, p. 609; Nambisan 2002, p. 393). Many of those communities, however, fail (Co-
threl & Williams 1999, p. 54) and success factors still remain rather unclear (Leimeister, Sid-
iras & Krcmar 2006, p. 281).
1
The terms “online community”, “virtual community”, “computer-mediated community”, “cyber commu-
nity”, “net community” or “e-community” are often used interchangeably (Döring 2001; Wanga & Fe-
senmaier 2004, p. 709)
Even though online communities have been studied from a variety of perspectives and several
authors have derived recommendations for operating online communities, characteristics of
successful communities have hardly been substantiated empirically (Leimeister, Sidiras &
Krcmar 2006, p. 281). This thesis aims to close this research gap by introducing a framework
to combine several of these perspectives and empirically analyze the link between characteris-
tics of online communities and their performance.
Businesses that want to involve online communities in their innovation process generally have
two distinct options: they can either try to find and utilize already existing communities or
attempt to build their own (Franke 2005, p. 708). Analyzing the link between community
characteristics and performance yields valuable insights for both, operators who want to pro-
actively encourage the development of successful online communities and firms which look
for promising communities to scan for innovative ideas and users. Due to the rise of a new
paradigm, often called “Web 2.0” (O'Reilly 2005), which causes growing interest in user gen-
erated content and user participation today (Tapscott & Williams 2008, p. 38), it is even more
important to shed light on this topic.
Given that Wikipedia “can be viewed as a massive experiment in collective action” (Viégas et
al. 2007, p. 2), observing communities in this online environment allows the examination of
numerous communities with different characteristics. In this study, a random sample of 5000
articles, created in 2007, is retrieved from Wikipedia and the characteristics of the community
of users collaborating on each article are analyzed. These characteristics are then linked to the
quantity and quality of the output created by these diverse online communities to answer the
following research question:
This thesis is structured as follows: chapter 2 provides a summary of the results of an in-
depth literature review and deals with the current state of research. Furthermore, the research
question is formulated in more detail and several hypotheses are developed. While chapter 3
provides a detailed explanation of the methodology used in this thesis, results are presented in
chapter 4. Chapter 5 concludes with a discussion of the results, implications for theory,
methods and practice as well as an outlook for promising future research directions. Addition-
ally, several examples along with an in-depth explanation of the data collection and analysis
method used in this study can be found in the appendix.
This chapter defines key terms and provides an insight into the theoretical background of this
study. After introducing the concept of online communities and factors indicating community
performance, the relationships between the characteristics of online communities and their
performance are explored from various perspectives.
The idea of geographically separated people meeting online to talk about common areas of
interest and to build online communities is older than the Internet itself (Licklider & Taylor
1968, p. 38). In fact, one of the first services of the Arpanet, file transfer, was soon diverted
from its intended use and employed for sending messages (Barabási 2003, p. 149). As a re-
sult, huge mailing lists emerged and due to the rise of the Arpanet, other networks and even-
tually the Internet more and more people were able to gather online and exchange their
thoughts on various topics (Cothrel & Williams 1999, p. 54; Koch 2002, p. 327). As a result
of his positive experiences of support and friendship on the bulletin board system “The
WELL” Rheingold coined the term “Virtual Community” to describe this social phenomenon
(Döring 2001; Rheingold 1993). Consequently, a scholarly discussion started whether the
term “community” should be used in this context at all as virtual communities lack face-to-
face contact and various other characteristics of traditional communities (Döring 2001; Jones
1997; Preece & Maloney-Krichmar 2005).
Even though scientists nowadays generally agree that virtual communities are “real” commu-
nities (Döring 2001) and the strict distinction between online and offline activities is losing
importance (Preece, Maloney-Krichmar & Abras 2003, p. 8), online communities are still a
vague concept with no widely accepted definition (Leimeister, Sidiras & Krcmar 2006, p.
278; Preece 2001, p. 347). It is, however, not the intent of this paper to comprehensively re-
view and analyze each single argument. Rather, online communities are understood as a cate-
gory with fuzzy boundaries (Bruckman 2006, p. 618) and are discussed in a broad context.
This paper aims to shed light on this topic by examining the relationships between community
characteristics and performance. Therefore, communities of users collaborating on individual
articles in Wikipedia are investigated. One may argue that observing the community around
an article amounts to taking it out of context, similarly to merely tracking changes in a single
file of an open source project, without considering dependencies. Mateos Garcia & Stein-
mueller (2003), however, call attention to an important difference between open source soft-
ware systems like Linux and open content collections like Wikipedia. While contributions to
open source software (e.g. a new software module) inhere a high need for integration due to
their “cumulative dependency”, articles in a collection are far less dependent on each other (p.
17) and valuable as standalone items (Wales 2005a, 1:30). In line with these findings, Voss
highlights that “most likely one can determine subcommunities” in Wikipedia (2005, p. 8).
Due to the fact that there are numerous kinds of online communities it is of uttermost impor-
tance to define clear and measurable objectives to asses their performance (Cothrel 2000, pp.
17-18). What is more, online communities can be examined from different perspectives and
consequently various performance indicators are discussed in the literature (Leimeister, Sid-
iras & Krcmar 2006, p. 279; Preece 2001, p. 354). While many of them, however, are very
general measures, Stvilia et al. suggest a specific set of metrics to assess Wikipedia articles
which seems especially relevant in the context of this paper (2005, p. 3). Applicable and rele-
vant factors of these studies were adapted to the research setting and enriched with a number
of additional metrics. Thus, this paper applies a direct approach when measuring the perform-
ance of different communities in Wikipedia by analyzing their output, whereas previous re-
search often focused on participation as a proxy of value creation (Cothrel & Williams 1999,
p. 55; Preece 2001, pp. 350-351).
Cothrel & Williams point out that a successful online community is “one that achieves its
purpose” (1999, p. 55). Asked about the purpose of Wikipedia Co-founder Jimmy Wales once
remarked that:
“Wikipedia is first and foremost an effort to create and distribute a free encyclopedia
of the highest possible quality to every single person on the planet in their own lan-
guage. […] the entire purpose of the community is precisely this goal” (2005b).
Even though the success of online communities is often a question of perspective (Leimeister,
Sidiras & Krcmar 2006, p. 279; Preece 2001, p. 354), due to their intrinsic motivation non-
commercial operators tend to agree on success factors with community members (Leimeister,
Sidiras & Krcmar 2006, p. 292). Thus in the case of the non-profit project Wikipedia, the pur-
pose for members and operators is crystal clear: create the largest high-quality reference work
for free. Consequently, the performance of communities in Wikipedia needs to be evaluated in
terms of the quantity of information and the quality of the articles produced.
When thinking about quantifying the information included in Wikipedia articles, the length of
each article is the first measure that comes to mind. However, if counting the number of
words were the only measure applied, articles that repeat the same information over and over
again would score far too high. Therefore, to increase its validity, the analysis was comple-
mented with an examination of the vocabulary used in each article, a proxy of the information
content contained. Furthermore, the number of links was assessed to take the amount of ex-
ternal information referred to into account. The following paragraphs provide additional in-
formation on these measures.
Number of words: The length of articles in Wikipedia and hence the quantity of information
produced by different communities varies sharply (Stvilia et al. 2005, p. 7). To asses the
length of every article the total number of words was counted. This measure was preferred to
using the mere number of characters to take different topics and hence varying word-lengths
of their vocabularies into account.
Vocabulary: This measure depicts the number of unique words and hence the word pool used
in the community effort. A short test revealed that this easy-to-understand, yet easy-to-
calculate metric correlates very well (0.985) with the zipped file size of the articles which is
known to be a good proxy of entropy (Voss 2005, p. 9).
Number of links: Buriol et al. found that the average number of outgoing links per Wikipedia
article increased over the last few years (2006, p. 5). This measure not only indicates the
amount of external information incorporated, but also the number of additional references
provided and is hence considered as an important dimension of information quantity.
The results of an expert-led investigation carried out by Nature points out that Wikipedia arti-
cles have a similar quality to those in the Encyclopaedia Britannica (Giles 2005, p. 900). This
study has attracted much attention and has been criticized for the selection and comparison of
articles (Encyclopædia Britannica 2006). Due to the fact that only a small number of articles
(42) was reviewed, it additionally lacks validity (den Besten, Loubser & Dalle 2008, p. 1). To
overcome these weaknesses several researchers have tried to automatically assess the quality
of Wikipedia articles (p. 8). One approach is to calculate readability scores as a measure of
quality (p. 8), a metric also applied in this study. To take the level of integration of each arti-
cle into account this analysis was complemented with an examination of the number of cate-
gories each article is placed into. Further details are provided in the following paragraphs.
Readability: Readability metrics have been used by several researchers to assess the quality
of Wikipedia articles (den Besten, Loubser & Dalle 2008, p. 8; Stvilia et al. 2005, p. 9).
Stvilia et al., for example, found that featured articles, i.e. a selection of the best articles de-
termined by Wikipedia’s editors (Wikipedia 2009e), show higher Flesch readability scores
than articles in a random set (2005, p. 7). The Flesch readability formula is a very popular
function of the number of words per sentence and the number of syllables per word used in a
text which yields a number between 0 (very difficult) & 100 (very easy) to assess its readabil-
ity (den Besten, Loubser & Dalle 2008, p. 10). While very easy to compute, it allows assess-
ing the readability of texts with considerable accuracy (p. 11).
While several researchers have examined the relationships between particular characteristics
of online communities and their performance, this thesis aims to combine these attempts into
a holistic framework to analyze online communities. Therefore, in the following sections
characteristics of online communities are examined from different perspectives (see figure 1).
The community-centered perspective, which deals with the effects of size and heterogeneity
of the communities on the created output, is followed by the application of a more user spe-
cific view based on the activity and focus of the average community member. Last but not
least, the characteristics and effects of the users’ collaboration are reviewed in terms of inter-
activity and dynamics.
With an increasing number of members communities generally have access to more resources
and better information (Butler 2001, p. 348). However, the size of a community is not the
only factor influencing its capability to perform well. Another important factor often dis-
cussed in the literature is a group’s composition and hence the heterogeneity of its members
(Horwitz & Horwitz 2007, p. 988). The community-centered perspective applied in this sec-
tion therefore examines the links between the sizes of the analyzed communities, the hetero-
geneity of their members and the quantity as well as quality of their outputs.
Size: Counting the number of members is the first thing that comes to mind, when thinking
about the characteristics of an online community. As there is still a lack of research on the
relatively new field of collaborative content creation platforms like Wikipedia, findings in the
free and open source software movement can be of help, as these two fields share considera-
bly similar philosophies (Ortega, Gonzalez-Barahona & Robles 2007, p. 47; Stvilia et al.
2005, p. 1). Thus Wikipedia, similarly to other communities, takes advantage of the collective
knowledge of its users (Stvilia et al. 2005, p. 6), thereby obeying Linus’s Law: “Given
enough eyeballs all bugs are shallow” (Raymond 2000). The number of eyeballs however, is
hard to estimate, as the number of lurkers, i.e. users who read but do not participate, is hard to
grasp (Nonnecke & Preece 1999, p. 123). In their analyses Stvilia et al. focus on people who
bother to make a change, noting that this number is “obviously much smaller and probably
more interesting and maybe correlating with the real number of eyeballs” (2005, p. 6). Strictly
speaking, only people who identify themselves with the community and as a result participate
actively are understood as members of the communities analyzed in this thesis.
Fernandez-Ramil, Izquierdo-Cortazar & Mens show that the number of unique contributors to
open source projects has a positive impact on a software’s total lines of code (2008, p. 4). A
similar relationship could be found by Ortega, Gonzalez-Barahona & Robles in Wikipedia
between the number of unique authors and an article’s size (2007, p. 52). Consequently, the
following hypothesis can be derived from these findings:
Large groups often come up with better solutions than individual experts (Surowiecki 2005, p.
XVII; Tapscott & Williams 2008, p. 41). Similarly, open source projects benefit from peer
reviews by a large base of users (Senyard & Michlmayr 2004, p. 2). As Butler puts it “in lar-
ger social structures it is more likely that there is a member who knows the needed informa-
tion” (2001, p. 348). In the context of Wikipedia Lih reasons that “with more editors, there are
more voices and different points of view for a given subject” (2004, p. 8) and Wilkinson &
Huberman (2007, p. 4) point out that there is a strong link between the number of editors,
edits and article quality. While large groups, however, traditionally often failed to take advan-
tage of this fact due to logistical problems and various other adverse effects, today the utiliza-
tion of computer mediated communication systems has the potential to significantly reduce if
not eradicate these problems (Butler 2001, pp. 349-350; Surowiecki 2005, pp. 275-277):
Hypothesis H1B: The quality of articles produced by an online community increases with
To sum up, even though the performance of large groups traditionally suffered due to prob-
lems inherent in their structure, modern information and communication technology helps to
harness their wealth of resources.
Heterogeneity: As online communities often gather around shared interests (Cosley, Ludford
& Terveen 2003, p. 8) and even inherit the risk of “balkanization”, i.e. an ongoing separation
in special interest groups (Van Alystyne & Brynjolfsson 2005, p. 851), it is interesting to ana-
lyze how the heterogeneity of community members influences the outcome of their joined
effort. In a meta-analysis of effects of team diversity on team performance Horwitz & Hor-
witz found a significant positive influence on both quantity and quality of output (2007, p.
1000). Surowiecki highlights the importance of diversity and notes that it is especially impor-
tant in small groups as they are very prone to groupthink (2005, pp. 29, 36). He concludes that
This section takes a closer look at the characteristics of the individual members of the ana-
lyzed communities. Several researchers have examined the activity of community members to
measure how engaged they are with the community (Cothrel & Williams 1999, p. 56; Preece
2001, pp. 350-351). This paper not only sheds light on the often analyzed activity and partici-
pation of community members but also examines how focused their effort on the analyzed
community is.
Kittur et al. note that even though novice users tend to delete more words than they add, they
may still increase the quality of the output (2007, p. 6). Indeed, Anthony, Smith & William-
son found that when it comes to the quality of contributions low-edit anonymous users
(“Good Samaritans”) play an equally important role as committed registered Wikipedians
(“Zealots”) (2007, p. 15). While the quality of edits by anonymous users decreases with the
number of their overall edits, the quality of contributions by registered users points in the op-
posite direction (p. 16). As this study does not distinguish between registered and anonymous
editors it is expected that these effects cancel each other out and there is no significant rela-
tionship between the quality of articles produced and the activity of users.
Hypothesis H4B: The quality of articles produced by an online community increases with
It has been noted that collaboration and social interaction between users is an important requi-
site for the success of user communities (Füller, Jawecki & Mühlbacher 2007, p. 61) and “not
an issue that can be ignored” (Kollock 1996, 23rd paragraph). Similarly, Tapscott & Williams
point out that the success of Wikipedia “is built on the premise that collaboration among users
will improve content over time, in the way that the open source community steadily improved
Linus Torvalds’s first version of Linux” (2008, p. 71). Due to the comprehensive record of
activities in Wikipedia it is not only possible to analyze the “behaviour of information pro-
ducers” (Almeida, Mozafari & Cho 2007, p. 1) but also its effects. To address the interesting
topic of collaboration, this section deals with the interactions between contributors and the
dynamics of contributions.
Interactivity: The development of ideas and innovations is often no solitary process but
benefits from the assistance of other community members. Franke & Shah for example found
that members of user communities do not innovate in isolation but rather receive crucial ad-
The impact of interactivity on the quantity of information in Wikipedia articles may be diluted
by “edit wars”, i.e. “interactions where two people or groups alternate between versions of the
page” which are not restricted to controversial topics (Viégas, Wattenberg & Dave 2004, p.
579). However, the number of edit wars has dropped significantly in the last few years
(Viégas et al. 2007, p. 3). As a result, it is anticipated that the highlighted positive impacts
outweigh and that not only the quantity, but also the quality of the produced articles increases
with the level of interactivity:
Hypothesis H5B: The quality of articles produced by an online community increases with
Dynamics: In order to thrive communities have to be dynamic (Mynatt et al. 1998, p. 128).
This study examines the dynamics within online communities by analyzing the distribution of
contributions over time. Members of the analyzed communities can either contribute occa-
sionally or collaborate intensively on the Wikipedia article within a short period of time.
Hypothesis H6B: The quality of articles produced by an online community increases with
This chapter explains the research method applied in this study in greater detail. A short in-
troduction of the research site is followed by sections dealing with the study design and the
data collection process. The chapter concludes with information on which measures were used
to operationalize each variable.
Wikipedia, “the free encyclopedia that anyone can edit” (Wikipedia 2009l), is one of the most
successful examples of massive collaborative content development (Ortega, Gonzalez-
Barahona & Robles 2008, p. 304) and the largest encyclopedia in the world (Tapscott & Wil-
liams 2008, p. 71). It applies the “wiki”-concept, invented by Cunningham, to allow users to
easily edit articles, while saving all changes and revisions in its database (Holloway, Bozice-
vic & Börner 2007, p. 30). This history of each page provides a “design trace” of how the
article evolved (Garud, Jain & Tuertscher 2008, p. 361) and provides valuable information on
the editor, the time of the edit, and the changes committed (see figure 2 for an example).
As the aim of this study is to analyze the relationship between characteristics of online com-
munities and the quantity and quality of output they create, this paper utilizes Wikipedia as a
natural experiment to analyze a large number of communities with diverse characteristics.
Owing to Wikipedia’s increasing popularity its article base has grown significantly over the
last few years (Viégas et al. 2007, p. 5) and consequently complete dumps of the English
Wikipedia have not only reached enormous file sizes that make them hard to analyze (den
Besten, Loubser & Dalle 2008, p. 8) but have even failed or have been corrupted recently
(Wikimedia 2008b, 2008c). Due to these disturbances and its more manageable size this paper
focuses on the German-language Wikipedia, which is, following the English version, the sec-
ond biggest of all language editions (Wikimedia 2008d). As a matter of fact, however, given
enough computing time all the analyses conducted can be easily performed on the English
version of Wikipedia as well as on an even larger sample.
To minimize biases due to changes in Wikipedia’s popularity and user base only revisions of
articles created in 2007 were analyzed. What is more, all articles edited by only one user were
excluded as they do not qualify as community effort. Of the more than 160.000 remaining
articles redirects to other articles were removed and a random sample of 5000 articles was
drawn.
Complete database dumps of Wikipedia and its sister projects are provided online by the
Wikimedia Foundation Inc. (Wikimedia 2009). Even though dumps including all pages with
complete revision history are available, given the huge amount of data the “stub-meta-
history.xml.gz” dump was used, which does not include any page text, but complete revision
metadata. The dump from June 7th of the German Wikipedia (Wikimedia 2008a) was
downloaded in August 2008 and imported into a MySQL database using the MWDumper-tool
(MediaWiki.org 2009).
Almeida et al. mention that Wikipedia dumps are often incomplete due to errors occurred dur-
ing their generations (2007, p. 2). Similar problems were found in the dump analyzed in this
paper where the table containing all pages, was out of sync with the pages included in the
table storing all revisions. Consequently, distinct pages in the revision table were used as a
basis and, where necessary, missing values queried from the Wikipedia API (Wikipedia
2009g).
To examine the output of each community in greater detail the last revision of the year 2007
was downloaded of each article using a Python script (Gude 2008) which was adapted to the
German version of Wikipedia. The yielded XML files include, amongst others, information
about the article and the author, time and text (including wiki markup; see Wikipedia 2009v)
of each revision (examples can be found in the appendix).
Due to the huge amount of articles under study a parser was developed in the Python pro-
gramming language to automatically obtain and analyze the files discussed above (an in-depth
explanation of this method can be found in the appendix). In this process edits by bots, i.e.
“automated or semi-automated tools that carry out repetitive and mundane tasks” (Wikipedia
2009c), were determined on the basis of a recent user-group assignment list (Wikipedia
2009b) and omitted in the analyses. It is important to note, however, that these assignments
are not static and may have changed since 2007, resulting in bots not recognized correctly by
the parser.
Vandalism is another topic which needs to be addressed in this context. Due to the low entry
barrier Wikipedia is quite vulnerable to vandalism. However, due to the fact that all revisions
are stored in the database, malicious edits can be fixed easily and fast. Indeed Wikipedians do
a very good job as flawed articles are often amended within minutes (Viégas, Wattenberg &
Dave 2004, p. 579). Vandalism can occur in various forms and is often hard to detect auto-
matically as there is no crystal clear definition of vandalism in Wikipedia (pp. 578-579). To
reduce the number of false positives only two often unambiguous cases were marked as van-
dalism in this analysis:
More than 90% of content was deleted, the remaining text has less than 500 characters
and no meaningful comment was created (Wikipedia 2009w)
3.4 Measures
This section deals with the way the discussed factors of performance and community charac-
teristics were operationalized and explains the applied metrics.
3.4.1.1 Information-Quantity
The following three variables were standardized and averaging to build the Information-
Quantity construct:
Number of words: To quantify the length of an article the words included in the article were
counted. Therefore the markup was stripped from the HTML versions of each article to yield
a plain text version. This text was split into individual words at every white-space character
with the help of regular expressions.
Vocabulary: The number of unique words was calculated accordingly. For simplicity reasons
no stemming was conducted and stop words were not removed.
Number of links: Regular expressions were used to determine outgoing links on every ana-
lyzed article page. Duplicate links and page internal links were omitted.
The Article-Quality construct was created by standardizing and averaging the following two
variables:
Readability: The Flesch reading ease is a function of the average sentence length (ASL;
words per sentence) and the average number of syllables per word (ASW) (den Besten, Loub-
ser & Dalle 2008, p. 10):
This formula yields a number between 0 (very difficult) & 100 (very easy), with standard
English texts usually scoring a number between 60 and 70 (den Besten, Loubser & Dalle
2008, p. 11). Flesch readability scores were calculated for the plain text versions of each arti-
cle’s last revision in 2007 using an online tool (stilversprechend.de 2009a) which applies an
adapted version of the formula for the German language (stilversprechend.de 2009b):
Number of categories: In Wikipedia an article can be placed into a category by adding a spe-
cific category tag (“[[Kategorie:Category name]]“ in the German language version, Wikipedia
2009f) to the page. Occurrences of these tags were counted in the last revision of 2007 to cal-
culate the number of categories for each analyzed article.
Size: Users who want to contribute to an article in Wikipedia have two options: they can ei-
ther sign up to Wikipedia or choose to remain anonymous. Whereas in the former case their
username is associated with their revisions, their IP address is stored in the latter case. It has
Heterogeneity: On average, users in the analyzed sample have edited 265 articles in 2007.
Consequently, the 265 most important articles (i.e. articles most members of the community
contributed to during the year 2007) were queried from the created tables with standard SQL
query statements when analyzing the heterogeneity of the members of a community. In the
next step a vector was created for each community user depicting the editing patterns in those
articles he/she co-authored (number of edits) and which article he/she didn’t edit (“0”). These
vectors of edited articles can be understood as areas of “common interest” (Korfiatis, Poulos
& Bokos 2006, p. 256), “interest profiles” (Cosley, Ludford & Terveen 2003, p. 2) or
“knowledge profiles” (Van Alystyne & Brynjolfsson 2005, p. 854). The similarities of each
user to every other user were then computed by calculating the cosine of each knowledge pro-
file pair (Manning & Schütze 2003, p. 300):
x⋅y
cos(x, y) =
| x || y |
This often called cosine-similarity is the cosine of the angle between two vectors and has al-
ready been used in other studies when analyzing the similarity of community members (for
example in: Cosley, Ludford & Terveen 2003, p. 4; Van Alystyne & Brynjolfsson 2005, p.
854). Similarly to Van Alystyne & Brynjolfsson’s approach, groups of users are compared in
this paper by the average similarity of their profiles (2005, p. 854). The heterogeneity was
then calculated by subtracting a community’s average similarity (a number between 0 and 1)
from 1:
Activity: To measure the general activity of community members, the number of edits in all
articles in Wikipedia in the year 2007 per user was queried from the database and averaged
for each community. Let us suppose a community consists of two contributors. A made 100
edits in Wikipedia articles in 2007, whereas B contributed 200 times. The activity of users in
this community hence amounts to 150 edits.
Focus: To assess the level of commitment in each community the average proportion of their
members’ activity in the analyzed community was calculated. To proceed with the previous
example: A and B made 10 edits in the analyzed article. The focus of users in this community
hence amounts to 0.075 (A: 10/100; B: 10/200).
Interactivity: All edits in 2007 of each article in the sample were analyzed in this paper. In a
first step the number of interactive edits was counted i.e. the first edit and all edits that were
preceded by an edit of another community member. The level of interactivity was then calcu-
lated as the ratio between interactive edits minus the number of distinct authors and the total
number of edits:
Let us suppose that four users (A, B, C, D) created an article and the revision history reveals
the following eight edits: A B C D A B C D. The interactivity level of this example amounts
to 0.5 as all edits by those four contributors are interactive edits ([8-4]/8).
Dynamics: To assess the dynamics within communities the median time between edits was
calculated for each article. This metric allows analyzing whether community members inten-
sively edited the article in a short period of time or if their efforts were distributed over the
whole year 2007. As less dynamic communities exhibit higher median times between edits the
yielded figure was multiplied by -1 to ease interpretation.
The following chapter presents the results of this study in two sections. While the first sec-
tion, Descriptive Statistics, provides an in-depth descriptive analysis of the articles and com-
munities in the sample, the second section, Inferential Statistics, contains the results of two
ordinary least squares (OLS) regressions used to statistically test the developed hypotheses on
the relationships between characteristics of online communities and their performance.
The random sample drawn from the Wikipedia database consists of 5000 articles that were
created in 2007 and edited by more than one user. Due to these sampling criteria it is no sur-
prise that the average article age is slightly skewed towards older articles that had more time
to attract enough contributors and amounts to 195.81 days (standard deviation: 105.55) with a
minimum of 0.43 days and a maximum of 364.94 days. Furthermore, due to the fact that
Wikipedia is the largest encyclopedia in the world (Tapscott & Williams 2008, p. 71) and is
still growing (Wikipedia 2009i), it is plausible that articles created in 2007, as evident from
the sample, often cover very specific, niche topics or recent events.
Information-Quantity: The average article in the sample consists of 382.77 words (s.d.:
543.64). There is, however, also an article without any words in the sample, as its last revi-
sion of the year 2007 did not contain any content. The longest article deals with an in-depth
description of the course of the NHL season 2007 (11784 words). What is more, 220.06
unique words (s.d.: 216.78) are used on average in each article. While the minimum is again
0, stemming from the empty article discussed above, the article with the most unique words
contains a table on Chinese Unicode characters (3738 words). The average number of unique
links amounts to 33.2 (s.d.: 37.23) per article. The minimum number of links is once more 0
due to the empty article, while the article with the highest number of unique links lists all fe-
male Olympic medalists in athletics (789 links).
Article-Quality: The average Flesch readability score of articles in the sample amounts to
56.97 (s.d.: 11.42) depicting a reasonable readability (stilversprechend.de 2009b). For six arti-
cles, however, no valid readability values could be determined due to insufficient length of
the articles’ content. What is more, looking at the most extreme outliers (min: 5; max: 100)
reveals that articles consisting of mere tables and lists cannot be assessed well with the help of
the Flesch readability function. On average articles in the sample are placed into 3.05 catego-
ries (s.d.: 2.15). While there are several articles in the sample which are not part of any cate-
gory, an article dealing with the achievements of a German silviculture scientist shows the
highest number of categories (18).
The hypotheses developed in chapter 2 were statistically tested with the help of two OLS-
regressions. While the first regression deals with the influences of community characteristics
on the quantity of information produced, the second regression examines their effects on the
articles’ quality. Table 1 summarizes the test results derived from these two regressions:
Dependent Variables
Community-Centered
User-Centered
Collaboration-Centered
†
p < .10 (two-tailed test), * p < .05 (two-tailed test), ** p < .01 (two-tailed test), *** p < .001 (two-tailed test);
articles: n=5000; 1 values are standardized coefficients (β-values); predictors were standardized before entry
The fit indices for the OLS-regression on Information-Quantity indicate a good fit of the
model with an adjusted R-square of 0.103 (F-Value: 74.175; p-Value: 0.000).
The test of H1A revealed that the quantity of information produced by an online community,
as predicted, increases with the number of contributors (β= 0.290***).
Furthermore, the coefficient for the quadratic term of heterogeneity shows the expected nega-
tive sign, which is indicative of an inverted U-shaped relationship between Information-
Quantity and heterogeneity (Aiken, West & Reno 1991, p. 65). Using differential calculus, the
maximum point of the inverted U can easily be calculated (Aiken, West & Reno 1991, p. 65;
Eisinga, Scheepers & van Snippenburg 1991, p. 113) and is located at a heterogeneity level of
0.58. In order to be able to compare the effect of heterogeneity with the standardized regres-
sion coefficients of other predictors, the method outlined in Eisinga, Scheepers & van Snip-
penburg (1991, p. 109) was used to obtain a composite effect of the linear and quadratic term.
The standardized regression coefficient of this combined effect amounts to 0.064 and is
highly significant (p<0.000). Note that even though the sign of this coefficient is a technical
artifice (Eisinga, Scheepers & van Snippenburg 1991, p. 110) and is hence not related to the
sign of the relationship between independent and dependent variable, its size allows investi-
gating the relative importance of heterogeneity for the explanation of the dependent variable
Information-Quantity. These results support H2A.
Even though, positive impacts of the activity and focus of community members on the quan-
tity of output were predicted, the analysis revealed negative relationships (activity: β= -
0.027†; focus: β= -0.054***). Consequently, H3A and H4A had to be rejected.
The analysis provides support for H5A, as the level of interactivity within the examined
communities has a highly significant positive influence on the information quantity (β=
0.089***).
H6A, however, which predicted a positive influence of collaboration dynamics did not find
empirical support as the relationship was not significant (β= 0.021; p= 0.139).
The control variable article age exerts a negative influence on the quantity of information (β=
-0.059***).
The indicators of how well the model fits the data point out a moderate fit with an adjusted R-
square of 0.023 (F-Value: 15.967; p-Value: 0.000).
The analysis reveals that the quality of the articles produced, as predicted, increases with the
number of contributors (β= 0.080***). Hence, H1B was supported by the data.
Again, the coefficient for the quadratic term of heterogeneity shows the expected negative
sign, suggesting an inverted U-shaped relationship between Article-Quality and heterogeneity.
The maximum point of the inverted U is located at a heterogeneity level of 0.66. What is
more, the standardized regression coefficient of the combined effect amounts to 0.062 and is
highly significant (p<0.000). These results support H2B.
H4B, which predicted a positive influence of focus of community members, had to be rejected
as the impact turned out to be negative (β= -0.102***).
H5B, positing a positive influence of interactivity, did not find empirical support in the data.
Even though the sign is as expected, the effect is not significant (β= 0.020; p= 0.153).
The expected positive influence of the collaboration dynamic (H6B) found support in the data
(β= 0.076***).
Article age was controlled for but showed no significant impact on the quality of produced
articles (β= -0.008; p= 0.601).
The aim of this study was to investigate the relationship between the characteristics of online
communities and their performance. Therefore the output and characteristics of 5000 commu-
nities gathering around Wikipedia articles were analyzed from different perspectives. The
results demonstrate that the number of users is by far the most influential force that drives
content creation. When it comes to the quality of the created output, however, characteristics
of community members and how they collaborate are as important as the sheer number of
contributors. The following paragraphs review and discuss these and other findings in greater
detail.
Amongst others, the study revealed that the quantity of information created by an online com-
munity is related to a number of community characteristics. Table 2 summarizes those find-
ings:
Both hypotheses regarding community-centered characteristics found support in the data. The
analysis showed that the size of the community has by far the biggest influence among all
factors, with larger communities tending to create more output. Furthermore, the output of
When it comes to the hypotheses concerning user-centered characteristics, neither the ex-
cepted positive influence of activity nor the posited positive influence of focus was supported
by the data as both influences turned out to be negative. In their analyses Kittur et al. found
that more active and experienced users tend to add more content than novice users (2007, p.
6). They, however, calculated these numbers over the whole time Wikipedia had been in exis-
tence and this reported trend may have shifted over recent years, especially in newly created
articles that, as already discussed above, nowadays often cover very specific, niche topics.
What is more, as the activity of community members in this study was measured as the num-
ber of contributions in 2007, this finding may be diluted by the fact that experienced users
that were very active in previous years and curbed their activity in 2007 were counted as oc-
casional contributors. Regarding the focus of community members on specific articles it was
expected that specialization leads to an increase in the output created. However, it turned out
that online communities benefit if their members are not too focused on a task. It seems as if
not only experts in a field but also novice users can contribute considerably to an open-
content project. Nevertheless, further research is needed to clarify the impact of activity and
focus on the quantity of content created.
What is more, whereas evidence of the positive influence of interactivity on the quantity of
created content was found, the impact of dynamics was positive but not significant. Thus, it
could be shown that the output increases as community members do not work in solitary con-
ditions but assist each other and collaborate interactively.
If at all, a positive impact of the control variable article age was expected. The negative influ-
ence found, however, points out that even young communities can be very productive. Some
of the articles examined may not only have grown over time, but also might have been short-
ened again. These shrinkages can happen if text is deleted or more dramatically if an article is
split and large sections of it are moved to a more specific page (Viégas, Wattenberg & Dave
2004, p. 580).
The analysis provided support for both hypotheses regarding the impact of community-
centered characteristics. Larger communities tend to create output of higher quality. In con-
trast to the results on Information-Quantity, however, community size is not the most impor-
tant factor. Again, evidence for an inverted U-shaped curvilinear relationship between the
heterogeneity of community members and the quality of their output was found. This finding
highlights the importance of a moderate level of heterogeneity in an online community.
Regarding the user-centered characteristics, as predicted, the average activity had no signifi-
cant influence on the quality of the communities’ output. Furthermore, it became evident that
the posited positive influence of focus is in fact negative and the most important of all factors
influencing an article’s quality. Concerning the influence of activity additional research is
needed to test whether a distinction between user-groups can replicate the findings of An-
thony, Smith & Williamson who found that the quality of edits by anonymous users decreases
with the number of their overall edits while the quality of contributions by registered users
points in the opposite direction (2007, p. 16). Similarly to the findings on Information-
“It turns out that in some ways, analytic skills and neutrality often play a greater role
than specialisation; editors who have worked for a time on a variety of articles usually
become quite capable of making good quality editorial decisions regarding specialist
material, even on unfamiliar technical subjects” (Wikipedia 2009m).
However, every article needs some experts that watch for and correct errors (Wikipedia
2009m). In line with these findings, Williams & Cothrel (2000, p. 90) stress the importance of
maintaining a balance between experts and novice users. Nevertheless, further research is
needed to clarify these connections in greater detail.
Finally, the effects of both collaboration-centered characteristics interactivity and focus show
the expected positive trend. The effect of interactivity, however, is not significant and hence
needs further clarification. The importance of intensive collaboration in online communities is
further stressed by the evident positive impact of its dynamics on the quality of the produced
output.
Following, this discussion of the link between characteristics of online communities and their
performance, it is of interest to examine which communities generate both extensive and high
quality content. Therefore, the significant effects of community characteristics on both per-
formance indicators Information-Quantity and Article-Quality found in this study are summa-
rized in table 4.
Community-Centered
# Users + +
Heterogeneity ∩ ∩
User-Centered
Activity -
Focus - -
Collaboration-Centered
Interactivity +
Dynamics +
Looking at this table it becomes evident that when taking both quantity and quality into ac-
count, those communities perform best that consist of a large number of users with a moderate
level of heterogeneity and a fair share of occasional and novice contributors who operate in a
variety of fields and collaborate interactively and dynamically.
While previous approaches often examined particular aspects of online communities, this
study introduced a framework to combine several of these perspectives to analyze the link
between community characteristics and performance. Furthermore, it extended the current
literature on online communities by utilizing Wikipedia as a massive experiment to analyze
5000 diverse communities and thereby empirically testing and substantiating these relation-
ships.
Even though additional research is needed to further clarify certain findings, it was shown that
not only general characteristics of online communities but also user specific characteristics
Future research may build on these findings and the approach applied in this study to develop
an even more detailed framework for analyzing the link between characteristics of online
communities and their performance.
In this study a scaleable approach was introduced to automatically analyze diverse communi-
ties in online environments. Thanks to the availability of and easy access to its database
Wikipedia is a unique source of data that proved to be a good research site for natural experi-
ments and yielded considerable insights into the link between characteristics and performance
of online communities.
The most accurate analysis of communities in Wikipedia could be gained from analyzing all
the available data. However, the databases of all popular language editions have grown to
enormous sizes and hence working with a sample seems to be the best way to proceed. Due to
limited computing resources and the large size of the English Wikipedia database it was de-
cided in this study to analyze a sample of 5000 communities in the German version of
Wikipedia. Given more computing time, however, the analyses conducted can be easily ex-
tended to a bigger sample as well as different language editions due to the efficient and scale-
able approach applied in this study to compare and validate results. What is more, boundaries
of analyzed communities can be enlarged to not only examine communities gathering around
individual articles but larger communities e.g. in WikiProjects, which are collections of arti-
cles that deal with specific topics (Wikipedia 2009t).
As this study analyzed the output of communities based on the latest version of the year 2007,
advancing the introduced method to see how the output changes and evolves over time and in
years to come may yield additional insights. This information can be easily extracted from the
collected data by the developed parser.
The results of this study show that the performance of online communities not only depend on
general community characteristics like size and heterogeneity but also on more user specific
characteristics such as activity and focus. What is more, how these users collaborate plays an
important role in influencing content quantity and quality. These findings suggest that com-
munity operators can pro-actively influence the performance of online communities by pro-
viding favorable conditions.
As already mentioned before, Wikipedia is one of the most successful examples of mass-
collaboration, most likely due to the favorable conditions provided by its operators and the
software used. While the low entry barriers for contributors for example allow novice users to
contribute without going trough a lengthy sign-up process often found in other online envi-
ronments, the applied “wiki”-concept ensures that they cannot do any real harm. What is
more, several tools like watch lists and revision histories support contributors in collaborating
interactively and dynamically.
Community operators can learn from the presented results and Wikipedia, as a best-practice
example, to apply appropriate strategies and tools in their effort to influence the performance
of communities.
When involving online communities in their innovation process, businesses generally have
two distinct options: they can either try to find and harness an already existing community or
attempt to build their own (Franke 2005, p. 708). Either way, results of this study imply that
they should aim for the following characteristics of online communities to foster the creation
of content which is both extensive and of good quality:
Low entry barriers that allow both novice users and experts to collaborate on various
tasks and topics
An environment which not only supports but fosters interactive and dynamic collabo-
ration
5.4 Limitations
The methods employed in this study have a number of inherent limitations and involve a
number of assumptions that are challenged and discussed in the following paragraphs.
This study used cross-sectional data to examine the link between several community charac-
teristics and the performance of online communities. Even if most of the developed hypothe-
ses were supported by the data and several meaningful correlations could be found, it could
still be that this study mixed up cause and effect. Longer articles for example may attract
more contributors than shorter articles and not the other way around. Longitudinal analyses
may allow stronger causal claims than the approach applied.
Due to the fact that this analysis was not a controlled experiment in a laboratory setting but
rather a natural experiment, not all variables could be controlled. External influences can
hence not be ruled out and unmeasured variables may have had a significant impact on the
results. Especially the low fit of the OLS-regression on Article-Quality highlights that some
important variables may have been omitted.
To keep the research design concise and easy to understand only the main effects of inde-
pendent variables were analyzed in this study. However, during the analyses it became evi-
dent that there could be significant interactions between several community characteristics
discussed in this paper. Including interactions between these variables in the analyses may
yield a more complex, yet more comprehensive model and increase its fit with the data.
Social scientists often moan about the difficult access to data for research. In the case of
Wikipedia, quite the opposite is the case. Even though full dumps of Wikipedia and its sister
projects and hence comprehensive records of collaboration are available, the enormous
amount of data is quite hard to handle. The scaleable approach introduced in this study can be
enhanced and applied to several interesting research questions.
Wikipedians have recently started a project to assess every article in Wikipedia (Wikipedia
2009k). While this scheme has not yet been adopted in the German language version
(Wikipedia 2009a) and could hence not be used in this study, future studies can draw upon
this valuable resource to better quantify the performance of online communities.
Furthermore, more and more Wikipedians gather around WikiProjects, collections of articles
that deal with specific topics (Wikipedia 2009t). Analyzing these large communities of inter-
est in combination with the widely used article assessments may yield additional insights.
Regarding the used measures, future research could dig deeper e.g. by analyzing the access
levels of contributing users (Wikipedia 2009r), the number of barn stars (Wikipedia 2009o)
they have received and the type of comments on their user pages (Wikipedia 2009s) to de-
scribe the characteristics of community users in greater detail.
What is more, each article in Wikipedia has a talk page that is used for editorial coordination
(Wikipedia 2009q). These talk pages are another valuable resource to analyze collaboration
characteristics. Further research may relate discussions on talk pages to the creation of content
in the article to gain valuable insights.
Aiken, L.S., West, S.G. & Reno, R.R. 1991, Multiple regression: testing and interpreting
interactions, SAGE Publications Newbury Park, CA.
Almeida, R.B., Mozafari, B. & Cho, J. 2007, 'On the Evolution of Wikipedia', International
Conference on Weblogs and Social Media, Boulder, Colorado, USA,
<http://www.icwsm.org/papers/2--Almeida-Mozafari-Cho.pdf>.
Anthony, D., Smith, S.W. & Williamson, T. 2007, The Quality of Open Source Production:
Zealots and Good Samaritans in the Case of Wikipedia,
<http://www.cs.dartmouth.edu/reports/TR2007-606.pdf>.
Bruckman, A. 2006, 'A New Perspective on “Community” and its Implications for Computer-
Mediated Communication Systems', paper presented to the CHI 2006, Montréal, Qué-
bec, Canada, <http://www.cc.gatech.edu/~asb/papers/bruckman-community-
chi06.pdf>.
Buriol, L.S., Castillo, C., Donato, D., Leonardi, S. & Millozzi, S. 2006, 'Temporal Analysis of
the Wikigraph', paper presented to the 2006 IEEE/WIC/ACM International Conference
on Web Intelligence, Hong Kong,
<http://www.inf.ufrgs.br/~buriol/papers/buriol_2006_temporal_analysis_wikigraph.pd
f>
Butler, B.S. 2001, 'Membership Size, Communication Activity, and Sustainability: A Re-
source-Based Model of Online Social Structures', Information Systems Research, vol.
12, no. 4, pp. 346-362.
Cosley, D., Ludford, P. & Terveen, L. 2003, 'Studying the Effect of Similarity in Online
Task-Focused Interactions', 2003 international ACM SIGGROUP conference on Sup-
porting group work, Sanibel Island, Florida, USA pp. 321-329
<http://www.grouplens.org/papers/pdf/simex-group2003.pdf>.
Cothrel, J.P. 2000, 'Measuring the success of an online community', Strategy & Leadership,
vol. 28, no. 2, pp. 17-21.
den Besten, M., Loubser, M. & Dalle, J.-M. 2008, Wikipedia as a Distributed Problem-
Solving Network,
<http://www.oii.ox.ac.uk/downloads/index.cfm?File=research/dpsn/Wikipedia_full.pd
f>.
Eisinga, R., Scheepers, P. & van Snippenburg, L. 1991, 'The standardized effect of a com-
pound of dummy variables or polynomial terms', Quality & Quantity, vol. 25, pp. 103-
114.
Franke, N. 2005, 'Open Source & Co.: Innovative User-Netzwerke', in S. Albers & O. Gass-
mann (eds), Handbuch Technologie- und Innovationsmanagement, Gabler, Wiesba-
den, pp. 695-712.
Franke, N. & Shah, S. 2003, 'How communities support innovative activities: an exploration
of assistance and sharing among end-users', Research Policy, vol. 32, no. 1, pp. 157-
178.
Füller, J., Jawecki, G. & Mühlbacher, H. 2007, 'Innovation creation by online basketball
communities', Journal of Business Research, vol. 60, no. 1, pp. 60-71.
Füller, J., Matzler, K. & Hoppe, e. 2008, 'Brand Community Members as a Source of Innova-
tion', Journal of Product Innovation Management, vol. 25, no. 6, pp. 609-619.
Garud, R., Jain, S. & Tuertscher, P. 2008, 'Incomplete by Design and Designing for Incom-
pleteness', Organization Studies, vol. 29, pp. 351-371.
Giles, J. 2005, 'Internet encyclopaedias go head to head', Nature, vol. 438, pp. 900-901.
Horwitz, S.K. & Horwitz, I.B. 2007, 'The Effects of Team Diversity on Team Outcomes: A
Meta-Analytic Review of Team Demography', Journal of Management, vol. 33, no. 6,
pp. 987-1015.
Kittur, A., Ch, E., Pendleton, B.A., Suh, B. & Mytkowicz, T. 2007, 'Power of the Few vs.
Wisdom of the Crowd: Wikipedia and the Rise of the Bourgeoisie', CHI 2007, San
Jose, CA, <http://www.parc.com/research/publications/files/5904.pdf>.
Kollock, P. 1996, 'Design Principles for Online Communities', First International Harvard
Conference on the Internet and Society, Boston,USA,
<http://www.sscnet.ucla.edu/soc/faculty/kollock/papers/design.htm>.
Korfiatis, N.T., Poulos, M. & Bokos, G. 2006, 'Evaluating authoritative sources using social
networks: an insight from Wikipedia', Online Information Review, vol. 30, no. 3, pp.
252-262.
Kozinets, R.V. 1999, 'E-Tribalized Marketing?: The Strategic Implications of Virtual Com-
munities of Consumption', European Management Journal, vol. 17, no. 3, pp. 252–
264.
Lakhani, K.R. & Panetta, J.A. 2007, 'The Principles of Distributed Innovation', Innovations,
vol. 2, no. 3, pp. 97-112.
Licklider, J.C.R. & Taylor, R.W. 1968, 'The Computer as a Communication Device', Science
and Technology, pp. 21-41.
Lih, A. 2004, 'Wikipedia as Participatory Journalism: Reliable Sources? Metrics for evaluat-
ing collaborative media as a news resource', paper presented to the 5th International
Symposium on Online Journalism, University of Texas at Austin, USA, April 16-17,
2004.
Ludford, P.J., Cosley, D., Frankowski, D. & Terveen, L. 2004, 'Think different: increasing
online community participation using uniqueness and group dissimilarity', SIGCHI
conference on Human factors in computing systems, ACM, Vienna, Austria, pp. 631-
638, <http://grouplens.org/papers/pdf/thinkdifferent-chi2004.pdf>.
Manning, C.D. & Schütze, H. 2003, Foundations of Statistical Natural Language Processing,
MIT Press, Cambridge,MA.
Mateos Garcia, J. & Steinmueller, W.E. 2003, 'Applying the open source development model
to knowledge work.' INK Open Source Research Working Paper No. 2,
<http://www.sussex.ac.uk/Units/spru/publications/imprint/sewps/sewp94/sewp94.pdf>
Mockus, A., Fielding, R.T. & Herbsleb, J. 2000, 'A Case Study of Open Source Software De-
velopment: The Apache Server', The 22th International Conference on Software Engi-
neering, Limerick, Ireland, <http://mockus.us/papers/apache.pdf>.
Mynatt, E.D., O'Day, V.L., Adler, A. & Ito, M. 1998, 'Network Communities: Something
Old, Something New, Something Borrowed . . .' Computer Supported Cooperative
Work (CSCW), vol. 7, no. 1-2, pp. 123-156.
Nambisan, S. 2002, 'Designing Virtual Customer Environments for New Product Develop-
ment: Toward a Theory', Academy of Management Review, vol. 27, no. 3, pp. 392-
413.
Ortega, F., Gonzalez-Barahona, J.M. & Robles, G. 2007, 'The Top Ten Wikipedias: A Quanti-
tative Analysis Using WikiXRay ', ICSOFT, Barcelona, Spain, pp. 46-53,
<http://libresoft.es/oldsite/downloads/C4_159_Ortega.pdf>.
Ortega, F., Gonzalez-Barahona, J.M. & Robles, G. 2008, 'On the Inequality of Contributions
to Wikipedia', 41st Annual Hawaii International Conference on System Sciences
Honolulu, Hawaii, p. 304, <http://libresoft.es/downloads/Ineq_Wikipedia.pdf>.
Preece, J. 2001, 'Sociability and usability in online communities: determining and measuring
success', Behaviour & Information Technology, vol. 20, no. 5, pp. 347-356.
Preece, J. & Maloney-Krichmar, D. 2005, 'Online Communities: Design, Theory, and Prac-
tice', Journal of Computer-Mediated Communication, vol. 10, no. 4, p. article 1.
Preece, J., Maloney-Krichmar, D. & Abras, C. 2003, History and emergence of online com-
munities, Berkshire Publishing Group, Sage,
<http://www.ifsm.umbc.edu/~preece/paper/6%20Final%20Enc%20preece%20et%20a
l.pdf>.
Rashid, A.M., Ling, K., Tassone, R.D., Resnick, P., Kraut, R. & Riedl, J. 2006, 'Motivating
Participation by Displaying the Value of Contribution', CHI 2006, ACM, Montréal,
Québec, Canada, pp. 955-
958<http://www.si.umich.edu/~presnick/papers/CHI06/rashidAl.pdf>.
Raymond, E.S. 2000, The Cathedral and the Bazaar (Electronic Version), viewed 07.12.
2008, <http://www.catb.org/~esr/writings/cathedral-bazaar/cathedral-
bazaar/ar01s04.html>.
Schoberth, T., Preece, J. & Heinzl, A. 2003, 'Online Communities: A Longitudinal Analysis
of Communication Activities', 36th Annual Hawaii International Conference on Sys-
tem Sciences, Big Island, Hawaii,
<http://www.ifsm.umbc.edu/~preece/paper/9%20HICSSNOCD06v2.pdf>.
Senyard, A. & Michlmayr, M. 2004, 'How to Have a Successful Free Software Project', 11th
Asia-Pacific Software Engineering Conference (APSEC’04), Busan, Korea,
<http://kb.cospa-project.org/retrieve/2450/senyardmichlmay.pdf>.
Stvilia, B., Twidale, M.B., Smith, L.C. & Gasser, L. 2005, 'Assessing information quality of a
community-based encyclopedia ', International Conference on Information Quality,
Cambridge,England, pp. 442-454,
<http://www.isrl.uiuc.edu/~stvilia/papers/quantWiki.pdf>.
Tapscott, D. & Williams, A.D. 2008, Wikinomics: How Mass Collaboration Changes Every-
thing, Penguin Group, New York.
Van Alystyne, M. & Brynjolfsson, E. 2005, 'Global Village or Cyber-Balkans? Modeling and
Measuring the Integration of Electronic Communities', Management Science, vol. 51,
no. 6, pp. 851-868.
Viégas, F.B., Wattenberg, M. & Dave, K. 2004, 'Studying Cooperation and Conflict between
Authors with history flow Visualizations', SIGCHI conference on Human factors in
computing systems, vol. 6, ACM, Vienna,Austria, pp. 575-
582<http://alumni.media.mit.edu/~fviegas/papers/history_flow.pdf>.
Viégas, F.B., Wattenberg, M., Kriss, J. & Ham, F.v. 2007, 'Talk Before You Type: Coordina-
tion in Wikipedia', 40th Hawaii International Conference on System Sciences, Hono-
von Hippel, E. 2001, 'Innovation by User Communities: Learning from Open-Source Soft-
ware', MIT Sloan Management Review, vol. 42, no. 4, pp. 82-86.
von Krogh, G., Spaeth, S. & Lakhani, K.R. 2003, 'Community, joining, and specialization in
open source software innovation: a case study', Research Policy, vol. 32, pp. 1217-
1241.
Voss, J. 2005, 'Measuring Wikipedia', paper presented to the International Conference of the
International Society for Scientometrics and Informetrics : 10th, Stockholm (Sweden),
24-28 July 2005,<http://eprints.rclis.org/3610/1/MeasuringWikipedia2005.pdf>.
Wanga, Y. & Fesenmaier, D.R. 2004, 'Towards understanding members’ general participation
in and active contribution to an online travel community', Tourism Management, vol.
25, pp. 709–722.
Wilkinson, D.M. & Huberman, B.A. 2007, 'Assessing the value of cooperation in Wikipedia',
First Monday, vol. 12, no. 4.
Williams, R.L. & Cothrel, J. 2000, 'Four Smart Ways to Run Online Communities', Sloan
Management Review, vol. 41, no. 4, pp. 81-91.
Internet Sources:
Alexa.com 2008, Traffic Details - wikipedia.org, viewed 10.11.2008
<http://www.alexa.com/data/details/traffic_details/wikipedia.org>.
Encyclopædia Britannica, I. 2006, Fatally Flawed - Refuting the recent study on encyclopedic
accuracy by the journal Nature, viewed 16.04.2009
<http://corporate.britannica.com/britannica_nature_response.pdf>.
Wales, J. 2005a, The Intelligence of Wikipedia, Oxford Internet Institute, viewed 27.03.2009
<http://webcast.oii.ox.ac.uk/?ID=20050711_76&view=Webcast>.
Wales, J. 2005c, Wikipedia, Emergence, and The Wisdom of Crowds, viewed 27.03.2009
<http://lists.wikimedia.org/pipermail/wikipedia-l/2005-May/021764.html>.
Instead of installing the whole MediaWiki software it was decided to setup the required data-
base scheme using the tables.sql file, which can be found in the MediaWiki repository [3].
The XML-dump from June 7th of the German Wikipedia was downloaded [4] and converted
into a SQL file using the MWDumper-tool [5]:
Two tables in the dump seemed especially important for this study: The page-table including
all pages in Wikipedia and the revision-table including every single revision of each page. As
these tables were out of sync (the revision table included more distinct pages than the page
table), it was decided to use the revision table as a basis for the analysis.
As this study aimed to analyze articles created in 2007, all revisions in 2007 were extracted
from the revision table in a first step and indices were added to speed up queries from this
table.
create table revision2007 as select * from revision where extract(year from rev_timestamp)=2007;
alter table revision2007 add index (rev_page);
alter table revision2007 add index (rev_user_text);
In a next step a table was created to depict which pages were edited by which user in 2007
and how often. Again, several indices were created to improve query performance:
…an up-to-date user-group assignment list received from [6] and updated with the help of the
following short Python script:
import urllib
import re
Due to special characters used in its name a bot had to be flagged by hand. What is more, all
other bot values were set to 0 and a ‘bot’ table including all bots was created:
After that, columns for the page title and page namespace were added to the table…
Missing values were queried from the Wikipedia API and missing pages flagged with the fol-
lowing Python script:
import urllib
import re
import xml.etree.cElementTree as cElementTree
import time
import MySQLdb
conn = MySQLdb.connect (host = "127.0.0.1",
user = "username",
passwd = "password",
db = "dbname")
cursor = conn.cursor (MySQLdb.cursors.DictCursor)
AppURLopenerinstance=AppURLopener()
numberofmissingvalues=100
for i in range (0,numberofmissingvalues/50): #it is only possible to query 50 values from the api without bot
status at once
cursor.execute ("select distinct(rev_page) from userpages where page_title is NULL limit 50")#create batches
of 50
resultpage = cursor.fetchall()
print resultpage
print
stringliste=[] #create query string for api
for page in resultpage:
stringliste.append(str(page['rev_page']))
fertigerstring="|".join(stringliste)
print fertigerstring
liste=AppURLopenerinstance.open("http://de.wikipedia.org/w/api.php?action=query&pageids="+fertigerstring+
"&format=xml") #get api results
if elem.attrib.has_key('missing'):
print "missing"
cursor.execute ("update userpages set page_title=%s, page_namespace=%s where
rev_page=%s",("!missing","999",elem.attrib['pageid']))#flag missing pages
else:
print "---"
print elem.attrib['pageid']
print elem.attrib['ns']
print elem.attrib['title']
Then a table with all articles in the main namespace (namespace 0; see [7] for more details)
edited in 2007, which was not only edited by bots was created:
It turned out that an easier way of removing all pages not in namespace 0 would probably
have been to use the –filter option of the MwDumper-tool (see [5] for more details).
Next, a column depicting the date of creation of each article was created…
import MySQLdb
conn = MySQLdb.connect (host = "127.0.0.1",
user = "username",
passwd = "password",
db = "dbname")
cursor = conn.cursor (MySQLdb.cursors.DictCursor)
Subsequently a column for the number of contributors in 2007 was added and…
import MySQLdb
conn = MySQLdb.connect (host = "127.0.0.1",
user = "username",
passwd = "password",
db = "dbname")
A random sample of 5000 articles with more than 1 user created in 2007 was drawn from this
table and stored in the samplepagescreated2007 table.
create table samplepagescreated2007 as select * from pagelist2007 where extract(year from creationdate)=2007
and user2007>1 order by rand() limit 5000;
Finally columns for each variable and a column indicating whether the article was already
analyzed were added to the table:
alter table samplepagescreated2007 add column words int(10) unsigned;# for nr. of words
alter table samplepagescreated2007 add column wordpool int(10) unsigned;# for vocabulary
alter table samplepagescreated2007 add column uniquesumlinks int(10) unsigned;#for unique total links
alter table samplepagescreated2007 add column fleschd int(10) unsigned; #for flesch readability score
alter table samplepagescreated2007 add column categoriescalc int(10) unsigned;# for nr. of categories
alter table samplepagescreated2007 add column userscalc int(10) unsigned; #for community-size and interactiv-
ity
alter table samplepagescreated2007 add column heterogeneitycalc double unsigned; #for heterogeneity
alter table samplepagescreated2007 add column avgactivity double unsigned; #for activity
alter table samplepagescreated2007 add column avgfocus double unsigned; #for focus
alter table samplepagescreated2007 add column interactionscalc int(10) unsigned; #for calculating interactivity
alter table samplepagescreated2007 add column editscalc int(10) unsigned; #for calculating interactivity
alter table samplepagescreated2007 add column mediantimebetweenedits double unsigned; #for calculating
dynamics
alter table samplepagescreated2007 add column analysed boolean; #set to 1 if article was already analyzed
The following sections explain the process of calculating all variables from the following
three sources:
The following script queries the title of each article from the Wikipedia API and downloads
the article’s XML file from Wikipedia with the help of an adapted version of the getwiki
script by Gude [9] (all links to the English Wikipedia were replaced by the respective links to
the German Wikipedia). The file is stored to a folder and parsed. In a first step vandals are
flagged. After that the XML file is parsed again to calculate the number of users (excluding
bots and vandals), categories, interactions and edits. In case of any problems, the article is
flagged with a problem code (1: article is a redirect, 2: rev_page id of downloaded article file
does not match rev_page in database, 3: redirect & ids do not match, 4: article not found via
API) and excluded from further analysis.
import MySQLdb
conn = MySQLdb.connect (host = "127.0.0.1",
user = "username",
passwd = "password",
db = "dbname")
cursor = conn.cursor (MySQLdb.cursors.DictCursor)
class AppURLopener(urllib.FancyURLopener):
version = "Mozilla/5.0 (Windows; U; Windows NT 5.1; it; rv:1.8.1.11) Gecko/20071127 Firefox/2.0.0.11"
def get_botlist():
if os.path.exists("botlist.txt"):
#print "botfile found"
botfile = file('botlist.txt', 'r')
botlist=pickle.load(botfile) #read from file
else:
botlist=[] #pickled and unpickled
conn = MySQLdb.connect (host = "127.0.0.1",
user = "username",
passwd = "password",
db = "dbname")
cursor = conn.cursor (MySQLdb.cursors.DictCursor)
cursor.execute ("select * from bots;") #get bots from database
bots = cursor.fetchall()
for bot in bots:
botlist.append(bot['rev_user_text'])
#save to file
botfile = file('botlist.txt', 'w')
pickle.dump(botlist,botfile) #store botfile
return botlist
def analysearticle(doc,rev_page):
conn = MySQLdb.connect (host = "127.0.0.1",
user = "username",
passwd = "password",
db = "dbname")
cursor = conn.cursor (MySQLdb.cursors.DictCursor)
redirect=False #is article a redirect?
problem=0 #is there a problem?
filecelement=open(doc, "r") #open xml file
revcounter=0#revision counter
newerthan2007=False
articleidfound=False # there are several id fields (article,revision,user)
vandalism=set()
vandalismuser=set()
upperlimit=datetime.datetime(2008, 1, 1)#
wrongarticle=False
#get botlist
bots=get_botlist()
#print bots
if elem.tag=="fusername" or elem.tag=="{http://www.mediawiki.org/xml/export-0.3/}ip":
currentusernameorip=elem.text
if elem.tag=="{http://www.mediawiki.org/xml/export-0.3/}timestamp":
currentzeit=elem.text
if datetime.datetime.strptime(elem.text, "%Y-%m-%dT%H:%M:%SZ")>=upperlimit:
newerthan2007=True
if elem.tag=="{http://www.mediawiki.org/xml/export-0.3/}title":
articletitle=elem.text
if elem.tag=="{http://www.mediawiki.org/xml/export-0.3/}comment":
if not newerthan2007:
if elem.text=="[[Hilfe:Zusammenfassung und Quelle#Auto-Zusammenfassung|AZ]]: Der Seiteninhalt
wurde durch einen anderen Text ersetzt." or elem.text=="[[Hilfe:Zusammenfassung und Quelle#Auto-
Zusammenfassung|AZ]]: Die Seite wurde geleert.": #potential vandalism detected see
http://de.wikipedia.org/wiki/Hilfe:Zusammenfassung_und_Quelle
vandalism.add(revcounter)
vandalismuser.add("'"+currentusernameorip+"'")
if elem.tag=="{http://www.mediawiki.org/xml/export-0.3/}text":
if not newerthan2007:
if revcounter not in vandalism:
if elem.text:
if "#redirect[[" in elem.text.lower() or "#redirect [[" in elem.text.lower() or "#weiterleitung[[" in
elem.text.lower() or "#weiterleitung [[" in elem.text.lower(): #redirect?
redirect=True#last revision includes redirect
else:
redirect=False
else: #if everything is deleted there's no text in the text element
vandalism.add(revcounter)
vandalismuser.add("'"+currentusernameorip+"'")
revcounter+=1
if elem.tag=="{http://www.mediawiki.org/xml/export-0.3/}username" or
elem.tag=="{http://www.mediawiki.org/xml/export-0.3/}ip":
currentusernameorip=elem.text #store username or ip
if elem.tag=="{http://www.mediawiki.org/xml/export-0.3/}timestamp":
currentzeit=elem.text
if datetime.datetime.strptime(elem.text, "%Y-%m-%dT%H:%M:%SZ")>=upperlimit
newerthan2007=True #flag revisions after 2007
if elem.tag=="{http://www.mediawiki.org/xml/export-0.3/}text":
if not newerthan2007:
if revcounter not in vandalism and currentusernameorip not in bots:
edits+=1
userset.add(currentusernameorip)
categories=len(re.findall("\[\[Kategorie:(.*)]]",elem.text))
if not currentusernameorip == olduser:
interactions+=1
def main(article,offset,rev_page):
wikiheader="""<mediawiki xmlns="http://www.mediawiki.org/xml/export-0.3/"
xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance"
xsi:schemaLocation="http://www.mediawiki.org/xml/export-0.3/ http://www.mediawiki.org/xml/export-0.3.xsd"
version="0.3" xml:lang="en">
<siteinfo>
<sitename>Wikipedia</sitename>
<base>http://en.wikipedia.org/wiki/Main_Page</base>
<generator>MediaWiki 1.14alpha</generator>
<case>first-letter</case>
<namespaces>
<namespace key="-2">Media</namespace>
<namespace key="-1">Special</namespace>
<namespace key="0" />
<namespace key="1">Talk</namespace>
<namespace key="2">User</namespace>
<namespace key="3">User talk</namespace>
<namespace key="4">Wikipedia</namespace>
<namespace key="5">Wikipedia talk</namespace>
<namespace key="6">Image</namespace>
<namespace key="7">Image talk</namespace>
<namespace key="8">MediaWiki</namespace>
<namespace key="9">MediaWiki talk</namespace>
<namespace key="10">Template</namespace>
<namespace key="11">Template talk</namespace>
<namespace key="12">Help</namespace>
<namespace key="13">Help talk</namespace>
<namespace key="14">Category</namespace>
<namespace key="15">Category talk</namespace>
<namespace key="100">Portal</namespace>
<namespace key="101">Portal talk</namespace>
</namespaces>
</siteinfo>
<page>
"""
#download article
if os.path.exists(rev_page+".xml"):
print "article xml file found, using this version. Delete old version to trigger download"
else:
if offset !=1:
Due to a number of missing pages and redirects in the sample a second draw was necessary:
To calculate the number of words and the Flesch readability score another Python script was
developed. The function getlastid examines the XML representation of an article and returns
the revision number of the last revision in 2007. Subsequently, the HTML version of this re-
vision is obtained from Wikipedia using specific parameters explained in [8]. The article con-
tent is extracted, some clean-ups conducted, HTML markup stripped and the plain text is send
to [10] for analysis. The Flesch readability scores are extracted from the results and inserted
into the MySQL database.
import urllib
import urllib2
import re
import xml.etree.cElementTree as ElementTree
import MySQLdb
import datetime
import time
def strip_tags(value):
"Return the given HTML with all tags stripped."
return re.sub(r'<[^>]*?>', '', value)
if elem.tag=="{http://www.mediawiki.org/xml/export-0.3/}revision":
inrevision=True
if elem.tag=="{http://www.mediawiki.org/xml/export-0.3/}timestamp":
currentzeit=elem.text
if datetime.datetime.strptime(elem.text, "%Y-%m-%dT%H:%M:%SZ")>=upperlimit#if newer than 2007
newerthan2007=True
else:
lastrevisionid=revisionid #as the timestamp tag comes after the revision tag, assign here
if elem.tag=="{http://www.mediawiki.org/xml/export-0.3/}id":
if inrevision==True:
revisionid=elem.text
inrevision=False
return lastrevisionid
##
url ="http://de.wikipedia.org/w/index.php"
data = urllib.urlencode(values)
print data
req = urllib2.Request(url, data, headers)
response = urllib2.urlopen(req)
the_page = response.read() #get last revision with specific id
#connect to stilverstprechend.de
url2 = 'http://www.stilversprechend.de/stil/bericht.html'#index.html'
values2 = {'text' : stripped_content.encode("cp1252","ignore")}
data2 = urllib.urlencode(values2)
req2 = urllib2.Request(url2, data2, headers)
response2 = urllib2.urlopen(req2)
the_page2 = response2.read()
The following script was used to calculate the number words and the number of unique words
(vocabulary) in each article. The plain text version of each article stored before is loaded and
split into words. The number of words and unique words is counted and stored in the data-
base:
import os.path
import re
import MySQLdb
import urllib
import urllib2
import re
import xml.etree.cElementTree as ElementTree
import MySQLdb
import datetime
import time
if elem.tag=="{http://www.mediawiki.org/xml/export-0.3/}revision":
inrevision=True
if elem.tag=="{http://www.mediawiki.org/xml/export-0.3/}timestamp":
currentzeit=elem.text
if datetime.datetime.strptime(elem.text, "%Y-%m-%dT%H:%M:%SZ")>=upperlimit#if newer than 2007
newerthan2007=True
else:
lastrevisionid=revisionid #as the timestamp tag comes after the revision tag, assign here
if elem.tag=="{http://www.mediawiki.org/xml/export-0.3/}id":
if inrevision==True:
revisionid=elem.text
inrevision=False
print lastrevisionid
return lastrevisionid
print page['rev_page']
rev_page=str(page['rev_page'])
lastrevisionid=getlastid(doc=rev_page+".xml",rev_page=rev_page)
url ="http://de.wikipedia.org/w/index.php"
data = urllib.urlencode(values)
print "http://de.wikipedia.org/wiki/index.php?"+data+'""'
req = urllib2.Request(url, data, headers)
response = urllib2.urlopen(req)
the_page = response.read()
#extract content
content=the_page[the_page.index("<!-- start content -->"):the_page.index("<!-- end content -->")]
alllinks=re.findall("<a href=\"([^\"]*)\"",content)
extandintlinks=re.findall("<a href=\"([^#][^\"]*)\"",content)
pageintlinks=re.findall("<a href=\"(#[^\"]*)\"",content)
extlinks=re.findall("<a href=\"(http://[^\"]*)\"",content)
#add newsgrouplinks
extlinks.extend(re.findall("<a href=\"(news://[^\"]*)\"",content))
#add maillinks
extlinks.extend(re.findall("<a href=\"(mailto:[^\"]*)\"",content))
relintlinks=re.findall("<a href=\"(/[^\"]*)\"",content)
print "extandintlinks",len(set(extandintlinks)) #alle unique links #
For calculating activity and focus in the analyzed communities another table was created con-
taining all user-page combinations in the sample and the number of edits by each user. Due
the fact that vandalism was only detected in less than 1% of all analyzed communities, poten-
tial vandals were not removed from these analyses as doing so would have complicated data-
base queries significantly. Again, bots were excluded:
create table characuser as select up1.rev_page,rev_user_text,`count(*)` from userpages2007 as up1 inner join
samplepagescreated2007 as up2 on up1.rev_page=up2.rev_page where up2.problem=0 and up2.userscalc>1 and
up1.bot=0;
alter table characuser add column edits2007 int unsigned; #edits by each user in 2007 in Wikipedia
alter table characuser add column percentofedits double unsigned;# used to calculate ratio (edits in arti-
cle/edits2007)
In a first step, the number of edits by each user in the sample in 2007 (activity) were calcu-
lated with the following function:
In a next step, the ratio of edits in the analyzed article and in other Wikipedia articles in 2007
was calculated for each user-page combination:
The average activity and focus of each community was then stored in a temporary table and
updated in the samplepagescreated2007 table:
To calculate the heterogeneity of an online community’s users, the number of articles each
user edited in 2007 was determined by adding a column to the characuser table in a first step.
alter table characuser add column articles2007 int unsigned; # nr. of articles edited by each user in 2007
These fields were populated with the help of the following Python script …
def calc_articlesuser2007():
conn = MySQLdb.connect (host = "127.0.0.1",
user = "username”
passwd = “password”
db = "dbname")
cursor = conn.cursor (MySQLdb.cursors.DictCursor)
#get all users
cursor.execute ("select distinct rev_user_text from characuser where articles2007 is NULL")
resultuser=cursor.fetchall()
…and the average number of articles users in the sample contributed to was calculated.
The yielded figure (265) was used to compare community users on the most important articles
each community edited and to calculate the average heterogeneity of each community with
the help of the cosine similarity function:
def calc_heterogeneity(rev_page,vandalismuser):
pagevektor=[]
uservektordict={}
if len(vandalismuser)==0:
string="''"
else:
string=",".join(vandalismuser) #join vandalismuser string (vandals identified)
sortedpagevektor=sorted(pagevektor)
stringmostcommonsites=",".join([str(el) for el in sortedpagevektor])#create a vector of the most important
articles
cursor.execute ("select distinct rev_user_text from userpages2007 where bot=0 and page_namespace=0 and
rev_page=%s and rev_user_text not in ("+string+");",(rev_page,))
articleusers = cursor.fetchall() #users of the article without vandals
usernumber=0
if len(articleusers)>1:
for user in articleusers:
uservektor=numpy.zeros(len(sortedpagevektor),dtype=int)#create vector for each user
i=0
usersites=cursor.execute("select * from userpages2007 where rev_user_text=%s and
page_namespace=0 and rev_page in ("+stringmostcommonsites+");",(user['rev_user_text'],))
resultnumber=cursor.fetchall()
dictforuser={}
for item in resultnumber:
dictforuser[item['rev_page']]=item['count(*)'] #populate vector with numbers of edits per article
for page in sortedpagevektor:
if dictforuser.has_key(page):
uservektor[i]=dictforuser[page]
i+=1
uservektordict[usernumber]=uservektor
usernumber+=1
mat[usernumber,usernumber2]=mat[usernumber2,usernumber]=float(numpy.dot(uservektordict[usernumber],use
rvektordict[usernumber2])) / (numpy.linalg.norm(uservektordict[usernumber]) *
numpy.linalg.norm(uservektordict[usernumber2]))#compare each user pair (cosim)
usernumber2+=1
usernumber+=1
heterogeneity=1-(mat.sum()-len(articleusers))/(len(articleusers)*(len(articleusers)-1)) #calculate average
heterogeneity(1-average similarity)
else:
heterogeneity=0 #if there is only one user heterogeneity =0
return heterogeneity
To assess the dynamics of collaboration within each article the time difference between each
revision and the following revision was determined and the median calculated.
Finally, the control variable article age in seconds was calculated with a standard SQL- state-
ment subtracting the creation date of an article from 01.01.2008 00:00:00.