Sei sulla pagina 1di 10

Long Session V: Time, Space, Archives WebSci '17, June 25-28, 2017, Troy, NY, USA

Observing Web Archives


The Case for an Ethnographic Study of Web Archiving

Jessica Ogden Susan Halford Leslie Carr


University of Southampton University of Southampton University of Southampton
Southampton, UK Southampton, UK Southampton, UK
jessica.ogden@soton.ac.uk susan.halford@soton.ac.uk lac@ecs.soton.ac.uk

ABSTRACT the ephemerality of the Web demands intervention to preserve


This paper makes the case for studying the work of web archivists, web content - in web archives - that reconstruct sites and the ‘web
in an effort to explore the ways in which practitioners shape the experience’ for posterity [2]. However, there has been rather less
preservation and maintenance of the archived Web in its various attention to the nature of this intervention: how it is done and why
forms. An ethnographic approach is taken through the use of ob- this matters. This paper explores the critical decisions being made
servation, interviews and documentary sources over the course now that will shape future generations’ ability to understand the
of several weeks in collaboration with web archivists, engineers history of the Web.
and managers at the Internet Archive - a private, non-profit digital
library that has been archiving the Web since 1996. The concept 1.1 Web Archiving
of web archival labour is proposed to encompass and highlight the Web archiving has roots in a wider digital preservation movement
ways in which web archivists (as both networked human and non- which emerged in the 1980s-1990s, led by memory institutions to
human agents) shape and maintain the preserved Web through develop strategies for addressing the rise of personal computing
work that is often embedded in and obscured by the complex tech- and the impact of digital artefacts on their abilities to capture and
nical arrangements of collection and access. As a result, this engage- preserve ‘records of social phenomena’ [61]. This was particularly
ment positions web archives as places of knowledge and cultural fuelled by fears over the so-called ‘digital dark ages,’ a term first
production in their own right, revealing new insights into the per- used by Kuny [41] to describe a scenario where the development
formative nature of web archiving that have implications for how pace of technologies (used to produce digital objects) outweighs
these data are used and understood.1 that of the investment in technologies, infrastructures and policies
to preserve them long-term. As the world’s information and com-
KEYWORDS munication platforms are increasingly born-digital and online, a
web archiving, knowledge production, STS, materiality, information diverse community of practitioners have positioned web archives as
labour key to capturing and preserving digital cultural heritage, ensuring
ACM Reference format: stability and access to pre-existing web resources and facilitating
Jessica Ogden, Susan Halford, and Leslie Carr. 2017. Observing Web Archives. new knowledge via scholarly research. Web archives in their vari-
In Proceedings of WebSci ’17, Troy, NY, USA., June 25–28, 2017, 10 pages. ous forms - including social media archives - have thus become a
https://doi.org/10.1145/3091478.3091506 sort of ‘prosthesis’ for the Web and a necessary pre-condition for
any research into the Web(s) of the past and near-present.2
1 INTRODUCTION The history of web archiving has been documented to varying
The World Wide Web has emerged as the preeminent mechanism for degrees in existing overviews [9, 13, 77] which chart the emer-
global communication, political, economic and cultural exchange gence of a field of practice around web archiving. Each have used
and more. Yet, at the same time, the Web is ephemeral. For a medium a series of factors to characterise the domain over time, including:
that has become pre-eminent, its dynamism and transience has be- the tools and technologies used, the frequency and scale of selec-
come increasingly worrisome. These concerns have been illustrated tion/collection methods (e.g. broad versus targeted) and the various
in various longitudinal studies of link rot [55] and investigations motivations behind the creation of web archives. These motiva-
which found that during a period between 2009 and 2012, on aver- tions may reinforce and represent, at least in part, a continuation of
age 11% of online resources shared on social media failed to resolve classical interpretations and analogue conceptions of the value and
one year later [60]. In this context, it is increasingly claimed that role of libraries and archives as institutions that provide access to
cultural heritage, information and knowledge resources; facilitate
1 Thispaper is based on data collected and fieldwork undertaken by the first author as evidence-based accountability and promote community memory
part of their PhD research.
and identity, amongst others [18, 26].
Web archiving projects have spanned from the large-scale collec-
tion of web resources by organisations such as the Internet Archive
This work is licensed under a Creative Commons
Attribution International 4.0 License.

WebSci ’17, June 25–28, 2017, Troy, NY, USA.


© 2017 Copyright held by the owner/author(s). 2 Inspiration
ACM ISBN 978-1-4503-4896-6/17/06. for this analogy is taken from Derrida’s [20] treatment of ‘technological
https://doi.org/10.1145/3091478.3091506 devices for archiving’ as prostheses for memory formation and storage.

299
Long Session V: Time, Space, Archives WebSci '17, June 25-28, 2017, Troy, NY, USA

WebSci ’17, June 25–28, 2017, Troy, NY, USA. J. Ogden et al.

and national libraries and archives, to networked ‘rogue’ collec- provide useful information for understanding the broader landscape
tives such as Archive Team3 and DataRefuge,4 and the individual of web archiving. However, by virtue of the methods used, they do
efforts of activists and scholars creating web archives for their own not provide in-depth information about the day-to-day decisions,
purposes. Surveys of the field reveal a growing number of organisa- activities and processes that facilitate web archiving in practice.
tions that are web archiving, albeit with limited staff-time resources Web technologies are evolving at a faster pace than that of web
and with the majority of respondents using some form of external preservation technologies and practice [21], presenting challenges
service to collect or manage their web archives [3, 52]. for preserving an ‘infinite stream with finite resources’ [42]. The
Much of the focus of the web archiving community has been limitations of web harvesters in the face of new markup languages
on the continued development of technologies and practices for (e.g. HTML5), executable content (e.g. JavaScript and Flash) and
web collection development [35], with an increased attention in other dynamic content (e.g. streamed multimedia, database-driven
recent years on facilitating the scholarly use of web archives [21]. or password-protected) all often lead to missing elements in the
Web archiving has thus become representative of an interdisci- representation of web resources in archives [56]. Problems for use
plinary space where practitioners and scholars from a range of are often presented as issues with the ‘quality’ of web archives,
backgrounds (e.g. libraries, archives, information science, engineer- as measured by the relative ‘likeness’ between archived web ob-
ing and computer science) come together with scholars from a jects and the ‘live web.’ As Brügger [14, p.108] has described, a
variety of disciplines (humanities, social sciences and computer combination of both collection decisions and ‘technical problems’
science) to study the Web, its application and the users who derive leads to archived web objects that are not copies of the live web,
value from its use. but rather contingent constructions where ‘the process of archiving
This research examines this space from the perspective of web itself may change what is archived, thus creating something that is
archival practitioners with the aim of documenting the practices not necessarily identical to what was once online.’
that determine and create the archived Web within the context of There is an abundance of research within the field pertaining
the Internet Archive. From the point of view of Web Science, web to improving the efficiency and quality of web crawlers, detect-
archiving should not be seen as merely a series of technical solu- ing change in web resources and automating frequency decisions
tions to the short-comings of the Web’s infrastructure, but rather associated with captures [64], as well as technical overviews of
as an assemblage of contingent sociotechnical practices that shape crawling technologies at the time of production [51]. Yet, little re-
what is known about the Web. These practices - and the entangle- search exists around the interactive nature and structuring effects
ment of both human and non-human actors - are important for of algorithmic and automated agents in decisions around what and
understanding the affordances of web archives and their implica- when to archive. Recent calls for the study of ‘archival algorithmic
tions for interpreting an increasingly dynamic Web. This type of systems’ [70, p.11] point towards the need to further consider the
engagement raises wide-ranging epistemological questions con- performative nature of crawlers and other web archiving technolo-
cerning the role of the web archive (and the archivist) in shaping gies, as well as the ways in which different environmental contexts
knowledge about the Web. From this perspective, the mechanisms shape the technological development [40] driving collection and
and circumstances surrounding the production of web archives are access tools. Here web crawlers (non-human, automated agents,
therefore fundamental to understanding them as ‘new forms of bots, algorithms, code) are conceived as not merely passive or ob-
social data’ [45, p.31]. jective participants in the collection of web resources [47], but are
intricately implicated in the active shaping of the ‘doing’ of web
1.2 Problematising Web Archival Practices archiving.
The field of web archiving, the practices and technologies used to At the heart of conflicts over the collection and use of web
facilitate the creation, maintenance and use of web archives are archives are often problems with defining the boundaries of the
all continually evolving, but not without their issues. The different web object to be captured and studied. Whilst recognising that web
legal, economic, technical and ethical challenges to preservation archives are by nature, inherently and necessarily incomplete [28],
and access presented by both ‘web documents’ and social media they are also highly ‘subjective reconstructions’ [11, 12] of what
have led to a mix of overlapping and divergent collection and ac- exists on the live Web at any given time. As argued by Dougherty
cess strategies. Existing overviews of the field [9, 48], literature and Meyer [21], often the specific ontological and epistemologi-
reviews [54] and best practice documents [8, 56] provide models for cal assumptions made during the collection and curation of web
mapping the general components of web archiving. However, some archives are either not made explicit to potential users or are seen
of this documentation has failed to keep up to date with the pace as an impediment to their use and/or re-use [21]. Collection deci-
of web/archiving technology development, and is often limited to sions, whether thematic or broad, domain-based harvesting [57],
specific software, tools and professional contexts (e.g. library or the temporal dimensions of when to capture and for how long [46]
bibliographic approaches to web archiving). Surveys of the field and issues over geographic [72] and language [1] coverage and
have yielded insights into the types and number of organisations representation of the global Web(s), have all highlighted method-
participating in web archiving, the tools and services used, the ological concerns over provenance, the subjectivity of records, the
types of content being archived, and the availability of institutional lack of transparency and metadata for harvester algorithms and
policy documents and staff-time resources [3, 29, 52]. These surveys problems with the generalisability of potential research findings
based on web archives.
3 http://archiveteam.org
4 https://www.datarefuge.org

300
Long Session V: Time, Space, Archives WebSci '17, June 25-28, 2017, Troy, NY, USA

Observing Web Archives WebSci ’17, June 25–28, 2017, Troy, NY, USA.

Issues of provenance in web archives - the why, when, and how 2 FRAMING PRACTICE ENGAGEMENT
web archives are collected - have inspired calls for greater documen- Postmodernism and its application in archival and social theory
tation around intent, particularly around what to preserve and why lays the groundwork for critically engaging with web archiving
[76]. Others have focused on calling for greater transparency in how as knowledge production. From this perspective, there has been a
web archives are built, some with a particular focus on the Internet longstanding interest in knowledge production activities where the
Archive [43]. However, practices that surround the capture and record is often positioned as ‘evidence of process, of activity, [and]
maintenance of web archives remain relatively understudied, and of transaction’ [34, p.12].
for initiatives that sit outside of mainstream memory institutions, First coined by Stoler [68], ‘the archival turn’ denotes a shift from
continue to exist almost wholly unexamined. Investigations into ‘archive as source’ to ‘archive as subject’, signalling wide-ranging
how curation strategies and collection tools structure the nature epistemological questions concerning the role of the archive (and
of collections - for example, the timing, frequency and length of the archivist) in shaping and legitimising knowledge and particular
collection - have the potential to yield insights into the contingen- ways of knowing. Cook [16, p.4-5] argues that postmodern archival
cies that lead to the archived Web(s). Recent qualitative research theory represents a fundamental paradigm shift within a commu-
on web archival appraisal practices by Summers and Punzalan [70] nity of practice that hitherto had been largely grounded in scientific
underlines the value of such an approach for both situating web rationalism, ‘archival science,’ the merits of record stability and the
archiving within wider institutional/archival paradigms, as well as objective role of archivists; towards one which recognises the in-
exposing undocumented practices largely missing from the archival completeness of records and attends to context and the interpretive
record. Taking a different approach, Milligan et al. [50] contribute role of archivists in the construction of social memory. This signi-
to a discussion of curatorial practices by reverse engineering se- fied a theoretical move away from the framing of archives as ‘sites
lection through a comparison of algorithmic, manual and social of knowledge retrieval’ towards a recognition that archives are
media-generated web archives associated with the 2015 Canadian deeply reflective of and implicated in the production of knowledge
Elections. These studies, plus the previously mentioned work of [68, p.90]. The conditions of historical narrative-making are intrin-
Dougherty and Meyer [21], can be seen as complementary to this sically tied to the processes of archival construction, where certain
research both in methodology and in their aims to address how col- narratives are privileged and others marginalised through the active
lection practices structure the nature of web archival engagement. reshaping by the archivist [62, 71, 74]. These usually invisible prac-
The following sections outline the ways in which this paper aims tices involved in the maintenance of archives, have ramifications
to extend current knowledge of web archival practice. for the ways in which archival holdings are then (re)presented as
being a view from nowhere or of ‘all possible statements’ rather
1.3 Aims and Outline than ‘the law of what can be said’ [6].
This paper aims to examine the ways in which web archives and In contrast, Brown and Davis-Brown [10] focus on the ‘technical-
associated practices - as sociotechnical phenomena - are structured rational work’ of archivists - or the everyday decisions and practices
and organised by an array of actors and environmental factors that related to the collection and maintenance of archives. They char-
actively shape the practice of archiving. This research contends acterise a profession where the ‘explicitly political who is often
that these (often) undocumented activities and processes are criti- reduced to the technically instrumented how’ - a sentiment also
cal for interpreting the affordances of web archives as contingent echoed by critical information studies [7], as well as practitioners
reconstructions of the previously live Web. from within the web archiving field [76]. In light of this, an engage-
The argument for a practice-based approach to web archiving is ment with the political nature of web archival practice would then
further examined through a theoretical framework. This framework include an examination of what comes to count as the ‘professional
draws on aspects of archival theory and Science and Technology decision-making’ [10] involved in a host of activities that mark the
Studies (STS) investigations into the materiality of information everyday tasks of (organisational) archivists.
mediation practices. Here, examining web archiving from the per- The turn also ascribes the importance of unpicking digital in-
spective of the materiality of practice enables two paths of inquiry. formation technologies and not falling into the trappings of either
The first acknowledges the ways in which new knowledge is em- equating them to analogue technologies, nor essentialising their
bedded and produced as part of the creation of web archives; the capabilities or potentialities for capturing the cultural record. Cook
second addresses how practice transforms knowledge through the [17], describes the impact of the transition to electronic archival
maintenance of web archival systems.5 records and the role of postmodern critique in strengthening the
A case for the chosen methodology is then made along with a role of the archivist in the digital age:
description of the data collection, in an effort to highlight the ways
We will move from databases to knowledge bases.
in which this research may have implications for a Web Science
We will move, in the language of the post-modernists,
approach to examining sociotechnical practices. The theoretical ap-
to re-contextualize our activities: we will reorient
proach is then worked through empirical examples and discussion
ourselves from the content to the context, and
drawn from vignettes of ethnographic fieldwork at the Internet
from the end result to the original empowering
Archive. A brief summary discussion is provided that highlights
intent, that is, from the artifact (the actual record)
some of the broad findings at this stage of the empirical research.
to the creating processes behind it, and thus to the
5 For further inspiration into the types of inquiries enabled by a focus on the materiality actions, programmes, and functions behind those
of media technologies, see Gillespie et al. [25]. processes [17, p.410-411].

301
Long Session V: Time, Space, Archives WebSci '17, June 25-28, 2017, Troy, NY, USA

WebSci ’17, June 25–28, 2017, Troy, NY, USA. J. Ogden et al.

As such, archives can be further considered in light of the dig- framework for an ‘archival ethnography’, or rather, the study of
ital technologies that are intimately tied to both the production archival practice in situ:
and preservation of web resources. Waterton [75, p.653-654] and Archival ethnography is a form of inquiry which
others make the case for examining digital archives and the ‘gen- positions the researcher within an archival envi-
erative capabilities’ of technologically-enabled data, information, ronment to gain the cultural perspective of those
and knowledge which are in an ‘eternal process of becoming’ [33]. responsible for the creation, collection, care and
Questions are therefore raised regarding how prior conceptualisa- use of records [30].
tions of significance and potential use are manifested in the ways
in which digital ‘objects’ are collected and preserved. Pinch and Specifically focused on record creation, other studies have used
Henry [58]’s notion of the ‘materiality of knowledge’ is useful here ethnographic observation to understand archives as a form of
for considering the ways in which knowledge is also ‘embedded knowledge production in scientific laboratories [63]; to investi-
in physical artefacts, technologies, and ways of doing things’. This gate communication and organisational accountability in record-
is manifested in how practice produces ‘material bodies’ [4, p.808- keeping [79]; and to understand both the technical and the social
809] but also how materiality is bound to and embedded in practice. apparatuses that facilitate record creation and maintenance within
In this instance Barad [4] is referring to how the body (e.g. the organisational contexts [73]. The motivations for these studies, as
anatomy and physiology) actively contributes to the processes of well as their findings, have all emphasised the importance and ef-
‘materialisation,’ but in the case of web archives it warrants an ex- fect of context on the production and management of the archival
amination of how the materiality of technologies (platforms, tools, record and reinforce the value of ethnographic methods for doc-
interfaces, code, algorithms) is both implicated in the production umenting situated practices and the wider interactions between
of archives but also potentially produced through practice. archivists, archival institutions and users of archives.
In this theoretical context, we extend the focus on materiality Ethnographic methods have been argued to enable a more ‘com-
and practice, to include the role that labour plays in the produc- plex’ appreciation of the development, use and role of technologies
tion and maintenance of web archives. Whereas practice embodies in society - beyond a view of technologies as simply ‘functional in-
the action, artefacts and tacit knowledge that gives meaning to struments’ [59, p.110]. One key assumption of this research design
both [15]; labour is conceptualised here to encompass the work is that the direct observation of technologies and their use in web
it takes to produce and transform web archives into information archiving is central to understanding and documenting practices.
sources. To frame our analysis, we draw on Downey’s [23] concept Suchman [69] provides relevant assistance here, defining technol-
of ‘information labor’, or the human and algorithmic labour that ogy as ‘the assemblage of skilled practices and associated logics
‘[enables] and [constrains] the constant circulation of information.’ characteristic of modern industrial societies’ and artefacts as the
Downey argues that a focus on information labour, particularly in ‘material production of skilled practice’. This approach allows for
the context of media literacy, reveals the contingent ‘social rela- an exploration of the materiality of web archiving through a dis-
tions’ that exist between information producers and consumers, cussion of the relationship between practice and the production of
through the process of acknowledging the work of agents that are digital artefacts, as well as the role of the environment (the policies,
often obscured by the technical arrangements of access. Here, by activities, infrastructure, and communities) that actively inform
acknowledging and exploring the interconnected collection and them.
maintenance work of both human agents (web archivists, engi- Our aim, clearly, is not to fetishise web archives as either tech-
neers, users) and non-human agents (algorithms, bots, code) in web nological objects or sociomaterial culture, a point mirrored in STS
archiving, we open the doors for new enquiry into the value and debates which attend to the pitfalls of determinism and advocate
role of the work of web archivists in the production of knowledge. for the avoidance of reductionist conceptualisations of either social
or technical agents as essentially defined by the other [69, p.165].
Rather, we understand web archives as produced by an assemblage
3 METHODOLOGY that require attention be paid to both the structure and agency of
Ethnographic methods were chosen to document the routine ac- technical actors, as well as the sociocultural elements of technical
tivities of archival practices and the ‘typical patterns of work’ [32, practices - as evidence for the ways in which ‘cultural values are
p.169] through the use of observation, interviews and documen- enacted, produced, shared, reified, represented and reaffirmed’ [22].
tary sources for understanding the day-to-day activities of web
archivists. 3.2 Data Collection and Analysis
A combination of interviews, observation records and documentary
sources were collected over a four week period in collaboration with
3.1 The Case for an Ethnographic Approach
web archivists, engineers and managers at the Internet Archive.
An ‘upward trend’ has been observed in the use of ethnographic Both organisational (for observations) and individual consent (for
methods in library and information science research [38]. Although interviews) were sought and received prior to data collection. In
there are comparatively fewer examples of ethnographic methods accordance with this consent all individual participant names re-
being used within the context of archives, ethnography has been ported here are pseudonyms.
identified as a means for conducting in-depth, comparative and Ethnographic interviews are a form of research interview that
cross-cultural studies of archival practice [49]. Gracy’s [31] ethno- are distinguished by not only the types of questions that are asked,
graphic work of archives provides the stimulus and methodological but also the ways in which the interviews are broken down into

302
Long Session V: Time, Space, Archives WebSci '17, June 25-28, 2017, Troy, NY, USA

Observing Web Archives WebSci ’17, June 25–28, 2017, Troy, NY, USA.

‘ethnographic elements’ or ‘speech acts’ designed to elicit specific Strategies for synthesising ethnographic data advocated by Spradley
types of cultural responses from informants [65]. They were used [65] were used to develop the analytical themes. This involved the
in combination with observation as a mechanism for developing identification of ‘things informants know’ in an effort to elicit every-
rapport with informants, as well as to clarify existing observation day practices through the various kinds of participant knowledge
records and focus subsequent observation activities. In an effort to (e.g. knowledge about crawl behaviours, scoping rules, reporting
not to predetermine what practices were discussed, the interviews tools) and their connections to ways of doing things (e.g. mainte-
took a largely unstructured approach using a combination of de- nance and quality assurance tasks, de-duplication techniques). All
scriptive, structural and contrasting questions in direct response data were analysed together in order to identify common themes
to the answers provided by informants within the context of the present across the dataset. Lists and groupings were compiled by
interview. A number of participant-led ‘walkthroughs’ were un- repeatedly listening to the audio recordings, reading the transcripts
dertaken as part of the interviews in order to allow practitioners and observation records, and comparing the subject matter across
to narrate their daily activities in real-time. In total, 16 interviews the data. The methods used provided the opportunity for mapping
were conducted with 11 staff members at the Archive. The length heterogenous data and highlighted particular groupings of practi-
of each interview varied, ranging from 20 minutes to over 2 hours, tioner knowledge and activities for further analysis. The themes
all of which were audio recorded. Most interviews took place in presented here are not exhaustive but rather, subjective and reflec-
person, barring two which took place on Skype as the informants tive of both the aims and current state of this research.
worked remotely. All interview data were fully transcribed.
The purpose of ethnographic observations is to provide access to 4 FINDINGS AND DISCUSSION
‘practices and actions as they unfold’ [5, p.55], in the form of ‘non- In what follows we explore the theoretical framework described
elicited data’ that can allow insights into the implicit and embodied above through an initial analysis of data collected from fieldwork
activities that form everyday life. This is predicated on the notion carried out at the Internet Archive (‘The Archive’). This is the re-
that some actions cannot be articulated by participants/or ‘insiders’ sult of four weeks spent at the Archive in October-November 2016
through other research methods such as interviews. Boellstorff [5] and February-March 2017. Before addressing the findings, some
and others [53] have long argued that ‘elicitation methods’ (such background is provided to set the scene. Through the presentation
as interviews) cannot be a substitute for observation as there are of findings that follows, the concept of web archival labour is ex-
inherent differences and disconnects between what people do and amined and explored through the knowledge work, breakdown and
what they say they do. Furthermore, people do not always have repair activities that shape web archiving at the Archive.
the perspective or ability to report on all aspects of the processes
(particularly cognitive processes) that underly the decisions and 4.1 Site Background and Setting
activities that make up practice [53]. Observation therefore, of-
In 1996, Brewster Kahle and Bruce Gilliat established the Internet
fers another window into understanding the relationship between
Archive as a nonprofit organisation alongside Alexa Internet, a com-
meaning and action in everyday practice.
mercial web indexing service [39]. The Internet Archive headquar-
Ethnographic records were made documenting participation ac-
ters are based in San Francisco, California, a state which officially
tivities - including all-staff and tool development meetings, staff and
designated them a library in 2007 [37]. Since 1999, the Internet
application training seminars, troubleshooting and quality assur-
Archive has expanded their holdings beyond archived web con-
ance tasks with participants - all with the aim of providing the basis
tent to provide web-based access to both digitised and born-digital
for ‘thick descriptions’ [24] of practices. Observation pro-forma
resources, including but not limited to: books, audio, film/video,
were not used, however ethnographic records were created for ob-
images, documents, software and video games [78]. According to
servations describing: what was done - action, activities, what was
web traffic statistics provided by Alexa Internet (now a subsidiary
made and used - ‘cultural artefacts’ and what was said - speech acts,
of Amazon), the Internet Archive regularly ranks between the 250-
discursive activities [66, p.10-12]. Varying degrees of researcher
350th most visited website globally, with roughly 30% of search
participation was undertaken as part of the observations (as deemed
engine results driven by users looking for the Wayback Machine.7
appropriate by the host), where ‘passive’ participant observation
The Archive forms a central component of the web archiving
was used during staff meetings, more ‘active’ participant observa-
landscape through their provision of both large-scale web index-
tion took place during workshop and community events held at the
ing and preservation services, as well as steering the direction
Internet Archive.6
of standards and practice through the development of tools and
Documentary sources in various forms were collected to supple-
technologies that support web archiving at scale. The Archive is
ment the observations and interviews. ‘Documents’ are used here
widely considered the largest web archive in the world, containing
to refer to various types of materials (not just textual) produced by
in excess of 15 petabytes of 270+ billion captures of web content
organisations, communities and individuals to describe procedures,
[27]. They are one of the few institutions to employ ‘exhaustive’
policies and preferred ways of practicing web archiving. The use of
global web crawling, which harvests web resources both within
documentary sources here offers insights into aspects of (otherwise)
and beyond national domain boundaries or thematically-driven cap-
implicit knowledge that underlies practices [15, p.401].
tures for the primary purpose of preservation (rather than indexing
for other search or commercially-driven purposes). The collection
6 For
of web archives at the Archive was originally (solely) facilitated
a further breakdown of the types of participation levels common to participant
observation, see Spradley [66, p.58]. 7 http://www.alexa.com/siteinfo/archive.org [Accessed: 3 February 2017]

303
Long Session V: Time, Space, Archives WebSci '17, June 25-28, 2017, Troy, NY, USA

WebSci ’17, June 25–28, 2017, Troy, NY, USA. J. Ogden et al.

by the Alexa Internet ‘toolbar’, a browser-based plugin developed mish-mashed.’ Collectively, these strategies (some of which are
for the purposes of improving early-web navigation and analyt- discussed below) can be seen as one component of what Downey
ics [44, p.274]. Based on user navigation, the toolbar captures and [23] calls the ‘knowledge work’, or the ‘high value labour’ that goes
preserves each web page as it is visited, subsequently donating it into the production of information that is obscured, or marketed
to the Archive with a six month access embargo (a practice that as either automated or infinite. The notion of knowledge work is
continues to this day). explored below as it relates to the curation of seed lists and what
The Archive released the Heritrix web crawler as open source in one informant called the ‘hybrid’ crawling activities of the Archive.
2002 alongside leading the development of the Wayback Machine One informant relayed that a common misconception about the
software which currently forms the predominant mechanism by Archive’s crawling activities is that they employ ‘one giant crawler’
which users collect and access web archives. In 2006, the Archive to archive the global Web.’8 This is repeated in research literature
began supporting Archive-It, the widely used subscription-based and popular media articles around the perceived automation of
web archiving service. The Archive-It team work as part of the ‘web the Archive’s crawling activities. At any given moment, in fact,
group’ at the Archive, providing technical and storage support for the Archive has an (unknown) number of crawlers engaged in
service subscribers and partner organisations. As such, this case selective archival activities. A key priority was to begin to map
study offered the opportunity to observe some of the ways the these crawling events during fieldwork, in an effort to understand
Archive-It team supports the development of tools and practice for each as a contingent, sociotechnical assemblage. These crawl events
web archiving. or ‘crawl modalities’ [70] define what is collected in the Archive.
Each is shaped by different motivations, priorities and approaches
4.2 Mapping Practice Roles and Activities to web archiving.9
Looking further back in history to 2010, a moment of significance
The web archival activities of the Archive can be broken down
can be observed when the Wayback team started engaging in and
into three broad areas: crawling, access and tool development. The
directing their own global crawls. Several informants described the
Archive is engaged in crawling at many different levels and thus
motivations that led to this shift in direction, which was driven (at
this area includes the crawl activities that are self-directed, those
least in part) by the perceived inadequacies of the crawl data that
that are fully undertaken at the direction of other organisations and
were seeded, crawled and donated by Alexa Internet to the Archive.
those crawls that are initiated and directed by other organisations
The Alexa crawling approach10 is raised here as it is representative
using their subscription service Archive-It. The Archive facilitates
of a historical focus on the popularity of sites as a factor in the
access and hosting for crawls undertaken by themselves and others
selection of seeds, one which (until recently) continued to influence
through the Wayback Machine, including the provision of tools
the ways in which crawls were prioritised - even in those crawls
that allow others to deposit and donate their own WARCs to the
directed by the Archive itself. The use of Alexa’s ‘top million’ sites as
collection. And lastly they facilitate web archiving through their
the starting seeds for the ‘wide crawls’ (see below) was discussed as
ongoing tool and technology development for crawling and replay.
a common place practice but often resulted in the over-prioritising
In practice, these three areas do not sit in isolation of one another
of popularity as an indicator of the value in capturing certain web
and represent a working environment of overlapping roles, tasks
resources - with Gregory claiming that: ‘over 50% of our wide crawl
and activities at the Archive. The work that makes up web archival
was from 2,000 websites.’ Further complaints were relayed about the
labour permeates each of these activity areas, explored further in
quality of Alexa crawls as they do not capture images or embedded
the following sections.
dynamic resources, often leading to web archives with extensive
missing elements.
4.3 Knowledge Work: Crawling and Curating Various crawl events have subsequently become associated with
Elsewhere, others have acknowledged a certain pre-occupation the Archive’s global crawling efforts, including for example, the
with abundance and ‘plenitude’ at ‘universal archives’ such as the ‘survey crawls’ and ‘wide crawls.’ Survey crawls are being used to
Internet Archive, an observation which De Kosnik [19, p.95] ar- supplement wide crawls by taking a snap shot of the home page of
gues implicitly denies, or at least distracts from, attention to any every domain/host ever identified by the Archive. Wide crawls are
selectivity in archiving. Although it is clear that the Archive is run twice a year over 4-6 months, though as Arthur described they
overtly concerned with abundance and scale in their endeavours had originally envisaged the crawl cycle to run 4-6 times a year.
to capture more, our observations point strongly towards efforts Wide crawls start from a seed list (initially the Alexa top million, as
on their part to increasingly shape and prioritise the web resources described above) and are then allowed to run autonomously, the bot
that they capture. This point was made clear at the Archive’s 20th
Anniversary Party (2016) when Kahle announced that the Archive 8 personal communication
9 The term ‘crawl modality’ is borrowed from Summers and Punzalan [70] who in
had (at that point) archived ‘273 billion webpages from over 361
a study of web archival appraisal practices describes the different ways in which
million websites’ with the help of robots and ‘1000 librarians.’ Here crawling activities were broadly conceptualised and implemented. Although different
and in an interview with Kahle, the Archive signals the creation (but overlapping) crawling modalities were identified at the Archive, the focus is
similarly on determining how these modalities come to pass and how they shape what
of the Archive-It subscription service as a significant step towards is collected.
archiving more selectively, by providing librarians with the tools 10 The mechanisms behind Alexa’s crawlers were presented by informants as not

to save web resources. However, the selection narrative is more fully understood, and based on a historical understanding of ‘how they used to work.’
The proprietary nature of their crawlers and the resulting crawl data were flagged
complex than this, or as an informant indicated when I enquired as an impediment to ever fully understanding the provenance of how resources are
about the Archive’s appraisal practices: ‘the process is strategically prioritised over others within this data.

304
Long Session V: Time, Space, Archives WebSci '17, June 25-28, 2017, Troy, NY, USA

Observing Web Archives WebSci ’17, June 25–28, 2017, Troy, NY, USA.

following each outlink until ‘it doesn’t produce any interesting data and described using the tool whenever he had downtime. When
any more.’ When asked how wide crawls were stopped, Arthur said asked how to spot a link farm, Arthur responded that ‘it’s obvious,
they have to regularly check on each Heritrix instance by manually there’s usually a giant box [iframe] with keywords and a list of do-
go into the machine in question and looking at the logs to see what mains on the home page, easy to spot.’ The use of manual curation
is actually being captured, a process they described as ‘daunting.’ tools reveals both the role of human intervention in the process of
Alex described what they were looking for when they examine curating millions of links and the tensions and trade-offs that exist
the logs, which largely involves a visual inspection of the domain between the use of bots and a desire to capture ‘high quality’ sites.
names contained within the capture URLs, watching out for strings In response to the restrictive number of URLs collected by the
that resemble ‘calendar traps,’ pornography and endless Facebook wide crawls seeded with the Alexa ‘top million’ sites, several in-
sites. In direct result of some of the manual labour required to formants described some of the Archive’s more recent efforts to
shape and monitor the large-scale crawls at the Archive, engineers study the wide crawls through a grant they received to improve
began developing various ad-hoc tools to mitigate the need for the Wayback Machine in various respects. One such study of wide
interacting with the harvester logs and other shell scripts. One such crawl 12 is captured in a grant milestone report (that was made
tool is something Arthur calls the ‘Domain Browser tool’ used in available) which describes the various techniques used and makes
conjunction with Hericrawler, a crawl queue management system recommendations for improving future crawls. Here Gregory de-
the Archive developed for orchestrating large-scale crawls: scribes the process of studying the hyperlink structure of existing
The domain browser manual tool is for identify- archives to seed crawls:
ing undesirable domains. It’s used to establish and I do a lot of link analysis where I study the hyper-
prioritise ‘shades of gray,’ for example only crawl link structure of our crawls and try to figure out
this site if there are no other sites to crawl. It’s in certain pockets, use some rank methodologies
used as a ranking mechanism for prioritising do- to figure out ‘oh these are important resources,’
mains based on time, resources and place in the for instance they have a lot of links to them or
queue, as certain important URLs can get blocked traffic is really high - let’s seed the crawl with
by many instances of unimportant URLs. For ex- those. The most recent wide crawl I took the most
ample people linking to Facebook pictures can linked to pages from every single website, so 230
create an infinite loop of queued Facebook links million websites [...] and instead of crawling the
because of the nature of the graph. These types of Alexa top million, let’s crawl this bit. Sort of like a
sites are really slow to crawl as they are hosted on hybrid survey and wide model [...]. And we found
a single site, which must be crawled in succession resources that we had never crawled before. I’m
because of the nature of the Heritrix crawler. Each not saying one is better than the other, I’m saying
domain/host is assigned a budget and the crawler that hybridising this process might be one way of
is paused if it reaches its budget. balancing the scales a little bit.
The Domain Browser tool is thus used to (manually) curate This type of link analysis thus assists in finding the edge nodes of
undesired domains based on a visual inspection of a gallery of websites that the crawlers have identified - sometimes upwards of 60
home page thumbnails of each domain/host. The tool is set up to million sites - but do not get around to crawling before the crawl is
facilitate users tagging the site as pornography, a domain squatter stopped. These are then used as seeds for the survey crawl (to crawl
or ‘link farm’ in order to remove it from subsequent crawls. Alex the home pages of the sites) and to iteratively expand the net and
describes the process: number of websites captured by the Archive. Gregory indicates that
What we did was hired half a dozen people - they through these types of studies, they estimate that at any given time
would just go through it and get the top 30,000 they are only crawling around 20% of the Web. Gregory structures
hosts [...] and they go through 4-5,000, that’s what the issues surrounding balance in selection priorities as a problem
one person can do in a month or so. And then of resources (a theme which repeatedly arises), but outlines three
we get actual human interaction to say yes, this considerations for determining ‘better crawls’ and ensuring they
is a good website. And then we would delete or are crawling the right 20%:
modify or prioritise based on that input. So having The way I think of it [...] there’s three branches,
humans actually spend a little bit of time at the top there’s popularity, there’s novelty and there’s the
really helped. We’d love to do it further of course. risk of going away. How do you achieve that bal-
In addition, Arthur described another similar manual tool called ance? You want to get stuff that people are using -
‘Live Update’ that they use to curate new domains that are discov- not just junk that is on the Web that you’re just
ered through their Wordpress crawls (crawls that are triggered by filling up the servers with that won’t ever be found
edits to sites hosted on Wordpress.com). Different to the Domain useful, like calendar pages, things like that, crap
Browser tool, the Live Update tool dynamically displays the do- [...] - there’s no novelty. It’s new? We want to make
main/host thumbnail of new domains in realtime allowing users to sure it’s preserved because it just came out, it’s a
choose between overlay buttons tagged ‘P’ for pornography, and new article, it’s a new website. And then there’s
‘F’ for link farms, or visit the site for further investigation. Arthur the risk of going away [...] - if you’re going to shut
said it was developed in an effort to ‘gamify’ the process of curation down this service - Vine is going away - we jump

305
Long Session V: Time, Space, Archives WebSci '17, June 25-28, 2017, Troy, NY, USA

WebSci ’17, June 25–28, 2017, Troy, NY, USA. J. Ogden et al.

in and crawl. So as we’re crawling the Web can we present throughout the study, including the repair and maintenance
do a good job of sort of achieving that balance? We of crawl data and crawling technologies, Wayback and access tools,
don’t quite know what the solution is to achieving as well as the repair of broken links on the live Web by the Archive.
that balance. Some of these practices are discussed further below, as they high-
If we expand the picture to look at some of the other crawl light some of the factors that influence the decision-making and
modalities of the Archive, the multi-faceted approach to selection technical processes that enable the repair of web archives (and the
becomes even more apparent. A number of techniques for selecting technologies that enable access).
domain/hosts were described by participants associated with the A few training sessions for a web archivist on the Archive-It team,
Archive’s contract crawling, or the custom and domain-level crawl- Karen were observed. Through listening to Karen’s Q&A with other
ing undertaken on behalf of partner institutions. A manager, Elaine team members, certain junctures were highlighted where support
described the use of zone files, partner-submitted seed lists (via staff are regularly required to prioritise activities, particularly in
Google spreadsheets or forms), links embedded in particular social response to the ‘quality’ or ‘completeness’ of web archives (and as
media streams, and using geographical look-ups of existing content raised by partner organisations). To set the scene a bit, we draw on
held by the Archive to extract relevant domains for crawling. Other interpretive notes following an observation session:
sources for selection include ‘listening in’ on Twitter to determine In the second session we all sat on the couch in the
which YouTube videos get linked to, as well as what outlinks get pit. Lydia asked Karen if she wanted to continue
added to Wikipedia - both of which trigger crawl events. Increas- the training they began in the morning, which was
ingly, the Archive has also been developing a variety of tools that aimed at addressing a recent support request that
use their longstanding ‘Save Page Now’ feature to promote the came in from a partner. When the conversation got
saving of web resources to the Wayback Machine by anyone with a bit technical - as a result of Karen asking increas-
access to the archive.org home page, or a Firefox plugin. ingly detailed questions about how certain seeds
These methods highlight some of the ways the Archive is lever- and test crawls are rendered in playback - Mike (a
aging the power and labour of ‘the crowds’ - through the users of support engineer) was called over by Lydia. Mike
Twitter, Wikipedia and Save Page Now - to not only diversify and walked over and leaned on the couch and began to
‘balance’ the domains/types of resources that are archived, but also explain some common differences between capture
(implicitly) co-opt and transform these users into potential stake- and playback issues. The consensus from Lydia and
holders. Furthermore, the Archive is in multiple ways, leveraging Mike seems to be that the first task in the support role
the web archives amassed in the Wayback Machine project over is to determine whether the issue is related to cap-
the last 20 years to continuously increase their net resources. ture or playback. Karen is concerned about waiting
to solve issues based on partner requests, advocating
4.4 Breakdown and Repair: Maintaining for the team to be more actively QA’ing collections in
Archives case they are at risk of disappearing. Mike responds
that there are different issues at play (including time
Many informants described scenarios where human mediation was
and resources) and that it’s key to understand that
required in either otherwise (seemingly) automated processes. Ex-
capture issues will always take precedent over play-
amples that require manual intervention include: running patch
back issues - for exactly that reason.
crawls when bots failed to crawl designated seeds, or in the event
of a crawler trap where bots are trapped in an infinite loop of seed From this vignette several points can be interpreted. First, we
requests (discussed above), or when missing or altered elements are can see that Karen is getting to grips with a key aspect of the role of
observed in the playback of archived web pages. Fundamentally, the web archivist at Archive-It, which (in conjunction with support
each of these boil down to issues surrounding either the capture or engineers) is to determine the difference between playback and col-
replay of web resources. Borrowing from Star and Ruhleder [67], lection issues and respond accordingly. Second, the comment that
these moments could be considered a form of breakdown which capture problems will always trump replay problems is insightful. It
in turn reveals the contingent assemblage of processes that for emphasises the goal of capture and reflects the underlying motiva-
example, enable the capture of intended web resources or provide tion driving activities - the fear of disappearance. This observation
‘high fidelity’ access to the archived collections. also emphasises the active role of the web archivist and support
The issues around the quality assurance of collection capture engineer in the processes that shape the ‘fidelity’ of web archives.
and playback is not an unknown dimension of web archival prac- They are implicitly driven by time and resource constraints but
tice, however, the processes that are undergone to mitigate these they are also active participants in the practice of choosing which
issues are not well documented. Here we draw on the work of support issues are prioritised. Mike indicated that specific repair
Jackson [36] who advocates for an examination of the moments of tasks are prioritised in each daily ‘stand-up’ where team members
breakdown, repair and maintenance of technologies, in that they report on upcoming development goals. Web archivists and pro-
redirect attention to the act and ‘ethics of care’ and embody the gramme managers will mark tasks as high priority either because
creation of value by their maintainers. In other words, the practice the content is at risk of changing or going way or because the
and processes involved in fixing and maintaining technologies can request comes from a high priority partner (as both were the case
be used as evidence for their worth by those who sustain these for the whitehouse.gov and End of Term archives that were being
practices over time. These moments of repair in web archiving are captured at the time).

306
Long Session V: Time, Space, Archives WebSci '17, June 25-28, 2017, Troy, NY, USA

Observing Web Archives WebSci ’17, June 25–28, 2017, Troy, NY, USA.

5 SUMMARY AND FUTURE WORK USWebArchivingSurvey_2013.pdf


[4] Karen Barad. 2003. Posthumanist Performativity: Toward an Understanding of
The aims of the paper were to examine how web archives are struc- How Matter Comes to Matter. Signs 28, 3 (2003), 801–831. http://www.jstor.org/
tured by the practices associated with collecting and maintaining stable/10.1086/345321
[5] Tom Boellstorff. 2012. Rethinking Digital Anthropology. In Digital Anthropology,
archives. The use of an ethnographic approach, with a focus on ob- Heather A. Horst and Daniel Miller (Eds.). Bloomsbury Publishing, London,
serving practice as web archival labour at the Internet Archive has 39–60.
revealed a number of insights. The data points towards a complex [6] Geoffrey C. Bowker. 2005. Memory Practices in the Sciences. MIT Press, Cambridge,
MA.
system of knowledge and maintenance work for prioritising which [7] Geoffrey C. Bowker and Susan Leigh Star. 1999. Sorting Things Out: Classification
web assets to collect and repair. The Archive is leveraging their and Its Consequences (paperback ed.). MIT Press, Boston, MA.
extensive existing archives for understanding networked linking [8] Molly Bragg and Kristine Hanna. 2013. The Web Archiving Life Cycle Model.
Technical Report. The Archive-It Team and Internet Archive. http://ait.blog.
behaviour in an effort to balance the breadth and depth of crawling archive.org/files/2014/04/archiveit_life_cycle_model.pdf
activities, while discovering new sources for identifying websites [9] Adrian Brown. 2006. Archiving Websites: A Practical Guide for Information Man-
agement Professionals. Facet, London.
to crawl based on measures of popularity, ‘novelty’ and sites that [10] Richard Harvey Brown and Beth Davis-Brown. 1998. The making of memory:
are endanger of going offline. The team has devised multiple mech- the politics of archives, libraries and museums in the construction of national
anisms for identifying different types of ‘undesirable domains,’ consciousness. History of the Human Sciences 11, 4 (1998), 17–32.
[11] Niels Brügger. 2008. The Archived Website and Website Philology. Nordicom
including rule-based link pattern-matching and the development Review 29, 2 (2008), 155–175.
of ‘gamified’ tools for the manual curation of sites. [12] Niels Brügger. 2009. Website history and the website as an object of study.
Collectively, the efforts of the Archive can be seen as knowledge New Media & Society 11, 1-2 (2009), 115–132. DOI:https://doi.org/10.1177/
1461444808099574
work, and these activities, seen in combination with other prac- [13] Niels Brügger. 2011. Web Archiving - Between Past, Present, and Future. In
tices around the prioritisation, repair and maintenance of tools and The Handbook of Internet Studies, Mia Consalvo and Charles Ess (Eds.). Wiley-
Blackwell, Oxford, 24–42.
archives all have ramifications for how web resources are trans- [14] Niels Brügger. 2012. When the Present Web is Later the Past: Web Historiography,
formed for use. It is the labour of non/human agents that enables Digital History, and Internet Studies. Historical Social Research 37, 4 (2012), 102 –
the preservation and ingestion of information from the Web into 117.
[15] Christian Bueger. 2014. Pathways to practice: praxiography and international
the Archive, and then once again back to the Web where archives politics. European Political Science Review 6, 3 (Aug. 2014), 383–406. DOI:https:
are reassembled via the Wayback Machine. Although imperfect, //doi.org/10.1017/S1755773913000167
this labour is increasingly recognised as an essential element of [16] Terry Cook. 2001. Archival science and postmodernism: new formulations for
old concepts. Archival Science 1, 1 (2001), 3–24. DOI:https://doi.org/10.1007/
the web architecture. The information labour and knowledge work BF02435636
of potential web archival users is therefore intimately tied to the [17] Terry Cook. 2007. Electronic Records, Paper Minds: The Revolution in Infor-
mation Management and Archives in the Post-Custodial and Post-Modernist
web archival labour of the Internet Archive. As the global Wayback Era. Archives & Social Studies: A Journal of Interdisciplinary Research 1, 0 (March
Machine currently provides access to billions of webpages - often 2007), 399–443.
inaccessible elsewhere - editorial decisions have implications for [18] Terry Cook. 2013. Evidence, memory, identity, and community: four shifting
archival paradigms. Archival Science 13, 2 (2013), 95–120. DOI:https://doi.org/
not only the fidelity of archived captures, but indeed whether or 10.1007/s10502-012-9180-7
not certain parts of the Web are preserved at all. Future work will [19] Abigail De Kosnik. 2016. Rogue Archives: Digital Cultural Memory and Media
aim to address any shortcomings in the ethnographic methods pre- Fandom. MIT Press, Cambridge, Massachusetts; London, England.
[20] Jacques Derrida. 1995. Archive Fever: A Freudian Impression. University of
sented by further examining the labour of algorithmic and other Chicago Press, Chicago and London.
non-human actors (e.g. hyperlinks, software) implicated in these [21] Meghan Dougherty and Eric T. Meyer. 2014. Community, Tools, and Practices in
Web Archiving: The State-of-the-Art in Relation to Social Science and Humanities
processes. By continuing to make these practices visible, both the Research Needs. Journal of the Association for Information Science and Technology
contingencies and value of this labour are revealed and open a new 65, 11 (2014), 2195–2209.
window on this increasingly vital activity. [22] Paul Dourish and Genevieve Bell. 2011. Divining a Digital Future: Mess and
Mythology in Ubiquitous Computing. MIT Press, Cambridge, MA.
[23] Gregory J. Downey. 2014. Making Media Work: Time, Space, Identity, and
ACKNOWLEDGMENTS Labor in the Analysis of Information and Communication Infrastructures. In
Media Technologies: Essays on Communication, Materiality, and Society, Tarleton
This work was supported by the UK Engineering and Physical Gillespie, Pablo J. Boczkowski, and Kirsten A. Foot (Eds.). MIT Press, Cambridge,
Sciences Research Council and the Web Science Centre for Doctoral Massachusetts; London, England, 141–165.
[24] Clifford Geertz. 1973. The Interpretation of Cultures. Basic Books, New York.
Training, Grant No. EP/G036926/1. The authors would also like to [25] Tarleton Gillespie, Pablo J. Boczkowski, and Kirsten A. Foot. 2014. Introduc-
thank the Internet Archive and staff for opening their doors and tion. In Media Technologies: Essays on Communication, Materiality, and Society,
Tarleton Gillespie, Pablo J. Boczkowski, and Kirsten A. Foot (Eds.). MIT Press,
being so generous with their time and feedback. Cambridge, Massachusetts; London, England.
[26] Anne J. Gilliland-Swetland. 2000. Enduring Paradigm, New Opportunities: The
Value of the Archival Perspective in the Digital Environment. Technical Report 89.
REFERENCES Council on Library and Information Resources, Washington, D.C. https://www.
[1] Ahmed AlSum, Michele C. Weigle, Michael L. Nelson, and Herbert Sompel. clir.org/pubs/reports/pub89/pub89.pdf
2014. Profiling web archive coverage for top-level domain and content language. [27] Vinay Goel. 2016. Defining Web pages, Web sites and Web
International Journal on Digital Libraries 14, 3 (2014), 149–166. DOI:https://doi. captures. (Oct. 2016). http://blog.archive.org/2016/10/23/
org/10.1007/s00799-014-0118-y defining-web-pages-web-sites-and-web-captures/
[2] William Y. Arms, Roger Adkins, Casey Ammen, and Allene Hayes. 2001. Col- [28] Daniel Gomes, Sérgio Freitas, and Mário J. Silva. 2006. Design and Selection
lecting and Preserving the Web: the MINERVA Prototype. RLG DigiNews 5, 2 Criteria for a National Web Archive. In Research and Advanced Technology
(April 2001). http://worldcat.org/arcviewer/2/OCC/2009/08/11/H1250005040496/ for Digital Libraries, Julio Gonzalo, Costantino Thanos, M. Felisa Verdejo, and
viewer/file2.html Rafael C. Carrasco (Eds.). Lecture Notes in Computer Science, Vol. 4172. Springer
[3] Jefferson Bailey, Abigail Grotke, Kristine Hanna, Cathy Hartman, Edward Mc- Berlin Heidelberg, 196–207. http://dx.doi.org/10.1007/11863878_17
Cain, Christie Moffatt, and Nicholas Taylor. 2014. Web Archiving in the United [29] Daniel Gomes, João Miranda, and Miguel Costa. 2011. A Survey on Web Archiving
States: A 2013 Survey. Technical Report. The National Digital Stewardship Initiatives. In Research and Advanced Technology for Digital Libraries, Stefan
Alliance. 1–24 pages. http://www.digitalpreservation.gov/documents/NDSA_

307
Long Session V: Time, Space, Archives WebSci '17, June 25-28, 2017, Troy, NY, USA

WebSci ’17, June 25–28, 2017, Troy, NY, USA. J. Ogden et al.

Gradmann, Francesca Borri, Carlo Meghini, and Heiko Schuldt (Eds.). Lecture 477–479. DOI:https://doi.org/10.1002/asi.23561
Notes in Computer Science, Vol. 6966. Springer Berlin Heidelberg, 408–420. [56] Maureen Pennock. 2013. Web-Archiving. Technical Report Technology Watch
http://dx.doi.org/10.1007/978-3-642-24469-8_41 Report 13:01. Digital Preservation Coalition, Great Britain. 1–50 pages. http:
[30] Karen F. Gracy. 2001. The Imperative to Preserve: Competing Definitions of Value //dx.doi.org/10.7207/twr13-01
in the World of Film Preservation. PhD. University of California, Los Angeles. [57] Margaret E. Phillips. 2005. What Should We Preserve? The Question for Heritage
[31] Karen F. Gracy. 2004. Documenting Communities of Practice: Making the Case Libraries in a Digital World. Library Trends 54, 1 (2005), 57–71. http://muse.jhu.
for Archival Ethnography. Archival Science 4, 3 (2004), 335–365. DOI:https: edu/journals/library_trends/v054/54.1phillips.html
//doi.org/10.1007/s10502-005-2599-3 [58] Steven Pinch and N Henry. 1999. Discursive Aspects of Technological Innovation:
[32] Martyn Hammersley and Paul Atkinson. 2007. Ethnography: Principles and The Case of the British Motor-Sport Industry. Environment and Planning A 31, 4
Practice (third ed.). Routledge, London and New York. (1999), 665–682. DOI:https://doi.org/10.1068/a310665
[33] Donna Haraway. 1988. Situated Knowledges: The Science Question in Feminism [59] P. Prasad. 1997. Systems of Meaning: Ethnography as a Methodology for the Study
and the Privilege of Partial Perspective. Feminist Studies 14, 3 (1988), 575–599. of Information Technologies. In Information Systems and Qualitative Research:
[34] Verne Harris. 2000. Law, Evidence and Electronic Records: Strategic Perspective Proceedings of the IFIP TC8 WG 8.2 International Conference on Information Systems
from the Global Periphery. S. A. Archives Journal 41 (2000), 3–19. and Qualitative Research, 31st May - 3rd June 1997, Philadelphia, Pennsylvania,
[35] Helen Hockx-Yu. 2014. Access and Scholarly Use of Web Archives. Alexandria: USA, Allen S. Lee, Jonathan Liebenau, and Janice I. DeGross (Eds.). Springer US,
The Journal of National and International Library and Information Issues 25, 1-2 Boston, MA, 101–118. http://dx.doi.org/10.1007/978-0-387-35309-8_7
(2014), 113–127. DOI:https://doi.org/10.7227/ALX.0023 [60] Hany M. SalahEldeen and Michael L. Nelson. 2012. Losing My Revolution: How
[36] Steven J. Jackson. 2014. Rethinking Repair. In Media Technologies: Essays on Com- Many Resources Shared on Social Media Have Been Lost? In Theory and Practice
munication, Materiality, and Society, Tarleton Gillespie, Pablo J. Boczkowski, and of Digital Libraries: Second International Conference, TPDL 2012, Paphos, Cyprus,
Kirsten A. Foot (Eds.). MIT Press, Cambridge, Massachusetts; London, England, September 23-27, 2012. Proceedings, Panayiotis Zaphiris, George Buchanan, Edie
221–239. Rasmussen, and Fernando Loizides (Eds.). Springer Berlin Heidelberg, Berlin,
[37] Brewster Kahle. 2007. Internet Archive officially a library. (June 2007). https: Heidelberg, 125–137. http://dx.doi.org/10.1007/978-3-642-33290-6_14
//archive.org/post/121377/internet-archive-officially-a-library [61] Steve Schneider and Kirsten Foot. 2008. Archiving of Internet Content. In The
[38] Michael Khoo, Lily Rozaklis, and Catherine Hall. 2012. A survey of the use of International Encyclopedia of Communication, Wolfgang Donsbach (Ed.). Wiley
ethnographic methods in the study of libraries and library users. Library & Publishing.
Information Science Research 34, 2 (2012), 82 – 91. DOI:https://doi.org/10.1016/j. [62] Joan M. Schwartz and Terry Cook. 2002. Archives, Records, and Power: The
lisr.2011.07.010 Making of Modern Memory. Archival Science 2 (2002), 1–19.
[39] Michele Kimpton and Jeff Ubois. 2006. Year-by-Year: From an Archive of the [63] Kalpana Shankar. 2004. Recordkeeping in the Production of Scientific Knowledge:
Internet to an Archive on the Internet. In Web Archiving (first ed.), Julien Masanès An Ethnographic Study. Archival Science 4 (2004), 367–382.
(Ed.). Springer, Berlin, Heidelberg, 201–212. [64] Marc Spaniol, Dimitar Denev, Arturas Mazeika, Gerhard Weikum, and Pierre
[40] Rob Kitchin. 2016. Thinking critically about and researching algorithms. Infor- Senellart. 2009. Data Quality in Web Archiving. In Proceedings of the 3rd Workshop
mation, Communication & Society 20, 1 (2016), 14–29. DOI:https://doi.org/10. on Information Credibility on the Web. ACM, New York, NY, USA, 19–26. DOI:
1080/1369118X.2016.1154087 https://doi.org/10.1145/1526993.1526999
[41] Terry Kuny. 1997. A Digital Dark Ages? Challenges in the Preservation of [65] James P. Spradley. 1979. The Ethnographic Interview. Holt, Rinehart and Winston,
Electronic Information. In Proceedings of the 63rd International Federation of United States.
Library Associations and Institutions. Copenhagen, Denmark. [66] James P. Spradley. 1980. Participant Observation. Wadsworth/Thomson Learning,
[42] Kalev Leetaru. 2015. Why It’s So Important To Understand What’s In Our Web United States.
Archives. (Nov. 2015). http://onforb.es/1VDPHPH [67] Susan Leigh Star and Karen Ruhleder. 1996. Steps Toward an Ecology of Infras-
[43] Kalev Leetaru. 2016. The Internet Archive Turns 20: A tructure: Design and Access for Large Information Spaces. Information Systems
Behind The Scenes Look At Archiving The Web. (Jan. Research 7, 1 (March 1996).
2016). http://www.forbes.com/sites/kalevleetaru/2016/01/18/ [68] Ann Laura Stoler. 2002. Colonial archives and the arts of governance. Archival Sci-
the-internet-archive-turns-20-a-behind-the-scenes-look-at-archiving-the-web/ ence 2, 1-2 (2002), 87–109. http://rd.springer.com/article/10.1007%2FBF02435632
#747db6257800 [69] Lucy A. Suchman. 2001. Building Bridges: Practice-based Ethnographies of Con-
[44] Jessica Livingston. 2007. Founders at Work: Stories of Startups’ Early Days. Apress, temporary Technology. In Anthropological Perspectives on Technology, Michael
United States of America. Schiffer (Ed.). University of New Mexico Press, Albuquerque, 163–177.
[45] Deborah Lupton. 2015. Digital Sociology. Routledge, London. [70] Ed Summers and Ricardo Punzalan. 2016. Bots, Seeds and People: Web Archives
[46] Peter Lyman. 2002. Archiving the World Wide Web. In Building a National as Infrastructure. The Computing Research Repository abs/1611.02493 (2016).
Strategy for Preservation: Issues in Digital Media Archiving. Council on Library http://arxiv.org/abs/1611.02493
and Information Resources and the Library of Congress, 38–51. http://www.clir. [71] Diana Taylor. 2003. The Archive and the Repertoire: Performing Cultural Memory
org/pubs/reports/pub106/web.html in the Americas. Duke University Press, London.
[47] Noortje Marres and Esther Weltevrede. 2013. Scraping the Social? Journal of [72] Mike Thelwall and Liwen Vaughan. 2004. A fair history of the Web? Examining
Cultural Economy 6, 3 (2013), 313–335. DOI:https://doi.org/10.1080/17530350. country balance in the Internet Archive. Library & Information Science Research
2013.772070 26 (2004), 162–176.
[48] Julien Masanès (Ed.). 2006. Web Archiving (first ed.). Springer-Verlag Berlin [73] Ciaran B. Trace. 2002. What is recorded is never simply ‘what happened’: Record
Heidelberg. keeping in modern organizational culture. Archival Science 2, 1 (2002), 137–159.
[49] Sue Mckemmish and Anne Gilliland. 2013. Archival and recordkeeping research: DOI:https://doi.org/10.1007/BF02435634
Past, present and future. In Research Methods: Information, Systems and Contexts, [74] Michel-Rolph Trouillot. 1995. Silencing the Past: Power and the Production of
K. Williamson and G. Johanson (Eds.). Tilde Publishing, Prahran, Victoria, 79– History. Beacon Press, Boston.
112. [75] Claire Waterton. 2010. Experimenting with the Archive: STS-ers As Analysts
[50] Ian Milligan, Nick Ruest, and Jimmy Lin. 2016. Content Selection and Curation and Co-constructors of Databases and Other Archival Forms. Science, Technology,
for Web Archiving: The Gatekeepers vs. the Masses. In JCDL âĂŹ16, June 19 & Human Values 35, 5 (2010), 645–676.
- 23, 2016, Newark, NJ, USA. ACM, Newark, NJ. DOI:https://doi.org/DOI:http: [76] Collin Webb, David Pearson, and Paul Koerbin. 2013. ‘Oh, you wanted us to
//dx.doi.org/10.1145/2910896.2910913 preserve that?!’ Statements of Preservation Intent for the National Library of
[51] Gordon Mohr, Michael Stack, Igor Ranitovic, Dan Avery, and Michele Kimpton. Australia’s Digital Collections. D-Lib Magazine 19, 1/2 (Feb. 2013). http://www.
2004. An Introduction to Heritrix: An open source archival quality web crawler. dlib.org/dlib/january13/webb/01webb.print.html
In Proceedings of the 4th International Web Archiving Workshop. Bath, UK. https: [77] Peter Webster. 2017 (forthcoming). Users, technologies, organisations: towards
//webarchive.jira.com/wiki/download/attachments/5441/Mohr-et-al-2004.pdf a cultural history of world web archiving. In Web 25: histories from the first 25
[52] National Digital Stewardship Alliance. 2012. Web Archiving Survey Report. Tech- years of the World Wide Web, Niels Brügger (Ed.). Peter Lang.
nical Report. NDSA Content Working Group. http://www.digitalpreservation. [78] Aaron Ximm. 2014. Active Personal Archiving and the Internet Archive. In
gov/documents/ndsa_web_archiving_survey_report_2012.pdf Personal Archiving: Preserving Our Digital Heritage. Information Today, Medford,
[53] Richard E. Nisbett and Timothy DeCamp Wilson. 1977. Telling More Than We US, 187–213.
Can Know: Verbal Reports on Mental Processes. Psychological Review 84, 3 (May [79] Elizabeth Yakel. 2001. The Social Construction of Accountability: Radiologists
1977), 231–259. and Their Record-Keeping Practices. The Information Society 17, 4 (2001), 233–245.
[54] Jinfang Niu. 2012. An Overview of Web Archiving. D-Lib Magazine 18, 3/4 (April DOI:https://doi.org/10.1080/019722401753330832
2012). http://www.dlib.org/dlib/march12/niu/03niu1.html
[55] Fatih Oguz and Wallace Koehler. 2015. URL decay at year 20: A research note.
Journal of the Association for Information Science and Technology 67, 2 (2015),

308

Potrebbero piacerti anche