Sei sulla pagina 1di 13

See discussions, stats, and author profiles for this publication at: https://www.researchgate.

net/publication/263935602

Big Data: An Introduction for Librarians

Article  in  Medical Reference Services Quarterly · July 2014


DOI: 10.1080/02763869.2014.925709 · Source: PubMed

CITATIONS READS

7 889

1 author:

Matthew B Hoy
Mayo Foundation for Medical Education and Research
18 PUBLICATIONS   224 CITATIONS   

SEE PROFILE

All content following this page was uploaded by Matthew B Hoy on 13 February 2015.

The user has requested enhancement of the downloaded file.


EMERGING TECHNOLOGIES

Matthew B. Hoy and Tara Brigham, Column Editors

Big Data: An Introduction for Librarians

Matthew B. Hoy

ABSTRACT. Modern life produces data at an astounding rate, and shows no signs of

slowing. This has lead to new advances in data storage and analysis and the concept of

“big data” – massive data sets that can yield surprising insights when analyzed. This

column will briefly describe what big data is and why it is important. It will also briefly

explore the possibilities and problems of big data and the implications it has for

librarians. A list of big data projects and resources is also included.

KEYWORDS. Big data, Internet, research data management

AUTHOR. Matthew Hoy, MLIS (hoy.matt@mayo.edu) is Supervisor of Library and

Media Support Services, Mayo Clinic Health System – Eau Claire, 1221 Whipple Street,

Eau Claire, WI 54701.

Comments and suggestions should be sent to the Column Editors: Matthew B. Hoy

(hoy.matt@mayo.edu) and Tara Brigham (Brigham.Tara@mayo.edu).

1
INTRODUCTION

Everything is data. Everything people do, say, or observe creates more data. For large

parts of human history, data was ephemeral and not recorded in any way. With the

invention of writing, data could be recorded for future reference, but the process was

slow and laborious. Only in recent decades has the ability to capture and analyze data

really taken off; as more and more aspects of daily life are connected to computers and

the Internet, this formerly ephemeral data is now being captured, stored, and analyzed,

often with surprising results. As Anderson noted in 2008, we are living in “the petabyte

age”; this is “the era of big data, where more isn’t just more. More is different.”1 This

column will attempt to briefly define what big data is and why it is important. It will also

briefly explore the possibilities and problems of big data and the implications it has for

librarians. A list of big data projects and resources is also included.

WHAT IS BIG DATA?

There are no clear cut definitions of what constitutes big data. Ward and Barker found

that there are likely “as many definitions as the number of people you ask.”2 They

surveyed major companies in the IT industry and used the results to develop this

definition:

Big data is a term describing the storage and analysis of large and or complex data

sets using a series of techniques including, but not limited to: NoSQL,

MapReduce and machine learning.2

2
Ed Dumbill, the editor-in-chief of a journal devoted to the topic of big data, offered this

broader, more conceptual definition:

Big data is data that exceeds the processing capacity of conventional database

systems. The data is too big, moves too fast, or doesn't fit the strictures of your

database architectures. To gain value from this data, you must choose an

alternative way to process it.3

Although he did not use the term “big data,” as far back as 2001 Laney defined the now

widely-accepted three dimensions or “three V’s” of big data: volume, velocity, and

variety.4

 Volume refers to the sheer amount of data being created. As McAfee and

Brynjolfsson noted in 2012, “about 2.5 exabytes of data are created each

day, and that number is doubling every 40 months or so.”5 One exabyte is

roughly equivalent to 4,000 times the amount of data in the Library of

Congress.

 Velocity refers to the speed with which data is being created.

 Variety refers both to the types of data being gathered and to the lack of

uniform structure in the data.

In their book Big Data: A Revolution That Will Transform How We Live, Work and

Think, Mayer-Schönberger and Cukier provide a more abstract definition:

big data refers to things one can do at a large scale that cannot be done at a

smaller one, to extract new insights or create new forms of value, in ways that

change markets, organizations, the relationships between citizens and

governments, and more.6

3
These definitions are all useful, but they fail to give a layman’s sense of what big data is

and how it is created or used. In the simplest terms, big data is the idea that computers

can gather trillions of pieces of information about billions of different things and find

useful patterns in that information.

BIG DATA POSSIBILITIES

Big data is in its infancy, and it has already changed life in a number of ways. Google’s

web search algorithm was one of the first tools to show the possibilities offered by big

data. Netflix has changed the way people choose and consume movies and television

through its big data recommendation engine. As big data analytics are applied to other

industries like energy, transportation, and medicine, there are sure to be similarly

disruptive breakthroughs.

As medical systems move toward a big data model, it’s likely that in the near

future patients may receive a diagnosis or have their therapy based on findings from big

data systems. One such system is already in development; CancerLinQ

<http://www.asco.org/quality-guidelines/cancerlinq> is a “learning health system” based

on a large set of de-identified cancer patient data that will allow physicians to more

quickly and accurately diagnose and treat other patients.7

In addition to aggregating data from many sources at once, big data systems can

also be used to track highly granular data about one thing. For example, sensors on a

single piece of industrial equipment can monitor temperature and vibration and use that

data to predict equipment failures before they happen. Another example of this is the idea

4
of the “quantified self.” The quantified self is the act of measuring and tracking data

about an individual person. Wearable devices can track a person’s heart rate, distance

traveled, and sleep patterns. Internet-enabled scales can upload weight data to health-

tracking websites, where users can also track their food intake. This may sound far-

fetched, but Swan states that “60% of U.S. adults are currently tracking their weight, diet,

or exercise routine, and 33% are monitoring other factors such as blood sugar, blood

pressure, headaches, or sleep patterns.”8

BIG DATA PROBLEMS

Despite all of the promise inherent in the big data revolution, there are also problems that

need to be addressed. Chief among these is the issue of privacy and data ownership;

although the big data era has barely begun, there are already many stories of unintended

privacy breaches. Target recently revealed a teen’s pregnancy to her family by mailing

her coupons featuring baby clothing and supplies. This happened because Target’s

computer system can predict, with startling accuracy, when a woman is pregnant. As

Duhigg found, there are “about 25 products that, when analyzed together, allowed him to

assign each shopper a ‘pregnancy prediction’ score.”9 The problem arose in this case

because computers are unable to make good decisions about what customers consider

sensitive information. Big data systems are also not tactful; several bloggers report being

deluged by marketing for baby-related products after they have miscarried.10

When retailers can track so many data points about consumer behavior, there are

many more opportunities for damaging disclosures. These disclosures aren’t limited to

5
intentional acts by the data holders either; news stories about data breaches at major

retailers are now an everyday occurrence. What many consumers don’t realize is that

more than their credit card numbers are at risk. How long will it be before hackers are

selling complete customer data sets along with the usual credit card and Social Security

numbers? Rival retailers and suppliers are likely very interested in knowing what is in

Target or Walmart’s customer data.

Retail systems are not the only big data privacy concern. Health care, social

networking, and government systems also contain large amounts of sensitive information.

As Ohm put it, “[a]lmost every person in the developed world can be linked to at least

one fact in a computer database that an adversary could use for blackmail, discrimination,

harassment, or financial or identity theft.”11This is especially troubling because most

users can no longer realistically opt-out of having their information stored in these

systems. Hill noted that a Princeton professor had to resort to using anonymizing proxy

software and paying cash for baby-related items in order to hide her pregnancy from

retailers and data brokers.10

Another problem with big data is that it can be misleading; a very strong

correlation can appear that has absolutely no meaning. Boyd and Crawford described this

phenomenon as “apophenia: seeing patterns where none actually exist, simply because

enormous quantities of data can offer connections that radiate in all directions.”12 As an

example, they cite data showing a link between “the S&P 500 stock index and butter

production in Bangladesh.”12 Seeing patterns where none exist is not the only pitfall. Big

data can also amplify and distort legitimate trends, making them appear more important

than they are. When Google’s Flu Trends project grossly overestimated flu outbreaks,

6
there was talk of “big data hubris,” the mistaken idea that “big data are a substitute for,

rather than a supplement to, traditional data collection and analysis.”13

There are also infrastructure problems to overcome before big data can become

everything it wants to be. Big data means big space, big power, and big networking.

Information technology systems now account for around 10% of global electricity usage;

that figure is bound to grow rapidly as more disks and CPUs are brought online to store

and crunch the massive amounts of data being created.14 All of those systems need to live

somewhere, and as Blum points out in “Tubes: A Journey to the Center of the Internet,”

there are relatively few locations that have the correct combination of cheap land,

abundant electricity, and easy access to fiber-optic networks.15 Even if CPUs and storage

continue to get less expensive and network capacity increases, the exponential growth in

data volume far outstrips gains being made in those areas.

BIG DATA AND LIBRARIES

Libraries and librarians are uniquely suited to working with big data. Libraries have a

long tradition of being early technology adopters, and big data should be no exception.16

There are several key ways librarians can get involved. One of these is through collection

development and preservation of data sets. As more users become interested in working

with big data, they will need guidance and material to work with. Thanks to a recent

presidential order spurring agencies to open their data vaults, there are now many

government data sets available for researchers to use. Unfortunately, as Schwartz noted,

the Executive Order that mandated public access to these data sets “does not provide for

7
preservation or create any user-centered services for the information.”17 Librarians are

well-positioned to help users understand how and where to find these data sets and to

preserve them for future users.

Another way librarians can get involved with big data is by working within their

institutions to help with research data management. Researchers need assistance with data

management, especially because many funding agencies now have very specific data

retention guidelines that must be followed. Librarians can “help researchers, even in the

planning phases of their projects, to appraise and think about the archival and

preservation options for their data, as well as the potential for sharing their data.”18

Librarians also need to act as a voice for balance and reason within their

organizations. While big data analytics has many interesting possibilities that should be

explored, there is no substitute for more traditional methods of research. As Bell put it

“investing heavily in data may look appealing, particularly when a win or two leave us

feeling satisfied, but the peril is overconfident thinking.”19 Like any other resource or

research method, librarians will need to help their patrons understand what big data can

and cannot do, and how it can best be used to achieve their research goals.

CONCLUSION

The appeal of big data is undeniable. The idea of extracting new and exciting insights

from previously unmanageable data is a bit like finding the proverbial needle in a

haystack. Just as they have with previous technological advances, librarians should

8
become familiar with the possibilities and problems inherent to big data and use that

knowledge to help their patrons choose the right tools.

BIG DATA PROJECTS AND RESOURCES

 Google Flu Trends <http://www.google.org/flutrends/> analyzes Google users’ search

queries to predict flu outbreaks around the world. A similar project tracks outbreaks

of dengue fever.

 Transit Time NYC <http://project.wnyc.org/transit-time/> is an interactive map that

allows users to estimate the time it would take to travel between any two addresses in

New York City using public transportation. The project is driven by a set of over 4.2

millions calculations that were run on a large Amazon computing cluster.

 Amazon Web Services Public Data Sets <http://aws.amazon.com/datasets/> is a

collection of large data sets from a number of fields. These data sets can easily be

analyzed and integrated into applications running on Amazon’s Web Services

platform. Notable examples include Google Books Ngrams, Ensembl Annotated

Human Genome Data, and the Common Crawl Corpus.

 Data.gov <http://catalog.data.gov/dataset> is a large collection of datasets from

United States Federal government agencies. These data sets are being made available

in a central clearinghouse as part of a new Open Data Policy.

REFERENCES

9
1. Anderson, Chris. “The End of Theory: The Data Deluge Makes the Scientific Method

Obsolete.” WIRED. June 27, 2008.

http://archive.wired.com/science/discoveries/magazine/16-07/pb_theory.

2. “The Big Data Conundrum: How to Define It?” MIT Technology Review. October 3,

2013. http://www.technologyreview.com/view/519851/the-big-data-conundrum-how-to-

define-it/.

3. Dumbill, Edd. “Making Sense of Big Data.” Big Data 1 no.1 (February, 2013): 1–2.

doi:10.1089/big.2012.1503.

4. Laney, Doug. “3D Data Management: Controlling Data Volume, Velocity and Variety.”

February 2001. META Group Research Note 6.

5. McAfee, Andrew, and Erik Brynjolfsson. “Big Data: The Management Revolution.”

Harvard Business Review 90 no. 10 (October 2012): 60–66, 68, 128.

6. Mayer-Schönberger, Viktor, and Kenneth Cukier. Big Data: A Revolution That Will

Transform How We Live, Work, and Think. Boston: Houghton Mifflin Harcourt, 2013.

7. Winslow, Ron. “‘Big Data’ for Cancer Care.” Wall Street Journal, March 27, 2013.

http://online.wsj.com/news/articles/SB10001424127887323466204578384732911187000

8. Swan, Melanie. “The Quantified Self: Fundamental Disruption in Big Data Science and

Biological Discovery.” Big Data 1 no. 2 (June, 2013): 85–99. doi:10.1089/big.2012.0002.

9. Duhigg, Charles. “How Companies Learn Your Secrets.” The New York Times, February

16, 2012. http://www.nytimes.com/2012/02/19/magazine/shopping-habits.html.

10. Hill, Kashmir. “You Can Hide Your Pregnancy Online, But You’ll Feel Like A

Criminal.” Forbes. April 29, 2014.

10
http://www.forbes.com/sites/kashmirhill/2014/04/29/you-can-hide-your-pregnancy-

online-but-youll-feel-like-a-criminal/.

11. Ohm, Paul. “Broken Promises of Privacy: Responding to the Surprising Failure of

Anonymization.” August 13, 2009. SSRN Scholarly Paper ID 1450006. Rochester, NY:

Social Science Research Network. http://papers.ssrn.com/abstract=1450006.

12. boyd, danah, and Kate Crawford. “Critical Questions for Big Data.” Information,

Communication & Society 15 no. 5 (May 10, 2012): 662–79.

doi:10.1080/1369118X.2012.678878.

13. Lazer, David, Ryan Kennedy, Gary King, and Alessandro Vespignani. “Big Data. The

Parable of Google Flu: Traps in Big Data Analysis.” Science (New York, N.Y.) 343 no.

6176(March 14, 2014): 1203–1205. doi:10.1126/science.1248506.

14. Clark, Jack. “IT Now 10 Percent of World’s Electricity Consumption, Report Finds.” The

Register. August 16, 2013.

http://www.theregister.co.uk/2013/08/16/it_electricity_use_worse_than_you_thought/.

15. Blum, Andrew. Tubes: A Journey to the Center of the Internet. New York:

HarperCollins, 2012.

16. Huwe, T.K. “Big Data and the Library: A Natural Fit.” Computers in Libraries 34 no. 2

(March 2014): 17–18.

17. Schwartz, Meredith. “What Governmental Big Data May Mean For Libraries.” Library

Journal. May 30, 2013. http://lj.libraryjournal.com/2013/05/oa/what-governmental-big-

data-may-mean-for-libraries/.

11
18. Creamer, Andrew T., Elaine R. Martin, and Donna Kafel. “Research Data Management

and the Health Sciences Librarian.” In Health Sciences Librarianship, edited by M.

Sandra Wood, 252-274. Lanham, MD: Rowman & Littlefield, 2014.

19. Bell, Steven. “Promise and Problems of Big Data | From the Bell Tower.” Library

Journal. March 13, 2013. http://lj.libraryjournal.com/2013/03/opinion/steven-

bell/promise-and-problems-of-big-data-from-the-bell-tower/.

12

View publication stats

Potrebbero piacerti anche