Documenti di Didattica
Documenti di Professioni
Documenti di Cultura
net/publication/263935602
CITATIONS READS
7 889
1 author:
Matthew B Hoy
Mayo Foundation for Medical Education and Research
18 PUBLICATIONS 224 CITATIONS
SEE PROFILE
All content following this page was uploaded by Matthew B Hoy on 13 February 2015.
Matthew B. Hoy
ABSTRACT. Modern life produces data at an astounding rate, and shows no signs of
slowing. This has lead to new advances in data storage and analysis and the concept of
“big data” – massive data sets that can yield surprising insights when analyzed. This
column will briefly describe what big data is and why it is important. It will also briefly
explore the possibilities and problems of big data and the implications it has for
Media Support Services, Mayo Clinic Health System – Eau Claire, 1221 Whipple Street,
Comments and suggestions should be sent to the Column Editors: Matthew B. Hoy
1
INTRODUCTION
Everything is data. Everything people do, say, or observe creates more data. For large
parts of human history, data was ephemeral and not recorded in any way. With the
invention of writing, data could be recorded for future reference, but the process was
slow and laborious. Only in recent decades has the ability to capture and analyze data
really taken off; as more and more aspects of daily life are connected to computers and
the Internet, this formerly ephemeral data is now being captured, stored, and analyzed,
often with surprising results. As Anderson noted in 2008, we are living in “the petabyte
age”; this is “the era of big data, where more isn’t just more. More is different.”1 This
column will attempt to briefly define what big data is and why it is important. It will also
briefly explore the possibilities and problems of big data and the implications it has for
There are no clear cut definitions of what constitutes big data. Ward and Barker found
that there are likely “as many definitions as the number of people you ask.”2 They
surveyed major companies in the IT industry and used the results to develop this
definition:
Big data is a term describing the storage and analysis of large and or complex data
sets using a series of techniques including, but not limited to: NoSQL,
2
Ed Dumbill, the editor-in-chief of a journal devoted to the topic of big data, offered this
Big data is data that exceeds the processing capacity of conventional database
systems. The data is too big, moves too fast, or doesn't fit the strictures of your
database architectures. To gain value from this data, you must choose an
Although he did not use the term “big data,” as far back as 2001 Laney defined the now
widely-accepted three dimensions or “three V’s” of big data: volume, velocity, and
variety.4
Volume refers to the sheer amount of data being created. As McAfee and
Brynjolfsson noted in 2012, “about 2.5 exabytes of data are created each
day, and that number is doubling every 40 months or so.”5 One exabyte is
Congress.
Variety refers both to the types of data being gathered and to the lack of
In their book Big Data: A Revolution That Will Transform How We Live, Work and
big data refers to things one can do at a large scale that cannot be done at a
smaller one, to extract new insights or create new forms of value, in ways that
3
These definitions are all useful, but they fail to give a layman’s sense of what big data is
and how it is created or used. In the simplest terms, big data is the idea that computers
can gather trillions of pieces of information about billions of different things and find
Big data is in its infancy, and it has already changed life in a number of ways. Google’s
web search algorithm was one of the first tools to show the possibilities offered by big
data. Netflix has changed the way people choose and consume movies and television
through its big data recommendation engine. As big data analytics are applied to other
industries like energy, transportation, and medicine, there are sure to be similarly
disruptive breakthroughs.
As medical systems move toward a big data model, it’s likely that in the near
future patients may receive a diagnosis or have their therapy based on findings from big
on a large set of de-identified cancer patient data that will allow physicians to more
In addition to aggregating data from many sources at once, big data systems can
also be used to track highly granular data about one thing. For example, sensors on a
single piece of industrial equipment can monitor temperature and vibration and use that
data to predict equipment failures before they happen. Another example of this is the idea
4
of the “quantified self.” The quantified self is the act of measuring and tracking data
about an individual person. Wearable devices can track a person’s heart rate, distance
traveled, and sleep patterns. Internet-enabled scales can upload weight data to health-
tracking websites, where users can also track their food intake. This may sound far-
fetched, but Swan states that “60% of U.S. adults are currently tracking their weight, diet,
or exercise routine, and 33% are monitoring other factors such as blood sugar, blood
Despite all of the promise inherent in the big data revolution, there are also problems that
need to be addressed. Chief among these is the issue of privacy and data ownership;
although the big data era has barely begun, there are already many stories of unintended
privacy breaches. Target recently revealed a teen’s pregnancy to her family by mailing
her coupons featuring baby clothing and supplies. This happened because Target’s
computer system can predict, with startling accuracy, when a woman is pregnant. As
Duhigg found, there are “about 25 products that, when analyzed together, allowed him to
assign each shopper a ‘pregnancy prediction’ score.”9 The problem arose in this case
because computers are unable to make good decisions about what customers consider
sensitive information. Big data systems are also not tactful; several bloggers report being
When retailers can track so many data points about consumer behavior, there are
many more opportunities for damaging disclosures. These disclosures aren’t limited to
5
intentional acts by the data holders either; news stories about data breaches at major
retailers are now an everyday occurrence. What many consumers don’t realize is that
more than their credit card numbers are at risk. How long will it be before hackers are
selling complete customer data sets along with the usual credit card and Social Security
numbers? Rival retailers and suppliers are likely very interested in knowing what is in
Retail systems are not the only big data privacy concern. Health care, social
networking, and government systems also contain large amounts of sensitive information.
As Ohm put it, “[a]lmost every person in the developed world can be linked to at least
one fact in a computer database that an adversary could use for blackmail, discrimination,
users can no longer realistically opt-out of having their information stored in these
systems. Hill noted that a Princeton professor had to resort to using anonymizing proxy
software and paying cash for baby-related items in order to hide her pregnancy from
Another problem with big data is that it can be misleading; a very strong
correlation can appear that has absolutely no meaning. Boyd and Crawford described this
phenomenon as “apophenia: seeing patterns where none actually exist, simply because
enormous quantities of data can offer connections that radiate in all directions.”12 As an
example, they cite data showing a link between “the S&P 500 stock index and butter
production in Bangladesh.”12 Seeing patterns where none exist is not the only pitfall. Big
data can also amplify and distort legitimate trends, making them appear more important
than they are. When Google’s Flu Trends project grossly overestimated flu outbreaks,
6
there was talk of “big data hubris,” the mistaken idea that “big data are a substitute for,
There are also infrastructure problems to overcome before big data can become
everything it wants to be. Big data means big space, big power, and big networking.
Information technology systems now account for around 10% of global electricity usage;
that figure is bound to grow rapidly as more disks and CPUs are brought online to store
and crunch the massive amounts of data being created.14 All of those systems need to live
somewhere, and as Blum points out in “Tubes: A Journey to the Center of the Internet,”
there are relatively few locations that have the correct combination of cheap land,
abundant electricity, and easy access to fiber-optic networks.15 Even if CPUs and storage
continue to get less expensive and network capacity increases, the exponential growth in
Libraries and librarians are uniquely suited to working with big data. Libraries have a
long tradition of being early technology adopters, and big data should be no exception.16
There are several key ways librarians can get involved. One of these is through collection
development and preservation of data sets. As more users become interested in working
with big data, they will need guidance and material to work with. Thanks to a recent
presidential order spurring agencies to open their data vaults, there are now many
government data sets available for researchers to use. Unfortunately, as Schwartz noted,
the Executive Order that mandated public access to these data sets “does not provide for
7
preservation or create any user-centered services for the information.”17 Librarians are
well-positioned to help users understand how and where to find these data sets and to
Another way librarians can get involved with big data is by working within their
institutions to help with research data management. Researchers need assistance with data
management, especially because many funding agencies now have very specific data
retention guidelines that must be followed. Librarians can “help researchers, even in the
planning phases of their projects, to appraise and think about the archival and
preservation options for their data, as well as the potential for sharing their data.”18
Librarians also need to act as a voice for balance and reason within their
organizations. While big data analytics has many interesting possibilities that should be
explored, there is no substitute for more traditional methods of research. As Bell put it
“investing heavily in data may look appealing, particularly when a win or two leave us
feeling satisfied, but the peril is overconfident thinking.”19 Like any other resource or
research method, librarians will need to help their patrons understand what big data can
and cannot do, and how it can best be used to achieve their research goals.
CONCLUSION
The appeal of big data is undeniable. The idea of extracting new and exciting insights
from previously unmanageable data is a bit like finding the proverbial needle in a
haystack. Just as they have with previous technological advances, librarians should
8
become familiar with the possibilities and problems inherent to big data and use that
queries to predict flu outbreaks around the world. A similar project tracks outbreaks
of dengue fever.
allows users to estimate the time it would take to travel between any two addresses in
New York City using public transportation. The project is driven by a set of over 4.2
collection of large data sets from a number of fields. These data sets can easily be
United States Federal government agencies. These data sets are being made available
REFERENCES
9
1. Anderson, Chris. “The End of Theory: The Data Deluge Makes the Scientific Method
http://archive.wired.com/science/discoveries/magazine/16-07/pb_theory.
2. “The Big Data Conundrum: How to Define It?” MIT Technology Review. October 3,
2013. http://www.technologyreview.com/view/519851/the-big-data-conundrum-how-to-
define-it/.
3. Dumbill, Edd. “Making Sense of Big Data.” Big Data 1 no.1 (February, 2013): 1–2.
doi:10.1089/big.2012.1503.
4. Laney, Doug. “3D Data Management: Controlling Data Volume, Velocity and Variety.”
5. McAfee, Andrew, and Erik Brynjolfsson. “Big Data: The Management Revolution.”
6. Mayer-Schönberger, Viktor, and Kenneth Cukier. Big Data: A Revolution That Will
Transform How We Live, Work, and Think. Boston: Houghton Mifflin Harcourt, 2013.
7. Winslow, Ron. “‘Big Data’ for Cancer Care.” Wall Street Journal, March 27, 2013.
http://online.wsj.com/news/articles/SB10001424127887323466204578384732911187000
8. Swan, Melanie. “The Quantified Self: Fundamental Disruption in Big Data Science and
9. Duhigg, Charles. “How Companies Learn Your Secrets.” The New York Times, February
10. Hill, Kashmir. “You Can Hide Your Pregnancy Online, But You’ll Feel Like A
10
http://www.forbes.com/sites/kashmirhill/2014/04/29/you-can-hide-your-pregnancy-
online-but-youll-feel-like-a-criminal/.
11. Ohm, Paul. “Broken Promises of Privacy: Responding to the Surprising Failure of
Anonymization.” August 13, 2009. SSRN Scholarly Paper ID 1450006. Rochester, NY:
12. boyd, danah, and Kate Crawford. “Critical Questions for Big Data.” Information,
doi:10.1080/1369118X.2012.678878.
13. Lazer, David, Ryan Kennedy, Gary King, and Alessandro Vespignani. “Big Data. The
Parable of Google Flu: Traps in Big Data Analysis.” Science (New York, N.Y.) 343 no.
14. Clark, Jack. “IT Now 10 Percent of World’s Electricity Consumption, Report Finds.” The
http://www.theregister.co.uk/2013/08/16/it_electricity_use_worse_than_you_thought/.
15. Blum, Andrew. Tubes: A Journey to the Center of the Internet. New York:
HarperCollins, 2012.
16. Huwe, T.K. “Big Data and the Library: A Natural Fit.” Computers in Libraries 34 no. 2
17. Schwartz, Meredith. “What Governmental Big Data May Mean For Libraries.” Library
data-may-mean-for-libraries/.
11
18. Creamer, Andrew T., Elaine R. Martin, and Donna Kafel. “Research Data Management
19. Bell, Steven. “Promise and Problems of Big Data | From the Bell Tower.” Library
bell/promise-and-problems-of-big-data-from-the-bell-tower/.
12