Documenti di Didattica
Documenti di Professioni
Documenti di Cultura
Introduction
This paper examines the phenomenon of Big Data and the contribution
that will be made to that data by the Internet of Things. It is suggested
that an understanding of Big Datawill provide businesses and
organisations with significant opportunities to use informational datasets
in many aspects of their activities. A significant contributor to already
existing Big Data datasets will be the Internet of Things (IoT) allowing for
information gathering on a hitherto unprecedented scale. The proper
management of this data to make it meaningful and useful for an
organisation is the purpose of Information Governance. It is argued that
Information Governance is an essential business strategy that not only
enables a business to use data effectively but also lays a significant
preparatory foundation for compliance with e-discovery obligations in the
event of litigation. It is suggested that rather than being viewed as a
discrete process, e-discovery should be seen as a part of an overall
Information Governance strategy.
What is Big Data
The term Big Data was first coined in the 1970s but it was not until the
enabling technologies became available that Big Data became a reality. In
the past the phenomenon was restricted only to the research field but now
Big Data analysis is required in many commercial fields and services
which reflect our modern society.
Big data1 is a broad term for data sets so large or complex that traditional
data processing applications are inadequate.2 Big data is a set of
techniques and technologies that require new forms of integration to
uncover large hidden values from large datasets that are diverse,
complex, and of a massive scale. Challenges posed by these datasets
include analysis, capture, curation, searching , sharing, storage, transfer,
visualization, and information privacy. The advantages of analysis of data
sets can find new correlations, to spot business trends, prevent diseases,
combat crime and so on.3
The term Big Data," which spans computer science and statistics/econometrics, probably
originated in lunch-table conversations at Silicon Graphics Inc. (SGI) in the mid 1990s,in which John
Mashey _gured prominently. The _rst signi_cant academic references are arguably Weiss and
Indurkhya (1998) in computer science and Diebold (2000) in statistics/econometrics. An
unpublished 2001 research note by Douglas Laney at Gartner enriched the concept significantly.
Hence the term \Big Data" appears reasonably attributed to Massey, Indurkhya and Weiss, Diebold,
and Laney. See Diebold, Francis X., On the Origin(s) and Development of the Term 'Big Data' (September 21, 2012).
PIER Working Paper No. 12-037. Available at SSRN: http://ssrn.com/abstract=2152421 or
http://dx.doi.org/10.2139/ssrn.2152421
2 The term often refers more to the use of predictive analytics or other
certain advanced methods to extract value from data, rather than to a
particular size of data set.
1
Data sets grow in size in part because they are increasingly being
supplemented by cheap, numerous information-sensing mobile devices,
aerial sensory technologies (remote sensing), software logs, cameras,
microphones, radio-frequency identification (RFID) readers, and wireless
sensor networks. The world's technological per-capita capacity to store
information has roughly doubled every 40 months since the 1980s; as of
2012, every day 2.5 exabytes (2.51018) of data were created; The
challenge for large enterprises is determining who should own big data
initiatives that straddle the entire organization
The big of Big Data is a relative term. Some organisations may have
Gigabytes or Terabytes of data in comparison to big global organisations
that may have several Petabytes or Exabyte of data to handle. However,
it is a reality of life in the digital paradigm that data is going increase day
by day and its volume will depend upon the size of the organisation.
Characteristics of Big Data
In a 2001 research report and related lectures, META Group (now Gartner)
analyst Doug Laney defined data growth challenges and opportunities as
being three-dimensional, i.e. increasing volume (amount of data), velocity
(speed of data in and out), and variety (range of data types and sources).
Gartner, and now much of the industry, continue to use this "3Vs" model
for describing big data. In 2012, Gartner updated its definition as follows:
"Big data is high volume, high velocity, and/or high variety information
assets that require new forms of processing to enable enhanced decision
making, insight discovery and process optimization." Additionally, a new V
- "Veracity" - is added by some organizations to describe it together with
complexity which describes as aspect of data management. These
characteristics may be detailed as follows
Volume The quantity of data that is generated is very important in this
context. It is the size or quantity of the data which determines the value
and potential of the data under consideration and whether it can actually
be considered Big Data or not. The name Big Data itself contains a term
which is related to size and hence the characteristic.
Variety - The next aspect of Big Data is its variety. This means that the
category to which Big Data belongs to is also a very essential fact that
needs to be known by the data analysts. This helps those who are closely
analyzing the data and are associated with it to effectively use the data to
their advantage and thus emphasising the importance of the Big Data.
querying an examples can include images, video and many other multimedia files.
To properly understand Big Data it is necessary to consider its movement
as a process that goes through five major stages and can be described as
pipelining. The first stage is data accession and recording. Big Data does
not emerge out of a vacuum. It is recorded from a source. In some
respect our sensors are data gatherers detecting the presence of air and
smell, the heart rate of a person, the number of steps that a person takes,
weight at a particular time of day and so on. Modern technologies enable
this data to be recorded in the same way that a telescope may generate a
large amount of raw data.
Information pulling and filtering is the second stage of the process and
recognises that the raw data that we collect from accession and recording
is not in a format ready for analysis. To analyse it means it cannot be left
in its raw format. What is required is an information extraction process
that can generate the required information from the source and express it
in a structured form.
The next stage in the pipeline is that of data integration, amalgamation
and representation. Because of its very nature, its non-uniformity and its
volume it is not enough to simply record data and warehouse it. If we
have a bundle of data sets in a warehouse it may be impossible to find
reuse and utilise such data unless it is associated with appropriate
metadata. Data analysis is a lot more challenging than only locating,
understanding and recognising and referring data. For very large scale
analysis these processes have to be approached from an automated
perspective. This will require a set of syntax and semantics which are in
such forms that are machine readable and machine resolvable. The final
outcome will be an integration of data sets.
The fourth phase in the pipeline is that of query processing, data
modelling and analysis. The techniques for mining and retrieval of Big
Data are basically different from conventional statistical analysis of small
samples. By its very nature Big Data is somewhat chaotic, occasionally
inter-related and rowdy. Nevertheless it could be more valuable than the
small samples because, once it has been properly analysed, the statistical
sample is larger and therefore more meaningful. Thus mining Bid Data
requires filtered, integrated, trustworthy, effectively accessible data,
scalable algorithms and environments which are suitable for performing
the necessary Big Data computations. The effective mining of Big Data
can be used to improve the standards and trustworthiness of the data
achieved as well as the conclusions drawn from it.
The final stage in the pipeline is that of data elucidation and
interpretation. How do we make Big Data analysis in a form that can be
easily understandable in terms of the user? If effective interpretation
about data cannot be made then it is of limited value. One of the
4
including heart rate, blood pressure, breathing rates, all of them important
in monitoring improvements in health.
Cars may not yet be autonomous but new models may have many
internet addressable capabilities including remote start, remote climate
control, location tracking as well as the currently latent ability to track
many driving habits. Every time there is a warning beep another item of
data is recorded.
The benefits of the Internet of Things, based solely on products that exist
today, let alone unimagined combinations of emerging capabilities, are
considerable. The smart phone has become the remote control for many
peoples existence. Data is available at a finger swipe on everything
imaginable. Yet at the same time there are challenges and disruptions
ahead including technical issues, business issues, requirements for new
and evolving skill sets, legal and legislative difficulties and social
complexities.
There are a number of levels of business opportunities:
1) Basic components and devices that connect to the network via
Wifi or Bluetooth.
2) Entirely new aggregated products and systems that combine
these devices in new ways like home management systems.
3) All of the services providing customised solutions to business and
consumers including data analysis services to help make sense of
the vast amount of Big Data generated by the IOT.
The IOT gives business new ways to instantly connect with customers
providing the opportunity to monitor and respond to customers in near
real time. Individual products will no longer exist in a vacuum and
interaction among devices from multiple sources and vendors should be
understood and taken into account.
Products and bundles will be
remotely reconfigured and repaired quickly and customers may be
provided with tools to do their own reconfiguration.
All of the new streams of data becoming available on the internet will
raise difficult privacy and moral issues that are only starting to be
addressed. Who would own the video streaming in from say Google Glass
and the health care related data streaming from other wearables? What
happens if autonomous devices malfunction or make data completely
public? These are challenges that will have to be faced as the IoT
develops.
Information Governance, Big Data and E-Discovery
As businesses begin to find value in the utilisation of IoT data the
consequences for e-discovery and Information Governance become even
more significant.
Steven OToole in an article entitled Demystifying Analytics in Ediscovery4 makes reference to the importance of pre-discovery within the
context of early case assessment. He suggests that the application of
analytics at the earliest possible opportunity provides the most value to
corporations and provides a competitive advantage.
I suggest that pre-discovery in fact occurs not at the Early Case
Assessment level but rather at the Information Governance level as
suggested by the EDRM model. Information Governance enables an
organisation to effectively use information for a number of purposes of
which e-discovery is but a part.
Therefore within this context of Information Governance different sets of
analytics capabilities are available:
a) Clustering is useful if little is known about the content because
it puts content into natural groupings of conceptually related
materials. A benefit of clustering is that it provides a fast map of
the document landscape in an objective consistent and concept
aware fashion.The reviewer can jump straight to a cluster that is
of most interest and avoids spending time in clusters of no
conceptual relevance.
This plays a part in Information
Governance as well as in e-discovery in that one can weed out
redundant outdated and transient content quickly and reduce
costs.
b) Term expansion identifies conceptually related terms
customised to content and ranked in order of relevance. One
term may provide a list of associated terms meaning that
conceptually related content can be found more quickly again
saving time and money. Within the context of Information
Governance it identifies content related to matters such as
corporate records, intellectual property and compliance.
c) Conceptual searching follows once a key document or
paragraph is being identified and now it is necessary to find
similar ones. A key word search will return documents containing
specific key words depending upon the Boolean search string and
as long as those key words are included in resulting documents.
Constructing Boolean search strings can be time consuming and
nameless key documents containing unknown terms that are not
included. It is at this point that a conceptual search comes into
play. A conceptual search looks for matching patterns in the
map of data called a conceptual space. The benefit is that a
conceptual search can find similar results even if the matching
4 http://www.contentanalyst.com/html/whoweare/whitepapers/whitepaperdemystifying-analytics-in-ediscovery-2014.html (last accessed 27 March
2015)
10
13