Sei sulla pagina 1di 13

Big Data and Implications for Information Governance and E-Discovery

Introduction
This paper examines the phenomenon of Big Data and the contribution
that will be made to that data by the Internet of Things. It is suggested
that an understanding of Big Datawill provide businesses and
organisations with significant opportunities to use informational datasets
in many aspects of their activities. A significant contributor to already
existing Big Data datasets will be the Internet of Things (IoT) allowing for
information gathering on a hitherto unprecedented scale. The proper
management of this data to make it meaningful and useful for an
organisation is the purpose of Information Governance. It is argued that
Information Governance is an essential business strategy that not only
enables a business to use data effectively but also lays a significant
preparatory foundation for compliance with e-discovery obligations in the
event of litigation. It is suggested that rather than being viewed as a
discrete process, e-discovery should be seen as a part of an overall
Information Governance strategy.
What is Big Data
The term Big Data was first coined in the 1970s but it was not until the
enabling technologies became available that Big Data became a reality. In
the past the phenomenon was restricted only to the research field but now
Big Data analysis is required in many commercial fields and services
which reflect our modern society.
Big data1 is a broad term for data sets so large or complex that traditional
data processing applications are inadequate.2 Big data is a set of
techniques and technologies that require new forms of integration to
uncover large hidden values from large datasets that are diverse,
complex, and of a massive scale. Challenges posed by these datasets
include analysis, capture, curation, searching , sharing, storage, transfer,
visualization, and information privacy. The advantages of analysis of data
sets can find new correlations, to spot business trends, prevent diseases,
combat crime and so on.3

The term Big Data," which spans computer science and statistics/econometrics, probably
originated in lunch-table conversations at Silicon Graphics Inc. (SGI) in the mid 1990s,in which John
Mashey _gured prominently. The _rst signi_cant academic references are arguably Weiss and
Indurkhya (1998) in computer science and Diebold (2000) in statistics/econometrics. An
unpublished 2001 research note by Douglas Laney at Gartner enriched the concept significantly.
Hence the term \Big Data" appears reasonably attributed to Massey, Indurkhya and Weiss, Diebold,
and Laney. See Diebold, Francis X., On the Origin(s) and Development of the Term 'Big Data' (September 21, 2012).
PIER Working Paper No. 12-037. Available at SSRN: http://ssrn.com/abstract=2152421 or
http://dx.doi.org/10.2139/ssrn.2152421

2 The term often refers more to the use of predictive analytics or other
certain advanced methods to extract value from data, rather than to a
particular size of data set.
1

Data sets grow in size in part because they are increasingly being
supplemented by cheap, numerous information-sensing mobile devices,
aerial sensory technologies (remote sensing), software logs, cameras,
microphones, radio-frequency identification (RFID) readers, and wireless
sensor networks. The world's technological per-capita capacity to store
information has roughly doubled every 40 months since the 1980s; as of
2012, every day 2.5 exabytes (2.51018) of data were created; The
challenge for large enterprises is determining who should own big data
initiatives that straddle the entire organization

The big of Big Data is a relative term. Some organisations may have
Gigabytes or Terabytes of data in comparison to big global organisations
that may have several Petabytes or Exabyte of data to handle. However,
it is a reality of life in the digital paradigm that data is going increase day
by day and its volume will depend upon the size of the organisation.
Characteristics of Big Data
In a 2001 research report and related lectures, META Group (now Gartner)
analyst Doug Laney defined data growth challenges and opportunities as
being three-dimensional, i.e. increasing volume (amount of data), velocity
(speed of data in and out), and variety (range of data types and sources).
Gartner, and now much of the industry, continue to use this "3Vs" model
for describing big data. In 2012, Gartner updated its definition as follows:
"Big data is high volume, high velocity, and/or high variety information
assets that require new forms of processing to enable enhanced decision
making, insight discovery and process optimization." Additionally, a new V
- "Veracity" - is added by some organizations to describe it together with
complexity which describes as aspect of data management. These
characteristics may be detailed as follows
Volume The quantity of data that is generated is very important in this
context. It is the size or quantity of the data which determines the value
and potential of the data under consideration and whether it can actually
be considered Big Data or not. The name Big Data itself contains a term
which is related to size and hence the characteristic.
Variety - The next aspect of Big Data is its variety. This means that the
category to which Big Data belongs to is also a very essential fact that
needs to be known by the data analysts. This helps those who are closely
analyzing the data and are associated with it to effectively use the data to
their advantage and thus emphasising the importance of the Big Data.

3 Scientists, for example, encounter limitations in e-Science work,


including meteorology, genomics, connectomics, complex physics
simulations, and biological and environmental research.
2

Velocity - The term velocity in the context refers to the speed of


generation of data or how quickly the data is generated, gathered and
processed to meet the demands and the challenges which lie ahead in the
path of growth and development.
Variability - This is a factor which can be a problem for those who
analyse the data. This refers to the inconsistency which can be shown by
the data at times, thus hampering the process of being able to handle and
manage the data effectively.
Veracity - The quality of the data being captured can vary greatly.
Accuracy of analysis depends on the veracity or provenance of the source
data.
Complexity - Data management can become a very complex process,
especially when large volumes of data come from multiple sources. These
data need to be linked, connected and correlated in order to be able to
grasp the information that is supposed to be conveyed by these data.
The Internet of Things and Data Provenance
The aspects of Big Data detailed above must be taken into account in
assessing the contribution of of the Internet of Things (IoT) to the dataset
and the implications that has upon Big Data and Information Governance.
Data can be obtained from a variety of sources or provenances which fall
into three separate categories: (a) structured; (b) semi-structures and (c)
unstructured.
The rise in technology, the escalation of sensors, social media and
networking and smart devices means that data has become more
complex and has moved from primarily structured data to semi-structured
and unstructured data.
Structured data is data that is generally clustered into a relational
scheme for example a prevailing database having rows in columns. The
data consistency in configuration allows it to retrieve useable information
by responding to basic queries based on parameters and operational
needs of the enquirer.
Semi-structured data is in the form of structured data sets that do not
have the fixed or particular schemer of a relational scheme. It has a
variable schematic that depends upon the generating source. Data sets
may include data obtained or inherited from hierarchies of records and
fields of data such as social media and web logs.
Unstructured data is the most scattered data and provides the greatest
challenge for processing. For this type of data one of the general
problems arising is to assign a proper index to data tables for analysis and

querying an examples can include images, video and many other multimedia files.
To properly understand Big Data it is necessary to consider its movement
as a process that goes through five major stages and can be described as
pipelining. The first stage is data accession and recording. Big Data does
not emerge out of a vacuum. It is recorded from a source. In some
respect our sensors are data gatherers detecting the presence of air and
smell, the heart rate of a person, the number of steps that a person takes,
weight at a particular time of day and so on. Modern technologies enable
this data to be recorded in the same way that a telescope may generate a
large amount of raw data.
Information pulling and filtering is the second stage of the process and
recognises that the raw data that we collect from accession and recording
is not in a format ready for analysis. To analyse it means it cannot be left
in its raw format. What is required is an information extraction process
that can generate the required information from the source and express it
in a structured form.
The next stage in the pipeline is that of data integration, amalgamation
and representation. Because of its very nature, its non-uniformity and its
volume it is not enough to simply record data and warehouse it. If we
have a bundle of data sets in a warehouse it may be impossible to find
reuse and utilise such data unless it is associated with appropriate
metadata. Data analysis is a lot more challenging than only locating,
understanding and recognising and referring data. For very large scale
analysis these processes have to be approached from an automated
perspective. This will require a set of syntax and semantics which are in
such forms that are machine readable and machine resolvable. The final
outcome will be an integration of data sets.
The fourth phase in the pipeline is that of query processing, data
modelling and analysis. The techniques for mining and retrieval of Big
Data are basically different from conventional statistical analysis of small
samples. By its very nature Big Data is somewhat chaotic, occasionally
inter-related and rowdy. Nevertheless it could be more valuable than the
small samples because, once it has been properly analysed, the statistical
sample is larger and therefore more meaningful. Thus mining Bid Data
requires filtered, integrated, trustworthy, effectively accessible data,
scalable algorithms and environments which are suitable for performing
the necessary Big Data computations. The effective mining of Big Data
can be used to improve the standards and trustworthiness of the data
achieved as well as the conclusions drawn from it.
The final stage in the pipeline is that of data elucidation and
interpretation. How do we make Big Data analysis in a form that can be
easily understandable in terms of the user? If effective interpretation
about data cannot be made then it is of limited value. One of the
4

important aspects is the verification of results which is difficult given the


size of the data set.
This provides a significant challenge to the analysis of big data. In small
data sets the nuance and richness of natural language provides in-depth
information which is difficult to achieve with machine algorithms in that
they cannot understand nuance. Thus for data to be effectively processed
by algorithms, it must be carefully structured. Indeed a basic requirement
of traditional data analysis system is to have data with a well defined
structure.
Scalability also poses a problem. We are used to traditional input output
subsystems such as hard disk drives which have been used to store data.
These have had disadvantages like slower random input-output
performances but these devices are now being replaced by solid state
drives or other technology such as phase change memory (PCN). All
these newer technologies require a new thinking about how to design
storage subsystems for data processing because these technologies do
not have the same large spread in performance between the random and
sequential input output performance in comparison with older hard disk
drive techniques. A further difficulty is that of the new scenario posed by
Cloud computing in which the whole system is in the form of a distributed
cluster.
Timeliness of analysis is another problem. Larger datasets are often by
their nature more complex. This complexity impacts upon the time
required for analysis. Increased dataset size is counterbalanced by
increased processing and analysis time.In many situations the result of an
analysis is required immediately. For example a fraudulent credit card
transaction should be flagged before the actual transaction is completed
to prevent the transition of funds from taking place. A full analysis of a
users purchase history will not be feasible in real time and the system
needs to develop a partial result about the user and the card so that an
effective conclusion can be made to arrive at a quick determination. Thus
the system should be developed in a way that possesses flexibility of
computation.
Another issue surrounding Big Data is that of data privacy. Indeed this
could be a special subject all on its own. Big Data and privacy
management have a number of aspects in both the sociological and legal
disciplines to provide two examples. The level of privacy is a matter of
concern. Location data may only be device specific and possibly even
user specific without necessarily identifying the user. It is important to
note that concealing the location of a user and the associated privacy
implications is more challenging than concealing identity. Finally human
collaboration is an important aspect of Big Data analytics because the
analysis system must support shared exploration of results and input from
multiple human experts.

A number of observations can be made about Big Data particularly within


the context of the commercial environment. Data has swept into every
industry and business function and now is an important factor of
production alongside labour and capital. All economic sectors will have
stored data which can be measured in hundreds of terabytes.
There are five broad ways in which using Big Data can create value. First,
information may become transparent and useable at a higher frequency.
Secondly, as organisations create and store more transactional data in
digital form they can collect more accurate and detailed performance
information on everything from product inventory to sick days and expose
variations in ability and performance. Thirdly, Dig Data allows ever
narrower segmentation of customers and therefore more precisely tailored
product or services. Fourthly, sophisticated analytics can substantially
improve decision making. Fifthly Big Data can be used to improve the
development of the next generation of products and services.
A further issue is that the use of Big Data will become a key basis for
competition and growth for individual firms.
This has significant
implications in terms of e-discovery in litigation involving competition,
access to confidential information and damages assessments.
The use of Big Data will underpin new waves or productivity growth and
consumer surplus and some sectors will in fact be set for greater gains.
The computer and electronic products and information sectors as well as
finance and insurance and Government are poised to gain substantially
from the use of Big Data.
One of the problems of the exponential growth of data is that there may
well be a shortage of talent necessary for organisations to take advantage
of Big Data. It is estimated that by 2018 the United States alone could
face a shortage of 140,000 to 190,000 people with the deep analytical
skills necessary as well as the 1.5 million managers and analysts with the
know how to use the analysis of Big Data to make effective decision.
Finally a number of issues will have to be addressed to capture the full
potential of Big Data not the least of which will be privacy, security,
intellectual property and liability.
The Internet of Things
Once there is an understanding of the implications of Big Data, the impact
of the Internet of Things (IoT) as a datasource may be considered.
The amount of data gathered will increase when every durable object is a
part of the Internet of Things. The introduction of IPV6 means that the
distribution of IP addresses to devices in everyday use will become the
norm rather than the exception. These devices will gather information
about us and our interactions with the environment and the volume of
data discoverable in litigation as the result of the Internet of Things will
6

significantly increase. Thus, the amount of data that may need to be


considered in an e-discovery exercise will be significantly larger.
The devices that will be connected to the internet of everything will
expand significantly over the 5 year period 2014 2019. As things stand
at the moment personal computers smart phones, tablets and connected
TVs comprise a reasonable source of data collection. Increasing in
number will be wearables and connected motor vehicles but the
significant growth will be in other connected devices and items which will
dwarf the current number of connected devices.
At the moment
something in the vicinity of about 11 billion devices are connected to the
internet of which a significant proportion (approximately 6 billion) are
computers, smart phones, tablets, connected devices and wearables as
well as connected cars. The Internet of Things accounts for approximately
another 5-6 billion devices. By 2019 it is estimated that there will be 35
billion devices connected to the internet of everything of which 25 billion
devices will be Internet of Things connections.
One of the important drivers behind the Internet of Things is how easy it is
now become to wirelessly connect mobile items to the internet via WiFi,
Bluetooth or proprietory wireless communications protocols.
Smart
Internet of Things devices include everything from structural health
monitors for buildings to smart egg trays that know how many eggs one
has and how old they are. Home automation devices includes Googles
Nest and two competing families of home and health IoT systems, Zigbee
and Z-wave. The Vessyl smart drinking cup monitors exactly what you are
drinking, the Hapi fork tracks your eating habits and the Beam tooth brush
reports on your tooth brushing history. Wearables range from the popular
Fitbit athletic tracker to smart watches, smart clothes and biological
emmbeddibles including pacemakers and glucose monitors.
Other devices that may be connected to a smart phone by means of an
app may be a gas tank app which will let the user know when it is time to
refuel, a glucose monitor which will send test results to a secure server
providing instant feedback and coaching to patients and also provide
doctors, nurses and diabetes educators will real time clinical data, the
smart washing machine designed to integrate smart eco- systems, water
use, temperature control and the like. The smart piggy bank which will
track savings and set financial goals from afar, the blood pressure monitor
will send details of blood pressure to health specialists, the smart weather
station will enable proper energy use within the home. The Internet of
Things will also enable remote access to home devices such as a slow
cooker, a refrigerator, a security system and sporting equipment such as
devices attached to tennis rackets, bicycles and golfing equipment which
will monitor data relating to sporting performance. Devices such as
Bitponics gives data on plants and conditions surrounding them for better
gardening and clothing may monitor how the body behaves over time

including heart rate, blood pressure, breathing rates, all of them important
in monitoring improvements in health.
Cars may not yet be autonomous but new models may have many
internet addressable capabilities including remote start, remote climate
control, location tracking as well as the currently latent ability to track
many driving habits. Every time there is a warning beep another item of
data is recorded.
The benefits of the Internet of Things, based solely on products that exist
today, let alone unimagined combinations of emerging capabilities, are
considerable. The smart phone has become the remote control for many
peoples existence. Data is available at a finger swipe on everything
imaginable. Yet at the same time there are challenges and disruptions
ahead including technical issues, business issues, requirements for new
and evolving skill sets, legal and legislative difficulties and social
complexities.
There are a number of levels of business opportunities:
1) Basic components and devices that connect to the network via
Wifi or Bluetooth.
2) Entirely new aggregated products and systems that combine
these devices in new ways like home management systems.
3) All of the services providing customised solutions to business and
consumers including data analysis services to help make sense of
the vast amount of Big Data generated by the IOT.
The IOT gives business new ways to instantly connect with customers
providing the opportunity to monitor and respond to customers in near
real time. Individual products will no longer exist in a vacuum and
interaction among devices from multiple sources and vendors should be
understood and taken into account.
Products and bundles will be
remotely reconfigured and repaired quickly and customers may be
provided with tools to do their own reconfiguration.
All of the new streams of data becoming available on the internet will
raise difficult privacy and moral issues that are only starting to be
addressed. Who would own the video streaming in from say Google Glass
and the health care related data streaming from other wearables? What
happens if autonomous devices malfunction or make data completely
public? These are challenges that will have to be faced as the IoT
develops.
Information Governance, Big Data and E-Discovery
As businesses begin to find value in the utilisation of IoT data the
consequences for e-discovery and Information Governance become even
more significant.

If one looks at the broader aspects of Information Governance it becomes


clear that within the litigation context, e-discovery in itself is a subset of a
larger Information Governance process. Although clients may consider
data analytics are only relevant in terms of the litigation environment, a
proper Information Governance strategy is essential for any organisation
with the flood of data that is coming in, particularly as the result of the
IOT.
Although high performing data analytics can reduce the time and cost of
preparing for a case, at the root of any e-discovery project is the ability to
initially identify collect index and analyse Big Data. A proper Information
Governance programme which recognises that e-discovery is an aspect of
Information Governance will make the e-discovery process a lot easier and
cheaper in the event of litigation. Thus it is important to develop a
strategy for Information Governance putting into place a comprehensive
data management programme for compliance with regulations, statutes
and best practices before litigation commences. It should involve the
development of customised guidelines and procedures for the creation
storage and disposition of all and any type of data along with email
policies litigation hold procedures and disaster recovery plans. One step
may be the development of a data classification process and a data
retention policy. From there it is possible to develop organisational
management policies and procedures for ESI including email policies and
develop work flows to deal with the potential for large amounts of non
searchable data including hard copy documents.
Security issues must be part of an Information Governance strategy which
will require current awareness of regulatory and legal data security
obligations so that a data security approach can be developed based on
repeatable and defensible best practices.
Although it may seem pessimistic to plan for a Court case, proper
Information Governance will not only assist the lawyers in effectively
carrying out an economical e-discovery process but will also assist overall
document management and information management within the
organisation. Early organisation of information into easily reviewable data
sets by way of key words searching concept searching analytics or TAR will
provide benefits further down the track and furthermore these processes
should not be considered limited only to e-discovery but to other forms of
data analytics.
An organisation that is utilising Big Data sets or data sets derived from the
IOT are necessarily going to have to engage in an analytics exercise in any
event so a proper Information Governance strategy is essential.

Steven OToole in an article entitled Demystifying Analytics in Ediscovery4 makes reference to the importance of pre-discovery within the
context of early case assessment. He suggests that the application of
analytics at the earliest possible opportunity provides the most value to
corporations and provides a competitive advantage.
I suggest that pre-discovery in fact occurs not at the Early Case
Assessment level but rather at the Information Governance level as
suggested by the EDRM model. Information Governance enables an
organisation to effectively use information for a number of purposes of
which e-discovery is but a part.
Therefore within this context of Information Governance different sets of
analytics capabilities are available:
a) Clustering is useful if little is known about the content because
it puts content into natural groupings of conceptually related
materials. A benefit of clustering is that it provides a fast map of
the document landscape in an objective consistent and concept
aware fashion.The reviewer can jump straight to a cluster that is
of most interest and avoids spending time in clusters of no
conceptual relevance.
This plays a part in Information
Governance as well as in e-discovery in that one can weed out
redundant outdated and transient content quickly and reduce
costs.
b) Term expansion identifies conceptually related terms
customised to content and ranked in order of relevance. One
term may provide a list of associated terms meaning that
conceptually related content can be found more quickly again
saving time and money. Within the context of Information
Governance it identifies content related to matters such as
corporate records, intellectual property and compliance.
c) Conceptual searching follows once a key document or
paragraph is being identified and now it is necessary to find
similar ones. A key word search will return documents containing
specific key words depending upon the Boolean search string and
as long as those key words are included in resulting documents.
Constructing Boolean search strings can be time consuming and
nameless key documents containing unknown terms that are not
included. It is at this point that a conceptual search comes into
play. A conceptual search looks for matching patterns in the
map of data called a conceptual space. The benefit is that a
conceptual search can find similar results even if the matching
4 http://www.contentanalyst.com/html/whoweare/whitepapers/whitepaperdemystifying-analytics-in-ediscovery-2014.html (last accessed 27 March
2015)
10

documents doesnt contain any of the same terms as the


example text.
d) Auto categorisation is an important aspect of TAR or predictive
coding. In fact it is auto characterisation that makes predictive
coding possible. Predictive coding is that application of machine
learning to a body of documents that enables the programme to
categorise them in any number of ways such as privileged,
responsible and non-responsive. Users can categorise documents
into sub-categorises and auto categorisation uses the same
conceptual space and sample document exemplas to find
conceptually similar documents and label them as appropriate.
The technology brings the most relevant documents to the
forefront.
e) Email threading is a way of eliminating repetitive emails whilst
keeping an email conversation alive. Email threading finds the
subset of emails that include all of the previous replies and
threading reveals exactly who knew what and when which may
be critical in piecing together the course of events that unfold
surrounding a matter.
f) Near duplicate identification is similar to email threading.
Conceptual searching clustering or categorisation may identify
documents relevant to a case but many could be various versions
of the same document. Knowing that they are near duplicates of
each other can save the time of reviewing each one. If it is
important to know what change from one version to the next,
when, and by whom, difference highlighting shows these changes
once again saving time and cost. By the same token deduplication eliminates identical documents using MDA5 hashing.
The importance of these tools is not just limited to e-discovery but, as I
have said, to overall Information Governance. Analytics is an aspect of
what could be called pre-discovery organises a companys data long
before it is needed in e-discovery.
Applying text analytics to an
organisations electronic records proactively through pre-discovery means
that data is already organised reduced and ready to be presented if and
when a matter arises.
The return on investment is obvious. Litigation costs are kept as low as
possible and the time that it takes to investigate or decide whether to
settle a case or proceed to court is reduced.
The cost benefits of organising and analysing content proactively are
significant and help to drive decision making and Information Governance
practices for compliance, risk mitigation, cost avoidance and just as
importantly for future planning and for the development of corporate
strategy based upon data returns.
11

Concerns about Big Data


Richards and King in Three Paradoxes of Big Data 5 acknowledge that the
proponents of Big Data tout the use of sophisticated analytics to mine
large data sets for insights as a solution to many of societys problems.
Big Data analytics may help us better conserve precious resources, track
and cure lethal diseases and make our lives safer and more efficient. At a
person level smart phones and wearable sensors enable believers in the
Quantified self to measure their lives in order to improve sleep, lose
weight and get fitter.
However Richards and King sound a cautionary note and consider the
potential of Big Data somewhat more critically. They highlight what they
call three paradoxes in the current rhetoric about Big Data to move
towards a better understanding of the Big Data picture.
Big Data
pervasively collects all manner of private information but the operations
of Big Data itself were almost entirely shrouded in legal and commercial
secrecy. This is referred to as the transparency paradox.
Secondly although supporters of Big Data talk in terms of significant
outcomes the rhetoric ignores the fact that Big Data seeks to identify at
the expense of the individual and collective identity. This is referred to as
the identify paradox. Thirdly the rhetoric of Big Data is characterised by
its power to transform society but Big Data has power effects all of its own
which provide privileges to large government and corporate entities at the
expense of ordinary individuals. This is referred to as the power paradox.
Once the paradoxes of Big Data are properly recognised which shows
some of the disadvantages of Big Data alongside its potential a better
understanding of the Big Data revolution may eventuate allowing us to
create solutions to produce satisfactory outcomes.
Conclusion
In a world of dynamic data I suggest that trying to compartmentalise data
management policies is inefficient in terms of time and investment.
Information Governance should be seen as an overarching data strategy
addressing all aspects of data use within an organisation. Because data
can be used for so many differing purposes, structuring its organisation
can present benefits throughout the business. No one like to think about
litigation but at the same time a proper Information Governance strategy
and programme ensures that when proceedings arrive, the performance
of e-discovery obligations become a part of the overall Information
Governance programme resulting in savings in time, costs and lawyers
fees, but at the same time ensuring compliance with Court rules and the
reasonableness and proportionality requirements that underpin common
law e-discovery regimes.
5 (2013) Stanford Law Review Online 41
http://www.stanfordlawreview.org/online/privacy-and-big-data/threeparadoxes-big-data (last accessed 26 March 2015)
12

13

Potrebbero piacerti anche