Sei sulla pagina 1di 19

A Seminar Presentation on

BIG DATA

Presented by:
Divyanshu Bhardwaj
Department of Computer Science
VIII Semester

What is Big Data?


In information technology, big data is a collection
of data sets so large and complex that it becomes
difficult to process using on-hand database
management tools or traditional data processing
applications.
The trend to larger data sets is due to the additional
information derivable from analysis of a single large
set of related data, allowing correlations to be found
to "spot business trends, determine quality of
research, prevent diseases, link legal citations,
combat crime, and determine real-time roadway
traffic conditions."

How big is Big Data?

Image Courtesy: Asigra Infotech

Why Big Data is difficult to handle?


Big data is difficult to work with using most relational
database management systems, requiring instead "massively
parallel software running on tens, hundreds, or even
thousands of servers".
What is considered "big data" varies depending on the
capabilities of the organization managing the set, and on the
capabilities of the applications that are traditionally used to
process and analyze the data set in its domain.
For some organizations, facing hundreds of gigabytes of data
for the first time may trigger a need to reconsider data
management options. For others, it may take tens or
hundreds of terabytes before data size becomes a significant
consideration.

Why are they collecting all this data?


Target Marketing
To send you catalogs for
exactly the merchandise
you typically purchase.
To suggest medications
that precisely match your
medical history.
To send advertisements
on those channels just
for you!

Targeted Information
To know what you need
before you even know you
need it based on past
purchasing habits!
To notify you of your
expiring drivers license
or credit cards or last
refill on a Rx, etc.
To give you turn-by-turn
directions to a shelter in
case of emergency.

Examples of Big Data


Examples include Big Science, web logs, sensor
networks, social networks, social data (due to
the social data revolution), Internet text and
documents, Internet search indexing, call detail
records, astronomy, atmospheric science,
genomics, biogeochemical, biological, and other
complex and often interdisciplinary scientific
research, military surveillance, medical records,
photography archives, video archives, and largescale e-commerce.

Big Science
The Large Hadron Collider experiments represent about 150
million sensors delivering data 40 million times per second.
There are nearly 600 million collisions per second.

As a result, only working with less than 0.001% of the sensor


stream data, the data flow from all four LHC experiments
represents 25 petabytes annual rate.
If all sensor data were to be recorded in LHC, the data flow
would be extremely hard to work with. The data flow would
exceed 150 million petabytes annual rate, or nearly
500 exabytes per day, before replication. To put the number
in perspective, this is equivalent to 500 quintillion (51020)
bytes per day, almost 200 times higher than all the other
sources combined in the world.

Government
In 2012, the Obama administration announced the Big
Data Research and Development Initiative, which
explored how big data could be used to address
important problems facing the government. The
initiative was composed of 84 different big data
programs spread across six departments.
Big data analysis played a large role in Barack Obama's
successful 2012 re-election campaign.
The NASA Center for Climate Simulation (NCCS) stores
32 petabytes of climate observations and simulations on
the Discover supercomputing cluster.

Private Sector
Amazon.com handles millions of back-end operations every
day, as well as queries from more than half a million thirdparty sellers. The core technology that keeps Amazon running
is Linux-based and as of 2005 they had the worlds three
largest Linux databases, with capacities of 7.8 TB, 18.5 TB,
and 24.7 TB.
Wal-Mart handles more than 1 million customer transactions
every hour, which is imported into databases estimated to
contain more than 2.5 petabytes (2560 terabytes) of data
the equivalent of 167 times the information contained in all
the books in the US Library of Congress.
Facebook handles 50 billion photos from its user base.

Technologies for handling Big Data


CROWDSOURCING
Crowdsourcing is the practice of obtaining needed services,
ideas, or content by soliciting contributions from a large
group of people, and especially from an online community,
rather than from traditional employees or suppliers.
Crowdsourcing can involve division of labor for tedious tasks
split to use crowd-based outsourcing.

The general concept is to combine the efforts of crowds of


volunteers or part-time workers, where each one could
contribute a small portion, which adds into a relatively large
or significant result.

Technologies for handling Big


Data(contd..)
A/B TESTING
A/B
testing
or
split
testing
is
an experimental approach, which aims to identify
changes to the database that increase or maximize
an outcome of interest.
As the name implies, two versions (A and B) are
compared, which are identical except for one
variation that might impact a user's behavior.
Version A might be the currently used version, while
Version B is modified in some respect.

Technologies for handling Big


Data(contd..)
DATA FUSION
Data fusion, is the process of integration of multiple data and
knowledge representing the same real-world object into a
consistent, accurate, and useful representation.
Fusion of the data from 2 sources (dimension #1 & #2) can yield
a classifier superior to any classifiers based on dimension #1 or
dimension #2 alone.
Data fusion processes are often categorized as low, intermediate or
high, depending on the processing stage at which fusion takes place.
Low level data fusion combines several sources of raw data to
produce new raw data. The expectation is that fused data is more
informative and than the original inputs.

Apache Hadoop
Apache Hadoop is an open-source software
framework that supports data-intensive distributed
applications.
Hadoop implements a computational paradigm
where the application is divided into many small
fragments of work, each of which may be executed
or re-executed on any node in the cluster.
It enables applications to work with thousands of
computation-independent computers and petabytes
of data.

Importance of Hadoop
Organizations are discovering that important
predictions can be made by sorting through and
analyzing Big Data.
However, since 80% of this data is "unstructured", it
must be formatted (or structured) in a way that that
makes it suitable for data mining and subsequent
analysis.
Hadoop is the core platform for structuring Big
Data, and solves the problem of making it useful for
analytics purposes.

Critiques of Big Data Paradigm


Even as companies invest eight- and nine-figure sums to
derive insight from information streaming in from
suppliers and customers, less than 40% of employees
have sufficiently mature processes and skills to do so.
It has been pointed out that the decisions based on the
analysis of Big Data are inevitably "informed by the
world as it was in the past, or, at best, as it currently is.
Consumer privacy advocates are concerned about the
threat to privacy represented by increasing storage and
integration of personally identifiable information.

We swim in a sea of data and the sea level


is rising rapidly. It is imperative that we either
learn to swim or arrange a life jacket.

References
[1] http://en.wikipedia.org/wiki/Big_data
[2] http://www.zettaset.com/info-center/what-isbig-data-and-hadoop.php
[3] http://www.fastcodesign.com/1669551/howcompanies-like-amazon-use-big-data-to-makeyou-love-them
[4] http://en.wikipedia.org/wiki/Dataintensive_computing
[5] http://www.youtube.com/Big_Data_Analytics

Potrebbero piacerti anche