Sei sulla pagina 1di 15

UNIT -1

What is Data?
Data can be defined as a systematic record of a particular quantity. It is the different values of that
quantity represented together in a set. It is a collection of facts and figures to be used for a specific
purpose such as a survey or analysis. When arranged in an organized form, can be called
information.
Types of Data
Data may be qualitative or quantitative.
 Qualitative Data: They represent some characteristics or attributes. They depict descriptions
that may be observed but cannot be computed or calculated. For example, data on attributes
such as intelligence, honesty, wisdom, cleanliness, and creativity collected using the students
of your class a sample would be classified as qualitative. They are more exploratory than
conclusive in nature.
 Quantitative Data: These can be measured and not simply observed. They can be
numerically represented and calculations can be performed on them. For example, data on
the number of students playing different sports from your class gives an estimate of how
many of the total students play which sport. This information is numerical and can be
classified as quantitative.
Data Collection
Depending on the source, it can classify as primary data or secondary data. Let us take a look at
them both.
Primary Data
These are the data that are collected for the first time by an investigator for a specific purpose.
Primary data are ‘pure’ in the sense that no statistical operations have been performed on them and
they are original. An example of primary data is the Census of India.
Statistics
Secondary Data
They are the data that are sourced from someplace that has originally collected it. This means that
this kind of data has already been collected by some researchers or investigators in the past and is
available either in published or unpublished form. This information is impure as statistical
operations may have been performed on them already. An example is an information available on
the Government of India, the Department of Finance’s website or in other repositories,
books, journals, etc.
RDBMS (relational database management system)
A relational database management system (RDBMS) is a collection of programs and
capabilities that enable IT teams and others to create, update, administer and otherwise interact
with a relational database. Most commercial RDBMSes use Structured Query Language (SQL)
to access the database, although SQL was invented after the initial development of the relational
model and is not necessary for its use.
An RDBMS is a type of DBMS with a row-based table structure that connects related data
elements and includes functions that maintain the security, accuracy, integrity and consistency
of the data.
Functions of relational database management systems
Elements of the relational database management system that overarch the basic relational
database are so intrinsic to operations that it is hard to dissociate the two in practice.
The most basic RDBMS functions are related to create, read, update and delete operations,
collectively known as CRUD. They form the foundation of a well-organized system that
promotes consistent treatment of data.
RDBMSes use complex algorithms that support multiple concurrent user access to the database,
while maintaining data integrity. Security management, which enforces policy-based access, is
yet another overlay service that the RDBMS provides for the basic database as it is used in
enterprise settings.
RDBMSes support the work of database administrators (DBAs) who must manage and monitor
database activity. Utilities help automate data loading and database backup. RDBMSes manage
log files that track system performance based on selected operational parameters. This enables
measurement of database usage, capacity and performance, particularly query performance.
RDBMSes provide graphical interfaces that help DBAs visualize database activity.
While not limited solely to the RDBMS, ACID compliance is an attribute of relational
technology that has proved important in enterprise computing. Standing
for atomicity, consistency, isolation and durability, these capabilities have particularly suited
RDBMSes for handling business transactions.
Relational database management systems are central to key applications, such as banking
ledgers, travel reservation systems and online retailing. As RDBMSes have matured, they have
achieved increasingly higher levels of query optimization, and they have become key parts of
reporting, analytics and data warehousing applications for businesses as well. RDBMSes are
intrinsic to operations of a variety of enterprise applications and are at the center of most master
data management (MDM) systems.
What is Big Data?
Big Data is also data but with a huge size. Big Data is a term used to describe a collection of data
that is huge in size and yet growing exponentially with time. In short such data is so large and
complex that none of the traditional data management tools are able to store it or process it
efficiently.
Types of Big Data
Structured: By structured data, we mean data that can be processed, stored, and retrieved in a
fixed format. It refers to highly organized information that can be readily and seamlessly stored
and accessed from a database by simple search engine algorithms. For instance, the employee
table in a company database will be structured as the employee details, their job positions, their
salaries, etc., will be present in an organized manner.
Unstructured: Unstructured data refers to the data that lacks any specific form or structure
whatsoever. This makes it very difficult and time-consuming to process and analyze unstructured
data. Email is an example of unstructured data.
Semi-structured: Semi-structured data pertains to the data containing both the formats
mentioned above, that is, structured and unstructured data. To be precise, it refers to the data that
although has not been classified under a particular repository (database), yet contains vital
information or tags that segregate individual elements within the data.
Characteristics of Big Data
1) Variety
Variety of Big Data refers to structured, unstructured, and semistructured data that is gathered
from multiple sources. While in the past, data could only be collected from spreadsheets and
databases, today data comes in an array of forms such as emails, PDFs, photos, videos, audios,
SM posts, and so much more.
Structured data Example:
A Product table in a database is an example of Structured Data
Product_id Product_name Product_price
1 Pen $5.95
2 Paper $8.95
Unstructured data : Output returned by ‘Google Search‘
2) Velocity
Velocity essentially refers to the speed at which data is being created in real-time. In a broader
prospect, it comprises the rate of change, linking of incoming data sets at varying speeds, and
activity bursts.
Example:
72 hours of video are uploaded to YouTube every minute this is the velocity.
Extremely high velocity of data is another major big data characteristics
3) Volume
We already know that Big Data indicates huge ‘volumes’ of data that is being generated on a
daily basis from various sources like social media platforms, business processes, machines,
networks, human interactions, etc. Such a large amount of data are stored in data warehouses.
Example:
Amazon handles 15 million customer click stream user data per day to recommend products.
Extremely large volume of data is major characteristic of big data online training
4) Variability
This refers to the inconsistency which can be shown by the data at times, thus hampering the
process of being able to handle and manage the data effectively.
You can see that few values are missing in the below table
Department Year Minimum sales Maximum sales
1 2010 ? 1500
2 2011 10000 ?
Data available can sometimes get messy and maybe difficult to trust. With wide variety in big
data types generated, quality and accuracy are difficult to control.
Example: A Twitter post has hashtags, typos and abbreviations.
Content Management System (CMS)
A content management system (CMS) is a software application or set of related programs that
are used to create and manage digital content. CMSes are typically used for enterprise content
management (ECM) and web content management (WCM). An ECM facilitates collaboration in
the workplace by integrating document management, digital asset management and records
retention functionalities, and providing end users with role-based access to the organization's
digital assets. A WCM facilitates collaborative authoring for websites. ECM software often
includes a WCM publishing functionality, but ECM webpages typically remain behind the
organization's firewall.
Both enterprise content management and web content management systems have two
components: a content management application (CMA) and a content delivery application
(CDA). The CMA is a graphical user interface (GUI) that allows the user to control the design,
creation, modification and removal of content from a website without needing to know anything
about HTML. The CDA component provides the back-end services that support management
and delivery of the content once it has been created in the CMA.
Features of CMSes
Features can vary amongst the various CMS offerings, but the core functions are often
considered to be indexing, search and retrieval, format management, revision control and
publishing.
 Intuitive indexing, search and retrieval features index all data for easy access through search
functions and allow users to search by attributes such as publication dates, keywords or
author.
 Format management facilitates turn scanned paper documents and legacy electronic
documents into HTML or PDF documents.
 Revision features allow content to be updated and edited after initial publication. Revision
control also tracks any changes made to files by individuals.
 Publishing functionality allows individuals to use a template or a set of templates approved
by the organization, as well as wizards and other tools to create or modify content.
A CMS may also provide tools for one-to-one marketing. One-to-one marketing is the ability of
a website to tailor its content and advertising to a user's specific characteristics using information
provided by the user or gathered by the site -- for instance, a particular user's page sequence
pattern. For example, if the user visited a search engine and searched for digital camera, the
advertising banners would feature businesses that sell digital cameras instead of businesses that
sell garden products.
Big Data Analytics Lifecycle
Big Data analysis differs from traditional data analysis primarily due to the volume, velocity and
variety characteristics of the data being processes. To address the distinct requirements for
performing analysis on Big Data, a step-by-step methodology is needed to organize the activities
and tasks involved with acquiring, processing, analyzing and repurposing data. The upcoming
sections explore a specific data analytics lifecycle that organizes and manages the tasks and
activities associated with the analysis of Big Data. From a Big Data adoption and planning
perspective, it is important that in addition to the lifecycle, consideration be made for issues of
training, education, tooling and staffing of a data analytics team.
The Big Data analytics lifecycle can be divided into the following nine stages, as shown
in Figure 3.6:
1. Business Case Evaluation
2. Data Identification
3. Data Acquisition & Filtering
4. Data Extraction
5. Data Validation & Cleansing
6. Data Aggregation & Representation
7. Data Analysis
8. Data Visualization
9. Utilization of Analysis Results
Business Case Evaluation
Each Big Data analytics lifecycle must begin with a well-defined business case that presents a
clear understanding of the justification, motivation and goals of carrying out the analysis. The
Business Case Evaluation stage shown in Figure 3.7 requires that a business case be created,
assessed and approved prior to proceeding with the actual hands-on analysis tasks.
An evaluation of a Big Data analytics business case helps decision-makers understand the
business resources that will need to be utilized and which business challenges the analysis will
tackle. The further identification of Key performance indicators (KPIs) during this stage can help
determine assessment criteria and guidance for the evaluation of the analytic results. If KPIs are
not readily available, efforts should be made to make the goals of the analysis project SMART,
which stands for specific, measurable, attainable, relevant and timely.
Based on business requirements that are documented in the business case, it can be determined
whether the business problems being addressed are really Big Data problems. In order to qualify
as a Big Data problem, a business problem needs to be directly related to one or more of the Big
Data characteristics of volume, velocity, or variety.
Data Identification
The Data Identification stage shown in Figure 3.8 is dedicated to identifying the datasets
required for the analysis project and their sources.
Identifying a wider variety of data sources may increase the probability of finding hidden
patterns and correlations. For example, to provide insight, it can be beneficial to identify as many
types of related data sources as possible, especially when it is unclear exactly what to look for.
Depending on the business scope of the analysis project and nature of the business problems
being addressed, the required datasets and their sources can be internal and/or external to the
enterprise.
In the case of internal datasets, a list of available datasets from internal sources, such as data
marts and operational systems, are typically compiled and matched against a pre-defined dataset
specification.
In the case of external datasets, a list of possible third-party data providers, such as data markets
and publicly available datasets, are compiled. Some forms of external data may be embedded
within blogs or other types of content-based web sites, in which case they may need to be
harvested via automated tools.
Data Acquisition and Filtering
During the Data Acquisition and Filtering stage, shown in Figure 3.9, the data is gathered from
all of the data sources that were identified during the previous stage. The acquired data is then
subjected to automated filtering for the removal of corrupt data or data that has been deemed to
have no value to the analysis objectives.
Depending on the type of data source, data may come as a collection of files, such as data
purchased from a third-party data provider, or may require API integration, such as with Twitter.
In many cases, especially where external, unstructured data is concerned, some or most of the
acquired data may be irrelevant (noise) and can be discarded as part of the filtering process.
Data classified as “corrupt” can include records with missing or nonsensical values or invalid
data types. Data that is filtered out for one analysis may possibly be valuable for a different type
of analysis. Therefore, it is advisable to store a verbatim copy of the original dataset before
proceeding with the filtering. To minimize the required storage space, the verbatim copy can be
compressed.
As evidenced in Figure 3.10, metadata can be added via automation to data from both internal
and external data sources to improve the classification and querying. Examples of appended
metadata include dataset size and structure, source information, date and time of creation or
collection and language-specific information. It is vital that metadata be machine-readable and
passed forward along subsequent analysis stages. This helps maintain data provenance
throughout the Big Data analytics lifecycle, which helps to establish and preserve data accuracy
and quality.
Data Extraction
Some of the data identified as input for the analysis may arrive in a format incompatible with the
Big Data solution. The need to address disparate types of data is more likely with data from
external sources. The Data Extraction lifecycle stage, shown in Figure 3.11, is dedicated to
extracting disparate data and transforming it into a format that the underlying Big Data solution
can use for the purpose of the data analysis.
The extent of extraction and transformation required depends on the types of analytics and
capabilities of the Big Data solution. For example, extracting the required fields from delimited
textual data, such as with webserver log files, may not be necessary if the underlying Big Data
solution can already directly process those files.
Similarly, extracting text for text analytics, which requires scans of whole documents, is
simplified if the underlying Big Data solution can directly read the document in its native format.
Figure 3.12 illustrates the extraction of comments and a user ID embedded within an XML
document without the need for further transformation.

Data Validation and Cleansing


The Data Validation and Cleansing stage shown in Figure 3.14 is dedicated to establishing often
complex validation rules and removing any known invalid data.
Big Data solutions often receive redundant data across different datasets. This redundancy can be
exploited to explore interconnected datasets in order to assemble validation parameters and fill in
missing valid data.
For example, as illustrated in Figure 3.15:
 The first value in Dataset B is validated against its corresponding value in Dataset A.
 The second value in Dataset B is not validated against its corresponding value in Dataset A.
 If a value is missing, it is inserted from Dataset A.
Data Aggregation and Representation
Data may be spread across multiple datasets, requiring that datasets be joined together via
common fields, for example date or ID. In other cases, the same data fields may appear in
multiple datasets, such as date of birth. Either way, a method of data reconciliation is required or
the dataset representing the correct value needs to be determined.
The Data Aggregation and Representation stage, shown in Figure 3.17, is dedicated to
integrating multiple datasets together to arrive at a unified view.
Performing this stage can become complicated because of differences in:
 Data Structure – Although the data format may be the same, the data model may be
different.
 Semantics – A value that is labeled differently in two different datasets may mean the same
thing, for example “surname” and “last name.”
Data Analysis
The Data Analysis stage shown in Figure 3.20 is dedicated to carrying out the actual analysis
task, which typically involves one or more types of analytics. This stage can be iterative in
nature, especially if the data analysis is exploratory, in which case analysis is repeated until the
appropriate pattern or correlation is uncovered.
Depending on the type of analytic result required, this stage can be as simple as querying a
dataset to compute an aggregation for comparison. On the other hand, it can be as challenging
as combining data mining and complex statistical analysis techniques to discover patterns and
anomalies or to generate a statistical or mathematical model to depict relationships between
variables.
Data Visualization
The Data Visualization stage, shown in Figure 3.22, is dedicated to using data visualization
techniques and tools to graphically communicate the analysis results for effective interpretation
by business users.
Business users need to be able to understand the results in order to obtain value from the analysis
and subsequently have the ability to provide feedback, as indicated by the dashed line leading
from stage 8 back to stage 7.
The results of completing the Data Visualization stage provide users with the ability to perform
visual analysis, allowing for the discovery of answers to questions that users have not yet even
formulated.
Utilization of Analysis Results
Utilization of Analysis Results stage, shown in Figure 3.23, is dedicated to determining how
and where processed analysis data can be further leveraged. Depending on the nature of the
analysis problems being addressed, it is possible for the analysis results to produce “models”
that encapsulate new insights and understandings about the nature of the patterns and
relationships that exist within the data that was analyzed. A model may look like a mathematical
equation or a set of rules. Models can be used to improve business process logic and application
system logic, and they can form the basis of a new system or software program.

UNIT 3

Phases of Data Analytics Lifecycle:

Phase 1—Discovery: In Phase 1, the team learns the business domain, including relevant history such as whether
the organization or business unit has attempted similar projects in the past from which they can learn. The team

assesses the resources available to support the project in terms of people, technology, time, and data. Important
activities in this phase include framing the business problem as an analytics challenge that can be addressed in

subsequent phases and formulating initial hypotheses (IHs) to test and begin learning the data.

Phase 2—Data preparation: Phase 2 requires the presence of an analytic sandbox, in which the team can work
with data and perform analytics for the duration of the project. The team needs to execute extract, load, and

transform (ELT) or extract, transform and load (ETL) to get data into the sandbox. The ELT and ETL are sometimes

abbreviated as ETLT. Data should be transformed in the ETLT process so the team can work with it and analyze it. In

this phase, the team also needs to familiarize itself with the data thoroughly and take steps to condition the data

Phase 3—Model planning: Phase 3 is model planning, where the team determines the methods, techniques, and
workflow it intends to follow for the subsequent model building phase. The team explores the data to learn about the

relationships between variables and subsequently selects key variables and the most suitable models.

Phase 4—Model building: In Phase 4, the team develops datasets for testing, training, and production purposes. In
addition, in this phase the team builds and executes models based on the work done in the model planning phase.

The team also considers whether its existing tools will suffice for running the models, or if it will need a more robust

environment for executing models and workflows (for example, fast hardware and parallel processing, if applicable).

Phase 5—Communicate results: In Phase 5, the team, in collaboration with major stakeholders, determines if the
results of the project are a success or a failure based on the criteria developed in Phase 1. The team should identify

key findings, quantify the business value, and develop a narrative to summarize and convey findings to stakeholders.

Phase 6—Operationalize: In Phase 6, the team delivers final reports, briefings, code, and technical documents. In
addition, the team may run a pilot project to implement the models in a production environment.
Hadoop is an Apache open source framework written in java that allows distributed processing of large
datasets across clusters of computers using simple programming models. The Hadoop framework
application works in an environment that provides distributed storage and computation across clusters
of computers. Hadoop is designed to scale up from single server to thousands of machines, each offering
local computation and storage.

Core Hadoop Components

The Hadoop Ecosystem comprises of 4 core components –

1) Hadoop Common-

Apache Foundation has pre-defined set of utilities and libraries that can be used by other modules
within the Hadoop ecosystem. For example, if HBase and Hive want to access HDFS they need to make
of Java archives (JAR files) that are stored in Hadoop Common.

2) Hadoop Distributed File System (HDFS) -

The default big data storage layer for Apache Hadoop is HDFS. HDFS is the “Secret Sauce” of Apache
Hadoop components as users can dump huge datasets into HDFS and the data will sit there nicely until
the user wants to leverage it for analysis. HDFS component creates several replicas of the data block to
be distributed across different clusters for reliable and quick data access. HDFS comprises of 3 important
components-NameNode, DataNode and Secondary NameNode. HDFS operates on a Master-Slave
architecture model where the NameNode acts as the master node for keeping a track of the storage
cluster and the DataNode acts as a slave node summing up to the various systems within a Hadoop
cluster.

HDFS Use Case-

Nokia deals with more than 500 terabytes of unstructured data and close to 100 terabytes of structured
data. Nokia uses HDFS for storing all the structured and unstructured data sets as it allows processing of
the stored data at a petabyte scale.

3) MapReduce- Distributed Data Processing Framework of Apache Hadoop

MapReduce is a Java-based system created by Google where the actual data from the HDFS store gets
processed efficiently. MapReduce breaks down a big data processing job into smaller tasks. MapReduce
is responsible for the analysing large datasets in parallel before reducing it to find the results. In the
Hadoop ecosystem, Hadoop MapReduce is a framework based on YARN architecture. YARN based
Hadoop architecture, supports parallel processing of huge data sets and MapReduce provides the
framework for easily writing applications on thousands of nodes, considering fault and failure
management.

The basic principle of operation behind MapReduce is that the “Map” job sends a query for processing
to various nodes in a Hadoop cluster and the “Reduce” job collects all the results to output into a single
value. Map Task in the Hadoop ecosystem takes input data and splits into independent chunks and
output of this task will be the input for Reduce Task. In The same Hadoop ecosystem Reduce task
combines Mapped data tuples into smaller set of tuples. Meanwhile, both input and output of tasks are
stored in a file system. MapReduce takes care of scheduling jobs, monitoring jobs and re-executes the
failed task.
MapReduce framework forms the compute node while the HDFS file system forms the data node.
Typically in the Hadoop ecosystem architecture both data node and compute node are considered to be
the same.

4)YARN

YARN forms an integral part of Hadoop 2.0.YARN is great enabler for dynamic resource utilization on
Hadoop framework as users can run various Hadoop applications without having to bother about
increasing workloads.

Key Benefits of Hadoop 2.0 YARN Component-

 It offers improved cluster utilization

 Highly scalable

 Beyond Java

 Novel programming models and services

 Agility

YARN Use Case:

Yahoo has close to 40,000 nodes running Apache Hadoop with 500,000 MapReduce jobs per day taking
230 compute years extra for processing every day. YARN at Yahoo helped them increase the load on the
most heavily used Hadoop cluster to 125,000 jobs a day when compared to 80,000 jobs a day which is
close to 50% increase.

A Step by Step Guide for Predictive Modeling Using R

Stages of Predictive Modeling


Predictive Modeling is the process of building a model to predict future
outcomes using statistics techniques. In order to generate the model,
historical data of prior occurrences needs to be analyzed, classified and
validated. Listed below are the stages of predictive modeling.

1. Data Gathering and Cleansing


Read the data from various sources and perform data cleansing
operations, such as identification of noisy data and removal of outliers to
make the prediction more accurate. Apply R packages to handle missing
data and impure values.
2. Data Analysis/Transformation
Before building a model, data needs to be transformed for efficient
processing by normalizing the data without missing the significance of
data. Normalization can be done by scaling the values to a particular
range. In addition, irrelevant attributes can be removed by performing a
correlation analysis, which will play a least significant role in determining
the outcomes.

3. Building a Predictive Model


Generate a decision tree or apply linear/logistic regression techniques to
build a predictive model. This involves choosing a classification
algorithm, identifying test data and generating classification rules.
Identify the confidence of the classification model by applying it against
test data.

4. Inferences
Perform a cluster analysis to segregate data groups. Use
these meaningful subsets of populations to make inferences.

type 1 and type 2 errors

Even though hypothesis tests are meant to be reliable, there are two types of errors
that can occur.

These errors are known as type 1 and type 2 errors.


Understanding Type 1 errors
Type 1 errors – often assimilated with false positives – happen in hypothesis testing
when the null hypothesis is true but rejected. The null hypothesis is a general statement
or default position that there is no relationship between two measured phenomena.

Simply put, type 1 errors are “false positives” – they happen when the tester validates a
statistically significant difference even though there isn’t one.

Type 1 errors have a probability of “α” correlated to the level of confidence that you set.
A test with a 95% confidence level means that there is a 5% chance of getting a type 1
error.

Consequences of a type 1 Error


Type 1 errors can happen due to bad luck (the 5% chance has played against you) or
because you didn’t respect the test duration and sample size initially set for your
experiment.

Consequently, a type 1 error will bring in a false positive. This means that you will
wrongfully assume that your hypothesis testing has worked even though it hasn’t.

In real life situations, this could potentially mean losing possible sales due to a faulty
assumption caused by the test.

Related: Sample Size Calculator for A/B Testing


A real-life example of a type 1 error
Let’s say that you want to increase conversions on a banner displayed on your website.
For that to work out, you’ve planned on adding an image to see if it increases
conversions or not.

You start your A/B test running a control version (A) against your variation (B) that
contains the image. After 5 days, the variation (B) outperforms the control version by a
staggering 25% increase in conversions with an 85% level of confidence.

You stop the test and implement the image in your banner. However, after a month, you
noticed that your month-to-month conversions have actually decreased.

That’s because you’ve encountered a type 1 error: your variation didn’t actually beat
your control version in the long run.

Understanding type 2 errors


If type 1 errors are commonly referred to as “false positives”, type 2 errors are referred
to as “false negatives”.

Type 2 errors happen when you inaccurately assume that no winner has been declared
between a control version and a variation although there actually is a winner.

In more statistically accurate terms, type 2 errors happen when the null hypothesis
is false and you subsequently fail to reject it.

If the probability of making a type 1 error is determined by “α”, the probability of a type 2
error is “β”. Beta depends on the power of the test (i.e the probability of not committing a
type 2 error, which is equal to 1-β).

There are 3 parameters that can affect the power of a test:

 Your sample size (n)


 The significance level of your test (α)
 The “true” value of your tested parameter (read more here)

Consequences of a type 2 error


Similarly to type 1 errors, type 2 errors can lead to false assumptions and poor decision
making that can result in lost sales or decreased profits.

Moreover, getting a false negative (without realizing it) can discredit your conversion
optimization efforts even though you could have proven your hypothesis. This can be a
discouraging turn of events that could happen to all CRO experts and digital marketers.
A real-life example of a type 2 error
Let’s say that you run an e-commerce store that sells high-end, complicated hardware
for tech-savvy customers. In an attempt to increase conversions, you have the idea to
implement an FAQ below your product page.

Potrebbero piacerti anche