Sei sulla pagina 1di 63

Topic/Header :- What is Analysis & Analytics

Keypoints :- Notes: - The Analysis is always done on the data of the past to explain how and why. For
Analysis, Analytics, example why the sales went down last year or why the story ended the way it ended at the
Qualitative Analytics & first place.
Quantitative Analytics
Analytics on the other hand is the study of the future and to predict. Instead of
explaining the past events, it explores the future events. Here we look data of the
past to recognise patterns and Analytics generally refers to the future instead of
explaining past events. It explores potential future ones. Analytics is essentially the
application of logical and computational reasoning to the component parts obtained
in an analysis and in doing this you are looking for patterns in exploring what you
can do.

Analytics branches off into two parts:-


Qualitative & Quantitative

Qualitative Analytics is intuition + analysis while Quantitative Analytics is


formulas + algorithms.
Example :-
Say you are an owner of an online clothing store. You are ahead of the competition
and have a great understanding of what your customers’ needs and wants are.
You've performed a very detailed analysis from women's clothing articles and feel
sure about which fashion trends to follow. You may use this intuition to decide on
which styles of clothing to start selling. This would be qualitative analytics but you
might not know when to introduce the new collection. In that case relying on past
sales data and user experience data you could predict in which month it would be
best to do that. This is an example of using quantitative analytics.

We can also use qualitative & quantitative methodology on Analysis too;


qualitative to explain how and why and quantitative analysis by working on past
data to explain how sales numbers decreased last quarter.

Now, understanding the terms data analysis, data analytics, business analysis and
business analytics makes better sense and can easily be differentiated.

Summary: - Analysis is study of past and analytics is study of the future. Qualitative Analytics = Intuition +
Analysis & Quantitative = Formulas + Algorithms.

Qualitative & Quantitative methodology can be used with Analysis too to explain how and why.
Topic/Header :- Most Popular disciplines in the Data Science Filed and how they intertwine with each
other

Keypoints :- Data Notes: - To begin with Business Sphere is the first Sphere which includes the
Science, Data following terms:-
Analytics, Business Business Case Study, Qualitative Analysis, Preliminary Data Report, Reporting
Intelligence, AI, ML with Visuals, Creating Dashboards, Sales Forecasting.

Not all of these terms are data driven. Some business activities are DATA DRIVEN
while others are SUBJECTIVE or EXPERINCE DRIVEN. Business Case Study &
Qualitative Analysis are not data driven rest all are.

Business case Studies are real world experiences of how companies succeed in their
businesses and Qualitative Analytics is using your intuition and experience and
knowledge of market to plan future business process.

Data Science is a discipline reliant on Data Availability while Business Analytics


does not completely rely on Data. However data science incorporates part of data
analytics mostly the part that uses complex mathematical statistical and
programming tools.

Business Intelligence is the process of analysing and reporting historical business


data and aims to explain past events using business data. BI is the preliminary step
of Predictive Analytics.

Preliminary Data Report is the first step of any data analysis Reporting with Visuals
and Creating Dashboards are the part of Business Intelligence.

Machine Learning is the ability of Machines to predict outcomes without being


explicitly programmed to do so. ML is eventually about creating and implementing
algorithms that let machines receive data and use this data to :
1. Make Predictions
2. Analyse Patterns
3. Give Recommendations

Artificial Intelligence is the process of stimulating Human Knowledge and Decision


Making with Computers.

Summary: - Understanding the difference between Business Analytics, Data Analytics, Data Science &
ML.
Topic/Header :- Connecting Data Science Disciplines

Keypoints :- Data, Notes: - Data is information stored in digital format that can be used as a base for
Big Data, 3Vs of Big performing analyses and decision making.
Data, Steps involved
in Data Science, Data can be in traditional or big data. Traditional Data is data in tabular format and
Traditional Methods can be stored and managed from one Computer. Big Data however is extremely
& Machine Learning large data and is not only stored on multiple computers or cloud but also can be
structured, semi-structured & unstructured.

There are 3Vs of Big Data that should be known by all. These are :-
 Volume: - Big data needs lots of space; it is usually in terabytes and is
typically distributed between many computers.
 Variety :- Big Data is not limited to text and numbers it could also be
images, audio files, mobile data and others
 Velocity: - It refers to the speed of big data and since the volume & variety
is too much in case of Big Data, extracting patterns from it as quickly as
possible is one of the major goals.

The first step of any data science process is gathering data. Once data is collected,
BI comes in the picture. BI includes all technology- driven tools involved in the
process of analyzing, understanding reporting available past data.

All these are part of analysing the past. To predict the future, We use Data
Analytics. This can be done using the Traditional Methods & Machine Learning.

Traditional Methods are perfect for forecasting future performance with Great
Accuracy. Techniques like Regression, Clustering Factor Analytics.

In Machine Learning, in contrast to traditional methods, the responsibility is left for


the machine. Through Mathematics, a significant amount of computational power
& AI, the machine is given the ability to program outcome from data without being
explicitly programmed to do so.

Summary: - Data Collection is the first step of Data Science process. BI comes in picture post this step to
make reports and analyse the data of the past. To make future predictions, Traditional Methods and Machine
Learning are used depending on what outcome we are looking for.
Topic/Header :- Techniques for Working with Traditional Data

Keypoints :- Notes: - The first step is data collection. Once that is done, then we move to data pre-
Data processing as the raw data that was collected or stored cannot be used directly for any
Collection, analysis. In Pre-processing we do Data Labelling where we separate numeric data with
Data Pre- categorical data.
processing,
Data Numerical data is such data that is stored in numeric form and can be manipulated
Labelling, mathematically such as Average goods sold per day, Max goods sold in a month and so on.
Numerical On the other hand Categorical Data cannot be manipulated Mathematically such as Sex,
Data vs Location, or Job Title.
Categorical
Data, Data Once this is done we move to next step which is Data Cleaning. The goal of Data Cleaning
Cleaning, is to deal with inconsistent Data such as Age being 324 or last name as a City name & so
Balancing & on.
Shuffling,
Missing Values is also something that is a part of Data Cleaning and there are techniques to
ER deal with missing data.
Diagrams,
Relational There are certain approaches that are case specific when working with Traditional Data like
Schemas Balancing & Shuffling. Balancing is process where we make sure we have sufficient data in
balance to make relevant prediction or to draw conclusions. For example, to predict
shopping pattern of Men & Women we cannot have data where 80% of reviews are from
women & 20% are from men.

Shuffling is the process of making sure that we have a random sample of data to make
predictions which are not biased in any fashion. This is also important as this makes the
sample representative of the master set or the universe of data.

ER Diagrams & Relational Schemas are the traditional ways to represent this kind of data
pictorially. ER diagrams or Entity-Relationship diagram uses 3 different types of boxes to
represent Entity, Action & Attribute.
 Boxes represent entities.
 Diamonds represent relationships
 Circles (ovals) represent attributes.

Relational Schemas are tables and shows relationship between them and how one table is
related to other/others.

Summary: - In order to work with Traditional Data, Data collection, Data Cleaning, Data Pre-processing
are initial and import stages. Data Labelling into Numeric & Categorical Data, Dealing with Missing Data,
Balancing & Shuffling are important steps. Balancing & Shuffling both ensure that data is not biased in any
way and dos not give biased insights. ER- diagrams & Relational Schemas
Topic/Header :- Working with Big Data Techniques

Keypoints :- Big Notes: - Well they definitely have their differences. Some of the approaches used
Data, Data Masking, on traditional data can also be implemented on big data collecting and pre-
processing big data is essential to help organize the data before doing analyses or
making predictions as is grouping the data into classes or categories. While working
with Big Data things can get a little more complex. Big Data has much more variety
than simple Numeric & Categorical Data.
Examples of big data can be text data digital image data digital video data digital
audio data and more.
Consequently with a larger amount of data types comes a wider range of data
cleansing methods.
Dealing with missing data is important here as the volume is huge and dropping
those missed data is not a preferred way to handle this.
Data Masking is also an important aspect here as maintaining confidentiality of data
is of utmost important. The motive here to make sure confidentiality is maintained
keeping in mind that analysis can be performed too.
Data masking is a complex process as it conceals the original data with random and
false data while analysis can be performed and keeps all the confidential
information in a secure place.
Examples of Big Data: - Facebook Data, Stock Trading Data.

Summary: - Brief update on Big Data and techniques used to process Big Data.
Topic/Header :- Business Intelligence Techniques

Keypoints :- Metric, Notes: - Once the data is gathered, cleaned & pre-processed, Business Intelligence
KPIs, Measure, comes. Here the analyst uses data skills, business knowledge & intuitions to explain
Reports, Dashboards, past data. In order to perform Analysis, we start by collecting observations.
What BI includes However, no mathematical manipulations can be done on observations and in order
to do that, we quantify those observations. Quantification is the process of
representing observations as numbers.

Measure is the accumulation of observations to show some information. It is related


to simple descriptive statistics of past performance.

BI is the process of extracting information and present information in the form of :-


 Metric
 KPIs
 Reports
 Dashboards

A Metric is Measure + Business Meaning. Example: - Let’s say that $350 was the
revenue for 1st quarter with 50 Customers, these 2 are measures. Average revenue
1st quarter would be $350/50 that is $7. This is a Metric. From measures, 1000 of
Metrics can be derived and it would be difficult to keep track and work on a Metric
that’s not useful or business relevant.

KPIs (Key Performance Indicators) are Metric + Business Objectives. KEY because
it is related to main business goals, PERFORMANCE because it shows how
successfully you have performed within a specific timeframe & INDICATORS
because they are values or metric that indicate something related to your business
performance. KPIs are metrics that are tightly aligned to with the business
objective.

Some Real Time Examples of BI are :-


 Price Optimisation- Hotels do this to improve their revenue by increasing
price when the demand is more and reducing price when the demand is low
to increase profit.
 Inventory Management- Supply enough stock to meet demand with minimal
waste. Analysing Inventory data and Sales data over a period of time can
solve this question easily to maximize revenue.

Summary: - BI answers questions that are relevant to make business decisions by creating KPIs, Metrics,
Reports and Dashboards. The main objective here to explain what happened from past data and Key
Performance Indicators.
Topic/Header :- Techniques For Working with Traditional Methods

Keypoints :- Linear Notes: - Linear Regression, Logistic Regression, Clustering, Factor Analysis are
Regression, Logistic some of the common and most popular and frequently used techniques when it
Regression, Factor comes to Traditional Data.
Analysis, Clustering
Linear Regression is a linear approach to modelling the relationship between a
scalar approach & one or more explanatory variables.

Logistic regression is used to describe data and to explain the relationship between
one dependent binary variable and one or more nominal, ordinal, interval or ratio-
level independent variables.

Factor Analysis is a process in which the values of observed data are expressed as
functions of a number of possible causes in order to find which are the most
important. It aims at reducing dimensionality.

Cluster analysis or clustering is the task of grouping a set of objects in such a way
that objects in the same group (called a cluster) are more similar (in some sense) to
each other than to those in other groups (clusters).

Clustering is grouping observations together and Factor Analysis is grouping


explanatory variables together.

Time Series Analysis is the analysis of data over time. Time is an independent
variable and will always be on X-axis. Mainly used in economics & financial data.

Summary: - Linear Regression, Logistic Regression, Clustering & Time Series Forecasting are some of
the common techniques used when we are dealing with traditional data. For labelled data Regression
analysis is the best techniques to go with while dealing with Unlabelled data, Clustering is the best method
to go with.
Topic/Header :- Techniques For Working with Traditional Methods & Types of Machine Learning

Keypoints :- Linear Notes: - Linear Regression, Logistic Regression, Clustering, Factor Analysis are
Regression, Logistic some of the common and most popular and frequently used techniques when it
Regression, Factor comes to Traditional Data.
Analysis, Clustering
Linear Regression is a linear approach to modelling the relationship between a
scalar approach & one or more explanatory variables.

Logistic regression is used to describe data and to explain the relationship between
one dependent binary variable and one or more nominal, ordinal, interval or ratio-
level independent variables.

Factor Analysis is a process in which the values of observed data are expressed as
functions of a number of possible causes in order to find which are the most
important. It aims at reducing dimensionality.

Cluster analysis or clustering is the task of grouping a set of objects in such a way
that objects in the same group (called a cluster) are more similar (in some sense) to
each other than to those in other groups (clusters).

Clustering is grouping observations together and Factor Analysis is grouping


explanatory variables together.

Time Series Analysis is the analysis of data over time. Time is an independent
variable and will always be on X-axis. Mainly used in economics & financial data.

ML or Machine learning is the scientific study of algorithms and statistical models


that computer systems use to effectively perform a specific task without using
explicit instructions, relying on patterns and inference instead.

There are mainly 3 types of Machine Learning :-


1) Supervised Learning – Dealing with Labelled Data, Supervision
2) Unsupervised Learning – Dealing with Unlabelled Data, No supervision
3) Reinforcement Learning – Reward System where the work is done better
that before a reward is awarded and no reward if there is no improvement.
Here the Machine tries to figure out what it should do better to get the
reward.
In reinforcement learning, a reward system is being used to improve the machine
learning model at hand. The idea of using this reward system is to maximize the
objective Function.

Deep Learning can be divided into Supervised, Unsupervised & Reinforcement


Learning and can solve all problems

Summary: - Explanation of ML techniques, Supervised, Unsupervised Learning & Reinforcement


Learning, Factor Analysis. Examples of Regression, Factor Analysis, Clustering and the process
explanation.
Topic/Header :- Population and Sample & Classification of Data , Level of Measurement

Keypoints :- Notes: - A population is the collection of all items of interest and is generally
represented by “N”. The numbers we obtain while using a population is called
parameters. Sample on the other hand is a subset of population and is denoted by
“n”. The numbers we obtain while using a population is called statistics.

Populations are hard to define and hard to observe and work with while sample is
much easier to gather and work on with and is less time and money consuming and
hence working with sample is a common practise is widely used throughout.

Sample has two defining characteristics :- Randomness & Representativeness

A Random Sample is collected when each member of the sample is chosen from the
population strictly by chance.

A Representative Sample is a subset of the population that accurately reflects the


members of the entire population.

Data can be classified in two main ways :-

1)Based on its type &


2) Based on its measurement levels

Based on types of Data, Data can be divided into Categorical & Numerical.
Categorical data describes categories or groups like {Yes, No}, {Male, Female}
etc. Numerical data describes numerical value which can be further classified into
two types: - Discrete & Continuous. Discrete data represents numerical data that is
a finite value or an integer like 0,1,2999,3478 etc while continuous data represents
numerical data that is infinite such as 62.09834940, 12.08090, 1.00000029213 etc.

Measurement Level can be classified into two categories :-


Quantitative & Qualitative

Qualitative Data can be NOMINAL or ORDINAL NOMINAL like 4 categories


of Cars like BMW, AUDI, and FORD etc. There is no order here. Ordinal Data on
the other hand has a definite order for example Rating of Food from bad to Very
Good.

Quantitative Data can be classified into INTERVAL & RATIOS. Ratio has a
TRUE 0 while Intervals does not have a True 0. Consider temperature example, 0
Dec Celsius and 0 Deg Fahrenheit are not actual Zeros. These are Intervals. 0 Deg
Kelvin on the other hand is True 0 and is -273.53 Deg Celsius and hence is Ratio.

Summary: - Population is whole universe and every possible dataset or entry in that universe while
Sample is a subset of the population that is representative of Population and is unbiased. Data classification
& Measure Classification.
Topic/Header :- Difference between Key Terms/Types of Variables

Keypoints :- Notes: - A categorical variable, also called a nominal variable, is for mutual
exclusive, but not ordered, categories. For example, your study might compare five
different genotypes. You can code the five genotypes with numbers if you want, but
the order is arbitrary and any calculations (for example, computing an average)
would be meaningless.

An ordinal variable is one where the order matters but not the difference between
values. For example, you might ask patients to express the amount of pain they are
feeling on a scale of 1 to 10. A score of 7 means more pain that a score of 5 and that
is more than a score of 3. But the difference between the 7 and the 5 may not be the
same as that between 5 and 3. The values simply express an order.

An interval variable is a measurement where the difference between two values is


meaningful. The difference between a temperature of 100 degrees and 90 degrees is
the same difference as between 90 degrees and 80 degrees.

A ratio variable, has all the properties of an interval variable, and also has a clear
definition of 0.0. When the variable equals 0.0, there is none of that variable.
Variables like height, weight, enzyme activity are ratio variables. Temperature,
expressed in F or C, is not a ratio variable. A temperature of 0.0 on either of those
scales does not mean 'no heat'. However, temperature in Kelvin is a ratio variable,
as 0.0 Kelvin really does mean 'no heat'.

While dealing with numeric variables, in order to create frequency distribution


table, following formula is used:-

(Largest Number – Smallest Number)/ No of Desirable Intervals

For example, No of desirable intervals is 5 & the smallest number is 1 & the largest
number is 100 then by form the interval bins should be of size :-
(100-1)/5 which is 19.8 and rounds of to 20.

Summary: -
Topic/Header :- Measures of Central Tendency, Measures of Asymmetry & Measures of Variability

Keypoints :- Notes: - Mean, Median & Mode are the three most common and popular measures
of central tendency.

Mean is calculated by totalling up all the components and then dividing it by the
number of components.

Or (x1+x2+x3+……………Xn)/N

Median is the value or number at (n+1)/2 position in an ordered list.

Mode is the number that has the highest frequency the most in the dataset.

The most common methods used to measure the asymmetry are Skewness &
Kurtosis. Skewness indicates whether the data is concentrated on one side or not.
The dataset will be Positive Skewed is Mean> Median and here the outliers are to
the right and this dataset is also be called Right Skewed.

The dataset has Zero Skew if Mean=Median & Mode. Here the outliers can be at
both the sides, left as well as right side. This is also called No Skew and the
distribution of Dataset is symmetrical.

The last skew is Negative Skewed where Mean<Median. The outliers are to the left
side and this dataset is also called Left Skewed.

Variance, Standard Deviation & Coefficient of Variation are the commonly used
techniques for Measures of Variability

Variance (S2) = average squared deviation of values from mean

Standard deviation (S) = square root of the variance

Coefficient of Variation is Relative Standard Deviation and is calculated by


dividing Standard Deviation/Mean

Comparing the SD of two different dataset is meaningless but comparing the


coefficient of Variance is not and is very handy and useful to compare two different
Datasets.

Summary: -
Topic/Header :- Covariance and Correlation

Keypoints :- Notes: -
Covariance and Correlation are two mathematical concepts which are
commonly used in the field of probability and statistics. Both concepts
describe the relationship between two variables.
Covariance –
1. It is the relationship between a pair of random variables where change
in one variable causes change in another variable.
2. It can take any value between -infinity to +infinity, where the negative
value represents the negative relationship whereas a positive value
represents the positive relationship.
3. It is used for the linear relationship between variables.
4. It gives the direction of relationship between variables.
Formula –
For Population:

For Sample

Here,
x’ and y’ = mean of given sample set
n = total no of sample
xi and yi = individual sample of set
Example –

Summary: -
Topic/Header :- Covariance and Correlation

Keypoints :- Notes: -
Correlation –
1. It show whether and how strongly pairs of variables are related to each
other.
2. Correlation takes values between -1 to +1, wherein values close to +1
represents strong positive correlation and values close to -1 represents
strong negative correlation.
3. In this variable are indirectly related to each other.
4. It gives the direction and strength of relationship between variables.
Formula –

Here,
x’ and y’ = mean of given sample set
n = total no of sample
xi and yi = individual sample of set
Example –

Summary: -
Topic/Header :- Covariance and Correlation Difference

Keypoints :- Notes: -
Covariance versus Correlation –
COVARIANCE CORRELATION

Covariance is a measure of Correlation is a statistical measure

how much two random that indicates how strongly two

variables vary together variables are related.

involve the relationship

between two variables or data involve the relationship between

sets multiple variables as well

Lie between -infinity and

+infinity Lie between -1 and +1

Measure of correlation Scaled version of covariance

provide direction of provide direction and strength of

relationship relationship

dependent on scale of variable independent on scale of variable

have dimensions dimensionless

Summary: -
Topic/Header :-

Keypoints :- Notes: -

Summary: -
Topic/Header :-

Keypoints :- Notes: -

Summary: -
Topic/Header :-

Keypoints :- Notes: -

Summary: -
Topic/Header :-

Keypoints :- Notes: -

Summary: -
Topic/Header :-

Keypoints :- Notes: -

Summary: -
Topic/Header :-

Keypoints :- Notes: -

Summary: -
Topic/Header :-

Keypoints :- Notes: -

Summary: -
Topic/Header :-

Keypoints :- Notes: -

Summary: -
Topic/Header :-

Keypoints :- Notes: -

Summary: -
Topic/Header :-

Keypoints :- Notes: -

Summary: -
Topic/Header :-

Keypoints :- Notes: -

Summary: -
Topic/Header :-

Keypoints :- Notes: -

Summary: -
Topic/Header :-

Keypoints :- Notes: -

Summary: -
Topic/Header :-

Keypoints :- Notes: -

Summary: -
Topic/Header :-

Keypoints :- Notes: -

Summary: -
Topic/Header :-

Keypoints :- Notes: -

Summary: -
Topic/Header :-

Keypoints :- Notes: -

Summary: -
Topic/Header :-

Keypoints :- Notes: -

Summary: -
Topic/Header :-

Keypoints :- Notes: -

Summary: -
Topic/Header :-

Keypoints :- Notes: -

Summary: -
Topic/Header :-

Keypoints :- Notes: -

Summary: -
Topic/Header :-

Keypoints :- Notes: -

Summary: -
Topic/Header :-

Keypoints :- Notes: -

Summary: -
Topic/Header :-

Keypoints :- Notes: -

Summary: -
Topic/Header :-

Keypoints :- Notes: -

Summary: -
Topic/Header :-

Keypoints :- Notes: -

Summary: -
Topic/Header :-

Keypoints :- Notes: -

Summary: -
Topic/Header :-

Keypoints :- Notes: -

Summary: -
Topic/Header :-

Keypoints :- Notes: -

Summary: -
Topic/Header :-

Keypoints :- Notes: -

Summary: -
Topic/Header :-

Keypoints :- Notes: -

Summary: -
Topic/Header :-

Keypoints :- Notes: -

Summary: -
Topic/Header :-

Keypoints :- Notes: -

Summary: -
Topic/Header :-

Keypoints :- Notes: -

Summary: -
Topic/Header :-

Keypoints :- Notes: -

Summary: -
Topic/Header :-

Keypoints :- Notes: -

Summary: -
Topic/Header :-

Keypoints :- Notes: -

Summary: -
Topic/Header :-

Keypoints :- Notes: -

Summary: -
Topic/Header :-

Keypoints :- Notes: -

Summary: -
Topic/Header :-

Keypoints :- Notes: -

Summary: -
Topic/Header :-

Keypoints :- Notes: -

Summary: -
Topic/Header :-

Keypoints :- Notes: -

Summary: -
Topic/Header :-

Keypoints :- Notes: -

Summary: -
Topic/Header :-

Keypoints :- Notes: -

Summary: -
Topic/Header :-

Keypoints :- Notes: -

Summary: -
Topic/Header :-

Keypoints :- Notes: -

Summary: -
Topic/Header :-

Keypoints :- Notes: -

Summary: -
Topic/Header :-

Keypoints :- Notes: -

Summary: -

Potrebbero piacerti anche