Documenti di Didattica
Documenti di Professioni
Documenti di Cultura
Keypoints :- Notes: - The Analysis is always done on the data of the past to explain how and why. For
Analysis, Analytics, example why the sales went down last year or why the story ended the way it ended at the
Qualitative Analytics & first place.
Quantitative Analytics
Analytics on the other hand is the study of the future and to predict. Instead of
explaining the past events, it explores the future events. Here we look data of the
past to recognise patterns and Analytics generally refers to the future instead of
explaining past events. It explores potential future ones. Analytics is essentially the
application of logical and computational reasoning to the component parts obtained
in an analysis and in doing this you are looking for patterns in exploring what you
can do.
Now, understanding the terms data analysis, data analytics, business analysis and
business analytics makes better sense and can easily be differentiated.
Summary: - Analysis is study of past and analytics is study of the future. Qualitative Analytics = Intuition +
Analysis & Quantitative = Formulas + Algorithms.
Qualitative & Quantitative methodology can be used with Analysis too to explain how and why.
Topic/Header :- Most Popular disciplines in the Data Science Filed and how they intertwine with each
other
Keypoints :- Data Notes: - To begin with Business Sphere is the first Sphere which includes the
Science, Data following terms:-
Analytics, Business Business Case Study, Qualitative Analysis, Preliminary Data Report, Reporting
Intelligence, AI, ML with Visuals, Creating Dashboards, Sales Forecasting.
Not all of these terms are data driven. Some business activities are DATA DRIVEN
while others are SUBJECTIVE or EXPERINCE DRIVEN. Business Case Study &
Qualitative Analysis are not data driven rest all are.
Business case Studies are real world experiences of how companies succeed in their
businesses and Qualitative Analytics is using your intuition and experience and
knowledge of market to plan future business process.
Preliminary Data Report is the first step of any data analysis Reporting with Visuals
and Creating Dashboards are the part of Business Intelligence.
Summary: - Understanding the difference between Business Analytics, Data Analytics, Data Science &
ML.
Topic/Header :- Connecting Data Science Disciplines
Keypoints :- Data, Notes: - Data is information stored in digital format that can be used as a base for
Big Data, 3Vs of Big performing analyses and decision making.
Data, Steps involved
in Data Science, Data can be in traditional or big data. Traditional Data is data in tabular format and
Traditional Methods can be stored and managed from one Computer. Big Data however is extremely
& Machine Learning large data and is not only stored on multiple computers or cloud but also can be
structured, semi-structured & unstructured.
There are 3Vs of Big Data that should be known by all. These are :-
Volume: - Big data needs lots of space; it is usually in terabytes and is
typically distributed between many computers.
Variety :- Big Data is not limited to text and numbers it could also be
images, audio files, mobile data and others
Velocity: - It refers to the speed of big data and since the volume & variety
is too much in case of Big Data, extracting patterns from it as quickly as
possible is one of the major goals.
The first step of any data science process is gathering data. Once data is collected,
BI comes in the picture. BI includes all technology- driven tools involved in the
process of analyzing, understanding reporting available past data.
All these are part of analysing the past. To predict the future, We use Data
Analytics. This can be done using the Traditional Methods & Machine Learning.
Traditional Methods are perfect for forecasting future performance with Great
Accuracy. Techniques like Regression, Clustering Factor Analytics.
Summary: - Data Collection is the first step of Data Science process. BI comes in picture post this step to
make reports and analyse the data of the past. To make future predictions, Traditional Methods and Machine
Learning are used depending on what outcome we are looking for.
Topic/Header :- Techniques for Working with Traditional Data
Keypoints :- Notes: - The first step is data collection. Once that is done, then we move to data pre-
Data processing as the raw data that was collected or stored cannot be used directly for any
Collection, analysis. In Pre-processing we do Data Labelling where we separate numeric data with
Data Pre- categorical data.
processing,
Data Numerical data is such data that is stored in numeric form and can be manipulated
Labelling, mathematically such as Average goods sold per day, Max goods sold in a month and so on.
Numerical On the other hand Categorical Data cannot be manipulated Mathematically such as Sex,
Data vs Location, or Job Title.
Categorical
Data, Data Once this is done we move to next step which is Data Cleaning. The goal of Data Cleaning
Cleaning, is to deal with inconsistent Data such as Age being 324 or last name as a City name & so
Balancing & on.
Shuffling,
Missing Values is also something that is a part of Data Cleaning and there are techniques to
ER deal with missing data.
Diagrams,
Relational There are certain approaches that are case specific when working with Traditional Data like
Schemas Balancing & Shuffling. Balancing is process where we make sure we have sufficient data in
balance to make relevant prediction or to draw conclusions. For example, to predict
shopping pattern of Men & Women we cannot have data where 80% of reviews are from
women & 20% are from men.
Shuffling is the process of making sure that we have a random sample of data to make
predictions which are not biased in any fashion. This is also important as this makes the
sample representative of the master set or the universe of data.
ER Diagrams & Relational Schemas are the traditional ways to represent this kind of data
pictorially. ER diagrams or Entity-Relationship diagram uses 3 different types of boxes to
represent Entity, Action & Attribute.
Boxes represent entities.
Diamonds represent relationships
Circles (ovals) represent attributes.
Relational Schemas are tables and shows relationship between them and how one table is
related to other/others.
Summary: - In order to work with Traditional Data, Data collection, Data Cleaning, Data Pre-processing
are initial and import stages. Data Labelling into Numeric & Categorical Data, Dealing with Missing Data,
Balancing & Shuffling are important steps. Balancing & Shuffling both ensure that data is not biased in any
way and dos not give biased insights. ER- diagrams & Relational Schemas
Topic/Header :- Working with Big Data Techniques
Keypoints :- Big Notes: - Well they definitely have their differences. Some of the approaches used
Data, Data Masking, on traditional data can also be implemented on big data collecting and pre-
processing big data is essential to help organize the data before doing analyses or
making predictions as is grouping the data into classes or categories. While working
with Big Data things can get a little more complex. Big Data has much more variety
than simple Numeric & Categorical Data.
Examples of big data can be text data digital image data digital video data digital
audio data and more.
Consequently with a larger amount of data types comes a wider range of data
cleansing methods.
Dealing with missing data is important here as the volume is huge and dropping
those missed data is not a preferred way to handle this.
Data Masking is also an important aspect here as maintaining confidentiality of data
is of utmost important. The motive here to make sure confidentiality is maintained
keeping in mind that analysis can be performed too.
Data masking is a complex process as it conceals the original data with random and
false data while analysis can be performed and keeps all the confidential
information in a secure place.
Examples of Big Data: - Facebook Data, Stock Trading Data.
Summary: - Brief update on Big Data and techniques used to process Big Data.
Topic/Header :- Business Intelligence Techniques
Keypoints :- Metric, Notes: - Once the data is gathered, cleaned & pre-processed, Business Intelligence
KPIs, Measure, comes. Here the analyst uses data skills, business knowledge & intuitions to explain
Reports, Dashboards, past data. In order to perform Analysis, we start by collecting observations.
What BI includes However, no mathematical manipulations can be done on observations and in order
to do that, we quantify those observations. Quantification is the process of
representing observations as numbers.
A Metric is Measure + Business Meaning. Example: - Let’s say that $350 was the
revenue for 1st quarter with 50 Customers, these 2 are measures. Average revenue
1st quarter would be $350/50 that is $7. This is a Metric. From measures, 1000 of
Metrics can be derived and it would be difficult to keep track and work on a Metric
that’s not useful or business relevant.
KPIs (Key Performance Indicators) are Metric + Business Objectives. KEY because
it is related to main business goals, PERFORMANCE because it shows how
successfully you have performed within a specific timeframe & INDICATORS
because they are values or metric that indicate something related to your business
performance. KPIs are metrics that are tightly aligned to with the business
objective.
Summary: - BI answers questions that are relevant to make business decisions by creating KPIs, Metrics,
Reports and Dashboards. The main objective here to explain what happened from past data and Key
Performance Indicators.
Topic/Header :- Techniques For Working with Traditional Methods
Keypoints :- Linear Notes: - Linear Regression, Logistic Regression, Clustering, Factor Analysis are
Regression, Logistic some of the common and most popular and frequently used techniques when it
Regression, Factor comes to Traditional Data.
Analysis, Clustering
Linear Regression is a linear approach to modelling the relationship between a
scalar approach & one or more explanatory variables.
Logistic regression is used to describe data and to explain the relationship between
one dependent binary variable and one or more nominal, ordinal, interval or ratio-
level independent variables.
Factor Analysis is a process in which the values of observed data are expressed as
functions of a number of possible causes in order to find which are the most
important. It aims at reducing dimensionality.
Cluster analysis or clustering is the task of grouping a set of objects in such a way
that objects in the same group (called a cluster) are more similar (in some sense) to
each other than to those in other groups (clusters).
Time Series Analysis is the analysis of data over time. Time is an independent
variable and will always be on X-axis. Mainly used in economics & financial data.
Summary: - Linear Regression, Logistic Regression, Clustering & Time Series Forecasting are some of
the common techniques used when we are dealing with traditional data. For labelled data Regression
analysis is the best techniques to go with while dealing with Unlabelled data, Clustering is the best method
to go with.
Topic/Header :- Techniques For Working with Traditional Methods & Types of Machine Learning
Keypoints :- Linear Notes: - Linear Regression, Logistic Regression, Clustering, Factor Analysis are
Regression, Logistic some of the common and most popular and frequently used techniques when it
Regression, Factor comes to Traditional Data.
Analysis, Clustering
Linear Regression is a linear approach to modelling the relationship between a
scalar approach & one or more explanatory variables.
Logistic regression is used to describe data and to explain the relationship between
one dependent binary variable and one or more nominal, ordinal, interval or ratio-
level independent variables.
Factor Analysis is a process in which the values of observed data are expressed as
functions of a number of possible causes in order to find which are the most
important. It aims at reducing dimensionality.
Cluster analysis or clustering is the task of grouping a set of objects in such a way
that objects in the same group (called a cluster) are more similar (in some sense) to
each other than to those in other groups (clusters).
Time Series Analysis is the analysis of data over time. Time is an independent
variable and will always be on X-axis. Mainly used in economics & financial data.
Keypoints :- Notes: - A population is the collection of all items of interest and is generally
represented by “N”. The numbers we obtain while using a population is called
parameters. Sample on the other hand is a subset of population and is denoted by
“n”. The numbers we obtain while using a population is called statistics.
Populations are hard to define and hard to observe and work with while sample is
much easier to gather and work on with and is less time and money consuming and
hence working with sample is a common practise is widely used throughout.
A Random Sample is collected when each member of the sample is chosen from the
population strictly by chance.
Based on types of Data, Data can be divided into Categorical & Numerical.
Categorical data describes categories or groups like {Yes, No}, {Male, Female}
etc. Numerical data describes numerical value which can be further classified into
two types: - Discrete & Continuous. Discrete data represents numerical data that is
a finite value or an integer like 0,1,2999,3478 etc while continuous data represents
numerical data that is infinite such as 62.09834940, 12.08090, 1.00000029213 etc.
Quantitative Data can be classified into INTERVAL & RATIOS. Ratio has a
TRUE 0 while Intervals does not have a True 0. Consider temperature example, 0
Dec Celsius and 0 Deg Fahrenheit are not actual Zeros. These are Intervals. 0 Deg
Kelvin on the other hand is True 0 and is -273.53 Deg Celsius and hence is Ratio.
Summary: - Population is whole universe and every possible dataset or entry in that universe while
Sample is a subset of the population that is representative of Population and is unbiased. Data classification
& Measure Classification.
Topic/Header :- Difference between Key Terms/Types of Variables
Keypoints :- Notes: - A categorical variable, also called a nominal variable, is for mutual
exclusive, but not ordered, categories. For example, your study might compare five
different genotypes. You can code the five genotypes with numbers if you want, but
the order is arbitrary and any calculations (for example, computing an average)
would be meaningless.
An ordinal variable is one where the order matters but not the difference between
values. For example, you might ask patients to express the amount of pain they are
feeling on a scale of 1 to 10. A score of 7 means more pain that a score of 5 and that
is more than a score of 3. But the difference between the 7 and the 5 may not be the
same as that between 5 and 3. The values simply express an order.
A ratio variable, has all the properties of an interval variable, and also has a clear
definition of 0.0. When the variable equals 0.0, there is none of that variable.
Variables like height, weight, enzyme activity are ratio variables. Temperature,
expressed in F or C, is not a ratio variable. A temperature of 0.0 on either of those
scales does not mean 'no heat'. However, temperature in Kelvin is a ratio variable,
as 0.0 Kelvin really does mean 'no heat'.
For example, No of desirable intervals is 5 & the smallest number is 1 & the largest
number is 100 then by form the interval bins should be of size :-
(100-1)/5 which is 19.8 and rounds of to 20.
Summary: -
Topic/Header :- Measures of Central Tendency, Measures of Asymmetry & Measures of Variability
Keypoints :- Notes: - Mean, Median & Mode are the three most common and popular measures
of central tendency.
Mean is calculated by totalling up all the components and then dividing it by the
number of components.
Or (x1+x2+x3+……………Xn)/N
Mode is the number that has the highest frequency the most in the dataset.
The most common methods used to measure the asymmetry are Skewness &
Kurtosis. Skewness indicates whether the data is concentrated on one side or not.
The dataset will be Positive Skewed is Mean> Median and here the outliers are to
the right and this dataset is also be called Right Skewed.
The dataset has Zero Skew if Mean=Median & Mode. Here the outliers can be at
both the sides, left as well as right side. This is also called No Skew and the
distribution of Dataset is symmetrical.
The last skew is Negative Skewed where Mean<Median. The outliers are to the left
side and this dataset is also called Left Skewed.
Variance, Standard Deviation & Coefficient of Variation are the commonly used
techniques for Measures of Variability
Summary: -
Topic/Header :- Covariance and Correlation
Keypoints :- Notes: -
Covariance and Correlation are two mathematical concepts which are
commonly used in the field of probability and statistics. Both concepts
describe the relationship between two variables.
Covariance –
1. It is the relationship between a pair of random variables where change
in one variable causes change in another variable.
2. It can take any value between -infinity to +infinity, where the negative
value represents the negative relationship whereas a positive value
represents the positive relationship.
3. It is used for the linear relationship between variables.
4. It gives the direction of relationship between variables.
Formula –
For Population:
For Sample
Here,
x’ and y’ = mean of given sample set
n = total no of sample
xi and yi = individual sample of set
Example –
Summary: -
Topic/Header :- Covariance and Correlation
Keypoints :- Notes: -
Correlation –
1. It show whether and how strongly pairs of variables are related to each
other.
2. Correlation takes values between -1 to +1, wherein values close to +1
represents strong positive correlation and values close to -1 represents
strong negative correlation.
3. In this variable are indirectly related to each other.
4. It gives the direction and strength of relationship between variables.
Formula –
Here,
x’ and y’ = mean of given sample set
n = total no of sample
xi and yi = individual sample of set
Example –
Summary: -
Topic/Header :- Covariance and Correlation Difference
Keypoints :- Notes: -
Covariance versus Correlation –
COVARIANCE CORRELATION
relationship relationship
Summary: -
Topic/Header :-
Keypoints :- Notes: -
Summary: -
Topic/Header :-
Keypoints :- Notes: -
Summary: -
Topic/Header :-
Keypoints :- Notes: -
Summary: -
Topic/Header :-
Keypoints :- Notes: -
Summary: -
Topic/Header :-
Keypoints :- Notes: -
Summary: -
Topic/Header :-
Keypoints :- Notes: -
Summary: -
Topic/Header :-
Keypoints :- Notes: -
Summary: -
Topic/Header :-
Keypoints :- Notes: -
Summary: -
Topic/Header :-
Keypoints :- Notes: -
Summary: -
Topic/Header :-
Keypoints :- Notes: -
Summary: -
Topic/Header :-
Keypoints :- Notes: -
Summary: -
Topic/Header :-
Keypoints :- Notes: -
Summary: -
Topic/Header :-
Keypoints :- Notes: -
Summary: -
Topic/Header :-
Keypoints :- Notes: -
Summary: -
Topic/Header :-
Keypoints :- Notes: -
Summary: -
Topic/Header :-
Keypoints :- Notes: -
Summary: -
Topic/Header :-
Keypoints :- Notes: -
Summary: -
Topic/Header :-
Keypoints :- Notes: -
Summary: -
Topic/Header :-
Keypoints :- Notes: -
Summary: -
Topic/Header :-
Keypoints :- Notes: -
Summary: -
Topic/Header :-
Keypoints :- Notes: -
Summary: -
Topic/Header :-
Keypoints :- Notes: -
Summary: -
Topic/Header :-
Keypoints :- Notes: -
Summary: -
Topic/Header :-
Keypoints :- Notes: -
Summary: -
Topic/Header :-
Keypoints :- Notes: -
Summary: -
Topic/Header :-
Keypoints :- Notes: -
Summary: -
Topic/Header :-
Keypoints :- Notes: -
Summary: -
Topic/Header :-
Keypoints :- Notes: -
Summary: -
Topic/Header :-
Keypoints :- Notes: -
Summary: -
Topic/Header :-
Keypoints :- Notes: -
Summary: -
Topic/Header :-
Keypoints :- Notes: -
Summary: -
Topic/Header :-
Keypoints :- Notes: -
Summary: -
Topic/Header :-
Keypoints :- Notes: -
Summary: -
Topic/Header :-
Keypoints :- Notes: -
Summary: -
Topic/Header :-
Keypoints :- Notes: -
Summary: -
Topic/Header :-
Keypoints :- Notes: -
Summary: -
Topic/Header :-
Keypoints :- Notes: -
Summary: -
Topic/Header :-
Keypoints :- Notes: -
Summary: -
Topic/Header :-
Keypoints :- Notes: -
Summary: -
Topic/Header :-
Keypoints :- Notes: -
Summary: -
Topic/Header :-
Keypoints :- Notes: -
Summary: -
Topic/Header :-
Keypoints :- Notes: -
Summary: -
Topic/Header :-
Keypoints :- Notes: -
Summary: -
Topic/Header :-
Keypoints :- Notes: -
Summary: -
Topic/Header :-
Keypoints :- Notes: -
Summary: -
Topic/Header :-
Keypoints :- Notes: -
Summary: -
Topic/Header :-
Keypoints :- Notes: -
Summary: -
Topic/Header :-
Keypoints :- Notes: -
Summary: -