Documenti di Didattica
Documenti di Professioni
Documenti di Cultura
Sample Space is the set of all possible outcomes. Any subset of the Sample Space
makes an event. A probability can be assigned to each of the Event. The axioms of
probability can be used to compute probability of a certain event when all outcomes
are Equi-probable.
Partition:
A set of all mutually exclusive and collectively
exhaustive events is called a Partition. (See figure 1)
A1 U A2 U A3 U A4 =S
Ai ꓵ Aj = φ, i, j ={1,2,3,4} & i ≠ j
Where, S = sample space.
A1, A2, A3, A4 are all partitions here.
A and Ᾱ constitute a partition. A1 A2 A3 A4
Figure 1: Partitions
Conditional Probability:
P(A|B) = P(AꓵB)/P(B)
For P(B)>0.
A B AꓵB
Independent Events:
If the outcome of one event doesn’t affect the outcome
of the other, the events are called independent Figure 2: Conditional Probability
events.
While collecting data, we need to make sure that the data is collected from an
unbiased sample, all the necessary fields to make the required analysis of the data
have been addressed and sufficiently large data points are collected.
Example: The height of waves in the Arabian Sea and the population of Portugal
both show similar patterns. However both are completely unrelated. Understanding
the data will help avoid errors where two independent data are used to draw a
relationship.
Example: The data collected related to predicting the number of cars in Bangalore
at the end of the year will need not only need to contain the cars added each year
over the past ten years, but also the change in infrastructure and the change in
demographic of the city over the same period. Understanding the problem at hand
and collecting comprehensive data for the same will help increase accuracy (of the
inferences drawn out of the data).
Weight of Monkey
the all data points (except one outlier) are 12
scattered in a certain cluster which shows the 10
height vs. weight of monkeys lie in a certain 8
interval. From this we can compute the 6
4
average weight and height of the monkeys in
2
the population surveyed. However, when the
0
height of a King Kong is added to the same 0 2 4 6 8 10 12
data set, even though the King Kong is a Height Of Monkey
monkey, the point lies way outside the Figure 7: King-kong Effect
formed cluster and appears as an outlier that
can affect the averages of the data inappropriately.
The help of a Domain Expert showed be taken in order to valid the data and the
existence of outliers should be studied and understood. An informed decision can
then be taken either to ignore or accept the data. Often, an outlier is analysed
separately if it’s a valid data point and may be ignored in estimating certain
population parameters.
In a nut shell, we should understand the data collected, make a scatter plot and
identify the pattern (trend or seasonality) in the data, carefully analyse the outliers
and carry out the required analysis without deviating from the goal.
Co-relation Coefficient:
Common Formulae:
Say 𝑥 ∈ 𝑋 and 𝑦 ∈ 𝑌 are two set of data. ∑𝑛
1 𝑥𝑖
Mean = 𝜇 = 𝑥̅ =
𝑛
Covariance of X and Y is defined as: ∑𝑛 (𝑥𝑖 −𝜇)^2
Variance = 𝜎 2 = 1
𝑛−1
∑(𝒙𝒊 − 𝒙
̅) (𝒚𝒊 − 𝒚
̅)
𝒄𝒐𝒓(𝒙, 𝒚) =
𝒏−𝟏
Variance tells us how much data is scattered about the mean. Squaring takes care of
the polarity so that the deviations (positive and negative) don’t cancel each other out
to give an unrealistic value. Squaring also gives higher weightage to larger deviations
thus highlighting them. Mean absolute deviation also negates the polarity, however,
gives equal weightage to each deviation.
Covariance gives the relationship between two data set X and Y and thus depends
on their dimensions. In order to make the quantities dimensionless, we define the
correlation coefficient as:
𝒄𝒐𝒗(𝒙,𝒚)
𝒄𝒐𝒓𝒆𝒍𝒂𝒕𝒊𝒐𝒏 𝒄𝒐𝒆𝒇𝒇𝒊𝒄𝒊𝒆𝒏𝒕 = 𝒓 = Where, −1 ≤ 𝑟 ≤ 1
𝝈𝒙∙𝝈𝒚
𝟏.𝟗𝟔 −𝟏.𝟗𝟔
*Typically if 𝒓 > √𝒏
or 𝒓 < √𝒏
we can say
with higher confidence that a linear coefficient
exists. This implies, larger the data collected
higher the confidence on the observed trend.