Documenti di Didattica
Documenti di Professioni
Documenti di Cultura
Histograms
Histograms are plots used for visually inspecting the distribution of a data set. Histograms
are quite useful for depicting large differences in shape or symmetry, such as whether a
data set appears symmetric or skewed. To construct a histogram, the first step is to "bin"
the range of valuesthat is, divide the entire range of values into a series of intervalsand
then count how many values fall into each interval. The bins are usually specified as
consecutive, non-overlapping intervals of a variable. The bins (intervals) must be adjacent,
and are often (but are not required to be) of equal size. Histograms can be generated using
Excel with a built-in tool (Tutorial with example: https://www.ablebits.com/office-addins-
blog/2016/05/11/make-histogram-excel/).
Example:
Excel has a data analysis package that allows the user to easily generate a histogram. Fig 1
shows the histogram generated for annual rainfall data. Table 2 shows the frequency table
based on which the histogram is generated.
Histogram
14
12
10
8
Frequency
0
5 10 15 20 25 30 35 40 More
Bin
To manually generate histogram, one should first fix the number of bins (or classes), and
the bin size (or class interval). As a rule of thumb, one may use the Sturges formula:
m = 3.3log10(n) + 1
where,
m = Number of class interval, you should round the number to the next highest
integer
n = Number of data points.
You can create the intervals by starting with the minimum number or some number that
is close to it.
To find the number of occurrences in each interval (frequency), use either the FREQUENCY
or COUNTIF functions. The formats are: FREQUENCY(Data range, Upper limit range) and
COUNTIF(Data range, <&Upper limit). Since the functions give cumulative frequency,
actual frequency for each class interval needs to be calculated as shown in Table 4.
Table 4. Bins for annual rainfall data using EXCEL functions
Lower limit Upper limit Cumulative Frequency
frequency
3 10 14 14
10 17 26 12
17 24 31 5
24 31 33 2
31 38 36 3
16
14
12
10
Frequency
0
10 17 24 31 38
bins
Fig 2. Histogram for rainfall data created using FREQUENCY and/or COUNTIF functions
Notice that Fig 1 and 2 are slightly different because we chose a different bin size. Now,
what other information can you gather about the data by looking at the histogram?
Shape: It can be seen from the shape that the data has a right skewed distribution. This
means the data tends to cluster near the lower bound of the horizontal axis. Consider this
behavior as something similar to a difficult exam: most students will perform poorly
making the left most bars in the histogram taller.
Now, let us use some of the numerical measures we learned in the last chapter to
summarize the data.
Measures of location: We already looked at the definition and formula used to compute
mean and median. Based on that, for the annual rainfall data,
Mean = 14.06 inches
Median = 11.98 inches
The mean and the median are different since this is an asymmetrical, skewed distribution.
One can also use the histograms frequency table to compute mean and median.
w
Median L (0.5n cfb )
fm
Where
L = the lower limit of the interval that contains the median,
w = interval width,
fm = frequency of the interval that contain the median,
n = Number of data points, and
cfb = cumulative frequencies for intervals before the median interval.
In order to use the above-equation, you need to determine the interval that contains the
median. The first interval, where the cumulative interval exceeds 50%, is the interval that
contains the median. You can do that by dividing the cumulative frequency by the total
number of observations (see table below).
Accordingly, the second interval contains the median (cumulative frequency is 0.72).
Then we can calculate the median using the above-equation, where L= 10, w = 7, fm = 12,
n = 36, cfb = 14. Accordingly, median is 12.3 inches.
Similarly, mean can be calculated from frequency table using the formula:
k
f i xi
X
i 1 n
Where
xi = class mark,
k = number of classes,
fi = frequency of class i, and
n = number of measurements (total sum of frequencies).
In this example, mean comes out to be 14.27 inches.
In addition to mean and median, there is another commonly used measure of location
used while describing a histogram. It is known as mode. The mode of a set of
measurements (observations) is defined as the measurement with the highest frequency
of occurrence. From the frequency table, it can be seen that the second interval has the
highest frequency (26). Then the mid-point (class mark) of that interval is considered the
mode. That is, the mode for the data is 13.5 inches.
Measures of spread: The spread refers to the numerical summaries that indicate the
degree of variability or dispersion, indicating how spread out the data are. The range,
which is the difference between the lowest data value and the highest data value, is a
very crude measure of spread. The range can also be computed from the frequency table
as the difference between the highest and lowest class mark (34.5-6.5 = 28). Of course,
the wider the range, the larger the degree of dispersion.
1 k
PopulationVariance
n i 1
( xi X ) 2 f i
Where
xi = class mark,
k = number of classes,
fi = frequency of class i, and
n = number of measurements (total sum of frequencies)
X = Average
The above-mentioned formula is for population variance. While dealing with sample
variance, n should be replaced with n-1. The formula then becomes,
1 k
SampleVariance ( xi X )2 f i
n 1 i 1
Note that in the case of sample variance, we divide by n-1 instead of n to get unbiased
estimate of the population variance (2). There is a mathematical justification to do so
which will be discussed in subsequent chapters. However, you can think of this as a
punishment (cost); the smaller the sample size the more drastic the effect on s2. As the
sample gets larger our estimate of the variance improves and the effect of subtracting 1
from n becomes negligible. Using the formula, the variance for this data is 74.9 square
inches. Square-root of the variance gives standard deviation, s. For this dataset, s is 8.7
inches.
Another measure of spread is IQR. As discussed in previous chapter, IQR = Q3 Q1. Q3 and
Q1 can be computed using equations similar to median:
w
Q1 L (0.25n cf b )
fm
w
Q3 L (0.75n cf b )
fm
Based on the formula, for this dataset, Q 1 is 7.5 and Q3 is 18.4 inches.
s
COV
x
It is unit-less. This becomes useful when we need to compare two different variables with
different units (thus making it difficult to compare them in other meaningful ways). For
this data, COV is 0.61. Distributions with COV<1 are considered low-variance, and
distributions with COV>1 is considered high-variance.
Dot Diagrams
Dot diagrams are useful tools for depicting the shape of the frequency distribution of a
continuous random variable, when only small samples, with typical sizes of 25-30, are
available. This is a common situation in many hydrologic analyses due to usually limited
periods of records and data samples of short lengths.
For creating dot diagrams, data are first ranked in ascending order of their values and
then plotted on a single horizontal axis. This comes in particularly useful when you want
to compare different data sets. The figure below shows the dot diagram for monthly
rainfalls at Sacramento and Houston:
6
0
0 0.5 1 1.5 2 2.5
Fig 3. Dot diagram for monthly rainfall data at Sacramento and Houston
Immediately, you can observe that the dispersion (spread) in the first data set is larger
than the second one. Also, it helps to notice if there are outliers among the data points.
Stem and leaf diagrams are like histograms turned on their side, with data magnitudes to
two significant digits presented rather than only bar heights. Individual values are easily
found. The S-L profile is identical to the histogram and can similarly be used to judge
shape and symmetry, but the numerical information adds greater detail. One S-L could
Scatter Diagrams
The two-dimensional scatterplot is one of the most familiar graphical methods for data
analysis. It illustrates the relationship between two variables. Of usual interest is whether
that relationship appears to be linear or non-linear, whether different groups of data lie
in separate regions of the scatterplot, whether there are outliers, and whether the
relationship (or association) between the two variables is weak or strong. We will study
this in detail in coming chapters.
Fig 4a. Scatter plots as tools to identify the relationship between two variables. First
figure shows a positive linear correlation, second shows a negative linear correlation, and
third shows a non-linear relationship (Source: https://statistics.laerd.com/)
A quantile is defined as a number (from the sample or population) that corresponds the
fraction of the sample that less or equal to the value of the quantile. Quantiles of importance
are: Q(0.25), Q(0.5), and Q(0.75). You are already familiar with these quantiles as lower
quartile (Q1), median or mid-quartile (Q2), and upper quartile (Q3). In terms of percentiles,
they are 25th, 50th, and 75th percentiles. Percentiles are quantiles that divide a distribution
into 100 equal parts. Hence, we can generalize and say that the k-th percentile of a set of
values divides them so that k% of the values lie below and (100-k)% of the values lie above.
Given a set of values x1, x2.. xn , we can define the quantiles for any fraction p as follows:
2. The ordered values are called the order statistics of the original sample.
3. Take the order statistics to be the quantiles which correspond to the fractions (or
sample fraction):
i 1
pi where i= 1,n
n 1
4. In general, to define the quantile which corresponds to the fraction p, use linear
interpolation between the two nearest pi. If p lies a fraction f of the way from pi
and pi+1 define the pth quantile to be:
Q( p) (1 f )Q( pi ) fQ( pi 1 )
First, sort the values into order: 1.3, 2.2, 2.7, 3.1, 3.3, 3.7
The sample fractions p, for these values are computed using the above mentioned
formula:
Sample 0 0.2 0.4 0.6 0.8 1
fraction
(p)
Quantile 1.3 2.2 2.7 3.1 3.3 3.7
Lets say we want to find Q(0.25) or the lower quartile. This lies between the sample
fractions 0.2 and 0.4.
Q(0.25) = (1-0.25)*Q(0.2)+0.25*Q(0.4) = 0.75*2.2+0.25*2.7 = 2.325
Excel can be used to compute the lower, middle, and upper quartiles. The function is
QUARTILE(array, quart) where quart is a value from 0 to 4 depending on which quartile
the user wants to compute.
Quantile Plot
Quantile plots visually portray the quantiles, or percentiles (which equal the quantiles
times 100) of the distribution of sample data. Quantiles of importance such as the median
are easily discerned (quantile, or cumulative frequency = 0.5). They have many advantages
over other plots like:
1. Arbitrary categories are not required, as with histograms or Stem and Leaf,
2. All of the data are displayed, unlike a boxplot (which will be discussed in the next
section)
Quantile plots are sample approximations of the cumulative distribution function (cdf) of a
continuous random variable. We will study more about distribution functions in the
coming chapters.
Construction of a quantile plot
1. To construct a quantile plot, the data are ranked from smallest to largest. The smallest
data value is assigned a rank i=1, while the largest receives a rank i=n, where n is the
sample size of the data set. The data values themselves are plotted along one axis,
usually the horizontal axis.
2. On the other axis is the "plotting position", which is a function of the rank i and sample
size n (similar to the sample fraction discussed before). A number of plotting position
formulae is available (Table 6). The general formula is given as:
ia
pi
n 2a
In the previous example, we used a =1 (which provides the results same as EXCELs
QUARTILE function). Some of the other commonly used formulas are given below.
Reference a Formula
0.5
0.4
0.3
0.2
0.1
0
0 10 20 30 40
Annual rainfall (inches)
When comparing two datasets, we use Quantile-Quantile (Q-Q) plots. This is useful for
determining if two data sets come from populations with a common distribution. We will
discuss this in detail in coming chapters.
Box Plot
A very useful and concise graphical display for summarizing the distribution of a data set is
the boxplot. Boxplots provide visual summaries of:
1. The center of the data (the median the center line of the box)
2. The variation or spread (interquartile range the box height)
3. The skewness (quartile skew the relative size of box halves)
4. Presence or absence of unusual values (outside or far outside values, also known
as outliers)
The boxplot is a graphical representation of five points, the three quantiles and the
minimum and maximum measurements. The process of displaying the graph is illustrated
with the rainfall data. From the data, we can calculate the quantiles, the minimum and the
maximum.
Upper whisker length = Q3+ min of (1.5IQR or maximum value in the time series)
Lower whisker length = Q1-min of (1.5IQR or minimum value in the time series)
Any data point that falls between 1.5IQR and 3IQR is considered to be an outlier. Any data
point that falls beyond 3IQR from the box is considered to be an extreme.
Excel 2016 has a box plot option among its different types of charts. However, to plot it in
earlier versions of Excel, use the instructions given below:
http://www.dummies.com/education/math/statistics/box-and-whisker-charts-for-excel/
35
30
25
20
15
10
0
1 3 5 7 9 11 13 15 17 19 21 23 25 27 29 31 33 35
Acknowledgement: Dr. Ramzi for his help with preparation of the notes.