Documenti di Didattica
Documenti di Professioni
Documenti di Cultura
2.1. Preliminaries
Measurement is the process of determining the values or label of the variable based on what has
been observed.
Quantitative variable a variable that takes on numerical values representing an amount or quantity.
Examples of quantitative variables are height, age, and weight. There are two kinds of quantitative
variables, discrete and continuous variables.
Discrete variable is a variable which can assume finite, or at least countably infinite number
of values. Examples are number of trees in the Sunken Garden, and number of car accidents
in University Avenue in one year.
Continuous variable is a variable which can assume infinitely many values that can be stated
using an interval. Examples are height and weight.
Remarks
Exercises 2.1.1. Determine whether the variables given below are qualitative or quantitative
variables. If quantitative, determine whether the variable is a discrete or continuous variable.
1. Nominal level of measurement is the level of measurement which only satisfies the property
that the numbers in the system are used to classify a person/object into distinct,
nonoverlapping, and exhaustive categories.
• There is no ordering of the possible values of the variable measure in the nominal scale.
• A variable measured in the nominal level is categorical/qualitative in nature, which means
that doing arithmetic operations for their values are still meaningless.
• These variables are mainly used for identification.
• Examples or variables measured in the nominal level: Sex, Plate number, Student number
Remarks
• Variables measured in the ordinal level of measurement are still categorical in nature.
• Mathematical operations are still meaningless for values of variable measured in the
ordinal level.
• Examples: Likert scale, Rankings, IMDB ratings, Yelp Ratings
Remarks
• The third property basically states that the distances between each interval on the scale are
equal right along the scale from the low end to the high end.
• In the interval level, there is no absolute zero. Absolute zero mans that a score of zero really
means that “none exists.”
• Values of variable measured in the interval level can now be added or subtracted. However,
multiplication and division are still meaningless mathematical operations for interval level.
• Examples: Celsius
Remarks
• Absolute zero mans that a score of zero really means that “none exists.”
• Examples: Scores in exam, Kelvin scale, Income
3. Survey Method – uses questionnaires to ask questions about the topic at hand. When
conducting a survey, respondents are the people who answer the questions, and enumerators
are the people who ask the questions.
Main Types of surveys:
a. Self-administered questionnaires
b. personal interviews
c. online surveys
4. Focus Group Discussion – a method of collection of data wherein a moderator follows a focus
group discussion guide to direct a freewheeling discussion among a small group of people.
5. Experimental Method - A method of collecting data where there is direct human intervention
on the conditions that may affect the values if the variable of interest.
Example 2.2.1. You want to study the effect of different daily water intake (5ml, 10 ml, 15 ml, 20
ml) to its height in centimeters.
Response Variable: Height of the mongo seedling
Explanatory Variable: Amount of water
Factor levels: 5 ml, 10 ml, 15 ml, and 20 ml
Extraneous variables: Fertility of soil, amount of sunlight, parent of the mongo seedlings
6. Use of Documented Data – data that came from documents and published media such as
journal, books, and newspapers.
• Primary Data: data documented by the primary source
• Secondary Data: data documented by a secondary source. An individual/agency, other
than the data collectors, documented this data.
Remarks
• The experimental method is the most appropriate method of collection of data if we want to
study a cause-and-effect relationship. This is because, in experiments, there are more control to
other variables that has an effect on the value of the response variable.
• One disadvantage of experimental method is the fact that the results of the study might be
more unnatural if there are too many extraneous variables being controlled.
• Observation method is superior over surveys when we are dealing with non-verbal behaviors.
• One disadvantage of observation method is when the environment under study is difficult to
enter.
• Surveys are more appropriate for cases when we want to extract verbal information (opinions
and ideas)
• A focus group discussion is oftentimes conducted before the actual survey to construct the
actual questionnaire.
2.3. Sampling
Census/Complete Enumeration is the process of gathering information from every unit in the
population
Sampling process of obtaining information from the units in the selected sample
Terminologies in Sampling
Sampled Population is the population from where we actually select the sample.
Sampling frame or frame is a list or map showing all the sampling units in the population
Remarks
• It is ideal that the target population and sampled population are equivalent. However, they
are oftentimes not identical.
• If the sampled population is not equivalent to the target population, it is more appropriate
to apply the conclusion to the sampled population.
• There are cases wherein the sampling unit is not equivalent to the elementary unit.
Example 2.3.1. A news program wants to know whether all Filipinos are in favor of a
proposed tax increase, so they posted on their Twitter account a poll about it.
In this scenario, the target population is the set of all Filipinos. However, the sampled
population are only those Filipinos who have twitter accounts and saw the poll. The
sampling unit and elementary unit are the Filipino Twitter users who answered the online
poll.
Example 2.3.2. A researcher wants to conduct a survey about social media usage of public high
school students in Manila City. He randomly selected 10 public high schools in Manila City, and all
the students in the selected high schools will be included in the study.
In this example, the target population and sampled population are the same: all public high school
students in Manila City (assuming that the sampling frame is updated. The elementary units are the
public high school students in Manila City, while the sampling units are the public high schools in
Manila City.
Sampling error is the error attributed to the variation present among the computed values of the
statistic from different possible samples consisting of n elements.
▪ Error in the implementation of the design is the error when the sampling design is not
implemented properly.
▪ Measurement error is the error when the actual value of the variable is not captured.
{𝑌1 , 𝑌2 , … , 𝑌𝑁 } : Population
{𝑦1 , 𝑦2 , … , 𝑦𝑛 } : Sample
You want to get a sample of 5 students from a class of 15 students. You want to estimate the mean
grade of the whole class in mathematics. The grades are 87, 88, 96, 75, 76, 89, 87, 89, 78, 75, 90, 92,
85, 84, and 78.
Sample 1: {y1= Y3, y2=Y12, y3=Y7, y4=Y9, y5=Y1} = {96, 92, 87, 78, 87}.
Estimated Mean = 88
Sample 2: {y1= Y2, y2=Y7, y3=Y8, y4=Y1, y5=Y15} = {88, 87, 89, 87, 78}.
Notice that the value of the mean differs based on the sample selected. This error is the
sampling error.
Types of Sampling
Probability Sampling are methods or sampling plans that gives every element a known and
nonzero probability of being included in the sample (known as the inclusion probability).
Nonprobability Sampling are methods or sampling plans wherein at least one of the of the two
conditions of probability sampling is not satisfied.
I. Simple Random Sampling is a probability sampling method wherein all possible subsets
consisting of n elements selected from the N elements of the population have the same chances
of selection.
a. Simple Random Sampling Without Replacement (SRSWOR) – all the n elements in
the sample must be distinct from each other.
Advantages Disadvantages
▪ The theory involved is much easier ▪ The sample chosen may be widely
to understand than the theory spread, thus entailing higher
behind other sampling methods. transportation cost.
▪ Inferential methods are simple and ▪ A population frame, or list, is
easy. needed.
▪ Less precise estimates result if the
population is heterogeneous with
respect to the characteristic under
study.
Exercises 2.3.1.
1. From a population of 900 employees in a factory, we need to find a sample of size 5 using
simple random sampling without replacement. Use the table of random numbers to get the
II. Systematic Sampling is a probability sampling method wherein the selection of the first
element is at random and the selection of the other elements in the sample is systematic by
subsequently taking every kth element from the random start, where k is the sampling interval.
Steps in picking a Systematic Sample
i. Number the units of the population consecutively from 1 to N.
ii. Let k be equal to ⟦𝑁⁄𝑛⟧
iii. Select the random start 𝑟, where 1 ≤ 𝑟 ≤ 𝑘. The unit corresponding to 𝑟 is the
first unit of the sample.
iv. The other units of the sample correspond to 𝑟 + 𝑘, 𝑟 + 2𝑘, … 𝑟 + (𝑛 − 1)𝑘.
v. Variation (Circular list): The random start will be selected from 1 to N. Starting
from the r count every kth element for the other elements of the sample.
Consider a circular list.
Advantages Disadvantages
▪ It is easier to draw the sample and ▪ If periodic regularities are found in
often easier to execute without the list, a systematic sample may
mistakes than SRS. consist of only similar types.
▪ It is possible to select a sample in ▪ Knowledge of the structure of the
the field without a sampling frame. population is necessary for its most
▪ The systematic sample is spread effective use.
more evenly over the population.
III. Stratified Random Sampling is a probability sampling method where we divide the
population into nonoverlapping subpopulations or strata, and then select one sample from
each stratum. The sample consists of all the samples in the different strata.
Steps in picking a Stratified Random Sample
i. Divide population into strata. Ideally, each stratum must consist of homogeneous
units.
ii. After the population has been stratified, a random sample is selected from each
stratum.
Advantages Disadvantages
▪ Stratification may produce a gain in ▪ The stratification of the population
precision in the estimates of may require additional prior
characteristics of the population. information about the population
▪ it allows for more comprehensive and its strata.
data analysis since information is ▪ A listing of the population for each
provided for each stratum. stratum is needed.
1. Equal Allocation – you will allocate your sample size equally on each stratum.
2. Proportional Allocation – you will allocate your sample size proportional to the size of
each stratum.
Example 2.3.2. Suppose you have a population of 100 students that consists of 40 males
and 60 females. You want to select a sample of size 10 using stratified sampling.
If your allocation method is equal allocation, you will get 5 students from the 40 male
students and another 5 from the female students.
If you choose to do proportional allocation, you have to select 4 students from the 40 male
students, and 6 students from the 60 female students.
IV. Cluster Sampling is a probability sampling method wherein we divide the population into
nonoverlapping groups or clusters consisting of one or more elements, and then select a
sample of clusters, The sample will consist of all the elements in the selected clusters.
Steps in picking a Cluster Sample
i. Decide on how to divide the population into non-overlapping clusters. Ideally,
each cluster must consist of heterogenous units. Number the clusters from 1 to
N.
ii. Select n numbers from 1 to N using some random process. The clusters
corresponding to the selected numbers form the sample of clusters.
iii. Observe all elements in the sample of clusters.
Advantages Disadvantages
▪ A population list of elements is not ▪ The costs and problems of statistical
needed; only a population list of analysis are greater.
clusters is required. Listing cost is ▪ Estimation procedures are more
reduced. difficult.
▪ Transportation cost is reduced if
clusters are geographic units.
Advantages Disadvantages
Transportation and listing cost are ▪ Estimation procedure is difficult,
reduced especially when the primary stage
units are not of the same size.
▪ Estimation procedure gets more
complicated as the number of
sampling stages increase.
▪ The sampling procedure entails
much planning before selection is
done.
Suppose we wish to conduct a sample survey. The population consists of N=30 members of the
Dumbledore’s Army and we wish to select a sample of size n=10 members.
04 Creevey, Dennis
06 Granger, Hermione
16 Weasley, Ginny
08 Jordan, Lee
16 Weasley, Ginny
11 Potter, Harry
15 Weasley, George
18 Abbott, Hannah
27 Goldstein, Anthony
02 Brown, Lavender
2. Use simple random sampling without replacement to select n=10. Use row 23 cols 16
to 16+m-1
25 Corner, Michael
18 Abbott, Hannah
29 Padma, Patil
14 Weasley, Fred
01 Bell, Katie
12 Spinnet, Alicia
11 Potter, Harry
13 Thomas, Dean
17 Weasley, Ron
07 Johnson, Angelina
3. Use stratified random sampling with equal allocation (SRSWOR per stratum). Use row
4 cols 4 to 4+m-1 for Gryffindor, row 7 cols 8 to 8+m-1 for Hufflepuff, row 10 cols 5 to
5+m-1 for Ravenclaw
Since we have equal allocation the number of elements per stratum is 10/3 ≈ 3.3333 ≈ 3.
For the extra single element, I will allocate it to Gryffindor, since this house has the greatest
number of Dumbledore’s Army members.
Gryffindor
13 Thomas, Dean
02 Brown, Lavender
12 Spinnet, Alicia
11 Potter, Harry
Hufflepuff
2 Bones, Susan
Ravenclaw
6 Lovegood, Luna
2 Chang, Cho
5 Goldstein, Anthony
4. Use stratified random sampling with proportional allocation (SRSWOR per stratum).
Use row 01 cols 16 to 16+m-1 for Gryffindor, row 20 cols 17 to 17+m-1 for Hufflepuff,
and row 16 cols 18 to 18-m+1 for Ravenclaw
To balance the sample size to the number of elements per stratum, I will deduct an element
from Gryffindor.
Gryffindor
01 Bell, Katie
12 Spinnet, Alicia
14 Weasley, Fred
11 Potter, Harry
13 Thomas, Dean
Hufflepuff
02 Bones, Susan
05 Smith, Zacharias
Ravenclaw
06 Lovegood, Luna
04 Edgecomb, Marietta
08 Reynolds, Maisy
5. Use Systematic Sampling with linear listing. To get the random start, use row 3 cols 16
to 16+m-1
k=30/10 = 3
Since we are considering a linear listing, r can be 1, 2, or 3.
6. Use systematic Sampling with circular listing. To get the random start, use row 2 cols
14 to 14+m-1
7. Use Cluster Sampling where the houses are the clusters (1 = Gryffindor, 2 = Hufflepuff,
3 = Ravenclaw) where only one cluster will be selected. Use row 14 cols 14 to 14+m-1
Based on the table of random numbers, the cluster selected is 1 = Gryffindor. So all the
students from Gryffindor that are members of the Dumbledore Army are included in the
sample.
I. Convenience Sampling – the sample consists of elements that are most accessible or
easiest to contact.
Example 2.3.2.2. A researcher may use a particular district, province, or city to be the
sample cluster in representing their population of interest. For instance, the researcher
can identify a specific district of Quezon City whose households have the same profile in
terms of the socioeconomic characteristics as the households in the whole Quezon City.
III. Quota Sampling - The nonprobability sampling version of stratified sampling, wherein
you divide the population first into different groups and you select a sample from each
group.
Example 2.3.2.3. A researcher wishes to study the people’s views on birth control and
religion. Census results showed that 70% of the population are Christians, 20% are
Muslims, and 10% are non-believers. The researcher then selects a sample reflecting the
proportions to represent the three religious group.
She added that from 2012 to 2015, the Philippine National Police (PNP) recorded a total of
27,823 cases of child in conflict with the law.
Furthermore, the PNP data showed that from 2002 to 2015, the percentage of offenses
committed by children in the total number of crimes recorded is very much negligible, Ancheta-
Templa said.
Specifically, the PNP data revealed that the percent distribution of crime committed by adults
is higher at 98 percent, while those involving children is only 2 percent.
Source: Vera-Ruiz E. (October 3, 2018), “Lowering criminal age never resulted in lower crime rates
says DSWD usec,” Manila Bulletin.
Advantages Disadvantages
▪ Simplest and most appropriate ▪ When a large mass of quantitative
approach when there are only a few data is included in a text or
numbers to be presented paragraph, the presentation
▪ Gives emphasis to significant figures becomes almost incomprehensible.
and comparisons ▪ Written paragraphs can be tiresome
to read especially if the same words
are repeated so many times.
Example 2.4.2. Curse Words used by P. Duterte in his speeches from June 2016 – June 2017
PI 248 741
Ga*o 72 143
Ya*a 95 32
Sh*t 41 66
Source: Bueza M. (June 27, 2017). “What were Duterte’s favorite words in his speeches?” Rappler.
Graphical Presentation - A graph or chart is a device for showing numerical values or relationships
in pictorial form.
2. Clarity - An effective chart can be easily read and understood. There should be an
unambiguous representation of the facts.
3. Appearance - It is one that is designed and constructed to attract and hold attention by
holding a neat, dignified, professional appearance.
4. Simplicity - The basic design of a statistical chart should be simple, straight-forward, and not
loaded with trivial symbols or ornamentation.
1. Line Chart
When to use: Line charts are most commonly used in presenting historical data. Use it when
you want to focus on the movement of the series over time. You can also use if you want to
compare the trend of two or more time series data.
Rules: The ratio of height to width should be 2:3 or 3:4.
Types:
a. Single Line Chart
b. Multiple Line Chart
Example 2.4.3. CO2 Emissions in the World from 1960 to 2014 (metric tons per capita)
Source: Rappler
4. Pie Chart
When to use: Use to present categorical data but the focus is to show the components parts
with respect to the total in terms of the percentage distribution.
Rules: Use pie charts when the categories are less than 6. The categories should be arranged
according to magnitude. Plot the biggest slice at 12 o’clock. The “Others” section should be
plotted last.
Source: AsiaSociety.Org
The frequency distribution table is a way of summarizing data by showing the number of
observations that belong in the different categories or classes. We also refer to this as grouped data.
1. Single value grouping is a frequency distribution where the classes are distinct values of the
variable.
2. Grouping by class intervals is a frequency distribution where the classes are the intervals.
Terminologies
• Class interval is the range of values that belong in the class or category
• Class frequency is the number of observations that belong in a class interval.
• Class limits are the end numbers used to define the class interval (Upper Class Limit, Lower
Class Limit)
• Open class interval is a class interval with no lower class limit or no upper class limit.
• Close class interval is a class interval with both lower and upper class limits
• Class boundaries are the true class limits. These are the midpoints between two class limits
(Upper class boundary, Lower class boundary)
• Class size is the size of the class interval (the difference between two upper class
boundaries/limits or two lower class boundaries/limits).
• Class mark is the midpoint of a class interval.
• Relative frequency is the class frequency divided by the total number of observations.
• Relative frequency percentage is the relative frequency multiplied by 100.
• Less than cumulative frequency distribution (< CFD) shows the number of observations with
values smaller than or equal to the upper class boundary. To compute for this, starting from the
first class interval, add the class frequencies in a succeeding manner up to the last class interval.
• Greater than cumulative frequency distribution (> CFD) shows the number of observations
with values larger than or equal to the lower class boundary. To compute for this, starting from
the last class interval, add the class frequencies in a succeeding manner up to the first class
interval.
Example 4.5.1. Given below are final grades of Stat 101 students arranged in an array. Create a FDT,
histogram, and ogives
50 57 63 69 72 74 77 80 82 84 87
50 59 65 69 72 75 77 80 82 84 87
50 59 66 69 72 75 77 80 82 85 88
50 60 66 69 72 75 77 81 83 85 89
50 60 68 70 73 75 78 81 83 86 89
50 60 68 71 73 75 79 81 84 86 91
51 62 68 71 73 76 79 81 84 87 92
52 62 68 71 73 76 79 82 84 87 92
53 62 68 71 74 76 79 82 84 87 94
53 62 69 72 74 76 79 82 84 87 96
𝐾 = 1 + 3.322 log(𝑛)
𝐾 = 1 + 3.322 log(110) = 7.7815
𝑅𝑎𝑛𝑔𝑒 = ℎ𝑖𝑔ℎ𝑒𝑠𝑡 𝑜𝑏𝑠 − 𝑙𝑜𝑤𝑒𝑠𝑡 𝑜𝑏𝑠 = 96 − 50 = 46
𝑅𝑎𝑛𝑔𝑒 46
𝐶′ = = = 5.9115
𝐾 7.7815
Round off C’ to the same number of decimal places as the original dataset, say C, and use C as the
class size. 𝐶 = 6
In this example, we will choose the lowest class limit to be the lowest observation. Therefore, the
lowest class limit is 50.
Relative
Class Class Class Class Relative
Frequency <CFD >CFD
Limit Frequency Boundary Mark Frequency
Percentage
1. Frequency Histogram
It shows the distribution of the observed values. It plots the class boundaries in the x-axis,
and class frequencies on the vertical axis. We represent each class frequency by a vertical bar
whose height is equal to the frequency of the class interval. The area under the histogram is
the total number of observations. If the histogram can be divided vertically into two such
that the half is a mirror image of the other half, then the distribution of the dataset is
symmetric, otherwise, it is skewed.
3. Ogives
It is the plot of the cumulative frequency distribution. The less than ogive is the plot of <CFD
against the upper class boundaries. The greater than ogive is the plot of the >CFD with the
lower class boundaries. Superimposing them, the intersection of the two ogives will give the
median of the data. If the less-than ogive looks like an S-shape, and the greater than ogive
looks like an inverted S-shape, then the distribution of the dataset looks like a symmetric
bell-shape curve.
Example 4.5.2. Histogram and ogives of the Stat 101 grades dataset
The difference between an ordinary histogram and the stem-and leaf display is that we will not lose
any information on the actual data values for we are using the data values themselves to create a
pictorial form of the distribution of the data.
In constructing the SALD, we need to split each observation into two parts and refer to them as stem
and leaf.Example: We have 356 as one of the observations. We can separate the digits this way:
Stem Leaf
3 | 56
1. Choose the common division point of each observations where we will split each data into its
stem and leaf components.
2. In a vertical column, list the smallest stem value up to the largest stem value.
3. Draw a vertical line to the right of the stem value.
4. Record the leaf portion of the first observation in the row corresponding to its stem value.
Do the same for all of the observations
5. Sort the leaves within each stem row from lowest to highest. Maintain spacing in between
the leaves.
6. Indicate the unit of the leaves to allow the recreation of the actual data values from the
display. For example we have the 35 | 6
Unit = 0.1 represents 356 x 0.1 = 35.6
Unit = 1 represents 356 x 1 = 356
Unit = 10 represents 356 x 10 = 3560
Example 4.5.3. Listed are the typing speed (net words per minute) for 20 secretarial
applicants:
68, 52, 65, 58, 46, 72, 75, 35, 61, 55, 91, 63, 84, 69, 66, 47, 55, 45, 32, 71. Create a Stem and
Leaf Display.
In this example, we will choose the common division point to be between the tens and ones
place. Example: 3 | 2
3 25
4 567
5 2558
6 135689
7 125
8 4
9 1
Unit=1
The following data shows the price of 30 different types of notebooks (Elem. Stat pg. 183 #4)
38.12 64.76 34.29 44.37 56.29
63.10 45.45 66.25 58.96 38.95
39.42 55.50 46.35 38.78 66.91
53.50 43.15 67.29 45.26 44.32
34.07 45.05 44.13 67.69 45.65
66.55 62.13 36.87 45.77 43.89
I. Given the frequency distribution of weights in pounds of 50 male college students, fill in the
blanks. (Elementary Statistics page 184 Exercise #6)
1. Construct a frequency distribution table with the following parts: Class limits, Class
frequency, Class boundaries, Class marks, Relative frequency, <CFD, and >CFD. Take note of
the following instructions:
a. Strictly follow Sturges’ guide in making a frequency distribution table. Show all solutions
for the number of classes, the range and the class size.
b. Use the minimum value of your raw data as the lower class limit of the first class interval.
c. Follow the natural rule in rounding off numbers.
d. The relative frequency should be rounded off up to 4 decimal places.
2. Construct a frequency histogram. Interpret.
3. Construct a less-than ogive and a greater-than ogive plotted on the same graph. Interpret.
4. Construct an ordinary stem and leaf display. Interpret.