Documenti di Didattica
Documenti di Professioni
Documenti di Cultura
In this module, you will learn how to construct statistical tables and graphs to present
collected data in a more meaningful and visual manner. Most of these can be done using
Microsoft Excel. However, we focus on the use of the R software in producing these graphs
or charts.
After the sampling and data collection process, what results is data in its raw format, which
is often difficult to understand as is. The next step would now be to summarize and organize
these using textual, tabular or graphical forms in order for the researcher or author to be
able to impart useful information to the readers. In preparing texts, tables or graphs, we
must always be mindful of what information the data are conveying, and what must be
done to include more useful information. Planning how the data will be presented is
essential before appropriately processing raw data.
Data Visualization is a term to describe the use of graphical displays to summarize and
present information about a data set. Data become more comprehensible and more
useful when they are organized and presented using graphs, frequency distribution tables,
charts, diagrams and the like to derive logical solutions and conclusions.
Data obtained from a single variable can be summarized and presented in many ways. A
frequency distribution table, a bar chart and a pie chart can be used to present
qualitative data. Quantitative data, on the other hand, can be summarized using a dot
plot, a stem-and-leaf display, a frequency distribution table, and a histogram. Let us look at
each these methods more closely.
A frequency distribution is a table that shows how often each value (or set of values) of the
variable in question occurs in a data set. It is used to summarize categorical (qualitative) or
numerical (quantitative) data. Simply put, it is a tabular summary of data showing the
number or frequency of observations in each of several non-overlapping categories or
classes.
Property of and for the exclusive use of SLU. Reproduction, storing in a retrieval system, distributing, uploading or posting online, or transmitting in any form or by any
means, electronic, mechanical, photocopying, recording, or otherwise of any part of this document, without the prior written permission of SLU, is strictly prohibited. 53
The relative frequency of a class equals the fraction or proportion of the observations
belonging to a class or category.Thus, the relative frequency can be computed using
A relative frequency distribution gives a tabular summary of data showing the relative
frequency for each class. If the relative frequency multiplied by 100, we get the percent
frequency of a class.A percent frequency distribution summarizes the percent frequency of
the data for each class.
Example 1:
The raw data in the table below shows fifty soft drink purchases. Notice that there is not so
much information that we can get from the data in its current form so it is best to consider
other ways to present the data. Let us construct a frequency distribution table for the
sample.
The frequency distribution table for this data set can be constructed manually or by using
the PivotTable feature of Microsoft Excel. With some editing, the following are the
frequency, relative frequency and percent frequency tables generated:
Property of and for the exclusive use of SLU. Reproduction, storing in a retrieval system, distributing, uploading or posting online, or transmitting in any form or by any
means, electronic, mechanical, photocopying, recording, or otherwise of any part of this document, without the prior written permission of SLU, is strictly prohibited. 54
Soft Drink Type Frequency Soft Drink Type Relative Frequency
Coke Classic 19 Coke Classic 0.38
Diet Coke 8 Diet Coke 0.16
Dr. Pepper 5 Dr. Pepper 0.10
Pepsi 13 Pepsi 0.26
Sprite 5 Sprite 0.10
Total 50 Total 1.00
Table 1. Frequency Distribution Table for Table 2. Relative Frequency Distribution Table
Soft Drink Purchases for Soft Drink Purchases
Using RStudio, on the other hand, the task can be completed by running the following R
code in the Console window. We will use the “purchase.csv” file in our working directory.
R Script
# This is to show how to construct a Frequency Histogram for Qualitative Data
# Get frequencies
data.freq =table(purchase) # table function performs
categorical tabulation of data
with the variable and its
frequency
Property of and for the exclusive use of SLU. Reproduction, storing in a retrieval system, distributing, uploading or posting online, or transmitting in any form or by any
means, electronic, mechanical, photocopying, recording, or otherwise of any part of this document, without the prior written permission of SLU, is strictly prohibited. 55
# To produce the output as a column
freq.dist<-cbind(data.freq) # cbind is used to combine vectors,
matrices, or data frames by columns
# RStudio output
pander(freq.dist)
Frequency
Coke Classic 19
Diet Coke 8
Dr. Pepper 5
Pepsi 13
Sprite 5
The same R code or script can also be written in the Source window or pane if you want to
keep a copy of the scripts you write in RStudio. First, we create a new R script file by
clicking on the File menu, then click on New File and select R Script. The same result can be
obtained by using the hot keys Ctrl+Shift+N.
Write the R code on the Source window. You should be able to have something similar to
Figure 11.
Save the R script file. R script files are named with an .R extension. Click on the save icon on
the Source window and browse to your set working directory. Name the file as purchase.R.
After saving the file, execute the script by highlighting all the lines on the Source window
and then clicking on the „Run‟ icon on the upper right part of the Source window. As an
alternative to the „Run‟ icon, you can press on the Ctrl+Enter keys to run the script. Take
note of this.
Property of and for the exclusive use of SLU. Reproduction, storing in a retrieval system, distributing, uploading or posting online, or transmitting in any form or by any
means, electronic, mechanical, photocopying, recording, or otherwise of any part of this document, without the prior written permission of SLU, is strictly prohibited. 56
Figure 11. R script for the frequency distribution table for the soft drink purchase data.
For the relative frequency table, we can run the following R script.
R Script
# R script for the relative frequency distribution table
# RStudio output
pander(relfreq.dist)
Relative Frequency
Coke Classic 0.38
Diet Coke 0.16
Dr. Pepper 0.1
Pepsi 0.26
Sprite 0.1
Property of and for the exclusive use of SLU. Reproduction, storing in a retrieval system, distributing, uploading or posting online, or transmitting in any form or by any
means, electronic, mechanical, photocopying, recording, or otherwise of any part of this document, without the prior written permission of SLU, is strictly prohibited. 57
Note that since the dataset was already imported in RStudio from the previous R script,
there is no need to import the data again. Also, since the packages were already installed
and loaded from the previous R script, there is no need to repeat these commands.
Example 2:
A survey was taken in Aurora Avenue. In each of 20 homes, people were asked how many
cars were registered to their households. The results were recorded as follows:
1, 2, 1, 0, 3, 4, 0, 1, 1, 1, 2, 2, 3, 2, 3, 2, 1, 4, 0, 0
Table 4 shows the frequency, relative frequency and percent frequency for the data in just
one table. Note that in practice, it is customary to only include one such type of
frequency.
In this example, the frequency table constructed is for ungrouped data, which means that
the individual values do not lose their identity in the table.
Doing this in RStudio, let us consider a different approach by instead constructing a vector
representing the data values. Open a new R script file then enter and run following script.
R Script
# Create a vector for the given data.
cars<-c(1, 2, 1, 0, 3, 4, 0, 1, 1, 1, 2, 2, 3, 2, 3, 2, 1, 4, 0, 0)
# RStudio output
pander(freq.dist)
Property of and for the exclusive use of SLU. Reproduction, storing in a retrieval system, distributing, uploading or posting online, or transmitting in any form or by any
means, electronic, mechanical, photocopying, recording, or otherwise of any part of this document, without the prior written permission of SLU, is strictly prohibited. 58
Frequency Relative Frequency Percent Frequency
0 4 0.2 20
1 6 0.3 30
2 5 0.25 25
3 3 0.15 15
4 2 0.1 10
Example 3:
Consider the following data set on the monthly rent ($) for a sample of 70 one-bedroom
apartments:
425 430 430 435 435 435 435 435 440 440 440 440 440 445 445
445 445 445 450 450 450 450 450 450 450 460 460 460 465 465
465 470 470 472 475 475 475 480 480 480 480 485 490 490 490
500 500 500 500 510 510 515 525 525 525 535 549 550 570 570
575 575 580 590 600 600 600 600 615 615
A frequency table with 8 class intervals for this sample is shown below. In this case, the
values are grouped together in each class, and the individual values are no longer visible.
To create the Grouped Frequency Distribution Table using R, we consider the following R
script and we make use of the rent.csv file in our data repository or working directory.
Property of and for the exclusive use of SLU. Reproduction, storing in a retrieval system, distributing, uploading or posting online, or transmitting in any form or by any
means, electronic, mechanical, photocopying, recording, or otherwise of any part of this document, without the prior written permission of SLU, is strictly prohibited. 59
R Script
# Creating a Grouped Frequency Distribution Table
# Load the readr and pander package in RStudio
library(readr)
library(pander)
# RStudio output
pander(freq.dist)
Frequency
[425,450) 18
[450,475) 16
[475,500) 11
[500,525) 7
[525,550) 5
[550,575) 3
[575,600) 4
[600,625) 6
In the output, a bracket on the left endpoint means that the value is included in the class
interval, while a parenthesis in the right endpoint means the value is not included in the
interval. For example, [525, 550) indicates the class interval 525-549.
Property of and for the exclusive use of SLU. Reproduction, storing in a retrieval system, distributing, uploading or posting online, or transmitting in any form or by any
means, electronic, mechanical, photocopying, recording, or otherwise of any part of this document, without the prior written permission of SLU, is strictly prohibited. 60
BAR GRAPH
A bar graph is a chart used to display qualitative data summarized in a frequency, relative
frequency, or percent frequency distribution.
For a vertical bar chart, the horizontal (x) axis represents the categories; the vertical (y) axis
represents a value (frequency, relative frequency, or percent frequency) for those
categories. In the graph below, the values are frequencies.
The figure below shows the bar chart of the data on softdrink purchases of Example 1.
R Script
To construct the bar chart using RStudio, we use the ggplot function. Using the
“purchase.csv” data, open a new R script file, enter and run the following script:
Property of and for the exclusive use of SLU. Reproduction, storing in a retrieval system, distributing, uploading or posting online, or transmitting in any form or by any
means, electronic, mechanical, photocopying, recording, or otherwise of any part of this document, without the prior written permission of SLU, is strictly prohibited. 61
# To order the bars by decreasing frequency
bar2 <- ggplot(mutate(purchase, Purchase =fct_infreq(Purchase)))
+geom_bar(aes(x = Purchase))
bar2
Just a note, you may not assign the bar graphs into the objects bar1 and bar2. Removing
these assignments in the script would generate the bar charts right away. Also, the bars will
be shown in the plots window of RStudio where you have the options to “Save as Image”,
“Save as PDF”, or “Copy to Clipboard” once you click of the “Export” icon on the Plots
window.
PIE CHART
A pie chart (also called a pie graph or circle graph) provides another graphical device for
presenting relative frequency and percent frequency distributions for qualitative data. The
Property of and for the exclusive use of SLU. Reproduction, storing in a retrieval system, distributing, uploading or posting online, or transmitting in any form or by any
means, electronic, mechanical, photocopying, recording, or otherwise of any part of this document, without the prior written permission of SLU, is strictly prohibited. 62
numerical values shown for each sector can be frequencies, relative frequencies, or
percent frequencies, which subdivides the circles into sectors.
A pie chart makes use of sectors (slices) in a circle. The angle of a sector is proportional to
the frequency of each of the categories of the variable that defines the data. The formula
to determine the angle of a sector in a circle graph is:
𝑓𝑟𝑒𝑞𝑢𝑒𝑛𝑐𝑦 𝑜𝑓 𝑐𝑎𝑡𝑒𝑔𝑜𝑟𝑦
𝐴𝑛𝑔𝑙𝑒 𝑜𝑓 𝑠𝑒𝑐𝑡𝑜𝑟 = × 360𝑜
𝑡𝑜𝑡𝑎𝑙 𝑓𝑟𝑒𝑞𝑢𝑒𝑛𝑐𝑦
The figure below shows the pie chart of the data on softdrink purchases of Example 1
generated using Microsoft Excel.
R Script
Suppose we start with the raw data, the following is the script in creating a simple pie chart
in RStudio. We use the “purchase.csv” file for the same example.
Property of and for the exclusive use of SLU. Reproduction, storing in a retrieval system, distributing, uploading or posting online, or transmitting in any form or by any
means, electronic, mechanical, photocopying, recording, or otherwise of any part of this document, without the prior written permission of SLU, is strictly prohibited. 63
# Label the categories
lbls<-c("Coke Classic", "Diet Coke", "Dr. Pepper", "Pepsi", "Sprite")
DOT PLOT
A dot plot is a graphical display of data using dots. It is similar to a bar graph because the
height of each “bar” of dots is equal to the number of items in a particular category. To
draw a dot plot, count the number of data points falling in each category and draw a
stack of dots that number high for each category. A dot plot can be used as a graphical
display of the frequency of qualitative and quantitative (ungrouped) data.
The figure that follows shows the dot plot for the data of Example 2 on the number of cars
registered to each household:
1, 2, 1, 0, 3, 4, 0, 1, 1, 1, 2, 2, 3, 2, 3, 2, 1, 4, 0, 0
Property of and for the exclusive use of SLU. Reproduction, storing in a retrieval system, distributing, uploading or posting online, or transmitting in any form or by any
means, electronic, mechanical, photocopying, recording, or otherwise of any part of this document, without the prior written permission of SLU, is strictly prohibited. 64
R Script
Here we present two ways by which a dot plot is constructed. First is by importing a .csv
data file from MS Excel, which is very useful especially if we have a large data set, and the
other way is by constructing the data vector in the RStudio environment. This is applicable if
we would be dealing with a small set of data. The following are the scripts. For the first
method, we use the “cars.csv” data from our directory.
# Create the vector given the data (for a small data set)
cars <-c(1, 2, 1, 0, 3, 4, 0, 1, 1, 1, 2, 2, 3, 2, 3, 2, 1, 4, 0, 0)
ID <-1:20# Generates a sequence of integers from 1 to 20.
data<-data.frame(ID, cars)
str(data)
'data.frame': 20 obs. of 2 variables:# The data frame with 2 variables
$ ID :int 1 2 3 4 5 6 7 8 9 10 ...
$ cars: num 1 2 1 0 3 4 0 1 1 1 ...
ggplot(data, aes(cars)) +geom_dotplot(binwidth=0.3)
Property of and for the exclusive use of SLU. Reproduction, storing in a retrieval system, distributing, uploading or posting online, or transmitting in any form or by any
means, electronic, mechanical, photocopying, recording, or otherwise of any part of this document, without the prior written permission of SLU, is strictly prohibited. 65
Notice the difference in dot sizes with different binwidths. You can further explore RStudio
functionality by varying the values of “arguments” in the syntax.
STEM-AND-LEAF PLOT
A stem-and-leaf plot is a graphical display for quantitative data that shows both the rank
order and shape of a data set. It is particularly useful when data are not too numerous.
Stem-and-leaf plots are a method for showing the frequency with which certain classes of
values occur.
Example 1:
The following illustration and steps are taken from the website:
https://study.com/academy/lesson/how-to-make-a-stem-and-leaf-plot.html
The process will be easiest to follow with sample data, so let's pretend that a sports
statistician wants to make a stem-and-leaf plot for a recent game played by the Blues
basketball team. The total minutes played by each team member has been recorded and
shown below:
Blues Member Name Minutes Played
Gifford 22
Slavky 29
Harrison 22
Samon 31
Mantry 20
Lewing 12
Wilson 14
Larriby 24
Paston 13
Lebling 4
Property of and for the exclusive use of SLU. Reproduction, storing in a retrieval system, distributing, uploading or posting online, or transmitting in any form or by any
means, electronic, mechanical, photocopying, recording, or otherwise of any part of this document, without the prior written permission of SLU, is strictly prohibited. 66
Waster 2
Canno 1
Step 1: Determine the smallest and largest number in the data.
Looking at the stats, we see the number of minutes played ranges from a low of 1
minute to a high of 31 minutes.
Step 3: Draw a vertical line and list the stem numbers to the left of the line.
0|
1|
2|
3|
Property of and for the exclusive use of SLU. Reproduction, storing in a retrieval system, distributing, uploading or posting online, or transmitting in any form or by any
means, electronic, mechanical, photocopying, recording, or otherwise of any part of this document, without the prior written permission of SLU, is strictly prohibited. 67
2|0 2 2 4 9
3|1
And that's the stem-and-leaf plot for minutes played.
The place value of the leaf is called the leaf unit. In the example above, the leaf unit is 1.
Other leaf units may be 100, 10, 0.1, and so on. If the leaf unit is not 1, it should be displayed
in the stem-and-leaf plot.
R Script
For the same example, the stem and leaf plot can be generated in RStudio by using the
stem() function. The script is very short. Try this out in RStudio.
0 | 124
1 | 234
2 | 02249
3 | 1
Example 2:
The stem-and-leaf plot for the data set
8.6 11.7 9.4 9.1 10.2 11.0 8.8
with leaf unit 0.1 is given by
This means that in reading the data from the stem-and-leaf plot, the stems are digits in the
units place while the leaves are the digits in tenths place (first decimal place).
R Script
Property of and for the exclusive use of SLU. Reproduction, storing in a retrieval system, distributing, uploading or posting online, or transmitting in any form or by any
means, electronic, mechanical, photocopying, recording, or otherwise of any part of this document, without the prior written permission of SLU, is strictly prohibited. 68
8 | 68
9 | 14
10 | 2
11 | 07
Example 3:
Let us now consider a data frame for this example. In MS Excel, open the data file
“inflation.csv”. The data shows the Inflation rate (in %) of countries in Asia and the Pacific.
Upon inspection of the variables, you would notice that there is only one quantitative
variable which is the inflation rate, labeled “Inflation”. We now create a stem-and-leaf
display for this variable.
R Script
-0 | 552
0 | 4588349
2 | 0112444668990778
4 | 1239047889
6 | 446688934
8 | 894499
10 | 7
Property of and for the exclusive use of SLU. Reproduction, storing in a retrieval system, distributing, uploading or posting online, or transmitting in any form or by any
means, electronic, mechanical, photocopying, recording, or otherwise of any part of this document, without the prior written permission of SLU, is strictly prohibited. 69
HISTOGRAM
The histogram corresponding to the frequency distribution table for the data on monthly
rent ($) for a sample of 70 one-bedroom apartments in Example 3 is shown below:
R Script
To plot the histogram for the same example, again we use the “rent.csv” file.
Property of and for the exclusive use of SLU. Reproduction, storing in a retrieval system, distributing, uploading or posting online, or transmitting in any form or by any
means, electronic, mechanical, photocopying, recording, or otherwise of any part of this document, without the prior written permission of SLU, is strictly prohibited. 70
Summarizing Qualitative and Quantitative Data for Two Variables
Tabular and graphical displays for data obtained from two variables are helpful in
understanding the relationship between them, if any. In this section we will discuss
thecrosstabulation or contingency table and the scatter diagram.
CROSSTABULATION
A crosstabulation or contingency table is a tabular summary of data for two variables. The
variables can both be qualitative or both quantitative, or can be a combination of one
qualitative and one quantitative variable. If either variable is quantitative, classes must be
created for the values of the quantitative variable. The labels shown in the margins of the
table define the categories (classes) for the two variables.
Example:
For an example, we consider the “salaries.csv” file which contains data on professors of a
university, including rank, discipline being taught, years since PhD was obtained, years of
service in the university, sex, and annual salary ($). We construct a crosstabulation of the
rank and sex of the teachers. Using RStudio, we can generate the crosstabulation shown in
Table 6.
Property of and for the exclusive use of SLU. Reproduction, storing in a retrieval system, distributing, uploading or posting online, or transmitting in any form or by any
means, electronic, mechanical, photocopying, recording, or otherwise of any part of this document, without the prior written permission of SLU, is strictly prohibited. 71
R Script
• cross_table:
Female Male Total
AssocProf 10 54 64
AsstProf 11 56 67
Prof 18 248 266
Total 39 358 397
Table 6. Crosstabulation of rank and sex of teachers.
• proportions:
Female Male Total
AssocProf 0.1562 0.8438 1
AsstProf 0.1642 0.8358 1
Prof 0.06767 0.9323 1
Total 0.09824 0.9018 1
Table 7. Proportions of crosstabulation of rank and sex of teachers
From the crosstabulation, we can see that majority of the teachers have a rank of
„Professor‟. There are relatively more males than females among all the ranks and teachers
who are male professors make up the largest group. This could not have been easily
observed by just looking at the raw data.
Property of and for the exclusive use of SLU. Reproduction, storing in a retrieval system, distributing, uploading or posting online, or transmitting in any form or by any
means, electronic, mechanical, photocopying, recording, or otherwise of any part of this document, without the prior written permission of SLU, is strictly prohibited. 72
SCATTER DIAGRAM/PLOT
A scatter diagram or scatter plot is a graphical display of the relationship between two
quantitative variables. One variable (independent variable) is shown on the horizontal axis
and the other variable (dependent variable) is shown on the vertical axis. The general
pattern of the plotted points suggests the overall relationship between the variables. This
relationship will be discussed more in Modules 11 (Correlation and Regression).
Example:
Consider the advertising/sales relationship for a stereo and sound equipment store. On 10
occasions during the past three months, the store used weekend television commercials to
promote sales at its stores. The managers want to investigate whether a relationship exists
between the number of commercials shown and the sales at the store during the following
week. Sample data for the 10 weeks with sales in hundreds of dollars are shown in the
table. The figure that follows is a scatter diagram for the data.
Property of and for the exclusive use of SLU. Reproduction, storing in a retrieval system, distributing, uploading or posting online, or transmitting in any form or by any
means, electronic, mechanical, photocopying, recording, or otherwise of any part of this document, without the prior written permission of SLU, is strictly prohibited. 73
R Script
Here we present two scripts in generating the scatter plot for the same problem. The
example data is contained in the “advertising.csv” data file.
Property of and for the exclusive use of SLU. Reproduction, storing in a retrieval system, distributing, uploading or posting online, or transmitting in any form or by any
means, electronic, mechanical, photocopying, recording, or otherwise of any part of this document, without the prior written permission of SLU, is strictly prohibited. 74
# Plot the chart
plot(x=comm, y=sales,
xlab="Number of Commercials", ylab="Sales in Hundred Dollars",
main="Number of Commercials vs Sales",
xlim=c(0, 6), ylim=c(0,70))
Use RStudio to construct the tabular and graphical displays required for each problem.
Submit a single .docx file containing the output of R for each problem and submit also the
saved RStudio script.
Please use the following convention for the filename: LRA5-1<LASTNAME>.docx [Example:
LRA5-1MIRANDA.docx] and for the R script, LRA5-1<LASTNAME>.R.
1. According to Kantar Media (March 13, 2020), the top four primetime television
shows in the Philippines were Ang Probinsyano (Prob), Make It With You (MIWY),
Prima Donnas (PD), and Descendants of the Sun Philippine Adaptation (DS). Data
indicating the preferred shows for a sample of 50 viewers follow. (15 points)
2. The data below shows the time in days required to complete year-end audits for a
sample of 20 clients of Sanderson and Clifford, a small public accounting firm.
Construct a dot plot for the sample. (5 points)
Property of and for the exclusive use of SLU. Reproduction, storing in a retrieval system, distributing, uploading or posting online, or transmitting in any form or by any
means, electronic, mechanical, photocopying, recording, or otherwise of any part of this document, without the prior written permission of SLU, is strictly prohibited. 75
3. Use the file salaries.csv to construct a crosstabulation of the following pairs of
variables: (15 points)
a. Rank (row variable) vs. Discipline (column variable)
b. Rank (row variable) vs. Years of Service (column variable, grouped by 10s)
c. Rank (row variable) vs. Salary (column variable, grouped by $25000s)
Property of and for the exclusive use of SLU. Reproduction, storing in a retrieval system, distributing, uploading or posting online, or transmitting in any form or by any
means, electronic, mechanical, photocopying, recording, or otherwise of any part of this document, without the prior written permission of SLU, is strictly prohibited. 76