Sei sulla pagina 1di 24

MODULE 5: DATA PRESENTATION

(For SEPTEMBER 18-21)

Learning Outcome: Create appropriate tabular and graphical displays using R


to present and summarize data in a meaningful manner.

In this module, you will learn how to construct statistical tables and graphs to present
collected data in a more meaningful and visual manner. Most of these can be done using
Microsoft Excel. However, we focus on the use of the R software in producing these graphs
or charts.

Data Presentation and Visualization

After the sampling and data collection process, what results is data in its raw format, which
is often difficult to understand as is. The next step would now be to summarize and organize
these using textual, tabular or graphical forms in order for the researcher or author to be
able to impart useful information to the readers. In preparing texts, tables or graphs, we
must always be mindful of what information the data are conveying, and what must be
done to include more useful information. Planning how the data will be presented is
essential before appropriately processing raw data.

Data Visualization is a term to describe the use of graphical displays to summarize and
present information about a data set. Data become more comprehensible and more
useful when they are organized and presented using graphs, frequency distribution tables,
charts, diagrams and the like to derive logical solutions and conclusions.

Summarizing Qualitative and Quantitative Data for a Single Variable

Data obtained from a single variable can be summarized and presented in many ways. A
frequency distribution table, a bar chart and a pie chart can be used to present
qualitative data. Quantitative data, on the other hand, can be summarized using a dot
plot, a stem-and-leaf display, a frequency distribution table, and a histogram. Let us look at
each these methods more closely.

FREQUENCY DISTRIBUTION TABLE

A frequency distribution is a table that shows how often each value (or set of values) of the
variable in question occurs in a data set. It is used to summarize categorical (qualitative) or
numerical (quantitative) data. Simply put, it is a tabular summary of data showing the
number or frequency of observations in each of several non-overlapping categories or
classes.

Property of and for the exclusive use of SLU. Reproduction, storing in a retrieval system, distributing, uploading or posting online, or transmitting in any form or by any
means, electronic, mechanical, photocopying, recording, or otherwise of any part of this document, without the prior written permission of SLU, is strictly prohibited. 53
The relative frequency of a class equals the fraction or proportion of the observations
belonging to a class or category.Thus, the relative frequency can be computed using

𝑓𝑟𝑒𝑞𝑢𝑒𝑛𝑐𝑦 𝑜𝑓 𝑡𝑕𝑒 𝑐𝑙𝑎𝑠𝑠


𝑟𝑒𝑙𝑎𝑡𝑖𝑣𝑒 𝑓𝑟𝑒𝑞𝑢𝑒𝑛𝑐𝑦 𝑜𝑓 𝑎 𝑐𝑙𝑎𝑠𝑠 =
𝑡𝑜𝑡𝑎𝑙 𝑛𝑢𝑚𝑏𝑒𝑟 𝑜𝑓 𝑜𝑏𝑠𝑒𝑟𝑣𝑎𝑡𝑖𝑜𝑛𝑠

A relative frequency distribution gives a tabular summary of data showing the relative
frequency for each class. If the relative frequency multiplied by 100, we get the percent
frequency of a class.A percent frequency distribution summarizes the percent frequency of
the data for each class.

Example 1:
The raw data in the table below shows fifty soft drink purchases. Notice that there is not so
much information that we can get from the data in its current form so it is best to consider
other ways to present the data. Let us construct a frequency distribution table for the
sample.

The frequency distribution table for this data set can be constructed manually or by using
the PivotTable feature of Microsoft Excel. With some editing, the following are the
frequency, relative frequency and percent frequency tables generated:

Property of and for the exclusive use of SLU. Reproduction, storing in a retrieval system, distributing, uploading or posting online, or transmitting in any form or by any
means, electronic, mechanical, photocopying, recording, or otherwise of any part of this document, without the prior written permission of SLU, is strictly prohibited. 54
Soft Drink Type Frequency Soft Drink Type Relative Frequency
Coke Classic 19 Coke Classic 0.38
Diet Coke 8 Diet Coke 0.16
Dr. Pepper 5 Dr. Pepper 0.10
Pepsi 13 Pepsi 0.26
Sprite 5 Sprite 0.10
Total 50 Total 1.00
Table 1. Frequency Distribution Table for Table 2. Relative Frequency Distribution Table
Soft Drink Purchases for Soft Drink Purchases

Soft Drink Type Percent Frequency


Coke Classic 38%
Diet Coke 16%
Dr. Pepper 10%
Pepsi 26%
Sprite 10%
Total 100%
Table 3. Percent Frequency Distribution Table for Soft Drink Purchases

Using RStudio, on the other hand, the task can be completed by running the following R
code in the Console window. We will use the “purchase.csv” file in our working directory.

R Script
# This is to show how to construct a Frequency Histogram for Qualitative Data

# Install necessary packages


install.packages(“readr”) # readr is a package used to read
rectangular data like 'csv' or 'tsv'
install.packages(“pander”) # pander provides a minimal and easy tool
for rendering R objects

# Load the installed packages in RStudio


library(readr)
library(pander)

# Import the file to RStudio


purchase <-read.csv("purchase.csv") # the csv data is assigned to the
object 'purchase'

# Get frequencies
data.freq =table(purchase) # table function performs
categorical tabulation of data
with the variable and its
frequency

Property of and for the exclusive use of SLU. Reproduction, storing in a retrieval system, distributing, uploading or posting online, or transmitting in any form or by any
means, electronic, mechanical, photocopying, recording, or otherwise of any part of this document, without the prior written permission of SLU, is strictly prohibited. 55
# To produce the output as a column
freq.dist<-cbind(data.freq) # cbind is used to combine vectors,
matrices, or data frames by columns

# To name the table column


colnames(freq.dist)<-c("Frequency") # colnames sets the column names or
labels.

# RStudio output
pander(freq.dist)

Frequency
Coke Classic 19
Diet Coke 8
Dr. Pepper 5
Pepsi 13
Sprite 5

The same R code or script can also be written in the Source window or pane if you want to
keep a copy of the scripts you write in RStudio. First, we create a new R script file by
clicking on the File menu, then click on New File and select R Script. The same result can be
obtained by using the hot keys Ctrl+Shift+N.

Write the R code on the Source window. You should be able to have something similar to
Figure 11.

Save the R script file. R script files are named with an .R extension. Click on the save icon on
the Source window and browse to your set working directory. Name the file as purchase.R.

After saving the file, execute the script by highlighting all the lines on the Source window
and then clicking on the „Run‟ icon on the upper right part of the Source window. As an
alternative to the „Run‟ icon, you can press on the Ctrl+Enter keys to run the script. Take
note of this.

Property of and for the exclusive use of SLU. Reproduction, storing in a retrieval system, distributing, uploading or posting online, or transmitting in any form or by any
means, electronic, mechanical, photocopying, recording, or otherwise of any part of this document, without the prior written permission of SLU, is strictly prohibited. 56
Figure 11. R script for the frequency distribution table for the soft drink purchase data.

For the relative frequency table, we can run the following R script.

R Script
# R script for the relative frequency distribution table

data.relfreq<-data.freq/nrow(purchase) # nrow counts the total number of


rows of the purchase data.
relfreq.dist<-cbind(data.relfreq)
colnames(relfreq.dist) <-c("Relative Frequency")

# RStudio output
pander(relfreq.dist)

Relative Frequency
Coke Classic 0.38
Diet Coke 0.16
Dr. Pepper 0.1
Pepsi 0.26
Sprite 0.1

Property of and for the exclusive use of SLU. Reproduction, storing in a retrieval system, distributing, uploading or posting online, or transmitting in any form or by any
means, electronic, mechanical, photocopying, recording, or otherwise of any part of this document, without the prior written permission of SLU, is strictly prohibited. 57
Note that since the dataset was already imported in RStudio from the previous R script,
there is no need to import the data again. Also, since the packages were already installed
and loaded from the previous R script, there is no need to repeat these commands.

Example 2:
A survey was taken in Aurora Avenue. In each of 20 homes, people were asked how many
cars were registered to their households. The results were recorded as follows:

1, 2, 1, 0, 3, 4, 0, 1, 1, 1, 2, 2, 3, 2, 3, 2, 1, 4, 0, 0

Table 4 shows the frequency, relative frequency and percent frequency for the data in just
one table. Note that in practice, it is customary to only include one such type of
frequency.

Number of cars Frequency Relative Frequency Percent Frequency


0 4 0.20 20 %
1 6 0.30 30 %
2 5 0.25 25 %
3 3 0.15 15 %
4 2 0.10 10 %
Table 4. Frequency distribution table for the number of cars registered in each household

In this example, the frequency table constructed is for ungrouped data, which means that
the individual values do not lose their identity in the table.

Doing this in RStudio, let us consider a different approach by instead constructing a vector
representing the data values. Open a new R script file then enter and run following script.

R Script
# Create a vector for the given data.
cars<-c(1, 2, 1, 0, 3, 4, 0, 1, 1, 1, 2, 2, 3, 2, 3, 2, 1, 4, 0, 0)

# Get the frequencies


data.freq<-table(cars)
data.relfreq<-data.freq/sum(data.freq)
data.pctfreq<-data.relfreq*100

# To combine necessary columns


freq.dist<-cbind(data.freq, data.relfreq, data.pctfreq)

# Naming the table columns


colnames(freq.dist) <-c("Frequency", "Relative Frequency", "Percent Frequency")

# RStudio output
pander(freq.dist)

Property of and for the exclusive use of SLU. Reproduction, storing in a retrieval system, distributing, uploading or posting online, or transmitting in any form or by any
means, electronic, mechanical, photocopying, recording, or otherwise of any part of this document, without the prior written permission of SLU, is strictly prohibited. 58
Frequency Relative Frequency Percent Frequency
0 4 0.2 20
1 6 0.3 30
2 5 0.25 25
3 3 0.15 15
4 2 0.1 10

Example 3:
Consider the following data set on the monthly rent ($) for a sample of 70 one-bedroom
apartments:
425 430 430 435 435 435 435 435 440 440 440 440 440 445 445
445 445 445 450 450 450 450 450 450 450 460 460 460 465 465
465 470 470 472 475 475 475 480 480 480 480 485 490 490 490
500 500 500 500 510 510 515 525 525 525 535 549 550 570 570
575 575 580 590 600 600 600 600 615 615

A frequency table with 8 class intervals for this sample is shown below. In this case, the
values are grouped together in each class, and the individual values are no longer visible.

Rent (in $) Frequency


425-449 18
450-474 16
475-499 11
500-524 7
525-549 5
550-574 3
575-599 4
600-624 6
Total 70
Table 5. Frequency table for monthly rents of 70 one-bedroom apartments

To create the Grouped Frequency Distribution Table using R, we consider the following R
script and we make use of the rent.csv file in our data repository or working directory.

Property of and for the exclusive use of SLU. Reproduction, storing in a retrieval system, distributing, uploading or posting online, or transmitting in any form or by any
means, electronic, mechanical, photocopying, recording, or otherwise of any part of this document, without the prior written permission of SLU, is strictly prohibited. 59
R Script
# Creating a Grouped Frequency Distribution Table
# Load the readr and pander package in RStudio
library(readr)
library(pander)

# Import the data into RStudio and assign it to "data"


data <-read.csv("rent.csv")

# To view the data in RStudio


View(data)

# To Define the Class Intervals


breaks <-seq(425, 625, by =25) # Creates class intervals each with
class width equal to 25.

# To Assign each observation to its class interval


classint<-cut(data$Rent, breaks, right =FALSE) # data$Rent calls the
variable "Rent" of the
data frame into the script

# To obtain the frequency of data in each class interval


freq<-table(classint)

# To combine necessary columns


freq.dist<-cbind(freq)

# To name or label the columns


colnames(freq.dist) <-c("Frequency")

# RStudio output
pander(freq.dist)

Frequency
[425,450) 18
[450,475) 16
[475,500) 11
[500,525) 7
[525,550) 5
[550,575) 3
[575,600) 4
[600,625) 6

In the output, a bracket on the left endpoint means that the value is included in the class
interval, while a parenthesis in the right endpoint means the value is not included in the
interval. For example, [525, 550) indicates the class interval 525-549.

Property of and for the exclusive use of SLU. Reproduction, storing in a retrieval system, distributing, uploading or posting online, or transmitting in any form or by any
means, electronic, mechanical, photocopying, recording, or otherwise of any part of this document, without the prior written permission of SLU, is strictly prohibited. 60
BAR GRAPH

A bar graph is a chart used to display qualitative data summarized in a frequency, relative
frequency, or percent frequency distribution.

For a vertical bar chart, the horizontal (x) axis represents the categories; the vertical (y) axis
represents a value (frequency, relative frequency, or percent frequency) for those
categories. In the graph below, the values are frequencies.

The figure below shows the bar chart of the data on softdrink purchases of Example 1.

R Script
To construct the bar chart using RStudio, we use the ggplot function. Using the
“purchase.csv” data, open a new R script file, enter and run the following script:

# Install the tidyverse and forcats Packages in RStudio


install.packages(“tidyverse”)
install.packages(“forcats”)

# Load the Packages into RStudio


library(readr)
library(tidyverse)
library(forcats)

# Import the “purchase.csv” file and assign it to “purchase”


purchase <-read.csv("purchase.csv")
View(purchase) # Presents the data on a different tab

# To generate the bar chart


bar1<-ggplot(purchase, aes(x=Purchase))+geom_bar(width=.5)+ggtitle("Soft Drink
Purchases")
bar1

Property of and for the exclusive use of SLU. Reproduction, storing in a retrieval system, distributing, uploading or posting online, or transmitting in any form or by any
means, electronic, mechanical, photocopying, recording, or otherwise of any part of this document, without the prior written permission of SLU, is strictly prohibited. 61
# To order the bars by decreasing frequency
bar2 <- ggplot(mutate(purchase, Purchase =fct_infreq(Purchase)))
+geom_bar(aes(x = Purchase))
bar2

Just a note, you may not assign the bar graphs into the objects bar1 and bar2. Removing
these assignments in the script would generate the bar charts right away. Also, the bars will
be shown in the plots window of RStudio where you have the options to “Save as Image”,
“Save as PDF”, or “Copy to Clipboard” once you click of the “Export” icon on the Plots
window.

PIE CHART

A pie chart (also called a pie graph or circle graph) provides another graphical device for
presenting relative frequency and percent frequency distributions for qualitative data. The

Property of and for the exclusive use of SLU. Reproduction, storing in a retrieval system, distributing, uploading or posting online, or transmitting in any form or by any
means, electronic, mechanical, photocopying, recording, or otherwise of any part of this document, without the prior written permission of SLU, is strictly prohibited. 62
numerical values shown for each sector can be frequencies, relative frequencies, or
percent frequencies, which subdivides the circles into sectors.

A pie chart makes use of sectors (slices) in a circle. The angle of a sector is proportional to
the frequency of each of the categories of the variable that defines the data. The formula
to determine the angle of a sector in a circle graph is:

𝑓𝑟𝑒𝑞𝑢𝑒𝑛𝑐𝑦 𝑜𝑓 𝑐𝑎𝑡𝑒𝑔𝑜𝑟𝑦
𝐴𝑛𝑔𝑙𝑒 𝑜𝑓 𝑠𝑒𝑐𝑡𝑜𝑟 = × 360𝑜
𝑡𝑜𝑡𝑎𝑙 𝑓𝑟𝑒𝑞𝑢𝑒𝑛𝑐𝑦

The figure below shows the pie chart of the data on softdrink purchases of Example 1
generated using Microsoft Excel.

R Script

Suppose we start with the raw data, the following is the script in creating a simple pie chart
in RStudio. We use the “purchase.csv” file for the same example.

# Load Packages into RStudio


library(readr)

# Import “purchase.csv” file into RStudio and assign it to “purchase”


purchase <-read.csv("purchase.csv")

# Determine the Frequencies


data.freq<-table(purchase) # Determines the Frequencies
data.freq # Presents the Frequencies
purchase
Coke Classic Diet Coke Dr. Pepper Pepsi Sprite
19 8 5 13 5
# Construct the vector of Frequencies
freq<-c(19, 8, 5, 13, 5)

Property of and for the exclusive use of SLU. Reproduction, storing in a retrieval system, distributing, uploading or posting online, or transmitting in any form or by any
means, electronic, mechanical, photocopying, recording, or otherwise of any part of this document, without the prior written permission of SLU, is strictly prohibited. 63
# Label the categories
lbls<-c("Coke Classic", "Diet Coke", "Dr. Pepper", "Pepsi", "Sprite")

# Calculate the Percentages


percents<-round(freq/sum(freq)*100, 1) # round-off values to 1 decimal place.
lbls<-paste(lbls, percents) # add percents to labels
lbls<-paste(lbls, "%", sep =" ") # adds % sign to labels

# Construct the pie chart with percentages


piechart<-pie(freq, labels =lbls, col=rainbow(length(lbls)), main="Pie Chart
of Soft Drink Purchases" )
piechart

DOT PLOT

A dot plot is a graphical display of data using dots. It is similar to a bar graph because the
height of each “bar” of dots is equal to the number of items in a particular category. To
draw a dot plot, count the number of data points falling in each category and draw a
stack of dots that number high for each category. A dot plot can be used as a graphical
display of the frequency of qualitative and quantitative (ungrouped) data.

The figure that follows shows the dot plot for the data of Example 2 on the number of cars
registered to each household:

1, 2, 1, 0, 3, 4, 0, 1, 1, 1, 2, 2, 3, 2, 3, 2, 1, 4, 0, 0

Number of cars Frequency


0 4
1 6
2 5
3 3
4 2
Table 4. Frequency distribution table for the
number of cars registered in each household

Property of and for the exclusive use of SLU. Reproduction, storing in a retrieval system, distributing, uploading or posting online, or transmitting in any form or by any
means, electronic, mechanical, photocopying, recording, or otherwise of any part of this document, without the prior written permission of SLU, is strictly prohibited. 64
R Script

Here we present two ways by which a dot plot is constructed. First is by importing a .csv
data file from MS Excel, which is very useful especially if we have a large data set, and the
other way is by constructing the data vector in the RStudio environment. This is applicable if
we would be dealing with a small set of data. The following are the scripts. For the first
method, we use the “cars.csv” data from our directory.

# Install the ggplot package in RStudio


install.packages(“ggplot2”)

# Load necessary packages in RStudio


library(readr)
library(ggplot2)

# Import the “cars.csv” data and assign it to “data”


data <-read.csv("cars.csv")

# Generate the dotplot


ggplot(data, aes(cars))+geom_dotplot(binwidth=0.5)

# Create the vector given the data (for a small data set)
cars <-c(1, 2, 1, 0, 3, 4, 0, 1, 1, 1, 2, 2, 3, 2, 3, 2, 1, 4, 0, 0)
ID <-1:20# Generates a sequence of integers from 1 to 20.
data<-data.frame(ID, cars)
str(data)
'data.frame': 20 obs. of 2 variables:# The data frame with 2 variables
$ ID :int 1 2 3 4 5 6 7 8 9 10 ...
$ cars: num 1 2 1 0 3 4 0 1 1 1 ...
ggplot(data, aes(cars)) +geom_dotplot(binwidth=0.3)

Property of and for the exclusive use of SLU. Reproduction, storing in a retrieval system, distributing, uploading or posting online, or transmitting in any form or by any
means, electronic, mechanical, photocopying, recording, or otherwise of any part of this document, without the prior written permission of SLU, is strictly prohibited. 65
Notice the difference in dot sizes with different binwidths. You can further explore RStudio
functionality by varying the values of “arguments” in the syntax.

STEM-AND-LEAF PLOT

A stem-and-leaf plot is a graphical display for quantitative data that shows both the rank
order and shape of a data set. It is particularly useful when data are not too numerous.
Stem-and-leaf plots are a method for showing the frequency with which certain classes of
values occur.

Example 1:
The following illustration and steps are taken from the website:
https://study.com/academy/lesson/how-to-make-a-stem-and-leaf-plot.html
The process will be easiest to follow with sample data, so let's pretend that a sports
statistician wants to make a stem-and-leaf plot for a recent game played by the Blues
basketball team. The total minutes played by each team member has been recorded and
shown below:
Blues Member Name Minutes Played
Gifford 22
Slavky 29
Harrison 22
Samon 31
Mantry 20
Lewing 12
Wilson 14
Larriby 24
Paston 13
Lebling 4

Property of and for the exclusive use of SLU. Reproduction, storing in a retrieval system, distributing, uploading or posting online, or transmitting in any form or by any
means, electronic, mechanical, photocopying, recording, or otherwise of any part of this document, without the prior written permission of SLU, is strictly prohibited. 66
Waster 2
Canno 1
Step 1: Determine the smallest and largest number in the data.
Looking at the stats, we see the number of minutes played ranges from a low of 1
minute to a high of 31 minutes.

Step 2: Identify the stems.


For any number, the digit/s to the left of the right-most digit is a stem. For example,
the number 31 has a stem of 3, while the number 29 has a stem of 2. A one-digit
number like 4 has a stem of 0. Think ''04'' for 4.Based on the range of 1 to 31, we
need stems of 0, 1, 2 and 3.

Step 3: Draw a vertical line and list the stem numbers to the left of the line.
0|
1|
2|
3|

Step 4: Fill in the leaves.


The first data value is for Gifford who played 22 minutes. The stem is on the left. The
leaf is on the right.
0|
1|
2|2
3|
Let's enter Lebling's 4 minutes. The stem is 0 and the leaf is 4.
0|4
1|
2|2
3|
Entering the rest of the data:
0|4 2 1
1|2 4 3
2|2 9 2 0 4
3|1

Step 5: Sort the leaf data.


The stem-and-leaf plot is easier to interpret when each row's leaves are sorted from
low to high.
0|1 2 4
1|2 3 4

Property of and for the exclusive use of SLU. Reproduction, storing in a retrieval system, distributing, uploading or posting online, or transmitting in any form or by any
means, electronic, mechanical, photocopying, recording, or otherwise of any part of this document, without the prior written permission of SLU, is strictly prohibited. 67
2|0 2 2 4 9
3|1
And that's the stem-and-leaf plot for minutes played.

The place value of the leaf is called the leaf unit. In the example above, the leaf unit is 1.
Other leaf units may be 100, 10, 0.1, and so on. If the leaf unit is not 1, it should be displayed
in the stem-and-leaf plot.

R Script

For the same example, the stem and leaf plot can be generated in RStudio by using the
stem() function. The script is very short. Try this out in RStudio.

# Create data vector and stem and leaf plot


data <-c(22, 29, 22, 31, 20, 12, 14, 24, 13, 4, 2, 1)
stem(data)
The decimal point is 1 digit(s) to the right of the |

0 | 124
1 | 234
2 | 02249
3 | 1

Example 2:
The stem-and-leaf plot for the data set
8.6 11.7 9.4 9.1 10.2 11.0 8.8
with leaf unit 0.1 is given by

This means that in reading the data from the stem-and-leaf plot, the stems are digits in the
units place while the leaves are the digits in tenths place (first decimal place).

R Script

# Create data vector and stem and leaf plot


data<-c(8.6, 11.7, 9.4, 9.1, 10.2, 11.0, 8.8)
stem(data)
The decimal point is at the |

Property of and for the exclusive use of SLU. Reproduction, storing in a retrieval system, distributing, uploading or posting online, or transmitting in any form or by any
means, electronic, mechanical, photocopying, recording, or otherwise of any part of this document, without the prior written permission of SLU, is strictly prohibited. 68
8 | 68
9 | 14
10 | 2
11 | 07

Example 3:
Let us now consider a data frame for this example. In MS Excel, open the data file
“inflation.csv”. The data shows the Inflation rate (in %) of countries in Asia and the Pacific.
Upon inspection of the variables, you would notice that there is only one quantitative
variable which is the inflation rate, labeled “Inflation”. We now create a stem-and-leaf
display for this variable.

R Script

# Load the readr package in RStudio


library(readr)

# Import "inflation.csv" data and assign it to "data"


data <-read.csv("inflation.csv")

#Inspect the data frame


head(data)
Regional.Member Year Inflation SubregionCountry.Code
1 Afghanistan 2013 7.4 South Asia AFG
2 Armenia 2013 5.8 Central Asia ARM
3 Azerbaijan 2013 2.4 Central Asia AZE
4 Bangladesh 2013 6.8 South Asia BGD
5 Bhutan 2013 8.8 South Asia BTN
6 Brunei Darussalam 2013 0.4 Southeast Asia BRN
# Select only the "Inflation" variable for the plot
# Assign filtered or isolated "Inflation" variable to "inf"
inf<-data$Inflation# The use of the “$” sign selects only the "Inflation"
variable from the data.

# Create the stem and leaf plot


stem(inf)

The decimal point is at the |

-0 | 552
0 | 4588349
2 | 0112444668990778
4 | 1239047889
6 | 446688934
8 | 894499
10 | 7

Property of and for the exclusive use of SLU. Reproduction, storing in a retrieval system, distributing, uploading or posting online, or transmitting in any form or by any
means, electronic, mechanical, photocopying, recording, or otherwise of any part of this document, without the prior written permission of SLU, is strictly prohibited. 69
HISTOGRAM

A histogram is a graphical portrayal of the frequency distribution of grouped data. It


divides the data set into class intervals and gives the frequency for each class. Histograms
are particularly useful for summarizing large sets of data.

The histogram corresponding to the frequency distribution table for the data on monthly
rent ($) for a sample of 70 one-bedroom apartments in Example 3 is shown below:

Rent (in $) Frequency


425-449 18
450-474 16
475-499 11
500-524 7
525-549 5
550-574 3
575-599 4
600-624 6
Total 70
Table 5. Frequency distribution table
for monthly rents of 70 one-bedroom
apartments

R Script

To plot the histogram for the same example, again we use the “rent.csv” file.

# Load the readr package in RStudio


library(readr)

# Import the "rent.csv" file and assign it to "data"


data <-read.csv("rent.csv")

# Create the histogram


hist(data$Rent, breaks=seq(425, 625, by=25), main="Histogram of Rents",
xlab="Monthly Rent", ylab="Frequency", col="gray", border="yellow",right =FALSE,
ylim =c(0,20) )

Property of and for the exclusive use of SLU. Reproduction, storing in a retrieval system, distributing, uploading or posting online, or transmitting in any form or by any
means, electronic, mechanical, photocopying, recording, or otherwise of any part of this document, without the prior written permission of SLU, is strictly prohibited. 70
Summarizing Qualitative and Quantitative Data for Two Variables

Tabular and graphical displays for data obtained from two variables are helpful in
understanding the relationship between them, if any. In this section we will discuss
thecrosstabulation or contingency table and the scatter diagram.

CROSSTABULATION

A crosstabulation or contingency table is a tabular summary of data for two variables. The
variables can both be qualitative or both quantitative, or can be a combination of one
qualitative and one quantitative variable. If either variable is quantitative, classes must be
created for the values of the quantitative variable. The labels shown in the margins of the
table define the categories (classes) for the two variables.

Example:
For an example, we consider the “salaries.csv” file which contains data on professors of a
university, including rank, discipline being taught, years since PhD was obtained, years of
service in the university, sex, and annual salary ($). We construct a crosstabulation of the
rank and sex of the teachers. Using RStudio, we can generate the crosstabulation shown in
Table 6.

The following is the RStudio Script.

Property of and for the exclusive use of SLU. Reproduction, storing in a retrieval system, distributing, uploading or posting online, or transmitting in any form or by any
means, electronic, mechanical, photocopying, recording, or otherwise of any part of this document, without the prior written permission of SLU, is strictly prohibited. 71
R Script

# Install the summarytools package in RStudio


install.packages(“summarytools”)

# Load the readr, summarytools, and pander package in RStudio


library(readr)
library(summarytools)
library(pander)

# Import "salaries.csv" and assign it to "data"


data <-read.csv("salaries.csv")
head(data)
X rank discipline yrs.since.phdyrs.service sex salary
1 1 Prof B 19 18 Male 139750
2 2 Prof B 20 16 Male 173200
3 3 AsstProf B 4 3 Male 79750
4 4 Prof B 45 39 Male 115000
5 5 Prof B 40 41 Male 141500
6 6 AssocProf B 6 6 Male 97000

# Crosstabulation of Rank and Sex


crosstab <-ctable(x=data$rank, y=data$sex, prop ="r")
pander(crosstab)

• cross_table:
Female Male Total
AssocProf 10 54 64
AsstProf 11 56 67
Prof 18 248 266
Total 39 358 397
Table 6. Crosstabulation of rank and sex of teachers.

• proportions:
Female Male Total
AssocProf 0.1562 0.8438 1
AsstProf 0.1642 0.8358 1
Prof 0.06767 0.9323 1
Total 0.09824 0.9018 1
Table 7. Proportions of crosstabulation of rank and sex of teachers

From the crosstabulation, we can see that majority of the teachers have a rank of
„Professor‟. There are relatively more males than females among all the ranks and teachers
who are male professors make up the largest group. This could not have been easily
observed by just looking at the raw data.

Property of and for the exclusive use of SLU. Reproduction, storing in a retrieval system, distributing, uploading or posting online, or transmitting in any form or by any
means, electronic, mechanical, photocopying, recording, or otherwise of any part of this document, without the prior written permission of SLU, is strictly prohibited. 72
SCATTER DIAGRAM/PLOT

A scatter diagram or scatter plot is a graphical display of the relationship between two
quantitative variables. One variable (independent variable) is shown on the horizontal axis
and the other variable (dependent variable) is shown on the vertical axis. The general
pattern of the plotted points suggests the overall relationship between the variables. This
relationship will be discussed more in Modules 11 (Correlation and Regression).

Example:
Consider the advertising/sales relationship for a stereo and sound equipment store. On 10
occasions during the past three months, the store used weekend television commercials to
promote sales at its stores. The managers want to investigate whether a relationship exists
between the number of commercials shown and the sales at the store during the following
week. Sample data for the 10 weeks with sales in hundreds of dollars are shown in the
table. The figure that follows is a scatter diagram for the data.

Week Number of Commercials Sales ($100s)


1 2 50
2 5 57
3 1 41
4 3 54
5 4 54
6 1 38
7 5 63
8 3 48
9 4 59
10 2 46

Property of and for the exclusive use of SLU. Reproduction, storing in a retrieval system, distributing, uploading or posting online, or transmitting in any form or by any
means, electronic, mechanical, photocopying, recording, or otherwise of any part of this document, without the prior written permission of SLU, is strictly prohibited. 73
R Script

Here we present two scripts in generating the scatter plot for the same problem. The
example data is contained in the “advertising.csv” data file.

# Load the readr package in RStudio


library(readr)

# Import the "advertising.csv" file and assign it to "data"


data <-read.csv("advertising.csv")

# Plot the chart


plot(x=data$Number.of.Commercials, y=data$Sales...100s.,
xlab="Number of Commercials", ylab="Sales in Hundred Dollars",
main="Number of Commercials vs Sales",
xlim=c(0, 6), ylim=c(0,70))

# To include a trend line in the plot


abline(lm(data$Sales...100s.~data$Number.of.Commercials))

# Load the readr package in RStudio


library(readr)

# Import the "advertising.csv" file and assign it to "data"


data <-read.csv("advertising.csv")

# Assign variables to simple object names


comm<-data$Number.of.Commercials # Isolates "Number of Commercials and
assigns it to "comm".
sales<-data$Sales...100s. # Isolates "Sales...100s" and assigns
it to "sales".

Property of and for the exclusive use of SLU. Reproduction, storing in a retrieval system, distributing, uploading or posting online, or transmitting in any form or by any
means, electronic, mechanical, photocopying, recording, or otherwise of any part of this document, without the prior written permission of SLU, is strictly prohibited. 74
# Plot the chart
plot(x=comm, y=sales,
xlab="Number of Commercials", ylab="Sales in Hundred Dollars",
main="Number of Commercials vs Sales",
xlim=c(0, 6), ylim=c(0,70))

# To include a trend line in the plot


abline(lm(sales ~comm))

Learning Reinforcement Activity No. 5-1: DATA PRESENTATION


Accomplish by September 21, 2020

Use RStudio to construct the tabular and graphical displays required for each problem.
Submit a single .docx file containing the output of R for each problem and submit also the
saved RStudio script.

Please use the following convention for the filename: LRA5-1<LASTNAME>.docx [Example:
LRA5-1MIRANDA.docx] and for the R script, LRA5-1<LASTNAME>.R.

1. According to Kantar Media (March 13, 2020), the top four primetime television
shows in the Philippines were Ang Probinsyano (Prob), Make It With You (MIWY),
Prima Donnas (PD), and Descendants of the Sun Philippine Adaptation (DS). Data
indicating the preferred shows for a sample of 50 viewers follow. (15 points)

Prob PD MIWY MIWY Prob MIWY DS PD Prob Prob


MIWY Prob Prob MIWY PD Prob PD Prob MIWY PD
PD PD DS PD Prob DS Prob Prob PD Prob
Prob MIWY Prob DS PD Prob Prob PD Prob DS
DS Prob DS Prob DS MIWY MIWY Prob MIWY PD

a. Construct a frequency, relative frequency, and percent frequency distribution


for the sample.
b. Construct a bar chart and a pie chart for the sample.
c. On the basis of the sample, which television show has the largest viewing
audience? Which is second?

2. The data below shows the time in days required to complete year-end audits for a
sample of 20 clients of Sanderson and Clifford, a small public accounting firm.
Construct a dot plot for the sample. (5 points)

Year-end Audit Time (in days)


12 20 14 15 21 18 22 18 17 13
15 22 14 27 18 19 33 16 23 28

Property of and for the exclusive use of SLU. Reproduction, storing in a retrieval system, distributing, uploading or posting online, or transmitting in any form or by any
means, electronic, mechanical, photocopying, recording, or otherwise of any part of this document, without the prior written permission of SLU, is strictly prohibited. 75
3. Use the file salaries.csv to construct a crosstabulation of the following pairs of
variables: (15 points)
a. Rank (row variable) vs. Discipline (column variable)
b. Rank (row variable) vs. Years of Service (column variable, grouped by 10s)
c. Rank (row variable) vs. Salary (column variable, grouped by $25000s)

Congratulations! You have just completed Module 5.


You are getting acquainted with the R software.
In the next module, we will start computing descriptive measures.

Property of and for the exclusive use of SLU. Reproduction, storing in a retrieval system, distributing, uploading or posting online, or transmitting in any form or by any
means, electronic, mechanical, photocopying, recording, or otherwise of any part of this document, without the prior written permission of SLU, is strictly prohibited. 76

Potrebbero piacerti anche