数据分析习题

Question 1
Consider the table below describing a data set of individuals who have registered to volunteer at a
public school. Which of the choices below listscategorical variables?
Your Answer Score E
number of siblings and year born Inorrect 0.00 T
name and number of siblings
annual income and phone number
phone number and name
Total 0.00 / 1.00
Question ExplanationThis question refers to the following learning objective(s):
Identify variables as numerical and categorical.

• If variable is numerical, further classify as continuous or discrete based on whether or not the
variable can take on an infinite number of values or only non-negative whole numbers,
respectively.
• If variable is categorical, determine if it is ordinal based on whether or not the levels have a
natural ordering.
Question 2
The General Social Survey conducted annually in the United States asks how many friends people
have and how they would rate their happiness level (very happy, pretty happy, not too happy). In
order to evaluate the relationship between these two variables a researcher calculates the average
number of friends for people who categorize themselves as very happy, pretty happy, and not too
happy. Which of the following correctlyidentifies the variables used in the study as explanatory
and response?
Your Answer Score Explanation
explanatory:number of friends Inorrect 0.00 Having more friends might cause people to be happier or b
response: happiness level more friends. So we can’t easily determine which variable
(categorical with 3 levels) based on which we might expect to affect which. However
is the explanatory variable since we first divide the data int
analyze summary statistics of number of friends of people
Therefore, number of friends is the response variable.
explanatory:number of friends
response: very happy, pretty
happy, not too happy
explanatory:very happy, pretty

happy, not too happy
response: number of friends
explanatory:happiness level
(categorical with 3 levels)
response: number of friends
Total 0.00 /
1.00
Identify the explanatory variable in a pair of variables as the variable suspected of affecting the
other, however note that labeling variables as explanatory and response does not guarantee that the
relationship between the two is actually causal, even if there is an association identified between
the two variables.
Question 3
Past research suggests that students who study with fewer distractions (internet, cell phone, etc.)
tend to get higher grades. Which of the following is the best scenario for being able to generalize
this finding to the population of all students?
None of the students in the sample has any misdemeanors; their answers
can’t be trusted.
A student list for the college is obtained and students are randomly Correct 1.00 Random sampling a
selected from the list, and all selected students participate in the study. population at large,
sampling.
A survey is emailed to all registered students, and the results are based
on the sample of returned surveys.
Sample only includes students who are in classes that the researcher
teaches.
Total 1.00 /
1.00
Classify a study as observational or experimental, and determine whether the study’s results can
be generalized to the population and whether they suggest correlation or causation.
• If random sampling has been employed in data collection, the results should be generalizable to
the target population.
• If random assignment has been employed in study design, the results suggest causality.
Question 4
A school district is considering whether it will no longer allow students to park at school after two
recent accidents where students were severely injured. As a first step, they survey parents of high
school students by mail, asking them whether or not the parents would object to this policy
change. Of 5,799 surveys that go out, 1,209 are returned. Of these 1,209 surveys that were
completed, 926 agreed with the policy change and 283 disagreed. Which of the following
statements is the most plausible?
Your Answer Score
It is possible that 80% of the parents of high school students disagree with the policy
change.
The survey is unlikely to have any bias because all parents were mailed a survey. Inorrect 0.00
The school district has strong support from parents to move forward with the policy
approval.
Total 0.00 / 1.00

Question confounding variables and sources of bias in a given study.
Question 5
As part of a statistics project, Andrea would like to collect data on household size in her city. To
do so, she asks each person in her statistics class for the size of their household, and reports that
her sample is a simple random sample. However, this is not a simple random sample. Which of the
following is the best reasoning for why this is not a random sample that is appropriate for this
research question?
Your Answer
Andrea did not block for any variables that might influence the response.
Andrea asked everybody in her class instead of asking her classmates to volunteer.
In this investigation of household size, each household represents a case. Andrea incorrectly sampled individuals instea
of households.
Andrea did not use a random number table to randomize the order in which she collected the students’ responses, so th
sample cannot be random.
Total
Distinguish between simple random, stratified, and cluster sampling, and recognize the benefits
and drawbacks of choosing one sampling scheme over another.
Question 6
True or False: Stratified sampling allows for controlling for possible confounders in the sampling
stage, while blocking allows for controlling for such variables during random assignment.
Your Score Explanation

Answer
False Inorrect 0.00 Stratifying and blocking both allow for controlling for potential confounders, but
stratify when we sample (divide population into strata and sample from within ea
random assignment (divide sample into blocks and randomly assign from within
True
Total 0.00 /
1.00
Identify the four principles of experimental design and recognize their purposes:
 control any possible confounders,

 randomize into treatment and control groups,
 replicate by using a sufficiently large sample or repeating the experiment,
 block any variables that might influence the response.
Question 7
Which of the below data sets has the lowest standard deviation? You do not need to calculate the
exact standard deviations to answer this question.
100, 100, 100, 100, 100, 100, Correct 1.00 The dataset with the most repeated observations has the
101 deviation.
0,1,2,3,4,5,6
0, 25, 50, 100, 125, 150, 1000
0,1,3,3,3,5,6
Total 1.00 / 1.00
Note that there are three commonly used measures of center and spread:
 center: mean (the arithmetic average), median (the midpoint), mode (the most frequent
observation)
 spread: standard deviation (variability around the mean), range (max-min), interquartile
range (middle 50% of the distribution)
Question 8
True or False: The statistic mean/median (mean divided by median) can be used as a measure of
skewness (either right or left). If this statistic is less than 1, the distribution is most likely left
skewed.
False
True Correct 1.00 In a left skewed distribution the median is greater than the mean, therefore
than 1.
Total 1.00 / 1.00
Identify the shape of a distribution as symmetric, right skewed, or left skewed, and unimodal,
bimodoal, multimodal, or uniform.
Question 9
Based on the relative frequency histogram below, which of the following statements is supported
by the plot?
The IQR of the

distribution is
roughly 10.
The mean of the

distribution is
smaller than its
median.
The distribution is
multimodal.
It is not possible
to estimate the
median without
knowing the
sample size.
There are no Inorrect 0.00

outliers in the
distribution.
Using the relative frequency histogram, we can tell that 10% of observation
between 5 and 10, 20% are between 10 and 15, and 15% between 15 and 20
5 and 10) and Q3 is in the fourth bin (between 15 and 20). This confirms th
approach we would place the median in the second bin, therefore we don’t
median. There are no observations more than 1.5×IQR below the first quart
than 1.5×IQR above the third quartile, therefore there are indeed outliers in
Total 0.00 /
1.00
Use histograms and box plots to visualize the shape, center, and spread of numerical distributions,
and intensity maps for visualizing the spatial distribution of the data.
Question 10
A recent housing survey was conducted to determine the price of a typical home in a city that is
mostly middle-class, with one very expensive suburb. The mean price of a house in this city is
roughly $650,000. Which of the following statements is most likely to be true?
There are about as many houses in

this city that cost more than $650,000
than less than this amount.
Majority of houses in this city cost Correct 1.00 Since the city is mostly middle-class, with one very ex
less than $650,000. distribution to be right skewed, and therefore the mean
observations fall below the median, more than 50% of
$650,000, the mean.
We need to know the standard

deviation to answer this question
Majority of houses in this city cost

more than $650,000.
Total 1.00 /
1.00
Question ExplanationThis question refers to the following learning objective:
Define a robust statistic (e.g. median, IQR) as a statistics that is not heavily affected by skewness
and extreme outliers, and determine when such statistics are more appropriate measures of center
and spread compared to other similar statistics.
Question 11
Phi Delta Kappa (PDK) is an international professional organization for educators that, in
collaboration with Gallup, has been conducting polls on the public’s attitudes toward the public
schools since 1969. The following was one of the questions on the 2011 poll:
”Most teachers in the nation now belong to unions or associations that bargain over salaries,
working conditions, and the like. Has unionization, in your opinion, helped, hurt, or made no
difference in the quality of public school education in the United States?”
The respondents’ answers broken down by party affiliation are shown below. Which of the
following statements is most justified by these data?
14% of Republicans and 58% of Democrats

think that teachers belonging to unions or
bargaining associations helped the quality of
public school education in the United States.
A histogram or a box plot would be useful

for investigating if distribution of opinion on
teachers belonging to unions or bargaining
associations varies by political party
affiliation.
The results of the survey suggest a Correct 1.00 35/290 ≈ 12% of Republicans, 146/341 ≈ 43% o
relationship between opinion on teachers Independents think that teachers belonging to un
belonging to unions or bargaining quality of public school education in the United
associations and political party affiliation. differences between these proportions, the resul
between opinion on teachers belonging to union
party affiliation.
The results of the survey suggest that opinion

on teachers belonging to unions or
bargaining associations and political party
affiliation appear to be independent.
Total 1.00 /
1.00
Use contingency tables and segmented bar plots or mosaic plots to assess the relationship between
two categorical variables.
Question 12
In 1948, Austin Bradford Hill, designed a study to test a new treatment for tuberculosis that at the
beginning of the study there was no evidence whether it would be any better or worse than bed
rest. He randomly assigned some patients who volunteered to be a part of this study to receive the
treatment Streptomycin, an antibiotic. The other patients received only bed rest as the control
group. Hill then observed the patients’ outcomes: which patients died and which recovered. The
results of the study are shown below.
We use the following simulation test if there is a difference between the recovery rates under the
two treatments: We write “died” on 18 index cards and “survived” on 89 index cards to indicate
whether or not a patient died. Next, we shuffle the cards and deal them into two groups of 52 and
55, for control and treatment, respectively. We then calculate the simulated difference between the
recovery rates in Streptomycin and control groups (pp Streptomycin − ppControl), and record this
value. We repeat this simulation 100 times. The histogram below shows the distribution simulated
difference between the recovery rates in these 100 simulations.
Which of the following is correct? Choose all that apply (there are multiple correct answers).
The conclusion of this study is Correct 0.11 Since the sample is comprised of volunteers, we can’t gen
generalizable to all tuberculosis
patients.
The alternative hypothesis should Correct 0.11 The evidence could go either way so we should consider a
be that there is a difference
between the recovery rates under
the two treatments.
Streptomycin treatment appears to Inorrect 0.00

be effective in treating tuberculosis The observed difference betwee
since the observed difference in isp^Streptomycin−p^control=5155−
recovery rates would be considered
There is 1 simulation where the simulated difference is ≥0
unusual based on the simulation
hypothesis test, the p-value is 0.01×2=0.02. This is consid
results.
Based on this study we can Correct 0.11 Also, since this is an experiment we can deduce causation
conclude a causal relationship
between Streptomycin and better
tuberculosis recovery rate.
Streptomycin treatment does not Inorrect 0.00
appear to be effective in treating The observed difference betwee
tuberculosis since the observed isp^Streptomycin−p^control=5155−
number of deaths in the treatment
group would not be considered
unusual based on the simulation
reject the null hypothesis and hence conclude that the data
results.
treatments.
The alternative hypothesis is that Inorrect 0.00 The evidence could go either way so we should consider a
the Streptomycin treatment is more
effective than bed rest.
If Streptomycin and bed rest are Inorrect 0.00

equally effective in curing The observed difference betwee
tuberculosis, the probability of isp^Streptomycin−p^control=5155−
observing a difference in the
recovery rates at least as high as
hypothesis test, the p-value is 0.01×2=0.02.
the one observed is 2%.
The difference between the Correct 0.11

survival rates in the control and The observed difference betwee
treatment groups appear to be isp^Streptomycin−p^control=5155−
simply due to chance.
chance.
Hill’s study is observational. Correct 0.11 No, this is an experiment.
Total 0.56 /
1.00
Question ExplanationThis question refers to the following learning objective:
Note that an observed difference in sample statistics suggesting dependence between variables
may be due to random chance, and that we need to use hypothesis testing to determine if this
difference is too large to be attributed to random chance. Set up null and alternative hypotheses for
testing for independence between variables, and evaluate the data support for these hypotheses
using a simulation technique.

数据分析习题

Caricato da

Informazioni sul documento

Copyright

Formati disponibili

Condividi questo documento

Condividi o incorpora il documento

Opzioni di condivisione

Hai trovato utile questo documento?

Questo contenuto è inappropriato?

Copyright:

Formati disponibili

数据分析习题

Caricato da

Copyright:

Formati disponibili

Question 1

Your Answer Score E

number of siblings and year born Inorrect 0.00 T

name and number of siblings

annual income and phone number

phone number and name

Total 0.00 / 1.00

Question ExplanationThis question refers to the following learning objective(s):

Identify variables as numerical and categorical.

Your Answer Score Explanation

explanatory:very happy, pretty

Question ExplanationThis question refers to the following learning objective(s):

Your Answer Score Explanation

Question ExplanationThis question refers to the following learning objective(s):

Your Answer Score

Total 0.00 / 1.00

Question ExplanationThis question refers to the following learning objective(s):

Question ExplanationThis question refers to the following learning objective(s):

Your Score Explanation

Question ExplanationThis question refers to the following learning objective(s):

 control any possible confounders,

Your Answer Score Explanation

0, 25, 50, 100, 125, 150, 1000

Total 1.00 / 1.00

Question ExplanationThis question refers to the following learning objective(s):

Your Answer Score Explanation

Total 1.00 / 1.00

Question ExplanationThis question refers to the following learning objective(s):

The IQR of the

The mean of the

There are no Inorrect 0.00

Question ExplanationThis question refers to the following learning objective(s):

Your Answer Score Explanation

There are about as many houses in

We need to know the standard

Majority of houses in this city cost

Your Answer Score Explanation

14% of Republicans and 58% of Democrats

A histogram or a box plot would be useful

The results of the survey suggest that opinion

Question ExplanationThis question refers to the following learning objective(s):

Your Answer Score Explanation

Streptomycin treatment appears to Inorrect 0.00

If Streptomycin and bed rest are Inorrect 0.00

The difference between the Correct 0.11

Hill’s study is observational. Correct 0.11 No, this is an experiment.

Question ExplanationThis question refers to the following learning objective:

Potrebbero piacerti anche