Sei sulla pagina 1di 6

Eberhard Karls Universität Tübingen

Seminar für Sprachwissenschaften

Mathematical Methods: Statistics SoSe 2019

Exercise Sheets 5 – Statistics

Students: Professor:
Šejla Hadžihasanović (4221623) Elnaz Shafaei-Bajestan
Jasmin Rittig (4239987) Philosophische Fakultät
Mateo Tošić (3892053) Seminar für Sprachwissenschaften
Matej Vibovec (3891892) Quantitative Linguistics

15th July 2019


Exercise Sheets 5 – Statistics

1 Introduction to Data
Ex. 1 — One of the data files uploaded on moodle is called county. It contains data for all counties in
the United States. Open a python shell and address the following questions. Report the answer and
the code you used for each item. (items g and h need no code.)
a) import the pandas library, and load the data stored in the county.csv file into a pandas dataframe.
(2 points)
>>> import pandas as pd
>>> data = pd . read_csv ( " county . csv " )

b) how many observations and how many variables are there in this data matrix? (3 points)
3142 observations and 15 variables:
>>> data . shape
( 3142 , 15 )

c) what are the names of all of the variables available for each county? (3 points)
>>> print ( data . columns )
Index ( [ ’ name ’ , ’ state ’ , ’ pop2000 ’ , ’ pop2010 ’ , ’ pop2017 ’ , ’ pop_change ’ ,
’ poverty ’ , ’ homeownership ’ , ’ multi_unit ’ , ’ unemployment_rate ’ , ’ metro ’ ,
’ median_edu ’ , ’ per_capita_income ’ , ’ median_hh_income ’ , ’ smoking_ban ’] ,
dtype = ’ object ’)

d) four of the variables are pop2017 (population in 2017), unemployment_rate (unemployment rate
in 2017), metro (whether the county contains a metropolitan area), and median_edu (median
education level in the range from 2013 to 2017). Look at the first 3 rows of the dataframe
and specify the type of the variable (either continuous numerical, discrete numerical, ordinal, or
nominal) for these four variables. (6 points)
>>> data . loc [ 0 :3 , [ ’ name ’ , ’ state ’ , ’ pop2017 ’ , ’ unemployment_rate ’ , ’ metro ’ , ’
median_edu ’] ]
name state pop2017 unemployment_rate metro median_edu
1 Autauga County Alabama 55504 . 0 3 . 86 yes some_college
2 Baldwin County Alabama 212628 . 0 3 . 99 yes some_college
3 Barbour County Alabama 25270 . 0 5 . 90 no hs_diploma

The variable pop2017 is discrete numerical, unemployment_rate is continuous numerical, metro


is dichotomous nominal and median_edu is ordinal.

e) what are the mean, median, mode, variance, and standard deviation statistics for the variable
multi_unit (percent of living units that are in multi-unit structures, e.g. apartments, condos) ?
(10 points)
>>> data [ ’ multi_unit ’] . mean ()
12 . 321896880967534
>>> data [ ’ multi_unit ’] . median ()
9.7
>>> data [ ’ multi_unit ’] . mode ()
0 5.5
dtype : float64
>>> data [ ’ multi_unit ’] . var ()
86 . 30790623832837
>>> data [ ’ multi_unit ’] . std ()
9 . 290204854486706

1/5
Exercise Sheets 5 – Statistics

f) make a scatterplot to study the relationship between two numerical variables homeownership
(percent of the population that lives in their own home or lives with the owner, e.g. children
living with parents who own the house) and multi_unit in the data. describe the relationship
between the variables. (7 points)
>>> import matplotlib . pyplot as plt
>>> data = pd . read_csv ( " county . csv " )
>>> data . plot . scatter ( x = ’ homeownership ’ , y = ’ multi_unit ’)
< matplotlib . axes . _subplots . AxesSubplot object at 0x11a542518 >
>>> plt . savefig ( ’ homeownership - multi_unit . png ’)

These two variables are associated. The higher the homeownership percentage, the lower the
percentage of living units that are in munti-unit structures. This means that apartments and
condos are more likely to be rented than single-unit houses.
g) in the question “If there is an increase in the median household income in a county, does this
drive an increase in its population?”, which variable is suspected to be the explanatory variable
and which one is suspected to be the response variable in this hypothesized relationship? (2
points)
The variable median_hh_income is suspected to be the explanatory variable, and pop_change is
suspected to be the response variable.
h) what type of study (observational versus experiment) is this? (2 points)
This is an observational study because the data was collected in a way that did not directly
interfere with how the data arose.
i) Histograms provide a view of the data density. provide a histogram for the variable multi_unit.
which bins have higher bars? histograms are also convenient for understanding the shape of the
distribution. discuss the shape of the data distribution. (8 points)

2/5
Exercise Sheets 5 – Statistics

>>> plt . close ( ’ all ’)


>>> data [ ’ multi_unit ’] . hist ()
< matplotlib . axes . _subplots . AxesSubplot object at 0x119ae8320 >
>>> plt . savefig ( ’ multi_unit - hist . png ’)

Most of the data is under 20% and much less between 20% and 40% and even less over 40%. The
distribution is strongly skewed to the right.

j) how can you solve the skew problem observed in the previous part? provide a second histogram
for the transformed data and discuss the shape. (10 points)
A bit of help: if you use the transformation technique suggested in class, you need to shift all
the datapoints by 1 (add a +1 somewhere in the process) because the function is not defined for
zero but there are zero values in the data.
It is useful to transform the data by using the natural logarithm. Some values are zero so we
increased all the values by 1 (log0 is undefined in R). This way the data is more symmetric and
some interesting data is now more visible.
>>> plt . close ( ’ all ’)
>>> import numpy as np
>>> data [ ’ multi_unit_log ’] = np . log ( data [ ’ multi_unit ’] + 1 )
>>> data [ ’ multi_unit_log ’] . hist ()
< matplotlib . axes . _subplots . AxesSubplot object at 0x119ae8828 >
>>> plt . savefig ( ’ multi_unit_log - hist . png ’)

Ex. 2 — A university wants to determine what fraction of its undergraduate student body supports a
new 20e annual fee to improve the student union. For each proposed method below, indicate whether
the method is reasonable or not. (6 points)
a) Survey a simple random sample of 500 students.
Reasonable.

b) Stratify students by their field of study, then sample 10% of students from each stratum.
Reasonable.

c) Cluster students by their ages (e.g. 18 years old in one cluster, 19 years old in one cluster, etc.),
then randomly sample three clusters and survey all students in those clusters.
Not reasonable, clusters are not necessarily homogeneous within themselves, but each cluster is
similar to another such that we can get away with sampling from just a few of the clusters.

3/5
Exercise Sheets 5 – Statistics

Ex. 3 — Describe the distribution in the histograms in figure 1 and match them to the box plots. (3
points)

Figure 1: Mix-and-match. 3 histograms on the left and their corresponding box plots on the right.

a) The distribution is symmetric and most of the data is in the middle (similar to the normal
distribution). It is represented by the box plot (2).

b) The distribution is almost even. It is represented by the box plot (3).

c) The distribution is skewed to the right. It is represented by the box plot (1).

2 Statistical Modeling
Ex. 4 — In Figure 2, six different models for six different artificial data sets are shown. State for each
model whether or not it is a good one. If the model is not good, explain. (12 points)

a) This model is not good. It b) This model is good. c) This model is not good. It
contains only the intercept but should not be linear because the
it should also have a slope. data relation is exponential.

d) This model is not good. e) This model is good. f) This model is not good.
There should be two lines. There should be two lines and
the slopes should be 0.

4/5
Exercise Sheets 5 – Statistics

Ex. 5 — In Figure 3 six different fake data sets are shown. Draw straight lines on them and state
how many different parameters you need in order to describe the straight lines, i.e. how many different
slopes and intercepts. Judge for every data set if a linear model is a good approximation. (12 points)
a) Draw linear lines into the data points. Draw as few as possible to capture all the structure in
the data points.
b) State how many parameters you need to specify the lines, i. e. how many different intercepts
and slopes.
c) Judge if the structure in the data points can be modeled with straight lines.
d) Give the prediction(s) of your model at the value xx = 6.

a) 2 parameters: 1 intercept and 1 slope. Linear. f (6) = 22


b) 2 parameters: 1 intercept and 1 slope. Not linear. f (6) = 0
c) 2 parameters: 1 intercept and 1 slope. Linear. f (6) = −5
d) 3 parameters: 2 intercepts and 1 slope. Linear. f1 (6) = −2.5 f2 (6) = 2.5
e) 4 parameters: 2 intercepts and 2 slopes. Linear. f1 (6) = −2.5 f2 (6) = 17.5
f) 4 parameters: 3 intercepts and 1 slope. Linear. f1 (6) = 2.5f2 (6) = 7.5 f3 (6) = 12.5

Ex. 6 — What are the general steps in order to find the best model? Which two models you want to
come up first? (6 points)
1. Create the null model. 2. Create a saturated model. 3. Find the model that is significantly better
than the null model and at the same time not significantly worse than the saturated model. 4. Check
the residuals, if the model makes sense, and if the predictions of the model lies within the data points.
5. If the model check succeeded continue interpreting the model otherwise come up with a different
model. 6. (Refit your model without bad data points in order to remove stress (inspect the points you
are removing!!)) 7. Interpret the parameter estimates of the refitted model.

5/5

Potrebbero piacerti anche