Documenti di Didattica
Documenti di Professioni
Documenti di Cultura
Acknowledgements:
Prof Deborah Black, Med program convenors, students and QMP tutors
IBM SPSS Statistics Version 24
gain confidence and experience in using SPSS for data handling and manipulation.
learn about data collection and data variables.
learn how to examine data effectively using basic statistical techniques.
gain skills in basic statistical analysis and interpretation in SPSS.
experience research teamwork.
The dataset
This is an SPSS dataset that we have simulated to reflect the demographics and risk factors for disease found in a
national survey published in 2012 in Indonesia 1. As this is simulated, there are no issues with multiple students
handling data across multiple devices and in multiple locations. In addition, we have been able to narrow down the
many variables and outcomes to make this a manageable and profitable learning experience.
Group work
From the first Research Skills practical in Society & Health you will need to work in your SH project groups. These will
also be your project groups in BGDB.
There will be 3 groups per scenario group, with each taking one of the 3 focus topics, so that all 3 topics are
represented in each scenario group:
Assessment
Each group will be expected to submit a formative assessment via your SH course Moodle.
This is due directly following your groups third and final Research Skills practical in week 6 of SH.
This submission is based on the work done during these three practical sessions.
Your submission will form the basis of your BGDB child health group project with formative feedback being
provided by QMP tutors by week 1 of the BGDB course.
Key References
For other references, see QMP Moodle, Resources for Research area.
1
Statistics Indonesia (Badan Pusat StatistikBPS), National Population and Family Planning Board (BKKBN), and Kementerian
Kesehatan (KemenkesMOH), and ICF International. 2013. Indonesia Demographic and Health Survey 2012. Jakarta, Indonesia:
BPS, BKKBN, Kemenkes, and ICF International. Retrieved from: http://dhsprogram.com/publications/publication-fr275-dhs-final-
reports.cfm
Research Skills Instructions, March 2017 Page 3
Research Scenario
Your Research Group
Your project group is made up of public health researchers who work for a local government unit in a semi-rural,
island of Indonesia. Its population is only 200,000. This is a very small local government unit compared to the largest
local government unit size of ~4 million people, and it is below the median size of these units across the nation
(median: ~280,000 people). The smallest unit has only ~12,000 people.
The Island
The island is quite poor but has reasonable school attendance and a good main hospital. However, there is some
very remote and rugged terrain on the island, with many small villages in the hills away from the sea.
Your received this data from the group who published the Indonesia Demographic and Health Survey 2012 2.
The dataset consists of all 1032 local households who took part in the survey, who had at least one child living at
home with them. (Note, limits on parent ages were imposed as per the original survey which was looking specifically
into reproductive and family topics).
You hope to write this up as a journal article and have set your sights on the respected, international Taylor &
Francis publication: Paediatrics and International Child Health (formerly the Annals of Tropical Paediatrics), see:
(http://www.tandfonline.com/loi/ypch20).
Also the survey team were able to collect extra information for you about the eldest child in each family (up to age
18 years). So you have data regarding the eldest childs 1 year immunisation status and also their status regarding
diarrhoeal disease and acute respiratory infection (ARI) (i.e. not just for 1 year olds/under 5 year olds respectively as
in the original survey).
In addition, the status of the rest of the children in the family for these conditions is included. For diarrhoea and ARI
status this is provided as YES/NO variables, with YES indicating that at least one other child in the family meets the
conditions for the disease. Another variable indicates the highest immunisation status among any other children in
the family.
>>> See the variable list on the next page.
2
Statistics Indonesia (Badan Pusat StatistikBPS), National Population and Family Planning Board (BKKBN), and Kementerian
Kesehatan (KemenkesMOH), and ICF International. 2013. Indonesia Demographic and Health Survey 2012. Jakarta, Indonesia:
BPS, BKKBN, Kemenkes, and ICF International. Retrieved from: http://dhsprogram.com/publications/publication-fr275-dhs-final-
reports.cfm
Research Skills Instructions, March 2017 Page 4
List of Dataset Variables
Note: This simulated dataset is based on data published in this report: Statistics Indonesia (Badan Pusat StatistikBPS),
National Population and Family Planning Board (BKKBN), and Kementerian Kesehatan (KemenkesMOH), and ICF
International. 2013. Indonesia Demographic and Health Survey 2012. Jakarta, Indonesia: BPS, BKKBN, Kemenkes, and ICF
International. Retrieved from: http://dhsprogram.com/publications/publication-fr275-dhs-final-reports.cfm
Spouse variables
Spouse.Sex Sex of spouse 1 = Male
2 = Female
Spouse.Age Age of spouse Range >16 years
Spouse.Education Highest level of education of 1 = No education
spouse 2 = Some primary
3 = Complete primary
4 = Some secondary
5 = Complete secondary
6 = More than secondary
Spouse.Occupation Current occupation of spouse 1 = N/A
Child Variables
Child1.Age Child 1s age (years) Range 0 - 18 years
On the lab PCs: To open IBM SPSS, go to the START button, select Programs, select IBM SPSS Statistics => IBM SPSS
Statistics 24 should open up but note that the launch can be slow at times. Just be patient.
With your laptops, unless you have SPSS installed (unlikely) you will need to use MyAccess to access a virtual version
of SPSS. You can do this anywhere so long as you have an internet connection. Instructions for setting up myAccess
on your devices are here: https://www.myaccess.unsw.edu.au/. The app that you will need to launch looks like this:
Open the 2017 Phase 1_SimData.sav dataset. UNSW students please access this via the Research Skills area in
QMP Moodle module/ The Research Process.
This data is mostly cleaned in terms of the data variables that you will use in this class, but you will need to check,
clean and prepare the data a little as you go (the instructions will help you to do this).
Note:
You are going to learn the basics of data input, management and manipulation. When you are doing your ILP/Hons
project, you will be analysing real data and may have to create your own dataset structure and manage this during
your research. To do this well, you will need to check carefully that the data is clean by checking for missing and
invalid data. There are some great YouTube videos from good sources showing how to do this systematically and
more thoroughly than we can manage today.
Once SPSS has launched the dataset in a Data Editor viewer window, take a few minutes to just LOOK at the data in
both the Data and Variable views:
These views are accessible via 2 tabs situated along the bottom of the Data Editor window.
To switch quickly between the two views, you can double-click on the variable name at the top of the
column in Data view or the number to the left of the variable name in the Variable view.
Or use: Shortcuts - Control T (PC) or Command T (Mac)
The long way to change views is to go via the View menu at the top of the viewer.
You might not have noticed yet, but another viewer window opened up when you opened up the dataset. This is an
Output document.
An Output document opens for each and every time that you open a dataset. Note: It gets complicated once you
have more than one dataset open you may need to check which Output document you want your outputs to go to
essentially, it will record to whichever one is most recently live at the time.
The Output is very useful as it will record almost everything you do to the dataset so it becomes a record of your
cleaning, editing, and analysis AND it also displays all of your analysis/ tests as well as the tables, charts and graphs
that you ask SPPS to carry out on the dataset.
On the left-hand side of the Output is a Log record of all your activities on the open dataset(s). You can scroll up and
down to find a particular analysis or process you did and click on it to navigate there in the right-hand window area.
If you right-click on the charts and graphs in the right-hand window, you open an editor window that allows you to
edit these figures, change the axes, labels, colours and so on.
You can also copy the main outputs in this document (using standard shortcuts or using right click and choosing
Copy) and paste these into other documents, such as Microsoft Word. This doesnt seem to be working as well in
v24 of SPSS, so you may have to screenshot tables or use copy/paste special.
You must save this Output document or you will lose all this useful information! See below for best practice saving.
The Output
Save your dataset and output files carefully with sensible names in an appropriate, accessible but secure place.
SPSS can be a bit fussy about filenames so dont use any punctuation / character marks except _ and keep it
short.
We suggest that you rename your output files with a date and time or version, but be consistent so you can use
these to track what you have done and find your analysis when you need to write it up.
If you make major changes to a dataset you should save it with a different name (just in case of any IT calamities
or mistakes!)
N.B. If you get stuck ask a tutor if you are in a class, or use the SPSS Help which has a very useful online help
resource. Please post up a question in the QMP Moodle discussion board if you cant work it out at home.
>> Take a quick look at the PowerPoint on Data and Variables, then answer the following questions:
>> In Data View scroll down to the bottom of the dataset and see how many cases there are
QUESTION: How many variables are there in total? What do these show? (ANSWER in ACTIVITY 1 APPENDIX)
>> In Variable View scroll down if necessary to see how many variables there are the far left hand column shows
you this.
The main variables that we are using today are below. Have a look at them:
Residence
HH.Age
HH.Sex
HH.Marital.Status
HH.Weight
HH.Height
>> Find (in QMP Moodle) and open the document: 2017 SimData Variables list.xlsx. This lists and gives explanations
for each variable and their categories / values. You will need to refer to this and make notes in this as we go through
the cleaning and analysis.
>> If you need to revise what you know about data types and variables, please see the Data and Variables
PowerPoint show (in the Child Health Project resources area in your QMP Moodle module, under The Research
Process).
>> Click on the Measure cell for the variable Residence. You should see this:
There are 3 types of Measure as shown in this image: Scale, Ordinal and Nominal.
QUESTION: What other properties of the variables can you see in the columns here? What do you think these are
for? (Answer in ACTIVITY 1 APPENDIX)
>> Click on Data=>Define variable properties. A window opens that window allows you to check all of the
properties of each variable separately or all in one go.
>> To change the labels in this window follow the steps in below. There is a quicker way to do this in the Variable
view (useful for small alterations) by just going into the relevant cells. We will show you how to do this later on.
Lets use the Define Variable Properties window to check on the variable HH.Residence:
QUESTION: Is Residence a nominal, ordinal or scale variable? SEE ANSWER in ACTIVITY 1. APPENDIX
1. The Data=>Define variable properties window can help you check whether the right measure
is provided for a variable, plus other properties are correct. You can also label coded data etc.
and view quantitative data.
2. You can look at more than one variable at a time. Hold control/ command or shift (as usual) to
select which variables and take them across from the left hand side and into the right hand
side. Now you can view them one by one!
3. However, if you do this you need to know that once you are finished and happy with your
changes, you must hit the OK button or SPSS will NOT keep your changes! If you hit OK these
changes are kept and can only be undone by re-editing.
>> Try out this process, for all of the variables that you are using today: Residence, HH.Age, HH.Sex,
HH.Marital.Status, HH.Weight, HH.Height. We have left some Labels and Values blank or in need of correcting so
check them carefully
Notes on HH.Marital.Status:
For HH.Marital.Status you will find that there are some invalid data: data values: 11, 22 and 5. These values
are not present in the code options for this survey variable. Hence they must be typos from the data input
that have not been spotted yet.
You can tidy these up in the Define Variable Properties window by checking the Missing box for these
variables
Another way to deal with this erroneous data is to find them in the Data View using the Sort function and
create them Missing one by one by right-clicking on the erroneous value and then selecting Clear from
the menu that comes up. This is not so easy if there are lots of erroneous values!
There are a few other inconsistencies in this variables data that you can seek out later if you have time and
are interested. Tip: look Spouse.Age data compared to HH.Age discuss.
>> Once youve finished checking the data and properties, and added in labels as appropriate for all the variables we
are using today, remember to click OK as you close the window. If you dont, your changes will not be saved!
>> Go back to the Variable view if you are in Data view. Take another look at your variables - can you see that the
value labels you have added are visible and accessible now in the Values column? Can you see any other changes
you have made?
>> The quick and easy way to add or change labels is (in Variable view) to highlight the relevant cell for the variable
in the Values column and click on the little square button (with 3 dots on) that appears in the right-hand side of
the cell. Click on this to open a window where you can add, remove or edit your variable value labels.
>> Try out this method to label the Spouse.Smokes.Cigs variable values. You can type in a label for it by activating
the cell and just typing it directly in. Remember also to check that the measurement label for this variable is correct
first (you can click on that cell to change it directly if you think it is incorrect).
Moving variables order in the dataset. You can also move variables up or down the list to display in a different order
in the Data View.
>>To do this, just click on the far left numbered button (5) next to Water. Treatment so that the whole row is
highlighted, then hold down the click and move the button up to just below the button 3.
In the Data View. Take a moment to look at the data can you see any immediate possible errors just by looking it?
>> Using the sort mode, in "Data=>Sort Cases", arrange the HH.Age variable in ascending order. Look at the data in
this variable column. Are there any odd data entries? The range expected is 18-55 years.
>> Do the same with the other variables we are using today.
1. Make a note of any changes that you make in the dataset as you need to be consistent. Use
the List of Variables Excel spreadsheet that we have provided you with to record new
variables and make notes of any manipulations/adjustments that you make.
2. Remember to keep one version that is totally original and save any changes you make
thereafter as renamed/dated versions.
3. Keeping a team research journal is a good idea and log all major changes and analyses in that.
4. You should also be prepared to mention how you cleaned the data in the
Methods section of your group project report (this will be prepared as a journal
publication).
5. Careful decisions need to be made if you REMOVE or CHANGE any data point(s). You cannot
guess what this should have been. If you arent sure, but its definitely wrong, then it must be
taken out of analyses by telling SPSS to see it as a missing value.
You obviously cant give labels to every value in a continuous variable. For example age in the age variables in this
set is continuous data. Labels are not useful here but we can create groups for continuous data. However, we could
create two age-groups for head of household (e.g. 18-34 = Under 35 yrs, and 35-55 = 35 yrs and over).
QUESTION: Have a think - what other variables could you create new groups for?
Also, for any categorical variable, you can amalgamate codes or create new variables from old ones.
To do this, use: Transform => Recode into same variable or Transform => Recode into different variable.
1. Change a code to a new code number (e.g. if 0 is male and 1 is female but you would rather they were 1 and
2 respectively
2. Convert a scaled (continuous) variable into a grouped variable (e.g. as shown below in the DEMO for a
different dataset that also has an AGE variable). In the DEMO below, there are only 2 age-groups (OLD and
YOUNG) formed but you could make many and use this for further analysis.
3. Change a variable into a new variable that only includes some of the data. For instance you could make a
new variable that only contains an older age group.
>> NOW VIEW THE DEMO BELOW FOR INSTRUCTIONS ON HOW TO DO THIS (note that this demo is using similar
but different variable data to the data you are using, but the principle is the same).
>> Then, carry out this transformation for our HH.Age variable: create a variable called HH.Age.Groups with two
categories: 18-34 = Under 35 yrs, and 35-55 = 35 yrs and over) and label these accordingly.
https://moodle.telt.unsw.edu.au/pluginfile.php/1361716/mod_resource/content/2/SPSSDtaHandling/Demo1.htm
View Demo
2:19 mins
To recode into the same variable, choose the Recode into Same variable from the
Transform menu.
HOWEVER take care as this will overwrite the original variable!
Therefore, it is good practice to ALWAYS Recode into Different variable as shown in
the demo.
>> Check out your Variable and Data Views to find your new HH.Age.Groups variable
<< YOU ARE NOW READY TO MOVE ONTO THE NEXT ACTIVITY >>
QUESTION: How many variables are there in total? What do they show?
ANSWER: There are 33 variables. They are varied some seem to be categorical and some scale data.
QUESTION: What other properties of the variables can you see in the columns here? What do you think these are
for?
ANSWER:
The columns in the variable view are: Name, Type, Width, Decimals, Label, Values, Missing, Columns, Align,
Measure, Role
The properties italicised above are more about the way the data is presented in the Data View.
The properties bolded above the ones we will concern ourselves with today. They are important in how the
data can be analysed and displayed in the Output.
ANSWER:
Residence is NOMINAL in this survey, as the respondents were categorised into living either in an urban or
rural setting.
There are only 2 categories so this data is also Dichotomous, but not represented here as a yes / no answer
(although it could be think about it!).
QUESTION: Have a think - what other variables could you create new groups for?
ANSWER:
Could do this for all the age/ height / weight variables in this dataset.
Could also do it for categorical variables e.g. for HH.Marital.Status, could collapse down the 5 categories
into 3 e.g. Never married / Married or Living together / Divorced or Widowed.
Now you are ready to examine the data in more detail. Running frequency distributions is a good way to see you
data and also to clean it.
>> Choose Analyze from the top menu then Descriptive statistics => Frequencies
>> Choose the SCALE variables, HH.Age, HH.Height, HH.Weight and use the arrow to move them from the left-hand
column to the right-hand column.
>> Click on the Statistics option at the right-hand side of the screen.
>> In the pop-up box that appears, tick: standard deviation, range, minimum, maximum, mean, median, mode and
skewness
>> Click on Continue for this window, and then OK in the Frequencies box. Your Output Viewer will now pop-up,
showing the long frequency tables for these 3 variables. Above this it displays a Statistics table with the following
information:
>> Look at the Statistics table. Which of the following: mean, median and mode is most useful here? Why?
>> What would you do to analyse these variables for head of household male and female separately?
>> To limit an analysis to a specific group of a nominal variable or a particular value of a scale (continuous) variable,
use: Data => Select cases.
>> In the window that appears choose the condition that you wish to apply. For example, click on the If condition is
satisfied button and then click on the If button.
>> Another window appears called Select cases: if and you can then click on the variable(s) you are interested in
selecting cases from (e.g. HH.Sex). Take this variable across to the clear window.
>> If you run the basic frequencies again now you will find a big difference in the output! There are only Head of
household Males analysed.
>>Do this now so that you have all the data ready for use. Check that you have removed the select cases by looking
in your Output file and the Data View.
The other way to select particular groups to run your analyses by, is to split the file using: Data => Split File
>> In the Split File window, choose how you want your analysis to take place: by Compare groups (which will
compare them directly in the same tables, etc.) or by Organizing output by groups (which will display them
separately in the Output file). Maybe try both of these ways and see what the difference is.
Click on one of the buttons now and then take the HH.Sex variable across to the Groups Based on: window. Leave
the rest as it is (SPSS will have to sort your file by the grouping variable you have chosen) and click on OK.
>> If you run the basic frequencies again now you will find a big difference in the output! Try the whole process
again for the other way of organising the output.
>> Then REMOVE the Split File by either taking the variable back to the LH side or by clicking on Reset and OK.
<< YOU ARE NOW READY TO GO ONTO ACTIVITY 3 AND DO SOME MORE ANALYSIS USING HISTOGRAMS! >>
>> Lets split the file again by HH.Sex. Data => Split file => Organise output by groups. Place HH.Sex in the Groups
based on:" window and click OK.
>> Choose Analyze from the top menu then Descriptive statistics and Frequencies.
>> Choose the original HH.Age variable and use the arrow to move this from the left-hand column to the right-hand
column.
>> Press the Statistics option and choose: standard deviation, range, minimum, maximum, mean, median, mode
and skewness.
>> Then press Continue and choose Charts from the menu at the right hand side.
>> Select the Histograms tick box and tick Show normal curve on histogram then Continue, then OK on the
original window.
>> In your Output viewer, you will find the frequencies and histograms for HH.Age presented separately for men and
women who are head of household.
QUESTION: What do the 2 charts (one male data, one female data) show you? Is there anything notable here? Why
are we looking at this variable separately?
>> Check the skewness values for HH.Age for male and female are these values between -1 and +1? What do the
histograms and normal curves look like?
QUESTION: Do you think this data is normal distributed, close to it or far from it?
>> If you have time, why dont you try this out for HH.Height and HH.Weight for Male /Female HH.Sex and HH.Age.
Groups TIP: it is easier to view and compare the groups if you use the Compare groups option in the Split File
window.
<< CHECK THE ANSWERS TO THIS ACTIVITY IN THE APPENDIX BEFORE YOU MOVE ONTO LOOKING AT
CATEGORICAL DATA IN ACTIVITY 4 >>
HOWEVER, BEFORE YOU MOVE ON: Remove the Split File conditions (for HH.Sex and HH.Age.Groups) by going to
Data => Split File and clicking on the reset button. Or you can click on the variables in the small window and send
them back using the arrow on the left to the main list. Either of these methods will remove the split.
>> Check your data is without any conditions and move on to Activity 4.
QUESTION: What do the 2 charts (one male data, one female data) show you? Is there anything notable here? Why
are we looking at this variable separately?
ANSWER: The data looks similar in terms of the shape of the distribution (sort of normal distribution bell-
shaped curve) but there is more male data than female data, and the female data has a lower mean (36.2
years compared to 38.5 years for the men).
QUESTION: Do you think this data is normal distributed, close to it or far from it?
ANSWER: The skewness for the both men and women head of household age data is very low (<0.15) and
positive. The data distribution isnt a normal curve for either of these but it does mostly follow the curve and
look somewhat symmetrical and bell-shaped. You would be hard-pressed to say this was normally distributed
though.
ANSWER:
For height
The histograms are much closer to normal distribution curves than the age distributions were.
The mean height of men and women is very different (women are shorter)
There is a tiny difference in the mean between the younger and older of both sexes. With the younger
groups being slightly taller. How might you test to see if this difference is statistically significant?
(NEED a HINT Ask a tutor!)
>> Choose Analyze from the top menu then Descriptive statistics and Frequencies.
>> Remove any variables remaining from previously in the right-hand window. Also remove any Statistics choices in
that window by un-ticking them (we are not interested in these as our data is NOT continuous!).
>> Choose the variable HH.Sex and use the arrow to move it from the left-hand column to the right-hand window.
>> Choose Charts from the menu and then from Chart Type, select the button for Bar charts and for Chart Values,
choose either Frequencies or Percentages. Click on Continue and then OK in the original Frequencies window.
>> Your Output will appear again with more tables and charts, this time for HH.Sex. See screenshot on next page:
>> Repeat this process for all the categorical variables in the dataset (be brave!) by taking all of them across to the
right-hand of the Frequencies window. SPSS will do them all at once. Take a look at your Output.
>> What does this show? Do you notice anything interesting? Discuss your findings in your group, then check with a
tutor that you are on the right track.
If you are uncertain about anything make sure you check it out with a tutor.
<< YOU ARE NOW READY TO MOVE ONTO THE INFERENTIAL TESTS IN ACTIVITIES 5-7 >>
>> Open up a crosstab window with the commands: Analyze => Descriptive Statistics => Crosstabs.
>> Select the variable for the Rows (always the variable you are interested in seeing if there is an effect on, i.e.
HH.Age.Groups) and move it across. Move HH.Sex into the Columns area (this variable is the grouping variable, i.e.
the variable you think might be causing an effect on the other one(s)).
>> Click on the Statistics option and then choose Chi-Square. Click Continue.
>> Choose the Cells... option in the original window and in the Cell Display window, select Observed and Expected
Counts and Row, Column and Total Percentages, then click OK.
Having all these results in the Output tables will help you to interpret the chi-square results:
SPSS will now show the numbers expected (Expected count) if there is no difference between the sex for
each HH.Age.Group category). In this way we can compare these values to the Observed (the actual values)
in the data.
In addition, you will be able to see the % across and down in the chi-square cross-tabulation that SPSS
creates for this analysis. This helps us to understand what the difference is, if there is one.
OK so I am sure that you are asking: What does this all mean?
This cross-tabulation displays the number of cases in each category defined by two or more grouping variables. So
here we have the number of men and women that are in each age-group, young or old.
We also have been given numbers in the cells that show what would be expected if there was no difference
between the two groups (this is the basis of the chi-square test).
Essentially, a chi-square test is used to test the hypothesis that the row and column variables in a cross-tabulation
are independent. A low p-value (significance value of conventionally below a probability of 0.05 or 5%) indicates that
there may be some relationship between the two variables i.e. they may be dependent.
However, while the chi-square test measures may indicate that there is a relationship between our two variables
here, they do not indicate the strength or direction of the relationship. We have to look at the values in the table to
work that out.
1. Usually we quote the Pearsons test result unless there are very small expected numbers
in the cells.
2. Chi-square test is a large-sample test, so when you have smaller sample sizes, a more exact
distribution instead of the chi square distribution is used with a method called Fishers
exact test instead of Pearsons.
3. So, you should take the p-value result for Fishers exact test when more than 20% of the
expected values in the crosstab table cells are < 5.
4. See below the Chi Square Tests table for an indicator of this (footnote a in the final table).
So for these results above, the chi-square value is significant at 5% level of significance as indicated by the p
(Asymp. Sig. (2-sided)) = 0.001 in the Chi-Square Tests table.
>> Note: SPSS will report a value of p <0.0005 as 0.000). When reporting a very statistically significant result like
this, you should present this value rounded up to 0.0001 usually). You must not report it as 0.000, as in theory, zero
probability for p value in stats would mean that you are certain and we can never be that certain when using a
process of sampling!
So, this result shows that we have demonstrated with this test a significant statistical association between the age
group (young/ old) and sex of the head of household in this dataset.
>> Considering this result, how do we explain these findings? What is the association? We can view the data in two
ways as the data presented in a table we can read across the rows or down the columns.
>> To interpret if there is a difference in the constitution of the age groups we can look across the rows.
To find out if there is a difference between men and women in the young group or the old group, we should examine
the data within the age group variable (% Young/ Old groups) for male and female. This shows us that only 27.1%
of the older group of heads of household are women and most are men (72.9%). Considering younger heads of
households, there are more women (36.6%) but still more men (63.4%) than women are heads of households.
>> To interpret this another way, first look at the % within Head of household sex for each type of age group, this
gives you the % of the participants for this variable grouping who are men and women.
What you can see is that for the group of young heads of household there is a higher % of women than men who are
young group: 43.5% of all women are in this young group, whereas only 33.1% of all men are in the young group. The
opposite (not surprisingly 3) is seen in the older group where 66.9% of men are in the older group, whereas only
56.5% of women are older.
3
Remember: you are looking at the data in a 2x2 table these proportions are all linked and can be used to explain
carefully what is present in the data!!
Research Skills Instructions, March 2017 Page 30
QUESTION: So - what does this actually suggest for our sample data?
Well, it shows that for the participants in our sample, you have found that there was a statistically significant
association between age group (young/old) and sex of the head of household with a suggestion (from examining the
figures in the cross-tabulation) that more women who are heads of households are younger than men in this
situation.
We can speculate why it may be that women are more likely to be young heads of households, or we can flip it
around and speculate why men might be likely to be older heads of households.
Take another look at your outputs for marital status frequencies and bar charts by split file of HH.Sex and
HH.Age.groups. This might give you some other ideas for possible reasons for this difference.
The degrees of freedom for a 2x2 Chi-square test is (df)=(rows-1)*(columns-1)=(2-1)*(2-1)=1 (this is given in the Chi-
square results table above).
The QMP Online Tutorial 8 on Chi-square tests gives this explanation in full detail.
Also, there is a detailed explanation on how to work out Odds ratio and confidence intervals for these results, in a
PowerPoint show in the group project area in the QMP Moodle.
<<IF YOU UNDERSTAND ALL OF THIS, YOU ARE NOW READY TO MOVE ONTO DOING A T-TEST IN ACTIVITY 6>>
The Student's independent t-test can be used to test whether there is a statistically significant difference in the
population means for two independent groups.
>> By independent t-test we mean that we could run an unpaired t-test with a grouping variable (of HH.Sex), who
values are separate and independent and we could examine these to see if their means heights were different.
By independent and unpaired we mean that we have not recruited one sample and measured each individual twice
for height before and after an intervention at different ages of young and then later again as old. If we had this
would be paired data and we would need to use a paired t-test to analyse the data for this analysis. In looking at
mens and womens height we are examining totally different and independent data!
>> In health and medicine, statistical significance is assumed when a 'p-value' is less than 5%. A 'p-value' (in the
case of a t-test) is the probability that a conclusion is reached that there is a difference in the population means for
the two groups when, in fact, there is no difference i.e. the difference observed occurred by chance.
>> Let us use a t-test to see if there is a statistically significant difference in the population means of the variable
HH.Height for the Male and Female Heads of household in the dataset HH.Sex. This is a sensible research question,
and we have good reason to believe that height is considerably affected by gender.
>> We should first check to see if height is normally distributed by checking the histogram with a normal curve and
the statistics skewness and central tendency values. If you havent done this yet, do it for HH.Height for each
HH.Sex value (Male and Female) separately.
Our null hypothesis (H0) is that there is no difference in the mean height of male and female heads of household in
the population from which our dataset came from.
Our alternative hypothesis (H1) is that there is a difference in the mean height of male and female heads of
household in the population from which our dataset came from.
>> We have 2 separate, unpaired groups of males and females so will be using the independent t-test, which is in
effect a two-sample t-test.
Note: There is also a one-sample t-test that can compare a sample mean to a standard (e.g. a known population
value) or a hypothesised value.
>> We might guess that the difference is that males are taller in fact. This makes a lot of sense, as there is evidence
to show that in most populations, the males are taller than females on the whole (do you know why?).
>> On occasions in science and medicine, we are certain that the difference is only active in one direction we could
use a one-tailed test. This is most commonly used in biological experiments where only one direction is physically/
physiologically possible.
So, we only use one-tailed test in controlled experimental circumstances where we are certain of the direction of
the difference under examination. An example of a one-tailed, one-sample t-test would be where we are testing a
group sample mean to see if it higher than a standard value e.g. testing exam results for a Phase 1 cohort (say a
mean of 68.2%) in relation to the pass mark of 50%).
>> It is pretty simple to carry this out in SPSS but you do need to take care to have cleaned your data and to choose
the correct variables for your question.
>> Our question is the alternative hypothesis above. However, we know that height is affected by age as well - so we
could split the file by our young-old age groups as well. Maybe try this afterwards and see if it makes a difference.
>> You will find the Independent-Samples T Test in the Analyze => Compare Means option.
>> In the t-test, you define the test variable as HH.Height and the grouping variable as HH.Sex. It is necessary to
state the values of the grouping variable by defining the two possible values to be used in the t-test. In this case it is
1 = Male, 2 = Female.
>> Leave the rest of the options as the default (take a look and see what these are, but dont touch anything).
>> The results for this analysis are on the next page (remember your final data results may look slightly different if
you have been playing with the dataset).
You will note in the top table in the SPSS output that the men have a higher mean height than women in this dataset
can you see by how much taller they are on average? Now we need to see what the test results are:
Levenes is corrective test is applied automatically by SPSS to correct for the effect that unequal variances (spread of
data distribution) between our two groups (Male and Female) might have.
This is just like other statistical tests of difference that you have come across so far. The F test value gives a p value
(Sig) for our data. If the significance value for Levenes test is high (conventionally greater that 0.05) we use the
results that assume equal variances for both groups (i.e. there is no difference in the variances between the groups).
If you remember, variance is a measure of the spread of the variable and if p >0.05, then that would mean that the
data variance for height in both male and female in this dataset is similar enough and we can use the top answers in
the table for the t-test result (i.e. the crude t-test results).
In the analysis of our data, Levene's test is NOT significant at the 5% level (F = 1.837, p = 0.176) i.e. p>0.05, so, we
read the results from the top (crude results) row, as we assume that the variances are equal (i.e. there is no
statistical difference in the variances of the groups of Male and Female Heads of households, for their height).
ANSWER:
The t-test is a small-sample test useful if total sample size (both groups) is <60 and especially if <30. Hence SPSS has
not carried out a t-test but essentially a z-test for us as the statistics will be based on the normal distribution z values.
QUESTION 2:
So what is the t-test result for our question?
Take a look at the tables for yourself and work out the result and what this actually means for our male and female
groups before you take a look at the answer given below.
The top table gives us the means of the heights for the 2 sexes: male = 170.65 cm, and female =158.10 cm.
The Independent Samples Test results in the second table show that is t = 14.854 (df = 1030*) with a statistical
significance of p<0.001**. This is a statistically significant result.
NOTE, though that the t statistic would be a lower value if we had needed to use the correction for unequal
variances. Sometimes this will cause our p value to increase enough to become not statistically significant at the 5%
level.
However, for our results here, we can conclude that the difference between the mean heights of Male and Female
Heads of households that we see (of 12.6 cm) is very likely to be a real difference.
This p value is <0.05 (remember that 5% our conventional level for accepting or rejecting the null hypothesis) so
we can reject our null hypothesis and accept our alternative hypothesis.
Therefore, in conclusion: there is a statistically significant difference in the population mean height of Male and
Female Heads of household with the men being on average 12.6 cm*** taller than the women (p<0.001).
FOOTNOTE EXPLANATIONS:
* Remember df = degrees of freedom. For an unpaired t-test = (n1+n2)-2. Note that if we had used the corrected t
value for unequal variances, the df is less indicating that further analyses have been carried out (for this test, it would
have been df = 643.8, if we had needed the correction).
** We cant quote p as 0.000 for this result (not a feasible probability!). This is in fact a value of <0.0005 that SPSS
has rounded down to 3 decimal places as .000. So, the convention is to express this very low number by rounding it
UP which is 0.001.
*** It is sensible to round figures (especially effect sizes - e.g. mean difference for t-tests) up or down to a
reasonable number of decimal places. Here the mean difference between the mens and womens heights (12.555cm)
is rounded up to 1 decimal place (12.6cm) firstly as the participants would have given their height in whole cm, and
secondly as these calculated means can be taken sensibly to one decimal place more accurate than this (i.e.to the
nearest 0.1cm).
<<IF YOU ARE OK WITH THIS TEST, MOVE ONTO THE LAST ACTIVITY: 7. CORRELATION & REGRESSION>>
Correlation and regression are useful methods for analysing relationships between two continuous variables.
Let us examine two continuous variables that we suspect might have a relationship: height and weight.
Firstly, does height predict weight, or the other way around? This is important, as we want to use regression to
predict one variable with the other.
In fact, if you think about it, height does predict weight. We will use our dataset variables of HH.Height and
HH.Weight. However, before we go any further we need to consider that these variables contain data for both men
and women heads of households. An interesting phenomenon of height and weight are that they are considerably
affected by the sex (and age) of a person.
>> So, before we start, we should first Split the File by HH.Sex => Organise output by groups.
>> We should examine the data first by graphing to see whether there appears to be a linear relationship between
x and y by carrying out a scatterplot.
By convention, we must remember to put HH.Height as the independent variable (x) and HH.Weight as the
dependent variable (y). Do you know why this is? Discuss as a group and check with a tutor.
>> Check you have the right limits on the dataset (i.e. for HH.Sex Split File).
>> Choose Graphs from the top menu in SPSS, choose Legacy Dialogs then Scatter/Dot...
>> Choose the Simple scatterplot as shown below. Then place HH.Weight on the y-axis and HH.Height on the x-
axis and click OK.
The scatterplot for women looks less dense than the mens one. It could just be due to the smaller sample size for
women, or it could be that there is more variation in the relationship for height and weight for women.
1. On looking at this graph you can see lots of OUTLIERS data that are unusual with
respect to the other data as they are higher or lower in value. How many of these are
mistakes in measurement or errors in data entry do you think?
2. See how you can use this graph to identify possible problems with your data by showing
up these outliers. You should plot all the continuous data variables against each other
to take a look at this before you start the proper analysis. Then you can remove or edit
the mistakes.
3. N.B. remember to make a note if you change anything. In writing up, researchers have
to show how they cleaned up the data as this could obviously have a major effect on
results.
Looking at the graph above, there appears to be some positive correlation between height and weight for the male
group.
Can you estimate where a line of best fit might run? Use your finger to show where it might be.
>> In the Output viewer, double-click on the chart and this will open up a Chart Editor window.
>>Along the top are some chart-like icons find the one that will Add a fit line at total and click on it. You should
get a line like this one with a box showing the equation for the line (dont worry about confident intervals at the
moment):
QUESTIONS:
Does this line have a positive slope or negative slope?
Is it a good fit?
It is hard to tell just by looking at this scatter plot, however, we were right about the POSITIVE correlation
the slope of this line is positive.
Is it a good fit?
To find out this, we need to test this correlation and carry out a linear regression analysis.
>> Choose Analyze => Regression => Linear from the top menu in SPSS.
>> Choose HH.Weight as the dependent variable and HH.Height as the independent variable. Leave the rest of the
properties at the default settings you can have a look in Statistics etc. but dont change anything.
The important output tables for Head of household = Male are displayed below.
r = the correlation coefficient, an expression of the correlation between the observed and predicted values of
the dependent variable.
The values of r for the model produced by the regression procedure (here a simple straight line) range from 0 to
1. A larger value of r would indicate a stronger relationship between the two variables that you are looking at
here.
r squared = the proportion of variation in the dependent variable explained by the regression model.
r squared values range from 0 to 1. A small value for r squared suggests that the model (in this case the model is
linear) does not fit well.
This means that 6% of the variation in weight is explained by height for male heads of household in our dataset.
But there is other information here also what does the information in the other tables mean?
Also, as the coefficients show us how strong the relationship is, and we know that the usual equation for a
straight line is: y = a + bx, then, from the coefficients table above we can fill in our equation constants.
a is the equation constant (where the line would cross zero on the x-axis) and the dependent variable constant
is b (which is the slope of the line which describes the relationship between the independent (HH.Height x) and
dependent variable (HH.Weight y).
Coefficient b gives us an estimate for how much weight changes for each unit of x from the equation for a
straight line: y = a + bx).
QUESTION A: Work out what these results mean for the output that you have obtained for HH.HEIGHT and
HH.WEIGHT for the dataset.
QUESTION B: So, do you think that height and weight are related for the heads of the household (for Male and
Female separately)? SEE ANSWERS IN APPENDIX BELOW
1. You can add titles, subtitles etc. using the scatterplot Chart Editor window.
2. Interactive charts are possible and very exciting but not so useful for your purposes here.
3. Remember to stick to black and white graphs and tables. Colour is very pretty but journal
articles (and medical faculty submission instructions) stipulate black and white in general as
this is how they are printed.
4. Last, but not least: correlation does not necessarily mean causation.
QUESTION 1: Work out what these results mean for the output that you have obtained for HEIGHT and WEIGHT for
the dataset. See APPENDIX for a full explanation. What do these results mean for the output that you have obtained
for HH.HEIGHT and HH.WEIGHT for the dataset?
ANSWER:
Remember that the equation for a straight line is: y = a + bx.
From the coefficients table above we read the values of the Unstandardized B column to find the coefficients for
our line equation.
a = 27.039 (the (Constant) row) this is the equation constant where the line would cross the y-axis (when
x=0).
b = 0.213 (the Height (head of household) row) this is the gradient of the line, or how much y (weight)
changes per unit of change in x (height).
As the usual equation for a line is: y = a + bx, this gives us the line equation of y = 27.039 + 0.213x.
In other words, as height (x) increases by 1cm, weight (y) increases by 0.213kg.
Going back to the correlation coefficients, in general if r2 is larger then there is more likely to be a significant
relationship between the two variables. The significance is actually given in the Sig. column in the ANOVA table.
Here it is statistically significant at p<0.001 (given as .000 but we round up for this minute probability if you
remember).
In words, we can say that if the significance value of the F statistic (the regression analysis test statistic) is small
(smaller than a probability of 0.05 the p value convention) then the independent variable is said to do a good job
explaining the variation in the dependent variable.
ANSWER:
Well it seems as if height and weight are quite strongly linked in this dataset for the men who are head of
households, and that this is highly statistically significant at the p<0.001.
However, only 6% of the variation in weight is explained by height (this was the r2 result). So there must be other
factors affecting the weight that we have not looked at (and may not be able to examine with this dataset even).
Remember also that although we might have shown some statistical significance, this does not PROVE that these 2
variables have a causal relationship. We can say that there appears to be a relationship but we have not proved that
one changing actually causes the other there may be an indirect link that we do not know about here.
>> Take a look at the output tables for the Female group. What does this show: different findings, or similar? Check
your answers with a tutor.
WELL DONE!
<< IF YOU HAVE COMPLETED ALL 7 ACTIVITIES, THEN YOU HAVE REACHED THE END OF THE BASIC
INSTRUCTIONS FOR LEARNING SPSS>>
Aim
Your group task in SH is to perform a formative data analysis that you will submit to SH Moodle during/directly
following your practical in Week 6.
Task description:
Submit your groups final formative analysis as a MS Word as shown below.
This formative submission should contain the basic descriptive statistics for the subjects in the dataset.
It should also contain a few key basic analyses of the demographic variables that you are interested in analyzing
for your BGDB group project.
You must demonstrate the following basic inferential analyses: at least one t-test, one chi-square test and one
correlation with simple linear regression.
It is recommended that your group uses the Research Skills Formative Assessment Proforma provided in QMP
Moodle.
Formative Assessment:
This submission will be marked formatively and returned to you electronically in week 1 of BGDB, with feedback
added directly to the Word document.
An unsatisfactory grade will be given if this assessment is not given in on time.
This will be useful and build towards your BGDB group project on Child Health.
Submission details will be provided in the Week 6 practical session.
Submission Format:
Your submission should be uploaded as a Microsoft Word document and include the following:
Group Details:
NAMES & STUDENT No. (State your names and student z numbers)
SG GROUP No. (State your SG group, e.g. B8, A6 etc.)
CHILDHOOD RESEARCH TOPIC (State the topic that you are analyzing: Immunisation, Acute Respiratory
Infection, Diarrhoea)
1. Demographic Analysis:
Include some basic descriptive statistics re the households and individuals in the dataset. Run basic
frequency tables and cross-tabs to find out numbers and % for sex and age-groups for all survey
participants. Include other descriptive statistics for other variables (categorical, ordinal and
continuous) that are interesting, e.g. marital status of head of household, head of household and
spouse: weight, height and BMI.
Comment on the averages, the skewness and normality of the continuous data.
You should produce a summary table to display this descriptive data as Table 1 and include some
bar charts and histograms (as appropriate for the data).
1. The research question or hypotheses null and alternative. What are you looking for/testing for?
2. Which variables are being used and for which data? (i.e. are you using a split file and looking at
gender separately? etc.)
3. Which test you are using and why is this appropriate for this hypothesis and the data?
4. Provide the relevant test results as OUTPUT tables (and graphs if appropriate). You can copy and
paste or screenshot these into the word document from your SPSS Output file.
5. Provide a summary of the key findings (extract the information from the output tables). Include the
figures for the following:
main outcomes
the test statistic
the degrees of freedom value for the test
the p value and confidence intervals
odds ratio if relevant/ available (2x2 chi square test). Ask a tutor how to do this.
6. Write a brief interpretation of the results in light of your research question or null/alternate
hypothesis. Do you reject the null hypothesis or accept it? Why?
7. Finally Write a full formal sentence that encapsulates all of the information necessary to transmit
each of your test findings - as you might find it written in a journal article.
Chi-square analysis
Make sure that your variables are in rows/ columns correctly best to have the variable you are
interested in comparing for a particular variable in the columns.
State the full results and try to interpret the odds ratio (not the risk ratio as this is a survey this is cross-
sectional so we use odds). To do this: tick the Risk box when choosing the chi-square test in the
statistics window.
T-test
Is this a paired or independent t-test? Why?
Dont forget to quote the actual outcome which here is a difference between the mean values!
For Levenes test, state what the results are and what this means. Then provide the actual t-test results.
Mostly in journal reports, Levenes test is not mentioned usually as this is just an accepted part of the
analysis, so you will not be reporting it in your final report in BGDB. However here we ask you to include
it, as it will help us to see that you understand what this is showing!
You can provide a p value for the t-test but also a confidence interval. Quote this carefully.
It is a good idea if the whole group takes part in the analysis as you will learn more about the data this
way AND be preparing for your ILP!
The biggest problem detected in poor projects (P-) each year is lack of understanding of the data.
Separating the work up so individual members miss out on the practical work creates misunderstandings re
the data and your research questions.
The best groups work together through ALL the practical sessions together.