Sei sulla pagina 1di 49

http://vault.hanover.edu/~altermattw/methods/R/data-entry/loadingr.

htm

Installing R and the JGR launcher

R is the name of a powerful statistical package that was developed as "open source," meaning it was built by a community of programmers and is free to use. R extends its functionality though add-on "packages" that you can also download for free. This webpage describes how to install what you need on your own computer, starting from scratch. 1. Download R You can download the latest version of R by going to http://cran.r-project.org/. Downloads are available for Windows, Linux, and Mac. A setup file for the latest version of R for Windows is available by going here and clicking on the "Download" link at the top. Windows XP. Go ahead and run the downloaded setup program. Windows 7 or Vista. You will need to right-click on the downloaded setup program and select "Run as administrator." 2. On Windows 7 Machines, Check Folder Privileges The next step will be to run a small program called the JGR (Java GUI for R) launcher. When JGR runs, it will want to store a few files that contain "Preferences" information, such as which packages to load by default. On Windows machines, it will look for the default file storage folder, often C:/Users/user although I have sometimes seen it put the preferences in C:/Users/[active username]. Check to see if there is a C:/Users/user folder on your computer. If there is one there, click on it and you should get a "You don't currently have access to this folder. Click continue to permanently get access to it." message. Click continue. If there isn't a user folder, put one there. You will need adminstrator access to do this, so it might not be possible in public labs. Next, right-click on the 'user' folder and select "Properties", then click on the "Security" tab (If there is no "Security" tab, it means you don't have administrator access and you'll need someone else to install the program). Check the boxes next to "Full control" and "Modify." You will need to have administrator access for that as well. If the permissions of the c:\User\user folder aren't set correctly, the following problems can occur:

1. JGR is unable to close correctly, returning error messages when you click the "Exit" button, select File -> Quit, or type "q()" (the command for Quit). This is because JGR is trying to access a file in the /user folder but it doesn't have permission. 2. JGR will restore the workspace from the previous session, which causes problems because that workspace can contain dialog objects that conflict with the ones called using the Analysis menu. To see them, type ls(all.names=T) and look for objects that begin with a period and have the word 'dialog' in them. To get rid of them, type "rm(name of object)", as in "rm(.ezDialog)".

3. Download the JGR launcher Next, download the JGR launcher, a small program that launches R within a Java environment. Look on your desktop or on Start -> Programs and see if there is an icon of a jaguar eating a letter "R". If there isn't, download it by going to http://www.rforge.net/JGR/index.html and scrolling down to the "Download" section. Look up the launcher for your operating system and download it to the desktop. If you have Windows XP, you can just double-click that program, but if you have Windows 7 or Vista, you will need to right-click the file and select "Run as administrator." So long as the setup of R went correctly, the JGR launcher should open a console window and begin downloading the packages it needs for JGR to run and then opening R. 4. Download R Packages Now that JGR (pronounced "jaguar") has been installed, you should be able to double-click it when you want to open R. R comes with many base functions, but you will need to install some add-on packages to extend its functionality. If you have Windows 7 or Vista, you will need to run JGR "as administrator" if you want to install packages (right-click on JGR and select "Run as Administrator"). If you are an administrator and you're installing this in a public lab and you want users to be able to install and update packages on their own later, you'll need to change the ownership of the "library" folder in R and then set its permissions to "Full." Here is a webpage that steps you through that process (tip: in step 6, "Everyone" works as a group name). To install packages, you will be typing into the R console that opens when you start JGR. Below are the commands to enter. I have numbered them, but don't enter those numbers themselves when you type.
1. install.packages("Deducer", dependencies=TRUE) This downloads the Deducer package and all the other packages that it uses (its 'dependencies'). There are a lot of them. The first time you execute "install.packages", JGR will ask you which CRAN repository you want to use. This is where you'll be downloading the files from, so I'd recommend choosing someplace close to you (at least in the same country). If you try one location and there is a problem or the connection is very slow, you can select a new repository with the command chooseCRANmirror() 2. install.packages("Deducer",repos="http://rforge.net")

This downloads the latest 'development' version of Deducer. It contains code necessary for the 'DeducerRichOutput' package to run. 3. install.packages(c("DeducerRichOutput", "DeducerAlpha", "DeducerReshape", "DeducerANOVA", "DeducerPSY220"), repos = "http://R-Forge.R-Project.org") This downloads 5 packages that I wrote. The first converts R output to HTML (which is prettier), the second allows you to compute Cronbach's alpha, the third is a data manipulation tool tailored for within-subjects designs, the fourth adds analysis of variance capability, and the fifth contains several useful data sets and a paired t-test analysis tool.

5. Load Packages "Installing" a package just downloads it and puts it on your hard drive. You also need to "load" the package to actually use it. Once you've got the packages installed, you can click on the "Packages and Data" menu in JGR and select "Package Manager". This opens a window listing all of the installed (downloaded) packages, including the ones you've just installed. There are 2 columns with checkboxes, and the columns are labeled "loaded" and "default". Any package that you check under "default" will be automatically loaded the next time you start JGR. That will save you some steps, so I'd recommend clicking the boxes for both "loaded" and "default" for all the packages that begin with "Deducer". You will also need to click the two boxes for the package called 'stringr'. I've started a TroubleShooting page for problems and solutions not addressed above. If you have problems, please feel free to email me: altermattw@hanover.edu. Be sure to include any error messages generated by R in your email.

Statistical Assignment 1 Answer the following questions. Justify your answers. 1. Define: Sample distribution 2. Define: Sampling distribution 3. Describe the sampling distribution of the mean 4. What statistic is relevant to the Central Limit Theorem? 5. What is the Standard Error and how is it related to the standard deviation? 6. Define the Central Limit Theory. There are multiple elements so be careful here. Consider the following:

The population from which the sample is drawn. The shape of the sampling distribution of the relevant statistic. The size of the sample.

Stats 2a: Entering Data

First, open the SPSS program. You'll probably get a message screen asking what you would like to do in SPSS:

Select "Type in data" and press "OK" For this exercise, imagine that you're doing a study on sexist beliefs. You're using an 11-item questionnaire called the Neo-Sexism Scale (NSS). You've gotten 189 people to fill out this survey, and you want to enter their data. For this exercise, you'll just need to enter data for two people. Data for the first person is available in this pdf document. Step 1: Number your data

After you collect data, your first step should be to put a number at the top of each packet of data: 1, 2, 3... These will be the subject numbers for your data - each number will correspond to one subject (participant). You'll use these numbers when you enter the data, and it will allow you to directly connect the data you've entered into the computer with a physical packet of paper, in case you need to go back to the original data later. On the pdf document you just opened, the number is in the upper right corner: 1. Before you start typing numbers into SPSS, it's important to stop and think about how you want the data to be organized. In SPSS data view, each column represents a different variable. A variable is a number that can vary, or take on more than one value. If the people in your study are of different ages and you collected information on age, then age is one of your variables. Each row represents a different participant. When you first open SPSS and select "Type in data," you are faced with a blank spreadsheet:

Your first step should be to put labels in the column headings. Double-click on the gray "var" rectangle in the first column.

After you double-click, the screen will change to "Variable View" and will look like this:

To name or rename a variable, simply type the name into the cell under the column headed "Name." In the image above, I've named the first column "id#". Variables in SPSS can have a maximum length of 8 letters and cannot begin with a number. Step 2: Naming the variables For questionnaire data that is in only one order (i.e., you don't have 2 forms of the questionnaire with the items in different orders), it is usually easiest to use variable labels that correspond to the questionnaire item numbers: Q01, Q02, Q03, etc.:

You'll need as many q variables as you have questions on the questionnaire. Once you've got all the variables labeled, you can start entering data. To do that, you need to switch from "Variable View" back to "Data View" by clicking the tab at the bottom of the screen:

The data you enter should look something like this:

To switch off the two decimal places for each data point, switch back to variable view...

...and change the number of decimals for each variable to zero by either pressing the small down-arrow button or just typing the number 0 into the appropriate cell:

When you switch back to Data View, you'll notice that the two decimal places are gone:

After you've entered the data for participant #1, click on this link to get a pdf document for participant #2 and enter their data as well. Step 3: Cleaning the Data After you've entered your data, you need to "clean" it. This involves checking it for errors, and is easiest if you have two people. One person reads the data off the questionnaires while the other person looks at the computer screen to make sure that what is entered is correct. You don't have enough data in this exercise to make this worthwhile, but you'll want to make sure you do this for your projects.

Once you've entered the data for participants 1 and 2, save the file using a filename with your last name in it.

I'd recommend saving it somewhere like the Desktop that is easy to find.
Stats Assignment 2A: Upload your SPSS data file (saved with a filename that contains your last name) to google documents and share it with me at krantzj@hanover.edu.

Once you feel like you've got the hang of entering data, please click the button below to move on to the next topic.

Stats 2b: Checking Data

Getting the Data

Rather than have you enter all the data, I've got an SPSS data file you can download and use. To get it, click the link above and THEN SELECT SAVE (not "open"). Save it to your Desktop. Then go to SPSS and use the File -> Open ->Data option to find the file (called nss.sav):

This should bring up the file. Looking for Problems First, you'll want to see if there are any weird values in your data. Select Analyze > Descriptive Statistics -> Frequencies...

Highlight all the variables that you want to check (q01-q11) and use the little arrow ( ) to move the variables into the "Variables" box:

Then click OK. Looking at the Output SPSS generates its output in a separate program window from the data, so you may need to look in the Start Bar at the bottom of your screen for a little box labeled "Output." What are the minimum and maximum possible values for participants' responses? Consult the sample data sheets to find out (the scale is at the top). Now look at the output tables:

The name of the variable is above the table: Q01, or question 1. The values in the first column are all the different responses for that variable. This table tells you how many people gave each of the possible responses. For example, 147 people

indicated a "1" for question 1. This was 78% of the total sample. Nobody responded with a "5". Look through the first columns for the other tables and see if you catch any unusual values. There are 2 errors in the data set.

Stats Assignment 1B: Open a new Google Document and name it "PSY 220 stats 1B [your last name]". Find the 2 errors in the SPSS data file and provide the following information for each error: 1) subject number, 2) variable name (e.g., "Q01"), 3) why it's an error. Share the document with krantzj@hanover.edu when you're done. Log out of Google Docs.

Thus endeth today's lesson.

Reliability Analysis

It is very common in psychological research to collect multiple measures of the same construct. For example, in a questionnaire designed to measure optimism, there are typically many items that collectively measure the construct of optimism. To have confidence in a measure such as this, we need to test its reliability: the degree to which it is error-free. The type of reliability we'll be examining here is called internal consistency reliability: the degree to which multiple measures of the same thing agree with one another. Benevolent Sexism Scale Peter Glick and Susan Fiske (1996) developed an interesting measure called the Benevolent Sexism Scale (BSS). Its 11 items are given below:
No matter how accomplished he is, a man is not truly complete as a person unless he has the love of a woman.

1.

2.

In a disaster, women ought not necessarily to be rescued before men. People are often truly happy in life without being romantically involved with a member of the other sex. Many women have a quality of purity that few men possess. Women should be cherished and protected by men. Every man ought to have a woman whom he adores. Men are complete without women. A good woman should be set on a pedestal by her man. Women, compared to men, tend to have a superior moral sensibility. Men should be willing to sacrifice their own well being in order to provide financially for the women in their lives. Women, as compared to men, tend to have a more refined sense of culture and good taste.

3.

4. 5. 6. 7. 8. 9.

10.

11.

Responses to these items from 74 male college students are in this SPSS data file, which you should download and open. Reverse Scoring Most of the items are phrased so that strong agreement indicates a belief that men should protect women, that men need women, and that women have positive qualities that men lack. However, three of the items are phrased in the reverse: #2, #3, and #7. In order to make those items comparable to the other items, we will need to reverse score them. In this questionnaire, participants responded to the items using a 7-point Likert scale ranging from 1 ("Strongly Disagree") to 7 ("Strongly Agree"). When we reverse-score an item, we want 1's to turn into 7's, 7's to turn into 1's, and all the scores in between to become their appropriate opposite (6's into 2's, 5's into 3's, etc.). Fortunately, there is a simple mathematical rule for reverse-scoring:

reverse score(x) = max(x) + 1 - x Where max(x) is the maximum possible value for x. In our case, max(x) is 7 because the Likert scale only went up to 7. To reverse score, we take 7 + 1 = 8, and subtract our scores from that. 8 - 7 = 1, 8 - 1 = 7. Voila. To get SPSS to reverse-score: Select Transform -> Compute:

You will be creating a new variable for each of the variables you need to reversescore: #2, #3, and #7. The original variables are called bss02, bss03, and bss07. Let's call the new reverse-scored variables bss02r, bss03r, and bss07r. Name the first variable (the "Target Variable") bss02r, and set it equal to 8-bss02:

Instead of pressing ok, press PASTE. The following syntax appears:

It's easy to modify this syntax to compute all your reverse-scored items at once. Highlight the second line: COMPUTE bss02r = 8-bss02. (don't forget to include the period at the end) and press CTRL+C (or select Edit -> Copy). Move the cursor down one line and press CTRL+V (or select Edit -> Paste). Press RETURN to move the 'EXECUTE .' line down. Your new syntax should look like this:

Repeat that one more time so that you've got three identical lines in a row:

Now, modify the second and third row so that they are appropriate to bss03 and bss07:

Be sure to change both the left and the right side of each COMPUTE statement (this is the most common mistake people make on this assignment). You can now run this syntax (Run -> All) and it will create three new variables that are reverse-scored versions of bss02, bss03, and bss07.

Note: SPSS syntax is very particular about spelling and punctuation. Make sure you spell all the variables correctly. Reliability Now you're ready to compute the reliability of this scale. Select Analyze -> Scale -> Reliability Analysis. Move the new reverse-scored items (bss02r, bss03r, bss07r) into the 'Items' box, as well as all the other items that didn't need to be reversescored (1, 4, 5, 6, 8, 9, 10, and 11).

Then click on the box labeled Statistics and select Scale if item deleted (you'll see why later):

Press 'Continue' and then 'OK.' You should get the following output:

Look at the top of the output and you will see ".741" under "Cronbach's Alpha." This is the most common statistic used to describe the internal consistency reliability of a set of items. If you are using a questionnaire in your research, your results should include a report of the Cronbach's alpha for your questionnaire. The first two columns (Scale Mean if Item Deleted and Scale Variance if Item Deleted) of the next table generally aren't all that useful. The third column is the correlation between a particular item and the sum of the rest of the items. This tells you how well a particular item "goes with" the rest of the items. In the output above, the best item appears to be BSS01, with an item-total correlation of r = .598. The item with the lowest item-total correlation is BSS05 (r = .255). If this number is close to zero, then you should consider removing the item from your scale because it is not measuring the same thing as the rest of the items. Alpha if Item Deleted Now look in the last column: "Alpha if item deleted." This is a very important column. It estimates what the Cronbach's alpha would be if you got rid of a particular item. For example, at the very top of this column, the number is .690. That means that the Cronbach's alpha of this scale would drop from .741 to .690 if you got rid of that item. Because a higher alpha indicates more reliability, it would be a bad idea to get rid of the first item. In fact, if you look down the "Alpha if item deleted" column, you will see that none of the values is greater than the current alpha of the whole scale: .741. This means that you don't need to drop any items. Improving Reliability If you are using an accepted scale obtained from a published source, you do not need to worry about improving reliability. You should use the whole scale, even if it has problems, because if you start changing the scale you will be unable to compare your results to the results of others who have used the scale. You only want to improve the reliability of a scale if it is a scale you are developing. If one of the "Alpha if item deleted" values is greater than the overall alpha, you should re-run Analyze -> Scale -> Reliability Analysis after moving the offending item from the "Items" box back over to the unused items box. Repeat this process until there are no values in the "Alpha if item deleted" column that are greater than the alpha for the overall scale. Computing a mean score for a questionnaire The goal of this whole procedure is to produce a single score for your questionnaire. Once you've used reliability analysis to identify the items that will produce the most reliable measure, you can use those items to create an average score for your questionnaire, as described below.

Note: Combining items on different scales. If you will be combining items that are on different scales (e.g., one is weight and goes from 100 to 250 [pounds] and another is height and goes from 60 to 81 [inches]) you cannot simply average them together because weight will have a much bigger impact on the final average. Instead, you must first standardize them and then you can average them together.

To compute a mean score, select Transform -> Compute. In the Target Variable box, type in the name of your scale: BSS:

In the Numeric Expression box, type the word MEAN, followed by ( and then a list of the variables you want to average together, separated by commas. Make sure you only put in the variables that you decided were the best for the scale. Note that I used bss02r, bss03r, and bss07r and not their original variables bss02, bss03, and bss07. At the end, close the expression with a ). Press OK to compute the new variable. Select Graphs -> Legacy Dialogs -> Histogram and put your new BSS variable into the variable box. Press OK. You should get output like this:

A histogram is a plot of how often possible values occurred. It's one way to see if there is anything really strange in your data - any extreme values, or all the scores piled up on one side. If you've done everything correctly, you should find that the values on the right side of the image above correspond to the values in your output: standard deviation of .851, mean of 4.30, and N of 74.

---------------------

Standardizing Variables
When you standardize a variable, you transform it so that it becomes a z-

score: its mean becomes 0 and its standard deviation becomes 1. This is useful if you will be combining variables that are on very different scales (e.g., score on the SAT, which has an average of 1000 and a maximum score of 1600, and GPA, which ranges from 0 to 4.0). If you took the simple arithmetic mean of SAT and GPA, your resulting score would be dominated by the SAT score because it is so much larger. To make each variable contribute equally to the mean, you can standardize them. Creating standard scores in SPSS In SPSS, you can standardize variables by selecting Analyze -> Descriptives, moving the variables you want to standardize into the "Variables" window, and checking the box at the bottom that says "Save standardized values as variables."

When you press "OK," SPSS creates a new variable for each of the variables you selected. These new variables will generally have the same name as the old variables except they will begin with the letter "z" (because they have been transformed into z-scores). To create an average out of these standardized variables, make sure you select the new z-score versions of the variables when you do Transform > Compute.

----------------------

Reliability Assignment

For this assignment, you'll be looking at a 15-item measure called the Attitude toward Women Scale (Spence, Helmreich, & Stapp, 1974). The dataset contains responses to this questionnaire from 201 college students, who responded using a 1-7 Likert scale. The following items need to be reverse-scored before reliability can be checked: atw01, atw02, atw04, atw06, atw08, atw11, and atw14
1. Open a new Google Document and name it "PSY 220 Stats 2 [your last name]". 2. Compute the reliability of the scale with all the items included and write "The reliability of the original 15-item scale was alpha = ____" in the document, with your obtained alpha in the blank.

Note: If you were really going to use the Attitude toward Women Scale in a research project, you would not drop any of the items; you would use the whole thing and report the alpha you computed above. We're only dropping items and improving reliability as a statistical exercise.
3. If any items have "Alpha if item deleted" values greater than the alpha of the whole scale, write down (in the document) the item with the highest "Alpha if item deleted" value and then re-run the reliability analysis after you take that item out. Repeat this process until none of the items has an "Alpha if item deleted" value greater than the alpha of the whole scale. Do not take out all the offending items at once, but rather remove the one that will help you the most, then re-run the reliability analysis, and repeat. As you remove each variable, write down that variable. 4. Write down the final list of items that will be in your scale. You wouldn't report these in an article, but I'd like to see them for this assignment to make sure you took out the right ones. 5. Write "The 15-item scale was reduced to a __-item scale to improve its reliability. The final reliability of this scale is alpha = ___." 6. Create a new variable called ATW that is the mean of all the remaining items. Create a histogram plot of this variable. Write down the standard deviation, mean, and N of this variable. 7. Share the document with krantzj@hanover.edu and log out of Google Documents.

Statistical Assignment 3 Open the data set indicated on the syllabus or website ('Stats3mental rotation.xlsx'). This data set is taken from a mental rotation experiment. We will collect data from the class during lab. You are to add the classes data to this dataset. You will need to add the class data to the Excel file on

the ' data ' tab at the bottom. When you import the data into SPSS you need to grab the SPSS tab to import. Run Pearson's Correlation on both the angle vs same variables and angle vs different variable. Do scatter plot showing the trend lines plotting them on one graph. You may use either Excel, which we will cover in class or SPSS. Then write up a short results section explaining these relationships and indicate if the same type of relationship is seen in both cases. If there is a difference in the relationship, use the graph to explain why.

Statistical Assignment 4 This assignment is to help you understand the statistics behind comparing two groups when different people are in the two groups. We are going to collect data from an experiment called the Stroop Effect. We will discuss this experiment in lab but you can find more about this experiment here: http://www.juliantrubin.com/encyclopedia/psychology/stroop_effect.html We will collect the data. Because of the way the program runs everyone will be in both conditions. Ooops. We need different people in different conditions. Half of you will use the congruent condition data and half will use the incongruent condition data. Enter the data in to SPSS and run the appropriate statistic. For the write up. Write up a procedure section for the experiment and a small results section presenting and interpreting your data.

Statistical Assignment 6 Open the data set indicated on the syllabus or website ('Stats6twogroupwithin.xls'). There are two data sets in this spreadsheet: Change Blindness (Change tab) and the Simon Effect (Simon tab) . We will collect data from both experiments during lab time and you will add the classes data to both data sets. Both are within-subject designs with two conditions.

Run the appropriate test. For these experiments, write up a participants subsection for a laboratory report and then do a small results section presenting these data and their results as if they were part of one experiment. Statistical Assignment 7 Open the data set indicated on the syllabus or website ('Stats7OneWayANOVASPAN.xlsx'). There are two data sets in this spreadsheet: Choose SpanBet . We run the experiment in class but you will not actually add to this data set. To be honest this is a within-subject experiment that we are pretending is between subject. Bad me. Run the appropriate test. Hand in a small results section presenting your data and then do a small discussion section trying to make sense of the data. There will be a paper available to read to help you write this part of the section ( http://psychclassics.yorku.ca/Miller/ ). Add it to a results section

Statistical Assignment 8 Open the data set indicated on the syllabus or website ('Stats8FactorialTypical.xls'). In lab, we will run this experiment about the way we do decision making. Chose the data tab when you import your data. Run the appropriate test. Hand in a small results section presenting your data remember to interpret all of the main effects and the interaction.

Statistical Assignment 9 Open the data set indicated on the syllabus or website ('Stats9MixedANOVAFalse.xls'). In lab, we will run this experiment about how memories are formed. Chose the data tab when you import your data. When you enter your data, you will need to add information about participant's gender which will be our other variable. Run the appropriate test. Hand in a small results section presenting your data remember to interpret all

of the main effects and the interaction. Also write up a small participants section to add to your results section.
----------------------http://www.ibyd.com/lifestyle/backup1/05_methods/codingdataeandanalysis.html

When designing a questionnaire, it is important to remember that the information collected will need to be processed and analysed when it is completed and returned. The following need to be taken into consideration:

In most cases, the information contained in the questionnaire will need to be entered into a computer package which allows it to be analysed. Commonly used packages include Excel, Access, SPSS, SAS, and SNAP. Some of these packages are simple to use, but have limitations in terms of statistical analysis unless you can use more complicated programming. Many statisticians use SPSS or SAS. Before the questionnaire is entered into the package, it will need a data template designing so that each questionnaire is entered in the same way. Also, a coding frame will need to be developed which gives the rules for data entry. During the design of the questionnaire, the processing and analysis need to be considered, to help them run more smoothly. There are different ways of assisting this process, depending on the data collection method (telephone/face-to-face/web-based/postal).

How to Avoid Common Problems


There are many issues in the design and administration of questionnaires which can be avoided. Some of the common problems and their solutions are given below.

Use ID Numbers
Each questionnaire is given an ID number so that it can be easily identified and filed. It is helpful to leave a place on the questionnaire so that this can be added in the same place for each questionnaire.

Format of the Data and Data Validation


Some packages require the format to be pre-set e.g. Access, SPSS, although others allow us to enter data without setting the format - e.g. Excel. For example the data may be:

Date

Numeric Alphabetic Alphanumeric Large amounts of text

It is also possible to set validation on some packages this means that the package only allows data is within the correct range (for example, that an answer to a question is either 1, 2, or 3, and not allowing any other answer to be input). Setting the format can therefore assist us in reducing data entry error and keeping data consistent. However,iIf validation is set on certain variables, it must allow for missing or data, or incorrect responses. Care should also be taken with some packages - for example if numeric data is entered as text and sorted in SPSS, it will sort as 1,100,2,200 - giving us real problems with analysis.

Show Codes on the Questionnaire


Each question on the questionnaire will need a code for each possible response, and where possible this should be shown on the questionnaire particularly for telephone/face-toface/internet questionnaires. However, for postal questionnaires it can be distracting for the respondent and so codes should be shown in a small font if used. An example is given below. Example of telephone or face-to-face question with coding Q1: Which of these age bands do you fall into? Interviewer - Please circle: 18 - 24 1 25 - 34 2 35 - 44 3 45 - 54 4 55 - 64 5 65 + 6 Example of postal questionnaire with coding Q1: Which of these age bands do you fall into? Interviewer - Please circle: 18 - 24 25 - 34 35 - 44 45 - 54 55 - 64 65 +
1 2 3 4 5 6

Be consistent with the use of codes


Codes should be consistent in all questions for example Yes=1, No=2, Dont know=3. This is particularly important when using scales for example if 1= very satisfied & 5 = very dissatisfied for one question it should be the same for all questions about satisfaction where this is appropriate.

Leave space for coding open questions on the questionnaire


Space for coding should be left on the questionnaire for open questions in case these questions need to be post-coded (see survey dictionary).

Take care with administration


When administering a questionnaire, for example when posting out questionnaires, care must be taken so that:

In large questionnaires all pages are included and are in the right order some printing processes make this possible so it is useful to check a percentage of questionnaires to ensure that they are correct. All documents are included in any mailouts for example, ensuring that both the letter, any accompanying documentation, the questionnaire and a reply paid envelope are included in the pack

Check returned questionnaires


particularly on self completion questionnaires - include questions being completed incorrectly, being missed (either accidentally or deliberately), or not following routing properly. Where possible, either the interviewer can check the missing or incorrect information, or the respondent can be asked directly if this is possible. If it is not possible to correct the missing information, the coding frame should include instructions of how to handle these types of problems.
http://www.ibyd.com/lifestyle/backup1/05_methods/ ----------------------

http://wps.pearsoned.co.uk/ema_uk_he_howitt_statpsych_5/175/44882/11489917.cw/content/i ndex.html

Task A: Spreadsheet preparation and data entry


Entering data into SPSS is quite a simple procedure providing you have set up the spreadsheet so that it is ready to input the information you receive from your study. In this case the data you will be using comes from a survey conducted using Rosenberg's (1965) Self-Esteem questionnaire completed by a number of people. The questionnaire is considered a robust measure of an individual's self-esteem and is available in the public domain for anyone who wishes to use it.

The questions used in the survey are:


1. 2. 3. 4. 5. 6. 7. 8. 9. 10. On the whole, I am satisfied with myself. At times, I think I am no good at all. * I feel that I have a number of good qualities. I am able to do things as well as most other people. I feel I do not have much to be proud of. * I certainly feel useless at times. * I feel that I'm a person of worth, at least on an equal plan with others. I wish I could have more respect for myself. * All in all, I am inclined to feel that I am a failure. * I take a positive attitude toward myself.

* signifies a reversed scored item. In the case of the Level 1 data set, the survey was conducted using a 5-point Likert Scale, ranging from strongly disagree (1), disagree (2), neither agree or disagree (3), agree (4), and strongly agree (5). The reversed scored items need to have their scores reversed in order to give a true self-esteem score; for example if a person responds 'strongly disagree' to item 6, it means they do not feel useless, thus their score needs to be changed from 1 to a 5. This can be done in SPSS and will be part of a future task. Examining a questionnaire by simply looking through the questions asked is important so that you understand what the responses signify; they represent an answer to a question, not simply a number. It is also important to take into account any reversed scored items, as it is easy to overlook them when analysing the data. Once you have examined the raw structure of the questionnaire and have a 'feel' for where the data comes from, you are now ready to start preparing the SPSS spreadsheet for entering the data. Getting the SPSS spreadsheet ready

Open SPSS on your computer

Select the Type in Data option from the dialogue box that opens. At the bottom left of the screen are two tabs - Data View and Variable View. Click on Variable View.

The Variable View page allows you to name the columns of the spreadsheet in the Data View option. This is useful for entering your data so that you know which piece of data goes into which column, and will generally allow you to manipulate the spreadsheet to suit your purposes. The various titles and their uses are: Name: This is the name you will see at the top of the column on the Data View page. It is best to keep the name of this cell short and simple. Note: you can only use text with no spaces in the Name of the column, otherwise it will say 'Variable name contains an illegal character'. This allows you to define what sort of information you will put in this data column, e.g. numbers or letters. The default is numeric, but if for example you wish to put letters in the cell, select the cell and click the grey box in the right of the cell, then click on the String option in the dialogue box titled Variable Type (see figure 1 below). This defines the width of the column, this is not an important function here as you can manually alter the width in the Data View page. This lets you change how many decimals you use in your data. The Default setting is 2dp, but you can change this by selecting the cell and clicking up or down on the buttons that appear, or by simply typing in the number of decimals you want. This is a text box that allows you to write what the information in the column refers to, for example by writing, 'Name of participant'. This label will also appear in the output table when you run analyses on the data. This is an important cell as it allows you to define which group an individual belongs to using only a number. To input the number, select the cell and click on the grey box on the right of the cell. This will open up the dialogue box titled Value Labels, you can then input the number which you wish to call the group, then input the value label, or what that number will refer to (see figure

Type:

Width:

Decimals:

Label:

Values:

1 for an example). You will use these values mostly for your independent variables, not for the dependent variables. Missing: This allows you to identify any missing data in the spreadsheet using a predetermined value. You do not need to use this just yet. This cell allows you to change how many characters you see in the data column. You generally do not need to use this cell. This lets you adjust the alignment of the information in this column, e.g. left, centre or right alignment. For string information e.g. names the default alignment is left, for numeric information the default alignment is right. This allows you to determine what type of data this column contains. If you select String in the Type option, this will change to Nominal, as text is nominal data. If you leave the Type as numerical this will remain as Scale information. You can also use the Ordinal option if you wish to use ranked data.

Columns:

Align:

Measure:

Figure 1: SPSS Variable View dialogue box options

You can name your cells anything you see fit, but it is best to use something you will remember as you may forget what the item means. In the case of this exercise it may be useful to use the Windows Excel file called 'Level 1 data set' and the titles in column A as your data titles.

Once you have titled all the cells you think you will need, click on the Data View tab to return to the Data View page. You will now see your column titles in the grey bar at the top of the page. If you run your cursor over the column titles your label will appear, giving you more detail about that column.

You are now ready to input your data into your prepared spreadsheet.

Open the Windows Excel file called 'Level 1 data set'. In this you will see 98 peoples' responses to Rosenberg's Self-Esteem questionnaire along with their name, age, gender and group. Input this data into SPSS. A useful function for this may be the Transpose function in the Paste Special option in Excel. This will allow you to input the data into SPSS by changing it from a vertical column per participant to a horizontal row of data per participant, then you can simply copy and paste it, although you will need to get round the obstacles of giving values to the 'Gender' and 'Group' items in the data set. This can be done in Excel by highlighting the row you wish to alter and using 'Edit / Replace' from the drop down menu.

Once you have entered your data, you should have a spreadsheet that looks something like figure 2.

Figure 2: SPSS Data View with data entered

----------------------------http://pages.bangor.ac.uk/~pes004/resmeth/dataman/spss12/dataman2.htm

This page will show you how to manage data in SPSS: how to create a data file, name and label variables, deal with missing values, compute new variables based on existing variable scores and recode scores. The instructions on this page are for SPSS Version 10.0 and higher (the screenshots are from Version 12.0). Older versions (9.0 and lower) are different for some procedures. To see instructions for using these older versions click here. The data used to illustrate these procedures are questionnaire-based and taken from a study conducted by students taking a Health Behaviours module. The fact that the data are from a questionnaire makes no difference to the procedures. You would do exactly the same whatever the nature of your data; they are only numbers after all! For the study, students generated a set of questionnaire items to measure the various constructs of the theory of planned behaviour aimed at predicting avoidance of over-exposure to the sun. To keep things simple for this example, we will only use two of the subscales of the questionnaire: behavioural beliefs which are peoples' perceptions of the consequences of engaging in the behaviour, and perceived behavioural control, which is peoples' perception of the degree of control they have over the behaviour. In addition I have included the variables age and sex. 315 participants completed the questionnaire. Each of the subscales comprises four questionnaire items which are scored on a five-point scale. The items follow in the order in which they appear in the questionnaire (although they are intermingled with the other items not included in this example): Strongly disagree 1 Too much sunshine can lead to skin cancer 2 I could easily avoid overexposure to the sun if I needed to Overexposure to the sun causes premature aging of the skin I find it difficult to avoid getting too much sun when the weather is nice 1 1 2 2 3 3 Strongly agree 4 4 5 5

5 Too much sun can damage your eyes 6 I don't find it easy to avoid overexposure to the sun 7 Overexposure to the sun is not that bad for your health Following the experts' advice on avoidance of too much sun is easier said than done

1 1 1

2 2 2

3 3 3

4 4 4

5 5 5

Items 1, 3, 5, and 7 are behavioural belief items whilst 2, 4, 6 and 8 are perceived behavioural control items. Note that some of these items are keyed in a different direction to others. That is, for the items highlighted in yellow a high score (strongly agree) means more of the property being measured while for the others a high score means less. This is often the case with questionnaire items and is designed to prevent extreme response biases by the participants (that is, circling scores at one end of the scale or the other for all the items). When we have items keyed in different directions we have to decide which way we want them scored. Although it does not really matter which way round we do it, in this case it makes sense for high scores to indicate stronger behavioural beliefs about the harmful effects of the sun and greater perceptions of control. For behavioural beliefs, items 1, 3 and 5 are already keyed in the right direction. For perceived behavioural control only item 2 is keyed in the right direction. For the other items we will have to recode the scores so that high scores are changed to low scores and vice versa. We will look at how to do this later on.

Creating the data file


Having collected the data, the first step is to create the data file in SPSS. To do this, open SPSS and you will see a blank spreadsheet ready for data input:

We want to enter the data in the order it appears in the questionnaire: age, sex, then the items 1 to 8. Each variable will be in a column and each respondent's scores (called cases in SPSS) will be in a row. So the file will have 10 columns and 315 rows. Each data point is entered into a cell in the file.

Defining the data


You could start entering data now, but it makes more sense to first define the data so that it is easier to keep track of where you are as you enter it. Defining the data involves (at least) giving each variable a name and specifying missing values. SPSS offers two views of the data file. You switch from one to the other by clicking the tabs at the bottom lefthand side of the screen. Data View shows the data itself and is automatically shown when you start a new file (as above). You use this view to enter data. Variable View is where you define the variable names and specify any other information you want to about the variables:

Naming variables
Data are defined variable by variable in Variable View (i.e. one row at a time). To do it, click on a cell where it says Name. Type a meaningful name in the cell. In this case, I've named the first variable age. The name can be up to eight characters long and must begin with a letter.

Missing values
Missing values are what they sound like: data points for which you have no score. For example, some participants might not have turned up for a data collection session but you still have other data from them that you want to use. So you need to enter the data you do have whilst taking account of missing data points. With questionnaires, people often fail to complete one or more items. This may be purely accidental, in which case the missing data points are referred to as missing at random because there is no systematic reason for their omission. Sometimes they may deliberately miss an item because they do not want to complete it for some reason. If lots of participants fail to complete a particular item it suggests there is something wrong with it; it may be ambiguous or perceived as too sensitive or whatever. Such systematic missing data points are more problematic than data missing at random because they mean that you have to do something about the offending item. Data missing at random, however, can essentially be ignored (actually it's not that simple but this issue goes beyond what I want to cover in this lesson). You can simply leave cells with missing data points blank in the SPSS data file. In this case, SPSS inserts a full stop in the cell to indicate missing values and they are referred to as system missing values. Alternatively, you can specify a code number that represents missing values. This is called a user-defined missing value. This gives you more control over the data and subsequent computations and analyses and allows you to maximise the use of all the data you have worked so hard to collect (more on this below). So, where there are missing data, we assign a code number that will designate a missing value. This number must be a value that cannot appear in the data for that variable. If it is, then any other cases that gave that value would be treated as having missing values! For example, the questionnaire data in our example can only take values from 1 to 5, so any other number would do for specifying missing values. Age, though, could take on lots more values so we would need to assign a number that cannot be the age of any of our participants. It makes things easier if you assign the same number to missing values for all the variables. I nearly always use either 99 or 999, since those values do not appear in the sort of data I normally collect. In this case, we'll use 99; there are no 99 year olds in the sample that provided these data! To define missing values for a variable click on the corresponding cell in the Missing column in Variable View and then click the little grey box that appears in the cell and the following dialogue box will appear:

By default, the No missing values button will be checked. Check the Discrete missing values button and the three shaded boxes will become available:

SPSS allows you to specify up to three missing values because you might want to differentiate between values that are missing for different reasons. In this case we don't want to do that, so we just type our missing value code (99 in this example) into the first box, then click on OK to return to the Variable View. You will see that it now says 99 for age in the Missing column cell. Default values will have appeared for Type (the type of data), width (width of the column in Data View) and Decimals (the number of decimal places). You can change these if you want or need to. If your data have more than two decimal places then you should change the values in the Decimals column because otherwise SPSS will automatically round off to the nearest two places when you enter the data. Now we do the same thing for each variable in turn, giving each a meaningful name and specifying the missing value. In this case, I put sex into the second row and then the behavioural beliefs and perceived behavioural control items alternately in rows three to ten, naming them bb1 to bb4 and pbc1 to pbc4 respectively. The variables are now defined and the file is ready for data input.

Saving the file


If you haven't done so already, you should now save the file so that all your work doesn't go down the pan if your machine crashes! Click on File then Save from the dropdown menus to get the Save As dialogue box. SPSS data files should be given the .sav file name extension, which is the default when saving a spreadsheet. That way SPSS will always recognise the file as a data file. Give the file a meaningful name in the File name box, choose where you want to save it to from the Save in box (probably to a floppy disc so select the A: drive), then click on OK to save it. Remember to save your work periodically as you enter the data.

Entering the data


Now you can enter the data. Switch to Data View by clicking on the tab at the bottom lefthand of the screen. Now simply enter each case's scores into the appropriate cells, with each case taking one row. You can use the arrow keys on the keyboard or the mouse to move from cell to cell. The value you type won't appear in the cell until you move to another cell but it does appear in the box above the spreadsheet. If you have missing scores, type in the missing value that you assigned to that variable (99 in this example).

Coding categorical variables


The variable sex is categorical and needs to be coded to differentiate between males and females. Any numbers will do; in this case I entered 1 for males and 2 for females. Any such categorical variable needs to be coded in this way. For example, if you have data from an experimental study, a code will need to be given to each group; say, 1 for a treatment group and 2 for a control group.

You should number your raw data collection sheets or questionnaires according to the row number for each case in the SPSS file. That way you will always know which participant's data corresponds to which row in the data file. Entering data is a really tedious job, especially if you have a lot of it, and it's easy to make mistakes. It is a good idea, therefore, to recruit a friend to help you with it. One of you can just call out the scores while the other types them in.

Labelling variables
In addition to naming variables, you can label them. This means assigning a longer and more descriptive name to the variable which can make reading the output of any analyses easier. Labels can be up to 256 characters long and can include spaces. To do this, in Variable View enter a meaningful label in the appropriate cell in the Label column. Here I have given bb1 a label comprising part of that item's wording.

You'll find that the variable label will appear when you mouse over a variable name in the column heading in Data View. You can get SPSS to print either variable names, or labels, or both in any outputs by clicking on Edit then Options from the dropdown menus and then choosing what format you want from the Output Labels tab.

Labelling variable values


You can even assign labels to variable values. For example, you might want to label the values for sex so that any output that uses this variable will say male and female instead of just giving the numbers 1 and 2. To do this, in Variable View click on the appropriate cell in the Values column, then on the little grey box that appears there to get the Value Labels dialogue box. Now type 1 in the Value box, then male in the Value Label box, then click on the Add button. This will add this value label to the bottom box. Then you'd do the same for 2 and female:

Screening for errors


Because it's so easy to make mistakes in entering data, it's a good idea to check for them once the data have been entered. A quick and useful check is to use the Frequencies analysis function. This gives you the frequency of the occurrence of each value for a variable. Click on Analyze then Descriptive Statistics and then Frequencies from the dropdown menus to get the following box:

Select the variables you want to check by transferring them to the right hand box (you can do them all at once) and then click OK. Here is part of the output for our variable bb1: Frequency Valid 2.00 3.00 4.00 5.00 7.00 99.00 Total 8 32 96 177 1 1 315 Percent 2.5 10.2 30.5 56.2 .3 .3 100.0 Valid Percent 2.5 10.2 30.5 56.2 .3 .3 100.0 Cumulative Percent 2.5 12.7 43.2 99.4 9.7 100.0

You can see that 8 participants scored 2, 32 participants scored 3 and so on. One participant failed to provide a score, so there is one missing value (99). In addition, though, one participant has a score of 7. As this is not a valid value for this variable (participants can only score from 1 to

5), it must be a mistake. To easily find the offending value, and which case it belongs to, you can use the search facility. Highlight the variable column which contains the wrong value by clicking once on the variable name in the column heading in the data file. Then click on Edit and Find from the dropdown menus. Type the value you are looking for in the box and click on Find Next . SPSS will take you straight to the cell containing the wrong value. Now you know which case has the wrong value by seeing which row number it is and you can go back to your raw data sheets or questionnaire to find the correct value. Obviously, frequency analysis will not help you find errors where you have mistakenly entered a valid value (e.g. a 4 instead of a 3 for bb1).

Recoding variables
Now we have to change those scores on the variables that are keyed in the wrong direction, as discussed above. In this case, items 4, 6, 7 and 8 need to be keyed in the opposite direction so that high scores become low scores and low scores become high. Specifically, we want a 5 to become a 1, a 4 to become a 2, 3 stays the same because it is in the middle, 2 needs to become a 4 and 1 needs to become a 5. To do this, click on Transform and then Recode from the dropdown menus. You will see two options, Into same variables and Into different variables. If you choose the former, the scores will be recoded in the same column as the original. The latter will keep the original column as it was and put the recoded scores into a new column on the end of the file. It's important to note that if you add new data to a column after recoding it, the new data will not be recoded! If this is likely to happen, it makes sense to recode into a different column so you always remember that the original column is not recoded. In this case, we'll recode into the same column, so click on Into same variables to get the following box:

Here we choose which variables we want to recode. Obviously, if your variable scores are recorded on different scales you will need to do this separately for all the different ones. In this example, all the variables we need to recode are scored on the same 1 to 5 scale so we can do them all at once. So highlight the variables for recoding in the left hand box and transfer them to the right hand box by clicking on the little arrowhead. Now click the Old and new values button to get this box:

Now type 1 in the Old Value box on the left, 5 in the New Value box on the right, then click the Add button to add them to the big box below. Do the same for the other values (2 becomes 4, 4 becomes 2, and 5 becomes 1). You don't have to do anything to 3 because it will stay the same anyway. The box will now look like this:

Click on Continue and then OK and the selected values will be recoded in the data file.

Computing new variables


You can use the Compute command to compute new variables on the basis of scores on existing variables. These new variables are then added to the end of the data file. For example, you might want to compute the mean or sum of two or more variable scores for each case. All sorts of such computations can be made. This is a useful facility for scoring questionnaires. In this example, we want to compute a mean score for each participant for behavioural beliefs and for perceived behavioural control to use in further analyses. To do this, click on Transform and then Compute. In the Target Variable box type a name for the new variable. Here I'm computing the mean of the behavioural belief items so I've called the new variable behbel. Then in the Numeric Expression box type: mean(bb1,bb2,bb3,bb4) as shown below:

The general rule for these numeric expressions is to give the computation wanted (in this example, the mean) followed by the list of variables being operated on in parentheses, with each variable separated by a comma. You can see in the big box a menu of possible computations that SPSS will do. Clicking on any one of these will set the numeric expression box up for you; all you have to do is type in the variable names in the appropriate places. Having specified the transformation, you click on OK and the new variable, with its name, will be added to a column at the end of the file. Note that if you add new cases to the file after using the compute command you will need to compute again. SPSS won't automatically do it for you.

Computations and missing values


Where there are cases with missing values for one or more of the scores used in the computation, SPSS will make the computation on the basis of the scores that are there. Think carefully about this because, depending on the computation you make, it could cause serious problems. For example, here I computed the mean of the behavioural belief item scores. If a case failed to give a score for any one item, SPSS would calculate behbel as the mean of the other three. The resulting overall score would still be on a scale of 1 to 5 and the case's score would still be comparable with other cases' scores. If instead I had computed behbel as the sum of the items, the resulting overall score would have a maximum of 15 for a case with a missing value but a maximum of 20 for cases with complete scores! This is one good reason for always computing questionnaire subscale scores as the mean of the items and not the sum. It is worth checking the results of a computation for one or two cases by hand to ensure that you have correctly specified the numeric expression and that missing values have not caused problems, especially with more complex transformations.

Now the data file is complete. The tedious bit is over and all that remains is to conduct whatever analyses you need to do and make sense of the results!

May all your P values be small ones (if that's what you are looking for)!

--------------------http://www.htm.uoguelph.ca/MJResearch/ResearchProcess/PreparingForDataEntry.htm Preparing for data entry

By Pamela Narins, Planning Manager, J. Walter Thompson Now that your surveys are returning, you may ask yourself, "Now what?" Well, there are two steps to complete before you can analyze data: data entry and data cleaning. This column focuses on data entry. Shifting perspective It is time to change your thinking. Until now, you have seen your survey as questions and answers. Now, begin to think of these elements as variables and values. Each variable is assigned a name. Each possible response, or value, is assigned a number. You can keep open-ended or verbatim responses as text. Assigning a unique identifier variable The first thing to do when you receive a completed questionnaire is to assign a unique

"case" number to it. Because each questionnaire is called a case, this number is often assigned the variable "caseid," for case identification. The importance of the caseid becomes clear the instant you question the accuracy of entered data. Without this unique number, you have no way to tie your data back to the original documents, and therefore, no way to identify or correct data entry errors. Also, some data handling and analytical tasks are possible only if you have a way of identifying individual cases. Choosing variable names For this and the following sections, let's use the example of a two-question survey. Respondents were asked their gender and height. Sometimes, when naming variables, you use numbers such as "Q1." This can be useful for identifying a variable's position on the questionnaire. Whenever possible, however, try to choose a name that gives you clues as to the content of the variable. Let's call the first variable "gender," and the second "height," in inches. Also, add variable caseid. Creating values Gender gives you two response options: male and female. For our example, assign males a value of "1" and females "2." It is important to enter a number, rather than a letter, because many analytical procedures cannot be done with letters of the alphabet. For a response that is already a number, simply enter that number as the value. For example, a height response of 72 inches, enter "72." Determine variable widths The next step is determining a maximum character width, or amount of column space, for each variable. For example, the gender variable is one character wide, because you can only respond with one-digit numbers. The maximum width for the height variable should be two characters. The width for caseid is dependent upon the number of questionnaires you expect to be returned. If you sent out 900 questionnaires, you should assign a width of three columns to caseid. If you sent out 2,000, the caseid should be four columns, just in case more than 999 are returned. The variable width is important because the computer reads your data file as a string of numbers. You need to tell SPSS which numbers mean caseid, height and gender. The only way to do so is to tell the computer how long each number is. Handling missing values Respondents leave questions blank for many reasons. Regardless of the reason, you should

key in a number for "refused" or "no answer" rather than leaving a blank. For example, if a respondent indicated they didn't know how tall they were, you may key in a "98." The code "98" only works if you know no respondent is legitimately 98 inches tall. If you are not sure, then you may want to assign a value "998" to the "don't know" response. If you do this, increase the variable width to three columns. Whatever values you assign to missing values, be consistent. A consistent scheme makes analysis much easier to perform and interpret. Entering data Data entry is highly tedious, prone to error and critically important. There are several ways to get your data into the computer:

Have SPSS interpret strings of numbers from a word processor Enter data using SPSS Data Entry II to create SPSS data files Use a spreadsheet Enter data directly into SPSS through the Data Editor

Other data entry methods include using a Computer Assisted Telephone Interview (CATI) system that enters data directly into a computer during the interview. You can also automate your data collection using a product like Teleform 5.0 which employs scannable forms (see our World Wide Web page at www.spss.com/software/spss/Tele/ or see the Teleform article on the Keywords 61 web page). Regardless of the method, keep in mind, errors are inevitable. If your data are carelessly entered, you can lose all the hard work that went into ensuring a sound sampling frame, usable questionnaire design and proper administration. One safeguard you can use during data entry is called double entry. Oftentimes entering cases and checking them twice is impractical. It is, however, reasonable to check a random sample of cases by double entry. If you have a large sample that requires multiple data entry personnel, create a variable for that person's initials. That way, you can track the clerk's accuracy rates.

Potrebbero piacerti anche