Sei sulla pagina 1di 54

STATA TUTORIAL

Stata is easy to learn Powerful software used widely in research

STATA tutorial: Part One


Introduction Create and using log and do files Memory Input and load data Examine data Generate and organize variables Modify data Label data Data Management Import data Data import Descriptive statistics and statistical tests.

Introduction
Open up Stata in start menu Programs Use the help files! Syntax: help [command name]
.help regress

If you want to search all the sources that has to do with regression in general, type
. findit regress

Set up your working directory


You will have to tell stata where to find your files by setting your directory path. . cd c://name of your folder

Create and using log and do files


The log command starts a log file that keeps record of the commands and outputs during your Stata session. Example: Let's create a log file named mylog.
If the filename is specified without an extension, the default extension is .smcl. But you may prefer the extension .log which enables you to open your log file in notepad or MS Word, . log using mylog.log

Create and using log and do files


To save and quit your log file when your are ending a session:
. log close

To temporarily suspend sending output to your log file use:


. log off

And to resume...
. log on

Create and using log and do files


When you start your next session in Stata, you can decide wether to start a new log file or add to an already existing one or replace and old one.
. log using mylog.log, append . log using mylog.log, replace

Create and using log and do files


The log file will save everything from your session. The command log however, will only save your input commands.
. cmdlog using mycmdlog.txt

.txt is the default extension. Append and replace applies here also This is useful if we want to create a do-file.

Create and using log and do files


You can run all your commands in your cmdlog file, so you don't have to type them all over again.
. do mycmdlog.txt

If you want to supress output


. run mycmdlog.txt

You can add the extension .do to your do-files.

Memory
Sometimes, your dataset is to big for Stata to handle. Then you can allocate more memory to reading your data: Example, increase memory to 300MB
. set memory 300m

Stata will warn your if your allocated memory is too small.

Input and load data


To create a dataset manually:
. input wage school public female 1. 2. 3. 4. 5. 6. 94 8 1 0 75 7 0 1 70 16 0 0 75 8 1 1 78 11 1 1 end

. save mydata

Input and load data


That last command will save your newly created data as mydata.dta You can choose other extensions too, like .xls If you want to work with this data, type
. use mydata

This command loads a stata file, that is files with the extension .dta

Examine dataset
Now, you have loaded your data file (mydata.dta) For an overview of your data:
. des

Short for describe, will describe your dataset If you want to see the actual values for all variables in your dataset, type:
. list

Examine dataset
To examine certain variables, type:
. list wage school

To list a particular range of numbers in one or all variables, use the in-option:
. list in 1/3

To see the 3 first values from all variables


. list wage in 1/3

To see the 3 first values in the variable wage.

Examine dataset
The qualifier option; if, allows you to get a list that satisfy a stated condition. Example: list of observation of females with more than 8 years of schooling. To give a quick overview of the data, type
. inspect

Your condition will then be:


Female==1 school>=8

Examine dataset
So combined with the list command, type
. list if female==1 & school>=8

== is typical programming syntax. Meaning identical to and not equal with

Examine dataset
count command can be used to show the number of observations in the dataset:
. count

You can also count the number of observations that fulfill a condition
. count if female==1 & school>=8

Generate and organize variables


Creating new variables. If you need to calculate new values based on a previous variables, perhaps. Use generate command, or shorthand gen. Example: Let's create a squared wage variable
. gen wagesqr=wage^2

Or to log transform the wage variable


. gen lnwage=log(wage)

Type list and see the result

Generate and organize variables


To organize your variables, it may be prudent to generate ID numbers for each observation:
. gen ID=_n

To give different ID:s for males and females, use by option. However it is necessary to first sort the data by these groups.
. sort female . by female: gen Idg=_n

Type . list and see the result

Generate and organize variables


Some more data manipulation. To replace an existing variable
. replace wagesqr=wagesqr/1000

To create location measurement for all observations for some variable


. egen avgwage=mean(wage)

Type list to see result

Generate and organize variables


To categorize variables, you can use recode option Example: generate a new variable based on school variable and recode the values as 0 and 1, (dummy variables). Condition is:
School is less than or equal to 8 --> 0 School is between 9 21 -->1 Type . gen schlevel=school . recode schlevel 1/8=0 9/21=1

Modify variables
To rename a variables
. rename wage timlom

So, the structure is rename [oldname] [newname] Don't forget to save the data with your new name!
. save, replace

To delete a variable
. drop schlevel

Label variables
Variable labels allows you to include a description of that variable. Example: Let's label the variable female
. label variable female Indicates gender of the individual. . des female

To include a interpretation of the values, we use label define


. label define mf 0 men 1 women

Label variables
mf is a created definition of 1/0 for female . To connect the new definition mf with the values i variable female
. label values female mf

Type list To see the result!

Data management
We used drop command before, that deletes a variable. If we instead want to specify which variables to keep, and discard the rest, type:
. keep timlon school public female ID

We can decide to keep variables by condition here too. Example: let's create a dataset of only females
. drop if female==0

Dataset
We will use a different dataset now Download Lnu.xls Open Lnu.xls and save it as lnu.csv instead

Import data
First, before we load new data, let's get rid of the old one
. clear

To import a .csv file


. insheet using Lnu.csv
If you have semi-colon as delimiter (or something else)

. Insheet using Lnu.csv,delimit(;)

Save it as a Stata file


. save mylnu.dta

Descriptive statistics and statistical tests


Your data is hopefully loaded now, let's look at it then
. des

Frequency table with tabulate command


. tab skolar

If you want a frequency table for several variables type:


. tab1 skolar kvinna

Descriptive statistics and statistical tests


We can create crosstabulation
. tab skolar kvinna

To get the column and/or row percentages


. tab skolar kvinna, column row

To instead hide the frequencies and only see the percentages, use nofreq
. tab skolar kvinna, column nofreq

Descriptive statistics and statistical tests


Here we can use a condition also,
. tab skolar kvinna if timlon>100,column nofreq

Summary statistics using summarize for a number of variables


. summ timlon skolar

Here we can again sort these summary statistics by group, like men and women
. sort kvinna . by kvinna: summ timlon skolar

Descriptive statistics and statistical tests


We can sort by more than one group. Example: summary statistics for men and women working in the private and publi sectors
. sort kvinna offentlg . by kvinna offentlg: summ timlon skolar

Some summary statistics like mean, median, standard deviation and so on can be listed in one table with tabstat command.
. tabstat timlon skolar, stat(mean var sd min max N)

Descriptive statistics and statistical tests


Here we can use the by option to et summary statistics for men and women on a single table
. tabstat timlon skolar, stat(mean sd) by(kvinna)

Correlations We can get the correlation between two or more variables Example: look at the correlation between wage, years of schooling and years of work experiene
. corr timlon skolar erfarnht

Descriptive statistics and statistical tests


T-test (mean comparison test) To test the equality of means, we use a t-test One-sample mean-comparison test Test if the mean of a specified variable is equal to a certain hypothesized value. Example: Test if the average wage in Sweden is equal to 100
. ttest timlon==100

Descriptive statistics and statistical tests


The confidence interval is 95% by default, this can be changed by
. ttest timlon==100, level(99)

Two-group mean-comparison test


To test if a specified value is the same for 2 groups Example: to test if men and women on average earn the same wage . ttest timlon by(kvinna)

End session
Don't forget to
. log close . cmdlog close

STATA tutorial: Part Two


Regression analysis Regression: Extract results Regression: Predictions Graphing Data

Start Session
Don't forget to
. cmdlog using mylog, append . log using mylog.log,append

Regression analysis
We use regression analysis to study the effect of one variable X (independent) on another variable Y (dependent). In this section we will see how to run a linear regressions, extract regression results, generate predicted values and run joint hypothesis test. We will continue to use mylnu.dta
. use mylnu

Explore data
Use data editor in the stata menu to browse through your data in a spreadsheet.
Data -->data editor

Regression analysis
Example: Let's see the effect of years of schooling on hourly wage rates. Using Ordinary Least Square (OLS) regression of dependent variable timlon and independent variable skolar
. reg timlon skolar

What does this result tell us about the effect of years of schooling on hourly wage rate?

Regression analysis
To see if the result of the variables in the regression is statistically significant, check
p-value t-value

Also check the R value to see how well the model explains the values of the indpendent variable.

Regression analysis
Keeping school years constant, we can also see the relationship between gender and wage rate.
. reg timlon skolar kvinna

What does the coefficient for kvinna tell us?

Regression analysis
Heteroskedasticity?
Correct for this by the option robust

. reg timlon skolar kvinna, robust Shorthand...


. Reg timlon skolar kvinna,r

Regression analysis
We can again use the by option to run separate analysis for groups of observations. Example: Run a regression for private and public sector employees separately but in the same time. Again we need to sort the data first
. sort offentlg . by offentlg: reg timlon skolar kvinna, r

Regression analysis
We can also use the if option to run separate regressions. Example: Run a regression just for public sector employees.
Condition is then offentlg==1 . reg timlon skolar kvinna if offentlg==1,robust

Regression: Extract results


You can see your stored results from a regression run
. ereturn list

You can save your last regression in an estimate table


. est store model1

To examine our coefficients from our regression: model1


. est table

Regression: Extract results


If you want to see more result statistics, just add the desired statistics after table,
. est table, b se t stats(N,r2,F)

Example: Let's run another regression and store it as model2 . reg timlon skolar kvinna erfarnht,r
. est store model2 . est table model1 model2, b se stats(N,r2,F)

Regression: Predictions
The predict command computes predicted (fitted) value and residual for each observation. To calculate predicted values for timlon from our regression We will name it yhat
. predict yhat

Regression: Predictions
Calculate predicted values of residuals, and store it as uhat
. predict uhat, resid

Check your new variables


. Des yhat uhat

Regression: Joint Hypothesis test (F-test)


To test wether one or more independent variables are jointly statistically significant in explaning variations in the dependent variable. That is, do all of the independent variables as a whole explain variations in the dependent variable.
. reg timlon skolar kvinna erfarnht,r . test kvinna erfarnht

Graphing Data
Histograms
. hist timlon . hist uhat

Superimpose a normal curve


. hist uhat, normal

Graphing Data
Scatter plot: to show the relationship between 2 variables.
. graph twoway svatter timlon skolar

Shorthand...
. twoway scatter timlon skolar

Write a title for your graph


. twoway scatter timlon skolar, ti(Hourly wage vs Years of schooling)

Graphing Data
We can fit a linear line onto our scatter plot to see any relationship more clearly.
. Scatter timlon skolar, lfit timlon skolar

Export as a post script file


. graph export mygraph.ps

Copy the graph directly to MS Word by rightclicking and use copy.

End Session
Don't forget to
. log close . cmdlog close

Potrebbero piacerti anche