LEARN SAS Within 7 Weeks: Part3 (Introduction To SAS - SET, MERGE, and Multiple Operations)

Week 9 SET, MERGE and Multiple Operations
Unit 4
SAS for Data Management
Week 9: Introduction to SAS – SET, MERGE, and Multiple Operations
Welcome.
SAS is an excellent software program for manipulating data! In this regard, it is a data
manager’s dream. This reading is a detailed introduction to a variety of SAS techniques for data
management, ranging from the simple definition of missing values to the alteration of the data
structure itself (e.g. creating multiple records from one or creating one record from many)
Goals of Week 9: Introduction to SAS – SET, MERGE, and Multiple Operations
1. To understand the rationale for and be competent in the creation of subsets (defined by variable
selection or by subject selection or both) of SAS data sets;
2. To understand the distinction between concatenating versus merging data sets and to be
competent in these techniques;
3. To appreciate, especially, the use of the MERGE statement in the manipulation of relational
databases;
4. To be competent in the use of conditional expressions such as IF, ELSEIF;
5. To appreciate the efficiency of ARRAYS and be competent in their use; and
6. To appreciate the necessity (at times) of planning and outlining in advance the writing of SAS
code.
week 09 9.1
Week 9 Outline – Introduction to SAS: SET, MERGE and Multiple Operations

Section Topic Page
1. Using the SET Statement ………………..…………….……. …………….. 4
a. How to Save Only a Subset of Variables for Current Use (Keep, Drop) 5
b. How to Save Only a Subset of Subjects for Current Use (IF, DELETE)…. 7
c. How to Create New Variables – part 1 …………………………………. 8
d. How to Concatenate Datasets …………………………………… 10
e. How to Use the IN instruction for identification and selection … 11
f. How to Use BY, .FIRST, and .LAST for identification and selection ….. 13
g. How to Use the OUTPUT instruction to create several data sets at once .. 19
h. How to Create MULTIPLE records from one record …………………….. 20
i. How to Use RETAIN to create ONE record from multiple records ………. 23
2. Using the MERGE Statement ………………………………………..……………. 25
a. How to combine data sets without duplication ……………………………. 25
3. How to Create New Variables – part 2 ………………………………………….... 34
a. Addition, Subtraction, Multiplication, Division, Exponentiation ………….. 34

b. How to Code Conditional Expressions (IF, THEN, ELSE IF) ..…………... 37
c. How to Code Multiple Instructions using DO and END …………………… 41
d. How to Code Repetition of Instructions using an ARRAY ……………….. 43
4. Illustration …………………………………………………………………….. …. 47
week 09 9.2
Now you have a SAS data set. It is in SAS format. A variety of techniques are available for the
manipulation of SAS data sets.
Some manipulations of SAS data sets that you might want.
a. Create new variables
b. Make subsets of the original data
c. Restructure the data
d. Combine several data sets
There are three statements for managing SAS data sets within a DATA step: (1) SET, (2) MERGE,
and (3) UPDATE.
Statements for Managing SAS Data files in a DATA STEP
SET file1 file2; Concatenates a list of files into a single file
MERGE file1 file2; Combines, record by record, multiple files into a

single file
UPDATE master1 newfile1; Replaces variable values from the NEWFILE1

file, and saves them in the MASTER1 file.
In this reading, the SET and MERGE statements are introduced in detail. Not presented is the
UPDATE statement.
week 09 9.3
1. Using the SET Statement
The SET statement can be used in a SAS DATA step to accomplish a variety of tasks, including (but
not limited to):
1. Create a copy of a SAS data set (for example, a permanent copy)
2. Create a subset of observations or records in a data set
3. Add new variables to a data set
4. Concatenate several data sets
5. Create a subset of variables
6. Identify data sources
7. Rearrange the structure of a data set
8. Combinations of the above.
To illustrate the syntax for a SAS SET statement and some options for application, a series of
examples follow.
TIP on the naming of a new data set:
If the same name is used for the new data set as the old data set, the new data set will
overwrite the old version. The old data set is then gone!
week 09 9.4
1a. How to Save Only a Subset of Variables for Current Use (KEEP and DROP)
Rationale
• It is very often the case that specific analyses involve only a subset of the study variables. Less
storage and greater efficiency of the SAS program will be achieved if only the necessary
subset of study variables are used.
Example
• This example creates a working copy from a permanent data set, saving only a subset of the
variables for current use.
• If you want to keep only a subset of the variables, you must use either the KEEP or DROP
option.
• Otherwise, the default is in place and all of the variables The default is that
• Choose KEEP or DROP – whichever list is shorter
• OR, choose KEEP or DROP – whichever is the listing that you want to be explicitly coded.
*** Example of KEEP dataset option ***;

*** to keep a subset of variables ***;
LIBNAME SDAT 'C:\TEMP';
DATA EXER(KEEP=SID AGE SEX HT WT);
SET SDAT.EXER1;
RUN;
week 09 9.5
KEEP (or DROP) can be used in 2 different places in a data step; the syntax differs slightly.
1. As part of the data statement.
The option appears in parentheses ( ) after the name of the new data set, and KEEP= is
followed by a list of variables to be kept in the new dataset.
2. As a separate statement.
• When used as a separate statement in a data step, the word KEEP (or DROP) is followed by a
list of variables to be kept in (or dropped from) the new data set.
• This statement can appear anywhere after the SET statement as long as it appears before the
RUN; statement that ends the step.
• There is no equal sign when used as a separate statement.
• TIP: Be careful of the placement of a KEEP statement. It must appear after any new variables
you create in the DATA step. If you use a RENAME statement to rename a variable, use the
OLD name on a KEEP statement, but the NEW name on a KEEP= dataset option.
*** Example of KEEP statement ***;

*** to keep a subset of variables ***;
DATA EXER;
SET SDAT.EXER1;
KEEP SID AGE SEX HT WT;
RUN;
TIP: In general, in SAS: KEEP and DROP are terms used to describe actions on variables
(columns), while RETAIN and DELETE are actions on observations (rows).
week 09 9.6
1b. How to Save Only a Subset of Subjects for Current Use (IF, DELETE)
Rationale
• The rationale is similar to the one given above. Specifically, it is very often the case that specific
analyses involve only a subset of the study subjects. Less storage and greater efficiency of
the SAS program will be achieved if only the necessary subset subjects are used.
Example
This example creates a subset of the data using an IF statement. The new data set contains data on
females only; they are selected for inclusion for the reason of having the value ‘F’ for the variable
SEX.
Note: To refer to values of character variables the values must be enclosed in single quotes.
*** Create a data set called F_EXER. Include only females (SEX=’F’) ***;
DATA F_EXER;
SET SDAT.EXER1;
IF SEX='F'; /* Include in new data set only IF SEX=’F’ */
RUN;
week 09 9.7
This could also be done using the DELETE statement
IF SEX='M' THEN DELETE;
CAUTION:
• This instruction tells SAS to delete all observations with the value of ‘M’ for SEX.
• Therefore, observations of SEX with the value of missing are still retained.
• Thus, IF SEX=’F’ is a better choice.
1c. How to Create New Variables – Part 1
Rationale
• Self explanatory, yes?
Tip: A suggested tip for data set archival and SAS programming is (1) retaining as a
permanent (“raw”) SAS data set, the data set that contains only the source variables and (2)
writing a stand alone program that has as its only tasks labeling, formatting, and the creation
of new variables. I find that this system makes it easier for me to retrace, debug, modify , and
amend my work - cb.
Tip: Be careful about overwriting. While overwriting can save workspace in SAS since fewer data
sets will be kept during a work session – BE CAREFUL about overwriting. In most instances where
you create new variables or modify old ones, you may want to have the old data set around and you
may want to store a separate data set containing the new version of the data, to help keep track of
the sequence of changes you have made.
week 09 9.8
Example
This example adds a new variable, height in cm, to the data set.
*** Notice that a separate new variable is created and the old still
exists ***;
DATA SDAT.EXER1;
SET SDAT.EXER1;
* CREATE HEIGHT IN CM from HTIN in inches *;
HTCM = HTIN * 2.54;
RUN;
• If you did not want to keep the variable HTIN in the new version of the data set, then the data
statement could read:
DATA SDAT.EXER1(DROP=HTIN);
• Actually, any variables to be dropped could be named in the DROP option, as described in an
earlier section.
• Further details of computing with SAS variables are covered in section 3 (“How to Create New
Variables, Part 2” – page 34 ).
week 09 9.9
3d How to Concatenate Datasets
• To concatenate data sets, list the names of the data sets to be concatenated, one after another
on the SET statement.
• There is no restriction on the number of data sets you can name. For practical purposes, don’t
let it get too large – it can be confusing.
• If the variables are not in the same order in the 2 data sets, the order will be determined by the
first-named data set, with any new variables in subsequent data sets listed after all those from
the first data set.
Rationale
• It is sometimes of interest to pool “like” data sets into a single large data set.
• An example is pooling site specific data sets from a multi-center clinical trial into a single
analytic data set.
Example
• This example concatenates two working data sets into a single permanent data set.
• When concatenating files, the number of observations in the new dataset is the sum of the
number in the original datasets.
• All of the variables in the two data sets will be contained in the new one (unless a KEEP
or DROP option or statement is used).
• If the two original data sets do not contain the same variables, then observations from the first
data set will have missing values for the variables contributed by the second, and vice versa.
week 09 9.10
• The default missing values of a period ‘.’ for numeric data, and a blank for character data will be
used when missing values are created during data processing.
*** concatenating 2 datasets ***;

DATA SDAT.EXER1_2;
SET EXER1 EXER2;
RUN;
• If the data set EXER1 contained the variables SID, AGE, HT and WT, while the data set EXER2
contained SID, HT, and WT then all records from EXER2 in the new data set EXER1_2 would
be assigned a missing value (.) for AGE.
1e How to use the IN instruction for Identification and Selection
Rationale
When concatenating several data sets into one, you may want to keep track of an observation’s
origins. For example, in pooling the site specific data from a multi-center clinical trial, the investigator
might well want to retain information on study site.
week 09 9.11
Guidelines for Using the IN Instruction.
• SAS “IN” variables are variables that SAS creates automatically within a data step and are
saved ONLY for the duration of the data step.
• The IN variables are indicator variables, taking the value of 1 if the data comes from the
specified data set, and 0 otherwise.
• “IN” variables can be used to identify data sources. For example when concatenating data from
two hospitals, a new variable HOSPID can be created to identify the source hospital, by use of
the IN variable, as in the next example.
• How to use IN variables: For each input data set named on a SET statement, an IN variable
can be named. The IN variable for a source data set is named by the phrase
(IN=invariablename) after the data set name on the SET statement.
Example
In this example two data sets, one from each of two hospitals, are pooled. The IN statement keeps
track of the source hospital data set for each observation. This, in turn, is used to create a HOSPID
variable that retains permanently this source hospital identification.
*** using IN variable to identify ***;

*** data source when concatenating files ***;
DATA SDAT.HOSP1_2;
SET HOSP1(IN=INH1)
HOSP2(IN=INH2);
IF INH1=1 THEN HOSPID=1;
ELSE IF INH2=1 THEN HOSPID=2;
RUN;
week 09 9.12
• Data is read from the input data sets (HOSP1 and HOSP2) in the order in which they are listed.
As an observation is read from HOSP1, the variable INH1 is assigned the value 1, and the
variable INH2 is assigned the value 0. These variables are then used to define the values of
the new variable, HOSPID.
• The IN variables are not saved with the data set, and can only be used for the duration of
the data step.
• NOTE on the meaning of indicator (“logical”) variables: The phrase IF INH2=1 is
equivalent to saying IF INH2. The abbreviated form is permitted because indicator variables
defined as 0/1 are logical variables. A logical variable is one that has value 1 when the
accompanying condition is true and has the value 0 otherwise.
1f How to Use BY, .FIRST, and .LAST for Identification and Selection
• Often, a data set will contain multiple records for each subject. In this situation, it may be of
interest to select one or more particular records for a given individual (eg. The baseline record
from among a longitudinal series of records for that individual).
• Using the BY, .FIRST and .LAST instructions, it is possible to select a subset of observations
that satisfy a particular condition.
• Consider a study where everyone should have 2 records – a record from a baseline visit and a
follow-up visit . We might want to find which subjects are missing one of the 2 visits. Such
subjects would contribute only one record in the data set. When the data have been sorted
by a variable, in this case the subject ID, then a SET statement followed by a BY statement
can be used. The BY statement names the variable used to group observations – in this
case, grouping by subject.
week 09 9.13
Guidelines for Using SET, BY, and .FIRST and .LAST
• When SET and BY statements are used together, SAS creates for you (automatically) two
temporary indicator variables: FIRST.variablename and LAST.variablename, where
variablename is the name of the variable named in the BY statement.
• For each set of multiple records for a given individual, the variable FIRST.variablename has
the value=1 for the first occurrence of each value of variablename. It has the value 0 for all
other records for that individual.
• Similarly, for reach set of multiple records for a given individual, the variable
LAST.variablename has the value=1 for the last occurrence of each value of variablename. It
has the value 0 for all other records for that individual.
• Thus, FIRST.variablename and LAST.variablename are simultaneously equal to 1 ONLY if
there is only ONE record for that subject. That is, the observation is both the first and last
occurrence of that value.
Example - Consider again the study where everyone should have 2 records – a record from a
baseline visit and a follow-up visit . As part of our data quality assessment, we want to find which
subjects contribute only one record instead of the intended two records. In this example, the data
have been sorted by the subject identification variable (SID). Once sorted, a SET statement followed
by a BY statement is used. The BY statement names the variable used to form the groups of
observations. In this example, we want to group the records for each individual subject (SID).
week 09 9.14
** example using FIRST. and LAST. Variables **;

** step 1: create data set with some repeat subjects (SID) **;
DATA TEMP1;
INPUT SID VISIT SCORE;
CARDS;
01 1 87 /* SID=1 contributes ONE record */
02 1 77 /* note that SID=2 contributes TWO records */
03 1 62 /* ditto SID=3 */
02 2 54
03 2 77
04 2 21 /* SID=4 contributes ONE record */
;
RUN;
** Sort by subject ID variable **;

PROC SORT DATA=TEMP1;
BY SID;
RUN;
** Create subset with only 1 visit **;
DATA ONEVISIT;
SET TEMP1;
BY SID;
IF FIRST.SID=1 AND LAST.SID=1 then output;/* select only those with one visit */
RUN;
PROC PRINT DATA=ONEVISIT;

ID SID;
TITLE1 ‘SUBJECTS WITH ONLY ONE VISIT’;
RUN;
** create subset with 2 visits **;
DATA TWOVISIT;
SET TEMP1;
BY SID;
IF FIRST.SID=0 or LAST.SID=0; * selects with two visits;
RUN;
PROC PRINT DATA=TWOVISIT;
ID SID;
TITLE1 ‘BOTH VISITS FOR SUBJECTS WITH 2 VISITS’;
RUN;
week 09 9.15
Note:
IF FIRST.SID=1 AND LAST.SID=1 then output;
Is the SAME as the instruction
IF FIRST.SID AND LAST.SID;
• Which would you rather have? Readability or brevity? It depends on how comfortable you are!
The resulting output would be:

SUBJECTS WITH ONLY ONE VISIT
SID VISIT SCORE

1 1 87
4 2 21
BOTH VISITS FOR SUBJECTS WITH 2 VISITS
SID VISIT SCORE

2 1 77
2 2 54
3 1 62
3 2 77
week 09 9.16
Let’s analyze how this works. Once the data were in sorted order by SID they would look like this:
01 1 87
02 1 77
02 2 54
03 1 62
03 2 77
04 2 21
• Once a SET and BY statement instructions are given to SAS, SAS then creates values for the
FIRST. and LAST. Variables.
• The values of these variables are given below. FIRST.SID has the value 1 at the first
occurrence of each value of SID, and 0 otherwise. LAST.SID has the value 1 at the last
occurrence of each value of SID, and 0 otherwise.
• When both FIRST.SID and LAST.SID are 1 this is the only occurrence of that value of SID. If
either FIRST.SID or LAST.SID is 0, there must be more than one occurrence of that value of
SID in the data set.
SID VISIT SCORE FIRST.SID LAST.SID

01 1 87 1 1
02 1 77 1 0
02 2 54 0 1
03 1 62 1 0
03 2 77 0 1
04 2 21 1 1
week 09 9.17
TIP – How to Check for Duplicate Records
First. and Last. variables are useful to check for duplicate records that may have occurred
inadvertently in data entry. The following lines could be used to look for duplicates:
** check for duplicate records for each patient identified by PATID **;
PROC SORT DATA=D1;
BY PATID;
RUN;
DATA DUPS;
SET D1;
BY PATID;
IF FIRST.PATID=0 or LAST.PATID=0; *selects obs with repeat of PATID;
RUN;
PROC PRINT DATA=DUPS;

TITLE1 ‘Duplicate PATient ID numbers;
RUN;
note the use of the “OR” in the selection statement.
week 09 9.18
1g How to Use the OUTPUT Statement to Create Several Data Sets at Once
Idea
• Provide the names of the multiple data sets in a single data step.
• Then, use one or more OUTPUT statements where each is an indication of a condition which
must be satisfied in order for the current record to be output or “written” to the data set
indicated.
Example
A data set contains 3 records for each subject such that each record corresponds to a different day.
Wanted instead are 3 separate data sets, where each data set corresponds to a given day.
*** Example creating multiple data sets ***;

*** in one data step ***;
DATA EXER1 EXER2 EXER3; * name 3 datasets in 1 step;
SET SDAT.EXER_ALL;
IF DAY=1 THEN OUTPUT EXER1; * EXER1 will contain day 1 records only;
ELSE IF DAY=2 THEN OUTPUT EXER2; * EXER2 will contain day 2 records only;
ELSE IF DAY=3 THEN OUTPUT EXER3; * EXER3 will contain day 3 records only;
RUN;
week 09 9.19
1h How to Create MULTIPLE records from one record.
Rationale
• This manipulation is of interest when repeated measurements are collected in one record and it
is of interest to do use repeated measurements analysis of variance software (SAS is one of
them) that requires that the repeated measures be separate records.
Example
• In this example, each subject has measurements (SCORE1 - SCORE3) from three occasions
and they are all listed in the same record.
• To make separate observations, 3 for each subject, the following statements could be used:
*** Example creating multiple observations from a single observation ***;

*** Original data has repeat values for each subject ***;
DATA TEST;
INPUT SID $ SCORE1 SCORE2 SCORE3;
CARDS;
A 99 98 97
B 88 87 85
;
RUN;
PROC PRINT DATA=TEST;
TITLE ‘LIST OF SCORES BY SUBJECT: ORIGINAL DATA’;
RUN;
*** define 2 new vars: ***;

*** score = score regardless of timing ***;
*** status defines timing of score ***;
DATA MULT(KEEP=SID SCORE STATUS);
SET TEST;
SCORE=SCORE1; STATUS=1; OUTPUT; *set score=1st score-write obs to file;
SCORE=SCORE2; STATUS=2; OUTPUT;
week 09 9.20
SCORE=SCORE3; STATUS=3; OUTPUT; * repeat for each score ;

RUN;
PROC PRINT DATA=MULT;
TITLE ‘LIST OF SCORES BY STATUS: NEW VERSION’;
RUN;
week 09 9.21
The resulting output would be:
LIST OF SCORES BY SUBJECT: ORIGINAL DATA
OBS SID SCORE1 SCORE2 SCORE3

1 A 99 98 97
2 B 88 87 85
LIST OF SCORES BY STATUS: NEW VERSION
OBS SID SCORE STATUS

1 A 99 1
2 A 98 2
3 A 97 3
4 B 88 1
5 B 87 2
6 B 85 3
Note on the Use of the OUTPUT statement –
• The OUTPUT statement causes a new observation to be written at the point that it appears,
Thus, in this example, the OUTPUT statement appears three times and in such a manner as to
accomplish the creation of three observations for each one observation that is read in.
• If the OUTPUT statement had NOT been used, one observation would have been written to
the new data set for each one observation read in from the source data set.
week 09 9.22
1i How to Use RETAIN to Create ONE Observation from Multiple Records
Rationale -
• Sometimes, it is the reverse that is desired. You start with one data set with several records
per subject, and you wish to have one record per subject.
• An example is when you wish to produce tables of descriptive statistics for the repeated
measures.
Example –
Beginning with the data set MULT that contains multiple records for each subject, you wish to
produce the data set TEST that contains just one record for each subject. This can be accomplished
by the following:
** use SORT to get all data from each subject together **;
PROC SORT DATA=MULT;
BY SID;
RUN;
** set BY SID, so LAST.SID indicator can be used;

DATA TEST2(KEEP=SID SCORE1 SCORE2 SCORE3);
SET MULT;
BY SID;
** Retain 3 new score variables to hold values **;

RETAIN SCORE1 SCORE2 SCORE3 .;
** Assign 1st score when status is 1, etc. **;

IF STATUS=1 THEN SCORE1=SCORE;
ELSE IF STATUS=2 THEN SCORE2=SCORE;
ELSE IF STATUS=3 THEN SCORE3=SCORE;
week 09 9.23
** Now write data to output data set after last obs **;
** for the subject is read **;
IF LAST.SID THEN OUTPUT;
RUN;
PROC PRINT DATA=TEST2;

TITLE ‘TEST2: ONE OBS PER SUBJECT, RECREATED’;
RUN;
The Idea of the RETAIN Statement
• The RETAIN statement is a placeholder; it is used to retain or hold the value of the variable(s)
from the previous record.
• Since three records from the input data set must be read to create one observation in the output
data, a RETAIN statement is required.
• Without the RETAIN statement the value of a variable is automatically reset to missing before a
new observation is read. (Try this example without the RETAIN statement to see the
difference.)
• The BY statement must be used so that the LAST. variable is available.
• An OUTPUT instruction is given ONLY AFTER all the observations from a subject have been
read.
week 09 9.24
2 Using the MERGE Statement
2a How to Combine Data Sets Without Duplication
Rationale
• We have just seen how to use the SET statement to concatenate files. It was mentioned that in
concatenating files, subjects that appear in multiple files will contribute multiple records. Thus,
concatenation can be viewed as just “piling together” the separate sets of data. We may wish
to be more selective than just “piling together”.
• For example, in the use of relational databases, it is often of interest to join together data from
two (or more) files where the two (or more) are linked by one or more common variables.
• Consider a relational database that includes one data file containing baseline information, and
another data file with follow-up information on the same subjects. Alternatively, one file may
contain data from a questionnaire, and another file clinical data on the same patients.
• Thus, it is often of interest to merge files to produce a single output file that contains a single
observation or record is created for each subject.
• If a SET statement were used to concatenate the files, subjects that were included in each file
would appear twice – have 2 records – in the combined file, one contributed from each file. To
avoid this duplication, files are combined using a MERGE statement.
week 09 9.25
Guidelines for the Use of the MERGE Statement
• To combine the two (or more) files we first identify a common (“linking”) variable in the files.
• The data must then be sorted (PROC SORT) within each file according to the common
(“linking”) variable.
• The data sets are then MERGED, where the linking variable is specified with a BY statement.
In the example that follows, the common variable is the study ID, SID.
• Inclusion of the BY statement naming the linking variable is crucial. Without this, the first
record from the first data set is matched with the first record from the second data set, and so
on, without regard to the subject. Warning - You will not see an error message if you
forget a BY statement – in fact the program will run just fine. Thus, if you find mismatches in
your data – check that the proper SORT and BY statements were used.
week 09 9.26
Example -
*************************************;
*** Example using MERGE statement ***;
*************************************;
LIBNAME SDAT 'C:\temp';
* sort (by linking variable SID) and print data from baseline visit *;
PROC SORT DATA=SDAT.EXER1;
BY SID;
RUN;
PROC PRINT DATA=SDAT.EXER1;
TITLE1 ‘BASELINE DATA’;
RUN;
* sort (by linking variable SID) and print data from follow-up visit;
PROC SORT DATA=SDAT.EXER2;
BY SID;
RUN;
PROC PRINT DATA=SDAT.EXER2;
TITLE1 ‘FOLLOW-UP DATA’;
RUN;
** merge files matching on SID **;
DATA SDAT.EXALL1;
MERGE SDAT.EXER1
SDAT.EXER2;
BY SID;
RUN;
PROC PRINT DATA=SDAT.EXALL1;
TITLE1 ‘MERGED DATA’;
RUN;
The resulting files are listed below:
BASELINE DATA
OBS SID VDATE1 DOB
1 101 06/12/90 01/27/84
2 102 06/12/90 10/11/87
3 103 06/13/90 05/09/87
week 09 9.27
FOLLOW-UP DATA
OBS SID VDATE2 DOB
1 101 08/15/90 01/27/84
2 102 08/15/90 11/10/87
3 104 08/17/90 04/22/85
MERGED DATA
OBS SID VDATE1 DOB VDATE2
1 101 06/12/90 01/27/84 08/15/90
2 102 06/12/90 11/10/87 08/15/90
3 103 06/13/90 05/09/87 .
4 104 . 04/22/85 08/17/90
• Notice that the SID represented in the baseline data are 101, 102 and 103. Whereas, the SID
represented in the follow-up data are 101, 102 and 103. The created (merged) data set has all
four SID represented: 101, 102, 103 and 104. However, there are some missing values.
• The combined file contains one record for each of the records from the individual files, where
records appearing in both files have been merged, and are recorded as a single record with a
complete set of variables from both of the original files. When a record does not have a match
in the other file, the values for variables from the other file are set to missing values.
• Yuck – DOB (date of birth) for SID=103 has different values in the baseline and follow-up data
sets.
• IMPORTANT NOTE - When merging files that contain the same variables –with the same
variable names – the value from the LAST-NAMED data set will be the value retained in
the merged data set. Thus, be careful of the order in which data sets are named on the
MERGE statement. In the above example, birthdate (DOB) is recorded at both visits with the
week 09 9.28
same variable name, and the value from the follow-up data file is retained (see SID 102, where
the DOB values don’t match).
• TIP - When merging 2 files with the same variable names, careful planning is required to
retain important information in the merged data set.
a. If you have values for the same variable, for example weight measured at two
occasions, make sure the data does not have the same variable name in both data sets
if you wish to retain both values in the new data set.
b. If the variables have the same name, such as WT, and you wish to retain both weight
variables in the merged data (for example to compute weight loss or gain), you must
rename (at least) one of them in a DATA step prior to the merge. This can be done with
a RENAME statement: RENAME oldname = newname; for example:
RENAME WT = WT1;
The advantage of using rename, rather than simply creating a new variable and dropping
the old one (WT1 = WT; DROP WT; ), is in retaining all the variable attributes (such
as length, formatting, labeling) that were previously set, when rename is used.
As with the SET statement, several variables can be automatically created with the MERGE
statement.
"IN" variables can be defined that indicate whether the current record was available in a particular
data set. FIRST. and LAST. variables can also be created that indicate the first occurrence of a BY
variable value. These variables can be used to specially tailor the resulting data set. The IN
variables and FIRST. or LAST. variables are available during a DATA step, but are not added to
the output data set. These variables can be added to the output data set by only defining new
week 09 9.29
variables using them in the data step. The values of the automatic variables are all indicator
variables. They have a value of "1" when the condition is true, and "0" otherwise. Note that FIRST.
and LAST. variables can be created for each variable named in the BY statement that is used in the
DATA step.
Example – Using the MERGE statement and IN Variables
• In the study of warm/cold cardiopulmonary bypass and neurologic function, baseline test data of
various sources was entered in several files for all individuals tested at baseline.
• However several subjects were later found to be ineligible for the study, due to complications
during surgery, or mistakes made in reviewing patient history.
• Rather than try and delete each such case individually from each of the study files, it is simpler
to keep a master "inclusion" file. Only those subjects whose names and ID numbers are kept
in the latest version of the inclusion file, i.e., those deemed eligible, are used in data analyses.
*** program to merge inclusion file ***;

*** with psych data file, keeping ***;
*** only subjects in inclusion file ***;
* sort data from include file- This file contains the deemed eligible;
PROC SORT DATA=SDAT.INCLUDE1;
BY PATID;
RUN;
* sort data from psych testing;

PROC SORT DATA=SDAT.PSYCH1;
BY PATID;
RUN;
week 09 9.30
** merge by PATID, and keep only those patients deemed eligible**;

** these are the subjects listed in INCLUDE file **;
DATA PSYCH1;
MERGE SDAT.INCLUDE1(IN=INCL1)
SDAT.PSYCH1;
BY PATID;
IF INCL1 = 1;
RUN;
week 09 9.31
Example - Sometimes, combinations of subsetting, concatenating and merging files are used to
create analysis data sets.
• In the cardiopulmonary bypass study, in order to summarize the pre-surgical test results, a
subset of variables (totals, rather than subscores) was selected from each data file.
• In addition, a subset of records, the baseline or pre-surgery records were selected.
• These were then merged with the "inclusion" file to keep only data for eligible patients.
• Summary baseline reports were based on final files created in this manner, for neurologic,
neuropsych, and delirium testing.

************************************************************************;
** SELECT TOTALS (Nscore,Mscore), ID, AND DATE VARIABLES **;
** KEEP PRE-TEST DATA ONLY (PSTATUS=1) **;
************************************************************************;
DATA N1(KEEP=PATID NSCORE MSCORE NDATE);
SET SDAT.NEURO1;
IF PSTATUS=1; * keep pre-test data only;
RUN;
PROC SORT DATA=SDAT.INCLUDE1; * must sort data from include file;

BY PATID;
RUN;
PROC SORT DATA=N1; * must sort data from neuro testing;
BY PATID;
RUN;
**************************************************;
** SAVE NEURO TOTALS FOR ELIGIBLE PATIENTS **;
**************************************************;
DATA SDAT.NBASE1(LABEL='BASELINE NEURO TOTALS');
MERGE SDAT.INCLUDE1(IN=INCL1)
N1;
BY PATID;
IF INCL1;
RUN;
week 09 9.32
(further steps to produce summary statistics)
week 09 9.33
3 How to Create New Variables – Part 2
Very often your data as brought into SAS from data entry is “raw”. It is not in the final form necessary
for summarization and analysis (“analytic”). Typically, you will need to make subsets of the data and
merge files, as described above. You will also want to compute or create new variables. New
variables can be created in any data step, after the INPUT or SET or MERGE statement. However,
these instructions MUST be placed before the RUN statement (or before the CARDS statement for
instream data).
3a Addition, Subtraction, Multiplication, Division, Exponentiation
New numeric variables can be created by
• adding (+)
• subtracting (-)
• multiplying (*)
• dividing (/)
• exponentiating (**)
existing variables by constants or by other variables. New variables can also be created using IF -
THEN conditional statements, and by combinations of these.
week 09 9.34
• When defining a new variable the name of the new variable is always given on the left
side of the equal sign, and the expression or value on the right side of the equal sign.
• The statement is evaluated for each observation, so that a new variable is created with a value
for each record, according to the given expression.
• Complex combinations of operations can be used following standard conventions for order of
operations. Use parentheses – ( ) – to be explicit about order of operations. Make sure that
every open parenthesis – ( – is paired with a close parenthesis – ). Some examples follow.
Examples -
HTCM = HT * 2.54; * create ht in cm from ht in inches;
WTKG = WT / 2.2; * create wt in kg from wt in pounds;
STD = VAR ** .5; * compute std dev as sq root of variance;
MSCORE=(SCORE1 + SCORE2 + SCORE3)/3; * compute average of 3 scores;
AGE = (VDATE1 - DOB)/365.25 ; * compute age in years at visit;
INTCPT = 1; * define a constant for all observations;
• In the above examples, if any of the original variables (on the right side of the equation) has a
missing value, then the value of the new variable will be missing also, except for the last
example, where the value of INTCPT is 1 for all observations.
week 09 9.35
NOTE - Other Variable Creation Techniques
Computation of new variables for each observation can also be done using SAS functions. The
functions handle missing data in a different way than standard computation, giving more options to
the user. Functions will be discussed in a later section.
week 09 9.36
3b How to Code Conditional Expressions (IF, THEN, ELSE IF)
Rationale –
• Suppose you want to perform an operation or group of operations for some observations but
not for all. IF - THEN statements are used in this situation.
• A few examples have already been given in the sections on SET and MERGE statements.
Examples of valid conditional statements follow.
• Note once again: to refer to specific values of character variables, the values must be
enclosed in single quotes.
IF SEX = 'F'; * retains cases with value F for SEX;
IF SEX = 'M' THEN DELETE; * deletes cases with value M for SEX;
IF AGE < 0 THEN DELETE; * deletes cases with missing or invalid AGE;
IF SEX = 'F' AND AGE > 0; * retains cases with valid AGE and SEX value F;
IF SEX = 'F' OR SEX = 'M'; * retains cases with valid SEX ;
New variables can also be defined with conditional statements.
For example to create a variable AGEGR to represent age groups, the following set of statements
can be used:
IF 0< AGE<20 THEN AGEGR=1;

ELSE IF 20<=AGE<40 THEN AGEGR=2;
ELSE IF 40<=AGE<60 THEN AGEGR=3;
ELSE IF 60<=AGE THEN AGEGR=4;
week 09 9.37
Use of the ELSE IF, while not required, means that the data will be evaluated more efficiently, and
can save processing time in large data sets. Once a condition has been met for an observation, the
subsequent ELSE IF statements are not evaluated for that observation. For example, for an 18 year
old, the first statement is evaluated, the value ‘1’ assigned for AGEGR, and the subsequent
statements bypassed since the condition was met. For a 45 year old the first 3 statements are
evaluated. If you know that most of your sample falls into one group, putting that conditional
statement first will be the most efficient way to create a new variable. Be careful using ELSE IF
when your condition is based upon more than one variable – it can be tricky.
New character variables can also be defined using conditional statements.
Tip – Be careful: he length of a character variable is determined by the first value defined in
the IF-THEN sequence. If longer values are subsequently given, they will be truncated to the length
of the first value named. In the next example, LIGHT, MODERATE, and HEAVY smokers are defined
from the number of cigarettes smoked per day (CIGSDAY).
** creating ordinal groups for smoking **;

IF 0<CIGSDAY<=10 THEN HSMOKE='LIGHT';
ELSE IF 10<CIGSDAY<=24 THEN HSMOKE='MODERATE';
ELSE IF 24<CIGSDAY THEN HSMOKE='HEAVY';
In this example, the value MODERATE would be truncated to 5 characters (MODER) to match the
length of LIGHT. To avoid this, the first named value should be padded with blanks, i.e.,
HSMOKE='LIGHT ';
week 09 9.38
How to Ensure that Created Character Variables Have the Desired Length
Use a length statement preceding variable definition to set the length: The length statement:
LENGTH HSMOKE $8.;
before the variable HSMOKE is defined would read all values without truncation.
A feature of SAS to be aware of is that, in the use of character values is that many SAS procedures
will automatically re-order the data alphabetically when printing summary tables. This reordering is
often not appropriate for ordinal variables, such as HSMOKE, which would appear in the order
HEAVY, LIGHT, MODERATE in frequency tables and graphic displays. To avoid this, HSMOKE can
be defined as a numeric variable with codes 1,2, and 3, and then formats assigned to the codes.
** create grouped, ordinal variable for cigs/day **;

DATA SMOKE2(KEEP=CASEID HSMOKE);
SET SMOKE1;
IF 0<CIGSDAY<=10 THEN HSMOKE=1;
ELSE IF 10<CIGSDAY<=24 THEN HSMOKE=2;
ELSE IF 24<CIGSDAY THEN HSMOKE=3;
RUN;
** create formats **;
PROC FORMAT;
VALUE SMKFMT 1='LIGHT'
2='MODERATE'
3='HEAVY';
RUN;
** get frequencies of hsmoke **;
PROC FREQ DATA=SMOKE2;
FORMAT HSMOKE SMKFMT.;
TABLES HSMOKE;
TITLE1 'FREQUENCY OF LIGHT, MODERATE, AND HEAVY SMOKING';
RUN;
week 09 9.39
In this program, the frequency table of HSMOKE would be ordered by its numeric value, but the
words LIGHT, MODERATE, and HEAVY would appear in the table.
week 09 9.40
3c How to Code Multiple Instructions Using DO and END statements
Rationale –
• It might be that, when a condition is satisfied, you want more than one operation to occur.
• In this case an IF-THEN-DO set of statements is used.
Example (This also illustrates the creation of design variables) –
We want to use the heavy/moderate/light smoking status at baseline as a predictor of QUIT status at
a 6-month follow-up survey. However, it is not appropriate to use the variable HSMOKE in an
analysis as defined above, since the 1-2-3 spacing implies an equal distance between light-moderate
smoking, and moderate-heavy smoking, which is unlikely to be reasonable. Instead, two indicator
variables are needed, which can be defined from the following table:
HSMOKE MOD HVY

1 0 0
2 1 0
3 0 1
Light smokers have value 0 for both new variables MOD and HVY, MOD is an indicator of moderate
smoking, and HVY is an indicator of heavy smoking. The following statements would create these
variables, which could then be used in PROC LOGISTIC, a logistic regression procedure, or other
regression procedure.
week 09 9.41
** example using IF-THEN-DO to create design variables (two indicators) *;

** for each value of HSMOKE **;
DATA SMOKE3;
SET SMOKE2;
IF HSMOKE=1 THEN DO; * When HSMOKE=1, MOD=0 HVY=0 ;
MOD=0; HVY=0;
END;
ELSE IF HSMOKE=2 THEN DO;
MOD=1; HVY=0; * When HSMOKE=2, MOD=1 HVY=0 ;
END;
ELSE IF HSMOKE=3 THEN DO;
MOD=0; HVY=1; * When HSMOKE=3, MOD=0 HVY=1 ;
END;
RUN;
• A condition is defined beginning with IF …, followed by THEN DO; A series of statements
taking action, such as defining new variables, follow the DO;.
• An END; statement is required to end the set of operations to be done when the condition is
met.
• Warning: it is easy to lose track of DO and END statements. An ‘END;’ is required for every
‘DO;’
There are other ways to create these variables, too.
week 09 9.42
3d How to Code Repetition of Instructions Using an Array
Rationale
• You want to perform the same operation (or same set of multiple operations) on each of many
variables. Why write this instruction over and over again for each variable? To avoid having to
repeat the same instructions for each variable, an ARRAY can be exploited.
Definition Array
• An array (or ordered listing of variables) is defined on an ARRAY statement by giving the array a
name, followed by the number of elements (variables) in curly brackets { }, followed by the list
of variables. The operation(s) to be carried out on the array elements are then defined in a DO
loop.
Example – Using an Array Statement to Define Missing Value Codes.
Suppose an input data set contains five test scores for each subject, with valid scores of 0 to 10,
while 99 represents a missing value. To convert these 99’s to SAS missing values for all test scores
at once, use the following:
** example using array statement to **;

** change missing code to SAS missing **;
** values for 5 variables at once **;
DATA SCORE2(DROP=I);
SET SCORE1;
** define array named TESTS with 5 elements (variables) *;
ARRAY TESTS{5} TEST1 TEST2 TEST3 TEST4 TEST5;
** use DO loop to change missing code 99 to . **;

DO I = 1 TO 5;
IF TESTS{I}=99 THEN TESTS{I}=.;
END;
RUN;
week 09 9.43
• The above program defines an ARRAY called TESTS with the five test scores as elements.
• The DO statement says to do the subsequent operation on the Ith element, as I goes from 1
to 5.
• When I=1, the statement acts as IF TEST1=99 THEN TEST1=.;
because TEST1 is the first-named variable on the array statement.
• When I=3 the statement acts as IF TEST3=99 THEN TEST3=.;
• Note that “I” will be included as a variable in the data set unless you specifically DROP it.
week 09 9.44
SAS permits you to use a shorthand listing for the variable name elements in the ARRAY statement.
In the example below, TEST1-TEST5 is the SAS shorthand notation for TEST1 TEST2 TEST3
TEST4 TEST5. In general, when variable names have the same prefix (in this example the prefix is
TEST) with sequential numbering, it is possible to name only the first and last variable, with a single
hyphen (no spaces) separating the names on any SAS statement, as in the example below.
*** program to recode missing, refused codes to SAS missing values***;

*** and compute an average of non-missing scores ***;
DATA SCORE2(DROP=I);
SET SCORE1;
ARRAY TESTS{5} TEST1-TEST5; ** define array TESTS with 5 elements;
DO I = 1 TO 5; ** start DO loop **;

IF TESTS{I}=99 THEN TESTS{I}=.; ** recode missing ;
ELSE IF TESTS{I}=88 THEN TESTS{I}=.R; ** and recode refusals;
END;
MTEST = MEAN(OF TEST1-TEST5); ** compute mean non-missing test scores;

RUN;
New variables can be defined by naming them in an ARRAY statement and then values may be
assigned in DO loop processing:
Consider a study of infant weight gain, where weights have been recorded, in ounces, at one-month
intervals from birth to 1 year. There are 13 measurements for each infant, WT0 being the birth
weight, WT1 the weight at 1 month, and so on to WT12 at 12 months. These weights are in the
dataset OLDWT.sas7bdat. However we are interested in doing the analysis on weight in grams.
Birth weight in grams can be computed as:
WTGM0 = WT0 * 28.4;
week 09 9.45
To convert each weight individually would require 13 separate statements like the one above. Or this
can be accomplished with the following, using ARRAY statements and DO loops.
** example defining new variables in array **;

LIBNAME SDAT ‘C:\TEMP’;
DATA SDAT.NEWWT(DROP=I); *drop I, variable for DO loop;
SET SDAT.OLDWT;
ARRAY OZ{13} WT0-WT12; *array weights in ounces;
ARRAY GM{13} WTGM0-WTGM12 .; *array weights in grams-no values yet;
DO I = 1 TO 13;
GM{I} = OZ{I} * 28.4; *compute wt in gm from wt in oz;
END;
RUN;
• The first array statement names an array called OZ with thirteen variables. These are found in
the old data set.
• The second array statement names the 13 new variables to be computed: WTGM0, WTGM1,
... WTGM12.
• The ‘.’ at the end of the array statement sets the initial value to missing (.) for all 13 new
variables.
• The statement GM{I}=OZ{I}*28.4; says to compute the ith variable in the array GM as
28.4 times the ith variable in the array OZ.
• The dimensions (number of elements) of the arrays must match, and the variables in
the two arrays must be named in the same order. If the orders don’t match, you may be
computing (incorrectly) WTGM1 = WT2 * 28.4;
week 09 9.46
4. Illustration
Example of steps in developing a more complicated program, which makes use of array statements
for repetitive processing of variables. This example also makes use of RETAIN statements and
FIRST. and LAST. variables.
Consider again the warm versus cold cardiopulmonary bypass study. In the pilot study, patients are
given a battery of psychological tests pre-operatively, post-operatively, and at a follow-up visit. At
each test period the same 16 tests designed to evaluate different aspects of cognitive processing
were given. The series of tests were time-consuming and tiring. Many patients refused to participate
at the follow-up or post-operative testing periods, or refused to complete the full battery of tests at one
or more of the testing periods. This meant that there were large amounts of missing data. It is
therefore particularly important that missing values are appropriately handled.
For ease of data entry, data were input with one record for each test period, so that there are up to
three records per patient. The variable PSTATUS identifies the patient status as 1 (pre), 2 (post) or 3
(follow-up). The structure of the data set is given below. The variable PATID is used as the patient
ID, and there are 16 variables to identify the test scores.
week 09 9.47
PSYCH TESTING DATA FROM WARM/COLD CPB STUDY

DATA WITH MULTIPLE OBSERVATIONS PER SUBJECT
----Variables Ordered by Position----
# Variable Type Len Pos Format Label

1 PSTATUS Num 8 4 1. PATIENT STATUS
2 PATID Num 8 12 6. PATIENT ID NUMBER
3 INF Num 8 20 2. INFORMATION-SCALED SCORE
4 PIC Num 8 28 2. PICTURES-SCALED SCORE
5 VOC Num 8 36 2. VOCABULARY-SCALED SCORE
6 BLK Num 8 44 2. BLOCK DESIGN-SCALED SCORE
7 OBJ Num 8 52 2. OBJECT ASSEMBLY-SCALED SCORE
8 SIM Num 8 60 2. SIMILARITIES-SCALED SCORE
9 IOQ1 Num 8 68 2. INFORM. & ORIENT. QUES.
10 LOGMEM1 Num 8 76 2. LOGICAL MEMORY 1
11 LOGMEM2 Num 8 84 2. LOGICAL MEMORY 2
12 VIS1 Num 8 92 2. VISUOSPATIAL 1
13 VIS2 Num 8 100 2. VISUOSPATIAL 2
14 STRPW Num 8 108 2. STROOP-WORDS
15 STRPC Num 8 116 2. STROOP-COLORS
16 STRPCW Num 8 124 2. STROOP-WORDS/COLORS
17 SDMT Num 8 132 2. SDMT
18 FAS Num 8 140 2. FAS
A listing of a couple of the psych testing variables for the first eight patients is given below.
MULTIPLE RECORDS PER SUBJECT
OBS PATID PSTATUS INF FAS
1 28 1 7 .
2 28 2 7 .
3 65 1 10 39
4 65 2 . .
5 65 3 12 48
6 74 1 8 32
7 74 2 8 33
8 74 3 9 .
week 09 9.48
9 144 1 6 28
10 144 2 6 .
11 192 1 6 6
12 192 2 7 .
13 192 3 . .
14 196 1 9 .
15 196 2 8 17
16 210 1 8 .
17 245 1 11 .
18 245 2 13 52
19 245 3 12 55
Investigators were interested in which, if any of the cognitive processes show significant decreases
post-operatively compared to pre-operatively, and whether or not there is full recovery or
improvement by the follow-up period. To do this, difference scores, post-op minus pre-op, and follow-
up minus pre-op for each of the 16 scores are needed.
We want a final data set that will have one observation for each subject, with the pre, post, and
follow-up scores, along with the difference scores. The steps required in the program are:
Create one observation per subject with pre-op, post-op, follow-up and difference scores
1. Sort the data so that the observations for each subject are grouped (sort by PATID)
2. Read the data so that the start and end for each subject can be identified (use BY PATID; so
FIRST. and LAST. can be used)
3. Make sure that values read for pre, post and follow-up will be kept on single new observation
for each subject (use RETAIN statement)
4. Identify start of a new subject (use FIRST.PATID), and reset the retained values to missing
so that value from previous subject won't be retained on missing record
5. Assign values to pre, post and follow-up scores, depending upon patient status (IF
PSTATUS=...)
6. After all data read for subject, compute difference scores, and output an observation (use
LAST.PATID)
week 09 9.49
First a program is developed that will do this for one variable, in this case INF, the information score.
We will examine in some detail the processing that goes on in this case. Then the program that will
process all 16 variables at once will be given.
*************************************************************;
** Program to create one observation per subject **;
** with pre, post, follow-up scores, pre-post, pre-foll **;
** difference scores for the variable INF **;
*************************************************************;
** sort input data, so BY can be used **;
PROC SORT DATA=PSYCH3;
BY PATID;
RUN;
** create new dataset, keep new variables and subject ID **;

DATA PSYCH4(KEEP=PATID INF1 INF2 INF3 INFPP INFFP);
SET PSYCH3;
BY PATID; * for identification of start and end of;
* subject with first. and last. ;
** identify values that will be retained **;

** as next record for subject is read **;
RETAIN INF1 INF2 INF3 .; * 1=pre, 2=post, 3=foll scores;
** identify start of new subject and set scores to missing **;

** before reading values for new subject **;
IF FIRST.PATID THEN DO;
INF1=.; INF2=.; INF3=.;
END;
** assign values to new vars based upon the patient status **;
IF PSTATUS=1 THEN INF1=INF; * find pre-op score;
ELSE IF PSTATUS=2 THEN INF2=INF; * find post-op score;
ELSE IF PSTATUS=3 THEN INF3=INF; * find follow-up score;
** identify end of subject's records, **
week 09 9.50
** compute difference scores and output a record **;

IF LAST.PATID THEN DO;
INFPP = INF2 - INF1; * post - pre;
INFFP = INF3 - INF1; * foll - pre;
OUTPUT; * write obs to new data set;
END;
RUN;
Now, to understand the processing, let’s examine in some detail what happens when this program is
used. Suppose we begin with the data set below, with 2 subjects, x and y. x has 3 records, and y
has only 2. The values of the FIRST. and LAST. variables, available as the data is read in with the
BY PATID statement are also shown.
PATID PSTATUS INF FIRST.PATID LAST.PATID

x 1 6 1 0
x 2 5 0 0
x 3 8 0 1
y 1 7 1 0
y 2 4 0 1
Values of variables in the new data set are initially all missing, until an observation is read:
PATID INF1 INF2 INF3 INFPP INFFP

. . . . .
As the first observation is read, missing values are replaced. In this case PSTATUS is 1 so the value
of INF1 is replaced:

x 6 . . . .
week 09 9.51
The value of INF1 is retained as the next observation is read from the input data set, and this time
INF2 is replaced:

x 6 5 . . .
INF2 and INF1 are retained as the next observation is read in, and INF3 is replaced:

x 6 5 8 . .
We have now reached an observation where LAST.PATID = 1, identifying the last observation for a
subject, so the next few statements are executed. The differences are computed, and the
observation is written to the new dataset:

x 6 5 8 -1 2
As the next observation is read from the input data set, the values of INF1, INF2, and INF3 are reset
to missing, since the FIRST.PATID=1 identifies the start of a new subject. We don't need to reset the
other variables – they are automatically reset to missing before a new observation is read from the
input data, since they were not named in the RETAIN statement:

x 6 5 8 -1 2
. . . . . .
will become:
week 09 9.52

x 6 5 8 -1 2
y 7 4 . -3 .
If we did not reset the values to missing at the start of the second subject, the value of INF3 would be
retained at 8 from the first subject, since there is no follow-up visit for the second subject to replace
this value.
This program worked well for processing a single variable, but there are actually 16 cognitive test
variables that we want to process. For each of the original 16 scores, there will be 5 variables on the
output data set, or 80 new variables. The following program follows the same logic as the program
above, but makes use of arrays to process all 16 variables at one time.
* example program for all 16 variables, using arrays **;

LIBNAME C 'C:\temp';
** sort by subject id **;

PROC SORT DATA=C.PSYCH3;
BY PATID;
RUN;
** create dataset with one obs per subject **;

DATA C.PSYCH4(LABEL='PSYCH DIFFERENCE SCORES'
DROP=I PSTATUS INF--FAS);
SET C.PSYCH3;
BY PATID; ** allows use of LAST.PATID **;
LENGTH DEFAULT=3; ** set length for new vars **;
** name arrays, creating separate **;

** pre, post, and follow-up arrays **;
** and differences, pp=post-pre, fp=foll-pre **;
week 09 9.53
ARRAY INIT{16} INF--FAS;

ARRAY PRE{16} INF1 PIC1 VOC1 BLK1 OBJ1 SIM1 IOQ11 LM1_1 LM2_1
VIS1_1 VIS2_1 STRPC1 STRPW1 STRPCW1 SDMT1 FAS1;
ARRAY POST{16} INF2 PIC2 VOC2 BLK2 OBJ2 SIM2 IOQ2 LM1_2 LM2_2
ARRAY FOLL{16} INF3 PIC3 VOC3 BLK3 OBJ3 SIM3 IOQ3 LM1_3 LM2_3
ARRAY PP{16} INFPP PICPP VOCPP BLKPP OBJPP SIMPP IOQPP LM1PP
LM2PP VIS1PP VIS2PP STRPCPP STRPWPP STRPCWPP
SDMTPP FASPP;
ARRAY FP{16} INFFP PICFP VOCFP BLKFP OBJFP SIMFP IOQFP LM1FP
LM2FP VIS1FP VIS2FP STRPCFP STRPWFP STRPCWFP
SDMTFP FASFP;
** use RETAIN statement so value of pre, post will be kept **;

RETAIN INF1--FAS3 .;
** reset pre,post,foll values to missing before each subject **;

IF FIRST.PATID THEN DO I=1 TO 16;
PRE{I}=.;
POST{I}=.;
FOLL{I}=.;
END;
** assign values to new variables **;
IF PSTATUS=1 THEN DO I=1 TO 16;
PRE{I} = INIT{I};
END;
ELSE IF PSTATUS=2 THEN DO I=1 TO 16;
POST{I} = INIT{I};
END;
ELSE IF PSTATUS=3 THEN DO I=1 TO 16;
FOLL{I} = INIT{I};
END;
** when all data for a subject has been read **;

** compute difference scores, and then **;
** output an observation for each subject **;
week 09 9.54
IF LAST.PATID THEN DO;

DO I=1 TO 16;
PP{I} = POST{I} - PRE{I};
FP{I} = FOLL{I} - PRE{I};
END;
OUTPUT;
END;
RUN;
PROC PRINT DATA=C.PSYCH4;
VAR PATID INF1 INF2 INF3 INFPP INFFP FAS1 FAS2 FAS3 FASPP
FASFP;
TITLE2 'DATA WITH ONE OBSERVATION PER SUBJECT';
RUN;
Creating the following output, with scores shown for the same subjects, first couple of variables:
DATA WITH ONE OBSERVATION PER SUBJECT
OBS PATID INF1 INF2 INF3 INFPP INFFP FAS1 FAS2 FAS3 FASPP FASFP
1 28 7 7 . 0 . . . . . .
2 65 10 . 12 . 2 39 . 48 . 9
3 74 8 8 9 0 1 32 33 . 1 .
4 144 6 6 . 0 . 28 . . . .
5 192 6 7 . 1 . 6 . . . .
6 196 9 8 . -1 . . 17 . . .
7 210 8 . . . . . . . . .
8 245 11 13 12 2 1 . 52 55 . .
week 09 9.55
• In this program, the initial 16 score variables are listed in an array called INIT.
• The new variables to be created are named in five arrays, for PRE operative, POST operative,
and FOLLow-up scores, and PP (post - pre) and FP (follow-up - pre) differences.
• The logic of the program is the same as the earlier one, but at each step a DO loop is used to
process all 16 scores, rather than just one.
• Note also in the RETAIN statement a shorthand listing of the 16 psych test variables was used:
RETAIN INF1--FAS3 .;
A list of variables can be given by naming the first variable (here INF1) followed by 2 hypens (--), no
spaces, followed by the last variable. The variables must be in order by position in the input data
set. This is where the POSITION option of PROC CONTENTS comes in handy. All of the variables
on the list must be of the same type, either numeric or character.
week 09 9.56

LEARN SAS Within 7 Weeks: Part3 (Introduction To SAS - SET, MERGE, and Multiple Operations)

Caricato da

Informazioni sul documento

Titolo originale

Copyright

Formati disponibili

Condividi questo documento

Condividi o incorpora il documento

Opzioni di condivisione

Hai trovato utile questo documento?

Questo contenuto è inappropriato?

Copyright:

Formati disponibili

LEARN SAS Within 7 Weeks: Part3 (Introduction To SAS - SET, MERGE, and Multiple Operations)

Caricato da

Copyright:

Formati disponibili

Week 9 SET, MERGE and Multiple Operations

Week 9: Introduction to SAS – SET, MERGE, and Multiple Operations

Goals of Week 9: Introduction to SAS – SET, MERGE, and Multiple Operations

selection or by subject selection or both) of SAS data sets;

competent in these techniques;

4. To be competent in the use of conditional expressions such as IF, ELSEIF;

5. To appreciate the efficiency of ARRAYS and be competent in their use; and

Week 9 Outline – Introduction to SAS: SET, MERGE and Multiple Operations

1. Using the SET Statement ………………..…………….……. …………….. 4

2. Using the MERGE Statement ………………………………………..……………. 25

a. How to combine data sets without duplication ……………………………. 25

3. How to Create New Variables – part 2 ………………………………………….... 34

a. Addition, Subtraction, Multiplication, Division, Exponentiation ………….. 34

manipulation of SAS data sets.

Some manipulations of SAS data sets that you might want.

a. Create new variables

b. Make subsets of the original data

c. Restructure the data

d. Combine several data sets

and (3) UPDATE.

Statements for Managing SAS Data files in a DATA STEP

SET file1 file2; Concatenates a list of files into a single file

MERGE file1 file2; Combines, record by record, multiple files into a

UPDATE master1 newfile1; Replaces variable values from the NEWFILE1

1. Using the SET Statement

not limited to):

1. Create a copy of a SAS data set (for example, a permanent copy)

2. Create a subset of observations or records in a data set

3. Add new variables to a data set

4. Concatenate several data sets

5. Create a subset of variables

6. Identify data sources

7. Rearrange the structure of a data set

8. Combinations of the above.

TIP on the naming of a new data set:

subset of study variables are used.

variables for current use.

• Choose KEEP or DROP – whichever list is shorter

*** Example of KEEP dataset option ***;

1. As part of the data statement.

followed by a list of variables to be kept in the new dataset.

RUN; statement that ends the step.

• There is no equal sign when used as a separate statement.

*** Example of KEEP statement ***;

(columns), while RETAIN and DELETE are actions on observations (rows).

This could also be done using the DELETE statement

IF SEX='M' THEN DELETE;

• Thus, IF SEX=’F’ is a better choice.

1c. How to Create New Variables – Part 1

• Self explanatory, yes?

amend my work - cb.

the sequence of changes you have made.

statement could read:

Variables, Part 2” – page 34 ).

3d How to Concatenate Datasets

on the SET statement.

let it get too large – it can be confusing.

the first data set.

analytic data set.

number in the original datasets.

or DROP option or statement is used).

used when missing values are created during data processing.

*** concatenating 2 datasets ***;

* Example of KEEP dataset option *;

* Example of KEEP statement *;

* concatenating 2 datasets *;

* using IN variable to identify *;

example using FIRST. and LAST. Variables ;

Sort by subject ID variable ;

* Example creating multiple data sets *;

* Example creating multiple observations from a single observation *;

* define 2 new vars: *;

Retain 3 new score variables to hold values ;

Assign 1st score when status is 1, etc. ;