Documenti di Didattica
Documenti di Professioni
Documenti di Cultura
Unit 4
SAS for Data Management
Welcome.
SAS is an excellent software program for manipulating data! In this regard, it is a data
manager’s dream. This reading is a detailed introduction to a variety of SAS techniques for data
management, ranging from the simple definition of missing values to the alteration of the data
structure itself (e.g. creating multiple records from one or creating one record from many)
1. To understand the rationale for and be competent in the creation of subsets (defined by variable
2. To understand the distinction between concatenating versus merging data sets and to be
3. To appreciate, especially, the use of the MERGE statement in the manipulation of relational
databases;
6. To appreciate the necessity (at times) of planning and outlining in advance the writing of SAS
code.
week 09 9.1
Week 9 SET, MERGE and Multiple Operations
a. How to Save Only a Subset of Variables for Current Use (Keep, Drop) 5
b. How to Save Only a Subset of Subjects for Current Use (IF, DELETE)…. 7
c. How to Create New Variables – part 1 …………………………………. 8
d. How to Concatenate Datasets …………………………………… 10
e. How to Use the IN instruction for identification and selection … 11
f. How to Use BY, .FIRST, and .LAST for identification and selection ….. 13
g. How to Use the OUTPUT instruction to create several data sets at once .. 19
h. How to Create MULTIPLE records from one record …………………….. 20
i. How to Use RETAIN to create ONE record from multiple records ………. 23
4. Illustration …………………………………………………………………….. …. 47
week 09 9.2
Week 9 SET, MERGE and Multiple Operations
Now you have a SAS data set. It is in SAS format. A variety of techniques are available for the
There are three statements for managing SAS data sets within a DATA step: (1) SET, (2) MERGE,
In this reading, the SET and MERGE statements are introduced in detail. Not presented is the
UPDATE statement.
week 09 9.3
Week 9 SET, MERGE and Multiple Operations
The SET statement can be used in a SAS DATA step to accomplish a variety of tasks, including (but
To illustrate the syntax for a SAS SET statement and some options for application, a series of
examples follow.
If the same name is used for the new data set as the old data set, the new data set will
overwrite the old version. The old data set is then gone!
week 09 9.4
Week 9 SET, MERGE and Multiple Operations
1a. How to Save Only a Subset of Variables for Current Use (KEEP and DROP)
Rationale
• It is very often the case that specific analyses involve only a subset of the study variables. Less
storage and greater efficiency of the SAS program will be achieved if only the necessary
Example
• This example creates a working copy from a permanent data set, saving only a subset of the
• If you want to keep only a subset of the variables, you must use either the KEEP or DROP
option.
• Otherwise, the default is in place and all of the variables The default is that
• OR, choose KEEP or DROP – whichever is the listing that you want to be explicitly coded.
week 09 9.5
Week 9 SET, MERGE and Multiple Operations
KEEP (or DROP) can be used in 2 different places in a data step; the syntax differs slightly.
The option appears in parentheses ( ) after the name of the new data set, and KEEP= is
2. As a separate statement.
• When used as a separate statement in a data step, the word KEEP (or DROP) is followed by a
list of variables to be kept in (or dropped from) the new data set.
• This statement can appear anywhere after the SET statement as long as it appears before the
• TIP: Be careful of the placement of a KEEP statement. It must appear after any new variables
you create in the DATA step. If you use a RENAME statement to rename a variable, use the
OLD name on a KEEP statement, but the NEW name on a KEEP= dataset option.
TIP: In general, in SAS: KEEP and DROP are terms used to describe actions on variables
week 09 9.6
Week 9 SET, MERGE and Multiple Operations
1b. How to Save Only a Subset of Subjects for Current Use (IF, DELETE)
Rationale
• The rationale is similar to the one given above. Specifically, it is very often the case that specific
analyses involve only a subset of the study subjects. Less storage and greater efficiency of
the SAS program will be achieved if only the necessary subset subjects are used.
Example
This example creates a subset of the data using an IF statement. The new data set contains data on
females only; they are selected for inclusion for the reason of having the value ‘F’ for the variable
SEX.
Note: To refer to values of character variables the values must be enclosed in single quotes.
*** Create a data set called F_EXER. Include only females (SEX=’F’) ***;
LIBNAME SDAT 'C:\TEMP';
DATA F_EXER;
SET SDAT.EXER1;
IF SEX='F'; /* Include in new data set only IF SEX=’F’ */
RUN;
week 09 9.7
Week 9 SET, MERGE and Multiple Operations
CAUTION:
• This instruction tells SAS to delete all observations with the value of ‘M’ for SEX.
• Therefore, observations of SEX with the value of missing are still retained.
Rationale
Tip: A suggested tip for data set archival and SAS programming is (1) retaining as a
permanent (“raw”) SAS data set, the data set that contains only the source variables and (2)
writing a stand alone program that has as its only tasks labeling, formatting, and the creation
of new variables. I find that this system makes it easier for me to retrace, debug, modify , and
Tip: Be careful about overwriting. While overwriting can save workspace in SAS since fewer data
sets will be kept during a work session – BE CAREFUL about overwriting. In most instances where
you create new variables or modify old ones, you may want to have the old data set around and you
may want to store a separate data set containing the new version of the data, to help keep track of
week 09 9.8
Week 9 SET, MERGE and Multiple Operations
Example
This example adds a new variable, height in cm, to the data set.
*** Notice that a separate new variable is created and the old still
exists ***;
LIBNAME SDAT 'C:\TEMP';
DATA SDAT.EXER1;
SET SDAT.EXER1;
* CREATE HEIGHT IN CM from HTIN in inches *;
HTCM = HTIN * 2.54;
RUN;
• If you did not want to keep the variable HTIN in the new version of the data set, then the data
DATA SDAT.EXER1(DROP=HTIN);
• Actually, any variables to be dropped could be named in the DROP option, as described in an
earlier section.
• Further details of computing with SAS variables are covered in section 3 (“How to Create New
week 09 9.9
Week 9 SET, MERGE and Multiple Operations
• To concatenate data sets, list the names of the data sets to be concatenated, one after another
• There is no restriction on the number of data sets you can name. For practical purposes, don’t
• If the variables are not in the same order in the 2 data sets, the order will be determined by the
first-named data set, with any new variables in subsequent data sets listed after all those from
Rationale
• It is sometimes of interest to pool “like” data sets into a single large data set.
• An example is pooling site specific data sets from a multi-center clinical trial into a single
Example
• This example concatenates two working data sets into a single permanent data set.
• When concatenating files, the number of observations in the new dataset is the sum of the
• All of the variables in the two data sets will be contained in the new one (unless a KEEP
• If the two original data sets do not contain the same variables, then observations from the first
data set will have missing values for the variables contributed by the second, and vice versa.
week 09 9.10
Week 9 SET, MERGE and Multiple Operations
• The default missing values of a period ‘.’ for numeric data, and a blank for character data will be
• If the data set EXER1 contained the variables SID, AGE, HT and WT, while the data set EXER2
contained SID, HT, and WT then all records from EXER2 in the new data set EXER1_2 would
Rationale
When concatenating several data sets into one, you may want to keep track of an observation’s
origins. For example, in pooling the site specific data from a multi-center clinical trial, the investigator
week 09 9.11
Week 9 SET, MERGE and Multiple Operations
• SAS “IN” variables are variables that SAS creates automatically within a data step and are
• The IN variables are indicator variables, taking the value of 1 if the data comes from the
• “IN” variables can be used to identify data sources. For example when concatenating data from
two hospitals, a new variable HOSPID can be created to identify the source hospital, by use of
• How to use IN variables: For each input data set named on a SET statement, an IN variable
can be named. The IN variable for a source data set is named by the phrase
Example
In this example two data sets, one from each of two hospitals, are pooled. The IN statement keeps
track of the source hospital data set for each observation. This, in turn, is used to create a HOSPID
week 09 9.12
Week 9 SET, MERGE and Multiple Operations
• Data is read from the input data sets (HOSP1 and HOSP2) in the order in which they are listed.
As an observation is read from HOSP1, the variable INH1 is assigned the value 1, and the
variable INH2 is assigned the value 0. These variables are then used to define the values of
• The IN variables are not saved with the data set, and can only be used for the duration of
equivalent to saying IF INH2. The abbreviated form is permitted because indicator variables
defined as 0/1 are logical variables. A logical variable is one that has value 1 when the
1f How to Use BY, .FIRST, and .LAST for Identification and Selection
• Often, a data set will contain multiple records for each subject. In this situation, it may be of
interest to select one or more particular records for a given individual (eg. The baseline record
• Using the BY, .FIRST and .LAST instructions, it is possible to select a subset of observations
• Consider a study where everyone should have 2 records – a record from a baseline visit and a
follow-up visit . We might want to find which subjects are missing one of the 2 visits. Such
subjects would contribute only one record in the data set. When the data have been sorted
by a variable, in this case the subject ID, then a SET statement followed by a BY statement
can be used. The BY statement names the variable used to group observations – in this
week 09 9.13
Week 9 SET, MERGE and Multiple Operations
• When SET and BY statements are used together, SAS creates for you (automatically) two
• For each set of multiple records for a given individual, the variable FIRST.variablename has
the value=1 for the first occurrence of each value of variablename. It has the value 0 for all
• Similarly, for reach set of multiple records for a given individual, the variable
LAST.variablename has the value=1 for the last occurrence of each value of variablename. It
has the value 0 for all other records for that individual.
there is only ONE record for that subject. That is, the observation is both the first and last
Example - Consider again the study where everyone should have 2 records – a record from a
baseline visit and a follow-up visit . As part of our data quality assessment, we want to find which
subjects contribute only one record instead of the intended two records. In this example, the data
have been sorted by the subject identification variable (SID). Once sorted, a SET statement followed
by a BY statement is used. The BY statement names the variable used to form the groups of
observations. In this example, we want to group the records for each individual subject (SID).
week 09 9.14
Week 9 SET, MERGE and Multiple Operations
week 09 9.15
Week 9 SET, MERGE and Multiple Operations
Note:
• Which would you rather have? Readability or brevity? It depends on how comfortable you are!
week 09 9.16
Week 9 SET, MERGE and Multiple Operations
Let’s analyze how this works. Once the data were in sorted order by SID they would look like this:
01 1 87
02 1 77
02 2 54
03 1 62
03 2 77
04 2 21
• Once a SET and BY statement instructions are given to SAS, SAS then creates values for the
• The values of these variables are given below. FIRST.SID has the value 1 at the first
occurrence of each value of SID, and 0 otherwise. LAST.SID has the value 1 at the last
• When both FIRST.SID and LAST.SID are 1 this is the only occurrence of that value of SID. If
either FIRST.SID or LAST.SID is 0, there must be more than one occurrence of that value of
02 1 77 1 0
02 2 54 0 1
03 1 62 1 0
03 2 77 0 1
04 2 21 1 1
week 09 9.17
Week 9 SET, MERGE and Multiple Operations
First. and Last. variables are useful to check for duplicate records that may have occurred
inadvertently in data entry. The following lines could be used to look for duplicates:
** check for duplicate records for each patient identified by PATID **;
PROC SORT DATA=D1;
BY PATID;
RUN;
DATA DUPS;
SET D1;
BY PATID;
IF FIRST.PATID=0 or LAST.PATID=0; *selects obs with repeat of PATID;
RUN;
week 09 9.18
Week 9 SET, MERGE and Multiple Operations
1g How to Use the OUTPUT Statement to Create Several Data Sets at Once
Idea
• Provide the names of the multiple data sets in a single data step.
• Then, use one or more OUTPUT statements where each is an indication of a condition which
must be satisfied in order for the current record to be output or “written” to the data set
indicated.
Example
A data set contains 3 records for each subject such that each record corresponds to a different day.
Wanted instead are 3 separate data sets, where each data set corresponds to a given day.
week 09 9.19
Week 9 SET, MERGE and Multiple Operations
Rationale
• This manipulation is of interest when repeated measurements are collected in one record and it
Example
• In this example, each subject has measurements (SCORE1 - SCORE3) from three occasions
• To make separate observations, 3 for each subject, the following statements could be used:
week 09 9.20
Week 9 SET, MERGE and Multiple Operations
week 09 9.21
Week 9 SET, MERGE and Multiple Operations
• The OUTPUT statement causes a new observation to be written at the point that it appears,
Thus, in this example, the OUTPUT statement appears three times and in such a manner as to
accomplish the creation of three observations for each one observation that is read in.
• If the OUTPUT statement had NOT been used, one observation would have been written to
the new data set for each one observation read in from the source data set.
week 09 9.22
Week 9 SET, MERGE and Multiple Operations
Rationale -
• Sometimes, it is the reverse that is desired. You start with one data set with several records
per subject, and you wish to have one record per subject.
• An example is when you wish to produce tables of descriptive statistics for the repeated
measures.
Example –
Beginning with the data set MULT that contains multiple records for each subject, you wish to
produce the data set TEST that contains just one record for each subject. This can be accomplished
by the following:
** use SORT to get all data from each subject together **;
PROC SORT DATA=MULT;
BY SID;
RUN;
week 09 9.23
Week 9 SET, MERGE and Multiple Operations
** Now write data to output data set after last obs **;
** for the subject is read **;
IF LAST.SID THEN OUTPUT;
RUN;
• The RETAIN statement is a placeholder; it is used to retain or hold the value of the variable(s)
• Since three records from the input data set must be read to create one observation in the output
• Without the RETAIN statement the value of a variable is automatically reset to missing before a
new observation is read. (Try this example without the RETAIN statement to see the
difference.)
• An OUTPUT instruction is given ONLY AFTER all the observations from a subject have been
read.
week 09 9.24
Week 9 SET, MERGE and Multiple Operations
Rationale
• We have just seen how to use the SET statement to concatenate files. It was mentioned that in
concatenating files, subjects that appear in multiple files will contribute multiple records. Thus,
concatenation can be viewed as just “piling together” the separate sets of data. We may wish
• For example, in the use of relational databases, it is often of interest to join together data from
two (or more) files where the two (or more) are linked by one or more common variables.
• Consider a relational database that includes one data file containing baseline information, and
another data file with follow-up information on the same subjects. Alternatively, one file may
contain data from a questionnaire, and another file clinical data on the same patients.
• Thus, it is often of interest to merge files to produce a single output file that contains a single
• If a SET statement were used to concatenate the files, subjects that were included in each file
would appear twice – have 2 records – in the combined file, one contributed from each file. To
week 09 9.25
Week 9 SET, MERGE and Multiple Operations
• To combine the two (or more) files we first identify a common (“linking”) variable in the files.
• The data must then be sorted (PROC SORT) within each file according to the common
(“linking”) variable.
• The data sets are then MERGED, where the linking variable is specified with a BY statement.
In the example that follows, the common variable is the study ID, SID.
• Inclusion of the BY statement naming the linking variable is crucial. Without this, the first
record from the first data set is matched with the first record from the second data set, and so
on, without regard to the subject. Warning - You will not see an error message if you
forget a BY statement – in fact the program will run just fine. Thus, if you find mismatches in
your data – check that the proper SORT and BY statements were used.
week 09 9.26
Week 9 SET, MERGE and Multiple Operations
Example -
*************************************;
*** Example using MERGE statement ***;
*************************************;
LIBNAME SDAT 'C:\temp';
* sort (by linking variable SID) and print data from baseline visit *;
PROC SORT DATA=SDAT.EXER1;
BY SID;
RUN;
PROC PRINT DATA=SDAT.EXER1;
TITLE1 ‘BASELINE DATA’;
RUN;
* sort (by linking variable SID) and print data from follow-up visit;
PROC SORT DATA=SDAT.EXER2;
BY SID;
RUN;
PROC PRINT DATA=SDAT.EXER2;
TITLE1 ‘FOLLOW-UP DATA’;
RUN;
** merge files matching on SID **;
DATA SDAT.EXALL1;
MERGE SDAT.EXER1
SDAT.EXER2;
BY SID;
RUN;
PROC PRINT DATA=SDAT.EXALL1;
TITLE1 ‘MERGED DATA’;
RUN;
BASELINE DATA
OBS SID VDATE1 DOB
1 101 06/12/90 01/27/84
2 102 06/12/90 10/11/87
3 103 06/13/90 05/09/87
week 09 9.27
Week 9 SET, MERGE and Multiple Operations
FOLLOW-UP DATA
OBS SID VDATE2 DOB
1 101 08/15/90 01/27/84
2 102 08/15/90 11/10/87
3 104 08/17/90 04/22/85
MERGED DATA
OBS SID VDATE1 DOB VDATE2
1 101 06/12/90 01/27/84 08/15/90
2 102 06/12/90 11/10/87 08/15/90
3 103 06/13/90 05/09/87 .
4 104 . 04/22/85 08/17/90
• Notice that the SID represented in the baseline data are 101, 102 and 103. Whereas, the SID
represented in the follow-up data are 101, 102 and 103. The created (merged) data set has all
four SID represented: 101, 102, 103 and 104. However, there are some missing values.
• The combined file contains one record for each of the records from the individual files, where
records appearing in both files have been merged, and are recorded as a single record with a
complete set of variables from both of the original files. When a record does not have a match
in the other file, the values for variables from the other file are set to missing values.
• Yuck – DOB (date of birth) for SID=103 has different values in the baseline and follow-up data
sets.
• IMPORTANT NOTE - When merging files that contain the same variables –with the same
variable names – the value from the LAST-NAMED data set will be the value retained in
the merged data set. Thus, be careful of the order in which data sets are named on the
MERGE statement. In the above example, birthdate (DOB) is recorded at both visits with the
week 09 9.28
Week 9 SET, MERGE and Multiple Operations
same variable name, and the value from the follow-up data file is retained (see SID 102, where
• TIP - When merging 2 files with the same variable names, careful planning is required to
a. If you have values for the same variable, for example weight measured at two
occasions, make sure the data does not have the same variable name in both data sets
b. If the variables have the same name, such as WT, and you wish to retain both weight
variables in the merged data (for example to compute weight loss or gain), you must
rename (at least) one of them in a DATA step prior to the merge. This can be done with
RENAME WT = WT1;
The advantage of using rename, rather than simply creating a new variable and dropping
the old one (WT1 = WT; DROP WT; ), is in retaining all the variable attributes (such
as length, formatting, labeling) that were previously set, when rename is used.
As with the SET statement, several variables can be automatically created with the MERGE
statement.
"IN" variables can be defined that indicate whether the current record was available in a particular
data set. FIRST. and LAST. variables can also be created that indicate the first occurrence of a BY
variable value. These variables can be used to specially tailor the resulting data set. The IN
variables and FIRST. or LAST. variables are available during a DATA step, but are not added to
the output data set. These variables can be added to the output data set by only defining new
week 09 9.29
Week 9 SET, MERGE and Multiple Operations
variables using them in the data step. The values of the automatic variables are all indicator
variables. They have a value of "1" when the condition is true, and "0" otherwise. Note that FIRST.
and LAST. variables can be created for each variable named in the BY statement that is used in the
DATA step.
• In the study of warm/cold cardiopulmonary bypass and neurologic function, baseline test data of
various sources was entered in several files for all individuals tested at baseline.
• However several subjects were later found to be ineligible for the study, due to complications
• Rather than try and delete each such case individually from each of the study files, it is simpler
to keep a master "inclusion" file. Only those subjects whose names and ID numbers are kept
in the latest version of the inclusion file, i.e., those deemed eligible, are used in data analyses.
* sort data from include file- This file contains the deemed eligible;
PROC SORT DATA=SDAT.INCLUDE1;
BY PATID;
RUN;
week 09 9.30
Week 9 SET, MERGE and Multiple Operations
week 09 9.31
Week 9 SET, MERGE and Multiple Operations
Example - Sometimes, combinations of subsetting, concatenating and merging files are used to
• In the cardiopulmonary bypass study, in order to summarize the pre-surgical test results, a
subset of variables (totals, rather than subscores) was selected from each data file.
• These were then merged with the "inclusion" file to keep only data for eligible patients.
• Summary baseline reports were based on final files created in this manner, for neurologic,
week 09 9.32
Week 9 SET, MERGE and Multiple Operations
week 09 9.33
Week 9 SET, MERGE and Multiple Operations
Very often your data as brought into SAS from data entry is “raw”. It is not in the final form necessary
for summarization and analysis (“analytic”). Typically, you will need to make subsets of the data and
merge files, as described above. You will also want to compute or create new variables. New
variables can be created in any data step, after the INPUT or SET or MERGE statement. However,
these instructions MUST be placed before the RUN statement (or before the CARDS statement for
instream data).
• adding (+)
• subtracting (-)
• multiplying (*)
• dividing (/)
• exponentiating (**)
existing variables by constants or by other variables. New variables can also be created using IF -
week 09 9.34
Week 9 SET, MERGE and Multiple Operations
• When defining a new variable the name of the new variable is always given on the left
side of the equal sign, and the expression or value on the right side of the equal sign.
• The statement is evaluated for each observation, so that a new variable is created with a value
• Complex combinations of operations can be used following standard conventions for order of
operations. Use parentheses – ( ) – to be explicit about order of operations. Make sure that
every open parenthesis – ( – is paired with a close parenthesis – ). Some examples follow.
Examples -
• In the above examples, if any of the original variables (on the right side of the equation) has a
missing value, then the value of the new variable will be missing also, except for the last
week 09 9.35
Week 9 SET, MERGE and Multiple Operations
Computation of new variables for each observation can also be done using SAS functions. The
functions handle missing data in a different way than standard computation, giving more options to
week 09 9.36
Week 9 SET, MERGE and Multiple Operations
Rationale –
• Suppose you want to perform an operation or group of operations for some observations but
• A few examples have already been given in the sections on SET and MERGE statements.
• Note once again: to refer to specific values of character variables, the values must be
IF SEX = 'M' THEN DELETE; * deletes cases with value M for SEX;
IF AGE < 0 THEN DELETE; * deletes cases with missing or invalid AGE;
IF SEX = 'F' AND AGE > 0; * retains cases with valid AGE and SEX value F;
For example to create a variable AGEGR to represent age groups, the following set of statements
can be used:
week 09 9.37
Week 9 SET, MERGE and Multiple Operations
Use of the ELSE IF, while not required, means that the data will be evaluated more efficiently, and
can save processing time in large data sets. Once a condition has been met for an observation, the
subsequent ELSE IF statements are not evaluated for that observation. For example, for an 18 year
old, the first statement is evaluated, the value ‘1’ assigned for AGEGR, and the subsequent
statements bypassed since the condition was met. For a 45 year old the first 3 statements are
evaluated. If you know that most of your sample falls into one group, putting that conditional
statement first will be the most efficient way to create a new variable. Be careful using ELSE IF
when your condition is based upon more than one variable – it can be tricky.
Tip – Be careful: he length of a character variable is determined by the first value defined in
the IF-THEN sequence. If longer values are subsequently given, they will be truncated to the length
of the first value named. In the next example, LIGHT, MODERATE, and HEAVY smokers are defined
In this example, the value MODERATE would be truncated to 5 characters (MODER) to match the
length of LIGHT. To avoid this, the first named value should be padded with blanks, i.e.,
HSMOKE='LIGHT ';
week 09 9.38
Week 9 SET, MERGE and Multiple Operations
How to Ensure that Created Character Variables Have the Desired Length
Use a length statement preceding variable definition to set the length: The length statement:
before the variable HSMOKE is defined would read all values without truncation.
A feature of SAS to be aware of is that, in the use of character values is that many SAS procedures
will automatically re-order the data alphabetically when printing summary tables. This reordering is
often not appropriate for ordinal variables, such as HSMOKE, which would appear in the order
HEAVY, LIGHT, MODERATE in frequency tables and graphic displays. To avoid this, HSMOKE can
be defined as a numeric variable with codes 1,2, and 3, and then formats assigned to the codes.
week 09 9.39
Week 9 SET, MERGE and Multiple Operations
In this program, the frequency table of HSMOKE would be ordered by its numeric value, but the
words LIGHT, MODERATE, and HEAVY would appear in the table.
week 09 9.40
Week 9 SET, MERGE and Multiple Operations
Rationale –
• It might be that, when a condition is satisfied, you want more than one operation to occur.
We want to use the heavy/moderate/light smoking status at baseline as a predictor of QUIT status at
a 6-month follow-up survey. However, it is not appropriate to use the variable HSMOKE in an
analysis as defined above, since the 1-2-3 spacing implies an equal distance between light-moderate
smoking, and moderate-heavy smoking, which is unlikely to be reasonable. Instead, two indicator
variables are needed, which can be defined from the following table:
Light smokers have value 0 for both new variables MOD and HVY, MOD is an indicator of moderate
smoking, and HVY is an indicator of heavy smoking. The following statements would create these
variables, which could then be used in PROC LOGISTIC, a logistic regression procedure, or other
regression procedure.
week 09 9.41
Week 9 SET, MERGE and Multiple Operations
• An END; statement is required to end the set of operations to be done when the condition is
met.
• Warning: it is easy to lose track of DO and END statements. An ‘END;’ is required for every
‘DO;’
week 09 9.42
Week 9 SET, MERGE and Multiple Operations
Rationale
• You want to perform the same operation (or same set of multiple operations) on each of many
variables. Why write this instruction over and over again for each variable? To avoid having to
repeat the same instructions for each variable, an ARRAY can be exploited.
Definition Array
• An array (or ordered listing of variables) is defined on an ARRAY statement by giving the array a
name, followed by the number of elements (variables) in curly brackets { }, followed by the list
of variables. The operation(s) to be carried out on the array elements are then defined in a DO
loop.
Suppose an input data set contains five test scores for each subject, with valid scores of 0 to 10,
while 99 represents a missing value. To convert these 99’s to SAS missing values for all test scores
week 09 9.43
Week 9 SET, MERGE and Multiple Operations
• The above program defines an ARRAY called TESTS with the five test scores as elements.
• The DO statement says to do the subsequent operation on the Ith element, as I goes from 1
to 5.
• Note that “I” will be included as a variable in the data set unless you specifically DROP it.
week 09 9.44
Week 9 SET, MERGE and Multiple Operations
SAS permits you to use a shorthand listing for the variable name elements in the ARRAY statement.
In the example below, TEST1-TEST5 is the SAS shorthand notation for TEST1 TEST2 TEST3
TEST4 TEST5. In general, when variable names have the same prefix (in this example the prefix is
TEST) with sequential numbering, it is possible to name only the first and last variable, with a single
hyphen (no spaces) separating the names on any SAS statement, as in the example below.
New variables can be defined by naming them in an ARRAY statement and then values may be
Consider a study of infant weight gain, where weights have been recorded, in ounces, at one-month
intervals from birth to 1 year. There are 13 measurements for each infant, WT0 being the birth
weight, WT1 the weight at 1 month, and so on to WT12 at 12 months. These weights are in the
dataset OLDWT.sas7bdat. However we are interested in doing the analysis on weight in grams.
week 09 9.45
Week 9 SET, MERGE and Multiple Operations
To convert each weight individually would require 13 separate statements like the one above. Or this
can be accomplished with the following, using ARRAY statements and DO loops.
• The first array statement names an array called OZ with thirteen variables. These are found in
• The second array statement names the 13 new variables to be computed: WTGM0, WTGM1,
... WTGM12.
• The ‘.’ at the end of the array statement sets the initial value to missing (.) for all 13 new
variables.
• The statement GM{I}=OZ{I}*28.4; says to compute the ith variable in the array GM as
• The dimensions (number of elements) of the arrays must match, and the variables in
the two arrays must be named in the same order. If the orders don’t match, you may be
week 09 9.46
Week 9 SET, MERGE and Multiple Operations
4. Illustration
Example of steps in developing a more complicated program, which makes use of array statements
for repetitive processing of variables. This example also makes use of RETAIN statements and
Consider again the warm versus cold cardiopulmonary bypass study. In the pilot study, patients are
each test period the same 16 tests designed to evaluate different aspects of cognitive processing
were given. The series of tests were time-consuming and tiring. Many patients refused to participate
at the follow-up or post-operative testing periods, or refused to complete the full battery of tests at one
or more of the testing periods. This meant that there were large amounts of missing data. It is
For ease of data entry, data were input with one record for each test period, so that there are up to
three records per patient. The variable PSTATUS identifies the patient status as 1 (pre), 2 (post) or 3
(follow-up). The structure of the data set is given below. The variable PATID is used as the patient
week 09 9.47
Week 9 SET, MERGE and Multiple Operations
A listing of a couple of the psych testing variables for the first eight patients is given below.
1 28 1 7 .
2 28 2 7 .
3 65 1 10 39
4 65 2 . .
5 65 3 12 48
6 74 1 8 32
7 74 2 8 33
8 74 3 9 .
week 09 9.48
Week 9 SET, MERGE and Multiple Operations
9 144 1 6 28
10 144 2 6 .
11 192 1 6 6
12 192 2 7 .
13 192 3 . .
14 196 1 9 .
15 196 2 8 17
16 210 1 8 .
17 245 1 11 .
18 245 2 13 52
19 245 3 12 55
Investigators were interested in which, if any of the cognitive processes show significant decreases
improvement by the follow-up period. To do this, difference scores, post-op minus pre-op, and follow-
We want a final data set that will have one observation for each subject, with the pre, post, and
follow-up scores, along with the difference scores. The steps required in the program are:
Create one observation per subject with pre-op, post-op, follow-up and difference scores
1. Sort the data so that the observations for each subject are grouped (sort by PATID)
2. Read the data so that the start and end for each subject can be identified (use BY PATID; so
FIRST. and LAST. can be used)
3. Make sure that values read for pre, post and follow-up will be kept on single new observation
for each subject (use RETAIN statement)
4. Identify start of a new subject (use FIRST.PATID), and reset the retained values to missing
so that value from previous subject won't be retained on missing record
5. Assign values to pre, post and follow-up scores, depending upon patient status (IF
PSTATUS=...)
6. After all data read for subject, compute difference scores, and output an observation (use
LAST.PATID)
week 09 9.49
Week 9 SET, MERGE and Multiple Operations
First a program is developed that will do this for one variable, in this case INF, the information score.
We will examine in some detail the processing that goes on in this case. Then the program that will
*************************************************************;
** Program to create one observation per subject **;
** with pre, post, follow-up scores, pre-post, pre-foll **;
** difference scores for the variable INF **;
*************************************************************;
** sort input data, so BY can be used **;
PROC SORT DATA=PSYCH3;
BY PATID;
RUN;
week 09 9.50
Week 9 SET, MERGE and Multiple Operations
Now, to understand the processing, let’s examine in some detail what happens when this program is
used. Suppose we begin with the data set below, with 2 subjects, x and y. x has 3 records, and y
has only 2. The values of the FIRST. and LAST. variables, available as the data is read in with the
Values of variables in the new data set are initially all missing, until an observation is read:
As the first observation is read, missing values are replaced. In this case PSTATUS is 1 so the value
of INF1 is replaced:
week 09 9.51
Week 9 SET, MERGE and Multiple Operations
The value of INF1 is retained as the next observation is read from the input data set, and this time
INF2 is replaced:
INF2 and INF1 are retained as the next observation is read in, and INF3 is replaced:
We have now reached an observation where LAST.PATID = 1, identifying the last observation for a
subject, so the next few statements are executed. The differences are computed, and the
As the next observation is read from the input data set, the values of INF1, INF2, and INF3 are reset
to missing, since the FIRST.PATID=1 identifies the start of a new subject. We don't need to reset the
other variables – they are automatically reset to missing before a new observation is read from the
input data, since they were not named in the RETAIN statement:
will become:
week 09 9.52
Week 9 SET, MERGE and Multiple Operations
If we did not reset the values to missing at the start of the second subject, the value of INF3 would be
retained at 8 from the first subject, since there is no follow-up visit for the second subject to replace
this value.
This program worked well for processing a single variable, but there are actually 16 cognitive test
variables that we want to process. For each of the original 16 scores, there will be 5 variables on the
output data set, or 80 new variables. The following program follows the same logic as the program
above, but makes use of arrays to process all 16 variables at one time.
week 09 9.53
Week 9 SET, MERGE and Multiple Operations
week 09 9.54
Week 9 SET, MERGE and Multiple Operations
Creating the following output, with scores shown for the same subjects, first couple of variables:
OBS PATID INF1 INF2 INF3 INFPP INFFP FAS1 FAS2 FAS3 FASPP FASFP
1 28 7 7 . 0 . . . . . .
2 65 10 . 12 . 2 39 . 48 . 9
3 74 8 8 9 0 1 32 33 . 1 .
4 144 6 6 . 0 . 28 . . . .
5 192 6 7 . 1 . 6 . . . .
6 196 9 8 . -1 . . 17 . . .
7 210 8 . . . . . . . . .
8 245 11 13 12 2 1 . 52 55 . .
week 09 9.55
Week 9 SET, MERGE and Multiple Operations
• In this program, the initial 16 score variables are listed in an array called INIT.
• The new variables to be created are named in five arrays, for PRE operative, POST operative,
and FOLLow-up scores, and PP (post - pre) and FP (follow-up - pre) differences.
• The logic of the program is the same as the earlier one, but at each step a DO loop is used to
• Note also in the RETAIN statement a shorthand listing of the 16 psych test variables was used:
RETAIN INF1--FAS3 .;
A list of variables can be given by naming the first variable (here INF1) followed by 2 hypens (--), no
spaces, followed by the last variable. The variables must be in order by position in the input data
set. This is where the POSITION option of PROC CONTENTS comes in handy. All of the variables
week 09 9.56