Documenti di Didattica
Documenti di Professioni
Documenti di Cultura
Erik Jørgensen
November 7, 2007
1 Introduction
The import of data into R is still a bit complicated if the original data are stored in the SAS-data format. Though
solutions have existed for a long time, see e.g. the R-documentation concerning Importing from other statistical
systems, it is complicated to use these facilities. They are based on functions which need to call the SAS-program
from R. The settings for doing this do not always correspond to the settings on the computers we use.
Instead we will focus on the export problem from the SAS point of view. We will present three different transfer
formats, and illustrate some of their benefits and short-comings.
As a start we use a short SAS program to generate a dataset that will illustrate our main points. As shown in Table
1 the dataset has a very long name, some of variable are long and they even share the initial character sequence.
In addition SAS formats (date and numeric) are specified for some of the variables. Both short and long labels are
specified.
data sasdata.RImportTest ;
do i=1 to 10 ;
x=i ;
y=i*i ;
if i>5 then y=. ;
Weightincrease=y-x ;
Weightincreasedate=mdy(11,i,2007) ;
Dayasnumber =Weightincreasedate;
output ;
end;
1
2 Reading SAS files
To read SAS-files in R it is necessary to store them in an intermediate file format. Unfortunately, this cannot be
done without some loss of information. The solutions presented here use three different intermediate formats.
Each solution has its own benefits.
1. Comma separated file (csv). This format can be read into R via the read.csv function
2. Excel file format (xls). This format can be read into R via the read.xls function in the xlsReadWrite
package.
3. SAS xport file (xpt). This format can be read into R via the read.xport function in the SASxport
package.
The most complicated part is to translate the SASs data format into the intermediate format. This should be
done within SAS, and thus requires a SAS license. The second major obstacle is to transfer the specialised SAS
information from the files, such as formatting (e.g. dates) and labels.
In the following the export to the three different formats is shown. It seems that three different methods have to be
used within SAS. We will present each of these in the following section.
Actually, we have defined some R-functions that specify the SAS-program and executes SAS as a batch command
automatically directly from R. The functions are presented in the appendix in section A.
This uses the PROC EXPORT in SAS. Actually, this is called by the Export data wizard of SAS. To use the wizard,
choose File|Export data from the main menu in SAS, and it guides you through the necessary specification.
You can choose to save the SAS-program during the process. Then you have a template that can be used later on.
The resulting lines are shown in Table 2.
The resulting rimporttest.csv file, may be read from the read.csv function in R.
In principle and according to the documentation it should be possible to us the method described in section 2.1, if
we specify DBMS=EXCEL in the PROC EXPORT. However, it does not seem to work, at least on my machine.
Therefore another method is employed. This method uses the new facility in SAS 9. Here you may specify the
database management system directly in the libname statement. The necessary program statements are shown
in Table 3.
2
Table 3: SAS-code for export to xls file
There are two irritating infelicities in this approach. Firstly, the output file may not exist before. If this is the case
the program ends with an error message. Therefore use a new name. Actually the filename used in Table 3 is
generated in the R-function readsas.xls (see appendix A) using a random number generator. Secondly, the
xls-file is locked by SAS until you release it and cannot be used by other programs. The easiest way of doing this,
is to exit SAS.
Finally, the xls dataset is read into R using the read.xls function in the package xlsReadWrite.
This is the most comprehensive transfer option. Originally it was used for copying SAS-files between SAS installed
on different operating systems. Unfortunately, the facility has not been upgraded since SAS version 6. Therefore
you need to downgrade your data set in two steps. First step is to reduce the name of the dataset to only 8 charac-
ters. You can do this using Windows explorer. E.g if your data is in the file longdatasetname.sas7bdat,
copy the file and rename the new file to shortdat.sas7bdat. Then you can address the dataset as e.g.
data a; set sasdata.shortdat; in SAS. You may also use this facility to copy the dataset, using the
following program lines:
data sasdata.shortdat; set sasdata.longdatasetname;.
Second step is to use proc copy to reduce the length of variable names and labels, as shown in Table 4.
You need initially to specify that you should restrict the valid names to those that were allowed in version 6
(options VALIDVARNAME=V6;). If the data statement is used for copying to a short dataset name, it is to
specify this option after the copying. If it is specified before the dataset with the short name is generated, the
copying is not allowed, because the long dataset name is not a valid name in version 6.
options VALIDVARNAME=V6;
libname source "C:/SASImport";
libname xportout xport "C:/SASImport/le26ca58.xpt" ;
proc copy in=source out=xportout memtype=data;
select le26ca58; run;
Again the strange naming of the dataset is because it is generated by the random generator in R-
Finally, the xpt dataset is read into R using the read.xport function in the package SASxport.
3
3 Reading the dataset
To illustrate the result of the data import, the result is presented of importing the data set via the three different
intermediate formats. The function read.sas in appendix A is used. However, the same result would have been
obtained by running the three different SAS-programs to generate to files in the three formats, and subsequently,
using either read.csv, read.xls or read.xport.
> library(xtable)
> library(SASxport)
> library(xlsReadWrite)
> datcsv <- readsas("rimporttest", via = "csv")
> datxls <- readsas("rimporttest", via = "xls")
> datxport <- readsas("rimporttest", via = "xport")
> (datcsv)
> str(datcsv)
The variable names are correct. The date variable is transferred as a character string and need to be translated into
the R date format, in a subsequent step. The labels of the variables are not transferred.
> (datxls)
4
i x y Weightincrease Weightincreasedate Dayasnumber
1 1 1 1 0 39387 17471
2 2 2 4 2 39388 17472
3 3 3 9 6 39389 17473
4 4 4 16 12 39390 17474
5 5 5 25 20 39391 17475
6 6 6 NA NA 39392 17476
7 7 7 NA NA 39393 17477
8 8 8 NA NA 39394 17478
9 9 9 NA NA 39395 17479
10 10 10 NA NA 39396 17480
> str(datxls)
In the xls format the variable names are also correct. The date variable is transferred as a integer value. The date
value corresponds to the representation in R’s Date representation, but you need to specify that in a subsequent
step. Note that the numeric representation of date values in SAS (variable Dayasnumber) differs from those in
R and Excel (variable Weightincreasedate). The labels of the variables are not transferred.
A note of caution: There are some limitations to the size of an excel dataset, both concerning number of variables
and number of observations. We have not yet tested whether this problem is circumvented because we do not call
excel explicitly.
In the xport format the variable names are no longer correct, but translated into upper case names with a length
of 8-characters. If the starting sequence is identical SAS uses digits to distinguish between the names. The date
variable is transferred as a integer value but with the Class Date. Thus they are printed in a readable format,
and you can se from data structure that it is a date value. Note that the numeric representation of date values in
SAS differs from those in R and Excel. Finally, the labels of the variables are transferred, even though the long
label is shortened to a length of 40 characters.
> (datxport)
5
7 7 7 NA NA 2007-11-07 17477
8 8 8 NA NA 2007-11-08 17478
9 9 9 NA NA 2007-11-09 17479
10 10 10 NA NA 2007-11-10 17480
> str(datxport)
> label(datxport$X)
> label(datxport$Y)
4 Summary
In most cases the use of the xls-format is probably the most convenient. If the dataset is documented with labels
the xport format may be the best, even though the variable names is mangled. Finally, the csv-format may be the
best for very large datasets that cannot be handled within the limitations of R.
A R-functions
In this section the R functions are presented. They may also be downloaded here.
6
A.1 readsas.r
"readsas" <-
function(dataname,libname='sasdata',dir=getwd(),via='csv'){
if (via=='csv') {
expdata<-readsas.csv(dataname,libname=libname,dir=dir)
} else {
if (via=='xls') {
expdata<-readsas.xls(dataname,libname=libname,dir=dir)
} else {
# via = xport assumed
expdata<-readsas.xport(dataname,libname=libname,dir=dir)
}}
expdata
}
A.2 readsas.csv.r
`readsas.csv` <-
function(dataname,libname='sasdata',dir=getwd()){
libnamestr<-paste('libname ',libname,' "',dir,'";',sep='')
exportstr<- paste('proc export data=',libname,'.',dataname,sep='')
outfilestr<-paste(' outfile= "',dataname,'.csv"',sep='')
dbmsstr<- ' DBMS=CSV REPLACE;'
sasprogname<-paste(dataname,"exportcsv.sas",sep="")
sasprog<-cat(libnamestr,exportstr,outfilestr,dbmsstr,
sep="\n",file=sasprogname)
ProgramPath<-Sys.getenv('ProgramFiles')
sascmd=paste('"',ProgramPath,'\\SAS\\SAS 9.1\\sas.exe"',sep="")
#sascmd='"C:\\Program Files\\SAS\\SAS 9.1\\sas.exe"'
system(paste(sascmd, sasprogname))
read.csv(paste(dataname,'.csv',sep=""))
}
A.3 readsas.xls.r
"readsas.xls" <-
function(dataname,libname='sasdata',dir=getwd()){
7
sasprogname <- paste(dataname,"exportxls.sas",sep="")
sasprog <-t’cat(libnamestr,outlibnamestr,datastepstr,
sep="\n",file=sasprogname)
ProgramPath <- Sys.getenv('ProgramFiles')
sascmd = paste('"',ProgramPath,'\\SAS\\SAS 9.1\\sas.exe"',sep="")
system(paste(sascmd, sasprogname))
read.xls(paste(newdataname,'.xls',sep=""))
}
A.4 readsas.xport.r
`readsas.xport` <-
function(dataname,libname='sasdata',dir=getwd(),filetype='sas7bdat'){
if (!file.exists(paste(dataname,'.',filetype,sep=''))) {
print('!warning filetype does not exist')}
#newdataname<-dataname
newdataname<-substr(tempfile(tmpdir=''),4,11)
if (nchar(newdataname)>8) {newdataname<-substr(newdataname,1,8)}
file.copy(paste(dataname,'.',filetype,sep=''),
paste(newdataname,'.',filetype,sep=''))
optionstr<-paste('options VALIDVARNAME=V6;')
libnamestr<-paste('libname source ',' "',dir,'";',sep='')
xportlibnamestr<-paste('libname xportout xport ','"',dir,"/",
newdataname,'.xpt" ;',sep='')
copystr<- paste('proc copy in=source out=xportout memtype=data;',sep='')
selectstr<-paste('select ',newdataname,'; run;',sep='')
sasprogname<-paste(dataname,"exportxport.sas",sep="")
sasprog<-cat(optionstr,libnamestr,xportlibnamestr,copystr,
selectstr,sep="\n",file=sasprogname)
ProgramPath<-Sys.getenv('ProgramFiles')
sascmd=paste('"',ProgramPath,'\\SAS\\SAS 9.1\\sas.exe"',sep="")
system(paste(sascmd, sasprogname))
file.copy(paste(newdataname,'.xpt',sep=''),paste(dataname,'.xpt',sep=''))
read.xport(paste(newdataname,'.xpt',sep=""))
}