Sei sulla pagina 1di 8

NESUG 2008 Programming Beyond the Basics

Effective Use of SQL in SAS Programming


Yi Zhao
Merck & Co. Inc., Upper Gwynedd, Pennsylvania

INTRODUCTION

Structured Query Language (SQL) is a data manipulation tool of which many SAS®
programmers are unaware, or not comfortable. Using fewer lines of code as well as
achieving improved performance, SQL can accomplish the same goal as many SAS data
steps. This paper gives a brief introduction on the subject of relational databases and SQL
syntax followed by a variety of tips on how to use SQL effectively in SAS programming.

RELATIONAL DATABASE

SQL is primarily designed as a programming language to work with relational databases.


Many of the features of SQL are directly related to database activities such as retrieving
data, updating or deleting data, and so on. In relational databases, relations or tables are
associated to each other by primary keys and foreign keys. Primary keys are used to
identify each row in a table uniquely and foreign keys are used to maintain the integrity
of the database. Many of the SQL primary keys and foreign keys are similar to variables
used in the SAS by-merge data step. A database schema is used to describe the structure
and relationship among tables. Using SQL gives the ability to check the schema to find
common variables between tables and variable attributes such as data type, format, etc.
To make data retrieval or updating more efficient, SQL can create and use a database
index. This is similar to SAS Proc SQL where we could create and store an index within
a dataset when working with large datasets. Using SQL, views or virtual tables can be
created to manipulate data in exactly the same way as they are created in SAS Proc SQL.
In summary, knowing the basics of SQL in relational database can help SAS
programmers develop better SAS code.

SQL BASICS

SQL – Structured Query Language - developed by IBM in the early 1970s, is a standard
interactive and programming language for querying, modifying data, and managing
databases. The basic syntax is shown in the following example:

Select d.subjid,
d.treat_cd,
a.exam_val
From demos d,
assy a
Where d.subjid = a.subjid
Group by d.treat_cd
Order by d.subjid;

1
NESUG 2008 Programming Beyond the Basics

Although SQL is both an ANSI and an ISO standard, many database products support
SQL with proprietary extensions to the standard language such as Oracle SQL, SQL
Server, MySQL, and so on. Proc SQL is the SAS version of SQL. Proc SQL adopts most
of the standard SQL features with additional SAS ingredients such as dataset options,
SAS functions, etc. As a result, SAS SQL has the power of regular SQL and many SAS
special add-on features.

TERMINOLOGY

To help less-experienced SAS programmers better understand the different terms used by
database SQL programmers and SAS programmers, a comparison of these terms is
displayed in Table 1 below:

Table 1: Comparison of SQL Terminology

SAS Term Database Term SQL Term


Dataset Relation Table
Observation Tuple Row
Variable Attribute Column
Merge Join Join
Missing value NULL NULL

USE OF SQL IN SAS

SAS uses SQL in two different ways – Where statement and Proc SQL. Where statement
is one of the most commonly used SAS statements. The concept and syntax, however,
were originally adopted from SQL - this is one example that SAS is a powerful language
that imports and mixes syntax from other languages.

Proc SQL is the main tool within SAS to use SQL. While Proc SQL is a SAS procedure,
it performs many functions similar to those found within SAS data steps. Often, for data
manipulation, data step or Proc SQL can be used either individually or interchangeably.
Four major areas which describe the effective use of SQL in SAS Proc SQL are outlined
in the following sections.

I. Access Relational Database

In SAS, there are two approaches to access relational databases. One is the LIBNAME
Statement and the other is the pass-through facility. Below is an example of the pass-
through facility. The code is to read a demographic table from an Oracle database and
output all those allocated subjects.

2
NESUG 2008 Programming Beyond the Basics

Proc sql;
connect to odbc (dsn=&dsn uid=&uid pwd=&pwd);
create table demo as
select *
from connection to odbc
(select distinct allocation_number subjid,
visit_number vt_num,
age
from std_demos
where allocation_number is not null
);
disconnect from odbc;
quit;

Programming Tips:

• Get login credentials from interactive Window input for security reasons.
• Do not use multiple joins to retrieve data - it is more efficient if multiple
CREATE TABLE statements are used.
• If possible, avoid the use of ORDER BY to speed up execution.
• Use index if available.

II. Create Macro Variables Using the Into Clause

SAS programmers often use %LET or SAS function CALL SYMPUT() to create macro
variables. The following is an example:

Data _null_;
set dup nobs=obs;
call symput(‘totdup', compress(put(obs, best.)));

There is an alternative approach to achieving the same result by using the following SQL
procedure:

Proc sql noprint;


select count(*) into : totdup
from dup;

The Into clause stores the value of one or more columns in macro variable(s) for use later
in another Proc SQL query or SAS statement - below is an example:

Proc sql noprint;


select count (distinct treat_cd) into : tot_trt
from sero_all;
select distinct treat_cd into :_trt1 - :_trt&tot_trt
from sero_all;

3
NESUG 2008 Programming Beyond the Basics

quit;

The above code creates a macro variable &TOT_TRT to store the total number of
treatment groups, creates macro variables &TRT1, &TRT2 …, and stores the names of
treatment groups in them. The total number of macro variables is determined by the value
in &TOT_TRT.

Following is another example using an automatic macro variable &SQLOBS:


Proc sql noprint;
create table count_by as
select distinct (&byvar) from datadir.&inds;
select &byvar into :byv1 - :byv&sqlobs from count_by;
quit;

Programming Tips:

• &Sqlobs is an automatic macro variable created by SAS to store the number of


observations in a dataset. It is similar to _null_ in data step.
• Use option Noprint to prevent printing to the SAS list.
• No need to repeat Proc SQL for each SQL statement.
• Separate variables with a comma, not a space.
• Use Distinct to select unique observations.
• Use Quit, not Run, at the end.

III. Merge (Join) Tables

The biggest advantage of a SQL join is that there is no need for sorting and renaming
which is especially useful when dealing with large datasets. The following is
corresponding code for a by-merge data step and SQL join:

Merge (Join)
Proc sort data = one; Proc sql;
By subjid; Create table three as
Select *
Proc sort data = two (rename = (an_num = From one, two
subjid)); Where one.subjid = two.an_num;
By subjid; Quit;

Data three;
Merge one two;
By subjid;
Run;

There are two kinds of joins in SQL: inner join and outer join. An inner join returns a
result table for all the rows in a table that have one or more matching rows in the other

4
NESUG 2008 Programming Beyond the Basics

table(s). The example above is an implied inner join and can be re-written with specific
inner join key words as shown below:

Inner Join
Proc sort data = one; Proc sql;
By subjid; Create table three as
Select *
Proc sort data = two (rename = (an_num = From one INNER JOIN two
subjid)); ON one.subjid = two.an_num;
By subjid; Quit;

Data three;
Merge one(in=a) two(in=b);
By subjid;
If a and b;
Run;

Outer joins are inner joins that have been augmented with rows that did not match with
any row from the other table in the joins. The three types of outer joins are left, right, and
full join. Below are examples of outer joins:

Left Join
Proc sort data = one; Proc sql;
By subjid; Create table three as
Select *
Proc sort data = two (rename = (an_num = From one LEFT JOIN two
subjid)); ON one.subjid = two.an_num;
By subjid; Quit;

Data three;
Merge one(in=a) two(in=b);
By subjid;
If a;
Run;

5
NESUG 2008 Programming Beyond the Basics

Right Join
Proc sort data = one; Proc sql;
By subjid; Create table three as
Select *
Proc sort data = two (rename = (an_num = From one RIGHT JOIN two
subjid)); ON one.subjid = two.an_num
By subjid; Quit;

Data three;
Merge one(in=a) two(in=b);
By subjid;
If b;
Run;

A full outer join, specified with the keywords FULL JOIN and ON, returns all the rows
from all the tables regardless of whether they match. The full outer join is rarely used in
the real world.

IV. Transform Data

SQL is used frequently for creating, renaming new variables, and ordering output.
Suppose we have the following task at hand:
• Create a new variable new_v1 by concatenating v1 and v2
• Create a new variable new_v2 as the sum of v3
• Rename v4 and v5 as out4 and out5
• Only output new_v1, new_v2, out4, out5, v3 and in that particular order in the
output dataset

Here is the code:

Proc sql;
create table new as
select v1 || v2 as new_v1, sum(v3) as new_v2, v4 as out4, v5 as out5, v3
from old;
quit;

Programming Tips:

• SAS dataset options such as keep, drop, rename and SAS functions can be used
within Proc SQL. Here is an example:

%let label=This is the label;


Proc sql;
create table one (label="&label" drop=subject_no center) as
select *
from tx t1,

6
NESUG 2008 Programming Beyond the Basics

scores t2
where t1.subject_no=input(substr(t2.subject_id,5),8.) and
t1.center=input(substr(t2.subject_id,1,3),8.);
quit;

• Use of a sub query or in-line view.

A query-expression is called subquery when used in WHERE or HAVING clauses. It


is nested as part of another query-expression. An in-line view is a special subquery
used in the FROM clause. This can be used in situations such as identifying those
patients who are older than the average age of all patients and who experienced an
Adverse Event.

Select subjid, birth_dt, age, gender


from std_demos
where age >
(select avg(age)
from std_demos)
and subjid in
(select distinct subjid
from std_ae)
order by subjid;

• Use of set operators like UNION, INTERSECT, EXCEPT


• Should avoid Cartesian product which is similar to SAS merge without by
variable(s)

CONCLUSION

• Proc SQL is more powerful and efficient than SAS data steps in certain cases,
with fewer lines of code.
• SQL is a basic tool for many job functions that involve working with databases.
Mastering SQL could result in project (or job) opportunities and enhance career
growth.
• Proc SQL must be used wisely or it can become complicated and inefficient..
• In summary, Proc SQL is an excellent alternative to non-SQL Base SAS, making
it worth the programmers' time to explore its use.

REFERENCES

Feng, Ying “Tips for Using SQL: When to Use and How?"
Proceedings of the 18th Annual NorthEast SAS Users Group Conference,
POS12, 2005.

SAS and all other SAS Institute Inc. product or service names are registered trademarks
or trademarks of SAS Institute Inc. in the USA and other countries. ® indicates USA

7
NESUG 2008 Programming Beyond the Basics

registration. Other brand and product names are trademarks of their respective
companies.

AUTHOR CONTACT INFORMATION

Yi Zhao
Senior Scientific Programming Analyst
Merck Research Laboratories
UG1CD-38
PO Box 1000
North Wales, PA 19454
Phone: 267-305-7672
Email: yi_zhao@merck.com

Potrebbero piacerti anche