Documenti di Didattica
Documenti di Professioni
Documenti di Cultura
Module - I
1 Introduction 4
2 Libname statement 14
3 Clear Windows 17
4 Input types 17
5 Pointer controllers 18
6 Double Trailing @@ 20
7 Trailing @ 21
8 Length statement 21
9 Named Input 21
10 Numeric value 22
11 Delimiter 22
12 Embedded space 25
13 Firstobs, obs 26
14 SET statement 26
15 VAR statement 26
16 Options 27
17 Advanced list input method 40
18 Title Statement 42
19 Footnote Statement 46
20 Informat & Format 51
Module - II
20 Transformations 64
21 Data Cleaning 66
22 Procedure Options 69
23 Procedure Statements 70
24 Import Procedure 75
25 Export Procedure 83
26 Append data 89
27 Reports 90
28 Filter Transformation 92
29 Where Statement & Options 93
30 Where Options 96
31 Expression Transformation 98
32 IF Statement 98
33 Do block 100
34 Output Statement 101
35 Multiple Datasets 102
36 Loops 103
37 Data Conversations 105
38 Customized Reports 106
39 Backend Process/PDV 108
40 Duplicate Observations 108
41 Functions 112
42 Data step Functions 112
43 Arithmetic Functions 113
44 Aggregate Functions 120
45 String Functions 121
46 Date & Time Functions 136
47 Calendar Functions 136
48 Time Functions 137
49 Interval Functions 139
50 Errors 144
51 Data Management Process 145
52 Append Process 145
52 Concatenation Process 149
53 Interleaving Process 149
54 SCD Process 150
55 Modify transformation 153
56 Merge 155
57 Lookup process 159
58 Goto or link statement 164
Stream lines of SAS are Data ware housing, Analytics and Data visualization.
Data ware housing (DW): Maintain the data meaning full, understand format with relation. DW
process can be implemented using data where housing concepts and ETL concepts.
SAS Functionality:
Raw data > Data step > SAS dataset > Proc step > Report/Information
Rules:
SAS statements: As with any language, there are a few rules to follow when writing SAS
programs. Fortunately for us, the rules for writing SAS programs are much fewer and simpler
than those for English.
Table Definition:
The table definition is a set of instructions that describes how to format the data. This
description includes but is not limited to:
• the order of the columns
• text and order of column headings
• formats for data
• font sizes and font faces
ODS destinations
An ODS destination specifies a specific type of output. ODS supports a number of
destinations, which include the following:
RTF
produces output that is formatted for use with Microsoft Word.
Output
produces a SAS data set.
Listing
produces traditional SAS output (monospace format).
Printer
produces output that is formatted for a high-resolution printer. An example of this type of output
is a PostScript file.
ODS output
ODS output consists of formatted output from any of the ODS destinations
Output object
ODS combines formatting instructions with the data to produce an output object. The output
object, therefore, contains both the results of the procedure or DATA step and information about
how to format the results. An output object has a name, a label, and a path.
Note: Although many output objects include formatting instructions, not all do. In
some cases the output object consists of only the data.
The values of numeric variables can contain only numbers. To store values that contain
alphabetic or special characters, you must create a character variable. By following a variable
name in an INPUT statement with a dollar sign ($), you create a character variable. The default
length of a character variable is also eight bytes. The following statement creates a data set that
contains one-character variable and four numeric variables, all with a default length of eight
bytes.
input IdNumber Name $ Test_1 Test_2 Test_3;
Libname statement
Working with user–defined libraries: these libraries are created and managed by the user.
These are two types.
1. Independent libraries.
2. Dependent libraries.
✓ Independent libraries: SAS files (datasets) are not sharing with other libraries.
✓ Dependent libraries: SAS files are sharing with other libraries.
These two libraries can be created in temporary mode or permanent mode.
✓ Temporary mode library: Available only for one session or until end of the SAS job.
✓ Permanent mode library: Available any session until deleted by the user.
‘libname’ statement: Creates user-defined independent or dependent libraries in temporary
mode.
Syntax: libname libref ‘path’;
Ex: libname SASFiles 'F:\SASFiles';
libref: It’s a name of the library used as reference for library name.
Naming rules:
✓ Can be given up to 8 characters. More than 8-character length gives ‘out of range’ error.
Engine: Stores the data in dataset in different format. These engines are two types.
1. Internal engines. 2. External or access or interface engines.
1.Internal engines: V4, V5, V6, V7, V8, V9 - SAS default takes internal engines according to
versions.
Note: For creating old technology SAS files by the new technology, then internal engine should
be changed.
2.External/access/interface engines: Used for creating database libraries or external libraries
and manage external process from SAS environment using SAS knowledge.
PC or server location: PC path or server path: To allocate the memory for storage of SAS files.
PC location: for Independent mode
Server location: for dependent mode.
Input types
• List Input
• Column Input
• Named Input
• Formatted Input
List Input:
➢ List input: input Pid Name $ Team $ Stwt Endwt;
➢ Multiple List input:
➢ Ex: input Pid Name $;
input Team $;
input Stwt Endwt;
Column Input:
➢ Column input: input Pid 1-4 Name $ 6-23 Team $ 24-29 stwt 31-33 Endwt 35-37;
Pointer controllers:
pointers: @n - moves the pointer to the nth column in the input buffer.
+n - moves the pointer forward n columns in the input buffer.
/ - moves the pointer to the next line in the input buffer.
#n - moves the pointer to the nth line in the input buffer.
➢ Column pointer @n: input Pid @6 Name $ @13 Team $ @20 Stwt @24 Endwt;
➢ Line pointer /: input IdNumber 1-4 Name $ 6-19 / Team $ / Stwt Endwt;
Double Trailing @@: A special symbol @@ that is used to hold a line of data in the input
buffer during multiple iterations of a DATA step.
Length: Registers the variable value with specific length. the length can be given up to 256.
Default character length is 8.
Syntax: length <variable name> <data type> <length increasing to number>;
Ex: length Name $ 12;
Named Input:
For named input, follow each variable name with an equal sign (=).
If the variable is character, follow the equal sign with a dollar sign ($). Read two of the four
variables.
Syntax: input <variable1> = <variable2> = …. <variable n> = @;
Ex: input name=$ english= math= science=;
Format option:
Syntax: format <variable name> <length increasing to number>.;
Ex: format cardno 20.;
Delimiter:
In raw data, data values are separated by any one special character, these special characters are
delimiters. Default delimiter is space.
DSD – reads delimiter as comma
DLM – reads delimiter as any special character
➢ DSD Option: ( , ) delimiter in raw data and also removes the quotes from data values.
Syntax: infile dsd;
Notice that, here the data is delimited by a semicolon (;) use datalines4 infile statement.
SET statement:
Used to read the data from one dataset to another dataset observation-by-observation (creates a
duplicate dataset), reads observations one after the other, takes more processing time.
Data emp2;
Set emp1;
Run;
VAR statement:
It requires variables, which variable values converted as observation or data values.
Runs on procedure step.
Syntax: var <variable name> <variable name>;
Stopover: It is a default nature of the SAS, it controls the SAS system to stop the reading at the
last observation.
Infile datalines: These statement backend process is infile datalines flowover stopover dlm = ‘ ’;
To read missing values:
Missover: To control any missing values at the end of observations in a dataset. when to call
missover option: When end of the data values are missing missover option is used. (if values are
missing at the end of the data, data entry professional won't use period in the data).
3.Scanover: Using the scanover option or technique you can read part of the raw data in logical
order using key data value. Implement the scanover technique using column hold pointer.
Column hold pointer requires key data value instead of column number.
Note: Logical flow is of 2 types:
1. scanover (old technology)
2. conditions (if and where – new technology)
Note: For reporting purpose use label option, but for permanently renaming use rename option
ALTER option: only when you are creating a SAS data file.
✓ In order to copy an encrypted SAS data file, the output engine must support encryption.
Otherwise, the data file is not copied.
✓ Encrypted files work only in Release 6.11 or in later releases of SAS.
✓ You cannot encrypt SAS data views or stored programs because they contain no data.
✓ If the data file is encrypted, all associated indexes are also encrypted.
✓ Encryption requires roughly the same amount of CPU resources as compression.
✓ You cannot use PROC CPORT on encrypted SAS data files.
Global options
The global options are invoked by default whenever you open the SAS environment and control
or instruct the SAS windows (except the editor window) The global options code is available in
the SAS configuration file. This file runs by default whenever you open the SAS environment.
This file runs are the back end.
Formatted Input:
Formatted input method: It works based on 2 symbols:
+n – column pointer – it indicates non-required data
n. – column range – it indicates required data (n = n2 – n1 + 1)
Syntax: input +n <variablename> <datatype> n. ;
Note: input +0 pid 3. +1 name $ 11. +2 age 2. +2 gender $ 6.;
1. & modifier
Syntax: input name & $ 11. ;
Note: input pid name & $ 11. age gender $;
TITLE statement
Title: The top of the report has a title.
The TITLE statement in the PROC PRINT step produces the title.
For more information about the TITLE statement.
TITLE statement that produces a descriptive title.
The content of the report is very similar to the contents of the original data set.
Syntax: title “Write your title here”;
Ex: title1 “write your title here”;
title2 “write second title here”;
data weight_club;
input Pid Name $ Team $ Stwt Endwt;
datalines;
1023 David red 189 165
1049 Amelia yellow 145 124
;
data weight_club;
input Pid Name $ Team $ Stwt Endwt;
datalines;
1023 David red 189 165
1049 Amelia yellow 145 124
;
run;
title ' ';
data weight_club;
input Pid Name $ Team $ Stwt Endwt;
datalines;
1023 David red 189 165
1049 Amelia yellow 145 124
;
run;
title;
proc print data=weight_club;
run;
The content of the report is very similar to the contents of the original data set but in colorfull.
Syntax: title <options>“Write your title here”;
1. Font is in Bold or Italic
Bold/Italic
2. Color: Text color.
Color=<color name>
3. BColor: Background color of Font
BColor=<BColor name>
4. Font: Facing of font type ex: Arial, Calibri, Algerian, Cambria, etc.,
Font =<Font type>
5. Justify: Justification of font
data weight_club;
input pid Name $ Team $ Stwt Endwt;
datalines;
1023 David red 189 165
1049 Amelia yellow 145 124
;
run;
Title bold color='yellow' bcolor='green' font='algeriaN' Underlin=1 'List Input
Reported';
Title2 italic color='white' bcolor='orange' Underlin=3 font='Cambria' 'SAS System reads
forward direction at input buffer';
Title3 italic color='Purple' bcolor='Pink' font='bauhaus 93' Underlin=2 'Data contains
character and Numeric';
proc print data=weight_club;
run;
data weight_club;
input pid Name $ Team $ Stwt Endwt;
datalines;
1023 David red 189 165
1049 Amelia yellow 145 124
;
run;
Title bold color='yellow' bcolor='green' font='algeriaN' Underlin=1 Justify=L 'List Input
Reported';
Title2 italic color='white' bcolor='orange' Underlin=3 font='Cambria' Justify=C 'SAS
System reads forward direction at input buffer';
Title3 italic color='Purple' bcolor='Pink' font='bauhaus 93' Underlin=2 Justify=R 'Data
contains character and Numeric';
proc print data=weight_club;
run;
FOOTNOTE statement
Footnote: The bottom of the procedure output.
The FOOTNOTE statement in the PROC PRINT step produces the Footnote.
For more information about the FOOTNOTE statement.
Footnote statement that produces a descriptive footnote.
The content of the report is very similar to the contents of the original data set.
Syntax: Footnote “Write your footnote here”;
Ex: Footnote1 “write your title here”;
Footnote2 “write second title here”;
data weight_club;
input Pid Name $ Team $ Stwt Endwt;
datalines;
1023 David red 189 165
1049 Amelia yellow 145 124
;
run;
Footnote 'List Input Reported';
proc print data=weight_club;
run;
data weight_club;
input Pid Name $ Team $ Stwt Endwt;
datalines;
1023 David red 189 165
1049 Amelia yellow 145 124
;
run;
Footnote ' ';
proc print data=weight_club;
run;
The content of the report is very similar to the contents of the original data set but in colorfull.
Syntax: title <options>“Write your title here”;
1. Font is in Bold or Italic
Bold/Italic
2. Color: Text color.
Color=<color name>
3. BColor: Background color of Font
BColor=<BColor name>
4. Font: Facing of font type ex: Arial, Calibri, Algerian, Cambria, etc.,
Font =<Font type>
5. Justify: Justification of font
Justify= C/L/R
C = Center
L = left
R = Right
6. Link: specifies hyperlink.
Link=’<Url>’
7. Underlin: specifies whether the subsequent text is underlined. 0 indicates no underlining.
1, 2, and 3 indicates underlining.
U=0/1/2/3
data weight_club;
input pid Name $ Team $ Stwt Endwt;
datalines;
1023 David red 189 165
1049 Amelia yellow 145 124
;
run;
Footnote bold color='yellow' bcolor='green' font='algeriaN' Underlin=1 Justify=L 'List
Input Reported';
Footnote2 italic color='white' bcolor='orange' Underlin=3 font='Cambria' Justify=C
SAS date value: It is the number of days difference between the present date and the SAS
discovery date (i.e. 01/01/1960 or 01 Jan 1960 00:00:00)
e.g. 01/01/1961 (non-standard) ---> informat technique ---> number (standard) ---> 365
Syntax: informat <variable name> <informat technique>;
e.g. 01/01/1961 (non-standard) ---> informat technique ---> number (standard) ---> 365 --->
format technique ---> 01/01/1961 (non-standard)
Syntax: format <variable name> <informat technique>;
Percent value
Percent Value Informat Format
50% percent3. (percentw.) Percent5. (percentw.)
% should be counted as 3 values, if 20% then 5 values.
Necessity of finding duplicates: to eliminate duplicate data and for data accuracy.
DW → database → Table → RDBMS → Dimension table and Fact table.
In data warehouse environment or database environment, the data will be loaded in
tables. These tables are existed using RDBMS (relation data base management system) process.
In this process, tables are two types.
Duplicate observation: The same observation repeated more than one time in data. Here key
variable will be more than one variable. eg. eid with month, pid with visit, etc.,
→ Nodupkey option: Using the nodupkey option you can eliminate duplicate data values using
the key variables or sorting variables.
Eliminate duplicate data values using the key variables or sorting variables
nodup or noduprec option: Using the nodupkey, nodup or noduprecs options you can eliminate
duplicate observations from the transaction file.
Eliminate duplicate data values using the key variables or sorting variables
Print procedure: This generates output in output window. These outputs can be called as listing
outputs.
1. SAS generates four types of reports.
2. Detail report (listing output).
3. Summarized report (table output).
4. Customized report (understanding format i.e., Title, footnotes etc.,).
5. Statistical tables and graphs.
2.Double: Using the double option you can give a gap between the observations for the report.
3.Heading: Using the heading option you can report the variable names in horizontal order or
vertical direction. Default value is “horizontal”.
4.Width: Using the width option you can give a gap between the columns. Width value can be
either minimum or full. Default value is “minimum”.
2.Var statement: It requires variables, which variable values converted as observation or data
values. Runs on procedure step.
Syntax: var <variable name> <variable name>;
3.Id statement: Using the id statement you can print the required variable at the
beginning/starting of the output window instead of the obs column.
Syntax: id <variable name> <variable name>;
4.Null Id statement: It works like the noobs option. The noobs option removes the obs column
whereas the null id statement replaces the obs column with a null column or blank spaces.
Syntax: id;
To report the total no. of patients who received treatment - page wise
Page 1:
Page 2:
Page 3:
TASK:
Data values related to customer credit limit information.
1 2 3 4
Visa Jan 80 8000000
Master Jan 70 5000000
Visa Feb 85 9000000
Master Feb 72 6000000
Visa Mar 83 8500000
Master Mar 70 6000000
Visa Apr 82 8000000
Master Apr 84 7000000
Assignment:
a. To report the total credit limit.
b. To report the no. of customers who have taken Visa credit card and Master credit card.
c. To report the no. of customers who have taken different credit cards and their total credit
limit monthly wise.
d. To report the total credit limit credit card wise.
a. b.
proc print data=crdcust; proc sort data=crdcust out=crdcust1;
sum amount; by descending ctype;
run; run;
proc print data=crdcust1;
sum cust;
by descending ctype;
run;
c. d.
proc sort data=crdcust out=crdcust2; proc sort data=crdcust out=crdcust3;
by month; by ctype;
run; run;
proc print data=crdcust2; proc print data=crdcust3;
sum cust amount; sum amount;
by month; by ctype;
run; run;
Import Procedure
Import: Using the infile statement you can access raw data from internal/external files or
location into the SAS internal memory location.
Filename statement: Using the filename you can access path from external files or location into
the SAS internal memory location.
Syntax: filename <filename> <file path>;
Ex: filename Num “F:\SASFiles\Numbers1.txt”;
Import procedure: To access the data from PC files to SAS via import procedure.
Syntax: proc import datafile= “<file path>”
out=<libref.dataset name> <dataset options>
dbms=<identifier> replace;
<statements>;
run;
Datafile: It indicates file location
Out: It indicates the output dataset name
Dbms: It indicates the database type
Replace: overwrite an existing SAS dataset.
Excel
Statements:
1.sheet: Using the sheet statement you can indicate the required sheet for importing.
Syntax: sheet = “<sheet-name>$”;
2.Getnames: It is a default statement for the import procedure and works with default value as
“yes”. Sometimes the raw data is available without variable names. In such cases you should call
the getnames statement with value as “no” otherwise SAS recognizes the starting row of the raw
data as variable names.
syntax: getnames=no;
3.Range Statement:
Using the range statement, you can access part of the data from excel sheet to SAS. To access
part of the data from excel to SAS using cell ranges.
Syntax: range = “<Sheet name>$<cell range>”;
Note:
1. In part of the data, if the data is available without variables you will use getnames statement
with “no” value.
2. If you use range statement no need to use sheet statement.
3. All the dataset options you can use in import procedure except firstobs obs.
To import part of the variables and change variable names in loading time
Ex: proc import datafile= “F:\SASFiles\class.xls”
out=class1 (keep=patid age color rename=(patid=subid))
dbms=xlsx replace;
sheet= ”clsdata$”;
run;
Note:
Using the import procedure, you can access the data from the excel file only one sheet at a time
Using the import procedure, you are unable to read mixed data because default one statement is
working (mixed statement) with “no” value.
To import part of the variables from access table to SAS (*.mdb, *.accdb.. etc.,)
proc import table= <table name>
out=<dataset name> <options>
dbms=<identifier> replace;
database = ”<file path>”;
<statements>;
run;
Ex: proc import table = demo
Out = clinical (keep= Sno pid age rename=(Sno=SerialNo))
dbms=access replace;
database = “H:\Studies\SAS_Books\SASpath\source\Accs\CDM.mdb”;
run;
Export Procedure
Export procedure: To export the data from SAS environment to external environment (PC).
Syntax: proc export outfile= “<file path>”
data=<libref.dataset name> <dataset options>
dbms=<identifier> replace;
<statements>;
run;
Statements:
putnames: It is a default statement for the export procedure and works with default value as
“yes”. Read variable names as column names to the first row of the exported. IF putnames
statement with value as “no”, SAS variable names are skipped, and the columns are left
unlabeled.
TXT File or delimiter file: Export data into txt files or delimiter files. This process can be done
using dataset block and export procedure.
Using Dataset block: This process can be called loading process or reporting process.
Loading Process: Upload the data into delimiter files without variable names or with variable
names.
Reporting process: Upload the data with variable names and adding titles, footnotes and some
other reporting specifications.
To load the data in delimiter (text) file using dataset block:
_null_: If you want to run a group of statements without creation dataset using dataset block in
this case you will use dataset name _null_.
File statement: To mention file location.
Put statement: Using put statement you can print data values or some text in external file
(delimiter file) or log window.
Dlm option: Using dlm option you can indicate delimiter for external environment or external
file.
Note: Data _null_ is the most important concept in pharma and financial domains to generate
reports.
To load the data in specific columns: @n option: It is a column hold pointer, n specifies
column number
syntax: data _null_;
set <lib ref>.<dataset name>;
file “<outfile path>” <dataset options> <infile options>;
put @n1<var 1> @n2<var 2>……@n3<var n>;
run;
Ex: data _null_;
set sashelp.class;
file “F:\SASFiles\class2.txt”;
put @5 name @15 sex @20 age @25 height;
run;
syntax: data;
set <lib ref>.<dataset name>;
file “<outfile path>” <dataset options> <infile options>;
Excel
To export the data from SAS environment to external environment (PC).
MS Access
Export Import
S.No File Type Data-block Procedure-block Data-block Procedure-block
1. Text file Yes Yes Yes Yes
2. Tab file -- Yes Yes Yes
3. Csv file -- Yes Yes Yes
4. Excel -- Yes -- Yes
5. Access -- Yes -- Yes
TASK:
Raw data (patinfor.txt and researchinfor.xls)
Q1a.TXT (source system) (extraction) ---> SAS file (transformation) ---> SAS file (loading) --->
access file (target).
Q1b. XLS (source system) (extraction) ---> SAS file(transformation) ---> SAS file (loading) --->
access file (target).
/* Extraction */
data one;
infile "D:\SAS\text\patinfor.txt";
input pid age gender $ color $;
run;
Append
Appending process: If you load the data in existed file or you can add raw data in existed file
this process you can call appending. If you take any database it supports to appending process
except to excel and access from other database.
Mod option: Using the mod option you can run appending process from SAS files to delimiter
(text) file.
/* dataset - 1 */
data medi1;
input gid $ week $ drug $;
cards;
G100 week3 Col5mg
G200 week3 Col10mg
G300 week3 Col15mg
;
/* dataset - 2 */
data medi2;
input gid $ week $ drug $;
cards;
G100 week6 Col5mg
G200 week6 Col10mg
G300 week6 Col15mg
;
/* dataset - 1 export to txt file (without variable) */
data _null_;
Reports
Loading Environments (ETL): TXT, Excel, Access, Oracle, DB2 and 52 databases.
Reporting Environments SAS (OLAP): TXT, RTF, HTML, XML and PDF.
/* Reporting */
/* Upload title and variable names */
data _null_;
file "F:\SASFiles\report.rtf";
put @25 "LAB DATA";
put @22 "----------------------------";
put @20 "Name" @30 "sex" @40 "Age";
put @21 "-----" @30 "-----" @40 "-----";
run;
/* export variables values into text file */
data _null_
Customized reports: If you generate the reports in TXT file or RTF file using data _null_.
To generate the reports in output window using dataset block.
Print option: Using the print option in file statement you can generate the reports in output
window.
Arithmetic operators:
Operator Meaning
+ Addition
- Subtraction
* Multiplication
/ Division
Comparison operators:
Logical operators:
Operator Meaning
& and
| or
Where Statement
Where statement: Using the where statement you can create subset of data for reporting
(temporary) or loading (permanent).
If you write where statement in procedure block it creates a subset of data for reporting. If you
write where statement in dataset block it creates a subset of data for loading.
/* For reporting */
proc print data=cls_M; Proc print data cls;
Where sex=‘M’; set sashelp.class;
run; where age between 12 and 13;
run;
/* For loading */
data cls_M;
set sashelp.class;
where sex='M';
run;
proc print data=cls_M;
run;
/* For reporting */
Proc print data cls;
set sashelp.class;
where age between 12 and 13;
run;
proc print data=cls_M;
Where height >=59 and height <= 69;
run;
proc print data=cls_M;
like operator: Using like operator you can run pattern matching part of checking process.
% - It indicates multiple characters
_ (underscore) – It indicates only one character
/* For report */
proc print data=cls_M;
Where name like ‘J%’;
/* For report */
proc print data=cls_M;
Where Age in (11 15);
run;
proc print data=cls_M;
Where Age in (11 15);
run;
/* For report */
proc print data=cls_M;
Where age not in (11 15);
run;
proc print data=cls_M;
Where age not in (‘M’);
Run;
Note: ‘In’ and ‘not in’ operators can be used in character or numeric data, ‘between’ operator is
used only for numeric data.
In conditional statements you will get logic errors. If you run any application logical based you
will get logic errors.
Contains operator: Can be used only for character data, (checking only specified letter or
word). Used to run part of the checking anywhere in the given value.
Logic error: In conditional application processing time (execution time) if you take any
observation it does not satisfy the condition. This time you will get logic error.
Errors are of 2 types:
1. Compilation errors
2. Execution errors
2. Execution errors
a. Data errors (raw data mistake)
b. Logic errors (conditional statement mistake)
Where Option
Where option: Where option also one type of dataset option. Using with where option you can
create a subset of data temporary for reading and permanent for loading.
Syntax: where=(<expression>);
/* For reporting */
proc print data=cls_M (Where=(sex=‘M’));
run;
/* For loading */
data cls_M;
set sashelp.class (where=(sex='M‘));
run;
proc print data=cls_M;
run;
Where statement & option: Major difference between where statement and where option:
Where statement you can use in dataset block and procedure block except SAS|Access
procedures. Where option you can use in SAS|Access procedures also. So, where option is more
Where statement: Split the data into multiple datasets depending on missing and nonmissing
values: eq represents ‘=’.
data Missing;
set market;
where cname eq ‘ ’ or area eq ‘ ’ or invest eq . or sprice eq .;
run;
/* or can be written as: */
data Missing1;
Expression Transformation
Data processing flow:
Extraction → Validation → Cleaning → Manipulation → Managing → Maintenance
Data Manipulation: You can run data manipulation process using operators and functions
IF Statement
If then else block: If you want to run any statement conditional based you will use if then else
block.
syntax: if <condition> then <statement>;
else <statement>;
or
if <condition> then <statement>;
Data emp_salaries;
input eid $ salary sale;
cards;
E-234 2300 678
E-245 4500 456
E-256 8900 567
E-456 3000 400
E-890 4500 300
E-235 7800 580
E-267 5000 380
;
Do block: To run multiple statements depending on conditions. Do block is used only with ‘if’
statement.
Create multiple variables depending on condition: with new variable: New salary
data Medi;
input Pctno $ Sno $ Gno $ Drug $ @@;
datalines;
pct_250 S001 G100 Asp_05mg pct_250 S001 G200 Asp_10mg
pct_250 S001 G300 Asp_15mg pct_250 S002 G100 Asp_05mg
pct_250 S002 G200 Asp_10mg pct_250 S002 G300 Asp_15mg
; run;
Data medi1;
set medi;
if drug=‘Asp-05mg’ then
date=’15Jan2005’d;
else if drug=‘Asp-10mg’ then
date=’15Jan2005’d+6;
else date=‘15Jan2005’d+20;
format date date9.;
Run;
data demo;
Pid = 101; output demo;
Pid = 102; output demo;
Pid = 103; output;
run;
Split Transformation: Create multiple datasets by the dataset block.
Data Mh1 Mh2 Mh3;
set medi;
if drug=‘Asp-05mg’ then output Mh1;
else if drug=‘Asp-10mg’ then output Mh2;
else output Mh3;
Run;
If statement: Using if statement you can create subset of data.
data red_team;
input Team $ 13-18 @;
if Team='red';
input IdNumber 1-4 StartWeight 20-22 EndWeight 24-26;
datalines;
Multiple Datasets
/* Set Statement */
data dta1 (keep=Name Age Weight)
dta2 (drop=Name Age Weight);
set sashelp.class;
run;
data dta3 (where =(age between 12 and 13))
dta4 (where=(Name like ‘J%’));
set sashelp.class;
run;
SAS variables:
_all_: This specifies all types of data character and numeric.
_character_: Enables to specify all character variables.
_numeric_: Enables to specify all numeric variables.
/* Set Statement */
data dta3 (where =(age between 12 and 13))
dta4 (Keep = _character_);
set sashelp.class;
run;
data dta3 (keep = _all_)
/* Set Statement */
data dta5 (keep=Name Age Weight)
dta6 (drop=Name Age Weight)
dta7 (where=(age >12 and age <15))
dta8 (where=(Name like 'J%'));
set sashelp.class;
Run;
data dta9 (where=(Name contains 'e'))
dta10 (where=(Name like 'J%'));
set sashelp.class;
if Age=15 then newwight=weight+2;
else newwight=weight;
run;
Loops
Loop processing: Used to run statement or statements multiple times.
Do while: Run the loop while the condition is true or until the condition is false.
Do until: To run the loop while the condition is false or until the condition is true.
1. Do while: Run the loop while the condition is true or until the condition is false.
Data Conversations: char -> char, num -> num, char -> num, num -> char
Data cvr;
Input id gender age race $ color $;
Datalines;
>>> Create a dataset for contents information of class from sashelp library.
Proc contents data=sashelp.class out=cls;
Run;
Customized reports
If you generate the reports in TXT file or RTF file using data _null_.
To generate the reports in output window using dataset block:
Print option: Using the print option in file statement you can generate the reports in output
window.
Input stack: It is a logical memory unit and default storage place of applications.
Word scanner: It is an input layer between input stack and compilers, it controls tokenization
process.
PDV (Program Data Vector): Capture observations one-by-one to run data error checking
process using two automatic variables _N_ and _ERROR_.
1 = error in observation
0 = no error in observation.
Dataset: To submit reports for below data contains Pid week drug
/* template */
data _null_;
file 'F:\SASFiles\dcf.txt';
put @10 'Dataset' @20 'Variable' @30 'Value' @40 'Obs';
run;
data dup1;
set dup;
by pid;
if first.pid=1;
run;
/* template */
data _null_;
set dup_obs;
by pid visit;
file 'F:\SASFiles\dcf.txt' mod;
if first.pid=0 and first.visit=0 then put @10 ‘dup_obs' @20 'pid visit' @30 _n_ @40 pid visit
dose;
run;
Functions
6.Fact & gamma function: To compute the factorial of a number, use the DATA step function
FACT. For example, the following statement computes 6!
Syntax: <new or existed var name> = fact (agrument);
Alternatively, you can use the GAMMA function to obtain the factorial of a number. For positive
integers, GAMMA(X) is (X-1)! . For example, the following statement computes 6!
Syntax: <new or existed var name> = gamma (agrument);
Apart from logarithms to base 10 which we saw in the last section, we can also have logarithms
to base e. These are called natural logarithms.
10.Mod function: Returns the remainder of the division of elements of the first argument by
elements of the second argument.
Prime numbers: Numbers that have two factors are called prime numbers.
composite numbers: Numbers that have more than two factors are called composite numbers.
Trimorphic numbers:
Trimorphic number is a number whose cube (expressed in a given base) ends in the number
itself. For example, 43 = 64, 243 = 13824.
data trimorphic;
i=10;
do while (i<=100);
i=i;
j=i*i*i;
k=mod(j, 100);
if k=i then status=‘trimorphic';
else status=‘Not’; output;
i=i+1;
end;
run;
proc print data=trimorphic;
run;
12.Lag function: Returns values from a queue. LAG1, LAG2 and LAG3 returns one missing
value and the values of date, cpi and alpha (lagged once).
Syntax: <new or existed var name> = lag (variable);
The RETAIN statement prevents the DATA step from reinitializing CPILAG to a missing value
at the start of each iteration and thus allows CPILAG to retain the value of CPI assigned to it in
the last statement. The OUTPUT statement causes the output observation to contain values of the
variables before CPILAG is reassigned the current value of CPI in the last statement.
Aggregate Functions
These Functions for analytical process in row wise by dataset block.
Standard Deviation: To quantify the amount of variation or dispersion of a set of data values. A
low standard deviation indicates that the data points tend to be close to the mean (also called the
expected value) of the set, while a high standard deviation indicates that the data points are
spread out over a wider range of values.
String Functions
1.Length function: It returns the length of the string (Number of characters in string including
space);
Syntax: <New Var> = length (‘value’ /<var>);
2.Index function: It returns position of character or word repeated multiple times then returns
first occurrence only. If character or word not available then returns 0.
Syntax: <New Var> = index(<var> ‘char/word’);
Syntax: compress(<,char><,modifier>);
Following characters can be used as modifiers.
a - Compress or Delete all upper and lower-case characters from String.
Ak - Compress or Delete alphabetic characters (1,2,3 etc.) from String.
Kd - Compress or Delete characters(alphabets) from String. (Keeps only digits).
D - Compress or Delete numerical values from String.
i - Compress or Delete specified characters both upper and lower case from String.
k - Keeps the specified characters in the string instead of removing them.
l - Compress or Delete lowercase characters from String.
p - Compress or Delete Punctuation characters from String.
s - Compress or delete spaces from String. This is default.
u - Compress or Delete uppercase characters from String.
data star;
input id name : $ 20.;
first=substr(name, 1, 5);
middle=substr(name, 6, 3);
last=substr(name, 9);
datalines;
1 abcdefghijklmnop
2 qrstuvwxyzabcdef
3 ghijklmnopqrstuv
4 wxyzabcdefmnopqu
run;
10.Trim function: Returns the specific character / Number from the string.
Removes unnecessary spaces from the end of the data values.
Syntax: <New Var> = trim (‘character’);
<New Var> = trim (Number);
2. Removing Blanks from the Search String: Remove spaces from end of the data values.
The LENGTH statement pads target with blanks to the length of 10, which causes the
TRANWRD function to search for the character string 'FISH ' in SALELIST. Because the search
fails, this line is written to the SAS log: CATFISH
You can use the TRIM function to exclude trailing blanks from a target or replacement variable.
Use the TRIM function with target:
salelist=tranwrd(salelist,trim(target), replacement);
put salelist;
3. Zero Length in the Third Argument of the TRANWRD Function: The results of the
TRANWRD function when the third argument, replacement, has a length of zero. In this case,
TRANWRD uses a single blank. In the DATA step, a character constant that consists of two
consecutive quotation marks represents a single blank, and not a zero-length string.
4. Removing Repeated commas: The TRANWRD function to remove repeated commas in text
and replace the repeated commas with a single comma. In the following example, the
TRANWRD function is used twice: to replace three commas with one comma, and to replace the
ending two commas with a period:
COMPBL
Data char;
Input Name $ 1-50;
char1 = compbl(Name);
Cards;
Sandy David
Annie Watson
Hello ladies and gentlemen
Hi, I am good
;
Run;
STRIP Function
It removes leading and trailing spaces.
STRIP
Data char1;
Set char;
char1 = strip(Name);
run;
LEFT
Data char1;
Set char;
char1 = left(Name);
run;
Dataset
data columns;
input col1 & : $ 18. col2 & : $ 22. col3 & : $ 25.;
datalines;
The cat function concatenates character variables.
The cat function concatenates trimmed character variables.
;
run;
CAT
data columns1 (drop=col1 col2 col3);
set columns;
cat_all=cat(col1, col2, col3);
run;
To remove the leading and trailing spaces, we can make use of the CATT and CATS functions.
CATT:
The CATT function is like the CAT function. However, it removes the trailing spaces before
concatenating the variables.
CATT
data columns2 (drop=col1 col2 col3);
set columns;
cat_T=catt(col1, col2, col3);
run;
CATS
data columns3 (drop=col1 col2 col3);
set columns;
cat_S=cats(col1, col2, col3);
run;
CATX:
In addition to removing the leading and trailing spaces, the CATX function inserts a delimiter
between the character values when concatenating the variables.
The first parameter in the CATX function is the delimiter. In our example, the space is specified
as the delimiter.
CATX
data columns3 (drop=col1 col2 col3);
set columns;
cat_X=catx(‘ ‘, col1, col2, col3);
run;
PROPCASE
Returns the word having uppercase in the first letter and lowercase in the rest of the letter
(sentence format).
PROPCASE
Data char;
FIND
To locate a substring within a string.
Syntax: find(character-value, find-string <,'modifiers'> <,start>)
FIND
data _null_;
file print;
STRING1 = "Hello hello goodbye";
x=FIND(STRING1, "hello");
y=FIND("abcxyzabc","abc",4);
z=FIND("abcxyzabcrtsabc","ABC",4);
put x=;
put y=;
put z=;
run;
Validation-Process
Database Validation: Validating the table name, variable names, data type, labels, informats
and formats.
Data validation: To control incomplete records, (missing values) to handle duplicates and
invalid records.
Structure of the data validation: Validate the structure of data value (character data).
S001 P001 34 f As W 1
S001 p002 15 m Af b 2
s001 p003 26 F AS W 3
S001 p04 23 M AF B 4
/* Template prepare */
Data _null_;
File ‘F:\SASFiles\invalid.txt’;
put @5 ‘Dataset’ @15 ‘Variable’ @25 ‘Value’ @35 ‘crfno’ @45 ‘datacheck’;
Run;
/* Upload Invalid data */
Data _null_;
Set valid (keep=pid crfno);
File ‘F:\SASFiles\invalid.txt’ mod;
If length (pid) ne 4 then put @5 ‘valid’ @15 ‘pid’ @25 pid @35 crfno @45
‘pid must be in 4 digits’;
Run;
Data _null_;
Set valid (keep=pid crfno);
File ‘F:\SASFiles\invalid.txt’ mod;
If substr (pid,1,1) ne ‘P’ then put @5 ‘valid’ @15 ‘pid’ @25 pid @35 crfno
@45 ‘pid starts with caps P’;
Run;
Data _null_;
Time Functions:
1.Hour function: It returns hours.
Syntax: <New Var> = hour(<var>);
Interval Functions:
1.Intck function: It returns difference between date values in date intervals, month intervals or
year intervals.
Syntax: <New Var> = intck(custom-interval, start-date, end-date, ‘method’);
1.Interval: Specify the name of basic interval, ex: year, day, month
2.Multiple: Specifies an optional multiplier that sets the interval equal to a multiple of the period
of the basic interval type. For example, the interval YEAR2 consists of two-year, or biennial,
periods.
3.custom: Specifies a user-defined interval that is defined by a SAS data set. Each observation
contains two variables, begin and end.
‘Method’:
1. CONTINUOUS: Specifies that continuous time is measured. The interval is shifted based on
the starting date.
‘method’ = ‘C’ or ‘CONT’
2. DISCRETE: Specifies that discrete time is measured. The discrete method counts interval
boundaries (for example, end of month).
‘method’ = ‘D’ or ‘DISC’
data b;
startDay='14Nov2017'd;
Today=today();
Yearscmpt=INTCK('YEAR',startDay,today(),'C');
cmpthalf=INTCK('YEAR2',startDay,today(),'C');
cmpt=INTCK('month',startDay,today(),'d');
half=INTCK('month2',startDay,today(),'d');
format startDay Today date9.;
run;
proc print data=b;
run;
data discrete;
data a;
interval='month';
start='14FEB2000'd;
end='13MAR2000'd;
months_default=intck(interval, start, end);
months_discrete=intck(interval, start, end,'d');
months_continuous=intck(interval, start, end,'c');
output;
end='14MAR2000'd;
months_default=intck(interval, start, end);
months_discrete=intck(interval, start, end,'d');
months_continuous=intck(interval, start, end,'c');
output;
start='31JAN2000'd;
end='01FEB2000'd;
months_default=intck(interval, start, end);
TASK:
QS1: Capture operating system date and time
QS2: Make a rtf report of above information.
Example: proc print data=sashelp.air label;
where year(date)=1959;
run;
QS3: Report 3rd month information from Air dataset, library sashelp.
QS4: Report 1959 year, with 4th to 6th month information from Air dataset, library sashelp.
QS5: Report weekdays information from Air dataset, library sashelp.
proc print data=sashelp.air label;
where weekday(date)>=6 and weekday(date)<=7; *where weekday(date) in (6,7);
run;
QS6: Reformat the data by date and time functions. Library: sashelp, dataset: Air.
length weekdays $ 10;
if weekday(date)=1 then weekdays='Sunday';
2.INTNX function: The SAS interval functions INTNX and INTCK perform calculations with
date values, datetime values, and time intervals. They can be used for calendar calculations with
SAS date values to increment date values or datetime values by intervals and to count time
intervals between dates.
The INTNX function increments dates by intervals. INTNX computes the date or datetime of
the start of the interval a specified number of intervals from the interval that contains a given
date or datetime value.
Syntax: <New Var> = intnx(interval, start-value, n, ‘alignment’);
For example, six weeks from the week of 17 October 1991. The function
INTNX(’WEEK’,’17OCT91’D,6) returns the SAS date value ’24NOV1991’D.
*Given that you know the first observation is for June 1990, use the INTNX function to compute
the ID variable DATE for each observation;
data uscpi;
input cpi;
date = intnx( 'month', '1jun1990'd, _n_-1);
*Thus _N_–1 is the increment needed from the first observation
date.;
format date monyy7.;
datalines;
129.9
130.4
25.6
35.7
86.7
run;
data uscpi;
input date : date9. cpi;
format date monthbeg midmonth monthend date9.;
monthbeg = intnx( 'month', date, 0, 'beg’ ); *Using alignment ‘beg’;
midmonth = intnx( 'month', monthbeg, 0, 'mid’ ); *Using alignment ‘mid’;
monthend = intnx( 'month', date, 0, 'end’ ); *Using alignment ‘end’;
datalines;
15jun1990 129.9
15jul1990 130.4
run;
Errors
Errors will be occurred in two ways:
1. Compile time
2. Execution time
Syntax Errors:
Forget semicolon end of the statement.
Semantic Errors:
To send wrong number of Arguments to function.
Data Error: To passed mismatched value or data type for variable. It requires numeric value but
you assign character value.
Logic Error: It will be occurred in conditional statement. Syntax is right but in syntax condition
is wrong.
Adding process:
1. Appending
2. Concatenation
3. Interleaving
Appending Process: Append the data from one dataset to another existed dataset.
Concatenation Process: Capture the data from multiple datasets will be loaded in one new
dataset one by one in sequential order.
Interleaving Process: Capture the data from multiple datasets will be loaded in one new dataset
one by one in sorting order.
Appending
data dm1;
input sno $ gno $ pid age gender $;
datalines;
S001 G100 101 23 F
S001 G100 102 34 M
S001 G100 103 25 M
S001 G100 104 21 F
run;
data dm2;
input sno $ gno $ pid age gender $;
datalines;
S002 G200 201 21 F
S002 G200 202 22 M
S002 G200 203 21 F
S002 G200 204 26 M
run;
proc append base=dm1 data=dm2;
run;
Appending
data dm1;
input sno $ gno $ pid age gender $;
datalines;
S001 G100 101 23 F
S001 G100 102 34 M
S001 G100 103 25 M
S001 G100 104 21 F
data dm3;
input sno $ gno $ pid age;
datalines;
S002 G200 201 21
S002 G200 202 22
S002 G200 203 21
S002 G200 204 26
run;
Appending
data dm1;
input sno $ gno $ pid age gender $;
datalines;
S001 G100 101 23 F
S001 G100 102 34 M
S001 G100 103 25 M
S001 G100 104 21 F
run;
data dm4;
input sno $ gno $ pid age gender $ race $;
datalines;
S004 G400 401 21 F As
S004 G400 402 22 M Af
S004 G400 403 21 F As
run;
Appending
data dm5;
input sno $ gno $ subid age gender;
datalines;
S005 G500 501 26 1
S005 G500 502 21 2
S005 G500 503 25 1
S005 G500 504 26 2
run;
/*reformat*/
data dm51;
set dm5;
if gender=1 then g='M';
else g='F';
drop gender;
rename g=gender;
run;
LOG Window
NOTE: Appending WORK.DM51 to WORK.DM5.
WARNING: Variable pid was not found on BASE file. The variable will not be added to
Appending
data emp1;
input bcode $ dptno $ eid salary;
datalines;
B100 D100 101 2300
B100 D200 102 3400
;
data emp2;
input bcode $ dptno $ eid salary;
datalines;
data employ;
set emp1 emp2 emp3;
run;
Interleaving Process: To upload the data depending upon salary in descending order.
Similar to concatenation process but before going to merge do sorting the datasets either
descending or ascending.
data employ1;
set emp1 emp2 emp3;
by descending salary;
run;
Upload process and update transformation: To replace master file values by the
transformation file or dataset using matching variable.
This process can be called slowly changing dimension (SCD).
SCD1 SCD2 SCD3
SCD Process:
Syntax: data <master-dataset>/<new-dataset>;
update <master-dataset> <transaction-dataset>;
By <variable>;
Run;
Appending
/* Master */ /* Transaction */
data emp_salaries; data emp_newsal;
input eid salary; input eid salary;
datalines; datalines;
101 5400 101 6000
102 4500 103 5000
103 3000 105 4000
104 2300 ;
105 2700
;
data emp_salaries;
update emp_salaries emp_newsal;
by eid;
run;
Problems:
Missing values are adjusted in transaction file.
In update processing time corresponding value in master file will not be changed. This process
controlled by update-mode option.
updatemode option: missingcheck (default) / nomissingcheck
Syntax: updatemode =nomissing check;
Appending
/* Master */ /* Transaction */
data emp_salaries; data emp_new;
input eid salary; input eid salary;
datalines; datalines;
101 5400 101 6000
102 4500 103 5000
103 3000 105 .
104 2300 ;
105 2700
;
proc sort data=emp_salaries;
by eid;
run;
proc sort data=emp_new;
by eid;
run;
data emp_salaries;
update emp_salaries emp_new updatemode=nomissingcheck;
by eid;
run;
Appending
data _null_;
File print;
Update emp_salaries emp_new;
By eid;
Put _all_;
Run;
run;
Modifications:
1. Replace the data.
2. Manipulate the data.
3. Delete the data.
Modifications can be done by update statement, set statement and modify statement.
Run the modification using set and modify statements:
Set statement: Read the data observation by observation.
Modify statement: Copy the data so modify statement will be taken less processing time
compared to set statement.
Appending
data emp1; data emp2;
input eid salary; input eid salary;
datalines; datalines;
101 2000 210 2000
102 4000 202 4000
; ;
/* set statement */
data emp1;
set emp1;
salary=salary+1000;
run;
/* set statement */
data emp3;
set emp1;
salary=salary+1000;
run;
/* modify statement - efficient */
data emp4;
modify emp2; ?
salary=salary+1000;
run;
Log Window
227 data emp4;
228 modify emp2;
ERROR 416-185: The MODIFY statement requires the MASTER data set to be present
on the DATA statement.
Appending
data emp_salaries; data emp_hic;
input eid salary; input eid hic;
datalines; datalines;
101 4000 101 0.5
102 5600 103 0.3
103 2300 105 0.2
104 3400 ;
105 3000
;
data emp_salaries;
modify emp_salaries emp_hic;
by eid;
salary=salary+(salary*hic);
run;
Note: Nomatching observations existed in transaction file then modify process will be failed.
Merge
Merge process: It combines the tables. Types of merge:
1. One to one merge without relation.
2. One to one merge with relation.
3. One to many merges with relation.
4. Many to one merge with relation.
5. Many to many merges with relation.
Note: many to many merge processes controlled by the more than one variable.
Lookup process: using lookup process can be reported matching and nonmatching data.
Tracking or temporary variable: It controls lookup process and returns two values 1 or 0.
Base dataset: Which dataset is existed with matching data is called base dataset.
/* matching data: report effected by adverse event and adverse drug reaction */
data ae_adr_match;
merge ae(in=a) adr(in=b);
by pid;
if a=1 and b=1;
run;
*/ Sorting */
proc sort data=plans;
by pcode;
proc sort data=plans_2009;
by pcode;
run;
/* Cleaning */
data plans_2009;
modify plans_2009 (in=var1) plans (in=var2);
by pcode;
if var1=1 and var2=1 then remove;
run;
/* Check log window */
/* Loading & Report*/
proc append base=plans data=plans_2009;
run;
proc print data=plans;
run;
TASK: To control conditional blocks using do block, goto statement, and link statement.
data emp_salaries;
input eid $ salary sale;
datalines;
E234 2300 678
E245 4500 456
E456 3000 400
E890 4500 300
E235 7800 580
E267 5000 280
;
/* do block processing */
data emp1;
set emp_salaries;
if sale>=500 then do;
newsalary=salary+2000;
rating ='A+++';
end;
else if sale>=400 and sale<500 then do;
newsalary=salary+1500;
rating='A++';
end;
else do;
newsalary=salary+1000;
rating ='A+';
end;
Goto or link statement: call the label conditional basis. label: run group of statements.
* Not much efficient.
/* extraction */
data medi;
infile 'H:\Studies\SAS_Books\Mohan\source\DLM\logicalvar.txt';
input pid $ date: date9. Drug $;
format date date9.;
run;
/* sequential number */
data medi1;
set medi;
sq_no+1;
run;
/* generates visit ids pid wise: (output statement controls the overwriting) */
proc sort data=medi out=medi2;
by pid;
run;
data medi3;
set medi2;
by pid;
Reverse Number
Program prints reverse of a number, i.e., if input is 951 then the output will be 159.
Syntax: <new var name> = reverse (argument);