Documenti di Didattica
Documenti di Professioni
Documenti di Cultura
Obje ive ct s
Concatenating Given two (or more) text strings, concatenate the values into one string. Casing Control whether a text string is represented as all capital letters or in mixed case.
Pa rsingD t aa
Parsing is a simple but intelligent tool for separating a multi-part field value into multiple, single-part fields (tokens). Each token is identified based on its individual contribution to the overall field. Name: Mr. Linwood Leroy Bubar, III, M.D.
Name Prefix Given Name Middle Name Family Name Name Suffix Name Appendage
Mr.
Linwood
Leroy
Bubar
III
M.D.
10
Conca e t D t t na ing a a
Concatenating is essentially the opposite of the parse step. Rather than separating a single field into multiple fields, concatenating combines one or more fields into a single field. Given Name: Middle Name: Family Name: Igor Bela Bonski
11
Ca sing
Changing case enables you to make all alphabetical values in a field UPPERCASE, lowercase, or Proper Case. Proper case treats a field value as a proper name; that is, the first letter of each word is capitalized, with the remaining characters in lowercase. As with standardization, changing case can make field values more consistent.
12
ApplyingTe chnique s
These data quality/cleansing techniques can be applied using the following: dfPower Studio's dfPower Architect the SAS Data Quality Server functions as column-level transformations with SAS Data Integration Studio the SAS Data Quality Server functions within a SAS programming environment Because the SAS Data Quality Server functions are the same whether surfaced in SAS Data Integration Studio or in a SAS session, you only look at these functions in a SAS session.
13
Obje ive ct s
Describe the functionality of dfPower Architect. Explore various job flow steps that are available to use. Discuss the sequence of steps for building a job.
15
dfPow r Archit ct e e
With dfPower Architect, you can perform the following tasks: identify and connect to multiple data sources, whether those sources are local, over a network on a different platform, or at a remote location choose and configure job flow nodes for processing your data reconfigure existing job flow nodes as needed view sample processed data at each job flow node specify a variety of output options, including reports and new data sources run a job flow with a single click
17
18
...
10
19
Menus and Tools Details Area -- Information tab -- Preview tab -- Log tab Status Line
20
continued...
11
Nodes List
21
...
Job FlowSt ps e
dfPower Architect's available job flow steps are grouped into nine categories:
22
12
23
...
Node Data Source SQL Query Text File Input Fixed Width File Input External Data Provider
Description identifies existing data sets to process. identifies existing data sets to process using SQL. accesses data in a plain-text file. accesses data in text file where the input is separated into fixed-width columns. enables services for applications or processes that want to pass data into dfPower Architect one record at a time; also can be used to call other Architect job flows within a job when used in conjunction with the embedded job node. is used for extracting meta information from a specific table within a database. is used to identify existing SAS data sets to process on the Microsoft Windows platform is used to identify existing data sets to process as with the SAS Data Set node. This step, however, enables you to use SQL to select data.
13
24
...
Node Data Target (Update) Data Target (Insert) Delete Record HTML Report Text File Output Fixed Width File Output Frequency Distribution Chart Match Report dfPower Merge File Output
Description updates existing data rather than create a new data source or replace an existing source. outputs data in a variety of data formats to a new data source, leaving your existing data as-is or overwriting your existing data. eliminates records from a data source using the unique key of those records. creates an HTML-formatted report from the results of your job flow. creates a plain-text file with the results of your job flow. outputs your data to well-defined fixed-width columns in your output file. creates a chart that shows how selected values are distributed throughout your data. generates a match report that can then be displayed with the Match Report Viewer. writes clustered data to a dfPower Merge file for use in dfPower Merge.
14
25
...
Node COM Plugin Data Sorting Expression Data Joining Data Joining (Non-Key) Data Union Concatenate Embedded Job Sequencer (Autonumber) SQL Lookup SQL Execute
Description adds COM (Component Object Model) to your job flows. re-orders your data set at any point in a job flow. runs a Visual Basic-like language to process your data sets in ways that are not built into dfPower Studio. is used when you have two tables, each with the same number of records, and you want to join them by location in the file rather than by a unique key. is used when you have two tables, each with the same number of records, and you want to join them by location in the file rather than by a unique key. uses Data Joining to combine two data sets in an intelligent way so that the records of one, the other, or both data sets are used as the basis for the resulting data set. performs essentially the opposite of the Parse node; rather than separate a single field into multiple fields, Concatenate combines one or more fields into a single field. embeds another dfPower Architect job in your current job flow. creates a sequence of numbers given a starting number and a specified interval. finds rows in a database table that have one or more fields matching those in the job flow. enables you to construct and execute any valid SQL statement (or series of statements); generally used to perform some database-specific task(s), either before, after, or in between architect job flows; stand-alone node (no parents or children). enables you to rename and reorder field names as they pass out of this node. provides a way to write an SQL query that contains variable inputs, also known as parameters.
15
26
...
Node Data Validation Pattern Analysis Basic Statistics Frequency Distribution Basic Pattern Analysis
Description analyzes the content of data by setting validation conditions. performs pattern analysis. calculates basic statistics. creates a frequency distribution. provides the ability to run Pattern Analysis in a very similar manner as it is run in dfPower Profile. (In contrast to advanced Pattern Analysis, the simplified version does not employ Blue Fusion pattern identification definitions.)
16
27
...
Node Gender Analysis Gender Analysis (Parsed) Identification Analysis Parsing Standardization Standardization (Parsed) Change Case Locale Guessing Right Fielding performs gender analysis.
Description
performs gender analysis on parsed information. performs identification analysis. parses a field. performs standardization of fields of data. performs standardization of fields of parsed information. enables the case of a field values to be set. attempts to guess the appropriate locale based on field information. identifies the contents of fields and copies the data to fields with more descriptive names.
17
28
...
Node Match Code Match Codes (Parsed) Clustering Cluster Update Surviving Record Identification Cluster Diff Exclusive Real Time Clustering (ERTC) Concurrent Real Time Clustering (CRTC) generates match codes.
Description
generates match codes on parsed information. generates clusters. enables new records to be integrated with existing clusters. examines clustered data and determines a surviving record for each cluster. compares sets of clustered records. facilitates the near real-time addition of new rows to previously clustered data is similar to ERTC node in its outcomes; the difference between the nodes is that the ERTC node interacts directly with the cluster state file while the CRTC node interacts with a server that interacts with the cluster state file.
18
29
...
Node Address Verification (US/Canada) Address Verification (QAS) Address Verification (World)
Description verifies, corrects, and enhances U.S. and Canadian addresses in your existing data. performs address verification on addresses from outside of the U.S. and Canada. performs address verification on addresses from outside of the U.S. and Canada. (This step is similar to Address Verification (QAS) but supports verification and correction for addresses from more countries.) matches geographic information from the geocode reference database with ZIP codes in your data to determine latitude, longitude, census tract, FIPS (Federal Information Processing Standard), and block information. matches information from the phone and geocode reference databases with FIPS codes in your data to calculate several values. matches information from the phone reference database with telephone numbers in your data. matches information from the phone reference database with zip codes in your data to calculate several values, primarily area code, but also Overlay1, Overlay2, Overlay3, and Result.
Geocoding
19
30
...
Node Distributed Geocoding Distributed Address Verification Distributed Phone Distributed Area Code Distributed County
Description offloads geocode processing to a machine other than one running the current dfPower Architect job. offloads address verification processing to a machine other than one running the current dfPower Architect job. offloads phone data processing to a machine other than one running the current dfPower Architect job. offloads area code data processing to a machine other than one running the current dfPower Architect job. offloads county data processing to a machine other than one running the current dfPower Architect job.
20
31
...
Description enables you to analyze data according to business rules that you create using the Business Rule Manager. The business rules that you create in Rule Manager can analyze the structure of the data and trigger an event, such as logging a message or sending an e-mail alert, when a condition is detected.
21
1. Plan the job flow. 2. Select the input data. 3. Build the job flow. 4. Specify the output. 5. Process the job flow.
32
33
22
Ca St Ta se udy sks
Analyze and Profile the Data
Access and view the data. Create and execute profiling job(s).
This demonstration illustrates the use of dfPower Architect to perform identification analysis, gender analysis, parsing, concatenation, and casing. In addition, other nodes are investigated (frequency distribution, frequency distribution chart, and HTML report).
34
Ca St Ta se udy sks
Analyze and Profile the Data
Access and view the data. Create and execute profiling job(s).
35
23
In this demonstration, first establish a data source to work with. Then run an identification analysis on a name field from this data source, with the results used to generate frequency counts of the identified types of data. After you decide that the majority of data in the name field are individual names, run a gender analysis with the results of this also used to generate frequency counts. As a last step, use the results from the identification and gender analysis to generate a pie chart. 1. If necessary, invoke dfPower Studio by selecting Start All Programs DataFlux dfPower Studio 7.1 dfPower Studio. 2. Select Base from the toolbar, and then select Architect.
24
25
The Data Source node is added to the job flow, and the Data Source Properties window opens.
To add a node to the job flow diagram, you can do the following:
double-click drag and drop right-click and select Insert on Page
26
c. Specify properties for the Data Source node. 1) Enter Contacts as the name. 2) Select next to Input table.
3) Expand the DataFlux Sample database and select the Contacts table.
4) Select
27
The Data Source Properties window shows available fields from the Contacts table.
5) Select
(double-arrow) to move all fields from the Available area to the Selected area.
28
6) Select
The job flow diagram is updated to a display that resembles what is shown below:
2. With the data source node selected, select the Preview tab from the Details area (at the bottom of dfPower Architect interface). The data from this node is displayed:
29
3. Perform an Identification Analysis using the Contact field. a. Expand the Quality grouping of nodes. b. Double-click the Identification Analysis node.
30
c. Move the CONTACT field from the Available area to the Selected area by double-clicking. d. Double-click on the Definition column for the selected CONTACT field. e. From the menu, select Individual/Organization.
f.
Scroll in the Selected area to reveal that the results of the identification analysis are placed in the field CONTACT_Identity. below the Available area.
g. Select
31
h. Select
(double-arrow) to move all fields from the Available area to the Selected area.
i. j.
Select Select
to close the Additional Outputs window. to close the Identification Analysis Properties window.
4. Preview the results of the Identification Analysis. a. Verify that the Identification Analysis node is selected. b. Select the Preview tab at the bottom of dfPower Architect interface. c. Scroll to the right to view the information populated for the CONTACT_Identity:
Although this preview is a good indication of the overall data values, it would be desirable to verify that there are no odd data values.
32
5. Add a Frequency Distribution task to the job flow. a. Expand the Profiling grouping of nodes. b. Double-click the Frequency Distribution node.
33
d. Select to close the Frequency Distribution Properties window. The Preview tab is populated with the frequency report.
If you are satisfied that the majority (99%) of the observations represents individuals, you can proceed with a gender analysis.
34
6. Perform a gender analysis using the Contact field. a. Verify that the Frequency Distribution 1 node is selected in the job flow diagram. b. Expand the Quality grouping of nodes. c. Right-click on the Gender Analysis node and select Insert Before Selected.
35
d. Move the CONTACT field from the Available area to the Selected area by double-clicking. e. Double-click on the Definition column for the selected CONTACT field. f. Select Gender.
g. Scroll in the Selected area to reveal that the results of the identification analysis are placed in the field CONTACT_Gender. h. Select i. j. Select Select below the Available area. (double-arrow) to move all fields from the Available area to the Selected area. to close the Additional Outputs window. to close the Identification Analysis Properties window.
k. Select
36
7. Update the properties of the Frequency Distribution to include the CONTACT_Gender field. a. Right-click Frequency Distribution 1 in the job flow and select Properties. b. Move the CONTACT_Gender field from the Available area to the Selected area. c. Select to close the Frequency Distribution Properties window. The Preview tab is populated with the frequency report.
A more visual approach for viewing the results uses a graphic representation of the information.
37
8. Add a Frequency Distribution Chart task to the job flow. a. Expand the Data Outputs grouping of nodes. b. Double-click the Frequency Distribution Chart node.
38
c. Select
1) Navigate to S:\Workshop\winsas\didq. 2) Enter Contacts Gender Identity Chart as the value for File name.
3) Select
39
d. Enter Gender & Identity Distribution from Contacts as the title for the chart. e. Move both CONTACT_Identity and CONTACT_Gender from the Available area to the Selected area.
f.
Select to close the Frequency Distribution Chart Properties window. The Preview tab is populated with the frequency report.
40
9. Run the entire job. a. Select indicator: from the toolbar. The job processes, and the Run Job window opens with a status
b. Select
41
42
c. Select
43
10. Save the job. a. From the dfPower Architect menu, select File Save As. b. Enter DIDQ Contact Gender/Identity Analysis as the name. c. Enter Gender & Identity Analysis for Contacts table as the description.
d. Select
44
3) Expand the DataFlux Sample database and then select the Contacts table. 4) Select to close the Select Table window.
The Data Source Properties window shows available fields from the Contacts table. 5) Select 6) Select (double-arrow) to move all fields from the Available area to the Selected area. to close the Data Source Properties window.
45
3. Parse the Contact field. a. Expand the Quality grouping of nodes. b. Double-click the Parsing node.
46
47
c. Select CONTACT as the field to parse. d. Select Name as the definition. e. Select to move all tokens from the Available area to the Selected area.
f.
Select
below the Available area. (double-arrow) to move all fields from the Available area to the Selected area. to close the Additional Outputs window. to close the Parse Properties window.
48
j.
49
4. Concatenate the parsed fields. a. Expand the Utilities grouping of nodes. b. Double-click the Concatenate node.
50
51
c. Specify LastFirst as the output field. d. Enter , (a comma and a space) as the value for Literal text. e. Select Family Name, and then select f. to move it to the Concatenation list area.
Select next to Literal text to move the text to the Concatenation list area after Family Name. to move it to the Concatenation list area.
below the Available fields area. (double-arrow) to move all fields from the Available area to the Selected area. to close the Additional Outputs window. to close the Concatenation Properties window.
k. Select
52
The Preview tab is populated. Scroll to find the new LastFirst column.
A more complete picture of the concatenation might be gained by viewing an HTML report.
53
5. Add an HTML Report task to the job flow. a. Expand the Data Outputs grouping of nodes. b. Double-click the HTML Report node.
54
55
c. Enter Concatenation Results as the value for Report title. d. Enter NewName as the value for Report name. e. Select the check box for Display report in browser after job runs. f. Deselect all columns from Selected. (Select .)
g. Move CONTACT, Given Name, Family Name, and LastFirst from the Available area to the Selected area.
h. Select
56
6. Run the entire job. a. Select indicator. b. Select from the toolbar. The job processes, and the Run Job window opens with a status to close the Run Job window.
c. Select File Close to close the browser when you are finished viewing the report.
57
7. Change the case of the LastFirst field. a. Select the HTML Report 1 node in the job flow. b. Expand the Quality grouping of nodes. c. Right-click Change Case and select Insert Before Selected. The Case Properties window opens.
58
d. Move LastFirst from the Available area to the Selected area. e. Select Proper as the type of casing to use. f. Select Proper (Name) as the definition to use.
below the Available area. (double-arrow) to move all fields from the Available area to the Selected area. to close the Additional Outputs window. to close the Parse Properties window.
59
8. Update the HTML Report 1 node. a. Double-click on the HTML Report 1 node in the job flow to open the HTML Report Properties window. b. Verify that the check box for Display report in browser after job runs is selected. c. Deselect all columns from the Selected area. (Select .)
d. Move CONTACT, Given Name, Family Name, LastFirst, and LastFirst_Cased from the Available area to the Selected area. e. Select to close the HTML Report Properties window.
60
9. Run the entire job. a. Select indicator. b. Select from the toolbar. The job processes, and the Run Job window opens with a status to close the Run Job window.
c. Select File Close to close the browser when you are finished viewing the report.
61
10. Save the job. a. From the dfPower Architect menu, select File Save As. b. Enter DIDQ Contact Parse/Concatenation Job as the name. c. Enter Parse then concatenation of Contact field as the description.
d. Select
11. Select File Exit to close dfPower Architect. 12. Select Studio Exit to close dfPower Studio.
62
Obje ive ct s
Describe some SAS Data Quality Server functions. List some basic examples using these functions.
38
39
63
% Q TLOC M cro D PU a
Each of these functions requires the specification of a definition as part of the syntax. The %DQPUTLOC AUTOCALL macro provides a quick means of displaying current information in the SAS log for the specified locale that is loaded into memory at that time. The available locale information includes a list of all definitions, parse tokens, related functions, and the names of the parse definitions that are related to each match definition. %DQPUTLOC(locale,<SHORT=0|1>, <PARSEDEFN=0|1>); %DQPUTLOC(locale,<SHORT=0|1>, <PARSEDEFN=0|1>);
40
%DQPUTLOC(locale,<SHORT=0|1>, <PARSEDEFN=0| 1>); where locale SHORT=0|1 specifies the locale of interest. optionally shortens the length of the entry in the SAS log. SHORT=1 removes the descriptions of how the definitions are used. The default value is SHORT=0, which displays the descriptions of how the definitions are used. optionally lists the related parse definition, if such a parse definition exists, with each gender analysis definition and each match definition. The default value PARSEDEFN=1 lists the related parse definition. PARSEDEFN=0 does not list the related parse definition.
PARSEDEFN=0|1
64
65
42
DQIDENTIFY(char, 'identification-definition'<, 'locale'>) where char is the value that is transformed, according to the specified identification definition. The value can be the name of a character variable, a character value in quotation marks, or an expression that evaluates to a variable name or a quoted value. specifies the name of the identification definition, which must exist in the specified locale. optionally specifies the name of the locale that contains the specified identification definition. The value can be a name in quotation marks, the name of a variable whose value is a locale name, or an expression that evaluates to a variable name or to a quoted locale name. The specified locale must be loaded into memory as part of the locale list. If no value is specified, the default locale is used. The default locale is the first locale in the locale list.
identification-definition locale
66
43
67
44
DQGENDER(char, 'gender-analysis-definition'<, 'locale'>) where char gender-analysis-definition locale is the name of a character variable, a character value in quotation marks, or an expression that evaluates to a variable name or a quoted value. specifies the name of the gender analysis definition, which must exist in the specified locale. optionally specifies the name of the locale that contains the specified gender-analysis definition.
68
45
69
Pa rsingin SAS
The DQPARSE function returns a parsed character value. The return value contains delimiters that identify the elements in the value that correspond to the tokens that are enabled by the parse definition. DQPARSE(char, 'parse-definition' <, 'locale'>) DQPARSE(char, 'parse-definition' <, 'locale'>)
46
DQPARSE(char, 'parse-definition'<, 'locale'>) where char is the value that is parsed according to the parse definition. The value can be the name of a character variable, a character value in quotation marks, or an expression that evaluates to a variable name or a quoted value. specifies the name of the parse definition, which must exist in the specified locale. optionally specifies the name of the locale that contains the specified gender analysis definition.
parse-definition locale
70
Pa rsingin SAS
The DQPARSEINFOGET function returns the token names in a parse definition. DQPARSEINFOGET('parse-definition' <, 'locale'>) DQPARSEINFOGET('parse-definition' <, 'locale'>) The DQPARSETOKENGET function returns a token from a parsed character value. DQPARSETOKENGET(parsed-char, 'token', DQPARSETOKENGET(parsed-char, 'token', 'parse-definition' <, 'locale'>) 'parse-definition' <, 'locale'>)
47
DQPARSEINFOGET('parse-definition'<, 'locale'>) where parse-definition locale specifies the name of the parse definition, which must exist in the specified locale. optionally specifies the name of the locale that contains the specified gender-analysis definition.
DQPARSETOKENGET(parsed-char, 'token','parse-definition'<, 'locale'>) where parsed-char token parse-definition locale is the parsed character value from which will be returned the value of the specified token. specifies the name of the token that is returned from the parsed value. specifies the name of the parse definition, which must exist in the specified locale. optionally specifies the name of the locale that contains the specified gender-analysis definition.
71
48
72
49
73
is the value that is transformed, according to the specified case definition. specifies the name of the case definition that will be referenced during the transformation. optionally specifies the name of the locale that contains the specified gender-analysis definition.
50
74
Ca St Ta se udy sks
Analyze and Profile the Data
Access and view the data. Create and execute profiling job(s).
This demonstration illustrates the use of the SAS Data Quality Server functions to perform identification analysis, gender analysis, parsing, concatenation, and casing.
51
Ca St Ta se udy sks
Analyze and Profile the Data
Access and view the data. Create and execute profiling job(s).
52
75
In this demonstration, investigate four separate SAS programs. These programs investigate the use and results of the DQIDENTIFY, DQGENDER, DQPARSE, and DQCASE functions. To investigate the results from the programs, short FREQ or PRINT procedure steps will be added. 1. Start a SAS session by selecting Start All Programs SAS BIArchitecture Start SAS. 2. If the Getting Started with SAS window opens, do the following: a. Select Dont show this dialog box again.
b. Select
76
/* xxxx COMMENTS xxxx */ %DQLOAD(DQLOCALE=(ENUSA), DQSETUPLOC='C:\SAS\BIarchitecture\Lev1\SASMain\dqsetup.txt', DQINFO=1); PROC IMPORT OUT= WORK.Prospects DATATABLE= "NewCustomers" DBMS=ACCESS REPLACE; DATABASE="S:\Workshop\winsas\didq\DQData\NewCustomers.mdb"; SCANMEMO=YES; USEDATE=NO; SCANTIME=YES; RUN; data std_prospects; set prospects; length Identity $1; label Identity='Customer Identity Type'; Identity = dqidentify(contact, 'Individual/Organization'); run; In this program, the following occurs: The %DQLOAD macro loads the ENUSA locale into memory. The PROC IMPORT step accesses the NewCustomers table in the NewCustomers Microsoft Access database. The DATA step uses the DQIDENTIFY function to identify whether the value for the CONTACT field is an individual, an organization, or not known. 5. Select Run Submit to execute the SAS program.
77
78
7. To view the scheme data set, do the following: a. Select SAS Explorer. b. Double-click on the Libraries icon. c. Double-click on the Work library icon. d. Double-click on the Std_prospects table to open it into a VIEWTABLE window. e. Scroll to view the Customer Identity Type column.
f.
79
9. Run a frequency report on the new identity column. a. At the bottom of the program, after the RUN statement for the DATA step, uncomment the PROC FREQ step (that is, remove the /* before the step and the */ after the step. The PROC FREQ step is as shown: proc freq; tables identity/nocum; run; b. Highlight only these three new lines and then select Run Submit. The following report surfaces:
80
/* xxxx COMMENTS xxxx */ %DQLOAD(DQLOCALE=(ENUSA), DQSETUPLOC='C:\SAS\BIarchitecture\Lev1\SASMain\dqsetup.txt', DQINFO=1); PROC IMPORT OUT= WORK.Prospects DATATABLE= "NewCustomers" DBMS=ACCESS REPLACE; DATABASE="S:\Workshop\winsas\didq\DQData\NewCustomers.mdb"; SCANMEMO=YES; USEDATE=NO; SCANTIME=YES; RUN; data std_prospects; set Prospects; /* use the GENDER function to determine gender based on name */ length custgender $1; label custgender='Customer Gender'; custgender = dqgender(contact, 'gender'); run; PROC FREQ; Tables custgender/nocum; RUN; In this program, the following occurs: The %DQLOAD macro loads the ENUSA locale into memory. The PROC IMPORT step accesses the NewCustomers table in the NewCustomers Microsoft Access database. The DATA step uses the DQGENDER function to identify whether the value for the CONTACT field is M(ale), F(emale), or U(nknown) The PROC FREQ step generates a report of frequency counts on the custgender column.
81
6. Select View Log to activate the Log window. A portion of the DATA step and PROC FREQ step is shown below:
7. Select View Output to activate the Output window. The report shows the following:
82
/* xxxx COMMENTS xxxx */ %DQLOAD(DQLOCALE=(ENUSA), DQSETUPLOC='C:\SAS\biarchitecture\Lev1\sasmain\dqsetup.txt', DQINFO=1); PROC IMPORT OUT= WORK.Prospects DATATABLE= "newcustomers" DBMS=ACCESS REPLACE; DATABASE="S:\Workshop\winsas\didq\dqdata\newcustomers.mdb" ; SCANMEMO=YES; USEDATE=NO; SCANTIME=YES; RUN; data std_prospects; Set prospects; Parsedname=dqparse(contact, 'NAME'); Prefix=dqparsetokenget(parsedname, 'Name Prefix', 'NAME'); First_name=dqparsetokenget(parsedname, 'Given Name', 'NAME'); Last_name=dqparsetokenget(parsedname, 'Family Name', 'NAME'); run; proc print; Var prefix first_name last_name; run; In this program, the following occurs: The %DQLOAD macro loads the ENUSA locale into memory. The PROC IMPORT step accesses the NewCustomers table in the NewCustomers Microsoft Access database. The DATA step uses the DQPARSE and DQPARSETOKENGET functions to parse the CONTACT field. The PROC PRINT step produces a listing report of the results of the DQPARSETOKENGET function usage.
83
6. Select View Log to activate the Log window. The portion for the DATA step and PROC PRINT step is shown below:
7. Select View Output to activate the Output window. The partial output is as follows:
84
/* xxxx COMMENTS xxxx */ %DQLOAD(DQLOCALE=(ENUSA), DQSETUPLOC='C:\SAS\BIarchitecture\Lev1\SASMain\dqsetup.txt', DQINFO=1); PROC IMPORT OUT= WORK.Prospects DATATABLE= "NewCustomers" DBMS=ACCESS REPLACE; DATABASE="S:\Workshop\winsas\didq\DQData\NewCustomers.mdb" ; SCANMEMO=YES; USEDATE=NO; SCANTIME=YES; RUN; data std_prospects; set prospects; ParsedName=dqParse(contact, 'NAME'); Prefix=dqParseTokenGet(parsedName, 'Name Prefix', 'NAME'); First_name=dqParseTokenGet(parsedName, 'Given Name', 'NAME'); Last_name=dqParseTokenGet(parsedName, 'Family Name', 'NAME'); run; data std_prospects; set std_prospects; length Contact2 $50; label Contact2='Re-formatted Prospect Name'; Contact2 = trim(Last_Name) || ', ' || First_Name; length Contact3 $50; label Contact3='Proper Cased Re-formatted Prospect Name'; Contact3 = dqcase(contact2,'PROPER'); run; proc print; Var Contact Contact2 Contact3; run;
85
In this program, the following occurs: The %DQLOAD macro loads the ENUSA locale into memory. The PROC IMPORT step accesses the NewCustomers table in the NewCustomers Microsoft Access database. The first DATA step uses the DQPARSE and DQPARSETOKENGET functions to parse the value for the CONTACT field. The second DATA step uses the concatenation operator (||) to rebuild a Name field (Contact2). The DQCASE function is then applied to resolve the Contact2 field to proper casing. The PROC PRINT step produces a listing report of some parsed and concatenated information.
5. Select Run Submit to execute the SAS program. 6. Select View Log to activate the Log window. The portion for the DATA steps and PROC PRINT step is shown below:
86
7. Select View Output to activate the Output window. Partial output is shown below:
87
12.4 Exercises
1. Analyzing the NewCustomers Table Use the NewCustomers table from the New Customers database to do the following: Verify the type of information found for each record. (Identify records as individual or organization.) Calculate gender information for each record. Create a frequency report and a frequency report chart on both the identity and gender information. Parse the Contact field. Add a field that contains a name string of the form Name_Prefix Given_Name Family_Name. Save the job as DIDQ Ch5Ex1 NewCustomers Analysis.
88
3) Expand the New Customers database and select the NewCustomers table. 4) Select 5) Select 6) Select to close the Select Table window. (double-arrow) to move all fields from the Available area to the Selected area. to close the Data Source Properties window.
e. With the data source node selected, select the Preview tab from the Details area (at the bottom of dfPower Architect interface). The data from this node is displayed. f. Expand the Quality grouping of nodes. g. Double-click the Identification Analysis node. The Identification Analysis Properties window opens. 1) Move the CONTACT field from the Available area to the Selected area by double-clicking. 2) Double-click on the Definition column for the selected CONTACT field. 3) From the menu, select Individual/Organization. 4) Scroll in the Selected area to reveal that the results of the identification analysis will be placed in the field CONTACT_Identity. 5) Select 6) Select 7) Select 8) Select below the Available area. (double-arrow) to move all fields from the Available area to the Selected area. to close the Additional Outputs window. to close the Identification Analysis Properties window.
89
h. Preview the results of the Identification Analysis. 1) Verify that Identification Analysis is selected. 2) Select the Preview tab at the bottom of dfPower Architect interface. 3) Scroll to the right to view the information populated for the CONTACT_Identity field. i. Expand the Quality grouping of nodes. j. Double-click on the Gender Analysis node. The Gender Analysis Properties window opens. 1) Move the CONTACT field from the Available area to the Selected area by double-clicking. 2) Double-click on the Definition column for the selected CONTACT field. 3) Select Gender. 4) Scroll in the Selected area to reveal that the results of the identification analysis will be placed in the field CONTACT_Gender. 5) Select 6) Select 7) Select 8) Select below the Available area. (double-arrow) to move all fields from the Available area to the Selected area. to close the Additional Outputs window. to close the Identification Analysis Properties window.
k. Expand the Profiling grouping of nodes. l. Double-click the Frequency Distribution node. 1) The Frequency Distribution Properties window opens. 2) Move CONTACT_Identity and CONTACT_Gender from the Available area to the Selected area. 3) Select to close the Frequency Distribution Properties window. The Preview tab is populated with the frequency report m. Expand the Data Outputs grouping of nodes. n. Double-click the Frequency Distribution Chart node. 1) Select next to Chart name to choose a location for the output.
2) Navigate to S:\Workshop\winsas\didq. 3) Enter New Customers Chart as the value for File name. 4) Select to close the Save As window.
5) Enter Gender & Identity Distribution from New Customers as the title for the chart.
90
6) Move both CONTACT_Identity and CONTACT_Gender from the Available area to the Selected area. 7) Select to close the Frequency Distribution Chart Properties window. The Preview tab is populated with the frequency report. o. Select from the toolbar. The job processes, and the Run Job window opens with a status indicator. 1) Select 2) Select to close the Run Job window. The Chart Viewer window opens. to scroll to the next chart for CONTACT_Gender.
3) Select File Exit to close the Chart Viewer window. p. Save the job. 1) From dfPower Architect menu, select File Save As. 2) Enter DIDQ Ch5Ex1 NewCustomers Analysis as the name. 3) Enter New Customer Analysis as the description. 4) Select to close the Save As window.
q. Select the Frequency Distribution 1 node in job flow. r. Expand the Quality grouping of nodes. s. Right-click the Parsing node and select Insert Before Selected. 1) Select CONTACT as the field to parse. 2) Select Name as the definition. 3) Select 4) Select 5) Select 6) Select 7) Select to move all tokens from the Available area to the Selected area. below the Available area. to move all fields from the Available area to the Selected area. to close the Additional Outputs window. to close the Parse Properties window.
t. Expand the Utilities grouping of nodes. u. Double-click the Concatenate node. The Concatenation Properties window opens. 1) Specify PreFirstLast as the output field. 2) Enter (a space) as the value for Literal text. to move it to the Concatenation list area.
91
4) Select next to Literal text to move the text to the Concatenation list area after Name Prefix. 5) Select Given Name, and then select to move it to the Concatenation list area.
6) Select next to Literal text to move the text to the Concatenation list area after Given Name. 7) Select Family Name, and then select to move it to the Concatenation list area.
below the Available fields area. to move all fields from the Available area to the Selected area. to close the Additional Outputs window. to close the Concatenation Properties window.
v. View the result of this node on the Preview tab. w. From the dfPower Architect menu, select File Save. x. Select File Exit to close dfPower Architect. y. Select Studio Exit to close dfPower Studio.