CHP 12 - Data Cleansing Additional Functionality

Chapter 12 Data Cleansing Additional Functionality
12.5 Solutions to Exercises
12.1 Additional Data Quality/Cleansing Techniques
Obje ive ct s
Discuss some additional data quality/cleansing techniques.
D t Qua y/Cle nsing aa lit a

The following are additional techniques that can be used to further enhance data quality: Identification analysis Gender analysis Parsing Concatenating Casing
D t Qua y/Cle nsing aa lit a

Identification Analysis Based on a given name string, determine whether the name represents an individual or an organization. Based on a person's name, determine the gender. Given a text string, parse the string into its individual elements.
Gender Analysis Parsing
Concatenating Given two (or more) text strings, concatenate the values into one string. Casing Control whether a text string is represented as all capital letters or in mixed case.
Ide ificat Ana nt ion lysis

Identification analysis enables you to compare information from the QKB with undetermined fields in your data to determine whether each field contains the following: For name information: an individual's name an organization's name empty For address information: a street address city/state/ZIP information empty
Ide ificat Ana nt ion lysis

For data fields containing name data, identity analysis returns INDIVIDUAL, ORGANIZATION, or UNKNOWN. For data fields containing address data, identity analysis returns one of the following: ACCT (account number type information) ADDR (address line1) ADDR2 (address line 2) ATTN (attention line) BLANK (blank or null values) CSZ (city state zip) IND (an individual's name) ORG (organization's type information) UNK (Unknown)
7
Ge r Ana nde lysis

Gender analysis determines whether a particular name is most likely feminine, masculine, or unknown. The results are placed in a new field and have three possible values: "M" for male "F" for female "U" for unknown
Pa rsingD t aa
Parsing is a simple but intelligent tool for separating a multi-part field value into multiple, single-part fields (tokens). Each token is identified based on its individual contribution to the overall field. Name: Mr. Linwood Leroy Bubar, III, M.D.
Name Prefix Given Name Middle Name Family Name Name Suffix Name Appendage
Mr.
Linwood
Leroy
Bubar
III
M.D.
10
Conca e t D t t na ing a a
Concatenating is essentially the opposite of the parse step. Rather than separating a single field into multiple fields, concatenating combines one or more fields into a single field. Given Name: Middle Name: Family Name: Igor Bela Bonski
Concatenated Name: Igor Bela Bonski
11
Ca sing
Changing case enables you to make all alphabetical values in a field UPPERCASE, lowercase, or Proper Case. Proper case treats a field value as a proper name; that is, the first letter of each word is capitalized, with the remaining characters in lowercase. As with standardization, changing case can make field values more consistent.
12
ApplyingTe chnique s
These data quality/cleansing techniques can be applied using the following: dfPower Studio's dfPower Architect the SAS Data Quality Server functions as column-level transformations with SAS Data Integration Studio the SAS Data Quality Server functions within a SAS programming environment Because the SAS Data Quality Server functions are the same whether surfaced in SAS Data Integration Studio or in a SAS session, you only look at these functions in a SAS session.
13
12.2 Data Quality/Cleansing Using dfPower Architect
Obje ive ct s

Describe the functionality of dfPower Architect. Explore various job flow steps that are available to use. Discuss the sequence of steps for building a job.
15
dfPow r Archit ct: Int e e roduct ion

dfPower Architect brings much of the functionality of the other dfPower Studio applications, as well as some unique functionality, into a single, intuitive user interface. To use dfPower Architect, you specify operations by selecting job flow steps and then configuring those steps to meet your specific data needs. The steps you choose are displayed as job flow icons, which together form a visual job flow.
16
dfPow r Archit ct e e
With dfPower Architect, you can perform the following tasks: identify and connect to multiple data sources, whether those sources are local, over a network on a different platform, or at a remote location choose and configure job flow nodes for processing your data reconfigure existing job flow nodes as needed view sample processed data at each job flow node specify a variety of output options, including reports and new data sources run a job flow with a single click
17
Acce ssing dfPow r Archit ct e e

dfPower Architect is invoked from the toolbar of dfPower Studio by selecting Base Architect.
18
...
10
dfPow r Archit ct Int rfa e e e ce

dfPower Architect's main screen contains five different components: Menus and tools Details area Information tab Preview tab Log tab Status line Nodes list Job Flow area
19
Menus and Tools Details Area -- Information tab -- Preview tab -- Log tab Status Line
20
continued...
11
Nodes List
Job Flow Area
21
...
Job FlowSt ps e
dfPower Architect's available job flow steps are grouped into nine categories:
22
12
Job FlowSt ps: D aInput e at s

Job flow steps in the Data Inputs category:
23
...
Node Data Source SQL Query Text File Input Fixed Width File Input External Data Provider
Description identifies existing data sets to process. identifies existing data sets to process using SQL. accesses data in a plain-text file. accesses data in text file where the input is separated into fixed-width columns. enables services for applications or processes that want to pass data into dfPower Architect one record at a time; also can be used to call other Architect job flows within a job when used in conjunction with the embedded job node. is used for extracting meta information from a specific table within a database. is used to identify existing SAS data sets to process on the Microsoft Windows platform is used to identify existing data sets to process as with the SAS Data Set node. This step, however, enables you to use SQL to select data.
Table Metadata SAS Data Set SAS SQL Query
13
Job FlowSt ps: D aOut s e at put

Job flow steps in the Data Outputs category:
24
...
Node Data Target (Update) Data Target (Insert) Delete Record HTML Report Text File Output Fixed Width File Output Frequency Distribution Chart Match Report dfPower Merge File Output
Description updates existing data rather than create a new data source or replace an existing source. outputs data in a variety of data formats to a new data source, leaving your existing data as-is or overwriting your existing data. eliminates records from a data source using the unique key of those records. creates an HTML-formatted report from the results of your job flow. creates a plain-text file with the results of your job flow. outputs your data to well-defined fixed-width columns in your output file. creates a chart that shows how selected values are distributed throughout your data. generates a match report that can then be displayed with the Match Report Viewer. writes clustered data to a dfPower Merge file for use in dfPower Merge.
14
Job FlowSt ps: U ilit s e t ie

Job flow steps for Utilities:
25
...
Node COM Plugin Data Sorting Expression Data Joining Data Joining (Non-Key) Data Union Concatenate Embedded Job Sequencer (Autonumber) SQL Lookup SQL Execute
Description adds COM (Component Object Model) to your job flows. re-orders your data set at any point in a job flow. runs a Visual Basic-like language to process your data sets in ways that are not built into dfPower Studio. is used when you have two tables, each with the same number of records, and you want to join them by location in the file rather than by a unique key. is used when you have two tables, each with the same number of records, and you want to join them by location in the file rather than by a unique key. uses Data Joining to combine two data sets in an intelligent way so that the records of one, the other, or both data sets are used as the basis for the resulting data set. performs essentially the opposite of the Parse node; rather than separate a single field into multiple fields, Concatenate combines one or more fields into a single field. embeds another dfPower Architect job in your current job flow. creates a sequence of numbers given a starting number and a specified interval. finds rows in a database table that have one or more fields matching those in the job flow. enables you to construct and execute any valid SQL statement (or series of statements); generally used to perform some database-specific task(s), either before, after, or in between architect job flows; stand-alone node (no parents or children). enables you to rename and reorder field names as they pass out of this node. provides a way to write an SQL query that contains variable inputs, also known as parameters.
Field Layout Parameterized SQL Query
15
Job FlowSt ps: Profiling e

Job flow steps for the Profiling category:
26
...
Node Data Validation Pattern Analysis Basic Statistics Frequency Distribution Basic Pattern Analysis
Description analyzes the content of data by setting validation conditions. performs pattern analysis. calculates basic statistics. creates a frequency distribution. provides the ability to run Pattern Analysis in a very similar manner as it is run in dfPower Profile. (In contrast to advanced Pattern Analysis, the simplified version does not employ Blue Fusion pattern identification definitions.)
16
Job FlowSt ps: Q lit e ua y

Job flow steps in the Quality category:
27
...
Node Gender Analysis Gender Analysis (Parsed) Identification Analysis Parsing Standardization Standardization (Parsed) Change Case Locale Guessing Right Fielding performs gender analysis.
Description
performs gender analysis on parsed information. performs identification analysis. parses a field. performs standardization of fields of data. performs standardization of fields of parsed information. enables the case of a field values to be set. attempts to guess the appropriate locale based on field information. identifies the contents of fields and copies the data to fields with more descriptive names.
17
Job FlowSt ps: Int gra e e tion

Job flow steps in the Integration category:
28
...
Node Match Code Match Codes (Parsed) Clustering Cluster Update Surviving Record Identification Cluster Diff Exclusive Real Time Clustering (ERTC) Concurrent Real Time Clustering (CRTC) generates match codes.
Description
generates match codes on parsed information. generates clusters. enables new records to be integrated with existing clusters. examines clustered data and determines a surviving record for each cluster. compares sets of clustered records. facilitates the near real-time addition of new rows to previously clustered data is similar to ERTC node in its outcomes; the difference between the nodes is that the ERTC node interacts directly with the cluster state file while the CRTC node interacts with a server that interacts with the cluster state file.
18
Job FlowSt ps: Enrichm nt e e

Job flow steps in the Enrichment category:
29
...
Node Address Verification (US/Canada) Address Verification (QAS) Address Verification (World)
Description verifies, corrects, and enhances U.S. and Canadian addresses in your existing data. performs address verification on addresses from outside of the U.S. and Canada. performs address verification on addresses from outside of the U.S. and Canada. (This step is similar to Address Verification (QAS) but supports verification and correction for addresses from more countries.) matches geographic information from the geocode reference database with ZIP codes in your data to determine latitude, longitude, census tract, FIPS (Federal Information Processing Standard), and block information. matches information from the phone and geocode reference databases with FIPS codes in your data to calculate several values. matches information from the phone reference database with telephone numbers in your data. matches information from the phone reference database with zip codes in your data to calculate several values, primarily area code, but also Overlay1, Overlay2, Overlay3, and Result.
Geocoding
County Phone Area Code
19
Job FlowSt ps: Enrichm nt(D ribut d) e e ist e

Job flow steps in the Enrichment (Distributed) category:
30
...
Node Distributed Geocoding Distributed Address Verification Distributed Phone Distributed Area Code Distributed County
Description offloads geocode processing to a machine other than one running the current dfPower Architect job. offloads address verification processing to a machine other than one running the current dfPower Architect job. offloads phone data processing to a machine other than one running the current dfPower Architect job. offloads area code data processing to a machine other than one running the current dfPower Architect job. offloads county data processing to a machine other than one running the current dfPower Architect job.
20
Job FlowSt ps: M oring e onit

Job flow steps in the Monitoring category:
31
...
Node Data Monitoring
Description enables you to analyze data according to business rules that you create using the Business Rule Manager. The business rules that you create in Rule Manager can analyze the structure of the data and trigger an event, such as logging a message or sending an e-mail alert, when a condition is detected.
21
Ge t St rt d w h dfPow r Archit ct t ing a e it e e

A typical dfPower Architect session consists of the following:
1. Plan the job flow. 2. Select the input data. 3. Build the job flow. 4. Specify the output. 5. Process the job flow.
32
Ge t St rt d w h dfPow r Archit ct t ing a e it e e

A typical dfPower Architect session consists of the following:
Identify how the data is to be processed. Select input data source(s) and/or manipulate with SQL. Select and configure job flow nodes. Identify the type of output, and where the output is to be saved. Select to begin processing. 1. 2. 3. 4. 5. Plan the job flow. Select the input data. Build the job flow. Specify the output. Process the job flow.
33
22
Ca St Ta se udy sks
Analyze and Profile the Data

Access and view the data. Create and execute profiling job(s).
Improve the Data

Standardize data. Augment and validate data. Create match codes.
This demonstration illustrates the use of dfPower Architect to perform identification analysis, gender analysis, parsing, concatenation, and casing. In addition, other nodes are investigated (frequency distribution, frequency distribution chart, and HTML report).
34
Ca St Ta se udy sks

Improve the Data

Task performed using

dfPower Studio1 7.1
from DataFlux
35
23
Augmenting and Validating Data Using dfPower Architect
In this demonstration, first establish a data source to work with. Then run an identification analysis on a name field from this data source, with the results used to generate frequency counts of the identified types of data. After you decide that the majority of data in the name field are individual names, run a gender analysis with the results of this also used to generate frequency counts. As a last step, use the results from the identification and gender analysis to generate a pie chart. 1. If necessary, invoke dfPower Studio by selecting Start All Programs DataFlux dfPower Studio 7.1 dfPower Studio. 2. Select Base from the toolbar, and then select Architect.
24
Identification and Gender Analysis

1. Add a data source to the job flow. a. Expand the Data Inputs grouping of nodes. b. Double-click the Data Source node.
25
The Data Source node is added to the job flow, and the Data Source Properties window opens.
To add a node to the job flow diagram, you can do the following:
double-click drag and drop right-click and select Insert on Page
26
c. Specify properties for the Data Source node. 1) Enter Contacts as the name. 2) Select next to Input table.
3) Expand the DataFlux Sample database and select the Contacts table.
4) Select
to close the Select Table window.
27
The Data Source Properties window shows available fields from the Contacts table.
5) Select
(double-arrow) to move all fields from the Available area to the Selected area.
28
6) Select
to close the Data Source Properties window.
The job flow diagram is updated to a display that resembles what is shown below:
2. With the data source node selected, select the Preview tab from the Details area (at the bottom of dfPower Architect interface). The data from this node is displayed:
29
3. Perform an Identification Analysis using the Contact field. a. Expand the Quality grouping of nodes. b. Double-click the Identification Analysis node.
30
The Identification Analysis Properties window opens.
c. Move the CONTACT field from the Available area to the Selected area by double-clicking. d. Double-click on the Definition column for the selected CONTACT field. e. From the menu, select Individual/Organization.
f.
Scroll in the Selected area to reveal that the results of the identification analysis are placed in the field CONTACT_Identity. below the Available area.
g. Select
31
h. Select
(double-arrow) to move all fields from the Available area to the Selected area.
i. j.
Select Select
to close the Additional Outputs window. to close the Identification Analysis Properties window.
4. Preview the results of the Identification Analysis. a. Verify that the Identification Analysis node is selected. b. Select the Preview tab at the bottom of dfPower Architect interface. c. Scroll to the right to view the information populated for the CONTACT_Identity:
Although this preview is a good indication of the overall data values, it would be desirable to verify that there are no odd data values.
32
5. Add a Frequency Distribution task to the job flow. a. Expand the Profiling grouping of nodes. b. Double-click the Frequency Distribution node.
The Frequency Distribution Properties window opens.
33
c. Move CONTACT_Identity from the Available area to the Selected area.
d. Select to close the Frequency Distribution Properties window. The Preview tab is populated with the frequency report.
If you are satisfied that the majority (99%) of the observations represents individuals, you can proceed with a gender analysis.
34
6. Perform a gender analysis using the Contact field. a. Verify that the Frequency Distribution 1 node is selected in the job flow diagram. b. Expand the Quality grouping of nodes. c. Right-click on the Gender Analysis node and select Insert Before Selected.
35
The Gender Analysis Properties window opens.
d. Move the CONTACT field from the Available area to the Selected area by double-clicking. e. Double-click on the Definition column for the selected CONTACT field. f. Select Gender.
g. Scroll in the Selected area to reveal that the results of the identification analysis are placed in the field CONTACT_Gender. h. Select i. j. Select Select below the Available area. (double-arrow) to move all fields from the Available area to the Selected area. to close the Additional Outputs window. to close the Identification Analysis Properties window.
k. Select
36
7. Update the properties of the Frequency Distribution to include the CONTACT_Gender field. a. Right-click Frequency Distribution 1 in the job flow and select Properties. b. Move the CONTACT_Gender field from the Available area to the Selected area. c. Select to close the Frequency Distribution Properties window. The Preview tab is populated with the frequency report.
A more visual approach for viewing the results uses a graphic representation of the information.
37
8. Add a Frequency Distribution Chart task to the job flow. a. Expand the Data Outputs grouping of nodes. b. Double-click the Frequency Distribution Chart node.
38
The Frequency Distribution Chart Properties window opens.
c. Select
next to Chart name to choose a location for the output.
1) Navigate to S:\Workshop\winsas\didq. 2) Enter Contacts Gender Identity Chart as the value for File name.
3) Select
to close the Save As window.
39
d. Enter Gender & Identity Distribution from Contacts as the title for the chart. e. Move both CONTACT_Identity and CONTACT_Gender from the Available area to the Selected area.
f.
Select to close the Frequency Distribution Chart Properties window. The Preview tab is populated with the frequency report.
40
9. Run the entire job. a. Select indicator: from the toolbar. The job processes, and the Run Job window opens with a status
b. Select
to close the Run Job window.
41
The Chart Viewer window opens.
42
c. Select
to scroll to the next chart for CONTACT_Gender.
d. Select File Exit to close the Chart Viewer window.
43
10. Save the job. a. From the dfPower Architect menu, select File Save As. b. Enter DIDQ Contact Gender/Identity Analysis as the name. c. Enter Gender & Identity Analysis for Contacts table as the description.
d. Select
44
Parsing, Concatenation, and Casing

Name fields are often populated in a variety of ways: sometimes as FIRST MIDDLE LAST, and other times as LAST, FIRST. Parsing enables you to break a name field into portions. Concatenation can rejoin the name field in a consistent fashion. After the field values are available in a consistent pattern, it is useful to put the data in the correct case. 1. Start a new job by selecting File New. 2. Add a data source to the job flow: a. Expand the Data Inputs grouping of nodes. b. Double-click the Data Source node. The Data Source Properties window opens. c. Specify properties for the Data Source node. 1) Enter Contacts as the name. 2) Select next to Input table.
3) Expand the DataFlux Sample database and then select the Contacts table. 4) Select to close the Select Table window.
The Data Source Properties window shows available fields from the Contacts table. 5) Select 6) Select (double-arrow) to move all fields from the Available area to the Selected area. to close the Data Source Properties window.
45
3. Parse the Contact field. a. Expand the Quality grouping of nodes. b. Double-click the Parsing node.
46
The Parse Properties window opens.
47
c. Select CONTACT as the field to parse. d. Select Name as the definition. e. Select to move all tokens from the Available area to the Selected area.
f.
Select
below the Available area. (double-arrow) to move all fields from the Available area to the Selected area. to close the Additional Outputs window. to close the Parse Properties window.
g. Select h. Select i. Select
48
j.
Select the Preview tab to view the results of the parse.
49
4. Concatenate the parsed fields. a. Expand the Utilities grouping of nodes. b. Double-click the Concatenate node.
50
The Concatenation Properties window opens.
51
c. Specify LastFirst as the output field. d. Enter , (a comma and a space) as the value for Literal text. e. Select Family Name, and then select f. to move it to the Concatenation list area.
Select next to Literal text to move the text to the Concatenation list area after Family Name. to move it to the Concatenation list area.
g. Select Given Name, and then select
h. Select i. j. Select Select
below the Available fields area. (double-arrow) to move all fields from the Available area to the Selected area. to close the Additional Outputs window. to close the Concatenation Properties window.
k. Select
52
The Preview tab is populated. Scroll to find the new LastFirst column.
A more complete picture of the concatenation might be gained by viewing an HTML report.
53
5. Add an HTML Report task to the job flow. a. Expand the Data Outputs grouping of nodes. b. Double-click the HTML Report node.
54
The HTML Report Properties window opens.
55
c. Enter Concatenation Results as the value for Report title. d. Enter NewName as the value for Report name. e. Select the check box for Display report in browser after job runs. f. Deselect all columns from Selected. (Select .)
g. Move CONTACT, Given Name, Family Name, and LastFirst from the Available area to the Selected area.
h. Select
to close the HTML Report Properties window.
56
6. Run the entire job. a. Select indicator. b. Select from the toolbar. The job processes, and the Run Job window opens with a status to close the Run Job window.
The appropriate browser opens and displays the HTML report.
c. Select File Close to close the browser when you are finished viewing the report.
57
7. Change the case of the LastFirst field. a. Select the HTML Report 1 node in the job flow. b. Expand the Quality grouping of nodes. c. Right-click Change Case and select Insert Before Selected. The Case Properties window opens.
58
d. Move LastFirst from the Available area to the Selected area. e. Select Proper as the type of casing to use. f. Select Proper (Name) as the definition to use.
g. Select h. Select i. j. Select Select
below the Available area. (double-arrow) to move all fields from the Available area to the Selected area. to close the Additional Outputs window. to close the Parse Properties window.
59
k. Select the Preview tab to view the results of the parse.
8. Update the HTML Report 1 node. a. Double-click on the HTML Report 1 node in the job flow to open the HTML Report Properties window. b. Verify that the check box for Display report in browser after job runs is selected. c. Deselect all columns from the Selected area. (Select .)
d. Move CONTACT, Given Name, Family Name, LastFirst, and LastFirst_Cased from the Available area to the Selected area. e. Select to close the HTML Report Properties window.
60
9. Run the entire job. a. Select indicator. b. Select from the toolbar. The job processes, and the Run Job window opens with a status to close the Run Job window.
The appropriate browser opens and displays the HTML report.
c. Select File Close to close the browser when you are finished viewing the report.
61
10. Save the job. a. From the dfPower Architect menu, select File Save As. b. Enter DIDQ Contact Parse/Concatenation Job as the name. c. Enter Parse then concatenation of Contact field as the description.
d. Select
11. Select File Exit to close dfPower Architect. 12. Select Studio Exit to close dfPower Studio.
62
12.3 Data Quality/Cleansing Using SAS
Obje ive ct s

Describe some SAS Data Quality Server functions. List some basic examples using these functions.
38
SAS D aQua Se r Funct at lity rve ions

The SAS Data Quality Server provides a set of functions that can be used to insure quality data. Of these, several can be used to enhance the data: DQIDENTIFY DQGENDER DQPARSE DQPARSEINFOGET DQPARSETOKENGET DQCASE
39
63
% Q TLOC M cro D PU a
Each of these functions requires the specification of a definition as part of the syntax. The %DQPUTLOC AUTOCALL macro provides a quick means of displaying current information in the SAS log for the specified locale that is loaded into memory at that time. The available locale information includes a list of all definitions, parse tokens, related functions, and the names of the parse definitions that are related to each match definition. %DQPUTLOC(locale,<SHORT=0|1>, <PARSEDEFN=0|1>); %DQPUTLOC(locale,<SHORT=0|1>, <PARSEDEFN=0|1>);
40
%DQPUTLOC(locale,<SHORT=0|1>, <PARSEDEFN=0| 1>); where locale SHORT=0|1 specifies the locale of interest. optionally shortens the length of the entry in the SAS log. SHORT=1 removes the descriptions of how the definitions are used. The default value is SHORT=0, which displays the descriptions of how the definitions are used. optionally lists the related parse definition, if such a parse definition exists, with each gender analysis definition and each match definition. The default value PARSEDEFN=1 lists the related parse definition. PARSEDEFN=0 does not list the related parse definition.
PARSEDEFN=0|1
64
% Q TLOC M cro Exa ple D PU a m

If the ENUSA locale is loaded, the %DQPUTLOC macro returns information for the ENUSA definitions, such as the following:
/*---------------------------------------------------------*/ /* GENDER DEFINITION(S) */ /* */ /* Gender definitions are used by the following: */ /* dqGender function */ /* dqGenderParsed function */ /*---------------------------------------------------------*/ Gender /*---------------------------------------------------------*/ /* IDENTIFICATION DEFINITION(S) */ /* */ /* Identification definitions are used by the following: */ /* dqIdentify function */ /*---------------------------------------------------------*/ Contact Info Individual/Organization Offensive
41
65
Ide ificat Ana nt ion lysis in SAS

The DQIDENTIFY function returns a value that indicates the category of the content in an input character value. The available categories and return values depend on your choice of identification definition and locale. DQIDENTIFY(char, 'identification-definition'<, 'locale'>) DQIDENTIFY(char, 'identification-definition'<, 'locale'>)
42
DQIDENTIFY(char, 'identification-definition'<, 'locale'>) where char is the value that is transformed, according to the specified identification definition. The value can be the name of a character variable, a character value in quotation marks, or an expression that evaluates to a variable name or a quoted value. specifies the name of the identification definition, which must exist in the specified locale. optionally specifies the name of the locale that contains the specified identification definition. The value can be a name in quotation marks, the name of a variable whose value is a locale name, or an expression that evaluates to a variable name or to a quoted locale name. The specified locale must be loaded into memory as part of the locale list. If no value is specified, the default locale is used. The default locale is the first locale in the locale list.
identification-definition locale
66
Exa pleof D EN m QID TIFY Funct ion

The following example determines if a character value represents an individual or an organization. data _null_; id=dqidentify('LL Bean', 'Individual/Organization', 'ENUSA'); put id=; run; The value returned for ID in the SAS log would be ORGANIZATION.
43
67
Ge r Ana nde lysis in SAS

The DQGENDER function evaluates the name of an individual to determine the gender of that individual. If the evaluation finds substantial clues that indicate gender, the function returns a value that indicates that the gender is female or male. If the evaluation is inconclusive, the function returns a value that indicates that the gender is unknown. The exact return value is determined by the specified gender analysis definition and locale. DQGENDER(char, 'gender-analysis-definition' <, 'locale'>) DQGENDER(char, 'gender-analysis-definition' <, 'locale'>)
44
DQGENDER(char, 'gender-analysis-definition'<, 'locale'>) where char gender-analysis-definition locale is the name of a character variable, a character value in quotation marks, or an expression that evaluates to a variable name or a quoted value. specifies the name of the gender analysis definition, which must exist in the specified locale. optionally specifies the name of the locale that contains the specified gender-analysis definition.
68
Exa pleof D m QGEN ER Funct D ion

The following example determines whether a character value represents an individual or an organization: data _null_; Gender=DQGENDER('Mr. Malcolm A. Lackey', 'gender', 'ENUSA'); put Gender=; run; The value returned for Gender in the SAS log would be M.
45
69
Pa rsingin SAS
The DQPARSE function returns a parsed character value. The return value contains delimiters that identify the elements in the value that correspond to the tokens that are enabled by the parse definition. DQPARSE(char, 'parse-definition' <, 'locale'>) DQPARSE(char, 'parse-definition' <, 'locale'>)
46
DQPARSE(char, 'parse-definition'<, 'locale'>) where char is the value that is parsed according to the parse definition. The value can be the name of a character variable, a character value in quotation marks, or an expression that evaluates to a variable name or a quoted value. specifies the name of the parse definition, which must exist in the specified locale. optionally specifies the name of the locale that contains the specified gender analysis definition.
parse-definition locale
70
Pa rsingin SAS
The DQPARSEINFOGET function returns the token names in a parse definition. DQPARSEINFOGET('parse-definition' <, 'locale'>) DQPARSEINFOGET('parse-definition' <, 'locale'>) The DQPARSETOKENGET function returns a token from a parsed character value. DQPARSETOKENGET(parsed-char, 'token', DQPARSETOKENGET(parsed-char, 'token', 'parse-definition' <, 'locale'>) 'parse-definition' <, 'locale'>)
47
DQPARSEINFOGET('parse-definition'<, 'locale'>) where parse-definition locale specifies the name of the parse definition, which must exist in the specified locale. optionally specifies the name of the locale that contains the specified gender-analysis definition.
DQPARSETOKENGET(parsed-char, 'token','parse-definition'<, 'locale'>) where parsed-char token parse-definition locale is the parsed character value from which will be returned the value of the specified token. specifies the name of the token that is returned from the parsed value. specifies the name of the parse definition, which must exist in the specified locale. optionally specifies the name of the locale that contains the specified gender-analysis definition.
71
Exa pleof Pa m rsing Funct ions

The following example determines whether a character value represents an individual or an organization: data _null_; parsedValue=DQPARSE( 'Mrs. Sallie Mae Pravlik', 'NAME', 'ENUSA'); prefix=DQPARSETOKENGET(parsedValue, 'Name Prefix', 'NAME', 'ENUSA'); given=DQPARSETOKENGET(parsedValue, 'Given Name', 'NAME', 'ENUSA'); put parsedValue= prefix= given=; run; The returned values in the SAS log would be as follows: parsedValue=Mrs./=/Sallie/=/Mae/=/Pravlik/=//=/ prefix=Mrs. given=Sallie.
48
72
Cha ngin Ca in SAS g se

The DQCASE function returns a character value with standardized capitalization. The DQCASE function operates on any character content, such as names, organizations, and addresses. All instances of adjacent blank spaces are replaced with single blank spaces. DQCASE(char, 'case-definition' <, 'locale'>) DQCASE(char, 'case-definition' <, 'locale'>)
49
DQCASE(char,'case-definition'<, 'locale'>) where
73
char case-definition locale
is the value that is transformed, according to the specified case definition. specifies the name of the case definition that will be referenced during the transformation. optionally specifies the name of the locale that contains the specified gender-analysis definition.
Exa pleof D m QCASE Funct ion

The following example determines whether a character value represents an individual or an organization: data _null_; orgname=DQCASE( "BILL's PLUMBING & HEATING", 'Proper', 'ENUSA'); put orgname=; run; The value returned for orgname in the SAS log would be Bill's Plumbing & Heating .
50
74
Ca St Ta se udy sks

Improve the Data

This demonstration illustrates the use of the SAS Data Quality Server functions to perform identification analysis, gender analysis, parsing, concatenation, and casing.
51
Ca St Ta se udy sks

Improve the Data

Task performed using
52
75
Augmenting and Validating Data Using SAS
In this demonstration, investigate four separate SAS programs. These programs investigate the use and results of the DQIDENTIFY, DQGENDER, DQPARSE, and DQCASE functions. To investigate the results from the programs, short FREQ or PRINT procedure steps will be added. 1. Start a SAS session by selecting Start All Programs SAS BIArchitecture Start SAS. 2. If the Getting Started with SAS window opens, do the following: a. Select Dont show this dialog box again.
b. Select
The SAS Display Manager session opens.
76
Using the DQIDENTIFY Function

1. Verify that the Enhanced Editor window is active. 2. Select File Open Program. 3. Navigate to S:\Workshop\winsas\didq\SASPgms and select DQIdentityFunctions.sas. 4. Select . The following program opens in the Enhanced Editor:
/* xxxx COMMENTS xxxx */ %DQLOAD(DQLOCALE=(ENUSA), DQSETUPLOC='C:\SAS\BIarchitecture\Lev1\SASMain\dqsetup.txt', DQINFO=1); PROC IMPORT OUT= WORK.Prospects DATATABLE= "NewCustomers" DBMS=ACCESS REPLACE; DATABASE="S:\Workshop\winsas\didq\DQData\NewCustomers.mdb"; SCANMEMO=YES; USEDATE=NO; SCANTIME=YES; RUN; data std_prospects; set prospects; length Identity $1; label Identity='Customer Identity Type'; Identity = dqidentify(contact, 'Individual/Organization'); run; In this program, the following occurs: The %DQLOAD macro loads the ENUSA locale into memory. The PROC IMPORT step accesses the NewCustomers table in the NewCustomers Microsoft Access database. The DATA step uses the DQIDENTIFY function to identify whether the value for the CONTACT field is an individual, an organization, or not known. 5. Select Run Submit to execute the SAS program.
77
6. Select View Log to activate the Log window.
78
7. To view the scheme data set, do the following: a. Select SAS Explorer. b. Double-click on the Libraries icon. c. Double-click on the Work library icon. d. Double-click on the Std_prospects table to open it into a VIEWTABLE window. e. Scroll to view the Customer Identity Type column.
f.
Select File Close to close the VIEWTABLE window.
8. Select Window DQIdentityFunctions.sas.
79
9. Run a frequency report on the new identity column. a. At the bottom of the program, after the RUN statement for the DATA step, uncomment the PROC FREQ step (that is, remove the /* before the step and the */ after the step. The PROC FREQ step is as shown: proc freq; tables identity/nocum; run; b. Highlight only these three new lines and then select Run Submit. The following report surfaces:
80
Using the DQGENDER Function

1. Verify that the Enhanced Editor window is active. If not, select View Enhanced Editor. 2. Select File Open Program. 3. Navigate to S:\Workshop\winsas\didq\SASPgms and select DQGenderFunctions.sas. 4. Select . The following program opens in the Enhanced Editor:
/* xxxx COMMENTS xxxx */ %DQLOAD(DQLOCALE=(ENUSA), DQSETUPLOC='C:\SAS\BIarchitecture\Lev1\SASMain\dqsetup.txt', DQINFO=1); PROC IMPORT OUT= WORK.Prospects DATATABLE= "NewCustomers" DBMS=ACCESS REPLACE; DATABASE="S:\Workshop\winsas\didq\DQData\NewCustomers.mdb"; SCANMEMO=YES; USEDATE=NO; SCANTIME=YES; RUN; data std_prospects; set Prospects; /* use the GENDER function to determine gender based on name */ length custgender $1; label custgender='Customer Gender'; custgender = dqgender(contact, 'gender'); run; PROC FREQ; Tables custgender/nocum; RUN; In this program, the following occurs: The %DQLOAD macro loads the ENUSA locale into memory. The PROC IMPORT step accesses the NewCustomers table in the NewCustomers Microsoft Access database. The DATA step uses the DQGENDER function to identify whether the value for the CONTACT field is M(ale), F(emale), or U(nknown) The PROC FREQ step generates a report of frequency counts on the custgender column.
5. Select Run Submit to execute the SAS program.
81
6. Select View Log to activate the Log window. A portion of the DATA step and PROC FREQ step is shown below:
7. Select View Output to activate the Output window. The report shows the following:
82
Using the DQPARSE Function

1. Verify that the Enhanced Editor window is active. If not, select View Enhanced Editor. 2. Select File Open Program. 3. Navigate to S:\Workshop\winsas\didq\SASPgms and select DQParseFunctions.sas. 4. Select . The following program opens in the Enhanced Editor:
/* xxxx COMMENTS xxxx */ %DQLOAD(DQLOCALE=(ENUSA), DQSETUPLOC='C:\SAS\biarchitecture\Lev1\sasmain\dqsetup.txt', DQINFO=1); PROC IMPORT OUT= WORK.Prospects DATATABLE= "newcustomers" DBMS=ACCESS REPLACE; DATABASE="S:\Workshop\winsas\didq\dqdata\newcustomers.mdb" ; SCANMEMO=YES; USEDATE=NO; SCANTIME=YES; RUN; data std_prospects; Set prospects; Parsedname=dqparse(contact, 'NAME'); Prefix=dqparsetokenget(parsedname, 'Name Prefix', 'NAME'); First_name=dqparsetokenget(parsedname, 'Given Name', 'NAME'); Last_name=dqparsetokenget(parsedname, 'Family Name', 'NAME'); run; proc print; Var prefix first_name last_name; run; In this program, the following occurs: The %DQLOAD macro loads the ENUSA locale into memory. The PROC IMPORT step accesses the NewCustomers table in the NewCustomers Microsoft Access database. The DATA step uses the DQPARSE and DQPARSETOKENGET functions to parse the CONTACT field. The PROC PRINT step produces a listing report of the results of the DQPARSETOKENGET function usage.
5. Select Run Submit to execute the SAS program.
83
6. Select View Log to activate the Log window. The portion for the DATA step and PROC PRINT step is shown below:
7. Select View Output to activate the Output window. The partial output is as follows:
84
Using the DQCASE Function

1. Verify that the Enhanced Editor window is active. If not, select View Enhanced Editor. 2. Select File Open Program. 3. Navigate to S:\Workshop\winsas\didq\SASPgms and select DQPropercaseFunctions.sas. 4. Select . The following program opens in the Enhanced Editor:
/* xxxx COMMENTS xxxx */ %DQLOAD(DQLOCALE=(ENUSA), DQSETUPLOC='C:\SAS\BIarchitecture\Lev1\SASMain\dqsetup.txt', DQINFO=1); PROC IMPORT OUT= WORK.Prospects DATATABLE= "NewCustomers" DBMS=ACCESS REPLACE; DATABASE="S:\Workshop\winsas\didq\DQData\NewCustomers.mdb" ; SCANMEMO=YES; USEDATE=NO; SCANTIME=YES; RUN; data std_prospects; set prospects; ParsedName=dqParse(contact, 'NAME'); Prefix=dqParseTokenGet(parsedName, 'Name Prefix', 'NAME'); First_name=dqParseTokenGet(parsedName, 'Given Name', 'NAME'); Last_name=dqParseTokenGet(parsedName, 'Family Name', 'NAME'); run; data std_prospects; set std_prospects; length Contact2 $50; label Contact2='Re-formatted Prospect Name'; Contact2 = trim(Last_Name) || ', ' || First_Name; length Contact3 $50; label Contact3='Proper Cased Re-formatted Prospect Name'; Contact3 = dqcase(contact2,'PROPER'); run; proc print; Var Contact Contact2 Contact3; run;
85
In this program, the following occurs: The %DQLOAD macro loads the ENUSA locale into memory. The PROC IMPORT step accesses the NewCustomers table in the NewCustomers Microsoft Access database. The first DATA step uses the DQPARSE and DQPARSETOKENGET functions to parse the value for the CONTACT field. The second DATA step uses the concatenation operator (||) to rebuild a Name field (Contact2). The DQCASE function is then applied to resolve the Contact2 field to proper casing. The PROC PRINT step produces a listing report of some parsed and concatenated information.
5. Select Run Submit to execute the SAS program. 6. Select View Log to activate the Log window. The portion for the DATA steps and PROC PRINT step is shown below:
86
7. Select View Output to activate the Output window. Partial output is shown below:
8. Close the SAS session and do not save any changes.
87
12.4 Exercises
1. Analyzing the NewCustomers Table Use the NewCustomers table from the New Customers database to do the following: Verify the type of information found for each record. (Identify records as individual or organization.) Calculate gender information for each record. Create a frequency report and a frequency report chart on both the identity and gender information. Parse the Contact field. Add a field that contains a name string of the form Name_Prefix Given_Name Family_Name. Save the job as DIDQ Ch5Ex1 NewCustomers Analysis.
88

1. Analyzing the NewCustomers Table a. If necessary, invoke dfPower Studio by selecting Start All Programs DataFlux dfPower Studio 7.1 dfPower Studio. b. Select Base from the toolbar, and then select Architect. c. Expand the Data Inputs grouping of nodes. d. Double-click the Data Source node. 1) Enter New Customers as the name. 2) Select next to Input table.
3) Expand the New Customers database and select the NewCustomers table. 4) Select 5) Select 6) Select to close the Select Table window. (double-arrow) to move all fields from the Available area to the Selected area. to close the Data Source Properties window.
e. With the data source node selected, select the Preview tab from the Details area (at the bottom of dfPower Architect interface). The data from this node is displayed. f. Expand the Quality grouping of nodes. g. Double-click the Identification Analysis node. The Identification Analysis Properties window opens. 1) Move the CONTACT field from the Available area to the Selected area by double-clicking. 2) Double-click on the Definition column for the selected CONTACT field. 3) From the menu, select Individual/Organization. 4) Scroll in the Selected area to reveal that the results of the identification analysis will be placed in the field CONTACT_Identity. 5) Select 6) Select 7) Select 8) Select below the Available area. (double-arrow) to move all fields from the Available area to the Selected area. to close the Additional Outputs window. to close the Identification Analysis Properties window.
89
h. Preview the results of the Identification Analysis. 1) Verify that Identification Analysis is selected. 2) Select the Preview tab at the bottom of dfPower Architect interface. 3) Scroll to the right to view the information populated for the CONTACT_Identity field. i. Expand the Quality grouping of nodes. j. Double-click on the Gender Analysis node. The Gender Analysis Properties window opens. 1) Move the CONTACT field from the Available area to the Selected area by double-clicking. 2) Double-click on the Definition column for the selected CONTACT field. 3) Select Gender. 4) Scroll in the Selected area to reveal that the results of the identification analysis will be placed in the field CONTACT_Gender. 5) Select 6) Select 7) Select 8) Select below the Available area. (double-arrow) to move all fields from the Available area to the Selected area. to close the Additional Outputs window. to close the Identification Analysis Properties window.
k. Expand the Profiling grouping of nodes. l. Double-click the Frequency Distribution node. 1) The Frequency Distribution Properties window opens. 2) Move CONTACT_Identity and CONTACT_Gender from the Available area to the Selected area. 3) Select to close the Frequency Distribution Properties window. The Preview tab is populated with the frequency report m. Expand the Data Outputs grouping of nodes. n. Double-click the Frequency Distribution Chart node. 1) Select next to Chart name to choose a location for the output.
2) Navigate to S:\Workshop\winsas\didq. 3) Enter New Customers Chart as the value for File name. 4) Select to close the Save As window.
5) Enter Gender & Identity Distribution from New Customers as the title for the chart.
90
6) Move both CONTACT_Identity and CONTACT_Gender from the Available area to the Selected area. 7) Select to close the Frequency Distribution Chart Properties window. The Preview tab is populated with the frequency report. o. Select from the toolbar. The job processes, and the Run Job window opens with a status indicator. 1) Select 2) Select to close the Run Job window. The Chart Viewer window opens. to scroll to the next chart for CONTACT_Gender.
3) Select File Exit to close the Chart Viewer window. p. Save the job. 1) From dfPower Architect menu, select File Save As. 2) Enter DIDQ Ch5Ex1 NewCustomers Analysis as the name. 3) Enter New Customer Analysis as the description. 4) Select to close the Save As window.
q. Select the Frequency Distribution 1 node in job flow. r. Expand the Quality grouping of nodes. s. Right-click the Parsing node and select Insert Before Selected. 1) Select CONTACT as the field to parse. 2) Select Name as the definition. 3) Select 4) Select 5) Select 6) Select 7) Select to move all tokens from the Available area to the Selected area. below the Available area. to move all fields from the Available area to the Selected area. to close the Additional Outputs window. to close the Parse Properties window.
t. Expand the Utilities grouping of nodes. u. Double-click the Concatenate node. The Concatenation Properties window opens. 1) Specify PreFirstLast as the output field. 2) Enter (a space) as the value for Literal text. to move it to the Concatenation list area.
3) Select Name Prefix, and then select
91
4) Select next to Literal text to move the text to the Concatenation list area after Name Prefix. 5) Select Given Name, and then select to move it to the Concatenation list area.
6) Select next to Literal text to move the text to the Concatenation list area after Given Name. 7) Select Family Name, and then select to move it to the Concatenation list area.
8) Select 9) Select 10) Select 11) Select
below the Available fields area. to move all fields from the Available area to the Selected area. to close the Additional Outputs window. to close the Concatenation Properties window.
v. View the result of this node on the Preview tab. w. From the dfPower Architect menu, select File Save. x. Select File Exit to close dfPower Architect. y. Select Studio Exit to close dfPower Studio.

CHP 12 - Data Cleansing Additional Functionality

Caricato da

Informazioni sul documento

Descrizione originale:

Titolo originale

Copyright

Formati disponibili

Condividi questo documento

Condividi o incorpora il documento

Opzioni di condivisione

Hai trovato utile questo documento?

Questo contenuto è inappropriato?

Copyright:

Formati disponibili

CHP 12 - Data Cleansing Additional Functionality

Caricato da

Copyright:

Formati disponibili

Chapter 12 Data Cleansing Additional Functionality

Chapter 12 Data Cleansing Additional Functionality

12.5 Solutions to Exercises

12.1 Additional Data Quality/Cleansing Techniques

Discuss some additional data quality/cleansing techniques.

D t Qua y/Cle nsing aa lit a

Chapter 12 Data Cleansing Additional Functionality

D t Qua y/Cle nsing aa lit a

Gender Analysis Parsing

Ide ificat Ana nt ion lysis

12.5 Solutions to Exercises

Ide ificat Ana nt ion lysis

Ge r Ana nde lysis

Chapter 12 Data Cleansing Additional Functionality

Concatenated Name: Igor Bela Bonski

12.5 Solutions to Exercises

Chapter 12 Data Cleansing Additional Functionality

12.2 Data Quality/Cleansing Using dfPower Architect

dfPow r Archit ct: Int e e roduct ion

12.5 Solutions to Exercises

Acce ssing dfPow r Archit ct e e

Chapter 12 Data Cleansing Additional Functionality

dfPow r Archit ct Int rfa e e e ce

dfPow r Archit ct Int rfa e e e ce

12.5 Solutions to Exercises

dfPow r Archit ct Int rfa e e e ce

Job Flow Area

Chapter 12 Data Cleansing Additional Functionality

Job FlowSt ps: D aInput e at s

Table Metadata SAS Data Set SAS SQL Query

12.5 Solutions to Exercises

Job FlowSt ps: D aOut s e at put

Chapter 12 Data Cleansing Additional Functionality

Job FlowSt ps: U ilit s e t ie

Field Layout Parameterized SQL Query

12.5 Solutions to Exercises

Job FlowSt ps: Profiling e

Chapter 12 Data Cleansing Additional Functionality

Job FlowSt ps: Q lit e ua y

12.5 Solutions to Exercises

Job FlowSt ps: Int gra e e tion

Chapter 12 Data Cleansing Additional Functionality

Job FlowSt ps: Enrichm nt e e

County Phone Area Code

12.5 Solutions to Exercises

Job FlowSt ps: Enrichm nt(D ribut d) e e ist e

Chapter 12 Data Cleansing Additional Functionality

Job FlowSt ps: M oring e onit

Node Data Monitoring

12.5 Solutions to Exercises

Ge t St rt d w h dfPow r Archit ct t ing a e it e e

Ge t St rt d w h dfPow r Archit ct t ing a e it e e

Chapter 12 Data Cleansing Additional Functionality

Improve the Data

Standardize data. Augment and validate data. Create match codes.

Improve the Data

Standardize data. Augment and validate data. Create match codes.

Task performed using

12.5 Solutions to Exercises

Augmenting and Validating Data Using dfPower Architect

Chapter 12 Data Cleansing Additional Functionality

Identification and Gender Analysis