Sei sulla pagina 1di 178

IBM InfoSphere Warehouse

Data Mining with Easy Mining procedures


V ersion 9.5.1

SH12-6837-02

IBM InfoSphere Warehouse

Data Mining with Easy Mining procedures


V ersion 9.5.1

SH12-6837-02

Note
Note: Before using this information and the product it supports, read the information in Notices on page 157.

This edition applies to Version 9.5.1 of the IBM InfoSphere Warehouse products and to all subsequent releases and modifications until otherwise indicated in new editions. Copyright International Business Machines Corporation 2001, 2008. All rights reserved. US Government Users Restricted Rights Use, duplication or disclosure restricted by GSA ADP Schedule Contract with IBM Corp.

Contents
Chapter 1. Overview of the Easy Mining procedures . . . . . . . . . . . . . 1
Easy Mining procedures for typical mining tasks . Easy Mining procedures for basic mining steps . . Easy Mining procedures for preprocessing and for utilities . . . . . . . . . . . . . . . . 1 . 1 . 2 Easy Mining procedures for sequences mining steps . . . . . . . . . . . . . . Exporting models and test results . . . . . Using Easy Mining procedures for preprocessing and for utilities . . . . . . . . . . . . Preprocessing procedures . . . . . . . . Utility procedures . . . . . . . . . . Putting it all together . . . . . . . . . . Scenario . . . . . . . . . . . . . Identifying characteristics . . . . . . . . Building a prediction model . . . . . . . Data mining at a glance . . . . . . . . . Data mining goals . . . . . . . . . . The data mining process . . . . . . . . Data mining functions . . . . . . . . . . 74 . 82 . . . . . . . . . . . 83 83 85 88 88 89 89 92 92 93 94

Chapter 2. Data mining with the Easy Mining procedures . . . . . . . . . . 3


Quick start sample . . . . . . . . . . . . 3 Scenario . . . . . . . . . . . . . . . 3 Finding deviations . . . . . . . . . . . 4 Using Easy Mining procedures for typical mining tasks . . . . . . . . . . . . . . . . . 6 Finding deviations (FindDeviations procedure) . . 7 Finding groups with similar characteristics (ClusterTable procedure) . . . . . . . . . 9 Finding relationships (FindRules procedure) . . 18 Finding sequential relationships in your data (FindSeqRules procedure). . . . . . . . . 24 Prediction of future behavior (PredictColumn procedure) . . . . . . . . . . . . . . 29 Prediction of an outcome (PredictColValue procedure) . . . . . . . . . . . . . . 37 Finding explanations for specific events (ExplainColValue procedure) . . . . . . . 41 Most important fields (FindMostImpFields procedure) . . . . . . . . . . . . . . 45 Using Easy Mining procedures for basic mining steps . . . . . . . . . . . . . . . . 48 Easy Mining procedures for classification mining steps . . . . . . . . . . . . . . . 48 Easy Mining procedures for regression mining steps . . . . . . . . . . . . . . . 55 Easy Mining procedures for clustering mining steps . . . . . . . . . . . . . . . 60 Easy Mining procedures for associations mining steps . . . . . . . . . . . . . . . 65

Chapter 3. Easy Mining reference . . . 97


Syntax diagrams and parameters . . . . . . . 97 Using the Easy Mining procedures . . . . . 97 Optional parameter strings . . . . . . . . 99 Easy Mining procedures for typical mining tasks 101 Easy Mining procedures for basic mining steps 116 Easy Mining procedures for preprocessing and utilities . . . . . . . . . . . . . . 148 Easy Mining conventions and mining field types 153 Conventions . . . . . . . . . . . . . 153 Mining field types . . . . . . . . . . . 154

Notices . . . . . . . . . . . . . . 157
Trademarks . . . . . . . . . . . . . . 159

Contacting IBM

. . . . . . . . . . 161
. . . . . . . . . . . . . . . . 161 . 161 . 162

Product Information . . . . . Accessible documentation . . . Comments on the documentation.

Index . . . . . . . . . . . . . . . 163

Copyright IBM Corp. 2001, 2008

iii

iv

IBM InfoSphere Warehouse: Data Mining with Easy Mining procedures

Chapter 1. Overview of the Easy Mining procedures


There are Easy Mining procedures for typical mining tasks and for basic mining steps. There are also Easy Mining procedures for preprocessing and for utilities.

Easy Mining procedures for typical mining tasks


The Easy Mining procedures for typical mining tasks correspond to the main steps of the data mining process. They are easy to use because any parameter modifications that might be required are done by the mining functions. This means that you do not need to have in-depth knowledge about data mining. With the Easy Mining procedures for typical mining tasks, you can solve most of the typical business problems in various application areas, for example, Banking or Manufacturing, without having in-depth data mining skills. The following table provides an overview of the available Easy Mining procedures for typical mining tasks.
Table 1. Overview of the Easy Mining procedures for typical mining tasks Mining tasks Finding deviations Finding groups with similar characteristics Finding relationships Finding sequential rules Predicting future behavior Predicting an outcome Finding explanations for specific events Finding most important fields Easy Mining procedure FindDeviations ClusterTable FindRules FindSeqRules PredictColumn PredictColValue ExplainColValue FindMostImpFields

Easy Mining procedures for basic mining steps


The Easy Mining procedures for basic mining steps correspond to the SQL API of IM Modeling and IM Scoring. They are easy to use because their syntax is easy. Furthermore, they concentrate on the more frequently used concepts of SQL/MM. They might even provide better results compared to the Easy Mining procedures for typical mining tasks because you can modify the parameters yourself. However, modifying the parameters yourself means that you need knowledge about the data mining process. For example, you must know how to modify the maximum number of clusters to gain better clustering results. With the Easy Mining procedures for basic mining steps, you can create, test, and modify data models. You can later apply these data models to new data to help you make successful business decisions. The following table provides an overview of the available Easy Mining procedures for basic mining steps.

Copyright IBM Corp. 2001, 2008

Table 2. Overview of the Easy Mining procedures for basic mining steps
Tasks Building models Classification mining procedure BuildClasModel Regression mining procedure BuildRegModel TestRegModel ApplyRegModel ExportRegModel ExportRegTestResult BuildRegView Clustering mining procedure BuildClusModel ApplyClusModel ExportClusModel BuildClusView Associations mining procedure BuildRuleModel ApplyRuleModel ExportRuleModel BuildRuleView Sequences mining procedure BuildSeqRuleModel ApplySeqRuleModel ExportSeqRuleModel BuildSeqRuleView

Testing models TestClasModel Applying models Exporting models Exporting test result Building mining views ApplyClasModel ExportClasModel ExportClasTestResult BuildClasView

Easy Mining procedures for preprocessing and for utilities


Easy Mining procedures for preprocessing and for utilities help you to prepare your data for an Easy Mining procedure, or to accomplish administrative tasks, for example, handling error messages, canceling Easy Mining procedures, or working with the trace file. The following Easy Mining procedures are available: v SplitData v GetLastError v SetTraceFile v GetTraceFile v GetCancelTask v GetCleanUpTask

IBM InfoSphere Warehouse: Data Mining with Easy Mining procedures

Chapter 2. Data mining with the Easy Mining procedures


With the Easy Mining procedures, you can perform data mining efficiently and successfully in a business context without the need of in-depth data mining skills. You only need to be familiar with the appropriate business area where you want to apply data mining.

Quick start sample


The quick start sample describes a banking scenario. It explains the everyday life situation of a bank and the goals the bank might have based on this everyday life situation. It also describes the advantage of finding deviations and how you can do this by using the Easy Mining procedure FindDeviations.

Scenario
A bank has quite a lot of information about its customers. For example, it knows the gender, the age, and the profession of its customers. Besides this demographic information, the bank also collects data about the status of its customers. For example, the bank might know since when a person has been a customer of this bank. Of course, a bank has much more information about its customers. It might also know, for example, the banking products its customers hold, the average balance of the accounts, or the number of transactions. To keep it simple, only a small amount of information is used in this example. The information that the bank collected is stored in a database. You can create tables or views from this information. One row in a table or view contains the complete information about a particular customer. The table or view might look like the table that is shown in Figure 1.

Figure 1. The input table BANK.CUSTOMERS_MASTERDATA

Copyright IBM Corp. 2001, 2008

The table that is shown in Figure 1 on page 3 contains one record for each customer. Each customer is identified by a value in the column CLIENT_ID, for example, 00861101. The other columns in the table contain values for age, gender, marital status, profession, and the number of years as client at that bank.

Creating the database Demobank


Create the database DEMOBANK by following these steps: 1. Create the database DEMOBANK by entering the following command:
db2 "create database DEMOBANK"

2. Enable the database for Intelligent Miner Modeling and Intelligent Miner Scoring with the recommended configuration parameters for this database by entering the following command:
idmenabledb DEMOBANK dbcfg

If the executable file idmenabledb is not in the search path, it is in the bin directory of your Intelligent Miner Modeling or Intelligent Miner Scoring installation path. 3. Create the sample tables for the Easy Mining procedures including the BANK.CUSTOMERS_MASTERDATA table by going to the EasyMining subdirectory in the samples directory of your Intelligent Miner Modeling or Intelligent Miner Scoring installation path and entering the following command:
db2 -tvf SampleTables.db2

Business goals of banks


The bank might want to find the customers who are to some respect different from the other customers. These customers might have been neglected in the past because they fall out of the standard target groups, however, they might be as profitable as other groups of the customers. Another goal of the bank might be to find implausible combinations of values that should be corrected in this table. For example, the value employee in the PROFESSION column and the value 80 in the AGE column indicates that the record needs to be updated because someone at this age is not actively working as an employee anymore. This means that the bank is looking for deviations.

Finding deviations
You can find deviations by using the Easy Mining procedure FindDeviations. Use the following command to run the Easy Mining procedure:
db2 "call IDMMX.FindDeviations(BANK.DEV_CUSTOMERS, BANK.CUSTOMERS_MASTERDATA)"

Where: BANK.DEV_CUSTOMERS is the name of the view that you want to create BANK.CUSTOMERS_MASTERDATA is the name of the input table within which you want to find deviations

IBM InfoSphere Warehouse: Data Mining with Easy Mining procedures

The output view BANK.DEV_CUSTOMERS


The result of this call is the output view BANK.DEV_CUSTOMERS that is shown in Figure 2.

Figure 2. The output view BANK.DEV_CUSTOMERS

The BANK.DEV_CUSTOMERS view contains the columns of the input table BANK.CUSTOMERS_MASTERDATA and the following additional columns: CLUSTER_ID This column contains the identification of the generated clusters. DEV_DEGREE This column contains the measure of how much this record deviates from the average of all records in the BANK.DEV_CUSTOMERS view. The higher this value is, the greater is the deviation. The records in the BANK.DEV_CUSTOMERS view that is shown in Figure 2 are sorted in descending order by their deviation degree value. The first two rows in this table have a deviation degree of 7506. This means that these rows represent the most remarkable deviations. These rows include the following values: v The values worker and intermediate professions in the PROFESSION column v The value F in the GENDER column v The values 66 and 64 in the AGE column Typically, women at the age of 64 or 66 are already retired. Therefore this information might not be up to date. The records with the deviation degree of 3753 represent a group of 4 women at the age of approximately 30 years who are separated or divorced. They are new customers. They are inactive or they have an intermediate profession. Therefore they might not represent an interesting target group for sales of securities. The same is likely to be true for the next group with a deviation degree of 3002.4. This group represents old retired men.

Chapter 2. Data mining with the Easy Mining procedures

For more information about the FindDeviations procedure, see Finding deviations (FindDeviations procedure) on page 7.

How to continue
Using Easy Mining procedures for typical mining tasks describes the available Easy Mining procedures for typical mining tasks. It starts with the complete description of the FindDeviations procedure that is introduced in this chapter as quick start example. If the information about the FindDeviations procedure in this chapter is sufficient for you, you can skip Finding deviations (FindDeviations procedure) on page 7 and continue with Finding groups with similar characteristics (ClusterTable procedure) on page 9.

Using Easy Mining procedures for typical mining tasks


This section describes a set of typical data mining tasks, for example, finding unusual deviations in your data. You can try out these tasks on your own data by using the Easy Mining procedures. This might reveal interesting information that you have not been aware of before, because this information was hidden in your data. For your first attempts, you can use the Easy Mining procedures with their default parameter settings. However, by using the default parameter settings, not all of the information that was previously hidden might be revealed. You can get even better results by using additional option strings. The following sections are divided into the following subsections: When to do it This section describes a couple of scenarios where you can apply the appropriate Easy Mining procedure. How to do it This section explains the Easy Mining procedure and its results. Example This section illustrates the Easy Mining procedure along the lines of an example. How to go further This section provides and explains additional parameters. It also provides hints and tips how you can refine the results. This section helps you to improve the results that you obtained by implementing the basic information that is provided in the previous chapters. It is assumed that your database contains tables or views that include many data records. Each data record describes different characteristics of a distinct entity. For example: v In the retail business, the distinct entity might represent a customer. The characteristics of the customer might include information about the age and the gender of the customer, the preferred shopping day, the products the customer has bought in the past, or the revenue the retail store has made with the customer in the last years. v In the Manufacturing business, the distinct entity might represent a particular type of car. The characteristics of this type of car might include information about the production line in which the car was manufactured, the engine type, or the lacquer that was used for this car.

IBM InfoSphere Warehouse: Data Mining with Easy Mining procedures

Finding deviations (FindDeviations procedure)


You can find deviations in your data by using the FindDeviations procedure.

When to do it
If your database contains customer data, you might want to know whether there are customers who are different from the majority of customers because they have combinations of characteristics that exist only for very few customers. Knowing such deviations in your data can be very useful for you. For example, some of the data records might represent customers who have an unusual buying behavior. There might be customers who deserve special attention because they buy expensive products of high quality. Deviations in data tables can represent anything. This depends on the kind of input data you are using. For example, deviations can indicate fraudulent behavior because there are customers who have unusually high discount rates. Other unusual combinations can represent inconsistencies in your data. For example, if you identify a customer who is 40 years old and still goes to school, there is something wrong with your data. For databases that describe characteristics of entities other than customers, the types of deviations that you can find depend on the kind of input data that you are using.

How to do it
To find deviations, use the FindDeviations procedure. Syntax:

IDMMX.FindDeviations(<deviationsView>, <inputTable>)

Input parameters: With the FindDeviations procedure, you must specify the following parameters: <deviationsView> The name of the view that you want to build. The FindDeviations procedure creates a view and a model. The model is stored in the table IDMMX.CLUSTERMODELS under the same name as the generated view. If a model with the same name already exists, the previous model is replaced with the new model. If a view with the same name already exists, the previous view is replaced with the new view. This parameter is of type VARCHAR. Its size is 240. <inputTable> The name of the input table or the input view. The columns of the input table that might not be useful to create a model are ignored by the Easy Mining procedure. These are, for example, key columns. This parameter is of type VARCHAR. Its size is 257.

Chapter 2. Data mining with the Easy Mining procedures

Output: The FindDeviations procedure creates a view that contains the columns of the input table and the following additional columns: DEV_DEGREE This column indicates the degree of deviation for each record. The degree of deviation is a number greater than 1. High numbers represent a high degree of deviation. CLUSTER_ID This column indicates the identification of the clusters that these records belong to. The small clusters are interesting for you because you are looking for unusual behavior. The cluster ID helps you to interpret the deviation because the typical characteristics of a cluster characterize the deviation. To explore the clusters in detail, you can use the table function DM_getClusters of IM Modeling, IBM InfoSphere Intelligent Miner Visualization, in this book referred to as IM Visualization, or any other visualization tool. Figure 3 shows the data flow of the FindDeviations procedure.

Figure 3. Data flow of the FindDeviations procedure

Example
For an example of the IDMMX.FindDeviations procedure, see Quick start sample on page 3.

How to go further
Sometimes you might additionally want to specify a specific parameter for an Easy Mining procedure. You can do this by including an optional parameter string in an Easy Mining procedure. For more information, see Optional parameter strings on page 99. Sometimes you might be interested in deviations based only on a subset of the columns in the input table. If you want to use a subset of the columns only, you can remove one or more fields from the input table. Removing fields from the input table:

IBM InfoSphere Warehouse: Data Mining with Easy Mining procedures

You might want to remove one or more fields from the input table because you do not want to use all fields to compute the model. To remove one field, you can use the DM_remDataSpecFld option. For example, to remove the column NBR_YEARS_CLI from the input table, you can use the following optional parameter string:
DM_remDataSpecFld(NBR_YEARS_CLI)

To remove more fields from the input table, you can use several DM_remDataSpecFld options in an Easy Mining procedure, or you can create a view that contains only the columns that you want to use. Use this view as input for the Easy Mining procedure. There are more optional parameter strings. For more information, see Optional parameter strings on page 99. Complete procedure call: The complete procedure call including the optional parameter string looks like this:
db2 "call IDMMX.FindDeviations(BANK.DEV_CUSTOMERS, BANK.CUSTOMERS_MASTERDATA, DM_remDataSpecFld(NBR_YEARS_CLI))"

Finding groups with similar characteristics (ClusterTable procedure)


You can find groups with similar characteristics by using the ClusterTable procedure.

When to do it
Your database might contain customer data including demographic data, for example: v Gender v Age v Profession v Family status The information might also include the income or the socio-demographic group of the customer. Furthermore, you might have collected other customer information. This information depends on the business that you are in. For example, a retail store might collect the sales transactions of their customers. From this information, you can compute the following results: v How often a customer visited your store in a certain time frame v How much money a customer spent in total v How much money a customer spent for particular product categories, for example, beverages or delicatessen v The preferred shopping days of a week Another example is an insurance company that knows the contracts that their customers have signed. Or a bank that knows the accounts of their customers and the amount of transactions per account.

Chapter 2. Data mining with the Easy Mining procedures

These are only few examples of the kind of information that data tables can contain. The data tables can also contain data other than customer data. A manufacturing company might collect information about the production of their products, or a retail chain might collect information about their stores. If you have data tables that contain this kind of information, you might want to know whether this data set has an inherent structure, or if it contains groups of objects that are very similar. Knowing such groups enhances your business operations immensely because you can treat your customers according to the group that they belong to. For example, you can define specific product offerings or marketing campaigns targeted for each important group instead of treating all customers equally. The Easy Mining procedure ClusterTable might find a group of customers that prefers healthy food of high quality. This is the appropriate customer group for a promotion of ecologically produced French cheese. It does not make any sense to send this group an advertisement about frozen pizza.

How to do it
To find groups with similar characteristics, use the ClusterTable procedure. Syntax:

IDMMX.ClusterTable(<clusterView>, <inputTable>, <minSize>, <maxSize>)

Input parameters: With the ClusterTable procedure, you must specify the following parameters: <clusterView> The name of the view that you want to build. The ClusterTable procedure creates a view and a model. The model represents the characteristics of the clusters. It is stored in the table IDMMX.ClusterModels under the same name as the generated view. If a model with the same name already exists, the previous model is replaced with the new model. If a view with the same name already exists, the previous view is replaced with the new view. This parameter is of type VARCHAR. Its size is 240. <inputTable> The name of the input table or the input view. The columns of the input table that might not be useful to create a model are ignored by the Easy Mining procedure. These are, for example, key columns. This parameter is of type VARCHAR. Its size is 257. <minSize> The value to define the minimum percentage of records in a cluster. This parameter is of type REAL.

10

IBM InfoSphere Warehouse: Data Mining with Easy Mining procedures

<maxSize> The value to define the maximum percentage of records in a cluster. This parameter is of type REAL. The values for the minimum and the maximum size of a cluster are indicated in percent. For example, a minimum size of 10,0 means that the smallest cluster contains 10% of the records that are contained in the input table. Output: The view created by the ClusterTable procedure contains the columns of the input table and the following columns: CLUSTER_ID This column contains the identification of the cluster that the record belongs to. QUALITY This column contains a value that indicates how well the records fit into the cluster. The value can range from 0 to 1. v 0 means that this record does not fit at all into this cluster. v 1 means that this record fits perfectly into this cluster. The quality value refers to a single record within the model. CONFIDENCE This column contains a value that indicates the confidence that the cluster is the best cluster for this record. The value can range from 0 to 1. v A value close to 0.5 indicates that the record fits another cluster equally well. v A value close to 1 indicates that the record does not fit into a different cluster. The confidence value refers to a single record within the model. To explore the characteristics of each cluster in more detail, you can open the clustering model with IM Visualization or with any other visualization tool that supports PMML. Visualizing the clustering model helps you to assess whether the clustering model is useful for you. Figure 4 on page 12 shows the data flow of the ClusterTable procedure.

Chapter 2. Data mining with the Easy Mining procedures

11

Figure 4. Data flow of the ClusterTable procedure

Interpreting the results: You can use the quality value to select the best records that belong to a cluster. For example, you might have a limited budget for a mailing campaign. This budget does not allow to address all customers that belong to a cluster. With the quality value, you can select the most promising customers to address. Typically, you do not want clusters that are too large. For example, a cluster that represents 75% of the whole population is not interesting because it does not differ too much from the characteristics of the whole population. Based on your business requirements, clusters that are too small also might not be interesting. For example, if you apply clustering for target marketing, there might be a lower limit for the size of a target group for a promotion campaign. If this target group is too small, the costs of the campaign might exceed the expected revenue. In other cases, however, you might be interested in very small clusters. These clusters represent the niches. For example, if you are looking for patterns of unusual behavior in your data to detect fraud or other kinds of irregularities. If you are looking for patterns of this kind, use the FindDeviations procedure. See Finding deviations (FindDeviations procedure) on page 7 for more information.

Example
Figure 5 on page 13 shows customer data of a bank. The table contains the following information: Demographic information Demographic information includes age, gender, marital status, and profession. Banking products Banking products include savings account and international credit card. Customer activities Customer activities include number of debit and credit transactions and average balance.

12

IBM InfoSphere Warehouse: Data Mining with Easy Mining procedures

Figure 5. The input table BANK.BANKCUSTOMERS

In a real banking scenario, a table containing customer data includes probably more columns than the example shown in Figure 5. Procedure call: To discover interesting clusters in the table BANK.CUSTOMERS, you can use the Easy Mining procedure ClusterTable. Use the following command to run the Easy Mining procedure:
db2 "call IDMMX.ClusterTable(BANK.CLUSVIEW, BANK.CUSTOMERS, 10, 35)"

Where: IDMMX.ClusterTable is the name of the Easy Mining procedure BANK.CLUSVIEW is the name of the view that you want to create BANK.CUSTOMERS is the name of the input table 10 35 is the value that you specified for the minimum percentage of records in a cluster is the value that you specified for the maximum percentage of records in a cluster

Output: Figure 6 on page 14 shows the data flow of the ClusterTable procedure based on the example used in this section.

Chapter 2. Data mining with the Easy Mining procedures

13

Figure 6. Data flow of the ClusterTable procedure based on the example used in this section

The output view BANK.CLUSVIEW shown in Figure 7 contains the columns of the input table BANK.CUSTOMERS and additionally the columns CLUSTER_ID, QUALITY, and CONFIDENCE . The CLUSTER_ID column contains values from 1 to 4. This means that the Clustering mining function has computed 4 clusters.

Figure 7. The output view BANK.CLUSVIEW

14

IBM InfoSphere Warehouse: Data Mining with Easy Mining procedures

The result view BANK.CLUSVIEW shown in Figure 7 on page 14 only shows which record belongs to which cluster. If you want to analyze the characteristics of the clusters, you can open the clustering model DEMOBANK.CLUSVIEW that is stored in the IDMMX.CLUSTERMODELS table with IM Visualization or with any other visualization tool that supports PMML. Analyzing the characteristics: Figure 8 shows the clustering model BANK.CLUSVIEW with IM Visualization. The Graphic View of the Clustering visualizer shows that there are 4 clusters. The largest cluster contains 33,71% of the total population. The smallest cluster contains 13,56% of the total population.

Figure 8. The output view BANK.CLUSVIEW displayed by the Clustering visualizer

The pie charts and the bar charts show the distribution of the values of the columns in the clusters compared to the total population. v In the pie charts, the inner circle represents the population of a cluster. The outer circle represents the total population. For example, the pie chart INT_CREDITCARD in Figure 9 on page 16 shows that only few customers in cluster 1 own an international credit card compared to the total amount of customers.

Chapter 2. Data mining with the Easy Mining procedures

15

Figure 9. The pie chart INT_CREDITCARD

v In the bar charts, the outlined histograms represent the distribution of the population of a cluster. The compact histograms represent the total population. For example, the bar chart NO_DEBIT_TRANS in Figure 10 shows that the customers in this cluster are less active compared to the total amount of customers.

Figure 10. The bar chart NO_DEBIT_TRANS

Only in the two leftmost histograms of the bar chart NO_DEBIT_TRANS , the relative frequency of debit transactions in cluster 1 is higher than in the total population. These histograms indicate the fraction of customers with the lowest number of debit transactions. These customers are the least active with regard to debit transactions.

16

IBM InfoSphere Warehouse: Data Mining with Easy Mining procedures

The columns INT_CREDITCARD and NO_DEBIT_TRANS show you that the customers included in this cluster are not the most interesting customers.

How to go further
Sometimes you might additionally want to specify a specific parameter for an Easy Mining procedure. You can do this by including an optional parameter string in an Easy Mining procedure. For more information, see Optional parameter strings on page 99. There are a lot of ways how to partition the contents of a table into homogenous clusters. The Clustering mining function is using one of these ways. Often the first clustering result does not fully satisfy your business requirements. Therefore you need to refine the results. The clustering results depend mainly on the input data. Therefore the main issue is to select the appropriate columns as input for the Clustering mining function. The ClusterTable procedure is using all columns of an input table. To specify particular columns as input data, you can choose between the following options: v Removing one or more fields from the input table v Defining a column as supplementary for the Clustering mining function Removing fields from the input table: You can remove one or more fields from the computation of the model. For more information, see Removing fields from the input table on page 8. Defining supplementary columns: Values of supplementary columns are not used to compute the similarity of records. However, statistics of these columns are computed and included in the generated model for reference purposes. To define a column as supplementary, you must set the field usage type of this column to 2. You can set the field usage type with the DM_setFldUsageType option. For example, to define the column PROFESSION as supplementary, you can use the following option string:
DM_setFldUsageType(PROFESSION,2)

Where: DM_setFldUsageType is the options string parameter PROFESSION is the column that you want to define as supplementary column 2 is the value that denotes a column as supplementary column

There are more optional parameter strings. For more information, see Optional parameter strings on page 99. Complete procedure call: The complete procedure call including the optional parameter string looks like this:
db2 "call IDMMX.ClusterTable(BANK.CLUSVIEW, BANK.CUSTOMERS, 10, 35 DM_setFldUsageType(PROFESSION,2))"
Chapter 2. Data mining with the Easy Mining procedures

17

Finding relationships (FindRules procedure)


You can find relationships in your data by using the FindRules procedure.

When to do it
Your database might include a data table that contains customer data. You might want to find out whether there are relationships between the values in the columns of this table. For example, such a relationship might indicate that in a certain number of cases a column has a specific value if other columns have a specific value combination. For example, the FindRules procedure might find out that 70% of the male customers who have an online access to their account also have a credit card. This is an interesting cross-selling information that you can exploit in the next marketing campaign. You can also apply the FindRules procedure to retail transaction data. A sales transaction consists of the items that a customer has bought during a visit to the retail store. The information is stored in a table with at least the following columns: v The identification of the sales transaction v The purchased items The items with the same associated transaction ID are bought together. If you apply the FindRules procedure to a transaction table, you might find out the relationships between the purchased items. They indicate, for example, that in 45% of the cases if customers buy cereals they also buy fruit. With this cross-selling information, you can decide where to place the products in your store or on which products you might put a discount in the next marketing campaign.

How to do it
To find relationships in your data, use the FindRules procedure. Syntax:
IDMMX.FindRules(<rulesView>, <inputTable>, <groupColumn>, <nbRules>, <minConfidence>)

Input parameters: To find relationships in your data, you must specify the following parameters for the FindRules procedure: <rulesView> The name of the view that you want to build. The FindRules procedure creates a view and a model. The model is stored in the table IDMMX.RuleModels under the same name as the generated view. If a model with the same name already exists, the previous model is replaced with the new model. If a view with the same name already exists, the previous view is replaced with the new view. This parameter is of type VARCHAR. Its size is 240. <inputTable> The name of the input table or the input view.

18

IBM InfoSphere Warehouse: Data Mining with Easy Mining procedures

The columns of the input table that might not be useful to create a model are ignored by the Easy Mining procedure. These are, for example, key columns. This parameter is of type VARCHAR. Its size is 257. <groupColumn> The name of the column that contains the group or transaction ID. If you specify a column of the input table as the GROUP column, the remaining columns of the input table are used as item columns. If you specify NULL for the GROUP column, each record represents a transaction and all columns of the input table are used as item columns. This parameter is of type VARCHAR. Its size is 128. For information about the valid SQL types of categorical and numerical fields, see Mining field types on page 154. <nbRules> The minimum number of rules to be generated. The generated rules are stored in a view. This parameter is of type INTEGER. <minConfidence> A value to measure the validity of the rule. The confidence of each rule is greater or equal to the value that you specified for minimum confidence. This parameter is of type REAL. Output: The Associations mining function computes association rules. The set of the computed associations rules is called rule model. The first part of an association rule is called rule body. The second part of an association rule is called rule head. If you use the Associations mining function, for example, on customer data of a bank, the rule body and the rule head might represent column value pairs, for example, online_access=YES. The association rule is interpreted like this:
If online_access=YES then bankcard=YES

If you use the Associations mining function, for example, on retail transaction data, the rule body and the rule head might represent articles that occur in retail transactions, for example, chocolate, or candy. The association rule is interpreted like this:
If customers buy chocolate, they also buy candies.

Association rules include the following attributes: Confidence The confidence value represents the validity of the rule. A confidence value of 50% means that in 50% of the cases where a particular rule body is present in a group, a particular rule head is also present. For example, rule ID 123 in Figure 13 on page 22 indicates that in 57,576% of the cases where customers have a savings account with a building society, they also have a home insurance contract and a popular savings plan.
Chapter 2. Data mining with the Easy Mining procedures

19

Support The support value states how many records or how many transactions are covered by a rule. The value for support is expressed as a percentage of the total number of records or transactions. Lift The lift value indicates how much the confidence value is higher than expected. It is defined as the quotient of the confidence value and the support value of the rule head.

The support value of the rule head can be considered as the expected value for the confidence. It indicates the relative frequency of the rule head in the whole transaction set. The confidence value of the association rule indicates the relative frequency of the rule head that contains the items of the rule body. For example, if the confidence of the following rule is 30%, and the frequency of the customers who buy candies (which is the support of the rule head) is 10%, you can expect that 10% of the customers who buy chocolate also buy candies. Therefore 10% is the expected confidence, and 30% / 10% = 3.0 is the lift of the rule:
If customers buy chocolate, they also buy candies

Figure 11 shows the data flow of the FindRules procedure.

Figure 11. Data flow of the FindRules procedure

Example
The FindRules procedure is very useful in the banking business. A bank knows the banking products that their customers own. This information might be stored in the table BANK.CUSTOMER_PRODUCTS. This table contains the following columns: v CLIENT_ID v PRODUCT The table might look like this:

20

IBM InfoSphere Warehouse: Data Mining with Easy Mining procedures

Figure 12. The input table BANK.CUSTOMER_PRODUCTS

Because customers can possess more than one banking product, tables can contain more than one record for a given client ID. The table in Figure 12 shows that the customer with the client ID 395821 owns one product while the customer with the client ID 856 owns five products. The bank might want to know whether there are relationships between the products that a customer owns. You can use the FindRules procedure to find such relationships. Procedure call Use the following command to run the Easy Mining procedure:
db2 "call IDMMX.FindRules(BANK.PRODUCT_RULES, BANK.CUSTOMER_PRODUCTS, CLIENT_ID, 100, 30)"

Where: BANK.PRODUCT_RULES is the name of the table that you want to create that contains the rules BANK.CUSTOMER_PRODUCTS is the name of the input table CLIENT_ID is the identification of a particular customer 100 30 Results The result of calling the IDMMX.FindRules procedure might look like this:
Chapter 2. Data mining with the Easy Mining procedures

is the maximum number of rules to be generated is the minimum value for confidence

21

Figure 13. The output view BANK.PRODUCT_RULES

In Figure 13, rule 122 includes the product home insurance contract in the BODYTEXT column because the home insurance contract represents the body of the rule. The product savings account with a building society represents the head of the rule. Therefore it is included in the HEADNAME column. v The support value of 6.187 means that 6.187% of all customers have a home insurance contract and a savings account with a building society. v The confidence value of 41.429 means that in 41.429% of the cases customers who have a home insurance contract also have a savings account with a building society. v The lift value of 1.21 means that it is 1.21 times more likely that the item in the rule head is also bought if the item in the rule body is bought.

How to go further
Sometimes you might additionally want to specify a specific parameter for an Easy Mining procedure. You can do this by including an optional parameter string in an Easy Mining procedure. For more information, see Optional parameter strings on page 99. When you use the FindRules procedure, you can do the following actions: v Setting the name for the item ID column v Using name mappings v Removing one or more fields from the input table. For more information, see Removing fields from the input table on page 8. Setting the name of the item ID column: The input table CUSTOMER_ID that is shown in Figure 12 on page 21 contains the following columns: CLIENT_ID This column represents the group_ID column.

22

IBM InfoSphere Warehouse: Data Mining with Easy Mining procedures

PRODUCT This column represents the item_ID column. The PRODUCT column is implicitly specified as item_ID column, because the CLIENT_ID column is specified as group_ID column. If you have a table with more than two columns, you can specify a particular column as the item ID column by using the DM_setItemFld option. If you specify a column as item ID column, only this column is considered by the Easy Mining procedure. If you do not specify a column as item ID column, all columns are searched for relationships. Example If you want to specify the PRODUCT column as the item ID column, the options string looks like this:
DM_setItemFld(PRODUCT)

Name mappings: Transaction tables, for example, the BANK.CUSTOMER_PRODUCTS table that is shown in Figure 12 on page 21, might contain numbers that represent the product identification instead of product names. Running the FindRules procedure on such tables generates rules that include numbers instead of product names. Such a rule might look like this:
If [2937], then [5879]

Where: 2937 5879 is the identification for the product safe is the identification for a savings account with a building society

This makes the understanding of the rules very difficult. To display the appropriate product names, you can define a name mapping that maps the product ID to the appropriate name in the associations rules. The mapping from the product ID to the product name must be defined in a table. If such a table exists, you can use the DM_addNmp option to define a name mapping, and the DM_setFldNmp option to apply the name mapping to the PRODUCT field. It is assumed that the mapping is defined by the table BANKING.PRODUCTS . This table contains the following columns: ID This column contains numerical values that correspond to the product IDs of the transaction table BANK.CUSTOMER_PRODUCTS.

DESCRIPTION This column maps the numerical values to a meaningful description, for example safe or savings account with a building society. There are more optional parameter strings. For more information, see Optional parameter strings on page 99. Defining a name mapping: The following option string defines a name mapping that is called PRODUCT_NAMES:

Chapter 2. Data mining with the Easy Mining procedures

23

DM_addNmp("PRODUCT_NAMES", "BANK.PRODUCTS", "ID", "DESCRIPTION")

Applying the name mapping: The following option string applies the name mapping PRODUCT_NAMES to the PRODUCT column:
DM_setFldNmp("PRODUCT", "PRODUCT_NAMES")

Complete procedure call


The complete procedure call including the optional parameter strings looks like this:
db2 "call IDMMX.FindRules(BANK.PRODUCT_RULES, BANK.CUSTOMER_PRODUCTS, CLIENT_ID, 100, 30 DM_setItemFld(PRODUCT), DM_addNmp(PRODUCT_NAMES, BANK.PRODUCTS, ID, DESCRIPTION), DM_setFldNmp(PRODUCT, PRODUCT_NAMES))"

Finding sequential relationships in your data (FindSeqRules procedure)


You can find sequential relationships in your data by using the FindSeqRules procedure.

When to do it
You might have data tables that include records that can be grouped according to a particular key. These groups might represent, for example, sales transactions or orders of customers. They can also represent defects of a product. If groups represent sales transactions or orders of customers, the key might be the customer ID. If the groups represent defects of a product, the key might be the serial number of the product. You can sort the members of these groups according to another order key. For sales transactions or customer orders, this order key can be the purchase date or the order date. For product defects, this order key can be the date when the defect occurred. With the FindRules procedure that is described in Finding relationships (FindRules procedure) on page 18, you can identify the products that are purchased together or the defects that typically occur for one product. With the FindSeqRules procedure, you can additionally find out the sequential order in which the articles are bought or the sequential order in which the defects have occurred. With regard to retail data, you can use this information for targeted mailing campaigns. For example, you might want to offer your customers the articles that they are likely to buy soon. With regard to defects, you can efficiently manage repair and warranty cases by fixing proactively defects that are likely to happen in the near future. You can use the FindSeqRules procedure also in other business areas.

24

IBM InfoSphere Warehouse: Data Mining with Easy Mining procedures

How to do it
You can find sequential relationships in your data by using the FindSeqRules procedure.

Syntax
IDMMX.FindSeqRules(<seqRulesView>, <inputTable>, <sequenceColumn> <groupColumn> <nbRules> <minConfidence> [<optionalParameters>])

Input parameters
With the IDMMX.FindSeqRules procedure, you must specify the following parameters: <seqRulesView> The name of the view that you want to build. The FindSeqRules procedure creates a view and a model. The model is stored in the table IDMMX.RuleModels under the same name as the generated view. The view contains the following columns: v ID v HEADSETID v HEADSETTEXT v v v v v BODYSEQID BODYSEQTEXT SUPPORT CONFIDENCE LIFT

If a model with the same name already exists, the previous model is replaced with the new model. If a view with the same name already exists, the previous view is replaced with the new view. This parameter is of type VARCHAR. Its size is 240. <inputTable> The name of the input table or the input view. The columns of the input table that might not be useful to create a model are ignored by the Easy Mining procedure. These are, for example, key columns. This parameter is of type VARCHAR. Its size is 257. <sequenceColumn> The name of the column that contains the sequence ID. A sequence contains the item sets that have the same sequence ID.
Chapter 2. Data mining with the Easy Mining procedures

25

<groupColumn> The name of the column that contains the group or the transaction ID. If you specify a column of the input table as the GROUP column, the remaining columns of the input table with the exception of the sequence column are used as item columns. An item set contains items that have the same sequence ID and the same value in the group column. The item sets of a sequence are sorted according to the value in the group column. This parameter is of type VARCHAR. Its size is 128. For information about the valid SQL types of categorical and numerical fields, see Mining field types on page 154. <nbRules> The minimum number of sequence rules to be generated. The generated rules are stored in a view. This parameter is of type INTEGER. <minConfidence> A value to measure the validity of the sequence rule. The confidence of each sequence rule is greater or equal to the value that you specified for minimum confidence. This parameter is of type REAL.

Output
Sequential relationships are represented as sequence rules. Sequence rules describe patterns in sequences. Depending on the business area, sequences might be, for example, purchases of customers or defects of cars over time. For example, customers might buy a digital camera and rechargeable batteries. A couple of weeks later, they buy a memory card and, again a couple of weeks later, they buy a photo printer. The sequence rule of this pattern looks like this:
<digital camera and rechargeable batteries> >>> <memory card> ==> <photo printer>

where: <digital camera and rechargeable batteries> represents an individual item set that is part of the rule body >>> represents a temporal ordering of item sets in ascending order

<memory card> represents an individual item set that is part of the rule body ==> splits the sequence rule into a sequence rule head and a sequence rule body

<photo printer> represents an item set that is included in the sequence rule head

26

IBM InfoSphere Warehouse: Data Mining with Easy Mining procedures

You can interpret the sequence rule above like this: If customers buy a digital camera with rechargeable batteries during their first purchase and a memory card during their second purchase, they will buy a photo printer during a subsequent purchase. Sequence rules include the following attributes: Confidence The confidence value represents the validity of the rule. A confidence value of 50% means that in 50% of the cases where a particular rule body is present in a sequence, a particular rule head is also present after the item sets of the rule body. For example, in the sequence rule above, a confidence value of 50% means that 50% of the customers who bought a digital camera with rechargeable batteries during their first visit and a memory card during any of their subsequent visits, bought a photo printer during another subsequent visit. Support The support value indicates how many sequences are covered by a sequence rule. The support value is expressed as the percentage of the total number of sequences. For example, a support value of 2% in the following sequence means that 2% of all sequences contain this particular sequence.
<digital camera and rechargeable batteries> => <memory card> = <photo printer>

Lift

The lift value indicates how much the confidence value is different from the expected confidence value. The lift value is computed by dividing the confidence value by the support value of the sequence rule head. If the support value of the above example is 10% and the confidence value of the sequence rule is 50%, the value for lift is 50% divided by 10% = 5. A lift value of 5 means, that customers who buy a digital camera and rechargeable batteries during their first visit and a memory card during their second visit, are 5 times more likely than average customers to buy a photo printer during a subsequent visit.

Mean time difference This value indicates the mean time difference between the time stamp of the first item set and the time stamp of the last item set in a sequence. If the type of the group column is numeric, this value is the mean value of the group values for the sequences. Standard Deviation of time difference This value indicates the standard deviation of the time difference between the time stamp of the first item set and the time stamp of the last item set in a sequence. If the type of the group column is numeric, this value is the standard deviation of the group values for the sequences.

Example
Besides the client ID and the banking product that is known to the bank in the example of the FindRules procedure, the bank additionally knows the date when their customers have bought the banking products. The table might look like this:

Chapter 2. Data mining with the Easy Mining procedures

27

Figure 14. The input table BANK.CUSTOMER_PRODUCTS

The figure above shows that one customer can have various banking products that are bought at different dates. Because customers can own more than one banking product, the table can contain more than one record for a given client ID. For example, the client ID 856 owns 5 products. The bank might want to know whether there are sequential relationships between the products that a customer owns. You can use the FindSeqRules procedure to find sequential relationships. Use the following command to run the FindSeqRules procedure:
call IDMMX.FindSeqRules (BANK.PRODUCT_SEQRULES BANK.CUSTOMER_PRODUCTS2, CLIENT_ID DATE, 100,10);

Where: BANK.PRODUCT_SEQRULES is the name of the generated table that contains the sequence rules BANK.CUSTOMER_PRODUCTS2 is the name of the input table CLIENT_ID is the group column that contains the customer ID DATE is the sequence column that contains the date of the purchase 100 10 is the maximum number of rules to be generated is the minimum value for confidence. This means that you want sequence rules with a confidence value of 10% or higher.

28

IBM InfoSphere Warehouse: Data Mining with Easy Mining procedures

The generated table BANK.PRODUCT_SEQRULES might look like this:

Figure 15. The generated table BANK.PRODUCT_SEQRULES

In the figure above, the sequence rules are sorted by the lift value in descending order. Rule 79 looks interesting. It states that 26% of the customers who first have a CODEVI savings account and after that a popular savings plan, next sign a savings plan with a building society. The lift value indicates that this is 3.2 times more probable for these customers than all customers in general. Therefore the chances are 3.2 higher than in general, that offering a savings plan with a building society to customers with a CODEVI account and a popular savings plan will lead to a concluding contract.

How to go further
Sometimes you might additionally want to specify a specific parameter for an Easy Mining procedure. You can do this by including an optional parameter string in an Easy Mining procedure. For more information, see Optional parameter strings on page 99. When you use the FindSeqRules procedure, you can do the following tasks: v Setting the name for the item ID column For more information, see Setting the name of the item ID column on page 22. v Using name mappings For more information, see Name mappings on page 23.

Prediction of future behavior (PredictColumn procedure)


You can predict future behavior by using the PredictColumn procedure.

When to do it
Your database might contain customer data. In the tables or views of your database, there might be one column that you are particularly interested in. For example, if you want to know the aggregated revenue of each customer who
Chapter 2. Data mining with the Easy Mining procedures

29

visited your shop last year, you might be interested in the AGGREGATED REVENUE column. This column is called the target column. You might want to know whether there are relationships between the occurrence of the values in the target column AGGREGATED REVENUE and the values of the other columns such that you can predict from the values of the other columns the values occurring in the target column AGGREGATED REVENUE . If you have new customer data that does not yet contain values in the AGGREGATED REVENUE column, you can predict the estimated revenue of new customers. If you have new customer data that does not yet have values in the target column CATEGORY, you can predict the category that the new customers best fit into. This information helps you to plan a marketing campaign for this particular customer category. In customer relationship management, you might want to predict which customers are likely to buy certain products. This information helps you to promote cross-selling. In the health care industry, you can find relations between symptoms and diseases. With this information, you can predict the potential diseases of new patients.

How to do it
To predict future behavior, you can use the IDMMX.PredictColumn procedure. Syntax:
IDMMX.PredictColumn(<predView>, <inputTable>, <targetColumn>)

Input parameters: To predict future behavior, you must specify the following parameters for the PredictColumn procedure: <viewName> The name of the view that you want to build. The PredictColumn procedure creates a view and a model. Depending on the mining function that is used to build the model, the model is stored in one of the following tables under the same name as the generated view: v IDMMX.ClassifModels if the target column is categorical v IDMMX.RegressionModels if the target column is numeric If a model with the same name already exists, the previous model is replaced with the new model. If a view with the same name already exists, the previous view is replaced with the new view. This parameter is of type VARCHAR. Its size is 240. <inputTable> The name of the input table or the input view. The columns of the input table that might not be useful to create a model are ignored by the Easy Mining procedure. These are, for example, key columns. This parameter is of type VARCHAR. Its size is 257.

30

IBM InfoSphere Warehouse: Data Mining with Easy Mining procedures

<targetColumn> The name of the target column. The PredictColumn procedure derives the values in this column from the values of the other columns in the input table. If the values in the target column are categorical, the Classification mining function is used. If the values in the target column are numeric, the Regression mining function is used. This parameter is of type VARCHAR. Its size is 128. For information about the valid SQL types of categorical and numerical fields, see Mining field types on page 154. Data flow: Figure 16 shows the data flow of the PredictColumn procedure. By applying the PredictColumn procedure to the input table with the specified target column, a model and a view are generated. The view includes the columns of the input table and the columns PREDICTION and CONFIDENCE.

Figure 16. Data flow of the PredictColumn procedure

Output: Based on the input parameters, the PredictColumn procedure creates a view. This view contains the columns of the input table and the following additional columns: PREDICTION This column contains the predicted values of the target column. These values are derived from the values of the input table. CONFIDENCE This column contains the confidence value of the prediction. If the target column is categorical, the confidence value can range from 0 to 1. v A value close to 0 indicates a low probability that the prediction is correct. v A value close to 1 indicates a high probability that the prediction is correct. If the target column is numeric, this column contains only null values. With the prediction confidence, you can select the most reliable predictions.

Chapter 2. Data mining with the Easy Mining procedures

31

To analyze the prediction model in detail, you can use IM Visualization or any other visualization tool that supports PMML. Data flow of the PredictColumn procedure: The PredictColumn procedure splits the input data in the following disjoint data sets: Training data set The training data set is used to compute the prediction model. Validation data set The quality of the prediction model is based on the records of the validation data set. The model quality indicates how well the model might perform on unknown data. Typically, the model quality is better on the training data than on the validation data because the model might be tuned towards the records of the training data set. In the extreme case it is as if you learned all records of the training data set by heart. This means that you had an optimal model quality for the training data set because for all records of the training data set the predictions were correct. On the other hand, you would not know what to predict for a record of the validation data set unless it had the same values as a record in the training data set. Therefore, for computing the quality of the model, it is better to use data records that were not used in the training phase.

Example
The input data in Figure 17 on page 33 represents a banking scenario. It shows the following information about banking customers: Demographic information Demographic information includes age, gender, marital status, or profession. Business specific information Business specific information includes the average balance, the number of years a customer is a client of this bank, savings account, or online access to accounts.

32

IBM InfoSphere Warehouse: Data Mining with Easy Mining procedures

Figure 17. Data flow of the PredictColumn procedure based on the example in this section

The BANKCARD column in the DEMOBANK.BANKCUSTOMERS table indicates whether customers have a bank card. Based on this information, the management of the bank might plan a promotion campaign for bank cards. The bank management needs to know who might be a good candidate to offer the bank card. Therefore the bank management wants to find out the characteristics of customers who have a bank card, and the characteristics of customers who do not have a bank card. Good candidates for the promotion are customers who do not yet have a bank card, although they possess the characteristics of a bank card holder. Input parameters: To determine the characteristics of bank card holders and non-bank card holders, you can use the PredictColumn procedure. Use the following command to run the Easy Mining procedure:
db2 "call IDMMX.PredictColumn(BANK.BANKCARD_PRED, BANK.BANKCUSTOMERS, BANKCARD)"

Where: BANK.BANKCARD_PRED is the name of the result view to be generated BANK.BANKCUSTOMERS is the name of the input table BANKCARD is the name of the target column

Chapter 2. Data mining with the Easy Mining procedures

33

Output: Figure 17 on page 33 shows the data flow of the PredictColumn procedure: applying the PredictColumn procedure to the input table BANKCUSTOMERS produces a model and an output view. The model and the output view have the same name, for example, BANK.BANKCARD_PRED. The model is stored in the IDMMX.CassifModels table. The output view BANK.BANKCARD_PRED that is shown in Figure 17 on page 33 contains the columns of the input table and the following additional columns: DATA_SET The values in this column show the data set that this record belongs to, for example, the training data set or the validation data set. PREDICTION This column contains the predicted values. These are the values that the model expects for the column BANKCARD. The type of this column is the same as the type of the target column BANKCARD. CONFIDENCE This column contains the confidence value for the prediction. Selection of candidates for bank card promotions: You can use the BANK.BANKCARD_PRED output view to select candidates for a bank card promotion. Potential candidates for the bank card promotion are the customers who are identified as not having a bank card yet although they have the characteristics of bank card holders. In the BANK.BANKCARD_PRED output view that is shown in Figure 17 on page 33, these candidates are represented by the records that have a NO value in the BANKCARD column and a YES value in the PREDICTION column. Identification of characteristics: With the information in the output view BANK.BANKCARD_PRED, you can select the candidates for the bank card promotion. However, if you want to know the characteristics that distinguish bank card holders from non-bank card holders, you can analyze the model BANK.BANKCARD_PRED that is stored in the table IDMMX.ClassifModels by opening it with IM Visualization or any other visualization tool that supports PMML. You can determine the characteristics of bank card holders by analyzing the Tree View of the Classification Visualizer in Figure 18 on page 35.

34

IBM InfoSphere Warehouse: Data Mining with Easy Mining procedures

Figure 18. Tree View of the Classification Visualizer

The first test ONLINE_ACCESS=YES is performed on the value of the tree node with the node ID 1.1. This tree node is not expanded because it is pruned. The SCORE value of this tree node is YES. This means that there is a good chance that this customer also has a bank card. The PURITY value shows that 83,5% of the customers with an online access to their account have a bank card. The value in the RECORD COUNT column shows that 22% of the customers have online access to their accounts. The tree node with the node ID 1.2 is expanded because it is not pruned. The SCORE value of this tree node is NO. Some descendants of this node are labeled with YES in the Score column, for example, the node 1.2.1.1.2.1. By analyzing these nodes and their corresponding paths related to the root node, you can determine the other characteristics that identify a bank card holder. The quality of the model: The Confusion Matrix View of the Classification Visualizer in Figure 19 on page 36 shows the number of correct and incorrect predictions for the validation data.

Chapter 2. Data mining with the Easy Mining procedures

35

Figure 19. Confusion Matrix View of the Classification Visualizer

The target column of the model BANK.BANKCARD_PRED has the following predicted values: v YES v NO The confusion matrix in Figure 19 shows that 264 of the non-bank card holders are classified as bank card holders. This is the target group that you are interested in.

How to go further
Sometimes you might additionally want to specify a specific parameter for an Easy Mining procedure. You can do this by including an optional parameter string in an Easy Mining procedure. For more information, see Optional parameter strings on page 99. The required quality of a prediction model depends on the application for which you want to use it. Being significantly better than random guessing might be a good result. This applies, for example, if you want to predict cross-selling opportunities. In other application areas, a good prediction quality is mandatory, for example, if you want to use a reliable model to predict diseases from symptoms. Removing fields from the input table: You might think that the prediction quality always should be as good as possible. In general, this is true. However,

36

IBM InfoSphere Warehouse: Data Mining with Easy Mining procedures

sometimes the result might be too good. The reason for a result that is too good can be that there are causal dependencies between the target field and some input fields. For example, possessing a bank card leads to an increasing amount of debit transactions. Therefore, a high number of debit transactions seems to be a good indicator for possessing a bank card. However, the causal relationship is just the inverse. The possession of a bank card implies a higher amount of debit transactions. It does not imply that a high number of debit transactions entails having a bank card. There are more reasons for having a high amount of debit transactions. You should always examine the tests near the root node of the classification tree. You should check whether there are columns that have causal relationships. Whenever you identify columns with causal relationships, you must remove them from the input fields. For more information about removing fields from the input table, see Removing fields from the input table on page 8. The depth of a classification tree: You can set more parameters of the Tree Classification mining function by using optional parameter strings. For example, if the target field is categorical, you might want to set the maximum tree depth. For more information, see Setting the depth of a classification tree on page 54. There are more optional parameter strings. For more information, see Optional parameter strings on page 99.

Prediction of an outcome (PredictColValue procedure)


You can predict values by using the PredictColValue procedure.

When to do it
You might have a database with customer data. In the tables or views of your database, there might be one column that you are particularly interested in. For example, if you want to know whether your customers have answered to a mailing campaign, you might be interested in the RESPONDED column. If you want to know whether customers have canceled their contracts, you might be interested in the CONTRACT column. The column that you are particularly interested in is called the target column. The target column CONTRACT_STATUS can have several values, for example, SIGNED, PROLONGED, or CANCELED. Typically, you are not interested in all values of the target column. You might be interested only in one value. For example, to prevent customers from canceling their contracts, you want to know the customers who are likely to cancel a contract. Therefore, you are only interested in the target value CANCELED. You might want to know whether there are relationships between the occurrence of the CANCELED value in the target column CONTRACT_STATUS and the values of the other columns such that you can predict from the values of the other columns the likelihood that the target column CONTRACT_STATUS contains the value CANCELED.

Chapter 2. Data mining with the Easy Mining procedures

37

How to do it
To predict an outcome, use the PredictColValue procedure. Syntax:

IDMMX.PredictColValue(<predView>, <inputTable>, <targetColumn>, <targetValue>)

Input parameters: With the PredictColValue procedure, you must specify the following parameters: <viewName> The name of the view that you want to build. The PredictColValue procedure creates a view and a model. The model is stored in the table IDMMX.PredictColValue under the same name as the generated view. If a model with the same name already exists, the previous model is replaced with the new model. If a view with the same name already exists, the previous view is replaced with the new view. This parameter is of type VARCHAR. Its size is 240. <inputTable> The name of the input table or the input view. The columns of the input table that might not be useful to create a model are ignored by the Easy Mining procedure. These are, for example, key columns. This parameter is of type VARCHAR. Its size is 257. <targetColumn> The name of the column whose values are to be predicted. This parameter is of type VARCHAR. Its size is 128. For information about the valid SQL types of categorical and numerical fields, see Mining field types on page 154. <targetValue> The name of a value in the target column, for example, YES or NO. This parameter is of type VARCHAR. Its size is 1024. Output: The PredictColValue procedure creates a view. This view contains the columns of the input table and the following additional columns: DATA_SET This column indicates whether the data is used for testing or for validation. TARGET_VALUE This column contains the target value of the target column, for example, SIGNED, PROLONGED, or CANCELED. CONFIDENCE This column contains the estimated confidence that is calculated by the Classification mining function that the target column contains the target value. It indicates the reliability that the prediction is correct.

38

IBM InfoSphere Warehouse: Data Mining with Easy Mining procedures

The values in this column can range from 0 to 1. v A value close to 0 indicates a low certainty that the prediction is correct. v A value close to 1 indicates that the prediction is reliable. With the confidence value, you can determine whether an individual prediction is sufficiently reliable for your application. PRED_VALUE This column contains the estimated confidence that is calculated by the Regression mining function that the target column contains the target value. It has the same validity range and meaning as the column CONFIDENCE. It might be useful to consider the results of different prediction techniques. For example, for a mailing campaign it might be more appropriate to use a target audience selected by the Classification mining function instead of a target audience selected by the Regression mining function. It depends on the size of the target audience. If you use a different size of target audience, it might be more appropriate to select the target audience by the Regression mining function. The values in the CLASSIFICATION_CONFIDENCE column and the PREDICTED_VALUE column help you to make this decision. For more information, seeDetermining the best suited model on page 90. Data flow of the PredictColValue procedure: The PredictColValue procedure does not use the original target column for the mining runs. To build a classification model, it creates a new categorical target column that includes the following values: v <targetValue> v !=<targetValue> To build a regression model, the PredictColValue procedure creates a numeric column. This column can contain the value 1 or 0. v It contains 1, if the value of the original target column is equal to the target value. v It contains 0, if the value of the original target column is different to the target value. Prediction models are computed on a subset of the input data. The PredictColValue procedure splits the input data into the following disjoint data sets: Training data set The training data set is used to compute the prediction model. Validation data set The quality of the prediction model is based on the records of the validation data set. The model quality indicates how well the model might perform on unknown data. You can use the validation results to select the model that is best suited according to the requirements of your application. For more information, see Splitting tables into training data sets and test data sets on page 83. Figure 20 on page 40 shows the data flow of the PredictColValue procedure.

Chapter 2. Data mining with the Easy Mining procedures

39

Figure 20. Data flow of the PredictColValue procedure

Example
To make the difference of the results clear between the PredictColValue procedure and the PredictColumn procedure, the same data is used for both procedures. You might want to find a prediction model for the BANKCARD column. You are interested in the target value YES. Use the following command to run the PredictColValue procedure:
db2 "call IDMMX.PredictColValue(BANK.BANKCARD_PRED_YES, BANK.BANKCUSTOMERS, BANKCARD, YES)"

Figure 21 on page 41 shows the data flow of the PredictColValue procedure based on the example in this section.

40

IBM InfoSphere Warehouse: Data Mining with Easy Mining procedures

Figure 21. Data flow of the PredictColValue procedure based on the example in this section

The output view BANK.BANKCARD_PRED_YES in Figure 21 shows that the TARGET_VALUE column contains only the target value YES. The classification confidence is computed with respect to the target value YES even when the value of the target column is different. For example, if the value of a target column for a certain record is NO, the confidence is computed with regard to the target value YES. In this example, the target column is BANKCARD.

How to go further
Sometimes you might additionally want to specify a specific parameter for an Easy Mining procedure. You can do this by including an optional parameter string in an Easy Mining procedure. For more information, see Optional parameter strings on page 99. For the PredictColValue procedure, you can also apply the hints and tips of the PredictColumn procedure. They are described in How to go further on page 36.

Finding explanations for specific events (ExplainColValue procedure)


With the ExplainColValue procedure, you can find explanations in your data that indicate that the target column contains the target value.

When to do it
Your database might contain data tables that describe the production of a particular product, for example, cars. Each record in the data table represents information about a particular car and details about its production process. One column in these tables represents a specific event, for example, whether the car was delivered on time, or whether it met the expected quality criteria.

Chapter 2. Data mining with the Easy Mining procedures

41

If the cars are not delivered on time, you might want to find out why the delivery is delayed. If the cars do not meet the quality criteria, you might want to find out why they do not meet the quality criteria. When you know the reasons, you can take the appropriate actions to improve the production process of the cars. When your database contains customer data, there might be a column that indicates that customers have canceled their contracts. When you know why customers have canceled their contracts, you can improve customer retention in the future.

How to do it
To find explanations, use the ExplainColValue procedure. Syntax:
IDMMX.ExplainColValue(<explanationView>, <inputTable>, <targetColumn>, <targetValue>, <maxExplLength>)

Input parameters: With the ExplainColValue procedure, you must specify the following parameters: <explanationView> The name of the view that you want to build. It contains the explanations in textual form. The ExplainColValue procedure creates a view, a classification model, and a rules model. The models are stored in the tables IDMMX.CLASSIFMODELS and IDMMX.RULEMODELS under the same name as the generated view. If a model with the same name already exists, the previous model is replaced with the new model. If a view with the same name already exists, the previous view is replaced with the new view. This parameter is of type VARCHAR. Its size is 240. <inputTable> The name of the input table or the input view. The columns of the input table that might not be useful to create a model are ignored by the Easy Mining procedure. These are, for example, key columns. This parameter is of type VARCHAR. Its size is 257. <targetColumn> The name of the column that you are particularly interested in. This parameter is of type VARCHAR. Its size is 128. For information about the valid SQL types of categorical and numerical fields, see Mining field types on page 154. <targetValue> The value of the target column that you are particularly interested in. This parameter is of type VARCHAR. Its size is 1024. <maxExplLength> The value that determines the maximal length of the explanation.

42

IBM InfoSphere Warehouse: Data Mining with Easy Mining procedures

This parameter is of type SMALLINT. Output: The ExplainColValue procedure creates a view that includes the following columns: ORIGIN This column shows the mining function that has computed the explanation. This column can contain the values A or C. v A means that the Associations mining function has computed the explanation. v C means that the Classification mining function has computed the explanation. TARGET This column contains the value <targetColumn>=< targetValue>, for example, BANKCARD=YES, as the target of the explanation. It corresponds to the rule head of association rules. EXPLANATION This column contains the explanation in textual form. It corresponds to the rule body of association rules. CONFIDENCE This column contains the confidence value for the explanation. The confidence value represents the reliability of the explanation. SUPPORT This column contains the support value for the explanation. The support value indicates for how many records of the input table this explanation is valid.

Figure 22. Data flow of the ExplainColValue procedure

Data flow of the ExplainColValue procedure: The ExplainColValue procedure generates explanations such that the target column contains a target value. These explanations represent rules that contain a rule head of <targetColumn> = <targetValue>. The rules are stored in the generated view. These rules are derived from a classification model and a rule model. These models are stored in the IDMMX.ClassifModels table and in the IDMMX.RuleModels table under the same name as the generated view.
Chapter 2. Data mining with the Easy Mining procedures

43

For the classification model, the explanations are the paths to the leaf nodes labeled with the target value. It is computed on the entire input table. The associations model is computed on all columns of the input table. Then the rules that contain a rule head of <targetColumn> = <targetValue> are filtered out.

Example
This example is based on the input table BANK.BANKCUSTOMERS that is also used for the ClusterTable procedure in Figure 5 on page 13. You might want to find explanations why the BANKCARD column in the BANK.BANKCUSTOMERS table contains the value YES. You might want to specify 3 as the maximum value for the length of the explanations. Use the following command to run the Easy Mining procedure:
db2 "call IDMMX.ExplainColValue(BANK.BANKCARD_EXPLANATIONS, BANK.BANKCUSTOMERS, BANKCARD, YES, 3)"

The ExplainColValue procedure creates the output view BANK.BANKCARD_EXPLANATIONS that is shown in Figure 23. This output view contains the explanations.

Figure 23. The output view BANK.BANKCARD_EXPLANATIONS

Looking at the explanations that have an explanation confidence of 84.243% in Figure 23, you can see that the EXPLANATION column contains the explanation ONLINE_ACCESS=YES. v The explanation confidence of 84.243% means that 84.243% of all customers with an online access to their account also have a banking card. v The support value of 18.019% means that 18.019% of all customers have an online access and a banking card. v The values A and C in the ORIGIN column mean that one rule is computed by the Associations mining function and the other rule is computed by the Classification mining function.

44

IBM InfoSphere Warehouse: Data Mining with Easy Mining procedures

These explanations show that having an online account has an important influence on having a banking card. A business action that is derived from this information is, for example, that you offer customers a banking card when they ask for online access to their accounts.

How to go further
Sometimes you might additionally want to specify a specific parameter for an Easy Mining procedure. You can do this by including an optional parameter string in an Easy Mining procedure. For more information, see Optional parameter strings on page 99. Always check the explanations whether there is a causal dependency between the explanation target and an input field. For more information about causal dependencies, see How to go further on page 36. If there is a dependency, remove this field from the input fields by using the DM_remDataSpecFld options. For more information, see Removing fields from the input table on page 8. There are more optional parameter strings. For more information, see Optional parameter strings on page 99.

Most important fields (FindMostImpFields procedure)


You can find most important fields by using the FindMostImpFields procedure.

When to do it
You might collect data about customers or a manufacturing process. The data might be difficult or expensive to collect. Therefore you want to know whether you need to collect complete data, or if there are parts of the data that you do not need to collect. For example, if you collect only the required information, you can enhance the satisfaction of your customers because you need not ask them superfluous questions. Or you can optimize your database design by keeping only the required information in the tables. This way you can reduce the costs of your business intelligence infrastructure.

How to do it
You are looking for the most important fields in a table or a view of your database. You can determine the most important fields only with respect to a certain purpose. The purpose is represented by the target column. The FindMostImpFields procedure finds the most important fields in a table. It determines the importance of the fields of this table with respect to the prediction of the values of the target column. Syntax:
IDMMX.FindMostImpFields(<mostImpFieldView>, <inputTable>, <targetColumn>)

Input parameters: With the FindMostImpFields procedure, you must specify the following parameters:
Chapter 2. Data mining with the Easy Mining procedures

45

<mostImpFieldView> The name of the view that you want to build. The FindMostImpFields procedure creates a view and a model. Depending on the type of the target column, a classification model or a regression model is generated. It is stored in one of the following tables under the same name as the generated view: v IDMMX.ClassifModels v IDMMX.RegressionModels If a model with the same name already exists, the previous model is replaced with the new model. If a view with the same name already exists, the previous view is replaced with the new view. This parameter is of type VARCHAR. Its size is 240. <inputTable> The name of the input table or the input view. Depending on the type of the target column, the Tree Classification mining function or the Regression mining function computes the most important fields of this table. The columns of the input table that might not be useful to create a model are ignored by the Easy Mining procedure. These are, for example, key columns. This parameter is of type VARCHAR. Its size is 257. <targetColumn> The name of the target column. The values of the target column can be categorical or numeric. This parameter is of type VARCHAR. Its size is 128. For information about the valid SQL types of categorical and numerical fields, see Mining field types on page 154. Output: The generated view contains the following columns: FIELDNAME This column contains the name of the column. IMPORTANCE This column contains the value that indicates the importance of that field. The importance value ranges from 0 to 1. The higher this value is, the greater is the importance of the field.

46

IBM InfoSphere Warehouse: Data Mining with Easy Mining procedures

Figure 24. Data flow of the FindMostImpFields procedure

Example
You might want to determine the most important field in the BANK.BANKCUSTOMERS input table that is shown in Figure 5 on page 13. The target column in this table is BANKCARD. Use the following command to run the FindMostImpFields procedure:
db2 "call IDMMX.FindMostImpFields(BANK.BANKCARD_MOSTIMPFIELDS, BANK.BANKCUSTOMERS, BANKCARD)"

BANK.BANKCARD_MOSTIMPFIELDS is the name of the view that contains the most important fields together with their importance value. The BANK.BANKCARD_MOSTIMPFIELDS view looks like this:

Figure 25. The output view BANK.BANKCARD_MOSTIMPFIELDS

Figure 25 shows that the ONLINE_ACCESS column has the highest importance value. Other columns with a high importance value are AGE and AVERAGE_BALANCE. These are the columns with the highest influence on having a bankcard. The columns PROFESSION, MARITAL_STATUS, and SAVINGS_ACCOUNT have the lowest importance value. Therefore their values are less relevant in the context of having a banking card.
Chapter 2. Data mining with the Easy Mining procedures

47

Using Easy Mining procedures for basic mining steps


Mining tasks include several mining steps. Mining steps are specific to the different mining functions. Therefore you need to have some experience in data mining if you want to use the Easy Mining procedures for basic mining steps. The following mining functions of IM Modeling provide Easy Mining procedures for basic mining steps: v Classification mining function v Regression mining function v Clustering mining function v Associations mining function v Sequences mining function For more information about these mining functions, see IBM InfoSphere Warehouse: Creating mining models with Intelligent Miner Modeling. The following table provides an overview of the available Easy Mining procedures for basic mining steps.
Table 3. Overview of the Easy Mining procedures for basic mining steps
Tasks Building models Classification mining procedure BuildClasModel Regression mining procedure BuildRegModel TestRegModel ApplyRegModel ExportRegModel ExportRegTestResult BuildRegView Clustering mining procedure BuildClusModel ApplyClusModel ExportClusModel BuildClusView Associations mining procedure BuildRuleModel ApplyRuleModel ExportRuleModel BuildRuleView Sequences mining procedure BuildSeqRuleModel ApplySeqRuleModel ExportSeqRuleModel BuildSeqRuleView

Testing models TestClasModel Applying models Exporting models Exporting test result Building mining views ApplyClasModel ExportClasModel ExportClasTestResult BuildClasView

Easy Mining procedures for classification mining steps


There are Easy Mining procedures for the following mining steps: v Building or training models v Testing models v Applying or scoring models v Building mining views v Exporting models or test results

When to use them


Classification and regression are prediction techniques. This means that the Classification mining function and the Regression mining function learn from known cases to predict an outcome in similar situations. Often cases are known from the past experience and applied to new cases when it can be assumed that the situation has not greatly changed. The outcome of classification is a categorical value, the outcome of regression is a numeric value. Apart from this difference, classification and regression are similar.

48

IBM InfoSphere Warehouse: Data Mining with Easy Mining procedures

If you want to use the Classification mining function, the outcome must be categorical. For example, it might be the car type that a customer is interested in. It might indicate whether a customer has responded to a mailing campaign, or if a car was delivered on time. Other application areas for classification is churn prevention or cross-selling. You might want to predict the customers who are likely to leave to take the appropriate counter measures. Or you want to predict whether customers are likely to buy certain products.

How to use them


With the Easy Mining procedures for classification mining steps, you can build, test, and apply classification models. You can also build classification views and export classification models and classification test results. Building classification models: Before you can use the Classification mining function, you must collect historical data about the past. This data must be in a table or a view. Each record in the table or the view describes the properties of an entity for which you know the outcome. For example, an entity might be a customer who has responded to a mailing campaign or a car that was produced on time. Based on the historical data, the Classification mining function determines which properties distinguish the entities with a certain outcome, for example, having responded to a mailing campaign, from entities that have another outcome, for example, not having responded. For example, the Classification mining function might find out that young and single customers are more likely to respond to a mailing campaign than other customers. The Classification mining function stores this information in a classification model. This step is called training of a model. You can build a classification model by using the BuildClasModel procedure. Syntax:
IDMMX.BuildClasModel(<modelName>, <inputTable>, <targetColumn>)

Input parameters: With the BuildClasModel procedure, you must specify the following parameters: <modelName> The name of the model that you want to build. Depending on the Easy Mining procedure that you are using, the generated model is stored in one of the following tables: v IDMMX.ClassifModels If a model with the same name already exists, the previous model is replaced with the new model. This parameter is of type VARCHAR. Its size is 240. <inputTable> The name of the input table or the input view.

Chapter 2. Data mining with the Easy Mining procedures

49

The values in the columns of the input table are used to determine the distinguishing properties for each value of the target column. The columns of the input table that might not be useful to create a model are ignored by the Easy Mining procedure. These are, for example, key columns. This parameter is of type VARCHAR. Its size is 240. <targetColumn> The name of the column whose values are to be predicted. The target column must contain only categorical values. For information about the valid SQL types of categorical and numerical fields, see Mining field types on page 154. This parameter is of type VARCHAR. Its size is 128. Example: You might want to build the model BANK.BANKCARD_CLASMODEL to predict the BANKCARD column of the BANK.BANKCUSTOMERS table. Use the following command to run the Easy Mining procedure:
DB2 "call IDMMX.BuildClasModel(BANK.BANKCARD_CLASMODEL, BANK.BANKCUSTOMERS_TRAIN, BANKCARD)"

Testing classification models: A classification model assigns to each entity the outcome it considers to be the most appropriate. The outcome is the predicted value that is determined by the classification model. The predicted value for the outcome can differ from the actual value of the outcome because a classification model is almost never perfect. The main reason for this is that the reality is much more complex than its representation in a mining model. To receive a better estimate for the model quality, you can apply the model on data that already includes the outcome but is different from the training data. The easiest way to get such a data set is to split the original data table that includes the historical data into two parts. One part is used for the training of the model, and the other part is used for determining the model quality only. This mining step is called the testing of the model. You can test models by using the TestClasModel procedure. Model quality: The quality of a classification model is the percentage of correct predictions for a given data set. The goal is to get a model with a high quality. The fraction of correct predictions should be as high as possible when you use the model to predict the outcome for entities for which the outcome is not yet known. So when you use a classification model to select customers for a mailing campaign, you want the model to select only those customers who are the most likely to respond. Overtrained models: The model quality based on the data that you use for the training of the model might not always be a good indicator for the number of correct predictions on unseen data. The model might be tuned too much towards the records in the training data set. This is called overtraining of a model.

50

IBM InfoSphere Warehouse: Data Mining with Easy Mining procedures

In the extreme case, an overtrained model contains all records of the training data set. This means that the model learns the training data by heart. It is capable of giving correct predictions for all records of the training data set. However, when applied to unseen data, this model might perform poorly because the data might contain records that were not included in the training data. Then the model does not know what to do. Syntax:

IDMMX.TestClasModel(<modelName>, <inputTable>, <testResultName>)

Input parameters: With the TestClasModel procedure, you must specify the following parameters: <modelName> The name of the model that you want to test. The model is stored in the IDMMX.ClassifModels table. This parameter is of type VARCHAR. Its size is 240. <inputTable> The name of the input table or the input view. This parameter is of type VARCHAR. Its size is 240. <testResultName> The name of the result of testing the model. The classification test result is stored in the IDMMX.ClasTestResults table. This parameter is of type VARCHAR. Its size is 240. Example: To test the classification model, you might want to apply the model BANK.BANKCARD_CLASMODEL to the data in the input table BANK.BANKCUSTOMERS and save the test result BANK.BANKCUSTOMERS_TESTRESULT in the IDMMX.ClasTestResult table. Use the following command to run the Easy Mining procedure:
DB2 "call IDMMX.TestClasModel(BANK.BANKCARD_CLASMODEL, BANK.BANKCUSTOMERS_TEST, BANK.BANKCARD_TESTRESULT)"

Applying classification models: The goal of finding a good prediction model is to apply it to data that does not yet include values for the outcome. This mining step is called scoring or applying. You can apply classification models by using the ApplyClasModel procedure. Syntax:

Chapter 2. Data mining with the Easy Mining procedures

51

IDMMX.ApplyClasModel(<modelName>, <inputTable>, <outputView>, <predClasColumn>, <confidenceColumn>)

Input parameters: With the ApplyClasModel procedure, you must specify the following parameters: <modelName> The name of the model that you want to apply to a table or a view. The model is stored in the IDMMX.ClassifModels table. This parameter is of type VARCHAR. Its size is 240. <inputTable> The name of the input table or the input view. This parameter is of type VARCHAR. Its size is 240. <outputView> The name of the generated output view. If a view with the same name already exists, the previous view is replaced with the new view. This parameter is of type VARCHAR. Its size is 240. Output view: The output view includes the columns of the input table and the following additional columns: <predClasColumn> The name of the column that contains the predicted value for the target column BANKCARD. This parameter is of type VARCHAR. Its size is 128. <confidenceColumn> The name of the column that contains the value for the classification confidence. If you do not want to include the Confidence column in the output table, leave the string for the confidence parameter empty or specify NULL for the confidence parameter. The value for the classification confidence can range from 0 to 1. It indicates the confidence that the predicted class value for the record is correct. v A value close to 1 means a high confidence. v A value close to 0 means a low confidence. This parameter is of type VARCHAR. Its size is 128. Example: You might want to complete the following steps: v Apply the model BANK.BANKCARD_CLASMODEL to the BANK.BANKCUSTOMERS table v Store the result in the BANK.BANKCARD_APPLY view v Additionally include the following columns in the BANK.BANKCARD_APPLY view: PRED_CLASS CONFIDENCE

52

IBM InfoSphere Warehouse: Data Mining with Easy Mining procedures

Use the following command to run the Easy Mining procedure:


DB2 "call IDMMX.ApplyClasModel(BANK.BANKCARD_CLASMODEL, BANK.BANKCUSTOMERS, BANK.BANKCARD_APPLY, PRED_CLASS, CONFIDENCE)"

Building classification views: The BuildClasView procedure combines the training of a classification model on the input table with the application of the model to the same input table. Combining the training and the application of a model is useful, for example, if you are interested in cross-selling opportunities. If you want to know the customers who are likely to buy a certain product but have not yet done so, you might want to apply the trained model directly to the input data to compare the predicted value with the actual value. This mining step is called building classification views. You can build classification views by using the BuildClasView procedure. Syntax

IDMMX.BuildClasView(<viewName>, <inputTable>, <targetColumn>)

Input parameters With the BuildClasView procedure, you must specify the following parameters: <viewName> The name of the view that you want to build. The BuildClasView procedure creates a view and a model. The model is stored in the table IDMMX.ClassifModels under the same name as the generated view. If a model with the same name already exists, the previous model is replaced with the new model. If a view with the same name already exists, the previous view is replaced with the new view. This parameter is of type VARCHAR. Its size is 240. <inputTable> The name of the input table or the input view. This parameter is of type VARCHAR. Its size is 240. <targetColumn> The name of the column that contains the predicted values. The target column must contain only categorical values. For information about the valid SQL types of categorical and numerical fields, see Mining field types on page 154. This parameter is of type VARCHAR. Its size is 128. Example You might want to create the classification view BANK.BANKCARD_CLASVIEW: v The BANK.BANKCARD_CLASVIEW view contains the column BANKCARD.
Chapter 2. Data mining with the Easy Mining procedures

53

v The BANKCARD column contains values that are predicted from the columns of the input table BANK.BANKCUSTOMERS. Use the following command to run the Easy Mining procedure:
DB2 "call IDMMX.BuildClasView(BANK.BANKCARD_CLASVIEW, BANK.BANKCUSTOMERS, BANKCARD)"

Output view When the classification model is applied to the data in the input table, the view is created. This view contains the columns of the input table and following additional columns: PRED_CLASS This column contains the predicted values for the target column. CONFIDENCE This column contains the value for the classification confidence of the record. Exporting classification models and test results: See Exporting models and test results on page 82 for information about exporting models and test results.

How to go further
Sometimes you might additionally want to specify a specific parameter for an Easy Mining procedure. You can do this by including an optional parameter string in an Easy Mining procedure. For more information, see Optional parameter strings on page 99. With the BuildClasView procedure, you can remove a field from the input data or set the depth of a classification tree by using the appropriate optional parameter string. Removing fields from the input table: It is important to select the appropriate input fields to avoid that the model becomes overtrained. For more information about overtrained models, see Overtrained models on page 50. For more information about removing input fields, see Removing fields from the input table on page 8. Setting the depth of a classification tree: IM Modeling uses the Tree Classification mining function for classification. Classification trees can get very complex. You can limit the tree depth of classification trees by using the DM_setTreeClasPar option. For example, to set the tree depth to 5, the DM_setTreeClasPar option looks like this:
DM_setTreeClasPar(MaxDth, 5)

There are more optional parameter strings for the Easy Mining procedures. For more information, see Optional parameter strings on page 99. Complete procedure call: The complete procedure call including the optional parameter string looks like this:
db2 "call IDMMX.BuildClasModel(BANK.BANKCARD_CLASVIEW, BANK.BANKCUSTOMERS, BANKCARD, DM_setTreeClasPar(MaxDth, 5))"

54

IBM InfoSphere Warehouse: Data Mining with Easy Mining procedures

Easy Mining procedures for regression mining steps


The Easy Mining procedures for regression mining steps support the following mining steps: v Building or training models v Testing models v Applying or scoring models v Building mining views v Exporting models or test results

When to use them


Regression and classification are prediction techniques. This means that the Regression mining function and the Classification mining function learn from the past to predict an outcome in the future. The outcome of regression is a numeric value, the outcome of classification is a categorical value. Apart from this difference, regression and classification are similar. The Regression mining function predicts continuous numeric values, for example, the estimated annual revenue for each customer. It can also predict discrete numeric values, for example, 1 and 0. v 1 might mean that a customer has responded to a mailing campaign or has left the bank. v 0 might mean that the customer did not respond or did not leave the bank. Therefore you can use the Regression mining function to solve classification problems provided that the models contain exactly two categories, for example, YES and NO. If YES is mapped to 1 and NO is mapped to 0, the resulting regression model predicts values close to 1 if YES is likely and values close to 0 if NO is likely.

How to use them


With the Easy Mining procedures for regression mining steps you can build, test, and apply regression models. You can also build regression views and export regression models or regression test results. Because the regression mining steps and the classification mining steps are identical, the following sections contain only the syntax of the Easy Mining procedure and a corresponding example. For more information, see the corresponding chapters in Easy Mining procedures for classification mining steps on page 48. Building regression models: You can build regression models by using the BuildRegModel procedure. Syntax

IDMMX.BuildRegModel(<modelName>, <inputTable>, <targetColumn>)

Input parameters You must specify the following parameters for the BuildRegModel procedure: <modelName> The name of the model that you want to build.
Chapter 2. Data mining with the Easy Mining procedures

55

The BuildRegModel procedure creates a model and saves it in the table IDMMX.RegressionModels. If a model with the same name already exists, the previous model is replaced with the new model. This parameter is of type VARCHAR. Its size is 240. <inputTable> The name of the input table or the input view. The BuildRegModel procedure starts a regression training run on this table. The columns of the input table that might not be useful to create a model are ignored by the Easy Mining procedure. These are, for example, key columns. This parameter is of type VARCHAR. Its size is 240. <targetColumn> The name of the column that includes the predicted values. You must specify numeric values. This parameter is of type VARCHAR. Its size is 128. For information about the valid SQL types of categorical and numerical fields, see Mining field types on page 154. Example You might want to predict the AVERAGE_BALANCE column of the BANK.BANKCUSTOMERS table by building the model BANK.AVERAGE_BALANCE_REGMODEL. Use the following command to run the Easy Mining procedure:
DB2 "call IDMMX.BuildRegModel(BANK.AVERAGE_BALANCE_REGMODEL, BANK.BANKCUSTOMERS, AVERAGE_BALANCE)"

Testing regression models: You can test regression models by using the TestRegModel procedure. Syntax

IDMMX.TestRegModel(<modelName>, <inputTable>, <testResultName>)

Input parameters You must specify the following parameters for the TestRegModel procedure: <modelName> The name of the model that you want to test. The model is stored in the IDMMX.RegressionModels table. This parameter is of type VARCHAR. Its size is 240. <inputTable> The name of the input table. This parameter is of type VARCHAR. Its size is 240.

56

IBM InfoSphere Warehouse: Data Mining with Easy Mining procedures

<testResultName> The name of the result of testing the model. The test result is stored in the IDMMX.RegTestResults table. This parameter is of type VARCHAR. Its size is 240. Example You might want to test the regression model BANK.AVERAGE_BALANCE_REGMODEL by applying it to the data in the BANK.BANKCUSTOMERS table and to save the test result in the BANK.AVERAGE_BALANCE_TESTRESULT table. Use the following command to run the Easy Mining procedure:
DB2 "call IDMMX.TestRegModel(BANK.AVERAGE_BALANCE_REGMODEL, BANK.BANKCUSTOMERS, BANK.AVERAGE_BALANCE_TESTRESULT)"

Applying regression models: You can apply regression models to a table or to a view by using the ApplyRegModel procedure. Syntax

IDMMX.ApplyRegModel(<modelName>, <inputTable>, <outputView>, <predValueColumn>, <predStdDevColumn>)

Input parameters With the ApplyRegModel procedure, you must specify the following parameters: <modelName> The name of the model that you want to apply to a table or a view. The model is stored in the IDMMX.RegressionModels table. This parameter is of type VARCHAR. Its size is 240. <inputTable> The name of the input table or the input view. The ApplyRegModel procedure applies the regression model to the data in this table. This parameter is of type VARCHAR. Its size is 240. <outputView> The name of the output view that you want to build. If a view with the same name already exists, the previous view is replaced with the new view. The output view contains the columns of the input table and additional columns that include values for the predicted value and the estimated standard deviation for the predicted value. This parameter is of type VARCHAR. Its size is 240. <predValueColumn> The name of the column in the output view that contains the predicted numeric values.
Chapter 2. Data mining with the Easy Mining procedures

57

This parameter is of type VARCHAR. Its size is 128. <predStdDevColumn> The name of the column in the output view that contains the value for the estimated standard deviation of the predicted value. The predicted standard deviation indicates the expected range of the actual value after you have applied the model. If you do not want to include the column for the estimated standard deviation in the output view, leave this string empty or specify null for this parameter. This parameter is of type VARCHAR. Its size is 128. Example You might want to complete the following steps: v Apply the regression model BANK.AVERAGE_BALANCE_REGMODEL to the input table BANK.BANKCUSTOMERS v Build the output view BANK.AVERAGE_BALANCE_APPLY v Additionally include the following columns in the output view BANK.AVERAGE_BALANCE_APPLY: PRED_AVG_BALANCE PRED_STD_DEV Use the following command to run the Easy Mining procedure:
DB2 "call IDMMX.ApplyRegModel(BANK.AVERAGE_BALANCE_REGMODEL, BANK.BANKCUSTOMERS, BANK.AVG_BALANCE_APPLY PRED_AVG_BALANCE, PRED_STD_DEV)"

Building regression views: You can build regression views by using the Easy Mining procedure BuildRegView. This procedure combines the training of regression models on the input table with the application of these models to the same table. Syntax

IDMMX.BuildRegView(<viewName>, <inputTable>, <targetColumn>)

Input parameters With the BuildRegView procedure, you must specify the following parameters: <viewName> The name of the view that you want to build. The BuildRegView procedure creates a view and a model. The model is stored in the table IDMMX.RegressionModels under the same name as the generated view. If a model with the same name already exists, the previous model is replaced with the new model. If a view with the same name already exists, the previous view is replaced with the new view. This parameter is of type VARCHAR. Its size is 240.

58

IBM InfoSphere Warehouse: Data Mining with Easy Mining procedures

<inputTable> The name of the input table or the input view. This parameter is of type VARCHAR. Its size is 240. <targetColumn> The name of the column that contains the predicted value. The target column must contain numeric values. For information about the valid SQL types of categorical and numerical fields, see Mining field types on page 154. This parameter is of type VARCHAR. Its size is 128. Output view When the regression model is applied to the data in the input table, the view is created. This view contains the columns of the input table and the following columns: PRED_VALUE This column contains the predicted values. PRED_STD_DEV This column contains the estimated standard deviation for the predicted values. Example You might want to build the regression view BANK.BANKCARD_REGVIEW: v The BANK.BANKCARD_REGVIEW additionally includes the AVERAGE_BALANCE column. v The AVERAGE_BALANCE column contains the values that are predicted from the columns of the input table BANK.BANKCUSTOMERS. Use the following command to run the Easy Mining procedure:
DB2 "call IDMMX.BuildRegView(BANK.BANKCARD_REGVIEW, BANK.BANKCUSTOMERS, AVERAGE_BALANCE)"

Exporting regression models or test results: See Exporting models and test results on page 82 for information about exporting models and test results.

How to go further
Sometimes you might additionally want to specify a specific parameter for an Easy Mining procedure. You can do this by including an optional parameter string in an Easy Mining procedure. For more information, see Optional parameter strings on page 99. Selecting regression algorithms: Intelligent Miner Modeling provides the following regression algorithms: v Transform Regression v Linear Regression v Polynomial Regression By default, the Easy Mining procedures use Transform Regression.

Chapter 2. Data mining with the Easy Mining procedures

59

You can change the default algorithm for the BuildRegModel procedure and the BuildRegView procedures by using the following option strings with the DM_setAlgorithm option: Linear Regression DM_setAlgorthm(linear) Polynomial Regression DM_setAlgorithm(Polynomial) For example, you might want to use the Linear Regression algorithm to build a Regression model. The procedure call looks like this:
db2 call IDMMX.BuildRegModel(BANK.AVERAGE_BALANCE_LINREGMODEL, BANK.BANKCUSTOMERS, AVERAGE_BALANCE, DM_setAlgoithm(linear))

Selecting input fields: It is important to select the appropriate input fields to avoid that the model becomes overtrained. There might be logical dependencies between the outcome and some of the input fields. This can occur, for example, if you want to predict customer churn. There might be an input field that indicates the development of the revenue with a customer. Typically, if a customer churns, the revenue declines. Therefore there is a close correlation between a negative revenue development and churn. However, if the decline occurs after the churn, this column is not an appropriate indicator for churn. Therefore this field must be removed from the set of input fields. For more information about removing one field from the input data, see Removing fields from the input table on page 8.

Easy Mining procedures for clustering mining steps


The Easy Mining procedures for clustering support the following mining steps: v Building or training clustering models v Applying or scoring clustering models v Building mining views v Exporting clustering models

When to use them


The Clustering mining function discovers homogenous groups of records in your data. These groups are called clusters or segments. These clusters can represent, for example, customers who share a common buying behavior or stores with common sales characteristics. Typically, you use the Clustering mining function to find groups whose members you want to treat similarly. For example, if you have a cluster of customers with a common buying behavior, a cluster might contain a specific target group for a marketing campaign. If you have a cluster of stores, you can optimize their offers depending on the cluster they belong to. You can use the Clustering mining function also to detect outliers. Because the members of a cluster share a degree of similarity, the members of the smallest clusters, the niches, have only little in common with the members of the other bigger clusters. The members of the niches are the outliers or the deviations in a data set. They can represent untypical behavior or inconsistencies in your data.

60

IBM InfoSphere Warehouse: Data Mining with Easy Mining procedures

How to use them


You can create, apply, and export clustering models. You can also create clustering views. Building clustering models: The input for the Clustering mining function must be a table or a view. In this table or view, each record represents the properties of an entity to be clustered. For customer segmentation, such a table might contain demographic information about each customer, for example, age, marital status, or profession. The table also contains figures about each customers buying behavior, for example, the total revenue per month per product category. If you run the Clustering mining function on such a table, you get a clustering model that assigns to each record of the input table the cluster it belongs to. This is called the training of a clustering model. Syntax

IDMMX.BuildClusModel(<modelName>, <inputTable>)

Input parameters With the BuildClusModel procedure, you must specify the following parameters: <modelName> The name of the model that you want to build. The model is stored in the IDMMX.ClusterModels table. If a model with the same name already exists in the IDMMX.ClusterModels table, the previous model is replaced with the new model. This parameter is of type VARCHAR. Its size is 240. <inputTable> The name of the input table. The BuildClusModel procedure starts a clustering run on the data of this table. The columns of the input table that might not be useful to create a model are ignored by the Easy Mining procedure. These are, for example, key columns. This parameter is of type VARCHAR. Its size is 240. Example You might want to create the clustering model BANK.CUSTOMERS_CLUSMODEL of the customers of a bank. The information about these customers is stored in the BANK.BANKCUSTOMERS table. Use the following command to run the Easy Mining procedure:
DB2 "call IDMMX.BuildClusModel(BANK.CUSTOMERS_CLUSMODEL, BANK.BANKCUSTOMERS)"

Applying clustering models: After you have built the model, you might want to determine the records that are assigned to the individual clusters. This mining step is called scoring of a clustering model.
Chapter 2. Data mining with the Easy Mining procedures

61

To apply a clustering model, you can use the ApplyClusModel procedure. Syntax

IDMMX.ApplyClusModel(<modelName>, <inputTable>, <outputView>, <clusterIDColumn>, <qualityColumn>, <confidenceColumn>)

Input parameters With the ApplyClusModel procedure, you must specify the following parameters: <modelName> The name of the model that you want to apply. The model is stored in IDMMX.ClusterModels table. This parameter is of type VARCHAR. Its size is 240. <inputTable> The name of the input table or the input view. The ApplyClusModel procedure applies the clustering model to the data in this table. This parameter is of type VARCHAR. Its size is 240. <outputView> The name of the output view that you want to create. If a view with the same name already exists, the previous view is replaced with the new view. The output view contains the columns of the input table and additional columns that contain the cluster identification, the clustering quality, and the clustering confidence. This parameter is of type VARCHAR. Its size is 240. <clusterIDColumn> The name of the column that contains the identification of the cluster the record is assigned to. This parameter is of type VARCHAR. Its size is 128. <qualityColumn> The name of the column that contains the value for the clustering quality. The value for clustering quality indicates how the records fit into the cluster. The value can range from 1 to 0. v A value close to 1 means a good fit. v A value close to 0 means a bad fit. If you do not want to include the Quality column in the output table, leave the string for these parameters empty, or specify NULL for these strings. This parameter is of type VARCHAR. Its size is 128.

62

IBM InfoSphere Warehouse: Data Mining with Easy Mining procedures

<confidenceColumn> The name of the column that contains the value for the computed cluster confidence. The clustering confidence indicates the confidence that the assigned cluster ID is the best cluster for this record. The value can range from 1 to 0. v A value close to 1 means that the assigned cluster is the only cluster where this record fits in well. v A value close to 0.5 means that there are other clusters this record might fit in well. If you do not want to include the Confidence column in the output table, leave the string for these parameters empty, or specify null for these strings. This parameter is of type VARCHAR. Its size is 128. Example You might want to create the output view BANK.CUSTOMERS_CLUS_APPLY by applying the model BANK.CUSTOMERS_CLUSMODEL to the input table BANK.BANKCUSTOMERS by creating. Use the following command to run the Easy Mining procedure:
DB2 "call IDMMX.ApplyClusModel(BANK.CUSTOMERS_CLUSMODEL, BANK.BANKCUSTOMERS, BANK.CUSTOMERS_CLUS_APPLY, CLUSTERID, CLUS_QUALITY, CLUS_CONFIDENCE)"

Building a clustering view: You can combine the following clustering mining steps by creating a clustering view: v Building a clustering model v Applying a clustering model If you build a clustering view, the clustering model is built and applied to the records of the input table or view. The clustering view indicates the clusters that each record belongs to. You can build a clustering view by using the BuildClusView procedure. Syntax
IDMMX.BuildClusView(<viewName>, <inputTable>)

Input parameters With the BuildClusView procedure, you must specify the following parameters: <viewName> The name of the view that you want to build. The BuildClusView procedure creates a view and a model. The model is stored in the table IDMMX.ClusterModels under the same name as the generated view. If a model with the same name already exists, the previous model is replaced with the new model. If a view with the same name already exists, the previous view is replaced with the new view.
Chapter 2. Data mining with the Easy Mining procedures

63

This parameter is of type VARCHAR. Its size is 240. <inputTable> The name of the input table or the input view. The BuildClusView procedure starts a clustering run on the data in this table. This parameter is of type VARCHAR. Its size is 240. Output view When the clustering model is applied to the data in the input table, the view is created. This view contains the columns of the input table and additionally the following columns: CLUSTER_ID The identification of the cluster the record is assigned to QUALITY The computed value for clustering score CONFIDENCE The computed value for the clustering confidence Example You might want to create the clustering view BANK.CUSTOMERS_CLUSVIEW from customers of a bank. The information about the customers is stored in the table BANK.BANKCUSTOMERS. Use the following command to run the Easy Mining procedure:
DB2 "call IDMMX.BuildClusView(BANK.CUSTOMERS_CLUSVIEW, BANK.BANKCUSTOMERS)"

Exporting clustering models: See Exporting models and test results on page 82 for information about exporting models and test results.

How to go further
Sometimes you might additionally want to specify a specific parameter for an Easy Mining procedure. You can do this by including an optional parameter string in an Easy Mining procedure. For more information, see Optional parameter strings on page 99. Selecting clustering algorithms: Intelligent Miner Modeling provides the following clustering algorithms: v Distribution Based Clustering v Center Based Clustering By default, the Easy Mining procedures use the Distribution Based Clustering algorithm. Center Based Clustering is based on Kohonen feature maps. If you want to use the Center Based Clustering algorithm to build a clustering model or a clustering view, you must use the option DM_setAlgorithm(Kohonen). For example, to build a clustering model, the procedure call looks like this:

64

IBM InfoSphere Warehouse: Data Mining with Easy Mining procedures

DB2 call IDMMX.BuildClusModel(BANK.CUSTOMERS_KOHONEN_CLUSVIEW, BANK.BANKCUSTOMERS, DM_setAlgorithm(Kohonen)).

Reducing the size of clusters: After you have used the Clustering mining function on a table, you might want to improve the model. For example, there might be a very big cluster and several smaller clusters. However, you want to have a more even partitioning of the data in your input table or view. You can reduce the size of the biggest cluster by increasing the similarity threshold. The similarity threshold represents the minimum degree of similarity for a record to be assigned to a particular cluster. If you increase the similarity threshold, fewer records are assigned to the same cluster. You can set the similarity threshold with the DM_setDClusPar option. The value for the similarity threshold ranges from 0 to 1. The default value for the similarity threshold is 0.5. This means that increasing the default value to a value of, for example, 0.7, reduces the size of the biggest cluster. The optional parameter string looks like this:
DM_setDClusPar(SimThr,0.7)

Setting the number of clusters: Increasing the similarity threshold has the side effect that the number of clusters increases. If you want to limit the number of clusters, you can use the DM_setMaxNumClus option. For example, you might want to set the maximum number of clusters to 10. The optional parameter string looks like this:
DM_setMaxNumClus(10)

Removing fields from the input table: You can also remove one or more fields from the input table. For more information, see Removing fields from the input table on page 8.

Complete procedure call


The complete procedure call including the optional parameter strings looks like this:
db2 "call IDMMX.BuildClusModel(BANK.CUSTOMERS_CLUSMODEL, BANK.BANKCUSTOMERS, DM_setMaxNumClus(10), DM_setDClusPar(SimThr,0.7))"

Easy Mining procedures for associations mining steps


The Easy Mining procedures for association rules support the following mining steps: v Building association rule models v Applying association rule models v Building association rule views v Exporting association rule models

When to use them


Association rules indicate the items that occur together in your data.

Chapter 2. Data mining with the Easy Mining procedures

65

Transaction tables: If you have a table that contains retail transaction data, association rules indicate the articles that are bought together. Retail transaction action data is usually stored in tables that contain the following columns: v A column that contains the transaction ID v A column that contains the item that is included in a transaction The items that have the same transaction ID belong to the same transaction. An associaton rule of retail transaction data might look like this:
If customers buy chocolate, they also buy candy.

With associations rules, you can determine the placement of the products in a store. Placing products that are linked by an association rule close to each other can increase the sales of these products. With association rules, you can also determine the products that you might put on sale for a marketing campaign. Placing the articles that are linked by an associaton rule together, however, putting not all of these articles on sale, might increase the sales of all products, those on sale and those not on sale. In an online store, you can use association rules to generate personalized product recommendations. Relational tables: Finding association rules is not limited to tables with retail transaction data. You can apply the Associations mining function to any relational tables. In relational tables, the association rules indicate the values that occur together in the records of the table. For example, an association rule derived from a table of bank customers might look like this:
If the value of the online-access column is YES, the value of the bankcard column is also YES.

Association rules like this might reveal important relationships that are hidden in your data. These relationships can explain why certain columns have specific values. For example, in the manufacturing industry, association rules like this can explain why quality problems occur or why certain goods are not produced on time.

How to use them


You can build, apply, and export association rule models. You can also combine building and applying association rule models by building association rule views. Building association rule models: The Associations mining function computes association rules. The set of the computed association rules is called a rule model. The first part of an association rule is called rule body. The second part of an association rule is called rule head. You can build an association rule model by using the BuildRuleModel procedure. Transaction tables: If you use the Associations mining function on transaction tables, for example, retail transaction data, the rule body and the rule head might represent articles that occur in retail transactions, for example, chocolate, or candy. The association rule might look like this:
If customers buy chocolate, they also buy candy.

66

IBM InfoSphere Warehouse: Data Mining with Easy Mining procedures

Relational tables: If you use the Associations mining function on relational tables, for example, customer data of a bank, the rule body and the rule head might represent column value pairs, for example, online-access=YES. The association rule might look like this:
If online-access=YES then bankcard=YES

Confidence: An association rule might not always be valid. For example, if the following association rule has a confidence value of 60%, this association rule indicates that if customers buy chocolate only in 60% of the cases they also buy candy:
If chocolate then candy

The degree of validity is indicated by the confidence value of an association rule. The value for confidence is shown in percent. It states how often the association rule head occurs in a transaction given that the rule body occurs in the transaction. Support: Another attribute of an association rule is the support value. The support value indicates the relative frequency that the conditions in the rule head and in the rule body hold. For example: v If the following association rule has a support value of 2%, it means that 2% of all sales transactions contain chocolate and candy:
If chocolate then candy

v If the following association rule has a support value of 4%, it means that if for 4% of all records the value of the ONLINE_ACCESS column is YES, the value of the BANKCARD column is also YES.
If online-access=YES then bankcard=YES

Syntax:
IDMMX.BuildRuleModel(<modelName>, <inputTable>, <groupColumn>, <minSupport>, <minConfidence>, <maxRuleLength>)

Input parameters: For the BuildRuleModel procedure, you must specify the following parameters: <modelName> The name of the model that you want to build. The model is stored in the IDMMX.RuleModels table. If a model with the same name already exists, the previous model is replaced with the new model. This parameter is of type VARCHAR. Its size is 240. <inputTable> The name of the input table or the input view. The BuildRuleModel procedure starts an associations run on this table. The columns of the input table that might not be useful to create a model are ignored by the Easy Mining procedure. These are, for example, key columns.
Chapter 2. Data mining with the Easy Mining procedures

67

This parameter is of type VARCHAR. Its size is 240. <groupColumn> The name of the column that contains the group ID or the transaction ID. If you specify a column of the input table as the GROUP column, the remaining columns of the input table are used as item columns. If you specify null or an empty string for the GROUP column, the BuildRuleModel procedure looks for association rules in the entire relational table. This parameter is of type VARCHAR. Its size is 128. For information about the valid SQL types of categorical and numerical fields, see Mining field types on page 154. <minSupport> The minimum support for all association rules expressed as a percentage. You can specify a value between 0 and 100. If you specify a value that is less or equal to 0, the value for minimum support is automatically determined to produce a result that contains at least some association rules. This parameter is of type REAL. <minConfidence> The minimum confidence for all association rules expressed as a percentage. You can specify a value between 0 and 100. If you specify a value that is less or equal to 0, the default value of 25% is automatically used as the lower limit for the confidence of an association rule. This parameter is of type REAL. <maxRuleLength> The value for the maximum rule length. The rule length determines the maximum number of items that occur in an association rule. You must specify a value greater or equal to 2. For example, if you specify 3 as the maximum rule length, the association rule contains two items in the rule body and one item in the rule head. If you specify a value less or equal to 0, the maximum rule length is not limited. This parameter is of type INTEGER. Example: You might want to build the model BANK.PRODUCT_RULES based on the BANK.CUSTOMER_PRODUCTS table. The BANK.CUSTOMER_PRODUCTS table includes the columns CLIENT_ID and PRODUCT. You might want to specify the CLIENT_ID column as GROUP column because a transaction includes all products that are owned by an individual customer. Use the following command to run the Easy Mining procedure:
DB2 "call IDMMX.BuildRuleModel(BANK.PRODUCT_RULES, BANK.CUSTOMER_PRODUCTS, CLIENT ID, 5, 30, 3)"

68

IBM InfoSphere Warehouse: Data Mining with Easy Mining procedures

If you want to run the BuildRuleModel procedure on the entire BANK.CUSTOMER_PRODUCTS table, you can use the following command:
DB2 "call IDMMX.BuildRuleModel(BANK.PRODUCT_RULES, BANK.CUSTOMER_PRODUCTS, NULL, 5, 30, 3)"

Applying association rule models: You can apply association rule models that are based on transaction tables. After you have built an association rule model, you might want to apply it to new data records to determine the items in the rule head that can be inferred. If the new data represents the current contents of market baskets, you can consider the inferred items as the items that your customers are interested in. This means you can recommend these items to your customers to increase your sales volume. You can also apply association rule models to the input tables that were used to build the models. By applying association rule models to input tables, you can determine the transactions that support an association rule. Transactions that support an association rule contain the complete set of items in an association rule. To apply an association rule model, you can use the ApplyRuleModel procedure. Syntax
IDMMX.ApplyRuleModel(<modelName>, <inputTable>, <outputView>, <ruleIdColumn>, <headColumn>, <headNameColumn>, <supportColumn>, <confidenceColumn>, <liftColumn>, <matchHeadColumn>, <matchBodyColumn>)

Input parameters With the ApplyRuleModel procedure, you must specify the following input parameters: <modelName> The name of the model that you want to apply to a table or a view. The model is stored in the IDMMX.RuleModels table. This parameter is of type VARCHAR. Its size is 240. <inputTable> The name of the input table or the input view. The ApplyRuleModel procedure applies the association rule model to the data in this table. The input table must contain the group column and the item columns that are required to build the model. This parameter is of type VARCHAR. Its size is 240. <outputView> The name of the output view.

Chapter 2. Data mining with the Easy Mining procedures

69

The output view contains the group column and the following additional columns. These columns are only included in the output view if their names are not NULL or an empty string. v Rule ID column v Head column v Head name column v Support column v Confidence column v Lift column v Match head column v Match body column This parameter is of type VARCHAR. Its size is 240. <ruleIdColumn> The name of the column that contains the ID of the matching association rule. This parameter is of type VARCHAR. Its size is 128. <headColumn> The name of the column that contains the rule head item of the matching association rule. For each rule head item of an association rule, one record is returned. This parameter is of type VARCHAR. Its size is 128. <headNameColumn> The name of the column that contains the name of the rule head item of the matching association rule. This parameter is of type VARCHAR. Its size is 128. <supportColumn> The name of the column that contains the support value of the matching association rule. This parameter is of type VARCHAR. Its size is 128. <confidenceColumn> The name of the column that contains the confidence value of the matching association rule. This parameter is of type VARCHAR. Its size is 128. <liftColumn> The name of the column that contains the lift value of the matching association rule. This parameter is of type VARCHAR. Its size is 128. <matchHeadColumn> The name of the column that contains the match head value. If the head item is included in the transaction, this column contains the value 1. If the head item is not included in the transaction, this column contains the value 0. This parameter is of type VARCHAR. Its size is 128. <matchBodyColumn> The name of the column that contains the match body value.

70

IBM InfoSphere Warehouse: Data Mining with Easy Mining procedures

This value is equal to the number of items that are included in the rule body because only the association rules are returned whose body items are included in the transaction. This parameter is of type VARCHAR. Its size is 128. Example To apply the association rule model BANK.PRODUCT_RULES to the table BANK.CUSTOMERS, use the following command:
db2 "call IDMMX.ApplyRuleModel(BANK.PRODUCT_RULES, BANK.CUSTOMER_PRODUCTS, BANK.APPLY_RULES_VIEW, RULEID, HEAD, HEADNAME, SUPPORT, CONFIDENCE, LIFT, MATCHHEAD, MATCHBODY,

You can determine the products that you can recommend to the customer with the customer ID 12634 by using the following command:
db2 "select HEAD, HEADNAME from BANK.APPLY_RULES_VIEW where CLIENT_ID = 12634 and MATCHHEAD = 0"

The condition MATCHHEAD=0 ensures that products that the customer has already bought are not recommended. You can retrieve transactions that support the association rule with the ID 25 by using the following command:
db2 "select CLIENT_ID from BANK.APPLY_RULES_VIEW where RULEID = 25 and MATCHHEAD = 1"

In the above command, the condition MATCHHEAD=1 is used because you want to retrieve the transactions that contain all items of the association rule with the ID 25. Building association rule views: You can combine to build an association rule model and to apply it to the input table by using the BuildRuleView procedure. Syntax
IDMMX.BuildRuleView( <viewName>, <inputTable>, <groupColumn>, <minSupport>, <minConfidence>, <maxRuleLength>, [,<optionsString>])

Input parameters <viewName> The name of the view that you want to build. The BuildRuleView procedure creates a view and a model. The model is stored in the table IDMMX.RuleModels under the same name as the

Chapter 2. Data mining with the Easy Mining procedures

71

generated view. If a model and a view with the same name already exist, the previous model and the previous view are replaced with the new model and the new view. This parameter is of type VARCHAR. Its size is 240. <inputTable> The name of the input table or the view. This parameter is of type VARCHAR. Its size is 240. <groupColumn> The name of the column that contains the group ID or the transaction ID. The value in the group column must be different from NULL. This parameter is of type VARCHAR. Its size is 128. <minSupport> The value for minimum support of the association rule. <minConfidence> The value for minimum confidence of the association rule. <maxRuleLength> The value for the maximal rule length. <optionsString> The string of the optional parameter. This parameter is of type VARCHAR. Its size is 32672. Output When the association rule model is applied to the data in the input table, the output view is created. The output view contains the group column and additionally the following columns: ID HEAD This column contains the ID of the matching sequence rule. This column contains the rule head of the matching sequence rule.

HEADNAME This column contains the head name of the matching sequence rule. SUPPORT This column contains the support value of the matching seqeunce rule. CONFIDENCE This column contains the confidence value of the matching sequence rule. LIFT This column contains the lift value of the matching sequence rule.

MATCHHEAD This column contains the match head value. MATCHBODY This column contains the match body value. Example You can combine building the association rule view BANK.PRODUCT_RULES_VIEW and applying it to the input table BANK.CUSTOMER_PRODUCTS by using the following procedure call:

72

IBM InfoSphere Warehouse: Data Mining with Easy Mining procedures

db2 "call IDMMX.BuildRuleView(BANK.PRODUCT_RULES_VIEW, BANK.CUSTOMER_PRODUCTS, CLIENT_ID, 5, 30, 3)"

Exporting rule models: See Exporting models and test results on page 82 for information about exporting models and test results.

How to go further
Sometimes you might additionally want to specify a specific parameter for an Easy Mining procedure. You can do this by including an optional parameter string in an Easy Mining procedure. For more information, see Optional parameter strings on page 99. You can do the following tasks by using the appropriate optional parameter string: v Reducing the number of association rules v Specifying the item ID field v Mapping numbers to meaningful names There are more optional parameter strings for the Easy Mining procedures. For more information, see Optional parameter strings on page 99. Reducing the number of rules: The Associations mining function discovers all association rules for a given minimum support and minimum confidence. This means that the number of discovered association rules can be huge. If you want to reduce the number of discovered association rules, you must increase the values for minimum support or minimum confidence. You can also increase the values for both parameters. You can also reduce the number of discovered association rules by reducing the maximum rule length. Specifying the item ID field: You might have a table with sales transaction data that includes additional columns besides the columns for the transaction ID and the item. Therefore you must specify the field that you want to use as item ID field. To specify a field as item ID field, you can use the DM_setItemFld option. For example, if you want to specify the PRODUCT column as item ID field, the optional parameter string looks like this:
DM_setItemFld(PRODUCT)

Mapping numbers to meaningful names: In the transaction tables of the retail industry, the column PRODUCT might contain numeric values to identify the articles. The description of the articles is stored in another table. If you want the article descriptions to appear in the association rules instead of the article IDs, you must specify a name mapping. For more information, see Name mappings on page 23. Complete procedure call: The complete procedure call including the optional parameter strings looks like this:

Chapter 2. Data mining with the Easy Mining procedures

73

db2 "call IDMMX.BuildRuleModel(BANK.PRODUCT_RULES, BANK.CUSTOMER_PRODUCTS, NULL, 5, 30, 3, DM_setItemFld("PRODUCT"), DM_addNmp("PRODUCT_NAMES", "BANK.PRODUCTS", "ID", "DESCRIPTION"), DM_setFldNmp("PRODUCT", "PRODUCT_NAMES"))"

Easy Mining procedures for sequences mining steps


There are Easy Mining procedures for the following sequences mining steps: v Building sequence rule models v Applying sequence rule models v Building sequence rule mining views v Exporting sequence rule models

When to use them


With sequence rules, you can predict what comes next. In the retail business area, you can make targeted product offerings based on the purchase histories of your customers. For example, if a customer has ordered a digital camera, you can send a leaflet about the latest memory card offerings together with the invoice to the customer. In the car manufacturing business area, you can take proactive actions to prevent defects that are likely to occur. For example, for the installation of an air conditioning system, you might want to use upgraded batteries that can deal with high consumption of electricity. Sequence mining steps are not limited to events in chronological order. You can also use any different order. For example, you might want to analyze genome sequences with this mining function.

How to use them


With the Easy Mining procedures for sequence mining steps, you can build, apply, and export sequence rule models. You can also build sequence rules views. Building sequence rule models: The Sequence Rule mining function computes sequence rules. A set of sequence rules is called a sequence rule model. In previous versions of the Intelligent Miner for Data, the Sequence Rule mining function was called Sequential Patterns mining function. You can build a sequence rule model by using the BuildSeqRuleModel procedure. Syntax
IDMMX.BuildSeqRuleModel(<modelName>, <inputTable>, <sequenceColumn>, <groupColumn>, <minSupport>, <minConfidence>, <maxRuleLength>, [,<optionsString>])

74

IBM InfoSphere Warehouse: Data Mining with Easy Mining procedures

Input parameters With the BuildSeqRuleModel procedure, you must specify the following parameters: <modelName> The name of the model that you want to build. The model is stored in the IDMMX.RuleModels table. If a model with the same name already exists, the previous model is replaced with the new model. This parameter is of type VARCHAR. Its size is 240. <inputTable> The name of the input table or the input view. The BuildSeqRuleModel procedure starts a mining run on this table. The columns of the input table that might not be useful to create a model are ignored by the Easy Mining procedure. These are, for example, key columns. This parameter is of type VARCHAR. Its size is 240. <sequenceColumn> The name of the sequence column. A sequence contains the item sets that have the same sequence ID. <groupColumn> The name of the column that contains the group ID or the transaction ID. The remaining columns of the input table with the exception of the sequences column are used as item columns. An item set contains items that have the same sequence ID and the same group ID. The item sets in a sequence are sorted according to the value in the group column. This parameter is of type VARCHAR. Its size is 128. <minSupport> The minimum support value for all sequence rules is expressed as a percentage. You can specify a value between 0 and 100. If you specify a value that is less or equal to 0, the value for minimum support is automatically determined to produce a result that contains at least a few sequence rules. This parameter is of type REAL. <minConfidence> The minimum confidence value for all sequence rules is expressed as a percentage. You can specify a value between 0 and 100. If you specify a value that is less or equal to 0, the default value of 25% is automatically used as the lower limit for the confidence of a sequence rule. This parameter is of type REAL. <maxRuleLength> The value for the maximum rule length.
Chapter 2. Data mining with the Easy Mining procedures

75

The rule length determines the maximum number of item sets that occur in a sequence rule. You must specify a value greater or equal to 2. For example, if you specify 3 as the maximum rule length, the sequence rule contains two item sets in the rule body and one item set in the rule head. If you specify a value less or equal to 0, the maximum rule length is not limited. This parameter is of type INTEGER. <optionsString> The optional parameter string that you want to use. This parameter is of type VARCHAR. Its size is 32672. Example You might want to build the model BANK.PRODUCT_SEQ_RULES to compute the sequence rules for the products that are bought by bank customers. Use the following command to run the Easy Mining procedure:
call IDMMX.BuildSeqRuleModel(BANK.PRODUCT_SEQ_RULES, BANK.CUSTOMER_PRODUCTS2, CLIENT_ID, DATE, 5, 30, 3);

Output Sequential relationships are represented as sequence rules. Sequence rules describe patterns in sequences. Depending on the business area, sequences might be, for example, purchases of customers or defects of cars over time. For example, customers might buy a digital camera and rechargeable batteries. A couple of weeks later, they buy a memory card and, again a couple of weeks later, they buy a photo printer. The sequence rule of this pattern looks like this:
<digital camera and rechargeable batteries> >>> <memory card> ==> <photo printer>

where: <digital camera and rechargeable batteries> represents an individual item set that is part of the rule body >>> represents a temporal ordering of item sets in ascending order

<memory card> represents an individual item set that is part of the rule body ==> splits the sequence rule into a sequence rule head and a sequence rule body

<photo printer> represents an item set that is included in the sequence rule head You can interpret the sequence rule above like this: If customers buy a digital camera with rechargeable batteries during their first purchase and a memory card during their second purchase, they will buy a photo printer during a subsequent purchase. Sequence rules include the following attributes:

76

IBM InfoSphere Warehouse: Data Mining with Easy Mining procedures

Confidence The confidence value represents the validity of the rule. A confidence value of 50% means that in 50% of the cases where a particular rule body is present in a sequence, a particular rule head is also present after the item sets of the rule body. For example, in the sequence rule above, a confidence value of 50% means that 50% of the customers who bought a digital camera with rechargeable batteries during their first visit and a memory card during any of their subsequent visits, bought a photo printer during another subsequent visit. Support The support value indicates how many sequences are covered by a sequence rule. The support value is expressed as the percentage of the total number of sequences. For example, a support value of 2% in the following sequence means that 2% of all sequences contain this particular sequence.
<digital camera and rechargeable batteries> => <memory card> = <photo printer>

Lift

The lift value indicates how much the confidence value is different from the expected confidence value. The lift value is computed by dividing the confidence value by the support value of the sequence rule head. If the support value of the above example is 10% and the confidence value of the sequence rule is 50%, the value for lift is 50% divided by 10% = 5. A lift value of 5 means, that customers who buy a digital camera and rechargeable batteries during their first visit and a memory card during their second visit, are 5 times more likely than average customers to buy a photo printer during a subsequent visit.

Mean time difference This value indicates the mean time difference between the time stamp of the first item set and the time stamp of the last item set in a sequence. If the type of the group column is numeric, this value is the mean value of the group values for the sequences. Standard Deviation of time difference This value indicates the standard deviation of the time difference between the time stamp of the first item set and the time stamp of the last item set in a sequence. If the type of the group column is numeric, this value is the standard deviation of the group values for the sequences. Applying sequence rule models: You can apply sequence rule models to models that are built based on transaction tables. After you have built a sequence rule model, you might want to apply it to new data records to determine the item sets in the rule head that can be inferred. If the new data represents the current contents of market baskets, you can consider the inferred items as the items that your customers might be interested in. This means that you can recommend these items to your customers. You can also apply sequence rule models to the input table that was used to build the model. By applying sequence rule models to input tables, you can determine
Chapter 2. Data mining with the Easy Mining procedures

77

the sequences that support a sequence rule. Sequences that support a sequence rule contain the item sets of a sequence rule in the right order. You can apply sequence rule models to input tables by using the ApplySeqRuleModel procedure. Syntax
IDMMX.ApplySeqRuleModel(<modelName>, <inputTable>, <outputView>, <ruleIdColumn>, <headColumn>, <headNameColumn>, <supportColumn>, <confidenceColumn>, <liftColumn>, <matchHeadColumn>, <matchBodyColumn>)

Input parameters With the ApplySeqRuleModel procedure, you must specify the following parameters: <modelName> The name of the model that you want to apply to a table or a view. The model is stored in the IDMMX.RuleModels table. This parameter is of type VARCHAR. Its size is 240. <input table> The name of the input table or the input view. The ApplyRuleModel procedure applies the sequence rule model to the data in this table. The input table must contain the following columns: v Sequence column v Group column v Item columns These columns are required to build the model. This parameter is of type VARCHAR. Its size is 240. <outputView> The name of the output view. The output view contains the sequence column and the following additional columns. These columns are only included in the output view if their names are not NULL or an empty string. v Rule ID column v Head column v Head name column v Support column v Confidence column v Lift column v Match head column v Match body column This parameter is of type VARCHAR. Its size is 240.

78

IBM InfoSphere Warehouse: Data Mining with Easy Mining procedures

<ruleIdColumn> The name of the column that contains the ID of the matching sequence rule. This parameter is of type VARCHAR. Its size is 128. <headColumn> The name of the column that contains the rule head item set of the matching sequence rule. For each rule head item set of a sequence rule, one record is returned. This parameter is of type VARCHAR. Its size is 128. <headNameColumn> The name of the column that contains the names of the items in the rule head item set of the matching sequence rule. This parameter is of type VARCHAR. Its size is 128. <supportColumn> The name of the column that contains the support value of the matching sequence rule. This parameter is of type VARCHAR. Its size is 128. <confidenceColumn> The name of the column that contains the confidence value of the matching sequence rule. This parameter is of type VARCHAR. Its size is 128. <liftColumn> The name of the column that contains the lift value of the matching sequence rule. This parameter is of type VARCHAR. Its size is 128. <matchHeadColumn> The name of the column that contains the match head value. If the head item set is included in the sequence, this column contains the value 1. If the head item set is not included in the sequence, this column contains the value 0. This parameter is of type VARCHAR. Its size is 128. <matchBodyColumn> The name of the column that contains the match body value. This value is equal to the number of item sets that are included in the rule body because only the sequence rules are returned whose body item sets are included in the sequence in the right order. This parameter is of type VARCHAR. Its size is 128. Example To apply the model BANK.PRODUCT_SEQ_RULES to the table BANK.CUSTOMERS2, use the following command:
db2 "call IDMMX.ApplySeqRuleModel(BANK.PRODUCT_SEQ_RULES, BANK.CUSTOMER_PRODUCTS2, BANK.APPLY_RULES_VIEW, RULEID, HEAD, HEADNAME, SUPPORT,
Chapter 2. Data mining with the Easy Mining procedures

79

CONFIDENCE, LIFT, MATCHHEAD, MATCHBODY)

Building sequence rule views: You can perform the following tasks in one step by using the BuildSeqRuleView procedure: v Building a sequence rule model v Applying this sequence rule model to the input table that is used to build the model Syntax
IDMMX.BuildSeqRuleView( <viewName>, <inputTable>, <sequenceColumn>, <groupColumn>, <minSupport>, <minConfidende>, <maxRuleLength>, [,<optionsString>])

Input parameters With the BuildSeqRuleView procedure, you must specify the following parameters: <viewName> The name of the view that you want to build. The BuildSeqRuleView procedure creates a view and a model. The model is stored in the table IDMMX.RuleModels under the same name as the generated view. If a model and a view with the same name already exist, the previous model and the previous view is replaced with the new model and the new view. This parameter is of type VARCHAR. Its size is 240. <inputTable> The name of the input table or view. This parameter is of type VARCHAR. Its size is 240. <sequenceColumn> The name of the sequence column. A sequence contains the item sets that have the same sequence ID. <groupColumn> The name of the column that contains the group ID or the transaction ID. In the input table, the columns other than the sequence column and the group column are the item columns. An item set contains the items that have the same sequence ID and the same group ID. The items sets that are included in a sequence are ordered according to the group ID. <minSupport> The value for the minimum support of the sequence. <minConfidence> The value for the minimum confidence of the sequence.

80

IBM InfoSphere Warehouse: Data Mining with Easy Mining procedures

<maxRuleLength> The value for the maximum length of the sequence rule. The maximum length of a sequence rule is determined by the maximum number of item sets in a sequence. <optionsString> The string for optional parameters. This parameter is of type VARCHAR. Its size is 32672. Output When the sequence rule model is applied to the data in the input table, the output view is created. The output view contains the sequence column and additionally the following columns: ID HEAD This column contains the ID of the matching sequence rule. This column contains the rule head of the matching sequence rule.

HEADNAME This column contains the head name of the matching sequence rule. SUPPORT This column contains the support value of the matching seqeunce rule. CONFIDENCE This column contains the confidence value of the matching sequence rule. LIFT This column contains the lift value of the matching sequence rule.

MATCHHEAD This column contains the match head value. MATCHBODY This column contains the match body value. Example To build the sequence rules view BANK.PRODUCT_SEQ_RULES_VIEW based on the table BANK.CUSTOMERS_PRODUCTS, use the following procedure call:
db2 "call IDMMX.BuildSeqRuleView(BANK.PRODUCT_SEQ_RULES_VIEW, BANK.CUSTOMER_PRODUCTS, CLIENT_ID, DATE, 5, 30, 3)"

Exporting sequence rule models: See Exporting models and test results on page 82 for information about exporting sequence rule models.

How to go further
Sometimes you might additionally want to specify a specific parameter for an Easy Mining procedure. You can do this by including an optional parameter string in an Easy Mining procedure. For more information, see Optional parameter strings on page 99. When you use the sequence mining steps, you can explicitly set the name for the item ID column. For more information, see Setting the name of the item ID column on page 22.

Chapter 2. Data mining with the Easy Mining procedures

81

Exporting models and test results


You might want to apply a model to data that is stored in a different database. For example, you might want to apply the model in an operational application that uses the model to select the customers for a specific marketing campaign. You can do this by following these steps: 1. Depending on the type of model that you want to export, use one of the following Easy Mining procedures: v v v v v v ExportClasModel ExportClasTestResult ExportClusModel ExportClusTestResult ExportRegModel ExportRegTestResult

v ExportRuleModel v ExportSeqRuleModel 2. Transfer the model to the server where the operational application is located. 3. Import the model into the database of the operational application by using a user-defined function (UDF) of IM Scoring, for example, DM_impClusFile. The file on the database server is of the format Predictive Model Markup Language (PMML). PMML is a standard developed by the Data Mining Group. For more information, visit the Data Mining Group.

Syntax
Exporting models
<EasyMiningProcedure>(<modelName>, <exportFileName>)

Exporting test results


<EasyMiningProcedure>(<testResultName>, <exportFileName>)

Input parameters
To export a model or a test result, you must specify the following parameters: <EasyMiningProcedure> Depending on the model or the test result that you want to export, you can use one of the available Easy Mining procedures: v ExportClasModel v ExportClasTestResult v ExportClusModel v ExportClusTestResult v v v v ExportRegModel ExportRegTestResult ExportRuleModel ExportSeqRuleModel

<modelName> or <testResultName> The name of the model or the test result that you want to export. Depending on the model or the test result that you want to export, it is stored in one of the following tables:

82

IBM InfoSphere Warehouse: Data Mining with Easy Mining procedures

v v v v v v v

IDMMX.ClassifModels IDMMX.ClasTestResults IDMMX.ClusterModels IDMMX.ClusTestResults IDMMX.RegressionModels IDMMX.RegTestResults IDMMX.RuleModels

<exportFileName> The name of the file on the server where the model is exported to. You must specify the complete path of the file.

Example
You might want to export the model BANK.BANKCARD_CLASMODEL to the file c:\temp\bankcard_clasmodel.xml on a Microsoft Windows database server. Use the following command to run the Easy Mining procedure:
DB2 "call IDMMX.ExportClasModel(BANK.BANKCARD_CLASMODEL, c:\temp\bankcard_clasmodel.xml)"

Using Easy Mining procedures for preprocessing and for utilities


With the Easy Mining procedure for preprocessing, you can prepare your data for an Easy Mining procedure for typical mining tasks or basic mining steps. With the Easy Mining procedures for utilities, you can perform administrative tasks such as retrieving error messages or cleaning up the database.

Preprocessing procedures
With the SplitData procedure, you can split tables into training data sets and test data sets.

Splitting tables into training data sets and test data sets
If you want to create a prediction model, you typically use a table or a view that contains historical data. If you decide to use the Easy Mining procedures for classification or regression, you might want to split this table into the following disjoint data sets: v One data set to train the prediction model v One data set to test the prediction model You can use the SplitData procedure to split a table in two disjoint data sets. Stratified samples: If you specify a value that is different from NULL or an empty string for the parameter <stratSampleColumn>, a stratified sample is created for the training data set. This means that, based on the complement of the test data set, a sample is created. In this sample, the values of the Stratified Sample column occur with approximately the same frequency. Creating a stratified sample can be useful if you use classification and the distribution of the values of the target column is unbalanced. For example, a specific value might occur with a frequency of 99%. A good classification model would be to predict always this value because the percentage of wrong predictions is 1% only. However, you do not always want this. Performing the training run on a stratified sample helps.
Chapter 2. Data mining with the Easy Mining procedures

83

Applying models that are computed on a stratified sample should be done with care because the confidence values that are computed by the classification model are not correct on a data set that does not have the same characteristics as the stratified sample. Correcting these confidence values requires expert knowledge in data mining. Therefore it is recommended to use the Easy Mining procedures on a non-stratified sample. They automatically take into account the differences of the frequencies of the target values. If you do not specify the Stratified Sample column, the training data set contains the records of the complement of the test data set. Syntax:
IDMMX.SplitData(<inputTable>, <trainViewName>, <testViewName>, <testSampleSize> [<stratSampleColumn>])

Input parameters: With the SplitData procedure, you must specify the following parameters: <inputTable> The name of the input table or the view This parameter is of type VARCHAR. Its size is 240. <trainViewName> The name of the view that contains the training data set. This view contains the same columns as the input table. This parameter is of type VARCHAR. Its size is 240. <testViewName> The name of the view that contains the test data set. This view contains the same columns as the input table. It contains a random sample of the records in the input table that is disjoint from the training view. This parameter is of type VARCHAR. Its size is 240. <testSampleSize> This column contains a percentage that indicates the size of the test data set. This parameter is of type REAL. <stratSampleColumn> This parameter is optional. If you specify a value that is different from NULL or an empty string for this parameter, a stratified sample is created for the training data set. This parameter is of type VARCHAR. Its size is 128. Example: You might want to split the table BANK.BANKCUSTOMERS into equal portions.

84

IBM InfoSphere Warehouse: Data Mining with Easy Mining procedures

v For the portion that includes the training data set, you want to specify the name BANK.BANKCUSTOMERS_TRAIN. v For the portion that includes the test data set, you want to specify the name BANK.BANKCUSTOMERS_TEST. v Because you want to split the table into equal portions, you specify 50.0 for the size of each portion. Use the following command to run the SplitData procedure:
DB2 "call IDMMX.SplitData(BANK.BANKCUSTOMERS, BANK.BANKCUSTOMERS_TRAIN, BANK.BANKCUSTOMERS_TEST, 50.0)"

Utility procedures
Utility procedures help you to accomplish administrative tasks, for example, handling error messages, canceling Easy Mining procedures, or working with the trace file.

Error messages
When the call of an Easy Mining procedure fails, DB2 returns the error with the following error message: SQL4302N (Java stored procedure or user-defined function <user-defined function name>, specific name <specific name> aborted with an exception <exception name>). In most of the cases, the actual message string is truncated. If you want to get the original error message from the Easy Mining procedures, you can call the GetLastError procedure. With the GetLastError procedure, you can specify several parameters. A question mark (?) in front of a parameter name indicates that the parameter after the question mark is an output parameter. The GetLastError procedure retrieves the last error message that was issued by an Easy Mining procedure that ran on the same database as the database from which the GetLastError procedure is called. Retrieving the last issued error message: To retrieve the last issued error message, you can specify the following parameters: v Error message v Optional: Error code v Optional: SQL state Syntax:
IDMMX.GetLastError(?<errorMessage>[,?<errorCode>,?<SQL state>])

Examples: If you want to retrieve the last error message that was issued, you can use the following command:
DB2 "call IDMMX.GetLastError(?)"

Chapter 2. Data mining with the Easy Mining procedures

85

If you want to retrieve the last error message that was issued including the error code and the SQL state, you can use the following command:
DB2 "call IDMMX.GetLastError(?, ?, ?)"

Retrieving the last issued error message of a particular model: You can also specify a model name or a view name. For example, you might want to retrieve the last error message that was issued when you used a particular model or a view. Syntax:
IDMMX.GetLastError(<modelViewName>,?<errorMessage>[,?<errorCode>,?<SQL state>

Examples: If you want to retrieve the last error message of the procedure call that used the model name BANK.CUSTOMERS_CLUSMODEL, use the following command:
DB2 "call IDMMX.GetLastError(BANK.CUSTOMERS_CLUSMODEL, ?)"

If you want to retrieve the last error message including the error code and the SQL state of the procedure call that used the view name BANK.CUSTOMERS_CLUSVIEW, use the following command:
DB2 "call IDMMX.GetLastError(BANK.CUSTOMERS_CLUSVIEW, ?, ?, ?)"

The procedure calls retrieve the last error message that was issued by an Easy Mining procedure that ran on the same database as the database from which the GetLastError procedure is called.

The trace file


If you have specified a name for the trace file, SQL statements and other trace information that are submitted during the run of an Easy Mining procedure are written to this file. This information helps the IBM Support team to locate an error. Setting the trace file: With the SetTraceFile procedure, you can set the name of the trace file for the current database connection. You must specify the full path of the file on the server where you want the trace information written to. If you specify an empty string or NULL as name for the trace file, the writing of trace file information is stopped. Syntax:

IDMMX.SetTraceFile(<traceFile>)

Example: To set the name of the trace file to c:\temp\easymining.trc, use the following command:
DB2 "call IDMMX.SetTraceFile(c:\temp\easymining.trc)"

Retrieving the name of the trace file: With the GetTraceFile procedure, you can retrieve the name of the trace file.

86

IBM InfoSphere Warehouse: Data Mining with Easy Mining procedures

Syntax:

IDMMX.GetTraceFile(?<traceFile>)

Example: To retrieve the name of the trace file, use the following command:
DB2 "call IDMMX.GetTraceFile (?)"

Canceling an Easy Mining procedure


Mining runs can take a very long time. You can cancel the Easy Mining procedures for building and testing models by using the CancelTask procedure. The CancelTask procedure cancels the Easy Mining procedure that contains the specified model name as first parameter. It might take a while until the Easy Mining procedure stops after it has received a cancel request because the point of time might not be appropriate for the Easy Mining procedure. Syntax: IDMMX.CancelTask(<modelViewName>) Example: You might want to run the PredictColumn procedure by using the following command:
call DB2"IDMMX.PredictColumn(BANK.BANKCARD_PRED, BANK.BANKCUSTOMERS, BANKCARD)"

With the CancelTask procedure, you can cancel the PredictColumn procedure by using the following command: DB2 "call IDMMX.CancelTask(BANK.BANKCARD_PRED)

Declaring a task as stopped


The first parameter of an Easy Mining procedure is typically the name of a model or a view. You cannot start an Easy Mining procedure if another Easy Mining procedure with the same first parameter is currently running. If you start an Easy Mining procedure that includes the same first parameter of an Easy Mining procedure that is currently running, an error message is displayed. It indicates that a mining run with same model name or the same view name is already running. This message might also be displayed, if an Easy Mining procedure ends abnormally and you try to start it again. Before you can start an Easy Mining procedure again that has abnormally ended, you must declare it as stopped by calling the SetTaskStopped procedure. Syntax:
IDMMX.SetTaskStopped(<modelViewName>)

Chapter 2. Data mining with the Easy Mining procedures

87

Where <modelViewName> is the first parameter of the call of the Easy Mining procedure that you want to stop. Example: You might want to declare the following call of the BuildClusModel procedure as stopped:
DB2 "call IDMMX.BuildClasModel(BANK.CUSTOMERS_CLUSMODEL, BANK.BANKCUSTOMERS)"

To declare the above call as stopped, you must use the following call of the SetTaskStopped procedure:
DB2 "call IDMMX.SetTaskStopped(BANK.CUSTOMERS_CLUSMODEL)"

Cleaning up
Some of the Easy Mining procedures create tables and views. Almost all of them create a model or a test result and store them in one of the standard tables for mining models or test results of IM Modeling. You can clean up these tables in your database by calling the IDMMX.CleanUpTask procedure. It removes all tables, views, models, or test results that are created by using a particular name as their first parameter. Syntax: IDMMX.CleanUpTask(<modelViewName>) Example: You might have built the model and the output view BANK.BANKCARD_PRED by using the following procedure call:
call IDMMX.PredictColumn(BANK.BANKCARD_PRED, BANK.BANKCUSTOMERS, BANKCARD)

To clean up the tasks that you have submitted to create the model and the output view BANK.BANKCARD_PRED, you can use the following procedure call:
call IDMMX.CleanUpTask(BANK.BANKCARD_PRED)

With this procedure, all models, views, and test results that are called BANK.BANKCARD_PRED are deleted.

Putting it all together


This section describes a complete scenario from the banking business. It introduces the business goal and describes the steps that lead to the successful completion of this goal. You can easily integrate this set of procedure calls and SQL queries into an application program that performs these steps automatically. This way you need not bother about the different operations that are performed. The only output that might be produced is a list of customers who are supposed to take part in the marketing campaign.

Scenario
It is assumed that you are working for a bank. You want to promote a new bank card offering. Therefore you are planning a direct marketing campaign. However,

88

IBM InfoSphere Warehouse: Data Mining with Easy Mining procedures

you have only a limited budget for this campaign. Therefore you decide to address only the customers who will most likely accept the new offering.

Identifying characteristics
To identify the characteristics of customers who are most likely interested in the new bank card, you can start a test campaign on a representative sample of your customers. You can save the results of the test campaign in a table, for example, the BANK.TEST_CAMPAIGN table. Among other columns, the BANK.TEST_CAMPAIGN table contains the following columns: CLIENT_ID This column identifies the customers who took part in the test campaign BANKCARD This column contains the value YES if the customer accepted the new offering With the information in the BANK.TEST_CAMPAIGN table and other information about the customers that might be available, you can build a prediction model. Based on the prediction model, you can select the customers who will most likely accept the new bank card offer.

Building a prediction model


In the BANK.TEST_CAMPAIGN table, customers who accepted the new offer are identified by the value YES in the BANKCARD column. Therefore you want to build a model that predicts that the BANKCARD column contains the value YES. To build the prediction model BANK.BANKCARD_PRED_YES based on the BANK.TEST_CAMPAIGN table, you can use the PredictColValue procedure by using the following command:
DB2 "call IDMMX.PredictColValue(BANK.BANKCARD_PRED_YES, BANK.TEST_CAMPAIGN BANKCARD, YES)"

The PredictColValue procedure uses the Classification mining function and the Regression mining function to build a prediction model for each mining function. It also creates the result view BANK.BANKCARD_PRED_YES. Besides the columns of the input table, the result view contains the following additional columns: TARGET_VALUE This column contains the predicted value YES. If a classification model is built, this column contains the target value. DATA_SET This column indicates whether the record belongs to the training data set or to the validation data set. CONFIDENCE This column contains the classification confidence for the target value. PREDICTION This column contains the predicted value of the regression mining function.

Chapter 2. Data mining with the Easy Mining procedures

89

Determining the best suited model


The PredictColValue procedure builds a classification model and a regression model. The model that makes the best selection of the customers for the marketing campaign is best suited to be used. Because you do not know the result of the campaign in advance, you must content yourself with the selection that is based on the validation data set of the BANK.TEST_CAMPAIGN table. Assuming that the budget of the campaign is adequate to address 30% of your customers, you need to determine which model performs best when you select 30% of the validation data set. Follow these steps: 1. To compute the size of the validation data set, you can use the following query:
select count(*) from BANK.BANKCARD_PRED_YES where DATA_SET = "VALIDATION"

2. To determine the number of customers that represents 30% of the validation data set, you must multiply the result of this query by the factor 0.3. Now you must determine the customers who are selected by the classification model. You also need to know how many of the customers are supposed to accept the offer. You can get this information by using the following query:
WITH CLASS_RANK(CLIENT_ID, CONFIDENCE, BANKCARD, RANK) as ( select CLIENT_ID, CONFIDENCE, BANKCARD, rank() over (order by CONFIDENCE desc, CLIENT_ID asc) as RANK from BANK.BANKCARD_PRED_YES where DATA_SET=VALIDATION ) select count(*), BANKCARD from CLASS_RANK where RANK <= 3000 group by BANKCARD

The common table expression defines the temporary table CLASS_RANK. In this temporary table, every client ID is associated with a rank number. The rank is determined by ordering the records of the BANK.BANKCARD_PRED_YES view according to the values in the CONFIDENCE column and the CLIENT_ID column. v The values in the CONFIDENCE column are sorted in descending order. v The values in the CLIENT_ID column are sorted in ascending order. This means that the records with a higher classification confidence value have a lower rank. If the classification confidence is equal for two or more client IDs, the rank is determined by the client ID. This ensures a continuous numbering of the rank numbers. The rank represents the order in which the customers are selected for the marketing campaign. It is assumed that the validation data set includes 10 000 records. 30% of 10 000 records corresponds to 3 000 customers. The selected customers are the customers that have a rank value that is less or equal to 3 000. With this information it is easy to determine how many of these customers have accepted the new bankcard. You can group the records of the CLASS_RANK column with a rank value less or equal to 3 000 according to the BANKCARD column and count the occurrences of each of its values. To determine the corresponding figure for the regression model, you must only replace the CONFIDENCE column with the PREDICTION column. The query looks like this:

90

IBM InfoSphere Warehouse: Data Mining with Easy Mining procedures

WITH CLASS_RANK(CLIENT_ID, PREDICTION, BANKCARD, RANK) as ( select CLIENT_ID, PREDICTION, BANKCARD, rank() over (order by PREDICTION desc, CLIENT_ID asc) as RANK from BANK.BANKCARD_PRED_YES where DATA_SET=VALIDATION ) select count(*), BANKCARD from CLASS_RANK where RANK <= 3000 group by BANKCARD

The model with the higher count for the YES value in the BANKCARD column is suited best for the marketing campaign.

Determining the customers


To select the customers for the marketing campaign, you might need to transfer the model to the server where the database of the campaign management application resides. You can transfer the model by using the ExportClasModel procedure or the ExportRegModel procedure. To import the model into the database of the campaign management application, you can use one of the following UDFs of IM Scoring: v DM_impClasFile v DM_impRegFile It is assumed that the BANK.BANKCARD_PRED_YES model is stored in the database in one of the following tables: v IDMMX.ClassifModels v IDMMX.RegressionModels The table that contains the customer data for the marketing campaign is called BANK.CAMPAIGN_CUSTOMERS. The BANK.CAMPAIGN_CUSTOMERS table has the same schema as the input table for the PredictColValue procedure. To select 30% of the customers for the marketing campaign, follow these steps: 1. Apply the model BANK.BANKCARD_PRED_YES to the BANK.CAMPAIGN_CUSTOMERS table by using the ApplyClasModel procedure or the ApplyRegModel procedure. If the selected model is a classification model, the command might look like this:
DB2 "call IDMMX.ApplyClasModel(BANK.BANKCARD_PRED_YES, BANK.CAMPAIGN_CUSTOMERS, BANK.CAMPAIGN_APPLY, CLASS_VAL, PREDICTION)"

If the selected model is a regression model, the command might look like this:
DB2 "call IDMMX.ApplyRegModel(BANK.BANKCARD_PRED_YES, BANK.CAMPAIGN_CUSTOMERS, BANK.CAMPAIGN_APPLY, PREDICTION, PRED_STD_DEV)"

With these procedure calls, the BANK.CAMPAIGN_APPLY view is created. This view contains the PREDICTION column. In the classification model, it represents the classification confidence. In the regression model, it represents the predicted value.

Chapter 2. Data mining with the Easy Mining procedures

91

2. You can use the PREDICTION column to select the customers for the marketing campaign. Assuming that 10 000 customers are selected for the campaign, the query to determine these customers looks like this:
WITH PRED_RANK(CLIENT_ID, PREDICTION, RANK) as ( select CLIENT_ID, PREDICTION, rank() over (order by PREDICTION desc, CLIENT_ID asc) as RANK from BANK.CAMPAIGN_APPLY ) select * from PRED_RANK where RANK <= 10000

This query looks similar to the above query that determines the selected customers of the validation data set. You also select the customers based on the computed classification confidence of the underlying classification model, or the predicted value of the underlying regression model.

Data mining at a glance


This section provides a quick overview of data mining. It outlines the data mining process and gives a general introduction to the mining functions that are supported by IM Scoring and IM Modeling. It also maps several business questions to the appropriate data mining solution in different business areas.

Data mining goals


Data mining is an innovative way of gaining new and valuable business insights by analyzing the information held in your company database. These insights can enable you to identify market niches, and they support and facilitate the making of well-informed business decisions. Essentially, data mining is a ground-breaking way to leverage the information that your company already has in order to plan a business strategy for the future. Data mining uncovers this in-depth business intelligence by using advanced analytical and modeling techniques. With data mining, you can ask far more sophisticated questions of your data than you can with conventional querying methods. The information that data mining provides can lead to an immense improvement in the quality and dependability of business decision making. Conventional methods can tell a bank, for example, which of the bank account types that it provides is the most profitable. However, only data mining enables the bank to create profiles of the customers who already have this type of account. The bank can then use data mining to find other customers who match that profile, so that it can accurately target a marketing campaign to them. Data mining can identify patterns in company data, for example, in records of supermarket purchases. If, for example, customers buy product A and product B, which product C are they most likely to buy as well? Accurate answers to questions like these are invaluable aids to marketing strategies. Data mining can identify the characteristics of a known group of customers, for example, those who have a proven record as poor credit risks. The company can then use these characteristics to screen new customers and to predict if they also will be poor credit risks. Data mining tools ease and automate the process of discovering this kind of information from large stores of data.

92

IBM InfoSphere Warehouse: Data Mining with Easy Mining procedures

The data mining process


Data mining is an iterative process that typically involves the following phases: Problem definition A data mining project starts with the understanding of the business problem. Data mining experts, business experts, and domain experts work closely together to define the project objectives and the requirements from a business perspective. The project objective is then translated into a data mining problem definition. In the problem definition phase, data mining tools are not yet required. Data exploration Domain experts understand the meaning of the metadata. They collect, describe, and explore the data. They also identify quality problems of the data. A frequent exchange with the data mining experts and the business experts from the problem definition phase is vital. In the data exploration phase, traditional data analysis tools, for example, statistics, are used to explore the data. Data preparation Domain experts build the data model for the modeling process. They collect, cleanse, and format the data because some of the mining functions accept data only in a certain format. They also create new derived attributes, for example, an average value. In the data preparation phase, data is tweaked multiple times in no prescribed order. Preparing the data for the modeling tool by selecting tables, records, and attributes, are typical tasks in this phase. The meaning of the data is not changed. Modeling Data mining experts select and apply various mining functions because you can use different mining functions for the same type of data mining problem. Some of the mining functions require specific data types. The data mining experts must assess each model. In the modeling phase, a frequent exchange with the domain experts from the data preparation phase is required. The modeling phase and the evaluation phase are coupled. They can be repeated several times to change parameters until optimal values are achieved. When the final modeling phase is completed, a model of high quality has been built. Evaluation Data mining experts evaluate the model. If the model does not satisfy their expectations, they go back to the modeling phase and rebuild the model by changing its parameters until optimal values are achieved. When they are finally satisfied with the model, they can extract business explanations and evaluate the following questions: v Does the model achieve the business objective? v Have all business issues been considered? At the end of the evaluation phase, the data mining experts decide how to use the data mining results. Deployment Data mining experts use the mining results by exporting the results into database tables or into other applications, for example, spreadsheets.
Chapter 2. Data mining with the Easy Mining procedures

93

The Intelligent Miner products assist you to follow this process. You can apply the functions of the Intelligent Miner products independently, iteratively, or in combination. The following figure shows the phases of the Cross Industry Standard Process for data mining (CRISP DM) process model.

Figure 26. The CRISP DM process model

IM Modeling helps you to select the input data, explore the data, transform the data, and mine the data. With IM Visualization you can display the data mining results to analyze and interpret them. With IM Scoring, you can apply the model that you have created with IM Modeling.

Data mining functions


To gain insight into specific business problems, Intelligent Miner provides various mining functions. v Associations mining function v Sequence Rule mining function v Classification mining function v Clustering mining function v Regression mining function

The Associations mining function


The goal of the Associations mining function is to find items that are associated with each other in a meaningful way. For example, you can analyze purchase transactions to discover combinations of goods that are often purchased together. The Associations mining function answers the following question: If certain items are present in a transaction, what other item or items are likely to be present in the same transaction? The relationships discovered by the Associations mining function are expressed as association rules. In a typical commercial application, the mining function finds

94

IBM InfoSphere Warehouse: Data Mining with Easy Mining procedures

associations and also assigns probabilities. For example, it can find that, if customers buy paint, there is a 20% chance that they will also buy a paintbrush. It also finds multiple associations, for example, if a customer buys paint and paintbrushes, there is a 40% chance they will also buy paint thinner.

The Sequence Rule mining function


In previous versions of Intelligent Miner for Data, the Sequence Rule mining function was called Sequential Patterns mining function. The Sequence Rule mining function finds typical sequences of events in your data. Sequence rule models contain various sequence rules. A sequence consists of a previous item or item set in the rule body that leads to a consecutive item set in the rule head. The consecutive item set occurs after a particular period of time. With the Sequence Rule mining function, you can, for example, identify the parts that break at a particular process step at a particular time. Or, if you have already broken parts, you can find out when the parts will break again.

The Classification mining function


Classification is the process of automatically creating a model of classes from a set of records that contain class labels. The Classification mining function analyzes records that are already known to belong to a certain class, and creates a profile for a member of that class from the common characteristics of the records. You can then use a data mining application tool to apply this model to new records, that is, records that have not yet been classified. This enables you to predict if the new records belong to that particular class. When a model is applied, IM Scoring assigns a class label and a confidence value to each individual record being scored.

The Clustering mining function


The Clustering mining function consists of a range of algorithms that group data records on the basis of how similar they are. For example, a data record might consist of a description of customers. In this case, clustering would group similar customers together, and at the same time it would maximize the differences between the different customer groups formed in this way. The groups that are found are known as clusters. Each cluster tells a specific story about customer identity or behavior, for example, about their demographic background, or about their preferred products or product combinations. In this way, customers can be grouped in homogeneous groups that are very similar to each other. When a model is applied, IM Scoring assigns a cluster ID, a quality value, and a confidence value to each individual record being scored. The quality value and the confidence value are different measures that indicate how well the record fits into the assigned cluster.

The Regression mining function


The Regression mining function predicts values to discover the dependency and the variation of one fields value on the values of the other fields within the same record. A model is generated that can predict a value for that particular field in a new record of the same form, based on other field values. For example, a retailer wants to use historical data to estimate the sales revenue for a new customer. A mining run on this historical data creates a model. This model can be used to predict the expected sales revenue for a new customer, based on the new customers data. The model might also show that, for some customers, incentive

Chapter 2. Data mining with the Easy Mining procedures

95

campaigns improve sales. In addition, it might reveal that frequent visits by sales representatives lead to a lower revenue if the customer is young. When a model is applied, IM Scoring assigns a predicted value.

96

IBM InfoSphere Warehouse: Data Mining with Easy Mining procedures

Chapter 3. Easy Mining reference


There are Easy Mining procedures for typical mining tasks, for basic mining steps, and for preprocessing and utilities. With these Easy Mining procedures, you can use optional parameter strings to modify the default parameters. The Easy Mining procedures follow particular conventions.

Syntax diagrams and parameters


This chapter contains the syntax diagrams and related information of the complete set of the Easy Mining procedures.

Using the Easy Mining procedures


If you are using IM Modeling and IM Scoring, you can use the complete set of Easy Mining procedures. If you are using IM Scoring, you can only use a subset of the Easy Mining procedures. If you are using IM Modeling, the Easy Mining procedures are not available. The following tables show the product that is required to run an Easy Mining procedure. The X denotes that the Easy Mining procedure is supported in the corresponding product.
Table 4. Availability of Easy Mining procedures for typical mining tasks in IM products Procedure name IDMMX.ClusterTable IDMMX.ExplainColValue IDMMX.FindDeviations IDMMX.FindMostImpFields IDMMX.FindRules IDMMX.FindSeqRules IDMMX.PredictColumn IDMMX.PredictColValue IM Scoring IM Modeling and IM Scoring X X X X X X X X

Table 5. Availability of Easy Mining procedures for basic mining steps in IM products Procedure name IDMMX.BuildClasModel IDMMX.TestClasModel IDMMX.ApplyClasModel IDMMX.ExportClasModel IDMMX.ExportClasTestResult IDMMX.BuildClasView IDMMX.BuildRegModel IDMMX.TestRegModel IM Scoring X X IM Modeling and IM Scoring X X X X X X X X

Copyright IBM Corp. 2001, 2008

97

Table 5. Availability of Easy Mining procedures for basic mining steps in IM products (continued) Procedure name IDMMX.ApplyRegModel IDMMX.ExportRegModel IDMMX.ExportRegTestResult IDMMX.BuildRegView IDMMX.BuildClusModel IDMMX.ApplyClusModel IDMMX.ExportClusModel IDMMX.BuildClusView IDMMX.BuildRuleModel IDMMX.ApplyRuleModel IDMMX.ExportRuleModel IDMMX.BuildRuleView IDMMX.BuildSeqRuleModel IDMMX.ApplySeqRuleModel IDMMX.ExportSeqRuleModel IDMMX.BuildSeqRuleView IM Scoring X X X X X X X X IM Modeling and IM Scoring X X X X X X X X X X X X X X X X

Table 6. Availability of Easy Mining procedures for preprocessing and utilities in IM products Procedure name IDMMX.SplitData IDMMX.GetLastError IDMMX.CancelTask IDMMX.SetTraceFile IDMMX.GetTraceFile IM Scoring X X X IM Modeling X X X X X

Typically, you must specify the following parameters for an Easy Mining procedure: Model name The name of the model that you want to build. Input table The name of the input table or the input view. Optional parameter strings The name of an optional parameter string. By specifying optional parameter strings, you can you can modify default parameters of the Easy Mining procedures. For example, when you are building a classification tree, you might want to set the depth of the classification tree to 5 by using the DM_setTreeClasPar option. An Easy Mining procedure that can include an optional parameter string is, for example, the BuildClasModel procedure. Settings parameters The name of a settings object.

98

IBM InfoSphere Warehouse: Data Mining with Easy Mining procedures

By specifying the name of a settings object, you can modify any parameter in this settings object. An Easy Mining procedure that includes the settings name parameter is labeled with the extension SN, for example, ClusterTableSN procedure. The extension SN denotes settings name. The syntax and both types of parameters of the Easy Mining procedures are described in the following sections.

Optional parameter strings


Syntax
With optional parameter strings, you can modify default parameters of the Easy Mining procedures. optionsString
, , optionName ( argument )

Input parameters
<optionName> For this parameter, you can use the following items: v The names of the IM Modeling methods that modify a mining data or a logical data specification value v Depending on the underlying mining function of the Easy Mining procedure, the methods of the corresponding settings types that modify a settings value v The DM_setProgTab mining task method The Easy Mining procedures automatically determine particular parameters. Therefore, an optional parameter string might be overwritten by a value that is determined by the Easy Mining procedure. <argument> For this parameter, you can use any value that is a reasonable parameter for the corresponding option. You must include string constants in two pairs of single quotes because the optional parameter string is a string constant that also must be included in a pair of single quotes. The following tables give you an overview of the available optional parameter strings.

Chapter 3. Easy Mining reference

99

Table 7. Overview of the available optional parameter strings Option type Mining data methods DM_defMiningData DM_setColumns DM_setFldAlias DM_impMiningData DM_setWhereClause The Easy Mining procedures for typical mining tasks. The following Easy Mining procedures for basic mining steps: BuildClasModel BuildClasView BuildClusModel BuildClusView BuildRegModel BuildRegView BuildRuleModel BuildRuleView BuildSeqRuleModel BuildSeqRuleView Logical data specs methods DM_addDataSpecFld DM_addNmp DM_addTax DM_addTaxMap DM_impDataSpec DM_remDataSpecFld DM_remNmp DM_remTax DM_remTaxMap DM_setFldNmp DM_setFldOutLim DM_setFldTax DM_setFldType Easy Mining procedures for typical mining tasks. The following Easy Mining procedures for basic mining steps: BuildClasModel BuildClasView BuildClusModel BuildClusView BuildRegModel BuildRegView BuildRuleModel BuildRuleView BuildSeqRuleModel BuildSeqRuleView PredictColumn PredictColValue ExplainColValue FindMostImpFields BuildClasModel BuildClasView Option name Supported Easy Mining procedure

Classification settings methods

DM_impClasSettings DM_setClasCost DM_setClasCostRate DM_setClasTarget DM_setCostMat DM_setFldUsageType DM_setPowerOptions DM_setTreeClasPar DM_useClasDataSpec DM_impRegSettings DM_setExecTime DM_setFldUsageType DM_setPowerOptions DM_setRegRSquared DM_setRegTarget DM_useRegDataSpec

Regression settings methods

PredictColumn PredictColValue FindMostImpFields BuildRegModel BuildRegView

100

IBM InfoSphere Warehouse: Data Mining with Easy Mining procedures

Table 7. Overview of the available optional parameter strings (continued) Option type Clustering settings methods Option name DM_addSimMat DM_impClusSettings DM_remSimMat DM_setDClusFldpar DM_setDClusPar DM_setExecTime DM_setFldOutlTreat DM_setFldSimMat DM_setFldSimScale DM_setFldUsageType DM_setFldWeight DM_setMaxNumClus DM_setMinData DM_setPowerOptions DM_useClusDataSpec DM_addConItem DM_impRuleSettings DM_remConItem DM_setConType DM_setFldUsageType DM_setGroup DM_setItemFld DM_setItemFormat DM_setMaxLen DM_setMinConf DM_setMinSupport DM_setPowerOptions DM_useRuleDataSpec Mining task methods DM_setProgTab Easy Mining procedures for typical mining tasks. The following Easy Mining procedures for basic mining steps: BuildClasModel BuildClasView BuildClusModel BuildClusView BuildRegModel BuildRegView BuildRuleModel BuildRuleView BuildSeqRuleModel BuildSeqRuleView Supported Easy Mining procedure FindDeviations ClusterTable BuildClusModel BuildClusView

Rule settings methods FindRules ExplainColValue BuildRuleModel BuildRuleView BuildSeqRuleModel BuildSeqRuleView

Easy Mining procedures for typical mining tasks


The following Easy Mining procedures for typical mining tasks are available. They are presented in alphabetical order: v IDMMX.ClusterTable and IDMMX.ClusterTableSN
Chapter 3. Easy Mining reference

101

v v v v v v

IDMMX.ExplainColValue and IDMMX.ExplainColValueSN IDMMX.FindDeviations and IDMMX.FindDeviationsSN IDMMX.FindMostImpFields and IDMMX.FindMostImpFieldsSN IDMMX.FindRules and IDMMX.FindRulesSN IDMMX.PredictColumn and IDMMX.PredictColumnSN IDMMX.PredictColValue and IDMMX.PredictColValueSN

IDMMX.ClusterTable
With the IDMMX.ClusterTable procedure or the IDMMX.ClusterTableSN procedure, you can partition the input table into homogenous segments or clusters. You can specify the minimum size and the maximum size of these clusters.

Syntax
IDMMX.ClusterTable
IDMMX.ClusterTable ( clusterView , input table , ) , optionsString minSize , maxSize

IDMMX.ClusterTableSN
IDMMX.ClusterTableSN ( clusterView , minSize , maxSize , settingsName ) input table ,

Input parameters
<clusterView> The name of the view that you want to build. The ClusterTable procedure creates a view and a model. The model represents the characteristics of the clusters. It is stored in the table IDMMX.ClusterModels under the same name as the generated view. If a model with the same name already exists, the previous model is replaced with the new model. If a view with the same name already exists, the previous view is replaced with the new view. This parameter is of type VARCHAR. Its size is 240. <inputTable> The name of the input table or the input view. The columns of the input table that might not be useful to create a model are ignored by the Easy Mining procedure. These are, for example, key columns. This parameter is of type VARCHAR. Its size is 257. <minSize> The value to define the minimum percentage of records in a cluster. This parameter is of type REAL. <maxSize> The value to define the maximum percentage of records in a cluster.

102

IBM InfoSphere Warehouse: Data Mining with Easy Mining procedures

This parameter is of type REAL. <optionsString> The optional parameter string that you want to use. This parameter is of type VARCHAR. Its size is 32672. <settingsName> The name of the cluster settings object. The cluster settings object is stored in the table IDMMX.ClusSettings. This parameter is of type VARCHAR. Its size is 240.

Return values
When the Easy Mining procedure finished successfully, the return code 0 is displayed. If the Easy Mining procedure does not finish successfully, the following exceptions might occur: v The value of the parameter <clusterView> or <inputTable> is NULL or an empty string. v A table or a view with the name <inputTable> does not exist. v The value of the parameter <minSize> or <maxSize> is not a positive percentage. v The value for <minSize> is greater than 50%. Only a trivial clustering model that consists of only one cluster meets this condition. v The value of <minSize> is not smaller than <maxSize>. v The value of <maxSize> is not greater than 1. This means that too many clusters (more than 100) would be generated. v The input table contains at least one of the following columns: CLUSTERID QUALITY CONFIDENCE v A model that satisfies the constraints for minimum size and maximum size cannot be found. v The value of the optional parameter string is syntactically not correct, or the optional parameter string contains invalid options. v The value of the parameter <clusterView> and the first argument of an Easy Mining procedure that is still running are the same. v An error occurred during the clustering run for computing the view <clusterView>. v A clustering settings with the name <settingsName > does not exist in the table IDMMX.ClusSettings.

IDMMX.FindDeviations
The IDMMX.FindDeviations procedure or the IDMMX.FindDeviationsSN procedure compute untypical records or deviations of the input table.

Syntax
IDMMX.FindDeviations

Chapter 3. Easy Mining reference

103

IDMMX.FindDeviations ( deviationsView , inputTable ) , optionsString

IDMMX.FindDeviationsSN
IDMMX.FindDeviationsSN ( deviationsView , inputTable , settingsName )

Input parameters
<deviationsView> The name of the view that you want to build. The FindDeviations procedure creates a view and a model. The model is stored in the table IDMMX.CLUSTERMODELS under the same name as the generated view. If a model with the same name already exists, the previous model is replaced with the new model. If a view with the same name already exists, the previous view is replaced with the new view. This parameter is of type VARCHAR. Its size is 240. <inputTable> The name of the input table or the input view. The columns of the input table that might not be useful to create a model are ignored by the Easy Mining procedure. These are, for example, key columns. This parameter is of type VARCHAR. Its size is 257. <optionsString> The optional parameter string that you want to use. This parameter is of type VARCHAR. Its size is 32672. <settingsName> The name of the cluster settings object. The cluster settings object is stored in the table IDMMX.ClusSettings. This parameter is of type VARCHAR. Its size is 240.

Return values
When the Easy Mining procedure finished successfully, the return code 0 is displayed. If the Easy Mining procedure does not finish successfully, the following exceptions might occur: v The value of the parameter <deviationsView> or <inputTable> is NULL or an empty string. v A table or a view with the name <inputTable> does not exist. v The following columns are already in the input table: DEV_DEGREE CLUSTERID

104

IBM InfoSphere Warehouse: Data Mining with Easy Mining procedures

v The value of the optional parameter string is syntactically not correct, or the optional parameter string contains invalid options. v The value of the parameter <deviationsView> and the first argument of an Easy Mining procedure that is still running are the same. v An error occurred during the clustering run for computing the view <deviationsView>. v A clustering settings with the name <settingsName> does not exist in the table IDMMX.ClusSettings.

IDMMX.ExplainColValue
With the ExplainColValue procedure or the ExplainColValueSN procedure, you can find explanations in your data that indicate under which condition the target column contains the target value.

Syntax
IDMMX.ExplainColValue
IDMMX.ExplainColValue ( viewName , inputTable , ) , optionsString

targetColumn , targetValue ,

maxExplLength

IDMMX.ExplainColValueSN
IDMMX.ExplainColValueSN targetValue , ( viewName , , inputTable , , targetColumn ruleSettingsName , )

maxExplLength

clasSettingsName

Input parameters
With the ExplainColValue procedure, you must specify the following parameters: <explanationView> The name of the view that you want to build. It contains the explanations in textual form. The ExplainColValue procedure creates a view, a classification model, and a rules model. The models are stored in the tables IDMMX.CLASSIFMODELS and IDMMX.RULEMODELS under the same name as the generated view. If a model with the same name already exists, the previous model is replaced with the new model. If a view with the same name already exists, the previous view is replaced with the new view. This parameter is of type VARCHAR. Its size is 240. <inputTable> The name of the input table or the input view. The columns of the input table that might not be useful to create a model are ignored by the Easy Mining procedure. These are, for example, key columns. This parameter is of type VARCHAR. Its size is 257. <targetColumn> The name of the column that you are particularly interested in.
Chapter 3. Easy Mining reference

105

This parameter is of type VARCHAR. Its size is 128. For information about the valid SQL types of categorical and numerical fields, see Mining field types on page 154. <targetValue> The value of the target column that you are particularly interested in. This parameter is of type VARCHAR. Its size is 1024. <maxExplLength> The value that determines the maximal length of the explanation. This parameter is of type SMALLINT. <optionsString> The optional parameter string that you want to use. This parameter is of type VARCHAR. Its size is 32672. <clasSettingsName> The name of the classification settings object. The classification settings object is stored in the table IDMMX.ClasSettings. This parameter is of type VARCHAR. Its size is 240. <ruleSettingsName> The name of the rule settings object. The rule settings object is stored in the table IDMMX.RuleSettings. This parameter is of type VARCHAR. Its size is 240.

Return values
When the Easy Mining procedure finished successfully, the return code 0 is displayed. If the Easy Mining procedure does not finish successfully, the following exceptions might occur: v The value of the following parameters is NULL or an empty string: <explanationView> <inputTable> <targetColumn> <targetValue> v A table or a view with the name <inputTable> does not exist. v The value of the parameter <targetColumn> is not the name of a column of the input table or view. v The target value is not a value of the target column. v The target column does not have a value, or it has only one value. v The value of the optional parameter string is syntactically not correct, or the optional parameter string contains invalid options. v The value of the parameter <explanationView> and the first argument of an Easy Mining procedure that is still running are the same. v An error occurred during the classification run or the association run for computing the view <explanationView>. v The classification settings object <settingsName> does not exist in the table IDMMX.ClasSettings, or the rule settings object <settingsName> does not exist in the table IDMMX.RuleSettings.

106

IBM InfoSphere Warehouse: Data Mining with Easy Mining procedures

IDMMX.FindMostImpFields
Depending on the type of the target column, the Tree Classification mining function or the Regression mining function is used by the IDMMX.FindMostImpFields procedure or the IDMMX.FindMostImpFieldsSN to compute the most important fields of the input table.

Syntax
IDMMX.FindMostImpFields
IDMMX.FindMostImpFields ( mostImpFieldView , inputTable , targetColumn , optionsString )

IDMMX.FindMostImpFieldsSN
IDMMX.FindMostImpFieldsSN ( mostImpFieldView , inputTable , targetColumn , settingsName )

Input parameters
With the IDMMX.FindMostImpFields procedure, you must specify the following parameters: <mostImpFieldView> The name of the view that you want to build. The FindMostImpFields procedure creates a view and a model. Depending on the type of the target column, a classification model or a regression model is generated. It is stored in one of the following tables under the same name as the generated view: v IDMMX.ClassifModels v IDMMX.RegressionModels If a model with the same name already exists, the previous model is replaced with the new model. If a view with the same name already exists, the previous view is replaced with the new view. This parameter is of type VARCHAR. Its size is 240. <inputTable> The name of the input table or the input view. Depending on the type of the target column, the Tree Classification mining function or the Regression mining function computes the most important fields of this table. The columns of the input table that might not be useful to create a model are ignored by the Easy Mining procedure. These are, for example, key columns. This parameter is of type VARCHAR. Its size is 257. <targetColumn> The name of the target column.
Chapter 3. Easy Mining reference

107

The values of the target column can be categorical or numeric. This parameter is of type VARCHAR. Its size is 128. For information about the valid SQL types of categorical and numerical fields, see Mining field types on page 154. <optionsString> The optional parameter string that you want to use. This parameter is of type VARCHAR. Its size is 32672. <settingsName> The name of the settings object. Depending on the mining function that is used by the IDMMX.FindMostImpFields procedure, the settings object is stored in the table IDMMX.ClasSettings or IDMMX.RegSettings. This parameter is of type VARCHAR. Its size is 240.

Return values
When the Easy Mining procedure finished successfully, the return code 0 is displayed. If the Easy Mining procedure does not finish successfully, the following exceptions might occur: v The value of the following parameters is NULL or an empty string: <mostImpFieldsView> <inputTable> <targetColumn> v The table or the view <inputTable> does not exist. v The value of the parameter <target column> is not the name of a column of the input table or view. v The value of the optional parameter string is syntactically not correct, or the optional parameter string contains invalid options. v The value of the parameter <mostImpFieldsView> and the first argument of an Easy Mining procedure that is still running are the same. v An error occurred during the classification run or the regression run for computing the view <mostImpFieldsView>. v The classification settings object <settingsName> does not exist in the IDMMX.ClasSettings table, or the regression settings object <settingsName> does not exist in the IDMMX.RegSettings table.

IDMMX.FindRules
The IDMMX.FindRules procedure or the IDMMX.FindRulesSN procedure computes a rule model based on the input table. It uses all columns of the input table as item fields except the column that is specified as the group column. If the value of the group column is not NULL, the standard transaction input format for associations is assumed.

Syntax
IDMMX.FindRules
IDMMX.FindRules ( rulesView , inputTable ,

108

IBM InfoSphere Warehouse: Data Mining with Easy Mining procedures

groupColumn , nbRules ,

minConfidence , optionsString

IDMMX.FindRulesSN
IDMMX.FindRulesSN ( rulesView , inputTable , minConfidence , settingsName )

groupColumn , nbRules ,

Input parameters
With the IDMMX.FindRules procedure, you must specify the following parameters: <rulesView> The name of the view that you want to build. The FindRules procedure creates a view and a model. The model is stored in the table IDMMX.RuleModels under the same name as the generated view. If a model with the same name already exists, the previous model is replaced with the new model. If a view with the same name already exists, the previous view is replaced with the new view. This parameter is of type VARCHAR. Its size is 240. <inputTable> The name of the input table or the input view. The columns of the input table that might not be useful to create a model are ignored by the Easy Mining procedure. These are, for example, key columns. This parameter is of type VARCHAR. Its size is 257. <groupColumn> The name of the column that contains the group or transaction ID. If you specify a column of the input table as the GROUP column, the remaining columns of the input table are used as item columns. If you specify NULL for the GROUP column, each record represents a transaction and all columns of the input table are used as item columns. This parameter is of type VARCHAR. Its size is 128. For information about the valid SQL types of categorical and numerical fields, see Mining field types on page 154. <nbRules> The minimum number of rules to be generated. The generated rules are stored in a view. This parameter is of type INTEGER. <minConfidence> A value to measure the validity of the rule. The confidence of each rule is greater or equal to the value that you specified for minimum confidence. This parameter is of type REAL.

Chapter 3. Easy Mining reference

109

<optionsString> The optional parameter string that you want to use. This parameter is of type VARCHAR. Its size is 32672. <settingsName> The name of the rule settings object. The rule settings object is stored in the table IDMMX.RuleSettings. This parameter is of type VARCHAR. Its size is 240.

Return values
When the Easy Mining procedure finished successfully, the return code 0 is displayed. If the Easy Mining procedure does not finish successfully, the following exceptions might occur: v The value of the following parameters is NULL or an empty string: <rulesView> <inputTable> v The table or the view <inputTable> does not exist. v The value of the <groupColumn> parameter, if it is different from NULL or the empty string, is not the name of a column of the input table or view. v The value of the <nbRules> parameter must be a positive number. v The value of the <minConfidence> parameter must be a positive percentage. v A rule model with the required minimum number of rules and a confidence value greater than or equal to the value of the minimum confidence parameter cannot be found. v The value of the optional parameter string is syntactically not correct, or the optional parameter string contains invalid options. v The value of the <rulesView> parameter and the first argument of an Easy Mining procedure that is still running must not be the same. v An error occurred during the associations runs for computing the view <rulesView>. v The rule settings object <settingsName> does not exist in the IDMMX.RuleSettings table.

IDMMX.FindSeqRules
With the IDMMX.FindSeqRules procedure, you can find sequential relationships in your data.

Syntax
IDMMX.FindSeqRules
IDMMX.FindSeqRules ( seqRulesView , inputTable , sequenceColumn , groupColumn , nbRules , minConfidence , optionsString )

110

IBM InfoSphere Warehouse: Data Mining with Easy Mining procedures

Input parameters
With the IDMMX.FindSeqRules procedure, you must specify the following parameters: <seqRulesView> The name of the view that you want to build. The FindSeqRules procedure creates a view and a model. The model is stored in the table IDMMX.RuleModels under the same name as the generated view. The view contains the following columns: v ID v HEADSETID v HEADSETTEXT v BODYSEQID v v v v BODYSEQTEXT SUPPORT CONFIDENCE LIFT

If a model with the same name already exists, the previous model is replaced with the new model. If a view with the same name already exists, the previous view is replaced with the new view. This parameter is of type VARCHAR. Its size is 240. <inputTable> The name of the input table or the input view. The columns of the input table that might not be useful to create a model are ignored by the Easy Mining procedure. These are, for example, key columns. This parameter is of type VARCHAR. Its size is 257. <sequenceColumn> The name of the column that contains the sequence ID. A sequence contains the item sets that have the same sequence ID. <groupColumn> The name of the column that contains the group or the transaction ID. If you specify a column of the input table as the GROUP column, the remaining columns of the input table with the exception of the sequence column are used as item columns. An item set contains items that have the same sequence ID and the same value in the group column. The item sets of a sequence are sorted according to the value in the group column. This parameter is of type VARCHAR. Its size is 128. For information about the valid SQL types of categorical and numerical fields, see Mining field types on page 154. <nbRules> The minimum number of sequence rules to be generated. The generated rules are stored in a view.
Chapter 3. Easy Mining reference

111

This parameter is of type INTEGER. <minConfidence> A value to measure the validity of the sequence rule. The confidence of each sequence rule is greater or equal to the value that you specified for minimum confidence. This parameter is of type REAL. <optionsString> The optional parameter string that you want to use. This parameter is of type VARCHAR. Its size is 32672.

Return values
When the Easy Mining procedure finished successfully, the return code 0 is displayed. If the Easy Mining procedure does not finish successfully, the following exceptions might occur: v The value of the following parameters is NULL or an empty string: <rulesView> <inputTable> v The table or the view <inputTable> does not exist. v The value of the <sequenceColumn> parameter or the <groupColumn> parameter is equal to NULL or the empty string. v The value of the <sequenceColumn> parameter or the <groupColumn> parameter is not the name of a column of the input table or view. v The value of the <nbRules> parameter must be a positive number. v The value of the <minConfidence> parameter must be a positive percentage. v A sequence rule model with the required minimum number of rules and a confidence value greater than or equal to the value of the minimum confidence parameter cannot be found. v The value of the optional parameter string is syntactically not correct, or the optional parameter string contains invalid options. v The value of the <rulesView> parameter and the first argument of an Easy Mining procedure that is still running must not be the same. v An error occurred during the mining run for computing the view <rulesView>. v The rule settings object <settingsName> does not exist in the IDMMX.RuleSettings table.

IDMMX.PredictColumn
The IDMMX.PredictColumn procedure or the IDMMX.PredictColumnSN procedure create a prediction model for the values in the target column and apply the model to the input table.

Syntax
IDMMX.PredictColumn
IDMMX.PredictColumn ( viewName , inputTable , targetColumn

112

IBM InfoSphere Warehouse: Data Mining with Easy Mining procedures

) , optionsString

IDMMX.PredictColumnSN
IDMMX.PredictColumnSN ( viewName , inputTable ,

targetColumn , settingsName )

Input parameters
<viewName> The name of the view that you want to build. The PredictColumn procedure creates a view and a model. Depending on the mining function that is used to build the model, the model is stored in one of the following tables under the same name as the generated view: v IDMMX.ClassifModels if the target column is categorical v IDMMX.RegressionModels if the target column is numeric If a model with the same name already exists, the previous model is replaced with the new model. If a view with the same name already exists, the previous view is replaced with the new view. This parameter is of type VARCHAR. Its size is 240. <inputTable> The name of the input table or the input view. The columns of the input table that might not be useful to create a model are ignored by the Easy Mining procedure. These are, for example, key columns. This parameter is of type VARCHAR. Its size is 257. <targetColumn> The name of the target column. The PredictColumn procedure derives the values in this column from the values of the other columns in the input table. If the values in the target column are categorical, the Classification mining function is used. If the values in the target column are numeric, the Regression mining function is used. This parameter is of type VARCHAR. Its size is 128. For information about the valid SQL types of categorical and numerical fields, see Mining field types on page 154. <optionsString> The optional parameter string that you want to use. This parameter is of type VARCHAR. Its size is 32672. <settingsName> The name of the settings object. Depending on the mining function that is used by the IDMMX.PredictColumnSN procedure, the settings object is stored in the table IDMMX.ClasSettings or IDMMX.RegSettings . This parameter is of type VARCHAR. Its size is 240.

Chapter 3. Easy Mining reference

113

Return values
When the Easy Mining procedure finished successfully, the return code 0 is displayed. If the Easy Mining procedure does not finish successfully, the following exceptions might occur: v The value of the following parameters is NULL or an empty string: <predView> <inputTable> <targetColumn> v The table or the view <inputTable> does not exist. v The value of the <targetColumn> parameter is not the name of a column of the input table or the input view. v The target column does not have a value, or it has only one value. v The target column has more than 10 categorical values. v One or more of the following columns already exist in the input table: PREDICTION CONFIDENCE PRED_STD_DEV DATA_SET v The value of the optional parameter string is syntactically not correct, or the optional parameter string contains invalid options. v The value of the <predView> parameter and the first argument of an Easy Mining procedure that is still running must not be the same. v An error occurred during the classification run or the regression run for computing the view <predView>. v The classification settings object or the regression settings <settingsName> does not exist in the table IDMMX.ClasSettings table or the IDMMX.RegSettings table.

IDMMX.PredictColValue
The IDMMX.PredictColValue procedure and the IDMMX.PredictColValueSN procedure compute a prediction model for a specific target value of the target column. They use the Tree Classification mining function and the Regression mining function to compute the model.

Syntax
IDMMX.PredictColValue
IDMMX.PredictColValue ( viewName , inputTable , ) , optionsString

targetColumn , targetValue

IDMMX.PredictColValueSN
IDMMX.PredictColValueSN ( viewName , inputTable ,

114

IBM InfoSphere Warehouse: Data Mining with Easy Mining procedures

targetColumn , targetValue , clasSettingsName , regSettingsName )

Input parameters
With the PredictColValue procedure, you must specify the following parameters: <viewName> The name of the view that you want to build. The PredictColValue procedure creates a view and a model. The model is stored in the table IDMMX.PredictColValue under the same name as the generated view. If a model with the same name already exists, the previous model is replaced with the new model. If a view with the same name already exists, the previous view is replaced with the new view. This parameter is of type VARCHAR. Its size is 240. <inputTable> The name of the input table or the input view. The columns of the input table that might not be useful to create a model are ignored by the Easy Mining procedure. These are, for example, key columns. This parameter is of type VARCHAR. Its size is 257. <targetColumn> The name of the column whose values are to be predicted. This parameter is of type VARCHAR. Its size is 128. For information about the valid SQL types of categorical and numerical fields, see Mining field types on page 154. <targetValue> The name of a value in the target column, for example, YES or NO. This parameter is of type VARCHAR. Its size is 1024. <optionsString> The optional parameter string that you want to use. This parameter is of type VARCHAR. Its size is 32672. <clasSettingsName> The name of the classification settings object. The classification settings object is stored in the table IDMMX.ClasSettings. This parameter is of type VARCHAR. Its size is 240. <regSettingsName> The name of the regression settings object. The regression settings object is stored in the table IDMMX.RegSettings. This parameter is of type VARCHAR. Its size is 240.

Return values
When the Easy Mining procedure finished successfully, the return code 0 is displayed. If the Easy Mining procedure does not finish successfully, the following exceptions might occur:

Chapter 3. Easy Mining reference

115

v The value of the following parameters is NULL or an empty string: <predView> <inputTable> <targetColumn> <targetValue> v The table or the view <inputTable> does not exist. v The value of the <targetColumn> parameter is not the name of a column of the input table or view. v The target value is not a value of the target column. v A value is missing in the target column, or the target column has only one value. v One or more of the following columns already exist in the input table: TARGET_VALUE CONFIDENCE PRED_STD_DEV DATA_SET v The value of the optional parameter string is syntactically not correct, or the optional parameter string contains invalid options. v The value of the <predView> parameter and the first argument of an Easy Mining procedure that is still running must not be the same. v An error occurred during the classification or the regression run for computing the view <predView>. v The classification settings object <settingsName> does not exist in the IDMMX.ClasSettings table, or the regression settings object <settingsName> does not exist in the IDMMX.RegSettings table.

Easy Mining procedures for basic mining steps


The following Easy Mining procedures for basic mining steps are available: v IDMMX.ApplyClasModel v IDMMX.ApplyClusModel v IDMMX.ApplyRegModel v IDMMX.BuildClasModel and IDMMX.BuildClasModelSN v IDMMX.BuildClasView and IDMMX.BuildClasViewSN v IDMMX.BuildClusModel and IDMMX.BuildClusModelSN v IDMMX.BuildClusView and IDMMX.BuildClusViewSN v IDMMX.BuildRegModel and IDMMX.BuildRegModelSN v IDMMX.BuildRegView and IDMMX.BuildRegViewSN v IDMMX.BuildRuleModel and IDMMX.BuildRuleModelSN v IDMMX.ExportClasModel v IDMMX.ExportClasTestResult v IDMMX.ExportClusModel v IDMMX.ExportRegModel v IDMMX.ExportRegTestResult v IDMMX.ExportRuleModel v IDMMX.TestClasModel v IDMMX.TestRegModel

116

IBM InfoSphere Warehouse: Data Mining with Easy Mining procedures

IDMMX.ApplyClasModel
With the IDMMX.ApplyClasModel procedure, you can apply a model to data that does not yet include values for the outcome.

Syntax
IDMMX.ApplyClasModel
IDMMX.ApplyClasModel ( modelName , inputTable ,

outputView , predClasColumn , confidenceColumn )

Input parameters
<modelName> The name of the model that you want to apply to a table or a view. The model is stored in the IDMMX.ClassifModels table. This parameter is of type VARCHAR. Its size is 240. <inputTable> The name of the input table or the input view. This parameter is of type VARCHAR. Its size is 240. <outputView> The name of the generated output view. If a view with the same name already exists, the previous view is replaced with the new view. This parameter is of type VARCHAR. Its size is 240.

Return values
When the Easy Mining procedure finished successfully, the return code 0 is displayed. If the Easy Mining procedure does not finish successfully, the following exceptions might occur: v The value of the following parameters is NULL or an empty string: <modelName> <inputTable> <outputView> <predClasColumn> v The table or the view <inputTable> does not exist. v The classification model <modelName> does exist in the IDMMX.ClassifModels table. v The column <predClasColumn> or <confidenceColumn> exists already in the input table. v The value of the <modelName> parameter and the first argument of an Easy Mining procedure that is still running must not be the same. v An error occurred when the model was applied.

Chapter 3. Easy Mining reference

117

IDMMX.ApplyClusModel
With the IDMMX.ApplyClusModel procedure, you can determine the records that are assigned to the individual clusters by applying the model to the input table.

Syntax
IDMMX.ApplyClusModel
IDMMX.ApplyClusModel ( modelName , inputTable , clusterIDColumn , qualityColumn , outputView ,

confidenceColumn )

Input parameters
<modelName> The name of the model that you want to apply. The model is stored in IDMMX.ClusterModels table. This parameter is of type VARCHAR. Its size is 240. <inputTable> The name of the input table or the input view. The ApplyClusModel procedure applies the clustering model to the data in this table. This parameter is of type VARCHAR. Its size is 240. <outputView> The name of the output view that you want to create. If a view with the same name already exists, the previous view is replaced with the new view. The output view contains the columns of the input table and additional columns that contain the cluster identification, the clustering quality, and the clustering confidence. This parameter is of type VARCHAR. Its size is 240. <clusterIDColumn> The name of the column that contains the identification of the cluster the record is assigned to. This parameter is of type VARCHAR. Its size is 128. <qualityColumn> The name of the column that contains the value for the clustering quality. The value for clustering quality indicates how the records fit into the cluster. The value can range from 1 to 0. v A value close to 1 means a good fit. v A value close to 0 means a bad fit. If you do not want to include the Quality column in the output table, leave the string for these parameters empty, or specify NULL for these strings. This parameter is of type VARCHAR. Its size is 128.

118

IBM InfoSphere Warehouse: Data Mining with Easy Mining procedures

<confidenceColumn> The name of the column that contains the value for the computed cluster confidence. The clustering confidence indicates the confidence that the assigned cluster ID is the best cluster for this record. The value can range from 1 to 0. v A value close to 1 means that the assigned cluster is the only cluster where this record fits in well. v A value close to 0.5 means that there are other clusters this record might fit in well. If you do not want to include the Confidence column in the output table, leave the string for these parameters empty, or specify null for these strings. This parameter is of type VARCHAR. Its size is 128.

Return values
When the Easy Mining procedure finished successfully, the return code 0 is displayed. If the Easy Mining procedure does not finish successfully, the following exceptions might occur: v The value of the following parameters is NULL or an empty string: <modelName> <inputTable> <outputView> <clusterIDColumn> v The table or the view <inputTable> does not exist. v The clustering model <modelName> already exists in the table IDMMX.ClusterModels. v In the input table, the following columns already exist: <clusterIDColumn> <qualityColumn> <confidenceColumn> v The value of the <modelName> parameter and the first argument of an Easy Mining procedure that is still running must not be the same. v When the model was applied, an error occurred.

IDMMX.ApplyRegModel
With the ApplyRegModel procedure, you can apply regression models to data that does not yet include values for the outcome.

Syntax
IDMMX.ApplyRegModel
IDMMX.ApplyRegModel ( modelName , inputTable ,

outputView , predValueColumn , predStdDevColumn )

Chapter 3. Easy Mining reference

119

Input parameters
<modelName> The name of the model that you want to apply to a table or a view. The model is stored in the IDMMX.RegressionModels table. This parameter is of type VARCHAR. Its size is 240. <inputTable> The name of the input table or the input view. The ApplyRegModel procedure applies the regression model to the data in this table. This parameter is of type VARCHAR. Its size is 240. <outputView> The name of the output view that you want to build. If a view with the same name already exists, the previous view is replaced with the new view. The output view contains the columns of the input table and additional columns that include values for the predicted value and the estimated standard deviation for the predicted value. This parameter is of type VARCHAR. Its size is 240. <predValueColumn> The name of the column in the output view that contains the predicted numeric values. This parameter is of type VARCHAR. Its size is 128. <predStdDevColumn> The name of the column in the output view that contains the value for the estimated standard deviation of the predicted value. The predicted standard deviation indicates the expected range of the actual value after you have applied the model. If you do not want to include the column for the estimated standard deviation in the output view, leave this string empty or specify null for this parameter. This parameter is of type VARCHAR. Its size is 128.

Return value
When the Easy Mining procedure finished successfully, the return code 0 is displayed. If the Easy Mining procedure does not finish successfully, the following exceptions might occur: v The value of the following parameters is NULL or an empty string: <modelName> <inputTable> <outputView> <predValueColumn> v The table or the view <inputTable> does not exist. v The classification model <modelName> is already existing in the IDMMX.RegressionModels table.

120

IBM InfoSphere Warehouse: Data Mining with Easy Mining procedures

v The value of the <modelName> parameter and the first argument of an Easy Mining procedure that is still running must not be the same. v The column <predValueColumn> or <predStdDevColumn> is already existing in the input table. v When the model was applied, an error occurred.

IDMMX.ApplyRuleModel
With the IDMMX.ApplyRuleModel procedure, you can apply a model to data that does not yet include values for the outcome.

Syntax
IDMMX.ApplyRuleModel
IDMMX.ApplyRuleModel ruleIdColumn ( modelName , , inputTable , , rulesView , ) ,

, headColumn

headNameColumn

supportColumn ,

confidenceColumn

, liftColumn

, matchHeadColumn

matchBodyColumn

Input parameters
<modelName> The name of the model that you want to apply to a table or a view. The model is stored in the IDMMX.RuleModels table. This parameter is of type VARCHAR. Its size is 240. <inputTable> The name of the input table or the input view. The ApplyRuleModel procedure applies the association rule model to the data in this table. The input table must contain the group column and the item columns that are required to build the model. This parameter is of type VARCHAR. Its size is 240. <outputView> The name of the output view. The output view contains the group column and the following additional columns. These columns are only included in the output view if their names are not NULL or an empty string. v Rule ID column v Head column v Head name column v Support column v Confidence column v Lift column v Match head column v Match body column This parameter is of type VARCHAR. Its size is 240. <ruleIdColumn> The name of the column that contains the ID of the matching association rule.

Chapter 3. Easy Mining reference

121

This parameter is of type VARCHAR. Its size is 128. <headColumn> The name of the column that contains the rule head item of the matching association rule. For each rule head item of an association rule, one record is returned. This parameter is of type VARCHAR. Its size is 128. <headNameColumn> The name of the column that contains the name of the rule head item of the matching association rule. This parameter is of type VARCHAR. Its size is 128. <supportColumn> The name of the column that contains the support value of the matching association rule. This parameter is of type VARCHAR. Its size is 128. <confidenceColumn> The name of the column that contains the confidence value of the matching association rule. This parameter is of type VARCHAR. Its size is 128. <liftColumn> The name of the column that contains the lift value of the matching association rule. This parameter is of type VARCHAR. Its size is 128. <matchHeadColumn> The name of the column that contains the match head value. If the head item is included in the transaction, this column contains the value 1. If the head item is not included in the transaction, this column contains the value 0. This parameter is of type VARCHAR. Its size is 128. <matchBodyColumn> The name of the column that contains the match body value. This value is equal to the number of items that are included in the rule body because only the association rules are returned whose body items are included in the transaction. This parameter is of type VARCHAR. Its size is 128.

Return values
When the Easy Mining procedure finished successfully, the return code 0 is displayed. If the Easy Mining procedure does not finish successfully, the following exceptions might occur: v The value of the following parameters is NULL or an empty string: <modelName> <inputTable> <rulesView>

122

IBM InfoSphere Warehouse: Data Mining with Easy Mining procedures

v The table or the view <inputTable> does not exist. v The rule model <modelName> does not exist in the IDMMX.RuleModels table. v The value of the <modelName> parameter and the first argument of an Easy Mining procedure that is still running must not be the same. v An error occurred when the model was applied.

IDMMX.ApplySeqRuleModel
With the IDMMX.ApplySeqRuleModel procedure, you can apply a model to data that does not yet include values for the outcome.

Syntax
IDMMX.ApplySeqRuleModel
IDMMX.ApplySeqRuleModel ruleIdColumn , headColumn ( modelName , , inputTable , outputView , ) ,

headNameColumn matchHeadColumn

, supportColumn ,

confidenceColumn

liftColumn ,

matchBodyColumn

Input parameters
<modelName> The name of the model that you want to apply to a table or a view. The model is stored in the IDMMX.RuleModels table. This parameter is of type VARCHAR. Its size is 240. <input table> The name of the input table or the input view. The ApplyRuleModel procedure applies the sequence rule model to the data in this table. The input table must contain the following columns: v Sequence column v Group column v Item columns These columns are required to build the model. This parameter is of type VARCHAR. Its size is 240. <outputView> The name of the output view. The output view contains the sequence column and the following additional columns. These columns are only included in the output view if their names are not NULL or an empty string. v Rule ID column v Head column v Head name column v Support column v Confidence column v Lift column v Match head column v Match body column
Chapter 3. Easy Mining reference

123

This parameter is of type VARCHAR. Its size is 240. <ruleIdColumn> The name of the column that contains the ID of the matching sequence rule. This parameter is of type VARCHAR. Its size is 128. <headColumn> The name of the column that contains the rule head item set of the matching sequence rule. For each rule head item set of a sequence rule, one record is returned. This parameter is of type VARCHAR. Its size is 128. <headNameColumn> The name of the column that contains the names of the items in the rule head item set of the matching sequence rule. This parameter is of type VARCHAR. Its size is 128. <supportColumn> The name of the column that contains the support value of the matching sequence rule. This parameter is of type VARCHAR. Its size is 128. <confidenceColumn> The name of the column that contains the confidence value of the matching sequence rule. This parameter is of type VARCHAR. Its size is 128. <liftColumn> The name of the column that contains the lift value of the matching sequence rule. This parameter is of type VARCHAR. Its size is 128. <matchHeadColumn> The name of the column that contains the match head value. If the head item set is included in the sequence, this column contains the value 1. If the head item set is not included in the sequence, this column contains the value 0. This parameter is of type VARCHAR. Its size is 128. <matchBodyColumn> The name of the column that contains the match body value. This value is equal to the number of item sets that are included in the rule body because only the sequence rules are returned whose body item sets are included in the sequence in the right order. This parameter is of type VARCHAR. Its size is 128.

Return values
When the Easy Mining procedure finished successfully, the return code 0 is displayed. If the Easy Mining procedure does not finish successfully, the following exceptions might occur: v The value of the following parameters is NULL or an empty string:

124

IBM InfoSphere Warehouse: Data Mining with Easy Mining procedures

v v v v

<modelName> <inputTable> The table or the view <inputTable> does not exist. The rule model <modelName> does not exist in the IDMMX.RuleModels table. The value of the <modelName> parameter and the first argument of an Easy Mining procedure that is still running must not be the same. An error occurred when the model was applied.

IDMMX.BuildClasModel
With the IDMMX.BuildClasModel procedure or the IDMMX.BuildClasModelSN procedure, you can perform a classification training run on the specified input table.

Syntax
IDMMX.BuildClasModel
IDMMX.BuildClasModel ( modelName , ) , optionsString inputTable , targetColumn

IDMMX.BuildClasModelSN
IDMMX.BuildClasModelSN ( modelName , inputTable , targetColumn , settingsName )

Input parameters
With the BuildClasModel procedure, you must specify the following parameters: <modelName> The name of the model that you want to build. Depending on the Easy Mining procedure that you are using, the generated model is stored in one of the following tables: v IDMMX.ClassifModels If a model with the same name already exists, the previous model is replaced with the new model. This parameter is of type VARCHAR. Its size is 240. <inputTable> The name of the input table or the input view. The values in the columns of the input table are used to determine the distinguishing properties for each value of the target column. The columns of the input table that might not be useful to create a model are ignored by the Easy Mining procedure. These are, for example, key columns. This parameter is of type VARCHAR. Its size is 240.

Chapter 3. Easy Mining reference

125

<targetColumn> The name of the column whose values are to be predicted. The target column must contain only categorical values. For information about the valid SQL types of categorical and numerical fields, see Mining field types on page 154. This parameter is of type VARCHAR. Its size is 128. <optionsString> The optional parameter string that you want to use. This parameter is of type VARCHAR. Its size is 32672. <settingsName> The name of the classification settings object. The classification settings object is stored in the table IDMMX.ClasSettings. This parameter is of type VARCHAR. Its size is 240.

Return value
When the Easy Mining procedure finished successfully, the return code 0 is displayed. If the Easy Mining procedure does not finish successfully, the following exceptions might occur: v The table or the view <inputTable> does not exist. v The value of one or more of the following parameters is NULL or an empty string: <modelName> <inputTable> <targetColumn> v The table or the view <inputTable> does not exist. v The value of the <targetColumn> parameter is not the name of a column of the input table or view. v The value of the optional parameter string is syntactically not correct, or the optional parameter string contains invalid options. v An error occurred during the classification run for computing the model <modelName>. v The value of the <modelName> parameter and the first argument of an Easy Mining procedure that is still running must not be the same. v The classification settings object <settingsName> does not exist in the IDMMX.ClasSettings table.

IDMMX.BuildClasView
With the BuildClasView procedure or the BuildClasViewSN procedure, you can combine the training of a classification model on the input table with the application of the model to the same input table.

126

IBM InfoSphere Warehouse: Data Mining with Easy Mining procedures

Syntax
IDMMX.BuildClasView
IDMMX.BuildClasView ( ) , optionsString viewName , inputTable , targetColumn

IDMMX.BuildClasViewSN
IDMMX.BuildClasViewSN ( viewName , inputTable ,

targetColumn , settingsName )

Input parameters
With the BuildClasView procedure, you must specify the following parameters: <viewName> The name of the view that you want to build. The BuildClasView procedure creates a view and a model. The model is stored in the table IDMMX.ClassifModels under the same name as the generated view. If a model with the same name already exists, the previous model is replaced with the new model. If a view with the same name already exists, the previous view is replaced with the new view. This parameter is of type VARCHAR. Its size is 240. <inputTable> The name of the input table or the input view. This parameter is of type VARCHAR. Its size is 240. <targetColumn> The name of the column that contains the predicted values. The target column must contain only categorical values. For information about the valid SQL types of categorical and numerical fields, see Mining field types on page 154. This parameter is of type VARCHAR. Its size is 128. <optionsString> The optional parameter string that you want to use. This parameter is of type VARCHAR. Its size is 32672. <settingsName> The name of the classification settings object. The classification settings object is stored in the table IDMMX.ClasSettings. This parameter is of type VARCHAR. Its size is 240.

Return values
When the Easy Mining procedure finished successfully, the return code 0 is displayed.

Chapter 3. Easy Mining reference

127

If the Easy Mining procedure does not finish successfully, the following exceptions might occur: v The value of one or more of the following parameters is NULL or an empty string: <viewName> <inputTable> <targetColumn> v The table or the view <inputTable> does not exist. v The value of the <targetColumn> parameter is not the name of a column of the input table or view. v In the input table, the following columns are already existing: PRED_CLASS CONFIDENCE v An error occurred during the classification run for computing the view <viewName>. v The value of the optional parameter string is syntactically not correct, or the optional parameter string contains invalid options. v The value of the <viewName> parameter and the first argument of an Easy Mining procedure that is still running must not be the same. v The classification settings object <settingsName> does not exist in the IDMMX.ClasSettings table.

IDMMX.BuildClusModel
With the IDMMX.BuildClusModel procedure or the IDMMX.BuildClusModelSN procedure, you can create a clustering model that assigns to each record of the input table the cluster it belongs to.

Syntax
IDMMX.BuildClusModel
IDMMX.BuildClusModel ( modelName , inputTable , optionsString )

IDMMX.BuildClusModelSN
IDMMX.BuildClusModelSN ( modelName , inputTable , settingsName )

Input parameters
<modelName> The name of the model that you want to build. The model is stored in the IDMMX.ClusterModels table. If a model with the same name already exists in the IDMMX.ClusterModels table, the previous model is replaced with the new model. This parameter is of type VARCHAR. Its size is 240. <inputTable> The name of the input table.

128

IBM InfoSphere Warehouse: Data Mining with Easy Mining procedures

The BuildClusModel procedure starts a clustering run on the data of this table. The columns of the input table that might not be useful to create a model are ignored by the Easy Mining procedure. These are, for example, key columns. This parameter is of type VARCHAR. Its size is 240. <optionsString> The optional parameter string that you want to use. This parameter is of type VARCHAR. Its size is 32672. <settingsName> The name of the cluster settings object. The cluster settings object is stored in the table IDMMX.ClusSettings. This parameter is of type VARCHAR. Its size is 240.

Return values
When the Easy Mining procedure finished successfully, the return code 0 is displayed. If the Easy Mining procedure does not finish successfully, the following exceptions might occur: v The value of the following parameters is NULL or an empty string: <modelName> <inputTable> v The table or the view <inputTable> does not exist. v An error occurred during the clustering run for computing the model <modelName>. v The value of the optional parameter string is syntactically not correct, or the optional parameter string contains invalid options. v The value of the <modelName> parameter and the first argument of an Easy Mining procedure that is still running must not be the same. v The clustering settings object <settingsName> does not exist in the IDMMX.ClusSettings table.

IDMMX.BuildClusView
With the IDMMX.BuildClusView procedure or the IDMMX.BuildClusViewSN procedure, you can combine the following clustering mining steps by creating a clustering view: v Building a clustering model v Applying a clustering model If you build a clustering view, the clustering model is built and applied to the records of the input table or view. The clustering view indicates the clusters that each record belongs to.

Chapter 3. Easy Mining reference

129

Syntax
IDMMX.BuildClusView
IDMMX.BuildClusView ( viewName , inputTable , optionsString )

IDMMX.BuildClusViewSN
IDMMX.BuildClusViewSN ( viewName , inputTable , settingsName )

Input parameters
<viewName> The name of the view that you want to build. The BuildClusView procedure creates a view and a model. The model is stored in the table IDMMX.ClusterModels under the same name as the generated view. If a model with the same name already exists, the previous model is replaced with the new model. If a view with the same name already exists, the previous view is replaced with the new view. This parameter is of type VARCHAR. Its size is 240. <inputTable> The name of the input table or the input view. The BuildClusView procedure starts a clustering run on the data in this table. This parameter is of type VARCHAR. Its size is 240. <optionsString> The optional parameter string that you want to use. This parameter is of type VARCHAR. Its size is 32672. <settingsName> The name of the cluster settings object. The cluster settings object is stored in the table IDMMX.ClusSettings. This parameter is of type VARCHAR. Its size is 240.

Return values
When the Easy Mining procedure finished successfully, the return code 0 is displayed. If the Easy Mining procedure does not finish successfully, the following exceptions might occur: v The value of the following parameters is NULL or an empty string: <viewName> <inputTable> v The table or the view <inputTable> does not exist. v In the input table, the following columns are already existing: CLUSTER_ID CONFIDENCE

130

IBM InfoSphere Warehouse: Data Mining with Easy Mining procedures

v An error occurred during the clustering run for computing the view <viewName>. v The value of the optional parameter string is syntactically not correct, or the optional parameter string contains invalid options. v The value of the <viewName> parameter and the first argument of an Easy Mining procedure that is still running must not be the same. v The clustering settings object with the name <settingsName> does not exist in the IDMMX.ClusSettings table.

IDMMX.BuildRegModel
With the BuildRegModel procedure or the BuildRegModelSN procedure, you can perform a regression training run on the input table.

Syntax
IDMMX.BuildRegModel
IDMMX.BuildRegModel ( ) , optionsString modelName , inputTable , targetColumn

IDMMX.BuildRegModelSN
IDMMX.BuildRegModelSN ( modelName , inputTable ,

targetColumn , settingsName )

Input parameters
<modelName> The name of the model that you want to build. The BuildRegModel procedure creates a model and saves it in the table IDMMX.RegressionModels. If a model with the same name already exists, the previous model is replaced with the new model. This parameter is of type VARCHAR. Its size is 240. <inputTable> The name of the input table or the input view. The BuildRegModel procedure starts a regression training run on this table. The columns of the input table that might not be useful to create a model are ignored by the Easy Mining procedure. These are, for example, key columns. This parameter is of type VARCHAR. Its size is 240. <targetColumn> The name of the column that includes the predicted values. You must specify numeric values. This parameter is of type VARCHAR. Its size is 128.

Chapter 3. Easy Mining reference

131

For information about the valid SQL types of categorical and numerical fields, see Mining field types on page 154. <optionsString> The optional parameter string that you want to use. This parameter is of type VARCHAR. Its size is 32672. <settingsName> The name of the regression settings object. The regression settings object is stored in the table IDMMX.RegSettings. This parameter is of type VARCHAR. Its size is 240.

Return values
When the Easy Mining procedure finished successfully, the return code 0 is displayed. If the Easy Mining procedure does not finish successfully, the following exceptions might occur: v The value of the following parameters is NULL or an empty string: <modelName> <inputTable> <targetColumn> v The table or the view <inputTable> does not exist. v The value of the <targetColumn> parameter is not the name of a column of the input table or view. v An error occurred during the regression run for computing the model <modelName>. v The value of the optional parameter string is syntactically not correct, or the optional parameter string contains invalid options. v The value of the <modelName> parameter and the first argument of an Easy Mining procedure that is still running must not be the same. v A regression settings object with the name <settingsName> does not exist in the IDMMX.RegSettings table.

IDMMX.BuildRegView
With the IDMMX.BuildRegView procedure or the IDMMX.BuildRegViewSN procedure, you can combine the training of regression models on the input table with the application of these models to the same table.

Syntax
IDMMX.BuildRegView
IDMMX.BuildRegView ( viewName , ) , optionsString inputTable , targetColumn

132

IBM InfoSphere Warehouse: Data Mining with Easy Mining procedures

IDMMX.BuildRegViewSN
IDMMX.BuildRegViewSN ( viewName , inputTable , targetColumn , settingsName )

Input parameters
<viewName> The name of the view that you want to build. The BuildRegView procedure creates a view and a model. The model is stored in the table IDMMX.RegressionModels under the same name as the generated view. If a model with the same name already exists, the previous model is replaced with the new model. If a view with the same name already exists, the previous view is replaced with the new view. This parameter is of type VARCHAR. Its size is 240. <inputTable> The name of the input table or the input view. This parameter is of type VARCHAR. Its size is 240. <targetColumn> The name of the column that contains the predicted value. The target column must contain numeric values. For information about the valid SQL types of categorical and numerical fields, see Mining field types on page 154. This parameter is of type VARCHAR. Its size is 128. <optionsString> The optional parameter string that you want to use. This parameter is of type VARCHAR. Its size is 32672. <settingsName> The name of the regression settings object. The regression settings object is stored in the table IDMMX.RegSettings. This parameter is of type VARCHAR. Its size is 240.

Return values
When the Easy Mining procedure finished successfully, the return code 0 is displayed. If the Easy Mining procedure does not finish successfully, the following exceptions might occur: v The value of the following parameters is NULL or an empty string: <viewName> <inputTable> <targetColumn> v The table or the view <inputTable> does not exist. v The value of the <targetColumn> parameter is not the name of a column of the input table or view. v In the input data, the following columns already exist:
Chapter 3. Easy Mining reference

133

PREDICTION PRED_STD_DEV v An error occurred during the regression run for computing the view <viewName>. v The value of the optional parameter string is syntactically not correct, or the optional parameter string contains invalid options. v The value of the <viewName> parameter and the first argument of an Easy Mining procedure that is still running must not be the same. v The regression settings <settingsName> does not exist in the IDMMX.RegSettings table.

IDMMX.BuildRuleModel
With the IDMMX.BuildRuleModel procedure or the IDMMX.BuildRuleModelSN procedure, you can compute association rules based on the input table. The set of the computed associations rules is called a rule model.

Syntax
IDMMX.BuildRuleModel
IDMMX.BuildRuleModel ( modelName , inputTable , minSupport , minConfidence , maxRuleLength , optionsString groupColumn , )

IDMMX.BuildRuleModelSN
IDMMX.BuildRuleModelSN ( modelName , minSupport , minConfidence , inputTable , groupColumn ,

maxRuleLength , settingsName )

Input parameters
<modelName> The name of the model that you want to build. The model is stored in the IDMMX.RuleModels table. If a model with the same name already exists, the previous model is replaced with the new model. This parameter is of type VARCHAR. Its size is 240. <inputTable> The name of the input table or the input view. The BuildRuleModel procedure starts an associations run on this table. The columns of the input table that might not be useful to create a model are ignored by the Easy Mining procedure. These are, for example, key columns. This parameter is of type VARCHAR. Its size is 240. <groupColumn> The name of the column that contains the group ID or the transaction ID.

134

IBM InfoSphere Warehouse: Data Mining with Easy Mining procedures

If you specify a column of the input table as the GROUP column, the remaining columns of the input table are used as item columns. If you specify null or an empty string for the GROUP column, the BuildRuleModel procedure looks for association rules in the entire relational table. This parameter is of type VARCHAR. Its size is 128. For information about the valid SQL types of categorical and numerical fields, see Mining field types on page 154. <minSupport> The minimum support for all association rules expressed as a percentage. You can specify a value between 0 and 100. If you specify a value that is less or equal to 0, the value for minimum support is automatically determined to produce a result that contains at least some association rules. This parameter is of type REAL. <minConfidence> The minimum confidence for all association rules expressed as a percentage. You can specify a value between 0 and 100. If you specify a value that is less or equal to 0, the default value of 25% is automatically used as the lower limit for the confidence of an association rule. This parameter is of type REAL. <maxRuleLength> The value for the maximum rule length. The rule length determines the maximum number of items that occur in an association rule. You must specify a value greater or equal to 2. For example, if you specify 3 as the maximum rule length, the association rule contains two items in the rule body and one item in the rule head. If you specify a value less or equal to 0, the maximum rule length is not limited. This parameter is of type INTEGER. <optionsString> The optional parameter string that you want to use. This parameter is of type VARCHAR. Its size is 32672. <settingsName> The name of the rule settings object. The rule settings object is stored in the table IDMMX.RuleSettings. This parameter is of type VARCHAR. Its size is 240.

Return value
When the Easy Mining procedure finished successfully, the return code 0 is displayed. If the Easy Mining procedure does not finish successfully, the following exceptions might occur: v The value of the following parameters is NULL or the empty string: <modelName>
Chapter 3. Easy Mining reference

135

<inputTable> v The table or the view with the name <inputTable> does not exist. v The value of the <groupColumn> parameter is NULL or equal to the empty string. v The value of the <groupColumn> parameter is not the name of a column of the input table or view. v An error occurred during the associations mining run for computing the model <modelName>. v The value of the optional parameter string is syntactically not correct, or the optional parameter string contains invalid options. v The value of the <modelName> parameter and the first argument of an Easy Mining procedure that is still running must not be the same. v The rule settings object <settingsName> does not exist in the IDMMX.RuleSettings table.

IDMMX.BuildRuleView
With the IDMMX.BuildRuleView procedure, you can perform the following tasks in one step: v Building an association rule model v Applying this association rule model to the input table or the input view that was used to build the model

Syntax
IDMMX.BuildRuleView
IDMMX.BuildRuleView minSupport , ( modelViewName , inputTable , groupColumn ) , optionsString ,

minConfidence

, maxRuleLength

IDMMX.BuildRuleViewSN
IDMMX.BuildRuleViewSN ( viewName , inputTable , settingsName )

Input parameters
<viewName> The name of the view that you want to build. The BuildRuleView procedure creates a view and a model. The model is stored in the table IDMMX.RuleModels under the same name as the generated view. If a model and a view with the same name already exist, the previous model and the previous view are replaced with the new model and the new view. This parameter is of type VARCHAR. Its size is 240. <inputTable> The name of the input table or the view. This parameter is of type VARCHAR. Its size is 240. <groupColumn> The name of the column that contains the group ID or the transaction ID.

136

IBM InfoSphere Warehouse: Data Mining with Easy Mining procedures

The value in the group column must be different from NULL. This parameter is of type VARCHAR. Its size is 128. <minSupport> The value for minimum support of the association rule. <minConfidence> The value for minimum confidence of the association rule. <maxRuleLength> The value for the maximal rule length. <optionsString> The string of the optional parameter. This parameter is of type VARCHAR. Its size is 32672. <settingsName> The name of the association settings object. The association settings object is stored in the table IDMMX.RuleSettings. This parameter is of type VARCHAR. Its size is 240.

Return values
When the Easy Mining procedure finished successfully, the return code 0 is displayed. If the Easy Mining procedure does not finish successfully, the following exceptions might occur: v The value of the following parameters is NULL or the empty string: <modelName> <inputTable> v The table or the view with the name <inputTable> does not exist. v The value of the <groupColumn> parameter is NULL or equal to the empty string. v The value of the <groupColumn> parameter is not the name of a column of the input table or view. v An error occurred during the associations mining run for computing the model <modelName>. v The value of the optional parameter string is syntactically not correct, or the optional parameter string contains invalid options. v The value of the <modelName> parameter and the first argument of an Easy Mining procedure that is still running must not be the same. v The rule settings object <settingsName> does not exist in the IDMMX.RuleSettings table.

IDMMX.BuildSeqRuleModel
With the BuildSeqRuleModel procedure, you can compute sequence rules based on the input table. The set of the computed sequence rules is called a sequence rule model.

Chapter 3. Easy Mining reference

137

Syntax
IDMMX.BuildSeqRuleModel
IDMMX.BuildSeqRuleModel groupColumn , minSupport ) , optionsString ( , modelName , inputTable , , sequenceColumn ,

minConfidence

maxRuleLength

IDMMX.BuildSeqRuleModelSN
IDMMX.BuildSeqRuleModelSN ( modelName , inputTable , settingsName )

Input parameters
<modelName> The name of the model that you want to build. The model is stored in the IDMMX.RuleModels table. If a model with the same name already exists, the previous model is replaced with the new model. This parameter is of type VARCHAR. Its size is 240. <inputTable> The name of the input table or the input view. The BuildSeqRuleModel procedure starts a mining run on this table. The columns of the input table that might not be useful to create a model are ignored by the Easy Mining procedure. These are, for example, key columns. This parameter is of type VARCHAR. Its size is 240. <sequenceColumn> The name of the sequence column. A sequence contains the item sets that have the same sequence ID. <groupColumn> The name of the column that contains the group ID or the transaction ID. The remaining columns of the input table with the exception of the sequences column are used as item columns. An item set contains items that have the same sequence ID and the same group ID. The item sets in a sequence are sorted according to the value in the group column. This parameter is of type VARCHAR. Its size is 128. <minSupport> The minimum support value for all sequence rules is expressed as a percentage. You can specify a value between 0 and 100. If you specify a value that is less or equal to 0, the value for minimum support is automatically determined to produce a result that contains at least a few sequence rules.

138

IBM InfoSphere Warehouse: Data Mining with Easy Mining procedures

This parameter is of type REAL. <minConfidence> The minimum confidence value for all sequence rules is expressed as a percentage. You can specify a value between 0 and 100. If you specify a value that is less or equal to 0, the default value of 25% is automatically used as the lower limit for the confidence of a sequence rule. This parameter is of type REAL. <maxRuleLength> The value for the maximum rule length. The rule length determines the maximum number of item sets that occur in a sequence rule. You must specify a value greater or equal to 2. For example, if you specify 3 as the maximum rule length, the sequence rule contains two item sets in the rule body and one item set in the rule head. If you specify a value less or equal to 0, the maximum rule length is not limited. This parameter is of type INTEGER. <optionsString> The optional parameter string that you want to use. This parameter is of type VARCHAR. Its size is 32672. <settingsName> The name of the sequence rule settings object. The sequence rule settings object is stored in the table IDMMX.RuleSettings. This parameter is of type VARCHAR. Its size is 240.

Return values
When the Easy Mining procedure finished successfully, the return code 0 is displayed. If the Easy Mining procedure does not finish successfully, the following exceptions might occur: v The value of the following parameters is NULL or the empty string: <modelName> <inputTable> v The table or the view with the name <inputTable> does not exist. v The value of the <sequenceColumn> parameter or the <groupColumn> parameter is NULL or equal to the empty string. v The value of the <sequenceColumn> parameter or the <groupColumn> parameter is not the name of a column of the input table or view. v An error occurred during the mining run for computing the model <modelName>. v The value of the optional parameter string is syntactically not correct, or the optional parameter string contains invalid options. v The value of the <modelName> parameter and the first argument of an Easy Mining procedure that is still running must not be the same. v The rule settings object <settingsName> does not exist in the IDMMX.RuleSettings table.
Chapter 3. Easy Mining reference

139

IDMMX.BuildSeqRuleView
With the IDMMX.BuildSeqRuleView procedure, you can perform the following tasks in one step: v Building a sequence rule model v Applying this sequence rule model to the input table or the input view that is used to build the model

Syntax
IDMMX.BuildSeqRuleView
IDMMX.BuildSeqRuleView ( modelViewName , inputTable , sequenceColumn , groupColumn , minSupport , minConfidence , maxRuleLength ) , optionsString

IDMMX.BuildSeqRuleViewSN
IDMMX.BuildSeqRuleViewSN ( viewName , inputTable , settingsName )

Input parameters
<viewName> The name of the view that you want to build. The BuildSeqRuleView procedure creates a view and a model. The model is stored in the table IDMMX.RuleModels under the same name as the generated view. If a model and a view with the same name already exist, the previous model and the previous view is replaced with the new model and the new view. This parameter is of type VARCHAR. Its size is 240. <inputTable> The name of the input table or view. This parameter is of type VARCHAR. Its size is 240. <sequenceColumn> The name of the sequence column. A sequence contains the item sets that have the same sequence ID. <groupColumn> The name of the column that contains the group ID or the transaction ID. In the input table, the columns other than the sequence column and the group column are the item columns. An item set contains the items that have the same sequence ID and the same group ID. The items sets that are included in a sequence are ordered according to the group ID. <minSupport> The value for the minimum support of the sequence.

140

IBM InfoSphere Warehouse: Data Mining with Easy Mining procedures

<minConfidence> The value for the minimum confidence of the sequence. <maxRuleLength> The value for the maximum length of the sequence rule. The maximum length of a sequence rule is determined by the maximum number of item sets in a sequence. <optionsString> The string for optional parameters. This parameter is of type VARCHAR. Its size is 32672.

Return values
When the Easy Mining procedure finished successfully, the return code 0 is displayed. If the Easy Mining procedure does not finish successfully, the following exceptions might occur: v The value of the following parameters is NULL or the empty string: <modelName> <inputTable> v The table or the view with the name <inputTable> does not exist. v The value of the <sequenceColumn> parameter or the <groupColumn> parameter is NULL or equal to the empty string. v The value of the <sequenceColumn> parameter or the <groupColumn> parameter is not the name of a column of the input table or view. v An error occurred during the mining run for computing the model <modelName>. v The value of the optional parameter string is syntactically not correct, or the optional parameter string contains invalid options. v The value of the <modelName> parameter and the first argument of an Easy Mining procedure that is still running must not be the same. v The rule settings object <settingsName> does not exist in the IDMMX.RuleSettings table.

IDMMX.ExportClasModel
With the IDMMX.ExportClasModel procedure, you can export a classification model to a file on a different server.

Syntax
IDMMX.ExportClasModel
IDMMX.ExportClasModel ( modelName , exportFileName )

Input parameters
With the IDMMX.ExportClasModel procedure, you must specify the following parameters:

Chapter 3. Easy Mining reference

141

<modelName> The name of the model that you want to export to a different server. The model is stored in the table IDMMX.ClassifModel. This parameter is of type VARCHAR. Its size is 240. <exportFileName> The name of the file on the server where you want the test result exported to. You must specify the complete path of the file. This parameter is of type VARCHAR. Its size is 32672.

Return values
When the Easy Mining procedure finished successfully, the return code 0 is displayed. If the Easy Mining procedure does not finish successfully, the following exceptions might occur: v The value of the following parameters is NULL or an empty string: <modelName> <exportFileName> v The value of the <modelName> parameter and the first parameter of an Easy Mining procedure that is still running must not be the same. v When the model was exported, an error occurred.

IDMMX.ExportClasTestResult
With the IDMMX.ExportClasTestResult procedure, you can export a classification test result to a file on a different server.

Syntax
IDMMX.ExportClasTestResult
IDMMX.ExportClasTestResult ( testResultName , exportFileName )

Input parameters
With the IDMMX.ExportClasTestResult procedure, you must specify the following parameters: <testResultName> The name of the test result that you want to export. It is stored in the table IDMMX.ClasTestResults. This parameter is of type VARCHAR. Its size is 240. <exportFileName> The name of the file on the server where you want the test result exported to. You must specify the complete path of the file. This parameter is of type VARCHAR. Its size is 32672.

142

IBM InfoSphere Warehouse: Data Mining with Easy Mining procedures

Return values
When the Easy Mining procedure finished successfully, the return code 0 is displayed. If the Easy Mining procedure does not finish successfully, the following exceptions might occur: v The value of the following parameters is NULL or an empty string: <testResultName> <exportFileName> v The value of the <testResultName> parameter and the first argument of an Easy Mining procedure that is still running must not be the same. v An error occurs when the test result is exported.

IDMMX.ExportClusModel
With the IDMMX.ExportClusModel procedure, you can export a clustering model to a file on a different server.

Syntax
IDMMX.ExportClusModel
IDMMX.ExportClusModel ( modelName , exportFileName )

Input parameters
With the IDMMX.ExportClusModel procedure, you must specify the following parameters: <modelName> The name of the model that you want to export to a different server. It is stored in the table IDMMX.ClusterModels. This parameter is of type VARCHAR. Its size is 240. <exportFileName> The name of the file on the server where you want the test result exported to. You must specify the complete path of the file. This parameter is of type VARCHAR. Its size is 32672.

Return values
When the Easy Mining procedure finished successfully, the return code 0 is displayed. If the Easy Mining procedure does not finish successfully, the following exceptions might occur: v The value of the following parameters is NULL or an empty string: <modelName> <exportFileName> v The value of the <modelName> parameter and the first parameter of an Easy Mining procedure that is still running must not be the same.
Chapter 3. Easy Mining reference

143

v An error occurs when the model is exported.

IDMMX.ExportRegModel
With the IDMMX.ExportRegModel procedure, you can export a regression model to a file on a different server.

Syntax
IDMMX.ExportRegModel
IDMMX.ExportRegModel ( modelName , exportFileName )

Input parameters
With the IDMMX.ExportRegModel procedure, you must specify the following parameters: <modelName> The name of the model that you want to export to a different server. It is stored in the table IDMMX.RegressionModels. This parameter is of type VARCHAR. Its size is 240. <exportFileName> The name of the file on the server where you want the test result exported to. You must specify the complete path of the file. This parameter is of type VARCHAR. Its size is 32672.

Return values
When the Easy Mining procedure finished successfully, the return code 0 is displayed. If the Easy Mining procedure does not finish successfully, the following exceptions might occur: v The value of the following parameters is NULL or an empty string: <modelName> <exportFileName> v The value of the <modelName> parameter and the first argument of an Easy Mining procedure that is still running must not be the same. v An error occurs when the model is exported.

IDMMX.ExportRegTestResult
With the IDMMX.ExportRegTestResult procedure, you can export a regression test result to a file on a different server.

Syntax
IDMMX.ExportRegTestResult
IDMMX.ExportRegTestResult ( testResultName , exportFileName )

144

IBM InfoSphere Warehouse: Data Mining with Easy Mining procedures

Input parameters
With the IDMMX.ExportRegTestResult procedure, you must specify the following parameters: <testResultName> The name of the test result that you want to export. It is stored in the table IDMMX.RegressionModels. This parameter is of type VARCHAR. Its size is 240. <exportFileName> The name of the file on the server where you want the test result exported to. You must specify the complete path of the file. This parameter is of type VARCHAR. Its size is 32672.

Return values
When the Easy Mining procedure finished successfully, the return code 0 is displayed. If the Easy Mining procedure does not finish successfully, the following exceptions might occur: v The value of the following parameters is NULL or an empty string: <testResultName> <exportFileName> v The value of the <testResultName> parameter and the first argument of an Easy Mining procedure that is still running must not be the same. v An error occurs when the test result is exported.

IDMMX.ExportRuleModel
With the IDMMX.ExportRuleModel procedure, you can export a rule model to a file on a different server.

Syntax
IDMMX.ExportRuleModel
IDMMX.ExportRuleModel ( modelName , exportFileName )

Input parameters
With the IDMMX.ExportRuleModel procedure, you must specify the following parameters: <modelName> The name of the model that you want to export to a different server. It is stored in the table IDMMX.RuleModels. This parameter is of type VARCHAR. Its size is 240. <exportFileName> The name of the file on the server where you want the test result exported to.
Chapter 3. Easy Mining reference

145

You must specify the complete path of the file. This parameter is of type VARCHAR. Its size is 32672.

Return values
When the Easy Mining procedure finished successfully, the return code 0 is displayed. If the Easy Mining procedure does not finish successfully, the following exceptions might occur: v The value of the following parameters is NULL or an empty string: <modelName> <exportFileName> v The value of the <modelName> parameter and the first argument of an Easy Mining procedure that is still running must not be the same. v An error occurs when the model is exported.

IDMMX.ExportSeqRuleModel
With the IDMMX.ExportSeqRuleModel procedure, you can export a sequence rule model to a file on a different server.

Syntax
IDMMX.ExportSeqRuleModel
IDMMX.ExportSeqRuleModel ( modelName , exportFileName )

Input parameters
With the IDMMX.ExportSeqRuleModel procedure, you must specify the following parameters: <modelName> The name of the model that you want to export to a different server. It is stored in the table IDMMX.SeqRuleModels. This parameter is of type VARCHAR. Its size is 240. <exportFileName> The name of the file on the server where you want the test result exported to. You must specify the complete path of the file. This parameter is of type VARCHAR. Its size is 32672.

Return values
When the Easy Mining procedure finished successfully, the return code 0 is displayed. If the Easy Mining procedure does not finish successfully, the following exceptions might occur: v The value of the following parameters is NULL or an empty string: <modelName>

146

IBM InfoSphere Warehouse: Data Mining with Easy Mining procedures

<exportFileName> v The value of the <modelName> parameter and the first argument of an Easy Mining procedure that is still running must not be the same. v An error occurs when the model is exported.

IDMMX.TestClasModel
The IDMMX.TestClasModel procedure tests the classification model by applying it to the data in the input table. It compares the predicted class value in the classification model with the actual class value in the input table.

Syntax
IDMMX.TestClasModel
IDMMX.TestClasModel ( modelName , inputTable , testResultName )

Input parameters
With the TestClasModel procedure, you must specify the following parameters: <modelName> The name of the model that you want to test. The model is stored in the IDMMX.ClassifModels table. This parameter is of type VARCHAR. Its size is 240. <inputTable> The name of the input table or the input view. This parameter is of type VARCHAR. Its size is 240. <testResultName> The name of the result of testing the model. The classification test result is stored in the IDMMX.ClasTestResults table. This parameter is of type VARCHAR. Its size is 240.

Return values
When the Easy Mining procedure finished successfully, the return code 0 is displayed. If the Easy Mining procedure does not finish successfully, the following exceptions might occur: v The value of the parameter <modelName>, <inputTable>, or <testResultName> is NULL or the empty string. v The table or the view <inputTable> does not exist. v The classification model <modelName> does exist in the table IDMMX.ClassifModels. v The value of the parameter <modelName> and the first argument of an Easy Mining procedure that is still running are the same. v An error occurred during the classification test run for computing the test result <testResultName>.

Chapter 3. Easy Mining reference

147

IDMMX.TestRegModel
The IDMMX.TestRegModel procedure tests the regression model by applying it to the data in the input table. It compares the predicted class value in the classification model with the actual class value in the input table.

Syntax
IDMMX.TestRegModel
IDMMX.TestRegModel ( modelName , inputTable , testResultName )

Input parameters
<modelName> The name of the model that you want to test. The model is stored in the IDMMX.RegressionModels table. This parameter is of type VARCHAR. Its size is 240. <inputTable> The name of the input table. This parameter is of type VARCHAR. Its size is 240. <testResultName> The name of the result of testing the model. The test result is stored in the IDMMX.RegTestResults table. This parameter is of type VARCHAR. Its size is 240.

Return values
When the Easy Mining procedure finished successfully, the return code 0 is displayed. If the Easy Mining procedure does not finish successfully, the following exceptions might occur: v The value of the parameter <modelName>, <inputTable>, or <testResultName > is NULL or the empty string. v The table or the view <inputTable> does not exist. v The classification model <modelName> does exist in the table IDMMX.ClassifModels. v The value of the parameter <modelName> and the first parameter of an Easy Mining procedure that is still running are the same. v An error occurred during the classification test run for computing the test result <testResultName>.

Easy Mining procedures for preprocessing and utilities


The following procedures are available: v v v v IDMMX.CancelTask IDMMX.CleanUpTask IDMMX.GetLastError IDMMX.GetTraceFile

148

IBM InfoSphere Warehouse: Data Mining with Easy Mining procedures

v IDMMX.SetTaskStopped v IDMMX.SetTraceFile v IDMMX.SplitData

IDMMX.CancelTask
Mining runs can take a very long time. With the IDMMX.CancelTask procedure, you can cancel the Easy Mining procedures for building and testing models.

Syntax
IDMMX.CancelTask
IDMMX.CancelTask ( modelViewName )

Input parameters
<modelViewName> The name that you specified for the model that you want to build. This parameter is of type VARCHAR. Its size is 240.

Return values
When the Easy Mining procedure finished successfully, the return code 0 is displayed. If the Easy Mining procedure does not finish successfully, the following exceptions might occur: v The value of the <modelViewName> parameter is NULL or an empty string. v An Easy Mining procedure that contains <modelViewName> as the value of the first parameter was not started.

IDMMX.CleanUpTask
With the IDMMX.CleanUpTask procedure, you can remove all tables, views, models, or test results that are created by using a particular name as their first parameter.

Syntax
IDMMX.CleanUpTask
IDMMX.CleanUpTask ( modelViewName )

Input parameters
<modelViewName> The name that you specified for the model that you want to build. This parameter is of type VARCHAR. Its size is 240.

Return values
When the Easy Mining procedure finished successfully, the return code 0 is displayed.
Chapter 3. Easy Mining reference

149

If the Easy Mining procedure does not finish successfully, the following exceptions might occur: v The value of the <modelViewName> parameter is NULL or an empty string. v An Easy Mining procedure that contains <modelViewName> as the value of the first parameter was not started. v An error occurs when the procedure is running.

IDMMX.GetLastError
With the IDMMX.GetLastError procedure, you can retrieve the previous error message.

Syntax
IDMMX.GetLastError
IDMMX.GetLastError ( modelViewName ) , errorCode , SQL code , errorMessage

Input parameters
<modelViewName> The name of the model or the view for which you want to retrieve the last error message. This parameter is of type VARCHAR. Its size is 240.

Output parameters
For the following parameters, you must specify a question mark (? ) instead of the variable name. For an example of the command, see Examples on page 85. <errorMessage> The textual description of the error situation. <errorCode> The identification of the error message. <SQL state> The SQL state of the error.

Return values
When the Easy Mining procedure finished successfully, the return code 0 is displayed. If the Easy Mining procedure does not finish successfully, the following exceptions might occur: v An Easy Mining procedure that contains <modelViewName> as the value of the first parameter was not started.

150

IBM InfoSphere Warehouse: Data Mining with Easy Mining procedures

IDMMX.GetTraceFile
With the IDMMX.GetTraceFile procedure, you can retrieve the name of the file that contains SQL statements that are submitted during the run of an Easy Mining procedure.

Syntax
IDMMX.GetTraceFile
IDMMX.GetTraceFile ( fileName )

Output parameters
<fileName> The complete path of the file on the server that you want to retrieve. This parameter is of type VARCHAR. Its size is 32672.

Return values
When the Easy Mining procedure finished successfully, the return code 0 is displayed. There are no exceptions for this procedure.

IDMMX.SetTaskStopped
With the IDMMX.SetTaskStopped procedure, you can declare the call of an Easy Mining procedure as stopped. You can use this procedure if the call of an Easy Mining procedure has abnormally ended and you want to start it again.

Syntax
IDMMX.SetTaskStopped
IDMMX.SetTaskStopped ( modelViewName )

Input parameters
<modelViewName> The name of the model or the view that is also used in an Easy Mining procedure that is currently running. This parameter is of type VARCHAR. Its size is 240.

Return values
When the Easy Mining procedure finished successfully, the return code 0 is displayed. If the Easy Mining procedure does not finish successfully, the following exceptions might occur:
Chapter 3. Easy Mining reference

151

v The value of the <modelViewName> parameter is NULL or an empty string. v An Easy Mining procedure that contains <modelViewName> as the value of the first parameter was not started. v An error occurs when the procedure is running.

IDMMX.SetTraceFile
With the SetTraceFile procedure, you can set the name of the trace file for the current database connection.

Syntax
IDMMX.SetTraceFile
IDMMX.SetTraceFile ( fileName )

Input parameters
<fileName> The complete path of the file on the server where you want the trace information written to. This parameter is of type VARCHAR. Its size is 32672.

Return values
When the Easy Mining procedure finished successfully, the return code 0 is displayed. If the Easy Mining procedure does not finish successfully, the following exceptions might occur: v The trace file cannot be created.

IDMMX.SplitData
With the SplitData procedure, you can split tables into training data sets and test data sets.

Syntax
IDMMX.SplitData
IDMMX.SplitData ( inputTable , testViewName , testSampleSize , , stratSampleColumn trainViewName , )

Input parameters
<inputTable> The name of the input table or the view This parameter is of type VARCHAR. Its size is 240. <trainViewName> The name of the view that contains the training data set.

152

IBM InfoSphere Warehouse: Data Mining with Easy Mining procedures

This view contains the same columns as the input table. This parameter is of type VARCHAR. Its size is 240. <testViewName> The name of the view that contains the test data set. This view contains the same columns as the input table. It contains a random sample of the records in the input table that is disjoint from the training view. This parameter is of type VARCHAR. Its size is 240. <testSampleSize> This column contains a percentage that indicates the size of the test data set. This parameter is of type REAL. <stratSampleColumn> This parameter is optional. If you specify a value that is different from NULL or an empty string for this parameter, a stratified sample is created for the training data set. This parameter is of type VARCHAR. Its size is 128.

Return values
When the Easy Mining procedure finished successfully, the return code 0 is displayed. If the Easy Mining procedure does not finish successfully, the following exceptions might occur: v The value of the following parameters is NULL or an empty string: <inputTable> <trainViewName> <testViewName> v The table or the view <inputTable> does not exist. v The value of the <testSampleSizeParameter> must be a positive percentage. v If a value is specified for the <stratSampleColumn> parameter, it is not the name of a column of the input table or the input view. v An error occurs when the views <trainViewName> and <testViewName> are computed.

Easy Mining conventions and mining field types Conventions


The following conventions are used for the Easy Mining procedures: Qualified names You can use qualified names for tables, models, and test results. Qualified names consist of the schema name and the table name, for example, BANKING.CUSTOMER_DATA where: v BANKING is the schema name
Chapter 3. Easy Mining reference

153

v CUSTOMER_DATA is the table name If you do not specify a schema name, the default schema is used. The default schema is the user ID. For example, if user ID Miller creates the table CUSTOMER_DATA without specifying a schema name, the qualified name of this table is MILLER.CUSTOMER_DATA. Well-defined scripts In DB2, all identifiers, for example, table names or column names, that are specified in DB2 SQL statements in lowercase without quotes, are converted into uppercase. To avoid syntax errors and to easily differentiate between identifiers and SQL keywords, it is recommended to always specify identifiers in uppercase and include them in quotes. For example, this specification is well-defined:
CREATE TABLE "BANKING_MODELING" ( "TYPE" CHAR(7), "GENDER" CHAR(6), "AGE" DOUBLE, "PRODUCT" CHAR(1), "SIBLINGS" DOUBLE, "INCOME" DOUBLE)

SQL strings SQL strings look like this:


DM_setTreeClaPar(MaxPur, 90)

The quotes included in SQL strings are single quotes. Different parameters must be separated with a comma. Saving models, task objects, and test results The generated mining models and test results are saved in the tables of IM Modeling and IM Scoring. If a task object is created, it is saved under the model name or the test result name in the corresponding IDMMX.*Task table. Authorization for model schema The schema for the following database objects defines a name space for these database objects and the corresponding tasks: v Tables v Views v Models In contrast to the standard behavior of a schema in a database, you cannot define authorizations for a model schema. Deleting objects You can delete the objects that are generated by the Easy Mining procedures by using a Utility procedure. For more information, see Cleaning up on page 88.

Mining field types


The Intelligent Miner mining functions distinguish only between categorical fields and numerical fields. Columns of the following SQL types are treated as categorical fields by Intelligent Miner: v CHAR

154

IBM InfoSphere Warehouse: Data Mining with Easy Mining procedures

v VARCHAR v LOB VARCHAR Columns of the following SQL types are treated as numerical fields by Intelligent Miner: v SMALLINT v INTEGER v v v v v v BIGINT DECIMAL FLOAT DOUBLE REAL TIME

v DATE v TIMESTAMP You can change the mining field type of a column by using the DM_setFldType option. The group column of the following procedures can be categorical or numerical: v BuildRuleModel v FindRules The target column of the following procedures can be categorical: v BuildClasModel v BuildClasView The target column of the following procedures can be numerical: v BuildRegModel v BuildRegView The target column of the following procedures can be categorical or numerical: v ExplainColValue v FindMostImpFields v PredictColumn v PredictColValue For more information about the column types and the sizes of the input and output parameters of the methods and functions, see IBM InfoSphere Warehouse: Creating mining models with Intelligent Miner Modeling.

Chapter 3. Easy Mining reference

155

156

IBM InfoSphere Warehouse: Data Mining with Easy Mining procedures

Notices
IBM may not offer the products, services, or features discussed in this document in other countries. Consult your local IBM representative for information on the products and services currently available in your area. Any reference to an IBM product, program, or service is not intended to state or imply that only that IBM product, program, or service may be used. Any functionally equivalent product, program, or service that does not infringe any IBM intellectual property right may be used instead. However, it is the users responsibility to evaluate and verify the operation of any non-IBM product, program, or service. IBM may have patents or pending patent applications covering subject matter described in this document. The furnishing of this document does not give you any license to these patents. You can send license inquiries, in writing, to: IBM Director of Licensing IBM Corporation North Castle Drive Armonk, NY 10504-1785 U.S.A. For license inquiries regarding double-byte (DBCS) information, contact the IBM Intellectual Property Department in your country or send inquiries, in writing, to: IBM World Trade Asia Corporation Licensing 2-31 Roppongi 3-chome, Minato-ku Tokyo 106, Japan The following paragraph does not apply to the United Kingdom or any other country where such provisions are inconsistent with local law: INTERNATIONAL BUSINESS MACHINES CORPORATION PROVIDES THIS PUBLICATION AS IS WITHOUT WARRANTY OF ANY KIND, EITHER EXPRESS OR IMPLIED, INCLUDING, BUT NOT LIMITED TO, THE IMPLIED WARRANTIES OF NON-INFRINGEMENT, MERCHANTABILITY OR FITNESS FOR A PARTICULAR PURPOSE. Some states do not allow disclaimer of express or implied warranties in certain transactions, therefore, this statement may not apply to you. This information could include technical inaccuracies or typographical errors. Changes are periodically made to the information herein; these changes will be incorporated in new editions of the publication. IBM may make improvements and/or changes in the product(s) and/or the program(s) described in this publication at any time without notice. Licensees of this program who wish to have information about it for the purpose of enabling: (i) the exchange of information between independently created programs and other programs (including this one) and (ii) the mutual use of the information which has been exchanged, should contact: IBM Deutschland GmbH Department 0790

Copyright IBM Corp. 2001, 2008

157

Pascalstrasse 100 70569 Stuttgart Germany Such information may be available, subject to appropriate terms and conditions, including in some cases, payment of a fee. The licensed program described in this information and all licensed material available for it are provided by IBM under terms of the IBM Customer Agreement or any equivalent agreement between us. Any performance data contained herein was determined in a controlled environment. Therefore, the results obtained in other operating environments may vary significantly. Some measurements may have been made on development-level systems and there is no guarantee that these measurements will be the same on generally available systems. Furthermore, some measurement may have been estimated through extrapolation. Actual results may vary. Users of this document should verify the applicable data for their specific environment. Information concerning non-IBM products was obtained from the suppliers of those products, their published announcements or other publicly available sources. IBM has not tested those products and cannot confirm the accuracy of performance, compatibility or any other claims related to non-IBM products. Questions on the capabilities of non-IBM products should be addressed to the suppliers of those products. All statements regarding IBMs future direction or intent are subject to change or withdrawal without notice, and represent goals and objectives only. This information contains examples of data and reports used in daily business operations. To illustrate them as completely as possible, the examples include the names of individuals, companies, brands, and products. All of these names are fictitious and any similarity to the names and addresses used by an actual business enterprise is entirely coincidental. COPYRIGHT LICENSE: This information contains sample application programs in source language, which illustrates programming techniques on various operating platforms. You may copy, modify, and distribute these sample programs in any form without payment to IBM, for the purposes of developing, using, marketing or distributing application programs conforming to the application programming interface for the operating platform for which the sample programs are written. These examples have not been thoroughly tested under all conditions. IBM, therefore, cannot guarantee or imply reliability, serviceability, or function of these programs. Each copy or any portion of these sample programs or any derivative work, must include a copyright notice as follows: (your company name) (year). Portions of this code are derived from IBM Corp. Sample Programs. Copyright IBM Corp. _enter the year or years_. All rights reserved.

158

IBM InfoSphere Warehouse: Data Mining with Easy Mining procedures

Trademarks
The following terms are trademarks of International Business Machines Corporation in the United States, other countries, or both:
IBM IBM logo ibm.com InfoSphere DB2 Intelligent Miner

Java and all Java-based trademarks are trademarks of Sun Microsystems, Inc. in the United States, other countries, or both. Linux is a trademark of Linus Torvalds in the United States, other countries, or both. Microsoft, Windows, Windows NT, and the Windows logo are trademarks of Microsoft Corporation in the United States, other countries, or both. UNIX is a registered trademark of The Open Group in the United States and other countries. Other company, product, or service names may be trademarks or service marks of others.

Notices

159

160

IBM InfoSphere Warehouse: Data Mining with Easy Mining procedures

Contacting IBM
If you have a technical problem, please review and carry out the actions suggested by the product documentation before contacting InfoSphere Warehouse Customer Support. This guide suggests information that you can gather to help InfoSphere Warehouse Customer Support to serve you better. For information or to order any of the InfoSphere Warehouse products, contact an IBM representative at a local branch office or contact any authorized IBM software remarketer. If you live in the U.S.A., you can call one of the following numbers: v 1-800-IBM-SERV (1-800-426-7378) for customer service v 1-888-426-4343 to learn about available service options v 1-800-IBM-4YOU (426-4968) for DB2 marketing and sales Note: In some countries, IBM-authorized dealers should contact their dealer support structure instead of the IBM Support Center.

Product Information
Information regarding IBM InfoSphere Warehouse products is available by telephone or by the World Wide Web at http://www.ibm.com/software/data/ db2/dwe. This site contains the latest information on the technical library, ordering books, product downloads, newsgroups, FixPaks, news, and links to Web resources. If you live in the U.S.A., then you can call one of the following numbers: v 1-800-IBM-CALL (1-800-426-2255) to order products or to obtain general information. v 1-800-879-2755 to order publications. http://www.ibm.com/software/data/db2/udb/dwe/ Provides links to information about IBM InfoSphere Warehouse products. http://www.ibm.com/software/data/db2/9 The DB2 Web pages provide current information about news, product descriptions, education schedules, and more. http://www.elink.ibmlink.ibm.com/ Click Publications to open the International Publications ordering Web site that provides information about how to order books. http://www.ibm.com/education/certify/ The Professional Certification Program from the IBM Web site provides certification test information for a variety of IBM products.

Accessible documentation
Documentation is provided in XHTML format, which is viewable in most Web browsers.

Copyright IBM Corp. 2001, 2008

161

XHTML allows you to view documentation according to the display preferences that you set in your browser. It also allows you to use screen readers and other assistive technologies. Syntax diagrams are provided in dotted decimal format. This format is available only if you are accessing the online documentation using a screen reader.

Comments on the documentation


Your feedback helps IBM to provide quality information. Please send any comments that you have about this book or other IBM InfoSphere Warehouse documentation. You can use any of the following methods to provide comments: v Send your comments using the online readers comment form at www.ibm.com/software/data/rcf. v Send your comments by electronic mail (e-mail) to comments@us.ibm.com. Be sure to include the name of the product, the version number of the product, and the name and part number of the book (if applicable). If you are commenting on specific text, please include the location of the text (for example, a title, a table number, or a page number). v Send your comments by mail to: International Business Machines Corporation Reader Comments DTX/E269 555 Bailey Avenue San Jose, CA U. S. A. 95141-9989 For information on how to contact IBM outside of the United States, go to the IBM Worldwide page at www.ibm.com/planetwide.

162

IBM InfoSphere Warehouse: Data Mining with Easy Mining procedures

Index A
analyzing characteristics ClusterTable procedure 15 finding groups with similar characteristics 15 ApplyClasModel procedure example 52 input parameters 51, 52, 117 output 52 reference 117 return values 117 syntax 51 syntax diagram 117 ApplyClusModel procedure example 63 how to use it 61 input parameters 62, 118 reference 118 return values 119 syntax 62 syntax diagram 118 applying association rule models 69 classification models 51 name mapping 24 sequence rule models 77 applying classification models example 52 input parameters 51, 52 output 52 syntax 51 applying clustering models example 63 how to do it 61 input parameters 62 syntax 62 applying regression models example 58 how to do it 57 input parameters 57 syntax 57 ApplyRegModel procedure example 58 how to use it 57 input parameters 57 reference 119 syntax 57, 119, 120 ApplyRuleModel procedure command 71 concepts 69 example 71 input parameters 69 reference 121 return values 122 syntax 69, 121 syntax diagram 121 ApplySeqRuleModel procedure 77 example 79 input parameters 78, 123 reference 123 return values 124 Copyright IBM Corp. 2001, 2008 ApplySeqRuleModel procedure (continued) syntax 78, 123 association rule models applying 69 association rule views building 71 Associations mining function 94 associations mining steps overview 65 authorization for model schema, conventions 154 building a clustering view (continued) input parameters 63, 64 syntax 63 building classification models example 50 input parameters 49 syntax 49 building classification views example 53 how to go further 54 input parameters 53 output view 54 syntax 53 building clustering models example 61 how to do it 61 input parameters 61 syntax 61 building regression models example 56 input parameters 55 syntax 55 building regression views example 59 how to do it 58 input parameters 58 output view 59 syntax 58 BuildRegModel procedure example 56 how to use it 55 input parameters 55, 131 reference 131 return values 132 syntax 55, 131 BuildRegView procedure examples 59 how to use it 58 input parameters 58 output view 59 reference 132 syntax 58, 132, 133 BuildRuleModel procedure command reference 134 reference 134 BuildRuleView procedure 71 example 72 input parameters 71, 136 output 72 reference 136 return values 137 syntax 71 syntax diagram 136 BuildSeqRuleModel procedure 74 example 76 input parameters 75, 138 reference 137 return values 139 syntax 74, 138 BuildSeqRuleView procedure 80 input parameters 140

B
banks, business goals 4 basic mining steps 48 overview 1 BuildClasModel procedure 125 complete procedure call 54 concepts 49 example 50 input parameters 49 model quality 50 overtrained models 50 reference 125 syntax 49 BuildClasView procedure example 53 how to go further 54 input parameters 53, 127 output view 54 reference 126 return values 127 syntax 53 syntax diagram 127 BuildClusModel procedure complete procedure call 65 example 61 how to use it 61 input parameters 61, 128 reference 128 return values 129 syntax 61, 128 BuildClusView procedure example 64 how to use it 63 input parameters 63, 64, 130 reference 129 return values 130 syntax 63, 130 building association rule views 71 classification models 49 classification views 53 regression models 55 sequence rule models 74 sequence rule views 80 building a clustering view example 64 how to do it 63

163

BuildSeqRuleView procedure (continued) output 81 reference 140 return values 141 syntax diagram 140 business goals of banks 4

C
CancelTask procedure input parameters 149 reference 149 return values 149 syntax 149 causal relationships 37 Classification mining function 95 classification mining steps 48 applying classification models 51 building classification model 49 building classification views 53 exporting classification models and test results 54 testing classification model 50 when to use them 48 classification models applying 51 building 49 exporting 54 testing 50 classification views building 53 CleanUpTask procedure input parameters 149 reference 149 return values 149 syntax 149 clustering algorithms, selecting 64 Clustering mining function 95 clustering mining steps how to go further 64 how to use them 61 overview 60 when to use them 60 clusters setting the number of 65 clusters, reducing size 65 ClusterTable procedure analyzing characteristics 15 complete procedure call 17 data flow 11, 13 defining supplementary columns 17 example 12 how to go further 17 how to use it 10 input parameters 10, 102 interpreting the results 12 output 11, 13 procedure call 13 reference 102 return values 103 syntax 10, 102 when to use it 9 columns, defining supplementary 17 command ApplySeqRuleModel procedure 79 command reference ApplyClasModel procedure 117

command reference (continued) ApplyClusModel procedure 118 ApplyRegModel procedure 119 ApplyRuleModel procedure 121 ApplySeqRuleModel procedure 123 BuildClasModel procedure 125 BuildClasView procedure 126 BuildClusModel procedure 128 BuildClusView procedure 129 BuildRegModel procedure 131 BuildRegView procedure 132 BuildRuleModel procedure 134 BuildRuleView procedure 136 BuildSeqRuleModel procedure 137 BuildSeqRuleView procedure 140 CancelTask procedure 149 CleanUpTask procedure 149 ClusterTable procedure 102 ExplainColValue procedure 105 ExportClasModel procedure 141 ExportClasTestResult procedure 142 ExportClusModel procedure 143 ExportRegModel procedure 144 ExportRegTestResult procedure 144 ExportRuleModel procedure 145 ExportSeqRuleModel procedure 146 FindDeviations procedure 103 FindMostImpFields procedure 107 FindRules procedure 108 FindSeqRules procedure 110 GetLastError procedure 150 GetTraceFile procedure 151 PredictColumn procedure 112 PredictColValue procedure 114 SetTaskStopped 151 SetTraceFile procedure 152 SplitData procedure 152 TestClasModel procedure 147 TestRegModel procedure 148 commands ApplyRuleModel procedure 71 ApplySeqRuleModel procedure 79 FindSeqRules procedure 28 complete procedure call FindDeviations procedure 9 finding relationships 24 complete procedure calls BuildClasModel procedure 54 BuildClusModel procedure 65 ClusterTable procedure 17 finding groups with similar characteristics 17 FindRules procedure 24 concepts applying association rule models 69 BuildClasModel procedure 49 TestClasModel procedure 50 confidence 11 of prediction 31 confusion matrix 35 conventions authorization for model schema 154 deleting objects 154 qualified names 153 saving models, task objects, test results 154 SQL strings 154

conventions (continued) well-defined scripts 154 creating the database Demobank

D
data exploration 93 data flow ClusterTable procedure 11, 13 ExplainColValue procedure 43 FindDeviations procedure 8 finding explanations for specific events 43 finding groups with similar characteristics 11, 13 finding most important fields 46 finding relationships 20 FindMostImpFields procedure 46 FindRules procedure 20 PredictColumn procedure 31, 32 PredictColValue procedure 39 predicting an outcome 39 predicting future behavior 31, 32 data flow PredictColumn procedure 31 data mining at a glance 92 data mining functions 94 Associations mining function 94 Classification mining function 95 Clustering mining function 95 Regression mining function 95 Sequence Rule mining function 95 data mining goals 92 data mining process 93 data exploration 93 data preparation 93 deployment 93 evaluation 93 modeling 93 problem definition 93 data mining with the Easy Mining procedures 3 data preparation 93 database Demobank 4 defining name mapping 23 defining supplementary columns 17 deleting objects, conventions 154 deployment 93 depth of classification tree PredictColumn procedure 37 predicting future behavior 37 depth of tree, setting 54 DM_addNmp 23 DM_remDataSpecFld 9 DM_setDClusPar 65 DM_setFldNmp 24 DM_setFldUsageType 17 DM_setItemFld option 23 DM_setMaxNumClus 65 DM_setTreeClasPar 54

E
easy mining procedures BuildingSeqRuleModel procedure 74, 77

164

IBM InfoSphere Warehouse: Data Mining with Easy Mining procedures

easy mining procedures (continued) for preprocessing and utilities 2 Easy Mining procedures for associations mining steps 65 for basic mining steps 1 for classification mining steps 48 for clustering mining steps 60 for preprocessing and utilities 83 for regression mining steps 55 for sequences mining steps 74 for typical mining tasks 1 for utilities and preprocessing 83 overview 1 putting it all together 88 quick start sample 3 evaluation 93 example BuildClusModel procedure 61 building clustering models 61 BuildRegView procedure 59 BuildSeqRuleView procedure 81 predicting an outcome 40 examples ApplyClasModel procedure 52 ApplyClusModel procedure 63 applying classification models 52 applying clustering models 63 applying regression models 58 ApplyRegModel procedure 58 ApplyRuleModel procedure 71 ApplySeqRuleModel procedure 79 BuildClasModel procedure 50 BuildClasView procedure 53 BuildClusView procedure 64 building a clustering view 64 building classification models 50 building classification views 53 building regression models 56 building regression views 59 BuildRegModel procedure 56 BuildRuleView procedure 72 BuildSeqRuleModel procedure 76 BuildSeqRuleView procedure 81 ClusterTable procedure 12 DM_remDataSpecFld 9 ExplainColValue procedure 44 finding explanations for specific events 44 finding groups with similar characteristics 12 finding most important fields 47 finding relationships 20 finding sequential relationships 27 FindMostImpFields procedure 47 FindRules procedure 20 FindSeqRules procedure 27 PredictColumn procedure 32 PredictColValue procedure 40 predicting future behavior 32 TestClasModel procedure 51 testing classification models 51 testing regression models 57 TestRegModel procedure 57 ExplainColValue procedure data flow 43 example 44 how to go further 45

ExplainColValue procedure (continued) how to use it 42 input parameters 42, 105 output 43 reference 105 return values 106 syntax 42, 105 when to use it 41 ExportClasModel procedure input parameters 141 reference 141 return values 142 syntax 141 ExportClasTestResult procedure command reference 142 input parameters 142 return values 143 syntax 142 ExportClusModel procedure input parameters 143 reference 143 return values 143 syntax 143 exporting classification models and test results 54 ExportRegModel procedure input parameters 144 reference 144 return values 144 syntax 144 ExportRegTestResult procedure input parameters 145 reference 144 return values 145 syntax 144 ExportRuleModel procedure command reference 145 input parameters 145 return values 146 syntax 145 ExportSeqRuleModel procedure command reference 146 input parameters 146 return values 146 syntax diagram 146

F
field usage type, setting 17 Find Rules procedure name mappings 23 procedure call 21 result 21 when to use it 18 FindDeviations procedure business goals of banks 4 command 4 complete procedure call 9 creating the database Demobank data flow 8 how to continue 6 how to go further 8 how to use it 7 input parameters 7, 104 output 8 output view 5

FindDeviations procedure (continued) quick start sample 3 reference 103 return values 104 syntax 7, 103 when to use it 7 finding deviations how to go further 8 when to do it 7 finding explanations for specific events data flow 43 example 44 how to do it 42 how to go further 45 input parameters 42 output 43 syntax 42 when to do it 41 finding groups with similar characteristic how to do it 10 finding groups with similar characteristics analyzing characteristics 15 complete procedure call 17 data flow 11, 13 defining supplementary columns 17 example 12 how to go further 17 input parameters 10 interpreting the results 12 output 11, 13 procedure call 13 syntax 10 when to do it 9 finding most important fields data flow 46 example 47 how to do it 45 input parameters 45 output 46 syntax 45 when to do it 45 finding relatinships data flow 20 example 20 how to do it 18 input parameters 18 output 19 syntax 18 finding relationships complete procedure call 24 how to go further 22 name mappings 23 procedure call 21 result 21 setting name of item ID column 22 when to do it 18 finding sequential relationships example 27 how to go further 29 input 25 output 26, 76 syntax 25 when to do it 24, 25 FindMostImpFields procedure data flow 46 example 47 Index

165

FindMostImpFields procedure (continued) how to use it 45 input parameters 45, 107 output 46 reference 107 return values 108 syntax 45 syntax diagram 107 when to use it 45 FindRules procedure complete procedure call 24 data flow 20 example 20 how to go further 22 how to use it 18 input parameters 18, 109 output 19 reference 108 return values 110 setting name of item ID column 22 syntax 18, 108 FindSeqRules procedure example 27 how to go further 29 how to use it 25 input parameters 25, 111 output 26, 76 reference 110 return values 112 syntax 25, 110 when to use it 24

G
GetLastError procedure input parameters 150 output parameters 150 reference 150 return values 150 syntax 150 GetTraceFile procedure output parameters 151 reference 151 return values 151 syntax diagram 151 goals, data mining 92

H
how to continue 6 FindDeviations procedure 6 how to do it applying clustering models 61 applying regression models 57 building a clustering view 63 building clustering models 61 building regression views 58 finding deviations 7 finding explanations for specific events 42 finding groups with similar characteristics 10 finding most important fields 45 finding relationships 18 FindSeqRules procedure 25 predicting an outcome 38

how to do it (continued) predicting future behavior 30 testing regression models 56 how to go further BuildClasView procedure 54 building classification views 54 clustering mining steps 64 ClusterTable procedure 17 ExplainColValue procedure 45 FindDeviations procedure 8 finding deviations 8 finding explanations for specific events 45 finding groups with similar characteristics 17 finding relationships 22 finding sequential relationships 29 FindRules procedure 22 FindSeqRules procedure 29 PredictColumn procedure 36 PredictColValue procedure 41 predicting an outcome 41 predicting future behavior 36 regression mining steps 59 sequence mining steps 81 how to go futher removing fields from the input table 8 how to use it ApplyClusModel procedure 61 ApplyRegModel procedure 57 BuildClusModel procedure 61 BuildClusView procedure 63 BuildRegModel procedure 55 BuildRegView procedure 58 ClusterTable procedure 10 ExplainColValue procedure 42 finding sequential relationships 25 FindMostImpFields procedure 45 FindRules procedure 18 PredictColumn procedure 30 PredictColValue procedure 38 TestRegModel procedure 56 how to use them clustering mining steps 61 procedures for classification mining steps 49 procedures for rules mining steps 66 regression mining steps 55 sequences mining steps 74

I
identification of characteristics 34 input fields, selecting 59 input parameters ApplyClasModel procedure 51, 52, 117 ApplyClusModel procedure 62, 118 applying classification models 51, 52 applying clustering models 62 applying regression models 57 ApplyRegModel procedure 57 ApplyRuleModel procedure 69 ApplySeqRuleModel procedure 78, 123 BuildClasModel procedure 49

input parameters (continued) BuildClasView procedure 53, 127 BuildClusModel procedure 61, 128 BuildClusView procedure 63, 64, 130 building a clustering view 63, 64 building classification views 53 building clustering models 61 building regression models 55 building regression views 58 BuildRegModel procedure 55, 131 BuildRegView procedure 58 BuildRuleView procedure 71, 136 BuildSeqRuleModel procedure 75, 138 BuildSeqRuleView procedure 80, 140 CancelTask procedure 149 CleanUpTask procedure 149 ClusterTable procedure 10, 102 ExplainColValue procedure 42, 105 ExportClasModel procedure 141 ExportClasTestResult procedure 142 ExportClusModel procedure 143 ExportRegModel procedure 144 ExportRegTestResult procedure 145 ExportRuleModel procedure 145 ExportSeqRuleModel procedure 146 FindDeviations procedure 7, 104 finding explanations for specific events 42 finding groups with similar characteristics 10 finding most important fields 45 finding relationships 18 finding sequential relationships 25 FindMostImpFields procedure 45, 107 FindRules procedure 18, 109 FindSeqRules procedure 25, 111 GetLastError procedure 150 PredictColumn procedure 30, 33, 113 PredictColValue procedure 38, 115 predicting an outcome 38 predicting future behavior 30, 33 SetTaskStopped procedure 151 SetTraceFile procedure 152 SplitData procedure 152 TestClasModel procedure 147 testing regression models 56 TestRegModel procedure 56, 148 input table, removing fields 8, 54 interpreting the results ClusterTable procedure 12 item ID column, setting name 22

L
limiting the tree depth 54

M
mining field types 154 mining procedures BuildingSeqRuleModel procedure 74, 77 for associations mining steps 65

166

IBM InfoSphere Warehouse: Data Mining with Easy Mining procedures

mining procedures (continued) procedures for associations mining steps mining steps basic 48 mining steps for associations 65 mining steps, basic overview 1 mining tasks typical 6 mining tasks, typical overview 1 model confidence 11 quality 11 model quality BuildClasModel procedure 50 modeling 93 models overtrained 50, 54

65

N
name mappings finding relationships 23 FindRules procedure 23 names, qualified 153 Notices 157

output (continued) PredictColValue procedure 38 predicting an outcome 38 predicting future behavior 31, 34 output parameters GetLastError procedure 150 GetTraceFile procedure 151 output view BuildClasView procedure 54 building classification views 54 building regression views 59 BuildRegView procedure 59 FindDeviations procedure 5 overtrained models 50, 54, 60 overview associations mining steps 65 Easy Mining procedures for basic mining steps 1 Easy Mining procedures for typical mining tasks 1 regression mining steps 55 overview of the Easy Mining procedures 1

P
parameter strings 99 parameters FindDeviations procedure 7 parameters and syntax diagrams ApplyRuleModel procedure 121 ApplySeqRuleModel procedure 123 BuildSeqRuleModel procedure 137 ClusterTable procedure 102 ExplainColValue procedure 105 ExportClasModel procedure 141 ExportClasTestResult procedure 142 ExportClusModel procedure 143 ExportRegModel procedure 144 ExportRegTestResult procedure 144 ExportRuleModel procedure 145 ExportSeqRuleModel procedure 146 FindDeviations procedure 103 FindMostImpFields procedure 107 FindRules procedure 108 FindSeqRules procedure 110 GetLastError procedure 150 GetTraceFile procedure 151 PredictColumn procedure 112 PredictColValue procedure 114 SetTaskStopped procedure 151 SetTraceFile procedure 152 SplitData procedure 152 TestClasModel procedure 147 TestRegModel procedure 148 PredictColumn procedure data flow 31, 32 depth of classification tree 37 example 32 how to go further 36 how to use it 30 identification of characteristics 34 input parameters 30, 33, 113 output 31, 34 quality of the model 35 reference 112

O
option string parameters DM_setDClusPar 65 optional parameter strings 99 DM_addNmp 23 DM_remDataSpecFld 9 DM_setFldNmp 24 DM_setFldUsageType 17 DM_setItemFld option 23 DM_setTreeClasPar 54 DM-SetMaxNumClus 65 optional parameters applying name mapping 24 defining name mapping 23 defining supplementary columns 17 setting name of item ID column 23 output ApplyClasModel procedure 52 applying classification models 52 BuildRuleView procedure 72 BuildSeqRuleView procedure 81 ClusterTable procedure 11, 13 ExplainColValue procedure 43 FindDeviations procedure 8 finding explanations for specific events 43 finding groups with similar characteristics 11, 13 finding most important fields 46 finding relationships 19 finding sequential relationships 26, 76 FindMostImpFields procedure 46 FindRules procedure 19 FindSeqRules procedure 26, 76 PredictColumn procedure 34

PredictColumn procedure (continued) removing fields from the input table 36 return values 114 selecting candidates for bank card promotions 34 syntax 30 syntax diagram 112 when to use it 29 PredictColValue procedure data flow 39 example 40 how to go further 41 how to use it 38 input parameters 38, 115 output 38 reference 114 return values 115 syntax 38 syntax diagram 114 when to use it 37 predicting an outcome data flow 39 example 40 how to do it 38 how to go further 41 input parameters 38 output 38 syntax 38 when to do it 37 predicting future behavior data flow 31, 32 depth of classification tree 37 examples 32 how to do it 30 how to go further 36 identification of characteristics 34 input parameters 30, 33 output 31, 34 quality of the model 35 removing fields from the input table 36 selecting candidates for bank card promotions 34 syntax 30 when to do it 29 prediction confidence 31 problem definition 93 procedure call finding relationships 21 FindRules procedure 21 procedure calls ClusterTable procedure 13 finding groups with similar characteristics 13 procedure calls, complete FindRules procedure 24 procedures 125 ApplyClasModel procedure 117 ApplyClusModel procedure 118 ApplyRegModel procedure 119 ApplyRuleModel procedure 121 ApplySeqRuleModel procedure 123 BuildClasView procedure 126 BuildClusModel procedure 128 BuildClusView procedure 129

Index

167

procedures (continued) BuildingSeqRuleModel procedure 74, 77 BuildRegModel procedure 131 BuildRegView procedure 132 BuildRuleModel procedure 134 BuildRuleView procedure 136 BuildSeqRuleModel procedure 137 BuildSeqRuleView procedure 80, 140 CancelTask procedure 149 CleanUpTask procedure 149 ClusterTable procedure 102 ExplainColValue procedure 105 ExportClasModel procedure 141 ExportClasTestResult procedure 142 ExportClusModel procedure 143 ExportRegModel procedure 144 ExportRegTestResult procedure 144 ExportRuleModel procedure 145 ExportSeqRuleModel procedure 146 FindDeviations procedure 103 FindMostImpFields procedure 107 FindRules procedure 108 FindSeqRules procedure 110 for basic mining steps 1 for preprocessing and utilities 2 for typical mining tasks 1 GetLastError procedure 150 GetTraceFile procedure 151 PredictColumn procedure 112 PredictColValue procedure 114 SetTaskStopped procedure 151 SetTraceFile procedure 152 SplitData procedure 152 TestClasModel procedure 147 TestRegModel procedure 148 procedures for classification mining steps how to use them 49 putting it all together 88

Q
qualified names, conventions 153 quality 11 BuildClasModel procedure 50 quality of the model PredictColumn procedure 35 predicting future behavior 35 quick start sample 6 business goals of banks 4 creating the database Demobank FindDeviations procedure 3, 4 finding deviations 3 output view 5 scenario 3

R
reducing the size of clusters 65 reference ApplyClasModel procedure 117 ApplyClusModel procedure 118 ApplyRegModel procedure 119 ApplyRuleModel procedure 121 ApplySeqRuleModel procedure 123 BuildClasModel procedure 125

reference (continued) BuildClasView procedure 126 BuildClusModel procedure 128 BuildClusView procedure 129 BuildRegModel procedure 131 BuildRegView procedure 132 BuildRuleModel procedure 134 BuildRuleView procedure 136 BuildSeqRuleModel procedure 137 BuildSeqRuleView procedure 140 CancelTask procedure 149 CleanUpTask procedure 149 ClusterTable procedure 102 ExplainColValue procedure 105 ExportClasModel procedure 141 ExportClasTestResult procedure 142 ExportClusModel procedure 143 ExportRegModel procedure 144 ExportRegTestResult procedure 144 ExportRuleModel procedure 145 ExportSeqRuleModel procedure 146 FindDeviations procedure 103 FindMostImpFields procedure 107 FindRules procedure 108 FindSeqRules procedure 110 GetLastError procedure 150 GetTraceFile procedure 151 PredictColumn procedure 112 PredictColValue procedure 114 SetTaskStopped procedure 151 SetTraceFile procedure 152 SplitData procedure 152 TestClasModel procedure 147 TestRegModel procedure 148 Regression mining function 95 regression mining steps how to go further 59 how to use them 55 overview 55 when to use them 55 regression models building 55 relational tables 66 relationships, causal 37 removing fields from the input table 8, 54 PredictColumn procedure 36 predicting future behavior 36 results ClusterTable procedure 12 finding groups with similar characteristics 12 finding relationships 21 FindRules procedure 21 return values ApplyClasModel procedure 117 ApplyClusModel procedure 119 ApplyRuleModel procedure 122 ApplySeqRuleModel procedure 124 BuildClasView procedure 127 BuildClusModel procedure 129 BuildClusView procedure 130 BuildRegModel procedure 132 BuildRuleView procedure 137 BuildSeqRuleModel procedure 139 BuildSeqRuleView procedure 141 CancelTask procedure 149

return values (continued) CleanUpTask procedure 149 ClusterTable procedure 103 ExplainColValue procedure 106 ExportClasModel procedure 142 ExportClasTestResult procedure 143 ExportClusModel procedure 143 ExportRegModel procedure 144 ExportRegTestResult procedure 145 ExportRuleModel procedure 146 ExportSeqRuleModel procedure 146 FindDeviations procedure 104 FindMostImpFields procedure 108 FindRules procedure 110 FindSeqRules procedure 112 GetLastError procedure 150 GetTraceFile procedure 151 PredictColumn procedure 114 PredictColValue procedure 115 SetTaskStopped procedure 151 SetTraceFile procedure 152 SplitData procedure 153 TestClasModel procedure 147 TestRegModel procedure 148 rules mining steps how to use them 66 when to use them 65

S
saving models, task objects, test results, conventions 154 scenario quick start sample 3 scripts, well-defined 154 selecting candidates for bank card promotions 34 clustering algorithms 64 input fields 59 sequence mining steps how to go further 81 Sequence Rule mining function 95 sequence rule models applying 77 building 74 sequence rule views building 80 sequences mining steps 74 how to use them 74 when to use them 74 SetTaskStopped procedure input parameters 151 reference 151 return values 151 syntax diagram 151 setting field usage type 17 name of item ID column 22 number of clusters 65 the tree depth 54 tree depth 54 SetTraceFile procedure input parameters 152 reference 152 return values 152 syntax diagram 152

168

IBM InfoSphere Warehouse: Data Mining with Easy Mining procedures

specifying name mapping 23 specifying optional parameter strings 99 specifying supplementary columns 17 SplitData procedure command reference 152 input parameters 152 return values 153 syntax diagram 152 SQL strings, conventions 154 strings, optional parameter 99 supplementary columns, defining 17 syntax ApplyClasModel procedure 51 ApplyClusModel procedure 62 applying classification models 51 applying clustering models 62 applying regression models 57 ApplyRegModel procedure 57, 119, 120 ApplyRuleModel procedure 69, 121 ApplySeqRuleModel procedure 78, 123 BuildClasModel procedure 49 BuildClasView procedure 53 BuildClusModel procedure 61, 128 BuildClusView procedure 63, 130 building a clustering view 63 building classification models 49 building classification views 53 building clustering models 61 building regression models 55 building regression views 58 BuildRegModel procedure 55, 131 BuildRegView procedure 58, 132, 133 BuildRuleView procedure 71 BuildSeqRuleModel procedure 74, 138 BuildSeqRuleView procedure 80 CancelTask procedure 149 CleanUpTask procedure 149 ClusterTable procedure 10, 102 ExplainColValue procedure 42, 105 ExportClasModel procedure 141 ExportClasTestResult procedure 142 ExportClusModel procedure 143 ExportRegModel procedure 144 ExportRegTestResult procedure 144 ExportRuleModel procedure 145 FindDeviations procedure 7, 103 finding explanations for specific events 42 finding groups with similar characteristics 10 finding most important fields 45 finding relationships 18 finding sequential relationships 25 FindMostImpFields procedure 45 FindRules procedure 18, 108 FindSeqRules procedure 25, 110 GetLastError procedure 150 PredictColumn procedure 30 PredictColValue procedure 38 predicting an outcome 38 predicting future behavior 30 TestClasModel procedure 51 testing classification models 51

syntax (continued) testing regression models 56 TestRegModel procedure 56 syntax diagram ApplyClasModel procedure 117 ApplyClusModel procedure 118 ApplyRuleModel procedure 121 BuildClasView procedure 127 BuildRuleView procedure 136 BuildSeqRuleView procedure 140 ExportSeqRuleModel procedure 146 FindMostImpFields procedure 107 GetTraceFile procedure 151 PredictColumn procedure 112 PredictColValue procedure 114 SetTaskStopped procedure 151 SetTraceFile procedure 152 SplitData procedure 152 TestClasModel procedure 147 TestRegModel procedure 148 syntax diagrams and parameters ApplyRuleModel procedure 121 ApplySeqRuleModel procedure 123 BuildSeqRuleModel procedure 137 ClusterTable procedure 102 ExplainColValue procedure 105 ExportClasModel procedure 141 ExportClasTestResult procedure 142 ExportClusModel procedure 143 ExportRegModel procedure 144 ExportRegTestResult procedure 144 ExportRuleModel procedure 145 ExportSeqRuleModel procedure 146 FindDeviations procedure 103 FindMostImpFields procedure 107 FindRules procedure 108 FindSeqRules procedure 110 GetLastError procedure 150 GetTraceFile procedure 151 PredictColumn procedure 112 PredictColValue procedure 114 SetTaskStopped procedure 151 SetTraceFile procedure 152 SplitData procedure 152 TestClasModel procedure 147 TestRegModel procedure 148

testing regression models (continued) how to do it 56 input parameters 56 syntax 56 TestRegModel procedure example 57 how to use it 56 input parameters 56, 148 reference 148 return values 148 syntax 56 syntax diagram 148 trademarks 159 transaction tables 66 tree depth, setting 54 typical mining tasks 6 overview 1

U
using Easy Mining procedures for preprocessing and for utilities for typical mining tasks 6 Using Easy mining procedures for basic mining steps 48 utilities, easy mining procedures 2 83

W
well-defined scripts, conventions 154 when to do it finding deviations 7 finding explanations for specific events 41 finding groups with similar characteristics 9 finding most important fields 45 finding relationships 18 finding sequential relationships 24 predicting an outcome 37 predicting future behavior 29 when to use it ClusterTable procedure 9 ExplainColValue procedure 41 FindMostImpFields procedure 45 FindRules procedure 18 FindSeqRules procedure 24 PredictColumn procedure 29 PredictColValue procedure 37 when to use them clustering mining steps 60 procedures for classification mining steps 48 regression mining steps 55 rules mining steps 65 sequences mining steps 74

T
tables relational tables 66 transaction tables 66 test results exporting 54 TestClasModel procedure concepts 50 example 51 input parameters 147 reference 147 return values 147 syntax 51 syntax diagram 147 testing classification models example 51 syntax 51 testing regression models example 57

Index

169

170

IBM InfoSphere Warehouse: Data Mining with Easy Mining procedures

Program Number: 5724-E34

Printed in USA

SH12-6837-02

Potrebbero piacerti anche