Data Mining Lab Notes

Build Data Warehouse/Data Mart
Data Warehouse: A data warehouse is a subject-oriented, integrated, time-variant, and nonvolatile collection of
data from multiple heterogeneous data sources in support of managements decision-making process.
Pentaho Data Integration (PDI, also called Kettle) is the component of Pentaho responsible for the Extract,
Transform and Load (ETL) processes. PDI can also be used for other purposes:
Migrating data between applications or databases
Exporting data from databases to flat files
Loading data massively into databases
Data preprocessing
Integrating applications
Prerequisites:PDI requires the Oracle Java Runtime Environment (JRE) version 7.

Installation: PDI does not require installation. Simply unpack the zip file into a folder of your choice.
Running: PDI comes with a graphical user interface called Spoon_,_ command-line scripts (Kitchen, Pan) to
execute transformations and jobs, and other utilities.
Spoon Introduction: Spoon is the graphical tool with which you design and test every PDI process. The other
PDI components execute the processes designed with Spoon, and are executed from a terminal window.
Repository and files:
In Spoon, you build Jobs and Transformations. PDI offers two methods to save them:
Files
Database repository
PDI (Enterprise) Repository (Enterprise Edition)
If you choose the database repository method, the repository has to be created the first time you execute
Spoon. If you choose the files method, the Jobs are saved in files with a kjb extension, and the
Transformations are in files with a ktr extension. In this tutorial you'll work with the Files method.
Starting Spoon
Start Spoon by executing Spoon.bat on command line prompt.
Creating a Transformation or Job

Create a new Transformation in one of three ways:
1.
By clicking on the New Transformation button on the main tool bar
2.
By clicking New, then Transformation
3.
By using the CTRL-N hot key

Any one of these actions opens a new Transformation tab for you to begin designing your transformation.
You create a new Job in one of three ways:
1.
By clicking on the New Job button on the main tool bar
2.
By clicking New, then Job
3.
By using the CTRL-ALT-N hot key

Any one of these actions opens a new Job tab for you to begin designing your job.
Hops
A hop connects one transformation step or job entry with another. The direction of the data flow is indicated
with an arrow on the graphical view pane. A hop can be enabled or disabled.
Identify source table and populate the sample data.

1. Create data file for student profile, attendance record and examination result record. Using PDI find
the list of student who have not detained due to shortage of attendance and passed in first class and
above.
Data from sales.csv
Data from city_zip_code.csv
2. Using PDI find the list of the sales record which has ordered with the quantity more than 75. Use the
system CSV data file sales and city_zip_code
3. Using PDI find the list of the sales record. Check for zip code if NULL lookup in city_zip_code file
and add for appropriate city. Use the system CSV data file sales and city_zip_code
4. Create data file for sailor information, reservation of boats and boat information. Using PDI find
the list of senior citizen sailors who have reserved at least a boat.
5. Create data file for sailor information, reservation of boats and boat information. Using PDI find
the list of sailors who are eligible for voting and reserved at least a red boat.
1. Design multi-dimensional data models namely star, snowflake and fact-constellation schemas.
Multidimensional Data Model:
Multidimensional data model is to view data in the form of a cube. Data cube contains set of dimensions like
product, customer, time etc. and facts like sales number (unit sold) and profit.
Star schema architecture

Star schema architecture is the simplest data warehouse design. The main feature of a star schema is a table at
the center, called the fact table and associated the dimension tables which allow browsing of specific categories,
summarizing, drill-downs and specifying criteria.
Typically, most of the fact tables in a star schema are in database third normal form, while dimensional
tables are de-normalized (second normal form).Despite the fact that the star schema is the simplest data
warehouse architecture, it is most commonly used in the data warehouse implementations across the world
today (about 90-95% cases).
Fact table
The fact table is not a typical relational database table as it is de-normalized on purpose - to enhance query
response times. The fact table contains facts and reference keys to respective dimension table. The fact table
typically contains records that are ready to explore, usually with ad hoc queries. Records in the fact table are
often referred to as events, due to the time-variant nature of a data warehouse environment.
Dimension table
Nearly all of the information in a typical fact table is also present in one or more dimension tables. The main
purpose of maintaining Dimension Tables is to allow browsing the categories quickly and easily.
The primary keys of each of the dimension tables are linked together to form the composite primary key
of the fact table. In a star schema design, there is only one de-normalized table for a given dimension.
Star schema example
Snowflake schema architecture:

Snowflake schema architecture is a more complex variation of a star schema design. The main difference is that
dimensional tables in a snowflake schema are normalized, so they have a typical relational database design.
Snowflake schemas are generally used when a dimensional table becomes very big and when a star
schema cant represent the complexity of a data structure. For example if a PRODUCT dimension table contains
millions of rows, the use of snowflake schemas should significantly improve performance by moving out some
data to other table
The problem is that the more normalized the dimension table is, the more complex SQL joins must be
issued to query them. This is because in order for a query to be answered, many tables need to be joined and
aggregates generated.
Snowflake schema example
Design a multidimensional cube using Schema Workbench in Pentaho CE BI Suite 3.0

Steps to design a new cube, publishing it into the Pentaho server and viewing the cube via Pentaho user
console:Step 1:Make sure the pentaho server is up and running.
Step 2: Once the pentaho server is started, go to the folder where you have the Schema
Workbench tool installed in your system.
Step 3:In the schema-workbench folder, double click on the batch file workbench to startup the schemaworkbench tool. Or you can right-click on the batch file workbench to do the same process. This will
open the schema workbench window along with a command prompt.
Step 4:Click on the menu Tools and select Preferences. This will open the Workbench Preferences window.
We need to provide the JDBC details based on the datasource we use.
Step 5:In the Workbench Preferences window provide the following details.
Driver Class Name: org.hsqldb.jdbcDriver
Connection URL: jdbc:hsqldb:hsql://localhost/sampledata
User Name: pentaho_user
Password: password
Schema (Optional): <leave it blank>
Require Schema Attributes: Check this option.
Click on Accept button.
Step 6:To create a new schema file, in the menu bar select File New Schema menu item.
This will open the New Schema 1 window with the schema file name as Schema1.xml. Please refer to the
below screenshot.
Step 7: Click on Schema as shown above and set the required properties for it, for ex, name of
the schema, etc. For now, enter name as SchemaTest.
Step 8: Right click on element Schema and select Add Cube option. This will add a new cube
into the schema.
Step 9:Set the name of the cube as CubeTest. Once it is done, the schema design will look like
below.
Step 10:Set the fact-table for the cube CubeTest. To do so, click on the icon before the cube
image as mentioned in #2 in above screenshot. This will expand the cube node like below image.
Step 11:Now click on the Table element, this will list out the attributes specific to the
Table element. Clicking on the name attribute will display all tables available under current
datasource (the database we set in Step 5. Select the table Customers.
Once you choose the table PUBLIC -> CUSTOMERS, the schema attribute value will be filled in
automatically as PUBLIC.
Step 12:Now add a new dimension called CustomerGeography to the cube by right clicking the
cube element CubeTest.
Step 13:
For the new dimension added, set the required attribute values like name, foreign key, etc.
Set name of the dimension as CustomerGeography, and foreign key as CUSTOMERNUMBER. Double click
on the dimension name CustomerGeography. This will expand the node and display the Hierarchy.
Click on the hierarchy in the left side pane, you can find the attribute properties for the hierarchy.
Set name -> CustomerGeo; allMemberName = All Countries
Step 14:Double click on the Hierarchy element in the left side pane, will expand the node
further and show the Table element. Click on the Table element to set the dimension-table for the
dimension CustomerGeography. This will list the related attributes on the right side pane. Clicking on
the name attributes value field will list the tables available in the current schema.
Select it as CUSTOMERS. This will automatically fill the schema field as PUBLIC.
Step 15:Right click on the element Hierarchy on the left side pane and select Add Level.
This will add a new level with name New Level 0. Refer to the below screenshot.
To rename and set other attributes, set the attribute values (as listed below) for the newly created level in the
right side pane.
Name -> CustomerCountry
Column -> COUNTRY
Type -> string
uniqueMembers -> true
Now we have added a level called CustomerCountry.
Step 16:To add another one level, right-click on Hierarchy in the left side pane (as we did in
Step 15), and select Add Level. This will add a new level with name New Level 1. To rename and set
other attribute values, set the attribute values in the right side pane as below,
Name -> CustomerCity
Column -> CITY

Type -> String
Step 17:To add a new dimension to the cube, right-click on the cube item (CubeTest) in the left
side pane then, select Add Dimension.
This will add a new dimension to the cube with a default name. To rename it and set other attribute values, click
on the newly created dimension in the left side pane. This will list out the attributes for the dimension.
Set name -> CustomerContact; foreign key CUSTOMERNUMBER.
Step 18:To add hierarchy and levels for this dimension, double click on the dimension name
which will expand the dimension node CustomerContact. Click on the hierarchy element in the left
side pane, then on the right side pane set the below attribute values.
Set name -> ; allmembername = All Contacts.
Step 19:Double click on the element hierarchy, will expand the node hierarchy where you
can set the dimension-table for the dimension CustomerContact.
Click on the Table element and select the table as CUSTOMERS.
Step 20:To add a new level for this dimension or hierarchy, right-click on the element
hierarchy and select Add Level. This will add a new level to the hierarchy with name New Level 0.
We can rename it by changing the attributes values like below,
name -> CustomerNames
Column -> CONTACTFIRSTNAME
Type -> String.
Step 21:To add new measure to the cube, right click on the cube CubeTest and select Add
Measure. This will add a new measure with name New Measure 0. You can rename it by changing the
attribute values.
Name -> CustomerCount

Aggregator -> count
Column -> CUSTOMERNUMBER
Format string -> ####
Datatype -> Integer

After setting up the measure, the cube (CubeTest) schema structure will look like below,
Step 22:Now, click File -> Save menu to save the cube schema. You can save it in your desired
path. For ex, save it as TestCube.mondrian.xml
Result:
A new cube has been designed, configured and published into the Pentaho Server. Also, we viewed the cube via
Pentaho User Console.
2. Write ETL scripts and implement using data warehouse tools.

Configuring the Modified JavaScript Value Step
1.
Double-click on the Modified JavaScript Value Step.
2.
The Step configuration window will appear. This is different from the previous Step config window in
that it allows you to write JavaScript code. You will use it to build the message "Hello, " concatenated with
each of the names.
3.
Name this Step Greetings.
4.
The main area of the configuration window is for coding. To the left, there is a tree with a set of
available functions that you can use in the code. In particular, the last two branches have
the input and output fields, ready to use in the code. In this example there are two fields: last_name and name.
Write the following code:

5.
var msg = 'Hello, ' + name + "!";
6.
At the bottom you can type any variable created in the code. In this case, you have created a variable
named msg. Since you need to send this message to the output file, you have to write the variable name in the
grid. This should be the result:
1.
Click OK to finish configuring the Modified Java Script Value step.
2.
Select the Step you just configured. In order to check that the new field will leave this Step, you will
now see the Input and Output Fields. Input Fields are the data columns that reach a Step. Output Fields are
the data columns that leave a Step. There are Steps that simply transform the input data. In this case, the input
and output fields are usually the same. There are Steps, however, that add fields to the Output - Calculator, for
example. There are other Steps that filter or combine data causing that the Output has less fields that the Input
- Group by, for example.
3.
Right-click the Step to bring up a context menu.
4.
Select Show Input Fields. You'll see that the Input Fields are last_name and name, which come from
the CSV file input Step.
5.
Select Show Output Fields. You'll see that not only do you have the existing fields, but also the
new msg field.
Configuring the XML Output Step
1.
Double-click the XML Output Step. The configuration window for this kind of Step will appear. Here
you're going to set the name and location of the output file, and establish which of the fields you want to
include. You may include all or some of the fields that reach the Step.
2.
Name the Step File with Greetings.
3.
4.
In the File box write:

${Internal.Transformation.Filename.Directory}/Hello.xml
5.
Click Get Fields to fill the grid with the three input fields. In the output file you only want to include the
message, so delete name and last_name.
6.
Save the Transformation again.
3. Perform OLAP operations

OLAP operations:
MOLAP uses array-based multidimensional storage engines for multidimensional views of data. With
multidimensional data stores, the storage utilization may be low if the data set is sparse. Many MOLAP server
use two levels of data storage representation to handle dense and sparse data sets. OLAP servers are based on
multidimensional view of data
List of OLAP operations:
Roll-up
Drill-down
Slice and dice
Pivot (rotate)
Roll-up
Roll-up performs aggregation on a data cube in any of the following ways:
By climbing up a concept hierarchy for a dimension
By dimension reduction
Roll-up is performed by climbing up a concept hierarchy for the dimension location.

Initially the concept hierarchy was "street < city < province < country".
On rolling up, the data is aggregated by ascending the location hierarchy from the level of city to the level of
country.
The data is grouped into cities rather than countries.
When roll-up is performed, one or more dimensions from the data cube are removed.
Drill-down
Drill-down is the reverse operation of roll-up. It is performed by either of the following ways:
By stepping down a concept hierarchy for a dimension
By introducing a new dimension.
Drill-down is performed by stepping down a concept hierarchy for the dimension time.
Initially the concept hierarchy was "day < month < quarter < year."
On drilling down, the time dimension is descended from the level of quarter to the level of month.
When drill-down is performed, one or more dimensions from the data cube are added.
It navigates the data from less detailed data to highly detailed data.
Slice
The slice operation selects one particular dimension from a given cube and provides a new sub-cube. Consider
the following diagram that shows how slice works.
Figure shows Slice is performed for the dimension "time" using the criterion time = "Q1".
It will form a new sub-cube by selecting one or more dimensions.
Dice
Dice selects two or more dimensions from a given cube and provides a new sub-cube. Consider the following
diagram that shows the dice operation.
The dice operation on the cube based on the following selection criteria involves three dimensions.
(location = "Toronto" or "Vancouver")
(time = "Q1" or "Q2")
(item =" Mobile" or "Modem")
Pivot
The pivot operation is also known as rotation. It rotates the data axes in view in order to provide an alternative
presentation of data. Consider the following diagram that shows the pivot operation.
Explore WEKA Data Mining/Machine Learning Toolkit

1. Downloading and installation of WEKA Data Mining Toolkit
Introduction
WEKA, formally called Waikato Environment for Knowledge Learning, is a computer program that was
developed at the University of Waikato in New Zealand for the purpose of identifying information from
raw data gathered from agricultural domains. WEKA supports many different standard data mining tasks
such as data preprocessing, classification, clustering, regression, visualization and feature selection. The
basic premise of the application is to utilize a computer application that can be trained to perform
machine learning capabilities and derive useful information in the form of trends and patterns. WEKA is
an open source application that is freely available under the GNU general public license agreement.
Originally written in C the WEKA application has been completely rewritten in Java and is compatible
with almost every computing platform. It is user friendly with a graphical interface that allows for quick
set up and operation. WEKA operates on the predication that the user data is available as a flat file or
relation, this means that each data object is described by a fixed number of attributes that usually are of a
specific type, normal alpha-numeric or numeric values. The WEKA application allows novice users a
tool to identify hidden information from database and file systems with simple to use options and visual
interfaces.
Installation
The program information can be found by conducting a search on the Web for WEKA Data Mining as
past experiments for new users to refine the potential uses that might be of particular interest to them.
When prepared to download the software it is best to select the latest application from the selection
offered on the site. The format for downloading the application is offered in a self installation package
and is a simple procedure that provides the complete program on the end users machine that is ready to
use when extracted.
2.
Understand the features of WEKA Toolkit such as Explorer, Knowledge Flow Interface,
Experimenter and command Line Interface.
Once the program has been loaded on the users machine it is opened by navigating to the programs start
option and that will depend on the users operating system. Figure is an example of the initial opening
screen on a computer with Windows.
There are four options available on this initial screen.

1. Simple CLI- provides users without a graphic interface option the ability to execute commands from
a terminal window.
2. Explorer- the graphical interface used to conduct experimentation on raw data
3. Experimenter- this option allows users to conduct different experimental variations on data sets and
perform statistical manipulation
4. Knowledge Flow-basically the same functionality as Explorer with drag and drop functionality. The
advantage of this option is that it supports incremental learning from previous results
3. Navigate the options available in WEKA (ex. Select attribute panel, Preprocess panel, Classify
panel, Cluster panel, Associate panel and Visualize panel)
Figure shows the opening screen with the available options. At first there is only the option to select the
Preprocess tab in the top left corner. This is due to the necessity to present the data set to the application so it
can be manipulated. After the data has been preprocessed the other tabs become active for use. There are six
tabs:
1. Preprocess- used to choose the data file to be used by the application
2. Classify- used to test and train different learning schemes on the preprocessed data file
under experimentation
3. Cluster- used to apply different tools that identify clusters within the data file
4. Association- used to apply different rules to the data file that identify association
within the data
5. Select attributes-used to apply different rules to reveal changes based on selected
attributes inclusion or exclusion from the experiment
6. Visualize- used to see what the various manipulation produced on the data set in a 2D
format, in scatter plot and bar graph output
4. Study the arff file format.
Attribute-Relation File Format
An ARFF (Attribute-Relation File Format) file is an ASCII text file that describes a list of instances sharing a
set of attributes.
ARFF files have two distinct sections. The first section is the Header information, which is followed
the Data information.
The ARFF Header Section
The ARFF Header section of the file contains the relation declaration and attribute declarations.
The @relation Declaration
The relation name is defined as the first line in the ARFF file. The format is:
@relation <relation-name>
where <relation-name> is a string. The string must be quoted if the name includes spaces.
The @attribute Declarations
Attribute declarations take the form of an orderd sequence of @attribute statements. Each attribute in the data
set has its own @attribute statement which uniquely defines the name of that attribute and it's data type.
The format for the @attribute statement is:
@attribute <attribute-name> <datatype>
where the <attribute-name> must start with an alphabetic character. If spaces are to be included in the name
then the entire name must be quoted.
The <datatype> can be any of the four types currently (version 3.2.1) supported by Weka:
numeric
<nominal-specification>
string
date [<date-format>]
where <nominal-specification> and <date-format> are defined below. The
keywords numeric, string and date are case insensitive.
ARFF Data Section
The ARFF Data section of the file contains the data declaration line and the actual instance lines.
The @data Declaration
The @data declaration is a single line denoting the start of the data segment in the file. The format is:
@data
Example:
The Header of the ARFF file contains the name of the relation, a list of the attributes (the columns in
the data), and their types. An example header on the standard IRIS dataset looks like this:
% 1. Title: Iris Plants Database
%
% 2. Sources:
%
(a) Creator: R.A. Fisher
%
(b) Donor: Michael Marshall (MARSHALL%PLU@io.arc.nasa.gov)
%
(c) Date: July, 1988
%
@RELATION iris
@ATTRIBUTE sepallength NUMERIC
@ATTRIBUTE sepalwidth NUMERIC
@ATTRIBUTE petallength NUMERIC
@ATTRIBUTE petalwidth NUMERIC
@ATTRIBUTE class
{Iris-setosa,Iris-versicolor,Iris-virginica}
The Data of the ARFF file looks like the following:
@DATA
5.1,3.5,1.4,0.2,Iris-setosa
4.9,3.0,1.4,0.2,Iris-setosa
4.7,3.2,1.3,0.2,Iris-setosa
4.6,3.1,1.5,0.2,Iris-setosa
5.0,3.6,1.4,0.2,Iris-setosa
5.4,3.9,1.7,0.4,Iris-setosa
4.6,3.4,1.4,0.3,Iris-setosa
5.0,3.4,1.5,0.2,Iris-setosa
4.4,2.9,1.4,0.2,Iris-setosa
4.9,3.1,1.5,0.1,Iris-setosa
Lines that begin with a % are comments. The @RELATION, @ATTRIBUTE and @DATA declarations are
case insensitive.
i.
5. Explore the available Data sets in WEKA. Load the data sets (ex Weather data set, iris data set etc.).
Load each data set and observe following:1. ISIS Dataset
List the attributes names and their types.
Sno
Name of attribute
Type of attribute
ii.
iii.
iv.
v.
Number of records in data set:

150
Identify the class attribute (if any):
Class
Plot Histogram and Visualize the data in various dimensions.
Determine the number of records for each class

Sno
i.
No. of Records
2. Wheather Dataset
Sno
ii.
iii.
iv.
Name of class
Name of attribute

14
Type of attribute
v.

Sno
Name of class
No. of Records
German Credit Card Dataset

i.

Sno
1
2
3
4
5
6
7
8
9
10
Name of attribute
Type of attribute
11
12
13
14
15
16
17
18
19
20
21
ii.
iii.
iv.

1000
Sno
v.
Name of class
No. of Records
UNIT II
Perform data preprocessing task and demonstrate the performing Association rule mining.
A. Explore weather dataset for preprocessing the data and apply discretization and resample
filter on dataset.
1. Start Weka . It will open Weka GUI window.
2. Click on the Explorer button and you get the Weka Knowledge Explorer window.
3. Click on the Open File.. button and open an weather.ARFF file.
4. Click on Choose and select filters/unsupervised/attribute/Discetize. Then click on the area right of the
Choose button.
5. Click on the Apply button to do the discretization. Then select one of the original numeric attributes (e.g.
temperature) and see how it is discretized in the Selected attribute window.
Click on Choose and select filters/unsupervised/instance/Resample. Then click on the area right of the Choose
button.
Click on the Apply button to do the Resample. Then select one of the original numeric attributes (e.g.
temperature) and see how it is discretized in the Selected attribute window.
B. Load IRIS dataset into WEKA. Apply discretization filter on Numeric field and run
Apriori algorithm. Study the rules generated.
1. Start Weka . It will open Weka GUI window.
2. Click on the Explorer button and you get the Weka Knowledge Explorer window.
3. Click on the Open File.. button and open an IRIS.ARFF file.
4. Click on Choose and select filters/unsupervised/attribute/Discetize.
5. Click on the Apply button to do the discretization.
6. Select the Associate tab from main menu. Then click on choose button and select the weka/associations/
Apriori.
7. Click on Start button to generate frequent itemsets and different association rules.
UNIT III
Demonstrate performing classification on datasets
CLASSIFICATION: - It is necessary to provide a clear classification of data mining systems which
may help users to distinguish between such systems and to identify them. Data mining classification can
be done in different ways:
1) Data mining can be classified according to the kinds of databases mined
2) Data mining can be Classified according to the kinds of knowledge mined which is done based on the
mining functionalities like characterization, discrimination etc.
3) We can also classify the data mining systems according to the kinds of techniques utilized,
applications adapted.
A. Load IRIS , German Credit Card dataset into weka and run ID3 and J48 algorithm. Study
the classifier output, entropy values and kappa statistics.
B. Extract if-then rules generated by the classifiers. Observe the confusion matrix and derive
accuracy, F-measure, TPrate, FPrate, Precision and Recall values. Apply cross-validation
strategy with various fold levels and compute the accuracy results.
Classifier output for IRIS Dataset
Classifier output for German credit card Dataset
C. Load IRIS dataset into weka and perform Nave-bayes classification and k-nearest neighbor
classification. Interpret the results obtained.
NaiveBayes: -Classification of data can be done based on Bayesian theorems. NaiveBayes classifier is a
simple probabilistic classifier based on applying the Bayes theorem with strong independence
assumptions. In simple terms NaiveBayes classifier assumes that the presence or absence of a particular
feature of a class is unrelated to the presence or absence of any other feature. An advantage of the
NaiveBayes classifier is that, it only requires small amount of training data to estimate the parameters
like mean, and variances of the variables necessary for classification.
D. Plot Roc Curves
E. Compare classification results of J48 and Nave-bayes classification for IRIS dataset and
deduce which classifier is performing best and poor for IRIS dataset and justify.
J48
Results of J48 and Nave-bayes classification Algorithm:

Instances
Test Mode
Number of Leaves
Size of tree
Correctly classified instances
Incorrectly classified instances
Kappa statistics
Confusion Matrix
Time taken to bulid model
Nave-bayes
classification
Instances
Test Mode
Number of Leaves
Size of tree
Correctly classified instances
Incorrectly classified instances
Kappa statistics
Confusion Matrix
Time taken to bulid model

Performance of J48 and Nave-bayes classification Algorithm from above result:
UNIT IV
Demonstrate performing clustering on datasets
CLUSTERING
Clustering is a task of assigning a set of objects into groups called as clusters. Clustering is also referred
as cluster analysis where the objects in the same cluster are more similar to each other than to those objects in
other clusters. Clustering is the main task of Explorative Data mining and is a common technique for statistical
data analysis used in many fields like machine learning, pattern recognition, image analysis, bio informatics etc.
Cluster analysis is not an algorithm but is a general task to be solved. Clustering is of different types like
hierarchical clustering which creates a hierarchy of clusters, partial clustering, and spectral clustering.
SimpleK-MeansIt is a method of cluster analysis called as partial cluster analysis or partial clustering. K-Means
clustering partition or divides n observations into K clusters. Each observation belongs to the cluster with the
nearest mean. K-means clustering is an algorithm to group the objects based on attributes/features into K
number of groups where K is positive integer.
A. Load IRIS dataset into weka and run simple k-mean clustering algorithm with different values
of k. Study the cluster formed. Observe the sum of square errors and centroids, and derive
insights.
B. Explore other clustering techniques available in weka.
C. weka.clusterers.Cobweb
1. weka.clusterers.FarthestFirst -
2. weka.clusterers.EM
4. weka.clusterers.FilteredClusterer -
5. weka.clusterers.HierarchicalClusterer
6. weka.clusterers.MakeDensityBasedClusterer -
6. Explore the visualization features of weka to visualize the clusters. Derive interesting insights and
explain.
3. GERMAN CREDIT DATA

Actual historical credit data is not always easy to come by because of confidentiality rules. Here is one such
dataset, consisting of 1000 actual cases collected in Germany. credit dataset (original) Excel spreadsheet version
of the German credit data (Down load from web).
In spite of the fact that the data is German, you should probably make use of it for this assignment. (Unless you
really can consult a real loan officer !)
A few notes on the German dataset
DM stands for Deutsche Mark, the unit of currency, worth about 90 cents Canadian (but looks and acts like a
quarter).
owns_telephone. German phone rates are much higher than in Canada so fewer people own telephones.
foreign_worker. There are millions of these in Germany (many from Turrkey). It is very hard to get German
citizenship if you were not born of German parents.
There are 20 attributes used in judging a loan applicant. The goal is the classify the applicant into one of two
categories, good or bad.
Description of the German credit dataset in ARFF (Attribute Relation File Format) Format:
Structure of ARFF Format:
%comment lines
@relation relation name

@attribute attribute name
@Data
Set of data items separated by commas.
% 1. Title: German Credit data
%
% 2. Source Information
%
% Professor Dr. Hans Hofmann
% Institut f"ur Statistik und "Okonometrie
% Universit"at Hamburg
% FB Wirtschaftswissenschaften
% Von-Melle-Park 5
% 2000 Hamburg 13
%
% 3. Number of Instances: 1000
%
% Two datasets are provided. the original dataset, in the form provided
% by Prof. Hofmann, contains categorical/symbolic attributes and
% is in the file "german.data".
%
% For algorithms that need numerical attributes, Strathclyde University
% produced the file "german.data-numeric". This file has been edited
% and several indicator variables added to make it suitable for
% algorithms which cannot cope with categorical variables. Several
% attributes that are ordered categorical (such as attribute 17) have
% been coded as integer. This was the form used by StatLog.
%
%
% 6. Number of Attributes german: 20 (7 numerical, 13 categorical)
% Number of Attributes german.numer: 24 (24 numerical)
%
%
% 7. Attribute description for german
%
% Attribute 1: (qualitative)
%
Status of existing checking account
%
A11 : ... < 0 DM
%
A12 : 0 <= ... < 200 DM
%
A13 :
... >= 200 DM /
%
salary assignments for at least 1 year
%
A14 : no checking account
% Attribute 2: (numerical)
%
Duration in month
%
%
Credit history
%
A30 : no credits taken/
%
all credits paid back duly
%
A31 : all credits at this bank paid back duly
%
A32 : existing credits paid back duly till now
%
A33 : delay in paying off in the past
%
A34 : critical account/
%
other credits existing (not at this bank)
%
%
Purpose
%
A40 : car (new)
%
A41 : car (used)
%
A42 : furniture/equipment
%
A43 : radio/television
%
A44 : domestic appliances
%
A45 : repairs
%
A46 : education
%
A47 : (vacation - does not exist?)
%
A48 : retraining
%
A49 : business
%
A410 : others
%
%
Credit amount
%
% Attibute 6: (qualitative)
%
Savings account/bonds
%
A61 :
... < 100 DM
%
A62 : 100 <= ... < 500 DM
%
A63 : 500 <= ... < 1000 DM
%
A64 :
.. >= 1000 DM
%
A65 : unknown/ no savings account
%
%
Present employment since
%
A71 : unemployed
%
A72 :
... < 1 year
%
A73 : 1 <= ... < 4 years
%
A74 : 4 <= ... < 7 years
%
A75 :
.. >= 7 years
%
%
Installment rate in percentage of disposable income
%
%
Personal status and sex
%
A91 : male : divorced/separated
%
A92 : female : divorced/separated/married
%
A93 : male : single
%
A94 : male : married/widowed
%
A95 : female : single
%
%
Other debtors / guarantors
%
A101 : none
%
A102 : co-applicant
%
A103 : guarantor
%
%
Present residence since
%
%
Property
%
A121 : real estate
%
A122 : if not A121 : building society savings agreement/
%
life insurance
%
A123 : if not A121/A122 : car or other, not in attribute 6
%
A124 : unknown / no property
%
%
Age in years
%
%
Other installment plans
%
A141 : bank
%
A142 : stores
%
A143 : none
%
%
Housing
%
A151 : rent
%
A152 : own
%
A153 : for free
%
%
Number of existing credits at this bank
%
%
Job
%
A171 : unemployed/ unskilled - non-resident
%
A172 : unskilled - resident
%
A173 : skilled employee / official
%
A174 : management/ self-employed/
%
highly qualified employee/ officer
%
%
Number of people being liable to provide maintenance for
%
%
Telephone
%
A191 : none
%
A192 : yes, registered under the customers name
%

%
foreign worker
%
A201 : yes
%
A202 : no
%
%
%
% 8. Cost Matrix
%
% This dataset requires use of a cost matrix (see below)
%
%
%
1
2
% ---------------------------% 1 0
1
% ----------------------% 2 5
0
%
% (1 = Good, 2 = Bad)
%
% the rows represent the actual classification and the columns
% the predicted classification.
%
% It is worse to class a customer as good when they are bad (5),
% than it is to class a customer as bad when they are good (1).
%
%
%
%
%
% Relabeled values in attribute checking_status
% From: A11
To: '<0'
% From: A12
To: '0<=X<200'
% From: A13
To: '>=200'
% From: A14
To: 'no checking'
%
%
% Relabeled values in attribute credit_history
% From: A30
To: 'no credits/all paid'
% From: A31
To: 'all paid'
% From: A32
To: 'existing paid'
% From: A33
To: 'delayed previously'
% From: A34
To: 'critical/other existing credit'
%
%
% Relabeled values in attribute purpose
% From: A40
To: 'new car'
% From: A41
To: 'used car'
% From: A42
To: furniture/equipment
% From: A43
To: radio/tv
% From: A44
To: 'domestic appliance'
% From: A45
To: repairs
% From: A46
To: education
% From: A47
To: vacation
% From: A48
To: retraining
% From: A49
To: business
% From: A410
To: other
%
%
% Relabeled values in attribute savings_status
% From: A61
To: '<100'
% From: A62
To: '100<=X<500'
% From: A63
To: '500<=X<1000'
% From: A64
To: '>=1000'
% From: A65
To: 'no known savings'
%
%
% Relabeled values in attribute employment
% From: A71
To: unemployed
% From: A72
To: '<1'
% From: A73
To: '1<=X<4'
% From: A74
To: '4<=X<7'
% From: A75
To: '>=7'
%
%
% Relabeled values in attribute personal_status
% From: A91
To: 'male div/sep'
% From: A92
To: 'female div/dep/mar'
% From: A93
To: 'male single'
% From: A94
To: 'male mar/wid'
% From: A95
To: 'female single'
%
%
% Relabeled values in attribute other_parties
% From: A101
To: none
% From: A102
To: 'co applicant'
% From: A103
To: guarantor
%
%
% Relabeled values in attribute property_magnitude
% From: A121
To: 'real estate'
% From: A122
To: 'life insurance'
% From: A123
To: car
% From: A124
To: 'no known property'
%
%
% Relabeled values in attribute other_payment_plans
% From: A141
To: bank
% From: A142
To: stores
% From: A143
To: none
%
%
% Relabeled values in attribute housing

% From: A151
To: rent
% From: A152
To: own
% From: A153
To: 'for free'
%
%
% Relabeled values in attribute job
% From: A171
To: 'unemp/unskilled non res'
% From: A172
To: 'unskilled resident'
% From: A173
To: skilled
% From: A174
To: 'high qualif/self emp/mgmt'
%
%
% Relabeled values in attribute own_telephone
% From: A191
To: none
% From: A192
To: yes
%
%
% Relabeled values in attribute foreign_worker
% From: A201
To: yes
% From: A202
To: no
%
%
% Relabeled values in attribute class
% From: 1
To: good
% From: 2
To: bad
%
@relation german_credit
@attribute checking_status { '<0', '0<=X<200', '>=200', 'no checking'}
@attribute duration real
@attribute credit_history { 'no credits/all paid', 'all paid', 'existing paid', 'delayed previously', 'critical/other
existing credit'}
@attribute purpose { 'new car', 'used car', furniture/equipment, radio/tv, 'domestic appliance', repairs, education,
vacation, retraining, business, other}
@attribute credit_amount real
@attribute savings_status { '<100', '100<=X<500', '500<=X<1000', '>=1000', 'no known savings'}
@attribute employment { unemployed, '<1', '1<=X<4', '4<=X<7', '>=7'}
@attribute installment_commitment real
@attribute personal_status { 'male div/sep', 'female div/dep/mar', 'male single', 'male mar/wid', 'female single'}
@attribute other_parties { none, 'co applicant', guarantor}
@attribute residence_since real
@attribute property_magnitude { 'real estate', 'life insurance', car, 'no known property'}
@attribute age real
@attribute other_payment_plans { bank, stores, none}
@attribute housing { rent, own, 'for free'}
@attribute existing_credits real
@attribute job { 'unemp/unskilled non res', 'unskilled resident', skilled, 'high qualif/self emp/mgmt'}
@attribute num_dependents real
@attribute own_telephone { none, yes}
@attribute foreign_worker { yes, no}
@attribute class { good, bad}
@data
'<0',6,'critical/other existing credit',radio/tv,1169,'no known savings','>=7',4,'male single',none,4,'real
estate',67,none,own,2,skilled,1,yes,yes,good
'0<=X<200',48,'existing paid',radio/tv,5951,'<100','1<=X<4',2,'female div/dep/mar',none,2,'real
estate',22,none,own,1,skilled,1,none,yes,bad
'no checking',12,'critical/other existing credit',education,2096,'<100','4<=X<7',2,'male single',none,3,'real
estate',49,none,own,1,'unskilled resident',2,none,yes,good
'<0',42,'existing paid',furniture/equipment,7882,'<100','4<=X<7',2,'male single',guarantor,4,'life
insurance',45,none,'for free',1,skilled,2,none,yes,good
'<0',24,'delayed previously','new car',4870,'<100','1<=X<4',3,'male single',none,4,'no known
property',53,none,'for free',2,skilled,2,none,yes,bad
'no checking',36,'existing paid',education,9055,'no known savings','1<=X<4',2,'male single',none,4,'no known
property',35,none,'for free',1,'unskilled resident',2,yes,yes,good
3.1. List all the categorical (or nominal) attributes and the real-valued attributes separately.
From the German Credit Assessment Case Study given to us the following attributes are found to be
applicable for Credit-Risk Assessment:
Total Valid Attributes:1. checking_status
2. duration
3. credit history
4. purpose
5. credit amount
6. savings_status
7. employment duration
8. installment rate
9. personal status
10. debitors
11. residence_since
12. property
14. installment plans
15. housing
16. existing credits
17. job
18. num_dependents
19. telephone
20. foreign worker
Categorical or Nomianal attributes(which takes True/false,Yes/no etc values):1. checking_status
2. credit history
3. purpose
4. savings_status
5. employment
6. personal status
7. debtors
8. property
9. installment plans
10. housing
11. job
12. telephone
13. foreign worker
Real valued attributes:1. duration
2. credit amount
3. credit amount
4. residence
5. age
6. existing credits
7. num_dependents
3.2. What attributes do you think might be crucial in making the credit assessement ? Come up with
some simple rules in plain English using your selected attributes.
According to me the following attributes may be crucial in making the credit risk assessment.
1. Credit_history
2. Employment
3. Property_magnitude
4. job
5. duration
6. crdit_amount
7. installment
8. existing credit
Based on the above attributes, we can make a decision whether to give credit or not.
3.3. One type of model that you can create is a Decision Tree - train a Decision Tree using the complete
dataset as the training data. Report the model obtained after training.
A decision tree is a flow chart like tree structure where each internal node(non-leaf)denotes a test on the
attribute, each branch represents an outcome of the test ,and each leaf node(terminal node)holds a class
label.
Decision trees can be easily converted into classification rules.
e.g. ID3,C4.5 and CART.
J48 pruned tree
1. Using WEKA Tool, we can generate a decision tree by selecting the classify tab.
2. In classify tab select choose option where a list of different decision trees are available. From that
list select J48.
3. Now under test option, select training data test option.
4. The resulting window in WEKA is as follows:
5. To generate the decision tree, right click on the result list and select visualize tree option by which the
decision tree will be generated.
6. The obtained decision tree for credit risk assessment is very large to fit on the screen.
7. The decision tree above is unclear due to a large number of attributes.
3.4. Suppose you use your above model trained on the complete dataset, and classify credit good/bad for
each of the examples in the dataset. What % of examples can you classify correctly? (This is also called
testing on the training set) Why do you think you cannot get 100 % training accuracy?
In the above model we trained complete dataset and we classified credit good/bad for each of the
examples in the dataset.
For example:
if
purpose=vacation then
credit=bad
else
Purpose =business then Credit=good
In this way we classified each of the examples in the dataset. We classified 85.5% of examples correctly and the
remaining 14.5% of examples are incorrectly classified. We cant get 100% training accuracy because out of the
20 attributes, we have some unnecessary attributes which are also been analyzed and trained. Due to this the
accuracy is affected and hence we cant get 100% training accuracy.
3.5. Is testing on the training set as you did above a good idea? Why or Why not?
According to the rules, for the maximum accuracy, we have to take 2/3 of the dataset as training set and the
remaining 1/3 as test set. But here in the above model we have taken complete dataset as training set which
results only 85.5% accuracy.
This is done for the analyzing and training of the unnecessary attributes which does not make a crucial role in
credit risk assessment. And by this complexity is increasing and finally it leads to the minimum accuracy. If
some part of the dataset is used as a training set and the remaining as test set then it leads to the accurate results
and the time for computation will be less. This is why, we prefer not to take complete dataset as training set.
Use Training Set Result for the table GermanCreditData:
Correctly Classified Instances
855
85.5 %
Incorrectly Classified Instances
145
Kappa statistic
0.6251
Mean absolute error
0.2312
Root mean squared error
0.34
Relative absolute error
55.0377 %
Root relative squared error
74.2015 %
Total Number of Instances
1000
14.5
3.6. One approach for solving the problem encountered in the previous question is using cross-validation?
Describe what cross-validation is briefly. Train a Decision Tree again using cross-validation and report
your results. Does your accuracy increase/decrease? Why?
Cross validation:-
In k-fold cross-validation, the initial data are randomly portioned into k mutually exclusive subsets or folds
D1, D2, D3, . . .Dk. Each of approximately equal size. Training and testing is performed k times. In iteration I,
partition Di is reserved as the test set and the remaining partitions are collectively used to train the model. That
is in the first iteration subsets D2, D3, . . . .Dk collectively serve as the training set in order to obtain as first
model. Which is tested on Di? The second trained on the subsets D1, D3, . . . Dk and test on the D2 and so
on.
1. Select classify tab and J48 decision tree and in the test option select cross validation radio button and
the number of folds as 10.
2. Number of folds indicates number of partition with the set of attributes.
3. Kappa statistics nearing 1 indicates that there is 100% accuracy and hence all the errors will be zeroed
out, but in reality there is no such training set that gives 100% accuracy.
Cross Validation Result at folds: 10 for the table German Credit Data:
705
70.5
295
29.5
Kappa statistic
0.2467
Mean absolute error
0.3467
0.4796
82.5233 %
104.6565 %
1000
Here there are 1000 instances with 100 instances per partition.
Cross Validation Result at folds: 20 for the table GermanCreditData:

698
69.8
302
30.2
Kappa statistic
0.2264
Mean absolute error
0.3571
0.4883
85.0006 %
106.5538 %
1000

709
70.9
291
29.1
Kappa statistic
0.2538
Mean absolute error
0.3484
0.4825
82.9304 %
105.2826 %
1000

710
71
290
29
Kappa statistic
0.2587
Mean absolute error
0.3444
0.4771
81.959 %
104.1164 %
1000
Percentage split does not allow 100%, it allows only till 99.9%
Percentage Split Result at 50%:

362
72.4
138
27.6
Kappa statistic
0.2725
Mean absolute error
0.3225
0.4764
76.3523 %
106.4373 %
500
Percentage Split Result at 99.9%:

Kappa statistic
Mean absolute error
0.6667
0.6667
221.7054 %
221.7054 %
100
3.7. Check to see if the data shows a bias against "foreign workers" (attribute 20), or "personalstatus"(attribute 9). One way to do this (Perhaps rather simple minded) is to remove these attributes
from the dataset and see if the decision tree created in those cases is significantly different from the full
dataset case which you have already done. To remove an attribute you can use the reprocess tab in
Weka's GUI Explorer. Did removing these attributes have any significant effect? Discuss.
This increases in accuracy because the two attributes foreign workers and personal status are not
much important in training and analyzing.
By removing this, the time has been reduced to some extent and then it results in increase in the
accuracy.
The decision tree which is created is very large compared to the decision tree which we have trained
now. This is the main difference between these two decision trees.
After forign worker is removed, the accuracy is increased to 85.9%
If we remove 9th attribute, the accuracy is further increased to 86.6% which shows that these two attributes are
not significant to perform training.
Cross validation after removing 9th attribute.
Percentage split after removing 9th attribute.
After removing the 20th attribute, the cross validation is as above.
After removing 20th attribute, the percentage split is as above.
3.8. Another question might be, do you really need to input so many attributes to get good results? Maybe
only a few would do. For example, you could try just having attributes 2, 3, 5, 7, 10, 17 (and 21, the class
attribute (naturally)). Try out some combinations. (You had removed two attributes in problem
7.Remember to reload the arff data file to get all the attributes initially before you start selecting the ones
you want.)
Select attribute 2,3,5,7,10,17,21 and click on invert to remove the remaining attributes.
Here accuracy is decreased.
Select random attributes and then check the accuracy.
After removing the attributes 1,4,6,8,9,11,12,13,14,15,16,18,19 and 20,we select the left over attributes and
visualize them.
After we remove 14 attributes, the accuracy has been decreased to 76.4% hence we can further try random
combination of attributes to increase the accuracy.
Cross validation
Percentage split
3.9. Sometimes, the cost of rejecting an applicant who actually has a good credit(case 1) might be higher
than accepting an applicant who has bad credit (case 2).Instead of counting the misclassifications equally
in both cases, give a higher cost to the first case (say cost 5) and lower cost to the second case. You can do
this by using a cost matrix in Weka. Train your Decision Tree again and report the Decision Tree and
cross-validation results. Are they significantly different from results obtained in problem 6 (using equal
cost)?
In the Problem 6, we used equal cost and we trained the decision tree. But here, we consider two cases with
different cost. Let us take cost 5 in case 1 and cost 2 in case 2.
When we give such costs in both cases and after training the decision tree, we can observe that almost equal to
that of the decision tree obtained in problem 6.
Case1 (cost 5) Case2 (cost 5)
Total Cost
3820 1705
Average cost 3.82 1.705
We dont find this cost factor in problem 6. As there we use equal cost. This is the
major difference between the results of problem 6 and problem 9.
The cost matrices we used here:
Case 1: 5 1
1 5
Case 2: 2 1
12
1.Select classify tab.

2. Select More Option from Test Option.
3.Tick on cost sensitive Evaluation and go to set.
4.Set classes as 2.
5.Click on Resize and then well get cost matrix.
6.Then change the 2nd entry in 1st row and 2nd entry in 1st column to 5.0
7.Then confusion matrix will be generated and you can find out the difference
between good and bad attribute.
8. Check accuracy whether its changing or not.
3.10. Do you think it is a good idea to prefer simple decision trees instead of having long complex
decision trees? How does the complexity of a Decision Tree relate to the bias of the model?
When we consider long complex decision trees, we will have many unnecessary attributes in the tree which
results in increase of the bias of the model. Because of this, the accuracy of the model can also effected.
This problem can be reduced by considering simple decision tree. The attributes will be less and it decreases the
bias of the model. Due to this the result will be more accurate.
So it is a good idea to prefer simple decision trees instead of long complex trees.
1. Open any existing ARFF file e.g labour.arff.
2. In preprocess tab, select ALL to select all the attributes.
3. Go to classify tab and then use traning set with J48 algorithm.
4. To generate the decision tree, right click on the result list and select visualize tree option, by which
the decision tree will be generated.
5. Right click on J48 algorithm to get Generic Object Editor window

6. In this,make the unpruned option as true .
7. Then press OK and then start. we find the tree will become more complex if not pruned.
Visualize tree
8. The tree has become more complex.
3.11. You can make your Decision Trees simpler by pruning the nodes. One approach is to use Reduced
Error Pruning - Explain this idea briefly. Try reduced error pruning for training your Decision Trees
using cross-validation (you can do this in Weka) and report the Decision Tree you obtain? Also, report
your accuracy using the pruned model. Does your accuracy increase ?
Reduced-error pruning:The idea of using a separate pruning set for pruning which is applicable to decision trees as well as rule sets is
called reduced-error pruning. The variant described previously prunes a rule immediately after it has been
grown and is called incremental reduced-error pruning.
Another possibility is to build a full, unpruned rule set first, pruning it afterwards by discarding individual tests.
However, this method is much slower. Of course, there are many different ways to assess the worth of a rule
based on the pruning set. A simple measure is to consider how well the rule would do at discriminating the
predicted class from other classes if it were the only rule in the theory, operating under the closed world
assumption.
If it gets p instances right out of the t instances that it covers, and there are P instances of this class out of a total
T of instances altogether, then it gets positive instances right. The instances that it does not cover include N - n
negative ones, where n = t p is the number of negative instances that the rule covers and N = T - P is the total
number of negative instances.
Thus the rule has an overall success ratio of [p +(N - n)] T , and this quantity, evaluated on the test set, has been
used to evaluate the success of a rule when using reduced-error pruning.
1. Right click on J48 algorithm to get Generic Object Editor window
2. In this,make reduced error pruning option as true and also the unpruned option as true
.
3. Then press OK and then start.
4. We find that the accuracy has been increased by selecting the reduced error pruning option.
3.12. (Extra Credit): How can you convert a Decision Trees into "if-then-else rules". Make up your own
small Decision Tree consisting of 2-3 levels and convert it into a set of rules. There also exist different
classifiers that output the model in the form of rules - one such classifier in Weka is rules. PART, train
this model and report the set of rules obtained. Sometimes just one attribute can be good enough in
making the decision, yes, just one ! Can you predict what attribute that might be in this dataset? OneR
classifier uses a single attribute to make decisions (it chooses the attribute based on minimum error).
Report the rule obtained by training a one R classifier. Rank the performance of j48, PART and oneR.
In weka, rules.PART is one of the classifier which converts the decision trees into IF-THEN-ELSE rules.
Converting Decision trees into IF-THEN-ELSE rules using rules.PART classifier:PART decision list
outlook = overcast: yes (4.0)

windy = TRUE: no (4.0/1.0)
outlook = sunny: no (3.0/1.0)
: yes (3.0)
Number of Rules : 4
Yes, sometimes just one attribute can be good enough in making the decision.
In this dataset (Weather), Single attribute for making the decision is outlook
outlook:
sunny -> no
overcast -> yes
rainy -> yes
(10/14 instances correct)
With respect to the time, the oneR classifier has higher ranking and J48 is in 2nd place and PART
gets 3rd place.
J48 PART oneR
TIME (sec) 0.12 0.14 0.04
RANK II III I
But if you consider the accuracy, The J48 classifier has higher ranking, PART gets second place
and oneR
gets lst place
J48 PART oneR
ACCURACY (%) 70.5 70.2% 66.8%
1. Open existing file as weather.nomial.arff
2. Select All.
3. Go to classify.
4. Start.
Here the accuracy is 100%
The tree is something like if-then-else rule

If outlook=overcast then play=yes
If outlook=sunny and humidity=high
Then play = no
Else play = yes
If outlook=rainy and windy=true
Then play = no
Else play = yes
To click out the rules

Data Mining Lab Notes

Caricato da

Informazioni sul documento

Titolo originale

Copyright

Formati disponibili

Condividi questo documento

Condividi o incorpora il documento

Opzioni di condivisione

Hai trovato utile questo documento?

Questo contenuto è inappropriato?

Copyright:

Formati disponibili

Data Mining Lab Notes

Caricato da

Copyright:

Formati disponibili

Build Data Warehouse/Data Mart

Migrating data between applications or databases

Exporting data from databases to flat files

Loading data massively into databases

Prerequisites:PDI requires the Oracle Java Runtime Environment (JRE) version 7.

PDI (Enterprise) Repository (Enterprise Edition)

Creating a Transformation or Job

By clicking on the New Transformation button on the main tool bar

By clicking New, then Transformation

By using the CTRL-N hot key

By clicking on the New Job button on the main tool bar

By clicking New, then Job

By using the CTRL-ALT-N hot key

Identify source table and populate the sample data.

Data from sales.csv

Data from city_zip_code.csv

system CSV data file sales and city_zip_code

Star schema architecture

Snowflake schema architecture:

Snowflake schema example

Design a multidimensional cube using Schema Workbench in Pentaho CE BI Suite 3.0

Column -> CITY

Name -> CustomerCount

Datatype -> Integer

2. Write ETL scripts and implement using data warehouse tools.

Double-click on the Modified JavaScript Value Step.

Name this Step Greetings.

Write the following code:

Click OK to finish configuring the Modified Java Script Value step.

Right-click the Step to bring up a context menu.

Name the Step File with Greetings.

In the File box write:

Save the Transformation again.

3. Perform OLAP operations

Slice and dice

By climbing up a concept hierarchy for a dimension

Roll-up is performed by climbing up a concept hierarchy for the dimension location.

By stepping down a concept hierarchy for a dimension

By introducing a new dimension.

It will form a new sub-cube by selecting one or more dimensions.

(location = "Toronto" or "Vancouver")

(time = "Q1" or "Q2")

(item =" Mobile" or "Modem")

Explore WEKA Data Mining/Machine Learning Toolkit

There are four options available on this initial screen.

Number of records in data set:

Determine the number of records for each class

Number of records in data set:

Determine the number of records for each class

German Credit Card Dataset

List the attributes names and their types.

Number of records in data set:

Plot Histogram and Visualize the data in various dimensions.

4. Click on Choose and select filters/unsupervised/attribute/Discetize.

5. Click on the Apply button to do the discretization.

Classifier output for IRIS Dataset

Classifier output for German credit card Dataset

D. Plot Roc Curves

Results of J48 and Nave-bayes classification Algorithm:

Time taken to bulid model

Time taken to bulid model

B. Explore other clustering techniques available in weka.

3. GERMAN CREDIT DATA