Sei sulla pagina 1di 10

IT Business Intelligence

Data Mining using WEKA

Submitted by

Rahul Aggarwal
MBA 2nd year Vinod Gupta School Of Management IIT Kharagpur (10BM60065)

Regression and Attribute selection using Weka

Introduction to WEKA
Weka (Waikato Environment for Knowledge Analysis) is a popular suite of machine learning software written in Java, developed at the University of Waikato, New Zealand. Weka is a collection of machine learning algorithms for data mining tasks. The algorithms can either be applied directly to a dataset or called from your own Java code. Weka contains tools for data preprocessing, classification, regression, clustering, association rules, and visualization. It is also wellsuited for developing new machine learning schemes. It deals only with flat files. Weka is free software available under the GNU General Public License.

PART-1: Linear Regression using WEKA


In statistics, regression analysis includes many techniques for modeling and analyzing several variables, when the focus is on the relationship between a dependent variable and one or more independent variables. In linear regression, the function is a linear (straight-line) equation. For example, if we assume the value of an automobile decreases by a constant amount each year after its purchase, and for each mile it is driven, the following linear function would predict its value (the dependent variable on the left side of the equal sign) as a function of the two independent variables which are age and miles: value = price + depage*age + depmiles*miles where, value, the dependent variable, is the value of the car age is the age of the car, and miles is the number of miles that the car has been driven. depage, the depreciation that takes place each year, and depmiles, the depreciation for each mile driven.

Problem Statement:
ABCD is a car manufacturing company that is thinking to abandon the production of its 6-year old household car, XERINA. Before doing this, it wants to check on what factors does the resale value of the car depends. XERINA comes in 3 models: Model-1: Premium Segment Model-2: Middle segment Model-3: Economical segment It collects random data from 30 households asking the current price of their car (Insured value), its model and its age. The company uses WEKA to find out the correlation between {Age & Model} and {Price} of the car.

Rahul Aggarwal, VGSoM, IIT Kharagpur

Regression and Attribute selection using Weka

Solution: Steps to be followed:


Step-1: Collection of Data Used Car price 125 115 130 160 219 150 190 163 280 260 200 100 105 190 219 150 190 140 270 260 118 210 122 160 219 150 185 129 260 214

Age 6 6 6 4 2 5 4 5 1 2 3 6 6 3 2 5 4 5 1 2 6 3 6 4 2 5 3 5 1 3

Model 1 2 1 2 2 2 1 1 1 1 2 3 3 3 2 2 1 3 2 1 2 1 1 2 2 2 3 3 3 1

In above table: Age = Age of the car i.e. n years ago the Car was manufactured Model = Model of the car Save the above file with .csv format.

Rahul Aggarwal, VGSoM, IIT Kharagpur

Regression and Attribute selection using Weka

Step -2: Open Explorer in WEKA

Following window will appear:

Rahul Aggarwal, VGSoM, IIT Kharagpur

Regression and Attribute selection using Weka

Step-3: In the Pre-process tab, click on Open File. And select your .csv file.

Following window will appear:

Rahul Aggarwal, VGSoM, IIT Kharagpur

Regression and Attribute selection using Weka

Step-4: Go to classify tab and select classifier as Linear Regression:

Then select Use training set in test options. Also select the dependent variable i.e. Used car Price from the dropdown indicated below:

Rahul Aggarwal, VGSoM, IIT Kharagpur

Regression and Attribute selection using Weka

Step-5: After clicking Start, the following regression model output is generated:

Blue circle: As you can see the regression equation is: Used Car price = 319.5 Age * (-29.4) Model * 14

Red Circle: Correlation coefficient is 0.9882 which is very near to 1. That means the independent variables are highly correlated with dependent variables.

Rahul Aggarwal, VGSoM, IIT Kharagpur

Regression and Attribute selection using Weka

Part-2: Attribute Selection using WEKA


It is used to investigate which (subsets of) attributes are the most predictive ones. Attribute selection methods contain two parts: A search method: best-first, forward selection, random, exhaustive, genetic algorithm, ranking An evaluation method: correlation-based, wrapper, information gain, chi-squared,etc

Problem Statement:
Our objective is to find out the important factors that affect the probability of default on a bank loan by people. For this data is collected by circulating a set of questionnaire having 9 questions to be filled up by the respondents. The questions are: 1. Age in years of the customer 2. Level of education 3. Number of years worked with the current employer 4. Number of years residing at current address 5. Household income in thousands of the individual 6. Debt to income ratio (x100) 7. Credit card debt in thousands 8. Other debt in thousands 9. Previously defaulted loan

Solution: Steps to be followed


Step-1: Following sample of 850 respondents has been collected: Debt-to-Income ratio 9.3 17.3 5.5 2.9 17.3 10.2 30.6 3.6

Age 41 27 40 41 24 41 39 43 . . .

Education 3 1 1 1 2 2 1 1

Employ 17 10 15 15 2 5 20 12

address 12 6 14 14 0 5 9 11

Income 176 31 55 120 28 25 67 38

creddebt 11.35939 1.362202 0.856075 2.65872 1.787436 0.3927 3.833874 0.128592

othdebt 5.008608 4.000798 2.168925 0.82128 3.056564 2.1573 16.66813 1.239408

default 1 0 0 0 1 0 0 0

Here, for Default column: 1 = Yes 2 = No Again, save the excel in .csv format.

Rahul Aggarwal, VGSoM, IIT Kharagpur

Regression and Attribute selection using Weka

Step-2: Open Weka Explorer and open that .csv file. Following window will open:

Step-3: Go to Select Attribute tab

Rahul Aggarwal, VGSoM, IIT Kharagpur

Regression and Attribute selection using Weka

Step-4: After clicking start, following output will appear:

The output shows that Probability of Defaulting depends on the following 2 variables: 1. Employed 2. Debt to Income ratio Hence above 2 attributes are selected. -------------------------------------------------------- THE END -------------------------------------------------------

Rahul Aggarwal, VGSoM, IIT Kharagpur

10

Potrebbero piacerti anche