Documenti di Didattica
Documenti di Professioni
Documenti di Cultura
Submitted by
Rahul Aggarwal
MBA 2nd year Vinod Gupta School Of Management IIT Kharagpur (10BM60065)
Introduction to WEKA
Weka (Waikato Environment for Knowledge Analysis) is a popular suite of machine learning software written in Java, developed at the University of Waikato, New Zealand. Weka is a collection of machine learning algorithms for data mining tasks. The algorithms can either be applied directly to a dataset or called from your own Java code. Weka contains tools for data preprocessing, classification, regression, clustering, association rules, and visualization. It is also wellsuited for developing new machine learning schemes. It deals only with flat files. Weka is free software available under the GNU General Public License.
Problem Statement:
ABCD is a car manufacturing company that is thinking to abandon the production of its 6-year old household car, XERINA. Before doing this, it wants to check on what factors does the resale value of the car depends. XERINA comes in 3 models: Model-1: Premium Segment Model-2: Middle segment Model-3: Economical segment It collects random data from 30 households asking the current price of their car (Insured value), its model and its age. The company uses WEKA to find out the correlation between {Age & Model} and {Price} of the car.
Age 6 6 6 4 2 5 4 5 1 2 3 6 6 3 2 5 4 5 1 2 6 3 6 4 2 5 3 5 1 3
Model 1 2 1 2 2 2 1 1 1 1 2 3 3 3 2 2 1 3 2 1 2 1 1 2 2 2 3 3 3 1
In above table: Age = Age of the car i.e. n years ago the Car was manufactured Model = Model of the car Save the above file with .csv format.
Step-3: In the Pre-process tab, click on Open File. And select your .csv file.
Then select Use training set in test options. Also select the dependent variable i.e. Used car Price from the dropdown indicated below:
Step-5: After clicking Start, the following regression model output is generated:
Blue circle: As you can see the regression equation is: Used Car price = 319.5 Age * (-29.4) Model * 14
Red Circle: Correlation coefficient is 0.9882 which is very near to 1. That means the independent variables are highly correlated with dependent variables.
Problem Statement:
Our objective is to find out the important factors that affect the probability of default on a bank loan by people. For this data is collected by circulating a set of questionnaire having 9 questions to be filled up by the respondents. The questions are: 1. Age in years of the customer 2. Level of education 3. Number of years worked with the current employer 4. Number of years residing at current address 5. Household income in thousands of the individual 6. Debt to income ratio (x100) 7. Credit card debt in thousands 8. Other debt in thousands 9. Previously defaulted loan
Age 41 27 40 41 24 41 39 43 . . .
Education 3 1 1 1 2 2 1 1
Employ 17 10 15 15 2 5 20 12
address 12 6 14 14 0 5 9 11
default 1 0 0 0 1 0 0 0
Here, for Default column: 1 = Yes 2 = No Again, save the excel in .csv format.
Step-2: Open Weka Explorer and open that .csv file. Following window will open:
The output shows that Probability of Defaulting depends on the following 2 variables: 1. Employed 2. Debt to Income ratio Hence above 2 attributes are selected. -------------------------------------------------------- THE END -------------------------------------------------------
10