Sei sulla pagina 1di 15

Study on WEKA Tool for Classification

A Report Submitted in Partial Fulfillment of the Requirement For the


Degree of Master of Computer Application

Submitted To:

Submitted By:

Dr. Neeraj Bhargava

Soniya Chandwani

Associate Professor & Head of Department


Department of Computer Science
School of Engineering and System Sciences.

Maharishi Dayanand Saraswati University, Ajmer.


(Rajasthan),India.
June 2015

MCA-LE II Sem

CERTIFICATE

This is to certify that the report entitled, Study on Weka Tool For Classification submitted by
Soniya Chandwani in partial fulfillment of the requirements for the award of degree of Master of
Computer Application Degree from Department of Computer Science & School of Engineering
and System Science M. D. S. University, Ajmer is an authentic work carried out by her/him.

To the best of my knowledge, the matter embodied in the thesis has not been submitted to any
other University / Institute for the award of any Degree or Diploma.

Date:

Dr. Neeraj Bhargava


Associate Professor
Department of Computer Science,
School of Engineering and System Sciences,
M.D.S. University, Ajmer.

ABSTRACT

The development of data-mining applications such as classification and clustering has shown the
need for machine learning algorithms to be applied to large scale data. In this paper we present the
comparison of different classification techniques using Waikato Environment for Knowledge
Analysis or in short, WEKA. WEKA is an open source software which consists of a collection of
machine learning algorithms for data mining tasks. The aim of this paper is to investigate the
performance of different classification or clustering methods for a set of large data.

Classification is an important data mining technique with broad applications. It classifies data of
various kinds. Classification is used in every field of our life. Classification is used to classify
each item in a set of data into one of predefined set of classes or groups. In this paper we are
studying the various Classification algorithms. The thesis main aims to show the comparison of
different classification algorithms using Waikato Environment for Knowledge Analysis or in
short, WEKA and find out which algorithm is most suitable for user working on hematological
data. To use propose model, new Doctor or patients can predict hematological data Comment also
developed a mobile App that can easily diagnosis hematological data comments.

ACKNOWLEDGEMENT
We express our sincere gratitude and indebtedness to Dr. Neeraj Bhargava, Associate Professor
and Head, Department of Computer Science, School of Engineering and System Science,
Maharishi Dayanand Saraswati, University, Ajmer for giving us the opportunity to work under
him and extending every support possible at each stage of this project work. The level of
flexibility offered by him in implementing the project work is highly applaud able.
We would also like to convey our deep regards to all other faculty members of Department, who
bestowed their great effort and guidance at appropriate times without which it would have been
very difficult on our part to finish the project.

Date : June, 2015

Soniya Chandwani

INDEX
CHAPTER

DESCRIPTION
CERTIFICATE

PAGE NO
i

ABSTRACT

ii

ACKNOWLEDGEMENT

Introduction

CHAPTER 2

Objective

CHAPTER 3

Methodology

CHAPTER 4

Result

CHAPTER 1

CONCLUSION

16

REFERENCES

17

CHAPTER 1
INTRODUCTION
Data mining technique is a process of discovering pattern of data. The patterns discovered must
be meaningful in that they lead to some advantage. The overall goal of the data mining process is
to extract information from a data set and transform it into an understandable data in order to aid
user decision making.
Data mining is being used in several applications like banking, insurance, hospital and Health
informatics. In case of health informatics, Data mining plays a vital role in helping physicians to
identify effective treatments, and Patients to receive better and more affordable health services.
In hematology laboratory, it has become a powerful tool in managing uncountable laboratory
information in order to seek knowledge that is underlying or within any given information.

Definition
Data mining is a technique or process which is use to extract meaningful information from the
database or data ware house.
Data mining is defined as extracting the information from the huge set of data. In other words
we can say that the data mining is mining the knowledge from data.
"The non-trivial extraction of implicit, previously unknown, and potentially useful information
from data"
Data mining requires a class of database applications that look for hidden patterns in a group of
data that can be used to predict future behavior. For example, data mining software can help
retail companies find customers with common interests.
The phrase data mining is commonly misused to describe software that presents data in new
ways. True data mining software doesn't just change the presentation, but actually discovers
previously unknown relationships among the data.
Data mining is popular in the science and mathematical fields but also is utilized increasingly by
marketers trying to distill useful consumer data from Web sites.

WEKA (Waikato Environment for Knowledge Analysis)


WEKA is a data mining/machine learning application developed by Department of Computer
Science, University of Waikato, New Zealand. WEKA is open source software in JAVA issued
under the GNU General Public License. WEKA is a collection tools for data pre-processing,
classification, regression, clustering, association, and visualization. WEKA is a collection of

machine learning algorithms for data mining tasks. WEKA is well-suited for developing new
machine learning schemes.
WEKA is a bird found only in New Zealand.
The key features responsible for WEKA's success are:

it provides many different algorithms for data mining and machine learning
is is open source and freely available
it is platform-independent
it is easily useable by people who are not data mining specialists
it provides flexible facilities for scripting experiments
it has kept up-to-date, with new algorithms being added as they appear in the research
literature.

Advantages of WEKA
Free availability
Under the GNU General Public License
Portability
Fully implemented in the Java programming language and thus runs on almost any
modern computing platforms Windows, Mac OS X and Linux
Comprehensive collection of data preprocessing and modeling techniques
Supports standard data mining tasks: data preprocessing, clustering, classification,
regression, visualization, and feature selection.
Easy to use GUI
Provides access to SQL databases
Using Java Database Connectivity and can process the result returned by a database
query.
Disadvantages of WEKA
Sequence modeling is not covered by the algorithms included in the Weka distribution
Not capable of multi-relational data mining
Memory bound

The report demonstrates possibilities offered by the Weka software to build classification models
for SAR (Structure-Activity Relationships) analysis. Two types of classification tasks will be
considered two-class and multi-class classification. In all cases protein-ligand binding data will
analyzed, ligands exhibiting strong binding affinity towards a certain protein being considered as
active with respect to it. If it is not known about the binding affinity of a ligand towards the
protein, such ligand is conventionally considered as nonactive one. In this case, the goal of
classification models is to be able to predict whether a new ligand will exhibit strong binding
activity toward certain protein biotargets. In the latter case one can expect that such ligands
might possess the corresponding type of biological activity and therefore could be used as hits
for drug design. All ligands in this tutorial are described by means of an extended set of MACCS
fingerprints, each of them comprising 1024 bits, the on value of each of them indicating the
presence of a certain structural feature in ligand, otherwise its value being off.
Building Classifiers
Classifiers in WEKA are the models for predicting nominal or numeric quantities. The learning
schemes available in WEKA include decision trees and lists, instance-based classifiers, support
vector machines, multi-layer perceptrons, logistic regression, and bayes nets. Meta- classifiers
include bagging, boosting, stacking, error-correcting output codes, and locally weighted learning.
Classification Methods

Three candidate classifiers are considered in this study: Decision Tree (J48), Nave Bayes, and
Neural Network (Multilayer Perceptron)
1. J48 Algorithm
J48 algorithm is called as optimized implementation of the C4.5 or improved version of the C4.5.
The output given by J48 is the Decision tree. A Decision tree is same as that of the tree structure
having different nodes, such as root node, intermediate nodes and leaf node. Each node in the
tree contains a decision and that decision leads to our result as name is decision tree. Decision
tree divide the input space of a data set into mutually exclusive areas, where each area having a
label, a value or an action to describe or elaborate its data points. Splitting criterion is used in
decision tree to calculate which attribute is the best to split that portion tree of the training data
that reaches a particular node.
2. Multilayer Perceptron
The single-layer perceptron can only classify linearly separable problems. For non-separable
problems it is necessary to use more layers. A Multilayer (feed forward) network has one or more
hidden layers whose neurons are called hidden neurons. The Fig.1 illustrates a multilayer
network with one input layer, one hidden layer and one output layer.

3. Naive Bayes
Naive Bayes implements the probabilistic Nave Bayes classifier. Nave Bayes Simple uses the
normal distribution to model numeric attributes. Nave Bayes can use kernel density estimators,
which develop performance if the normality assumption if grossly correct; it can also handle
numeric attributes using supervised discretization. Nave Bayes Updateable is an incremental
version that processes one request at a time. It can use a kernel estimator but not discretization.

CHAPTER 2

LITERATURE SURVEY
Classifiers in WEKA are the models for predicting nominal or numeric quantities. The learning
schemes available in WEKA include decision trees and lists, instance-based classifiers, support
vector machines, multi-layer perceptrons, logistic regression, and bayes nets. Meta-classifiers
include bagging, boosting, stacking, error-correcting output codes, and locally weighted learning.
The methods used are:
1. J48 Algorithm
J48 algorithm is called as optimized implementation of the C4.5 or improved version of the C4.5.
The output given by J48 is the Decision tree. A Decision tree is same as that of the tree structure
having different nodes, such as root node, intermediate nodes and leaf node.
2. Multilayer Perceptron
The single-layer perceptron can only classify linearly separable problems. For non-separable
problems it is necessary to use more layers.
3. Naive Bayes
Naive Bayes implements the probabilistic Nave Bayes classifier. Nave Bayes Simple uses the
normal distribution to model numeric attributes.

In data mining classification is to accurately predict the target class for each case in the data.
Decision tree algorithm is one of the commonly used classification algorithm to make induction
learning based on examples. In this paper we present the comparison of different classification
techniques using WEKA. The aim of this paper is to investigate the performance of different
classification methods on clinical data. The algorithms tested are Bayes Network, Navie bayes
and J48.

CHAPTER 2
OBJECTIVE
The aim of our work is to investigate the performance of different classification methods using
WEKA. Machine learning covers such a broad range of processes that it is difficult to define
precisely. A dictionary definition includes phrases such as to gain knowledge or understanding of
or skill by studying the instruction or experience and modification of a behavioral tendency by
experienced zoologists and psychologists study learning in animals and humans. The extraction
of important information from a large pile of data and its correlations is often the advantage of
using machine learning. New knowledge about tasks is constantly being discovered by humans
and vocabulary changes. There is a constant stream of new events in the world and continuing
redesign of Artificial Intelligent systems to conform to new knowledge is impractical but
machine learning methods might be able to track much of it.
There is a substantial amount of research with machine learning algorithm such as Bayes
Network, Decision tree and Multilayer Perceptron.

CHAPTER 3
METHODOLOGY
Steps to apply different classification techniques on data set and obtain result in WEKA:
Step 1: Accept the input dataset and preprocessed.
Step 2: Apply the classifier algorithm on the whole data set in Classification.
Step 3: Note the accuracy given by it and time required for execution. Also check the confusion
matrix.
Step 4: For comparison of different classification algorithms on different datasets repeat step 2 nd
and 3rd with respect to accuracy and execution time.
Step 5: Compare the different accuracy results provided by the dataset with different
classification algorithms and identify the significant classification algorithm for particular
dataset.
Once you have your data set loaded, all the tabs are available to you. Click on the Classify
tab.Classify

Classify window comes up on the screen.

1. Selecting J48 classifier


Click on Choose button in the Classifier box just below the tabs and select C4.5 classifier
WEKA -> Classifiers-> Trees-> J48.

Before you run the classification algorithm, you need to set test options. Set test options in the
Test options box. The test options that available to you are :
1. Use training set: Evaluates the classifier on how well it predicts the class of the instances it
was trained on.
2. Supplied test set: Evaluates the classifier on how well it predicts the class of a set of instances
loaded from a file. Clicking on the Set button brings up a dialog allowing you to choose the
file to test on.
3. Cross-validation: Evaluates the classifier by cross-validation, using the number of folds that
are entered in the Folds text field.
4. Percentage split: Evaluates the classifier on how well it predicts a certain percentage of the
data, which is held out for testing. The amount of data held out depends on the value entered in
the % field.
In this exercise you will evaluate classifier based on how well it predicts 66% of the tested data.
Check Percentage split radio-button and keep it as default 66%. Click on More options
button.

Identify what is included into the output. In the Classifier evaluation options make sure that the
following options are checked:
1. Output model. The output is the classification model on the full training set, so that it can be
viewed, visualized, etc.
2. Output per-class stats. The precision/recall and true/false statistics for each class output.
3. Output confusion matrix. The confusion matrix of the classifiers predictions is included in the
output.
4. Store predictions for visualization. The classifiers predictions are remembered so that they
can be visualized.
5. Set Random seed for Xval / % Split to 1. This specifies the random seed used when
randomizing the data before it is divided up for evaluation purposes.