Documenti di Didattica
Documenti di Professioni
Documenti di Cultura
978-1-4799-8047-5/15/$31.00 2015
c IEEE 403
buzzword and it is defined in terms of 3 V‘s such as Volume, B. Addressing Imbalanced Problem
Velocity and Variety [1] [19]. Several techniques are present to address the classification of
In 2004, the MapReduce programming framework was Imbalanced data [5] [6] [19]. These techniques are categorized
first proposed by Google. It is a platform designed for into various groups:
processing tremendous amounts of data in an extremely
parallel manner. It provides an environment to easily develop 1. Data Level Approach: In which original dataset is
scalable and fault tolerant applications. The Apache Hadoop is modified to get balanced dataset so that it can be
used by standard machine learning algorithms.
an open source project and provides the MapReduce
framework for processing big data [1]. 2. Algorithm Level Approach: In which existing
Apache Mahout is an open source project and runs on top algorithm is modified to launch procedures that can
of Hadoop platform. It provides set of machine learning deal with imbalanced data.
algorithms for clustering, recommendation and classification 3. Cost-sensitive Approach: In which both data level
problems. Mahout contains various implementations of and algorithm level approaches are combining to get
classification models such as Logistic Regression, Bayesian accuracy and reduce misclassification costs.
models, Support Vector Machines, and Random Forest among Furthermore, data level approaches are divided into
others [2]. various groups: Oversampling technique and Undersamping
Like Hadoop, Apache Spark is an open-source cluster technique and hybrid technique. In oversampling technique,
computing framework. It is developed in the AMPLab at UC new data from minority classes are added to original dataset in
Berkeley. It is in-memory based MapReduce paradigm. It order to obtain balanced dataset. In undersampling technique,
performs 100 times faster than Hadoop for certain applications data from majority classes are removed in order to balance the
[3]. datasets. In hybrid technique, previous techniques are
In this work, we present analysis of techniques which are combines to achieve the goal of balanced dataset. Usually, at
used to deal with imbalanced data problem. We present the first oversampling technique is used to create new samples for
enhanced SMOTE algorithm for handling multi-class minority class and then undersampling technique is applied to
imbalanced big data problem. These approaches are evaluated delete samples from majority class [5] [6] [11].
on the basis of potency in correct classification of each The oversampling and undersampling techniques have
instance of each class and time required to build classification some drawbacks. To address this end, Synthetic Minority
model. In order to perform classification, Random Forest Oversampling Technique (SMOTE) is used. The SMOTE
classifier (RF) which is a popular and well-known decision algorithm is a powerful solution to imbalance data problem
tree ensemble method is used. It is proven that RF is scalable, that have shown success in various application domains. The
robust and gives good performance. SMOTE algorithm is an oversampling technique. This
For experimental study, we will focus on MapReduce technique adds synthetic minority class samples to original
based implementation of SMOTE + RF. The experiments dataset to achieve balanced dataset [5] [6].
performed provide the limitations of original multi-
In SMOTE algorithm, minority class is oversampled by
classification algorithm and enhanced SMOTE algorithm.
duplicating samples from minority class. Depending on the
Finally, we evaluate the proposed system based on accuracy
and Geometric Mean for true rates and ȕ-F-Measure - popular oversampling required, numbers of nearest neighbors are
measures in imbalanced domain. randomly chosen. The synthetic data is generated based on
feature space likeliness prevails between existing samples of
minority class. For subset אS, consider K nearest neighbors
II. RELATED WORK for each sample אX. The K-nearest neighbors are the K
elements whose Euclidean distance between itself and have
A. Imbalanced Data Problem smallest weight along the n-dimensional feature space of X.
The samples are generated simply as, randomly select one of
The classification of Imbalanced datasets poses problem
the K-nearest neighbors and multiply corresponding Euclidean
where class distributions have number of examples in one
class is outnumbered by the number of examples in the other distance with random number between [0,1]. Finally, add this
class.[5][6] A class having abundant number of examples value to original value of instance [5].
called majority or negative class and a class having significant Mathematically, it is written as
number of examples called minority or positive class. In
recent years, the imbalanced data problem has become
burning point in industry, academia and government agencies. Where, Ǧ ample from minority class used to generate
This problem present in many real world problems such as synthetic data. - Nearest neighbor for and į - Random
medical diagnosis [8], fraud detection, finances, risk
number between [0, 1]. The generated synthetic data is a point
management, network intrusion, E-mail foldering [12],
on line fragment between under consideration and k-nearest
Software Defect Detection [18] and so on. Additionally, the
positive (minority) class is the class of interest from the neighbors for Ǥ
learning point of view and has great impact when it is not Though SMOTE is popular technique in imbalanced
classified properly. domain, it has some drawbacks including over-generalization,
only applicable for binary class problem, over-sampling rate
1: while values.hasNext ()
2: instance = INSTANCE REPRESENTATION
(values.getValue())
3: smote instances.add(instance)
4: end while
5: final instances RANDOMIZE (smote instances)
6: for i = 0 to final instances:length - 1do
7: EMIT (null, finalinstances.get(i))
8: end for
active and inactive instances by considering sensitivity and Base (OVA) SMOTE+OVA)
specificity into single metric. F-measure gives balanced
Dataset ACCtr ACCts ACCtr ACCts
accuracy and is defined as follows: [5] [19]
Sensitivity (Recall): Landsat 1 0.6534 0.9999 0.7190
It is defined as percentage of positive samples which are Image segment 0.934 0.9339 1 0.9753
correctly classified samples and it is given as:
Lymphography 1 0.9987 1 1
Iris 0.98 0.9943 1 0.9533
Sensitivity= Zoo 1 1 1 1
Car 0.9008 0.9343 0.9867 0.9305
Specificity:
It is defined as percentage of negative samples which Vehicle 0.8156 0.8158 0.9976 0.9976
are correctly classified and it is given as: Waveform 0.8736 0.8697 0.9996 0.9995
Mean 0.9381 0.90 0.9978 0.9469
Specificity=
Precision: TABLE III. COMPARING F-MEASURE VALUE FOR TESTING
DATASET WITH BASE METHOD AND PROPOSED METHOD
It is the proportion of predicted positive cases that were
correctly classified. Dataset BASE(OVA) SMOTE+OVA
be the precision of class and be the recall of class. Iris 0.992 0.955
So, G-mean is calculated as: Zoo 1 1
Car 0.882 0.819
G-Mean=
Vehicle 0.752 0.998
F-measure:
Another metric used to assess quality of classifier in Waveform 0.805 0.999
imbalanced domain is -f-measure and it is given as: Mean 0.8758 0.9314