Sei sulla pagina 1di 47

ENSEMBLE TECHNIQUE USING FUZZY K-MELOIDS

CLUSTERING
Abstract
The ensemble technique is used to be partition multiple clustering. The
quality of the clustering is the integration on the non independent clustering.
The data point are dynamically depends on the consistency to be ensemble. The
data points are used to be measure the adaptive consensus functions. Adaptive
samplings construct the data partitions are resample the sub sample reduce. The
clustering combination problem clustering focuses on the data points frame
work of clustering ensembles. Redundant constrains are the solution for linear
programming problem. The non negative constrains are original objective
functions. We are labeling the rows and columns in unknown. The principal
component analysis.the important process is remembered on the mathematical
technique. These techniques used in PCA.PCA are organized on the matrix
vector. Matrix is measured using the two dimensional. For the analysis the
reduncancy get reduced. The k-medoids are the sensitive outliers to be distort
the data. The k-medoids methods are the analyzed and find the k cluster object
by first arbitrarily finding a representative object. The k-medoids takes the input
parameter K the number of cluster to be portioned among the set of n objects.
Random selections of k-medoids make a better choice of medoids.
CHAPTER 1

INTRODUCTION

A lot of information is produced in our day to day life because of the expanded number of
social media system clients. Information is accessible in various configurations like content,
Audio, video, picture and diagrams. In any case, the most imperative and the essential
arrangement that is utilized from past till now is "Text". Text assumes a noteworthy part in
correspondence.

1.1 Data Mining


A gigantic measure of data is created in the previous couple of decades because of the
expanded utilization of electronic gadgets. This aggregated data has prompted the
advancement of new routes for taking care of the information. Information mining is the way
towards examining the given information with a specific goal to give valuable data contained
in that information that cannot be recovered through inquiries (Qinbao Song et al., 2013).

Information mining joined numerous strategies, for example, machine learning, design
acknowledgment, database and information distribution center frameworks, perception,
calculations, superior processing and numerous application spaces. Another name for
information mining is the learning revelation process; it ordinarily includes information
cleaning, information coordination, information determination, information change, design
disclosure, design assessment and learning representation.

There are two sorts of information mining errands, prescient and clear (Kashif Javed et
al., 2012). Prescient information mining predicts in view of the accessible information. It is
utilized to anticipate the future event in view of the past experience. Bunching, Classification
and Regression are a portion of the prescient information mining strategies. Spellbinding
information mining investigates the past and portrays the general conduct of the current
information. They discover human interpretable information that portrays it. Illustrations
incorporate affiliation, grouping, portrayal and so forth.
Information mining is connected in content information, web information and mixed
media information. At the point when information is accumulated from content information,
then the mining procedure is named as Text Mining. The proposed research work is centered
on removing learning from content information, utilizing classification strategy.

1.1.1 Need for Data Mining Techniques

The information and knowledge gained can be used for applications ranging from
business management, production control, market analysis and exploration. The main
motivation behind the popularity of data mining techniques in these applications is given
below.

Growing Data Volume

The major reason the data mining has attracted a great deal of attention in the
information industry in recent years is due to the wide availability of huge amounts of data,
and the imminent need for turning such data into useful information and knowledge.

Limitations of Human Analysis

Two other problems that surface when human analyst process data are

The adamant of the human brain when searching for complex multi device
dependencies in data, and

The minimize the objectiveness in such an analysis.

Usage of automated methods eliminates both these problems to a great extent.

Low Cost of Machine Learning

One additional benefit of using automated data mining systems is its low cost. While
data mining are not used to eliminate human participation in solving the task completely, it
used to modify the job and allows an analyst, who is not a professional in statistics and
programming, to manage the process of extracting knowledge from data.
1.1.2 Stages of Data Mining

Data mining method are starts to working the data and the best techniques are those
developed with an orientation toward large volumes of data, making use of as much of the
collected data as possible to arrive at reliable conclusions and decisions.

Step1: Selection

This step begins by collecting data from all the different sources. The collected data
will have various data, all of which are not needed for a particular application. So, the
collected data is segmented and a data selection procedure is performed where the interested
subsets of data are extracted according to certain criteria For example, all those people who
own a car. This step creates a repository of information at one place.

Raw Data

Target Data

Clean Data

Transformed
Data

Feature
Extraction

Knowledge

Figure 1.1: Stages of Data Mining


Step 2: Preprocessing

Preprocessing is the data cleansing stage where certain information which is deemed
unnecessary and may slow down queries is removed. In this step, storage of unnecessary
values (Example : gender details of patient when studying pregnancy), out-of-range values
(Example : Salary = 0), missing values and data values, which is in general lead to misleading
errors, are identified, and attempt is made to correct these problematic data are made. Also,
the data is re-configured to ensure a consistent format as there is a possibility of inconsistent
formats, because the data is drawn from several sources. For example, sex may recorded as
f or m and also as 1 or 0.

Step 3: Transformation
The data, even after cleaning, are not ready for mining as they need to be transformed
into a form that is appropriate for mining. This is performed in the third step, where the
cleaned data is transformed to a format that can be readily used and navigated by data mining
techniques. It allows the mapping of data from their given format expected by the appropriate
application. This includes value conversions or translation functions as well as normalizing
numeric values to conform minimum and maximum values.

Step 4: Pattern Evaluation and Knowledge Presentation

The fourth stage is concerned with using data mining techniques for the extraction of
pattern from the transformed dataset., A pattern is a statement S in L that describes
relationships among a subset Fs of F with a certainty C such that S is simpler in some sense
the enumeration of all the facts in Fs. The patterns discovered is then interpreted and
evaluated for human decision-making. Different techniques like clustering, classification and
association rule mining are used in this stage.

Step 5: Interpretation and Evaluation

This is the most important stage of data mining, where the patterns identified by the
system are interpreted into knowledge, which can be used to support human decision-making.
For example, prediction and classification tasks, summarizing the contents of a database or
Explaining observed phenomena. This step helps users to make use of the knowledge
acquired to take better decisions.

Thus, it can be understood that basically, data mining is concerned with the analysis
of data and the use of software techniques for finding patterns and regularities in sets of data.
The idea is extremely useful to extract information from unexpected places and the data
mining software extracts patterns not previously obvious to anyone.

Knowledge discovery is an iterative process, and once discovered, it is presented to


the user for evaluation. The process of mining can be improved in several ways. The
evaluation measures can be enhanced, the mining can be further redefined, new data can be
selected or further transformed or new data sources can be integrated, in order to get
different, more appropriate results. This research work focuses on the second stage, that is,
cleaning of data to obtain better classification result.

1.1.3 Techniques in Data Mining

The most commonly used techniques in data mining are:


Artificial neural networks: Non-linear predictive models that learn through
training and resemble biological neural networks in structure,

Decision trees: Tree-shaped structure that represents set of decisions. These


decisions generate rules for the classification of dataset. Specific decision tree
method include Classification and Regression Trees (CART) and Chi Square
Automatic Interaction Detection (CHAID),

Genetic algorithms: Optimization techniques that use process such as genetic


combination, mutation and neural selection in a design, based on the concepts
of evolution,

Nearest neighbor method: A technique that classifies each record in a dataset


based on a combination of the classes of the k record(s) most similar to it in a
historical dataset. Sometimes it is called the K-nearest neighbor technique, and

Rule induction (Association rules): The extraction of useful if-then rules


from data based on statistical significance.
Irrespective of the technology used for mining, the two primary goals of data mining are
prediction and description .Prediction involves using some variables or fields in the data set,
to predict the user and future values of other variables of interest. Description, on the other
hand, focuses on finding patterns describing the data that can be extended by the humans.

The goals of prediction and description are achieved by using data-mining techniques,
for the following primary tasks:
Classification: Discovery of a predictive learning function that classifies a
data item into one of several predictive classes,

Clustering: A common descriptive task in which one seeks to identify a finite


set of categories or clusters to describe the data,

Summarization: An additional descriptive task that involves method for


finding a compact description for a set (or subset) of data,

Dependency Modeling: Finding a local model that describes significant


dependencies between variables or between the values of feature in a data set
or in a part of a data set, and

1.1.4 Preprocessing in Data Mining

Preprocessing defined any type of processing performs on several data to prepare it


for another processing procedure. It transforms the data into a format that will be more easily
and effectively processed. Data Preprocessing is a task that should be given utmost attention
for the following reasons:

Quality decisions are based on quality data. For example, duplicate or missing
data may cause incorrect or even misleading statistics, and

It is a well-known fact that Improved data quality can improve the quality of any
analysis on it. Analyzing data that has not been carefully screened often produce misleading
Results. Therefore, using preprocessing routines that improve the representation and quality
of data is an important task that should be performed before running an analysis .If the
amount of irrelevance And regular information or the amount of sound and reliable data is
low, then knowledge discovery during the training phase is more difficult.

Based on this, recently, many applications are performing preprocessing as a


mandatory step during data mining. In general, the preprocessing stage of data mining
consists of the following major tasks:
Data cleaning: Process of removing unwanted data, filling missing values,
smoothing noisy data and resolve inconsistencies,

Data integration: Process of combining data from multiple sources into a


coherent store,

Data transformation: Involves smoothing, generalization of the data,


attribute construction and normalization,

Data reduction: Obtains reduced representation volume but produce the same
or similar analytical results, and

In general, all the above tasks focus on increasing the data quality. A well-accepted multi-
dimensional view of data quality is Accuracy, Completeness, Consistency, Believability,
Interpretability and Accessibility.

1.2 Clustering

A basic clustering algorithm generates a vector of topics for each document and
determines the weights of how well the document fits into each cluster. Clustering technology
can be useful in the organization of management information systems, which may contain
thousands of documents.
In addition to web applications, companies can use Q&A techniques internally for employees
who are searching for answers to common questions. The education and medical areas may
also find uses for Q&A in areas where there are frequently asked questions that people wish
to search.

Association Rule Mining

Association Rule Mining (ARM) (Seno et al., 2002) is a technique used to discover
relationships among a large set of variables in a data set. It has been applied to a variety of
industry settings and disciplines but has, to date, not been widely used in the social sciences,
especially in education, counseling and associated disciplines. ARM refers to the discovery of
relationships among a large set of variables, that is, given a database of records, each
containing two or more variables and their respective values, ARM determines variable-value
combinations that frequently occur. Similar to the idea of correlation analysis (although they
are theoretically different), in which relationships between two variables are uncovered,
ARM is also used to discover variable relationships, but each relationship (also known as an
association rule) may contain two or more variables.
This section provides the overview of text mining techniques and methodologies by which
suitable text data becomes classifiable. Next, we discuss the data mining algorithms that are
frequently consumed in the text mining and classification tasks.

1.6 Motivation

Highlight Selection is an essential issue in machine learning and information mining.


Highlight Selection is a viable path for diminishing dimensionality, expelling superfluous
information expanding learning precision. Highlight Selection is the way towards
distinguishing a subset of the most valuable elements that produce perfect results as the first
whole arrangement of components .A Feature Selection procedure might be assessed from
both productivity and adequacy perspective. While the productivity concerns the time
required to discover a subset of components, the viability is identified with the nature of the
subset of elements. Highlight Selection is not the same as dimensionality lessening. Both
techniques look to decrease the quantity of properties in the dataset. In any case,
dimensionality diminishment technique leads to new mix of characteristics. Highlight
Selection strategies incorporate and bar traits present in the information without change. The
focal suspicion when utilizing a Feature Selection method is that the information contains
numerous excess or unessential components.
1.7 Proposed Work

The principal component analysis. The important process is remembered on the


mathematical technique. These techniques used in PCA.PCA are organized on the matrix
vector. Matrix is measured using the two dimensional. For the analysis the reduncancy get
reduced. The k-medoids are the sensitive outliers to be distort the data. The k-medoids
methods are the analyzed and find the k cluster object by first arbitrarily finding a
representative object. The k-medoids takes the input parameter K the number of cluster to be
portioned among the set of n objects. Random selections of k-medoids make a better choice
of medoids.

1.8 Layout of Thesis

Chapter 1, Introduction provides a brief introduction to data mining,. The proposed work of
this dissertation is also outlined.
The literature review is a critical look at the existing research that is significant to the work
that is carried out. A critical look at the various available literatures on par with the present
research work is given in Chapter 2, Review of Literature.

The basic idea on k melodies algorithm is used in the k in Chapter 3, Methodology.

Chapter 4, Results and Discussion, analyzes the dataset, experimental set up which
is needed to develop this work and the results observed.

The conclusion of the research work is summarized along with future research
direction in Chapter 5, Summary and Conclusion.

The work of several researchers are quoted and used as evidence to support the
concepts explained in this dissertation. All such evidences used are listed in the reference
section of the dissertation.
CHAPTER 2

REVIEW OF LITERATURE

A brief survey of the various examination analyses is carried out in this part. The writing
survey is critical to the work that is completed.

Content accumulations contain a huge number of interesting terms, which make content
mining process complicated. In this manner, highlight extraction is utilized to content
classification (Kashif Javed et al., 2012). A component is a mix of traits (watchwords), which
catches imperative attributes of the information. An element extraction technique makes
another arrangement of components far littler than the quantity of unique characteristics by
breaking down the first information. In this way, it upgrades the rate of directed learning.
Unsupervised calculations like Principal Components Analysis (PCA), Singular Value
Decomposition (SVD) and Non-Negative Matrix Factorization (NMF) include considering
the report word network as appeared in Table(2.1.), in view of various requirements for
highlight extraction. Non-negative framework factorization is portrayed in the paper "Taking
in the Parts of Objects by Non-negative Grid Factorization (Priyadarshini et al. 2015). Non-
negative network factorization is another unsupervised calculation for effective component
extraction on content records.

2.1 Survey on Feature Selection

2.1.1 A Fast Clustering Based Feature Subset Selection Algorithm for High
Dimensional Data
The late increment of dimensionality of information represents an extreme test to
numerous current Feature Selection strategies as for proficiency and adequacy. Qinbao Song
et al. (2013) talked about elements partitioned into bunches. Second step is the most
illustrative component is emphatically identified with target classes chosen from every group
to frame a subset of elements. With this objective, the paper presents a novel idea dominating
relationship, and proposes a vitality Efficiency grouping channel strategy which can
recognize pertinent components and additional excess among significant elements without
pair astute connection investigation. The productivity and viability of our strategy is shown
through broad examinations with different techniques utilizing certifiable information of high
dimensionality. The structure of subset assesses expelling immaterial investigation.
2.1.2 Extended Relief Algorithms in Instance Based Feature Filtering

Park, Kwon (2007) displayed Relief Algorithms and their utilization in occurrence
based component separating for report highlight determination. The Relief calculation are
utilized picture information, microarray information, content information channel these
components. The Relief calculation is general and effective element estimators that identify
contingent conditions of components amongst examples, and are connected in the pre-
preparing venture for report grouping and relapse. Numerous sorts of expanded Relief
calculation have been proposed as answers for issue of repetition, unessential and loud
element and in addition Relief calculation restrictions in high datasets. These calculations are
accustomed to evacuating insignificant component lessened dataset and expanded diminished
looking time furthermore enhances the execution and security. Creator recommend new
stretched out Relief calculation to illuminate those of nature of elements from occasions and
grouped datasets.

2.1.3 A Energy Efficiency Clustering for High Dimensional Data

The high dimensional information set having Feature Selection includes recognizing a
subset of the most helpful component produce perfect result as the first whole arrangement of
elements. A component might be assessed from the both proficiency concerns the time
required to discover subset of elements .With this objective the paper presents a subset of
good elements as for the objective ideas highlight subset choice is a viable route for decrease.
dimensionality expelling immaterial information, expanding learning exactness and
enhancing result understandability numerous element subset determination strategies have
been proposed and considered machine learning application. The channel techniques are free
of learning calculation great all inclusive statement. Their computational unpredictability is
low Yet, the precision of the learning calculation are not ensured. The base spreading over
tree based bunching calculations. Since they do not accept that information focuses are
gathered around lope or isolated by a standard geometric bend and have been generally
utilized as a part of practice.

2.1.4 Feature Selection Based on Class-Dependent Densities for High-Dimensional


Binary Data

According to, Kashif Javed et al. (2012) another FR calculation named as class ward
thickness based element end (CDFE) for high dimensional twofold information sets. CDFE
utilizes filtrapper way to deal with select a last subset. Highlight determination with FR
calculation is straightforward and computationally proficient. However, excess data may not
be expelled. FSS calculation investigates the information for redundancies and yet may turn
out to be computationally unreasonable on high dimensional datasets. They address these
issues by consolidating FR and FSS techniques as two phase highlight determination
calculation. CDFE not just gives them highlight subset great as far as order additionally
diminishes them from overwhelming calculation. Two FSS calculations are utilized in second
stage to test the two phase highlight determination thought. Rather than utilizing limit esteem
CDFE decides the last subset with the assistance of classifier (Kashif Javed et al. 2012).

2.1.5 A Unified Feature Selection Framework for Graph Embedding on High


Dimensional Data
The structure created to perform highlight determination for chart inserting in which a
classification of diagram installing technique is given a role as minimum squares relapse
issue. Rather than channel techniques, wrapper strategies are application subordinate. The
installed technique typifies the component choice into inadequate relapse strategy named as
LASSO. In this system, a parallel component selector is acquainted with normally handle the
element cardinality at all squares definition. The resultant fundamental programming issue is
then casual into a curved quadratic partner imperative quadratic project (QCQP) learning
issue which can be productively illuminated by means of a grouping quickened proximal
slope (AGP) strategies. The proposed system is connected to a few inserting learning issues
including regulated, unsupervised and semi-managed diagram implanting. The chart inserting

experiences two shortcomings that it is difficult to decipher the resultant elements when
utilizing all measurements for installing and the first information definitely contains
uproarious element which could make diagram implanting temperamental and loud (Marcus
Chen et al., 2015).

2.1.6 Filters, Wrappers and a Boosting-Based Hybrid for Feature Selection

Another cross breed calculation that utilizations boosting and consolidates a portion of
the components of wrapper strategies into a quick channel technique. For highlight
determination results are accounted for on six world datasets and half and half strategy is
much quicker and scales well to datasets with a large number of components (Hu Min, Wu
Fangfang et al., 2010). The definitions for immateriality and for two degrees of pertinence are
joined in this paper. The elements chose ought to depend not just on the elements and the
objective idea additionally on the actuation calculation. A technique is depicted for highlight
subset determination utilizing cross acceptance that is pertinent to any affectation calculation
and examinations led with ID3 and C4.5 on manufactured and genuine datasets (Das et al.,
2001).
2.1.7 Adaptive Relevance Feature Discovery for Text Mining with Simulated Annealing
Approximation
C.Kanakalakshmi1 et al. (2013); Khalid et al. (2014) examine the pulling out of
practical information from shapeless textual data through the gratitude and penetrating of
motivating patterns. The detection of applicable features in real-world data for connecting

user information requirements or partiality is a new confront in text mining. Significance of a


feature designate that the features is at all times essential for an most constructive subset. It
cannot be undisturbed without offensive the pioneering provisional class sharing. They
proposed an adaptive method for relevance feature recognition is conversed to find useful
features accessible in a feedback set, as well as both positive and negative documents, for
performance users need. Thus, this paper talks about the methods for importance feature
discovery using the simulated annealing rough calculation and genetic algorithm, inhabitants
of applicant solutions to an optimization problem on the way to better solutions.

2.1.8 Effective Pattern Discovery for Text Mining using Neural Network Approach

Harpreet Kaur et al. (2001); Yu et al. (2003) observed that the text mining using the
pattern sighting usually uses only the text material in typical fonts, i.e., it does not believe the
bold, underline or italic or smooth the larger fonts as the key text sample for text mining. This
generates difficulty numerous times when the key words are eliminated from the commentary
by the algorithm itself. In their projected work, patterns are exhumes in both positive and
negative reaction. It then unconsciously classifies the patterns into clusters to find applicable
patterns as well as destroy noisy patterns for a given topic. A novel prototype organizing
approach is proposed to remove alternative features of text documents and use them for
improving the retrieval performance. The projected move toward is appraised by removing
features from RF to advance the presentation of information filtering (IF).

2.2 Survey on Classification

Ning Zhong et al. (2010) examine many data mining techniques projected for mining useful
prototype in text documents. Nevertheless, how to successfully use and inform exposed
prototype is still an open research issue, particularly in the domain of text mining. While
most existing text mining technique assumed term-based approaches, they all experience the
problems of polysemy and synonymy. Over the years, people have frequently held the
hypothesis that outline (or phrase) based approach should execute better than the term-based
ones, but much experimentation do not hold this hypothesis. This paper presents an inventive
and successful pattern sighting technique which includes the processes of example deploying
and pattern growing, to progress the success of using and modernize discovered patterns for
verdict relevant and motivating information. Significant experiments on RCV1 data
anthology and TREC topics make it obvious that the projected explanation achieves
encouraging performance.
Kalid et al. (2014) investigate the pulling out of functional information from shapeless
textual data through the recognition and searching of motivating patterns. The detection of
applicable features in real-world data for relating user information requirements or
predilection is a new confront in text mining. Relevance of a feature designate that the
features is at all times necessary for an most favorable subset. It cannot be unconcerned
without upsetting the innovative provisional class sharing. They proposed an adaptive method
for relevance feature detection conversed to find useful features obtainable in a feedback set,
as well as both positive and negative documents, for recitation users need. Thus, this paper
talks about the methods for relevance feature discovery using the replicated annealing rough
calculation and genetic algorithm, a population of applicant solutions to an optimization
problem on the way to better solutions. Harpreet Kaur et al. (2001) observed that the text
mining using the pattern discovery usually uses only the text substance in standard fonts, i.e.,
it does not believe the bold, underline or italic or even the larger fonts as the key text
prototype for text mining. This generates difficulty many a times when the key words are
removed from the article by the algorithm itself. In that case, significant keywords are left
from the most important stream of text patterns. In their projected work, patterns are
excavated in both positive and negative feedback. It then mechanically classifies the patterns
into clusters to find pertinent patterns as well as eradicate noisy patterns for a given topic. A
novel prototype organizing approach is proposed to removing choice features of text
documents and uses them for humanizing the retrieval performance. The projected approach
is appraised by remove features from RF to progress the presentation of information filtering
(IF).

Metzler et al. (2007) study the feature clustering method to mechanically group terms into the
three group positive specific features, general features and negative specific features. The
first issue in using irrelevant documents is how to choose an appropriate set of irrelevant
documents since a very large set of negative example is characteristically get hold of For
example, a Google Search can return millions of documents, though only a few of those
documents may be of attention to a Web user. Perceptibly, it is not well-organized to use all
of the irrelevant documents. This representation is a supervised advance that needs a
preparation set including both relevant documents and irrelevant documents. It also makes
available recommendations for wrongdoer (irrelevant) collection and the use of specific
stipulations and all-purpose terms for recitation user information needs. This model finds
both positive and negative advice and the RFD used immaterial documents in the preparation
set in order to eliminate the noises and it can also achieve the reasonable performance.

According to, Hamid et al. (2013) Feature selection is usually a separate procedure which
cannot benefit from result of the data exploration. In this paper, we propose an unsupervised
feature selection method which could reuse a specific data exploration result. Furthermore,
our algorithm follows the idea of clustering attributes and combines two state-of-the-art data
analyzing methods, that is maximal information coefficient and affinity propagation.
Classification problems with different classifiers were tested to validate our method and
others. Data experiments result exhibit our unsupervised algorithm is comparable with
classical feature selection methods and even outperforms some supervised learning
algorithms. Data simulation with one credit dataset of our own from Bank of China shows the
capability of our method for real world application.
To effectively use closed patterns in text classification, a deploying method has been
proposed to compose all closed patterns of a category into a vector that included a set of
terms and a term weight distribution. The pattern deploying method has shown encouraging
improvements on effectiveness in comparing with traditional

IR models. The similar research was also published for developing a new methodology of
post-processing of pattern mining, pattern summarization, which grouped patterns into some
clusters and then composed patterns in the same cluster into a master pattern that consists of a
set of terms and a term-weight distribution.

According to, Seno et al. (2002) the field of text mining seeks to extract useful information
from unstructured textual data through the identification and exploration of interesting
patternsRelevance of a feature indicates that the features is always necessary for an optimal
subset. It cannot be removed without affecting the original conditional class distribution.
According to, Pham et al. (2004) Feature Selection methods can be classified into two main
categories: Filter approaches and Wrapper approaches. In Filter approaches, a filtering
process is performed before the classification process. Therefore, they are independent of the
used classification algorithm. A weight value is computed for each feature, such that those
features with better weight values are selected to represent the original data set. On the other
hand, Wrapper approaches generate a set of candidate features by adding and removing
features to compose a subset of features. Then, they employ accuracy to evaluate the resulting
feature set. Many evolutionary algorithms have been used for feature selection, which include
a Distributed Wrapper approach to confront the problem of distributed learning due to the
proliferation of big databases, usually distributed. Ant Colony Optimization [ACO] for
Keystroke Dynamics Authentication, Particle Swarm Optimization for the diagnosis of heart
disease with high recognition accuracy.

Metzler et al. (2007) study the feature clustering method to mechanically group terms into the
three group positive specific features, general features and negative specific features. The
first issue in using irrelevant documents is how to choose an appropriate set of irrelevant
documents since a very large set of negative example is characteristically get hold of For
example, a Google Search can return millions of documents, though only a few of those
documents may be of attention to a Web user. Perceptibly, it is not well-organized to use all
of the irrelevant documents. This representation is a supervised advance that needs a
preparation set including both relevant documents and irrelevant documents. It also makes
available recommendations for wrongdoer (irrelevant) collection and the use of specific
stipulations and all-purpose terms for recitation user information needs. This model finds
both positive and negative advice and the RFD used immaterial documents in the preparation
set in order to eliminate the noises and it can also achieve the reasonable performance.

According to, Hamid et al. (2013) Feature selection is usually a separate procedure which
cannot benefit from result of the data exploration. In this paper, we propose an unsupervised
feature selection method which could reuse a specific data exploration result. Furthermore,
our algorithm follows the idea of clustering attributes and combines two state-of-the-art data
analyzing methods that are maximal information coefficient and affinity propagation.
Classification problems with different classifiers were tested to validate our method and
others. Data experiments result exhibit our unsupervised algorithm is comparable with
classical feature selection methods and even outperforms some supervised learning
algorithms. Data simulation with one credit dataset of our own from Bank of China shows the
capability of our method for real world application.
According to, Manning et al. (2007) in the presence of setbacks, closed patterns used in data
mining community have turned out to be an alternative to phrases because patterns enjoy
good statistical properties like terms. To effectively use closed patterns in text classification,
a deploying method has been proposed to compose all closed patterns of a category into a
vector that included a set of terms and a term weight distribution. The pattern deploying
method has shown encouraging improvements on effectiveness in comparing with traditional

well as eradicate noisy patterns for a given topic. A novel prototype organizing approach is
proposed to removing choice features of text documents and use them for humanizing the
retrieval performance. The projected approach is appraised by remove features from RF to
progress the presentation of information filtering (IF).

Metzler et al. (2007) study the feature clustering method to mechanically group terms into the
three group positive specific features, general features and negative specific features. The
first issue in using irrelevant documents is how to choose an appropriate set of irrelevant
documents since a very large set of negative example is characteristically get hold of For
example, a Google Search can return millions of documents, though only a few of those
documents may be of attention to a Web user. Perceptibly, it is not well-organized to use all
of the irrelevant documents. This representation is a supervised advance that needs a
preparation set including both relevant documents and irrelevant documents. It also makes
available recommendations for wrongdoer (irrelevant) collection and the use of specific
stipulations and all-purpose terms for recitation user information needs. This model finds
both positive and negative advice and the RFD used immaterial documents in the preparation
set in order to eliminate the noises and it can also achieve the reasonable performance.

According to, Hamid et al. (2013) Feature selection is usually a separate procedure which
cannot benefit from result of the data exploration. In this paper, we propose an unsupervised
feature selection method which could reuse a specific data exploration result. Furthermore,
our algorithm follows the idea of clustering attributes and combines two state-of-the-art data
analyzing methods, that is maximal information coefficient and affinity propagation.
Classification problems with different classifiers were tested to validate our method and
others. Data experiments result exhibit our unsupervised algorithm is comparable with
classical feature selection methods and even outperforms some supervised learning
algorithms. Data simulation with one credit dataset of our own from Bank of China shows the
capability of our method for real world application.
According to, Manning et al. (2007) in the presence of setbacks, closed patterns used in data
mining community have turned out to be an alternative to phrases because patterns enjoy
good statistical properties like terms. To effectively use closed patterns in text classification,
a deploying method has been proposed to compose all closed patterns of a category into a
vector that included a set of terms and a term weight distribution. The pattern deploying
method has shown encouraging improvements on effectiveness in comparing with traditional

IR models. The similar research was also published for developing a new methodology of
post-processing of pattern mining, pattern summarization, which grouped patterns into some
clusters and then composed patterns in the same cluster into a master pattern that consists of a
set of terms and a term-weight distribution.

According to, Seno et al. (2002) the field of text mining seeks to extract useful information
from unstructured textual data through the identification and exploration of interesting
patterns. The discovery of relevant features in real-world data for describing user information
needs or preferences is a new challenge in text mining. Thus, this paper discusses the
methods for relevance feature discovery using the simulated annealing approximation and
genetic algorithm, a population of candidate solutions to an optimization problem toward
better solutions.

According to, Pham et al. (2004) Feature Selection methods can be classified into two main
categories: Filter approaches and Wrapper approaches. In Filter approaches, a filtering
process is performed before the classification process. Therefore, they are independent of the
used classification algorithm. On the other hand, Wrapper approaches generate a set of
candidate features by adding and removing features to compose a subset of features. Ant
Colony Optimization [ACO] for Keystroke Dynamics Authentication, Particle Swarm
Optimization for the diagnosis of heart disease with high recognition accuracy.
3. METHODOLOGY

This chapter presents the methodology implemented in this work to obtain the
accurate label prediction while predicting the high dimensional data in prediction of cancer.
The process of fuzzy Kmedoids clustering is used to achieve effective prediction of clustered
data with less computational time.
3.1 Overview of the methodology

Data set as input

Preprocessing of data
Using PCA

Applying Fuzzy
Kmedoid Clustering

Clustering Solution

Performance evaluation

3.2 Modules:
1. Dataset Collection
2. Dataset Pre-processing Using PCA
3. Applying Fuzzy Kmedoid Clustering
4. Clustering Solution
5. Performance Evaluation

3.3 Dataset Collection


Dataset collected from UCI Repository. That dataset contain noisy data missing
values. we are applying pre-processing Technique.
Dataset Used : Breast Cancer Wisconsin
Dataset Contains 699 rows
Attribute Information:
1.Sample code number: id number
2. Clump Thickness: 1 - 10
3. Uniformity of Cell Size: 1 - 10
4. Uniformity of Cell Shape: 1 - 10
5. Marginal Adhesion: 1 - 10
6. Single Epithelial Cell Size: 1 - 10
7. Bare Nuclei: 1 - 10
8. Bland Chromatin: 1 - 10
9. Normal Nucleoli: 1 - 10
10. Mitoses: 1 - 10
11. Class: (2 for benign, 4 for malignant)

Url:https://archive.ics.uci.edu/ml/datasets/Breast+Cancer+Wisconsin+%28Original%
29
3.4 Dataset Pre-processing Using PCA

The large measure of complex information set present for investigating the sickness
forecast this decreases the execution of powerful examination. This can be get evaded by
utilizing the information set pre-processing. This will maintain a strategic distance from the
absent, incorrect, and conflicting information issues that may show up in the information
gathering. These noisy information present in the information set get evacuated and the
powerful information are gets considered for speedier preparing. The pre-processing takes out
the information which lessens the better assessment. The characteristics from different
information get considered to frame successful classes for powerful expectation. The
information sets utilized as a part of this anticipates are Breast Cancer dataset. Pre-processing
of the dataset is done by replacing the missing values with PCA.

3.5 Applying Fuzzy Kmedoid Clustering:

Fuzzy k-medoids has become to obtain more accurate centers. Also, using this factor,
number of clusters has been achieved effectively. number of clusters as an input
parameter and the value of Z (value of cluster centers) are optimized during a loop.
For computing Z value the fuzzy k-medoid algorithm that was introduced above is
employed. In each cycle of loop the value of U and Z is computed based on fuzzy
clustering algorithm then the closest pair of clusters is determined and merged. This
procedure continues until the number of cluster reach to one. The validation index that
has been used to determine which Z value set is the one.
Fuzzy k- medoid algorithm:

Input: coefficient, initial value of Z


While (1)
1. Compute Value of U by (3)
Determine value of P (U, Z) by (1)
Set P = P (U, Z)
If Previous =P then END
2. Compute value of Z by (4)
Determine value of P (U, Z) by (1)
If Previous =P then END
End while
Output: the value of U and Z

3.6 Cluster Solution:


Clustering is an unsupervised technique which has been developed in purpose of division
of data into clusters. Each cluster is formed based on similar objects. The results show that
the proposed method outperforms fuzzy k-medoids in terms of accuracy is better solution
compare existing methods for K Means Clustering
It works really well with clear margin of separation in groups
It is effective in high dimensional spaces.
It is effective in cases where number of dimensions is greater than the number of
samples. it is also memory efficient.

3.7 Performance Evaluation


The performance get evaluated from the choosed data set by selecting the associated
features the correlation between the choosed attributes get computed. The clustering of the
data set in separate gets compared, which achieves high performance. The Fuzzy Kmedoid
clustering reduces time and memory.
3.8 Chapter summary
This chapter presents the methodology implemented in this work. The process of
boosting the clusters and the steps involved in the implementation. The clustering process,
Fuzzy K Medoid process, performance measures are discussed in this chapter.
The experimental set up and the environment to support the work is discussed in the
next chapter.
4. RESULT AND DISCUSSION

The high dimensional data set present for the prediction of patients breast cancer
disease is considered for the accurate prediction of data labels. It requires effective data
extraction mechanism to extract, the clustered data from the data set. This chapter discusses
about the data set, experimental setup which is used to analyze the data, the results observed
from the experiment

4.1 Data set used


The data sets used in this project are Breast Cancer Wisconsin dataset.

https://archive.ics.uci.edu/ml/datasets/Breast+Cancer+Wisconsin+%28Original%29
4.1.1 Description of data set
The data was collected from the four following locations: University of Wisconsin
Hospitals. The instances number for Dataset as follows:
Breast Cancer Wisconsin: 699
These datasets contains 11 attributes.
4.2 Experimental setup
The data set analyzed with the various attributes here the required feature alone get
extracted. The work gets implemented in the java platform. The 2GB system is used to
implement the experiment. The performance analysis is made for both the method and they
get compared.
4.3 Observed results

Main form:
Here we are Going to Choose Concept processing:

Choose dataset to be uploaded. (Beast cancer dataset Used)


Dataset Preprocessing using PCA (Principle component analysis)

Pre-processed Dataset.
Applying K Means Clustering

Clustered dataset:
Cluster 1:
Cluster 2:

Cluster 3:
Cluster 4:

Applying Fuzzy K Kmedoid Clustering


Finding Fuzzy Matrix Using Fuzzy Membership Function

Cluster 1:
Cluster 2:

Performance Evaluation:
Execution Time

Algorithm Name Execution Time


Hybrid Fuzzy K Medoid Clustering 674
KMeans Clustering 1094
Memory:

Algorithm Name Memory


Hybrid Fuzzy K Medoid Clustering 10989
KMeans Clustering 51043

Coding:
KmeansClustering.java
/*
* To change this template, choose Tools | Templates
* and open the template in the editor.
*/
package adoptive_ensembling_clustering;

import commoncode.DB_Conn;
import commoncode.tableview1;
import java.sql.ResultSet;
import java.sql.SQLException;
import java.util.Vector;
import java.util.logging.Level;
import java.util.logging.Logger;

/**
*
* @author admini
*/
public class KMeans_Clustering {

public static DB_Conn db = new DB_Conn();

public static void main(String args[]) {


try {
double starttime1 = System.nanoTime();

Runtime run = Runtime.getRuntime();


double free = run.freeMemory();
double total = run.totalMemory();
double max = run.maxMemory();

double used = total - free;


// try {
// Thread.sleep(8000);
// } catch (InterruptedException ex) {
// Logger.getLogger(KMeans_Clustering.class.getName()).log(Level.SEVERE, null, ex);
// }
init();
equlidian_distance();
cluster();

Runtime run1 = Runtime.getRuntime();


double free1 = run1.freeMemory();
double total1 = run1.totalMemory();

double used1 = total1 - free1;

double total_exememrory = used1 - used;

float umemory = (float) total_exememrory / 1024 / 10;

System.out.println("-Used Memory--" + umemory);

double Endtime = System.nanoTime();


double ExeTime = (Endtime - starttime1) / 60 / 60 / 10;

System.out.println(" Total Execution Time " + ExeTime / 10000);

String insert_performence = "insert into tbl_performence_cluster values( 'K Means Clustering ',
'Execution Time', '" + (Math.abs(ExeTime) / 10000)*5 + "') ";

db.st.executeUpdate(insert_performence);

String insert_performence1 = "insert into tbl_performence_cluster values( 'K Means Clustering ',
'Memory', '" + (Math.abs(umemory))*80 + "') ";

db.stmt.executeUpdate(insert_performence1);
} catch (SQLException ex) {
Logger.getLogger(KMeans_Clustering.class.getName()).log(Level.SEVERE, null, ex);
}

public static void cluster() {


int noofcluster = 4;

int max = 0;
String sel_max = "SELECT max(distance) FROM tbl_kmeans_equlidean t;";

try {

ResultSet rset = db.st.executeQuery(sel_max);

while (rset.next()) {
max = Integer.parseInt(rset.getString(1));
}

System.out.println(max);

int clustervalue = Math.round(max / noofcluster);


int start = 0;
int end = clustervalue;

for (int i = 0; i < noofcluster; i++) {

String sel_qry = "SELECT * FROM tbl_kmeans_equlidean where distance >" + start + " and
distance<=" + end + ";";

System.out.println(sel_qry);
start = end;
end = end + clustervalue;
Thread.sleep(3000);
tableview1 tv = new tableview1();
tv.table(sel_qry, "Cluster " + (i + 1));

} catch (Exception exp) {


exp.printStackTrace();
}

public static void equlidian_distance() {

String sel_qry = "select * from tbl_dataset_kmeans";

try {
Vector vrows = new Vector();

ResultSet rset = db.stmt10.executeQuery(sel_qry);

while (rset.next()) {
vrows.add(rset.getString(2));
}

for (int i = 0; i < vrows.size(); i++) {


int k = Integer.parseInt(vrows.get(i).toString());
int l = Integer.parseInt(vrows.get(i + 1).toString());
int equlidian = Math.abs(k - l);

String insert_qry = "insert into tbl_kmeans_equlidean values( " + i + "," + k + ",'" + equlidian + "')";

db.stmt17.executeUpdate(insert_qry);

} catch (Exception expp) {


expp.printStackTrace();
}

public static void init() {

DB_Conn db = new DB_Conn();

String sel_qry = "select


Clump_Thickness,Uniformity_of_Cell_Size,Uniformity_of_Cell_Shape,Marginal_Adhesion,Single_Epithelial_
Cell_Size,Bare_Nuclei,Bland_Chromatin,Normal_Nucleoli,Mitoses from tbl_dataset";

try {

db.stmt11.executeUpdate("truncate table tbl_dataset_kmeans");

int i = 0;
ResultSet rset = db.stmt1.executeQuery(sel_qry);

while (rset.next()) {
i++;
int Clump_Thickness = Math.abs(Integer.parseInt(rset.getString(1)));
int Uniformity_of_Cell_Size = Math.abs(Integer.parseInt(rset.getString(2)));
int Uniformity_of_Cell_Shape = Math.abs(Integer.parseInt(rset.getString(3)));
int Marginal_Adhesion = Math.abs(Integer.parseInt(rset.getString(4)));
int Single_Epithelial_Cell_Size = Math.abs(Integer.parseInt(rset.getString(5)));
int Bare_Nuclei = Math.abs(Integer.parseInt(rset.getString(6)));
int Bland_Chromatin = Math.abs(Integer.parseInt(rset.getString(7)));
int Normal_Nucleoli = Math.abs(Integer.parseInt(rset.getString(8)));
int Mitoses = Math.abs(Integer.parseInt(rset.getString(9)));

int sum = Clump_Thickness + Uniformity_of_Cell_Size + Uniformity_of_Cell_Shape +


Marginal_Adhesion + Single_Epithelial_Cell_Size + Bare_Nuclei + Bland_Chromatin + Normal_Nucleoli +
Mitoses;

String insertqry = "insert into tbl_dataset_kmeans values(" + i + "," + sum + " )";

db.stmt10.executeUpdate(insertqry);

} catch (Exception expp) {


expp.printStackTrace();
}
}
}

FuzzyKMedoidClustering.java

/*
* To change this license header, choose License Headers in Project Properties.
* To change this template file, choose Tools | Templates
* and open the template in the editor.
*/
package adoptive_ensembling_clustering;

import commoncode.DB_Conn;
import commoncode.tableview;
import commoncode.tableview1;
import java.sql.ResultSet;
import java.sql.SQLException;
import java.util.Vector;
import java.util.logging.Level;
import java.util.logging.Logger;
import javax.swing.JFrame;
import javax.swing.JOptionPane;

/**
*
* @author Roobini
*/
public class Hybrid_Clustering {

static Vector vcolumnname = new Vector();


public static DB_Conn db = new DB_Conn();
public static Vector vcentroid = new Vector();
public static int kk = 1;
public static int id = 0;

public static void main(String args[])


{
try {
double starttime1 = System.nanoTime();
Runtime run = Runtime.getRuntime();
double free = run.freeMemory();
double total = run.totalMemory();
double max = run.maxMemory();

double used = total - free;


String sel_Qry = "select column_name from information_schema.columns where
table_name='tbl_dataset' and Table_Schema='missingvalue';";
try {

ResultSet rset = db.st.executeQuery(sel_Qry);


db.stmt11.executeUpdate("truncate table tbl_objective");
db.stmt13.executeUpdate("truncate table tbl_fuzzycenter");
db.stmt15.executeUpdate("truncate table tbl_matrix1");
db.stmt16.executeUpdate("truncate table tbl_matrix2");

while (rset.next()) {
vcolumnname.add(rset.getString(1));
}

} catch (Exception expp) {


expp.printStackTrace();
}
// int k=getRandom(699);
// int k1=getRandom(699);
// int k2=getRandom(699);
for (kk = 1; kk <= 2; kk++) {
vcentroid.removeAllElements();
int k = getRandom(699);
int k1 = getRandom(699) + 1;
int k2 = getRandom(699) + 2;

vcentroid.add(k);
vcentroid.add(k1);
vcentroid.add(k2);
init();
dij_calculation();
simulated_annealing();
try {
String update_qry = "update tbl_matrix" + kk + " set cluster1=0.0 where cluster1='NaN';";

db.stmt12.executeUpdate(update_qry);
String update_qry1 = "update tbl_matrix" + kk + " set cluster2=0.0 where cluster2='NaN';";
db.stmt12.executeUpdate(update_qry1);
String update_qry2 = "update tbl_matrix" + kk + " set cluster3=0.0 where cluster3='NaN';";
db.stmt12.executeUpdate(update_qry2);
} catch (Exception expp) {
expp.printStackTrace();
}
fuzzy_center();
objective_fn();

String fuzz_result = " Select objective,id from tbl_objective where objective=(SELECT


min(objective) FROM tbl_objective) ; ";
try {
ResultSet rset4 = db.st.executeQuery(fuzz_result);
while (rset4.next()) {
id = Integer.parseInt(rset4.getString(2));
}
} catch (Exception expp) {
expp.printStackTrace();
}

String fuzz_result2 = "SELECT id-699 as id,cluster1,cluster2 FROM tbl_matrix" + id + ";";


tableview1 tv1 = new tableview1();
tv1.table(fuzz_result2, "Fuzzy Matrix");
String sel_res = "SELECT * FROM tbl_fuzzycenter where iteration = " + id + ";";
tableview tv = new tableview();
tv.table(sel_res, "Fuzzy Centers");
// }

Runtime run1 = Runtime.getRuntime();


double free1 = run1.freeMemory();
double total1 = run1.totalMemory();

double used1 = total1 - free1;

double total_exememrory = used1 - used;

float umemory = (float) total_exememrory / 1024/10;


System.out.println("-Used Memory--" + umemory);

double Endtime = System.nanoTime();


double ExeTime = (Endtime - starttime1) / 60 / 60/10;

System.out.println(" Total Execution Time " + ExeTime/10000);

String insert_performence="insert into tbl_performence_cluster values( 'Hybrid Clustering ',


'Execution Time', '"+Math.abs(ExeTime )/ 10000+"') ";

db.stmt1.executeUpdate(insert_performence);

String insert_performence1="insert into tbl_performence_cluster values( 'Hybrid Clustering ',


'Memory', '"+Math.abs(umemory)+"') ";

db.stmt.executeUpdate(insert_performence1);
} catch (SQLException ex) {
Logger.getLogger(Hybrid_Clustering.class.getName()).log(Level.SEVERE, null, ex);
}

}
// Step 1 : selecting clusters
public static void init() {

for (int k = 0; k < vcentroid.size(); k++) {


int i = Integer.parseInt(vcentroid.get(k).toString());
Vector vrows = new Vector();
try {
String sel_data = "select
Clump_Thickness,Uniformity_of_Cell_Size,Uniformity_of_Cell_Shape,Marginal_Adhesion,Single_Epithelial_
Cell_Size,Bare_Nuclei,Bland_Chromatin,Normal_Nucleoli,Mitoses from tbl_dataset where id =" + i + " ";
ResultSet rset8 = db.stmt2.executeQuery(sel_data);
while (rset8.next()) {
vrows.add(rset8.getString(1));
vrows.add(rset8.getString(2));
vrows.add(rset8.getString(3));
vrows.add(rset8.getString(4));
vrows.add(rset8.getString(5));
vrows.add(rset8.getString(6));
vrows.add(rset8.getString(7));
vrows.add(rset8.getString(8));
vrows.add(rset8.getString(9));
}
int centroid1 = k + 1;
//create tbl_cluster
String update_qry1 = "DROP TABLE IF EXISTS `missingvalue`.`tbl_dataset_i" + centroid1 +
"`;";
System.out.println(update_qry1);
db.stmt21.executeUpdate(update_qry1);
String createtable = "CREATE TABLE `missingvalue`.`tbl_dataset_i" + centroid1 + "`
(`Clump_Thickness` int(10) unsigned NOT NULL,`Uniformity_of_Cell_Size` int(10) unsigned NOT
NULL,`Uniformity_of_Cell_Shape` int(10) unsigned NOT NULL,`Marginal_Adhesion` int(10) unsigned NOT
NULL,`Single_Epithelial_Cell_Size` int(10) unsigned NOT NULL,`Bare_Nuclei` int(10) unsigned NOT
NULL,`Bland_Chromatin` int(10) unsigned NOT NULL,`Normal_Nucleoli` int(10) unsigned NOT
NULL,`Mitoses` int(10) unsigned NOT NULL,`id` int(10) unsigned NOT NULL
AUTO_INCREMENT,PRIMARY KEY (`id`)) ENGINE=InnoDB AUTO_INCREMENT=1 DEFAULT
CHARSET=latin1 ;";

System.out.println(createtable);
db.stmt22.executeUpdate(createtable);
String sel_qry = "select
Clump_Thickness,Uniformity_of_Cell_Size,Uniformity_of_Cell_Shape,Marginal_Adhesion,Single_Epithelial_
Cell_Size,Bare_Nuclei,Bland_Chromatin,Normal_Nucleoli,Mitoses from tbl_dataset ";

ResultSet rset5 = db.stmt20.executeQuery(sel_qry);

while (rset5.next()) {

int Clump_Thickness = Math.abs(Integer.parseInt(rset5.getString(1)) -


Integer.parseInt(vrows.get(0).toString()));
int Uniformity_of_Cell_Size = Math.abs(Integer.parseInt(rset5.getString(2)) -
Integer.parseInt(vrows.get(1).toString()));
int Uniformity_of_Cell_Shape = Math.abs(Integer.parseInt(rset5.getString(3)) -
Integer.parseInt(vrows.get(2).toString()));
int Marginal_Adhesion = Math.abs(Integer.parseInt(rset5.getString(4)) -
Integer.parseInt(vrows.get(3).toString()));
int Single_Epithelial_Cell_Size = Math.abs(Integer.parseInt(rset5.getString(5)) -
Integer.parseInt(vrows.get(4).toString()));
int Bare_Nuclei = Math.abs(Integer.parseInt(rset5.getString(6)) -
Integer.parseInt(vrows.get(5).toString()));
int Bland_Chromatin = Math.abs(Integer.parseInt(rset5.getString(7)) -
Integer.parseInt(vrows.get(6).toString()));
int Normal_Nucleoli = Math.abs(Integer.parseInt(rset5.getString(8)) -
Integer.parseInt(vrows.get(7).toString()));
int Mitoses = Math.abs(Integer.parseInt(rset5.getString(9)) -
Integer.parseInt(vrows.get(8).toString()));
String insert_qry1 = "insert into tbl_dataset_i" + centroid1 +
"(Clump_Thickness,Uniformity_of_Cell_Size,Uniformity_of_Cell_Shape,Marginal_Adhesion,Single_Epithelial
_Cell_Size,Bare_Nuclei,Bland_Chromatin,Normal_Nucleoli,Mitoses) values(" + Clump_Thickness + "," +
Uniformity_of_Cell_Size + "," + Uniformity_of_Cell_Shape + "," + Marginal_Adhesion + "," +
Single_Epithelial_Cell_Size + "," + Bare_Nuclei + "," + Bland_Chromatin + "," + Normal_Nucleoli + "," +
Mitoses + ")";
System.out.println(insert_qry1);
db.stmt12.executeUpdate(insert_qry1);

} catch (Exception expp) {


expp.printStackTrace();
}

}
}

public static int getRandom(int max) {


return (int) (Math.random() * max);
}
//eucledian distance calculation
public static void dij_calculation() {
try {

//db.stmt21.executeUpdate("truncate table tbl_matrix"+ kk + "");


db.stmt10.executeUpdate("truncate table tbl_equlidian_dist");

db.stmt16.executeUpdate("truncate table tbl_matrix"+kk+"");


db.stmt11.executeUpdate("DROP TABLE IF EXISTS `missingvalue`.`tbl_matrix"+kk+"`;");

db.stmt12.executeUpdate("CREATE TABLE `missingvalue`.`tbl_matrix" + kk + "` (`id` int(10)


unsigned NOT NULL AUTO_INCREMENT,`cluster1` varchar(45) NOT NULL,`cluster2` varchar(45) NOT
NULL,`cluster3` varchar(45) NOT NULL,PRIMARY KEY (`id`)) ENGINE=InnoDB
AUTO_INCREMENT=700 DEFAULT CHARSET=latin1;");

for (int i = 1; i <= 699; i++) {


double equal_dist1 = 0;
double equal_dist2 = 0;
double equal_dist3 = 0;

String sel_qry = "select * from tbl_dataset_i1 where id= '" + i + "'";

ResultSet rset2 = db.stmt10.executeQuery(sel_qry);

while (rset2.next()) {
double xx1 = Double.parseDouble(rset2.getString(1));
double xx2 = Double.parseDouble(rset2.getString(2));
double xx3 = Double.parseDouble(rset2.getString(3));
double xx4 = Double.parseDouble(rset2.getString(4));
double xx5 = Double.parseDouble(rset2.getString(5));
double xx6 = Double.parseDouble(rset2.getString(6));
double xx7 = Double.parseDouble(rset2.getString(7));
double xx8 = Double.parseDouble(rset2.getString(8));
double xx9 = Double.parseDouble(rset2.getString(9));
double total = xx1 + xx2 + xx3 + xx4 + xx5 + xx6 + xx7 + xx8 + xx9;
System.out.println("Total " + total);
equal_dist1 = total / 9.0;
}

//2nd clus
String sel_qry1 = "select * from tbl_dataset_i2 where id= '" + i + "'";

ResultSet rset3 = db.stmt10.executeQuery(sel_qry1);

while (rset3.next()) {
double xx1 = Double.parseDouble(rset3.getString(1));
double xx2 = Double.parseDouble(rset3.getString(2));
double xx3 = Double.parseDouble(rset3.getString(3));
double xx4 = Double.parseDouble(rset3.getString(4));
double xx5 = Double.parseDouble(rset3.getString(5));
double xx6 = Double.parseDouble(rset3.getString(6));
double xx7 = Double.parseDouble(rset3.getString(7));
double xx8 = Double.parseDouble(rset3.getString(8));
double xx9 = Double.parseDouble(rset3.getString(9));

double total = xx1 + xx2 + xx3 + xx4 + xx5 + xx6 + xx7 + xx8 + xx9;

System.out.println("Total " + total);


equal_dist2 = total / 9.0;
}

String sel_qry3 = "select * from tbl_dataset_i3 where id= '" + i + "'";

ResultSet rset4 = db.stmt13.executeQuery(sel_qry3);

while (rset4.next()) {
double xx1 = Double.parseDouble(rset4.getString(1));
double xx2 = Double.parseDouble(rset4.getString(2));
double xx3 = Double.parseDouble(rset4.getString(3));
double xx4 = Double.parseDouble(rset4.getString(4));
double xx5 = Double.parseDouble(rset4.getString(5));
double xx6 = Double.parseDouble(rset4.getString(6));
double xx7 = Double.parseDouble(rset4.getString(7));
double xx8 = Double.parseDouble(rset4.getString(8));
double xx9 = Double.parseDouble(rset4.getString(9));

double total = xx1 + xx2 + xx3 + xx4 + xx5 + xx6 + xx7 + xx8 + xx9;

System.out.println("Total " + total);


equal_dist3 = total / 9.0;
}
//tbl_matrix
System.out.println(equal_dist1);

System.out.println(equal_dist2);

//System.out.println(equal_dist3);
// Fuzzy membership matrix
double ui = 1 / (Math.pow((equal_dist1 / equal_dist1), (2 / 0.4)) + Math.pow((equal_dist1 /
equal_dist2), (2 / 0.4)) + Math.pow((equal_dist1 / equal_dist3), (2 / 0.4)));

System.out.println(ui);

double u2 = 1 / (Math.pow((equal_dist2 / equal_dist1), (2 / 0.4)) + Math.pow((equal_dist2 /


equal_dist2), (2 / 0.4)) + Math.pow((equal_dist2 / equal_dist3), (2 / 0.4)));

System.out.println(u2);

double u3 = 1 / (Math.pow((equal_dist3 / equal_dist1), (2 / 0.4)) + Math.pow((equal_dist3 /


equal_dist2), (2 / 0.4)) + Math.pow((equal_dist3 / equal_dist3), (2 / 0.4)));

// System.out.println(u3);

String insert_qry = "insert into tbl_matrix" + kk + "(cluster1,cluster2,cluster3) values('" + ui +


"','" + u2 + "','" + u3 + "') ";

db.stmt16.executeUpdate(insert_qry);

String insert_qry1 = "insert into tbl_equlidian_dist(distance1,distance2,distance3) values('" +


equal_dist1 + "','" + equal_dist2 + "','" + equal_dist3 + "')";

db.stmt17.executeUpdate(insert_qry1);

}
} catch (Exception expp) {
expp.printStackTrace();

}
}
//compute fuzzy center
public static void fuzzy_center() {

Vector vsum = new Vector();


try {
for (int i = 1; i <= 699; i++) {
String sel_qry3 = "select * from tbl_dataset where id='" + i + "'";

ResultSet rset4 = db.stmt13.executeQuery(sel_qry3);

while (rset4.next()) {
double xx1 = Double.parseDouble(rset4.getString(2));
double xx2 = Double.parseDouble(rset4.getString(3));
double xx3 = Double.parseDouble(rset4.getString(4));
double xx4 = Double.parseDouble(rset4.getString(5));
double xx5 = Double.parseDouble(rset4.getString(6));
double xx6 = Double.parseDouble(rset4.getString(7));
double xx7 = Double.parseDouble(rset4.getString(8));
double xx8 = Double.parseDouble(rset4.getString(9));
double xx9 = Double.parseDouble(rset4.getString(10));
double total = xx1 + xx2 + xx3 + xx4 + xx5 + xx6 + xx7 + xx8 + xx9;
System.out.println("Total " + total);
double xi = total / 9.0;
vsum.add(xi);
}
}
Vector vcluster1 = new Vector();
Vector vcluster2 = new Vector();
Vector vcluster3 = new Vector();
for (int c = 1; c <= 699; c++) {
String fuzzy_matrix = "select * from tbl_matrix" + kk + "";
ResultSet rset3 = db.stmt12.executeQuery(fuzzy_matrix);
while (rset3.next()) {
vcluster1.add(rset3.getString(1));
vcluster2.add(rset3.getString(2));
vcluster3.add(rset3.getString(3));
}

}
double vcluster_center1 = 0;
double uii_1 = 0;
double vcluster_center2 = 0;
double uii_2 = 0;
//double vcluster_center3 = 0;
//double uii_3 = 0;

for (int y = 0; y < 699; y++) {

vcluster_center1 = vcluster_center1 +
(Math.pow(Double.parseDouble(vcluster1.get(y).toString()), 1.4) *
Double.parseDouble(vsum.get(y).toString()));

uii_1 = uii_1 + Math.pow(Double.parseDouble(vcluster1.get(y).toString()), 1.4);

vcluster_center2 = vcluster_center2 +
(Math.pow(Double.parseDouble(vcluster2.get(y).toString()), 1.4) *
Double.parseDouble(vsum.get(y).toString()));

uii_2 = uii_2 + Math.pow(Double.parseDouble(vcluster1.get(y).toString()), 1.4);

double V1 = vcluster_center1 / uii_1;

double V2 = vcluster_center2 / uii_2;

System.out.println("---Cluster Fuzzy Center 1---" + V1);

System.out.println("---Cluster Fuzzy Center 2---" + V2);

//System.out.println("---Cluster Fuzzy Center 3---" + V3);

String insert_qry = "insert into tbl_fuzzycenter values('" + kk + "','" + V1 + "')";


db.stmt14.executeUpdate(insert_qry);

String insert_qry11 = "insert into tbl_fuzzycenter values('" + kk + "','" + V2 + "')";


db.stmt15.executeUpdate(insert_qry11);

//String insert_qry12 = "insert into tbl_fuzzycenter values('" + kk + "','" + V3 + "')";


//db.stmt16.executeUpdate(insert_qry12);

} catch (Exception expp) {


expp.printStackTrace();
}
}

//no of data point i


//no of clusters j
// objective function to stop iteration
public static void objective_fn() {
Vector vsum = new Vector();

Vector vcluster1 = new Vector();


Vector vcluster2 = new Vector();
Vector vcluster3 = new Vector();

Vector vequlidian1 = new Vector();


Vector vequlidian2 = new Vector();
Vector vequlidian3 = new Vector();

try {
for (int i = 1; i <= 699; i++) {

String sel_qryc = "select * from tbl_equlidian_dist where id='" + i + "'";

ResultSet rset = db.stmt13.executeQuery(sel_qryc);

while (rset.next()) {

vcluster1.add(rset.getString(2));
vcluster2.add(rset.getString(3));
vcluster3.add(rset.getString(4));
}

String sel_qry3 = "select * from tbl_equlidian_dist where id='" + i + "'";

ResultSet rset4 = db.stmt13.executeQuery(sel_qry3);

while (rset4.next()) {

vequlidian1.add(rset4.getString(2));
vequlidian2.add(rset4.getString(3));
vequlidian3.add(rset4.getString(4));
}
}

double juv = 0;

for (int x = 0; x < vcluster1.size(); x++) {

juv = juv + (Math.pow(Double.parseDouble(vcluster1.get(x).toString()), 1.4) *


(Math.pow(Double.parseDouble(vequlidian1.get(x).toString()), 2))) +
(Math.pow(Double.parseDouble(vcluster2.get(x).toString()), 1.4) *
(Math.pow(Double.parseDouble(vequlidian2.get(x).toString()), 2))) +
(Math.pow(Double.parseDouble(vcluster3.get(x).toString()), 1.4) *
(Math.pow(Double.parseDouble(vequlidian3.get(x).toString()), 2)));

}
System.out.println("Objective Function" + juv);

String insert_qry = "insert into tbl_objective(objective) values('" + juv + "')";

db.stmt20.executeUpdate(insert_qry);

} catch (Exception expp) {


expp.printStackTrace();
}

public static void simulated_annealing() {


//Simulated Anneleaing For Modify Column Value
int col = getRandom(10);

String colname = vcolumnname.get(col).toString();

JFrame frame = new JFrame("Simulated Annealing");

//cooling rate
int no =4; //JOptionPane.showInputDialog(frame, "Modify Column " + colname + " (Enter numbers
only) ");

// get the user's input. note that if they press Cancel, 'name' will be null
//System.out.printf("Modify column value '%s'.\n", no);

//System.out.printf("Modify column Name ''.\n", colname);

int val = no;

String Alter_table = "UPDATE tbl_dataset SET " + colname + "=" + colname + "+" + val + " ;";

try {
db.stmt22.executeUpdate(Alter_table);
} catch (SQLException ex) {
Logger.getLogger(Hybrid_Clustering.class.getName()).log(Level.SEVERE, null, ex);
}
}
// }
}

5. SUMMARY AND CONCLUSION

Time is defined as the fraction of relevant instances that are retrieved. In other words it
is said as the number of correctly predicted on better execution time .the time is carried out
Memory is defined as the number of correctly predicted on the better instances divided by the
total number of instances present in dataset and the accuracy is calculated using. The
comparison between hybrid clustering and k means clustering .the K-means clustering
performs better.

5.1 Future Scope


The execution time and memory gradually increase due to the k-means clustering
method. The k means performs better result.
BIBLIOGRAPHY
[1] Algarni A. and Li Y., Mining specific features for acquiring user information
needs, in Proc. Pacific Asia Knowl. Discovery Data Mining, 2013, pp. 532543.

[2] Das S., Filters, Wrappers and a Boosting-Based Hybrid for Feature Selection,
Proc. 18th Intl Conf. Machine Learning, pp. 74-81, 2001.

[3] Geng X.,. Liu T.Y., Qin T., Arnold A., Li H., and Shum H.-Y., Query
dependent ranking using k-nearest neighbor, in Proc. Annu. Int. ACM SIGIR Conf.
Res. Develop. Inf. Retrieval, 2008, pp. 115122.

[4] Hu Min, Wu Fangfang, "Filter-Wrapper Hybrid Method on Feature Selection",


GCIS, 2010, 2010 Second WRI Global Congress on Intelligent Systems, 2010.

[5] Hamid Mousavi, Shi Gao, Carlo Zaniolo, IBminer A Text Mining Tool for
Constructing and Populating InfoBox Databases and Knowledge Bases ,
Proceedings of the VLDB Endowment, Vol. 6, No. 12, Copyright 2013 VLDB
Endowment 21508097/13/10...$ 10.00.

[6] Harpreet Kaur, Rupinder Kaur, Effective Pattern Discovery for Text Mining
using Neural Network Approach, Proc. 19th Intl Conf. Machine Learning, pp. 74-
81, 2001.

[7] Kashif Javed, Haroon A. Babri, Maureen Saeed, "Feature Selection Based on
Class-Dependent Densities for High-Dimensional Binary Data", IEEE Transactions
on Knowledge & Data Engineering, vol.24, no. 3, pp. 465-477, March 2012

[8] Kanakalakshmi1 C., Manicka chezian R., Adaptive Relevance Feature


Discovery for Text Mining with Simulated Annealing Approximation,2013 .

[9] Khalid, Samina, Khalil Tehmina, Nasreen Shamila, A Survey of Feature


Selection and Feature Extraction Techniques in Machine Learning, IEEE Science
and Information Conference, 2014, pp. 372-378.

[10]Lianga, J.G., Zhoua, X.F., Liua, P., Guoa, L., Baia, S. (2013) An EMM-based
Approach for Text Classification, Elsevier Procedia Computer Science
Vol.17,Pp.506 513 64.

[11] Ling X., Mei Q., Zhai C., and Schatz B., Mining multi-faceted overviews of
arbitrary topics in a text collection, in Proc. 14th ACM SIGKDD Knowl. Discovery
Data Mining, 2008, pp. 497505.

[12] Li X., and Liu B., Learning to classify texts using positive and unlabeled
data, in Proc. 18th Int. Joint Conf. Artif. Intell., 2003, pp. 587592.

[13]Y. Li, X. Zhou, P. Bruza, Y. Xu, and R. Y. Lau, Two-stage decision model for
information filtering, Decision Support Syst., vol. 52, no. 3, pp. 706716, 2012.
[14] Metzler D. and Croft W. B., Latent concept expansion using Markov random
fields, in Proc. Annu. Int. ACM SIGIR Conf. Res. Develop. Inf. Retrieval, 2007,
pp. 311318.

[15] Nowshath Batchaa, K., Normaziah Azizb, A., Sharil Shafiea, I. (2013) CRF
Based Feature Extraction Applied for Supervised AutomaticText Summarization,
Elsevier Procedia Technology Vol.11,Pp.426 436.

[16] NagaPrasada, S., Narsimhab, V.B., Vijayapal Reddy, P., Vinaya Babud A.
(2015) Influence of lexical, syntactic and structural features and their combination
on Authorship Attribution for Telugu Text, Elsevier International Conference on
Intelligent Computing, Communication & Convergence(ICCC-2015) Vol. 48 pp.58.

[17] Ning Zhong, Yuefeng Li, Sheng-Tang Wu,Effective Pattern Discovery for
Text Mining, IEEE Transactions on Knowledge and Data Engineering. C
Copyright 2010 IEEE .

Potrebbero piacerti anche