Sei sulla pagina 1di 8

IPASJ International Journal of Computer Science (IIJCS)

Web Site: http://www.ipasj.org/IIJCS/IIJCS.htm


Email: editoriijcs@ipasj.org
ISSN 2321-5992

A Publisher for Research Motivation ........

Volume 3, Issue 5, May 2015

Predicting Protein-Protein Interactions through


Associative Classification Technique
1

Lakshmi Priya, 2 Dr.Shomona Gracia Jacob

M.E (Software Engineering), SSN College of Engineering, Old Mahabalipuram Road,


Kalavakkam 603 110, Tamil Nadu, India.
2

Associate Professor, SSN College of Engineering, Old Mahabalipuram Road,


Kalavakkam 603 110, Tamil Nadu, India.

ABSTRACT
Discovering Protein-Protein Interactions (PPI) is a new interesting challenge in computational biology. The identification of
interactions between HIV-1 proteins and Human proteins is a particular PPI problem whose study might lead to the discovery of
drugs and important interactions leading to AIDS. The interaction of protein-protein network is analysed by using the datasets
available. Since, Biclustering approaches lead to loss of data, this need to be enhanced to prevent the data loss.With this motivation
in mind, this paper targets to predict new interactions with the Associative classification (AC) technique.

Keywords - Protein-Protein Interactions (PPI), Acquired Immune Deciency Syndrome (AIDS), Association Rule Mining
(ARM), HIV-Human Protein-Protein Interaction (HHPPI), Associative Classification technique (AC).

1. INTRODUCTION
Acquired Immune Deficiency Syndrome (AIDS) is the last stage of HIV infection. At this stage, the human immune system
fails to protect the body from infection, and this eventually leads to death. HIV is a member of the retrovirus family
(lentivirus) which infects important cells in the human immune system. HIV-1 is a species of the HIV virus that relies on
human host cell proteins in virtually every phase of its life cycle .This kind of infection is due to the interaction between
proteins of both the virus and the human host in the human cells. One of the main goals in research of Protein-Protein
Interaction (PPI) is to predict possible viral-host interactions. This is specifically aimed at assisting drug developers
targeting protein interactions for the development of specially designed small molecules to inhibit potential HIV-1human
PPIs. Targeting protein-protein interactions has recently been established to be a promising alternative to the conventional
approach to drug design .There are several computational approaches for predicting PPIs. Most of these approaches are
mainly used for determining PPIs in a single organism, such as yeast, human etc. Therefore in most of the works in this
area, negative samples are prepared by taking random protein pairs which are not found in the interaction database.
1.1 Related Work
Human immunodeficiency virus (HIV) is a lentivirus (a member of the retrovirus family with long incubation period) that
can lead to Acquired Immunodeficiency Syndrome (AIDS), a condition in humans in which the immune system begins to
fail, leading to life-threatening infection. Various approaches for predicting interactions have been studied in the literature.
These approaches are based on Bayesian networks [1], random forest classifier [2], mixture of feature expert classifiers [3].
Recently, two approaches have been proposed to predict the set of interactions between HIV-1 and human host cellular
proteins [4].
This paper attempts to propose a methodology that identifies the best association rules and classifies the data into
interacting and non-interacting proteins with a better accuracy.

2.MATERIALS AND METHODS


2.1 Materials
The interaction information reported between HIV-1 and human proteins, which has been prepared based on a recently
published PPI data set, has been collected. There are total of 19 HIV-1 proteins and 1432 human proteins. A binary matrix
of size 19 1432 was constructed. An entry of 1 in the matrix denotes the presence of interaction between the
corresponding pair of HIV-1 and human proteins, and an entry of 0 represents the absence of any information regarding
the interaction of the corresponding viral and human proteins. The resulting binary matrix is treated as the input to the
ARM algorithm.

Volume 3 Issue 5 May 2015

Page 88

IPASJ International Journal of Computer Science (IIJCS)


A Publisher for Research Motivation ........

Volume 3, Issue 5, May 2015

Web Site: http://www.ipasj.org/IIJCS/IIJCS.htm


Email: editoriijcs@ipasj.org
ISSN 2321-5992

2.2 Methods
In this paper, a system is proposed with a new methodology called Associative Classification (AC) technique. This can
reduce the misclassification data and predict the new interactions which were unknown with the help of known data sets
available. The several interestingness measures are validated using support and confidence for the known data sets. AC
integrates two known data mining tasks, associative rule discovery and classification, to build a model for the purpose of
prediction. Classification and Association rule discovery are similar tasks in data mining, with the exception that the main
aim of classification is the prediction of class labels, while association rule discovery describes correlations between items in
a transactional database. AC algorithm is represented in simple if-then rules, which makes it easy for the end-user to
understand and interpret it. In case of categorical attributes, all possible values are mapped to a set of positive integers. For
continuous attributes, a discretization attributes, a discretization method is used. The main task of AC is to construct a set of
rules that is able to predict the classes of previously unseen data, known as the test data set, as accurately as possible. The
goal is to find a classifier that maximizes the probability of interaction for each test object.
Association rule mining technique is the most efficient data mining technique to search hidden or desired patterns in
voluminous data. It aims at detecting correlation among various data attributes in a large set of items in a database.
Associations across the itemset have been determined by association rule mining. Association analysis is the detection of
hidden patterns or conditions that occur frequently together in a given data.
Association Rule mining techniques find interesting associations and correlations among data set. An association rule
entails certain association relationships with objects or items. For example, the interrelationship of the data item as whether
they occur simultaneously with other data items and how often. These rules are computed from the data and are calculated
with help of probability. Support and confidence are measures of interestingness. Association rules are regarded as
appealing if a minimum support and a minimum confidence threshold is satisfied. Boolean association rule mining is more
extensively used than other kinds of association rule mining.
Apriori [5] uses breadth-first search and a tree structure to count candidate item sets efficiently. It generates candidate item
sets of length k from item sets of length k 1. Then it prunes the candidates which have an infrequent sub pattern.
According to the downward closure lemma, the candidate set contains all frequent k-length item sets. After that, it scans the
transaction database to determine frequent item sets among the candidates. Candidate generation generates large numbers
of subsets (the algorithm attempts to load up the candidate set with as many as possible before each scan). Bottom-up subset
exploration (essentially a breadth-first traversal of the subset lattice) finds any maximal subset S only after all 2 | S | 1 of
its proper subsets.
The main bottleneck of the Apriori algorithm is at the candidate set generation and test. This problem was dealt with by
introducing a novel, compact data structure, called frequent pattern tree, or FP-tree. Then based on this structure an FP-treebased pattern fragment growth method was developed, called FP-growth.
FP-growth doesnt require candidate generation, but stores in an efficient novel structure, an FP-tree (a Frequent Pattern
tree), the transaction database. It scans the database once to find frequent items. Frequent items are then sorted in
descending support count and kept in a list. Another scan of the database is then performed, and for each transaction,
infrequent items are suppressed and the remaining items are sorted in order and inserted in the FP-tree.
Classification consists of predicting a certain outcome based on a given input. In order to predict the outcome, the classifier
processes a training set containing a set of attributes and the respective outcome, usually called prediction attribute. The
classifier tries to discover relationships between the attributes that would make it possible to predict the outcome. The rules
obtained from the association rule mining, are the input to the rule based classifier. The training data is the HIV and
Human proteins, which are interacting and obtained from the association rule mining. The test data is the sample input HIV
and Human proteins to classify, which are interacting and non-interacting proteins. The accuracy is determined for the test
data.

Figure 1 Steps of Associative classification

Volume 3 Issue 5 May 2015

Page 89

IPASJ International Journal of Computer Science (IIJCS)


A Publisher for Research Motivation ........

Volume 3, Issue 5, May 2015

Web Site: http://www.ipasj.org/IIJCS/IIJCS.htm


Email: editoriijcs@ipasj.org
ISSN 2321-5992

3.PROPOSED METHODOLOGY
The proposed methodology comprises of data pre-processing, ARM execution and classification. The proposed methodology
is depicted in Figure 2.

Figure 2 Associative Classification Framework


The data was collected and data pre-processing was done. The association rule mining was applied and the best rules of
association mining were generated with the parameters of support and confidence. The rules are considered as the input for
classification. A classifier is proposed called rule based classifier, and is used to classify the proteins into interacting and
non-interacting protein pairs.

Figure 3 Algorithms of Association rule and classification


3.1 PPI Data Description
The HIV and Human proteins are considered as the input training data. There are 19 HIV-1 proteins and 1432 human
proteins.The output is to predict the two groups associated as the interactive predictions and the non-interactive predictions
based on the best association rules generated.

4.RESULTS AND DISCUSSION


Performance metrics of the association rule with the FP-growth algorithm were obtained based on the following parameters:
4.1 Support
The support of an itemset X in T is Support(X, T) is the number of tuples containing both X and T / the total number of
tuples.

Volume 3 Issue 5 May 2015

Page 90

IPASJ International Journal of Computer Science (IIJCS)


A Publisher for Research Motivation ........

Volume 3, Issue 5, May 2015

Web Site: http://www.ipasj.org/IIJCS/IIJCS.htm


Email: editoriijcs@ipasj.org
ISSN 2321-5992

Support = P(X U T)
4.2 Confidence
Confidence of an association rule A B is the ratio of the number of transactions that include all items in the consequent as
well as the antecedent (namely, the support) to the number of transactions that include all items in the antecedent. The
Confidence rule is defined as:

Results obtained from the association rules are tabulated with the parameters of confidence.This research aims at exploring
the FP-Growth algorithm to mine more novel interaction patterns that can be extended to viral interactions of diverse kinds.
Based on the results tabulated in Table1&2, the best rules of FP-growth algorithm was extracted with a confidence of 0.8.
Table 1: Performance metrics of HIV and human proteins dataset (FP-growth algorithm)
BEST RULES

CONFIDENCE

env_gp120, Nef, env_gp41 Tat

0.98

env_gp120,Nef,env_gp160, env_gp41
Tat

0.96

env_gp41, retropepsin Tat

0.95

Vpr, neucleocapsid Tat

0.95

Tat, Vpr, env_gp160 env_gp120

0.94

env_gp120, Vpr, env_gp160 Tat


0.94
env_gp120, Vpr, matrix Tat

0.94

Tat, Nef, env_gp160 env_gp120

0.92

Tat, Nef, env_gp41 env_gp120

0.92

env_gp120, Nef, Vpr Tat

0.9

Table 2: Performance metrics of human and HIV proteins dataset (FP-growth algorithm)

BEST RULES

CONFIDENC
E

ACTG1 ACTB
0.88
ACTB ACTG1
0.88
CASP3 CD4
0.86
BCL2 CD4
0.83
PARP1 CASP3
0.83
CD28 CASP3
0.83
CD28 CASP3
0.83
PARP1 BCL2
0.83
BCL2 PARP1
0.83
CD28 BCL2
0.83
From the above tables,it is evident that, the condence value was predicted to be less than 1.The condence value less than
1 denotes that , there is a possibility of nding new unknown interactions between the HIV and human proteins.

5.PROTOTYPE AND IMPLEMENTATION OF THE PROPOSED WORK


The prototype consists of four modules, namely preprocessed data upload, executing FP-growth algorithm, best rules and
identifying predictions.

Volume 3 Issue 5 May 2015

Page 91

IPASJ International Journal of Computer Science (IIJCS)


A Publisher for Research Motivation ........

Volume 3, Issue 5, May 2015

Web Site: http://www.ipasj.org/IIJCS/IIJCS.htm


Email: editoriijcs@ipasj.org
ISSN 2321-5992

5.1 Prototype of the proposed work


5.1.1 Preprocessed data upload
The input data consisting of HIV-Human protein data, was in the form of Binary matrix. Binary matrix consists of rows and
columns with human and viral proteins respectively and vice-versa. Each row represents an item and each column
represents the transaction. It consists of two matrices. They were Viral-Human input data and Human-Viral input data. This
input data file in the form of .csv, was uploaded.

Figure 4 Upload Input file


5.1.2 Data preprocessing
The input data is constructed into a binary matrix of human and viral proteins, of size 1432 *19 in which an entry of 1
denotes the presence of regulating interaction between the corresponding pair of human and HIV-1 proteins. It consists of
two matrices. They are HIV-Human input data and Human-HIV input data.

Figure 5 HIV-human input dataset

Figure 6 Human-HIV dataset


5.1.3 Executing FP-growth algorithm
By executing FP-growth algorithm, the rules are being generated with support and confidence. The best rules are being
generated, if the condition is satisfied. The condition is, if the support of respective rule is greater than the minimum
support and the confidence must be greater than minimum confidence. The minimum support was 0.01 and minimum
confidence was 0.8.

Volume 3 Issue 5 May 2015

Page 92

IPASJ International Journal of Computer Science (IIJCS)


A Publisher for Research Motivation ........

Volume 3, Issue 5, May 2015

Web Site: http://www.ipasj.org/IIJCS/IIJCS.htm


Email: editoriijcs@ipasj.org
ISSN 2321-5992

Figure 7 Obtain Best rules


5.1.4 Identifying predictions:
A Classifier called Rule based classifier was used to predict the new interactions between HIV and human proteins. This
classifier was to classify the proteins into interacting and non-interacting proteins and to determine a better accuracy

Figure 8 Predict new interactions


5.2 Implementation of the proposed work
The above prototype was implemented in java platform.
5.2.1 Upload input data

Figure 9 Upload HIV-human matrix

Volume 3 Issue 5 May 2015

Page 93

IPASJ International Journal of Computer Science (IIJCS)


Web Site: http://www.ipasj.org/IIJCS/IIJCS.htm
Email: editoriijcs@ipasj.org
ISSN 2321-5992

A Publisher for Research Motivation ........

Volume 3, Issue 5, May 2015


5.2.2 Executing FP-growth algorithm

Figure 10 Extract best rules


5.2.3 Rule based classifier

Figure 11 Classification of single HIV and human protein


5.2.4 Determine accuracy:(Test data for HIV proteins)

Figure 12 Obtained accuracy for a set of HIV proteins

Volume 3 Issue 5 May 2015

Page 94

IPASJ International Journal of Computer Science (IIJCS)


A Publisher for Research Motivation ........

Volume 3, Issue 5, May 2015

Web Site: http://www.ipasj.org/IIJCS/IIJCS.htm


Email: editoriijcs@ipasj.org
ISSN 2321-5992

5.2.5 Determine accuracy:(Test data for human proteins)

Figure 13 Obtained accuracy for a set of human proteins

6.RULES DESCRIPTION
Generally a rule consists of two parts Antecedent and Consequent. Considering the rule,
Vpr, neucleocapsid Tat
The antecedent of the rule shows that, if Vpr is interacting (yes) and neucleocapsid is interacting (yes) then the
predicted interaction (consequent) shows Tat is interacting (yes). Similarly it is applied for all the rules. Antecedent are
interacting proteins where as Consequent are predicted proteins from Antecedent.

7.CONCLUSION
This paper addresses the problem of predicting new HIV-1 and human protein interactions based on the existing PPI
database with an associative classification technique. A prototype and the implementation of the proposed work focus on
identifying the best association rules and classified the proteins with a better accuracy. The proposed technique called
Associative classification (AC) is the integration of the association rule mining and classification. In association rule
mining, an algorithm called FP-growth is used to mine best association rules. In classification, a classifier called rule based
classifier is used to classify the proteins into interacting and non-interacting and to generate better accuracy. This paper
targeted the best 10 rules and found the accuracy to be 94.7% for HIV proteins and 85.75% for human proteins. In future, if
more rules are considered, then more accurate predictions can be identified.

REFERENCES
[1]. Tastan.O, Carbonell.J ,Klein-Seetharaman.J,Prediction of interactions between HIV-1and human proteins by
information integration,2009. In Proc. PSB, pages516527.
[2]. Bandyopadhyay.S, Maulik.U, Holder.L.B, Cook.D.J ,Advanced Methods for Knowledge Discovery from Complex
Data (Advanced Information and Knowledge Processing),2005. Springer-Verlag, London.
[3]. Hipp J, Gntzer U, Nakhaeizadeh G Algorithms for association rule mining a general survey and comparison,2000.
SIGKDD Explorations 2: 5864. Doi: 10.1145/360402.360421.
[4]. Goethals BEfficient Frequent Pattern Mining,2002. Ph.D. thesis, University of Limburg, Belgium.
[5]. Anirban Mukhopadhyay, Ujjwal Maulik, Sanghamitra Bandyopadhyay and Roland Eils,Mining Association Rules
from HIV-Human Protein Interactions, Proceedings of 2010 International Conference on Systems in Medicine and
Biology 16-18 December 2010, IIT Kharagpur, India.
[6]. Pasquier, N., Taouil, R., Bastide, Y., Stumme, G., and Lakhal, L.Generating a condensed representation for
association rules,2005. J. Intell. Inf. Syst., 1(24):2960.Nour Moustafa IJMEIT Vol 1 Issue 1 Dec 2013 Page 41.
[7]. Jansen.R,H.Yu,Kluger.Y,Krogan.N.J,Chung.S.,Emili.A,Snyder.M,Greenblatt.J.F,Gerstein.M,A Bayesian networks
approach for predicting protein-protein interactions from genomic data,2003.
[8]. Agrawal,R., Mannila, H., Srikant, R.,Toivonen,H.,and Verkamo,A.IFast discovery of association rules. In Advances in
Knowledge Discovery and Data Mining,1996. pages307328.AAAI/MITPress.

Volume 3 Issue 5 May 2015

Page 95

Potrebbero piacerti anche