Documenti di Didattica
Documenti di Professioni
Documenti di Cultura
ABSTRACT
Discovering Protein-Protein Interactions (PPI) is a new interesting challenge in computational biology. The identification of
interactions between HIV-1 proteins and Human proteins is a particular PPI problem whose study might lead to the discovery of
drugs and important interactions leading to AIDS. The interaction of protein-protein network is analysed by using the datasets
available. Since, Biclustering approaches lead to loss of data, this need to be enhanced to prevent the data loss.With this motivation
in mind, this paper targets to predict new interactions with the Associative classification (AC) technique.
Keywords - Protein-Protein Interactions (PPI), Acquired Immune Deciency Syndrome (AIDS), Association Rule Mining
(ARM), HIV-Human Protein-Protein Interaction (HHPPI), Associative Classification technique (AC).
1. INTRODUCTION
Acquired Immune Deficiency Syndrome (AIDS) is the last stage of HIV infection. At this stage, the human immune system
fails to protect the body from infection, and this eventually leads to death. HIV is a member of the retrovirus family
(lentivirus) which infects important cells in the human immune system. HIV-1 is a species of the HIV virus that relies on
human host cell proteins in virtually every phase of its life cycle .This kind of infection is due to the interaction between
proteins of both the virus and the human host in the human cells. One of the main goals in research of Protein-Protein
Interaction (PPI) is to predict possible viral-host interactions. This is specifically aimed at assisting drug developers
targeting protein interactions for the development of specially designed small molecules to inhibit potential HIV-1human
PPIs. Targeting protein-protein interactions has recently been established to be a promising alternative to the conventional
approach to drug design .There are several computational approaches for predicting PPIs. Most of these approaches are
mainly used for determining PPIs in a single organism, such as yeast, human etc. Therefore in most of the works in this
area, negative samples are prepared by taking random protein pairs which are not found in the interaction database.
1.1 Related Work
Human immunodeficiency virus (HIV) is a lentivirus (a member of the retrovirus family with long incubation period) that
can lead to Acquired Immunodeficiency Syndrome (AIDS), a condition in humans in which the immune system begins to
fail, leading to life-threatening infection. Various approaches for predicting interactions have been studied in the literature.
These approaches are based on Bayesian networks [1], random forest classifier [2], mixture of feature expert classifiers [3].
Recently, two approaches have been proposed to predict the set of interactions between HIV-1 and human host cellular
proteins [4].
This paper attempts to propose a methodology that identifies the best association rules and classifies the data into
interacting and non-interacting proteins with a better accuracy.
Page 88
2.2 Methods
In this paper, a system is proposed with a new methodology called Associative Classification (AC) technique. This can
reduce the misclassification data and predict the new interactions which were unknown with the help of known data sets
available. The several interestingness measures are validated using support and confidence for the known data sets. AC
integrates two known data mining tasks, associative rule discovery and classification, to build a model for the purpose of
prediction. Classification and Association rule discovery are similar tasks in data mining, with the exception that the main
aim of classification is the prediction of class labels, while association rule discovery describes correlations between items in
a transactional database. AC algorithm is represented in simple if-then rules, which makes it easy for the end-user to
understand and interpret it. In case of categorical attributes, all possible values are mapped to a set of positive integers. For
continuous attributes, a discretization attributes, a discretization method is used. The main task of AC is to construct a set of
rules that is able to predict the classes of previously unseen data, known as the test data set, as accurately as possible. The
goal is to find a classifier that maximizes the probability of interaction for each test object.
Association rule mining technique is the most efficient data mining technique to search hidden or desired patterns in
voluminous data. It aims at detecting correlation among various data attributes in a large set of items in a database.
Associations across the itemset have been determined by association rule mining. Association analysis is the detection of
hidden patterns or conditions that occur frequently together in a given data.
Association Rule mining techniques find interesting associations and correlations among data set. An association rule
entails certain association relationships with objects or items. For example, the interrelationship of the data item as whether
they occur simultaneously with other data items and how often. These rules are computed from the data and are calculated
with help of probability. Support and confidence are measures of interestingness. Association rules are regarded as
appealing if a minimum support and a minimum confidence threshold is satisfied. Boolean association rule mining is more
extensively used than other kinds of association rule mining.
Apriori [5] uses breadth-first search and a tree structure to count candidate item sets efficiently. It generates candidate item
sets of length k from item sets of length k 1. Then it prunes the candidates which have an infrequent sub pattern.
According to the downward closure lemma, the candidate set contains all frequent k-length item sets. After that, it scans the
transaction database to determine frequent item sets among the candidates. Candidate generation generates large numbers
of subsets (the algorithm attempts to load up the candidate set with as many as possible before each scan). Bottom-up subset
exploration (essentially a breadth-first traversal of the subset lattice) finds any maximal subset S only after all 2 | S | 1 of
its proper subsets.
The main bottleneck of the Apriori algorithm is at the candidate set generation and test. This problem was dealt with by
introducing a novel, compact data structure, called frequent pattern tree, or FP-tree. Then based on this structure an FP-treebased pattern fragment growth method was developed, called FP-growth.
FP-growth doesnt require candidate generation, but stores in an efficient novel structure, an FP-tree (a Frequent Pattern
tree), the transaction database. It scans the database once to find frequent items. Frequent items are then sorted in
descending support count and kept in a list. Another scan of the database is then performed, and for each transaction,
infrequent items are suppressed and the remaining items are sorted in order and inserted in the FP-tree.
Classification consists of predicting a certain outcome based on a given input. In order to predict the outcome, the classifier
processes a training set containing a set of attributes and the respective outcome, usually called prediction attribute. The
classifier tries to discover relationships between the attributes that would make it possible to predict the outcome. The rules
obtained from the association rule mining, are the input to the rule based classifier. The training data is the HIV and
Human proteins, which are interacting and obtained from the association rule mining. The test data is the sample input HIV
and Human proteins to classify, which are interacting and non-interacting proteins. The accuracy is determined for the test
data.
Page 89
3.PROPOSED METHODOLOGY
The proposed methodology comprises of data pre-processing, ARM execution and classification. The proposed methodology
is depicted in Figure 2.
Page 90
Support = P(X U T)
4.2 Confidence
Confidence of an association rule A B is the ratio of the number of transactions that include all items in the consequent as
well as the antecedent (namely, the support) to the number of transactions that include all items in the antecedent. The
Confidence rule is defined as:
Results obtained from the association rules are tabulated with the parameters of confidence.This research aims at exploring
the FP-Growth algorithm to mine more novel interaction patterns that can be extended to viral interactions of diverse kinds.
Based on the results tabulated in Table1&2, the best rules of FP-growth algorithm was extracted with a confidence of 0.8.
Table 1: Performance metrics of HIV and human proteins dataset (FP-growth algorithm)
BEST RULES
CONFIDENCE
0.98
env_gp120,Nef,env_gp160, env_gp41
Tat
0.96
0.95
0.95
0.94
0.94
0.92
0.92
0.9
Table 2: Performance metrics of human and HIV proteins dataset (FP-growth algorithm)
BEST RULES
CONFIDENC
E
ACTG1 ACTB
0.88
ACTB ACTG1
0.88
CASP3 CD4
0.86
BCL2 CD4
0.83
PARP1 CASP3
0.83
CD28 CASP3
0.83
CD28 CASP3
0.83
PARP1 BCL2
0.83
BCL2 PARP1
0.83
CD28 BCL2
0.83
From the above tables,it is evident that, the condence value was predicted to be less than 1.The condence value less than
1 denotes that , there is a possibility of nding new unknown interactions between the HIV and human proteins.
Page 91
Page 92
Page 93
Page 94
6.RULES DESCRIPTION
Generally a rule consists of two parts Antecedent and Consequent. Considering the rule,
Vpr, neucleocapsid Tat
The antecedent of the rule shows that, if Vpr is interacting (yes) and neucleocapsid is interacting (yes) then the
predicted interaction (consequent) shows Tat is interacting (yes). Similarly it is applied for all the rules. Antecedent are
interacting proteins where as Consequent are predicted proteins from Antecedent.
7.CONCLUSION
This paper addresses the problem of predicting new HIV-1 and human protein interactions based on the existing PPI
database with an associative classification technique. A prototype and the implementation of the proposed work focus on
identifying the best association rules and classified the proteins with a better accuracy. The proposed technique called
Associative classification (AC) is the integration of the association rule mining and classification. In association rule
mining, an algorithm called FP-growth is used to mine best association rules. In classification, a classifier called rule based
classifier is used to classify the proteins into interacting and non-interacting and to generate better accuracy. This paper
targeted the best 10 rules and found the accuracy to be 94.7% for HIV proteins and 85.75% for human proteins. In future, if
more rules are considered, then more accurate predictions can be identified.
REFERENCES
[1]. Tastan.O, Carbonell.J ,Klein-Seetharaman.J,Prediction of interactions between HIV-1and human proteins by
information integration,2009. In Proc. PSB, pages516527.
[2]. Bandyopadhyay.S, Maulik.U, Holder.L.B, Cook.D.J ,Advanced Methods for Knowledge Discovery from Complex
Data (Advanced Information and Knowledge Processing),2005. Springer-Verlag, London.
[3]. Hipp J, Gntzer U, Nakhaeizadeh G Algorithms for association rule mining a general survey and comparison,2000.
SIGKDD Explorations 2: 5864. Doi: 10.1145/360402.360421.
[4]. Goethals BEfficient Frequent Pattern Mining,2002. Ph.D. thesis, University of Limburg, Belgium.
[5]. Anirban Mukhopadhyay, Ujjwal Maulik, Sanghamitra Bandyopadhyay and Roland Eils,Mining Association Rules
from HIV-Human Protein Interactions, Proceedings of 2010 International Conference on Systems in Medicine and
Biology 16-18 December 2010, IIT Kharagpur, India.
[6]. Pasquier, N., Taouil, R., Bastide, Y., Stumme, G., and Lakhal, L.Generating a condensed representation for
association rules,2005. J. Intell. Inf. Syst., 1(24):2960.Nour Moustafa IJMEIT Vol 1 Issue 1 Dec 2013 Page 41.
[7]. Jansen.R,H.Yu,Kluger.Y,Krogan.N.J,Chung.S.,Emili.A,Snyder.M,Greenblatt.J.F,Gerstein.M,A Bayesian networks
approach for predicting protein-protein interactions from genomic data,2003.
[8]. Agrawal,R., Mannila, H., Srikant, R.,Toivonen,H.,and Verkamo,A.IFast discovery of association rules. In Advances in
Knowledge Discovery and Data Mining,1996. pages307328.AAAI/MITPress.
Page 95