Documenti di Didattica
Documenti di Professioni
Documenti di Cultura
Data Mining in
Bioinformatics
SUBMITTED BY
Bioinformatics and data mining provide exciting and challenging research and
application areas for computational science. Bioinformatics is the science of managing,
mining, and interpreting information from biological sequences and structures. Data
mining is the process of sorting through large amounts of data and picking out relevant
information. Database mining is an important research area because there is an urgent
need for analyzing data in different sources. Advances such as genome-sequencing
initiatives, microarrays, proteomics, and functional and structural genomics have pushed
the frontiers of human knowledge. In addition, data mining and machine learning have
been advancing in strides in recent years, with high-impact applications from marketing
to science. Starting with possible definitions of data mining and bioinformatics, this
paper give a generalized view of both, emphasising on the statistical data mining parts
and its techniques, illustrate possible synergies and discuss how statistical data miners
may collaborate in bioinformatics’ challenges in order to unlock the secrets of the cell.
CONTENTS
1. INTRODUCTION
MINING
MINERS
8. CONCLUSION
9. REFERENCES
INTRODUCTION
“Bioinformatics” is the new technology buzzword used to describe a rapidly developing
discipline lying at the intersection of computer technology and the life sciences. These
days, it is virtually impossible to pick up a popular science magazine and not find
something about bioinformatics. Although there is currently no widely agreed upon
definition for the term “bioinformatics”, it may be defined as “the application of
computing power to biological data to reveal new patterns and information below the
surface of those data.”1 Bioinformatics is currently being applied to a number of
scientific areas including chemistry, genomics, brain mapping, pharmacology,
proteomics and structural biology. Data mining is the process of sorting through large
amounts of data and picking out relevant information. This special issue aims to bridge
the gap between bioinformatics and data mining by presenting research integrating the t
two. We believe that data mining will provide the necessary tools for better nderstanding
of gene expression, drug design, and other emerging problems in genomics and
proteomics. Forthis special issue, we encouraged papers that
propose novel data mining techniques for tasks such as
• gene expression analysis,
• searching and understanding of protein mass spectroscopy data,
• 3D structural and functional analysis and mining of DNA and protein sequences for
structural and functional motifs, drug design, and understanding of the origins of life, and
• text mining for biological knowledge discovery.
Mining of sequence data is still in its infancy because the methodologies are much more
involved, and because a large number of tools have to be integrated before progress can
be made. However, this area has a potential of yielding rich dividends in the years to
come.
DATA MINING TECHNIQUES Data Mining has three major components
Clustering or Classification, Association Rules and Sequence Analysis. Neural network is
also one of the key ingredients for modern data mining.
Data mining techniques draw from
Optimization techniques
Visualization
Classification
The clustering techniques analyze a set of data and generate a set of grouping
rules that can be used to classify future data. The mining tool automatically identifies the
clusters, by studying the pattern in the training data. Once the clusters are generated,
classification can be used to identify, to which particular cluster, an input belongs.
Association
Sequential Analysis
In sequential Analysis, we seek to discover patterns that occur in sequence. This
deals with data that appear in separate transactions (as opposed to data that appearing the
same transaction in the case of association) e.g. if a shopper buys item A in the first week
of the month, and then he buys item B in the second week etc.
MULTI-LAYERED FRAMEWORK FOR DISTRIBUTED DATA MINING
There is an increase in the demand for data mining applications on the web. With
the increase in the size of data sets there is also a demand for scalable generic solutions.
Scalability and generic data mining models can be provided with the use of distributed
computing. A multi-layered architecture can take advantage of the latest technological
advances in hardware to provide efficient solutions and also allow the easy addition of
new data mining and data capture components to the basic system. There were several
attempts on large scale distributed data mining. The Kensington project is for mining
enterprise data distributed across the internet. The Papyrus project is a distributed data
mining system developed for clusters and super clusters of workstations. It is composed
of four software layers: data management, data mining, predictive modeling, and agents.
Papyrus is based on mobile agents implemented using Java. Another distributed data
mining suite based on Java is PaDDMAS , a component-based tool set that integrates
predeveloped or custom packages (that can be sequential or parallel) using a dataflow
approach. JAM is an agentbased distributed data mining system that has been developed
to mine data stored in different sites for building so called meta-models as a combination
of several models learned at the different sites where data are stored. JAM uses Java
applets to move data mining agents to remote sites. BODHI is a project [8] for doing
collective data mining with stress on learning from vertically partitioned data. Discovery
Net provides an architecture for building and managing KDD processes on a Grid. Most
of the projects are implemented as prototypes. In this paper we attempt to address some
of the major issues related to distributed data mining. The solutions suggested were
implemented as part of DataMIME. From a user point of view the basic requirement
for providing data mining models as services over the internet is the ability to use
efficiently generic customizable data mining tools on a wide array of data sets over the
internet. From a developers point of view the architecture should facilitate an iterative
development process. This will enable the integration of new components to the system.
This will also allow the developers to take advantage of the latest developments in
hardware. We were able to identify a uniform efficient vertical data structure at the
lowest layer that can take advantage of the latest hardware. We were able to identify a
data management layer that facilitates the data distribution. We also identify a data
mining and data capture layer that is defined in the form of an API. The generic data
mining models are developed on top of the data mining interface. The client is built on
top of the communication layer to capture the user requirements for a particular job.
ALGORITHM AND ITS APPLICATION
Frequent Structure Mining (FSM) refers to an important
class of exploratory mining tasks, namely, those dealing with extracting patterns in
massive databases representing complex interactions between entities.Consider a problem
of mining structural patterns in a data set of Ribonucleic acid (RNA) molecules, which
can be represented as trees. In this paper, we introduce TREEMINER, an efficient
algorithm for the problem of mining frequent subtrees in a forest (the database). The key
contributions of our work are as follows:
1. We introduce the problem of mining embedded subtrees in a collection of
rooted, ordered, and labeled trees.
2. We use the notion of a scope for a node in a tree. We show how any tree can
be represented as a list of its node scopes in a novel vertical format called scope-
list.
3. We develop a framework for nonredundant candidate subtree generation, i.e.,
we propose a systematic search of the possibly frequent subtrees, such that no
pattern is generated more than once.
4. We show how one can efficiently compute the frequency of a candidate tree by
joining the scopelists of its subtrees.
5. Our formulation allows one to discover all subtrees in a forest, as well as all
subtrees in a single large tree.
Furthermore, simple modifications also allow us to mine unlabeled subtrees, unordered
subtrees, and also frequent subforests (i.e., disconnected subtrees). We also present
TREEMINERD, a method that, instead of counting all embeddings, only counts distinct
occurrences of a pattern and might be more suitable than TREEMINER for certain data
sets. We also contrast TREEMINER with another tree mining algorithm based on pattern
matching, PATTERNMATCHER.The applications in bioinformatics, such as mining
frequent RNA structures and common phylogenetic tree patterns.
TREEMINER ALGORITHM
PATTERNMATCHER ALGORITHM
RNA Structure
RNA has a three-dimensional (3D) shape, it can be viewed in terms of its
secondary structure, which is composed mainly of double-stranded regions formed by
folding the single-stranded RNA molecule back on itself. To produce these double-
stranded regions, a subsequence of bases (made up of four letters: A, C, G, U) must be
complementary to another subsequence so that basepairing can occur (G-C and A-U). It
is these pairings that contribute to the energetic stability of the RNA molecule.TreeBASE
is a relational database designed to manage and explore information on phylogenetic
relationships.4 It stores phylogenetic trees and data matrices used to generate them from
published research papers.
RNA structure and its tree representation.
CONCLUSION
The blending of biology and computers makes bioinformatics a unique discipline
that is clearly here to stay. Bioinformatics is one of the fastest growing areas of all life
science based markets. In this paper, we introduced the notion of mining embedded
subtrees in a (forest) database of trees. Among our novel contributions is the procedure
for systematic candidate subtree generation, i.e., no subtree is generated more than once.
We utilized a string encoding of the tree that is space-efficient to store the horizontal data
set and we use the notion of a node’s scope to develop a novel vertical representation of
a tree called scope-lists. Our formalization of the problem is flexible enough to handle
several variations and it has shown that the problem of multi-database mining is
challenging and pressing. In particular, due to essential differences between mono- and
multi-databases, we have defined a new process of multi-database mining for our
system
References
1. H. Liu, J. Li, and L. Wong, “Use of Extreme Patient Samples for Outcome Prediction
from Gene Expression Data,” Bioinformatics, vol.21, no. 16, 2005, pp. 3377–3384.
2. R. Agrawal, J. Shafer: Parallel mining of association rules. IEEETransactions on
Knowledge and Data Engneering, 8(6) (1996): 962-969.
3. S. D_zeroski and N. Lavra_c, editors. Relational Data Mining. Springer, Berlin, 2001.
4. Yan Fu et al., “A Block-Based Support Vector Machine Approach to the Protein
Homology Prediction Task in KDD Cup 2004,” ACM SIGKDD Explorations, vol. 6, no.
2, 2004, pp. 120–124.
5. S. Ray and M. Craven, “Learning Statistical Models for Annotating Proteins with
function Information Using Biomedical Text,” BMC Bioinformatics, vol. 6, suppl. 1,
2005, p. S18.