Sei sulla pagina 1di 15

GOVERNMENT COLLEGE OF TECHNOLOGY

(An Autonomous Institution)


DEPARTMENT OF COMPUTER SCIENCE
ENGINEERING
Coimbatore - 641013.

Data Mining in
Bioinformatics
SUBMITTED BY

B.SUJITHA DEVI S VAISHNAVI


III yr, IT III yr, IT

Email id: Email id:


Sujitha.rebeka@gmail.com vaishi_roy@yahoo.co.in

Contact no: Contact no:


9789634723 9994128742
ABSTRACT

Bioinformatics and data mining provide exciting and challenging research and
application areas for computational science. Bioinformatics is the science of managing,
mining, and interpreting information from biological sequences and structures. Data
mining is the process of sorting through large amounts of data and picking out relevant
information. Database mining is an important research area because there is an urgent
need for analyzing data in different sources. Advances such as genome-sequencing
initiatives, microarrays, proteomics, and functional and structural genomics have pushed
the frontiers of human knowledge. In addition, data mining and machine learning have
been advancing in strides in recent years, with high-impact applications from marketing
to science. Starting with possible definitions of data mining and bioinformatics, this
paper give a generalized view of both, emphasising on the statistical data mining parts
and its techniques, illustrate possible synergies and discuss how statistical data miners
may collaborate in bioinformatics’ challenges in order to unlock the secrets of the cell.
CONTENTS

1. INTRODUCTION

2. A DATABASE PLATFORM FOR BIOINFORMATICS

3. MINING SEQUENCE DATA

4. DATA MINING TECHNIQUES

5. MULTILAYERED FRAMEWORK FOR DISTRIBUTED DATA

MINING

6. ALGORITHM AND ITS APPLICATION

7. CHALLENGES IN BIOINFORMATICS FOR STATISTICAL DATA

MINERS

8. CONCLUSION

9. REFERENCES
INTRODUCTION
“Bioinformatics” is the new technology buzzword used to describe a rapidly developing
discipline lying at the intersection of computer technology and the life sciences. These
days, it is virtually impossible to pick up a popular science magazine and not find
something about bioinformatics. Although there is currently no widely agreed upon
definition for the term “bioinformatics”, it may be defined as “the application of
computing power to biological data to reveal new patterns and information below the
surface of those data.”1 Bioinformatics is currently being applied to a number of
scientific areas including chemistry, genomics, brain mapping, pharmacology,
proteomics and structural biology. Data mining is the process of sorting through large
amounts of data and picking out relevant information. This special issue aims to bridge
the gap between bioinformatics and data mining by presenting research integrating the t
two. We believe that data mining will provide the necessary tools for better nderstanding
of gene expression, drug design, and other emerging problems in genomics and
proteomics. Forthis special issue, we encouraged papers that
propose novel data mining techniques for tasks such as
• gene expression analysis,
• searching and understanding of protein mass spectroscopy data,
• 3D structural and functional analysis and mining of DNA and protein sequences for
structural and functional motifs, drug design, and understanding of the origins of life, and
• text mining for biological knowledge discovery.

A DATABASE PLATFORM FOR BIOINFORMATICS


As the mapping of the human genome draws to a close, there is increasing
realization that the ‘life’ sciences are dependent, as never before, on computing. The atlas
of the human genome promises to revolutionize medical practice and biological research
for the next millennium: all human genes will eventually be found, accurate diagnostics
will be developed for all heritable diseases, animal models for human disease research
will be more easily developed, and cures developed for many diseases. Many of these
developments will occur, not inside testtubes in biologists’ laboratories, but on high-
performance computing platforms, with massive storage systems to store genomic data,
databases to search through the data, identifying similarities and patterns, as well as
integration software to unify the slices of knowledge developed at globally distributed
institutions.
Extending Databases
We approached the complex data problem from the standpoint of creating such an
architecture. Databases must be made inherently extensible to be able to efficiently
handle various rich, application-domainspecific complex data types. Extensibility is the
ability to provide support for any user-defined datatype (structured or unstructured)
efficiently without having to re-architect the DBMS. Such types – which can be plugged
into the database to extend its capabilities for specific domains –
are also called data cartridges. An extensible database system needs support for:
• User-defined types -- the ability to define new datatypes corresponding to domain
entities like sequence,
• User-defined operators -- like Resembles() or Distance() to add
domain-specific operators that can be called from SQL,
• Domain-specific indexing - support for indexes specific to genomic data , spatial
data etc., which can be used to speed the query, and
• Optimizer extensibility - intelligent ordering of query predicates involving user-
defined types, especially for multi-domain queries.

MINING SEQUENCE DATA


The current approach for finding genes has a large experimental component. Any
small increase in the accuracy of computer classification of genes can result in substantial
time and cost savings. Oracle has developed a suite of software tools that analyse large
collections of data to discover new patterns and forecast relationships . This process of
sifting through enormous databases to extract hidden information is called ‘data mining’.
Mining sequence data can help discover relationships between genes, discover gene
expression, discover drugs based on functional information and so on. A simple case of
mining genetic data could be to classify cancers based solely on gene expression.
Classifiers are first trained on the genes in a training set, and then applied to the
remaining genes to assign them to specific clusters. Another use of mining relates to
predicting which sections of a piece of DNA are ‘active’ and which are not.
Chromosomes have coding sequences (exons), interspersed with non-coding sequences
introns.) It has recently been discovered through mining that a non-linear correlation
statistic for DNA sequences, called the Average Mutual Information (AMI) [8], is very
effective at distinguishing exons from introns. The AMI is a nonlinear function based on
a vector of 12 frequencies each dependent on the positions of the bases A, C, T & G. The
inductive process of mining helps us arrive at such complex insights, which deductive
analyses have little hope of unearthing. When terabytes of data are involved, traditional
data mining relies on analysts trying to guess which small subset of the information in a
database is relevant. Because of their limited capacity for data, traditional methods often
operate on only 1-2% of the data available in every record. Yet discarded variables often
contain key information: correlations that aren't obvious, patterns one wouldn't expect, or
significant fluctuations that are normally overshadowed by larger trends.

Mining of sequence data is still in its infancy because the methodologies are much more
involved, and because a large number of tools have to be integrated before progress can
be made. However, this area has a potential of yielding rich dividends in the years to
come.
DATA MINING TECHNIQUES Data Mining has three major components
Clustering or Classification, Association Rules and Sequence Analysis. Neural network is
also one of the key ingredients for modern data mining.
Data mining techniques draw from

Statistics Machine learning

Database techniques Pattern recognition

Optimization techniques
Visualization
Classification

The clustering techniques analyze a set of data and generate a set of grouping
rules that can be used to classify future data. The mining tool automatically identifies the
clusters, by studying the pattern in the training data. Once the clusters are generated,
classification can be used to identify, to which particular cluster, an input belongs.

Association

An association rule is a rule that implies certain association relationships among a


set of objects in a database. In this process we discover a set of association rules at
multiple levels of abstraction from the relevant set(s) of data in a database.

Sequential Analysis
In sequential Analysis, we seek to discover patterns that occur in sequence. This
deals with data that appear in separate transactions (as opposed to data that appearing the
same transaction in the case of association) e.g. if a shopper buys item A in the first week
of the month, and then he buys item B in the second week etc.
MULTI-LAYERED FRAMEWORK FOR DISTRIBUTED DATA MINING
There is an increase in the demand for data mining applications on the web. With
the increase in the size of data sets there is also a demand for scalable generic solutions.
Scalability and generic data mining models can be provided with the use of distributed
computing. A multi-layered architecture can take advantage of the latest technological
advances in hardware to provide efficient solutions and also allow the easy addition of
new data mining and data capture components to the basic system. There were several
attempts on large scale distributed data mining. The Kensington project is for mining
enterprise data distributed across the internet. The Papyrus project is a distributed data
mining system developed for clusters and super clusters of workstations. It is composed
of four software layers: data management, data mining, predictive modeling, and agents.
Papyrus is based on mobile agents implemented using Java. Another distributed data
mining suite based on Java is PaDDMAS , a component-based tool set that integrates
predeveloped or custom packages (that can be sequential or parallel) using a dataflow
approach. JAM is an agentbased distributed data mining system that has been developed
to mine data stored in different sites for building so called meta-models as a combination
of several models learned at the different sites where data are stored. JAM uses Java
applets to move data mining agents to remote sites. BODHI is a project [8] for doing
collective data mining with stress on learning from vertically partitioned data. Discovery
Net provides an architecture for building and managing KDD processes on a Grid. Most
of the projects are implemented as prototypes. In this paper we attempt to address some
of the major issues related to distributed data mining. The solutions suggested were
implemented as part of DataMIME. From a user point of view the basic requirement
for providing data mining models as services over the internet is the ability to use
efficiently generic customizable data mining tools on a wide array of data sets over the
internet. From a developers point of view the architecture should facilitate an iterative
development process. This will enable the integration of new components to the system.
This will also allow the developers to take advantage of the latest developments in
hardware. We were able to identify a uniform efficient vertical data structure at the
lowest layer that can take advantage of the latest hardware. We were able to identify a
data management layer that facilitates the data distribution. We also identify a data
mining and data capture layer that is defined in the form of an API. The generic data
mining models are developed on top of the data mining interface. The client is built on
top of the communication layer to capture the user requirements for a particular job.
ALGORITHM AND ITS APPLICATION
Frequent Structure Mining (FSM) refers to an important
class of exploratory mining tasks, namely, those dealing with extracting patterns in
massive databases representing complex interactions between entities.Consider a problem
of mining structural patterns in a data set of Ribonucleic acid (RNA) molecules, which
can be represented as trees. In this paper, we introduce TREEMINER, an efficient
algorithm for the problem of mining frequent subtrees in a forest (the database). The key
contributions of our work are as follows:
1. We introduce the problem of mining embedded subtrees in a collection of
rooted, ordered, and labeled trees.
2. We use the notion of a scope for a node in a tree. We show how any tree can
be represented as a list of its node scopes in a novel vertical format called scope-
list.
3. We develop a framework for nonredundant candidate subtree generation, i.e.,
we propose a systematic search of the possibly frequent subtrees, such that no
pattern is generated more than once.
4. We show how one can efficiently compute the frequency of a candidate tree by
joining the scopelists of its subtrees.
5. Our formulation allows one to discover all subtrees in a forest, as well as all
subtrees in a single large tree.
Furthermore, simple modifications also allow us to mine unlabeled subtrees, unordered
subtrees, and also frequent subforests (i.e., disconnected subtrees). We also present
TREEMINERD, a method that, instead of counting all embeddings, only counts distinct
occurrences of a pattern and might be more suitable than TREEMINER for certain data
sets. We also contrast TREEMINER with another tree mining algorithm based on pattern
matching, PATTERNMATCHER.The applications in bioinformatics, such as mining
frequent RNA structures and common phylogenetic tree patterns.
TREEMINER ALGORITHM

PATTERNMATCHER ALGORITHM
RNA Structure
RNA has a three-dimensional (3D) shape, it can be viewed in terms of its
secondary structure, which is composed mainly of double-stranded regions formed by
folding the single-stranded RNA molecule back on itself. To produce these double-
stranded regions, a subsequence of bases (made up of four letters: A, C, G, U) must be
complementary to another subsequence so that basepairing can occur (G-C and A-U). It
is these pairings that contribute to the energetic stability of the RNA molecule.TreeBASE
is a relational database designed to manage and explore information on phylogenetic
relationships.4 It stores phylogenetic trees and data matrices used to generate them from
published research papers.
RNA structure and its tree representation.

CHALLENGES IN BIOINFORMATICS FOR STATISICAL DATA


MINERS
One of the most basic operations in bioinformatics involves searching for
similarities or homologies between newly sequenced DNA or RNA and previously
sequenced DNA or RNA segments from various organisms. Finding near matches allows
the researchers to predict the type of protein the sequence encodes.This not only yields
for drug targets early in drug developments but also but also weeds out many targets that
would have turned out to be the dead ends. The possible financial value of ethical
considerations connected with some biological data means that data mining in biological
database is not always easy to perform as in the case of other areas. Previous applications
of data mining and machine learning to bioinformatics include gene finding, protein
function domain detection, function motif detection, protein function inference, disease
diagnosis, disease prognosis, disease treatment optimization, protein and gene interaction
network reconstruction, data cleansing, and protein sub cellular location prediction.
However, data mining in bioinformatics has hampered many facets of biological
database including their size, their number, their diversity and the lack of standard
ontology to aid the querying of them, as well as the heterogeneous data of the quality and
the provenance information it contains .Another problem is the range of levels and
domains of expertise present amongst the potential users, so it can be difficult for the
database curators to provide access mechanism appropriate to all. The integration of the
biological database is also lacking, so it can be very difficult to query more than one
database at once.
From statistical dataminer’s are perspective most bioinfomaticians tend to ignore
statistical datamining, too impatient for solutions(pressure to publish),expect the data
miners to identify the informations published long before. On the other hand from the
bioinformaticin’s data too impatient for miners do not understand the biological queries
because they both use different scientific and biological terms and dreadful softwares.
Hence they both continue to sarcastically criticize each other .It is important to note that
bioinfomatics can learn from statistical data mining-that to large extent since the
statistical datamining is the basis which the informatics is trying to achieve. There is the
opportunity to immensely rewarding synergy between bioinfomaticians and the data
miners.
Data mining and bioinformatics are the fast developing frontiers. It is important to note
that what are the research areas of bioinformatics and develop new data mining
techniques for better scalable and effective analysis. Bioinformatics and statistical data
miners will inevitably grow towards each other because bioinformatics will not become
knowledge discovery without statistical data mining and thinking. A maturity challenge
between these both is to widen their focus until true collaboration and unlocking the
secrets of the cells comes to reality.

CONCLUSION
The blending of biology and computers makes bioinformatics a unique discipline
that is clearly here to stay. Bioinformatics is one of the fastest growing areas of all life
science based markets. In this paper, we introduced the notion of mining embedded
subtrees in a (forest) database of trees. Among our novel contributions is the procedure
for systematic candidate subtree generation, i.e., no subtree is generated more than once.
We utilized a string encoding of the tree that is space-efficient to store the horizontal data
set and we use the notion of a node’s scope to develop a novel vertical representation of
a tree called scope-lists. Our formalization of the problem is flexible enough to handle
several variations and it has shown that the problem of multi-database mining is
challenging and pressing. In particular, due to essential differences between mono- and
multi-databases, we have defined a new process of multi-database mining for our
system
References
1. H. Liu, J. Li, and L. Wong, “Use of Extreme Patient Samples for Outcome Prediction
from Gene Expression Data,” Bioinformatics, vol.21, no. 16, 2005, pp. 3377–3384.
2. R. Agrawal, J. Shafer: Parallel mining of association rules. IEEETransactions on
Knowledge and Data Engneering, 8(6) (1996): 962-969.
3. S. D_zeroski and N. Lavra_c, editors. Relational Data Mining. Springer, Berlin, 2001.
4. Yan Fu et al., “A Block-Based Support Vector Machine Approach to the Protein
Homology Prediction Task in KDD Cup 2004,” ACM SIGKDD Explorations, vol. 6, no.
2, 2004, pp. 120–124.
5. S. Ray and M. Craven, “Learning Statistical Models for Annotating Proteins with
function Information Using Biomedical Text,” BMC Bioinformatics, vol. 6, suppl. 1,
2005, p. S18.

Potrebbero piacerti anche