Sei sulla pagina 1di 6

International Journal of Trend in Scientific

Research and Development (IJTSRD)


International Open Access Journal
ISSN No: 2456 - 6470 | www.ijtsrd.com | Volume - 2 | Issue – 1

Elementary approach
pproach towards
owards Biological Data Mining

Faiz Hashmi
Department of Biotechnology, IILM Academy of
Higher Learning, Greater Noida, Uttar Pradesh, India

ABSTRACT
In this paper we provide an overview on interactive and process large amounts of often noisy data efficiently, in
integrative knowledge discovery and data mining. The an exploratory fashion. The scope of data mining is the
most important challenges, includes the need to develop knowledge extraction from large data amounts with the
and apply novel methods, algorithms and tools for the help of computers. It is an interdisciplinary areas
ar of
integration, fusion, pre-processing,
processing, mapping, analysis researches that has its roots in databases, machine
and interpretation of complex biomedical data with the learning and statistics and has contribution from many
aim to identify testable hypotheses, and build real
realistic other areas such as information retrieval, pattern
models. The HCI-KDD KDD approach, which is a recognition, visualization, parallel and distributed
synergistic combination of methodologies and computing. It is an iterative process in which preceding
approaches of two areas, Human–Computer
Computer Interaction process are modified to support new hypotheses
(HCI) and Knowledge Discovery & Data Mining suggested by the data. The main aim of data mining is
(KDD), offer ideal conditions towards solving these to explore the databases through automated means and
challenges: with the goal of supporting human discover meaningful, useful patterns and relationships
intelligence with machine intelligence. There is an in data. Data mining can be defined as one particular
urgent need for integrative and interactive machine step of the KDD (knowledge discovery from data)
learning solutions, because no medical doctor or process: the identification of interesting structures in
biomedical researcher can keep pace today with the data. It uses different algorithm for classification,
increasingly large and complex data sets – often called regression, clustering or association rules.
“Big Data”. The application of data mining in the
domain of bioinformatics is explained. It also highlights The steps for data mining follow the following pattern:
some of the current challenges and opportunities of data
 Data Extraction
mining in bioinformatics.
 Data Cleansing
Keywords: Data Mining, HCI and KDD (Human (Human-  Data Transformation /Reduction
Computer Interaction, Knowledge Discovery & Data  Data Mining Methods
Mining), Big Data, Interactive Knowledge discovery.  Applying Data Mining Algorithm
 Modeling Data
Introduction  Pattern Discovery
 Data Visualization
Data mining is defined as the process of automatically
extracting meaningful patterns from usually very large Data Extraction
quantities of seeminglyly unrelated data. Data mining
emerges as a new discipline at the end of 1980’s. The Data selection and sampling from extracted data by
introduction of new technologies such as computers, data warehouses, databases data marts repositories is a
satellites, new mass storage media and many others first challenging step in data mining. Data mining
have leads to an exponential growth of collected data. requires a controlled vocabulary, usually implemented
Traditional data
ata analysis techniques often failed to as part of a data dictionary, so that a single word can be

@ IJTSRD | Available Online @ www.ijtsrd.com | Volume – 2 | Issue – 1 | Nov-Dec


Dec 2017 Page: 1109
International Journal of Trend in Scientific Research and Development (IJTSRD) ISSN: 2456-6470

used to express a given concept. As millions and biology create the heuristics that can be applied to the
thousands of records and variables are gathered in data data. Data enrichment: involves strengthening of data
warehouses and data bases initial mining of meaningful from multiple data sources to minimize the limitations
data is quite a complicated process. Typically restrict to of a single data source. It basically involves studying
computationally enable sample of the holding in an various sources of data For example; two databases on
entire data warehouse. The evaluation of the inherited diseases might each be sparsely populated in
relationships that are revealed in these samples can be terms of proteins that are associated with particular
used to determine which relationships in the data diseases. This deficit could be addressed by
should be mined further using the complete data incorporating data from both databases, assuming only
warehouse. With large, complex databases, even with a moderate degree of overlap in the content of the two
sampling, the computational resource requirements databases. Frequency and Distribution Analysis: It
associated with non-directed data mining may be finds the frequency of occurrence of data during the
excessive. In this situation, researchers generally rely data mining process by placing the weights on values as
on their knowledge of biology to identify potentially a function of their frequency of Occurrence. This is
valuable relationships and they limit sampling based on done to maximize the contribution of common findings
these heuristics. while minimizing the effect of rare occurrences on the
conclusions made from the data-mining output.
Data Cleansing
Data Transformation
The data collected are not clean and may contain errors,
missing values, noisy or inconsistent data. So we need The data even after cleaning are not ready for mining as
to apply different techniques to get rid of such we need to transform them into forms appropriate for
anomalies. mining. The techniques used to accomplish this are
smoothing, aggregation, normalization etc.
Once this extracted it has to be preprocessed and Normalization: It represents the data in various forms
cleaned. This is done in following steps: depending on analysis and based on further processes to
be implemented. It involves transforming data values
Data Characterization: It basically deals with
from one representation to another, using a predefined
documentation of data in an appropriate and
range of final values. Various scales are used in
meaningful manner, so that any person could
normalization process like absolute scales, nominal
understand and interpret the data comfortably. This task
scales, ordinal scales, rank scales. For example,
is basically done by programmers and other staff
qualitative values, such as "high" and "low," and
involved in data mining project it involves creating a
qualitative values from multiple sources regarding a
high-level description of the nature and the content of
particular parameter might be normalized to a
the data to be mined.
numerical score from 1 to 10.
Consistency Analysis: It is analyzing the variability of
data independent of domain. Based on data values, it is Data Mining
primarily statistical analysis of data. Outliers and values Now we are ready to apply data mining techniques on
determined to be significantly different from other data the data to discover the interesting patterns. The
may be automatically excluded from the knowledge- process of data mining is concerned with extracting
discovery process, based on predefined statistical patterns from the data. Techniques like clustering and
constraints. association analysis are among the many different
techniques used for data mining.
For example, data associated with a given parameter
that is more than three standard. Deviations from the Applying Data Mining Algorithm
mean might be excluded from the mining operation.
This is not a single method or approach, but ii
Domain Analysis: It is validating the data values in a converges various technology and techniques to
larger context of biology. It is something which goes achieve proper mining of wide range of and also the
beyond simply verifying that data value is a text string data of interest biological data. Machine learning
or an integer, or that it's statistically consistent with methods have wide applicability in data mining
other data on the same parameter, to ensure that it algorithms. It includes statistics, biological modeling,
makes sense in the context of the biology. Domain adaptive control theory, psychology, and artificial
analysis requires that someone familiar with the
@ IJTSRD | Available Online @ www.ijtsrd.com | Volume – 2 | Issue – 1 | Nov-Dec 2017 Page: 1110
International Journal of Trend in Scientific Research and Development (IJTSRD) ISSN: 2456-6470

intelligence (AI).Basically genetic algorithm and neural are generally hidden in that they make up a small
networks take a major part as a technique to in fraction of the data. For particular biological
biological data. Similarly, adaptive control theory, applications, even the definition of a relevant pattern
where parameters of System change dynamically to may be difficult to state clearly, or may be unresolved.
meet the current conditions, and psychological theories, In bioinformatics, pattern recognition is most often
especially those regarding positive and negative concerned with the automatic classification of character
reinforcement learning, heavily influence machine sequences representative of the nucleotide bases or
learning methods. Artificial Intelligence techniques, molecular structures, and of 3D protein structures.
such as pattern matching through inductive logic
programming, are designed to derive general rules from Data Visualization
specific examples.
Visualizing biological data is one of the most
Data Modeling challenging part of data mining process. In this modern,
digital society, how the data is visualized becomes the
Data modeling basically is a process of structuring and prime facto, when it comes to communicating or
organizing the data, and then these structured data are understanding complex concepts. Better the data
implemented in database management system. Today’s visualized, better the concepts will be clear.
biological world demands for heavy exploitation of Visualization technologies can provide an intuitive
data These data as are in various forms which has to be representation of the relationships among large groups
capsulated in a meaning full manner .The data are in of objects or data points that could otherwise be
disparate formats, remotely dispersed, and based on the incomprehensible, while providing context and
different vocabularies of Various disciplines. indications of relative importance. The "Sequence
Furthermore, data are often stored or distributed using Visualization" and "Structure Visualization is types of
formats that leave implicit many important features data visualization techniques.
relating to the structure and semantics of the data.
Conceptual data modeling involves the development of Houle et al. (2000) refer to a classification of three
implementation-independent models that capture and successive levels for the analysis of biological data,
make explicit the principal structural properties of data. that is identified on the basis of the central dogma of
Entities such as a biopolymer or a reaction, and their molecular biology:
relations, egg catalyzed can be formalized using a
1. Genomics is the study of an organism's genome and
conceptual data model. Conceptual models are
deals with the systematic use of genome information to
implementation-independent and can be transformed in
provide new biological knowledge.
systematic ways for implementation using different
platforms, e.g. traditional database management 2. Gene expression analysis is the use of quantitative
systems. mRNA-level measurements of gene expression (the
process by which a gene’s coded information is
Pattern Discovery converted into the structural and functional units of a
Biology has been transformed from a data poor to a cell) in order to characterize biological processes and
data rich field, with massive accumulation of disparate elucidate the mechanisms of gene transcription (Houle
types of data, for example huge databases of sequences et al., 2000).
(DNA, RNA, or protein). This data allows important
3. Proteomics is the large-scale study of proteins,
biological insights to be made, partly by finding
particularly their structures and functions. These
patterns and motifs that are conserved across many
application domains are examined in the following
individuals or species; there is now a huge biological
paragraphs. As many genome projects (the endeavors to
literature reporting on such conserved patterns and
sequence and map genomes) like the Human Genome
motifs that have been found in biological datasets. In
Project have been completed, there is a paradigm shift
contrast to the area of pattern matching, the patterns
from static structural genomics to dynamic functional
and motifs are generally not known ahead of time, but
genomics (Houle et al., 2000). The term structural
must be identified or discovered from the data; this task
genomics refers to the DNA sequence determination
is often very subtle and difficult because the patterns
and mapping activities, while functional genomics
and motifs may be short, may be highly degenerate
refers to the assignment of functional information to
(containing wildcards and variable length elements),
known sequences. There are particular DNA sequences
may be ordered differently in different genomes, and
@ IJTSRD | Available Online @ www.ijtsrd.com | Volume – 2 | Issue – 1 | Nov-Dec 2017 Page: 1111
International Journal of Trend in Scientific Research and Development (IJTSRD) ISSN: 2456-6470

that have a specific biological role. The identification association and sequence analysis, and regression.
of such sequences is a problem that concerns Depending on the nature of the data as well as the
bioinformatics scientists. One such sequence is desired knowledge there is a large number of
transcription start site, which is the region of DNA algorithms for each task. All of these algorithms try to
where transcription (the process of mRNA production fit a model to the data (Dunham, 2002). Such a model
from DNA) starts. Another biologically meaningful can be either predictive or descriptive. A predictive
sequence is the translation initiation site, which is the model makes a prediction about data using known
site where translation (protein production from mRNA) examples, while a descriptive model identifies patterns
initiates. Although every cell in an organism -with only or relationships in data. Table 3 presents the most
few exceptions- has the same set of chromosomes, two common data mining tasks (Dunham, 2002).
cells may have very different properties and functions.
This is due to the differences in abundance of proteins. Many general data mining systems such as SAS
The abundance of a protein is partly determined by the Enterprise Miner, SPSS, S-Plus, IBM Intelligent Miner,
levels of mRNA which in turn are determined by the Microsoft SQL Server 2000, SGI MineSet, and
expression or non-expression of the corresponding InxightVizServer can be used for biological data
gene. A tool for analyzing gene expression is mining. However, some biological datamining tools
microarray. A microarray experiment measures the such as GeneSpring, Spot Fire, VectorNTI, COMPASS,
relative mRNA levels of typically thousands of genes, Statistics for Microarray Analysis, and Affymetrix Data
providing the ability to compare the expression levels Mining Tool have been developed (Han, 2002). Also, a
of different biological samples. These samples may large number of biological data mining tools is
correlate with different time points taken during a provided by National Center for Biotechnology
biological process or with different tissue types such as Information and by European Bioinformatics Institute.
normal cells and cancer cells (Aas, 2001).
Data Mining in Genomics
Serial Analysis of Gene Expression (SAGE) is a
Many data mining techniques have been proposed to
method that allows the quantitative profiling of a large
deal with the identification of specific DNA sequences.
number of transcripts (Velculescu et al., 1995). A
The most common include neural networks, Bayesian
transcript is a sequence of mRNA produced by
classifiers, decision trees, and Support Vector Machines
transcription. However, this method is very expensive
(SVMs) (Ma & Wang, 1999; Hirsh &Noordewier,
in contrast to microarrays, thus there is a limited
1994; Zien et al., 2000). Sequence recognition
amount of publicly available SAGE data. One of the
algorithms exhibit performance tradeoffs between
concerns of Proteomics is the prediction of protein
increasing sensitivity (ability to detect true positives)
properties such as active sites, modification sites,
and decreasing selectivity (ability to exclude false
localization, stability, globularity, shape, protein
positives) (Houle et al., 2000). However, as Li et al.
domains, secondary structure and interactions
(2003) state, traditional data mining techniques cannot
(Whishart, 2002). Secondary structure prediction is one
be directly applied to this type of recognition problems.
of the most important problems in proteomics. The
Thus, there is the need to adapt the existing techniques
interaction of proteins with other biomolecules is
to this kind of problems. Attempts to overcome this
another important issue.
problem have been made using feature generation and
Mining Biological Data feature selection (Zeng& Yap, 2002; Li et al., 2003).
Another data mining application in genomic level is the
Data mining is the discovery of useful knowledge from use of clustering algorithms to group structurally
databases. It is the main step in the process known as related DNA sequences.
Knowledge Discovery in Databases (KDD) (Fayyad et
al., 1996), although the two terms are often used Gene Expression Data Mining
interchangeably. Other steps of the KDD process are
The main types of microarray data analysis include
the collection, selection, and transformation of the data
(Piatetsky-Shapiro & Tamayo,2003): gene selection,
and the visualization and evaluation of the extracted
clustering, and classification. Piatetsky-Shapiro and
knowledge. Data mining employs algorithms and
Tamayo (2003) present one great challenge that data
techniques from statistics, machine learning, artificial
mining practitioners have to deal with. Microarray
intelligence, databases and data warehousing etc. Some
datasets -in contrast with other application domains-
of the most popular tasks are classification, clustering,
contain a small number of records (less than a

@ IJTSRD | Available Online @ www.ijtsrd.com | Volume – 2 | Issue – 1 | Nov-Dec 2017 Page: 1112
International Journal of Trend in Scientific Research and Development (IJTSRD) ISSN: 2456-6470

hundred), while the number of fields (genes), is better results in these cases. Similar approaches are
typically in thousands. The same case is in SAGE data. used for the prediction of active sites. Neural network
This increases the likelihood of finding “false approaches and nearest neighbor classifiers have been
positives”. An important issue in data analysis is used to deal with protein localization prediction
feature selection. In gene expression analysis the (Whishart, 2002). Neural networks have also been used
features are the genes. Gene selection is a process of to predict protein properties such as stability,
finding the genes most strongly related to a particular globularity and shape. Whishart refers to the use of
class. One benefit provided by this process is the hierarchical clustering algorithms for predicting protein
reduction of the foresaid dimensionality of dataset. domains. Data mining has been applied for the protein
Moreover, a large number of genes are irrelevant when secondary structure prediction. This problem has been
classification is applied. The danger of overshadowing studied for over than 30 years and many techniques
the contribution of relevant genes is reduced when gene have been developed (Whishart, 2002). Initially,
selection is applied. Clustering is the far most used statistical approaches were adopted to deal with this
method in gene expression analysis. Tibshirani et al. problem. Later, more accurate techniques based on
(1999) and Aas (2001) provide a classification of information theory, Bayes theory, nearest neighbors,
clustering methods in two categories: one-way and neural networks were developed. Combined
clustering and two-way clustering. Methods of the first methods such as integrated multiple sequence
category are used to group either genes with similar alignments with neural network or nearest neighbor
behavior or samples with similar gene expressions. approaches improve prediction accuracy. A density
Two-way clustering methods are used to based clustering algorithm (GDBSCAN) is presented
simultaneously cluster genes and samples. Hierarchical by Sander et al. (1998), that can be used to deal with
clustering is currently the most frequently applied protein interactions. This algorithm is able to cluster
method in gene expression analysis. An important issue point and spatial objects according to both, their spatial
concerning the application of clustering methods in and non-spatial attributes.
microarray data is the assessment of cluster quality.
Many techniques such as bootstrap, repeated Databases of Bioinformatics
measurements, mixture model-based approaches, sub-
There are many rapidly growing databases in the field
sampling and others have been proposed to deal with
of Bio informatics.
the cluster reliability assessment (Kerr &Churchill,
2001; Yeung et al., 2003; Ghosh&Chinnaiyan, 2002; Protein Data Bank
Smolkin&Ghosh, 2003). In microarray analysis
classification is applied to discriminate diseases or to The PDB archive is a repository of atomic coordinates
predict outcomes based on gene expression patterns and and other information describing proteins and other
perhaps even identify the best treatment for given important biological macromolecules. Structural
genetic signature (Piatetsky-Shapiro & Tamayo, 2003). biologists use methods such as X-ray crystallography,
Table 4 lists the most commonly used methods in NMR spectroscopy, and cryo-electron microscopy to
microarray data analysis. Detailed descriptions of these determine the location of each atom relative to each
methods can be found in literature (Aas, 2001; other in the molecule. They then deposit this
Tibshirani et al., 1999; Hastie et al., 2000; Lazzeroni& information, which is then annotated and publicly
Owen, 2002; Dudoit et al., 2002; Golub et al., 1999). released into the archive by the PDB.

Most of the methods used to deal with microarray data SWISS-PROT:


analysis can be used for SAGE data analysis. Finally,
machine learning and data mining can be applied in SWISS-PROT is an annotated protein sequence
order to design microarray experiments except to database, which was created at the Department of
analyze them (Molla et al., 2004). Medical Biochemistry of the University of Geneva and
has been a collaborative effort of the Department and
Data Mining in Proteomics the European Molecular Biology Laboratory (EMBL),
since 1987. The SWISS-PROT protein sequence
Many modification sites can be detected by simply database consists of sequence entries. Sequence entries
scanning a database that contains known modification are composed of different line types, each with their
sites. However, in some cases, a simple database scan is own format. For standardization purposes the format of
not effective. The use of neural networks provides SWISS-PROT follows as closely as possible that of the

@ IJTSRD | Available Online @ www.ijtsrd.com | Volume – 2 | Issue – 1 | Nov-Dec 2017 Page: 1113
International Journal of Trend in Scientific Research and Development (IJTSRD) ISSN: 2456-6470

EMBL Nucleotide Sequence Database. The SWISS-  Discovery of structural patterns and analysis of
PROT database distinguishes itself from other protein genetic networks and protein pathways.
sequence databases by three distinct criteria:
 Association and path analysis.
(i) Annotations,
 Visualization tools in genetic data analysis.
(ii) Minimal redundancy and
Applications of data mining in bioinformatics
(iii) Integration with other databases.
 Gene finding,
Medline
 Protein function domain detection,
Medical Literature Analysis and Retrieval System
Online, or MEDLARS Online is a bibliographic  Function motif detection,
database of life sciences and biomedical information. It
 Protein function inference,
includes bibliographic information for articles from
academic Journals covering medicine, nursing,  Disease diagnosis,
pharmacy, dentistry, veterinary medicine, and health
care. MEDLINE also covers much of the literature in  Disease prognosis,
biology and biochemistry, as well as fields such as
molecular evolution.  Disease treatment optimization,

The EMBL Nucleotide Sequence Database  Protein and gene interaction network,

The EMBL Database collects, organizes and distributes  Reconstruction,


a database of nucleotide sequence data and related  Data cleansing,
biological information. Since 1982 this work has been
done in collaboration with GenBank (NCBI, Bethesda,  Protein sub-cellular location prediction,
USA) and the DNA Database of Japan (Mishima). Each
of the three international collaborating databases  Analysis of protein and nucleotides sequences.
DDBJ/EMBL/GenBank, collect a portion of the total
REFERENCES
sequence data reported world-wide. All new and
updated database entries are exchanged between the 1. M. Andrade and P. Bork. Automated extraction of
International Nucleotide Sequence Collaboration on a Information in molecular biology. FEBS
daily basis. EMBL Database releases are produced Letters,476:12–17, 2000.
quarterly and are distributed on CD-ROM. The most
up-to-date data collection is available via Internet and 2. T. Attwood and D. Parry-Smith. Introduction to
World Wide Web interface. Bioinformatics. Longman Higher Education, 1999.

Biological Data Analysis 3. A. Bairoch and R. Apweiler. The SWISS-PROT


Protein sequence database and its supplement
Biological data mining is a very important part of TrEMBL in 2000. NucleicAcids Res., 28:45–48,
Bioinformatics. Following are the aspects in which data 2000.
mining contributes for biological data analysis:
4. P. Baldi and S. Brunak. Bioinformatics: The
 Semantic integration of heterogeneous, distributed MachineLearning Approach, Second Edition. MIT
genomic and proteomic databases. Press, 2001.
 Alignment, indexing, similarity search and 5. Bioinformatics Computing By Bryan Bergeron.
comparative analysis multiple nucleotide
sequences.

@ IJTSRD | Available Online @ www.ijtsrd.com | Volume – 2 | Issue – 1 | Nov-Dec 2017 Page: 1114

Potrebbero piacerti anche