Sei sulla pagina 1di 4

The SIJ Transactions on Computer Science Engineering & its Applications (CSEA), Vol. 1, No.

2, May-June 2013

ISSN: 2321 2381 2013 | Published by The Standard International Journals (The SIJ) 33



AbstractThe biggest challenge for text and data mining in biomedical informatics is to impact the discovery
process, enabling scientists to generate novel hypothesis to address the most crucial questions for
understanding the knowledge basis from biomed articles or documents. However, formulation of a flexible and
general approach for integrating heterogeneous data and knowledge sources for discovery is elusive and highly
dependent upon the specific underlying scientific question. This target has been taken as our work to interpret
the knowledge discovery from Biomed articles. Our work has been framed with base of Keyword search,
Information Retrieval and Information Extraction to bound the knowledge of articles in databases like Pub
Med, Oxford journals etc., Thus, the true impact of text and data mining is only realized if it goes beyond the
methods for extraction and indexing into enabling understanding of the entities like Protein-Protein Interaction,
Gene and Human diseases relationship presented in articles, documents. It acts as an underpinnings process to
form a network is modelled here.
KeywordsGene Disease Relationship, Gene Network, Information Retrieval, PPI, Text Mining
AbbreviationsInformation Extraction (IE), Information Indexing (II), Information Retrieval (IR), Natural
Language Processing (NLP), Protein Protein Interaction (PPI), Text Classification (TC), Text Clustering (TCl)

I. INTRODUCTION
computer, like a human, needs certain specialized
knowledge in order to understand text. The scientific
field that is dedicated to train computers with the
right knowledge for the task is called Natural Language
Processing (NLP). Biomedical text mining is the subfield that
deals with text that comes from biology, medicine, and
chemistry [Salton, 1989; Aronson, 1996]. The Challenges of
Text mining in biomed terminology are dynamic nature of the
domain inclusive of new terms (genes, proteins, chemical
compounds, drugs) which frequently and constantly being
created [Kuramochi & Karypis, 2004]. Also existing
biomedical resources and ontologies need constant updating
[Tin et al., 2005]. Some of these reason grounds to develop a
new application to normalize and find structural information
from open access database articles. In our work well-
summarized information and newly discovered evidence can
be obtained.
II. A REVIEW OF RELATED WORK
There are a number of web-based text mining applications
which can be used for discovering knowledge from articles.
The pitfalls of existing are due to size of the widely used
database which has a negative impact on the relevance of
users query results and also simple free-text queries would
return many false positives. Additionally, when reading a
document of interest, users can query for related documents.
Query expansion or reformulation is used to improve retrieval
of documents relevant to a free-text query or related to a
document of interest. Although applications are useful in
exploring such information in the literature, not many of them
provide real-time responses - the users often have to wait for
several minutes before they receive the results. Some of the
systems provide reasonably quick responses by limiting the
number of documents to be analyzed to a very small number,
but such limitation leads to a significant deterioration of the
coverage [Simon M. Lin et al., 2004].
To complement existing applications, we develop an
search mechanism to groundwork the knowledge and hidden
structured data from abstract, articles, discussions etc.
III. METHODS OF KNOWLEDGE DISCOVERY
AND PATHWAYS
Curators struggling to process scientific literature for
discovery of facts and events crucial for gaining insights in
biosciences motivated text mining to substructure the huge
number of articles [Kors et al., 2005; Maier et al., 2005].
A
*Research Scholar in Computer Science, Manonmanium Sundaranar University, Tirunelveli Town, Tamil Nadu, INDIA.
E-Mail: praba_bud@yahoo.co.in
**Doctoral Research Supervisor & Assistant Professor, Post Graduate & Research Department of Computer Science, Government Arts
College, Coimbatore, Tamil Nadu, INDIA. E-Mail: sumathirajes@hotmail.com
K. Prabavathy* & Dr. P. Sumathi**
Text Mining Interpreting Knowledge
Discovery from Biomed Articles
The SIJ Transactions on Computer Science Engineering & its Applications (CSEA), Vol. 1, No. 2, May-June 2013

ISSN: 2321 2381 2013 | Published by The Standard International Journals (The SIJ) 34
Rapid growth of literature data poses challenges in efficient
methods for extraction of information and effective ways of
querying the information. Some of the crucial applications in
Biological field are Named entity recognition of biological
entities, Gene normalization, Protein-Protein interaction,
Functional Analysis of genes, Extraction of gene-disease
association etc., are taken as target of our work [Manning &
Schutze, 1999; Mao & Chu, 2002].
Five main tasks and supporting tasks are arranged, and
their results show advances in the state of the art in fine-
grained biomedical domain. The key technologies and tasks
grounding the structured information used in our work are as
Information Retrieval (IR)
Information Extraction (IE)
Information Indexing (II)
Text Classification (TC)
Text Clustering (TCl)
3.1. Information Retrieval
It is a process of recovery of documents from a collection of
documents, open access database etc., which persuade a
given information demand. Information demand is posed in
form of a user flexibility query.
3.2. Information Extraction (IE)
IE refers to the automatic extraction of structured information
such as entities, relationships between entities, and attributes
describing entities from unstructured sources. It focuses on
the collection, organization and application of information to
answer questions. The challenges faced are of accuracy,
running time, dynamically changing sources, Data Integration
and Extraction Errors. Information extraction demonstrates
that extraction methods successfully generalize in various
aspects.
3.3. Information Indexing
Efficient Indexing is required to reduce vocabulary of terms
and query formulation. Indexed Document Collection
includes of Tokenization, Stemming, and Stop word removal
methods.

Figure 1 Technologies of BioText Mining
3.4. Text Classification
Common problem in information science is assignment of an
electronic document to one or more categories, based on its
contents. Supervised document classification are provided
and the correct classification model is learnt based on naive
Bayes classifier, latent semantic indexing support vector
machines, artificial neural network, kNN, decision trees,
Concept Mining techniques.
3.5. Text Clustering
Find which documents have many words in common, and
place the documents with the most words in common into the
same groups [Strehl et al., 2000]. Similarity of documents
instead of similarity of sequences, expression profiles or
structures. Cluster documents into topics according to user
query keywords. A clustering program tries to find the groups
in the data. Text Clustering programs often choose first the
documents that seem representative of the middle of each of
the clusters. Then it compares all the documents to these
initial representatives. Similarity is based on how many
words the documents have in common, and how strongly they
are weighted. The topical terms of the clusters are chosen
from words that represent the centre of the cluster. The best
clustering is one in which the average difference of the
documents to their cluster centres smallest [Varelas et al.,
2005].
IV. FUNCTIONALITY OF SYSTEM
Each of the approaches has its own strengths and weaknesses,
especially with regard to the sensitivity and specificity of the
method. A simple and finer idea to extract vein information
like Protein Protein Information (PPI), Gene Disease
relationship and sub structuring the Gene Network are done
here using technologies of biotext mining [Hoffmann &
Valencia, 2004; Sehgal & Srinivasan, 2006; Liu et al., 2006].
We use non overlapping training and background sets,
and test sets are processed using a leave-one-out validation
procedure. It acts as an integrating tool based environment to
mine the information from a given biomedical literature and a
database to store the mined information. As a first phase, the
biomedical articles and documents are retrieved from the
open access database like oxford journal, PubMed etc. Since
there are 22 million documents effective Information
Retrieval methods are used. Here Rule based induction
methods are used to retrieve the articles from BioMed
literature with the user requirements or keyword given. In the
second phase of Information extraction various query
expansion or reformulation strategies have been proposed in
the biomedical field. A users free-text query defining the
need for some information can be enriched with common
synonyms or morphological variants from existing or
automatically with keyword analysis. The analysed
documents are indexed in the third phase of Information
indexing. The indexed articles and documents are trained
with set of documents representing a topic of interest from
BioText
Mining
IR
IE
I I
TCl
Classify
TC
The SIJ Transactions on Computer Science Engineering & its Applications (CSEA), Vol. 1, No. 2, May-June 2013

ISSN: 2321 2381 2013 | Published by The Standard International Journals (The SIJ) 35
generated thesauruses by means of well known classification
algorithm. In final phase of Text clustering, first comparing
every pair of documents, and finding the pair of documents
which are most similar to each other are clustered. Identified
biological entities and longer entities in articles from the
clusters are linked and marked to entries in biological
database called temp warehouse.
The complexity exists in synonyms/acronyms,
ambiguity, typographical variants, symbols/id of entities are
overwhelmed by various dictionary based techniques
[Schuemie et al., 2007]. Ambiguity occurs between protein
names and their protein family names and Genes. Diversity of
features of words, similarity with existing entries in database,
presence of trigger words are considered in accuracy of
information extraction.


Figure 2 Functionality of a System
V. DISCUSSION AND CONCLUSION
The stipulation of an adapted system to gather relevant and
brittle information from text, abstract and articles are
obtained as a goal of our work. The retrieval of documents
related to a single document can be significantly improved by
extracted output. The extracted output preserves the general
design and goals of the previous event, but adds a new focus
on variability to address a limitation of existing. It is intended
for biologists and biologists interested in adding text mining
tools to their bioinformatics toolbox. It serves as a unique
forum to discuss novel approaches to text and data mining
methods that respond to specific scientific questions, enabling
predictions that integrate a variety of data sources and can
potentially impact scientific discovery.
REFERENCES
[1] G. Salton (1989), Automatic Text Processing: The
Transformation, Analysis, and Retrieval of Information by
Computer, Addison-Wesley, Reading, MA.
[2] AR Aronson (1996), The Elect of Textual Variation on
Concept based Information Retrieval, Proceedings of AMIA
Annu Fall Symp, Pp. 373.
[3] C. Manning & H. Schutze (1999), Foundation of Statistical
Natural Language Processing, The MIT Press, Cambridge
MA.
[4] A. Strehl, J. Ghosh, & R.J. Mooney (2000), Impact of
Similarity Measures on Webpage Clustering, AAAI Workshop
on AI for Web Search, Pp. 5864.
[5] W. Mao & W.W. Chu (2002), Free Text Medical Document
Retrieval via Phrased-based Vector Space Model, Proceedings
of AMIA02, San Antonio, TX.
[6] M. Kuramochi & G. Karypis (2004), An Efficient Algorithm
for Discovering Frequent Subgraphs, IEEE Transactions on
Knowledge and Data Engineering, Vol. 16, No. 9.
[7] Simon M. Lin, Patrick McConnell, Kimberly F. Johnson &
Jennifer Shoemaker (2004), MedlineR: An Open Source
Library in R for Medline Literature Data Mining,
Bioinformatics, Vol. 20, Pp. 36593661.
[8] R. Hoffmann & A. Valencia (2004), A Gene Network for
Navigating the Literature, Nat Genet, Vol. 36, Pp. 664
[9] N. Tin, JF. Kelso, AR. Powell, H. Pan, VB Bajic & WA Hide
(2005), Integration of Text- and Data Mining using Ontologies
Successfully Selects Disease Gene Candidates, Nucleic Acids
Res, Vol. 33, No. 5, Pp. 15441552.
[10] J. Kors, M. Schuemie, B. Schijvenaars, M. Weeber & B. Mons
(2005), Combination of Genetic Databases for Improving
Identification of Genes and Proteins in Text, Biolink
Conference.
[11] H. Maier, S. Dhr, K. Grote, S. O'Keeffe, T. Werner, M. Hrab
de Angelis & R. Schneider (2005), LitMiner and WikiGene:
Identifying Problem-Related Key Players of Gene Regulation
using Publication Abstracts, Nucleic Acids Res., Vol. 33, Pp.
W779W782.
[12] G. Varelas, E. Voutsakis, Euripides G. M. Petrakis, Evangelos
E. Milios & P. Raftopoulou (2005), Semantic Similarity
Methods in WordNet and their Application to Information
Retrieval on the Web, WIDM '05, New York: ACM Press, Pp.
1016.
[13] AK. Sehgal & P. Srinivasan (2006), Retrieval with Gene
Queries, BMC Bioinformatics, 7:220.
[14] H. Liu, ZZ Hu, J. Zhang & C. Wu (2006), BioThesaurus: A
Web-based Thesaurus of Protein and Gene Names,
Bioinformatics, Vol. 22, Pp. 103105.
[15] MJ. Schuemie, B. Mons, M. Weeber & JA. Kors (2007),
Evaluation of Techniques for Increasing Recall in a
Dictionary Approach to Gene and Protein Name
Identification, Journal of Biomedical Informatics, Vol. 40,
No. 3, Pp. 316324.
Information
Retrieval
Articles
from Db
Information
Extraction

Information
Indexing
Text
Classification
Text Clustering
Temp
Warehouse
Identification
Filtering
Analysis
Grouping
Extracted
Output
Keyword/
Query
The SIJ Transactions on Computer Science Engineering & its Applications (CSEA), Vol. 1, No. 2, May-June 2013

ISSN: 2321 2381 2013 | Published by The Standard International Journals (The SIJ) 36
K. Prabavathy, M.Sc, M.Phil.., Doctoral
Research scholar in department of Computer
Science, Manonmanium Sundaranar
University, Tirunelveli, Tamil Nadu, India.
She completed M.Phil in the area of Data
Mining and received MCA degree through
Bharathiar University, Coimbatore and M.Sc
degree through Madurai Kamaraj University,
Madurai. She has published number of
papers in reputed journals and conferences. She has about five years
experience of teaching and research experience. Her area of interest
includes Data Mining, Bioinformatics and Computer Networks.







Dr. P. Sumathi is working as an Assistant
Professor, PG & Research Department of
Computer Science, Government Arts
College, Coimbatore, Tamilnadu, India. She
received her Ph.D., in the area of Grid
Computing in Bharathiar University. She has
done her M.Phil in the area of Software
Engineering in Mother Teresa Womens
University and received MCA degree at
Kongu Engineering College, Perundurai. She has published a
number of papers in reputed journals and conferences. She has about
Sixteen years of teaching and research experience. Her research
interests include Data Mining, Grid Computing and Software
Engineering.

Potrebbero piacerti anche