Automatic Text Classification Using Supervised Learning

IDL - International Digital Library Of
Technology & Research

Volume 1, Issue 2, Mar 2017 Available at: www.dbpublications.org
International e-Journal For Technology And Research-2017
Automatic Text Classification using Supervised

Learning
Ms. NAYANA N MURTHY 1, Mrs. SHASHIREKHA H 2
Dept. of Computer Science
1
MTech, Student VTU PG Center, Mysuru, India
2
Guide, Assistant Professor VTU PG Center, Mysuru, India
SURVEY PAPER newsgroups, bulletin boards, and broadcast or printed

1. ABSTRACT - As the time goes on and on, news. They are multi-source, and consequently have
digitization of text has been increasing remarkably and different formats, different preferred vocabularies and
the need to organize, categorize and classify text has often significantly different writing styles even for
become indispensable. Disorganization and very little documents within one genre. Namely, the data are
categorization and classification of text may result in heterogeneous. Intuitively Text Classification is the
gradual lower response time of text or information task of classifying a document under a predefined
retrieval. Therefore it is very important and necessary category. More formally, if I d is a document of the
to organize, categorize and classify texts and digitized entire set of documents D and {cc c 1 2 , ,..., n} is the
documents according to description proposed by text set of all the categories, then text classification assigns
mining experts and computer scientists. Automated one category j c to a document ID. As in every
text classification has been considered as a imperative supervised machine learning task, an initial dataset is
method to manage and process a large amount of needed. A document may be assigned to more than
documents in digital forms that are widespread and one category (Ranking Classification), but in this
continuously increasing. In general, text classification paper only researches on Hard Categorization
plays and substantial role in information extraction (assigning a single category to each document) are
and text retrieval, and question answering. This paper taken into consideration. Moreover, approaches, that
emphasizes the text classification process using take into consideration other information besides the
machine learning techniques. pure text, such as hierarchical structure of the texts or
date of publication, are not presented. This is because
the main issue of this paper is to present techniques
2. INTRODUCTION
that exploit the most of the text of each document and
Automatic text classification has always been an
perform best under this condition.
important application and research topic since the
inception of digital documents. Today, text
3. PLAN OF WORK FLOW
classification is a necessity due to the very large
B2B market places are an intermediate layer for
amount of text documents that we have to deal with
business communications providing one serious
daily. In general, text classification includes topic
advantage to their clients. They can communicate with
based text classification and text genre-based
a large number of customers based on one
classification. Topic-based text categorization
communication channel to the market place. A
classifies documents according to their topics. Texts
successful market place has to deal with various
can also be written in many genres, for instance:
aspects. It has to integrate with various hardware and
scientific articles, news reports, movie reviews, and
software platforms and has to provide a common
advertisements. Genre is defined on the way a text was
protocol for information exchange. However, the real
created, the way it was edited, the register of language
problem is the heterogeneity and openness of the
it uses, and the kind of audience to whom it is
exchanged content. Therefore, content management is
addressed. Previous work on genre classification
one of the real challenges in successful B2B electronic
recognized that this task differs from topic-based
commerce. One of the serious problem is document
categorization. Typically, most data for genre
description must be classified. Each document will be
classification are collected from the web, through
having its own taxonomy which organizes document
IDL - International Digital Library 1|P a g e Copyright@IDL-2017


into its respective categories. Each supplier uses characteristic of the classification problem is the
different structures and vocabularies to describe its extremely high dimensionality of text data. The
documents. This may not cause a problem for a 1-1 number of potential features often exceeds the number
relationship where the buyer may get used to the of training set.
private terminology of his supplier. B2B market places
that enable n-m commerce cannot rely on such an 4 SURVEY
assumption. They must classify all documents 4.1 Text Classification Using Machine Learning
according to a standard classification schema that help Techniques This survey represent as machine
buyers and suppliers in communicating their document learning techniques, here automatic text classification
information. A widely used classification schema in has always been an important application and research
the is UNSPSC Again it is a difficult and mainly topic since the inception of digital documents. Today,
manual task to classify the documents according to a text classification is a necessity due to the very large
classification schema like UNSPSC. It requires amount of text documents that we have to deal with
domain expertise and knowledge about the document daily. In general, text classification includes topic
domain. Finding the right place for a document based text classification and text genre-based
description in a standard classification system such as classification. Topic-based text categorization
UNSPSC is not at all a trivial task. Each document classifies documents according to their topics. Texts
must be mapped to the corresponding document can also be written in many genres, for instance:
category in UNSPSC to create the document catalog. scientific articles, news reports, movie reviews, and
Document classification schemes contain huge number advertisements. Genre is defined on the way a text was
of categories with far from sufficient definitions (e.g. created, the way it was edited, the register of language
over 12,000 classes for UNSPSC) and millions of it uses, and the kind of audience to whom it is
documents must be classified according to them. addressed. Previous work on genre classification
Document classification is expensive, complicated, recognized that this task differs from topic-based
time consuming and error-prone. Content Management categorization. Typically, most data for genre
needs support in automation of the document classification are collected from the web, through
classification process. Text mining and Machine newsgroups, bulletin boards, and broadcast or printed
Learning work together for automatic classification of news. They are multi-source, and consequently have
document. The below figure shows that flow of txt different formats, different preferred vocabularies and
classification process.. often significantly different writing styles even for
documents within one genre. Namely, the data are
heterogenous. Intuitively Text Classification is the
task of classifying a document under a predefined
category. More formally, if i d is a document of the
entire set of documents D and {cc c 1 2 , ,..., n} is the
set of all the categories, then text classification assigns
one category j c to a document id. As in every
supervised machine learning task, an initial dataset is
needed. A document may be assigned to more than
The motivated perspective of text mining is
one category (Ranking Classification), but in this
Information Extraction (IE) to extract specific
paper only researches on Hard Categorization
information from document description. Natural
(assigning a single category to each document) are
Language Processing (NLP) is to achieve a better
taken into consideration. Moreover, approaches, that
understanding of natural language by use of computers
take into consideration other information besides the
and represent the description semantically to improve
pure text, such as hierarchical structure of the texts or
the classification process. Text representation is the
date of publication, are not presented. This is because
important aspect in classification process, denotes the
the main issue of this paper is to present techniques
mapping of a document description into a compact
that exploit the most of the text of each document and
form of its contents. Description is typically
perform best under this condition.
represented as a vector of term weights (word features)
from a set of terms (dictionary), where each term
4.2 A Review of Machine Learning Algorithms for
occurs at least in any document description. A major


Text-Documents Classification The text mining document with high accuracy, are the key points in
studies are gaining more importance recently because text classification. Text-Documents Classification.
of the availability of the increasing number of the
electronic documents from a variety of sources. The 4.3 A Concept of Text Classification Using Machine
resources of unstructured and semi structured Learning Modern information age produces vast
information include the world wide web, amount of textual data, which can be termed in other
governmental electronic repositories, news articles, words as unstructured data. Internet and corporate
biological databases, chat rooms, digital libraries, spread across the globe produces textual data in
online forums, electronic mail and blog repositories. exponential growth, which needs to be shared, on need
Therefore, proper classification and knowledge basis by individuals. If the data generated is properly
discovery from these resources is an important area for organized, classified then retrieving the needed data
research. Natural Language Processing (NLP), Data can be made easily with least efforts. Hence the need
Mining, and Machine Learning techniques work of automatic methods to organize, classify the
together to automatically classify and discover patterns documents becomes inevitable due to such exponential
from the electronic documents. The main goal of text growth in documents, very especially after the increase
mining is to enable users to extract information from usage of internet by individuals. Automatic
textual resources and deals with the operations like, classification refers to assigning the documents to a set
retrieval, classification (supervised, unsupervised and of pre-defined classes based on the textual content of
semi supervised) and summarization. However how the document. The classification can be flat or
these documented can be properly annotated, hierarchical. The class categories grow significantly
presented and classified. So it consists of several large in number say, in thousands then searching with
challenges, like proper annotation to the documents, such a large number of categories becomes very
appropriate document representation, dimensionality difficult. This difficulty leads to have hierarchical
reduction to handle algorithmic issues, and an classification in which the thematic relationship
appropriate classifier function to obtain good between the classifications is also used, in searching of
generalization and avoid over-fitting. Extraction, documents. Text Categorization (TC), also known as
Integration and classification of electronic documents Text Classification, is the task of automatically
from different sources and knowledge discovery from classifying a set of text documents into different
these documents are important for the research categories from a predefined set. Consider the case of
communities. Today the web is the main source for the sorting and organizing emails, files in folder
text documents, the amount of textual data available to hierarchies so that topic identification that would
us is consistently increasing, and approximately 80% support topic specific operations be made. On such
of the information of an organization is stored in attempt is the yahoo web directory. If such
unstructured textual format, in the form of reports, classification is to be done manually it has several
email, views and news etc. The shows that disadvantages.
approximately 90% of the worlds data is held in i. It needs domain experts in the areas of predefined
unstructured formats, so Information intensive categories.
business processes demand that we transcend from ii. It is time-consuming, leads to frustration.
simple document retrieval to knowledge discovery. iii. It is error-prone and could be employee biased
The need of automatically retrieval of useful (subject biased).
knowledge from the huge amount of textual data in iv. Human decision among two experts may disagree.
order to assist the human analysis is fully apparent. v. Need to repeat the process for new documents
Market trend based on the content of the online news (possibly of another domain).
articles, sentiments, and events is an emerging topic So the need to employee machine learning to
for research in data mining and text mining Automate the classification is needed. In machine
community. For these purpose state-of-the-art learning generally two types of learning algorithms are
approaches to text classifications are presented in, in found in the literature: supervised learning algorithms
which three problems were discussed: documents or unsupervised learning algorithms. We restrict in the
representation, classifier construction and classifier paper about supervised learning.
evaluation. So constructing a data structure that can
represent the documents, and constructing a classifier 4.4 A Study on Document Classification using
that can be used to predicate the class label of a Machine Learning Techniques Due to the fast


growth of digital information available electronically, attitude of a speaker or a writer with respect to some
text mining plays a key role in managing information topic or the overall tonality of a document.
and knowledge, and therefore has become an active What are the challenges?
research area. Text mining, also known as intelligent Sentiment Analysis approaches aim to extract positive
text analysis is the process of extracting interesting and negative sentiment bearing words from a text and
and non-trivial information and knowledge from classify the text as positive, negative or else objective
unstructured text. Text mining is a young if it cannot find any sentiment bearing words. In this
interdisciplinary field, which draws on information respect, it can be thought of as a text categorization
retrieval, data mining, machine learning, statistics and task. In text classification there are many classes
computational linguistics. Typical text mining tasks corresponding to different topics whereas in Sentiment
include information extraction, topic tracking, Analysis we have only 3 broad classes i.e. positive,
document summarization, classification, clustering, negative and neutral. Thus it seems Sentiment
question answering. Automated text classification is Analysis is easier than text classification which is not
the act of dividing a set of input documents into two or quite the case.
more classes where each document can be said to The general challenges can be summarized as.
belong to one or multiple classes. Text classification 1. Implicit Sentiment and Sarcasm
aims at assigning pre-defined classes to text 2. Domain Dependency
documents. An example would be to automatically 3. Thwarted Expectations4. Pragmatics
label each incoming news story with a topic like 5. World Knowledge
sports, politics, or art. The classification task 6. Subjectivity Detection
starts with a training set D d ( ,..., ) 1 n of documents 7. Entity Identification
that are already labeled with a class c C (e.g. sport, 8. Negation
politics). The task is then to determine a classification Hence, its not easy to do text categorization and
model f D C : f d c ( ) which is able to assign the understand what the user intends to say (sentiments)
correct class to a new document d of the domain. Text because of the above mentioned problems.
classification is a challenging task, as it is difficult to The complexity of the problems varies from high to
capture the meaning and abstract concepts of natural low. So some problems are easily solvable like World
language just from a few keywords. Also, the high Knowledge and some are difficult like Negation. For
dimensionality of the feature space makes this purpose various algorithms like Naive Bayes,
classification problem very difficult. Text SVM and Decision Tree at available at our disposal.
classification is commonly used to handle spam Steps for analyzing the sentiments in the sentence:
emails, classify large text collections into topical 1. Firstly we need to decide the classifier algorithms
categories, and manage knowledge and also to help and have an appropriate data for training.
Internet search engines. 2. Preprocess and label the data.
3. Prepare the data for training.
4.5 Various Machine Learning Techniques for Text 4. Train the classifier with the help of libraries such as
Classification In this survey, we examine and NLTK, libsvm etc.
compare the effectiveness of applying machine 5. Make predictions by giving new test data to the
learning techniques to the sentiment classification trained classifier.
problem. A challenging aspect of this problem that Text categorization is the task of assigning a Boolean
seems to distinguish it from traditional topic-based value to each pair (dj , ci) D C, where D is a
classification is that while topics are often identifiable domain of documents and C = {c1 , . . . , c|C| } is a set
by keywords alone, sentiment can be expressed in a of predefined categories. A value of T assigned to (dj
more subtle manner. , ci) indicates a decision to file dj under ci,while a
Sentimental Analysis value of F indicates a decision not to file dj under ci.
Definition Sentiment Analysis is a Natural Language
Processing and Information Extraction task that aims 4.6 Types of Machine Learning Algorithms
to obtain writers feelings expressed in positive or Machine learning algorithms are organized into
negative comments, questions and requests, by taxonomy, based on the desired outcome of the
analyzing a large numbers of documents. Generally algorithm. Common algorithm types include:
speaking, sentiment analysis aims to determine the Supervised learning: where the algorithm generates
a function that maps inputs to desired outputs. One


standard formulation of the supervised learning task is corpuses to improve classifiers performance. Some
the classification problem: the learner is required to important conclusions have not been reached yet,
learn a function which maps a vector into one of including:
several classes by looking at several input-output Which feature selection methods are both
examples of the function. computationally scalable and high performing across
Unsupervised learning: Which models a set of classifiers and collections? Given the high variability
inputs: labeled examples are not available. of text collections, do such methods even exist?
Semi-supervised learning: Which combines both Would combining uncorrelated, but well performing
labeled and unlabeled examples to generate an methods yield a performance increase?
appropriate function or classifier? Change the thinking from word frequency based
Reinforcement learning: Where the algorithm vector space to concepts based vector space. Study the
learns a policy of how to act given an observation of methodology of feature selection under concepts, to
the world. Every action has some impact in the see if these will help in text categorization.
environment, and the environment provides feedback Make the dimensionality reduction more efficient
that guides the learning algorithm. over large corpus.
Transduction: Similar to supervised learning, but Moreover, there are other two open problems in text
does not explicitly construct a function: instead, tries mining: polysemy, synonymy. Polysemy refers to the
to predict new outputs based on training inputs, fact that a word can have multiple meanings.
training outputs, and new inputs. Distinguishing between different meanings of a word
Learning to learn: Where the algorithm learns its (called word sense disambiguation) is not easy, often
own inductive bias based on previous experience. requiring the context in which the word appears.
The performance and computational analysis of Synonymy means that different words can have the
machine learning algorithms is a branch of statistics same or similar meaning.
known as computational learning theory. Machine
learning is about designing algorithms that allow a OTHER REFERENCES
computer to learn. Learning is not necessarily involves [1] Bao Y. and Ishii N., Combining Multiple kNN
consciousness but learning is a matter of finding Classifiers for Text Categorization by Reducts, LNCS
statistical regularities or other patterns in the data. 2534, 2002, pp. 340-347
Thus, many machine learning algorithms will barely
resemble how human might approach a learning task. [2] Bi Y., Bell D., Wang H., Guo G., Greer K.,
However, learning algorithms can give insight into the Combining Multiple Classifiers Using Dempster's
relative difficulty of learning in different Rule of Combination for Text Categorization, MDAI,
environments. 2004, 127-138.
[3] Brank J., Grobelnik M., Milic-Frayling N.,

5. CONCLUSION Mladenic D., Interaction of Feature Selection
This survey finally conclude that, the text Methods and Linear Classification Models, Proc. of
classification problem is an Artificial Intelligence the 19th International Conference on Machine
research topic, especially given the vast number of Learning, Australia, 2002.
documents available in the form of web pages and
other electronic texts like emails, discussion forum [4] Ana Cardoso-Cachopo, Arlindo L. Oliveira, An
postings and other electronic documents. It has Empirical Comparison of Text Categorization
observed that even for a specified Methods, Lecture Notes in Computer Science,
Classification method, classification performances of Volume 2857, Jan 2003, Pages 183 - 196
the classifiers based on different training text corpuses
are different; and in some cases such differences are [5] Chawla, N. V., Bowyer, K. W., Hall, L. O.,
quite substantial. This observation implies that a) Kegelmeyer, W. P., SMOTE: Synthetic Minority
classifier performance is relevant to its training corpus Over-sampling Technique, Journal of AI Research,
in some degree, and b) good or high quality training 16 2002, pp. 321-357.
corpuses may derive classifiers of good performance.
Unfortunately, up to now little research work in the
literature has been seen on how to exploit training text


[6] Forman, G., An Experimental Study of Feature
Selection Metrics for Text Categorization. Journal of
Machine Learning Research, 3 2003, pp. 1289-1305
[7] Fragoudis D., Meretakis D., Likothanassis S.,

Integrating Feature and Instance Selection for Text
Classification, SIGKDD 02, July 23-26, 2002,
Edmonton, Alberta, Canada.
[8] Guan J., Zhou S., Pruning Training Corpus to

Speedup Text Classification, DEXA 2002, pp. 831-
840
[9] D. E. Johnson, F. J. Oles, T. Zhang, T. Goetz, A

decision-tree-based symbolic rule induction system for
text categorization, IBM Systems Journal, September
2002.
[10] Han X., Zu G., Ohyama W., Wakabayashi T.,

Kimura F., Accuracy Improvement of Automatic Text
Classification Based on Feature Transformation and
Multi-classifier Combination, LNCS, Volume 3309,
Jan 2004, pp. 463-468
[11] Ke H., Shaoping M., Text categorization based

on Concept indexing and principal component
analysis, Proc. TENCON 2002 Conference on
Computers, Communications, Control and Power
Engineering, 2002, pp. 51- 56.
[12] Kehagias A., Petridis V., Kaburlasos V., Fragkou

P., A Comparison of Word- and Sense-Based Text
Categorization Using Several Classification
Algorithms, JIIS, Volume 21, Issue 3, 2003, pp. 227-
247.
[13] B. Kessler, G. Nunberg, and H. Schutze.

Automatic detection of text genre. In Proceedings of
the Thirty-Fifth ACL and EACL, pages 3238, 1997.
[14] Kim S. B., Rim H. C., Yook D. S. and Lim H. S.,

Effective Methods for Improving Nave Bayes Text
Classifiers, LNAI 2417, 2002, pp. 414-423

Automatic Text Classification Using Supervised Learning

Caricato da

Informazioni sul documento

Titolo originale

Copyright

Formati disponibili

Condividi questo documento

Condividi o incorpora il documento

Opzioni di condivisione

Hai trovato utile questo documento?

Questo contenuto è inappropriato?

Copyright:

Formati disponibili

Automatic Text Classification Using Supervised Learning

Caricato da

Copyright:

Formati disponibili

IDL - International Digital Library Of

Technology & Research

International e-Journal For Technology And Research-2017