Sei sulla pagina 1di 1

Automatic Genre-Specific Text Classification

its performance was similar to the one deemed best complex one, outperforms NB especially in text mining
such as Information Gain and Chi Square, and it tasks [Kim, Han, Rim, & Myaeng, 2006]. We describe
is simple and efficient. Therefore, we chose DF them below.
as our general feature selection method. In our
previous work [Yu et al., 2008], we concluded that 1. Naïve Bayes - Naïve Bayes classifier can be viewed
a DF threshold of 30 is a good setting to balance as a Bayesian network where feature attributes X1,
the computation complexity and classification X2, …, Xn are conditionally independent given
accuracy. With such a feature selection setting, the class attribute C [John & Langley, 1995].
we obtained 1754 features from 63963 unique Let C be a random variable and X be a vector of
words in the training corpus. random variables X1, X2, …, Xn. The probability
2. Genre Features - Each defined class has its own of a document x being in class c is calculated us-
characteristics other than general features. Many ing Bayes’ rule as below. The document will be
keywords such as ‘grading policy’ occur in a true classified into the most probable class.
syllabus probably along with a link to the content
page. On the other hand, a false syllabus might con- p ( X = x | C = c) p (C = c)
p (C = c | X = x) =
tain syllabus keyword without enough keywords p( X = x)
related to the syllabus components. In addition,
the position of a keyword within a page matters.
Since feature attributes (x1, x2, …, xn) represent the
For example, a keyword within the anchor text of
document x, and they are assumed to be conditionally
a link or around the link would suggest a syllabus
independent, we can obtain the equation below.
component outside the current page. A capitalized
keyword at the beginning of a page would sug-
p ( X = x | C = c) = ∏ p ( X i = xi | C = c)
gest a syllabus component with a heading in the i
page. Motivated by the above observations, we
manually selected 84 features to classify our data
An assumption to estimate the above probabilities
set into the four classes. We used both content
for numeric attributes is that the value of such an at-
and structure features for syllabus classification,
tribute follows a normal distribution within a class.
as they have been found useful in the detection
Therefore, we can estimate p(Xi = xi | C = c) by using
of other genres [Kennedy & Shepherd, 2005].
the mean and the standard deviation of such a normal
These features mainly concern the occurrence
distribution from the training data.
of keywords, the positions of keywords, and the
Such an assumption for the distribution may not
co-occurrence of keywords and links. Details of
hold for some domains. Therefore, we also applied
these features are in [Yu et al., 2008].
the kernel method from [John & Langley, 1995] to
estimate the distribution of each numeric attribute in
After extracting free text from these documents, our
our syllabus classification application.
training corpus consisted of 63963 unique terms, We
represented it by the three kinds of feature attributes:
2. Support Vector Machines - It is a two-class classi-
1754 unique general features, 84 unique genre features,
fier (Figure 1) that finds the hyperplane maximiz-
and 1838 unique features in total. Each of these feature
ing the minimum distance between the hyperplane
attributes has a numeric value between 0.0 and 1.0.
and training data points [Boser, Guyon, & Vapnik,
1992]. Specifically, the hyperplane ωTx + γ is
Classifiers found by minimizing the objective function:
NB and SVM are two well-known best performing 1
supervised learning models in text classification appli- || W || 2 such that D( AW − eG ) ≥ e
2
cations [Kim, Han, Rim, & Myaeng, 2006; Joachims,
1998]. NB, a simple and efficient approach, succeeds The margin is
in various data mining tasks, while SVM, a highly



Potrebbero piacerti anche