Sei sulla pagina 1di 1

Automatic Genre-Specific Text Classification

each class, the definitions of the measures are as fol- show the same pattern. We also found that the perfor-
lows. A higher F1 value indicates better classification mance with hybrid features settings is dominated by the
performance. general features among them. It is probably because the
number of the genre features is very small, compared
• Precision is the percentage of the correctly clas- to the number of general features. Therefore, it might
sified positive examples among all the examples be useful to test new ways of mixing genre features
classified as positive. and general features to take advantage of both of them
• Recall is the percentage of the correctly classi- more effectively.
fied positive examples among all the positive Finally, at all settings, better performance is achieved
examples. in recognizing true syllabi than in recognizing false
2 × Pr ecision × Re call syllabi. We analyzed the classification results with
• F1 =
Pr ecision + Re call the best setting and found that 94 of 313 false syllabi
were classified as true ones mistakenly. It is likely that
(2) Findings and Discussions – The following are the skewed distribution in the two classes makes clas-
the main four findings from our experiments. sifiers favor true syllabus class given an error-prone
First, SVM outperforms NB in syllabus classifica- data point. Since we probably provide no appropriate
tion in the average case (Figure 2). On average, SMO information if we misclassify a true syllabus as a false
performed best at the F1 score of 0.87, 15% better than one, our better performance in the true syllabus class
NB in terms of the true syllabus class and 1% better in is satisfactory.
terms of the false syllabus class. The best setting for our
task is SMO with the genre feature selection method,
which achieved an F1 score of 0.88 in recognizing true FUTURE WORk
syllabi and 0.71 in recognizing false syllabi.
Second, the kernel settings we tried in the experi- Although general search engines such as Google
ments were not helpful in the syllabus classification meet people’s basic information needs, there are still
task. Figure 2 indicates that SMO with kernel settings possibilities for improvement, especially with genre-
perform rather worse than that without kernels. specific search. Our work on the syllabus genre suc-
Third, the performance with genre features settings cessfully indicates that machine learning techniques can
outperforms those with general features settings and contribute to genre-specific search and classification.
hybrid feature settings. Figure 3 shows this performance In the future, we plan to improve the classification
pattern in the SMO classifier setting; other classifiers accuracy from multiple perspectives such as defining

Figure 2. Classification performance of different classifiers on different classes measured in terms of F1



Potrebbero piacerti anche