Documenti di Didattica
Documenti di Professioni
Documenti di Cultura
0.85
0.1
0.8
11-pt average precision
0.75
IG
0.7 0.01
0.65
0.6 IG
DF
TS 0.001
0.55 CHI.max
MI.max
0.5
0.45 0.0001
16000 14000 12000 10000 8000 6000 4000 2000 0 1 10 100 1000 10000
Number of features (unique words) DF
Figure 2. Average precision of LLSF vs. unique word count Figure 4. Correlation between DF and CHI values of words in Reuters
0.95 1
DF-CHI plots
0.9
0.85 0.1
0.8
11-pt average precision
0.75 0.01
CHI
0.7
0.65 0.001
0.6 IG
DF
TS
0.55 CHI.max 0.0001
MI.max
0.5
0.45 1e-05
16000 14000 12000 10000 8000 6000 4000 2000 0 1 10 100 1000 10000
Number of features (unique words) DF
instead of the complete solution to save some com- more aggressive thresholds, its performance declines
putation resources in the experiments. That is, we much faster than IG, CHI and DF. MI does not have
computed only 200 largest singular values in solving comparable performance to any of the other methods.
LLSF, although the best results (which is similar to
the performance of kNN) appeared with using 1000 4.4 Correlations between DF, IG and CHI
singular values[24]. Nevertheless, this simplication of
LLSF should not invalidate the examination of feature The similar performance of IG, DF and CHI in term
selection which is the focus of the experiments. selection is rather surprising because no such an obser-
An observation merges from the categorization results vation has previously been reported. A natural ques-
of kNN and LLSF on Reuters. That is, IG, DF and tion therefore is whether these three corpus statistics
CHI thresholding have similar eects on the perfor- are correlated.
mance of the classiers. All of them can eliminate Figure 3 plots the values of DF and IG given a term in
up to 90% or more of the unique terms with either the Reuters collection. Figure 4 plots the values of DF
an improvement or no loss in categorization accuracy and CHIavg correspondingly. Clearly there are indeed
(as measured by average precision). Using IG thresh- very strong correlations between the DF, IG and CHI
olding, for example, the vocabulary is reduced from values of a term. Figures 5 and 6 shows the results of a
16,039 terms to 321 (a 98% reduction), and the AVGP cross-collection examination. A strong correlation be-
of kNN is improved from 87.9% to 89.2%. CHI has tween DF and IG is also observed in the OHSUMED
even better categorization results except that at ex- collection. The performance curves of kNN with DF
tremely aggressive thresholds IG is better. TS has a versus IG are identical on this collection. The obser-
comparable performance with up-to 50% term removal vations on Reuters and on OHSUMED are highly con-
in kNN, and about 60% term removal in LLSF. With sistent. Given the very dierent application domains
1
Figure 5. DF-IG correlation in OHSUMED’90
From Table 1, the methods with an \excellent" per-
formance share the same bias, i.e., scoring in favor of
common terms over rare terms. This bias is obviously
in the DF criterion. It is not necessarily true in IG or
0.1
CHI by denition: in theory, a common term can have
a zero-valued information gain or 2 score. However,
it is statistically true based on the strong correlations
DF-IG plots
0.01
Figure 5. Feature selection and the performance of kNN on two collections formation were lost at high levels (e.g. 98%) of vo-
1
cabulary reduction it would not be possible for kNN
or LLSF to have improved categorization preformance.
To be more precise, in theory, IG measures the num-
ber of bits of information obtained by knowing the
0.8
=
Xn P (t; c )I(t; c ) + Xn P (t; c )I(t; c ) Acknowledgement
r i i r i i
i=1 i=1 We would like to thank Jaime Carbonell and Ralf
These formulas show that information gain is the Brown at CMU for pointing out an implementation
weighted average of the mutual information I(t; c) and error in the computation of information gain, and pro-
I(t; c), where the weights are the joint probabilities viding the correct code. We also like to thank Tom
Pr (t; c) and Pr (t; c), respectively. So information gain Mitchell and his machine learning group at CMU for
is also called average mutual information[7]. There are the fruitful discussions that clarify the denitions of
two fundamental dierences between IG and MI: 1) information gain and mutual information in the liter-
IG makes a use of information about term absence in ature.
the form of I(t; c), while MI ignores such information;
and 2) IG normalizes the mutual information scores References
using the joint probabilities while MI uses the non-
normalized scores. [1] C. Apte, F. Damerau, and S. Weiss. Towards
language independent automated learning of text
6 Conclusion categorization models. In Proceedings of the 17th
Annual ACM/SIGIR conference, 1994.
This is an evaluation of feature selection methods in [2] Kenneth Ward Church and Patrick Hanks. Word
dimensionality reduction for text categorization at all association norms, mutual information and lexi-
the reduction levels of aggressiveness, from using the cography. In Proceedings of ACL 27, pages 76{83,
full vocabulary (except stop words) as the feature Vancouver, Canada, 1989.
space, to removing 98% of the unique terms. We found
IG and CHI most eective in aggressive term removal [3] William W. Cohen and Yoram Singer. Context-
without losing categorization accuracy in our experi- sensitive learning metods for text categorization.
ments with kNN and LLSF. DF thresholding is found In SIGIR '96: Proceedings of the 19th Annual In-
comparable to the performance of IG and CHI with up ternational ACM SIGIR Conference on Research
to 90% term removal, while TS is comparable with up and Development in Information Retrieval, 1996.
to 50-60% term removal. Mutual information has infe- 307-315.
rior performance compared to the other methods due [4] R.H. Creecy, B.M. Masand, S.J. Smith, and D.L.
to a bias favoring rare terms and a strong sensitivity Waltz. Trading mips and memory for knowledge
to probability estimation errors. engineering: classifying census returns on the con-
We discovered that the DF, IG and CHI scores of a nection machine. Comm. ACM, 35:48{63, 1992.
term are strongly correlated, revealing a previously [5] S. Deerwester, S.T. Dumais, G.W. Furnas, T.K.
unknown fact about the importance of common terms Landauer, and R. Harshman. Indexing by latent
in text categorization. This suggests that that DF semantic analysis. In J Amer Soc Inf Sci 1, 6,
thresholding is not just an ad hoc approach to im- pages 391{407, 1990.
prove eciency (as it has been assumed in the liter-
ature of text categorization and retrieval), but a reli- [6] T.E. Dunning. Accurate methods for the statis-
able measure for seleting informative features. It can tics of surprise and coincidence. In Computational
be used instead of IG or CHI when the computation Linguistics, volume 19:1, pages 61{74, 1993.
(quadratic) of these measures is too expensive. The [7] R. Fano. Transmission of Information. MIT
availability of a simple but eective means for aggres- Press, Cambridge, MA, 1961.
sive feature space reduction may signicantly ease the
application of more powerful and computationally in- [8] N. Fuhr, S. Hartmanna, G. Lustig, M. Schwant-
tensive learning methods, such as neural networks, to ner, and K. Tzeras. Air/x - a rule-based multi-
very large text categorization problems which are oth- stage indexing systems for large subject elds. In
erwise intractable. 606-623, editor, Proceedings of RIAO'91, 1991.
[9] W. Hersh, C. Buckley, T.J. Leone, and D. Hick- on Document Analysis and Information Retrieval
man. Ohsumed: an interactive retrieval evalu- (SDAIR'95), 1995.
ation and new large text collection for research. [22] J.W. Wilbur and K. Sirotkin. The automatic
In 17th Ann Int ACM SIGIR Conference on Re- identication of stop words. J. Inf. Sci., 18:45{55,
search and Development in Information Retrieval 1992.
(SIGIR'94), pages 192{201, 1994.
[23] Y. Yang. Expert network: Eective and ecient
[10] D. Koller and M. Sahami. Toward optimal feature learning from human decisions in text categoriza-
selection. In Proceedings of the Thirteenth Inter- tion and retrieval. In 17th Ann Int ACM SI-
national Conference on Machine Learning, 1996. GIR Conference on Research and Development in
[11] K. Lang. Newsweeder: Learning to lter netnews. Information Retrieval (SIGIR'94), pages 13{22,
In Proceedings of the Twelfth International Con- 1994.
ference on Machine Learning, 1995. [24] Y. Yang. Noise reduction in a statistical ap-
[12] David D. Lewis, Robert E. Schapire, James P. proach to text categorization. In Proceedings of
Callan, and Ron Papka. Training algorithms for the 18th Ann Int ACM SIGIR Conference on Re-
linear text classiers. In SIGIR '96: Proceed- search and Development in Information Retrieval
ings of the 19th Annual International ACM SI- (SIGIR'95), pages 256{263, 1995.
GIR Conference on Research and Development in [25] Y. Yang. Sampling strategies and learning e-
Information Retrieval, 1996. 298-306. ciency in text categorization. In AAAI Spring
[13] D.D. Lewis and M. Ringuette. Comparison of Symposium on Machine Learning in Information
two learning algorithms for text categorization. Access, pages 88{95, 1996.
In Proceedings of the Third Annual Symposium [26] Y. Yang. An evaluation of statistical approach
on Document Analysis and Information Retrieval to text categorization. In Technical Report
(SDAIR'94), 1994. CMU-CS-97-127, Computer Science Department,
[14] Tom Mitchell. Machine Learning. McCraw Hill, Carnegie Mellon University, 1997.
1996. [27] Y. Yang and C.G. Chute. An example-based
[15] I. Moulinier. Is learning bias an issue on the mapping method for text categorization and re-
text categorization problem? In Technical re- trieval. ACM Transaction on Information Sys-
port, LAFORIA-LIP6, Universite Paris VI, page tems (TOIS), pages 253{277, 1994.
(to appear), 1997. [28] Y. Yang and W.J. Wilbur. Using corpus statistics
[16] I. Moulinier, G. Raskinis, and J. Ganascia. Text to remove redundant words in text categorization.
categorization: a symbolic approach. In Proceed- In J Amer Soc Inf Sci, 1996.
ings of the Fifth Annual Symposium on Document
Analysis and Information Retrieval, 1996.
[17] J.R. Quinlan. Induction of decision trees. Ma-
chine Learning, 1(1):81{106, 1986.
[18] G. Salton. Automatic Text Processing: The
Transformation, Analysis, and Retrieval of Infor-
mation by Computer. Addison-Wesley, Reading,
Pennsylvania, 1989.
[19] H. Schutze, D.A. Hull, and J.O. Pedersen. A
comparison of classiers and document represen-
tations for the routing problem. In 18th Ann
Int ACM SIGIR Conference on Research and De-
velopment in Information Retrieval (SIGIR'95),
pages 229{237, 1995.
[20] K. Tzeras and S. Hartman. Automatic indexing
based on bayesian inference networks. In Proc
16th Ann Int ACM SIGIR Conference on Re-
search and Development in Information Retrieval
(SIGIR'93), pages 22{34, 1993.
[21] E. Wiener, J.O. Pedersen, and A.S. Weigend.
A neural network approach to topic spotting.
In Proceedings of the Fourth Annual Symposium