Text Categorization: An extensive comparison of classifiers, feature selection metrics and document representation

Research output: Chapter in Book/Report/Conference proceedingConference contributionpeer-review

Abstract

In this paper, on automatic text categorization, we extensively compare several aspects which include document representation, feature selection, three classifiers, and their application to two language text collections. Regarding the computational representation of documents, we compare the traditional bag of words representation with 4 other alternative representations: bag of multiwords and bag of word prefixes with N characters (for N = 4, 5 and 6). Concerning the feature selection we compare the well known feature selection metrics Information Gain and Chi-Square with a new one based on the third moment statistics which enhances rare terms. As to the classifiers, we compare the well known Support Vector Machine and K-Nearest Neighbor classifiers with a classifier based on Mahalanobis distance. Finally, the study performed is language independent and was applied over two document collections, one written in English (Reuters-21578) and the other in Portuguese (Folha de São Paulo).
Original languageUnknown
Title of host publicationProceedings of the 15th Portuguese Conference in Arificial Intelligence, EPIA 2011.
Pages660 to 674
Publication statusPublished - 1 Jan 2011
EventEPIA 2011, Portuguese Conference on Artificial Inteligence -
Duration: 1 Jan 2011 → …

Conference

ConferenceEPIA 2011, Portuguese Conference on Artificial Inteligence
Period1/01/11 → …

Cite this