Automatic Document Classification Temporally Robust

Thiago Salles, Leonardo Rocha, Fernando Mourão, Gisele L. Pappa, Lucas Cunha, Marcos André Gonçalves, Wagner Meira Jr.

Abstract


p, li { white-space: pre-wrap; }

The widespread use of the Internet has increased the amount of information being stored on and accessed through the Web. This information is frequently organized as textual documents and is the main target of search engines and other retrieval tools, which have to classify documents, among other tasks. Automatic Document Classification (ADC) associates documents to semantically meaningful categories, and usually employs a supervised learning strategy, where we first build a classification model using pre-classified documents, and then use the model to classify new documents. One major challenge in building classifiers is dealing with the temporal evolution of the characteristics of the documents and the classes to which they belong. However, most of the current techniques for ADC do not consider this evolution while building and using the models. Previous studies show that the performance of classifiers may be affected by three different temporal effects (class distribution, term distribution and class similarity). In this paper, we propose a new approach that aims to minimize the impact of temporal effects through a Temporal Adjustment Factor, in order to devise temporally robust classifiers based on traditional ones (Rocchio and KNN). Experimental results obtained using two real and large textual collections point to significant gains up to 11% of the temporal-aware versions of the classifiers over their traditional counterparts, and up to 4% compared to SVM (with a significantly lower runtime).


Full Text:

PDF


An official publication of the Brazilian Computer Society Special Interest Group on Databases.