Tackling Temporal Effects in Automatic Document Classification

Thiago Salles, Thiago Cardoso, Vitor Oliveira, Leonardo Rocha, Marcos André Gonçalves

Abstract


Automatic Document Classification (ADC) has become an important research topic due the rapid growth in volume and complexity of data produced nowadays. ADC usually employs a supervised learning strategy, where we first build a classification model using pre-classified documents and then use it to classify unseen documents. One major challenge in building classifiers has to do with the temporal evolution of the characteristics of the dataset (a.k.a temporal effects). However, most of the current techniques for ADC does not consider this evolution while building and using the classification models. Recently we have proposed temporally-aware algorithms for ADC in order to properly handle these temporal effects. Despite of the their effectiveness, the temporally-aware classifiers have a major side effect of being naturally lazy, since they need to know the creation time of the test document to build the model. Such lazy property incurs in a potentially high test phase runtime and brings a critical scalability issue that may make these classifiers infeasible to handle large volumes of data, such as the Web and very large digital libraries. This work aims at addressing the following challenge: can we deal with the temporal effects in some entirely off-line setting, reducing the test phase runtime and without compromising its effectiveness due to varying data distributions? We propose to address this question by tackling the temporal effects from a data engineering perspective. We devise a pre-processing---classifier independent---step able to moving all the overhead of considering the temporal effects to an off-line setting, called Cascaded Temporal Smoothing (CTS). The CTS consists of a controlled data oversampling strategy which aims at smoothing the observed temporal effects in the data distribution. This new training set can be used by any traditional classifier in a way that produces similar effectiveness as the lazy temporally-aware classifiers but not incurring in any overhead at test phase. As our experimental evaluation shows, the use of CTS before learning a traditional Naive Bayes classifier was able to improve classification effectiveness in two real datasets (gains up to 5.00% in terms of MacroF1) exhibiting scalability properties not present in the lazy temporally-aware classifiers, guaranteeing its practical applicability in  large classification problems.

Keywords


automatic document classification; temporal effects; oversampling

Full Text:

PDF


An official publication of the Brazilian Computer Society Special Interest Group on Databases.