Estimating the Credibility of Examples in Automatic Document Classification

João Palotti, Thiago Salles, Gisele L. Pappa, Filipe Arcanjo, Marcos A. Gonçalves, Wagner Meira Jr.

Abstract


Abstract. Classification algorithms usually assume that any example in the raining set should contribute equally to the classification model being generated. However, this is not always the case. This paper shows that the contribution of an example to the classification model varies according to many factors, which are application dependent, and can be estimated using what we call a credibility function. The credibility of an entity reflects how much value it aggregates to a task being performed, and here we investigate it in Automatic Document Classification, where the credibility of a document relates to its terms, authors, citations, venues, time of publication, among others. After introducing the concept of credibility in classification, we investigate how to estimate a credibility function using information regarding documents content, citations and authorship using mainly metrics previously defined in the literature. As the credibility of the content of a document can be easily mapped to any other classification problem, in a second phase we focus on content-based credibility functions. We propose a genetic programming algorithm to estimate this function based on a large set of metrics generally used to measure the strength of term-class relationship. The proposed and evolved credibility functions are then incorporated to the Naive Bayes classifier, and applied to four text collections, namely ACM-DL, Reuters, Ohsumed, and 20 Newsgroup. The results obtained showed significant improvements in both micro-F1 and macro-F1, with gains up to 21% in Ohsumed when compared to the traditional Naive Bayes.

Keywords


credibility, automatic document classification, genetic programming

Full Text:

PDF


An official publication of the Brazilian Computer Society Special Interest Group on Databases.