Evaluating the Diversification of Similarity Query Results

Lucio F. D. Santos, Willian D. Oliveira, Mônica R. P. Ferreira, Robson L. F. Cordeiro, Agma J. M. Traina, Caetano Traina Jr.

Abstract


The data currently generated and collected increase not only in volume, but also in complexity, requiring new query operators to be searched. Similarity queries have been acknowledged as one of the most useful resources to retrieve complex data, but the basic similarity operators are not enough to meet the requirements of the applications, largely because their result sets tend to include many elements too similar to the query center and among themselves. To tackle this problem, variations and extensions of basic operators have been studied pursuing result diversification, i.e, to search for elements sufficiently similar to the query center, but also diverse from each other.  Result diversification has been studied considering either extra information related to the data or the distance among result set elements. The problem with the former approach is that ''extra information''  rarely exists and, even when it does, the corresponding processing cost is commonly too high. Moreover, the distance-based algorithms are often good alternatives even for data domains that can rely on other information, besides the elements and their distances. The main drawback of distance-based algorithms is the lack of evaluation methods to understand how diverse the retrieved answer is. This article reports on the development of several statistical measurements able to evaluate the diversity of the result set. The concept of the "answer space",  has also been created, aimed at highlighting the distribution of the several result sets that can be the answers  to a given similarity-diversified query, which enables the evaluation of the query quality regarding several different criteria. Finally, we describe an extensive set of experiments to validate our proposals and highlight the analysis that could be performed by the system analyst, using four real datasets that span up to 72k elements and 761 dimensions.



Keywords


Evaluation methods;Result diversification;Similarity search;Space mapping

Full Text:

PDF


An official publication of the Brazilian Computer Society Special Interest Group on Databases.