A Wider Concept for Similarity Joins

Luiz Olmes Carvalho, Willian Dener de Oliveira, Ives R. V. Pola, Agma J. M. Traina, Caetano Traina Jr.

Abstract


Join is one of the most studied and employed retrieval operators made available by the modern relational database management systems (RDBMSs). This binary operator is algebraically defined as a Cartesian product followed by the selection operator that specifies the join condition. In modern RDBMS, the join condition employs comparison operators based both on equality and on the Total Ordering Relationship. The resulting selection and join operators based on them are commutative, so they can be executed in any order, regardless of the expression ordering in the query command, allowing the RDBMS to look for optimized query evaluation strategies. However, more complex data do not follow equality nor TOR properties. In fact, they are better compared by similarity. Join operators based on similarity comparison operators are called 'similarity joins'. There exist similarity joins that can be executed directly as a similarity selection following a Cartesian product, but others demand extra and/or intermediate operations before they can compute the final result. In this paper, we analyze the existing types of similarity join in an algebraic way and we claim that in fact some of them generate operators that are not a join, but other binary operators. We identify a novel binary operator that embraces them and provide three variations of the new proposed operator. As an important follow up, the new operator allows relationally expressing and answering queries that previously could be processed only in sequential ways. The experiments performed on real data sets support our claim that our new operator is fast and flexible enough to be represented in the relational algebra and included into the RDBMSs, and it meets the demands for similarity queries from modern applications.

Keywords


database operators; join processing; relational algebra; similarity search

Full Text:

PDF


An official publication of the Brazilian Computer Society Special Interest Group on Databases.