FPCluster: An Efficient Out-of-core Clustering Strategy without a Similarity Metric

Douglas E.V. Pires, Luam C. Totti, Rubens E.A. Moreira, Elverton C. Fazzion, Osvaldo L.H.M. Fonseca, Wagner Meira Jr, Raquel C. de Melo-Minardi, Dorgival Guedes Neto


Clustering is one of the most popular and relevant data mining tasks. Two challenges for determining clusters arethe volume of data to be grouped and the difficulty in defining a similarity metric applicable to the entire data set. In this work we present FPCluster, a new clustering algorithm that addresses both problems. The algorithm is based on building out-of-core frequent pattern trees, a data structure originally proposed for mining patterns. Additionally, the algorithm transparently handles missing features, a common constraint in real case scenarios. We applied FPCluster to two real scenarios: characterization of spam campaigns and clustering of protein families. We evaluated both the quality of the obtained groups and the computational efficiency of the proposed strategy. In particular, we achieved precision above 90% while the storage demand increased sub-linearly. 


Clustering; out-of-core; protein families; spam detection

Full Text:


An official publication of the Brazilian Computer Society Special Interest Group on Databases.