A Framework for Set Similarity Join on Multi-Attribute Data
Set similarity join, which finds all pairs of similar sets in a collection, plays an important role in data cleaning and integration. Many algorithms have been proposed to efficiently answer set similarity join on single-attribute data. However, real-world data often contain multiple attributes. In this paper, we propose a framework to enhance existing algorithms with additional filters for dealing with multi-attribute data. We then present a simple, yet effective filter based on lightweight indexes, for which exact and probabilistic implementation alternatives are evaluated. Finally, we devise a cost model to identify the best attribute ordering to reduce processing time. Our experimental results show that our approach is effective and significantly outperforms previous work.