A Practical Approach for Scalable Record Linkage on Hadoop

2013 ◽  
Vol 753-755 ◽  
pp. 3018-3024 ◽  
Author(s):  
Fen Gyu Yang ◽  
Ying Chen ◽  
Ye Zhang

As increasing data have been collected in many applications, we have to face with millions of data in record linkage. With respect to traditional methods, there comes out a big challenge in performance while dealing with massive data. Parallel computing framework, such as MapReduce, has become an efficient and practical way to address this problem. In this paper, we propose a practical 3-phase MapReduce approach that fulfills blocking, filtering, and linking in 3 consecutive processes on Hadoop cluster. Experiments show that our approach functions efficiently and effectively with keeping high recall in contrast to tradition method.

Sign in / Sign up

Export Citation Format

Share Document