Entity Resolution in Big Data Era: Challenges and Applications

Overlapped Hashing: A Novel Scalable Blocking Technique for Entity Resolution in Big-Data Era

Advances in Intelligent Systems and Computing - Intelligent Computing ◽

10.1007/978-3-030-01174-1_32 ◽

2018 ◽

pp. 427-441

Author(s):

Rana Khalil ◽

Ahmed Shawish ◽

Doaa Elzanfaly

Keyword(s):

Big Data ◽

Entity Resolution

Download Full-text

An Effective Entity Resolution Approach for Big Data

International Journal of Innovative Technology and Exploring Engineering - Special Issue ◽

10.35940/ijitee.k9503.09101121 ◽

2021 ◽

Vol 10 (11) ◽

pp. 100-112

Author(s):

Randa Mohamed Abd El-ghafar ◽

◽

Ali H. El-Bastawissy ◽

Eman S. Nasr ◽

Mervat H. Gheith ◽

...

Keyword(s):

Big Data ◽

Phase 1 ◽

Entity Resolution ◽

Locality Sensitive Hashing ◽

Performance Time ◽

Efficiency And Effectiveness ◽

Similar Cluster ◽

Computational Resources ◽

Five Phases ◽

F Measure

Entity Resolution (ER) is defined as the process 0f identifying records/ objects that correspond to real-world objects/ entities. To define a good ER approach, the schema of the data should be well-known. In addition, schema alignment of multiple datasets is not an easy task and may require either domain expert or ML algorithm to select which attributes to match. Schema agnostic meta-blocking tries to solve such a problem by considering each token as a blocking key regardless of the attributes it appears in. It may also be coupled with meta-blocking to reduce the number of false negatives. However, it requires the exact match of tokens which is very hard to occur in the actual datasets and it results in very low precision. To overcome such issues, we propose a novel and efficient ER approach for big data implemented in Apache Spark. The proposed approach is employed to avoid schema alignment as it treats the attributes as a bag of words and generates a set of n-grams which is transformed to vectors. The generated vectors are compared using a chosen similarity measure. The proposed approach is a generic one as it can accept all types of datasets. It consists of five consecutive sub-modules: 1) Dataset acquisition, 2) Dataset pre-processing, 3) Setting selection criteria, where all settings of the proposed approach are selected such as the used blocking key, the significant attributes, NLP techniques, ER threshold, and the used scenario of ER, 4) ER pipeline construction, and 5) Clustering where the similar records are grouped into the similar cluster. The ER pipeline could accept two types of attributes; the Weighted Attributes (WA) or the Compound Attributes (CA). In addition, it accepts all the settings selected in the fourth module. The pipeline consists of five phases. Phase 1) Generating the tokens composing the attributes. Phase 2) Generating n-grams of length n. Phase 3) Applying the hashing Text Frequency (TF) to convert each n-grams to a fixed-length feature vector. Phase 4) Applying Locality Sensitive Hashing (LSH), which maps similar input items to the same buckets with a higher probability than dissimilar input items. Phase 5) Classification of the objects to duplicates or not according to the calculated similarity between them. We introduced seven different scenarios as an input to the ER pipeline. To minimize the number of comparisons, we proposed the length filter which greatly contributes to improving the effectiveness of the proposed approach as it achieves the highest F-measure between the existing computational resources and scales well with the available working nodes. Three results have been revealed: 1) Using the CA in the different scenarios achieves better results than the single WA in terms of efficiency and effectiveness. 2) Scenario 3 and 4 Achieve the best performance time because using Soundex and Stemming contribute to reducing the performance time of the proposed approach. 3) Scenario 7 achieves the highest F-measure because by utilizing the length filter, we only compare records that are nearly within a pre-determined percentage of increase or decrease of string length. LSH is used to map the same inputs items to the buckets with a higher probability than dis-similar ones. It takes numHashTables as a parameter. Increasing the number of candidate pairs with the same numHashTables will reduce the accuracy of the model. Utilizing the length filter helps to minimize the number of candidates which in turn increases the accuracy of the approach.

Download Full-text

Big data entity resolution: From highly to somehow similar entity descriptions in the Web

2015 IEEE International Conference on Big Data (Big Data) ◽

10.1109/bigdata.2015.7363781 ◽

2015 ◽

Cited By ~ 11

Author(s):

Vasilis Efthymiou ◽

Kostas Stefanidis ◽

Vassilis Christophides

Keyword(s):

Big Data ◽

Entity Resolution ◽

The Web

Download Full-text

Private entity resolution for big data on Apache Spark using multiple phonetic codes

Big Data Recommender Systems - Volume 1: Algorithms, Architectures, Big Data, Security and Trust ◽

10.1049/pbpc035f_ch13 ◽

2019 ◽

pp. 283-301

Author(s):

Alexandros Karakasidis ◽

Georgia Koloniari

Keyword(s):

Big Data ◽

Entity Resolution ◽

Apache Spark ◽

Private Entity

Download Full-text

Entity resolution for big data

Proceedings of the 19th ACM SIGKDD international conference on Knowledge discovery and data mining - KDD '13 ◽

10.1145/2487575.2506179 ◽

2013 ◽

Cited By ~ 41

Author(s):

Lise Getoor ◽

Ashwin Machanavajjhala

Keyword(s):

Big Data ◽

Entity Resolution

Download Full-text

An Overview of End-to-End Entity Resolution for Big Data

ACM Computing Surveys ◽

10.1145/3418896 ◽

2020 ◽

Vol 53 (6) ◽

pp. 1-42

Author(s):

Vassilis Christophides ◽

Vasilis Efthymiou ◽

Themis Palpanas ◽

George Papadakis ◽

Kostas Stefanidis

Keyword(s):

Big Data ◽

Entity Resolution ◽

End To End

Download Full-text

An Efficient Multi-Phase Blocking Strategy for Entity Resolution in Big Data

International Journal of Innovative Technology and Exploring Engineering - Special Issue ◽

10.35940/ijitee.i7070.079920 ◽

2020 ◽

Vol 9 (9) ◽

pp. 254-263

Keyword(s):

Big Data ◽

Data Integration ◽

Data Warehouse ◽

Real World ◽

Business Intelligence ◽

Time Complexity ◽

Entity Resolution ◽

Second Phase ◽

Multi Phase ◽

State Of Art

Entity Resolution (ER) is the process of identifying records that refer to the same real-world entity. It plays a key role in many applications as data warehouse, data integration, and business intelligence. Comparing every record with all corresponding records is infeasible especially for a big dataset. To overcome such a problem, blocking techniques have been implemented. In this paper, we propose a novel Efficient Multi-Phase Blocking Strategy (EMPBS) for resolving duplicates in big data. As per our knowledge, some state of art blocking techniques may result in overlapping blocks (i.e. Q-grams) which cause redundant comparisons and hence increase the time complexity. Our proposed blocking strategy has disjoint blocks and less time complexity compared to Q-grams and slandered blocking techniques. In addition, EMPBS is general and requires no restrictions on the type of blocking keys. EMPBS consists of three phases. The first one generates three single efficient blocking keys. The second phase takes the output of the first phase as an input to construct a compound key. The compound key is composed of concatenation of two single blocking keys. Three compound blocking keys are the output of this phase that will be used as an input for the last phase, which is generating the Efficient Multi-Phase Blocking Key (EMPBK). EMPBK is constructed using the union of two compound blocking keys. The implementation of EMPBS presents promising results in terms of Reduction Ratio (RR) as it achieves a higher value of RR than adopting only a single blocking key, while at the same time maintains nearly the same precision and recall. EMPBS reduced about 84% of the average number of comparisons accomplished in a single blocking key. To evaluate EMPBS, we developed a Duplicate Generation tool (DupGen) that accepts a clean semi-structured file as an input and generates labeled duplicate records according to certain criteria.

Download Full-text

Entity Resolution for Big Data using Combination of Supervised Meta-Blocking and pay-as-you-go Configuration

International Journal of Innovative Technology and Exploring Engineering - Special Issue ◽

10.35940/ijitee.f1031.0486s419 ◽

2019 ◽

Vol 8 (6S4) ◽

pp. 166-167

Keyword(s):

Big Data ◽

Real World ◽

Data Cleaning ◽

Entity Resolution ◽

Detailed Comparison ◽

Data Sets ◽

Important Process ◽

Improve Performance ◽

Multiple Data ◽

Multiple Data Sets

Entity resolution refers to the method of identifying the same real world object from multiple data sets. In Data cleaning and data integration application, entity resolution is an important process. When data is large the task of entity resolution becomes complex and time consuming. End-to-end entity resolution proposal involves stages like blocking (efficiently identifies duplicates), detailed comparison (refines blocking output) and clustering (identifies the set of records which may refer to the same entity). In this paper, an approach for feedback based optimization of complete entity resolution is proposed in which supervised meta-blocking is used for blocking stage. This paper proposes a technique for entity resolution which does optimization of each phase of entity resolution with benefits of supervised Meta-blocking to improve performance of entity resolution for big data

Download Full-text