Advances in Data Mining and Database Management - Innovative Techniques and Applications of Entity Resolution
Latest Publications


TOTAL DOCUMENTS

17
(FIVE YEARS 0)

H-INDEX

1
(FIVE YEARS 0)

Published By IGI Global

9781466651982, 9781466651999

This chapter focuses on the basic data operators for entity resolution, which include similarity search, similarity join, and clustering on sets or strings. These three problems are of increasing complexity, and the solution of simpler problems is the building blocks for the harder problem. The authors first introduce the solution of similarity search, covering gram-based algorithms and sketch-based algorithms. Then the chapter turns to the solution of similarity join, covering both exact and approximate algorithms. At last, the authors deal with the problem of clustering similar strings in a set, which can be applied to duplicate detection in databases.


In this chapter, the authors study entity resolution on graph data set. In order to conduct entity resolution on graph data, the authors need to define the distance of graph. The authors compute these distances or approximately compute them for time efficiency. At last, the authors utilize the distances to get the final result of entity resolution. The approximate graph matching algorithms may be index-based like the NH-Index method or kernel function based like G-hash method. Other methods concentrate on providing new definitions of similar graph that are easier to compute than traditional methods, like the Web-collection method and the Grafil method. To increase the resolution ability of traditional methods, researchers provide some methods to recognize similar graphs, like graph-bounded simulation and p-homomorphism. Section 8.1 introduces existing methods on defining the distance of graph, which has a direct impact on the computation of graph similarity. Section 8.1 introduces pair-wise entity resolution on graph data set, including index techniques, graph-bounded simulation, and graph p-homomorphism.


With the rapid development of e-commerce, there is a huge amount of commodity data on the Internet. Users are always spending a lot of time looking for the exact product. Therefore, finding products representing the same entity is an effective way to improve the efficiency of purchasing. Due to frequently missing or wrong values and subjective difference in description, traditional method of entity resolution may not have a good result on e-commerce data. Therefore, a set of algorithms are proposed in data cleaning, attribute and value tagging, and entity resolution, which are specialized for e-commerce data. In addition, user’s actions are collected to improve the classification result. The chapter evaluates the effectiveness of the proposed algorithms with real-life datasets from e-commerce sites.


In information integration systems, duplicate records bring problems in data processing and analysis. To represent the similarity between two records from different data sources with different schema, the optimal bipartite graph matching is adopted on the attributes of them, and the similarity is measured as the weight of such matching. Based on similarity estimation, the basic idea in this chapter is to estimate the range of the records similarity and to determine whether they are duplicate records according to the estimation. When data integration is performed on XML data, there are many problems because of the flexibility of XML. One of the current implementations is to use Data Exchange to carry out the above operations. This chapter proposes the concept of quality assurance mechanisms besides the data integrity and reliability.


A basic work of entity resolution is to detect duplicate records in single relation. To address this problem, many different approaches for different areas are proposed. The basic process of entity resolution is attribute similarity computation. Based on the attribute similarity computation methods, many techniques for different areas are proposed to fulfill the process of entity resolution. Rule-based approach is one of the main techniques for entity resolution. To speed up the process of duplicate record detecting, the authors use techniques such as canopy and blocking. In this chapter, the authors focus on the record similarity computation, rule-based approach, similarity threshold computation, and blocking.


Prior work of entity resolution involves expensive similarity comparison and clustering approaches. Additionally, the quality of entity resolution may be low due to insufficient information. To address these problems, by adopting context information of data objects, the authors present a novel framework of entity resolution, Context-Based Entity Description (CED), to make context information help entity resolution. In this framework, each entity is described by a set of CEDs. During entity resolution, objects are only compared with CEDs to determine its corresponding entity. Additionally, the authors propose efficient algorithms for CED discovery, maintenance, and CED-based entity resolution. The authors experimentally evaluated the CED-based ER algorithm on the real DBLP datasets, and the experimental results show that this algorithm can achieve both high precision and recall as well as outperform existing methods.


Errors with names occur frequently. “California” and “CA” refer to the same state of the USA; however, they may both appear as records in a database at the same time. Several techniques need to be proposed to solve these problems. In this chapter, the authors introduce the methods of entity resolution on names. They propose three methods. Similarity measure between names is a kind of fundamental techniques; it makes a significant contribution to the textual similarity. The method of string transformations can handle some situations beyond textual similarity. Recently, learning algorithms on string transformations have been proposed to make matching robust to such variations. Examples illustrate the benefits of each approach.


Data quality is one of the most prevalent problems in data management. A traditional data management application typically concerns the creation, maintenance, and use of a large amount of data, focusing only on clean datasets. However, real-life data are often dirty: inconsistent, duplicated, inaccurate, incomplete, or out of date. Derived from these issues, the problem of conformity of facts from a large amount of conflicting information provided by various Web sets or different data sources to be integrated receives increasing attention. False data can generate misleading or biased analytical results and decisions and lead to loss of revenue, credibility, and customers. Based on the results of entity resolution, truth discovery shares an important role in modern data management applications. In this chapter, the authors review approaches to processing truth discovery related to central aspects of data quality (i.e., data consistency, data reduplication, data accuracy, data currency, and information completeness).


Large quantities of records need to be read and analyzed in cloud computing; many records referring to the same entity bring challenges for data processing and analysis. Entity resolution has become one of the hot issues in database research. Clustering based on records similarity is one of most commonly used methods, but the existing methods of computing records similarity often cost much time and are not suitable for cloud computing. This chapter shows that it is necessary to use wave of strings to compute records similarity in cloud computing and provides a method based on wave of strings of entity resolution. Theoretical analysis and experimental results show that the method proposed in this chapter is correct and effective.


Entity resolution is one of many importation operations for data quality management, information retrieval, and data management. It has wide applications in Web search, ecommerce search, data cleaning, and information integration. Due to its importance, entity resolution has been studied by researchers in multiple fields including database, machine learning, information retrieval, as well as high performance computation. This book contains a number of chapters, which are carefully chosen in order to discuss the broad research issues in entity resolution. In addition, a number of important applications of entity resolution are also covered in the book. The purpose of this chapter is to provide an overview of the concepts, applications, and research topics of entity resolution, as well as the coverage of these topics in this book.


Sign in / Sign up

Export Citation Format

Share Document