Optimizing the Accuracy of Entity-Based Data Integration of Multiple Data Sources Using Genetic Programming Methods
Entity-based data integration (EBDI) is a form of data integration in which information related to the same real-world entity is collected and merged from different sources. It often happens that not all of the sources will agree on one value for a common attribute. These cases are typically resolved by invoking a rule that will select one of the non-null values presented by the sources. One of the most commonly used selection rules is called the naïve selection operator that chooses the non-null value provided by the source with the highest overall accuracy for the attribute in question. However, the naïve selection operator will not always produce the most accurate result. This paper describes a method for automatically generating a selection operator using methods from genetic programming. It also presents the results from a series of experiments using synthetic data that indicate that this method will yield a more accurate selection operator than either the naïve or naïve-voting selection operators.