Optimizing the Accuracy of Entity-Based Data Integration of Multiple Data Sources Using Genetic Programming Methods

2012 ◽  
Vol 3 (1) ◽  
pp. 72-82 ◽  
Author(s):  
Yinle Zhou ◽  
Ali Kooshesh ◽  
John Talburt

Entity-based data integration (EBDI) is a form of data integration in which information related to the same real-world entity is collected and merged from different sources. It often happens that not all of the sources will agree on one value for a common attribute. These cases are typically resolved by invoking a rule that will select one of the non-null values presented by the sources. One of the most commonly used selection rules is called the naïve selection operator that chooses the non-null value provided by the source with the highest overall accuracy for the attribute in question. However, the naïve selection operator will not always produce the most accurate result. This paper describes a method for automatically generating a selection operator using methods from genetic programming. It also presents the results from a series of experiments using synthetic data that indicate that this method will yield a more accurate selection operator than either the naïve or naïve-voting selection operators.

1984 ◽  
Vol 7 (1) ◽  
pp. 129-150
Author(s):  
Joachim Biskup

We study operations on generalized database relations which possibly contain maybe tuples and two types of null values. The existential null value has the meaning “value at present unknown” whereas the universal null value has the meaning “value arbitrary”. For extending a usual relational operation to generalized relations we develop three requirements: adequacy, restrictedness, and feasibility. As demonstrated for the natural join as an example, we can essetially meet these requirements although we are faced with a minor tradeoff between restrictedness and feasibility.


2018 ◽  
Vol 1 (2) ◽  
pp. 270-280 ◽  
Author(s):  
John K. Kruschke

This article explains a decision rule that uses Bayesian posterior distributions as the basis for accepting or rejecting null values of parameters. This decision rule focuses on the range of plausible values indicated by the highest density interval of the posterior distribution and the relation between this range and a region of practical equivalence (ROPE) around the null value. The article also discusses considerations for setting the limits of a ROPE and emphasizes that analogous considerations apply to setting the decision thresholds for p values and Bayes factors.


Author(s):  
Yan Qi ◽  
Huiping Cao ◽  
K. Selçuk Candan ◽  
Maria Luisa Sapino

In XML Data Integration, data/metadata merging and query processing are indispensable. Specifically, merging integrates multiple disparate (heterogeneous and autonomous) input data sources together for further usage, while query processing is one main reason why the data need to be integrated in the first place. Besides, when supported with appropriate user feedback techniques, queries can also provide contexts in which conflicts among the input sources can be interpreted and resolved. The flexibility of XML structure provides opportunities for alleviating some of the difficulties that other less flexible data types face in the presence of uncertainty; yet, this flexibility also introduces new challenges in merging multiple sources and query processing over integrated data. In this chapter, the authors discuss two alternative ways XML data/schema can be integrated: conflict-eliminating (where the result is cleaned from any conflicts that the different sources might have with each other) and conflict-preserving (where the resulting XML data or XML schema captures the alternative interpretations of the data). They also present techniques for query processing over integrated, possibly imprecise, XML data, and cover strategies that can be used for resolving underlying conflicts.


Author(s):  
William T. Sabados ◽  
Harry S. Delugach

The pragmatic context of information is a fundamental characteristic that is not often formally addressed in data integration. This paper discusses the challenges of modeling the multiple contexts at play in data integration. A simple data integration context modeling framework is introduced that we believe addresses important issues of representing a pragmatic context. It allows for multiple data sources from similar domains to be brought together without having to designate one as the “true” semantics. An example is provided showing how this approach supports integration efforts.


Author(s):  
Héctor Oscar Nigro ◽  
Sandra Elizabeth González Císaro

Today’s technology allows storing vast quantities of information from different sources in nature. This information has missing values, nulls, internal variation, taxonomies, and rules. We need a new type of data analysis that allows us represent the complexity of reality, maintaining the internal variation and structure (Diday, 2003). In Data Analysis Process or Data Mining, it is necessary to know the nature of null values - the cases are by absence value, null value or default value -, being also possible and valid to have some imprecision, due to differential semantic in a concept, diverse sources, linguistic imprecision, element resumed in Database, human errors, etc (Chavent, 1997). So, we need a conceptual support to manipulate these types of situations. As we are going to see below, Symbolic Data Analysis (SDA) is a new issue based on a strong conceptual model called Symbolic Object (SO). A “SO” is defined by its “intent” which contains a way to find its “extent”. For instance, the description of habitants in a region and the way of allocating an individual to this region is called “intent”, the set of individuals, which satisfies this intent, is called “extent” (Diday 2003). For this type of analysis, different experts are needed, each one giving their concepts.


2020 ◽  
Vol 21 (S1) ◽  
Author(s):  
Daniel Ruiz-Perez ◽  
Haibin Guan ◽  
Purnima Madhivanan ◽  
Kalai Mathee ◽  
Giri Narasimhan

Abstract Background Partial Least-Squares Discriminant Analysis (PLS-DA) is a popular machine learning tool that is gaining increasing attention as a useful feature selector and classifier. In an effort to understand its strengths and weaknesses, we performed a series of experiments with synthetic data and compared its performance to its close relative from which it was initially invented, namely Principal Component Analysis (PCA). Results We demonstrate that even though PCA ignores the information regarding the class labels of the samples, this unsupervised tool can be remarkably effective as a feature selector. In some cases, it outperforms PLS-DA, which is made aware of the class labels in its input. Our experiments range from looking at the signal-to-noise ratio in the feature selection task, to considering many practical distributions and models encountered when analyzing bioinformatics and clinical data. Other methods were also evaluated. Finally, we analyzed an interesting data set from 396 vaginal microbiome samples where the ground truth for the feature selection was available. All the 3D figures shown in this paper as well as the supplementary ones can be viewed interactively at http://biorg.cs.fiu.edu/plsda Conclusions Our results highlighted the strengths and weaknesses of PLS-DA in comparison with PCA for different underlying data models.


2020 ◽  
Vol 2020 ◽  
pp. 1-13
Author(s):  
Jaspreet Chawla ◽  
Anil Kr Ahlawat ◽  
Jyoti Gautam

Web services and agent technology play a significant role while resolving the issues related to platform interoperability. Web service interoperability organization (WS-I) provided the guidelines to remove the interoperability issues using basic profile 1.1/1.2 product. However, issues are still arising while transferring the precision values and an array with null values between different platforms like JAVA and .NET. As in a precision issue, JAVA supports data precision up to the 6th value and .NET up to the 5th value after the decimal and after increasing their limits, the whole number gets rounded off. In array with a null value issue, JAVA treats null as a value but .NET treats null as an empty string. To remove these issues, we use the WSIG-JADE framework that helps to build and demonstrate a multiagent system that does the mapping and conversions between agents and web services. It limits the number of digits to the 5th place after the decimal thereby increasing the precision in data sets, whereas it treats null as an empty string so that string length remains the same for both the platforms thereby helping in the correct count of data elements.


F1000Research ◽  
2019 ◽  
Vol 8 ◽  
pp. 1822 ◽  
Author(s):  
Ana Claudia Sima ◽  
Christophe Dessimoz ◽  
Kurt Stockinger ◽  
Monique Zahn-Zabal ◽  
Tarcisio Mendes de Farias

The increasing use of Semantic Web technologies in the life sciences, in particular the use of the Resource Description Framework (RDF) and the RDF query language SPARQL, opens the path for novel integrative analyses, combining information from multiple sources. However, analyzing evolutionary data in RDF is not trivial, due to the steep learning curve required to understand both the data models adopted by different RDF data sources, as well as the SPARQL query language. In this article, we provide a hands-on introduction to querying evolutionary data across multiple sources that publish orthology information in RDF, namely: The Orthologous MAtrix (OMA), the European Bioinformatics Institute (EBI) RDF platform, the Database of Orthologous Groups (OrthoDB) and the Microbial Genome Database (MBGD). We present four protocols in increasing order of complexity. In these protocols, we demonstrate through SPARQL queries how to retrieve pairwise orthologs, homologous groups, and hierarchical orthologous groups. Finally, we show how orthology information in different sources can be compared, through the use of federated SPARQL queries.


2019 ◽  
Author(s):  
Manuel Bohn ◽  
Michael Henry Tessler ◽  
Michael C. Frank

Pragmatic inferences are an integral part of language learn- ing and comprehension. To recover the intended meaning of an utterance, listeners need to balance and integrate different sources of contextual information. In a series of experiments, we studied how listeners integrate general expectations about speakers with expectations specific to their interactional his- tory with a particular speaker. We used a Bayesian pragmatics model to formalize the integration process. In Experiments 1 and 2, we replicated previous findings showing that listeners make inferences based on speaker-general and speaker-specific expectations. We then used the empirical measurements from these experiments to generate model predictions about how the two kinds of expectations should be integrated, which we tested in Experiment 3. Experiment 4 replicated and extended Experiment 3 to a broader set of conditions. In both experiments, listeners based their inferences on both types of expectations. We found that model performance was also consistent with this finding; with better fit for a model which incorporated both general and specific information compared to baselines incorporating only one type. Listeners flexibly integrate different forms of social expectations across a range of contexts, a process which can be described using Bayesian models of pragmatic reasoning.


Sign in / Sign up

Export Citation Format

Share Document