Optimizing the Accuracy of Entity-Based Data Integration of Multiple Data Sources Using Genetic Programming Methods

Yinle Zhou; Ali Kooshesh; John Talburt

doi:10.4018/jbir.2012010105

Optimizing the Accuracy of Entity-Based Data Integration of Multiple Data Sources Using Genetic Programming Methods

International Journal of Business Intelligence Research ◽

10.4018/jbir.2012010105 ◽

2012 ◽

Vol 3 (1) ◽

pp. 72-82 ◽

Cited By ~ 2

Author(s):

Yinle Zhou ◽

Ali Kooshesh ◽

John Talburt

Keyword(s):

Genetic Programming ◽

Data Integration ◽

Synthetic Data ◽

Multiple Data ◽

Null Value ◽

Selection Operators ◽

Series Of Experiments ◽

Null Values ◽

Selection Operator ◽

Different Sources

Entity-based data integration (EBDI) is a form of data integration in which information related to the same real-world entity is collected and merged from different sources. It often happens that not all of the sources will agree on one value for a common attribute. These cases are typically resolved by invoking a rule that will select one of the non-null values presented by the sources. One of the most commonly used selection rules is called the naïve selection operator that chooses the non-null value provided by the source with the highest overall accuracy for the attribute in question. However, the naïve selection operator will not always produce the most accurate result. This paper describes a method for automatically generating a selection operator using methods from genetic programming. It also presents the results from a series of experiments using synthetic data that indicate that this method will yield a more accurate selection operator than either the naïve or naïve-voting selection operators.

Download Full-text

Extending the Relational Algebra for Relations with Maybe Tuples and Existential and Universal Null Values

Fundamenta Informaticae ◽

10.3233/fi-1984-7109 ◽

1984 ◽

Vol 7 (1) ◽

pp. 129-150

Author(s):

Joachim Biskup

Keyword(s):

Relational Algebra ◽

Null Value ◽

Null Values ◽

A Minor

We study operations on generalized database relations which possibly contain maybe tuples and two types of null values. The existential null value has the meaning “value at present unknown” whereas the universal null value has the meaning “value arbitrary”. For extending a usual relational operation to generalized relations we develop three requirements: adequacy, restrictedness, and feasibility. As demonstrated for the natural join as an example, we can essetially meet these requirements although we are faced with a minor tradeoff between restrictedness and feasibility.

Download Full-text

Rejecting or Accepting Parameter Values in Bayesian Estimation

Advances in Methods and Practices in Psychological Science ◽

10.1177/2515245918771304 ◽

2018 ◽

Vol 1 (2) ◽

pp. 270-280 ◽

Cited By ~ 64

Author(s):

John K. Kruschke

Keyword(s):

Decision Rule ◽

Bayes Factors ◽

Posterior Distributions ◽

P Values ◽

Plausible Values ◽

Null Value ◽

Null Values ◽

Highest Density Interval ◽

Region Of Practical Equivalence ◽

Parameter Values

This article explains a decision rule that uses Bayesian posterior distributions as the basis for accepting or rejecting null values of parameters. This decision rule focuses on the range of plausible values indicated by the highest density interval of the posterior distribution and the relation between this range and a region of practical equivalence (ROPE) around the null value. The article also discusses considerations for setting the limits of a ROPE and emphasizes that analogous considerations apply to setting the decision thresholds for p values and Bayes factors.

Download Full-text

XML Data Integration

Advanced Applications and Structures in XML Processing ◽

10.4018/978-1-61520-727-5.ch015 ◽

2010 ◽

pp. 333-360 ◽

Cited By ~ 1

Author(s):

Yan Qi ◽

Huiping Cao ◽

K. Selçuk Candan ◽

Maria Luisa Sapino

Keyword(s):

Data Integration ◽

Query Processing ◽

Multiple Sources ◽

Data Types ◽

Xml Data ◽

Data Schema ◽

Integration Data ◽

Feedback Techniques ◽

Different Sources ◽

Xml Data Integration

In XML Data Integration, data/metadata merging and query processing are indispensable. Specifically, merging integrates multiple disparate (heterogeneous and autonomous) input data sources together for further usage, while query processing is one main reason why the data need to be integrated in the first place. Besides, when supported with appropriate user feedback techniques, queries can also provide contexts in which conflicts among the input sources can be interpreted and resolved. The flexibility of XML structure provides opportunities for alleviating some of the difficulties that other less flexible data types face in the presence of uncertainty; yet, this flexibility also introduces new challenges in merging multiple sources and query processing over integrated data. In this chapter, the authors discuss two alternative ways XML data/schema can be integrated: conflict-eliminating (where the result is cleaned from any conflicts that the different sources might have with each other) and conflict-preserving (where the resulting XML data or XML schema captures the alternative interpretations of the data). They also present techniques for query processing over integrated, possibly imprecise, XML data, and cover strategies that can be used for resolving underlying conflicts.

Download Full-text

Understanding and Modeling Context in Data Integration

International Journal of Conceptual Structures and Smart Applications ◽

10.4018/ijcssa.2014010101 ◽

2014 ◽

Vol 2 (1) ◽

pp. 1-17

Author(s):

William T. Sabados ◽

Harry S. Delugach

Keyword(s):

Data Integration ◽

Data Sources ◽

Fundamental Characteristic ◽

Context Modeling ◽

Modeling Framework ◽

Multiple Data Sources ◽

Multiple Contexts ◽

Multiple Data

The pragmatic context of information is a fundamental characteristic that is not often formally addressed in data integration. This paper discusses the challenges of modeling the multiple contexts at play in data integration. A simple data integration context modeling framework is introduced that we believe addresses important issues of representing a pragmatic context. It allows for multiple data sources from similar domains to be brought together without having to designate one as the “true” semantics. An example is provided showing how this approach supports integration efforts.

Download Full-text

Principles on Symbolic Data Analysis

Handbook of Research on Innovations in Database Technologies and Applications ◽

10.4018/978-1-60566-242-8.ch009 ◽

2009 ◽

pp. 74-81

Author(s):

Héctor Oscar Nigro ◽

Sandra Elizabeth González Císaro

Keyword(s):

Data Analysis ◽

Missing Values ◽

Symbolic Data Analysis ◽

Human Errors ◽

Symbolic Data ◽

Analysis Process ◽

New Type ◽

Null Value ◽

Internal Variation ◽

Different Sources

Today’s technology allows storing vast quantities of information from different sources in nature. This information has missing values, nulls, internal variation, taxonomies, and rules. We need a new type of data analysis that allows us represent the complexity of reality, maintaining the internal variation and structure (Diday, 2003). In Data Analysis Process or Data Mining, it is necessary to know the nature of null values - the cases are by absence value, null value or default value -, being also possible and valid to have some imprecision, due to differential semantic in a concept, diverse sources, linguistic imprecision, element resumed in Database, human errors, etc (Chavent, 1997). So, we need a conceptual support to manipulate these types of situations. As we are going to see below, Symbolic Data Analysis (SDA) is a new issue based on a strong conceptual model called Symbolic Object (SO). A “SO” is defined by its “intent” which contains a way to find its “extent”. For instance, the description of habitants in a region and the way of allocating an individual to this region is called “intent”, the set of individuals, which satisfies this intent, is called “extent” (Diday 2003). For this type of analysis, different experts are needed, each one giving their concepts.

Download Full-text

So you think you can PLS-DA?

BMC Bioinformatics ◽

10.1186/s12859-019-3310-7 ◽

2020 ◽

Vol 21 (S1) ◽

Author(s):

Daniel Ruiz-Perez ◽

Haibin Guan ◽

Purnima Madhivanan ◽

Kalai Mathee ◽

Giri Narasimhan

Keyword(s):

Feature Selection ◽

Signal To Noise Ratio ◽

Synthetic Data ◽

Principal Component ◽

Ground Truth ◽

Close Relative ◽

Data Set ◽

Series Of Experiments ◽

Feature Selector ◽

Class Labels

Abstract Background Partial Least-Squares Discriminant Analysis (PLS-DA) is a popular machine learning tool that is gaining increasing attention as a useful feature selector and classifier. In an effort to understand its strengths and weaknesses, we performed a series of experiments with synthetic data and compared its performance to its close relative from which it was initially invented, namely Principal Component Analysis (PCA). Results We demonstrate that even though PCA ignores the information regarding the class labels of the samples, this unsupervised tool can be remarkably effective as a feature selector. In some cases, it outperforms PLS-DA, which is made aware of the class labels in its input. Our experiments range from looking at the signal-to-noise ratio in the feature selection task, to considering many practical distributions and models encountered when analyzing bioinformatics and clinical data. Other methods were also evaluated. Finally, we analyzed an interesting data set from 396 vaginal microbiome samples where the ground truth for the feature selection was available. All the 3D figures shown in this paper as well as the supplementary ones can be viewed interactively at http://biorg.cs.fiu.edu/plsda Conclusions Our results highlighted the strengths and weaknesses of PLS-DA in comparison with PCA for different underlying data models.

Download Full-text

Resolving Interoperability Issues of Precision and Array with Null Value of Web Services Using WSIG-JADE Framework

Modelling and Simulation in Engineering ◽

10.1155/2020/8862249 ◽

2020 ◽

Vol 2020 ◽

pp. 1-13

Author(s):

Jaspreet Chawla ◽

Anil Kr Ahlawat ◽

Jyoti Gautam

Keyword(s):

Web Services ◽

Multiagent System ◽

Data Sets ◽

String Length ◽

A Value ◽

Empty String ◽

Whole Number ◽

Null Value ◽

Null Values ◽

Data Elements

Web services and agent technology play a significant role while resolving the issues related to platform interoperability. Web service interoperability organization (WS-I) provided the guidelines to remove the interoperability issues using basic profile 1.1/1.2 product. However, issues are still arising while transferring the precision values and an array with null values between different platforms like JAVA and .NET. As in a precision issue, JAVA supports data precision up to the 6th value and .NET up to the 5th value after the decimal and after increasing their limits, the whole number gets rounded off. In array with a null value issue, JAVA treats null as a value but .NET treats null as an empty string. To remove these issues, we use the WSIG-JADE framework that helps to build and demonstrate a multiagent system that does the mapping and conversions between agents and web services. It limits the number of digits to the 5th place after the decimal thereby increasing the precision in data sets, whereas it treats null as an empty string so that string length remains the same for both the platforms thereby helping in the correct count of data elements.

Download Full-text

Null Values Revisited in Prospect of Data Integration

Semantics of a Networked World. Semantics for Grid Databases - Lecture Notes in Computer Science ◽

10.1007/978-3-540-30145-5_5 ◽

2004 ◽

pp. 79-90 ◽

Cited By ~ 2

Author(s):

Guy de Tré ◽

Rita de Caluwe ◽

Henri Prade

Keyword(s):

Data Integration ◽

Null Values

Download Full-text

A hands-on introduction to querying evolutionary relationships across multiple data sources using SPARQL

F1000Research ◽

10.12688/f1000research.21027.1 ◽

2019 ◽

Vol 8 ◽

pp. 1822 ◽

Cited By ~ 1

Author(s):

Ana Claudia Sima ◽

Christophe Dessimoz ◽

Kurt Stockinger ◽

Monique Zahn-Zabal ◽

Tarcisio Mendes de Farias

Keyword(s):

Query Language ◽

Data Sources ◽

Multiple Sources ◽

Semantic Web Technologies ◽

Genome Database ◽

Web Technologies ◽

Multiple Data ◽

Hands On ◽

Description Framework ◽

Different Sources

The increasing use of Semantic Web technologies in the life sciences, in particular the use of the Resource Description Framework (RDF) and the RDF query language SPARQL, opens the path for novel integrative analyses, combining information from multiple sources. However, analyzing evolutionary data in RDF is not trivial, due to the steep learning curve required to understand both the data models adopted by different RDF data sources, as well as the SPARQL query language. In this article, we provide a hands-on introduction to querying evolutionary data across multiple sources that publish orthology information in RDF, namely: The Orthologous MAtrix (OMA), the European Bioinformatics Institute (EBI) RDF platform, the Database of Orthologous Groups (OrthoDB) and the Microbial Genome Database (MBGD). We present four protocols in increasing order of complexity. In these protocols, we demonstrate through SPARQL queries how to retrieve pairwise orthologs, homologous groups, and hierarchical orthologous groups. Finally, we show how orthology information in different sources can be compared, through the use of federated SPARQL queries.

Download Full-text

Integrating Common Ground and Informativeness in Pragmatic Word Learning

10.31234/osf.io/cbx46 ◽

2019 ◽

Author(s):

Manuel Bohn ◽

Michael Henry Tessler ◽

Michael C. Frank

Keyword(s):

Common Ground ◽

Contextual Information ◽

Model Performance ◽

Specific Information ◽

Intended Meaning ◽

Pragmatic Reasoning ◽

Extended Experiment ◽

Series Of Experiments ◽

Generate Model ◽

Different Sources

Pragmatic inferences are an integral part of language learn- ing and comprehension. To recover the intended meaning of an utterance, listeners need to balance and integrate different sources of contextual information. In a series of experiments, we studied how listeners integrate general expectations about speakers with expectations specific to their interactional his- tory with a particular speaker. We used a Bayesian pragmatics model to formalize the integration process. In Experiments 1 and 2, we replicated previous findings showing that listeners make inferences based on speaker-general and speaker-specific expectations. We then used the empirical measurements from these experiments to generate model predictions about how the two kinds of expectations should be integrated, which we tested in Experiment 3. Experiment 4 replicated and extended Experiment 3 to a broader set of conditions. In both experiments, listeners based their inferences on both types of expectations. We found that model performance was also consistent with this finding; with better fit for a model which incorporated both general and specific information compared to baselines incorporating only one type. Listeners flexibly integrate different forms of social expectations across a range of contexts, a process which can be described using Bayesian models of pragmatic reasoning.

Download Full-text