scholarly journals Data integration with high dimensionality

Biometrika ◽  
2017 ◽  
Vol 104 (2) ◽  
pp. 251-272 ◽  
Author(s):  
Xin Gao ◽  
Raymond J. Carroll

Summary We consider situations where the data consist of a number of responses for each individual, which may include a mix of discrete and continuous variables. The data also include a class of predictors, where the same predictor may have different physical measurements across different experiments depending on how the predictor is measured. The goal is to select which predictors affect any of the responses, where the number of such informative predictors tends to infinity as the sample size increases. There are marginal likelihoods for each experiment; we specify a pseudolikelihood combining the marginal likelihoods, and propose a pseudolikelihood information criterion. Under regularity conditions, we establish selection consistency for this criterion with unbounded true model size. The proposed method includes a Bayesian information criterion with appropriate penalty term as a special case. Simulations indicate that data integration can dramatically improve upon using only one data source.

2019 ◽  
pp. 254-277 ◽  
Author(s):  
Ying Zhang ◽  
Chaopeng Li ◽  
Na Chen ◽  
Shaowen Liu ◽  
Liming Du ◽  
...  

Since large amount of geospatial data are produced by various sources, geospatial data integration is difficult because of the shortage of semantics. Despite standardised data format and data access protocols, such as Web Feature Service (WFS), can enable end-users with access to heterogeneous data stored in different formats from various sources, it is still time-consuming and ineffective due to the lack of semantics. To solve this problem, a prototype to implement the geospatial data integration is proposed by addressing the following four problems, i.e., geospatial data retrieving, modeling, linking and integrating. We mainly adopt four kinds of geospatial data sources to evaluate the performance of the proposed approach. The experimental results illustrate that the proposed linking method can get high performance in generating the matched candidate record pairs in terms of Reduction Ratio(RR), Pairs Completeness(PC), Pairs Quality(PQ) and F-score. The integrating results denote that each data source can get much Complementary Completeness(CC) and Increased Completeness(IC).


2019 ◽  
pp. 230-253
Author(s):  
Ying Zhang ◽  
Chaopeng Li ◽  
Na Chen ◽  
Shaowen Liu ◽  
Liming Du ◽  
...  

Since large amount of geospatial data are produced by various sources and stored in incompatible formats, geospatial data integration is difficult because of the shortage of semantics. Despite standardised data format and data access protocols, such as Web Feature Service (WFS), can enable end-users with access to heterogeneous data stored in different formats from various sources, it is still time-consuming and ineffective due to the lack of semantics. To solve this problem, a prototype to implement the geospatial data integration is proposed by addressing the following four problems, i.e., geospatial data retrieving, modeling, linking and integrating. First, we provide a uniform integration paradigm for users to retrieve geospatial data. Then, we align the retrieved geospatial data in the modeling process to eliminate heterogeneity with the help of Karma. Our main contribution focuses on addressing the third problem. Previous work has been done by defining a set of semantic rules for performing the linking process. However, the geospatial data has some specific geospatial relationships, which is significant for linking but cannot be solved by the Semantic Web techniques directly. We take advantage of such unique features about geospatial data to implement the linking process. In addition, the previous work will meet a complicated problem when the geospatial data sources are in different languages. In contrast, our proposed linking algorithms are endowed with translation function, which can save the translating cost among all the geospatial sources with different languages. Finally, the geospatial data is integrated by eliminating data redundancy and combining the complementary properties from the linked records. We mainly adopt four kinds of geospatial data sources, namely, OpenStreetMap(OSM), Wikmapia, USGS and EPA, to evaluate the performance of the proposed approach. The experimental results illustrate that the proposed linking method can get high performance in generating the matched candidate record pairs in terms of Reduction Ratio(RR), Pairs Completeness(PC), Pairs Quality(PQ) and F-score. The integrating results denote that each data source can get much Complementary Completeness(CC) and Increased Completeness(IC).


Author(s):  
Ying Zhang ◽  
Chaopeng Li ◽  
Na Chen ◽  
Shaowen Liu ◽  
Liming Du ◽  
...  

Since large amount of geospatial data are produced by various sources, geospatial data integration is difficult because of the shortage of semantics. Despite standardised data format and data access protocols, such as Web Feature Service (WFS), can enable end-users with access to heterogeneous data stored in different formats from various sources, it is still time-consuming and ineffective due to the lack of semantics. To solve this problem, a prototype to implement the geospatial data integration is proposed by addressing the following four problems, i.e., geospatial data retrieving, modeling, linking and integrating. We mainly adopt four kinds of geospatial data sources to evaluate the performance of the proposed approach. The experimental results illustrate that the proposed linking method can get high performance in generating the matched candidate record pairs in terms of Reduction Ratio(RR), Pairs Completeness(PC), Pairs Quality(PQ) and F-score. The integrating results denote that each data source can get much Complementary Completeness(CC) and Increased Completeness(IC).


2014 ◽  
Vol 543-547 ◽  
pp. 2937-2940
Author(s):  
Xiao Xiao Liang ◽  
Shun Min Wang ◽  
Chong Gang Wei ◽  
Chuang Shen

According to the distribution, autonomy and heterogeneity of the university database, we designed the structure, main arithmetic, query distribution device, result processor and wrapper of the university heterogeneous data integration middle ware by using Java, XML and middle ware. We emphasized on introducing the designation of query distribution device, result processor and wrapper.


Author(s):  
Lihua Lu ◽  
Hengzhen Zhang ◽  
Xiao-Zhi Gao

Purpose – Data integration is to combine data residing at different sources and to provide the users with a unified interface of these data. An important issue on data integration is the existence of conflicts among the different data sources. Data sources may conflict with each other at data level, which is defined as data inconsistency. The purpose of this paper is to aim at this problem and propose a solution for data inconsistency in data integration. Design/methodology/approach – A relational data model extended with data source quality criteria is first defined. Then based on the proposed data model, a data inconsistency solution strategy is provided. To accomplish the strategy, fuzzy multi-attribute decision-making (MADM) approach based on data source quality criteria is applied to obtain the results. Finally, users feedbacks strategies are proposed to optimize the result of fuzzy MADM approach as the final data inconsistent solution. Findings – To evaluate the proposed method, the data obtained from the sensors are extracted. Some experiments are designed and performed to explain the effectiveness of the proposed strategy. The results substantiate that the solution has a better performance than the other methods on correctness, time cost and stability indicators. Practical implications – Since the inconsistent data collected from the sensors are pervasive, the proposed method can solve this problem and correct the wrong choice to some extent. Originality/value – In this paper, for the first time the authors study the effect of users feedbacks on integration results aiming at the inconsistent data.


1995 ◽  
Vol 7 (1) ◽  
pp. 86-107 ◽  
Author(s):  
G. Deco ◽  
W. Finnoff ◽  
H. G. Zimmermann

Controlling the network complexity in order to prevent overfitting is one of the major problems encountered when using neural network models to extract the structure from small data sets. In this paper we present a network architecture designed for use with a cost function that includes a novel complexity penalty term. In this architecture the outputs of the hidden units are strictly positive and sum to one, and their outputs are defined as the probability that the actual input belongs to a certain class formed during learning. The penalty term expresses the mutual information between the inputs and the extracted classes. This measure effectively describes the network complexity with respect to the given data in an unsupervised fashion. The efficiency of this architecture/penalty-term when combined with backpropagation training, is demonstrated on a real world economic time series forecasting problem. The model was also applied to the benchmark sunspot data and to a synthetic data set from the statistics community.


2018 ◽  
Vol 3 (2) ◽  
pp. 162
Author(s):  
Slamet Sudaryanto Nurhendratno ◽  
Sudaryanto Sudaryanto

 Data integration is an important step in integrating information from multiple sources. The problem is how to find and combine data from scattered data sources that are heterogeneous and have semantically informant interconnections optimally. The heterogeneity of data sources is the result of a number of factors, including storing databases in different formats, using different software and hardware for database storage systems, designing in different data semantic models (Katsis & Papakonstantiou, 2009, Ziegler & Dittrich , 2004). Nowadays there are two approaches in doing data integration that is Global as View (GAV) and Local as View (LAV), but both have different advantages and limitations so that proper analysis is needed in its application. Some of the major factors to be considered in making efficient and effective data integration of heterogeneous data sources are the understanding of the type and structure of the source data (source schema). Another factor to consider is also the view type of integration result (target schema). The results of the integration can be displayed into one type of global view or a variety of other views. So in integrating data whose source is structured the approach will be different from the integration of the data if the data source is not structured or semi-structured. Scheme mapping is a specific declaration that describes the relationship between the source scheme and the target scheme. In the scheme mapping is expressed in in some logical formulas that can help applications in data interoperability, data exchange and data integration. In this paper, in the case of establishing a patient referral center data center, it requires integration of data whose source is derived from a number of different health facilities, it is necessary to design a schema mapping system (to support optimization). Data Center as the target orientation schema (target schema) from various reference service units as a source schema (source schema) has the characterization and nature of data that is structured and independence. So that the source of data can be integrated tersetruktur of the data source into an integrated view (as a data center) with an equivalent query rewriting (equivalent). The data center as a global schema serves as a schema target requires a "mediator" that serves "guides" to maintain global schemes and map (mapping) between global and local schemes. Data center as from Global As View (GAV) here tends to be single and unified view so to be effective in its integration process with various sources of schema which is needed integration facilities "integration". The "Pemadu" facility is a declarative mapping language that allows to specifically link each of the various schema sources to the data center. So that type of query rewriting equivalent is suitable to be applied in the context of query optimization and maintenance of physical data independence.Keywords: Global as View (GAV), Local as View (LAV), source schema ,mapping schema


2020 ◽  
Vol 10 (20) ◽  
pp. 7092
Author(s):  
Ameera Almasoud ◽  
Hend Al-Khalifa ◽  
AbdulMalik Al-salman ◽  
Miltiadis Lytras

Massive heterogeneous big data residing at different sites with various types and formats need to be integrated into a single unified view before starting data mining processes. Furthermore, in most of applications and research, a single big data source is not enough to complete the analysis and achieve goals. Unfortunately, there is no general or standardized integration process; the nature of an integration process depends on the data type, domain, and integration purpose. Based on these parameters, we proposed, implemented, and tested a big data integration framework that integrates big data in the biology domain, based on the domain ontology and using distributed processing. The integration resulted in the same result as that obtained from the local integration. The results are equivalent in terms of the ontology size before the integration; in the number of added items, skipped items, and overlapped items; in the ontology size after the integration; and in the number of edges, vertices, and roots. The results also do not violate any logical consistency rules, passing all the logical consistency tests, such as Jena Ontology API, HermiT, and Pellet reasoners. The integration result is a new big data source that combines big data from several critical sources in the biology domain and transforms it into one unified format to help researchers and specialists use it for further research and analysis.


Sign in / Sign up

Export Citation Format

Share Document