scholarly journals A Framework for Enhancing Big Data Integration in Biological Domain Using Distributed Processing

2020 ◽  
Vol 10 (20) ◽  
pp. 7092
Author(s):  
Ameera Almasoud ◽  
Hend Al-Khalifa ◽  
AbdulMalik Al-salman ◽  
Miltiadis Lytras

Massive heterogeneous big data residing at different sites with various types and formats need to be integrated into a single unified view before starting data mining processes. Furthermore, in most of applications and research, a single big data source is not enough to complete the analysis and achieve goals. Unfortunately, there is no general or standardized integration process; the nature of an integration process depends on the data type, domain, and integration purpose. Based on these parameters, we proposed, implemented, and tested a big data integration framework that integrates big data in the biology domain, based on the domain ontology and using distributed processing. The integration resulted in the same result as that obtained from the local integration. The results are equivalent in terms of the ontology size before the integration; in the number of added items, skipped items, and overlapped items; in the ontology size after the integration; and in the number of edges, vertices, and roots. The results also do not violate any logical consistency rules, passing all the logical consistency tests, such as Jena Ontology API, HermiT, and Pellet reasoners. The integration result is a new big data source that combines big data from several critical sources in the biology domain and transforms it into one unified format to help researchers and specialists use it for further research and analysis.

Author(s):  
Richard Kumaradjaja

This chapter describes data integration issues in big data analytics and proposes an integrated data integration framework for big data analytics. The main focus of this chapter is to address the issues of data integration from the architectural point of view. Addressing the issues of data integration from the architectural point of view will lead to a better understanding of the current situation and better construction of proposed solutions to those issues since architectural approach can give us a holistic and comprehensive view of the problems. The chapter also discusses future research directions of the proposed integrated data architecture framework.


Author(s):  
Richard Kumaradjaja

This paper describes data integration issues in big data analytics and proposes an integrated data integration framework for big data analytics. The main focus of this article is to address the issues of data integration from the architectural point of view. Addressing the issues of data integration from the architectural point of view will lead to a better understanding of the current situation and better able to construct proposed solutions to those issues since architectural approach can give us a holistic and comprehensive view of the problems. The paper also discusses about future research directions of the proposed integrated data architecture framework.


Author(s):  
Hansi Zhang ◽  
Yi Guo ◽  
Mattia Prosperi ◽  
Jiang Bian

Abstract Background To reduce cancer mortality and improve cancer outcomes, it is critical to understand the various cancer risk factors (RFs) across different domains (e.g., genetic, environmental, and behavioral risk factors) and levels (e.g., individual, interpersonal, and community levels). However, prior research on RFs of cancer outcomes, has primarily focused on individual level RFs due to the lack of integrated datasets that contain multi-level, multi-domain RFs. Further, the lack of a consensus and proper guidance on systematically identify RFs also increase the difficulty of RF selection from heterogenous data sources in a multi-level integrative data analysis (mIDA) study. More importantly, as mIDA studies require integrating heterogenous data sources, the data integration processes in the limited number of existing mIDA studies are inconsistently performed and poorly documented, and thus threatening transparency and reproducibility. Methods Informed by the National Institute on Minority Health and Health Disparities (NIMHD) research framework, we (1) reviewed existing reporting guidelines from the Enhancing the QUAlity and Transparency Of health Research (EQUATOR) network and (2) developed a theory-driven reporting guideline to guide the RF variable selection, data source selection, and data integration process. Then, we developed an ontology to standardize the documentation of the RF selection and data integration process in mIDA studies. Results We summarized the review results and created a reporting guideline—ATTEST—for reporting the variable selection and data source selection and integration process. We provided an ATTEST check list to help researchers to annotate and clearly document each step of their mIDA studies to ensure the transparency and reproducibility. We used the ATTEST to report two mIDA case studies and further transformed annotation results into sematic triples, so that the relationships among variables, data sources and integration processes are explicitly standardized and modeled using the classes and properties from OD-ATTEST. Conclusion Our ontology-based reporting guideline solves some key challenges in current mIDA studies for cancer outcomes research, through providing (1) a theory-driven guidance for multi-level and multi-domain RF variable and data source selection; and (2) a standardized documentation of the data selection and integration processes powered by an ontology, thus a way to enable sharing of mIDA study reports among researchers.


2020 ◽  
Vol 12 (6) ◽  
pp. 972 ◽  
Author(s):  
Yinyi Cheng ◽  
Kefa Zhou ◽  
Jinlin Wang ◽  
Jining Yan

The arrival of the era of big data for Earth observation (EO) indicates that traditional data management models have been unable to meet the needs of remote sensing data in big data environments. With the launch of the first remote sensing satellite, the volume of remote sensing data has also been increasing, and traditional data storage methods have been unable to ensure the efficient management of large amounts of remote sensing data. Therefore, a professional remote sensing big data integration method is sorely needed. In recent years, the emergence of some new technical methods has provided effective solutions for multi-source remote sensing data integration. This paper proposes a multi-source remote sensing data integration framework based on a distributed management model. In this framework, the multi-source remote sensing data are partitioned by the proposed spatial segmentation indexing (SSI) model through spatial grid segmentation. The designed complete information description system, based on International Organization for Standardization (ISO) 19115, can explain multi-source remote sensing data in detail. Then, the distributed storage method of data based on MongoDB is used to store multi-source remote sensing data. The distributed storage method is physically based on the sharding mechanism of the MongoDB database, and it can provide advantages for the security and performance of the preservation of remote sensing data. Finally, several experiments have been designed to test the performance of this framework in integrating multi-source remote sensing data. The results show that the storage and retrieval performance of the distributed remote sensing data integration framework proposed in this paper is superior. At the same time, the grid level of the SSI model proposed in this paper also has an important impact on the storage efficiency of remote sensing data. Therefore, the remote storage data integration framework, based on distributed storage, can provide new technical support and development prospects for big EO data.


Author(s):  
Hansi Zhang ◽  
Yi Guo ◽  
Jiang Bian

AbstractBackgroundTo reduce cancer mortality and improve cancer outcomes, it is critical to understand the various cancer risk factors (RFs) across different domains (e.g., genetic, environmental, and behavioral risk factors) and levels (e.g., individual, interpersonal, and community levels). However, prior research on RFs of cancer outcomes, has primarily focused on individual level RFs due to the lack of integrated datasets that contain multi-level, multi-domain RFs. Further, the lack of a consensus and proper guidance on systematically identify RFs also increase the difficulty of RF selection from heterogenous data sources in a multi-level integrative data analysis (mIDA) study. More importantly, as mIDA studies require integrating heterogenous data sources, the data integration processes in the limited number of existing mIDA studies are inconsistently performed and poorly documented, and thus threatening transparency and reproducibility.MethodsInformed by the National Institute on Minority Health and Health Disparities (NIMHD) research framework, we (1) reviewed existing reporting guidelines from the Enhancing the QUAlity and Transparency Of health Research (EQUATOR) network and (2) developed a theory-driven reporting guideline to guide the RF variable selection, data source selection, and data integration process. Then, we developed an ontology to standardize the documentation of the RF selection and data integration process in mIDA studies.ResultsWe summarized the review results and created a reporting guideline—ATTEST—for reporting the variable selection and data source selection and integration process. We provided an ATTEST check list to help researchers to annotate and clearly document each step of their mIDA studies to ensure the transparency and reproducibility. We used the ATTEST to report two mIDA case studies and further transformed annotation results into sematic triples, so that the relationships among variables, data sources and integration processes are explicitly standardized and modeled using the classes and properties from OD-ATTEST.ConclusionOur ontology-based reporting guideline solves some key challenges in current mIDA studies for cancer outcomes research, through providing (1) a theory-driven guidance for multi-level and multi-domain RF variable and data source selection; and (2) a standardized documentation of the data selection and integration processes powered by an ontology, thus a way to enable sharing of mIDA study reports among researchers.


Author(s):  
Indrabudhi Lokaadinugroho ◽  
Abba Suganda Girsang ◽  
Burhanudin Burhanudin

This paper discusses about how to build a data warehouse (DW) in business intelligence (BI) for a typical marketing division in a university. This study uses a descriptive method that attempts to describe the object or subject under study as it is, with the aim of systematically describing the facts and characteristics of the object under study precisely. In the elaboration of the methodology, there are four phases that include the identification and source data collection phase, the analysis phase, the design phase, and then the results phase of each detail in accordance with the nine steps of Kimball’s data warehouse and the Pentaho Data Integration (PDI). The result is a tableau as a tool of BI that does not have complete ETL tools. So, the process approach in combining PDI and DW as a data source certainly makes a tableau as a BI tool more useful in presenting data thus minimizing the time needed to obtain strategic data from 2-3 weeks to 77 minutes.


2021 ◽  
Vol 11 (13) ◽  
pp. 6047
Author(s):  
Soheil Rezaee ◽  
Abolghasem Sadeghi-Niaraki ◽  
Maryam Shakeri ◽  
Soo-Mi Choi

A lack of required data resources is one of the challenges of accepting the Augmented Reality (AR) to provide the right services to the users, whereas the amount of spatial information produced by people is increasing daily. This research aims to design a personalized AR that is based on a tourist system that retrieves the big data according to the users’ demographic contexts in order to enrich the AR data source in tourism. This research is conducted in two main steps. First, the type of the tourist attraction where the users interest is predicted according to the user demographic contexts, which include age, gender, and education level, by using a machine learning method. Second, the correct data for the user are extracted from the big data by considering time, distance, popularity, and the neighborhood of the tourist places, by using the VIKOR and SWAR decision making methods. By about 6%, the results show better performance of the decision tree by predicting the type of tourist attraction, when compared to the SVM method. In addition, the results of the user study of the system show the overall satisfaction of the participants in terms of the ease-of-use, which is about 55%, and in terms of the systems usefulness, about 56%.


Sign in / Sign up

Export Citation Format

Share Document