Data Integration to Create Large-Scale Spatially Detailed Synthetic Populations

Author(s):  
Yi Zhu ◽  
Joseph Ferreira

Author(s):  
Balaje T. Thumati ◽  
Halasya Siva Subramania ◽  
Rajeev Shastri ◽  
Karthik Kalyana Kumar ◽  
Nicole Hessner ◽  
...  




Genes ◽  
2019 ◽  
Vol 10 (3) ◽  
pp. 238 ◽  
Author(s):  
Evangelina López de Maturana ◽  
Lola Alonso ◽  
Pablo Alarcón ◽  
Isabel Adoración Martín-Antoniano ◽  
Silvia Pineda ◽  
...  

Omics data integration is already a reality. However, few omics-based algorithms show enough predictive ability to be implemented into clinics or public health domains. Clinical/epidemiological data tend to explain most of the variation of health-related traits, and its joint modeling with omics data is crucial to increase the algorithm’s predictive ability. Only a small number of published studies performed a “real” integration of omics and non-omics (OnO) data, mainly to predict cancer outcomes. Challenges in OnO data integration regard the nature and heterogeneity of non-omics data, the possibility of integrating large-scale non-omics data with high-throughput omics data, the relationship between OnO data (i.e., ascertainment bias), the presence of interactions, the fairness of the models, and the presence of subphenotypes. These challenges demand the development and application of new analysis strategies to integrate OnO data. In this contribution we discuss different attempts of OnO data integration in clinical and epidemiological studies. Most of the reviewed papers considered only one type of omics data set, mainly RNA expression data. All selected papers incorporated non-omics data in a low-dimensionality fashion. The integrative strategies used in the identified papers adopted three modeling methods: Independent, conditional, and joint modeling. This review presents, discusses, and proposes integrative analytical strategies towards OnO data integration.



2018 ◽  
Author(s):  
Jong-Eun Park ◽  
Krzysztof Polański ◽  
Kerstin Meyer ◽  
Sarah A. Teichmann

AbstractIncreasing numbers of large scale single cell RNA-Seq projects are leading to a data explosion, which can only be fully exploited through data integration. Therefore, efficient computational tools for combining diverse datasets are crucial for biology in the single cell genomics era. A number of methods have been developed to assist data integration by removing technical batch effects, but most are computationally intensive. To overcome the challenge of enormous datasets, we have developed BBKNN, an extremely fast graph-based data integration method. We illustrate the power of BBKNN for dimensionalityreduced visualisation and clustering in multiple biological scenarios, including a massive integrative study over several murine atlases. BBKNN successfully connects cell populations across experimentally heterogeneous mouse scRNA-Seq datasets, which reveals global markers of cell type and organspecificity and provides the foundation for inferring the underlying transcription factor network. BBKNN is available at https://github.com/Teichlab/bbknn.



2017 ◽  
Author(s):  
Ryan M. Layer ◽  
Brent S. Pedersen ◽  
Tonya DiSera ◽  
Gabor T. Marth ◽  
Jason Gertz ◽  
...  

AbstractGIGGLE is a genomics search engine that identifies and ranks the significance of shared genomic loci between query features and thousands of genome interval files. GIGGLE scales to billions of intervals, is faster (+1,000X) than existing methods, and its speed extends the accessibility and utility of resources such as ENCODE, Roadmap Epigenomics, and GTEX by facilitating data integration and hypothesis generation. GIGGLE is available at https://github.com/ryanlayer/giggle.



Author(s):  
James Boyd ◽  
Anna Ferrante ◽  
Adrian Brown ◽  
Sean Randall ◽  
James Semmens

ABSTRACT ObjectivesWhile record linkage has become a strategic research priority within Australia and internationally, legal and administrative issues prevent data linkage in some situations due to privacy concerns. Even current best practices in record linkage carry some privacy risk as they require the release of personally identifying information to trusted third parties. Application of record linkage systems that do not require the release of personal information can overcome legal and privacy issues surrounding data integration. Current conceptual and experimental privacy-preserving record linkage (PPRL) models show promise in addressing data integration challenges but do not yet address all of the requirements for real-world operations. This paper aims to identify and address some of the challenges of operationalising PPRL frameworks. ApproachTraditional linkage processes involve comparing personally identifying information (name, address, date of birth) on pairs of records to determine whether the records belong to the same person. Designing appropriate linkage strategies is an important part of the process. These are typically based on the analysis of data attributes (metadata) such as data completeness, consistency, constancy and field discriminating power. Under a PPRL model, however, these factors cannot be discerned from the encrypted data, so an alternative approach is required. This paper explores methods for data profiling, blocking, weight/threshold estimation and error detection within a PPRL framework. ResultsProbabilistic record linkage typically involves the estimation of weights and thresholds to optimise the linkage and ensure highly accurate results. The paper outlines the metadata requirements and automated methods necessary to collect data without compromising privacy. We present work undertaken to develop parameter estimation methods which can help optimise a linkage strategy without the release of personally identifiable information. These are required in all parts of the privacy preserving record linkage process (pre-processing, standardising activities, linkage, grouping and extracting). ConclusionsPPRL techniques that operate on encrypted data have the potential for large-scale record linkage, performing both accurately and efficiently under experimental conditions. Our research has advanced the current state of PPRL with a framework for secure record linkage that can be implemented to improve and expand linkage service delivery while protecting an individual’s privacy. However, more research is required to supplement this technique with additional elements to ensure the end-to-end method is practical and can be incorporated into real-world models.



2021 ◽  
Author(s):  
Benjamin R Babcock ◽  
Astrid Kosters ◽  
Junkai Yang ◽  
Mackenzie L White ◽  
Eliver Ghosn

Single-cell RNA sequencing (scRNA-seq) can reveal accurate and sensitive RNA abundance in a single sample, but robust integration of multiple samples remains challenging. Large-scale scRNA-seq data generated by different workflows or laboratories can contain batch-specific systemic variation. Such variation challenges data integration by confounding sample-specific biology with undesirable batch-specific systemic effects. Therefore, there is a need for guidance in selecting computational and experimental approaches to minimize batch-specific impacts on data interpretation and a need to empirically evaluate the sources of systemic variation in a given dataset. To uncover the contributions of experimental variables to systemic variation, we intentionally perturb four potential sources of batch-effect in five human peripheral blood samples. We investigate sequencing replicate, sequencing depth, sample replicate, and the effects of pooling libraries for concurrent sequencing. To quantify the downstream effects of these variables on data interpretation, we introduced a new scoring metric, the Cell Misclassification Statistic (CMS), which identifies losses to cell type fidelity that occur when merging datasets of different batches. CMS reveals an undesirable overcorrection by popular batch-effect correction and data integration methods. We show that optimizing gene expression matrix normalization and merging can reduce the need for batch-effect correction and minimize the risk of overcorrecting true biological differences between samples.



10.2196/34493 ◽  
2021 ◽  
Vol 23 (11) ◽  
pp. e34493
Author(s):  
Ieuan Clay ◽  
Christian Angelopoulos ◽  
Anne Lord Bailey ◽  
Aaron Blocker ◽  
Simona Carini ◽  
...  

Data integration, the processes by which data are aggregated, combined, and made available for use, has been key to the development and growth of many technological solutions. In health care, we are experiencing a revolution in the use of sensors to collect data on patient behaviors and experiences. Yet, the potential of this data to transform health outcomes is being held back. Deficits in standards, lexicons, data rights, permissioning, and security have been well documented, less so the cultural adoption of sensor data integration as a priority for large-scale deployment and impact on patient lives. The use and reuse of trustworthy data to make better and faster decisions across drug development and care delivery will require an understanding of all stakeholder needs and best practices to ensure these needs are met. The Digital Medicine Society is launching a new multistakeholder Sensor Data Integration Tour of Duty to address these challenges and more, providing a clear direction on how sensor data can fulfill its potential to enhance patient lives.



Sign in / Sign up

Export Citation Format

Share Document