scholarly journals CLARITY: comparing heterogeneous data using dissimilarity

2021 ◽  
Vol 8 (12) ◽  
Author(s):  
Daniel J. Lawson ◽  
Vinesh Solanki ◽  
Igor Yanovich ◽  
Johannes Dellert ◽  
Damian Ruck ◽  
...  

Integrating datasets from different disciplines is hard because the data are often qualitatively different in meaning, scale and reliability. When two datasets describe the same entities, many scientific questions can be phrased around whether the (dis)similarities between entities are conserved across such different data. Our method, CLARITY, quantifies consistency across datasets, identifies where inconsistencies arise and aids in their interpretation. We illustrate this using three diverse comparisons: gene methylation versus expression, evolution of language sounds versus word use, and country-level economic metrics versus cultural beliefs. The non-parametric approach is robust to noise and differences in scaling, and makes only weak assumptions about how the data were generated. It operates by decomposing similarities into two components: a ‘structural’ component analogous to a clustering, and an underlying ‘relationship’ between those structures. This allows a ‘structural comparison’ between two similarity matrices using their predictability from ‘structure’. Significance is assessed with the help of re-sampling appropriate for each dataset. The software, CLARITY, is available as an R package from github.com/danjlawson/CLARITY .

2019 ◽  
Vol 20 (1) ◽  
Author(s):  
Benjamin Ulfenborg

Abstract Background Studies on multiple modalities of omics data such as transcriptomics, genomics and proteomics are growing in popularity, since they allow us to investigate complex mechanisms across molecular layers. It is widely recognized that integrative omics analysis holds the promise to unlock novel and actionable biological insights into health and disease. Integration of multi-omics data remains challenging, however, and requires combination of several software tools and extensive technical expertise to account for the properties of heterogeneous data. Results This paper presents the miodin R package, which provides a streamlined workflow-based syntax for multi-omics data analysis. The package allows users to perform analysis of omics data either across experiments on the same samples (vertical integration), or across studies on the same variables (horizontal integration). Workflows have been designed to promote transparent data analysis and reduce the technical expertise required to perform low-level data import and processing. Conclusions The miodin package is implemented in R and is freely available for use and extension under the GPL-3 license. Package source, reference documentation and user manual are available at https://gitlab.com/algoromics/miodin.


2020 ◽  
Author(s):  
Yuping Lu ◽  
Charles A. Phillips ◽  
Michael A. Langston

Abstract Objective Bipartite graphs are widely used to model relationships between pairs of heterogeneous data types. Maximal bicliques are foundational structures in such graphs, and their enumeration is an important task in systems biology, epidemiology and many other problem domains. Thus, there is a need for an efficient, general purpose, publicly available tool to enumerate maximal bicliques in bipartite graphs. The statistical programming language R is a logical choice for such a tool, but until now no R package has existed for this purpose. Our objective is to provide such a package, so that the research community can more easily perform this computationally demanding task. Results Biclique is an R package that takes as input a bipartite graph and produces a listing of all maximal bicliques in this graph. Input and output formats are straightforward, with examples provided both in this paper and in the package documentation. Biclique employs a state-of-the-art algorithm previously developed for basic research in functional genomics. This package, along with its source code and reference manual, are freely available from the CRAN public repository at https://cran.r-project.org/web/packages/biclique/index.html .


2021 ◽  
Vol 15 (12) ◽  
pp. e0009954
Author(s):  
Andrés F. Miranda-Arboleda ◽  
Ezequiel José Zaidel ◽  
Rachel Marcus ◽  
María Jesús Pinazo ◽  
Luis Eduardo Echeverría ◽  
...  

Background Chagas disease (CD) is endemic in Latin America; however, its spread to nontropical areas has raised global interest in this condition. Barriers in access to early diagnosis and treatment of both acute and chronic infection and their complications have led to an increasing disease burden outside of Latin America. Our goal was to identify those barriers and to perform an additional analysis of them based on the Inter American Society of Cardiology (SIAC) and the World Heart Federation (WHF) Chagas Roadmap, at a country level in Argentina, Colombia, Spain, and the United States, which serve as representatives of endemic and nonendemic countries. Methodology and principal findings This is a nonsystematic review of articles published in indexed journals from 1955 to 2021 and of gray literature (local health organizations guidelines, local policies, blogs, and media). We classified barriers to access care as (i) existing difficulties limiting healthcare access; (ii) lack of awareness about CD and its complications; (iii) poor transmission control (vectorial and nonvectorial); (iv) scarce availability of antitrypanosomal drugs; and (v) cultural beliefs and stigma. Region-specific barriers may limit the implementation of roadmaps and require the application of tailored strategies to improve access to appropriate care. Conclusions Multiple barriers negatively impact the prognosis of CD. Identification of these roadblocks both nationally and globally is important to guide development of appropriate policies and public health programs to reduce the global burden of this disease.


F1000Research ◽  
2021 ◽  
Vol 10 ◽  
pp. 538
Author(s):  
Tyrone Chen ◽  
Al J Abadi ◽  
Kim-Anh Lê Cao ◽  
Sonika Tyagi

Data from multiple omics layers of a biological system is growing in quantity, heterogeneity and dimensionality. Simultaneous multi-omics data integration is a growing field of research as it has strong potential to unlock information on previously hidden biological relationships leading to early diagnosis, prognosis and expedited treatments. Many tools for multi-omics data integration are being developed. However, these tools are often restricted to highly specific experimental designs, and types of omics data. While some general methods do exist, they require specific data formats and experimental conditions. A major limitation in the field is a lack of a single or multi-omics pipeline which can accept data in an unrefined, information-rich form pre-integration and subsequently generate output for further investigation. There is an increasing demand for a generic multi-omics pipeline to facilitate general-purpose data exploration and analysis of heterogeneous data. Therefore, we present our R multiomics pipeline as an easy to use and flexible pipeline that takes unrefined multi-omics data as input, sample information and user-specified parameters to generate a list of output plots and data tables for quality control and downstream analysis. We have demonstrated application of the pipeline on two separate COVID-19 case studies. We enabled limited checkpointing where intermediate output is staged to allow continuation after errors or interruptions in the pipeline and generate a script for reproducing the analysis to improve reproducibility. A seamless integration with the mixOmics R package is achieved, as the R data object can be loaded and manipulated with mixOmics functions. Our pipeline can be installed as an R package or from the git repository, and is accompanied by detailed documentation with walkthroughs on two case studies. The pipeline is also available as Docker and Singularity containers.


2019 ◽  
Vol 35 (17) ◽  
pp. 3143-3145
Author(s):  
Kevin Matlock ◽  
Raziur Rahman ◽  
Souparno Ghosh ◽  
Ranadip Pal

Abstract Summary Biological processes are characterized by a variety of different genomic feature sets. However, often times when building models, portions of these features are missing for a subset of the dataset. We provide a modeling framework to effectively integrate this type of heterogeneous data to improve prediction accuracy. To test our methodology, we have stacked data from the Cancer Cell Line Encyclopedia to increase the accuracy of drug sensitivity prediction. The package addresses the dynamic regime of information integration involving sequential addition of features and samples. Availability and implementation The framework has been implemented as a R package Sstack, which can be downloaded from https://cran.r-project.org/web/packages/Sstack/index.html, where further explanation of the package is available. Supplementary information Supplementary data are available at Bioinformatics online.


Author(s):  
Florian Schneider ◽  
David Fichtmüller ◽  
Martin Gossner ◽  
Anton Güntsch ◽  
Malte Jochum ◽  
...  

Trait-based research spans from evolutionary studies of individual-level properties to global patterns of biodiversity and ecosystem functioning. An increasing number of trait data is available for many different organism groups, published as open access data on a variety of file hosting services. Thus, standardization between datasets is generally lacking due to heterogeneous data formats and types. The compilation of these published data into centralised databases remains a difficult and time-consuming task. We reviewed existing trait databases and online services, as well as initiatives for trait data standardization. Together with data providers and users participating in a large long-term observation project on multiple taxa and research questions (the Biodiversity Exploratories, www.biodiversity-exploratories.de), we identified a need for a minimal trait-data terminology that is flexible enough to include traits from all types of organisms but simple enough to be adopted by different research communities. In order to facilitate reproducibility of analyses, the reuse of data and the combination of datasets from multiple sources, we propose a standardized vocabulary for trait data, the Ecological Trait-data Standard Vocabulary (ETS, hosted on GFBio Terminology Service, https://terminologies.gfbio.org/terms/ets/pages), which builds upon and is compatible with existing ontologies. By relying on unambiguous identifiers, the proposed minimal vocabulary for trait data captures the different degrees of resolution and measurement detail for multiple use cases of trait-based research. It further encourages the use of global Uniform Resource Identifiers (URI) for taxa and trait definitions, methods and units, thereby readying the data publication for the semantic web. An accompanying R-package (traitdataform) facilitates the upload of data to hosting services but also simplifies the access to published trait data. While originating from a current need in ecological research, in the next step, the described products are being developed for a seamless fit with broader initiatives on biodiversity data standardisation to foster a better linkage of ecological trait data and global e-infrastructures for biological data. The ETS is maintained and discussion on terms are managed via Github (https://github.com/EcologicalTraitData/ETS).


PeerJ ◽  
2019 ◽  
Vol 7 ◽  
pp. e6918 ◽  
Author(s):  
Granville J. Matheson

Neuroimaging, in addition to many other fields of clinical research, is both time-consuming and expensive, and recruitable patients can be scarce. These constraints limit the possibility of large-sample experimental designs, and often lead to statistically underpowered studies. This problem is exacerbated by the use of outcome measures whose accuracy is sometimes insufficient to answer the scientific questions posed. Reliability is usually assessed in validation studies using healthy participants, however these results are often not easily applicable to clinical studies examining different populations. I present a new method and tools for using summary statistics from previously published test-retest studies to approximate the reliability of outcomes in new samples. In this way, the feasibility of a new study can be assessed during planning stages, and before collecting any new data. An R package called relfeas also accompanies this article for performing these calculations. In summary, these methods and tools will allow researchers to avoid performing costly studies which are, by virtue of their design, unlikely to yield informative conclusions.


Author(s):  
Fabian Schmich ◽  
Jack Kuipers ◽  
Gunter Merdes ◽  
Niko Beerenwinkel

Abstract In the post-genomic era of big data in biology, computational approaches to integrate multiple heterogeneous data sets become increasingly important. Despite the availability of large amounts of omics data, the prioritisation of genes relevant for a specific functional pathway based on genetic screening experiments, remains a challenging task. Here, we introduce netprioR, a probabilistic generative model for semi-supervised integrative prioritisation of hit genes. The model integrates multiple network data sets representing gene–gene similarities and prior knowledge about gene functions from the literature with gene-based covariates, such as phenotypes measured in genetic perturbation screens, for example, by RNA interference or CRISPR/Cas9. We evaluate netprioR on simulated data and show that the model outperforms current state-of-the-art methods in many scenarios and is on par otherwise. In an application to real biological data, we integrate 22 network data sets, 1784 prior knowledge class labels and 3840 RNA interference phenotypes in order to prioritise novel regulators of Notch signalling in Drosophila melanogaster. The biological relevance of our predictions is evaluated using in silico and in vivo experiments. An efficient implementation of netprioR is available as an R package at http://bioconductor.org/packages/netprioR.


2019 ◽  
Author(s):  
Yuping Lu ◽  
Charles A. Phillips ◽  
Michael A. Langston

Abstract Objective Bipartite graphs are widely used to model relationships between pairs of heterogeneous data types. Maximal bicliques are foundational structures in such graphs, and their enumeration is an important task in systems biology, epidemiology and many other problem domains. Thus, there is a need for an efficient, general purpose, publicly available tool to enumerate maximal bicliques in bipartite graphs. The statistical programming language R is a logical choice for such a tool, but until now no R package has existed for this purpose. Our objective is to provide such a package, so that the research community can more easily perform this computationally demanding task.Results Biclique is an R package that takes as input a bipartite graph and produces a listing of all maximal bicliques in this graph. Input and output formats are straightforward, with examples provided both in this paper and in the package documentation. Biclique employs a state-of-the-art algorithm previously developed for basic research in functional genomics. This package, along with its source code and reference manual, are freely available from the CRAN public repository at https://cran.r-project.org/web/packages/biclique/index.html .


2018 ◽  
Author(s):  
Granville J. Matheson

ABSTRACTPositron emission tomography (PET), along with many other fields of clinical research, is both timeconsuming and expensive, and recruitable patients can be scarce. These constraints limit the possibility of large-sample experimental designs, and often lead to statistically underpowered studies. This problem is exacerbated by the use of outcome measures whose accuracy is sometimes insufficient to answer the scientific questions posed. Reliability is usually assessed in validation studies using healthy participants, however these results are often not easily applicable to clinical studies examining different populations. I present a new method and tools for using summary statistics from previously published test-retest studies to approximate the reliability of outcomes in new samples. In this way, the feasibility of a new study can be assessed during planning stages, and before collecting any new data. An R package called relfeas also accompanies this article for performing these calculations. In summary, these methods and tools will allow researchers to avoid performing costly studies which are, by virtue of their design, unlikely to yield informative conclusions.


Sign in / Sign up

Export Citation Format

Share Document