BMC Bioinformatics | ScienceGate

Aristotle: stratified causal discovery for omics data

BMC Bioinformatics ◽

10.1186/s12859-021-04521-w ◽

2022 ◽

Vol 23 (1) ◽

Author(s):

Mehrdad Mansouri ◽

Sahand Khakabimamaghani ◽

Leonid Chindelevitch ◽

Martin Ester

Keyword(s):

Synthetic Data ◽

Causal Analysis ◽

Causal Discovery ◽

Simultaneous Increase ◽

Biological Knowledge ◽

Omics Data ◽

Metabolomics Data ◽

Widespread Application ◽

Anthracycline Cardiotoxicity ◽

Stratification Method

Abstract Background There has been a simultaneous increase in demand and accessibility across genomics, transcriptomics, proteomics and metabolomics data, known as omics data. This has encouraged widespread application of omics data in life sciences, from personalized medicine to the discovery of underlying pathophysiology of diseases. Causal analysis of omics data may provide important insight into the underlying biological mechanisms. Existing causal analysis methods yield promising results when identifying potential general causes of an observed outcome based on omics data. However, they may fail to discover the causes specific to a particular stratum of individuals and missing from others. Methods To fill this gap, we introduce the problem of stratified causal discovery and propose a method, Aristotle, for solving it. Aristotle addresses the two challenges intrinsic to omics data: high dimensionality and hidden stratification. It employs existing biological knowledge and a state-of-the-art patient stratification method to tackle the above challenges and applies a quasi-experimental design method to each stratum to find stratum-specific potential causes. Results Evaluation based on synthetic data shows better performance for Aristotle in discovering true causes under different conditions compared to existing causal discovery methods. Experiments on a real dataset on Anthracycline Cardiotoxicity indicate that Aristotle’s predictions are consistent with the existing literature. Moreover, Aristotle makes additional predictions that suggest further investigations.

Assigning protein function from domain-function associations using DomFun

BMC Bioinformatics ◽

10.1186/s12859-022-04565-6 ◽

2022 ◽

Vol 23 (1) ◽

Author(s):

Elena Rojano ◽

Fernando M. Jabato ◽

James R. Perkins ◽

José Córdoba-Caballero ◽

Federico García-Criado ◽

...

Keyword(s):

Protein Function ◽

Protein Function Prediction ◽

Function Prediction ◽

Evaluation Procedure ◽

Simpson Index ◽

Go Annotation ◽

Link Type ◽

Pathway Prediction ◽

Almost All ◽

Stouffer's Method

Abstract Background Protein function prediction remains a key challenge. Domain composition affects protein function. Here we present DomFun, a Ruby gem that uses associations between protein domains and functions, calculated using multiple indices based on tripartite network analysis. These domain-function associations are combined at the protein level, to generate protein-function predictions. Results We analysed 16 tripartite networks connecting homologous superfamily and FunFam domains from CATH-Gene3D with functional annotations from the three Gene Ontology (GO) sub-ontologies, KEGG, and Reactome. We validated the results using the CAFA 3 benchmark platform for GO annotation, finding that out of the multiple association metrics and domain datasets tested, Simpson index for FunFam domain-function associations combined with Stouffer’s method leads to the best performance in almost all scenarios. We also found that using FunFams led to better performance than superfamilies, and better results were found for GO molecular function compared to GO biological process terms. DomFun performed as well as the highest-performing method in certain CAFA 3 evaluation procedures in terms of $$F_{max}$$ F max and $$S_{min}$$ S min We also implemented our own benchmark procedure, Pathway Prediction Performance (PPP), which can be used to validate function prediction for additional annotations sources, such as KEGG and Reactome. Using PPP, we found similar results to those found with CAFA 3 for GO, moreover we found good performance for the other annotation sources. As with CAFA 3, Simpson index with Stouffer’s method led to the top performance in almost all scenarios. Conclusions DomFun shows competitive performance with other methods evaluated in CAFA 3 when predicting proteins function with GO, although results vary depending on the evaluation procedure. Through our own benchmark procedure, PPP, we have shown it can also make accurate predictions for KEGG and Reactome. It performs best when using FunFams, combining Simpson index derived domain-function associations using Stouffer’s method. The tool has been implemented so that it can be easily adapted to incorporate other protein features, such as domain data from other sources, amino acid k-mers and motifs. The DomFun Ruby gem is available from https://rubygems.org/gems/DomFun. Code maintained at https://github.com/ElenaRojano/DomFun. Validation procedure scripts can be found at https://github.com/ElenaRojano/DomFun_project.

ECCsplorer: a pipeline to detect extrachromosomal circular DNA (eccDNA) from next-generation sequencing data

BMC Bioinformatics ◽

10.1186/s12859-021-04545-2 ◽

2022 ◽

Vol 23 (1) ◽

Author(s):

Ludwig Mann ◽

Kathrin M. Seibt ◽

Beatrice Weber ◽

Tony Heitkam

Keyword(s):

Next Generation Sequencing ◽

Transposable Elements ◽

Data Availability ◽

Next Generation Sequencing Data ◽

Next Generation ◽

Sequencing Data ◽

Bioinformatics Pipeline ◽

Circular Dna ◽

Wide Range ◽

Generation Sequencing

Abstract Background Extrachromosomal circular DNAs (eccDNAs) are ring-like DNA structures physically separated from the chromosomes with 100 bp to several megabasepairs in size. Apart from carrying tandemly repeated DNA, eccDNAs may also harbor extra copies of genes or recently activated transposable elements. As eccDNAs occur in all eukaryotes investigated so far and likely play roles in stress, cancer, and aging, they have been prime targets in recent research—with their investigation limited by the scarcity of computational tools. Results Here, we present the ECCsplorer, a bioinformatics pipeline to detect eccDNAs in any kind of organism or tissue using next-generation sequencing techniques. Following Illumina-sequencing of amplified circular DNA (circSeq), the ECCsplorer enables an easy and automated discovery of eccDNA candidates. The data analysis encompasses two major procedures: first, read mapping to the reference genome allows the detection of informative read distributions including high coverage, discordant mapping, and split reads. Second, reference-free comparison of read clusters from amplified eccDNA against control sample data reveals specifically enriched DNA circles. Both software parts can be run separately or jointly, depending on the individual aim or data availability. To illustrate the wide applicability of our approach, we analyzed semi-artificial and published circSeq data from the model organisms Homo sapiens and Arabidopsis thaliana, and generated circSeq reads from the non-model crop plant Beta vulgaris. We clearly identified eccDNA candidates from all datasets, with and without reference genomes. The ECCsplorer pipeline specifically detected mitochondrial mini-circles and retrotransposon activation, showcasing the ECCsplorer’s sensitivity and specificity. Conclusion The ECCsplorer (available online at https://github.com/crimBubble/ECCsplorer) is a bioinformatics pipeline to detect eccDNAs in any kind of organism or tissue using next-generation sequencing data. The derived eccDNA targets are valuable for a wide range of downstream investigations—from analysis of cancer-related eccDNAs over organelle genomics to identification of active transposable elements.

MGcount: a total RNA-seq quantification tool to address multi-mapping and multi-overlapping alignments ambiguity in non-coding transcripts

BMC Bioinformatics ◽

10.1186/s12859-021-04544-3 ◽

2022 ◽

Vol 23 (1) ◽

Author(s):

Andrea Hita ◽

Gilles Brocart ◽

Ana Fernandez ◽

Marc Rehmsmeier ◽

Anna Alemany ◽

...

Keyword(s):

Rna Sequencing ◽

Genomic Region ◽

Simultaneous Estimation ◽

Rna Seq ◽

Protein Coding ◽

Total Rna ◽

Simultaneous Study ◽

Downstream Analysis ◽

And Function ◽

Genomic Locations

Abstract Background Total-RNA sequencing (total-RNA-seq) allows the simultaneous study of both the coding and the non-coding transcriptome. Yet, computational pipelines have traditionally focused on particular biotypes, making assumptions that are not fullfilled by total-RNA-seq datasets. Transcripts from distinct RNA biotypes vary in length, biogenesis, and function, can overlap in a genomic region, and may be present in the genome with a high copy number. Consequently, reads from total-RNA-seq libraries may cause ambiguous genomic alignments, demanding for flexible quantification approaches. Results Here we present Multi-Graph count (MGcount), a total-RNA-seq quantification tool combining two strategies for handling ambiguous alignments. First, MGcount assigns reads hierarchically to small-RNA and long-RNA features to account for length disparity when transcripts overlap in the same genomic position. Next, MGcount aggregates RNA products with similar sequences where reads systematically multi-map using a graph-based approach. MGcount outputs a transcriptomic count matrix compatible with RNA-sequencing downstream analysis pipelines, with both bulk and single-cell resolution, and the graphs that model repeated transcript structures for different biotypes. The software can be used as a python module or as a single-file executable program. Conclusions MGcount is a flexible total-RNA-seq quantification tool that successfully integrates reads that align to multiple genomic locations or that overlap with multiple gene features. Its approach is suitable for the simultaneous estimation of protein-coding, long non-coding and small non-coding transcript concentration, in both precursor and processed forms. Both source code and compiled software are available at https://github.com/hitaandrea/MGcount.

LuxRep: a technical replicate-aware method for bisulfite sequencing data analysis

BMC Bioinformatics ◽

10.1186/s12859-021-04546-1 ◽

2022 ◽

Vol 23 (1) ◽

Author(s):

Maia H. Malonzo ◽

Viivi Halla-aho ◽

Mikko Konki ◽

Riikka J. Lund ◽

Harri Lähdesmäki

Keyword(s):

Dna Methylation ◽

Bisulfite Sequencing ◽

Probabilistic Method ◽

Computation Time ◽

Bisulfite Conversion ◽

Sequencing Data ◽

Whole Genome Analysis ◽

Dna Libraries ◽

Conversion Rates ◽

Bisulfite Sequencing Data

Abstract Background DNA methylation is commonly measured using bisulfite sequencing (BS-seq). The quality of a BS-seq library is measured by its bisulfite conversion efficiency. Libraries with low conversion rates are typically excluded from analysis resulting in reduced coverage and increased costs. Results We have developed a probabilistic method and software, LuxRep, that implements a general linear model and simultaneously accounts for technical replicates (libraries from the same biological sample) from different bisulfite-converted DNA libraries. Using simulations and actual DNA methylation data, we show that including technical replicates with low bisulfite conversion rates generates more accurate estimates of methylation levels and differentially methylated sites. Moreover, using variational inference speeds up computation time necessary for whole genome analysis. Conclusions In this work we show that taking into account technical replicates (i.e. libraries) of BS-seq data of varying bisulfite conversion rates, with their corresponding experimental parameters, improves methylation level estimation and differential methylation detection.

Topology preserving stratification of tissue neoplasticity using Deep Neural Maps and microRNA signatures

BMC Bioinformatics ◽

10.1186/s12859-022-04559-4 ◽

2022 ◽

Vol 23 (1) ◽

Author(s):

Emily Kaczmarek ◽

Jina Nanayakkara ◽

Alireza Sedghi ◽

Mehran Pesteie ◽

Thomas Tuschl ◽

...

Keyword(s):

Deep Learning ◽

Expression Profiles ◽

Treatment Selection ◽

Cancer Classification ◽

Common Disease ◽

Cancer Tissue ◽

Topological Map ◽

Tissue Samples ◽

Learning Framework ◽

Tissue Of Origin

Abstract Background Accurate cancer classification is essential for correct treatment selection and better prognostication. microRNAs (miRNAs) are small RNA molecules that negatively regulate gene expression, and their dyresgulation is a common disease mechanism in many cancers. Through a clearer understanding of miRNA dysregulation in cancer, improved mechanistic knowledge and better treatments can be sought. Results We present a topology-preserving deep learning framework to study miRNA dysregulation in cancer. Our study comprises miRNA expression profiles from 3685 cancer and non-cancer tissue samples and hierarchical annotations on organ and neoplasticity status. Using unsupervised learning, a two-dimensional topological map is trained to cluster similar tissue samples. Labelled samples are used after training to identify clustering accuracy in terms of tissue-of-origin and neoplasticity status. In addition, an approach using activation gradients is developed to determine the attention of the networks to miRNAs that drive the clustering. Using this deep learning framework, we classify the neoplasticity status of held-out test samples with an accuracy of 91.07%, the tissue-of-origin with 86.36%, and combined neoplasticity status and tissue-of-origin with an accuracy of 84.28%. The topological maps display the ability of miRNAs to recognize tissue types and neoplasticity status. Importantly, when our approach identifies samples that do not cluster well with their respective classes, activation gradients provide further insight in cancer subtypes or grades. Conclusions An unsupervised deep learning approach is developed for cancer classification and interpretation. This work provides an intuitive approach for understanding molecular properties of cancer and has significant potential for cancer classification and treatment selection.

Aird: a computation-oriented mass spectrometry data format enables a higher compression ratio and less decoding time

BMC Bioinformatics ◽

10.1186/s12859-021-04490-0 ◽

2022 ◽

Vol 23 (1) ◽

Author(s):

Miaoshan Lu ◽

Shaowei An ◽

Ruimin Wang ◽

Jinyin Wang ◽

Changbin Yu

Keyword(s):

Mass Spectrometry ◽

Data Storage ◽

High Speed ◽

Lossless Compression ◽

Mass Spectrometry Data ◽

Compression Rate ◽

Search Performance ◽

Data Format ◽

Link Type ◽

Decoding Speed

Abstract Background With the precision of the mass spectrometry (MS) going higher, the MS file size increases rapidly. Beyond the widely-used open format mzML, near-lossless or lossless compression algorithms and formats emerged in scenarios with different precision requirements. The data precision is often related to the instrument and subsequent processing algorithms. Unlike storage-oriented formats, which focus more on lossless compression rate, computation-oriented formats concentrate as much on decoding speed as the compression rate. Results Here we introduce “Aird”, an opensource and computation-oriented format with controllable precision, flexible indexing strategies, and high compression rate. Aird provides a novel compressor called Zlib-Diff-PforDelta (ZDPD) for m/z data. Compared with Zlib only, m/z data size is about 55% lower in Aird average. With the high-speed decoding and encoding performance of the single instruction multiple data technology used in the ZDPD, Aird merely takes 33% decoding time compared with Zlib. We have downloaded seven datasets from ProteomeXchange and Metabolights. They are from different SCIEX, Thermo, and Agilent instruments. Then we convert the raw data into mzML, mgf, and mz5 file formats by MSConvert and compare them with Aird format. Aird uses JavaScript Object Notation for metadata storage. Aird-SDK is written in Java, and AirdPro is a GUI client for vendor file converting written in C#. They are freely available at https://github.com/CSi-Studio/Aird-SDK and https://github.com/CSi-Studio/AirdPro. Conclusions With the innovation of MS acquisition mode, MS data characteristics are also constantly changing. New data features can bring more effective compression methods and new index modes to achieve high search performance. The MS data storage mode will also become professional and customized. ZDPD uses multiple MS digital features, and researchers also can use it in other formats like mzML. Aird is designed to become a computing-oriented data format with high scalability, compression rate, and fast decoding speed.

baredSC: Bayesian approach to retrieve expression distribution of single-cell data

BMC Bioinformatics ◽

10.1186/s12859-021-04507-8 ◽

2022 ◽

Vol 23 (1) ◽

Author(s):

Lucille Lopez-Delisle ◽

Jean-Baptiste Delisle

Keyword(s):

Single Cell ◽

Bayesian Approach ◽

Genetic Interaction ◽

Gaussian Mixture ◽

Two Dimensions ◽

Biological Data ◽

Specific Gene ◽

Trimodal Distribution ◽

Embryonic Limb ◽

Cell Data

Abstract Background The number of studies using single-cell RNA sequencing (scRNA-seq) is constantly growing. This powerful technique provides a sampling of the whole transcriptome of a cell. However, sparsity of the data can be a major hurdle when studying the distribution of the expression of a specific gene or the correlation between the expressions of two genes. Results We show that the main technical noise associated with these scRNA-seq experiments is due to the sampling, i.e., Poisson noise. We present a new tool named baredSC, for Bayesian Approach to Retrieve Expression Distribution of Single-Cell data, which infers the intrinsic expression distribution in scRNA-seq data using a Gaussian mixture model. baredSC can be used to obtain the distribution in one dimension for individual genes and in two dimensions for pairs of genes, in particular to estimate the correlation in the two genes’ expressions. We apply baredSC to simulated scRNA-seq data and show that the algorithm is able to uncover the expression distribution used to simulate the data, even in multi-modal cases with very sparse data. We also apply baredSC to two real biological data sets. First, we use it to measure the anti-correlation between Hoxd13 and Hoxa11, two genes with known genetic interaction in embryonic limb. Then, we study the expression of Pitx1 in embryonic hindlimb, for which a trimodal distribution has been identified through flow cytometry. While other methods to analyze scRNA-seq are too sensitive to sampling noise, baredSC reveals this trimodal distribution. Conclusion baredSC is a powerful tool which aims at retrieving the expression distribution of few genes of interest from scRNA-seq data.

Knowledge graph analytics platform with LINCS and IDG for Parkinson's disease target illumination

BMC Bioinformatics ◽

10.1186/s12859-021-04530-9 ◽

2022 ◽

Vol 23 (1) ◽

Author(s):

Jeremy J. Yang ◽

Christopher R. Gessner ◽

Joel L. Duerksen ◽

Daniel Biber ◽

Jessica L. Binder ◽

...

Keyword(s):

Parkinson’S Disease ◽

Human Health ◽

Molecular Basis ◽

Drug Target ◽

Drug Targets ◽

Complex Diseases ◽

Knowledge Graph ◽

Graph Analytics ◽

Novel Drug ◽

Novel Drug Targets

Abstract Background LINCS, "Library of Integrated Network-based Cellular Signatures", and IDG, "Illuminating the Druggable Genome", are both NIH projects and consortia that have generated rich datasets for the study of the molecular basis of human health and disease. LINCS L1000 expression signatures provide unbiased systems/omics experimental evidence. IDG provides compiled and curated knowledge for illumination and prioritization of novel drug target hypotheses. Together, these resources can support a powerful new approach to identifying novel drug targets for complex diseases, such as Parkinson's disease (PD), which continues to inflict severe harm on human health, and resist traditional research approaches. Results Integrating LINCS and IDG, we built the Knowledge Graph Analytics Platform (KGAP) to support an important use case: identification and prioritization of drug target hypotheses for associated diseases. The KGAP approach includes strong semantics interpretable by domain scientists and a robust, high performance implementation of a graph database and related analytical methods. Illustrating the value of our approach, we investigated results from queries relevant to PD. Approved PD drug indications from IDG’s resource DrugCentral were used as starting points for evidence paths exploring chemogenomic space via LINCS expression signatures for associated genes, evaluated as target hypotheses by integration with IDG. The KG-analytic scoring function was validated against a gold standard dataset of genes associated with PD as elucidated, published mechanism-of-action drug targets, also from DrugCentral. IDG's resource TIN-X was used to rank and filter KGAP results for novel PD targets, and one, SYNGR3 (Synaptogyrin-3), was manually investigated further as a case study and plausible new drug target for PD. Conclusions The synergy of LINCS and IDG, via KG methods, empowers graph analytics methods for the investigation of the molecular basis of complex diseases, and specifically for identification and prioritization of novel drug targets. The KGAP approach enables downstream applications via integration with resources similarly aligned with modern KG methodology. The generality of the approach indicates that KGAP is applicable to many disease areas, in addition to PD, the focus of this paper.

ChromoMap: an R package for interactive visualization of multi-omics data and annotation of chromosomes

BMC Bioinformatics ◽

10.1186/s12859-021-04556-z ◽

2022 ◽

Vol 23 (1) ◽

Author(s):

Lakshay Anand ◽

Carlos M. Rodriguez Lopez

Keyword(s):

High Throughput Sequencing ◽

R Package ◽

Genomic Feature ◽

Omics Data ◽

Protein Coding ◽

Homologous Chromosomes ◽

Single Function ◽

A Genome ◽

Interactive Visualizations ◽

Living Organisms

Abstract Background The recent advancements in high-throughput sequencing have resulted in the availability of annotated genomes, as well as of multi-omics data for many living organisms. This has increased the need for graphic tools that allow the concurrent visualization of genomes and feature-associated multi-omics data on single publication-ready plots. Results We present chromoMap, an R package, developed for the construction of interactive visualizations of chromosomes/chromosomal regions, mapping of any chromosomal feature with known coordinates (i.e., protein coding genes, transposable elements, non-coding RNAs, microsatellites, etc.), and chromosomal regional characteristics (i.e. genomic feature density, gene expression, DNA methylation, chromatin modifications, etc.) of organisms with a genome assembly. ChromoMap can also integrate multi-omics data (genomics, transcriptomics and epigenomics) in relation to their occurrence across chromosomes. ChromoMap takes tab-delimited files (BED like) or alternatively R objects to specify the genomic co-ordinates of the chromosomes and elements to annotate. Rendered chromosomes are composed of continuous windows of a given range, which, on hover, display detailed information about the elements annotated within that range. By adjusting parameters of a single function, users can generate a variety of plots that can either be saved as static image or as HTML documents. Conclusions ChromoMap’s flexibility allows for concurrent visualization of genomic data in each strand of a given chromosome, or of more than one homologous chromosome; allowing the comparison of multi-omic data between genotypes (e.g. species, varieties, etc.) or between homologous chromosomes of phased diploid/polyploid genomes. chromoMap is an extensive tool that can be potentially used in various bioinformatics analysis pipelines for genomic visualization of multi-omics data.

BMC Bioinformatics
Latest Publications

TOTAL DOCUMENTS

H-INDEX

Published By Springer (Biomed Central Ltd.)

Aristotle: stratified causal discovery for omics data

Assigning protein function from domain-function associations using DomFun

ECCsplorer: a pipeline to detect extrachromosomal circular DNA (eccDNA) from next-generation sequencing data

MGcount: a total RNA-seq quantification tool to address multi-mapping and multi-overlapping alignments ambiguity in non-coding transcripts

LuxRep: a technical replicate-aware method for bisulfite sequencing data analysis

Topology preserving stratification of tissue neoplasticity using Deep Neural Maps and microRNA signatures

Aird: a computation-oriented mass spectrometry data format enables a higher compression ratio and less decoding time

baredSC: Bayesian approach to retrieve expression distribution of single-cell data

Knowledge graph analytics platform with LINCS and IDG for Parkinson's disease target illumination

ChromoMap: an R package for interactive visualization of multi-omics data and annotation of chromosomes

Export Citation Format

BMC BioinformaticsLatest Publications

TOTAL DOCUMENTS

H-INDEX

Published By Springer (Biomed Central Ltd.)

Aristotle: stratified causal discovery for omics data

Assigning protein function from domain-function associations using DomFun

ECCsplorer: a pipeline to detect extrachromosomal circular DNA (eccDNA) from next-generation sequencing data

MGcount: a total RNA-seq quantification tool to address multi-mapping and multi-overlapping alignments ambiguity in non-coding transcripts

LuxRep: a technical replicate-aware method for bisulfite sequencing data analysis

Topology preserving stratification of tissue neoplasticity using Deep Neural Maps and microRNA signatures

Aird: a computation-oriented mass spectrometry data format enables a higher compression ratio and less decoding time

baredSC: Bayesian approach to retrieve expression distribution of single-cell data

Knowledge graph analytics platform with LINCS and IDG for Parkinson's disease target illumination

ChromoMap: an R package for interactive visualization of multi-omics data and annotation of chromosomes

BMC Bioinformatics
Latest Publications