Priors for Genotyping Polyploids

Bioinformatics ◽

10.1093/bioinformatics/btz852 ◽

2019 ◽

Author(s):

David Gerard ◽

Luís Felipe Ventorim Ferrão

Keyword(s):

Empirical Bayes ◽

A Priori ◽

Real Data ◽

R Package ◽

Complete Characterization ◽

Supplementary Information ◽

Genotype Distribution ◽

Systematic Biases ◽

Technical Artifacts

Abstract Motivation Empirical Bayes techniques to genotype polyploid organisms usually either (i) assume technical artifacts are known a priori or (ii) estimate technical artifacts simultaneously with the prior genotype distribution. Case (i) is unappealing as it places the onus on the researcher to estimate these artifacts, or to ensure that there are no systematic biases in the data. However, as we demonstrate with a few empirical examples, case (ii) makes choosing the class of prior genotype distributions extremely important. Choosing a class that is either too flexible or too restrictive results in poor genotyping performance. Results We propose two classes of prior genotype distributions that are of intermediate levels of flexibility: the class of proportional normal distributions and the class of unimodal distributions. We provide a complete characterization of and optimization details for the class of unimodal distributions. We demonstrate, using both simulated and real data, that using these classes results in superior genotyping performance. Availability and implementation Genotyping methods that use these priors are implemented in the updog R package available on the Comprehensive R Archive Network: https://cran.r-project.org/package=updog. All code needed to reproduce the results of this paper is available on GitHub: https://github.com/dcgerard/reproduce\_prior\_sims. Supplementary information Supplementary data are available at Bioinformatics online.

Download Full-text

Priors for Genotyping Polyploids

10.1101/751784 ◽

2019 ◽

Author(s):

David Gerard ◽

Luís Felipe Ventorim Ferrão

Keyword(s):

Empirical Bayes ◽

A Priori ◽

Real Data ◽

R Package ◽

Complete Characterization ◽

Genotype Distribution ◽

Link Type ◽

Systematic Biases ◽

Technical Artifacts

AbstractMotivationEmpirical Bayes techniques to genotype polyploid organisms usually either (i) assume technical artifacts are known a priori or (ii) estimate technical artifacts simultaneously with the prior genotype distribution. Case (i) is unappealing as it places the onus on the researcher to estimate these artifacts, or to ensure that there are no systematic biases in the data. However, as we demonstrate with a few empirical examples, case (ii) makes choosing the class of prior genotype distributions extremely important. Choosing a class that is either too flexible or too restrictive results in poor genotyping performance.ResultsWe propose two classes of prior genotype distributions that are of intermediate levels of flexibility: the class of proportional normal distributions and the class of unimodal distributions. We provide a complete characterization of and optimization details for the class of unimodal distributions. We demonstrate, using both simulated and real data, that using these classes results in superior genotyping performance.Availability and ImplementationGenotyping methods that use these priors are implemented in the updog R package available on the Comprehensive R Archive Network: https://cran.r-project.org/package=updog. All code needed to reproduce the results of this paper is available on GitHub: https://github.com/dcgerard/[email protected]

Download Full-text

Detection of differentially methylated CpG sites between tumor samples with uneven tumor purities

Bioinformatics ◽

10.1093/bioinformatics/btz885 ◽

2019 ◽

Vol 36 (7) ◽

pp. 2017-2024

Author(s):

Weiwei Zhang ◽

Ziyi Li ◽

Nana Wei ◽

Hua-Jun Wu ◽

Xiaoqi Zheng

Keyword(s):

Real Data ◽

R Package ◽

Differential Methylation ◽

Least Square ◽

Epigenetic Mechanism ◽

Supplementary Information ◽

Cpg Sites ◽

Tumor Purity ◽

Different Sources ◽

Normal Controls

Abstract Motivation Inference of differentially methylated (DM) CpG sites between two groups of tumor samples with different geno- or pheno-types is a critical step to uncover the epigenetic mechanism of tumorigenesis, and identify biomarkers for cancer subtyping. However, as a major source of confounding factor, uneven distributions of tumor purity between two groups of tumor samples will lead to biased discovery of DM sites if not properly accounted for. Results We here propose InfiniumDM, a generalized least square model to adjust tumor purity effect for differential methylation analysis. Our method is applicable to a variety of experimental designs including with or without normal controls, different sources of normal tissue contaminations. We compared our method with conventional methods including minfi, limma and limma corrected by tumor purity using simulated datasets. Our method shows significantly better performance at different levels of differential methylation thresholds, sample sizes, mean purity deviations and so on. We also applied the proposed method to breast cancer samples from TCGA database to further evaluate its performance. Overall, both simulation and real data analyses demonstrate favorable performance over existing methods serving similar purpose. Availability and implementation InfiniumDM is a part of R package InfiniumPurify, which is freely available from GitHub (https://github.com/Xiaoqizheng/InfiniumPurify). Supplementary information Supplementary data are available at Bioinformatics online.

Download Full-text

multiHiCcompare: joint normalization and comparative analysis of complex Hi-C experiments

Bioinformatics ◽

10.1093/bioinformatics/btz048 ◽

2019 ◽

Vol 35 (17) ◽

pp. 2916-2923 ◽

Cited By ~ 15

Author(s):

John C Stansfield ◽

Kellen G Cresswell ◽

Mikhail G Dozmorov

Keyword(s):

Comparative Analysis ◽

A Priori ◽

Three Dimensional ◽

R Package ◽

Supplementary Information ◽

Chromatin Interaction ◽

Model Framework ◽

Chromatin Interactions ◽

Loess Regression ◽

Sequencing Studies

Abstract Motivation With the development of chromatin conformation capture technology and its high-throughput derivative Hi-C sequencing, studies of the three-dimensional interactome of the genome that involve multiple Hi-C datasets are becoming available. To account for the technology-driven biases unique to each dataset, there is a distinct need for methods to jointly normalize multiple Hi-C datasets. Previous attempts at removing biases from Hi-C data have made use of techniques which normalize individual Hi-C datasets, or, at best, jointly normalize two datasets. Results Here, we present multiHiCcompare, a cyclic loess regression-based joint normalization technique for removing biases across multiple Hi-C datasets. In contrast to other normalization techniques, it properly handles the Hi-C-specific decay of chromatin interaction frequencies with the increasing distance between interacting regions. multiHiCcompare uses the general linear model framework for comparative analysis of multiple Hi-C datasets, adapted for the Hi-C-specific decay of chromatin interaction frequencies. multiHiCcompare outperforms other methods when detecting a priori known chromatin interaction differences from jointly normalized datasets. Applied to the analysis of auxin-treated versus untreated experiments, and CTCF depletion experiments, multiHiCcompare was able to recover the expected epigenetic and gene expression signatures of loss of chromatin interactions and reveal novel insights. Availability and implementation multiHiCcompare is freely available on GitHub and as a Bioconductor R package https://bioconductor.org/packages/multiHiCcompare. Supplementary information Supplementary data are available at Bioinformatics online.

Download Full-text

Bayesian wavelet de-noising with the caravan prior

ESAIM Probability and Statistics ◽

10.1051/ps/2019019 ◽

2019 ◽

Vol 23 ◽

pp. 947-978

Author(s):

Shota Gugushvili ◽

Frank van der Meulen ◽

Moritz Schauer ◽

Peter Spreij

Keyword(s):

Empirical Bayes ◽

Expert Knowledge ◽

Estimation Error ◽

A Priori ◽

Real Data ◽

Detailed Comparison ◽

Wavelet Coefficients ◽

Visual Appearance ◽

Bayesian Wavelet ◽

Clustering Patterns

According to both domain expert knowledge and empirical evidence, wavelet coefficients of real signals tend to exhibit clustering patterns, in that they contain connected regions of coefficients of similar magnitude (large or small). A wavelet de-noising approach that takes into account such a feature of the signal may in practice outperform other, more vanilla methods, both in terms of the estimation error and visual appearance of the estimates. Motivated by this observation, we present a Bayesian approach to wavelet de-noising, where dependencies between neighbouring wavelet coefficients are a priori modelled via a Markov chain-based prior, that we term the caravan prior. Posterior computations in our method are performed via the Gibbs sampler. Using representative synthetic and real data examples, we conduct a detailed comparison of our approach with a benchmark empirical Bayes de-noising method (due to Johnstone and Silverman). We show that the caravan prior fares well and is therefore a useful addition to the wavelet de-noising toolbox.

Download Full-text

powmic: an R package for power assessment in microbiome case–control studies

Bioinformatics ◽

10.1093/bioinformatics/btaa197 ◽

2020 ◽

Vol 36 (11) ◽

pp. 3563-3565

Author(s):

Li Chen

Keyword(s):

Power Analysis ◽

Real Data ◽

Analytical Form ◽

R Package ◽

Case Control ◽

Supplementary Information ◽

Metagenomic Sequencing ◽

Case Control Studies ◽

Simulation Based ◽

Over Dispersion

Abstract Summary Power analysis is essential to decide the sample size of metagenomic sequencing experiments in a case–control study for identifying differentially abundant (DA) microbes. However, the complexity of microbial data characteristics, such as excessive zeros, over-dispersion, compositionality, intrinsically microbial correlations and variable sequencing depths, makes the power analysis particularly challenging because the analytical form is usually unavailable. Here, we develop a simulation-based power assessment strategy and R package powmic, which considers the complexity of microbial data characteristics. A real data example demonstrates the usage of powmic. Availability and implementation powmic R package and online tutorial are available at https://github.com/lichen-lab/powmic. Contact [email protected] Supplementary information Supplementary data are available at Bioinformatics online.

Download Full-text

Analytical modelling of period spacings across the HR diagram

Monthly Notices of the Royal Astronomical Society ◽

10.1093/mnras/stz2582 ◽

2019 ◽

Vol 490 (1) ◽

pp. 909-926 ◽

Cited By ~ 5

Author(s):

M S Cunha ◽

P P Avelino ◽

J Christensen-Dalsgaard ◽

D Stello ◽

M Vrard ◽

...

Keyword(s):

Mixed Mode ◽

Chemical Elements ◽

A Priori ◽

Real Data ◽

Relative Difference ◽

Model Data ◽

Analytical Expression ◽

Chemical Gradients ◽

Red Giant

ABSTRACT The characterization of stellar cores may be accomplished through the modelling of asteroseismic data from stars exhibiting either gravity-mode or mixed-mode pulsations, potentially shedding light on the physical processes responsible for the production, mixing, and segregation of chemical elements. In this work, we validate against model data an analytical expression for the period spacing that will facilitate the inference of the properties of stellar cores, including the detection and characterization of buoyancy glitches (strong chemical gradients). This asymptotically based analytical expression is tested both in models with and without buoyancy glitches. It does not assume that glitches are small and, consequently, predicts non-sinusoidal glitch-induced period-spacing variations, as often seen in model and real data. We show that the glitch position and width inferred from the fitting of the analytical expression to model data consisting of pure gravity modes are in close agreement (typically better than 7 ${{\ \rm per\ cent}}$ relative difference) with the properties measured directly from the stellar models. In the case of fitting mixed-mode model data, the same expression is shown to reproduce well the numerical results, when the glitch properties are known a priori. In addition, the fits performed to mixed-mode model data reveal a frequency dependence of the coupling coefficient, q, for a moderate-luminosity red-giant-branch model star. Finally, we find that fitting the analytical expression to the mixed-mode period spacings may provide a way to infer the frequencies of the pure acoustic dipole modes that would exist if no coupling took place between acoustic and gravity waves.

Download Full-text

A latent unknown clustering integrating multi-omics data (LUCID) with phenotypic traits

Bioinformatics ◽

10.1093/bioinformatics/btz667 ◽

2019 ◽

Vol 36 (3) ◽

pp. 842-850 ◽

Cited By ~ 4

Author(s):

Cheng Peng ◽

Jun Wang ◽

Isaac Asante ◽

Stan Louie ◽

Ran Jin ◽

...

Keyword(s):

Real Data ◽

R Package ◽

Integrative Model ◽

Supplementary Information ◽

Phenotypic Traits ◽

Omics Data ◽

Data Types ◽

Specific Effects ◽

Metabolomic Data ◽

Future Prediction

Abstract Motivation Epidemiologic, clinical and translational studies are increasingly generating multiplatform omics data. Methods that can integrate across multiple high-dimensional data types while accounting for differential patterns are critical for uncovering novel associations and underlying relevant subgroups. Results We propose an integrative model to estimate latent unknown clusters (LUCID) aiming to both distinguish unique genomic, exposure and informative biomarkers/omic effects while jointly estimating subgroups relevant to the outcome of interest. Simulation studies indicate that we can obtain consistent estimates reflective of the true simulated values, accurately estimate subgroups and recapitulate subgroup-specific effects. We also demonstrate the use of the integrated model for future prediction of risk subgroups and phenotypes. We apply this approach to two real data applications to highlight the integration of genomic, exposure and metabolomic data. Availability and Implementation The LUCID method is implemented through the LUCIDus R package available on CRAN (https://CRAN.R-project.org/package=LUCIDus). Supplementary information Supplementary materials are available at Bioinformatics online.

Download Full-text

scRNABatchQC: multi-samples quality control for single cell RNA-seq data

Bioinformatics ◽

10.1093/bioinformatics/btz601 ◽

2019 ◽

Vol 35 (24) ◽

pp. 5306-5308

Author(s):

Qi Liu ◽

Quanhu Sheng ◽

Jie Ping ◽

Marisol Adelina Ramirez ◽

Ken S Lau ◽

...

Keyword(s):

Single Cell ◽

R Package ◽

Supplementary Information ◽

Rna Seq ◽

Technical Artifact ◽

Multiple Sample ◽

Systematic Biases ◽

Cell Transcriptome ◽

Single Cell Transcriptome ◽

Spurious Results

Abstract Summary Single cell RNA sequencing is a revolutionary technique to characterize inter-cellular transcriptomics heterogeneity. However, the data are noise-prone because gene expression is often driven by both technical artifacts and genuine biological variations. Proper disentanglement of these two effects is critical to prevent spurious results. While several tools exist to detect and remove low-quality cells in one single cell RNA-seq dataset, there is lack of approach to examining consistency between sample sets and detecting systematic biases, batch effects and outliers. We present scRNABatchQC, an R package to compare multiple sample sets simultaneously over numerous technical and biological features, which gives valuable hints to distinguish technical artifact from biological variations. scRNABatchQC helps identify and systematically characterize sources of variability in single cell transcriptome data. The examination of consistency across datasets allows visual detection of biases and outliers. Availability and implementation scRNABatchQC is freely available at https://github.com/liuqivandy/scRNABatchQC as an R package. Supplementary information Supplementary data are available at Bioinformatics online.

Download Full-text

scBatch: batch-effect correction of RNA-seq data through sample distance matrix adjustment

Bioinformatics ◽

10.1093/bioinformatics/btaa097 ◽

2020 ◽

Vol 36 (10) ◽

pp. 3115-3123 ◽

Cited By ~ 3

Author(s):

Teng Fei ◽

Tianwei Yu

Keyword(s):

Single Cell ◽

Differential Expression Analysis ◽

Distance Matrix ◽

Real Data ◽

R Package ◽

Batch Effect ◽

Supplementary Information ◽

Rna Seq ◽

Sequencing Data ◽

Gene Differential Expression

Abstract Motivation Batch effect is a frequent challenge in deep sequencing data analysis that can lead to misleading conclusions. Existing methods do not correct batch effects satisfactorily, especially with single-cell RNA sequencing (RNA-seq) data. Results We present scBatch, a numerical algorithm for batch-effect correction on bulk and single-cell RNA-seq data with emphasis on improving both clustering and gene differential expression analysis. scBatch is not restricted by assumptions on the mechanism of batch-effect generation. As shown in simulations and real data analyses, scBatch outperforms benchmark batch-effect correction methods. Availability and implementation The R package is available at github.com/tengfei-emory/scBatch. The code to generate results and figures in this article is available at github.com/tengfei-emory/scBatch-paper-scripts. Supplementary information Supplementary data are available at Bioinformatics online.

Download Full-text

SPARSim single cell: a count data simulator for scRNA-seq data

Bioinformatics ◽

10.1093/bioinformatics/btz752 ◽

2019 ◽

Cited By ~ 2

Author(s):

Giacomo Baruzzo ◽

Ilaria Patuzzi ◽

Barbara Di Camillo

Keyword(s):

Single Cell ◽

Count Data ◽

Simulated Data ◽

Real Data ◽

R Package ◽

Supplementary Information ◽

Rna Seq ◽

Distribution Of Zeros ◽

New Methods ◽

Research Fields

Abstract Motivation Single cell RNA-seq (scRNA-seq) count data show many differences compared with bulk RNA-seq count data, making the application of many RNA-seq pre-processing/analysis methods not straightforward or even inappropriate. For this reason, the development of new methods for handling scRNA-seq count data is currently one of the most active research fields in bioinformatics. To help the development of such new methods, the availability of simulated data could play a pivotal role. However, only few scRNA-seq count data simulators are available, often showing poor or not demonstrated similarity with real data. Results In this article we present SPARSim, a scRNA-seq count data simulator based on a Gamma-Multivariate Hypergeometric model. We demonstrate that SPARSim allows to generate count data that resemble real data in terms of count intensity, variability and sparsity, performing comparably or better than one of the most used scRNA-seq simulator, Splat. In particular, SPARSim simulated count matrices well resemble the distribution of zeros across different expression intensities observed in real count data. Availability and implementation SPARSim R package is freely available at http://sysbiobig.dei.unipd.it/? q=SPARSim and at https://gitlab.com/sysbiobig/sparsim. Supplementary information Supplementary data are available at Bioinformatics online.

Download Full-text