geck: trio-based comparative benchmarking of variant calls

Mapping Intimacies ◽

10.1101/208116 ◽

2017 ◽

Cited By ~ 3

Author(s):

Péter Kómár ◽

Deniz Kural

Keyword(s):

Statistical Analysis ◽

Mixture Model ◽

Variant Calling ◽

Supplementary Information ◽

Genotype Data ◽

Diverse Populations ◽

High Confidence ◽

Supplementary Material ◽

Related Individuals ◽

Statistical Mixture Model

MotivationClassical methods of comparing the accuracies of variant calling pipelines are based on truth sets of variants whose genotypes are previously determined with high confidence. An alternative way of performing benchmarking is based on Mendelian constraints between related individuals. Statistical analysis of Mendelian violations can provide truth set-independent benchmarking information, and enable benchmarking less-studied variants and diverse populations.ResultsWe introduce a statistical mixture model forcomparing two variant calling pipelines from genotype data they produce after running on individual members of a trio. We determine the accuracy of our model by comparing the precision and recall of GATK Unified Genotyper and Haplotype Caller on the high-confidence SNPs of the NIST Ashkenazim trio and the two independent Platinum Genome trios. We show that our method is able to estimate differential precision and recall between the two pipelines with 10-3 uncertainty.AvailabilityThe Python library geck, and usage examples are available at the following URL: https://github.com/sbg/[email protected] informationSupplementary materials are available at bioRxiv.

16GT: a fast and sensitive variant caller using a 16-genotype probabilistic model

10.1101/111393 ◽

2017 ◽

Cited By ~ 3

Author(s):

Ruibang Luo ◽

Michael C. Schatz ◽

Steven L. Salzberg

Keyword(s):

Probabilistic Model ◽

Variant Calling ◽

Supplementary Information ◽

Link Type ◽

Indel Calling ◽

Supplementary Material ◽

Calling Algorithm

AbstractSummary16GT is a variant caller for Illumina WGS and WES germline data. It uses a new 16-genotype probabilistic model to unify SNP and indel calling in a single variant calling algorithm. In benchmark comparisons with five other widely used variant callers on a modern 36-core server, 16GT ran faster and demonstrated improved sensitivity in calling SNPs, and it provided comparable sensitivity and accuracy in calling indels as compared to the GATK HaplotypeCaller.Availability and implementationhttps://github.com/aquaskyline/[email protected] informationSupplementary tables and notes are available at Bioinformatics online.

MyelinJ: an ImageJ macro for high throughput analysis of myelinating cultures

Bioinformatics ◽

10.1093/bioinformatics/btz403 ◽

2019 ◽

Vol 35 (21) ◽

pp. 4528-4530 ◽

Cited By ~ 1

Author(s):

Michael J Whitehead ◽

George A McCanney ◽

Hugh J Willison ◽

Susan C Barnett

Keyword(s):

Statistical Analysis ◽

High Throughput ◽

Supplementary Information ◽

Local Contrast ◽

High Throughput Analysis ◽

Throughput Analysis ◽

Neurite Density ◽

Supplementary Material ◽

User Friendly

Abstract Summary MyelinJ is a free user friendly ImageJ macro for high throughput analysis of fluorescent micrographs such as 2D-myelinating cultures and statistical analysis using R. MyelinJ can analyse single images or complex experiments with multiple conditions, where the ggpubr package in R is automatically used for statistical analysis and the production of publication quality graphs. The main outputs are percentage (%) neurite density and % myelination. % neurite density is calculated using the normalize local contrast algorithm, followed by thresholding, to adjust for differences in intensity. For % myelination the myelin sheaths are selected using the Frangi vesselness algorithm, in conjunction with a grey scale morphology filter and the removal of cell bodies using a high intensity mask. MyelinJ uses a simple graphical user interface and user name system for reproducibility and sharing that will be useful to the wider scientific community that study 2D-myelination in vitro. Availability and implementation MyelinJ is freely available at https://github.com/BarnettLab/MyelinJ. For statistical analysis the freely available R and the ggpubr package are also required. MyelinJ has a user guide (Supplementary Material) and has been tested on both Windows (Windows 10) and Mac (High Sierra) operating systems. Supplementary information Supplementary data are available at Bioinformatics online.

tHapMix: simulating tumour samples through haplotype mixtures

10.1101/057414 ◽

2016 ◽

Author(s):

Sergii Ivakhno ◽

Camilla Colombo ◽

Stephen Tanner ◽

Philip Tedder ◽

Stefano Berri ◽

...

Keyword(s):

Copy Number ◽

Large Scale ◽

Variant Calling ◽

Copy Number Variant ◽

Supplementary Information ◽

Genome Diversity ◽

Simulation Framework ◽

Somatic Genome ◽

Copy Number Changes ◽

Sequencing Platforms

AbstractMotivationLarge-scale rearrangements and copy number changes combined with different modes of cloevolution create extensive somatic genome diversity, making it difficult to develop versatile and scalable oriant calling tools and create well-calibrated benchmarks.ResultsWe developed a new simulation framework tHapMix that enables the creation of tumour samples with different ploidy, purity and polyclonality features. It easily scales to simulation of hundreds of somatic genomes, while re-use of real read data preserves noise and biases present in sequencing platforms. We further demonstrate tHapMix utility by creating a simulated set of 140 somatic genomes and showing how it can be used in training and testing of somatic copy number variant calling tools.Availability and implementationtHapMix is distributed under an open source license and can be downloaded from https://github.com/Illumina/[email protected] informationSupplementary data are available at Bioinformatics online.

PathScore: a web tool for identifying altered pathways in cancer data

10.1101/067090 ◽

2016 ◽

Cited By ~ 2

Author(s):

Stephen G. Gaffney ◽

Jeffrey P. Townsend

Keyword(s):

Web Application ◽

Somatic Mutations ◽

Supplementary Information ◽

Web Tool ◽

Cancer Data ◽

Link Type ◽

Novel Approach ◽

Supplementary Material ◽

User Friendly ◽

Pathway Effect

ABSTRACTSummaryPathScore quantifies the level of enrichment of somatic mutations within curated pathways, applying a novel approach that identifies pathways enriched across patients. The application provides several user-friendly, interactive graphic interfaces for data exploration, including tools for comparing pathway effect sizes, significance, gene-set overlap and enrichment differences between projects.Availability and ImplementationWeb application available at pathscore.publichealth.yale.edu. Site implemented in Python and MySQL, with all major browsers supported. Source code available at github.com/sggaffney/pathscore with a GPLv3 [email protected] InformationAdditional documentation can be found at http://pathscore.publichealth.yale.edu/faq.

Palaeolatitudinal distribution of the Ediacaran macrobiota

Journal of the Geological Society ◽

10.1144/jgs2021-030 ◽

2021 ◽

pp. jgs2021-030

Author(s):

Catherine E. Boddy ◽

Emily G. Mitchell ◽

Andrew Merdith ◽

Alexander G. Liu

Keyword(s):

Taxonomic Composition ◽

Supplementary Information ◽

Cambrian Explosion ◽

Content Type ◽

Link Type ◽

Environmental Perturbations ◽

Significant Difference ◽

Evolutionary Trajectories ◽

Cambrian Radiation ◽

Supplementary Material

Macrofossils of the late Ediacaran Period (c. 579–539 Ma) document diverse, complex multicellular eukaryotes, including early animals, prior to the Cambrian radiation of metazoan phyla. To investigate the relationships between environmental perturbations, biotic responses and early metazoan evolutionary trajectories, it is vital to distinguish between evolutionary and ecological controls on the global distribution of Ediacaran macrofossils. The contributions of temporal, palaeoenvironmental and lithological factors in shaping the observed variations in assemblage taxonomic composition between Ediacaran macrofossil sites are widely discussed, but the role of palaeogeography remains ambiguous. Here we investigate the influence of palaeolatitude on the spatial distribution of Ediacaran macrobiota through the late Ediacaran Period using two leading palaeogeographical reconstructions. We find that overall generic diversity was distributed across all palaeolatitudes. Among specific groups, the distributions of candidate ‘Bilateral’ and Frondomorph taxa exhibit weakly statistically significant and statistically significant differences between low and high palaeolatitudes within our favoured palaeogeographical reconstruction, respectively, whereas Algal, Tubular, Soft-bodied and Biomineralizing taxa show no significant difference. The recognition of statistically significant palaeolatitudinal differences in the distribution of certain morphogroups highlights the importance of considering palaeolatitudinal influences when interrogating trends in Ediacaran taxon distributions.Supplementary material: Supplementary information, data and code are available at https://doi.org/10.6084/m9.figshare.c.5488945Thematic collection: This article is part of the Advances in the Cambrian Explosion collection available at: https://www.lyellcollection.org/cc/advances-cambrian-explosion

Numt identification and removal with RtN!

Bioinformatics ◽

10.1093/bioinformatics/btaa642 ◽

2020 ◽

Vol 36 (20) ◽

pp. 5115-5116 ◽

Cited By ~ 2

Author(s):

August E Woerner ◽

Jennifer Churchill Cihlar ◽

Utpal Smart ◽

Bruce Budowle

Keyword(s):

Mitochondrial Genome ◽

Massively Parallel Sequencing ◽

Sequence Similarity ◽

Variant Calling ◽

Supplementary Information ◽

Mitochondrial Genomes ◽

Sequencing Data ◽

Read Mapping ◽

Genome Data ◽

Mitochondrial Sequences

Abstract Motivation Assays in mitochondrial genomics rely on accurate read mapping and variant calling. However, there are known and unknown nuclear paralogs that have fundamentally different genetic properties than that of the mitochondrial genome. Such paralogs complicate the interpretation of mitochondrial genome data and confound variant calling. Results Remove the Numts! (RtN!) was developed to categorize reads from massively parallel sequencing data not based on the expected properties and sequence identities of paralogous nuclear encoded mitochondrial sequences, but instead using sequence similarity to a large database of publicly available mitochondrial genomes. RtN! removes low-level sequencing noise and mitochondrial paralogs while not impacting variant calling, while competing methods were shown to remove true variants from mitochondrial mixtures. Availability and implementation https://github.com/Ahhgust/RtN Supplementary information Supplementary data are available at Bioinformatics online.

Generalized Born radii computation using linear models and neural networks

Bioinformatics ◽

10.1093/bioinformatics/btz818 ◽

2019 ◽

Vol 36 (6) ◽

pp. 1757-1764

Author(s):

Saida Saad Mohamed Mahmoud ◽

Gennaro Esposito ◽

Giuseppe Serra ◽

Federico Fogolari

Keyword(s):

Neural Network ◽

Neural Networks ◽

Linear Model ◽

Correlation Coefficient ◽

Linear Models ◽

Reference Method ◽

Supplementary Information ◽

Model Parameters ◽

Generalized Born ◽

Supplementary Material

Abstract Motivation Implicit solvent models play an important role in describing the thermodynamics and the dynamics of biomolecular systems. Key to an efficient use of these models is the computation of generalized Born (GB) radii, which is accomplished by algorithms based on the electrostatics of inhomogeneous dielectric media. The speed and accuracy of such computations are still an issue especially for their intensive use in classical molecular dynamics. Here, we propose an alternative approach that encodes the physics of the phenomena and the chemical structure of the molecules in model parameters which are learned from examples. Results GB radii have been computed using (i) a linear model and (ii) a neural network. The input is the element, the histogram of counts of neighbouring atoms, divided by atom element, within 16 Å. Linear models are ca. 8 times faster than the most widely used reference method and the accuracy is higher with correlation coefficient with the inverse of ‘perfect’ GB radii of 0.94 versus 0.80 of the reference method. Neural networks further improve the accuracy of the predictions with correlation coefficient with ‘perfect’ GB radii of 0.97 and ca. 20% smaller root mean square error. Availability and implementation We provide a C program implementing the computation using the linear model, including the coefficients appropriate for the set of Bondi radii, as Supplementary Material. We also provide a Python implementation of the neural network model with parameter and example files in the Supplementary Material as well. Supplementary information Supplementary data are available at Bioinformatics online.

G2P: a Genome-Wide-Association-Study simulation tool for genotype simulation, phenotype simulation and power evaluation

Bioinformatics ◽

10.1093/bioinformatics/btz126 ◽

2019 ◽

Vol 35 (19) ◽

pp. 3852-3854 ◽

Cited By ~ 3

Author(s):

You Tang ◽

Xiaolei Liu

Keyword(s):

Association Study ◽

Genome Wide Association Study ◽

Genome Wide Association ◽

Supplementary Information ◽

Maximum Efficiency ◽

Genotype Data ◽

Simulation Tool ◽

Genome Wide ◽

Phenotype Data ◽

Power Evaluation

Abstract Motivation Plenty of Genome-Wide-Association-Study (GWAS) methods have been developed for mapping genetic markers that associated with human diseases and agricultural economic traits. Computer simulation is a nice tool to test the performances of various GWAS methods under certain scenarios. Existing tools are either inefficient in terms of computation and memory efficiency or inconvenient to use to simulate big, realistic genotype data and phenotype data to evaluate available GWAS methods. Results Here, we present a GWAS simulation tool named G2P that can be used to simulate genotype data, phenotype data and perform power evaluation of GWAS methods. G2P is a user-friendly tool with all functions is provided in both graphical user interface and pipeline manners and it is available for Windows, Mac and Linux environments. Furthermore, G2P achieves maximum efficiency in terms of both memory usage and simulation speed; with G2P, the simulation of genotype data that includes 1 000 000 samples and 2 000 000 markers can be accomplished in 5 h. Availability and implementation The G2P software, user manual, and example datasets are freely available at GitHub: https://github.com/XiaoleiLiuBio/G2P. Supplementary information Supplementary data are available at Bioinformatics online.

AlpsNMR: an R package for signal processing of fully untargeted NMR-based metabolomics

Bioinformatics ◽

10.1093/bioinformatics/btaa022 ◽

2020 ◽

Vol 36 (9) ◽

pp. 2943-2945 ◽

Cited By ~ 3

Author(s):

Francisco Madrid-Gambin ◽

Sergio Oller-Moreno ◽

Luis Fernandez ◽

Simona Bartova ◽

Maria Pilar Giner ◽

...

Keyword(s):

Signal Processing ◽

Statistical Analysis ◽

Processing System ◽

R Package ◽

Supplementary Information ◽

Test Case ◽

Previous Knowledge ◽

Spectral Processing ◽

Computational Tools ◽

User Friendly

Abstract Summary Nuclear magnetic resonance (NMR)-based metabolomics is widely used to obtain metabolic fingerprints of biological systems. While targeted workflows require previous knowledge of metabolites, prior to statistical analysis, untargeted approaches remain a challenge. Computational tools dealing with fully untargeted NMR-based metabolomics are still scarce or not user-friendly. Therefore, we developed AlpsNMR (Automated spectraL Processing System for NMR), an R package that provides automated and efficient signal processing for untargeted NMR metabolomics. AlpsNMR includes spectra loading, metadata handling, automated outlier detection, spectra alignment and peak-picking, integration and normalization. The resulting output can be used for further statistical analysis. AlpsNMR proved effective in detecting metabolite changes in a test case. The tool allows less experienced users to easily implement this workflow from spectra to a ready-to-use dataset in their routines. Availability and implementation The AlpsNMR R package and tutorial is freely available to download from http://github.com/sipss/AlpsNMR under the MIT license. Supplementary information Supplementary data are available at Bioinformatics online.

Boost-HiC: computational enhancement of long-range contacts in chromosomal contact maps

Bioinformatics ◽

10.1093/bioinformatics/bty1059 ◽

2019 ◽

Vol 35 (16) ◽

pp. 2724-2729 ◽

Cited By ~ 3

Author(s):

L Carron ◽

J B Morlot ◽

V Matthys ◽

A Lesne ◽

J Mozziconacci

Keyword(s):

Long Range ◽

Short Range ◽

Supplementary Information ◽

Supplementary Data ◽

Missing Information ◽

High Confidence ◽

Contact Maps ◽

Genome Wide ◽

Algorithmic Procedure

Abstract Motivation Genome-wide chromosomal contact maps are widely used to uncover the 3D organization of genomes. They rely on collecting millions of contacting pairs of genomic loci. Contacts at short range are usually well measured in experiments, while there is a lot of missing information about long-range contacts. Results We propose to use the sparse information contained in raw contact maps to infer high-confidence contact counts between all pairs of loci. Our algorithmic procedure, Boost-HiC, enables the detection of Hi-C patterns such as chromosomal compartments at a resolution that would be otherwise only attainable by sequencing a hundred times deeper the experimental Hi-C library. Boost-HiC can also be used to compare contact maps at an improved resolution. Availability and implementation Boost-HiC is available at https://github.com/LeopoldC/Boost-HiC. Supplementary information Supplementary data are available at Bioinformatics online.