Information theoretic alignment free variant calling

While traditional methods for calling variants across whole genome sequence data rely on alignment to an appropriate reference sequence, alternative techniques are needed when a suitable reference does not exist. We present a novel alignment and assembly free variant calling method based on information theoretic principles designed to detect variants have strong statistical evidence for their ability to segregate samples in a given dataset. Our method uses the context surrounding a particular nucleotide to define variants. Given a set of reads, we model the probability of observing a given nucleotide conditioned on the surrounding prefix and suffixes of length k as a multinomial distribution. We then estimate which of these contexts are stable intra-sample and varying inter-sample using a statistic based on the Kullback–Leibler divergence. The utility of the variant calling method was evaluated through analysis of a pair of bacterial datasets and a mouse dataset. We found that our variants are highly informative for supervised learning tasks with performance similar to standard reference based calls and another reference free method (DiscoSNP++). Comparisons against reference based calls showed our method was able to capture very similar population structure on the bacterial dataset. The algorithm’s focus on discriminatory variants makes it suitable for many common analysis tasks for organisms that are too diverse to be mapped back to a single reference sequence.

Download Full-text

Information theoretic alignment free variant calling

PeerJ Computer Science ◽

10.7717/peerj-cs.71 ◽

2016 ◽

Vol 2 ◽

pp. e71

Author(s):

Justin Bedo ◽

Benjamin Goudey ◽

Jeremy Wazny ◽

Zeyu Zhou

Keyword(s):

Sequence Data ◽

Multinomial Distribution ◽

Variant Calling ◽

Whole Genome Sequence ◽

Reference Sequence ◽

Information Theoretic ◽

Learning Tasks ◽

Leibler Divergence ◽

Suitable Reference ◽

Mouse Dataset

While traditional methods for calling variants across whole genome sequence data rely on alignment to an appropriate reference sequence, alternative techniques are needed when a suitable reference does not exist. We present a novel alignment and assembly free variant calling method based on information theoretic principles designed to detect variants have strong statistical evidence for their ability to segregate samples in a given dataset. Our method uses the context surrounding a particular nucleotide to define variants. Given a set of reads, we model the probability of observing a given nucleotide conditioned on the surrounding prefix and suffixes of lengthkas a multinomial distribution. We then estimate which of these contexts are stable intra-sample and varying inter-sample using a statistic based on the Kullback–Leibler divergence.The utility of the variant calling method was evaluated through analysis of a pair of bacterial datasets and a mouse dataset. We found that our variants are highly informative for supervised learning tasks with performance similar to standard reference based calls and another reference free method (DiscoSNP++). Comparisons against reference based calls showed our method was able to capture very similar population structure on the bacterial dataset. The algorithm’s focus on discriminatory variants makes it suitable for many common analysis tasks for organisms that are too diverse to be mapped back to a single reference sequence.

Download Full-text

Information theoretic alignment free variant calling

10.7287/peerj.preprints.2015v1 ◽

2016 ◽

Author(s):

Justin Bedo ◽

Benjamin Goudey ◽

Jeremy Wazny ◽

Zeyu Zhou

Keyword(s):

Sequence Data ◽

Multinomial Distribution ◽

Variant Calling ◽

Whole Genome Sequence ◽

Reference Sequence ◽

Information Theoretic ◽

Learning Tasks ◽

Leibler Divergence ◽

Suitable Reference ◽

Mouse Dataset

Download Full-text

Rapid, raw-read reference and identification (R4IDs): A flexible platform for rapid generic species ID using long-read sequencing technology

10.1101/281048 ◽

2018 ◽

Cited By ~ 2

Author(s):

Joe Parker ◽

Andrew Helmstetter ◽

James Crowe ◽

John Iacona ◽

Dion Devey ◽

...

Keyword(s):

Dna Sequencing ◽

Species Identification ◽

Sequence Data ◽

Vascular Plant ◽

Reference Sequence ◽

Read Length ◽

Reference Database ◽

Sequencing Technology ◽

Long Read ◽

Suitable Reference

AbstractThe versatility of the current DNA sequencing platforms and the development of portable, nanopore sequencers means that it has never been easier to collect genetic data for unknown sample ID. DNA barcoding and meta-barcoding have become increasingly popular and barcode databases continue to grow at an impressive rate. However, the number of canonical genome assemblies (reference or draft) that are publically available is relatively tiny, hindering the more widespread use of genome scale DNA sequencing technology for accurate species identification and discovery. Here, we show that rapid raw-read reference datasets, or R4IDs for short, generated in a matter of hours on the Oxford Nanopore MinION, can bridge this gap and accelerate the generation of useable reference sequence data. By exploiting the long read length of this technology, shotgun genomic sequencing of a small portion of an organism’s genome can act as a suitable reference database despite the low sequencing coverage. These R4IDs can then be used for accurate species identification with minimal amounts of re-sequencing effort (1000s of reads). We demonstrated the capabilities of this approach with six vascular plant species for which we created R4IDs in the laboratory and then re-sequenced, live at the Kew Science Festival 2016. We further validated our method using simulations to determine the broader applicability of the approach. Our data analysis pipeline has been made available as a Dockerised workflow for simple, scalable deployment for a range of uses.

Download Full-text

Variant calling for cpn60 barcode sequence-based microbiome profiling

10.1101/749267 ◽

2019 ◽

Author(s):

Sarah J. Vancuren ◽

Scott J. Dos Santos ◽

Janet E. Hill ◽

Keyword(s):

De Novo ◽

Sequence Data ◽

Variant Calling ◽

Taxonomic Composition ◽

Species Level ◽

Reference Sequence ◽

Sequence Length ◽

Sequence Variant ◽

Operational Taxonomic Units ◽

Microbiome Profiling

AbstractAmplification and sequencing of conserved genetic barcodes such as the cpn60 gene is a common approach to determining the taxonomic composition of microbiomes. Exact sequence variant calling has been proposed as an alternative to previously established methods for aggregation of sequence reads into operational taxonomic units (OTU). We investigated the utility of variant calling for cpn60 barcode sequences and determined the minimum sequence length required to provide species-level resolution. Sequence data from the 5’ region of the cpn60 barcode amplified from the human vaginal microbiome (n=45), and a mock community were used to compare variant calling to de novo assembly of reads, and mapping to a reference sequence database in terms of number of OTU formed, and overall community composition. Variant calling resulted in microbiome profiles that were consistent in apparent composition to those generated with the other methods but with significant logistical advantages. Variant calling is rapid, achieves high resolution of taxa, and does not require reference sequence data. Our results further demonstrate that 150 bp from the 5’ end of the cpn60 barcode sequence is sufficient to provide species-level resolution of microbiota.

Download Full-text

Influence of neighboring small sequence variants on functional impact prediction

10.1101/596718 ◽

2019 ◽

Cited By ~ 10

Author(s):

Jan-Simon Baasner ◽

Dakota Howard ◽

Boas Pucker

Keyword(s):

Variant Calling ◽

Reference Sequence ◽

Proof Of Concept ◽

Protein Coding ◽

Functional Impact ◽

Impact Prediction ◽

Functional Implications ◽

Functional Consequences ◽

Suitable Reference ◽

Impact Predictions

AbstractOnce a suitable reference sequence is generated, genomic differences within a species are often assessed by re-sequencing. Variant calling processes can reveal all differences between two strains, accessions, genotypes, or individuals. These variants can be enriched with predictions about their functional implications based on available structural annotations i.e. gene models. Although these functional impact predictions on a per variant basis are often accurate, some challenging cases require the simultaneous incorporation of multiple adjacent variants into this prediction process. Examples are neighboring variants which modify each others’ functional impact. Neighborhood-Aware Variant Impact Predictor (NAVIP) considers all variants within a given protein coding sequence when predicting the functional consequences. As a proof of concept, variants between the Arabidopsis thaliana accessions Columbia-0 and Niederzenz-1 were annotated. NAVIP is freely available on github: https://github.com/bpucker/NAVIP.

Download Full-text

Simulation of African and non-African low and high coverage whole genome sequence data to assess variant calling approaches

Briefings in Bioinformatics ◽

10.1093/bib/bbaa366 ◽

2020 ◽

Author(s):

Shatha Alosaimi ◽

Noëlle van Biljon ◽

Denis Awany ◽

Prisca K Thami ◽

Joel Defo ◽

...

Keyword(s):

Genetic Diversity ◽

False Positive ◽

Sequence Data ◽

False Negative ◽

Variant Calling ◽

High Rate ◽

Whole Genome Sequence ◽

Whole Genome ◽

Sequence Coverage ◽

High Coverage

Abstract Current variant calling (VC) approaches have been designed to leverage populations of long-range haplotypes and were benchmarked using populations of European descent, whereas most genetic diversity is found in non-European such as Africa populations. Working with these genetically diverse populations, VC tools may produce false positive and false negative results, which may produce misleading conclusions in prioritization of mutations, clinical relevancy and actionability of genes. The most prominent question is which tool or pipeline has a high rate of sensitivity and precision when analysing African data with either low or high sequence coverage, given the high genetic diversity and heterogeneity of this data. Here, a total of 100 synthetic Whole Genome Sequencing (WGS) samples, mimicking the genetics profile of African and European subjects for different specific coverage levels (high/low), have been generated to assess the performance of nine different VC tools on these contrasting datasets. The performances of these tools were assessed in false positive and false negative call rates by comparing the simulated golden variants to the variants identified by each VC tool. Combining our results on sensitivity and positive predictive value (PPV), VarDict [PPV = 0.999 and Matthews correlation coefficient (MCC) = 0.832] and BCFtools (PPV = 0.999 and MCC = 0.813) perform best when using African population data on high and low coverage data. Overall, current VC tools produce high false positive and false negative rates when analysing African compared with European data. This highlights the need for development of VC approaches with high sensitivity and precision tailored for populations characterized by high genetic variations and low linkage disequilibrium.

Download Full-text

Plasmid Profiler: Comparative Analysis of Plasmid Content in WGS Data

10.1101/121350 ◽

2017 ◽

Cited By ~ 2

Author(s):

Adrian Zetner ◽

Jennifer Cabral ◽

Laura Mataseje ◽

Natalie C Knox ◽

Philip Mabon ◽

...

Keyword(s):

Comparative Analysis ◽

De Novo ◽

Sequence Data ◽

Health Agency ◽

R Package ◽

Whole Genome Sequence ◽

Reference Sequence ◽

Supplementary Information ◽

Plasmid Content ◽

Link Type

AbstractSummaryComparative analysis of bacterial plasmids from whole genome sequence (WGS) data generated from short read sequencing is challenging. This is due to the difficulty in identifying contigs harbouring plasmid sequence data, and further difficulty in assembling such contigs into a full plasmid. As such, few software programs and bioinformatics pipelines exist to perform comprehensive comparative analyses of plasmids within and amongst sequenced isolates. To address this gap, we have developed Plasmid Profiler, a pipeline to perform comparative plasmid content analysis without the need forde novoassembly. The pipeline is designed to rapidly identify plasmid sequences by mapping reads to a plasmid reference sequence database. Predicted plasmid sequences are then annotated with their incompatibility group, if known. The pipeline allows users to query plasmids for genes or regions of interest and visualize results as an interactive heat map.Availability and ImplementationPlasmid Profiler is freely available software released under the Apache 2.0 open source software license. A stand-alone version of the entire Plasmid Profiler pipeline is available as a Docker container athttps://hub.docker.com/r/phacnml/plasmidprofiler_0_1_6/.The conda recipe for the Plasmid R package is available at:https://anaconda.org/bioconda/r-plasmidprofilerThe custom Plasmid Profiler R package is also available as a CRAN package athttps://cran.r-project.org/web/packages/Plasmidprofiler/index.htmlGalaxy tools associated with the pipeline are available as a Galaxy tool suite athttps://toolshed.g2.bx.psu.edu/repository?repository_id=55e082200d16a504The source code is available at:https://github.com/phac-nml/plasmidprofilerThe Galaxy implementation is available at:https://github.com/phac-nml/plasmidprofiler-galaxyContactEmail:[email protected]: National Microbiology Laboratory, Public Health Agency of Canada, 1015 Arlington Street, Winnipeg, Manitoba, CanadaSupplementary informationDocumentation:http://plasmid-profiler.readthedocs.io/en/latest/

Download Full-text

Assessing Bos taurus introgression in the UOA Bos indicus assembly

Genetics Selection Evolution ◽

10.1186/s12711-021-00688-1 ◽

2021 ◽

Vol 53 (1) ◽

Author(s):

Maulana M. Naji ◽

Yuri T. Utsunomiya ◽

Johann Sölkner ◽

Benjamin D. Rosen ◽

Gábor Mészáros

Keyword(s):

Bos Taurus ◽

Sequence Data ◽

Variant Calling ◽

Principal Component ◽

Reference Sequence ◽

Sequencing Analysis ◽

Single Nucleotide Variants ◽

Reference Allele ◽

Brahman Cattle ◽

Reference Genomes

Abstract Background Reference genomes are essential in the analysis of genomic data. As the cost of sequencing decreases, multiple reference genomes are being produced within species to alleviate problems such as low mapping accuracy and reference allele bias in variant calling that can be associated with the alignment of divergent samples to a single reference individual. The latest reference sequence adopted by the scientific community for the analysis of cattle data is ARS_UCD1.2, built from the DNA of a Hereford cow (Bos taurus taurus—B. taurus). A complementary genome assembly, UOA_Brahman_1, was recently built to represent the other cattle subspecies (Bos taurus indicus—B. indicus) from a Brahman cow haplotype to further support analysis of B. indicus data. In this study, we aligned the sequence data of 15 B. taurus and B. indicus breeds to each of these references. Results The alignment of B. taurus individuals against UOA_Brahman_1 detected up to five million more single-nucleotide variants (SNVs) compared to that against ARS_UCD1.2. Similarly, the alignment of B. indicus individuals against ARS_UCD1.2 resulted in one and a half million more SNVs than that against UOA_Brahman_1. The number of SNVs with nearly fixed alternative alleles also increased in the alignments with cross-subspecies. Interestingly, the alignment of B. taurus cattle against UOA_Brahman_1 revealed regions with a smaller than expected number of counts of SNVs with nearly fixed alternative alleles. Since B. taurus introgression represents on average 10% of the genome of Brahman cattle, we suggest that these regions comprise taurine DNA as opposed to indicine DNA in the UOA_Brahman_1 reference genome. Principal component and admixture analyses using genotypes inferred from this region support these taurine-introgressed loci. Overall, the flagged taurine segments represent 13.7% of the UOA_Brahman_1 assembly. The genes located within these segments were previously reported to be under positive selection in Brahman cattle, and include functional candidate genes implicated in feed efficiency, development and immunity. Conclusions We report a list of taurine segments that are in the UOA_Brahman_1 assembly, which will be useful for the interpretation of interesting genomic features (e.g., signatures of selection, runs of homozygosity, increased mutation rate, etc.) that could appear in future re-sequencing analysis of indicine cattle.

Download Full-text

A reference dataset of 5.4 million phased human variants validated by genetic inheritance from sequencing a three-generation 17-member pedigree

10.1101/055541 ◽

2016 ◽

Cited By ~ 18

Author(s):

Michael A. Eberle ◽

Epameinondas Fritzilas ◽

Peter Krusche ◽

Morten Källberg ◽

Benjamin L. Moore ◽

...

Keyword(s):

De Novo ◽

Sequence Data ◽

Objective Assessment ◽

Variant Calling ◽

Whole Genome Sequence ◽

Reference Dataset ◽

Single Nucleotide Variants ◽

Single Nucleotide ◽

Genome Wide ◽

Transmission Information

AbstractImprovement of variant calling in next-generation sequence data requires a comprehensive, genome-wide catalogue of high-confidence variants called in a set of genomes for use as a benchmark. We generated deep, whole-genome sequence data of seventeen individuals in a three-generation pedigree and called variants in each genome using a range of currently available algorithms. We used haplotype transmission information to create a phased “platinum” variant catalogue of 4.7 million single nucleotide variants (SNVs) plus 0.7 million small (1-50bp) insertions and deletions (indels) that are consistent with the pattern of inheritance in the parents and eleven children of this pedigree. Platinum genotypes are highly concordant with the current catalogue of the National Institute of Standards and Technology for both SNVs (>99.99%) and indels (99.92%), and add a validated truth catalogue that has 26% more SNVs and 45% more indels. Analysis of 334,652 SNVs that were consistent between informatics pipelines yet inconsistent with haplotype transmission (“non-platinum”) revealed that the majority of these variants are de novo and cell-line mutations or reside within previously unidentified duplications and deletions. The reference materials from this study are a resource for objective assessment of the accuracy of variant calls throughout genomes.

Download Full-text

Large-sample confidence intervals of information-theoretic measures in linguistics

Journal of Research Design and Statistics in Linguistics and Communication Science ◽

10.1558/jrds.40134 ◽

2020 ◽

Vol 6 (1) ◽

pp. 19-54

Author(s):

Ryan Ka Yau Lai ◽

Youngah Do

Keyword(s):

Maximum Likelihood ◽

Corpus Linguistics ◽

Delta Method ◽

Confidence Bounds ◽

Likelihood Estimator ◽

Information Theoretic ◽

Leibler Divergence ◽

Information Theoretic Measures ◽

Data Points ◽

Measure Of Uncertainty

This article explores a method of creating confidence bounds for information-theoretic measures in linguistics, such as entropy, Kullback-Leibler Divergence (KLD), and mutual information. We show that a useful measure of uncertainty can be derived from simple statistical principles, namely the asymptotic distribution of the maximum likelihood estimator (MLE) and the delta method. Three case studies from phonology and corpus linguistics are used to demonstrate how to apply it and examine its robustness against common violations of its assumptions in linguistics, such as insufficient sample size and non-independence of data points.

Download Full-text