Assessing Bos taurus introgression in the UOA Bos indicus assembly

Abstract Background Reference genomes are essential in the analysis of genomic data. As the cost of sequencing decreases, multiple reference genomes are being produced within species to alleviate problems such as low mapping accuracy and reference allele bias in variant calling that can be associated with the alignment of divergent samples to a single reference individual. The latest reference sequence adopted by the scientific community for the analysis of cattle data is ARS_UCD1.2, built from the DNA of a Hereford cow (Bos taurus taurus—B. taurus). A complementary genome assembly, UOA_Brahman_1, was recently built to represent the other cattle subspecies (Bos taurus indicus—B. indicus) from a Brahman cow haplotype to further support analysis of B. indicus data. In this study, we aligned the sequence data of 15 B. taurus and B. indicus breeds to each of these references. Results The alignment of B. taurus individuals against UOA_Brahman_1 detected up to five million more single-nucleotide variants (SNVs) compared to that against ARS_UCD1.2. Similarly, the alignment of B. indicus individuals against ARS_UCD1.2 resulted in one and a half million more SNVs than that against UOA_Brahman_1. The number of SNVs with nearly fixed alternative alleles also increased in the alignments with cross-subspecies. Interestingly, the alignment of B. taurus cattle against UOA_Brahman_1 revealed regions with a smaller than expected number of counts of SNVs with nearly fixed alternative alleles. Since B. taurus introgression represents on average 10% of the genome of Brahman cattle, we suggest that these regions comprise taurine DNA as opposed to indicine DNA in the UOA_Brahman_1 reference genome. Principal component and admixture analyses using genotypes inferred from this region support these taurine-introgressed loci. Overall, the flagged taurine segments represent 13.7% of the UOA_Brahman_1 assembly. The genes located within these segments were previously reported to be under positive selection in Brahman cattle, and include functional candidate genes implicated in feed efficiency, development and immunity. Conclusions We report a list of taurine segments that are in the UOA_Brahman_1 assembly, which will be useful for the interpretation of interesting genomic features (e.g., signatures of selection, runs of homozygosity, increased mutation rate, etc.) that could appear in future re-sequencing analysis of indicine cattle.

Download Full-text

Haplotype-aware variant calling enables high accuracy in nanopore long-reads using deep neural networks

10.1101/2021.03.04.433952 ◽

2021 ◽

Author(s):

Kishwar Shafin ◽

Trevor Pesout ◽

Pi-Chuan Chang ◽

Maria Nattestad ◽

Alexey Kolesnikov ◽

...

Keyword(s):

De Novo ◽

Sequence Data ◽

Variant Calling ◽

High Accuracy ◽

Superior Performance ◽

Read Length ◽

Single Nucleotide Variants ◽

Single Nucleotide ◽

Short Read ◽

Long Read

Long-read sequencing has the potential to transform variant detection by reaching currently difficult-to-map regions and routinely linking together adjacent variations to enable read based phasing. Third-generation nanopore sequence data has demonstrated a long read length, but current interpretation methods for its novel pore-based signal have unique error profiles, making accurate analysis challenging. Here, we introduce a haplotype-aware variant calling pipeline PEPPER-Margin-DeepVariant that produces state-of-the-art variant calling results with nanopore data. We show that our nanopore-based method outperforms the short-read-based single nucleotide variant identification method at the whole genome-scale and produces high-quality single nucleotide variants in segmental duplications and low-mappability regions where short-read based genotyping fails. We show that our pipeline can provide highly-contiguous phase blocks across the genome with nanopore reads, contiguously spanning between 85% to 92% of annotated genes across six samples. We also extend PEPPER-Margin-DeepVariant to PacBio HiFi data, providing an efficient solution with superior performance than the current WhatsHap-DeepVariant standard. Finally, we demonstrate de novo assembly polishing methods that use nanopore and PacBio HiFi reads to produce diploid assemblies with high accuracy (Q35+ nanopore-polished and Q40+ PacBio-HiFi-polished).

Download Full-text

Information theoretic alignment free variant calling

PeerJ Computer Science ◽

10.7717/peerj-cs.71 ◽

2016 ◽

Vol 2 ◽

pp. e71

Author(s):

Justin Bedo ◽

Benjamin Goudey ◽

Jeremy Wazny ◽

Zeyu Zhou

Keyword(s):

Sequence Data ◽

Multinomial Distribution ◽

Variant Calling ◽

Whole Genome Sequence ◽

Reference Sequence ◽

Information Theoretic ◽

Learning Tasks ◽

Leibler Divergence ◽

Suitable Reference ◽

Mouse Dataset

While traditional methods for calling variants across whole genome sequence data rely on alignment to an appropriate reference sequence, alternative techniques are needed when a suitable reference does not exist. We present a novel alignment and assembly free variant calling method based on information theoretic principles designed to detect variants have strong statistical evidence for their ability to segregate samples in a given dataset. Our method uses the context surrounding a particular nucleotide to define variants. Given a set of reads, we model the probability of observing a given nucleotide conditioned on the surrounding prefix and suffixes of lengthkas a multinomial distribution. We then estimate which of these contexts are stable intra-sample and varying inter-sample using a statistic based on the Kullback–Leibler divergence.The utility of the variant calling method was evaluated through analysis of a pair of bacterial datasets and a mouse dataset. We found that our variants are highly informative for supervised learning tasks with performance similar to standard reference based calls and another reference free method (DiscoSNP++). Comparisons against reference based calls showed our method was able to capture very similar population structure on the bacterial dataset. The algorithm’s focus on discriminatory variants makes it suitable for many common analysis tasks for organisms that are too diverse to be mapped back to a single reference sequence.

Download Full-text

Novel functional sequences uncovered through a bovine multiassembly graph

Proceedings of the National Academy of Sciences ◽

10.1073/pnas.2101056118 ◽

2021 ◽

Vol 118 (20) ◽

pp. e2101056118

Author(s):

Danang Crysnanto ◽

Alexander S. Leonard ◽

Zih-Hua Fang ◽

Hubert Pausch

Keyword(s):

Genetic Diversity ◽

Reference Genome ◽

Bos Taurus ◽

Whole Genome Sequencing Data ◽

Whole Genome ◽

Sequencing Data ◽

Reference Allele ◽

Genomic Analyses ◽

Reference Quality ◽

Reference Genomes

Many genomic analyses start by aligning sequencing reads to a linear reference genome. However, linear reference genomes are imperfect, lacking millions of bases of unknown relevance and are unable to reflect the genetic diversity of populations. This makes reference-guided methods susceptible to reference-allele bias. To overcome such limitations, we build a pangenome from six reference-quality assemblies from taurine and indicine cattle as well as yak. The pangenome contains an additional 70,329,827 bases compared to the Bos taurus reference genome. Our multiassembly approach reveals 30 and 10.1 million bases private to yak and indicine cattle, respectively, and between 3.3 and 4.4 million bases unique to each taurine assembly. Utilizing transcriptomes from 56 cattle, we show that these nonreference sequences encode transcripts that hitherto remained undetected from the B. taurus reference genome. We uncover genes, primarily encoding proteins contributing to immune response and pathogen-mediated immunomodulation, differentially expressed between Mycobacterium bovis–infected and noninfected cattle that are also undetectable in the B. taurus reference genome. Using whole-genome sequencing data of cattle from five breeds, we show that reads which were previously misaligned against the Bos taurus reference genome now align accurately to the pangenome sequences. This enables us to discover 83,250 polymorphic sites that segregate within and between breeds of cattle and capture genetic differentiation across breeds. Our work makes a so-far unused source of variation amenable to genetic investigations and provides methods and a framework for establishing and exploiting a more diverse reference genome.

Download Full-text

Improving read alignment through the generation of alternative reference via iterative strategy

Scientific Reports ◽

10.1038/s41598-020-74526-7 ◽

2020 ◽

Vol 10 (1) ◽

Author(s):

Lina Bu ◽

Qi Wang ◽

Wenjin Gu ◽

Ruifei Yang ◽

Di Zhu ◽

...

Keyword(s):

Variant Calling ◽

Optimal Number ◽

Reference Sequence ◽

Hardware Platform ◽

Variable Regions ◽

Read Alignment ◽

Reference Sequences ◽

Downstream Analysis ◽

Reference Genomes ◽

Number Of Iterations

Abstract There is generally one standard reference sequence for each species. When extensive variations exist in other breeds of the species, it can lead to ambiguous alignment and inaccurate variant calling and, in turn, compromise the accuracy of downstream analysis. Here, with the help of the FPGA hardware platform, we present a method that generates an alternative reference via an iterative strategy to improve the read alignment for breeds that are genetically distant to the reference breed. Compared to the published reference genomes, by using the alternative reference sequences we built, the mapping rates of Chinese indigenous pigs and chickens were improved by 0.61–1.68% and 0.09–0.45%, respectively. These sequences also enable researchers to recover highly variable regions that could be missed using public reference sequences. We also determined that the optimal number of iterations needed to generate alternative reference sequences were seven and five for pigs and chickens, respectively. Our results show that, for genetically distant breeds, generating an alternative reference sequence can facilitate read alignment and variant calling and improve the accuracy of downstream analyses.

Download Full-text

Variant calling for cpn60 barcode sequence-based microbiome profiling

10.1101/749267 ◽

2019 ◽

Author(s):

Sarah J. Vancuren ◽

Scott J. Dos Santos ◽

Janet E. Hill ◽

Keyword(s):

De Novo ◽

Sequence Data ◽

Variant Calling ◽

Taxonomic Composition ◽

Species Level ◽

Reference Sequence ◽

Sequence Length ◽

Sequence Variant ◽

Operational Taxonomic Units ◽

Microbiome Profiling

AbstractAmplification and sequencing of conserved genetic barcodes such as the cpn60 gene is a common approach to determining the taxonomic composition of microbiomes. Exact sequence variant calling has been proposed as an alternative to previously established methods for aggregation of sequence reads into operational taxonomic units (OTU). We investigated the utility of variant calling for cpn60 barcode sequences and determined the minimum sequence length required to provide species-level resolution. Sequence data from the 5’ region of the cpn60 barcode amplified from the human vaginal microbiome (n=45), and a mock community were used to compare variant calling to de novo assembly of reads, and mapping to a reference sequence database in terms of number of OTU formed, and overall community composition. Variant calling resulted in microbiome profiles that were consistent in apparent composition to those generated with the other methods but with significant logistical advantages. Variant calling is rapid, achieves high resolution of taxa, and does not require reference sequence data. Our results further demonstrate that 150 bp from the 5’ end of the cpn60 barcode sequence is sufficient to provide species-level resolution of microbiota.

Download Full-text

Sequence variation aware genome references and read mapping with the variation graph toolkit

10.1101/234856 ◽

2017 ◽

Cited By ~ 12

Author(s):

Erik Garrison ◽

Jouni Sirén ◽

Adam M. Novak ◽

Glenn Hickey ◽

Jordan M. Eizenga ◽

...

Keyword(s):

Dna Sequence ◽

Large Scale ◽

De Novo ◽

Sequence Data ◽

Variant Calling ◽

Read Mapping ◽

Dna Sequence Data ◽

Suffix Arrays ◽

Improved Accuracy ◽

Reference Genomes

AbstractReference genomes guide our interpretation of DNA sequence data. However, conventional linear references are fundamentally limited in that they represent only one version of each locus, whereas the population may contain multiple variants. When the reference represents an individual’s genome poorly, it can impact read mapping and introduce bias. Variation graphs are bidirected DNA sequence graphs that compactly represent genetic variation, including large scale structural variation such as inversions and duplications.1 Equivalent structures are produced by de novo genome assemblers.2,3 Here we present vg, a toolkit of computational methods for creating, manipulating, and utilizing these structures as references at the scale of the human genome. vg provides an efficient approach to mapping reads onto arbitrary variation graphs using generalized compressed suffix arrays,4 with improved accuracy over alignment to a linear reference, creating data structures to support downstream variant calling and genotyping. These capabilities make using variation graphs as reference structures for DNA sequencing practical at the scale of vertebrate genomes, or at the topological complexity of new species assemblies.

Download Full-text

Measurements of Intrahost Viral Diversity Are Extremely Sensitive to Systematic Errors in Variant Calling

Journal of Virology ◽

10.1128/jvi.00667-16 ◽

2016 ◽

Vol 90 (15) ◽

pp. 6884-6895 ◽

Cited By ~ 62

Author(s):

John T. McCrone ◽

Adam S. Lauring

Keyword(s):

Quality Control ◽

Influenza Virus ◽

Cohort Study ◽

Sequence Data ◽

Viral Diversity ◽

Sequencing Analysis ◽

Sample Collection ◽

Single Nucleotide Variants ◽

Data Set ◽

Sequencing Platform

ABSTRACTWith next-generation sequencing technologies, it is now feasible to efficiently sequence patient-derived virus populations at a depth of coverage sufficient to detect rare variants. However, each sequencing platform has characteristic error profiles, and sample collection, target amplification, and library preparation are additional processes whereby errors are introduced and propagated. Many studies account for these errors by usingad hocquality thresholds and/or previously published statistical algorithms. Despite common usage, the majority of these approaches have not been validated under conditions that characterize many studies of intrahost diversity. Here, we use defined populations of influenza virus to mimic the diversity and titer typically found in patient-derived samples. We identified single-nucleotide variants using two commonly employed variant callers, DeepSNV and LoFreq. We found that the accuracy of these variant callers was lower than expected and exquisitely sensitive to the input titer. Small reductions in specificity had a significant impact on the number of minority variants identified and subsequent measures of diversity. We were able to increase the specificity of DeepSNV to >99.95% by applying an empirically validated set of quality thresholds. When applied to a set of influenza virus samples from a household-based cohort study, these changes resulted in a 10-fold reduction in measurements of viral diversity. We have made our sequence data and analysis code available so that others may improve on our work and use our data set to benchmark their own bioinformatics pipelines. Our work demonstrates that inadequate quality control and validation can lead to significant overestimation of intrahost diversity.IMPORTANCEAdvances in sequencing technology have made it feasible to sequence patient-derived viral samples at a level sufficient for detection of rare mutations. These high-throughput, cost-effective methods are revolutionizing the study of within-host viral diversity. However, the techniques are error prone, and the methods commonly used to control for these errors have not been validated under the conditions that characterize patient-derived samples. Here, we show that these conditions affect measurements of viral diversity. We found that the accuracy of previously benchmarked analysis pipelines was greatly reduced under patient-derived conditions. By carefully validating our sequencing analysis using known control samples, we were able to identify biases in our method and to improve our accuracy to acceptable levels. Application of our modified pipeline to a set of influenza virus samples from a cohort study provided a realistic picture of intrahost diversity and suggested the need for rigorous quality control in such studies.

Download Full-text

Information theoretic alignment free variant calling

10.7287/peerj.preprints.2015 ◽

2016 ◽

Author(s):

Justin Bedo ◽

Benjamin Goudey ◽

Jeremy Wazny ◽

Zeyu Zhou

Keyword(s):

Sequence Data ◽

Multinomial Distribution ◽

Variant Calling ◽

Whole Genome Sequence ◽

Reference Sequence ◽

Information Theoretic ◽

Learning Tasks ◽

Leibler Divergence ◽

Suitable Reference ◽

Mouse Dataset

While traditional methods for calling variants across whole genome sequence data rely on alignment to an appropriate reference sequence, alternative techniques are needed when a suitable reference does not exist. We present a novel alignment and assembly free variant calling method based on information theoretic principles designed to detect variants have strong statistical evidence for their ability to segregate samples in a given dataset. Our method uses the context surrounding a particular nucleotide to define variants. Given a set of reads, we model the probability of observing a given nucleotide conditioned on the surrounding prefix and suffixes of length k as a multinomial distribution. We then estimate which of these contexts are stable intra-sample and varying inter-sample using a statistic based on the Kullback–Leibler divergence. The utility of the variant calling method was evaluated through analysis of a pair of bacterial datasets and a mouse dataset. We found that our variants are highly informative for supervised learning tasks with performance similar to standard reference based calls and another reference free method (DiscoSNP++). Comparisons against reference based calls showed our method was able to capture very similar population structure on the bacterial dataset. The algorithm’s focus on discriminatory variants makes it suitable for many common analysis tasks for organisms that are too diverse to be mapped back to a single reference sequence.

Download Full-text

Accurate Reference-Free Somatic Variant-Calling by Integrating Genomic, Sequencing and Population Data

10.1101/383703 ◽

2018 ◽

Author(s):

Ren X. Sun ◽

Christopher M. Lalansingh ◽

Shadrielle Melijah G. Espiritu ◽

Cindy Q. Yao ◽

Takafumi N. Yamaguchi ◽

...

Keyword(s):

Sequence Data ◽

Human Cancer ◽

Variant Calling ◽

Population Data ◽

Single Nucleotide Variants ◽

Reference Tissue ◽

Sequencing Technologies ◽

Distant Tissue ◽

Tumor Types ◽

Reference Samples

ABSTRACTThe detection of somatic single nucleotide variants (SNVs) is critical in both research and clinical applications. Studies of human cancer typically use matched normal (reference) samples from a distant tissue to increase SNV prediction accuracy. This process both doubles sequencing costs and poses challenges when reference samples are not readily available, such as for many cell-lines. To address these challenges, we created S22S: an approach for the prediction of somatic mutations without need for matched reference tissue. S22S takes underlying sequence data, augments them with genomic background context and population frequency information, and classifies SNVs as somatic or non-somatic. We validated S22S using primary tumor/normal pairs from four tumor types, spanning two different sequencing technologies. S22S robustly identifies somatic SNVs, with the area under the precision recall curve reaching 0.97 in kidney clear cell carcinoma, comparable to the best tumor/normal analysis pipelines. S22S is freely available at http://labs.oicr.on.ca/Boutros-lab/software/s22s.

Download Full-text

Accurate sequence variant genotyping in cattle using variation-aware genome graphs

10.1101/460345 ◽

2018 ◽

Cited By ~ 1

Author(s):

Danang Crysnanto ◽

Christine Wurmser ◽

Hubert Pausch

Keyword(s):

Dna Sequences ◽

Reference Genome ◽

Sequence Data ◽

Sequence Variant ◽

Sequence Variants ◽

Sequencing Data ◽

Reference Allele ◽

Reference Genomes ◽

Genome Graph ◽

Genotype Concordance

Background: The genotyping of sequence variants typically involves as a first step the alignment of sequencing reads to a linear reference genome. Because a linear reference genome represents only a small fraction of sequence variation within a species, reference allele bias may occur at highly polymorphic or diverged regions of the genome. Graph-based methods facilitate to compare sequencing reads to a variation-aware genome graph that incorporates non-redundant DNA sequences that segregate within a species. We compared accuracy and sensitivity of graph-based sequence variant genotyping using the Graphtyper software to two widely used methods, i.e., GATK and SAMtools, that rely on linear reference genomes using whole-genomes sequencing data of 49 Original Braunvieh cattle. Results: We discovered 21,140,196, 20,262,913 and 20,668,459 polymorphic sites using GATK, Graphtyper, and SAMtools, respectively. Comparisons between sequence variant and microarray-derived genotypes showed that Graphtyper outperformed both GATK and SAMtools in terms of genotype concordance, non-reference sensitivity, and non-reference discrepancy. The sequence variant genotypes that were obtained using Graphtyper had the lowest number of mendelian inconsistencies for both SNPs and indels in nine sire-son pairs with sequence data. Genotype phasing and imputation using the Beagle software improved the quality of the sequence variant genotypes for all tools evaluated particularly for animals that have been sequenced at low coverage. Following imputation, the concordance between sequence- and microarray-derived genotypes was almost identical for the three methods evaluated, i.e., 99.32, 99.46, and 99.24 % for GATK, Graphtyper, and SAMtools, respectively. Variant filtration based on commonly used criteria improved the genotype concordance slightly but it also decreased sensitivity. Graphtyper required considerably more computing resources than SAMtools but it required less than GATK. Conclusions: Sequence variant genotyping using Graphtyper is accurate, sensitive and computationally feasible in cattle. Graph-based methods enable sequence variant genotyping from variation-aware reference genomes that may incorporate cohort-specific sequence variants which is not possible with the current implementations of state-of-the-art methods that rely on linear reference genomes.

Download Full-text