scholarly journals Accurate sequence variant genotyping in cattle using variation-aware genome graphs

2018 ◽  
Author(s):  
Danang Crysnanto ◽  
Christine Wurmser ◽  
Hubert Pausch

Background: The genotyping of sequence variants typically involves as a first step the alignment of sequencing reads to a linear reference genome. Because a linear reference genome represents only a small fraction of sequence variation within a species, reference allele bias may occur at highly polymorphic or diverged regions of the genome. Graph-based methods facilitate to compare sequencing reads to a variation-aware genome graph that incorporates non-redundant DNA sequences that segregate within a species. We compared accuracy and sensitivity of graph-based sequence variant genotyping using the Graphtyper software to two widely used methods, i.e., GATK and SAMtools, that rely on linear reference genomes using whole-genomes sequencing data of 49 Original Braunvieh cattle. Results: We discovered 21,140,196, 20,262,913 and 20,668,459 polymorphic sites using GATK, Graphtyper, and SAMtools, respectively. Comparisons between sequence variant and microarray-derived genotypes showed that Graphtyper outperformed both GATK and SAMtools in terms of genotype concordance, non-reference sensitivity, and non-reference discrepancy. The sequence variant genotypes that were obtained using Graphtyper had the lowest number of mendelian inconsistencies for both SNPs and indels in nine sire-son pairs with sequence data. Genotype phasing and imputation using the Beagle software improved the quality of the sequence variant genotypes for all tools evaluated particularly for animals that have been sequenced at low coverage. Following imputation, the concordance between sequence- and microarray-derived genotypes was almost identical for the three methods evaluated, i.e., 99.32, 99.46, and 99.24 % for GATK, Graphtyper, and SAMtools, respectively. Variant filtration based on commonly used criteria improved the genotype concordance slightly but it also decreased sensitivity. Graphtyper required considerably more computing resources than SAMtools but it required less than GATK. Conclusions: Sequence variant genotyping using Graphtyper is accurate, sensitive and computationally feasible in cattle. Graph-based methods enable sequence variant genotyping from variation-aware reference genomes that may incorporate cohort-specific sequence variants which is not possible with the current implementations of state-of-the-art methods that rely on linear reference genomes.

2021 ◽  
Vol 118 (20) ◽  
pp. e2101056118
Author(s):  
Danang Crysnanto ◽  
Alexander S. Leonard ◽  
Zih-Hua Fang ◽  
Hubert Pausch

Many genomic analyses start by aligning sequencing reads to a linear reference genome. However, linear reference genomes are imperfect, lacking millions of bases of unknown relevance and are unable to reflect the genetic diversity of populations. This makes reference-guided methods susceptible to reference-allele bias. To overcome such limitations, we build a pangenome from six reference-quality assemblies from taurine and indicine cattle as well as yak. The pangenome contains an additional 70,329,827 bases compared to the Bos taurus reference genome. Our multiassembly approach reveals 30 and 10.1 million bases private to yak and indicine cattle, respectively, and between 3.3 and 4.4 million bases unique to each taurine assembly. Utilizing transcriptomes from 56 cattle, we show that these nonreference sequences encode transcripts that hitherto remained undetected from the B. taurus reference genome. We uncover genes, primarily encoding proteins contributing to immune response and pathogen-mediated immunomodulation, differentially expressed between Mycobacterium bovis–infected and noninfected cattle that are also undetectable in the B. taurus reference genome. Using whole-genome sequencing data of cattle from five breeds, we show that reads which were previously misaligned against the Bos taurus reference genome now align accurately to the pangenome sequences. This enables us to discover 83,250 polymorphic sites that segregate within and between breeds of cattle and capture genetic differentiation across breeds. Our work makes a so-far unused source of variation amenable to genetic investigations and provides methods and a framework for establishing and exploiting a more diverse reference genome.


2021 ◽  
Vol 22 (1) ◽  
Author(s):  
Nae-Chyun Chen ◽  
Brad Solomon ◽  
Taher Mun ◽  
Sheila Iyer ◽  
Ben Langmead

AbstractMost sequencing data analyses start by aligning sequencing reads to a linear reference genome, but failure to account for genetic variation leads to reference bias and confounding of results downstream. Other approaches replace the linear reference with structures like graphs that can include genetic variation, incurring major computational overhead. We propose the reference flow alignment method that uses multiple population reference genomes to improve alignment accuracy and reduce reference bias. Compared to the graph aligner vg, reference flow achieves a similar level of accuracy and bias avoidance but with 14% of the memory footprint and 5.5 times the speed.


2018 ◽  
Vol 35 (15) ◽  
pp. 2654-2656 ◽  
Author(s):  
Guoli Ji ◽  
Wenbin Ye ◽  
Yaru Su ◽  
Moliang Chen ◽  
Guangzao Huang ◽  
...  

Abstract Summary Alternative splicing (AS) is a well-established mechanism for increasing transcriptome and proteome diversity, however, detecting AS events and distinguishing among AS types in organisms without available reference genomes remains challenging. We developed a de novo approach called AStrap for AS analysis without using a reference genome. AStrap identifies AS events by extensive pair-wise alignments of transcript sequences and predicts AS types by a machine-learning model integrating more than 500 assembled features. We evaluated AStrap using collected AS events from reference genomes of rice and human as well as single-molecule real-time sequencing data from Amborella trichopoda. Results show that AStrap can identify much more AS events with comparable or higher accuracy than the competing method. AStrap also possesses a unique feature of predicting AS types, which achieves an overall accuracy of ∼0.87 for different species. Extensive evaluation of AStrap using different parameters, sample sizes and machine-learning models on different species also demonstrates the robustness and flexibility of AStrap. AStrap could be a valuable addition to the community for the study of AS in non-model organisms with limited genetic resources. Availability and implementation AStrap is available for download at https://github.com/BMILAB/AStrap. Supplementary information Supplementary data are available at Bioinformatics online.


Author(s):  
Nae-Chyun Chen ◽  
Brad Solomon ◽  
Taher Mun ◽  
Sheila Iyer ◽  
Ben Langmead

AbstractMost sequencing data analyses start by aligning sequencing reads to a linear reference genome. But failure to account for genetic variation causes reference bias and confounding of results downstream. Other approaches replace the linear reference with structures like graphs that can include genetic variation, incurring major computational overhead. We propose the “reference flow” alignment method that uses multiple population reference genomes to improve alignment accuracy and reduce reference bias. Compared to the graph aligner vg, reference flow achieves a similar level of accuracy and bias avoidance, but with 14% of the memory footprint and 5.5 times the speed.


Author(s):  
Alaina Shumate ◽  
Aleksey V. Zimin ◽  
Rachel M. Sherman ◽  
Daniela Puiu ◽  
Justin M. Wagner ◽  
...  

AbstractHere we describe the assembly and annotation of the genome of an Ashkenazi individual and the creation of a new, population-specific human reference genome. This genome is more contiguous and more complete than GRCh38, the latest version of the human reference genome, and is annotated with highly similar gene content. The Ashkenazi reference genome, Ash1, contains 2,973,118,650 nucleotides as compared to 2,937,639,212 in GRCh38. Annotation identified 20,157 protein-coding genes, of which 19,563 are >99% identical to their counterparts on GRCh38. Most of the remaining genes have small differences. 40 of the protein-coding genes in GRCh38 are missing from Ash1; however, all of these genes are members of multi-gene families for which Ash1 contains other copies. 11 genes appear on different chromosomes from their homologs in GRCh38. Alignment of DNA sequences from an unrelated Ashkenazi individual to Ash1 identified ~1 million fewer homozygous SNPs than alignment of those same sequences to the more-distant GRCh38 genome, illustrating one of the benefits of population-specific reference genomes.


2019 ◽  
Vol 20 (10) ◽  
pp. 2483 ◽  
Author(s):  
Veronika Kapustová ◽  
Zuzana Tulpová ◽  
Helena Toegelová ◽  
Petr Novák ◽  
Jiří Macas ◽  
...  

Reference genomes of important cereals, including barley, emmer wheat and bread wheat, were released recently. Their comparison with genome size estimates obtained by flow cytometry indicated that the assemblies represent not more than 88–98% of the complete genome. This work is aimed at identifying the missing parts in two cereal genomes and proposing techniques to make the assemblies more complete. We focused on tandemly organised repetitive sequences, known to be underrepresented in genome assemblies generated from short-read sequence data. Our study found arrays of three tandem repeats with unit sizes of 1242 to 2726 bp present in the bread wheat reference genome generated from short reads. However, this and another wheat genome assembly employing long PacBio reads failed in integrating correctly the 2726-bp repeat in the pseudomolecule context. This suggests that tandem repeats of this size, frequently incorporated in unassigned scaffolds, may contribute to shrinking of pseudomolecules without reducing size of the entire assembly. We demonstrate how this missing information may be added to the pseudomolecules with the aid of nanopore sequencing of individual BAC clones and optical mapping. Using the latter technique, we identified and localised a 470-kb long array of 45S ribosomal DNA absent from the reference genome of barley.


PLoS ONE ◽  
2020 ◽  
Vol 15 (11) ◽  
pp. e0242553
Author(s):  
Fernanda Sayuri do Nascimento ◽  
Milena Oliveira Suzuki ◽  
João Victor Taba ◽  
Vitoria Carneiro de Mattos ◽  
Leonardo Zumerkorn Pipek ◽  
...  

Background The performance of the microbiota is observed in several digestive tract diseases. Therefore, reaching the biliary microbiota may suggest ways for studies of biomarkers, diagnoses, tests and therapies in hepatobiliopancreatic diseases. Methods Bile samples will be collected in endoscopic retrograde cholangiopancreatography patients (case group) and living liver transplantation donors (control group). We will characterize the microbiome based on two types of sequence data: the V3/V4 regions of the 16S ribosomal RNA (rRNA) gene and total shotgun DNA. For 16S sequencing data a standard 16S processing pipeline based on the Amplicon Sequence Variant concept and the qiime2 software package will be employed; for shotgun data, for each sample we will assemble the reads and obtain and analyze metagenome-assembled genomes. Results The primary expected results of the study is to characterize the specific composition of the biliary microbiota in situations of disease and health. In addition, it seeks to demonstrate the existence of changes in the case of illness and also possible disease biomarkers, diagnosis, interventions and therapies in hepatobiliopancreatic diseases. Trial registration NCT04391426. Registered 18 May 2020, https://clinicaltrials.gov/ct2/show/NCT04391426.


2018 ◽  
Author(s):  
Roger Ros-Freixedes ◽  
Battagin Mara ◽  
Martin Johnsson ◽  
Gregor Gorjanc ◽  
Alan J Mileham ◽  
...  

AbstractBackgroundInherent sources of error and bias that affect the quality of the sequence data include index hopping and bias towards the reference allele. The impact of these artefacts is likely greater for low-coverage data than for high-coverage data because low-coverage data has scant information and standard tools for processing sequence data were designed for high-coverage data. With the proliferation of cost-effective low-coverage sequencing there is a need to understand the impact of these errors and bias on resulting genotype calls.ResultsWe used a dataset of 26 pigs sequenced both at 2x with multiplexing and at 30x without multiplexing to show that index hopping and bias towards the reference allele due to alignment had little impact on genotype calls. However, pruning of alternative haplotypes supported by a number of reads below a predefined threshold, a default and desired step for removing potential sequencing errors in high-coverage data, introduced an unexpected bias towards the reference allele when applied to low-coverage data. This bias reduced best-guess genotype concordance of low-coverage sequence data by 19.0 absolute percentage points.ConclusionsWe propose a simple pipeline to correct this bias and we recommend that users of low-coverage sequencing be wary of unexpected biases produced by tools designed for high-coverage sequencing.


2021 ◽  
Vol 22 (1) ◽  
Author(s):  
Yunfeng Wang ◽  
Haoliang Xue ◽  
Christine Pourcel ◽  
Yang Du ◽  
Daniel Gautheret

Abstract Background The detection of genome variants, including point mutations, indels and structural variants, is a fundamental and challenging computational problem. We address here the problem of variant detection between two deep-sequencing (DNA-seq) samples, such as two human samples from an individual patient, or two samples from distinct bacterial strains. The preferred strategy in such a case is to align each sample to a common reference genome, collect all variants and compare these variants between samples. Such mapping-based protocols have several limitations. DNA sequences with large indels, aggregated mutations and structural variants are hard to map to the reference. Furthermore, DNA sequences cannot be mapped reliably to genomic low complexity regions and repeats. Results We introduce 2-kupl, a k-mer based, mapping-free protocol to detect variants between two DNA-seq samples. On simulated and actual data, 2-kupl achieves higher accuracy than other mapping-free protocols. Applying 2-kupl to prostate cancer whole exome sequencing data, we identify a number of candidate variants in hard-to-map regions and propose potential novel recurrent variants in this disease. Conclusions We developed a mapping-free protocol for variant calling between matched DNA-seq samples. Our protocol is suitable for variant detection in unmappable genome regions or in the absence of a reference genome.


Author(s):  
Zheyang Zhang ◽  
Sainan Zhang ◽  
Xin Li ◽  
Zhangxiang Zhao ◽  
Changjing Chen ◽  
...  

Abstract RNA-sequencing enables accurate and low-cost transcriptome-wide detection. However, expression estimates vary as reference genomes and gene annotations are updated, confounding existing expression-based prognostic signatures. Herein, prognostic 9-gene pair signature (GPS) was applied to 197 patients with stage I lung adenocarcinoma derived from previous and latest data from The Cancer Genome Atlas (TCGA) processed with different reference genomes and annotations. For 9-GPS, 6.6% of patients exhibited discordant risk classifications between the two TCGA versions. Similar results were observed for other prognostic signatures, including IRGPI, 15-gene and ORACLE. We found that conflicting annotations for gene length and overlap were the major cause of their discordant risk classification. Therefore, we constructed a prognostic 40-GPS based on stable genes across GENCODE v20-v30 and validated it using public data of 471 stage I samples (log-rank P < 0.0010). Risk classification was still stable in RNA-sequencing data processed with the newest GENCODE v32 versus GENCODE v20–v30. Specifically, 40-GPS could predict survival for 30 stage I samples with formalin-fixed paraffin-embedded tissues (log-rank P = 0.0177). In conclusion, this method overcomes the vulnerability of existing prognostic signatures due to reference genome and annotation updates. 40-GPS may offer individualized clinical applications due to its prognostic accuracy and classification stability.


Sign in / Sign up

Export Citation Format

Share Document