scholarly journals Mapping whole genome shotgun sequence and variant calling in mammalian species without their reference genomes

F1000Research ◽  
2013 ◽  
Vol 2 ◽  
pp. 244 ◽  
Author(s):  
Ted Kalbfleisch ◽  
Michael P. Heaton

Genomics research in mammals has produced reference genome sequences that are essential for identifying variation associated with disease.  High quality reference genome sequences are now available for humans, model species, and economically important agricultural animals.  Comparisons between these species have provided unique insights into mammalian gene function.  However, the number of species with reference genomes is small compared to those needed for studying molecular evolutionary relationships in the tree of life.  For example, among the even-toed ungulates there are approximately 300 species whose phylogenetic relationships have been calculated in the 10k trees project.  Only six of these have reference genomes:  cattle, swine, sheep, goat, water buffalo, and bison.  Although reference sequences will eventually be developed for additional hoof stock, the resources in terms of time, money, infrastructure and expertise required to develop a quality reference genome may be unattainable for most species for at least another decade.  In this work we mapped 35 Gb of next generation sequence data of a Katahdin sheep to its own species’ reference genome (Ovis aries Oar3.1) and to that of a species that diverged 15 to 30 million years ago (Bos taurus UMD3.1).  In total, 56% of reads covered 76% of UMD3.1 to an average depth of 6.8 reads per site, 83 million variants were identified, of which 78 million were homozygous and likely represent interspecies nucleotide differences. Excluding genome repeat regions and sex chromosomes, approximately 3.7 million heterozygous sites were identified in this animal vs. bovine UMD3.1, representing polymorphisms occurring in sheep.  Of these, 41% could be readily mapped to orthologous positions in ovine Oar3.1 with 80% corroborated as heterozygous.  These variant sites, identified via interspecies mapping could be used for comparative genomics, disease association studies, and ultimately to understand mammalian gene function.

F1000Research ◽  
2014 ◽  
Vol 2 ◽  
pp. 244 ◽  
Author(s):  
Ted Kalbfleisch ◽  
Michael P. Heaton

Genomics research in mammals has produced reference genome sequences that are essential for identifying variation associated with disease.  High quality reference genome sequences are now available for humans, model species, and economically important agricultural animals.  Comparisons between these species have provided unique insights into mammalian gene function.  However, the number of species with reference genomes is small compared to those needed for studying molecular evolutionary relationships in the tree of life.  For example, among the even-toed ungulates there are approximately 300 species whose phylogenetic relationships have been calculated in the 10k trees project.  Only six of these have reference genomes:  cattle, swine, sheep, goat, water buffalo, and bison.  Although reference sequences will eventually be developed for additional hoof stock, the resources in terms of time, money, infrastructure and expertise required to develop a quality reference genome may be unattainable for most species for at least another decade.  In this work we mapped 35 Gb of next generation sequence data of a Katahdin sheep to its own species’ reference genome (Ovis aries Oar3.1) and to that of a species that diverged 15 to 30 million years ago (Bos taurus UMD3.1).  In total, 56% of reads covered 76% of UMD3.1 to an average depth of 6.8 reads per site, 83 million variants were identified, of which 78 million were homozygous and likely represent interspecies nucleotide differences. Excluding repeat regions and sex chromosomes, nearly 3.7 million heterozygous sites were identified in this animal vs. bovine UMD3.1, representing polymorphisms occurring in sheep.  Of these, 41% could be readily mapped to orthologous positions in ovine Oar3.1 with 80% corroborated as heterozygous.  These variant sites, identified via interspecies mapping could be used for comparative genomics, disease association studies, and ultimately to understand mammalian gene function.


2015 ◽  
Author(s):  
Farzana Rahman ◽  
Mehedi Hassan ◽  
Alona Kryshchenko ◽  
Inna Dubchak ◽  
Tatiana V Tatarinova ◽  
...  

In the last decade a number of algorithms and associated software were developed to align next generation sequencing (NGS) reads to relevant reference genomes. The results of these programs may vary significantly, especially when the NGS reads are contain mutations not found in the reference genome. Yet there is no standard way to compare these programs and assess their biological relevance. We propose a benchmark to assess accuracy of the short reads mapping based on the pre-computed global alignment of closely related genome sequences. In this paper we outline the method and also present a short report of an experiment performed on five popular alignment tools .


2020 ◽  
Author(s):  
Inácio Gomes Medeiros ◽  
André Salim Khayat ◽  
Beatriz Stransky ◽  
Sidney Emanuel Batista dos Santos ◽  
Paulo Pimentel de Assumpção ◽  
...  

Abstract This protocol aims to describe the building of a database of SARS-CoV-2 targets for siRNA approaches. Starting from the virus reference genome, we will derive sequences from 18 to 21nt-long and verify their similarity against the human genome and coding and non-coding transcriptome, as well as genomes from related viruses. We will also calculate a set of thermodynamic features for those sequences and will infer their efficiencies using three different predictors. The protocol has two main phases: at first, we align sequences against reference genomes. In the second one, we extract the features. The first phase varies in terms of duration, depending on computational power from the running machine and the number of reference genomes. Despite that, the second phase lasts about thirty minutes of execution, also depending on the number of cores of running machine. The constructed database aims to speed the design process by providing a broad set of possible SARS-CoV-2 sequences targets and siRNA sequences.


2019 ◽  
Vol 20 (10) ◽  
pp. 2483 ◽  
Author(s):  
Veronika Kapustová ◽  
Zuzana Tulpová ◽  
Helena Toegelová ◽  
Petr Novák ◽  
Jiří Macas ◽  
...  

Reference genomes of important cereals, including barley, emmer wheat and bread wheat, were released recently. Their comparison with genome size estimates obtained by flow cytometry indicated that the assemblies represent not more than 88–98% of the complete genome. This work is aimed at identifying the missing parts in two cereal genomes and proposing techniques to make the assemblies more complete. We focused on tandemly organised repetitive sequences, known to be underrepresented in genome assemblies generated from short-read sequence data. Our study found arrays of three tandem repeats with unit sizes of 1242 to 2726 bp present in the bread wheat reference genome generated from short reads. However, this and another wheat genome assembly employing long PacBio reads failed in integrating correctly the 2726-bp repeat in the pseudomolecule context. This suggests that tandem repeats of this size, frequently incorporated in unassigned scaffolds, may contribute to shrinking of pseudomolecules without reducing size of the entire assembly. We demonstrate how this missing information may be added to the pseudomolecules with the aid of nanopore sequencing of individual BAC clones and optical mapping. Using the latter technique, we identified and localised a 470-kb long array of 45S ribosomal DNA absent from the reference genome of barley.


2017 ◽  
Author(s):  
Erik Garrison ◽  
Jouni Sirén ◽  
Adam M. Novak ◽  
Glenn Hickey ◽  
Jordan M. Eizenga ◽  
...  

AbstractReference genomes guide our interpretation of DNA sequence data. However, conventional linear references are fundamentally limited in that they represent only one version of each locus, whereas the population may contain multiple variants. When the reference represents an individual’s genome poorly, it can impact read mapping and introduce bias. Variation graphs are bidirected DNA sequence graphs that compactly represent genetic variation, including large scale structural variation such as inversions and duplications.1 Equivalent structures are produced by de novo genome assemblers.2,3 Here we present vg, a toolkit of computational methods for creating, manipulating, and utilizing these structures as references at the scale of the human genome. vg provides an efficient approach to mapping reads onto arbitrary variation graphs using generalized compressed suffix arrays,4 with improved accuracy over alignment to a linear reference, creating data structures to support downstream variant calling and genotyping. These capabilities make using variation graphs as reference structures for DNA sequencing practical at the scale of vertebrate genomes, or at the topological complexity of new species assemblies.


2019 ◽  
Vol 36 (8) ◽  
pp. 2569-2571 ◽  
Author(s):  
Cinta Pegueroles ◽  
Verónica Mixão ◽  
Laia Carreté ◽  
Manu Molina ◽  
Toni Gabaldón

Abstract Summary An increasing number of phased (i.e. with resolved haplotypes) reference genomes are available. However, the most genetic variant calling tools do not explicitly account for haplotype structure. Here, we present HaploTypo, a pipeline tailored to resolve haplotypes in genetic variation analyses. HaploTypo infers the haplotype correspondence for each heterozygous variant called on a phased reference genome. Availability and implementation HaploTypo is implemented in Python 2.7 and Python 3.5, and is freely available at https://github.com/gabaldonlab/haplotypo, and as a Docker image. Supplementary information Supplementary data are available at Bioinformatics online.


2015 ◽  
Author(s):  
Farzana Rahman ◽  
Mehedi Hassan ◽  
Alona Martin Kryshchenko ◽  
Inna Dubchak ◽  
Nikolai Nickolai Alexandrov ◽  
...  

In the last decade a number of algorithms and associated software were developed to align next generation sequencing (NGS) reads to relevant reference genomes. The results of these programs may vary significantly, especially when the NGS reads are contain mutations not found in the reference genome. Yet there is no standard way to compare these programs and assess their biological relevance. We propose a benchmark to assess accuracy of the short reads mapping based on the precomputed global alignment of closely related genome sequences. In this paper we outline the method and also present a short report of an experiment performed on five popular alignment tools.


2018 ◽  
Author(s):  
Danang Crysnanto ◽  
Christine Wurmser ◽  
Hubert Pausch

Background: The genotyping of sequence variants typically involves as a first step the alignment of sequencing reads to a linear reference genome. Because a linear reference genome represents only a small fraction of sequence variation within a species, reference allele bias may occur at highly polymorphic or diverged regions of the genome. Graph-based methods facilitate to compare sequencing reads to a variation-aware genome graph that incorporates non-redundant DNA sequences that segregate within a species. We compared accuracy and sensitivity of graph-based sequence variant genotyping using the Graphtyper software to two widely used methods, i.e., GATK and SAMtools, that rely on linear reference genomes using whole-genomes sequencing data of 49 Original Braunvieh cattle. Results: We discovered 21,140,196, 20,262,913 and 20,668,459 polymorphic sites using GATK, Graphtyper, and SAMtools, respectively. Comparisons between sequence variant and microarray-derived genotypes showed that Graphtyper outperformed both GATK and SAMtools in terms of genotype concordance, non-reference sensitivity, and non-reference discrepancy. The sequence variant genotypes that were obtained using Graphtyper had the lowest number of mendelian inconsistencies for both SNPs and indels in nine sire-son pairs with sequence data. Genotype phasing and imputation using the Beagle software improved the quality of the sequence variant genotypes for all tools evaluated particularly for animals that have been sequenced at low coverage. Following imputation, the concordance between sequence- and microarray-derived genotypes was almost identical for the three methods evaluated, i.e., 99.32, 99.46, and 99.24 % for GATK, Graphtyper, and SAMtools, respectively. Variant filtration based on commonly used criteria improved the genotype concordance slightly but it also decreased sensitivity. Graphtyper required considerably more computing resources than SAMtools but it required less than GATK. Conclusions: Sequence variant genotyping using Graphtyper is accurate, sensitive and computationally feasible in cattle. Graph-based methods enable sequence variant genotyping from variation-aware reference genomes that may incorporate cohort-specific sequence variants which is not possible with the current implementations of state-of-the-art methods that rely on linear reference genomes.


2015 ◽  
Author(s):  
Farzana Rahman ◽  
Mehedi Hassan ◽  
Alona Kryshchenko ◽  
Inna Dubchak ◽  
Tatiana V Tatarinova ◽  
...  

In the last decade a number of algorithms and associated software were developed to align next generation sequencing (NGS) reads to relevant reference genomes. The results of these programs may vary significantly, especially when the NGS reads are contain mutations not found in the reference genome. Yet there is no standard way to compare these programs and assess their biological relevance. We propose a benchmark to assess accuracy of the short reads mapping based on the pre-computed global alignment of closely related genome sequences. In this paper we outline the method and also present a short report of an experiment performed on five popular alignment tools .


2021 ◽  
Author(s):  
Jeremie S. Kim ◽  
Can Firtina ◽  
Meryem Banu Cavlak ◽  
Damla Senol Cali ◽  
Nastaran Hajinazar ◽  
...  

AbstractAs genome sequencing tools and techniques improve, researchers are able to incrementally assemble more accurate reference genomes, which enable sensitivity in read mapping and downstream analysis such as variant calling. A more sensitive downstream analysis is critical for a better understanding of the genome donor (e.g., health characteristics). Therefore, read sets from sequenced samples should ideally be mapped to the latest available reference genome that represents the most relevant population. Unfortunately, the increasingly large amount of available genomic data makes it prohibitively expensive to fully re-map each read set to its respective reference genome every time the reference is updated. There are several tools that attempt to accelerate the process of updating a read data set from one reference to another (i.e., remapping) by 1) identifying regions that appear similarly between two references and 2) updating the mapping location of reads that map to any of the identified regions in the old reference to the corresponding similar region in the new reference. The main drawback of existing approaches is that if a read maps to a region in the old reference that does not appear with a reasonable degree of similarity in the new reference, the read cannot be remapped. We find that, as a result of this drawback, a significant portion of annotations (i.e., coding regions in a genome) are lost when using state-of-the-art remapping tools. To address this major limitation in existing tools, we propose AirLift, a fast and comprehensive technique for remapping alignments from one genome to another. Compared to the state-of-the-art method for remapping reads (i.e., full mapping), AirLift reduces 1) the number of reads (out of the entire read set) that need to be fully mapped to the new reference by up to 99.99% and 2) the overall execution time to remap read sets between two reference genome versions by 6.7×, 6.6×, and 2.8× for large (human), medium (C. elegans), and small (yeast) reference genomes, respectively. We validate our remapping results with GATK and find that AirLift provides similar accuracy in identifying ground truth SNP and INDEL variants as the baseline of fully mapping a read set.Code AvailabilityAirLift source code and readme describing how to reproduce our results are available at https://github.com/CMU-SAFARI/AirLift.


Sign in / Sign up

Export Citation Format

Share Document