Fast and Accurate Genomic Analyses using Genome Graphs

AbstractThe human reference genome serves as the foundation for genomics by providing a scaffold for alignment of sequencing reads, but currently only reflects a single consensus haplotype, which impairs read alignment and downstream analysis accuracy. Reference genome structures incorporating known genetic variation have been shown to improve the accuracy of genomic analyses, but have so far remained computationally prohibitive for routine large-scale use. Here we present a graph genome implementation that enables read alignment across 2,800 diploid genomes encompassing 12.6 million SNPs and 4.0 million indels. Our Graph Genome Pipeline requires 6.5 hours to process a 30x coverage WGS sample on a system with 36 CPU cores compared with 11 hours required by the GATK Best Practices pipeline. Using complementary benchmarking experiments based on real and simulated data, we show that using a graph genome reference improves read mapping sensitivity and produces a 0.5% increase in variant calling recall, or about 20,000 additional variants being detected per sample, while variant calling specificity is unaffected. Structural variations (SVs) incorporated into a graph genome can be genotyped accurately under a unified framework. Finally, we show that iterative augmentation of graph genomes yields incremental gains in variant calling accuracy. Our implementation is a significant advance towards fulfilling the promise of graph genomes to radically enhance the scalability and accuracy of genomic analyses.

Download Full-text

Improving read alignment through the generation of alternative reference via iterative strategy

Scientific Reports ◽

10.1038/s41598-020-74526-7 ◽

2020 ◽

Vol 10 (1) ◽

Author(s):

Lina Bu ◽

Qi Wang ◽

Wenjin Gu ◽

Ruifei Yang ◽

Di Zhu ◽

...

Keyword(s):

Variant Calling ◽

Optimal Number ◽

Reference Sequence ◽

Hardware Platform ◽

Variable Regions ◽

Read Alignment ◽

Reference Sequences ◽

Downstream Analysis ◽

Reference Genomes ◽

Number Of Iterations

Abstract There is generally one standard reference sequence for each species. When extensive variations exist in other breeds of the species, it can lead to ambiguous alignment and inaccurate variant calling and, in turn, compromise the accuracy of downstream analysis. Here, with the help of the FPGA hardware platform, we present a method that generates an alternative reference via an iterative strategy to improve the read alignment for breeds that are genetically distant to the reference breed. Compared to the published reference genomes, by using the alternative reference sequences we built, the mapping rates of Chinese indigenous pigs and chickens were improved by 0.61–1.68% and 0.09–0.45%, respectively. These sequences also enable researchers to recover highly variable regions that could be missed using public reference sequences. We also determined that the optimal number of iterations needed to generate alternative reference sequences were seven and five for pigs and chickens, respectively. Our results show that, for genetically distant breeds, generating an alternative reference sequence can facilitate read alignment and variant calling and improve the accuracy of downstream analyses.

Download Full-text

Spacemake: processing and analysis of large-scale spatial transcriptomics data

10.1101/2021.11.07.467598 ◽

2021 ◽

Author(s):

Tamas Ryszard Sztanka-Toth ◽

Marvin Jens ◽

Nikos Karaiskos ◽

Nikolaus Rajewsky

Keyword(s):

Large Scale ◽

Modular Design ◽

State Of The Art ◽

Sequencing Data ◽

Unified Framework ◽

Tissue Sections ◽

Long Reads ◽

Rna Biology ◽

Transcriptomics Data ◽

Downstream Analysis

Spatial sequencing methods increasingly gain popularity within RNA biology studies. State-of-the-art techniques can read mRNA expression levels from tissue sections and at the same time register information about the original locations of the molecules in the tissue. The resulting datasets are processed and analyzed by accompanying software which, however, is incompatible across inputs from different technologies. Here, we present spacemake, a modular, robust and scalable spatial transcriptomics pipeline built in snakemake and python. Spacemake is designed to handle all major spatial transcriptomics datasets and can be readily configured to run on other technologies. It can process and analyze several samples in parallel, even if they stem from different experimental methods. Spacemake's unified framework enables reproducible data processing from raw sequencing data to automatically generated downstream analysis reports. Moreover, spacemake is built with a modular design and offers additional functionality such as sample merging, saturation analysis and analysis of long-reads as separate modules. Moreover, spacemake employs novoSpaRc to integrate spatial and single-cell transcriptomics data, resulting in increased gene counts for the spatial dataset. Spacemake is open-source, extendable and can be readily integrated with existing computational workflows.

Download Full-text

AirLift: A Fast and Comprehensive Technique for Remapping Alignments between Reference Genomes

10.1101/2021.02.16.431517 ◽

2021 ◽

Author(s):

Jeremie S. Kim ◽

Can Firtina ◽

Meryem Banu Cavlak ◽

Damla Senol Cali ◽

Nastaran Hajinazar ◽

...

Keyword(s):

Reference Genome ◽

State Of The Art ◽

Variant Calling ◽

Ground Truth ◽

Data Set ◽

C Elegans ◽

A Genome ◽

Downstream Analysis ◽

Similar Accuracy ◽

Reference Genomes

AbstractAs genome sequencing tools and techniques improve, researchers are able to incrementally assemble more accurate reference genomes, which enable sensitivity in read mapping and downstream analysis such as variant calling. A more sensitive downstream analysis is critical for a better understanding of the genome donor (e.g., health characteristics). Therefore, read sets from sequenced samples should ideally be mapped to the latest available reference genome that represents the most relevant population. Unfortunately, the increasingly large amount of available genomic data makes it prohibitively expensive to fully re-map each read set to its respective reference genome every time the reference is updated. There are several tools that attempt to accelerate the process of updating a read data set from one reference to another (i.e., remapping) by 1) identifying regions that appear similarly between two references and 2) updating the mapping location of reads that map to any of the identified regions in the old reference to the corresponding similar region in the new reference. The main drawback of existing approaches is that if a read maps to a region in the old reference that does not appear with a reasonable degree of similarity in the new reference, the read cannot be remapped. We find that, as a result of this drawback, a significant portion of annotations (i.e., coding regions in a genome) are lost when using state-of-the-art remapping tools. To address this major limitation in existing tools, we propose AirLift, a fast and comprehensive technique for remapping alignments from one genome to another. Compared to the state-of-the-art method for remapping reads (i.e., full mapping), AirLift reduces 1) the number of reads (out of the entire read set) that need to be fully mapped to the new reference by up to 99.99% and 2) the overall execution time to remap read sets between two reference genome versions by 6.7×, 6.6×, and 2.8× for large (human), medium (C. elegans), and small (yeast) reference genomes, respectively. We validate our remapping results with GATK and find that AirLift provides similar accuracy in identifying ground truth SNP and INDEL variants as the baseline of fully mapping a read set.Code AvailabilityAirLift source code and readme describing how to reproduce our results are available at https://github.com/CMU-SAFARI/AirLift.

Download Full-text

Performance Assessment of Variant Calling Pipelines using Human Whole Exome Sequencing and Simulated data

10.1101/359109 ◽

2018 ◽

Author(s):

Manojkumar Kumaran ◽

Umadevi Subramanian ◽

Bharanidharan Devarajan

Keyword(s):

Exome Sequencing ◽

Whole Exome Sequencing ◽

Reference Genome ◽

Variant Calling ◽

Simulated Data ◽

Variant Call ◽

Human Reference Genome ◽

Indel Detection ◽

Whole Exome ◽

Clinical Variants

AbstractThe whole exome sequencing (WES) is a time-consuming technology in the identification of clinical variants and it demands the accurate variant caller tools. The currently available tools compromise accuracy in predicting the specific types of variants. Thus, it is important to find out the possible combination of best aligner-variant caller tools for detecting SNVs and InDels separately. Moreover, many important aspects of InDel detection are not overlooked while comparing the performance of tools. One such aspect is the detection of InDels with respect to base pair length. To assess the performance of variant (especially InDels) caller in combination with different aligners, 20 automated pipelines were developed and evaluated using gold reference variant dataset (NA12878) from Genome in a Bottle (GiaB) consortium of human whole exome sequencing. Additionally, the simulated exome data from two human reference genome sequences (GRCh37 and GRCh38) were used to compare the performance of the pipelines. By analyzing various performance metrices, we observed that BWA and Novoalign aligners performed better with DeepVariant and SAMtools callers for detecting SNVs, and with DeepVariant and GATK for Indels. Altogether, DeepVariant with BWA and Novoalign performed best. Further, we showed that merging the top performing pipelines improved the accurate variant call set. Collectively, this study would help the investigators to effectively improve the sensitivity and accuracy in detecting specific variants.

Download Full-text

GW-CALL: Accurate Genome-Wide Variant Caller

10.1101/079905 ◽

2016 ◽

Cited By ~ 1

Author(s):

M. Ghareghani ◽

S. A. Motahari ◽

S. Khazaei ◽

M. Tavassolipour

Keyword(s):

Reference Genome ◽

Homo Sapiens ◽

Association Studies ◽

Variant Calling ◽

Poor Performance ◽

Genome Wide Association Studies ◽

Main Challenge ◽

Genome Wide ◽

A Genome ◽

Downstream Analysis

AbstractThe main challenge in reliable variant calling using DNA reads is to extract information from reads mappable to multiple locations on the reference genome. Conventional approaches ignore these reads and rely on reads mappable uniquely to the reference genome. These approaches fail to perform satisfactorily in variant calling within repeat regions which are abundant in many species including homo sapiens. This, in turn, lowers the reliability of any downstream analysis including poor performance in genome-wide association studies. GW-CALL, a fast and accurate variant caller, is proposed. GW-CALL exploits information of all reads in a genome-wide decision making process. In particular, it partitions the genome into several independent regions called clusters and incorporates an efficient algorithm to use all reads belonging to a cluster in calling variants within that cluster.AvailabilityGW-CALL is implemented in C++ and is freely available at URL: brl.ce.sharif.edu/gwcall.

Download Full-text

Large-Scale Uniform Analysis of Cancer Whole Genomes in Multiple Computing Environments

10.1101/161638 ◽

2017 ◽

Cited By ~ 12

Author(s):

Christina K. Yung ◽

Brian D. O’Connor ◽

Sergei Yakneen ◽

Junjun Zhang ◽

Kyle Ellrott ◽

...

Keyword(s):

Large Scale ◽

Variant Calling ◽

Sequencing Data ◽

Coding Regions ◽

Whole Genomes ◽

Portable Software ◽

Working Groups ◽

Genome Consortium ◽

Data Portal ◽

Downstream Analysis

AbstractThe International Cancer Genome Consortium (ICGC)’s Pan-Cancer Analysis of Whole Genomes (PCAWG) project aimed to categorize somatic and germline variations in both coding and non-coding regions in over 2,800 cancer patients. To provide this dataset to the research working groups for downstream analysis, the PCAWG Technical Working Group marshalled ~800TB of sequencing data from distributed geographical locations; developed portable software for uniform alignment, variant calling, artifact filtering and variant merging; performed the analysis in a geographically and technologically disparate collection of compute environments; and disseminated high-quality validated consensus variants to the working groups. The PCAWG dataset has been mirrored to multiple repositories and can be located using the ICGC Data Portal. The PCAWG workflows are also available as Docker images through Dockstore enabling researchers to replicate our analysis on their own data.

Download Full-text

An Invariants-based Method for Efficient Identification of Hybrid Species From Large-scale Genomic Data

10.1101/034348 ◽

2015 ◽

Cited By ~ 8

Author(s):

Laura Kubatko ◽

Julia Chifman

Keyword(s):

Large Scale ◽

Sequence Data ◽

Simulated Data ◽

Species Tree ◽

Computational Time ◽

Putative Hybrid ◽

Hybrid Speciation ◽

Unified Framework ◽

Hybrid Species ◽

Genome Scale

Coalescent-based species tree inference has become widely used in the analysis of genome-scale multilocus and SNP datasets when the goal is inference of a species-level phylogeny. However, numerous evolutionary processes are known to violate the assumptions of a coalescence-only model and complicate inference of the species tree. One such process is hybrid speciation, in which a species shares its ancestry with two distinct species. Although many methods have been proposed to detect hybrid speciation, only a few have considered both hybridization and coalescence in a unified framework, and these are generally limited to the setting in which putative hybrid species must be identified in advance. Here we propose a method that can examine genome-scale data for a large number of taxa and detect those taxa that may have arisen via hybridization, as well as their potential ``parental'' taxa. The method is based on a model that considers both coalescence and hybridization together, and uses phylogenetic invariants to construct a test that scales well in terms of computational time for both the number of taxa and the amount of sequence data. We test the method using simulated data for up 20 taxa and 100,000bp, and find that the method accurately identifies both recent and ancient hybrid species in less than 30 seconds. We apply the method to two empirical datasets, one composed ofSistrurusrattlesnakes for which hybrid speciation is not supported by previous work, and one consisting of several species ofHeliconiusbutterflies for which some evidence of hybrid speciation has been previously found.

Download Full-text

FUSTr: a tool to find gene families under selection in transcriptomes

PeerJ ◽

10.7717/peerj.4234 ◽

2018 ◽

Vol 6 ◽

pp. e4234 ◽

Cited By ~ 6

Author(s):

T. Jeffrey Cole ◽

Michael S. Brewer

Keyword(s):

Molecular Evolution ◽

Positive Selection ◽

High Performance ◽

Large Scale ◽

Simulated Data ◽

Gene Families ◽

Strong Positive Selection ◽

Transcriptomic Data ◽

Downstream Analysis ◽

User Friendly

Background The recent proliferation of large amounts of biodiversity transcriptomic data has resulted in an ever-expanding need for scalable and user-friendly tools capable of answering large scale molecular evolution questions. FUSTr identifies gene families involved in the process of adaptation. This is a tool that finds genes in transcriptomic datasets under strong positive selection that automatically detects isoform designation patterns in transcriptome assemblies to maximize phylogenetic independence in downstream analysis. Results When applied to previously studied spider transcriptomic data as well as simulated data, FUSTr successfully grouped coding sequences into proper gene families as well as correctly identified those under strong positive selection in relatively little time. Conclusions FUSTr provides a useful tool for novice bioinformaticians to characterize the molecular evolution of organisms throughout the tree of life using large transcriptomic biodiversity datasets and can utilize multi-processor high-performance computational facilities.

Download Full-text

LARGE-SCALE GENOMIC ANALYSES REVEAL EXTENSIVE DIVERSITY AMONGST PNEUMOCOCCAL CAPSULAR LOCUS SEQUENCES AND PUTATIVELY NOVEL SEROTYPES

10.26226/morressier.5731f0d5d462b8029237fa18 ◽

2016 ◽

Author(s):

Andries van Tonder

Keyword(s):

Large Scale ◽

Genomic Analyses

Download Full-text

3DIV update for 2021: a comprehensive resource of 3D genome and 3D cancer genome

Nucleic Acids Research ◽

10.1093/nar/gkaa1078 ◽

2020 ◽

Vol 49 (D1) ◽

pp. D38-D46

Author(s):

Kyukwang Kim ◽

Insu Jang ◽

Mooyoung Kim ◽

Jinhyuk Choi ◽

Min-Seo Kim ◽

...

Keyword(s):

Cell Line ◽

Large Scale ◽

Three Dimensional ◽

Cancer Cell Line ◽

Cancer Genome ◽

Structural Variations ◽

3D Genome ◽

Tightly Coupled ◽

Regulatory Effects ◽

The Impact

Abstract Three-dimensional (3D) genome organization is tightly coupled with gene regulation in various biological processes and diseases. In cancer, various types of large-scale genomic rearrangements can disrupt the 3D genome, leading to oncogenic gene expression. However, unraveling the pathogenicity of the 3D cancer genome remains a challenge since closer examinations have been greatly limited due to the lack of appropriate tools specialized for disorganized higher-order chromatin structure. Here, we updated a 3D-genome Interaction Viewer and database named 3DIV by uniformly processing ∼230 billion raw Hi-C reads to expand our contents to the 3D cancer genome. The updates of 3DIV are listed as follows: (i) the collection of 401 samples including 220 cancer cell line/tumor Hi-C data, 153 normal cell line/tissue Hi-C data, and 28 promoter capture Hi-C data, (ii) the live interactive manipulation of the 3D cancer genome to simulate the impact of structural variations and (iii) the reconstruction of Hi-C contact maps by user-defined chromosome order to investigate the 3D genome of the complex genomic rearrangement. In summary, the updated 3DIV will be the most comprehensive resource to explore the gene regulatory effects of both the normal and cancer 3D genome. ‘3DIV’ is freely available at http://3div.kr.

Download Full-text