Making the most of RNA-seq: Pre-processing sequencing data with Opossum for reliable SNP variant detection

Identifying variants from RNA-seq (transcriptome sequencing) data is a cost-effective and versatile alternative to whole-genome sequencing. However, current variant callers do not generally behave well with RNA-seq data due to reads encompassing intronic regions. We have developed a software programme called Opossum to address this problem. Opossum pre-processes RNA-seq reads prior to variant calling, and although it has been designed to work specifically with Platypus, it can be used equally well with other variant callers such as GATK HaplotypeCaller. In this work, we show that using Opossum in conjunction with either Platypus or GATK HaplotypeCaller maintains precision and improves the sensitivity for SNP detection compared to the GATK Best Practices pipeline. In addition, using it in combination with Platypus offers a substantial reduction in run times compared to the GATK pipeline so it is ideal when there are only limited time or computational resources available.

Download Full-text

Making the most of RNA-seq: Pre-processing sequencing data with Opossum for reliable SNP variant detection

Wellcome Open Research ◽

10.12688/wellcomeopenres.10501.2 ◽

2017 ◽

Vol 2 ◽

pp. 6 ◽

Cited By ~ 9

Author(s):

Laura Oikkonen ◽

Stefano Lise

Keyword(s):

Best Practices ◽

Transcriptome Sequencing ◽

Gene Expression Analysis ◽

Substantial Reduction ◽

Variant Calling ◽

Rna Seq ◽

Sequencing Data ◽

Dna Variants ◽

Variant Detection ◽

Computational Resources

RNA-seq (transcriptome sequencing) is primarily considered a method of gene expression analysis but it can also be used to detect DNA variants in expressed regions of the genome. However, current variant callers do not generally behave well with RNA-seq data due to reads encompassing intronic regions. We have developed a software programme called Opossum to address this problem. Opossum pre-processes RNA-seq reads prior to variant calling, and although it has been designed to work specifically with Platypus, it can be used equally well with other variant callers such as GATK HaplotypeCaller. In this work, we show that using Opossum in conjunction with either Platypus or GATK HaplotypeCaller maintains precision and improves the sensitivity for SNP detection compared to the GATK Best Practices pipeline. In addition, using it in combination with Platypus offers a substantial reduction in run times compared to the GATK pipeline so it is ideal when there are only limited time or computational resources available.

Download Full-text

VariFAST: a variant filter by automated scoring based on tagged-signatures

BMC Bioinformatics ◽

10.1186/s12859-019-3226-2 ◽

2019 ◽

Vol 20 (S22) ◽

Author(s):

Hang Zhang ◽

Ke Wang ◽

Juan Zhou ◽

Jianhua Chen ◽

Yizhou Xu ◽

...

Keyword(s):

Best Practices ◽

False Positive ◽

Variant Calling ◽

Sequencing Data ◽

Novel Approach ◽

Variant Filtering ◽

Ngs Data Analysis ◽

High Consistency ◽

Ngs Data ◽

Manual Review

Abstract Background Variant calling and refinement from whole genome/exome sequencing data is a fundamental task for genomics studies. Due to the limited accuracy of NGS sequencing and variant callers, IGV-based manual review is required for further false positive variant filtering, which costs massive labor and time, and results in high inter- and intra-lab variability. Results To overcome the limitation of manual review, we developed a novel approach for Variant Filter by Automated Scoring based on Tagged-signature (VariFAST), and also provided a pipeline integrating GATK Best Practices with VariFAST, which can be easily used for high quality variants detection from raw data. Using the bam and vcf files, VariFAST calculates a v-score by sum of weighted metrics causing false positive variations, and marks tags in the manner of keeping high consistency with manual review, for each variant. We validated the performance of VariFAST for germline variant filtering using the benchmark sequencing data from GIAB, and also for somatic variant filtering using sequencing data of both malignant carcinoma and benign adenomas as well. VariFAST also includes a predictive model trained by XGBOOST algorithm for germline variants refinement, which reveals better MCC and AUC than the state-of-the-art VQSR, especially outcompete in INDEL variant filtering. Conclusion VariFAST can assist researchers efficiently and conveniently to filter the false positive variants, including both germline and somatic ones, in NGS data analysis. The VariFAST source code and the pipeline integrating with GATK Best Practices are available at https://github.com/bioxsjtu/VariFAST.

Download Full-text

SparkRA: Enabling Big Data Scalability for the GATK RNA-seq Pipeline with Apache Spark

Genes ◽

10.3390/genes11010053 ◽

2020 ◽

Vol 11 (1) ◽

pp. 53

Author(s):

Zaid Al-Ars ◽

Saiyi Wang ◽

Hamid Mushtaq

Keyword(s):

Low Cost ◽

Computing Time ◽

Scale Up ◽

Variant Calling ◽

Computation Time ◽

Apache Spark ◽

Rna Seq ◽

Single Node ◽

Practical Applications ◽

Computational Resources

The rapid proliferation of low-cost RNA-seq data has resulted in a growing interest in RNA analysis techniques for various applications, ranging from identifying genotype–phenotype relationships to validating discoveries of other analysis results. However, many practical applications in this field are limited by the available computational resources and associated long computing time needed to perform the analysis. GATK has a popular best practices pipeline specifically designed for variant calling RNA-seq analysis. Some tools in this pipeline are not optimized to scale the analysis to multiple processors or compute nodes efficiently, thereby limiting their ability to process large datasets. In this paper, we present SparkRA, an Apache Spark based pipeline to efficiently scale up the GATK RNA-seq variant calling pipeline on multiple cores in one node or in a large cluster. On a single node with 20 hyper-threaded cores, the original pipeline runs for more than 5 h to process a dataset of 32 GB. In contrast, SparkRA is able to reduce the overall computation time of the pipeline on the same single node by about 4×, reducing the computation time down to 1.3 h. On a cluster with 16 nodes (each with eight single-threaded cores), SparkRA is able to further reduce this computation time by 7.7× compared to a single node. Compared to other scalable state-of-the-art solutions, SparkRA is 1.2× faster while achieving the same accuracy of the results.

Download Full-text

NanoCaller for accurate detection of SNPs and indels in difficult-to-map regions from long-read sequencing by haplotype-aware deep neural networks

10.1101/2019.12.29.890418 ◽

2019 ◽

Cited By ~ 1

Author(s):

Umair Ahsan ◽

Qian Liu ◽

Li Fang ◽

Kai Wang

Keyword(s):

Deep Neural Network ◽

Deep Neural Networks ◽

Variant Calling ◽

Sequencing Data ◽

Long Reads ◽

Novel Variants ◽

Long Read ◽

Variant Detection ◽

Genomic Regions ◽

Haplotype Information

AbstractVariant (SNPs/indels) detection from high-throughput sequencing data remains an important yet unresolved problem. Long-read sequencing enables variant detection in difficult-to-map genomic regions that short-read sequencing cannot reliably examine (for example, only ~80% of genomic regions are marked as “high-confidence region” to have SNP/indel calls in the Genome In A Bottle project); however, the high per-base error rate poses unique challenges in variant detection. Existing methods on long-read data typically rely on analyzing pileup information from neighboring bases surrounding a candidate variant, similar to short-read variant callers, yet the benefits of much longer read length are not fully exploited. Here we present a deep neural network called NanoCaller, which detects SNPs by examining pileup information solely from other nonadjacent candidate SNPs that share the same long reads using long-range haplotype information. With called SNPs by NanoCaller, NanoCaller phases long reads and performs local realignment on two sets of phased reads to call indels by another deep neural network. Extensive evaluation on 5 human genomes (sequenced by Nanopore and PacBio long-read techniques) demonstrated that NanoCaller greatly improved performance in difficult-to-map regions, compared to other long-read variant callers. We experimentally validated 41 novel variants in difficult-to-map regions in a widely-used benchmarking genome, which cannot be reliably detected previously. We extensively evaluated the run-time characteristics and the sensitivity of parameter settings of NanoCaller to different characteristics of sequencing data. Finally, we achieved the best performance in Nanopore-based variant calling from MHC regions in the PrecisionFDA Variant Calling Challenge on Difficult-to-Map Regions by ensemble calling. In summary, by incorporating haplotype information in deep neural networks, NanoCaller facilitates the discovery of novel variants in complex genomic regions from long-read sequencing data.

Download Full-text

ECNano: A Cost-Effective Workflow for Target Enrichment Sequencing and Accurate Variant Calling on 4,800 Clinically Significant Genes Using a Single MinION Flowcell

10.1101/2021.04.05.438455 ◽

2021 ◽

Author(s):

Amy Wing-Sze Leung ◽

Henry Chi-Ming Leung ◽

Chak-Lim Wong ◽

Zhen-Xian Zheng ◽

Wui-Wang Lui ◽

...

Keyword(s):

Variant Calling ◽

Cost Effective ◽

Turnaround Time ◽

Read Length ◽

Sequencing Error ◽

Target Enrichment ◽

Long Read ◽

Wet Lab ◽

Variant Detection ◽

Clinically Significant

Background: The application of long-read sequencing using the Oxford Nanopore Technologies (ONT) MinION sequencer is getting more diverse in the medical field. Having a high sequencing error of ONT and limited throughput from a single MinION flowcell, however, limits its applicability for accurate variant detection. Medical exome sequencing (MES) targets clinically significant exon regions, allowing rapid and comprehensive screening of pathogenic variants. By applying MES with MinION sequencing, the technology can achieve a more uniform capture of the target regions, shorter turnaround time, and lower sequencing cost per sample. Method: We introduced a cost-effective optimized workflow, ECNano, comprising a wet-lab protocol and bioinformatics analysis, for accurate variant detection at 4,800 clinically important genes and regions using a single MinION flowcell. The ECNano wet-lab protocol was optimized to perform long-read target enrichment and ONT library preparation to stably generate high-quality MES data with adequate coverage. The subsequent variant-calling workflow, Clair-ensemble, adopted a fast RNN-based variant caller, Clair, and was optimized for target enrichment data. To evaluate its performance and practicality, ECNano was tested on both reference DNA samples and patient samples. Results: ECNano achieved deep on-target depth of coverage (DoC) at average >100x and >98% uniformity using one MinION flowcell. For accurate ONT variant calling, the generated reads sufficiently covered 98.9% of pathogenic positions listed in ClinVar, with 98.96% having at least 30x DoC. ECNano obtained an average read length of 1,000 bp. The long reads of ECNano also covered the adjacent splice sites well, with 98.5% of positions having ≥ 30x DoC. Clair-ensemble achieved >99% recall and accuracy for SNV calling. The whole workflow from wet-lab protocol to variant detection was completed within three days. Conclusion: We presented ECNano, an out-of-the-box workflow comprising (1) a wet-lab protocol for ONT target enrichment sequencing and (2) a downstream variant detection workflow, Clair-ensemble. The workflow is cost-effective, with a short turnaround time for high accuracy variant calling in 4,800 clinically significant genes and regions using a single MinION flowcell. The long-read exon captured data has potential for further development, promoting the application of long-read sequencing in personalized disease treatment and risk prediction.

Download Full-text

Standard operating procedure for somatic variant refinement of tumor sequencing data

10.1101/266262 ◽

2018 ◽

Cited By ~ 1

Author(s):

Erica K. Barnell ◽

Peter Ronning ◽

Katie M. Campbell ◽

Kilannin Krysiak ◽

Benjamin J. Ainscough ◽

...

Keyword(s):

Massively Parallel Sequencing ◽

Variant Calling ◽

Standard Operating Procedure ◽

Sequencing Data ◽

Optimal Method ◽

Somatic Variant ◽

Standard Operating ◽

Variant Detection ◽

Manual Review

AbstractPurposeManual review of aligned sequencing reads is required to develop a high-quality list of somatic variants from massively parallel sequencing data (MPS). Despite widespread use in analyzing MPS data, there has been little attempt to describe methods for manual review, resulting in high inter- and intra-lab variability in somatic variant detection and characterization of tumors.MethodsOpen source software was used to develop an optimal method for manual review setup. We also developed a systemic approach to visually inspect each variant during manual review.ResultsWe present a standard operating procedures for somatic variant refinement for use by manual reviewers. The approach is enhanced through representative examples of 4 different manual review categories that indicate a reviewer’s confidence in the somatic variant call and 19 annotation tags that contextualize commonly observed sequencing patterns during manual review. Representative examples provide detailed instructions on how to classify variants during manual review to rectify lack of confidence in automated somatic variant detection.ConclusionStandardization of somatic variant refinement through systematization of manual review will improve the consistency and reproducibility of identifying true somatic variants after automated variant calling.

Download Full-text

TALC: Transcript-level Aware Long Read Correction

10.1101/2020.01.10.901728 ◽

2020 ◽

Cited By ~ 1

Author(s):

Lucile Broseus ◽

Aubin Thomas ◽

Andrew J. Oldfield ◽

Dany Severac ◽

Emeric Dubois ◽

...

Keyword(s):

Transcriptome Sequencing ◽

Transcript Level ◽

De Bruijn Graph ◽

Rna Seq ◽

Sequencing Data ◽

Sequencing Technologies ◽

Long Reads ◽

Long Read ◽

De Bruijn ◽

Rna Transcript

ABSTRACTMotivationLong-read sequencing technologies are invaluable for determining complex RNA transcript architectures but are error-prone. Numerous “hybrid correction” algorithms have been developed for genomic data that correct long reads by exploiting the accuracy and depth of short reads sequenced from the same sample. These algorithms are not suited for correcting more complex transcriptome sequencing data.ResultsWe have created a novel reference-free algorithm called TALC (Transcription Aware Long Read Correction) which models changes in RNA expression and isoform representation in a weighted De-Bruijn graph to correct long reads from transcriptome studies. We show that transcription aware correction by TALC improves the accuracy of the whole spectrum of downstream RNA-seq applications and is thus necessary for transcriptome analyses that use long read technology.Availability and ImplementationTALC is implemented in C++ and available at https://gitlab.igh.cnrs.fr/lbroseus/[email protected]

Download Full-text

Tools and best practices for allelic expression analysis

10.1101/016097 ◽

2015 ◽

Cited By ~ 2

Author(s):

Stephane E Castel ◽

Ami Levy-Moonshine ◽

Pejman Mohammadi ◽

Eric Banks ◽

Tuuli Lappalainen

Keyword(s):

Best Practices ◽

Read Depth ◽

Allelic Expression ◽

Rna Seq ◽

Sequencing Data ◽

Technical Noise ◽

Nonsense Mediated Decay ◽

Biological Phenomena ◽

Mapping Bias ◽

Technical Bias

Allelic expression (AE) analysis has become an important tool for integrating genome and transcriptome data to characterize various biological phenomena such as cis-regulatory variation and nonsense-mediated decay. In this paper, we systematically analyze the properties of AE read count data and technical sources of error, such as low-quality or double-counted RNA-seq reads, genotyping errors, allelic mapping bias, and technical covariates due to sample preparation and sequencing, and variation in total read depth. We provide guidelines for correcting and filtering for such errors, and show that the resulting AE data has extremely low technical noise. Finally, we introduce novel software for high-throughput production of AE data from RNA-sequencing data, implemented in the GATK framework. These improved tools and best practices for AE analysis yield higher quality AE data by reducing technical bias. This provides a practical framework for wider adoption of AE analysis by the genomics community.

Download Full-text

Comparison of sequencing data processing pipelines and application to underrepresented African human populations

BMC Bioinformatics ◽

10.1186/s12859-021-04407-x ◽

2021 ◽

Vol 22 (1) ◽

Author(s):

Gwenna Breton ◽

Anna C. V. Johansson ◽

Per Sjödin ◽

Carina M. Schlebusch ◽

Mattias Jakobsson

Keyword(s):

Best Practices ◽

High Throughput ◽

High Throughput Sequencing ◽

Variant Calling ◽

Human Populations ◽

Sequencing Data ◽

High Coverage ◽

Individual Level ◽

Bioinformatic Tools ◽

The Individual

Abstract Background Population genetic studies of humans make increasing use of high-throughput sequencing in order to capture diversity in an unbiased way. There is an abundance of sequencing technologies, bioinformatic tools and the available genomes are increasing in number. Studies have evaluated and compared some of these technologies and tools, such as the Genome Analysis Toolkit (GATK) and its “Best Practices” bioinformatic pipelines. However, studies often focus on a few genomes of Eurasian origin in order to detect technical issues. We instead surveyed the use of the GATK tools and established a pipeline for processing high coverage full genomes from a diverse set of populations, including Sub-Saharan African groups, in order to reveal challenges from human diversity and stratification. Results We surveyed 29 studies using high-throughput sequencing data, and compared their strategies for data pre-processing and variant calling. We found that processing of data is very variable across studies and that the GATK “Best Practices” are seldom followed strictly. We then compared three versions of a GATK pipeline, differing in the inclusion of an indel realignment step and with a modification of the base quality score recalibration step. We applied the pipelines on a diverse set of 28 individuals. We compared the pipelines in terms of count of called variants and overlap of the callsets. We found that the pipelines resulted in similar callsets, in particular after callset filtering. We also ran one of the pipelines on a larger dataset of 179 individuals. We noted that including more individuals at the joint genotyping step resulted in different counts of variants. At the individual level, we observed that the average genome coverage was correlated to the number of variants called. Conclusions We conclude that applying the GATK “Best Practices” pipeline, including their recommended reference datasets, to underrepresented populations does not lead to a decrease in the number of called variants compared to alternative pipelines. We recommend to aim for coverage of > 30X if identifying most variants is important, and to work with large sample sizes at the variant calling stage, also for underrepresented individuals and populations.

Download Full-text

Variant Analysis Pipeline for Accurate Detection of Genomic Variants from Transcriptome Sequencing Data

10.1101/625020 ◽

2019 ◽

Author(s):

Modupeore O. Adetunji ◽

Carl J. Schmidt ◽

Susan J. Lamont ◽

Behnam Abasht

Keyword(s):

Transcriptome Sequencing ◽

Phenotypic Diversity ◽

Active Regions ◽

Rna Seq ◽

Sequencing Data ◽

Functional Variation ◽

Analysis Pipeline ◽

Post Translational Modifications ◽

Variant Analysis ◽

Coding Variants

AbstractThe wealth of information deliverable from transcriptome sequencing (RNA-seq) is significant, however current applications for variant detection still remain a challenge due to the complexity of the transcriptome. Given the ability of RNA-seq to reveal active regions of the genome, detection of RNA-seq SNPs can prove valuable in understanding the phenotypic diversity between populations. Thus, we present a novel computational workflow named VAP (Variant Analysis Pipeline) that takes advantage of multiple RNA-seq splice aware aligners to call SNPs in non-human models using RNA-seq data only. We applied VAP to RNA-seq from a highly inbred chicken line and achieved >97% precision and >99% sensitivity when compared with the matching whole genome sequencing (WGS) data. Over 65% of WGS coding variants were identified from RNA-seq. Further, our results discovered SNPs resulting from post translational modifications, such as RNA editing, which may reveal potentially functional variation that would have otherwise been missed in genomic data. Even with the limitation in detecting variants in expressed regions only, our method proves to be a reliable alternative for SNP identification using RNA-seq data.

Download Full-text