Demonstrating the utility of flexible sequence queries against indexed short reads with FlexTyper

Across the life sciences, processing next generation sequencing data commonly relies upon a computationally expensive process where reads are mapped onto a reference sequence. Prior to such processing, however, there is a vast amount of information that can be ascertained from the reads, potentially obviating the need for processing, or allowing optimized mapping approaches to be deployed. Here, we present a method termed FlexTyper which facilitates a “reverse mapping” approach in which high throughput sequence queries, in the form of k-mer searches, are run against indexed short-read datasets in order to extract useful information. This reverse mapping approach enables the rapid counting of target sequences of interest. We demonstrate FlexTyper’s utility for recovering depth of coverage, and accurate genotyping of SNP sites across the human genome. We show that genotyping unmapped reads can correctly inform a sample’s population, sex, and relatedness in a family setting. Detection of pathogen sequences within RNA-seq data was sensitive and accurate, performing comparably to existing methods, but with increased flexibility. We present two examples of ways in which this flexibility allows the analysis of genome features not well-represented in a linear reference. First, we analyze contigs from African genome sequencing studies, showing how they distribute across families from three distinct populations. Second, we show how gene-marking k-mers for the killer immune receptor locus allow allele detection in a region that is challenging for standard read mapping pipelines. The future adoption of the reverse mapping approach represented by FlexTyper will be enabled by more efficient methods for FM-index generation and biology-informed collections of reference queries. In the long-term, selection of population-specific references or weighting of edges in pan-population reference genome graphs will be possible using the FlexTyper approach. FlexTyper is available at https://github.com/wassermanlab/OpenFlexTyper.

Download Full-text

Demonstrating the utility of flexible sequence queries against indexed short reads with FlexTyper

10.1101/2020.03.02.973750 ◽

2020 ◽

Author(s):

Phillip A. Richmond ◽

Alice M. Kaye ◽

Godfrain Jacques Kounkou ◽

Tamar V. Av-Shalom ◽

Wyeth W. Wasserman

Keyword(s):

Next Generation Sequencing ◽

Reference Genome ◽

Next Generation Sequencing Data ◽

Rna Seq ◽

Next Generation ◽

Sequencing Data ◽

Read Mapping ◽

Mapping Approach ◽

Reverse Mapping ◽

Generation Sequencing

AbstractAcross the life sciences, processing next generation sequencing data commonly relies upon a computationally expensive process where reads are mapped onto a reference sequence. Prior to such processing, however, there is a vast amount of information that can be ascertained from the reads, potentially obviating the need for processing, or allowing optimized mapping approaches to be deployed. Here, we present a method termed FlexTyper which facilitates a “reverse mapping” approach in which high throughput sequence queries, in the form of k-mer searches, are run against indexed short-read datasets in order to extract useful information. This reverse mapping approach enables the rapid counting of target sequences of interest. We demonstrate FlexTyper’s utility for recovering depth of coverage, and accurate genotyping of SNP sites across the human genome. We show that genotyping unmapped reads can correctly inform a sample’s population, sex, and relatedness in a family setting. Detection of pathogen sequences within RNA-seq data was sensitive and accurate, performing comparably to existing methods, but with increased flexibility. We present two examples of ways in which this flexibility allows the analysis of genome features not well-represented in a linear reference. First, we analyze contigs from African genome sequencing studies, showing how they distribute across families from three distinct populations. Second, we show how gene-marking k-mers for the killer immune receptor locus allow allele detection in a region that is challenging for standard read mapping pipelines. The future adoption of the reverse mapping approach represented by FlexTyper will be enabled by more efficient methods for FM-index generation and biology-informed collections of reference queries. In the long-term, selection of population-specific references or weighting of edges in pan-population reference genome graphs will be possible using the FlexTyper approach. FlexTyper is available at https://github.com/wassermanlab/OpenFlexTyper.Author SummaryIn the past 15 years, next generation sequencing technology has revolutionized our capacity to process and analyze DNA sequencing data. From agriculture to medicine, this technology is enabling a deeper understanding of the blueprint of life. Next generation sequencing data is composed of short sequences of DNA, referred to as “reads”, which are often shorter than 200 base pairs making them many orders of magnitude smaller than the entirety of a human genome. Gaining insights from this data has typically leveraged a reference-guided mapping approach, where the reads are aligned to a reference genome and then post-processed to gain actionable information such as presence or absence of genomic sequence, or variation between the reference genome and the sequenced sample. Many experts in the field of genomics have concluded that selecting a single, linear reference genome for mapping reads against is limiting, and several current research endeavors are focused on exploring options for improved analysis methods to unlock the full utility of sequencing data. Among these improvements are the usage of sex-matched genomes, population-specific reference genomes, and emergent graph-based reference pan-genomes. However, advanced methods that use raw DNA sequencing data to inform the choice of reference genome and guide the alignment of reads to enriched reference genomes are needed. Here we develop a method termed FlexTyper, which creates a searchable index of the short read data and enables flexible, user-guided queries to provide valuable insights without the need for reference-guided mapping. We demonstrate the utility of our method by identifying sample ancestry and sex in human whole genome sequencing data, detecting viral pathogen reads in RNA-seq data, African-enriched genome regions absent from the global reference, and HLA alleles that are complex to discern using standard read mapping. We anticipate early adoption of FlexTyper within analysis pipelines as a pre-mapping component, and further envision the bioinformatics and genomics community will leverage the tool for creative uses of sequence queries from unmapped data.

Download Full-text

PhredEM: A Phred-Score-Informed Genotype-Calling Approach for Next-Generation Sequencing Studies

10.1101/046136 ◽

2016 ◽

Author(s):

Peizhou Liao ◽

Glen A. Satten ◽

Yi-juan Hu

Keyword(s):

Logistic Regression ◽

Next Generation Sequencing ◽

Em Algorithm ◽

Error Rates ◽

Next Generation Sequencing Data ◽

Next Generation ◽

Sequencing Data ◽

Genotype Calling ◽

Sequencing Studies ◽

Generation Sequencing

ABSTRACTA fundamental challenge in analyzing next-generation sequencing data is to determine an individual’s genotype correctly as the accuracy of the inferred genotype is essential to downstream analyses. Some genotype callers, such as GATK and SAMtools, directly calculate the base-calling error rates from phred scores or recalibrated base quality scores. Others, such as SeqEM, estimate error rates from the read data without using any quality scores. It is also a common quality control procedure to filter out reads with low phred scores. However, choosing an appropriate phred score threshold is problematic as a too-high threshold may lose data while a too-low threshold may introduce errors. We propose a new likelihood-based genotype-calling approach that exploits all reads and estimates the per-base error rates by incorporating phred scores through a logistic regression model. The algorithm, which we call PhredEM, uses the Expectation-Maximization (EM) algorithm to obtain consistent estimates of genotype frequencies and logistic regression parameters. We also develop a simple, computationally efficient screening algorithm to identify loci that are estimated to be monomorphic, so that only loci estimated to be non-monomorphic require application of the EM algorithm. We evaluate the performance of PhredEM using both simulated data and real sequencing data from the UK10K project. The results demonstrate that PhredEM is an improved, robust and widely applicable genotype-calling approach for next-generation sequencing studies. The relevant software is freely available.

Download Full-text

MerCat: a versatile k-mer counter and diversity estimator for database-independent property analysis obtained from metagenomic and/or metatranscriptomic sequencing data

10.7287/peerj.preprints.2825 ◽

2017 ◽

Author(s):

Richard Allen White III ◽

Ajay Panyala ◽

Kevin Glass ◽

Sean Colby ◽

Kurt R Glaesemann ◽

...

Keyword(s):

Software Package ◽

Compositional Analysis ◽

Direct Analysis ◽

Reference Sequence ◽

Next Generation Sequencing Data ◽

Sequence Database ◽

Sequencing Data ◽

Independent Property ◽

Generation Sequencing ◽

Modular Property

MerCat (“ Mer - Cat enate”) is a parallel, highly scalable and modular property software package for robust analysis of features in next-generation sequencing data. Using assembled contigs and raw sequence reads from any platform as input, MerCat performs k-mer counting of any length k, resulting in feature abundance counts tables. MerCat allows for direct analysis of data properties without reference sequence database dependency commonly used by search tools such as BLAST for compositional analysis of whole community shotgun sequencing (e.g., metagenomes and metatranscriptomes).

Download Full-text

MerCat: a versatile k-mer counter and diversity estimator for database-independent property analysis obtained from metagenomic and/or metatranscriptomic sequencing data

10.7287/peerj.preprints.2825v1 ◽

2017 ◽

Cited By ~ 1

Author(s):

Richard Allen White III ◽

Ajay Panyala ◽

Kevin Glass ◽

Sean Colby ◽

Kurt R Glaesemann ◽

...

Keyword(s):

Software Package ◽

Compositional Analysis ◽

Direct Analysis ◽

Reference Sequence ◽

Next Generation Sequencing Data ◽

Sequence Database ◽

Sequencing Data ◽

Independent Property ◽

Generation Sequencing ◽

Modular Property

Download Full-text

dv-trio: a family-based variant calling pipeline using DeepVariant

Bioinformatics ◽

10.1093/bioinformatics/btaa116 ◽

2020 ◽

Vol 36 (11) ◽

pp. 3549-3551 ◽

Cited By ~ 1

Author(s):

Eddie K K Ip ◽

Clinton Hadinata ◽

Joshua W K Ho ◽

Eleni Giannoulatou

Keyword(s):

Genetic Model ◽

Variant Calling ◽

Supplementary Information ◽

Next Generation Sequencing Data ◽

Sequencing Data ◽

Single Nucleotide Variants ◽

Mutation Discovery ◽

Family Trio ◽

Sequencing Studies ◽

Family Based

Abstract Motivation In 2018, Google published an innovative variant caller, DeepVariant, which converts pileups of sequence reads into images and uses a deep neural network to identify single-nucleotide variants and small insertion/deletions from next-generation sequencing data. This approach outperforms existing state-of-the-art tools. However, DeepVariant was designed to call variants within a single sample. In disease sequencing studies, the ability to examine a family trio (father-mother-affected child) provides greater power for disease mutation discovery. Results To further improve DeepVariant’s variant calling accuracy in family-based sequencing studies, we have developed a family-based variant calling pipeline, dv-trio, which incorporates the trio information from the Mendelian genetic model into variant calling based on DeepVariant. Availability and implementation dv-trio is available via an open source BSD3 license at GitHub (https://github.com/VCCRI/dv-trio/). Contact [email protected] Supplementary information Supplementary data are available at Bioinformatics online.

Download Full-text

Faculty Opinions recommendation of VarWalker: personalized mutation network analysis of putative cancer genes from next-generation sequencing data.

Faculty Opinions – Post-Publication Peer Review of the Biomedical Literature ◽

10.3410/f.718272765.793499663 ◽

2014 ◽

Author(s):

Gary Bader ◽

Mohamed Helmy

Keyword(s):

Next Generation Sequencing ◽

Network Analysis ◽

Next Generation Sequencing Data ◽

Cancer Genes ◽

Next Generation ◽

Sequencing Data ◽

Generation Sequencing

Download Full-text

Faculty Opinions recommendation of Bioinformatory-assisted analysis of next-generation sequencing data for precision medicine in pancreatic cancer.

Faculty Opinions – Post-Publication Peer Review of the Biomedical Literature ◽

10.3410/f.727775566.793536095 ◽

2017 ◽

Author(s):

Steve Pereira

Keyword(s):

Pancreatic Cancer ◽

Next Generation Sequencing ◽

Precision Medicine ◽

Next Generation Sequencing Data ◽

Next Generation ◽

Sequencing Data ◽

Assisted Analysis ◽

Generation Sequencing

Download Full-text

Advancing clinical genomics and precision medicine with GVViZ: FAIR bioinformatics platform for variable gene-disease annotation, visualization, and expression analysis

Human Genomics ◽

10.1186/s40246-021-00336-1 ◽

2021 ◽

Vol 15 (1) ◽

Author(s):

Zeeshan Ahmed ◽

Eduard Gibert Renart ◽

Saman Zeeshan ◽

XinQi Dong

Keyword(s):

Data Analysis ◽

Patient Care ◽

Expression Analysis ◽

High Throughput ◽

Gene Annotation ◽

Next Generation Sequencing Data ◽

Rna Seq ◽

Sequencing Data ◽

Complex Disorders ◽

Transcriptomics Data

Abstract Background Genetic disposition is considered critical for identifying subjects at high risk for disease development. Investigating disease-causing and high and low expressed genes can support finding the root causes of uncertainties in patient care. However, independent and timely high-throughput next-generation sequencing data analysis is still a challenge for non-computational biologists and geneticists. Results In this manuscript, we present a findable, accessible, interactive, and reusable (FAIR) bioinformatics platform, i.e., GVViZ (visualizing genes with disease-causing variants). GVViZ is a user-friendly, cross-platform, and database application for RNA-seq-driven variable and complex gene-disease data annotation and expression analysis with a dynamic heat map visualization. GVViZ has the potential to find patterns across millions of features and extract actionable information, which can support the early detection of complex disorders and the development of new therapies for personalized patient care. The execution of GVViZ is based on a set of simple instructions that users without a computational background can follow to design and perform customized data analysis. It can assimilate patients’ transcriptomics data with the public, proprietary, and our in-house developed gene-disease databases to query, easily explore, and access information on gene annotation and classified disease phenotypes with greater visibility and customization. To test its performance and understand the clinical and scientific impact of GVViZ, we present GVViZ analysis for different chronic diseases and conditions, including Alzheimer’s disease, arthritis, asthma, diabetes mellitus, heart failure, hypertension, obesity, osteoporosis, and multiple cancer disorders. The results are visualized using GVViZ and can be exported as image (PNF/TIFF) and text (CSV) files that include gene names, Ensembl (ENSG) IDs, quantified abundances, expressed transcript lengths, and annotated oncology and non-oncology diseases. Conclusions We emphasize that automated and interactive visualization should be an indispensable component of modern RNA-seq analysis, which is currently not the case. However, experts in clinics and researchers in life sciences can use GVViZ to visualize and interpret the transcriptomics data, making it a powerful tool to study the dynamics of gene expression and regulation. Furthermore, with successful deployment in clinical settings, GVViZ has the potential to enable high-throughput correlations between patient diagnoses based on clinical and transcriptomics data.

Download Full-text

NGSremix: A software tool for estimating pairwise relatedness between admixed individuals from next-generation sequencing data

G3 Genes|Genome|Genetics ◽

10.1093/g3journal/jkab174 ◽

2021 ◽

Author(s):

Anne Krogh Nøhr ◽

Kristian Hanghøj ◽

Genis Garcia Erill ◽

Zilong Li ◽

Ida Moltke ◽

...

Keyword(s):

Next Generation Sequencing ◽

Genetic Research ◽

Likelihood Estimation ◽

Software Tool ◽

Estimation Methods ◽

Next Generation Sequencing Data ◽

Next Generation ◽

Sequencing Data ◽

Ngs Data ◽

Generation Sequencing

Abstract Estimation of relatedness between pairs of individuals is important in many genetic research areas. When estimating relatedness, it is important to account for admixture if this is present. However, the methods that can account for admixture are all based on genotype data as input, which is a problem for low-depth next-generation sequencing (NGS) data from which genotypes are called with high uncertainty. Here we present a software tool, NGSremix, for maximum likelihood estimation of relatedness between pairs of admixed individuals from low-depth NGS data, which takes the uncertainty of the genotypes into account via genotype likelihoods. Using both simulated and real NGS data for admixed individuals with an average depth of 4x or below we show that our method works well and clearly outperforms all the commonly used state-of-the-art relatedness estimation methods PLINK, KING, relateAdmix, and ngsRelate that all perform quite poorly. Hence, NGSremix is a useful new tool for estimating relatedness in admixed populations from low-depth NGS data. NGSremix is implemented in C/C ++ in a multi-threaded software and is freely available on Github https://github.com/KHanghoj/NGSremix.

Download Full-text

METAMVGL: a multi-view graph-based metagenomic contig binning algorithm by integrating assembly and paired-end graphs

BMC Bioinformatics ◽

10.1186/s12859-021-04284-4 ◽

2021 ◽

Vol 22 (S10) ◽

Author(s):

Zhenmiao Zhang ◽

Lu Zhang

Keyword(s):

De Novo ◽

Label Propagation ◽

Next Generation Sequencing Data ◽

Metagenomic Sequencing ◽

Sequencing Data ◽

Fecal Samples ◽

Microbial Genomes ◽

Metagenome Assembly ◽

High Chance ◽

Mock Communities

Abstract Background Due to the complexity of microbial communities, de novo assembly on next generation sequencing data is commonly unable to produce complete microbial genomes. Metagenome assembly binning becomes an essential step that could group the fragmented contigs into clusters to represent microbial genomes based on contigs’ nucleotide compositions and read depths. These features work well on the long contigs, but are not stable for the short ones. Contigs can be linked by sequence overlap (assembly graph) or by the paired-end reads aligned to them (PE graph), where the linked contigs have high chance to be derived from the same clusters. Results We developed METAMVGL, a multi-view graph-based metagenomic contig binning algorithm by integrating both assembly and PE graphs. It could strikingly rescue the short contigs and correct the binning errors from dead ends. METAMVGL learns the two graphs’ weights automatically and predicts the contig labels in a uniform multi-view label propagation framework. In experiments, we observed METAMVGL made use of significantly more high-confidence edges from the combined graph and linked dead ends to the main graph. It also outperformed many state-of-the-art contig binning algorithms, including MaxBin2, MetaBAT2, MyCC, CONCOCT, SolidBin and GraphBin on the metagenomic sequencing data from simulation, two mock communities and Sharon infant fecal samples. Conclusions Our findings demonstrate METAMVGL outstandingly improves the short contig binning and outperforms the other existing contig binning tools on the metagenomic sequencing data from simulation, mock communities and infant fecal samples.

Download Full-text