scholarly journals Full-Length Envelope Analyzer (FLEA): A tool for longitudinal analysis of viral amplicons

2017 ◽  
Author(s):  
Kemal Eren ◽  
Steven Weaver ◽  
Robert Ketteringham ◽  
Morné Valentyn ◽  
Melissa Laird Smith ◽  
...  

AbstractNext generation sequencing of viral populations has advanced our understanding of viral population dynamics, the development of drug resistance, and escape from host immune responses. Many applications require complete gene sequences, which can be impossible to reconstruct from short reads. HIV-1 env, the protein of interest for HIV vaccine studies, is exceptionally challenging for long-read sequencing and analysis due to its length, high substitution rate, and extensive indel variation. While long-read sequencing is attractive in this setting, the analysis of such data is not well handled by existing methods. To address this, we introduce FLEA (Full-Length Envelope Analyzer), which performs end-to-end analysis and visualization of long-read sequencing data.FLEA consists of both a pipeline (optionally run on a high-performance cluster), and a client-side web application that provides interactive results. The pipeline transforms FASTQ reads into high-quality consensus sequences (HQCSs) and uses them to build a codon-aware multiple sequence alignment. The resulting alignment is then used to infer phylogenies, selection pressure, and evolutionary dynamics. The web application provides publication-quality plots and interactive visualizations, including an annotated viral alignment browser, time series plots of evolutionary dynamics, visualizations of gene-wide selective pressures (such as dN /dS) across time and across protein structure, and a phylogenetic tree browser.We demonstrate how FLEA may be used to process Pacific Biosciences HIV-1 env data and describe recent examples of its use. Simulations show how FLEA dramatically reduces the error rate of this sequencing platform, providing an accurate portrait of complex and variable HIV-1 env populations.A public instance of FLEA is hosted at http://flea.datamonkey.org. The Python source code for the FLEA pipeline can be found at https://github.com/veg/flea-pipeline. The client-side application is available at https://github.com/veg/flea-web-app. A live demo of the P018 results can be found at http://flea.murrell.group/view/P018.


2017 ◽  
Author(s):  
Christopher Wilks ◽  
Phani Gaddipati ◽  
Abhinav Nellore ◽  
Ben Langmead

AbstractAs more and larger genomics studies appear, there is a growing need for comprehensive and queryable cross-study summaries. Snaptron is a search engine for summarized RNA sequencing data with a query planner that leverages R-tree, B-tree and inverted indexing strategies to rapidly execute queries over 146 million exon-exon splice junctions from over 70,000 human RNA-seq samples. Queries can be tailored by constraining which junctions and samples to consider. Snaptron can also rank and score junctions according to tissue specificity or other criteria. Further, Snaptron can rank and score samples according to the relative frequency of different splicing patterns. We outline biological questions that can be explored with Snaptron queries, including a study of novel exons in annotated genes, of exonization of repetitive element loci, and of a recently discovered alternative transcription start site for the ALK gene. Web app and documentation are at http://snaptron.cs.jhu.edu. Source code is at https://github.com/ChristopherWilks/snaptron under the MIT license.



PeerJ ◽  
2021 ◽  
Vol 9 ◽  
pp. e11333
Author(s):  
Daniyar Karabayev ◽  
Askhat Molkenov ◽  
Kaiyrgali Yerulanuly ◽  
Ilyas Kabimoldayev ◽  
Asset Daniyarov ◽  
...  

Background High-throughput sequencing platforms generate a massive amount of high-dimensional genomic datasets that are available for analysis. Modern and user-friendly bioinformatics tools for analysis and interpretation of genomics data becomes essential during the analysis of sequencing data. Different standard data types and file formats have been developed to store and analyze sequence and genomics data. Variant Call Format (VCF) is the most widespread genomics file type and standard format containing genomic information and variants of sequenced samples. Results Existing tools for processing VCF files don’t usually have an intuitive graphical interface, but instead have just a command-line interface that may be challenging to use for the broader biomedical community interested in genomics data analysis. re-Searcher solves this problem by pre-processing VCF files by chunks to not load RAM of computer. The tool can be used as standalone user-friendly multiplatform GUI application as well as web application (https://nla-lbsb.nu.edu.kz). The software including source code as well as tested VCF files and additional information are publicly available on the GitHub repository (https://github.com/LabBandSB/re-Searcher).



2017 ◽  
Author(s):  
Philipp N. Spahn ◽  
Tyler Bath ◽  
Ryan J. Weiss ◽  
Jihoon Kim ◽  
Jeffrey D. Esko ◽  
...  

AbstractBackgroundLarge-scale genetic screens using CRISPR/Cas9 technology have emerged as a major tool for functional genomics. With its increased popularity, experimental biologists frequently acquire large sequencing datasets for which they often do not have an easy analysis option. While a few bioinformatic tools have been developed for this purpose, their utility is still hindered either due to limited functionality or the requirement of bioinformatic expertise.ResultsTo make sequencing data analysis of CRISPR/Cas9 screens more accessible to a wide range of scientists, we developed a Platform-independent Analysis of Pooled Screens using Python (PinAPL-Py), which is operated as an intuitive web-service. PinAPL-Py implements state-of-the-art tools and statistical models, assembled in a comprehensive workflow covering sequence quality control, automated sgRNA sequence extraction, alignment, sgRNA enrichment/depletion analysis and gene ranking. The workflow is set up to use a variety of popular sgRNA libraries as well as custom libraries that can be easily uploaded. Various analysis options are offered, suitable to analyze a large variety of CRISPR/Cas9 screening experiments. Analysis output includes ranked lists of sgRNAs and genes, and publication-ready plots.ConclusionsPinAPL-Py helps to advance genome-wide screening efforts by combining comprehensive functionality with user-friendly implementation. PinAPL-Py is freely accessible at http://pinapl-py.ucsd.edu with instructions, documentation and test datasets. The source code is available at https://github.com/LewisLabUCSD/PinAPL-Py



2017 ◽  
Author(s):  
Timothy G. Vaughan

AbstractSummaryIcyTree is an easy-to-use application which can be used to visualize a wide variety of phylogenetic trees and networks. While numerous phylogenetic tree viewers exist already, IcyTree distinguishes itself by being a purely online tool, having a responsive user interface, supporting phylogenetic networks (ancestral recombination graphs in particular), and efficiently drawing trees that include information such as ancestral locations or trait values. IcyTree also provides intuitive panning and zooming utilities that make exploring large phylogenetic trees of many thousands of taxa feasible.Availability and ImplementationIcyTree is a web application and can be accessed directly at http://tgvaughan.github.com/icytree. Currently-supported web browsers include Mozilla Firefox and Google Chrome. IcyTree is written entirely in client-side JavaScript (no plugin required) and, once loaded, does not require network access to run. IcyTree is free software, and the source code is made available at http://github.com/tgvaughan/icytree under version 3 of the GNU General Public [email protected]



Author(s):  
Shifu Chen ◽  
Changshou He ◽  
Yingqiang Li ◽  
Zhicheng Li ◽  
Charles E Melançon

ABSTRACTIn this paper, we present a toolset and related resources for rapid identification of viruses and microorganisms from short-read or long-read sequencing data. We present fastv as an ultra-fast tool to detect microbial sequences present in sequencing data, identify target microorganisms, and visualize coverage of microbial genomes. This tool is based on the k-mer mapping and extension method. K-mer sets are generated by UniqueKMER, another tool provided in this toolset. UniqueKMER can generate complete sets of unique k-mers for each genome within a large set of viral or microbial genomes. For convenience, unique k-mers for microorganisms and common viruses that afflict humans have been generated and are provided with the tools. As a lightweight tool, fastv accepts FASTQ data as input, and directly outputs the results in both HTML and JSON formats. Prior to the k-mer analysis, fastv automatically performs adapter trimming, quality pruning, base correction, and other pre-processing to ensure the accuracy of k-mer analysis. Specifically, fastv provides built-in support for rapid SARS-CoV-2 identification and typing. Experimental results showed that fastv achieved 100% sensitivity and 100% specificity for detecting SARS-CoV-2 from sequencing data; and can distinguish SARS-CoV-2 from SARS, MERS, and other coronaviruses. This toolset is available at: https://github.com/OpenGene/fastv.



2017 ◽  
Author(s):  
Wouter De Coster ◽  
Svenn D’Hert ◽  
Darrin T. Schultz ◽  
Marc Cruts ◽  
Christine Van Broeckhoven

AbstractSummary: Here we describe NanoPack, a set of tools developed for visualization and processing of long read sequencing data from Oxford Nanopore Technologies and Pacific Biosciences.Availability and Implementation: The NanoPack tools are written in Python3 and released under the GNU GPL3.0 Licence. The source code can be found at https://github.com/wdecoster/nanopack, together with links to separate scripts and their documentation. The scripts are compatible with Linux, Mac OS and the MS Windows 10 subsystem for linux and are available as a graphical user interface, a web service at http://nanoplot.bioinf.be and command line tools.Contact:[email protected] information: Supplementary tables and figures are available at Bioinformatics online.



2021 ◽  
Vol 22 (1) ◽  
Author(s):  
Kie Kyon Huang ◽  
Jiawen Huang ◽  
Jeanie Kar Leng Wu ◽  
Minghui Lee ◽  
Su Ting Tay ◽  
...  

Abstract Background Deregulated gene expression is a hallmark of cancer; however, most studies to date have analyzed short-read RNA sequencing data with inherent limitations. Here, we combine PacBio long-read isoform sequencing (Iso-Seq) and Illumina paired-end short-read RNA sequencing to comprehensively survey the transcriptome of gastric cancer (GC), a leading cause of global cancer mortality. Results We performed full-length transcriptome analysis across 10 GC cell lines covering four major GC molecular subtypes (chromosomal unstable, Epstein-Barr positive, genome stable and microsatellite unstable). We identify 60,239 non-redundant full-length transcripts, of which > 66% are novel compared to current transcriptome databases. Novel isoforms are more likely to be cell line and subtype specific, expressed at lower levels with larger number of exons, with longer isoform/coding sequence lengths. Most novel isoforms utilize an alternate first exon, and compared to other alternative splicing categories, are expressed at higher levels and exhibit higher variability. Collectively, we observe alternate promoter usage in 25% of detected genes, with the majority (84.2%) of known/novel promoter pairs exhibiting potential changes in their coding sequences. Mapping these alternate promoters to TCGA GC samples, we identify several cancer-associated isoforms, including novel variants of oncogenes. Tumor-specific transcript isoforms tend to alter protein coding sequences to a larger extent than other isoforms. Analysis of outcome data suggests that novel isoforms may impart additional prognostic information. Conclusions Our results provide a rich resource of full-length transcriptome data for deeper studies of GC and other gastrointestinal malignancies.



2019 ◽  
Author(s):  
Alejandro R. Gener

ABSTRACTObjective(s)To evaluate nanopore DNA sequencing for sequencing full-length HIV-1 provirus.DesignI used nanopore sequencing to sequence full-length HIV-1 from a plasmid (pHXB2).MethodspHXB2 plasmid was processed with the Rapid PCR-Barcoding library kit and sequenced on the MinION sequencer (Oxford Nanopore Technologies, Oxford., UK). Raw fast5 reads were converted into fastq (base called) with Albacore, Guppy, and FlipFlop base callers. Reads were first aligned to the reference with BWA-MEM to evaluate sample coverage manually. Reads were then assembled with Canu into contigs, and contigs manually finished in SnapGene.ResultsI sequenced full-length HXB2 HIV-1 from 5’ to 3’ LTR (100%), with median per-base coverage of over 9000x in one 12-barcoded experiment on a single MinION flow cell. The longest HIV-spanning read to-date was generated, at a length of 11,487 bases, which included full-length HIV-1 and plasmid backbone on either side. At least 20 variants were discovered in pHXB2 compared to reference.ConclusionsThe MinION sequencer performed as-expected, covering full-length HIV. The discovery of variants in a dogmatic reference plasmid demonstrates the need for single-molecule sequence verification moving forward. These results illustrate the utility of long read sequencing to advance the study of HIV at single integration site resolution.



2021 ◽  
Author(s):  
Alejandro R. Gener ◽  
Wei Zou ◽  
Brian T. Foley ◽  
Deborah P. Hyink ◽  
Paul E. Klotman

Abstract Objective: To compare long-read nanopore DNA sequencing (DNA-seq) with short-read sequencing-by-synthesis for sequencing a full-length (e.g., non-deletion, nor reporter) HIV-1 model provirus in plasmid pHXB2_D. Design: We sequenced pHXB2_D and a control plasmid pNL4-3_gag-pol(Δ1443-4553)_EGFP with long- and short-read DNA-seq, evaluating sample variability with resequencing (sequencing and mapping to reference HXB2) and de novo viral genome assembly. Methods: We prepared pHXB2_D and pNL4-3_gag-pol(Δ1443-4553)_EGFP for long-read nanopore DNA-seq, varying DNA polymerases Taq (Sigma-Aldrich) and Long Amplicon (LA) Taq (Takara). Nanopore basecallers were compared. After aligning reads to the reference HXB2 to evaluate sample coverage, we looked for variants. We next assembled reads into contigs, followed by finishing and polishing. We hired an external core to sequence-verify pHXB2_D and pNL4-3_gag-pol(Δ1443-4553)_EGFP with single-end 150 base-long Illumina reads, after masking sample identity. Results: We achieved full-coverage (100%) of HXB2 HIV-1 from 5' to 3' long terminal repeats (LTRs), with median per-base coverage of over 9000x in one experiment on a single MinION flow cell. The longest HIV-spanning read to-date was generated, at a length of 11,487 bases, which included full-length HIV-1 and plasmid backbone with flanking host sequences supporting a single HXB2 integration event. We discovered 20 single nucleotide variants in pHXB2_D compared to reference, verified by short-read DNA sequencing. There were no variants detected in the HIV-1 segments of pNL4-3_gag-pol(Δ1443-4553)_EGFP. Conclusions: Nanopore sequencing performed as-expected, phasing LTRs, and even covering full-length HIV. The discovery of variants in a reference plasmid demonstrates the need for sequence verification moving forward, in line with calls from funding agencies for reagent verification. These results illustrate the utility of long-read DNA-seq to advance the study of HIV at single integration site resolution.



2019 ◽  
Author(s):  
Scott V. Nguyen ◽  
David R. Greig ◽  
Daniel Hurley ◽  
Yu Cao ◽  
Evonne McCabe ◽  
...  

ABSTRACTA Gram-negative rod from the Yersinia genus was isolated from a clinical case of yersiniosis in the United Kingdom. Long read sequencing data from an Oxford Nanopore Technology (ONT) MinION in conjunction with Illumina HiSeq reads were used to generate a finished quality genome of this strain. Overall Genome Related Index (OGRI) of the strain was used to determine that it was a novel species within Yersinia, despite biochemical similarities to Yersinia enterocolitica. The 16S ribosomal RNA gene accessions are MN434982-MN434987 and the accession number for the complete and closed chromosome is CP043727. The type strain is CFS3336T (=NCTC 14382T/ =LMG Accession under process).



Sign in / Sign up

Export Citation Format

Share Document