read distribution Latest Research Papers

isoformant: A visual toolkit for reference-free long-read isoform analysis at single-read resolution

10.1101/2021.12.17.457386 ◽

2021 ◽

Author(s):

Daniel D Le ◽

Faye T Orcales ◽

William Stephenson

Keyword(s):

Region Of Interest ◽

Sequencing Data ◽

Consensus Sequences ◽

Interactive Analysis ◽

Isoform Diversity ◽

Oxford Nanopore ◽

Long Read ◽

Multiple Samples ◽

Read Distribution

isoformant is an analytical toolkit for isoform characterization of Oxford Nanopore Technologies (ONT) long-transcript sequencing data (i.e. direct RNA and cDNA). Deployment of these tools using Jupyter Notebook enables interactive analysis of user- defined region-of-interest (ROI), typically a gene. The core module of isoformant clus- ters sequencing reads by k-mer density to generate isoform consensus sequences without the requirement for a reference genome or prior annotations. The inclusion of differential isoform usage hypothesis testing based on read distribution among clusters enables com- parison across multiple samples. Here, as proof-of-principle, we demonstrate the utility of isoformant for analyzing isoform diversity of commercially-available isoform standard mixtures. isoformant is available here: https://github.com/danledinh/isoformant.

Download Full-text

Figbird: A probabilistic method for filling gaps in genome assemblies

10.1101/2021.11.24.469861 ◽

2021 ◽

Author(s):

Sumit Tarafder ◽

Mazharul Islam ◽

Swakkhar Shatabda ◽

Atif Rahman

Keyword(s):

Probabilistic Method ◽

Draft Genome ◽

Gap Filling ◽

Sequencing Errors ◽

Sequencing Coverage ◽

Sequencing Technologies ◽

Novel Approach ◽

Account Information ◽

Genome Assemblies ◽

Read Distribution

Motivation: Advances in sequencing technologies have led to sequencing of genomes of a multitude of organisms. However, draft genomes of many of these organisms contain a large number of gaps due to repeats in genomes, low sequencing coverage and limitations in sequencing technologies. Although there exist several tools for filling gaps, many of these do not utilize all information relevant to gap filling. Results: Here, we present a probabilistic method for filling gaps in draft genome assemblies using second generation reads based on a generative model for sequencing that takes into account information on insert sizes and sequencing errors. Our method is based on the expectation-maximization(EM) algorithm unlike the graph based methods adopted in the literature. Experiments on real biological datasets show that this novel approach can fill up large portions of gaps with small number of errors and misassemblies compared to other state of the art gap filling tools. Availability and Implementation:The method is implemented using C++ in a software named "Filling Gaps by Iterative Read Distribution (Figbird)", which is available at: https://github.com/SumitTarafder/Figbird.

Download Full-text

Simulation study and comparative evaluation of viral contiguous sequence identification tools

BMC Bioinformatics ◽

10.1186/s12859-021-04242-0 ◽

2021 ◽

Vol 22 (1) ◽

Author(s):

Cody Glickman ◽

Jo Hendrix ◽

Michael Strong

Keyword(s):

Machine Learning ◽

State Of The Art ◽

Gene Content ◽

Metagenomic Data ◽

Bioinformatic Tools ◽

Sequence Identification ◽

Tool Performance ◽

Bacterial Genes ◽

Viral Sequences ◽

Read Distribution

Abstract Background Viruses, including bacteriophages, are important components of environmental and human associated microbial communities. Viruses can act as extracellular reservoirs of bacterial genes, can mediate microbiome dynamics, and can influence the virulence of clinical pathogens. Various targeted metagenomic analysis techniques detect viral sequences, but these methods often exclude large and genome integrated viruses. In this study, we evaluate and compare the ability of nine state-of-the-art bioinformatic tools, including Vibrant, VirSorter, VirSorter2, VirFinder, DeepVirFinder, MetaPhinder, Kraken 2, Phybrid, and a BLAST search using identified proteins from the Earth Virome Pipeline to identify viral contiguous sequences (contigs) across simulated metagenomes with different read distributions, taxonomic compositions, and complexities. Results Of the tools tested in this study, VirSorter achieved the best F1 score while Vibrant had the highest average F1 score at predicting integrated prophages. Though less balanced in its precision and recall, Kraken2 had the highest average precision by a substantial margin. We introduced the machine learning tool, Phybrid, which demonstrated an improvement in average F1 score over tools such as MetaPhinder. The tool utilizes machine learning with both gene content and nucleotide features. The addition of nucleotide features improves the precision and recall compared to the gene content features alone.Viral identification by all tools was not impacted by underlying read distribution but did improve with contig length. Tool performance was inversely related to taxonomic complexity and varied by the phage host. For instance, Rhizobium and Enterococcus phages were identified consistently by the tools; whereas, Neisseria prophage sequences were commonly missed in this study. Conclusion This study benchmarked the performance of nine state-of-the-art bioinformatic tools to identify viral contigs across different simulation conditions. This study explored the ability of the tools to identify integrated prophage elements traditionally excluded from targeted sequencing approaches. Our comprehensive analysis of viral identification tools to assess their performance in a variety of situations provides valuable insights to viral researchers looking to mine viral elements from publicly available metagenomic data.

Download Full-text

Simulation Study and Comparative Evaluation of Viral Contiguous Sequence Identification Tools

10.21203/rs.3.rs-287089/v1 ◽

2021 ◽

Author(s):

Cody Glickman ◽

Jo Hendrix ◽

Michael Strong

Keyword(s):

Machine Learning ◽

Microbial Communities ◽

State Of The Art ◽

Metagenomic Data ◽

Bioinformatic Tools ◽

Sequence Identification ◽

Tool Performance ◽

Bacterial Genes ◽

Viral Sequences ◽

Read Distribution

Abstract Background:Viruses, including bacteriophage, are important components of environmental and human associated microbial communities. Viruses can act as extracellular reservoirs of bacterial genes, can mediate microbiome dynamics, and can influence the virulence of clinical pathogens. It is essential, therefore, to have robust sequence analysis methods in place to detect and annotate viral elements within microbial communities. Various targeted metagenomic analysis techniques detect viral sequences, but these methods often exclude large and genome integrated viruses. In this study, we evaluate and compare the ability of nine state-of-the-art bioinformatic tools, including Vibrant, VirSorter, VirSorter2, VirFinder, DeepVirFinder, MetaPhinder, JGI Earth Virome Pipeline, Kraken 2, and VirBrant, to identify viral contiguous sequences (contigs) across simulated metagenomes with different read distributions, taxonomic compositions, and complexities.Results:Of the tools tested in this study, VirSorter achieved the best F1 score while Vibrant had the highest average F1 score at predicting integrated prophages. Though less balanced in its precision and recall, Kraken2 had the highest average precision by a substantial margin. We introduced the machine learning tool, VirBrant, which demonstrated an improvement in average F1 score over tools such as MetaPhinder. The tool utilizes machine learning with both protein compositional and nucleotide features. The addition of nucleotide features improves the precision and recall compared to the protein compositional features alone. Viral identification by all tools was not impacted by underlying read distribution but did improve with contig length. Tool performance was inversely related to taxonomic complexity and varied by the phage host. Rhizobium and Enterococcus phage were identified consistently by the tools; whereas, Neisseria phage were commonly missed in this study.Conclusion:This study benchmarked the performance of nine state-of-the-art bioinformatic tools to identify viral contigs across different simulation conditions. This study explored the ability of the tools to identify integrated prophage elements traditionally excluded from targeted sequencing approaches. Our comprehensive analysis of viral identification tools to assess their performance in a variety of situations provides valuable insights to viral researchers looking to mine viral elements from publicly available metagenomic data.

Download Full-text

Whole‐genome sequencing of cell‐free DNA yields genome‐wide read distribution patterns to track tissue of origin in cancer patients

Clinical and Translational Medicine ◽

10.1002/ctm2.177 ◽

2020 ◽

Vol 10 (6) ◽

Author(s):

Han Liang ◽

Fuqiang Li ◽

Sitan Qiao ◽

Xinlan Zhou ◽

Guoyun Xie ◽

...

Keyword(s):

Whole Genome Sequencing ◽

Cancer Patients ◽

Genome Sequencing ◽

Distribution Patterns ◽

Whole Genome ◽

Cell Free Dna ◽

Free Dna ◽

Genome Wide ◽

Tissue Of Origin ◽

Read Distribution

Download Full-text

LIQA: Long-read Isoform Quantification and Analysis

10.1101/2020.09.09.289793 ◽

2020 ◽

Author(s):

Yu Hu ◽

Li Fang ◽

Xuelian Chen ◽

Jiang F. Zhong ◽

Mingyao Li ◽

...

Keyword(s):

Simulated Data ◽

Read Length ◽

Specific Gene ◽

Rna Seq ◽

Short Read ◽

Oxford Nanopore ◽

Long Read ◽

Low Coverage ◽

Read Distribution ◽

Isoform Quantification

AbstractLong-read RNA sequencing (RNA-seq) technologies have made it possible to sequence full-length transcripts, facilitating the exploration of isoform-specific gene expression over conventional short-read RNA-seq. However, long-read RNA-seq suffers from high per-base error rate, presence of chimeric reads and alternative alignments, and other biases, which require different analysis methods than short-read RNA-seq. Here we present LIQA (Long-read Isoform Quantification and Analysis), an Expectation-Maximization based statistical method to quantify isoform expression and detect differential alternative splicing (DAS) events using long-read RNA-seq data. Rather than summarizing isoform-specific read counts directly as done in short-read methods, LIQA incorporates base-pair quality score and isoform-specific read length information to assign different weights across reads, which reflects alignment confidence. Moreover, given isoform usage estimates, LIQA can detect DAS events between conditions. We evaluated LIQA’s performance on simulated data and demonstrated that it outperforms other approaches in rare isoform characterization and in detecting DAS events between two groups. We also generated one direct mRNA sequencing dataset and one cDNA sequencing dataset using the Oxford Nanopore long-read platform, both with paired short-read RNA-seq data and qPCR data on selected genes, and we demonstrated that LIQA performs well in isoform discovery and quantification. Finally, we evaluated LIQA on a PacBio dataset on esophageal squamous epithelial cells, and demonstrated that LIQA recovered DAS events on FGFR3 that failed to be detected in short-read data. In summary, LIQA leverages the power of long-read RNA-seq and achieves higher accuracy in estimating isoform abundance than existing approaches, especially for isoforms with low coverage and biased read distribution. LIQA is freely available for download at https://github.com/WGLab/LIQA.

Download Full-text

Performance assessment of total RNA sequencing of human biofluids and extracellular vesicles

Scientific Reports ◽

10.1038/s41598-019-53892-x ◽

2019 ◽

Vol 9 (1) ◽

Cited By ~ 11

Author(s):

Celine Everaert ◽

Hetty Helsmoortel ◽

Anneleen Decock ◽

Eva Hulstaert ◽

Ruben Van Paemel ◽

...

Keyword(s):

Rna Sequencing ◽

Extracellular Vesicles ◽

Platelet Rich Plasma ◽

Rna Seq ◽

Total Rna ◽

Rna Molecules ◽

Rna Profiling ◽

Wide Range ◽

Read Distribution ◽

Free Plasma

AbstractRNA profiling has emerged as a powerful tool to investigate the biomarker potential of human biofluids. However, despite enormous interest in extracellular nucleic acids, RNA sequencing methods to quantify the total RNA content outside cells are rare. Here, we evaluate the performance of the SMARTer Stranded Total RNA-Seq method in human platelet-rich plasma, platelet-free plasma, urine, conditioned medium, and extracellular vesicles (EVs) from these biofluids. We found the method to be accurate, precise, compatible with low-input volumes and able to quantify a few thousand genes. We picked up distinct classes of RNA molecules, including mRNA, lncRNA, circRNA, miscRNA and pseudogenes. Notably, the read distribution and gene content drastically differ among biofluids. In conclusion, we are the first to show that the SMARTer method can be used for unbiased unraveling of the complete transcriptome of a wide range of biofluids and their extracellular vesicles.

Download Full-text

Whole genome sequencing of cell-free DNA yields genome-wide read distribution patterns to track tissue of origin in cancer patients

10.1101/772657 ◽

2019 ◽

Author(s):

Han Liang ◽

Fuqiang Li ◽

Sitan Qiao ◽

Xinlan Zhou ◽

Guoyun Xie ◽

...

Keyword(s):

Whole Genome Sequencing ◽

Cancer Patients ◽

Genome Sequencing ◽

Somatic Mosaicism ◽

Distribution Patterns ◽

Whole Genome ◽

Cell Free Dna ◽

Free Dna ◽

Tissue Of Origin ◽

Read Distribution

AbstractSomatic mosaicism is widespread among tissues and could indicate distinct tissue origins of circulating cell-free DNA (cfDNA), DNA fragments released by lytic cells into the blood. By investigating the alignment patterns of whole genome sequencing reads with the genomic DNA of different tissues, we found that the read distributions formed type-specific patterns in some regions as a result of somatic mosaicism. We then utilized this information to construct a tissue-of-origin mapping model and evaluated its predictive performance on whole genome sequencing data from tissue and cfDNA samples. In total, 1,545 tissue samples associated with 13 cancer types were included, and identification of the tissue of origin achieved a specificity of 82% and a sensitivity of 80%. Furthermore, a total of 30 cfDNA samples from lung cancer and liver cancer patients and healthy controls were analyzed to predict their tissues of origin with a specificity of 87% and a sensitivity of 87%. Our results show that read distribution patterns from whole genome sequencing could be used to identify cfDNA tissues of origin with high accuracy, suggesting the potential application of our model to early cancer detection and diagnosis.

Download Full-text

Performance assessment of total RNA sequencing of human biofluids and extracellular vesicles

10.1101/701524 ◽

2019 ◽

Author(s):

Celine Everaert ◽

Hetty Helsmoortel ◽

Anneleen Decock ◽

Eva Hulstaert ◽

Ruben Van Paemel ◽

...

Keyword(s):

Rna Sequencing ◽

Extracellular Vesicles ◽

Platelet Rich Plasma ◽

Rna Seq ◽

Total Rna ◽

Rna Molecules ◽

Rna Profiling ◽

Wide Range ◽

Read Distribution ◽

Free Plasma

AbstractRNA profiling has emerged as a powerful tool to investigate the biomarker potential of human biofluids. However, despite enormous interest in extracellular nucleic acids, RNA sequencing methods to quantify the total RNA content outside cells are rare. Here, we evaluate the performance of the SMARTer Stranded Total RNA-Seq method in human platelet-rich plasma, platelet-free plasma, urine, conditioned medium, and extracellular vesicles (EVs) from these biofluids. We found the method to be accurate, precise, compatible with low-input volumes and able to quantify a few thousand genes. We picked up distinct classes of RNA molecules, including mRNA, lncRNA, circRNA, miscRNA and pseudogenes. Notably, the read distribution and gene content drastically differ among biofluids. In conclusion, we are the first to show that the SMARTer method can be used for unbiased unraveling of the complete transcriptome of a wide range of biofluids and their extracellular vesicles.

Download Full-text

Effect of de novo transcriptome assembly on transcript quantification

10.1101/380998 ◽

2018 ◽

Author(s):

Ping-Han Hsieh ◽

Yen-Jen Oyang ◽

Chien-Yu Chen

Keyword(s):

Developmental Stages ◽

De Novo ◽

Sequence Similarity ◽

Transcriptome Assembly ◽

Connected Components ◽

Strong Impact ◽

Rna Seq ◽

Transcript Quantification ◽

Downstream Analysis ◽

Read Distribution

AbstractBackgroundCorrect quantification of transcript expression is essential to understand the functional products of the genome in different physiological conditions and developmental stages. Recently, the development of high-throughput RNA sequencing (RNA-Seq) allows the researchers to perform transcriptome analysis for the organisms without the reference genome and transcriptome. For such projects, de novo transcriptome assembly must be carried out prior to quantification. However, a large number of erroneous contigs produced by the assemblers might result in unreliable estimation on the abundance of transcripts. In this regard, this study comprehensively investigates how assembly quality affects the performance of quantification for RNA-Seq analysis based on de novo transcriptome assembly.ResultsSeveral important factors that might seriously affect the accuracy of the RNA-Seq analysis were thoroughly discussed. First, we found that the assemblers perform comparatively well for the transcriptomes with lower biological complexity. Second, we examined the over-extended and incomplete contigs, and then demonstrated that assembly completeness has a strong impact on the estimation of contig abundance. Lastly, we investigated the behavior of the quantifiers with respect to sequence ambiguity which might be originally present in the transcriptome or accidentally produced by assemblers. The results suggest that the quantifiers often over-estimate the expression of family-collapse contigs and under-estimate the expression of duplicated contigs. For organisms without reference transcriptome, it remained challenging to detect the inaccurate abundance estimation on family-collapse contigs. On the contrary, we observed that the situation of under-estimation on duplicated contigs can be warned through analyzing the read distribution of the duplicated contigs.ConclusionsIn summary, we explicated the behavior of quantifiers when erroneous contigs are present and we outlined the potential problems that the assemblers might cause for the downstream analysis of RNA-Seq. We anticipate the analytic results conducted in this study provides valuable insights for future development of transcriptome assembly and quantification.Availabilitywe proposed an open-source Python based package QuantEval that builds connected components for the assembled contigs based on sequence similarity and evaluates the quantification results for each connected component. The package can be downloaded from https://github.com/dn070017/QuantEval.

Download Full-text

read distribution
Recently Published Documents

TOTAL DOCUMENTS

H-INDEX

isoformant: A visual toolkit for reference-free long-read isoform analysis at single-read resolution

Figbird: A probabilistic method for filling gaps in genome assemblies

Simulation study and comparative evaluation of viral contiguous sequence identification tools

Simulation Study and Comparative Evaluation of Viral Contiguous Sequence Identification Tools

Whole‐genome sequencing of cell‐free DNA yields genome‐wide read distribution patterns to track tissue of origin in cancer patients

LIQA: Long-read Isoform Quantification and Analysis

Performance assessment of total RNA sequencing of human biofluids and extracellular vesicles

Whole genome sequencing of cell-free DNA yields genome-wide read distribution patterns to track tissue of origin in cancer patients

Performance assessment of total RNA sequencing of human biofluids and extracellular vesicles

Effect of de novo transcriptome assembly on transcript quantification

Export Citation Format

read distributionRecently Published Documents

TOTAL DOCUMENTS

H-INDEX

isoformant: A visual toolkit for reference-free long-read isoform analysis at single-read resolution

Figbird: A probabilistic method for filling gaps in genome assemblies

Simulation study and comparative evaluation of viral contiguous sequence identification tools

Simulation Study and Comparative Evaluation of Viral Contiguous Sequence Identification Tools

Whole‐genome sequencing of cell‐free DNA yields genome‐wide read distribution patterns to track tissue of origin in cancer patients

LIQA: Long-read Isoform Quantification and Analysis

Performance assessment of total RNA sequencing of human biofluids and extracellular vesicles

Whole genome sequencing of cell-free DNA yields genome-wide read distribution patterns to track tissue of origin in cancer patients

Performance assessment of total RNA sequencing of human biofluids and extracellular vesicles

Effect of de novo transcriptome assembly on transcript quantification

read distribution
Recently Published Documents