scholarly journals FastViromeExplorer: a pipeline for virus and phage identification and abundance profiling in metagenomics data

PeerJ ◽  
2018 ◽  
Vol 6 ◽  
pp. e4227 ◽  
Author(s):  
Saima Sultana Tithi ◽  
Frank O. Aylward ◽  
Roderick V. Jensen ◽  
Liqing Zhang

With the increase in the availability of metagenomic data generated by next generation sequencing, there is an urgent need for fast and accurate tools for identifying viruses in host-associated and environmental samples. In this paper, we developed a stand-alone pipeline called FastViromeExplorer for the detection and abundance quantification of viruses and phages in large metagenomic datasets by performing rapid searches of virus and phage sequence databases. Both simulated and real data from human microbiome and ocean environmental samples are used to validate FastViromeExplorer as a reliable tool to quickly and accurately identify viruses and their abundances in large datasets.

2017 ◽  
Author(s):  
Saima Sultana Tithi ◽  
Roderick V. Jensen ◽  
Liqing Zhang

AbstractIdentifying viruses and phages in a metagenomics sample has important implication in improving human health, preventing viral outbreaks, and developing personalized medicine. With the rapid increase in data files generated by next generation sequencing, existing tools for identifying and annotating viruses and phages in metagenomics samples suffer from expensive running time. In this paper, we developed a stand-alone pipeline, FastViromeExplorer, for rapid identification and abundance quantification of viruses and phages in big metagenomic data. Both real and simulated data validated FastViromeExplorer as a reliable tool to accurately identify viruses and their abundances in large data, as well as in a time efficient manner.


2020 ◽  
Vol 2020 ◽  
pp. 1-12
Author(s):  
Felix M. Kibegwa ◽  
Rawlynce C. Bett ◽  
Charles K. Gachuiri ◽  
Francesca Stomeo ◽  
Fidalis D. Mujibi

Analysis of shotgun metagenomic data generated from next generation sequencing platforms can be done through a variety of bioinformatic pipelines. These pipelines employ different sets of sophisticated bioinformatics algorithms which may affect the results of this analysis. In this study, we compared two commonly used pipelines for shotgun metagenomic analysis: MG-RAST and Kraken 2, in terms of taxonomic classification, diversity analysis, and usability using their primarily default parameters. Overall, the two pipelines detected similar abundance distributions in the three most abundant taxa Proteobacteria, Firmicutes, and Bacteroidetes. Within bacterial domain, 497 genera were identified by both pipelines, while an additional 694 and 98 genera were solely identified by Kraken 2 and MG-RAST, respectively. 933 species were detected by the two algorithms. Kraken 2 solely detected 3550 species, while MG-RAST identified 557 species uniquely. For archaea, Kraken 2 generated 105 and 236 genera and species, respectively, while MG-RAST detected 60 genera and 88 species. 54 genera and 72 species were commonly detected by the two methods. Kraken 2 had a quicker analysis time (~4 hours) while MG-RAST took approximately 2 days per sample. This study revealed that Kraken 2 and MG-RAST generate comparable results and that a reliable high-level overview of sample is generated irrespective of the pipeline selected. However, Kraken 2 generated a more accurate taxonomic identification given the higher number of “Unclassified” reads in MG-RAST. The observed variations at the genus level show that a main restriction is using different databases for classification of the metagenomic data. The results of this research indicate that a more inclusive and representative classification of microbiomes may be achieved through creation of the combined pipelines.


2018 ◽  
Author(s):  
Viachaslau Tsyvina ◽  
David S. Campo ◽  
Seth Sims ◽  
Alex Zelikovsky ◽  
Yury Khudyakov ◽  
...  

AbstractMany biological analysis tasks require extraction of families of genetically similar sequences from large datasets produced by Next-generation Sequencing (NGS). Such tasks include detection of viral transmissions by analysis of all genetically close pairs of sequences from viral datasets sampled from infected individuals or studying of evolution of viruses or immune repertoires by analysis of network of intra-host viral variants or antibody clonotypes formed by genetically close sequences. The most obvious naϊeve algorithms to extract such sequence families are impractical in light of the massive size of modern NGS datasets. In this paper, we present fast and scalable k-mer-based framework to perform such sequence similarity queries efficiently, which specifically targets data produced by deep sequencing of heterogeneous populations such as viruses. The tool is freely available for download at https://github.com/vyacheslav-tsivina/signature-sj


2017 ◽  
Author(s):  
Xiao-Tao Jiang ◽  
Ke Yu ◽  
Li-Guan Li ◽  
Xiao-Le Yin ◽  
An-Dong Li ◽  
...  

AbstractMetatranscriptome has become increasingly important along with the application of next generation sequencing in the studies of microbial functional gene activity in environmental samples. However, the quantification of target active gene is hindered by the current relative quantification methods, especially when tracking the sharp environmental change. Great needs are here for an easy-to-perform method to obtain the absolute quantification. By borrowing information from the parallel metagenome, an absolute quantification method for both metagenomic and metatranscriptomic data to per gene/cell/volume/gram level was developed. The effectiveness of AQMM was validated by simulated experiments and was demonstrated with a real experimental design of comparing activated sludge with and without foaming. Our method provides a novel bioinformatic approach to fast and accurately conduct absolute quantification of metagenome and metatranscriptome in environmental samples. The AQMM can be accessed from https://github.com/biofuture/aqmm.


2015 ◽  
pp. 539-544
Author(s):  
Henry C. M. Leung ◽  
Yi Wang ◽  
S. M. Yiu ◽  
Francis Y. L. Chin

Data in Brief ◽  
2019 ◽  
Vol 22 ◽  
pp. 195-198 ◽  
Author(s):  
Farha Arakkaveettil Kabeer ◽  
T. Jabir ◽  
K.P. Krishnan ◽  
Mohamed Hatha Abdulla

2014 ◽  
Vol 12 (04) ◽  
pp. 1450021
Author(s):  
Junbo Duan ◽  
Ji-Gang Zhang ◽  
Mingxi Wan ◽  
Hong-Wen Deng ◽  
Yu-Ping Wang

Copy number variations (CNVs) can be used as significant bio-markers and next generation sequencing (NGS) provides a high resolution detection of these CNVs. But how to extract features from CNVs and further apply them to genomic studies such as population clustering have become a big challenge. In this paper, we propose a novel method for population clustering based on CNVs from NGS. First, CNVs are extracted from each sample to form a feature matrix. Then, this feature matrix is decomposed into the source matrix and weight matrix with non-negative matrix factorization (NMF). The source matrix consists of common CNVs that are shared by all the samples from the same group, and the weight matrix indicates the corresponding level of CNVs from each sample. Therefore, using NMF of CNVs one can differentiate samples from different ethnic groups, i.e. population clustering. To validate the approach, we applied it to the analysis of both simulation data and two real data set from the 1000 Genomes Project. The results on simulation data demonstrate that the proposed method can recover the true common CNVs with high quality. The results on the first real data analysis show that the proposed method can cluster two family trio with different ancestries into two ethnic groups and the results on the second real data analysis show that the proposed method can be applied to the whole-genome with large sample size consisting of multiple groups. Both results demonstrate the potential of the proposed method for population clustering.


Sign in / Sign up

Export Citation Format

Share Document