PPIT: an R package for inferring microbial taxonomy from nifH sequences

Author(s):  
Bennett J Kapili ◽  
Anne E Dekas

Abstract Motivation Linking microbial community members to their ecological functions is a central goal of environmental microbiology. When assigned taxonomy, amplicon sequences of metabolic marker genes can suggest such links, thereby offering an overview of the phylogenetic structure underpinning particular ecosystem functions. However, inferring microbial taxonomy from metabolic marker gene sequences remains a challenge, particularly for the frequently sequenced nitrogen fixation marker gene, nitrogenase reductase (nifH). Horizontal gene transfer in recent nifH evolutionary history can confound taxonomic inferences drawn from the pairwise identity methods used in existing software. Other methods for inferring taxonomy are not standardized and require manual inspection that is difficult to scale. Results We present Phylogenetic Placement for Inferring Taxonomy (PPIT), an R package that infers microbial taxonomy from nifH amplicons using both phylogenetic and sequence identity approaches. After users place query sequences on a reference nifH gene tree provided by PPIT (n = 6317 full-length nifH sequences), PPIT searches the phylogenetic neighborhood of each query sequence and attempts to infer microbial taxonomy. An inference is drawn only if references in the phylogenetic neighborhood are: (1) taxonomically consistent and (2) share sufficient pairwise identity with the query, thereby avoiding erroneous inferences due to known horizontal gene transfer events. We find that PPIT returns a higher proportion of correct taxonomic inferences than BLAST-based approaches at the cost of fewer total inferences. We demonstrate PPIT on deep-sea sediment and find that Deltaproteobacteria are the most abundant potential diazotrophs. Using this dataset we show that emending PPIT inferences based on visual inspection of query sequence placement can achieve taxonomic inferences for nearly all sequences in a query set. We additionally discuss how users can apply PPIT to the analysis of other marker genes. Availability PPIT is freely available to non-commercial users at https://github.com/bkapili/ppit. Installation includes a vignette that demonstrates package use and reproduces the nifH amplicon analysis discussed here. The raw nifH amplicon sequence data have been deposited in the GenBank, EMBL, and DDBJ databases under BioProject number PRJEB37167. Supplementary information Supplementary data are available at Bioinformatics online.

Author(s):  
Yixuan Qiu ◽  
Jiebiao Wang ◽  
Jing Lei ◽  
Kathryn Roeder

Abstract Motivation Marker genes, defined as genes that are expressed primarily in a single cell type, can be identified from the single cell transcriptome; however, such data are not always available for the many uses of marker genes, such as deconvolution of bulk tissue. Marker genes for a cell type, however, are highly correlated in bulk data, because their expression levels depend primarily on the proportion of that cell type in the samples. Therefore, when many tissue samples are analyzed, it is possible to identify these marker genes from the correlation pattern. Results To capitalize on this pattern, we develop a new algorithm to detect marker genes by combining published information about likely marker genes with bulk transcriptome data in the form of a semi-supervised algorithm. The algorithm then exploits the correlation structure of the bulk data to refine the published marker genes by adding or removing genes from the list. Availability and implementation We implement this method as an R package markerpen, hosted on CRAN (https://CRAN.R-project.org/package=markerpen). Supplementary information Supplementary data are available at Bioinformatics online.


2020 ◽  
Vol 36 (9) ◽  
pp. 2740-2749
Author(s):  
Henry Xing ◽  
Steven W Kembel ◽  
Vladimir Makarenkov

Abstract Motivation Phylogenetic trees and the methods for their analysis have played a key role in many evolutionary, ecological and bioinformatics studies. Alternatively, phylogenetic networks have been widely used to analyze and represent complex reticulate evolutionary processes which cannot be adequately studied using traditional phylogenetic methods. These processes include, among others, hybridization, horizontal gene transfer, and genetic recombination. Nowadays, sequence similarity and genome similarity networks have become an efficient tool for community analysis of large molecular datasets in comparative studies. These networks can be used for tackling a variety of complex evolutionary problems such as the identification of horizontal gene transfer events, the recovery of mosaic genes and genomes, and the study of holobionts. Results The shortest path in a phylogenetic tree is used to estimate evolutionary distances between species. We show how the shortest path concept can be extended to sequence similarity networks by defining five new distances, NetUniFrac, Spp, Spep, Spelp and Spinp, and the Transfer index, between species communities present in the network. These new distances can be seen as network analogs of the traditional UniFrac distance used to assess dissimilarity between species communities in a phylogenetic tree, whereas the Transfer index is intended for estimating the rate and direction of gene transfers, or species dispersal, between different phylogenetic, or ecological, species communities. Moreover, NetUniFrac and the Transfer index can be computed in linear time with respect to the number of edges in the network. We show how these new measures can be used to analyze microbiota and antibiotic resistance gene similarity networks. Availability and implementation Our NetFrac program, implemented in R and C, along with its source code, is freely available on Github at the following URL address: https://github.com/XPHenry/Netfrac. Supplementary information Supplementary data are available at Bioinformatics online.


2016 ◽  
Author(s):  
Kieran R Campbell ◽  
Christopher Yau

AbstractPseudotime estimation from single-cell gene expression allows the recovery of temporal information from otherwise static profiles of individual cells. This pseudotemporal information can be used to characterise transient events in temporally evolving biological systems. Conventional algorithms typically emphasise an unsupervised transcriptome-wide approach and use retrospective analysis to evaluate the behaviour of individual genes. Here we introduce an orthogonal approach termed “Ouija” that learns pseudotimes from a small set of marker genes that might ordinarily be used to retrospectively confirm the accuracy of unsupervised pseudotime algorithms. Crucially, we model these genes in terms of switch-like or transient behaviour along the trajectory, allowing us to understand why the pseudotimes have been inferred and learn informative parameters about the behaviour of each gene. Since each gene is associated with a switch or peak time the genes are effectively ordered along with the cells, allowing each part of the trajectory to be understood in terms of the behaviour of certain genes. In the following we introduce our model and demonstrate that in many instances a small panel of marker genes can recover pseudotimes that are consistent with those obtained using the entire transcriptome. Furthermore, we show that our method can detect differences in the regulation timings between two genes and identify “metastable” states - discrete cell types along the continuous trajectories - that recapitulate known cell types. Ouija therefore provides a powerful complimentary approach to existing whole transcriptome based pseudotime estimation methods. An open source implementation is available at http://www.github.com/kieranrcampbell/ouija as an R package and at http://www.github.com/kieranrcampbell/ouijaflow as a Python/TensorFlow package.


2016 ◽  
Vol 2 ◽  
pp. e56 ◽  
Author(s):  
Orlando Schwery ◽  
Brian C. O’Meara

Background.The monophyly of taxa is an important attribute of a phylogenetic tree. A lack of it may hint at shortcomings of either the tree or the current taxonomy, or can indicate cases of incomplete lineage sorting or horizontal gene transfer. Whichever is the reason, a lack of monophyly can misguide subsequent analyses. While monophyly is conceptually simple, it is manually tedious and time consuming to assess on modern phylogenies of hundreds to thousands of species.Results.The R packageMonoPhyallows assessment and exploration of monophyly of taxa in a phylogeny. It can assess the monophyly of genera using the phylogeny only, and with an additional input file any other desired higher order taxa or unranked groups can be checked as well.Conclusion.Summary tables, easily subsettable results and several visualization options allow quick and convenient exploration of monophyly issues, thus makingMonoPhya valuable tool for any researcher working with phylogenies.


2019 ◽  
Vol 35 (19) ◽  
pp. 3870-3872 ◽  
Author(s):  
Nathan D Olson ◽  
Nidhi Shah ◽  
Jayaram Kancherla ◽  
Justin Wagner ◽  
Joseph N Paulson ◽  
...  

Abstract Summary We developed the metagenomeFeatures R Bioconductor package along with annotation packages for three 16S rRNA databases (Greengenes, RDP and SILVA) to facilitate working with 16S rRNA databases and marker-gene survey feature data. The metagenomeFeatures package defines two classes, MgDb for working with 16S rRNA sequence databases, and mgFeatures for marker-gene survey feature data. The associated annotation packages provide a consistent interface to the different databases facilitating database comparison and exploration. The mgFeatures-class represents a crucial step in the development of a common data structure for working with 16S marker-gene survey data in R. Availability and implementation https://bioconductor.org/packages/release/bioc/html/metagenomeFeatures.html. Supplementary information Supplementary material is available at Bioinformatics online.


2021 ◽  
Author(s):  
Kaiyang Zheng ◽  
Yantao Liang ◽  
David Paez-Espino ◽  
Sijun Huang ◽  
Xiao Zou ◽  
...  

Abstract Background: N4-like viruses, with specific genomic features and propagation signatures, comprise a unique viral clade within the Podoviridae family. N4-like viruses are commonly characterized by the N4-like major capsid protein (MCP) and a giant virion-encapsulated RNA polymerase (N4-like RNAP) with a size of approximately 3,500-aa, which is the largest viral protein so far described. To date, our understanding of N4-like viruses is largely derived from 80 viral isolates that infect bacteria. Thus, it is necessary to expand the diversity of N4-like viruses in culturing-independent methods.Methods: A Hidden-Markov-Module based method was designed based on two characterized N4-specific marker genes, major capsid protein and N4-like virion-encapsulated RNA polymerase. Viral sub-clades were classified based on the monophyly presented in phylogenic tree and the results of pangenome analysis. Further analysis assessed different distribution patterns, genomic properties, hosts’ metabolism reprogramming potentialities, significance of viral tRNA and horizontal gene transfer landscape.Results: We identified 1,000 N4-like virus sequences from genomes and metagenomes representing diverse habitats from around the world. N4-like viruses have been classified into 27 sub-clades and detected in almost all habitats from pole to pole, including some novel habitats, such as oral mucosa and Antarctica. Virulent factors might be crucial for some human-associated N4-like viruses to reprogram the metabolism of host cells and mediate their pathogenic ability through horizontal gene transfer. From the pangenome analysis, the protein diversity was expended over 7-fold and 17 conserved house-keeping genes were identified. Transcriptional compensation of tRNA indicates that producing progeny virion might be the main significance of viral tRNAs. From the horizontal gene transfer network, some N4-like viral sub-clades were observed that potentially infect some important human pathogens, such as Campylobacteria and Veillonella , which have not been considered as potential hosts of N4-like virus or even any virus.Conclusion: This study expands the knowledge of N4-like viruses via global metagenomic datasets, reveals the novel ecological and genomic signatures of these viruses and will provide the backbone for further N4-like virus studies.


2019 ◽  
Vol 36 (7) ◽  
pp. 2311-2313 ◽  
Author(s):  
Roman Hillje ◽  
Pier Giuseppe Pelicci ◽  
Lucilla Luzi

Abstract Despite the growing availability of sophisticated bioinformatic methods for the analysis of single-cell RNA-seq data, few tools exist that allow biologists without extensive bioinformatic expertise to directly visualize and interact with their own data and results. Here, we present Cerebro (cell report browser), a Shiny- and Electron-based standalone desktop application for macOS and Windows which allows investigation and inspection of pre-processed single-cell transcriptomics data without requiring bioinformatic experience of the user. Through an interactive and intuitive graphical interface, users can (i) explore similarities and heterogeneity between samples and cell clusters in two-dimensional or three-dimensional projections such as t-SNE or UMAP, (ii) display the expression level of single genes or gene sets of interest, (iii) browse tables of most expressed genes and marker genes for each sample and cluster and (iv) display trajectories calculated with Monocle 2. We provide three examples prepared from publicly available datasets to show how Cerebro can be used and which are its capabilities. Through a focus on flexibility and direct access to data and results, we think Cerebro offers a collaborative framework for bioinformaticians and experimental biologists that facilitates effective interaction to shorten the gap between analysis and interpretation of the data. Availability and implementation The Cerebro application, additional documentation, and example datasets are available at https://github.com/romanhaa/Cerebro. Similarly, the cerebroApp R package is available at https://github.com/romanhaa/cerebroApp. All components are released under the MIT License. Supplementary information Supplementary data are available at Bioinformatics online.


2014 ◽  
Vol 80 (11) ◽  
pp. 3508-3517 ◽  
Author(s):  
J. L. Baugher ◽  
E. Durmaz ◽  
T. R. Klaenhammer

ABSTRACTLactobacillus gasseriis an endogenous species of the human gastrointestinal tract and vagina. With recent advances in microbial taxonomy, phylogenetics, and genomics,L. gasseriis recognized as an important commensal and is increasingly being used in probiotic formulations.L. gasseristrain ADH is lysogenic and harbors two inducible prophages. In this study, prophage ϕadh was found to spontaneously induce in broth cultures to populations of ∼107PFU/ml by stationary phase. The ϕadh prophage-cured ADH derivative NCK102 was found to harbor a new, second inducible phage, vB_Lga_jlb1 (jlb1). Phage jlb1 was sequenced and found to be highly similar to the closely related phage LgaI, which resides as two tandem prophages in the neotype strainL. gasseriATCC 33323. The common occurrence of multiple prophages inL. gasserigenomes, their propensity for spontaneous induction, and the high degree of homology among phages within multiple species ofLactobacillussuggest that temperate bacteriophages likely contribute to horizontal gene transfer (HGT) in commensal lactobacilli. In this study, the host ranges of phages ϕadh and jlb1 were determined against 16L. gasseristrains. The transduction range and the rate of spontaneous transduction were investigated in coculture experiments to ascertain the degree to which prophages can promote HGT among a variety of commensal and probiotic lactobacilli. Both ϕadh and jlb1 particles were confirmed to mediate plasmid transfer. As many as ∼103spontaneous transductants/ml were obtained. HGT by transducing phages of commensal lactobacilli may have a significant impact on the evolution of bacteria within the human microbiota.


Sign in / Sign up

Export Citation Format

Share Document