Comparative genomics identifies thousands of candidate structured RNAs in human microbiomes

Abstract Background Structured RNAs play varied bioregulatory roles within microbes. To date, hundreds of candidate structured RNAs have been predicted using informatic approaches that search for motif structures in genomic sequence data. The human microbiome contains thousands of species and strains of microbes. Yet, much of the metagenomic data from the human microbiome remains unmined for structured RNA motifs primarily due to computational limitations. Results We sought to apply a large-scale, comparative genomics approach to these organisms to identify candidate structured RNAs. With a carefully constructed, though computationally intensive automated analysis, we identify 3161 conserved candidate structured RNAs in intergenic regions, as well as 2022 additional candidate structured RNAs that may overlap coding regions. We validate the RNA expression of 177 of these candidate structures by analyzing small fragment RNA-seq data from four human fecal samples. Conclusions This approach identifies a wide variety of candidate structured RNAs, including tmRNAs, antitoxins, and likely ribosome protein leaders, from a wide variety of taxa. Overall, our pipeline enables conservative predictions of thousands of novel candidate structured RNAs from human microbiomes.

Download Full-text

Searching more genomic sequence with less memory for fast and accurate metagenomic profiling

10.1101/036681 ◽

2016 ◽

Author(s):

Shea N Gardner ◽

Sasha K Ames ◽

Maya B Gokhale ◽

Tom R Slezak ◽

Jonathan Allen

Keyword(s):

Large Scale ◽

Genomic Sequence ◽

Sequence Data ◽

Low Cost ◽

False Negative ◽

Human Microbiome ◽

Human Microbiome Project ◽

Metagenomic Data ◽

Reference Database ◽

Metagenomic Sequence

Software for rapid, accurate, and comprehensive microbial profiling of metagenomic sequence data on a desktop will play an important role in large scale clinical use of metagenomic data. Here we describe LMAT-ML (Livermore Metagenomics Analysis Toolkit-Marker Library) which can be run with 24 GB of DRAM memory, an amount available on many clusters, or with 16 GB DRAM plus a 24 GB low cost commodity flash drive (NVRAM), a cost effective alternative for desktop or laptop users. We compared results from LMAT with five other rapid, low-memory tools for metagenome analysis for 131 Human Microbiome Project samples, and assessed discordant calls with BLAST. All the tools except LMAT-ML reported overly specific or incorrect species and strain resolution of reads that were in fact much more widely conserved across species, genera, and even families. Several of the tools misclassified reads from synthetic or vector sequence as microbial or human reads as viral. We attribute the high numbers of false positive and false negative calls to a limited reference database with inadequate representation of known diversity. Our comparisons with real world samples show that LMAT-ML is the only tool tested that classifies the majority of reads, and does so with high accuracy.

Download Full-text

A combined RNA-Seq and comparative genomics approach identifies 1,085 candidate structured RNAs expressed in human microbiomes

10.1101/2020.03.31.018887 ◽

2020 ◽

Cited By ~ 2

Author(s):

Brayon J. Fremin ◽

Ami S. Bhatt

Keyword(s):

Comparative Genomics ◽

Experimental Approach ◽

Genomic Sequence ◽

Sequence Data ◽

Human Microbiome ◽

Human Microbiome Project ◽

Computational Approach ◽

Rna Seq ◽

Rna Structures ◽

Experimental Approaches

AbstractStructured RNAs play varied bioregulatory roles within microbes. To date, hundreds of candidate structured RNAs have been predicted using informatic approaches by searching for motif structures in genomic sequence data. However, only a subset of these candidate structured RNAs, those from culturable, well-studied microbes, have been shown to be transcribed. As the human microbiome contains thousands of species and strains of microbes, we sought to apply both informatic and experimental approaches to these organisms to identify novel transcribed structured RNAs. We combine an experimental approach, RNA-Seq, with an informatic approach, comparative genomics across the human microbiome project, to discover 1,085 candidate, conserved structured RNAs that are actively transcribed in human fecal microbiomes. These predictions include novel tracrRNAs that associate with Cas9 and RNA structures encoded in overlapping regions of the genome that are in opposing orientations. In summary, this combined experimental and computational approach enables the discovery of thousands of novel candidate structured RNAs.

Download Full-text

Comparative genomics suggests a taxonomic revision of the Staphylococcus cohnii species complex

Genome Biology and Evolution ◽

10.1093/gbe/evab020 ◽

2021 ◽

Author(s):

Anna Lavecchia ◽

Matteo Chiara ◽

Caterina De Virgilio ◽

Caterina Manzari ◽

Carlo Pazzani ◽

...

Keyword(s):

Comparative Genomics ◽

Species Complex ◽

Large Scale ◽

Genomic Sequence ◽

Bacterial Species ◽

Taxonomic Revision ◽

Distinct Species ◽

The Novel ◽

Staphylococcus Cohnii ◽

Taxonomic Assignments

Abstract Staphylococcus cohnii (SC), a coagulase-negative bacterium, was first isolated in 1975 from human skin. Early phenotypic analyses led to the delineation of two subspecies (subsp.), Staphylococcus cohnii subsp. cohnii (SCC) and Staphylococcus cohnii subsp. urealyticus (SCU). SCC was considered to be specific to humans whereas SCU apparently demonstrated a wider host range, from lower primates to humans. The type strains ATCC 29974 and ATCC 49330 have been designated for SCC and SCU, respectively. Comparative analysis of 66 complete genome sequences—including a novel SC isolate—revealed unexpected patterns within the SC complex, both in terms of genomic sequence identity and gene content, highlighting the presence of 3 phylogenetically distinct groups. Based on our observations, and on the current guidelines for taxonomic classification for bacterial species, we propose a revision of the SC species complex. We suggest that SCC and SCU should be regarded as two distinct species: SC and SU (Staphylococcus urealyticus), and that two distinct subspecies, SCC and SCB (SC subsp. barensis, represented by the novel strain isolated in Bari) should be recognized within SC. Furthermore, since large scale comparative genomics studies recurrently suggest inconsistencies or conflicts in taxonomic assignments of bacterial species, we believe that the approach proposed here might be considered for more general application.

Download Full-text

META-pipe cloud setup and execution

F1000Research ◽

10.12688/f1000research.13204.1 ◽

2017 ◽

Vol 6 ◽

pp. 2060

Author(s):

Aleksandr Agafonov ◽

Kimmo Mattila ◽

Cuong Duong Tuan ◽

Lars Tiede ◽

Inge Alexander Raknes ◽

...

Keyword(s):

Functional Annotation ◽

High Performance ◽

Sequence Data ◽

Metagenomic Data ◽

Taxonomic Profiling ◽

Geographically Distributed ◽

Computationally Intensive ◽

High Performance Computing Cluster ◽

And Storage ◽

Performance Computing

META-pipe is a complete service for the analysis of marine metagenomic data. It provides assembly of high-throughput sequence data, functional annotation of predicted genes, and taxonomic profiling. The functional annotation is computationally demanding and is therefore currently run on a high-performance computing cluster in Norway. However, additional compute resources are necessary to open the service to all ELIXIR users. We describe our approach for setting up and executing the functional analysis of META-pipe on additional academic and commercial clouds. Our goal is to provide a powerful analysis service that is easy to use and to maintain. Our design therefore uses a distributed architecture where we combine central servers with multiple distributed backends that execute the computationally intensive jobs. We believe our experiences developing and operating META-pipe provides a useful model for others that plan to provide a portal based data analysis service in ELIXIR and other organizations with geographically distributed compute and storage resources.

Download Full-text

Large‐scale genomic sequence data resolve the deepest divergences in the legume phylogeny and support a near‐simultaneous evolutionary origin of all six subfamilies

New Phytologist ◽

10.1111/nph.16290 ◽

2019 ◽

Vol 225 (3) ◽

pp. 1355-1369 ◽

Cited By ~ 12

Author(s):

Erik J. M. Koenen ◽

Dario I. Ojeda ◽

Royce Steeves ◽

Jérémy Migliore ◽

Freek T. Bakker ◽

...

Keyword(s):

Large Scale ◽

Genomic Sequence ◽

Sequence Data ◽

Evolutionary Origin

Download Full-text

The Origin and Early Evolution of the Legumes are a Complex Paleopolyploid Phylogenomic Tangle closely associated with the Cretaceous-Paleogene (K-Pg) Boundary

10.1101/577957 ◽

2019 ◽

Cited By ~ 3

Author(s):

Erik J.M. Koenen ◽

Dario I. Ojeda ◽

Royce Steeves ◽

Jérémy Migliore ◽

Freek T. Bakker ◽

...

Keyword(s):

Mass Extinction ◽

Large Scale ◽

Genomic Sequence ◽

Sequence Data ◽

Incomplete Lineage Sorting ◽

Nuclear Gene ◽

Early Evolution ◽

Large Set ◽

Gene Trees ◽

The Family

AbstractThe consequences of the Cretaceous-Paleogene (K-Pg) boundary (KPB) mass extinction for the evolution of plant diversity are poorly understood, even although evolutionary turnover of plant lineages at the KPB is central to understanding the assembly of the Cenozoic biota. One aspect that has received considerable attention is the apparent concentration of whole genome duplication (WGD) events around the KPB, which may have played a role in survival and subsequent diversification of plant lineages. In order to gain new insights into the origins of Cenozoic biodiversity, we examine the origin and early evolution of the legume family, one of the most important angiosperm clades that rose to prominence after the KPB and for which multiple WGD events are found to have occurred early in its evolution. The legume family (Leguminosae or Fabaceae), with c. 20.000 species, is the third largest family of Angiospermae, and is globally widespread and second only to the grasses (Poaceae) in economic importance. Accordingly, it has been intensively studied in botanical, systematic and agronomic research, but a robust phylogenetic framework and timescale for legume evolution based on large-scale genomic sequence data is lacking, and key questions about the origin and early evolution of the family remain unresolved. We extend previous phylogenetic knowledge to gain insights into the early evolution of the family, analysing an alignment of 72 protein-coding chloroplast genes and a large set of nuclear genomic sequence data, sampling thousands of genes. We use a concatenation approach with heterogeneous models of sequence evolution to minimize inference artefacts, and evaluate support and conflict among individual nuclear gene trees with internode certainty calculations, a multi-species coalescent method, and phylogenetic supernetwork reconstruction. Using a set of 20 fossil calibrations we estimate a revised timeline of legume evolution based on a selection of genes that are both informative and evolving in an approximately clock-like fashion. We find that the root of the family is particularly difficult to resolve, with strong conflict among gene trees suggesting incomplete lineage sorting and/or reticulation. Mapping of duplications in gene family trees suggest that a WGD event occurred along the stem of the family and is shared by all legumes, with additional nested WGDs subtending subfamilies Papilionoideae and Detarioideae. We propose that the difficulty of resolving the root of the family is caused by a combination of ancient polyploidy and an alternation of long and very short internodes, shaped respectively by extinction and rapid divergence. Our results show that the crown age of the legumes dates back to the Maastrichtian or Paleocene and suggests that it is most likely close to the KPB. We conclude that the origin and early evolution of the legumes followed a complex history, in which multiple nested polyploidy events coupled with rapid diversification are associated with the mass extinction event at the KPB, ultimately underpinning the evolutionary success of the Leguminosae in the Cenozoic.

Download Full-text

Development of Self-Compressing BLSOM for Comprehensive Analysis of Big Sequence Data

BioMed Research International ◽

10.1155/2015/506052 ◽

2015 ◽

Vol 2015 ◽

pp. 1-8 ◽

Cited By ~ 2

Author(s):

Akihito Kikuchi ◽

Toshimichi Ikemura ◽

Takashi Abe

Keyword(s):

High Performance ◽

Large Scale ◽

Genomic Sequence ◽

Sequence Data ◽

Bacterial Genome ◽

Computation Time ◽

Comprehensive Analysis ◽

Self Organizing Map ◽

Genome Sequences ◽

Oligonucleotide Composition

With the remarkable increase in genomic sequence data from various organisms, novel tools are needed for comprehensive analyses of available big sequence data. We previously developed a Batch-Learning Self-Organizing Map (BLSOM), which can cluster genomic fragment sequences according to phylotype solely dependent on oligonucleotide composition and applied to genome and metagenomic studies. BLSOM is suitable for high-performance parallel-computing and can analyze big data simultaneously, but a large-scale BLSOM needs a large computational resource. We have developed Self-Compressing BLSOM (SC-BLSOM) for reduction of computation time, which allows us to carry out comprehensive analysis of big sequence data without the use of high-performance supercomputers. The strategy of SC-BLSOM is to hierarchically construct BLSOMs according to data class, such as phylotype. The first-layer BLSOM was constructed with each of the divided input data pieces that represents the data subclass, such as phylotype division, resulting in compression of the number of data pieces. The second BLSOM was constructed with a total of weight vectors obtained in the first-layer BLSOMs. We compared SC-BLSOM with the conventional BLSOM by analyzing bacterial genome sequences. SC-BLSOM could be constructed faster than BLSOM and cluster the sequences according to phylotype with high accuracy, showing the method’s suitability for efficient knowledge discovery from big sequence data.

Download Full-text

The single-species metagenome: subtyping Staphylococcus aureus core genome sequences from shotgun metagenomic data

10.1101/030692 ◽

2015 ◽

Cited By ~ 1

Author(s):

Sandeep J. Joseph ◽

Ben Li ◽

Robert A. Petit ◽

Zhaohui S. Qin ◽

Lyndsey A. Darrow ◽

...

Keyword(s):

Staphylococcus Aureus ◽

Large Scale ◽

Pathogen Detection ◽

Geographic Distance ◽

Human Microbiome ◽

Genomic Analysis ◽

Single Species ◽

Metagenomic Data ◽

Comparative Genomic ◽

Genome Coverage

AbstractMetagenome shotgun sequence projects offer the potential for large scale biogeographic analysis of microbial species. In this project we developed a method for detecting 33 common subtypes of the pathogenic bacterium Staphylococcus aureus. We used a binomial mixture model implemented in the binstrain software and the coverage counts at > 100,000 known S. aureus SNP (single nucleotide polymorphism) sites derived from prior comparative genomic analysis to estimate the proportion of each subtype in metagenome samples. Using this pipeline we were able to obtain > 87% sensitivity and > 94% specificity when testing on low genome coverage samples of diverse S. aureus strains (0.025X). We found that 321 and 149 metagenome samples from the Human Microbiome Project and metaSUB analysis of the New York City subway, respectively, contained S. aureus at genome coverage > 0.025. In both projects, CC8 and CC30 were the most common S. aureus subtypes encountered. We found evidence that the subtype composition at different body sites of the same individual were more similar than random sampling and more limited evidence that certain body sites were enriched for particular subtypes. One surprising finding was the apparent high frequency of CC398, a lineage associated with livestock, in samples from the tongue dorsum. Epidemiologic analysis of the HMP subject population suggested that high BMI (body mass index) and health insurance are risk factors for S. aureus but there was limited power to find factors linked to carriage of even the most common subtype. In the NYC subway data, we found a small signal of geographic distance affecting subtype clustering but other unknown factors influence taxonomic distribution of the species around the city. We argue that pathogen detection in metagenome samples requires the use of subtypes based on whole species population genomic analysis rather than using ad hoc collections of reference strains.

Download Full-text

Fast functional annotation of metagenomic shotgun data by DNA alignment to a microbial gene catalog

10.1101/120402 ◽

2017 ◽

Author(s):

Stuart M. Brown ◽

Yuhan Hao ◽

Hao Chen ◽

Bobby P. Laungani ◽

Thahmina A. Ali ◽

...

Keyword(s):

Functional Annotation ◽

Sequence Data ◽

Human Microbiome ◽

Metagenomic Data ◽

Metagenomic Sequencing ◽

Alternative Analysis ◽

Metagenomic Sequence ◽

Shotgun Metagenomics ◽

Gene Functions ◽

Dna Alignment

AbstractBackgroundMetagenomic shotgun sequencing is becoming increasingly popular to study microbes associated with the human body and in environmental samples. A key goal of shotgun metagenomic sequencing is to identify gene functions and metabolic pathways that differ between samples or conditions. However, current methods to identify function in the large number of reads in a high-throughput sequence data file rely on the computationally intensive and low stringency approach of mapping each read to a generic database of proteins or reference microbial genomes.ResultsWe have developed an alternative analysis approach for shotgun metagenomic sequence data utilizing Bowtie2 DNA-DNA alignment of the reads to a database of well annotated genes compiled from human microbiome data. This method is rapid, and provides high stringency matches (>90% DNA sequence identity) of shotgun metagenomics reads to genes with annotated functions. We demonstrate the use of this method with synthetic data, Human Microbiome Project shotgun metagenomic data sets, and data from a study of liver disease. Differentially abundant KEGG gene functions can be detected in these experiments.ConclusionsFunctional annotation of metagenomic shotgun sequence reads can be accomplished by rapid DNA-DNA matching to a custom database of microbial sequences using the Bowtie2 sequence alignment tool. This method can be used for a variety of microbiome studies and allows functional analysis which is otherwise computationally demanding. This rapid annotation method is freely available as a Galaxy workflow within a Docker image.

Download Full-text

Genomic variants among threatened Acropora corals

10.1101/349910 ◽

2018 ◽

Cited By ~ 4

Author(s):

S. A. Kitchen ◽

A. Ratan ◽

O. C. Bedoya-Reina ◽

R. Burhans ◽

N. D. Fogarty ◽

...

Keyword(s):

Large Scale ◽

Genomic Sequence ◽

Sequence Data ◽

Cellular Stress Response ◽

Model Organisms ◽

Additional Species ◽

Genomic Resources ◽

Coral Genus ◽

The Galaxy ◽

Galaxy Server

ABSTRACTGenomic sequence data for non-model organisms are increasingly available requiring the development of efficient and reproducible workflows. Here, we develop the first genomic resources and reproducible workflows for two threatened members of the reef-building coral genus Acropora. We generated genomic sequence data from multiple samples of the Caribbean A. cervicornis (staghorn coral) and A. palmata (elkhorn coral), and predicted millions of nucleotide variants among these two species and the Pacific A. digitifera. A subset of predicted nucleotide variants were verified using restriction length polymorphism assays and proved useful in distinguishing the two Caribbean Acroporids and the hybrid they form (“A. prolifera”). Nucleotide variants are freely available from the Galaxy server (usegalaxy.org), and can be analyzed there with computational tools and stored workflows that require only an internet browser. We describe these data and some of the analysis tools, concentrating on fixed differences between A. cervicornis and A. palmata. In particular, we found that fixed amino acid differences between these two species were enriched in proteins associated with development, cellular stress response and the host’s interactions with associated microbes, for instance in the Wnt pathway, ABC transporters and superoxide dismutase. Identified candidate genes may underlie functional differences in the way these threatened species respond to changing environments. Users can expand the presented analyses easily by adding genomic data from additional species as they become available.Article SummaryWe provide the first comprehensive genomic resources for two threatened Caribbean reef-building corals in the genus Acropora. We identified genetic differences in key pathways and genes known to be important in the animals’ response to the environmental disturbances and larval development. We further provide a list of candidate loci for large scale genotyping of these species to gather intra- and interspecies differences between A. cervicornis and A. palmata across their geographic range. All analyses and workflows are made available and can be used as a resource to not only analyze these corals but other non-model organisms.

Download Full-text