scholarly journals A combined RNA-Seq and comparative genomics approach identifies 1,085 candidate structured RNAs expressed in human microbiomes

Author(s):  
Brayon J. Fremin ◽  
Ami S. Bhatt

AbstractStructured RNAs play varied bioregulatory roles within microbes. To date, hundreds of candidate structured RNAs have been predicted using informatic approaches by searching for motif structures in genomic sequence data. However, only a subset of these candidate structured RNAs, those from culturable, well-studied microbes, have been shown to be transcribed. As the human microbiome contains thousands of species and strains of microbes, we sought to apply both informatic and experimental approaches to these organisms to identify novel transcribed structured RNAs. We combine an experimental approach, RNA-Seq, with an informatic approach, comparative genomics across the human microbiome project, to discover 1,085 candidate, conserved structured RNAs that are actively transcribed in human fecal microbiomes. These predictions include novel tracrRNAs that associate with Cas9 and RNA structures encoded in overlapping regions of the genome that are in opposing orientations. In summary, this combined experimental and computational approach enables the discovery of thousands of novel candidate structured RNAs.

2016 ◽  
Author(s):  
Shea N Gardner ◽  
Sasha K Ames ◽  
Maya B Gokhale ◽  
Tom R Slezak ◽  
Jonathan Allen

Software for rapid, accurate, and comprehensive microbial profiling of metagenomic sequence data on a desktop will play an important role in large scale clinical use of metagenomic data. Here we describe LMAT-ML (Livermore Metagenomics Analysis Toolkit-Marker Library) which can be run with 24 GB of DRAM memory, an amount available on many clusters, or with 16 GB DRAM plus a 24 GB low cost commodity flash drive (NVRAM), a cost effective alternative for desktop or laptop users. We compared results from LMAT with five other rapid, low-memory tools for metagenome analysis for 131 Human Microbiome Project samples, and assessed discordant calls with BLAST. All the tools except LMAT-ML reported overly specific or incorrect species and strain resolution of reads that were in fact much more widely conserved across species, genera, and even families. Several of the tools misclassified reads from synthetic or vector sequence as microbial or human reads as viral. We attribute the high numbers of false positive and false negative calls to a limited reference database with inadequate representation of known diversity. Our comparisons with real world samples show that LMAT-ML is the only tool tested that classifies the majority of reads, and does so with high accuracy.


2021 ◽  
Vol 22 (1) ◽  
Author(s):  
Brayon J. Fremin ◽  
Ami S. Bhatt

Abstract Background Structured RNAs play varied bioregulatory roles within microbes. To date, hundreds of candidate structured RNAs have been predicted using informatic approaches that search for motif structures in genomic sequence data. The human microbiome contains thousands of species and strains of microbes. Yet, much of the metagenomic data from the human microbiome remains unmined for structured RNA motifs primarily due to computational limitations. Results We sought to apply a large-scale, comparative genomics approach to these organisms to identify candidate structured RNAs. With a carefully constructed, though computationally intensive automated analysis, we identify 3161 conserved candidate structured RNAs in intergenic regions, as well as 2022 additional candidate structured RNAs that may overlap coding regions. We validate the RNA expression of 177 of these candidate structures by analyzing small fragment RNA-seq data from four human fecal samples. Conclusions This approach identifies a wide variety of candidate structured RNAs, including tmRNAs, antitoxins, and likely ribosome protein leaders, from a wide variety of taxa. Overall, our pipeline enables conservative predictions of thousands of novel candidate structured RNAs from human microbiomes.


2018 ◽  
Author(s):  
Brian Tsui ◽  
Michelle Dow ◽  
Dylan Skola ◽  
Hannah Carter

The Sequence Read Archive (SRA) contains over one million publicly available sequencing runs from various studies using a variety of sequencing library strategies. These data inherently contain information about underlying genomic sequence variants which we exploit to extract allelic read counts on an unprecedented scale. We reprocessed over 250,000 human sequencing runs (>1000 TB data worth of raw sequence data) into a single unified dataset of allelic read counts for nearly 300,000 variants of biomedical relevance curated by NCBI dbSNP, where germline variants were detected in a median of 912 sequencing runs, and somatic variants were detected in a median of 4,876 sequencing runs, suggesting that this dataset facilitates identification of sequencing runs that harbor variants of interest. Allelic read counts obtained using a targeted alignment were very similar to read counts obtained from whole genome alignment. Analyzing allelic read count data for matched DNA and RNA samples from tumors, we find that RNA-seq can also recover variants identified by WXS, suggesting that reprocessed allelic read counts can support variant detection across different library strategies in SRA. This study provides a rich database of known human variants across SRA samples that can support future meta-analyses of human sequence variation.


PeerJ ◽  
2016 ◽  
Vol 4 ◽  
pp. e2571 ◽  
Author(s):  
Sandeep J. Joseph ◽  
Ben Li ◽  
Robert A. Petit III ◽  
Zhaohui S. Qin ◽  
Lyndsey Darrow ◽  
...  

In this study we developed a genome-based method for detectingStaphylococcus aureussubtypes from metagenome shotgun sequence data. We used a binomial mixture model and the coverage counts at >100,000 knownS. aureusSNP (single nucleotide polymorphism) sites derived from prior comparative genomic analysis to estimate the proportion of 40 subtypes in metagenome samples. We were able to obtain >87% sensitivity and >94% specificity at 0.025X coverage forS. aureus. We found that 321 and 149 metagenome samples from the Human Microbiome Project and metaSUB analysis of the New York City subway, respectively, containedS. aureusat genome coverage >0.025. In both projects, CC8 and CC30 were the most commonS. aureusclonal complexes encountered. We found evidence that the subtype composition at different body sites of the same individual were more similar than random sampling and more limited evidence that certain body sites were enriched for particular subtypes. One surprising finding was the apparent high frequency of CC398, a lineage often associated with livestock, in samples from the tongue dorsum. Epidemiologic analysis of the HMP subject population suggested that high BMI (body mass index) and health insurance are possibly associated withS. aureuscarriage but there was limited power to identify factors linked to carriage of even the most common subtype. In the NYC subway data, we found a small signal of geographic distance affecting subtype clustering but other unknown factors influence taxonomic distribution of the species around the city.


2008 ◽  
Vol 413 (3) ◽  
pp. 545-552 ◽  
Author(s):  
Karine Rousseau ◽  
Sara Kirkham ◽  
Lindsay Johnson ◽  
Brian Fitzpatrick ◽  
Marj Howard ◽  
...  

MUC5B is the predominant polymeric mucin in human saliva [Thornton, Khan, Mehrotra, Howard, Veerman, Packer and Sheehan (1999) Glycobiology 9, 293–302], where it contributes to oral cavity hydration and protection. More recently, the gene for another putative polymeric mucin, MUC19, has been shown to be expressed in human salivary glands [Chen, Zhao, Kalaslavadi, Hamati, Nehrke, Le, Ann and Wu (2004) Am. J. Respir. Cell Mol. Biol. 30, 155–165]. However, to date, the MUC19 mucin has not been isolated from human saliva. Our aim was therefore to purify and characterize the MUC19 glycoprotein from human saliva. Saliva was solubilized in 4 M guanidinium chloride and the high-density mucins were purified by density-gradient centrifugation. The presence of MUC19 was investigated using tandem MS of tryptic peptides derived from this mucin preparation. Using this approach, we found multiple MUC5B-derived tryptic peptides, but were unable to detect any putative MUC19 peptides. These results suggest that MUC19 is not a major component in human saliva. In contrast, using the same experimental approach, we identified Muc19 and Muc5b glycoproteins in horse saliva. Moreover, we also identified Muc19 from pig, cow and rat saliva; the saliva of cow and rat also contained Muc5b; however, due to the lack of pig Muc5b genomic sequence data, we were unable to identify Muc5b in pig saliva. Our results suggest that unlike human saliva, which contains MUC5B, cow, horse and rat saliva are a heterogeneous mixture of Muc5b and Muc19. The functional consequence of these species differences remains to be elucidated.


2020 ◽  
Vol 15 ◽  
Author(s):  
Affan Alim ◽  
Abdul Rafay ◽  
Imran Naseem

Background: Proteins contribute significantly in every task of cellular life. Their functions encompass the building and repairing of tissues in human bodies and other organisms. Hence they are the building blocks of bones, muscles, cartilage, skin, and blood. Similarly, antifreeze proteins are of prime significance for organisms that live in very cold areas. With the help of these proteins, the cold water organisms can survive below zero temperature and resist the water crystallization process which may cause the rupture in the internal cells and tissues. AFP’s have attracted attention and interest in food industries and cryopreservation. Objective: With the increase in the availability of genomic sequence data of protein, an automated and sophisticated tool for AFP recognition and identification is in dire need. The sequence and structures of AFP are highly distinct, therefore, most of the proposed methods fail to show promising results on different structures. A consolidated method is proposed to produce the competitive performance on highly distinct AFP structure. Methods: In this study, we propose to use machine learning-based algorithms Principal Component Analysis (PCA) followed by Gradient Boosting (GB) for antifreeze protein identification. To analyze the performance and validation of the proposed model, various combinations of two segments composition of amino acid and dipeptide are used. PCA, in particular, is proposed to dimension reduction and high variance retaining of data which is followed by an ensemble method named gradient boosting for modelling and classification. Results: The proposed method obtained the superfluous performance on PDB, Pfam and Uniprot dataset as compared with the RAFP-Pred method. In experiment-3, by utilizing only 150 PCA components a high accuracy of 89.63 was achieved which is superior to the 87.41 utilizing 300 significant features reported for the RAFP-Pred method. Experiment-2 is conducted using two different dataset such that non-AFP from the PISCES server and AFPs from Protein data bank. In this experiment-2, our proposed method attained high sensitivity of 79.16 which is 12.50 better than state-of-the-art the RAFP-pred method. Conclusion: AFPs have a common function with distinct structure. Therefore, the development of a single model for different sequences often fails to AFPs. A robust results have been shown by our proposed model on the diversity of training and testing dataset. The results of the proposed model outperformed compared to the previous AFPs prediction method such as RAFP-Pred. Our model consists of PCA for dimension reduction followed by gradient boosting for classification. Due to simplicity, scalability properties and high performance result our model can be easily extended for analyzing the proteomic and genomic dataset.


Pathogens ◽  
2021 ◽  
Vol 10 (2) ◽  
pp. 86
Author(s):  
Erin M. Garcia ◽  
Myrna G. Serrano ◽  
Laahirie Edupuganti ◽  
David J. Edwards ◽  
Gregory A. Buck ◽  
...  

Gardnerella vaginalis has recently been split into 13 distinct species. In this study, we tested the hypotheses that species-specific variations in the vaginolysin (VLY) amino acid sequence could influence the interaction between the toxin and vaginal epithelial cells and that VLY variation may be one factor that distinguishes less virulent or commensal strains from more virulent strains. This was assessed by bioinformatic analyses of publicly available Gardnerella spp. sequences and quantification of cytotoxicity and cytokine production from purified, recombinantly produced versions of VLY. After identifying conserved differences that could distinguish distinct VLY types, we analyzed metagenomic data from a cohort of female subjects from the Vaginal Human Microbiome Project to investigate whether these different VLY types exhibited any significant associations with symptoms or Gardnerella spp.-relative abundance in vaginal swab samples. While Type 1 VLY was most prevalent among the subjects and may be associated with increased reports of symptoms, subjects with Type 2 VLY dominant profiles exhibited increased relative Gardnerella spp. abundance. Our findings suggest that amino acid differences alter the interaction of VLY with vaginal keratinocytes, which may potentiate differences in bacterial vaginosis (BV) immunopathology in vivo.


Sign in / Sign up

Export Citation Format

Share Document